2014-10-09 16:35:10

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 0/12] perf/x86: implement HT leak workaround for SNB/IVB/HSW

This patch series addresses a serious known erratum in the PMU
of Intel SandyBridge, IvyBrige, Haswell processors with hyper-threading
enabled.

The erratum is documented in the Intel specification update documents
for each processor under the errata listed below:
- SandyBridge: BJ122
- IvyBridge: BV98
- Haswell: HSD29

The bug causes silent counter corruption across hyperthreads only
when measuring certain memory events (0xd0, 0xd1, 0xd2, 0xd3).
Counters measuring those events may leak counts to the sibling
counter. For instance, counter 0, thread 0 measuring event 0xd0,
may leak to counter 0, thread 1, regardless of the event measured
there. The size of the leak is not predictible. It all depends on
the workload and the state of each sibling hyper-thread. The
corrupting events do undercount as a consequence of the leak. The
leak is compensated automatically only when the sibling counter measures
the exact same corrupting event AND the workload is on the two threads
is the same. Given, there is no way to guarantee this, a work-around
is necessary. Furthermore, there is a serious problem if the leaked count
is added to a low-occurrence event. In that case the corruption on
the low occurrence event can be very large, e.g., orders of magnitude.

There is no HW or FW workaround for this problem.

The bug is very easy to reproduce on a loaded system.
Here is an example on a Haswell client, where CPU0, CPU4
are siblings. We load the CPUs with a simple triad app
streaming large floating-point vector. We use 0x81d0
corrupting event (MEM_UOPS_RETIRED:ALL_LOADS) and
0x20cc (ROB_MISC_EVENTS:LBR_INSERTS). Given that we are not
using the LBR, the 0x20cc event should be zero.

$ taskset -c 0 triad &
$ taskset -c 4 triad &
$ perf stat -a -C 0 -e r81d0 sleep 100 &
$ perf stat -a -C 4 -r20cc sleep 10
Performance counter stats for 'system wide':
139 277 291 r20cc
10,000969126 seconds time elapsed

In this example, 0x81d0 and r20cc are using the same sibling counter
on CPU0 and CPU4. 0x81d0 leaks into 0x20cc and corrupts it
from 0 to 139 millions occurrences.

This patch provides a software workaround to this problem by modifying the
way events are scheduled onto counters by the kernel. The patch forces
cross-thread mutual exclusion between counters in case a corrupting event
is measured by one of the hyper-threads. If thread 0, counter 0 is measuring
event 0xd0, then nothing can be measured on counter 0, thread 1. If no corrupting
event is measured on any hyper-thread, event scheduling proceeds as before.

The same example run with the workaround enabled, yield the correct answer:
$ taskset -c 0 triad &
$ taskset -c 4 triad &
$ perf stat -a -C 0 -e r81d0 sleep 100 &
$ perf stat -a -C 4 -r20cc sleep 10
Performance counter stats for 'system wide':
0 r20cc
10,000969126 seconds time elapsed

The patch does provide correctness for all non-corrupting events. It does not
"repatriate" the leaked counts back to the leaking counter. This is planned
for a later patch series. This patch series however makes this repatriation
easier by guaranteeing that the sibling counter is not measuring any useful event.

The patch introduces dynamic constraints for events. That means that events which
did not have constraints, i.e., could be measured on any counters, may now be
constrained to a subset of the counters depending on what is going on the sibling
thread. The algorithm is similar to a cache coherency protocol. We call it XSU
in reference to Exclusive, Shared, Unused, the 3 possible states of a PMU
counter.

As a consequence of the workaround, users may see an increased amount of event
multiplexing, even in situtations where there are fewer events than counters measured
on a CPU.

Patch has been tested on all three impacted processors. If HT is off,
workaround is disabled.

The series covers precise sampling with the corrupting events as well.

In V2, we rebased to 3.17, we also addressed all the feedback received
for LKML from V1. In particular, we now automatically detect if HT is
disabled, and if so we disable the workaround altogether. This is done
as a 2-step initcall as suggested by PeterZ. We also fixed the spinlock
irq masking. We pre-allocate the dynamic constraints in the per-cpu cpuc
struct. We have isolated the sysfs sysctl to make it optional.

Special thanks to Maria for working out a solution to this complex problem.

Maria Dimakopoulou (6):
perf/x86: add 3 new scheduling callbacks
perf/x86: add cross-HT counter exclusion infrastructure
perf/x86: implement cross-HT corruption bug workaround
perf/x86: enforce HT bug workaround for SNB/IVB/HSW
perf/x86: enforce HT bug workaround with PEBS for SNB/IVB/HSW
perf/x86: add syfs entry to disable HT bug workaround

Stephane Eranian (6):
perf,x86: rename er_flags to flags
perf/x86: vectorize cpuc->kfree_on_online
perf/x86: add index param to get_event_constraint() callback
perf/x86: fix intel_get_event_constraints() for dynamic constraints
watchdog: add watchdog enable/disable all functions
perf/x86: make HT bug workaround conditioned on HT enabled

arch/x86/kernel/cpu/perf_event.c | 94 +++++-
arch/x86/kernel/cpu/perf_event.h | 93 +++++-
arch/x86/kernel/cpu/perf_event_amd.c | 9 +-
arch/x86/kernel/cpu/perf_event_intel.c | 530 ++++++++++++++++++++++++++++--
arch/x86/kernel/cpu/perf_event_intel_ds.c | 28 +-
include/linux/watchdog.h | 3 +
kernel/watchdog.c | 28 ++
7 files changed, 717 insertions(+), 68 deletions(-)

--
1.9.1


2014-10-09 16:35:23

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 01/12] perf,x86: rename er_flags to flags

Because it will be used for more than just
tracking the presence of extra registers.

Reviewed-by: Maria Dimakopoulou <[email protected]>
Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 9 ++++++---
arch/x86/kernel/cpu/perf_event_intel.c | 24 ++++++++++++------------
2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index d98a34d..8af5010 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -502,7 +502,7 @@ struct x86_pmu {
* Extra registers for events
*/
struct extra_reg *extra_regs;
- unsigned int er_flags;
+ unsigned int flags;

/*
* Intel host/guest support (KVM)
@@ -519,8 +519,11 @@ do { \
x86_pmu.quirks = &__quirk; \
} while (0)

-#define ERF_NO_HT_SHARING 1
-#define ERF_HAS_RSP_1 2
+/*
+ * x86_pmu flags
+ */
+#define PMU_FL_NO_HT_SHARING 0x1 /* no hyper-threading resource sharing */
+#define PMU_FL_HAS_RSP_1 0x2 /* has 2 equivalent offcore_rsp regs */

#define EVENT_VAR(_id) event_attr_##_id
#define EVENT_PTR(_id) &event_attr_##_id.attr.attr
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 3851def..9f79da7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1593,7 +1593,7 @@ intel_bts_constraints(struct perf_event *event)

static int intel_alt_er(int idx)
{
- if (!(x86_pmu.er_flags & ERF_HAS_RSP_1))
+ if (!(x86_pmu.flags & PMU_FL_HAS_RSP_1))
return idx;

if (idx == EXTRA_REG_RSP_0)
@@ -2157,7 +2157,7 @@ static void intel_pmu_cpu_starting(int cpu)
if (!cpuc->shared_regs)
return;

- if (!(x86_pmu.er_flags & ERF_NO_HT_SHARING)) {
+ if (!(x86_pmu.flags & PMU_FL_NO_HT_SHARING)) {
for_each_cpu(i, topology_thread_cpumask(cpu)) {
struct intel_shared_regs *pc;

@@ -2589,7 +2589,7 @@ __init int intel_pmu_init(void)
x86_pmu.event_constraints = intel_slm_event_constraints;
x86_pmu.pebs_constraints = intel_slm_pebs_event_constraints;
x86_pmu.extra_regs = intel_slm_extra_regs;
- x86_pmu.er_flags |= ERF_HAS_RSP_1;
+ x86_pmu.flags |= PMU_FL_HAS_RSP_1;
pr_cont("Silvermont events, ");
break;

@@ -2607,7 +2607,7 @@ __init int intel_pmu_init(void)
x86_pmu.enable_all = intel_pmu_nhm_enable_all;
x86_pmu.pebs_constraints = intel_westmere_pebs_event_constraints;
x86_pmu.extra_regs = intel_westmere_extra_regs;
- x86_pmu.er_flags |= ERF_HAS_RSP_1;
+ x86_pmu.flags |= PMU_FL_HAS_RSP_1;

x86_pmu.cpu_events = nhm_events_attrs;

@@ -2639,8 +2639,8 @@ __init int intel_pmu_init(void)
else
x86_pmu.extra_regs = intel_snb_extra_regs;
/* all extra regs are per-cpu when HT is on */
- x86_pmu.er_flags |= ERF_HAS_RSP_1;
- x86_pmu.er_flags |= ERF_NO_HT_SHARING;
+ x86_pmu.flags |= PMU_FL_HAS_RSP_1;
+ x86_pmu.flags |= PMU_FL_NO_HT_SHARING;

x86_pmu.cpu_events = snb_events_attrs;

@@ -2674,8 +2674,8 @@ __init int intel_pmu_init(void)
else
x86_pmu.extra_regs = intel_snb_extra_regs;
/* all extra regs are per-cpu when HT is on */
- x86_pmu.er_flags |= ERF_HAS_RSP_1;
- x86_pmu.er_flags |= ERF_NO_HT_SHARING;
+ x86_pmu.flags |= PMU_FL_HAS_RSP_1;
+ x86_pmu.flags |= PMU_FL_NO_HT_SHARING;

x86_pmu.cpu_events = snb_events_attrs;

@@ -2702,8 +2702,8 @@ __init int intel_pmu_init(void)
x86_pmu.extra_regs = intel_snbep_extra_regs;
x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
/* all extra regs are per-cpu when HT is on */
- x86_pmu.er_flags |= ERF_HAS_RSP_1;
- x86_pmu.er_flags |= ERF_NO_HT_SHARING;
+ x86_pmu.flags |= PMU_FL_HAS_RSP_1;
+ x86_pmu.flags |= PMU_FL_NO_HT_SHARING;

x86_pmu.hw_config = hsw_hw_config;
x86_pmu.get_event_constraints = hsw_get_event_constraints;
@@ -2724,8 +2724,8 @@ __init int intel_pmu_init(void)
x86_pmu.extra_regs = intel_snbep_extra_regs;
x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
/* all extra regs are per-cpu when HT is on */
- x86_pmu.er_flags |= ERF_HAS_RSP_1;
- x86_pmu.er_flags |= ERF_NO_HT_SHARING;
+ x86_pmu.flags |= PMU_FL_HAS_RSP_1;
+ x86_pmu.flags |= PMU_FL_NO_HT_SHARING;

x86_pmu.hw_config = hsw_hw_config;
x86_pmu.get_event_constraints = hsw_get_event_constraints;
--
1.9.1

2014-10-09 16:35:33

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 02/12] perf/x86: vectorize cpuc->kfree_on_online

Make the cpuc->kfree_on_online a vector to accomodate
more than one entry and add the second entry to be
used by a later patch.

Reviewed-by: Maria Dimakopoulou <[email protected]>
Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 10 +++++++---
arch/x86/kernel/cpu/perf_event.h | 8 +++++++-
arch/x86/kernel/cpu/perf_event_amd.c | 3 ++-
arch/x86/kernel/cpu/perf_event_intel.c | 4 +++-
4 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 16c7302..542435f 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1325,11 +1325,12 @@ x86_pmu_notifier(struct notifier_block *self, unsigned long action, void *hcpu)
{
unsigned int cpu = (long)hcpu;
struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
- int ret = NOTIFY_OK;
+ int i, ret = NOTIFY_OK;

switch (action & ~CPU_TASKS_FROZEN) {
case CPU_UP_PREPARE:
- cpuc->kfree_on_online = NULL;
+ for (i = 0 ; i < X86_PERF_KFREE_MAX; i++)
+ cpuc->kfree_on_online[i] = NULL;
if (x86_pmu.cpu_prepare)
ret = x86_pmu.cpu_prepare(cpu);
break;
@@ -1342,7 +1343,10 @@ x86_pmu_notifier(struct notifier_block *self, unsigned long action, void *hcpu)
break;

case CPU_ONLINE:
- kfree(cpuc->kfree_on_online);
+ for (i = 0 ; i < X86_PERF_KFREE_MAX; i++) {
+ kfree(cpuc->kfree_on_online[i]);
+ cpuc->kfree_on_online[i] = NULL;
+ }
break;

case CPU_DYING:
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 8af5010..4c5a00c 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -123,6 +123,12 @@ struct intel_shared_regs {

#define MAX_LBR_ENTRIES 16

+enum {
+ X86_PERF_KFREE_SHARED = 0,
+ X86_PERF_KFREE_EXCL = 1,
+ X86_PERF_KFREE_MAX
+};
+
struct cpu_hw_events {
/*
* Generic x86 PMC bits
@@ -185,7 +191,7 @@ struct cpu_hw_events {
/* Inverted mask of bits to clear in the perf_ctr ctrl registers */
u64 perf_ctr_virt_mask;

- void *kfree_on_online;
+ void *kfree_on_online[X86_PERF_KFREE_MAX];
};

#define __EVENT_CONSTRAINT(c, n, m, w, o, f) {\
diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
index beeb7cc..a8d1a43 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -382,6 +382,7 @@ static int amd_pmu_cpu_prepare(int cpu)
static void amd_pmu_cpu_starting(int cpu)
{
struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
+ void **onln = &cpuc->kfree_on_online[X86_PERF_KFREE_SHARED];
struct amd_nb *nb;
int i, nb_id;

@@ -399,7 +400,7 @@ static void amd_pmu_cpu_starting(int cpu)
continue;

if (nb->nb_id == nb_id) {
- cpuc->kfree_on_online = cpuc->amd_nb;
+ *onln = cpuc->amd_nb;
cpuc->amd_nb = nb;
break;
}
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 9f79da7..4b6a3b0 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2158,12 +2158,14 @@ static void intel_pmu_cpu_starting(int cpu)
return;

if (!(x86_pmu.flags & PMU_FL_NO_HT_SHARING)) {
+ void **onln = &cpuc->kfree_on_online[X86_PERF_KFREE_SHARED];
+
for_each_cpu(i, topology_thread_cpumask(cpu)) {
struct intel_shared_regs *pc;

pc = per_cpu(cpu_hw_events, i).shared_regs;
if (pc && pc->core_id == core_id) {
- cpuc->kfree_on_online = cpuc->shared_regs;
+ *onln = cpuc->shared_regs;
cpuc->shared_regs = pc;
break;
}
--
1.9.1

2014-10-09 16:35:44

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 03/12] perf/x86: add 3 new scheduling callbacks

From: Maria Dimakopoulou <[email protected]>

This patch adds 3 new PMU model specific callbacks
during the event scheduling done by x86_schedule_events().

- start_scheduling: invoked when entering the schedule routine.
- stop_scheduling: invoked at the end of the schedule routine
- commit_scheduling: invoked for each committed event

To be used optionally by model-specific code.

Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: Maria Dimakopoulou <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 9 +++++++++
arch/x86/kernel/cpu/perf_event.h | 9 +++++++++
2 files changed, 18 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 542435f..ca44ba2 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -734,6 +734,9 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)

bitmap_zero(used_mask, X86_PMC_IDX_MAX);

+ if (x86_pmu.start_scheduling)
+ x86_pmu.start_scheduling(cpuc);
+
for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
hwc = &cpuc->event_list[i]->hw;
c = x86_pmu.get_event_constraints(cpuc, cpuc->event_list[i]);
@@ -780,6 +783,8 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
for (i = 0; i < n; i++) {
e = cpuc->event_list[i];
e->hw.flags |= PERF_X86_EVENT_COMMITTED;
+ if (x86_pmu.commit_scheduling)
+ x86_pmu.commit_scheduling(cpuc, e, assign[i]);
}
}
/*
@@ -800,6 +805,10 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
x86_pmu.put_event_constraints(cpuc, e);
}
}
+
+ if (x86_pmu.stop_scheduling)
+ x86_pmu.stop_scheduling(cpuc);
+
return num ? -EINVAL : 0;
}

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 4c5a00c..b6848c3 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -447,6 +447,15 @@ struct x86_pmu {

void (*put_event_constraints)(struct cpu_hw_events *cpuc,
struct perf_event *event);
+
+ void (*commit_scheduling)(struct cpu_hw_events *cpuc,
+ struct perf_event *event,
+ int cntr);
+
+ void (*start_scheduling)(struct cpu_hw_events *cpuc);
+
+ void (*stop_scheduling)(struct cpu_hw_events *cpuc);
+
struct event_constraint *event_constraints;
struct x86_pmu_quirk *quirks;
int perfctr_second_write;
--
1.9.1

2014-10-09 16:35:58

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 05/12] perf/x86: add cross-HT counter exclusion infrastructure

From: Maria Dimakopoulou <[email protected]>

This patch adds a new shared_regs style structure to the
per-cpu x86 state (cpuc). It is used to coordinate access
between counters which must be used with exclusion across
HyperThreads on Intel processors. This new struct is not
needed on each PMU, thus is is allocated on demand.

Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: Maria Dimakopoulou <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 32 +++++++++++++++
arch/x86/kernel/cpu/perf_event_intel.c | 71 +++++++++++++++++++++++++++++++---
2 files changed, 98 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 667a185..c985270 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -71,6 +71,7 @@ struct event_constraint {
#define PERF_X86_EVENT_COMMITTED 0x8 /* event passed commit_txn */
#define PERF_X86_EVENT_PEBS_LD_HSW 0x10 /* haswell style datala, load */
#define PERF_X86_EVENT_PEBS_NA_HSW 0x20 /* haswell style datala, unknown */
+#define PERF_X86_EVENT_EXCL 0x40 /* HT exclusivity on counter */

struct amd_nb {
int nb_id; /* NorthBridge id */
@@ -121,6 +122,26 @@ struct intel_shared_regs {
unsigned core_id; /* per-core: core id */
};

+enum intel_excl_state_type {
+ INTEL_EXCL_UNUSED = 0, /* counter is unused */
+ INTEL_EXCL_SHARED = 1, /* counter can be used by both threads */
+ INTEL_EXCL_EXCLUSIVE = 2, /* counter can be used by one thread only */
+};
+
+struct intel_excl_states {
+ enum intel_excl_state_type init_state[X86_PMC_IDX_MAX];
+ enum intel_excl_state_type state[X86_PMC_IDX_MAX];
+};
+
+struct intel_excl_cntrs {
+ spinlock_t lock;
+
+ struct intel_excl_states states[2];
+
+ int refcnt; /* per-core: #HT threads */
+ unsigned core_id; /* per-core: core id */
+};
+
#define MAX_LBR_ENTRIES 16

enum {
@@ -183,6 +204,12 @@ struct cpu_hw_events {
* used on Intel NHM/WSM/SNB
*/
struct intel_shared_regs *shared_regs;
+ /*
+ * manage exclusive counter access between hyperthread
+ */
+ struct event_constraint *constraint_list; /* in enable order */
+ struct intel_excl_cntrs *excl_cntrs;
+ int excl_thread_id; /* 0 or 1 */

/*
* AMD specific bits
@@ -206,6 +233,10 @@ struct cpu_hw_events {
#define EVENT_CONSTRAINT(c, n, m) \
__EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 0, 0)

+#define INTEL_EXCLEVT_CONSTRAINT(c, n) \
+ __EVENT_CONSTRAINT(c, n, ARCH_PERFMON_EVENTSEL_EVENT, HWEIGHT(n),\
+ 0, PERF_X86_EVENT_EXCL)
+
/*
* The overlap flag marks event constraints with overlapping counter
* masks. This is the case if the counter mask of such an event is not
@@ -540,6 +571,7 @@ do { \
*/
#define PMU_FL_NO_HT_SHARING 0x1 /* no hyper-threading resource sharing */
#define PMU_FL_HAS_RSP_1 0x2 /* has 2 equivalent offcore_rsp regs */
+#define PMU_FL_EXCL_CNTRS 0x4 /* has exclusive counter requirements */

#define EVENT_VAR(_id) event_attr_##_id
#define EVENT_PTR(_id) &event_attr_##_id.attr.attr
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 64bbb57..0eece24 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2131,16 +2131,52 @@ struct intel_shared_regs *allocate_shared_regs(int cpu)
return regs;
}

+struct intel_excl_cntrs *allocate_excl_cntrs(int cpu)
+{
+ struct intel_excl_cntrs *c;
+ int i;
+
+ c = kzalloc_node(sizeof(struct intel_excl_cntrs),
+ GFP_KERNEL, cpu_to_node(cpu));
+ if (c) {
+ spin_lock_init(&c->lock);
+ for (i = 0; i < X86_PMC_IDX_MAX; i++) {
+ c->states[0].state[i] = INTEL_EXCL_UNUSED;
+ c->states[0].init_state[i] = INTEL_EXCL_UNUSED;
+
+ c->states[1].state[i] = INTEL_EXCL_UNUSED;
+ c->states[1].init_state[i] = INTEL_EXCL_UNUSED;
+ }
+ c->core_id = -1;
+ }
+ return c;
+}
+
static int intel_pmu_cpu_prepare(int cpu)
{
struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);

- if (!(x86_pmu.extra_regs || x86_pmu.lbr_sel_map))
- return NOTIFY_OK;
+ if (x86_pmu.extra_regs || x86_pmu.lbr_sel_map) {
+ cpuc->shared_regs = allocate_shared_regs(cpu);
+ if (!cpuc->shared_regs)
+ return NOTIFY_BAD;
+ }

- cpuc->shared_regs = allocate_shared_regs(cpu);
- if (!cpuc->shared_regs)
- return NOTIFY_BAD;
+ if (x86_pmu.flags & PMU_FL_EXCL_CNTRS) {
+ size_t sz = X86_PMC_IDX_MAX * sizeof(struct event_constraint);
+
+ cpuc->constraint_list = kzalloc(sz, GFP_KERNEL);
+ if (!cpuc->constraint_list)
+ return NOTIFY_BAD;
+
+ cpuc->excl_cntrs = allocate_excl_cntrs(cpu);
+ if (!cpuc->excl_cntrs) {
+ kfree(cpuc->constraint_list);
+ kfree(cpuc->shared_regs);
+ return NOTIFY_BAD;
+ }
+ cpuc->excl_thread_id = 0;
+ }

return NOTIFY_OK;
}
@@ -2181,12 +2217,29 @@ static void intel_pmu_cpu_starting(int cpu)

if (x86_pmu.lbr_sel_map)
cpuc->lbr_sel = &cpuc->shared_regs->regs[EXTRA_REG_LBR];
+
+ if (x86_pmu.flags & PMU_FL_EXCL_CNTRS) {
+ for_each_cpu(i, topology_thread_cpumask(cpu)) {
+ struct intel_excl_cntrs *c;
+
+ c = per_cpu(cpu_hw_events, i).excl_cntrs;
+ if (c && c->core_id == core_id) {
+ cpuc->kfree_on_online[1] = cpuc->excl_cntrs;
+ cpuc->excl_cntrs = c;
+ cpuc->excl_thread_id = 1;
+ break;
+ }
+ }
+ cpuc->excl_cntrs->core_id = core_id;
+ cpuc->excl_cntrs->refcnt++;
+ }
}

static void intel_pmu_cpu_dying(int cpu)
{
struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
struct intel_shared_regs *pc;
+ struct intel_excl_cntrs *c;

pc = cpuc->shared_regs;
if (pc) {
@@ -2194,6 +2247,14 @@ static void intel_pmu_cpu_dying(int cpu)
kfree(pc);
cpuc->shared_regs = NULL;
}
+ c = cpuc->excl_cntrs;
+ if (c) {
+ if (c->core_id == -1 || --c->refcnt == 0)
+ kfree(c);
+ cpuc->excl_cntrs = NULL;
+ kfree(cpuc->constraint_list);
+ cpuc->constraint_list = NULL;
+ }

fini_debug_store_on_cpu(cpu);
}
--
1.9.1

2014-10-09 16:36:14

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 07/12] perf/x86: enforce HT bug workaround for SNB/IVB/HSW

From: Maria Dimakopoulou <[email protected]>

This patches activates the HT bug workaround for the
SNB/IVB/HSW processors. This covers non-PEBS mode.
Activation is done thru the constraint tables.

Both client and server processors needs this workaround.

Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: Maria Dimakopoulou <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 53 ++++++++++++++++++++++++++++------
1 file changed, 44 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index d7df7b3..041dec7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -113,6 +113,12 @@ static struct event_constraint intel_snb_event_constraints[] __read_mostly =
INTEL_EVENT_CONSTRAINT(0xcd, 0x8), /* MEM_TRANS_RETIRED.LOAD_LATENCY */
INTEL_UEVENT_CONSTRAINT(0x04a3, 0xf), /* CYCLE_ACTIVITY.CYCLES_NO_DISPATCH */
INTEL_UEVENT_CONSTRAINT(0x02a3, 0x4), /* CYCLE_ACTIVITY.CYCLES_L1D_PENDING */
+
+ INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf), /* MEM_UOPS_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd3, 0xf), /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
+
EVENT_CONSTRAINT_END
};

@@ -131,15 +137,12 @@ static struct event_constraint intel_ivb_event_constraints[] __read_mostly =
INTEL_UEVENT_CONSTRAINT(0x08a3, 0x4), /* CYCLE_ACTIVITY.CYCLES_L1D_PENDING */
INTEL_UEVENT_CONSTRAINT(0x0ca3, 0x4), /* CYCLE_ACTIVITY.STALLS_L1D_PENDING */
INTEL_UEVENT_CONSTRAINT(0x01c0, 0x2), /* INST_RETIRED.PREC_DIST */
- /*
- * Errata BV98 -- MEM_*_RETIRED events can leak between counters of SMT
- * siblings; disable these events because they can corrupt unrelated
- * counters.
- */
- INTEL_EVENT_CONSTRAINT(0xd0, 0x0), /* MEM_UOPS_RETIRED.* */
- INTEL_EVENT_CONSTRAINT(0xd1, 0x0), /* MEM_LOAD_UOPS_RETIRED.* */
- INTEL_EVENT_CONSTRAINT(0xd2, 0x0), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
- INTEL_EVENT_CONSTRAINT(0xd3, 0x0), /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
+
+ INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf), /* MEM_UOPS_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd3, 0xf), /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
+
EVENT_CONSTRAINT_END
};

@@ -217,6 +220,12 @@ static struct event_constraint intel_hsw_event_constraints[] = {
INTEL_EVENT_CONSTRAINT(0x0ca3, 0x4),
/* CYCLE_ACTIVITY.CYCLES_NO_EXECUTE */
INTEL_EVENT_CONSTRAINT(0x04a3, 0xf),
+
+ INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf), /* MEM_UOPS_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd3, 0xf), /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
+
EVENT_CONSTRAINT_END
};

@@ -2784,6 +2793,27 @@ static __init void intel_nehalem_quirk(void)
}
}

+/*
+ * enable software workaround for errata:
+ * SNB: BJ122
+ * IVB: BV98
+ * HSW: HSD29
+ *
+ * Only needed when HT is enabled. However detecting
+ * this is too difficult and model specific so we enable
+ * it even with HT off for now.
+ */
+static __init void intel_ht_bug(void)
+{
+ x86_pmu.flags |= PMU_FL_EXCL_CNTRS;
+
+ x86_pmu.commit_scheduling = intel_commit_scheduling;
+ x86_pmu.start_scheduling = intel_start_scheduling;
+ x86_pmu.stop_scheduling = intel_stop_scheduling;
+
+ pr_info("CPU erratum BJ122, BV98, HSD29 worked around\n");
+}
+
EVENT_ATTR_STR(mem-loads, mem_ld_hsw, "event=0xcd,umask=0x1,ldlat=3");
EVENT_ATTR_STR(mem-stores, mem_st_hsw, "event=0xd0,umask=0x82")

@@ -2997,6 +3027,7 @@ __init int intel_pmu_init(void)
case 42: /* 32nm SandyBridge */
case 45: /* 32nm SandyBridge-E/EN/EP */
x86_add_quirk(intel_sandybridge_quirk);
+ x86_add_quirk(intel_ht_bug);
memcpy(hw_cache_event_ids, snb_hw_cache_event_ids,
sizeof(hw_cache_event_ids));
memcpy(hw_cache_extra_regs, snb_hw_cache_extra_regs,
@@ -3011,6 +3042,8 @@ __init int intel_pmu_init(void)
x86_pmu.extra_regs = intel_snbep_extra_regs;
else
x86_pmu.extra_regs = intel_snb_extra_regs;
+
+
/* all extra regs are per-cpu when HT is on */
x86_pmu.flags |= PMU_FL_HAS_RSP_1;
x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
@@ -3029,6 +3062,7 @@ __init int intel_pmu_init(void)

case 58: /* 22nm IvyBridge */
case 62: /* 22nm IvyBridge-EP/EX */
+ x86_add_quirk(intel_ht_bug);
memcpy(hw_cache_event_ids, snb_hw_cache_event_ids,
sizeof(hw_cache_event_ids));
/* dTLB-load-misses on IVB is different than SNB */
@@ -3064,6 +3098,7 @@ __init int intel_pmu_init(void)
case 63: /* 22nm Haswell Server */
case 69: /* 22nm Haswell ULT */
case 70: /* 22nm Haswell + GT3e (Intel Iris Pro graphics) */
+ x86_add_quirk(intel_ht_bug);
x86_pmu.late_ack = true;
memcpy(hw_cache_event_ids, hsw_hw_cache_event_ids, sizeof(hw_cache_event_ids));
memcpy(hw_cache_extra_regs, hsw_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
--
1.9.1

2014-10-09 16:36:25

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 06/12] perf/x86: implement cross-HT corruption bug workaround

From: Maria Dimakopoulou <[email protected]>

This patch implements a software workaround for a HW erratum
on Intel SandyBridge, IvyBridge and Haswell processors
with Hyperthreading enabled. The errata are documented for
each processor in their respective specification update
documents:
- SandyBridge: BJ122
- IvyBridge: BV98
- Haswell: HSD29

The bug causes silent counter corruption across hyperthreads only
when measuring certain memory events (0xd0, 0xd1, 0xd2, 0xd3).
Counters measuring those events may leak counts to the sibling
counter. For instance, counter 0, thread 0 measuring event 0xd0,
may leak to counter 0, thread 1, regardless of the event measured
there. The size of the leak is not predictible. It all depends on
the workload and the state of each sibling hyper-thread. The
corrupting events do undercount as a consequence of the leak. The
leak is compensated automatically only when the sibling counter measures
the exact same corrupting event AND the workload is on the two threads
is the same. Given, there is no way to guarantee this, a work-around
is necessary. Furthermore, there is a serious problem if the leaked count
is added to a low-occurrence event. In that case the corruption on
the low occurrence event can be very large, e.g., orders of magnitude.

There is no HW or FW workaround for this problem.

The bug is very easy to reproduce on a loaded system.
Here is an example on a Haswell client, where CPU0, CPU4
are siblings. We load the CPUs with a simple triad app
streaming large floating-point vector. We use 0x81d0
corrupting event (MEM_UOPS_RETIRED:ALL_LOADS) and
0x20cc (ROB_MISC_EVENTS:LBR_INSERTS). Given we are not
using the LBR, the 0x20cc event should be zero.

$ taskset -c 0 triad &
$ taskset -c 4 triad &
$ perf stat -a -C 0 -e r81d0 sleep 100 &
$ perf stat -a -C 4 -r20cc sleep 10
Performance counter stats for 'system wide':
139 277 291 r20cc
10,000969126 seconds time elapsed

In this example, 0x81d0 and r20cc ar eusing sinling counters
on CPU0 and CPU4. 0x81d0 leaks into 0x20cc and corrupts it
from 0 to 139 millions occurrences.

This patch provides a software workaround to this problem by modifying the
way events are scheduled onto counters by the kernel. The patch forces
cross-thread mutual exclusion between counters in case a corrupting event
is measured by one of the hyper-threads. If thread 0, counter 0 is measuring
event 0xd0, then nothing can be measured on counter 0, thread 1. If no corrupting
event is measured on any hyper-thread, event scheduling proceeds as before.

The same example run with the workaround enabled, yield the correct answer:
$ taskset -c 0 triad &
$ taskset -c 4 triad &
$ perf stat -a -C 0 -e r81d0 sleep 100 &
$ perf stat -a -C 4 -r20cc sleep 10
Performance counter stats for 'system wide':
0 r20cc
10,000969126 seconds time elapsed

The patch does provide correctness for all non-corrupting events. It does not
"repatriate" the leaked counts back to the leaking counter. This is planned
for a second patch series. This patch series makes this repatriation more
easy by guaranteeing the sibling counter is not measuring any useful event.

The patch introduces dynamic constraints for events. That means that events which
did not have constraints, i.e., could be measured on any counters, may now be
constrained to a subset of the counters depending on what is going on the sibling
thread. The algorithm is similar to a cache coherency protocol. We call it XSU
in reference to Exclusive, Shared, Unused, the 3 possible states of a PMU
counter.

As a consequence of the workaround, users may see an increased amount of event
multiplexing, even in situtations where there are fewer events than counters
measured on a CPU.

Patch has been tested on all three impacted processors. Note that when
HT is off, there is no corruption. However, the workaround is still enabled,
yet not costing too much. Adding a dynamic detection of HT on turned out to
be complex are requiring too much to code to be justified.

This patch addresses the issue when PEBS is not used. A subsequent patch
fixes the problem when PEBS is used.

Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: Maria Dimakopoulou <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 31 ++--
arch/x86/kernel/cpu/perf_event.h | 6 +
arch/x86/kernel/cpu/perf_event_intel.c | 307 ++++++++++++++++++++++++++++++++-
3 files changed, 331 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 56b97a4..9799e9b 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -729,7 +729,7 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
struct event_constraint *c;
unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
struct perf_event *e;
- int i, wmin, wmax, num = 0;
+ int i, wmin, wmax, unsched = 0;
struct hw_perf_event *hwc;

bitmap_zero(used_mask, X86_PMC_IDX_MAX);
@@ -772,14 +772,20 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)

/* slow path */
if (i != n)
- num = perf_assign_events(cpuc->event_list, n, wmin,
- wmax, assign);
+ unsched = perf_assign_events(cpuc->event_list, n, wmin,
+ wmax, assign);

/*
- * Mark the event as committed, so we do not put_constraint()
- * in case new events are added and fail scheduling.
+ * In case of success (unsched = 0), mark events as committed,
+ * so we do not put_constraint() in case new events are added
+ * and fail to be scheduled
+ *
+ * We invoke the lower level commit callback to lock the resource
+ *
+ * We do not need to do all of this in case we are called to
+ * validate an event group (assign == NULL)
*/
- if (!num && assign) {
+ if (!unsched && assign) {
for (i = 0; i < n; i++) {
e = cpuc->event_list[i];
e->hw.flags |= PERF_X86_EVENT_COMMITTED;
@@ -787,11 +793,9 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
x86_pmu.commit_scheduling(cpuc, e, assign[i]);
}
}
- /*
- * scheduling failed or is just a simulation,
- * free resources if necessary
- */
- if (!assign || num) {
+
+ if (!assign || unsched) {
+
for (i = 0; i < n; i++) {
e = cpuc->event_list[i];
/*
@@ -801,6 +805,9 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
if ((e->hw.flags & PERF_X86_EVENT_COMMITTED))
continue;

+ /*
+ * release events that failed scheduling
+ */
if (x86_pmu.put_event_constraints)
x86_pmu.put_event_constraints(cpuc, e);
}
@@ -809,7 +816,7 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
if (x86_pmu.stop_scheduling)
x86_pmu.stop_scheduling(cpuc);

- return num ? -EINVAL : 0;
+ return unsched ? -EINVAL : 0;
}

/*
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index c985270..bd5c381 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -72,6 +72,7 @@ struct event_constraint {
#define PERF_X86_EVENT_PEBS_LD_HSW 0x10 /* haswell style datala, load */
#define PERF_X86_EVENT_PEBS_NA_HSW 0x20 /* haswell style datala, unknown */
#define PERF_X86_EVENT_EXCL 0x40 /* HT exclusivity on counter */
+#define PERF_X86_EVENT_DYNAMIC 0x80 /* dynamic alloc'd constraint */

struct amd_nb {
int nb_id; /* NorthBridge id */
@@ -131,6 +132,7 @@ enum intel_excl_state_type {
struct intel_excl_states {
enum intel_excl_state_type init_state[X86_PMC_IDX_MAX];
enum intel_excl_state_type state[X86_PMC_IDX_MAX];
+ bool sched_started; /* true if scheduling has started */
};

struct intel_excl_cntrs {
@@ -290,6 +292,10 @@ struct cpu_hw_events {
#define INTEL_UEVENT_CONSTRAINT(c, n) \
EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK)

+#define INTEL_EXCLUEVT_CONSTRAINT(c, n) \
+ __EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK, \
+ HWEIGHT(n), 0, PERF_X86_EVENT_EXCL)
+
#define INTEL_PLD_CONSTRAINT(c, n) \
__EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS, \
HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LDLAT)
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 0eece24..d7df7b3 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1771,7 +1771,7 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
}

static struct event_constraint *
-intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
+__intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
struct perf_event *event)
{
struct event_constraint *c;
@@ -1792,6 +1792,254 @@ intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
}

static void
+intel_start_scheduling(struct cpu_hw_events *cpuc)
+{
+ struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
+ struct intel_excl_states *xl, *xlo;
+ int tid = cpuc->excl_thread_id;
+ int o_tid = 1 - tid; /* sibling thread */
+
+ /*
+ * nothing needed if in group validation mode
+ */
+ if (cpuc->is_fake)
+ return;
+ /*
+ * no exclusion needed
+ */
+ if (!excl_cntrs)
+ return;
+
+ xlo = &excl_cntrs->states[o_tid];
+ xl = &excl_cntrs->states[tid];
+
+ xl->sched_started = true;
+
+ /*
+ * lock shared state until we are done scheduling
+ * in stop_event_scheduling()
+ * makes scheduling appear as a transaction
+ */
+ WARN_ON_ONCE(!irqs_disabled());
+ spin_lock(&excl_cntrs->lock);
+
+ /*
+ * save initial state of sibling thread
+ */
+ memcpy(xlo->init_state, xlo->state, sizeof(xlo->init_state));
+}
+
+static void
+intel_stop_scheduling(struct cpu_hw_events *cpuc)
+{
+ struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
+ struct intel_excl_states *xl, *xlo;
+ int tid = cpuc->excl_thread_id;
+ int o_tid = 1 - tid; /* sibling thread */
+
+ /*
+ * nothing needed if in group validation mode
+ */
+ if (cpuc->is_fake)
+ return;
+ /*
+ * no exclusion needed
+ */
+ if (!excl_cntrs)
+ return;
+
+ xlo = &excl_cntrs->states[o_tid];
+ xl = &excl_cntrs->states[tid];
+
+ /*
+ * make new sibling thread state visible
+ */
+ memcpy(xlo->state, xlo->init_state, sizeof(xlo->state));
+
+ xl->sched_started = false;
+ /*
+ * release shared state lock (acquired in intel_start_scheduling())
+ */
+ spin_unlock(&excl_cntrs->lock);
+}
+
+static struct event_constraint *
+intel_get_excl_constraints(struct cpu_hw_events *cpuc, struct perf_event *event,
+ int idx, struct event_constraint *c)
+{
+ struct event_constraint *cx;
+ struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
+ struct intel_excl_states *xl, *xlo;
+ int is_excl, i;
+ int tid = cpuc->excl_thread_id;
+ int o_tid = 1 - tid; /* alternate */
+
+ /*
+ * validating a group does not require
+ * enforcing cross-thread exclusion
+ */
+ if (cpuc->is_fake)
+ return c;
+
+ /*
+ * event requires exclusive counter access
+ * across HT threads
+ */
+ is_excl = c->flags & PERF_X86_EVENT_EXCL;
+
+ /*
+ * xl = state of current HT
+ * xlo = state of sibling HT
+ */
+ xl = &excl_cntrs->states[tid];
+ xlo = &excl_cntrs->states[o_tid];
+
+ cx = c;
+
+ /*
+ * because we modify the constraint, we need
+ * to make a copy. Static constraints come
+ * from static const tables.
+ *
+ * only needed when constraint has not yet
+ * been cloned (marked dynamic)
+ */
+ if (!(c->flags & PERF_X86_EVENT_DYNAMIC)) {
+
+ /* sanity check */
+ if (idx < 0)
+ return &emptyconstraint;
+
+ /*
+ * grab pre-allocated constraint entry
+ */
+ cx = &cpuc->constraint_list[idx];
+
+ /*
+ * initialize dynamic constraint
+ * with static constraint
+ */
+ memcpy(cx, c, sizeof(*cx));
+
+ /*
+ * mark constraint as dynamic, so we
+ * can free it later on
+ */
+ cx->flags |= PERF_X86_EVENT_DYNAMIC;
+ }
+
+ /*
+ * From here on, the constraint is dynamic.
+ * Either it was just allocated above, or it
+ * was allocated during a earlier invocation
+ * of this function
+ */
+
+ /*
+ * Modify static constraint with current dynamic
+ * state of thread
+ *
+ * EXCLUSIVE: sibling counter measuring exclusive event
+ * SHARED : sibling counter measuring non-exclusive event
+ * UNUSED : sibling counter unused
+ */
+ for_each_set_bit(i, cx->idxmsk, X86_PMC_IDX_MAX) {
+ /*
+ * exclusive event in sibling counter
+ * our corresponding counter cannot be used
+ * regardless of our event
+ */
+ if (xl->state[i] == INTEL_EXCL_EXCLUSIVE)
+ __clear_bit(i, cx->idxmsk);
+ /*
+ * if measuring an exclusive event, sibling
+ * measuring non-exclusive, then counter cannot
+ * be used
+ */
+ if (is_excl && xl->state[i] == INTEL_EXCL_SHARED)
+ __clear_bit(i, cx->idxmsk);
+ }
+
+ /*
+ * recompute actual bit weight for scheduling algorithm
+ */
+ cx->weight = hweight64(cx->idxmsk64);
+
+ /*
+ * if we return an empty mask, then switch
+ * back to static empty constraint to avoid
+ * the cost of freeing later on
+ */
+ if (cx->weight == 0)
+ cx = &emptyconstraint;
+
+ return cx;
+}
+
+static struct event_constraint *
+intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
+ struct perf_event *event)
+{
+ struct event_constraint *c = event->hw.constraint;
+
+ /*
+ * first time only
+ * - static constraint: no change across incremental scheduling calls
+ * - dynamic constraint: handled by intel_get_excl_constraints()
+ */
+ if (!c)
+ c = __intel_get_event_constraints(cpuc, idx, event);
+
+ if (cpuc->excl_cntrs)
+ return intel_get_excl_constraints(cpuc, event, idx, c);
+
+ return c;
+}
+
+static void intel_put_excl_constraints(struct cpu_hw_events *cpuc,
+ struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
+ struct intel_excl_states *xlo, *xl;
+ unsigned long flags = 0; /* keep compiler happy */
+ int tid = cpuc->excl_thread_id;
+ int o_tid = 1 - tid;
+
+ /*
+ * nothing needed if in group validation mode
+ */
+ if (cpuc->is_fake)
+ return;
+
+ WARN_ON_ONCE(!excl_cntrs);
+
+ if (!excl_cntrs)
+ return;
+
+ xl = &excl_cntrs->states[tid];
+ xlo = &excl_cntrs->states[o_tid];
+
+ /*
+ * put_constraint may be called from x86_schedule_events()
+ * which already has the lock held so here make locking
+ * conditional
+ */
+ if (!xl->sched_started)
+ spin_lock_irqsave(&excl_cntrs->lock, flags);
+
+ /*
+ * if event was actually assigned, then mark the
+ * counter state as unused now
+ */
+ if (hwc->idx >= 0)
+ xlo->state[hwc->idx] = INTEL_EXCL_UNUSED;
+
+ if (!xl->sched_started)
+ spin_unlock_irqrestore(&excl_cntrs->lock, flags);
+}
+
+static void
intel_put_shared_regs_event_constraints(struct cpu_hw_events *cpuc,
struct perf_event *event)
{
@@ -1809,7 +2057,57 @@ intel_put_shared_regs_event_constraints(struct cpu_hw_events *cpuc,
static void intel_put_event_constraints(struct cpu_hw_events *cpuc,
struct perf_event *event)
{
+ struct event_constraint *c = event->hw.constraint;
+
intel_put_shared_regs_event_constraints(cpuc, event);
+
+ /*
+ * is PMU has exclusive counter restrictions, then
+ * all events are subject to and must call the
+ * put_excl_constraints() routine
+ */
+ if (c && cpuc->excl_cntrs)
+ intel_put_excl_constraints(cpuc, event);
+
+ /* cleanup dynamic constraint */
+ if (c && (c->flags & PERF_X86_EVENT_DYNAMIC))
+ event->hw.constraint = NULL;
+}
+
+static void intel_commit_scheduling(struct cpu_hw_events *cpuc,
+ struct perf_event *event, int cntr)
+{
+ struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
+ struct event_constraint *c = event->hw.constraint;
+ struct intel_excl_states *xlo, *xl;
+ int tid = cpuc->excl_thread_id;
+ int o_tid = 1 - tid;
+ int is_excl;
+
+ if (cpuc->is_fake || !c)
+ return;
+
+ is_excl = c->flags & PERF_X86_EVENT_EXCL;
+
+ if (!(c->flags & PERF_X86_EVENT_DYNAMIC))
+ return;
+
+ WARN_ON_ONCE(!excl_cntrs);
+
+ if (!excl_cntrs)
+ return;
+
+ xl = &excl_cntrs->states[tid];
+ xlo = &excl_cntrs->states[o_tid];
+
+ WARN_ON_ONCE(!spin_is_locked(&excl_cntrs->lock));
+
+ if (cntr >= 0) {
+ if (is_excl)
+ xlo->init_state[cntr] = INTEL_EXCL_EXCLUSIVE;
+ else
+ xlo->init_state[cntr] = INTEL_EXCL_SHARED;
+ }
}

static void intel_pebs_aliases_core2(struct perf_event *event)
@@ -2256,6 +2554,13 @@ static void intel_pmu_cpu_dying(int cpu)
cpuc->constraint_list = NULL;
}

+ c = cpuc->excl_cntrs;
+ if (c) {
+ if (c->core_id == -1 || --c->refcnt == 0)
+ kfree(c);
+ cpuc->excl_cntrs = NULL;
+ }
+
fini_debug_store_on_cpu(cpu);
}

--
1.9.1

2014-10-09 16:36:38

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 10/12] watchdog: add watchdog enable/disable all functions

This patch adds two new functions to enable/disable
the watchdog across all CPUs.

Signed-off-by: Stephane Eranian <[email protected]>
---
include/linux/watchdog.h | 3 +++
kernel/watchdog.c | 28 ++++++++++++++++++++++++++++
2 files changed, 31 insertions(+)

diff --git a/include/linux/watchdog.h b/include/linux/watchdog.h
index 2a3038e..b89b414 100644
--- a/include/linux/watchdog.h
+++ b/include/linux/watchdog.h
@@ -142,4 +142,7 @@ extern int watchdog_init_timeout(struct watchdog_device *wdd,
extern int watchdog_register_device(struct watchdog_device *);
extern void watchdog_unregister_device(struct watchdog_device *);

+void watchdog_nmi_disable_all(void);
+void watchdog_nmi_enable_all(void);
+
#endif /* ifndef _LINUX_WATCHDOG_H */
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 8759d0b..6aa5b8f 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -514,9 +514,37 @@ static void watchdog_nmi_disable(unsigned int cpu)
cpu0_err = 0;
}
}
+
+void watchdog_nmi_enable_all(void)
+{
+ int cpu;
+
+ if (!watchdog_user_enabled)
+ return;
+
+ get_online_cpus();
+ for_each_online_cpu(cpu)
+ watchdog_nmi_enable(cpu);
+ put_online_cpus();
+}
+
+void watchdog_nmi_disable_all(void)
+{
+ int cpu;
+
+ if (!watchdog_running)
+ return;
+
+ get_online_cpus();
+ for_each_online_cpu(cpu)
+ watchdog_nmi_disable(cpu);
+ put_online_cpus();
+}
#else
static int watchdog_nmi_enable(unsigned int cpu) { return 0; }
static void watchdog_nmi_disable(unsigned int cpu) { return; }
+void watchdog_nmi_enable_all(void) {}
+void watchdog_nmi_disable_all(void) {}
#endif /* CONFIG_HARDLOCKUP_DETECTOR */

static struct smp_hotplug_thread watchdog_threads = {
--
1.9.1

2014-10-09 16:36:47

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 09/12] perf/x86: fix intel_get_event_constraints() for dynamic constraints

With dynamic constraint, we need to restart from the static
constraints each time the intel_get_event_constraints() is called.

Reviewed-by: Maria Dimakopoulou <[email protected]>
Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 041dec7..674f296 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1989,20 +1989,25 @@ static struct event_constraint *
intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
struct perf_event *event)
{
- struct event_constraint *c = event->hw.constraint;
+ struct event_constraint *c1 = event->hw.constraint;
+ struct event_constraint *c2;

/*
* first time only
* - static constraint: no change across incremental scheduling calls
* - dynamic constraint: handled by intel_get_excl_constraints()
*/
- if (!c)
- c = __intel_get_event_constraints(cpuc, idx, event);
+ c2 = __intel_get_event_constraints(cpuc, idx, event);
+ if (c1 && (c1->flags & PERF_X86_EVENT_DYNAMIC)) {
+ bitmap_copy(c1->idxmsk, c2->idxmsk, X86_PMC_IDX_MAX);
+ c1->weight = c2->weight;
+ c2 = c1;
+ }

if (cpuc->excl_cntrs)
- return intel_get_excl_constraints(cpuc, event, idx, c);
+ return intel_get_excl_constraints(cpuc, event, idx, c2);

- return c;
+ return c2;
}

static void intel_put_excl_constraints(struct cpu_hw_events *cpuc,
--
1.9.1

2014-10-09 16:36:57

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 11/12] perf/x86: make HT bug workaround conditioned on HT enabled

This patch disables the PMU HT bug when Hyperthreading (HT)
is disabled. We cannot do this test immediately when perf_events
is initialized. We need to wait until the topology information
is setup properly. As such, we register a later initcall, check
the topology and potentially disable the workaround. To do this,
we need to ensure there is no user of the PMU. At this point of
the boot, the only user is the NMI watchdog, thus we disable
it during the switch and re-enable right after.

Having the workaround disabled when it is not needed provides
some benefits by limiting the overhead is time and space.
The workaround still ensures correct scheduling of the corrupting
memory events (0xd0, 0xd1, 0xd2) when HT is off. Those events
can only be measured on counters 0-3. Something else the current
kernel did not handle correctly.

Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 5 ++
arch/x86/kernel/cpu/perf_event_intel.c | 95 ++++++++++++++++++++++++++--------
2 files changed, 79 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index edf5b84..2b6e375 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -596,6 +596,7 @@ do { \
#define PMU_FL_NO_HT_SHARING 0x1 /* no hyper-threading resource sharing */
#define PMU_FL_HAS_RSP_1 0x2 /* has 2 equivalent offcore_rsp regs */
#define PMU_FL_EXCL_CNTRS 0x4 /* has exclusive counter requirements */
+#define PMU_FL_EXCL_ENABLED 0x8 /* exclusive counter active */

#define EVENT_VAR(_id) event_attr_##_id
#define EVENT_PTR(_id) &event_attr_##_id.attr.attr
@@ -829,6 +830,10 @@ int knc_pmu_init(void);
ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
char *page);

+static inline int is_ht_workaround_enabled(void)
+{
+ return !!(x86_pmu.flags & PMU_FL_EXCL_ENABLED);
+}
#else /* CONFIG_CPU_SUP_INTEL */

static inline void reserve_ds_buffers(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 674f296..aa4c0b8 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -12,6 +12,7 @@
#include <linux/init.h>
#include <linux/slab.h>
#include <linux/export.h>
+#include <linux/watchdog.h>

#include <asm/cpufeature.h>
#include <asm/hardirq.h>
@@ -1811,7 +1812,7 @@ intel_start_scheduling(struct cpu_hw_events *cpuc)
/*
* nothing needed if in group validation mode
*/
- if (cpuc->is_fake)
+ if (cpuc->is_fake || !is_ht_workaround_enabled())
return;
/*
* no exclusion needed
@@ -1849,8 +1850,9 @@ intel_stop_scheduling(struct cpu_hw_events *cpuc)
/*
* nothing needed if in group validation mode
*/
- if (cpuc->is_fake)
+ if (cpuc->is_fake || !is_ht_workaround_enabled())
return;
+
/*
* no exclusion needed
*/
@@ -1887,7 +1889,13 @@ intel_get_excl_constraints(struct cpu_hw_events *cpuc, struct perf_event *event,
* validating a group does not require
* enforcing cross-thread exclusion
*/
- if (cpuc->is_fake)
+ if (cpuc->is_fake || !is_ht_workaround_enabled())
+ return c;
+
+ /*
+ * no exclusion needed
+ */
+ if (!excl_cntrs)
return c;

/*
@@ -2547,18 +2555,11 @@ static void intel_pmu_cpu_starting(int cpu)
}
}

-static void intel_pmu_cpu_dying(int cpu)
+static void free_excl_cntrs(int cpu)
{
struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
- struct intel_shared_regs *pc;
struct intel_excl_cntrs *c;

- pc = cpuc->shared_regs;
- if (pc) {
- if (pc->core_id == -1 || --pc->refcnt == 0)
- kfree(pc);
- cpuc->shared_regs = NULL;
- }
c = cpuc->excl_cntrs;
if (c) {
if (c->core_id == -1 || --c->refcnt == 0)
@@ -2567,14 +2568,22 @@ static void intel_pmu_cpu_dying(int cpu)
kfree(cpuc->constraint_list);
cpuc->constraint_list = NULL;
}
+}

- c = cpuc->excl_cntrs;
- if (c) {
- if (c->core_id == -1 || --c->refcnt == 0)
- kfree(c);
- cpuc->excl_cntrs = NULL;
+static void intel_pmu_cpu_dying(int cpu)
+{
+ struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
+ struct intel_shared_regs *pc;
+
+ pc = cpuc->shared_regs;
+ if (pc) {
+ if (pc->core_id == -1 || --pc->refcnt == 0)
+ kfree(pc);
+ cpuc->shared_regs = NULL;
}

+ free_excl_cntrs(cpu);
+
fini_debug_store_on_cpu(cpu);
}

@@ -2805,18 +2814,18 @@ static __init void intel_nehalem_quirk(void)
* HSW: HSD29
*
* Only needed when HT is enabled. However detecting
- * this is too difficult and model specific so we enable
- * it even with HT off for now.
+ * if HT is enabled is difficult (model specific). So instead,
+ * we enable the workaround in the early boot, and verify if
+ * it is needed in a later initcall phase once we have valid
+ * topology information to check if HT is actually enabled
*/
static __init void intel_ht_bug(void)
{
- x86_pmu.flags |= PMU_FL_EXCL_CNTRS;
+ x86_pmu.flags |= PMU_FL_EXCL_CNTRS | PMU_FL_EXCL_ENABLED;

x86_pmu.commit_scheduling = intel_commit_scheduling;
x86_pmu.start_scheduling = intel_start_scheduling;
x86_pmu.stop_scheduling = intel_stop_scheduling;
-
- pr_info("CPU erratum BJ122, BV98, HSD29 worked around\n");
}

EVENT_ATTR_STR(mem-loads, mem_ld_hsw, "event=0xcd,umask=0x1,ldlat=3");
@@ -3232,3 +3241,47 @@ __init int intel_pmu_init(void)

return 0;
}
+
+/*
+ * HT bug: phase 2 init
+ * Called once we have valid topology information to check
+ * whether or not HT is enabled
+ * If HT is off, then we disable the workaround
+ */
+static __init int fixup_ht_bug(void)
+{
+ int cpu = smp_processor_id();
+ int w, c;
+ /*
+ * problem not present on this CPU model, nothing to do
+ */
+ if (!(x86_pmu.flags & PMU_FL_EXCL_ENABLED))
+ return 0;
+
+ w = cpumask_weight(topology_thread_cpumask(cpu));
+ if (w > 1) {
+ pr_info("CPU erratum BJ122, BV98, HSD29 worked around\n");
+ return 0;
+ }
+
+ watchdog_nmi_disable_all();
+
+ x86_pmu.flags &= ~(PMU_FL_EXCL_CNTRS | PMU_FL_EXCL_ENABLED);
+
+ x86_pmu.commit_scheduling = NULL;
+ x86_pmu.start_scheduling = NULL;
+ x86_pmu.stop_scheduling = NULL;
+
+ watchdog_nmi_enable_all();
+
+ get_online_cpus();
+
+ for_each_online_cpu(c) {
+ free_excl_cntrs(c);
+ }
+
+ put_online_cpus();
+ pr_info("CPU erratum BJ122, BV98, HSD29 workaround disabled, HT off\n");
+ return 0;
+}
+subsys_initcall(fixup_ht_bug)
--
1.9.1

2014-10-09 16:37:06

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 12/12] perf/x86: add syfs entry to disable HT bug workaround

From: Maria Dimakopoulou <[email protected]>

This patch adds a sysfs entry:

/sys/devices/cpu/ht_bug_workaround

to activate/deactivate the PMU HT bug workaround.

To activate (activated by default):
# echo 1 > /sys/devices/cpu/ht_bug_workaround

To deactivate:
# echo 0 > /sys/devices/cpu/ht_bug_workaround

Results effective only once there is no more active
events.

Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: Maria Dimakopoulou <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 40 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 40 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 9799e9b..3d942d9 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1889,10 +1889,50 @@ static ssize_t set_attr_rdpmc(struct device *cdev,
return count;
}

+static ssize_t get_attr_xsu(struct device *cdev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ int ff = is_ht_workaround_enabled();
+
+ return snprintf(buf, 40, "%d\n", ff);
+}
+
+static ssize_t set_attr_xsu(struct device *cdev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+ unsigned long val;
+ int ff = is_ht_workaround_enabled();
+ ssize_t ret;
+
+ /*
+ * if workaround, disabled, no effect
+ */
+ if (!cpuc->excl_cntrs)
+ return count;
+
+ ret = kstrtoul(buf, 0, &val);
+ if (ret)
+ return ret;
+
+ if (!!val != ff) {
+ if (!val)
+ x86_pmu.flags &= ~PMU_FL_EXCL_ENABLED;
+ else
+ x86_pmu.flags |= PMU_FL_EXCL_ENABLED;
+ }
+ return count;
+}
+
static DEVICE_ATTR(rdpmc, S_IRUSR | S_IWUSR, get_attr_rdpmc, set_attr_rdpmc);
+static DEVICE_ATTR(ht_bug_workaround, S_IRUSR | S_IWUSR, get_attr_xsu, \
+ set_attr_xsu);

static struct attribute *x86_pmu_attrs[] = {
&dev_attr_rdpmc.attr,
+ &dev_attr_ht_bug_workaround.attr,
NULL,
};

--
1.9.1

2014-10-09 16:37:55

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 08/12] perf/x86: enforce HT bug workaround with PEBS for SNB/IVB/HSW

From: Maria Dimakopoulou <[email protected]>

This patch modifies the PEBS constraint tables for SNB/IVB/HSW
such that corrupting events supporting PEBS activate the HT
workaround.

Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: Maria Dimakopoulou <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 20 +++++++++++++++++++-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 28 ++++++++++++++++++----------
2 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index bd5c381..edf5b84 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -320,22 +320,40 @@ struct cpu_hw_events {

/* Check flags and event code, and set the HSW load flag */
#define INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_LD(code, n) \
- __EVENT_CONSTRAINT(code, n, \
+ __EVENT_CONSTRAINT(code, n, \
ARCH_PERFMON_EVENTSEL_EVENT|X86_ALL_EVENT_FLAGS, \
HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LD_HSW)

+#define INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_XLD(code, n) \
+ __EVENT_CONSTRAINT(code, n, \
+ ARCH_PERFMON_EVENTSEL_EVENT|X86_ALL_EVENT_FLAGS, \
+ HWEIGHT(n), 0, \
+ PERF_X86_EVENT_PEBS_LD_HSW|PERF_X86_EVENT_EXCL)
+
/* Check flags and event code/umask, and set the HSW store flag */
#define INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(code, n) \
__EVENT_CONSTRAINT(code, n, \
INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS, \
HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_ST_HSW)

+#define INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XST(code, n) \
+ __EVENT_CONSTRAINT(code, n, \
+ INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS, \
+ HWEIGHT(n), 0, \
+ PERF_X86_EVENT_PEBS_ST_HSW|PERF_X86_EVENT_EXCL)
+
/* Check flags and event code/umask, and set the HSW load flag */
#define INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(code, n) \
__EVENT_CONSTRAINT(code, n, \
INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS, \
HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LD_HSW)

+#define INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XLD(code, n) \
+ __EVENT_CONSTRAINT(code, n, \
+ INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS, \
+ HWEIGHT(n), 0, \
+ PERF_X86_EVENT_PEBS_LD_HSW|PERF_X86_EVENT_EXCL)
+
/* Check flags and event code/umask, and set the HSW N/A flag */
#define INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA(code, n) \
__EVENT_CONSTRAINT(code, n, \
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index b1553d0..291f15b 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -611,6 +611,10 @@ struct event_constraint intel_snb_pebs_event_constraints[] = {
INTEL_PST_CONSTRAINT(0x02cd, 0x8), /* MEM_TRANS_RETIRED.PRECISE_STORES */
/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+ INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf), /* MEM_UOP_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd3, 0xf), /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
/* Allow all events as PEBS with no flags */
INTEL_ALL_EVENT_CONSTRAINT(0, 0xf),
EVENT_CONSTRAINT_END
@@ -622,6 +626,10 @@ struct event_constraint intel_ivb_pebs_event_constraints[] = {
INTEL_PST_CONSTRAINT(0x02cd, 0x8), /* MEM_TRANS_RETIRED.PRECISE_STORES */
/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+ INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf), /* MEM_UOP_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd3, 0xf), /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
/* Allow all events as PEBS with no flags */
INTEL_ALL_EVENT_CONSTRAINT(0, 0xf),
EVENT_CONSTRAINT_END
@@ -633,16 +641,16 @@ struct event_constraint intel_hsw_pebs_event_constraints[] = {
/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA(0x01c2, 0xf), /* UOPS_RETIRED.ALL */
- INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x11d0, 0xf), /* MEM_UOPS_RETIRED.STLB_MISS_LOADS */
- INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x21d0, 0xf), /* MEM_UOPS_RETIRED.LOCK_LOADS */
- INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x41d0, 0xf), /* MEM_UOPS_RETIRED.SPLIT_LOADS */
- INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x81d0, 0xf), /* MEM_UOPS_RETIRED.ALL_LOADS */
- INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(0x12d0, 0xf), /* MEM_UOPS_RETIRED.STLB_MISS_STORES */
- INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(0x42d0, 0xf), /* MEM_UOPS_RETIRED.SPLIT_STORES */
- INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(0x82d0, 0xf), /* MEM_UOPS_RETIRED.ALL_STORES */
- INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_LD(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
- INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_LD(0xd2, 0xf), /* MEM_LOAD_UOPS_L3_HIT_RETIRED.* */
- INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_LD(0xd3, 0xf), /* MEM_LOAD_UOPS_L3_MISS_RETIRED.* */
+ INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XLD(0x11d0, 0xf), /* MEM_UOPS_RETIRED.STLB_MISS_LOADS */
+ INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XLD(0x21d0, 0xf), /* MEM_UOPS_RETIRED.LOCK_LOADS */
+ INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XLD(0x41d0, 0xf), /* MEM_UOPS_RETIRED.SPLIT_LOADS */
+ INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XLD(0x81d0, 0xf), /* MEM_UOPS_RETIRED.ALL_LOADS */
+ INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XST(0x12d0, 0xf), /* MEM_UOPS_RETIRED.STLB_MISS_STORES */
+ INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XST(0x42d0, 0xf), /* MEM_UOPS_RETIRED.SPLIT_STORES */
+ INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XST(0x82d0, 0xf), /* MEM_UOPS_RETIRED.ALL_STORES */
+ INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_XLD(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
+ INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_XLD(0xd2, 0xf), /* MEM_LOAD_UOPS_L3_HIT_RETIRED.* */
+ INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_XLD(0xd3, 0xf), /* MEM_LOAD_UOPS_L3_MISS_RETIRED.* */
/* Allow all events as PEBS with no flags */
INTEL_ALL_EVENT_CONSTRAINT(0, 0xf),
EVENT_CONSTRAINT_END
--
1.9.1

2014-10-09 16:38:14

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 04/12] perf/x86: add index param to get_event_constraint() callback

This patch adds an index parameter to the get_event_constraint()
x86_pmu callback. It is expected to represent the index of the
event in the cpuc->event_list[] array. When the callback is used
for fake_cpuc (evnet validation), then the index must be -1. The
motivation for passing the index is to use it to index into another
cpuc array.

Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 4 ++--
arch/x86/kernel/cpu/perf_event.h | 4 +++-
arch/x86/kernel/cpu/perf_event_amd.c | 6 ++++--
arch/x86/kernel/cpu/perf_event_intel.c | 15 ++++++++++-----
4 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index ca44ba2..56b97a4 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -739,7 +739,7 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)

for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
hwc = &cpuc->event_list[i]->hw;
- c = x86_pmu.get_event_constraints(cpuc, cpuc->event_list[i]);
+ c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
hwc->constraint = c;

wmin = min(wmin, c->weight);
@@ -1731,7 +1731,7 @@ static int validate_event(struct perf_event *event)
if (IS_ERR(fake_cpuc))
return PTR_ERR(fake_cpuc);

- c = x86_pmu.get_event_constraints(fake_cpuc, event);
+ c = x86_pmu.get_event_constraints(fake_cpuc, -1, event);

if (!c || !c->weight)
ret = -EINVAL;
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index b6848c3..667a185 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -443,6 +443,7 @@ struct x86_pmu {
u64 max_period;
struct event_constraint *
(*get_event_constraints)(struct cpu_hw_events *cpuc,
+ int idx,
struct perf_event *event);

void (*put_event_constraints)(struct cpu_hw_events *cpuc,
@@ -690,7 +691,8 @@ static inline int amd_pmu_init(void)
int intel_pmu_save_and_restart(struct perf_event *event);

struct event_constraint *
-x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event);
+x86_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
+ struct perf_event *event);

struct intel_shared_regs *allocate_shared_regs(int cpu);

diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
index a8d1a43..65b9a4b 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -430,7 +430,8 @@ static void amd_pmu_cpu_dead(int cpu)
}

static struct event_constraint *
-amd_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
+amd_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
+ struct perf_event *event)
{
/*
* if not NB event or no NB, then no constraints
@@ -538,7 +539,8 @@ static struct event_constraint amd_f15_PMC50 = EVENT_CONSTRAINT(0, 0x3F, 0);
static struct event_constraint amd_f15_PMC53 = EVENT_CONSTRAINT(0, 0x38, 0);

static struct event_constraint *
-amd_get_event_constraints_f15h(struct cpu_hw_events *cpuc, struct perf_event *event)
+amd_get_event_constraints_f15h(struct cpu_hw_events *cpuc, int idx,
+ struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
unsigned int event_code = amd_get_event_code(hwc);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 4b6a3b0..64bbb57 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1753,7 +1753,8 @@ intel_shared_regs_constraints(struct cpu_hw_events *cpuc,
}

struct event_constraint *
-x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
+x86_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
+ struct perf_event *event)
{
struct event_constraint *c;

@@ -1770,7 +1771,8 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
}

static struct event_constraint *
-intel_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
+intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
+ struct perf_event *event)
{
struct event_constraint *c;

@@ -1786,7 +1788,7 @@ intel_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event
if (c)
return c;

- return x86_get_event_constraints(cpuc, event);
+ return x86_get_event_constraints(cpuc, idx, event);
}

static void
@@ -2020,9 +2022,12 @@ static struct event_constraint counter2_constraint =
EVENT_CONSTRAINT(0, 0x4, 0);

static struct event_constraint *
-hsw_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
+hsw_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
+ struct perf_event *event)
{
- struct event_constraint *c = intel_get_event_constraints(cpuc, event);
+ struct event_constraint *c;
+
+ c = intel_get_event_constraints(cpuc, idx, event);

/* Handle special quirk on in_tx_checkpointed only in counter 2 */
if (event->hw.config & HSW_IN_TX_CHECKPOINTED) {
--
1.9.1

2014-10-21 11:17:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 05/12] perf/x86: add cross-HT counter exclusion infrastructure

On Thu, Oct 09, 2014 at 06:34:39PM +0200, Stephane Eranian wrote:
> +struct intel_excl_cntrs {
> + spinlock_t lock;
> +
> + struct intel_excl_states states[2];
> +
> + int refcnt; /* per-core: #HT threads */
> + unsigned core_id; /* per-core: core id */
> +};

lkml.kernel.org/r/[email protected]

lkml.kernel.org/r/CABPqkBR=+xf-yT7+ouNzVp_TdBYpjeNYpCYc+H7fOMFxPuzyog@mail.gmail.com

2014-10-21 11:18:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 05/12] perf/x86: add cross-HT counter exclusion infrastructure

On Thu, Oct 09, 2014 at 06:34:39PM +0200, Stephane Eranian wrote:
> +++ b/arch/x86/kernel/cpu/perf_event.h
> @@ -71,6 +71,7 @@ struct event_constraint {
> #define PERF_X86_EVENT_COMMITTED 0x8 /* event passed commit_txn */
> #define PERF_X86_EVENT_PEBS_LD_HSW 0x10 /* haswell style datala, load */
> #define PERF_X86_EVENT_PEBS_NA_HSW 0x20 /* haswell style datala, unknown */
> +#define PERF_X86_EVENT_EXCL 0x40 /* HT exclusivity on counter */
>

lkml.kernel.org/r/CAHzUATgam8qrma+B-d=9jOHpmmiwNwk-717roNY=ndbJAZefjw@mail.gmail.com

2014-10-21 12:28:09

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 0/12] perf/x86: implement HT leak workaround for SNB/IVB/HSW

Peter,

On Tue, Oct 21, 2014 at 1:25 PM, Peter Zijlstra <[email protected]> wrote:
>
>
> lkml.kernel.org/r/CABPqkBRbst4sgpgE5O_VXt-CSC0VD=aP2KWA0e3Uy64tw7df3A@mail.gmail.com
>
> I missed that 3 lines if they were in here.
>
I did not put them in there because there is another problem.
If you partition the generic counters 2 and 2, then some CPUs will not
be able to measure some events.
Unfortunately, there is no way to partition the 4 counters such that
all the events can be measured by
each CPU. Some events or precise sampling requires counter 2 for
instance (like prec_dist).
That's why I did not put this fix in.

2014-10-21 13:08:34

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 0/12] perf/x86: implement HT leak workaround for SNB/IVB/HSW

On Tue, Oct 21, 2014 at 3:03 PM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Oct 21, 2014 at 02:28:06PM +0200, Stephane Eranian wrote:
>> Peter,
>>
>> On Tue, Oct 21, 2014 at 1:25 PM, Peter Zijlstra <[email protected]> wrote:
>> >
>> >
>> > lkml.kernel.org/r/CABPqkBRbst4sgpgE5O_VXt-CSC0VD=aP2KWA0e3Uy64tw7df3A@mail.gmail.com
>> >
>> > I missed that 3 lines if they were in here.
>> >
>> I did not put them in there because there is another problem.
>> If you partition the generic counters 2 and 2, then some CPUs will not
>> be able to measure some events.
>> Unfortunately, there is no way to partition the 4 counters such that
>> all the events can be measured by
>> each CPU. Some events or precise sampling requires counter 2 for
>> instance (like prec_dist).
>> That's why I did not put this fix in.
>
> Ah, I wasn't thinking about a hard partition, just a limit on the number
> of exclusive counters any one CPU can claim such as to not starve. Or is
> that what you were talking about? I feel not being able to starve
> another CPU is more important than a better utilization bound for
> counter scheduling.

So you're saying, just limit number of used counters to 2 regardless
of which one
they are. So sometimes, this will avoid the problem aforementioned and sometimes
not. We can try that.

2014-10-21 13:09:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/12] perf/x86: implement HT leak workaround for SNB/IVB/HSW

On Tue, Oct 21, 2014 at 02:28:06PM +0200, Stephane Eranian wrote:
> Peter,
>
> On Tue, Oct 21, 2014 at 1:25 PM, Peter Zijlstra <[email protected]> wrote:
> >
> >
> > lkml.kernel.org/r/CABPqkBRbst4sgpgE5O_VXt-CSC0VD=aP2KWA0e3Uy64tw7df3A@mail.gmail.com
> >
> > I missed that 3 lines if they were in here.
> >
> I did not put them in there because there is another problem.
> If you partition the generic counters 2 and 2, then some CPUs will not
> be able to measure some events.
> Unfortunately, there is no way to partition the 4 counters such that
> all the events can be measured by
> each CPU. Some events or precise sampling requires counter 2 for
> instance (like prec_dist).
> That's why I did not put this fix in.

Ah, I wasn't thinking about a hard partition, just a limit on the number
of exclusive counters any one CPU can claim such as to not starve. Or is
that what you were talking about? I feel not being able to starve
another CPU is more important than a better utilization bound for
counter scheduling.

2014-10-22 09:12:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/12] perf/x86: implement HT leak workaround for SNB/IVB/HSW

On Tue, Oct 21, 2014 at 03:08:32PM +0200, Stephane Eranian wrote:
> On Tue, Oct 21, 2014 at 3:03 PM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Oct 21, 2014 at 02:28:06PM +0200, Stephane Eranian wrote:
> >> Peter,
> >>
> >> On Tue, Oct 21, 2014 at 1:25 PM, Peter Zijlstra <[email protected]> wrote:
> >> >
> >> >
> >> > lkml.kernel.org/r/CABPqkBRbst4sgpgE5O_VXt-CSC0VD=aP2KWA0e3Uy64tw7df3A@mail.gmail.com
> >> >
> >> > I missed that 3 lines if they were in here.
> >> >
> >> I did not put them in there because there is another problem.
> >> If you partition the generic counters 2 and 2, then some CPUs will not
> >> be able to measure some events.
> >> Unfortunately, there is no way to partition the 4 counters such that
> >> all the events can be measured by
> >> each CPU. Some events or precise sampling requires counter 2 for
> >> instance (like prec_dist).
> >> That's why I did not put this fix in.
> >
> > Ah, I wasn't thinking about a hard partition, just a limit on the number
> > of exclusive counters any one CPU can claim such as to not starve. Or is
> > that what you were talking about? I feel not being able to starve
> > another CPU is more important than a better utilization bound for
> > counter scheduling.
>
> So you're saying, just limit number of used counters to 2 regardless
> of which one they are.

used as in marked exclusive and forced empty on the other side.

> So sometimes, this will avoid the problem aforementioned and sometimes
> not. We can try that.

How will this sometimes not avoid the starvation issue?

2014-10-22 12:32:49

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v2 06/12] perf/x86: implement cross-HT corruption bug workaround

On Thu, Oct 09, 2014 at 06:34:40PM +0200, Stephane Eranian wrote:
> From: Maria Dimakopoulou <[email protected]>

SNIP

> +static struct event_constraint *
> +intel_get_excl_constraints(struct cpu_hw_events *cpuc, struct perf_event *event,
> + int idx, struct event_constraint *c)
> +{
> + struct event_constraint *cx;
> + struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
> + struct intel_excl_states *xl, *xlo;
> + int is_excl, i;

SNIP

> + /*
> + * Modify static constraint with current dynamic
> + * state of thread
> + *
> + * EXCLUSIVE: sibling counter measuring exclusive event
> + * SHARED : sibling counter measuring non-exclusive event
> + * UNUSED : sibling counter unused
> + */
> + for_each_set_bit(i, cx->idxmsk, X86_PMC_IDX_MAX) {
> + /*
> + * exclusive event in sibling counter
> + * our corresponding counter cannot be used
> + * regardless of our event
> + */
> + if (xl->state[i] == INTEL_EXCL_EXCLUSIVE)
> + __clear_bit(i, cx->idxmsk);

if we want to check sibling counter, shouldn't we check xlo->state[i] instead? like

if (xlo->state[i] == INTEL_EXCL_EXCLUSIVE)
__clear_bit(i, cx->idxmsk);


and also in condition below?

> + /*
> + * if measuring an exclusive event, sibling
> + * measuring non-exclusive, then counter cannot
> + * be used
> + */
> + if (is_excl && xl->state[i] == INTEL_EXCL_SHARED)
> + __clear_bit(i, cx->idxmsk);
> + }
> +
> + /*
> + * recompute actual bit weight for scheduling algorithm
> + */
> + cx->weight = hweight64(cx->idxmsk64);
> +
> + /*
> + * if we return an empty mask, then switch
> + * back to static empty constraint to avoid
> + * the cost of freeing later on
> + */
> + if (cx->weight == 0)
> + cx = &emptyconstraint;
> +

SNIP

jirka

2014-10-22 13:28:38

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v2 11/12] perf/x86: make HT bug workaround conditioned on HT enabled

On Thu, Oct 09, 2014 at 06:34:45PM +0200, Stephane Eranian wrote:

SNIP

> + */
> +static __init int fixup_ht_bug(void)
> +{
> + int cpu = smp_processor_id();
> + int w, c;
> + /*
> + * problem not present on this CPU model, nothing to do
> + */
> + if (!(x86_pmu.flags & PMU_FL_EXCL_ENABLED))
> + return 0;
> +
> + w = cpumask_weight(topology_thread_cpumask(cpu));
> + if (w > 1) {
> + pr_info("CPU erratum BJ122, BV98, HSD29 worked around\n");
> + return 0;
> + }
> +
> + watchdog_nmi_disable_all();
> +
> + x86_pmu.flags &= ~(PMU_FL_EXCL_CNTRS | PMU_FL_EXCL_ENABLED);
> +
> + x86_pmu.commit_scheduling = NULL;
> + x86_pmu.start_scheduling = NULL;
> + x86_pmu.stop_scheduling = NULL;
> +
> + watchdog_nmi_enable_all();

if you build with CONFIG_LOCKUP_DETECTOR=n it wont link:

LD init/built-in.o
arch/x86/built-in.o: In function `fixup_ht_bug':
/root/linux/arch/x86/kernel/cpu/perf_event_intel.c:3267: undefined reference to `watchdog_nmi_disable_all'
/root/linux/arch/x86/kernel/cpu/perf_event_intel.c:3275: undefined reference to `watchdog_nmi_enable_all'

I think you need watchdog_nmi_(enable|disable)_all stubs out of kernel/watchdog.c

jirka

2014-10-22 14:36:17

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 11/12] perf/x86: make HT bug workaround conditioned on HT enabled

On Wed, Oct 22, 2014 at 3:27 PM, Jiri Olsa <[email protected]> wrote:
> On Thu, Oct 09, 2014 at 06:34:45PM +0200, Stephane Eranian wrote:
>
> SNIP
>
>> + */
>> +static __init int fixup_ht_bug(void)
>> +{
>> + int cpu = smp_processor_id();
>> + int w, c;
>> + /*
>> + * problem not present on this CPU model, nothing to do
>> + */
>> + if (!(x86_pmu.flags & PMU_FL_EXCL_ENABLED))
>> + return 0;
>> +
>> + w = cpumask_weight(topology_thread_cpumask(cpu));
>> + if (w > 1) {
>> + pr_info("CPU erratum BJ122, BV98, HSD29 worked around\n");
>> + return 0;
>> + }
>> +
>> + watchdog_nmi_disable_all();
>> +
>> + x86_pmu.flags &= ~(PMU_FL_EXCL_CNTRS | PMU_FL_EXCL_ENABLED);
>> +
>> + x86_pmu.commit_scheduling = NULL;
>> + x86_pmu.start_scheduling = NULL;
>> + x86_pmu.stop_scheduling = NULL;
>> +
>> + watchdog_nmi_enable_all();
>
> if you build with CONFIG_LOCKUP_DETECTOR=n it wont link:
>
> LD init/built-in.o
> arch/x86/built-in.o: In function `fixup_ht_bug':
> /root/linux/arch/x86/kernel/cpu/perf_event_intel.c:3267: undefined reference to `watchdog_nmi_disable_all'
> /root/linux/arch/x86/kernel/cpu/perf_event_intel.c:3275: undefined reference to `watchdog_nmi_enable_all'
>
> I think you need watchdog_nmi_(enable|disable)_all stubs out of kernel/watchdog.c
>
Good catch. In fact the stub cannot be in watchdog.c because it may
not even get compiled. Needs to be
in watchdog.h. I fixed that now.
Thanks

2014-10-22 14:59:09

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v2 11/12] perf/x86: make HT bug workaround conditioned on HT enabled

On Thu, Oct 09, 2014 at 06:34:45PM +0200, Stephane Eranian wrote:

SNIP

> --- a/arch/x86/kernel/cpu/perf_event_intel.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -12,6 +12,7 @@
> #include <linux/init.h>
> #include <linux/slab.h>
> #include <linux/export.h>
> +#include <linux/watchdog.h>
>
> #include <asm/cpufeature.h>
> #include <asm/hardirq.h>
> @@ -1811,7 +1812,7 @@ intel_start_scheduling(struct cpu_hw_events *cpuc)
> /*
> * nothing needed if in group validation mode
> */
> - if (cpuc->is_fake)
> + if (cpuc->is_fake || !is_ht_workaround_enabled())
> return;
> /*
> * no exclusion needed
> @@ -1849,8 +1850,9 @@ intel_stop_scheduling(struct cpu_hw_events *cpuc)
> /*
> * nothing needed if in group validation mode
> */
> - if (cpuc->is_fake)
> + if (cpuc->is_fake || !is_ht_workaround_enabled())
> return;


Is it possible to change is_ht_workaround_enabled() result within
intel_start_scheduling and intel_stop_scheduling time frame? like
following would happen:


is_ht_workaround_enabled == 1

intel_start_scheduling -> spin_lock(&excl_cntrs->lock);

# echo 0 > /sys/devices/cpu/ht_bug_workaround

is_ht_workaround_enabled == 0

intel_stop_scheduling -> wont call spin_unlock(&excl_cntrs->lock)


for some reason I can't get HT enabled on my HSW server, so I cannot try ;-)

jirka

2014-10-22 15:08:14

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v2 05/12] perf/x86: add cross-HT counter exclusion infrastructure

On Thu, Oct 09, 2014 at 06:34:39PM +0200, Stephane Eranian wrote:
> From: Maria Dimakopoulou <[email protected]>
>

SNIP

> +struct intel_excl_cntrs *allocate_excl_cntrs(int cpu)
> +{
> + struct intel_excl_cntrs *c;
> + int i;
> +
> + c = kzalloc_node(sizeof(struct intel_excl_cntrs),
> + GFP_KERNEL, cpu_to_node(cpu));
> + if (c) {
> + spin_lock_init(&c->lock);
> + for (i = 0; i < X86_PMC_IDX_MAX; i++) {
> + c->states[0].state[i] = INTEL_EXCL_UNUSED;
> + c->states[0].init_state[i] = INTEL_EXCL_UNUSED;
> +
> + c->states[1].state[i] = INTEL_EXCL_UNUSED;
> + c->states[1].init_state[i] = INTEL_EXCL_UNUSED;

'c' is allocated with kzalloc_node, seems there's no need
to re-initialize states it to zero again with INTEL_EXCL_UNUSED

jirka

2014-10-22 16:42:19

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 11/12] perf/x86: make HT bug workaround conditioned on HT enabled

On Wed, Oct 22, 2014 at 4:58 PM, Jiri Olsa <[email protected]> wrote:
> On Thu, Oct 09, 2014 at 06:34:45PM +0200, Stephane Eranian wrote:
>
> SNIP
>
>> --- a/arch/x86/kernel/cpu/perf_event_intel.c
>> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
>> @@ -12,6 +12,7 @@
>> #include <linux/init.h>
>> #include <linux/slab.h>
>> #include <linux/export.h>
>> +#include <linux/watchdog.h>
>>
>> #include <asm/cpufeature.h>
>> #include <asm/hardirq.h>
>> @@ -1811,7 +1812,7 @@ intel_start_scheduling(struct cpu_hw_events *cpuc)
>> /*
>> * nothing needed if in group validation mode
>> */
>> - if (cpuc->is_fake)
>> + if (cpuc->is_fake || !is_ht_workaround_enabled())
>> return;
>> /*
>> * no exclusion needed
>> @@ -1849,8 +1850,9 @@ intel_stop_scheduling(struct cpu_hw_events *cpuc)
>> /*
>> * nothing needed if in group validation mode
>> */
>> - if (cpuc->is_fake)
>> + if (cpuc->is_fake || !is_ht_workaround_enabled())
>> return;
>
>
> Is it possible to change is_ht_workaround_enabled() result within
> intel_start_scheduling and intel_stop_scheduling time frame? like
> following would happen:
>
>
> is_ht_workaround_enabled == 1
>
> intel_start_scheduling -> spin_lock(&excl_cntrs->lock);
>
> # echo 0 > /sys/devices/cpu/ht_bug_workaround
>
> is_ht_workaround_enabled == 0
>
> intel_stop_scheduling -> wont call spin_unlock(&excl_cntrs->lock)
>
>
> for some reason I can't get HT enabled on my HSW server, so I cannot try ;-)
>
Yes, that would be possible!
I can fix this by grabbing the lock in the sysctl handler.
Thanks.

2014-10-22 17:25:43

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 05/12] perf/x86: add cross-HT counter exclusion infrastructure

On Wed, Oct 22, 2014 at 5:07 PM, Jiri Olsa <[email protected]> wrote:
> On Thu, Oct 09, 2014 at 06:34:39PM +0200, Stephane Eranian wrote:
>> From: Maria Dimakopoulou <[email protected]>
>>
>
> SNIP
>
>> +struct intel_excl_cntrs *allocate_excl_cntrs(int cpu)
>> +{
>> + struct intel_excl_cntrs *c;
>> + int i;
>> +
>> + c = kzalloc_node(sizeof(struct intel_excl_cntrs),
>> + GFP_KERNEL, cpu_to_node(cpu));
>> + if (c) {
>> + spin_lock_init(&c->lock);
>> + for (i = 0; i < X86_PMC_IDX_MAX; i++) {
>> + c->states[0].state[i] = INTEL_EXCL_UNUSED;
>> + c->states[0].init_state[i] = INTEL_EXCL_UNUSED;
>> +
>> + c->states[1].state[i] = INTEL_EXCL_UNUSED;
>> + c->states[1].init_state[i] = INTEL_EXCL_UNUSED;
>
> 'c' is allocated with kzalloc_node, seems there's no need
> to re-initialize states it to zero again with INTEL_EXCL_UNUSED
>
You are assuming that INTEL_EXCL_UNUSED=0 always.
That means it needs to be clear there is a dependency here.

I am happy removing that if someone confirms the assumption is common
in the kernel
for this kind of fields.


> jirka

2014-10-22 21:04:33

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 0/12] perf/x86: implement HT leak workaround for SNB/IVB/HSW

On Wed, Oct 22, 2014 at 11:12 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Oct 21, 2014 at 03:08:32PM +0200, Stephane Eranian wrote:
>> On Tue, Oct 21, 2014 at 3:03 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Tue, Oct 21, 2014 at 02:28:06PM +0200, Stephane Eranian wrote:
>> >> Peter,
>> >>
>> >> On Tue, Oct 21, 2014 at 1:25 PM, Peter Zijlstra <[email protected]> wrote:
>> >> >
>> >> >
>> >> > lkml.kernel.org/r/CABPqkBRbst4sgpgE5O_VXt-CSC0VD=aP2KWA0e3Uy64tw7df3A@mail.gmail.com
>> >> >
>> >> > I missed that 3 lines if they were in here.
>> >> >
>> >> I did not put them in there because there is another problem.
>> >> If you partition the generic counters 2 and 2, then some CPUs will not
>> >> be able to measure some events.
>> >> Unfortunately, there is no way to partition the 4 counters such that
>> >> all the events can be measured by
>> >> each CPU. Some events or precise sampling requires counter 2 for
>> >> instance (like prec_dist).
>> >> That's why I did not put this fix in.
>> >
>> > Ah, I wasn't thinking about a hard partition, just a limit on the number
>> > of exclusive counters any one CPU can claim such as to not starve. Or is
>> > that what you were talking about? I feel not being able to starve
>> > another CPU is more important than a better utilization bound for
>> > counter scheduling.
>>
>> So you're saying, just limit number of used counters to 2 regardless
>> of which one they are.
>
> used as in marked exclusive and forced empty on the other side.
>
>> So sometimes, this will avoid the problem aforementioned and sometimes
>> not. We can try that.
>
> How will this sometimes not avoid the starvation issue?

Here is a simple case:
Limiting each HT to only 2 counters, can be any, 2 out of 4 possible.

HT0: you measure a MEM* in ctr2, it is started first, and it keeps running
HT1: you measure PREC_DIST with PEBS (it requires ctr2)

HT0 is measuring a corrupting event on ctr2, this prevents ctr2 on HT1
from being used.
HT1 is starved, it cannot measure PREC_DIST

Yes you have a quota of 2 out of 4 counters.

The quota dynamic or static can help mitigate the starvation. The only
way to eliminate
it is to force multiplexing even though you are using fewer counters
than actually avail.

2014-10-23 07:15:44

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v2 05/12] perf/x86: add cross-HT counter exclusion infrastructure

On Wed, Oct 22, 2014 at 07:25:40PM +0200, Stephane Eranian wrote:
> On Wed, Oct 22, 2014 at 5:07 PM, Jiri Olsa <[email protected]> wrote:
> > On Thu, Oct 09, 2014 at 06:34:39PM +0200, Stephane Eranian wrote:
> >> From: Maria Dimakopoulou <[email protected]>
> >>
> >
> > SNIP
> >
> >> +struct intel_excl_cntrs *allocate_excl_cntrs(int cpu)
> >> +{
> >> + struct intel_excl_cntrs *c;
> >> + int i;
> >> +
> >> + c = kzalloc_node(sizeof(struct intel_excl_cntrs),
> >> + GFP_KERNEL, cpu_to_node(cpu));
> >> + if (c) {
> >> + spin_lock_init(&c->lock);
> >> + for (i = 0; i < X86_PMC_IDX_MAX; i++) {
> >> + c->states[0].state[i] = INTEL_EXCL_UNUSED;
> >> + c->states[0].init_state[i] = INTEL_EXCL_UNUSED;
> >> +
> >> + c->states[1].state[i] = INTEL_EXCL_UNUSED;
> >> + c->states[1].init_state[i] = INTEL_EXCL_UNUSED;
> >
> > 'c' is allocated with kzalloc_node, seems there's no need
> > to re-initialize states it to zero again with INTEL_EXCL_UNUSED
> >
> You are assuming that INTEL_EXCL_UNUSED=0 always.
> That means it needs to be clear there is a dependency here.

it's 0 explicitly:

enum intel_excl_state_type {
INTEL_EXCL_UNUSED = 0, /* counter is unused */

but it's not big issue for me.. just curious ;-)

jirka

2014-10-23 07:20:46

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v2 06/12] perf/x86: implement cross-HT corruption bug workaround

On Wed, Oct 22, 2014 at 02:31:51PM +0200, Jiri Olsa wrote:
> On Thu, Oct 09, 2014 at 06:34:40PM +0200, Stephane Eranian wrote:
> > From: Maria Dimakopoulou <[email protected]>
>
> SNIP
>
> > +static struct event_constraint *
> > +intel_get_excl_constraints(struct cpu_hw_events *cpuc, struct perf_event *event,
> > + int idx, struct event_constraint *c)
> > +{
> > + struct event_constraint *cx;
> > + struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
> > + struct intel_excl_states *xl, *xlo;
> > + int is_excl, i;
>
> SNIP
>
> > + /*
> > + * Modify static constraint with current dynamic
> > + * state of thread
> > + *
> > + * EXCLUSIVE: sibling counter measuring exclusive event
> > + * SHARED : sibling counter measuring non-exclusive event
> > + * UNUSED : sibling counter unused
> > + */
> > + for_each_set_bit(i, cx->idxmsk, X86_PMC_IDX_MAX) {
> > + /*
> > + * exclusive event in sibling counter
> > + * our corresponding counter cannot be used
> > + * regardless of our event
> > + */
> > + if (xl->state[i] == INTEL_EXCL_EXCLUSIVE)
> > + __clear_bit(i, cx->idxmsk);
>
> if we want to check sibling counter, shouldn't we check xlo->state[i] instead? like
>
> if (xlo->state[i] == INTEL_EXCL_EXCLUSIVE)
> __clear_bit(i, cx->idxmsk);
>
>
> and also in condition below?

any comment on this? I'm curious, because it'd enlighten me
on how this is supposed to work ;-)

I dont understand why you update the sibling's counter state instead
of the current cpuc->excl_thread_id HT, like in intel_commit_scheduling
while you hold lock for the current HT state

could you please comment, I must be missing something

thanks,
jirka

2014-10-23 08:01:11

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 06/12] perf/x86: implement cross-HT corruption bug workaround

On Thu, Oct 23, 2014 at 9:19 AM, Jiri Olsa <[email protected]> wrote:
> On Wed, Oct 22, 2014 at 02:31:51PM +0200, Jiri Olsa wrote:
>> On Thu, Oct 09, 2014 at 06:34:40PM +0200, Stephane Eranian wrote:
>> > From: Maria Dimakopoulou <[email protected]>
>>
>> SNIP
>>
>> > +static struct event_constraint *
>> > +intel_get_excl_constraints(struct cpu_hw_events *cpuc, struct perf_event *event,
>> > + int idx, struct event_constraint *c)
>> > +{
>> > + struct event_constraint *cx;
>> > + struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
>> > + struct intel_excl_states *xl, *xlo;
>> > + int is_excl, i;
>>
>> SNIP
>>
>> > + /*
>> > + * Modify static constraint with current dynamic
>> > + * state of thread
>> > + *
>> > + * EXCLUSIVE: sibling counter measuring exclusive event
>> > + * SHARED : sibling counter measuring non-exclusive event
>> > + * UNUSED : sibling counter unused
>> > + */
>> > + for_each_set_bit(i, cx->idxmsk, X86_PMC_IDX_MAX) {
>> > + /*
>> > + * exclusive event in sibling counter
>> > + * our corresponding counter cannot be used
>> > + * regardless of our event
>> > + */
>> > + if (xl->state[i] == INTEL_EXCL_EXCLUSIVE)
>> > + __clear_bit(i, cx->idxmsk);
>>
>> if we want to check sibling counter, shouldn't we check xlo->state[i] instead? like
>>
>> if (xlo->state[i] == INTEL_EXCL_EXCLUSIVE)
>> __clear_bit(i, cx->idxmsk);
>>
>>
>> and also in condition below?
>
> any comment on this? I'm curious, because it'd enlighten me
> on how this is supposed to work ;-)
>
> I dont understand why you update the sibling's counter state instead
> of the current cpuc->excl_thread_id HT, like in intel_commit_scheduling
> while you hold lock for the current HT state
>
> could you please comment, I must be missing something
>
Yes, it is a bit confusing. It comes down that what the state represents.
Let me explain.

In get_constraints(), you compute the bitmask of possible counters by looking
at your own state in states[tid] (xl).

In commit_constraint(), the scheduler has picked counters and you need to commit
the changes. But you don't update your state, you update your
sibling's state. Why?
Because of the bug, what you use influences what the sibling can measure. So you
update the sibling's state to reflect its new constraint. When the
sibling calls get_constraint()
it will harvest the new constraints automatically.

Example:

HT0 wants to program a MEM (corrupting) event, it gathers its
constraints from get_constraints().
The mask is, let's say, 0x3. The scheduler picks counter0. Then, in
commit_constraints(), you need
to mark *HT1*'s counter0 in exclusive mode, i.e., it cannot be used
anymore on that thread.

Hope this helps.

2014-10-23 08:14:10

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v2 06/12] perf/x86: implement cross-HT corruption bug workaround

On Thu, Oct 23, 2014 at 10:01:08AM +0200, Stephane Eranian wrote:
> On Thu, Oct 23, 2014 at 9:19 AM, Jiri Olsa <[email protected]> wrote:
> > On Wed, Oct 22, 2014 at 02:31:51PM +0200, Jiri Olsa wrote:
> >> On Thu, Oct 09, 2014 at 06:34:40PM +0200, Stephane Eranian wrote:
> >> > From: Maria Dimakopoulou <[email protected]>

SNIP

> >> > + for_each_set_bit(i, cx->idxmsk, X86_PMC_IDX_MAX) {
> >> > + /*
> >> > + * exclusive event in sibling counter
> >> > + * our corresponding counter cannot be used
> >> > + * regardless of our event
> >> > + */
> >> > + if (xl->state[i] == INTEL_EXCL_EXCLUSIVE)
> >> > + __clear_bit(i, cx->idxmsk);
> >>
> >> if we want to check sibling counter, shouldn't we check xlo->state[i] instead? like
> >>
> >> if (xlo->state[i] == INTEL_EXCL_EXCLUSIVE)
> >> __clear_bit(i, cx->idxmsk);
> >>
> >>
> >> and also in condition below?
> >
> > any comment on this? I'm curious, because it'd enlighten me
> > on how this is supposed to work ;-)
> >
> > I dont understand why you update the sibling's counter state instead
> > of the current cpuc->excl_thread_id HT, like in intel_commit_scheduling
> > while you hold lock for the current HT state
> >
> > could you please comment, I must be missing something
> >
> Yes, it is a bit confusing. It comes down that what the state represents.
> Let me explain.
>
> In get_constraints(), you compute the bitmask of possible counters by looking
> at your own state in states[tid] (xl).
>
> In commit_constraint(), the scheduler has picked counters and you need to commit
> the changes. But you don't update your state, you update your
> sibling's state. Why?
> Because of the bug, what you use influences what the sibling can measure. So you
> update the sibling's state to reflect its new constraint. When the
> sibling calls get_constraint()
> it will harvest the new constraints automatically.
>
> Example:
>
> HT0 wants to program a MEM (corrupting) event, it gathers its
> constraints from get_constraints().
> The mask is, let's say, 0x3. The scheduler picks counter0. Then, in
> commit_constraints(), you need
> to mark *HT1*'s counter0 in exclusive mode, i.e., it cannot be used
> anymore on that thread.

ah, I translated the INTEL_EXCL_EXCLUSIVE state as "here's the exclusive
event, sibling HT is forbiden to schedule in this counter (slot)"

cool, thanks
jirka

2014-10-23 08:53:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/12] perf/x86: implement HT leak workaround for SNB/IVB/HSW

On Wed, Oct 22, 2014 at 11:04:31PM +0200, Stephane Eranian wrote:
> Here is a simple case:
> Limiting each HT to only 2 counters, can be any, 2 out of 4 possible.
>
> HT0: you measure a MEM* in ctr2, it is started first, and it keeps running
> HT1: you measure PREC_DIST with PEBS (it requires ctr2)
>
> HT0 is measuring a corrupting event on ctr2, this prevents ctr2 on HT1
> from being used.
> HT1 is starved, it cannot measure PREC_DIST
>
> Yes you have a quota of 2 out of 4 counters.
>
> The quota dynamic or static can help mitigate the starvation. The only
> way to eliminate
> it is to force multiplexing even though you are using fewer counters
> than actually avail.

Ah yes, the very narrowly constrained events. Those suck indeed. And I
imagine rotation might not even help here -- rotation doesn't guarantee
SMT1 will try and schedule before SMT0, in fact there are setups
(staggered tick) where its almost guaranteed not to.

Still I suppose for 'normal' event its a much better state, SMT1 can
always schedule some events.

2014-10-23 08:58:03

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 0/12] perf/x86: implement HT leak workaround for SNB/IVB/HSW

On Thu, Oct 23, 2014 at 10:53 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Oct 22, 2014 at 11:04:31PM +0200, Stephane Eranian wrote:
>> Here is a simple case:
>> Limiting each HT to only 2 counters, can be any, 2 out of 4 possible.
>>
>> HT0: you measure a MEM* in ctr2, it is started first, and it keeps running
>> HT1: you measure PREC_DIST with PEBS (it requires ctr2)
>>
>> HT0 is measuring a corrupting event on ctr2, this prevents ctr2 on HT1
>> from being used.
>> HT1 is starved, it cannot measure PREC_DIST
>>
>> Yes you have a quota of 2 out of 4 counters.
>>
>> The quota dynamic or static can help mitigate the starvation. The only
>> way to eliminate
>> it is to force multiplexing even though you are using fewer counters
>> than actually avail.
>
> Ah yes, the very narrowly constrained events. Those suck indeed. And I
> imagine rotation might not even help here -- rotation doesn't guarantee
> SMT1 will try and schedule before SMT0, in fact there are setups
> (staggered tick) where its almost guaranteed not to.
>
> Still I suppose for 'normal' event its a much better state, SMT1 can
> always schedule some events.

Yes, I agree with you. The soft partition helps. I will add that in V3.
Thanks.