2014-06-04 21:46:53

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH 0/9] perf/x86: implement HT counter corruption workaround

From: Maria Dimakopoulou <[email protected]>

This patch series addresses a serious known erratum in the PMU
of Intel SandyBridge, IvyBrige, Haswell processors with hyper-threading
enabled.

The erratum is documented in the Intel specification update documents
for each processor under the errata listed below:
- SandyBridge: BJ122
- IvyBridge: BV98
- Haswell: HSD29

The bug causes silent counter corruption across hyperthreads only
when measuring certain memory events (0xd0, 0xd1, 0xd2, 0xd3).
Counters measuring those events may leak counts to the sibling
counter. For instance, counter 0, thread 0 measuring event 0x81d0,
may leak to counter 0, thread 1, regardless of the event measured
there. The size of the leak is not predictible. It all depends on
the workload and the state of each sibling hyper-thread. The
corrupting events do undercount as a consequence of the leak. The
leak is compensated automatically only when the sibling counter measures
the exact same corrupting event AND the workload is on the two threads
is the same. Given, there is no way to guarantee this, a workaround
is necessary. Furthermore, there is a serious problem if the leaked counts
are added to a low-occurrence event. In that case the corruption on
the low occurrence event can be very large, e.g., orders of magnitude.

There is no HW or FW workaround for this problem.

The bug is very easy to reproduce on a loaded system.
Here is an example on a Haswell client, where CPU0 and CPU4
are siblings. We load the CPUs with a simple triad app
streaming large floating-point vector. We use 0x81d0
corrupting event (MEM_UOPS_RETIRED:ALL_LOADS) and
0x20cc (ROB_MISC_EVENTS:LBR_INSERTS). Given we are not
using the LBR, the 0x20cc event should be zero.

$ taskset -c 0 triad &
$ taskset -c 4 triad &
$ perf stat -a -C 0 -e r81d0 sleep 100 &
$ perf stat -a -C 4 -r20cc sleep 10
Performance counter stats for 'system wide':
139 277 291 r20cc
10,000969126 seconds time elapsed

In this example, 0x81d0 and r20cc are using sibling counters
on CPU0 and CPU4. 0x81d0 leaks into 0x20cc and corrupts it
from 0 to 139 millions occurrences.

This patch provides a software workaround to this problem by modifying the
way events are scheduled onto counters by the kernel. The patch forces
cross-thread mutual exclusion between sibling counters in case a
corrupting event is measured by one of the hyper-threads. If thread 0,
counter 0 is measuring event 0x81d0, then nothing can be measured on
counter 0, thread 1. If no corrupting event is measured on any hyper-thread,
event scheduling proceeds as before.

The same example run with the workaround enabled, yields the correct answer:
$ taskset -c 0 triad &
$ taskset -c 4 triad &
$ perf stat -a -C 0 -e r81d0 sleep 100 &
$ perf stat -a -C 4 -r20cc sleep 10
Performance counter stats for 'system wide':
0 r20cc
10,000969126 seconds time elapsed

The patch does provide correctness for all non-corrupting events. It does not
"repatriate" the leaked counts back to the leaking counter. This is planned
for a second patch series. This patch series however makes this repatriation
easier by guaranteeing that the sibling counter is not measuring any useful event.

The patch introduces dynamic constraints for events. That means that events which
did not have constraints, i.e., could be measured on any counters, may now be
constrained to a subset of the counters depending on what is going on the sibling
thread. The algorithm is similar to a cache coherency protocol. We call it XSU
in reference to Exclusive, Shared, Unused, the 3 possible states of a PMU
counter.

As a consequence of the workaround, users may see an increased amount of event
multiplexing, even in situtations where there are fewer events than counters measured
on a CPU.

Patch has been tested on all three impacted processors. Note that when
HT is off, there is no corruption. However, the workaround is still enabled,
yet not costing too much. Adding a dynamic detection of HT on turned out to
be complex are requiring too much to code to be justified. This patch series
also covers events used with PEBS.

Maria Dimakopoulou (6):
perf/x86: add 3 new scheduling callbacks
perf/x86: add cross-HT counter exclusion infrastructure
perf/x86: implement cross-HT corruption bug workaround
perf/x86: enforce HT bug workaround for SNB/IVB/HSW
perf/x86: enforce HT bug workaround with PEBS for SNB/IVB/HSW
perf/x86: add syfs entry to disable HT bug workaround

Stephane Eranian (3):
perf,x86: rename er_flags to flags
pref/x86: vectorize cpuc->kfree_on_online
perf/x86: fix intel_get_event_constraints() for dynamic constraints

arch/x86/kernel/cpu/perf_event.c | 81 ++++-
arch/x86/kernel/cpu/perf_event.h | 76 ++++-
arch/x86/kernel/cpu/perf_event_amd.c | 3 +-
arch/x86/kernel/cpu/perf_event_intel.c | 456 +++++++++++++++++++++++++++--
arch/x86/kernel/cpu/perf_event_intel_ds.c | 40 +--
5 files changed, 586 insertions(+), 70 deletions(-)

--
1.7.9.5


2014-06-04 21:43:56

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH 1/9] perf,x86: rename er_flags to flags

Because it will be used for more than just
tracking the presence of extra registers.

Reviewed-by: Maria Dimakopoulou <[email protected]>
Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 9 ++++++---
arch/x86/kernel/cpu/perf_event_intel.c | 20 ++++++++++----------
2 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3b2f9bd..0b50cd2 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -463,7 +463,7 @@ struct x86_pmu {
* Extra registers for events
*/
struct extra_reg *extra_regs;
- unsigned int er_flags;
+ unsigned int flags;

/*
* Intel host/guest support (KVM)
@@ -480,8 +480,11 @@ do { \
x86_pmu.quirks = &__quirk; \
} while (0)

-#define ERF_NO_HT_SHARING 1
-#define ERF_HAS_RSP_1 2
+/*
+ * x86_pmu flags
+ */
+#define PMU_FL_NO_HT_SHARING 0x1 /* no hyper-threading resource sharing */
+#define PMU_FL_HAS_RSP_1 0x2 /* has 2 equivalent offcore_rsp regs */

#define EVENT_VAR(_id) event_attr_##_id
#define EVENT_PTR(_id) &event_attr_##_id.attr.attr
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index adb02aa..f6f8018 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1455,7 +1455,7 @@ intel_bts_constraints(struct perf_event *event)

static int intel_alt_er(int idx)
{
- if (!(x86_pmu.er_flags & ERF_HAS_RSP_1))
+ if (!(x86_pmu.flags & PMU_FL_HAS_RSP_1))
return idx;

if (idx == EXTRA_REG_RSP_0)
@@ -2001,7 +2001,7 @@ static void intel_pmu_cpu_starting(int cpu)
if (!cpuc->shared_regs)
return;

- if (!(x86_pmu.er_flags & ERF_NO_HT_SHARING)) {
+ if (!(x86_pmu.flags & PMU_FL_NO_HT_SHARING)) {
for_each_cpu(i, topology_thread_cpumask(cpu)) {
struct intel_shared_regs *pc;

@@ -2397,7 +2397,7 @@ __init int intel_pmu_init(void)
x86_pmu.event_constraints = intel_slm_event_constraints;
x86_pmu.pebs_constraints = intel_slm_pebs_event_constraints;
x86_pmu.extra_regs = intel_slm_extra_regs;
- x86_pmu.er_flags |= ERF_HAS_RSP_1;
+ x86_pmu.flags |= PMU_FL_HAS_RSP_1;
pr_cont("Silvermont events, ");
break;

@@ -2415,7 +2415,7 @@ __init int intel_pmu_init(void)
x86_pmu.enable_all = intel_pmu_nhm_enable_all;
x86_pmu.pebs_constraints = intel_westmere_pebs_event_constraints;
x86_pmu.extra_regs = intel_westmere_extra_regs;
- x86_pmu.er_flags |= ERF_HAS_RSP_1;
+ x86_pmu.flags |= PMU_FL_HAS_RSP_1;

x86_pmu.cpu_events = nhm_events_attrs;

@@ -2447,8 +2447,8 @@ __init int intel_pmu_init(void)
else
x86_pmu.extra_regs = intel_snb_extra_regs;
/* all extra regs are per-cpu when HT is on */
- x86_pmu.er_flags |= ERF_HAS_RSP_1;
- x86_pmu.er_flags |= ERF_NO_HT_SHARING;
+ x86_pmu.flags |= PMU_FL_HAS_RSP_1;
+ x86_pmu.flags |= PMU_FL_NO_HT_SHARING;

x86_pmu.cpu_events = snb_events_attrs;

@@ -2478,8 +2478,8 @@ __init int intel_pmu_init(void)
else
x86_pmu.extra_regs = intel_snb_extra_regs;
/* all extra regs are per-cpu when HT is on */
- x86_pmu.er_flags |= ERF_HAS_RSP_1;
- x86_pmu.er_flags |= ERF_NO_HT_SHARING;
+ x86_pmu.flags |= PMU_FL_HAS_RSP_1;
+ x86_pmu.flags |= PMU_FL_NO_HT_SHARING;

x86_pmu.cpu_events = snb_events_attrs;

@@ -2507,8 +2507,8 @@ __init int intel_pmu_init(void)
x86_pmu.extra_regs = intel_snb_extra_regs;
x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
/* all extra regs are per-cpu when HT is on */
- x86_pmu.er_flags |= ERF_HAS_RSP_1;
- x86_pmu.er_flags |= ERF_NO_HT_SHARING;
+ x86_pmu.flags |= PMU_FL_HAS_RSP_1;
+ x86_pmu.flags |= PMU_FL_NO_HT_SHARING;

x86_pmu.hw_config = hsw_hw_config;
x86_pmu.get_event_constraints = hsw_get_event_constraints;
--
1.7.9.5

2014-06-04 21:44:00

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH 2/9] pref/x86: vectorize cpuc->kfree_on_online

Make the cpuc->kfree_on_online a vector to accomodate
more than one entry and add the second entry to be
used by a later patch.

Reviewed-by: Maria Dimakopoulou <[email protected]>
Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 10 +++++++---
arch/x86/kernel/cpu/perf_event.h | 8 +++++++-
arch/x86/kernel/cpu/perf_event_amd.c | 3 ++-
arch/x86/kernel/cpu/perf_event_intel.c | 4 +++-
4 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 32029e3..36fb4fc 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1321,11 +1321,12 @@ x86_pmu_notifier(struct notifier_block *self, unsigned long action, void *hcpu)
{
unsigned int cpu = (long)hcpu;
struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
- int ret = NOTIFY_OK;
+ int i, ret = NOTIFY_OK;

switch (action & ~CPU_TASKS_FROZEN) {
case CPU_UP_PREPARE:
- cpuc->kfree_on_online = NULL;
+ for (i = 0 ; i < X86_PERF_KFREE_MAX; i++)
+ cpuc->kfree_on_online[i] = NULL;
if (x86_pmu.cpu_prepare)
ret = x86_pmu.cpu_prepare(cpu);
break;
@@ -1338,7 +1339,10 @@ x86_pmu_notifier(struct notifier_block *self, unsigned long action, void *hcpu)
break;

case CPU_ONLINE:
- kfree(cpuc->kfree_on_online);
+ for (i = 0 ; i < X86_PERF_KFREE_MAX; i++) {
+ kfree(cpuc->kfree_on_online[i]);
+ cpuc->kfree_on_online[i] = NULL;
+ }
break;

case CPU_DYING:
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 0b50cd2..1cab1c2 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -121,6 +121,12 @@ struct intel_shared_regs {

#define MAX_LBR_ENTRIES 16

+enum {
+ X86_PERF_KFREE_SHARED = 0,
+ X86_PERF_KFREE_EXCL = 1,
+ X86_PERF_KFREE_MAX
+};
+
struct cpu_hw_events {
/*
* Generic x86 PMC bits
@@ -183,7 +189,7 @@ struct cpu_hw_events {
/* Inverted mask of bits to clear in the perf_ctr ctrl registers */
u64 perf_ctr_virt_mask;

- void *kfree_on_online;
+ void *kfree_on_online[X86_PERF_KFREE_MAX];
};

#define __EVENT_CONSTRAINT(c, n, m, w, o, f) {\
diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
index beeb7cc..a8d1a43 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -382,6 +382,7 @@ static int amd_pmu_cpu_prepare(int cpu)
static void amd_pmu_cpu_starting(int cpu)
{
struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
+ void **onln = &cpuc->kfree_on_online[X86_PERF_KFREE_SHARED];
struct amd_nb *nb;
int i, nb_id;

@@ -399,7 +400,7 @@ static void amd_pmu_cpu_starting(int cpu)
continue;

if (nb->nb_id == nb_id) {
- cpuc->kfree_on_online = cpuc->amd_nb;
+ *onln = cpuc->amd_nb;
cpuc->amd_nb = nb;
break;
}
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index f6f8018..e913e46 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2002,12 +2002,14 @@ static void intel_pmu_cpu_starting(int cpu)
return;

if (!(x86_pmu.flags & PMU_FL_NO_HT_SHARING)) {
+ void **onln = &cpuc->kfree_on_online[X86_PERF_KFREE_SHARED];
+
for_each_cpu(i, topology_thread_cpumask(cpu)) {
struct intel_shared_regs *pc;

pc = per_cpu(cpu_hw_events, i).shared_regs;
if (pc && pc->core_id == core_id) {
- cpuc->kfree_on_online = cpuc->shared_regs;
+ *onln = cpuc->shared_regs;
cpuc->shared_regs = pc;
break;
}
--
1.7.9.5

2014-06-04 21:44:06

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH 3/9] perf/x86: add 3 new scheduling callbacks

From: Maria Dimakopoulou <[email protected]>

This patch adds 3 new PMU model specific callbacks
during the event scheduling done by x86_schedule_events().

- start_scheduling: invoked when entering the schedule routine.
- stop_scheduling: invoked at the end of the schedule routine
- commit_scheduling: invoked for each committed event

To be used optionally by model-specific code.

Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: Maria Dimakopoulou <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 9 +++++++++
arch/x86/kernel/cpu/perf_event.h | 8 ++++++++
2 files changed, 17 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 36fb4fc..858a72a 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -733,6 +733,9 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)

bitmap_zero(used_mask, X86_PMC_IDX_MAX);

+ if (x86_pmu.start_scheduling)
+ x86_pmu.start_scheduling(cpuc);
+
for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
hwc = &cpuc->event_list[i]->hw;
c = x86_pmu.get_event_constraints(cpuc, cpuc->event_list[i]);
@@ -779,6 +782,8 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
for (i = 0; i < n; i++) {
e = cpuc->event_list[i];
e->hw.flags |= PERF_X86_EVENT_COMMITTED;
+ if (x86_pmu.commit_scheduling)
+ x86_pmu.commit_scheduling(cpuc, e, assign[i]);
}
}
/*
@@ -799,6 +804,10 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
x86_pmu.put_event_constraints(cpuc, e);
}
}
+
+ if (x86_pmu.stop_scheduling)
+ x86_pmu.stop_scheduling(cpuc);
+
return num ? -EINVAL : 0;
}

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 1cab1c2..413799f 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -409,6 +409,14 @@ struct x86_pmu {

void (*put_event_constraints)(struct cpu_hw_events *cpuc,
struct perf_event *event);
+
+ void (*commit_scheduling)(struct cpu_hw_events *cpuc,
+ struct perf_event *event, int cntr);
+
+ void (*start_scheduling)(struct cpu_hw_events *cpuc);
+
+ void (*stop_scheduling)(struct cpu_hw_events *cpuc);
+
struct event_constraint *event_constraints;
struct x86_pmu_quirk *quirks;
int perfctr_second_write;
--
1.7.9.5

2014-06-04 21:44:19

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH 6/9] perf/x86: enforce HT bug workaround for SNB/IVB/HSW

From: Maria Dimakopoulou <[email protected]>

This patches activates the HT bug workaround for the
SNB/IVB/HSW processors. This covers non-PEBS mode.
Activation is done thru the constraint tables.

Both client and server processors needs this workaround.

Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: Maria Dimakopoulou <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 53 ++++++++++++++++++++++++++------
1 file changed, 44 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 3af9745..b1a2684 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -113,6 +113,12 @@ static struct event_constraint intel_snb_event_constraints[] __read_mostly =
INTEL_EVENT_CONSTRAINT(0xcd, 0x8), /* MEM_TRANS_RETIRED.LOAD_LATENCY */
INTEL_UEVENT_CONSTRAINT(0x04a3, 0xf), /* CYCLE_ACTIVITY.CYCLES_NO_DISPATCH */
INTEL_UEVENT_CONSTRAINT(0x02a3, 0x4), /* CYCLE_ACTIVITY.CYCLES_L1D_PENDING */
+
+ INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf), /* MEM_UOPS_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd3, 0xf), /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
+
EVENT_CONSTRAINT_END
};

@@ -131,15 +137,12 @@ static struct event_constraint intel_ivb_event_constraints[] __read_mostly =
INTEL_UEVENT_CONSTRAINT(0x08a3, 0x4), /* CYCLE_ACTIVITY.CYCLES_L1D_PENDING */
INTEL_UEVENT_CONSTRAINT(0x0ca3, 0x4), /* CYCLE_ACTIVITY.STALLS_L1D_PENDING */
INTEL_UEVENT_CONSTRAINT(0x01c0, 0x2), /* INST_RETIRED.PREC_DIST */
- /*
- * Errata BV98 -- MEM_*_RETIRED events can leak between counters of SMT
- * siblings; disable these events because they can corrupt unrelated
- * counters.
- */
- INTEL_EVENT_CONSTRAINT(0xd0, 0x0), /* MEM_UOPS_RETIRED.* */
- INTEL_EVENT_CONSTRAINT(0xd1, 0x0), /* MEM_LOAD_UOPS_RETIRED.* */
- INTEL_EVENT_CONSTRAINT(0xd2, 0x0), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
- INTEL_EVENT_CONSTRAINT(0xd3, 0x0), /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
+
+ INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf), /* MEM_UOPS_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd3, 0xf), /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
+
EVENT_CONSTRAINT_END
};

@@ -217,6 +220,12 @@ static struct event_constraint intel_hsw_event_constraints[] = {
INTEL_EVENT_CONSTRAINT(0x0ca3, 0x4),
/* CYCLE_ACTIVITY.CYCLES_NO_EXECUTE */
INTEL_EVENT_CONSTRAINT(0x04a3, 0xf),
+
+ INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf), /* MEM_UOPS_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd3, 0xf), /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
+
EVENT_CONSTRAINT_END
};

@@ -2584,6 +2593,27 @@ static __init void intel_nehalem_quirk(void)
}
}

+/*
+ * enable software workaround for errata:
+ * SNB: BJ122
+ * IVB: BV98
+ * HSW: HSD29
+ *
+ * Only needed when HT is enabled. However detecting
+ * this is too difficult and model specific so we enable
+ * it even with HT off for now.
+ */
+static __init void intel_ht_bug(void)
+{
+ x86_pmu.flags |= PMU_FL_EXCL_CNTRS;
+
+ x86_pmu.commit_scheduling = intel_commit_scheduling;
+ x86_pmu.start_scheduling = intel_start_scheduling;
+ x86_pmu.stop_scheduling = intel_stop_scheduling;
+
+ pr_info("CPU erratum BJ122, BV98, HSD29 worked around\n");
+}
+
EVENT_ATTR_STR(mem-loads, mem_ld_hsw, "event=0xcd,umask=0x1,ldlat=3");
EVENT_ATTR_STR(mem-stores, mem_st_hsw, "event=0xd0,umask=0x82")

@@ -2796,6 +2826,7 @@ __init int intel_pmu_init(void)
case 42: /* SandyBridge */
case 45: /* SandyBridge, "Romely-EP" */
x86_add_quirk(intel_sandybridge_quirk);
+ x86_add_quirk(intel_ht_bug);
memcpy(hw_cache_event_ids, snb_hw_cache_event_ids,
sizeof(hw_cache_event_ids));
memcpy(hw_cache_extra_regs, snb_hw_cache_extra_regs,
@@ -2810,6 +2841,8 @@ __init int intel_pmu_init(void)
x86_pmu.extra_regs = intel_snbep_extra_regs;
else
x86_pmu.extra_regs = intel_snb_extra_regs;
+
+
/* all extra regs are per-cpu when HT is on */
x86_pmu.flags |= PMU_FL_HAS_RSP_1;
x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
@@ -2827,6 +2860,7 @@ __init int intel_pmu_init(void)
break;
case 58: /* IvyBridge */
case 62: /* IvyBridge EP */
+ x86_add_quirk(intel_ht_bug);
memcpy(hw_cache_event_ids, snb_hw_cache_event_ids,
sizeof(hw_cache_event_ids));
memcpy(hw_cache_extra_regs, snb_hw_cache_extra_regs,
@@ -2860,6 +2894,7 @@ __init int intel_pmu_init(void)
case 71:
case 63:
case 69:
+ x86_add_quirk(intel_ht_bug);
x86_pmu.late_ack = true;
memcpy(hw_cache_event_ids, snb_hw_cache_event_ids, sizeof(hw_cache_event_ids));
memcpy(hw_cache_extra_regs, snb_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
--
1.7.9.5

2014-06-04 21:44:24

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH 7/9] perf/x86: enforce HT bug workaround with PEBS for SNB/IVB/HSW

From: Maria Dimakopoulou <[email protected]>

This patch modifies the PEBS constraint tables for SNB/IVB/HSW
such that corrupting events supporting PEBS activate the HT
workaround.

Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: Maria Dimakopoulou <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 40 ++++++++++++++---------------
1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 980970c..b4c6ca5 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -630,10 +630,10 @@ struct event_constraint intel_snb_pebs_event_constraints[] = {
INTEL_EVENT_CONSTRAINT(0xc5, 0xf), /* BR_MISP_RETIRED.* */
INTEL_PLD_CONSTRAINT(0x01cd, 0x8), /* MEM_TRANS_RETIRED.LAT_ABOVE_THR */
INTEL_PST_CONSTRAINT(0x02cd, 0x8), /* MEM_TRANS_RETIRED.PRECISE_STORES */
- INTEL_EVENT_CONSTRAINT(0xd0, 0xf), /* MEM_UOP_RETIRED.* */
- INTEL_EVENT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
- INTEL_EVENT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
- INTEL_EVENT_CONSTRAINT(0xd3, 0xf), /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf), /* MEM_UOP_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd3, 0xf), /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
INTEL_UEVENT_CONSTRAINT(0x02d4, 0xf), /* MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS */
EVENT_CONSTRAINT_END
};
@@ -646,10 +646,10 @@ struct event_constraint intel_ivb_pebs_event_constraints[] = {
INTEL_EVENT_CONSTRAINT(0xc5, 0xf), /* BR_MISP_RETIRED.* */
INTEL_PLD_CONSTRAINT(0x01cd, 0x8), /* MEM_TRANS_RETIRED.LAT_ABOVE_THR */
INTEL_PST_CONSTRAINT(0x02cd, 0x8), /* MEM_TRANS_RETIRED.PRECISE_STORES */
- INTEL_EVENT_CONSTRAINT(0xd0, 0xf), /* MEM_UOP_RETIRED.* */
- INTEL_EVENT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
- INTEL_EVENT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
- INTEL_EVENT_CONSTRAINT(0xd3, 0xf), /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf), /* MEM_UOP_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf), /* MEM_LOAD_UOPS_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf), /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
+ INTEL_EXCLEVT_CONSTRAINT(0xd3, 0xf), /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
EVENT_CONSTRAINT_END
};

@@ -665,24 +665,24 @@ struct event_constraint intel_hsw_pebs_event_constraints[] = {
/* MEM_UOPS_RETIRED.STLB_MISS_LOADS */
INTEL_UEVENT_CONSTRAINT(0x11d0, 0xf),
/* MEM_UOPS_RETIRED.STLB_MISS_STORES */
- INTEL_UEVENT_CONSTRAINT(0x12d0, 0xf),
- INTEL_UEVENT_CONSTRAINT(0x21d0, 0xf), /* MEM_UOPS_RETIRED.LOCK_LOADS */
- INTEL_UEVENT_CONSTRAINT(0x41d0, 0xf), /* MEM_UOPS_RETIRED.SPLIT_LOADS */
+ INTEL_EXCLUEVT_CONSTRAINT(0x12d0, 0xf),
+ INTEL_EXCLUEVT_CONSTRAINT(0x21d0, 0xf), /* MEM_UOPS_RETIRED.LOCK_LOADS */
+ INTEL_EXCLUEVT_CONSTRAINT(0x41d0, 0xf), /* MEM_UOPS_RETIRED.SPLIT_LOADS */
/* MEM_UOPS_RETIRED.SPLIT_STORES */
- INTEL_UEVENT_CONSTRAINT(0x42d0, 0xf),
- INTEL_UEVENT_CONSTRAINT(0x81d0, 0xf), /* MEM_UOPS_RETIRED.ALL_LOADS */
+ INTEL_EXCLUEVT_CONSTRAINT(0x42d0, 0xf),
+ INTEL_EXCLUEVT_CONSTRAINT(0x81d0, 0xf), /* MEM_UOPS_RETIRED.ALL_LOADS */
INTEL_PST_HSW_CONSTRAINT(0x82d0, 0xf), /* MEM_UOPS_RETIRED.ALL_STORES */
- INTEL_UEVENT_CONSTRAINT(0x01d1, 0xf), /* MEM_LOAD_UOPS_RETIRED.L1_HIT */
- INTEL_UEVENT_CONSTRAINT(0x02d1, 0xf), /* MEM_LOAD_UOPS_RETIRED.L2_HIT */
- INTEL_UEVENT_CONSTRAINT(0x04d1, 0xf), /* MEM_LOAD_UOPS_RETIRED.L3_HIT */
+ INTEL_EXCLUEVT_CONSTRAINT(0x01d1, 0xf), /* MEM_LOAD_UOPS_RETIRED.L1_HIT */
+ INTEL_EXCLUEVT_CONSTRAINT(0x02d1, 0xf), /* MEM_LOAD_UOPS_RETIRED.L2_HIT */
+ INTEL_EXCLUEVT_CONSTRAINT(0x04d1, 0xf), /* MEM_LOAD_UOPS_RETIRED.L3_HIT */
/* MEM_LOAD_UOPS_RETIRED.HIT_LFB */
- INTEL_UEVENT_CONSTRAINT(0x40d1, 0xf),
+ INTEL_EXCLUEVT_CONSTRAINT(0x40d1, 0xf),
/* MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS */
- INTEL_UEVENT_CONSTRAINT(0x01d2, 0xf),
+ INTEL_EXCLUEVT_CONSTRAINT(0x01d2, 0xf),
/* MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT */
- INTEL_UEVENT_CONSTRAINT(0x02d2, 0xf),
+ INTEL_EXCLUEVT_CONSTRAINT(0x02d2, 0xf),
/* MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM */
- INTEL_UEVENT_CONSTRAINT(0x01d3, 0xf),
+ INTEL_EXCLUEVT_CONSTRAINT(0x01d3, 0xf),
INTEL_UEVENT_CONSTRAINT(0x04c8, 0xf), /* HLE_RETIRED.Abort */
INTEL_UEVENT_CONSTRAINT(0x04c9, 0xf), /* RTM_RETIRED.Abort */

--
1.7.9.5

2014-06-04 21:44:34

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

From: Maria Dimakopoulou <[email protected]>

This patch adds a sysfs entry:

/sys/devices/cpu/ht_bug_workaround

to activate/deactivate the PMU HT bug workaround.

To activate (activated by default):
# echo 1 > /sys/devices/cpu/ht_bug_workaround

To deactivate:
# echo 0 > /sys/devices/cpu/ht_bug_workaround

Results effective only once there is no more active
events.

Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: Maria Dimakopoulou <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 31 +++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/perf_event.h | 5 +++++
arch/x86/kernel/cpu/perf_event_intel.c | 4 ++--
3 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 314458a..fdea88e 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1876,10 +1876,41 @@ static ssize_t set_attr_rdpmc(struct device *cdev,
return count;
}

+static ssize_t get_attr_xsu(struct device *cdev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ int ff = is_ht_workaround_enabled();
+ return snprintf(buf, 40, "%d\n", ff);
+}
+
+static ssize_t set_attr_xsu(struct device *cdev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned long val;
+ int ff = is_ht_workaround_enabled();
+ ssize_t ret;
+
+ ret = kstrtoul(buf, 0, &val);
+ if (ret)
+ return ret;
+
+ if (!!val != ff) {
+ if (!val)
+ x86_pmu.flags &= ~PMU_FL_EXCL_ENABLED;
+ else
+ x86_pmu.flags |= PMU_FL_EXCL_ENABLED;
+ }
+ return count;
+}
+
static DEVICE_ATTR(rdpmc, S_IRUSR | S_IWUSR, get_attr_rdpmc, set_attr_rdpmc);
+static DEVICE_ATTR(ht_bug_workaround, S_IRUSR | S_IWUSR, get_attr_xsu, set_attr_xsu);

static struct attribute *x86_pmu_attrs[] = {
&dev_attr_rdpmc.attr,
+ &dev_attr_ht_bug_workaround.attr,
NULL,
};

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index c61ca4a..2e7c6a7 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -537,6 +537,7 @@ do { \
#define PMU_FL_NO_HT_SHARING 0x1 /* no hyper-threading resource sharing */
#define PMU_FL_HAS_RSP_1 0x2 /* has 2 equivalent offcore_rsp regs */
#define PMU_FL_EXCL_CNTRS 0x4 /* has exclusive counter requirements */
+#define PMU_FL_EXCL_ENABLED 0x8 /* exclusive counter active */

#define EVENT_VAR(_id) event_attr_##_id
#define EVENT_PTR(_id) &event_attr_##_id.attr.attr
@@ -769,6 +770,10 @@ int knc_pmu_init(void);
ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
char *page);

+static inline int is_ht_workaround_enabled(void)
+{
+ return !!(x86_pmu.flags & PMU_FL_EXCL_ENABLED);
+}
#else /* CONFIG_CPU_SUP_INTEL */

static inline void reserve_ds_buffers(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 4fb4fe6..7040c41 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1747,7 +1747,7 @@ intel_get_excl_constraints(struct cpu_hw_events *cpuc, struct perf_event *event,
* validating a group does not require
* enforcing cross-thread exclusion
*/
- if (cpuc->is_fake)
+ if (cpuc->is_fake || !is_ht_workaround_enabled())
return c;

/*
@@ -2610,7 +2610,7 @@ static __init void intel_nehalem_quirk(void)
*/
static __init void intel_ht_bug(void)
{
- x86_pmu.flags |= PMU_FL_EXCL_CNTRS;
+ x86_pmu.flags |= PMU_FL_EXCL_CNTRS | PMU_FL_EXCL_ENABLED;

x86_pmu.commit_scheduling = intel_commit_scheduling;
x86_pmu.start_scheduling = intel_start_scheduling;
--
1.7.9.5

2014-06-04 21:44:28

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH 8/9] perf/x86: fix intel_get_event_constraints() for dynamic constraints

With dynamic constraint, we need to restart from the static
constraints each time the intel_get_event_constraints() is called.

Reviewed-by: Maria Dimakopoulou <[email protected]>
Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index b1a2684..4fb4fe6 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1849,20 +1849,25 @@ intel_get_excl_constraints(struct cpu_hw_events *cpuc, struct perf_event *event,
static struct event_constraint *
intel_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
{
- struct event_constraint *c = event->hw.constraint;
+ struct event_constraint *c1 = event->hw.constraint;
+ struct event_constraint *c2;

/*
* first time only
* - static constraint: no change across incremental scheduling calls
* - dynamic constraint: handled by intel_get_excl_constraints()
*/
- if (!c)
- c = __intel_get_event_constraints(cpuc, event);
+ c2 = __intel_get_event_constraints(cpuc, event);
+ if (c1 && (c1->flags & PERF_X86_EVENT_DYNAMIC)) {
+ bitmap_copy(c1->idxmsk, c2->idxmsk, X86_PMC_IDX_MAX);
+ c1->weight = c2->weight;
+ c2 = c1;
+ }

if (cpuc->excl_cntrs)
- return intel_get_excl_constraints(cpuc, event, c);
+ return intel_get_excl_constraints(cpuc, event, c2);

- return c;
+ return c2;
}

static void intel_put_excl_constraints(struct cpu_hw_events *cpuc,
--
1.7.9.5

2014-06-04 21:44:16

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH 5/9] perf/x86: implement cross-HT corruption bug workaround

From: Maria Dimakopoulou <[email protected]>

This patch implements a software workaround for a HW erratum
on Intel SandyBridge, IvyBridge and Haswell processors
with Hyperthreading enabled. The errata are documented for
each processor in their respective specification update
documents:
- SandyBridge: BJ122
- IvyBridge: BV98
- Haswell: HSD29

The bug causes silent counter corruption across hyperthreads only
when measuring certain memory events (0xd0, 0xd1, 0xd2, 0xd3).
Counters measuring those events may leak counts to the sibling
counter. For instance, counter 0, thread 0 measuring event 0xd0,
may leak to counter 0, thread 1, regardless of the event measured
there. The size of the leak is not predictible. It all depends on
the workload and the state of each sibling hyper-thread. The
corrupting events do undercount as a consequence of the leak. The
leak is compensated automatically only when the sibling counter measures
the exact same corrupting event AND the workload is on the two threads
is the same. Given, there is no way to guarantee this, a work-around
is necessary. Furthermore, there is a serious problem if the leaked count
is added to a low-occurrence event. In that case the corruption on
the low occurrence event can be very large, e.g., orders of magnitude.

There is no HW or FW workaround for this problem.

The bug is very easy to reproduce on a loaded system.
Here is an example on a Haswell client, where CPU0, CPU4
are siblings. We load the CPUs with a simple triad app
streaming large floating-point vector. We use 0x81d0
corrupting event (MEM_UOPS_RETIRED:ALL_LOADS) and
0x20cc (ROB_MISC_EVENTS:LBR_INSERTS). Given we are not
using the LBR, the 0x20cc event should be zero.

$ taskset -c 0 triad &
$ taskset -c 4 triad &
$ perf stat -a -C 0 -e r81d0 sleep 100 &
$ perf stat -a -C 4 -r20cc sleep 10
Performance counter stats for 'system wide':
139 277 291 r20cc
10,000969126 seconds time elapsed

In this example, 0x81d0 and r20cc ar eusing sinling counters
on CPU0 and CPU4. 0x81d0 leaks into 0x20cc and corrupts it
from 0 to 139 millions occurrences.

This patch provides a software workaround to this problem by modifying the
way events are scheduled onto counters by the kernel. The patch forces
cross-thread mutual exclusion between counters in case a corrupting event
is measured by one of the hyper-threads. If thread 0, counter 0 is measuring
event 0xd0, then nothing can be measured on counter 0, thread 1. If no corrupting
event is measured on any hyper-thread, event scheduling proceeds as before.

The same example run with the workaround enabled, yield the correct answer:
$ taskset -c 0 triad &
$ taskset -c 4 triad &
$ perf stat -a -C 0 -e r81d0 sleep 100 &
$ perf stat -a -C 4 -r20cc sleep 10
Performance counter stats for 'system wide':
0 r20cc
10,000969126 seconds time elapsed

The patch does provide correctness for all non-corrupting events. It does not
"repatriate" the leaked counts back to the leaking counter. This is planned
for a second patch series. This patch series makes this repatriation more
easy by guaranteeing the sibling counter is not measuring any useful event.

The patch introduces dynamic constraints for events. That means that events which
did not have constraints, i.e., could be measured on any counters, may now be
constrained to a subset of the counters depending on what is going on the sibling
thread. The algorithm is similar to a cache coherency protocol. We call it XSU
in reference to Exclusive, Shared, Unused, the 3 possible states of a PMU
counter.

As a consequence of the workaround, users may see an increased amount of event
multiplexing, even in situtations where there are fewer events than counters
measured on a CPU.

Patch has been tested on all three impacted processors. Note that when
HT is off, there is no corruption. However, the workaround is still enabled,
yet not costing too much. Adding a dynamic detection of HT on turned out to
be complex are requiring too much to code to be justified.

This patch addresses the issue when PEBS is not used. A subsequent patch
fixes the problem when PEBS is used.

Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: Maria Dimakopoulou <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 31 ++--
arch/x86/kernel/cpu/perf_event.h | 6 +
arch/x86/kernel/cpu/perf_event_intel.c | 311 +++++++++++++++++++++++++++++++-
3 files changed, 335 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 858a72a..314458a 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -728,7 +728,7 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
struct event_constraint *c;
unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
struct perf_event *e;
- int i, wmin, wmax, num = 0;
+ int i, wmin, wmax, unsched = 0;
struct hw_perf_event *hwc;

bitmap_zero(used_mask, X86_PMC_IDX_MAX);
@@ -771,14 +771,20 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)

/* slow path */
if (i != n)
- num = perf_assign_events(cpuc->event_list, n, wmin,
- wmax, assign);
+ unsched = perf_assign_events(cpuc->event_list, n, wmin,
+ wmax, assign);

/*
- * Mark the event as committed, so we do not put_constraint()
- * in case new events are added and fail scheduling.
+ * In case of success (unsched = 0), mark events as committed,
+ * so we do not put_constraint() in case new events are added
+ * and fail to be scheduled
+ *
+ * We invoke the lower level commit callback to lock the resource
+ *
+ * We do not need to do all of this in case we are called to
+ * validate an event group (assign == NULL)
*/
- if (!num && assign) {
+ if (!unsched && assign) {
for (i = 0; i < n; i++) {
e = cpuc->event_list[i];
e->hw.flags |= PERF_X86_EVENT_COMMITTED;
@@ -786,11 +792,9 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
x86_pmu.commit_scheduling(cpuc, e, assign[i]);
}
}
- /*
- * scheduling failed or is just a simulation,
- * free resources if necessary
- */
- if (!assign || num) {
+
+ if (!assign || unsched) {
+
for (i = 0; i < n; i++) {
e = cpuc->event_list[i];
/*
@@ -800,6 +804,9 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
if ((e->hw.flags & PERF_X86_EVENT_COMMITTED))
continue;

+ /*
+ * release events that failed scheduling
+ */
if (x86_pmu.put_event_constraints)
x86_pmu.put_event_constraints(cpuc, e);
}
@@ -808,7 +815,7 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
if (x86_pmu.stop_scheduling)
x86_pmu.stop_scheduling(cpuc);

- return num ? -EINVAL : 0;
+ return unsched ? -EINVAL : 0;
}

/*
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 5da0a2b..c61ca4a 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -70,6 +70,7 @@ struct event_constraint {
#define PERF_X86_EVENT_PEBS_ST_HSW 0x04 /* haswell style st data sampling */
#define PERF_X86_EVENT_COMMITTED 0x08 /* event passed commit_txn */
#define PERF_X86_EVENT_EXCL 0x10 /* HT exclusivity on counter */
+#define PERF_X86_EVENT_DYNAMIC 0x20 /* dynamic alloc'd constraint */

struct amd_nb {
int nb_id; /* NorthBridge id */
@@ -129,6 +130,7 @@ enum intel_excl_state_type {
struct intel_excl_states {
enum intel_excl_state_type init_state[X86_PMC_IDX_MAX];
enum intel_excl_state_type state[X86_PMC_IDX_MAX];
+ bool sched_started; /* true if scheduling has started */
};

struct intel_excl_cntrs {
@@ -288,6 +290,10 @@ struct cpu_hw_events {
#define INTEL_UEVENT_CONSTRAINT(c, n) \
EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK)

+#define INTEL_EXCLUEVT_CONSTRAINT(c, n) \
+ __EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK, \
+ HWEIGHT(n), 0, PERF_X86_EVENT_EXCL)
+
#define INTEL_PLD_CONSTRAINT(c, n) \
__EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK, \
HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LDLAT)
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 380fce2..3af9745 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1632,7 +1632,7 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
}

static struct event_constraint *
-intel_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
+__intel_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
{
struct event_constraint *c;

@@ -1652,6 +1652,256 @@ intel_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event
}

static void
+intel_start_scheduling(struct cpu_hw_events *cpuc)
+{
+ struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
+ struct intel_excl_states *xl, *xlo;
+ int tid = cpuc->excl_thread_id;
+ int o_tid = 1 - tid; /* sibling thread */
+
+ /*
+ * nothing needed if in group validation mode
+ */
+ if (cpuc->is_fake)
+ return;
+ /*
+ * no exclusion needed
+ */
+ if (!excl_cntrs)
+ return;
+
+ xlo = &excl_cntrs->states[o_tid];
+ xl = &excl_cntrs->states[tid];
+
+ xl->sched_started = true;
+
+ /*
+ * lock shared state until we are done scheduling
+ * in stop_event_scheduling()
+ * makes scheduling appear as a transaction
+ */
+ spin_lock_irqsave(&excl_cntrs->lock, excl_cntrs->lock_flags);
+
+ /*
+ * save initial state of sibling thread
+ */
+ memcpy(xlo->init_state, xlo->state, sizeof(xlo->init_state));
+}
+
+static void
+intel_stop_scheduling(struct cpu_hw_events *cpuc)
+{
+ struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
+ struct intel_excl_states *xl, *xlo;
+ int tid = cpuc->excl_thread_id;
+ int o_tid = 1 - tid; /* sibling thread */
+ int i;
+
+ /*
+ * nothing needed if in group validation mode
+ */
+ if (cpuc->is_fake)
+ return;
+ /*
+ * no exclusion needed
+ */
+ if (!excl_cntrs)
+ return;
+
+ xlo = &excl_cntrs->states[o_tid];
+ xl = &excl_cntrs->states[tid];
+
+ /*
+ * make new sibling thread state visible
+ */
+ memcpy(xlo->state, xlo->init_state, sizeof(xlo->state));
+
+ xl->sched_started = false;
+ /*
+ * release shared state lock (lock acquire in intel_start_scheduling())
+ */
+ spin_unlock_irqrestore(&excl_cntrs->lock, excl_cntrs->lock_flags);
+}
+
+static struct event_constraint *
+intel_get_excl_constraints(struct cpu_hw_events *cpuc, struct perf_event *event,
+ struct event_constraint *c)
+{
+ struct event_constraint *cx;
+ struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
+ struct intel_excl_states *xl, *xlo;
+ int is_excl, i;
+ int tid = cpuc->excl_thread_id;
+ int o_tid = 1 - tid; /* alternate */
+
+ /*
+ * validating a group does not require
+ * enforcing cross-thread exclusion
+ */
+ if (cpuc->is_fake)
+ return c;
+
+ /*
+ * event requires exclusive counter access
+ * across HT threads
+ */
+ is_excl = c->flags & PERF_X86_EVENT_EXCL;
+
+ /*
+ * xl = state of current HT
+ * xlo = state of sibling HT
+ */
+ xl = &excl_cntrs->states[tid];
+ xlo = &excl_cntrs->states[o_tid];
+
+ cx = c;
+
+ /*
+ * because we modify the constraint, we need
+ * to make a copy. Static constraints come
+ * from static const tables.
+ *
+ * only needed when constraint has not yet
+ * been cloned (marked dynamic)
+ */
+ if (!(c->flags & PERF_X86_EVENT_DYNAMIC)) {
+
+ /*
+ * in case we fail, we assume no counter
+ * is supported to be on the safe side
+ */
+ cx = kmalloc(sizeof(*cx), GFP_KERNEL);
+ if (!cx)
+ return &emptyconstraint;
+
+ /*
+ * initialize dynamic constraint
+ * with static constraint
+ */
+ memcpy(cx, c, sizeof(*cx));
+
+ /*
+ * mark constraint as dynamic, so we
+ * can free it later on
+ */
+ cx->flags |= PERF_X86_EVENT_DYNAMIC;
+ }
+
+ /*
+ * From here on, the constraint is dynamic.
+ * Either it was just allocated above, or it
+ * was allocated during a earlier invocation
+ * of this function
+ */
+
+ /*
+ * Modify static constraint with current dynamic
+ * state of thread
+ *
+ * EXCLUSIVE: sibling counter measuring exclusive event
+ * SHARED : sibling counter measuring non-exclusive event
+ * UNUSED : sibling counter unused
+ */
+ for_each_set_bit(i, cx->idxmsk, X86_PMC_IDX_MAX) {
+ /*
+ * exclusive event in sibling counter
+ * our corresponding counter cannot be used
+ * regardless of our event
+ */
+ if (xl->state[i] == INTEL_EXCL_EXCLUSIVE)
+ __clear_bit(i, cx->idxmsk);
+ /*
+ * if measuring an exclusive event, sibling
+ * measuring non-exclusive, then counter cannot
+ * be used
+ */
+ if (is_excl && xl->state[i] == INTEL_EXCL_SHARED)
+ __clear_bit(i, cx->idxmsk);
+ }
+
+ /*
+ * recompute actual bit weight for scheduling algorithm
+ */
+ cx->weight = hweight64(cx->idxmsk64);
+
+ /*
+ * if we return an empty mask, then switch
+ * back to static empty constraint to avoid
+ * the cost of freeing later on
+ */
+ if (cx->weight == 0) {
+ kfree(cx);
+ cx = &emptyconstraint;
+ }
+
+ return cx;
+}
+
+static struct event_constraint *
+intel_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
+{
+ struct event_constraint *c = event->hw.constraint;
+
+ /*
+ * first time only
+ * - static constraint: no change across incremental scheduling calls
+ * - dynamic constraint: handled by intel_get_excl_constraints()
+ */
+ if (!c)
+ c = __intel_get_event_constraints(cpuc, event);
+
+ if (cpuc->excl_cntrs)
+ return intel_get_excl_constraints(cpuc, event, c);
+
+ return c;
+}
+
+static void intel_put_excl_constraints(struct cpu_hw_events *cpuc,
+ struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
+ struct intel_excl_states *xlo, *xl;
+ unsigned long flags = 0; /* keep compiler happy */
+ int tid = cpuc->excl_thread_id;
+ int o_tid = 1 - tid;
+ int i;
+
+ /*
+ * nothing needed if in group validation mode
+ */
+ if (cpuc->is_fake)
+ return;
+
+ WARN_ON_ONCE(!excl_cntrs);
+
+ if (!excl_cntrs)
+ return;
+
+ xl = &excl_cntrs->states[tid];
+ xlo = &excl_cntrs->states[o_tid];
+
+ /*
+ * put_constraint may be called from x86_schedule_events()
+ * which already has the lock held so here make locking
+ * conditional
+ */
+ if (!xl->sched_started)
+ spin_lock_irqsave(&excl_cntrs->lock, flags);
+
+ /*
+ * if event was actually assigned, then mark the
+ * counter state as unused now
+ */
+ if (hwc->idx >= 0)
+ xlo->state[hwc->idx] = INTEL_EXCL_UNUSED;
+
+ if (!xl->sched_started)
+ spin_unlock_irqrestore(&excl_cntrs->lock, flags);
+
+}
+
+static void
intel_put_shared_regs_event_constraints(struct cpu_hw_events *cpuc,
struct perf_event *event)
{
@@ -1669,7 +1919,59 @@ intel_put_shared_regs_event_constraints(struct cpu_hw_events *cpuc,
static void intel_put_event_constraints(struct cpu_hw_events *cpuc,
struct perf_event *event)
{
+ struct event_constraint *c = event->hw.constraint;
+
intel_put_shared_regs_event_constraints(cpuc, event);
+
+ /*
+ * is PMU has exclusive counter restrictions, then
+ * all events are subject to and must call the
+ * put_excl_constraints() routine
+ */
+ if (c && cpuc->excl_cntrs)
+ intel_put_excl_constraints(cpuc, event);
+
+ /* free dynamic constraint */
+ if (c && (c->flags & PERF_X86_EVENT_DYNAMIC)) {
+ kfree(event->hw.constraint);
+ event->hw.constraint = NULL;
+ }
+}
+
+static void intel_commit_scheduling(struct cpu_hw_events *cpuc,
+ struct perf_event *event, int cntr)
+{
+ struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
+ struct event_constraint *c = event->hw.constraint;
+ struct intel_excl_states *xlo, *xl;
+ int tid = cpuc->excl_thread_id;
+ int o_tid = 1 - tid;
+ int is_excl;
+
+ if (cpuc->is_fake || !c)
+ return;
+
+ is_excl = c->flags & PERF_X86_EVENT_EXCL;
+
+ if (!(c->flags & PERF_X86_EVENT_DYNAMIC))
+ return;
+
+ WARN_ON_ONCE(!excl_cntrs);
+
+ if (!excl_cntrs)
+ return;
+
+ xl = &excl_cntrs->states[tid];
+ xlo = &excl_cntrs->states[o_tid];
+
+ WARN_ON_ONCE(!spin_is_locked(&excl_cntrs->lock));
+
+ if (cntr >= 0) {
+ if (is_excl)
+ xlo->init_state[cntr] = INTEL_EXCL_EXCLUSIVE;
+ else
+ xlo->init_state[cntr] = INTEL_EXCL_SHARED;
+ }
}

static void intel_pebs_aliases_core2(struct perf_event *event)
@@ -2087,6 +2389,13 @@ static void intel_pmu_cpu_dying(int cpu)
cpuc->excl_cntrs = NULL;
}

+ c = cpuc->excl_cntrs;
+ if (c) {
+ if (c->core_id == -1 || --c->refcnt == 0)
+ kfree(c);
+ cpuc->excl_cntrs = NULL;
+ }
+
fini_debug_store_on_cpu(cpu);
}

--
1.7.9.5

2014-06-04 21:44:11

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH 4/9] perf/x86: add cross-HT counter exclusion infrastructure

From: Maria Dimakopoulou <[email protected]>

This patch adds a new shared_regs style structure to the
per-cpu x86 state (cpuc). It is used to coordinate access
between counters which must be used with exclusion across
HyperThreads on Intel processors. This new struct is not
needed on each PMU, thus is is allocated on demand.

Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: Maria Dimakopoulou <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 40 ++++++++++++++++++--
arch/x86/kernel/cpu/perf_event_intel.c | 63 +++++++++++++++++++++++++++++---
2 files changed, 94 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 413799f..5da0a2b 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -65,10 +65,11 @@ struct event_constraint {
/*
* struct hw_perf_event.flags flags
*/
-#define PERF_X86_EVENT_PEBS_LDLAT 0x1 /* ld+ldlat data address sampling */
-#define PERF_X86_EVENT_PEBS_ST 0x2 /* st data address sampling */
-#define PERF_X86_EVENT_PEBS_ST_HSW 0x4 /* haswell style st data sampling */
-#define PERF_X86_EVENT_COMMITTED 0x8 /* event passed commit_txn */
+#define PERF_X86_EVENT_PEBS_LDLAT 0x01 /* ld+ldlat data address sampling */
+#define PERF_X86_EVENT_PEBS_ST 0x02 /* st data address sampling */
+#define PERF_X86_EVENT_PEBS_ST_HSW 0x04 /* haswell style st data sampling */
+#define PERF_X86_EVENT_COMMITTED 0x08 /* event passed commit_txn */
+#define PERF_X86_EVENT_EXCL 0x10 /* HT exclusivity on counter */

struct amd_nb {
int nb_id; /* NorthBridge id */
@@ -119,6 +120,27 @@ struct intel_shared_regs {
unsigned core_id; /* per-core: core id */
};

+enum intel_excl_state_type {
+ INTEL_EXCL_UNUSED = 0, /* counter is unused */
+ INTEL_EXCL_SHARED = 1, /* counter can be used by both threads */
+ INTEL_EXCL_EXCLUSIVE = 2, /* counter can be used by one thread only */
+};
+
+struct intel_excl_states {
+ enum intel_excl_state_type init_state[X86_PMC_IDX_MAX];
+ enum intel_excl_state_type state[X86_PMC_IDX_MAX];
+};
+
+struct intel_excl_cntrs {
+ spinlock_t lock;
+ unsigned long lock_flags;
+
+ struct intel_excl_states states[2];
+
+ int refcnt; /* per-core: #HT threads */
+ unsigned core_id; /* per-core: core id */
+};
+
#define MAX_LBR_ENTRIES 16

enum {
@@ -181,6 +203,11 @@ struct cpu_hw_events {
* used on Intel NHM/WSM/SNB
*/
struct intel_shared_regs *shared_regs;
+ /*
+ * manage exclusive counter access between hyperthread
+ */
+ struct intel_excl_cntrs *excl_cntrs;
+ int excl_thread_id; /* 0 or 1 */

/*
* AMD specific bits
@@ -204,6 +231,10 @@ struct cpu_hw_events {
#define EVENT_CONSTRAINT(c, n, m) \
__EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 0, 0)

+#define INTEL_EXCLEVT_CONSTRAINT(c, n) \
+ __EVENT_CONSTRAINT(c, n, ARCH_PERFMON_EVENTSEL_EVENT, HWEIGHT(n),\
+ 0, PERF_X86_EVENT_EXCL)
+
/*
* The overlap flag marks event constraints with overlapping counter
* masks. This is the case if the counter mask of such an event is not
@@ -499,6 +530,7 @@ do { \
*/
#define PMU_FL_NO_HT_SHARING 0x1 /* no hyper-threading resource sharing */
#define PMU_FL_HAS_RSP_1 0x2 /* has 2 equivalent offcore_rsp regs */
+#define PMU_FL_EXCL_CNTRS 0x4 /* has exclusive counter requirements */

#define EVENT_VAR(_id) event_attr_##_id
#define EVENT_PTR(_id) &event_attr_##_id.attr.attr
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index e913e46..380fce2 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1970,16 +1970,46 @@ struct intel_shared_regs *allocate_shared_regs(int cpu)
return regs;
}

+struct intel_excl_cntrs *allocate_excl_cntrs(int cpu)
+{
+ struct intel_excl_cntrs *c;
+ int i;
+
+ c = kzalloc_node(sizeof(struct intel_excl_cntrs),
+ GFP_KERNEL, cpu_to_node(cpu));
+ if (c) {
+ spin_lock_init(&c->lock);
+ for (i = 0; i < X86_PMC_IDX_MAX; i++) {
+ c->states[0].state[i] = INTEL_EXCL_UNUSED;
+ c->states[0].init_state[i] = INTEL_EXCL_UNUSED;
+
+ c->states[1].state[i] = INTEL_EXCL_UNUSED;
+ c->states[1].init_state[i] = INTEL_EXCL_UNUSED;
+ }
+ c->core_id = -1;
+ }
+ return c;
+}
+
static int intel_pmu_cpu_prepare(int cpu)
{
struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);

- if (!(x86_pmu.extra_regs || x86_pmu.lbr_sel_map))
- return NOTIFY_OK;
+ if (x86_pmu.extra_regs || x86_pmu.lbr_sel_map) {
+ cpuc->shared_regs = allocate_shared_regs(cpu);
+ if (!cpuc->shared_regs)
+ return NOTIFY_BAD;
+ }

- cpuc->shared_regs = allocate_shared_regs(cpu);
- if (!cpuc->shared_regs)
- return NOTIFY_BAD;
+ if (x86_pmu.flags & PMU_FL_EXCL_CNTRS) {
+ cpuc->excl_cntrs = allocate_excl_cntrs(cpu);
+ if (!cpuc->excl_cntrs) {
+ if (cpuc->shared_regs)
+ kfree(cpuc->shared_regs);
+ return NOTIFY_BAD;
+ }
+ cpuc->excl_thread_id = 0;
+ }

return NOTIFY_OK;
}
@@ -2020,12 +2050,29 @@ static void intel_pmu_cpu_starting(int cpu)

if (x86_pmu.lbr_sel_map)
cpuc->lbr_sel = &cpuc->shared_regs->regs[EXTRA_REG_LBR];
+
+ if (x86_pmu.flags & PMU_FL_EXCL_CNTRS) {
+ for_each_cpu(i, topology_thread_cpumask(cpu)) {
+ struct intel_excl_cntrs *c;
+
+ c = per_cpu(cpu_hw_events, i).excl_cntrs;
+ if (c && c->core_id == core_id) {
+ cpuc->kfree_on_online[1] = cpuc->excl_cntrs;
+ cpuc->excl_cntrs = c;
+ cpuc->excl_thread_id = 1;
+ break;
+ }
+ }
+ cpuc->excl_cntrs->core_id = core_id;
+ cpuc->excl_cntrs->refcnt++;
+ }
}

static void intel_pmu_cpu_dying(int cpu)
{
struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
struct intel_shared_regs *pc;
+ struct intel_excl_cntrs *c;

pc = cpuc->shared_regs;
if (pc) {
@@ -2033,6 +2080,12 @@ static void intel_pmu_cpu_dying(int cpu)
kfree(pc);
cpuc->shared_regs = NULL;
}
+ c = cpuc->excl_cntrs;
+ if (c) {
+ if (c->core_id == -1 || --c->refcnt == 0)
+ kfree(c);
+ cpuc->excl_cntrs = NULL;
+ }

fini_debug_store_on_cpu(cpu);
}
--
1.7.9.5

2014-06-04 22:29:04

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/9] perf/x86: implement HT counter corruption workaround

> There is no HW or FW workaround for this problem.

Actually there is for global measurements:

measure with Any bit set and divide by two.

Please add a check that the any bit is set, and disable
the workaround for that case.

-Andi

2014-06-05 07:47:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/9] perf/x86: add cross-HT counter exclusion infrastructure

On Wed, Jun 04, 2014 at 11:34:13PM +0200, Stephane Eranian wrote:
> From: Maria Dimakopoulou <[email protected]>
>
> This patch adds a new shared_regs style structure to the
> per-cpu x86 state (cpuc). It is used to coordinate access
> between counters which must be used with exclusion across
> HyperThreads on Intel processors. This new struct is not
> needed on each PMU, thus is is allocated on demand.
>
> Reviewed-by: Stephane Eranian <[email protected]>
> Signed-off-by: Maria Dimakopoulou <[email protected]>
> ---
> arch/x86/kernel/cpu/perf_event.h | 40 ++++++++++++++++++--
> arch/x86/kernel/cpu/perf_event_intel.c | 63 +++++++++++++++++++++++++++++---
> 2 files changed, 94 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
> index 413799f..5da0a2b 100644
> --- a/arch/x86/kernel/cpu/perf_event.h
> +++ b/arch/x86/kernel/cpu/perf_event.h
> @@ -65,10 +65,11 @@ struct event_constraint {
> /*
> * struct hw_perf_event.flags flags
> */
> -#define PERF_X86_EVENT_PEBS_LDLAT 0x1 /* ld+ldlat data address sampling */
> -#define PERF_X86_EVENT_PEBS_ST 0x2 /* st data address sampling */
> -#define PERF_X86_EVENT_PEBS_ST_HSW 0x4 /* haswell style st data sampling */
> -#define PERF_X86_EVENT_COMMITTED 0x8 /* event passed commit_txn */
> +#define PERF_X86_EVENT_PEBS_LDLAT 0x01 /* ld+ldlat data address sampling */
> +#define PERF_X86_EVENT_PEBS_ST 0x02 /* st data address sampling */
> +#define PERF_X86_EVENT_PEBS_ST_HSW 0x04 /* haswell style st data sampling */
> +#define PERF_X86_EVENT_COMMITTED 0x08 /* event passed commit_txn */
> +#define PERF_X86_EVENT_EXCL 0x10 /* HT exclusivity on counter */
>
> struct amd_nb {
> int nb_id; /* NorthBridge id */
> @@ -119,6 +120,27 @@ struct intel_shared_regs {
> unsigned core_id; /* per-core: core id */
> };
>
> +enum intel_excl_state_type {
> + INTEL_EXCL_UNUSED = 0, /* counter is unused */
> + INTEL_EXCL_SHARED = 1, /* counter can be used by both threads */
> + INTEL_EXCL_EXCLUSIVE = 2, /* counter can be used by one thread only */
> +};
> +
> +struct intel_excl_states {
> + enum intel_excl_state_type init_state[X86_PMC_IDX_MAX];
> + enum intel_excl_state_type state[X86_PMC_IDX_MAX];
> +};
> +
> +struct intel_excl_cntrs {
> + spinlock_t lock;
> + unsigned long lock_flags;
> +
> + struct intel_excl_states states[2];
> +
> + int refcnt; /* per-core: #HT threads */
> + unsigned core_id; /* per-core: core id */
> +};

This must be a raw_spin_lock_t, its taken from pmu::add() which is
called under perf_event_context::lock, which is raw_spinlock_t, as its
taken under rq::lock, which too is raw_spinlock_t.

I should really get around to fixing these errors and include the
lockdep infrastructure for this.


Attachments:
(No filename) (2.75 kB)
(No filename) (836.00 B)
Download all attachments

2014-06-05 08:04:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/9] perf/x86: add cross-HT counter exclusion infrastructure

On Wed, Jun 04, 2014 at 11:34:13PM +0200, Stephane Eranian wrote:
> @@ -499,6 +530,7 @@ do { \
> */
> #define PMU_FL_NO_HT_SHARING 0x1 /* no hyper-threading resource sharing */
> #define PMU_FL_HAS_RSP_1 0x2 /* has 2 equivalent offcore_rsp regs */
> +#define PMU_FL_EXCL_CNTRS 0x4 /* has exclusive counter requirements */

the EXLC thing is HT specific too, right? How about HT_EXCL or so?



Attachments:
(No filename) (407.00 B)
(No filename) (836.00 B)
Download all attachments

2014-06-05 08:29:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/9] perf/x86: add cross-HT counter exclusion infrastructure

On Wed, Jun 04, 2014 at 11:34:13PM +0200, Stephane Eranian wrote:
> @@ -2020,12 +2050,29 @@ static void intel_pmu_cpu_starting(int cpu)
>
> if (x86_pmu.lbr_sel_map)
> cpuc->lbr_sel = &cpuc->shared_regs->regs[EXTRA_REG_LBR];
> +
> + if (x86_pmu.flags & PMU_FL_EXCL_CNTRS) {
> + for_each_cpu(i, topology_thread_cpumask(cpu)) {
> + struct intel_excl_cntrs *c;
> +
> + c = per_cpu(cpu_hw_events, i).excl_cntrs;
> + if (c && c->core_id == core_id) {
> + cpuc->kfree_on_online[1] = cpuc->excl_cntrs;
> + cpuc->excl_cntrs = c;
> + cpuc->excl_thread_id = 1;
> + break;
> + }
> + }
> + cpuc->excl_cntrs->core_id = core_id;
> + cpuc->excl_cntrs->refcnt++;
> + }
> }

This hard assumes theres only ever 2 threads, which is true and I
suppose more in arch/x86 will come apart the moment Intel makes a chip
with more, still, do we have topology_thread_id() or so to cure this?


Attachments:
(No filename) (896.00 B)
(No filename) (836.00 B)
Download all attachments

2014-06-05 08:32:31

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On 4 June 2014 22:34, Stephane Eranian <[email protected]> wrote:
> From: Maria Dimakopoulou <[email protected]>
>
> This patch adds a sysfs entry:
>
> /sys/devices/cpu/ht_bug_workaround
>
> to activate/deactivate the PMU HT bug workaround.
>
> To activate (activated by default):
> # echo 1 > /sys/devices/cpu/ht_bug_workaround
>
> To deactivate:
> # echo 0 > /sys/devices/cpu/ht_bug_workaround

If your hardware contains this erratum, why would you ever want to
disable the workaround? Providing the user with the option of turning
this off seems like a bad idea.

I suspect that users will rarely know whether they can legitimately
disable this.

2014-06-05 09:29:36

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On Thu, Jun 5, 2014 at 10:32 AM, Matt Fleming <[email protected]> wrote:
> On 4 June 2014 22:34, Stephane Eranian <[email protected]> wrote:
>> From: Maria Dimakopoulou <[email protected]>
>>
>> This patch adds a sysfs entry:
>>
>> /sys/devices/cpu/ht_bug_workaround
>>
>> to activate/deactivate the PMU HT bug workaround.
>>
>> To activate (activated by default):
>> # echo 1 > /sys/devices/cpu/ht_bug_workaround
>>
>> To deactivate:
>> # echo 0 > /sys/devices/cpu/ht_bug_workaround
>
> If your hardware contains this erratum, why would you ever want to
> disable the workaround? Providing the user with the option of turning
> this off seems like a bad idea.
>
> I suspect that users will rarely know whether they can legitimately
> disable this.

If you know what you are doing (poweruser), then there are measurements
which works fine with the HT erratum. This is why we have the option.

For instance if you only measure events 4x4 in system-wide mode
and you know which counters these event are going to use, you don't
need the workaround. For instance:

# perf stat -a -e r81d0,r01d1,r08d0,r20d1 sleep 5

Works well if you have a uniform workload across all CPUs.
All those events leak, but the leaks balance themselves and you
get the correct counts in the end. The advantage is that you don't
have to multiplex. With the workaround enable, this would multiplex
a lot.

But as I said, this is for experts only.

Another reason is for systems with HT disabled. It turned out to be
very difficult to determine at kernel BOOT TIME if HT was enabled
or not. Note what I said: ENABLED and not SUPPORTED. The latter is
easy to detect. The former needs some model specific code which is
quite complicated. I wish the kernel had this capability abstracted
somehow. Consequently, the workaround is always enabled. When
HT is disabled, there won't be multiplexing because there will never
be conflict, but you pay a little price for accessing the extra data
state. An init script could well detect HT is off and thus disable the
workaround altogether.

Those are the two main reasons for this control in sysfs.

2014-06-05 10:01:31

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On 5 June 2014 10:29, Stephane Eranian <[email protected]> wrote:
>
> If you know what you are doing (poweruser), then there are measurements
> which works fine with the HT erratum. This is why we have the option.
>
> For instance if you only measure events 4x4 in system-wide mode
> and you know which counters these event are going to use, you don't
> need the workaround. For instance:
>
> # perf stat -a -e r81d0,r01d1,r08d0,r20d1 sleep 5
>
> Works well if you have a uniform workload across all CPUs.
> All those events leak, but the leaks balance themselves and you
> get the correct counts in the end. The advantage is that you don't
> have to multiplex. With the workaround enable, this would multiplex
> a lot.
>
> But as I said, this is for experts only.

Is it not possible to detect this in the kernel and only enable the
workaround for the case where the leaks don't balance? It may not be
possible (or practical) but I do think it's worth having the
discussion.

> Another reason is for systems with HT disabled. It turned out to be
> very difficult to determine at kernel BOOT TIME if HT was enabled
> or not. Note what I said: ENABLED and not SUPPORTED. The latter is
> easy to detect. The former needs some model specific code which is
> quite complicated. I wish the kernel had this capability abstracted
> somehow. Consequently, the workaround is always enabled. When
> HT is disabled, there won't be multiplexing because there will never
> be conflict, but you pay a little price for accessing the extra data
> state.

Does cpu_sibling_map not give you some indication of whether HT is
enabled? I think the topology_thread_cpumask() is the topology API for
that. But I could most definitely be wrong. Hopefully someone on the
Cc list will know.

>An init script could well detect HT is off and thus disable the workaround altogether.

This is exactly the kind of thing I think we should try to avoid. The
ideal is that things just work out of the box and don't require these
magic knobs to be tweaked.

> Those are the two main reasons for this control in sysfs.

Thanks for the info!

2014-06-05 10:19:49

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On Thu, Jun 5, 2014 at 12:01 PM, Matt Fleming <[email protected]> wrote:
> On 5 June 2014 10:29, Stephane Eranian <[email protected]> wrote:
>>
>> If you know what you are doing (poweruser), then there are measurements
>> which works fine with the HT erratum. This is why we have the option.
>>
>> For instance if you only measure events 4x4 in system-wide mode
>> and you know which counters these event are going to use, you don't
>> need the workaround. For instance:
>>
>> # perf stat -a -e r81d0,r01d1,r08d0,r20d1 sleep 5
>>
>> Works well if you have a uniform workload across all CPUs.
>> All those events leak, but the leaks balance themselves and you
>> get the correct counts in the end. The advantage is that you don't
>> have to multiplex. With the workaround enable, this would multiplex
>> a lot.
>>
>> But as I said, this is for experts only.
>
> Is it not possible to detect this in the kernel and only enable the
> workaround for the case where the leaks don't balance? It may not be
> possible (or practical) but I do think it's worth having the
> discussion.
>
How would you know that you have a uniform workload from inside
the kernel?

>> Another reason is for systems with HT disabled. It turned out to be
>> very difficult to determine at kernel BOOT TIME if HT was enabled
>> or not. Note what I said: ENABLED and not SUPPORTED. The latter is
>> easy to detect. The former needs some model specific code which is
>> quite complicated. I wish the kernel had this capability abstracted
>> somehow. Consequently, the workaround is always enabled. When
>> HT is disabled, there won't be multiplexing because there will never
>> be conflict, but you pay a little price for accessing the extra data
>> state.
>
> Does cpu_sibling_map not give you some indication of whether HT is
> enabled? I think the topology_thread_cpumask() is the topology API for
> that. But I could most definitely be wrong. Hopefully someone on the
> Cc list will know.
>
Remember trying some of that, but when perf_event is initialized, those
masks are not yet setup properly.

>>An init script could well detect HT is off and thus disable the workaround altogether.
>
> This is exactly the kind of thing I think we should try to avoid. The
> ideal is that things just work out of the box and don't require these
> magic knobs to be tweaked.
>
>> Those are the two main reasons for this control in sysfs.
>
> Thanks for the info!

2014-06-05 10:51:38

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 4/9] perf/x86: add cross-HT counter exclusion infrastructure

On Thu, Jun 5, 2014 at 9:47 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Jun 04, 2014 at 11:34:13PM +0200, Stephane Eranian wrote:
>> From: Maria Dimakopoulou <[email protected]>
>>
>> This patch adds a new shared_regs style structure to the
>> per-cpu x86 state (cpuc). It is used to coordinate access
>> between counters which must be used with exclusion across
>> HyperThreads on Intel processors. This new struct is not
>> needed on each PMU, thus is is allocated on demand.
>>
>> Reviewed-by: Stephane Eranian <[email protected]>
>> Signed-off-by: Maria Dimakopoulou <[email protected]>
>> ---
>> arch/x86/kernel/cpu/perf_event.h | 40 ++++++++++++++++++--
>> arch/x86/kernel/cpu/perf_event_intel.c | 63 +++++++++++++++++++++++++++++---
>> 2 files changed, 94 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
>> index 413799f..5da0a2b 100644
>> --- a/arch/x86/kernel/cpu/perf_event.h
>> +++ b/arch/x86/kernel/cpu/perf_event.h
>> @@ -65,10 +65,11 @@ struct event_constraint {
>> /*
>> * struct hw_perf_event.flags flags
>> */
>> -#define PERF_X86_EVENT_PEBS_LDLAT 0x1 /* ld+ldlat data address sampling */
>> -#define PERF_X86_EVENT_PEBS_ST 0x2 /* st data address sampling */
>> -#define PERF_X86_EVENT_PEBS_ST_HSW 0x4 /* haswell style st data sampling */
>> -#define PERF_X86_EVENT_COMMITTED 0x8 /* event passed commit_txn */
>> +#define PERF_X86_EVENT_PEBS_LDLAT 0x01 /* ld+ldlat data address sampling */
>> +#define PERF_X86_EVENT_PEBS_ST 0x02 /* st data address sampling */
>> +#define PERF_X86_EVENT_PEBS_ST_HSW 0x04 /* haswell style st data sampling */
>> +#define PERF_X86_EVENT_COMMITTED 0x08 /* event passed commit_txn */
>> +#define PERF_X86_EVENT_EXCL 0x10 /* HT exclusivity on counter */
>>
>> struct amd_nb {
>> int nb_id; /* NorthBridge id */
>> @@ -119,6 +120,27 @@ struct intel_shared_regs {
>> unsigned core_id; /* per-core: core id */
>> };
>>
>> +enum intel_excl_state_type {
>> + INTEL_EXCL_UNUSED = 0, /* counter is unused */
>> + INTEL_EXCL_SHARED = 1, /* counter can be used by both threads */
>> + INTEL_EXCL_EXCLUSIVE = 2, /* counter can be used by one thread only */
>> +};
>> +
>> +struct intel_excl_states {
>> + enum intel_excl_state_type init_state[X86_PMC_IDX_MAX];
>> + enum intel_excl_state_type state[X86_PMC_IDX_MAX];
>> +};
>> +
>> +struct intel_excl_cntrs {
>> + spinlock_t lock;
>> + unsigned long lock_flags;
>> +
>> + struct intel_excl_states states[2];
>> +
>> + int refcnt; /* per-core: #HT threads */
>> + unsigned core_id; /* per-core: core id */
>> +};
>
> This must be a raw_spin_lock_t, its taken from pmu::add() which is
> called under perf_event_context::lock, which is raw_spinlock_t, as its
> taken under rq::lock, which too is raw_spinlock_t.
>
Will fix this in V2.

> I should really get around to fixing these errors and include the
> lockdep infrastructure for this.

2014-06-05 11:16:05

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On 5 June 2014 11:19, Stephane Eranian <[email protected]> wrote:
> How would you know that you have a uniform workload from inside
> the kernel?

That's what I'm asking you ;-)

>> Does cpu_sibling_map not give you some indication of whether HT is
>> enabled? I think the topology_thread_cpumask() is the topology API for
>> that. But I could most definitely be wrong. Hopefully someone on the
>> Cc list will know.
>>
> Remember trying some of that, but when perf_event is initialized, those
> masks are not yet setup properly.

Oh, bummer.

If there's no way to detect whether we should enable this workaround
at runtime (and it sounds like there isn't a good way), then that's
fair enough.

We should think twice about allowing it to be disabled via sysfs,
however. Because what is guaranteed to happen is that some user will
report getting bogus results from perf for these events and we'll
spend days trying to figure out why, only to discover they disabled
the workaround and didn't tell us or didn't realise that they'd
disabled it.

If the workaround is low overhead, can't we just leave it enabled?

2014-06-05 12:02:54

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On Thu, Jun 5, 2014 at 1:16 PM, Matt Fleming <[email protected]> wrote:
> On 5 June 2014 11:19, Stephane Eranian <[email protected]> wrote:
>> How would you know that you have a uniform workload from inside
>> the kernel?
>
> That's what I'm asking you ;-)
>
No way to know this otherwise we could play some tricks.

>>> Does cpu_sibling_map not give you some indication of whether HT is
>>> enabled? I think the topology_thread_cpumask() is the topology API for
>>> that. But I could most definitely be wrong. Hopefully someone on the
>>> Cc list will know.
>>>
>> Remember trying some of that, but when perf_event is initialized, those
>> masks are not yet setup properly.
>
> Oh, bummer.
>
I think those should be initialized earlier on during booting.

> If there's no way to detect whether we should enable this workaround
> at runtime (and it sounds like there isn't a good way), then that's
> fair enough.
>
> We should think twice about allowing it to be disabled via sysfs,
> however. Because what is guaranteed to happen is that some user will
> report getting bogus results from perf for these events and we'll
> spend days trying to figure out why, only to discover they disabled
> the workaround and didn't tell us or didn't realise that they'd
> disabled it.
>
> If the workaround is low overhead, can't we just leave it enabled?

It is enabled by default. Nothing is done to try and disable it later
even once the kernel is fully booted. So this is mostly for testing
and power-users.

2014-06-05 12:45:32

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 0/9] perf/x86: implement HT counter corruption workaround

On Thu, Jun 5, 2014 at 12:28 AM, Andi Kleen <[email protected]> wrote:
>> There is no HW or FW workaround for this problem.
>
> Actually there is for global measurements:
>
> measure with Any bit set and divide by two.
>
> Please add a check that the any bit is set, and disable
> the workaround for that case.
>
I think this assume same event is measured on sibling counters at any time.
Otherwise, how does that eliminate the corruption?

2014-06-05 12:50:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On Thu, Jun 05, 2014 at 12:16:01PM +0100, Matt Fleming wrote:
> On 5 June 2014 11:19, Stephane Eranian <[email protected]> wrote:
> > How would you know that you have a uniform workload from inside
> > the kernel?
>
> That's what I'm asking you ;-)
>
> >> Does cpu_sibling_map not give you some indication of whether HT is
> >> enabled? I think the topology_thread_cpumask() is the topology API for
> >> that. But I could most definitely be wrong. Hopefully someone on the
> >> Cc list will know.
> >>
> > Remember trying some of that, but when perf_event is initialized, those
> > masks are not yet setup properly.
>
> Oh, bummer.

So we init perf very early to get nmi-watchdog up and running, but
there's no reason you cannot register a second initcall later and flip
the switch from it there.


Attachments:
(No filename) (801.00 B)
(No filename) (836.00 B)
Download all attachments

2014-06-05 12:55:13

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On Thu, Jun 5, 2014 at 2:50 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Jun 05, 2014 at 12:16:01PM +0100, Matt Fleming wrote:
>> On 5 June 2014 11:19, Stephane Eranian <[email protected]> wrote:
>> > How would you know that you have a uniform workload from inside
>> > the kernel?
>>
>> That's what I'm asking you ;-)
>>
>> >> Does cpu_sibling_map not give you some indication of whether HT is
>> >> enabled? I think the topology_thread_cpumask() is the topology API for
>> >> that. But I could most definitely be wrong. Hopefully someone on the
>> >> Cc list will know.
>> >>
>> > Remember trying some of that, but when perf_event is initialized, those
>> > masks are not yet setup properly.
>>
>> Oh, bummer.
>
> So we init perf very early to get nmi-watchdog up and running, but
> there's no reason you cannot register a second initcall later and flip
> the switch from it there.

and what initcall would that be?

2014-06-05 12:59:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On Thu, Jun 05, 2014 at 02:55:05PM +0200, Stephane Eranian wrote:
> On Thu, Jun 5, 2014 at 2:50 PM, Peter Zijlstra <[email protected]> wrote:
> > On Thu, Jun 05, 2014 at 12:16:01PM +0100, Matt Fleming wrote:
> >> On 5 June 2014 11:19, Stephane Eranian <[email protected]> wrote:
> >> > How would you know that you have a uniform workload from inside
> >> > the kernel?
> >>
> >> That's what I'm asking you ;-)
> >>
> >> >> Does cpu_sibling_map not give you some indication of whether HT is
> >> >> enabled? I think the topology_thread_cpumask() is the topology API for
> >> >> that. But I could most definitely be wrong. Hopefully someone on the
> >> >> Cc list will know.
> >> >>
> >> > Remember trying some of that, but when perf_event is initialized, those
> >> > masks are not yet setup properly.
> >>
> >> Oh, bummer.
> >
> > So we init perf very early to get nmi-watchdog up and running, but
> > there's no reason you cannot register a second initcall later and flip
> > the switch from it there.
>
> and what initcall would that be?

Pretty much anything !early_initcall() is ran after SMP bringup iirc.


Attachments:
(No filename) (1.09 kB)
(No filename) (836.00 B)
Download all attachments

2014-06-05 13:16:50

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On Thu, Jun 5, 2014 at 2:59 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Jun 05, 2014 at 02:55:05PM +0200, Stephane Eranian wrote:
>> On Thu, Jun 5, 2014 at 2:50 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Thu, Jun 05, 2014 at 12:16:01PM +0100, Matt Fleming wrote:
>> >> On 5 June 2014 11:19, Stephane Eranian <[email protected]> wrote:
>> >> > How would you know that you have a uniform workload from inside
>> >> > the kernel?
>> >>
>> >> That's what I'm asking you ;-)
>> >>
>> >> >> Does cpu_sibling_map not give you some indication of whether HT is
>> >> >> enabled? I think the topology_thread_cpumask() is the topology API for
>> >> >> that. But I could most definitely be wrong. Hopefully someone on the
>> >> >> Cc list will know.
>> >> >>
>> >> > Remember trying some of that, but when perf_event is initialized, those
>> >> > masks are not yet setup properly.
>> >>
>> >> Oh, bummer.
>> >
>> > So we init perf very early to get nmi-watchdog up and running, but
>> > there's no reason you cannot register a second initcall later and flip
>> > the switch from it there.
>>
>> and what initcall would that be?
>
> Pretty much anything !early_initcall() is ran after SMP bringup iirc.

Ok, we can try this. Need to check the impact on NMI watchdog if
already active.

2014-06-05 13:20:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On Thu, Jun 05, 2014 at 11:29:33AM +0200, Stephane Eranian wrote:

> If you know what you are doing (poweruser), then there are measurements
> which works fine with the HT erratum. This is why we have the option.
>
> For instance if you only measure events 4x4 in system-wide mode
> and you know which counters these event are going to use, you don't
> need the workaround. For instance:
>
> # perf stat -a -e r81d0,r01d1,r08d0,r20d1 sleep 5
>
> Works well if you have a uniform workload across all CPUs.
> All those events leak, but the leaks balance themselves and you
> get the correct counts in the end. The advantage is that you don't
> have to multiplex. With the workaround enable, this would multiplex
> a lot.
>
> But as I said, this is for experts only.

Still seems tricky, you really want those pinned to make that guarantee,
and even then its a stretch. I don't think perf tool exposes the pinned
attribute though, or I'm just not looking right.

I say stretch, because while I think it'll work out and we'll end up
programming the counters the same way on each cpu, we really do not make
that guarantee either, pinned or not.

I think I agree with Matt in that exposing this to userspace is really
asking for trouble.

Now, I've not yet read through the entire patch series, but how
impossible is it to allow programming the exact same event on both HT
siblings?


Attachments:
(No filename) (1.35 kB)
(No filename) (836.00 B)
Download all attachments

2014-06-05 13:26:19

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On Thu, Jun 5, 2014 at 3:19 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Jun 05, 2014 at 11:29:33AM +0200, Stephane Eranian wrote:
>
>> If you know what you are doing (poweruser), then there are measurements
>> which works fine with the HT erratum. This is why we have the option.
>>
>> For instance if you only measure events 4x4 in system-wide mode
>> and you know which counters these event are going to use, you don't
>> need the workaround. For instance:
>>
>> # perf stat -a -e r81d0,r01d1,r08d0,r20d1 sleep 5
>>
>> Works well if you have a uniform workload across all CPUs.
>> All those events leak, but the leaks balance themselves and you
>> get the correct counts in the end. The advantage is that you don't
>> have to multiplex. With the workaround enable, this would multiplex
>> a lot.
>>
>> But as I said, this is for experts only.
>
> Still seems tricky, you really want those pinned to make that guarantee,
> and even then its a stretch. I don't think perf tool exposes the pinned
> attribute though, or I'm just not looking right.
>
I think it does. But regardless, if you are on single user machine,
NMI disabled,
and you know where events can run, workload is uniform, then it does work.
Of course, this is a stretch for average users.

> I say stretch, because while I think it'll work out and we'll end up
> programming the counters the same way on each cpu, we really do not make
> that guarantee either, pinned or not.
>
There is no guarantee. However, this is what is currently going on.

> I think I agree with Matt in that exposing this to userspace is really
> asking for trouble.
>
This is a separate patch for a good reason, it is optional. If you think
it is too risky, then drop it.

> Now, I've not yet read through the entire patch series, but how
> impossible is it to allow programming the exact same event on both HT
> siblings?

That would require global view of scheduling and multiplexing in sync
between HT to ensure corrupting events always face each other. But
again this also assume only one tool instance is running.

I think it is better to use the workaround and repatriate the counts
for corrupting events. That would allow correct counting. Sampling
is out on those.

2014-06-05 13:26:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On Thu, Jun 05, 2014 at 03:16:47PM +0200, Stephane Eranian wrote:
> > Pretty much anything !early_initcall() is ran after SMP bringup iirc.
>
> Ok, we can try this. Need to check the impact on NMI watchdog if
> already active.

Right, we could provide an interface to stop and restart the watchdog so
that the PMU is quiescent when we flip this switch, if required.


Attachments:
(No filename) (368.00 B)
(No filename) (836.00 B)
Download all attachments

2014-06-05 13:29:40

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On Thu, Jun 05, 2014 at 02:02:51PM +0200, Stephane Eranian wrote:
> It is enabled by default. Nothing is done to try and disable it later
> even once the kernel is fully booted. So this is mostly for testing
> and power-users.

You keep saying "power-users". What is the disadvantage for power users
running with the workaround disabled? I.e., why would anyone want to
disable it at all, what is the use case for that?

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-06-05 13:36:30

by Maria Dimakopoulou

[permalink] [raw]
Subject: Re: [PATCH 4/9] perf/x86: add cross-HT counter exclusion infrastructure

This is to indicate that the PMU needs to setup the
shared state struct which is called excl_cntrs.
That struct is allocated for all CPUs at first, and
then half of them are destroyed and their CPUs
made to point to their siblings struct.

So yes. EXCL is really related to HT.
It could be called HT_EXCL.

Will fix in V2.

On Thu, Jun 5, 2014 at 11:04 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Jun 04, 2014 at 11:34:13PM +0200, Stephane Eranian wrote:
>> @@ -499,6 +530,7 @@ do { \
>> */
>> #define PMU_FL_NO_HT_SHARING 0x1 /* no hyper-threading resource sharing */
>> #define PMU_FL_HAS_RSP_1 0x2 /* has 2 equivalent offcore_rsp regs */
>> +#define PMU_FL_EXCL_CNTRS 0x4 /* has exclusive counter requirements */
>
> the EXLC thing is HT specific too, right? How about HT_EXCL or so?
>
>

2014-06-05 13:38:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/9] perf/x86: implement cross-HT corruption bug workaround

On Wed, Jun 04, 2014 at 11:34:14PM +0200, Stephane Eranian wrote:
> static void
> +intel_start_scheduling(struct cpu_hw_events *cpuc)
> +{

> + /*
> + * lock shared state until we are done scheduling
> + * in stop_event_scheduling()
> + * makes scheduling appear as a transaction
> + */
> + spin_lock_irqsave(&excl_cntrs->lock, excl_cntrs->lock_flags);
> +
> + /*
> + * save initial state of sibling thread
> + */
> + memcpy(xlo->init_state, xlo->state, sizeof(xlo->init_state));
> +}
> +
> +static void
> +intel_stop_scheduling(struct cpu_hw_events *cpuc)
> +{

> + /*
> + * make new sibling thread state visible
> + */
> + memcpy(xlo->state, xlo->init_state, sizeof(xlo->state));
> +
> + xl->sched_started = false;
> + /*
> + * release shared state lock (lock acquire in intel_start_scheduling())
> + */
> + spin_unlock_irqrestore(&excl_cntrs->lock, excl_cntrs->lock_flags);
> +}

Do you really need the irqsave/irqrestore? From what I can tell this is
always called under perf_event_context::lock and that is already an
IRQ-safe lock, so interrupts should always be disabled here.

Also, it looks like xl->state is the effective state, and ->init_state
is the work state? Is init the right name for this?


Attachments:
(No filename) (1.19 kB)
(No filename) (836.00 B)
Download all attachments

2014-06-05 13:42:16

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On Thu, Jun 5, 2014 at 3:27 PM, Borislav Petkov <[email protected]> wrote:
> On Thu, Jun 05, 2014 at 02:02:51PM +0200, Stephane Eranian wrote:
>> It is enabled by default. Nothing is done to try and disable it later
>> even once the kernel is fully booted. So this is mostly for testing
>> and power-users.
>
> You keep saying "power-users". What is the disadvantage for power users
> running with the workaround disabled? I.e., why would anyone want to
> disable it at all, what is the use case for that?
>
I gave a test case earlier:

# echo 0 >/proc/sys/kernel/nmi_watchdog
# run_my_uniform_workload_on_all_cpus &
# perf stat -a -e r81d0,r01d1,r08d0,r20d1 sleep 5

That run gives the correct answer.

If I just look at CPU0 CPU4 siblings:

CPU0, counter0 leaks N counts to CPU4, counter 0

but at the same time:

CPU4, counter0 leaks N counts to CPU0, counter 0

This is because we have the same event in the same
counter AND the workload is uniform, meaning the
event (here loads retired) occurs at the same rate
on both siblings.

You can test this by measuring only on one HT.
# perf stat -a -C0 -e r81d0,r01d1,r08d0,r20d1 sleep 5

Note that some events, leak more than they count.

Again, this is really for experts. The average users
should not have to deal with this. So we can drop
the sysfs entry.

2014-06-05 13:42:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/9] perf/x86: implement cross-HT corruption bug workaround

On Wed, Jun 04, 2014 at 11:34:14PM +0200, Stephane Eranian wrote:
> +static void intel_commit_scheduling(struct cpu_hw_events *cpuc,
> + struct perf_event *event, int cntr)
> +{
> + struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
> + struct event_constraint *c = event->hw.constraint;
> + struct intel_excl_states *xlo, *xl;
> + int tid = cpuc->excl_thread_id;
> + int o_tid = 1 - tid;
> + int is_excl;
> +
> + if (cpuc->is_fake || !c)
> + return;
> +
> + is_excl = c->flags & PERF_X86_EVENT_EXCL;
> +
> + if (!(c->flags & PERF_X86_EVENT_DYNAMIC))
> + return;
> +
> + WARN_ON_ONCE(!excl_cntrs);
> +
> + if (!excl_cntrs)
> + return;
> +
> + xl = &excl_cntrs->states[tid];
> + xlo = &excl_cntrs->states[o_tid];
> +
> + WARN_ON_ONCE(!spin_is_locked(&excl_cntrs->lock));

Use:
lockdep_assert_held(&excl_cntrs->lock);

It also checks to see the current context is actually the lock holder,
and it doesn't generate any code when !lockdep.

> +
> + if (cntr >= 0) {
> + if (is_excl)
> + xlo->init_state[cntr] = INTEL_EXCL_EXCLUSIVE;
> + else
> + xlo->init_state[cntr] = INTEL_EXCL_SHARED;
> + }
> }


Attachments:
(No filename) (1.09 kB)
(No filename) (836.00 B)
Download all attachments

2014-06-05 13:48:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/9] perf/x86: implement cross-HT corruption bug workaround

On Wed, Jun 04, 2014 at 11:34:14PM +0200, Stephane Eranian wrote:

> +static struct event_constraint *
> +intel_get_excl_constraints(struct cpu_hw_events *cpuc, struct perf_event *event,
> + struct event_constraint *c)
> +{

> + if (!(c->flags & PERF_X86_EVENT_DYNAMIC)) {
> +
> + /*
> + * in case we fail, we assume no counter
> + * is supported to be on the safe side
> + */
> + cx = kmalloc(sizeof(*cx), GFP_KERNEL);
> + if (!cx)
> + return &emptyconstraint;
> +

Ok, so forgive me if I'm wrong, but the way we get here is through:

x86_schedule_event()
->start_scheduling()
spin_lock()
->get_event_constraints()
intel_get_excl_constraints()
kmalloc(.gfp=GFP_KERNEL)

How can that ever work?


Attachments:
(No filename) (729.00 B)
(No filename) (836.00 B)
Download all attachments

2014-06-05 14:01:28

by Maria Dimakopoulou

[permalink] [raw]
Subject: Re: [PATCH 5/9] perf/x86: implement cross-HT corruption bug workaround

Are you saying it is illegal to call kmalloc() from
this context?

kmalloc is needed because we need to allocate
a new constraint struct since the static constraint
cannot be modified.

Worst case we can statically allocate a second
constraint struct in the event struct.

On Thu, Jun 5, 2014 at 4:48 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Jun 04, 2014 at 11:34:14PM +0200, Stephane Eranian wrote:
>
>> +static struct event_constraint *
>> +intel_get_excl_constraints(struct cpu_hw_events *cpuc, struct perf_event *event,
>> + struct event_constraint *c)
>> +{
>
>> + if (!(c->flags & PERF_X86_EVENT_DYNAMIC)) {
>> +
>> + /*
>> + * in case we fail, we assume no counter
>> + * is supported to be on the safe side
>> + */
>> + cx = kmalloc(sizeof(*cx), GFP_KERNEL);
>> + if (!cx)
>> + return &emptyconstraint;
>> +
>
> Ok, so forgive me if I'm wrong, but the way we get here is through:
>
> x86_schedule_event()
> ->start_scheduling()
> spin_lock()
> ->get_event_constraints()
> intel_get_excl_constraints()
> kmalloc(.gfp=GFP_KERNEL)
>
> How can that ever work?

2014-06-05 14:04:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/9] perf/x86: implement cross-HT corruption bug workaround

On Wed, Jun 04, 2014 at 11:34:14PM +0200, Stephane Eranian wrote:
> +
> + /*
> + * Modify static constraint with current dynamic
> + * state of thread
> + *
> + * EXCLUSIVE: sibling counter measuring exclusive event
> + * SHARED : sibling counter measuring non-exclusive event
> + * UNUSED : sibling counter unused
> + */
> + for_each_set_bit(i, cx->idxmsk, X86_PMC_IDX_MAX) {
> + /*
> + * exclusive event in sibling counter
> + * our corresponding counter cannot be used
> + * regardless of our event
> + */
> + if (xl->state[i] == INTEL_EXCL_EXCLUSIVE)
> + __clear_bit(i, cx->idxmsk);
> + /*
> + * if measuring an exclusive event, sibling
> + * measuring non-exclusive, then counter cannot
> + * be used
> + */
> + if (is_excl && xl->state[i] == INTEL_EXCL_SHARED)
> + __clear_bit(i, cx->idxmsk);
> + }
> +
> + /*
> + * recompute actual bit weight for scheduling algorithm
> + */
> + cx->weight = hweight64(cx->idxmsk64);

So I think we talked about this a bit; what happens if CPU0 (taking your
4 core HSW-client) is first to program its counters and takes all 4 in
exclusive mode?

Then there's none left for CPU4.

Did I miss where we avoid that problem, or is that an actual issue?


Attachments:
(No filename) (1.20 kB)
(No filename) (836.00 B)
Download all attachments

2014-06-05 14:05:42

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On Thu, Jun 05, 2014 at 03:42:14PM +0200, Stephane Eranian wrote:
> I gave a test case earlier:
>
> # echo 0 >/proc/sys/kernel/nmi_watchdog
> # run_my_uniform_workload_on_all_cpus &
> # perf stat -a -e r81d0,r01d1,r08d0,r20d1 sleep 5
>
> That run gives the correct answer.
>
> If I just look at CPU0 CPU4 siblings:
>
> CPU0, counter0 leaks N counts to CPU4, counter 0
>
> but at the same time:
>
> CPU4, counter0 leaks N counts to CPU0, counter 0
>
> This is because we have the same event in the same
> counter AND the workload is uniform, meaning the
> event (here loads retired) occurs at the same rate
> on both siblings.
>
> You can test this by measuring only on one HT.
> # perf stat -a -C0 -e r81d0,r01d1,r08d0,r20d1 sleep 5
>
> Note that some events, leak more than they count.

Ok, so AFAIU, this particular workload counts correctly just because
counters leak the same amount. If so, what happens if you run this exact
same workload with the workaround enabled? I read something about a bit
more counter multiplexing... or is there a more serious issue?

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-06-05 14:06:49

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 5/9] perf/x86: implement cross-HT corruption bug workaround

On Thu, Jun 05, 2014 at 05:01:25PM +0300, Maria Dimakopoulou wrote:
> Are you saying it is illegal to call kmalloc() from
> this context?
>
> kmalloc is needed because we need to allocate
> a new constraint struct since the static constraint
> cannot be modified.
>
> Worst case we can statically allocate a second
> constraint struct in the event struct.

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

Please do not top-post.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-06-05 14:11:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/9] perf/x86: implement cross-HT corruption bug workaround

On Thu, Jun 05, 2014 at 05:01:25PM +0300, Maria Dimakopoulou wrote:
> On Thu, Jun 5, 2014 at 4:48 PM, Peter Zijlstra <[email protected]> wrote:
> > On Wed, Jun 04, 2014 at 11:34:14PM +0200, Stephane Eranian wrote:
> >
> >> +static struct event_constraint *
> >> +intel_get_excl_constraints(struct cpu_hw_events *cpuc, struct perf_event *event,
> >> + struct event_constraint *c)
> >> +{
> >
> >> + if (!(c->flags & PERF_X86_EVENT_DYNAMIC)) {
> >> +
> >> + /*
> >> + * in case we fail, we assume no counter
> >> + * is supported to be on the safe side
> >> + */
> >> + cx = kmalloc(sizeof(*cx), GFP_KERNEL);
> >> + if (!cx)
> >> + return &emptyconstraint;
> >> +
> >
> > Ok, so forgive me if I'm wrong, but the way we get here is through:
> >
> > x86_schedule_event()
> > ->start_scheduling()
> > spin_lock()
> > ->get_event_constraints()
> > intel_get_excl_constraints()
> > kmalloc(.gfp=GFP_KERNEL)
> >
> > How can that ever work?

> Are you saying it is illegal to call kmalloc() from
> this context?

Nobody will come and arrest you for it, so no. Broken though. GFP_KERNEL
will attempt to sleep to wait for reclaim, and you're holding a
spinlock.

> kmalloc is needed because we need to allocate
> a new constraint struct since the static constraint
> cannot be modified.
>
> Worst case we can statically allocate a second
> constraint struct in the event struct.

Nah, since you will need at most one constraint per counter, you could
preallocate num_counter constraints for each cpu.


Attachments:
(No filename) (1.60 kB)
(No filename) (836.00 B)
Download all attachments

2014-06-05 14:14:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/9] perf/x86: implement cross-HT corruption bug workaround

On Thu, Jun 05, 2014 at 04:11:50PM +0200, Peter Zijlstra wrote:
> > > x86_schedule_event()
> > > ->start_scheduling()
> > > spin_lock()
> > > ->get_event_constraints()
> > > intel_get_excl_constraints()
> > > kmalloc(.gfp=GFP_KERNEL)
> > >
> > > How can that ever work?
>
> > Are you saying it is illegal to call kmalloc() from
> > this context?
>
> Nobody will come and arrest you for it, so no. Broken though. GFP_KERNEL
> will attempt to sleep to wait for reclaim, and you're holding a
> spinlock.

Furthermore, even GFP_ATOMIC shouldn't be used because these are/should
be raw_spinlocks and the zone->lock is a regular spinlock.

So please look into the preallocation thing.


Attachments:
(No filename) (699.00 B)
(No filename) (836.00 B)
Download all attachments

2014-06-05 14:15:22

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 5/9] perf/x86: implement cross-HT corruption bug workaround

On Thu, Jun 5, 2014 at 4:04 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Jun 04, 2014 at 11:34:14PM +0200, Stephane Eranian wrote:
>> +
>> + /*
>> + * Modify static constraint with current dynamic
>> + * state of thread
>> + *
>> + * EXCLUSIVE: sibling counter measuring exclusive event
>> + * SHARED : sibling counter measuring non-exclusive event
>> + * UNUSED : sibling counter unused
>> + */
>> + for_each_set_bit(i, cx->idxmsk, X86_PMC_IDX_MAX) {
>> + /*
>> + * exclusive event in sibling counter
>> + * our corresponding counter cannot be used
>> + * regardless of our event
>> + */
>> + if (xl->state[i] == INTEL_EXCL_EXCLUSIVE)
>> + __clear_bit(i, cx->idxmsk);
>> + /*
>> + * if measuring an exclusive event, sibling
>> + * measuring non-exclusive, then counter cannot
>> + * be used
>> + */
>> + if (is_excl && xl->state[i] == INTEL_EXCL_SHARED)
>> + __clear_bit(i, cx->idxmsk);
>> + }
>> +
>> + /*
>> + * recompute actual bit weight for scheduling algorithm
>> + */
>> + cx->weight = hweight64(cx->idxmsk64);
>
> So I think we talked about this a bit; what happens if CPU0 (taking your
> 4 core HSW-client) is first to program its counters and takes all 4 in
> exclusive mode?
>
> Then there's none left for CPU4.
>
> Did I miss where we avoid that problem, or is that an actual issue?

Yes, this patch series does not address this problem yet. It will be
in a second series.
Don't have a good solution yet.

2014-06-05 14:21:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/9] perf/x86: implement cross-HT corruption bug workaround

On Thu, Jun 05, 2014 at 04:15:17PM +0200, Stephane Eranian wrote:
> On Thu, Jun 5, 2014 at 4:04 PM, Peter Zijlstra <[email protected]> wrote:
> > On Wed, Jun 04, 2014 at 11:34:14PM +0200, Stephane Eranian wrote:
> >> +
> >> + /*
> >> + * Modify static constraint with current dynamic
> >> + * state of thread
> >> + *
> >> + * EXCLUSIVE: sibling counter measuring exclusive event
> >> + * SHARED : sibling counter measuring non-exclusive event
> >> + * UNUSED : sibling counter unused
> >> + */
> >> + for_each_set_bit(i, cx->idxmsk, X86_PMC_IDX_MAX) {
> >> + /*
> >> + * exclusive event in sibling counter
> >> + * our corresponding counter cannot be used
> >> + * regardless of our event
> >> + */
> >> + if (xl->state[i] == INTEL_EXCL_EXCLUSIVE)
> >> + __clear_bit(i, cx->idxmsk);
> >> + /*
> >> + * if measuring an exclusive event, sibling
> >> + * measuring non-exclusive, then counter cannot
> >> + * be used
> >> + */
> >> + if (is_excl && xl->state[i] == INTEL_EXCL_SHARED)
> >> + __clear_bit(i, cx->idxmsk);
> >> + }
> >> +
> >> + /*
> >> + * recompute actual bit weight for scheduling algorithm
> >> + */
> >> + cx->weight = hweight64(cx->idxmsk64);
> >
> > So I think we talked about this a bit; what happens if CPU0 (taking your
> > 4 core HSW-client) is first to program its counters and takes all 4 in
> > exclusive mode?
> >
> > Then there's none left for CPU4.
> >
> > Did I miss where we avoid that problem, or is that an actual issue?
>
> Yes, this patch series does not address this problem yet. It will be
> in a second series.
> Don't have a good solution yet.

We could limit each cpu to num_counters/2 exclusive slots. That'll still
be painful with some constrained events I imagine, but in general that
should 'work' I suppose.


Attachments:
(No filename) (1.97 kB)
(No filename) (836.00 B)
Download all attachments

2014-06-05 14:24:32

by Maria Dimakopoulou

[permalink] [raw]
Subject: Re: [PATCH 5/9] perf/x86: implement cross-HT corruption bug workaround

On Thu, Jun 5, 2014 at 5:14 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Jun 05, 2014 at 04:11:50PM +0200, Peter Zijlstra wrote:
>> > > x86_schedule_event()
>> > > ->start_scheduling()
>> > > spin_lock()
>> > > ->get_event_constraints()
>> > > intel_get_excl_constraints()
>> > > kmalloc(.gfp=GFP_KERNEL)
>> > >
>> > > How can that ever work?
>>
>> > Are you saying it is illegal to call kmalloc() from
>> > this context?
>>
>> Nobody will come and arrest you for it, so no. Broken though. GFP_KERNEL
>> will attempt to sleep to wait for reclaim, and you're holding a
>> spinlock.

Ok, got it.

>
> Furthermore, even GFP_ATOMIC shouldn't be used because these are/should
> be raw_spinlocks and the zone->lock is a regular spinlock.
>
> So please look into the preallocation thing.

We will preallocate.

2014-06-05 14:26:40

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 5/9] perf/x86: implement cross-HT corruption bug workaround

On Thu, Jun 5, 2014 at 4:21 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Jun 05, 2014 at 04:15:17PM +0200, Stephane Eranian wrote:
>> On Thu, Jun 5, 2014 at 4:04 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Wed, Jun 04, 2014 at 11:34:14PM +0200, Stephane Eranian wrote:
>> >> +
>> >> + /*
>> >> + * Modify static constraint with current dynamic
>> >> + * state of thread
>> >> + *
>> >> + * EXCLUSIVE: sibling counter measuring exclusive event
>> >> + * SHARED : sibling counter measuring non-exclusive event
>> >> + * UNUSED : sibling counter unused
>> >> + */
>> >> + for_each_set_bit(i, cx->idxmsk, X86_PMC_IDX_MAX) {
>> >> + /*
>> >> + * exclusive event in sibling counter
>> >> + * our corresponding counter cannot be used
>> >> + * regardless of our event
>> >> + */
>> >> + if (xl->state[i] == INTEL_EXCL_EXCLUSIVE)
>> >> + __clear_bit(i, cx->idxmsk);
>> >> + /*
>> >> + * if measuring an exclusive event, sibling
>> >> + * measuring non-exclusive, then counter cannot
>> >> + * be used
>> >> + */
>> >> + if (is_excl && xl->state[i] == INTEL_EXCL_SHARED)
>> >> + __clear_bit(i, cx->idxmsk);
>> >> + }
>> >> +
>> >> + /*
>> >> + * recompute actual bit weight for scheduling algorithm
>> >> + */
>> >> + cx->weight = hweight64(cx->idxmsk64);
>> >
>> > So I think we talked about this a bit; what happens if CPU0 (taking your
>> > 4 core HSW-client) is first to program its counters and takes all 4 in
>> > exclusive mode?
>> >
>> > Then there's none left for CPU4.
>> >
>> > Did I miss where we avoid that problem, or is that an actual issue?
>>
>> Yes, this patch series does not address this problem yet. It will be
>> in a second series.
>> Don't have a good solution yet.
>
> We could limit each cpu to num_counters/2 exclusive slots. That'll still
> be painful with some constrained events I imagine, but in general that
> should 'work' I suppose.

That is probably the easiest solution, just modify the dynamic constraint
mask some more. Have not yet tried it.

The repatriation of the leaked count is not so easy either. Need to IPI
the other HT. There may be some restrictions as to when we can
safely do this.

2014-06-05 14:45:28

by Maria Dimakopoulou

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On Thu, Jun 5, 2014 at 5:03 PM, Borislav Petkov <[email protected]> wrote:
> On Thu, Jun 05, 2014 at 03:42:14PM +0200, Stephane Eranian wrote:
>> I gave a test case earlier:
>>
>> # echo 0 >/proc/sys/kernel/nmi_watchdog
>> # run_my_uniform_workload_on_all_cpus &
>> # perf stat -a -e r81d0,r01d1,r08d0,r20d1 sleep 5
>>
>> That run gives the correct answer.
>>
>> If I just look at CPU0 CPU4 siblings:
>>
>> CPU0, counter0 leaks N counts to CPU4, counter 0
>>
>> but at the same time:
>>
>> CPU4, counter0 leaks N counts to CPU0, counter 0
>>
>> This is because we have the same event in the same
>> counter AND the workload is uniform, meaning the
>> event (here loads retired) occurs at the same rate
>> on both siblings.
>>
>> You can test this by measuring only on one HT.
>> # perf stat -a -C0 -e r81d0,r01d1,r08d0,r20d1 sleep 5
>>
>> Note that some events, leak more than they count.
>
> Ok, so AFAIU, this particular workload counts correctly just because
> counters leak the same amount. If so, what happens if you run this exact
> same workload with the workaround enabled? I read something about a bit
> more counter multiplexing... or is there a more serious issue?

The issue is that the outcoming leaked counts are not compensated
by the incoming leaked counts of the sibling thread. With the workaround,
corrupting events are always scheduled with an empty sibling counter.
This means that their leaked counts are lost. So it is expected to see
lower counts with the workaround. Note that this is not a side-effect of
the workaround; leaked counts are expected to be lost with nothing
measured on the sibling counter in general.

In a second series we intend to re-integrate the counts for counting mode
events. The workaround makes this easier because it guarantees
that the sibling counter is unused, thus its counts are purely leaked
counts and they can be safely re-integrated.


>
> Thanks.
>
> --
> Regards/Gruss,
> Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --

2014-06-05 14:31:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/9] perf/x86: implement cross-HT corruption bug workaround

On Thu, Jun 05, 2014 at 04:26:38PM +0200, Stephane Eranian wrote:
> On Thu, Jun 5, 2014 at 4:21 PM, Peter Zijlstra <[email protected]> wrote:

> > We could limit each cpu to num_counters/2 exclusive slots. That'll still
> > be painful with some constrained events I imagine, but in general that
> > should 'work' I suppose.
>
> That is probably the easiest solution, just modify the dynamic constraint
> mask some more. Have not yet tried it.

Right, see what happens :-)

> The repatriation of the leaked count is not so easy either. Need to IPI
> the other HT. There may be some restrictions as to when we can
> safely do this.

Yes, that'll be 'interesting'. You typically cannot do things like
smp_function_call() from IRQ/NMI context. Which leaves you to
asynchonous IPIs.

I suppose the easiest way is to simply push the count out to the right
event when you reprogram the sibling. And maybe kick it every so often
to force do this just in case the PMU doesn't get reprogrammed a lot.

Still tricky.


Attachments:
(No filename) (0.99 kB)
(No filename) (836.00 B)
Download all attachments

2014-06-05 15:19:46

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On Thu, Jun 05, 2014 at 05:45:24PM +0300, Maria Dimakopoulou wrote:
> The issue is that the outcoming leaked counts are not compensated
> by the incoming leaked counts of the sibling thread. With the workaround,
> corrupting events are always scheduled with an empty sibling counter.
> This means that their leaked counts are lost. So it is expected to see
> lower counts with the workaround. Note that this is not a side-effect of
> the workaround; leaked counts are expected to be lost with nothing
> measured on the sibling counter in general.
>
> In a second series we intend to re-integrate the counts for counting mode
> events. The workaround makes this easier because it guarantees
> that the sibling counter is unused, thus its counts are purely leaked
> counts and they can be safely re-integrated.

IIUC, sounds to me like reintegrating the leaked counts from the unused
counter should be part of the workaround too, not a second series...

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-06-05 15:31:13

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 5/9] perf/x86: implement cross-HT corruption bug workaround

On Thu, Jun 5, 2014 at 4:21 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Jun 05, 2014 at 04:15:17PM +0200, Stephane Eranian wrote:
>> On Thu, Jun 5, 2014 at 4:04 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Wed, Jun 04, 2014 at 11:34:14PM +0200, Stephane Eranian wrote:
>> >> +
>> >> + /*
>> >> + * Modify static constraint with current dynamic
>> >> + * state of thread
>> >> + *
>> >> + * EXCLUSIVE: sibling counter measuring exclusive event
>> >> + * SHARED : sibling counter measuring non-exclusive event
>> >> + * UNUSED : sibling counter unused
>> >> + */
>> >> + for_each_set_bit(i, cx->idxmsk, X86_PMC_IDX_MAX) {
>> >> + /*
>> >> + * exclusive event in sibling counter
>> >> + * our corresponding counter cannot be used
>> >> + * regardless of our event
>> >> + */
>> >> + if (xl->state[i] == INTEL_EXCL_EXCLUSIVE)
>> >> + __clear_bit(i, cx->idxmsk);
>> >> + /*
>> >> + * if measuring an exclusive event, sibling
>> >> + * measuring non-exclusive, then counter cannot
>> >> + * be used
>> >> + */
>> >> + if (is_excl && xl->state[i] == INTEL_EXCL_SHARED)
>> >> + __clear_bit(i, cx->idxmsk);
>> >> + }
>> >> +
>> >> + /*
>> >> + * recompute actual bit weight for scheduling algorithm
>> >> + */
>> >> + cx->weight = hweight64(cx->idxmsk64);
>> >
>> > So I think we talked about this a bit; what happens if CPU0 (taking your
>> > 4 core HSW-client) is first to program its counters and takes all 4 in
>> > exclusive mode?
>> >
>> > Then there's none left for CPU4.
>> >
>> > Did I miss where we avoid that problem, or is that an actual issue?
>>
>> Yes, this patch series does not address this problem yet. It will be
>> in a second series.
>> Don't have a good solution yet.
>
> We could limit each cpu to num_counters/2 exclusive slots. That'll still
> be painful with some constrained events I imagine, but in general that
> should 'work' I suppose.

Ok, tried this. It avoids the problems and just requires a 3 line patch.
So we could go with that for now. It would avoid total starvation.

2014-06-05 16:39:58

by Maria Dimakopoulou

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On Thu, Jun 5, 2014 at 6:17 PM, Borislav Petkov <[email protected]> wrote:
> On Thu, Jun 05, 2014 at 05:45:24PM +0300, Maria Dimakopoulou wrote:
>> The issue is that the outcoming leaked counts are not compensated
>> by the incoming leaked counts of the sibling thread. With the workaround,
>> corrupting events are always scheduled with an empty sibling counter.
>> This means that their leaked counts are lost. So it is expected to see
>> lower counts with the workaround. Note that this is not a side-effect of
>> the workaround; leaked counts are expected to be lost with nothing
>> measured on the sibling counter in general.
>>
>> In a second series we intend to re-integrate the counts for counting mode
>> events. The workaround makes this easier because it guarantees
>> that the sibling counter is unused, thus its counts are purely leaked
>> counts and they can be safely re-integrated.
>
> IIUC, sounds to me like reintegrating the leaked counts from the unused
> counter should be part of the workaround too, not a second series...
>

This series aims to avoid corruption of non-corrupting events.
Re-integration of the counts is not related to this. This is why
we chose to fix this other problem on a second series to keep
things clean and concepts separated.

> --
> Regards/Gruss,
> Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --

2014-06-05 16:47:48

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On Thu, Jun 5, 2014 at 6:39 PM, Maria Dimakopoulou
<[email protected]> wrote:
> On Thu, Jun 5, 2014 at 6:17 PM, Borislav Petkov <[email protected]> wrote:
>> On Thu, Jun 05, 2014 at 05:45:24PM +0300, Maria Dimakopoulou wrote:
>>> The issue is that the outcoming leaked counts are not compensated
>>> by the incoming leaked counts of the sibling thread. With the workaround,
>>> corrupting events are always scheduled with an empty sibling counter.
>>> This means that their leaked counts are lost. So it is expected to see
>>> lower counts with the workaround. Note that this is not a side-effect of
>>> the workaround; leaked counts are expected to be lost with nothing
>>> measured on the sibling counter in general.
>>>
>>> In a second series we intend to re-integrate the counts for counting mode
>>> events. The workaround makes this easier because it guarantees
>>> that the sibling counter is unused, thus its counts are purely leaked
>>> counts and they can be safely re-integrated.
>>
>> IIUC, sounds to me like reintegrating the leaked counts from the unused
>> counter should be part of the workaround too, not a second series...
>>
>
> This series aims to avoid corruption of non-corrupting events.
> Re-integration of the counts is not related to this. This is why
> we chose to fix this other problem on a second series to keep
> things clean and concepts separated.
>
I agree with Maria here. The reintegration is different problem.
The series here helps and is a required first step. Reintegration
will be added asap.

>> --
>> Regards/Gruss,
>> Boris.
>>
>> Sent from a fat crate under my desk. Formatting is fine.
>> --

2014-06-05 16:54:43

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On Thu, Jun 05, 2014 at 07:39:55PM +0300, Maria Dimakopoulou wrote:
> This series aims to avoid corruption of non-corrupting events.
> Re-integration of the counts is not related to this.

What do you mean "not related"? Then you have a partial workaround. This
whole thing is trying to fix a hw bug so of course it is related.

Once you do the "full" workaround, along with the part which
reintegrates the counts and thus completely fixes the issue (I believe),
then you don't need to disable it at all because disabling it doesn't
make any sense then. See what I'm saying?

This is the whole point Matt and I are trying to make: if the *full*
workaround doesn't have any noticeable disadvantages, then you don't
need to add a disable-mechanism due to the can of worms opening if you
do.

How you get this upstream to ease the review is a whole another story
and I applaud your desire to keep things clean.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-06-05 18:00:25

by Maria Dimakopoulou

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On Thu, Jun 5, 2014 at 7:52 PM, Borislav Petkov <[email protected]> wrote:
> On Thu, Jun 05, 2014 at 07:39:55PM +0300, Maria Dimakopoulou wrote:
>> This series aims to avoid corruption of non-corrupting events.
>> Re-integration of the counts is not related to this.
>
> What do you mean "not related"? Then you have a partial workaround. This
> whole thing is trying to fix a hw bug so of course it is related.

I said not related to the problem of corruption of non-corrupting events which
we solve here.
Re-integration of the counts of corrupting events is a different problem.
Both problems are related to HT bug.

>
> Once you do the "full" workaround, along with the part which
> reintegrates the counts and thus completely fixes the issue (I believe),
> then you don't need to disable it at all because disabling it doesn't
> make any sense then. See what I'm saying?
>
> This is the whole point Matt and I are trying to make: if the *full*
> workaround doesn't have any noticeable disadvantages, then you don't
> need to add a disable-mechanism due to the can of worms opening if you
> do.
>

As Stephane pointed out, the sysfs entry is optional and the workaround
can be disabled only as root.

It is not absolutely necessary and it's not important.
We will drop it in V2.

> How you get this upstream to ease the review is a whole another story
> and I applaud your desire to keep things clean.
>

Then, I guess we agree to keep the re-integration for another series for
the reason you just pointed.


> --
> Regards/Gruss,
> Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --

2014-06-05 21:34:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 4/9] perf/x86: add cross-HT counter exclusion infrastructure

> This hard assumes theres only ever 2 threads, which is true and I
> suppose more in arch/x86 will come apart the moment Intel makes a chip
> with more, still, do we have topology_thread_id() or so to cure this?


Xeon Phi already has 4 threads today.

-Andi

--
[email protected] -- Speaking for myself only

2014-06-05 21:38:27

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 4/9] perf/x86: add cross-HT counter exclusion infrastructure

On Thu, Jun 5, 2014 at 11:33 PM, Andi Kleen <[email protected]> wrote:
>> This hard assumes theres only ever 2 threads, which is true and I
>> suppose more in arch/x86 will come apart the moment Intel makes a chip
>> with more, still, do we have topology_thread_id() or so to cure this?
>
>
> Xeon Phi already has 4 threads today.
>
Yes, but it does not have that bug (hopefully). This code is just for
the impacted processors.

> -Andi
>
> --
> [email protected] -- Speaking for myself only

2014-06-05 23:29:51

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

> As Stephane pointed out, the sysfs entry is optional and the workaround
> can be disabled only as root.
>
> It is not absolutely necessary and it's not important.
> We will drop it in V2.

I would prefer to keep it. It's fairly complex and it's always good
to have a way to disable complex things in case something goes wrong.

-Andi

2014-06-06 08:28:05

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH 9/9] perf/x86: add syfs entry to disable HT bug workaround

On 6 June 2014 00:29, Andi Kleen <[email protected]> wrote:
>> As Stephane pointed out, the sysfs entry is optional and the workaround
>> can be disabled only as root.
>>
>> It is not absolutely necessary and it's not important.
>> We will drop it in V2.
>
> I would prefer to keep it. It's fairly complex and it's always good
> to have a way to disable complex things in case something goes wrong.

You want to be able to disable the workaround in case it doesn't work?
I'm having a hard time buying that as a valid reason for this knob.

If someone runs into issues with the workaround we want them to report
those issues so that we can tweak the code. If they've got the ability
to simply disable the code they're less likely to report the problem
and we'll all be worse for it.

Having the knob increase the number of configurations and increases
complexity, it doesn't reduce it.

2014-06-10 11:53:48

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 4/9] perf/x86: add cross-HT counter exclusion infrastructure

On Thu, Jun 5, 2014 at 10:29 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Jun 04, 2014 at 11:34:13PM +0200, Stephane Eranian wrote:
>> @@ -2020,12 +2050,29 @@ static void intel_pmu_cpu_starting(int cpu)
>>
>> if (x86_pmu.lbr_sel_map)
>> cpuc->lbr_sel = &cpuc->shared_regs->regs[EXTRA_REG_LBR];
>> +
>> + if (x86_pmu.flags & PMU_FL_EXCL_CNTRS) {
>> + for_each_cpu(i, topology_thread_cpumask(cpu)) {
>> + struct intel_excl_cntrs *c;
>> +
>> + c = per_cpu(cpu_hw_events, i).excl_cntrs;
>> + if (c && c->core_id == core_id) {
>> + cpuc->kfree_on_online[1] = cpuc->excl_cntrs;
>> + cpuc->excl_cntrs = c;
>> + cpuc->excl_thread_id = 1;
>> + break;
>> + }
>> + }
>> + cpuc->excl_cntrs->core_id = core_id;
>> + cpuc->excl_cntrs->refcnt++;
>> + }
>> }
>
> This hard assumes theres only ever 2 threads, which is true and I
> suppose more in arch/x86 will come apart the moment Intel makes a chip
> with more, still, do we have topology_thread_id() or so to cure this?

I assume your comment is relative to kfree_on_online[].
This code is specific to the HT bug, so yes, it assumes 2 threads and that
only one entry of the two excl_cntrs structs needs to be freed.
Doing otherwise, would require a list and will never be used to its full
potential.

2014-06-10 12:26:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/9] perf/x86: add cross-HT counter exclusion infrastructure

On Tue, Jun 10, 2014 at 01:53:45PM +0200, Stephane Eranian wrote:
> On Thu, Jun 5, 2014 at 10:29 AM, Peter Zijlstra <[email protected]> wrote:
> > On Wed, Jun 04, 2014 at 11:34:13PM +0200, Stephane Eranian wrote:
> >> @@ -2020,12 +2050,29 @@ static void intel_pmu_cpu_starting(int cpu)
> >>
> >> if (x86_pmu.lbr_sel_map)
> >> cpuc->lbr_sel = &cpuc->shared_regs->regs[EXTRA_REG_LBR];
> >> +
> >> + if (x86_pmu.flags & PMU_FL_EXCL_CNTRS) {
> >> + for_each_cpu(i, topology_thread_cpumask(cpu)) {
> >> + struct intel_excl_cntrs *c;
> >> +
> >> + c = per_cpu(cpu_hw_events, i).excl_cntrs;
> >> + if (c && c->core_id == core_id) {
> >> + cpuc->kfree_on_online[1] = cpuc->excl_cntrs;
> >> + cpuc->excl_cntrs = c;
> >> + cpuc->excl_thread_id = 1;
> >> + break;
> >> + }
> >> + }
> >> + cpuc->excl_cntrs->core_id = core_id;
> >> + cpuc->excl_cntrs->refcnt++;
> >> + }
> >> }
> >
> > This hard assumes theres only ever 2 threads, which is true and I
> > suppose more in arch/x86 will come apart the moment Intel makes a chip
> > with more, still, do we have topology_thread_id() or so to cure this?
>
> I assume your comment is relative to kfree_on_online[].
> This code is specific to the HT bug, so yes, it assumes 2 threads and that
> only one entry of the two excl_cntrs structs needs to be freed.
> Doing otherwise, would require a list and will never be used to its full
> potential.

That and ->excl_thread_id = 1, I was thinking that if we'd somehow have
4 threads, some of them need to have = [23] in there.



Attachments:
(No filename) (1.74 kB)
(No filename) (836.00 B)
Download all attachments