2014-07-22 08:10:11

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 0/7] perf, x86: large PEBS interrupt threshold

This patch series implements large PEBS interrupt threshold. For some
limited cases, it can significantly reduce the sample overhead. Please
read patch 6's commit message for more information.

changes since v1:
- drop patch 'perf, core: Add all PMUs to pmu_idr'
- add comments for case that multiple counters overflow simultaneously
changes since v2:
- rename perf_sched_cb_{enable,disable} to perf_sched_cb_user_{inc,dec}
- use flag to indicate auto reload mechanism
- move codes that setup PEBS sample data to separate function
- output the PEBS records in batch
- enable this for All (PEBS capable) hardware
- more description for the multiplex


2014-07-22 08:10:17

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 2/9] perf, x86: use context switch callback to flush LBR stack

Previous commit introduces context switch callback, its function
overlaps with the flush branch stack callback. So we can use the
context switch callback to flush LBR stack.

This patch adds code that uses the flush branch callback to
flush the LBR stack when task is being scheduled in. The callback
is enabled only when there are events use the LBR hardware. This
patch also removes all old flush branch stack code.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 7 ---
arch/x86/kernel/cpu/perf_event.h | 3 +-
arch/x86/kernel/cpu/perf_event_intel.c | 14 +-----
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 38 ++++++++++++--
include/linux/perf_event.h | 6 ---
kernel/events/core.c | 81 ------------------------------
6 files changed, 36 insertions(+), 113 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 7d22972..8868e9b 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1880,12 +1880,6 @@ static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
x86_pmu.sched_task(ctx, sched_in);
}

-static void x86_pmu_flush_branch_stack(void)
-{
- if (x86_pmu.flush_branch_stack)
- x86_pmu.flush_branch_stack();
-}
-
void perf_check_microcode(void)
{
if (x86_pmu.check_microcode)
@@ -1912,7 +1906,6 @@ static struct pmu pmu = {
.commit_txn = x86_pmu_commit_txn,

.event_idx = x86_pmu_event_idx,
- .flush_branch_stack = x86_pmu_flush_branch_stack,
.sched_task = x86_pmu_sched_task,
};

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index e70b352..d8165f3 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -428,7 +428,6 @@ struct x86_pmu {
void (*cpu_dead)(int cpu);

void (*check_microcode)(void);
- void (*flush_branch_stack)(void);
void (*sched_task)(struct perf_event_context *ctx,
bool sched_in);

@@ -685,6 +684,8 @@ void intel_pmu_pebs_disable_all(void);

void intel_ds_init(void);

+void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
+
void intel_pmu_lbr_reset(void);

void intel_pmu_lbr_enable(struct perf_event *event);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index adb02aa..ef926ee 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2035,18 +2035,6 @@ static void intel_pmu_cpu_dying(int cpu)
fini_debug_store_on_cpu(cpu);
}

-static void intel_pmu_flush_branch_stack(void)
-{
- /*
- * Intel LBR does not tag entries with the
- * PID of the current task, then we need to
- * flush it on ctxsw
- * For now, we simply reset it
- */
- if (x86_pmu.lbr_nr)
- intel_pmu_lbr_reset();
-}
-
PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");

PMU_FORMAT_ATTR(ldlat, "config1:0-15");
@@ -2098,7 +2086,7 @@ static __initconst const struct x86_pmu intel_pmu = {
.cpu_starting = intel_pmu_cpu_starting,
.cpu_dying = intel_pmu_cpu_dying,
.guest_get_msrs = intel_guest_get_msrs,
- .flush_branch_stack = intel_pmu_flush_branch_stack,
+ .sched_task = intel_pmu_lbr_sched_task,
};

static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 9dd2459..a30bfab 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -181,13 +181,36 @@ void intel_pmu_lbr_reset(void)
intel_pmu_lbr_reset_64();
}

-void intel_pmu_lbr_enable(struct perf_event *event)
+void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);

if (!x86_pmu.lbr_nr)
return;
+ /*
+ * When sampling the branck stack in system-wide, it may be
+ * necessary to flush the stack on context switch. This happens
+ * when the branch stack does not tag its entries with the pid
+ * of the current task. Otherwise it becomes impossible to
+ * associate a branch entry with a task. This ambiguity is more
+ * likely to appear when the branch stack supports priv level
+ * filtering and the user sets it to monitor only at the user
+ * level (which could be a useful measurement in system-wide
+ * mode). In that case, the risk is high of having a branch
+ * stack with branch from multiple tasks.
+ */
+ if (sched_in) {
+ intel_pmu_lbr_reset();
+ cpuc->lbr_context = ctx;
+ }
+}
+
+void intel_pmu_lbr_enable(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);

+ if (!x86_pmu.lbr_nr)
+ return;
/*
* Reset the LBR stack if we changed task context to
* avoid data leaks.
@@ -199,6 +222,8 @@ void intel_pmu_lbr_enable(struct perf_event *event)
cpuc->br_sel = event->hw.branch_reg.reg;

cpuc->lbr_users++;
+ if (cpuc->lbr_users == 1)
+ perf_sched_cb_user_inc(event->ctx->pmu);
}

void intel_pmu_lbr_disable(struct perf_event *event)
@@ -211,10 +236,13 @@ void intel_pmu_lbr_disable(struct perf_event *event)
cpuc->lbr_users--;
WARN_ON_ONCE(cpuc->lbr_users < 0);

- if (cpuc->enabled && !cpuc->lbr_users) {
- __intel_pmu_lbr_disable();
- /* avoid stale pointer */
- cpuc->lbr_context = NULL;
+ if (!cpuc->lbr_users) {
+ perf_sched_cb_user_dec(event->ctx->pmu);
+ if (cpuc->enabled) {
+ __intel_pmu_lbr_disable();
+ /* avoid stale pointer */
+ cpuc->lbr_context = NULL;
+ }
}
}

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fe92e6b..82f3dd5 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -259,11 +259,6 @@ struct pmu {
int (*event_idx) (struct perf_event *event); /*optional */

/*
- * flush branch stack on context-switches (needed in cpu-wide mode)
- */
- void (*flush_branch_stack) (void);
-
- /*
* context-switches callback for CPU PMU. Other PMUs shouldn't set
* this callback
*/
@@ -512,7 +507,6 @@ struct perf_event_context {
u64 generation;
int pin_count;
int nr_cgroups; /* cgroup evts */
- int nr_branch_stack; /* branch_stack evt */
struct rcu_head rcu_head;
};

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7431bec..65cac84 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -143,7 +143,6 @@ enum event_type_t {
*/
struct static_key_deferred perf_sched_events __read_mostly;
static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
-static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
static DEFINE_PER_CPU(int, perf_sched_cb_users);

static atomic_t nr_mmap_events __read_mostly;
@@ -1137,9 +1136,6 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
if (is_cgroup_event(event))
ctx->nr_cgroups++;

- if (has_branch_stack(event))
- ctx->nr_branch_stack++;
-
list_add_rcu(&event->event_entry, &ctx->event_list);
if (!ctx->nr_events)
perf_pmu_rotate_start(ctx->pmu);
@@ -1302,9 +1298,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
cpuctx->cgrp = NULL;
}

- if (has_branch_stack(event))
- ctx->nr_branch_stack--;
-
ctx->nr_events--;
if (event->attr.inherit_stat)
ctx->nr_stat--;
@@ -2602,64 +2595,6 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
}

/*
- * When sampling the branck stack in system-wide, it may be necessary
- * to flush the stack on context switch. This happens when the branch
- * stack does not tag its entries with the pid of the current task.
- * Otherwise it becomes impossible to associate a branch entry with a
- * task. This ambiguity is more likely to appear when the branch stack
- * supports priv level filtering and the user sets it to monitor only
- * at the user level (which could be a useful measurement in system-wide
- * mode). In that case, the risk is high of having a branch stack with
- * branch from multiple tasks. Flushing may mean dropping the existing
- * entries or stashing them somewhere in the PMU specific code layer.
- *
- * This function provides the context switch callback to the lower code
- * layer. It is invoked ONLY when there is at least one system-wide context
- * with at least one active event using taken branch sampling.
- */
-static void perf_branch_stack_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
- struct perf_cpu_context *cpuctx;
- struct pmu *pmu;
- unsigned long flags;
-
- /* no need to flush branch stack if not changing task */
- if (prev == task)
- return;
-
- local_irq_save(flags);
-
- rcu_read_lock();
-
- list_for_each_entry_rcu(pmu, &pmus, entry) {
- cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
-
- /*
- * check if the context has at least one
- * event using PERF_SAMPLE_BRANCH_STACK
- */
- if (cpuctx->ctx.nr_branch_stack > 0
- && pmu->flush_branch_stack) {
-
- perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-
- perf_pmu_disable(pmu);
-
- pmu->flush_branch_stack();
-
- perf_pmu_enable(pmu);
-
- perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
- }
- }
-
- rcu_read_unlock();
-
- local_irq_restore(flags);
-}
-
-/*
* Called from scheduler to add the events of the current task
* with interrupts disabled.
*
@@ -2691,14 +2626,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
if (atomic_read(&__get_cpu_var(perf_cgroup_events)))
perf_cgroup_sched_in(prev, task);

- /* check for system-wide branch_stack events */
- if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
- perf_branch_stack_sched_in(prev, task);
-
- /* check for system-wide branch_stack events */
- if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
- perf_branch_stack_sched_in(prev, task);
-
if (__get_cpu_var(perf_sched_cb_users))
perf_pmu_sched_task(prev, task, true);
}
@@ -3284,10 +3211,6 @@ static void unaccount_event_cpu(struct perf_event *event, int cpu)
if (event->parent)
return;

- if (has_branch_stack(event)) {
- if (!(event->attach_state & PERF_ATTACH_TASK))
- atomic_dec(&per_cpu(perf_branch_stack_events, cpu));
- }
if (is_cgroup_event(event))
atomic_dec(&per_cpu(perf_cgroup_events, cpu));
}
@@ -6773,10 +6696,6 @@ static void account_event_cpu(struct perf_event *event, int cpu)
if (event->parent)
return;

- if (has_branch_stack(event)) {
- if (!(event->attach_state & PERF_ATTACH_TASK))
- atomic_inc(&per_cpu(perf_branch_stack_events, cpu));
- }
if (is_cgroup_event(event))
atomic_inc(&per_cpu(perf_cgroup_events, cpu));
}
--
1.9.3

2014-07-22 08:10:25

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 6/9] perf, x86: handle multiple records in PEBS buffer

When PEBS interrupt threshold is larger than one, the PEBS buffer
may include mutiple records for each PEBS event. This patch makes
the code first count how many records each PEBS event has, then
output the samples in batch.

One corner case needs to mention is that the PEBS hardware doesn't
deal well with collisions, when PEBS events happen near to each
other. The records for the events can be collapsed into a single
one. However in practice collisions are extremely rare, as long as
different events are used. The periods are typically very large,
so any collision is unlikely. When collision happens, we can either
drop the PEBS record or use the record to serve multiple events.
This patch chooses the later approach.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 88 +++++++++++++++++++++----------
1 file changed, 59 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 86ef5b0..1e3b8cf 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -989,18 +989,51 @@ static void setup_pebs_sample_data(struct perf_event *event,
}

static void __intel_pmu_pebs_event(struct perf_event *event,
- struct pt_regs *iregs, void *__pebs)
+ struct pt_regs *iregs,
+ void *at, void *top, int count)
{
+ struct perf_output_handle handle;
+ struct perf_event_header header;
struct perf_sample_data data;
struct pt_regs regs;

- if (!intel_pmu_save_and_restart(event))
+ if (!intel_pmu_save_and_restart(event) &&
+ !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
return;

- setup_pebs_sample_data(event, iregs, __pebs, &data, &regs);
+ setup_pebs_sample_data(event, iregs, at, &data, &regs);

- if (perf_event_overflow(event, &data, &regs))
+ if (perf_event_overflow(event, &data, &regs)) {
x86_pmu_stop(event, 0);
+ return;
+ }
+
+ if (count <= 1)
+ return;
+
+ at += x86_pmu.pebs_record_size;
+ count--;
+
+ perf_sample_data_init(&data, 0, event->hw.last_period);
+ perf_prepare_sample(&header, &data, event, &regs);
+
+ if (perf_output_begin(&handle, event, header.size * count))
+ return;
+
+ for (; at < top; at += x86_pmu.pebs_record_size) {
+ struct pebs_record_nhm *p = at;
+ if (!(p->status & (1 << event->hw.idx)))
+ continue;
+
+ setup_pebs_sample_data(event, iregs, at, &data, &regs);
+ perf_output_sample(&handle, &header, &data, event);
+
+ count--;
+ if (count == 0)
+ break;
+ }
+
+ perf_output_end(&handle);
}

static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
@@ -1041,61 +1074,58 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
WARN_ONCE(n > 1, "bad leftover pebs %d\n", n);
at += n - 1;

- __intel_pmu_pebs_event(event, iregs, at);
+ __intel_pmu_pebs_event(event, iregs, at, top, 1);
}

static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
struct debug_store *ds = cpuc->ds;
- struct perf_event *event = NULL;
- void *at, *top;
- u64 status = 0;
+ struct perf_event *event;
+ void *base, *at, *top;
int bit;
+ int counts[MAX_PEBS_EVENTS] = {};

if (!x86_pmu.pebs_active)
return;

- at = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
+ base = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;

ds->pebs_index = ds->pebs_buffer_base;

- if (unlikely(at > top))
+ if (unlikely(base >= top))
return;

- /*
- * Should not happen, we program the threshold at 1 and do not
- * set a reset value.
- */
- WARN_ONCE(top - at > x86_pmu.max_pebs_events * x86_pmu.pebs_record_size,
- "Unexpected number of pebs records %ld\n",
- (long)(top - at) / x86_pmu.pebs_record_size);
-
- for (; at < top; at += x86_pmu.pebs_record_size) {
+ for (at = base; at < top; at += x86_pmu.pebs_record_size) {
struct pebs_record_nhm *p = at;
-
+ /*
+ * PEBS creates only one entry if multiple counters
+ * overflow simultaneously.
+ */
for_each_set_bit(bit, (unsigned long *)&p->status,
x86_pmu.max_pebs_events) {
event = cpuc->events[bit];
if (!test_bit(bit, cpuc->active_mask))
continue;
-
WARN_ON_ONCE(!event);
-
if (!event->attr.precise_ip)
continue;
-
- if (__test_and_set_bit(bit, (unsigned long *)&status))
- continue;
-
- break;
+ counts[bit]++;
}
+ }

- if (!event || bit >= x86_pmu.max_pebs_events)
+ for (bit = 0; bit < MAX_PEBS_EVENTS; bit++) {
+ if (counts[bit] == 0)
continue;
+ event = cpuc->events[bit];
+ for (at = base; at < top; at += x86_pmu.pebs_record_size) {
+ struct pebs_record_nhm *p = at;
+ if (p->status & (1 << bit))
+ break;
+ }

- __intel_pmu_pebs_event(event, iregs, at);
+ __intel_pmu_pebs_event(event, iregs, at, top, counts[bit]);
}
}

--
1.9.3

2014-07-22 08:11:04

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 9/9] tools, perf: Allow the user to disable time stamps

From: Andi Kleen <[email protected]>

Time stamps are always implicitely enabled for record currently.
The old --time/-T option is a nop.

Allow the user to disable timestamps by using --no-time

This can cause some minor misaccounting (by missing mmaps), but significantly
lowers the size of perf.data

The defaults are unchanged.

Signed-off-by: Andi Kleen <[email protected]>
---
tools/perf/builtin-record.c | 1 +
tools/perf/util/evsel.c | 9 ++++++---
2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 378b85b..8728c7c 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -776,6 +776,7 @@ static const char * const record_usage[] = {
*/
static struct record record = {
.opts = {
+ .sample_time = true,
.mmap_pages = UINT_MAX,
.user_freq = UINT_MAX,
.user_interval = ULLONG_MAX,
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 8606175..1bc4093 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -632,9 +632,12 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts)
if (opts->period)
perf_evsel__set_sample_bit(evsel, PERIOD);

- if (!perf_missing_features.sample_id_all &&
- (opts->sample_time || !opts->no_inherit ||
- target__has_cpu(&opts->target) || per_cpu))
+ /*
+ * When the user explicitely disabled time don't force it here.
+ */
+ if (opts->sample_time &&
+ (!perf_missing_features.sample_id_all &&
+ (!opts->no_inherit || target__has_cpu(&opts->target) || per_cpu)))
perf_evsel__set_sample_bit(evsel, TIME);

if (opts->raw_samples) {
--
1.9.3

2014-07-22 08:11:50

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 8/9] perf, x86: enlarge PEBS buffer

Currently the PEBS buffer size is 4k, it only can hold about 21
PEBS records. This patch enlarges the PEBS buffer size to 64k
(the same as BTS buffer), 64k memory can hold about 330 PEBS
records. This will significantly the reduce number of PMI when
large PEBS interrupt threshold is used.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 99b07de0..33b4c0e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -11,7 +11,7 @@
#define BTS_RECORD_SIZE 24

#define BTS_BUFFER_SIZE (PAGE_SIZE << 4)
-#define PEBS_BUFFER_SIZE PAGE_SIZE
+#define PEBS_BUFFER_SIZE (PAGE_SIZE << 4)
#define PEBS_FIXUP_SIZE PAGE_SIZE

/*
--
1.9.3

2014-07-22 08:12:40

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 7/9] perf, x86: drain PEBS buffer during context switch

Flush the PEBS buffer during context switch if PEBS interrupt threshold
is larger than one. This allows perf to supply TID for sample outputs.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 3 +++
arch/x86/kernel/cpu/perf_event_intel.c | 11 +++++++++-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 32 ++++++++++++++++++++++++++++--
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 2 --
4 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index fa8dfd4..c4a746c 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -148,6 +148,7 @@ struct cpu_hw_events {
*/
struct debug_store *ds;
u64 pebs_enabled;
+ bool pebs_sched_cb_enabled;

/*
* Intel LBR bits
@@ -683,6 +684,8 @@ void intel_pmu_pebs_enable_all(void);

void intel_pmu_pebs_disable_all(void);

+void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in);
+
void intel_ds_init(void);

void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index ef926ee..cb5a838 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2035,6 +2035,15 @@ static void intel_pmu_cpu_dying(int cpu)
fini_debug_store_on_cpu(cpu);
}

+static void intel_pmu_sched_task(struct perf_event_context *ctx,
+ bool sched_in)
+{
+ if (x86_pmu.pebs_active)
+ intel_pmu_pebs_sched_task(ctx, sched_in);
+ if (x86_pmu.lbr_nr)
+ intel_pmu_lbr_sched_task(ctx, sched_in);
+}
+
PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");

PMU_FORMAT_ATTR(ldlat, "config1:0-15");
@@ -2086,7 +2095,7 @@ static __initconst const struct x86_pmu intel_pmu = {
.cpu_starting = intel_pmu_cpu_starting,
.cpu_dying = intel_pmu_cpu_dying,
.guest_get_msrs = intel_guest_get_msrs,
- .sched_task = intel_pmu_lbr_sched_task,
+ .sched_task = intel_pmu_sched_task,
};

static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 1e3b8cf..99b07de0 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -705,6 +705,18 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
return &emptyconstraint;
}

+static inline void intel_pmu_drain_pebs_buffer(void)
+{
+ struct pt_regs regs;
+ x86_pmu.drain_pebs(&regs);
+}
+
+void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+ if (!sched_in)
+ intel_pmu_drain_pebs_buffer();
+}
+
/*
* Flags PEBS can handle without an PMI.
*
@@ -745,13 +757,20 @@ void intel_pmu_pebs_enable(struct perf_event *event)
* When the event is constrained enough we can use a larger
* threshold and run the event with less frequent PMI.
*/
- if (0 && /* disable this temporarily */
- (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) &&
+ if ((hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) &&
!(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS)) {
threshold = ds->pebs_absolute_maximum -
x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
+ if (first_pebs) {
+ perf_sched_cb_user_inc(event->ctx->pmu);
+ cpuc->pebs_sched_cb_enabled = true;
+ }
} else {
threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+ if (cpuc->pebs_sched_cb_enabled) {
+ perf_sched_cb_user_dec(event->ctx->pmu);
+ cpuc->pebs_sched_cb_enabled = false;
+ }
}
if (first_pebs || ds->pebs_interrupt_threshold > threshold)
ds->pebs_interrupt_threshold = threshold;
@@ -767,8 +786,17 @@ void intel_pmu_pebs_disable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
+ struct debug_store *ds = cpuc->ds;
+
+ if (ds->pebs_interrupt_threshold >
+ ds->pebs_buffer_base + x86_pmu.pebs_record_size)
+ intel_pmu_drain_pebs_buffer();

cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
+ if (cpuc->pebs_sched_cb_enabled && !pebs_is_enabled(cpuc)) {
+ perf_sched_cb_user_dec(event->ctx->pmu);
+ cpuc->pebs_sched_cb_enabled = false;
+ }

if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_LDLAT)
cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index a30bfab..e4f3a09 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -185,8 +185,6 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);

- if (!x86_pmu.lbr_nr)
- return;
/*
* When sampling the branck stack in system-wide, it may be
* necessary to flush the stack on context switch. This happens
--
1.9.3

2014-07-22 08:10:21

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 5/9] perf, x86: large PEBS interrupt threshold

PEBS always had the capability to log samples to its buffers without
an interrupt. Traditionally perf has not used this but always set the
PEBS threshold to one.

For frequently occuring events (like cycles or branches or load/stores)
this in term requires using a relatively high sampling period to avoid
overloading the system, by only processing PMIs. This in term increases
sampling error.

For the common cases we still need to use the PMI because the PEBS
hardware has various limitations. The biggest one is that it can not
supply a callgraph. It also requires setting a fixed period, as the
hardware does not support adaptive period. Another issue is that it
cannot supply a time stamp and some other options. To supply a TID it
requires flushing on context switch. It can however supply the IP, the
load/store address, TSX information, registers, and some other things.

So we can make PEBS work for some specific cases, basically as long as
you can do without a callgraph and can set the period you can use this
new PEBS mode.

The main benefit is the ability to support much lower sampling period
(down to -c 1000) without extensive overhead.

One use cases is for example to increase the resolution of the c2c tool.
Another is double checking when you suspect the standard sampling has
too much sampling error.

Some numbers on the overhead, using cycle soak, comparing
"perf record --no-time -e cycles:p -c" to "perf record -e cycles:p -c"

period plain multi delta
10003 15 5 10
20003 15.7 4 11.7
40003 8.7 2.5 6.2
80003 4.1 1.4 2.7
100003 3.6 1.2 2.4
800003 4.4 1.4 3
1000003 0.6 0.4 0.2
2000003 0.4 0.3 0.1
4000003 0.3 0.2 0.1
10000003 0.3 0.2 0.1

The interesting part is the delta between multi-pebs and normal pebs. Above
-c 1000003 it does not really matter because the basic overhead is so low.
With periods below 80003 it becomes interesting.

Note in some other workloads (e.g. kernbench) the smaller sampling periods
cause much more overhead without multi-pebs, upto 80% (and throttling) have
been observed with -c 10003. multi pebs generally does not throttle.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 40 +++++++++++++++++++++++++++----
1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 858c4ee..86ef5b0 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -251,7 +251,7 @@ static int alloc_pebs_buffer(int cpu)
{
struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
int node = cpu_to_node(cpu);
- int max, thresh = 1; /* always use a single PEBS record */
+ int max;
void *buffer, *ibuffer;

if (!x86_pmu.pebs)
@@ -281,9 +281,6 @@ static int alloc_pebs_buffer(int cpu)
ds->pebs_absolute_maximum = ds->pebs_buffer_base +
max * x86_pmu.pebs_record_size;

- ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
- thresh * x86_pmu.pebs_record_size;
-
return 0;
}

@@ -708,15 +705,35 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
return &emptyconstraint;
}

+/*
+ * Flags PEBS can handle without an PMI.
+ *
+ * TID can only be handled by flushing at context switch.
+ */
+#define PEBS_FREERUNNING_FLAGS \
+ (PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR | \
+ PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
+ PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
+ PERF_SAMPLE_TRANSACTION)
+
+static inline bool pebs_is_enabled(struct cpu_hw_events *cpuc)
+{
+ return (cpuc->pebs_enabled & ((1ULL << MAX_PEBS_EVENTS) - 1));
+}
+
void intel_pmu_pebs_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
+ struct debug_store *ds = cpuc->ds;
+ bool first_pebs;
+ u64 threshold;

hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
if (!event->attr.freq)
hwc->flags |= PERF_X86_EVENT_AUTO_RELOAD;

+ first_pebs = !pebs_is_enabled(cpuc);
cpuc->pebs_enabled |= 1ULL << hwc->idx;

if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
@@ -724,6 +741,21 @@ void intel_pmu_pebs_enable(struct perf_event *event)
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled |= 1ULL << 63;

+ /*
+ * When the event is constrained enough we can use a larger
+ * threshold and run the event with less frequent PMI.
+ */
+ if (0 && /* disable this temporarily */
+ (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) &&
+ !(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS)) {
+ threshold = ds->pebs_absolute_maximum -
+ x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
+ } else {
+ threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+ }
+ if (first_pebs || ds->pebs_interrupt_threshold > threshold)
+ ds->pebs_interrupt_threshold = threshold;
+
/* Use auto-reload if possible to save a MSR write in the PMI */
if (hwc->flags &PERF_X86_EVENT_AUTO_RELOAD) {
ds->pebs_event_reset[hwc->idx] =
--
1.9.3

2014-07-22 08:13:33

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 4/9] perf, x86: introduce setup_pebs_sample_data()

move codes that setup PEBS sample data to separate function

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 63 ++++++++++++++++++-------------
1 file changed, 36 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index ab91b11..858c4ee 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -868,8 +868,10 @@ static inline u64 intel_hsw_transaction(struct pebs_record_hsw *pebs)
return txn;
}

-static void __intel_pmu_pebs_event(struct perf_event *event,
- struct pt_regs *iregs, void *__pebs)
+static void setup_pebs_sample_data(struct perf_event *event,
+ struct pt_regs *iregs, void *__pebs,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
{
/*
* We cast to the biggest pebs_record but are careful not to
@@ -877,21 +879,16 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
*/
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
struct pebs_record_hsw *pebs = __pebs;
- struct perf_sample_data data;
- struct pt_regs regs;
u64 sample_type;
int fll, fst;

- if (!intel_pmu_save_and_restart(event))
- return;
-
fll = event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT;
fst = event->hw.flags & (PERF_X86_EVENT_PEBS_ST |
PERF_X86_EVENT_PEBS_ST_HSW);

- perf_sample_data_init(&data, 0, event->hw.last_period);
+ perf_sample_data_init(data, 0, event->hw.last_period);

- data.period = event->hw.last_period;
+ data->period = event->hw.last_period;
sample_type = event->attr.sample_type;

/*
@@ -902,19 +899,19 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
* Use latency for weight (only avail with PEBS-LL)
*/
if (fll && (sample_type & PERF_SAMPLE_WEIGHT))
- data.weight = pebs->lat;
+ data->weight = pebs->lat;

/*
* data.data_src encodes the data source
*/
if (sample_type & PERF_SAMPLE_DATA_SRC) {
if (fll)
- data.data_src.val = load_latency_data(pebs->dse);
+ data->data_src.val = load_latency_data(pebs->dse);
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST_HSW)
- data.data_src.val =
+ data->data_src.val =
precise_store_data_hsw(event, pebs->dse);
else
- data.data_src.val = precise_store_data(pebs->dse);
+ data->data_src.val = precise_store_data(pebs->dse);
}
}

@@ -928,35 +925,47 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
* PERF_SAMPLE_IP and PERF_SAMPLE_CALLCHAIN to function properly.
* A possible PERF_SAMPLE_REGS will have to transfer all regs.
*/
- regs = *iregs;
- regs.flags = pebs->flags;
- set_linear_ip(&regs, pebs->ip);
- regs.bp = pebs->bp;
- regs.sp = pebs->sp;
+ *regs = *iregs;
+ regs->flags = pebs->flags;
+ set_linear_ip(regs, pebs->ip);
+ regs->bp = pebs->bp;
+ regs->sp = pebs->sp;

if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format >= 2) {
- regs.ip = pebs->real_ip;
- regs.flags |= PERF_EFLAGS_EXACT;
- } else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(&regs))
- regs.flags |= PERF_EFLAGS_EXACT;
+ regs->ip = pebs->real_ip;
+ regs->flags |= PERF_EFLAGS_EXACT;
+ } else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(regs))
+ regs->flags |= PERF_EFLAGS_EXACT;
else
- regs.flags &= ~PERF_EFLAGS_EXACT;
+ regs->flags &= ~PERF_EFLAGS_EXACT;

if ((event->attr.sample_type & PERF_SAMPLE_ADDR) &&
x86_pmu.intel_cap.pebs_format >= 1)
- data.addr = pebs->dla;
+ data->addr = pebs->dla;

if (x86_pmu.intel_cap.pebs_format >= 2) {
/* Only set the TSX weight when no memory weight. */
if ((event->attr.sample_type & PERF_SAMPLE_WEIGHT) && !fll)
- data.weight = intel_hsw_weight(pebs);
+ data->weight = intel_hsw_weight(pebs);

if (event->attr.sample_type & PERF_SAMPLE_TRANSACTION)
- data.txn = intel_hsw_transaction(pebs);
+ data->txn = intel_hsw_transaction(pebs);
}

if (has_branch_stack(event))
- data.br_stack = &cpuc->lbr_stack;
+ data->br_stack = &cpuc->lbr_stack;
+}
+
+static void __intel_pmu_pebs_event(struct perf_event *event,
+ struct pt_regs *iregs, void *__pebs)
+{
+ struct perf_sample_data data;
+ struct pt_regs regs;
+
+ if (!intel_pmu_save_and_restart(event))
+ return;
+
+ setup_pebs_sample_data(event, iregs, __pebs, &data, &regs);

if (perf_event_overflow(event, &data, &regs))
x86_pmu_stop(event, 0);
--
1.9.3

2014-07-22 08:10:15

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 3/9] perf, x86: use the PEBS auto reload mechanism when possible

When a fixed period is specified, this patch make perf use the PEBS
auto reload mechanism. This makes normal profiling faster, because
it avoids one costly MSR write in the PMI handler.

Signef-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 15 +++++++++------
arch/x86/kernel/cpu/perf_event.h | 1 +
arch/x86/kernel/cpu/perf_event_intel_ds.c | 9 +++++++++
3 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 8868e9b..60593bc 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -979,13 +979,16 @@ int x86_perf_event_set_period(struct perf_event *event)

per_cpu(pmc_prev_left[idx], smp_processor_id()) = left;

- /*
- * The hw event starts counting from this event offset,
- * mark it to be able to extra future deltas:
- */
- local64_set(&hwc->prev_count, (u64)-left);
+ if (!(hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) ||
+ local64_read(&hwc->prev_count) != (u64)-left) {
+ /*
+ * The hw event starts counting from this event offset,
+ * mark it to be able to extra future deltas:
+ */
+ local64_set(&hwc->prev_count, (u64)-left);

- wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+ wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+ }

/*
* Due to erratum on certan cpu we need
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index d8165f3..fa8dfd4 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -69,6 +69,7 @@ struct event_constraint {
#define PERF_X86_EVENT_PEBS_ST 0x2 /* st data address sampling */
#define PERF_X86_EVENT_PEBS_ST_HSW 0x4 /* haswell style st data sampling */
#define PERF_X86_EVENT_COMMITTED 0x8 /* event passed commit_txn */
+#define PERF_X86_EVENT_AUTO_RELOAD 0x10 /* use PEBS auto-reload */

struct amd_nb {
int nb_id; /* NorthBridge id */
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 980970c..ab91b11 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -714,6 +714,8 @@ void intel_pmu_pebs_enable(struct perf_event *event)
struct hw_perf_event *hwc = &event->hw;

hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
+ if (!event->attr.freq)
+ hwc->flags |= PERF_X86_EVENT_AUTO_RELOAD;

cpuc->pebs_enabled |= 1ULL << hwc->idx;

@@ -721,6 +723,12 @@ void intel_pmu_pebs_enable(struct perf_event *event)
cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled |= 1ULL << 63;
+
+ /* Use auto-reload if possible to save a MSR write in the PMI */
+ if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
+ ds->pebs_event_reset[hwc->idx] =
+ (u64)-hwc->sample_period & x86_pmu.cntval_mask;
+ }
}

void intel_pmu_pebs_disable(struct perf_event *event)
@@ -739,6 +747,7 @@ void intel_pmu_pebs_disable(struct perf_event *event)
wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);

hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
+ hwc->flags &= ~PERF_X86_EVENT_AUTO_RELOAD;
}

void intel_pmu_pebs_enable_all(void)
--
1.9.3

2014-07-22 08:14:29

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v3 1/9] perf, core: introduce pmu context switch callback

The callback is invoked when process is scheduled in or out.
It provides mechanism for later patches to save/store the LBR
stack. For the schedule in case, the callback is invoked at
the same place that flush branch stack callback is invoked.
So it also can replace the flush branch stack callback. To
avoid unnecessary overhead, the callback is enabled only when
there are events use the LBR stack.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 7 +++++
arch/x86/kernel/cpu/perf_event.h | 2 ++
include/linux/perf_event.h | 9 ++++++
kernel/events/core.c | 63 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 81 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 2bdfbff..7d22972 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1874,6 +1874,12 @@ static const struct attribute_group *x86_pmu_attr_groups[] = {
NULL,
};

+static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+ if (x86_pmu.sched_task)
+ x86_pmu.sched_task(ctx, sched_in);
+}
+
static void x86_pmu_flush_branch_stack(void)
{
if (x86_pmu.flush_branch_stack)
@@ -1907,6 +1913,7 @@ static struct pmu pmu = {

.event_idx = x86_pmu_event_idx,
.flush_branch_stack = x86_pmu_flush_branch_stack,
+ .sched_task = x86_pmu_sched_task,
};

void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3b2f9bd..e70b352 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -429,6 +429,8 @@ struct x86_pmu {

void (*check_microcode)(void);
void (*flush_branch_stack)(void);
+ void (*sched_task)(struct perf_event_context *ctx,
+ bool sched_in);

/*
* Intel Arch Perfmon v2+
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 707617a..fe92e6b 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -262,6 +262,13 @@ struct pmu {
* flush branch stack on context-switches (needed in cpu-wide mode)
*/
void (*flush_branch_stack) (void);
+
+ /*
+ * context-switches callback for CPU PMU. Other PMUs shouldn't set
+ * this callback
+ */
+ void (*sched_task) (struct perf_event_context *ctx,
+ bool sched_in);
};

/**
@@ -557,6 +564,8 @@ extern void perf_event_delayed_put(struct task_struct *task);
extern void perf_event_print_debug(void);
extern void perf_pmu_disable(struct pmu *pmu);
extern void perf_pmu_enable(struct pmu *pmu);
+extern void perf_sched_cb_user_inc(struct pmu *pmu);
+extern void perf_sched_cb_user_dec(struct pmu *pmu);
extern int perf_event_task_disable(void);
extern int perf_event_task_enable(void);
extern int perf_event_refresh(struct perf_event *event, int refresh);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 67e3b9c..7431bec 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -144,6 +144,7 @@ enum event_type_t {
struct static_key_deferred perf_sched_events __read_mostly;
static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
+static DEFINE_PER_CPU(int, perf_sched_cb_users);

static atomic_t nr_mmap_events __read_mostly;
static atomic_t nr_comm_events __read_mostly;
@@ -2362,6 +2363,58 @@ unlock:
}
}

+void perf_sched_cb_user_inc(struct pmu *pmu)
+{
+ this_cpu_inc(perf_sched_cb_users);
+}
+
+void perf_sched_cb_user_dec(struct pmu *pmu)
+{
+ this_cpu_dec(perf_sched_cb_users);
+}
+
+/*
+ * This function provides the context switch callback to the lower code
+ * layer. It is invoked ONLY when the context switch callback is enabled.
+ */
+static void perf_pmu_sched_task(struct task_struct *prev,
+ struct task_struct *next,
+ bool sched_in)
+{
+ struct perf_cpu_context *cpuctx;
+ struct pmu *pmu;
+ unsigned long flags;
+
+ if (prev == next)
+ return;
+
+ local_irq_save(flags);
+
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(pmu, &pmus, entry) {
+ if (pmu->sched_task) {
+ cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+
+ perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+ perf_pmu_disable(pmu);
+
+ pmu->sched_task(cpuctx->task_ctx, sched_in);
+
+ perf_pmu_enable(pmu);
+
+ perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+ /* only CPU PMU has context switch callback */
+ break;
+ }
+ }
+
+ rcu_read_unlock();
+
+ local_irq_restore(flags);
+}
+
#define for_each_task_context_nr(ctxn) \
for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)

@@ -2381,6 +2434,9 @@ void __perf_event_task_sched_out(struct task_struct *task,
{
int ctxn;

+ if (__get_cpu_var(perf_sched_cb_users))
+ perf_pmu_sched_task(task, next, false);
+
for_each_task_context_nr(ctxn)
perf_event_context_sched_out(task, ctxn, next);

@@ -2638,6 +2694,13 @@ void __perf_event_task_sched_in(struct task_struct *prev,
/* check for system-wide branch_stack events */
if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
perf_branch_stack_sched_in(prev, task);
+
+ /* check for system-wide branch_stack events */
+ if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
+ perf_branch_stack_sched_in(prev, task);
+
+ if (__get_cpu_var(perf_sched_cb_users))
+ perf_pmu_sched_task(prev, task, true);
}

static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count)
--
1.9.3

2014-07-22 16:16:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v3 5/9] perf, x86: large PEBS interrupt threshold

> + /*
> + * When the event is constrained enough we can use a larger
> + * threshold and run the event with less frequent PMI.
> + */
> + if (0 && /* disable this temporarily */

Where in the patchkit does it get reenabled?

-Andi

> + (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) &&
> + !(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS)) {
> + threshold = ds->pebs_absolute_maximum -
> + x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
> + } else {
> + threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
> + }
> + if (first_pebs || ds->pebs_interrupt_threshold > threshold)
> + ds->pebs_interrupt_threshold = threshold;

2014-07-23 00:59:09

by Yan, Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 5/9] perf, x86: large PEBS interrupt threshold

On 07/23/2014 12:16 AM, Andi Kleen wrote:
>> + /*
>> + * When the event is constrained enough we can use a larger
>> + * threshold and run the event with less frequent PMI.
>> + */
>> + if (0 && /* disable this temporarily */
>
> Where in the patchkit does it get reenabled?

enabled by patch 7
>
> -Andi
>
>> + (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) &&
>> + !(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS)) {
>> + threshold = ds->pebs_absolute_maximum -
>> + x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
>> + } else {
>> + threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
>> + }
>> + if (first_pebs || ds->pebs_interrupt_threshold > threshold)
>> + ds->pebs_interrupt_threshold = threshold;

2014-07-25 08:10:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 6/9] perf, x86: handle multiple records in PEBS buffer

On Tue, Jul 22, 2014 at 04:09:59PM +0800, Yan, Zheng wrote:
> One corner case needs to mention is that the PEBS hardware doesn't
> deal well with collisions, when PEBS events happen near to each
> other. The records for the events can be collapsed into a single
> one. However in practice collisions are extremely rare, as long as
> different events are used. The periods are typically very large,
> so any collision is unlikely. When collision happens, we can either
> drop the PEBS record or use the record to serve multiple events.
> This patch chooses the later approach.

You can't.. the events might have different security context.

Remember, the overflow bit is set from the overflow until the PEBS
event is generated, this is quite a long time. So if another PEBS event
gets generated while the other is still pending it will have both bits
set. Even though the second bit is for another (unrelated) counter.

The unrelated counter might not have privilege to observe the data of
the generated event.

I think you can unwind and fully correct this trainwreck. But simply
delivering an even with multiple bits set to all relevant events is
wrong and might leak sensitive information.

2014-07-25 08:34:50

by Yan, Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 6/9] perf, x86: handle multiple records in PEBS buffer

On 07/25/2014 04:10 PM, Peter Zijlstra wrote:
> On Tue, Jul 22, 2014 at 04:09:59PM +0800, Yan, Zheng wrote:
>> One corner case needs to mention is that the PEBS hardware doesn't
>> deal well with collisions, when PEBS events happen near to each
>> other. The records for the events can be collapsed into a single
>> one. However in practice collisions are extremely rare, as long as
>> different events are used. The periods are typically very large,
>> so any collision is unlikely. When collision happens, we can either
>> drop the PEBS record or use the record to serve multiple events.
>> This patch chooses the later approach.
>
> You can't.. the events might have different security context.
>
> Remember, the overflow bit is set from the overflow until the PEBS
> event is generated, this is quite a long time. So if another PEBS event
> gets generated while the other is still pending it will have both bits
> set. Even though the second bit is for another (unrelated) counter.
>
> The unrelated counter might not have privilege to observe the data of
> the generated event.
>
> I think you can unwind and fully correct this trainwreck.

could you give more information how to do this.

Regards
Yan, Zheng

> But simply
> delivering an even with multiple bits set to all relevant events is
> wrong and might leak sensitive information.
>

2014-07-25 14:07:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 6/9] perf, x86: handle multiple records in PEBS buffer

On Fri, Jul 25, 2014 at 04:34:44PM +0800, Yan, Zheng wrote:
> On 07/25/2014 04:10 PM, Peter Zijlstra wrote:
> > On Tue, Jul 22, 2014 at 04:09:59PM +0800, Yan, Zheng wrote:
> >> One corner case needs to mention is that the PEBS hardware doesn't
> >> deal well with collisions, when PEBS events happen near to each
> >> other. The records for the events can be collapsed into a single
> >> one. However in practice collisions are extremely rare, as long as
> >> different events are used. The periods are typically very large,
> >> so any collision is unlikely. When collision happens, we can either
> >> drop the PEBS record or use the record to serve multiple events.
> >> This patch chooses the later approach.
> >
> > You can't.. the events might have different security context.
> >
> > Remember, the overflow bit is set from the overflow until the PEBS
> > event is generated, this is quite a long time. So if another PEBS event
> > gets generated while the other is still pending it will have both bits
> > set. Even though the second bit is for another (unrelated) counter.
> >
> > The unrelated counter might not have privilege to observe the data of
> > the generated event.
> >
> > I think you can unwind and fully correct this trainwreck.
>
> could you give more information how to do this.

We went over that already:

lkml.kernel.org/r/[email protected]

Now ignore the patch there, its nonsense.

But the idea is that the bit gets cleared upon writing the PEBS record.
So look to the next record and see which bit got cleared.

Furthermore, we know that all bits set at PMI time are in-progress and
can therefore be cleared from the last record.

This should allow us to iterate the entire thing backwards and provide
a unique event for each record.

So take this series of 2 records and a PMI:

C0 C1 C3 C4
---------------
O
| O
| |
| A < R1
| O
A | < R2
---------+----- < PMI

O - overflow
A - assist

So at PMI time we have C3 set in the overflow mask, our last even R2
will have both C0 and C3 set, we clear C3 because we know it cannot have
been that. Then for R1 we have C0 and C1 set, but because R2 was C0 we
can clear C0 from R1, finding it was indeed C1.

So typically we'd have one event set and no problem, but in case there's
more we can reconstruct with such a backwards pass from a known good
state.

But when in doubt, we should drop the record, its the safest choice.


Attachments:
(No filename) (2.43 kB)
(No filename) (836.00 B)
Download all attachments

2014-07-25 15:04:51

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v3 6/9] perf, x86: handle multiple records in PEBS buffer

> You can't.. the events might have different security context.
>
> Remember, the overflow bit is set from the overflow until the PEBS
> event is generated, this is quite a long time. So if another PEBS event
> gets generated while the other is still pending it will have both bits
> set. Even though the second bit is for another (unrelated) counter.

When an event is not allowed by some policy it should be disabled
in global ctrl right? And disabling makes sure overflow is cleared,
and PEBS will not report it.

When it's not disabled it could happen any time and there
is no isolation.

Or is the concern that the PEBS buffer may not be flushed
on event switch/disable and you see something stale? I believe it is
flushed.

> I think you can unwind and fully correct this trainwreck. But simply
> delivering an even with multiple bits set to all relevant events is
> wrong and might leak sensitive information.

In theory could double check that the event is enabled,
but I don't think it's really needed.

-Andi
--
[email protected] -- Speaking for myself only.

2014-07-25 15:53:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 6/9] perf, x86: handle multiple records in PEBS buffer

On Fri, Jul 25, 2014 at 05:04:45PM +0200, Andi Kleen wrote:
> > You can't.. the events might have different security context.
> >
> > Remember, the overflow bit is set from the overflow until the PEBS
> > event is generated, this is quite a long time. So if another PEBS event
> > gets generated while the other is still pending it will have both bits
> > set. Even though the second bit is for another (unrelated) counter.
>
> When an event is not allowed by some policy it should be disabled
> in global ctrl right? And disabling makes sure overflow is cleared,
> and PEBS will not report it.
>
> When it's not disabled it could happen any time and there
> is no isolation.
>
> Or is the concern that the PEBS buffer may not be flushed
> on event switch/disable and you see something stale? I believe it is
> flushed.

Suppose two pebs events, one has exclude_kernel set. It overflows,
before entering the kernel, the other event generates PEBS records from
inside the kernel with both events marked in the overflow field.

And only once we leave the kernel can the exclude_kernel event tick
again and trigger the assist, finalyl clearing the bit.

If you were to report the records to both events, one would get a lot of
kernel info he was not entitled to.


Attachments:
(No filename) (1.23 kB)
(No filename) (836.00 B)
Download all attachments

2014-07-25 16:40:46

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v3 6/9] perf, x86: handle multiple records in PEBS buffer

> Suppose two pebs events, one has exclude_kernel set. It overflows,
> before entering the kernel, the other event generates PEBS records from
> inside the kernel with both events marked in the overflow field.
>
> And only once we leave the kernel can the exclude_kernel event tick
> again and trigger the assist, finalyl clearing the bit.
>
> If you were to report the records to both events, one would get a lot of
> kernel info he was not entitled to.

Ok that case can be filtered in software. Shouldn't be too difficult.
Perhaps just using ip

if (event->attr.exclude_kernel && pebs->ip >= __PAGE_OFFSET)
skip;
if (event->attr.exclude_user && pebs->ip < __PAGE_OFFSET)
skip;

This would also help with the existing skid.

Any other concerns?

-Andi

--
[email protected] -- Speaking for myself only.

2014-07-28 02:24:37

by Yan, Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 6/9] perf, x86: handle multiple records in PEBS buffer

On 07/25/2014 10:06 PM, Peter Zijlstra wrote:
> On Fri, Jul 25, 2014 at 04:34:44PM +0800, Yan, Zheng wrote:
>> On 07/25/2014 04:10 PM, Peter Zijlstra wrote:
>>> On Tue, Jul 22, 2014 at 04:09:59PM +0800, Yan, Zheng wrote:
>>>> One corner case needs to mention is that the PEBS hardware doesn't
>>>> deal well with collisions, when PEBS events happen near to each
>>>> other. The records for the events can be collapsed into a single
>>>> one. However in practice collisions are extremely rare, as long as
>>>> different events are used. The periods are typically very large,
>>>> so any collision is unlikely. When collision happens, we can either
>>>> drop the PEBS record or use the record to serve multiple events.
>>>> This patch chooses the later approach.
>>>
>>> You can't.. the events might have different security context.
>>>
>>> Remember, the overflow bit is set from the overflow until the PEBS
>>> event is generated, this is quite a long time. So if another PEBS event
>>> gets generated while the other is still pending it will have both bits
>>> set. Even though the second bit is for another (unrelated) counter.
>>>
>>> The unrelated counter might not have privilege to observe the data of
>>> the generated event.
>>>
>>> I think you can unwind and fully correct this trainwreck.
>>
>> could you give more information how to do this.
>
> We went over that already:
>
> lkml.kernel.org/r/[email protected]
>
> Now ignore the patch there, its nonsense.
>
> But the idea is that the bit gets cleared upon writing the PEBS record.
> So look to the next record and see which bit got cleared.
>
> Furthermore, we know that all bits set at PMI time are in-progress and
> can therefore be cleared from the last record.
>
> This should allow us to iterate the entire thing backwards and provide
> a unique event for each record.
>
> So take this series of 2 records and a PMI:
>
> C0 C1 C3 C4
> ---------------
> O
> | O
> | |
> | A < R1
> | O
> A | < R2
> ---------+----- < PMI
>
> O - overflow
> A - assist
>
> So at PMI time we have C3 set in the overflow mask, our last even R2
> will have both C0 and C3 set, we clear C3 because we know it cannot have
> been that. Then for R1 we have C0 and C1 set, but because R2 was C0 we
> can clear C0 from R1, finding it was indeed C1.

I don't think this method works for interrupt threshold > 1 case. When collision
happens, the hardware only create one PEBS record. The status in next record has
nothing to do with the collision record.
>
> So typically we'd have one event set and no problem, but in case there's
> more we can reconstruct with such a backwards pass from a known good
> state.
>
> But when in doubt, we should drop the record, its the safest choice.

The problem is that, in some cases, each PEBS record has more than one events
set, so we will drop all records.

Regards
Yan, Zheng

2014-07-28 03:20:25

by Yan, Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 6/9] perf, x86: handle multiple records in PEBS buffer

On 07/26/2014 12:40 AM, Andi Kleen wrote:
>> Suppose two pebs events, one has exclude_kernel set. It overflows,
>> before entering the kernel, the other event generates PEBS records from
>> inside the kernel with both events marked in the overflow field.
>>
>> And only once we leave the kernel can the exclude_kernel event tick
>> again and trigger the assist, finalyl clearing the bit.
>>
>> If you were to report the records to both events, one would get a lot of
>> kernel info he was not entitled to.
>
> Ok that case can be filtered in software. Shouldn't be too difficult.
> Perhaps just using ip
>
> if (event->attr.exclude_kernel && pebs->ip >= __PAGE_OFFSET)
> skip;
> if (event->attr.exclude_user && pebs->ip < __PAGE_OFFSET)
> skip;
>
> This would also help with the existing skid.
>
> Any other concerns?
>
> -Andi
>

how about following patch
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 33b4c0e..ea76507 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -1016,6 +1016,16 @@ static void setup_pebs_sample_data(struct perf_event *event,
data->br_stack = &cpuc->lbr_stack;
}

+static inline bool intel_pmu_pebs_filter(struct perf_event *event,
+ struct pebs_record_nhm *record)
+{
+ if (event->attr.exclude_user && !kernel_ip(record->ip))
+ return true;
+ if (event->attr.exclude_kernel && kernel_ip(record->ip))
+ return true;
+ return false;
+}
+
static void __intel_pmu_pebs_event(struct perf_event *event,
struct pt_regs *iregs,
void *at, void *top, int count)
@@ -1052,6 +1062,8 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
struct pebs_record_nhm *p = at;
if (!(p->status & (1 << event->hw.idx)))
continue;
+ if (intel_pmu_pebs_filter(event, p))
+ continue;

setup_pebs_sample_data(event, iregs, at, &data, &regs);
perf_output_sample(&handle, &header, &data, event);
@@ -1139,6 +1151,8 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
WARN_ON_ONCE(!event);
if (!event->attr.precise_ip)
continue;
+ if (intel_pmu_pebs_filter(event, p))
+ continue;
counts[bit]++;
}
}
@@ -1149,7 +1163,8 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
event = cpuc->events[bit];
for (at = base; at < top; at += x86_pmu.pebs_record_size) {
struct pebs_record_nhm *p = at;
- if (p->status & (1 << bit))
+ if ((p->status & (1 << bit)) &&
+ !intel_pmu_pebs_filter(event, p))
break;
}

---

2014-07-28 03:34:36

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v3 6/9] perf, x86: handle multiple records in PEBS buffer

> how about following patch

Looks good to me.

This will also solve the existing problem that perf record -e cycles:u
... gives kernel samples too.

-Andi

> diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> index 33b4c0e..ea76507 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> @@ -1016,6 +1016,16 @@ static void setup_pebs_sample_data(struct perf_event *event,
> data->br_stack = &cpuc->lbr_stack;
> }
>
> +static inline bool intel_pmu_pebs_filter(struct perf_event *event,
> + struct pebs_record_nhm *record)
> +{
> + if (event->attr.exclude_user && !kernel_ip(record->ip))
> + return true;
> + if (event->attr.exclude_kernel && kernel_ip(record->ip))
> + return true;
> + return false;
> +}
> +
> static void __intel_pmu_pebs_event(struct perf_event *event,
> struct pt_regs *iregs,
> void *at, void *top, int count)
> @@ -1052,6 +1062,8 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
> struct pebs_record_nhm *p = at;
> if (!(p->status & (1 << event->hw.idx)))
> continue;
> + if (intel_pmu_pebs_filter(event, p))
> + continue;
>
> setup_pebs_sample_data(event, iregs, at, &data, &regs);
> perf_output_sample(&handle, &header, &data, event);
> @@ -1139,6 +1151,8 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
> WARN_ON_ONCE(!event);
> if (!event->attr.precise_ip)
> continue;
> + if (intel_pmu_pebs_filter(event, p))
> + continue;
> counts[bit]++;
> }
> }
> @@ -1149,7 +1163,8 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
> event = cpuc->events[bit];
> for (at = base; at < top; at += x86_pmu.pebs_record_size) {
> struct pebs_record_nhm *p = at;
> - if (p->status & (1 << bit))
> + if ((p->status & (1 << bit)) &&
> + !intel_pmu_pebs_filter(event, p))
> break;
> }
>
> ---
>

--
[email protected] -- Speaking for myself only.

2014-07-28 03:36:32

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v3 6/9] perf, x86: handle multiple records in PEBS buffer

> I don't think this method works for interrupt threshold > 1 case. When collision
> happens, the hardware only create one PEBS record. The status in next record has
> nothing to do with the collision record.
>
Not even for the threshold == 1 case, because the same could happen with
a different PEBS event. Of course in any case it's very unlikely ...

>
> > So typically we'd have one event set and no problem, but in case there's
> > more we can reconstruct with such a backwards pass from a known good
> > state.
> >
> > But when in doubt, we should drop the record, its the safest choice.
>
> The problem is that, in some cases, each PEBS record has more than one events
> set, so we will drop all records.

Just dropping is fine imho, this should be rare.

-Andi

2014-07-28 06:53:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 6/9] perf, x86: handle multiple records in PEBS buffer

On Mon, Jul 28, 2014 at 10:24:30AM +0800, Yan, Zheng wrote:

> I don't think this method works for interrupt threshold > 1 case. When collision
> happens, the hardware only create one PEBS record. The status in next record has
> nothing to do with the collision record.

Andi previously stated that the hardware will always generate a record
for each event:

http://marc.info/?l=linux-kernel&m=140129731708247

Please lock yourselves in room and do not come out until you can give a
straight and coherent answer.


Attachments:
(No filename) (515.00 B)
(No filename) (836.00 B)
Download all attachments

2014-07-28 06:54:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 6/9] perf, x86: handle multiple records in PEBS buffer

On Fri, Jul 25, 2014 at 06:40:41PM +0200, Andi Kleen wrote:
> > Suppose two pebs events, one has exclude_kernel set. It overflows,
> > before entering the kernel, the other event generates PEBS records from
> > inside the kernel with both events marked in the overflow field.
> >
> > And only once we leave the kernel can the exclude_kernel event tick
> > again and trigger the assist, finalyl clearing the bit.
> >
> > If you were to report the records to both events, one would get a lot of
> > kernel info he was not entitled to.
>
> Ok that case can be filtered in software. Shouldn't be too difficult.
> Perhaps just using ip
>
> if (event->attr.exclude_kernel && pebs->ip >= __PAGE_OFFSET)
> skip;
> if (event->attr.exclude_user && pebs->ip < __PAGE_OFFSET)
> skip;
>
> This would also help with the existing skid.
>
> Any other concerns?

Yeah, why fuck about and do ugly hacks when you can actually do it
right? That way you're sure you've not forgotten anything.


Attachments:
(No filename) (990.00 B)
(No filename) (836.00 B)
Download all attachments

2014-07-31 07:31:39

by Yan, Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 6/9] perf, x86: handle multiple records in PEBS buffer

On 07/28/2014 02:52 PM, Peter Zijlstra wrote:
> On Mon, Jul 28, 2014 at 10:24:30AM +0800, Yan, Zheng wrote:
>
>> I don't think this method works for interrupt threshold > 1 case. When collision
>> happens, the hardware only create one PEBS record. The status in next record has
>> nothing to do with the collision record.
>
> Andi previously stated that the hardware will always generate a record
> for each event:
>
> http://marc.info/?l=linux-kernel&m=140129731708247
>
> Please lock yourselves in room and do not come out until you can give a
> straight and coherent answer.
>

Andi,

please confirm this

Regards
Yan, Zheng

2014-07-31 14:44:41

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v3 6/9] perf, x86: handle multiple records in PEBS buffer

"Yan, Zheng" <[email protected]> writes:
>>
>> Please lock yourselves in room and do not come out until you can give a
>> straight and coherent answer.
>>
>
> Andi,
>
> please confirm this

The hardware can merge PEBS events that happen nearby each other.
I was wrong earlier in stating the opposite.

-Andi
--
[email protected] -- Speaking for myself only

2014-07-31 15:25:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 6/9] perf, x86: handle multiple records in PEBS buffer

On Thu, Jul 31, 2014 at 07:44:12AM -0700, Andi Kleen wrote:
> "Yan, Zheng" <[email protected]> writes:
> >>
> >> Please lock yourselves in room and do not come out until you can give a
> >> straight and coherent answer.
> >>
> >
> > Andi,
> >
> > please confirm this
>
> The hardware can merge PEBS events that happen nearby each other.
> I was wrong earlier in stating the opposite.

OK, thanks!


Attachments:
(No filename) (405.00 B)
(No filename) (836.00 B)
Download all attachments