This patch series implements large PEBS interrupt threshold.
Currently, the PEBS threshold is forced to set to one. A larger PEBS
interrupt threshold can significantly reduce the sampling overhead
especially for frequently occurring events
(like cycles or branches or load/stores) with small sampling period.
For example, perf record cycles event when running kernbench
with 10003 sampling period. The Elapsed Time reduced from 32.7 seconds
to 16.5 seconds, which is 2X faster.
For more details, please refer to patch 3's description.
Limitations:
- It can not supply a callgraph.
- It requires setting a fixed period.
- It cannot supply a time stamp.
- To supply a TID it requires flushing on context switch.
If the above requirement doesn't apply, the threshold will set to one.
Collisions:
When PEBS events happen near to each other, the records for the events
can be collapsed into a single one, and it's not possible to
reconstruct. When collision happens, we drop the PEBS record.
Actually, collisions are extremely rare as long as different events
are used. We once tested the worst case with four frequently occurring
events (cycles:p,instructions:p,branches:p,mem-stores:p).
The collisions rate is only 0.34%.
The only way you can get a lot of collision is when you count the same
thing multiple times. But it is not a useful configuration.
For details about collisions, please refer to patch 4's description.
changes since v1:
- drop patch 'perf, core: Add all PMUs to pmu_idr'
- add comments for case that multiple counters overflow simultaneously
changes since v2:
- rename perf_sched_cb_{enable,disable} to perf_sched_cb_user_{inc,dec}
- use flag to indicate auto reload mechanism
- move codes that setup PEBS sample data to separate function
- output the PEBS records in batch
- enable this for All (PEBS capable) hardware
- more description for the multiplex
changes since v3:
- ignore conflicting PEBS record
changes since v4:
- Do more tests for collision and update comments
Yan, Zheng (6):
perf, x86: use the PEBS auto reload mechanism when possible
perf, x86: introduce setup_pebs_sample_data()
perf, x86: large PEBS interrupt threshold
perf, x86: handle multiple records in PEBS buffer
perf, x86: drain PEBS buffer during context switch
perf, x86: enlarge PEBS buffer
arch/x86/kernel/cpu/perf_event.c | 15 +-
arch/x86/kernel/cpu/perf_event.h | 5 +-
arch/x86/kernel/cpu/perf_event_intel.c | 11 +-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 283 ++++++++++++++++++++---------
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 3 -
5 files changed, 224 insertions(+), 93 deletions(-)
--
1.8.3.2
From: Yan, Zheng <[email protected]>
When a fixed period is specified, this patch make perf use the PEBS
auto reload mechanism. This makes normal profiling faster, because
it avoids one costly MSR write in the PMI handler.
Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 15 +++++++++------
arch/x86/kernel/cpu/perf_event.h | 2 +-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 9 +++++++++
3 files changed, 19 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index e0dab5c..8a77a94 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -990,13 +990,16 @@ int x86_perf_event_set_period(struct perf_event *event)
per_cpu(pmc_prev_left[idx], smp_processor_id()) = left;
- /*
- * The hw event starts counting from this event offset,
- * mark it to be able to extra future deltas:
- */
- local64_set(&hwc->prev_count, (u64)-left);
+ if (!(hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) ||
+ local64_read(&hwc->prev_count) != (u64)-left) {
+ /*
+ * The hw event starts counting from this event offset,
+ * mark it to be able to extra future deltas:
+ */
+ local64_set(&hwc->prev_count, (u64)-left);
- wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+ wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+ }
/*
* Due to erratum on certan cpu we need
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index a371d27..bc4ae3b 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -72,7 +72,7 @@ struct event_constraint {
#define PERF_X86_EVENT_PEBS_LD_HSW 0x10 /* haswell style datala, load */
#define PERF_X86_EVENT_PEBS_NA_HSW 0x20 /* haswell style datala, unknown */
#define PERF_X86_EVENT_RDPMC_ALLOWED 0x40 /* grant rdpmc permission */
-
+#define PERF_X86_EVENT_AUTO_RELOAD 0x80 /* use PEBS auto-reload */
struct amd_nb {
int nb_id; /* NorthBridge id */
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 0739833..728fb88 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -673,6 +673,8 @@ void intel_pmu_pebs_enable(struct perf_event *event)
struct hw_perf_event *hwc = &event->hw;
hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
+ if (!event->attr.freq)
+ hwc->flags |= PERF_X86_EVENT_AUTO_RELOAD;
cpuc->pebs_enabled |= 1ULL << hwc->idx;
@@ -680,6 +682,12 @@ void intel_pmu_pebs_enable(struct perf_event *event)
cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled |= 1ULL << 63;
+
+ /* Use auto-reload if possible to save a MSR write in the PMI */
+ if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
+ ds->pebs_event_reset[hwc->idx] =
+ (u64)-hwc->sample_period & x86_pmu.cntval_mask;
+ }
}
void intel_pmu_pebs_disable(struct perf_event *event)
@@ -698,6 +706,7 @@ void intel_pmu_pebs_disable(struct perf_event *event)
wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
+ hwc->flags &= ~PERF_X86_EVENT_AUTO_RELOAD;
}
void intel_pmu_pebs_enable_all(void)
--
1.8.3.2
From: Yan, Zheng <[email protected]>
move codes that setup PEBS sample data to separate function.
Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 95 +++++++++++++++++--------------
1 file changed, 52 insertions(+), 43 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 728fb88..1c16700 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -838,8 +838,10 @@ static inline u64 intel_hsw_transaction(struct pebs_record_hsw *pebs)
return txn;
}
-static void __intel_pmu_pebs_event(struct perf_event *event,
- struct pt_regs *iregs, void *__pebs)
+static void setup_pebs_sample_data(struct perf_event *event,
+ struct pt_regs *iregs, void *__pebs,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
{
#define PERF_X86_EVENT_PEBS_HSW_PREC \
(PERF_X86_EVENT_PEBS_ST_HSW | \
@@ -851,30 +853,25 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
*/
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct pebs_record_hsw *pebs = __pebs;
- struct perf_sample_data data;
- struct pt_regs regs;
u64 sample_type;
int fll, fst, dsrc;
int fl = event->hw.flags;
- if (!intel_pmu_save_and_restart(event))
- return;
-
sample_type = event->attr.sample_type;
dsrc = sample_type & PERF_SAMPLE_DATA_SRC;
fll = fl & PERF_X86_EVENT_PEBS_LDLAT;
fst = fl & (PERF_X86_EVENT_PEBS_ST | PERF_X86_EVENT_PEBS_HSW_PREC);
- perf_sample_data_init(&data, 0, event->hw.last_period);
+ perf_sample_data_init(data, 0, event->hw.last_period);
- data.period = event->hw.last_period;
+ data->period = event->hw.last_period;
/*
* Use latency for weight (only avail with PEBS-LL)
*/
if (fll && (sample_type & PERF_SAMPLE_WEIGHT))
- data.weight = pebs->lat;
+ data->weight = pebs->lat;
/*
* data.data_src encodes the data source
@@ -887,7 +884,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
val = precise_datala_hsw(event, pebs->dse);
else if (fst)
val = precise_store_data(pebs->dse);
- data.data_src.val = val;
+ data->data_src.val = val;
}
/*
@@ -900,58 +897,70 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
* PERF_SAMPLE_IP and PERF_SAMPLE_CALLCHAIN to function properly.
* A possible PERF_SAMPLE_REGS will have to transfer all regs.
*/
- regs = *iregs;
- regs.flags = pebs->flags;
- set_linear_ip(®s, pebs->ip);
- regs.bp = pebs->bp;
- regs.sp = pebs->sp;
+ *regs = *iregs;
+ regs->flags = pebs->flags;
+ set_linear_ip(regs, pebs->ip);
+ regs->bp = pebs->bp;
+ regs->sp = pebs->sp;
if (sample_type & PERF_SAMPLE_REGS_INTR) {
- regs.ax = pebs->ax;
- regs.bx = pebs->bx;
- regs.cx = pebs->cx;
- regs.dx = pebs->dx;
- regs.si = pebs->si;
- regs.di = pebs->di;
- regs.bp = pebs->bp;
- regs.sp = pebs->sp;
-
- regs.flags = pebs->flags;
+ regs->ax = pebs->ax;
+ regs->bx = pebs->bx;
+ regs->cx = pebs->cx;
+ regs->dx = pebs->dx;
+ regs->si = pebs->si;
+ regs->di = pebs->di;
+ regs->bp = pebs->bp;
+ regs->sp = pebs->sp;
+
+ regs->flags = pebs->flags;
#ifndef CONFIG_X86_32
- regs.r8 = pebs->r8;
- regs.r9 = pebs->r9;
- regs.r10 = pebs->r10;
- regs.r11 = pebs->r11;
- regs.r12 = pebs->r12;
- regs.r13 = pebs->r13;
- regs.r14 = pebs->r14;
- regs.r15 = pebs->r15;
+ regs->r8 = pebs->r8;
+ regs->r9 = pebs->r9;
+ regs->r10 = pebs->r10;
+ regs->r11 = pebs->r11;
+ regs->r12 = pebs->r12;
+ regs->r13 = pebs->r13;
+ regs->r14 = pebs->r14;
+ regs->r15 = pebs->r15;
#endif
}
if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format >= 2) {
- regs.ip = pebs->real_ip;
- regs.flags |= PERF_EFLAGS_EXACT;
- } else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(®s))
- regs.flags |= PERF_EFLAGS_EXACT;
+ regs->ip = pebs->real_ip;
+ regs->flags |= PERF_EFLAGS_EXACT;
+ } else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(regs))
+ regs->flags |= PERF_EFLAGS_EXACT;
else
- regs.flags &= ~PERF_EFLAGS_EXACT;
+ regs->flags &= ~PERF_EFLAGS_EXACT;
if ((sample_type & PERF_SAMPLE_ADDR) &&
x86_pmu.intel_cap.pebs_format >= 1)
- data.addr = pebs->dla;
+ data->addr = pebs->dla;
if (x86_pmu.intel_cap.pebs_format >= 2) {
/* Only set the TSX weight when no memory weight. */
if ((sample_type & PERF_SAMPLE_WEIGHT) && !fll)
- data.weight = intel_hsw_weight(pebs);
+ data->weight = intel_hsw_weight(pebs);
if (sample_type & PERF_SAMPLE_TRANSACTION)
- data.txn = intel_hsw_transaction(pebs);
+ data->txn = intel_hsw_transaction(pebs);
}
if (has_branch_stack(event))
- data.br_stack = &cpuc->lbr_stack;
+ data->br_stack = &cpuc->lbr_stack;
+}
+
+static void __intel_pmu_pebs_event(struct perf_event *event,
+ struct pt_regs *iregs, void *__pebs)
+{
+ struct perf_sample_data data;
+ struct pt_regs regs;
+
+ if (!intel_pmu_save_and_restart(event))
+ return;
+
+ setup_pebs_sample_data(event, iregs, __pebs, &data, ®s);
if (perf_event_overflow(event, &data, ®s))
x86_pmu_stop(event, 0);
--
1.8.3.2
From: Yan, Zheng <[email protected]>
PEBS always had the capability to log samples to its buffers without
an interrupt. Traditionally perf has not used this but always set the
PEBS threshold to one.
For frequently occurring events (like cycles or branches or load/store)
this in term requires using a relatively high sampling period to avoid
overloading the system, by only processing PMIs. This in term increases
sampling error.
For the common cases we still need to use the PMI because the PEBS
hardware has various limitations. The biggest one is that it can not
supply a callgraph. It also requires setting a fixed period, as the
hardware does not support adaptive period. Another issue is that it
cannot supply a time stamp and some other options. To supply a TID it
requires flushing on context switch. It can however supply the IP, the
load/store address, TSX information, registers, and some other things.
So we can make PEBS work for some specific cases, basically as long as
you can do without a callgraph and can set the period you can use this
new PEBS mode.
The main benefit is the ability to support much lower sampling period
(down to -c 1000) without extensive overhead.
One use cases is for example to increase the resolution of the c2c tool.
Another is double checking when you suspect the standard sampling has
too much sampling error.
Some numbers on the overhead, using cycle soak, comparing the elapsed
time from "kernbench -M -H" between plain (threshold set to one) and
multi (large threshold).
The test command for plain:
"perf record -e cycles:p -c $period -- kernbench -M -H"
The test command for multi:
"perf record --no-time -e cycles:p -c $period -- kernbench -M -H"
(The only difference of test command between multi and plain is time
stamp options. Since time stamp is not supported by large PEBS
threshold, it can be used as a flag to indicate if large threshold is
enabled during the test.)
period plain(Sec) multi(Sec) Delta
10003 32.7 16.5 16.2
20003 30.2 16.2 14.0
40003 18.6 14.1 4.5
80003 16.8 14.6 2.2
100003 16.9 14.1 2.8
800003 15.4 15.7 -0.3
1000003 15.3 15.2 0.2
2000003 15.3 15.1 0.1
With periods below 100003, plain (threshold one) cause much more
overhead. With 10003 sampling period, the Elapsed Time for multi is
even 2X faster than plain.
Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 40 +++++++++++++++++++++++++++----
1 file changed, 36 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 1c16700..16fdb18 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -250,7 +250,7 @@ static int alloc_pebs_buffer(int cpu)
{
struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
int node = cpu_to_node(cpu);
- int max, thresh = 1; /* always use a single PEBS record */
+ int max;
void *buffer, *ibuffer;
if (!x86_pmu.pebs)
@@ -280,9 +280,6 @@ static int alloc_pebs_buffer(int cpu)
ds->pebs_absolute_maximum = ds->pebs_buffer_base +
max * x86_pmu.pebs_record_size;
- ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
- thresh * x86_pmu.pebs_record_size;
-
return 0;
}
@@ -667,15 +664,35 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
return &emptyconstraint;
}
+/*
+ * Flags PEBS can handle without an PMI.
+ *
+ * TID can only be handled by flushing at context switch.
+ */
+#define PEBS_FREERUNNING_FLAGS \
+ (PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR | \
+ PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
+ PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
+ PERF_SAMPLE_TRANSACTION)
+
+static inline bool pebs_is_enabled(struct cpu_hw_events *cpuc)
+{
+ return (cpuc->pebs_enabled & ((1ULL << MAX_PEBS_EVENTS) - 1));
+}
+
void intel_pmu_pebs_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
+ struct debug_store *ds = cpuc->ds;
+ bool first_pebs;
+ u64 threshold;
hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
if (!event->attr.freq)
hwc->flags |= PERF_X86_EVENT_AUTO_RELOAD;
+ first_pebs = !pebs_is_enabled(cpuc);
cpuc->pebs_enabled |= 1ULL << hwc->idx;
if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
@@ -683,6 +700,21 @@ void intel_pmu_pebs_enable(struct perf_event *event)
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled |= 1ULL << 63;
+ /*
+ * When the event is constrained enough we can use a larger
+ * threshold and run the event with less frequent PMI.
+ */
+ if (0 && /* disable this temporarily */
+ (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) &&
+ !(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS)) {
+ threshold = ds->pebs_absolute_maximum -
+ x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
+ } else {
+ threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+ }
+ if (first_pebs || ds->pebs_interrupt_threshold > threshold)
+ ds->pebs_interrupt_threshold = threshold;
+
/* Use auto-reload if possible to save a MSR write in the PMI */
if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
ds->pebs_event_reset[hwc->idx] =
--
1.8.3.2
From: Yan, Zheng <[email protected]>
When PEBS interrupt threshold is larger than one, the PEBS buffer
may include multiple records for each PEBS event. This patch makes
the code first count how many records each PEBS event has, then
output the samples in batch.
One corner case needs to mention is that the PEBS hardware doesn't
deal well with collisions, when PEBS events happen near to each
other. The records for the events can be collapsed into a single
one, and it's not possible to reconstruct all events that caused
the PEBS record, However in practice collisions are extremely rare,
as long as different events are used. The periods are typically very
large, so any collision is unlikely. When collision happens, we drop
the PEBS record.
Here are some numbers about collisions.
Four frequently occurring events
(cycles:p,instructions:p,branches:p,mem-stores:p) are tested
Test events which are sampled together collision rate
cycles:p,instructions:p 0.25%
cycles:p,instructions:p,branches:p 0.30%
cycles:p,instructions:p,branches:p,mem-stores:p 0.35%
cycles:p,cycles:p 98.52%
collisions are extremely rare as long as different events are used. The
only way you can get a lot of collision is when you count the same thing
multiple times. But it is not a useful configuration.
Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 114 ++++++++++++++++++++----------
1 file changed, 77 insertions(+), 37 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 16fdb18..cf74a52 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -984,18 +984,52 @@ static void setup_pebs_sample_data(struct perf_event *event,
}
static void __intel_pmu_pebs_event(struct perf_event *event,
- struct pt_regs *iregs, void *__pebs)
+ struct pt_regs *iregs,
+ void *at, void *top, int count)
{
+ struct perf_output_handle handle;
+ struct perf_event_header header;
struct perf_sample_data data;
struct pt_regs regs;
- if (!intel_pmu_save_and_restart(event))
+ if (!intel_pmu_save_and_restart(event) &&
+ !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
return;
- setup_pebs_sample_data(event, iregs, __pebs, &data, ®s);
+ setup_pebs_sample_data(event, iregs, at, &data, ®s);
- if (perf_event_overflow(event, &data, ®s))
+ if (perf_event_overflow(event, &data, ®s)) {
x86_pmu_stop(event, 0);
+ return;
+ }
+
+ if (count <= 1)
+ return;
+
+ at += x86_pmu.pebs_record_size;
+ count--;
+
+ perf_sample_data_init(&data, 0, event->hw.last_period);
+ perf_prepare_sample(&header, &data, event, ®s);
+
+ if (perf_output_begin(&handle, event, header.size * count))
+ return;
+
+ for (; at < top; at += x86_pmu.pebs_record_size) {
+ struct pebs_record_nhm *p = at;
+
+ if (p->status != (1 << event->hw.idx))
+ continue;
+
+ setup_pebs_sample_data(event, iregs, at, &data, ®s);
+ perf_output_sample(&handle, &header, &data, event);
+
+ count--;
+ if (count == 0)
+ break;
+ }
+
+ perf_output_end(&handle);
}
static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
@@ -1036,61 +1070,67 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
WARN_ONCE(n > 1, "bad leftover pebs %d\n", n);
at += n - 1;
- __intel_pmu_pebs_event(event, iregs, at);
+ __intel_pmu_pebs_event(event, iregs, at, top, 1);
}
static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct debug_store *ds = cpuc->ds;
- struct perf_event *event = NULL;
- void *at, *top;
- u64 status = 0;
+ struct perf_event *event;
+ void *base, *at, *top;
int bit;
+ int counts[MAX_PEBS_EVENTS] = {};
if (!x86_pmu.pebs_active)
return;
- at = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
+ base = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
ds->pebs_index = ds->pebs_buffer_base;
- if (unlikely(at > top))
+ if (unlikely(base >= top))
return;
- /*
- * Should not happen, we program the threshold at 1 and do not
- * set a reset value.
- */
- WARN_ONCE(top - at > x86_pmu.max_pebs_events * x86_pmu.pebs_record_size,
- "Unexpected number of pebs records %ld\n",
- (long)(top - at) / x86_pmu.pebs_record_size);
-
- for (; at < top; at += x86_pmu.pebs_record_size) {
+ for (at = base; at < top; at += x86_pmu.pebs_record_size) {
struct pebs_record_nhm *p = at;
- for_each_set_bit(bit, (unsigned long *)&p->status,
- x86_pmu.max_pebs_events) {
- event = cpuc->events[bit];
- if (!test_bit(bit, cpuc->active_mask))
- continue;
-
- WARN_ON_ONCE(!event);
-
- if (!event->attr.precise_ip)
- continue;
-
- if (__test_and_set_bit(bit, (unsigned long *)&status))
- continue;
-
- break;
- }
+ bit = find_first_bit((unsigned long *)&p->status,
+ x86_pmu.max_pebs_events);
+ if (bit >= x86_pmu.max_pebs_events)
+ continue;
+ /*
+ * The PEBS hardware does not deal well with collisions,
+ * when the same event happens near to each other. The
+ * records for the events can be collapsed into a single
+ * one, and it's not possible to reconstruct all events
+ * that caused the PEBS record. However in practice, the
+ * collisions are extremely rare. If collision happened,
+ * we drop the record. its the safest choice.
+ */
+ if (p->status != (1 << bit))
+ continue;
+ if (!test_bit(bit, cpuc->active_mask))
+ continue;
+ event = cpuc->events[bit];
+ WARN_ON_ONCE(!event);
+ if (!event->attr.precise_ip)
+ continue;
+ counts[bit]++;
+ }
- if (!event || bit >= x86_pmu.max_pebs_events)
+ for (bit = 0; bit < x86_pmu.max_pebs_events; bit++) {
+ if (counts[bit] == 0)
continue;
+ event = cpuc->events[bit];
+ for (at = base; at < top; at += x86_pmu.pebs_record_size) {
+ struct pebs_record_nhm *p = at;
- __intel_pmu_pebs_event(event, iregs, at);
+ if (p->status == (1 << bit))
+ break;
+ }
+ __intel_pmu_pebs_event(event, iregs, at, top, counts[bit]);
}
}
--
1.8.3.2
From: Yan, Zheng <[email protected]>
Flush the PEBS buffer during context switch if PEBS interrupt threshold
is larger than one. This allows perf to supply TID for sample outputs.
Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 3 +++
arch/x86/kernel/cpu/perf_event_intel.c | 11 +++++++++-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 33 ++++++++++++++++++++++++++++--
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 3 ---
4 files changed, 44 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index bc4ae3b..b4f6431 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -151,6 +151,7 @@ struct cpu_hw_events {
*/
struct debug_store *ds;
u64 pebs_enabled;
+ bool pebs_sched_cb_enabled;
/*
* Intel LBR bits
@@ -739,6 +740,8 @@ void intel_pmu_pebs_enable_all(void);
void intel_pmu_pebs_disable_all(void);
+void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in);
+
void intel_ds_init(void);
void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 9f1dd18..bdb2ffd 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2030,6 +2030,15 @@ static void intel_pmu_cpu_dying(int cpu)
fini_debug_store_on_cpu(cpu);
}
+static void intel_pmu_sched_task(struct perf_event_context *ctx,
+ bool sched_in)
+{
+ if (x86_pmu.pebs_active)
+ intel_pmu_pebs_sched_task(ctx, sched_in);
+ if (x86_pmu.lbr_nr)
+ intel_pmu_lbr_sched_task(ctx, sched_in);
+}
+
PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");
PMU_FORMAT_ATTR(ldlat, "config1:0-15");
@@ -2081,7 +2090,7 @@ static __initconst const struct x86_pmu intel_pmu = {
.cpu_starting = intel_pmu_cpu_starting,
.cpu_dying = intel_pmu_cpu_dying,
.guest_get_msrs = intel_guest_get_msrs,
- .sched_task = intel_pmu_lbr_sched_task,
+ .sched_task = intel_pmu_sched_task,
};
static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index cf74a52..b8d0451 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -664,6 +664,19 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
return &emptyconstraint;
}
+static inline void intel_pmu_drain_pebs_buffer(void)
+{
+ struct pt_regs regs;
+
+ x86_pmu.drain_pebs(®s);
+}
+
+void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+ if (!sched_in)
+ intel_pmu_drain_pebs_buffer();
+}
+
/*
* Flags PEBS can handle without an PMI.
*
@@ -704,13 +717,20 @@ void intel_pmu_pebs_enable(struct perf_event *event)
* When the event is constrained enough we can use a larger
* threshold and run the event with less frequent PMI.
*/
- if (0 && /* disable this temporarily */
- (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) &&
+ if ((hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) &&
!(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS)) {
threshold = ds->pebs_absolute_maximum -
x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
+ if (first_pebs) {
+ perf_sched_cb_inc(event->ctx->pmu);
+ cpuc->pebs_sched_cb_enabled = true;
+ }
} else {
threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+ if (cpuc->pebs_sched_cb_enabled) {
+ perf_sched_cb_dec(event->ctx->pmu);
+ cpuc->pebs_sched_cb_enabled = false;
+ }
}
if (first_pebs || ds->pebs_interrupt_threshold > threshold)
ds->pebs_interrupt_threshold = threshold;
@@ -726,8 +746,17 @@ void intel_pmu_pebs_disable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
+ struct debug_store *ds = cpuc->ds;
+
+ if (ds->pebs_interrupt_threshold >
+ ds->pebs_buffer_base + x86_pmu.pebs_record_size)
+ intel_pmu_drain_pebs_buffer();
cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
+ if (cpuc->pebs_sched_cb_enabled && !pebs_is_enabled(cpuc)) {
+ perf_sched_cb_dec(event->ctx->pmu);
+ cpuc->pebs_sched_cb_enabled = false;
+ }
if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_LDLAT)
cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 0473874..1edd3b9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -256,9 +256,6 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct x86_perf_task_context *task_ctx;
- if (!x86_pmu.lbr_nr)
- return;
-
/*
* If LBR callstack feature is enabled and the stack was saved when
* the task was scheduled out, restore the stack. Otherwise flush
--
1.8.3.2
From: Yan, Zheng <[email protected]>
Currently the PEBS buffer size is 4k, it only can hold about 21
PEBS records. This patch enlarges the PEBS buffer size to 64k
(the same as BTS buffer), 64k memory can hold about 330 PEBS
records. This will significantly the reduce number of PMI when
large PEBS interrupt threshold is used.
Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index b8d0451..eba42a1 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -11,7 +11,7 @@
#define BTS_RECORD_SIZE 24
#define BTS_BUFFER_SIZE (PAGE_SIZE << 4)
-#define PEBS_BUFFER_SIZE PAGE_SIZE
+#define PEBS_BUFFER_SIZE (PAGE_SIZE << 4)
#define PEBS_FIXUP_SIZE PAGE_SIZE
/*
--
1.8.3.2