2015-04-09 16:38:37

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V6 0/6] large PEBS interrupt threshold

This patch series implements large PEBS interrupt threshold.
Currently, the PEBS threshold is forced to set to one. A larger PEBS
interrupt threshold can significantly reduce the sampling overhead
especially for frequently occurring events
(like cycles or branches or load/stores) with small sampling period.
For example, perf record cycles event when running kernbench
with 10003 sampling period. The Elapsed Time reduced from 32.7 seconds
to 16.5 seconds, which is 2X faster.
For more details, please refer to patch 3's description.

Limitations:
- It can not supply a callgraph.
- It requires setting a fixed period.
- It cannot supply a time stamp.
- To supply a TID it requires flushing on context switch.
If the above requirement doesn't apply, the threshold will set to one.

Collisions:
When PEBS events happen near to each other, the records for the events
can be collapsed into a single one, and it's not possible to
reconstruct. When collision happens, we drop the PEBS record.
Actually, collisions are extremely rare as long as different events
are used. We once tested the worst case with four frequently occurring
events (cycles:p,instructions:p,branches:p,mem-stores:p).
The collisions rate is only 0.34%.
The only way you can get a lot of collision is when you count the same
thing multiple times. But it is not a useful configuration.
For details about collisions, please refer to patch 4's description.

changes since v1:
- drop patch 'perf, core: Add all PMUs to pmu_idr'
- add comments for case that multiple counters overflow simultaneously
changes since v2:
- rename perf_sched_cb_{enable,disable} to perf_sched_cb_user_{inc,dec}
- use flag to indicate auto reload mechanism
- move codes that setup PEBS sample data to separate function
- output the PEBS records in batch
- enable this for All (PEBS capable) hardware
- more description for the multiplex
changes since v3:
- ignore conflicting PEBS record
changes since v4:
- Do more tests for collision and update comments
changes since v5:
- Move autoreload and large PEBS available check to intel_pmu_hw_config
- make AUTO_RELOAD conditional on large PEBS
- !PEBS bug fix
- coherent story about what is collision and how we handle it.
- Remove extra state pebs_sched_cb_enabled

Yan, Zheng (6):
perf, x86: use the PEBS auto reload mechanism when possible
perf, x86: introduce setup_pebs_sample_data()
perf, x86: large PEBS interrupt threshold
perf, x86: handle multiple records in PEBS buffer
perf, x86: drain PEBS buffer during context switch
perf, x86: enlarge PEBS buffer

arch/x86/kernel/cpu/perf_event.c | 15 +-
arch/x86/kernel/cpu/perf_event.h | 15 +-
arch/x86/kernel/cpu/perf_event_intel.c | 21 +-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 316 ++++++++++++++++++++++-------
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 3 -
include/linux/perf_event.h | 4 +
kernel/events/core.c | 6 +-
7 files changed, 293 insertions(+), 87 deletions(-)

--
1.7.11.7


2015-04-09 16:39:54

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V6 1/6] perf, x86: use the PEBS auto reload mechanism when possible

From: Yan, Zheng <[email protected]>

When a fixed period is specified, this patch make perf use the PEBS
auto reload mechanism. This makes normal profiling faster, because
it avoids one costly MSR write in the PMI handler.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 15 +++++++++------
arch/x86/kernel/cpu/perf_event.h | 2 +-
arch/x86/kernel/cpu/perf_event_intel.c | 8 ++++++--
arch/x86/kernel/cpu/perf_event_intel_ds.c | 7 +++++++
4 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 87848eb..8cc1153 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1058,13 +1058,16 @@ int x86_perf_event_set_period(struct perf_event *event)

per_cpu(pmc_prev_left[idx], smp_processor_id()) = left;

- /*
- * The hw event starts counting from this event offset,
- * mark it to be able to extra future deltas:
- */
- local64_set(&hwc->prev_count, (u64)-left);
+ if (!(hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) ||
+ local64_read(&hwc->prev_count) != (u64)-left) {
+ /*
+ * The hw event starts counting from this event offset,
+ * mark it to be able to extra future deltas:
+ */
+ local64_set(&hwc->prev_count, (u64)-left);

- wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+ wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+ }

/*
* Due to erratum on certan cpu we need
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 329f035..1290bb6 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -74,7 +74,7 @@ struct event_constraint {
#define PERF_X86_EVENT_EXCL 0x40 /* HT exclusivity on counter */
#define PERF_X86_EVENT_DYNAMIC 0x80 /* dynamic alloc'd constraint */
#define PERF_X86_EVENT_RDPMC_ALLOWED 0x40 /* grant rdpmc permission */
-
+#define PERF_X86_EVENT_AUTO_RELOAD 0x80 /* use PEBS auto-reload */

struct amd_nb {
int nb_id; /* NorthBridge id */
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 9da2400..0a7b5ca 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2305,8 +2305,12 @@ static int intel_pmu_hw_config(struct perf_event *event)
if (ret)
return ret;

- if (event->attr.precise_ip && x86_pmu.pebs_aliases)
- x86_pmu.pebs_aliases(event);
+ if (event->attr.precise_ip) {
+ if (!event->attr.freq)
+ event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
+ if (x86_pmu.pebs_aliases)
+ x86_pmu.pebs_aliases(event);
+ }

if (needs_branch_stack(event)) {
ret = intel_pmu_setup_lbr_filter(event);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index ca69ea5..7c6dd8e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -680,6 +680,7 @@ void intel_pmu_pebs_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
+ struct debug_store *ds = cpuc->ds;

hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;

@@ -689,6 +690,12 @@ void intel_pmu_pebs_enable(struct perf_event *event)
cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled |= 1ULL << 63;
+
+ /* Use auto-reload if possible to save a MSR write in the PMI */
+ if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
+ ds->pebs_event_reset[hwc->idx] =
+ (u64)(-hwc->sample_period) & x86_pmu.cntval_mask;
+ }
}

void intel_pmu_pebs_disable(struct perf_event *event)
--
1.7.11.7

2015-04-09 16:39:50

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V6 2/6] perf, x86: introduce setup_pebs_sample_data()

From: Yan, Zheng <[email protected]>

move codes that setup PEBS sample data to separate function.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 95 +++++++++++++++++--------------
1 file changed, 52 insertions(+), 43 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 7c6dd8e..e3916d5 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -845,8 +845,10 @@ static inline u64 intel_hsw_transaction(struct pebs_record_hsw *pebs)
return txn;
}

-static void __intel_pmu_pebs_event(struct perf_event *event,
- struct pt_regs *iregs, void *__pebs)
+static void setup_pebs_sample_data(struct perf_event *event,
+ struct pt_regs *iregs, void *__pebs,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
{
#define PERF_X86_EVENT_PEBS_HSW_PREC \
(PERF_X86_EVENT_PEBS_ST_HSW | \
@@ -858,30 +860,25 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
*/
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct pebs_record_hsw *pebs = __pebs;
- struct perf_sample_data data;
- struct pt_regs regs;
u64 sample_type;
int fll, fst, dsrc;
int fl = event->hw.flags;

- if (!intel_pmu_save_and_restart(event))
- return;
-
sample_type = event->attr.sample_type;
dsrc = sample_type & PERF_SAMPLE_DATA_SRC;

fll = fl & PERF_X86_EVENT_PEBS_LDLAT;
fst = fl & (PERF_X86_EVENT_PEBS_ST | PERF_X86_EVENT_PEBS_HSW_PREC);

- perf_sample_data_init(&data, 0, event->hw.last_period);
+ perf_sample_data_init(data, 0, event->hw.last_period);

- data.period = event->hw.last_period;
+ data->period = event->hw.last_period;

/*
* Use latency for weight (only avail with PEBS-LL)
*/
if (fll && (sample_type & PERF_SAMPLE_WEIGHT))
- data.weight = pebs->lat;
+ data->weight = pebs->lat;

/*
* data.data_src encodes the data source
@@ -894,7 +891,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
val = precise_datala_hsw(event, pebs->dse);
else if (fst)
val = precise_store_data(pebs->dse);
- data.data_src.val = val;
+ data->data_src.val = val;
}

/*
@@ -907,58 +904,70 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
* PERF_SAMPLE_IP and PERF_SAMPLE_CALLCHAIN to function properly.
* A possible PERF_SAMPLE_REGS will have to transfer all regs.
*/
- regs = *iregs;
- regs.flags = pebs->flags;
- set_linear_ip(&regs, pebs->ip);
- regs.bp = pebs->bp;
- regs.sp = pebs->sp;
+ *regs = *iregs;
+ regs->flags = pebs->flags;
+ set_linear_ip(regs, pebs->ip);
+ regs->bp = pebs->bp;
+ regs->sp = pebs->sp;

if (sample_type & PERF_SAMPLE_REGS_INTR) {
- regs.ax = pebs->ax;
- regs.bx = pebs->bx;
- regs.cx = pebs->cx;
- regs.dx = pebs->dx;
- regs.si = pebs->si;
- regs.di = pebs->di;
- regs.bp = pebs->bp;
- regs.sp = pebs->sp;
-
- regs.flags = pebs->flags;
+ regs->ax = pebs->ax;
+ regs->bx = pebs->bx;
+ regs->cx = pebs->cx;
+ regs->dx = pebs->dx;
+ regs->si = pebs->si;
+ regs->di = pebs->di;
+ regs->bp = pebs->bp;
+ regs->sp = pebs->sp;
+
+ regs->flags = pebs->flags;
#ifndef CONFIG_X86_32
- regs.r8 = pebs->r8;
- regs.r9 = pebs->r9;
- regs.r10 = pebs->r10;
- regs.r11 = pebs->r11;
- regs.r12 = pebs->r12;
- regs.r13 = pebs->r13;
- regs.r14 = pebs->r14;
- regs.r15 = pebs->r15;
+ regs->r8 = pebs->r8;
+ regs->r9 = pebs->r9;
+ regs->r10 = pebs->r10;
+ regs->r11 = pebs->r11;
+ regs->r12 = pebs->r12;
+ regs->r13 = pebs->r13;
+ regs->r14 = pebs->r14;
+ regs->r15 = pebs->r15;
#endif
}

if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format >= 2) {
- regs.ip = pebs->real_ip;
- regs.flags |= PERF_EFLAGS_EXACT;
- } else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(&regs))
- regs.flags |= PERF_EFLAGS_EXACT;
+ regs->ip = pebs->real_ip;
+ regs->flags |= PERF_EFLAGS_EXACT;
+ } else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(regs))
+ regs->flags |= PERF_EFLAGS_EXACT;
else
- regs.flags &= ~PERF_EFLAGS_EXACT;
+ regs->flags &= ~PERF_EFLAGS_EXACT;

if ((sample_type & PERF_SAMPLE_ADDR) &&
x86_pmu.intel_cap.pebs_format >= 1)
- data.addr = pebs->dla;
+ data->addr = pebs->dla;

if (x86_pmu.intel_cap.pebs_format >= 2) {
/* Only set the TSX weight when no memory weight. */
if ((sample_type & PERF_SAMPLE_WEIGHT) && !fll)
- data.weight = intel_hsw_weight(pebs);
+ data->weight = intel_hsw_weight(pebs);

if (sample_type & PERF_SAMPLE_TRANSACTION)
- data.txn = intel_hsw_transaction(pebs);
+ data->txn = intel_hsw_transaction(pebs);
}

if (has_branch_stack(event))
- data.br_stack = &cpuc->lbr_stack;
+ data->br_stack = &cpuc->lbr_stack;
+}
+
+static void __intel_pmu_pebs_event(struct perf_event *event,
+ struct pt_regs *iregs, void *__pebs)
+{
+ struct perf_sample_data data;
+ struct pt_regs regs;
+
+ if (!intel_pmu_save_and_restart(event))
+ return;
+
+ setup_pebs_sample_data(event, iregs, __pebs, &data, &regs);

if (perf_event_overflow(event, &data, &regs))
x86_pmu_stop(event, 0);
--
1.7.11.7

2015-04-09 16:38:44

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold

From: Yan, Zheng <[email protected]>

PEBS always had the capability to log samples to its buffers without
an interrupt. Traditionally perf has not used this but always set the
PEBS threshold to one.

For frequently occurring events (like cycles or branches or load/store)
this in term requires using a relatively high sampling period to avoid
overloading the system, by only processing PMIs. This in term increases
sampling error.

For the common cases we still need to use the PMI because the PEBS
hardware has various limitations. The biggest one is that it can not
supply a callgraph. It also requires setting a fixed period, as the
hardware does not support adaptive period. Another issue is that it
cannot supply a time stamp and some other options. To supply a TID it
requires flushing on context switch. It can however supply the IP, the
load/store address, TSX information, registers, and some other things.

So we can make PEBS work for some specific cases, basically as long as
you can do without a callgraph and can set the period you can use this
new PEBS mode.

The main benefit is the ability to support much lower sampling period
(down to -c 1000) without extensive overhead.

One use cases is for example to increase the resolution of the c2c tool.
Another is double checking when you suspect the standard sampling has
too much sampling error.

Some numbers on the overhead, using cycle soak, comparing the elapsed
time from "kernbench -M -H" between plain (threshold set to one) and
multi (large threshold).
The test command for plain:
"perf record --time -e cycles:p -c $period -- kernbench -M -H"
The test command for multi:
"perf record --no-time -e cycles:p -c $period -- kernbench -M -H"
(The only difference of test command between multi and plain is time
stamp options. Since time stamp is not supported by large PEBS
threshold, it can be used as a flag to indicate if large threshold is
enabled during the test.)

period plain(Sec) multi(Sec) Delta
10003 32.7 16.5 16.2
20003 30.2 16.2 14.0
40003 18.6 14.1 4.5
80003 16.8 14.6 2.2
100003 16.9 14.1 2.8
800003 15.4 15.7 -0.3
1000003 15.3 15.2 0.2
2000003 15.3 15.1 0.1

With periods below 100003, plain (threshold one) cause much more
overhead. With 10003 sampling period, the Elapsed Time for multi is
even 2X faster than plain.

This patch also make AUTO_RELOAD conditional on large PEBS. Auto reload
only be enabled when fix period and large PEBS.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 11 +++++++++++
arch/x86/kernel/cpu/perf_event_intel.c | 4 +++-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 30 +++++++++++++++++++++++++++---
3 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 1290bb6..5b677a9 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -87,6 +87,17 @@ struct amd_nb {
#define MAX_PEBS_EVENTS 8

/*
+ * Flags PEBS can handle without an PMI.
+ *
+ * TID can only be handled by flushing at context switch.
+ */
+#define PEBS_FREERUNNING_FLAGS \
+ (PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR | \
+ PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
+ PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
+ PERF_SAMPLE_TRANSACTION)
+
+/*
* A debug store configuration.
*
* We only support architectures that use 64bit fields.
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 0a7b5ca..6c8579a 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2306,7 +2306,9 @@ static int intel_pmu_hw_config(struct perf_event *event)
return ret;

if (event->attr.precise_ip) {
- if (!event->attr.freq)
+ /* only enable auto reload when fix period and large PEBS */
+ if (!event->attr.freq &&
+ !(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS))
event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
if (x86_pmu.pebs_aliases)
x86_pmu.pebs_aliases(event);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index e3916d5..3ce7f59 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -250,7 +250,7 @@ static int alloc_pebs_buffer(int cpu)
{
struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
int node = cpu_to_node(cpu);
- int max, thresh = 1; /* always use a single PEBS record */
+ int max;
void *buffer, *ibuffer;

if (!x86_pmu.pebs)
@@ -280,8 +280,9 @@ static int alloc_pebs_buffer(int cpu)
ds->pebs_absolute_maximum = ds->pebs_buffer_base +
max * x86_pmu.pebs_record_size;

- ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
- thresh * x86_pmu.pebs_record_size;
+ if (x86_pmu.intel_cap.pebs_format < 1)
+ ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
+ x86_pmu.pebs_record_size;

return 0;
}
@@ -676,14 +677,22 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
return &emptyconstraint;
}

+static inline bool pebs_is_enabled(struct cpu_hw_events *cpuc)
+{
+ return (cpuc->pebs_enabled & ((1ULL << MAX_PEBS_EVENTS) - 1));
+}
+
void intel_pmu_pebs_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
struct debug_store *ds = cpuc->ds;
+ bool first_pebs;
+ u64 threshold;

hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;

+ first_pebs = !pebs_is_enabled(cpuc);
cpuc->pebs_enabled |= 1ULL << hwc->idx;

if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
@@ -691,11 +700,26 @@ void intel_pmu_pebs_enable(struct perf_event *event)
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled |= 1ULL << 63;

+ /*
+ * When the event is constrained enough we can use a larger
+ * threshold and run the event with less frequent PMI.
+ */
+ if (0 && /* disable this temporarily */
+ (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD)) {
+ threshold = ds->pebs_absolute_maximum -
+ x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
+ } else {
+ threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+ }
+
/* Use auto-reload if possible to save a MSR write in the PMI */
if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
ds->pebs_event_reset[hwc->idx] =
(u64)(-hwc->sample_period) & x86_pmu.cntval_mask;
}
+
+ if (first_pebs || ds->pebs_interrupt_threshold > threshold)
+ ds->pebs_interrupt_threshold = threshold;
}

void intel_pmu_pebs_disable(struct perf_event *event)
--
1.7.11.7

2015-04-09 16:39:32

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

From: Yan, Zheng <[email protected]>

When PEBS interrupt threshold is larger than one, the PEBS buffer
may include multiple records for each PEBS event. This patch makes
the code first count how many records each PEBS event has, then
output the samples in batch.

One corner case needs to mention is that the PEBS hardware doesn't
deal well with collisions. The records for the events can be collapsed
into a single one, and it's not possible to reconstruct all events that
caused the PEBS record.
Here are some cases which can be called collisions.
- PEBS events happen near to each other, so the hardware merges them.
- PEBS events happen near to each other, but they are not merged.
The GLOBAL_STATUS for first counter is clear before generating event
for next counter. Only the first record can be treated as collisions.
- Same as case2, but the first counter isn't clear before generating
event for next counter. All the records are treated as collision
until a record with only one bit set for PEBS event.

GLOBAL_STATUS could be set by both PEBS and non-PEBS events. Multiple
non-PEBS bit set doesn't count as collisions.

In practice collisions are extremely rare, as long as different PEBS
events are used. The periods are typically very large, so any collision
is unlikely. When collision happens, we drop the PEBS record.
The only way you can get a lot of collision is when you count the same
thing multiple times. But it is not a useful configuration.

Here are some numbers about collisions.
Four frequently occurring events
(cycles:p,instructions:p,branches:p,mem-stores:p) are tested

Test events which are sampled together collision rate
cycles:p,instructions:p 0.25%
cycles:p,instructions:p,branches:p 0.30%
cycles:p,instructions:p,branches:p,mem-stores:p 0.35%

cycles:p,cycles:p 98.52%

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 157 +++++++++++++++++++++++++-----
include/linux/perf_event.h | 4 +
kernel/events/core.c | 6 +-
3 files changed, 137 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 3ce7f59..fafbf97 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -888,6 +888,9 @@ static void setup_pebs_sample_data(struct perf_event *event,
int fll, fst, dsrc;
int fl = event->hw.flags;

+ if (pebs == NULL)
+ return;
+
sample_type = event->attr.sample_type;
dsrc = sample_type & PERF_SAMPLE_DATA_SRC;

@@ -982,7 +985,7 @@ static void setup_pebs_sample_data(struct perf_event *event,
data->br_stack = &cpuc->lbr_stack;
}

-static void __intel_pmu_pebs_event(struct perf_event *event,
+static void __intel_pmu_pebs_event_core(struct perf_event *event,
struct pt_regs *iregs, void *__pebs)
{
struct perf_sample_data data;
@@ -997,6 +1000,89 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
x86_pmu_stop(event, 0);
}

+/* Clear all non-PEBS bits */
+static u64
+nonpebs_bit_clear(u64 pebs_status)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct perf_event *event;
+ int bit;
+
+ for_each_set_bit(bit, (unsigned long *)&pebs_status, 64) {
+
+ if (bit >= x86_pmu.max_pebs_events)
+ clear_bit(bit, (unsigned long *)&pebs_status);
+ else {
+ event = cpuc->events[bit];
+ WARN_ON_ONCE(!event);
+
+ if (!event->attr.precise_ip)
+ clear_bit(bit, (unsigned long *)&pebs_status);
+ }
+ }
+
+ return pebs_status;
+}
+
+static inline void *
+get_next_pebs_record_by_bit(void *base, void *top, int bit)
+{
+ void *at;
+ u64 pebs_status;
+
+ if (base == NULL)
+ return NULL;
+
+ for (at = base; at < top; at += x86_pmu.pebs_record_size) {
+ struct pebs_record_nhm *p = at;
+
+ if (p->status & (1 << bit)) {
+
+ if (p->status == (1 << bit))
+ return at;
+
+ /* clear non-PEBS bit and re-check */
+ pebs_status = nonpebs_bit_clear(p->status);
+ if (pebs_status == (1 << bit))
+ return at;
+ }
+ }
+ return NULL;
+}
+
+static void __intel_pmu_pebs_event_nhm(struct perf_event *event,
+ struct pt_regs *iregs,
+ void *base, void *top,
+ int bit, int count)
+{
+ struct perf_sample_data data;
+ struct pt_regs regs;
+ int i;
+ void *at = get_next_pebs_record_by_bit(base, top, bit);
+
+ if (!intel_pmu_save_and_restart(event) &&
+ !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
+ return;
+
+ if (count > 1) {
+ for (i = 0; i < count - 1; i++) {
+ setup_pebs_sample_data(event, iregs, at, &data, &regs);
+ perf_event_output(event, &data, &regs);
+ at += x86_pmu.pebs_record_size;
+ at = get_next_pebs_record_by_bit(at, top, bit);
+ }
+ }
+
+ setup_pebs_sample_data(event, iregs, at, &data, &regs);
+
+ /* all records are processed, handle event overflow now */
+ if (perf_event_overflow(event, &data, &regs)) {
+ x86_pmu_stop(event, 0);
+ return;
+ }
+
+}
+
static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -1035,61 +1121,78 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
WARN_ONCE(n > 1, "bad leftover pebs %d\n", n);
at += n - 1;

- __intel_pmu_pebs_event(event, iregs, at);
+ __intel_pmu_pebs_event_core(event, iregs, at);
}

static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct debug_store *ds = cpuc->ds;
- struct perf_event *event = NULL;
- void *at, *top;
- u64 status = 0;
+ struct perf_event *event;
+ void *base, *at, *top;
int bit;
+ int counts[MAX_PEBS_EVENTS] = {};

if (!x86_pmu.pebs_active)
return;

- at = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
+ base = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;

ds->pebs_index = ds->pebs_buffer_base;

- if (unlikely(at > top))
+ if (unlikely(base >= top))
return;

- /*
- * Should not happen, we program the threshold at 1 and do not
- * set a reset value.
- */
- WARN_ONCE(top - at > x86_pmu.max_pebs_events * x86_pmu.pebs_record_size,
- "Unexpected number of pebs records %ld\n",
- (long)(top - at) / x86_pmu.pebs_record_size);
-
- for (; at < top; at += x86_pmu.pebs_record_size) {
+ for (at = base; at < top; at += x86_pmu.pebs_record_size) {
struct pebs_record_nhm *p = at;

for_each_set_bit(bit, (unsigned long *)&p->status,
x86_pmu.max_pebs_events) {
event = cpuc->events[bit];
- if (!test_bit(bit, cpuc->active_mask))
- continue;
-
WARN_ON_ONCE(!event);

- if (!event->attr.precise_ip)
- continue;
+ if (event->attr.precise_ip)
+ break;
+ }

- if (__test_and_set_bit(bit, (unsigned long *)&status))
- continue;
+ if (bit >= x86_pmu.max_pebs_events)
+ continue;
+ if (!test_bit(bit, cpuc->active_mask))
+ continue;
+ /*
+ * The PEBS hardware does not deal well with the situation
+ * when events happen near to each other and multiple bits
+ * are set. But it should happen rarely.
+ *
+ * If these events include one PEBS and multiple non-PEBS
+ * events, it doesn't impact PEBS record. The record will
+ * be handled normally. (slow path)
+ *
+ * If these events include two or more PEBS events, the
+ * records for the events can be collapsed into a single
+ * one, and it's not possible to reconstruct all events
+ * that caused the PEBS record. It's called collision.
+ * If collision happened, the record will be dropped.
+ *
+ */
+ if (p->status != (1 << bit)) {
+ u64 pebs_status;

- break;
+ /* slow path */
+ pebs_status = nonpebs_bit_clear(p->status);
+ if (pebs_status != (1 << bit))
+ continue;
}
+ counts[bit]++;
+ }

- if (!event || bit >= x86_pmu.max_pebs_events)
+ for (bit = 0; bit < x86_pmu.max_pebs_events; bit++) {
+ if (counts[bit] == 0)
continue;
-
- __intel_pmu_pebs_event(event, iregs, at);
+ event = cpuc->events[bit];
+ __intel_pmu_pebs_event_nhm(event, iregs, base,
+ top, bit, counts[bit]);
}
}

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 61992cf..750007e 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -734,6 +734,10 @@ extern int perf_event_overflow(struct perf_event *event,
struct perf_sample_data *data,
struct pt_regs *regs);

+extern void perf_event_output(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs);
+
static inline bool is_sampling_event(struct perf_event *event)
{
return event->attr.sample_period != 0;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 06917d5..a8d0e92 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5360,9 +5360,9 @@ void perf_prepare_sample(struct perf_event_header *header,
}
}

-static void perf_event_output(struct perf_event *event,
- struct perf_sample_data *data,
- struct pt_regs *regs)
+void perf_event_output(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
{
struct perf_output_handle handle;
struct perf_event_header header;
--
1.7.11.7

2015-04-09 16:38:49

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V6 5/6] perf, x86: drain PEBS buffer during context switch

From: Yan, Zheng <[email protected]>

Flush the PEBS buffer during context switch if PEBS interrupt threshold
is larger than one. This allows perf to supply TID for sample outputs.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 2 ++
arch/x86/kernel/cpu/perf_event_intel.c | 11 +++++++-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 45 +++++++++++++++++++++++++-----
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 3 --
4 files changed, 50 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 5b677a9..446f21b 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -871,6 +871,8 @@ void intel_pmu_pebs_enable_all(void);

void intel_pmu_pebs_disable_all(void);

+void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in);
+
void intel_ds_init(void);

void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 6c8579a..d647d7e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2729,6 +2729,15 @@ static void intel_pmu_cpu_dying(int cpu)
fini_debug_store_on_cpu(cpu);
}

+static void intel_pmu_sched_task(struct perf_event_context *ctx,
+ bool sched_in)
+{
+ if (x86_pmu.pebs_active)
+ intel_pmu_pebs_sched_task(ctx, sched_in);
+ if (x86_pmu.lbr_nr)
+ intel_pmu_lbr_sched_task(ctx, sched_in);
+}
+
PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");

PMU_FORMAT_ATTR(ldlat, "config1:0-15");
@@ -2780,7 +2789,7 @@ static __initconst const struct x86_pmu intel_pmu = {
.cpu_starting = intel_pmu_cpu_starting,
.cpu_dying = intel_pmu_cpu_dying,
.guest_get_msrs = intel_guest_get_msrs,
- .sched_task = intel_pmu_lbr_sched_task,
+ .sched_task = intel_pmu_sched_task,
};

static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index fafbf97..3d8950a 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -550,6 +550,19 @@ int intel_pmu_drain_bts_buffer(void)
return 1;
}

+static inline void intel_pmu_drain_pebs_buffer(void)
+{
+ struct pt_regs regs;
+
+ x86_pmu.drain_pebs(&regs);
+}
+
+void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+ if (!sched_in)
+ intel_pmu_drain_pebs_buffer();
+}
+
/*
* PEBS
*/
@@ -704,18 +717,28 @@ void intel_pmu_pebs_enable(struct perf_event *event)
* When the event is constrained enough we can use a larger
* threshold and run the event with less frequent PMI.
*/
- if (0 && /* disable this temporarily */
- (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD)) {
+ if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
threshold = ds->pebs_absolute_maximum -
x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
- } else {
- threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
- }

- /* Use auto-reload if possible to save a MSR write in the PMI */
- if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
+ /* Use auto-reload if possible to save a MSR write in the PMI */
ds->pebs_event_reset[hwc->idx] =
(u64)(-hwc->sample_period) & x86_pmu.cntval_mask;
+
+ if (first_pebs)
+ perf_sched_cb_inc(event->ctx->pmu);
+ } else {
+ threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+
+ ds->pebs_event_reset[hwc->idx] = 0;
+
+ /*
+ * If not all events can use larger buffer,
+ * roll back to threshold = 1
+ */
+ if (!first_pebs &&
+ (ds->pebs_interrupt_threshold > threshold))
+ perf_sched_cb_dec(event->ctx->pmu);
}

if (first_pebs || ds->pebs_interrupt_threshold > threshold)
@@ -726,6 +749,7 @@ void intel_pmu_pebs_disable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
+ struct debug_store *ds = cpuc->ds;

cpuc->pebs_enabled &= ~(1ULL << hwc->idx);

@@ -734,6 +758,13 @@ void intel_pmu_pebs_disable(struct perf_event *event)
else if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled &= ~(1ULL << 63);

+ if (ds->pebs_interrupt_threshold >
+ ds->pebs_buffer_base + x86_pmu.pebs_record_size) {
+ intel_pmu_drain_pebs_buffer();
+ if (!pebs_is_enabled(cpuc))
+ perf_sched_cb_dec(event->ctx->pmu);
+ }
+
if (cpuc->enabled)
wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 94e5b50..c8a72cc 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -262,9 +262,6 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct x86_perf_task_context *task_ctx;

- if (!x86_pmu.lbr_nr)
- return;
-
/*
* If LBR callstack feature is enabled and the stack was saved when
* the task was scheduled out, restore the stack. Otherwise flush
--
1.7.11.7

2015-04-09 16:38:47

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V6 6/6] perf, x86: enlarge PEBS buffer

From: Yan, Zheng <[email protected]>

Currently the PEBS buffer size is 4k, it only can hold about 21
PEBS records. This patch enlarges the PEBS buffer size to 64k
(the same as BTS buffer), 64k memory can hold about 330 PEBS
records. This will significantly the reduce number of PMI when
large PEBS interrupt threshold is used.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 3d8950a..dfbcaad 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -11,7 +11,7 @@
#define BTS_RECORD_SIZE 24

#define BTS_BUFFER_SIZE (PAGE_SIZE << 4)
-#define PEBS_BUFFER_SIZE PAGE_SIZE
+#define PEBS_BUFFER_SIZE (PAGE_SIZE << 4)
#define PEBS_FIXUP_SIZE PAGE_SIZE

/*
--
1.7.11.7

2015-04-09 21:01:21

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

> + for_each_set_bit(bit, (unsigned long *)&pebs_status, 64) {
> +
> + if (bit >= x86_pmu.max_pebs_events)
> + clear_bit(bit, (unsigned long *)&pebs_status);
> + else {
> + event = cpuc->events[bit];
> + WARN_ON_ONCE(!event);
> +
> + if (!event->attr.precise_ip)
> + clear_bit(bit, (unsigned long *)&pebs_status);

Precompute the mask of non pebs events first in the caller.
Then this function would be just a & ~mask

BTW clear_bit is atomic, if you're local you should always use
__clear_bit.


> +}
> +
> +static inline void *
> +get_next_pebs_record_by_bit(void *base, void *top, int bit)
> +{
> + void *at;
> + u64 pebs_status;
> +
> + if (base == NULL)
> + return NULL;
> +
> + for (at = base; at < top; at += x86_pmu.pebs_record_size) {
> + struct pebs_record_nhm *p = at;
> +
> + if (p->status & (1 << bit)) {

Use test_bit.

> +
> + if (p->status == (1 << bit))
> + return at;
> +
> + /* clear non-PEBS bit and re-check */
> + pebs_status = nonpebs_bit_clear(p->status);
> + if (pebs_status == (1 << bit))
> + return at;
> + }
> + }
> + return NULL;

-Andi

--
[email protected] -- Speaking for myself only.

2015-04-15 17:14:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold

On Thu, Apr 09, 2015 at 12:37:43PM -0400, Kan Liang wrote:
> This patch also make AUTO_RELOAD conditional on large PEBS. Auto reload
> only be enabled when fix period and large PEBS.

What's a large PEBS?

> +++ b/arch/x86/kernel/cpu/perf_event.h
> @@ -87,6 +87,17 @@ struct amd_nb {
> #define MAX_PEBS_EVENTS 8
>
> /*
> + * Flags PEBS can handle without an PMI.
> + *
> + * TID can only be handled by flushing at context switch.
> + */
> +#define PEBS_FREERUNNING_FLAGS \
> + (PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR | \
> + PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
> + PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
> + PERF_SAMPLE_TRANSACTION)
> +
> +/*
> * A debug store configuration.
> *
> * We only support architectures that use 64bit fields.
> diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
> index 0a7b5ca..6c8579a 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -2306,7 +2306,9 @@ static int intel_pmu_hw_config(struct perf_event *event)
> return ret;
>
> if (event->attr.precise_ip) {
> - if (!event->attr.freq)
> + /* only enable auto reload when fix period and large PEBS */
> + if (!event->attr.freq &&
> + !(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS))
> event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
> if (x86_pmu.pebs_aliases)
> x86_pmu.pebs_aliases(event);

I suspect you meant the above change right?

But this negates part of the benefit of the auto reload; where
previously it saved an MSR write for pretty much all PEBS usage it now
becomes a burden for pretty much everyone.

Why cannot we retain the win for all PEBS users?

2015-04-15 17:49:07

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold



> -----Original Message-----
> From: Peter Zijlstra [mailto:[email protected]]
> Sent: Wednesday, April 15, 2015 1:15 PM
> To: Liang, Kan
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold
>
> On Thu, Apr 09, 2015 at 12:37:43PM -0400, Kan Liang wrote:
> > This patch also make AUTO_RELOAD conditional on large PEBS. Auto
> > reload only be enabled when fix period and large PEBS.
>
> What's a large PEBS?
>
Should be large PEBS available.
!(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS)

> > +++ b/arch/x86/kernel/cpu/perf_event.h
> > @@ -87,6 +87,17 @@ struct amd_nb {
> > #define MAX_PEBS_EVENTS 8
> >
> > /*
> > + * Flags PEBS can handle without an PMI.
> > + *
> > + * TID can only be handled by flushing at context switch.
> > + */
> > +#define PEBS_FREERUNNING_FLAGS \
> > + (PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR | \
> > + PERF_SAMPLE_ID | PERF_SAMPLE_CPU |
> PERF_SAMPLE_STREAM_ID | \
> > + PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
> > + PERF_SAMPLE_TRANSACTION)
> > +
> > +/*
> > * A debug store configuration.
> > *
> > * We only support architectures that use 64bit fields.
> > diff --git a/arch/x86/kernel/cpu/perf_event_intel.c
> > b/arch/x86/kernel/cpu/perf_event_intel.c
> > index 0a7b5ca..6c8579a 100644
> > --- a/arch/x86/kernel/cpu/perf_event_intel.c
> > +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> > @@ -2306,7 +2306,9 @@ static int intel_pmu_hw_config(struct
> perf_event *event)
> > return ret;
> >
> > if (event->attr.precise_ip) {
> > - if (!event->attr.freq)
> > + /* only enable auto reload when fix period and large PEBS
> */
> > + if (!event->attr.freq &&
> > + !(event->attr.sample_type &
> ~PEBS_FREERUNNING_FLAGS))
> > event->hw.flags |=
> PERF_X86_EVENT_AUTO_RELOAD;
> > if (x86_pmu.pebs_aliases)
> > x86_pmu.pebs_aliases(event);
>
> I suspect you meant the above change right?

Yes.

>
> But this negates part of the benefit of the auto reload; where previously it
> saved an MSR write for pretty much all PEBS usage it now becomes a
> burden for pretty much everyone.
>
> Why cannot we retain the win for all PEBS users?

The change tries to address your comments.
https://lkml.org/lkml/2015/3/30/294

Yes, we can retain the win. If so, I think we need to introduce another
flag like PERF_X86_EVENT_LARGE_PEBS and check it in pebs_is_enabled.
Or just keep the previous V5 patch unchanged.

Thanks,
Kan

2015-04-15 18:05:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold

On Thu, Apr 09, 2015 at 12:37:43PM -0400, Kan Liang wrote:
> @@ -280,8 +280,9 @@ static int alloc_pebs_buffer(int cpu)
> ds->pebs_absolute_maximum = ds->pebs_buffer_base +
> max * x86_pmu.pebs_record_size;
>
> - ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
> - thresh * x86_pmu.pebs_record_size;
> + if (x86_pmu.intel_cap.pebs_format < 1)
> + ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
> + x86_pmu.pebs_record_size;
>
> return 0;
> }

I can't seem to figure out what this is about.. help?

2015-04-15 18:10:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold

On Wed, Apr 15, 2015 at 05:48:39PM +0000, Liang, Kan wrote:
> > > +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> > > @@ -2306,7 +2306,9 @@ static int intel_pmu_hw_config(struct perf_event *event)
> > > return ret;
> > >
> > > if (event->attr.precise_ip) {
> > > - if (!event->attr.freq)
> > > + /* only enable auto reload when fix period and large PEBS */
> > > + if (!event->attr.freq &&
> > > + !(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS))
> > > event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
> > > if (x86_pmu.pebs_aliases)
> > > x86_pmu.pebs_aliases(event);
> >
> > I suspect you meant the above change right?
>
> Yes.
>
> >
> > But this negates part of the benefit of the auto reload; where previously it
> > saved an MSR write for pretty much all PEBS usage it now becomes a
> > burden for pretty much everyone.
> >
> > Why cannot we retain the win for all PEBS users?
>
> The change tries to address your comments.
> https://lkml.org/lkml/2015/3/30/294
>
> Yes, we can retain the win. If so, I think we need to introduce another
> flag like PERF_X86_EVENT_LARGE_PEBS and check it in pebs_is_enabled.
> Or just keep the previous V5 patch unchanged.

Right, something like so.

--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -65,16 +65,17 @@ struct event_constraint {
/*
* struct hw_perf_event.flags flags
*/
-#define PERF_X86_EVENT_PEBS_LDLAT 0x1 /* ld+ldlat data address sampling */
-#define PERF_X86_EVENT_PEBS_ST 0x2 /* st data address sampling */
-#define PERF_X86_EVENT_PEBS_ST_HSW 0x4 /* haswell style datala, store */
-#define PERF_X86_EVENT_COMMITTED 0x8 /* event passed commit_txn */
-#define PERF_X86_EVENT_PEBS_LD_HSW 0x10 /* haswell style datala, load */
-#define PERF_X86_EVENT_PEBS_NA_HSW 0x20 /* haswell style datala, unknown */
-#define PERF_X86_EVENT_EXCL 0x40 /* HT exclusivity on counter */
-#define PERF_X86_EVENT_DYNAMIC 0x80 /* dynamic alloc'd constraint */
-#define PERF_X86_EVENT_RDPMC_ALLOWED 0x40 /* grant rdpmc permission */
-#define PERF_X86_EVENT_AUTO_RELOAD 0x80 /* use PEBS auto-reload */
+#define PERF_X86_EVENT_PEBS_LDLAT 0x0001 /* ld+ldlat data address sampling */
+#define PERF_X86_EVENT_PEBS_ST 0x0002 /* st data address sampling */
+#define PERF_X86_EVENT_PEBS_ST_HSW 0x0004 /* haswell style datala, store */
+#define PERF_X86_EVENT_COMMITTED 0x0008 /* event passed commit_txn */
+#define PERF_X86_EVENT_PEBS_LD_HSW 0x0010 /* haswell style datala, load */
+#define PERF_X86_EVENT_PEBS_NA_HSW 0x0020 /* haswell style datala, unknown */
+#define PERF_X86_EVENT_EXCL 0x0040 /* HT exclusivity on counter */
+#define PERF_X86_EVENT_DYNAMIC 0x0080 /* dynamic alloc'd constraint */
+#define PERF_X86_EVENT_RDPMC_ALLOWED 0x0100 /* grant rdpmc permission */
+#define PERF_X86_EVENT_AUTO_RELOAD 0x0200 /* use PEBS auto-reload */
+#define PERF_X86_EVENT_FREERUNNING 0x0400 /* use freerunning PEBS */

struct amd_nb {
int nb_id; /* NorthBridge id */
@@ -87,6 +88,17 @@ struct amd_nb {
#define MAX_PEBS_EVENTS 8

/*
+ * Flags PEBS can handle without an PMI.
+ *
+ * TID can only be handled by flushing at context switch.
+ */
+#define PEBS_FREERUNNING_FLAGS \
+ (PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR | \
+ PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
+ PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
+ PERF_SAMPLE_TRANSACTION)
+
+/*
* A debug store configuration.
*
* We only support architectures that use 64bit fields.
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2306,8 +2306,12 @@ static int intel_pmu_hw_config(struct pe
return ret;

if (event->attr.precise_ip) {
- if (!event->attr.freq)
+ if (!event->attr.freq) {
event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
+
+ if (!(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS))
+ event->hw.flags |= PERF_X86_EVENT_FREERUNNING;
+ }
if (x86_pmu.pebs_aliases)
x86_pmu.pebs_aliases(event);
}
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -250,7 +250,7 @@ static int alloc_pebs_buffer(int cpu)
{
struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
int node = cpu_to_node(cpu);
- int max, thresh = 1; /* always use a single PEBS record */
+ int max;
void *buffer, *ibuffer;

if (!x86_pmu.pebs)
@@ -277,11 +277,13 @@ static int alloc_pebs_buffer(int cpu)

ds->pebs_buffer_base = (u64)(unsigned long)buffer;
ds->pebs_index = ds->pebs_buffer_base;
- ds->pebs_absolute_maximum = ds->pebs_buffer_base +
- max * x86_pmu.pebs_record_size;
+ ds->pebs_absolute_maximum =
+ ds->pebs_buffer_base + max * x86_pmu.pebs_record_size;

- ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
- thresh * x86_pmu.pebs_record_size;
+ if (x86_pmu.intel_cap.pebs_format < 1) {
+ ds->pebs_interrupt_threshold =
+ ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+ }

return 0;
}
@@ -684,14 +686,22 @@ struct event_constraint *intel_pebs_cons
return &emptyconstraint;
}

+static inline bool pebs_is_enabled(struct cpu_hw_events *cpuc)
+{
+ return (cpuc->pebs_enabled & ((1ULL << MAX_PEBS_EVENTS) - 1));
+}
+
void intel_pmu_pebs_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
struct debug_store *ds = cpuc->ds;
+ bool first_pebs;
+ u64 threshold;

hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;

+ first_pebs = !pebs_is_enabled(cpuc);
cpuc->pebs_enabled |= 1ULL << hwc->idx;

if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
@@ -699,11 +709,26 @@ void intel_pmu_pebs_enable(struct perf_e
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled |= 1ULL << 63;

+ /*
+ * When the event is constrained enough we can use a larger
+ * threshold and run the event with less frequent PMI.
+ */
+ if (0 && /* disable this temporarily */
+ (hwc->flags & PERF_X86_EVENT_FREERUNNING)) {
+ threshold = ds->pebs_absolute_maximum -
+ x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
+ } else {
+ threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+ }
+
/* Use auto-reload if possible to save a MSR write in the PMI */
if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
ds->pebs_event_reset[hwc->idx] =
(u64)(-hwc->sample_period) & x86_pmu.cntval_mask;
}
+
+ if (first_pebs || ds->pebs_interrupt_threshold > threshold)
+ ds->pebs_interrupt_threshold = threshold;
}

void intel_pmu_pebs_disable(struct perf_event *event)

2015-04-15 18:28:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

On Thu, Apr 09, 2015 at 12:37:44PM -0400, Kan Liang wrote:

> The only way you can get a lot of collision is when you count the same
> thing multiple times. But it is not a useful configuration.

Not entirely true; I _think_ you can be unfortunate if you measure with
a userspace only PEBS event along with either a kernel or unrestricted
PEBS event. Imagine the event triggering and setting the overflow flag
right before entering the kernel. Then all kernel side events will end
up with multiple bits set.

2015-04-15 18:36:01

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold


>
> On Thu, Apr 09, 2015 at 12:37:43PM -0400, Kan Liang wrote:
> > @@ -280,8 +280,9 @@ static int alloc_pebs_buffer(int cpu)
> > ds->pebs_absolute_maximum = ds->pebs_buffer_base +
> > max * x86_pmu.pebs_record_size;
> >
> > - ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
> > - thresh * x86_pmu.pebs_record_size;
> > + if (x86_pmu.intel_cap.pebs_format < 1)
> > + ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
> > + x86_pmu.pebs_record_size;
> >
> > return 0;
> > }
>
> I can't seem to figure out what this is about.. help?

We move AUTO_RELOAD and large PEBS check to intel_pmu_hw_config.
But for earlier platform, it calls x86_pmu_hw_config.
So we force single PEBS record for old platform.

2015-04-15 18:36:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

On Thu, Apr 09, 2015 at 12:37:44PM -0400, Kan Liang wrote:
> +/* Clear all non-PEBS bits */
> +static u64
> +nonpebs_bit_clear(u64 pebs_status)
> +{
> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> + struct perf_event *event;
> + int bit;
> +
> + for_each_set_bit(bit, (unsigned long *)&pebs_status, 64) {
> +
> + if (bit >= x86_pmu.max_pebs_events)
> + clear_bit(bit, (unsigned long *)&pebs_status);
> + else {
> + event = cpuc->events[bit];
> + WARN_ON_ONCE(!event);
> +
> + if (!event->attr.precise_ip)
> + clear_bit(bit, (unsigned long *)&pebs_status);
> + }
> + }
> +
> + return pebs_status;
> +}

What was wrong with:

status = p->status & cpuc->pebs_enabled;

?

We use the same index bits in the PEBS_ENABLE MSR as in the status reg,
right? If you're really paranoid you can mask out the high (>31) bits
too I suppose.

2015-04-15 18:41:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold

On Wed, Apr 15, 2015 at 06:35:48PM +0000, Liang, Kan wrote:
>
> >
> > On Thu, Apr 09, 2015 at 12:37:43PM -0400, Kan Liang wrote:
> > > @@ -280,8 +280,9 @@ static int alloc_pebs_buffer(int cpu)
> > > ds->pebs_absolute_maximum = ds->pebs_buffer_base +
> > > max * x86_pmu.pebs_record_size;
> > >
> > > - ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
> > > - thresh * x86_pmu.pebs_record_size;
> > > + if (x86_pmu.intel_cap.pebs_format < 1)
> > > + ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
> > > + x86_pmu.pebs_record_size;
> > >
> > > return 0;
> > > }
> >
> > I can't seem to figure out what this is about.. help?
>
> We move AUTO_RELOAD and large PEBS check to intel_pmu_hw_config.
> But for earlier platform, it calls x86_pmu_hw_config.
> So we force single PEBS record for old platform.

We're talking about intel_pmu vs core_pmu right? I don't think core_pmu
supports PEBS at all.

2015-04-16 12:54:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

On Thu, Apr 09, 2015 at 12:37:44PM -0400, Kan Liang wrote:
> From: Yan, Zheng <[email protected]>
>
> When PEBS interrupt threshold is larger than one, the PEBS buffer
> may include multiple records for each PEBS event. This patch makes
> the code first count how many records each PEBS event has, then
> output the samples in batch.
>
> One corner case needs to mention is that the PEBS hardware doesn't
> deal well with collisions. The records for the events can be collapsed
> into a single one, and it's not possible to reconstruct all events that
> caused the PEBS record.
> Here are some cases which can be called collisions.
> - PEBS events happen near to each other, so the hardware merges them.
> - PEBS events happen near to each other, but they are not merged.
> The GLOBAL_STATUS for first counter is clear before generating event
> for next counter. Only the first record can be treated as collisions.
> - Same as case2, but the first counter isn't clear before generating
> event for next counter. All the records are treated as collision
> until a record with only one bit set for PEBS event.
>
> GLOBAL_STATUS could be set by both PEBS and non-PEBS events. Multiple
> non-PEBS bit set doesn't count as collisions.
>
> In practice collisions are extremely rare, as long as different PEBS
> events are used. The periods are typically very large, so any collision
> is unlikely. When collision happens, we drop the PEBS record.
> The only way you can get a lot of collision is when you count the same
> thing multiple times. But it is not a useful configuration.
>
> Here are some numbers about collisions.
> Four frequently occurring events
> (cycles:p,instructions:p,branches:p,mem-stores:p) are tested
>
> Test events which are sampled together collision rate
> cycles:p,instructions:p 0.25%
> cycles:p,instructions:p,branches:p 0.30%
> cycles:p,instructions:p,branches:p,mem-stores:p 0.35%
>
> cycles:p,cycles:p 98.52%

*sigh* you're really going to make me write this :-(

The sad part is that I had already written a large part of it for you in
a previous email (
lkml.kernel.org/r/[email protected]
).

And yes, writing a good Changelog takes a lot of time and effort,
sometimes more than the actual patch, and that is OK.

There's a *PLEASE CLARIFY* in the below, please do that.

Also the below talks about a PERF_RECORD_SAMPLES_LOST, please also
implement that.

---

When the PEBS interrupt threshold is larger than one record and the
machine supports multiple PEBS events, the records of these events are
mixed up and we need to demultiplex them.

Demuxing the records is hard because the hardware is deficient. The
hardware has two issues that, when combined, create impossible scenarios
to demux.

The first issue is that the 'status' field of the PEBS record is a copy
of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
problem let us first describe the regular PEBS cycle:

A) the CTRn value reaches 0:
- the corresponding bit in GLOBAL_STATUS gets set
- we start arming the hardware assist

< some unspecified amount of time later --
this could cover multiple events of interest >

B) the hardware assist is armed, any next event will trigger it

C) a matching event happens:
- the hardware assist triggers and generates a PEBS record
this includes a copy of GLOBAL_STATUS at this moment
- if we auto-reload we (re)set CTRn
- we clear the relevant bit in GLOBAL_STATUS

Now consider the following chain of events:

A0, B0, A1, C0

The event generated for counter 0 will include a status with counter 1
set, even though its not at all related to the record. A similar thing
can happen with a !PEBS event if it just happens to overflow at the
right moment.

The second issue is that the hardware will only emit one record for two
or more counters if the event that triggers the assist is 'close' --
*PLEASE CLARIFY* either the very same instruction or retired in the same
cycle?

For instance, consider this chain of events:

A0, B0, A1, B1, C01

Where C01 is an event that triggers both hardware assists (the
instruction matches both criteria), we will generate but a single
record, but again with both counters listed in the status field.

This time the record pertains to both events.

Note that these two cases are different but undistinguishable with the
data as generated. Therefore demuxing records with multiple PEBS bits
(we can safely ignore status bits for !PEBS counters) is impossible.

Furthermore we cannot emit the record to both events because that might
cause a data leak -- the events might not have the same privileges -- so
what this patch does is discard such events.

The assumption/hope is that such discards will be rare, and to make sure
the user is not left in the dark about this we'll emit a
PERF_RECORD_SAMPLES_LOST record with the number of possible discards.

2015-04-16 18:46:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold

On Thu, Apr 09, 2015 at 12:37:43PM -0400, Kan Liang wrote:
> + /*
> + * When the event is constrained enough we can use a larger
> + * threshold and run the event with less frequent PMI.
> + */
> + if (0 && /* disable this temporarily */

This is a good hint your patches are ordered wrong.

> + (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD)) {
> + threshold = ds->pebs_absolute_maximum -
> + x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
> + } else {
> + threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
> + }
> +

2015-04-17 08:11:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

On Thu, Apr 16, 2015 at 02:53:42PM +0200, Peter Zijlstra wrote:
> When the PEBS interrupt threshold is larger than one record and the
> machine supports multiple PEBS events, the records of these events are
> mixed up and we need to demultiplex them.
>
> Demuxing the records is hard because the hardware is deficient. The
> hardware has two issues that, when combined, create impossible scenarios
> to demux.
>
> The first issue is that the 'status' field of the PEBS record is a copy
> of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
> problem let us first describe the regular PEBS cycle:
>
> A) the CTRn value reaches 0:
> - the corresponding bit in GLOBAL_STATUS gets set
> - we start arming the hardware assist
>
> < some unspecified amount of time later --
> this could cover multiple events of interest >
>
> B) the hardware assist is armed, any next event will trigger it
>
> C) a matching event happens:
> - the hardware assist triggers and generates a PEBS record
> this includes a copy of GLOBAL_STATUS at this moment
> - if we auto-reload we (re)set CTRn

Is this actually true? Do we reload here or on A ?

> - we clear the relevant bit in GLOBAL_STATUS

2015-04-17 12:50:42

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer



> >
> > A) the CTRn value reaches 0:
> > - the corresponding bit in GLOBAL_STATUS gets set
> > - we start arming the hardware assist
> >
> > < some unspecified amount of time later --
> > this could cover multiple events of interest >
> >
> > B) the hardware assist is armed, any next event will trigger it
> >
> > C) a matching event happens:
> > - the hardware assist triggers and generates a PEBS record
> > this includes a copy of GLOBAL_STATUS at this moment
> > - if we auto-reload we (re)set CTRn
>
> Is this actually true? Do we reload here or on A ?
>

Yes, on C.
According to SDM Volume 3, 18.7.1.1, the reset value will be
loaded after each PEBS record is written, which is done
by hw assist.


> > - we clear the relevant bit in GLOBAL_STATUS

2015-04-17 13:12:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

On Fri, Apr 17, 2015 at 12:50:33PM +0000, Liang, Kan wrote:
>
>
> > >
> > > A) the CTRn value reaches 0:
> > > - the corresponding bit in GLOBAL_STATUS gets set
> > > - we start arming the hardware assist
> > >
> > > < some unspecified amount of time later --
> > > this could cover multiple events of interest >
> > >
> > > B) the hardware assist is armed, any next event will trigger it
> > >
> > > C) a matching event happens:
> > > - the hardware assist triggers and generates a PEBS record
> > > this includes a copy of GLOBAL_STATUS at this moment
> > > - if we auto-reload we (re)set CTRn
> >
> > Is this actually true? Do we reload here or on A ?
> >
>
> Yes, on C.
> According to SDM Volume 3, 18.7.1.1, the reset value will be
> loaded after each PEBS record is written, which is done
> by hw assist.

OK, then I did indeed remember that right.

But that brings us to patch 1 of this series, how is that correct in the
face of this? There is an arbitrary delay (A->B) added to the period.
And the Changelog of course never did bother to make that clear.

2015-04-17 14:20:09

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer



> -----Original Message-----
> From: Peter Zijlstra [mailto:[email protected]]
> Sent: Friday, April 17, 2015 9:13 AM
> To: Liang, Kan
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS
> buffer
>
> On Fri, Apr 17, 2015 at 12:50:33PM +0000, Liang, Kan wrote:
> >
> >
> > > >
> > > > A) the CTRn value reaches 0:
> > > > - the corresponding bit in GLOBAL_STATUS gets set
> > > > - we start arming the hardware assist
> > > >
> > > > < some unspecified amount of time later --
> > > > this could cover multiple events of interest >
> > > >
> > > > B) the hardware assist is armed, any next event will trigger it
> > > >
> > > > C) a matching event happens:
> > > > - the hardware assist triggers and generates a PEBS record
> > > > this includes a copy of GLOBAL_STATUS at this moment
> > > > - if we auto-reload we (re)set CTRn
> > >
> > > Is this actually true? Do we reload here or on A ?
> > >
> >
> > Yes, on C.
> > According to SDM Volume 3, 18.7.1.1, the reset value will be loaded
> > after each PEBS record is written, which is done by hw assist.
>
> OK, then I did indeed remember that right.
>
> But that brings us to patch 1 of this series, how is that correct in the face of
> this? There is an arbitrary delay (A->B) added to the period.
> And the Changelog of course never did bother to make that clear.

OK. I will update the changelog for patch 1 as below.
---
When a fixed period is specified, this patch make perf use the PEBS
auto reload mechanism. This makes normal profiling faster, because
it avoids one costly MSR write in the PMI handler.
However, the reset value will be loaded by hardware assist. There is
a little bit delay compared to previous non-auto-reload mechanism.
The delay is arbitrary but very small.

Thanks,
Kan

2015-04-17 14:44:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

On Fri, Apr 17, 2015 at 02:19:58PM +0000, Liang, Kan wrote:

> > But that brings us to patch 1 of this series, how is that correct in the face of
> > this? There is an arbitrary delay (A->B) added to the period.
> > And the Changelog of course never did bother to make that clear.
>
> OK. I will update the changelog for patch 1 as below.
> ---
> When a fixed period is specified, this patch make perf use the PEBS
> auto reload mechanism. This makes normal profiling faster, because
> it avoids one costly MSR write in the PMI handler.

> However, the reset value will be loaded by hardware assist. There is
> a little bit delay compared to previous non-auto-reload mechanism.
> The delay is arbitrary but very small.

What is very small? And doesn't that mean its significant at exactly the
point this patch series is aimed at, namely very short period.

2015-04-17 18:20:42

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

On Fri, Apr 17, 2015 at 04:44:07PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 17, 2015 at 02:19:58PM +0000, Liang, Kan wrote:
>
> > > But that brings us to patch 1 of this series, how is that correct in the face of
> > > this? There is an arbitrary delay (A->B) added to the period.
> > > And the Changelog of course never did bother to make that clear.

That's how perf and other profilers always behaved. The PMI
is not part of the period. The automatic PEBS reload is not in any way
different. It's much faster than a PMI, but it's also not zero cost.

This is not a gap in measurement though -- there is no other code
running during that time on that CPU. It's simply overhead from the
measurement mechanism.

> >
> > OK. I will update the changelog for patch 1 as below.
> > ---
> > When a fixed period is specified, this patch make perf use the PEBS
> > auto reload mechanism. This makes normal profiling faster, because
> > it avoids one costly MSR write in the PMI handler.
>
> > However, the reset value will be loaded by hardware assist. There is
> > a little bit delay compared to previous non-auto-reload mechanism.
> > The delay is arbitrary but very small.
>
> What is very small? And doesn't that mean its significant at exactly the
> point this patch series is aimed at, namely very short period.

The assist cost is 400-800 cycles, assuming common cases with everything
cached. The minimum period the patch currently uses is 10000. In that
extreme case it can be ~10% if cycles are used.

-Andi
--
[email protected] -- Speaking for myself only.

2015-04-17 18:25:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

On Fri, Apr 17, 2015 at 08:20:37PM +0200, Andi Kleen wrote:
> On Fri, Apr 17, 2015 at 04:44:07PM +0200, Peter Zijlstra wrote:
> > On Fri, Apr 17, 2015 at 02:19:58PM +0000, Liang, Kan wrote:
> >
> > > > But that brings us to patch 1 of this series, how is that correct in the face of
> > > > this? There is an arbitrary delay (A->B) added to the period.
> > > > And the Changelog of course never did bother to make that clear.
>
> That's how perf and other profilers always behaved. The PMI
> is not part of the period. The automatic PEBS reload is not in any way
> different. It's much faster than a PMI, but it's also not zero cost.
>
> This is not a gap in measurement though -- there is no other code
> running during that time on that CPU. It's simply overhead from the
> measurement mechanism.
>
> > >
> > > OK. I will update the changelog for patch 1 as below.
> > > ---
> > > When a fixed period is specified, this patch make perf use the PEBS
> > > auto reload mechanism. This makes normal profiling faster, because
> > > it avoids one costly MSR write in the PMI handler.
> >
> > > However, the reset value will be loaded by hardware assist. There is
> > > a little bit delay compared to previous non-auto-reload mechanism.
> > > The delay is arbitrary but very small.
> >
> > What is very small? And doesn't that mean its significant at exactly the
> > point this patch series is aimed at, namely very short period.
>
> The assist cost is 400-800 cycles, assuming common cases with everything
> cached. The minimum period the patch currently uses is 10000. In that
> extreme case it can be ~10% if cycles are used.

Thanks, please include all this information.