LinuxLists.cc - [PATCH V7 0/6] large PEBS interrupt threshold

2015-04-20 15:19:42

[permalink] [raw]

Subject: [PATCH V7 0/6] large PEBS interrupt threshold

This patch series implements large PEBS interrupt threshold.
Currently, the PEBS threshold is forced to set to one. A larger PEBS
interrupt threshold can significantly reduce the sampling overhead
especially for frequently occurring events
(like cycles or branches or load/stores) with small sampling period.
For example, perf record cycles event when running kernbench
with 10003 sampling period. The Elapsed Time reduced from 32.7 seconds
to 16.5 seconds, which is 2X faster.
For more details, please refer to patch 4's description.

Limitations:
- It can not supply a callgraph.
- It requires setting a fixed period.
- It cannot supply a time stamp.
- To supply a TID it requires flushing on context switch.
If the above requirement doesn't apply, the threshold will set to one.

Collisions:
When PEBS events happen near to each other, the records for the events
can be collapsed into a single one, and it's not possible to
reconstruct. When collision happens, we drop the PEBS record.
Actually, collisions are extremely rare as long as different events
are used. We once tested the worst case with four frequently occurring
events (cycles:p,instructions:p,branches:p,mem-stores:p).
The collisions rate is only 0.34%.
For details about collisions, please refer to patch 3's description.

changes since v1:
- drop patch 'perf, core: Add all PMUs to pmu_idr'
- add comments for case that multiple counters overflow simultaneously
changes since v2:
- rename perf_sched_cb_{enable,disable} to perf_sched_cb_user_{inc,dec}
- use flag to indicate auto reload mechanism
- move codes that setup PEBS sample data to separate function
- output the PEBS records in batch
- enable this for All (PEBS capable) hardware
- more description for the multiplex
changes since v3:
- ignore conflicting PEBS record
changes since v4:
- Do more tests for collision and update comments
changes since v5:
- Move autoreload and large PEBS available check to intel_pmu_hw_config
- make AUTO_RELOAD conditional on large PEBS
- !PEBS bug fix
- coherent story about what is collision and how we handle it
- Remove extra state pebs_sched_cb_enabled
changes since v6:
- new flag PERF_X86_EVENT_FREERUNNING to indicate large PEBS available
- patch reorder and changelog changes for patch 1 and 3
- An easy way to clear !PEBS bit
- Log collision to PERF_RECORD_SAMPLES_LOST

Yan, Zheng (6):
perf, x86: use the PEBS auto reload mechanism when possible
perf, x86: introduce setup_pebs_sample_data()
perf, x86: handle multiple records in PEBS buffer
perf, x86: large PEBS interrupt threshold
perf, x86: drain PEBS buffer during context switch
perf, x86: enlarge PEBS buffer

arch/x86/kernel/cpu/perf_event.c | 15 +-
arch/x86/kernel/cpu/perf_event.h | 35 +++-
arch/x86/kernel/cpu/perf_event_intel.c | 22 +-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 324 +++++++++++++++++++++--------
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 3 -
include/linux/perf_event.h | 13 ++
kernel/events/core.c | 6 +-
kernel/events/internal.h | 9 -
8 files changed, 312 insertions(+), 115 deletions(-)

--
1.8.3.1

2015-04-20 15:21:08

[permalink] [raw]

Subject: [PATCH V7 1/6] perf, x86: use the PEBS auto reload mechanism when possible

From: Yan, Zheng <[email protected]>

When a fixed period is specified, this patch make perf use the PEBS
auto reload mechanism. This makes normal profiling faster, because
it avoids one costly MSR write in the PMI handler.
However, the reset value will be loaded by hardware assist. There is a
little bit delay compared to previous non-auto-reload mechanism. The
delay time is arbitrary, but very small. The assist cost is 400-800
cycles, assuming common cases with everything cached. The minimum period
the patch currently uses is 10000. In that extreme case it can be ~10%
if cycles are used.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 15 +++++++++------
arch/x86/kernel/cpu/perf_event.h | 20 ++++++++++----------
arch/x86/kernel/cpu/perf_event_intel.c | 8 ++++++--
arch/x86/kernel/cpu/perf_event_intel_ds.c | 7 +++++++
4 files changed, 32 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 87848eb..8cc1153 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1058,13 +1058,16 @@ int x86_perf_event_set_period(struct perf_event *event)

per_cpu(pmc_prev_left[idx], smp_processor_id()) = left;

- /*
- * The hw event starts counting from this event offset,
- * mark it to be able to extra future deltas:
- */
- local64_set(&hwc->prev_count, (u64)-left);
+ if (!(hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) ||
+ local64_read(&hwc->prev_count) != (u64)-left) {
+ /*
+ * The hw event starts counting from this event offset,
+ * mark it to be able to extra future deltas:
+ */
+ local64_set(&hwc->prev_count, (u64)-left);

- wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+ wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+ }

/*
* Due to erratum on certan cpu we need
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 329f035..9963981 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -65,16 +65,16 @@ struct event_constraint {
/*
* struct hw_perf_event.flags flags
*/
-#define PERF_X86_EVENT_PEBS_LDLAT 0x1 /* ld+ldlat data address sampling */
-#define PERF_X86_EVENT_PEBS_ST 0x2 /* st data address sampling */
-#define PERF_X86_EVENT_PEBS_ST_HSW 0x4 /* haswell style datala, store */
-#define PERF_X86_EVENT_COMMITTED 0x8 /* event passed commit_txn */
-#define PERF_X86_EVENT_PEBS_LD_HSW 0x10 /* haswell style datala, load */
-#define PERF_X86_EVENT_PEBS_NA_HSW 0x20 /* haswell style datala, unknown */
-#define PERF_X86_EVENT_EXCL 0x40 /* HT exclusivity on counter */
-#define PERF_X86_EVENT_DYNAMIC 0x80 /* dynamic alloc'd constraint */
-#define PERF_X86_EVENT_RDPMC_ALLOWED 0x40 /* grant rdpmc permission */
-
+#define PERF_X86_EVENT_PEBS_LDLAT 0x0001 /* ld+ldlat data address sampling */
+#define PERF_X86_EVENT_PEBS_ST 0x0002 /* st data address sampling */
+#define PERF_X86_EVENT_PEBS_ST_HSW 0x0004 /* haswell style datala, store */
+#define PERF_X86_EVENT_COMMITTED 0x0008 /* event passed commit_txn */
+#define PERF_X86_EVENT_PEBS_LD_HSW 0x0010 /* haswell style datala, load */
+#define PERF_X86_EVENT_PEBS_NA_HSW 0x0020 /* haswell style datala, unknown */
+#define PERF_X86_EVENT_EXCL 0x0040 /* HT exclusivity on counter */
+#define PERF_X86_EVENT_DYNAMIC 0x0080 /* dynamic alloc'd constraint */
+#define PERF_X86_EVENT_RDPMC_ALLOWED 0x0100 /* grant rdpmc permission */
+#define PERF_X86_EVENT_AUTO_RELOAD 0x0200 /* use PEBS auto-reload */

struct amd_nb {
int nb_id; /* NorthBridge id */
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 9da2400..0a7b5ca 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2305,8 +2305,12 @@ static int intel_pmu_hw_config(struct perf_event *event)
if (ret)
return ret;

- if (event->attr.precise_ip && x86_pmu.pebs_aliases)
- x86_pmu.pebs_aliases(event);
+ if (event->attr.precise_ip) {
+ if (!event->attr.freq)
+ event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
+ if (x86_pmu.pebs_aliases)
+ x86_pmu.pebs_aliases(event);
+ }

if (needs_branch_stack(event)) {
ret = intel_pmu_setup_lbr_filter(event);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index ca69ea5..7c6dd8e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -680,6 +680,7 @@ void intel_pmu_pebs_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
+ struct debug_store *ds = cpuc->ds;

hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;

@@ -689,6 +690,12 @@ void intel_pmu_pebs_enable(struct perf_event *event)
cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled |= 1ULL << 63;
+
+ /* Use auto-reload if possible to save a MSR write in the PMI */
+ if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
+ ds->pebs_event_reset[hwc->idx] =
+ (u64)(-hwc->sample_period) & x86_pmu.cntval_mask;
+ }
}

void intel_pmu_pebs_disable(struct perf_event *event)
--
1.8.3.1

2015-04-20 15:21:05

[permalink] [raw]

Subject: [PATCH V7 2/6] perf, x86: introduce setup_pebs_sample_data()

From: Yan, Zheng <[email protected]>

move codes that setup PEBS sample data to separate function.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 95 +++++++++++++++++--------------
1 file changed, 52 insertions(+), 43 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 7c6dd8e..e3916d5 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -845,8 +845,10 @@ static inline u64 intel_hsw_transaction(struct pebs_record_hsw *pebs)
return txn;
}

-static void __intel_pmu_pebs_event(struct perf_event *event,
- struct pt_regs *iregs, void *__pebs)
+static void setup_pebs_sample_data(struct perf_event *event,
+ struct pt_regs *iregs, void *__pebs,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
{
#define PERF_X86_EVENT_PEBS_HSW_PREC \
(PERF_X86_EVENT_PEBS_ST_HSW | \
@@ -858,30 +860,25 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
*/
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct pebs_record_hsw *pebs = __pebs;
- struct perf_sample_data data;
- struct pt_regs regs;
u64 sample_type;
int fll, fst, dsrc;
int fl = event->hw.flags;

- if (!intel_pmu_save_and_restart(event))
- return;
-
sample_type = event->attr.sample_type;
dsrc = sample_type & PERF_SAMPLE_DATA_SRC;

fll = fl & PERF_X86_EVENT_PEBS_LDLAT;
fst = fl & (PERF_X86_EVENT_PEBS_ST | PERF_X86_EVENT_PEBS_HSW_PREC);

- perf_sample_data_init(&data, 0, event->hw.last_period);
+ perf_sample_data_init(data, 0, event->hw.last_period);

- data.period = event->hw.last_period;
+ data->period = event->hw.last_period;

/*
* Use latency for weight (only avail with PEBS-LL)
*/
if (fll && (sample_type & PERF_SAMPLE_WEIGHT))
- data.weight = pebs->lat;
+ data->weight = pebs->lat;

/*
* data.data_src encodes the data source
@@ -894,7 +891,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
val = precise_datala_hsw(event, pebs->dse);
else if (fst)
val = precise_store_data(pebs->dse);
- data.data_src.val = val;
+ data->data_src.val = val;
}

/*
@@ -907,58 +904,70 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
* PERF_SAMPLE_IP and PERF_SAMPLE_CALLCHAIN to function properly.
* A possible PERF_SAMPLE_REGS will have to transfer all regs.
*/
- regs = *iregs;
- regs.flags = pebs->flags;
- set_linear_ip(&regs, pebs->ip);
- regs.bp = pebs->bp;
- regs.sp = pebs->sp;
+ *regs = *iregs;
+ regs->flags = pebs->flags;
+ set_linear_ip(regs, pebs->ip);
+ regs->bp = pebs->bp;
+ regs->sp = pebs->sp;

if (sample_type & PERF_SAMPLE_REGS_INTR) {
- regs.ax = pebs->ax;
- regs.bx = pebs->bx;
- regs.cx = pebs->cx;
- regs.dx = pebs->dx;
- regs.si = pebs->si;
- regs.di = pebs->di;
- regs.bp = pebs->bp;
- regs.sp = pebs->sp;
-
- regs.flags = pebs->flags;
+ regs->ax = pebs->ax;
+ regs->bx = pebs->bx;
+ regs->cx = pebs->cx;
+ regs->dx = pebs->dx;
+ regs->si = pebs->si;
+ regs->di = pebs->di;
+ regs->bp = pebs->bp;
+ regs->sp = pebs->sp;
+
+ regs->flags = pebs->flags;
#ifndef CONFIG_X86_32
- regs.r8 = pebs->r8;
- regs.r9 = pebs->r9;
- regs.r10 = pebs->r10;
- regs.r11 = pebs->r11;
- regs.r12 = pebs->r12;
- regs.r13 = pebs->r13;
- regs.r14 = pebs->r14;
- regs.r15 = pebs->r15;
+ regs->r8 = pebs->r8;
+ regs->r9 = pebs->r9;
+ regs->r10 = pebs->r10;
+ regs->r11 = pebs->r11;
+ regs->r12 = pebs->r12;
+ regs->r13 = pebs->r13;
+ regs->r14 = pebs->r14;
+ regs->r15 = pebs->r15;
#endif
}

if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format >= 2) {
- regs.ip = pebs->real_ip;
- regs.flags |= PERF_EFLAGS_EXACT;
- } else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(&regs))
- regs.flags |= PERF_EFLAGS_EXACT;
+ regs->ip = pebs->real_ip;
+ regs->flags |= PERF_EFLAGS_EXACT;
+ } else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(regs))
+ regs->flags |= PERF_EFLAGS_EXACT;
else
- regs.flags &= ~PERF_EFLAGS_EXACT;
+ regs->flags &= ~PERF_EFLAGS_EXACT;

if ((sample_type & PERF_SAMPLE_ADDR) &&
x86_pmu.intel_cap.pebs_format >= 1)
- data.addr = pebs->dla;
+ data->addr = pebs->dla;

if (x86_pmu.intel_cap.pebs_format >= 2) {
/* Only set the TSX weight when no memory weight. */
if ((sample_type & PERF_SAMPLE_WEIGHT) && !fll)
- data.weight = intel_hsw_weight(pebs);
+ data->weight = intel_hsw_weight(pebs);

if (sample_type & PERF_SAMPLE_TRANSACTION)
- data.txn = intel_hsw_transaction(pebs);
+ data->txn = intel_hsw_transaction(pebs);
}

if (has_branch_stack(event))
- data.br_stack = &cpuc->lbr_stack;
+ data->br_stack = &cpuc->lbr_stack;
+}
+
+static void __intel_pmu_pebs_event(struct perf_event *event,
+ struct pt_regs *iregs, void *__pebs)
+{
+ struct perf_sample_data data;
+ struct pt_regs regs;
+
+ if (!intel_pmu_save_and_restart(event))
+ return;
+
+ setup_pebs_sample_data(event, iregs, __pebs, &data, &regs);

if (perf_event_overflow(event, &data, &regs))
x86_pmu_stop(event, 0);
--
1.8.3.1

2015-04-20 15:19:47

[permalink] [raw]

Subject: [PATCH V7 3/6] perf, x86: handle multiple records in PEBS buffer

From: Yan, Zheng <[email protected]>

When the PEBS interrupt threshold is larger than one record and the
machine supports multiple PEBS events, the records of these events are
mixed up and we need to demultiplex them.

Demuxing the records is hard because the hardware is deficient. The
hardware has two issues that, when combined, create impossible scenarios
to demux.

The first issue is that the 'status' field of the PEBS record is a copy
of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
problem let us first describe the regular PEBS cycle:

A) the CTRn value reaches 0:
- the corresponding bit in GLOBAL_STATUS gets set
- we start arming the hardware assist
< some unspecified amount of time later -- this could cover multiple
events of interest >

B) the hardware assist is armed, any next event will trigger it

C) a matching event happens:
- the hardware assist triggers and generates a PEBS record
this includes a copy of GLOBAL_STATUS at this moment
- if we auto-reload we (re)set CTRn
- we clear the relevant bit in GLOBAL_STATUS

Now consider the following chain of events:

A0, B0, A1, C0

The event generated for counter 0 will include a status with counter 1
set, even though its not at all related to the record. A similar thing
can happen with a !PEBS event if it just happens to overflow at the
right moment.

The second issue is that the hardware will only emit one record for two
or more counters if the event that triggers the assist is 'close'. The
'close' can be several cycles. In some cases even the complete assist,
if the event is something that doesn't need retirement.

For instance, consider this chain of events:

A0, B0, A1, B1, C01

Where C01 is an event that triggers both hardware assists, we will
generate but a single record, but again with both counters listed in the
status field.

This time the record pertains to both events.

Note that these two cases are different but undistinguishable with the
data as generated. Therefore demuxing records with multiple PEBS bits
(we can safely ignore status bits for !PEBS counters) is impossible.

Furthermore we cannot emit the record to both events because that might
cause a data leak -- the events might not have the same privileges -- so
what this patch does is discard such events.

The assumption/hope is that such discards will be rare, and to make sure
the user is not left in the dark about this we'll emit a
PERF_RECORD_SAMPLES_LOST record with the number of possible discards.

Here lists some possible ways you may get a lot of collision.
- when you count the same thing multiple times. But it is not a useful
configuration.
- you can be unfortunate if you measure with a userspace only PEBS
event along with either a kernel or unrestricted PEBS event. Imagine
the event triggering and setting the overflow flag right before
entering the kernel. Then all kernel side events will end up with
multiple bits set.

Here are some numbers about collisions.
Four frequently occurring events
(cycles:p,instructions:p,branches:p,mem-stores:p) are tested

Test events which are sampled together collision rate
cycles:p,instructions:p 0.25%
cycles:p,instructions:p,branches:p 0.30%
cycles:p,instructions:p,branches:p,mem-stores:p 0.35%

cycles:p,cycles:p 98.52%

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 171 +++++++++++++++++++++++-------
include/linux/perf_event.h | 13 +++
kernel/events/core.c | 6 +-
kernel/events/internal.h | 9 --
4 files changed, 149 insertions(+), 50 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index e3916d5..44be1f6 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -864,6 +864,9 @@ static void setup_pebs_sample_data(struct perf_event *event,
int fll, fst, dsrc;
int fl = event->hw.flags;

+ if (pebs == NULL)
+ return;
+
sample_type = event->attr.sample_type;
dsrc = sample_type & PERF_SAMPLE_DATA_SRC;

@@ -958,19 +961,97 @@ static void setup_pebs_sample_data(struct perf_event *event,
data->br_stack = &cpuc->lbr_stack;
}

+static void perf_log_lost(struct perf_event *event)
+{
+ struct perf_output_handle handle;
+ struct perf_sample_data sample;
+ int ret;
+
+ struct {
+ struct perf_event_header header;
+ u64 id;
+ u64 lost;
+ } lost_event = {
+ .header = {
+ .type = PERF_RECORD_LOST,
+ .misc = 0,
+ .size = sizeof(lost_event),
+ },
+ .id = event->id,
+ .lost = 1,
+ };
+
+ perf_event_header__init_id(&lost_event.header, &sample, event);
+
+ ret = perf_output_begin(&handle, event,
+ lost_event.header.size);
+ if (ret)
+ return;
+
+ perf_output_put(&handle, lost_event);
+ perf_event__output_id_sample(event, &handle, &sample);
+ perf_output_end(&handle);
+}
+
+static inline void *
+get_next_pebs_record_by_bit(void *base, void *top, int bit)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ void *at;
+ u64 pebs_status;
+
+ if (base == NULL)
+ return NULL;
+
+ for (at = base; at < top; at += x86_pmu.pebs_record_size) {
+ struct pebs_record_nhm *p = at;
+
+ if (test_bit(bit, (unsigned long *)&p->status)) {
+
+ if (p->status == (1 << bit))
+ return at;
+
+ /* clear non-PEBS bit and re-check */
+ pebs_status = p->status & cpuc->pebs_enabled;
+ pebs_status &= (1ULL << MAX_PEBS_EVENTS) - 1;
+ if (pebs_status == (1 << bit))
+ return at;
+ }
+ }
+ return NULL;
+}
+
static void __intel_pmu_pebs_event(struct perf_event *event,
- struct pt_regs *iregs, void *__pebs)
+ struct pt_regs *iregs,
+ void *base, void *top,
+ int bit, int count)
{
struct perf_sample_data data;
struct pt_regs regs;
+ int i;
+ void *at = get_next_pebs_record_by_bit(base, top, bit);

- if (!intel_pmu_save_and_restart(event))
+ if (!intel_pmu_save_and_restart(event) &&
+ !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
return;

- setup_pebs_sample_data(event, iregs, __pebs, &data, &regs);
+ if (count > 1) {
+ for (i = 0; i < count - 1; i++) {
+ setup_pebs_sample_data(event, iregs, at, &data, &regs);
+ perf_event_output(event, &data, &regs);
+ at += x86_pmu.pebs_record_size;
+ at = get_next_pebs_record_by_bit(at, top, bit);
+ }
+ }
+
+ setup_pebs_sample_data(event, iregs, at, &data, &regs);

- if (perf_event_overflow(event, &data, &regs))
+ /* all records are processed, handle event overflow now */
+ if (perf_event_overflow(event, &data, &regs)) {
x86_pmu_stop(event, 0);
+ return;
+ }
+
}

static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
@@ -1000,72 +1081,86 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
if (!event->attr.precise_ip)
return;

- n = top - at;
+ n = (top - at) / x86_pmu.pebs_record_size;
if (n <= 0)
return;

- /*
- * Should not happen, we program the threshold at 1 and do not
- * set a reset value.
- */
- WARN_ONCE(n > 1, "bad leftover pebs %d\n", n);
- at += n - 1;
-
- __intel_pmu_pebs_event(event, iregs, at);
+ __intel_pmu_pebs_event(event, iregs, at,
+ top, 0, n);
}

static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct debug_store *ds = cpuc->ds;
- struct perf_event *event = NULL;
- void *at, *top;
- u64 status = 0;
+ struct perf_event *event;
+ void *base, *at, *top;
int bit;
+ int counts[MAX_PEBS_EVENTS] = {};

if (!x86_pmu.pebs_active)
return;

- at = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
+ base = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;

ds->pebs_index = ds->pebs_buffer_base;

- if (unlikely(at > top))
+ if (unlikely(base >= top))
return;

- /*
- * Should not happen, we program the threshold at 1 and do not
- * set a reset value.
- */
- WARN_ONCE(top - at > x86_pmu.max_pebs_events * x86_pmu.pebs_record_size,
- "Unexpected number of pebs records %ld\n",
- (long)(top - at) / x86_pmu.pebs_record_size);
-
- for (; at < top; at += x86_pmu.pebs_record_size) {
+ for (at = base; at < top; at += x86_pmu.pebs_record_size) {
struct pebs_record_nhm *p = at;

for_each_set_bit(bit, (unsigned long *)&p->status,
x86_pmu.max_pebs_events) {
event = cpuc->events[bit];
- if (!test_bit(bit, cpuc->active_mask))
- continue;
-
WARN_ON_ONCE(!event);

- if (!event->attr.precise_ip)
- continue;
+ if (event->attr.precise_ip)
+ break;
+ }

- if (__test_and_set_bit(bit, (unsigned long *)&status))
+ if (bit >= x86_pmu.max_pebs_events)
+ continue;
+ if (!test_bit(bit, cpuc->active_mask))
+ continue;
+ /*
+ * The PEBS hardware does not deal well with the situation
+ * when events happen near to each other and multiple bits
+ * are set. But it should happen rarely.
+ *
+ * If these events include one PEBS and multiple non-PEBS
+ * events, it doesn't impact PEBS record. The record will
+ * be handled normally. (slow path)
+ *
+ * If these events include two or more PEBS events, the
+ * records for the events can be collapsed into a single
+ * one, and it's not possible to reconstruct all events
+ * that caused the PEBS record. It's called collision.
+ * If collision happened, the record will be dropped.
+ *
+ */
+ if (p->status != (1 << bit)) {
+ u64 pebs_status;
+
+ /* slow path */
+ pebs_status = p->status & cpuc->pebs_enabled;
+ pebs_status &= (1ULL << MAX_PEBS_EVENTS) - 1;
+ if (pebs_status != (1 << bit)) {
+ perf_log_lost(event);
continue;
-
- break;
+ }
}
+ counts[bit]++;
+ }

- if (!event || bit >= x86_pmu.max_pebs_events)
+ for (bit = 0; bit < x86_pmu.max_pebs_events; bit++) {
+ if (counts[bit] == 0)
continue;
-
- __intel_pmu_pebs_event(event, iregs, at);
+ event = cpuc->events[bit];
+ __intel_pmu_pebs_event(event, iregs, base,
+ top, bit, counts[bit]);
}
}

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 61992cf..bed1b6f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -734,6 +734,19 @@ extern int perf_event_overflow(struct perf_event *event,
struct perf_sample_data *data,
struct pt_regs *regs);

+extern void perf_event_output(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs);
+
+extern void
+perf_event_header__init_id(struct perf_event_header *header,
+ struct perf_sample_data *data,
+ struct perf_event *event);
+extern void
+perf_event__output_id_sample(struct perf_event *event,
+ struct perf_output_handle *handle,
+ struct perf_sample_data *sample);
+
static inline bool is_sampling_event(struct perf_event *event)
{
return event->attr.sample_period != 0;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 06917d5..a8d0e92 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5360,9 +5360,9 @@ void perf_prepare_sample(struct perf_event_header *header,
}
}

-static void perf_event_output(struct perf_event *event,
- struct perf_sample_data *data,
- struct pt_regs *regs)
+void perf_event_output(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
{
struct perf_output_handle handle;
struct perf_event_header header;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 9f6ce9b..2deb24c 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -72,15 +72,6 @@ static inline bool rb_has_aux(struct ring_buffer *rb)
void perf_event_aux_event(struct perf_event *event, unsigned long head,
unsigned long size, u64 flags);

-extern void
-perf_event_header__init_id(struct perf_event_header *header,
- struct perf_sample_data *data,
- struct perf_event *event);
-extern void
-perf_event__output_id_sample(struct perf_event *event,
- struct perf_output_handle *handle,
- struct perf_sample_data *sample);
-
extern struct page *
perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff);

--
1.8.3.1

2015-04-20 15:19:50

[permalink] [raw]

Subject: [PATCH V7 4/6] perf, x86: large PEBS interrupt threshold

From: Yan, Zheng <[email protected]>

PEBS always had the capability to log samples to its buffers without
an interrupt. Traditionally perf has not used this but always set the
PEBS threshold to one.

For frequently occurring events (like cycles or branches or load/store)
this in term requires using a relatively high sampling period to avoid
overloading the system, by only processing PMIs. This in term increases
sampling error.

For the common cases we still need to use the PMI because the PEBS
hardware has various limitations. The biggest one is that it can not
supply a callgraph. It also requires setting a fixed period, as the
hardware does not support adaptive period. Another issue is that it
cannot supply a time stamp and some other options. To supply a TID it
requires flushing on context switch. It can however supply the IP, the
load/store address, TSX information, registers, and some other things.

So we can make PEBS work for some specific cases, basically as long as
you can do without a callgraph and can set the period you can use this
new PEBS mode.

The main benefit is the ability to support much lower sampling period
(down to -c 1000) without extensive overhead.

One use cases is for example to increase the resolution of the c2c tool.
Another is double checking when you suspect the standard sampling has
too much sampling error.

Some numbers on the overhead, using cycle soak, comparing the elapsed
time from "kernbench -M -H" between plain (threshold set to one) and
multi (large threshold).
The test command for plain:
"perf record --time -e cycles:p -c $period -- kernbench -M -H"
The test command for multi:
"perf record --no-time -e cycles:p -c $period -- kernbench -M -H"
(The only difference of test command between multi and plain is time
stamp options. Since time stamp is not supported by large PEBS
threshold, it can be used as a flag to indicate if large threshold is
enabled during the test.)

period plain(Sec) multi(Sec) Delta
10003 32.7 16.5 16.2
20003 30.2 16.2 14.0
40003 18.6 14.1 4.5
80003 16.8 14.6 2.2
100003 16.9 14.1 2.8
800003 15.4 15.7 -0.3
1000003 15.3 15.2 0.2
2000003 15.3 15.1 0.1

With periods below 100003, plain (threshold one) cause much more
overhead. With 10003 sampling period, the Elapsed Time for multi is
even 2X faster than plain.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 11 +++++++++++
arch/x86/kernel/cpu/perf_event_intel.c | 5 ++++-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 27 +++++++++++++++++++++++----
3 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 9963981..9ef821d 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -75,6 +75,7 @@ struct event_constraint {
#define PERF_X86_EVENT_DYNAMIC 0x0080 /* dynamic alloc'd constraint */
#define PERF_X86_EVENT_RDPMC_ALLOWED 0x0100 /* grant rdpmc permission */
#define PERF_X86_EVENT_AUTO_RELOAD 0x0200 /* use PEBS auto-reload */
+#define PERF_X86_EVENT_FREERUNNING 0x0400 /* use freerunning PEBS */

struct amd_nb {
int nb_id; /* NorthBridge id */
@@ -87,6 +88,16 @@ struct amd_nb {
#define MAX_PEBS_EVENTS 8

/*
+ * Flags PEBS can handle without an PMI.
+ *
+ */
+#define PEBS_FREERUNNING_FLAGS \
+ (PERF_SAMPLE_IP | PERF_SAMPLE_ADDR | \
+ PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
+ PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
+ PERF_SAMPLE_TRANSACTION)
+
+/*
* A debug store configuration.
*
* We only support architectures that use 64bit fields.
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 0a7b5ca..700c331 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2306,8 +2306,11 @@ static int intel_pmu_hw_config(struct perf_event *event)
return ret;

if (event->attr.precise_ip) {
- if (!event->attr.freq)
+ if (!event->attr.freq) {
event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
+ if (!(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS))
+ event->hw.flags |= PERF_X86_EVENT_FREERUNNING;
+ }
if (x86_pmu.pebs_aliases)
x86_pmu.pebs_aliases(event);
}
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 44be1f6..8721c45 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -250,7 +250,7 @@ static int alloc_pebs_buffer(int cpu)
{
struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
int node = cpu_to_node(cpu);
- int max, thresh = 1; /* always use a single PEBS record */
+ int max;
void *buffer, *ibuffer;

if (!x86_pmu.pebs)
@@ -280,9 +280,6 @@ static int alloc_pebs_buffer(int cpu)
ds->pebs_absolute_maximum = ds->pebs_buffer_base +
max * x86_pmu.pebs_record_size;

- ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
- thresh * x86_pmu.pebs_record_size;
-
return 0;
}

@@ -676,14 +673,22 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
return &emptyconstraint;
}

+static inline bool pebs_is_enabled(struct cpu_hw_events *cpuc)
+{
+ return (cpuc->pebs_enabled & ((1ULL << MAX_PEBS_EVENTS) - 1));
+}
+
void intel_pmu_pebs_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
struct debug_store *ds = cpuc->ds;
+ bool first_pebs;
+ u64 threshold;

hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;

+ first_pebs = !pebs_is_enabled(cpuc);
cpuc->pebs_enabled |= 1ULL << hwc->idx;

if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
@@ -691,11 +696,25 @@ void intel_pmu_pebs_enable(struct perf_event *event)
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled |= 1ULL << 63;

+ /*
+ * When the event is constrained enough we can use a larger
+ * threshold and run the event with less frequent PMI.
+ */
+ if (hwc->flags & PERF_X86_EVENT_FREERUNNING) {
+ threshold = ds->pebs_absolute_maximum -
+ x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
+ } else {
+ threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+ }
+
/* Use auto-reload if possible to save a MSR write in the PMI */
if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
ds->pebs_event_reset[hwc->idx] =
(u64)(-hwc->sample_period) & x86_pmu.cntval_mask;
}
+
+ if (first_pebs || ds->pebs_interrupt_threshold > threshold)
+ ds->pebs_interrupt_threshold = threshold;
}

void intel_pmu_pebs_disable(struct perf_event *event)
--
1.8.3.1

2015-04-20 15:20:24

[permalink] [raw]

Subject: [PATCH V7 5/6] perf, x86: drain PEBS buffer during context switch

From: Yan, Zheng <[email protected]>

Flush the PEBS buffer during context switch if PEBS interrupt threshold
is larger than one. This allows perf to supply TID for sample outputs.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 6 +++++-
arch/x86/kernel/cpu/perf_event_intel.c | 11 +++++++++-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 32 ++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 3 ---
4 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 9ef821d..9c692a4 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -90,9 +90,11 @@ struct amd_nb {
/*
* Flags PEBS can handle without an PMI.
*
+ * TID can only be handled by flushing at context switch.
+ *
*/
#define PEBS_FREERUNNING_FLAGS \
- (PERF_SAMPLE_IP | PERF_SAMPLE_ADDR | \
+ (PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR | \
PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
PERF_SAMPLE_TRANSACTION)
@@ -871,6 +873,8 @@ void intel_pmu_pebs_enable_all(void);

void intel_pmu_pebs_disable_all(void);

+void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in);
+
void intel_ds_init(void);

void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 700c331..3d28dc6 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2730,6 +2730,15 @@ static void intel_pmu_cpu_dying(int cpu)
fini_debug_store_on_cpu(cpu);
}

+static void intel_pmu_sched_task(struct perf_event_context *ctx,
+ bool sched_in)
+{
+ if (x86_pmu.pebs_active)
+ intel_pmu_pebs_sched_task(ctx, sched_in);
+ if (x86_pmu.lbr_nr)
+ intel_pmu_lbr_sched_task(ctx, sched_in);
+}
+
PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");

PMU_FORMAT_ATTR(ldlat, "config1:0-15");
@@ -2781,7 +2790,7 @@ static __initconst const struct x86_pmu intel_pmu = {
.cpu_starting = intel_pmu_cpu_starting,
.cpu_dying = intel_pmu_cpu_dying,
.guest_get_msrs = intel_guest_get_msrs,
- .sched_task = intel_pmu_lbr_sched_task,
+ .sched_task = intel_pmu_sched_task,
};

static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 8721c45..01f68b1 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -546,6 +546,19 @@ int intel_pmu_drain_bts_buffer(void)
return 1;
}

+static inline void intel_pmu_drain_pebs_buffer(void)
+{
+ struct pt_regs regs;
+
+ x86_pmu.drain_pebs(&regs);
+}
+
+void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+ if (!sched_in)
+ intel_pmu_drain_pebs_buffer();
+}
+
/*
* PEBS
*/
@@ -703,8 +716,19 @@ void intel_pmu_pebs_enable(struct perf_event *event)
if (hwc->flags & PERF_X86_EVENT_FREERUNNING) {
threshold = ds->pebs_absolute_maximum -
x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
+
+ if (first_pebs)
+ perf_sched_cb_inc(event->ctx->pmu);
} else {
threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+
+ /*
+ * If not all events can use larger buffer,
+ * roll back to threshold = 1
+ */
+ if (!first_pebs &&
+ (ds->pebs_interrupt_threshold > threshold))
+ perf_sched_cb_dec(event->ctx->pmu);
}

/* Use auto-reload if possible to save a MSR write in the PMI */
@@ -721,6 +745,7 @@ void intel_pmu_pebs_disable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
+ struct debug_store *ds = cpuc->ds;

cpuc->pebs_enabled &= ~(1ULL << hwc->idx);

@@ -729,6 +754,13 @@ void intel_pmu_pebs_disable(struct perf_event *event)
else if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled &= ~(1ULL << 63);

+ if (ds->pebs_interrupt_threshold >
+ ds->pebs_buffer_base + x86_pmu.pebs_record_size) {
+ intel_pmu_drain_pebs_buffer();
+ if (!pebs_is_enabled(cpuc))
+ perf_sched_cb_dec(event->ctx->pmu);
+ }
+
if (cpuc->enabled)
wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 94e5b50..c8a72cc 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -262,9 +262,6 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct x86_perf_task_context *task_ctx;

- if (!x86_pmu.lbr_nr)
- return;
-
/*
* If LBR callstack feature is enabled and the stack was saved when
* the task was scheduled out, restore the stack. Otherwise flush
--
1.8.3.1

2015-04-20 15:20:22

[permalink] [raw]

Subject: [PATCH V7 6/6] perf, x86: enlarge PEBS buffer

From: Yan, Zheng <[email protected]>

Currently the PEBS buffer size is 4k, it only can hold about 21
PEBS records. This patch enlarges the PEBS buffer size to 64k
(the same as BTS buffer), 64k memory can hold about 330 PEBS
records. This will significantly the reduce number of PMI when
large PEBS interrupt threshold is used.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 01f68b1..ce754ee 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -11,7 +11,7 @@
#define BTS_RECORD_SIZE 24

#define BTS_BUFFER_SIZE (PAGE_SIZE << 4)
-#define PEBS_BUFFER_SIZE PAGE_SIZE
+#define PEBS_BUFFER_SIZE (PAGE_SIZE << 4)
#define PEBS_FIXUP_SIZE PAGE_SIZE

/*
--
1.8.3.1

2015-05-05 13:07:51

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH V7 3/6] perf, x86: handle multiple records in PEBS buffer

On Mon, Apr 20, 2015 at 04:07:47AM -0400, Kan Liang wrote:
> From: Yan, Zheng <[email protected]>

<snip>

> Here lists some possible ways you may get a lot of collision.

This is the first time the world 'collisions' is used; either define
what you mean by it or avoid using it.

> - when you count the same thing multiple times. But it is not a useful
> configuration.
> - you can be unfortunate if you measure with a userspace only PEBS
> event along with either a kernel or unrestricted PEBS event. Imagine
> the event triggering and setting the overflow flag right before
> entering the kernel. Then all kernel side events will end up with
> multiple bits set.
>
> Here are some numbers about collisions.
> Four frequently occurring events
> (cycles:p,instructions:p,branches:p,mem-stores:p) are tested
>
> Test events which are sampled together collision rate
> cycles:p,instructions:p 0.25%
> cycles:p,instructions:p,branches:p 0.30%
> cycles:p,instructions:p,branches:p,mem-stores:p 0.35%
>
> cycles:p,cycles:p 98.52%

It would be good if you can illustrate this with the new PREF_RECORD and
the perf tool itself.

> Signed-off-by: Yan, Zheng <[email protected]>
> Signed-off-by: Kan Liang <[email protected]>
> ---

> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c

> @@ -958,19 +961,97 @@ static void setup_pebs_sample_data(struct perf_event *event,
> data->br_stack = &cpuc->lbr_stack;
> }
>
> +static void perf_log_lost(struct perf_event *event)
> +{
> + struct perf_output_handle handle;
> + struct perf_sample_data sample;
> + int ret;
> +
> + struct {
> + struct perf_event_header header;
> + u64 id;
> + u64 lost;
> + } lost_event = {
> + .header = {
> + .type = PERF_RECORD_LOST,
> + .misc = 0,
> + .size = sizeof(lost_event),
> + },
> + .id = event->id,
> + .lost = 1,
> + };
> +
> + perf_event_header__init_id(&lost_event.header, &sample, event);
> +
> + ret = perf_output_begin(&handle, event,
> + lost_event.header.size);
> + if (ret)
> + return;
> +
> + perf_output_put(&handle, lost_event);
> + perf_event__output_id_sample(event, &handle, &sample);
> + perf_output_end(&handle);
> +}

RECORDs are generic, and should live in the core code.

Also, you should introduce this RECORD in a separate patch.

Ideally, you'd also update the tools side to parse this and modify
perf-record to show the number of dropped events as a percentage, going
warn/error when >1%/>5% or so?

2015-05-05 13:17:00

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH V7 3/6] perf, x86: handle multiple records in PEBS buffer

On Mon, Apr 20, 2015 at 04:07:47AM -0400, Kan Liang wrote:
> +static inline void *
> +get_next_pebs_record_by_bit(void *base, void *top, int bit)
> +{
> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> + void *at;
> + u64 pebs_status;
> +
> + if (base == NULL)
> + return NULL;
> +
> + for (at = base; at < top; at += x86_pmu.pebs_record_size) {
> + struct pebs_record_nhm *p = at;
> +
> + if (test_bit(bit, (unsigned long *)&p->status)) {

Just wondering, is that BT better than: p->state & (1 << bit) ?

> +
> + if (p->status == (1 << bit))
> + return at;
> +
> + /* clear non-PEBS bit and re-check */
> + pebs_status = p->status & cpuc->pebs_enabled;
> + pebs_status &= (1ULL << MAX_PEBS_EVENTS) - 1;
> + if (pebs_status == (1 << bit))
> + return at;
> + }
> + }
> + return NULL;
> +}
> +
> static void __intel_pmu_pebs_event(struct perf_event *event,
> + struct pt_regs *iregs,
> + void *base, void *top,
> + int bit, int count)
> {
> struct perf_sample_data data;
> struct pt_regs regs;
> + int i;
> + void *at = get_next_pebs_record_by_bit(base, top, bit);
>
> + if (!intel_pmu_save_and_restart(event) &&
> + !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
> return;
>
> + if (count > 1) {
> + for (i = 0; i < count - 1; i++) {
> + setup_pebs_sample_data(event, iregs, at, &data, &regs);
> + perf_event_output(event, &data, &regs);
> + at += x86_pmu.pebs_record_size;
> + at = get_next_pebs_record_by_bit(at, top, bit);
> + }
> + }
> +
> + setup_pebs_sample_data(event, iregs, at, &data, &regs);
>
> + /* all records are processed, handle event overflow now */

All but the last. There explicitly is one left to be able to call the
overflow handler is there not?

> + if (perf_event_overflow(event, &data, &regs)) {
> x86_pmu_stop(event, 0);
> + return;
> + }
> +
> }
>
> static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
> @@ -1000,72 +1081,86 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
> if (!event->attr.precise_ip)
> return;
>
> + n = (top - at) / x86_pmu.pebs_record_size;
> if (n <= 0)
> return;
>
> + __intel_pmu_pebs_event(event, iregs, at,
> + top, 0, n);
> }
>
> static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
> {
> struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> struct debug_store *ds = cpuc->ds;
> + struct perf_event *event;
> + void *base, *at, *top;
> int bit;
> + int counts[MAX_PEBS_EVENTS] = {};
>
> if (!x86_pmu.pebs_active)
> return;
>
> + base = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
> top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
>
> ds->pebs_index = ds->pebs_buffer_base;
>
> + if (unlikely(base >= top))
> return;
>
> + for (at = base; at < top; at += x86_pmu.pebs_record_size) {
> struct pebs_record_nhm *p = at;
>
> for_each_set_bit(bit, (unsigned long *)&p->status,
> x86_pmu.max_pebs_events) {
> event = cpuc->events[bit];
> WARN_ON_ONCE(!event);
>
> + if (event->attr.precise_ip)
> + break;
> + }

Would it make sense to delay looking for the event until you've found
there is a single bit set -- and already know which bit that is?

>
> + if (bit >= x86_pmu.max_pebs_events)
> + continue;
> + if (!test_bit(bit, cpuc->active_mask))
> + continue;
> + /*
> + * The PEBS hardware does not deal well with the situation
> + * when events happen near to each other and multiple bits
> + * are set. But it should happen rarely.
> + *
> + * If these events include one PEBS and multiple non-PEBS
> + * events, it doesn't impact PEBS record. The record will
> + * be handled normally. (slow path)
> + *
> + * If these events include two or more PEBS events, the
> + * records for the events can be collapsed into a single
> + * one, and it's not possible to reconstruct all events
> + * that caused the PEBS record. It's called collision.
> + * If collision happened, the record will be dropped.
> + *
> + */
> + if (p->status != (1 << bit)) {
> + u64 pebs_status;
> +
> + /* slow path */
> + pebs_status = p->status & cpuc->pebs_enabled;
> + pebs_status &= (1ULL << MAX_PEBS_EVENTS) - 1;
> + if (pebs_status != (1 << bit)) {
> + perf_log_lost(event);

Does it make sense to keep an error[bit] count and only log once with
the actual number in? -- when !0 obviously.

> continue;
> + }
> }
> + counts[bit]++;
> + }
>
> + for (bit = 0; bit < x86_pmu.max_pebs_events; bit++) {
> + if (counts[bit] == 0)
> continue;
> + event = cpuc->events[bit];
> + __intel_pmu_pebs_event(event, iregs, base,
> + top, bit, counts[bit]);
> }
> }

2015-05-05 13:17:58

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH V7 3/6] perf, x86: handle multiple records in PEBS buffer

On Tue, May 05, 2015 at 03:07:23PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 20, 2015 at 04:07:47AM -0400, Kan Liang wrote:
> > From: Yan, Zheng <[email protected]>
> > +static void perf_log_lost(struct perf_event *event)
> > +{
> > + struct perf_output_handle handle;
> > + struct perf_sample_data sample;
> > + int ret;
> > +
> > + struct {
> > + struct perf_event_header header;
> > + u64 id;
> > + u64 lost;
> > + } lost_event = {
> > + .header = {
> > + .type = PERF_RECORD_LOST,
> > + .misc = 0,
> > + .size = sizeof(lost_event),
> > + },
> > + .id = event->id,
> > + .lost = 1,
> > + };
> > +
> > + perf_event_header__init_id(&lost_event.header, &sample, event);
> > +
> > + ret = perf_output_begin(&handle, event,
> > + lost_event.header.size);
> > + if (ret)
> > + return;
> > +
> > + perf_output_put(&handle, lost_event);
> > + perf_event__output_id_sample(event, &handle, &sample);
> > + perf_output_end(&handle);
> > +}
>
> RECORDs are generic, and should live in the core code.
>
> Also, you should introduce this RECORD in a separate patch.

On that, this is lacking a RECORD definition in
include/uapi/linux/perf_event.h:perf_event_type

2015-05-05 16:30:32

[permalink] [raw]

Subject: RE: [PATCH V7 3/6] perf, x86: handle multiple records in PEBS buffer

>
> On Mon, Apr 20, 2015 at 04:07:47AM -0400, Kan Liang wrote:
> > +static inline void *
> > +get_next_pebs_record_by_bit(void *base, void *top, int bit) {
> > + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> > + void *at;
> > + u64 pebs_status;
> > +
> > + if (base == NULL)
> > + return NULL;
> > +
> > + for (at = base; at < top; at += x86_pmu.pebs_record_size) {
> > + struct pebs_record_nhm *p = at;
> > +
> > + if (test_bit(bit, (unsigned long *)&p->status)) {
>
> Just wondering, is that BT better than: p->state & (1 << bit) ?

Technically, I think they are same here.
test_bit looks more common, and widely used in drain_pebs functions.
So I changed it according to Andi's comment.

>
> > +
> > + if (p->status == (1 << bit))
> > + return at;
> > +
> > + /* clear non-PEBS bit and re-check */
> > + pebs_status = p->status & cpuc->pebs_enabled;
> > + pebs_status &= (1ULL << MAX_PEBS_EVENTS) - 1;
> > + if (pebs_status == (1 << bit))
> > + return at;
> > + }
> > + }
> > + return NULL;
> > +}
> > +
> > static void __intel_pmu_pebs_event(struct perf_event *event,
> > + struct pt_regs *iregs,
> > + void *base, void *top,
> > + int bit, int count)
> > {
> > struct perf_sample_data data;
> > struct pt_regs regs;
> > + int i;
> > + void *at = get_next_pebs_record_by_bit(base, top, bit);
> >
> > + if (!intel_pmu_save_and_restart(event) &&
> > + !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
> > return;
> >
> > + if (count > 1) {
> > + for (i = 0; i < count - 1; i++) {
> > + setup_pebs_sample_data(event, iregs, at, &data,
> &regs);
> > + perf_event_output(event, &data, &regs);
> > + at += x86_pmu.pebs_record_size;
> > + at = get_next_pebs_record_by_bit(at, top, bit);
> > + }
> > + }
> > +
> > + setup_pebs_sample_data(event, iregs, at, &data, &regs);
> >
> > + /* all records are processed, handle event overflow now */
>
> All but the last. There explicitly is one left to be able to call the overflow
> handler is there not?

Right, I will change the comments.

>
> > + if (perf_event_overflow(event, &data, &regs)) {
> > x86_pmu_stop(event, 0);
> > + return;
> > + }
> > +
> > }
> >
> > static void intel_pmu_drain_pebs_core(struct pt_regs *iregs) @@
> > -1000,72 +1081,86 @@ static void intel_pmu_drain_pebs_core(struct
> pt_regs *iregs)
> > if (!event->attr.precise_ip)
> > return;
> >
> > + n = (top - at) / x86_pmu.pebs_record_size;
> > if (n <= 0)
> > return;
> >
> > + __intel_pmu_pebs_event(event, iregs, at,
> > + top, 0, n);
> > }
> >
> > static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs) {
> > struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> > struct debug_store *ds = cpuc->ds;
> > + struct perf_event *event;
> > + void *base, *at, *top;
> > int bit;
> > + int counts[MAX_PEBS_EVENTS] = {};
> >
> > if (!x86_pmu.pebs_active)
> > return;
> >
> > + base = (struct pebs_record_nhm *)(unsigned
> > +long)ds->pebs_buffer_base;
> > top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
> >
> > ds->pebs_index = ds->pebs_buffer_base;
> >
> > + if (unlikely(base >= top))
> > return;
> >
> > + for (at = base; at < top; at += x86_pmu.pebs_record_size) {
> > struct pebs_record_nhm *p = at;
> >
> > for_each_set_bit(bit, (unsigned long *)&p->status,
> > x86_pmu.max_pebs_events) {
> > event = cpuc->events[bit];
> > WARN_ON_ONCE(!event);
> >
> > + if (event->attr.precise_ip)
> > + break;
> > + }
>
> Would it make sense to delay looking for the event until you've found
> there is a single bit set -- and already know which bit that is?
>

Yes, I think we can test cpuc->pebs_enabled here.
It should be better than attr.precise_ip checking.

- for (; at < top; at += x86_pmu.pebs_record_size) {
+ for (at = base; at < top; at += x86_pmu.pebs_record_size) {
struct pebs_record_nhm *p = at;

for_each_set_bit(bit, (unsigned long *)&p->status,
x86_pmu.max_pebs_events) {
- event = cpuc->events[bit];
- if (!test_bit(bit, cpuc->active_mask))
- continue;
-
- WARN_ON_ONCE(!event);

- if (!event->attr.precise_ip)
- continue;
+ if (test_bit(bit, cpuc->pebs_enabled))
+ break;
+ }

... ...

+ for (bit = 0; bit < x86_pmu.max_pebs_events; bit++) {
+ if (counts[bit] == 0)
continue;
-
- __intel_pmu_pebs_event(event, iregs, at);
+ event = cpuc->events[bit];
+ WARN_ON_ONCE(!event);
+ WARN_ON_ONCE(!event->attr.precise_ip);
+ __intel_pmu_pebs_event(event, iregs, base,
+ top, bit, counts[bit]);
}

> >
> > + if (bit >= x86_pmu.max_pebs_events)
> > + continue;
> > + if (!test_bit(bit, cpuc->active_mask))
> > + continue;
> > + /*
> > + * The PEBS hardware does not deal well with the situation
> > + * when events happen near to each other and multiple
> bits
> > + * are set. But it should happen rarely.
> > + *
> > + * If these events include one PEBS and multiple non-PEBS
> > + * events, it doesn't impact PEBS record. The record will
> > + * be handled normally. (slow path)
> > + *
> > + * If these events include two or more PEBS events, the
> > + * records for the events can be collapsed into a single
> > + * one, and it's not possible to reconstruct all events
> > + * that caused the PEBS record. It's called collision.
> > + * If collision happened, the record will be dropped.
> > + *
> > + */
> > + if (p->status != (1 << bit)) {
> > + u64 pebs_status;
> > +
> > + /* slow path */
> > + pebs_status = p->status & cpuc->pebs_enabled;
> > + pebs_status &= (1ULL << MAX_PEBS_EVENTS) - 1;
> > + if (pebs_status != (1 << bit)) {
> > + perf_log_lost(event);
>
> Does it make sense to keep an error[bit] count and only log once with the
> actual number in? -- when !0 obviously.

Yes, will do it.

Thanks,
Kan
>
> > continue;
> > + }
> > }
> > + counts[bit]++;
> > + }
> >
> > + for (bit = 0; bit < x86_pmu.max_pebs_events; bit++) {
> > + if (counts[bit] == 0)
> > continue;
> > + event = cpuc->events[bit];
> > + __intel_pmu_pebs_event(event, iregs, base,
> > + top, bit, counts[bit]);
> > }
> > }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected] More majordomo
> info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-05-05 16:36:58

[permalink] [raw]

Subject: RE: [PATCH V7 3/6] perf, x86: handle multiple records in PEBS buffer

> On Tue, May 05, 2015 at 03:07:23PM +0200, Peter Zijlstra wrote:
> > On Mon, Apr 20, 2015 at 04:07:47AM -0400, Kan Liang wrote:
> > > From: Yan, Zheng <[email protected]>
> > > +static void perf_log_lost(struct perf_event *event) {
> > > + struct perf_output_handle handle;
> > > + struct perf_sample_data sample;
> > > + int ret;
> > > +
> > > + struct {
> > > + struct perf_event_header header;
> > > + u64 id;
> > > + u64 lost;
> > > + } lost_event = {
> > > + .header = {
> > > + .type = PERF_RECORD_LOST,
> > > + .misc = 0,
> > > + .size = sizeof(lost_event),
> > > + },
> > > + .id = event->id,
> > > + .lost = 1,
> > > + };
> > > +
> > > + perf_event_header__init_id(&lost_event.header, &sample,
> event);
> > > +
> > > + ret = perf_output_begin(&handle, event,
> > > + lost_event.header.size);
> > > + if (ret)
> > > + return;
> > > +
> > > + perf_output_put(&handle, lost_event);
> > > + perf_event__output_id_sample(event, &handle, &sample);
> > > + perf_output_end(&handle);
> > > +}
> >
> > RECORDs are generic, and should live in the core code.
> >
> > Also, you should introduce this RECORD in a separate patch.
>
> On that, this is lacking a RECORD definition in
> include/uapi/linux/perf_event.h:perf_event_type

The PERF_RECORD_LOST already defined in perf_event_type.
Are you suggesting to add a new dedicated RECORD type, like PERF_RECORD_COLLISION?
Or update the definition about PERF_RECORD_LOST?

Thanks,
Kan

2015-05-05 17:01:09

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH V7 3/6] perf, x86: handle multiple records in PEBS buffer

On Tue, May 05, 2015 at 04:36:51PM +0000, Liang, Kan wrote:
> > > RECORDs are generic, and should live in the core code.
> > >
> > > Also, you should introduce this RECORD in a separate patch.
> >
> > On that, this is lacking a RECORD definition in
> > include/uapi/linux/perf_event.h:perf_event_type
>
> The PERF_RECORD_LOST already defined in perf_event_type.
> Are you suggesting to add a new dedicated RECORD type, like PERF_RECORD_COLLISION?

Yes, this should be a new RECORD, LOST_SAMPLES maybe.

The thing is, LOST is about the ring-buffer running out of space, this
is very much not the case here. Reusing it like this creates the
situation where userspace cannot tell what happened, and that is a very
bad thing indeed.

What we want to convey is that we dropped/lost a (number of) sample(s).

So the objection against the RECORD_COLLISIONS name is that it names the
reason we did something, but not the something we did.

2015-05-05 17:08:35

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH V7 3/6] perf, x86: handle multiple records in PEBS buffer

On Tue, May 05, 2015 at 04:30:25PM +0000, Liang, Kan wrote:
> > > + for (at = base; at < top; at += x86_pmu.pebs_record_size) {
> > > struct pebs_record_nhm *p = at;
> > >
> > > for_each_set_bit(bit, (unsigned long *)&p->status,
> > > x86_pmu.max_pebs_events) {
> > > event = cpuc->events[bit];
> > > WARN_ON_ONCE(!event);
> > >
> > > + if (event->attr.precise_ip)
> > > + break;
> > > + }
> >
> > Would it make sense to delay looking for the event until you've found
> > there is a single bit set -- and already know which bit that is?
> >
>
> Yes, I think we can test cpuc->pebs_enabled here.
> It should be better than attr.precise_ip checking.
>
> - for (; at < top; at += x86_pmu.pebs_record_size) {
> + for (at = base; at < top; at += x86_pmu.pebs_record_size) {
> struct pebs_record_nhm *p = at;
>
> for_each_set_bit(bit, (unsigned long *)&p->status,
> x86_pmu.max_pebs_events) {
> - event = cpuc->events[bit];
> - if (!test_bit(bit, cpuc->active_mask))
> - continue;
> -
> - WARN_ON_ONCE(!event);
>
> - if (!event->attr.precise_ip)
> - continue;
> + if (test_bit(bit, cpuc->pebs_enabled))
> + break;
> + }
>

Can't we take that entire for_each_set_bit() loop out?

It appears to me we effectively do that single test_bit() test you left
in there already with the & cpuc->pebs_enabled later on.

>
> + for (bit = 0; bit < x86_pmu.max_pebs_events; bit++) {
> + if (counts[bit] == 0)
> continue;
> -
> - __intel_pmu_pebs_event(event, iregs, at);
> + event = cpuc->events[bit];
> + WARN_ON_ONCE(!event);
> + WARN_ON_ONCE(!event->attr.precise_ip);
> + __intel_pmu_pebs_event(event, iregs, base,
> + top, bit, counts[bit]);
> }

Right bit of paranoia :-)

2015-05-05 17:22:45

[permalink] [raw]

Subject: RE: [PATCH V7 3/6] perf, x86: handle multiple records in PEBS buffer

> On Tue, May 05, 2015 at 04:30:25PM +0000, Liang, Kan wrote:
> > > > + for (at = base; at < top; at += x86_pmu.pebs_record_size) {
> > > > struct pebs_record_nhm *p = at;
> > > >
> > > > for_each_set_bit(bit, (unsigned long *)&p->status,
> > > > x86_pmu.max_pebs_events) {
> > > > event = cpuc->events[bit];
> > > > WARN_ON_ONCE(!event);
> > > >
> > > > + if (event->attr.precise_ip)
> > > > + break;
> > > > + }
> > >
> > > Would it make sense to delay looking for the event until you've
> > > found there is a single bit set -- and already know which bit that is?
> > >
> >
> > Yes, I think we can test cpuc->pebs_enabled here.
> > It should be better than attr.precise_ip checking.
> >
> > - for (; at < top; at += x86_pmu.pebs_record_size) {
> > + for (at = base; at < top; at += x86_pmu.pebs_record_size) {
> > struct pebs_record_nhm *p = at;
> >
> > for_each_set_bit(bit, (unsigned long *)&p->status,
> > x86_pmu.max_pebs_events) {
> > - event = cpuc->events[bit];
> > - if (!test_bit(bit, cpuc->active_mask))
> > - continue;
> > -
> > - WARN_ON_ONCE(!event);
> >
> > - if (!event->attr.precise_ip)
> > - continue;
> > + if (test_bit(bit, cpuc->pebs_enabled))
> > + break;
> > + }
> >
>
> Can't we take that entire for_each_set_bit() loop out?
>
> It appears to me we effectively do that single test_bit() test you left in
> there already with the & cpuc->pebs_enabled later on.
>
>
Right, will use find_first_bit to replace.

+ for (at = base; at < top; at += x86_pmu.pebs_record_size) {
struct pebs_record_nhm *p = at;

- for_each_set_bit(bit, (unsigned long *)&p->status,
- x86_pmu.max_pebs_events) {
- event = cpuc->events[bit];
- if (!test_bit(bit, cpuc->active_mask))
- continue;
-
- WARN_ON_ONCE(!event);
-
- if (!event->attr.precise_ip)
- continue;
-
- if (__test_and_set_bit(bit, (unsigned long *)&status))
+ bit = find_first_bit((unsigned long *)&p->status,
+ x86_pmu.max_pebs_events);

Thanks,
Kan

2015-05-06 13:01:21

[permalink] [raw]

Subject: Re: [PATCH V7 3/6] perf, x86: handle multiple records in PEBS buffer

> > > + if (p->status != (1 << bit)) {
> > > + u64 pebs_status;
> > > +
> > > + /* slow path */
> > > + pebs_status = p->status & cpuc->pebs_enabled;
> > > + pebs_status &= (1ULL << MAX_PEBS_EVENTS) - 1;
> > > + if (pebs_status != (1 << bit)) {
> > > + perf_log_lost(event);
> >
> > Does it make sense to keep an error[bit] count and only log once with the
> > actual number in? -- when !0 obviously.
>
> Yes, will do it.

If you use anything but u8 for the array member it would be too large
for the NMI stack, and u8 is lilkely overflow prone. Would not do it.

-Andi

2015-05-06 13:13:36

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH V7 3/6] perf, x86: handle multiple records in PEBS buffer

On Wed, May 06, 2015 at 03:01:16PM +0200, Andi Kleen wrote:
> > > > + if (p->status != (1 << bit)) {
> > > > + u64 pebs_status;
> > > > +
> > > > + /* slow path */
> > > > + pebs_status = p->status & cpuc->pebs_enabled;
> > > > + pebs_status &= (1ULL << MAX_PEBS_EVENTS) - 1;
> > > > + if (pebs_status != (1 << bit)) {
> > > > + perf_log_lost(event);
> > >
> > > Does it make sense to keep an error[bit] count and only log once with the
> > > actual number in? -- when !0 obviously.
> >
> > Yes, will do it.
>
> If you use anything but u8 for the array member it would be too large
> for the NMI stack, and u8 is lilkely overflow prone. Would not do it.

If we're so close that 4*8=32 bytes would overflow the stack we're in
trouble already.

But we could just switch counts[] over to short and have another short
array for errors[], that way no extra bytes of stack used.