2015-05-07 02:46:05

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V8 0/8] large PEBS interrupt threshold

This patch series implements large PEBS interrupt threshold.
Currently, the PEBS threshold is forced to set to one. A larger PEBS
interrupt threshold can significantly reduce the sampling overhead
especially for frequently occurring events
(like cycles or branches or load/stores) with small sampling period.
For example, perf record cycles event when running kernbench
with 10003 sampling period. The Elapsed Time reduced from 32.7 seconds
to 16.5 seconds, which is 2X faster.
For more details, please refer to patch 4's description.

Limitations:
- It can not supply a callgraph.
- It requires setting a fixed period.
- It cannot supply a time stamp.
- To supply a TID it requires flushing on context switch.
If the above requirement doesn't apply, the threshold will set to one.

Discard samples:
When PEBS events happen close to each other, the records for the events
could be mixed up. Demuxing the records is hard because of hardware
deficiecy. As a result, we have to drop some PEBS records.
A new RECORD type, PERF_RECORD_LOST_SAMPLES, is introduced to record
the number of possible discards, and make sure the user is not left
in the dark about such discards.
For details about sample discards, please refer to patch 3's description.

changes since v1:
- drop patch 'perf, core: Add all PMUs to pmu_idr'
- add comments for case that multiple counters overflow simultaneously
changes since v2:
- rename perf_sched_cb_{enable,disable} to perf_sched_cb_user_{inc,dec}
- use flag to indicate auto reload mechanism
- move codes that setup PEBS sample data to separate function
- output the PEBS records in batch
- enable this for All (PEBS capable) hardware
- more description for the multiplex
changes since v3:
- ignore conflicting PEBS record
changes since v4:
- Do more tests for collision and update comments
changes since v5:
- Move autoreload and large PEBS available check to intel_pmu_hw_config
- make AUTO_RELOAD conditional on large PEBS
- !PEBS bug fix
- coherent story about what is collision and how we handle it
- Remove extra state pebs_sched_cb_enabled
changes since v6:
- new flag PERF_X86_EVENT_FREERUNNING to indicate large PEBS available
- patch reorder and changelog changes for patch 1 and 3
- An easy way to clear !PEBS bit
- Log collision to PERF_RECORD_SAMPLES_LOST
changes since v7:
- Introduce PERF_RECORD_LOST_SAMPLES to record the number of discards
- Remove entire for_each_set_bit() loop
- Minor changes on comments and changelog

Yan, Zheng (6):
perf, x86: use the PEBS auto reload mechanism when possible
perf, x86: introduce setup_pebs_sample_data()
perf, x86: handle multiple records in PEBS buffer
perf, x86: large PEBS interrupt threshold
perf, x86: drain PEBS buffer during context switch
perf, x86: enlarge PEBS buffer

Kan Liang (2):
perf, x86: introduce PERF_RECORD_LOST_SAMPLES
perf tools: handle PERF_RECORD_LOST_SAMPLES

arch/x86/kernel/cpu/perf_event.c | 15 +-
arch/x86/kernel/cpu/perf_event.h | 16 ++
arch/x86/kernel/cpu/perf_event_intel.c | 22 ++-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 307 +++++++++++++++++++++--------
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 3 -
include/linux/perf_event.h | 16 ++
include/uapi/linux/perf_event.h | 12 ++
kernel/events/core.c | 41 +++-
kernel/events/internal.h | 9 -
tools/perf/util/event.c | 9 +
tools/perf/util/event.h | 18 ++
tools/perf/util/machine.c | 10 +
tools/perf/util/machine.h | 2 +
tools/perf/util/session.c | 19 ++
tools/perf/util/tool.h | 1 +
15 files changed, 390 insertions(+), 110 deletions(-)

--
1.8.3.1


2015-05-07 02:46:08

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V8 1/8] perf, x86: use the PEBS auto reload mechanism when possible

From: Yan, Zheng <[email protected]>

When a fixed period is specified, this patch make perf use the PEBS
auto reload mechanism. This makes normal profiling faster, because
it avoids one costly MSR write in the PMI handler.
However, the reset value will be loaded by hardware assist. There is a
little bit delay compared to previous non-auto-reload mechanism. The
delay time is arbitrary, but very small. The assist cost is 400-800
cycles, assuming common cases with everything cached. The minimum period
the patch currently uses is 10000. In that extreme case it can be ~10%
if cycles are used.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 15 +++++++++------
arch/x86/kernel/cpu/perf_event.h | 1 +
arch/x86/kernel/cpu/perf_event_intel.c | 8 ++++++--
arch/x86/kernel/cpu/perf_event_intel_ds.c | 7 +++++++
4 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 87848eb..8cc1153 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1058,13 +1058,16 @@ int x86_perf_event_set_period(struct perf_event *event)

per_cpu(pmc_prev_left[idx], smp_processor_id()) = left;

- /*
- * The hw event starts counting from this event offset,
- * mark it to be able to extra future deltas:
- */
- local64_set(&hwc->prev_count, (u64)-left);
+ if (!(hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) ||
+ local64_read(&hwc->prev_count) != (u64)-left) {
+ /*
+ * The hw event starts counting from this event offset,
+ * mark it to be able to extra future deltas:
+ */
+ local64_set(&hwc->prev_count, (u64)-left);

- wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+ wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+ }

/*
* Due to erratum on certan cpu we need
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 6ac5cb7..1cb5859 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -74,6 +74,7 @@ struct event_constraint {
#define PERF_X86_EVENT_EXCL 0x0040 /* HT exclusivity on counter */
#define PERF_X86_EVENT_DYNAMIC 0x0080 /* dynamic alloc'd constraint */
#define PERF_X86_EVENT_RDPMC_ALLOWED 0x0100 /* grant rdpmc permission */
+#define PERF_X86_EVENT_AUTO_RELOAD 0x0200 /* use PEBS auto-reload */


struct amd_nb {
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 960e85d..3119071 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2305,8 +2305,12 @@ static int intel_pmu_hw_config(struct perf_event *event)
if (ret)
return ret;

- if (event->attr.precise_ip && x86_pmu.pebs_aliases)
- x86_pmu.pebs_aliases(event);
+ if (event->attr.precise_ip) {
+ if (!event->attr.freq)
+ event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
+ if (x86_pmu.pebs_aliases)
+ x86_pmu.pebs_aliases(event);
+ }

if (needs_branch_stack(event)) {
ret = intel_pmu_setup_lbr_filter(event);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 813f75d..f856f73 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -688,6 +688,7 @@ void intel_pmu_pebs_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
+ struct debug_store *ds = cpuc->ds;

hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;

@@ -697,6 +698,12 @@ void intel_pmu_pebs_enable(struct perf_event *event)
cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled |= 1ULL << 63;
+
+ /* Use auto-reload if possible to save a MSR write in the PMI */
+ if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
+ ds->pebs_event_reset[hwc->idx] =
+ (u64)(-hwc->sample_period) & x86_pmu.cntval_mask;
+ }
}

void intel_pmu_pebs_disable(struct perf_event *event)
--
1.8.3.1

2015-05-07 02:47:52

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V8 2/8] perf, x86: introduce setup_pebs_sample_data()

From: Yan, Zheng <[email protected]>

move codes that setup PEBS sample data to separate function.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 95 +++++++++++++++++--------------
1 file changed, 52 insertions(+), 43 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index f856f73..f26c8b4 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -853,8 +853,10 @@ static inline u64 intel_hsw_transaction(struct pebs_record_hsw *pebs)
return txn;
}

-static void __intel_pmu_pebs_event(struct perf_event *event,
- struct pt_regs *iregs, void *__pebs)
+static void setup_pebs_sample_data(struct perf_event *event,
+ struct pt_regs *iregs, void *__pebs,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
{
#define PERF_X86_EVENT_PEBS_HSW_PREC \
(PERF_X86_EVENT_PEBS_ST_HSW | \
@@ -866,30 +868,25 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
*/
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct pebs_record_hsw *pebs = __pebs;
- struct perf_sample_data data;
- struct pt_regs regs;
u64 sample_type;
int fll, fst, dsrc;
int fl = event->hw.flags;

- if (!intel_pmu_save_and_restart(event))
- return;
-
sample_type = event->attr.sample_type;
dsrc = sample_type & PERF_SAMPLE_DATA_SRC;

fll = fl & PERF_X86_EVENT_PEBS_LDLAT;
fst = fl & (PERF_X86_EVENT_PEBS_ST | PERF_X86_EVENT_PEBS_HSW_PREC);

- perf_sample_data_init(&data, 0, event->hw.last_period);
+ perf_sample_data_init(data, 0, event->hw.last_period);

- data.period = event->hw.last_period;
+ data->period = event->hw.last_period;

/*
* Use latency for weight (only avail with PEBS-LL)
*/
if (fll && (sample_type & PERF_SAMPLE_WEIGHT))
- data.weight = pebs->lat;
+ data->weight = pebs->lat;

/*
* data.data_src encodes the data source
@@ -902,7 +899,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
val = precise_datala_hsw(event, pebs->dse);
else if (fst)
val = precise_store_data(pebs->dse);
- data.data_src.val = val;
+ data->data_src.val = val;
}

/*
@@ -915,58 +912,70 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
* PERF_SAMPLE_IP and PERF_SAMPLE_CALLCHAIN to function properly.
* A possible PERF_SAMPLE_REGS will have to transfer all regs.
*/
- regs = *iregs;
- regs.flags = pebs->flags;
- set_linear_ip(&regs, pebs->ip);
- regs.bp = pebs->bp;
- regs.sp = pebs->sp;
+ *regs = *iregs;
+ regs->flags = pebs->flags;
+ set_linear_ip(regs, pebs->ip);
+ regs->bp = pebs->bp;
+ regs->sp = pebs->sp;

if (sample_type & PERF_SAMPLE_REGS_INTR) {
- regs.ax = pebs->ax;
- regs.bx = pebs->bx;
- regs.cx = pebs->cx;
- regs.dx = pebs->dx;
- regs.si = pebs->si;
- regs.di = pebs->di;
- regs.bp = pebs->bp;
- regs.sp = pebs->sp;
-
- regs.flags = pebs->flags;
+ regs->ax = pebs->ax;
+ regs->bx = pebs->bx;
+ regs->cx = pebs->cx;
+ regs->dx = pebs->dx;
+ regs->si = pebs->si;
+ regs->di = pebs->di;
+ regs->bp = pebs->bp;
+ regs->sp = pebs->sp;
+
+ regs->flags = pebs->flags;
#ifndef CONFIG_X86_32
- regs.r8 = pebs->r8;
- regs.r9 = pebs->r9;
- regs.r10 = pebs->r10;
- regs.r11 = pebs->r11;
- regs.r12 = pebs->r12;
- regs.r13 = pebs->r13;
- regs.r14 = pebs->r14;
- regs.r15 = pebs->r15;
+ regs->r8 = pebs->r8;
+ regs->r9 = pebs->r9;
+ regs->r10 = pebs->r10;
+ regs->r11 = pebs->r11;
+ regs->r12 = pebs->r12;
+ regs->r13 = pebs->r13;
+ regs->r14 = pebs->r14;
+ regs->r15 = pebs->r15;
#endif
}

if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format >= 2) {
- regs.ip = pebs->real_ip;
- regs.flags |= PERF_EFLAGS_EXACT;
- } else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(&regs))
- regs.flags |= PERF_EFLAGS_EXACT;
+ regs->ip = pebs->real_ip;
+ regs->flags |= PERF_EFLAGS_EXACT;
+ } else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(regs))
+ regs->flags |= PERF_EFLAGS_EXACT;
else
- regs.flags &= ~PERF_EFLAGS_EXACT;
+ regs->flags &= ~PERF_EFLAGS_EXACT;

if ((sample_type & PERF_SAMPLE_ADDR) &&
x86_pmu.intel_cap.pebs_format >= 1)
- data.addr = pebs->dla;
+ data->addr = pebs->dla;

if (x86_pmu.intel_cap.pebs_format >= 2) {
/* Only set the TSX weight when no memory weight. */
if ((sample_type & PERF_SAMPLE_WEIGHT) && !fll)
- data.weight = intel_hsw_weight(pebs);
+ data->weight = intel_hsw_weight(pebs);

if (sample_type & PERF_SAMPLE_TRANSACTION)
- data.txn = intel_hsw_transaction(pebs);
+ data->txn = intel_hsw_transaction(pebs);
}

if (has_branch_stack(event))
- data.br_stack = &cpuc->lbr_stack;
+ data->br_stack = &cpuc->lbr_stack;
+}
+
+static void __intel_pmu_pebs_event(struct perf_event *event,
+ struct pt_regs *iregs, void *__pebs)
+{
+ struct perf_sample_data data;
+ struct pt_regs regs;
+
+ if (!intel_pmu_save_and_restart(event))
+ return;
+
+ setup_pebs_sample_data(event, iregs, __pebs, &data, &regs);

if (perf_event_overflow(event, &data, &regs))
x86_pmu_stop(event, 0);
--
1.8.3.1

2015-05-07 02:47:49

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V8 3/8] perf, x86: handle multiple records in PEBS buffer

From: Yan, Zheng <[email protected]>

When the PEBS interrupt threshold is larger than one record and the
machine supports multiple PEBS events, the records of these events are
mixed up and we need to demultiplex them.

Demuxing the records is hard because the hardware is deficient. The
hardware has two issues that, when combined, create impossible scenarios
to demux.

The first issue is that the 'status' field of the PEBS record is a copy
of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
problem let us first describe the regular PEBS cycle:

A) the CTRn value reaches 0:
- the corresponding bit in GLOBAL_STATUS gets set
- we start arming the hardware assist
< some unspecified amount of time later -- this could cover multiple
events of interest >

B) the hardware assist is armed, any next event will trigger it

C) a matching event happens:
- the hardware assist triggers and generates a PEBS record
this includes a copy of GLOBAL_STATUS at this moment
- if we auto-reload we (re)set CTRn
- we clear the relevant bit in GLOBAL_STATUS

Now consider the following chain of events:

A0, B0, A1, C0

The event generated for counter 0 will include a status with counter 1
set, even though its not at all related to the record. A similar thing
can happen with a !PEBS event if it just happens to overflow at the
right moment.

The second issue is that the hardware will only emit one record for two
or more counters if the event that triggers the assist is 'close'. The
'close' can be several cycles. In some cases even the complete assist,
if the event is something that doesn't need retirement.

For instance, consider this chain of events:
A0, B0, A1, B1, C01

Where C01 is an event that triggers both hardware assists, we will
generate but a single record, but again with both counters listed in the
status field.

This time the record pertains to both events.

Note that these two cases are different but undistinguishable with the
data as generated. Therefore demuxing records with multiple PEBS bits
(we can safely ignore status bits for !PEBS counters) is impossible.

Furthermore we cannot emit the record to both events because that might
cause a data leak -- the events might not have the same privileges -- so
what this patch does is discard such events.

The assumption/hope is that such discards will be rare.

Here lists some possible ways you may get high discard rate.
- when you count the same thing multiple times. But it is not a useful
configuration.
- you can be unfortunate if you measure with a userspace only PEBS
event along with either a kernel or unrestricted PEBS event. Imagine
the event triggering and setting the overflow flag right before
entering the kernel. Then all kernel side events will end up with
multiple bits set.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 144 +++++++++++++++++++++---------
include/linux/perf_event.h | 13 +++
kernel/events/core.c | 6 +-
kernel/events/internal.h | 9 --
4 files changed, 118 insertions(+), 54 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index f26c8b4..28354ed 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -872,6 +872,9 @@ static void setup_pebs_sample_data(struct perf_event *event,
int fll, fst, dsrc;
int fl = event->hw.flags;

+ if (pebs == NULL)
+ return;
+
sample_type = event->attr.sample_type;
dsrc = sample_type & PERF_SAMPLE_DATA_SRC;

@@ -966,19 +969,68 @@ static void setup_pebs_sample_data(struct perf_event *event,
data->br_stack = &cpuc->lbr_stack;
}

+static inline void *
+get_next_pebs_record_by_bit(void *base, void *top, int bit)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ void *at;
+ u64 pebs_status;
+
+ if (base == NULL)
+ return NULL;
+
+ for (at = base; at < top; at += x86_pmu.pebs_record_size) {
+ struct pebs_record_nhm *p = at;
+
+ if (test_bit(bit, (unsigned long *)&p->status)) {
+
+ if (p->status == (1 << bit))
+ return at;
+
+ /* clear non-PEBS bit and re-check */
+ pebs_status = p->status & cpuc->pebs_enabled;
+ pebs_status &= (1ULL << MAX_PEBS_EVENTS) - 1;
+ if (pebs_status == (1 << bit))
+ return at;
+ }
+ }
+ return NULL;
+}
+
static void __intel_pmu_pebs_event(struct perf_event *event,
- struct pt_regs *iregs, void *__pebs)
+ struct pt_regs *iregs,
+ void *base, void *top,
+ int bit, int count)
{
struct perf_sample_data data;
struct pt_regs regs;
+ int i;
+ void *at = get_next_pebs_record_by_bit(base, top, bit);

- if (!intel_pmu_save_and_restart(event))
+ if (!intel_pmu_save_and_restart(event) &&
+ !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
return;

- setup_pebs_sample_data(event, iregs, __pebs, &data, &regs);
+ if (count > 1) {
+ for (i = 0; i < count - 1; i++) {
+ setup_pebs_sample_data(event, iregs, at, &data, &regs);
+ perf_event_output(event, &data, &regs);
+ at += x86_pmu.pebs_record_size;
+ at = get_next_pebs_record_by_bit(at, top, bit);
+ }
+ }
+
+ setup_pebs_sample_data(event, iregs, at, &data, &regs);

- if (perf_event_overflow(event, &data, &regs))
+ /*
+ * All but the last records are processed.
+ * The last one is left to be able to call the overflow handler.
+ */
+ if (perf_event_overflow(event, &data, &regs)) {
x86_pmu_stop(event, 0);
+ return;
+ }
+
}

static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
@@ -1008,72 +1060,80 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
if (!event->attr.precise_ip)
return;

- n = top - at;
+ n = (top - at) / x86_pmu.pebs_record_size;
if (n <= 0)
return;

- /*
- * Should not happen, we program the threshold at 1 and do not
- * set a reset value.
- */
- WARN_ONCE(n > 1, "bad leftover pebs %d\n", n);
- at += n - 1;
-
- __intel_pmu_pebs_event(event, iregs, at);
+ __intel_pmu_pebs_event(event, iregs, at,
+ top, 0, n);
}

static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct debug_store *ds = cpuc->ds;
- struct perf_event *event = NULL;
- void *at, *top;
- u64 status = 0;
+ struct perf_event *event;
+ void *base, *at, *top;
int bit;
+ short counts[MAX_PEBS_EVENTS] = {};

if (!x86_pmu.pebs_active)
return;

- at = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
+ base = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;

ds->pebs_index = ds->pebs_buffer_base;

- if (unlikely(at > top))
+ if (unlikely(base >= top))
return;

- /*
- * Should not happen, we program the threshold at 1 and do not
- * set a reset value.
- */
- WARN_ONCE(top - at > x86_pmu.max_pebs_events * x86_pmu.pebs_record_size,
- "Unexpected number of pebs records %ld\n",
- (long)(top - at) / x86_pmu.pebs_record_size);
-
- for (; at < top; at += x86_pmu.pebs_record_size) {
+ for (at = base; at < top; at += x86_pmu.pebs_record_size) {
struct pebs_record_nhm *p = at;

- for_each_set_bit(bit, (unsigned long *)&p->status,
- x86_pmu.max_pebs_events) {
- event = cpuc->events[bit];
- if (!test_bit(bit, cpuc->active_mask))
- continue;
-
- WARN_ON_ONCE(!event);
-
- if (!event->attr.precise_ip)
- continue;
+ bit = find_first_bit((unsigned long *)&p->status,
+ x86_pmu.max_pebs_events);
+ if (bit >= x86_pmu.max_pebs_events)
+ continue;
+ if (!test_bit(bit, cpuc->active_mask))
+ continue;
+ /*
+ * The PEBS hardware does not deal well with the situation
+ * when events happen near to each other and multiple bits
+ * are set. But it should happen rarely.
+ *
+ * If these events include one PEBS and multiple non-PEBS
+ * events, it doesn't impact PEBS record. The record will
+ * be handled normally. (slow path)
+ *
+ * If these events include two or more PEBS events, the
+ * records for the events can be collapsed into a single
+ * one, and it's not possible to reconstruct all events
+ * that caused the PEBS record. It's called collision.
+ * If collision happened, the record will be dropped.
+ *
+ */
+ if (p->status != (1 << bit)) {
+ u64 pebs_status;

- if (__test_and_set_bit(bit, (unsigned long *)&status))
+ /* slow path */
+ pebs_status = p->status & cpuc->pebs_enabled;
+ pebs_status &= (1ULL << MAX_PEBS_EVENTS) - 1;
+ if (pebs_status != (1 << bit))
continue;
-
- break;
}
+ counts[bit]++;
+ }

- if (!event || bit >= x86_pmu.max_pebs_events)
+ for (bit = 0; bit < x86_pmu.max_pebs_events; bit++) {
+ if (counts[bit] == 0)
continue;
+ event = cpuc->events[bit];
+ WARN_ON_ONCE(!event);
+ WARN_ON_ONCE(!event->attr.precise_ip);

- __intel_pmu_pebs_event(event, iregs, at);
+ __intel_pmu_pebs_event(event, iregs, base,
+ top, bit, counts[bit]);
}
}

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 61992cf..bed1b6f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -734,6 +734,19 @@ extern int perf_event_overflow(struct perf_event *event,
struct perf_sample_data *data,
struct pt_regs *regs);

+extern void perf_event_output(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs);
+
+extern void
+perf_event_header__init_id(struct perf_event_header *header,
+ struct perf_sample_data *data,
+ struct perf_event *event);
+extern void
+perf_event__output_id_sample(struct perf_event *event,
+ struct perf_output_handle *handle,
+ struct perf_sample_data *sample);
+
static inline bool is_sampling_event(struct perf_event *event)
{
return event->attr.sample_period != 0;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f528829..4d221a4 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5333,9 +5333,9 @@ void perf_prepare_sample(struct perf_event_header *header,
}
}

-static void perf_event_output(struct perf_event *event,
- struct perf_sample_data *data,
- struct pt_regs *regs)
+void perf_event_output(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
{
struct perf_output_handle handle;
struct perf_event_header header;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 9f6ce9b..2deb24c 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -72,15 +72,6 @@ static inline bool rb_has_aux(struct ring_buffer *rb)
void perf_event_aux_event(struct perf_event *event, unsigned long head,
unsigned long size, u64 flags);

-extern void
-perf_event_header__init_id(struct perf_event_header *header,
- struct perf_sample_data *data,
- struct perf_event *event);
-extern void
-perf_event__output_id_sample(struct perf_event *event,
- struct perf_output_handle *handle,
- struct perf_sample_data *sample);
-
extern struct page *
perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff);

--
1.8.3.1

2015-05-07 02:46:14

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V8 4/8] perf, x86: large PEBS interrupt threshold

From: Yan, Zheng <[email protected]>

PEBS always had the capability to log samples to its buffers without
an interrupt. Traditionally perf has not used this but always set the
PEBS threshold to one.

For frequently occurring events (like cycles or branches or load/store)
this in term requires using a relatively high sampling period to avoid
overloading the system, by only processing PMIs. This in term increases
sampling error.

For the common cases we still need to use the PMI because the PEBS
hardware has various limitations. The biggest one is that it can not
supply a callgraph. It also requires setting a fixed period, as the
hardware does not support adaptive period. Another issue is that it
cannot supply a time stamp and some other options. To supply a TID it
requires flushing on context switch. It can however supply the IP, the
load/store address, TSX information, registers, and some other things.

So we can make PEBS work for some specific cases, basically as long as
you can do without a callgraph and can set the period you can use this
new PEBS mode.

The main benefit is the ability to support much lower sampling period
(down to -c 1000) without extensive overhead.

One use cases is for example to increase the resolution of the c2c tool.
Another is double checking when you suspect the standard sampling has
too much sampling error.

Some numbers on the overhead, using cycle soak, comparing the elapsed
time from "kernbench -M -H" between plain (threshold set to one) and
multi (large threshold).
The test command for plain:
"perf record --time -e cycles:p -c $period -- kernbench -M -H"
The test command for multi:
"perf record --no-time -e cycles:p -c $period -- kernbench -M -H"
(The only difference of test command between multi and plain is time
stamp options. Since time stamp is not supported by large PEBS
threshold, it can be used as a flag to indicate if large threshold is
enabled during the test.)

period plain(Sec) multi(Sec) Delta
10003 32.7 16.5 16.2
20003 30.2 16.2 14.0
40003 18.6 14.1 4.5
80003 16.8 14.6 2.2
100003 16.9 14.1 2.8
800003 15.4 15.7 -0.3
1000003 15.3 15.2 0.2
2000003 15.3 15.1 0.1

With periods below 100003, plain (threshold one) cause much more
overhead. With 10003 sampling period, the Elapsed Time for multi is
even 2X faster than plain.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 11 +++++++++++
arch/x86/kernel/cpu/perf_event_intel.c | 5 ++++-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 27 +++++++++++++++++++++++----
3 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 1cb5859..626ded3 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -75,6 +75,7 @@ struct event_constraint {
#define PERF_X86_EVENT_DYNAMIC 0x0080 /* dynamic alloc'd constraint */
#define PERF_X86_EVENT_RDPMC_ALLOWED 0x0100 /* grant rdpmc permission */
#define PERF_X86_EVENT_AUTO_RELOAD 0x0200 /* use PEBS auto-reload */
+#define PERF_X86_EVENT_FREERUNNING 0x0400 /* use freerunning PEBS */


struct amd_nb {
@@ -88,6 +89,16 @@ struct amd_nb {
#define MAX_PEBS_EVENTS 8

/*
+ * Flags PEBS can handle without an PMI.
+ *
+ */
+#define PEBS_FREERUNNING_FLAGS \
+ (PERF_SAMPLE_IP | PERF_SAMPLE_ADDR | \
+ PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
+ PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
+ PERF_SAMPLE_TRANSACTION)
+
+/*
* A debug store configuration.
*
* We only support architectures that use 64bit fields.
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 3119071..fdf818a 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2306,8 +2306,11 @@ static int intel_pmu_hw_config(struct perf_event *event)
return ret;

if (event->attr.precise_ip) {
- if (!event->attr.freq)
+ if (!event->attr.freq) {
event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
+ if (!(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS))
+ event->hw.flags |= PERF_X86_EVENT_FREERUNNING;
+ }
if (x86_pmu.pebs_aliases)
x86_pmu.pebs_aliases(event);
}
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 28354ed..7389a12 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -250,7 +250,7 @@ static int alloc_pebs_buffer(int cpu)
{
struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
int node = cpu_to_node(cpu);
- int max, thresh = 1; /* always use a single PEBS record */
+ int max;
void *buffer, *ibuffer;

if (!x86_pmu.pebs)
@@ -280,9 +280,6 @@ static int alloc_pebs_buffer(int cpu)
ds->pebs_absolute_maximum = ds->pebs_buffer_base +
max * x86_pmu.pebs_record_size;

- ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
- thresh * x86_pmu.pebs_record_size;
-
return 0;
}

@@ -684,14 +681,22 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
return &emptyconstraint;
}

+static inline bool pebs_is_enabled(struct cpu_hw_events *cpuc)
+{
+ return (cpuc->pebs_enabled & ((1ULL << MAX_PEBS_EVENTS) - 1));
+}
+
void intel_pmu_pebs_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
struct debug_store *ds = cpuc->ds;
+ bool first_pebs;
+ u64 threshold;

hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;

+ first_pebs = !pebs_is_enabled(cpuc);
cpuc->pebs_enabled |= 1ULL << hwc->idx;

if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
@@ -699,11 +704,25 @@ void intel_pmu_pebs_enable(struct perf_event *event)
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled |= 1ULL << 63;

+ /*
+ * When the event is constrained enough we can use a larger
+ * threshold and run the event with less frequent PMI.
+ */
+ if (hwc->flags & PERF_X86_EVENT_FREERUNNING) {
+ threshold = ds->pebs_absolute_maximum -
+ x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
+ } else {
+ threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+ }
+
/* Use auto-reload if possible to save a MSR write in the PMI */
if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
ds->pebs_event_reset[hwc->idx] =
(u64)(-hwc->sample_period) & x86_pmu.cntval_mask;
}
+
+ if (first_pebs || ds->pebs_interrupt_threshold > threshold)
+ ds->pebs_interrupt_threshold = threshold;
}

void intel_pmu_pebs_disable(struct perf_event *event)
--
1.8.3.1

2015-05-07 02:46:10

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V8 5/8] perf, x86: drain PEBS buffer during context switch

From: Yan, Zheng <[email protected]>

Flush the PEBS buffer during context switch if PEBS interrupt threshold
is larger than one. This allows perf to supply TID for sample outputs.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 6 +++++-
arch/x86/kernel/cpu/perf_event_intel.c | 11 +++++++++-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 32 ++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 3 ---
4 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 626ded3..8746c61 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -91,9 +91,11 @@ struct amd_nb {
/*
* Flags PEBS can handle without an PMI.
*
+ * TID can only be handled by flushing at context switch.
+ *
*/
#define PEBS_FREERUNNING_FLAGS \
- (PERF_SAMPLE_IP | PERF_SAMPLE_ADDR | \
+ (PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR | \
PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
PERF_SAMPLE_TRANSACTION)
@@ -872,6 +874,8 @@ void intel_pmu_pebs_enable_all(void);

void intel_pmu_pebs_disable_all(void);

+void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in);
+
void intel_ds_init(void);

void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index fdf818a..c4c5e1f 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2702,6 +2702,15 @@ static void intel_pmu_cpu_dying(int cpu)
fini_debug_store_on_cpu(cpu);
}

+static void intel_pmu_sched_task(struct perf_event_context *ctx,
+ bool sched_in)
+{
+ if (x86_pmu.pebs_active)
+ intel_pmu_pebs_sched_task(ctx, sched_in);
+ if (x86_pmu.lbr_nr)
+ intel_pmu_lbr_sched_task(ctx, sched_in);
+}
+
PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");

PMU_FORMAT_ATTR(ldlat, "config1:0-15");
@@ -2791,7 +2800,7 @@ static __initconst const struct x86_pmu intel_pmu = {
.cpu_starting = intel_pmu_cpu_starting,
.cpu_dying = intel_pmu_cpu_dying,
.guest_get_msrs = intel_guest_get_msrs,
- .sched_task = intel_pmu_lbr_sched_task,
+ .sched_task = intel_pmu_sched_task,
};

static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 7389a12..9ec5a16 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -546,6 +546,19 @@ int intel_pmu_drain_bts_buffer(void)
return 1;
}

+static inline void intel_pmu_drain_pebs_buffer(void)
+{
+ struct pt_regs regs;
+
+ x86_pmu.drain_pebs(&regs);
+}
+
+void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+ if (!sched_in)
+ intel_pmu_drain_pebs_buffer();
+}
+
/*
* PEBS
*/
@@ -711,8 +724,19 @@ void intel_pmu_pebs_enable(struct perf_event *event)
if (hwc->flags & PERF_X86_EVENT_FREERUNNING) {
threshold = ds->pebs_absolute_maximum -
x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
+
+ if (first_pebs)
+ perf_sched_cb_inc(event->ctx->pmu);
} else {
threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+
+ /*
+ * If not all events can use larger buffer,
+ * roll back to threshold = 1
+ */
+ if (!first_pebs &&
+ (ds->pebs_interrupt_threshold > threshold))
+ perf_sched_cb_dec(event->ctx->pmu);
}

/* Use auto-reload if possible to save a MSR write in the PMI */
@@ -729,6 +753,7 @@ void intel_pmu_pebs_disable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
+ struct debug_store *ds = cpuc->ds;

cpuc->pebs_enabled &= ~(1ULL << hwc->idx);

@@ -737,6 +762,13 @@ void intel_pmu_pebs_disable(struct perf_event *event)
else if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled &= ~(1ULL << 63);

+ if (ds->pebs_interrupt_threshold >
+ ds->pebs_buffer_base + x86_pmu.pebs_record_size) {
+ intel_pmu_drain_pebs_buffer();
+ if (!pebs_is_enabled(cpuc))
+ perf_sched_cb_dec(event->ctx->pmu);
+ }
+
if (cpuc->enabled)
wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 94e5b50..c8a72cc 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -262,9 +262,6 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct x86_perf_task_context *task_ctx;

- if (!x86_pmu.lbr_nr)
- return;
-
/*
* If LBR callstack feature is enabled and the stack was saved when
* the task was scheduled out, restore the stack. Otherwise flush
--
1.8.3.1

2015-05-07 02:46:12

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V8 6/8] perf, x86: enlarge PEBS buffer

From: Yan, Zheng <[email protected]>

Currently the PEBS buffer size is 4k, it only can hold about 21
PEBS records. This patch enlarges the PEBS buffer size to 64k
(the same as BTS buffer), 64k memory can hold about 330 PEBS
records. This will significantly the reduce number of PMI when
large PEBS interrupt threshold is used.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 9ec5a16..fc03e7a 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -11,7 +11,7 @@
#define BTS_RECORD_SIZE 24

#define BTS_BUFFER_SIZE (PAGE_SIZE << 4)
-#define PEBS_BUFFER_SIZE PAGE_SIZE
+#define PEBS_BUFFER_SIZE (PAGE_SIZE << 4)
#define PEBS_FIXUP_SIZE PAGE_SIZE

/*
--
1.8.3.1

2015-05-07 02:47:25

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V8 7/8] perf, x86: introduce PERF_RECORD_LOST_SAMPLES

From: Kan Liang <[email protected]>

After enlarging the PEBS interrupt threshold, there may be some mixed up
PEBS samples which are discarded by kernel. This patch drives the kernel
to emit a PERF_RECORD_LOST_SAMPLES record with the number of possible
discards when it is impossible to demux the samples. It makes sure the
user is not left in the dark about such discards.

Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 16 ++++++++++----
include/linux/perf_event.h | 3 +++
include/uapi/linux/perf_event.h | 12 +++++++++++
kernel/events/core.c | 35 +++++++++++++++++++++++++++++++
4 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index fc03e7a..cdd331f 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -1127,6 +1127,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
void *base, *at, *top;
int bit;
short counts[MAX_PEBS_EVENTS] = {};
+ short error[MAX_PEBS_EVENTS] = {};

if (!x86_pmu.pebs_active)
return;
@@ -1170,21 +1171,28 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
/* slow path */
pebs_status = p->status & cpuc->pebs_enabled;
pebs_status &= (1ULL << MAX_PEBS_EVENTS) - 1;
- if (pebs_status != (1 << bit))
+ if (pebs_status != (1 << bit)) {
+ error[bit]++;
continue;
+ }
}
counts[bit]++;
}

for (bit = 0; bit < x86_pmu.max_pebs_events; bit++) {
- if (counts[bit] == 0)
+ if ((counts[bit] == 0) && (error[bit] == 0))
continue;
event = cpuc->events[bit];
WARN_ON_ONCE(!event);
WARN_ON_ONCE(!event->attr.precise_ip);

- __intel_pmu_pebs_event(event, iregs, base,
- top, bit, counts[bit]);
+ /* log dropped samples number */
+ if (error[bit])
+ perf_log_lost_samples(event, error[bit]);
+
+ if (counts[bit])
+ __intel_pmu_pebs_event(event, iregs, base,
+ top, bit, counts[bit]);
}
}

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index bed1b6f..d47d792 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -747,6 +747,9 @@ perf_event__output_id_sample(struct perf_event *event,
struct perf_output_handle *handle,
struct perf_sample_data *sample);

+extern void
+perf_log_lost_samples(struct perf_event *event, u64 lost);
+
static inline bool is_sampling_event(struct perf_event *event)
{
return event->attr.sample_period != 0;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 309211b..2c1ae5c 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -800,6 +800,18 @@ enum perf_event_type {
*/
PERF_RECORD_ITRACE_START = 12,

+ /*
+ * Records the dropped/lost sample number.
+ *
+ * struct {
+ * struct perf_event_header header;
+ * u64 id;
+ * u64 lost;
+ * struct sample_id sample_id;
+ * };
+ */
+ PERF_RECORD_LOST_SAMPLES = 13,
+
PERF_RECORD_MAX, /* non-ABI */
};

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4d221a4..1337fe3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5927,6 +5927,41 @@ void perf_event_aux_event(struct perf_event *event, unsigned long head,
}

/*
+ * Lost/dropped samples logging
+ */
+void perf_log_lost_samples(struct perf_event *event, u64 lost)
+{
+ struct perf_output_handle handle;
+ struct perf_sample_data sample;
+ int ret;
+
+ struct {
+ struct perf_event_header header;
+ u64 id;
+ u64 lost;
+ } lost_samples_event = {
+ .header = {
+ .type = PERF_RECORD_LOST_SAMPLES,
+ .misc = 0,
+ .size = sizeof(lost_samples_event),
+ },
+ .id = event->id,
+ .lost = lost,
+ };
+
+ perf_event_header__init_id(&lost_samples_event.header, &sample, event);
+
+ ret = perf_output_begin(&handle, event,
+ lost_samples_event.header.size);
+ if (ret)
+ return;
+
+ perf_output_put(&handle, lost_samples_event);
+ perf_event__output_id_sample(event, &handle, &sample);
+ perf_output_end(&handle);
+}
+
+/*
* IRQ throttle logging
*/

--
1.8.3.1

2015-05-07 02:47:05

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V8 8/8] perf tools: handle PERF_RECORD_LOST_SAMPLES

From: Kan Liang <[email protected]>

This patch modified the perf tool to handle the new RECORD type,
PERF_RECORD_LOST_SAMPLES.
The number of lost-sample events is stored in
.nr_events[PERF_EVENT_LOST_SAMPLES]. While the exact number of samples
which the kernel dropped is stored in total_lost_samples.
When the percentage of dropped samples is greater than 5%, a warning
will be sent out.

Here are some examples:

Eg 1, Recording different frequently-occurring events is safe with the
patch. Only a very low drop rate is associated with such actions.

$ perf record -e '{cycles:p,instructions:p}' -c 20003 --no-time ~/tchain
~/tchain
[perf record: Woken up 148 times to write data]
[perf record: Captured and wrote 36.922 MB perf.data (1206322 samples)]

$ perf report -D | tail
SAMPLE events: 1206322
MMAP2 events: 5
LOST_SAMPLE events: 2559
FINISHED_ROUND events: 148
cycles:p stats:
TOTAL events: 606278
SAMPLE events: 606278
instructions:p stats:
TOTAL events: 600044
SAMPLE events: 600044

Eg 2, Recording the same thing multiple times can lead to high drop
rate, but it is not a useful configuration.

$ perf record -e '{cycles:p,cycles:p}' -c 20003 --no-time ~/tchain
[perf record: Woken up 1 times to write data]
Warning:
Processed 600592 samples and lost 99.73% samples!
[perf record: Captured and wrote 0.121 MB perf.data (1629 samples)]

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/event.c | 9 +++++++++
tools/perf/util/event.h | 18 ++++++++++++++++++
tools/perf/util/machine.c | 10 ++++++++++
tools/perf/util/machine.h | 2 ++
tools/perf/util/session.c | 19 +++++++++++++++++++
tools/perf/util/tool.h | 1 +
6 files changed, 59 insertions(+)

diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index db52609..2daadc8 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -25,6 +25,7 @@ static const char *perf_event__names[] = {
[PERF_RECORD_SAMPLE] = "SAMPLE",
[PERF_RECORD_AUX] = "AUX",
[PERF_RECORD_ITRACE_START] = "ITRACE_START",
+ [PERF_RECORD_LOST_SAMPLES] = "LOST_SAMPLES",
[PERF_RECORD_HEADER_ATTR] = "ATTR",
[PERF_RECORD_HEADER_EVENT_TYPE] = "EVENT_TYPE",
[PERF_RECORD_HEADER_TRACING_DATA] = "TRACING_DATA",
@@ -713,6 +714,14 @@ int perf_event__process_itrace_start(struct perf_tool *tool __maybe_unused,
return machine__process_itrace_start_event(machine, event);
}

+int perf_event__process_lost_samples(struct perf_tool *tool __maybe_unused,
+ union perf_event *event,
+ struct perf_sample *sample,
+ struct machine *machine)
+{
+ return machine__process_lost_samples_event(machine, event, sample);
+}
+
size_t perf_event__fprintf_mmap(union perf_event *event, FILE *fp)
{
return fprintf(fp, " %d/%d: [%#" PRIx64 "(%#" PRIx64 ") @ %#" PRIx64 "]: %c %s\n",
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 7eecd5e..abfc579 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -52,6 +52,12 @@ struct lost_event {
u64 lost;
};

+struct lost_samples_event {
+ struct perf_event_header header;
+ u64 id;
+ u64 lost;
+};
+
/*
* PERF_FORMAT_ENABLED | PERF_FORMAT_RUNNING | PERF_FORMAT_ID
*/
@@ -235,6 +241,12 @@ enum auxtrace_error_type {
* total_lost tells exactly how many events the kernel in fact lost, i.e. it is
* the sum of all struct lost_event.lost fields reported.
*
+ * The kernel discards mixed up samples and sends the number in a
+ * PERF_RECORD_LOST_SAMPLES event. The number of lost-samples events is stored
+ * in .nr_events[PERF_EVENT_LOST_SAMPLES] while total_lost_samples tells
+ * exactly how many samples the kernel in fact dropped, i.e. it is the sum of
+ * all struct lost_samples_event.lost fields reported.
+ *
* The total_period is needed because by default auto-freq is used, so
* multipling nr_events[PERF_EVENT_SAMPLE] by a frequency isn't possible to get
* the total number of low level events, it is necessary to to sum all struct
@@ -244,6 +256,7 @@ struct events_stats {
u64 total_period;
u64 total_non_filtered_period;
u64 total_lost;
+ u64 total_lost_samples;
u64 total_invalid_chains;
u32 nr_events[PERF_RECORD_HEADER_MAX];
u32 nr_non_filtered_samples;
@@ -342,6 +355,7 @@ union perf_event {
struct comm_event comm;
struct fork_event fork;
struct lost_event lost;
+ struct lost_samples_event lost_samples;
struct read_event read;
struct throttle_event throttle;
struct sample_event sample;
@@ -390,6 +404,10 @@ int perf_event__process_lost(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
struct machine *machine);
+int perf_event__process_lost_samples(struct perf_tool *tool,
+ union perf_event *event,
+ struct perf_sample *sample,
+ struct machine *machine);
int perf_event__process_aux(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 2f47110..4623dd7 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -458,6 +458,14 @@ int machine__process_lost_event(struct machine *machine __maybe_unused,
return 0;
}

+int machine__process_lost_samples_event(struct machine *machine __maybe_unused,
+ union perf_event *event, struct perf_sample *sample __maybe_unused)
+{
+ dump_printf(": id:%" PRIu64 ": lost samples :%" PRIu64 "\n",
+ event->lost.id, event->lost.lost);
+ return 0;
+}
+
static struct dso*
machine__module_dso(struct machine *machine, struct kmod_path *m,
const char *filename)
@@ -1351,6 +1359,8 @@ int machine__process_event(struct machine *machine, union perf_event *event,
ret = machine__process_aux_event(machine, event); break;
case PERF_RECORD_ITRACE_START:
ret = machine__process_itrace_start_event(machine, event);
+ case PERF_RECORD_LOST_SAMPLES:
+ ret = machine__process_lost_samples_event(machine, event, sample); break;
break;
default:
ret = -1;
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index 1d99296..7ba5e8f 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -81,6 +81,8 @@ int machine__process_fork_event(struct machine *machine, union perf_event *event
struct perf_sample *sample);
int machine__process_lost_event(struct machine *machine, union perf_event *event,
struct perf_sample *sample);
+int machine__process_lost_samples_event(struct machine *machine, union perf_event *event,
+ struct perf_sample *sample);
int machine__process_aux_event(struct machine *machine,
union perf_event *event);
int machine__process_itrace_start_event(struct machine *machine,
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index e722107..4a5a609 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -325,6 +325,8 @@ void perf_tool__fill_defaults(struct perf_tool *tool)
tool->exit = process_event_stub;
if (tool->lost == NULL)
tool->lost = perf_event__process_lost;
+ if (tool->lost_samples == NULL)
+ tool->lost_samples = perf_event__process_lost_samples;
if (tool->aux == NULL)
tool->aux = perf_event__process_aux;
if (tool->itrace_start == NULL)
@@ -606,6 +608,7 @@ static perf_event__swap_op perf_event__swap_ops[] = {
[PERF_RECORD_SAMPLE] = perf_event__all64_swap,
[PERF_RECORD_AUX] = perf_event__aux_swap,
[PERF_RECORD_ITRACE_START] = perf_event__itrace_start_swap,
+ [PERF_RECORD_LOST_SAMPLES] = perf_event__all64_swap,
[PERF_RECORD_HEADER_ATTR] = perf_event__hdr_attr_swap,
[PERF_RECORD_HEADER_EVENT_TYPE] = perf_event__event_type_swap,
[PERF_RECORD_HEADER_TRACING_DATA] = perf_event__tracing_data_swap,
@@ -1049,6 +1052,10 @@ static int machines__deliver_event(struct machines *machines,
if (tool->lost == perf_event__process_lost)
evlist->stats.total_lost += event->lost.lost;
return tool->lost(tool, event, sample, machine);
+ case PERF_RECORD_LOST_SAMPLES:
+ if (tool->lost_samples == perf_event__process_lost_samples)
+ evlist->stats.total_lost_samples += event->lost_samples.lost;
+ return tool->lost_samples(tool, event, sample, machine);
case PERF_RECORD_READ:
return tool->read(tool, event, sample, evsel, machine);
case PERF_RECORD_THROTTLE:
@@ -1286,6 +1293,18 @@ static void perf_session__warn_about_errors(const struct perf_session *session)
stats->nr_events[PERF_RECORD_LOST]);
}

+ if (session->tool->lost_samples == perf_event__process_lost_samples) {
+ double drop_rate;
+
+ drop_rate = (double)stats->total_lost_samples /
+ (double) (stats->nr_events[PERF_RECORD_SAMPLE] + stats->total_lost_samples);
+ if (drop_rate > 0.05) {
+ ui__warning("Processed %lu samples and lost %3.2f%% samples!\n\n",
+ stats->nr_events[PERF_RECORD_SAMPLE] + stats->total_lost_samples,
+ drop_rate * 100.0);
+ }
+ }
+
if (stats->nr_unknown_events != 0) {
ui__warning("Found %u unknown events!\n\n"
"Is this an older tool processing a perf.data "
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index 7f282ad..c307dd4 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -43,6 +43,7 @@ struct perf_tool {
fork,
exit,
lost,
+ lost_samples,
aux,
itrace_start,
throttle,
--
1.8.3.1

2015-05-07 10:33:46

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH V8 8/8] perf tools: handle PERF_RECORD_LOST_SAMPLES

On Wed, May 06, 2015 at 03:33:54PM -0400, Kan Liang wrote:
> From: Kan Liang <[email protected]>
>
> This patch modified the perf tool to handle the new RECORD type,
> PERF_RECORD_LOST_SAMPLES.
> The number of lost-sample events is stored in
> .nr_events[PERF_EVENT_LOST_SAMPLES]. While the exact number of samples
> which the kernel dropped is stored in total_lost_samples.
> When the percentage of dropped samples is greater than 5%, a warning
> will be sent out.

That's nice. Would be good to also print this in perf report --stdio
output (without -D), like the samples are.

Could you please do the same for the normal PERF_RECORD_LOST too?
There we don't have a exact percentage, but at least could count the
lost events too. This can be a separate patch of course, doesn't need
to be tied to this patchkit.


-Andi

2015-05-07 11:35:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V8 7/8] perf, x86: introduce PERF_RECORD_LOST_SAMPLES


So I changed it slightly to the below; changes are:

- record 'lost' events to all set bits; after all we really do not
know which event this sample belonged to, only logging to the first
set bit seems 'wrong'.

- dropped the @id field from the record, it is already included in the
@sample_id values.


---
Subject: perf, x86: introduce PERF_RECORD_LOST_SAMPLES
From: Kan Liang <[email protected]>
Date: Wed, 6 May 2015 15:33:53 -0400

After enlarging the PEBS interrupt threshold, there may be some mixed up
PEBS samples which are discarded by kernel. This patch drives the kernel
to emit a PERF_RECORD_LOST_SAMPLES record with the number of possible
discards when it is impossible to demux the samples. It makes sure the
user is not left in the dark about such discards.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 22 ++++++++++++++++----
include/linux/perf_event.h | 3 ++
include/uapi/linux/perf_event.h | 12 ++++++++++
kernel/events/core.c | 33 ++++++++++++++++++++++++++++++
4 files changed, 66 insertions(+), 4 deletions(-)

--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -1124,8 +1124,9 @@ static void intel_pmu_drain_pebs_nhm(str
struct debug_store *ds = cpuc->ds;
struct perf_event *event;
void *base, *at, *top;
- int bit;
short counts[MAX_PEBS_EVENTS] = {};
+ short error[MAX_PEBS_EVENTS] = {};
+ int bit, i;

if (!x86_pmu.pebs_active)
return;
@@ -1169,20 +1170,33 @@ static void intel_pmu_drain_pebs_nhm(str
/* slow path */
pebs_status = p->status & cpuc->pebs_enabled;
pebs_status &= (1ULL << MAX_PEBS_EVENTS) - 1;
- if (pebs_status != (1 << bit))
+ if (pebs_status != (1 << bit)) {
+ for (i = 0; 1 < MAX_PEBS_EVENTS; i++) {
+ if (pebs_status & (1 << i))
+ error[i]++;
+ }
continue;
+ }
}
counts[bit]++;
}

for (bit = 0; bit < x86_pmu.max_pebs_events; bit++) {
- if (counts[bit] == 0)
+ if ((counts[bit] == 0) && (error[bit] == 0))
continue;
+
event = cpuc->events[bit];
WARN_ON_ONCE(!event);
WARN_ON_ONCE(!event->attr.precise_ip);

- __intel_pmu_pebs_event(event, iregs, base, top, bit, counts[bit]);
+ /* log dropped samples number */
+ if (error[bit])
+ perf_log_lost_samples(event, error[bit]);
+
+ if (counts[bit]) {
+ __intel_pmu_pebs_event(event, iregs, base,
+ top, bit, counts[bit]);
+ }
}
}

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -747,6 +747,9 @@ perf_event__output_id_sample(struct perf
struct perf_output_handle *handle,
struct perf_sample_data *sample);

+extern void
+perf_log_lost_samples(struct perf_event *event, u64 lost);
+
static inline bool is_sampling_event(struct perf_event *event)
{
return event->attr.sample_period != 0;
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -800,6 +800,18 @@ enum perf_event_type {
*/
PERF_RECORD_ITRACE_START = 12,

+ /*
+ * Records the dropped/lost sample number.
+ *
+ * struct {
+ * struct perf_event_header header;
+ *
+ * u64 lost;
+ * struct sample_id sample_id;
+ * };
+ */
+ PERF_RECORD_LOST_SAMPLES = 13,
+
PERF_RECORD_MAX, /* non-ABI */
};

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5947,6 +5947,39 @@ void perf_event_aux_event(struct perf_ev
}

/*
+ * Lost/dropped samples logging
+ */
+void perf_log_lost_samples(struct perf_event *event, u64 lost)
+{
+ struct perf_output_handle handle;
+ struct perf_sample_data sample;
+ int ret;
+
+ struct {
+ struct perf_event_header header;
+ u64 lost;
+ } lost_samples_event = {
+ .header = {
+ .type = PERF_RECORD_LOST_SAMPLES,
+ .misc = 0,
+ .size = sizeof(lost_samples_event),
+ },
+ .lost = lost,
+ };
+
+ perf_event_header__init_id(&lost_samples_event.header, &sample, event);
+
+ ret = perf_output_begin(&handle, event,
+ lost_samples_event.header.size);
+ if (ret)
+ return;
+
+ perf_output_put(&handle, lost_samples_event);
+ perf_event__output_id_sample(event, &handle, &sample);
+ perf_output_end(&handle);
+}
+
+/*
* IRQ throttle logging
*/

2015-05-07 11:55:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V8 7/8] perf, x86: introduce PERF_RECORD_LOST_SAMPLES

On Thu, May 07, 2015 at 01:35:24PM +0200, Peter Zijlstra wrote:
>
> - dropped the @id field from the record, it is already included in the
> @sample_id values.

Hmm, this would force people to use sample_id; which in general is a
good idea, but should we really force that on people?

Acme?

2015-05-07 13:56:18

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH V8 7/8] perf, x86: introduce PERF_RECORD_LOST_SAMPLES


>
>
> So I changed it slightly to the below; changes are:
>
> - record 'lost' events to all set bits; after all we really do not
> know which event this sample belonged to, only logging to the first
> set bit seems 'wrong'.

If so, the same dropped sample will be count multiple times. It's hard to know
the exact number of dropped samples.

>
> - dropped the @id field from the record, it is already included in the
> @sample_id values.
> ---
> Subject: perf, x86: introduce PERF_RECORD_LOST_SAMPLES
> From: Kan Liang <[email protected]>
> Date: Wed, 6 May 2015 15:33:53 -0400
>
> After enlarging the PEBS interrupt threshold, there may be some mixed up
> PEBS samples which are discarded by kernel. This patch drives the kernel to
> emit a PERF_RECORD_LOST_SAMPLES record with the number of possible
> discards when it is impossible to demux the samples. It makes sure the
> user is not left in the dark about such discards.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Kan Liang <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Link: http://lkml.kernel.org/r/1430940834-8964-8-git-send-email-
> [email protected]
> ---
> arch/x86/kernel/cpu/perf_event_intel_ds.c | 22 ++++++++++++++++----
> include/linux/perf_event.h | 3 ++
> include/uapi/linux/perf_event.h | 12 ++++++++++
> kernel/events/core.c | 33
> ++++++++++++++++++++++++++++++
> 4 files changed, 66 insertions(+), 4 deletions(-)
>
> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> @@ -1124,8 +1124,9 @@ static void intel_pmu_drain_pebs_nhm(str
> struct debug_store *ds = cpuc->ds;
> struct perf_event *event;
> void *base, *at, *top;
> - int bit;
> short counts[MAX_PEBS_EVENTS] = {};
> + short error[MAX_PEBS_EVENTS] = {};
> + int bit, i;
>
> if (!x86_pmu.pebs_active)
> return;
> @@ -1169,20 +1170,33 @@ static void intel_pmu_drain_pebs_nhm(str
> /* slow path */
> pebs_status = p->status & cpuc->pebs_enabled;
> pebs_status &= (1ULL << MAX_PEBS_EVENTS) - 1;
> - if (pebs_status != (1 << bit))
> + if (pebs_status != (1 << bit)) {
> + for (i = 0; 1 < MAX_PEBS_EVENTS; i++) {
> + if (pebs_status & (1 << i))
> + error[i]++;
> + }
> continue;
> + }
> }
> counts[bit]++;
> }
>
> for (bit = 0; bit < x86_pmu.max_pebs_events; bit++) {
> - if (counts[bit] == 0)
> + if ((counts[bit] == 0) && (error[bit] == 0))
> continue;
> +
> event = cpuc->events[bit];
> WARN_ON_ONCE(!event);
> WARN_ON_ONCE(!event->attr.precise_ip);
>
> - __intel_pmu_pebs_event(event, iregs, base, top, bit,
> counts[bit]);
> + /* log dropped samples number */
> + if (error[bit])
> + perf_log_lost_samples(event, error[bit]);
> +
> + if (counts[bit]) {
> + __intel_pmu_pebs_event(event, iregs, base,
> + top, bit, counts[bit]);
> + }
> }
> }
>
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -747,6 +747,9 @@ perf_event__output_id_sample(struct perf
> struct perf_output_handle *handle,
> struct perf_sample_data *sample);
>
> +extern void
> +perf_log_lost_samples(struct perf_event *event, u64 lost);
> +
> static inline bool is_sampling_event(struct perf_event *event) {
> return event->attr.sample_period != 0;
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -800,6 +800,18 @@ enum perf_event_type {
> */
> PERF_RECORD_ITRACE_START = 12,
>
> + /*
> + * Records the dropped/lost sample number.
> + *
> + * struct {
> + * struct perf_event_header header;
> + *
> + * u64 lost;
> + * struct sample_id sample_id;
> + * };
> + */
> + PERF_RECORD_LOST_SAMPLES = 13,
> +
> PERF_RECORD_MAX, /* non-ABI */
> };
>
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5947,6 +5947,39 @@ void perf_event_aux_event(struct perf_ev }
>
> /*
> + * Lost/dropped samples logging
> + */
> +void perf_log_lost_samples(struct perf_event *event, u64 lost) {
> + struct perf_output_handle handle;
> + struct perf_sample_data sample;
> + int ret;
> +
> + struct {
> + struct perf_event_header header;
> + u64 lost;
> + } lost_samples_event = {
> + .header = {
> + .type = PERF_RECORD_LOST_SAMPLES,
> + .misc = 0,
> + .size = sizeof(lost_samples_event),
> + },
> + .lost = lost,
> + };
> +
> + perf_event_header__init_id(&lost_samples_event.header,
> &sample,
> +event);
> +
> + ret = perf_output_begin(&handle, event,
> + lost_samples_event.header.size);
> + if (ret)
> + return;
> +
> + perf_output_put(&handle, lost_samples_event);
> + perf_event__output_id_sample(event, &handle, &sample);
> + perf_output_end(&handle);
> +}
> +
> +/*
> * IRQ throttle logging
> */
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected] More majordomo
> info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-05-07 13:58:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V8 7/8] perf, x86: introduce PERF_RECORD_LOST_SAMPLES

On Thu, May 07, 2015 at 01:56:09PM +0000, Liang, Kan wrote:
>
> >
> >
> > So I changed it slightly to the below; changes are:
> >
> > - record 'lost' events to all set bits; after all we really do not
> > know which event this sample belonged to, only logging to the first
> > set bit seems 'wrong'.
>
> If so, the same dropped sample will be count multiple times. It's hard to know
> the exact number of dropped samples.

OTOH if the events are from different consumers, they might think there
are no samples lost, while in fact there are.

2015-05-07 14:15:31

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH V8 7/8] perf, x86: introduce PERF_RECORD_LOST_SAMPLES

Em Thu, May 07, 2015 at 01:54:46PM +0200, Peter Zijlstra escreveu:
> On Thu, May 07, 2015 at 01:35:24PM +0200, Peter Zijlstra wrote:
> >
> > - dropped the @id field from the record, it is already included in the
> > @sample_id values.
>
> Hmm, this would force people to use sample_id; which in general is a
> good idea, but should we really force that on people?

Well, if there are more than one sample, we need it, right? If there is
just one, we don't need it, what is different? Am I needing (even more)
coffee?

/me goes read some code...

- Arnaldo

2015-05-07 14:17:20

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH V8 8/8] perf tools: handle PERF_RECORD_LOST_SAMPLES



>
> On Wed, May 06, 2015 at 03:33:54PM -0400, Kan Liang wrote:
> > From: Kan Liang <[email protected]>
> >
> > This patch modified the perf tool to handle the new RECORD type,
> > PERF_RECORD_LOST_SAMPLES.
> > The number of lost-sample events is stored in
> > .nr_events[PERF_EVENT_LOST_SAMPLES]. While the exact number of
> samples
> > which the kernel dropped is stored in total_lost_samples.
> > When the percentage of dropped samples is greater than 5%, a warning
> > will be sent out.
>
> That's nice. Would be good to also print this in perf report --stdio output
> (without -D), like the samples are.
>

OK, I will do it.

> Could you please do the same for the normal PERF_RECORD_LOST too?
> There we don't have a exact percentage, but at least could count the lost
> events too. This can be a separate patch of course, doesn't need to be tied
> to this patchkit.

I think current perf tool already did that.
It will print warning for any lost.
Is this what you want?

if (session->tool->lost == perf_event__process_lost &&
stats->nr_events[PERF_RECORD_LOST] != 0) {
ui__warning("Processed %d events and lost %d chunks!\n\n"
"Check IO/CPU overload!\n\n",
stats->nr_events[0],
stats->nr_events[PERF_RECORD_LOST]);
}

Thanks,
Kan
>
>
> -Andi

2015-05-07 14:21:30

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH V8 7/8] perf, x86: introduce PERF_RECORD_LOST_SAMPLES

Em Thu, May 07, 2015 at 11:15:20AM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Thu, May 07, 2015 at 01:54:46PM +0200, Peter Zijlstra escreveu:
> > On Thu, May 07, 2015 at 01:35:24PM +0200, Peter Zijlstra wrote:
> > >
> > > - dropped the @id field from the record, it is already included in the
> > > @sample_id values.
> >
> > Hmm, this would force people to use sample_id; which in general is a
> > good idea, but should we really force that on people?
>
> Well, if there are more than one sample, we need it, right? If there is
> just one, we don't need it, what is different? Am I needing (even more)
> coffee?
>
> /me goes read some code...

I.e.:

[acme@ssdandy linux]$ perf record -e cycles usleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.013 MB perf.data (8 samples) ]
[acme@ssdandy linux]$
[acme@ssdandy linux]$ perf evlist -v
cycles: size: 112, { sample_period, sample_freq }: 4000,
sample_type: IP|TID|TIME|PERIOD, disabled: 1, inherit: 1, mmap: 1,
comm: 1, freq: 1, enable_on_exec: 1, task: 1, sample_id_all: 1,
exclude_guest: 1, mmap2: 1, comm_exec: 1
[acme@ssdandy linux]$

Just one event, no need for PERF_SAMPLE_ID.

But with > 1 events:

[acme@ssdandy linux]$ perf record -e cycles,instructions usleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.014 MB perf.data (8 samples) ]
[acme@ssdandy linux]$ perf evlist -v
cycles: size: 112, { sample_period, sample_freq }: 4000,
sample_type: IP|TID|TIME|ID|PERIOD, read_format: ID, disabled: 1,
inherit: 1, mmap: 1, comm: 1, freq: 1, enable_on_exec: 1, task: 1,
sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1
instructions: size: 112, config: 1,
{ sample_period, sample_freq }: 4000,
sample_type: IP|TID|TIME|ID|PERIOD, read_format: ID, disabled: 1,
inherit: 1, freq: 1, enable_on_exec: 1, sample_id_all: 1,
exclude_guest: 1
[acme@ssdandy linux]$

- Arnaldo

2015-05-07 14:39:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V8 7/8] perf, x86: introduce PERF_RECORD_LOST_SAMPLES

On Thu, May 07, 2015 at 11:15:20AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Thu, May 07, 2015 at 01:54:46PM +0200, Peter Zijlstra escreveu:
> > On Thu, May 07, 2015 at 01:35:24PM +0200, Peter Zijlstra wrote:
> > >
> > > - dropped the @id field from the record, it is already included in the
> > > @sample_id values.
> >
> > Hmm, this would force people to use sample_id; which in general is a
> > good idea, but should we really force that on people?
>
> Well, if there are more than one sample, we need it, right? If there is
> just one, we don't need it, what is different? Am I needing (even more)
> coffee?
>
> /me goes read some code...

So the question was, do we do:

/*
* struct {
* struct perf_event_header header;
* u64 id;
* u64 lost;
* struct sample_id sample_id;
* };
*/
PERF_RECORD_LOST_SAMPLES

And have the id thing twice if attr.sample_id && PERF_SAMPLE_ID, but
allow decoding if !attr.sample_id.

Or force attr.sample_id && PERF_SAMPLE_ID if there's multiple events and
do away with the extra id field, like:

/*
* struct {
* struct perf_event_header header;
* u64 lost;
* struct sample_id sample_id;
* };
*/
PERF_RECORD_LOST_SAMPLES

Should we force the use of sample_id on people?

2015-05-07 16:22:34

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH V8 7/8] perf, x86: introduce PERF_RECORD_LOST_SAMPLES

Em Thu, May 07, 2015 at 04:39:39PM +0200, Peter Zijlstra escreveu:
> On Thu, May 07, 2015 at 11:15:20AM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Thu, May 07, 2015 at 01:54:46PM +0200, Peter Zijlstra escreveu:
> > > On Thu, May 07, 2015 at 01:35:24PM +0200, Peter Zijlstra wrote:
> > > >
> > > > - dropped the @id field from the record, it is already included in the
> > > > @sample_id values.
> > >
> > > Hmm, this would force people to use sample_id; which in general is a
> > > good idea, but should we really force that on people?
> >
> > Well, if there are more than one sample, we need it, right? If there is
> > just one, we don't need it, what is different? Am I needing (even more)
> > coffee?
> >
> > /me goes read some code...
>
> So the question was, do we do:
>
> /*
> * struct {
> * struct perf_event_header header;
> * u64 id;
> * u64 lost;
> * struct sample_id sample_id;
> * };
> */
> PERF_RECORD_LOST_SAMPLES
>
> And have the id thing twice if attr.sample_id && PERF_SAMPLE_ID, but
> allow decoding if !attr.sample_id.
>
> Or force attr.sample_id && PERF_SAMPLE_ID if there's multiple events and
> do away with the extra id field, like:
>
> /*
> * struct {
> * struct perf_event_header header;
> * u64 lost;
> * struct sample_id sample_id;
> * };
> */
> PERF_RECORD_LOST_SAMPLES
>
> Should we force the use of sample_id on people?

If we have more than one event we _need_ PERF_SAMPLE_ID, to
disambiguate, if we don't, then the lost events are just for that one,
no?

- Arnaldo

2015-05-07 17:38:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V8 7/8] perf, x86: introduce PERF_RECORD_LOST_SAMPLES

On Thu, May 07, 2015 at 01:22:23PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Thu, May 07, 2015 at 04:39:39PM +0200, Peter Zijlstra escreveu:
> > On Thu, May 07, 2015 at 11:15:20AM -0300, Arnaldo Carvalho de Melo wrote:
> > > Em Thu, May 07, 2015 at 01:54:46PM +0200, Peter Zijlstra escreveu:
> > > > On Thu, May 07, 2015 at 01:35:24PM +0200, Peter Zijlstra wrote:
> > > > >
> > > > > - dropped the @id field from the record, it is already included in the
> > > > > @sample_id values.
> > > >
> > > > Hmm, this would force people to use sample_id; which in general is a
> > > > good idea, but should we really force that on people?
> > >
> > > Well, if there are more than one sample, we need it, right? If there is
> > > just one, we don't need it, what is different? Am I needing (even more)
> > > coffee?
> > >
> > > /me goes read some code...
> >
> > So the question was, do we do:
> >
> > /*
> > * struct {
> > * struct perf_event_header header;
> > * u64 id;
> > * u64 lost;
> > * struct sample_id sample_id;
> > * };
> > */
> > PERF_RECORD_LOST_SAMPLES
> >
> > And have the id thing twice if attr.sample_id && PERF_SAMPLE_ID, but
> > allow decoding if !attr.sample_id.
> >
> > Or force attr.sample_id && PERF_SAMPLE_ID if there's multiple events and
> > do away with the extra id field, like:
> >
> > /*
> > * struct {
> > * struct perf_event_header header;
> > * u64 lost;
> > * struct sample_id sample_id;
> > * };
> > */
> > PERF_RECORD_LOST_SAMPLES
> >
> > Should we force the use of sample_id on people?
>
> If we have more than one event we _need_ PERF_SAMPLE_ID, to
> disambiguate, if we don't, then the lost events are just for that one,
> no?

Sure, PERF_SAMPLE_ID is required, but attr::sample_id_all is not is it?

We can largely get by without using sample_id_all, as we did for a
while.

That said; sample_id_all has been around for more than 4 years and its
recommended for use; but should we mandate it?

2015-05-07 20:01:29

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH V8 7/8] perf, x86: introduce PERF_RECORD_LOST_SAMPLES

Em Thu, May 07, 2015 at 07:37:50PM +0200, Peter Zijlstra escreveu:
> On Thu, May 07, 2015 at 01:22:23PM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Thu, May 07, 2015 at 04:39:39PM +0200, Peter Zijlstra escreveu:
> > > On Thu, May 07, 2015 at 11:15:20AM -0300, Arnaldo Carvalho de Melo wrote:
> > > > Em Thu, May 07, 2015 at 01:54:46PM +0200, Peter Zijlstra escreveu:
> > > > > On Thu, May 07, 2015 at 01:35:24PM +0200, Peter Zijlstra wrote:
> > > > > >
> > > > > > - dropped the @id field from the record, it is already included in the
> > > > > > @sample_id values.
> > > > >
> > > > > Hmm, this would force people to use sample_id; which in general is a
> > > > > good idea, but should we really force that on people?
> > > >
> > > > Well, if there are more than one sample, we need it, right? If there is
> > > > just one, we don't need it, what is different? Am I needing (even more)
> > > > coffee?
> > > >
> > > > /me goes read some code...
> > >
> > > So the question was, do we do:
> > >
> > > /*
> > > * struct {
> > > * struct perf_event_header header;
> > > * u64 id;
> > > * u64 lost;
> > > * struct sample_id sample_id;
> > > * };
> > > */
> > > PERF_RECORD_LOST_SAMPLES
> > >
> > > And have the id thing twice if attr.sample_id && PERF_SAMPLE_ID, but
> > > allow decoding if !attr.sample_id.
> > >
> > > Or force attr.sample_id && PERF_SAMPLE_ID if there's multiple events and
> > > do away with the extra id field, like:
> > >
> > > /*
> > > * struct {
> > > * struct perf_event_header header;
> > > * u64 lost;
> > > * struct sample_id sample_id;
> > > * };
> > > */
> > > PERF_RECORD_LOST_SAMPLES
> > >
> > > Should we force the use of sample_id on people?
> >
> > If we have more than one event we _need_ PERF_SAMPLE_ID, to
> > disambiguate, if we don't, then the lost events are just for that one,
> > no?

> Sure, PERF_SAMPLE_ID is required, but attr::sample_id_all is not is it?
>
> We can largely get by without using sample_id_all, as we did for a
> while.

> That said; sample_id_all has been around for more than 4 years and its
> recommended for use; but should we mandate it?

Got it now, to have PERF_SAMPLE_ID(ENTIFIER) in records !=
PERF_RECORD_SAMPLE when multiplexing more than one event on a ring
buffer, one would have to set attr.sample_id_all.

So, if we have just one event, we don't need sample_id_all (but we will
end up using it to have PERF_SAMPLE_TIME, to order metadata events
accross CPUs), we also don't need PERF_SAMPLE_ID, and we don't need to
have the u64 id in the PERF_RECORD_LOST_SAMPLE, no?

- Arnaldo


Below is some rambling, thinking out loud, ignore it if you want.

The attr::sample_id_all thing was more to be able to have
PERF_SAMPLE_TIME, and with that be able to order metadata events
together with PERF_SAMPLE_TIME, where we can ask for PERF_SAMPLE_TIME.

PERF_SAMPLE_ID(ENTIFIER) is about mapping back a PERF_RECORD_SAMPLE to
an event, with this it would also be used to figure out what event is
getting samples LOST, so I think the same semantic applies, i.e. if we
mux more than one event in a ring buffer, then PERF_RECORD_SAMPLE _and_
PERF_RECORD_LOST_SAMPLES should use the same mechanism _when we need to
disambiguate_, right?

I.e. those 8 bytes would only be required when we have more than one
event.

What downsides would we have if we used attr.sample_id_all +
PERF_SAMPLE_ID(ENTIFIER) to figure out what event the
PERF_RECORD_LOST_SAMPLES refers to?

Looking at
__perf_event__output_id_sample(kernel)/perf_evsel__parse_id_sample(tools)
we only insert/parse:

if (type & PERF_SAMPLE_IDENTIFIER) {
if (type & PERF_SAMPLE_CPU) {
if (type & PERF_SAMPLE_STREAM_ID) {
if (type & PERF_SAMPLE_ID) {
if (type & PERF_SAMPLE_TIME) {
if (type & PERF_SAMPLE_TID) {

- Arnaldo

2015-05-08 11:05:50

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH V8 3/8] perf, x86: handle multiple records in PEBS buffer

> The first issue is that the 'status' field of the PEBS record is a copy
> of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a

I wanted to mention that Skylake has this issue fixed. The status
field in PEBS is now only the counter that triggered the PEBS record.

> The second issue is that the hardware will only emit one record for two
> or more counters if the event that triggers the assist is 'close'. The

If two events gets merged it is one or the other.

-Andi

2015-05-08 11:29:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V8 3/8] perf, x86: handle multiple records in PEBS buffer

On Fri, May 08, 2015 at 01:05:47PM +0200, Andi Kleen wrote:
> > The first issue is that the 'status' field of the PEBS record is a copy
> > of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
>
> I wanted to mention that Skylake has this issue fixed. The status
> field in PEBS is now only the counter that triggered the PEBS record.

Yes, just read the latest SDM on this, great!

Subject: [tip:perf/core] perf/x86/intel: Use the PEBS auto reload mechanism when possible

Commit-ID: 851559e35fd5ab637783ba395e55edd50f761229
Gitweb: http://git.kernel.org/tip/851559e35fd5ab637783ba395e55edd50f761229
Author: Yan, Zheng <[email protected]>
AuthorDate: Wed, 6 May 2015 15:33:47 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 7 Jun 2015 16:08:35 +0200

perf/x86/intel: Use the PEBS auto reload mechanism when possible

When a fixed period is specified, this patch makes perf use the PEBS
auto reload mechanism. This makes normal profiling faster, because
it avoids one costly MSR write in the PMI handler.

However, the reset value will be loaded by hardware assist. There is a
small delay compared to the previous non-auto-reload mechanism. The
delay time is arbitrary, but very small. The assist cost is 400-800
cycles, assuming common cases with everything cached. The minimum period
the patch currently uses is 10000. In that extreme case it can be ~10%
if cycles are used.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 15 +++++++++------
arch/x86/kernel/cpu/perf_event.h | 1 +
arch/x86/kernel/cpu/perf_event_intel.c | 8 ++++++--
arch/x86/kernel/cpu/perf_event_intel_ds.c | 7 +++++++
4 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index dbe3328..9560d0f 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1094,13 +1094,16 @@ int x86_perf_event_set_period(struct perf_event *event)

per_cpu(pmc_prev_left[idx], smp_processor_id()) = left;

- /*
- * The hw event starts counting from this event offset,
- * mark it to be able to extra future deltas:
- */
- local64_set(&hwc->prev_count, (u64)-left);
+ if (!(hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) ||
+ local64_read(&hwc->prev_count) != (u64)-left) {
+ /*
+ * The hw event starts counting from this event offset,
+ * mark it to be able to extra future deltas:
+ */
+ local64_set(&hwc->prev_count, (u64)-left);

- wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+ wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+ }

/*
* Due to erratum on certan cpu we need
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 89e6cd6..7a3f0fd 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -75,6 +75,7 @@ struct event_constraint {
#define PERF_X86_EVENT_DYNAMIC 0x0080 /* dynamic alloc'd constraint */
#define PERF_X86_EVENT_RDPMC_ALLOWED 0x0100 /* grant rdpmc permission */
#define PERF_X86_EVENT_EXCL_ACCT 0x0200 /* accounted EXCL event */
+#define PERF_X86_EVENT_AUTO_RELOAD 0x0400 /* use PEBS auto-reload */


struct amd_nb {
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 74f19d9..1762893 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2260,8 +2260,12 @@ static int intel_pmu_hw_config(struct perf_event *event)
if (ret)
return ret;

- if (event->attr.precise_ip && x86_pmu.pebs_aliases)
- x86_pmu.pebs_aliases(event);
+ if (event->attr.precise_ip) {
+ if (!event->attr.freq)
+ event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
+ if (x86_pmu.pebs_aliases)
+ x86_pmu.pebs_aliases(event);
+ }

if (needs_branch_stack(event)) {
ret = intel_pmu_setup_lbr_filter(event);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 7f73b35..4802d5d 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -688,6 +688,7 @@ void intel_pmu_pebs_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
+ struct debug_store *ds = cpuc->ds;

hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;

@@ -697,6 +698,12 @@ void intel_pmu_pebs_enable(struct perf_event *event)
cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled |= 1ULL << 63;
+
+ /* Use auto-reload if possible to save a MSR write in the PMI */
+ if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
+ ds->pebs_event_reset[hwc->idx] =
+ (u64)(-hwc->sample_period) & x86_pmu.cntval_mask;
+ }
}

void intel_pmu_pebs_disable(struct perf_event *event)

Subject: [tip:perf/core] perf/x86/intel: Introduce setup_pebs_sample_data( )

Commit-ID: 43cf76312faefed098c057082abac8a3d521e1dc
Gitweb: http://git.kernel.org/tip/43cf76312faefed098c057082abac8a3d521e1dc
Author: Yan, Zheng <[email protected]>
AuthorDate: Wed, 6 May 2015 15:33:48 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 7 Jun 2015 16:08:40 +0200

perf/x86/intel: Introduce setup_pebs_sample_data()

Move code that sets up the PEBS sample data to a separate function.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 95 +++++++++++++++++--------------
1 file changed, 52 insertions(+), 43 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 4802d5d..a5fe561 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -853,8 +853,10 @@ static inline u64 intel_hsw_transaction(struct pebs_record_hsw *pebs)
return txn;
}

-static void __intel_pmu_pebs_event(struct perf_event *event,
- struct pt_regs *iregs, void *__pebs)
+static void setup_pebs_sample_data(struct perf_event *event,
+ struct pt_regs *iregs, void *__pebs,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
{
#define PERF_X86_EVENT_PEBS_HSW_PREC \
(PERF_X86_EVENT_PEBS_ST_HSW | \
@@ -866,30 +868,25 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
*/
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct pebs_record_hsw *pebs = __pebs;
- struct perf_sample_data data;
- struct pt_regs regs;
u64 sample_type;
int fll, fst, dsrc;
int fl = event->hw.flags;

- if (!intel_pmu_save_and_restart(event))
- return;
-
sample_type = event->attr.sample_type;
dsrc = sample_type & PERF_SAMPLE_DATA_SRC;

fll = fl & PERF_X86_EVENT_PEBS_LDLAT;
fst = fl & (PERF_X86_EVENT_PEBS_ST | PERF_X86_EVENT_PEBS_HSW_PREC);

- perf_sample_data_init(&data, 0, event->hw.last_period);
+ perf_sample_data_init(data, 0, event->hw.last_period);

- data.period = event->hw.last_period;
+ data->period = event->hw.last_period;

/*
* Use latency for weight (only avail with PEBS-LL)
*/
if (fll && (sample_type & PERF_SAMPLE_WEIGHT))
- data.weight = pebs->lat;
+ data->weight = pebs->lat;

/*
* data.data_src encodes the data source
@@ -902,7 +899,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
val = precise_datala_hsw(event, pebs->dse);
else if (fst)
val = precise_store_data(pebs->dse);
- data.data_src.val = val;
+ data->data_src.val = val;
}

/*
@@ -915,58 +912,70 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
* PERF_SAMPLE_IP and PERF_SAMPLE_CALLCHAIN to function properly.
* A possible PERF_SAMPLE_REGS will have to transfer all regs.
*/
- regs = *iregs;
- regs.flags = pebs->flags;
- set_linear_ip(&regs, pebs->ip);
- regs.bp = pebs->bp;
- regs.sp = pebs->sp;
+ *regs = *iregs;
+ regs->flags = pebs->flags;
+ set_linear_ip(regs, pebs->ip);
+ regs->bp = pebs->bp;
+ regs->sp = pebs->sp;

if (sample_type & PERF_SAMPLE_REGS_INTR) {
- regs.ax = pebs->ax;
- regs.bx = pebs->bx;
- regs.cx = pebs->cx;
- regs.dx = pebs->dx;
- regs.si = pebs->si;
- regs.di = pebs->di;
- regs.bp = pebs->bp;
- regs.sp = pebs->sp;
-
- regs.flags = pebs->flags;
+ regs->ax = pebs->ax;
+ regs->bx = pebs->bx;
+ regs->cx = pebs->cx;
+ regs->dx = pebs->dx;
+ regs->si = pebs->si;
+ regs->di = pebs->di;
+ regs->bp = pebs->bp;
+ regs->sp = pebs->sp;
+
+ regs->flags = pebs->flags;
#ifndef CONFIG_X86_32
- regs.r8 = pebs->r8;
- regs.r9 = pebs->r9;
- regs.r10 = pebs->r10;
- regs.r11 = pebs->r11;
- regs.r12 = pebs->r12;
- regs.r13 = pebs->r13;
- regs.r14 = pebs->r14;
- regs.r15 = pebs->r15;
+ regs->r8 = pebs->r8;
+ regs->r9 = pebs->r9;
+ regs->r10 = pebs->r10;
+ regs->r11 = pebs->r11;
+ regs->r12 = pebs->r12;
+ regs->r13 = pebs->r13;
+ regs->r14 = pebs->r14;
+ regs->r15 = pebs->r15;
#endif
}

if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format >= 2) {
- regs.ip = pebs->real_ip;
- regs.flags |= PERF_EFLAGS_EXACT;
- } else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(&regs))
- regs.flags |= PERF_EFLAGS_EXACT;
+ regs->ip = pebs->real_ip;
+ regs->flags |= PERF_EFLAGS_EXACT;
+ } else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(regs))
+ regs->flags |= PERF_EFLAGS_EXACT;
else
- regs.flags &= ~PERF_EFLAGS_EXACT;
+ regs->flags &= ~PERF_EFLAGS_EXACT;

if ((sample_type & PERF_SAMPLE_ADDR) &&
x86_pmu.intel_cap.pebs_format >= 1)
- data.addr = pebs->dla;
+ data->addr = pebs->dla;

if (x86_pmu.intel_cap.pebs_format >= 2) {
/* Only set the TSX weight when no memory weight. */
if ((sample_type & PERF_SAMPLE_WEIGHT) && !fll)
- data.weight = intel_hsw_weight(pebs);
+ data->weight = intel_hsw_weight(pebs);

if (sample_type & PERF_SAMPLE_TRANSACTION)
- data.txn = intel_hsw_transaction(pebs);
+ data->txn = intel_hsw_transaction(pebs);
}

if (has_branch_stack(event))
- data.br_stack = &cpuc->lbr_stack;
+ data->br_stack = &cpuc->lbr_stack;
+}
+
+static void __intel_pmu_pebs_event(struct perf_event *event,
+ struct pt_regs *iregs, void *__pebs)
+{
+ struct perf_sample_data data;
+ struct pt_regs regs;
+
+ if (!intel_pmu_save_and_restart(event))
+ return;
+
+ setup_pebs_sample_data(event, iregs, __pebs, &data, &regs);

if (perf_event_overflow(event, &data, &regs))
x86_pmu_stop(event, 0);

Subject: [tip:perf/core] perf/x86/intel: Handle multiple records in the PEBS buffer

Commit-ID: 21509084f999d7accd32e45961ef76853112e978
Gitweb: http://git.kernel.org/tip/21509084f999d7accd32e45961ef76853112e978
Author: Yan, Zheng <[email protected]>
AuthorDate: Wed, 6 May 2015 15:33:49 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 7 Jun 2015 16:08:45 +0200

perf/x86/intel: Handle multiple records in the PEBS buffer

When the PEBS interrupt threshold is larger than one record and the
machine supports multiple PEBS events, the records of these events are
mixed up and we need to demultiplex them.

Demuxing the records is hard because the hardware is deficient. The
hardware has two issues that, when combined, create impossible
scenarios to demux.

The first issue is that the 'status' field of the PEBS record is a copy
of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
problem let us first describe the regular PEBS cycle:

A) the CTRn value reaches 0:
- the corresponding bit in GLOBAL_STATUS gets set
- we start arming the hardware assist
< some unspecified amount of time later -- this could cover multiple
events of interest >

B) the hardware assist is armed, any next event will trigger it

C) a matching event happens:
- the hardware assist triggers and generates a PEBS record
this includes a copy of GLOBAL_STATUS at this moment
- if we auto-reload we (re)set CTRn
- we clear the relevant bit in GLOBAL_STATUS

Now consider the following chain of events:

A0, B0, A1, C0

The event generated for counter 0 will include a status with counter 1
set, even though its not at all related to the record. A similar thing
can happen with a !PEBS event if it just happens to overflow at the
right moment.

The second issue is that the hardware will only emit one record for two
or more counters if the event that triggers the assist is 'close'. The
'close' can be several cycles. In some cases even the complete assist,
if the event is something that doesn't need retirement.

For instance, consider this chain of events:

A0, B0, A1, B1, C01

Where C01 is an event that triggers both hardware assists, we will
generate but a single record, but again with both counters listed in the
status field.

This time the record pertains to both events.

Note that these two cases are different but undistinguishable with the
data as generated. Therefore demuxing records with multiple PEBS bits
(we can safely ignore status bits for !PEBS counters) is impossible.

Furthermore we cannot emit the record to both events because that might
cause a data leak -- the events might not have the same privileges -- so
what this patch does is discard such events.

The assumption/hope is that such discards will be rare.

Here lists some possible ways you may get high discard rate.

- when you count the same thing multiple times. But it is not a useful
configuration.
- you can be unfortunate if you measure with a userspace only PEBS
event along with either a kernel or unrestricted PEBS event. Imagine
the event triggering and setting the overflow flag right before
entering the kernel. Then all kernel side events will end up with
multiple bits set.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
[ Changelog improvements. ]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 142 +++++++++++++++++++++---------
include/linux/perf_event.h | 13 +++
kernel/events/core.c | 6 +-
kernel/events/internal.h | 9 --
4 files changed, 116 insertions(+), 54 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index a5fe561..72529c2 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -872,6 +872,9 @@ static void setup_pebs_sample_data(struct perf_event *event,
int fll, fst, dsrc;
int fl = event->hw.flags;

+ if (pebs == NULL)
+ return;
+
sample_type = event->attr.sample_type;
dsrc = sample_type & PERF_SAMPLE_DATA_SRC;

@@ -966,19 +969,68 @@ static void setup_pebs_sample_data(struct perf_event *event,
data->br_stack = &cpuc->lbr_stack;
}

+static inline void *
+get_next_pebs_record_by_bit(void *base, void *top, int bit)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ void *at;
+ u64 pebs_status;
+
+ if (base == NULL)
+ return NULL;
+
+ for (at = base; at < top; at += x86_pmu.pebs_record_size) {
+ struct pebs_record_nhm *p = at;
+
+ if (test_bit(bit, (unsigned long *)&p->status)) {
+
+ if (p->status == (1 << bit))
+ return at;
+
+ /* clear non-PEBS bit and re-check */
+ pebs_status = p->status & cpuc->pebs_enabled;
+ pebs_status &= (1ULL << MAX_PEBS_EVENTS) - 1;
+ if (pebs_status == (1 << bit))
+ return at;
+ }
+ }
+ return NULL;
+}
+
static void __intel_pmu_pebs_event(struct perf_event *event,
- struct pt_regs *iregs, void *__pebs)
+ struct pt_regs *iregs,
+ void *base, void *top,
+ int bit, int count)
{
struct perf_sample_data data;
struct pt_regs regs;
+ int i;
+ void *at = get_next_pebs_record_by_bit(base, top, bit);

- if (!intel_pmu_save_and_restart(event))
+ if (!intel_pmu_save_and_restart(event) &&
+ !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
return;

- setup_pebs_sample_data(event, iregs, __pebs, &data, &regs);
+ if (count > 1) {
+ for (i = 0; i < count - 1; i++) {
+ setup_pebs_sample_data(event, iregs, at, &data, &regs);
+ perf_event_output(event, &data, &regs);
+ at += x86_pmu.pebs_record_size;
+ at = get_next_pebs_record_by_bit(at, top, bit);
+ }
+ }
+
+ setup_pebs_sample_data(event, iregs, at, &data, &regs);

- if (perf_event_overflow(event, &data, &regs))
+ /*
+ * All but the last records are processed.
+ * The last one is left to be able to call the overflow handler.
+ */
+ if (perf_event_overflow(event, &data, &regs)) {
x86_pmu_stop(event, 0);
+ return;
+ }
+
}

static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
@@ -1008,72 +1060,78 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
if (!event->attr.precise_ip)
return;

- n = top - at;
+ n = (top - at) / x86_pmu.pebs_record_size;
if (n <= 0)
return;

- /*
- * Should not happen, we program the threshold at 1 and do not
- * set a reset value.
- */
- WARN_ONCE(n > 1, "bad leftover pebs %d\n", n);
- at += n - 1;
-
- __intel_pmu_pebs_event(event, iregs, at);
+ __intel_pmu_pebs_event(event, iregs, at, top, 0, n);
}

static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct debug_store *ds = cpuc->ds;
- struct perf_event *event = NULL;
- void *at, *top;
- u64 status = 0;
+ struct perf_event *event;
+ void *base, *at, *top;
int bit;
+ short counts[MAX_PEBS_EVENTS] = {};

if (!x86_pmu.pebs_active)
return;

- at = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
+ base = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;

ds->pebs_index = ds->pebs_buffer_base;

- if (unlikely(at > top))
+ if (unlikely(base >= top))
return;

- /*
- * Should not happen, we program the threshold at 1 and do not
- * set a reset value.
- */
- WARN_ONCE(top - at > x86_pmu.max_pebs_events * x86_pmu.pebs_record_size,
- "Unexpected number of pebs records %ld\n",
- (long)(top - at) / x86_pmu.pebs_record_size);
-
- for (; at < top; at += x86_pmu.pebs_record_size) {
+ for (at = base; at < top; at += x86_pmu.pebs_record_size) {
struct pebs_record_nhm *p = at;

- for_each_set_bit(bit, (unsigned long *)&p->status,
- x86_pmu.max_pebs_events) {
- event = cpuc->events[bit];
- if (!test_bit(bit, cpuc->active_mask))
- continue;
-
- WARN_ON_ONCE(!event);
-
- if (!event->attr.precise_ip)
- continue;
+ bit = find_first_bit((unsigned long *)&p->status,
+ x86_pmu.max_pebs_events);
+ if (bit >= x86_pmu.max_pebs_events)
+ continue;
+ if (!test_bit(bit, cpuc->active_mask))
+ continue;
+ /*
+ * The PEBS hardware does not deal well with the situation
+ * when events happen near to each other and multiple bits
+ * are set. But it should happen rarely.
+ *
+ * If these events include one PEBS and multiple non-PEBS
+ * events, it doesn't impact PEBS record. The record will
+ * be handled normally. (slow path)
+ *
+ * If these events include two or more PEBS events, the
+ * records for the events can be collapsed into a single
+ * one, and it's not possible to reconstruct all events
+ * that caused the PEBS record. It's called collision.
+ * If collision happened, the record will be dropped.
+ *
+ */
+ if (p->status != (1 << bit)) {
+ u64 pebs_status;

- if (__test_and_set_bit(bit, (unsigned long *)&status))
+ /* slow path */
+ pebs_status = p->status & cpuc->pebs_enabled;
+ pebs_status &= (1ULL << MAX_PEBS_EVENTS) - 1;
+ if (pebs_status != (1 << bit))
continue;
-
- break;
}
+ counts[bit]++;
+ }

- if (!event || bit >= x86_pmu.max_pebs_events)
+ for (bit = 0; bit < x86_pmu.max_pebs_events; bit++) {
+ if (counts[bit] == 0)
continue;
+ event = cpuc->events[bit];
+ WARN_ON_ONCE(!event);
+ WARN_ON_ONCE(!event->attr.precise_ip);

- __intel_pmu_pebs_event(event, iregs, at);
+ __intel_pmu_pebs_event(event, iregs, base, top, bit, counts[bit]);
}
}

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0658002..5f192e1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -730,6 +730,19 @@ extern int perf_event_overflow(struct perf_event *event,
struct perf_sample_data *data,
struct pt_regs *regs);

+extern void perf_event_output(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs);
+
+extern void
+perf_event_header__init_id(struct perf_event_header *header,
+ struct perf_sample_data *data,
+ struct perf_event *event);
+extern void
+perf_event__output_id_sample(struct perf_event *event,
+ struct perf_output_handle *handle,
+ struct perf_sample_data *sample);
+
static inline bool is_sampling_event(struct perf_event *event)
{
return event->attr.sample_period != 0;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index eddf1ed..e499b4e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5381,9 +5381,9 @@ void perf_prepare_sample(struct perf_event_header *header,
}
}

-static void perf_event_output(struct perf_event *event,
- struct perf_sample_data *data,
- struct pt_regs *regs)
+void perf_event_output(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
{
struct perf_output_handle handle;
struct perf_event_header header;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 9f6ce9b..2deb24c 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -72,15 +72,6 @@ static inline bool rb_has_aux(struct ring_buffer *rb)
void perf_event_aux_event(struct perf_event *event, unsigned long head,
unsigned long size, u64 flags);

-extern void
-perf_event_header__init_id(struct perf_event_header *header,
- struct perf_sample_data *data,
- struct perf_event *event);
-extern void
-perf_event__output_id_sample(struct perf_event *event,
- struct perf_output_handle *handle,
- struct perf_sample_data *sample);
-
extern struct page *
perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff);

Subject: [tip:perf/core] perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS interrupt threshold )

Commit-ID: 3569c0d7c5440d6fd06b10e1ef9614588a049bc7
Gitweb: http://git.kernel.org/tip/3569c0d7c5440d6fd06b10e1ef9614588a049bc7
Author: Yan, Zheng <[email protected]>
AuthorDate: Wed, 6 May 2015 15:33:50 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 7 Jun 2015 16:08:49 +0200

perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS interrupt threshold)

PEBS always had the capability to log samples to its buffers without
an interrupt. Traditionally perf has not used this but always set the
PEBS threshold to one.

For frequently occurring events (like cycles or branches or load/store)
this in term requires using a relatively high sampling period to avoid
overloading the system, by only processing PMIs. This in term increases
sampling error.

For the common cases we still need to use the PMI because the PEBS
hardware has various limitations. The biggest one is that it can not
supply a callgraph. It also requires setting a fixed period, as the
hardware does not support adaptive period. Another issue is that it
cannot supply a time stamp and some other options. To supply a TID it
requires flushing on context switch. It can however supply the IP, the
load/store address, TSX information, registers, and some other things.

So we can make PEBS work for some specific cases, basically as long as
you can do without a callgraph and can set the period you can use this
new PEBS mode.

The main benefit is the ability to support much lower sampling period
(down to -c 1000) without extensive overhead.

One use cases is for example to increase the resolution of the c2c tool.
Another is double checking when you suspect the standard sampling has
too much sampling error.

Some numbers on the overhead, using cycle soak, comparing the elapsed
time from "kernbench -M -H" between plain (threshold set to one) and
multi (large threshold).

The test command for plain:
"perf record --time -e cycles:p -c $period -- kernbench -M -H"

The test command for multi:
"perf record --no-time -e cycles:p -c $period -- kernbench -M -H"

( The only difference of test command between multi and plain is time
stamp options. Since time stamp is not supported by large PEBS
threshold, it can be used as a flag to indicate if large threshold is
enabled during the test. )

period plain(Sec) multi(Sec) Delta
10003 32.7 16.5 16.2
20003 30.2 16.2 14.0
40003 18.6 14.1 4.5
80003 16.8 14.6 2.2
100003 16.9 14.1 2.8
800003 15.4 15.7 -0.3
1000003 15.3 15.2 0.2
2000003 15.3 15.1 0.1

With periods below 100003, plain (threshold one) cause much more
overhead. With 10003 sampling period, the Elapsed Time for multi is
even 2X faster than plain.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 11 +++++++++++
arch/x86/kernel/cpu/perf_event_intel.c | 5 ++++-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 27 +++++++++++++++++++++++----
3 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 7a3f0fd..a73dfc9 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -76,6 +76,7 @@ struct event_constraint {
#define PERF_X86_EVENT_RDPMC_ALLOWED 0x0100 /* grant rdpmc permission */
#define PERF_X86_EVENT_EXCL_ACCT 0x0200 /* accounted EXCL event */
#define PERF_X86_EVENT_AUTO_RELOAD 0x0400 /* use PEBS auto-reload */
+#define PERF_X86_EVENT_FREERUNNING 0x0800 /* use freerunning PEBS */


struct amd_nb {
@@ -89,6 +90,16 @@ struct amd_nb {
#define MAX_PEBS_EVENTS 8

/*
+ * Flags PEBS can handle without an PMI.
+ *
+ */
+#define PEBS_FREERUNNING_FLAGS \
+ (PERF_SAMPLE_IP | PERF_SAMPLE_ADDR | \
+ PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
+ PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
+ PERF_SAMPLE_TRANSACTION)
+
+/*
* A debug store configuration.
*
* We only support architectures that use 64bit fields.
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 1762893..6985f43 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2261,8 +2261,11 @@ static int intel_pmu_hw_config(struct perf_event *event)
return ret;

if (event->attr.precise_ip) {
- if (!event->attr.freq)
+ if (!event->attr.freq) {
event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
+ if (!(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS))
+ event->hw.flags |= PERF_X86_EVENT_FREERUNNING;
+ }
if (x86_pmu.pebs_aliases)
x86_pmu.pebs_aliases(event);
}
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 72529c2..0ce455d 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -250,7 +250,7 @@ static int alloc_pebs_buffer(int cpu)
{
struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
int node = cpu_to_node(cpu);
- int max, thresh = 1; /* always use a single PEBS record */
+ int max;
void *buffer, *ibuffer;

if (!x86_pmu.pebs)
@@ -280,9 +280,6 @@ static int alloc_pebs_buffer(int cpu)
ds->pebs_absolute_maximum = ds->pebs_buffer_base +
max * x86_pmu.pebs_record_size;

- ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
- thresh * x86_pmu.pebs_record_size;
-
return 0;
}

@@ -684,14 +681,22 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
return &emptyconstraint;
}

+static inline bool pebs_is_enabled(struct cpu_hw_events *cpuc)
+{
+ return (cpuc->pebs_enabled & ((1ULL << MAX_PEBS_EVENTS) - 1));
+}
+
void intel_pmu_pebs_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
struct debug_store *ds = cpuc->ds;
+ bool first_pebs;
+ u64 threshold;

hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;

+ first_pebs = !pebs_is_enabled(cpuc);
cpuc->pebs_enabled |= 1ULL << hwc->idx;

if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
@@ -699,11 +704,25 @@ void intel_pmu_pebs_enable(struct perf_event *event)
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled |= 1ULL << 63;

+ /*
+ * When the event is constrained enough we can use a larger
+ * threshold and run the event with less frequent PMI.
+ */
+ if (hwc->flags & PERF_X86_EVENT_FREERUNNING) {
+ threshold = ds->pebs_absolute_maximum -
+ x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
+ } else {
+ threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+ }
+
/* Use auto-reload if possible to save a MSR write in the PMI */
if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
ds->pebs_event_reset[hwc->idx] =
(u64)(-hwc->sample_period) & x86_pmu.cntval_mask;
}
+
+ if (first_pebs || ds->pebs_interrupt_threshold > threshold)
+ ds->pebs_interrupt_threshold = threshold;
}

void intel_pmu_pebs_disable(struct perf_event *event)

Subject: [tip:perf/core] perf/x86/intel: Drain the PEBS buffer during context switches

Commit-ID: 9c964efa4330a58520783effe9847f15126fef1f
Gitweb: http://git.kernel.org/tip/9c964efa4330a58520783effe9847f15126fef1f
Author: Yan, Zheng <[email protected]>
AuthorDate: Wed, 6 May 2015 15:33:51 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 7 Jun 2015 16:08:54 +0200

perf/x86/intel: Drain the PEBS buffer during context switches

Flush the PEBS buffer during context switches if PEBS interrupt threshold
is larger than one. This allows perf to supply TID for sample outputs.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 6 +++++-
arch/x86/kernel/cpu/perf_event_intel.c | 11 +++++++++-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 32 ++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 3 ---
4 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index a73dfc9..74089bc 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -92,9 +92,11 @@ struct amd_nb {
/*
* Flags PEBS can handle without an PMI.
*
+ * TID can only be handled by flushing at context switch.
+ *
*/
#define PEBS_FREERUNNING_FLAGS \
- (PERF_SAMPLE_IP | PERF_SAMPLE_ADDR | \
+ (PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR | \
PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
PERF_SAMPLE_TRANSACTION)
@@ -877,6 +879,8 @@ void intel_pmu_pebs_enable_all(void);

void intel_pmu_pebs_disable_all(void);

+void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in);
+
void intel_ds_init(void);

void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 6985f43..d455e2a 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2642,6 +2642,15 @@ static void intel_pmu_cpu_dying(int cpu)
fini_debug_store_on_cpu(cpu);
}

+static void intel_pmu_sched_task(struct perf_event_context *ctx,
+ bool sched_in)
+{
+ if (x86_pmu.pebs_active)
+ intel_pmu_pebs_sched_task(ctx, sched_in);
+ if (x86_pmu.lbr_nr)
+ intel_pmu_lbr_sched_task(ctx, sched_in);
+}
+
PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");

PMU_FORMAT_ATTR(ldlat, "config1:0-15");
@@ -2731,7 +2740,7 @@ static __initconst const struct x86_pmu intel_pmu = {
.cpu_starting = intel_pmu_cpu_starting,
.cpu_dying = intel_pmu_cpu_dying,
.guest_get_msrs = intel_guest_get_msrs,
- .sched_task = intel_pmu_lbr_sched_task,
+ .sched_task = intel_pmu_sched_task,
};

static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 0ce455d..6285247 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -546,6 +546,19 @@ int intel_pmu_drain_bts_buffer(void)
return 1;
}

+static inline void intel_pmu_drain_pebs_buffer(void)
+{
+ struct pt_regs regs;
+
+ x86_pmu.drain_pebs(&regs);
+}
+
+void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+ if (!sched_in)
+ intel_pmu_drain_pebs_buffer();
+}
+
/*
* PEBS
*/
@@ -711,8 +724,19 @@ void intel_pmu_pebs_enable(struct perf_event *event)
if (hwc->flags & PERF_X86_EVENT_FREERUNNING) {
threshold = ds->pebs_absolute_maximum -
x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
+
+ if (first_pebs)
+ perf_sched_cb_inc(event->ctx->pmu);
} else {
threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+
+ /*
+ * If not all events can use larger buffer,
+ * roll back to threshold = 1
+ */
+ if (!first_pebs &&
+ (ds->pebs_interrupt_threshold > threshold))
+ perf_sched_cb_dec(event->ctx->pmu);
}

/* Use auto-reload if possible to save a MSR write in the PMI */
@@ -729,6 +753,7 @@ void intel_pmu_pebs_disable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
+ struct debug_store *ds = cpuc->ds;

cpuc->pebs_enabled &= ~(1ULL << hwc->idx);

@@ -737,6 +762,13 @@ void intel_pmu_pebs_disable(struct perf_event *event)
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled &= ~(1ULL << 63);

+ if (ds->pebs_interrupt_threshold >
+ ds->pebs_buffer_base + x86_pmu.pebs_record_size) {
+ intel_pmu_drain_pebs_buffer();
+ if (!pebs_is_enabled(cpuc))
+ perf_sched_cb_dec(event->ctx->pmu);
+ }
+
if (cpuc->enabled)
wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 201e16f..452a7bd 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -264,9 +264,6 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct x86_perf_task_context *task_ctx;

- if (!x86_pmu.lbr_nr)
- return;
-
/*
* If LBR callstack feature is enabled and the stack was saved when
* the task was scheduled out, restore the stack. Otherwise flush

Subject: [tip:perf/core] perf/intel/x86: Enlarge the PEBS buffer

Commit-ID: 156174999dd1d0fe8732f5a05f4e9cef921ad487
Gitweb: http://git.kernel.org/tip/156174999dd1d0fe8732f5a05f4e9cef921ad487
Author: Yan, Zheng <[email protected]>
AuthorDate: Wed, 6 May 2015 15:33:52 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 7 Jun 2015 16:08:57 +0200

perf/intel/x86: Enlarge the PEBS buffer

Currently the PEBS buffer size is 4k, it can only hold about 21
PEBS records. This patch enlarges the PEBS buffer size to 64k
(the same as the BTS buffer).

64k memory can hold about 330 PEBS records. This will significantly
reduce the number of PMIs when batched PEBS interrupts are enabled.

Signed-off-by: Yan, Zheng <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 6285247..266079a 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -11,7 +11,7 @@
#define BTS_RECORD_SIZE 24

#define BTS_BUFFER_SIZE (PAGE_SIZE << 4)
-#define PEBS_BUFFER_SIZE PAGE_SIZE
+#define PEBS_BUFFER_SIZE (PAGE_SIZE << 4)
#define PEBS_FIXUP_SIZE PAGE_SIZE

/*