2014-07-31 06:45:13

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v4 0/9] perf, x86: large PEBS interrupt threshold

This patch series implements large PEBS interrupt threshold. For some
limited cases, it can significantly reduce the sample overhead. Please
read patch 6's commit message for more information.

changes since v1:
- drop patch 'perf, core: Add all PMUs to pmu_idr'
- add comments for case that multiple counters overflow simultaneously
changes since v2:
- rename perf_sched_cb_{enable,disable} to perf_sched_cb_user_{inc,dec}
- use flag to indicate auto reload mechanism
- move codes that setup PEBS sample data to separate function
- output the PEBS records in batch
- enable this for All (PEBS capable) hardware
- more description for the multiplex
changes since v3:
- ignore conflicting PEBS record


2014-07-31 06:45:15

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v4 3/9] perf, x86: use the PEBS auto reload mechanism when possible

When a fixed period is specified, this patch make perf use the PEBS
auto reload mechanism. This makes normal profiling faster, because
it avoids one costly MSR write in the PMI handler.

Signef-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 15 +++++++++------
arch/x86/kernel/cpu/perf_event.h | 1 +
arch/x86/kernel/cpu/perf_event_intel_ds.c | 9 +++++++++
3 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 46f98f2..3311981 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -982,13 +982,16 @@ int x86_perf_event_set_period(struct perf_event *event)

per_cpu(pmc_prev_left[idx], smp_processor_id()) = left;

- /*
- * The hw event starts counting from this event offset,
- * mark it to be able to extra future deltas:
- */
- local64_set(&hwc->prev_count, (u64)-left);
+ if (!(hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) ||
+ local64_read(&hwc->prev_count) != (u64)-left) {
+ /*
+ * The hw event starts counting from this event offset,
+ * mark it to be able to extra future deltas:
+ */
+ local64_set(&hwc->prev_count, (u64)-left);

- wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+ wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+ }

/*
* Due to erratum on certan cpu we need
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index d93a16b..3035930 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -69,6 +69,7 @@ struct event_constraint {
#define PERF_X86_EVENT_PEBS_ST 0x2 /* st data address sampling */
#define PERF_X86_EVENT_PEBS_ST_HSW 0x4 /* haswell style st data sampling */
#define PERF_X86_EVENT_COMMITTED 0x8 /* event passed commit_txn */
+#define PERF_X86_EVENT_AUTO_RELOAD 0x10 /* use PEBS auto-reload */

struct amd_nb {
int nb_id; /* NorthBridge id */
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 696ade3..ecf8b5d6 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -716,6 +716,8 @@ void intel_pmu_pebs_enable(struct perf_event *event)
struct hw_perf_event *hwc = &event->hw;

hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
+ if (!event->attr.freq)
+ hwc->flags |= PERF_X86_EVENT_AUTO_RELOAD;

cpuc->pebs_enabled |= 1ULL << hwc->idx;

@@ -723,6 +725,12 @@ void intel_pmu_pebs_enable(struct perf_event *event)
cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled |= 1ULL << 63;
+
+ /* Use auto-reload if possible to save a MSR write in the PMI */
+ if (hwc->flags &PERF_X86_EVENT_AUTO_RELOAD) {
+ ds->pebs_event_reset[hwc->idx] =
+ (u64)-hwc->sample_period & x86_pmu.cntval_mask;
+ }
}

void intel_pmu_pebs_disable(struct perf_event *event)
@@ -741,6 +749,7 @@ void intel_pmu_pebs_disable(struct perf_event *event)
wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);

hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
+ hwc->flags &= ~PERF_X86_EVENT_AUTO_RELOAD;
}

void intel_pmu_pebs_enable_all(void)
--
1.9.3

2014-07-31 06:45:22

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v4 8/9] perf, x86: enlarge PEBS buffer

Currently the PEBS buffer size is 4k, it only can hold about 21
PEBS records. This patch enlarges the PEBS buffer size to 64k
(the same as BTS buffer), 64k memory can hold about 330 PEBS
records. This will significantly the reduce number of PMI when
large PEBS interrupt threshold is used.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 65457ca..61b6e33 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -11,7 +11,7 @@
#define BTS_RECORD_SIZE 24

#define BTS_BUFFER_SIZE (PAGE_SIZE << 4)
-#define PEBS_BUFFER_SIZE PAGE_SIZE
+#define PEBS_BUFFER_SIZE (PAGE_SIZE << 4)
#define PEBS_FIXUP_SIZE PAGE_SIZE

/*
--
1.9.3

2014-07-31 06:45:43

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v4 9/9] tools, perf: Allow the user to disable time stamps

From: Andi Kleen <[email protected]>

Time stamps are always implicitely enabled for record currently.
The old --time/-T option is a nop.

Allow the user to disable timestamps by using --no-time

This can cause some minor misaccounting (by missing mmaps), but significantly
lowers the size of perf.data

The defaults are unchanged.

Signed-off-by: Andi Kleen <[email protected]>
---
tools/perf/builtin-record.c | 1 +
tools/perf/util/evsel.c | 9 ++++++---
2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 4869050..ca0251e 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -781,6 +781,7 @@ static const char * const record_usage[] = {
*/
static struct record record = {
.opts = {
+ .sample_time = true,
.mmap_pages = UINT_MAX,
.user_freq = UINT_MAX,
.user_interval = ULLONG_MAX,
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 21a373e..92e5235 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -633,9 +633,12 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts)
if (opts->period)
perf_evsel__set_sample_bit(evsel, PERIOD);

- if (!perf_missing_features.sample_id_all &&
- (opts->sample_time || !opts->no_inherit ||
- target__has_cpu(&opts->target) || per_cpu))
+ /*
+ * When the user explicitely disabled time don't force it here.
+ */
+ if (opts->sample_time &&
+ (!perf_missing_features.sample_id_all &&
+ (!opts->no_inherit || target__has_cpu(&opts->target) || per_cpu)))
perf_evsel__set_sample_bit(evsel, TIME);

if (opts->raw_samples && !evsel->no_aux_samples) {
--
1.9.3

2014-07-31 06:45:19

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v4 5/9] perf, x86: large PEBS interrupt threshold

PEBS always had the capability to log samples to its buffers without
an interrupt. Traditionally perf has not used this but always set the
PEBS threshold to one.

For frequently occuring events (like cycles or branches or load/stores)
this in term requires using a relatively high sampling period to avoid
overloading the system, by only processing PMIs. This in term increases
sampling error.

For the common cases we still need to use the PMI because the PEBS
hardware has various limitations. The biggest one is that it can not
supply a callgraph. It also requires setting a fixed period, as the
hardware does not support adaptive period. Another issue is that it
cannot supply a time stamp and some other options. To supply a TID it
requires flushing on context switch. It can however supply the IP, the
load/store address, TSX information, registers, and some other things.

So we can make PEBS work for some specific cases, basically as long as
you can do without a callgraph and can set the period you can use this
new PEBS mode.

The main benefit is the ability to support much lower sampling period
(down to -c 1000) without extensive overhead.

One use cases is for example to increase the resolution of the c2c tool.
Another is double checking when you suspect the standard sampling has
too much sampling error.

Some numbers on the overhead, using cycle soak, comparing
"perf record --no-time -e cycles:p -c" to "perf record -e cycles:p -c"

period plain multi delta
10003 15 5 10
20003 15.7 4 11.7
40003 8.7 2.5 6.2
80003 4.1 1.4 2.7
100003 3.6 1.2 2.4
800003 4.4 1.4 3
1000003 0.6 0.4 0.2
2000003 0.4 0.3 0.1
4000003 0.3 0.2 0.1
10000003 0.3 0.2 0.1

The interesting part is the delta between multi-pebs and normal pebs. Above
-c 1000003 it does not really matter because the basic overhead is so low.
With periods below 80003 it becomes interesting.

Note in some other workloads (e.g. kernbench) the smaller sampling periods
cause much more overhead without multi-pebs, upto 80% (and throttling) have
been observed with -c 10003. multi pebs generally does not throttle.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 40 +++++++++++++++++++++++++++----
1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 06b4884..7df9092 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -251,7 +251,7 @@ static int alloc_pebs_buffer(int cpu)
{
struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
int node = cpu_to_node(cpu);
- int max, thresh = 1; /* always use a single PEBS record */
+ int max;
void *buffer, *ibuffer;

if (!x86_pmu.pebs)
@@ -281,9 +281,6 @@ static int alloc_pebs_buffer(int cpu)
ds->pebs_absolute_maximum = ds->pebs_buffer_base +
max * x86_pmu.pebs_record_size;

- ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
- thresh * x86_pmu.pebs_record_size;
-
return 0;
}

@@ -710,15 +707,35 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
return &emptyconstraint;
}

+/*
+ * Flags PEBS can handle without an PMI.
+ *
+ * TID can only be handled by flushing at context switch.
+ */
+#define PEBS_FREERUNNING_FLAGS \
+ (PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR | \
+ PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
+ PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
+ PERF_SAMPLE_TRANSACTION)
+
+static inline bool pebs_is_enabled(struct cpu_hw_events *cpuc)
+{
+ return (cpuc->pebs_enabled & ((1ULL << MAX_PEBS_EVENTS) - 1));
+}
+
void intel_pmu_pebs_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
+ struct debug_store *ds = cpuc->ds;
+ bool first_pebs;
+ u64 threshold;

hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
if (!event->attr.freq)
hwc->flags |= PERF_X86_EVENT_AUTO_RELOAD;

+ first_pebs = !pebs_is_enabled(cpuc);
cpuc->pebs_enabled |= 1ULL << hwc->idx;

if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
@@ -726,6 +743,21 @@ void intel_pmu_pebs_enable(struct perf_event *event)
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
cpuc->pebs_enabled |= 1ULL << 63;

+ /*
+ * When the event is constrained enough we can use a larger
+ * threshold and run the event with less frequent PMI.
+ */
+ if (0 && /* disable this temporarily */
+ (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) &&
+ !(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS)) {
+ threshold = ds->pebs_absolute_maximum -
+ x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
+ } else {
+ threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+ }
+ if (first_pebs || ds->pebs_interrupt_threshold > threshold)
+ ds->pebs_interrupt_threshold = threshold;
+
/* Use auto-reload if possible to save a MSR write in the PMI */
if (hwc->flags &PERF_X86_EVENT_AUTO_RELOAD) {
ds->pebs_event_reset[hwc->idx] =
--
1.9.3

2014-07-31 06:46:05

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v4 7/9] perf, x86: drain PEBS buffer during context switch

Flush the PEBS buffer during context switch if PEBS interrupt threshold
is larger than one. This allows perf to supply TID for sample outputs.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 3 +++
arch/x86/kernel/cpu/perf_event_intel.c | 11 +++++++++-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 32 ++++++++++++++++++++++++++++--
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 2 --
4 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3035930..c0ece8d 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -148,6 +148,7 @@ struct cpu_hw_events {
*/
struct debug_store *ds;
u64 pebs_enabled;
+ bool pebs_sched_cb_enabled;

/*
* Intel LBR bits
@@ -685,6 +686,8 @@ void intel_pmu_pebs_enable_all(void);

void intel_pmu_pebs_disable_all(void);

+void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in);
+
void intel_ds_init(void);

void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index d1f36ca..7a7013d 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2044,6 +2044,15 @@ static void intel_pmu_cpu_dying(int cpu)
fini_debug_store_on_cpu(cpu);
}

+static void intel_pmu_sched_task(struct perf_event_context *ctx,
+ bool sched_in)
+{
+ if (x86_pmu.pebs_active)
+ intel_pmu_pebs_sched_task(ctx, sched_in);
+ if (x86_pmu.lbr_nr)
+ intel_pmu_lbr_sched_task(ctx, sched_in);
+}
+
PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");

PMU_FORMAT_ATTR(ldlat, "config1:0-15");
@@ -2095,7 +2104,7 @@ static __initconst const struct x86_pmu intel_pmu = {
.cpu_starting = intel_pmu_cpu_starting,
.cpu_dying = intel_pmu_cpu_dying,
.guest_get_msrs = intel_guest_get_msrs,
- .sched_task = intel_pmu_lbr_sched_task,
+ .sched_task = intel_pmu_sched_task,
};

static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index ec7b725..65457ca 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -707,6 +707,18 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
return &emptyconstraint;
}

+static inline void intel_pmu_drain_pebs_buffer(void)
+{
+ struct pt_regs regs;
+ x86_pmu.drain_pebs(&regs);
+}
+
+void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+ if (!sched_in)
+ intel_pmu_drain_pebs_buffer();
+}
+
/*
* Flags PEBS can handle without an PMI.
*
@@ -747,13 +759,20 @@ void intel_pmu_pebs_enable(struct perf_event *event)
* When the event is constrained enough we can use a larger
* threshold and run the event with less frequent PMI.
*/
- if (0 && /* disable this temporarily */
- (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) &&
+ if ((hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) &&
!(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS)) {
threshold = ds->pebs_absolute_maximum -
x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
+ if (first_pebs) {
+ perf_sched_cb_user_inc(event->ctx->pmu);
+ cpuc->pebs_sched_cb_enabled = true;
+ }
} else {
threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+ if (cpuc->pebs_sched_cb_enabled) {
+ perf_sched_cb_user_dec(event->ctx->pmu);
+ cpuc->pebs_sched_cb_enabled = false;
+ }
}
if (first_pebs || ds->pebs_interrupt_threshold > threshold)
ds->pebs_interrupt_threshold = threshold;
@@ -769,8 +788,17 @@ void intel_pmu_pebs_disable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
+ struct debug_store *ds = cpuc->ds;
+
+ if (ds->pebs_interrupt_threshold >
+ ds->pebs_buffer_base + x86_pmu.pebs_record_size)
+ intel_pmu_drain_pebs_buffer();

cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
+ if (cpuc->pebs_sched_cb_enabled && !pebs_is_enabled(cpuc)) {
+ perf_sched_cb_user_dec(event->ctx->pmu);
+ cpuc->pebs_sched_cb_enabled = false;
+ }

if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_LDLAT)
cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index a30bfab..e4f3a09 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -185,8 +185,6 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);

- if (!x86_pmu.lbr_nr)
- return;
/*
* When sampling the branck stack in system-wide, it may be
* necessary to flush the stack on context switch. This happens
--
1.9.3

2014-07-31 06:46:30

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v4 6/9] perf, x86: handle multiple records in PEBS buffer

When PEBS interrupt threshold is larger than one, the PEBS buffer
may include mutiple records for each PEBS event. This patch makes
the code first count how many records each PEBS event has, then
output the samples in batch.

One corner case needs to mention is that the PEBS hardware doesn't
deal well with collisions, when PEBS events happen near to each
other. The records for the events can be collapsed into a single
one, and it's not possible to reconstruct all events that caused
the PEBS record, However in practice collisions are extremely rare,
as long as different events are used. The periods are typically very
large, so any collision is unlikely. When collision happens, we drop
the PEBS record.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 114 ++++++++++++++++++++----------
1 file changed, 76 insertions(+), 38 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 7df9092..ec7b725 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -991,18 +991,51 @@ static void setup_pebs_sample_data(struct perf_event *event,
}

static void __intel_pmu_pebs_event(struct perf_event *event,
- struct pt_regs *iregs, void *__pebs)
+ struct pt_regs *iregs,
+ void *at, void *top, int count)
{
+ struct perf_output_handle handle;
+ struct perf_event_header header;
struct perf_sample_data data;
struct pt_regs regs;

- if (!intel_pmu_save_and_restart(event))
+ if (!intel_pmu_save_and_restart(event) &&
+ !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
return;

- setup_pebs_sample_data(event, iregs, __pebs, &data, &regs);
+ setup_pebs_sample_data(event, iregs, at, &data, &regs);

- if (perf_event_overflow(event, &data, &regs))
+ if (perf_event_overflow(event, &data, &regs)) {
x86_pmu_stop(event, 0);
+ return;
+ }
+
+ if (count <= 1)
+ return;
+
+ at += x86_pmu.pebs_record_size;
+ count--;
+
+ perf_sample_data_init(&data, 0, event->hw.last_period);
+ perf_prepare_sample(&header, &data, event, &regs);
+
+ if (perf_output_begin(&handle, event, header.size * count))
+ return;
+
+ for (; at < top; at += x86_pmu.pebs_record_size) {
+ struct pebs_record_nhm *p = at;
+ if (p->status != (1 << event->hw.idx))
+ continue;
+
+ setup_pebs_sample_data(event, iregs, at, &data, &regs);
+ perf_output_sample(&handle, &header, &data, event);
+
+ count--;
+ if (count == 0)
+ break;
+ }
+
+ perf_output_end(&handle);
}

static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
@@ -1043,61 +1076,66 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
WARN_ONCE(n > 1, "bad leftover pebs %d\n", n);
at += n - 1;

- __intel_pmu_pebs_event(event, iregs, at);
+ __intel_pmu_pebs_event(event, iregs, at, top, 1);
}

static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
struct debug_store *ds = cpuc->ds;
- struct perf_event *event = NULL;
- void *at, *top;
- u64 status = 0;
+ struct perf_event *event;
+ void *base, *at, *top;
int bit;
+ int counts[MAX_PEBS_EVENTS] = {};

if (!x86_pmu.pebs_active)
return;

- at = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
+ base = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;

ds->pebs_index = ds->pebs_buffer_base;

- if (unlikely(at > top))
+ if (unlikely(base >= top))
return;

- /*
- * Should not happen, we program the threshold at 1 and do not
- * set a reset value.
- */
- WARN_ONCE(top - at > x86_pmu.max_pebs_events * x86_pmu.pebs_record_size,
- "Unexpected number of pebs records %ld\n",
- (long)(top - at) / x86_pmu.pebs_record_size);
-
- for (; at < top; at += x86_pmu.pebs_record_size) {
+ for (at = base; at < top; at += x86_pmu.pebs_record_size) {
struct pebs_record_nhm *p = at;

- for_each_set_bit(bit, (unsigned long *)&p->status,
- x86_pmu.max_pebs_events) {
- event = cpuc->events[bit];
- if (!test_bit(bit, cpuc->active_mask))
- continue;
-
- WARN_ON_ONCE(!event);
-
- if (!event->attr.precise_ip)
- continue;
-
- if (__test_and_set_bit(bit, (unsigned long *)&status))
- continue;
-
- break;
- }
-
- if (!event || bit >= x86_pmu.max_pebs_events)
+ bit = find_first_bit((unsigned long *)&p->status,
+ x86_pmu.max_pebs_events);
+ if (bit >= x86_pmu.max_pebs_events)
+ continue;
+ /*
+ * The PEBS hardware does not deal well with collisions,
+ * when the same event happens near to each other. The
+ * records for the events can be collapsed into a single
+ * one, and it's not possible to reconstruct all events
+ * that caused the PEBS record. However in practice, the
+ * collisions are extremely rara. If collision happened,
+ * we drop the record. its the safest choice.
+ */
+ if (p->status != (1 << bit))
continue;
+ if (!test_bit(bit, cpuc->active_mask))
+ continue;
+ event = cpuc->events[bit];
+ WARN_ON_ONCE(!event);
+ if (!event->attr.precise_ip)
+ continue;
+ counts[bit]++;
+ }

- __intel_pmu_pebs_event(event, iregs, at);
+ for (bit = 0; bit < x86_pmu.max_pebs_events; bit++) {
+ if (counts[bit] == 0)
+ continue;
+ event = cpuc->events[bit];
+ for (at = base; at < top; at += x86_pmu.pebs_record_size) {
+ struct pebs_record_nhm *p = at;
+ if (p->status == (1 << bit))
+ break;
+ }
+ __intel_pmu_pebs_event(event, iregs, at, top, counts[bit]);
}
}

--
1.9.3

2014-07-31 06:46:46

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v4 4/9] perf, x86: introduce setup_pebs_sample_data()

move codes that setup PEBS sample data to separate function

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 63 ++++++++++++++++++-------------
1 file changed, 36 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index ecf8b5d6..06b4884 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -870,8 +870,10 @@ static inline u64 intel_hsw_transaction(struct pebs_record_hsw *pebs)
return txn;
}

-static void __intel_pmu_pebs_event(struct perf_event *event,
- struct pt_regs *iregs, void *__pebs)
+static void setup_pebs_sample_data(struct perf_event *event,
+ struct pt_regs *iregs, void *__pebs,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
{
/*
* We cast to the biggest pebs_record but are careful not to
@@ -879,21 +881,16 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
*/
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
struct pebs_record_hsw *pebs = __pebs;
- struct perf_sample_data data;
- struct pt_regs regs;
u64 sample_type;
int fll, fst;

- if (!intel_pmu_save_and_restart(event))
- return;
-
fll = event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT;
fst = event->hw.flags & (PERF_X86_EVENT_PEBS_ST |
PERF_X86_EVENT_PEBS_ST_HSW);

- perf_sample_data_init(&data, 0, event->hw.last_period);
+ perf_sample_data_init(data, 0, event->hw.last_period);

- data.period = event->hw.last_period;
+ data->period = event->hw.last_period;
sample_type = event->attr.sample_type;

/*
@@ -904,19 +901,19 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
* Use latency for weight (only avail with PEBS-LL)
*/
if (fll && (sample_type & PERF_SAMPLE_WEIGHT))
- data.weight = pebs->lat;
+ data->weight = pebs->lat;

/*
* data.data_src encodes the data source
*/
if (sample_type & PERF_SAMPLE_DATA_SRC) {
if (fll)
- data.data_src.val = load_latency_data(pebs->dse);
+ data->data_src.val = load_latency_data(pebs->dse);
else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST_HSW)
- data.data_src.val =
+ data->data_src.val =
precise_store_data_hsw(event, pebs->dse);
else
- data.data_src.val = precise_store_data(pebs->dse);
+ data->data_src.val = precise_store_data(pebs->dse);
}
}

@@ -930,35 +927,47 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
* PERF_SAMPLE_IP and PERF_SAMPLE_CALLCHAIN to function properly.
* A possible PERF_SAMPLE_REGS will have to transfer all regs.
*/
- regs = *iregs;
- regs.flags = pebs->flags;
- set_linear_ip(&regs, pebs->ip);
- regs.bp = pebs->bp;
- regs.sp = pebs->sp;
+ *regs = *iregs;
+ regs->flags = pebs->flags;
+ set_linear_ip(regs, pebs->ip);
+ regs->bp = pebs->bp;
+ regs->sp = pebs->sp;

if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format >= 2) {
- regs.ip = pebs->real_ip;
- regs.flags |= PERF_EFLAGS_EXACT;
- } else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(&regs))
- regs.flags |= PERF_EFLAGS_EXACT;
+ regs->ip = pebs->real_ip;
+ regs->flags |= PERF_EFLAGS_EXACT;
+ } else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(regs))
+ regs->flags |= PERF_EFLAGS_EXACT;
else
- regs.flags &= ~PERF_EFLAGS_EXACT;
+ regs->flags &= ~PERF_EFLAGS_EXACT;

if ((event->attr.sample_type & PERF_SAMPLE_ADDR) &&
x86_pmu.intel_cap.pebs_format >= 1)
- data.addr = pebs->dla;
+ data->addr = pebs->dla;

if (x86_pmu.intel_cap.pebs_format >= 2) {
/* Only set the TSX weight when no memory weight. */
if ((event->attr.sample_type & PERF_SAMPLE_WEIGHT) && !fll)
- data.weight = intel_hsw_weight(pebs);
+ data->weight = intel_hsw_weight(pebs);

if (event->attr.sample_type & PERF_SAMPLE_TRANSACTION)
- data.txn = intel_hsw_transaction(pebs);
+ data->txn = intel_hsw_transaction(pebs);
}

if (has_branch_stack(event))
- data.br_stack = &cpuc->lbr_stack;
+ data->br_stack = &cpuc->lbr_stack;
+}
+
+static void __intel_pmu_pebs_event(struct perf_event *event,
+ struct pt_regs *iregs, void *__pebs)
+{
+ struct perf_sample_data data;
+ struct pt_regs regs;
+
+ if (!intel_pmu_save_and_restart(event))
+ return;
+
+ setup_pebs_sample_data(event, iregs, __pebs, &data, &regs);

if (perf_event_overflow(event, &data, &regs))
x86_pmu_stop(event, 0);
--
1.9.3

2014-07-31 06:45:11

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v4 1/9] perf, core: introduce pmu context switch callback

The callback is invoked when process is scheduled in or out.
It provides mechanism for later patches to save/store the LBR
stack. For the schedule in case, the callback is invoked at
the same place that flush branch stack callback is invoked.
So it also can replace the flush branch stack callback. To
avoid unnecessary overhead, the callback is enabled only when
there are events use the LBR stack.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 7 +++++
arch/x86/kernel/cpu/perf_event.h | 2 ++
include/linux/perf_event.h | 9 ++++++
kernel/events/core.c | 63 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 81 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 2879ecd..3070302 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1877,6 +1877,12 @@ static const struct attribute_group *x86_pmu_attr_groups[] = {
NULL,
};

+static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+ if (x86_pmu.sched_task)
+ x86_pmu.sched_task(ctx, sched_in);
+}
+
static void x86_pmu_flush_branch_stack(void)
{
if (x86_pmu.flush_branch_stack)
@@ -1910,6 +1916,7 @@ static struct pmu pmu = {

.event_idx = x86_pmu_event_idx,
.flush_branch_stack = x86_pmu_flush_branch_stack,
+ .sched_task = x86_pmu_sched_task,
};

void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 8ade931..d6f1165 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -431,6 +431,8 @@ struct x86_pmu {

void (*check_microcode)(void);
void (*flush_branch_stack)(void);
+ void (*sched_task)(struct perf_event_context *ctx,
+ bool sched_in);

/*
* Intel Arch Perfmon v2+
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 707617a..fe92e6b 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -262,6 +262,13 @@ struct pmu {
* flush branch stack on context-switches (needed in cpu-wide mode)
*/
void (*flush_branch_stack) (void);
+
+ /*
+ * context-switches callback for CPU PMU. Other PMUs shouldn't set
+ * this callback
+ */
+ void (*sched_task) (struct perf_event_context *ctx,
+ bool sched_in);
};

/**
@@ -557,6 +564,8 @@ extern void perf_event_delayed_put(struct task_struct *task);
extern void perf_event_print_debug(void);
extern void perf_pmu_disable(struct pmu *pmu);
extern void perf_pmu_enable(struct pmu *pmu);
+extern void perf_sched_cb_user_inc(struct pmu *pmu);
+extern void perf_sched_cb_user_dec(struct pmu *pmu);
extern int perf_event_task_disable(void);
extern int perf_event_task_enable(void);
extern int perf_event_refresh(struct perf_event *event, int refresh);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1cf24b3..ac9a226 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -144,6 +144,7 @@ enum event_type_t {
struct static_key_deferred perf_sched_events __read_mostly;
static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
+static DEFINE_PER_CPU(int, perf_sched_cb_users);

static atomic_t nr_mmap_events __read_mostly;
static atomic_t nr_comm_events __read_mostly;
@@ -2362,6 +2363,58 @@ unlock:
}
}

+void perf_sched_cb_user_inc(struct pmu *pmu)
+{
+ this_cpu_inc(perf_sched_cb_users);
+}
+
+void perf_sched_cb_user_dec(struct pmu *pmu)
+{
+ this_cpu_dec(perf_sched_cb_users);
+}
+
+/*
+ * This function provides the context switch callback to the lower code
+ * layer. It is invoked ONLY when the context switch callback is enabled.
+ */
+static void perf_pmu_sched_task(struct task_struct *prev,
+ struct task_struct *next,
+ bool sched_in)
+{
+ struct perf_cpu_context *cpuctx;
+ struct pmu *pmu;
+ unsigned long flags;
+
+ if (prev == next)
+ return;
+
+ local_irq_save(flags);
+
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(pmu, &pmus, entry) {
+ if (pmu->sched_task) {
+ cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+
+ perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+ perf_pmu_disable(pmu);
+
+ pmu->sched_task(cpuctx->task_ctx, sched_in);
+
+ perf_pmu_enable(pmu);
+
+ perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+ /* only CPU PMU has context switch callback */
+ break;
+ }
+ }
+
+ rcu_read_unlock();
+
+ local_irq_restore(flags);
+}
+
#define for_each_task_context_nr(ctxn) \
for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)

@@ -2381,6 +2434,9 @@ void __perf_event_task_sched_out(struct task_struct *task,
{
int ctxn;

+ if (__get_cpu_var(perf_sched_cb_users))
+ perf_pmu_sched_task(task, next, false);
+
for_each_task_context_nr(ctxn)
perf_event_context_sched_out(task, ctxn, next);

@@ -2638,6 +2694,13 @@ void __perf_event_task_sched_in(struct task_struct *prev,
/* check for system-wide branch_stack events */
if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
perf_branch_stack_sched_in(prev, task);
+
+ /* check for system-wide branch_stack events */
+ if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
+ perf_branch_stack_sched_in(prev, task);
+
+ if (__get_cpu_var(perf_sched_cb_users))
+ perf_pmu_sched_task(prev, task, true);
}

static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count)
--
1.9.3

2014-07-31 06:47:22

by Yan, Zheng

[permalink] [raw]
Subject: [PATCH v4 2/9] perf, x86: use context switch callback to flush LBR stack

Previous commit introduces context switch callback, its function
overlaps with the flush branch stack callback. So we can use the
context switch callback to flush LBR stack.

This patch adds code that uses the flush branch callback to
flush the LBR stack when task is being scheduled in. The callback
is enabled only when there are events use the LBR hardware. This
patch also removes all old flush branch stack code.

Signed-off-by: Yan, Zheng <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 7 ---
arch/x86/kernel/cpu/perf_event.h | 3 +-
arch/x86/kernel/cpu/perf_event_intel.c | 14 +-----
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 38 ++++++++++++--
include/linux/perf_event.h | 6 ---
kernel/events/core.c | 81 ------------------------------
6 files changed, 36 insertions(+), 113 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 3070302..46f98f2 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1883,12 +1883,6 @@ static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
x86_pmu.sched_task(ctx, sched_in);
}

-static void x86_pmu_flush_branch_stack(void)
-{
- if (x86_pmu.flush_branch_stack)
- x86_pmu.flush_branch_stack();
-}
-
void perf_check_microcode(void)
{
if (x86_pmu.check_microcode)
@@ -1915,7 +1909,6 @@ static struct pmu pmu = {
.commit_txn = x86_pmu_commit_txn,

.event_idx = x86_pmu_event_idx,
- .flush_branch_stack = x86_pmu_flush_branch_stack,
.sched_task = x86_pmu_sched_task,
};

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index d6f1165..d93a16b 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -430,7 +430,6 @@ struct x86_pmu {
void (*cpu_dead)(int cpu);

void (*check_microcode)(void);
- void (*flush_branch_stack)(void);
void (*sched_task)(struct perf_event_context *ctx,
bool sched_in);

@@ -687,6 +686,8 @@ void intel_pmu_pebs_disable_all(void);

void intel_ds_init(void);

+void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
+
void intel_pmu_lbr_reset(void);

void intel_pmu_lbr_enable(struct perf_event *event);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 2502d0d..d1f36ca 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2044,18 +2044,6 @@ static void intel_pmu_cpu_dying(int cpu)
fini_debug_store_on_cpu(cpu);
}

-static void intel_pmu_flush_branch_stack(void)
-{
- /*
- * Intel LBR does not tag entries with the
- * PID of the current task, then we need to
- * flush it on ctxsw
- * For now, we simply reset it
- */
- if (x86_pmu.lbr_nr)
- intel_pmu_lbr_reset();
-}
-
PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");

PMU_FORMAT_ATTR(ldlat, "config1:0-15");
@@ -2107,7 +2095,7 @@ static __initconst const struct x86_pmu intel_pmu = {
.cpu_starting = intel_pmu_cpu_starting,
.cpu_dying = intel_pmu_cpu_dying,
.guest_get_msrs = intel_guest_get_msrs,
- .flush_branch_stack = intel_pmu_flush_branch_stack,
+ .sched_task = intel_pmu_lbr_sched_task,
};

static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 9dd2459..a30bfab 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -181,13 +181,36 @@ void intel_pmu_lbr_reset(void)
intel_pmu_lbr_reset_64();
}

-void intel_pmu_lbr_enable(struct perf_event *event)
+void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);

if (!x86_pmu.lbr_nr)
return;
+ /*
+ * When sampling the branck stack in system-wide, it may be
+ * necessary to flush the stack on context switch. This happens
+ * when the branch stack does not tag its entries with the pid
+ * of the current task. Otherwise it becomes impossible to
+ * associate a branch entry with a task. This ambiguity is more
+ * likely to appear when the branch stack supports priv level
+ * filtering and the user sets it to monitor only at the user
+ * level (which could be a useful measurement in system-wide
+ * mode). In that case, the risk is high of having a branch
+ * stack with branch from multiple tasks.
+ */
+ if (sched_in) {
+ intel_pmu_lbr_reset();
+ cpuc->lbr_context = ctx;
+ }
+}
+
+void intel_pmu_lbr_enable(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);

+ if (!x86_pmu.lbr_nr)
+ return;
/*
* Reset the LBR stack if we changed task context to
* avoid data leaks.
@@ -199,6 +222,8 @@ void intel_pmu_lbr_enable(struct perf_event *event)
cpuc->br_sel = event->hw.branch_reg.reg;

cpuc->lbr_users++;
+ if (cpuc->lbr_users == 1)
+ perf_sched_cb_user_inc(event->ctx->pmu);
}

void intel_pmu_lbr_disable(struct perf_event *event)
@@ -211,10 +236,13 @@ void intel_pmu_lbr_disable(struct perf_event *event)
cpuc->lbr_users--;
WARN_ON_ONCE(cpuc->lbr_users < 0);

- if (cpuc->enabled && !cpuc->lbr_users) {
- __intel_pmu_lbr_disable();
- /* avoid stale pointer */
- cpuc->lbr_context = NULL;
+ if (!cpuc->lbr_users) {
+ perf_sched_cb_user_dec(event->ctx->pmu);
+ if (cpuc->enabled) {
+ __intel_pmu_lbr_disable();
+ /* avoid stale pointer */
+ cpuc->lbr_context = NULL;
+ }
}
}

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fe92e6b..82f3dd5 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -259,11 +259,6 @@ struct pmu {
int (*event_idx) (struct perf_event *event); /*optional */

/*
- * flush branch stack on context-switches (needed in cpu-wide mode)
- */
- void (*flush_branch_stack) (void);
-
- /*
* context-switches callback for CPU PMU. Other PMUs shouldn't set
* this callback
*/
@@ -512,7 +507,6 @@ struct perf_event_context {
u64 generation;
int pin_count;
int nr_cgroups; /* cgroup evts */
- int nr_branch_stack; /* branch_stack evt */
struct rcu_head rcu_head;
};

diff --git a/kernel/events/core.c b/kernel/events/core.c
index ac9a226..86c6336 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -143,7 +143,6 @@ enum event_type_t {
*/
struct static_key_deferred perf_sched_events __read_mostly;
static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
-static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
static DEFINE_PER_CPU(int, perf_sched_cb_users);

static atomic_t nr_mmap_events __read_mostly;
@@ -1137,9 +1136,6 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
if (is_cgroup_event(event))
ctx->nr_cgroups++;

- if (has_branch_stack(event))
- ctx->nr_branch_stack++;
-
list_add_rcu(&event->event_entry, &ctx->event_list);
if (!ctx->nr_events)
perf_pmu_rotate_start(ctx->pmu);
@@ -1302,9 +1298,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
cpuctx->cgrp = NULL;
}

- if (has_branch_stack(event))
- ctx->nr_branch_stack--;
-
ctx->nr_events--;
if (event->attr.inherit_stat)
ctx->nr_stat--;
@@ -2602,64 +2595,6 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
}

/*
- * When sampling the branck stack in system-wide, it may be necessary
- * to flush the stack on context switch. This happens when the branch
- * stack does not tag its entries with the pid of the current task.
- * Otherwise it becomes impossible to associate a branch entry with a
- * task. This ambiguity is more likely to appear when the branch stack
- * supports priv level filtering and the user sets it to monitor only
- * at the user level (which could be a useful measurement in system-wide
- * mode). In that case, the risk is high of having a branch stack with
- * branch from multiple tasks. Flushing may mean dropping the existing
- * entries or stashing them somewhere in the PMU specific code layer.
- *
- * This function provides the context switch callback to the lower code
- * layer. It is invoked ONLY when there is at least one system-wide context
- * with at least one active event using taken branch sampling.
- */
-static void perf_branch_stack_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
- struct perf_cpu_context *cpuctx;
- struct pmu *pmu;
- unsigned long flags;
-
- /* no need to flush branch stack if not changing task */
- if (prev == task)
- return;
-
- local_irq_save(flags);
-
- rcu_read_lock();
-
- list_for_each_entry_rcu(pmu, &pmus, entry) {
- cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
-
- /*
- * check if the context has at least one
- * event using PERF_SAMPLE_BRANCH_STACK
- */
- if (cpuctx->ctx.nr_branch_stack > 0
- && pmu->flush_branch_stack) {
-
- perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-
- perf_pmu_disable(pmu);
-
- pmu->flush_branch_stack();
-
- perf_pmu_enable(pmu);
-
- perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
- }
- }
-
- rcu_read_unlock();
-
- local_irq_restore(flags);
-}
-
-/*
* Called from scheduler to add the events of the current task
* with interrupts disabled.
*
@@ -2691,14 +2626,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
if (atomic_read(&__get_cpu_var(perf_cgroup_events)))
perf_cgroup_sched_in(prev, task);

- /* check for system-wide branch_stack events */
- if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
- perf_branch_stack_sched_in(prev, task);
-
- /* check for system-wide branch_stack events */
- if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
- perf_branch_stack_sched_in(prev, task);
-
if (__get_cpu_var(perf_sched_cb_users))
perf_pmu_sched_task(prev, task, true);
}
@@ -3284,10 +3211,6 @@ static void unaccount_event_cpu(struct perf_event *event, int cpu)
if (event->parent)
return;

- if (has_branch_stack(event)) {
- if (!(event->attach_state & PERF_ATTACH_TASK))
- atomic_dec(&per_cpu(perf_branch_stack_events, cpu));
- }
if (is_cgroup_event(event))
atomic_dec(&per_cpu(perf_cgroup_events, cpu));
}
@@ -6779,10 +6702,6 @@ static void account_event_cpu(struct perf_event *event, int cpu)
if (event->parent)
return;

- if (has_branch_stack(event)) {
- if (!(event->attach_state & PERF_ATTACH_TASK))
- atomic_inc(&per_cpu(perf_branch_stack_events, cpu));
- }
if (is_cgroup_event(event))
atomic_inc(&per_cpu(perf_cgroup_events, cpu));
}
--
1.9.3

2014-07-31 07:16:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4 0/9] perf, x86: large PEBS interrupt threshold


* Yan, Zheng <[email protected]> wrote:

> This patch series implements large PEBS interrupt threshold. For some
> limited cases, it can significantly reduce the sample overhead. Please
> read patch 6's commit message for more information.
>
> changes since v1:
> - drop patch 'perf, core: Add all PMUs to pmu_idr'
> - add comments for case that multiple counters overflow simultaneously
> changes since v2:
> - rename perf_sched_cb_{enable,disable} to perf_sched_cb_user_{inc,dec}
> - use flag to indicate auto reload mechanism
> - move codes that setup PEBS sample data to separate function
> - output the PEBS records in batch
> - enable this for All (PEBS capable) hardware
> - more description for the multiplex
> changes since v3:
> - ignore conflicting PEBS record

Please include performance measurements, 'significantly reduces sample
overhead' is a totally inadequate description to judge this patch set.

Please also include total diffstat in the 0/N mail.

Thanks,

Ingo

2014-07-31 07:18:34

by Yan, Zheng

[permalink] [raw]
Subject: Re: [PATCH v4 0/9] perf, x86: large PEBS interrupt threshold

On 07/31/2014 03:16 PM, Ingo Molnar wrote:
>
> * Yan, Zheng <[email protected]> wrote:
>
>> This patch series implements large PEBS interrupt threshold. For some
>> limited cases, it can significantly reduce the sample overhead. Please
>> read patch 6's commit message for more information.
>>
>> changes since v1:
>> - drop patch 'perf, core: Add all PMUs to pmu_idr'
>> - add comments for case that multiple counters overflow simultaneously
>> changes since v2:
>> - rename perf_sched_cb_{enable,disable} to perf_sched_cb_user_{inc,dec}
>> - use flag to indicate auto reload mechanism
>> - move codes that setup PEBS sample data to separate function
>> - output the PEBS records in batch
>> - enable this for All (PEBS capable) hardware
>> - more description for the multiplex
>> changes since v3:
>> - ignore conflicting PEBS record
>
> Please include performance measurements, 'significantly reduces sample
> overhead' is a totally inadequate description to judge this patch set.

patch 6 contains performance data.

>
> Please also include total diffstat in the 0/N mail.
>

will do.

Regards
Yan, Zheng

> Thanks,
>
> Ingo
>

2014-07-31 07:25:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4 0/9] perf, x86: large PEBS interrupt threshold


* Yan, Zheng <[email protected]> wrote:

> On 07/31/2014 03:16 PM, Ingo Molnar wrote:
> >
> > * Yan, Zheng <[email protected]> wrote:
> >
> >> This patch series implements large PEBS interrupt threshold. For some
> >> limited cases, it can significantly reduce the sample overhead. Please
> >> read patch 6's commit message for more information.
> >>
> >> changes since v1:
> >> - drop patch 'perf, core: Add all PMUs to pmu_idr'
> >> - add comments for case that multiple counters overflow simultaneously
> >> changes since v2:
> >> - rename perf_sched_cb_{enable,disable} to perf_sched_cb_user_{inc,dec}
> >> - use flag to indicate auto reload mechanism
> >> - move codes that setup PEBS sample data to separate function
> >> - output the PEBS records in batch
> >> - enable this for All (PEBS capable) hardware
> >> - more description for the multiplex
> >> changes since v3:
> >> - ignore conflicting PEBS record
> >
> > Please include performance measurements, 'significantly reduces sample
> > overhead' is a totally inadequate description to judge this patch set.
>
> patch 6 contains performance data.

I actually checked patch 6, because you referred to it in the
description - but the performance data is actually in patch 5.

Which nicely demonstrates why such figures belong into 0/N as well.

Please also improve the desciption of the performance figures: what
does plain/multi/delta mean, and convert into nicely digestible
before/after performance comparison with human readable percentage
figures.

"It got faster by 50% when XYZ" is so much nicer to read for everyone
involved.

Thanks,

Ingo

2014-07-31 07:27:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 0/9] perf, x86: large PEBS interrupt threshold



OK, so no feedback on the 'pending' discussions we had wrt PEBS record
generation?

No feedback on the correctness aspects of the overflow crap?

Just a new series, which I then have to dig through to figure out wtf
changed?

A quick look at patch 6 reads like you still don't understand the issue
right. There are no 'collisions' as such in PEBS record generation, or
are there? See the earlier open discussion.

_WHY_ are you sending me new patches without sorting the open points
first?


Attachments:
(No filename) (492.00 B)
(No filename) (836.00 B)
Download all attachments

2014-07-31 07:36:30

by Yan, Zheng

[permalink] [raw]
Subject: Re: [PATCH v4 0/9] perf, x86: large PEBS interrupt threshold

On 07/31/2014 03:27 PM, Peter Zijlstra wrote:
>
>
> OK, so no feedback on the 'pending' discussions we had wrt PEBS record
> generation?
>
> No feedback on the correctness aspects of the overflow crap?
>
> Just a new series, which I then have to dig through to figure out wtf
> changed?
>
> A quick look at patch 6 reads like you still don't understand the issue
> right. There are no 'collisions' as such in PEBS record generation, or
> are there? See the earlier open discussion.

I'm sure collision only create one PEBS records. Actually, the patch description
was written by Andi.



>
> _WHY_ are you sending me new patches without sorting the open points
> first?
>

2014-07-31 07:49:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 0/9] perf, x86: large PEBS interrupt threshold

On Thu, Jul 31, 2014 at 03:36:26PM +0800, Yan, Zheng wrote:
> On 07/31/2014 03:27 PM, Peter Zijlstra wrote:
> >
> >
> > OK, so no feedback on the 'pending' discussions we had wrt PEBS record
> > generation?
> >
> > No feedback on the correctness aspects of the overflow crap?
> >
> > Just a new series, which I then have to dig through to figure out wtf
> > changed?
> >
> > A quick look at patch 6 reads like you still don't understand the issue
> > right. There are no 'collisions' as such in PEBS record generation, or
> > are there? See the earlier open discussion.
>
> I'm sure collision only create one PEBS records. Actually, the patch description
> was written by Andi.

So now Andi is contradicting himself:

http://marc.info/?l=linux-kernel&m=140129731708247&w=2

Bloody wonderful..

OK, so someone write down exactly how PEBS record generation works,
including the iffy overflow register, bring it to the relevant hardware
engineer, having him sign off on it, then come back.

And preferably add it to the next SDM rev. How is anybody to use this
stuff without knowing how it works?!

Let me try and figure out if actual collisions mean we cannot
reconstruct the actual events or not, you do the same.

Please file an internal bug report against that hardware (if there isn't
one already -- I've only been asking for this to get fixed for years
now).



Attachments:
(No filename) (1.34 kB)
(No filename) (836.00 B)
Download all attachments

2014-07-31 14:46:08

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v4 9/9] tools, perf: Allow the user to disable time stamps

Em Thu, Jul 31, 2014 at 02:45:04PM +0800, Yan, Zheng escreveu:
> From: Andi Kleen <[email protected]>
>
> Time stamps are always implicitely enabled for record currently.
> The old --time/-T option is a nop.

Thanks, applied.

- Arnaldo

2014-07-31 14:50:46

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v4 0/9] perf, x86: large PEBS interrupt threshold

Peter Zijlstra <[email protected]> writes:
>
> No feedback on the correctness aspects of the overflow crap?

I assume you mean collisions? See below.

> A quick look at patch 6 reads like you still don't understand the issue
> right. There are no 'collisions' as such in PEBS record generation, or
> are there? See the earlier open discussion.

There can be collisions, but they are in practice extremely rare.

The only way you can get a lot of collision is if you count
the same thing multiple times which is not a useful configuration.
A single monitoring tool only needs a single event once.
While multiple processes may be configuring multiple events
at the same time, they will not start them at the same
time which also prevents collisions.

-Andi

--
[email protected] -- Speaking for myself only