2014-11-14 13:44:11

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT

Hi Peter and all,

Since there weren't too many review comments on the two previous
versions, here's another one with a couple more fixes.

[full description below the changelog]

Changes since v7:
* fixed a bug in aux allocation logic that caused more than
requested pages to be allocated,
* fixed pt_event_add() for snapshot mode, which would return an
error in case of successful completion.

Changes since v6:
* replaced kref with atomic in aux refcounting this time for real,
* added a small comment on capabilities/"caps" attribute group in PT
pmu.

Changes since v5:

* dropped "bypass mask", bits that a privileged user could set in
RTIT_CTL,
* merged configuring and enabling RTIT_CTL writes,
* fixed kernel-doc comments
* if LBR/BTS events are present in the system, PT event creation is
disallowed and vice versa,
* fixed a bug with high order allocation under pressure,
* fixed undoing vm_locked accounting in case of AUX allocation
failure,
* left out kernel counters for a separate series to keep this one
concise.

This patchset adds support for Intel Processor Trace (PT) extension [1] of
Intel Architecture that allows the capture of information about software
execution flow, to the perf kernel infrastructure.

The single most notable thing is that while PT outputs trace data in a
compressed binary format, it will still generate hundreds of megabytes
of trace data per second per core. Decoding this binary stream takes
2-3 orders of magnitude the cpu time that it takes to generate
it. These considerations make it impossible to carry out decoding in
kernel space. Therefore, the trace data is exported to userspace as a
zero-copy mapping that userspace can collect and store for later
decoding. To address this, this patchset extends perf ring buffer with
an "AUX space", which is allocated for hardware blocks such as PT to
export their trace data with minimal overhead. This space can be
configured via buffer's user page and mmapped from the same file
descriptor with a given offset. Data can then be collected from it
by reading the aux_head (write) pointer from the user page and updating
aux_tail (read) pointer similarly to data_{head,tail} of the
traditional perf buffer. There is an api between perf core and pmu
drivers that wish to make use of this AUX space to export their data.

For tracing blocks that don't support hardware scatter-gather tables,
we provide high-order physically contiguous allocations to minimize
the overhead needed for software double buffering and PMI pressure.

This way we get a normal perf data stream that provides sideband
information that is required to decode the trace data, such as MMAPs,
COMMs etc, plus the actual trace in its own logical space.

If the trace buffer is mapped writable, the driver will stop tracing
when it fills up (aux_head approaches aux_tail), till data is read,
aux_tail pointer is moved forward and an ioctl() is issued to
re-enable tracing. If the trace buffer is mapped read only, the
tracing will continue, overwriting older data, so that the buffer
always contains the most recent data. Tracing can be stopped with an
ioctl() and restarted once the data is collected.

This patchset consists of necessary changes to the perf kernel
infrastructure, and PT and BTS pmu drivers. The tooling support is not
included in this series, however, it can be found in my github tree [2].

[1] chapter 36 of
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
[2] http://github.com/virtuoso/linux-perf/tree/intel_pt

Alexander Shishkin (13):
perf: Add data_{offset,size} to user_page
perf: Support high-order allocations for AUX space
perf: Add a capability for AUX_NO_SG pmus to do software double
buffering
perf: Add a pmu capability for "exclusive" events
perf: Add AUX record
perf: Add api for pmus to write to AUX area
perf: Support overwrite mode for AUX area
perf: Add wakeup watermark control to AUX area
x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection
x86: perf: Intel PT and LBR/BTS are mutually exclusive
x86: perf: intel_pt: Intel PT PMU driver
x86: perf: intel_bts: Add BTS PMU driver
perf: add ITRACE_START record to indicate that tracing has started

Peter Zijlstra (1):
perf: Add AUX area to ring buffer for raw data streams

arch/x86/include/asm/cpufeature.h | 1 +
arch/x86/include/uapi/asm/msr-index.h | 18 +
arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/kernel/cpu/intel_pt.h | 131 ++++
arch/x86/kernel/cpu/perf_event.c | 43 ++
arch/x86/kernel/cpu/perf_event.h | 49 ++
arch/x86/kernel/cpu/perf_event_intel.c | 40 +-
arch/x86/kernel/cpu/perf_event_intel_bts.c | 525 ++++++++++++++
arch/x86/kernel/cpu/perf_event_intel_ds.c | 3 +-
arch/x86/kernel/cpu/perf_event_intel_pt.c | 1082 ++++++++++++++++++++++++++++
arch/x86/kernel/cpu/scattered.c | 1 +
include/linux/perf_event.h | 47 +-
include/uapi/linux/perf_event.h | 59 +-
kernel/events/core.c | 268 ++++++-
kernel/events/internal.h | 33 +
kernel/events/ring_buffer.c | 327 ++++++++-
16 files changed, 2577 insertions(+), 51 deletions(-)
create mode 100644 arch/x86/kernel/cpu/intel_pt.h
create mode 100644 arch/x86/kernel/cpu/perf_event_intel_bts.c
create mode 100644 arch/x86/kernel/cpu/perf_event_intel_pt.c

--
2.1.1


2014-11-14 13:44:28

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v8 01/14] perf: Add data_{offset,size} to user_page

Currently, the actual perf ring buffer is one page into the mmap area,
following the user page and the userspace follows this convention. This
patch adds data_{offset,size} fields to user_page that can be used by
userspace instead for locating perf data in the mmap area. This is also
helpful when mapping existing or shared buffers if their size is not
known in advance.

Right now, it is made to follow the existing convention that

data_offset == PAGE_SIZE and
data_offset + data_size == mmap_size.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/uapi/linux/perf_event.h | 5 +++++
kernel/events/core.c | 2 ++
2 files changed, 7 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9d845404d8..dbb50c31c7 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -489,9 +489,14 @@ struct perf_event_mmap_page {
* In this case the kernel will not over-write unread data.
*
* See perf_output_put_handle() for the data ordering.
+ *
+ * data_{offset,size} indicate the location and size of the perf record
+ * buffer within the mmapped area.
*/
__u64 data_head; /* head in the data section */
__u64 data_tail; /* user-space written tail */
+ __u64 data_offset; /* where the buffer starts */
+ __u64 data_size; /* data buffer size */
};

#define PERF_RECORD_MISC_CPUMODE_MASK (7 << 0)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2b02c9fda7..96105bb6b1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3942,6 +3942,8 @@ static void perf_event_init_userpage(struct perf_event *event)
/* Allow new userspace to detect that bit 0 is deprecated */
userpg->cap_bit0_is_deprecated = 1;
userpg->size = offsetof(struct perf_event_mmap_page, __reserved);
+ userpg->data_offset = PAGE_SIZE;
+ userpg->data_size = perf_data_size(rb);

unlock:
rcu_read_unlock();
--
2.1.1

2014-11-14 13:44:41

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v8 06/14] perf: Add AUX record

When there's new data in the AUX space, output a record indicating its
offset and size and a set of flags, such as PERF_AUX_FLAG_TRUNCATED, to
mean the described data was truncated to fit in the ring buffer.

Signed-off-by: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
---
include/uapi/linux/perf_event.h | 19 +++++++++++++++++++
kernel/events/core.c | 34 ++++++++++++++++++++++++++++++++++
kernel/events/internal.h | 3 +++
3 files changed, 56 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9a13e5fded..272258380e 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -733,6 +733,20 @@ enum perf_event_type {
*/
PERF_RECORD_MMAP2 = 10,

+ /*
+ * Records that new data landed in the AUX buffer part.
+ *
+ * struct {
+ * struct perf_event_header header;
+ *
+ * u64 aux_offset;
+ * u64 aux_size;
+ * u64 flags;
+ * struct sample_id sample_id;
+ * };
+ */
+ PERF_RECORD_AUX = 11,
+
PERF_RECORD_MAX, /* non-ABI */
};

@@ -750,6 +764,11 @@ enum perf_callchain_context {
PERF_CONTEXT_MAX = (__u64)-4095,
};

+/**
+ * PERF_RECORD_AUX::flags bits
+ */
+#define PERF_AUX_FLAG_TRUNCATED 0x01 /* record was truncated to fit */
+
#define PERF_FLAG_FD_NO_GROUP (1UL << 0)
#define PERF_FLAG_FD_OUTPUT (1UL << 1)
#define PERF_FLAG_PID_CGROUP (1UL << 2) /* pid=cgroup id, per-cpu mode only */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2399c12e49..779f60d300 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5606,6 +5606,40 @@ void perf_event_mmap(struct vm_area_struct *vma)
perf_event_mmap_event(&mmap_event);
}

+void perf_event_aux_event(struct perf_event *event, unsigned long head,
+ unsigned long size, u64 flags)
+{
+ struct perf_output_handle handle;
+ struct perf_sample_data sample;
+ struct perf_aux_event {
+ struct perf_event_header header;
+ u64 offset;
+ u64 size;
+ u64 flags;
+ } rec = {
+ .header = {
+ .type = PERF_RECORD_AUX,
+ .misc = 0,
+ .size = sizeof(rec),
+ },
+ .offset = head,
+ .size = size,
+ .flags = flags,
+ };
+ int ret;
+
+ perf_event_header__init_id(&rec.header, &sample, event);
+ ret = perf_output_begin(&handle, event, rec.header.size);
+
+ if (ret)
+ return;
+
+ perf_output_put(&handle, rec);
+ perf_event__output_id_sample(event, &handle, &sample);
+
+ perf_output_end(&handle);
+}
+
/*
* IRQ throttle logging
*/
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 0f6d080159..4d117a9814 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -62,6 +62,9 @@ static inline bool rb_has_aux(struct ring_buffer *rb)
return !!rb->aux_nr_pages;
}

+void perf_event_aux_event(struct perf_event *event, unsigned long head,
+ unsigned long size, u64 flags);
+
extern void
perf_event_header__init_id(struct perf_event_header *header,
struct perf_sample_data *data,
--
2.1.1

2014-11-14 13:44:27

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v8 03/14] perf: Support high-order allocations for AUX space

Some pmus (such as BTS or Intel PT without multiple-entry ToPA capability)
don't support scatter-gather and will prefer larger contiguous areas for
their output regions.

This patch adds a new pmu capability to request higher order allocations.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/linux/perf_event.h | 1 +
kernel/events/ring_buffer.c | 56 ++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 344058c71d..71df948b4d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -171,6 +171,7 @@ struct perf_event;
* pmu::capabilities flags
*/
#define PERF_PMU_CAP_NO_INTERRUPT 0x01
+#define PERF_PMU_CAP_AUX_NO_SG 0x02

/**
* struct pmu - generic performance monitoring unit
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index c6fdce4ea0..6e700e8fd6 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -242,30 +242,74 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
spin_lock_init(&rb->event_lock);
}

+#define PERF_AUX_GFP (GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
+
+static struct page *rb_alloc_aux_page(int node, int order)
+{
+ struct page *page;
+
+ if (order > MAX_ORDER)
+ order = MAX_ORDER;
+
+ do {
+ page = alloc_pages_node(node, PERF_AUX_GFP, order);
+ } while (!page && order--);
+
+ if (page && order) {
+ /*
+ * Communicate the allocation size to the driver
+ */
+ split_page(page, order);
+ SetPagePrivate(page);
+ set_page_private(page, order);
+ }
+
+ return page;
+}
+
+static void rb_free_aux_page(struct ring_buffer *rb, int idx)
+{
+ struct page *page = virt_to_page(rb->aux_pages[idx]);
+
+ ClearPagePrivate(page);
+ page->mapping = NULL;
+ __free_page(page);
+}
+
int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
pgoff_t pgoff, int nr_pages, int flags)
{
bool overwrite = !(flags & RING_BUFFER_WRITABLE);
int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
- int ret = -ENOMEM;
+ int ret = -ENOMEM, max_order = 0;

if (!has_aux(event))
return -ENOTSUPP;

+ if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG)
+ /*
+ * We need to start with the max_order that fits in nr_pages,
+ * not the other way around, hence ilog2() and not get_order.
+ */
+ max_order = ilog2(nr_pages);
+
rb->aux_pages = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
if (!rb->aux_pages)
return -ENOMEM;

rb->free_aux = event->pmu->free_aux;
- for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;
- rb->aux_nr_pages++) {
+ for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;) {
struct page *page;
+ int last, order;

- page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
+ order = min(max_order, ilog2(nr_pages - rb->aux_nr_pages));
+ page = rb_alloc_aux_page(node, order);
if (!page)
goto out;

- rb->aux_pages[rb->aux_nr_pages] = page_address(page);
+ for (last = rb->aux_nr_pages + (1 << page_private(page));
+ last > rb->aux_nr_pages; rb->aux_nr_pages++)
+ rb->aux_pages[rb->aux_nr_pages] = page_address(page++);
}

rb->aux_priv = event->pmu->setup_aux(event->cpu, rb->aux_pages, nr_pages,
@@ -303,7 +347,7 @@ static void __rb_free_aux(struct ring_buffer *rb)
}

for (pg = 0; pg < rb->aux_nr_pages; pg++)
- free_page((unsigned long)rb->aux_pages[pg]);
+ rb_free_aux_page(rb, pg);

kfree(rb->aux_pages);
rb->aux_nr_pages = 0;
--
2.1.1

2014-11-14 13:45:03

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v8 09/14] perf: Add wakeup watermark control to AUX area

When AUX area gets a certain amount of new data, we want to wake up
userspace to collect it. This adds a new control to specify how much
data will cause a wakeup. This is then passed down to pmu drivers via
output handle's "wakeup" field, so that the driver can find the nearest
point where it can generate an interrupt.

We repurpose __reserved_2 in the event attribute for this, even though
it was never checked to be zero before, aux_watermark will only matter
for new AUX-aware code, so the old code should still be fine.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/uapi/linux/perf_event.h | 7 +++++--
kernel/events/core.c | 3 ++-
kernel/events/internal.h | 4 +++-
kernel/events/ring_buffer.c | 22 +++++++++++++++++++---
4 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index f36aac2ac5..ffd52c0861 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -238,6 +238,7 @@ enum perf_event_read_format {
#define PERF_ATTR_SIZE_VER2 80 /* add: branch_sample_type */
#define PERF_ATTR_SIZE_VER3 96 /* add: sample_regs_user */
/* add: sample_stack_user */
+ /* add: aux_watermark */

/*
* Hardware event_id to monitor via a performance monitoring event:
@@ -332,8 +333,10 @@ struct perf_event_attr {
*/
__u32 sample_stack_user;

- /* Align to u64. */
- __u32 __reserved_2;
+ /*
+ * Wakeup watermark for AUX area
+ */
+ __u32 aux_watermark;
};

#define perf_flags(attr) (*(&(attr)->read_format + 1))
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 48b8dc0d36..da2e44aedd 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4419,7 +4419,8 @@ accounting:
perf_event_init_userpage(event);
perf_event_update_userpage(event);
} else {
- ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages, flags);
+ ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages,
+ event->attr.aux_watermark, flags);
if (!ret)
rb->aux_mmap_locked = extra;
}
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index ffd51d9f59..9f6ce9ba4a 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -27,6 +27,7 @@ struct ring_buffer {
local_t lost; /* nr records lost */

long watermark; /* wakeup watermark */
+ long aux_watermark;
/* poll crap */
spinlock_t event_lock;
struct list_head event_list;
@@ -38,6 +39,7 @@ struct ring_buffer {
/* AUX area */
local_t aux_head;
local_t aux_nest;
+ local_t aux_wakeup;
unsigned long aux_pgoff;
int aux_nr_pages;
int aux_overwrite;
@@ -57,7 +59,7 @@ extern struct ring_buffer *
rb_alloc(int nr_pages, long watermark, int cpu, int flags);
extern void perf_event_wakeup(struct perf_event *event);
extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
- pgoff_t pgoff, int nr_pages, int flags);
+ pgoff_t pgoff, int nr_pages, long watermark, int flags);
extern void rb_free_aux(struct ring_buffer *rb);
extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
extern void ring_buffer_put(struct ring_buffer *rb);
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 8bff8c85eb..8d959f24ab 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -295,6 +295,7 @@ void *perf_aux_output_begin(struct perf_output_handle *handle,
*/
if (!rb->aux_overwrite) {
aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
+ handle->wakeup = local_read(&rb->aux_wakeup) + rb->aux_watermark;
if (aux_head - aux_tail < perf_aux_size(rb))
handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));

@@ -358,9 +359,12 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
perf_event_aux_event(handle->event, aux_head, size, flags);
}

- rb->user_page->aux_head = local_read(&rb->aux_head);
+ aux_head = rb->user_page->aux_head = local_read(&rb->aux_head);

- perf_output_wakeup(handle);
+ if (aux_head - local_read(&rb->aux_wakeup) >= rb->aux_watermark) {
+ perf_output_wakeup(handle);
+ local_add(rb->aux_watermark, &rb->aux_wakeup);
+ }
handle->event = NULL;

local_set(&rb->aux_nest, 0);
@@ -382,6 +386,14 @@ int perf_aux_output_skip(struct perf_output_handle *handle, unsigned long size)

local_add(size, &rb->aux_head);

+ aux_head = rb->user_page->aux_head = local_read(&rb->aux_head);
+ if (aux_head - local_read(&rb->aux_wakeup) >= rb->aux_watermark) {
+ perf_output_wakeup(handle);
+ local_add(rb->aux_watermark, &rb->aux_wakeup);
+ handle->wakeup = local_read(&rb->aux_wakeup) +
+ rb->aux_watermark;
+ }
+
handle->head = aux_head;
handle->size -= size;

@@ -432,7 +444,7 @@ static void rb_free_aux_page(struct ring_buffer *rb, int idx)
}

int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
- pgoff_t pgoff, int nr_pages, int flags)
+ pgoff_t pgoff, int nr_pages, long watermark, int flags)
{
bool overwrite = !(flags & RING_BUFFER_WRITABLE);
int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
@@ -496,6 +508,10 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
atomic_set(&rb->aux_refcount, 1);

rb->aux_overwrite = overwrite;
+ rb->aux_watermark = watermark;
+
+ if (!rb->aux_watermark && !rb->aux_overwrite)
+ rb->aux_watermark = nr_pages << (PAGE_SHIFT - 1);

out:
if (!ret)
--
2.1.1

2014-11-14 13:45:08

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v8 11/14] x86: perf: Intel PT and LBR/BTS are mutually exclusive

Intel PT cannot be used at the same time as LBR or BTS and will cause a
general protection fault if they are used together. In order to avoid
fixing up GPs in the fast path, instead we disallow creating LBR/BTS
events when PT events are present and vice versa.

Signed-off-by: Alexander Shishkin <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 43 ++++++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/perf_event.h | 40 +++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/perf_event_intel.c | 26 ++++++++++----------
3 files changed, 95 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 143e5f5dc8..1cca1ac9b6 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -259,6 +259,14 @@ static void hw_perf_event_destroy(struct perf_event *event)
}
}

+void hw_perf_lbr_event_destroy(struct perf_event *event)
+{
+ hw_perf_event_destroy(event);
+
+ /* undo the lbr/bts event accounting */
+ x86_del_exclusive(x86_lbr_exclusive_lbr);
+}
+
static inline int x86_pmu_initialized(void)
{
return x86_pmu.handle_irq != NULL;
@@ -298,6 +306,35 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event)
return x86_pmu_extra_regs(val, event);
}

+/*
+ * Check if we can create event of a certain type (that no conflicting events
+ * are present).
+ */
+int x86_add_exclusive(unsigned int what)
+{
+ int ret = -EBUSY, i;
+
+ if (atomic_inc_not_zero(&x86_pmu.lbr_exclusive[what]))
+ return 0;
+
+ mutex_lock(&pmc_reserve_mutex);
+ for (i = 0; i < ARRAY_SIZE(x86_pmu.lbr_exclusive); i++)
+ if (i != what && atomic_read(&x86_pmu.lbr_exclusive[i]))
+ goto out;
+
+ atomic_inc(&x86_pmu.lbr_exclusive[what]);
+ ret = 0;
+
+out:
+ mutex_unlock(&pmc_reserve_mutex);
+ return ret;
+}
+
+void x86_del_exclusive(unsigned int what)
+{
+ atomic_dec(&x86_pmu.lbr_exclusive[what]);
+}
+
int x86_setup_perfctr(struct perf_event *event)
{
struct perf_event_attr *attr = &event->attr;
@@ -342,6 +379,12 @@ int x86_setup_perfctr(struct perf_event *event)
/* BTS is currently only allowed for user-mode. */
if (!attr->exclude_kernel)
return -EOPNOTSUPP;
+
+ /* disallow bts if conflicting events are present */
+ if (x86_add_exclusive(x86_lbr_exclusive_lbr))
+ return -EBUSY;
+
+ event->destroy = hw_perf_lbr_event_destroy;
}

hwc->config |= config;
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index fc5eb390b3..620612d311 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -402,6 +402,12 @@ union x86_pmu_config {

#define X86_CONFIG(args...) ((union x86_pmu_config){.bits = {args}}).value

+enum {
+ x86_lbr_exclusive_lbr,
+ x86_lbr_exclusive_pt,
+ x86_lbr_exclusive_max,
+};
+
/*
* struct x86_pmu - generic x86 pmu
*/
@@ -498,6 +504,11 @@ struct x86_pmu {
bool lbr_double_abort; /* duplicated lbr aborts */

/*
+ * Intel PT/LBR/BTS are exclusive
+ */
+ atomic_t lbr_exclusive[x86_lbr_exclusive_max];
+
+ /*
* Extra registers for events
*/
struct extra_reg *extra_regs;
@@ -582,6 +593,12 @@ static inline int x86_pmu_rdpmc_index(int index)
return x86_pmu.rdpmc_index ? x86_pmu.rdpmc_index(index) : index;
}

+int x86_add_exclusive(unsigned int what);
+
+void x86_del_exclusive(unsigned int what);
+
+void hw_perf_lbr_event_destroy(struct perf_event *event);
+
int x86_setup_perfctr(struct perf_event *event);

int x86_pmu_hw_config(struct perf_event *event);
@@ -668,6 +685,29 @@ static inline int amd_pmu_init(void)

#ifdef CONFIG_CPU_SUP_INTEL

+static inline bool intel_pmu_needs_lbr_smpl(struct perf_event *event)
+{
+ /* user explicitly requested branch sampling */
+ if (has_branch_stack(event))
+ return true;
+
+ /* implicit branch sampling to correct PEBS skid */
+ if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1 &&
+ x86_pmu.intel_cap.pebs_format < 2)
+ return true;
+
+ return false;
+}
+
+static inline bool intel_pmu_has_bts(struct perf_event *event)
+{
+ if (event->attr.config == PERF_COUNT_HW_BRANCH_INSTRUCTIONS &&
+ !event->attr.freq && event->hw.sample_period == 1)
+ return true;
+
+ return false;
+}
+
int intel_pmu_save_and_restart(struct perf_event *event);

struct event_constraint *
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 944bf019b7..c64e97a1cb 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1029,20 +1029,6 @@ static __initconst const u64 slm_hw_cache_event_ids
},
};

-static inline bool intel_pmu_needs_lbr_smpl(struct perf_event *event)
-{
- /* user explicitly requested branch sampling */
- if (has_branch_stack(event))
- return true;
-
- /* implicit branch sampling to correct PEBS skid */
- if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1 &&
- x86_pmu.intel_cap.pebs_format < 2)
- return true;
-
- return false;
-}
-
static void intel_pmu_disable_all(void)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -1751,6 +1737,18 @@ static int intel_pmu_hw_config(struct perf_event *event)
ret = intel_pmu_setup_lbr_filter(event);
if (ret)
return ret;
+
+ /*
+ * bts is set up earlier in this path, so don't account
+ * twice
+ */
+ if (!intel_pmu_has_bts(event)) {
+ /* disallow lbr if conflicting events are present */
+ if (x86_add_exclusive(x86_lbr_exclusive_lbr))
+ return -EBUSY;
+
+ event->destroy = hw_perf_lbr_event_destroy;
+ }
}

if (event->attr.type != PERF_TYPE_RAW)
--
2.1.1

2014-11-14 13:45:18

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v8 07/14] perf: Add api for pmus to write to AUX area

For pmus that wish to write data to ring buffer's AUX area, provide
perf_aux_output_{begin,end}() calls to initiate/commit data writes,
similarly to perf_output_{begin,end}. These also use the same output
handle structure. Also, similarly to software counterparts, these
will direct inherited events' output to parents' ring buffers.

After the perf_aux_output_begin() returns successfully, handle->size
is set to the maximum amount of data that can be written wrt aux_tail
pointer, so that no data that the user hasn't seen will be overwritten,
therefore this should always be called before hardware writing is
enabled. On success, this will return the pointer to pmu driver's
private structure allocated for this aux area by pmu::setup_aux. Same
pointer can also be retrieved using perf_get_aux() while hardware
writing is enabled.

PMU driver should pass the actual amount of data written as a parameter
to perf_aux_output_end(). All hardware writes should be completed and
visible before this one is called.

Additionally, perf_aux_output_skip() will adjust output handle and
aux_head in case some part of the buffer has to be skipped over to
maintain hardware's alignment constraints.

Nested writers are forbidden and guards are in place to catch such
attempts.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/linux/perf_event.h | 23 +++++++-
kernel/events/core.c | 5 +-
kernel/events/internal.h | 4 ++
kernel/events/ring_buffer.c | 139 ++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 167 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 5c3dee021c..f126eb89e6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -555,12 +555,22 @@ struct perf_output_handle {
struct ring_buffer *rb;
unsigned long wakeup;
unsigned long size;
- void *addr;
+ union {
+ void *addr;
+ unsigned long head;
+ };
int page;
};

#ifdef CONFIG_PERF_EVENTS

+extern void *perf_aux_output_begin(struct perf_output_handle *handle,
+ struct perf_event *event);
+extern void perf_aux_output_end(struct perf_output_handle *handle,
+ unsigned long size, bool truncated);
+extern int perf_aux_output_skip(struct perf_output_handle *handle,
+ unsigned long size);
+extern void *perf_get_aux(struct perf_output_handle *handle);
extern int perf_pmu_register(struct pmu *pmu, const char *name, int type);
extern void perf_pmu_unregister(struct pmu *pmu);

@@ -817,6 +827,17 @@ extern void perf_event_disable(struct perf_event *event);
extern int __perf_event_disable(void *info);
extern void perf_event_task_tick(void);
#else /* !CONFIG_PERF_EVENTS: */
+static inline void *
+perf_aux_output_begin(struct perf_output_handle *handle,
+ struct perf_event *event) { return NULL; }
+static inline void
+perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
+ bool truncated) { }
+static inline int
+perf_aux_output_skip(struct perf_output_handle *handle,
+ unsigned long size) { return -EINVAL; }
+static inline void *
+perf_get_aux(struct perf_output_handle *handle) { return NULL; }
static inline void
perf_event_task_sched_in(struct task_struct *prev,
struct task_struct *task) { }
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 779f60d300..48b8dc0d36 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3293,7 +3293,6 @@ static void free_event_rcu(struct rcu_head *head)
kfree(event);
}

-static void ring_buffer_put(struct ring_buffer *rb);
static void ring_buffer_attach(struct perf_event *event,
struct ring_buffer *rb);

@@ -4110,7 +4109,7 @@ static void rb_free_rcu(struct rcu_head *rcu_head)
rb_free(rb);
}

-static struct ring_buffer *ring_buffer_get(struct perf_event *event)
+struct ring_buffer *ring_buffer_get(struct perf_event *event)
{
struct ring_buffer *rb;

@@ -4125,7 +4124,7 @@ static struct ring_buffer *ring_buffer_get(struct perf_event *event)
return rb;
}

-static void ring_buffer_put(struct ring_buffer *rb)
+void ring_buffer_put(struct ring_buffer *rb)
{
if (!atomic_dec_and_test(&rb->refcount))
return;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 4d117a9814..b701ebc325 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -36,6 +36,8 @@ struct ring_buffer {
struct user_struct *mmap_user;

/* AUX area */
+ local_t aux_head;
+ local_t aux_nest;
unsigned long aux_pgoff;
int aux_nr_pages;
atomic_t aux_mmap_count;
@@ -56,6 +58,8 @@ extern void perf_event_wakeup(struct perf_event *event);
extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
pgoff_t pgoff, int nr_pages, int flags);
extern void rb_free_aux(struct ring_buffer *rb);
+extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
+extern void ring_buffer_put(struct ring_buffer *rb);

static inline bool rb_has_aux(struct ring_buffer *rb)
{
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 608c0d67ad..49d6d8f410 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -242,6 +242,145 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
spin_lock_init(&rb->event_lock);
}

+/*
+ * This is called before hardware starts writing to the AUX area to
+ * obtain an output handle and make sure there's room in the buffer.
+ * When the capture completes, call perf_aux_output_end() to commit
+ * the recorded data to the buffer.
+ *
+ * The ordering is similar to that of perf_output_{begin,end}, with
+ * the exception of (B), which should be taken care of by the pmu
+ * driver, since ordering rules will differ depending on hardware.
+ */
+void *perf_aux_output_begin(struct perf_output_handle *handle,
+ struct perf_event *event)
+{
+ struct perf_event *output_event = event;
+ unsigned long aux_head, aux_tail;
+ struct ring_buffer *rb;
+
+ if (output_event->parent)
+ output_event = output_event->parent;
+
+ /*
+ * Since this will typically be open across pmu::add/pmu::del, we
+ * grab ring_buffer's refcount instead of holding rcu read lock
+ * to make sure it doesn't disappear under us.
+ */
+ rb = ring_buffer_get(output_event);
+ if (!rb)
+ return NULL;
+
+ if (!rb_has_aux(rb) || !atomic_inc_not_zero(&rb->aux_refcount))
+ goto err;
+
+ /*
+ * Nesting is not supported for AUX area, make sure nested
+ * writers are caught early
+ */
+ if (WARN_ON_ONCE(local_xchg(&rb->aux_nest, 1)))
+ goto err_put;
+
+ aux_head = local_read(&rb->aux_head);
+ aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
+
+ handle->rb = rb;
+ handle->event = event;
+ handle->head = aux_head;
+ if (aux_head - aux_tail < perf_aux_size(rb))
+ handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
+ else
+ handle->size = 0;
+
+ /*
+ * handle->size computation depends on aux_tail load; this forms a
+ * control dependency barrier separating aux_tail load from aux data
+ * store that will be enabled on successful return
+ */
+ if (!handle->size) { /* A, matches D */
+ event->pending_disable = 1;
+ perf_output_wakeup(handle);
+ local_set(&rb->aux_nest, 0);
+ goto err_put;
+ }
+
+ return handle->rb->aux_priv;
+
+err_put:
+ rb_free_aux(rb);
+
+err:
+ ring_buffer_put(rb);
+ handle->event = NULL;
+
+ return NULL;
+}
+
+/*
+ * Commit the data written by hardware into the ring buffer by adjusting
+ * aux_head and posting a PERF_RECORD_AUX into the perf buffer. It is the
+ * pmu driver's responsibility to observe ordering rules of the hardware,
+ * so that all the data is externally visible before this is called.
+ */
+void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
+ bool truncated)
+{
+ struct ring_buffer *rb = handle->rb;
+ unsigned long aux_head = local_read(&rb->aux_head);
+ u64 flags = 0;
+
+ if (truncated)
+ flags |= PERF_AUX_FLAG_TRUNCATED;
+
+ local_add(size, &rb->aux_head);
+
+ if (size || flags) {
+ /*
+ * Only send RECORD_AUX if we have something useful to communicate
+ */
+
+ perf_event_aux_event(handle->event, aux_head, size, flags);
+ }
+
+ rb->user_page->aux_head = local_read(&rb->aux_head);
+
+ perf_output_wakeup(handle);
+ handle->event = NULL;
+
+ local_set(&rb->aux_nest, 0);
+ rb_free_aux(rb);
+ ring_buffer_put(rb);
+}
+
+/*
+ * Skip over a given number of bytes in the AUX buffer, due to, for example,
+ * hardware's alignment constraints.
+ */
+int perf_aux_output_skip(struct perf_output_handle *handle, unsigned long size)
+{
+ struct ring_buffer *rb = handle->rb;
+ unsigned long aux_head;
+
+ if (size > handle->size)
+ return -ENOSPC;
+
+ local_add(size, &rb->aux_head);
+
+ handle->head = aux_head;
+ handle->size -= size;
+
+ return 0;
+}
+
+void *perf_get_aux(struct perf_output_handle *handle)
+{
+ /* this is only valid between perf_aux_output_begin and *_end */
+ if (!handle->event)
+ return NULL;
+
+ return handle->rb->aux_priv;
+}
+
#define PERF_AUX_GFP (GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)

static struct page *rb_alloc_aux_page(int node, int order)
--
2.1.1

2014-11-14 13:45:20

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v8 05/14] perf: Add a pmu capability for "exclusive" events

Usually, pmus that do, for example, instruction tracing, would only ever
be able to have one event per task per cpu (or per perf_context). For such
pmus it makes sense to disallow creating conflicting events early on, so
as to provide consistent behavior for the user.

This patch adds a pmu capability that indicates such constraint on event
creation.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/linux/perf_event.h | 1 +
kernel/events/core.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 46 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e7a15f5c3f..5c3dee021c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -173,6 +173,7 @@ struct perf_event;
#define PERF_PMU_CAP_NO_INTERRUPT 0x01
#define PERF_PMU_CAP_AUX_NO_SG 0x02
#define PERF_PMU_CAP_AUX_SW_DOUBLEBUF 0x04
+#define PERF_PMU_CAP_EXCLUSIVE 0x08

/**
* struct pmu - generic performance monitoring unit
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ac7bfb43fc..2399c12e49 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7298,6 +7298,32 @@ out:
return ret;
}

+static bool exclusive_event_match(struct perf_event *e1, struct perf_event *e2)
+{
+ if ((e1->pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE) &&
+ (e1->cpu == e2->cpu ||
+ e1->cpu == -1 ||
+ e2->cpu == -1))
+ return true;
+ return false;
+}
+
+static bool exclusive_event_ok(struct perf_event *event,
+ struct perf_event_context *ctx)
+{
+ struct perf_event *iter_event;
+
+ if (!(event->pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE))
+ return true;
+
+ list_for_each_entry(iter_event, &ctx->event_list, event_entry) {
+ if (exclusive_event_match(iter_event, event))
+ return false;
+ }
+
+ return true;
+}
+
/**
* sys_perf_event_open - open a performance event, associate it to a task/cpu
*
@@ -7449,6 +7475,11 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_alloc;
}

+ if ((pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE) && group_leader) {
+ err = -EBUSY;
+ goto err_context;
+ }
+
if (task) {
put_task_struct(task);
task = NULL;
@@ -7534,6 +7565,12 @@ SYSCALL_DEFINE5(perf_event_open,
}
}

+ if (!exclusive_event_ok(event, ctx)) {
+ mutex_unlock(&ctx->mutex);
+ fput(event_file);
+ goto err_context;
+ }
+
perf_install_in_context(ctx, event, event->cpu);
perf_unpin_context(ctx);
mutex_unlock(&ctx->mutex);
@@ -7620,6 +7657,14 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,

WARN_ON_ONCE(ctx->parent_ctx);
mutex_lock(&ctx->mutex);
+ if (!exclusive_event_ok(event, ctx)) {
+ mutex_unlock(&ctx->mutex);
+ perf_unpin_context(ctx);
+ put_ctx(ctx);
+ err = -EBUSY;
+ goto err_free;
+ }
+
perf_install_in_context(ctx, event, cpu);
perf_unpin_context(ctx);
mutex_unlock(&ctx->mutex);
--
2.1.1

2014-11-14 13:45:36

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v8 08/14] perf: Support overwrite mode for AUX area

This adds support for overwrite mode in the AUX area, which means "keep
collecting data till you're stopped", turning AUX area into a circular
buffer, where new data overwrites old data. It does not depend on data
buffer's overwrite mode, so that it doesn't lose sideband data that is
instrumental for processing AUX data.

Overwrite mode is enabled at mapping AUX area read only. Even though
aux_tail in the buffer's user page might be user writable, it will be
ignored in this mode.

A PERF_RECORD_AUX with PERF_AUX_FLAG_OVERWRITE set is written to the perf
data stream every time an event writes new data to the AUX area. The pmu
driver might not be able to infer the exact beginning of the new data in
each snapshot, some drivers will only provide the tail, which is
aux_offset + aux_size in the AUX record. Consumer has to be able to tell
the new data from the old one, for example, by means of time stamps if
such are provided in the trace.

Consumer is also responsible for disabling any events that might write
to the AUX area (thus potentially racing with the consumer) before
collecting the data.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/uapi/linux/perf_event.h | 1 +
kernel/events/internal.h | 1 +
kernel/events/ring_buffer.c | 48 ++++++++++++++++++++++++++++-------------
3 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 272258380e..f36aac2ac5 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -768,6 +768,7 @@ enum perf_callchain_context {
* PERF_RECORD_AUX::flags bits
*/
#define PERF_AUX_FLAG_TRUNCATED 0x01 /* record was truncated to fit */
+#define PERF_AUX_FLAG_OVERWRITE 0x02 /* snapshot from overwrite mode */

#define PERF_FLAG_FD_NO_GROUP (1UL << 0)
#define PERF_FLAG_FD_OUTPUT (1UL << 1)
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index b701ebc325..ffd51d9f59 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -40,6 +40,7 @@ struct ring_buffer {
local_t aux_nest;
unsigned long aux_pgoff;
int aux_nr_pages;
+ int aux_overwrite;
atomic_t aux_mmap_count;
unsigned long aux_mmap_locked;
void (*free_aux)(void *);
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 49d6d8f410..8bff8c85eb 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -282,26 +282,33 @@ void *perf_aux_output_begin(struct perf_output_handle *handle,
goto err_put;

aux_head = local_read(&rb->aux_head);
- aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);

handle->rb = rb;
handle->event = event;
handle->head = aux_head;
- if (aux_head - aux_tail < perf_aux_size(rb))
- handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
- else
- handle->size = 0;
+ handle->size = 0;

/*
- * handle->size computation depends on aux_tail load; this forms a
- * control dependency barrier separating aux_tail load from aux data
- * store that will be enabled on successful return
+ * In overwrite mode, AUX data stores do not depend on aux_tail,
+ * therefore (A) control dependency barrier does not exist. The
+ * (B) <-> (C) ordering is still observed by the pmu driver.
*/
- if (!handle->size) { /* A, matches D */
- event->pending_disable = 1;
- perf_output_wakeup(handle);
- local_set(&rb->aux_nest, 0);
- goto err_put;
+ if (!rb->aux_overwrite) {
+ aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
+ if (aux_head - aux_tail < perf_aux_size(rb))
+ handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
+
+ /*
+ * handle->size computation depends on aux_tail load; this forms a
+ * control dependency barrier separating aux_tail load from aux data
+ * store that will be enabled on successful return
+ */
+ if (!handle->size) { /* A, matches D */
+ event->pending_disable = 1;
+ perf_output_wakeup(handle);
+ local_set(&rb->aux_nest, 0);
+ goto err_put;
+ }
}

return handle->rb->aux_priv;
@@ -326,13 +333,22 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
bool truncated)
{
struct ring_buffer *rb = handle->rb;
- unsigned long aux_head = local_read(&rb->aux_head);
+ unsigned long aux_head;
u64 flags = 0;

if (truncated)
flags |= PERF_AUX_FLAG_TRUNCATED;

- local_add(size, &rb->aux_head);
+ /* in overwrite mode, driver provides aux_head via handle */
+ if (rb->aux_overwrite) {
+ flags |= PERF_AUX_FLAG_OVERWRITE;
+
+ aux_head = handle->head;
+ local_set(&rb->aux_head, aux_head);
+ } else {
+ aux_head = local_read(&rb->aux_head);
+ local_add(size, &rb->aux_head);
+ }

if (size || flags) {
/*
@@ -479,6 +495,8 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
*/
atomic_set(&rb->aux_refcount, 1);

+ rb->aux_overwrite = overwrite;
+
out:
if (!ret)
rb->aux_pgoff = pgoff;
--
2.1.1

2014-11-14 13:45:42

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v8 02/14] perf: Add AUX area to ring buffer for raw data streams

From: Peter Zijlstra <[email protected]>

This patch introduces "AUX space" in the perf mmap buffer, intended for
exporting high bandwidth data streams to userspace, such as instruction
flow traces.

AUX space is a ring buffer, defined by aux_{offset,size} fields in the
user_page structure, and read/write pointers aux_{head,tail}, which abide
by the same rules as data_* counterparts of the main perf buffer.

In order to allocate/mmap AUX, userspace needs to set up aux_offset to
such an offset that will be greater than data_offset+data_size and
aux_size to be the desired buffer size. Both need to be page aligned.
Then, same aux_offset and aux_size should be passed to mmap() call and
if everything adds up, you should have an AUX buffer as a result.

Pages that are mapped into this buffer also come out of user's mlock
rlimit plus perf_event_mlock_kb allowance.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/linux/perf_event.h | 17 +++++
include/uapi/linux/perf_event.h | 16 +++++
kernel/events/core.c | 140 +++++++++++++++++++++++++++++++++-------
kernel/events/internal.h | 23 +++++++
kernel/events/ring_buffer.c | 97 ++++++++++++++++++++++++++--
5 files changed, 264 insertions(+), 29 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 893a0d0798..344058c71d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -263,6 +263,18 @@ struct pmu {
* flush branch stack on context-switches (needed in cpu-wide mode)
*/
void (*flush_branch_stack) (void);
+
+ /*
+ * Set up pmu-private data structures for an AUX area
+ */
+ void *(*setup_aux) (int cpu, void **pages,
+ int nr_pages, bool overwrite);
+ /* optional */
+
+ /*
+ * Free pmu-private AUX data structures
+ */
+ void (*free_aux) (void *aux); /* optional */
};

/**
@@ -782,6 +794,11 @@ static inline bool has_branch_stack(struct perf_event *event)
return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
}

+static inline bool has_aux(struct perf_event *event)
+{
+ return event->pmu->setup_aux;
+}
+
extern int perf_output_begin(struct perf_output_handle *handle,
struct perf_event *event, unsigned int size);
extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index dbb50c31c7..9a13e5fded 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -497,6 +497,22 @@ struct perf_event_mmap_page {
__u64 data_tail; /* user-space written tail */
__u64 data_offset; /* where the buffer starts */
__u64 data_size; /* data buffer size */
+
+ /*
+ * AUX area is defined by aux_{offset,size} fields that should be set
+ * by the userspace, so that
+ *
+ * aux_offset >= data_offset + data_size
+ *
+ * prior to mmap()ing it. Size of the mmap()ed area should be aux_size.
+ *
+ * Ring buffer pointers aux_{head,tail} have the same semantics as
+ * data_{head,tail} and same ordering rules apply.
+ */
+ __u64 aux_head;
+ __u64 aux_tail;
+ __u64 aux_offset;
+ __u64 aux_size;
};

#define PERF_RECORD_MISC_CPUMODE_MASK (7 << 0)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 96105bb6b1..ac7bfb43fc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4141,6 +4141,8 @@ static void perf_mmap_open(struct vm_area_struct *vma)

atomic_inc(&event->mmap_count);
atomic_inc(&event->rb->mmap_count);
+ if (vma->vm_pgoff)
+ atomic_inc(&event->rb->aux_mmap_count);
}

/*
@@ -4160,6 +4162,20 @@ static void perf_mmap_close(struct vm_area_struct *vma)
int mmap_locked = rb->mmap_locked;
unsigned long size = perf_data_size(rb);

+ /*
+ * rb->aux_mmap_count will always drop before rb->mmap_count and
+ * event->mmap_count, so it is ok to use event->mmap_mutex to
+ * serialize with perf_mmap here.
+ */
+ if (rb_has_aux(rb) && vma->vm_pgoff == rb->aux_pgoff &&
+ atomic_dec_and_mutex_lock(&rb->aux_mmap_count, &event->mmap_mutex)) {
+ atomic_long_sub(rb->aux_nr_pages, &mmap_user->locked_vm);
+ vma->vm_mm->pinned_vm -= rb->aux_mmap_locked;
+
+ rb_free_aux(rb);
+ mutex_unlock(&event->mmap_mutex);
+ }
+
atomic_dec(&rb->mmap_count);

if (!atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex))
@@ -4233,7 +4249,7 @@ out_put:

static const struct vm_operations_struct perf_mmap_vmops = {
.open = perf_mmap_open,
- .close = perf_mmap_close,
+ .close = perf_mmap_close, /* non mergable */
.fault = perf_mmap_fault,
.page_mkwrite = perf_mmap_fault,
};
@@ -4244,10 +4260,10 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
unsigned long user_locked, user_lock_limit;
struct user_struct *user = current_user();
unsigned long locked, lock_limit;
- struct ring_buffer *rb;
+ struct ring_buffer *rb = NULL;
unsigned long vma_size;
unsigned long nr_pages;
- long user_extra, extra;
+ long user_extra = 0, extra = 0;
int ret = 0, flags = 0;

/*
@@ -4262,7 +4278,66 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
return -EINVAL;

vma_size = vma->vm_end - vma->vm_start;
- nr_pages = (vma_size / PAGE_SIZE) - 1;
+
+ if (vma->vm_pgoff == 0) {
+ nr_pages = (vma_size / PAGE_SIZE) - 1;
+ } else {
+ /*
+ * AUX area mapping: if rb->aux_nr_pages != 0, it's already
+ * mapped, all subsequent mappings should have the same size
+ * and offset. Must be above the normal perf buffer.
+ */
+ u64 aux_offset, aux_size;
+
+ if (!event->rb)
+ return -EINVAL;
+
+ nr_pages = vma_size / PAGE_SIZE;
+
+ mutex_lock(&event->mmap_mutex);
+ ret = -EINVAL;
+
+ rb = event->rb;
+ if (!rb)
+ goto aux_unlock;
+
+ aux_offset = ACCESS_ONCE(rb->user_page->aux_offset);
+ aux_size = ACCESS_ONCE(rb->user_page->aux_size);
+
+ if (aux_offset < perf_data_size(rb) + PAGE_SIZE)
+ goto aux_unlock;
+
+ if (aux_offset != vma->vm_pgoff << PAGE_SHIFT)
+ goto aux_unlock;
+
+ /* already mapped with a different offset */
+ if (rb_has_aux(rb) && rb->aux_pgoff != vma->vm_pgoff)
+ goto aux_unlock;
+
+ if (aux_size != vma_size || aux_size != nr_pages * PAGE_SIZE)
+ goto aux_unlock;
+
+ /* already mapped with a different size */
+ if (rb_has_aux(rb) && rb->aux_nr_pages != nr_pages)
+ goto aux_unlock;
+
+ if (!is_power_of_2(nr_pages))
+ goto aux_unlock;
+
+ if (!atomic_inc_not_zero(&rb->mmap_count))
+ goto aux_unlock;
+
+ if (rb_has_aux(rb)) {
+ atomic_inc(&rb->aux_mmap_count);
+ ret = 0;
+ goto unlock;
+ }
+
+ atomic_set(&rb->aux_mmap_count, 1);
+ user_extra = nr_pages;
+
+ goto accounting;
+ }

/*
* If we have rb pages ensure they're a power-of-two number, so we
@@ -4274,9 +4349,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
if (vma_size != PAGE_SIZE * (1 + nr_pages))
return -EINVAL;

- if (vma->vm_pgoff != 0)
- return -EINVAL;
-
WARN_ON_ONCE(event->ctx->parent_ctx);
again:
mutex_lock(&event->mmap_mutex);
@@ -4300,6 +4372,8 @@ again:
}

user_extra = nr_pages + 1;
+
+accounting:
user_lock_limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);

/*
@@ -4309,7 +4383,6 @@ again:

user_locked = atomic_long_read(&user->locked_vm) + user_extra;

- extra = 0;
if (user_locked > user_lock_limit)
extra = user_locked - user_lock_limit;

@@ -4323,35 +4396,45 @@ again:
goto unlock;
}

- WARN_ON(event->rb);
+ WARN_ON(!rb && event->rb);

if (vma->vm_flags & VM_WRITE)
flags |= RING_BUFFER_WRITABLE;

- rb = rb_alloc(nr_pages,
- event->attr.watermark ? event->attr.wakeup_watermark : 0,
- event->cpu, flags);
-
if (!rb) {
- ret = -ENOMEM;
- goto unlock;
- }
+ rb = rb_alloc(nr_pages,
+ event->attr.watermark ? event->attr.wakeup_watermark : 0,
+ event->cpu, flags);

- atomic_set(&rb->mmap_count, 1);
- rb->mmap_locked = extra;
- rb->mmap_user = get_current_user();
+ if (!rb) {
+ ret = -ENOMEM;
+ goto unlock;
+ }

- atomic_long_add(user_extra, &user->locked_vm);
- vma->vm_mm->pinned_vm += extra;
+ atomic_set(&rb->mmap_count, 1);
+ rb->mmap_user = get_current_user();
+ rb->mmap_locked = extra;

- ring_buffer_attach(event, rb);
+ ring_buffer_attach(event, rb);

- perf_event_init_userpage(event);
- perf_event_update_userpage(event);
+ perf_event_init_userpage(event);
+ perf_event_update_userpage(event);
+ } else {
+ ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages, flags);
+ if (!ret)
+ rb->aux_mmap_locked = extra;
+ }

unlock:
- if (!ret)
+ if (!ret) {
+ atomic_long_add(user_extra, &user->locked_vm);
+ vma->vm_mm->pinned_vm += extra;
+
atomic_inc(&event->mmap_count);
+ } else if (rb) {
+ atomic_dec(&rb->mmap_count);
+ }
+aux_unlock:
mutex_unlock(&event->mmap_mutex);

/*
@@ -7185,6 +7268,13 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
if (output_event->cpu == -1 && output_event->ctx != event->ctx)
goto out;

+ /*
+ * If both events generate aux data, they must be on the same PMU
+ */
+ if (has_aux(event) && has_aux(output_event) &&
+ event->pmu != output_event->pmu)
+ goto out;
+
set:
mutex_lock(&event->mmap_mutex);
/* Can't redirect output if we've got an active mmap() */
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 569b218782..0f6d080159 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -35,6 +35,16 @@ struct ring_buffer {
unsigned long mmap_locked;
struct user_struct *mmap_user;

+ /* AUX area */
+ unsigned long aux_pgoff;
+ int aux_nr_pages;
+ atomic_t aux_mmap_count;
+ unsigned long aux_mmap_locked;
+ void (*free_aux)(void *);
+ atomic_t aux_refcount;
+ void **aux_pages;
+ void *aux_priv;
+
struct perf_event_mmap_page *user_page;
void *data_pages[0];
};
@@ -43,6 +53,14 @@ extern void rb_free(struct ring_buffer *rb);
extern struct ring_buffer *
rb_alloc(int nr_pages, long watermark, int cpu, int flags);
extern void perf_event_wakeup(struct perf_event *event);
+extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
+ pgoff_t pgoff, int nr_pages, int flags);
+extern void rb_free_aux(struct ring_buffer *rb);
+
+static inline bool rb_has_aux(struct ring_buffer *rb)
+{
+ return !!rb->aux_nr_pages;
+}

extern void
perf_event_header__init_id(struct perf_event_header *header,
@@ -81,6 +99,11 @@ static inline unsigned long perf_data_size(struct ring_buffer *rb)
return rb->nr_pages << (PAGE_SHIFT + page_order(rb));
}

+static inline unsigned long perf_aux_size(struct ring_buffer *rb)
+{
+ return rb->aux_nr_pages << PAGE_SHIFT;
+}
+
#define DEFINE_OUTPUT_COPY(func_name, memcpy_func) \
static inline unsigned long \
func_name(struct perf_output_handle *handle, \
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 146a5792b1..c6fdce4ea0 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -242,14 +242,87 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
spin_lock_init(&rb->event_lock);
}

+int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
+ pgoff_t pgoff, int nr_pages, int flags)
+{
+ bool overwrite = !(flags & RING_BUFFER_WRITABLE);
+ int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
+ int ret = -ENOMEM;
+
+ if (!has_aux(event))
+ return -ENOTSUPP;
+
+ rb->aux_pages = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
+ if (!rb->aux_pages)
+ return -ENOMEM;
+
+ rb->free_aux = event->pmu->free_aux;
+ for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;
+ rb->aux_nr_pages++) {
+ struct page *page;
+
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
+ if (!page)
+ goto out;
+
+ rb->aux_pages[rb->aux_nr_pages] = page_address(page);
+ }
+
+ rb->aux_priv = event->pmu->setup_aux(event->cpu, rb->aux_pages, nr_pages,
+ overwrite);
+ if (!rb->aux_priv)
+ goto out;
+
+ ret = 0;
+
+ /*
+ * aux_pages (and pmu driver's private data, aux_priv) will be
+ * referenced in both producer's and consumer's contexts, thus
+ * we keep a refcount here to make sure either of the two can
+ * reference them safely.
+ */
+ atomic_set(&rb->aux_refcount, 1);
+
+out:
+ if (!ret)
+ rb->aux_pgoff = pgoff;
+ else
+ rb_free_aux(rb);
+
+ return ret;
+}
+
+static void __rb_free_aux(struct ring_buffer *rb)
+{
+ int pg;
+
+ if (rb->aux_priv) {
+ rb->free_aux(rb->aux_priv);
+ rb->free_aux = NULL;
+ rb->aux_priv = NULL;
+ }
+
+ for (pg = 0; pg < rb->aux_nr_pages; pg++)
+ free_page((unsigned long)rb->aux_pages[pg]);
+
+ kfree(rb->aux_pages);
+ rb->aux_nr_pages = 0;
+}
+
+void rb_free_aux(struct ring_buffer *rb)
+{
+ if (atomic_dec_and_test(&rb->aux_refcount))
+ __rb_free_aux(rb);
+}
+
#ifndef CONFIG_PERF_USE_VMALLOC

/*
* Back perf_mmap() with regular GFP_KERNEL-0 pages.
*/

-struct page *
-perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+static struct page *
+__perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
{
if (pgoff > rb->nr_pages)
return NULL;
@@ -339,8 +412,8 @@ static int data_page_nr(struct ring_buffer *rb)
return rb->nr_pages << page_order(rb);
}

-struct page *
-perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+static struct page *
+__perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
{
/* The '>' counts in the user page. */
if (pgoff > data_page_nr(rb))
@@ -415,3 +488,19 @@ fail:
}

#endif
+
+struct page *
+perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+{
+ if (rb->aux_nr_pages) {
+ /* above AUX space */
+ if (pgoff > rb->aux_pgoff + rb->aux_nr_pages)
+ return NULL;
+
+ /* AUX space */
+ if (pgoff >= rb->aux_pgoff)
+ return virt_to_page(rb->aux_pages[pgoff - rb->aux_pgoff]);
+ }
+
+ return __perf_mmap_to_page(rb, pgoff);
+}
--
2.1.1

2014-11-14 13:45:54

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v8 13/14] x86: perf: intel_bts: Add BTS PMU driver

Add support for Branch Trace Store (BTS) via kernel perf event infrastructure.
The difference with the existing implementation of BTS support is that this
one is a separate PMU that exports events' trace buffers to userspace by means
of AUX area of the perf buffer, which is zero-copy mapped into userspace.

The immediate benefit is that the buffer size can be much bigger, resulting in
fewer interrupts and no kernel side copying is involved and little to no trace
data loss. Also, kernel code can be traced with this driver.

The old way of collecting BTS traces still works.

Signed-off-by: Alexander Shishkin <[email protected]>
---
arch/x86/kernel/cpu/Makefile | 2 +-
arch/x86/kernel/cpu/perf_event.h | 7 +
arch/x86/kernel/cpu/perf_event_intel.c | 6 +-
arch/x86/kernel/cpu/perf_event_intel_bts.c | 525 +++++++++++++++++++++++++++++
arch/x86/kernel/cpu/perf_event_intel_ds.c | 3 +-
5 files changed, 540 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/kernel/cpu/perf_event_intel_bts.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 4bc29fa23e..a595246a47 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -40,7 +40,7 @@ endif
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_p6.o perf_event_knc.o perf_event_p4.o
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_rapl.o
-obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_pt.o
+obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_pt.o perf_event_intel_bts.o

obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \
perf_event_intel_uncore_snb.o \
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 9ab6c59851..a30c812f22 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -404,6 +404,7 @@ union x86_pmu_config {

enum {
x86_lbr_exclusive_lbr,
+ x86_lbr_exclusive_bts,
x86_lbr_exclusive_pt,
x86_lbr_exclusive_max,
};
@@ -785,6 +786,12 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event);

void intel_pt_interrupt(void);

+int intel_bts_interrupt(void);
+
+void intel_bts_enable_local(void);
+
+void intel_bts_disable_local(void);
+
int p4_pmu_init(void);

int p6_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 5a29ea59d6..a2fc1bf04c 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1037,6 +1037,8 @@ static void intel_pmu_disable_all(void)

if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
intel_pmu_disable_bts();
+ else
+ intel_bts_disable_local();

intel_pmu_pebs_disable_all();
intel_pmu_lbr_disable_all();
@@ -1059,7 +1061,8 @@ static void intel_pmu_enable_all(int added)
return;

intel_pmu_enable_bts(event->hw.config);
- }
+ } else
+ intel_bts_enable_local();
}

/*
@@ -1345,6 +1348,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
apic_write(APIC_LVTPC, APIC_DM_NMI);
intel_pmu_disable_all();
handled = intel_pmu_drain_bts_buffer();
+ handled += intel_bts_interrupt();
status = intel_pmu_get_status();
if (!status)
goto done;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_bts.c b/arch/x86/kernel/cpu/perf_event_intel_bts.c
new file mode 100644
index 0000000000..af26a88052
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_bts.c
@@ -0,0 +1,525 @@
+/*
+ * BTS PMU driver for perf
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#undef DEBUG
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/bitops.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/debugfs.h>
+#include <linux/device.h>
+#include <linux/coredump.h>
+
+#include <asm-generic/sizes.h>
+#include <asm/perf_event.h>
+
+#include "perf_event.h"
+
+struct bts_ctx {
+ struct perf_output_handle handle;
+ struct debug_store ds_back;
+ int started;
+};
+
+static DEFINE_PER_CPU(struct bts_ctx, bts_ctx);
+
+#define BTS_RECORD_SIZE 24
+#define BTS_SAFETY_MARGIN 4080
+
+struct bts_phys {
+ struct page *page;
+ unsigned long size;
+ unsigned long offset;
+ unsigned long displacement;
+};
+
+struct bts_buffer {
+ size_t real_size; /* multiple of BTS_RECORD_SIZE */
+ unsigned int nr_pages;
+ unsigned int nr_bufs;
+ unsigned int cur_buf;
+ bool snapshot;
+ local_t data_size;
+ local_t lost;
+ local_t head;
+ unsigned long end;
+ void **data_pages;
+ struct bts_phys buf[0];
+};
+
+struct pmu bts_pmu;
+
+void intel_pmu_enable_bts(u64 config);
+void intel_pmu_disable_bts(void);
+
+static size_t buf_size(struct page *page)
+{
+ return 1 << (PAGE_SHIFT + page_private(page));
+}
+
+static void *
+bts_buffer_setup_aux(int cpu, void **pages, int nr_pages, bool overwrite)
+{
+ struct bts_buffer *buf;
+ struct page *page;
+ int node = (cpu == -1) ? cpu : cpu_to_node(cpu);
+ unsigned long offset;
+ size_t size = nr_pages << PAGE_SHIFT;
+ int pg, nbuf, pad;
+
+ /* count all the high order buffers */
+ for (pg = 0, nbuf = 0; pg < nr_pages;) {
+ page = virt_to_page(pages[pg]);
+ if (WARN_ON_ONCE(!PagePrivate(page) && nr_pages > 1))
+ return NULL;
+ pg += 1 << page_private(page);
+ nbuf++;
+ }
+
+ /*
+ * to avoid interrupts in overwrite mode, only allow one physical
+ */
+ if (overwrite && nbuf > 1)
+ return NULL;
+
+ buf = kzalloc_node(offsetof(struct bts_buffer, buf[nbuf]), GFP_KERNEL, node);
+ if (!buf)
+ return NULL;
+
+ buf->nr_pages = nr_pages;
+ buf->nr_bufs = nbuf;
+ buf->snapshot = overwrite;
+ buf->data_pages = pages;
+ buf->real_size = size - size % BTS_RECORD_SIZE;
+
+ for (pg = 0, nbuf = 0, offset = 0, pad = 0; nbuf < buf->nr_bufs; nbuf++) {
+ unsigned int __nr_pages;
+
+ page = virt_to_page(pages[pg]);
+ __nr_pages = PagePrivate(page) ? 1 << page_private(page) : 1;
+ buf->buf[nbuf].page = page;
+ buf->buf[nbuf].offset = offset;
+ buf->buf[nbuf].displacement = (pad ? BTS_RECORD_SIZE - pad : 0);
+ buf->buf[nbuf].size = buf_size(page) - buf->buf[nbuf].displacement;
+ pad = buf->buf[nbuf].size % BTS_RECORD_SIZE;
+ buf->buf[nbuf].size -= pad;
+
+ pg += __nr_pages;
+ offset += __nr_pages << PAGE_SHIFT;
+ }
+
+ return buf;
+}
+
+static void bts_buffer_free_aux(void *data)
+{
+ kfree(data);
+}
+
+static unsigned long bts_buffer_offset(struct bts_buffer *buf, unsigned int idx)
+{
+ return buf->buf[idx].offset + buf->buf[idx].displacement;
+}
+
+static void
+bts_config_buffer(struct bts_buffer *buf)
+{
+ int cpu = raw_smp_processor_id();
+ struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
+ struct bts_phys *phys = &buf->buf[buf->cur_buf];
+ unsigned long index, thresh = 0, end = phys->size;
+ struct page *page = phys->page;
+
+ index = local_read(&buf->head);
+
+ if (!buf->snapshot) {
+ if (buf->end < phys->offset + buf_size(page))
+ end = buf->end - phys->offset - phys->displacement;
+
+ index -= phys->offset + phys->displacement;
+
+ if (end - index > BTS_SAFETY_MARGIN)
+ thresh = end - BTS_SAFETY_MARGIN;
+ else if (end - index > BTS_RECORD_SIZE)
+ thresh = end - BTS_RECORD_SIZE;
+ else
+ thresh = end;
+ }
+
+ ds->bts_buffer_base = (u64)page_address(page) + phys->displacement;
+ ds->bts_index = ds->bts_buffer_base + index;
+ ds->bts_absolute_maximum = ds->bts_buffer_base + end;
+ ds->bts_interrupt_threshold = !buf->snapshot
+ ? ds->bts_buffer_base + thresh
+ : ds->bts_absolute_maximum + BTS_RECORD_SIZE;
+}
+
+static void bts_buffer_pad_out(struct bts_phys *phys, unsigned long head)
+{
+ unsigned long index = head - phys->offset;
+
+ memset(page_address(phys->page) + index, 0, phys->size - index);
+}
+
+static bool bts_buffer_is_full(struct bts_buffer *buf, struct bts_ctx *bts)
+{
+ if (buf->snapshot)
+ return false;
+
+ if (local_read(&buf->data_size) >= bts->handle.size ||
+ bts->handle.size - local_read(&buf->data_size) < BTS_RECORD_SIZE)
+ return true;
+
+ return false;
+}
+
+static void bts_update(struct bts_ctx *bts)
+{
+ int cpu = raw_smp_processor_id();
+ struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
+ struct bts_buffer *buf = perf_get_aux(&bts->handle);
+ unsigned long index = ds->bts_index - ds->bts_buffer_base, old, head;
+
+ if (!buf)
+ return;
+
+ head = index + bts_buffer_offset(buf, buf->cur_buf);
+ old = local_xchg(&buf->head, head);
+
+ if (!buf->snapshot) {
+ if (old == head)
+ return;
+
+ if (ds->bts_index >= ds->bts_absolute_maximum)
+ local_inc(&buf->lost);
+
+ /*
+ * old and head are always in the same physical buffer, so we
+ * can subtract them to get the data size.
+ */
+ local_add(head - old, &buf->data_size);
+ } else {
+ local_set(&buf->data_size, head);
+ }
+}
+
+static void __bts_event_start(struct perf_event *event)
+{
+ struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+ struct bts_buffer *buf = perf_get_aux(&bts->handle);
+ u64 config = 0;
+
+ if (!buf || bts_buffer_is_full(buf, bts))
+ return;
+
+ event->hw.state = 0;
+
+ if (!buf->snapshot)
+ config |= ARCH_PERFMON_EVENTSEL_INT;
+ if (!event->attr.exclude_kernel)
+ config |= ARCH_PERFMON_EVENTSEL_OS;
+ if (!event->attr.exclude_user)
+ config |= ARCH_PERFMON_EVENTSEL_USR;
+
+ bts_config_buffer(buf);
+
+ /*
+ * local barrier to make sure that ds configuration made it
+ * before we enable BTS
+ */
+ wmb();
+
+ intel_pmu_enable_bts(config);
+}
+
+static void bts_event_start(struct perf_event *event, int flags)
+{
+ struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+ __bts_event_start(event);
+
+ /* PMI handler: this counter is running and likely generating PMIs */
+ ACCESS_ONCE(bts->started) = 1;
+}
+
+static void __bts_event_stop(struct perf_event *event)
+{
+ /*
+ * No extra synchronization is mandated by the documentation to have
+ * BTS data stores globally visible.
+ */
+ intel_pmu_disable_bts();
+
+ if (event->hw.state & PERF_HES_STOPPED)
+ return;
+
+ ACCESS_ONCE(event->hw.state) |= PERF_HES_STOPPED;
+}
+
+static void bts_event_stop(struct perf_event *event, int flags)
+{
+ struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+ /* PMI handler: don't restart this counter */
+ ACCESS_ONCE(bts->started) = 0;
+
+ __bts_event_stop(event);
+
+ if (flags & PERF_EF_UPDATE)
+ bts_update(bts);
+}
+
+void intel_bts_enable_local(void)
+{
+ struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+ if (bts->handle.event && bts->started)
+ __bts_event_start(bts->handle.event);
+}
+
+void intel_bts_disable_local(void)
+{
+ struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+ if (bts->handle.event)
+ __bts_event_stop(bts->handle.event);
+}
+
+static int
+bts_buffer_reset(struct bts_buffer *buf, struct perf_output_handle *handle)
+{
+ unsigned long head, space, next_space, pad, gap, skip, wakeup;
+ unsigned int next_buf;
+ struct bts_phys *phys, *next_phys;
+ int ret;
+
+ if (buf->snapshot)
+ return 0;
+
+ head = handle->head & ((buf->nr_pages << PAGE_SHIFT) - 1);
+ if (WARN_ON_ONCE(head != local_read(&buf->head)))
+ return -EINVAL;
+
+ phys = &buf->buf[buf->cur_buf];
+ space = phys->offset + phys->displacement + phys->size - head;
+ pad = space;
+ if (space > handle->size) {
+ space = handle->size;
+ space -= space % BTS_RECORD_SIZE;
+ }
+ if (space <= BTS_SAFETY_MARGIN) {
+ /* See if next phys buffer has more space */
+ next_buf = buf->cur_buf + 1;
+ if (next_buf >= buf->nr_bufs)
+ next_buf = 0;
+ next_phys = &buf->buf[next_buf];
+ gap = buf_size(phys->page) - phys->displacement - phys->size +
+ next_phys->displacement;
+ skip = pad + gap;
+ if (handle->size >= skip) {
+ next_space = next_phys->size;
+ if (next_space + skip > handle->size) {
+ next_space = handle->size - skip;
+ next_space -= next_space % BTS_RECORD_SIZE;
+ }
+ if (next_space > space || !space) {
+ if (pad)
+ bts_buffer_pad_out(phys, head);
+ ret = perf_aux_output_skip(handle, skip);
+ if (ret)
+ return ret;
+ /* Advance to next phys buffer */
+ phys = next_phys;
+ space = next_space;
+ head = phys->offset + phys->displacement;
+ /*
+ * After this, cur_buf and head won't match ds
+ * anymore, so we must not be racing with
+ * bts_update().
+ */
+ buf->cur_buf = next_buf;
+ local_set(&buf->head, head);
+ }
+ }
+ }
+
+ /* Don't go far beyond wakeup watermark */
+ wakeup = BTS_SAFETY_MARGIN + BTS_RECORD_SIZE + handle->wakeup -
+ handle->head;
+ if (space > wakeup) {
+ space = wakeup;
+ space -= space % BTS_RECORD_SIZE;
+ }
+
+ buf->end = head + space;
+
+ /*
+ * If we have no space, the lost notification would have been sent when
+ * we hit absolute_maximum - see bts_update()
+ */
+ if (!space)
+ return -ENOSPC;
+
+ return 0;
+}
+
+int intel_bts_interrupt(void)
+{
+ struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+ struct perf_event *event = bts->handle.event;
+ struct bts_buffer *buf;
+ s64 old_head;
+ int err;
+
+ if (!event || !bts->started)
+ return 0;
+
+ buf = perf_get_aux(&bts->handle);
+ /*
+ * Skip snapshot counters: they don't use the interrupt, but
+ * there's no other way of telling, because the pointer will
+ * keep moving
+ */
+ if (!buf || buf->snapshot)
+ return 0;
+
+ old_head = local_read(&buf->head);
+ bts_update(bts);
+
+ /* no new data */
+ if (old_head == local_read(&buf->head))
+ return 0;
+
+ perf_aux_output_end(&bts->handle, local_xchg(&buf->data_size, 0),
+ !!local_xchg(&buf->lost, 0));
+
+ buf = perf_aux_output_begin(&bts->handle, event);
+ if (!buf)
+ return 1;
+
+ err = bts_buffer_reset(buf, &bts->handle);
+ if (err)
+ perf_aux_output_end(&bts->handle, 0, false);
+
+ return 1;
+}
+
+static void bts_event_del(struct perf_event *event, int mode)
+{
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+ struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+ struct bts_buffer *buf = perf_get_aux(&bts->handle);
+
+ bts_event_stop(event, PERF_EF_UPDATE);
+
+ if (buf) {
+ if (buf->snapshot)
+ bts->handle.head =
+ local_xchg(&buf->data_size,
+ buf->nr_pages << PAGE_SHIFT);
+ perf_aux_output_end(&bts->handle, local_xchg(&buf->data_size, 0),
+ !!local_xchg(&buf->lost, 0));
+ }
+
+ cpuc->ds->bts_index = bts->ds_back.bts_buffer_base;
+ cpuc->ds->bts_buffer_base = bts->ds_back.bts_buffer_base;
+ cpuc->ds->bts_absolute_maximum = bts->ds_back.bts_absolute_maximum;
+ cpuc->ds->bts_interrupt_threshold = bts->ds_back.bts_interrupt_threshold;
+}
+
+static int bts_event_add(struct perf_event *event, int mode)
+{
+ struct bts_buffer *buf;
+ struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+ struct hw_perf_event *hwc = &event->hw;
+ int ret = -EBUSY;
+
+ event->hw.state = PERF_HES_STOPPED;
+
+ if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
+ return -EBUSY;
+
+ if (bts->handle.event)
+ return -EBUSY;
+
+ buf = perf_aux_output_begin(&bts->handle, event);
+ if (!buf)
+ return -EINVAL;
+
+ ret = bts_buffer_reset(buf, &bts->handle);
+ if (ret) {
+ perf_aux_output_end(&bts->handle, 0, false);
+ return ret;
+ }
+
+ bts->ds_back.bts_buffer_base = cpuc->ds->bts_buffer_base;
+ bts->ds_back.bts_absolute_maximum = cpuc->ds->bts_absolute_maximum;
+ bts->ds_back.bts_interrupt_threshold = cpuc->ds->bts_interrupt_threshold;
+
+ if (mode & PERF_EF_START) {
+ bts_event_start(event, 0);
+ if (hwc->state & PERF_HES_STOPPED) {
+ bts_event_del(event, 0);
+ return -EBUSY;
+ }
+ }
+
+ return 0;
+}
+
+static void bts_event_destroy(struct perf_event *event)
+{
+ x86_del_exclusive(x86_lbr_exclusive_bts);
+}
+
+static int bts_event_init(struct perf_event *event)
+{
+ if (event->attr.type != bts_pmu.type)
+ return -ENOENT;
+
+ if (x86_add_exclusive(x86_lbr_exclusive_bts))
+ return -EBUSY;
+
+ event->destroy = bts_event_destroy;
+
+ return 0;
+}
+
+static void bts_event_read(struct perf_event *event)
+{
+}
+
+static __init int bts_init(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_DTES64) || !x86_pmu.bts)
+ return -ENODEV;
+
+ bts_pmu.capabilities = PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_ITRACE;
+ bts_pmu.task_ctx_nr = perf_hw_context;
+ bts_pmu.event_init = bts_event_init;
+ bts_pmu.add = bts_event_add;
+ bts_pmu.del = bts_event_del;
+ bts_pmu.start = bts_event_start;
+ bts_pmu.stop = bts_event_stop;
+ bts_pmu.read = bts_event_read;
+ bts_pmu.setup_aux = bts_buffer_setup_aux;
+ bts_pmu.free_aux = bts_buffer_free_aux;
+
+ return perf_pmu_register(&bts_pmu, "intel_bts", -1);
+}
+
+module_init(bts_init);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 46211bcc81..0b5d97e0a2 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -461,7 +461,8 @@ void intel_pmu_enable_bts(u64 config)

debugctlmsr |= DEBUGCTLMSR_TR;
debugctlmsr |= DEBUGCTLMSR_BTS;
- debugctlmsr |= DEBUGCTLMSR_BTINT;
+ if (config & ARCH_PERFMON_EVENTSEL_INT)
+ debugctlmsr |= DEBUGCTLMSR_BTINT;

if (!(config & ARCH_PERFMON_EVENTSEL_OS))
debugctlmsr |= DEBUGCTLMSR_BTS_OFF_OS;
--
2.1.1

2014-11-14 13:46:00

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v8 14/14] perf: add ITRACE_START record to indicate that tracing has started

For counters that generate AUX data that is bound to the context of a
running task, such as instruction tracing, the decoder needs to know
exactly which task is running when the event is first scheduled in,
before the first sched_switch. The decoder's need to know this stems
from the fact that instruction flow trace decoding will almost always
require program's object code in order to reconstruct said flow and
for that we need at least its pid/tid in the perf stream.

To single out such instruction tracing pmus, this patch introduces
ITRACE PMU capability. The reason this is not part of RECORD_AUX
record is that not all pmus capable of generating AUX data need this,
and the opposite is *probably* also true.

While sched_switch covers for most cases, there are two problems with it:
the consumer will need to process events out of order (that is, having
found RECORD_AUX, it will have to skip forward to the nearest sched_switch
to figure out which task it was, then go back to the actual trace to
decode it) and it completely misses the case when the tracing is enabled
and disabled before sched_switch, for example, via PERF_EVENT_IOC_DISABLE.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/linux/perf_event.h | 4 ++++
include/uapi/linux/perf_event.h | 11 +++++++++++
kernel/events/core.c | 41 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 56 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f126eb89e6..282721b2df 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -127,6 +127,9 @@ struct hw_perf_event {
/* for tp_event->class */
struct list_head tp_list;
};
+ struct { /* itrace */
+ int itrace_started;
+ };
#ifdef CONFIG_HAVE_HW_BREAKPOINT
struct { /* breakpoint */
/*
@@ -174,6 +177,7 @@ struct perf_event;
#define PERF_PMU_CAP_AUX_NO_SG 0x02
#define PERF_PMU_CAP_AUX_SW_DOUBLEBUF 0x04
#define PERF_PMU_CAP_EXCLUSIVE 0x08
+#define PERF_PMU_CAP_ITRACE 0x10

/**
* struct pmu - generic performance monitoring unit
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index ffd52c0861..78017276fe 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -750,6 +750,17 @@ enum perf_event_type {
*/
PERF_RECORD_AUX = 11,

+ /*
+ * Indicates that instruction trace has started
+ *
+ * struct {
+ * struct perf_event_header header;
+ * u32 pid;
+ * u32 tid;
+ * };
+ */
+ PERF_RECORD_ITRACE_START = 12,
+
PERF_RECORD_MAX, /* non-ABI */
};

diff --git a/kernel/events/core.c b/kernel/events/core.c
index da2e44aedd..988ef2380a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1733,6 +1733,7 @@ static void perf_set_shadow_time(struct perf_event *event,
#define MAX_INTERRUPTS (~0ULL)

static void perf_log_throttle(struct perf_event *event, int enable);
+static void perf_log_itrace_start(struct perf_event *event);

static int
event_sched_in(struct perf_event *event,
@@ -1767,6 +1768,8 @@ event_sched_in(struct perf_event *event,

perf_pmu_disable(event->pmu);

+ perf_log_itrace_start(event);
+
if (event->pmu->add(event, PERF_EF_START)) {
event->state = PERF_EVENT_STATE_INACTIVE;
event->oncpu = -1;
@@ -5681,6 +5684,44 @@ static void perf_log_throttle(struct perf_event *event, int enable)
perf_output_end(&handle);
}

+static void perf_log_itrace_start(struct perf_event *event)
+{
+ struct perf_output_handle handle;
+ struct perf_sample_data sample;
+ struct perf_aux_event {
+ struct perf_event_header header;
+ u32 pid;
+ u32 tid;
+ } rec;
+ int ret;
+
+ if (event->parent)
+ event = event->parent;
+
+ if (!(event->pmu->capabilities & PERF_PMU_CAP_ITRACE) ||
+ event->hw.itrace_started)
+ return;
+
+ event->hw.itrace_started = 1;
+
+ rec.header.type = PERF_RECORD_ITRACE_START;
+ rec.header.misc = 0;
+ rec.header.size = sizeof(rec);
+ rec.pid = perf_event_pid(event, current);
+ rec.tid = perf_event_tid(event, current);
+
+ perf_event_header__init_id(&rec.header, &sample, event);
+ ret = perf_output_begin(&handle, event, rec.header.size);
+
+ if (ret)
+ return;
+
+ perf_output_put(&handle, rec);
+ perf_event__output_id_sample(event, &handle, &sample);
+
+ perf_output_end(&handle);
+}
+
/*
* Generic event overflow handling, sampling.
*/
--
2.1.1

2014-11-14 13:47:39

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v8 04/14] perf: Add a capability for AUX_NO_SG pmus to do software double buffering

For pmus that don't support scatter-gather for AUX data in hardware, it
might still make sense to implement software double buffering to avoid
losing data while the user is reading data out. For this purpose, add
a pmu capability that guarantees multiple high-order chunks for AUX buffer,
so that the pmu driver can do switchover tricks.

To make use of this feature, add PERF_PMU_CAP_AUX_SW_DOUBLEBUF to your
pmu's capability mask. This will make the ring buffer AUX allocation code
ensure that the biggest high order allocation for the aux buffer pages is
no bigger than half of the total requested buffer size, thus making sure
that the buffer has at least two high order allocations.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/linux/perf_event.h | 1 +
kernel/events/ring_buffer.c | 15 ++++++++++++++-
2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 71df948b4d..e7a15f5c3f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -172,6 +172,7 @@ struct perf_event;
*/
#define PERF_PMU_CAP_NO_INTERRUPT 0x01
#define PERF_PMU_CAP_AUX_NO_SG 0x02
+#define PERF_PMU_CAP_AUX_SW_DOUBLEBUF 0x04

/**
* struct pmu - generic performance monitoring unit
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 6e700e8fd6..608c0d67ad 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -286,13 +286,26 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
if (!has_aux(event))
return -ENOTSUPP;

- if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG)
+ if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG) {
/*
* We need to start with the max_order that fits in nr_pages,
* not the other way around, hence ilog2() and not get_order.
*/
max_order = ilog2(nr_pages);

+ /*
+ * PMU requests more than one contiguous chunks of memory
+ * for SW double buffering
+ */
+ if ((event->pmu->capabilities & PERF_PMU_CAP_AUX_SW_DOUBLEBUF) &&
+ !overwrite) {
+ if (!max_order)
+ return -EINVAL;
+
+ max_order--;
+ }
+ }
+
rb->aux_pages = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
if (!rb->aux_pages)
return -ENOMEM;
--
2.1.1

2014-11-14 13:45:16

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v8 12/14] x86: perf: intel_pt: Intel PT PMU driver

Add support for Intel Processor Trace (PT) to kernel's perf events.
PT is an extension of Intel Architecture that collects information about
software execuction such as control flow, execution modes and timings and
formats it into highly compressed binary packets. Even being compressed,
these packets are generated at hundreds of megabytes per second per core,
which makes it impractical to decode them on the fly in the kernel.

This driver exports trace data by through AUX space in the perf ring
buffer, which is zero-copy mapped into userspace for faster data retrieval.

Signed-off-by: Alexander Shishkin <[email protected]>
---
arch/x86/include/uapi/asm/msr-index.h | 18 +
arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/kernel/cpu/intel_pt.h | 131 ++++
arch/x86/kernel/cpu/perf_event.h | 2 +
arch/x86/kernel/cpu/perf_event_intel.c | 8 +
arch/x86/kernel/cpu/perf_event_intel_pt.c | 1082 +++++++++++++++++++++++++++++
6 files changed, 1242 insertions(+)
create mode 100644 arch/x86/kernel/cpu/intel_pt.h
create mode 100644 arch/x86/kernel/cpu/perf_event_intel_pt.c

diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h
index e21331ce36..caae296ab7 100644
--- a/arch/x86/include/uapi/asm/msr-index.h
+++ b/arch/x86/include/uapi/asm/msr-index.h
@@ -74,6 +74,24 @@
#define MSR_IA32_PERF_CAPABILITIES 0x00000345
#define MSR_PEBS_LD_LAT_THRESHOLD 0x000003f6

+#define MSR_IA32_RTIT_CTL 0x00000570
+#define RTIT_CTL_TRACEEN BIT(0)
+#define RTIT_CTL_OS BIT(2)
+#define RTIT_CTL_USR BIT(3)
+#define RTIT_CTL_CR3EN BIT(7)
+#define RTIT_CTL_TOPA BIT(8)
+#define RTIT_CTL_TSC_EN BIT(10)
+#define RTIT_CTL_DISRETC BIT(11)
+#define RTIT_CTL_BRANCH_EN BIT(13)
+#define MSR_IA32_RTIT_STATUS 0x00000571
+#define RTIT_STATUS_CONTEXTEN BIT(1)
+#define RTIT_STATUS_TRIGGEREN BIT(2)
+#define RTIT_STATUS_ERROR BIT(4)
+#define RTIT_STATUS_STOPPED BIT(5)
+#define MSR_IA32_RTIT_CR3_MATCH 0x00000572
+#define MSR_IA32_RTIT_OUTPUT_BASE 0x00000560
+#define MSR_IA32_RTIT_OUTPUT_MASK 0x00000561
+
#define MSR_MTRRfix64K_00000 0x00000250
#define MSR_MTRRfix16K_80000 0x00000258
#define MSR_MTRRfix16K_A0000 0x00000259
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index e27b49d7c9..4bc29fa23e 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -40,6 +40,7 @@ endif
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_p6.o perf_event_knc.o perf_event_p4.o
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_rapl.o
+obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_pt.o

obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \
perf_event_intel_uncore_snb.o \
diff --git a/arch/x86/kernel/cpu/intel_pt.h b/arch/x86/kernel/cpu/intel_pt.h
new file mode 100644
index 0000000000..1c338b0eba
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_pt.h
@@ -0,0 +1,131 @@
+/*
+ * Intel(R) Processor Trace PMU driver for perf
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * Intel PT is specified in the Intel Architecture Instruction Set Extensions
+ * Programming Reference:
+ * http://software.intel.com/en-us/intel-isa-extensions
+ */
+
+#ifndef __INTEL_PT_H__
+#define __INTEL_PT_H__
+
+/*
+ * Single-entry ToPA: when this close to region boundary, switch
+ * buffers to avoid losing data.
+ */
+#define TOPA_PMI_MARGIN 512
+
+/*
+ * Table of Physical Addresses bits
+ */
+enum topa_sz {
+ TOPA_4K = 0,
+ TOPA_8K,
+ TOPA_16K,
+ TOPA_32K,
+ TOPA_64K,
+ TOPA_128K,
+ TOPA_256K,
+ TOPA_512K,
+ TOPA_1MB,
+ TOPA_2MB,
+ TOPA_4MB,
+ TOPA_8MB,
+ TOPA_16MB,
+ TOPA_32MB,
+ TOPA_64MB,
+ TOPA_128MB,
+ TOPA_SZ_END,
+};
+
+static inline unsigned int sizes(enum topa_sz tsz)
+{
+ return 1 << (tsz + 12);
+};
+
+struct topa_entry {
+ u64 end : 1;
+ u64 rsvd0 : 1;
+ u64 intr : 1;
+ u64 rsvd1 : 1;
+ u64 stop : 1;
+ u64 rsvd2 : 1;
+ u64 size : 4;
+ u64 rsvd3 : 2;
+ u64 base : 36;
+ u64 rsvd4 : 16;
+};
+
+#define TOPA_SHIFT 12
+#define PT_CPUID_LEAVES 2
+
+enum pt_capabilities {
+ PT_CAP_max_subleaf = 0,
+ PT_CAP_cr3_filtering,
+ PT_CAP_topa_output,
+ PT_CAP_topa_multiple_entries,
+ PT_CAP_payloads_lip,
+};
+
+struct pt_pmu {
+ struct pmu pmu;
+ u32 caps[4 * PT_CPUID_LEAVES];
+};
+
+/**
+ * struct pt_buffer - buffer configuration; one buffer per task_struct or
+ * cpu, depending on perf event configuration
+ * @cpu: cpu for per-cpu allocation
+ * @tables: list of ToPA tables in this buffer
+ * @first: shorthand for first topa table
+ * @last: shorthand for last topa table
+ * @cur: current topa table
+ * @nr_pages: buffer size in pages
+ * @cur_idx: current output region's index within @cur table
+ * @output_off: offset within the current output region
+ * @data_size: running total of the amount of data in this buffer
+ * @lost: if data was lost/truncated
+ * @head: logical write offset inside the buffer
+ * @snapshot: if this is for a snapshot/overwrite counter
+ * @stop_pos: STOP topa entry in the buffer
+ * @intr_pos: INT topa entry in the buffer
+ * @data_pages: array of pages from perf
+ * @topa_index: table of topa entries indexed by page offset
+ */
+struct pt_buffer {
+ int cpu;
+ struct list_head tables;
+ struct topa *first, *last, *cur;
+ unsigned int cur_idx;
+ size_t output_off;
+ unsigned long nr_pages;
+ local_t data_size;
+ local_t lost;
+ local64_t head;
+ bool snapshot;
+ unsigned long stop_pos, intr_pos;
+ void **data_pages;
+ struct topa_entry *topa_index[0];
+};
+
+/**
+ * struct pt - per-cpu pt context
+ * @handle: perf output handle
+ * @handle_nmi: do handle PT PMI on this cpu, there's an active event
+ */
+struct pt {
+ struct perf_output_handle handle;
+ int handle_nmi;
+};
+
+#endif /* __INTEL_PT_H__ */
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 620612d311..9ab6c59851 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -783,6 +783,8 @@ void intel_pmu_lbr_init_snb(void);

int intel_pmu_setup_lbr_filter(struct perf_event *event);

+void intel_pt_interrupt(void);
+
int p4_pmu_init(void);

int p6_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index c64e97a1cb..5a29ea59d6 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1385,6 +1385,14 @@ again:
}

/*
+ * Intel PT
+ */
+ if (__test_and_clear_bit(55, (unsigned long *)&status)) {
+ handled++;
+ intel_pt_interrupt();
+ }
+
+ /*
* Checkpointed counters can lead to 'spurious' PMIs because the
* rollback caused by the PMI will have cleared the overflow status
* bit. Therefore always force probe these counters.
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c b/arch/x86/kernel/cpu/perf_event_intel_pt.c
new file mode 100644
index 0000000000..f2fda8f038
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -0,0 +1,1082 @@
+/*
+ * Intel(R) Processor Trace PMU driver for perf
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * Intel PT is specified in the Intel Architecture Instruction Set Extensions
+ * Programming Reference:
+ * http://software.intel.com/en-us/intel-isa-extensions
+ */
+
+#undef DEBUG
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/device.h>
+
+#include <asm/perf_event.h>
+#include <asm/insn.h>
+
+#include "perf_event.h"
+#include "intel_pt.h"
+
+static DEFINE_PER_CPU(struct pt, pt_ctx);
+
+static struct pt_pmu pt_pmu;
+
+enum cpuid_regs {
+ CR_EAX = 0,
+ CR_ECX,
+ CR_EDX,
+ CR_EBX
+};
+
+/*
+ * Capabilities of Intel PT hardware, such as number of address bits or
+ * supported output schemes, are cached and exported to userspace as "caps"
+ * attribute group of pt pmu device
+ * (/sys/bus/event_source/devices/intel_pt/caps/) so that userspace can store
+ * relevant bits together with intel_pt traces.
+ *
+ * These are necessary for both trace decoding (payloads_lip, contains address
+ * width encoded in IP-related packets), and event configuration (bitmasks with
+ * permitted values for certain bit fields).
+ */
+#define PT_CAP(_n, _l, _r, _m) \
+ [PT_CAP_ ## _n] = { .name = __stringify(_n), .leaf = _l, \
+ .reg = _r, .mask = _m }
+
+static struct pt_cap_desc {
+ const char *name;
+ u32 leaf;
+ u8 reg;
+ u32 mask;
+} pt_caps[] = {
+ PT_CAP(max_subleaf, 0, CR_EAX, 0xffffffff),
+ PT_CAP(cr3_filtering, 0, CR_EBX, BIT(0)),
+ PT_CAP(topa_output, 0, CR_ECX, BIT(0)),
+ PT_CAP(topa_multiple_entries, 0, CR_ECX, BIT(1)),
+ PT_CAP(payloads_lip, 0, CR_ECX, BIT(31)),
+};
+
+static u32 pt_cap_get(enum pt_capabilities cap)
+{
+ struct pt_cap_desc *cd = &pt_caps[cap];
+ u32 c = pt_pmu.caps[cd->leaf * 4 + cd->reg];
+ unsigned int shift = __ffs(cd->mask);
+
+ return (c & cd->mask) >> shift;
+}
+
+static ssize_t pt_cap_show(struct device *cdev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct dev_ext_attribute *ea =
+ container_of(attr, struct dev_ext_attribute, attr);
+ enum pt_capabilities cap = (long)ea->var;
+
+ return snprintf(buf, PAGE_SIZE, "%x\n", pt_cap_get(cap));
+}
+
+static struct attribute_group pt_cap_group = {
+ .name = "caps",
+};
+
+PMU_FORMAT_ATTR(tsc, "config:10" );
+PMU_FORMAT_ATTR(noretcomp, "config:11" );
+
+static struct attribute *pt_formats_attr[] = {
+ &format_attr_tsc.attr,
+ &format_attr_noretcomp.attr,
+ NULL,
+};
+
+static struct attribute_group pt_format_group = {
+ .name = "format",
+ .attrs = pt_formats_attr,
+};
+
+static const struct attribute_group *pt_attr_groups[] = {
+ &pt_cap_group,
+ &pt_format_group,
+ NULL,
+};
+
+static int __init pt_pmu_hw_init(void)
+{
+ struct dev_ext_attribute *de_attrs;
+ struct attribute **attrs;
+ size_t size;
+ long i;
+
+ if (test_cpu_cap(&boot_cpu_data, X86_FEATURE_INTEL_PT)) {
+ for (i = 0; i < PT_CPUID_LEAVES; i++)
+ cpuid_count(20, i,
+ &pt_pmu.caps[CR_EAX + i * 4],
+ &pt_pmu.caps[CR_EBX + i * 4],
+ &pt_pmu.caps[CR_ECX + i * 4],
+ &pt_pmu.caps[CR_EDX + i * 4]);
+ } else {
+ return -ENODEV;
+ }
+
+ size = sizeof(struct attribute *) * (ARRAY_SIZE(pt_caps) + 1);
+ attrs = kzalloc(size, GFP_KERNEL);
+ if (!attrs)
+ goto err_attrs;
+
+ size = sizeof(struct dev_ext_attribute) * (ARRAY_SIZE(pt_caps) + 1);
+ de_attrs = kzalloc(size, GFP_KERNEL);
+ if (!de_attrs)
+ goto err_de_attrs;
+
+ for (i = 0; i < ARRAY_SIZE(pt_caps); i++) {
+ de_attrs[i].attr.attr.name = pt_caps[i].name;
+
+ sysfs_attr_init(&de_attrs[i].attr.attr);
+ de_attrs[i].attr.attr.mode = S_IRUGO;
+ de_attrs[i].attr.show = pt_cap_show;
+ de_attrs[i].var = (void *)i;
+ attrs[i] = &de_attrs[i].attr.attr;
+ }
+
+ pt_cap_group.attrs = attrs;
+ return 0;
+
+err_de_attrs:
+ kfree(de_attrs);
+err_attrs:
+ kfree(attrs);
+
+ return -ENOMEM;
+}
+
+#define PT_CONFIG_MASK (RTIT_CTL_TSC_EN | RTIT_CTL_DISRETC)
+
+static bool pt_event_valid(struct perf_event *event)
+{
+ u64 config = event->attr.config;
+
+ if ((config & PT_CONFIG_MASK) != config)
+ return false;
+
+ return true;
+}
+
+/*
+ * PT configuration helpers
+ * These all are cpu affine and operate on a local PT
+ */
+
+static bool pt_is_running(void)
+{
+ u64 ctl;
+
+ rdmsrl(MSR_IA32_RTIT_CTL, ctl);
+
+ return !!(ctl & RTIT_CTL_TRACEEN);
+}
+
+static void pt_config(struct perf_event *event)
+{
+ u64 reg;
+
+ reg = RTIT_CTL_TOPA | RTIT_CTL_BRANCH_EN | RTIT_CTL_TRACEEN;
+
+ if (!event->attr.exclude_kernel)
+ reg |= RTIT_CTL_OS;
+ if (!event->attr.exclude_user)
+ reg |= RTIT_CTL_USR;
+
+ reg |= (event->attr.config & PT_CONFIG_MASK);
+
+ wrmsrl(MSR_IA32_RTIT_CTL, reg);
+}
+
+static void pt_config_start(bool start)
+{
+ u64 ctl;
+
+ rdmsrl(MSR_IA32_RTIT_CTL, ctl);
+ if (start)
+ ctl |= RTIT_CTL_TRACEEN;
+ else
+ ctl &= ~RTIT_CTL_TRACEEN;
+ wrmsrl(MSR_IA32_RTIT_CTL, ctl);
+
+ /*
+ * A wrmsr that disables trace generation serializes other PT
+ * registers and causes all data packets to be written to memory,
+ * but a fence is required for the data to become globally visible.
+ *
+ * The below WMB, separating data store and aux_head store matches
+ * the consumer's RMB that separates aux_head load and data load.
+ */
+ if (!start)
+ wmb();
+}
+
+static void pt_config_buffer(void *buf, unsigned int topa_idx,
+ unsigned int output_off)
+{
+ u64 reg;
+
+ wrmsrl(MSR_IA32_RTIT_OUTPUT_BASE, virt_to_phys(buf));
+
+ reg = 0x7f | ((u64)topa_idx << 7) | ((u64)output_off << 32);
+
+ wrmsrl(MSR_IA32_RTIT_OUTPUT_MASK, reg);
+}
+
+/*
+ * Keep ToPA table-related metadata on the same page as the actual table,
+ * taking up a few words from the top
+ */
+
+#define TENTS_PER_PAGE (((PAGE_SIZE - 40) / sizeof(struct topa_entry)) - 1)
+
+/**
+ * struct topa - page-sized ToPA table with metadata at the top
+ * @table: actual ToPA table entries, as understood by PT hardware
+ * @list: linkage to struct pt_buffer's list of tables
+ * @phys: physical address of this page
+ * @offset: offset of the first entry in this table in the buffer
+ * @size: total size of all entries in this table
+ * @last: index of the last initialized entry in this table
+ */
+struct topa {
+ struct topa_entry table[TENTS_PER_PAGE];
+ struct list_head list;
+ u64 phys;
+ u64 offset;
+ size_t size;
+ int last;
+};
+
+/* make -1 stand for the last table entry */
+#define TOPA_ENTRY(t, i) ((i) == -1 ? &(t)->table[(t)->last] : &(t)->table[(i)])
+
+/**
+ * topa_alloc() - allocate page-sized ToPA table
+ * @cpu: CPU on which to allocate.
+ * @gfp: Allocation flags.
+ *
+ * Return: On success, return the pointer to ToPA table page.
+ */
+static struct topa *topa_alloc(int cpu, gfp_t gfp)
+{
+ int node = cpu_to_node(cpu);
+ struct topa *topa;
+ struct page *p;
+
+ p = alloc_pages_node(node, gfp | __GFP_ZERO, 0);
+ if (!p)
+ return NULL;
+
+ topa = page_address(p);
+ topa->last = 0;
+ topa->phys = page_to_phys(p);
+
+ /*
+ * In case of singe-entry ToPA, always put the self-referencing END
+ * link as the 2nd entry in the table
+ */
+ if (!pt_cap_get(PT_CAP_topa_multiple_entries)) {
+ TOPA_ENTRY(topa, 1)->base = topa->phys >> TOPA_SHIFT;
+ TOPA_ENTRY(topa, 1)->end = 1;
+ }
+
+ return topa;
+}
+
+/**
+ * topa_free() - free a page-sized ToPA table
+ * @topa: Table to deallocate.
+ */
+static void topa_free(struct topa *topa)
+{
+ free_page((unsigned long)topa);
+}
+
+/**
+ * topa_insert_table() - insert a ToPA table into a buffer
+ * @buf: PT buffer that's being extended.
+ * @topa: New topa table to be inserted.
+ *
+ * If it's the first table in this buffer, set up buffer's pointers
+ * accordingly; otherwise, add a END=1 link entry to @topa to the current
+ * "last" table and adjust the last table pointer to @topa.
+ */
+static void topa_insert_table(struct pt_buffer *buf, struct topa *topa)
+{
+ struct topa *last = buf->last;
+
+ list_add_tail(&topa->list, &buf->tables);
+
+ if (!buf->first) {
+ buf->first = buf->last = buf->cur = topa;
+ return;
+ }
+
+ topa->offset = last->offset + last->size;
+ buf->last = topa;
+
+ if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+ return;
+
+ BUG_ON(last->last != TENTS_PER_PAGE - 1);
+
+ TOPA_ENTRY(last, -1)->base = topa->phys >> TOPA_SHIFT;
+ TOPA_ENTRY(last, -1)->end = 1;
+}
+
+/**
+ * topa_table_full() - check if a ToPA table is filled up
+ * @topa: ToPA table.
+ */
+static bool topa_table_full(struct topa *topa)
+{
+ /* single-entry ToPA is a special case */
+ if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+ return !!topa->last;
+
+ return topa->last == TENTS_PER_PAGE - 1;
+}
+
+/**
+ * topa_insert_pages() - create a list of ToPA tables
+ * @buf: PT buffer being initialized.
+ * @gfp: Allocation flags.
+ *
+ * This initializes a list of ToPA tables with entries from
+ * the data_pages provided by rb_alloc_aux().
+ *
+ * Return: 0 on success or error code.
+ */
+static int topa_insert_pages(struct pt_buffer *buf, gfp_t gfp)
+{
+ struct topa *topa = buf->last;
+ int order = 0;
+ struct page *p;
+
+ p = virt_to_page(buf->data_pages[buf->nr_pages]);
+ if (PagePrivate(p))
+ order = page_private(p);
+
+ if (topa_table_full(topa)) {
+ topa = topa_alloc(buf->cpu, gfp);
+ if (!topa)
+ return -ENOMEM;
+
+ topa_insert_table(buf, topa);
+ }
+
+ TOPA_ENTRY(topa, -1)->base = page_to_phys(p) >> TOPA_SHIFT;
+ TOPA_ENTRY(topa, -1)->size = order;
+ if (!buf->snapshot && !pt_cap_get(PT_CAP_topa_multiple_entries)) {
+ TOPA_ENTRY(topa, -1)->intr = 1;
+ TOPA_ENTRY(topa, -1)->stop = 1;
+ }
+
+ topa->last++;
+ topa->size += sizes(order);
+
+ buf->nr_pages += 1ul << order;
+
+ return 0;
+}
+
+/**
+ * pt_topa_dump() - print ToPA tables and their entries
+ * @buf: PT buffer.
+ */
+static void pt_topa_dump(struct pt_buffer *buf)
+{
+ struct topa *topa;
+
+ list_for_each_entry(topa, &buf->tables, list) {
+ int i;
+
+ pr_debug("# table @%p (%p), off %llx size %zx\n", topa->table,
+ (void *)topa->phys, topa->offset, topa->size);
+ for (i = 0; i < TENTS_PER_PAGE; i++) {
+ pr_debug("# entry @%p (%lx sz %u %c%c%c) raw=%16llx\n",
+ &topa->table[i],
+ (unsigned long)topa->table[i].base << TOPA_SHIFT,
+ sizes(topa->table[i].size),
+ topa->table[i].end ? 'E' : ' ',
+ topa->table[i].intr ? 'I' : ' ',
+ topa->table[i].stop ? 'S' : ' ',
+ *(u64 *)&topa->table[i]);
+ if ((pt_cap_get(PT_CAP_topa_multiple_entries) &&
+ topa->table[i].stop) ||
+ topa->table[i].end)
+ break;
+ }
+ }
+}
+
+/**
+ * pt_buffer_advance() - advance to the next output region
+ * @buf: PT buffer.
+ *
+ * Advance the current pointers in the buffer to the next ToPA entry.
+ */
+static void pt_buffer_advance(struct pt_buffer *buf)
+{
+ buf->output_off = 0;
+ buf->cur_idx++;
+
+ if (buf->cur_idx == buf->cur->last) {
+ if (buf->cur == buf->last)
+ buf->cur = buf->first;
+ else
+ buf->cur = list_entry(buf->cur->list.next, struct topa,
+ list);
+ buf->cur_idx = 0;
+ }
+}
+
+/**
+ * pt_update_head() - calculate current offsets and sizes
+ * @pt: Per-cpu pt context.
+ *
+ * Update buffer's current write pointer position and data size.
+ */
+static void pt_update_head(struct pt *pt)
+{
+ struct pt_buffer *buf = perf_get_aux(&pt->handle);
+ u64 topa_idx, base, old;
+
+ /* offset of the first region in this table from the beginning of buf */
+ base = buf->cur->offset + buf->output_off;
+
+ /* offset of the current output region within this table */
+ for (topa_idx = 0; topa_idx < buf->cur_idx; topa_idx++)
+ base += sizes(buf->cur->table[topa_idx].size);
+
+ if (buf->snapshot) {
+ local_set(&buf->data_size, base);
+ } else {
+ old = (local64_xchg(&buf->head, base) &
+ ((buf->nr_pages << PAGE_SHIFT) - 1));
+ if (base < old)
+ base += buf->nr_pages << PAGE_SHIFT;
+
+ local_add(base - old, &buf->data_size);
+ }
+}
+
+/**
+ * pt_buffer_region() - obtain current output region's address
+ * @buf: PT buffer.
+ */
+static void *pt_buffer_region(struct pt_buffer *buf)
+{
+ return phys_to_virt(buf->cur->table[buf->cur_idx].base << TOPA_SHIFT);
+}
+
+/**
+ * pt_buffer_region_size() - obtain current output region's size
+ * @buf: PT buffer.
+ */
+static size_t pt_buffer_region_size(struct pt_buffer *buf)
+{
+ return sizes(buf->cur->table[buf->cur_idx].size);
+}
+
+/**
+ * pt_handle_status() - take care of possible status conditions
+ * @pt: Per-cpu pt context.
+ */
+static void pt_handle_status(struct pt *pt)
+{
+ struct pt_buffer *buf = perf_get_aux(&pt->handle);
+ int advance = 0;
+ u64 status;
+
+ rdmsrl(MSR_IA32_RTIT_STATUS, status);
+
+ if (status & RTIT_STATUS_ERROR) {
+ pr_err_ratelimited("ToPA ERROR encountered, trying to recover\n");
+ pt_topa_dump(buf);
+ status &= ~RTIT_STATUS_ERROR;
+ }
+
+ if (status & RTIT_STATUS_STOPPED) {
+ status &= ~RTIT_STATUS_STOPPED;
+
+ /*
+ * On systems that only do single-entry ToPA, hitting STOP
+ * means we are already losing data; need to let the decoder
+ * know.
+ */
+ if (!pt_cap_get(PT_CAP_topa_multiple_entries) ||
+ buf->output_off == sizes(TOPA_ENTRY(buf->cur, buf->cur_idx)->size)) {
+ local_inc(&buf->lost);
+ advance++;
+ }
+ }
+
+ /*
+ * Also on single-entry ToPA implementations, interrupt will come
+ * before the output reaches its output region's boundary.
+ */
+ if (!pt_cap_get(PT_CAP_topa_multiple_entries) && !buf->snapshot &&
+ pt_buffer_region_size(buf) - buf->output_off <= TOPA_PMI_MARGIN) {
+ void *head = pt_buffer_region(buf);
+
+ /* everything within this margin needs to be zeroed out */
+ memset(head + buf->output_off, 0,
+ pt_buffer_region_size(buf) -
+ buf->output_off);
+ advance++;
+ }
+
+ if (advance)
+ pt_buffer_advance(buf);
+
+ wrmsrl(MSR_IA32_RTIT_STATUS, status);
+}
+
+/**
+ * pt_read_offset() - translate registers into buffer pointers
+ * @buf: PT buffer.
+ *
+ * Set buffer's output pointers from MSR values.
+ */
+static void pt_read_offset(struct pt_buffer *buf)
+{
+ u64 offset, base_topa;
+
+ rdmsrl(MSR_IA32_RTIT_OUTPUT_BASE, base_topa);
+ buf->cur = phys_to_virt(base_topa);
+
+ rdmsrl(MSR_IA32_RTIT_OUTPUT_MASK, offset);
+ /* offset within current output region */
+ buf->output_off = offset >> 32;
+ /* index of current output region within this table */
+ buf->cur_idx = (offset & 0xffffff80) >> 7;
+}
+
+/**
+ * pt_topa_next_entry() - obtain index of the first page in the next ToPA entry
+ * @buf: PT buffer.
+ * @pg: Page offset in the buffer.
+ *
+ * When advancing to the next output region (ToPA entry), given a page offset
+ * into the buffer, we need to find the offset of the first page in the next
+ * region.
+ */
+static unsigned int pt_topa_next_entry(struct pt_buffer *buf, unsigned int pg)
+{
+ struct topa_entry *te = buf->topa_index[pg];
+
+ /* one region */
+ if (buf->first == buf->last && buf->first->last == 1)
+ return pg;
+
+ do {
+ pg++;
+ pg &= buf->nr_pages - 1;
+ } while (buf->topa_index[pg] == te);
+
+ return pg;
+}
+
+/**
+ * pt_buffer_reset_markers() - place interrupt and stop bits in the buffer
+ * @buf: PT buffer.
+ * @handle: Current output handle.
+ *
+ * Place INT and STOP marks to prevent overwriting old data that the consumer
+ * hasn't yet collected.
+ */
+static int pt_buffer_reset_markers(struct pt_buffer *buf,
+ struct perf_output_handle *handle)
+
+{
+ unsigned long idx, npages, end;
+
+ if (buf->snapshot)
+ return 0;
+
+ /* can't stop in the middle of an output region */
+ if (buf->output_off + handle->size + 1 <
+ sizes(TOPA_ENTRY(buf->cur, buf->cur_idx)->size))
+ return -EINVAL;
+
+
+ /* single entry ToPA is handled by marking all regions STOP=1 INT=1 */
+ if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+ return 0;
+
+ /* clear STOP and INT from current entry */
+ buf->topa_index[buf->stop_pos]->stop = 0;
+ buf->topa_index[buf->intr_pos]->intr = 0;
+
+ if (pt_cap_get(PT_CAP_topa_multiple_entries)) {
+ npages = (handle->size + 1) >> PAGE_SHIFT;
+ end = (local64_read(&buf->head) >> PAGE_SHIFT) + npages;
+ /*if (end > handle->wakeup >> PAGE_SHIFT)
+ end = handle->wakeup >> PAGE_SHIFT;*/
+ idx = end & (buf->nr_pages - 1);
+ buf->stop_pos = idx;
+ idx = (local64_read(&buf->head) >> PAGE_SHIFT) + npages - 1;
+ idx &= buf->nr_pages - 1;
+ buf->intr_pos = idx;
+ }
+
+ buf->topa_index[buf->stop_pos]->stop = 1;
+ buf->topa_index[buf->intr_pos]->intr = 1;
+
+ return 0;
+}
+
+/**
+ * pt_buffer_setup_topa_index() - build topa_index[] table of regions
+ * @buf: PT buffer.
+ *
+ * topa_index[] references output regions indexed by offset into the
+ * buffer for purposes of quick reverse lookup.
+ */
+static void pt_buffer_setup_topa_index(struct pt_buffer *buf)
+{
+ struct topa *cur = buf->first, *prev = buf->last;
+ struct topa_entry *te_cur = TOPA_ENTRY(cur, 0),
+ *te_prev = TOPA_ENTRY(prev, prev->last - 1);
+ int pg = 0, idx = 0, ntopa = 0;
+
+ while (pg < buf->nr_pages) {
+ int tidx;
+
+ /* pages within one topa entry */
+ for (tidx = 0; tidx < 1 << te_cur->size; tidx++, pg++)
+ buf->topa_index[pg] = te_prev;
+
+ te_prev = te_cur;
+
+ if (idx == cur->last - 1) {
+ /* advance to next topa table */
+ idx = 0;
+ cur = list_entry(cur->list.next, struct topa, list);
+ ntopa++;
+ } else
+ idx++;
+ te_cur = TOPA_ENTRY(cur, idx);
+ }
+
+}
+
+/**
+ * pt_buffer_reset_offsets() - adjust buffer's write pointers from aux_head
+ * @buf: PT buffer.
+ * @head: Write pointer (aux_head) from AUX buffer.
+ *
+ * Find the ToPA table and entry corresponding to given @head and set buffer's
+ * "current" pointers accordingly.
+ */
+static void pt_buffer_reset_offsets(struct pt_buffer *buf, unsigned long head)
+{
+ int pg;
+
+ if (buf->snapshot)
+ head &= (buf->nr_pages << PAGE_SHIFT) - 1;
+
+ pg = (head >> PAGE_SHIFT) & (buf->nr_pages - 1);
+ pg = pt_topa_next_entry(buf, pg);
+
+ buf->cur = (struct topa *)((unsigned long)buf->topa_index[pg] & PAGE_MASK);
+ buf->cur_idx = ((unsigned long)buf->topa_index[pg] -
+ (unsigned long)buf->cur) / sizeof(struct topa_entry);
+ buf->output_off = head & (sizes(buf->cur->table[buf->cur_idx].size) - 1);
+
+ local64_set(&buf->head, head);
+ local_set(&buf->data_size, 0);
+}
+
+/**
+ * pt_buffer_fini_topa() - deallocate ToPA structure of a buffer
+ * @buf: PT buffer.
+ */
+static void pt_buffer_fini_topa(struct pt_buffer *buf)
+{
+ struct topa *topa, *iter;
+
+ list_for_each_entry_safe(topa, iter, &buf->tables, list) {
+ /*
+ * right now, this is in free_aux() path only, so
+ * no need to unlink this table from the list
+ */
+ topa_free(topa);
+ }
+}
+
+/**
+ * pt_buffer_init_topa() - initialize ToPA table for pt buffer
+ * @buf: PT buffer.
+ * @size: Total size of all regions within this ToPA.
+ * @gfp: Allocation flags.
+ */
+static int pt_buffer_init_topa(struct pt_buffer *buf, unsigned long nr_pages,
+ gfp_t gfp)
+{
+ struct topa *topa;
+ int err;
+
+ topa = topa_alloc(buf->cpu, gfp);
+ if (!topa)
+ return -ENOMEM;
+
+ topa_insert_table(buf, topa);
+
+ while (buf->nr_pages < nr_pages) {
+ err = topa_insert_pages(buf, gfp);
+ if (err) {
+ pt_buffer_fini_topa(buf);
+ return -ENOMEM;
+ }
+ }
+
+ pt_buffer_setup_topa_index(buf);
+
+ /* link last table to the first one, unless we're double buffering */
+ if (pt_cap_get(PT_CAP_topa_multiple_entries)) {
+ TOPA_ENTRY(buf->last, -1)->base = buf->first->phys >> TOPA_SHIFT;
+ TOPA_ENTRY(buf->last, -1)->end = 1;
+ }
+
+ pt_topa_dump(buf);
+ return 0;
+}
+
+/**
+ * pt_buffer_setup_aux() - set up topa tables for a PT buffer
+ * @cpu: Cpu on which to allocate, -1 means current.
+ * @pages: Array of pointers to buffer pages passed from perf core.
+ * @nr_pages: Number of pages in the buffer.
+ * @snapshot: If this is a snapshot/overwrite counter.
+ *
+ * This is a pmu::setup_aux callback that sets up ToPA tables and all the
+ * bookkeeping for an AUX buffer.
+ *
+ * Return: Our private PT buffer structure.
+ */
+static void *
+pt_buffer_setup_aux(int cpu, void **pages, int nr_pages, bool snapshot)
+{
+ struct pt_buffer *buf;
+ int node, ret;
+
+ if (!nr_pages)
+ return NULL;
+
+ if (cpu == -1)
+ cpu = raw_smp_processor_id();
+ node = cpu_to_node(cpu);
+
+ buf = kzalloc_node(offsetof(struct pt_buffer, topa_index[nr_pages]),
+ GFP_KERNEL, node);
+ if (!buf)
+ return NULL;
+
+ buf->cpu = cpu;
+ buf->snapshot = snapshot;
+ buf->data_pages = pages;
+
+ INIT_LIST_HEAD(&buf->tables);
+
+ ret = pt_buffer_init_topa(buf, nr_pages, GFP_KERNEL);
+ if (ret) {
+ kfree(buf);
+ return NULL;
+ }
+
+ return buf;
+}
+
+/**
+ * pt_buffer_free_aux() - perf AUX deallocation path callback
+ * @data: PT buffer.
+ */
+static void pt_buffer_free_aux(void *data)
+{
+ struct pt_buffer *buf = data;
+
+ pt_buffer_fini_topa(buf);
+ kfree(buf);
+}
+
+/**
+ * pt_buffer_is_full() - check if the buffer is full
+ * @buf: PT buffer.
+ * @pt: Per-cpu pt handle.
+ *
+ * If the user hasn't read data from the output region that aux_head
+ * points to, the buffer is considered full: the user needs to read at
+ * least this region and update aux_tail to point past it.
+ */
+static bool pt_buffer_is_full(struct pt_buffer *buf, struct pt *pt)
+{
+ if (buf->snapshot)
+ return false;
+
+ if (local_read(&buf->data_size) >= pt->handle.size)
+ return true;
+
+ return false;
+}
+
+/**
+ * intel_pt_interrupt() - PT PMI handler
+ */
+void intel_pt_interrupt(void)
+{
+ struct pt *pt = this_cpu_ptr(&pt_ctx);
+ struct pt_buffer *buf;
+ struct perf_event *event = pt->handle.event;
+
+ if (!ACCESS_ONCE(pt->handle_nmi))
+ return;
+
+ pt_config_start(false);
+
+ if (!event)
+ return;
+
+ buf = perf_get_aux(&pt->handle);
+ if (!buf)
+ return;
+
+ pt_read_offset(buf);
+
+ pt_handle_status(pt);
+
+ pt_update_head(pt);
+
+ perf_aux_output_end(&pt->handle, local_xchg(&buf->data_size, 0),
+ local_xchg(&buf->lost, 0));
+
+ if (!event->hw.state) {
+ int ret;
+
+ buf = perf_aux_output_begin(&pt->handle, event);
+ if (!buf) {
+ event->hw.state = PERF_HES_STOPPED;
+ return;
+ }
+
+ pt_buffer_reset_offsets(buf, pt->handle.head);
+ ret = pt_buffer_reset_markers(buf, &pt->handle);
+ if (ret) {
+ perf_aux_output_end(&pt->handle, 0, true);
+ return;
+ }
+
+ pt_config_buffer(buf->cur->table, buf->cur_idx,
+ buf->output_off);
+ wrmsrl(MSR_IA32_RTIT_STATUS, 0);
+ pt_config(event);
+ }
+}
+
+/*
+ * PMU callbacks
+ */
+
+static void pt_event_start(struct perf_event *event, int mode)
+{
+ struct pt *pt = this_cpu_ptr(&pt_ctx);
+ struct pt_buffer *buf = perf_get_aux(&pt->handle);
+
+ if (pt_is_running() || !buf || pt_buffer_is_full(buf, pt)) {
+ event->hw.state = PERF_HES_STOPPED;
+ return;
+ }
+
+ ACCESS_ONCE(pt->handle_nmi) = 1;
+ event->hw.state = 0;
+
+ pt_config_buffer(buf->cur->table, buf->cur_idx,
+ buf->output_off);
+ wrmsrl(MSR_IA32_RTIT_STATUS, 0);
+ pt_config(event);
+}
+
+static void pt_event_stop(struct perf_event *event, int mode)
+{
+ struct pt *pt = this_cpu_ptr(&pt_ctx);
+
+ ACCESS_ONCE(pt->handle_nmi) = 0;
+ pt_config_start(false);
+
+ if (event->hw.state == PERF_HES_STOPPED)
+ return;
+
+ event->hw.state = PERF_HES_STOPPED;
+
+ if (mode & PERF_EF_UPDATE) {
+ struct pt *pt = this_cpu_ptr(&pt_ctx);
+ struct pt_buffer *buf = perf_get_aux(&pt->handle);
+
+ if (!buf)
+ return;
+
+ if (WARN_ON_ONCE(pt->handle.event != event))
+ return;
+
+ pt_read_offset(buf);
+
+ pt_handle_status(pt);
+
+ pt_update_head(pt);
+ }
+}
+
+static void pt_event_del(struct perf_event *event, int mode)
+{
+ struct pt *pt = this_cpu_ptr(&pt_ctx);
+ struct pt_buffer *buf;
+
+ pt_event_stop(event, PERF_EF_UPDATE);
+
+ buf = perf_get_aux(&pt->handle);
+
+ if (buf) {
+ if (buf->snapshot)
+ pt->handle.head =
+ local_xchg(&buf->data_size,
+ buf->nr_pages << PAGE_SHIFT);
+ perf_aux_output_end(&pt->handle, local_xchg(&buf->data_size, 0),
+ local_xchg(&buf->lost, 0));
+ }
+}
+
+static int pt_event_add(struct perf_event *event, int mode)
+{
+ struct pt_buffer *buf;
+ struct pt *pt = this_cpu_ptr(&pt_ctx);
+ struct hw_perf_event *hwc = &event->hw;
+ int ret = -EBUSY;
+
+ if (pt->handle.event)
+ goto out;
+
+ buf = perf_aux_output_begin(&pt->handle, event);
+ if (!buf) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ pt_buffer_reset_offsets(buf, pt->handle.head);
+ if (!buf->snapshot) {
+ ret = pt_buffer_reset_markers(buf, &pt->handle);
+ if (ret) {
+ perf_aux_output_end(&pt->handle, 0, true);
+ goto out;
+ }
+ }
+
+ if (mode & PERF_EF_START) {
+ pt_event_start(event, 0);
+ if (hwc->state == PERF_HES_STOPPED) {
+ pt_event_del(event, 0);
+ ret = -EBUSY;
+ }
+ } else {
+ hwc->state = PERF_HES_STOPPED;
+ }
+
+ ret = 0;
+out:
+
+ if (ret)
+ hwc->state = PERF_HES_STOPPED;
+
+ return ret;
+}
+
+static void pt_event_read(struct perf_event *event)
+{
+}
+
+static void pt_event_destroy(struct perf_event *event)
+{
+ x86_del_exclusive(x86_lbr_exclusive_pt);
+}
+
+static int pt_event_init(struct perf_event *event)
+{
+ if (event->attr.type != pt_pmu.pmu.type)
+ return -ENOENT;
+
+ if (!pt_event_valid(event))
+ return -EINVAL;
+
+ if (x86_add_exclusive(x86_lbr_exclusive_pt))
+ return -EBUSY;
+
+ event->destroy = pt_event_destroy;
+
+ return 0;
+}
+
+static __init int pt_init(void)
+{
+ int ret, cpu, prior_warn = 0;
+
+ BUILD_BUG_ON(sizeof(struct topa) > PAGE_SIZE);
+ get_online_cpus();
+ for_each_online_cpu(cpu) {
+ u64 ctl;
+
+ ret = rdmsrl_safe_on_cpu(cpu, MSR_IA32_RTIT_CTL, &ctl);
+ if (!ret && (ctl & RTIT_CTL_TRACEEN))
+ prior_warn++;
+ }
+ put_online_cpus();
+
+ ret = pt_pmu_hw_init();
+ if (ret)
+ return ret;
+
+ if (!pt_cap_get(PT_CAP_topa_output)) {
+ pr_warn("ToPA output is not supported on this CPU\n");
+ return -ENODEV;
+ }
+
+ if (prior_warn)
+ pr_warn("PT is enabled at boot time, traces may be empty\n");
+
+ if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+ pt_pmu.pmu.capabilities =
+ PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_AUX_SW_DOUBLEBUF;
+
+ pt_pmu.pmu.capabilities |= PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE;
+ pt_pmu.pmu.attr_groups = pt_attr_groups;
+ pt_pmu.pmu.task_ctx_nr = perf_hw_context;
+ pt_pmu.pmu.event_init = pt_event_init;
+ pt_pmu.pmu.add = pt_event_add;
+ pt_pmu.pmu.del = pt_event_del;
+ pt_pmu.pmu.start = pt_event_start;
+ pt_pmu.pmu.stop = pt_event_stop;
+ pt_pmu.pmu.read = pt_event_read;
+ pt_pmu.pmu.setup_aux = pt_buffer_setup_aux;
+ pt_pmu.pmu.free_aux = pt_buffer_free_aux;
+ ret = perf_pmu_register(&pt_pmu.pmu, "intel_pt", -1);
+
+ return ret;
+}
+
+module_init(pt_init);
--
2.1.1

2014-11-14 13:50:47

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v8 10/14] x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection

Intel Processor Trace is an architecture extension that allows for program
flow tracing.

Signed-off-by: Alexander Shishkin <[email protected]>
---
arch/x86/include/asm/cpufeature.h | 1 +
arch/x86/kernel/cpu/scattered.c | 1 +
2 files changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 0bb1335313..86aa081a0d 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -189,6 +189,7 @@
#define X86_FEATURE_DTHERM ( 7*32+ 7) /* Digital Thermal Sensor */
#define X86_FEATURE_HW_PSTATE ( 7*32+ 8) /* AMD HW-PState */
#define X86_FEATURE_PROC_FEEDBACK ( 7*32+ 9) /* AMD ProcFeedbackInterface */
+#define X86_FEATURE_INTEL_PT ( 7*32+10) /* Intel Processor Trace */

/* Virtualization flags: Linux defined, word 8 */
#define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 4a8013d559..42f5fa953f 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -36,6 +36,7 @@ void init_scattered_cpuid_features(struct cpuinfo_x86 *c)
{ X86_FEATURE_ARAT, CR_EAX, 2, 0x00000006, 0 },
{ X86_FEATURE_PLN, CR_EAX, 4, 0x00000006, 0 },
{ X86_FEATURE_PTS, CR_EAX, 6, 0x00000006, 0 },
+ { X86_FEATURE_INTEL_PT, CR_EBX,25, 0x00000007, 0 },
{ X86_FEATURE_APERFMPERF, CR_ECX, 0, 0x00000006, 0 },
{ X86_FEATURE_EPB, CR_ECX, 3, 0x00000006, 0 },
{ X86_FEATURE_HW_PSTATE, CR_EDX, 7, 0x80000007, 0 },
--
2.1.1

2014-11-17 09:33:07

by Metzger, Markus T

[permalink] [raw]
Subject: RE: [PATCH v8 02/14] perf: Add AUX area to ring buffer for raw data streams

> -----Original Message-----
> From: Alexander Shishkin [mailto:[email protected]]
> Sent: Friday, November 14, 2014 2:44 PM
> To: Peter Zijlstra
> Cc: Ingo Molnar; [email protected]; Robert Richter; Frederic
> Weisbecker; Mike Galbraith; Paul Mackerras; Stephane Eranian; Andi Kleen;
> Liang, Kan; Hunter, Adrian; Metzger, Markus T; [email protected];
> [email protected]; Peter Zijlstra; Alexander Shishkin
> Subject: [PATCH v8 02/14] perf: Add AUX area to ring buffer for raw data
> streams
>
> From: Peter Zijlstra <[email protected]>
>
> This patch introduces "AUX space" in the perf mmap buffer, intended for
> exporting high bandwidth data streams to userspace, such as instruction
> flow traces.
>
> AUX space is a ring buffer, defined by aux_{offset,size} fields in the
> user_page structure, and read/write pointers aux_{head,tail}, which abide
> by the same rules as data_* counterparts of the main perf buffer.
>
> In order to allocate/mmap AUX, userspace needs to set up aux_offset to
> such an offset that will be greater than data_offset+data_size and
> aux_size to be the desired buffer size. Both need to be page aligned.
> Then, same aux_offset and aux_size should be passed to mmap() call and
> if everything adds up, you should have an AUX buffer as a result.

Why would userspace need to fill in aux_offset and aux_size? Can't the kernel
do this when userspace mmaps the aux buffer like it does for the respective
data_* fields?

How would userspace fill in those fields if it mapped the data buffer read-only?

Regards,
Markus.

Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen, Deutschland
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Christian Lamprechter, Hannes Schwaderer, Douglas Lusk
Registergericht: Muenchen HRB 47456
Ust.-IdNr./VAT Registration No.: DE129385895
Citibank Frankfurt a.M. (BLZ 502 109 00) 600119052

2014-11-17 21:24:13

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [PATCH v8 02/14] perf: Add AUX area to ring buffer for raw data streams

Alexander Shishkin [[email protected]] wrote:
| From: Peter Zijlstra <[email protected]>
|
| This patch introduces "AUX space" in the perf mmap buffer, intended for
| exporting high bandwidth data streams to userspace, such as instruction
| flow traces.
|
| AUX space is a ring buffer, defined by aux_{offset,size} fields in the
| user_page structure, and read/write pointers aux_{head,tail}, which abide
| by the same rules as data_* counterparts of the main perf buffer.

The "format" of this raw stream of data can differ across processors
(and architecutres) correct ? i.e the perf tool processing of this
aux data will also differ across processors ?

Power8 processors support what we call 24x7 counters that collect info
on a large number of events. The current 24x7 support in
(arch/powerpc/perf/hv-24x7.c) currently uses reading one counter at a
time using the pmu->read() interface.

We are looking for ways to export much larger number of counters through
one pmu->read() or another call. I am trying to see if the hv-24x7 pmu
could use this aux interface and then have/implement a helper in perf
tool to extract the counter values.

Sukadev

2014-11-17 21:45:21

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v8 02/14] perf: Add AUX area to ring buffer for raw data streams

On Mon, Nov 17, 2014 at 01:24:00PM -0800, Sukadev Bhattiprolu wrote:
> Alexander Shishkin [[email protected]] wrote:
> | From: Peter Zijlstra <[email protected]>
> |
> | This patch introduces "AUX space" in the perf mmap buffer, intended for
> | exporting high bandwidth data streams to userspace, such as instruction
> | flow traces.
> |
> | AUX space is a ring buffer, defined by aux_{offset,size} fields in the
> | user_page structure, and read/write pointers aux_{head,tail}, which abide
> | by the same rules as data_* counterparts of the main perf buffer.
>
> The "format" of this raw stream of data can differ across processors
> (and architecutres) correct ? i.e the perf tool processing of this
> aux data will also differ across processors ?

Yes.

The main caveat is that you need special handling code in perf.

>
> Power8 processors support what we call 24x7 counters that collect info
> on a large number of events. The current 24x7 support in
> (arch/powerpc/perf/hv-24x7.c) currently uses reading one counter at a
> time using the pmu->read() interface.
>
> We are looking for ways to export much larger number of counters through
> one pmu->read() or another call. I am trying to see if the hv-24x7 pmu
> could use this aux interface and then have/implement a helper in perf
> tool to extract the counter values.

It's probably only worth it if your amount of data is extremely
high and has to be to compressed (like in PT -- no choice, cannot do it
otherwise)

If your data amounts are more reasonable it will be a lot
simpler to stay with the current perf interfaces, even
though they have a bit higher overhead.

-Andi

--
[email protected] -- Speaking for myself only

2014-12-17 14:07:56

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT

Alexander Shishkin <[email protected]> writes:

> Hi Peter and all,

Hi again,

> Since there weren't too many review comments on the two previous
> versions, here's another one with a couple more fixes.

This patchset is one month old now, I thought somebody might want to
take a look before a send a rebased new version.

Cheers,
--
Alex