2014-10-13 13:47:42

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 00/20] perf: Add infrastructure and support for Intel PT

Hi Peter and all,

[full description below the changelog]

This version of the patchset hopefully addresses comments from the
previous (v4) version. Changelog messages should be more descriptive
as well as comments in the code. Funcitonal changes:

* events are not disabled on munmap(), this got replaced with
refcounting,
* explicit sfence is added to the intel_pt driver to make sure
that data stores are globally visible before the aux_head is
updated,
* dropped the unnecessary set_output for inherited events in
favor of using parent's ring buffer,
* intel_pt needs to handle #GP from enabling WRMSR, so that a
privileged user can set arbitrary RTIT_CTL bits in the range
that is reserved for packet enables (see PT_BYPASS_MASK and
comments around pt_config()) without potentially killing the
machine.

Interface changes:

* replaced 'u8 truncated' in the PERF_RECORD_AUX with a 'u64 flags',
dropped redundant id/stream_id,
* in overwrite mode, always provide offset and size even if the driver
cannot tell where the snapshot begins/weather its beginning was
overwritten by older data.

This patchset adds support for Intel Processor Trace (PT) extension [1] of
Intel Architecture that allows the capture of information about software
execution flow, to the perf kernel infrastructure.

The single most notable thing is that while PT outputs trace data in a
compressed binary format, it will still generate hundreds of megabytes
of trace data per second per core. Decoding this binary stream takes
2-3 orders of magnitude the cpu time that it takes to generate
it. These considerations make it impossible to carry out decoding in
kernel space. Therefore, the trace data is exported to userspace as a
zero-copy mapping that userspace can collect and store for later
decoding. To address this, this patchset extends perf ring buffer with
an "AUX space", which is allocated for hardware blocks such as PT to
export their trace data with minimal overhead. This space can be
configured via buffer's user page and mmapped from the same file
descriptor with a given offset. Data can then be collected from it
by reading the aux_head (write) pointer from the user page and updating
aux_tail (read) pointer similarly to data_{head,tail} of the
traditional perf buffer. There is an api between perf core and pmu
drivers that wish to make use of this AUX space to export their data.

For tracing blocks that don't support hardware scatter-gather tables,
we provide high-order physically contiguous allocations to minimize
the overhead needed for software double buffering and PMI pressure.

This way we get a normal perf data stream that provides sideband
information that is required to decode the trace data, such as MMAPs,
COMMs etc, plus the actual trace in its own logical space.

If the trace buffer is mapped writable, the driver will stop tracing
when it fills up (aux_head approaches aux_tail), till data is read,
aux_tail pointer is moved forward and an ioctl() is issued to
re-enable tracing. If the trace buffer is mapped read only, the
tracing will continue, overwriting older data, so that the buffer
always contains the most recent data. Tracing can be stopped with an
ioctl() and restarted once the data is collected.

Another use case is annotating samples of other perf events: setting
PERF_SAMPLE_AUX requests attr.aux_sample_size bytes of trace to be
included in each event's sample.

This patchset consists of necessary changes to the perf kernel
infrastructure, and PT and BTS pmu drivers. The tooling support is not
included in this series, however, it can be found in my github tree [2].

[1] http://software.intel.com/en-us/intel-isa-extensions
[2] http://github.com/virtuoso/linux-perf/tree/intel_pt

Alexander Shishkin (19):
perf: Add data_{offset,size} to user_page
perf: Support high-order allocations for AUX space
perf: Add a capability for AUX_NO_SG pmus to do software double
buffering
perf: Add a pmu capability for "exclusive" events
perf: Add AUX record
perf: Add api for pmus to write to AUX area
perf: Support overwrite mode for AUX area
perf: Add wakeup watermark control to AUX area
x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection
x86: perf: Intel PT and LBR/BTS are mutually exclusive
x86: perf: intel_pt: Intel PT PMU driver
x86: perf: intel_bts: Add BTS PMU driver
perf: add ITRACE_START record to indicate that tracing has started
perf: Add api to (de-)allocate AUX buffers for kernel counters
perf: Add a helper for looking up pmus by type
perf: Add infrastructure for using AUX data in perf samples
perf: Allocate ring buffers for inherited per-task kernel events
perf: Allow AUX sampling for multiple events
perf: Allow AUX sampling of inherited events

Peter Zijlstra (1):
perf: Add AUX area to ring buffer for raw data streams

arch/x86/include/asm/cpufeature.h | 1 +
arch/x86/include/uapi/asm/msr-index.h | 18 +
arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/kernel/cpu/intel_pt.h | 129 ++++
arch/x86/kernel/cpu/perf_event.h | 14 +
arch/x86/kernel/cpu/perf_event_intel.c | 14 +-
arch/x86/kernel/cpu/perf_event_intel_bts.c | 518 +++++++++++++++
arch/x86/kernel/cpu/perf_event_intel_ds.c | 11 +-
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 9 +-
arch/x86/kernel/cpu/perf_event_intel_pt.c | 995 +++++++++++++++++++++++++++++
arch/x86/kernel/cpu/scattered.c | 1 +
include/linux/perf_event.h | 56 +-
include/uapi/linux/perf_event.h | 73 ++-
kernel/events/core.c | 534 +++++++++++++++-
kernel/events/internal.h | 52 ++
kernel/events/ring_buffer.c | 386 ++++++++++-
16 files changed, 2768 insertions(+), 44 deletions(-)
create mode 100644 arch/x86/kernel/cpu/intel_pt.h
create mode 100644 arch/x86/kernel/cpu/perf_event_intel_bts.c
create mode 100644 arch/x86/kernel/cpu/perf_event_intel_pt.c

--
2.1.0


2014-10-13 13:46:40

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 01/20] perf: Add data_{offset,size} to user_page

Currently, the actual perf ring buffer is one page into the mmap area,
following the user page and the userspace follows this convention. This
patch adds data_{offset,size} fields to user_page that can be used by
userspace instead for locating perf data in the mmap area. This is also
helpful when mapping existing or shared buffers if their size is not
known in advance.

Right now, it is made to follow the existing convention that

data_offset == PAGE_SIZE and
data_offset + data_size == mmap_size.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/uapi/linux/perf_event.h | 5 +++++
kernel/events/core.c | 2 ++
2 files changed, 7 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9269de2548..f7d18c2cb7 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -489,9 +489,14 @@ struct perf_event_mmap_page {
* In this case the kernel will not over-write unread data.
*
* See perf_output_put_handle() for the data ordering.
+ *
+ * data_{offset,size} indicate the location and size of the perf record
+ * buffer within the mmapped area.
*/
__u64 data_head; /* head in the data section */
__u64 data_tail; /* user-space written tail */
+ __u64 data_offset; /* where the buffer starts */
+ __u64 data_size; /* data buffer size */
};

#define PERF_RECORD_MISC_CPUMODE_MASK (7 << 0)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b164cb07b3..23bacb8682 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3917,6 +3917,8 @@ static void perf_event_init_userpage(struct perf_event *event)
/* Allow new userspace to detect that bit 0 is deprecated */
userpg->cap_bit0_is_deprecated = 1;
userpg->size = offsetof(struct perf_event_mmap_page, __reserved);
+ userpg->data_offset = PAGE_SIZE;
+ userpg->data_size = perf_data_size(rb);

unlock:
rcu_read_unlock();
--
2.1.0

2014-10-13 13:46:47

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 02/20] perf: Add AUX area to ring buffer for raw data streams

From: Peter Zijlstra <[email protected]>

This patch introduces "AUX space" in the perf mmap buffer, intended for
exporting high bandwidth data streams to userspace, such as instruction
flow traces.

AUX space is a ring buffer, defined by aux_{offset,size} fields in the
user_page structure, and read/write pointers aux_{head,tail}, which abide
by the same rules as data_* counterparts of the main perf buffer.

In order to allocate/mmap AUX, userspace needs to set up aux_offset to
such an offset that will be greater than data_offset+data_size and
aux_size to be the desired buffer size. Both need to be page aligned.
Then, same aux_offset and aux_size should be passed to mmap() call and
if everything adds up, you should have an AUX buffer as a result.

Pages that are mapped into this buffer also come out of user's mlock
rlimit plus perf_event_mlock_kb allowance.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/linux/perf_event.h | 17 +++++
include/uapi/linux/perf_event.h | 16 +++++
kernel/events/core.c | 140 +++++++++++++++++++++++++++++++++-------
kernel/events/internal.h | 23 +++++++
kernel/events/ring_buffer.c | 97 ++++++++++++++++++++++++++--
5 files changed, 264 insertions(+), 29 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 893a0d0798..344058c71d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -263,6 +263,18 @@ struct pmu {
* flush branch stack on context-switches (needed in cpu-wide mode)
*/
void (*flush_branch_stack) (void);
+
+ /*
+ * Set up pmu-private data structures for an AUX area
+ */
+ void *(*setup_aux) (int cpu, void **pages,
+ int nr_pages, bool overwrite);
+ /* optional */
+
+ /*
+ * Free pmu-private AUX data structures
+ */
+ void (*free_aux) (void *aux); /* optional */
};

/**
@@ -782,6 +794,11 @@ static inline bool has_branch_stack(struct perf_event *event)
return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
}

+static inline bool has_aux(struct perf_event *event)
+{
+ return event->pmu->setup_aux;
+}
+
extern int perf_output_begin(struct perf_output_handle *handle,
struct perf_event *event, unsigned int size);
extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index f7d18c2cb7..7e0967c0f5 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -497,6 +497,22 @@ struct perf_event_mmap_page {
__u64 data_tail; /* user-space written tail */
__u64 data_offset; /* where the buffer starts */
__u64 data_size; /* data buffer size */
+
+ /*
+ * AUX area is defined by aux_{offset,size} fields that should be set
+ * by the userspace, so that
+ *
+ * aux_offset >= data_offset + data_size
+ *
+ * prior to mmap()ing it. Size of the mmap()ed area should be aux_size.
+ *
+ * Ring buffer pointers aux_{head,tail} have the same semantics as
+ * data_{head,tail} and same ordering rules apply.
+ */
+ __u64 aux_head;
+ __u64 aux_tail;
+ __u64 aux_offset;
+ __u64 aux_size;
};

#define PERF_RECORD_MISC_CPUMODE_MASK (7 << 0)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 23bacb8682..86b0577229 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4116,6 +4116,8 @@ static void perf_mmap_open(struct vm_area_struct *vma)

atomic_inc(&event->mmap_count);
atomic_inc(&event->rb->mmap_count);
+ if (vma->vm_pgoff)
+ atomic_inc(&event->rb->aux_mmap_count);
}

/*
@@ -4135,6 +4137,20 @@ static void perf_mmap_close(struct vm_area_struct *vma)
int mmap_locked = rb->mmap_locked;
unsigned long size = perf_data_size(rb);

+ /*
+ * rb->aux_mmap_count will always drop before rb->mmap_count and
+ * event->mmap_count, so it is ok to use event->mmap_mutex to
+ * serialize with perf_mmap here.
+ */
+ if (rb_has_aux(rb) && vma->vm_pgoff == rb->aux_pgoff &&
+ atomic_dec_and_mutex_lock(&rb->aux_mmap_count, &event->mmap_mutex)) {
+ atomic_long_sub(rb->aux_nr_pages, &mmap_user->locked_vm);
+ vma->vm_mm->pinned_vm -= rb->aux_mmap_locked;
+
+ rb_free_aux(rb);
+ mutex_unlock(&event->mmap_mutex);
+ }
+
atomic_dec(&rb->mmap_count);

if (!atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex))
@@ -4208,7 +4224,7 @@ out_put:

static const struct vm_operations_struct perf_mmap_vmops = {
.open = perf_mmap_open,
- .close = perf_mmap_close,
+ .close = perf_mmap_close, /* non mergable */
.fault = perf_mmap_fault,
.page_mkwrite = perf_mmap_fault,
};
@@ -4219,10 +4235,10 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
unsigned long user_locked, user_lock_limit;
struct user_struct *user = current_user();
unsigned long locked, lock_limit;
- struct ring_buffer *rb;
+ struct ring_buffer *rb = NULL;
unsigned long vma_size;
unsigned long nr_pages;
- long user_extra, extra;
+ long user_extra = 0, extra = 0;
int ret = 0, flags = 0;

/*
@@ -4237,7 +4253,66 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
return -EINVAL;

vma_size = vma->vm_end - vma->vm_start;
- nr_pages = (vma_size / PAGE_SIZE) - 1;
+
+ if (vma->vm_pgoff == 0) {
+ nr_pages = (vma_size / PAGE_SIZE) - 1;
+ } else {
+ /*
+ * AUX area mapping: if rb->aux_nr_pages != 0, it's already
+ * mapped, all subsequent mappings should have the same size
+ * and offset. Must be above the normal perf buffer.
+ */
+ u64 aux_offset, aux_size;
+
+ if (!event->rb)
+ return -EINVAL;
+
+ nr_pages = vma_size / PAGE_SIZE;
+
+ mutex_lock(&event->mmap_mutex);
+ ret = -EINVAL;
+
+ rb = event->rb;
+ if (!rb)
+ goto aux_unlock;
+
+ aux_offset = ACCESS_ONCE(rb->user_page->aux_offset);
+ aux_size = ACCESS_ONCE(rb->user_page->aux_size);
+
+ if (aux_offset < perf_data_size(rb) + PAGE_SIZE)
+ goto aux_unlock;
+
+ if (aux_offset != vma->vm_pgoff << PAGE_SHIFT)
+ goto aux_unlock;
+
+ /* already mapped with a different offset */
+ if (rb_has_aux(rb) && rb->aux_pgoff != vma->vm_pgoff)
+ goto aux_unlock;
+
+ if (aux_size != vma_size || aux_size != nr_pages * PAGE_SIZE)
+ goto aux_unlock;
+
+ /* already mapped with a different size */
+ if (rb_has_aux(rb) && rb->aux_nr_pages != nr_pages)
+ goto aux_unlock;
+
+ if (!is_power_of_2(nr_pages))
+ goto aux_unlock;
+
+ if (!atomic_inc_not_zero(&rb->mmap_count))
+ goto aux_unlock;
+
+ if (rb_has_aux(rb)) {
+ atomic_inc(&rb->aux_mmap_count);
+ ret = 0;
+ goto unlock;
+ }
+
+ atomic_set(&rb->aux_mmap_count, 1);
+ user_extra = nr_pages;
+
+ goto accounting;
+ }

/*
* If we have rb pages ensure they're a power-of-two number, so we
@@ -4249,9 +4324,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
if (vma_size != PAGE_SIZE * (1 + nr_pages))
return -EINVAL;

- if (vma->vm_pgoff != 0)
- return -EINVAL;
-
WARN_ON_ONCE(event->ctx->parent_ctx);
again:
mutex_lock(&event->mmap_mutex);
@@ -4275,6 +4347,8 @@ again:
}

user_extra = nr_pages + 1;
+
+accounting:
user_lock_limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);

/*
@@ -4284,7 +4358,6 @@ again:

user_locked = atomic_long_read(&user->locked_vm) + user_extra;

- extra = 0;
if (user_locked > user_lock_limit)
extra = user_locked - user_lock_limit;

@@ -4298,35 +4371,45 @@ again:
goto unlock;
}

- WARN_ON(event->rb);
+ WARN_ON(!rb && event->rb);

if (vma->vm_flags & VM_WRITE)
flags |= RING_BUFFER_WRITABLE;

- rb = rb_alloc(nr_pages,
- event->attr.watermark ? event->attr.wakeup_watermark : 0,
- event->cpu, flags);
-
if (!rb) {
- ret = -ENOMEM;
- goto unlock;
- }
+ rb = rb_alloc(nr_pages,
+ event->attr.watermark ? event->attr.wakeup_watermark : 0,
+ event->cpu, flags);

- atomic_set(&rb->mmap_count, 1);
- rb->mmap_locked = extra;
- rb->mmap_user = get_current_user();
+ if (!rb) {
+ ret = -ENOMEM;
+ goto unlock;
+ }

- atomic_long_add(user_extra, &user->locked_vm);
- vma->vm_mm->pinned_vm += extra;
+ atomic_set(&rb->mmap_count, 1);
+ rb->mmap_user = get_current_user();
+ rb->mmap_locked = extra;

- ring_buffer_attach(event, rb);
+ ring_buffer_attach(event, rb);

- perf_event_init_userpage(event);
- perf_event_update_userpage(event);
+ perf_event_init_userpage(event);
+ perf_event_update_userpage(event);
+ } else {
+ ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages, flags);
+ if (ret)
+ atomic_dec(&rb->mmap_count);
+ else
+ rb->aux_mmap_locked = extra;
+ }

unlock:
- if (!ret)
+ if (!ret) {
+ atomic_long_add(user_extra, &user->locked_vm);
+ vma->vm_mm->pinned_vm += extra;
+
atomic_inc(&event->mmap_count);
+ }
+aux_unlock:
mutex_unlock(&event->mmap_mutex);

/*
@@ -7173,6 +7256,13 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
if (output_event->cpu == -1 && output_event->ctx != event->ctx)
goto out;

+ /*
+ * If both events generate aux data, they must be on the same PMU
+ */
+ if (has_aux(event) && has_aux(output_event) &&
+ event->pmu != output_event->pmu)
+ goto out;
+
set:
mutex_lock(&event->mmap_mutex);
/* Can't redirect output if we've got an active mmap() */
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 569b218782..1e476d2f29 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -35,6 +35,16 @@ struct ring_buffer {
unsigned long mmap_locked;
struct user_struct *mmap_user;

+ /* AUX area */
+ unsigned long aux_pgoff;
+ int aux_nr_pages;
+ atomic_t aux_mmap_count;
+ unsigned long aux_mmap_locked;
+ void (*free_aux)(void *);
+ struct kref aux_refcount;
+ void **aux_pages;
+ void *aux_priv;
+
struct perf_event_mmap_page *user_page;
void *data_pages[0];
};
@@ -43,6 +53,14 @@ extern void rb_free(struct ring_buffer *rb);
extern struct ring_buffer *
rb_alloc(int nr_pages, long watermark, int cpu, int flags);
extern void perf_event_wakeup(struct perf_event *event);
+extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
+ pgoff_t pgoff, int nr_pages, int flags);
+extern void rb_free_aux(struct ring_buffer *rb);
+
+static inline bool rb_has_aux(struct ring_buffer *rb)
+{
+ return !!rb->aux_nr_pages;
+}

extern void
perf_event_header__init_id(struct perf_event_header *header,
@@ -81,6 +99,11 @@ static inline unsigned long perf_data_size(struct ring_buffer *rb)
return rb->nr_pages << (PAGE_SHIFT + page_order(rb));
}

+static inline unsigned long perf_aux_size(struct ring_buffer *rb)
+{
+ return rb->aux_nr_pages << PAGE_SHIFT;
+}
+
#define DEFINE_OUTPUT_COPY(func_name, memcpy_func) \
static inline unsigned long \
func_name(struct perf_output_handle *handle, \
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 146a5792b1..2ebb6c5d3b 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -242,14 +242,87 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
spin_lock_init(&rb->event_lock);
}

+int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
+ pgoff_t pgoff, int nr_pages, int flags)
+{
+ bool overwrite = !(flags & RING_BUFFER_WRITABLE);
+ int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
+ int ret = -ENOMEM;
+
+ if (!has_aux(event))
+ return -ENOTSUPP;
+
+ rb->aux_pages = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
+ if (!rb->aux_pages)
+ return -ENOMEM;
+
+ rb->free_aux = event->pmu->free_aux;
+ for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;
+ rb->aux_nr_pages++) {
+ struct page *page;
+
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
+ if (!page)
+ goto out;
+
+ rb->aux_pages[rb->aux_nr_pages] = page_address(page);
+ }
+
+ rb->aux_priv = event->pmu->setup_aux(event->cpu, rb->aux_pages, nr_pages,
+ overwrite);
+ if (!rb->aux_priv)
+ goto out;
+
+ ret = 0;
+
+ /*
+ * aux_pages (and pmu driver's private data, aux_priv) will be
+ * referenced in both producer's and consumer's contexts, thus
+ * we keep a refcount here to make sure either of the two can
+ * reference them safely.
+ */
+ kref_init(&rb->aux_refcount);
+
+out:
+ if (!ret)
+ rb->aux_pgoff = pgoff;
+ else
+ rb_free_aux(rb);
+
+ return ret;
+}
+
+static void __rb_free_aux(struct kref *kref)
+{
+ struct ring_buffer *rb = container_of(kref, struct ring_buffer, aux_refcount);
+ int pg;
+
+ if (rb->aux_priv) {
+ rb->free_aux(rb->aux_priv);
+ rb->free_aux = NULL;
+ rb->aux_priv = NULL;
+ }
+
+ for (pg = 0; pg < rb->aux_nr_pages; pg++)
+ free_page((unsigned long)rb->aux_pages[pg]);
+
+ kfree(rb->aux_pages);
+ rb->aux_nr_pages = 0;
+}
+
+void rb_free_aux(struct ring_buffer *rb)
+{
+ kref_put(&rb->aux_refcount, __rb_free_aux);
+}
+
#ifndef CONFIG_PERF_USE_VMALLOC

/*
* Back perf_mmap() with regular GFP_KERNEL-0 pages.
*/

-struct page *
-perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+static struct page *
+__perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
{
if (pgoff > rb->nr_pages)
return NULL;
@@ -339,8 +412,8 @@ static int data_page_nr(struct ring_buffer *rb)
return rb->nr_pages << page_order(rb);
}

-struct page *
-perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+static struct page *
+__perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
{
/* The '>' counts in the user page. */
if (pgoff > data_page_nr(rb))
@@ -415,3 +488,19 @@ fail:
}

#endif
+
+struct page *
+perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+{
+ if (rb->aux_nr_pages) {
+ /* above AUX space */
+ if (pgoff > rb->aux_pgoff + rb->aux_nr_pages)
+ return NULL;
+
+ /* AUX space */
+ if (pgoff >= rb->aux_pgoff)
+ return virt_to_page(rb->aux_pages[pgoff - rb->aux_pgoff]);
+ }
+
+ return __perf_mmap_to_page(rb, pgoff);
+}
--
2.1.0

2014-10-13 13:47:18

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 07/20] perf: Add api for pmus to write to AUX area

For pmus that wish to write data to ring buffer's AUX area, provide
perf_aux_output_{begin,end}() calls to initiate/commit data writes,
similarly to perf_output_{begin,end}. These also use the same output
handle structure. Also, similarly to software counterparts, these
will direct inherited events' output to parents' ring buffers.

After the perf_aux_output_begin() returns successfully, handle->size
is set to the maximum amount of data that can be written wrt aux_tail
pointer, so that no data that the user hasn't seen will be overwritten,
therefore this should always be called before hardware writing is
enabled. On success, this will return the pointer to pmu driver's
private structure allocated for this aux area by pmu::setup_aux. Same
pointer can also be retrieved using perf_get_aux() while hardware
writing is enabled.

PMU driver should pass the actual amount of data written as a parameter
to perf_aux_output_end(). All hardware writes should be completed and
visible before this one is called.

Additionally, perf_aux_output_skip() will adjust output handle and
aux_head in case some part of the buffer has to be skipped over to
maintain hardware's alignment constraints.

Nested writers are forbidden and guards are in place to catch such
attempts.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/linux/perf_event.h | 23 +++++++-
kernel/events/core.c | 5 +-
kernel/events/internal.h | 4 ++
kernel/events/ring_buffer.c | 139 ++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 167 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 5c3dee021c..f126eb89e6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -555,12 +555,22 @@ struct perf_output_handle {
struct ring_buffer *rb;
unsigned long wakeup;
unsigned long size;
- void *addr;
+ union {
+ void *addr;
+ unsigned long head;
+ };
int page;
};

#ifdef CONFIG_PERF_EVENTS

+extern void *perf_aux_output_begin(struct perf_output_handle *handle,
+ struct perf_event *event);
+extern void perf_aux_output_end(struct perf_output_handle *handle,
+ unsigned long size, bool truncated);
+extern int perf_aux_output_skip(struct perf_output_handle *handle,
+ unsigned long size);
+extern void *perf_get_aux(struct perf_output_handle *handle);
extern int perf_pmu_register(struct pmu *pmu, const char *name, int type);
extern void perf_pmu_unregister(struct pmu *pmu);

@@ -817,6 +827,17 @@ extern void perf_event_disable(struct perf_event *event);
extern int __perf_event_disable(void *info);
extern void perf_event_task_tick(void);
#else /* !CONFIG_PERF_EVENTS: */
+static inline void *
+perf_aux_output_begin(struct perf_output_handle *handle,
+ struct perf_event *event) { return NULL; }
+static inline void
+perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
+ bool truncated) { }
+static inline int
+perf_aux_output_skip(struct perf_output_handle *handle,
+ unsigned long size) { return -EINVAL; }
+static inline void *
+perf_get_aux(struct perf_output_handle *handle) { return NULL; }
static inline void
perf_event_task_sched_in(struct task_struct *prev,
struct task_struct *task) { }
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 46a6217bbe..fcb61ae7a3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3268,7 +3268,6 @@ static void free_event_rcu(struct rcu_head *head)
kfree(event);
}

-static void ring_buffer_put(struct ring_buffer *rb);
static void ring_buffer_attach(struct perf_event *event,
struct ring_buffer *rb);

@@ -4085,7 +4084,7 @@ static void rb_free_rcu(struct rcu_head *rcu_head)
rb_free(rb);
}

-static struct ring_buffer *ring_buffer_get(struct perf_event *event)
+struct ring_buffer *ring_buffer_get(struct perf_event *event)
{
struct ring_buffer *rb;

@@ -4100,7 +4099,7 @@ static struct ring_buffer *ring_buffer_get(struct perf_event *event)
return rb;
}

-static void ring_buffer_put(struct ring_buffer *rb)
+void ring_buffer_put(struct ring_buffer *rb)
{
if (!atomic_dec_and_test(&rb->refcount))
return;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index e0013084c5..68bbf86d8f 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -36,6 +36,8 @@ struct ring_buffer {
struct user_struct *mmap_user;

/* AUX area */
+ local_t aux_head;
+ local_t aux_nest;
unsigned long aux_pgoff;
int aux_nr_pages;
atomic_t aux_mmap_count;
@@ -56,6 +58,8 @@ extern void perf_event_wakeup(struct perf_event *event);
extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
pgoff_t pgoff, int nr_pages, int flags);
extern void rb_free_aux(struct ring_buffer *rb);
+extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
+extern void ring_buffer_put(struct ring_buffer *rb);

static inline bool rb_has_aux(struct ring_buffer *rb)
{
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index b45feb3f4a..f686df9d1d 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -242,6 +242,145 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
spin_lock_init(&rb->event_lock);
}

+/*
+ * This is called before hardware starts writing to the AUX area to
+ * obtain an output handle and make sure there's room in the buffer.
+ * When the capture completes, call perf_aux_output_end() to commit
+ * the recorded data to the buffer.
+ *
+ * The ordering is similar to that of perf_output_{begin,end}, with
+ * the exception of (B), which should be taken care of by the pmu
+ * driver, since ordering rules will differ depending on hardware.
+ */
+void *perf_aux_output_begin(struct perf_output_handle *handle,
+ struct perf_event *event)
+{
+ struct perf_event *output_event = event;
+ unsigned long aux_head, aux_tail;
+ struct ring_buffer *rb;
+
+ if (output_event->parent)
+ output_event = output_event->parent;
+
+ /*
+ * Since this will typically be open across pmu::add/pmu::del, we
+ * grab ring_buffer's refcount instead of holding rcu read lock
+ * to make sure it doesn't disappear under us.
+ */
+ rb = ring_buffer_get(output_event);
+ if (!rb)
+ return NULL;
+
+ if (!rb_has_aux(rb) || !kref_get_unless_zero(&rb->aux_refcount))
+ goto err;
+
+ /*
+ * Nesting is not supported for AUX area, make sure nested
+ * writers are caught early
+ */
+ if (WARN_ON_ONCE(local_xchg(&rb->aux_nest, 1)))
+ goto err_put;
+
+ aux_head = local_read(&rb->aux_head);
+ aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
+
+ handle->rb = rb;
+ handle->event = event;
+ handle->head = aux_head;
+ if (aux_head - aux_tail < perf_aux_size(rb))
+ handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
+ else
+ handle->size = 0;
+
+ /*
+ * handle->size computation depends on aux_tail load; this forms a
+ * control dependency barrier separating aux_tail load from aux data
+ * store that will be enabled on successful return
+ */
+ if (!handle->size) { /* A, matches D */
+ event->pending_disable = 1;
+ perf_output_wakeup(handle);
+ local_set(&rb->aux_nest, 0);
+ goto err_put;
+ }
+
+ return handle->rb->aux_priv;
+
+err_put:
+ rb_free_aux(rb);
+
+err:
+ ring_buffer_put(rb);
+ handle->event = NULL;
+
+ return NULL;
+}
+
+/*
+ * Commit the data written by hardware into the ring buffer by adjusting
+ * aux_head and posting a PERF_RECORD_AUX into the perf buffer. It is the
+ * pmu driver's responsibility to observe ordering rules of the hardware,
+ * so that all the data is externally visible before this is called.
+ */
+void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
+ bool truncated)
+{
+ struct ring_buffer *rb = handle->rb;
+ unsigned long aux_head = local_read(&rb->aux_head);
+ u64 flags = 0;
+
+ if (truncated)
+ flags |= PERF_AUX_FLAG_TRUNCATED;
+
+ local_add(size, &rb->aux_head);
+
+ if (size || flags) {
+ /*
+ * Only send RECORD_AUX if we have something useful to communicate
+ */
+
+ perf_event_aux_event(handle->event, aux_head, size, flags);
+ }
+
+ rb->user_page->aux_head = local_read(&rb->aux_head);
+
+ perf_output_wakeup(handle);
+ handle->event = NULL;
+
+ local_set(&rb->aux_nest, 0);
+ rb_free_aux(rb);
+ ring_buffer_put(rb);
+}
+
+/*
+ * Skip over a given number of bytes in the AUX buffer, due to, for example,
+ * hardware's alignment constraints.
+ */
+int perf_aux_output_skip(struct perf_output_handle *handle, unsigned long size)
+{
+ struct ring_buffer *rb = handle->rb;
+ unsigned long aux_head = local_read(&rb->aux_head);
+
+ if (size > handle->size)
+ return -ENOSPC;
+
+ local_add(size, &rb->aux_head);
+
+ handle->head = aux_head;
+ handle->size -= size;
+
+ return 0;
+}
+
+void *perf_get_aux(struct perf_output_handle *handle)
+{
+ /* this is only valid between perf_aux_output_begin and *_end */
+ if (!handle->event)
+ return NULL;
+
+ return handle->rb->aux_priv;
+}
+
#define PERF_AUX_GFP (GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)

static struct page *rb_alloc_aux_page(int node, int order)
--
2.1.0

2014-10-13 13:47:26

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 09/20] perf: Add wakeup watermark control to AUX area

When AUX area gets a certain amount of new data, we want to wake up
userspace to collect it. This adds a new control to specify how much
data will cause a wakeup. This is then passed down to pmu drivers via
output handle's "wakeup" field, so that the driver can find the nearest
point where it can generate an interrupt.

We repurpose __reserved_2 in the event attribute for this, even though
it was never checked to be zero before, aux_watermark will only matter
for new AUX-aware code, so the old code should still be fine.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/uapi/linux/perf_event.h | 7 +++++--
kernel/events/core.c | 3 ++-
kernel/events/internal.h | 4 +++-
kernel/events/ring_buffer.c | 22 +++++++++++++++++++---
4 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index a6bd31f5e0..c20aaa5af7 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -238,6 +238,7 @@ enum perf_event_read_format {
#define PERF_ATTR_SIZE_VER2 80 /* add: branch_sample_type */
#define PERF_ATTR_SIZE_VER3 96 /* add: sample_regs_user */
/* add: sample_stack_user */
+ /* add: aux_watermark */

/*
* Hardware event_id to monitor via a performance monitoring event:
@@ -332,8 +333,10 @@ struct perf_event_attr {
*/
__u32 sample_stack_user;

- /* Align to u64. */
- __u32 __reserved_2;
+ /*
+ * Wakeup watermark for AUX area
+ */
+ __u32 aux_watermark;
};

#define perf_flags(attr) (*(&(attr)->read_format + 1))
diff --git a/kernel/events/core.c b/kernel/events/core.c
index fcb61ae7a3..908dc3e63b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4394,7 +4394,8 @@ accounting:
perf_event_init_userpage(event);
perf_event_update_userpage(event);
} else {
- ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages, flags);
+ ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages,
+ event->attr.aux_watermark, flags);
if (ret)
atomic_dec(&rb->mmap_count);
else
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index b1ed80f87d..4715aae48b 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -27,6 +27,7 @@ struct ring_buffer {
local_t lost; /* nr records lost */

long watermark; /* wakeup watermark */
+ long aux_watermark;
/* poll crap */
spinlock_t event_lock;
struct list_head event_list;
@@ -38,6 +39,7 @@ struct ring_buffer {
/* AUX area */
local_t aux_head;
local_t aux_nest;
+ local_t aux_wakeup;
unsigned long aux_pgoff;
int aux_nr_pages;
int aux_overwrite;
@@ -57,7 +59,7 @@ extern struct ring_buffer *
rb_alloc(int nr_pages, long watermark, int cpu, int flags);
extern void perf_event_wakeup(struct perf_event *event);
extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
- pgoff_t pgoff, int nr_pages, int flags);
+ pgoff_t pgoff, int nr_pages, long watermark, int flags);
extern void rb_free_aux(struct ring_buffer *rb);
extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
extern void ring_buffer_put(struct ring_buffer *rb);
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index d0373c6d30..96ec58fb46 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -295,6 +295,7 @@ void *perf_aux_output_begin(struct perf_output_handle *handle,
*/
if (!rb->aux_overwrite) {
aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
+ handle->wakeup = local_read(&rb->aux_wakeup) + rb->aux_watermark;
if (aux_head - aux_tail < perf_aux_size(rb))
handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));

@@ -358,9 +359,12 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
perf_event_aux_event(handle->event, aux_head, size, flags);
}

- rb->user_page->aux_head = local_read(&rb->aux_head);
+ aux_head = rb->user_page->aux_head = local_read(&rb->aux_head);

- perf_output_wakeup(handle);
+ if (aux_head - local_read(&rb->aux_wakeup) >= rb->aux_watermark) {
+ perf_output_wakeup(handle);
+ local_add(rb->aux_watermark, &rb->aux_wakeup);
+ }
handle->event = NULL;

local_set(&rb->aux_nest, 0);
@@ -382,6 +386,14 @@ int perf_aux_output_skip(struct perf_output_handle *handle, unsigned long size)

local_add(size, &rb->aux_head);

+ aux_head = rb->user_page->aux_head = local_read(&rb->aux_head);
+ if (aux_head - local_read(&rb->aux_wakeup) >= rb->aux_watermark) {
+ perf_output_wakeup(handle);
+ local_add(rb->aux_watermark, &rb->aux_wakeup);
+ handle->wakeup = local_read(&rb->aux_wakeup) +
+ rb->aux_watermark;
+ }
+
handle->head = aux_head;
handle->size -= size;

@@ -432,7 +444,7 @@ static void rb_free_aux_page(struct ring_buffer *rb, int idx)
}

int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
- pgoff_t pgoff, int nr_pages, int flags)
+ pgoff_t pgoff, int nr_pages, long watermark, int flags)
{
bool overwrite = !(flags & RING_BUFFER_WRITABLE);
int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
@@ -491,6 +503,10 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
kref_init(&rb->aux_refcount);

rb->aux_overwrite = overwrite;
+ rb->aux_watermark = watermark;
+
+ if (!rb->aux_watermark && !rb->aux_overwrite)
+ rb->aux_watermark = nr_pages << (PAGE_SHIFT - 1);

out:
if (!ret)
--
2.1.0

2014-10-13 13:47:30

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 10/20] x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection

Intel Processor Trace is an architecture extension that allows for program
flow tracing.

Signed-off-by: Alexander Shishkin <[email protected]>
---
arch/x86/include/asm/cpufeature.h | 1 +
arch/x86/kernel/cpu/scattered.c | 1 +
2 files changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index bb9b258d60..db2debe5bb 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -185,6 +185,7 @@
#define X86_FEATURE_DTHERM ( 7*32+ 7) /* Digital Thermal Sensor */
#define X86_FEATURE_HW_PSTATE ( 7*32+ 8) /* AMD HW-PState */
#define X86_FEATURE_PROC_FEEDBACK ( 7*32+ 9) /* AMD ProcFeedbackInterface */
+#define X86_FEATURE_INTEL_PT ( 7*32+10) /* Intel Processor Trace */

/* Virtualization flags: Linux defined, word 8 */
#define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 4a8013d559..42f5fa953f 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -36,6 +36,7 @@ void init_scattered_cpuid_features(struct cpuinfo_x86 *c)
{ X86_FEATURE_ARAT, CR_EAX, 2, 0x00000006, 0 },
{ X86_FEATURE_PLN, CR_EAX, 4, 0x00000006, 0 },
{ X86_FEATURE_PTS, CR_EAX, 6, 0x00000006, 0 },
+ { X86_FEATURE_INTEL_PT, CR_EBX,25, 0x00000007, 0 },
{ X86_FEATURE_APERFMPERF, CR_ECX, 0, 0x00000006, 0 },
{ X86_FEATURE_EPB, CR_ECX, 3, 0x00000006, 0 },
{ X86_FEATURE_HW_PSTATE, CR_EDX, 7, 0x80000007, 0 },
--
2.1.0

2014-10-13 13:47:36

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 11/20] x86: perf: Intel PT and LBR/BTS are mutually exclusive

Intel PT cannot be used at the same time as LBR or BTS and will cause a
general protection fault if they are used together. In order to avoid
fixing up GPs in the fast path, instead we use flags to indicate that
that one of these is in use so that the other avoids MSR access altogether.

Signed-off-by: Alexander Shishkin <[email protected]>
---
arch/x86/kernel/cpu/perf_event.h | 6 ++++++
arch/x86/kernel/cpu/perf_event_intel_ds.c | 8 +++++++-
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 9 +++++----
3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index d98a34d435..8eee92be78 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -148,6 +148,7 @@ struct cpu_hw_events {
* Intel DebugStore bits
*/
struct debug_store *ds;
+ unsigned int bts_enabled;
u64 pebs_enabled;

/*
@@ -161,6 +162,11 @@ struct cpu_hw_events {
u64 br_sel;

/*
+ * Intel Processor Trace
+ */
+ unsigned int pt_enabled;
+
+ /*
* Intel host/guest exclude bits
*/
u64 intel_ctrl_guest_mask;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index b1553d05a5..865ee1268a 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -455,8 +455,13 @@ struct event_constraint bts_constraint =

void intel_pmu_enable_bts(u64 config)
{
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
unsigned long debugctlmsr;

+ if (cpuc->pt_enabled)
+ return;
+
+ cpuc->bts_enabled = 1;
debugctlmsr = get_debugctlmsr();

debugctlmsr |= DEBUGCTLMSR_TR;
@@ -477,9 +482,10 @@ void intel_pmu_disable_bts(void)
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
unsigned long debugctlmsr;

- if (!cpuc->ds)
+ if (!cpuc->ds || cpuc->pt_enabled)
return;

+ cpuc->bts_enabled = 0;
debugctlmsr = get_debugctlmsr();

debugctlmsr &=
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 4af10617de..1b09c4b91a 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -172,7 +172,9 @@ static void intel_pmu_lbr_reset_64(void)

void intel_pmu_lbr_reset(void)
{
- if (!x86_pmu.lbr_nr)
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+ if (!x86_pmu.lbr_nr || cpuc->pt_enabled)
return;

if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
@@ -185,7 +187,7 @@ void intel_pmu_lbr_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);

- if (!x86_pmu.lbr_nr)
+ if (!x86_pmu.lbr_nr || cpuc->pt_enabled)
return;

/*
@@ -205,11 +207,10 @@ void intel_pmu_lbr_disable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);

- if (!x86_pmu.lbr_nr)
+ if (!x86_pmu.lbr_nr || !cpuc->lbr_users || cpuc->pt_enabled)
return;

cpuc->lbr_users--;
- WARN_ON_ONCE(cpuc->lbr_users < 0);

if (cpuc->enabled && !cpuc->lbr_users) {
__intel_pmu_lbr_disable();
--
2.1.0

2014-10-13 13:47:59

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 16/20] perf: Add a helper for looking up pmus by type

This patch adds a helper for looking up a registered pmu by its type.

Signed-off-by: Alexander Shishkin <[email protected]>
---
kernel/events/core.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index d4057dff27..92da1aecc7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6963,6 +6963,18 @@ void perf_pmu_unregister(struct pmu *pmu)
}
EXPORT_SYMBOL_GPL(perf_pmu_unregister);

+/* call under pmus_srcu */
+static struct pmu *__perf_find_pmu(u32 type)
+{
+ struct pmu *pmu;
+
+ rcu_read_lock();
+ pmu = idr_find(&pmu_idr, type);
+ rcu_read_unlock();
+
+ return pmu;
+}
+
struct pmu *perf_init_event(struct perf_event *event)
{
struct pmu *pmu = NULL;
@@ -6971,9 +6983,7 @@ struct pmu *perf_init_event(struct perf_event *event)

idx = srcu_read_lock(&pmus_srcu);

- rcu_read_lock();
- pmu = idr_find(&pmu_idr, event->attr.type);
- rcu_read_unlock();
+ pmu = __perf_find_pmu(event->attr.type);
if (pmu) {
if (!try_module_get(pmu->module)) {
pmu = ERR_PTR(-ENODEV);
--
2.1.0

2014-10-13 13:48:13

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 05/20] perf: Add a pmu capability for "exclusive" events

Usually, pmus that do, for example, instruction tracing, would only ever
be able to have one event per task per cpu (or per perf_context). For such
pmus it makes sense to disallow creating conflicting events early on, so
as to provide consistent behavior for the user.

This patch adds a pmu capability that indicates such constraint on event
creation.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/linux/perf_event.h | 1 +
kernel/events/core.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 46 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e7a15f5c3f..5c3dee021c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -173,6 +173,7 @@ struct perf_event;
#define PERF_PMU_CAP_NO_INTERRUPT 0x01
#define PERF_PMU_CAP_AUX_NO_SG 0x02
#define PERF_PMU_CAP_AUX_SW_DOUBLEBUF 0x04
+#define PERF_PMU_CAP_EXCLUSIVE 0x08

/**
* struct pmu - generic performance monitoring unit
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 86b0577229..ced9b0819f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7286,6 +7286,32 @@ out:
return ret;
}

+static bool exclusive_event_match(struct perf_event *e1, struct perf_event *e2)
+{
+ if ((e1->pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE) &&
+ (e1->cpu == e2->cpu ||
+ e1->cpu == -1 ||
+ e2->cpu == -1))
+ return true;
+ return false;
+}
+
+static bool exclusive_event_ok(struct perf_event *event,
+ struct perf_event_context *ctx)
+{
+ struct perf_event *iter_event;
+
+ if (!(event->pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE))
+ return true;
+
+ list_for_each_entry(iter_event, &ctx->event_list, event_entry) {
+ if (exclusive_event_match(iter_event, event))
+ return false;
+ }
+
+ return true;
+}
+
/**
* sys_perf_event_open - open a performance event, associate it to a task/cpu
*
@@ -7437,6 +7463,11 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_alloc;
}

+ if ((pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE) && group_leader) {
+ err = -EBUSY;
+ goto err_context;
+ }
+
if (task) {
put_task_struct(task);
task = NULL;
@@ -7522,6 +7553,12 @@ SYSCALL_DEFINE5(perf_event_open,
}
}

+ if (!exclusive_event_ok(event, ctx)) {
+ mutex_unlock(&ctx->mutex);
+ fput(event_file);
+ goto err_context;
+ }
+
perf_install_in_context(ctx, event, event->cpu);
perf_unpin_context(ctx);
mutex_unlock(&ctx->mutex);
@@ -7608,6 +7645,14 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,

WARN_ON_ONCE(ctx->parent_ctx);
mutex_lock(&ctx->mutex);
+ if (!exclusive_event_ok(event, ctx)) {
+ mutex_unlock(&ctx->mutex);
+ perf_unpin_context(ctx);
+ put_ctx(ctx);
+ err = -EBUSY;
+ goto err_free;
+ }
+
perf_install_in_context(ctx, event, cpu);
perf_unpin_context(ctx);
mutex_unlock(&ctx->mutex);
--
2.1.0

2014-10-13 13:48:11

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 17/20] perf: Add infrastructure for using AUX data in perf samples

AUX data can be used to annotate perf events such as performance counters
or tracepoints/breakpoints by including it in sample records when
PERF_SAMPLE_AUX flag is set. Such samples would be instrumental in debugging
and profiling by providing, for example, a history of instruction flow
leading up to the event's overflow.

To facilitate this, this patch adds code to create a kernel counter with a
ring buffer using rb_{alloc,free}_kernel() interface for each event that
needs to include AUX samples and to copy AUX data from it into the perf
data stream.

The user interface is extended to allow for this, new attribute fields are
added:

* aux_sample_type: specify PMU on which the AUX data generating event
is created;
* aux_sample_config: event config (maps to attribute's config field),
* aux_sample_size: size of the sample to be written.

This kernel counter is configured similarly to the event that is being
annotated with regards to filtering (exclude_{hv,idle,user,kernel}) and
enabled state (disabled, enable_on_exec) to make sure that the two events
are scheduled at the same time and that no out of context activity is
sampled.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/linux/perf_event.h | 9 +++
include/uapi/linux/perf_event.h | 18 ++++-
kernel/events/core.c | 172 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 198 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 282721b2df..de8cc714e9 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -84,6 +84,12 @@ struct perf_regs_user {
struct pt_regs *regs;
};

+struct perf_aux_record {
+ u64 size;
+ unsigned long from;
+ unsigned long to;
+};
+
struct task_struct;

/*
@@ -458,6 +464,7 @@ struct perf_event {
perf_overflow_handler_t overflow_handler;
void *overflow_handler_context;

+ struct perf_event *sampler;
#ifdef CONFIG_EVENT_TRACING
struct ftrace_event_call *tp_event;
struct event_filter *filter;
@@ -628,6 +635,7 @@ struct perf_sample_data {
union perf_mem_data_src data_src;
struct perf_callchain_entry *callchain;
struct perf_raw_record *raw;
+ struct perf_aux_record aux;
struct perf_branch_stack *br_stack;
struct perf_regs_user regs_user;
u64 stack_user_size;
@@ -655,6 +663,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
data->period = period;
data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE;
data->regs_user.regs = NULL;
+ data->aux.from = data->aux.to = data->aux.size = 0;
data->stack_user_size = 0;
data->weight = 0;
data->data_src.val = PERF_MEM_NA;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index a0cafbdc1c..ed2e21fa13 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -137,8 +137,9 @@ enum perf_event_sample_format {
PERF_SAMPLE_DATA_SRC = 1U << 15,
PERF_SAMPLE_IDENTIFIER = 1U << 16,
PERF_SAMPLE_TRANSACTION = 1U << 17,
+ PERF_SAMPLE_AUX = 1U << 18,

- PERF_SAMPLE_MAX = 1U << 18, /* non-ABI */
+ PERF_SAMPLE_MAX = 1U << 19, /* non-ABI */
};

/*
@@ -239,6 +240,9 @@ enum perf_event_read_format {
#define PERF_ATTR_SIZE_VER3 96 /* add: sample_regs_user */
/* add: sample_stack_user */
/* add: aux_watermark */
+#define PERF_ATTR_SIZE_VER4 120 /* add: aux_sample_config */
+ /* add: aux_sample_size */
+ /* add: aux_sample_type */

/*
* Hardware event_id to monitor via a performance monitoring event:
@@ -337,6 +341,16 @@ struct perf_event_attr {
* Wakeup watermark for AUX area
*/
__u32 aux_watermark;
+
+ /*
+ * Itrace pmus' event config
+ */
+ __u64 aux_sample_config; /* event config for AUX sampling */
+ __u64 aux_sample_size; /* desired sample size */
+ __u32 aux_sample_type; /* pmu->type of an AUX PMU */
+
+ /* Align to u64. */
+ __u32 __reserved_2;
};

#define perf_flags(attr) (*(&(attr)->read_format + 1))
@@ -710,6 +724,8 @@ enum perf_event_type {
* { u64 weight; } && PERF_SAMPLE_WEIGHT
* { u64 data_src; } && PERF_SAMPLE_DATA_SRC
* { u64 transaction; } && PERF_SAMPLE_TRANSACTION
+ * { u64 size;
+ * char data[size]; } && PERF_SAMPLE_AUX
* };
*/
PERF_RECORD_SAMPLE = 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 92da1aecc7..5da1bc403f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1647,6 +1647,9 @@ void perf_event_disable(struct perf_event *event)
struct perf_event_context *ctx = event->ctx;
struct task_struct *task = ctx->task;

+ if (event->sampler)
+ perf_event_disable(event->sampler);
+
if (!task) {
/*
* Disable the event on the cpu that it's on
@@ -2149,6 +2152,8 @@ void perf_event_enable(struct perf_event *event)
struct perf_event_context *ctx = event->ctx;
struct task_struct *task = ctx->task;

+ if (event->sampler)
+ perf_event_enable(event->sampler);
if (!task) {
/*
* Enable the event on the cpu that it's on
@@ -3287,6 +3292,8 @@ static void unaccount_event_cpu(struct perf_event *event, int cpu)
atomic_dec(&per_cpu(perf_cgroup_events, cpu));
}

+static void perf_aux_sampler_fini(struct perf_event *event);
+
static void unaccount_event(struct perf_event *event)
{
if (event->parent)
@@ -3306,6 +3313,8 @@ static void unaccount_event(struct perf_event *event)
static_key_slow_dec_deferred(&perf_sched_events);
if (has_branch_stack(event))
static_key_slow_dec_deferred(&perf_sched_events);
+ if ((event->attr.sample_type & PERF_SAMPLE_AUX))
+ perf_aux_sampler_fini(event);

unaccount_event_cpu(event, event->cpu);
}
@@ -4632,6 +4641,139 @@ perf_output_sample_ustack(struct perf_output_handle *handle, u64 dump_size,
}
}

+static void perf_aux_sampler_destroy(struct perf_event *event)
+{
+ struct ring_buffer *rb = event->rb;
+
+ if (!rb)
+ return;
+
+ ring_buffer_put(rb); /* can be last */
+}
+
+static int perf_aux_sampler_init(struct perf_event *event,
+ struct task_struct *task,
+ struct pmu *pmu)
+{
+ struct perf_event_attr attr;
+ struct perf_event *sampler;
+ struct ring_buffer *rb;
+ unsigned long nr_pages;
+
+ if (!pmu || !(pmu->setup_aux))
+ return -ENOTSUPP;
+
+ memset(&attr, 0, sizeof(attr));
+ attr.type = pmu->type;
+ attr.config = event->attr.aux_sample_config;
+ attr.sample_type = 0;
+ attr.disabled = event->attr.disabled;
+ attr.enable_on_exec = event->attr.enable_on_exec;
+ attr.exclude_hv = event->attr.exclude_hv;
+ attr.exclude_idle = event->attr.exclude_idle;
+ attr.exclude_user = event->attr.exclude_user;
+ attr.exclude_kernel = event->attr.exclude_kernel;
+ attr.aux_sample_size = event->attr.aux_sample_size;
+
+ sampler = perf_event_create_kernel_counter(&attr, event->cpu, task,
+ NULL, NULL);
+ if (IS_ERR(sampler))
+ return PTR_ERR(sampler);
+
+ nr_pages = 1ul << __get_order(event->attr.aux_sample_size);
+
+ rb = rb_alloc_kernel(sampler, 0, nr_pages);
+ if (!rb) {
+ perf_event_release_kernel(sampler);
+ return -ENOMEM;
+ }
+
+ event->sampler = sampler;
+ sampler->destroy = perf_aux_sampler_destroy;
+
+ return 0;
+}
+
+static void perf_aux_sampler_fini(struct perf_event *event)
+{
+ struct perf_event *sampler = event->sampler;
+
+ /* might get free'd from event->destroy() path */
+ if (!sampler)
+ return;
+
+ perf_event_release_kernel(sampler);
+
+ event->sampler = NULL;
+}
+
+static unsigned long perf_aux_sampler_trace(struct perf_event *event,
+ struct perf_sample_data *data)
+{
+ struct perf_event *sampler = event->sampler;
+ struct ring_buffer *rb;
+
+ if (!sampler || sampler->state != PERF_EVENT_STATE_ACTIVE) {
+ data->aux.size = 0;
+ goto out;
+ }
+
+ rb = ring_buffer_get(sampler);
+ if (!rb) {
+ data->aux.size = 0;
+ goto out;
+ }
+
+ sampler->pmu->del(sampler, 0);
+
+ data->aux.to = local_read(&rb->aux_head);
+
+ if (data->aux.to < sampler->attr.aux_sample_size)
+ data->aux.from = rb->aux_nr_pages * PAGE_SIZE +
+ data->aux.to - sampler->attr.aux_sample_size;
+ else
+ data->aux.from = data->aux.to -
+ sampler->attr.aux_sample_size;
+ data->aux.size = ALIGN(sampler->attr.aux_sample_size, sizeof(u64));
+ ring_buffer_put(rb);
+
+out:
+ return data->aux.size;
+}
+
+static void perf_aux_sampler_output(struct perf_event *event,
+ struct perf_output_handle *handle,
+ struct perf_sample_data *data)
+{
+ struct perf_event *sampler = event->sampler;
+ struct ring_buffer *rb;
+ unsigned long pad;
+ int ret;
+
+ if (WARN_ON_ONCE(!sampler || !data->aux.size))
+ return;
+
+ rb = ring_buffer_get(sampler);
+ if (WARN_ON_ONCE(!rb))
+ return;
+ ret = rb_output_aux(rb, data->aux.from, data->aux.to,
+ (aux_copyfn)perf_output_copy, handle);
+ if (ret < 0) {
+ pr_warn_ratelimited("failed to copy trace data\n");
+ goto out;
+ }
+
+ pad = data->aux.size - ret;
+ if (pad) {
+ u64 p = 0;
+
+ perf_output_copy(handle, &p, pad);
+ }
+out:
+ ring_buffer_put(rb);
+ sampler->pmu->add(sampler, PERF_EF_START);
+}
+
static void __perf_event_header__init_id(struct perf_event_header *header,
struct perf_sample_data *data,
struct perf_event *event)
@@ -4918,6 +5060,13 @@ void perf_output_sample(struct perf_output_handle *handle,
if (sample_type & PERF_SAMPLE_TRANSACTION)
perf_output_put(handle, data->txn);

+ if (sample_type & PERF_SAMPLE_AUX) {
+ perf_output_put(handle, data->aux.size);
+
+ if (data->aux.size)
+ perf_aux_sampler_output(event, handle, data);
+ }
+
if (!event->attr.watermark) {
int wakeup_events = event->attr.wakeup_events;

@@ -5025,6 +5174,14 @@ void perf_prepare_sample(struct perf_event_header *header,
data->stack_user_size = stack_size;
header->size += size;
}
+
+ if (sample_type & PERF_SAMPLE_AUX) {
+ u64 size = sizeof(u64);
+
+ size += perf_aux_sampler_trace(event, data);
+
+ header->size += size;
+ }
}

static void perf_event_output(struct perf_event *event,
@@ -7172,6 +7329,21 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
if (err)
goto err_pmu;
}
+
+ if (event->attr.sample_type & PERF_SAMPLE_AUX) {
+ struct pmu *aux_pmu;
+ int idx;
+
+ idx = srcu_read_lock(&pmus_srcu);
+ aux_pmu = __perf_find_pmu(event->attr.aux_sample_type);
+ err = perf_aux_sampler_init(event, task, aux_pmu);
+ srcu_read_unlock(&pmus_srcu, idx);
+
+ if (err) {
+ put_callchain_buffers();
+ goto err_pmu;
+ }
+ }
}

return event;
--
2.1.0

2014-10-13 13:48:32

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 15/20] perf: Add api to (de-)allocate AUX buffers for kernel counters

Several use cases for AUX data, namely event sampling (including a piece of
AUX data in some perf event sample, so that the user can get, for example,
instruction traces leading up to a certain event like a breakpoint or a
hardware event), process core dumps (providing user with a history of a
process' instruction flow leading up to a crash), system crash dumps and
storing AUX data in pstore across reboot (to facilitate post-mortem
investigation of a system crash) require different parts of the kernel code
to be able to configure hardware to produce AUX data and collect it when it
is needed.

Luckily, there is already an api for in-kernel perf events, which has several
users. This proposal is to extend that api to allow in-kernel users to
allocate AUX buffers for kernel counters. Such users will call
rb_alloc_kernel() to allocate what they want and later copy the data out to
other backends, e.g. a sample in another event's ring buffer or a core dump
file. These buffers are never mapped to userspace.

There are no additional constraints or requirements on the pmu drivers.

A typical user of this interface will first create a kernel counter with a
call to perf_event_create_kernel_counter() and then allocate a ring buffer
for it with rb_alloc_kernel(). Data can then be copied out from the AUX
buffer using rb_output_aux(), which is passed a callback that will write
chunks of AUX data into the desired destination, such as perf_output_copy()
or dump_emit(). Caller needs to use perf_event_disable to make sure that the
counter is not active while it copies data out.

Signed-off-by: Alexander Shishkin <[email protected]>
---
kernel/events/internal.h | 8 ++++++
kernel/events/ring_buffer.c | 64 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 72 insertions(+)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 4715aae48b..81cb7afec4 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -54,6 +54,9 @@ struct ring_buffer {
void *data_pages[0];
};

+typedef unsigned long (*aux_copyfn)(void *data, const void *src,
+ unsigned long len);
+
extern void rb_free(struct ring_buffer *rb);
extern struct ring_buffer *
rb_alloc(int nr_pages, long watermark, int cpu, int flags);
@@ -61,6 +64,11 @@ extern void perf_event_wakeup(struct perf_event *event);
extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
pgoff_t pgoff, int nr_pages, long watermark, int flags);
extern void rb_free_aux(struct ring_buffer *rb);
+extern long rb_output_aux(struct ring_buffer *rb, unsigned long from,
+ unsigned long to, aux_copyfn copyfn, void *data);
+extern struct ring_buffer *
+rb_alloc_kernel(struct perf_event *event, int nr_pages, int aux_nr_pages);
+extern void rb_free_kernel(struct ring_buffer *rb, struct perf_event *event);
extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
extern void ring_buffer_put(struct ring_buffer *rb);

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 96ec58fb46..c062378c37 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -409,6 +409,37 @@ void *perf_get_aux(struct perf_output_handle *handle)
return handle->rb->aux_priv;
}

+long rb_output_aux(struct ring_buffer *rb, unsigned long from,
+ unsigned long to, aux_copyfn copyfn, void *data)
+{
+ unsigned long tocopy, remainder, len = 0;
+ void *addr;
+
+ from &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+ to &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+
+ do {
+ tocopy = PAGE_SIZE - offset_in_page(from);
+ if (to > from)
+ tocopy = min(tocopy, to - from);
+ if (!tocopy)
+ break;
+
+ addr = rb->aux_pages[from >> PAGE_SHIFT];
+ addr += offset_in_page(from);
+
+ remainder = copyfn(data, addr, tocopy);
+ if (remainder)
+ return -EFAULT;
+
+ len += tocopy;
+ from += tocopy;
+ from &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+ } while (to != from);
+
+ return len;
+}
+
#define PERF_AUX_GFP (GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)

static struct page *rb_alloc_aux_page(int node, int order)
@@ -540,6 +571,39 @@ void rb_free_aux(struct ring_buffer *rb)
kref_put(&rb->aux_refcount, __rb_free_aux);
}

+struct ring_buffer *
+rb_alloc_kernel(struct perf_event *event, int nr_pages, int aux_nr_pages)
+{
+ struct ring_buffer *rb;
+ int ret, pgoff = nr_pages + 1;
+
+ rb = rb_alloc(nr_pages, 0, event->cpu, 0);
+ if (!rb)
+ return NULL;
+
+ ret = rb_alloc_aux(rb, event, pgoff, aux_nr_pages, 0, 0);
+ if (ret) {
+ rb_free(rb);
+ return NULL;
+ }
+
+ /*
+ * Kernel counters don't need ring buffer wakeups, therefore we don't
+ * use ring_buffer_attach() here and event->rb_entry stays empty
+ */
+ rcu_assign_pointer(event->rb, rb);
+
+ return rb;
+}
+
+void rb_free_kernel(struct ring_buffer *rb, struct perf_event *event)
+{
+ WARN_ON_ONCE(atomic_read(&rb->refcount) != 1);
+ rcu_assign_pointer(event->rb, NULL);
+ rb_free_aux(rb);
+ rb_free(rb);
+}
+
#ifndef CONFIG_PERF_USE_VMALLOC

/*
--
2.1.0

2014-10-13 13:48:35

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 13/20] x86: perf: intel_bts: Add BTS PMU driver

Add support for Branch Trace Store (BTS) via kernel perf event infrastructure.
The difference with the existing implementation of BTS support is that this
one is a separate PMU that exports events' trace buffers to userspace by means
of AUX area of the perf buffer, which is zero-copy mapped into userspace.

The immediate benefit is that the buffer size can be much bigger, resulting in
fewer interrupts and no kernel side copying is involved and little to no trace
data loss. Also, kernel code can be traced with this driver.

The old way of collecting BTS traces still works.

Signed-off-by: Alexander Shishkin <[email protected]>
---
arch/x86/kernel/cpu/Makefile | 2 +-
arch/x86/kernel/cpu/perf_event.h | 6 +
arch/x86/kernel/cpu/perf_event_intel.c | 6 +-
arch/x86/kernel/cpu/perf_event_intel_bts.c | 518 +++++++++++++++++++++++++++++
arch/x86/kernel/cpu/perf_event_intel_ds.c | 3 +-
5 files changed, 532 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/kernel/cpu/perf_event_intel_bts.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 00d40f889d..387223c716 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -39,7 +39,7 @@ obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_lbr.o perf_event_intel_ds.o per
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_uncore.o perf_event_intel_uncore_snb.o
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_uncore_snbep.o perf_event_intel_uncore_nhmex.o
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_rapl.o
-obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_pt.o
+obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_pt.o perf_event_intel_bts.o
endif


diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 9f9b02fea6..155721e54d 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -752,6 +752,12 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event);

void intel_pt_interrupt(void);

+int intel_bts_interrupt(void);
+
+void intel_bts_enable_local(void);
+
+void intel_bts_disable_local(void);
+
int p4_pmu_init(void);

int p6_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 6e04da1b04..0124c0d868 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1180,6 +1180,8 @@ static void intel_pmu_disable_all(void)

if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
intel_pmu_disable_bts();
+ else
+ intel_bts_disable_local();

intel_pmu_pebs_disable_all();
intel_pmu_lbr_disable_all();
@@ -1202,7 +1204,8 @@ static void intel_pmu_enable_all(int added)
return;

intel_pmu_enable_bts(event->hw.config);
- }
+ } else
+ intel_bts_enable_local();
}

/*
@@ -1488,6 +1491,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
apic_write(APIC_LVTPC, APIC_DM_NMI);
intel_pmu_disable_all();
handled = intel_pmu_drain_bts_buffer();
+ handled += intel_bts_interrupt();
status = intel_pmu_get_status();
if (!status)
goto done;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_bts.c b/arch/x86/kernel/cpu/perf_event_intel_bts.c
new file mode 100644
index 0000000000..d418c7ed9c
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_bts.c
@@ -0,0 +1,518 @@
+/*
+ * BTS PMU driver for perf
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#undef DEBUG
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/bitops.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/debugfs.h>
+#include <linux/device.h>
+#include <linux/coredump.h>
+
+#include <asm-generic/sizes.h>
+#include <asm/perf_event.h>
+
+#include "perf_event.h"
+
+struct bts_ctx {
+ struct perf_output_handle handle;
+ struct debug_store ds_back;
+ int started;
+};
+
+static DEFINE_PER_CPU(struct bts_ctx, bts_ctx);
+
+#define BTS_RECORD_SIZE 24
+#define BTS_SAFETY_MARGIN 4080
+
+struct bts_phys {
+ struct page *page;
+ unsigned long size;
+ unsigned long offset;
+ unsigned long displacement;
+};
+
+struct bts_buffer {
+ size_t real_size; /* multiple of BTS_RECORD_SIZE */
+ unsigned int nr_pages;
+ unsigned int nr_bufs;
+ unsigned int cur_buf;
+ bool snapshot;
+ local_t data_size;
+ local_t lost;
+ local_t head;
+ unsigned long end;
+ void **data_pages;
+ struct bts_phys buf[0];
+};
+
+struct pmu bts_pmu;
+
+void intel_pmu_enable_bts(u64 config);
+void intel_pmu_disable_bts(void);
+
+static size_t buf_size(struct page *page)
+{
+ return 1 << (PAGE_SHIFT + page_private(page));
+}
+
+static void *
+bts_buffer_setup_aux(int cpu, void **pages, int nr_pages, bool overwrite)
+{
+ struct bts_buffer *buf;
+ struct page *page;
+ int node = (cpu == -1) ? cpu : cpu_to_node(cpu);
+ unsigned long offset;
+ size_t size = nr_pages << PAGE_SHIFT;
+ int pg, nbuf, pad;
+
+ /* count all the high order buffers */
+ for (pg = 0, nbuf = 0; pg < nr_pages;) {
+ page = virt_to_page(pages[pg]);
+ if (WARN_ON_ONCE(!PagePrivate(page) && nr_pages > 1))
+ return NULL;
+ pg += 1 << page_private(page);
+ nbuf++;
+ }
+
+ /*
+ * to avoid interrupts in overwrite mode, only allow one physical
+ */
+ if (overwrite && nbuf > 1)
+ return NULL;
+
+ buf = kzalloc_node(offsetof(struct bts_buffer, buf[nbuf]), GFP_KERNEL, node);
+ if (!buf)
+ return NULL;
+
+ buf->nr_pages = nr_pages;
+ buf->nr_bufs = nbuf;
+ buf->snapshot = overwrite;
+ buf->data_pages = pages;
+ buf->real_size = size - size % BTS_RECORD_SIZE;
+
+ for (pg = 0, nbuf = 0, offset = 0, pad = 0; nbuf < buf->nr_bufs; nbuf++) {
+ unsigned int __nr_pages;
+
+ page = virt_to_page(pages[pg]);
+ __nr_pages = PagePrivate(page) ? 1 << page_private(page) : 1;
+ buf->buf[nbuf].page = page;
+ buf->buf[nbuf].offset = offset;
+ buf->buf[nbuf].displacement = (pad ? BTS_RECORD_SIZE - pad : 0);
+ buf->buf[nbuf].size = buf_size(page) - buf->buf[nbuf].displacement;
+ pad = buf->buf[nbuf].size % BTS_RECORD_SIZE;
+ buf->buf[nbuf].size -= pad;
+
+ pg += __nr_pages;
+ offset += __nr_pages << PAGE_SHIFT;
+ }
+
+ return buf;
+}
+
+static void bts_buffer_free_aux(void *data)
+{
+ kfree(data);
+}
+
+static unsigned long bts_buffer_offset(struct bts_buffer *buf, unsigned int idx)
+{
+ return buf->buf[idx].offset + buf->buf[idx].displacement;
+}
+
+static void
+bts_config_buffer(struct bts_buffer *buf)
+{
+ int cpu = raw_smp_processor_id();
+ struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
+ struct bts_phys *phys = &buf->buf[buf->cur_buf];
+ unsigned long index, thresh = 0, end = phys->size;
+ struct page *page = phys->page;
+
+ index = local_read(&buf->head);
+
+ if (!buf->snapshot) {
+ if (buf->end < phys->offset + buf_size(page))
+ end = buf->end - phys->offset - phys->displacement;
+
+ index -= phys->offset + phys->displacement;
+
+ if (end - index > BTS_SAFETY_MARGIN)
+ thresh = end - BTS_SAFETY_MARGIN;
+ else if (end - index > BTS_RECORD_SIZE)
+ thresh = end - BTS_RECORD_SIZE;
+ else
+ thresh = end;
+ }
+
+ ds->bts_buffer_base = (u64)page_address(page) + phys->displacement;
+ ds->bts_index = ds->bts_buffer_base + index;
+ ds->bts_absolute_maximum = ds->bts_buffer_base + end;
+ ds->bts_interrupt_threshold = !buf->snapshot
+ ? ds->bts_buffer_base + thresh
+ : ds->bts_absolute_maximum + BTS_RECORD_SIZE;
+}
+
+static void bts_buffer_pad_out(struct bts_phys *phys, unsigned long head)
+{
+ unsigned long index = head - phys->offset;
+
+ memset(page_address(phys->page) + index, 0, phys->size - index);
+}
+
+static bool bts_buffer_is_full(struct bts_buffer *buf, struct bts_ctx *bts)
+{
+ if (buf->snapshot)
+ return false;
+
+ if (local_read(&buf->data_size) >= bts->handle.size ||
+ bts->handle.size - local_read(&buf->data_size) < BTS_RECORD_SIZE)
+ return true;
+
+ return false;
+}
+
+static void bts_update(struct bts_ctx *bts)
+{
+ int cpu = raw_smp_processor_id();
+ struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
+ struct bts_buffer *buf = perf_get_aux(&bts->handle);
+ unsigned long index = ds->bts_index - ds->bts_buffer_base, old, head;
+
+ if (!buf)
+ return;
+
+ head = index + bts_buffer_offset(buf, buf->cur_buf);
+
+ if (!buf->snapshot) {
+ old = local_xchg(&buf->head, head);
+ if (old == head)
+ return;
+
+ if (ds->bts_index >= ds->bts_absolute_maximum)
+ local_inc(&buf->lost);
+
+ /*
+ * old and head are always in the same physical buffer, so we
+ * can subtract them to get the data size.
+ */
+ local_add(head - old, &buf->data_size);
+ } else {
+ local_set(&buf->data_size, head);
+ }
+}
+
+static void __bts_event_start(struct perf_event *event)
+{
+ struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+ struct bts_buffer *buf = perf_get_aux(&bts->handle);
+ u64 config = 0;
+
+ if (!buf || bts_buffer_is_full(buf, bts))
+ return;
+
+ event->hw.state = 0;
+
+ if (!buf->snapshot)
+ config |= ARCH_PERFMON_EVENTSEL_INT;
+ if (!event->attr.exclude_kernel)
+ config |= ARCH_PERFMON_EVENTSEL_OS;
+ if (!event->attr.exclude_user)
+ config |= ARCH_PERFMON_EVENTSEL_USR;
+
+ bts_config_buffer(buf);
+
+ /*
+ * local barrier to make sure that ds configuration made it
+ * before we enable BTS
+ */
+ wmb();
+
+ intel_pmu_enable_bts(config);
+}
+
+static void bts_event_start(struct perf_event *event, int flags)
+{
+ struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+ __bts_event_start(event);
+
+ /* PMI handler: this counter is running and likely generating PMIs */
+ ACCESS_ONCE(bts->started) = 1;
+}
+
+static void __bts_event_stop(struct perf_event *event)
+{
+ /*
+ * No extra synchronization is mandated by the documentation to have
+ * BTS data stores globally visible.
+ */
+ intel_pmu_disable_bts();
+
+ if (event->hw.state & PERF_HES_STOPPED)
+ return;
+
+ ACCESS_ONCE(event->hw.state) |= PERF_HES_STOPPED;
+}
+
+static void bts_event_stop(struct perf_event *event, int flags)
+{
+ struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+ /* PMI handler: don't restart this counter */
+ ACCESS_ONCE(bts->started) = 0;
+
+ __bts_event_stop(event);
+
+ if (flags & PERF_EF_UPDATE)
+ bts_update(bts);
+}
+
+void intel_bts_enable_local(void)
+{
+ struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+ if (bts->handle.event && bts->started)
+ __bts_event_start(bts->handle.event);
+}
+
+void intel_bts_disable_local(void)
+{
+ struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+ if (bts->handle.event)
+ __bts_event_stop(bts->handle.event);
+}
+
+static int
+bts_buffer_reset(struct bts_buffer *buf, struct perf_output_handle *handle)
+{
+ unsigned long head, space, next_space, pad, gap, skip, wakeup;
+ unsigned int next_buf;
+ struct bts_phys *phys, *next_phys;
+ int ret;
+
+ if (buf->snapshot)
+ return 0;
+
+ head = handle->head & ((buf->nr_pages << PAGE_SHIFT) - 1);
+ if (WARN_ON_ONCE(head != local_read(&buf->head)))
+ return -EINVAL;
+
+ phys = &buf->buf[buf->cur_buf];
+ space = phys->offset + phys->displacement + phys->size - head;
+ pad = space;
+ if (space > handle->size) {
+ space = handle->size;
+ space -= space % BTS_RECORD_SIZE;
+ }
+ if (space <= BTS_SAFETY_MARGIN) {
+ /* See if next phys buffer has more space */
+ next_buf = buf->cur_buf + 1;
+ if (next_buf >= buf->nr_bufs)
+ next_buf = 0;
+ next_phys = &buf->buf[next_buf];
+ gap = buf_size(phys->page) - phys->displacement - phys->size +
+ next_phys->displacement;
+ skip = pad + gap;
+ if (handle->size >= skip) {
+ next_space = next_phys->size;
+ if (next_space + skip > handle->size) {
+ next_space = handle->size - skip;
+ next_space -= next_space % BTS_RECORD_SIZE;
+ }
+ if (next_space > space || !space) {
+ if (pad)
+ bts_buffer_pad_out(phys, head);
+ ret = perf_aux_output_skip(handle, skip);
+ if (ret)
+ return ret;
+ /* Advance to next phys buffer */
+ phys = next_phys;
+ space = next_space;
+ head = phys->offset + phys->displacement;
+ /*
+ * After this, cur_buf and head won't match ds
+ * anymore, so we must not be racing with
+ * bts_update().
+ */
+ buf->cur_buf = next_buf;
+ local_set(&buf->head, head);
+ }
+ }
+ }
+
+ /* Don't go far beyond wakeup watermark */
+ wakeup = BTS_SAFETY_MARGIN + BTS_RECORD_SIZE + handle->wakeup -
+ handle->head;
+ if (space > wakeup) {
+ space = wakeup;
+ space -= space % BTS_RECORD_SIZE;
+ }
+
+ buf->end = head + space;
+
+ /*
+ * If we have no space, the lost notification would have been sent when
+ * we hit absolute_maximum - see bts_update()
+ */
+ if (!space)
+ return -ENOSPC;
+
+ return 0;
+}
+
+int intel_bts_interrupt(void)
+{
+ struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+ struct perf_event *event = bts->handle.event;
+ struct bts_buffer *buf;
+ s64 old_head;
+ int err;
+
+ if (!event || !bts->started)
+ return 0;
+
+ buf = perf_get_aux(&bts->handle);
+ /*
+ * Skip snapshot counters: they don't use the interrupt, but
+ * there's no other way of telling, because the pointer will
+ * keep moving
+ */
+ if (!buf || buf->snapshot)
+ return 0;
+
+ old_head = local_read(&buf->head);
+ bts_update(bts);
+
+ /* no new data */
+ if (old_head == local_read(&buf->head))
+ return 0;
+
+ perf_aux_output_end(&bts->handle, local_xchg(&buf->data_size, 0),
+ !!local_xchg(&buf->lost, 0));
+
+ buf = perf_aux_output_begin(&bts->handle, event);
+ if (!buf)
+ return 1;
+
+ err = bts_buffer_reset(buf, &bts->handle);
+ if (err)
+ perf_aux_output_end(&bts->handle, 0, false);
+
+ return 1;
+}
+
+static void bts_event_del(struct perf_event *event, int mode)
+{
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+ struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+ struct bts_buffer *buf = perf_get_aux(&bts->handle);
+
+ bts_event_stop(event, PERF_EF_UPDATE);
+
+ if (buf) {
+ if (buf->snapshot)
+ bts->handle.head =
+ local_xchg(&buf->data_size,
+ buf->nr_pages << PAGE_SHIFT);
+ perf_aux_output_end(&bts->handle, local_xchg(&buf->data_size, 0),
+ !!local_xchg(&buf->lost, 0));
+ }
+
+ cpuc->ds->bts_index = bts->ds_back.bts_buffer_base;
+ cpuc->ds->bts_buffer_base = bts->ds_back.bts_buffer_base;
+ cpuc->ds->bts_absolute_maximum = bts->ds_back.bts_absolute_maximum;
+ cpuc->ds->bts_interrupt_threshold = bts->ds_back.bts_interrupt_threshold;
+}
+
+static int bts_event_add(struct perf_event *event, int mode)
+{
+ struct bts_buffer *buf;
+ struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+ struct hw_perf_event *hwc = &event->hw;
+ int ret = -EBUSY;
+
+ event->hw.state = PERF_HES_STOPPED;
+
+ if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
+ return -EBUSY;
+
+ if (cpuc->pt_enabled)
+ return -EBUSY;
+
+ if (bts->handle.event)
+ return -EBUSY;
+
+ buf = perf_aux_output_begin(&bts->handle, event);
+ if (!buf)
+ return -EINVAL;
+
+ ret = bts_buffer_reset(buf, &bts->handle);
+ if (ret) {
+ perf_aux_output_end(&bts->handle, 0, false);
+ return ret;
+ }
+
+ bts->ds_back.bts_buffer_base = cpuc->ds->bts_buffer_base;
+ bts->ds_back.bts_absolute_maximum = cpuc->ds->bts_absolute_maximum;
+ bts->ds_back.bts_interrupt_threshold = cpuc->ds->bts_interrupt_threshold;
+
+ if (mode & PERF_EF_START) {
+ bts_event_start(event, 0);
+ if (hwc->state & PERF_HES_STOPPED) {
+ bts_event_del(event, 0);
+ return -EBUSY;
+ }
+ }
+
+ return 0;
+}
+
+static int bts_event_init(struct perf_event *event)
+{
+ if (event->attr.type != bts_pmu.type)
+ return -ENOENT;
+
+ return 0;
+}
+
+static void bts_event_read(struct perf_event *event)
+{
+}
+
+static __init int bts_init(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_DTES64) || !x86_pmu.bts)
+ return -ENODEV;
+
+ bts_pmu.capabilities = PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_ITRACE;
+ bts_pmu.task_ctx_nr = perf_hw_context;
+ bts_pmu.event_init = bts_event_init;
+ bts_pmu.add = bts_event_add;
+ bts_pmu.del = bts_event_del;
+ bts_pmu.start = bts_event_start;
+ bts_pmu.stop = bts_event_stop;
+ bts_pmu.read = bts_event_read;
+ bts_pmu.setup_aux = bts_buffer_setup_aux;
+ bts_pmu.free_aux = bts_buffer_free_aux;
+
+ return perf_pmu_register(&bts_pmu, "intel_bts", -1);
+}
+
+module_init(bts_init);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 865ee1268a..ed3cb75327 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -466,7 +466,8 @@ void intel_pmu_enable_bts(u64 config)

debugctlmsr |= DEBUGCTLMSR_TR;
debugctlmsr |= DEBUGCTLMSR_BTS;
- debugctlmsr |= DEBUGCTLMSR_BTINT;
+ if (config & ARCH_PERFMON_EVENTSEL_INT)
+ debugctlmsr |= DEBUGCTLMSR_BTINT;

if (!(config & ARCH_PERFMON_EVENTSEL_OS))
debugctlmsr |= DEBUGCTLMSR_BTS_OFF_OS;
--
2.1.0

2014-10-13 13:48:52

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 19/20] perf: Allow AUX sampling for multiple events

Right now, only one perf event in a context at a time can be annotated with
AUX data, because AUX pmus will normally have only one event scheduled at a
time, however it should be possible to annotate several events with similar
configuration (wrt exclude_{hv,idle,user,kernel}, config and other event
attribute fields) using the same AUX event.

This patch changes the behavior so that every time, before a kernel counter
is created for AUX sampling, we first look for an existing counter that is
suitable in terms of configuration and we use it to annotate the new event
as well. If no such event is found, we fall back to the old path and try
to allocate a new kernel counter.

Signed-off-by: Alexander Shishkin <[email protected]>
---
kernel/events/core.c | 93 +++++++++++++++++++++++++++++++++++++---------------
1 file changed, 66 insertions(+), 27 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 60e354d668..ad7b1e92dd 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4651,45 +4651,84 @@ static void perf_aux_sampler_destroy(struct perf_event *event)
ring_buffer_put(rb); /* can be last */
}

+static bool exclusive_event_match(struct perf_event *e1, struct perf_event *e2);
+
+static bool perf_aux_event_match(struct perf_event *e1, struct perf_event *e2)
+{
+ return has_aux(e1) && exclusive_event_match(e1, e2);
+}
+
static int perf_aux_sampler_init(struct perf_event *event,
struct task_struct *task,
struct pmu *pmu)
{
+ struct perf_event_context *ctx;
struct perf_event_attr attr;
- struct perf_event *sampler;
+ struct perf_event *sampler = NULL;
struct ring_buffer *rb;
- unsigned long nr_pages;
+ unsigned long nr_pages, flags;

if (!pmu || !(pmu->setup_aux))
return -ENOTSUPP;

- memset(&attr, 0, sizeof(attr));
- attr.type = pmu->type;
- attr.config = event->attr.aux_sample_config;
- attr.sample_type = 0;
- attr.disabled = event->attr.disabled;
- attr.enable_on_exec = event->attr.enable_on_exec;
- attr.exclude_hv = event->attr.exclude_hv;
- attr.exclude_idle = event->attr.exclude_idle;
- attr.exclude_user = event->attr.exclude_user;
- attr.exclude_kernel = event->attr.exclude_kernel;
- attr.aux_sample_size = event->attr.aux_sample_size;
-
- sampler = perf_event_create_kernel_counter(&attr, event->cpu, task,
- NULL, NULL);
- if (IS_ERR(sampler))
- return PTR_ERR(sampler);
-
- nr_pages = 1ul << __get_order(event->attr.aux_sample_size);
-
- rb = rb_alloc_kernel(sampler, 0, nr_pages);
- if (!rb) {
- perf_event_release_kernel(sampler);
- return -ENOMEM;
+ ctx = find_get_context(pmu, task, event->cpu);
+ if (ctx) {
+ raw_spin_lock_irqsave(&ctx->lock, flags);
+ list_for_each_entry(sampler, &ctx->event_list, event_entry) {
+ /*
+ * event is not an aux event, but all the relevant
+ * bits should match
+ */
+ if (perf_aux_event_match(sampler, event) &&
+ sampler->attr.type == event->attr.aux_sample_type &&
+ sampler->attr.config == event->attr.aux_sample_config &&
+ sampler->attr.exclude_hv == event->attr.exclude_hv &&
+ sampler->attr.exclude_idle == event->attr.exclude_idle &&
+ sampler->attr.exclude_user == event->attr.exclude_user &&
+ sampler->attr.exclude_kernel == event->attr.exclude_kernel &&
+ sampler->attr.aux_sample_size >= event->attr.aux_sample_size &&
+ atomic_long_inc_not_zero(&sampler->refcount))
+ goto got_event;
+ }
+
+ sampler = NULL;
+
+got_event:
+ --ctx->pin_count;
+ put_ctx(ctx);
+ raw_spin_unlock_irqrestore(&ctx->lock, flags);
+ }
+
+ if (!sampler) {
+ memset(&attr, 0, sizeof(attr));
+ attr.type = pmu->type;
+ attr.config = event->attr.aux_sample_config;
+ attr.sample_type = 0;
+ attr.disabled = event->attr.disabled;
+ attr.enable_on_exec = event->attr.enable_on_exec;
+ attr.exclude_hv = event->attr.exclude_hv;
+ attr.exclude_idle = event->attr.exclude_idle;
+ attr.exclude_user = event->attr.exclude_user;
+ attr.exclude_kernel = event->attr.exclude_kernel;
+ attr.aux_sample_size = event->attr.aux_sample_size;
+
+ sampler = perf_event_create_kernel_counter(&attr, event->cpu,
+ task, NULL, NULL);
+ if (IS_ERR(sampler))
+ return PTR_ERR(sampler);
+
+ nr_pages = 1ul << __get_order(event->attr.aux_sample_size);
+
+ rb = rb_alloc_kernel(sampler, 0, nr_pages);
+ if (!rb) {
+ perf_event_release_kernel(sampler);
+ return -ENOMEM;
+ }
+
+ sampler->destroy = perf_aux_sampler_destroy;
}

event->sampler = sampler;
- sampler->destroy = perf_aux_sampler_destroy;

return 0;
}
@@ -4702,7 +4741,7 @@ static void perf_aux_sampler_fini(struct perf_event *event)
if (!sampler)
return;

- perf_event_release_kernel(sampler);
+ put_event(sampler);

event->sampler = NULL;
}
--
2.1.0

2014-10-13 13:48:43

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 20/20] perf: Allow AUX sampling of inherited events

Try to find an AUX sampler event for the current event if none is linked
via event::sampler.

This is useful when these events (the one that is being sampled and the
one providing sample annotation) are allocated by inheritance path,
independently of one another and the latter is not yet referenced by the
former's event::sampler.

Signed-off-by: Alexander Shishkin <[email protected]>
---
kernel/events/core.c | 94 ++++++++++++++++++++++++++++++++++------------------
1 file changed, 62 insertions(+), 32 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index ad7b1e92dd..02fcd84b0f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4658,47 +4658,67 @@ static bool perf_aux_event_match(struct perf_event *e1, struct perf_event *e2)
return has_aux(e1) && exclusive_event_match(e1, e2);
}

+struct perf_event *__find_sampling_counter(struct perf_event_context *ctx,
+ struct perf_event *event,
+ struct task_struct *task)
+{
+ struct perf_event *sampler = NULL;
+
+ list_for_each_entry(sampler, &ctx->event_list, event_entry) {
+ /*
+ * event is not an itrace event, but all the relevant
+ * bits should match
+ */
+ if (perf_aux_event_match(sampler, event) &&
+ kernel_rb_event(sampler) &&
+ sampler->attr.type == event->attr.aux_sample_type &&
+ sampler->attr.config == event->attr.aux_sample_config &&
+ sampler->attr.exclude_hv == event->attr.exclude_hv &&
+ sampler->attr.exclude_idle == event->attr.exclude_idle &&
+ sampler->attr.exclude_user == event->attr.exclude_user &&
+ sampler->attr.exclude_kernel == event->attr.exclude_kernel &&
+ sampler->attr.aux_sample_size >= event->attr.aux_sample_size &&
+ atomic_long_inc_not_zero(&sampler->refcount))
+ return sampler;
+ }
+
+ return NULL;
+}
+
+struct perf_event *find_sampling_counter(struct pmu *pmu,
+ struct perf_event *event,
+ struct task_struct *task)
+{
+ struct perf_event_context *ctx;
+ struct perf_event *sampler = NULL;
+ unsigned long flags;
+
+ ctx = find_get_context(pmu, task, event->cpu);
+ if (!ctx)
+ return NULL;
+
+ raw_spin_lock_irqsave(&ctx->lock, flags);
+ sampler = __find_sampling_counter(ctx, event, task);
+ --ctx->pin_count;
+ put_ctx(ctx);
+ raw_spin_unlock_irqrestore(&ctx->lock, flags);
+
+ return sampler;
+}
+
static int perf_aux_sampler_init(struct perf_event *event,
struct task_struct *task,
struct pmu *pmu)
{
- struct perf_event_context *ctx;
struct perf_event_attr attr;
- struct perf_event *sampler = NULL;
+ struct perf_event *sampler;
struct ring_buffer *rb;
- unsigned long nr_pages, flags;
+ unsigned long nr_pages;

if (!pmu || !(pmu->setup_aux))
return -ENOTSUPP;

- ctx = find_get_context(pmu, task, event->cpu);
- if (ctx) {
- raw_spin_lock_irqsave(&ctx->lock, flags);
- list_for_each_entry(sampler, &ctx->event_list, event_entry) {
- /*
- * event is not an aux event, but all the relevant
- * bits should match
- */
- if (perf_aux_event_match(sampler, event) &&
- sampler->attr.type == event->attr.aux_sample_type &&
- sampler->attr.config == event->attr.aux_sample_config &&
- sampler->attr.exclude_hv == event->attr.exclude_hv &&
- sampler->attr.exclude_idle == event->attr.exclude_idle &&
- sampler->attr.exclude_user == event->attr.exclude_user &&
- sampler->attr.exclude_kernel == event->attr.exclude_kernel &&
- sampler->attr.aux_sample_size >= event->attr.aux_sample_size &&
- atomic_long_inc_not_zero(&sampler->refcount))
- goto got_event;
- }
-
- sampler = NULL;
-
-got_event:
- --ctx->pin_count;
- put_ctx(ctx);
- raw_spin_unlock_irqrestore(&ctx->lock, flags);
- }
-
+ sampler = find_sampling_counter(pmu, event, task);
if (!sampler) {
memset(&attr, 0, sizeof(attr));
attr.type = pmu->type;
@@ -4749,9 +4769,19 @@ static void perf_aux_sampler_fini(struct perf_event *event)
static unsigned long perf_aux_sampler_trace(struct perf_event *event,
struct perf_sample_data *data)
{
- struct perf_event *sampler = event->sampler;
+ struct perf_event *sampler;
struct ring_buffer *rb;

+ if (!event->sampler) {
+ /*
+ * down this path, event->ctx is already locked IF it's the
+ * same context
+ */
+ event->sampler = __find_sampling_counter(event->ctx, event,
+ event->ctx->task);
+ }
+
+ sampler = event->sampler;
if (!sampler || sampler->state != PERF_EVENT_STATE_ACTIVE) {
data->aux.size = 0;
goto out;
--
2.1.0

2014-10-13 13:48:42

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 18/20] perf: Allocate ring buffers for inherited per-task kernel events

Normally, per-task events can't be inherited parents' ring buffers to
avoid multiple events contending for the same buffer. And since buffer
allocation is typically done by the userspace consumer, there is no
practical interface to allocate new buffers for inherited counters.

However, for kernel users we can allocate new buffers for inherited
events as soon as they are created (and also reap them on event
destruction). This pattern has a number of use cases, such as event
sample annotation and process core dump annotation.

When a new event is inherited from a per-task kernel event that has a
ring buffer, allocate a new buffer for this event so that data from the
child task is collected and can later be retrieved for sample annotation
or core dump inclusion. This ring buffer is released when the event is
freed, for example, when the child task exits.

Signed-off-by: Alexander Shishkin <[email protected]>
---
kernel/events/core.c | 9 +++++++++
kernel/events/internal.h | 11 +++++++++++
2 files changed, 20 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5da1bc403f..60e354d668 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8267,6 +8267,15 @@ inherit_event(struct perf_event *parent_event,
= parent_event->overflow_handler_context;

/*
+ * For per-task kernel events with ring buffers, set_output doesn't
+ * make sense, but we can allocate a new buffer here.
+ */
+ if (parent_event->cpu == -1 && kernel_rb_event(parent_event)) {
+ (void)rb_alloc_kernel(child_event, parent_event->rb->nr_pages,
+ parent_event->rb->aux_nr_pages);
+ }
+
+ /*
* Precalculate sample_data sizes
*/
perf_event__header_size(child_event);
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 81cb7afec4..373ac012f5 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -122,6 +122,17 @@ static inline unsigned long perf_aux_size(struct ring_buffer *rb)
return rb->aux_nr_pages << PAGE_SHIFT;
}

+static inline bool kernel_rb_event(struct perf_event *event)
+{
+ /*
+ * Having a ring buffer and not being on any ring buffers' wakeup
+ * list means it was attached by rb_alloc_kernel() and not
+ * ring_buffer_attach(). It's the only case when these two
+ * conditions take place at the same time.
+ */
+ return event->rb && list_empty(&event->rb_entry);
+}
+
#define DEFINE_OUTPUT_COPY(func_name, memcpy_func) \
static inline unsigned long \
func_name(struct perf_output_handle *handle, \
--
2.1.0

2014-10-13 13:49:55

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 14/20] perf: add ITRACE_START record to indicate that tracing has started

For counters that generate AUX data that is bound to the context of a
running task, such as instruction tracing, the decoder needs to know
exactly which task is running when the event is first scheduled in,
before the first sched_switch. The decoder's need to know this stems
from the fact that instruction flow trace decoding will almost always
require program's object code in order to reconstruct said flow and
for that we need at least its pid/tid in the perf stream.

To single out such instruction tracing pmus, this patch introduces
ITRACE PMU capability. The reason this is not part of RECORD_AUX
record is that not all pmus capable of generating AUX data need this,
and the opposite is *probably* also true.

While sched_switch covers for most cases, there are two problems with it:
the consumer will need to process events out of order (that is, having
found RECORD_AUX, it will have to skip forward to the nearest sched_switch
to figure out which task it was, then go back to the actual trace to
decode it) and it completely misses the case when the tracing is enabled
and disabled before sched_switch, for example, via PERF_EVENT_IOC_DISABLE.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/linux/perf_event.h | 4 ++++
include/uapi/linux/perf_event.h | 11 +++++++++++
kernel/events/core.c | 41 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 56 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f126eb89e6..282721b2df 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -127,6 +127,9 @@ struct hw_perf_event {
/* for tp_event->class */
struct list_head tp_list;
};
+ struct { /* itrace */
+ int itrace_started;
+ };
#ifdef CONFIG_HAVE_HW_BREAKPOINT
struct { /* breakpoint */
/*
@@ -174,6 +177,7 @@ struct perf_event;
#define PERF_PMU_CAP_AUX_NO_SG 0x02
#define PERF_PMU_CAP_AUX_SW_DOUBLEBUF 0x04
#define PERF_PMU_CAP_EXCLUSIVE 0x08
+#define PERF_PMU_CAP_ITRACE 0x10

/**
* struct pmu - generic performance monitoring unit
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index c20aaa5af7..a0cafbdc1c 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -750,6 +750,17 @@ enum perf_event_type {
*/
PERF_RECORD_AUX = 11,

+ /*
+ * Indicates that instruction trace has started
+ *
+ * struct {
+ * struct perf_event_header header;
+ * u32 pid;
+ * u32 tid;
+ * };
+ */
+ PERF_RECORD_ITRACE_START = 12,
+
PERF_RECORD_MAX, /* non-ABI */
};

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 908dc3e63b..d4057dff27 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1723,6 +1723,7 @@ static void perf_set_shadow_time(struct perf_event *event,
#define MAX_INTERRUPTS (~0ULL)

static void perf_log_throttle(struct perf_event *event, int enable);
+static void perf_log_itrace_start(struct perf_event *event);

static int
event_sched_in(struct perf_event *event,
@@ -1757,6 +1758,8 @@ event_sched_in(struct perf_event *event,

perf_pmu_disable(event->pmu);

+ perf_log_itrace_start(event);
+
if (event->pmu->add(event, PERF_EF_START)) {
event->state = PERF_EVENT_STATE_INACTIVE;
event->oncpu = -1;
@@ -5656,6 +5659,44 @@ static void perf_log_throttle(struct perf_event *event, int enable)
perf_output_end(&handle);
}

+static void perf_log_itrace_start(struct perf_event *event)
+{
+ struct perf_output_handle handle;
+ struct perf_sample_data sample;
+ struct perf_aux_event {
+ struct perf_event_header header;
+ u32 pid;
+ u32 tid;
+ } rec;
+ int ret;
+
+ if (event->parent)
+ event = event->parent;
+
+ if (!(event->pmu->capabilities & PERF_PMU_CAP_ITRACE) ||
+ event->hw.itrace_started)
+ return;
+
+ event->hw.itrace_started = 1;
+
+ rec.header.type = PERF_RECORD_ITRACE_START;
+ rec.header.misc = 0;
+ rec.header.size = sizeof(rec);
+ rec.pid = perf_event_pid(event, current);
+ rec.tid = perf_event_tid(event, current);
+
+ perf_event_header__init_id(&rec.header, &sample, event);
+ ret = perf_output_begin(&handle, event, rec.header.size);
+
+ if (ret)
+ return;
+
+ perf_output_put(&handle, rec);
+ perf_event__output_id_sample(event, &handle, &sample);
+
+ perf_output_end(&handle);
+}
+
/*
* Generic event overflow handling, sampling.
*/
--
2.1.0

2014-10-13 13:48:09

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 06/20] perf: Add AUX record

When there's new data in the AUX space, output a record indicating its
offset and size and a set of flags, such as PERF_AUX_FLAG_TRUNCATED, to
mean the described data was truncated to fit in the ring buffer.

Signed-off-by: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
---
include/uapi/linux/perf_event.h | 19 +++++++++++++++++++
kernel/events/core.c | 34 ++++++++++++++++++++++++++++++++++
kernel/events/internal.h | 3 +++
3 files changed, 56 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 7e0967c0f5..1a46627699 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -733,6 +733,20 @@ enum perf_event_type {
*/
PERF_RECORD_MMAP2 = 10,

+ /*
+ * Records that new data landed in the AUX buffer part.
+ *
+ * struct {
+ * struct perf_event_header header;
+ *
+ * u64 aux_offset;
+ * u64 aux_size;
+ * u64 flags;
+ * struct sample_id sample_id;
+ * };
+ */
+ PERF_RECORD_AUX = 11,
+
PERF_RECORD_MAX, /* non-ABI */
};

@@ -750,6 +764,11 @@ enum perf_callchain_context {
PERF_CONTEXT_MAX = (__u64)-4095,
};

+/**
+ * PERF_RECORD_AUX::flags bits
+ */
+#define PERF_AUX_FLAG_TRUNCATED 0x01 /* record was truncated to fit */
+
#define PERF_FLAG_FD_NO_GROUP (1UL << 0)
#define PERF_FLAG_FD_OUTPUT (1UL << 1)
#define PERF_FLAG_PID_CGROUP (1UL << 2) /* pid=cgroup id, per-cpu mode only */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ced9b0819f..46a6217bbe 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5581,6 +5581,40 @@ void perf_event_mmap(struct vm_area_struct *vma)
perf_event_mmap_event(&mmap_event);
}

+void perf_event_aux_event(struct perf_event *event, unsigned long head,
+ unsigned long size, u64 flags)
+{
+ struct perf_output_handle handle;
+ struct perf_sample_data sample;
+ struct perf_aux_event {
+ struct perf_event_header header;
+ u64 offset;
+ u64 size;
+ u64 flags;
+ } rec = {
+ .header = {
+ .type = PERF_RECORD_AUX,
+ .misc = 0,
+ .size = sizeof(rec),
+ },
+ .offset = head,
+ .size = size,
+ .flags = flags,
+ };
+ int ret;
+
+ perf_event_header__init_id(&rec.header, &sample, event);
+ ret = perf_output_begin(&handle, event, rec.header.size);
+
+ if (ret)
+ return;
+
+ perf_output_put(&handle, rec);
+ perf_event__output_id_sample(event, &handle, &sample);
+
+ perf_output_end(&handle);
+}
+
/*
* IRQ throttle logging
*/
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 1e476d2f29..e0013084c5 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -62,6 +62,9 @@ static inline bool rb_has_aux(struct ring_buffer *rb)
return !!rb->aux_nr_pages;
}

+void perf_event_aux_event(struct perf_event *event, unsigned long head,
+ unsigned long size, u64 flags);
+
extern void
perf_event_header__init_id(struct perf_event_header *header,
struct perf_sample_data *data,
--
2.1.0

2014-10-13 13:47:45

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 04/20] perf: Add a capability for AUX_NO_SG pmus to do software double buffering

For pmus that don't support scatter-gather for AUX data in hardware, it
might still make sense to implement software double buffering to avoid
losing data while the user is reading data out. For this purpose, add
a pmu capability that guarantees multiple high-order chunks for AUX buffer,
so that the pmu driver can do switchover tricks.

To make use of this feature, add PERF_PMU_CAP_AUX_SW_DOUBLEBUF to your
pmu's capability mask. This will make the ring buffer AUX allocation code
ensure that the biggest high order allocation for the aux buffer pages is
no bigger than half of the total requested buffer size, thus making sure
that the buffer has at least two high order allocations.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/linux/perf_event.h | 1 +
kernel/events/ring_buffer.c | 15 ++++++++++++++-
2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 71df948b4d..e7a15f5c3f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -172,6 +172,7 @@ struct perf_event;
*/
#define PERF_PMU_CAP_NO_INTERRUPT 0x01
#define PERF_PMU_CAP_AUX_NO_SG 0x02
+#define PERF_PMU_CAP_AUX_SW_DOUBLEBUF 0x04

/**
* struct pmu - generic performance monitoring unit
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index b620e94ea7..b45feb3f4a 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -286,9 +286,22 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
if (!has_aux(event))
return -ENOTSUPP;

- if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG)
+ if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG) {
order = get_order(nr_pages * PAGE_SIZE);

+ /*
+ * PMU requests more than one contiguous chunks of memory
+ * for SW double buffering
+ */
+ if ((event->pmu->capabilities & PERF_PMU_CAP_AUX_SW_DOUBLEBUF) &&
+ !overwrite) {
+ if (!order)
+ return -EINVAL;
+
+ order--;
+ }
+ }
+
rb->aux_pages = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
if (!rb->aux_pages)
return -ENOMEM;
--
2.1.0

2014-10-13 13:52:12

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 03/20] perf: Support high-order allocations for AUX space

Some pmus (such as BTS or Intel PT without multiple-entry ToPA capability)
don't support scatter-gather and will prefer larger contiguous areas for
their output regions.

This patch adds a new pmu capability to request higher order allocations.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/linux/perf_event.h | 1 +
kernel/events/ring_buffer.c | 51 +++++++++++++++++++++++++++++++++++++++------
2 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 344058c71d..71df948b4d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -171,6 +171,7 @@ struct perf_event;
* pmu::capabilities flags
*/
#define PERF_PMU_CAP_NO_INTERRUPT 0x01
+#define PERF_PMU_CAP_AUX_NO_SG 0x02

/**
* struct pmu - generic performance monitoring unit
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 2ebb6c5d3b..b620e94ea7 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -242,30 +242,69 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
spin_lock_init(&rb->event_lock);
}

+#define PERF_AUX_GFP (GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
+
+static struct page *rb_alloc_aux_page(int node, int order)
+{
+ struct page *page;
+
+ if (order > MAX_ORDER)
+ order = MAX_ORDER;
+
+ do {
+ page = alloc_pages_node(node, PERF_AUX_GFP, order);
+ } while (!page && order--);
+
+ if (page && order) {
+ /*
+ * Communicate the allocation size to the driver
+ */
+ split_page(page, order);
+ SetPagePrivate(page);
+ set_page_private(page, order);
+ }
+
+ return page;
+}
+
+static void rb_free_aux_page(struct ring_buffer *rb, int idx)
+{
+ struct page *page = virt_to_page(rb->aux_pages[idx]);
+
+ ClearPagePrivate(page);
+ page->mapping = NULL;
+ __free_page(page);
+}
+
int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
pgoff_t pgoff, int nr_pages, int flags)
{
bool overwrite = !(flags & RING_BUFFER_WRITABLE);
int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
- int ret = -ENOMEM;
+ int ret = -ENOMEM, order = 0;

if (!has_aux(event))
return -ENOTSUPP;

+ if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG)
+ order = get_order(nr_pages * PAGE_SIZE);
+
rb->aux_pages = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
if (!rb->aux_pages)
return -ENOMEM;

rb->free_aux = event->pmu->free_aux;
- for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;
- rb->aux_nr_pages++) {
+ for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;) {
struct page *page;
+ int last;

- page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
+ page = rb_alloc_aux_page(node, order);
if (!page)
goto out;

- rb->aux_pages[rb->aux_nr_pages] = page_address(page);
+ for (last = rb->aux_nr_pages + (1 << page_private(page));
+ last > rb->aux_nr_pages; rb->aux_nr_pages++)
+ rb->aux_pages[rb->aux_nr_pages] = page_address(page++);
}

rb->aux_priv = event->pmu->setup_aux(event->cpu, rb->aux_pages, nr_pages,
@@ -304,7 +343,7 @@ static void __rb_free_aux(struct kref *kref)
}

for (pg = 0; pg < rb->aux_nr_pages; pg++)
- free_page((unsigned long)rb->aux_pages[pg]);
+ rb_free_aux_page(rb, pg);

kfree(rb->aux_pages);
rb->aux_nr_pages = 0;
--
2.1.0

2014-10-13 13:52:37

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

Add support for Intel Processor Trace (PT) to kernel's perf events.
PT is an extension of Intel Architecture that collects information about
software execuction such as control flow, execution modes and timings and
formats it into highly compressed binary packets. Even being compressed,
these packets are generated at hundreds of megabytes per second per core,
which makes it impractical to decode them on the fly in the kernel.

This driver exports trace data by through AUX space in the perf ring
buffer, which is zero-copy mapped into userspace for faster data retrieval.

Signed-off-by: Alexander Shishkin <[email protected]>
---
arch/x86/include/uapi/asm/msr-index.h | 18 +
arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/kernel/cpu/intel_pt.h | 129 ++++
arch/x86/kernel/cpu/perf_event.h | 2 +
arch/x86/kernel/cpu/perf_event_intel.c | 8 +
arch/x86/kernel/cpu/perf_event_intel_pt.c | 995 ++++++++++++++++++++++++++++++
6 files changed, 1153 insertions(+)
create mode 100644 arch/x86/kernel/cpu/intel_pt.h
create mode 100644 arch/x86/kernel/cpu/perf_event_intel_pt.c

diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h
index e21331ce36..caae296ab7 100644
--- a/arch/x86/include/uapi/asm/msr-index.h
+++ b/arch/x86/include/uapi/asm/msr-index.h
@@ -74,6 +74,24 @@
#define MSR_IA32_PERF_CAPABILITIES 0x00000345
#define MSR_PEBS_LD_LAT_THRESHOLD 0x000003f6

+#define MSR_IA32_RTIT_CTL 0x00000570
+#define RTIT_CTL_TRACEEN BIT(0)
+#define RTIT_CTL_OS BIT(2)
+#define RTIT_CTL_USR BIT(3)
+#define RTIT_CTL_CR3EN BIT(7)
+#define RTIT_CTL_TOPA BIT(8)
+#define RTIT_CTL_TSC_EN BIT(10)
+#define RTIT_CTL_DISRETC BIT(11)
+#define RTIT_CTL_BRANCH_EN BIT(13)
+#define MSR_IA32_RTIT_STATUS 0x00000571
+#define RTIT_STATUS_CONTEXTEN BIT(1)
+#define RTIT_STATUS_TRIGGEREN BIT(2)
+#define RTIT_STATUS_ERROR BIT(4)
+#define RTIT_STATUS_STOPPED BIT(5)
+#define MSR_IA32_RTIT_CR3_MATCH 0x00000572
+#define MSR_IA32_RTIT_OUTPUT_BASE 0x00000560
+#define MSR_IA32_RTIT_OUTPUT_MASK 0x00000561
+
#define MSR_MTRRfix64K_00000 0x00000250
#define MSR_MTRRfix16K_80000 0x00000258
#define MSR_MTRRfix16K_A0000 0x00000259
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 7e1fd4e085..00d40f889d 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -39,6 +39,7 @@ obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_lbr.o perf_event_intel_ds.o per
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_uncore.o perf_event_intel_uncore_snb.o
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_uncore_snbep.o perf_event_intel_uncore_nhmex.o
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_rapl.o
+obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_pt.o
endif


diff --git a/arch/x86/kernel/cpu/intel_pt.h b/arch/x86/kernel/cpu/intel_pt.h
new file mode 100644
index 0000000000..9fa9c165cc
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_pt.h
@@ -0,0 +1,129 @@
+/*
+ * Intel(R) Processor Trace PMU driver for perf
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * Intel PT is specified in the Intel Architecture Instruction Set Extensions
+ * Programming Reference:
+ * http://software.intel.com/en-us/intel-isa-extensions
+ */
+
+#ifndef __INTEL_PT_H__
+#define __INTEL_PT_H__
+
+/*
+ * Single-entry ToPA: when this close to region boundary, switch
+ * buffers to avoid losing data.
+ */
+#define TOPA_PMI_MARGIN 512
+
+/*
+ * Table of Physical Addresses bits
+ */
+enum topa_sz {
+ TOPA_4K = 0,
+ TOPA_8K,
+ TOPA_16K,
+ TOPA_32K,
+ TOPA_64K,
+ TOPA_128K,
+ TOPA_256K,
+ TOPA_512K,
+ TOPA_1MB,
+ TOPA_2MB,
+ TOPA_4MB,
+ TOPA_8MB,
+ TOPA_16MB,
+ TOPA_32MB,
+ TOPA_64MB,
+ TOPA_128MB,
+ TOPA_SZ_END,
+};
+
+static inline unsigned int sizes(enum topa_sz tsz)
+{
+ return 1 << (tsz + 12);
+};
+
+struct topa_entry {
+ u64 end : 1;
+ u64 rsvd0 : 1;
+ u64 intr : 1;
+ u64 rsvd1 : 1;
+ u64 stop : 1;
+ u64 rsvd2 : 1;
+ u64 size : 4;
+ u64 rsvd3 : 2;
+ u64 base : 36;
+ u64 rsvd4 : 16;
+};
+
+#define TOPA_SHIFT 12
+#define PT_CPUID_LEAVES 2
+
+enum pt_capabilities {
+ PT_CAP_max_subleaf = 0,
+ PT_CAP_cr3_filtering,
+ PT_CAP_topa_output,
+ PT_CAP_topa_multiple_entries,
+ PT_CAP_payloads_lip,
+};
+
+struct pt_pmu {
+ struct pmu pmu;
+ u32 caps[4 * PT_CPUID_LEAVES];
+};
+
+/**
+ * struct pt_buffer - buffer configuration; one buffer per task_struct or
+ * cpu, depending on perf event configuration
+ * @tables: list of ToPA tables in this buffer
+ * @first, @last: shorthands for first and last topa tables
+ * @cur: current topa table
+ * @nr_pages: buffer size in pages
+ * @cur_idx: current output region's index within @cur table
+ * @output_off: offset within the current output region
+ * @data_size: running total of the amount of data in this buffer
+ * @lost: if data was lost/truncated
+ * @head: logical write offset inside the buffer
+ * @snapshot: if this is for a snapshot/overwrite counter
+ * @stop_pos, @intr_pos: STOP and INT topa entries in the buffer
+ * @data_pages: array of pages from perf
+ * @topa_index: table of topa entries indexed by page offset
+ */
+struct pt_buffer {
+ /* hint for allocation */
+ int cpu;
+ /* list of ToPA tables */
+ struct list_head tables;
+ /* top-level table */
+ struct topa *first, *last, *cur;
+ unsigned int cur_idx;
+ size_t output_off;
+ unsigned long nr_pages;
+ local_t data_size;
+ local_t lost;
+ local64_t head;
+ bool snapshot;
+ unsigned long stop_pos, intr_pos;
+ void **data_pages;
+ struct topa_entry *topa_index[0];
+};
+
+/**
+ * struct pt - per-cpu pt
+ */
+struct pt {
+ struct perf_output_handle handle;
+ int handle_nmi;
+};
+
+#endif /* __INTEL_PT_H__ */
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 8eee92be78..9f9b02fea6 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -750,6 +750,8 @@ void intel_pmu_lbr_init_snb(void);

int intel_pmu_setup_lbr_filter(struct perf_event *event);

+void intel_pt_interrupt(void);
+
int p4_pmu_init(void);

int p6_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 3851def505..6e04da1b04 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1528,6 +1528,14 @@ again:
}

/*
+ * Intel PT
+ */
+ if (__test_and_clear_bit(55, (unsigned long *)&status)) {
+ handled++;
+ intel_pt_interrupt();
+ }
+
+ /*
* Checkpointed counters can lead to 'spurious' PMIs because the
* rollback caused by the PMI will have cleared the overflow status
* bit. Therefore always force probe these counters.
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c b/arch/x86/kernel/cpu/perf_event_intel_pt.c
new file mode 100644
index 0000000000..bc8bf1f24d
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -0,0 +1,995 @@
+/*
+ * Intel(R) Processor Trace PMU driver for perf
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * Intel PT is specified in the Intel Architecture Instruction Set Extensions
+ * Programming Reference:
+ * http://software.intel.com/en-us/intel-isa-extensions
+ */
+
+#undef DEBUG
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/device.h>
+
+#include <asm/perf_event.h>
+#include <asm/insn.h>
+
+#include "perf_event.h"
+#include "intel_pt.h"
+
+static DEFINE_PER_CPU(struct pt, pt_ctx);
+
+static struct pt_pmu pt_pmu;
+
+enum cpuid_regs {
+ CR_EAX = 0,
+ CR_ECX,
+ CR_EDX,
+ CR_EBX
+};
+
+/*
+ * Capabilities of Intel PT hardware, such as number of address bits or
+ * supported output schemes, are cached and exported to userspace as "caps"
+ * attribute group of pt pmu device
+ * (/sys/bus/event_source/devices/intel_pt/caps/) so that userspace can store
+ * relevant bits together with intel_pt traces.
+ */
+#define PT_CAP(_n, _l, _r, _m) \
+ [PT_CAP_ ## _n] = { .name = __stringify(_n), .leaf = _l, \
+ .reg = _r, .mask = _m }
+
+static struct pt_cap_desc {
+ const char *name;
+ u32 leaf;
+ u8 reg;
+ u32 mask;
+} pt_caps[] = {
+ PT_CAP(max_subleaf, 0, CR_EAX, 0xffffffff),
+ PT_CAP(cr3_filtering, 0, CR_EBX, BIT(0)),
+ PT_CAP(topa_output, 0, CR_ECX, BIT(0)),
+ PT_CAP(topa_multiple_entries, 0, CR_ECX, BIT(1)),
+ PT_CAP(payloads_lip, 0, CR_ECX, BIT(31)),
+};
+
+static u32 pt_cap_get(enum pt_capabilities cap)
+{
+ struct pt_cap_desc *cd = &pt_caps[cap];
+ u32 c = pt_pmu.caps[cd->leaf * 4 + cd->reg];
+ unsigned int shift = __ffs(cd->mask);
+
+ return (c & cd->mask) >> shift;
+}
+
+static ssize_t pt_cap_show(struct device *cdev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct dev_ext_attribute *ea =
+ container_of(attr, struct dev_ext_attribute, attr);
+ enum pt_capabilities cap = (long)ea->var;
+
+ return snprintf(buf, PAGE_SIZE, "%x\n", pt_cap_get(cap));
+}
+
+static struct attribute_group pt_cap_group = {
+ .name = "caps",
+};
+
+PMU_FORMAT_ATTR(tsc, "config:10" );
+PMU_FORMAT_ATTR(noretcomp, "config:11" );
+
+static struct attribute *pt_formats_attr[] = {
+ &format_attr_tsc.attr,
+ &format_attr_noretcomp.attr,
+ NULL,
+};
+
+static struct attribute_group pt_format_group = {
+ .name = "format",
+ .attrs = pt_formats_attr,
+};
+
+static const struct attribute_group *pt_attr_groups[] = {
+ &pt_cap_group,
+ &pt_format_group,
+ NULL,
+};
+
+static int __init pt_pmu_hw_init(void)
+{
+ struct dev_ext_attribute *de_attrs;
+ struct attribute **attrs;
+ size_t size;
+ long i;
+
+ if (test_cpu_cap(&boot_cpu_data, X86_FEATURE_INTEL_PT)) {
+ for (i = 0; i < PT_CPUID_LEAVES; i++)
+ cpuid_count(20, i,
+ &pt_pmu.caps[CR_EAX + i * 4],
+ &pt_pmu.caps[CR_EBX + i * 4],
+ &pt_pmu.caps[CR_ECX + i * 4],
+ &pt_pmu.caps[CR_EDX + i * 4]);
+ } else
+ return -ENODEV;
+
+ size = sizeof(struct attribute *) * (ARRAY_SIZE(pt_caps) + 1);
+ attrs = kzalloc(size, GFP_KERNEL);
+ if (!attrs)
+ goto err_attrs;
+
+ size = sizeof(struct dev_ext_attribute) * (ARRAY_SIZE(pt_caps) + 1);
+ de_attrs = kzalloc(size, GFP_KERNEL);
+ if (!de_attrs)
+ goto err_de_attrs;
+
+ for (i = 0; i < ARRAY_SIZE(pt_caps); i++) {
+ de_attrs[i].attr.attr.name = pt_caps[i].name;
+
+ sysfs_attr_init(&de_attrs[i].attr.attr);
+ de_attrs[i].attr.attr.mode = S_IRUGO;
+ de_attrs[i].attr.show = pt_cap_show;
+ de_attrs[i].var = (void *)i;
+ attrs[i] = &de_attrs[i].attr.attr;
+ }
+
+ pt_cap_group.attrs = attrs;
+ return 0;
+
+err_de_attrs:
+ kfree(de_attrs);
+err_attrs:
+ kfree(attrs);
+
+ return -ENOMEM;
+}
+
+#define PT_CONFIG_MASK (RTIT_CTL_TSC_EN | RTIT_CTL_DISRETC)
+/* bits 57:14 are reserved for packet enables */
+#define PT_BYPASS_MASK 0x03ffffffffffc000ull
+
+static bool pt_event_valid(struct perf_event *event)
+{
+ u64 config = event->attr.config;
+
+ /* admin can set any packet generation parameters */
+ if (capable(CAP_SYS_ADMIN) && (config & PT_BYPASS_MASK) == config)
+ return true;
+
+ if ((config & PT_CONFIG_MASK) != config)
+ return false;
+
+ return true;
+}
+
+/*
+ * PT configuration helpers
+ * These all are cpu affine and operate on a local PT
+ */
+
+static bool pt_is_running(void)
+{
+ u64 ctl;
+
+ rdmsrl(MSR_IA32_RTIT_CTL, ctl);
+
+ return !!(ctl & RTIT_CTL_TRACEEN);
+}
+
+static int pt_config(struct perf_event *event)
+{
+ u64 reg;
+
+ reg = RTIT_CTL_TOPA | RTIT_CTL_BRANCH_EN;
+
+ if (!event->attr.exclude_kernel)
+ reg |= RTIT_CTL_OS;
+ if (!event->attr.exclude_user)
+ reg |= RTIT_CTL_USR;
+
+ reg |= (event->attr.config & PT_CONFIG_MASK);
+
+ /*
+ * User can try to set bits in RTIT_CTL through PT_BYPASS_MASK,
+ * that aren't supported by the hardware. Weather or not a
+ * particular bitmask is supported by a cpu can't be determined
+ * via cpuid or otherwise, so we have to rely on #GP handling
+ * to catch these cases.
+ */
+ return wrmsrl_safe(MSR_IA32_RTIT_CTL, reg);
+}
+
+static void pt_config_start(bool start)
+{
+ u64 ctl;
+
+ rdmsrl(MSR_IA32_RTIT_CTL, ctl);
+ if (start)
+ ctl |= RTIT_CTL_TRACEEN;
+ else
+ ctl &= ~RTIT_CTL_TRACEEN;
+ wrmsrl(MSR_IA32_RTIT_CTL, ctl);
+
+ /*
+ * A wrmsr that disables trace generation serializes other PT
+ * registers and causes all data packets to be written to memory,
+ * but a fence is required for the data to become globally visible.
+ *
+ * The below WMB, separating data store and aux_head store matches
+ * the consumer's RMB that separates aux_head load and data load.
+ */
+ if (!start)
+ wmb();
+}
+
+static void pt_config_buffer(void *buf, unsigned int topa_idx,
+ unsigned int output_off)
+{
+ u64 reg;
+
+ wrmsrl(MSR_IA32_RTIT_OUTPUT_BASE, virt_to_phys(buf));
+
+ reg = 0x7f | ((u64)topa_idx << 7) | ((u64)output_off << 32);
+
+ wrmsrl(MSR_IA32_RTIT_OUTPUT_MASK, reg);
+}
+
+/*
+ * Keep ToPA table-related metadata on the same page as the actual table,
+ * taking up a few words from the top
+ */
+
+#define TENTS_PER_PAGE (((PAGE_SIZE - 40) / sizeof(struct topa_entry)) - 1)
+
+struct topa {
+ struct topa_entry table[TENTS_PER_PAGE];
+ struct list_head list;
+ u64 phys;
+ u64 offset;
+ size_t size;
+ int last;
+};
+
+/* make negative table index stand for the last table entry */
+#define TOPA_ENTRY(t, i) ((i) == -1 ? &(t)->table[(t)->last] : &(t)->table[(i)])
+
+/*
+ * allocate page-sized ToPA table
+ */
+static struct topa *topa_alloc(int cpu, gfp_t gfp)
+{
+ int node = cpu_to_node(cpu);
+ struct topa *topa;
+ struct page *p;
+
+ p = alloc_pages_node(node, gfp | __GFP_ZERO, 0);
+ if (!p)
+ return NULL;
+
+ topa = page_address(p);
+ topa->last = 0;
+ topa->phys = page_to_phys(p);
+
+ /*
+ * In case of singe-entry ToPA, always put the self-referencing END
+ * link as the 2nd entry in the table
+ */
+ if (!pt_cap_get(PT_CAP_topa_multiple_entries)) {
+ TOPA_ENTRY(topa, 1)->base = topa->phys >> TOPA_SHIFT;
+ TOPA_ENTRY(topa, 1)->end = 1;
+ }
+
+ return topa;
+}
+
+static void topa_free(struct topa *topa)
+{
+ free_page((unsigned long)topa);
+}
+
+/**
+ * topa_insert_table - insert a ToPA table into a buffer
+ * @buf - pt buffer that's being extended
+ * @topa - new topa table to be inserted
+ *
+ * If it's the first table in this buffer, set up buffer's pointers
+ * accordingly; otherwise, add a END=1 link entry to @topa to the current
+ * "last" table and adjust the last table pointer to @topa.
+ */
+static void topa_insert_table(struct pt_buffer *buf, struct topa *topa)
+{
+ struct topa *last = buf->last;
+
+ list_add_tail(&topa->list, &buf->tables);
+
+ if (!buf->first) {
+ buf->first = buf->last = buf->cur = topa;
+ return;
+ }
+
+ topa->offset = last->offset + last->size;
+ buf->last = topa;
+
+ if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+ return;
+
+ BUG_ON(last->last != TENTS_PER_PAGE - 1);
+
+ TOPA_ENTRY(last, -1)->base = topa->phys >> TOPA_SHIFT;
+ TOPA_ENTRY(last, -1)->end = 1;
+}
+
+static bool topa_table_full(struct topa *topa)
+{
+ /* single-entry ToPA is a special case */
+ if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+ return !!topa->last;
+
+ return topa->last == TENTS_PER_PAGE - 1;
+}
+
+static int topa_insert_pages(struct pt_buffer *buf, gfp_t gfp)
+{
+ struct topa *topa = buf->last;
+ int order = 0;
+ struct page *p;
+
+ p = virt_to_page(buf->data_pages[buf->nr_pages]);
+ if (PagePrivate(p))
+ order = page_private(p);
+
+ if (topa_table_full(topa)) {
+ topa = topa_alloc(buf->cpu, gfp);
+ if (!topa)
+ return -ENOMEM;
+
+ topa_insert_table(buf, topa);
+ }
+
+ TOPA_ENTRY(topa, -1)->base = page_to_phys(p) >> TOPA_SHIFT;
+ TOPA_ENTRY(topa, -1)->size = order;
+ if (!buf->snapshot && !pt_cap_get(PT_CAP_topa_multiple_entries)) {
+ TOPA_ENTRY(topa, -1)->intr = 1;
+ TOPA_ENTRY(topa, -1)->stop = 1;
+ }
+
+ topa->last++;
+ topa->size += sizes(order);
+
+ buf->nr_pages += 1ul << order;
+
+ return 0;
+}
+
+static void pt_topa_dump(struct pt_buffer *buf)
+{
+ struct topa *topa;
+
+ list_for_each_entry(topa, &buf->tables, list) {
+ int i;
+
+ pr_debug("# table @%p (%p), off %llx size %zx\n", topa->table,
+ (void *)topa->phys, topa->offset, topa->size);
+ for (i = 0; i < TENTS_PER_PAGE; i++) {
+ pr_debug("# entry @%p (%lx sz %u %c%c%c) raw=%16llx\n",
+ &topa->table[i],
+ (unsigned long)topa->table[i].base << TOPA_SHIFT,
+ sizes(topa->table[i].size),
+ topa->table[i].end ? 'E' : ' ',
+ topa->table[i].intr ? 'I' : ' ',
+ topa->table[i].stop ? 'S' : ' ',
+ *(u64 *)&topa->table[i]);
+ if ((pt_cap_get(PT_CAP_topa_multiple_entries)
+ && topa->table[i].stop)
+ || topa->table[i].end)
+ break;
+ }
+ }
+}
+
+/* advance to the next output region */
+static void pt_buffer_advance(struct pt_buffer *buf)
+{
+ buf->output_off = 0;
+ buf->cur_idx++;
+
+ if (buf->cur_idx == buf->cur->last) {
+ if (buf->cur == buf->last)
+ buf->cur = buf->first;
+ else
+ buf->cur = list_entry(buf->cur->list.next, struct topa,
+ list);
+ buf->cur_idx = 0;
+ }
+}
+
+static void pt_update_head(struct pt *pt)
+{
+ struct pt_buffer *buf = perf_get_aux(&pt->handle);
+ u64 topa_idx, base, old;
+
+ /* offset of the first region in this table from the beginning of buf */
+ base = buf->cur->offset + buf->output_off;
+
+ /* offset of the current output region within this table */
+ for (topa_idx = 0; topa_idx < buf->cur_idx; topa_idx++)
+ base += sizes(buf->cur->table[topa_idx].size);
+
+ if (buf->snapshot) {
+ local_set(&buf->data_size, base);
+ } else {
+ old = (local64_xchg(&buf->head, base) &
+ ((buf->nr_pages << PAGE_SHIFT) - 1));
+ if (base < old)
+ base += buf->nr_pages << PAGE_SHIFT;
+
+ local_add(base - old, &buf->data_size);
+ }
+}
+
+static void *pt_buffer_region(struct pt_buffer *buf)
+{
+ return phys_to_virt(buf->cur->table[buf->cur_idx].base << TOPA_SHIFT);
+}
+
+static size_t pt_buffer_region_size(struct pt_buffer *buf)
+{
+ return sizes(buf->cur->table[buf->cur_idx].size);
+}
+
+/**
+ * pt_handle_status - take care of possible status conditions
+ * @pt: per-cpu pt handle
+ */
+static void pt_handle_status(struct pt *pt)
+{
+ struct pt_buffer *buf = perf_get_aux(&pt->handle);
+ int advance = 0;
+ u64 status;
+
+ rdmsrl(MSR_IA32_RTIT_STATUS, status);
+
+ if (status & RTIT_STATUS_ERROR) {
+ pr_err_ratelimited("ToPA ERROR encountered, trying to recover\n");
+ pt_topa_dump(buf);
+ status &= ~RTIT_STATUS_ERROR;
+ wrmsrl(MSR_IA32_RTIT_STATUS, status);
+ }
+
+ if (status & RTIT_STATUS_STOPPED) {
+ status &= ~RTIT_STATUS_STOPPED;
+ wrmsrl(MSR_IA32_RTIT_STATUS, status);
+
+ /*
+ * On systems that only do single-entry ToPA, hitting STOP
+ * means we are already losing data; need to let the decoder
+ * know.
+ */
+ if (!pt_cap_get(PT_CAP_topa_multiple_entries) ||
+ buf->output_off == sizes(TOPA_ENTRY(buf->cur, buf->cur_idx)->size)) {
+ local_inc(&buf->lost);
+ advance++;
+ }
+ }
+
+ /*
+ * Also on single-entry ToPA implementations, interrupt will come
+ * before the output reaches its output region's boundary.
+ */
+ if (!pt_cap_get(PT_CAP_topa_multiple_entries) && !buf->snapshot &&
+ pt_buffer_region_size(buf) - buf->output_off <= TOPA_PMI_MARGIN) {
+ void *head = pt_buffer_region(buf);
+
+ /* everything within this margin needs to be zeroed out */
+ memset(head + buf->output_off, 0,
+ pt_buffer_region_size(buf) -
+ buf->output_off);
+ advance++;
+ }
+
+ if (advance)
+ pt_buffer_advance(buf);
+}
+
+static void pt_read_offset(struct pt_buffer *buf)
+{
+ u64 offset, base_topa;
+
+ rdmsrl(MSR_IA32_RTIT_OUTPUT_BASE, base_topa);
+ buf->cur = phys_to_virt(base_topa);
+
+ rdmsrl(MSR_IA32_RTIT_OUTPUT_MASK, offset);
+ /* offset within current output region */
+ buf->output_off = offset >> 32;
+ /* index of current output region within this table */
+ buf->cur_idx = (offset & 0xffffff80) >> 7;
+}
+
+/**
+ * pt_buffer_fini_topa() - deallocate ToPA structure of a buffer
+ * @buf: pt buffer
+ */
+static void pt_buffer_fini_topa(struct pt_buffer *buf)
+{
+ struct topa *topa, *iter;
+
+ list_for_each_entry_safe(topa, iter, &buf->tables, list) {
+ /*
+ * right now, this is in free_aux() path only, so
+ * no need to unlink this table from the list
+ */
+ topa_free(topa);
+ }
+}
+
+static unsigned int pt_topa_next_entry(struct pt_buffer *buf, unsigned int pg)
+{
+ struct topa_entry *te = buf->topa_index[pg];
+
+ if (buf->first == buf->last && buf->first->last == 1)
+ return pg;
+
+ do {
+ pg++;
+ pg &= buf->nr_pages - 1;
+ } while (buf->topa_index[pg] == te);
+
+ return pg;
+}
+
+static int pt_buffer_reset_markers(struct pt_buffer *buf,
+ struct perf_output_handle *handle)
+
+{
+ unsigned long idx, npages, end;
+
+ if (buf->snapshot)
+ return 0;
+
+ /* can't stop in the middle of an output region */
+ if (buf->output_off + handle->size + 1 <
+ sizes(TOPA_ENTRY(buf->cur, buf->cur_idx)->size))
+ return -EINVAL;
+
+
+ /* single entry ToPA is handled by marking all regions STOP=1 INT=1 */
+ if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+ return 0;
+
+ /* clear STOP and INT from current entry */
+ buf->topa_index[buf->stop_pos]->stop = 0;
+ buf->topa_index[buf->intr_pos]->intr = 0;
+
+ if (pt_cap_get(PT_CAP_topa_multiple_entries)) {
+ npages = (handle->size + 1) >> PAGE_SHIFT;
+ end = (local64_read(&buf->head) >> PAGE_SHIFT) + npages;
+ if (end > handle->wakeup >> PAGE_SHIFT)
+ end = handle->wakeup >> PAGE_SHIFT;
+ idx = end & (buf->nr_pages - 1);
+ buf->stop_pos = idx;
+ idx = (local64_read(&buf->head) >> PAGE_SHIFT) + npages / 2;
+ idx &= buf->nr_pages - 1;
+ buf->intr_pos = idx;
+ }
+
+ buf->topa_index[buf->stop_pos]->stop = 1;
+ buf->topa_index[buf->intr_pos]->intr = 1;
+
+ return 0;
+}
+
+static void pt_buffer_setup_topa_index(struct pt_buffer *buf)
+{
+ struct topa *cur = buf->first, *prev = buf->last;
+ struct topa_entry *te_cur = TOPA_ENTRY(cur, 0),
+ *te_prev = TOPA_ENTRY(prev, prev->last - 1);
+ int pg = 0, idx = 0, ntopa = 0;
+
+ while (pg < buf->nr_pages) {
+ int tidx;
+
+ /* pages within one topa entry */
+ for (tidx = 0; tidx < 1 << te_cur->size; tidx++, pg++)
+ buf->topa_index[pg] = te_prev;
+
+ te_prev = te_cur;
+
+ if (idx == cur->last - 1) {
+ /* advance to next topa table */
+ idx = 0;
+ cur = list_entry(cur->list.next, struct topa, list);
+ ntopa++;
+ } else
+ idx++;
+ te_cur = TOPA_ENTRY(cur, idx);
+ }
+
+}
+
+static void pt_buffer_reset_offsets(struct pt_buffer *buf, unsigned long head)
+{
+ int pg;
+
+ if (buf->snapshot)
+ head &= (buf->nr_pages << PAGE_SHIFT) - 1;
+
+ pg = (head >> PAGE_SHIFT) & (buf->nr_pages - 1);
+ pg = pt_topa_next_entry(buf, pg);
+
+ buf->cur = (struct topa *)((unsigned long)buf->topa_index[pg] & PAGE_MASK);
+ buf->cur_idx = ((unsigned long)buf->topa_index[pg] -
+ (unsigned long)buf->cur) / sizeof(struct topa_entry);
+ buf->output_off = head & (sizes(buf->cur->table[buf->cur_idx].size) - 1);
+
+ local64_set(&buf->head, head);
+ local_set(&buf->data_size, 0);
+}
+
+/**
+ * pt_buffer_init_topa() - initialize ToPA table for pt buffer
+ * @buf: pt buffer
+ * @size: total size of all regions within this ToPA
+ * @gfp: allocation flags
+ */
+static int pt_buffer_init_topa(struct pt_buffer *buf, unsigned long nr_pages,
+ gfp_t gfp)
+{
+ struct topa *topa;
+ int err;
+
+ topa = topa_alloc(buf->cpu, gfp);
+ if (!topa)
+ return -ENOMEM;
+
+ topa_insert_table(buf, topa);
+
+ while (buf->nr_pages < nr_pages) {
+ err = topa_insert_pages(buf, gfp);
+ if (err) {
+ pt_buffer_fini_topa(buf);
+ return -ENOMEM;
+ }
+ }
+
+ pt_buffer_setup_topa_index(buf);
+
+ /* link last table to the first one, unless we're double buffering */
+ if (pt_cap_get(PT_CAP_topa_multiple_entries)) {
+ TOPA_ENTRY(buf->last, -1)->base = buf->first->phys >> TOPA_SHIFT;
+ TOPA_ENTRY(buf->last, -1)->end = 1;
+ }
+
+ pt_topa_dump(buf);
+ return 0;
+}
+
+/**
+ * pt_buffer_setup_aux() - set up topa tables for a PT buffer
+ * @cpu: cpu on which to allocate, -1 means current
+ * @pages: array of pointers to buffer pages passed from perf core
+ * @nr_pages: number of pages in the buffer
+ * @snapshot: if this is a snapshot/overwrite counter
+ */
+static void *
+pt_buffer_setup_aux(int cpu, void **pages, int nr_pages, bool snapshot)
+{
+ struct pt_buffer *buf;
+ int node, ret;
+
+ if (!nr_pages)
+ return NULL;
+
+ if (cpu == -1)
+ cpu = raw_smp_processor_id();
+ node = cpu_to_node(cpu);
+
+ buf = kzalloc_node(offsetof(struct pt_buffer, topa_index[nr_pages]),
+ GFP_KERNEL, node);
+ if (!buf)
+ return NULL;
+
+ buf->cpu = cpu;
+ buf->snapshot = snapshot;
+ buf->data_pages = pages;
+
+ INIT_LIST_HEAD(&buf->tables);
+
+ ret = pt_buffer_init_topa(buf, nr_pages, GFP_KERNEL);
+ if (ret) {
+ kfree(buf);
+ return NULL;
+ }
+
+ return buf;
+}
+
+/**
+ * pt_buffer_free_aux() - perf mmap_close path callback
+ * @data: pt buffer
+ */
+static void pt_buffer_free_aux(void *data)
+{
+ struct pt_buffer *buf = data;
+
+ pt_buffer_fini_topa(buf);
+ kfree(buf);
+}
+
+/**
+ * pt_buffer_is_full - check if the buffer is full
+ * @buf: pt buffer
+ * @pt: per-cpu pt handle
+ * If the user hasn't read data from the output region that aux_head
+ * points to, the buffer is considered full: the user needs to read at
+ * least this region and update aux_tail to point past it.
+ */
+static bool pt_buffer_is_full(struct pt_buffer *buf, struct pt *pt)
+{
+ if (buf->snapshot)
+ return false;
+
+ if (local_read(&buf->data_size) >= pt->handle.size)
+ return true;
+
+ return false;
+}
+
+void intel_pt_interrupt(void)
+{
+ struct pt *pt = this_cpu_ptr(&pt_ctx);
+ struct pt_buffer *buf;
+ struct perf_event *event = pt->handle.event;
+
+ if (!ACCESS_ONCE(pt->handle_nmi))
+ return;
+
+ pt_config_start(false);
+
+ if (!event)
+ return;
+
+ buf = perf_get_aux(&pt->handle);
+ if (!buf)
+ return;
+
+ pt_read_offset(buf);
+
+ pt_handle_status(pt);
+
+ pt_update_head(pt);
+
+ perf_aux_output_end(&pt->handle, local_xchg(&buf->data_size, 0),
+ local_xchg(&buf->lost, 0));
+
+ if (!event->hw.state) {
+ int ret = pt_config(event);
+
+ if (ret) {
+ event->hw.state = PERF_HES_STOPPED;
+ return;
+ }
+
+ buf = perf_aux_output_begin(&pt->handle, event);
+ if (!buf) {
+ event->hw.state = PERF_HES_STOPPED;
+ return;
+ }
+
+ pt_buffer_reset_offsets(buf, pt->handle.head);
+ ret = pt_buffer_reset_markers(buf, &pt->handle);
+ if (ret) {
+ perf_aux_output_end(&pt->handle, 0, true);
+ return;
+ }
+
+ pt_config_buffer(buf->cur->table, buf->cur_idx,
+ buf->output_off);
+ wrmsrl(MSR_IA32_RTIT_STATUS, 0);
+ pt_config_start(true);
+ }
+}
+
+static void pt_event_start(struct perf_event *event, int mode)
+{
+ struct pt *pt = this_cpu_ptr(&pt_ctx);
+ struct pt_buffer *buf = perf_get_aux(&pt->handle);
+
+ if (pt_is_running() || !buf || pt_buffer_is_full(buf, pt) ||
+ pt_config(event)) {
+ event->hw.state = PERF_HES_STOPPED;
+ return;
+ }
+
+ ACCESS_ONCE(pt->handle_nmi) = 1;
+ event->hw.state = 0;
+
+ pt_config_buffer(buf->cur->table, buf->cur_idx,
+ buf->output_off);
+ wrmsrl(MSR_IA32_RTIT_STATUS, 0);
+ pt_config_start(true);
+}
+
+static void pt_event_stop(struct perf_event *event, int mode)
+{
+ struct pt *pt = this_cpu_ptr(&pt_ctx);
+
+ ACCESS_ONCE(pt->handle_nmi) = 0;
+ pt_config_start(false);
+
+ if (event->hw.state == PERF_HES_STOPPED)
+ return;
+
+ event->hw.state = PERF_HES_STOPPED;
+
+ if (mode & PERF_EF_UPDATE) {
+ struct pt *pt = this_cpu_ptr(&pt_ctx);
+ struct pt_buffer *buf = perf_get_aux(&pt->handle);
+
+ if (!buf)
+ return;
+
+ if (WARN_ON_ONCE(pt->handle.event != event))
+ return;
+
+ pt_read_offset(buf);
+
+ pt_handle_status(pt);
+
+ pt_update_head(pt);
+ }
+}
+
+static void pt_event_del(struct perf_event *event, int mode)
+{
+ struct pt *pt = this_cpu_ptr(&pt_ctx);
+ struct pt_buffer *buf;
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+ pt_event_stop(event, PERF_EF_UPDATE);
+
+ buf = perf_get_aux(&pt->handle);
+ cpuc->pt_enabled = 0;
+
+ if (buf) {
+ if (buf->snapshot)
+ pt->handle.head =
+ local_xchg(&buf->data_size,
+ buf->nr_pages << PAGE_SHIFT);
+ perf_aux_output_end(&pt->handle, local_xchg(&buf->data_size, 0),
+ local_xchg(&buf->lost, 0));
+ }
+}
+
+static int pt_event_add(struct perf_event *event, int mode)
+{
+ struct pt_buffer *buf;
+ struct pt *pt = this_cpu_ptr(&pt_ctx);
+ struct hw_perf_event *hwc = &event->hw;
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+ int ret = -EBUSY;
+
+ if (cpuc->lbr_users || cpuc->bts_enabled)
+ goto out;
+
+ ret = pt_config(event);
+ if (ret) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (pt->handle.event) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ buf = perf_aux_output_begin(&pt->handle, event);
+ if (!buf) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ pt_buffer_reset_offsets(buf, pt->handle.head);
+ if (!buf->snapshot) {
+ ret = pt_buffer_reset_markers(buf, &pt->handle);
+ if (ret) {
+ perf_aux_output_end(&pt->handle, 0, true);
+ goto out;
+ }
+ }
+
+ if (mode & PERF_EF_START) {
+ pt_event_start(event, 0);
+ if (hwc->state == PERF_HES_STOPPED) {
+ pt_event_del(event, 0);
+ ret = -EBUSY;
+ }
+ } else {
+ hwc->state = PERF_HES_STOPPED;
+ }
+
+out:
+
+ if (ret)
+ hwc->state = PERF_HES_STOPPED;
+ else
+ cpuc->pt_enabled = 1;
+
+ return ret;
+}
+
+static void pt_event_read(struct perf_event *event)
+{
+}
+
+static int pt_event_init(struct perf_event *event)
+{
+ if (event->attr.type != pt_pmu.pmu.type)
+ return -ENOENT;
+
+ if (!pt_event_valid(event))
+ return -EINVAL;
+
+ return 0;
+}
+
+static __init int pt_init(void)
+{
+ int ret, cpu, prior_warn = 0;
+
+ BUILD_BUG_ON(sizeof(struct topa) > PAGE_SIZE);
+ get_online_cpus();
+ for_each_online_cpu(cpu) {
+ u64 ctl;
+
+ ret = rdmsrl_safe_on_cpu(cpu, MSR_IA32_RTIT_CTL, &ctl);
+ if (!ret && (ctl & RTIT_CTL_TRACEEN))
+ prior_warn++;
+ }
+ put_online_cpus();
+
+ ret = pt_pmu_hw_init();
+ if (ret)
+ return ret;
+
+ if (!pt_cap_get(PT_CAP_topa_output)) {
+ pr_warn("ToPA output is not supported on this CPU\n");
+ return -ENODEV;
+ }
+
+ if (prior_warn)
+ pr_warn("PT is enabled at boot time, traces may be empty\n");
+
+ if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+ pt_pmu.pmu.capabilities =
+ PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_AUX_SW_DOUBLEBUF;
+
+ pt_pmu.pmu.capabilities |= PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE;
+ pt_pmu.pmu.attr_groups = pt_attr_groups;
+ pt_pmu.pmu.task_ctx_nr = perf_hw_context;
+ pt_pmu.pmu.event_init = pt_event_init;
+ pt_pmu.pmu.add = pt_event_add;
+ pt_pmu.pmu.del = pt_event_del;
+ pt_pmu.pmu.start = pt_event_start;
+ pt_pmu.pmu.stop = pt_event_stop;
+ pt_pmu.pmu.read = pt_event_read;
+ pt_pmu.pmu.setup_aux = pt_buffer_setup_aux;
+ pt_pmu.pmu.free_aux = pt_buffer_free_aux;
+ ret = perf_pmu_register(&pt_pmu.pmu, "intel_pt", -1);
+
+ return ret;
+}
+
+module_init(pt_init);
--
2.1.0

2014-10-13 13:53:41

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v5 08/20] perf: Support overwrite mode for AUX area

This adds support for overwrite mode in the AUX area, which means "keep
collecting data till you're stopped", turning AUX area into a circular
buffer, where new data overwrites old data. It does not depend on data
buffer's overwrite mode, so that it doesn't lose sideband data that is
instrumental for processing AUX data.

Overwrite mode is enabled at mapping AUX area read only. Even though
aux_tail in the buffer's user page might be user writable, it will be
ignored in this mode.

A PERF_RECORD_AUX with PERF_AUX_FLAG_OVERWRITE set is written to the perf
data stream every time an event writes new data to the AUX area. The pmu
driver might not be able to infer the exact beginning of the new data in
each snapshot, some drivers will only provide the tail, which is
aux_offset + aux_size in the AUX record. Consumer has to be able to tell
the new data from the old one, for example, by means of time stamps if
such are provided in the trace.

Consumer is also responsible for disabling any events that might write
to the AUX area (thus potentially racing with the consumer) before
collecting the data.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/uapi/linux/perf_event.h | 1 +
kernel/events/internal.h | 1 +
kernel/events/ring_buffer.c | 48 ++++++++++++++++++++++++++++-------------
3 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 1a46627699..a6bd31f5e0 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -768,6 +768,7 @@ enum perf_callchain_context {
* PERF_RECORD_AUX::flags bits
*/
#define PERF_AUX_FLAG_TRUNCATED 0x01 /* record was truncated to fit */
+#define PERF_AUX_FLAG_OVERWRITE 0x02 /* snapshot from overwrite mode */

#define PERF_FLAG_FD_NO_GROUP (1UL << 0)
#define PERF_FLAG_FD_OUTPUT (1UL << 1)
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 68bbf86d8f..b1ed80f87d 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -40,6 +40,7 @@ struct ring_buffer {
local_t aux_nest;
unsigned long aux_pgoff;
int aux_nr_pages;
+ int aux_overwrite;
atomic_t aux_mmap_count;
unsigned long aux_mmap_locked;
void (*free_aux)(void *);
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index f686df9d1d..d0373c6d30 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -282,26 +282,33 @@ void *perf_aux_output_begin(struct perf_output_handle *handle,
goto err_put;

aux_head = local_read(&rb->aux_head);
- aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);

handle->rb = rb;
handle->event = event;
handle->head = aux_head;
- if (aux_head - aux_tail < perf_aux_size(rb))
- handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
- else
- handle->size = 0;
+ handle->size = 0;

/*
- * handle->size computation depends on aux_tail load; this forms a
- * control dependency barrier separating aux_tail load from aux data
- * store that will be enabled on successful return
+ * In overwrite mode, AUX data stores do not depend on aux_tail,
+ * therefore (A) control dependency barrier does not exist. The
+ * (B) <-> (C) ordering is still observed by the pmu driver.
*/
- if (!handle->size) { /* A, matches D */
- event->pending_disable = 1;
- perf_output_wakeup(handle);
- local_set(&rb->aux_nest, 0);
- goto err_put;
+ if (!rb->aux_overwrite) {
+ aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
+ if (aux_head - aux_tail < perf_aux_size(rb))
+ handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
+
+ /*
+ * handle->size computation depends on aux_tail load; this forms a
+ * control dependency barrier separating aux_tail load from aux data
+ * store that will be enabled on successful return
+ */
+ if (!handle->size) { /* A, matches D */
+ event->pending_disable = 1;
+ perf_output_wakeup(handle);
+ local_set(&rb->aux_nest, 0);
+ goto err_put;
+ }
}

return handle->rb->aux_priv;
@@ -326,13 +333,22 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
bool truncated)
{
struct ring_buffer *rb = handle->rb;
- unsigned long aux_head = local_read(&rb->aux_head);
+ unsigned long aux_head;
u64 flags = 0;

if (truncated)
flags |= PERF_AUX_FLAG_TRUNCATED;

- local_add(size, &rb->aux_head);
+ /* in overwrite mode, driver provides aux_head via handle */
+ if (rb->aux_overwrite) {
+ flags |= PERF_AUX_FLAG_OVERWRITE;
+
+ aux_head = handle->head;
+ local_set(&rb->aux_head, aux_head);
+ } else {
+ aux_head = local_read(&rb->aux_head);
+ local_add(size, &rb->aux_head);
+ }

if (size || flags) {
/*
@@ -474,6 +490,8 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
*/
kref_init(&rb->aux_refcount);

+ rb->aux_overwrite = overwrite;
+
out:
if (!ret)
rb->aux_pgoff = pgoff;
--
2.1.0

2014-10-22 12:36:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 02/20] perf: Add AUX area to ring buffer for raw data streams

On Mon, Oct 13, 2014 at 04:45:30PM +0300, Alexander Shishkin wrote:
> + struct kref aux_refcount;

I'm not a fan of kref, pointless obfuscation that.

2014-10-22 13:26:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 06/20] perf: Add AUX record

On Mon, Oct 13, 2014 at 04:45:34PM +0300, Alexander Shishkin wrote:
> + /*
> + * Records that new data landed in the AUX buffer part.
> + *
> + * struct {
> + * struct perf_event_header header;
> + *
> + * u64 aux_offset;
> + * u64 aux_size;
> + * u64 flags;
> + * struct sample_id sample_id;
> + * };
> + */
> + PERF_RECORD_AUX = 11,

Given the discussion with the ARM people the last time, do we want to
add the possibility of a variable data field in this event? Its easy to
add now, harder to do later (although not impossible).

Also added Mathieu Poirier on CC, he asked to be included in your next
postings.

2014-10-22 14:02:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 07/20] perf: Add api for pmus to write to AUX area

On Mon, Oct 13, 2014 at 04:45:35PM +0300, Alexander Shishkin wrote:
> + /*
> + * Nesting is not supported for AUX area, make sure nested
> + * writers are caught early
> + */
> + if (WARN_ON_ONCE(local_xchg(&rb->aux_nest, 1)))
> + goto err_put;

Note that printk() from NMI is prone to making your machine die, not
that I have a better option though.

2014-10-22 14:16:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 11/20] x86: perf: Intel PT and LBR/BTS are mutually exclusive

On Mon, Oct 13, 2014 at 04:45:39PM +0300, Alexander Shishkin wrote:
> Intel PT cannot be used at the same time as LBR or BTS and will cause a
> general protection fault if they are used together. In order to avoid
> fixing up GPs in the fast path, instead we use flags to indicate that
> that one of these is in use so that the other avoids MSR access altogether.
>

Yeah, don't like this. Like I've said many times before we should simply
disallow creating PT events when there are LBR events and vice versa.

2014-10-22 14:18:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

On Mon, Oct 13, 2014 at 04:45:40PM +0300, Alexander Shishkin wrote:
> +/**
> + * struct pt_buffer - buffer configuration; one buffer per task_struct or
- * cpu, depending on perf event configuration
+ * cpu, depending on perf event configuration
> + * @tables: list of ToPA tables in this buffer
> + * @first, @last: shorthands for first and last topa tables
> + * @cur: current topa table
> + * @nr_pages: buffer size in pages
> + * @cur_idx: current output region's index within @cur table
> + * @output_off: offset within the current output region
> + * @data_size: running total of the amount of data in this buffer
> + * @lost: if data was lost/truncated
> + * @head: logical write offset inside the buffer
> + * @snapshot: if this is for a snapshot/overwrite counter
> + * @stop_pos, @intr_pos: STOP and INT topa entries in the buffer
> + * @data_pages: array of pages from perf
> + * @topa_index: table of topa entries indexed by page offset
> + */

2014-10-22 14:18:36

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH v5 06/20] perf: Add AUX record

Peter Zijlstra <[email protected]> writes:

> On Mon, Oct 13, 2014 at 04:45:34PM +0300, Alexander Shishkin wrote:
>> + /*
>> + * Records that new data landed in the AUX buffer part.
>> + *
>> + * struct {
>> + * struct perf_event_header header;
>> + *
>> + * u64 aux_offset;
>> + * u64 aux_size;
>> + * u64 flags;
>> + * struct sample_id sample_id;
>> + * };
>> + */
>> + PERF_RECORD_AUX = 11,
>
> Given the discussion with the ARM people the last time, do we want to
> add the possibility of a variable data field in this event? Its easy to
> add now, harder to do later (although not impossible).

Iirc, what they want is to save a once-per-session chunk of data, which
would be better synthesized by perf record than sent from a pmu driver?

We do something like that with PT right now, perf record looks at pmu's
sysfs attributes and stores them in some synthesized record.

> Also added Mathieu Poirier on CC, he asked to be included in your next
> postings.

Indeed, my apologies.

Regards,
--
Alex

2014-10-22 14:19:03

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH v5 07/20] perf: Add api for pmus to write to AUX area

Peter Zijlstra <[email protected]> writes:

> On Mon, Oct 13, 2014 at 04:45:35PM +0300, Alexander Shishkin wrote:
>> + /*
>> + * Nesting is not supported for AUX area, make sure nested
>> + * writers are caught early
>> + */
>> + if (WARN_ON_ONCE(local_xchg(&rb->aux_nest, 1)))
>> + goto err_put;
>
> Note that printk() from NMI is prone to making your machine die, not
> that I have a better option though.

Indeed, but if your driver is really attempting nested
perf_aux_output_begin()s, then you had it coming to you anyway.

Regards,
--
Alex

2014-10-22 14:20:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

On Mon, Oct 13, 2014 at 04:45:40PM +0300, Alexander Shishkin wrote:
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -1528,6 +1528,14 @@ again:
> }
>
> /*
> + * Intel PT
> + */
> + if (__test_and_clear_bit(55, (unsigned long *)&status)) {
> + handled++;
> + intel_pt_interrupt();
> + }
> +

How does the PT interrupt interact with the regular PMI? In particular
does it respect stuff like FREEZE_ON_PMI etc?

2014-10-22 14:23:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

On Mon, Oct 13, 2014 at 04:45:40PM +0300, Alexander Shishkin wrote:
> +/*
> + * Capabilities of Intel PT hardware, such as number of address bits or
> + * supported output schemes, are cached and exported to userspace as "caps"
> + * attribute group of pt pmu device
> + * (/sys/bus/event_source/devices/intel_pt/caps/) so that userspace can store
> + * relevant bits together with intel_pt traces.
> + */

Does it make sense to use that AUX record variable data thing for this
instead -- much like what the ARM folks wanted?

2014-10-22 14:27:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

On Mon, Oct 13, 2014 at 04:45:40PM +0300, Alexander Shishkin wrote:
> + if (test_cpu_cap(&boot_cpu_data, X86_FEATURE_INTEL_PT)) {
> + for (i = 0; i < PT_CPUID_LEAVES; i++)
> + cpuid_count(20, i,
> + &pt_pmu.caps[CR_EAX + i * 4],
> + &pt_pmu.caps[CR_EBX + i * 4],
> + &pt_pmu.caps[CR_ECX + i * 4],
> + &pt_pmu.caps[CR_EDX + i * 4]);
> + } else
> + return -ENODEV;

Please use {} for every multi-line stmt, even if its a single stmt, also
use {} in both branches of a condition, even if only one strictly
requires it.

2014-10-22 14:32:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

On Mon, Oct 13, 2014 at 04:45:40PM +0300, Alexander Shishkin wrote:
> +
> +enum cpuid_regs {
> + CR_EAX = 0,
> + CR_ECX,
> + CR_EDX,
> + CR_EBX
> +};
> +
> +/*
> + * Capabilities of Intel PT hardware, such as number of address bits or
> + * supported output schemes, are cached and exported to userspace as "caps"
> + * attribute group of pt pmu device
> + * (/sys/bus/event_source/devices/intel_pt/caps/) so that userspace can store
> + * relevant bits together with intel_pt traces.
> + */
> +#define PT_CAP(_n, _l, _r, _m) \
> + [PT_CAP_ ## _n] = { .name = __stringify(_n), .leaf = _l, \
> + .reg = _r, .mask = _m }
> +
> +static struct pt_cap_desc {
> + const char *name;
> + u32 leaf;
> + u8 reg;
> + u32 mask;
> +} pt_caps[] = {
> + PT_CAP(max_subleaf, 0, CR_EAX, 0xffffffff),
> + PT_CAP(cr3_filtering, 0, CR_EBX, BIT(0)),
> + PT_CAP(topa_output, 0, CR_ECX, BIT(0)),
> + PT_CAP(topa_multiple_entries, 0, CR_ECX, BIT(1)),
> + PT_CAP(payloads_lip, 0, CR_ECX, BIT(31)),
> +};
> +
> +static u32 pt_cap_get(enum pt_capabilities cap)
> +{
> + struct pt_cap_desc *cd = &pt_caps[cap];
> + u32 c = pt_pmu.caps[cd->leaf * 4 + cd->reg];
> + unsigned int shift = __ffs(cd->mask);
> +
> + return (c & cd->mask) >> shift;
> +}

> + if (test_cpu_cap(&boot_cpu_data, X86_FEATURE_INTEL_PT)) {
> + for (i = 0; i < PT_CPUID_LEAVES; i++)
> + cpuid_count(20, i,
> + &pt_pmu.caps[CR_EAX + i * 4],
> + &pt_pmu.caps[CR_EBX + i * 4],
> + &pt_pmu.caps[CR_ECX + i * 4],
> + &pt_pmu.caps[CR_EDX + i * 4]);
> + } else
> + return -ENODEV;

I would really rather you use bitfield unions for cpuid stuff, have a
look at union cpuid10_e[abd]x as used in
perf_event_intel.c:intel_pmu_init().

2014-10-22 14:34:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

On Mon, Oct 13, 2014 at 04:45:40PM +0300, Alexander Shishkin wrote:
> +static bool pt_event_valid(struct perf_event *event)
> +{
> + u64 config = event->attr.config;
> +
> + /* admin can set any packet generation parameters */
> + if (capable(CAP_SYS_ADMIN) && (config & PT_BYPASS_MASK) == config)
> + return true;
> +
> + if ((config & PT_CONFIG_MASK) != config)
> + return false;
> +
> + return true;
> +}

This seems to suggest PT is available to !priv users, is this right?

2014-10-22 14:45:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

On Mon, Oct 13, 2014 at 04:45:40PM +0300, Alexander Shishkin wrote:
> +static int pt_config(struct perf_event *event)
> +{
> + u64 reg;
> +
> + reg = RTIT_CTL_TOPA | RTIT_CTL_BRANCH_EN;
> +
> + if (!event->attr.exclude_kernel)
> + reg |= RTIT_CTL_OS;
> + if (!event->attr.exclude_user)
> + reg |= RTIT_CTL_USR;
> +
> + reg |= (event->attr.config & PT_CONFIG_MASK);
> +
> + /*
> + * User can try to set bits in RTIT_CTL through PT_BYPASS_MASK,
> + * that aren't supported by the hardware. Weather or not a
> + * particular bitmask is supported by a cpu can't be determined
> + * via cpuid or otherwise, so we have to rely on #GP handling
> + * to catch these cases.
> + */
> + return wrmsrl_safe(MSR_IA32_RTIT_CTL, reg);
> +}

Whether the weather is nice or not :-)

But no, this cannot be, once we've accepted the event is must be
programmable. Failing at the time of programming is vile; pmu::start()
is a void return, failure is not an option there.

The fact that the hardware cannot even tell you the supported mask is
further fail.

IIRC I think Andi once suggested probing each of the 64 bits in that MSR
to determine the supported mask at device init time.

BTW, what's that RTIT thing? Did someone forget to propagate the latest
name change of the thing or so?

2014-10-22 14:49:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

On Mon, Oct 13, 2014 at 04:45:40PM +0300, Alexander Shishkin wrote:
> +static void pt_config_start(bool start)
> +{
> + u64 ctl;
> +
> + rdmsrl(MSR_IA32_RTIT_CTL, ctl);
> + if (start)
> + ctl |= RTIT_CTL_TRACEEN;
> + else
> + ctl &= ~RTIT_CTL_TRACEEN;
> + wrmsrl(MSR_IA32_RTIT_CTL, ctl);
> +
> + /*
> + * A wrmsr that disables trace generation serializes other PT
> + * registers and causes all data packets to be written to memory,
> + * but a fence is required for the data to become globally visible.
> + *
> + * The below WMB, separating data store and aux_head store matches
> + * the consumer's RMB that separates aux_head load and data load.
> + */
> + if (!start)
> + wmb();
> +}

wmb is sfence, is that sufficient? One would have expected an mfence
since that would also orders later reads.

2014-10-22 14:55:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

> +/* make negative table index stand for the last table entry */
> +#define TOPA_ENTRY(t, i) ((i) == -1 ? &(t)->table[(t)->last] : &(t)->table[(i)])

code does not match comment; negative would be: i < 0, not i == -1.

Something like: ({ if (i < 0) i += t->size; t->table[i]; }), might work,
of course that goes bang when: i !e [-size,size)

2014-10-22 15:07:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 06/20] perf: Add AUX record

On Wed, Oct 22, 2014 at 05:18:29PM +0300, Alexander Shishkin wrote:
> Peter Zijlstra <[email protected]> writes:
>
> > On Mon, Oct 13, 2014 at 04:45:34PM +0300, Alexander Shishkin wrote:
> >> + /*
> >> + * Records that new data landed in the AUX buffer part.
> >> + *
> >> + * struct {
> >> + * struct perf_event_header header;
> >> + *
> >> + * u64 aux_offset;
> >> + * u64 aux_size;
> >> + * u64 flags;
> >> + * struct sample_id sample_id;
> >> + * };
> >> + */
> >> + PERF_RECORD_AUX = 11,
> >
> > Given the discussion with the ARM people the last time, do we want to
> > add the possibility of a variable data field in this event? Its easy to
> > add now, harder to do later (although not impossible).
>
> Iirc, what they want is to save a once-per-session chunk of data, which
> would be better synthesized by perf record than sent from a pmu driver?
>
> We do something like that with PT right now, perf record looks at pmu's
> sysfs attributes and stores them in some synthesized record.

Right, I just got to that. So I wasn't entirely sure if perf_event_open
time conditions where sufficient for them, if they are then yes that
could work.

2014-10-22 15:11:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

On Wed, Oct 22, 2014 at 04:49:52PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 13, 2014 at 04:45:40PM +0300, Alexander Shishkin wrote:
> > +static void pt_config_start(bool start)
> > +{
> > + u64 ctl;
> > +
> > + rdmsrl(MSR_IA32_RTIT_CTL, ctl);
> > + if (start)
> > + ctl |= RTIT_CTL_TRACEEN;
> > + else
> > + ctl &= ~RTIT_CTL_TRACEEN;
> > + wrmsrl(MSR_IA32_RTIT_CTL, ctl);
> > +
> > + /*
> > + * A wrmsr that disables trace generation serializes other PT
> > + * registers and causes all data packets to be written to memory,
> > + * but a fence is required for the data to become globally visible.
> > + *
> > + * The below WMB, separating data store and aux_head store matches
> > + * the consumer's RMB that separates aux_head load and data load.
> > + */
> > + if (!start)
> > + wmb();
> > +}
>
> wmb is sfence, is that sufficient? One would have expected an mfence
> since that would also orders later reads.

Silly me, we're separating two stores here. Ignore that.

2014-10-22 15:14:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

On Mon, Oct 13, 2014 at 04:45:40PM +0300, Alexander Shishkin wrote:
> + if ((pt_cap_get(PT_CAP_topa_multiple_entries)
> + && topa->table[i].stop)
> + || topa->table[i].end)
> + break;

> + old = (local64_xchg(&buf->head, base) &
> + ((buf->nr_pages << PAGE_SHIFT) - 1));

> + if (!pt_cap_get(PT_CAP_topa_multiple_entries) ||
> + buf->output_off == sizes(TOPA_ENTRY(buf->cur, buf->cur_idx)->size)) {
> + local_inc(&buf->lost);
> + advance++;
> + }

Inconsistent coding style. Please put operators at the end of a line,
not at the start of one.

2014-10-22 15:26:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

On Mon, Oct 13, 2014 at 04:45:40PM +0300, Alexander Shishkin wrote:
> +static void pt_event_start(struct perf_event *event, int mode)
> +{
> + struct pt *pt = this_cpu_ptr(&pt_ctx);
> + struct pt_buffer *buf = perf_get_aux(&pt->handle);
> +
> + if (pt_is_running() || !buf || pt_buffer_is_full(buf, pt) ||
> + pt_config(event)) {
> + event->hw.state = PERF_HES_STOPPED;
> + return;
> + }
> +
> + ACCESS_ONCE(pt->handle_nmi) = 1;
> + event->hw.state = 0;
> +
> + pt_config_buffer(buf->cur->table, buf->cur_idx,
> + buf->output_off);
> + wrmsrl(MSR_IA32_RTIT_STATUS, 0);
> + pt_config_start(true);
> +}

That's two RTIT_CTL writes, should we not try and merge those?

2014-10-23 03:05:46

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v5 02/20] perf: Add AUX area to ring buffer for raw data streams

On Wed, Oct 22, 2014 at 02:35:47PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 13, 2014 at 04:45:30PM +0300, Alexander Shishkin wrote:
> > + struct kref aux_refcount;
>
> I'm not a fan of kref, pointless obfuscation that.

It has a good potential for debugging though. Sure right now
the get/put simple APIs only performs counting sanity checks
but I've seen patches that extend it to object debugging.

Sounds quite valuable on complicated object lifecycles like
perf events.

2014-10-23 12:39:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 18/20] perf: Allocate ring buffers for inherited per-task kernel events

On Mon, Oct 13, 2014 at 04:45:46PM +0300, Alexander Shishkin wrote:
> Normally, per-task events can't be inherited parents' ring buffers to
> avoid multiple events contending for the same buffer. And since buffer
> allocation is typically done by the userspace consumer, there is no
> practical interface to allocate new buffers for inherited counters.
>
> However, for kernel users we can allocate new buffers for inherited
> events as soon as they are created (and also reap them on event
> destruction). This pattern has a number of use cases, such as event
> sample annotation and process core dump annotation.
>
> When a new event is inherited from a per-task kernel event that has a
> ring buffer, allocate a new buffer for this event so that data from the
> child task is collected and can later be retrieved for sample annotation
> or core dump inclusion. This ring buffer is released when the event is
> freed, for example, when the child task exits.
>

This causes a pinned memory explosion, not at all nice that.

I think I see why and all, but it would be ever so good to not have to
allocate so much memory.

2014-10-24 07:44:59

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH v5 18/20] perf: Allocate ring buffers for inherited per-task kernel events

Peter Zijlstra <[email protected]> writes:

> On Mon, Oct 13, 2014 at 04:45:46PM +0300, Alexander Shishkin wrote:
>> Normally, per-task events can't be inherited parents' ring buffers to
>> avoid multiple events contending for the same buffer. And since buffer
>> allocation is typically done by the userspace consumer, there is no
>> practical interface to allocate new buffers for inherited counters.
>>
>> However, for kernel users we can allocate new buffers for inherited
>> events as soon as they are created (and also reap them on event
>> destruction). This pattern has a number of use cases, such as event
>> sample annotation and process core dump annotation.
>>
>> When a new event is inherited from a per-task kernel event that has a
>> ring buffer, allocate a new buffer for this event so that data from the
>> child task is collected and can later be retrieved for sample annotation
>> or core dump inclusion. This ring buffer is released when the event is
>> freed, for example, when the child task exits.
>>
>
> This causes a pinned memory explosion, not at all nice that.
>
> I think I see why and all, but it would be ever so good to not have to
> allocate so much memory.

Are there any controls we could use to limit such memory usage?
Theoretically, the buffers that we'd allocate for this are way smaller
than, for example, what we use if we try to capture a complete trace,
since we'd only be interested in the most recent trace data. We already
have RLIMIT_NPROC, which implicitly limits the number of these buffers,
for example. Or maybe we can introduce a new rlimit/sysctl/whatnot that
would limit the maximum amount of such memory per-cpu/system/user. What
do you think?

Per-cpu buffers with inheritance would solve this problem, but raises
other issues: we'd need sched_switch again to tell the traces apart and
since those buffers run in overwrite mode, a cpu hog task can
potentially overwrite any useful trace data.

Regards,
--
Alex

2014-10-24 07:47:28

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH v5 11/20] x86: perf: Intel PT and LBR/BTS are mutually exclusive

Peter Zijlstra <[email protected]> writes:

> On Mon, Oct 13, 2014 at 04:45:39PM +0300, Alexander Shishkin wrote:
>> Intel PT cannot be used at the same time as LBR or BTS and will cause a
>> general protection fault if they are used together. In order to avoid
>> fixing up GPs in the fast path, instead we use flags to indicate that
>> that one of these is in use so that the other avoids MSR access altogether.
>>
>
> Yeah, don't like this. Like I've said many times before we should simply
> disallow creating PT events when there are LBR events and vice versa.

Ok, I must have misunderstood you before. This makes things a bit
easier.

Regards,
--
Alex

2014-10-24 07:49:40

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

Peter Zijlstra <[email protected]> writes:

> On Mon, Oct 13, 2014 at 04:45:40PM +0300, Alexander Shishkin wrote:
>> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
>> @@ -1528,6 +1528,14 @@ again:
>> }
>>
>> /*
>> + * Intel PT
>> + */
>> + if (__test_and_clear_bit(55, (unsigned long *)&status)) {
>> + handled++;
>> + intel_pt_interrupt();
>> + }
>> +
>
> How does the PT interrupt interact with the regular PMI? In particular
> does it respect stuff like FREEZE_ON_PMI etc?

It ignores the FREEZE_ON_PMI bit. I stop it by hand inside the PMI
handler, so you can see parts of the handler in the trace if you're
tracing the kernel.

Regards,
--
Alex

2014-10-24 07:50:16

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

Peter Zijlstra <[email protected]> writes:

> On Mon, Oct 13, 2014 at 04:45:40PM +0300, Alexander Shishkin wrote:
>> + if (test_cpu_cap(&boot_cpu_data, X86_FEATURE_INTEL_PT)) {
>> + for (i = 0; i < PT_CPUID_LEAVES; i++)
>> + cpuid_count(20, i,
>> + &pt_pmu.caps[CR_EAX + i * 4],
>> + &pt_pmu.caps[CR_EBX + i * 4],
>> + &pt_pmu.caps[CR_ECX + i * 4],
>> + &pt_pmu.caps[CR_EDX + i * 4]);
>> + } else
>> + return -ENODEV;
>
> Please use {} for every multi-line stmt, even if its a single stmt, also
> use {} in both branches of a condition, even if only one strictly
> requires it.

Ok.

2014-10-24 07:52:13

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

Peter Zijlstra <[email protected]> writes:

> On Mon, Oct 13, 2014 at 04:45:40PM +0300, Alexander Shishkin wrote:
>> +static bool pt_event_valid(struct perf_event *event)
>> +{
>> + u64 config = event->attr.config;
>> +
>> + /* admin can set any packet generation parameters */
>> + if (capable(CAP_SYS_ADMIN) && (config & PT_BYPASS_MASK) == config)
>> + return true;
>> +
>> + if ((config & PT_CONFIG_MASK) != config)
>> + return false;
>> +
>> + return true;
>> +}
>
> This seems to suggest PT is available to !priv users, is this right?

Yes, that's the intention. PT does CPL-based filtering in hardware and
unlike previous attempts at branch tracing, does not leak out-of-context
addresses (or any other information for that matter).

Regards,
--
Alex

2014-10-24 07:59:36

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

Peter Zijlstra <[email protected]> writes:

>> +/* make negative table index stand for the last table entry */
>> +#define TOPA_ENTRY(t, i) ((i) == -1 ? &(t)->table[(t)->last] : &(t)->table[(i)])
>
> code does not match comment; negative would be: i < 0, not i == -1.

Indeed.

> Something like: ({ if (i < 0) i += t->size; t->table[i]; }), might work,
> of course that goes bang when: i !e [-size,size)

Table indices and sizes are measured in different units. Using -1 to
mean the last is just a small convenience, in the code it would be used
explicitly as TOPA_ENTRY(topa, -1).end = 1, for example.

Regards,
--
Alex

2014-10-24 08:00:24

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

Peter Zijlstra <[email protected]> writes:

> On Mon, Oct 13, 2014 at 04:45:40PM +0300, Alexander Shishkin wrote:
>> + if ((pt_cap_get(PT_CAP_topa_multiple_entries)
>> + && topa->table[i].stop)
>> + || topa->table[i].end)
>> + break;
>
>> + old = (local64_xchg(&buf->head, base) &
>> + ((buf->nr_pages << PAGE_SHIFT) - 1));
>
>> + if (!pt_cap_get(PT_CAP_topa_multiple_entries) ||
>> + buf->output_off == sizes(TOPA_ENTRY(buf->cur, buf->cur_idx)->size)) {
>> + local_inc(&buf->lost);
>> + advance++;
>> + }
>
> Inconsistent coding style. Please put operators at the end of a line,
> not at the start of one.

Ok.

2014-10-24 08:22:27

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

Peter Zijlstra <[email protected]> writes:

> On Mon, Oct 13, 2014 at 04:45:40PM +0300, Alexander Shishkin wrote:
>> +static int pt_config(struct perf_event *event)
>> +{
>> + u64 reg;
>> +
>> + reg = RTIT_CTL_TOPA | RTIT_CTL_BRANCH_EN;
>> +
>> + if (!event->attr.exclude_kernel)
>> + reg |= RTIT_CTL_OS;
>> + if (!event->attr.exclude_user)
>> + reg |= RTIT_CTL_USR;
>> +
>> + reg |= (event->attr.config & PT_CONFIG_MASK);
>> +
>> + /*
>> + * User can try to set bits in RTIT_CTL through PT_BYPASS_MASK,
>> + * that aren't supported by the hardware. Weather or not a
>> + * particular bitmask is supported by a cpu can't be determined
>> + * via cpuid or otherwise, so we have to rely on #GP handling
>> + * to catch these cases.
>> + */
>> + return wrmsrl_safe(MSR_IA32_RTIT_CTL, reg);
>> +}
>
> Whether the weather is nice or not :-)
>
> But no, this cannot be, once we've accepted the event is must be
> programmable. Failing at the time of programming is vile; pmu::start()
> is a void return, failure is not an option there.

This is called from pmu::add(), which can fail. If this wrmsrl throws a
gp, pmu::add() will fail and we won't even get to pmu::start(). Of
course, the problem with such event faulting every time it is added is
still there. Maybe we can simply disable such events after the first
fault. Good news is, only a CAP_SYS_ADMIN can set arbitrary bits, so the
damage is limited.

> The fact that the hardware cannot even tell you the supported mask is
> further fail.
>
> IIRC I think Andi once suggested probing each of the 64 bits in that MSR
> to determine the supported mask at device init time.

The problem with this is that some bits go in groups, there'd be 2..3..4
bit fields encoding desired packet frequency, for example.

Regards,
--
Alex

2014-10-24 11:26:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

On Fri, Oct 24, 2014 at 10:49:33AM +0300, Alexander Shishkin wrote:
> Peter Zijlstra <[email protected]> writes:
>
> > On Mon, Oct 13, 2014 at 04:45:40PM +0300, Alexander Shishkin wrote:
> >> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> >> @@ -1528,6 +1528,14 @@ again:
> >> }
> >>
> >> /*
> >> + * Intel PT
> >> + */
> >> + if (__test_and_clear_bit(55, (unsigned long *)&status)) {
> >> + handled++;
> >> + intel_pt_interrupt();
> >> + }
> >> +
> >
> > How does the PT interrupt interact with the regular PMI? In particular
> > does it respect stuff like FREEZE_ON_PMI etc?
>
> It ignores the FREEZE_ON_PMI bit. I stop it by hand inside the PMI
> handler, so you can see parts of the handler in the trace if you're
> tracing the kernel.

Urgh, horrid that. Routing something to the same interrupt, sharing
status registers but not observing the same semantics for the interrupt
is a massive fail.

IIRC Andi was planning to start using FREEZE_ON_PMI to avoid the MSR
writes in intel_pmu_{disable,enable}_all(), this interrupt not actually
respecting that makes that non-trivial.

We already use FREEZE_ON_PMI for LBR, but for now PT and LBR are
mutually exclusive so that's not a problem, if we ever get those working
together this needs to get fixed.

2014-10-24 11:51:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

On Fri, Oct 24, 2014 at 11:22:20AM +0300, Alexander Shishkin wrote:
> > The fact that the hardware cannot even tell you the supported mask is
> > further fail.

> The problem with this is that some bits go in groups, there'd be 2..3..4
> bit fields encoding desired packet frequency, for example.

OK, so put the magic number in the big model array.

2014-10-24 12:01:57

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

Peter Zijlstra <[email protected]> writes:

> On Fri, Oct 24, 2014 at 10:49:33AM +0300, Alexander Shishkin wrote:
>> Peter Zijlstra <[email protected]> writes:
>>
>> > On Mon, Oct 13, 2014 at 04:45:40PM +0300, Alexander Shishkin wrote:
>> >> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
>> >> @@ -1528,6 +1528,14 @@ again:
>> >> }
>> >>
>> >> /*
>> >> + * Intel PT
>> >> + */
>> >> + if (__test_and_clear_bit(55, (unsigned long *)&status)) {
>> >> + handled++;
>> >> + intel_pt_interrupt();
>> >> + }
>> >> +
>> >
>> > How does the PT interrupt interact with the regular PMI? In particular
>> > does it respect stuff like FREEZE_ON_PMI etc?
>>
>> It ignores the FREEZE_ON_PMI bit. I stop it by hand inside the PMI
>> handler, so you can see parts of the handler in the trace if you're
>> tracing the kernel.
>
> Urgh, horrid that. Routing something to the same interrupt, sharing
> status registers but not observing the same semantics for the interrupt
> is a massive fail.

I can't pretend to understand the logic behind this either.

> IIRC Andi was planning to start using FREEZE_ON_PMI to avoid the MSR
> writes in intel_pmu_{disable,enable}_all(), this interrupt not actually
> respecting that makes that non-trivial.
>
> We already use FREEZE_ON_PMI for LBR, but for now PT and LBR are
> mutually exclusive so that's not a problem, if we ever get those working
> together this needs to get fixed.

Agreed.

Regards,
--
Alex

2014-10-24 12:13:25

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

Peter Zijlstra <[email protected]> writes:

> On Fri, Oct 24, 2014 at 11:22:20AM +0300, Alexander Shishkin wrote:
>> > The fact that the hardware cannot even tell you the supported mask is
>> > further fail.
>
>> The problem with this is that some bits go in groups, there'd be 2..3..4
>> bit fields encoding desired packet frequency, for example.
>
> OK, so put the magic number in the big model array.

I'm not sure I follow. These bits are reserved for the future, they can
potentially be whatever combinations of whatever. If we want to probe
around for valid combinations is to check everything in the range of
0..2^43 (or something like that, the region reserved for packet enables)
and store all the valid ones, which sounds crazy.

2014-10-24 13:02:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

On Fri, Oct 24, 2014 at 03:13:19PM +0300, Alexander Shishkin wrote:
> Peter Zijlstra <[email protected]> writes:
>
> > On Fri, Oct 24, 2014 at 11:22:20AM +0300, Alexander Shishkin wrote:
> >> > The fact that the hardware cannot even tell you the supported mask is
> >> > further fail.
> >
> >> The problem with this is that some bits go in groups, there'd be 2..3..4
> >> bit fields encoding desired packet frequency, for example.
> >
> > OK, so put the magic number in the big model array.
>
> I'm not sure I follow. These bits are reserved for the future, they can
> potentially be whatever combinations of whatever. If we want to probe
> around for valid combinations is to check everything in the range of
> 0..2^43 (or something like that, the region reserved for packet enables)
> and store all the valid ones, which sounds crazy.

I was assuming that the accepted bits are model specific, and we have
this big model switch statement in perf_event_intel.c:intel_pmu_init(),
so why not have something like x86_pmu.pt_magic_bitmask = 0xf00d in
there?

No need to probe in that case. That is the same thing we do for all
unenumerated model specific things.

2014-10-24 13:18:52

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

Peter Zijlstra <[email protected]> writes:

> On Fri, Oct 24, 2014 at 03:13:19PM +0300, Alexander Shishkin wrote:
>> Peter Zijlstra <[email protected]> writes:
>>
>> > On Fri, Oct 24, 2014 at 11:22:20AM +0300, Alexander Shishkin wrote:
>> >> > The fact that the hardware cannot even tell you the supported mask is
>> >> > further fail.
>> >
>> >> The problem with this is that some bits go in groups, there'd be 2..3..4
>> >> bit fields encoding desired packet frequency, for example.
>> >
>> > OK, so put the magic number in the big model array.
>>
>> I'm not sure I follow. These bits are reserved for the future, they can
>> potentially be whatever combinations of whatever. If we want to probe
>> around for valid combinations is to check everything in the range of
>> 0..2^43 (or something like that, the region reserved for packet enables)
>> and store all the valid ones, which sounds crazy.
>
> I was assuming that the accepted bits are model specific, and we have
> this big model switch statement in perf_event_intel.c:intel_pmu_init(),
> so why not have something like x86_pmu.pt_magic_bitmask = 0xf00d in
> there?

Ah, I see what you mean. The main point of this whole reserved region
proposal is that one shouldn't have to update one's kernel to enable new
PT packets by doing -e intel_pt/config=0xf00d/, if one is CAP_SYS_ADMIN.

> No need to probe in that case. That is the same thing we do for all
> unenumerated model specific things.

They are, actually, enumerated, we just want to be able to enable them
before they are. If the driver is aware of feature X, it can test for it
in CPUID and allow/disallow it based on that.

Regards,
--
Alex

2014-10-24 13:48:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

On Fri, Oct 24, 2014 at 04:18:46PM +0300, Alexander Shishkin wrote:
> Peter Zijlstra <[email protected]> writes:

> Ah, I see what you mean. The main point of this whole reserved region
> proposal is that one shouldn't have to update one's kernel to enable new
> PT packets by doing -e intel_pt/config=0xf00d/, if one is CAP_SYS_ADMIN.
>
> > No need to probe in that case. That is the same thing we do for all
> > unenumerated model specific things.
>
> They are, actually, enumerated, we just want to be able to enable them
> before they are. If the driver is aware of feature X, it can test for it
> in CPUID and allow/disallow it based on that.

If you've got hardware that 'exposes' functionality not enumerated in
CPUID you have the capability of modifying the kernel.

And while it might seem like a cute feature when you have early
hardware, that is no reason to make the code horrible. Either just patch
your firmware such that CPUID reports the right bits for your hardware
or frob the kernel.

2014-10-30 08:44:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 18/20] perf: Allocate ring buffers for inherited per-task kernel events

On Fri, Oct 24, 2014 at 10:44:54AM +0300, Alexander Shishkin wrote:
> Peter Zijlstra <[email protected]> writes:
>
> > On Mon, Oct 13, 2014 at 04:45:46PM +0300, Alexander Shishkin wrote:
> >> Normally, per-task events can't be inherited parents' ring buffers to
> >> avoid multiple events contending for the same buffer. And since buffer
> >> allocation is typically done by the userspace consumer, there is no
> >> practical interface to allocate new buffers for inherited counters.
> >>
> >> However, for kernel users we can allocate new buffers for inherited
> >> events as soon as they are created (and also reap them on event
> >> destruction). This pattern has a number of use cases, such as event
> >> sample annotation and process core dump annotation.
> >>
> >> When a new event is inherited from a per-task kernel event that has a
> >> ring buffer, allocate a new buffer for this event so that data from the
> >> child task is collected and can later be retrieved for sample annotation
> >> or core dump inclusion. This ring buffer is released when the event is
> >> freed, for example, when the child task exits.
> >>
> >
> > This causes a pinned memory explosion, not at all nice that.
> >
> > I think I see why and all, but it would be ever so good to not have to
> > allocate so much memory.
>
> Are there any controls we could use to limit such memory usage?

I'd say the same limit we're already accounting the mmap()s against. But
the question is; what do we do when we run out?

Will we fail clone()? That might 'surprise' quite a few people, that
their application won't work when profiled.

In any case, lets focus on the other parts of this work and delay this
feature till later.

2014-10-30 10:20:39

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH v5 18/20] perf: Allocate ring buffers for inherited per-task kernel events

Peter Zijlstra <[email protected]> writes:

> On Fri, Oct 24, 2014 at 10:44:54AM +0300, Alexander Shishkin wrote:
>> Peter Zijlstra <[email protected]> writes:
>>
>> > On Mon, Oct 13, 2014 at 04:45:46PM +0300, Alexander Shishkin wrote:
>> >> Normally, per-task events can't be inherited parents' ring buffers to
>> >> avoid multiple events contending for the same buffer. And since buffer
>> >> allocation is typically done by the userspace consumer, there is no
>> >> practical interface to allocate new buffers for inherited counters.
>> >>
>> >> However, for kernel users we can allocate new buffers for inherited
>> >> events as soon as they are created (and also reap them on event
>> >> destruction). This pattern has a number of use cases, such as event
>> >> sample annotation and process core dump annotation.
>> >>
>> >> When a new event is inherited from a per-task kernel event that has a
>> >> ring buffer, allocate a new buffer for this event so that data from the
>> >> child task is collected and can later be retrieved for sample annotation
>> >> or core dump inclusion. This ring buffer is released when the event is
>> >> freed, for example, when the child task exits.
>> >>
>> >
>> > This causes a pinned memory explosion, not at all nice that.
>> >
>> > I think I see why and all, but it would be ever so good to not have to
>> > allocate so much memory.
>>
>> Are there any controls we could use to limit such memory usage?
>
> I'd say the same limit we're already accounting the mmap()s against. But
> the question is; what do we do when we run out?
>
> Will we fail clone()? That might 'surprise' quite a few people, that
> their application won't work when profiled.

Or we just don't allocate any more buffers for this user; if there's a
perf stream involved, we can output a record saying that.

> In any case, lets focus on the other parts of this work and delay this
> feature till later.

Agreed.

Regards,
--
Alex

2014-10-30 13:31:15

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v5 18/20] perf: Allocate ring buffers for inherited per-task kernel events

Em Thu, Oct 30, 2014 at 09:43:59AM +0100, Peter Zijlstra escreveu:
> On Fri, Oct 24, 2014 at 10:44:54AM +0300, Alexander Shishkin wrote:
> > Peter Zijlstra <[email protected]> writes:
> > > On Mon, Oct 13, 2014 at 04:45:46PM +0300, Alexander Shishkin wrote:
> > >> When a new event is inherited from a per-task kernel event that has a
> > >> ring buffer, allocate a new buffer for this event so that data from the
> > >> child task is collected and can later be retrieved for sample annotation
> > >> or core dump inclusion. This ring buffer is released when the event is
> > >> freed, for example, when the child task exits.

> > > This causes a pinned memory explosion, not at all nice that.

> > > I think I see why and all, but it would be ever so good to not have to
> > > allocate so much memory.

> > Are there any controls we could use to limit such memory usage?

> I'd say the same limit we're already accounting the mmap()s against. But
> the question is; what do we do when we run out?

> Will we fail clone()? That might 'surprise' quite a few people, that

> their application won't work when profiled.

Can't we just emit PERF_RECORD_THROTTLE or similar stuff then? I.e.
somehow mark the cloned process as not being profiled due to ENOMEM?

- Arnaldo

2014-10-31 13:13:56

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

Peter Zijlstra <[email protected]> writes:

> On Mon, Oct 13, 2014 at 04:45:40PM +0300, Alexander Shishkin wrote:
>> +
>> +enum cpuid_regs {
>> + CR_EAX = 0,
>> + CR_ECX,
>> + CR_EDX,
>> + CR_EBX
>> +};
>> +
>> +/*
>> + * Capabilities of Intel PT hardware, such as number of address bits or
>> + * supported output schemes, are cached and exported to userspace as "caps"
>> + * attribute group of pt pmu device
>> + * (/sys/bus/event_source/devices/intel_pt/caps/) so that userspace can store
>> + * relevant bits together with intel_pt traces.
>> + */
>> +#define PT_CAP(_n, _l, _r, _m) \
>> + [PT_CAP_ ## _n] = { .name = __stringify(_n), .leaf = _l, \
>> + .reg = _r, .mask = _m }
>> +
>> +static struct pt_cap_desc {
>> + const char *name;
>> + u32 leaf;
>> + u8 reg;
>> + u32 mask;
>> +} pt_caps[] = {
>> + PT_CAP(max_subleaf, 0, CR_EAX, 0xffffffff),
>> + PT_CAP(cr3_filtering, 0, CR_EBX, BIT(0)),
>> + PT_CAP(topa_output, 0, CR_ECX, BIT(0)),
>> + PT_CAP(topa_multiple_entries, 0, CR_ECX, BIT(1)),
>> + PT_CAP(payloads_lip, 0, CR_ECX, BIT(31)),
>> +};
>> +
>> +static u32 pt_cap_get(enum pt_capabilities cap)
>> +{
>> + struct pt_cap_desc *cd = &pt_caps[cap];
>> + u32 c = pt_pmu.caps[cd->leaf * 4 + cd->reg];
>> + unsigned int shift = __ffs(cd->mask);
>> +
>> + return (c & cd->mask) >> shift;
>> +}
>
>> + if (test_cpu_cap(&boot_cpu_data, X86_FEATURE_INTEL_PT)) {
>> + for (i = 0; i < PT_CPUID_LEAVES; i++)
>> + cpuid_count(20, i,
>> + &pt_pmu.caps[CR_EAX + i * 4],
>> + &pt_pmu.caps[CR_EBX + i * 4],
>> + &pt_pmu.caps[CR_ECX + i * 4],
>> + &pt_pmu.caps[CR_EDX + i * 4]);
>> + } else
>> + return -ENODEV;
>
> I would really rather you use bitfield unions for cpuid stuff, have a
> look at union cpuid10_e[abd]x as used in
> perf_event_intel.c:intel_pmu_init().

It looks like it would only work for the first cpuid leaf, but we'll
need more than that. And the array makes it easier to allocate
attributes for sysfs en masse.

I guess it doesn't really matter if we use unions unless these bits need
to be exported to other parts of the kernel? But *that* is hardly a good
idea. What do you think?

Regards,
--
Alex

2014-11-04 15:57:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

On Fri, Oct 31, 2014 at 03:13:35PM +0200, Alexander Shishkin wrote:
> >> + if (test_cpu_cap(&boot_cpu_data, X86_FEATURE_INTEL_PT)) {
> >> + for (i = 0; i < PT_CPUID_LEAVES; i++)
> >> + cpuid_count(20, i,
> >> + &pt_pmu.caps[CR_EAX + i * 4],
> >> + &pt_pmu.caps[CR_EBX + i * 4],
> >> + &pt_pmu.caps[CR_ECX + i * 4],
> >> + &pt_pmu.caps[CR_EDX + i * 4]);
> >> + } else
> >> + return -ENODEV;
> >
> > I would really rather you use bitfield unions for cpuid stuff, have a
> > look at union cpuid10_e[abd]x as used in
> > perf_event_intel.c:intel_pmu_init().
>
> It looks like it would only work for the first cpuid leaf, but we'll
> need more than that. And the array makes it easier to allocate
> attributes for sysfs en masse.
>
> I guess it doesn't really matter if we use unions unless these bits need
> to be exported to other parts of the kernel? But *that* is hardly a good
> idea. What do you think?

Ah yes, the generation. C is lacking there isn't it :/

Now I'm not sure we want to export all the bits you're using though.
Like the topa_multiple_entires, that appears an implementation detail
userspace should not really care about either way.

2014-11-11 11:25:00

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

Peter Zijlstra <[email protected]> writes:

> On Fri, Oct 31, 2014 at 03:13:35PM +0200, Alexander Shishkin wrote:
>> >> + if (test_cpu_cap(&boot_cpu_data, X86_FEATURE_INTEL_PT)) {
>> >> + for (i = 0; i < PT_CPUID_LEAVES; i++)
>> >> + cpuid_count(20, i,
>> >> + &pt_pmu.caps[CR_EAX + i * 4],
>> >> + &pt_pmu.caps[CR_EBX + i * 4],
>> >> + &pt_pmu.caps[CR_ECX + i * 4],
>> >> + &pt_pmu.caps[CR_EDX + i * 4]);
>> >> + } else
>> >> + return -ENODEV;
>> >
>> > I would really rather you use bitfield unions for cpuid stuff, have a
>> > look at union cpuid10_e[abd]x as used in
>> > perf_event_intel.c:intel_pmu_init().
>>
>> It looks like it would only work for the first cpuid leaf, but we'll
>> need more than that. And the array makes it easier to allocate
>> attributes for sysfs en masse.
>>
>> I guess it doesn't really matter if we use unions unless these bits need
>> to be exported to other parts of the kernel? But *that* is hardly a good
>> idea. What do you think?
>
> Ah yes, the generation. C is lacking there isn't it :/
>
> Now I'm not sure we want to export all the bits you're using though.
> Like the topa_multiple_entires, that appears an implementation detail
> userspace should not really care about either way.

Actually, userspace can make assumptions about lost data from this
bit. But there are others, such as encoded address length. In future
there will also be several "caps" with allowed values for certain bit
fields in RTIT_CTL, such as timing packet frequencies.

Regards,
--
Alex

2014-11-11 13:20:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

On Tue, Nov 11, 2014 at 01:24:43PM +0200, Alexander Shishkin wrote:
> Peter Zijlstra <[email protected]> writes:

> > Now I'm not sure we want to export all the bits you're using though.
> > Like the topa_multiple_entires, that appears an implementation detail
> > userspace should not really care about either way.
>
> Actually, userspace can make assumptions about lost data from this
> bit.

Do explain; it feels entirely wrong to expose something like this, esp.
since this crappy single TOPA thing is going away.

> But there are others, such as encoded address length. In future
> there will also be several "caps" with allowed values for certain bit
> fields in RTIT_CTL, such as timing packet frequencies.

No, those should not be caps, those should be in the format description.

2014-11-11 14:17:56

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/20] x86: perf: intel_pt: Intel PT PMU driver

Peter Zijlstra <[email protected]> writes:

> On Tue, Nov 11, 2014 at 01:24:43PM +0200, Alexander Shishkin wrote:
>> Peter Zijlstra <[email protected]> writes:
>
>> > Now I'm not sure we want to export all the bits you're using though.
>> > Like the topa_multiple_entires, that appears an implementation detail
>> > userspace should not really care about either way.
>>
>> Actually, userspace can make assumptions about lost data from this
>> bit.
>
> Do explain; it feels entirely wrong to expose something like this, esp.
> since this crappy single TOPA thing is going away.

At the very least, it gives you the minimal amount of memory you can ask
for the aux buffer. With multiple entry topa it is one page, with single
entry topa it is two pages. You can leave it up to the userspace to find
out by trial and error, of course.

>> But there are others, such as encoded address length. In future
>> there will also be several "caps" with allowed values for certain bit
>> fields in RTIT_CTL, such as timing packet frequencies.
>
> No, those should not be caps, those should be in the format description.

Format description gives a bit field, within which not all combinations
may be valid. That is,

format/blah_freq: config:16-19
caps/blah_freq: 81

which means that on this cpu you can only set blah_freq to either 0 or
7. Figuring out these things by trial and error from userspace is not
really nice.

Regards,
--
Alex