2014-06-11 15:42:13

by Alexander Shishkin

[permalink] [raw]
Subject: [RFC v2 0/7] perf: perf: add AUX space to ring_buffer

Hi Peter,

Here's the 2nd go at the AUX area. This covers pretty much what I need
for Intel PT buffer management.

Alexander Shishkin (6):
perf: add data_{offset,size} to user_page
perf: support high-order allocations for AUX space
perf: add a capability for AUX_NO_SG pmus to do software double
buffering
perf: add a pmu capability for "exclusive" events
perf: add api for pmus to write to AUX space
perf: add AUX record

Peter Zijlstra (1):
perf: add AUX area to ring buffer for raw data streams

include/linux/perf_event.h | 39 ++++++-
include/uapi/linux/perf_event.h | 36 +++++++
kernel/events/core.c | 225 ++++++++++++++++++++++++++++++++++------
kernel/events/internal.h | 27 +++++
kernel/events/ring_buffer.c | 182 +++++++++++++++++++++++++++++++-
5 files changed, 475 insertions(+), 34 deletions(-)

--
2.0.0


2014-06-11 15:42:18

by Alexander Shishkin

[permalink] [raw]
Subject: [RFC v2 1/7] perf: add data_{offset,size} to user_page

Currently, the actual perf ring buffer is one page into the mmap area,
following the user page and the userspace follows this convention. This
patch adds data_{offset,size} fields to user_page that can be used by
userspace instead for locating perf data in the mmap area. This is also
helpful when mapping existing or shared buffers if their size is not
known in advance.

Right now, it is made to follow the existing convention that

data_offset == PAGE_SIZE and
data_offset + data_size == mmap_size.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/uapi/linux/perf_event.h | 5 +++++
kernel/events/core.c | 2 ++
2 files changed, 7 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 853bc1c..ef95af4 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -488,9 +488,14 @@ struct perf_event_mmap_page {
* In this case the kernel will not over-write unread data.
*
* See perf_output_put_handle() for the data ordering.
+ *
+ * data_{offset,size} indicate the location and size of the perf record
+ * buffer within the mmapped area.
*/
__u64 data_head; /* head in the data section */
__u64 data_tail; /* user-space written tail */
+ __u64 data_offset; /* where the buffer starts */
+ __u64 data_size; /* data buffer size */
};

#define PERF_RECORD_MISC_CPUMODE_MASK (7 << 0)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2f9db66..6c1b030 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3768,6 +3768,8 @@ static void perf_event_init_userpage(struct perf_event *event)
/* Allow new userspace to detect that bit 0 is deprecated */
userpg->cap_bit0_is_deprecated = 1;
userpg->size = offsetof(struct perf_event_mmap_page, __reserved);
+ userpg->data_offset = PAGE_SIZE;
+ userpg->data_size = perf_data_size(rb);

unlock:
rcu_read_unlock();
--
2.0.0

2014-06-11 15:44:30

by Alexander Shishkin

[permalink] [raw]
Subject: [RFC v2 2/7] perf: add AUX area to ring buffer for raw data streams

From: Peter Zijlstra <[email protected]>

This patch introduces "AUX space" in the perf mmap buffer, intended for
exporting high bandwidth data streams to userspace, such as instruction
flow traces.

AUX space is a ring buffer, defined by aux_{offset,size} fields in the
user_page structure, and read/write pointers aux_{head,tail}, which abide
by the same rules as data_* counterparts of the main perf buffer.

In order to allocate/mmap AUX, userspace needs to set up aux_offset to
such an offset that will be greater than data_offset+data_size and
aux_size to be the desired buffer size. Both need to be page aligned.
Then, same aux_offset and aux_size should be passed to mmap() call and
if everything adds up, you should have an AUX buffer as a result.

Pages that are mapped into this buffer also come out of user's mlock
rlimit plus perf_event_mlock_kb allowance.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/linux/perf_event.h | 17 +++++
include/uapi/linux/perf_event.h | 16 +++++
kernel/events/core.c | 142 ++++++++++++++++++++++++++++++++--------
kernel/events/internal.h | 21 ++++++
kernel/events/ring_buffer.c | 77 ++++++++++++++++++++--
5 files changed, 243 insertions(+), 30 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index a6b6ab1..75b9e6d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -257,6 +257,18 @@ struct pmu {
* flush branch stack on context-switches (needed in cpu-wide mode)
*/
void (*flush_branch_stack) (void);
+
+ /*
+ * Set up pmu-private data structures for an AUX area
+ */
+ void *(*setup_aux) (int cpu, void **pages,
+ int nr_pages, bool overwrite);
+ /* optional */
+
+ /*
+ * Free pmu-private AUX data structures
+ */
+ void (*free_aux) (void *aux); /* optional */
};

/**
@@ -767,6 +779,11 @@ static inline bool has_branch_stack(struct perf_event *event)
return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
}

+static inline bool has_aux(struct perf_event *event)
+{
+ return event->pmu->setup_aux;
+}
+
extern int perf_output_begin(struct perf_output_handle *handle,
struct perf_event *event, unsigned int size);
extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index ef95af4..dd10f16 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -496,6 +496,22 @@ struct perf_event_mmap_page {
__u64 data_tail; /* user-space written tail */
__u64 data_offset; /* where the buffer starts */
__u64 data_size; /* data buffer size */
+
+ /*
+ * AUX area is defined by aux_{offset,size} fields that should be set
+ * by the userspace, so that
+ *
+ * aux_offset >= data_offset + data_size
+ *
+ * prior to mmap()ing it. Size of the mmap()ed area should be aux_size.
+ *
+ * Ring buffer pointers aux_{head,tail} have the same semantics as
+ * data_{head,tail} and same ordering rules apply.
+ */
+ __u64 aux_head;
+ __u64 aux_tail;
+ __u64 aux_offset;
+ __u64 aux_size;
};

#define PERF_RECORD_MISC_CPUMODE_MASK (7 << 0)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6c1b030..48ad31b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3950,6 +3950,8 @@ static void perf_mmap_open(struct vm_area_struct *vma)

atomic_inc(&event->mmap_count);
atomic_inc(&event->rb->mmap_count);
+ if (vma->vm_pgoff)
+ atomic_inc(&event->rb->aux_mmap_count);
}

/*
@@ -3969,6 +3971,20 @@ static void perf_mmap_close(struct vm_area_struct *vma)
int mmap_locked = rb->mmap_locked;
unsigned long size = perf_data_size(rb);

+ /*
+ * rb->aux_mmap_count will always drop before rb->mmap_count and
+ * event->mmap_count, so it is ok to use event->mmap_mutex to
+ * serialize with perf_mmap here.
+ */
+ if (rb_has_aux(rb) && vma->vm_pgoff == rb->aux_pgoff &&
+ atomic_dec_and_mutex_lock(&rb->aux_mmap_count, &event->mmap_mutex)) {
+ atomic_long_sub(rb->aux_nr_pages, &mmap_user->locked_vm);
+ vma->vm_mm->pinned_vm -= rb->aux_mmap_locked;
+
+ rb_free_aux(rb, event);
+ mutex_unlock(&event->mmap_mutex);
+ }
+
atomic_dec(&rb->mmap_count);

if (!atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex))
@@ -4047,7 +4063,7 @@ again:

static const struct vm_operations_struct perf_mmap_vmops = {
.open = perf_mmap_open,
- .close = perf_mmap_close,
+ .close = perf_mmap_close, /* non mergable */
.fault = perf_mmap_fault,
.page_mkwrite = perf_mmap_fault,
};
@@ -4058,10 +4074,10 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
unsigned long user_locked, user_lock_limit;
struct user_struct *user = current_user();
unsigned long locked, lock_limit;
- struct ring_buffer *rb;
+ struct ring_buffer *rb = NULL;
unsigned long vma_size;
unsigned long nr_pages;
- long user_extra, extra;
+ long user_extra = 0, extra = 0;
int ret = 0, flags = 0;

/*
@@ -4076,7 +4092,66 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
return -EINVAL;

vma_size = vma->vm_end - vma->vm_start;
- nr_pages = (vma_size / PAGE_SIZE) - 1;
+
+ if (vma->vm_pgoff == 0) {
+ nr_pages = (vma_size / PAGE_SIZE) - 1;
+ } else {
+ /*
+ * AUX area mapping: if rb->aux_nr_pages != 0, it's already
+ * mapped, all subsequent mappings should have the same size
+ * and offset. Must be above the normal perf buffer.
+ */
+ u64 aux_offset, aux_size;
+
+ if (!event->rb)
+ return -EINVAL;
+
+ nr_pages = vma_size / PAGE_SIZE;
+
+ mutex_lock(&event->mmap_mutex);
+ ret = -EINVAL;
+
+ rb = event->rb;
+ if (!rb)
+ goto aux_unlock;
+
+ aux_offset = ACCESS_ONCE(rb->user_page->aux_offset);
+ aux_size = ACCESS_ONCE(rb->user_page->aux_size);
+
+ if (aux_offset < perf_data_size(rb) + PAGE_SIZE)
+ goto aux_unlock;
+
+ if (aux_offset != vma->vm_pgoff << PAGE_SHIFT)
+ goto aux_unlock;
+
+ /* already mapped with a different offset */
+ if (rb_has_aux(rb) && rb->aux_pgoff != vma->vm_pgoff)
+ goto aux_unlock;
+
+ if (aux_size != vma_size || aux_size != nr_pages * PAGE_SIZE)
+ goto aux_unlock;
+
+ /* already mapped with a different size */
+ if (rb_has_aux(rb) && rb->aux_nr_pages != nr_pages)
+ goto aux_unlock;
+
+ if (!is_power_of_2(nr_pages))
+ goto aux_unlock;
+
+ if (!atomic_inc_not_zero(&rb->mmap_count))
+ goto aux_unlock;
+
+ if (rb_has_aux(rb)) {
+ atomic_inc(&rb->aux_mmap_count);
+ ret = 0;
+ goto unlock;
+ }
+
+ atomic_set(&rb->aux_mmap_count, 1);
+ user_extra = nr_pages;
+
+ goto accounting;
+ }

/*
* If we have rb pages ensure they're a power-of-two number, so we
@@ -4088,9 +4163,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
if (vma_size != PAGE_SIZE * (1 + nr_pages))
return -EINVAL;

- if (vma->vm_pgoff != 0)
- return -EINVAL;
-
WARN_ON_ONCE(event->ctx->parent_ctx);
again:
mutex_lock(&event->mmap_mutex);
@@ -4114,6 +4186,8 @@ again:
}

user_extra = nr_pages + 1;
+
+accounting:
user_lock_limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);

/*
@@ -4123,7 +4197,6 @@ again:

user_locked = atomic_long_read(&user->locked_vm) + user_extra;

- extra = 0;
if (user_locked > user_lock_limit)
extra = user_locked - user_lock_limit;

@@ -4137,36 +4210,46 @@ again:
goto unlock;
}

- WARN_ON(event->rb);
+ WARN_ON(!rb && event->rb);

if (vma->vm_flags & VM_WRITE)
flags |= RING_BUFFER_WRITABLE;

- rb = rb_alloc(nr_pages,
- event->attr.watermark ? event->attr.wakeup_watermark : 0,
- event->cpu, flags);
-
if (!rb) {
- ret = -ENOMEM;
- goto unlock;
- }
+ rb = rb_alloc(nr_pages,
+ event->attr.watermark ? event->attr.wakeup_watermark : 0,
+ event->cpu, flags);

- atomic_set(&rb->mmap_count, 1);
- rb->mmap_locked = extra;
- rb->mmap_user = get_current_user();
+ if (!rb) {
+ ret = -ENOMEM;
+ goto unlock;
+ }

- atomic_long_add(user_extra, &user->locked_vm);
- vma->vm_mm->pinned_vm += extra;
+ atomic_set(&rb->mmap_count, 1);
+ rb->mmap_user = get_current_user();
+ rb->mmap_locked = extra;

- ring_buffer_attach(event, rb);
- rcu_assign_pointer(event->rb, rb);
+ ring_buffer_attach(event, rb);
+ rcu_assign_pointer(event->rb, rb);

- perf_event_init_userpage(event);
- perf_event_update_userpage(event);
+ perf_event_init_userpage(event);
+ perf_event_update_userpage(event);
+ } else {
+ ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages, flags);
+ if (ret)
+ atomic_dec(&rb->mmap_count);
+ else
+ rb->aux_mmap_locked = extra;
+ }

unlock:
- if (!ret)
+ if (!ret) {
+ atomic_long_add(user_extra, &user->locked_vm);
+ vma->vm_mm->pinned_vm += extra;
+
atomic_inc(&event->mmap_count);
+ }
+aux_unlock:
mutex_unlock(&event->mmap_mutex);

/*
@@ -6982,6 +7065,13 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
if (output_event->cpu == -1 && output_event->ctx != event->ctx)
goto out;

+ /*
+ * If both events generate aux data, they must be on the same PMU
+ */
+ if (has_aux(event) && has_aux(output_event) &&
+ event->pmu != output_event->pmu)
+ goto out;
+
set:
mutex_lock(&event->mmap_mutex);
/* Can't redirect output if we've got an active mmap() */
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 569b2187..e537403 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -35,6 +35,14 @@ struct ring_buffer {
unsigned long mmap_locked;
struct user_struct *mmap_user;

+ /* AUX area */
+ unsigned long aux_pgoff;
+ int aux_nr_pages;
+ atomic_t aux_mmap_count;
+ unsigned long aux_mmap_locked;
+ void **aux_pages;
+ void *aux_priv;
+
struct perf_event_mmap_page *user_page;
void *data_pages[0];
};
@@ -43,6 +51,14 @@ extern void rb_free(struct ring_buffer *rb);
extern struct ring_buffer *
rb_alloc(int nr_pages, long watermark, int cpu, int flags);
extern void perf_event_wakeup(struct perf_event *event);
+extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
+ pgoff_t pgoff, int nr_pages, int flags);
+extern void rb_free_aux(struct ring_buffer *rb, struct perf_event *event);
+
+static inline bool rb_has_aux(struct ring_buffer *rb)
+{
+ return !!rb->aux_nr_pages;
+}

extern void
perf_event_header__init_id(struct perf_event_header *header,
@@ -81,6 +97,11 @@ static inline unsigned long perf_data_size(struct ring_buffer *rb)
return rb->nr_pages << (PAGE_SHIFT + page_order(rb));
}

+static inline unsigned long perf_aux_size(struct ring_buffer *rb)
+{
+ return rb->aux_nr_pages << PAGE_SHIFT;
+}
+
#define DEFINE_OUTPUT_COPY(func_name, memcpy_func) \
static inline unsigned long \
func_name(struct perf_output_handle *handle, \
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 146a579..ed259a4 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -242,14 +242,67 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
spin_lock_init(&rb->event_lock);
}

+int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
+ pgoff_t pgoff, int nr_pages, int flags)
+{
+ bool overwrite = !!(flags & RING_BUFFER_WRITABLE);
+ int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
+ int ret = -ENOMEM;
+
+ if (!has_aux(event))
+ return -ENOTSUPP;
+
+ rb->aux_pages = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
+ if (!rb->aux_pages)
+ return -ENOMEM;
+
+ for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;
+ rb->aux_nr_pages++) {
+ struct page *page;
+
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
+ if (!page)
+ goto out;
+
+ rb->aux_pages[rb->aux_nr_pages] = page_address(page);
+ }
+
+ rb->aux_priv = event->pmu->setup_aux(event->cpu, rb->aux_pages, nr_pages,
+ overwrite);
+ if (rb->aux_priv)
+ ret = 0;
+
+out:
+ if (!ret)
+ rb->aux_pgoff = pgoff;
+ else
+ rb_free_aux(rb, event);
+
+ return ret;
+}
+
+void rb_free_aux(struct ring_buffer *rb, struct perf_event *event)
+{
+ int pg;
+
+ if (rb->aux_priv)
+ event->pmu->free_aux(rb->aux_priv);
+
+ for (pg = 0; pg < rb->aux_nr_pages; pg++)
+ free_page((unsigned long)rb->aux_pages[pg]);
+
+ kfree(rb->aux_pages);
+ rb->aux_nr_pages = 0;
+}
+
#ifndef CONFIG_PERF_USE_VMALLOC

/*
* Back perf_mmap() with regular GFP_KERNEL-0 pages.
*/

-struct page *
-perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+static struct page *
+__perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
{
if (pgoff > rb->nr_pages)
return NULL;
@@ -339,8 +392,8 @@ static int data_page_nr(struct ring_buffer *rb)
return rb->nr_pages << page_order(rb);
}

-struct page *
-perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+static struct page *
+__perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
{
/* The '>' counts in the user page. */
if (pgoff > data_page_nr(rb))
@@ -415,3 +468,19 @@ fail:
}

#endif
+
+struct page *
+perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+{
+ if (rb->aux_nr_pages) {
+ /* above AUX space */
+ if (pgoff > rb->aux_pgoff + rb->aux_nr_pages)
+ return NULL;
+
+ /* AUX space */
+ if (pgoff >= rb->aux_pgoff)
+ return virt_to_page(rb->aux_pages[pgoff - rb->aux_pgoff]);
+ }
+
+ return __perf_mmap_to_page(rb, pgoff);
+}
--
2.0.0

2014-06-11 15:46:57

by Alexander Shishkin

[permalink] [raw]
Subject: [RFC v2 4/7] perf: add a capability for AUX_NO_SG pmus to do software double buffering

For pmus that don't support scatter-gather for AUX data in hardware, it
might still make sense to implement software double buffering to avoid
losing data while the user is reading data out. For this purpose, add
a pmu capability that guarantees multiple high-order chunks for AUX buffer,
so that the pmu driver can do switchover tricks.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/linux/perf_event.h | 1 +
kernel/events/ring_buffer.c | 14 +++++++++++++-
2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index b316971..3d68411 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -276,6 +276,7 @@ struct pmu {
*/
#define PERF_PMU_CAP_NO_INTERRUPT 1
#define PERF_PMU_CAP_AUX_NO_SG 2
+#define PERF_PMU_CAP_AUX_SW_DOUBLEBUF 4

/**
* enum perf_event_active_state - the states of a event
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index e66ed76..43571ac 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -286,9 +286,21 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
if (!has_aux(event))
return -ENOTSUPP;

- if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG)
+ if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG) {
order = get_order(nr_pages * PAGE_SIZE);

+ /*
+ * PMU requests more than one contiguous chunks of memory
+ * for SW double buffering
+ */
+ if (event->pmu->capabilities & PERF_PMU_CAP_AUX_SW_DOUBLEBUF) {
+ if (!order)
+ return -EINVAL;
+
+ order--;
+ }
+ }
+
rb->aux_pages = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
if (!rb->aux_pages)
return -ENOMEM;
--
2.0.0

2014-06-11 15:47:06

by Alexander Shishkin

[permalink] [raw]
Subject: [RFC v2 3/7] perf: support high-order allocations for AUX space

Some pmus (such as BTS or Intel PT without multiple-entry ToPA capability)
don't support scatter-gather and will prefer larger contiguous areas for
their output regions.

This patch adds a new pmu capability to request higher order allocations.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/linux/perf_event.h | 1 +
kernel/events/ring_buffer.c | 51 +++++++++++++++++++++++++++++++++++++++------
2 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 75b9e6d..b316971 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -275,6 +275,7 @@ struct pmu {
* struct pmu::capabilities flags
*/
#define PERF_PMU_CAP_NO_INTERRUPT 1
+#define PERF_PMU_CAP_AUX_NO_SG 2

/**
* enum perf_event_active_state - the states of a event
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index ed259a4..e66ed76 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -242,29 +242,68 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
spin_lock_init(&rb->event_lock);
}

+#define PERF_AUX_GFP (GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
+
+static struct page *rb_alloc_aux_page(int node, int order)
+{
+ struct page *page;
+
+ if (order > MAX_ORDER)
+ order = MAX_ORDER;
+
+ do {
+ page = alloc_pages_node(node, PERF_AUX_GFP, order);
+ } while (!page && order--);
+
+ if (page && order) {
+ /*
+ * Communicate the allocation size to the driver
+ */
+ split_page(page, order);
+ SetPagePrivate(page);
+ set_page_private(page, order);
+ }
+
+ return page;
+}
+
+static void rb_free_aux_page(struct ring_buffer *rb, int idx)
+{
+ struct page *page = virt_to_page(rb->aux_pages[idx]);
+
+ ClearPagePrivate(page);
+ page->mapping = NULL;
+ __free_page(page);
+}
+
int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
pgoff_t pgoff, int nr_pages, int flags)
{
bool overwrite = !!(flags & RING_BUFFER_WRITABLE);
int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
- int ret = -ENOMEM;
+ int ret = -ENOMEM, order = 0;

if (!has_aux(event))
return -ENOTSUPP;

+ if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG)
+ order = get_order(nr_pages * PAGE_SIZE);
+
rb->aux_pages = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
if (!rb->aux_pages)
return -ENOMEM;

- for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;
- rb->aux_nr_pages++) {
+ for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;) {
struct page *page;
+ int last;

- page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
+ page = rb_alloc_aux_page(node, order);
if (!page)
goto out;

- rb->aux_pages[rb->aux_nr_pages] = page_address(page);
+ for (last = rb->aux_nr_pages + (1 << page_private(page));
+ last > rb->aux_nr_pages; rb->aux_nr_pages++)
+ rb->aux_pages[rb->aux_nr_pages] = page_address(page++);
}

rb->aux_priv = event->pmu->setup_aux(event->cpu, rb->aux_pages, nr_pages,
@@ -289,7 +328,7 @@ void rb_free_aux(struct ring_buffer *rb, struct perf_event *event)
event->pmu->free_aux(rb->aux_priv);

for (pg = 0; pg < rb->aux_nr_pages; pg++)
- free_page((unsigned long)rb->aux_pages[pg]);
+ rb_free_aux_page(rb, pg);

kfree(rb->aux_pages);
rb->aux_nr_pages = 0;
--
2.0.0

2014-06-11 15:49:19

by Alexander Shishkin

[permalink] [raw]
Subject: [RFC v2 5/7] perf: add a pmu capability for "exclusive" events

Usually, pmus that do, for example, instruction tracing, would only ever
be able to have one event per task per cpu (or per perf_context). For such
pmus it makes sense to disallow creating conflicting events early on, so
as to provide consistent behavior for the user.

This patch adds a pmu capability that indicates such constraint on event
creation.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/linux/perf_event.h | 1 +
kernel/events/core.c | 38 ++++++++++++++++++++++++++++++++++++++
2 files changed, 39 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3d68411..b6f7408 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -277,6 +277,7 @@ struct pmu {
#define PERF_PMU_CAP_NO_INTERRUPT 1
#define PERF_PMU_CAP_AUX_NO_SG 2
#define PERF_PMU_CAP_AUX_SW_DOUBLEBUF 4
+#define PERF_PMU_CAP_EXCLUSIVE 8

/**
* enum perf_event_active_state - the states of a event
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 48ad31b..9783c60 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7113,6 +7113,32 @@ out:
return ret;
}

+static bool exclusive_event_match(struct perf_event *e1, struct perf_event *e2)
+{
+ if ((e1->pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE) &&
+ (e1->cpu == e2->cpu ||
+ e1->cpu == -1 ||
+ e2->cpu == -1))
+ return true;
+ return false;
+}
+
+static bool exclusive_event_ok(struct perf_event *event,
+ struct perf_event_context *ctx)
+{
+ struct perf_event *iter_event;
+
+ if (!(event->pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE))
+ return true;
+
+ list_for_each_entry(iter_event, &ctx->event_list, event_entry) {
+ if (exclusive_event_match(iter_event, event))
+ return false;
+ }
+
+ return true;
+}
+
/**
* sys_perf_event_open - open a performance event, associate it to a task/cpu
*
@@ -7261,6 +7287,11 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_alloc;
}

+ if (!exclusive_event_ok(event, ctx)) {
+ err = -EBUSY;
+ goto err_context;
+ }
+
if (task) {
put_task_struct(task);
task = NULL;
@@ -7427,6 +7458,13 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
goto err_free;
}

+ if (!exclusive_event_ok(event, ctx)) {
+ perf_unpin_context(ctx);
+ put_ctx(ctx);
+ err = -EBUSY;
+ goto err_free;
+ }
+
WARN_ON_ONCE(ctx->parent_ctx);
mutex_lock(&ctx->mutex);
perf_install_in_context(ctx, event, cpu);
--
2.0.0

2014-06-11 15:49:21

by Alexander Shishkin

[permalink] [raw]
Subject: [RFC v2 6/7] perf: add api for pmus to write to AUX space

For pmus that wish to write data to AUX space, provide
perf_aux_output_{begin,end}() calls to initiate/commit data writes,
similarly to perf_output_{begin,end}. These also use the same output
handle structure.

After the perf_aux_output_begin() returns successfully, handle->size
is set to the maximum amount of data that can be written wrt aux_tail
pointer, so that no data that the user hasn't seen will be overwritten.

PMU driver should pass the actual amount of data written as a parameter
to perf_aux_output_end().

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/linux/perf_event.h | 19 +++++++++++++++-
kernel/events/core.c | 5 ++---
kernel/events/internal.h | 3 +++
kernel/events/ring_buffer.c | 53 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index b6f7408..e85d134 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -548,12 +548,20 @@ struct perf_output_handle {
struct ring_buffer *rb;
unsigned long wakeup;
unsigned long size;
- void *addr;
+ union {
+ void *addr;
+ unsigned long head;
+ };
int page;
};

#ifdef CONFIG_PERF_EVENTS

+extern void *perf_aux_output_begin(struct perf_output_handle *handle,
+ struct perf_event *event);
+extern void perf_aux_output_end(struct perf_output_handle *handle,
+ unsigned long size, bool truncated);
+extern void *perf_get_aux(struct perf_output_handle *handle);
extern int perf_pmu_register(struct pmu *pmu, const char *name, int type);
extern void perf_pmu_unregister(struct pmu *pmu);

@@ -802,6 +810,15 @@ extern void perf_event_disable(struct perf_event *event);
extern int __perf_event_disable(void *info);
extern void perf_event_task_tick(void);
#else
+static inline void *
+perf_aux_output_begin(struct perf_output_handle *handle,
+ struct perf_event *event) { return NULL; }
+static inline void
+perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
+ bool truncated) { }
+static inline void *
+perf_get_aux(struct perf_output_handle *handle) { return NULL; }
+
static inline void
perf_event_task_sched_in(struct task_struct *prev,
struct task_struct *task) { }
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9783c60..44d6d01 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3194,7 +3194,6 @@ static void free_event_rcu(struct rcu_head *head)
kfree(event);
}

-static void ring_buffer_put(struct ring_buffer *rb);
static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb);

static void unaccount_event_cpu(struct perf_event *event, int cpu)
@@ -3919,7 +3918,7 @@ static void rb_free_rcu(struct rcu_head *rcu_head)
rb_free(rb);
}

-static struct ring_buffer *ring_buffer_get(struct perf_event *event)
+struct ring_buffer *ring_buffer_get(struct perf_event *event)
{
struct ring_buffer *rb;

@@ -3934,7 +3933,7 @@ static struct ring_buffer *ring_buffer_get(struct perf_event *event)
return rb;
}

-static void ring_buffer_put(struct ring_buffer *rb)
+void ring_buffer_put(struct ring_buffer *rb)
{
if (!atomic_dec_and_test(&rb->refcount))
return;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index e537403..e20d073 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -36,6 +36,7 @@ struct ring_buffer {
struct user_struct *mmap_user;

/* AUX area */
+ local_t aux_head;
unsigned long aux_pgoff;
int aux_nr_pages;
atomic_t aux_mmap_count;
@@ -54,6 +55,8 @@ extern void perf_event_wakeup(struct perf_event *event);
extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
pgoff_t pgoff, int nr_pages, int flags);
extern void rb_free_aux(struct ring_buffer *rb, struct perf_event *event);
+extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
+extern void ring_buffer_put(struct ring_buffer *rb);

static inline bool rb_has_aux(struct ring_buffer *rb)
{
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 43571ac..8f76e5c 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -242,6 +242,59 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
spin_lock_init(&rb->event_lock);
}

+void *perf_aux_output_begin(struct perf_output_handle *handle,
+ struct perf_event *event)
+{
+ unsigned long aux_head, aux_tail;
+ struct ring_buffer *rb;
+
+ rb = ring_buffer_get(event);
+ if (!rb)
+ return NULL;
+
+ aux_head = local_read(&rb->aux_head);
+ aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
+
+ handle->rb = rb;
+ handle->event = event;
+ handle->head = aux_head;
+ handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
+ if (!handle->size) {
+ event->pending_disable = 1;
+ event->hw.state = PERF_HES_STOPPED;
+ perf_output_wakeup(handle);
+ ring_buffer_put(rb);
+ handle->event = NULL;
+
+ return NULL;
+ }
+
+ return handle->rb->aux_priv;
+}
+
+void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
+ bool truncated)
+{
+ struct ring_buffer *rb = handle->rb;
+ unsigned long aux_head;
+
+ aux_head = local_read(&rb->aux_head);
+ local_add(size, &rb->aux_head);
+
+ rb->user_page->aux_head = local_read(&rb->aux_head);
+ smp_wmb();
+
+ perf_output_wakeup(handle);
+ handle->event = NULL;
+
+ ring_buffer_put(rb);
+}
+
+void *perf_get_aux(struct perf_output_handle *handle)
+{
+ return handle->rb->aux_priv;
+}
+
#define PERF_AUX_GFP (GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)

static struct page *rb_alloc_aux_page(int node, int order)
--
2.0.0

2014-06-11 15:51:13

by Alexander Shishkin

[permalink] [raw]
Subject: [RFC v2 7/7] perf: add AUX record

When there's new data in the AUX space, output a record indicating its
offset and size and weather it was truncated to fix in the ring buffer.

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/uapi/linux/perf_event.h | 15 +++++++++++++++
kernel/events/core.c | 38 ++++++++++++++++++++++++++++++++++++++
kernel/events/internal.h | 3 +++
kernel/events/ring_buffer.c | 1 +
4 files changed, 57 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index dd10f16..7f8621f 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -726,6 +726,21 @@ enum perf_event_type {
*/
PERF_RECORD_MMAP2 = 10,

+ /*
+ * Records that new data landed in the AUX buffer part.
+ *
+ * struct {
+ * struct perf_event_header header;
+ *
+ * u64 aux_offset;
+ * u64 aux_size;
+ * u8 truncated;
+ * u64 id;
+ * u64 stream_id;
+ * };
+ */
+ PERF_RECORD_AUX = 11,
+
PERF_RECORD_MAX, /* non-ABI */
};

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 44d6d01..cd64b75 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5394,6 +5394,44 @@ void perf_event_mmap(struct vm_area_struct *vma)
perf_event_mmap_event(&mmap_event);
}

+void perf_event_aux_event(struct perf_event *event, unsigned long head,
+ unsigned long size, bool truncated)
+{
+ struct perf_output_handle handle;
+ struct perf_sample_data sample;
+ struct perf_aux_event {
+ struct perf_event_header header;
+ u64 offset;
+ u64 size;
+ u8 truncated;
+ u64 id;
+ u64 stream_id;
+ } rec = {
+ .header = {
+ .type = PERF_RECORD_AUX,
+ .misc = 0,
+ .size = sizeof(rec),
+ },
+ .offset = head,
+ .size = size,
+ .truncated = truncated,
+ .id = primary_event_id(event),
+ .stream_id = event->id,
+ };
+ int ret;
+
+ perf_event_header__init_id(&rec.header, &sample, event);
+ ret = perf_output_begin(&handle, event, rec.header.size);
+
+ if (ret)
+ return;
+
+ perf_output_put(&handle, rec);
+ perf_event__output_id_sample(event, &handle, &sample);
+
+ perf_output_end(&handle);
+}
+
/*
* IRQ throttle logging
*/
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index e20d073..8bc4547 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -63,6 +63,9 @@ static inline bool rb_has_aux(struct ring_buffer *rb)
return !!rb->aux_nr_pages;
}

+void perf_event_aux_event(struct perf_event *event, unsigned long head,
+ unsigned long size, bool truncated);
+
extern void
perf_event_header__init_id(struct perf_event_header *header,
struct perf_sample_data *data,
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 8f76e5c..4e6f815 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -280,6 +280,7 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,

aux_head = local_read(&rb->aux_head);
local_add(size, &rb->aux_head);
+ perf_event_aux_event(handle->event, aux_head, size, truncated);

rb->user_page->aux_head = local_read(&rb->aux_head);
smp_wmb();
--
2.0.0

2014-06-13 13:52:01

by Robert Richter

[permalink] [raw]
Subject: Re: [RFC v2 1/7] perf: add data_{offset,size} to user_page

On 11.06.14 18:41:44, Alexander Shishkin wrote:

> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 853bc1c..ef95af4 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -488,9 +488,14 @@ struct perf_event_mmap_page {
> * In this case the kernel will not over-write unread data.
> *
> * See perf_output_put_handle() for the data ordering.
> + *
> + * data_{offset,size} indicate the location and size of the perf record
> + * buffer within the mmapped area.
> */
> __u64 data_head; /* head in the data section */
> __u64 data_tail; /* user-space written tail */
> + __u64 data_offset; /* where the buffer starts */
> + __u64 data_size; /* data buffer size */
> };
>

Yes, right. This change is useful to determine the buffer size for
existing buffers to be able to share them (e.g. persistent events).

If the buffer size is not known by userspace The mechanism would be to
mmap the header page only and later remap the buffer with the correct
buffer size determined from the header page.

If there are no objections on this patch it would be good if this
patch could be applied independently of the status of the follow on
patches.

Thanks,

-Robert

2014-06-24 17:18:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 5/7] perf: add a pmu capability for "exclusive" events

On Wed, Jun 11, 2014 at 06:41:48PM +0300, Alexander Shishkin wrote:
> +static bool exclusive_event_ok(struct perf_event *event,
> + struct perf_event_context *ctx)
> +{
> + struct perf_event *iter_event;
> +
> + if (!(event->pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE))
> + return true;
> +
> + list_for_each_entry(iter_event, &ctx->event_list, event_entry) {
> + if (exclusive_event_match(iter_event, event))
> + return false;
> + }

This list iteration needs either rcu or ctx->lock or ctx->mutex, and the
two callsites below don't have either afaict.

> +
> + return true;
> +}
> +
> /**
> * sys_perf_event_open - open a performance event, associate it to a task/cpu
> *
> @@ -7261,6 +7287,11 @@ SYSCALL_DEFINE5(perf_event_open,
> goto err_alloc;
> }
>
> + if (!exclusive_event_ok(event, ctx)) {
> + err = -EBUSY;
> + goto err_context;
> + }
> +
> if (task) {
> put_task_struct(task);
> task = NULL;
> @@ -7427,6 +7458,13 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
> goto err_free;
> }
>
> + if (!exclusive_event_ok(event, ctx)) {
> + perf_unpin_context(ctx);
> + put_ctx(ctx);
> + err = -EBUSY;
> + goto err_free;
> + }
> +
> WARN_ON_ONCE(ctx->parent_ctx);
> mutex_lock(&ctx->mutex);
> perf_install_in_context(ctx, event, cpu);
> --
> 2.0.0
>


Attachments:
(No filename) (1.30 kB)
(No filename) (836.00 B)
Download all attachments

2014-06-24 17:20:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 2/7] perf: add AUX area to ring buffer for raw data streams

On Wed, Jun 11, 2014 at 06:41:45PM +0300, Alexander Shishkin wrote:
> + /*
> + * Set up pmu-private data structures for an AUX area
> + */
> + void *(*setup_aux) (int cpu, void **pages,
> + int nr_pages, bool overwrite);
> + /* optional */
> +
> + /*
> + * Free pmu-private AUX data structures
> + */
> + void (*free_aux) (void *aux); /* optional */

I was hoping you could replace those with a PERF_CAP_AUX or something
and then have one generic allocation routine like you provide in the
subsequent patches.


Attachments:
(No filename) (525.00 B)
(No filename) (836.00 B)
Download all attachments

2014-06-24 17:27:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 6/7] perf: add api for pmus to write to AUX space

On Wed, Jun 11, 2014 at 06:41:49PM +0300, Alexander Shishkin wrote:
> +void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
> + bool truncated)
> +{
> + struct ring_buffer *rb = handle->rb;
> + unsigned long aux_head;
> +
> + aux_head = local_read(&rb->aux_head);
> + local_add(size, &rb->aux_head);
> +
> + rb->user_page->aux_head = local_read(&rb->aux_head);
> + smp_wmb();
> +
> + perf_output_wakeup(handle);
> + handle->event = NULL;
> +
> + ring_buffer_put(rb);
> +}

This thing is distinctly less complex than perf_output_put_handle().. ?


Attachments:
(No filename) (575.00 B)
(No filename) (836.00 B)
Download all attachments

2014-06-25 11:12:11

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [RFC v2 2/7] perf: add AUX area to ring buffer for raw data streams

Peter Zijlstra <[email protected]> writes:

> On Wed, Jun 11, 2014 at 06:41:45PM +0300, Alexander Shishkin wrote:
>> + /*
>> + * Set up pmu-private data structures for an AUX area
>> + */
>> + void *(*setup_aux) (int cpu, void **pages,
>> + int nr_pages, bool overwrite);
>> + /* optional */
>> +
>> + /*
>> + * Free pmu-private AUX data structures
>> + */
>> + void (*free_aux) (void *aux); /* optional */
>
> I was hoping you could replace those with a PERF_CAP_AUX or something
> and then have one generic allocation routine like you provide in the
> subsequent patches.

I need these to allocate pmu-specific SG tables now, which I don't see
how to generalize nicely. User-visible aux_pages are allocated in the
generic rb_alloc_aux().

As for sg tables, the alternative would be to allocate them in the
event::pmu::add() path, which is probably not very good for
performance, unless I'm missing something? Like, I can allocate stuff in
the first add() and then free it in event::destroy().

Or, we can assume a generic sg table format and allocate them in the
generic code, but then we'll need more capabilities to indicate at least
the size of a table entry. PT needs 64 bits per entry, ARM's TMC needs
32 and the non-SG PT will still need some tables. But I'd rather leave
this part to the pmu drivers. What do you think?

Regards,
--
Alex

2014-06-25 11:12:29

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [RFC v2 5/7] perf: add a pmu capability for "exclusive" events

Peter Zijlstra <[email protected]> writes:

> On Wed, Jun 11, 2014 at 06:41:48PM +0300, Alexander Shishkin wrote:
>> +static bool exclusive_event_ok(struct perf_event *event,
>> + struct perf_event_context *ctx)
>> +{
>> + struct perf_event *iter_event;
>> +
>> + if (!(event->pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE))
>> + return true;
>> +
>> + list_for_each_entry(iter_event, &ctx->event_list, event_entry) {
>> + if (exclusive_event_match(iter_event, event))
>> + return false;
>> + }
>
> This list iteration needs either rcu or ctx->lock or ctx->mutex, and the
> two callsites below don't have either afaict.

True, this needs to be done before perf_install_in_context() under the
same ctx->mutex. Thanks!

Regards,
--
Alex

2014-06-25 11:24:58

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [RFC v2 6/7] perf: add api for pmus to write to AUX space

Peter Zijlstra <[email protected]> writes:

> On Wed, Jun 11, 2014 at 06:41:49PM +0300, Alexander Shishkin wrote:
>> +void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
>> + bool truncated)
>> +{
>> + struct ring_buffer *rb = handle->rb;
>> + unsigned long aux_head;
>> +
>> + aux_head = local_read(&rb->aux_head);
>> + local_add(size, &rb->aux_head);
>> +
>> + rb->user_page->aux_head = local_read(&rb->aux_head);
>> + smp_wmb();
>> +
>> + perf_output_wakeup(handle);
>> + handle->event = NULL;
>> +
>> + ring_buffer_put(rb);
>> +}
>
> This thing is distinctly less complex than perf_output_put_handle().. ?

This one doesn't support nested writers, but there shouldn't be any
either. But I notice now that the barrier is misplaced. Wakeup watermark
is not yet taken care of, but will be. Anything else I'm missing?

Regards,
--
Alex

2014-06-25 12:12:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 2/7] perf: add AUX area to ring buffer for raw data streams

On Wed, Jun 25, 2014 at 02:09:31PM +0300, Alexander Shishkin wrote:
> Peter Zijlstra <[email protected]> writes:
>
> > On Wed, Jun 11, 2014 at 06:41:45PM +0300, Alexander Shishkin wrote:
> >> + /*
> >> + * Set up pmu-private data structures for an AUX area
> >> + */
> >> + void *(*setup_aux) (int cpu, void **pages,
> >> + int nr_pages, bool overwrite);
> >> + /* optional */
> >> +
> >> + /*
> >> + * Free pmu-private AUX data structures
> >> + */
> >> + void (*free_aux) (void *aux); /* optional */
> >
> > I was hoping you could replace those with a PERF_CAP_AUX or something
> > and then have one generic allocation routine like you provide in the
> > subsequent patches.
>
> I need these to allocate pmu-specific SG tables now, which I don't see
> how to generalize nicely. User-visible aux_pages are allocated in the
> generic rb_alloc_aux().
>
> As for sg tables, the alternative would be to allocate them in the
> event::pmu::add() path, which is probably not very good for
> performance, unless I'm missing something? Like, I can allocate stuff in
> the first add() and then free it in event::destroy().
>
> Or, we can assume a generic sg table format and allocate them in the
> generic code, but then we'll need more capabilities to indicate at least
> the size of a table entry. PT needs 64 bits per entry, ARM's TMC needs
> 32 and the non-SG PT will still need some tables. But I'd rather leave
> this part to the pmu drivers. What do you think?

Ah, I wasn't aware.. too bad. Maybe add this to the changelog.

2014-06-25 12:12:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 6/7] perf: add api for pmus to write to AUX space

On Wed, Jun 25, 2014 at 02:24:51PM +0300, Alexander Shishkin wrote:
> Peter Zijlstra <[email protected]> writes:
>
> > On Wed, Jun 11, 2014 at 06:41:49PM +0300, Alexander Shishkin wrote:
> >> +void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
> >> + bool truncated)
> >> +{
> >> + struct ring_buffer *rb = handle->rb;
> >> + unsigned long aux_head;
> >> +
> >> + aux_head = local_read(&rb->aux_head);
> >> + local_add(size, &rb->aux_head);
> >> +
> >> + rb->user_page->aux_head = local_read(&rb->aux_head);
> >> + smp_wmb();
> >> +
> >> + perf_output_wakeup(handle);
> >> + handle->event = NULL;
> >> +
> >> + ring_buffer_put(rb);
> >> +}
> >
> > This thing is distinctly less complex than perf_output_put_handle().. ?
>
> This one doesn't support nested writers, but there shouldn't be any
> either. But I notice now that the barrier is misplaced. Wakeup watermark
> is not yet taken care of, but will be. Anything else I'm missing?

Maybe put a comment in about explicitly not supporting nesting etc.
Maybe also a debug check to make sure we catch anybody doing it anyway.