2010-12-01 01:00:24

by Corey Ashford

[permalink] [raw]
Subject: [BUG(?)] perf_events: combining multiple tracepoint events into a group produces no counts on member events


Hi,

I'm not sure that what I'm seeing is a bug, or was something intentional.

If I place multiple tracepoint events into a group and measure counts of
these events on a process, I get no counts for the tracepoint events
other than the group leader.

Is this expected behavior?

It's not clear to me why this should be the case; grouping shouldn't
have any ill effects on tracepoint events, from my understanding.

I noticed this because my private version of the perf tool has the event
group patch https://lkml.org/lkml/2010/11/24/584 as well as the patch
which fixes the parsing of multiple tracepoint events in the same -e
switch: https://lkml.org/lkml/2010/11/30/460

When I dig into the code a bit, I find that each event opens
successfully, so that's not the problem. If I disable the grouping,
then I get counts for all of the tracepoint events.

- Corey


2010-12-01 11:46:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG(?)] perf_events: combining multiple tracepoint events into a group produces no counts on member events

On Tue, 2010-11-30 at 17:00 -0800, Corey Ashford wrote:
> Hi,
>
> I'm not sure that what I'm seeing is a bug, or was something intentional.
>
> If I place multiple tracepoint events into a group and measure counts of
> these events on a process, I get no counts for the tracepoint events
> other than the group leader.
>
> Is this expected behavior?
>
> It's not clear to me why this should be the case; grouping shouldn't
> have any ill effects on tracepoint events, from my understanding.
>
> I noticed this because my private version of the perf tool has the event
> group patch https://lkml.org/lkml/2010/11/24/584 as well as the patch
> which fixes the parsing of multiple tracepoint events in the same -e
> switch: https://lkml.org/lkml/2010/11/30/460
>
> When I dig into the code a bit, I find that each event opens
> successfully, so that's not the problem. If I disable the grouping,
> then I get counts for all of the tracepoint events.

Hrm,.. definitely not expected. I'll try and look into it, but I'm a bit
over-committed atm.

Also, I've started a rewrite of the whole tracepoint <-> perf
interaction:

http://lkml.org/lkml/2010/11/23/147

Could you see if that cures your problem?

Another thing to test, does the same hold true for regular software
events? tracepoints and software events share a lot of infrastructure.

2010-12-01 12:04:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG(?)] perf_events: combining multiple tracepoint events into a group produces no counts on member events

On Wed, 2010-12-01 at 12:46 +0100, Peter Zijlstra wrote:
> On Tue, 2010-11-30 at 17:00 -0800, Corey Ashford wrote:
> > Hi,
> >
> > I'm not sure that what I'm seeing is a bug, or was something intentional.
> >
> > If I place multiple tracepoint events into a group and measure counts of
> > these events on a process, I get no counts for the tracepoint events
> > other than the group leader.
> >
> > Is this expected behavior?
> >
> > It's not clear to me why this should be the case; grouping shouldn't
> > have any ill effects on tracepoint events, from my understanding.
> >
> > I noticed this because my private version of the perf tool has the event
> > group patch https://lkml.org/lkml/2010/11/24/584 as well as the patch
> > which fixes the parsing of multiple tracepoint events in the same -e
> > switch: https://lkml.org/lkml/2010/11/30/460
> >
> > When I dig into the code a bit, I find that each event opens
> > successfully, so that's not the problem. If I disable the grouping,
> > then I get counts for all of the tracepoint events.
>
> Hrm,.. definitely not expected. I'll try and look into it, but I'm a bit
> over-committed atm.
>
> Also, I've started a rewrite of the whole tracepoint <-> perf
> interaction:
>
> http://lkml.org/lkml/2010/11/23/147
>
> Could you see if that cures your problem?
>
> Another thing to test, does the same hold true for regular software
> events? tracepoints and software events share a lot of infrastructure.


In fact, here's a later version of that same patch.

---
Subject: perf: Tracepoint collection support
From: Peter Zijlstra <[email protected]>
Date: Sat Nov 20 18:09:34 CET 2010

Due to popular demand this implements a tracepoint collection event
{ .type = PERF_TYPE_TRACEPOINT, .config = ~0ULL }. by default it
contains no tracepoints, but tracepoints can be added using:
ioctl(fd, PERF_EVENT_IOC_ADD_TP, tp_id);

In order to provide a dense ID space for tracepoints replace the
tracepoint ID generation with an IDR tree.

Furthermore, replace the whole trace-event <-> perf infrastructure with
multiple IDR trees.

We keep an IDR tree per perf_event, this tree collects all the
tracepoints the perf_event is interested in and stores the corresponding
'node'. This tree manages the node life-time.

Then we keep an IDR tree per task and per CPU. Both of these trees are
accumulation trees, they're the union of all events of that particular
task/CPU. It stores a list of 'node's.

Then, when a trace-event happens, we look for it in both the CPU tree as
well as the current task tree. For both, if present we iterate the node
list and deliver the event to the corresponding perf_events.

We manage the IDR trees on perf_event creation/destruction and
ioctl(ADD_TP) time. This mean that pmu::{add,remove} are empty ops, the
per-task stat is taken care of in the per-task tree after all.

The patch got hammered by tglx, but wants some more review as well as
solving the open points before merging.

The open points are:
- like the breakpoint muck it suffers from the pmu::event_init() vs
context-attach inversion, and
- the whole fancy inherited context avoid switch logic needs a fix.

[ merged a cleanup contributed by Lin Ming ]

Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <new-submission>
---
include/linux/ftrace_event.h | 10
include/linux/perf_event.h | 9
include/linux/sched.h | 9
include/trace/ftrace.h | 4
kernel/perf_event.c | 420 +++++++++++++++++++++++++++++++++++++---
kernel/trace/trace_event_perf.c | 95 +++------
kernel/trace/trace_kprobe.c | 10
kernel/trace/trace_output.c | 116 +++--------
kernel/trace/trace_syscalls.c | 8
9 files changed, 504 insertions(+), 177 deletions(-)

Index: linux-2.6/include/linux/ftrace_event.h
===================================================================
--- linux-2.6.orig/include/linux/ftrace_event.h
+++ linux-2.6/include/linux/ftrace_event.h
@@ -87,8 +87,6 @@ struct trace_event_functions {
};

struct trace_event {
- struct hlist_node node;
- struct list_head list;
int type;
struct trace_event_functions *funcs;
};
@@ -194,7 +192,6 @@ struct ftrace_event_call {

#ifdef CONFIG_PERF_EVENTS
int perf_refcount;
- struct hlist_head __percpu *perf_events;
#endif
};

@@ -260,8 +257,9 @@ struct perf_event;

DECLARE_PER_CPU(struct pt_regs, perf_trace_regs);

-extern int perf_trace_init(struct perf_event *event);
+extern int perf_trace_init(struct perf_event *event, int event_id);
extern void perf_trace_destroy(struct perf_event *event);
+extern void perf_trace_destroy_id(int id);
extern int perf_trace_add(struct perf_event *event, int flags);
extern void perf_trace_del(struct perf_event *event, int flags);
extern int ftrace_profile_set_filter(struct perf_event *event, int event_id,
@@ -272,9 +270,9 @@ extern void *perf_trace_buf_prepare(int

static inline void
perf_trace_buf_submit(void *raw_data, int size, int rctx, u64 addr,
- u64 count, struct pt_regs *regs, void *head)
+ u64 count, struct pt_regs *regs, int id)
{
- perf_tp_event(addr, count, raw_data, size, regs, head, rctx);
+ perf_tp_event(addr, count, raw_data, size, regs, rctx, id);
}
#endif

Index: linux-2.6/include/linux/perf_event.h
===================================================================
--- linux-2.6.orig/include/linux/perf_event.h
+++ linux-2.6/include/linux/perf_event.h
@@ -238,6 +238,7 @@ struct perf_event_attr {
#define PERF_EVENT_IOC_PERIOD _IOW('$', 4, __u64)
#define PERF_EVENT_IOC_SET_OUTPUT _IO ('$', 5)
#define PERF_EVENT_IOC_SET_FILTER _IOW('$', 6, char *)
+#define PERF_EVENT_IOC_ADD_TP _IO ('$', 7)

enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
@@ -545,6 +546,11 @@ struct hw_perf_event {
struct task_struct *bp_target;
};
#endif
+ /*
+ * Same fudge as for breakpoints, trace-events needs
+ * it too,.. convert the bp crap over..
+ */
+ struct task_struct *event_target;
};
int state;
local64_t prev_count;
@@ -813,6 +819,7 @@ struct perf_event {
#ifdef CONFIG_EVENT_TRACING
struct ftrace_event_call *tp_event;
struct event_filter *filter;
+ struct perf_tp_idr tp_idr;
#endif

#endif /* CONFIG_PERF_EVENTS */
@@ -1076,7 +1083,7 @@ static inline bool perf_paranoid_kernel(
extern void perf_event_init(void);
extern void perf_tp_event(u64 addr, u64 count, void *record,
int entry_size, struct pt_regs *regs,
- struct hlist_head *head, int rctx);
+ int rctx, int id);
extern void perf_bp_event(struct perf_event *event, void *data);

#ifndef perf_misc_flags
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -82,6 +82,7 @@ struct sched_param {
#include <linux/rculist.h>
#include <linux/rtmutex.h>

+#include <linux/idr.h>
#include <linux/time.h>
#include <linux/param.h>
#include <linux/resource.h>
@@ -1179,6 +1180,11 @@ enum perf_event_task_context {
perf_nr_task_contexts,
};

+struct perf_tp_idr {
+ struct mutex lock;
+ struct idr idr;
+};
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
@@ -1452,6 +1458,9 @@ struct task_struct {
struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
struct mutex perf_event_mutex;
struct list_head perf_event_list;
+#ifdef CONFIG_EVENT_TRACING
+ struct perf_tp_idr *perf_tp_idr;
+#endif
#endif
#ifdef CONFIG_NUMA
struct mempolicy *mempolicy; /* Protected by alloc_lock */
Index: linux-2.6/include/trace/ftrace.h
===================================================================
--- linux-2.6.orig/include/trace/ftrace.h
+++ linux-2.6/include/trace/ftrace.h
@@ -700,7 +700,6 @@ perf_trace_##call(void *__data, proto)
struct ftrace_raw_##call *entry; \
struct pt_regs __regs; \
u64 __addr = 0, __count = 1; \
- struct hlist_head *head; \
int __entry_size; \
int __data_size; \
int rctx; \
@@ -725,9 +724,8 @@ perf_trace_##call(void *__data, proto)
\
{ assign; } \
\
- head = this_cpu_ptr(event_call->perf_events); \
perf_trace_buf_submit(entry, __entry_size, rctx, __addr, \
- __count, &__regs, head); \
+ __count, &__regs, event_call->event.type); \
}

/*
Index: linux-2.6/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/kernel/perf_event.c
+++ linux-2.6/kernel/perf_event.c
@@ -13,6 +13,7 @@
#include <linux/mm.h>
#include <linux/cpu.h>
#include <linux/smp.h>
+#include <linux/idr.h>
#include <linux/file.h>
#include <linux/poll.h>
#include <linux/slab.h>
@@ -310,6 +311,7 @@ list_add_event(struct perf_event *event,
ctx->nr_events++;
if (event->attr.inherit_stat)
ctx->nr_stat++;
+ ++ctx->generation;
}

static void perf_group_attach(struct perf_event *event)
@@ -370,6 +372,7 @@ list_del_event(struct perf_event *event,
*/
if (event->state > PERF_EVENT_STATE_OFF)
event->state = PERF_EVENT_STATE_OFF;
+ ++ctx->generation;
}

static void perf_group_detach(struct perf_event *event)
@@ -1228,6 +1231,12 @@ void perf_event_context_sched_out(struct
if (!cpuctx->task_ctx)
return;

+#if 0
+ /*
+ * Need to sort out how to make task_struct::perf_tp_idr
+ * work with this fancy switching stuff.. tracepoints could be
+ * in multiple contexts due to the software event muck.
+ */
rcu_read_lock();
parent = rcu_dereference(ctx->parent_ctx);
next_ctx = next->perf_event_ctxp[ctxn];
@@ -1261,6 +1270,7 @@ void perf_event_context_sched_out(struct
raw_spin_unlock(&ctx->lock);
}
rcu_read_unlock();
+#endif

if (do_switch) {
ctx_sched_out(ctx, cpuctx, EVENT_ALL);
@@ -2561,6 +2571,7 @@ static struct perf_event *perf_fget_ligh
static int perf_event_set_output(struct perf_event *event,
struct perf_event *output_event);
static int perf_event_set_filter(struct perf_event *event, void __user *arg);
+static int perf_event_add_tp(struct perf_event *event, int tp_id);

static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
@@ -2607,6 +2618,9 @@ static long perf_ioctl(struct file *file
case PERF_EVENT_IOC_SET_FILTER:
return perf_event_set_filter(event, (void __user *)arg);

+ case PERF_EVENT_IOC_ADD_TP:
+ return perf_event_add_tp(event, arg);
+
default:
return -ENOTTY;
}
@@ -4741,6 +4755,9 @@ static struct pmu perf_swevent = {

#ifdef CONFIG_EVENT_TRACING

+#include <linux/ftrace_event.h>
+#include "trace/trace_output.h"
+
static int perf_tp_filter_match(struct perf_event *event,
struct perf_sample_data *data)
{
@@ -4755,6 +4772,9 @@ static int perf_tp_event_match(struct pe
struct perf_sample_data *data,
struct pt_regs *regs)
{
+ if (event->state != PERF_EVENT_STATE_ACTIVE)
+ return 0;
+
/*
* All tracepoints are from kernel-space.
*/
@@ -4767,12 +4787,62 @@ static int perf_tp_event_match(struct pe
return 1;
}

+static void perf_tp_idr_init(struct perf_tp_idr *idr)
+{
+ idr_init(&idr->idr);
+ mutex_init(&idr->lock);
+}
+
+static DEFINE_PER_CPU(struct perf_tp_idr, perf_tp_idr);
+
+struct perf_tp_node {
+ struct list_head list;
+ struct perf_event *event;
+ struct rcu_head rcu;
+};
+
+static void do_perf_tp_event(struct perf_event *event, u64 count,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ if (perf_tp_event_match(event, data, regs))
+ perf_swevent_event(event, count, 1, data, regs);
+}
+
+static void perf_tp_idr_event(struct perf_tp_idr *tp_idr,
+ int id, u64 count,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ struct perf_tp_node *tp_node, *node;
+ struct perf_event *event;
+
+ if (!tp_idr)
+ return;
+
+ /*
+ * Most of this is done under rcu_read_lock_sched(), which doesn't
+ * exclude regular RCU grace periods, but the IDR code uses call_rcu()
+ * so we have to use rcu_read_lock() here as well.
+ */
+ rcu_read_lock();
+ tp_node = idr_find(&tp_idr->idr, id);
+ rcu_read_unlock();
+
+ if (!tp_node)
+ return;
+
+ event = tp_node->event;
+
+ do_perf_tp_event(event, count, data, regs);
+ list_for_each_entry_rcu(node, &tp_node->list, list)
+ do_perf_tp_event(node->event, count, data, regs);
+}
+
void perf_tp_event(u64 addr, u64 count, void *record, int entry_size,
- struct pt_regs *regs, struct hlist_head *head, int rctx)
+ struct pt_regs *regs, int rctx, int id)
{
struct perf_sample_data data;
- struct perf_event *event;
- struct hlist_node *node;

struct perf_raw_record raw = {
.size = entry_size,
@@ -4782,18 +4852,197 @@ void perf_tp_event(u64 addr, u64 count,
perf_sample_data_init(&data, addr);
data.raw = &raw;

- hlist_for_each_entry_rcu(event, node, head, hlist_entry) {
- if (perf_tp_event_match(event, &data, regs))
- perf_swevent_event(event, count, 1, &data, regs);
- }
+ perf_tp_idr_event(&__get_cpu_var(perf_tp_idr), id, count, &data, regs);
+ perf_tp_idr_event(current->perf_tp_idr, id, count, &data, regs);

perf_swevent_put_recursion_context(rctx);
}
EXPORT_SYMBOL_GPL(perf_tp_event);

+static struct perf_tp_idr *
+perf_tp_init_task(struct perf_event *event, struct task_struct *task)
+{
+ struct perf_tp_idr *idr;
+
+ mutex_lock(&task->perf_event_mutex);
+ idr = task->perf_tp_idr;
+ if (idr)
+ goto unlock;
+
+ idr = kzalloc(sizeof(struct perf_tp_idr), GFP_KERNEL);
+ if (!idr)
+ goto unlock;
+
+ perf_tp_idr_init(idr);
+
+ task->perf_tp_idr = idr;
+unlock:
+ mutex_unlock(&task->perf_event_mutex);
+
+ return idr;
+}
+
+static struct perf_tp_idr *perf_event_idr(struct perf_event *event, bool create)
+{
+ struct perf_tp_idr *tp_idr;
+ struct task_struct *task;
+
+ if (event->attach_state & PERF_ATTACH_TASK) {
+ task = event->hw.event_target;
+ tp_idr = task->perf_tp_idr;
+ if (!tp_idr && create)
+ tp_idr = perf_tp_init_task(event, task);
+ } else
+ tp_idr = &per_cpu(perf_tp_idr, event->cpu);
+
+ return tp_idr;
+}
+
+static void perf_tp_free_node(struct rcu_head *rcu)
+{
+ struct perf_tp_node *node = container_of(rcu, struct perf_tp_node, rcu);
+
+ kfree(node);
+}
+
+static int perf_tp_remove_idr(int id, void *p, void *data)
+{
+ struct perf_tp_node *node = p;
+ struct perf_tp_node *first, *next;
+ struct perf_tp_idr *tp_idr = data;
+
+ if (!tp_idr)
+ goto no_idr;
+
+ mutex_lock(&tp_idr->lock);
+ first = idr_find(&tp_idr->idr, id);
+ if (first == node) {
+ next = list_first_entry(&first->list, struct perf_tp_node, list);
+ if (next != first)
+ idr_replace(&tp_idr->idr, next, id);
+ else
+ idr_remove(&tp_idr->idr, id);
+ }
+ list_del_rcu(&node->list);
+ mutex_unlock(&tp_idr->lock);
+
+no_idr:
+ perf_trace_destroy_id(id);
+ call_rcu_sched(&node->rcu, perf_tp_free_node);
+ return 0;
+}
+
static void tp_perf_event_destroy(struct perf_event *event)
{
- perf_trace_destroy(event);
+ /*
+ * Since this is the free path, the fd is gone an there
+ * can be no concurrency on event->tp_idr.
+ */
+
+ idr_for_each(&event->tp_idr.idr, perf_tp_remove_idr,
+ perf_event_idr(event, false));
+
+ idr_remove_all(&event->tp_idr.idr);
+ idr_destroy(&event->tp_idr.idr);
+}
+
+static int __perf_event_add_tp(struct perf_event *event, int tp_id)
+{
+ struct perf_tp_node *node, *first;
+ struct perf_tp_idr *idr;
+ int tmp_id, err, ret = -ENOMEM;
+
+ node = kmalloc(sizeof(*node), GFP_KERNEL);
+ if (!node)
+ goto out;
+
+ node->event = event;
+ INIT_LIST_HEAD(&node->list);
+
+ /*
+ * Insert the node into the event->idr, this idr tracks the
+ * tracepoints we're interested in, it has a 1:1 relation
+ * with the node.
+ */
+ idr = &event->tp_idr;
+ mutex_lock(&idr->lock);
+ err = idr_pre_get(&idr->idr, GFP_KERNEL);
+ if (!err) {
+ ret = -ENOMEM;
+ goto free_node;
+ }
+
+ ret = idr_get_new_above(&idr->idr, node, tp_id, &tmp_id);
+ if (ret)
+ goto free_node;
+
+ if (WARN_ON(tp_id != tmp_id)) {
+ printk(KERN_ERR "fail: %d %d\n" , tp_id, tmp_id);
+ ret = -EBUSY;
+ goto free_idr1;
+ }
+ mutex_unlock(&idr->lock);
+
+ /*
+ * Insert the node into the task/cpu idr, this idr tracks
+ * all active tracepoints for the task/cpu, it has a 1:n relation
+ * with the node.
+ */
+ idr = perf_event_idr(event, true);
+ if (!idr) {
+ if (event->attach_state & PERF_ATTACH_CONTEXT)
+ ret = -ENOMEM;
+ else
+ ret = -ESRCH;
+ goto free_idr1_set;
+ }
+ mutex_lock(&idr->lock);
+ first = idr_find(&idr->idr, tp_id);
+ if (first) {
+ list_add_rcu(&node->list, &first->list);
+ goto unlock;
+ }
+
+ err = idr_pre_get(&idr->idr, GFP_KERNEL);
+ if (!err) {
+ ret = -ENOMEM;
+ goto free_idr1_set_unlock;
+ }
+
+ ret = idr_get_new_above(&idr->idr, node, tp_id, &tmp_id);
+ if (ret)
+ goto free_idr1_set;
+
+ if (WARN_ON(tp_id != tmp_id)) {
+ ret = -EBUSY;
+ goto free_idr2;
+ }
+unlock:
+ mutex_unlock(&idr->lock);
+
+ ret = perf_trace_init(event, tp_id);
+ if (ret)
+ goto free_all;
+
+out:
+ return ret;
+
+free_all:
+ mutex_lock(&idr->lock);
+free_idr2:
+ idr_remove(&idr->idr, tmp_id);
+free_idr1_set_unlock:
+ mutex_unlock(&idr->lock);
+free_idr1_set:
+ idr = &event->tp_idr;
+ tmp_id = tp_id;
+ mutex_lock(&idr->lock);
+free_idr1:
+ idr_remove(&idr->idr, tmp_id);
+free_node:
+ mutex_unlock(&idr->lock);
+ kfree(node);
+ goto out;
}

static int perf_tp_event_init(struct perf_event *event)
@@ -4803,21 +5052,35 @@ static int perf_tp_event_init(struct per
if (event->attr.type != PERF_TYPE_TRACEPOINT)
return -ENOENT;

- err = perf_trace_init(event);
- if (err)
- return err;
+ perf_tp_idr_init(&event->tp_idr);

event->destroy = tp_perf_event_destroy;

+ if (event->attr.config != ~0ULL) {
+ err = __perf_event_add_tp(event, event->attr.config);
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+
+static int perf_tp_event_add(struct perf_event *event, int flags)
+{
+ event->hw.state = flags & PERF_EF_START ? 0 : PERF_HES_STOPPED;
return 0;
}

+static void perf_tp_event_del(struct perf_event *event, int flags)
+{
+}
+
static struct pmu perf_tracepoint = {
.task_ctx_nr = perf_sw_context,

.event_init = perf_tp_event_init,
- .add = perf_trace_add,
- .del = perf_trace_del,
+ .add = perf_tp_event_add,
+ .del = perf_tp_event_del,
.start = perf_swevent_start,
.stop = perf_swevent_stop,
.read = perf_swevent_read,
@@ -4825,6 +5088,11 @@ static struct pmu perf_tracepoint = {

static inline void perf_tp_register(void)
{
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ perf_tp_idr_init(&per_cpu(perf_tp_idr, cpu));
+
perf_pmu_register(&perf_tracepoint);
}

@@ -4833,7 +5101,8 @@ static int perf_event_set_filter(struct
char *filter_str;
int ret;

- if (event->attr.type != PERF_TYPE_TRACEPOINT)
+ if (event->attr.type != PERF_TYPE_TRACEPOINT ||
+ event->attr.config == ~0ULL)
return -EINVAL;

filter_str = strndup_user(arg, PAGE_SIZE);
@@ -4851,6 +5120,74 @@ static void perf_event_free_filter(struc
ftrace_profile_free_filter(event);
}

+static int perf_event_add_tp(struct perf_event *event, int tp_id)
+{
+ if (event->attr.type != PERF_TYPE_TRACEPOINT &&
+ event->attr.config != ~0ULL)
+ return -EINVAL;
+
+ return __perf_event_add_tp(event, tp_id);
+}
+
+/*
+ * Called from the exit path, _after_ all events have been detached from it.
+ */
+static void perf_tp_event_exit(struct task_struct *tsk)
+{
+ struct perf_tp_idr *idr = tsk->perf_tp_idr;
+
+ if (!idr)
+ return;
+
+ idr_remove_all(&idr->idr);
+ idr_destroy(&idr->idr);
+}
+
+static void perf_tp_event_delayed_put(struct task_struct *tsk)
+{
+ struct perf_tp_idr *idr = tsk->perf_tp_idr;
+
+ tsk->perf_tp_idr = NULL;
+ kfree(idr);
+}
+
+static int perf_tp_inherit_idr(int id, void *p, void *data)
+{
+ struct perf_event *child = data;
+
+ return __perf_event_add_tp(child, id);
+}
+
+static int perf_tp_event_inherit(struct perf_event *parent_event,
+ struct perf_event *child_event)
+{
+ int ret;
+
+ if (parent_event->attr.type != PERF_TYPE_TRACEPOINT ||
+ parent_event->attr.config != ~0ULL)
+ return 0;
+
+ /*
+ * The child is not yet exposed, hence no need to serialize things
+ * on that side.
+ */
+ mutex_lock(&parent_event->tp_idr.lock);
+ ret = idr_for_each(&parent_event->tp_idr.idr,
+ perf_tp_inherit_idr,
+ child_event);
+ mutex_unlock(&parent_event->tp_idr.lock);
+
+ return ret;
+}
+
+static void perf_tp_event_init_task(struct task_struct *child)
+{
+ /*
+ * Clear the idr pointer copied from the parent.
+ */
+ child->perf_tp_idr = NULL;
+}
+
#else

static inline void perf_tp_register(void)
@@ -4866,6 +5203,29 @@ static void perf_event_free_filter(struc
{
}

+static int perf_event_add_tp(struct perf_event *event, int tp_id)
+{
+ return -ENOENT;
+}
+
+static void perf_tp_event_exit(struct task_struct *tsk)
+{
+}
+
+static void perf_tp_event_delayed_put(struct task_struct *tsk)
+{
+}
+
+static int perf_tp_event_inherit(struct perf_event *parent_event,
+ struct perf_event *child_event)
+{
+ return 0;
+}
+
+static void perf_tp_event_init_task()(struct task_struct *child)
+{
+}
+
#endif /* CONFIG_EVENT_TRACING */

#ifdef CONFIG_HAVE_HW_BREAKPOINT
@@ -5290,6 +5650,9 @@ perf_event_alloc(struct perf_event_attr
INIT_LIST_HEAD(&event->sibling_list);
init_waitqueue_head(&event->waitq);
init_irq_work(&event->pending, perf_pending_event);
+#ifdef CONFIG_EVENT_TRACING
+ perf_tp_idr_init(&event->tp_idr);
+#endif

mutex_init(&event->mmap_mutex);

@@ -5308,6 +5671,7 @@ perf_event_alloc(struct perf_event_attr

if (task) {
event->attach_state = PERF_ATTACH_TASK;
+ event->hw.event_target = task;
#ifdef CONFIG_HAVE_HW_BREAKPOINT
/*
* hw_breakpoint is a bit difficult here..
@@ -5353,7 +5717,7 @@ perf_event_alloc(struct perf_event_attr
if (err) {
if (event->ns)
put_pid_ns(event->ns);
- kfree(event);
+ free_event(event);
return ERR_PTR(err);
}

@@ -5694,7 +6058,6 @@ SYSCALL_DEFINE5(perf_event_open,
}

perf_install_in_context(ctx, event, cpu);
- ++ctx->generation;
mutex_unlock(&ctx->mutex);

event->owner = current;
@@ -5763,7 +6126,6 @@ perf_event_create_kernel_counter(struct
WARN_ON_ONCE(ctx->parent_ctx);
mutex_lock(&ctx->mutex);
perf_install_in_context(ctx, event, cpu);
- ++ctx->generation;
mutex_unlock(&ctx->mutex);

return event;
@@ -5936,6 +6298,8 @@ void perf_event_exit_task(struct task_st

for_each_task_context_nr(ctxn)
perf_event_exit_task_context(child, ctxn);
+
+ perf_tp_event_exit(child);
}

static void perf_free_event(struct perf_event *event,
@@ -5998,6 +6362,8 @@ void perf_event_delayed_put(struct task_

for_each_task_context_nr(ctxn)
WARN_ON_ONCE(task->perf_event_ctxp[ctxn]);
+
+ perf_tp_event_delayed_put(task);
}

/*
@@ -6013,6 +6379,7 @@ inherit_event(struct perf_event *parent_
{
struct perf_event *child_event;
unsigned long flags;
+ int ret;

/*
* Instead of creating recursive hierarchies of events,
@@ -6030,6 +6397,13 @@ inherit_event(struct perf_event *parent_
NULL);
if (IS_ERR(child_event))
return child_event;
+
+ ret = perf_tp_event_inherit(parent_event, child_event);
+ if (ret) {
+ free_event(child_event);
+ return ERR_PTR(ret);
+ }
+
get_ctx(child_ctx);

/*
@@ -6134,9 +6508,7 @@ inherit_task_group(struct perf_event *ev
child->perf_event_ctxp[ctxn] = child_ctx;
}

- ret = inherit_group(event, parent, parent_ctx,
- child, child_ctx);
-
+ ret = inherit_group(event, parent, parent_ctx, child, child_ctx);
if (ret)
*inherited_all = 0;

@@ -6157,9 +6529,6 @@ int perf_event_init_context(struct task_

child->perf_event_ctxp[ctxn] = NULL;

- mutex_init(&child->perf_event_mutex);
- INIT_LIST_HEAD(&child->perf_event_list);
-
if (likely(!parent->perf_event_ctxp[ctxn]))
return 0;

@@ -6236,6 +6605,11 @@ int perf_event_init_task(struct task_str
{
int ctxn, ret;

+ mutex_init(&child->perf_event_mutex);
+ INIT_LIST_HEAD(&child->perf_event_list);
+
+ perf_tp_event_init_task(child);
+
for_each_task_context_nr(ctxn) {
ret = perf_event_init_context(child, ctxn);
if (ret)
Index: linux-2.6/kernel/trace/trace_event_perf.c
===================================================================
--- linux-2.6.orig/kernel/trace/trace_event_perf.c
+++ linux-2.6/kernel/trace/trace_event_perf.c
@@ -8,6 +8,7 @@
#include <linux/module.h>
#include <linux/kprobes.h>
#include "trace.h"
+#include "trace_output.h"

static char __percpu *perf_trace_buf[PERF_NR_CONTEXTS];

@@ -47,9 +48,7 @@ static int perf_trace_event_perm(struct
static int perf_trace_event_init(struct ftrace_event_call *tp_event,
struct perf_event *p_event)
{
- struct hlist_head __percpu *list;
int ret;
- int cpu;

ret = perf_trace_event_perm(tp_event, p_event);
if (ret)
@@ -61,15 +60,6 @@ static int perf_trace_event_init(struct

ret = -ENOMEM;

- list = alloc_percpu(struct hlist_head);
- if (!list)
- goto fail;
-
- for_each_possible_cpu(cpu)
- INIT_HLIST_HEAD(per_cpu_ptr(list, cpu));
-
- tp_event->perf_events = list;
-
if (!total_ref_count) {
char __percpu *buf;
int i;
@@ -100,63 +90,40 @@ static int perf_trace_event_init(struct
}
}

- if (!--tp_event->perf_refcount) {
- free_percpu(tp_event->perf_events);
- tp_event->perf_events = NULL;
- }
+ --tp_event->perf_refcount;

return ret;
}

-int perf_trace_init(struct perf_event *p_event)
+int perf_trace_init(struct perf_event *p_event, int event_id)
{
struct ftrace_event_call *tp_event;
- int event_id = p_event->attr.config;
+ struct trace_event *t_event;
int ret = -EINVAL;

+ trace_event_read_lock();
+ t_event = ftrace_find_event(event_id);
+ if (!t_event)
+ goto out;
+
+ tp_event = container_of(t_event, struct ftrace_event_call, event);
+
mutex_lock(&event_mutex);
- list_for_each_entry(tp_event, &ftrace_events, list) {
- if (tp_event->event.type == event_id &&
- tp_event->class && tp_event->class->reg &&
- try_module_get(tp_event->mod)) {
- ret = perf_trace_event_init(tp_event, p_event);
- if (ret)
- module_put(tp_event->mod);
- break;
- }
+ if (tp_event->class && tp_event->class->reg &&
+ try_module_get(tp_event->mod)) {
+ ret = perf_trace_event_init(tp_event, p_event);
+ if (ret)
+ module_put(tp_event->mod);
}
mutex_unlock(&event_mutex);
+out:
+ trace_event_read_unlock();

return ret;
}

-int perf_trace_add(struct perf_event *p_event, int flags)
-{
- struct ftrace_event_call *tp_event = p_event->tp_event;
- struct hlist_head __percpu *pcpu_list;
- struct hlist_head *list;
-
- pcpu_list = tp_event->perf_events;
- if (WARN_ON_ONCE(!pcpu_list))
- return -EINVAL;
-
- if (!(flags & PERF_EF_START))
- p_event->hw.state = PERF_HES_STOPPED;
-
- list = this_cpu_ptr(pcpu_list);
- hlist_add_head_rcu(&p_event->hlist_entry, list);
-
- return 0;
-}
-
-void perf_trace_del(struct perf_event *p_event, int flags)
-{
- hlist_del_rcu(&p_event->hlist_entry);
-}
-
-void perf_trace_destroy(struct perf_event *p_event)
+static void __perf_trace_destroy(struct ftrace_event_call *tp_event)
{
- struct ftrace_event_call *tp_event = p_event->tp_event;
int i;

mutex_lock(&event_mutex);
@@ -171,9 +138,6 @@ void perf_trace_destroy(struct perf_even
*/
tracepoint_synchronize_unregister();

- free_percpu(tp_event->perf_events);
- tp_event->perf_events = NULL;
-
if (!--total_ref_count) {
for (i = 0; i < PERF_NR_CONTEXTS; i++) {
free_percpu(perf_trace_buf[i]);
@@ -185,6 +149,27 @@ void perf_trace_destroy(struct perf_even
mutex_unlock(&event_mutex);
}

+void perf_trace_destroy(struct perf_event *p_event)
+{
+ __perf_trace_destroy(p_event->tp_event);
+}
+
+void perf_trace_destroy_id(int event_id)
+{
+ struct ftrace_event_call *tp_event;
+ struct trace_event *t_event;
+
+ trace_event_read_lock();
+ t_event = ftrace_find_event(event_id);
+ if (!t_event)
+ goto unlock;
+
+ tp_event = container_of(t_event, struct ftrace_event_call, event);
+ __perf_trace_destroy(tp_event);
+unlock:
+ trace_event_read_unlock();
+}
+
__kprobes void *perf_trace_buf_prepare(int size, unsigned short type,
struct pt_regs *regs, int *rctxp)
{
Index: linux-2.6/kernel/trace/trace_kprobe.c
===================================================================
--- linux-2.6.orig/kernel/trace/trace_kprobe.c
+++ linux-2.6/kernel/trace/trace_kprobe.c
@@ -1559,7 +1559,6 @@ static __kprobes void kprobe_perf_func(s
struct trace_probe *tp = container_of(kp, struct trace_probe, rp.kp);
struct ftrace_event_call *call = &tp->call;
struct kprobe_trace_entry_head *entry;
- struct hlist_head *head;
int size, __size, dsize;
int rctx;

@@ -1579,8 +1578,8 @@ static __kprobes void kprobe_perf_func(s
memset(&entry[1], 0, dsize);
store_trace_args(sizeof(*entry), tp, regs, (u8 *)&entry[1], dsize);

- head = this_cpu_ptr(call->perf_events);
- perf_trace_buf_submit(entry, size, rctx, entry->ip, 1, regs, head);
+ perf_trace_buf_submit(entry, size, rctx, entry->ip, 1, regs,
+ call->event.type);
}

/* Kretprobe profile handler */
@@ -1590,7 +1589,6 @@ static __kprobes void kretprobe_perf_fun
struct trace_probe *tp = container_of(ri->rp, struct trace_probe, rp);
struct ftrace_event_call *call = &tp->call;
struct kretprobe_trace_entry_head *entry;
- struct hlist_head *head;
int size, __size, dsize;
int rctx;

@@ -1610,8 +1608,8 @@ static __kprobes void kretprobe_perf_fun
entry->ret_ip = (unsigned long)ri->ret_addr;
store_trace_args(sizeof(*entry), tp, regs, (u8 *)&entry[1], dsize);

- head = this_cpu_ptr(call->perf_events);
- perf_trace_buf_submit(entry, size, rctx, entry->ret_ip, 1, regs, head);
+ perf_trace_buf_submit(entry, size, rctx, entry->ret_ip, 1,
+ regs, call->event.type);
}

static int probe_perf_enable(struct ftrace_event_call *call)
Index: linux-2.6/kernel/trace/trace_output.c
===================================================================
--- linux-2.6.orig/kernel/trace/trace_output.c
+++ linux-2.6/kernel/trace/trace_output.c
@@ -8,6 +8,7 @@
#include <linux/module.h>
#include <linux/mutex.h>
#include <linux/ftrace.h>
+#include <linux/idr.h>

#include "trace_output.h"

@@ -16,9 +17,9 @@

DECLARE_RWSEM(trace_event_mutex);

-static struct hlist_head event_hash[EVENT_HASHSIZE] __read_mostly;
+static const int first_event_type = __TRACE_LAST_TYPE + 1;

-static int next_event_type = __TRACE_LAST_TYPE + 1;
+static DEFINE_IDR(trace_type_idr);

int trace_print_seq(struct seq_file *m, struct trace_seq *s)
{
@@ -660,58 +661,43 @@ static int task_state_char(unsigned long
*/
struct trace_event *ftrace_find_event(int type)
{
- struct trace_event *event;
- struct hlist_node *n;
- unsigned key;
-
- key = type & (EVENT_HASHSIZE - 1);
-
- hlist_for_each_entry(event, n, &event_hash[key], node) {
- if (event->type == type)
- return event;
- }
-
- return NULL;
+ return idr_find(&trace_type_idr, type);
}

-static LIST_HEAD(ftrace_event_list);
+void trace_event_read_lock(void)
+{
+ down_read(&trace_event_mutex);
+}

-static int trace_search_list(struct list_head **list)
+void trace_event_read_unlock(void)
{
- struct trace_event *e;
- int last = __TRACE_LAST_TYPE;
+ up_read(&trace_event_mutex);
+}

- if (list_empty(&ftrace_event_list)) {
- *list = &ftrace_event_list;
- return last + 1;
- }
+static int register_event(struct trace_event *event, int id, bool strict)
+{
+ int ret, type;

- /*
- * We used up all possible max events,
- * lets see if somebody freed one.
- */
- list_for_each_entry(e, &ftrace_event_list, list) {
- if (e->type != last + 1)
- break;
- last++;
- }
+ ret = idr_pre_get(&trace_type_idr, GFP_KERNEL);
+ if (!ret)
+ return 0;

- /* Did we used up all 65 thousand events??? */
- if ((last + 1) > FTRACE_MAX_EVENT)
+ ret = idr_get_new_above(&trace_type_idr, event, id, &type);
+ if (ret)
return 0;

- *list = &e->list;
- return last + 1;
-}
+ if (strict && id != type) {
+ idr_remove(&trace_type_idr, type);
+ return 0;
+ }

-void trace_event_read_lock(void)
-{
- down_read(&trace_event_mutex);
-}
+ if (type > FTRACE_MAX_EVENT) {
+ idr_remove(&trace_type_idr, type);
+ return 0;
+ }

-void trace_event_read_unlock(void)
-{
- up_read(&trace_event_mutex);
+ event->type = type;
+ return type;
}

/**
@@ -731,7 +717,6 @@ void trace_event_read_unlock(void)
*/
int register_ftrace_event(struct trace_event *event)
{
- unsigned key;
int ret = 0;

down_write(&trace_event_mutex);
@@ -742,35 +727,18 @@ int register_ftrace_event(struct trace_e
if (WARN_ON(!event->funcs))
goto out;

- INIT_LIST_HEAD(&event->list);
-
if (!event->type) {
- struct list_head *list = NULL;
-
- if (next_event_type > FTRACE_MAX_EVENT) {
-
- event->type = trace_search_list(&list);
- if (!event->type)
- goto out;
-
- } else {
-
- event->type = next_event_type++;
- list = &ftrace_event_list;
- }
-
- if (WARN_ON(ftrace_find_event(event->type)))
+ ret = register_event(event, first_event_type, false);
+ if (!ret)
goto out;
-
- list_add_tail(&event->list, list);
-
- } else if (event->type > __TRACE_LAST_TYPE) {
- printk(KERN_WARNING "Need to add type to trace.h\n");
- WARN_ON(1);
- goto out;
} else {
- /* Is this event already used */
- if (ftrace_find_event(event->type))
+ if (event->type > __TRACE_LAST_TYPE) {
+ printk(KERN_WARNING "Need to add type to trace.h\n");
+ WARN_ON(1);
+ goto out;
+ }
+ ret = register_event(event, event->type, true);
+ if (!ret)
goto out;
}

@@ -783,11 +751,6 @@ int register_ftrace_event(struct trace_e
if (event->funcs->binary == NULL)
event->funcs->binary = trace_nop_print;

- key = event->type & (EVENT_HASHSIZE - 1);
-
- hlist_add_head(&event->node, &event_hash[key]);
-
- ret = event->type;
out:
up_write(&trace_event_mutex);

@@ -800,8 +763,7 @@ EXPORT_SYMBOL_GPL(register_ftrace_event)
*/
int __unregister_ftrace_event(struct trace_event *event)
{
- hlist_del(&event->node);
- list_del(&event->list);
+ idr_remove(&trace_type_idr, event->type);
return 0;
}

Index: linux-2.6/kernel/trace/trace_syscalls.c
===================================================================
--- linux-2.6.orig/kernel/trace/trace_syscalls.c
+++ linux-2.6/kernel/trace/trace_syscalls.c
@@ -489,7 +489,6 @@ static void perf_syscall_enter(void *ign
{
struct syscall_metadata *sys_data;
struct syscall_trace_enter *rec;
- struct hlist_head *head;
int syscall_nr;
int rctx;
int size;
@@ -520,8 +519,7 @@ static void perf_syscall_enter(void *ign
syscall_get_arguments(current, regs, 0, sys_data->nb_args,
(unsigned long *)&rec->args);

- head = this_cpu_ptr(sys_data->enter_event->perf_events);
- perf_trace_buf_submit(rec, size, rctx, 0, 1, regs, head);
+ perf_trace_buf_submit(rec, size, rctx, 0, 1, regs, rec->ent.type);
}

int perf_sysenter_enable(struct ftrace_event_call *call)
@@ -563,7 +561,6 @@ static void perf_syscall_exit(void *igno
{
struct syscall_metadata *sys_data;
struct syscall_trace_exit *rec;
- struct hlist_head *head;
int syscall_nr;
int rctx;
int size;
@@ -596,8 +593,7 @@ static void perf_syscall_exit(void *igno
rec->nr = syscall_nr;
rec->ret = syscall_get_return_value(current, regs);

- head = this_cpu_ptr(sys_data->exit_event->perf_events);
- perf_trace_buf_submit(rec, size, rctx, 0, 1, regs, head);
+ perf_trace_buf_submit(rec, size, rctx, 0, 1, regs, rec->ent.type);
}

int perf_sysexit_enable(struct ftrace_event_call *call)

2010-12-01 18:02:48

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [BUG(?)] perf_events: combining multiple tracepoint events into a group produces no counts on member events

On Wed, Dec 01, 2010 at 01:04:37PM +0100, Peter Zijlstra wrote:
> @@ -545,6 +546,11 @@ struct hw_perf_event {
> struct task_struct *bp_target;
> };
> #endif
> + /*
> + * Same fudge as for breakpoints, trace-events needs
> + * it too,.. convert the bp crap over..
> + */
> + struct task_struct *event_target;

Yeah, looks like we can merge the bp_target and event_target.


> struct task_struct {
> volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
> void *stack;
> @@ -1452,6 +1458,9 @@ struct task_struct {
> struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
> struct mutex perf_event_mutex;
> struct list_head perf_event_list;
> +#ifdef CONFIG_EVENT_TRACING
> + struct perf_tp_idr *perf_tp_idr;

Why not attaching this to the ctx eventually? This makes one pointer less
in task_struct.

> @@ -370,6 +372,7 @@ list_del_event(struct perf_event *event,
> */
> if (event->state > PERF_EVENT_STATE_OFF)
> event->state = PERF_EVENT_STATE_OFF;
> + ++ctx->generation;

What's the role of the ctx->generation? It seems to be incremented two times
but doesn't appear to have any purpose.


> }
>
> static void perf_group_detach(struct perf_event *event)
> @@ -1228,6 +1231,12 @@ void perf_event_context_sched_out(struct
> if (!cpuctx->task_ctx)
> return;
>
> +#if 0
> + /*
> + * Need to sort out how to make task_struct::perf_tp_idr
> + * work with this fancy switching stuff.. tracepoints could be
> + * in multiple contexts due to the software event muck.
> + */

Not sure what's the issue here. Each ctx have the perf_tp_idr matching
active tracepoints, isn't it?

> +static struct perf_tp_idr *perf_event_idr(struct perf_event *event, bool create)
> +{
> + struct perf_tp_idr *tp_idr;
> + struct task_struct *task;
> +
> + if (event->attach_state & PERF_ATTACH_TASK) {
> + task = event->hw.event_target;
> + tp_idr = task->perf_tp_idr;
> + if (!tp_idr && create)

Is it possible that task->perf_tp_idr can eventually disappear
under us there? Like when an event is released from that task?

> + tp_idr = perf_tp_init_task(event, task);
> + } else
> + tp_idr = &per_cpu(perf_tp_idr, event->cpu);
> +
> + return tp_idr;
> +}

2010-12-01 18:29:24

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [BUG(?)] perf_events: combining multiple tracepoint events into a group produces no counts on member events

On Wed, Dec 01, 2010 at 01:04:37PM +0100, Peter Zijlstra wrote:
> @@ -4833,7 +5101,8 @@ static int perf_event_set_filter(struct
> char *filter_str;
> int ret;
>
> - if (event->attr.type != PERF_TYPE_TRACEPOINT)
> + if (event->attr.type != PERF_TYPE_TRACEPOINT ||
> + event->attr.config == ~0ULL)
> return -EINVAL;
>
> filter_str = strndup_user(arg, PAGE_SIZE);
> @@ -4851,6 +5120,74 @@ static void perf_event_free_filter(struc
> ftrace_profile_free_filter(event);
> }
>
> +static int perf_event_add_tp(struct perf_event *event, int tp_id)
> +{
> + if (event->attr.type != PERF_TYPE_TRACEPOINT &&

Should it be || instead?

> + event->attr.config != ~0ULL)
> + return -EINVAL;
> +
> + return __perf_event_add_tp(event, tp_id);
> +}
> +
> +/*
> + * Called from the exit path, _after_ all events have been detached from it.
> + */
> +static void perf_tp_event_exit(struct task_struct *tsk)
> +{
> + struct perf_tp_idr *idr = tsk->perf_tp_idr;
> +
> + if (!idr)
> + return;
> +
> + idr_remove_all(&idr->idr);
> + idr_destroy(&idr->idr);
> +}
> +
> +static void perf_tp_event_delayed_put(struct task_struct *tsk)
> +{
> + struct perf_tp_idr *idr = tsk->perf_tp_idr;
> +
> + tsk->perf_tp_idr = NULL;
> + kfree(idr);
> +}
> +
> +static int perf_tp_inherit_idr(int id, void *p, void *data)
> +{
> + struct perf_event *child = data;
> +
> + return __perf_event_add_tp(child, id);
> +}
> +
> +static int perf_tp_event_inherit(struct perf_event *parent_event,
> + struct perf_event *child_event)
> +{
> + int ret;
> +
> + if (parent_event->attr.type != PERF_TYPE_TRACEPOINT ||
> + parent_event->attr.config != ~0ULL)
> + return 0;
> +
> + /*
> + * The child is not yet exposed, hence no need to serialize things
> + * on that side.
> + */
> + mutex_lock(&parent_event->tp_idr.lock);
> + ret = idr_for_each(&parent_event->tp_idr.idr,
> + perf_tp_inherit_idr,
> + child_event);
> + mutex_unlock(&parent_event->tp_idr.lock);
> +
> + return ret;
> +}
> +
> +static void perf_tp_event_init_task(struct task_struct *child)
> +{
> + /*
> + * Clear the idr pointer copied from the parent.
> + */
> + child->perf_tp_idr = NULL;
> +}
> +
> #else
>
> static inline void perf_tp_register(void)
> @@ -4866,6 +5203,29 @@ static void perf_event_free_filter(struc
> {
> }
>
> +static int perf_event_add_tp(struct perf_event *event, int tp_id)
> +{
> + return -ENOENT;
> +}
> +
> +static void perf_tp_event_exit(struct task_struct *tsk)
> +{
> +}
> +
> +static void perf_tp_event_delayed_put(struct task_struct *tsk)
> +{
> +}
> +
> +static int perf_tp_event_inherit(struct perf_event *parent_event,
> + struct perf_event *child_event)
> +{
> + return 0;
> +}
> +
> +static void perf_tp_event_init_task()(struct task_struct *child)
> +{
> +}
> +
> #endif /* CONFIG_EVENT_TRACING */
>
> #ifdef CONFIG_HAVE_HW_BREAKPOINT
> @@ -5290,6 +5650,9 @@ perf_event_alloc(struct perf_event_attr
> INIT_LIST_HEAD(&event->sibling_list);
> init_waitqueue_head(&event->waitq);
> init_irq_work(&event->pending, perf_pending_event);
> +#ifdef CONFIG_EVENT_TRACING
> + perf_tp_idr_init(&event->tp_idr);
> +#endif
>
> mutex_init(&event->mmap_mutex);
>
> @@ -5308,6 +5671,7 @@ perf_event_alloc(struct perf_event_attr
>
> if (task) {
> event->attach_state = PERF_ATTACH_TASK;
> + event->hw.event_target = task;
> #ifdef CONFIG_HAVE_HW_BREAKPOINT
> /*
> * hw_breakpoint is a bit difficult here..
> @@ -5353,7 +5717,7 @@ perf_event_alloc(struct perf_event_attr
> if (err) {
> if (event->ns)
> put_pid_ns(event->ns);
> - kfree(event);
> + free_event(event);
> return ERR_PTR(err);
> }
>
> @@ -5694,7 +6058,6 @@ SYSCALL_DEFINE5(perf_event_open,
> }
>
> perf_install_in_context(ctx, event, cpu);
> - ++ctx->generation;
> mutex_unlock(&ctx->mutex);
>
> event->owner = current;
> @@ -5763,7 +6126,6 @@ perf_event_create_kernel_counter(struct
> WARN_ON_ONCE(ctx->parent_ctx);
> mutex_lock(&ctx->mutex);
> perf_install_in_context(ctx, event, cpu);
> - ++ctx->generation;
> mutex_unlock(&ctx->mutex);
>
> return event;
> @@ -5936,6 +6298,8 @@ void perf_event_exit_task(struct task_st
>
> for_each_task_context_nr(ctxn)
> perf_event_exit_task_context(child, ctxn);
> +
> + perf_tp_event_exit(child);
> }
>
> static void perf_free_event(struct perf_event *event,
> @@ -5998,6 +6362,8 @@ void perf_event_delayed_put(struct task_
>
> for_each_task_context_nr(ctxn)
> WARN_ON_ONCE(task->perf_event_ctxp[ctxn]);
> +
> + perf_tp_event_delayed_put(task);
> }
>
> /*
> @@ -6013,6 +6379,7 @@ inherit_event(struct perf_event *parent_
> {
> struct perf_event *child_event;
> unsigned long flags;
> + int ret;
>
> /*
> * Instead of creating recursive hierarchies of events,
> @@ -6030,6 +6397,13 @@ inherit_event(struct perf_event *parent_
> NULL);
> if (IS_ERR(child_event))
> return child_event;
> +
> + ret = perf_tp_event_inherit(parent_event, child_event);
> + if (ret) {
> + free_event(child_event);
> + return ERR_PTR(ret);
> + }
> +
> get_ctx(child_ctx);
>
> /*
> @@ -6134,9 +6508,7 @@ inherit_task_group(struct perf_event *ev
> child->perf_event_ctxp[ctxn] = child_ctx;
> }
>
> - ret = inherit_group(event, parent, parent_ctx,
> - child, child_ctx);
> -
> + ret = inherit_group(event, parent, parent_ctx, child, child_ctx);
> if (ret)
> *inherited_all = 0;
>
> @@ -6157,9 +6529,6 @@ int perf_event_init_context(struct task_
>
> child->perf_event_ctxp[ctxn] = NULL;
>
> - mutex_init(&child->perf_event_mutex);
> - INIT_LIST_HEAD(&child->perf_event_list);
> -
> if (likely(!parent->perf_event_ctxp[ctxn]))
> return 0;
>
> @@ -6236,6 +6605,11 @@ int perf_event_init_task(struct task_str
> {
> int ctxn, ret;
>
> + mutex_init(&child->perf_event_mutex);
> + INIT_LIST_HEAD(&child->perf_event_list);
> +
> + perf_tp_event_init_task(child);
> +
> for_each_task_context_nr(ctxn) {
> ret = perf_event_init_context(child, ctxn);
> if (ret)
> Index: linux-2.6/kernel/trace/trace_event_perf.c
> ===================================================================
> --- linux-2.6.orig/kernel/trace/trace_event_perf.c
> +++ linux-2.6/kernel/trace/trace_event_perf.c
> @@ -8,6 +8,7 @@
> #include <linux/module.h>
> #include <linux/kprobes.h>
> #include "trace.h"
> +#include "trace_output.h"
>
> static char __percpu *perf_trace_buf[PERF_NR_CONTEXTS];
>
> @@ -47,9 +48,7 @@ static int perf_trace_event_perm(struct
> static int perf_trace_event_init(struct ftrace_event_call *tp_event,
> struct perf_event *p_event)
> {
> - struct hlist_head __percpu *list;
> int ret;
> - int cpu;
>
> ret = perf_trace_event_perm(tp_event, p_event);
> if (ret)
> @@ -61,15 +60,6 @@ static int perf_trace_event_init(struct
>
> ret = -ENOMEM;
>
> - list = alloc_percpu(struct hlist_head);
> - if (!list)
> - goto fail;
> -
> - for_each_possible_cpu(cpu)
> - INIT_HLIST_HEAD(per_cpu_ptr(list, cpu));
> -
> - tp_event->perf_events = list;
> -
> if (!total_ref_count) {
> char __percpu *buf;
> int i;
> @@ -100,63 +90,40 @@ static int perf_trace_event_init(struct
> }
> }
>
> - if (!--tp_event->perf_refcount) {
> - free_percpu(tp_event->perf_events);
> - tp_event->perf_events = NULL;
> - }
> + --tp_event->perf_refcount;
>
> return ret;
> }
>
> -int perf_trace_init(struct perf_event *p_event)
> +int perf_trace_init(struct perf_event *p_event, int event_id)
> {
> struct ftrace_event_call *tp_event;
> - int event_id = p_event->attr.config;
> + struct trace_event *t_event;
> int ret = -EINVAL;
>
> + trace_event_read_lock();
> + t_event = ftrace_find_event(event_id);
> + if (!t_event)
> + goto out;
> +
> + tp_event = container_of(t_event, struct ftrace_event_call, event);
> +
> mutex_lock(&event_mutex);
> - list_for_each_entry(tp_event, &ftrace_events, list) {
> - if (tp_event->event.type == event_id &&
> - tp_event->class && tp_event->class->reg &&
> - try_module_get(tp_event->mod)) {
> - ret = perf_trace_event_init(tp_event, p_event);
> - if (ret)
> - module_put(tp_event->mod);
> - break;
> - }
> + if (tp_event->class && tp_event->class->reg &&
> + try_module_get(tp_event->mod)) {
> + ret = perf_trace_event_init(tp_event, p_event);
> + if (ret)
> + module_put(tp_event->mod);
> }
> mutex_unlock(&event_mutex);
> +out:
> + trace_event_read_unlock();
>
> return ret;
> }
>
> -int perf_trace_add(struct perf_event *p_event, int flags)
> -{
> - struct ftrace_event_call *tp_event = p_event->tp_event;
> - struct hlist_head __percpu *pcpu_list;
> - struct hlist_head *list;
> -
> - pcpu_list = tp_event->perf_events;
> - if (WARN_ON_ONCE(!pcpu_list))
> - return -EINVAL;
> -
> - if (!(flags & PERF_EF_START))
> - p_event->hw.state = PERF_HES_STOPPED;
> -
> - list = this_cpu_ptr(pcpu_list);
> - hlist_add_head_rcu(&p_event->hlist_entry, list);
> -
> - return 0;
> -}
> -
> -void perf_trace_del(struct perf_event *p_event, int flags)
> -{
> - hlist_del_rcu(&p_event->hlist_entry);
> -}
> -
> -void perf_trace_destroy(struct perf_event *p_event)
> +static void __perf_trace_destroy(struct ftrace_event_call *tp_event)
> {
> - struct ftrace_event_call *tp_event = p_event->tp_event;
> int i;
>
> mutex_lock(&event_mutex);
> @@ -171,9 +138,6 @@ void perf_trace_destroy(struct perf_even
> */
> tracepoint_synchronize_unregister();
>
> - free_percpu(tp_event->perf_events);
> - tp_event->perf_events = NULL;
> -
> if (!--total_ref_count) {
> for (i = 0; i < PERF_NR_CONTEXTS; i++) {
> free_percpu(perf_trace_buf[i]);
> @@ -185,6 +149,27 @@ void perf_trace_destroy(struct perf_even
> mutex_unlock(&event_mutex);
> }
>
> +void perf_trace_destroy(struct perf_event *p_event)
> +{
> + __perf_trace_destroy(p_event->tp_event);
> +}

Is this a leftover? It doesn't seem to be used anymore.

Other than that, the whole looks good. May be I need to have
one more look at the trace_output.c changes, but the conversion
from hlist to idr seemed fine at a glance.

Adding Steve in Cc.

2010-12-01 18:37:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG(?)] perf_events: combining multiple tracepoint events into a group produces no counts on member events

On Wed, 2010-12-01 at 19:29 +0100, Frederic Weisbecker wrote:
> > +static int perf_event_add_tp(struct perf_event *event, int tp_id)
> > +{
> > + if (event->attr.type != PERF_TYPE_TRACEPOINT &&
>
> Should it be || instead?
>
> > + event->attr.config != ~0ULL)
> > + return -EINVAL;
> > +
> > + return __perf_event_add_tp(event, tp_id);
> > +}

Probably,.. yeah.

2010-12-01 18:51:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG(?)] perf_events: combining multiple tracepoint events into a group produces no counts on member events

On Wed, 2010-12-01 at 19:02 +0100, Frederic Weisbecker wrote:

> > struct task_struct {
> > volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
> > void *stack;
> > @@ -1452,6 +1458,9 @@ struct task_struct {
> > struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
> > struct mutex perf_event_mutex;
> > struct list_head perf_event_list;
> > +#ifdef CONFIG_EVENT_TRACING
> > + struct perf_tp_idr *perf_tp_idr;
>
> Why not attaching this to the ctx eventually? This makes one pointer less
> in task_struct.

What context? :-) There's now two context's (with the possibility of
even more), which one will hold the tracepoint stuff?

Also, since we only need one such structure, adding it to the context
doesn't make sense.

> > @@ -370,6 +372,7 @@ list_del_event(struct perf_event *event,
> > */
> > if (event->state > PERF_EVENT_STATE_OFF)
> > event->state = PERF_EVENT_STATE_OFF;
> > + ++ctx->generation;
>
> What's the role of the ctx->generation? It seems to be incremented two times
> but doesn't appear to have any purpose.

You didn't look hard enough, its a sequence stamp on the context for
inheritance, then later, when we want to compare inherited contexts we
can simply compare generation numbers, if they're the same the contexts
are the same.

> > }
> >
> > static void perf_group_detach(struct perf_event *event)
> > @@ -1228,6 +1231,12 @@ void perf_event_context_sched_out(struct
> > if (!cpuctx->task_ctx)
> > return;
> >
> > +#if 0
> > + /*
> > + * Need to sort out how to make task_struct::perf_tp_idr
> > + * work with this fancy switching stuff.. tracepoints could be
> > + * in multiple contexts due to the software event muck.
> > + */
>
> Not sure what's the issue here. Each ctx have the perf_tp_idr matching
> active tracepoints, isn't it?

No, there's only 1 idr per task. Having one per context means we have to
iterate all contexts when a tracepoint triggers and it adds yet another
pointer chase. It also means we have to manage more stuff when
tracepoints change context etc..

But yes, it would make this part easier, I just don't like the added
fast path overhead.

> > +static struct perf_tp_idr *perf_event_idr(struct perf_event *event, bool create)
> > +{
> > + struct perf_tp_idr *tp_idr;
> > + struct task_struct *task;
> > +
> > + if (event->attach_state & PERF_ATTACH_TASK) {
> > + task = event->hw.event_target;
> > + tp_idr = task->perf_tp_idr;
> > + if (!tp_idr && create)
>
> Is it possible that task->perf_tp_idr can eventually disappear
> under us there? Like when an event is released from that task?

We hold a ref on the task on the create path, I'd have to actually think
about the destroy path, but I think there's a problem there.

I used to have a PERF_ATTACH_CONTEXT/GROUP test in there as well, dunno
why that didn't work out.

> > + tp_idr = perf_tp_init_task(event, task);
> > + } else
> > + tp_idr = &per_cpu(perf_tp_idr, event->cpu);
> > +
> > + return tp_idr;
> > +}

2010-12-01 19:23:17

by Corey Ashford

[permalink] [raw]
Subject: Re: [BUG(?)] perf_events: combining multiple tracepoint events into a group produces no counts on member events

Thanks for your reply, Peter.

On 12/01/2010 03:46 AM, Peter Zijlstra wrote:
> On Tue, 2010-11-30 at 17:00 -0800, Corey Ashford wrote:
>> Hi,
>>
>> I'm not sure that what I'm seeing is a bug, or was something intentional.
>>
>> If I place multiple tracepoint events into a group and measure counts of
>> these events on a process, I get no counts for the tracepoint events
>> other than the group leader.
>>
>> Is this expected behavior?
>>
>> It's not clear to me why this should be the case; grouping shouldn't
>> have any ill effects on tracepoint events, from my understanding.
>>
>> I noticed this because my private version of the perf tool has the event
>> group patch https://lkml.org/lkml/2010/11/24/584 as well as the patch
>> which fixes the parsing of multiple tracepoint events in the same -e
>> switch: https://lkml.org/lkml/2010/11/30/460
>>
>> When I dig into the code a bit, I find that each event opens
>> successfully, so that's not the problem. If I disable the grouping,
>> then I get counts for all of the tracepoint events.
>
> Hrm,.. definitely not expected. I'll try and look into it, but I'm a bit
> over-committed atm.
>
> Also, I've started a rewrite of the whole tracepoint<-> perf
> interaction:
>
> http://lkml.org/lkml/2010/11/23/147
>
> Could you see if that cures your problem?

I've had some trouble getting recent kernels to boot on my Power5
machine, but I might be able to try it on my laptop.

>
> Another thing to test, does the same hold true for regular software
> events? tracepoints and software events share a lot of infrastructure.

I just tried "perf stat -e context-switches,faults ..." on both my
laptop (running 2.6.35), and on my Power5 machine (running 2.6.33) and I
get the same behavior when software events are grouped, e.g.:
% ./perf stat -e context-switches,faults ~/load 1000

Performance counter stats for '/home/corey/load 1000':

240 context-switches
<not counted> page-faults

2.382022393 seconds time elapsed

So I suppose that points to a common flaw in the tracepoint and software
event logic.

With that in mind, would it still make sense to try out your tracepoint
patch?

- Corey

2010-12-01 19:30:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG(?)] perf_events: combining multiple tracepoint events into a group produces no counts on member events

On Wed, 2010-12-01 at 11:22 -0800, Corey Ashford wrote:
>
> With that in mind, would it still make sense to try out your tracepoint
> patch?

Probably not..

2010-12-02 08:58:47

by Corey Ashford

[permalink] [raw]
Subject: Re: [BUG(?)] perf_events: combining multiple tracepoint events into a group produces no counts on member events

On 11/30/2010 05:00 PM, Corey Ashford wrote:
>
> Hi,
>
> I'm not sure that what I'm seeing is a bug, or was something intentional.
>
> If I place multiple tracepoint events into a group and measure counts of
> these events on a process, I get no counts for the tracepoint events
> other than the group leader.
>
> Is this expected behavior?
>
> It's not clear to me why this should be the case; grouping shouldn't
> have any ill effects on tracepoint events, from my understanding.
>
> I noticed this because my private version of the perf tool has the event
> group patch https://lkml.org/lkml/2010/11/24/584 as well as the patch
> which fixes the parsing of multiple tracepoint events in the same -e
> switch: https://lkml.org/lkml/2010/11/30/460
>
> When I dig into the code a bit, I find that each event opens
> successfully, so that's not the problem. If I disable the grouping, then
> I get counts for all of the tracepoint events.

False alarm. I found a bug in my forwarded-ported version of the event
grouping patch for perf stat. I was setting the attr->disabled bit for
all of the events instead of just the group leader, when forking off the
command.

Software events, tracepoints, and hardware events all work when grouped
now. Sorry, I should have been more thorough in testing this on other
types of events. I had thought I had it working with hardware events,
and it was, but only if I attached to an existing process, not when
forking off a new command. Anyway, if I had tested with hardware events
in the same way I had tested tracepoints I would have seen the problem.

So the patch mentioned above, https://lkml.org/lkml/2010/11/24/584, has
a bug, but since it didn't get committed, I suppose it's not a problem.
If anyone is still interested, I can post a v4 of that patch.

Sorry for the bother!

- Corey

2010-12-02 17:57:00

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [BUG(?)] perf_events: combining multiple tracepoint events into a group produces no counts on member events

On Wed, Dec 01, 2010 at 07:52:06PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-01 at 19:02 +0100, Frederic Weisbecker wrote:
>
> > > struct task_struct {
> > > volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
> > > void *stack;
> > > @@ -1452,6 +1458,9 @@ struct task_struct {
> > > struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
> > > struct mutex perf_event_mutex;
> > > struct list_head perf_event_list;
> > > +#ifdef CONFIG_EVENT_TRACING
> > > + struct perf_tp_idr *perf_tp_idr;
> >
> > Why not attaching this to the ctx eventually? This makes one pointer less
> > in task_struct.
>
> What context? :-) There's now two context's (with the possibility of
> even more), which one will hold the tracepoint stuff?
>
> Also, since we only need one such structure, adding it to the context
> doesn't make sense.

Oh you're right, I forgot the per pmu context thing :)


>
> > > @@ -370,6 +372,7 @@ list_del_event(struct perf_event *event,
> > > */
> > > if (event->state > PERF_EVENT_STATE_OFF)
> > > event->state = PERF_EVENT_STATE_OFF;
> > > + ++ctx->generation;
> >
> > What's the role of the ctx->generation? It seems to be incremented two times
> > but doesn't appear to have any purpose.
>
> You didn't look hard enough, its a sequence stamp on the context for
> inheritance, then later, when we want to compare inherited contexts we
> can simply compare generation numbers, if they're the same the contexts
> are the same.

Ah right.

> > > }
> > >
> > > static void perf_group_detach(struct perf_event *event)
> > > @@ -1228,6 +1231,12 @@ void perf_event_context_sched_out(struct
> > > if (!cpuctx->task_ctx)
> > > return;
> > >
> > > +#if 0
> > > + /*
> > > + * Need to sort out how to make task_struct::perf_tp_idr
> > > + * work with this fancy switching stuff.. tracepoints could be
> > > + * in multiple contexts due to the software event muck.
> > > + */
> >
> > Not sure what's the issue here. Each ctx have the perf_tp_idr matching
> > active tracepoints, isn't it?
>
> No, there's only 1 idr per task. Having one per context means we have to
> iterate all contexts when a tracepoint triggers and it adds yet another
> pointer chase. It also means we have to manage more stuff when
> tracepoints change context etc..
>
> But yes, it would make this part easier, I just don't like the added
> fast path overhead.

Ok.

2010-12-09 19:56:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG(?)] perf_events: combining multiple tracepoint events into a group produces no counts on member events

On Thu, 2010-12-02 at 00:58 -0800, Corey Ashford wrote:
> So the patch mentioned above, https://lkml.org/lkml/2010/11/24/584, has
> a bug, but since it didn't get committed, I suppose it's not a problem.
> If anyone is still interested, I can post a v4 of that patch.

Yeah, I think something like that would be useful.