2010-02-03 09:14:40

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC GIT PULL] perf/trace/lock optimization/scalability improvements

Hi,

There are many things that happen in this patchset, treating
different problems:

- remove most of the string copy overhead in fast path
- open the way for lock class oriented profiling (as
opposite to lock instance profiling. Both can be useful
in different ways).
- remove the buffers muliplexing (less contention)
- event injection support
- remove violent lock events recursion (only 2 among 3, the remaining
one is detailed below).

Some differences, by running:
perf lock record perf sched pipe -l 100000

Before the patchset:

Total time: 91.015 [sec]

910.157300 usecs/op
1098 ops/sec

After this patchset applied:

Total time: 43.706 [sec]

437.062080 usecs/op
2288 ops/sec

Although it's actually 50 secs after the very latest patch in this
series. It is supposed to bring more scalability (and I believe it
does on a box with more than two cpus, although I can't test).
But multiplexing the counters had a side effect: perf record has
only one buffer to eat and not 5 * NR_CPUS, which makes its job
a bit easier when we multiplex (at the cost of cpus contention of
course, but on my atom, the scalability gain is not very visible).

And also, after this odd patch:

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 98fd360..254b3d4 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -3094,7 +3094,8 @@ static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
if (event->parent)
event = event->parent;

- return task_pid_nr_ns(p, event->ns);
+ return p->pid;
}

We get:

Total time: 26.170 [sec]

261.707960 usecs/op
3821 ops/sec

Ie: 2x faster than this patchset, and more than 3x faster than
tip:/perf/core

This is because task_pid_nr_ns() takes a lock and creates
lock events recursion. We really need to fix that.

You can pull this patchset from:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git
perf/core

Thanks.


---

Frederic Weisbecker (11):
tracing: Add lock_class_init event
tracing: Introduce TRACE_EVENT_INJECT
tracing: Inject lock_class_init events on registration
tracing: Add lock class id in lock_acquire event
perf: New PERF_EVENT_IOC_INJECT ioctl
perf: Handle injection ioctl with trace events
perf: Handle injection iotcl for tracepoints from perf record
perf/lock: Add support for lock_class_init events
tracing: Remove the lock name from most lock events
tracing/perf: Fix lock events recursions in the fast path
perf lock: Drop the buffers multiplexing dependency


include/linux/ftrace_event.h | 6 +-
include/linux/lockdep.h | 4 +
include/linux/perf_event.h | 6 +
include/linux/tracepoint.h | 3 +
include/trace/define_trace.h | 6 +
include/trace/events/lock.h | 57 ++++--
include/trace/ftrace.h | 31 +++-
kernel/lockdep.c | 16 ++
kernel/perf_event.c | 47 ++++-
kernel/trace/trace_event_profile.c | 46 +++--
kernel/trace/trace_events.c | 3 +
tools/perf/builtin-lock.c | 345 ++++++++++++++++++++++++++++++++----
tools/perf/builtin-record.c | 9 +
13 files changed, 497 insertions(+), 82 deletions(-)


2010-02-03 09:14:43

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 01/11] tracing: Add lock_class_init event

Add a lock_class_init event that brings informations about a
new lock class that gets created.

This event is required for the following plans:

- We want to get rid of the lock name copy from each high freq
lock events.
- But we want to be able to retrieve it from post-processing
- We also want to be able to map the lock instances to their
corresponding lock classes. Instrumenting the locks can
make sense in both class and instance level, depending on
the desired level of details.

This new event will make us able to map a lock class to its name.
It is also planned to drop the name from the other lock events
but to keep the lock instance id and to add the lock class id
to the lock_acquire event (which marks the beginning of a lock
scenario).

With such layout, we can basically retrieve the lock instance,
class and name from every locking sequence.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Hitoshi Mitake <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Jens Axboe <[email protected]>
---
include/trace/events/lock.h | 24 ++++++++++++++++++++++++
kernel/lockdep.c | 4 ++++
2 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/lock.h b/include/trace/events/lock.h
index 5c1dcfc..90af03c 100644
--- a/include/trace/events/lock.h
+++ b/include/trace/events/lock.h
@@ -9,6 +9,30 @@

#ifdef CONFIG_LOCKDEP

+TRACE_EVENT(lock_class_init,
+
+ TP_PROTO(struct lock_class *class),
+
+ TP_ARGS(class),
+
+ TP_STRUCT__entry(
+ __string(class_name, class->name)
+ __field(void *, class_id)
+ ),
+
+ /*
+ * The pointer to the class name is shared between a lock instance
+ * (lockdep_map) and its lock class. So we can use the name pointer
+ * as a class id and as a link between an instance and its class.
+ */
+ TP_fast_assign(
+ __assign_str(class_name, class->name);
+ __entry->class_id = (void *)class->name;
+ ),
+
+ TP_printk("%p %s", __entry->class_id, __get_str(class_name))
+);
+
TRACE_EVENT(lock_acquire,

TP_PROTO(struct lockdep_map *lock, unsigned int subclass,
diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 5feaddc..43758dd 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -757,6 +757,7 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
dump_stack();
return NULL;
}
+
class = lock_classes + nr_lock_classes++;
debug_atomic_inc(&nr_unused_locks);
class->key = key;
@@ -766,6 +767,9 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
INIT_LIST_HEAD(&class->locks_before);
INIT_LIST_HEAD(&class->locks_after);
class->name_version = count_matching_names(class);
+
+ trace_lock_class_init(class);
+
/*
* We use RCU's safe list-add method to make
* parallel walking of the hash-list safe:
--
1.6.2.3

2010-02-03 09:14:50

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 03/11] tracing: Inject lock_class_init events on registration

When a user enables the lock_class_init event, automatically
inject the existing mapping of lock_class_id:name that
lockdep have stored.

This will help the user to retrieve the name of the lock
classes from the other lock events, once we'll remove
the name inside.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Hitoshi Mitake <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Jens Axboe <[email protected]>
---
include/linux/lockdep.h | 4 ++++
include/trace/events/lock.h | 11 +++++++++--
kernel/lockdep.c | 12 ++++++++++++
3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 9ccf0e2..0b662fc 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -323,6 +323,10 @@ extern void lockdep_trace_alloc(gfp_t mask);

#define lockdep_assert_held(l) WARN_ON(debug_locks && !lockdep_is_held(l))

+#ifdef CONFIG_EVENT_TRACING
+extern void lock_class_init_inject_events(void);
+#endif
+
#else /* !LOCKDEP */

static inline void lockdep_off(void)
diff --git a/include/trace/events/lock.h b/include/trace/events/lock.h
index 90af03c..6999f16 100644
--- a/include/trace/events/lock.h
+++ b/include/trace/events/lock.h
@@ -9,7 +9,7 @@

#ifdef CONFIG_LOCKDEP

-TRACE_EVENT(lock_class_init,
+TRACE_EVENT_INJECT(lock_class_init,

TP_PROTO(struct lock_class *class),

@@ -30,7 +30,14 @@ TRACE_EVENT(lock_class_init,
__entry->class_id = (void *)class->name;
),

- TP_printk("%p %s", __entry->class_id, __get_str(class_name))
+ TP_printk("%p %s", __entry->class_id, __get_str(class_name)),
+
+ /*
+ * On activation, we want to send an event for each lock
+ * classes that have been recorded by lockdep, so that we
+ * catch up with the existing class:name mappings.
+ */
+ lock_class_init_inject_events
);

TRACE_EVENT(lock_acquire,
diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 43758dd..4d9eef8 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -3813,3 +3813,15 @@ void lockdep_sys_exit(void)
lockdep_print_held_locks(curr);
}
}
+
+
+#ifdef CONFIG_EVENT_TRACING
+void lock_class_init_inject_events(void)
+{
+ struct lock_class *class;
+
+ /* Safe to be taken locklessly as it only moves forward */
+ list_for_each_entry(class, &all_lock_classes, lock_entry)
+ trace_lock_class_init(class);
+}
+#endif
--
1.6.2.3

2010-02-03 09:15:00

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 09/11] tracing: Remove the lock name from most lock events

Remove the lock name copy from every lock events but
lock_class_init. We don't need it anymore now that we can
resolve the name from the lock class id.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Hitoshi Mitake <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Jens Axboe <[email protected]>
---
include/trace/events/lock.h | 23 +++++------------------
1 files changed, 5 insertions(+), 18 deletions(-)

diff --git a/include/trace/events/lock.h b/include/trace/events/lock.h
index 3c4a5c1..ced4ceb 100644
--- a/include/trace/events/lock.h
+++ b/include/trace/events/lock.h
@@ -50,23 +50,20 @@ TRACE_EVENT(lock_acquire,

TP_STRUCT__entry(
__field(unsigned int, flags)
- __string(name, lock->name)
__field(void *, lockdep_addr)
__field(void *, class_id)
),

TP_fast_assign(
__entry->flags = (trylock ? 1 : 0) | (read ? 2 : 0);
- __assign_str(name, lock->name);
__entry->lockdep_addr = lock;
__entry->class_id = (void *)lock->name;
),

- TP_printk("%p %p %s%s%s", __entry->lockdep_addr,
+ TP_printk("%p %p %s%s", __entry->lockdep_addr,
__entry->class_id,
(__entry->flags & 1) ? "try " : "",
- (__entry->flags & 2) ? "read " : "",
- __get_str(name))
+ (__entry->flags & 2) ? "read " : "")
);

TRACE_EVENT(lock_release,
@@ -76,17 +73,14 @@ TRACE_EVENT(lock_release,
TP_ARGS(lock, nested, ip),

TP_STRUCT__entry(
- __string(name, lock->name)
__field(void *, lockdep_addr)
),

TP_fast_assign(
- __assign_str(name, lock->name);
__entry->lockdep_addr = lock;
),

- TP_printk("%p %s",
- __entry->lockdep_addr, __get_str(name))
+ TP_printk("%p", __entry->lockdep_addr)
);

#ifdef CONFIG_LOCK_STAT
@@ -98,17 +92,14 @@ TRACE_EVENT(lock_contended,
TP_ARGS(lock, ip),

TP_STRUCT__entry(
- __string(name, lock->name)
__field(void *, lockdep_addr)
),

TP_fast_assign(
- __assign_str(name, lock->name);
__entry->lockdep_addr = lock;
),

- TP_printk("%p %s",
- __entry->lockdep_addr, __get_str(name))
+ TP_printk("%p", __entry->lockdep_addr)
);

TRACE_EVENT(lock_acquired,
@@ -117,19 +108,15 @@ TRACE_EVENT(lock_acquired,
TP_ARGS(lock, ip, waittime),

TP_STRUCT__entry(
- __string(name, lock->name)
__field(s64, wait_nsec)
__field(void *, lockdep_addr)
),

TP_fast_assign(
- __assign_str(name, lock->name);
__entry->wait_nsec = waittime;
__entry->lockdep_addr = lock;
),
- TP_printk("%p %s (%llu ns)", __entry->lockdep_addr,
- __get_str(name),
- __entry->wait_nsec)
+ TP_printk("%p (%llu ns)", __entry->lockdep_addr, __entry->wait_nsec)
);

#endif
--
1.6.2.3

2010-02-03 09:14:47

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 02/11] tracing: Introduce TRACE_EVENT_INJECT

TRACE_EVENT_INJECT macro is the same as TRACE_EVENT but takes one
more parameter that defines an "inject" callback to be called when
the event is enabled.

This is useful when we need to catch up with events that have
already occured but that are required for the user.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Hitoshi Mitake <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Jens Axboe <[email protected]>
---
include/linux/ftrace_event.h | 1 +
include/linux/tracepoint.h | 3 +++
include/trace/define_trace.h | 6 ++++++
include/trace/ftrace.h | 31 ++++++++++++++++++++++++++++++-
kernel/trace/trace_events.c | 3 +++
5 files changed, 43 insertions(+), 1 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index cd95919..026d39b 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -126,6 +126,7 @@ struct ftrace_event_call {
int (*show_format)(struct ftrace_event_call *,
struct trace_seq *);
int (*define_fields)(struct ftrace_event_call *);
+ void (*inject)(void);
struct list_head fields;
int filter_active;
struct event_filter *filter;
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index f59604e..f114aec 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -291,5 +291,8 @@ static inline void tracepoint_synchronize_unregister(void)
#define TRACE_EVENT_FN(name, proto, args, struct, \
assign, print, reg, unreg) \
DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
+#define TRACE_EVENT_INJECT(name, proto, args, struct, \
+ assign, print, inject) \
+ DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))

#endif /* ifdef TRACE_EVENT (see note above) */
diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
index 5acfb1e..41f7ce3 100644
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -31,6 +31,11 @@
assign, print, reg, unreg) \
DEFINE_TRACE_FN(name, reg, unreg)

+#undef TRACE_EVENT_INJECT
+#define TRACE_EVENT_INJECT(name, proto, args, tstruct, \
+ assign, print, inject) \
+ DEFINE_TRACE(name)
+
#undef DEFINE_EVENT
#define DEFINE_EVENT(template, name, proto, args) \
DEFINE_TRACE(name)
@@ -71,6 +76,7 @@

#undef TRACE_EVENT
#undef TRACE_EVENT_FN
+#undef TRACE_EVENT_INJECT
#undef DECLARE_EVENT_CLASS
#undef DEFINE_EVENT
#undef DEFINE_EVENT_PRINT
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index f2c09e4..869da37 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -37,6 +37,26 @@
PARAMS(print)); \
DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));

+/*
+ * TRACE_EVENT_INJECT creates an event that has an injector callback
+ * to call when the trace event is enabled, usually to trigger
+ * automatically some necessary initial traces.
+ */
+#undef TRACE_EVENT_INJECT
+#define TRACE_EVENT_INJECT(name, proto, args, tstruct, \
+ assign, print, inject) \
+ DECLARE_EVENT_CLASS(name, \
+ PARAMS(proto), \
+ PARAMS(args), \
+ PARAMS(tstruct), \
+ PARAMS(assign), \
+ PARAMS(print)); \
+ DEFINE_EVENT_INJECT(name, name, PARAMS(proto), PARAMS(args), inject);
+
+#undef DEFINE_EVENT_INJECT
+#define DEFINE_EVENT_INJECT(template, name, proto, args, inject) \
+ DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args));
+

#undef __field
#define __field(type, item) type item;
@@ -726,7 +746,11 @@ static struct trace_event ftrace_event_type_##call = { \
#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)

#undef DEFINE_EVENT
-#define DEFINE_EVENT(template, call, proto, args) \
+#define DEFINE_EVENT(template, call, proto, args) \
+ DEFINE_EVENT_INJECT(template, call, PARAMS(proto), PARAMS(proto), NULL)
+
+#undef DEFINE_EVENT_INJECT
+#define DEFINE_EVENT_INJECT(template, call, proto, args, injector) \
\
static struct ftrace_event_call __used \
__attribute__((__aligned__(4))) \
@@ -739,6 +763,7 @@ __attribute__((section("_ftrace_events"))) event_##call = { \
.unregfunc = ftrace_raw_unreg_event_##call, \
.show_format = ftrace_format_##template, \
.define_fields = ftrace_define_fields_##template, \
+ .inject = injector, \
_TRACE_PROFILE_INIT(call) \
}

@@ -877,6 +902,10 @@ ftrace_profile_templ_##call(struct ftrace_event_call *event_call, \
__count, irq_flags); \
}

+#undef DEFINE_EVENT_INJECT
+#define DEFINE_EVENT_INJECT(template, call, proto, args, inject) \
+ DEFINE_EVENT(template, call, PARAMS(proto), PARAMS(args))
+
#undef DEFINE_EVENT
#define DEFINE_EVENT(template, call, proto, args) \
static void ftrace_profile_##call(proto) \
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 189b09b..5c75cc7 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -142,6 +142,9 @@ static int ftrace_event_enable_disable(struct ftrace_event_call *call,
break;
}
call->enabled = 1;
+
+ if (call->inject)
+ call->inject();
}
break;
}
--
1.6.2.3

2010-02-03 09:16:35

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 06/11] perf: Handle injection ioctl with trace events

Handle injection ioctl request for tracepoints that support it.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Hitoshi Mitake <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Jens Axboe <[email protected]>
---
include/linux/ftrace_event.h | 5 ++-
include/linux/perf_event.h | 4 +++
kernel/perf_event.c | 14 ++++++++--
kernel/trace/trace_event_profile.c | 46 ++++++++++++++++++++++-------------
4 files changed, 47 insertions(+), 22 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 026d39b..9b45db6 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -189,8 +189,9 @@ do { \

#ifdef CONFIG_PERF_EVENTS
struct perf_event;
-extern int ftrace_profile_enable(int event_id);
-extern void ftrace_profile_disable(int event_id);
+extern int ftrace_profile_enable(struct perf_event *event);
+extern void ftrace_profile_disable(struct perf_event *event);
+extern void ftrace_perf_inject(struct perf_event *event);
extern int ftrace_profile_set_filter(struct perf_event *event, int event_id,
char *filter_str);
extern void ftrace_profile_free_filter(struct perf_event *event);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d2e83f0..9e58ab9 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -492,6 +492,10 @@ struct hw_perf_event {
struct arch_hw_breakpoint info;
};
#endif
+#ifdef CONFIG_EVENT_TRACING
+ /* tracepoint */
+ struct ftrace_event_call *tp_event;
+#endif
};
atomic64_t prev_count;
u64 sample_period;
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index e4dfd12..280ae44 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -4350,6 +4350,14 @@ static const struct pmu perf_ops_task_clock = {

#ifdef CONFIG_EVENT_TRACING

+static const struct pmu perf_ops_tracepoints = {
+ .enable = perf_swevent_enable,
+ .disable = perf_swevent_disable,
+ .read = perf_swevent_read,
+ .unthrottle = perf_swevent_unthrottle,
+ .inject = ftrace_perf_inject,
+};
+
void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
int entry_size)
{
@@ -4386,7 +4394,7 @@ static int perf_tp_event_match(struct perf_event *event,

static void tp_perf_event_destroy(struct perf_event *event)
{
- ftrace_profile_disable(event->attr.config);
+ ftrace_profile_disable(event);
}

static const struct pmu *tp_perf_event_init(struct perf_event *event)
@@ -4400,12 +4408,12 @@ static const struct pmu *tp_perf_event_init(struct perf_event *event)
!capable(CAP_SYS_ADMIN))
return ERR_PTR(-EPERM);

- if (ftrace_profile_enable(event->attr.config))
+ if (ftrace_profile_enable(event))
return NULL;

event->destroy = tp_perf_event_destroy;

- return &perf_ops_generic;
+ return &perf_ops_tracepoints;
}

static int perf_event_set_filter(struct perf_event *event, void __user *arg)
diff --git a/kernel/trace/trace_event_profile.c b/kernel/trace/trace_event_profile.c
index f0d6930..9ed684f 100644
--- a/kernel/trace/trace_event_profile.c
+++ b/kernel/trace/trace_event_profile.c
@@ -13,6 +13,8 @@
static char *perf_trace_buf;
static char *perf_trace_buf_nmi;

+static DEFINE_MUTEX(perf_trace_mutex);
+
typedef typeof(char [FTRACE_MAX_PROFILE_SIZE]) perf_trace_t ;

/* Count the events in use (per event id, not per instance) */
@@ -59,21 +61,27 @@ fail_buf:
return ret;
}

-int ftrace_profile_enable(int event_id)
+int ftrace_profile_enable(struct perf_event *event)
{
- struct ftrace_event_call *event;
+ int event_id = event->attr.config;
+ struct ftrace_event_call *call;
int ret = -EINVAL;

mutex_lock(&event_mutex);
- list_for_each_entry(event, &ftrace_events, list) {
- if (event->id == event_id && event->profile_enable &&
- try_module_get(event->mod)) {
- ret = ftrace_profile_enable_event(event);
+ list_for_each_entry(call, &ftrace_events, list) {
+ if (call->id == event_id && call->profile_enable &&
+ try_module_get(call->mod)) {
+ mutex_lock(&perf_trace_mutex);
+ ret = ftrace_profile_enable_event(call);
+ mutex_unlock(&perf_trace_mutex);
break;
}
}
mutex_unlock(&event_mutex);

+ if (!ret)
+ event->hw.tp_event = call;
+
return ret;
}

@@ -104,19 +112,23 @@ static void ftrace_profile_disable_event(struct ftrace_event_call *event)
}
}

-void ftrace_profile_disable(int event_id)
+void ftrace_profile_disable(struct perf_event *event)
{
- struct ftrace_event_call *event;
+ struct ftrace_event_call *call = event->hw.tp_event;

- mutex_lock(&event_mutex);
- list_for_each_entry(event, &ftrace_events, list) {
- if (event->id == event_id) {
- ftrace_profile_disable_event(event);
- module_put(event->mod);
- break;
- }
- }
- mutex_unlock(&event_mutex);
+ mutex_lock(&perf_trace_mutex);
+ ftrace_profile_disable_event(call);
+ mutex_unlock(&perf_trace_mutex);
+
+ module_put(call->mod);
+}
+
+void ftrace_perf_inject(struct perf_event *event)
+{
+ struct ftrace_event_call *call = event->hw.tp_event;
+
+ if (call->inject)
+ call->inject();
}

__kprobes void *ftrace_perf_buf_prepare(int size, unsigned short type,
--
1.6.2.3

2010-02-03 09:15:15

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 11/11] perf lock: Drop the buffers multiplexing dependency

We need to deal with time ordered events to build a correct
state machine of lock events. This is why we multiplex the lock
events buffers. But the ordering is done from the kernel, on
the tracing fast path, leading to high contention between cpus.

Without multiplexing, the events appears in a weak order.
If we have four events, each split per cpu, perf record will
read the events buffers in the following order:

[ CPU0 ev0, CPU0 ev1, CPU0 ev3, CPU0 ev4, CPU1 ev0, CPU1 ev0....]

To handle a post processing reordering, we could just read and sort
the whole in memory, but it just doesn't scale with high amounts
of events: lock events can fill huge amounts in few times.

Basically we need to sort in memory and find a "grace period"
point when we know that a given slice of previously sorted events
can be committed for post-processing, so that we can unload the
memory usage step by step and keep a scalable sorting list.

There is no strong rules about how to define such "grace period".
What does this patch is:

We define a FLUSH_PERIOD value that defines a grace period in
seconds.
We want to have a slice of events covering 2 * FLUSH_PERIOD in our
sorted list.
If FLUSH_PERIOD is big enough, it ensures every events that occured
in the first half of the timeslice have all been buffered and there
are none remaining and there won't be further to put inside this
first timeslice. Then once we reach the 2 * FLUSH_PERIOD
timeslice, we flush the first half to be gentle with the memory
(the second half can still get new events in the middle, so wait
another period to flush it)

FLUSH_PERIOD is defined to 5 seconds. Say the first event started on
time t0. We can safely assume that at the time we are processing
events of t0 + 10 seconds, ther won't be anymore events to read
from perf.data that occured between t0 and t0 + 5 seconds. Hence
we can safely flush the first half.

To point out funky bugs, we have a guardian that checks a new event
timestamp is not below the last event's timestamp flushed and that
displays a warning in this case.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Hitoshi Mitake <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Jens Axboe <[email protected]>
---
tools/perf/builtin-lock.c | 148 ++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 146 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index e1133b7..759cfeb 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -572,6 +572,150 @@ process_raw_event(void *data, int cpu,
process_lock_class_init_event(data, event, cpu, timestamp, thread);
}

+struct raw_event_queue {
+ u64 timestamp;
+ int cpu;
+ void *data;
+ struct thread *thread;
+ struct list_head list;
+};
+
+static LIST_HEAD(raw_event_head);
+
+#define FLUSH_PERIOD (5 * NSEC_PER_SEC)
+
+static u64 flush_limit = ULLONG_MAX;
+static u64 last_flush = 0;
+struct raw_event_queue *last_inserted;
+
+static void flush_raw_event_queue(u64 limit)
+{
+ struct raw_event_queue *tmp, *iter;
+
+ list_for_each_entry_safe(iter, tmp, &raw_event_head, list) {
+ if (iter->timestamp > limit)
+ return;
+
+ if (iter == last_inserted)
+ last_inserted = NULL;
+
+ process_raw_event(iter->data, iter->cpu, iter->timestamp,
+ iter->thread);
+
+ last_flush = iter->timestamp;
+ list_del(&iter->list);
+ free(iter->data);
+ free(iter);
+ }
+}
+
+static void __queue_raw_event_end(struct raw_event_queue *new)
+{
+ struct raw_event_queue *iter;
+
+ list_for_each_entry_reverse(iter, &raw_event_head, list) {
+ if (iter->timestamp < new->timestamp) {
+ list_add(&new->list, &iter->list);
+ return;
+ }
+ }
+
+ list_add(&new->list, &raw_event_head);
+}
+
+static void __queue_raw_event_before(struct raw_event_queue *new,
+ struct raw_event_queue *iter)
+{
+ list_for_each_entry_continue_reverse(iter, &raw_event_head, list) {
+ if (iter->timestamp < new->timestamp) {
+ list_add(&new->list, &iter->list);
+ return;
+ }
+ }
+
+ list_add(&new->list, &raw_event_head);
+}
+
+static void __queue_raw_event_after(struct raw_event_queue *new,
+ struct raw_event_queue *iter)
+{
+ list_for_each_entry_continue(iter, &raw_event_head, list) {
+ if (iter->timestamp > new->timestamp) {
+ list_add_tail(&new->list, &iter->list);
+ return;
+ }
+ }
+ list_add_tail(&new->list, &raw_event_head);
+}
+
+/* The queue is ordered by time */
+static void __queue_raw_event(struct raw_event_queue *new)
+{
+ if (!last_inserted) {
+ __queue_raw_event_end(new);
+ return;
+ }
+
+ /*
+ * Most of the time the current event has a timestamp
+ * very close to the last event inserted, unless we just switched
+ * to another event buffer. Having a sorting based on a list and
+ * on the last inserted event that is close to the current one is
+ * probably more efficient than an rbtree based sorting.
+ */
+ if (last_inserted->timestamp >= new->timestamp)
+ __queue_raw_event_before(new, last_inserted);
+ else
+ __queue_raw_event_after(new, last_inserted);
+}
+
+static void queue_raw_event(void *data, int raw_size, int cpu,
+ u64 timestamp, struct thread *thread)
+{
+ struct raw_event_queue *new;
+
+ if (flush_limit == ULLONG_MAX)
+ flush_limit = timestamp + FLUSH_PERIOD;
+
+ if (timestamp < last_flush) {
+ printf("Warning: Timestamp below last timeslice flush\n");
+ return;
+ }
+
+ new = malloc(sizeof(*new));
+ if (!new)
+ die("Not enough memory\n");
+
+ new->timestamp = timestamp;
+ new->cpu = cpu;
+ new->thread = thread;
+
+ new->data = malloc(raw_size);
+ if (!new->data)
+ die("Not enough memory\n");
+
+ memcpy(new->data, data, raw_size);
+
+ __queue_raw_event(new);
+ last_inserted = new;
+
+ /*
+ * We want to have a slice of events covering 2 * FLUSH_PERIOD
+ * If FLUSH_PERIOD is big enough, it ensures every events that occured
+ * in the first half of the timeslice have all been buffered and there
+ * are none remaining (we need that because of the weakly ordered
+ * event recording we have). Then once we reach the 2 * FLUSH_PERIOD
+ * timeslice, we flush the first half to be gentle with the memory
+ * (the second half can still get new events in the middle, so wait
+ * another period to flush it)
+ */
+ if (new->timestamp > flush_limit &&
+ new->timestamp - flush_limit > FLUSH_PERIOD) {
+ flush_limit += FLUSH_PERIOD;
+ flush_raw_event_queue(flush_limit);
+ }
+}
+
static int process_sample_event(event_t *event, struct perf_session *session)
{
struct thread *thread;
@@ -592,7 +736,7 @@ static int process_sample_event(event_t *event, struct perf_session *session)
if (profile_cpu != -1 && profile_cpu != (int) data.cpu)
return 0;

- process_raw_event(data.raw_data, data.cpu, data.time, thread);
+ queue_raw_event(data.raw_data, data.raw_size, data.cpu, data.time, thread);

return 0;
}
@@ -702,6 +846,7 @@ static void __cmd_report(void)
setup_pager();
select_key();
read_events();
+ flush_raw_event_queue(ULLONG_MAX);
sort_result();
print_result();
}
@@ -734,7 +879,6 @@ static const char *record_args[] = {
"record",
"-a",
"-R",
- "-M",
"-f",
"-m", "1024",
"-c", "1",
--
1.6.2.3

2010-02-03 09:14:56

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 08/11] perf/lock: Add support for lock_class_init events

Add support for the new lock_class_init event from perf lock.
Retrieve the name of the locks from these events and remove
support of the name from other lock events.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Hitoshi Mitake <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Jens Axboe <[email protected]>
---
tools/perf/builtin-lock.c | 197 +++++++++++++++++++++++++++++++++++++--------
1 files changed, 163 insertions(+), 34 deletions(-)

diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index fb9ab2a..e1133b7 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -28,24 +28,38 @@
#define LOCKHASH_SIZE (1UL << LOCKHASH_BITS)

static struct list_head lockhash_table[LOCKHASH_SIZE];
+static struct list_head classhash_table[LOCKHASH_SIZE];

#define __lockhashfn(key) hash_long((unsigned long)key, LOCKHASH_BITS)
#define lockhashentry(key) (lockhash_table + __lockhashfn((key)))
+#define classhashentry(key) (classhash_table + __lockhashfn((key)))

#define LOCK_STATE_UNLOCKED 0 /* initial state */
#define LOCK_STATE_LOCKED 1

+/*
+ * Can be extended for class scope profiling
+ */
+struct class_stat {
+ struct list_head hash_entry;
+ void *addr;
+ char *name;
+ struct list_head lock_list;
+};
+
struct lock_stat {
struct list_head hash_entry;
struct rb_node rb; /* used for sorting */

+ struct list_head lock_list; /* list in the class */
+
/*
* FIXME: raw_field_value() returns unsigned long long,
* so address of lockdep_map should be dealed as 64bit.
* Is there more better solution?
*/
void *addr; /* address of lockdep_map, used as ID */
- char *name; /* for strcpy(), we cannot use const */
+ struct class_stat *class;

int state;
u64 prev_event_time; /* timestamp of previous event */
@@ -155,34 +169,89 @@ static struct lock_stat *pop_from_result(void)
return container_of(node, struct lock_stat, rb);
}

-static struct lock_stat *lock_stat_findnew(void *addr, const char *name)
+static void class_add_lock(struct class_stat *class, struct lock_stat *lock)
+{
+ if (!class)
+ return;
+
+ /* NOTE: we may want to handle class changes in the future */
+ list_del(&lock->lock_list);
+ list_add_tail(&lock->lock_list, &class->lock_list);
+ lock->class = class;
+}
+
+static struct lock_stat *lock_stat_findnew(void *addr, struct class_stat *class)
{
struct list_head *entry = lockhashentry(addr);
- struct lock_stat *ret, *new;
+ struct lock_stat *ret;
+
+ if (class) {
+ list_for_each_entry(ret, &class->lock_list, lock_list) {
+ if (ret->addr == addr)
+ return ret;
+ }
+ }

list_for_each_entry(ret, entry, hash_entry) {
- if (ret->addr == addr)
+ if (ret->addr == addr) {
+ class_add_lock(class, ret);
return ret;
+ }
}

- new = zalloc(sizeof(struct lock_stat));
- if (!new)
+ ret = zalloc(sizeof(struct lock_stat));
+ if (!ret)
goto alloc_failed;

- new->addr = addr;
- new->name = zalloc(sizeof(char) * strlen(name) + 1);
- if (!new->name)
- goto alloc_failed;
- strcpy(new->name, name);
+ ret->addr = addr;
+ INIT_LIST_HEAD(&ret->lock_list);
+ class_add_lock(class, ret);

/* LOCK_STATE_UNLOCKED == 0 isn't guaranteed forever */
- new->state = LOCK_STATE_UNLOCKED;
- new->wait_time_min = ULLONG_MAX;
+ ret->state = LOCK_STATE_UNLOCKED;
+ ret->wait_time_min = ULLONG_MAX;
+
+ list_add_tail(&ret->hash_entry, entry);
+
+ return ret;

- list_add(&new->hash_entry, entry);
- return new;
+ alloc_failed:
+ die("memory allocation failed\n");
+}
+
+static struct class_stat *class_stat_findnew(void *addr, const char *name)
+{
+ struct list_head *entry = classhashentry(addr);
+ struct class_stat *ret;
+
+ list_for_each_entry(ret, entry, hash_entry) {
+ if (ret->addr == addr) {
+ if (!ret->name && name) {
+ ret->name = strdup(name);
+ if (!ret->name)
+ goto alloc_failed;
+ }
+ return ret;
+ }
+ }
+
+ ret = zalloc(sizeof(struct class_stat));
+ if (!ret)
+ goto alloc_failed;
+
+ ret->addr = addr;
+ INIT_LIST_HEAD(&ret->lock_list);
+ if (name) {
+ ret->name = strdup(name);
+ if (!ret->name)
+ goto alloc_failed;
+ }
+
+ list_add_tail(&ret->hash_entry, entry);

-alloc_failed:
+ return ret;
+
+ alloc_failed:
die("memory allocation failed\n");
}

@@ -195,23 +264,29 @@ struct raw_event_sample {
char data[0];
};

+/*
+ * For now we keep the below as is with only one field.
+ * These structures may be filled further
+ */
struct trace_acquire_event {
void *addr;
- const char *name;
+ void *class_id;
};

struct trace_acquired_event {
void *addr;
- const char *name;
};

struct trace_contended_event {
void *addr;
- const char *name;
};

struct trace_release_event {
void *addr;
+};
+
+struct trace_init_event {
+ void *class_id;
const char *name;
};

@@ -239,6 +314,12 @@ struct trace_lock_handler {
int cpu,
u64 timestamp,
struct thread *thread);
+
+ void (*init_event)(struct trace_init_event *,
+ struct event *,
+ int cpu,
+ u64 timestamp,
+ struct thread *thread);
};

static void
@@ -248,11 +329,13 @@ report_lock_acquire_event(struct trace_acquire_event *acquire_event,
u64 timestamp,
struct thread *thread __used)
{
- struct lock_stat *st;
+ struct class_stat *class;
+ struct lock_stat *lock;

- st = lock_stat_findnew(acquire_event->addr, acquire_event->name);
+ class = class_stat_findnew(acquire_event->class_id, NULL);
+ lock = lock_stat_findnew(acquire_event->addr, class);

- switch (st->state) {
+ switch (lock->state) {
case LOCK_STATE_UNLOCKED:
break;
case LOCK_STATE_LOCKED:
@@ -262,7 +345,7 @@ report_lock_acquire_event(struct trace_acquire_event *acquire_event,
break;
}

- st->prev_event_time = timestamp;
+ lock->prev_event_time = timestamp;
}

static void
@@ -274,7 +357,7 @@ report_lock_acquired_event(struct trace_acquired_event *acquired_event,
{
struct lock_stat *st;

- st = lock_stat_findnew(acquired_event->addr, acquired_event->name);
+ st = lock_stat_findnew(acquired_event->addr, NULL);

switch (st->state) {
case LOCK_STATE_UNLOCKED:
@@ -300,7 +383,7 @@ report_lock_contended_event(struct trace_contended_event *contended_event,
{
struct lock_stat *st;

- st = lock_stat_findnew(contended_event->addr, contended_event->name);
+ st = lock_stat_findnew(contended_event->addr, NULL);

switch (st->state) {
case LOCK_STATE_UNLOCKED:
@@ -326,7 +409,7 @@ report_lock_release_event(struct trace_release_event *release_event,
struct lock_stat *st;
u64 hold_time;

- st = lock_stat_findnew(release_event->addr, release_event->name);
+ st = lock_stat_findnew(release_event->addr, NULL);

switch (st->state) {
case LOCK_STATE_UNLOCKED:
@@ -357,6 +440,16 @@ end:
st->prev_event_time = timestamp;
}

+static void
+report_lock_class_init_event(struct trace_init_event *init_event,
+ struct event *__event __used,
+ int cpu __used,
+ u64 timestamp __used,
+ struct thread *thread __used)
+{
+ class_stat_findnew(init_event->class_id, init_event->name);
+}
+
/* lock oriented handlers */
/* TODO: handlers for CPU oriented, thread oriented */
static struct trace_lock_handler report_lock_ops = {
@@ -364,6 +457,7 @@ static struct trace_lock_handler report_lock_ops = {
.acquired_event = report_lock_acquired_event,
.contended_event = report_lock_contended_event,
.release_event = report_lock_release_event,
+ .init_event = report_lock_class_init_event,
};

static struct trace_lock_handler *trace_handler;
@@ -380,7 +474,8 @@ process_lock_acquire_event(void *data,

tmp = raw_field_value(event, "lockdep_addr", data);
memcpy(&acquire_event.addr, &tmp, sizeof(void *));
- acquire_event.name = (char *)raw_field_ptr(event, "name", data);
+ tmp = raw_field_value(event, "class_id", data);
+ memcpy(&acquire_event.class_id, &tmp, sizeof(void *));

if (trace_handler->acquire_event)
trace_handler->acquire_event(&acquire_event, event, cpu, timestamp, thread);
@@ -398,7 +493,6 @@ process_lock_acquired_event(void *data,

tmp = raw_field_value(event, "lockdep_addr", data);
memcpy(&acquired_event.addr, &tmp, sizeof(void *));
- acquired_event.name = (char *)raw_field_ptr(event, "name", data);

if (trace_handler->acquire_event)
trace_handler->acquired_event(&acquired_event, event, cpu, timestamp, thread);
@@ -416,7 +510,6 @@ process_lock_contended_event(void *data,

tmp = raw_field_value(event, "lockdep_addr", data);
memcpy(&contended_event.addr, &tmp, sizeof(void *));
- contended_event.name = (char *)raw_field_ptr(event, "name", data);

if (trace_handler->acquire_event)
trace_handler->contended_event(&contended_event, event, cpu, timestamp, thread);
@@ -434,13 +527,30 @@ process_lock_release_event(void *data,

tmp = raw_field_value(event, "lockdep_addr", data);
memcpy(&release_event.addr, &tmp, sizeof(void *));
- release_event.name = (char *)raw_field_ptr(event, "name", data);

if (trace_handler->acquire_event)
trace_handler->release_event(&release_event, event, cpu, timestamp, thread);
}

static void
+process_lock_class_init_event(void *data,
+ struct event *event __used,
+ int cpu __used,
+ u64 timestamp __used,
+ struct thread *thread __used)
+{
+ struct trace_init_event init_event;
+ u64 tmp;
+
+ tmp = raw_field_value(event, "class_id", data);
+ memcpy(&init_event.class_id, &tmp, sizeof(void *));
+ init_event.name = (char *)raw_field_ptr(event, "class_name", data);
+
+ if (trace_handler->init_event)
+ trace_handler->init_event(&init_event, event, cpu, timestamp, thread);
+}
+
+static void
process_raw_event(void *data, int cpu,
u64 timestamp, struct thread *thread)
{
@@ -458,6 +568,8 @@ process_raw_event(void *data, int cpu,
process_lock_contended_event(data, event, cpu, timestamp, thread);
if (!strcmp(event->name, "lock_release"))
process_lock_release_event(data, event, cpu, timestamp, thread);
+ if (!strcmp(event->name, "lock_class_init"))
+ process_lock_class_init_event(data, event, cpu, timestamp, thread);
}

static int process_sample_event(event_t *event, struct perf_session *session)
@@ -503,15 +615,23 @@ static void print_result(void)
printf("\n\n");

while ((st = pop_from_result())) {
+ char *name;
bzero(cut_name, 20);

printf("%p ", st->addr);

- if (strlen(st->name) < 16) {
+ if (!st->class)
+ name = (char *)"<unknown>";
+ else if (!st->class->name)
+ name = (char *)"<unknown2>";
+ else
+ name = st->class->name;
+
+ if (strlen(name) < 16) {
/* output raw name */
- printf("%20s ", st->name);
+ printf("%20s ", name);
} else {
- strncpy(cut_name, st->name, 16);
+ strncpy(cut_name, name, 16);
cut_name[16] = '.';
cut_name[17] = '.';
cut_name[18] = '.';
@@ -533,12 +653,18 @@ static void print_result(void)

static void dump_map(void)
{
+ char *name;
unsigned int i;
struct lock_stat *st;

for (i = 0; i < LOCKHASH_SIZE; i++) {
list_for_each_entry(st, &lockhash_table[i], hash_entry) {
- printf("%p: %s\n", st->addr, st->name);
+ if (!st->class || !st->class->name)
+ name = (char *)"<unknown>";
+ else
+ name = st->class->name;
+
+ printf("%p: %s\n", st->addr, name);
}
}
}
@@ -612,6 +738,7 @@ static const char *record_args[] = {
"-f",
"-m", "1024",
"-c", "1",
+ "-e", "lock:lock_class_init:r",
"-e", "lock:lock_acquire:r",
"-e", "lock:lock_acquired:r",
"-e", "lock:lock_contended:r",
@@ -644,6 +771,8 @@ int cmd_lock(int argc, const char **argv, const char *prefix __used)
symbol__init();
for (i = 0; i < LOCKHASH_SIZE; i++)
INIT_LIST_HEAD(lockhash_table + i);
+ for (i = 0; i < LOCKHASH_SIZE; i++)
+ INIT_LIST_HEAD(classhash_table + i);

argc = parse_options(argc, argv, lock_options, lock_usage,
PARSE_OPT_STOP_AT_NON_OPTION);
--
1.6.2.3

2010-02-03 09:16:56

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 05/11] perf: New PERF_EVENT_IOC_INJECT ioctl

The PERF_EVENT_IOC_INJECT perf event ioctl can be used to inject
events, if the corresponding pmu and event supports it.

On trace events, it will call the inject callback, usually reserved
for events that need to catch up with the user.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Hitoshi Mitake <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Jens Axboe <[email protected]>
---
include/linux/perf_event.h | 2 ++
kernel/perf_event.c | 23 +++++++++++++++++++++++
2 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 556b0f4..d2e83f0 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -228,6 +228,7 @@ struct perf_event_attr {
#define PERF_EVENT_IOC_PERIOD _IOW('$', 4, __u64)
#define PERF_EVENT_IOC_SET_OUTPUT _IO ('$', 5)
#define PERF_EVENT_IOC_SET_FILTER _IOW('$', 6, char *)
+#define PERF_EVENT_IOC_INJECT _IO ('$', 7)

enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
@@ -513,6 +514,7 @@ struct pmu {
void (*disable) (struct perf_event *event);
void (*read) (struct perf_event *event);
void (*unthrottle) (struct perf_event *event);
+ void (*inject) (struct perf_event *event);
};

/**
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 40f8b07..e4dfd12 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2151,6 +2151,26 @@ unlock:
return ret;
}

+static void __perf_event_inject(void *info)
+{
+ struct perf_event *event = info;
+
+ event->pmu->inject(event);
+}
+
+static int perf_event_inject(struct perf_event *event)
+{
+ struct perf_event_context *ctx = event->ctx;
+ struct task_struct *task = ctx->task;
+
+ if (!event->pmu->inject || task)
+ return -EINVAL;
+
+ smp_call_function_single(event->cpu, __perf_event_inject, event, 1);
+
+ return 0;
+}
+
static int perf_event_set_output(struct perf_event *event, int output_fd);
static int perf_event_set_filter(struct perf_event *event, void __user *arg);

@@ -2183,6 +2203,9 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
case PERF_EVENT_IOC_SET_FILTER:
return perf_event_set_filter(event, (void __user *)arg);

+ case PERF_EVENT_IOC_INJECT:
+ return perf_event_inject(event);
+
default:
return -ENOTTY;
}
--
1.6.2.3

2010-02-03 09:15:59

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 10/11] tracing/perf: Fix lock events recursions in the fast path

There are rcu locked read side areas in the path where we submit
a trace events. And these rcu_read_(un)lock() trigger lock events,
which create recursive events.

One pair in do_perf_sw_event:

__lock_acquire
|
|--96.11%-- lock_acquire
| |
| |--27.21%-- do_perf_sw_event
| | perf_tp_event
| | |
| | |--49.62%-- ftrace_profile_lock_release
| | | lock_release
| | | |
| | | |--33.85%-- _raw_spin_unlock

Another pair in perf_output_begin/end:

__lock_acquire
|--23.40%-- perf_output_begin
| | __perf_event_overflow
| | perf_swevent_overflow
| | perf_swevent_add
| | perf_swevent_ctx_event
| | do_perf_sw_event
| | perf_tp_event
| | |
| | |--55.37%-- ftrace_profile_lock_acquire
| | | lock_acquire
| | | |
| | | |--37.31%-- _raw_spin_lock

The problem is not that much the trace recursion itself, as we have a
recursion protection already (though it's always wasteful to recurse).
But the trace events are outside the lockdep recursion protection, then
each lockdep event triggers a lock trace, which will trigger two
other lockdep events. Here the recursive lock trace event won't
be taken because of the trace recursion, so the recursion stops there
but lockdep will still analyse these new events:

To sum up, for each lockdep events we have:

lock_*()
|
trace lock_acquire
|
----- rcu_read_lock()
| |
| lock_acquire()
| |
| trace_lock_acquire() (stopped)
| |
| lockdep analyze
|
----- rcu_read_unlock()
|
lock_release
|
trace_lock_release() (stopped)
|
lockdep analyze

And you can repeat the above two times as we have two rcu read side
sections when we submit an event.

This is fixed in this pacth by using the non-lockdep versions of
rcu_read_(un)lock.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Hitoshi Mitake <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Jens Axboe <[email protected]>
---
kernel/perf_event.c | 10 +++++-----
1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 280ae44..98fd360 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2986,7 +2986,7 @@ int perf_output_begin(struct perf_output_handle *handle,
u64 lost;
} lost_event;

- rcu_read_lock();
+ __rcu_read_lock();
/*
* For inherited events we send all the output towards the parent.
*/
@@ -3051,7 +3051,7 @@ fail:
atomic_inc(&data->lost);
perf_output_unlock(handle);
out:
- rcu_read_unlock();
+ __rcu_read_unlock();

return -ENOSPC;
}
@@ -3072,7 +3072,7 @@ void perf_output_end(struct perf_output_handle *handle)
}

perf_output_unlock(handle);
- rcu_read_unlock();
+ __rcu_read_unlock();
}

static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
@@ -4116,7 +4116,7 @@ static void do_perf_sw_event(enum perf_type_id type, u32 event_id,
struct perf_event_context *ctx;

cpuctx = &__get_cpu_var(perf_cpu_context);
- rcu_read_lock();
+ __rcu_read_lock();
perf_swevent_ctx_event(&cpuctx->ctx, type, event_id,
nr, nmi, data, regs);
/*
@@ -4126,7 +4126,7 @@ static void do_perf_sw_event(enum perf_type_id type, u32 event_id,
ctx = rcu_dereference(current->perf_event_ctxp);
if (ctx)
perf_swevent_ctx_event(ctx, type, event_id, nr, nmi, data, regs);
- rcu_read_unlock();
+ __rcu_read_unlock();
}

void __perf_sw_event(u32 event_id, u64 nr, int nmi,
--
1.6.2.3

2010-02-03 09:16:07

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 07/11] perf: Handle injection iotcl for tracepoints from perf record

Call the injection iotcl request for tracepoint events, in case
some of them support it.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Hitoshi Mitake <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Jens Axboe <[email protected]>
---
tools/perf/builtin-record.c | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index eea5691..da42519 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -386,6 +386,15 @@ try_again:
}

ioctl(fd[nr_cpu][counter], PERF_EVENT_IOC_ENABLE);
+
+ /*
+ * If a tracepoint has some events he wants us to catchup
+ * let's inject them. We may have such injectable event
+ * in several cpus, we want only one injection, so pick the
+ * first cpu
+ */
+ if (attr->type == PERF_TYPE_TRACEPOINT && !nr_cpu)
+ ioctl(fd[nr_cpu][counter], PERF_EVENT_IOC_INJECT);
}

static void open_counters(int cpu, pid_t pid)
--
1.6.2.3

2010-02-03 09:16:59

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 04/11] tracing: Add lock class id in lock_acquire event

Add the lock class id to lock_acquire event. This will let us
remove the name copy from every lock events (except lock_class_init)
as we'll be able to retrieve the name of the lock class with
the class id.

The class id is represented by the lock name address, which is
the same in the corresponding lock class.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Hitoshi Mitake <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Jens Axboe <[email protected]>
---
include/trace/events/lock.h | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/include/trace/events/lock.h b/include/trace/events/lock.h
index 6999f16..3c4a5c1 100644
--- a/include/trace/events/lock.h
+++ b/include/trace/events/lock.h
@@ -52,15 +52,18 @@ TRACE_EVENT(lock_acquire,
__field(unsigned int, flags)
__string(name, lock->name)
__field(void *, lockdep_addr)
+ __field(void *, class_id)
),

TP_fast_assign(
__entry->flags = (trylock ? 1 : 0) | (read ? 2 : 0);
__assign_str(name, lock->name);
__entry->lockdep_addr = lock;
+ __entry->class_id = (void *)lock->name;
),

- TP_printk("%p %s%s%s", __entry->lockdep_addr,
+ TP_printk("%p %p %s%s%s", __entry->lockdep_addr,
+ __entry->class_id,
(__entry->flags & 1) ? "try " : "",
(__entry->flags & 2) ? "read " : "",
__get_str(name))
--
1.6.2.3

2010-02-03 09:20:06

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 05/11] perf: New PERF_EVENT_IOC_INJECT ioctl

On Wed, Feb 03, 2010 at 10:14:29AM +0100, Frederic Weisbecker wrote:
> The PERF_EVENT_IOC_INJECT perf event ioctl can be used to inject
> events, if the corresponding pmu and event supports it.
>
> On trace events, it will call the inject callback, usually reserved
> for events that need to catch up with the user.
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Hitoshi Mitake <[email protected]>
> Cc: Li Zefan <[email protected]>
> Cc: Lai Jiangshan <[email protected]>
> Cc: Masami Hiramatsu <[email protected]>
> Cc: Jens Axboe <[email protected]>
> ---
> include/linux/perf_event.h | 2 ++
> kernel/perf_event.c | 23 +++++++++++++++++++++++
> 2 files changed, 25 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 556b0f4..d2e83f0 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -228,6 +228,7 @@ struct perf_event_attr {
> #define PERF_EVENT_IOC_PERIOD _IOW('$', 4, __u64)
> #define PERF_EVENT_IOC_SET_OUTPUT _IO ('$', 5)
> #define PERF_EVENT_IOC_SET_FILTER _IOW('$', 6, char *)
> +#define PERF_EVENT_IOC_INJECT _IO ('$', 7)
>
> enum perf_event_ioc_flags {
> PERF_IOC_FLAG_GROUP = 1U << 0,
> @@ -513,6 +514,7 @@ struct pmu {
> void (*disable) (struct perf_event *event);
> void (*read) (struct perf_event *event);
> void (*unthrottle) (struct perf_event *event);
> + void (*inject) (struct perf_event *event);
> };
>
> /**
> diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> index 40f8b07..e4dfd12 100644
> --- a/kernel/perf_event.c
> +++ b/kernel/perf_event.c
> @@ -2151,6 +2151,26 @@ unlock:
> return ret;
> }
>
> +static void __perf_event_inject(void *info)
> +{
> + struct perf_event *event = info;
> +
> + event->pmu->inject(event);
> +}
> +
> +static int perf_event_inject(struct perf_event *event)
> +{
> + struct perf_event_context *ctx = event->ctx;
> + struct task_struct *task = ctx->task;
> +
> + if (!event->pmu->inject || task)
> + return -EINVAL;
> +
> + smp_call_function_single(event->cpu, __perf_event_inject, event, 1);
> +



Ah, I forgot to say. Injection is only supported on cpu
bound events (non-task bound). Because if it is task-bound,
we can't ensure we can inject while the event is scheduled,
not sure how to fix this.

2010-02-03 10:25:46

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC GIT PULL] perf/trace/lock optimization/scalability improvements

On Wed, Feb 03 2010, Frederic Weisbecker wrote:
> Hi,
>
> There are many things that happen in this patchset, treating
> different problems:
>
> - remove most of the string copy overhead in fast path
> - open the way for lock class oriented profiling (as
> opposite to lock instance profiling. Both can be useful
> in different ways).
> - remove the buffers muliplexing (less contention)
> - event injection support
> - remove violent lock events recursion (only 2 among 3, the remaining
> one is detailed below).
>
> Some differences, by running:
> perf lock record perf sched pipe -l 100000
>
> Before the patchset:
>
> Total time: 91.015 [sec]
>
> 910.157300 usecs/op
> 1098 ops/sec
>
> After this patchset applied:
>
> Total time: 43.706 [sec]
>
> 437.062080 usecs/op
> 2288 ops/sec

This does a lot better here, even if it isn't exactly stellar
performance. It generates a LOT of data:

root@nehalem:/dev/shm # time perf lock rec -fg ls
perf.data perf.data.old
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 137.224 MB perf.data (~5995421
samples) ]

real 0m3.320s
user 0m0.000s
sys 0m3.220s

Without -g, it has 1.688s real and 1.590s sys time.

So while this is orders of magnitude better than the previous patchset,
it's still not anywhere near lean. But I expect you know that, just
consider this a 'I tested it and this is what happened' report :-)

--
Jens Axboe

2010-02-03 10:26:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC GIT PULL] perf/trace/lock optimization/scalability improvements


* Frederic Weisbecker <[email protected]> wrote:

> Hi,
>
> There are many things that happen in this patchset, treating
> different problems:
>
> - remove most of the string copy overhead in fast path
> - open the way for lock class oriented profiling (as
> opposite to lock instance profiling. Both can be useful
> in different ways).
> - remove the buffers muliplexing (less contention)
> - event injection support
> - remove violent lock events recursion (only 2 among 3, the remaining
> one is detailed below).
>
> Some differences, by running:
> perf lock record perf sched pipe -l 100000
>
> Before the patchset:
>
> Total time: 91.015 [sec]
>
> 910.157300 usecs/op
> 1098 ops/sec
>
> After this patchset applied:
>
> Total time: 43.706 [sec]
>
> 437.062080 usecs/op
> 2288 ops/sec

Fantastic!

There's one area that needs more thought i think: the dump-all-classes
init-event-injector approach. It is async, hence we could lose events if
there's a lot of lock classes to dump. Plus we eventually want to use your
injector approach for other things as well (such as to dump the state of a
collection of tasks) - so i think we want it to be more synchronous.

One approach would be to allow a gradual read() deplete the dump. Also, i
think the 'state dump' events should be separate from regular init events.
Filters attached to these events will automatically cause the dumping to be
restricted to the filter set. For example in the case of tasks one could dump
only tasks from a particular UID - by adding a 'uid == 1234' filter before
the dump (on a per tasks basis - so the filtering is nicely task local).

What do you think?

Ingo

2010-02-03 10:33:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC GIT PULL] perf/trace/lock optimization/scalability improvements

On Wed, 2010-02-03 at 10:14 +0100, Frederic Weisbecker wrote:
> - event injection support

I like the idea, I'm just not sure about the name and API details.

I would like to call it something like collection support, and the API
should have an iterator like interface.

That is, it should not blindly dump all events from a collection at
once, praying the output buffer is large enough, but either dump a
specified number and/or stop dumping when the buffer is full. Allowing a
second invocation to continue where it left off after the buffer content
has been consumed.

Which brings us to the ioctl() interface, we can do the above using
ioctl()s, but it seems to me we're starting to get ioctl() heavy and
should be looking at alternative ways of extending this.

Anybody any bright ideas?

2010-02-03 20:50:23

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC GIT PULL] perf/trace/lock optimization/scalability improvements

On Wed, Feb 03, 2010 at 11:25:41AM +0100, Jens Axboe wrote:
> On Wed, Feb 03 2010, Frederic Weisbecker wrote:
> > Hi,
> >
> > There are many things that happen in this patchset, treating
> > different problems:
> >
> > - remove most of the string copy overhead in fast path
> > - open the way for lock class oriented profiling (as
> > opposite to lock instance profiling. Both can be useful
> > in different ways).
> > - remove the buffers muliplexing (less contention)
> > - event injection support
> > - remove violent lock events recursion (only 2 among 3, the remaining
> > one is detailed below).
> >
> > Some differences, by running:
> > perf lock record perf sched pipe -l 100000
> >
> > Before the patchset:
> >
> > Total time: 91.015 [sec]
> >
> > 910.157300 usecs/op
> > 1098 ops/sec
> >
> > After this patchset applied:
> >
> > Total time: 43.706 [sec]
> >
> > 437.062080 usecs/op
> > 2288 ops/sec
>
> This does a lot better here, even if it isn't exactly stellar
> performance. It generates a LOT of data:
>
> root@nehalem:/dev/shm # time perf lock rec -fg ls
> perf.data perf.data.old
> [ perf record: Woken up 0 times to write data ]
> [ perf record: Captured and wrote 137.224 MB perf.data (~5995421
> samples) ]



Doh, 137 MB for a single ls :)

That said we don't have yet support for callchains in perf lock,
and callchains can fill the buffer quickly, especially on lock
events. You can drop the -g option for now.


>
> real 0m3.320s
> user 0m0.000s
> sys 0m3.220s
>
> Without -g, it has 1.688s real and 1.590s sys time.


Ok.


> So while this is orders of magnitude better than the previous patchset,
> it's still not anywhere near lean. But I expect you know that, just
> consider this a 'I tested it and this is what happened' report :-)


Ok, thanks a lot, the fact you can test on a 64 threads box is critically
helpful.

I also wonder what happens after this patch applied:

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 98fd360..254b3d4 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -3094,7 +3094,8 @@ static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
if (event->parent)
event = event->parent;

- return task_pid_nr_ns(p, event->ns);
+ return p->pid;
}

In my box it has increased the speed from 2x this patchset.

I wonder if the tool becomes usable for you with that.
Otherwise, it means we have other things to fix, and
the result of:

perf record -g -f perf lock record sleep 6
perf report

would be very nice to have.

Thanks!

2010-02-03 21:21:48

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC GIT PULL] perf/trace/lock optimization/scalability improvements

On Wed, Feb 03 2010, Frederic Weisbecker wrote:
> On Wed, Feb 03, 2010 at 11:25:41AM +0100, Jens Axboe wrote:
> > On Wed, Feb 03 2010, Frederic Weisbecker wrote:
> > > Hi,
> > >
> > > There are many things that happen in this patchset, treating
> > > different problems:
> > >
> > > - remove most of the string copy overhead in fast path
> > > - open the way for lock class oriented profiling (as
> > > opposite to lock instance profiling. Both can be useful
> > > in different ways).
> > > - remove the buffers muliplexing (less contention)
> > > - event injection support
> > > - remove violent lock events recursion (only 2 among 3, the remaining
> > > one is detailed below).
> > >
> > > Some differences, by running:
> > > perf lock record perf sched pipe -l 100000
> > >
> > > Before the patchset:
> > >
> > > Total time: 91.015 [sec]
> > >
> > > 910.157300 usecs/op
> > > 1098 ops/sec
> > >
> > > After this patchset applied:
> > >
> > > Total time: 43.706 [sec]
> > >
> > > 437.062080 usecs/op
> > > 2288 ops/sec
> >
> > This does a lot better here, even if it isn't exactly stellar
> > performance. It generates a LOT of data:
> >
> > root@nehalem:/dev/shm # time perf lock rec -fg ls
> > perf.data perf.data.old
> > [ perf record: Woken up 0 times to write data ]
> > [ perf record: Captured and wrote 137.224 MB perf.data (~5995421
> > samples) ]
>
>
>
> Doh, 137 MB for a single ls :)

Yeah, it's pretty crazy. It varies a lot too, I didn't post the other
run which was ~170MB.

> That said we don't have yet support for callchains in perf lock,
> and callchains can fill the buffer quickly, especially on lock
> events. You can drop the -g option for now.

OK

> > So while this is orders of magnitude better than the previous patchset,
> > it's still not anywhere near lean. But I expect you know that, just
> > consider this a 'I tested it and this is what happened' report :-)
>
>
> Ok, thanks a lot, the fact you can test on a 64 threads box is critically
> helpful.

My pleasure, I'd love to have a fast and functional perf lockstat. If my
testing helps getting there, consider me signed up :-)

> I also wonder what happens after this patch applied:
>
> diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> index 98fd360..254b3d4 100644
> --- a/kernel/perf_event.c
> +++ b/kernel/perf_event.c
> @@ -3094,7 +3094,8 @@ static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
> if (event->parent)
> event = event->parent;
>
> - return task_pid_nr_ns(p, event->ns);
> + return p->pid;
> }
>
> In my box it has increased the speed from 2x this patchset.

Cool, I'll give that a spin in the morning, the box is off atm.

> I wonder if the tool becomes usable for you with that.
> Otherwise, it means we have other things to fix, and
> the result of:
>
> perf record -g -f perf lock record sleep 6
> perf report
>
> would be very nice to have.

I'll package that up for you and put it somewhere.

--
Jens Axboe

2010-02-03 21:27:00

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC GIT PULL] perf/trace/lock optimization/scalability improvements

On Wed, Feb 03, 2010 at 11:26:11AM +0100, Ingo Molnar wrote:
> There's one area that needs more thought i think: the dump-all-classes
> init-event-injector approach. It is async, hence we could lose events if
> there's a lot of lock classes to dump.


Yeah, the dump itself is purely async.

But the lock_class_init event is used from two sites:

- the injector, purely aynchronous and catch up from the past
thing

- register_lock_class(), this is the synchronous point, each time
we have a new class created.

When we register a lock_class_init ftrace/perf event, we first activate
the synchronous point, it behaves here as a usual event, and will hook
in every present new events.

And only after, we call the injector, asynchronous, rely on past events.

That beeing split in two part covers every events.


> Plus we eventually want to use your
> injector approach for other things as well (such as to dump the state of a
> collection of tasks) - so i think we want it to be more synchronous.


Yeah, that would work also for tasks. And we can follow the same
pattern for that.

We can set up a synchronous trace point in fork and have a secondary
asynchronous point that can dump the task list. That too would
cover every events we want.


> One approach would be to allow a gradual read() deplete the dump. Also, i
> think the 'state dump' events should be separate from regular init events.
> Filters attached to these events will automatically cause the dumping to be
> restricted to the filter set. For example in the case of tasks one could dump
> only tasks from a particular UID - by adding a 'uid == 1234' filter before
> the dump (on a per tasks basis - so the filtering is nicely task local).


But this is what we want right? If the init event and the dump event are
the same, which is the case currently, the filter will apply to both.

And if we are only interested in tasks for uid == 1234, I guess we
want async and sync events that have the same filter.

May be we want to split up init events from dump events, say,
have an event class that you can open in either async or
sync mode. But I can't figure out a workflow for which it can
be useful.

2010-02-03 22:07:47

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC GIT PULL] perf/trace/lock optimization/scalability improvements

On Wed, Feb 03, 2010 at 11:33:16AM +0100, Peter Zijlstra wrote:
> On Wed, 2010-02-03 at 10:14 +0100, Frederic Weisbecker wrote:
> > - event injection support
>
> I like the idea, I'm just not sure about the name and API details.
>
> I would like to call it something like collection support, and the API
> should have an iterator like interface.
>
> That is, it should not blindly dump all events from a collection at
> once, praying the output buffer is large enough, but either dump a
> specified number and/or stop dumping when the buffer is full. Allowing a
> second invocation to continue where it left off after the buffer content
> has been consumed.


Yeah I agree. But my worry is there are induced races in this scheme.
But probably tight enough that we don't care much.

Consider dumping the task list content:

A -> B -> C -> D

You open a "task" event. And ask to inject it one by one,
you first dump A, and B disappear, then you'll miss it
but you can still get C and D if they don't disappear.

As I said it is tight enough that we don't care. If B disappears
so early, it means it won't have a determinant role in the profiling
anyway (at worst few isolated events in the beginning).

The fact is the async injection will always have such misses by
its nature anyway, you can still dump the whole at once, and you will
miss events that just disappeared before the injection call, so whatever...


>
> Which brings us to the ioctl() interface, we can do the above using
> ioctl()s, but it seems to me we're starting to get ioctl() heavy and
> should be looking at alternative ways of extending this.
>
> Anybody any bright ideas?
>


Using ioctl, we can basically have such structure:

struct perf_inject_req {
int start; (in)
int len; (in)
int *read; (out)
};

Using an idx will often imply rewalk a whole object list from
the beginning, which is perfectly fine.

If you prefer an alternate syscall, I can make it. I've been
told about ioctl drawbacks by the past, I personally think
ioctl has not only drawbacks: it avoids a proliferation
of syscalls. But that's how you prefer.

2010-02-03 22:13:14

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC GIT PULL] perf/trace/lock optimization/scalability improvements

On Wed, Feb 03, 2010 at 10:21:43PM +0100, Jens Axboe wrote:
> On Wed, Feb 03 2010, Frederic Weisbecker wrote:
> > > So while this is orders of magnitude better than the previous patchset,
> > > it's still not anywhere near lean. But I expect you know that, just
> > > consider this a 'I tested it and this is what happened' report :-)
> >
> >
> > Ok, thanks a lot, the fact you can test on a 64 threads box is critically
> > helpful.
>
> My pleasure, I'd love to have a fast and functional perf lockstat. If my
> testing helps getting there, consider me signed up :-)



Great :)


> > I also wonder what happens after this patch applied:
> >
> > diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> > index 98fd360..254b3d4 100644
> > --- a/kernel/perf_event.c
> > +++ b/kernel/perf_event.c
> > @@ -3094,7 +3094,8 @@ static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
> > if (event->parent)
> > event = event->parent;
> >
> > - return task_pid_nr_ns(p, event->ns);
> > + return p->pid;
> > }
> >
> > In my box it has increased the speed from 2x this patchset.
>
> Cool, I'll give that a spin in the morning, the box is off atm.


Thanks a lot!

2010-02-04 06:33:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC GIT PULL] perf/trace/lock optimization/scalability improvements


* Frederic Weisbecker <[email protected]> wrote:

> On Wed, Feb 03, 2010 at 11:33:16AM +0100, Peter Zijlstra wrote:
> > On Wed, 2010-02-03 at 10:14 +0100, Frederic Weisbecker wrote:
> > > - event injection support
> >
> > I like the idea, I'm just not sure about the name and API details.
> >
> > I would like to call it something like collection support, and the API
> > should have an iterator like interface.
> >
> > That is, it should not blindly dump all events from a collection at once,
> > praying the output buffer is large enough, but either dump a specified
> > number and/or stop dumping when the buffer is full. Allowing a second
> > invocation to continue where it left off after the buffer content has
> > been consumed.
>
>
> Yeah I agree. But my worry is there are induced races in this scheme. But
> probably tight enough that we don't care much.
>
> Consider dumping the task list content:
>
> A -> B -> C -> D
>
> You open a "task" event. And ask to inject it one by one,
> you first dump A, and B disappear, then you'll miss it
> but you can still get C and D if they don't disappear.
>
> As I said it is tight enough that we don't care. If B disappears
> so early, it means it won't have a determinant role in the profiling
> anyway (at worst few isolated events in the beginning).

We probably dont care - /proc is racy in the same fashion (you can miss
tasks), still 'top' has been able to cope for a decade.

If we cared, it would be possible to construct a dump-collection-state
sequence along the lines of:

activate init/deinit events
initiate dump

anything that got missed by the dump will be covered by the init/deinit event
flow. In that sense it's important that init/deinit and dumping uses similar
state structure - possibly the same event (as your solution did).

A 'dump' of a collection is basically the artificial replay of all its init
events.

Ingo

2010-02-04 15:47:09

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 10/11] tracing/perf: Fix lock events recursions in the fast path

On Wed, Feb 03, 2010 at 10:14:34AM +0100, Frederic Weisbecker wrote:
> There are rcu locked read side areas in the path where we submit
> a trace events. And these rcu_read_(un)lock() trigger lock events,
> which create recursive events.
>
> One pair in do_perf_sw_event:
>
> __lock_acquire
> |
> |--96.11%-- lock_acquire
> | |
> | |--27.21%-- do_perf_sw_event
> | | perf_tp_event
> | | |
> | | |--49.62%-- ftrace_profile_lock_release
> | | | lock_release
> | | | |
> | | | |--33.85%-- _raw_spin_unlock
>
> Another pair in perf_output_begin/end:
>
> __lock_acquire
> |--23.40%-- perf_output_begin
> | | __perf_event_overflow
> | | perf_swevent_overflow
> | | perf_swevent_add
> | | perf_swevent_ctx_event
> | | do_perf_sw_event
> | | perf_tp_event
> | | |
> | | |--55.37%-- ftrace_profile_lock_acquire
> | | | lock_acquire
> | | | |
> | | | |--37.31%-- _raw_spin_lock
>
> The problem is not that much the trace recursion itself, as we have a
> recursion protection already (though it's always wasteful to recurse).
> But the trace events are outside the lockdep recursion protection, then
> each lockdep event triggers a lock trace, which will trigger two
> other lockdep events. Here the recursive lock trace event won't
> be taken because of the trace recursion, so the recursion stops there
> but lockdep will still analyse these new events:
>
> To sum up, for each lockdep events we have:
>
> lock_*()
> |
> trace lock_acquire
> |
> ----- rcu_read_lock()
> | |
> | lock_acquire()
> | |
> | trace_lock_acquire() (stopped)
> | |
> | lockdep analyze
> |
> ----- rcu_read_unlock()
> |
> lock_release
> |
> trace_lock_release() (stopped)
> |
> lockdep analyze
>
> And you can repeat the above two times as we have two rcu read side
> sections when we submit an event.
>
> This is fixed in this pacth by using the non-lockdep versions of
> rcu_read_(un)lock.

Hmmm... Perhaps I should rename __rcu_read_lock() to something more
meaningful if it is to be used outside of the RCU files. In the
meantime:

Reviewed-by: Paul E. McKenney <[email protected]>

> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Hitoshi Mitake <[email protected]>
> Cc: Li Zefan <[email protected]>
> Cc: Lai Jiangshan <[email protected]>
> Cc: Masami Hiramatsu <[email protected]>
> Cc: Jens Axboe <[email protected]>
> ---
> kernel/perf_event.c | 10 +++++-----
> 1 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> index 280ae44..98fd360 100644
> --- a/kernel/perf_event.c
> +++ b/kernel/perf_event.c
> @@ -2986,7 +2986,7 @@ int perf_output_begin(struct perf_output_handle *handle,
> u64 lost;
> } lost_event;
>
> - rcu_read_lock();
> + __rcu_read_lock();
> /*
> * For inherited events we send all the output towards the parent.
> */
> @@ -3051,7 +3051,7 @@ fail:
> atomic_inc(&data->lost);
> perf_output_unlock(handle);
> out:
> - rcu_read_unlock();
> + __rcu_read_unlock();
>
> return -ENOSPC;
> }
> @@ -3072,7 +3072,7 @@ void perf_output_end(struct perf_output_handle *handle)
> }
>
> perf_output_unlock(handle);
> - rcu_read_unlock();
> + __rcu_read_unlock();
> }
>
> static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
> @@ -4116,7 +4116,7 @@ static void do_perf_sw_event(enum perf_type_id type, u32 event_id,
> struct perf_event_context *ctx;
>
> cpuctx = &__get_cpu_var(perf_cpu_context);
> - rcu_read_lock();
> + __rcu_read_lock();
> perf_swevent_ctx_event(&cpuctx->ctx, type, event_id,
> nr, nmi, data, regs);
> /*
> @@ -4126,7 +4126,7 @@ static void do_perf_sw_event(enum perf_type_id type, u32 event_id,
> ctx = rcu_dereference(current->perf_event_ctxp);
> if (ctx)
> perf_swevent_ctx_event(ctx, type, event_id, nr, nmi, data, regs);
> - rcu_read_unlock();
> + __rcu_read_unlock();
> }
>
> void __perf_sw_event(u32 event_id, u64 nr, int nmi,
> --
> 1.6.2.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2010-02-04 19:40:54

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC GIT PULL] perf/trace/lock optimization/scalability improvements

On Wed, Feb 03 2010, Frederic Weisbecker wrote:
> Ok, thanks a lot, the fact you can test on a 64 threads box is critically
> helpful.
>
> I also wonder what happens after this patch applied:
>
> diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> index 98fd360..254b3d4 100644
> --- a/kernel/perf_event.c
> +++ b/kernel/perf_event.c
> @@ -3094,7 +3094,8 @@ static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
> if (event->parent)
> event = event->parent;
>
> - return task_pid_nr_ns(p, event->ns);
> + return p->pid;
> }
>
> In my box it has increased the speed from 2x this patchset.

Doesn't seem to change anything, same runtime for a ls.

> I wonder if the tool becomes usable for you with that.
> Otherwise, it means we have other things to fix, and
> the result of:
>
> perf record -g -f perf lock record sleep 6
> perf report
>
> would be very nice to have.

root@nehalem:/dev/shm # perf record -g -f perf lock record sleep 6
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 446.208 MB perf.data (~19495127 samples) ]
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 1.135 MB perf.data (~49609 samples) ]

It's huuuge. Thankfully the output isn't so big, I've attached it.

--
Jens Axboe


Attachments:
(No filename) (1.31 kB)
perf-lock-report.txt (33.84 kB)
Download all attachments

2010-02-05 02:39:50

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 10/11] tracing/perf: Fix lock events recursions in the fast path

Paul E. McKenney wrote:
> On Wed, Feb 03, 2010 at 10:14:34AM +0100, Frederic Weisbecker wrote:
>> There are rcu locked read side areas in the path where we submit
>> a trace events. And these rcu_read_(un)lock() trigger lock events,
>> which create recursive events.
>>
>> One pair in do_perf_sw_event:
>>
>> __lock_acquire
>> |
>> |--96.11%-- lock_acquire
>> | |
>> | |--27.21%-- do_perf_sw_event
>> | | perf_tp_event
>> | | |
>> | | |--49.62%-- ftrace_profile_lock_release
>> | | | lock_release
>> | | | |
>> | | | |--33.85%-- _raw_spin_unlock
>>
>> Another pair in perf_output_begin/end:
>>
>> __lock_acquire
>> |--23.40%-- perf_output_begin
>> | | __perf_event_overflow
>> | | perf_swevent_overflow
>> | | perf_swevent_add
>> | | perf_swevent_ctx_event
>> | | do_perf_sw_event
>> | | perf_tp_event
>> | | |
>> | | |--55.37%-- ftrace_profile_lock_acquire
>> | | | lock_acquire
>> | | | |
>> | | | |--37.31%-- _raw_spin_lock
>>
>> The problem is not that much the trace recursion itself, as we have a
>> recursion protection already (though it's always wasteful to recurse).
>> But the trace events are outside the lockdep recursion protection, then
>> each lockdep event triggers a lock trace, which will trigger two
>> other lockdep events. Here the recursive lock trace event won't
>> be taken because of the trace recursion, so the recursion stops there
>> but lockdep will still analyse these new events:
>>
>> To sum up, for each lockdep events we have:
>>
>> lock_*()
>> |
>> trace lock_acquire
>> |
>> ----- rcu_read_lock()
>> | |
>> | lock_acquire()
>> | |
>> | trace_lock_acquire() (stopped)
>> | |
>> | lockdep analyze
>> |
>> ----- rcu_read_unlock()
>> |
>> lock_release
>> |
>> trace_lock_release() (stopped)
>> |
>> lockdep analyze
>>
>> And you can repeat the above two times as we have two rcu read side
>> sections when we submit an event.
>>
>> This is fixed in this pacth by using the non-lockdep versions of
>> rcu_read_(un)lock.
>
> Hmmm... Perhaps I should rename __rcu_read_lock() to something more
> meaningful if it is to be used outside of the RCU files. In the
> meantime:
>
> Reviewed-by: Paul E. McKenney <[email protected]>
>

Perhaps we can use the existed rcu_read_lock_sched_notrace().

not relate to this patchset, but RCU & lockdep:

We need to remove lockdep from rcu_read_lock_*().

1) rcu_read_lock() is deadlock-immunity,
we get very little benefit from lockdep.

rcu_read_lock()
lock_acquire(read=2,check=1)

* Values for check:
*
* 0: disabled
* 1: simple checks (freeing, held-at-exit-time, etc.)
* 2: full validation
*/

We can check it by other methods.

2) popular distributions and some companies enable lockdep for their kernel.
rcu_read_lock_*() are the most frequent lock in kernel.
lock_acquire() is not fast enough, it is a big function for RCU.

2010-02-05 09:45:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 10/11] tracing/perf: Fix lock events recursions in the fast path

On Fri, 2010-02-05 at 10:38 +0800, Lai Jiangshan wrote:
> Paul E. McKenney wrote:
> > On Wed, Feb 03, 2010 at 10:14:34AM +0100, Frederic Weisbecker wrote:
> >> There are rcu locked read side areas in the path where we submit
> >> a trace events. And these rcu_read_(un)lock() trigger lock events,
> >> which create recursive events.
> >>
> >> One pair in do_perf_sw_event:
> >>
> >> __lock_acquire
> >> |
> >> |--96.11%-- lock_acquire
> >> | |
> >> | |--27.21%-- do_perf_sw_event
> >> | | perf_tp_event
> >> | | |
> >> | | |--49.62%-- ftrace_profile_lock_release
> >> | | | lock_release
> >> | | | |
> >> | | | |--33.85%-- _raw_spin_unlock
> >>
> >> Another pair in perf_output_begin/end:
> >>
> >> __lock_acquire
> >> |--23.40%-- perf_output_begin
> >> | | __perf_event_overflow
> >> | | perf_swevent_overflow
> >> | | perf_swevent_add
> >> | | perf_swevent_ctx_event
> >> | | do_perf_sw_event
> >> | | perf_tp_event
> >> | | |
> >> | | |--55.37%-- ftrace_profile_lock_acquire
> >> | | | lock_acquire
> >> | | | |
> >> | | | |--37.31%-- _raw_spin_lock
> >>
> >> The problem is not that much the trace recursion itself, as we have a
> >> recursion protection already (though it's always wasteful to recurse).
> >> But the trace events are outside the lockdep recursion protection, then
> >> each lockdep event triggers a lock trace, which will trigger two
> >> other lockdep events. Here the recursive lock trace event won't
> >> be taken because of the trace recursion, so the recursion stops there
> >> but lockdep will still analyse these new events:
> >>
> >> To sum up, for each lockdep events we have:
> >>
> >> lock_*()
> >> |
> >> trace lock_acquire
> >> |
> >> ----- rcu_read_lock()
> >> | |
> >> | lock_acquire()
> >> | |
> >> | trace_lock_acquire() (stopped)
> >> | |
> >> | lockdep analyze
> >> |
> >> ----- rcu_read_unlock()
> >> |
> >> lock_release
> >> |
> >> trace_lock_release() (stopped)
> >> |
> >> lockdep analyze
> >>
> >> And you can repeat the above two times as we have two rcu read side
> >> sections when we submit an event.
> >>
> >> This is fixed in this pacth by using the non-lockdep versions of
> >> rcu_read_(un)lock.
> >
> > Hmmm... Perhaps I should rename __rcu_read_lock() to something more
> > meaningful if it is to be used outside of the RCU files. In the
> > meantime:
> >
> > Reviewed-by: Paul E. McKenney <[email protected]>
> >
>
> Perhaps we can use the existed rcu_read_lock_sched_notrace().
>
> not relate to this patchset, but RCU & lockdep:
>
> We need to remove lockdep from rcu_read_lock_*().

I'm not at all convinced we need to do any such thing, remember its
debugging stuff, performance, while nice, doesn't really count.


> 1) rcu_read_lock() is deadlock-immunity,
> we get very little benefit from lockdep.

Except it detects things like failing to unlock, or going into userspace
while holding an rcu_read_lock()

Also, Paul has been spending lots of effort on getting rcu_dereference()
annotated.

> rcu_read_lock()
> lock_acquire(read=2,check=1)
>
> * Values for check:
> *
> * 0: disabled
> * 1: simple checks (freeing, held-at-exit-time, etc.)
> * 2: full validation
> */
>
> We can check it by other methods.
>
> 2) popular distributions and some companies enable lockdep for their kernel.
> rcu_read_lock_*() are the most frequent lock in kernel.
> lock_acquire() is not fast enough, it is a big function for RCU.

Its debug stuff, get over it, we're not going to limit its coverage
because people do silly things.

2010-02-05 09:51:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 10/11] tracing/perf: Fix lock events recursions in the fast path

On Fri, 2010-02-05 at 10:45 +0100, Peter Zijlstra wrote:
> On Fri, 2010-02-05 at 10:38 +0800, Lai Jiangshan wrote:
> > Paul E. McKenney wrote:
> > > On Wed, Feb 03, 2010 at 10:14:34AM +0100, Frederic Weisbecker wrote:
> > >> There are rcu locked read side areas in the path where we submit
> > >> a trace events. And these rcu_read_(un)lock() trigger lock events,
> > >> which create recursive events.
> > >>
> > >> One pair in do_perf_sw_event:
> > >>
> > >> __lock_acquire
> > >> |
> > >> |--96.11%-- lock_acquire
> > >> | |
> > >> | |--27.21%-- do_perf_sw_event
> > >> | | perf_tp_event
> > >> | | |
> > >> | | |--49.62%-- ftrace_profile_lock_release
> > >> | | | lock_release
> > >> | | | |
> > >> | | | |--33.85%-- _raw_spin_unlock
> > >>
> > >> Another pair in perf_output_begin/end:
> > >>
> > >> __lock_acquire
> > >> |--23.40%-- perf_output_begin
> > >> | | __perf_event_overflow
> > >> | | perf_swevent_overflow
> > >> | | perf_swevent_add
> > >> | | perf_swevent_ctx_event
> > >> | | do_perf_sw_event
> > >> | | perf_tp_event
> > >> | | |
> > >> | | |--55.37%-- ftrace_profile_lock_acquire
> > >> | | | lock_acquire
> > >> | | | |
> > >> | | | |--37.31%-- _raw_spin_lock
> > >>
> > >> The problem is not that much the trace recursion itself, as we have a
> > >> recursion protection already (though it's always wasteful to recurse).
> > >> But the trace events are outside the lockdep recursion protection, then
> > >> each lockdep event triggers a lock trace, which will trigger two
> > >> other lockdep events. Here the recursive lock trace event won't
> > >> be taken because of the trace recursion, so the recursion stops there
> > >> but lockdep will still analyse these new events:
> > >>
> > >> To sum up, for each lockdep events we have:
> > >>
> > >> lock_*()
> > >> |
> > >> trace lock_acquire
> > >> |
> > >> ----- rcu_read_lock()
> > >> | |
> > >> | lock_acquire()
> > >> | |
> > >> | trace_lock_acquire() (stopped)
> > >> | |
> > >> | lockdep analyze
> > >> |
> > >> ----- rcu_read_unlock()
> > >> |
> > >> lock_release
> > >> |
> > >> trace_lock_release() (stopped)
> > >> |
> > >> lockdep analyze
> > >>
> > >> And you can repeat the above two times as we have two rcu read side
> > >> sections when we submit an event.
> > >>
> > >> This is fixed in this pacth by using the non-lockdep versions of
> > >> rcu_read_(un)lock.
> > >
> > > Hmmm... Perhaps I should rename __rcu_read_lock() to something more
> > > meaningful if it is to be used outside of the RCU files. In the
> > > meantime:
> > >
> > > Reviewed-by: Paul E. McKenney <[email protected]>
> > >
> >
> > Perhaps we can use the existed rcu_read_lock_sched_notrace().
> >
> > not relate to this patchset, but RCU & lockdep:
> >
> > We need to remove lockdep from rcu_read_lock_*().
>
> I'm not at all convinced we need to do any such thing, remember its
> debugging stuff, performance, while nice, doesn't really count.

That said, I'm not at all happy about removing lockdep annotations to
make the tracer faster, that's really counter productive.


2010-02-05 10:50:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 10/11] tracing/perf: Fix lock events recursions in the fast path


* Peter Zijlstra <[email protected]> wrote:

> On Fri, 2010-02-05 at 10:45 +0100, Peter Zijlstra wrote:
> > On Fri, 2010-02-05 at 10:38 +0800, Lai Jiangshan wrote:
> > > Paul E. McKenney wrote:
> > > > On Wed, Feb 03, 2010 at 10:14:34AM +0100, Frederic Weisbecker wrote:
> > > >> There are rcu locked read side areas in the path where we submit
> > > >> a trace events. And these rcu_read_(un)lock() trigger lock events,
> > > >> which create recursive events.
> > > >>
> > > >> One pair in do_perf_sw_event:
> > > >>
> > > >> __lock_acquire
> > > >> |
> > > >> |--96.11%-- lock_acquire
> > > >> | |
> > > >> | |--27.21%-- do_perf_sw_event
> > > >> | | perf_tp_event
> > > >> | | |
> > > >> | | |--49.62%-- ftrace_profile_lock_release
> > > >> | | | lock_release
> > > >> | | | |
> > > >> | | | |--33.85%-- _raw_spin_unlock
> > > >>
> > > >> Another pair in perf_output_begin/end:
> > > >>
> > > >> __lock_acquire
> > > >> |--23.40%-- perf_output_begin
> > > >> | | __perf_event_overflow
> > > >> | | perf_swevent_overflow
> > > >> | | perf_swevent_add
> > > >> | | perf_swevent_ctx_event
> > > >> | | do_perf_sw_event
> > > >> | | perf_tp_event
> > > >> | | |
> > > >> | | |--55.37%-- ftrace_profile_lock_acquire
> > > >> | | | lock_acquire
> > > >> | | | |
> > > >> | | | |--37.31%-- _raw_spin_lock
> > > >>
> > > >> The problem is not that much the trace recursion itself, as we have a
> > > >> recursion protection already (though it's always wasteful to recurse).
> > > >> But the trace events are outside the lockdep recursion protection, then
> > > >> each lockdep event triggers a lock trace, which will trigger two
> > > >> other lockdep events. Here the recursive lock trace event won't
> > > >> be taken because of the trace recursion, so the recursion stops there
> > > >> but lockdep will still analyse these new events:
> > > >>
> > > >> To sum up, for each lockdep events we have:
> > > >>
> > > >> lock_*()
> > > >> |
> > > >> trace lock_acquire
> > > >> |
> > > >> ----- rcu_read_lock()
> > > >> | |
> > > >> | lock_acquire()
> > > >> | |
> > > >> | trace_lock_acquire() (stopped)
> > > >> | |
> > > >> | lockdep analyze
> > > >> |
> > > >> ----- rcu_read_unlock()
> > > >> |
> > > >> lock_release
> > > >> |
> > > >> trace_lock_release() (stopped)
> > > >> |
> > > >> lockdep analyze
> > > >>
> > > >> And you can repeat the above two times as we have two rcu read side
> > > >> sections when we submit an event.
> > > >>
> > > >> This is fixed in this pacth by using the non-lockdep versions of
> > > >> rcu_read_(un)lock.
> > > >
> > > > Hmmm... Perhaps I should rename __rcu_read_lock() to something more
> > > > meaningful if it is to be used outside of the RCU files. In the
> > > > meantime:
> > > >
> > > > Reviewed-by: Paul E. McKenney <[email protected]>
> > > >
> > >
> > > Perhaps we can use the existed rcu_read_lock_sched_notrace().
> > >
> > > not relate to this patchset, but RCU & lockdep:
> > >
> > > We need to remove lockdep from rcu_read_lock_*().
> >
> > I'm not at all convinced we need to do any such thing, remember its
> > debugging stuff, performance, while nice, doesn't really count.
>
> That said, I'm not at all happy about removing lockdep annotations to make
> the tracer faster, that's really counter productive.

Are there no dynamic techniques that could be used here?

Lockdep obviously wants maximum instrumentation coverage - performance be
damned.

Lock profiling/tracing/visualization wants the minimum subset of events it is
interested in - everything else is unnecessary overhead.

Ingo

2010-02-05 12:10:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 10/11] tracing/perf: Fix lock events recursions in the fast path

On Fri, 2010-02-05 at 11:49 +0100, Ingo Molnar wrote:
>
> > That said, I'm not at all happy about removing lockdep annotations to make
> > the tracer faster, that's really counter productive.
>
> Are there no dynamic techniques that could be used here?
>
> Lockdep obviously wants maximum instrumentation coverage - performance be
> damned.
>
> Lock profiling/tracing/visualization wants the minimum subset of events it is
> interested in - everything else is unnecessary overhead.

Well, they could start by moving the tracepoint inside the lockdep
recursion check.

2010-02-05 12:13:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 10/11] tracing/perf: Fix lock events recursions in the fast path

On Fri, 2010-02-05 at 13:10 +0100, Peter Zijlstra wrote:
> On Fri, 2010-02-05 at 11:49 +0100, Ingo Molnar wrote:
> >
> > > That said, I'm not at all happy about removing lockdep annotations to make
> > > the tracer faster, that's really counter productive.
> >
> > Are there no dynamic techniques that could be used here?
> >
> > Lockdep obviously wants maximum instrumentation coverage - performance be
> > damned.
> >
> > Lock profiling/tracing/visualization wants the minimum subset of events it is
> > interested in - everything else is unnecessary overhead.
>
> Well, they could start by moving the tracepoint inside the lockdep
> recursion check.

IIRC the reason its now outside is that you'd loose tracepoint on
lockdep_off() usage, but having the tracer folks help on removing any
such usage is of course a good thing.

The usage thereof in nmi_enter() doesn't seem like a problem, since
you're not supposed to be using locks from nmi context anyway, more so,
I'd not be adverse to putting BUG_ON(in_nmi()) in every lockdep hook.

2010-02-05 13:02:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 10/11] tracing/perf: Fix lock events recursions in the fast path

On Fri, 2010-02-05 at 13:12 +0100, Peter Zijlstra wrote:
> On Fri, 2010-02-05 at 13:10 +0100, Peter Zijlstra wrote:
> > On Fri, 2010-02-05 at 11:49 +0100, Ingo Molnar wrote:
> > >
> > > > That said, I'm not at all happy about removing lockdep annotations to make
> > > > the tracer faster, that's really counter productive.
> > >
> > > Are there no dynamic techniques that could be used here?
> > >
> > > Lockdep obviously wants maximum instrumentation coverage - performance be
> > > damned.
> > >
> > > Lock profiling/tracing/visualization wants the minimum subset of events it is
> > > interested in - everything else is unnecessary overhead.
> >
> > Well, they could start by moving the tracepoint inside the lockdep
> > recursion check.
>
> IIRC the reason its now outside is that you'd loose tracepoint on
> lockdep_off() usage, but having the tracer folks help on removing any
> such usage is of course a good thing.
>
> The usage thereof in nmi_enter() doesn't seem like a problem, since
> you're not supposed to be using locks from nmi context anyway, more so,
> I'd not be adverse to putting BUG_ON(in_nmi()) in every lockdep hook.

Another nasty side effect is that it (lockdep recursion) isn't IRQ aware
in that we don't do any tracking for IRQ's that hit while we're doing
lockdep. We can fix that using a recursion context like we did for perf,
that would actually improve lockdep itself too.


2010-02-05 14:08:26

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 02/11] tracing: Introduce TRACE_EVENT_INJECT

On Wed, 2010-02-03 at 10:14 +0100, Frederic Weisbecker wrote:
> TRACE_EVENT_INJECT macro is the same as TRACE_EVENT but takes one
> more parameter that defines an "inject" callback to be called when
> the event is enabled.

I was just talking with Johannes about this.

> This is useful when we need to catch up with events that have
> already occured but that are required for the user.
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Hitoshi Mitake <[email protected]>
> Cc: Li Zefan <[email protected]>
> Cc: Lai Jiangshan <[email protected]>
> Cc: Masami Hiramatsu <[email protected]>
> Cc: Jens Axboe <[email protected]>
> ---
> include/linux/ftrace_event.h | 1 +
> include/linux/tracepoint.h | 3 +++
> include/trace/define_trace.h | 6 ++++++
> include/trace/ftrace.h | 31 ++++++++++++++++++++++++++++++-
> kernel/trace/trace_events.c | 3 +++
> 5 files changed, 43 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
> index cd95919..026d39b 100644
> --- a/include/linux/ftrace_event.h
> +++ b/include/linux/ftrace_event.h
> @@ -126,6 +126,7 @@ struct ftrace_event_call {
> int (*show_format)(struct ftrace_event_call *,
> struct trace_seq *);
> int (*define_fields)(struct ftrace_event_call *);
> + void (*inject)(void);
> struct list_head fields;
> int filter_active;
> struct event_filter *filter;
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index f59604e..f114aec 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -291,5 +291,8 @@ static inline void tracepoint_synchronize_unregister(void)
> #define TRACE_EVENT_FN(name, proto, args, struct, \
> assign, print, reg, unreg) \
> DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
> +#define TRACE_EVENT_INJECT(name, proto, args, struct, \
> + assign, print, inject) \

I would much rather not add this macro and instead make a
register_event_callback() that a process could register.

Actually, I was thinking of something even a bit more functional:

Creating a command file in the event directory along with format,filter
and id.

It would default with: trace, or ftrace if you want it to be specific to
ftrace. Then we could add:

echo perf > command

which would make the call back call the perf command instead of the
ftrace one. Or use >> to call both commands.

So instead of making an unique MACRO that is very stale, I would rather
change the structure to be much more generic.

This way you could do something like:

register_event_command(event, "perf_inject", perf_inject, data);

And when the event is enabled, it would call the perf_inject function.

How does this sound to you?

Note, this may also clean up some of ftrace.h too.

-- Steve

> + DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
>
> #endif /* ifdef TRACE_EVENT (see note above) */
> diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
> index 5acfb1e..41f7ce3 100644
> --- a/include/trace/define_trace.h
> +++ b/include/trace/define_trace.h
> @@ -31,6 +31,11 @@
> assign, print, reg, unreg) \
> DEFINE_TRACE_FN(name, reg, unreg)
>
> +#undef TRACE_EVENT_INJECT
> +#define TRACE_EVENT_INJECT(name, proto, args, tstruct, \
> + assign, print, inject) \
> + DEFINE_TRACE(name)
> +
> #undef DEFINE_EVENT
> #define DEFINE_EVENT(template, name, proto, args) \
> DEFINE_TRACE(name)
> @@ -71,6 +76,7 @@
>
> #undef TRACE_EVENT
> #undef TRACE_EVENT_FN
> +#undef TRACE_EVENT_INJECT
> #undef DECLARE_EVENT_CLASS
> #undef DEFINE_EVENT
> #undef DEFINE_EVENT_PRINT
> diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
> index f2c09e4..869da37 100644
> --- a/include/trace/ftrace.h
> +++ b/include/trace/ftrace.h
> @@ -37,6 +37,26 @@
> PARAMS(print)); \
> DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
>
> +/*
> + * TRACE_EVENT_INJECT creates an event that has an injector callback
> + * to call when the trace event is enabled, usually to trigger
> + * automatically some necessary initial traces.
> + */
> +#undef TRACE_EVENT_INJECT
> +#define TRACE_EVENT_INJECT(name, proto, args, tstruct, \
> + assign, print, inject) \
> + DECLARE_EVENT_CLASS(name, \
> + PARAMS(proto), \
> + PARAMS(args), \
> + PARAMS(tstruct), \
> + PARAMS(assign), \
> + PARAMS(print)); \
> + DEFINE_EVENT_INJECT(name, name, PARAMS(proto), PARAMS(args), inject);
> +
> +#undef DEFINE_EVENT_INJECT
> +#define DEFINE_EVENT_INJECT(template, name, proto, args, inject) \
> + DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args));
> +
>
> #undef __field
> #define __field(type, item) type item;
> @@ -726,7 +746,11 @@ static struct trace_event ftrace_event_type_##call = { \
> #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)
>
> #undef DEFINE_EVENT
> -#define DEFINE_EVENT(template, call, proto, args) \
> +#define DEFINE_EVENT(template, call, proto, args) \
> + DEFINE_EVENT_INJECT(template, call, PARAMS(proto), PARAMS(proto), NULL)
> +
> +#undef DEFINE_EVENT_INJECT
> +#define DEFINE_EVENT_INJECT(template, call, proto, args, injector) \
> \
> static struct ftrace_event_call __used \
> __attribute__((__aligned__(4))) \
> @@ -739,6 +763,7 @@ __attribute__((section("_ftrace_events"))) event_##call = { \
> .unregfunc = ftrace_raw_unreg_event_##call, \
> .show_format = ftrace_format_##template, \
> .define_fields = ftrace_define_fields_##template, \
> + .inject = injector, \
> _TRACE_PROFILE_INIT(call) \
> }
>
> @@ -877,6 +902,10 @@ ftrace_profile_templ_##call(struct ftrace_event_call *event_call, \
> __count, irq_flags); \
> }
>
> +#undef DEFINE_EVENT_INJECT
> +#define DEFINE_EVENT_INJECT(template, call, proto, args, inject) \
> + DEFINE_EVENT(template, call, PARAMS(proto), PARAMS(args))
> +
> #undef DEFINE_EVENT
> #define DEFINE_EVENT(template, call, proto, args) \
> static void ftrace_profile_##call(proto) \
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index 189b09b..5c75cc7 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -142,6 +142,9 @@ static int ftrace_event_enable_disable(struct ftrace_event_call *call,
> break;
> }
> call->enabled = 1;
> +
> + if (call->inject)
> + call->inject();
> }
> break;
> }

2010-02-05 14:13:23

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 03/11] tracing: Inject lock_class_init events on registration

On Wed, 2010-02-03 at 10:14 +0100, Frederic Weisbecker wrote:

> static inline void lockdep_off(void)
> diff --git a/include/trace/events/lock.h b/include/trace/events/lock.h
> index 90af03c..6999f16 100644
> --- a/include/trace/events/lock.h
> +++ b/include/trace/events/lock.h
> @@ -9,7 +9,7 @@
>
> #ifdef CONFIG_LOCKDEP
>
> -TRACE_EVENT(lock_class_init,
> +TRACE_EVENT_INJECT(lock_class_init,
>
> TP_PROTO(struct lock_class *class),
>
> @@ -30,7 +30,14 @@ TRACE_EVENT(lock_class_init,
> __entry->class_id = (void *)class->name;
> ),
>
> - TP_printk("%p %s", __entry->class_id, __get_str(class_name))
> + TP_printk("%p %s", __entry->class_id, __get_str(class_name)),
> +
> + /*
> + * On activation, we want to send an event for each lock
> + * classes that have been recorded by lockdep, so that we
> + * catch up with the existing class:name mappings.
> + */
> + lock_class_init_inject_events
> );
>

I also suggested this type of registering:

register_event_callback("event", "command", command_func,
command_enable_func,
command_disable_func);

Where command_func, command_enable_func and command_disable_func can all
be a function pointer or NULL.

command_func gets called every time the event is hit while enabled.

command_enable_func gets called when the event is enabled

command_disable_func gets called when the event is disabled.

Hmm,

I could make it so that if the string "command" is NULL that these
functions are registered permanent, and can not be disabled by the user?
Otherwise, the user needs to set them with "echo command > command"

-- Steve

2010-02-05 14:30:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 03/11] tracing: Inject lock_class_init events on registration

On Fri, 2010-02-05 at 09:13 -0500, Steven Rostedt wrote:
> On Wed, 2010-02-03 at 10:14 +0100, Frederic Weisbecker wrote:
>
> > static inline void lockdep_off(void)
> > diff --git a/include/trace/events/lock.h b/include/trace/events/lock.h
> > index 90af03c..6999f16 100644
> > --- a/include/trace/events/lock.h
> > +++ b/include/trace/events/lock.h
> > @@ -9,7 +9,7 @@
> >
> > #ifdef CONFIG_LOCKDEP
> >
> > -TRACE_EVENT(lock_class_init,
> > +TRACE_EVENT_INJECT(lock_class_init,
> >
> > TP_PROTO(struct lock_class *class),
> >
> > @@ -30,7 +30,14 @@ TRACE_EVENT(lock_class_init,
> > __entry->class_id = (void *)class->name;
> > ),
> >
> > - TP_printk("%p %s", __entry->class_id, __get_str(class_name))
> > + TP_printk("%p %s", __entry->class_id, __get_str(class_name)),
> > +
> > + /*
> > + * On activation, we want to send an event for each lock
> > + * classes that have been recorded by lockdep, so that we
> > + * catch up with the existing class:name mappings.
> > + */
> > + lock_class_init_inject_events
> > );
> >
>
> I also suggested this type of registering:
>
> register_event_callback("event", "command", command_func,
> command_enable_func,
> command_disable_func);
>
> Where command_func, command_enable_func and command_disable_func can all
> be a function pointer or NULL.
>
> command_func gets called every time the event is hit while enabled.
>
> command_enable_func gets called when the event is enabled
>
> command_disable_func gets called when the event is disabled.

Well, that will solve johill's polling thing, but not work nicely with
the proposed collection iteration stuff.

2010-02-05 14:44:48

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 03/11] tracing: Inject lock_class_init events on registration

On Fri, 2010-02-05 at 15:30 +0100, Peter Zijlstra wrote:
> On Fri, 2010-02-05 at 09:13 -0500, Steven Rostedt wrote:

> > register_event_callback("event", "command", command_func,
> > command_enable_func,
> > command_disable_func);
> >
> > Where command_func, command_enable_func and command_disable_func can all
> > be a function pointer or NULL.
> >
> > command_func gets called every time the event is hit while enabled.
> >
> > command_enable_func gets called when the event is enabled
> >
> > command_disable_func gets called when the event is disabled.
>
> Well, that will solve johill's polling thing, but not work nicely with
> the proposed collection iteration stuff.
>

Actually, johill and I decided that this was not the proper solution for
him ;-)

I wanted to do this so we could add a lot more functionality and
flexibility to events. Start the tracer when this event is hit, stop in
on that event, but only when the filters match. Things like that. And
I'm sure people could come up with much better and more imaginative
commands that would be useful for tracing.

What does not fit with this proposal? I really hate adding more macros
for every new feature or functionality we need. It is ugly and will
become a maintenance nightmare.

-- Steve



2010-02-05 14:47:50

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 02/11] tracing: Introduce TRACE_EVENT_INJECT

On Wed, 2010-02-03 at 10:14 +0100, Frederic Weisbecker wrote:
> TRACE_EVENT_INJECT macro is the same as TRACE_EVENT but takes one

> #undef DEFINE_EVENT
> #define DEFINE_EVENT(template, call, proto, args) \
> static void ftrace_profile_##call(proto) \
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index 189b09b..5c75cc7 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -142,6 +142,9 @@ static int ftrace_event_enable_disable(struct ftrace_event_call *call,
> break;
> }
> call->enabled = 1;
> +
> + if (call->inject)
> + call->inject();
> }
> break;
> }

With the proposal I'm suggesting:

register_event_command(event, NULL, NULL, inject_me, inject_me, data);

The above would do basically what this patch does. When ever the event
is enabled or disabled the inject_me function will be called.

Would this work?

-- Steve

2010-02-05 14:53:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 02/11] tracing: Introduce TRACE_EVENT_INJECT

On Fri, 2010-02-05 at 09:47 -0500, Steven Rostedt wrote:
> On Wed, 2010-02-03 at 10:14 +0100, Frederic Weisbecker wrote:
> > TRACE_EVENT_INJECT macro is the same as TRACE_EVENT but takes one
>
> > #undef DEFINE_EVENT
> > #define DEFINE_EVENT(template, call, proto, args) \
> > static void ftrace_profile_##call(proto) \
> > diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> > index 189b09b..5c75cc7 100644
> > --- a/kernel/trace/trace_events.c
> > +++ b/kernel/trace/trace_events.c
> > @@ -142,6 +142,9 @@ static int ftrace_event_enable_disable(struct ftrace_event_call *call,
> > break;
> > }
> > call->enabled = 1;
> > +
> > + if (call->inject)
> > + call->inject();
> > }
> > break;
> > }
>
> With the proposal I'm suggesting:
>
> register_event_command(event, NULL, NULL, inject_me, inject_me, data);
>
> The above would do basically what this patch does. When ever the event
> is enabled or disabled the inject_me function will be called.
>
> Would this work?

No, because a dump all inject sucks chunks.

2010-02-05 15:07:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 02/11] tracing: Introduce TRACE_EVENT_INJECT

On Fri, 2010-02-05 at 15:53 +0100, Peter Zijlstra wrote:
> On Fri, 2010-02-05 at 09:47 -0500, Steven Rostedt wrote:
> > On Wed, 2010-02-03 at 10:14 +0100, Frederic Weisbecker wrote:
> > > TRACE_EVENT_INJECT macro is the same as TRACE_EVENT but takes one
> >
> > > #undef DEFINE_EVENT
> > > #define DEFINE_EVENT(template, call, proto, args) \
> > > static void ftrace_profile_##call(proto) \
> > > diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> > > index 189b09b..5c75cc7 100644
> > > --- a/kernel/trace/trace_events.c
> > > +++ b/kernel/trace/trace_events.c
> > > @@ -142,6 +142,9 @@ static int ftrace_event_enable_disable(struct ftrace_event_call *call,
> > > break;
> > > }
> > > call->enabled = 1;
> > > +
> > > + if (call->inject)
> > > + call->inject();
> > > }
> > > break;
> > > }
> >
> > With the proposal I'm suggesting:
> >
> > register_event_command(event, NULL, NULL, inject_me, inject_me, data);
> >
> > The above would do basically what this patch does. When ever the event
> > is enabled or disabled the inject_me function will be called.
> >
> > Would this work?
>
> No, because a dump all inject sucks chunks.

How would it be dumping all?

You need to pick an event to register, not all events will do this.

How is this different than defining a TRACE_EVENT_INJECT()? Now
lock_class_init will always call this inject.

What I am proposing is to add the lock_class_init just as TRACE_EVENT()
and then in a initcall, do:

register_event_command("lock_class_init", NULL, NULL, inject_me,
inject_me, data);

The above would have lock_class_init, when enabled or disabled call
inject_me. Isn't that exactly what the TRACE_EVENT_INJECT() is doing?

I have no qualms about adding lock_class_init, I just don't think we
need another TRACE_EVENT macro, that is very inflexible. I rather
consolidate the macros in ftrace.h than be adding new ones.

-- Steve

2010-02-06 10:38:13

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC GIT PULL] perf/trace/lock optimization/scalability improvements

On Thu, Feb 04, 2010 at 08:40:47PM +0100, Jens Axboe wrote:
> On Wed, Feb 03 2010, Frederic Weisbecker wrote:
> > Ok, thanks a lot, the fact you can test on a 64 threads box is critically
> > helpful.
> >
> > I also wonder what happens after this patch applied:
> >
> > diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> > index 98fd360..254b3d4 100644
> > --- a/kernel/perf_event.c
> > +++ b/kernel/perf_event.c
> > @@ -3094,7 +3094,8 @@ static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
> > if (event->parent)
> > event = event->parent;
> >
> > - return task_pid_nr_ns(p, event->ns);
> > + return p->pid;
> > }
> >
> > In my box it has increased the speed from 2x this patchset.
>
> Doesn't seem to change anything, same runtime for a ls.


Ah.



> > I wonder if the tool becomes usable for you with that.
> > Otherwise, it means we have other things to fix, and
> > the result of:
> >
> > perf record -g -f perf lock record sleep 6
> > perf report
> >
> > would be very nice to have.
>
> root@nehalem:/dev/shm # perf record -g -f perf lock record sleep 6
> [ perf record: Woken up 0 times to write data ]
> [ perf record: Captured and wrote 446.208 MB perf.data (~19495127 samples) ]
> [ perf record: Woken up 9 times to write data ]
> [ perf record: Captured and wrote 1.135 MB perf.data (~49609 samples) ]
>
> It's huuuge. Thankfully the output isn't so big, I've attached it.



Yeah. I made a mistake, both perf record and perf lock are writing to the
same perf.data with this cmdline.

Actually we'd have needed:

perf record -o perflock.data -g -f perf lock record sleep 6
perf report -i perflock.data

That's why you don't have callchains in your profile, probably because the
file has been messed up.

Anyway, we should retry after my next iteration, I have various
reviews to address :)

Thanks.

2010-02-06 11:12:22

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 10/11] tracing/perf: Fix lock events recursions in the fast path

On Fri, Feb 05, 2010 at 02:01:55PM +0100, Peter Zijlstra wrote:
> On Fri, 2010-02-05 at 13:12 +0100, Peter Zijlstra wrote:
> > On Fri, 2010-02-05 at 13:10 +0100, Peter Zijlstra wrote:
> > > On Fri, 2010-02-05 at 11:49 +0100, Ingo Molnar wrote:
> > > >
> > > > > That said, I'm not at all happy about removing lockdep annotations to make
> > > > > the tracer faster, that's really counter productive.
> > > >
> > > > Are there no dynamic techniques that could be used here?
> > > >
> > > > Lockdep obviously wants maximum instrumentation coverage - performance be
> > > > damned.
> > > >
> > > > Lock profiling/tracing/visualization wants the minimum subset of events it is
> > > > interested in - everything else is unnecessary overhead.
> > >
> > > Well, they could start by moving the tracepoint inside the lockdep
> > > recursion check.
> >
> > IIRC the reason its now outside is that you'd loose tracepoint on
> > lockdep_off() usage, but having the tracer folks help on removing any
> > such usage is of course a good thing.
> >
> > The usage thereof in nmi_enter() doesn't seem like a problem, since
> > you're not supposed to be using locks from nmi context anyway, more so,
> > I'd not be adverse to putting BUG_ON(in_nmi()) in every lockdep hook.
>
> Another nasty side effect is that it (lockdep recursion) isn't IRQ aware
> in that we don't do any tracking for IRQ's that hit while we're doing
> lockdep. We can fix that using a recursion context like we did for perf,
> that would actually improve lockdep itself too.


Yep, I agree with you. With the lockdep recursion check fixed to be
subtle enough for that + the lock events under lockdep recursion checks,
it fixes the situation while keeping the lockdep coverage in perf
tracing path for other cases.

I will start by adressing this.

That said, I think this is good for a first step, but we can't continue
to force the lock events -> lockdep dependency in the long term. We
can't have a serious lock profiling if we are doomed to suffer the
slowness due to lockdep checks at the same time.

Sure we can continue to support having both, but I think we should also
think about a solution to handle lock events without it in the future.
That will require some minimal lockdep functionalities (keeping the
lockdep map, and class hashes).

Thanks.

2010-02-06 11:24:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 10/11] tracing/perf: Fix lock events recursions in the fast path

On Sat, 2010-02-06 at 12:12 +0100, Frederic Weisbecker wrote:

> That said, I think this is good for a first step, but we can't continue
> to force the lock events -> lockdep dependency in the long term. We
> can't have a serious lock profiling if we are doomed to suffer the
> slowness due to lockdep checks at the same time.
>
> Sure we can continue to support having both, but I think we should also
> think about a solution to handle lock events without it in the future.
> That will require some minimal lockdep functionalities (keeping the
> lockdep map, and class hashes).

You mean like building without CONFIG_PROVE_LOCKING, or boot with
lockdep.prove_locking=0, or use echo 0
> /sys/modules/lockdep/prove_locking ?

That keeps the lock tracking but does away with all the dependency
analysis and was created for just such an use case as you are looking
at, namely lockstat.

2010-02-06 11:40:58

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 10/11] tracing/perf: Fix lock events recursions in the fast path

On Sat, Feb 06, 2010 at 12:24:02PM +0100, Peter Zijlstra wrote:
> On Sat, 2010-02-06 at 12:12 +0100, Frederic Weisbecker wrote:
>
> > That said, I think this is good for a first step, but we can't continue
> > to force the lock events -> lockdep dependency in the long term. We
> > can't have a serious lock profiling if we are doomed to suffer the
> > slowness due to lockdep checks at the same time.
> >
> > Sure we can continue to support having both, but I think we should also
> > think about a solution to handle lock events without it in the future.
> > That will require some minimal lockdep functionalities (keeping the
> > lockdep map, and class hashes).
>
> You mean like building without CONFIG_PROVE_LOCKING, or boot with
> lockdep.prove_locking=0, or use echo 0
> > /sys/modules/lockdep/prove_locking ?
>
> That keeps the lock tracking but does away with all the dependency
> analysis and was created for just such an use case as you are looking
> at, namely lockstat.


Looks pretty what I'm looking for. Except that it still continues
to fill and keep track of the locks held by the current thread,
namely the copies in curr->held_locks.

2010-02-06 12:20:15

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 02/11] tracing: Introduce TRACE_EVENT_INJECT

On Fri, Feb 05, 2010 at 10:07:32AM -0500, Steven Rostedt wrote:
> How would it be dumping all?
>
> You need to pick an event to register, not all events will do this.
>
> How is this different than defining a TRACE_EVENT_INJECT()? Now
> lock_class_init will always call this inject.
>
> What I am proposing is to add the lock_class_init just as TRACE_EVENT()
> and then in a initcall, do:
>
> register_event_command("lock_class_init", NULL, NULL, inject_me,
> inject_me, data);
>
> The above would have lock_class_init, when enabled or disabled call
> inject_me. Isn't that exactly what the TRACE_EVENT_INJECT() is doing?
>
> I have no qualms about adding lock_class_init, I just don't think we
> need another TRACE_EVENT macro, that is very inflexible. I rather
> consolidate the macros in ftrace.h than be adding new ones.


This won't work for perf needs because:

1) we need this call to be asynchronous, ie: called whenever we
want.

Basically, the injection of this event is required for other
lock events to be understandable. It is split in two parts:

* a normal trace event, synchronous, that has a normal trace_* thing
call site. This tracks the new lock classes created.
* an asynchronous point that catch up with old events we need.

These two parts cover all lock init events we need.

But there is a reason I haven't made this injection callback something
we call once the event gets enabled (I had a first version using
TRACE_EVENT_CALLBACK, but it didn't work), because when perf registers
a trace event, it is not yet scheduled in, not at the same time but
a bit after (it's even more complicated from perf tools as we
create every events as disabled first and enable them later).

So if you call the injector right after registering a user for
the trace event, the injected events will be refused by perf,
as it is not yet set up in the context.

This is why perf needs to choose the time when this injector
is called.

2) we don't want to call that for every lock init events
registered

We create one lock_init perf event per cpu, only one will need
to call this injector. We just need to catch up with old events
only once.


3) perf and ftrace need the same injector callback here.

4) but ftrace needs this callback to be called when the event gets
enabled (otherwise, other lock init events just don't make sense).

I agree with you that creating a new trace event macro is a bit
insane. ftrace.h is already a nightmare. I just thought that
having the injector set inside the same macro that the synchronous
event is defined helped to make it clear about its nature: that
it needs a secondary async catch up thing.

But a register_event_injector thing will work too, we probably
better want that, given the maintainance pain otherwise.

I really would like to make something useful for everyone, could
you tell me more about johill needs?

Thanks.

2010-02-06 13:21:20

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 02/11] tracing: Introduce TRACE_EVENT_INJECT

On Sat, 2010-02-06 at 13:20 +0100, Frederic Weisbecker wrote:

> I agree with you that creating a new trace event macro is a bit
> insane. ftrace.h is already a nightmare. I just thought that
> having the injector set inside the same macro that the synchronous
> event is defined helped to make it clear about its nature: that
> it needs a secondary async catch up thing.
>
> But a register_event_injector thing will work too, we probably
> better want that, given the maintainance pain otherwise.

We can add a register_event_injector later. For now, why not just add
the TRACE_EVENT() and then hook to it in perf using the normal
tracepoint mechanism.

You could add some macro around the trace_init_lock_class() call that
would facilitate finding all the locks you need. This would probably be
a bit more straight forward than to overload TRACE_EVENT() again.

>
> I really would like to make something useful for everyone, could
> you tell me more about johill needs?

Well, basically he needed a way to cause polling to happen using an
event. From what I understood, the polling called the trace point. Just
enabling the trace point did nothing because the polling did not take
place. I thought about ways to enable this command when the trace point
was enabled. But in the end, it was specific to a driver and another
debugfs file seemed a better fit to initiate it.

-- Steve

2010-02-06 14:18:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 10/11] tracing/perf: Fix lock events recursions in the fast path

On Sat, 2010-02-06 at 12:40 +0100, Frederic Weisbecker wrote:
> On Sat, Feb 06, 2010 at 12:24:02PM +0100, Peter Zijlstra wrote:
> > On Sat, 2010-02-06 at 12:12 +0100, Frederic Weisbecker wrote:
> >
> > > That said, I think this is good for a first step, but we can't continue
> > > to force the lock events -> lockdep dependency in the long term. We
> > > can't have a serious lock profiling if we are doomed to suffer the
> > > slowness due to lockdep checks at the same time.
> > >
> > > Sure we can continue to support having both, but I think we should also
> > > think about a solution to handle lock events without it in the future.
> > > That will require some minimal lockdep functionalities (keeping the
> > > lockdep map, and class hashes).
> >
> > You mean like building without CONFIG_PROVE_LOCKING, or boot with
> > lockdep.prove_locking=0, or use echo 0
> > > /sys/modules/lockdep/prove_locking ?
> >
> > That keeps the lock tracking but does away with all the dependency
> > analysis and was created for just such an use case as you are looking
> > at, namely lockstat.
>
>
> Looks pretty what I'm looking for. Except that it still continues
> to fill and keep track of the locks held by the current thread,
> namely the copies in curr->held_locks.

Which is exactly what you need for that lock hierarchy recording you
wanted :-)

2010-02-06 16:11:06

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 10/11] tracing/perf: Fix lock events recursions in the fast path

On Sat, Feb 06, 2010 at 03:17:47PM +0100, Peter Zijlstra wrote:
> On Sat, 2010-02-06 at 12:40 +0100, Frederic Weisbecker wrote:
> > On Sat, Feb 06, 2010 at 12:24:02PM +0100, Peter Zijlstra wrote:
> > > On Sat, 2010-02-06 at 12:12 +0100, Frederic Weisbecker wrote:
> > >
> > > > That said, I think this is good for a first step, but we can't continue
> > > > to force the lock events -> lockdep dependency in the long term. We
> > > > can't have a serious lock profiling if we are doomed to suffer the
> > > > slowness due to lockdep checks at the same time.
> > > >
> > > > Sure we can continue to support having both, but I think we should also
> > > > think about a solution to handle lock events without it in the future.
> > > > That will require some minimal lockdep functionalities (keeping the
> > > > lockdep map, and class hashes).
> > >
> > > You mean like building without CONFIG_PROVE_LOCKING, or boot with
> > > lockdep.prove_locking=0, or use echo 0
> > > > /sys/modules/lockdep/prove_locking ?
> > >
> > > That keeps the lock tracking but does away with all the dependency
> > > analysis and was created for just such an use case as you are looking
> > > at, namely lockstat.
> >
> >
> > Looks pretty what I'm looking for. Except that it still continues
> > to fill and keep track of the locks held by the current thread,
> > namely the copies in curr->held_locks.
>
> Which is exactly what you need for that lock hierarchy recording you
> wanted :-)


Well, the lock hierarchy should probably be retrieved from the traces,
using state machines.
Otherwise we would need yet other lock events for that, which is going
to add even more overhead.

2010-02-07 09:46:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 10/11] tracing/perf: Fix lock events recursions in the fast path

On Sat, 2010-02-06 at 17:10 +0100, Frederic Weisbecker wrote:
>
> > > Looks pretty what I'm looking for. Except that it still continues
> > > to fill and keep track of the locks held by the current thread,
> > > namely the copies in curr->held_locks.
> >
> > Which is exactly what you need for that lock hierarchy recording you
> > wanted :-)
>
>
> Well, the lock hierarchy should probably be retrieved from the traces,
> using state machines.
> Otherwise we would need yet other lock events for that, which is going
> to add even more overhead.


Right, well you could look at adding a mode that also strips out the
held_lock tracing, but since you really need the register class stuff to
re-generate the class mapping, avoiding the held_lock tracing doesn't
look like it's going to save you much, its all thread local storage.

2010-02-07 17:10:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC GIT PULL] perf/trace/lock optimization/scalability improvements

On Wed, 2010-02-03 at 23:07 +0100, Frederic Weisbecker wrote:
>
> >
> > Which brings us to the ioctl() interface, we can do the above using
> > ioctl()s, but it seems to me we're starting to get ioctl() heavy and
> > should be looking at alternative ways of extending this.
> >
> > Anybody any bright ideas?
> >
>
>
> Using ioctl, we can basically have such structure:
>
> struct perf_inject_req {
> int start; (in)
> int len; (in)
> int *read; (out)
> };
>
> Using an idx will often imply rewalk a whole object list from
> the beginning, which is perfectly fine.
>
> If you prefer an alternate syscall, I can make it. I've been
> told about ioctl drawbacks by the past, I personally think
> ioctl has not only drawbacks: it avoids a proliferation
> of syscalls. But that's how you prefer.

One thing we could do is add a PERF_TYPE_COLLECTION and have its read()
method do something else than regular counters would, that way we can
use read()/lseek()/stat() fd interface. The only trouble with that is
that doesn't use the 'regular' output buffer.

/me goes try and come up with yet another solution..

2010-02-10 10:04:52

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 02/11] tracing: Introduce TRACE_EVENT_INJECT

On Sat, Feb 06, 2010 at 08:19:30AM -0500, Steven Rostedt wrote:
> On Sat, 2010-02-06 at 13:20 +0100, Frederic Weisbecker wrote:
>
> > I agree with you that creating a new trace event macro is a bit
> > insane. ftrace.h is already a nightmare. I just thought that
> > having the injector set inside the same macro that the synchronous
> > event is defined helped to make it clear about its nature: that
> > it needs a secondary async catch up thing.
> >
> > But a register_event_injector thing will work too, we probably
> > better want that, given the maintainance pain otherwise.
>
> We can add a register_event_injector later. For now, why not just add
> the TRACE_EVENT() and then hook to it in perf using the normal
> tracepoint mechanism.
>
> You could add some macro around the trace_init_lock_class() call that
> would facilitate finding all the locks you need. This would probably be
> a bit more straight forward than to overload TRACE_EVENT() again.



I'm not sure what you mean by that.



> >
> > I really would like to make something useful for everyone, could
> > you tell me more about johill needs?
>
> Well, basically he needed a way to cause polling to happen using an
> event. From what I understood, the polling called the trace point. Just
> enabling the trace point did nothing because the polling did not take
> place. I thought about ways to enable this command when the trace point
> was enabled. But in the end, it was specific to a driver and another
> debugfs file seemed a better fit to initiate it.


Ok, thanks.

2010-02-10 10:17:27

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 10/11] tracing/perf: Fix lock events recursions in the fast path

On Sun, Feb 07, 2010 at 10:45:55AM +0100, Peter Zijlstra wrote:
> On Sat, 2010-02-06 at 17:10 +0100, Frederic Weisbecker wrote:
> >
> > > > Looks pretty what I'm looking for. Except that it still continues
> > > > to fill and keep track of the locks held by the current thread,
> > > > namely the copies in curr->held_locks.
> > >
> > > Which is exactly what you need for that lock hierarchy recording you
> > > wanted :-)
> >
> >
> > Well, the lock hierarchy should probably be retrieved from the traces,
> > using state machines.
> > Otherwise we would need yet other lock events for that, which is going
> > to add even more overhead.
>
>
> Right, well you could look at adding a mode that also strips out the
> held_lock tracing, but since you really need the register class stuff to
> re-generate the class mapping, avoiding the held_lock tracing doesn't
> look like it's going to save you much, its all thread local storage.


But if we draw a tree based representation (per-instances or per-class)
from perf lock, we'll already need to get the locking scenarios from
post processing, which induce the dependencies.

I don't think it will help much.

Thanks.

2010-02-10 10:49:25

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC GIT PULL] perf/trace/lock optimization/scalability improvements

On Sun, Feb 07, 2010 at 06:10:27PM +0100, Peter Zijlstra wrote:
> One thing we could do is add a PERF_TYPE_COLLECTION and have its read()
> method do something else than regular counters would, that way we can
> use read()/lseek()/stat() fd interface. The only trouble with that is
> that doesn't use the 'regular' output buffer.


But would be nice too. The drawback is indeed a separation/implementation
treatment against other events.


> /me goes try and come up with yet another solution..


Not sure there is another solution than the above or an inject/mmap
thing.

2010-02-10 14:05:11

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 02/11] tracing: Introduce TRACE_EVENT_INJECT

On Wed, 2010-02-10 at 11:04 +0100, Frederic Weisbecker wrote:

> > We can add a register_event_injector later. For now, why not just add
> > the TRACE_EVENT() and then hook to it in perf using the normal
> > tracepoint mechanism.
> >
> > You could add some macro around the trace_init_lock_class() call that
> > would facilitate finding all the locks you need. This would probably be
> > a bit more straight forward than to overload TRACE_EVENT() again.
>
>
>
> I'm not sure what you mean by that.

Looking back at where trace_lock_class_init() was placed, you don't need
a wrapper. But instead, why not just have perf hook directly into the
trace_lock_class_init trace point, and do its injection directly.
Instead of needing to add some hack to TRACE_EVENT()?

-- Steve

2010-02-11 18:57:56

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 02/11] tracing: Introduce TRACE_EVENT_INJECT

On Wed, Feb 10, 2010 at 09:05:06AM -0500, Steven Rostedt wrote:
> On Wed, 2010-02-10 at 11:04 +0100, Frederic Weisbecker wrote:
>
> > > We can add a register_event_injector later. For now, why not just add
> > > the TRACE_EVENT() and then hook to it in perf using the normal
> > > tracepoint mechanism.
> > >
> > > You could add some macro around the trace_init_lock_class() call that
> > > would facilitate finding all the locks you need. This would probably be
> > > a bit more straight forward than to overload TRACE_EVENT() again.
> >
> >
> >
> > I'm not sure what you mean by that.
>
> Looking back at where trace_lock_class_init() was placed, you don't need
> a wrapper. But instead, why not just have perf hook directly into the
> trace_lock_class_init trace point, and do its injection directly.
> Instead of needing to add some hack to TRACE_EVENT()?


We could. But we expect further events may need such catch-up by
injection system. And ftrace will also need to perform this injection.

That's why I think something more generic would be more reasonable.

2010-02-11 19:23:15

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 02/11] tracing: Introduce TRACE_EVENT_INJECT

On Thu, 2010-02-11 at 19:57 +0100, Frederic Weisbecker wrote:
> On Wed, Feb 10, 2010 at 09:05:06AM -0500, Steven Rostedt wrote:

> > Looking back at where trace_lock_class_init() was placed, you don't need
> > a wrapper. But instead, why not just have perf hook directly into the
> > trace_lock_class_init trace point, and do its injection directly.
> > Instead of needing to add some hack to TRACE_EVENT()?
>
>
> We could. But we expect further events may need such catch-up by
> injection system. And ftrace will also need to perform this injection.
>
> That's why I think something more generic would be more reasonable.
>

Hmm, I'm a bit dense lately and still a bit confused.

Why is an injector needed exactly, and why would ftrace want one too?

-- Steve

2010-02-28 22:24:23

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 10/11] tracing/perf: Fix lock events recursions in the fast path

On Fri, Feb 05, 2010 at 02:01:55PM +0100, Peter Zijlstra wrote:
> On Fri, 2010-02-05 at 13:12 +0100, Peter Zijlstra wrote:
> > On Fri, 2010-02-05 at 13:10 +0100, Peter Zijlstra wrote:
> > > On Fri, 2010-02-05 at 11:49 +0100, Ingo Molnar wrote:
> > > >
> > > > > That said, I'm not at all happy about removing lockdep annotations to make
> > > > > the tracer faster, that's really counter productive.
> > > >
> > > > Are there no dynamic techniques that could be used here?
> > > >
> > > > Lockdep obviously wants maximum instrumentation coverage - performance be
> > > > damned.
> > > >
> > > > Lock profiling/tracing/visualization wants the minimum subset of events it is
> > > > interested in - everything else is unnecessary overhead.
> > >
> > > Well, they could start by moving the tracepoint inside the lockdep
> > > recursion check.
> >
> > IIRC the reason its now outside is that you'd loose tracepoint on
> > lockdep_off() usage, but having the tracer folks help on removing any
> > such usage is of course a good thing.
> >
> > The usage thereof in nmi_enter() doesn't seem like a problem, since
> > you're not supposed to be using locks from nmi context anyway, more so,
> > I'd not be adverse to putting BUG_ON(in_nmi()) in every lockdep hook.
>
> Another nasty side effect is that it (lockdep recursion) isn't IRQ aware
> in that we don't do any tracking for IRQ's that hit while we're doing
> lockdep. We can fix that using a recursion context like we did for perf,
> that would actually improve lockdep itself too.


Actually, looking at lock_acquire/release/acquired/contended, they are
all performing their job under a raw_lock_irq_save() window, so it
doesn't seem we are losing anything.

Something that could be nice though: dropping this irq saving from the
main window, add the perf style recursion protection, and eventually
have a raw_local_irq_save only when we take internal lockdep locks.
This will let irqs happen during lockdep checks.

I just guess the irq disabled thing is also protecting something else,
something I'll probably discover after testing that :)