LinuxLists.cc - [patch] trace: Add user-space event tracing/injection

2010-11-17 12:08:06

Subject: [patch] trace: Add user-space event tracing/injection

* Pekka Enberg <[email protected]> wrote:

> (Trimming CC for sanity.)

[ Added back Linus (in case he wants to object to the new prctl()) and lkml. ]

> On 11/16/10 11:04 PM, Thomas Gleixner wrote:
> >'trace' is our shot at improving the situation: it aims at providing a
> >simple to use and straightforward tracing tool based on the perf
> >infrastructure and on the well-known perf profiling workflow:
> >
> > trace record firefox # trace firefox and all child tasks,
> > put it into trace.data
> >
> > trace summary # show quick overview of the trace,
> > based on trace.data
> >
> > trace report # detailed traces, based on trace.data
>
> Nice work guys!
>
> Does this concept lend itself to tracing latencies in userspace applications that
> run in virtual machines (e.g. the Java kind)? I'm of course interested in this
> because of Jato [1] where bunch of interesting things can cause jitter: JIT
> compilation, GC, kernel, and the actual application doing something (in either
> native code or JIT'd code). It's important to be able to measure where "slowness"
> to desktop applications and certain class of server applications comes from to be
> able to improve things.

Makes quite a bit of sense.

How about the attached patch? It works fine with the simple testcase included in the
changelog. There's a common-sense limit on the message size - but otherwise it adds
support for apps to generate a free-form string trace event.

Thanks,

Ingo

---------------------------------->
Subject: trace: Add user-space event tracing/injection
From: Ingo Molnar <[email protected]>
Date: Wed Nov 17 10:11:53 CET 2010

This feature (suggested by Darren Hart and Pekka Engberg) allows user-space
programs to print trace events in a very simple and self-contained way:

#include <sys/prctl.h>
#include <string.h>

#define PR_TASK_PERF_USER_TRACE 35

int main(void)
{
char *msg = "Hello World!\n";

prctl(PR_TASK_PERF_USER_TRACE, msg, strlen(msg));

return 0;
}

These show up in 'trace' output as:

$ trace report
#
# trace events of 'sleep 1':
#
testit/ 6006 ( 0.002 ms): <"Hello World!">
testit/ 6006 ( 0.002 ms): <"Hello World!">

Suggested-by: Darren Hart <[email protected]>
Suggested-by: Pekka Enberg <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/prctl.h | 2 ++
include/trace/events/user.h | 32 ++++++++++++++++++++++++++++++++
kernel/sys.c | 23 +++++++++++++++++++++++
kernel/trace/trace_events.c | 34 +++++++++++-----------------------
tools/perf/builtin-trace.c | 41 +++++++++++++++++++++++++++++++++++++++++
5 files changed, 109 insertions(+), 23 deletions(-)

Index: linux/include/linux/prctl.h
===================================================================
--- linux.orig/include/linux/prctl.h
+++ linux/include/linux/prctl.h
@@ -102,4 +102,6 @@

#define PR_MCE_KILL_GET 34

+#define PR_TASK_PERF_USER_TRACE 35
+
#endif /* _LINUX_PRCTL_H */
Index: linux/include/trace/events/user.h
===================================================================
--- /dev/null
+++ linux/include/trace/events/user.h
@@ -0,0 +1,32 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM user
+
+#if !defined(_TRACE_USER_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_USER_H_
+
+#include <linux/tracepoint.h>
+#include <linux/ftrace.h>
+
+TRACE_EVENT(user,
+
+ TP_PROTO(const char *message),
+
+ TP_ARGS(message),
+
+ TP_STRUCT__entry(
+ __string( message, message);
+ ),
+
+ TP_fast_assign(
+ __assign_str(message, message);
+ ),
+
+ TP_printk("user %s", __get_str(message))
+);
+
+#undef NO_DEV
+
+#endif /* _TRACE_USER_H_ */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
Index: linux/kernel/sys.c
===================================================================
--- linux.orig/kernel/sys.c
+++ linux/kernel/sys.c
@@ -47,6 +47,11 @@
#include <asm/io.h>
#include <asm/unistd.h>

+#define MAX_USER_TRACE_SIZE 128
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/user.h>
+
#ifndef SET_UNALIGN_CTL
# define SET_UNALIGN_CTL(a,b) (-EINVAL)
#endif
@@ -1681,6 +1686,24 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
case PR_TASK_PERF_EVENTS_ENABLE:
error = perf_event_task_enable();
break;
+ /*
+ * Inject a trace event into the current tracing context:
+ */
+ case PR_TASK_PERF_USER_TRACE:
+ {
+ void __user *uevent_ptr = (void *)arg2;
+ char kstring[MAX_USER_TRACE_SIZE+1];
+ unsigned long uevent_len = arg3;
+
+ if (uevent_len > MAX_USER_TRACE_SIZE)
+ return -EINVAL;
+ if (copy_from_user(kstring, uevent_ptr, uevent_len))
+ return -EFAULT;
+ kstring[uevent_len] = 0;
+
+ trace_user(kstring);
+ return 0;
+ }
case PR_GET_TIMERSLACK:
error = current->timer_slack_ns;
break;
Index: linux/kernel/trace/trace_events.c
===================================================================
--- linux.orig/kernel/trace/trace_events.c
+++ linux/kernel/trace/trace_events.c
@@ -989,8 +989,7 @@ static struct dentry *event_trace_events

d_events = debugfs_create_dir("events", d_tracer);
if (!d_events)
- pr_warning("Could not create debugfs "
- "'events' directory\n");
+ pr_warning("Could not create debugfs 'events' directory\n");

return d_events;
}
@@ -1014,15 +1013,13 @@ event_subsystem_dir(const char *name, st
/* need to create new entry */
system = kmalloc(sizeof(*system), GFP_KERNEL);
if (!system) {
- pr_warning("No memory to create event subsystem %s\n",
- name);
+ pr_warning("No memory to create event subsystem %s\n", name);
return d_events;
}

system->entry = debugfs_create_dir(name, d_events);
if (!system->entry) {
- pr_warning("Could not create event subsystem %s\n",
- name);
+ pr_warning("Could not create event subsystem %s\n", name);
kfree(system);
return d_events;
}
@@ -1041,8 +1038,7 @@ event_subsystem_dir(const char *name, st

system->filter = kzalloc(sizeof(struct event_filter), GFP_KERNEL);
if (!system->filter) {
- pr_warning("Could not allocate filter for subsystem "
- "'%s'\n", name);
+ pr_warning("Could not allocate filter for subsystem '%s'\n", name);
return system->entry;
}

@@ -1051,8 +1047,7 @@ event_subsystem_dir(const char *name, st
if (!entry) {
kfree(system->filter);
system->filter = NULL;
- pr_warning("Could not create debugfs "
- "'%s/filter' entry\n", name);
+ pr_warning("Could not create debugfs '%s/filter' entry\n", name);
}

trace_create_file("enable", 0644, system->entry,
@@ -1081,19 +1076,16 @@ event_create_dir(struct ftrace_event_cal

call->dir = debugfs_create_dir(call->name, d_events);
if (!call->dir) {
- pr_warning("Could not create debugfs "
- "'%s' directory\n", call->name);
+ pr_warning("Could not create debugfs '%s' directory\n", call->name);
return -1;
}

if (call->class->reg)
- trace_create_file("enable", 0644, call->dir, call,
- enable);
+ trace_create_file("enable", 0644, call->dir, call, enable);

#ifdef CONFIG_PERF_EVENTS
if (call->event.type && call->class->reg)
- trace_create_file("id", 0444, call->dir, call,
- id);
+ trace_create_file("id", 0444, call->dir, call, id);
#endif

/*
@@ -1104,16 +1096,12 @@ event_create_dir(struct ftrace_event_cal
if (list_empty(head)) {
ret = call->class->define_fields(call);
if (ret < 0) {
- pr_warning("Could not initialize trace point"
- " events/%s\n", call->name);
+ pr_warning("Could not initialize trace point events/%s\n", call->name);
return ret;
}
}
- trace_create_file("filter", 0644, call->dir, call,
- filter);
-
- trace_create_file("format", 0444, call->dir, call,
- format);
+ trace_create_file("filter", 0644, call->dir, call, filter);
+ trace_create_file("format", 0444, call->dir, call, format);

return 0;
}
Index: linux/tools/perf/builtin-trace.c
===================================================================
--- linux.orig/tools/perf/builtin-trace.c
+++ linux/tools/perf/builtin-trace.c
@@ -719,6 +719,44 @@ static void vfs_getname(void *data,
}
}

+static void user_event(void *data,
+ struct event *event __used,
+ int cpu __used,
+ u64 timestamp __used,
+ struct thread *thread __used)
+{
+ struct thread_data *tdata = get_thread_data(thread);
+ unsigned int i;
+ char *message;
+ u64 t = 0;
+
+ if (!tdata->enabled)
+ return;
+ if (time_filtered(timestamp))
+ return;
+ if (cpu_filtered(cpu))
+ return;
+ if (tdata->entry_time) {
+ t = timestamp - tdata->entry_time;
+ if (filter_duration(t))
+ return;
+ } else if (duration_filter)
+ return;
+
+ print_entry_head(thread, timestamp, t, cpu);
+ message = strdup(raw_field_ptr(event, "message", data));
+
+ /* Sanitize the string a bit before printing it: */
+ for (i = 0; i < strlen(message); i++) {
+ if (message[i] == '\n')
+ message[i] = ' ';
+ }
+
+ color_fprintf(stdout, PERF_COLOR_GREEN, "<\"%s\">\n", message);
+
+ free(message);
+}
+
static int pagefault_preprocess_sample(const event_t *self,
struct addr_location *al,
struct sample_data *data,
@@ -1036,6 +1074,8 @@ process_raw_event(event_t *self, void *d
pagefault_exit(data, event, cpu, timestamp, thread);
if (!strcmp(event->name, "vfs_getname"))
vfs_getname(data, event, cpu, timestamp, thread);
+ if (!strcmp(event->name, "user"))
+ user_event(data, event, cpu, timestamp, thread);

if (!scheduler_events && !scheduler_all_events)
return;
@@ -1226,6 +1266,7 @@ static const char *record_args[] = {
"-e", "sched:sched_stat_sleep:r",
"-e", "sched:sched_stat_iowait:r",
"-e", "sched:sched_stat_runtime:r",
+ "-e", "user:user:r",
};

static int __cmd_record(int argc, const char **argv)

2010-11-17 12:17:11

by Pekka Enberg

[permalink] [raw]

Subject: Re: [patch] trace: Add user-space event tracing/injection

Hi Ingo,

On 11/17/10 2:07 PM, Ingo Molnar wrote:
> * Pekka Enberg<[email protected]> wrote:
>
>> (Trimming CC for sanity.)
> [ Added back Linus (in case he wants to object to the new prctl()) and lkml. ]
>
>> On 11/16/10 11:04 PM, Thomas Gleixner wrote:
>>> 'trace' is our shot at improving the situation: it aims at providing a
>>> simple to use and straightforward tracing tool based on the perf
>>> infrastructure and on the well-known perf profiling workflow:
>>>
>>> trace record firefox # trace firefox and all child tasks,
>>> put it into trace.data
>>>
>>> trace summary # show quick overview of the trace,
>>> based on trace.data
>>>
>>> trace report # detailed traces, based on trace.data
>> Nice work guys!
>>
>> Does this concept lend itself to tracing latencies in userspace applications that
>> run in virtual machines (e.g. the Java kind)? I'm of course interested in this
>> because of Jato [1] where bunch of interesting things can cause jitter: JIT
>> compilation, GC, kernel, and the actual application doing something (in either
>> native code or JIT'd code). It's important to be able to measure where "slowness"
>> to desktop applications and certain class of server applications comes from to be
>> able to improve things.
> Makes quite a bit of sense.
>
> How about the attached patch? It works fine with the simple testcase included in the
> changelog. There's a common-sense limit on the message size - but otherwise it adds
> support for apps to generate a free-form string trace event.
>
> Thanks,
>
> Ingo
>
> ---------------------------------->
> Subject: trace: Add user-space event tracing/injection
> From: Ingo Molnar<[email protected]>
> Date: Wed Nov 17 10:11:53 CET 2010
>
> This feature (suggested by Darren Hart and Pekka Engberg) allows user-space
> programs to print trace events in a very simple and self-contained way:
>
> #include<sys/prctl.h>
> #include<string.h>
>
> #define PR_TASK_PERF_USER_TRACE 35
>
> int main(void)
> {
> char *msg = "Hello World!\n";
>
> prctl(PR_TASK_PERF_USER_TRACE, msg, strlen(msg));
>
> return 0;
> }
>
> These show up in 'trace' output as:
>
> $ trace report
> #
> # trace events of 'sleep 1':
> #
> testit/ 6006 ( 0.002 ms):<"Hello World!">
> testit/ 6006 ( 0.002 ms):<"Hello World!">

Wow! This looks really nice!

What does the duration in milliseconds mean there? For things like GC
and JIT, I want something like:

void gc(void)
{
prctl(PR_TASK_PERF_USER_TRACE_START, ...)

collect();

prctl(PR_TASK_PERF_USER_TRACE_END, ...)
}

So that it's clear from the tracing output that the VM was busy doing GC
for n milliseconds. Barring background JIT'ing and pauseless GC, I'd
also be interested in showing how much time the VM was actually
_blocking_ the running application (which can happen with signals too,
btw, for things like accessing data that's lazily initialized).

Pekka

2010-11-17 12:24:56

by Ingo Molnar

[permalink] [raw]

Subject: Re: [patch] trace: Add user-space event tracing/injection

* Ingo Molnar <[email protected]> wrote:

> > Does this concept lend itself to tracing latencies in userspace applications
> > that run in virtual machines (e.g. the Java kind)? I'm of course interested in
> > this because of Jato [1] where bunch of interesting things can cause jitter: JIT
> > compilation, GC, kernel, and the actual application doing something (in either
> > native code or JIT'd code). It's important to be able to measure where
> > "slowness" to desktop applications and certain class of server applications
> > comes from to be able to improve things.
>
> Makes quite a bit of sense.
>
> How about the attached patch? It works fine with the simple testcase included in
> the changelog. There's a common-sense limit on the message size - but otherwise it
> adds support for apps to generate a free-form string trace event.

The entirely untested Jato patch below adds support for this to Jato's user-space
tracer. Btw., you have _hundreds_ of tracepoints in Jato, wow!

The prctl() approach is very attractive because it's very simple to integrate. It's
also reasonably fast, there's no fd baggage in prctl(). It is also arguably a
'process/task event' so fits prctl()'s original design (if it ever had one ...).

Note, i kept the original Jato buffering as well, and the prctl() does finegrained
events, one event per trace_printf() line printed.

I think it makes sense to generate a separate event for all trace_printf() calls,
because that way the events propagate immediately. OTOH i dont know how large the
trace messages are - if there's really big tables printed (or lines are constructed
out of many trace_printf() calls) then it may make sense to buffer them a bit.

Anyway, this demonstrates the concept.

Thanks,

Ingo

diff --git a/vm/trace.c b/vm/trace.c
index 0192de6..64030ff 100644
--- a/vm/trace.c
+++ b/vm/trace.c
@@ -24,10 +24,13 @@
* Please refer to the file LICENSE for details.
*/

+#include <sys/prctl.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>

+#define PR_TASK_PERF_USER_TRACE 35
+
#include "jit/compiler.h"
#include "lib/string.h"
#include "vm/thread.h"
@@ -50,15 +53,24 @@ static void setup_trace_buffer(void)

int trace_printf(const char *fmt, ...)
{
+ unsigned long curr_pos;
va_list args;
int err;

setup_trace_buffer();

+ curr_pos = trace_buffer->length;
+
va_start(args, fmt);
err = str_vappend(trace_buffer, fmt, args);
va_end(args);

+ /*
+ * Send the trace buffer to perf, it will show up as user:user events:
+ * (on non-perf kernels this will produce a harmless -EINVAL)
+ */
+ prctl(PR_TASK_PERF_USER_TRACE, trace_buffer->value + curr_pos, trace_buffer->length - curr_pos);
+
return err;
}

2010-11-17 12:30:02

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [patch] trace: Add user-space event tracing/injection

On Wed, 2010-11-17 at 13:07 +0100, Ingo Molnar wrote:

> Subject: trace: Add user-space event tracing/injection
> From: Ingo Molnar <[email protected]>
> Date: Wed Nov 17 10:11:53 CET 2010
>
> This feature (suggested by Darren Hart and Pekka Engberg) allows user-space
> programs to print trace events in a very simple and self-contained way:
>
> #include <sys/prctl.h>
> #include <string.h>
>
> #define PR_TASK_PERF_USER_TRACE 35
>
> int main(void)
> {
> char *msg = "Hello World!\n";
>
> prctl(PR_TASK_PERF_USER_TRACE, msg, strlen(msg));
>
> return 0;
> }
>
> These show up in 'trace' output as:
>
> $ trace report
> #
> # trace events of 'sleep 1':
> #
> testit/ 6006 ( 0.002 ms): <"Hello World!">
> testit/ 6006 ( 0.002 ms): <"Hello World!">
>
> Suggested-by: Darren Hart <[email protected]>
> Suggested-by: Pekka Enberg <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>

I really dislike abusing prctl(), I understand your reasons, but it
still sucks.

Also, the naming doesn't work, you've implemented a trace event, that's
got nothing to do with perf, so the PR_TASK_PERF_ prefix is incorrect.

> Index: linux/include/linux/prctl.h
> ===================================================================
> --- linux.orig/include/linux/prctl.h
> +++ linux/include/linux/prctl.h
> @@ -102,4 +102,6 @@
>
> #define PR_MCE_KILL_GET 34
>
> +#define PR_TASK_PERF_USER_TRACE 35
> +
> #endif /* _LINUX_PRCTL_H */
> Index: linux/include/trace/events/user.h
> ===================================================================
> --- /dev/null
> +++ linux/include/trace/events/user.h
> @@ -0,0 +1,32 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM user
> +
> +#if !defined(_TRACE_USER_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_USER_H_
> +
> +#include <linux/tracepoint.h>
> +#include <linux/ftrace.h>
> +
> +TRACE_EVENT(user,
> +
> + TP_PROTO(const char *message),
> +
> + TP_ARGS(message),
> +
> + TP_STRUCT__entry(
> + __string( message, message);
> + ),
> +
> + TP_fast_assign(
> + __assign_str(message, message);
> + ),
> +
> + TP_printk("user %s", __get_str(message))
> +);
> +
> +#undef NO_DEV
> +
> +#endif /* _TRACE_USER_H_ */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> Index: linux/kernel/sys.c
> ===================================================================
> --- linux.orig/kernel/sys.c
> +++ linux/kernel/sys.c
> @@ -47,6 +47,11 @@
> #include <asm/io.h>
> #include <asm/unistd.h>
>
> +#define MAX_USER_TRACE_SIZE 128
> +
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/user.h>
> +
> #ifndef SET_UNALIGN_CTL
> # define SET_UNALIGN_CTL(a,b) (-EINVAL)
> #endif
> @@ -1681,6 +1686,24 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
> case PR_TASK_PERF_EVENTS_ENABLE:
> error = perf_event_task_enable();
> break;
> + /*
> + * Inject a trace event into the current tracing context:
> + */
> + case PR_TASK_PERF_USER_TRACE:
> + {
> + void __user *uevent_ptr = (void *)arg2;
> + char kstring[MAX_USER_TRACE_SIZE+1];
> + unsigned long uevent_len = arg3;
> +
> + if (uevent_len > MAX_USER_TRACE_SIZE)
> + return -EINVAL;
> + if (copy_from_user(kstring, uevent_ptr, uevent_len))
> + return -EFAULT;
> + kstring[uevent_len] = 0;
> +
> + trace_user(kstring);
> + return 0;
> + }
> case PR_GET_TIMERSLACK:
> error = current->timer_slack_ns;
> break;

2010-11-17 12:30:17

by Pekka Enberg

[permalink] [raw]

Subject: Re: [patch] trace: Add user-space event tracing/injection

On Wed, Nov 17, 2010 at 2:24 PM, Ingo Molnar <[email protected]> wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
>> > Does this concept lend itself to tracing latencies in userspace applications
>> > that run in virtual machines (e.g. the Java kind)? I'm of course interested in
>> > this because of Jato [1] where bunch of interesting things can cause jitter: JIT
>> > compilation, GC, kernel, and the actual application doing something (in either
>> > native code or JIT'd code). It's important to be able to measure where
>> > "slowness" to desktop applications and certain class of server applications
>> > comes from to be able to improve things.
>>
>> Makes quite a bit of sense.
>>
>> How about the attached patch? It works fine with the simple testcase included in
>> the changelog. There's a common-sense limit on the message size - but otherwise it
>> adds support for apps to generate a free-form string trace event.
>
> The entirely untested Jato patch below adds support for this to Jato's user-space
> tracer. Btw., you have _hundreds_ of tracepoints in Jato, wow!
>
> The prctl() approach is very attractive because it's very simple to integrate. It's
> also reasonably fast, there's no fd baggage in prctl(). It is also arguably a
> 'process/task event' so fits prctl()'s original design (if it ever had one ...).
>
> Note, i kept the original Jato buffering as well, and the prctl() does finegrained
> events, one event per trace_printf() line printed.
>
> I think it makes sense to generate a separate event for all trace_printf() calls,
> because that way the events propagate immediately. OTOH i dont know how large the
> trace messages are - if there's really big tables printed (or lines are constructed
> out of many trace_printf() calls) then it may make sense to buffer them a bit.

There's two kinds of tracing: simple (almost) one-liners (e.g.
trace_invoke() function) that are similar to trace points and data
dumps (e.g. trace_lir() function). I don't think we want too hook the
latter to PR_TASK_PERF_USER_TRACE but for the former, it definitely
makes tons of sense for the former!

Pekka

2010-11-17 12:31:44

by Ingo Molnar

[permalink] [raw]

Subject: Re: [patch] trace: Add user-space event tracing/injection

* Pekka Enberg <[email protected]> wrote:

> > $ trace report
> > #
> > # trace events of 'sleep 1':
> > #
> > testit/ 6006 ( 0.002 ms):<"Hello World!">
> > testit/ 6006 ( 0.002 ms):<"Hello World!">
>
> Wow! This looks really nice!
>
> What does the duration in milliseconds mean there? For things like
> GC and JIT, I want something like:
>
> void gc(void)
> {
> prctl(PR_TASK_PERF_USER_TRACE_START, ...)
>
> collect();
>
> prctl(PR_TASK_PERF_USER_TRACE_END, ...)
> }
>
> So that it's clear from the tracing output that the VM was busy
> doing GC for n milliseconds. Barring background JIT'ing and
> pauseless GC, I'd also be interested in showing how much time the VM
> was actually _blocking_ the running application (which can happen
> with signals too, btw, for things like accessing data that's lazily
> initialized).

We can add two events: user_event_entry/user_event_exit - or we could use the string
to differentiate, and start it with:

"entry: ..."
"exit: ..."

And then the event timestamps (which are absolute and are available) could be used
to calculate the duration of this period.

'trace' could even be taught to treat such entry:/exit: strings in a special way, so
that you dont have to write Jato specific trace decoding bits?

Thanks,

Ingo

2010-11-17 12:33:17

by Pekka Enberg

[permalink] [raw]

Subject: Re: [patch] trace: Add user-space event tracing/injection

On 11/17/10 2:30 PM, Ingo Molnar wrote:
> What does the duration in milliseconds mean there? For things like
>> GC and JIT, I want something like:
>>
>> void gc(void)
>> {
>> prctl(PR_TASK_PERF_USER_TRACE_START, ...)
>>
>> collect();
>>
>> prctl(PR_TASK_PERF_USER_TRACE_END, ...)
>> }
>>
>> So that it's clear from the tracing output that the VM was busy
>> doing GC for n milliseconds. Barring background JIT'ing and
>> pauseless GC, I'd also be interested in showing how much time the VM
>> was actually _blocking_ the running application (which can happen
>> with signals too, btw, for things like accessing data that's lazily
>> initialized).
> We can add two events: user_event_entry/user_event_exit - or we could use the string
> to differentiate, and start it with:
>
> "entry: ..."
> "exit: ..."
>
> And then the event timestamps (which are absolute and are available) could be used
> to calculate the duration of this period.
>
> 'trace' could even be taught to treat such entry:/exit: strings in a special way, so
> that you dont have to write Jato specific trace decoding bits?

Yes, makes sense. I like the API so lets convince others that it's
important enough to be merged. :-)

Pekka

2010-11-17 12:37:31

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [patch] trace: Add user-space event tracing/injection

On Wed, 2010-11-17 at 13:29 +0100, Peter Zijlstra wrote:
> > + case PR_TASK_PERF_USER_TRACE:
> > + {
> > + void __user *uevent_ptr = (void *)arg2;
> > + char kstring[MAX_USER_TRACE_SIZE+1];
> > + unsigned long uevent_len = arg3;
> > +
> > + if (uevent_len > MAX_USER_TRACE_SIZE)
> > + return -EINVAL;
> > + if (copy_from_user(kstring, uevent_ptr, uevent_len))
> > + return -EFAULT;
> > + kstring[uevent_len] = 0;
> > +
> > + trace_user(kstring);
> > + return 0;
> > + }

That implementation can be improved upon, no need to actually do the
copy_from_user() when the tracepoint isn't enabled.

Also, do we really need a max size here? I know ftrace (and the perf
tracepoint interface) are currently limited to 1 page of trace data but
that can be fixed (although not trivially).

Also, -EINVAL isn't really a helpful error code in that case, -EMSGSIZE
seems like a good candidate.

And the uevent name is already taken for something else, re-using it
here might cause confusion.

2010-11-17 12:42:47

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [patch] trace: Add user-space event tracing/injection

On Wed, 2010-11-17 at 14:33 +0200, Pekka Enberg wrote:
> On 11/17/10 2:30 PM, Ingo Molnar wrote:
> > What does the duration in milliseconds mean there? For things like
> >> GC and JIT, I want something like:
> >>
> >> void gc(void)
> >> {
> >> prctl(PR_TASK_PERF_USER_TRACE_START, ...)
> >>
> >> collect();
> >>
> >> prctl(PR_TASK_PERF_USER_TRACE_END, ...)
> >> }
> >>
> >> So that it's clear from the tracing output that the VM was busy
> >> doing GC for n milliseconds. Barring background JIT'ing and
> >> pauseless GC, I'd also be interested in showing how much time the VM
> >> was actually _blocking_ the running application (which can happen
> >> with signals too, btw, for things like accessing data that's lazily
> >> initialized).
> > We can add two events: user_event_entry/user_event_exit - or we could use the string
> > to differentiate, and start it with:
> >
> > "entry: ..."
> > "exit: ..."
> >
> > And then the event timestamps (which are absolute and are available) could be used
> > to calculate the duration of this period.
> >
> > 'trace' could even be taught to treat such entry:/exit: strings in a special way, so
> > that you dont have to write Jato specific trace decoding bits?
>
> Yes, makes sense. I like the API so lets convince others that it's
> important enough to be merged. :-)

I don't much like it, Jato already does its own tracing for the anon_vma
symbols, it might as well write its own event log too (would need a
proper VDSO clock thingy though).

Both these, the anon_vma symbols and the events, should be implemented
in a libperf.so or the like so as not to rely on the crummy /tmp file
interface we've been using so far.

Apps can link against libperf_dummy.so to get weak stubs and LD_PRELOAD
libperf.so to get the real ones (perf record can do that LD_PRELOAD
automagically).

perf report will need to be taught to not only pick up the anon_vma
symbols but also look for the events to be merge sorted.

We already have this communication channel for Jato (due to the anon_vma
muck), might as well extend it.

2010-11-17 12:58:52

by Ingo Molnar

[permalink] [raw]

Subject: Re: [patch] trace: Add user-space event tracing/injection

* Peter Zijlstra <[email protected]> wrote:

> On Wed, 2010-11-17 at 14:33 +0200, Pekka Enberg wrote:
> > On 11/17/10 2:30 PM, Ingo Molnar wrote:
> > > What does the duration in milliseconds mean there? For things like
> > >> GC and JIT, I want something like:
> > >>
> > >> void gc(void)
> > >> {
> > >> prctl(PR_TASK_PERF_USER_TRACE_START, ...)
> > >>
> > >> collect();
> > >>
> > >> prctl(PR_TASK_PERF_USER_TRACE_END, ...)
> > >> }
> > >>
> > >> So that it's clear from the tracing output that the VM was busy
> > >> doing GC for n milliseconds. Barring background JIT'ing and
> > >> pauseless GC, I'd also be interested in showing how much time the VM
> > >> was actually _blocking_ the running application (which can happen
> > >> with signals too, btw, for things like accessing data that's lazily
> > >> initialized).
> > > We can add two events: user_event_entry/user_event_exit - or we could use the string
> > > to differentiate, and start it with:
> > >
> > > "entry: ..."
> > > "exit: ..."
> > >
> > > And then the event timestamps (which are absolute and are available) could be used
> > > to calculate the duration of this period.
> > >
> > > 'trace' could even be taught to treat such entry:/exit: strings in a special way, so
> > > that you dont have to write Jato specific trace decoding bits?
> >
> > Yes, makes sense. I like the API so lets convince others that it's
> > important enough to be merged. :-)
>
> I don't much like it, Jato already does its own tracing for the anon_vma
> symbols, it might as well write its own event log too (would need a
> proper VDSO clock thingy though).

The problem is that it then does not properly mix with other events outside of the
control of the application.

For example if there are two apps both generating user events, but there's no
connection with them, a system-wide tracer wont get a properly ordered set of events
- both apps will trace into their own buffers. So if we have:

CPU1

app1: "user event X"
app2: "user event Y"

Then a 'trace --all' system-wide tracing session will not get proper ordering
between app1 and app2's events. It only gets timestamps - which may or may not be
correct.

User-space tracing schemes tend to be clumsy and limiting. There's other
disadvantages as well: approaches that expose a named pipe in /tmp or an shmem
region are not transparent and robust either: if user-space owns a pending buffer
then bugs in the apps can corrupt the trace buffer, can prevent its flushing when
the app goes down due to an app bug (and when the trace would be the most useful),
etc. etc.

Also, in general their deployment isnt particularly fast nor lightweight - while
prctl() is available everywhere.

And when it comes to tracing/instrumentation, if we make deployment too complex,
people will simply not use it - and we all use. A prctl() isnt particularly sexy
design, but it's a task/process event that we are generating (so related to prctls),
plus it's available everywhere and is very easy to deploy.

Thanks,

Ingo

2010-11-17 13:09:41

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [patch] trace: Add user-space event tracing/injection

On Wed, 2010-11-17 at 13:58 +0100, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:

> > I don't much like it, Jato already does its own tracing for the anon_vma
> > symbols, it might as well write its own event log too (would need a
> > proper VDSO clock thingy though).
>
> The problem is that it then does not properly mix with other events outside of the
> control of the application.
>
> For example if there are two apps both generating user events, but there's no
> connection with them, a system-wide tracer wont get a properly ordered set of events
> - both apps will trace into their own buffers. So if we have:
>
> CPU1
>
> app1: "user event X"
> app2: "user event Y"
>
> Then a 'trace --all' system-wide tracing session will not get proper ordering
> between app1 and app2's events. It only gets timestamps - which may or may not be
> correct.

I claim we can do a VDSO to the quality of the kernel/sched_clock.c
code, which basically means we can do it as good as the kernel can.

> User-space tracing schemes tend to be clumsy and limiting. There's other
> disadvantages as well: approaches that expose a named pipe in /tmp or an shmem
> region are not transparent and robust either: if user-space owns a pending buffer
> then bugs in the apps can corrupt the trace buffer, can prevent its flushing when
> the app goes down due to an app bug (and when the trace would be the most useful),
> etc. etc.

Sure, but you're not considering the fact that Jato already needs an
interface to communicate its generated symbols, also writing its own
events really isn't a big deal after that.

> Also, in general their deployment isnt particularly fast nor lightweight - while
> prctl() is available everywhere.

I know your reasoning, but deployment isn't everything. Technical sanity
does, I hope, still count for something as well.

> And when it comes to tracing/instrumentation, if we make deployment too complex,
> people will simply not use it - and we all use. A prctl() isnt particularly sexy
> design, but it's a task/process event that we are generating (so related to prctls),
> plus it's available everywhere and is very easy to deploy.

Different tools for different people, complex applications like JITs can
use a more complex interface to communicate all their various data.

A simple printk() style interface through a syscall (preferably not
prctl) if fine too, it just doesn't suffice for everything, nor should
we want it to.

2010-11-17 13:10:45

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [patch] trace: Add user-space event tracing/injection

On Wed, Nov 17, 2010 at 01:29:54PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 13:07 +0100, Ingo Molnar wrote:
>
> > Subject: trace: Add user-space event tracing/injection
> > From: Ingo Molnar <[email protected]>
> > Date: Wed Nov 17 10:11:53 CET 2010
> >
> > This feature (suggested by Darren Hart and Pekka Engberg) allows user-space
> > programs to print trace events in a very simple and self-contained way:
> >
> > #include <sys/prctl.h>
> > #include <string.h>
> >
> > #define PR_TASK_PERF_USER_TRACE 35
> >
> > int main(void)
> > {
> > char *msg = "Hello World!\n";
> >
> > prctl(PR_TASK_PERF_USER_TRACE, msg, strlen(msg));
> >
> > return 0;
> > }
> >
> > These show up in 'trace' output as:
> >
> > $ trace report
> > #
> > # trace events of 'sleep 1':
> > #
> > testit/ 6006 ( 0.002 ms): <"Hello World!">
> > testit/ 6006 ( 0.002 ms): <"Hello World!">
> >
> > Suggested-by: Darren Hart <[email protected]>
> > Suggested-by: Pekka Enberg <[email protected]>
> > Signed-off-by: Ingo Molnar <[email protected]>
>
> I really dislike abusing prctl(), I understand your reasons, but it
> still sucks.
>
> Also, the naming doesn't work, you've implemented a trace event, that's
> got nothing to do with perf, so the PR_TASK_PERF_ prefix is incorrect.

Why not using write() on the perf event fd?
So that you don't only limit to a custom new simple event (which should
match most needs) but you can also inject whatever existing event you like.

2010-11-17 13:31:39

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [patch] trace: Add user-space event tracing/injection

On Wed, 2010-11-17 at 14:10 +0100, Frederic Weisbecker wrote:
>
>
>
> Why not using write() on the perf event fd?
> So that you don't only limit to a custom new simple event (which should
> match most needs) but you can also inject whatever existing event you like.

Because you don't have the perf event fd, the monitoring application has
it.

2010-11-17 13:33:31

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [patch] trace: Add user-space event tracing/injection

On Wed, Nov 17, 2010 at 02:31:27PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 14:10 +0100, Frederic Weisbecker wrote:
> >
> >
> >
> > Why not using write() on the perf event fd?
> > So that you don't only limit to a custom new simple event (which should
> > match most needs) but you can also inject whatever existing event you like.
>
> Because you don't have the perf event fd, the monitoring application has
> it.

Ah, right.

2010-11-17 13:36:27

by Ingo Molnar

[permalink] [raw]

Subject: Re: [patch] trace: Add user-space event tracing/injection

* Peter Zijlstra <[email protected]> wrote:

> > User-space tracing schemes tend to be clumsy and limiting. There's other
> > disadvantages as well: approaches that expose a named pipe in /tmp or an shmem
> > region are not transparent and robust either: if user-space owns a pending
> > buffer then bugs in the apps can corrupt the trace buffer, can prevent its
> > flushing when the app goes down due to an app bug (and when the trace would be
> > the most useful), etc. etc.
>
> Sure, but you're not considering the fact that Jato already needs an interface to
> communicate its generated symbols, also writing its own events really isn't a big
> deal after that.

But Jato is special there (it's a special execution machine with its own symbol
space) - and most apps that generate trace events are not such.

Also, while it's not a big deal to not get symbols, it's a big deal to not get trace
events _exactly when they are needed most_: when the app crashes or corrupts itself.

I.e. the kernel does us a real and useful service of extracting and then protecting
data.

> > Also, in general their deployment isnt particularly fast nor lightweight - while
> > prctl() is available everywhere.
>
> I know your reasoning, but deployment isn't everything. Technical sanity does, I
> hope, still count for something as well.

I agree that a prctl() isnt particularly nice - a new syscall would be nicer, if it
wasnt such a PITA to get new syscalls supported by widely available libraries like
glibc.

But i disagree that there should be pending buffers in the tracee context. Having
app-side data buffering introduces the sorts of problems i outlined, that the data
can be lost or corrupted when we need _reliable_ (and non-corrupted) trace data the
most.

We could use the vDSO approach for super-fast and super-voluminous tracing needs,
although i really doubt that it's the common case.

Availability is the biggest issue by far - and availability is inverse proportional
to deployment complexity.

> > And when it comes to tracing/instrumentation, if we make deployment too complex,
> > people will simply not use it - and we all use. A prctl() isnt particularly sexy
> > design, but it's a task/process event that we are generating (so related to
> > prctls), plus it's available everywhere and is very easy to deploy.
>
> Different tools for different people, complex applications like JITs can use a
> more complex interface to communicate all their various data.

Yes but i dont want complex interfaces at all - i want rich trace data from many
apps, so that tracing tools start to make sense.

> A simple printk() style interface through a syscall (preferably not prctl) is fine
> too, it just doesn't suffice for everything, nor should we want it to.

Well, it covers about 80-90% of the needs, so it was the first thing i considered.

Thanks,

Ingo

2010-11-18 01:18:11

by Darren Hart

[permalink] [raw]

Subject: Re: [patch] trace: Add user-space event tracing/injection

On 11/17/2010 04:07 AM, Ingo Molnar wrote:
>
> * Pekka Enberg<[email protected]> wrote:
>
>> (Trimming CC for sanity.)
>
> [ Added back Linus (in case he wants to object to the new prctl()) and lkml. ]
>
>> On 11/16/10 11:04 PM, Thomas Gleixner wrote:
>>> 'trace' is our shot at improving the situation: it aims at providing a
>>> simple to use and straightforward tracing tool based on the perf
>>> infrastructure and on the well-known perf profiling workflow:
>>>
>>> trace record firefox # trace firefox and all child tasks,
>>> put it into trace.data
>>>
>>> trace summary # show quick overview of the trace,
>>> based on trace.data
>>>
>>> trace report # detailed traces, based on trace.data
>>
>> Nice work guys!
>>
>> Does this concept lend itself to tracing latencies in userspace applications that
>> run in virtual machines (e.g. the Java kind)? I'm of course interested in this
>> because of Jato [1] where bunch of interesting things can cause jitter: JIT
>> compilation, GC, kernel, and the actual application doing something (in either
>> native code or JIT'd code). It's important to be able to measure where "slowness"
>> to desktop applications and certain class of server applications comes from to be
>> able to improve things.
>
> Makes quite a bit of sense.
>
> How about the attached patch? It works fine with the simple testcase included in the
> changelog. There's a common-sense limit on the message size - but otherwise it adds
> support for apps to generate a free-form string trace event.

Ideally I would like to see something just like trace_printf() without
having to define it myself in each of my testcases. I only see a few
ways of doing that, none of which are particularly nice:

# Perf trace wrappers in linux header files
#include <linux/perf-trace.h>
- requires working with glibc for headers installation
- could just be something like trace_printf() (that Ingo posted in
another mail in this thread) in the header file itself, no need for
a library.

# Perf vDSO and /sys exported headers
-lperf -I/sys/kernel/include/perf/
#include <perf-trace.h>
- introduces including from and linking to the kernel to applications
(not just glibc)
- introduces kernel /sys exported headers (ewe)

# Limited variable count
Use the max syscall argument count (7?) to allow for an opcode, format
string, and at most 5 arguments. This isn't very pretty in the user code
and I'd probably rather wrap the prctl version below than use this idea.

# char[] pointer argument
This requires the conversion of all arguments to strings and rather
defeats the purpose of having the kernel do the formatting.

Are there any other tricks I'm not thinking of which we could use to
implement:

int trace_printf(const char *format, ...);

The only other approach I can think of is a per process file descriptor
which could be used with fprintf. Something like
/proc/$PID/trace_marker. That wouldn't require any new headers or vDSO,
but it would add to the vfs, which Ingo objected to. However, placing it
in proc makes it feasible to access even in chroot environments. Leaving
the existing prctl implementation would provide a bare bones fallback
when needed.

Thoughts?

Thanks,

Darren

>
> Thanks,
>
> Ingo
>
> ---------------------------------->
> Subject: trace: Add user-space event tracing/injection
> From: Ingo Molnar<[email protected]>
> Date: Wed Nov 17 10:11:53 CET 2010
>
> This feature (suggested by Darren Hart and Pekka Engberg) allows user-space
> programs to print trace events in a very simple and self-contained way:
>
> #include<sys/prctl.h>
> #include<string.h>
>
> #define PR_TASK_PERF_USER_TRACE 35
>
> int main(void)
> {
> char *msg = "Hello World!\n";
>
> prctl(PR_TASK_PERF_USER_TRACE, msg, strlen(msg));
>
> return 0;
> }
>
> These show up in 'trace' output as:
>
> $ trace report
> #
> # trace events of 'sleep 1':
> #
> testit/ 6006 ( 0.002 ms):<"Hello World!">
> testit/ 6006 ( 0.002 ms):<"Hello World!">
>
> Suggested-by: Darren Hart<[email protected]>
> Suggested-by: Pekka Enberg<[email protected]>
> Signed-off-by: Ingo Molnar<[email protected]>
> ---
> include/linux/prctl.h | 2 ++
> include/trace/events/user.h | 32 ++++++++++++++++++++++++++++++++
> kernel/sys.c | 23 +++++++++++++++++++++++
> kernel/trace/trace_events.c | 34 +++++++++++-----------------------
> tools/perf/builtin-trace.c | 41 +++++++++++++++++++++++++++++++++++++++++
> 5 files changed, 109 insertions(+), 23 deletions(-)
>
> Index: linux/include/linux/prctl.h
> ===================================================================
> --- linux.orig/include/linux/prctl.h
> +++ linux/include/linux/prctl.h
> @@ -102,4 +102,6 @@
>
> #define PR_MCE_KILL_GET 34
>
> +#define PR_TASK_PERF_USER_TRACE 35
> +
> #endif /* _LINUX_PRCTL_H */
> Index: linux/include/trace/events/user.h
> ===================================================================
> --- /dev/null
> +++ linux/include/trace/events/user.h
> @@ -0,0 +1,32 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM user
> +
> +#if !defined(_TRACE_USER_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_USER_H_
> +
> +#include<linux/tracepoint.h>
> +#include<linux/ftrace.h>
> +
> +TRACE_EVENT(user,
> +
> + TP_PROTO(const char *message),
> +
> + TP_ARGS(message),
> +
> + TP_STRUCT__entry(
> + __string( message, message);
> + ),
> +
> + TP_fast_assign(
> + __assign_str(message, message);
> + ),
> +
> + TP_printk("user %s", __get_str(message))
> +);
> +
> +#undef NO_DEV
> +
> +#endif /* _TRACE_USER_H_ */
> +
> +/* This part must be outside protection */
> +#include<trace/define_trace.h>
> Index: linux/kernel/sys.c
> ===================================================================
> --- linux.orig/kernel/sys.c
> +++ linux/kernel/sys.c
> @@ -47,6 +47,11 @@
> #include<asm/io.h>
> #include<asm/unistd.h>
>
> +#define MAX_USER_TRACE_SIZE 128
> +
> +#define CREATE_TRACE_POINTS
> +#include<trace/events/user.h>
> +
> #ifndef SET_UNALIGN_CTL
> # define SET_UNALIGN_CTL(a,b) (-EINVAL)
> #endif
> @@ -1681,6 +1686,24 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
> case PR_TASK_PERF_EVENTS_ENABLE:
> error = perf_event_task_enable();
> break;
> + /*
> + * Inject a trace event into the current tracing context:
> + */
> + case PR_TASK_PERF_USER_TRACE:
> + {
> + void __user *uevent_ptr = (void *)arg2;
> + char kstring[MAX_USER_TRACE_SIZE+1];
> + unsigned long uevent_len = arg3;
> +
> + if (uevent_len> MAX_USER_TRACE_SIZE)
> + return -EINVAL;
> + if (copy_from_user(kstring, uevent_ptr, uevent_len))
> + return -EFAULT;
> + kstring[uevent_len] = 0;
> +
> + trace_user(kstring);
> + return 0;
> + }
> case PR_GET_TIMERSLACK:
> error = current->timer_slack_ns;
> break;
> Index: linux/kernel/trace/trace_events.c
> ===================================================================
> --- linux.orig/kernel/trace/trace_events.c
> +++ linux/kernel/trace/trace_events.c
> @@ -989,8 +989,7 @@ static struct dentry *event_trace_events
>
> d_events = debugfs_create_dir("events", d_tracer);
> if (!d_events)
> - pr_warning("Could not create debugfs "
> - "'events' directory\n");
> + pr_warning("Could not create debugfs 'events' directory\n");
>
> return d_events;
> }
> @@ -1014,15 +1013,13 @@ event_subsystem_dir(const char *name, st
> /* need to create new entry */
> system = kmalloc(sizeof(*system), GFP_KERNEL);
> if (!system) {
> - pr_warning("No memory to create event subsystem %s\n",
> - name);
> + pr_warning("No memory to create event subsystem %s\n", name);
> return d_events;
> }
>
> system->entry = debugfs_create_dir(name, d_events);
> if (!system->entry) {
> - pr_warning("Could not create event subsystem %s\n",
> - name);
> + pr_warning("Could not create event subsystem %s\n", name);
> kfree(system);
> return d_events;
> }
> @@ -1041,8 +1038,7 @@ event_subsystem_dir(const char *name, st
>
> system->filter = kzalloc(sizeof(struct event_filter), GFP_KERNEL);
> if (!system->filter) {
> - pr_warning("Could not allocate filter for subsystem "
> - "'%s'\n", name);
> + pr_warning("Could not allocate filter for subsystem '%s'\n", name);
> return system->entry;
> }
>
> @@ -1051,8 +1047,7 @@ event_subsystem_dir(const char *name, st
> if (!entry) {
> kfree(system->filter);
> system->filter = NULL;
> - pr_warning("Could not create debugfs "
> - "'%s/filter' entry\n", name);
> + pr_warning("Could not create debugfs '%s/filter' entry\n", name);
> }
>
> trace_create_file("enable", 0644, system->entry,
> @@ -1081,19 +1076,16 @@ event_create_dir(struct ftrace_event_cal
>
> call->dir = debugfs_create_dir(call->name, d_events);
> if (!call->dir) {
> - pr_warning("Could not create debugfs "
> - "'%s' directory\n", call->name);
> + pr_warning("Could not create debugfs '%s' directory\n", call->name);
> return -1;
> }
>
> if (call->class->reg)
> - trace_create_file("enable", 0644, call->dir, call,
> - enable);
> + trace_create_file("enable", 0644, call->dir, call, enable);
>
> #ifdef CONFIG_PERF_EVENTS
> if (call->event.type&& call->class->reg)
> - trace_create_file("id", 0444, call->dir, call,
> - id);
> + trace_create_file("id", 0444, call->dir, call, id);
> #endif
>
> /*
> @@ -1104,16 +1096,12 @@ event_create_dir(struct ftrace_event_cal
> if (list_empty(head)) {
> ret = call->class->define_fields(call);
> if (ret< 0) {
> - pr_warning("Could not initialize trace point"
> - " events/%s\n", call->name);
> + pr_warning("Could not initialize trace point events/%s\n", call->name);
> return ret;
> }
> }
> - trace_create_file("filter", 0644, call->dir, call,
> - filter);
> -
> - trace_create_file("format", 0444, call->dir, call,
> - format);
> + trace_create_file("filter", 0644, call->dir, call, filter);
> + trace_create_file("format", 0444, call->dir, call, format);
>
> return 0;
> }
> Index: linux/tools/perf/builtin-trace.c
> ===================================================================
> --- linux.orig/tools/perf/builtin-trace.c
> +++ linux/tools/perf/builtin-trace.c
> @@ -719,6 +719,44 @@ static void vfs_getname(void *data,
> }
> }
>
> +static void user_event(void *data,
> + struct event *event __used,
> + int cpu __used,
> + u64 timestamp __used,
> + struct thread *thread __used)
> +{
> + struct thread_data *tdata = get_thread_data(thread);
> + unsigned int i;
> + char *message;
> + u64 t = 0;
> +
> + if (!tdata->enabled)
> + return;
> + if (time_filtered(timestamp))
> + return;
> + if (cpu_filtered(cpu))
> + return;
> + if (tdata->entry_time) {
> + t = timestamp - tdata->entry_time;
> + if (filter_duration(t))
> + return;
> + } else if (duration_filter)
> + return;
> +
> + print_entry_head(thread, timestamp, t, cpu);
> + message = strdup(raw_field_ptr(event, "message", data));
> +
> + /* Sanitize the string a bit before printing it: */
> + for (i = 0; i< strlen(message); i++) {
> + if (message[i] == '\n')
> + message[i] = ' ';
> + }
> +
> + color_fprintf(stdout, PERF_COLOR_GREEN, "<\"%s\">\n", message);
> +
> + free(message);
> +}
> +
> static int pagefault_preprocess_sample(const event_t *self,
> struct addr_location *al,
> struct sample_data *data,
> @@ -1036,6 +1074,8 @@ process_raw_event(event_t *self, void *d
> pagefault_exit(data, event, cpu, timestamp, thread);
> if (!strcmp(event->name, "vfs_getname"))
> vfs_getname(data, event, cpu, timestamp, thread);
> + if (!strcmp(event->name, "user"))
> + user_event(data, event, cpu, timestamp, thread);
>
> if (!scheduler_events&& !scheduler_all_events)
> return;
> @@ -1226,6 +1266,7 @@ static const char *record_args[] = {
> "-e", "sched:sched_stat_sleep:r",
> "-e", "sched:sched_stat_iowait:r",
> "-e", "sched:sched_stat_runtime:r",
> + "-e", "user:user:r",
> };
>
> static int __cmd_record(int argc, const char **argv)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Darren Hart
Yocto Linux Kernel

2010-11-18 08:55:40

by Ingo Molnar

[permalink] [raw]

Subject: Re: [patch] trace: Add user-space event tracing/injection

* Darren Hart <[email protected]> wrote:

> Ideally I would like to see something just like trace_printf()
> without having to define it myself in each of my testcases. [...]

We can make the prctl a single-argument thing, at the cost of not allowing \0 in the
content. (which is probably sane anyway)

That way deployment is super-simple:

prctl(35, "My Trace Message");
...

if (asprintf(&msg, "My Trace Message: %d\n", 1234) != -1) {
prctl(35, *msg);
free(*msg);
}

Thanks,

Ingo

2010-11-18 17:07:17

by Darren Hart

[permalink] [raw]

Subject: Re: [patch] trace: Add user-space event tracing/injection

On 11/18/2010 12:55 AM, Ingo Molnar wrote:
>
> * Darren Hart<[email protected]> wrote:
>
>> Ideally I would like to see something just like trace_printf()
>> without having to define it myself in each of my testcases. [...]
>
> We can make the prctl a single-argument thing, at the cost of not allowing \0 in the
> content. (which is probably sane anyway)
>
> That way deployment is super-simple:
>
> prctl(35, "My Trace Message");
> ...
>
>
> if (asprintf(&msg, "My Trace Message: %d\n", 1234) != -1) {
> prctl(35, *msg);
> free(*msg);
> }

I think that would be ideal.

--
Darren Hart
Yocto Linux Kernel

2010-11-18 18:40:08

by hp

[permalink] [raw]

Subject: Re: [patch] trace: Add user-space event tracing/injection

Ingo Molnar <mingo <at> elte.hu> writes:

>
>
> * Darren Hart <dvhart <at> linux.intel.com> wrote:
>
> > Ideally I would like to see something just like trace_printf()
> > without having to define it myself in each of my testcases. [...]
>
> We can make the prctl a single-argument thing, at the cost of not allowing \0
in the
> content. (which is probably sane anyway)
>
> That way deployment is super-simple:
>
> prctl(35, "My Trace Message");
> ...
>
> if (asprintf(&msg, "My Trace Message: %d\n", 1234) != -1) {
> prctl(35, *msg);
> free(*msg);
> }
>
> Thanks,
>
> Ingo
>
I like this approach - it is doing it nearly the same way I did it with an extra
k-mod (no patch needed) and a debugfs entry handled in that mod.
I only see one thing with the string only data - I am doing stuff where there
are long recording times with also a lot of user events,
in such an environment I need more semantics on the event contents.
In my k-mod solution there's an event ID and the opportunity to log binary data.
As prctl() has 4 additional args after the option, it would be possible to use
it in the following way:
prtctl( 35, int eventID, int data_type, int msglen, void *buf);
or without the data_type
prtctl( 35, int eventID, int msglen, void *buf);
decoding would be of more effort but it would be worth

The event definition would be like this (with data_type):

TRACE_EVENT(user,
TP_PROTO(int id, int dtype, int dlen, unsigned char *bytes),
TP_ARGS(id, dtype, dlen, bytes),
TP_STRUCT__entry(
__field(int, ev_id)
__field(int, ev_type)
__dynamic_array(unsigned char, ev_data, dlen)
),
TP_fast_assign(
__entry->ev_id = id;
__entry->ev_type = dtype;
memcpy(__get_dynamic_array(ev_data), bytes, dlen);
),

TP_printk("ID: %d type: %s data: %s", __entry->ev_id,
__print_symbolic(__entry->ev_type, {0,"V"}, {1,"I"}, {2,"S"}, {4,"B"}),
__entry->ev_type == 0 ? "n/a" : __get_str(ev_data))
);

What do you think about this?
/hp

2010-11-18 18:57:03

by Ingo Molnar

[permalink] [raw]

Subject: Re: [patch] trace: Add user-space event tracing/injection

* hp <[email protected]> wrote:

> Ingo Molnar <mingo <at> elte.hu> writes:
>
> >
> >
> > * Darren Hart <dvhart <at> linux.intel.com> wrote:
> >
> > > Ideally I would like to see something just like trace_printf()
> > > without having to define it myself in each of my testcases. [...]
> >
> > We can make the prctl a single-argument thing, at the cost of not allowing \0
> in the
> > content. (which is probably sane anyway)
> >
> > That way deployment is super-simple:
> >
> > prctl(35, "My Trace Message");
> > ...
> >
> > if (asprintf(&msg, "My Trace Message: %d\n", 1234) != -1) {
> > prctl(35, *msg);
> > free(*msg);
> > }
> >
> > Thanks,
> >
> > Ingo
> >
> I like this approach - it is doing it nearly the same way I did it with an extra
> k-mod (no patch needed) and a debugfs entry handled in that mod.
> I only see one thing with the string only data - I am doing stuff where there
> are long recording times with also a lot of user events,
> in such an environment I need more semantics on the event contents.
> In my k-mod solution there's an event ID and the opportunity to log binary data.
> As prctl() has 4 additional args after the option, it would be possible to use
> it in the following way:
> prtctl( 35, int eventID, int data_type, int msglen, void *buf);
> or without the data_type
> prtctl( 35, int eventID, int msglen, void *buf);
> decoding would be of more effort but it would be worth
>
> The event definition would be like this (with data_type):
>
> TRACE_EVENT(user,
> TP_PROTO(int id, int dtype, int dlen, unsigned char *bytes),
> TP_ARGS(id, dtype, dlen, bytes),
> TP_STRUCT__entry(
> __field(int, ev_id)
> __field(int, ev_type)
> __dynamic_array(unsigned char, ev_data, dlen)
> ),
> TP_fast_assign(
> __entry->ev_id = id;
> __entry->ev_type = dtype;
> memcpy(__get_dynamic_array(ev_data), bytes, dlen);
> ),
>
> TP_printk("ID: %d type: %s data: %s", __entry->ev_id,
> __print_symbolic(__entry->ev_type, {0,"V"}, {1,"I"}, {2,"S"}, {4,"B"}),
> __entry->ev_type == 0 ? "n/a" : __get_str(ev_data))
> );
>
>
> What do you think about this?

The transport was not limited to strings - it's a memory buffer of 'len' bytes.

In that sense 'ev_id' and 'ev_type' above is really just hardcoding something that
the app might not care about.

For example with the patch i sent one could send 1 byte messages - no other
overhead. (beyond the standard record header)

While if an app does want to use an (ev_id, ev_type), it can still do so. Or if an
app wants to do some other message type, that can be done too - it's free-form.

Thanks,

Ingo

2010-11-18 19:17:45

by Reichert, Hans-Peter

[permalink] [raw]

Subject: AW: [patch] trace: Add user-space event tracing/injection

XS Embedded GmbH
Roggenbachstr. 6
D-78050 Villingen - Schwenningen

Telefon +49 (7721) 4060 - 450
Fax +49 (7721) 4060 - 499

http://www.xse.de
[email protected]

:::::::::: based.on.visions ::::::::::

XS Embedded GmbH
Gesch?ftsf?hrung: Michael Maier, Rainer Oder

Handelsregister: Amtsgericht Freiburg i.Br., HRB 705237
Sitz der Gesellschaft: Roggenbachstrasse 6, 78050 Villingen-Schwenningen
USt-Id Nummer: DE268786819

Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

-----Urspr?ngliche Nachricht-----
Von: Ingo Molnar [mailto:[email protected]]
Gesendet: Donnerstag, 18. November 2010 19:57
An: Reichert, Hans-Peter
Cc: [email protected]; Peter Zijlstra; Thomas Gleixner; Arnaldo Carvalho de Melo; Fr?d?ric Weisbecker
Betreff: Re: [patch] trace: Add user-space event tracing/injection

* hp <[email protected]> wrote:

> Ingo Molnar <mingo <at> elte.hu> writes:
>
> >
> >
> > * Darren Hart <dvhart <at> linux.intel.com> wrote:
> >
> > > Ideally I would like to see something just like trace_printf()
> > > without having to define it myself in each of my testcases. [...]
> >
> > We can make the prctl a single-argument thing, at the cost of not allowing \0
> in the
> > content. (which is probably sane anyway)
> >
> > That way deployment is super-simple:
> >
> > prctl(35, "My Trace Message");
> > ...
> >
> > if (asprintf(&msg, "My Trace Message: %d\n", 1234) != -1) {
> > prctl(35, *msg);
> > free(*msg);
> > }
> >
> > Thanks,
> >
> > Ingo
> >
> I like this approach - it is doing it nearly the same way I did it with an extra
> k-mod (no patch needed) and a debugfs entry handled in that mod.
> I only see one thing with the string only data - I am doing stuff where there
> are long recording times with also a lot of user events,
> in such an environment I need more semantics on the event contents.
> In my k-mod solution there's an event ID and the opportunity to log binary data.
> As prctl() has 4 additional args after the option, it would be possible to use
> it in the following way:
> prtctl( 35, int eventID, int data_type, int msglen, void *buf);
> or without the data_type
> prtctl( 35, int eventID, int msglen, void *buf);
> decoding would be of more effort but it would be worth
>
> The event definition would be like this (with data_type):
>
> TRACE_EVENT(user,
> TP_PROTO(int id, int dtype, int dlen, unsigned char *bytes),
> TP_ARGS(id, dtype, dlen, bytes),
> TP_STRUCT__entry(
> __field(int, ev_id)
> __field(int, ev_type)
> __dynamic_array(unsigned char, ev_data, dlen)
> ),
> TP_fast_assign(
> __entry->ev_id = id;
> __entry->ev_type = dtype;
> memcpy(__get_dynamic_array(ev_data), bytes, dlen);
> ),
>
> TP_printk("ID: %d type: %s data: %s", __entry->ev_id,
> __print_symbolic(__entry->ev_type, {0,"V"}, {1,"I"}, {2,"S"}, {4,"B"}),
> __entry->ev_type == 0 ? "n/a" : __get_str(ev_data))
> );
>
>
> What do you think about this?

The transport was not limited to strings - it's a memory buffer of 'len' bytes.

In that sense 'ev_id' and 'ev_type' above is really just hardcoding something that
the app might not care about.

For example with the patch i sent one could send 1 byte messages - no other
overhead. (beyond the standard record header)

While if an app does want to use an (ev_id, ev_type), it can still do so. Or if an
app wants to do some other message type, that can be done too - it's free-form.

Thanks,

Ingo

I agree.
The reason for this was to have a format description that could be used within "trace-cmd report"
without having the need to implement a separate decoder for (each) user land application.
I did a lot of system tracing for the last 10years on a different 'NX system.
>From an analysts sight it is more than just convenience that the tracing format is the same for every application,
just think of a system that is build up of binaries of several suppliers - like it will be in automotive projects.
Having a uniform format description would highly increase usage and acceptance.

Thanks
/hp