2009-12-09 22:41:14

by Tim Bird

[permalink] [raw]
Subject: [PATCH 2/4] ftrace - add function_duration tracer

Add function duration tracer.

Signed-off-by: Tim Bird <[email protected]>
---
kernel/trace/Kconfig | 8
kernel/trace/Makefile | 1
kernel/trace/trace.c | 32 ++
kernel/trace/trace_duration.c | 527 ++++++++++++++++++++++++++++++++++++++++++
4 files changed, 568 insertions(+)

--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -153,6 +153,14 @@ config FUNCTION_GRAPH_TRACER
the return value. This is done by setting the current return
address on the current task structure into a stack of calls.

+config FUNCTION_DURATION_TRACER
+ bool "Kernel Function Duration Tracer"
+ depends on FUNCTION_GRAPH_TRACER
+ default y
+ help
+ Enable the kernel to record function durations, with
+ filtering. Can also be used in early startup.
+ See Documentation/trace/func-duration.txt

config IRQSOFF_TRACER
bool "Interrupts-off Latency Tracer"
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -40,6 +40,7 @@ obj-$(CONFIG_STACK_TRACER) += trace_stac
obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o
obj-$(CONFIG_BOOT_TRACER) += trace_boot.o
obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o
+obj-$(CONFIG_FUNCTION_DURATION_TRACER) += trace_duration.o
obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o
obj-$(CONFIG_HW_BRANCH_TRACER) += trace_hw_branches.o
obj-$(CONFIG_KMEMTRACE) += kmemtrace.o
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -979,6 +979,38 @@ void tracing_record_cmdline(struct task_
trace_save_cmdline(tsk);
}

+#ifdef CONFIG_FUNCTION_DURATION_TRACER
+void __duration_return(struct trace_array *tr,
+ struct ftrace_graph_ret *trace,
+ unsigned long flags,
+ int pc)
+{
+ struct ftrace_event_call *call = &event_funcgraph_exit;
+ struct ring_buffer_event *event;
+ struct ring_buffer *buffer = tr->buffer;
+ struct ftrace_graph_ret_entry *entry;
+
+ if (unlikely(local_read(&__get_cpu_var(ftrace_cpu_disabled))))
+ return;
+
+ /* check duration filter */
+ if (tracing_thresh &&
+ (trace->rettime - trace->calltime) < tracing_thresh) {
+ return;
+ }
+
+ event = trace_buffer_lock_reserve(buffer, TRACE_GRAPH_RET,
+ sizeof(*entry), flags, pc);
+ if (!event)
+ return;
+ entry = ring_buffer_event_data(event);
+ entry->ret = *trace;
+
+ if (!filter_current_check_discard(buffer, call, entry, event))
+ ring_buffer_unlock_commit(buffer, event);
+}
+#endif
+
void
tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags,
int pc)
--- /dev/null
+++ b/kernel/trace/trace_duration.c
@@ -0,0 +1,527 @@
+/*
+ * Function duration tracer.
+ * Copyright 2009 Sony Corporation
+ * Author: Tim Bird <[email protected]>
+ *
+ * Much was copied from function graph tracer, which is
+ * Copyright (c) 2008-2009 Frederic Weisbecker <[email protected]>
+ */
+#include <linux/debugfs.h>
+#include <linux/uaccess.h>
+#include <linux/ftrace.h>
+#include <linux/fs.h>
+
+#include "trace.h"
+#include "trace_output.h"
+
+struct fgraph_data {
+ int depth;
+};
+
+#define DURATION_INDENT 2
+
+/* Flag options */
+#define DURATION_PRINT_OVERRUN 0x1
+#define DURATION_PRINT_CPU 0x2
+#define DURATION_PRINT_PROC 0x4
+#define DURATION_PRINT_ABS_TIME 0X8
+
+static struct tracer_opt trace_opts[] = {
+ /* Display overruns? (for self-debug purpose) */
+ { TRACER_OPT(duration-overrun, DURATION_PRINT_OVERRUN) },
+ /* Display CPU */
+ { TRACER_OPT(duration-cpu, DURATION_PRINT_CPU) },
+ /* Display proc name/pid */
+ { TRACER_OPT(duration-proc, DURATION_PRINT_PROC) },
+ /* Display absolute time of an entry */
+ { TRACER_OPT(duration-abstime, DURATION_PRINT_ABS_TIME) },
+ { } /* Empty entry */
+};
+
+static struct tracer_flags tracer_flags = {
+ /* Don't display overruns and proc by default */
+ .val = DURATION_PRINT_CPU,
+ .opts = trace_opts
+};
+
+extern void __duration_return(struct trace_array *tr,
+ struct ftrace_graph_ret *trace,
+ unsigned long flags,
+ int pc);
+
+static int duration_entry(struct ftrace_graph_ent *trace)
+{
+ /* check some filters, and abort if hit */
+ if (!ftrace_trace_task(current))
+ return 0;
+
+ if (!ftrace_graph_addr(trace->func))
+ return 0;
+
+ return 1;
+}
+
+void duration_return(struct ftrace_graph_ret *trace)
+{
+ struct trace_array *tr = &global_trace;
+ struct trace_array_cpu *data;
+ unsigned long flags;
+ long disabled;
+ int cpu;
+ int pc;
+
+ local_irq_save(flags);
+ cpu = raw_smp_processor_id();
+ data = tr->data[cpu];
+ disabled = atomic_inc_return(&data->disabled);
+ if (likely(disabled == 1)) {
+ pc = preempt_count();
+ __duration_return(tr, trace, flags, pc);
+ }
+ atomic_dec(&data->disabled);
+ local_irq_restore(flags);
+}
+
+static int duration_trace_init(struct trace_array *tr)
+{
+ int ret = register_ftrace_graph(&duration_return,
+ &duration_entry);
+ if (ret)
+ return ret;
+ tracing_start_cmdline_record();
+
+ return 0;
+}
+
+static void duration_trace_reset(struct trace_array *tr)
+{
+ tracing_stop_cmdline_record();
+ unregister_ftrace_graph();
+}
+
+static inline int log10_cpu(int nb)
+{
+ if (nb / 100)
+ return 3;
+ if (nb / 10)
+ return 2;
+ return 1;
+}
+
+static enum print_line_t
+print_duration_cpu(struct trace_seq *s, int cpu)
+{
+ int i;
+ int ret;
+ int log10_this = log10_cpu(cpu);
+ int log10_all = log10_cpu(cpumask_weight(cpu_online_mask));
+
+ /*
+ * Start with a space character - to make it stand out
+ * to the right a bit when trace output is pasted into
+ * email:
+ */
+ ret = trace_seq_printf(s, " ");
+
+ /*
+ * Tricky - we space the CPU field according to the max
+ * number of online CPUs. On a 2-cpu system it would take
+ * a maximum of 1 digit - on a 128 cpu system it would
+ * take up to 3 digits:
+ */
+ for (i = 0; i < log10_all - log10_this; i++) {
+ ret = trace_seq_printf(s, " ");
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+ }
+ ret = trace_seq_printf(s, "%d) ", cpu);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ return TRACE_TYPE_HANDLED;
+}
+
+#define DURATION_PROCINFO_LENGTH 14
+
+static enum print_line_t
+print_duration_proc(struct trace_seq *s, pid_t pid)
+{
+ char comm[TASK_COMM_LEN];
+ /* sign + log10(MAX_INT) + '\0' */
+ char pid_str[11];
+ int spaces = 0;
+ int ret;
+ int len;
+ int i;
+
+ trace_find_cmdline(pid, comm);
+ comm[7] = '\0';
+ sprintf(pid_str, "%d", pid);
+
+ /* 1 stands for the "-" character */
+ len = strlen(comm) + strlen(pid_str) + 1;
+
+ if (len < DURATION_PROCINFO_LENGTH)
+ spaces = DURATION_PROCINFO_LENGTH - len;
+
+ /* First spaces to align center */
+ for (i = 0; i < spaces / 2; i++) {
+ ret = trace_seq_printf(s, " ");
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+ }
+
+ ret = trace_seq_printf(s, "%s-%s", comm, pid_str);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ /* Last spaces to align center */
+ for (i = 0; i < spaces - (spaces / 2); i++) {
+ ret = trace_seq_printf(s, " ");
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+ }
+ return TRACE_TYPE_HANDLED;
+}
+
+static int print_duration_abs_time(u64 t, struct trace_seq *s)
+{
+ unsigned long usecs_rem;
+
+ usecs_rem = do_div(t, NSEC_PER_SEC);
+ usecs_rem /= 1000;
+
+ return trace_seq_printf(s, "%5lu.%06lu | ",
+ (unsigned long)t, usecs_rem);
+}
+
+static int print_duration_calltime(u64 t, struct trace_seq *s)
+{
+ unsigned long usecs_rem;
+
+ usecs_rem = do_div(t, NSEC_PER_SEC);
+
+ return trace_seq_printf(s, "%5lu.%09lu | ",
+ (unsigned long)t, usecs_rem);
+}
+
+enum print_line_t
+trace_print_duration_duration(unsigned long long duration, struct trace_seq *s)
+{
+ unsigned long nsecs_rem = do_div(duration, 1000);
+ /* log10(ULONG_MAX) + '\0' */
+ char msecs_str[21];
+ char nsecs_str[5];
+ int ret, len;
+ int i;
+
+ sprintf(msecs_str, "%lu", (unsigned long) duration);
+
+ /* Print msecs */
+ ret = trace_seq_printf(s, "%s", msecs_str);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ len = strlen(msecs_str);
+
+ /* Print nsecs (we don't want to exceed 7 numbers) */
+ if (len < 7) {
+ snprintf(nsecs_str, 8 - len, "%03lu", nsecs_rem);
+ ret = trace_seq_printf(s, ".%s", nsecs_str);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+ len += strlen(nsecs_str);
+ }
+
+ ret = trace_seq_printf(s, " us ");
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ /* Print remaining spaces to fit the row's width */
+ for (i = len; i < 7; i++) {
+ ret = trace_seq_printf(s, " ");
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+ }
+ return TRACE_TYPE_HANDLED;
+}
+
+static enum print_line_t
+print_duration_duration(unsigned long long duration, struct trace_seq *s)
+{
+ int ret;
+
+ ret = trace_print_duration_duration(duration, s);
+ if (ret != TRACE_TYPE_HANDLED)
+ return ret;
+
+ ret = trace_seq_printf(s, "| ");
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ return TRACE_TYPE_HANDLED;
+}
+
+static enum print_line_t
+print_duration_prologue(struct trace_iterator *iter, struct trace_seq *s)
+
+{
+ struct trace_entry *ent = iter->ent;
+ int cpu = iter->cpu;
+ int ret;
+
+ /* Absolute time */
+ if (tracer_flags.val & DURATION_PRINT_ABS_TIME) {
+ ret = print_duration_abs_time(iter->ts, s);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+ }
+
+ /* Cpu */
+ if (tracer_flags.val & DURATION_PRINT_CPU) {
+ ret = print_duration_cpu(s, cpu);
+ if (ret == TRACE_TYPE_PARTIAL_LINE)
+ return TRACE_TYPE_PARTIAL_LINE;
+ }
+
+ /* Proc */
+ if (tracer_flags.val & DURATION_PRINT_PROC) {
+ ret = print_duration_proc(s, ent->pid);
+ if (ret == TRACE_TYPE_PARTIAL_LINE)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ ret = trace_seq_printf(s, " | ");
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+ }
+
+ return 0;
+}
+
+static enum print_line_t
+print_duration_return(struct ftrace_graph_ret *trace, struct trace_seq *s,
+ struct trace_entry *ent, struct trace_iterator *iter)
+{
+ unsigned long long duration = trace->rettime - trace->calltime;
+ struct fgraph_data *data = iter->private;
+ pid_t pid = ent->pid;
+ int cpu = iter->cpu;
+ int ret;
+ int i;
+
+ if (data) {
+ int cpu = iter->cpu;
+ int *depth = &(per_cpu_ptr(data, cpu)->depth);
+
+ /*
+ * Comments display at + 1 to depth. This is the
+ * return from a function, we now want the comments
+ * to display at the same level of the bracket.
+ */
+ *depth = trace->depth - 1;
+ }
+
+ if (print_duration_prologue(iter, s))
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ /* Calltime */
+ ret = print_duration_calltime(trace->calltime, s);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ /* Duration */
+ ret = print_duration_duration(duration, s);
+ if (ret == TRACE_TYPE_PARTIAL_LINE)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ /* Function */
+ for (i = 0; i < trace->depth * DURATION_INDENT; i++) {
+ ret = trace_seq_printf(s, " ");
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+ }
+
+ /* for debug, change 0 to TRACE_ITER_SYM_OFFSET in line below */
+ ret = seq_print_ip_sym(s, trace->func, 0);
+
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ ret = trace_seq_printf(s, "\n");
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ /* Overrun */
+ if (tracer_flags.val & DURATION_PRINT_OVERRUN) {
+ ret = trace_seq_printf(s, " (Overruns: %lu)\n",
+ trace->overrun);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+ }
+
+ return TRACE_TYPE_HANDLED;
+}
+
+static enum print_line_t
+print_duration_comment(struct trace_seq *s, struct trace_entry *ent,
+ struct trace_iterator *iter)
+{
+ unsigned long sym_flags = (trace_flags & TRACE_ITER_SYM_MASK);
+ struct fgraph_data *data = iter->private;
+ struct trace_event *event;
+ int depth = 0;
+ int ret;
+ int i;
+
+ if (data)
+ depth = per_cpu_ptr(data, iter->cpu)->depth;
+
+ if (print_duration_prologue(iter, s))
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ /* No time */
+ ret = trace_seq_printf(s, " | ");
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ /* Indentation */
+ if (depth > 0)
+ for (i = 0; i < (depth + 1) * DURATION_INDENT; i++) {
+ ret = trace_seq_printf(s, " ");
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+ }
+
+ /* The comment */
+ ret = trace_seq_printf(s, "/* ");
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ switch (iter->ent->type) {
+ case TRACE_BPRINT:
+ ret = trace_print_bprintk_msg_only(iter);
+ if (ret != TRACE_TYPE_HANDLED)
+ return ret;
+ break;
+ case TRACE_PRINT:
+ ret = trace_print_printk_msg_only(iter);
+ if (ret != TRACE_TYPE_HANDLED)
+ return ret;
+ break;
+ default:
+ event = ftrace_find_event(ent->type);
+ if (!event)
+ return TRACE_TYPE_UNHANDLED;
+
+ ret = event->trace(iter, sym_flags);
+ if (ret != TRACE_TYPE_HANDLED)
+ return ret;
+ }
+
+ /* Strip ending newline */
+ if (s->buffer[s->len - 1] == '\n') {
+ s->buffer[s->len - 1] = '\0';
+ s->len--;
+ }
+
+ ret = trace_seq_printf(s, " */\n");
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ return TRACE_TYPE_HANDLED;
+}
+
+
+enum print_line_t
+print_duration_function(struct trace_iterator *iter)
+{
+ struct trace_entry *entry = iter->ent;
+ struct trace_seq *s = &iter->seq;
+
+ switch (entry->type) {
+ case TRACE_GRAPH_RET: {
+ /*
+ * Note: No need for trace_assign_type() here.
+ * Case statement already checks the type.)
+ */
+ struct ftrace_graph_ret_entry *ret_entry;
+ ret_entry = (struct ftrace_graph_ret_entry *)entry;
+ return print_duration_return( &ret_entry->ret, s, entry, iter);
+ }
+ default:
+ return print_duration_comment(s, entry, iter);
+ }
+
+ return TRACE_TYPE_HANDLED;
+}
+
+static void print_duration_headers(struct seq_file *s)
+{
+ /* 1st line - settings */
+ seq_printf(s, "# tracing_thresh=%lu\n", tracing_thresh/1000);
+
+ /* fields */
+ seq_printf(s, "# ");
+ if (tracer_flags.val & DURATION_PRINT_ABS_TIME)
+ seq_printf(s, " TIME ");
+ if (tracer_flags.val & DURATION_PRINT_CPU)
+ seq_printf(s, "CPU");
+ if (tracer_flags.val & DURATION_PRINT_PROC)
+ seq_printf(s, " TASK/PID ");
+ seq_printf(s, " CALLTIME ");
+ seq_printf(s, " DURATION ");
+ seq_printf(s, " FUNCTION CALLS\n");
+
+ /* field column lines */
+ seq_printf(s, "# ");
+ if (tracer_flags.val & DURATION_PRINT_ABS_TIME)
+ seq_printf(s, " | ");
+ if (tracer_flags.val & DURATION_PRINT_CPU)
+ seq_printf(s, "| ");
+ if (tracer_flags.val & DURATION_PRINT_PROC)
+ seq_printf(s, " | | ");
+ seq_printf(s, " | ");
+ seq_printf(s, " | | ");
+ seq_printf(s, " | | | |\n");
+}
+
+static void duration_trace_open(struct trace_iterator *iter)
+{
+ /* pid and depth on the last trace processed */
+ struct fgraph_data *data = alloc_percpu(struct fgraph_data);
+ int cpu;
+
+ if (!data)
+ pr_warning("duration tracer: not enough memory\n");
+ else
+ for_each_possible_cpu(cpu) {
+ int *depth = &(per_cpu_ptr(data, cpu)->depth);
+ *depth = 0;
+ }
+
+ iter->private = data;
+}
+
+static void duration_trace_close(struct trace_iterator *iter)
+{
+ free_percpu(iter->private);
+}
+
+struct tracer duration_trace __read_mostly = {
+ .name = "function_duration",
+ .open = duration_trace_open,
+ .close = duration_trace_close,
+ .wait_pipe = poll_wait_pipe,
+ .init = duration_trace_init,
+ .reset = duration_trace_reset,
+ .print_line = print_duration_function,
+ .print_header = print_duration_headers,
+ .flags = &tracer_flags,
+};
+
+static __init int init_duration_trace(void)
+{
+ return register_tracer(&duration_trace);
+}
+
+device_initcall(init_duration_trace);


2009-12-10 07:08:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer


* Tim Bird <[email protected]> wrote:

> Add function duration tracer.
>
> Signed-off-by: Tim Bird <[email protected]>
> ---
> kernel/trace/Kconfig | 8
> kernel/trace/Makefile | 1
> kernel/trace/trace.c | 32 ++
> kernel/trace/trace_duration.c | 527 ++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 568 insertions(+)

Please do it in a cleaner an more generic fashion: add a "function
event" that perf can see and process, so all the output embellishment
can be done outside of the kernel, in tools/perf/.

We want to wind down the current maze of ftrace plugins, not extend
them. We already obsoleted the following ftrace plugins: scheduler,
sysprof, blktrace, kmem, scheduler, etc. There's more work ongoing and
broad agreement between folks developing it that this is the way
forward.

The function tracer / function graph tracer is a holdout due to its
complexity - but that by no means weakens the argument and the necessity
to migrate it.

ftrace plugins were a nice idea originally and a clear improvement over
existing alternatives, but now that we've got a technicaly superior,
unified event framework that can do what the old plugins did and much
more, we want to improve that and not look back ...

Thanks,

ingo

2009-12-10 12:03:41

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer

On Thu, Dec 10, 2009 at 08:08:00AM +0100, Ingo Molnar wrote:
>
> * Tim Bird <[email protected]> wrote:
>
> > Add function duration tracer.
> >
> > Signed-off-by: Tim Bird <[email protected]>
> > ---
> > kernel/trace/Kconfig | 8
> > kernel/trace/Makefile | 1
> > kernel/trace/trace.c | 32 ++
> > kernel/trace/trace_duration.c | 527 ++++++++++++++++++++++++++++++++++++++++++
> > 4 files changed, 568 insertions(+)
>
> Please do it in a cleaner an more generic fashion: add a "function
> event" that perf can see and process, so all the output embellishment
> can be done outside of the kernel, in tools/perf/.
>
> We want to wind down the current maze of ftrace plugins, not extend
> them. We already obsoleted the following ftrace plugins: scheduler,
> sysprof, blktrace, kmem, scheduler, etc. There's more work ongoing and
> broad agreement between folks developing it that this is the way
> forward.
>
> The function tracer / function graph tracer is a holdout due to its
> complexity - but that by no means weakens the argument and the necessity
> to migrate it.
>
> ftrace plugins were a nice idea originally and a clear improvement over
> existing alternatives, but now that we've got a technicaly superior,
> unified event framework that can do what the old plugins did and much
> more, we want to improve that and not look back ...


I agree. If we can abstract it out in a struct trace_event rather than a
struct tracer, then please try. I doubt we can't.

The trace events are more unified.

This makes me feel I'm going to try converting the function graph tracer
into an event during the next cycle. It does not mean I could make it
usable as a perf event right away in the same shot that said, as you can
guess this is not a trivial plug. The current perf fast path is not yet
adapted for that.

But at least this will be a good step forward.

2009-12-10 14:11:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer


* Frederic Weisbecker <[email protected]> wrote:

> On Thu, Dec 10, 2009 at 08:08:00AM +0100, Ingo Molnar wrote:
> >
> > * Tim Bird <[email protected]> wrote:
> >
> > > Add function duration tracer.
> > >
> > > Signed-off-by: Tim Bird <[email protected]>
> > > ---
> > > kernel/trace/Kconfig | 8
> > > kernel/trace/Makefile | 1
> > > kernel/trace/trace.c | 32 ++
> > > kernel/trace/trace_duration.c | 527 ++++++++++++++++++++++++++++++++++++++++++
> > > 4 files changed, 568 insertions(+)
> >
> > Please do it in a cleaner an more generic fashion: add a "function
> > event" that perf can see and process, so all the output embellishment
> > can be done outside of the kernel, in tools/perf/.
> >
> > We want to wind down the current maze of ftrace plugins, not extend
> > them. We already obsoleted the following ftrace plugins: scheduler,
> > sysprof, blktrace, kmem, scheduler, etc. There's more work ongoing and
> > broad agreement between folks developing it that this is the way
> > forward.
> >
> > The function tracer / function graph tracer is a holdout due to its
> > complexity - but that by no means weakens the argument and the necessity
> > to migrate it.
> >
> > ftrace plugins were a nice idea originally and a clear improvement over
> > existing alternatives, but now that we've got a technicaly superior,
> > unified event framework that can do what the old plugins did and much
> > more, we want to improve that and not look back ...
>
>
> I agree. If we can abstract it out in a struct trace_event rather than
> a struct tracer, then please try. I doubt we can't.
>
> The trace events are more unified.
>
> This makes me feel I'm going to try converting the function graph
> tracer into an event during the next cycle. [...]

Great!

> [...] It does not mean I could make it usable as a perf event right
> away in the same shot that said, as you can guess this is not a
> trivial plug. The current perf fast path is not yet adapted for that.

Yeah, definitely so. I'd guess it would be slower out of box - it hasnt
gone through nearly as many refinements yet.

> But at least this will be a good step forward.

Yeah.

Also, i'd suggest we call unified events 'ftrace events', as that is
what they really are: the whole TRACE_EVENT() infrastructure is the
crown jewel of ftrace and IMO it worked out pretty well.

I hope there wont be any significant culture clash between ftrace and
perf - we want a single, unified piece of instrumentation
infrastructure, we want to keep the best of both worlds, and want to
eliminate any weaknesses and duplications. As long as we keep all that
in mind it will be all fine.

Ingo

2009-12-10 14:29:29

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer

On Thu, 2009-12-10 at 13:03 +0100, Frederic Weisbecker wrote:

> This makes me feel I'm going to try converting the function graph tracer
> into an event during the next cycle. It does not mean I could make it
> usable as a perf event right away in the same shot that said, as you can
> guess this is not a trivial plug. The current perf fast path is not yet
> adapted for that.

I curious how you plan on doing this. The current event system shows one
event per trace point. A straight forward approach would make every
entry and exit of a function a trace point and that would lead to a very
large kernel to handle that.

Perhaps we could abstract out all entries and exits. We need to be able
to link to a single point (entry or exit) not all. This also has the
added issue of using the ftrace infrastructure of nop the mcount call.
We also need a way to enable a set of functions.

We may be able to abstract this out, but I'm hesitant on making this the
only interface.

-- Steve

2009-12-10 14:53:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer

On Thu, 2009-12-10 at 15:11 +0100, Ingo Molnar wrote:

> > > ftrace plugins were a nice idea originally and a clear improvement over
> > > existing alternatives, but now that we've got a technicaly superior,
> > > unified event framework that can do what the old plugins did and much
> > > more, we want to improve that and not look back ...

Well to me the ftrace plugins still serve a purpose. The event
structures are very powerful for showing events. The plugins purpose is
to show functionality.

The latency tracers are a perfect example. Because they do not
concentrate on just events. But we must hit a maximum to save off the
trace. Just watching the events is not good enough. A separate buffer to
keep trace of the biggest latency is still needed.

> >
> >
> > I agree. If we can abstract it out in a struct trace_event rather than
> > a struct tracer, then please try. I doubt we can't.
> >
> > The trace events are more unified.

Yes because the trace events all pretty much do the same thing.

> >
> > This makes me feel I'm going to try converting the function graph
> > tracer into an event during the next cycle. [...]
>
> Great!
>
> > [...] It does not mean I could make it usable as a perf event right
> > away in the same shot that said, as you can guess this is not a
> > trivial plug. The current perf fast path is not yet adapted for that.
>
> Yeah, definitely so. I'd guess it would be slower out of box - it hasnt
> gone through nearly as many refinements yet.
>
> > But at least this will be a good step forward.
>
> Yeah.
>
> Also, i'd suggest we call unified events 'ftrace events', as that is
> what they really are: the whole TRACE_EVENT() infrastructure is the
> crown jewel of ftrace and IMO it worked out pretty well.

For recording events, yes I totally agree. But for logic that needs to
pass data from one event to another, it is still a bit lacking.

>
> I hope there wont be any significant culture clash between ftrace and
> perf - we want a single, unified piece of instrumentation
> infrastructure, we want to keep the best of both worlds, and want to
> eliminate any weaknesses and duplications. As long as we keep all that
> in mind it will be all fine.

I'm just not from the mind set that one product fits all needs. I never
was and that was the reason that I joined the Linux community in the
first place. I liked the old Unix philosophy of "Do one thing, and do it
well, and let all others interact, and interact with all others". Ftrace
itself never was one product. It just seemed that anything to do with
tracing was called ftrace. It started as just the function tracer. Then
it had plugins, then it got events, but these are separate entities all
together.

I designed the ftrace plugins as a way to plug in new features that I
could never dream of.

I wrote the ring buffer not for ftrace, but as a separate entity, that
is also used by the hard ware latency detector.

I designed the ftrace function tracer to not just work with ftrace but
to allow all others to hook to functions. This created the function
graph tracer, the stack tracer, and even LTTng hooks into it (not to
mention my own logdev).

I see that perf at the user level has ways to interact with it nicely,
although I don't know how well it interacts with other utilites. But the
perf kernel code seems to be a one way street. You can add features to
perf, but it is hard to use the perf infrastructure for something other
than perf (with the exception of the hardware perf events, that part has
a nice interface).

-- Steve

2009-12-10 15:39:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer


* Steven Rostedt <[email protected]> wrote:

> On Thu, 2009-12-10 at 15:11 +0100, Ingo Molnar wrote:
>
> > > > ftrace plugins were a nice idea originally and a clear
> > > > improvement over existing alternatives, but now that we've got a
> > > > technicaly superior, unified event framework that can do what
> > > > the old plugins did and much more, we want to improve that and
> > > > not look back ...
>
> Well to me the ftrace plugins still serve a purpose. The event
> structures are very powerful for showing events. The plugins purpose
> is to show functionality.
>
> The latency tracers are a perfect example. Because they do not
> concentrate on just events. But we must hit a maximum to save off the
> trace. Just watching the events is not good enough. A separate buffer
> to keep trace of the biggest latency is still needed.

The correctly designed way to express latency tracing is via a new
generic event primitive: connecting two events to a maximum value.

That can be done without forcibly tying it and limiting it to a specific
'latency tracing' variant as the /debug/tracing/ bits of ftrace do it
right now.

Just off the top of my head we want to be able to trace:

- max irq service latencies for a given IRQ
- max block IO completion latencies for a app
- max TLB flush latencies in the system
- max sys_open() latencies in a task
- max fork()/exit() latencies in a workload
- max scheduling latencies on a given CPU
- max page fault latencies
- max wakeup latencies for a given task
- max memory allocation latencies

- ... and dozens and dozens of other things where there's a "start"
and a "stop" event and where we want to measure the time between
them.

Your design of tying latency tracing to some hardcoded 'ftrace plugin'
abstraction is shortsighted and just does not scale to many of the items
above.

> > > I agree. If we can abstract it out in a struct trace_event rather
> > > than a struct tracer, then please try. I doubt we can't.
> > >
> > > The trace events are more unified.
>
> Yes because the trace events all pretty much do the same thing.
>
> > >
> > > This makes me feel I'm going to try converting the function graph
> > > tracer into an event during the next cycle. [...]
> >
> > Great!
> >
> > > [...] It does not mean I could make it usable as a perf event right
> > > away in the same shot that said, as you can guess this is not a
> > > trivial plug. The current perf fast path is not yet adapted for that.
> >
> > Yeah, definitely so. I'd guess it would be slower out of box - it hasnt
> > gone through nearly as many refinements yet.
> >
> > > But at least this will be a good step forward.
> >
> > Yeah.
> >
> > Also, i'd suggest we call unified events 'ftrace events', as that is
> > what they really are: the whole TRACE_EVENT() infrastructure is the
> > crown jewel of ftrace and IMO it worked out pretty well.
>
> For recording events, yes I totally agree. But for logic that needs to
> pass data from one event to another, it is still a bit lacking.

Expressing latency tracing in form of an 'ftrace plugin' is a pretty
inefficient way of doing it: it's very limiting and its utility is much
lower than what it could be.

> > I hope there wont be any significant culture clash between ftrace
> > and perf - we want a single, unified piece of instrumentation
> > infrastructure, we want to keep the best of both worlds, and want to
> > eliminate any weaknesses and duplications. As long as we keep all
> > that in mind it will be all fine.
>
> I'm just not from the mind set that one product fits all needs. I
> never was and that was the reason that I joined the Linux community in
> the first place. I liked the old Unix philosophy of "Do one thing, and
> do it well, and let all others interact, and interact with all
> others". Ftrace itself never was one product. It just seemed that
> anything to do with tracing was called ftrace. It started as just the
> function tracer. Then it had plugins, then it got events, but these
> are separate entities all together.
>
> I designed the ftrace plugins as a way to plug in new features that I
> could never dream of.
>
> I wrote the ring buffer not for ftrace, but as a separate entity, that
> is also used by the hard ware latency detector.
>
> I designed the ftrace function tracer to not just work with ftrace but
> to allow all others to hook to functions. This created the function
> graph tracer, the stack tracer, and even LTTng hooks into it (not to
> mention my own logdev).
>
> I see that perf at the user level has ways to interact with it nicely,
> although I don't know how well it interacts with other utilites. But
> the perf kernel code seems to be a one way street. You can add
> features to perf, but it is hard to use the perf infrastructure for
> something other than perf (with the exception of the hardware perf
> events, that part has a nice interface).

I see ftrace plugins as a step of evolution. If you see it as some
ground to 'protect' then that's going to cause significant disagreement
between us. I prefer to reimplement functionality in a better way and
throw away the old version, and the whole premise of /debug is that we
can throw away old versions of code.

If you want to keep inferior concepts under the guise of 'choice' then
i'm very much against that. In the kernel we make up our minds about
what the best technical solution is for a given range of problems, and
then we go for it. Having a zillion mediocre xterms (and not a single
good one) is not a development model i find too convincing.

Ingo

2009-12-10 16:17:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer


* Steven Rostedt <[email protected]> wrote:

> On Thu, 2009-12-10 at 13:03 +0100, Frederic Weisbecker wrote:
>
> > This makes me feel I'm going to try converting the function graph
> > tracer into an event during the next cycle. It does not mean I could
> > make it usable as a perf event right away in the same shot that
> > said, as you can guess this is not a trivial plug. The current perf
> > fast path is not yet adapted for that.
>
> I curious how you plan on doing this. The current event system shows
> one event per trace point. A straight forward approach would make
> every entry and exit of a function a trace point and that would lead
> to a very large kernel to handle that.
>
> Perhaps we could abstract out all entries and exits. We need to be
> able to link to a single point (entry or exit) not all. This also has
> the added issue of using the ftrace infrastructure of nop the mcount
> call. We also need a way to enable a set of functions.

I think a good solution would be what i suggested (and what the patch i
sent does in essence), i.e. to add a special software event for
functions:

PERF_COUNT_SW_FUNCTION_CALLS

Next step would be to extend the ioctl interface to allow per function
tracepoints to be turned on/off (and set other per function attributes).
The attributes would be system-wide (at least initially) - we dont want
to live patch functions at context switch time.

We'd basically keep the function tracer, just re-wire it somewhat and
expose it via the unified interface and make it available to a much
larger set of usecases.

Ingo

2009-12-10 16:22:14

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer

On Thu, 2009-12-10 at 16:38 +0100, Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:

> The correctly designed way to express latency tracing is via a new
> generic event primitive: connecting two events to a maximum value.
>
> That can be done without forcibly tying it and limiting it to a specific
> 'latency tracing' variant as the /debug/tracing/ bits of ftrace do it
> right now.

Sure, when we have that, then we can remove the plugin (and then all
plugins when there are superior alternatives).

>
> Just off the top of my head we want to be able to trace:
>
> - max irq service latencies for a given IRQ
> - max block IO completion latencies for a app
> - max TLB flush latencies in the system
> - max sys_open() latencies in a task
> - max fork()/exit() latencies in a workload
> - max scheduling latencies on a given CPU
> - max page fault latencies
> - max wakeup latencies for a given task
> - max memory allocation latencies
>
> - ... and dozens and dozens of other things where there's a "start"
> and a "stop" event and where we want to measure the time between
> them.

Note, we also need a way to "store" the max. The fly recorder method is
not good enough.


>
> Your design of tying latency tracing to some hardcoded 'ftrace plugin'
> abstraction is shortsighted and just does not scale to many of the items
> above.

When we can do everything a better way, then I'm all for removing them.
I just don't want to remove them before a superior alternative exists.


> > For recording events, yes I totally agree. But for logic that needs to
> > pass data from one event to another, it is still a bit lacking.
>
> Expressing latency tracing in form of an 'ftrace plugin' is a pretty
> inefficient way of doing it: it's very limiting and its utility is much
> lower than what it could be.

As I said, "it is still a bit lacking". When we solve these issues, then
we can look at removing plugins. I'm just saying that the plugins still
serve their purpose.

>
> > > I hope there wont be any significant culture clash between ftrace
> > > and perf - we want a single, unified piece of instrumentation
> > > infrastructure, we want to keep the best of both worlds, and want to
> > > eliminate any weaknesses and duplications. As long as we keep all
> > > that in mind it will be all fine.
> >
> > I'm just not from the mind set that one product fits all needs. I
> > never was and that was the reason that I joined the Linux community in
> > the first place. I liked the old Unix philosophy of "Do one thing, and
> > do it well, and let all others interact, and interact with all
> > others". Ftrace itself never was one product. It just seemed that
> > anything to do with tracing was called ftrace. It started as just the
> > function tracer. Then it had plugins, then it got events, but these
> > are separate entities all together.
> >
> > I designed the ftrace plugins as a way to plug in new features that I
> > could never dream of.
> >
> > I wrote the ring buffer not for ftrace, but as a separate entity, that
> > is also used by the hard ware latency detector.
> >
> > I designed the ftrace function tracer to not just work with ftrace but
> > to allow all others to hook to functions. This created the function
> > graph tracer, the stack tracer, and even LTTng hooks into it (not to
> > mention my own logdev).
> >
> > I see that perf at the user level has ways to interact with it nicely,
> > although I don't know how well it interacts with other utilites. But
> > the perf kernel code seems to be a one way street. You can add
> > features to perf, but it is hard to use the perf infrastructure for
> > something other than perf (with the exception of the hardware perf
> > events, that part has a nice interface).
>
> I see ftrace plugins as a step of evolution. If you see it as some
> ground to 'protect' then that's going to cause significant disagreement
> between us. I prefer to reimplement functionality in a better way and
> throw away the old version, and the whole premise of /debug is that we
> can throw away old versions of code.

I have two points. One is that the current event infrastructure is still
lacking the features to replace the plugins, in which the plugins still
serve a purpose. I actually have some patches that have the events pass
data around. But they are just hacks, I need to restructure them.

The second point (and my real concern) is that I want to keep the design
open that other parts of the kernel can always access these features. I
like the fact that LTTng can tap into ftrace. I do not want ftrace to be
the gate keeper of its utilities. If other tools can benefit from the
ftrace infrastructure, I think that is great!

>
> If you want to keep inferior concepts under the guise of 'choice' then
> i'm very much against that. In the kernel we make up our minds about
> what the best technical solution is for a given range of problems, and
> then we go for it. Having a zillion mediocre xterms (and not a single
> good one) is not a development model i find too convincing.

My point of this email is not to protect the concept of the plugin. But
to protect the concept of interaction. I don't see xterm as a good
example. I use gnome-terminal, and I have no issues with it. Lets look
at network browsers. Yes these still suck too (but then all userspace
apps suck ;-) One thing that I've seen (and like) about Firefox is the
gecko interface. They don't make everyone use firefox or firefox
plugins. But they have an infrastructure that lets other applications
tap into its engine. This is what ftrace did with its function tracer.
It created an API to allow other tools to use it (as did LTTng).

I'm not saying we need to split the infrastructure between perf and
ftrace, I'm saying that what ever infrastructure there is, needs to stay
flexible that other utilities can hook into. LTTng is well known as
ftrace's biggest competitor. Funny thing is, I've spent several days
talking with Maitheu on how we can share infrastructure.

I don't want LTTng to go away. I like the tool. There's things that I
picked from it, and there's things in LTTng that Mathieu took from
ftrace. I'm all for merging LTTng when it works well with ftrace.

perf is also a great tool. But I don't want to be forced to do all
kernel tracing through a single utility. That was my original complaint
about LTTng.

I have my own utility that I use because perf currently can't handle the
event tracing that well. I like having the debugfs interface, because I
feel more comfortable with accessing the trace information directly than
going through a tool. I see that we are getting rid of the sysctl
syscall in favor of the /proc/sys interface. I wonder why that is?

-- Steve

2009-12-10 16:52:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer


* Steven Rostedt <[email protected]> wrote:

> On Thu, 2009-12-10 at 16:38 +0100, Ingo Molnar wrote:
> > * Steven Rostedt <[email protected]> wrote:
>
> > The correctly designed way to express latency tracing is via a new
> > generic event primitive: connecting two events to a maximum value.
> >
> > That can be done without forcibly tying it and limiting it to a specific
> > 'latency tracing' variant as the /debug/tracing/ bits of ftrace do it
> > right now.
>
> Sure, when we have that, then we can remove the plugin (and then all
> plugins when there are superior alternatives).

Ok, that's good to hear.

> > Just off the top of my head we want to be able to trace:
> >
> > - max irq service latencies for a given IRQ
> > - max block IO completion latencies for a app
> > - max TLB flush latencies in the system
> > - max sys_open() latencies in a task
> > - max fork()/exit() latencies in a workload
> > - max scheduling latencies on a given CPU
> > - max page fault latencies
> > - max wakeup latencies for a given task
> > - max memory allocation latencies
> >
> > - ... and dozens and dozens of other things where there's a "start"
> > and a "stop" event and where we want to measure the time between
> > them.
>
> Note, we also need a way to "store" the max. The fly recorder method
> is not good enough.

Yeah. What we want in the larger scheme of things is to have operations
connected to events. One such operation would be "start measuring max",
another would be "stop measuring the max".

[ Whether the max is intrinsic to the context structure, or is perhaps
some _third_ event (so the max can be recovered by observing that
event) is a detail. ]

Note that other operations make sense as well, such as:

- if event X happens then enable event Y
- if event X happens then disable event Y

A popular use of that would be to enable the function events on the
'start' event of a latency measurement, and disable function events on
the 'stop' event.

Yet another use would be to enable cache miss events when 'entering' a
specific function event, and disable cache miss counting on 'exiting'
that function. (this would be an event of the function graph tracer)

This would allow the precise cache miss profiling of a given function
and all its sub-functions - and only of that function.

Note that the existing filter engine functionality connects in a natural
way here as well: it can already be used to limit events and thus can be
used to further shape the behavior of tracing, runtime.

Other interesting event operations are possible as well. Key is to
expose this via the unified interface and to stop adding new
functionality-limited crap via the ftrace plugin infrastructure - it's
clearly not suitable for this purpose. We now have found how to do these
things properly and cleanly.

And the thing is, _you_ implemented unified ftrace events, all i'm
asking you is to realize the power of them and to stop adding new
ftrace-plugins [which are crap in comparison] and contcentrate on
exposing new tracing functionality via unified ftrace events, ok? ;-)

Ingo

2009-12-10 17:16:28

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer

On Thu, 2009-12-10 at 17:52 +0100, Ingo Molnar wrote:

> > Note, we also need a way to "store" the max. The fly recorder method
> > is not good enough.
>
> Yeah. What we want in the larger scheme of things is to have operations
> connected to events. One such operation would be "start measuring max",
> another would be "stop measuring the max".
>
> [ Whether the max is intrinsic to the context structure, or is perhaps
> some _third_ event (so the max can be recovered by observing that
> event) is a detail. ]
>
> Note that other operations make sense as well, such as:
>
> - if event X happens then enable event Y
> - if event X happens then disable event Y

This is exactly what I was implementing, but I got pulled off to do
other things before I completed it. The idea is to use the filtering
infrastructure to be able to start or stop other events.

>
> A popular use of that would be to enable the function events on the
> 'start' event of a latency measurement, and disable function events on
> the 'stop' event.

Function events will need to be start and stopped via a variable. It's
too much overhead and risk to do the full patching.

Code already exists that is like this. It is done in the function graph
tracer with set_graph_function, and the function tracer set_ftrace_pid.

>
> Yet another use would be to enable cache miss events when 'entering' a
> specific function event, and disable cache miss counting on 'exiting'
> that function. (this would be an event of the function graph tracer)

This could tap into the "set_ftrace_filter" infrastructure. I'm not
saying that it needs to use that file. But that file supports setting
specific actions to a particular function. We could extend the interface
(if it isn't already there) to allow a perf ioctl or whatever to do the
same thing.


>
> This would allow the precise cache miss profiling of a given function
> and all its sub-functions - and only of that function.
>
> Note that the existing filter engine functionality connects in a natural
> way here as well: it can already be used to limit events and thus can be
> used to further shape the behavior of tracing, runtime.

Yep.

>
> Other interesting event operations are possible as well. Key is to
> expose this via the unified interface and to stop adding new
> functionality-limited crap via the ftrace plugin infrastructure - it's
> clearly not suitable for this purpose. We now have found how to do these
> things properly and cleanly.

Yes, I agree that we should not add any new plugins. But I will still
maintain the ones that are there until we have a replacement. I can
continue my work on getting the events to pass data, and also add a
kernel API to access hooks to functions.

>
> And the thing is, _you_ implemented unified ftrace events, all i'm
> asking you is to realize the power of them and to stop adding new
> ftrace-plugins [which are crap in comparison] and contcentrate on
> exposing new tracing functionality via unified ftrace events, ok? ;-)

Ah, I think we had a misunderstanding here. I was defending the current
plugins and that they still needed to be supported (I'm working on some
enhancements now). But I agree that no new plugins should be introduced,
and that work should happen on finding alternatives. But until
alternatives are ready, I'll still maintain the current plugins that are
there. If anything, they help us understand what needs to be done with
whatever replaces them.

I think you thought I was pushing to extend plugins. That wasn't my
argument. I was just arguing that the plugins still serve a purpose and
I'll support and maintain the current plugins until they become
obsolete.

-- Steve

2009-12-10 17:29:12

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer

Ingo Molnar <[email protected]> writes:

> [...]
> Just off the top of my head we want to be able to trace:
>
> - max irq service latencies for a given IRQ
> - max block IO completion latencies for a app
> - max TLB flush latencies in the system
> - max sys_open() latencies in a task
> - max fork()/exit() latencies in a workload
> - max scheduling latencies on a given CPU
> - max page fault latencies
> - max wakeup latencies for a given task
> - max memory allocation latencies
>
> - ... and dozens and dozens of other things where there's a "start"
> and a "stop" event and where we want to measure the time between
> them.
> [...]

FWIW, those who want to collect such measurements today can do so with
a few lines of systemtap script for each of the above.


- FChE

2009-12-10 17:57:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer


* Frank Ch. Eigler <[email protected]> wrote:

> Ingo Molnar <[email protected]> writes:
>
> > [...]
> > Just off the top of my head we want to be able to trace:
> >
> > - max irq service latencies for a given IRQ
> > - max block IO completion latencies for a app
> > - max TLB flush latencies in the system
> > - max sys_open() latencies in a task
> > - max fork()/exit() latencies in a workload
> > - max scheduling latencies on a given CPU
> > - max page fault latencies
> > - max wakeup latencies for a given task
> > - max memory allocation latencies
> >
> > - ... and dozens and dozens of other things where there's a "start"
> > and a "stop" event and where we want to measure the time between
> > them.
> > [...]
>
> FWIW, those who want to collect such measurements today can do so with
> a few lines of systemtap script for each of the above.

Well, i dont think stap can do workload instrumentation. It can do
system-wide (and user local / task local) - but can it do per task
hierarchies?

Also, i dont think stap supports proper separation of per workload
measurements either. I.e. can you write a script that will work properly
even if multiple monitoring tools are running, each trying to measure
latencies?

Also, i personally find built-in kernel functionality more trustable
than dynamically built stap kernel modules that get inserted.

Ingo

2009-12-10 18:04:39

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer

Hi -


> > FWIW, those who want to collect such measurements today can do so with
> > a few lines of systemtap script for each of the above.
>
> Well, i dont think stap can do workload instrumentation. It can do
> system-wide (and user local / task local) - but can it do per task
> hierarchies?

It can track the evolution of task hierarchies by listening to process
forking events, and filter other kernel/user events according to
then-current hierarchy data. One primitive implementation of this is
in the target_set.stp tapset, but it's easy to script up other
policies.


> Also, i dont think stap supports proper separation of per workload
> measurements either. I.e. can you write a script that will work properly
> even if multiple monitoring tools are running, each trying to measure
> latencies?

Sure, always has. You can run many scripts concurrently, each with its
own internal state. (Overheads accumulate, sadly & naturally.)


> Also, i personally find built-in kernel functionality more trustable
> than dynamically built stap kernel modules that get inserted.

I understand. In the absence of a suitable bytecode engine in the
kernel, this was the only practical way to do everything we needed.


- FChE

2009-12-10 18:35:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer


* Frank Ch. Eigler <[email protected]> wrote:

> Hi -
>
> > > FWIW, those who want to collect such measurements today can do so
> > > with a few lines of systemtap script for each of the above.
> >
> > Well, i dont think stap can do workload instrumentation. It can do
> > system-wide (and user local / task local) - but can it do per task
> > hierarchies?
>
> It can track the evolution of task hierarchies by listening to process
> forking events, and filter other kernel/user events according to
> then-current hierarchy data. One primitive implementation of this is
> in the target_set.stp tapset, but it's easy to script up other
> policies.

target_set.stp is not really adequate. Have you actually _tried_ to use
it on something real like hackbench, which runs thousands (or tens of
thousands) of tasks? You'll soon find that associative arrays are not
really adequate for that ...

Another problem i can see is that target_set.stp starts with:

global _target_set # map: target-set-pid -> ancestor-pid

see that 'global' thing? It's a system global variable - i.e. you cannot
measure two task hierarchies at once.

> > Also, i dont think stap supports proper separation of per workload
> > measurements either. I.e. can you write a script that will work
> > properly even if multiple monitoring tools are running, each trying
> > to measure latencies?
>
> Sure, always has. You can run many scripts concurrently, each with
> its own internal state. (Overheads accumulate, sadly & naturally.)

To measure latencies you need two probes, a start and a stop one. How do
you define a local variable that is visible to those two probes? You
have to create a global variable - but that will/can clash with other
instances.

( Also, you dont offer per application channels/state from the same
script. Each app has to define their own probes, duplicating the
script and increasing probe chaining overhead. )

The whole state sharing and eventing model of SystemTap is poorly
thought out.

> > Also, i personally find built-in kernel functionality more trustable
> > than dynamically built stap kernel modules that get inserted.
>
> I understand. In the absence of a suitable bytecode engine in the
> kernel, this was the only practical way to do everything we needed.

You seem to be under the mistaken assumption that your course of action
with SystemTap is somehow limited by what is available (or not) in the
upstream kernel.

In reality you can implement anything you want (in fact you did
precisely that - _against_ repeated advice of upstream kernel
developers), and if it's good, it will be merged - simple as that. It
might take years, but once you deliver the proof (which comes in form of
lots of happy users/developers), it happens.

So saying 'but the kernel does not have a bytecode interpreter' (or any
other excuse) is pretty lame.

Ingo

2009-12-10 18:51:08

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer

Hi -


On Thu, Dec 10, 2009 at 07:35:08PM +0100, Ingo Molnar wrote:
> [...]
> target_set.stp is not really adequate. Have you actually _tried_ to use
> it on something real like hackbench, which runs thousands (or tens of
> thousands) of tasks? You'll soon find that associative arrays are not
> really adequate for that ... [...]

A few thousand entries in a hash table is really not that big a deal.


> > > Also, i dont think stap supports proper separation of per workload
> > > measurements either. I.e. can you write a script that will work
> > > properly even if multiple monitoring tools are running, each trying
> > > to measure latencies?
> >
> > Sure, always has. You can run many scripts concurrently, each with
> > its own internal state. (Overheads accumulate, sadly & naturally.)
>
> To measure latencies you need two probes, a start and a stop one. How do
> you define a local variable that is visible to those two probes? You
> have to create a global variable - but that will/can clash with other
> instances.

You misunderstand systemtap "global" values. They are global to that
particular execution of that particular script. They are not shared
between scripts that may be concurrently running.


> ( Also, you dont offer per application channels/state from the same
> script. Each app has to define their own probes, duplicating the
> script and increasing probe chaining overhead. )

Please elaborate what you mean.


> > > Also, i personally find built-in kernel functionality more trustable
> > > than dynamically built stap kernel modules that get inserted.
> >
> > I understand. In the absence of a suitable bytecode engine in the
> > kernel, this was the only practical way to do everything we needed.
>
> You seem to be under the mistaken assumption that your course of action
> with SystemTap is somehow limited by what is available (or not) in the
> upstream kernel. In reality you can implement anything you want [...]

The message we have received time, after time, after time was
stronger: that a suitable interpreter was not going to be welcome in
tree. If this is relaxed (and perhaps even if not), we may prototype
such a thing in the new year.


- FChE

2009-12-10 20:15:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer


* Frank Ch. Eigler <[email protected]> wrote:

> Hi -
>
>
> On Thu, Dec 10, 2009 at 07:35:08PM +0100, Ingo Molnar wrote:
> > [...]
> > target_set.stp is not really adequate. Have you actually _tried_ to use
> > it on something real like hackbench, which runs thousands (or tens of
> > thousands) of tasks? You'll soon find that associative arrays are not
> > really adequate for that ... [...]
>
> A few thousand entries in a hash table is really not that big a deal.

Except if it's a high-freq event and the huge hash table is kept in the
CPU cache all the time.

> > > > Also, i dont think stap supports proper separation of per
> > > > workload measurements either. I.e. can you write a script that
> > > > will work properly even if multiple monitoring tools are
> > > > running, each trying to measure latencies?
> > >
> > > Sure, always has. You can run many scripts concurrently, each
> > > with its own internal state. (Overheads accumulate, sadly &
> > > naturally.)
> >
> > To measure latencies you need two probes, a start and a stop one.
> > How do you define a local variable that is visible to those two
> > probes? You have to create a global variable - but that will/can
> > clash with other instances.
>
> You misunderstand systemtap "global" values. They are global to that
> particular execution of that particular script. They are not shared
> between scripts that may be concurrently running.

Ok.

> > ( Also, you dont offer per application channels/state from the same
> > script. Each app has to define their own probes, duplicating the
> > script and increasing probe chaining overhead. )
>
> Please elaborate what you mean.

Firstly, AFAICS each subsequent systemtap probe for the same event adds
chaining overhead - and then you have to disambiguate back to the
originating script. Secondly, is there a way for a single probe to
multiplex its output to multiple apps? AFAICS that's only possible by
running multiple scripts.

> > > > Also, i personally find built-in kernel functionality more
> > > > trustable than dynamically built stap kernel modules that get
> > > > inserted.
> > >
> > > I understand. In the absence of a suitable bytecode engine in the
> > > kernel, this was the only practical way to do everything we
> > > needed.
> >
> > You seem to be under the mistaken assumption that your course of
> > action with SystemTap is somehow limited by what is available (or
> > not) in the upstream kernel. In reality you can implement anything
> > you want [...]
>
> The message we have received time, after time, after time was
> stronger: that a suitable interpreter was not going to be welcome in
> tree. If this is relaxed (and perhaps even if not), we may prototype
> such a thing in the new year.

FYI, i suggested this to you 2-3 years ago.

Ingo

2009-12-10 20:23:20

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer

On Thu, Dec 10, 2009 at 09:29:27AM -0500, Steven Rostedt wrote:
> On Thu, 2009-12-10 at 13:03 +0100, Frederic Weisbecker wrote:
>
> > This makes me feel I'm going to try converting the function graph tracer
> > into an event during the next cycle. It does not mean I could make it
> > usable as a perf event right away in the same shot that said, as you can
> > guess this is not a trivial plug. The current perf fast path is not yet
> > adapted for that.
>
> I curious how you plan on doing this. The current event system shows one
> event per trace point. A straight forward approach would make every
> entry and exit of a function a trace point and that would lead to a very
> large kernel to handle that.


Oh no, I'm not planning to use tracepoints for that.


> Perhaps we could abstract out all entries and exits. We need to be able
> to link to a single point (entry or exit) not all. This also has the
> added issue of using the ftrace infrastructure of nop the mcount call.
> We also need a way to enable a set of functions.
>
> We may be able to abstract this out, but I'm hesitant on making this the
> only interface.


Hmm, yeah. The idea was just to move the use the struct trace to struct
trace_event. This would be about straightforward. A bit like kprobes: by
not using the TRACE_EVENT macros (would be impossible anyway) but
specific callbacks.

It would be one event.

set_ftrace_filter and set_graph_function can still be used to further
control dynamic patching. That's what I intended for a first conversion.

Another idea would be to abstract it through one trace event subsystem
that has one event per function. But that sounds a bit too much in term
of memory footprint. Also it's perhaps sufficient to abstract the
dynamic patching, but not enough to abstract set_graph_function.

But later on, a full trace event integration would probably imply
dicossiating dynamic tracing from the two function tracers.
For example if the function graph tracer asks to nop a function,
this shouldn't be propagated to a parallel function tracer user.
That's even worse once we get a perf integration, we can have
multiple parallel users of the function tracer. And patching
should probably adapt to parallel uses, maintaining a kind of
refcounting, extending the current function hashlist we have
for function profiling could probably help for that.

2009-12-10 21:13:38

by Tim Bird

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer

Ingo Molnar wrote:
> * Tim Bird <[email protected]> wrote:
>
>> Add function duration tracer.
>>
>> Signed-off-by: Tim Bird <[email protected]>
>> ---
>> kernel/trace/Kconfig | 8
>> kernel/trace/Makefile | 1
>> kernel/trace/trace.c | 32 ++
>> kernel/trace/trace_duration.c | 527 ++++++++++++++++++++++++++++++++++++++++++
>> 4 files changed, 568 insertions(+)
>
> Please do it in a cleaner an more generic fashion: add a "function
> event" that perf can see and process, so all the output embellishment
> can be done outside of the kernel, in tools/perf/.

Sigh. OK - where's the perf code?

It took me a while to wrap my head around ftrace, so if I've
got to switch to a completely different event-ing infrastructure,
it will probably be some time before I'm back again.

I have some patches to add function graph tracing for
ARM, and to make it possible to use arbitrary ftrace plugins
at boot time. (There are issues with the current code for
anything other than the bootgraph plugin).

Are either of these still interesting?

I was about to start work on dynamic ftrace support for ARM.
What is the status of this?

> We want to wind down the current maze of ftrace plugins, not extend
> them. We already obsoleted the following ftrace plugins: scheduler,
> sysprof, blktrace, kmem, scheduler, etc. There's more work ongoing and
> broad agreement between folks developing it that this is the way
> forward.

It looks like all the tracers mentioned above are still in 2.6.32,
which was just released. In what sense are they "already obsoleted"?

In what time frame will the ftrace plugin feature be obsoleted?
I'm using this code on 2.6.29 for a variety of Sony products.
If the ftrace plugin stuff isn't going away for another year
or two that gives me several years of utility with the current
code (which I recognize I'd have to maintain myself out-of-tree).

Once things have settled down and you guys have figured out
for sure what's going on, I could come back and try to integrate
these features into perf. As it stands now, I'm a little wary
of putting much effort into that.

Where are these things discussed? Only on LKML? It would be
handy if there were a separate list that could get CC-ed for
ftrace stuff, so I could monitor it more easily for big movements
like this. I apologize if there is and I've just missed it.

> The function tracer / function graph tracer is a holdout due to its
> complexity - but that by no means weakens the argument and the necessity
> to migrate it.
>
> ftrace plugins were a nice idea originally and a clear improvement over
> existing alternatives, but now that we've got a technicaly superior,
> unified event framework that can do what the old plugins did and much
> more, we want to improve that and not look back ...

I'm a little worried about this. ftrace is already an order
of magnitude more overhead than the previous tracer I was using.
I don't have ANY experience with perf, but if you're using it
for tracing, that already seems one step removed from it's
original use for reporting performance counters. The complexity
described in the discussion related to this thread does not
raise my hopes that it will meet my needs.

I guess I'll have to start looking at perf to see if my
concerns are justified. But first, I've got to get the mainline
kernel booting again on my embedded hardware.
-- Tim

=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Corporation of America
=============================

2009-12-10 21:31:21

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer

Hi -

On Thu, Dec 10, 2009 at 09:14:54PM +0100, Ingo Molnar wrote:
> [...]
> > A few thousand entries in a hash table is really not that big a deal.
> Except if it's a high-freq event and the huge hash table is kept in the
> CPU cache all the time.

OK. (For reference, an int->int hash table slot costs about 40 bytes,
so an L2 cache could carry quite a few of them.)


> Firstly, AFAICS each subsequent systemtap probe for the same event
> adds chaining overhead - and then you have to disambiguate back to
> the originating script.

Right, but at some point this kind of demultiplexing has to occur
somewhere along the line. In practice, chaining a la kprobes or
tracepoints is negligible compared to the other costs.


> Secondly, is there a way for a single probe to multiplex its output
> to multiple apps? AFAICS that's only possible by running multiple
> scripts.

As in having multiple files to write to? There's no easy & direct way
to do that right now (beyond unmerged per-cpu files in "bulk" mode).
One can have systemtap print data on multiple synthetic /proc/.../
files, but that has other tradeoffs. Or one could demultiplex in user
space (for example by prefixing text lines, or using binary records).


> > The message we have received time, after time, after time was
> > stronger: that a suitable interpreter was not going to be welcome in
> > tree. If this is relaxed (and perhaps even if not), we may prototype
> > such a thing in the new year.
>
> FYI, i suggested this to you 2-3 years ago.

OK, well, I hope that when the time comes, the messages will be less
mixed than usual. :)


- FChE

2009-12-10 21:55:41

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer

On Thu, 2009-12-10 at 21:23 +0100, Frederic Weisbecker wrote:
> On Thu, Dec 10, 2009 at 09:29:27AM -0500, Steven Rostedt wrote:
> > On Thu, 2009-12-10 at 13:03 +0100, Frederic Weisbecker wrote:
> >
> > > This makes me feel I'm going to try converting the function graph tracer
> > > into an event during the next cycle. It does not mean I could make it
> > > usable as a perf event right away in the same shot that said, as you can
> > > guess this is not a trivial plug. The current perf fast path is not yet
> > > adapted for that.
> >
> > I curious how you plan on doing this. The current event system shows one
> > event per trace point. A straight forward approach would make every
> > entry and exit of a function a trace point and that would lead to a very
> > large kernel to handle that.
>
>
> Oh no, I'm not planning to use tracepoints for that.

Thank goodness ;-)

>
>
> > Perhaps we could abstract out all entries and exits. We need to be able
> > to link to a single point (entry or exit) not all. This also has the
> > added issue of using the ftrace infrastructure of nop the mcount call.
> > We also need a way to enable a set of functions.
> >
> > We may be able to abstract this out, but I'm hesitant on making this the
> > only interface.
>
>
> Hmm, yeah. The idea was just to move the use the struct trace to struct
> trace_event. This would be about straightforward. A bit like kprobes: by
> not using the TRACE_EVENT macros (would be impossible anyway) but
> specific callbacks.

Hmm, please keep the use of struct tracer around. That is still a very
powerful utility.

>
> It would be one event.
>
> set_ftrace_filter and set_graph_function can still be used to further
> control dynamic patching. That's what I intended for a first conversion.
>
> Another idea would be to abstract it through one trace event subsystem
> that has one event per function. But that sounds a bit too much in term
> of memory footprint. Also it's perhaps sufficient to abstract the
> dynamic patching, but not enough to abstract set_graph_function.

Yeah, making something store an event per function would be too much
memory consumption. We could dynamically make it perhaps, but this will
take a lot of thought.

>
> But later on, a full trace event integration would probably imply
> dicossiating dynamic tracing from the two function tracers.
> For example if the function graph tracer asks to nop a function,
> this shouldn't be propagated to a parallel function tracer user.
> That's even worse once we get a perf integration, we can have
> multiple parallel users of the function tracer. And patching
> should probably adapt to parallel uses, maintaining a kind of
> refcounting, extending the current function hashlist we have
> for function profiling could probably help for that.

Yeah, this gets a bit complicated. The biggest issue is that the mcount
call is not C abi, so it must go through some sort of trampoline. I've
thought about making a dynamic trampoline, but again, that will start
hogging up memory, and I'm not sure it is worth it.

I gave this some thought, and the best we could do is have a ref counter
with the ftrace record itself. We need a global flag to know if all
functions need to replace a nop to trace caller, or just some. That way
we can have one tracer tracing all functions and another tracing just
some. We sorta have that today though.

Making the API to this infrastructure will also take some thought. We
currently can register a function to be called by the function tracer
(we can today register more than one). But the filtering is at a global
level. We need a way to have a tracer tell ftrace that it wants to trace
all or some functions, as well as make up its mind later. Then this
function can be called by all, or if a function in some hash table says
its OK to be called.

-- Steve

2009-12-10 22:04:55

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer

On Thu, 2009-12-10 at 13:13 -0800, Tim Bird wrote:

> > Please do it in a cleaner an more generic fashion: add a "function
> > event" that perf can see and process, so all the output embellishment
> > can be done outside of the kernel, in tools/perf/.
>
> Sigh. OK - where's the perf code?
>
> It took me a while to wrap my head around ftrace, so if I've
> got to switch to a completely different event-ing infrastructure,
> it will probably be some time before I'm back again.
>
> I have some patches to add function graph tracing for
> ARM, and to make it possible to use arbitrary ftrace plugins
> at boot time. (There are issues with the current code for
> anything other than the bootgraph plugin).

The function graph for arm would be fine. Go ahead, there's no expected
change in this code for the near future, and if it does change it will
be at a higher level (generic code, not arch).

>
> Are either of these still interesting?

We already have a way to enable arbitrary ftrace plugins at boot time:

Try ftrace=function on the command line.

>
> I was about to start work on dynamic ftrace support for ARM.
> What is the status of this?

Please do. Again, if we change this code, I would expect it to happen at
a higher level. Trying to do a registry at the arch level will explode
in complexity.

>
> > We want to wind down the current maze of ftrace plugins, not extend
> > them. We already obsoleted the following ftrace plugins: scheduler,
> > sysprof, blktrace, kmem, scheduler, etc. There's more work ongoing and
> > broad agreement between folks developing it that this is the way
> > forward.
>
> It looks like all the tracers mentioned above are still in 2.6.32,
> which was just released. In what sense are they "already obsoleted"?

Well, the TRACE_EVENT infrastructure has made them obsolete. In ftrace
terms, you can just do:

# echo 1 > debugfs/tracing/events/sched/enable
# cat /debugfs/tracing/trace

and see basically the same information that sched_switch showed.

>
> In what time frame will the ftrace plugin feature be obsoleted?

Well, currently we don't have any substitute for function/function_graph
and the latency tracers. We just don't want to be adding more.


> I'm using this code on 2.6.29 for a variety of Sony products.
> If the ftrace plugin stuff isn't going away for another year
> or two that gives me several years of utility with the current
> code (which I recognize I'd have to maintain myself out-of-tree).

I don't plan on removing the plugin infrastructure anytime soon.

>
> Once things have settled down and you guys have figured out
> for sure what's going on, I could come back and try to integrate
> these features into perf. As it stands now, I'm a little wary
> of putting much effort into that.
>
> Where are these things discussed? Only on LKML? It would be
> handy if there were a separate list that could get CC-ed for
> ftrace stuff, so I could monitor it more easily for big movements
> like this. I apologize if there is and I've just missed it.

It's not really discussed much. Usually someone pushes out a patch set
that changes the direction of tracing, and we go from there. So yes, it
is LKML.

>
> > The function tracer / function graph tracer is a holdout due to its
> > complexity - but that by no means weakens the argument and the necessity
> > to migrate it.
> >
> > ftrace plugins were a nice idea originally and a clear improvement over
> > existing alternatives, but now that we've got a technicaly superior,
> > unified event framework that can do what the old plugins did and much
> > more, we want to improve that and not look back ...
>
> I'm a little worried about this. ftrace is already an order
> of magnitude more overhead than the previous tracer I was using.

Which tracer was that? I know you used the -finstrument-functions for
tracing before, and that has quite a bit overhead as well. The ring
buffer has gone through a lot to speed it up.

-- Steve

2009-12-10 22:26:52

by Tim Bird

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer

Steven Rostedt wrote:
> On Thu, 2009-12-10 at 13:13 -0800, Tim Bird wrote:
>> I have some patches ...
>> to make it possible to use arbitrary ftrace plugins
>> at boot time. (There are issues with the current code for
>> anything other than the bootgraph plugin).
>
> We already have a way to enable arbitrary ftrace plugins at boot time:
>
> Try ftrace=function on the command line.

That's what I'm using, but my function_duration tracer didn't
work, because of an issue initializing the trace-option files.
The problem was that the debugfs wasn't initialized yet.
I worked up a simple fix to defer initializing the
trace-option files in debugfs until later in the boot
sequence.

Maybe you've already fixed this. I'll check out the
latest code and see if I still have problems with it.
If so, I'll send the patches (which are quite small)
for consideration.
-- Tim

=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Corporation of America
=============================

2009-12-10 22:37:43

by Tim Bird

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer

Steven Rostedt wrote:
>> I'm a little worried about this. ftrace is already an order
>> of magnitude more overhead than the previous tracer I was using.
>
> Which tracer was that? I know you used the -finstrument-functions for
> tracing before, and that has quite a bit overhead as well. The ring
> buffer has gone through a lot to speed it up.

KFT. It used -finstrument-functions, which did result in
a few more instructions per call site than mcount does. However,
the code paths in KFT are much shorter than in ftrace.
And the fact that both function exit and entry are instrumented
meant that I didn't have to do funky trampoline junk, or
manage a separate structure, in order to capture function
exit events.

It is somewhat to be expected that ftrace would have higher
overhead, because ftrace does more. But in the version
of code I was using (which is probably quite old), there
was lots of extraneous code in the hot path.
For example, there are at least 4 checks in the code
to detect whether tracing is active or not.

The ring buffer code was particularly slow. I was using
a simple fixed-size array in KFT (on a uniprocessor), and
the time to save an event was under 1 microsecond. With
ftrace on the same hardware, it costs about 6 microseconds
to save an event.
-- Tim

=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Corporation of America
=============================

2009-12-10 22:40:22

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer

On Thu, Dec 10, 2009 at 04:55:43PM -0500, Steven Rostedt wrote:
> On Thu, 2009-12-10 at 21:23 +0100, Frederic Weisbecker wrote:
> > Oh no, I'm not planning to use tracepoints for that.
>
> Thank goodness ;-)


:)



> > Hmm, yeah. The idea was just to move the use the struct trace to struct
> > trace_event. This would be about straightforward. A bit like kprobes: by
> > not using the TRACE_EVENT macros (would be impossible anyway) but
> > specific callbacks.
>
> Hmm, please keep the use of struct tracer around. That is still a very
> powerful utility.


Sure. The goal is to move the function graph tracer into a
more unified interface that can profit to a wider range of
uses (perf being the new user I have in mind), and also to
make it easier to use concurrently.

But I'm not dropping struct tracer, as I too think that until
we have something superior (and not only potentially superior),
we need to keep it. And I'm pretty sure we all think the same.

But if I convert the graph tracer into the trace event interface,
I guess we shouldn't maintain its struct tracer version in
parallel.


> > But later on, a full trace event integration would probably imply
> > dicossiating dynamic tracing from the two function tracers.
> > For example if the function graph tracer asks to nop a function,
> > this shouldn't be propagated to a parallel function tracer user.
> > That's even worse once we get a perf integration, we can have
> > multiple parallel users of the function tracer. And patching
> > should probably adapt to parallel uses, maintaining a kind of
> > refcounting, extending the current function hashlist we have
> > for function profiling could probably help for that.
>
> Yeah, this gets a bit complicated. The biggest issue is that the mcount
> call is not C abi, so it must go through some sort of trampoline. I've
> thought about making a dynamic trampoline, but again, that will start
> hogging up memory, and I'm not sure it is worth it.


Yeah, that would be too complicated and hard to maintain.


> I gave this some thought, and the best we could do is have a ref counter
> with the ftrace record itself. We need a global flag to know if all
> functions need to replace a nop to trace caller, or just some. That way
> we can have one tracer tracing all functions and another tracing just
> some. We sorta have that today though.
>
> Making the API to this infrastructure will also take some thought. We
> currently can register a function to be called by the function tracer
> (we can today register more than one). But the filtering is at a global
> level. We need a way to have a tracer tell ftrace that it wants to trace
> all or some functions, as well as make up its mind later. Then this
> function can be called by all, or if a function in some hash table says
> its OK to be called.


Yeah, I thought about that too.
May be can we keep the current list of ftrace records.

Then when a random function tracer registers, we give it
a dedicated hashlist of functions (which kind of mirrors
the ftrace records, but in a hashlist fashion).

When it wants to trace a set of functions, we set appropriate
flags in its hashlist and the ftrace core patch the sites if
needed.

Once we want to disable tracing on functions, we disable it
on the tracer hashlist, then the ftrace core goes through
the list of every tracer hashlist and counts the current
users of the function, if there are no more of them, we nop
back the site. Well, that could be refcount in ftrace records
too.

But the point is this hashlist. I don't think it will be that
much a huge memory consumer as I don't expect there will ever be a
lot of concurrent function tracing users at the same time.

I just hope we can make it smart enough to be a near true O(1)
hashlist.

Just an idea...

2009-12-10 23:47:46

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/4] ftrace - add function_duration tracer

On Thu, 2009-12-10 at 14:36 -0800, Tim Bird wrote:

> The ring buffer code was particularly slow. I was using
> a simple fixed-size array in KFT (on a uniprocessor), and
> the time to save an event was under 1 microsecond. With
> ftrace on the same hardware, it costs about 6 microseconds
> to save an event.
>

Well, I know you work in embedded so it may be much slower.

I just ran the ring_buffer_benchmark on a Xeon Quad 2660MHz with
2.6.32-rc7 and I got 190ns per trace entry. (Note most of this time is
spent in the timestamp calculation!)

I booted the latest 2.6.32 kernel, where I've made some changes to the
time keeping code and got 117ns per entry. I sent out a RFC patch set,
and disabled all the RB_WARN_ON safety checks and got it down to 90ns
per trace entry! This still includes the 3 levels of checking if the
buffer is enabled. Those checks are just a nit and probably (haven't
tested) cost a single ns. The variables being checked are labeled
read_mostly and are most likely in the cache.


Maybe you want to test out the overhead again?

Note, the ftrace infrastructure's first priority is stability so I won't
be removing those RB_WARN_ONs. Next priority is to have non tracing be
unaffected. Third is tracing being as fast as possible. A tracer that
does not need to worry about going away when disabled will probably
have a performance advantage when tracing. But, if it causes a
noticeable overhead when disabled (like -finstrument-function does, and
it has a lot more lines than mcount), it will never be merged.

-- Steve