LinuxLists.cc - [PATCH 0/2] ftrace: updates to tip

2009-01-16 00:41:39

Subject: [PATCH 0/2] ftrace: updates to tip

Ingo,

The first patch is critical, and needs to stay with trace_output.c
Not that critical since trace_output.c is not in mainline yet.

The second patch gives the ability to stack trace functions.
I've been leery about adding this and still keep it a separate
option from the "stacktrace" that already exists. This is because
when enabled with no filtering, the lag between typing and seeing
what is typed can be up to 10 seconds or more.

I made the function stack trace an option attached to the function
tracer, so it must be enabled after the function tracer has been
set. This still needs to be updated in ftrace.txt.

The following patches are in:

git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git

branch: tip/devel

Lai Jiangshan (1):
ftrace: fix trace_output

Steven Rostedt (1):
ftrace: add stack trace to function tracer

----
kernel/trace/trace.c | 26 ++++++++----
kernel/trace/trace.h | 7 +++
kernel/trace/trace_functions.c | 84 ++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace_output.c | 65 +++++++++++++++----------------
kernel/trace/trace_output.h | 4 +-
5 files changed, 141 insertions(+), 45 deletions(-)

--

2009-01-16 11:24:18

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 0/2] ftrace: updates to tip

* Steven Rostedt <[email protected]> wrote:

> Ingo,
>
> The first patch is critical, and needs to stay with trace_output.c Not
> that critical since trace_output.c is not in mainline yet.
>
> The second patch gives the ability to stack trace functions. I've been
> leery about adding this and still keep it a separate option from the
> "stacktrace" that already exists. This is because when enabled with no
> filtering, the lag between typing and seeing what is typed can be up to
> 10 seconds or more.

Btw., is this true even if frame pointers are enabled? When frame pointers
are disabled we scan the full kernel stack => that can be quite slow if
every kernel function is traced ;-)

> I made the function stack trace an option attached to the function
> tracer, so it must be enabled after the function tracer has been set.
> This still needs to be updated in ftrace.txt.

maybe we could drive this via the filter API? Something like:

echo "*btrfs*:stacktrace" >> set_filter_functions

Would automatically mean that those functions will all generate
stacktraces too. Note how safe this API is by default: the filter is used
for a narrow scope of functions anwyay. To get it for all kernel functions
one would have to do:

echo "*:stacktrace" >> set_filter_functions

Which one cannot do accidentally.

What do you think?

> The following patches are in:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
>
> branch: tip/devel
>
>
> Lai Jiangshan (1):
> ftrace: fix trace_output
>
> Steven Rostedt (1):
> ftrace: add stack trace to function tracer
>
> ----
> kernel/trace/trace.c | 26 ++++++++----
> kernel/trace/trace.h | 7 +++
> kernel/trace/trace_functions.c | 84 ++++++++++++++++++++++++++++++++++++++++
> kernel/trace/trace_output.c | 65 +++++++++++++++----------------
> kernel/trace/trace_output.h | 4 +-
> 5 files changed, 141 insertions(+), 45 deletions(-)

Pulled into tip/tracing/ftrace, thanks Steve!

Ingo

2009-01-16 13:08:53

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [PATCH 0/2] ftrace: updates to tip

2009/1/16 Ingo Molnar <[email protected]>:
>
> * Steven Rostedt <[email protected]> wrote:
>
>> Ingo,
>>
>> The first patch is critical, and needs to stay with trace_output.c Not
>> that critical since trace_output.c is not in mainline yet.
>>
>> The second patch gives the ability to stack trace functions. I've been
>> leery about adding this and still keep it a separate option from the
>> "stacktrace" that already exists. This is because when enabled with no
>> filtering, the lag between typing and seeing what is typed can be up to
>> 10 seconds or more.
>
> Btw., is this true even if frame pointers are enabled? When frame pointers
> are disabled we scan the full kernel stack => that can be quite slow if
> every kernel function is traced ;-)

Hi Ingo,

The function tracer selects Frame pointers :-)

Frederic.

>> I made the function stack trace an option attached to the function
>> tracer, so it must be enabled after the function tracer has been set.
>> This still needs to be updated in ftrace.txt.
>
> maybe we could drive this via the filter API? Something like:
>
> echo "*btrfs*:stacktrace" >> set_filter_functions
>
> Would automatically mean that those functions will all generate
> stacktraces too. Note how safe this API is by default: the filter is used
> for a narrow scope of functions anwyay. To get it for all kernel functions
> one would have to do:
>
> echo "*:stacktrace" >> set_filter_functions
>
> Which one cannot do accidentally.
>
> What do you think?
>
>> The following patches are in:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
>>
>> branch: tip/devel
>>
>>
>> Lai Jiangshan (1):
>> ftrace: fix trace_output
>>
>> Steven Rostedt (1):
>> ftrace: add stack trace to function tracer
>>
>> ----
>> kernel/trace/trace.c | 26 ++++++++----
>> kernel/trace/trace.h | 7 +++
>> kernel/trace/trace_functions.c | 84 ++++++++++++++++++++++++++++++++++++++++
>> kernel/trace/trace_output.c | 65 +++++++++++++++----------------
>> kernel/trace/trace_output.h | 4 +-
>> 5 files changed, 141 insertions(+), 45 deletions(-)
>
> Pulled into tip/tracing/ftrace, thanks Steve!
>
> Ingo
>

2009-01-16 13:11:25

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 0/2] ftrace: updates to tip

* Fr?d?ric Weisbecker <[email protected]> wrote:

> 2009/1/16 Ingo Molnar <[email protected]>:
> >
> > * Steven Rostedt <[email protected]> wrote:
> >
> >> Ingo,
> >>
> >> The first patch is critical, and needs to stay with trace_output.c Not
> >> that critical since trace_output.c is not in mainline yet.
> >>
> >> The second patch gives the ability to stack trace functions. I've been
> >> leery about adding this and still keep it a separate option from the
> >> "stacktrace" that already exists. This is because when enabled with no
> >> filtering, the lag between typing and seeing what is typed can be up to
> >> 10 seconds or more.
> >
> > Btw., is this true even if frame pointers are enabled? When frame pointers
> > are disabled we scan the full kernel stack => that can be quite slow if
> > every kernel function is traced ;-)
>
>
> Hi Ingo,
>
> The function tracer selects Frame pointers :-)

doh, i even added that to begin with ;-)

Life is short but a coder's memory is even shorter.

Ingo

2009-01-16 14:32:43

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH 0/2] ftrace: updates to tip

On Fri, 16 Jan 2009, Ingo Molnar wrote:

>
> * Steven Rostedt <[email protected]> wrote:
>
> > Ingo,
> >
> > The first patch is critical, and needs to stay with trace_output.c Not
> > that critical since trace_output.c is not in mainline yet.
> >
> > The second patch gives the ability to stack trace functions. I've been
> > leery about adding this and still keep it a separate option from the
> > "stacktrace" that already exists. This is because when enabled with no
> > filtering, the lag between typing and seeing what is typed can be up to
> > 10 seconds or more.
>
> Btw., is this true even if frame pointers are enabled? When frame pointers
> are disabled we scan the full kernel stack => that can be quite slow if
> every kernel function is traced ;-)

After making my latest changes, I see a half sec lag. I don't know
why it was so slow before. But I do have all the debug features enabled
which would also slow things down quite a bit (lockdep, et. al)

>
> > I made the function stack trace an option attached to the function
> > tracer, so it must be enabled after the function tracer has been set.
> > This still needs to be updated in ftrace.txt.
>
> maybe we could drive this via the filter API? Something like:
>
> echo "*btrfs*:stacktrace" >> set_filter_functions
>
> Would automatically mean that those functions will all generate
> stacktraces too. Note how safe this API is by default: the filter is used
> for a narrow scope of functions anwyay. To get it for all kernel functions
> one would have to do:
>
> echo "*:stacktrace" >> set_filter_functions
>
> Which one cannot do accidentally.
>
> What do you think?

Now that I see only a 1/2 sec lag, do you still think this is necessary?

Maybe I should go back and see why it was so bad before?

But I do notice that not all functions produce a valid stack trace. Maybe
it would be better to add that api :-?

-- Steve

2009-01-16 15:21:48

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 0/2] ftrace: updates to tip

* Steven Rostedt <[email protected]> wrote:

> > maybe we could drive this via the filter API? Something like:
> >
> > echo "*btrfs*:stacktrace" >> set_filter_functions
> >
> > Would automatically mean that those functions will all generate
> > stacktraces too. Note how safe this API is by default: the filter is used
> > for a narrow scope of functions anwyay. To get it for all kernel functions
> > one would have to do:
> >
> > echo "*:stacktrace" >> set_filter_functions
> >
> > Which one cannot do accidentally.
> >
> > What do you think?
>
> Now that I see only a 1/2 sec lag, do you still think this is necessary?
>
> Maybe I should go back and see why it was so bad before?
>
> But I do notice that not all functions produce a valid stack trace.
> Maybe it would be better to add that api :-?

yes - i think that API would be more intuitive, and that way people could
mix more interesting functions (with stack traces) with less important
functions (no stack traces).

Ingo

2009-01-16 15:54:23

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH 0/2] ftrace: updates to tip

On Fri, 16 Jan 2009, Ingo Molnar wrote:
> >
> > But I do notice that not all functions produce a valid stack trace.
> > Maybe it would be better to add that api :-?
>
> yes - i think that API would be more intuitive, and that way people could
> mix more interesting functions (with stack traces) with less important
> functions (no stack traces).

Ah, that's the point. We can't mix and match on these. Either all
functions that are traced do the stack trace, or none do. This is where
the new api may be confusing. We can register a function to be traced via
the function pointer, and we can pick which functions to trace, but we can
not separate out different functions for different traces.

-- Steve

2009-01-16 16:02:29

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 0/2] ftrace: updates to tip

* Steven Rostedt <[email protected]> wrote:

>
> On Fri, 16 Jan 2009, Ingo Molnar wrote:
> > >
> > > But I do notice that not all functions produce a valid stack trace.
> > > Maybe it would be better to add that api :-?
> >
> > yes - i think that API would be more intuitive, and that way people could
> > mix more interesting functions (with stack traces) with less important
> > functions (no stack traces).
>
> Ah, that's the point. We can't mix and match on these. Either all
> functions that are traced do the stack trace, or none do. This is where
> the new api may be confusing. [...]

that would be confusing indeed.

> [...] We can register a function to be traced via the function pointer,
> and we can pick which functions to trace, but we can not separate out
> different functions for different traces.

Why not? We could have a hash of all traced functions with metadata
attached. Since patching functions in/out would be a relatively rare
operation, this would be a highly efficient read-mostly hash.

Ingo

2009-01-16 16:03:34

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 0/2] ftrace: updates to tip

* Ingo Molnar <[email protected]> wrote:

> > [...] We can register a function to be traced via the function
> > pointer, and we can pick which functions to trace, but we can not
> > separate out different functions for different traces.
>
> Why not? We could have a hash of all traced functions with metadata
> attached. Since patching functions in/out would be a relatively rare
> operation, this would be a highly efficient read-mostly hash.

We could build that hash when registering the tracer. Via recordmcount we
have advance knowledge of all the functions that may be traced.

Ingo

2009-01-16 16:22:41

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH 0/2] ftrace: updates to tip

On Fri, 16 Jan 2009, Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:

> > [...] We can register a function to be traced via the function pointer,
> > and we can pick which functions to trace, but we can not separate out
> > different functions for different traces.
>
> Why not? We could have a hash of all traced functions with metadata
> attached. Since patching functions in/out would be a relatively rare
> operation, this would be a highly efficient read-mostly hash.

I was afraid you would say that :-)

Yes, we could add something to do this. It would take some thought on a
good api, and implementation. This is not an easy task by any means. We
need a way to map between a function call and a tracer, where a function
call can call more than one tracer.

Probably be best for the tracer to have its own hash. This way the tracer
could do a look up to determine if it should record or not. When filtering
on a function, it is all functions that get called that are registered.
For instance, the stack_tracer hooks to all functions to test the size of
the stack. When the function tracer is hooked in, it does not notice that
a stack tracer is there.

But we could design it that with the "function[:tracer]" api to
set_ftrace_filter. Where if you specify a ":tracer" it will only effect
those functions. In fact, we could have those functions be the only one
that call multiple tracers. Right now, if only one function is added, the
dynamic tracer will have the mcount call that function directly. If you
add a second function (in the case of the stack tracer and function tracer
running simultaneously) it will change mcount to call an internal function
that calls each registered function. With a new design, we can have some
functions call a function together, or call the list. We could have the
list check the hash to determine which function to call, but that may be
difficult.

-- Steve

2009-01-16 16:30:47

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 0/2] ftrace: updates to tip

* Steven Rostedt <[email protected]> wrote:

> On Fri, 16 Jan 2009, Ingo Molnar wrote:
> > * Steven Rostedt <[email protected]> wrote:
>
> > > [...] We can register a function to be traced via the function pointer,
> > > and we can pick which functions to trace, but we can not separate out
> > > different functions for different traces.
> >
> > Why not? We could have a hash of all traced functions with metadata
> > attached. Since patching functions in/out would be a relatively rare
> > operation, this would be a highly efficient read-mostly hash.
>
> I was afraid you would say that :-)
>
> Yes, we could add something to do this. It would take some thought on a
> good api, and implementation. This is not an easy task by any means. We
> need a way to map between a function call and a tracer, where a function
> call can call more than one tracer.

Note that some other things could be consolidated under per function
metadata: for example set_graph_function - which really is a per function
attribute too. Plus a lot of new things would be enabled i think.

Ingo

2009-01-16 21:48:58

by Chris Mason

[permalink] [raw]

Subject: Re: [PATCH 0/2] ftrace: updates to tip

On Thu, 2009-01-15 at 19:40 -0500, Steven Rostedt wrote:
> Ingo,
>
> The first patch is critical, and needs to stay with trace_output.c
> Not that critical since trace_output.c is not in mainline yet.
>
> The second patch gives the ability to stack trace functions.
> I've been leery about adding this and still keep it a separate
> option from the "stacktrace" that already exists. This is because
> when enabled with no filtering, the lag between typing and seeing
> what is typed can be up to 10 seconds or more.
>

I mostly asked for this because I often try to find the most common
reason a given function is called, and oprofile isn't always a great way
to catch it. systemtap can do it too if you can get systemtap to work
against your current devel kernel, but there are limits on how much
memory it'll use.

I've attached some simple python code that parses the output of the
function stack tracer, it makes some dumb assumptions about the format
but isn't a bad proof of concept. The first such assumption is that
you're only filtering on a single function.

Here is some example output, trying to find the most common causes of
native_smp_send_reschedule() during a btrfs dbench run.

It relates to the Oracle OLTP thread because oracle heavily uses IPC
semaphores to trigger wakeups of processes as various events finish.
I'd bet that try_to_wakeup is the most common cause of the reschedule
calls there as well.

For btrfs, the btree lock mutexes come back into the profile yet again.
It would be interesting to change the spinning mutex code to look for
spinners and skip the wakeup on unlock, but that's a different thread
entirely.

The short version is: thanks Steve, this is really cool!

12058 hits:
<= check_preempt_wakeup
<= try_to_wake_up
<= wake_up_process
<= __mutex_unlock_slowpath
<= mutex_unlock
<= btrfs_tree_unlock
<= unlock_up
===========

11678 hits:
<= check_preempt_curr_idle
<= try_to_wake_up
<= wake_up_process
<= __mutex_unlock_slowpath
<= mutex_unlock
<= btrfs_tree_unlock
<= unlock_up
===========

1569 hits:
<= check_preempt_wakeup
<= try_to_wake_up
<= default_wake_function
<= autoremove_wake_function
<= wake_bit_function
<= __wake_up_common
<= __wake_up
===========

1505 hits:
<= check_preempt_curr_idle
<= try_to_wake_up
<= default_wake_function
<= autoremove_wake_function
<= wake_bit_function
<= __wake_up_common
<= __wake_up
===========

1377 hits: # the output isn't perfectly parsable, or I'm dumb
<= 0
<= 0
<= 0
<= 0
<= 0
<= 0
<= 0
===========

1194 hits:
<= check_preempt_curr_idle
<= try_to_wake_up
<= wake_up_process
<= btrfs_queue_worker
<= schedule_bio
<= btrfs_map_bio
<= __btrfs_submit_bio_done
===========

1108 hits:
<= check_preempt_curr_idle
<= try_to_wake_up
<= default_wake_function
<= autoremove_wake_function
<= __wake_up_common
<= __wake_up
<= cache_drop_leaf_ref
===========

935 hits:
<= check_preempt_wakeup
<= try_to_wake_up
<= wake_up_process
<= btrfs_queue_worker
<= schedule_bio
<= btrfs_map_bio
<= __btrfs_submit_bio_done
===========

856 hits:
<= check_preempt_curr_idle
<= try_to_wake_up
<= wake_up_process
<= btrfs_queue_worker
<= btrfs_wq_submit_bio
<= btrfs_submit_bio_hook
<= submit_one_bio
===========

502 hits:
<= check_preempt_wakeup
<= try_to_wake_up
<= wake_up_process
<= btrfs_queue_worker
<= btrfs_wq_submit_bio
<= btrfs_submit_bio_hook
<= submit_one_bio
===========

442 hits:
<= check_preempt_curr_idle
<= try_to_wake_up
<= wake_up_process
<= __mutex_unlock_slowpath
<= mutex_unlock
<= btrfs_tree_unlock
<= btrfs_release_path
===========

327 hits:
<= check_preempt_wakeup
<= try_to_wake_up
<= default_wake_function
<= autoremove_wake_function
<= __wake_up_common
<= __wake_up
<= cache_drop_leaf_ref
===========

... output continues for a bit

Attachments:

trace-histogram (1.60 kB)

2009-01-16 22:26:24

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 0/2] ftrace: updates to tip

* Chris Mason <[email protected]> wrote:

> On Thu, 2009-01-15 at 19:40 -0500, Steven Rostedt wrote:
> > Ingo,
> >
> > The first patch is critical, and needs to stay with trace_output.c
> > Not that critical since trace_output.c is not in mainline yet.
> >
> > The second patch gives the ability to stack trace functions.
> > I've been leery about adding this and still keep it a separate
> > option from the "stacktrace" that already exists. This is because
> > when enabled with no filtering, the lag between typing and seeing
> > what is typed can be up to 10 seconds or more.
> >
>
> I mostly asked for this because I often try to find the most common
> reason a given function is called, and oprofile isn't always a great way
> to catch it. systemtap can do it too if you can get systemtap to work
> against your current devel kernel, but there are limits on how much
> memory it'll use.
>
> I've attached some simple python code that parses the output of the
> function stack tracer, it makes some dumb assumptions about the format
> but isn't a bad proof of concept. The first such assumption is that
> you're only filtering on a single function.
>
> Here is some example output, trying to find the most common causes of
> native_smp_send_reschedule() during a btrfs dbench run.
>
> It relates to the Oracle OLTP thread because oracle heavily uses IPC
> semaphores to trigger wakeups of processes as various events finish.
> I'd bet that try_to_wakeup is the most common cause of the reschedule
> calls there as well.
>
> For btrfs, the btree lock mutexes come back into the profile yet again.
> It would be interesting to change the spinning mutex code to look for
> spinners and skip the wakeup on unlock, but that's a different thread
> entirely.
>
> The short version is: thanks Steve, this is really cool!
>
> 12058 hits:
> <= check_preempt_wakeup
> <= try_to_wake_up
> <= wake_up_process
> <= __mutex_unlock_slowpath
> <= mutex_unlock
> <= btrfs_tree_unlock
> <= unlock_up
> ===========

Cool! We've got scripts/tracing/ [with one Python script in it already] -
so if this is tidied up to be generally useful we could put it there.

The other thing is that there's the statistics framework of ftrace, being
worked on by Frederic and Steve. That tries to handle and provide
higher-order summaries/"views" of plain traces, like histograms and counts
- provided by the kernel.

Maybe the above type of multi-dimensional-stack-trace based histogram
could fit into the statistics framework too?

Ingo

2009-01-16 23:00:24

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 0/2] ftrace: updates to tip

* Ingo Molnar <[email protected]> wrote:

> > I was afraid you would say that :-)
> >
> > Yes, we could add something to do this. It would take some thought on
> > a good api, and implementation. This is not an easy task by any means.
> > We need a way to map between a function call and a tracer, where a
> > function call can call more than one tracer.
>
> Note that some other things could be consolidated under per function
> metadata: for example set_graph_function - which really is a per
> function attribute too. Plus a lot of new things would be enabled i
> think.

a few other things that could be per-function attributes:

- Using function trace 'limits': trace a function 50 times and dont trace
it afterwards. Each invocation of that function decreases the
remaining-count by one. For example:

echo 'btrfs_join_transaction:#2' >> set_ftrace_filter

Would specify that we generate two trace entries of
btrfs_join_transaction(), then stop tracing this function.

- Using function-triggered tracing: a function could be specified (via a
filter format extension) to act as a 'start tracing' trigger. Another
extension would be 'stop tracing' trigger.

For example:

echo 'btrfs_join_transaction:+' >> set_ftrace_filter
echo 'btrfs_commit_transaction:-' >> set_ftrace_filter

The '+' is a start-tracing trigger condition, the '-' is a stop-tracing
trigger condition. All function calls between btrfs_join_transaction()
and btrfs_commit_transaction() would be traced.

The two could be combined - to generate the trace of a single btrfs
transaction, one could do:

echo 0 > tracing_enabled
echo 'btrfs_join_transaction:+#1' >> set_ftrace_filter
echo 'btrfs_commit_transaction:-#1' >> set_ftrace_filter
echo 1 > tracing_enabled

Other extensions are possible too:

- Trace length triggers. For example one could do:

echo 'btrfs_join_transaction:+*#10' >> set_ftrace_filter

To trace 10 function calls [allowed by current filter settings] after
the first btrfs_join_transaction() call - and stop tracing after those
10 trace entries.

This would allow the creation of "surgical" one-time traces - of events
and functions one is specifically interested in.

Ingo

2009-01-17 01:14:39

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH 0/2] ftrace: updates to tip

On Fri, 16 Jan 2009, Ingo Molnar wrote:

>
> * Ingo Molnar <[email protected]> wrote:
>
> > > I was afraid you would say that :-)
> > >
> > > Yes, we could add something to do this. It would take some thought on
> > > a good api, and implementation. This is not an easy task by any means.
> > > We need a way to map between a function call and a tracer, where a
> > > function call can call more than one tracer.
> >
> > Note that some other things could be consolidated under per function
> > metadata: for example set_graph_function - which really is a per
> > function attribute too. Plus a lot of new things would be enabled i
> > think.
>
> a few other things that could be per-function attributes:
>
> - Using function trace 'limits': trace a function 50 times and dont trace
> it afterwards. Each invocation of that function decreases the
> remaining-count by one. For example:
>
> echo 'btrfs_join_transaction:#2' >> set_ftrace_filter
>
> Would specify that we generate two trace entries of
> btrfs_join_transaction(), then stop tracing this function.
>
> - Using function-triggered tracing: a function could be specified (via a
> filter format extension) to act as a 'start tracing' trigger. Another
> extension would be 'stop tracing' trigger.
>
> For example:
>
> echo 'btrfs_join_transaction:+' >> set_ftrace_filter
> echo 'btrfs_commit_transaction:-' >> set_ftrace_filter
>
> The '+' is a start-tracing trigger condition, the '-' is a stop-tracing
> trigger condition. All function calls between btrfs_join_transaction()
> and btrfs_commit_transaction() would be traced.
>
> The two could be combined - to generate the trace of a single btrfs
> transaction, one could do:
>
> echo 0 > tracing_enabled
> echo 'btrfs_join_transaction:+#1' >> set_ftrace_filter
> echo 'btrfs_commit_transaction:-#1' >> set_ftrace_filter
> echo 1 > tracing_enabled
>
> Other extensions are possible too:
>
> - Trace length triggers. For example one could do:
>
> echo 'btrfs_join_transaction:+*#10' >> set_ftrace_filter
>
> To trace 10 function calls [allowed by current filter settings] after
> the first btrfs_join_transaction() call - and stop tracing after those
> 10 trace entries.
>
> This would allow the creation of "surgical" one-time traces - of events
> and functions one is specifically interested in.

This all sounds great, but it would add a lot of conditionals into a
extremely hot function tracer path. Remember, we are not modifying the
calls to mcount to call a registered function directly. All functions
being traced must call the same function. The reason is that mcount is not
a normal function in C. It does not behave the same as other functions,
and must be implemented in assembly (as you already know ;-) The dynamic
tracer calls into a trampoline that calls the registered function. There
is only one trampoline, so only one function gets called on a trace. We
can at most enable or disable a given function. We can not change what
that function calls (without implementing it for every arch).

This means any conditional that you make, must be made for all traced
functions. And this will put an overhead onto the system.

Now we can register multiple functions to be called by a traced function,
or pick and choose what function will be called by a traced function
depending on what option was asked for. But all traced functions will
still call the same code. We can start experimenting, but I would
be more keen on how this will be used by developers. Chris gave a great
example of how he would use his feature. Doing what you ask would require
a rewrite of most the code. I hate to do that and have only 2 or 3 people
using it :-/

-- Steve

2009-01-17 01:31:18

by Chris Mason

[permalink] [raw]

Subject: Re: [PATCH 0/2] ftrace: updates to tip

On Fri, 2009-01-16 at 23:26 +0100, Ingo Molnar wrote:
> * Chris Mason <[email protected]> wrote:

> >
> > The short version is: thanks Steve, this is really cool!
> >
> > 12058 hits:
> > <= check_preempt_wakeup
> > <= try_to_wake_up
> > <= wake_up_process
> > <= __mutex_unlock_slowpath
> > <= mutex_unlock
> > <= btrfs_tree_unlock
> > <= unlock_up
> > ===========
>
> Cool! We've got scripts/tracing/ [with one Python script in it already] -
> so if this is tidied up to be generally useful we could put it there.
>

For it to be generally useful I think it would need a way to pull out
the marker for which function is the start of the stack trace. Some
parse friendly tags would be great.

Other than that I'd add some magic to actually start the tracing for a
given function. Are there other things people would like it to do?

> The other thing is that there's the statistics framework of ftrace, being
> worked on by Frederic and Steve. That tries to handle and provide
> higher-order summaries/"views" of plain traces, like histograms and counts
> - provided by the kernel.

Ok, I tried to keep it simple because I was hoping others had plans as
well ;)

>
> Maybe the above type of multi-dimensional-stack-trace based histogram
> could fit into the statistics framework too?

I'd hope so, just a question of how to keep it clear and usable.

-chris

2009-01-17 13:48:46

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [PATCH 0/2] ftrace: updates to tip

On Fri, Jan 16, 2009 at 08:14:25PM -0500, Steven Rostedt wrote:
>
> On Fri, 16 Jan 2009, Ingo Molnar wrote:
>
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> > > > I was afraid you would say that :-)
> > > >
> > > > Yes, we could add something to do this. It would take some thought on
> > > > a good api, and implementation. This is not an easy task by any means.
> > > > We need a way to map between a function call and a tracer, where a
> > > > function call can call more than one tracer.
> > >
> > > Note that some other things could be consolidated under per function
> > > metadata: for example set_graph_function - which really is a per
> > > function attribute too. Plus a lot of new things would be enabled i
> > > think.
> >
> > a few other things that could be per-function attributes:
> >
> > - Using function trace 'limits': trace a function 50 times and dont trace
> > it afterwards. Each invocation of that function decreases the
> > remaining-count by one. For example:
> >
> > echo 'btrfs_join_transaction:#2' >> set_ftrace_filter
> >
> > Would specify that we generate two trace entries of
> > btrfs_join_transaction(), then stop tracing this function.
> >
> > - Using function-triggered tracing: a function could be specified (via a
> > filter format extension) to act as a 'start tracing' trigger. Another
> > extension would be 'stop tracing' trigger.
> >
> > For example:
> >
> > echo 'btrfs_join_transaction:+' >> set_ftrace_filter
> > echo 'btrfs_commit_transaction:-' >> set_ftrace_filter
> >
> > The '+' is a start-tracing trigger condition, the '-' is a stop-tracing
> > trigger condition. All function calls between btrfs_join_transaction()
> > and btrfs_commit_transaction() would be traced.
> >
> > The two could be combined - to generate the trace of a single btrfs
> > transaction, one could do:
> >
> > echo 0 > tracing_enabled
> > echo 'btrfs_join_transaction:+#1' >> set_ftrace_filter
> > echo 'btrfs_commit_transaction:-#1' >> set_ftrace_filter
> > echo 1 > tracing_enabled
> >
> > Other extensions are possible too:
> >
> > - Trace length triggers. For example one could do:
> >
> > echo 'btrfs_join_transaction:+*#10' >> set_ftrace_filter
> >
> > To trace 10 function calls [allowed by current filter settings] after
> > the first btrfs_join_transaction() call - and stop tracing after those
> > 10 trace entries.
> >
> > This would allow the creation of "surgical" one-time traces - of events
> > and functions one is specifically interested in.
>
> This all sounds great, but it would add a lot of conditionals into a
> extremely hot function tracer path. Remember, we are not modifying the
> calls to mcount to call a registered function directly. All functions
> being traced must call the same function. The reason is that mcount is not
> a normal function in C. It does not behave the same as other functions,
> and must be implemented in assembly (as you already know ;-) The dynamic
> tracer calls into a trampoline that calls the registered function. There
> is only one trampoline, so only one function gets called on a trace. We
> can at most enable or disable a given function. We can not change what
> that function calls (without implementing it for every arch).
>
> This means any conditional that you make, must be made for all traced
> functions. And this will put an overhead onto the system.
>
> Now we can register multiple functions to be called by a traced function,
> or pick and choose what function will be called by a traced function
> depending on what option was asked for. But all traced functions will
> still call the same code.

We could use a kind of ftrace_filter thing which would be a list of callbacks
to call, depending of the options set.

This would add one condition to verify for each function in the best case.

> We can start experimenting, but I would
> be more keen on how this will be used by developers. Chris gave a great
> example of how he would use his feature. Doing what you ask would require
> a rewrite of most the code. I hate to do that and have only 2 or 3 people
> using it :-/
>
> -- Steve
>

2009-01-17 22:35:31

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 0/2] ftrace: updates to tip

* Steven Rostedt <[email protected]> wrote:

> > This would allow the creation of "surgical" one-time traces - of
> > events and functions one is specifically interested in.
>
> This all sounds great, but it would add a lot of conditionals into a
> extremely hot function tracer path. [...]

Why would we need more than exactly one unlikely() condition to handle all
of this?

Ingo

2009-01-18 07:28:20

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 0/2] ftrace: updates to tip

* Ingo Molnar <[email protected]> wrote:

>
> * Steven Rostedt <[email protected]> wrote:
>
> > > This would allow the creation of "surgical" one-time traces - of
> > > events and functions one is specifically interested in.
> >
> > This all sounds great, but it would add a lot of conditionals into a
> > extremely hot function tracer path. [...]
>
> Why would we need more than exactly one unlikely() condition to handle
> all of this?

We could even do it with zero branches, burdening none of the hotpaths of
existing functionality. Hm?

Ingo

2009-01-19 13:26:11

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH 0/2] ftrace: updates to tip

Em Fri, Jan 16, 2009 at 11:26:00PM +0100, Ingo Molnar escreveu:
>
> * Chris Mason <[email protected]> wrote:
>
> > On Thu, 2009-01-15 at 19:40 -0500, Steven Rostedt wrote:
> > > Ingo,
> > >
> > > The first patch is critical, and needs to stay with trace_output.c
> > > Not that critical since trace_output.c is not in mainline yet.
> > >
> > > The second patch gives the ability to stack trace functions.
> > > I've been leery about adding this and still keep it a separate
> > > option from the "stacktrace" that already exists. This is because
> > > when enabled with no filtering, the lag between typing and seeing
> > > what is typed can be up to 10 seconds or more.
> > >
> >
> > I mostly asked for this because I often try to find the most common
> > reason a given function is called, and oprofile isn't always a great way
> > to catch it. systemtap can do it too if you can get systemtap to work
> > against your current devel kernel, but there are limits on how much
> > memory it'll use.
> >
> > I've attached some simple python code that parses the output of the
> > function stack tracer, it makes some dumb assumptions about the format
> > but isn't a bad proof of concept. The first such assumption is that
> > you're only filtering on a single function.
> >
> > Here is some example output, trying to find the most common causes of
> > native_smp_send_reschedule() during a btrfs dbench run.
> >
> > It relates to the Oracle OLTP thread because oracle heavily uses IPC
> > semaphores to trigger wakeups of processes as various events finish.
> > I'd bet that try_to_wakeup is the most common cause of the reschedule
> > calls there as well.
> >
> > For btrfs, the btree lock mutexes come back into the profile yet again.
> > It would be interesting to change the spinning mutex code to look for
> > spinners and skip the wakeup on unlock, but that's a different thread
> > entirely.
> >
> > The short version is: thanks Steve, this is really cool!
> >
> > 12058 hits:
> > <= check_preempt_wakeup
> > <= try_to_wake_up
> > <= wake_up_process
> > <= __mutex_unlock_slowpath
> > <= mutex_unlock
> > <= btrfs_tree_unlock
> > <= unlock_up
> > ===========
>
> Cool! We've got scripts/tracing/ [with one Python script in it already] -
> so if this is tidied up to be generally useful we could put it there.
>
> The other thing is that there's the statistics framework of ftrace, being
> worked on by Frederic and Steve. That tries to handle and provide
> higher-order summaries/"views" of plain traces, like histograms and counts
> - provided by the kernel.

That I plan to use to provide something similar (equal?) in blkftrace to what
blktrace produces when it stops a tracing session:

8,0 1 3767 2.180288230 31765 C W 126227253 + 8 [0]
^CCPU0 (8,0):
Reads Queued: 0, 0KiB Writes Queued: 141, 564KiB
Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
Read Merges: 0, 0KiB Write Merges: 45, 180KiB
Read depth: 1 Write depth: 2
IO unplugs: 2 Timer unplugs: 0
CPU1 (8,0):
Reads Queued: 2, 8KiB Writes Queued: 679, 2,716KiB
Read Dispatches: 2, 8KiB Write Dispatches: 598, 3,280KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 2, 8KiB Writes Completed: 598, 3,280KiB
Read Merges: 0, 0KiB Write Merges: 177, 708KiB
Read depth: 1 Write depth: 2
IO unplugs: 3 Timer unplugs: 0

Total (8,0):
Reads Queued: 2, 8KiB Writes Queued: 820, 3,280KiB
Read Dispatches: 2, 8KiB Write Dispatches: 598, 3,280KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 2, 8KiB Writes Completed: 598, 3,280KiB
Read Merges: 0, 0KiB Write Merges: 222, 888KiB
IO unplugs: 5 Timer unplugs: 0

Throughput (R/W): 3KiB/s / 1,504KiB/s
Events (8,0): 4,289 entries
Skips: 0 forward (0 - 0.0%)
[root@doppio ~]#

:-)

- Arnaldo

2009-01-19 13:33:16

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 0/2] ftrace: updates to tip

* Arnaldo Carvalho de Melo <[email protected]> wrote:

> Em Fri, Jan 16, 2009 at 11:26:00PM +0100, Ingo Molnar escreveu:
> >
> > * Chris Mason <[email protected]> wrote:
> >
> > > On Thu, 2009-01-15 at 19:40 -0500, Steven Rostedt wrote:
> > > > Ingo,
> > > >
> > > > The first patch is critical, and needs to stay with trace_output.c
> > > > Not that critical since trace_output.c is not in mainline yet.
> > > >
> > > > The second patch gives the ability to stack trace functions.
> > > > I've been leery about adding this and still keep it a separate
> > > > option from the "stacktrace" that already exists. This is because
> > > > when enabled with no filtering, the lag between typing and seeing
> > > > what is typed can be up to 10 seconds or more.
> > > >
> > >
> > > I mostly asked for this because I often try to find the most common
> > > reason a given function is called, and oprofile isn't always a great way
> > > to catch it. systemtap can do it too if you can get systemtap to work
> > > against your current devel kernel, but there are limits on how much
> > > memory it'll use.
> > >
> > > I've attached some simple python code that parses the output of the
> > > function stack tracer, it makes some dumb assumptions about the format
> > > but isn't a bad proof of concept. The first such assumption is that
> > > you're only filtering on a single function.
> > >
> > > Here is some example output, trying to find the most common causes of
> > > native_smp_send_reschedule() during a btrfs dbench run.
> > >
> > > It relates to the Oracle OLTP thread because oracle heavily uses IPC
> > > semaphores to trigger wakeups of processes as various events finish.
> > > I'd bet that try_to_wakeup is the most common cause of the reschedule
> > > calls there as well.
> > >
> > > For btrfs, the btree lock mutexes come back into the profile yet again.
> > > It would be interesting to change the spinning mutex code to look for
> > > spinners and skip the wakeup on unlock, but that's a different thread
> > > entirely.
> > >
> > > The short version is: thanks Steve, this is really cool!
> > >
> > > 12058 hits:
> > > <= check_preempt_wakeup
> > > <= try_to_wake_up
> > > <= wake_up_process
> > > <= __mutex_unlock_slowpath
> > > <= mutex_unlock
> > > <= btrfs_tree_unlock
> > > <= unlock_up
> > > ===========
> >
> > Cool! We've got scripts/tracing/ [with one Python script in it already] -
> > so if this is tidied up to be generally useful we could put it there.
> >
> > The other thing is that there's the statistics framework of ftrace, being
> > worked on by Frederic and Steve. That tries to handle and provide
> > higher-order summaries/"views" of plain traces, like histograms and counts
> > - provided by the kernel.
>
> That I plan to use to provide something similar (equal?) in blkftrace to what
> blktrace produces when it stops a tracing session:
>
> 8,0 1 3767 2.180288230 31765 C W 126227253 + 8 [0]
> ^CCPU0 (8,0):
> Reads Queued: 0, 0KiB Writes Queued: 141, 564KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
> Read Merges: 0, 0KiB Write Merges: 45, 180KiB
> Read depth: 1 Write depth: 2
> IO unplugs: 2 Timer unplugs: 0
> CPU1 (8,0):
> Reads Queued: 2, 8KiB Writes Queued: 679, 2,716KiB
> Read Dispatches: 2, 8KiB Write Dispatches: 598, 3,280KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 2, 8KiB Writes Completed: 598, 3,280KiB
> Read Merges: 0, 0KiB Write Merges: 177, 708KiB
> Read depth: 1 Write depth: 2
> IO unplugs: 3 Timer unplugs: 0
>
> Total (8,0):
> Reads Queued: 2, 8KiB Writes Queued: 820, 3,280KiB
> Read Dispatches: 2, 8KiB Write Dispatches: 598, 3,280KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 2, 8KiB Writes Completed: 598, 3,280KiB
> Read Merges: 0, 0KiB Write Merges: 222, 888KiB
> IO unplugs: 5 Timer unplugs: 0
>
> Throughput (R/W): 3KiB/s / 1,504KiB/s
> Events (8,0): 4,289 entries
> Skips: 0 forward (0 - 0.0%)
> [root@doppio ~]#
>
> :-)

Cool :)

Ingo