2011-03-07 13:43:29

by Sam Liao

[permalink] [raw]
Subject: [PATCH] Add inverted call graph report support to perf tool

Add "-r" option to support inverted butterfly report, in the
inverted report, the call graph start from the callee's ancestor,
like main->func1->func2 style. users can use such view to catch
system's performance bottleneck, find the software's design
problem not just some function's poor performance.

Current pref implementation is not easy to add such inversion, so this
fix just invert the ip and callchain in an ugly style. But I do think
this invert
view help developer to find performance root cause for complex
software.
---
tools/perf/builtin-report.c | 43 +++++++++++++++++++++++++++++++++++++++++++
1 files changed, 43 insertions(+), 0 deletions(-)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index c27e31f..ac2ec0e 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -33,6 +33,7 @@
static char const *input_name = "perf.data";

static bool force, use_tui, use_stdio;
+static bool reverse_call;
static bool hide_unresolved;
static bool dont_use_callchains;

@@ -155,6 +156,41 @@ static int process_sample_event(event_t *event,
struct sample_data *sample,
{
struct addr_location al;
struct perf_event_attr *attr;
+
+ /* reverse call chain data */
+ if (reverse_call && symbol_conf.use_callchain && sample->callchain) {
+ struct ip_callchain *chain;
+ int i, j;
+ u64 tmp_ip;
+ event_t *reverse_event;
+
+ chain = malloc(sizeof(u64) * (sample->callchain->nr + 1));
+ if (!chain) {
+ pr_debug("malloc failed\n");
+ return -1;
+ }
+ reverse_event = malloc(sizeof(event_t));
+ if (!reverse_event) {
+ pr_debug("malloc failed\n");
+ return -1;
+ }
+ memcpy(reverse_event, event, sizeof(event_t));
+
+ chain->nr = sample->callchain->nr;
+ j = sample->callchain->nr;
+ tmp_ip = event->ip.ip;
+ reverse_event->ip.ip = sample->callchain->ips[j-1];
+ chain->ips[j-1] = tmp_ip;
+ for (i = 0, j = sample->callchain->nr - 2; i < j; i++, j--) {
+ chain->ips[i] = sample->callchain->ips[j];
+ chain->ips[j] = sample->callchain->ips[i];
+ }
+
+ sample->callchain = chain;
+ call_chain_reversed = true;
+ event = reverse_event;
+ }

if (event__preprocess_sample(event, session, &al, sample, NULL) < 0) {
fprintf(stderr, "problem processing %d event, skipping it.\n",
@@ -177,6 +213,11 @@ static int process_sample_event(event_t *event,
struct sample_data *sample,
return -1;
}

+ if (reverse_call && call_chain_reversed) {
+ free(sample->callchain);
+ free(event);
+ }
+
return 0;
}

@@ -469,6 +510,8 @@ static const struct option options[] = {
OPT_CALLBACK_DEFAULT('g', "call-graph", NULL, "output_type,min_percent",
"Display callchains using output_type (graph, flat, fractal,
or none) and min percent threshold. "
"Default: fractal,0.5", &parse_callchain_opt, callchain_default_opt),
+ OPT_BOOLEAN('r', "reverse-call", &reverse_call,
+ "reverse call chain report (butterfly view)"),
OPT_STRING('d', "dsos", &symbol_conf.dso_list_str, "dso[,dso...]",
"only consider symbols in these dsos"),
OPT_STRING('C', "comms", &symbol_conf.comm_list_str, "comm[,comm...]",
--
1.7.1


2011-03-07 18:06:28

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] Add inverted call graph report support to perf tool

On Mon, Mar 07, 2011 at 09:43:27PM +0800, Sam Liao wrote:
> Add "-r" option to support inverted butterfly report, in the
> inverted report, the call graph start from the callee's ancestor,
> like main->func1->func2 style. users can use such view to catch
> system's performance bottleneck, find the software's design
> problem not just some function's poor performance.

Yeah, that can be interesting.

>
> Current pref implementation is not easy to add such inversion, so this
> fix just invert the ip and callchain in an ugly style. But I do think
> this invert
> view help developer to find performance root cause for complex
> software.
> ---
> tools/perf/builtin-report.c | 43 +++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 43 insertions(+), 0 deletions(-)
>
> diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> index c27e31f..ac2ec0e 100644
> --- a/tools/perf/builtin-report.c
> +++ b/tools/perf/builtin-report.c
> @@ -33,6 +33,7 @@
> static char const *input_name = "perf.data";
>
> static bool force, use_tui, use_stdio;
> +static bool reverse_call;
> static bool hide_unresolved;
> static bool dont_use_callchains;
>
> @@ -155,6 +156,41 @@ static int process_sample_event(event_t *event,
> struct sample_data *sample,
> {
> struct addr_location al;
> struct perf_event_attr *attr;
> +
> + /* reverse call chain data */
> + if (reverse_call && symbol_conf.use_callchain && sample->callchain) {
> + struct ip_callchain *chain;
> + int i, j;
> + u64 tmp_ip;
> + event_t *reverse_event;
> +
> + chain = malloc(sizeof(u64) * (sample->callchain->nr + 1));
> + if (!chain) {
> + pr_debug("malloc failed\n");
> + return -1;
> + }
> + reverse_event = malloc(sizeof(event_t));
> + if (!reverse_event) {
> + pr_debug("malloc failed\n");
> + return -1;
> + }
> + memcpy(reverse_event, event, sizeof(event_t));
> +
> + chain->nr = sample->callchain->nr;
> + j = sample->callchain->nr;
> + tmp_ip = event->ip.ip;
> + reverse_event->ip.ip = sample->callchain->ips[j-1];
> + chain->ips[j-1] = tmp_ip;
> + for (i = 0, j = sample->callchain->nr - 2; i < j; i++, j--) {
> + chain->ips[i] = sample->callchain->ips[j];
> + chain->ips[j] = sample->callchain->ips[i];
> + }
> +
> + sample->callchain = chain;
> + call_chain_reversed = true;
> + event = reverse_event;
> + }

So, instead of having such temporary copy, could you rather feed the callchain
into the cursor in reverse from perf_session__resolve_callchain() ?

You can keep the common part inside the loop into a seperate helper
but have two different kinds of loops.

>
> if (event__preprocess_sample(event, session, &al, sample, NULL) < 0) {
> fprintf(stderr, "problem processing %d event, skipping it.\n",
> @@ -177,6 +213,11 @@ static int process_sample_event(event_t *event,
> struct sample_data *sample,
> return -1;
> }
>
> + if (reverse_call && call_chain_reversed) {
> + free(sample->callchain);
> + free(event);
> + }
> +
> return 0;
> }
>
> @@ -469,6 +510,8 @@ static const struct option options[] = {
> OPT_CALLBACK_DEFAULT('g', "call-graph", NULL, "output_type,min_percent",
> "Display callchains using output_type (graph, flat, fractal,
> or none) and min percent threshold. "
> "Default: fractal,0.5", &parse_callchain_opt, callchain_default_opt),
> + OPT_BOOLEAN('r', "reverse-call", &reverse_call,
> + "reverse call chain report (butterfly view)"),

What about making it an argument to the exisiting -g option, something
that defines the base of the callchain like "caller" and "callee"

Like "-g graph,0.5,caller".

caller would be what we call here reverse and callee the current and future default.

Does that look sensible?

2011-03-08 08:59:33

by Sam Liao

[permalink] [raw]
Subject: Re: [PATCH] Add inverted call graph report support to perf tool

On Tue, Mar 8, 2011 at 2:06 AM, Frederic Weisbecker <[email protected]> wrote:
> On Mon, Mar 07, 2011 at 09:43:27PM +0800, Sam Liao wrote:
>> Add "-r" option to support inverted butterfly report, in the
>> inverted report, the call graph start from the callee's ancestor,
>> like main->func1->func2 style. users can use such view to catch
>> system's performance bottleneck, find the software's design
>> problem not just some function's poor performance.
>
> Yeah, that can be interesting.
>
>>
>> Current pref implementation is not easy to add such inversion, so this
>> fix just invert the ip and callchain in an ugly style. But I do think
>> this invert
>> view help developer to find performance root cause for complex
>> software.
>> ---
>> ?tools/perf/builtin-report.c | ? 43 +++++++++++++++++++++++++++++++++++++++++++
>> ?1 files changed, 43 insertions(+), 0 deletions(-)
>>
>> diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
>> index c27e31f..ac2ec0e 100644
>> --- a/tools/perf/builtin-report.c
>> +++ b/tools/perf/builtin-report.c
>> @@ -33,6 +33,7 @@
>> ?static char ? ? ? ? ?const *input_name = "perf.data";
>>
>> ?static bool ? ? ? ? ?force, use_tui, use_stdio;
>> +static bool ? ? ? ? ?reverse_call;
>> ?static bool ? ? ? ? ?hide_unresolved;
>> ?static bool ? ? ? ? ?dont_use_callchains;
>>
>> @@ -155,6 +156,41 @@ static int process_sample_event(event_t *event,
>> struct sample_data *sample,
>> ?{
>> ? ? ? struct addr_location al;
>> ? ? ? struct perf_event_attr *attr;
>> +
>> + ? ? /* reverse call chain data */
>> + ? ? if (reverse_call && symbol_conf.use_callchain && sample->callchain) {
>> + ? ? ? ? ? ? struct ip_callchain *chain;
>> + ? ? ? ? ? ? int i, j;
>> + ? ? ? ? ? ? u64 tmp_ip;
>> + ? ? ? ? ? ? event_t *reverse_event;
>> +
>> + ? ? ? ? ? ? chain = malloc(sizeof(u64) * (sample->callchain->nr + 1));
>> + ? ? ? ? ? ? if (!chain) {
>> + ? ? ? ? ? ? ? ? ? ? pr_debug("malloc failed\n");
>> + ? ? ? ? ? ? ? ? ? ? return -1;
>> + ? ? ? ? ? ? }
>> + ? ? ? ? ? ? reverse_event = malloc(sizeof(event_t));
>> + ? ? ? ? ? ? if (!reverse_event) {
>> + ? ? ? ? ? ? ? ? ? ? pr_debug("malloc failed\n");
>> + ? ? ? ? ? ? ? ? ? ? return -1;
>> + ? ? ? ? ? ? }
>> + ? ? ? ? ? ? memcpy(reverse_event, event, sizeof(event_t));
>> +
>> + ? ? ? ? ? ? chain->nr = sample->callchain->nr;
>> + ? ? ? ? ? ? j = sample->callchain->nr;
>> + ? ? ? ? ? ? tmp_ip = event->ip.ip;
>> + ? ? ? ? ? ? reverse_event->ip.ip = sample->callchain->ips[j-1];
>> + ? ? ? ? ? ? chain->ips[j-1] = tmp_ip;
>> + ? ? ? ? ? ? for (i = 0, j = sample->callchain->nr - 2; i < j; i++, j--) {
>> + ? ? ? ? ? ? ? ? ? ? chain->ips[i] = sample->callchain->ips[j];
>> + ? ? ? ? ? ? ? ? ? ? chain->ips[j] = sample->callchain->ips[i];
>> + ? ? ? ? ? ? }
>> +
>> + ? ? ? ? ? ? sample->callchain = chain;
>> + ? ? ? ? ? ? call_chain_reversed = true;
>> + ? ? ? ? ? ? event = reverse_event;
>> + ? ? }
>
> So, instead of having such temporary copy, could you rather feed the callchain
> into the cursor in reverse from perf_session__resolve_callchain() ?
>
> You can keep the common part inside the loop into a seperate helper
> but have two different kinds of loops.

In perf_session__resolve_callchain, only the callchain itself can be reversed,
which means the root of report will still be the ip of the event with a reversed
call chain sub tree. But what is more impressive to user is to make "main" like
function to be the root of the report, and this means that both the ip
and call chain is
involved to the reversion process.

Since the ip of event is resolved in event__preprocess_sample, so it is kind
hard to do such reversion in a better way.

>
>>
>> ? ? ? if (event__preprocess_sample(event, session, &al, sample, NULL) < 0) {
>> ? ? ? ? ? ? ? fprintf(stderr, "problem processing %d event, skipping it.\n",
>> @@ -177,6 +213,11 @@ static int process_sample_event(event_t *event,
>> struct sample_data *sample,
>> ? ? ? ? ? ? ? return -1;
>> ? ? ? }
>>
>> + ? ? if (reverse_call && call_chain_reversed) {
>> + ? ? ? ? ? ? free(sample->callchain);
>> + ? ? ? ? ? ? free(event);
>> + ? ? }
>> +
>> ? ? ? return 0;
>> ?}
>>
>> @@ -469,6 +510,8 @@ static const struct option options[] = {
>> ? ? ? OPT_CALLBACK_DEFAULT('g', "call-graph", NULL, "output_type,min_percent",
>> ? ? ? ? ? ? ? ? ? ?"Display callchains using output_type (graph, flat, fractal,
>> or none) and min percent threshold. "
>> ? ? ? ? ? ? ? ? ? ?"Default: fractal,0.5", &parse_callchain_opt, callchain_default_opt),
>> + ? ? OPT_BOOLEAN('r', "reverse-call", &reverse_call,
>> + ? ? ? ? ? ? ? ? ? ? "reverse call chain report (butterfly view)"),
>
> What about making it an argument to the exisiting -g option, something
> that defines the base of the callchain like "caller" and "callee"
>
> Like "-g graph,0.5,caller".
>
> caller would be what we call here reverse and callee the current and future default.
>
> Does that look sensible?
>
This option looks better.

Thanks,
-Sam

2011-03-10 02:44:06

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] Add inverted call graph report support to perf tool

On Tue, Mar 08, 2011 at 04:59:30PM +0800, Sam Liao wrote:
> On Tue, Mar 8, 2011 at 2:06 AM, Frederic Weisbecker <[email protected]> wrote:
> > So, instead of having such temporary copy, could you rather feed the callchain
> > into the cursor in reverse from perf_session__resolve_callchain() ?
> >
> > You can keep the common part inside the loop into a seperate helper
> > but have two different kinds of loops.
>
> In perf_session__resolve_callchain, only the callchain itself can be reversed,
> which means the root of report will still be the ip of the event with a reversed
> call chain sub tree. But what is more impressive to user is to make "main" like
> function to be the root of the report, and this means that both the ip
> and call chain is
> involved to the reversion process.
>
> Since the ip of event is resolved in event__preprocess_sample, so it is kind
> hard to do such reversion in a better way.

You are making an interesting point.

My view of this feature was limited to the current per hist area: having
the callchains on top of hists that can be sorted per ip, dso, pid, etc...
like we have today basically. So my view was for this reverse callchain
to show us one callers profiling for each hist entry.

But your idea of turning the callee into the caller would show us a very global
profiling. With reverse callchains it can be a very nice overview of the big picture.

IMO both workflow can be interesting:

1) Have a big reversed callchain overview, with one root per entrypoint. This
what you wanted.
2) Have a per hist 1) which means a per hist per entrypoint callchain

1) involves reverting both callchains and ip <->caller whereas 2) only involves
reverting the callchain.

In order to get both features with a maximum flexibility and keep that extendable, I
would suggest to decouple that in two independant parts:

- an option to get reversed callchains. Using the -g option and caller/callee
as a third argument.

- a new "caller" sort entry. What defines a hist entry is a set of sort
entries: dso, symbol, pid, comm, ... That we use with the -s option in perf report.
If you want one hist per entrypoint, we could add a new "caller" sort entry.
Then perf report -s caller will (roughly) produce one hist for main(), one hist
for kernel_thread(), etc...

Hence, someone running "perf report -g fractal,0.5,reversed -s dso" is going to have a
per dso caller-callchain profiling.

Someone running "perf report -g fractal,0.5,reversed -s caller" is going to have that
global caller profiling you wanted. You'll have one hist per entrypoint.

The caller sorting mode may sound a bit limited currently, but think about what happens
when you push forward the entrypoint, if one day we bring a feature to filter the callchains
on some dso address space , we could do a caller callchain profile starting on a shared library
and pinpoint which functions are mostly called on it, so that can be coupled with dso sorting mode,
etc... So that looks like a right way to go.

Ah and that shouldn't require to overwrite the real ip of an event with the caller.
Better create a new "caller" field on struct hist_entry for that.

Hm?

2011-03-10 06:48:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] Add inverted call graph report support to perf tool


* Frederic Weisbecker <[email protected]> wrote:

> But your idea of turning the callee into the caller would show us a very global
> profiling. With reverse callchains it can be a very nice overview of the big
> picture.

Very much agreed - i think that matches the sysprof display structure as well,
right?

This could then be propagated not just into perf report but perf top --tui as well.

Thanks,

Ingo

2011-03-10 14:32:46

by Sam Liao

[permalink] [raw]
Subject: Re: [PATCH] Add inverted call graph report support to perf tool

On Thu, Mar 10, 2011 at 10:43 AM, Frederic Weisbecker
<[email protected]> wrote:
> On Tue, Mar 08, 2011 at 04:59:30PM +0800, Sam Liao wrote:
>> On Tue, Mar 8, 2011 at 2:06 AM, Frederic Weisbecker <[email protected]> wrote:
>> > So, instead of having such temporary copy, could you rather feed the callchain
>> > into the cursor in reverse from perf_session__resolve_callchain() ?
>> >
>> > You can keep the common part inside the loop into a seperate helper
>> > but have two different kinds of loops.
>>
>> In perf_session__resolve_callchain, only the callchain itself can be reversed,
>> which means the root of report will still be the ip of the event with a reversed
>> call chain sub tree. But what is more impressive to user is to make "main" like
>> function to be the root of the report, and this means that both the ip
>> and call chain is
>> involved to the reversion process.
>>
>> Since the ip of event is resolved in event__preprocess_sample, so it is kind
>> hard to do such reversion in a better way.
>
> You are making an interesting point.
>
> My view of this feature was limited to the current per hist area: having
> the callchains on top of hists that can be sorted per ip, dso, pid, etc...
> like we have today basically. So my view was for this reverse callchain
> to show us one callers profiling for each hist entry.
>
> But your idea of turning the callee into the caller would show us a very global
> profiling. With reverse callchains it can be a very nice overview of the big picture.
>
> IMO both workflow can be interesting:
>
> 1) Have a big reversed callchain overview, with one root per entrypoint. This
> what you wanted.
> 2) Have a per hist 1) ?which means a per hist per entrypoint callchain
>
> 1) involves reverting both callchains and ip <->caller whereas 2) only involves
> reverting the callchain.

Having both workflow included would be more helpful.
>
> In order to get both features with a maximum flexibility and keep that extendable, I
> would suggest to decouple that in two independant parts:
>
> ? ? ? ?- an option to get reversed callchains. Using the -g option and caller/callee
> ? ? ? ?as a third argument.
>

This could be easily extended by reversing the callchain symbols as
you mentioned.

> ? ? ? ?- a new "caller" sort entry. What defines a hist entry is a set of sort
> ? ? ? ?entries: dso, symbol, pid, comm, ... That we use with the -s option in perf report.
> ? ? ? ?If you want one hist per entrypoint, we could add a new "caller" sort entry.
> ? ? ? ?Then perf report -s caller will (roughly) produce one hist for main(), one hist
> ? ? ? ?for kernel_thread(), etc...
>

I'm not sure adding a "caller" sort entry can get things done. As for
my limited understanding,
"sort" is kind way to group events, after we group all the events
under "main" or "kernel_thread",
the sub-trees will still rooted as ip entry points with a reversed
call-chain sub-trees which seems
just the same as the previous workflow. Am I right? If so, here we
still have to revert the ip and
callchain.

> Hence, someone running "perf report -g fractal,0.5,reversed -s dso" is going to have a
> per dso caller-callchain profiling.
>
> Someone running "perf report -g fractal,0.5,reversed -s caller" is going to have that
> global caller profiling you wanted. You'll have one hist per entrypoint.
>
> The caller sorting mode may sound a bit limited currently, but think about what happens
> when you push forward the entrypoint, if one day we bring a feature to filter the callchains
> on some dso address space , we could do a caller callchain profile starting on a shared library
> and pinpoint which functions are mostly called on it, so that can be coupled with dso sorting mode,
> etc... So that looks like a right way to go.

This is interesting, we might choose different caller based on dso
instead of the process/thread entry
point. This also remind me one thing, at present, though with
different sort orders, the event trees still
root in the ip of the event. For an event with call chain like:
main->func1->func2->lib_func3->ip, what
about we expand this event to following events:

- a direct event with ip called, with full call chain.
- a indirect event with libfunc3 called, with lib_func3->ip call chain
- a indirect event with func2 called ...
- a indirect event with main called,

With these indirect event added(we may also need to add direct count
and indirect count), we are
able to check any routine in any dso which functions are mostly
called. Actually, it much seems like
the sysprof that Ingo mentioned.

>
> Ah and that shouldn't require to overwrite the real ip of an event with the caller.
> Better create a new "caller" field on struct hist_entry for that.
>
> Hm?
>

Thanks,
-Sam

2011-03-11 10:58:09

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] Add inverted call graph report support to perf tool

On Thu, Mar 10, 2011 at 07:48:16AM +0100, Ingo Molnar wrote:
>
> * Frederic Weisbecker <[email protected]> wrote:
>
> > But your idea of turning the callee into the caller would show us a very global
> > profiling. With reverse callchains it can be a very nice overview of the big
> > picture.
>
> Very much agreed - i think that matches the sysprof display structure as well,
> right?

Yeah indeed, IIRC.

> This could then be propagated not just into perf report but perf top --tui as well.

Probably yeah. Are callchains already supported by perf top?

2011-03-11 11:57:10

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] Add inverted call graph report support to perf tool

On Thu, Mar 10, 2011 at 10:32:43PM +0800, Sam Liao wrote:
> On Thu, Mar 10, 2011 at 10:43 AM, Frederic Weisbecker
> <[email protected]> wrote:
> > On Tue, Mar 08, 2011 at 04:59:30PM +0800, Sam Liao wrote:
> >> On Tue, Mar 8, 2011 at 2:06 AM, Frederic Weisbecker <[email protected]> wrote:
> >> > So, instead of having such temporary copy, could you rather feed the callchain
> >> > into the cursor in reverse from perf_session__resolve_callchain() ?
> >> >
> >> > You can keep the common part inside the loop into a seperate helper
> >> > but have two different kinds of loops.
> >>
> >> In perf_session__resolve_callchain, only the callchain itself can be reversed,
> >> which means the root of report will still be the ip of the event with a reversed
> >> call chain sub tree. But what is more impressive to user is to make "main" like
> >> function to be the root of the report, and this means that both the ip
> >> and call chain is
> >> involved to the reversion process.
> >>
> >> Since the ip of event is resolved in event__preprocess_sample, so it is kind
> >> hard to do such reversion in a better way.
> >
> > You are making an interesting point.
> >
> > My view of this feature was limited to the current per hist area: having
> > the callchains on top of hists that can be sorted per ip, dso, pid, etc...
> > like we have today basically. So my view was for this reverse callchain
> > to show us one callers profiling for each hist entry.
> >
> > But your idea of turning the callee into the caller would show us a very global
> > profiling. With reverse callchains it can be a very nice overview of the big picture.
> >
> > IMO both workflow can be interesting:
> >
> > 1) Have a big reversed callchain overview, with one root per entrypoint. This
> > what you wanted.
> > 2) Have a per hist 1) ?which means a per hist per entrypoint callchain
> >
> > 1) involves reverting both callchains and ip <->caller whereas 2) only involves
> > reverting the callchain.
>
> Having both workflow included would be more helpful.

That's the point, we should be able to do both. But only 1) is possible with
your initial proposition.

> >
> > In order to get both features with a maximum flexibility and keep that extendable, I
> > would suggest to decouple that in two independant parts:
> >
> > ? ? ? ?- an option to get reversed callchains. Using the -g option and caller/callee
> > ? ? ? ?as a third argument.
> >
>
> This could be easily extended by reversing the callchain symbols as
> you mentioned.

Yeah. -g caller only requires to iterate the callchain in reverse.

> > ? ? ? ?- a new "caller" sort entry. What defines a hist entry is a set of sort
> > ? ? ? ?entries: dso, symbol, pid, comm, ... That we use with the -s option in perf report.
> > ? ? ? ?If you want one hist per entrypoint, we could add a new "caller" sort entry.
> > ? ? ? ?Then perf report -s caller will (roughly) produce one hist for main(), one hist
> > ? ? ? ?for kernel_thread(), etc...
> >
>
> I'm not sure adding a "caller" sort entry can get things done. As for
> my limited understanding,
> "sort" is kind way to group events

This is actually _what_ group events. This defines how hist entries are
built.

If you do "perf report -s sym", events will be grouped by symbols.
Thus if you had thousands events but all of them only hit sym1 and sym2
then you'll see two groups in your histogram.

Look:

# ./perf report -s sym --stdio
# Events: 4 cycles
#
# Overhead Symbol
# ........ .................
#
36.72% [.] hex2u64
31.21% [k] __lock_acquire
18.03% [k] lock_acquire
14.04% [k] sub_preempt_count

We may have got thousand events for the above profile. But only 4 symbols
were hit in amongst these thousand events. As we asked for, events have been
grouped per symbol target.

Callchains follow this grouping scheme. Below the __lock_acquire hist,
you would only get callchains for which the root (deepest callee) was __lock_acquire.

If you have several grouping, like -s sym, dso, pid
then it computes an intersection. Events will be grouped when their
sym, dso and pid are equal. Moreoever they will be sorted, first dimension
per sym, second dimension per dso, third dimension per pid.

You should play a bit with different combinations to get the whole picture
and how it works.

Callchains still follow the grouping, as elaborated as it can be. For the hist
that has sym1, dso2 and pid 3, you'll find only callchains that start from sym1
for events that happened on dso2 and pid3.


, after we group all the events
> under "main" or "kernel_thread",
> the sub-trees will still rooted as ip entry points with a reversed
> call-chain sub-trees which seems
> just the same as the previous workflow. Am I right? If so, here we
> still have to revert the ip and
> callchain.

No. The callchain will follow that grouping. If you group only per caller
(-s caller) you may have one hist entry for main and another for kernel_thread.
Then below the main entry, you'll have only callchains starting
from main. And below the kernel_thread, only callchains starting from kernel_thread.

It depends if you select reverse callchain or not:

$ perf report -s caller

That will report main and kernel_thread as hists, and regular callee -> caller callchains.
Hence under main hist, you'll a lot of callchain starting from random points and all
ending in main!

$ perf report -s caller -g caller

That will report main and kernel_thread as hists, with callchains starting from
main under main.

It becomes interesting when you want more granularity with -s caller,dso if we bring a way
to push forward the entrypoint one day. I suspect even more sorting combinations are
going to be interesting.


> > Hence, someone running "perf report -g fractal,0.5,reversed -s dso" is going to have a
> > per dso caller-callchain profiling.
> >
> > Someone running "perf report -g fractal,0.5,reversed -s caller" is going to have that
> > global caller profiling you wanted. You'll have one hist per entrypoint.
> >
> > The caller sorting mode may sound a bit limited currently, but think about what happens
> > when you push forward the entrypoint, if one day we bring a feature to filter the callchains
> > on some dso address space , we could do a caller callchain profile starting on a shared library
> > and pinpoint which functions are mostly called on it, so that can be coupled with dso sorting mode,
> > etc... So that looks like a right way to go.
>
> This is interesting, we might choose different caller based on dso
> instead of the process/thread entry
> point.

Exactly! That's for later, but cutting the things like I'm suggesting
would pave the way for that.

> This also remind me one thing, at present, though with
> different sort orders, the event trees still
> root in the ip of the event.

Currently yeah.

> For an event with call chain like:
> main->func1->func2->lib_func3->ip, what
> about we expand this event to following events:
>
> - a direct event with ip called, with full call chain.
> - a indirect event with libfunc3 called, with lib_func3->ip call chain
> - a indirect event with func2 called ...
> - a indirect event with main called,
>
> With these indirect event added(we may also need to add direct count
> and indirect count), we are
> able to check any routine in any dso which functions are mostly
> called.

I see your point, and it may be a good idea in theory but I fear
the O(hell) (sorry I suck in math but I figure out the pain)
property is going to prevent this feature from ever being usable.
Regular callchains can already be slow to process on perf.data of
hundreds MB. With such a feature their processing will simply never
finish.

I suspect we should rather opt for fileting callchain dso
domains. Like only start from a given dso or so.


> the sysprof that Ingo mentioned.

sysprof does something like that?

Note, another reason to avoid abusing the event.ip to group
per caller, is that we could be able to:

perf report -s caller,sym -g caller

If we limit callchains to start from a given foo.so, this may
sort hists per caller and then per endpoint.

If your library offers function func1, func2, etc... It will sort
them per usage (func1 has been first used, then func2, etc...)
then per endpoint overhead (func1 most often sticks in strcpy(),
then on read(), etc....).

Right?

That may or may not be useful. I don't know. In fact I don't
want to take the responsibility to judge whether it's useful
or not. Thus I prefer caller and ip to be two different
properties of hist entries and not having one absusing the
other, so that we don't prevent this feature to exist (or many
other sort combinations I haven't imagined).

2011-03-11 12:07:32

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] Add inverted call graph report support to perf tool

On Fri, Mar 11, 2011 at 12:57:01PM +0100, Frederic Weisbecker wrote:
> On Thu, Mar 10, 2011 at 10:32:43PM +0800, Sam Liao wrote:
> > the sysprof that Ingo mentioned.
>
> sysprof does something like that?
>
> Note, another reason to avoid abusing the event.ip to group
> per caller, is that we could be able to:
>
> perf report -s caller,sym -g caller
>
> If we limit callchains to start from a given foo.so, this may
> sort hists per caller and then per endpoint.
>
> If your library offers function func1, func2, etc... It will sort
> them per usage (func1 has been first used, then func2, etc...)

I meant func1 has been the most used, then came func2, etc...

> then per endpoint overhead (func1 most often sticks in strcpy(),
> then on read(), etc....).
>
> Right?
>
> That may or may not be useful. I don't know. In fact I don't
> want to take the responsibility to judge whether it's useful
> or not. Thus I prefer caller and ip to be two different
> properties of hist entries and not having one absusing the
> other, so that we don't prevent this feature to exist (or many
> other sort combinations I haven't imagined).

2011-03-11 14:45:31

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH] Add inverted call graph report support to perf tool

Em Fri, Mar 11, 2011 at 11:51:22AM +0100, Frederic Weisbecker escreveu:
> On Thu, Mar 10, 2011 at 07:48:16AM +0100, Ingo Molnar wrote:
> > * Frederic Weisbecker <[email protected]> wrote:

> > > But your idea of turning the callee into the caller would show us a very global
> > > profiling. With reverse callchains it can be a very nice overview of the big
> > > picture.

> > Very much agreed - i think that matches the sysprof display structure as well,
> > right?

> Yeah indeed, IIRC.

> > This could then be propagated not just into perf report but perf top --tui as well.

> Probably yeah. Are callchains already supported by perf top?

No, they arent, the move to get 'struct hist_entry' (used by report) and 'struct
sym_entry' (used by top) is underway, I want to merge the hists browser
with the top browser, at that point it all becomes just one tool:

start with 'perf top'

press F -> freeze:

it becomes 'perf report'

in both modes press 'A' and you get annotation, if in top mode, live, if
in 'report' mode, static.

Having this all integrated and easily switched to/from various modes is
what, in my view, makes the TUI (and GUI at some point) compelling.

- Arnaldo

2011-03-12 14:59:12

by Sam Liao

[permalink] [raw]
Subject: Re: [PATCH] Add inverted call graph report support to perf tool

On Fri, Mar 11, 2011 at 7:57 PM, Frederic Weisbecker <[email protected]> wrote:
> On Thu, Mar 10, 2011 at 10:32:43PM +0800, Sam Liao wrote:
>> On Thu, Mar 10, 2011 at 10:43 AM, Frederic Weisbecker
>> <[email protected]> wrote:
>> > On Tue, Mar 08, 2011 at 04:59:30PM +0800, Sam Liao wrote:
>> >> On Tue, Mar 8, 2011 at 2:06 AM, Frederic Weisbecker <[email protected]> wrote:
>> >> > So, instead of having such temporary copy, could you rather feed the callchain
>> >> > into the cursor in reverse from perf_session__resolve_callchain() ?
>> >> >
>> >> > You can keep the common part inside the loop into a seperate helper
>> >> > but have two different kinds of loops.
>> >>
>> >> In perf_session__resolve_callchain, only the callchain itself can be reversed,
>> >> which means the root of report will still be the ip of the event with a reversed
>> >> call chain sub tree. But what is more impressive to user is to make "main" like
>> >> function to be the root of the report, and this means that both the ip
>> >> and call chain is
>> >> involved to the reversion process.
>> >>
>> >> Since the ip of event is resolved in event__preprocess_sample, so it is kind
>> >> hard to do such reversion in a better way.
>> >
>> > You are making an interesting point.
>> >
>> > My view of this feature was limited to the current per hist area: having
>> > the callchains on top of hists that can be sorted per ip, dso, pid, etc...
>> > like we have today basically. So my view was for this reverse callchain
>> > to show us one callers profiling for each hist entry.
>> >
>> > But your idea of turning the callee into the caller would show us a very global
>> > profiling. With reverse callchains it can be a very nice overview of the big picture.
>> >
>> > IMO both workflow can be interesting:
>> >
>> > 1) Have a big reversed callchain overview, with one root per entrypoint. This
>> > what you wanted.
>> > 2) Have a per hist 1) ?which means a per hist per entrypoint callchain
>> >
>> > 1) involves reverting both callchains and ip <->caller whereas 2) only involves
>> > reverting the callchain.
>>
>> Having both workflow included would be more helpful.
>
> That's the point, we should be able to do both. But only 1) is possible with
> your initial proposition.
>
>> >
>> > In order to get both features with a maximum flexibility and keep that extendable, I
>> > would suggest to decouple that in two independant parts:
>> >
>> > ? ? ? ?- an option to get reversed callchains. Using the -g option and caller/callee
>> > ? ? ? ?as a third argument.
>> >
>>
>> This could be easily extended by reversing the callchain symbols as
>> you mentioned.
>
> Yeah. -g caller only requires to iterate the callchain in reverse.
>
>> > ? ? ? ?- a new "caller" sort entry. What defines a hist entry is a set of sort
>> > ? ? ? ?entries: dso, symbol, pid, comm, ... That we use with the -s option in perf report.
>> > ? ? ? ?If you want one hist per entrypoint, we could add a new "caller" sort entry.
>> > ? ? ? ?Then perf report -s caller will (roughly) produce one hist for main(), one hist
>> > ? ? ? ?for kernel_thread(), etc...
>> >
>>
>> I'm not sure adding a "caller" sort entry can get things done. As for
>> my limited understanding,
>> "sort" is kind way to group events
>
> This is actually _what_ group events. This defines how hist entries are
> built.
>
> If you do "perf report -s sym", events will be grouped by symbols.
> Thus if you had thousands events but all of them only hit sym1 and sym2
> then you'll see two groups in your histogram.
>
> Look:
>
> # ./perf report -s sym --stdio
> # Events: 4 ?cycles
> #
> # Overhead ? ? ? ? ? ? Symbol
> # ........ ?.................
> #
> ? ?36.72% ?[.] hex2u64
> ? ?31.21% ?[k] __lock_acquire
> ? ?18.03% ?[k] lock_acquire
> ? ?14.04% ?[k] sub_preempt_count
>
> We may have got thousand events for the above profile. But only 4 symbols
> were hit in amongst these thousand events. As we asked for, events have been
> grouped per symbol target.
>
> Callchains follow this grouping scheme. Below the __lock_acquire hist,
> you would only get callchains for which the root (deepest callee) was __lock_acquire.
>
> If you have several grouping, like -s sym, dso, pid
> then it computes an intersection. Events will be grouped when their
> sym, dso and pid are equal. Moreoever they will be sorted, first dimension
> per sym, second dimension per dso, third dimension per pid.
>
> You should play a bit with different combinations to get the whole picture
> and how it works.
>
> Callchains still follow the grouping, as elaborated as it can be. For the hist
> that has sym1, dso2 and pid 3, you'll find only callchains that start from sym1
> for events that happened on dso2 and pid3.
>
>
> , after we group all the events
>> under "main" or "kernel_thread",
>> the sub-trees will still rooted as ip entry points with a reversed
>> call-chain sub-trees which seems
>> just the same as the previous workflow. Am I right? If so, here we
>> still have to revert the ip and
>> callchain.
>
> No. The callchain will follow that grouping. If you group only per caller
> (-s caller) you may have one hist entry for main and another for kernel_thread.
> Then below the main entry, you'll have only callchains starting
> from main. And below the kernel_thread, only callchains starting from kernel_thread.
>
> It depends if you select reverse callchain or not:
>
> $ perf report -s caller
>
> That will report main and kernel_thread as hists, and regular callee -> caller callchains.
> Hence under main hist, you'll a lot of callchain starting from random points and all
> ending in main!
>
> $ perf report -s caller -g caller
>
> That will report main and kernel_thread as hists, with callchains starting from
> main under main.
>
> It becomes interesting when you want more granularity with -s caller,dso if we bring a way
> to push forward the entrypoint one day. I suspect even more sorting combinations are
> going to be interesting.
>

Thanks for clarification. I'll try to come up with patches as you talked.

-Sam

2011-03-17 17:53:47

by Arun Sharma

[permalink] [raw]
Subject: Re: [PATCH] Add inverted call graph report support to perf tool

On Mon, Mar 07, 2011 at 09:43:27PM +0800, Sam Liao wrote:
> +
> + /* reverse call chain data */
> + if (reverse_call && symbol_conf.use_callchain && sample->callchain) {

You probably want to check for nr > 1 here, as in:

/* reverse call chain data */
- if (reverse_call && symbol_conf.use_callchain && sample->callchain) {
+ if (reverse_call && symbol_conf.use_callchain && sample->callchain
+ && (sample->callchain->nr > 1)) {
struct ip_callchain *chain;
int i, j;
u64 tmp_ip;

Otherwise, if you get a callchain with ->nr == 0, the loop with (nr - 2) will
go on for a long time.

-Arun

2011-05-06 08:54:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] Add inverted call graph report support to perf tool


* Sam Liao <[email protected]> wrote:

> > It depends if you select reverse callchain or not:
> >
> > $ perf report -s caller
> >
> > That will report main and kernel_thread as hists, and regular callee -> caller callchains.
> > Hence under main hist, you'll a lot of callchain starting from random points and all
> > ending in main!
> >
> > $ perf report -s caller -g caller
> >
> > That will report main and kernel_thread as hists, with callchains starting from
> > main under main.
> >
> > It becomes interesting when you want more granularity with -s caller,dso if we bring a way
> > to push forward the entrypoint one day. I suspect even more sorting combinations are
> > going to be interesting.
> >
>
> Thanks for clarification. I'll try to come up with patches as you talked.

I'm wondering, is there any progress with this feature? Having sysprof-alike
reverse ordering for call chains is a really powerful form of visualization
IMO.

Thanks,

Ingo