LinuxLists.cc - callchain sampling bug in perf?

2010-08-15 22:54:01

Subject: callchain sampling bug in perf?

I'm trying to play with perf record -g, that is sampling the callchains.
Doing this with simple workloads works fine, but when I try to do this
with compilebench a lot of samples seem to get lost.

I'm doing the following:

perf probe --add xlog_sync
perf record -g -e probe:xlog_sync -- ./compilebench

Trying to report it I do not get any callchains at all:

[root@virtlab106 compilebench-0.6]# perf report -g flat -n
# Events: 9K cycles
#
# Overhead Samples Command Shared Object Symbol
# ........ .......... ............ ................. .........
#
70.41% 6757 compilebench [kernel.kallsyms] [k] xlog_sync
22.61% 2170 sync [kernel.kallsyms] [k] xlog_sync
3.89% 373 sh [kernel.kallsyms] [k] xlog_sync
2.50% 240 python [kernel.kallsyms] [k] xlog_sync
0.33% 32 :3881 [kernel.kallsyms] [k] xlog_sync
0.13% 12 :3971 [kernel.kallsyms] [k] xlog_sync
0.11% 11 :3956 [kernel.kallsyms] [k] xlog_sync
0.01% 1 :3972 [kernel.kallsyms] [k] xlog_sync

Doing a perf report -g flat,0.0 -n shows lots of callgraph, but the
percentag for them doesn't add up at all.

2010-08-19 00:57:29

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: callchain sampling bug in perf?

On Sun, Aug 15, 2010 at 06:53:59PM -0400, Christoph Hellwig wrote:
> I'm trying to play with perf record -g, that is sampling the callchains.
> Doing this with simple workloads works fine, but when I try to do this
> with compilebench a lot of samples seem to get lost.
>
> I'm doing the following:
>
> perf probe --add xlog_sync
> perf record -g -e probe:xlog_sync -- ./compilebench
>
> Trying to report it I do not get any callchains at all:
>
> [root@virtlab106 compilebench-0.6]# perf report -g flat -n
> # Events: 9K cycles
> #
> # Overhead Samples Command Shared Object Symbol
> # ........ .......... ............ ................. .........
> #
> 70.41% 6757 compilebench [kernel.kallsyms] [k] xlog_sync
> 22.61% 2170 sync [kernel.kallsyms] [k] xlog_sync
> 3.89% 373 sh [kernel.kallsyms] [k] xlog_sync
> 2.50% 240 python [kernel.kallsyms] [k] xlog_sync
> 0.33% 32 :3881 [kernel.kallsyms] [k] xlog_sync
> 0.13% 12 :3971 [kernel.kallsyms] [k] xlog_sync
> 0.11% 11 :3956 [kernel.kallsyms] [k] xlog_sync
> 0.01% 1 :3972 [kernel.kallsyms] [k] xlog_sync
>
> Doing a perf report -g flat,0.0 -n shows lots of callgraph, but the
> percentag for them doesn't add up at all.

Hmm, against which tree are you running? I've just tested callchains
record on random kprobe based tracepoint and it seemed to work well
on report.

Could you send me your perf.data and also the resulting archive
after the following command: "perf archive" ?

Thanks.

2010-08-19 15:04:37

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: callchain sampling bug in perf?

Em Thu, Aug 19, 2010 at 04:57:01AM -0400, Christoph Hellwig escreveu:
> On Thu, Aug 19, 2010 at 02:57:22AM +0200, Frederic Weisbecker wrote:
> > Hmm, against which tree are you running? I've just tested callchains
> > record on random kprobe based tracepoint and it seemed to work well
> > on report.
>
> It's 2.6.36-rc1 + a few filesystem related changes.
>
> > Could you send me your perf.data and also the resulting archive
> > after the following command: "perf archive" ?
>
> Both attached.

[acme@doppio tmp]$ tar xvf perf.data.tar.bz2 -C ~/.debug
.build-id/36/4300f12afee3dd5b1a87269d60de586f6b61ce
[kernel.kallsyms]/364300f12afee3dd5b1a87269d60de586f6b61ce
[acme@doppio tmp]$
[acme@doppio tmp]$ cat /home/acme/.perfconfig
[tui]

report=off
[acme@doppio tmp]$ perf report
# Events: 11K cycles
#
# Overhead Command Shared Object Symbol
# ........ ............ ................. .........
#
61.70% python [kernel.kallsyms] [k] xlog_sync
24.05% flush-9:1 [kernel.kallsyms] [k] xlog_sync
7.99% sync [kernel.kallsyms] [k] xlog_sync
2.94% xfssyncd/md1 [kernel.kallsyms] [k] xlog_sync
1.54% compilebench [kernel.kallsyms] [k] xlog_sync
1.31% sh [kernel.kallsyms] [k] xlog_sync
0.23% :3500 [kernel.kallsyms] [k] xlog_sync
0.17% :3496 [kernel.kallsyms] [k] xlog_sync
0.06% :3570 [kernel.kallsyms] [k] xlog_sync
0.01% :3590 [kernel.kallsyms] [k] xlog_sync

#
# (For a higher level overview, try: perf report --sort comm,dso)
#
[acme@doppio tmp]$

# yeah, I need to do what we discussed and make tui activation on the command
# line, this use case proves it, if it needed proving usefulness at all :-)

[acme@doppio tmp]$ cat /home/acme/.perfconfig
[tui]

report=on

[acme@doppio tmp]$ perf report

http://vger.kernel.org/~acme/perf/perf-report-tui-callchain-xlog_sync.png

So it seems to work (you tell me if the callchains make sense), and the problem
is just on the stdio based callchain rendering.

Frederic, so it doesn't look like anything in the kernel, just a tools/perf/
problem.

I found some issues on the TUI as well, please press the left key some times
after expanding some callchains, so that the window can be resized, your workload
has just a few top level entries, so the window initially is small, I need to
take into account the (to be expanded) callchains and size it properly.

- Arnaldo

2010-08-20 09:16:48

by Christoph Hellwig

[permalink] [raw]

Subject: Re: callchain sampling bug in perf?

On Thu, Aug 19, 2010 at 12:04:22PM -0300, Arnaldo Carvalho de Melo wrote:
>
> [acme@doppio tmp]$ perf report
>
> http://vger.kernel.org/~acme/perf/perf-report-tui-callchain-xlog_sync.png
>
> So it seems to work (you tell me if the callchains make sense), and the problem

I've rebuilt perf with libnewt to reproduce it, but to get any of the
callchain data I need to call perf report with a -g argument (which is
rather expected from the documentation anyway)

I still see the same problems as with the TUI perf report with that.
With the -g {mode},0.0 there is nothing to expand inside the GUI for
e.g. the pythong process, and with the 0.0 threshold I can only expand
a few 0.<something> callchains, but I never see the 80% your screenshot
shows. What perf version are you running?

Also the flat mode is rendered incorrectly, it just adds different call
graphs inside a single process directly after each other instead of
separating them in the rendering.

2010-08-20 19:12:43

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: callchain sampling bug in perf?

Em Fri, Aug 20, 2010 at 05:16:45AM -0400, Christoph Hellwig escreveu:
> On Thu, Aug 19, 2010 at 12:04:22PM -0300, Arnaldo Carvalho de Melo wrote:
> >
> > [acme@doppio tmp]$ perf report
> >
> > http://vger.kernel.org/~acme/perf/perf-report-tui-callchain-xlog_sync.png
> >
> > So it seems to work (you tell me if the callchains make sense), and the problem
>
> I've rebuilt perf with libnewt to reproduce it, but to get any of the
> callchain data I need to call perf report with a -g argument (which is
> rather expected from the documentation anyway)

In fact I never run 'perf report' specifying '-g', have you tried it
that way?

> I still see the same problems as with the TUI perf report with that.
> With the -g {mode},0.0 there is nothing to expand inside the GUI for
> e.g. the pythong process, and with the 0.0 threshold I can only expand
> a few 0.<something> callchains, but I never see the 80% your screenshot
> shows. What perf version are you running?

IIRC I tried this on perf/urgent and on perf/core, the former should be what
is Linus tree.

Lemme try again, right, just checked out from torvalds/master and running it
as:

$ perf report

that screenshot is reproduced.

> Also the flat mode is rendered incorrectly, it just adds different call
> graphs inside a single process directly after each other instead of
> separating them in the rendering.

Probably because of what I said above, i.e. I still have to try 'perf
record -g flat' on the TUI, lets see...

Running it as:

$ perf report -g flat,0,0

This is produced:

http://vger.kernel.org/~acme/perf/perf-report-tui-g-flat.png

which has the flat percentages missing, have to fix that but hey, at least I
finally tried it! 8-)

The TUI code started with the default:

[acme@doppio linux-2.6-tip]$ perf report --help
<SNIP>
-g [type,min], --call-graph
Display callchains using type and min percent threshold. type can
be either:

? flat: single column, linear exposure of callchains.

? graph: use a graph tree, displaying absolute overhead rates.

? fractal: like graph, but displays relative rates. Each branch
of the tree is considered as a new profiled object. Default:
^^^^^^^^
fractal,0.5.
^^^^^^^^^^^
<SNIP>

The TUI took longer than I antecipated to get right, requiring writing a
generic tree widget from scratch, not using any libnewt code, as the one newt
has is too cumbersome and inflexible to use :-\

I will work to have the "flat" and "graph" modes working properly in the coming
weeks.

It should work on the stdio mode, IIRC Fr?d?ric told me he found and fixed the
problem there, right?

- Arnaldo

2010-08-21 02:29:51

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: callchain sampling bug in perf?

On Fri, Aug 20, 2010 at 04:12:30PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Fri, Aug 20, 2010 at 05:16:45AM -0400, Christoph Hellwig escreveu:
> > On Thu, Aug 19, 2010 at 12:04:22PM -0300, Arnaldo Carvalho de Melo wrote:
> > >
> > > [acme@doppio tmp]$ perf report
> > >
> > > http://vger.kernel.org/~acme/perf/perf-report-tui-callchain-xlog_sync.png
> > >
> > > So it seems to work (you tell me if the callchains make sense), and the problem
> >
> > I've rebuilt perf with libnewt to reproduce it, but to get any of the
> > callchain data I need to call perf report with a -g argument (which is
> > rather expected from the documentation anyway)
>
> In fact I never run 'perf report' specifying '-g', have you tried it
> that way?
>
> > I still see the same problems as with the TUI perf report with that.
> > With the -g {mode},0.0 there is nothing to expand inside the GUI for
> > e.g. the pythong process, and with the 0.0 threshold I can only expand
> > a few 0.<something> callchains, but I never see the 80% your screenshot
> > shows. What perf version are you running?
>
> IIRC I tried this on perf/urgent and on perf/core, the former should be what
> is Linus tree.
>
> Lemme try again, right, just checked out from torvalds/master and running it
> as:
>
> $ perf report
>
> that screenshot is reproduced.
>
> > Also the flat mode is rendered incorrectly, it just adds different call
> > graphs inside a single process directly after each other instead of
> > separating them in the rendering.
>
> Probably because of what I said above, i.e. I still have to try 'perf
> record -g flat' on the TUI, lets see...
>
> Running it as:
>
> $ perf report -g flat,0,0
>
> This is produced:
>
> http://vger.kernel.org/~acme/perf/perf-report-tui-g-flat.png
>
> which has the flat percentages missing, have to fix that but hey, at least I
> finally tried it! 8-)
>
> The TUI code started with the default:
>
> [acme@doppio linux-2.6-tip]$ perf report --help
> <SNIP>
> -g [type,min], --call-graph
> Display callchains using type and min percent threshold. type can
> be either:
>
> ? flat: single column, linear exposure of callchains.
>
> ? graph: use a graph tree, displaying absolute overhead rates.
>
> ? fractal: like graph, but displays relative rates. Each branch
> of the tree is considered as a new profiled object. Default:
> ^^^^^^^^
> fractal,0.5.
> ^^^^^^^^^^^
> <SNIP>
>
> The TUI took longer than I antecipated to get right, requiring writing a
> generic tree widget from scratch, not using any libnewt code, as the one newt
> has is too cumbersome and inflexible to use :-\
>
> I will work to have the "flat" and "graph" modes working properly in the coming
> weeks.

I think graph mode is nice to have in the TUI.
Actually this should be called "absolute fractal", as opposite to default
mode that "relative fractal".

In "graph" mode we show the overhead percentage against the total overhead, while
in default mode "fractal" we show overhead against the parent branch.

But the flat mode is just a dumb dump of every callchain records. It's mostly
useful for callchain debugging I think. May be it's not worth adapting the TUI
for that mode. As you want.

One day I really need to fill the perf wiki about callchains.

> It should work on the stdio mode, IIRC Fr?d?ric told me he found and fixed the
> problem there, right?

Part of it. I've a fix for the "nothing displayed unless 0 min percentage" case.
Now I need to investigate on why callchains percentages sometimes don't add up correctly.

Those are eventually two different problems (more details in my next reply to Christoph).

Thanks.

2010-08-21 02:47:19

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: callchain sampling bug in perf?

On Fri, Aug 20, 2010 at 05:16:45AM -0400, Christoph Hellwig wrote:
> On Thu, Aug 19, 2010 at 12:04:22PM -0300, Arnaldo Carvalho de Melo wrote:
> >
> > [acme@doppio tmp]$ perf report
> >
> > http://vger.kernel.org/~acme/perf/perf-report-tui-callchain-xlog_sync.png
> >
> > So it seems to work (you tell me if the callchains make sense), and the problem
>
> I've rebuilt perf with libnewt to reproduce it, but to get any of the
> callchain data I need to call perf report with a -g argument (which is
> rather expected from the documentation anyway)
>
> I still see the same problems as with the TUI perf report with that.
> With the -g {mode},0.0 there is nothing to expand inside the GUI for
^^
you meant "without", right?

> e.g. the pythong process, and with the 0.0 threshold I can only expand
> a few 0.<something> callchains, but I never see the 80% your screenshot
> shows. What perf version are you running?

Right. So this is due to the roots of the callchains that have uninitialized
values for their children hits.

To summarize, you enter a percentage (or you keep the default 0.5%) and it gets
multiplied by a random number. The result is supposed to be the minimum hits
a branch should have to be displayed. But it often produced a huge number due
to the unintialization randomness. Hence some callchains were not displayed.

The following patch solves that particular problem, I'll push it to Ingo for .36
(and -stable):

diff --git a/tools/perf/util/callchain.h b/tools/perf/util/callchain.h
index 624a96c..d1349d0 100644
--- a/tools/perf/util/callchain.h
+++ b/tools/perf/util/callchain.h
@@ -51,6 +51,7 @@ static inline void callchain_init(struct callchain_node *node)
INIT_LIST_HEAD(&node->val);

node->parent = NULL;
+ node->children_hit = 0;
node->hit = 0;
}

> Also the flat mode is rendered incorrectly, it just adds different call
> graphs inside a single process directly after each other instead of
> separating them in the rendering.

I'm not sure what you mean. The flat format is a pure dump of every callchains
that belong to a single process (or whatever kind of histogram source...).

What do you mean by separating them in the rendering?

OTOH I see another bug:

61.70% python [kernel.kallsyms] [k] xlog_sync
0.03%
xlog_sync
xlog_write
xlog_cil_push
_xfs_log_force
xfs_log_force
xfs_commit_dummy_trans
xfs_quiesce_data
xfs_fs_sync_fs
__sync_filesystem
sync_one_sb
iterate_supers
sync_filesystems
sys_sync
system_call_fastpath
0x334f0d72e7

0.01%
xlog_sync
_xfs_log_force
xfs_log_force
xfs_commit_dummy_trans
xfs_quiesce_data
xfs_fs_sync_fs
__sync_filesystem
sync_one_sb
iterate_supers
sync_filesystems
sys_sync
system_call_fastpath
0x334f0d72e7

I think that's what you said previously: they sometimes don't add up correctly.
The total of hit percentages from callchains should equal the total
percentage of the process they belong to.

I'm investigating to find the reason.

Thanks a lot of your reports!

2010-08-21 14:42:44

by Christoph Hellwig

[permalink] [raw]

Subject: Re: callchain sampling bug in perf?

On Sat, Aug 21, 2010 at 04:47:14AM +0200, Frederic Weisbecker wrote:
> > I still see the same problems as with the TUI perf report with that.
> > With the -g {mode},0.0 there is nothing to expand inside the GUI for
> ^^
> you meant "without", right?

Yes, sorry.

> The following patch solves that particular problem, I'll push it to Ingo for .36
> (and -stable):

It does seem to fix the bug for some cases but not all. Default perf
report in TUI and the normal command line seem to get it right. perf
report -g flat still shows the old problem. perf report -g flat,0.0
shows callgraphs, but just as before they just show the 0.<n>
percentages.

Btw, even in normal perf report mode the numbers while they look correct
confused me a bit. The percentages before the callgraphs split are
always relative to the node above, not absolute which is rather
confusing. And even despite adding -n to the perf report command line
I only get absolute numbers for the proccesses but not the actual
callgraphs.

> > Also the flat mode is rendered incorrectly, it just adds different call
> > graphs inside a single process directly after each other instead of
> > separating them in the rendering.
>
>
> I'm not sure what you mean. The flat format is a pure dump of every callchains
> that belong to a single process (or whatever kind of histogram source...).
>
> What do you mean by separating them in the rendering?

If there are different callchains leading to the same tracepoint they
are just appened to each other with no visual indication that they are
separate callchains

2010-08-21 14:44:55

by Christoph Hellwig

[permalink] [raw]

Subject: Re: callchain sampling bug in perf?

On Fri, Aug 20, 2010 at 04:12:30PM -0300, Arnaldo Carvalho de Melo wrote:
> In fact I never run 'perf report' specifying '-g', have you tried it
> that way?

Yes, same issue.

> It should work on the stdio mode, IIRC Fr?d?ric told me he found and fixed the
> problem there, right?

With the patch some modes work, but not all.

And I'm getting really eager to get separate commands for TUI with line
mode perf report. Having to recompile the damn thing all the time or
tweak config files (which is even worse) is getting highly annoying.

2010-08-21 14:46:29

by Christoph Hellwig

[permalink] [raw]

Subject: Re: callchain sampling bug in perf?

2010-08-22 00:49:34

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: callchain sampling bug in perf?

On Sat, Aug 21, 2010 at 10:42:39AM -0400, Christoph Hellwig wrote:
> It does seem to fix the bug for some cases but not all. Default perf
> report in TUI and the normal command line seem to get it right. perf
> report -g flat still shows the old problem. perf report -g flat,0.0
> shows callgraphs, but just as before they just show the 0.<n>
> percentages.

Yep. So I just found the other problem.
We collect every events and store them into per tid histograms.
But depending on the final sorting (by default we sort by comm),
we may merge (collapse) the histograms against the sorting
criteria. If this is by comm, per tid histograms will become
per comm histograms, hence threads profiles will into process
profiles. We have callbacks that handle this merge, but we
forgot to handle callchains.

So imagine we have three threads (tids: 1000, 1001, 1002) that
belong to python.

tid 1000 got 100 events
tid 1001 got 10 events
tid 1002 got 3 events

Once we merge these histograms to get a per comm result, we'll
finally get:

python got 113 events

The problem is we merge 1000 and 1001 histograms into 1002. So the end
merge result, wrt callchains, will be only 1002 callchains. Because
we haven't handled callchains in the merge. Only those from one of
the threads survived.

So, I'm going to fix that.

> Btw, even in normal perf report mode the numbers while they look correct
> confused me a bit. The percentages before the callgraphs split are
> always relative to the node above, not absolute which is rather
> confusing. And even despite adding -n to the perf report command line
> I only get absolute numbers for the proccesses but not the actual
> callgraphs.

That's the point of the fractal mode. It's a relative profiling against
the parent node.

But you can select the "graph" mode that does an absolute profiling (against
the total hits).

> > > Also the flat mode is rendered incorrectly, it just adds different call
> > > graphs inside a single process directly after each other instead of
> > > separating them in the rendering.
> >
> >
> > I'm not sure what you mean. The flat format is a pure dump of every callchains
> > that belong to a single process (or whatever kind of histogram source...).
> >
> > What do you mean by separating them in the rendering?
>
> If there are different callchains leading to the same tracepoint they
> are just appened to each other with no visual indication that they are
> separate callchains

Ah right. There is a blank line between callchains. If that's confusing I
can add a kind of "----" boundary.

Thanks.

2010-08-22 05:49:11

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: callchain sampling bug in perf?

On Sat, Aug 21, 2010 at 10:46:25AM -0400, Christoph Hellwig wrote:
> On Sat, Aug 21, 2010 at 10:42:39AM -0400, Christoph Hellwig wrote:
> > It does seem to fix the bug for some cases but not all. Default perf
> > report in TUI and the normal command line seem to get it right. perf
> > report -g flat still shows the old problem. perf report -g flat,0.0
> > shows callgraphs, but just as before they just show the 0.<n>
> > percentages.
>
> -g graph also shows the same issue of not beeing able to expand the
> python line. It seems like only the fractal case was fixed by that
> patch.
>

Now does it work with the patch in attachment?
It includes the previous one.

It still needs a few cleanups but the idea is there.
Ah and it applies in tip:perf/core but latest -linus
should be fine.

Thanks.

Attachments:

(No filename) (808.00 B)
callchain.diff (2.89 kB)
Download all attachments

2010-08-22 08:12:01

by Christoph Hellwig

[permalink] [raw]

Subject: Re: callchain sampling bug in perf?

On Sun, Aug 22, 2010 at 07:20:14AM +0200, Frederic Weisbecker wrote:
> Now does it work with the patch in attachment?
> It includes the previous one.

It works perfectly thanks!