2010-08-02 18:35:16

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC] BTS based perf user callchains

Hi,

As you may know there is an issue with user stacktraces: it requires
userspace apps to be built with frame pointers.

So there is something we can try: dump a piece of the top user stack page
each time we have an event hit and let the tools deal with that later using
the dwarf informations.

But before trying that, which might require heavy copies, I would like to
try something based on BTS. The idea is to look at the branch buffer and
only pick addresses of branches that originated from "call" instructions.

So we want BTS activated, only in user ring, without the need of interrupts
once we reach the limit of the buffer, we can just run in a kind of live
mode and read on need. This could be a secondary perf event that has no mmap
buffer. Something only used by the kernel internally by others true perf events
in a given context. Primary perf events can then read on this BTS buffer when
they want.

Now there are two ways:

- record the whole branch buffer each time we overflow on another perf event
and let post processing userspace deal with "call" instruction filtering to
build the stacktrace on top of the branch trace.

- do the "call" filtering on record time. That requires to inspect each
recorded branches and look at the instruction content from the fast path.

I don't know which solution could be the faster one.


I'm not even sure that will work. Also, while looking at the BTS implementation
in perf, I see we have one BTS buffer per cpu. But that doesn't look right as
the code flow is not linear per cpu but per task. Hence I suspect we need
one BTS buffer per task. But may be someone tried that and encountered a
problem?


Tell me your feelings.

Thanks.


2010-08-02 18:39:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] BTS based perf user callchains

On Mon, 2010-08-02 at 20:35 +0200, Frederic Weisbecker wrote:
> I'm not even sure that will work. Also, while looking at the BTS implementation
> in perf, I see we have one BTS buffer per cpu. But that doesn't look right as
> the code flow is not linear per cpu but per task. Hence I suspect we need
> one BTS buffer per task. But may be someone tried that and encountered a
> problem?

IIRC we flush the buffer when we deschedule the counter.

2010-08-02 18:41:57

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC] BTS based perf user callchains

On Mon, Aug 02, 2010 at 08:38:52PM +0200, Peter Zijlstra wrote:
> On Mon, 2010-08-02 at 20:35 +0200, Frederic Weisbecker wrote:
> > I'm not even sure that will work. Also, while looking at the BTS implementation
> > in perf, I see we have one BTS buffer per cpu. But that doesn't look right as
> > the code flow is not linear per cpu but per task. Hence I suspect we need
> > one BTS buffer per task. But may be someone tried that and encountered a
> > problem?
>
> IIRC we flush the buffer when we deschedule the counter.


Ok. So the buffer is cut on schedule time. It might be nice
to maintain the buffer progress across scheduling.

That requires one buffer per task though. That could be worth.

2010-08-02 19:47:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] BTS based perf user callchains

On Mon, 2010-08-02 at 20:41 +0200, Frederic Weisbecker wrote:
> On Mon, Aug 02, 2010 at 08:38:52PM +0200, Peter Zijlstra wrote:
> > On Mon, 2010-08-02 at 20:35 +0200, Frederic Weisbecker wrote:
> > > I'm not even sure that will work. Also, while looking at the BTS implementation
> > > in perf, I see we have one BTS buffer per cpu. But that doesn't look right as
> > > the code flow is not linear per cpu but per task. Hence I suspect we need
> > > one BTS buffer per task. But may be someone tried that and encountered a
> > > problem?
> >
> > IIRC we flush the buffer when we deschedule the counter.
>
>
> Ok. So the buffer is cut on schedule time. It might be nice
> to maintain the buffer progress across scheduling.

We flush it into the perf data buffer.

2010-08-03 06:54:29

by Metzger, Markus T

[permalink] [raw]
Subject: RE: [RFC] BTS based perf user callchains

>-----Original Message-----
>From: Frederic Weisbecker [mailto:[email protected]]
>Sent: Monday, August 02, 2010 8:35 PM
>To: Ingo Molnar; Peter Zijlstra; Arnaldo Carvalho de Melo; Paul Mackerras; Stephane Eranian; Metzger,
>Markus T; Robert Richter
>Cc: LKML
>Subject: [RFC] BTS based perf user callchains
>
>Hi,
>
>As you may know there is an issue with user stacktraces: it requires
>userspace apps to be built with frame pointers.

It requires DWARF to correctly describe how to unwind a frame. You can also
generate ESP-based frames and still get a correct backtrace, provided you
have debug information.


>So there is something we can try: dump a piece of the top user stack page
>each time we have an event hit and let the tools deal with that later using
>the dwarf informations.
>
>But before trying that, which might require heavy copies, I would like to
>try something based on BTS. The idea is to look at the branch buffer and
>only pick addresses of branches that originated from "call" instructions.

You would also need to track returns.


>So we want BTS activated, only in user ring, without the need of interrupts
>once we reach the limit of the buffer, we can just run in a kind of live
>mode and read on need. This could be a secondary perf event that has no mmap
>buffer. Something only used by the kernel internally by others true perf events
>in a given context. Primary perf events can then read on this BTS buffer when
>they want.
>
>Now there are two ways:
>
>- record the whole branch buffer each time we overflow on another perf event
>and let post processing userspace deal with "call" instruction filtering to
>build the stacktrace on top of the branch trace.

If you only care about backtrace, there will be too much noise in the data.
I doubt that you will get a very deep backtrace.

On the other hand, the trace data might be useful for other purposes. But
then, what you would want is BTS and perf events collected in the same buffer.


>- do the "call" filtering on record time. That requires to inspect each
>recorded branches and look at the instruction content from the fast path.

You can try to use LBR for that. Core i7 adds LBR filters that allow you to
only record calls and returns. You will be limited to a handful of records, but
I doubt that you will get much more out of a page of BTS.


With both approaches, the backtrace will not be very deep. There is so much
traffic at the top of the stack that you won't find entries further down.


>I'm not even sure that will work. Also, while looking at the BTS implementation
>in perf, I see we have one BTS buffer per cpu. But that doesn't look right as
>the code flow is not linear per cpu but per task. Hence I suspect we need
>one BTS buffer per task. But may be someone tried that and encountered a
>problem?


When BTS was stand-alone, there had been one buffer per task. It now uses the perf
ring buffer. The per-cpu buffers are only used to collect the data. On context
switch or buffer overflow, the data is copied into the perf ring buffer.


regards,
markus.

---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.