2009-06-24 17:29:25

by Ben Gamari

[permalink] [raw]
Subject: Infrastructure for tracking driver performance events

[I apologize profusely to those of you who recieve this twice. Sorry about that]


Hey all,

Now since GEM has been implemented and is beginning to stabilize for the
Intel graphics driver, work has begun on trying to optimize the driver
and its usage of the hardware. While finding cpu-bound operations can be
easily done with a profiler, identifying GPU stalls has been
substantially more difficult.

One class of GPU stalls that can be easily identified occurs when the
driver needs to wait for the GPU to complete some work before proceeding
(waiting for the chip to free a hardware resource --- e.g. a fence
register for configuring tiling --- or complete some other type of
transaction --- e.g. flush caches). In order to debug these stalls, it is
useful to know both what is causing the stall (i.e. call path) and why
the driver had to wait (e.g. waiting for GEM domain change, waiting for
fence, waiting for cache flush, etc.)

I recently wrote a very simple patch to add accounting for these types
of stalls to the i915 driver[1], exposing a list of wait-event counts to
userspace through debugfs. While this is useful for giving a general
overview of the drivers' performance, it does very little to expose
individual bottlenecks in the driver or userland components. It has been
suggested[2] that this wait-event tracking functionality would be far more
useful if we could provide stack backtraces all the way into user space
for each wait event.

I am investigating how this might be accomplished with existing kernel
infrastructure. At first, ftrace looked like a promising option, as the
sysprof profiler is driven by ftrace and provides exactly the type of
full system backtraces we need. We could probably even accomplish an
approximation of our desired result by calling a function when we begin
and another when we end waiting and using a script to look for these
events. I haven't looked into how we could get a usermode trace with
this approach, but it seems possible as sysprof already does it.

While this approach would work, it has a few shortcomings:
1) Function graph tracing must be enabled on the entire machine to debug
stalls
2) It is difficult to extract the kernel mode callgraph with no natural
way to capture the usermode callgraph
3) A large amount of usermode support is necessary (which will likely be
the case for any option; listed here for completeness)

Another option seems to be systemtap. It has already been documented[3]
that this option could provide both user-mode and kernel-mode
backtraces. The driver could provide a kernel marker at every potential
wait point (or a single marker in a function called at each wait point,
for that matter) which would be picked up by systemtap and processed in
usermode, calling ptrace to acquire a usermode backtrace. This approach
seems slightly cleaner as it doesn't require the tracing on the entire
machine to catch what should be reasonably rare events (hopefully).

Unfortunately, the systemtap approach described in [3] requires that
each process have an associated "driver" process to get a usermode
backtrace. It would be nice to avoid this requirement as there are
generally far more gpu clients than just the X server (i.e. direct
rendering clients) and tracking them all could get tricky.

These are the two options I have seen thusfar. It seems like getting
this sort of information will be increasingly important as more and more
drivers move into kernel-space and it is likely that the intel
implementation will be a model for future drivers, so it would be nice
to implement it correctly the first time. Does anyone see an option
which I have missed? Are there any thoughts on any new generic services
that the kernel might provide that might make this task easier? Any
comments, questions, or complaints would be greatly appreciated.

Thanks,

- Ben


[1] http://lists.freedesktop.org/archives/intel-gfx/2009-June/002938.html
[2] http://lists.freedesktop.org/archives/intel-gfx/2009-June/002979.html
[3] http://sourceware.org/ml/systemtap/2006-q4/msg00198.html


2009-06-24 18:53:59

by Josh Stone

[permalink] [raw]
Subject: Re: Infrastructure for tracking driver performance events

On 06/24/2009 10:29 AM, Ben Gamari wrote:
[...]
> I recently wrote a very simple patch to add accounting for these types
> of stalls to the i915 driver[1], exposing a list of wait-event counts to
> userspace through debugfs. While this is useful for giving a general
> overview of the drivers' performance, it does very little to expose
> individual bottlenecks in the driver or userland components. It has been
> suggested[2] that this wait-event tracking functionality would be far more
> useful if we could provide stack backtraces all the way into user space
> for each wait event.
[...]
> Another option seems to be systemtap. It has already been documented[3]
> that this option could provide both user-mode and kernel-mode
> backtraces. The driver could provide a kernel marker at every potential
> wait point (or a single marker in a function called at each wait point,
> for that matter) which would be picked up by systemtap and processed in
> usermode, calling ptrace to acquire a usermode backtrace. This approach
> seems slightly cleaner as it doesn't require the tracing on the entire
> machine to catch what should be reasonably rare events (hopefully).
>
> Unfortunately, the systemtap approach described in [3] requires that
> each process have an associated "driver" process to get a usermode
> backtrace. It would be nice to avoid this requirement as there are
> generally far more gpu clients than just the X server (i.e. direct
> rendering clients) and tracking them all could get tricky.
[...]
> [3] http://sourceware.org/ml/systemtap/2006-q4/msg00198.html

I have to say, I'm a bit surprised to see my hacky suggestion
resurrected from the archives. :) I would guess that that approach
would add way too much overhead to be of use in diagnosing stalls though.

However, I think we can do a lot better with systemtap these days.
We're growing the ability to do userspace backtraces[1] directly within
your systemtap script, which should be much less intrusive.

Please take a look at ubacktrace() and family in recent systemtap and
let us know how you think it could improve.

Thanks,

Josh

[1] http://sourceware.org/ml/systemtap/2009-q2/msg00364.html

2009-06-25 12:55:58

by Steven Rostedt

[permalink] [raw]
Subject: Re: Infrastructure for tracking driver performance events



On Wed, 24 Jun 2009, Ben Gamari wrote:
>
> I am investigating how this might be accomplished with existing kernel
> infrastructure. At first, ftrace looked like a promising option, as the
> sysprof profiler is driven by ftrace and provides exactly the type of
> full system backtraces we need. We could probably even accomplish an
> approximation of our desired result by calling a function when we begin
> and another when we end waiting and using a script to look for these
> events. I haven't looked into how we could get a usermode trace with
> this approach, but it seems possible as sysprof already does it.
>
> While this approach would work, it has a few shortcomings:
> 1) Function graph tracing must be enabled on the entire machine to debug
> stalls

You can filter on functions to trace. Or add a list of functions
in set_graph_function to just graph a specific list.

> 2) It is difficult to extract the kernel mode callgraph with no natural
> way to capture the usermode callgraph

Do you just need a backtrace of some point, or a full user mode graph?

> 3) A large amount of usermode support is necessary (which will likely be
> the case for any option; listed here for completeness)
>
> Another option seems to be systemtap. It has already been documented[3]
> that this option could provide both user-mode and kernel-mode
> backtraces. The driver could provide a kernel marker at every potential
> wait point (or a single marker in a function called at each wait point,
> for that matter) which would be picked up by systemtap and processed in
> usermode, calling ptrace to acquire a usermode backtrace. This approach
> seems slightly cleaner as it doesn't require the tracing on the entire
> machine to catch what should be reasonably rare events (hopefully).

Enabling the userstacktrace will give userspace stack traces at event
trace points. The thing is that the userspace utility must be built with
frame pointers.

-- Steve

>
> Unfortunately, the systemtap approach described in [3] requires that
> each process have an associated "driver" process to get a usermode
> backtrace. It would be nice to avoid this requirement as there are
> generally far more gpu clients than just the X server (i.e. direct
> rendering clients) and tracking them all could get tricky.
>
> These are the two options I have seen thusfar. It seems like getting
> this sort of information will be increasingly important as more and more
> drivers move into kernel-space and it is likely that the intel
> implementation will be a model for future drivers, so it would be nice
> to implement it correctly the first time. Does anyone see an option
> which I have missed? Are there any thoughts on any new generic services
> that the kernel might provide that might make this task easier? Any
> comments, questions, or complaints would be greatly appreciated.
>

2009-06-25 13:25:31

by Mark Wielaard

[permalink] [raw]
Subject: Re: Infrastructure for tracking driver performance events

Hi,

On Thu, 2009-06-25 at 08:55 -0400, Steven Rostedt wrote:
> On Wed, 24 Jun 2009, Ben Gamari wrote:
> > 3) A large amount of usermode support is necessary (which will likely be
> > the case for any option; listed here for completeness)
> >
> > Another option seems to be systemtap. It has already been documented[3]
> > that this option could provide both user-mode and kernel-mode
> > backtraces. The driver could provide a kernel marker at every potential
> > wait point (or a single marker in a function called at each wait point,
> > for that matter) which would be picked up by systemtap and processed in
> > usermode, calling ptrace to acquire a usermode backtrace. This approach
> > seems slightly cleaner as it doesn't require the tracing on the entire
> > machine to catch what should be reasonably rare events (hopefully).
>
> Enabling the userstacktrace will give userspace stack traces at event
> trace points. The thing is that the userspace utility must be built with
> frame pointers.

This isn't true for Systemtap. It can unwind through anything since it
contains a dwarf-unwinder that can do backtraces as long as unwind
tables are available for the modules (executables, vdso, shared
libraries, etc.) one wants to unwind through. Systemtap currently gets
these in its "translation" phase and you do need to list them explicitly
atm. There is work underway to make this more flexible and automatic.
Also cross kernel-user-space backtraces need some work (systemtap can
use the dwarf unwinder also in-kernel, but some kernel parts are missing
unwind tables).

Some systemtap bugs to track if you are interested in extending this
functionality:

= Prerequirements for more ubiquitous backtracing
sw#6961 backtrace from non-pt_regs probe context
http://sourceware.org/bugzilla/show_bug.cgi?id=6961
sw#10080 track vdso for process symbols/backtrace
http://sourceware.org/bugzilla/show_bug.cgi?id=10080
sw#10208 Support probing glibc synthesized syscall wrappers
http://sourceware.org/bugzilla/show_bug.cgi?id=10208

NOTE: the above still won't make cross kernel-to-userspace backtracing
fully work since we cannot easily unwind through the kernel-entry/exit
assembly code that doesn't have dwarf unwind tables.

= Make user backtraces more convenient
sw#10228 Add more vma-tracking for user space symbol/backtraces
http://sourceware.org/bugzilla/show_bug.cgi?id=10228
sw#6580 revamp backtrace-related tapset functions
http://sourceware.org/bugzilla/show_bug.cgi?id=6580

Cheers,

Mark

2009-07-03 16:30:25

by Ben Gamari

[permalink] [raw]
Subject: Re: Infrastructure for tracking driver performance events

On Thu, Jun 25, 2009 at 08:55:51AM -0400, Steven Rostedt wrote:
>
>
> On Wed, 24 Jun 2009, Ben Gamari wrote:
> You can filter on functions to trace. Or add a list of functions
> in set_graph_function to just graph a specific list.

Perfect.

>
> > 2) It is difficult to extract the kernel mode callgraph with no natural
> > way to capture the usermode callgraph
>
> Do you just need a backtrace of some point, or a full user mode graph?

Just a backtrace to the syscall invocation. This should allow us to
identify which path mesa or the ddx took to hit the slow path.

>
> > 3) A large amount of usermode support is necessary (which will likely be
> > the case for any option; listed here for completeness)
> >
> > Another option seems to be systemtap. It has already been documented[3]
> > that this option could provide both user-mode and kernel-mode
> > backtraces. The driver could provide a kernel marker at every potential
> > wait point (or a single marker in a function called at each wait point,
> > for that matter) which would be picked up by systemtap and processed in
> > usermode, calling ptrace to acquire a usermode backtrace. This approach
> > seems slightly cleaner as it doesn't require the tracing on the entire
> > machine to catch what should be reasonably rare events (hopefully).
>
> Enabling the userstacktrace will give userspace stack traces at event
> trace points. The thing is that the userspace utility must be built with
> frame pointers.

Yep, I apparently hadn't read through the documentation all that well.
It looks like the stacktrace and userstacktrace will serve quite nicely.
Are there preexisting tools for resolving addresses in the produced
stacktrace into symbols? Is there an example of this being done
somewhere? Thanks,

- Ben