2010-11-16 21:09:29

by Thomas Gleixner

[permalink] [raw]
Subject: [ANNOUNCE] New utility: 'trace'

We are pleased to announce a new tracing tool called 'trace'.

This tool follows up on Linus's (not so) recent observations that
Linux tracing is still not really offering good tracing tools to
ordinary users. There's no easy to use workflow and the tools that
exist are not expressive enough to solve everyday problem.

After years of efforts we have not succeeded in meeting (let alone
exceeding) the utility of decades-old user-space tracing tools such as
strace - except for a few new good tools such as PowerTop and
LatencyTop.

'trace' is our shot at improving the situation: it aims at providing a
simple to use and straightforward tracing tool based on the perf
infrastructure and on the well-known perf profiling workflow:

trace record firefox # trace firefox and all child tasks,
put it into trace.data

trace summary # show quick overview of the trace,
based on trace.data

trace report # detailed traces, based on trace.data

The main workflow approach is to trace once, and then to analyze the
trace in an iterative fashion, by drilling down from a high-level
overview to more and more detailed trace events - with numerous
filtering capabilities available to make this progression effective
and intuitive. (further below there are a number of real examples of
what you can do - but try out the tool to see all the capabilities.)

To try it out, pick up the -tip tree:

http://people.redhat.com/mingo/tip.git/README

( NOTE: you will need to install the audit-libs-dev /
audit-libs-devel package if you dont have it installed. We'll
remove this library dependency in a future iteration of the
tool. )

And do:

cd tools/perf/
make -j install

Once you've done that you should have the 'trace' tool available.

Build the -tip kernel and reboot into it to get the most out of the
tool - but even on default kernels it will try to cope. Syscall events
(CONFIG_FTRACE_SYSCALLS=y) is a must for most methods of analysis
though. (with the above -tip tree you'll get it all configured by
default)

If you've booted the new kernel you can run 'trace check' to double
check that all events used by the tool are available:

$ trace check

Checking whether the kernel has all required events ...
... Checking event raw_syscalls:sys_enter:r : ok
...
... Checking event sched:sched_stat_runtime:r : ok
Good: all required event types are supported by this kernel.
The 'trace' utility will be fully functional.

The combo diffstat of the tool is appended at the end of the mail. The
overwhelming majority of changes is on the tooling side - it uses
existing perf events facilities and features to implement the tool.

Here's a bit more general description of the 'trace' tool's capabilities:

- Trace sessions (single tasks or groups of processes/threads
(auto-follow), system-wide and per cpu
tracing).

- Unintrusive tracing

- It will record filterable information about the session:

* syscalls
* task lifetime events (fork/clone/exit)
* pagefaults
* various mm events (mmap)
* scheduling events

* (support for more events is being worked on)

- Iterative analysis of an existing session/trace without having
to re-run the trace, with different filters and options (trace
data is saved on disk)

- Concurrent and multi-user safe trace recording: multiple users
can trace, or the same user can trace in multiple different
tracing sessions.

- Available to plain users - the permission model is not
completely done yet, it needs a temporary workaround currently:

echo -1 > /proc/sys/kernel/perf_event_paranoid

Todo list:

- Improve output formatting
- Utilize more events: networking, block IO
- Add more higher level trace points
- Implement 'trace diff' utility to simplify the comparison of traces
- Improve the permission model (work in progress)
- Implement -p NNN (trace existing pid) option

Main differences to strace:

- Low impact recording

- Arguments are printed in plain hex representation, except for
important arguments like filenames, which are resolved after
the syscall is recorded. 'trace' wont do full decoding like
strace does.

- Ability to record pagefaults and resolve them fully back to
within the corresponding mapping:

* file : offset
* [heap] : offset
* [stack] : offset
* [anon] : offset
* [vdso] : offset

- Ability to repeat analysis with different filters and options

Here are a few real-life examples of how to use the 'trace' tool.

Firstly, to record a trace, simply prefix whatever command you'd like
to be traced with 'trace record':

$ trace record firefox
# trace recorded [2.570 MB] - try 'trace summary' to get an overview

The new trace.data file in the current directory contains the tracing
session, and it can be analyzed after it has been recorded - without
having to repeat tracing again.

For example, you can get a summary of all events:

$ trace summary

.--------------------------------.
__( Summary of tasks traced )_______________________________

[ task - pid ] [ events ] [ ratio ] [ runtime ]
_____________________________________________________________________

run-mozilla.sh - 1740 : 14108 [ 57.2% ] 37.341 ms
firefox - 1741 : 851 [ 3.5% ] 0.708 ms
uname - 1742 : 842 [ 3.4% ] 0.527 ms
firefox - 1747 : 1793 [ 7.3% ] 3.568 ms
mkdir - 1748 : 713 [ 2.9% ] 1.289 ms
firefox - 1750 : 497 [ 2.0% ] 2.050 ms
:1751 - 1751 : 368 [ 1.5% ] 0.000 ms
sed - 1752 : 1034 [ 4.2% ] 0.479 ms
:1753 - 1753 : 494 [ 2.0% ] 2.059 ms
:1754 - 1754 : 377 [ 1.5% ] 0.331 ms
sed - 1755 : 1035 [ 4.2% ] 0.378 ms
run-mozilla.sh - 1756 : 848 [ 3.4% ] 0.467 ms
dirname - 1757 : 845 [ 3.4% ] 0.531 ms
run-mozilla.sh - 1758 : 840 [ 3.4% ] 0.501 ms
____________________________________________________________________

14 tasks : 24645 [100.0% ] 50.228 ms
$

( Note that the table is augmented with colors to highligh
high-overhead entries and other information. Those colors are not
visible in the cut & paste screens here :-) )

To get to the detailed trace (see further below about subsystem level
filtering and higher level overview):

$ trace report

#
# trace events of 'firefox':
#
firefox/28132 ( 0.000 ms): ... [continued]: execve() => 0x0
firefox/28132 ( 0.001 ms): brk(brk: 0x0) => 0x6e2000
firefox/28132 ( 0.010 ms): mmap(addr: 0x0, len: 0x1000, prot: 0x3, flags: 0x22, fd: 0xffffffff, off: 0x0) => 0x7fb3f442c000
firefox/28132 ( 0.006 ms): access(filename: 0x7fb3f422aa60, mode: 0x4) (fpath: /etc/ld.so.preload) => 0xfffffffffffffffe
firefox/28132 ( 0.004 ms): open(filename: 0x7fb3f4227f4b, flags: 0x0, mode: 0x1) (fpath: /etc/ld.so.cache) => 0x3
firefox/28132 ( 0.001 ms): fstat(3:</etc/ld.so.cache>, statbuf: 0x7fffafe4ef80) => 0x0

To see the events of the 'mozilla-xremote' task alone, the -t option
can be used:

$ trace report -t mozilla-xremote

( 0.000 ms): ... [continued]: clone() => 0x0
( 0.003 ms): close(255:<...>) => 0x0
( 0.002 ms): rt_sigprocmask(how: 0x2, set: 0x6e1a60, oset: 0x0, sigsetsize: 0x8) => 0x0
( 0.001 ms): rt_sigaction(sig: 0x14, act: 0x7fffcc4caa80, oact: 0x7fffcc4ca9e0, sigsetsize: 0x8) => 0x0
( 0.001 ms): rt_sigaction(sig: 0x15, act: 0x7fffcc4caa80, oact: 0x7fffcc4ca9e0, sigsetsize: 0x8) => 0x0
( 0.001 ms): rt_sigaction(sig: 0x16, act: 0x7fffcc4caa90, oact: 0x7fffcc4ca9f0, sigsetsize: 0x8) => 0x0
( 0.001 ms): rt_sigaction(sig: 0x2, act: 0x7fffcc4cabb0, oact: 0x7fffcc4cab10, sigsetsize: 0x8) => 0x0
( 0.001 ms): rt_sigaction(sig: 0x3, act: 0x7fffcc4cabb0, oact: 0x7fffcc4cab10, sigsetsize: 0x8) => 0x0
( 0.001 ms): rt_sigaction(sig: 0x11, act: 0x7fffcc4cabb0, oact: 0x7fffcc4cab10, sigsetsize: 0x8) => 0x0
( 0.010 ms): open(filename: 0x6e95b0, flags: 0x241, mode: 0x1b6) (fpath: /dev/null) => 0x3
( 0.001 ms): dup2(oldfd: 0x3, newfd: 0x1) => 0x1
( 0.001 ms): close(3:</dev/null>) => 0x0

Note how the comm/pid column has been collapsed - because there's only
a single task to display - this keeps the output simpler.

But such traces are generally still too large and detailed to be
really suitable for at-a-glance analysis, so 'trace' offers further
filters which allow you to drill down into any traced workload.

For example subsystem filters will output activities of a given kernel
subsystem.

Current subsystems tags are:

aio, arch-x86, events, fd, fs, fs-attr, IO, IO-locking,
IPC-locking, IPC-mm, IPC-net, locking, misc, mm, net, process,
sched, security, signal, stat, system, task, timer, tty.

This is a first rough categorization of events - more is possible and
more will be done. (Suggestions are welcome!)

For example to utilize the 'process' event category, simply add it
after 'trace report':

$ trace report process

# (restricting syscalls to: process)
firefox/ 6722 ( 0.000 ms): ... [continued]: execve() => 0x0
firefox/ 6722 ( 0.257 ms): clone(0x1200011, 0x0, 0x0, 0x7f1aa32a69f0, 0x0, 0x1a42) => 0x1a43
basename/ 6723 ( 0.000 ms): ... [continued]: clone() => 0x0
basename/ 6723 ( 0.215 ms): execve(0x6e9c40, 0x6e8d20, 0x6e8b00, 0x7fffcc4ca980, 0x0, 0x0) (fpath: /bin/basename) => 0x0
basename/ 6723 ( 0.000 ms): exit_group(error_code: 0x0)
firefox/ 6722 ( 0.021 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4cad6c, options: 0x0, ru: 0x0) => 0x1a43
firefox/ 6722 ( 0.001 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4ca8dc, options: 0x1, ru: 0x0) => 0xfffffffffffffff6
firefox/ 6722 ( 0.267 ms): clone(0x1200011, 0x0, 0x0, 0x7f1aa32a69f0, 0x0, 0x1a42) => 0x1a44
run-mozilla.sh/ 6724 ( 0.000 ms): ... [continued]: clone() => 0x0
run-mozilla.sh/ 6724 ( 0.166 ms): execve(0x6ea150, 0x6ea400, 0x6ec660, 0x7fffcc4ca830, 0x0, 0x7f1aa2a611e8) (fpath: /bin/uname) => 0x0
run-mozilla.sh/ 6724 ( 0.000 ms): exit_group(error_code: 0x0)
firefox/ 6722 ( 0.020 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4cac1c, options: 0x0, ru: 0x0) => 0x1a44
firefox/ 6722 ( 0.001 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4ca79c, options: 0x1, ru: 0x0) => 0xfffffffffffffff6
firefox/ 6722 ( 0.246 ms): clone(0x1200011, 0x0, 0x0, 0x7f1aa32a69f0, 0x0, 0x1a42) => 0x1a45
mozilla-xremote/ 6725 ( 0.000 ms): ... [continued]: clone() => 0x0
mozilla-xremote/ 6725 ( 0.204 ms): execve(0x6ef210, 0x6e95d0, 0x6ecda0, 0x7fffcc4caa30, 0x1, 0x2a) (fpath: /usr/lib64//xulrunner-2/mozilla-xremote-client) => 0x0
mozilla-xremote/ 6725 ( 0.000 ms): exit_group(error_code: 0x1)
firefox/ 6722 ( 3.262 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4cac7c, options: 0x0, ru: 0x0) => 0x1a45
firefox/ 6722 ( 0.001 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4ca7dc, options: 0x1, ru: 0x0) => 0xfffffffffffffff6
firefox/ 6722 ( 0.247 ms): clone(0x1200011, 0x0, 0x0, 0x7f1aa32a69f0, 0x0, 0x1a42) => 0x1a46
run-mozilla.sh/ 6726 ( 0.000 ms): ... [continued]: clone() => 0x0
run-mozilla.sh/ 6726 ( 0.150 ms): execve(0x6edca0, 0x6e9ab0, 0x6ecda0, 0x7fffcc4c9ca0, 0x1, 0x7f1aa2a611e8) (fpath: /bin/mkdir) => 0x0
run-mozilla.sh/ 6726 ( 0.000 ms): exit_group(error_code: 0x0)
firefox/ 6722 ( 1.148 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4c9eec, options: 0x0, ru: 0x0) => 0x1a46
firefox/ 6722 ( 0.001 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4c9a5c, options: 0x1, ru: 0x0) => 0xfffffffffffffff6
firefox/ 6722 ( 0.243 ms): clone(0x1200011, 0x0, 0x0, 0x7f1aa32a69f0, 0x0, 0x1a42) => 0x1a47
firefox/ 6727 ( 0.000 ms): ... [continued]: clone() => 0x0
firefox/ 6727 ( 0.289 ms): clone(0x1200011, 0x0, 0x0, 0x7f1aa32a69f0, 0x1a47, 0x1a47) => 0x1a48
firefox/ 6728 ( 0.000 ms): ... [continued]: clone() => 0x0
firefox/ 6727 ( 0.259 ms): clone(0x1200011, 0x0, 0x0, 0x7f1aa32a69f0, 0x1a47, 0x1a47) => 0x1a49
sed/ 6729 ( 0.000 ms): ... [continued]: clone() => 0x0
firefox/ 6728 ( 0.000 ms): exit_group(error_code: 0x0)
firefox/ 6727 ( 0.151 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4c9dfc, options: 0x0, ru: 0x0) => 0x1a48
sed/ 6729 ( 0.161 ms): execve(0x6e9ed0, 0x6f25e0, 0x6ecda0, 0x7fffcc4c9bb0, 0x0, 0x0) (fpath: /bin/sed) => 0x0
sed/ 6729 ( 0.000 ms): exit_group(error_code: 0x0)
firefox/ 6727 ( 1.413 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4c9dfc, options: 0x0, ru: 0x0) => 0x1a49
firefox/ 6727 ( 0.001 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4c9c1c, options: 0x1, ru: 0x0) => 0xfffffffffffffff6
firefox/ 6727 ( 0.000 ms): exit_group(error_code: 0x0)
firefox/ 6722 ( 0.041 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4ca2bc, options: 0x0, ru: 0x0) => 0x1a47
firefox/ 6722 ( 0.001 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4c9e1c, options: 0x1, ru: 0x0) => 0xfffffffffffffff6
firefox/ 6722 ( 0.264 ms): clone(0x1200011, 0x0, 0x0, 0x7f1aa32a69f0, 0x0, 0x1a42) => 0x1a4a
run-mozilla.sh/ 6730 ( 0.000 ms): ... [continued]: clone() => 0x0
run-mozilla.sh/ 6730 ( 0.277 ms): clone(0x1200011, 0x0, 0x0, 0x7f1aa32a69f0, 0x1a4a, 0x1a4a) => 0x1a4b
:6731/ 6731 ( 0.000 ms): ... [continued]: clone() => 0x0
run-mozilla.sh/ 6730 ( 0.248 ms): clone(0x1200011, 0x0, 0x0, 0x7f1aa32a69f0, 0x1a4a, 0x1a4a) => 0x1a4c
sed/ 6732 ( 0.000 ms): ... [continued]: clone() => 0x0
:6731/ 6731 ( 0.000 ms): exit_group(error_code: 0x0)
run-mozilla.sh/ 6730 ( 0.214 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4c9fdc, options: 0x0, ru: 0x0) => 0x1a4b
sed/ 6732 ( 0.229 ms): execve(0x6e9e50, 0x6f25e0, 0x6ecda0, 0x7fffcc4c9d90, 0x0, 0x0) (fpath: /bin/sed) => 0x0
sed/ 6732 ( 0.000 ms): exit_group(error_code: 0x0)
run-mozilla.sh/ 6730 ( 1.413 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4c9fdc, options: 0x0, ru: 0x0) => 0x1a4c
run-mozilla.sh/ 6730 ( 0.001 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4c9ddc, options: 0x1, ru: 0x0) => 0xfffffffffffffff6
run-mozilla.sh/ 6730 ( 0.000 ms): exit_group(error_code: 0x0)
firefox/ 6722 ( 0.043 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4ca49c, options: 0x0, ru: 0x0) => 0x1a4a
firefox/ 6722 ( 0.001 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4ca01c, options: 0x1, ru: 0x0) => 0xfffffffffffffff6
firefox/ 6722 ( 0.174 ms): execve(0x6f0ee0, 0x6f0550, 0x6ef580, 0x7fffcc4cae50, 0x0, 0x0) (fpath: /usr/lib64/firefox-4/run-mozilla.sh) => 0x0
firefox/ 6722 ( 0.489 ms): clone(0x1200011, 0x0, 0x0, 0x7f2b57e7d9f0, 0x0, 0x1a42) => 0x1a4d
basename/ 6733 ( 0.000 ms): ... [continued]: clone() => 0x0
basename/ 6733 ( 0.154 ms): execve(0x6ea830, 0x6eacd0, 0x6e9070, 0x7fffc740ac50, 0x0, 0x7f2b576381e8) (fpath: /bin/basename) => 0x0
basename/ 6733 ( 0.000 ms): exit_group(error_code: 0x0)
firefox/ 6722 ( 0.020 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffc740b03c, options: 0x0, ru: 0x0) => 0x1a4d
firefox/ 6722 ( 0.001 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffc740ab9c, options: 0x1, ru: 0x0) => 0xfffffffffffffff6
firefox/ 6722 ( 0.243 ms): clone(0x1200011, 0x0, 0x0, 0x7f2b57e7d9f0, 0x0, 0x1a42) => 0x1a4e
run-mozilla.sh/ 6734 ( 0.000 ms): ... [continued]: clone() => 0x0
run-mozilla.sh/ 6734 ( 0.154 ms): execve(0x6ea790, 0x6eabd0, 0x6e9070, 0x7fffc740ac50, 0x0, 0x7f2b576381e8) (fpath: /usr/bin/dirname) => 0x0
run-mozilla.sh/ 6734 ( 0.000 ms): exit_group(error_code: 0x0)
firefox/ 6722 ( 0.074 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffc740b03c, options: 0x0, ru: 0x0) => 0x1a4e
firefox/ 6722 ( 0.001 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffc740ab9c, options: 0x1, ru: 0x0) => 0xfffffffffffffff6
firefox/ 6722 ( 0.237 ms): clone(0x1200011, 0x0, 0x0, 0x7f2b57e7d9f0, 0x0, 0x1a42) => 0x1a4f
uname/ 6735 ( 0.000 ms): ... [continued]: clone() => 0x0
uname/ 6735 ( 0.158 ms): execve(0x6e9780, 0x6e9b00, 0x6e97a0, 0x7fffc740a330, 0x0, 0x7f2b576381e8) (fpath: /bin/uname) => 0x0
uname/ 6735 ( 0.000 ms): exit_group(error_code: 0x0)
firefox/ 6722 ( 0.020 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffc740a71c, options: 0x0, ru: 0x0) => 0x1a4f
firefox/ 6722 ( 0.001 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffc740a29c, options: 0x1, ru: 0x0) => 0xfffffffffffffff6
firefox/ 6722 ( 0.194 ms): execve(0x6f8ae0, 0x6f9c00, 0x6f8ed0, 0x7fffc740a8f0, 0x0, 0x40) (fpath: /usr/lib64/firefox-4/firefox) => 0x0
firefox/ 6722 ( 0.000 ms): exit_group(error_code: 0x1)

(Multiple categories can be specified too - for example 'process,io'.)

Note that this is the complete output already, no abbreviations - this
dump alone is already short enough and suitable to get an overview of
what happened in this Firefox session.

Other useful filters are 'trace report io' - this filters out IO
events (syscalls for the time being), or 'trace report fs' to see FS
related activities.

Another common approach to analyzing traces is to figure out which
portion of a trace took the most time. 'trace' allows such analysis
via a duration filter. For example, to see all events that took longer
than 0.5 milliseconds:

$ trace report -d 0.5

firefox/ 6722 ( 1.273 ms): read(3:</usr/bin/firefox>, buf: 0x7fffcc4caf40, count: 0x80) => 0x8
firefox/ 6722 ( 1.213 ms): read(3:</usr/bin/firefox>, buf: 0x7fffcc4cadf0, count: 0x80) => 0x7
firefox/ 6722 ( 3.262 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4cac7c, options: 0x0, ru: 0x0) => 0x1a45
firefox/ 6722 ( 1.148 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4c9eec, options: 0x0, ru: 0x0) => 0x1a46
firefox/ 6722 ( 2.360 ms): read(3:</usr/bin/firefox>, buf: 0x7fffcc4ca490, count: 0x80) => 0x3
firefox/ 6727 ( 1.413 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4c9dfc, options: 0x0, ru: 0x0) => 0x1a49
firefox/ 6722 ( 2.431 ms): read(3:</usr/bin/firefox>, buf: 0x7fffcc4ca670, count: 0x80) => 0x6
run-mozilla.sh/ 6730 ( 1.413 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffcc4c9fdc, options: 0x0, ru: 0x0) => 0x1a4c
firefox/ 6722 ( 1.054 ms): read(3:</usr/lib64/firefox-4/run-mozilla.sh>, buf: 0x7fffc740b210, count: 0x80) => 0xf
firefox/ 6722 ( 1.053 ms): read(3:</usr/lib64/firefox-4/run-mozilla.sh>, buf: 0x7fffc740b210, count: 0x80) => 0x15
firefox/ 6722 ( 1.157 ms): read(3:</usr/lib64/firefox-4/run-mozilla.sh>, buf: 0x7fffc740a8f0, count: 0x80) => 0x6

Another thing that is recorded are page fault events. These can be analyzed via the
-p option:

$ trace report -p

[...]
firefox/ 6722 ( 0.001 ms): #PF 4793: [ g_str_hash]: => /usr/lib64/libgdk-x11-2.0.so.0.2302.0 offset: 0xa5000 page: 165 (R.U)
firefox/ 6722 ( 0.001 ms): #PF 4794: [ gdk_event_handler_set]: => /usr/lib64/libgdk-x11-2.0.so.0.2302.0 offset: 0x25bc0 page: 37 (R.U)
firefox/ 6722 ( 0.001 ms): rt_sigaction(sig: 0xd, act: 0x7fff163c34c0, oact: 0x7fff163c3420, sigsetsize: 0x8) => 0x0
firefox/ 6722 ( 0.001 ms): #PF 4795: [ 7f4dddedd860]: => /usr/lib64/libgtk-x11-2.0.so.0.2302.0 offset: 0x72860 page: 114 (R.U)
firefox/ 6722 ( 0.001 ms): #PF 4796: [ 7f4dde0163a0]: => /usr/lib64/libgtk-x11-2.0.so.0.2302.0 offset: 0x1ab3a0 page: 427 (R.U)
firefox/ 6722 ( 0.001 ms): #PF 4797: [ 7f4dde0163a4]: => [anon 0x7f4dde4ee000] offset: 0x220 page: 0 (R.U)
firefox/ 6722 ( 0.002 ms): #PF 4798: [ 7f4dde0163b0]: => [anon 0x7f4dde4ee000] offset: 0x220 page: 0 (W.U)
firefox/ 6722 ( 0.001 ms): #PF 4799: [ 7f4dde011c50]: => /usr/lib64/libgtk-x11-2.0.so.0.2302.0 offset: 0x1a6c50 page: 422 (R.U)
[...]

Note that both data faults and code faults are traced and interpreted
- and if the fault lies in an executable area (binary or shared
library) then the tool will figure out the function that caused the
fault.

The file name of the faulting address is displayed as well, and the
offset into that file.

Here is an example of the 'duration filter', where only events that
took longer than 20 msecs are displayed, including absolute timestamps
(with syscalls, pagefaults and scheduler events included):

$ trace report -t -P -s -d 20
#
# trace events of 'firefox':
# (duration filter: 20.000 msecs)
#
29.205 run-mozilla.sh/11425 (22.723 ms): => blocked [state: 0x02 cputime: 0.044 ms #PF: 0] stat(filename: 0x1734860, statbuf: 0x7fff313370b0) (fpath: /usr/lib64/firefox-3.6/firefox)
32.347 run-mozilla.sh/11425 (25.864 ms): stat(filename: 0x1734860, statbuf: 0x7fff313370b0) (fpath: /usr/lib64/firefox-3.6/firefox) => 0x0
69.612 mozilla-plugin-/11428 (23.321 ms): => blocked [state: 0x02 cputime: 0.041 ms #PF: 0] stat(filename: 0x8f1ba0, statbuf: 0x7fff06a44a40) (fpath: /usr/lib64/nspluginwrapper/plugi
74.800 mozilla-plugin-/11428 (28.509 ms): stat(filename: 0x8f1ba0, statbuf: 0x7fff06a44a40) (fpath: /usr/lib64/nspluginwrapper/plugin-config) => 0x0
307.225 mozilla-plugin-/11428 (216.985 ms): => unblock [PF: 0]
307.238 mozilla-plugin-/11428 (217.002 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fff06a44c7c, options: 0x0, ru: 0x0) => 0x2ca6
307.459 run-mozilla.sh/11425 (274.198 ms): => unblock [PF: 0]
307.466 run-mozilla.sh/11425 (274.209 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fff313372ec, options: 0x0, ru: 0x0) => 0x2ca4
534.117 run-mozilla.sh/11425 (221.398 ms): => unblock [PF: 0]
534.128 run-mozilla.sh/11425 (221.413 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fff313370ec, options: 0x0, ru: 0x0) => 0x2ca7
556.522 run-mozilla.sh/11425 (21.759 ms): execve(0x173c110, 0x173b800, 0x173acc0, 0x7fff31337040, 0x0, 0x0) (fpath: /usr/lib64/firefox-3.6/run-mozilla.sh) => 0x0
597.105 run-mozilla.sh/11435 (23.718 ms): #PF 64: [ __execve]: => /usr/lib64/firefox-3.6/firefox offset: 0x6d8 page: 0 (W.K)
597.230 run-mozilla.sh/11435 (30.228 ms): execve(0x1a7c860, 0x1a81560, 0x1a88140, 0x7fffd937b050, 0x0, 0x7f30498921e8) (fpath: /usr/lib64/firefox-3.6/firefox) => 0x0
628.920 run-mozilla.sh/11435 (20.541 ms): open(filename: 0x7fff35246ec4, flags: 0x0, mode: 0x1b6) (fpath: /etc/gre.d/gre64.conf) => 0x3
767.480 run-mozilla.sh/11435 (29.628 ms): #PF 579: [ memset]: => /usr/lib64/xulrunner-1.9.2/libxul.so offset: 0x1ce7a0 page: 462 (W.U)
1179.170 run-mozilla.sh/11435 (26.734 ms): => unblock [PF: 1]
1179.270 run-mozilla.sh/11435 (26.864 ms): #PF 2944: [ 7fb7a42db4e0]: => /usr/lib64/xulrunner-1.9.2/libxul.so offset: 0xe814e0 page: 3713 (R.U) ... [unfini
1179.270 run-mozilla.sh/11435 (26.864 ms): => blocked [state: 0x02 cputime: 0.010 ms #PF: 1]
1193.124 run-mozilla.sh/11435 (40.718 ms): #PF 2944: [ 7fb7a42db4e0]: => /usr/lib64/xulrunner-1.9.2/libxul.so offset: 0xe814e0 page: 3713 (R.U)
1730.468 run-mozilla.sh/11435 (20.649 ms): access(filename: 0x7fb7a5981040, mode: 0x4) (fpath: /usr/share/X11/locale/en_US.UTF-8/XLC_LOCALE) => 0x0
1984.050 run-mozilla.sh/11425 (1417.131 ms): => unblock [PF: 0]
1984.070 run-mozilla.sh/11425 (1417.155 ms): wait4(upid: 0xffffffffffffffff, stat_addr: 0x7fffd937b32c, options: 0x0, ru: 0x0) => 0x2cab

Here is an example of looking at a specific task (pid 11435),
scheduler events in a timeframe around an interesting event which we
picked from the above duration filtered output.

$ trace report -t -P -s -f 11435 -T 1152.4,1196
#
# trace events of 'firefox':
#
1152.401 ( 0.002 ms): #PF 2942: [ 7fb7a3005e20]: => /usr/lib64/libssl3.so offset: 0x29e20 page: 41 (R.U)
1152.404 ( 0.001 ms): #PF 2943: [ 7fb7a39375d0]: => /usr/lib64/xulrunner-1.9.2/libxul.so offset: 0x4dd5d0 page: 1245 (R.U)
1152.436 ( 0.031 ms): #PF 2944: [ 7fb7a42db4e0]: => /usr/lib64/xulrunner-1.9.2/libxul.so offset: 0xe814e0 page: 3713 (R.U) ... [unfinished]
1152.436 ( 0.031 ms): => blocked [state: 0x02 cputime: 0.216 ms #PF: 1]
1179.170 (26.734 ms): => unblock [PF: 1]
1179.270 (26.864 ms): => blocked [state: 0x02 cputime: 0.010 ms #PF: 1]
1193.120 (13.850 ms): => unblock [PF: 1]
1193.124 (40.718 ms): #PF 2944: [ 7fb7a42db4e0]: => /usr/lib64/xulrunner-1.9.2/libxul.so offset: 0xe814e0 page: 3713 (R.U)
1193.187 ( 0.061 ms): #PF 2945: [ 7fb7a42c270c]: => /usr/lib64/xulrunner-1.9.2/libxul.so offset: 0xe6870c page: 3688 (R.U) ... [unfinished]
1193.187 ( 0.061 ms): => blocked [state: 0x02 cputime: 0.067 ms #PF: 1]

At this point we'd like to ask for feedback from other users of tracing tools,
which workflow components would you like to see in addition to what the 'trace'
tool is offering?

Comments, bug reports, suggestions welcome,

Thanks,

Thomas, Ingo
---
arch/x86/mm/fault.c | 32 +-
fs/namei.c | 7 +-
include/trace/events/kmem.h | 39 +
include/trace/events/sched.h | 80 +
include/trace/events/vfs.h | 32 +
init/Kconfig | 3 +
kernel/perf_event.c | 2 +-
kernel/sched.c | 2 +
kernel/trace/trace.c | 17 +
.../{perf-trace-perl.txt => perf-script-perl.txt} | 28 +-
...erf-trace-python.txt => perf-script-python.txt} | 88 +-
tools/perf/Documentation/perf-script.txt | 111 ++
tools/perf/Documentation/perf-trace.txt | 96 +-
tools/perf/Makefile | 14 +-
tools/perf/builtin-lock.c | 6 +-
tools/perf/builtin-record.c | 91 +-
tools/perf/builtin-sched.c | 6 +-
tools/perf/builtin-script.c | 826 +++++++++
tools/perf/builtin-trace.c | 1839 +++++++++++++-------
tools/perf/builtin.h | 3 +-
tools/perf/command-list.txt | 1 +
tools/perf/perf.c | 13 +-
.../perf/scripts/python/Perf-Trace-Util/Context.c | 2 +-
tools/perf/syscall-attr.h | 303 ++++
tools/perf/util/event.c | 6 +-
tools/perf/util/header.c | 76 +-
tools/perf/util/header.h | 1 +
tools/perf/util/map.c | 26 +-
tools/perf/util/parse-events.c | 18 +-
tools/perf/util/parse-events.h | 1 +
.../perf/util/scripting-engines/trace-event-perl.c | 6 +-
.../util/scripting-engines/trace-event-python.c | 4 +-
tools/perf/util/session.c | 28 +-
tools/perf/util/session.h | 5 +
34 files changed, 2969 insertions(+), 843 deletions(-)


2010-11-16 21:27:37

by Darren Hart

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On 11/16/2010 01:04 PM, Thomas Gleixner wrote:
> We are pleased to announce a new tracing tool called 'trace'.
>

Hi Thomas, Ingo - Congrats and Thanks!

> At this point we'd like to ask for feedback from other users of tracing tools,
> which workflow components would you like to see in addition to what the 'trace'
> tool is offering?
>
> Comments, bug reports, suggestions welcome,

The first thing that comes to mind is trace_marker:

echo "Test point A" > /sys/kernel/debug/tracing/trace_marker

I've found this sort of markup to be useful using ftrace (with C
equivalents embedded in the test case). Is this supported?

Thanks,

--
Darren Hart
Yocto Linux Kernel

2010-11-16 21:36:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'


* Thomas Gleixner <[email protected]> wrote:

> Todo list:
>
> [...]
>
> - Implement -p NNN (trace existing pid) option

Update, 'trace -p <pid> ...' is now implemented via:

da46d9f09699: trace: Add -p <pid> recording

I have just pushed it out into -tip.

Thanks,

Ingo

2010-11-16 22:00:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'


* Darren Hart <[email protected]> wrote:

> On 11/16/2010 01:04 PM, Thomas Gleixner wrote:
> >We are pleased to announce a new tracing tool called 'trace'.
> >
>
> Hi Thomas, Ingo - Congrats and Thanks!

You are welcome :)

> > At this point we'd like to ask for feedback from other users of tracing tools,
> > which workflow components would you like to see in addition to what the 'trace'
> > tool is offering?
> >
> > Comments, bug reports, suggestions welcome,
>
> The first thing that comes to mind is trace_marker:
>
> echo "Test point A" > /sys/kernel/debug/tracing/trace_marker
>
> I've found this sort of markup to be useful using ftrace (with C equivalents
> embedded in the test case). Is this supported?

Yes, LatencyTop uses something similar IIRC, via a prctl() hack: it uses
prctl(PR_SET_NAME) to inject up to 16 characters into the comm - and then every perf
event gets that comm. You can utilize that straight away if you need a marker
solution right now.

A cleaner, more generic approach would be a more free-form ASCII event facility for
this: a generic 'user-space triggered event' with injection capabilities.

To inject it, we could put that into a prctl() as well, for easy, single-syscall
access from C. We already have two perf prctls: PR_TASK_PERF_EVENTS_DISABLE,
PR_TASK_PERF_EVENTS_ENABLE - and this would be a third one: PR_TASK_PERF_USER_EVENT.

This would, like the existing perf_event_task_enable()/disable() methods, loop
through current events and inject a string into matching user events.

Ideally i'd like to see basically user-space access to trace_printk(), and proper
integration into /debug/tracing/events/ enumeration and availability - not a
ftrace-specific and admin-only hack like /sys/kernel/debug/tracing/trace_marker is
today.

Would you be interested in helping out with (and testing) such a more generic
approach?

Thanks,

Ingo

2010-11-16 22:03:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Tue, 2010-11-16 at 22:59 +0100, Ingo Molnar wrote:
>
>
> Would you be interested in helping out with (and testing) such a more generic
> approach?

how about having perf-record open a named pipe and rewriting everything
that comes in through that as an event and storing it in the buffer?

2010-11-16 22:07:14

by Darren Hart

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On 11/16/2010 01:59 PM, Ingo Molnar wrote:
>
> * Darren Hart<[email protected]> wrote:
>
>> On 11/16/2010 01:04 PM, Thomas Gleixner wrote:
>>> We are pleased to announce a new tracing tool called 'trace'.
>>>
>>
>> Hi Thomas, Ingo - Congrats and Thanks!
>
> You are welcome :)
>
>>> At this point we'd like to ask for feedback from other users of tracing tools,
>>> which workflow components would you like to see in addition to what the 'trace'
>>> tool is offering?
>>>
>>> Comments, bug reports, suggestions welcome,
>>
>> The first thing that comes to mind is trace_marker:
>>
>> echo "Test point A"> /sys/kernel/debug/tracing/trace_marker
>>
>> I've found this sort of markup to be useful using ftrace (with C equivalents
>> embedded in the test case). Is this supported?
>
> Yes, LatencyTop uses something similar IIRC, via a prctl() hack: it uses
> prctl(PR_SET_NAME) to inject up to 16 characters into the comm - and then every perf
> event gets that comm. You can utilize that straight away if you need a marker
> solution right now.

OK, definitely much more limited than trace_marker.

>
> A cleaner, more generic approach would be a more free-form ASCII event facility for
> this: a generic 'user-space triggered event' with injection capabilities.
>
> To inject it, we could put that into a prctl() as well, for easy, single-syscall
> access from C. We already have two perf prctls: PR_TASK_PERF_EVENTS_DISABLE,
> PR_TASK_PERF_EVENTS_ENABLE - and this would be a third one: PR_TASK_PERF_USER_EVENT.
>
> This would, like the existing perf_event_task_enable()/disable() methods, loop
> through current events and inject a string into matching user events.
>
> Ideally i'd like to see basically user-space access to trace_printk(), and proper

Right, this would be ideal.

> integration into /debug/tracing/events/ enumeration and availability - not a
> ftrace-specific and admin-only hack like /sys/kernel/debug/tracing/trace_marker is
> today.
>
> Would you be interested in helping out with (and testing) such a more generic
> approach?

I would. I need some time to familiarize myself with perf first (I've
been a relatively happy ftrace/trace-cmd user for the last couple years).

--
Darren Hart
Yocto Linux Kernel

2010-11-16 22:09:15

by Darren Hart

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On 11/16/2010 02:03 PM, Peter Zijlstra wrote:
> On Tue, 2010-11-16 at 22:59 +0100, Ingo Molnar wrote:
>>
>>
>> Would you be interested in helping out with (and testing) such a more generic
>> approach?
>
> how about having perf-record open a named pipe and rewriting everything
> that comes in through that as an event and storing it in the buffer?

When I'm tracing an app or test-case, I want to be able to quickly add
and change strings. I like the trace_printk sort of access as it avoids
having to do a bunch of string formatting (sprintf()) calls in the test app.

--
Darren Hart
Yocto Linux Kernel

2010-11-16 22:09:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'


* Peter Zijlstra <[email protected]> wrote:

> On Tue, 2010-11-16 at 22:59 +0100, Ingo Molnar wrote:
> >
> >
> > Would you be interested in helping out with (and testing) such a more generic
> > approach?
>
> how about having perf-record open a named pipe and rewriting everything that comes
> in through that as an event and storing it in the buffer?

Yeah, that would work too in most cases - except if there's no VFS connection
between tracer and tracee. (as it might be in the case of say chroot environments,
etc.)

I think it's better to have kernel solutions that dont modify the VFS beyond the
trace.data session info. A task could thus generate a user-space event without
having to negotiate with the tracer.

Thanks,

Ingo

2010-11-16 22:17:35

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Tue, Nov 16, 2010 at 01:27:35PM -0800, Darren Hart wrote:
> On 11/16/2010 01:04 PM, Thomas Gleixner wrote:
>> We are pleased to announce a new tracing tool called 'trace'.
>>
>
> Hi Thomas, Ingo - Congrats and Thanks!
>
>> At this point we'd like to ask for feedback from other users of tracing tools,
>> which workflow components would you like to see in addition to what the 'trace'
>> tool is offering?
>>
>> Comments, bug reports, suggestions welcome,
>
> The first thing that comes to mind is trace_marker:
>
> echo "Test point A" > /sys/kernel/debug/tracing/trace_marker
>
> I've found this sort of markup to be useful using ftrace (with C
> equivalents embedded in the test case). Is this supported?


No, the trace_marker file sends trace_printk events. And trace_printk
events needs to be converted to use the unified trace event interface
(ie: implement a struct ftrace_event_call).

This shouldn't be so hard, and there are several ways we could do this.

An idea is to reproduce the kernel file hierarchy in a "printk" event
subsystem, but this implies to allow subsystems nesting.

Imagine you have two trace_printk(), one in kernel/sched.c:117
and one in kernel/time/tick-sched.c:228
The result would be:

$ ls /sys/kernel/debug/tracing/events/printk

/sys/kernel/debug/tracing/events/printk:
enable filter kernel/

/sys/kernel/debug/tracing/events/printk/kernel:
enable filter sched.c/ time/

/sys/kernel/debug/tracing/events/printk/kernel/sched.c:
enable filter 117/

/sys/kernel/debug/tracing/events/printk/kernel/sched.c/117:
enable filter format id

/sys/kernel/debug/tracing/events/printk/kernel/time:
enable filter tick-sched.c/

/sys/kernel/debug/tracing/events/printk/kernel/time/tick-sched.c:
enable filter 228/

/sys/kernel/debug/tracing/events/printk/kernel/time/tick-sched.c/228:
enable filter format id


The format would be quite easy, and only one field for the whole string (strloc).

Not sure what to do with trace_marker. I just think we can add it as:
/sys/kernel/debug/tracing/events/printk/sys/kernel/debug/tracing/trace_marker

<:o)


But may be the whole idea is just fancy and nobody will care, and would just
a simple single event in a printk subsystem, on which you can use a kind of
virtual filter:

echo "path != kernel/whatever.c:226" > /sys/kernel/debug/tracing/events/printk/filter

would turn on every trace_printk() but the one in the given path.

Dunno, I like both ideas. I prefer the first one which looks to me more flexible: could be
useful to list the user his trace_printks for example and implement an interface for him
to quickly select the trace_printk he wants.

For example I'm currently working with dozens of trace_printk() and I would be very happy
to turn some of them off half of the time.

2010-11-16 22:48:13

by Darren Hart

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On 11/16/2010 02:09 PM, Darren Hart wrote:
> On 11/16/2010 02:03 PM, Peter Zijlstra wrote:
>> On Tue, 2010-11-16 at 22:59 +0100, Ingo Molnar wrote:
>>>
>>>
>>> Would you be interested in helping out with (and testing) such a more
>>> generic
>>> approach?
>>
>> how about having perf-record open a named pipe and rewriting everything
>> that comes in through that as an event and storing it in the buffer?
>
> When I'm tracing an app or test-case, I want to be able to quickly add
> and change strings. I like the trace_printk sort of access as it avoids
> having to do a bunch of string formatting (sprintf()) calls in the test
> app.

Sorry - fprintf would work for Peter's suggestion. I had the PR_SET_NAME
with a short string or another call with jsut a plain C string - those
would require the user doing the string formatting first.

--
Darren Hart
Yocto Linux Kernel

2010-11-17 01:37:11

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Tue, Nov 16, 2010 at 10:04:40PM +0100, Thomas Gleixner wrote:
> If you've booted the new kernel you can run 'trace check' to double
> check that all events used by the tool are available:
>
> $ trace check
>
> Checking whether the kernel has all required events ...
> ... Checking event raw_syscalls:sys_enter:r : ok
> ...
> ... Checking event sched:sched_stat_runtime:r : ok
> Good: all required event types are supported by this kernel.
> The 'trace' utility will be fully functional.

For the benefit of people who create tracepoints, what restrictions
does trace have with respect to event types, and is this anticipated
to change in the future?

> The combo diffstat of the tool is appended at the end of the mail. The
> overwhelming majority of changes is on the tooling side - it uses
> existing perf events facilities and features to implement the tool.

What about the filtering and other general features/functionality of
ftrace? Is the anticipation that this will be ported over to perf?
What about things like blktrace?

Not that I expect all of this will be working with this initial
release, but I'm curious about the long-term roadmap of this
interface. (Obviously subject to change as we learn more, etc. But
I'd love to hear what your current thoughts and plans are.)

Thanks,

- Ted

P.S. What about sysrq-z? Is that going to stay with ftrace side of
the tracing architecture?

2010-11-17 01:47:44

by Steven Rostedt

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Tue, 2010-11-16 at 20:37 -0500, Ted Ts'o wrote:

> For the benefit of people who create tracepoints, what restrictions
> does trace have with respect to event types, and is this anticipated
> to change in the future?

Note, I just posted an RFC stable event patch set. I would like any tool
that does general analysis, to use stable events and not random raw
events created by the perspective maintainers.

The tool could tap into the raw events, but I don't like the "trace
check" I think anything that the trace needs should be guaranteed there
(with the stable tracepoints).

The tracepoints used for general analysis should be stable, anything
else is just shear bonus.

-- Steve

2010-11-17 03:16:15

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Tue, Nov 16, 2010 at 08:47:40PM -0500, Steven Rostedt wrote:
>
> Note, I just posted an RFC stable event patch set. I would like any tool
> that does general analysis, to use stable events and not random raw
> events created by the perspective maintainers.
>
> The tool could tap into the raw events, but I don't like the "trace
> check" I think anything that the trace needs should be guaranteed there
> (with the stable tracepoints).
>
> The tracepoints used for general analysis should be stable, anything
> else is just shear bonus.

Which is fine with me, so long as it is accepted that the 'trace' tool
is not targetted at kernel developers (who would probably like to use
a combination of stable and unstable tracepoints).

Do we all agree on what the intended target audience is for this
'trace' tool?

My one concern of having a tool that doesn't support the unstable
tracepoints is that if the kernel developers aren't using it for their
day-to-day use, it won't get attention/love, and it risks the fate of
'systemtap' --- ignored by kernel developers and treated as if it
doesn't exist.

Maybe the answer is there's a #ifdef, and there's one version of the
tool that is intended for use by kernel developers, and one which is
restricted to the stable interface that we give to the hoi polloi?

- Ted

2010-11-17 03:34:06

by Steven Rostedt

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Tue, 2010-11-16 at 22:16 -0500, Ted Ts'o wrote:
> On Tue, Nov 16, 2010 at 08:47:40PM -0500, Steven Rostedt wrote:
> >
> > Note, I just posted an RFC stable event patch set. I would like any tool
> > that does general analysis, to use stable events and not random raw
> > events created by the perspective maintainers.
> >
> > The tool could tap into the raw events, but I don't like the "trace
> > check" I think anything that the trace needs should be guaranteed there
> > (with the stable tracepoints).
> >
> > The tracepoints used for general analysis should be stable, anything
> > else is just shear bonus.
>
> Which is fine with me, so long as it is accepted that the 'trace' tool
> is not targetted at kernel developers (who would probably like to use
> a combination of stable and unstable tracepoints).
>
> Do we all agree on what the intended target audience is for this
> 'trace' tool?
>
> My one concern of having a tool that doesn't support the unstable
> tracepoints is that if the kernel developers aren't using it for their
> day-to-day use, it won't get attention/love, and it risks the fate of
> 'systemtap' --- ignored by kernel developers and treated as if it
> doesn't exist.
>
> Maybe the answer is there's a #ifdef, and there's one version of the
> tool that is intended for use by kernel developers, and one which is
> restricted to the stable interface that we give to the hoi polloi?

I was not involved in making of this tool, but I would envision that it
would probably have special options to enable the raw events. That is,
basic users would never see the events that kernel developers use, but
with a simple switch (-r?) it would enable all events (both stable and
raw). It just wont have the ability to do any analysis of these events.
That is, it can just report the events (much like what trace-cmd does
today), but they will just exist as an event, not anything the tool can
give true meaning to. But the kernel developer should be able to
understand it.(*)

If we get a consensus on stable events, I would like to add
a /debug/events directory that will be similar to
the /debug/tracing/events, except that it will have the new format for
all raw events.

-- Steve

(*) Currently in trace-cmd I have plugins. These plugins are simple
functions to read events and either output better displays of the event
or do something special (Avi wrote some for his kvm events). I would
like to create a tools/trace/plugins directory that this trace tool
could use. A kernel developer could write the code to pretty print their
event and show much better information that reading a bunch of hex
fields. But more on this later, I would like to work on getting stable
events done first.

2010-11-17 08:31:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'


* Frederic Weisbecker <[email protected]> wrote:

> On Tue, Nov 16, 2010 at 01:27:35PM -0800, Darren Hart wrote:
> > On 11/16/2010 01:04 PM, Thomas Gleixner wrote:
> >> We are pleased to announce a new tracing tool called 'trace'.
> >>
> >
> > Hi Thomas, Ingo - Congrats and Thanks!
> >
> >> At this point we'd like to ask for feedback from other users of tracing tools,
> >> which workflow components would you like to see in addition to what the 'trace'
> >> tool is offering?
> >>
> >> Comments, bug reports, suggestions welcome,
> >
> > The first thing that comes to mind is trace_marker:
> >
> > echo "Test point A" > /sys/kernel/debug/tracing/trace_marker
> >
> > I've found this sort of markup to be useful using ftrace (with C
> > equivalents embedded in the test case). Is this supported?
>
>
> No, the trace_marker file sends trace_printk events. And trace_printk
> events needs to be converted to use the unified trace event interface
> (ie: implement a struct ftrace_event_call).
>
> This shouldn't be so hard, and there are several ways we could do this.
>
> An idea is to reproduce the kernel file hierarchy in a "printk" event
> subsystem, but this implies to allow subsystems nesting.
>
> Imagine you have two trace_printk(), one in kernel/sched.c:117
> and one in kernel/time/tick-sched.c:228
> The result would be:
>
> $ ls /sys/kernel/debug/tracing/events/printk
>
> /sys/kernel/debug/tracing/events/printk:
> enable filter kernel/
>
> /sys/kernel/debug/tracing/events/printk/kernel:
> enable filter sched.c/ time/
>
> /sys/kernel/debug/tracing/events/printk/kernel/sched.c:
> enable filter 117/
>
> /sys/kernel/debug/tracing/events/printk/kernel/sched.c/117:
> enable filter format id
>
> /sys/kernel/debug/tracing/events/printk/kernel/time:
> enable filter tick-sched.c/
>
> /sys/kernel/debug/tracing/events/printk/kernel/time/tick-sched.c:
> enable filter 228/
>
> /sys/kernel/debug/tracing/events/printk/kernel/time/tick-sched.c/228:
> enable filter format id
>
>
> The format would be quite easy, and only one field for the whole string (strloc).
>
> Not sure what to do with trace_marker. I just think we can add it as:
> /sys/kernel/debug/tracing/events/printk/sys/kernel/debug/tracing/trace_marker
>
> <:o)
>
>
> But may be the whole idea is just fancy and nobody will care, [...]

I think it's an excellent idea, and it could also give us the framework to integrate
the dynamic_printk points.

> [...] and would just a simple single event in a printk subsystem, on which you can
> use a kind of virtual filter:
>
> echo "path != kernel/whatever.c:226" > /sys/kernel/debug/tracing/events/printk/filter
>
> would turn on every trace_printk() but the one in the given path.
>
> Dunno, I like both ideas. I prefer the first one which looks to me more flexible:
> could be useful to list the user his trace_printks for example and implement an
> interface for him to quickly select the trace_printk he wants.

I prefer the first one too.

> For example I'm currently working with dozens of trace_printk() and I would be
> very happy to turn some of them off half of the time.

I guess we could try such a patch. If you send a prototype i'd be interested in
testing it out.

Thanks,

Ingo

2010-11-17 09:50:07

by Philipp Marek

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

Hello Thomas!

Thomas Gleixner <tglx <at> linutronix.de> writes:
> We are pleased to announce a new tracing tool called 'trace'.
...
> read(3:</usr/bin/firefox>, buf: 0x7fffcc4caf40, count: 0x80)

Is there a way to get (parts of) the data being read/written/sent/received?
Something like the -s flag for strace would be nice, but that already seems
taken for scheduler events.


Regards,

Phil

2010-11-17 11:36:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 09:30 +0100, Ingo Molnar wrote:
> * Frederic Weisbecker <[email protected]> wrote:
>
> > On Tue, Nov 16, 2010 at 01:27:35PM -0800, Darren Hart wrote:
> > > On 11/16/2010 01:04 PM, Thomas Gleixner wrote:
> > >> We are pleased to announce a new tracing tool called 'trace'.
> > >>
> > >
> > > Hi Thomas, Ingo - Congrats and Thanks!
> > >
> > >> At this point we'd like to ask for feedback from other users of tracing tools,
> > >> which workflow components would you like to see in addition to what the 'trace'
> > >> tool is offering?
> > >>
> > >> Comments, bug reports, suggestions welcome,
> > >
> > > The first thing that comes to mind is trace_marker:
> > >
> > > echo "Test point A" > /sys/kernel/debug/tracing/trace_marker
> > >
> > > I've found this sort of markup to be useful using ftrace (with C
> > > equivalents embedded in the test case). Is this supported?
> >
> >
> > No, the trace_marker file sends trace_printk events. And trace_printk
> > events needs to be converted to use the unified trace event interface
> > (ie: implement a struct ftrace_event_call).
> >
> > This shouldn't be so hard, and there are several ways we could do this.
> >
> > An idea is to reproduce the kernel file hierarchy in a "printk" event
> > subsystem, but this implies to allow subsystems nesting.
> >
> > Imagine you have two trace_printk(), one in kernel/sched.c:117
> > and one in kernel/time/tick-sched.c:228
> > The result would be:
> >
> > $ ls /sys/kernel/debug/tracing/events/printk
> >
> > /sys/kernel/debug/tracing/events/printk:
> > enable filter kernel/
> >
> > /sys/kernel/debug/tracing/events/printk/kernel:
> > enable filter sched.c/ time/
> >
> > /sys/kernel/debug/tracing/events/printk/kernel/sched.c:
> > enable filter 117/
> >
> > /sys/kernel/debug/tracing/events/printk/kernel/sched.c/117:
> > enable filter format id
> >
> > /sys/kernel/debug/tracing/events/printk/kernel/time:
> > enable filter tick-sched.c/
> >
> > /sys/kernel/debug/tracing/events/printk/kernel/time/tick-sched.c:
> > enable filter 228/
> >
> > /sys/kernel/debug/tracing/events/printk/kernel/time/tick-sched.c/228:
> > enable filter format id
> >
> >
> > The format would be quite easy, and only one field for the whole string (strloc).
> >
> > Not sure what to do with trace_marker. I just think we can add it as:
> > /sys/kernel/debug/tracing/events/printk/sys/kernel/debug/tracing/trace_marker
> >
> > <:o)
> >
> >
> > But may be the whole idea is just fancy and nobody will care, [...]
>
> I think it's an excellent idea, and it could also give us the framework to integrate
> the dynamic_printk points.
>
> > [...] and would just a simple single event in a printk subsystem, on which you can
> > use a kind of virtual filter:
> >
> > echo "path != kernel/whatever.c:226" > /sys/kernel/debug/tracing/events/printk/filter
> >
> > would turn on every trace_printk() but the one in the given path.
> >
> > Dunno, I like both ideas. I prefer the first one which looks to me more flexible:
> > could be useful to list the user his trace_printks for example and implement an
> > interface for him to quickly select the trace_printk he wants.
>
> I prefer the first one too.
>
> > For example I'm currently working with dozens of trace_printk() and I would be
> > very happy to turn some of them off half of the time.
>
> I guess we could try such a patch. If you send a prototype i'd be interested in
> testing it out.

I don't see the point, the kernel shouldn't contain any trace_printk()s
to begin with..

2010-11-17 12:48:09

by Török Edwin

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Tue, 16 Nov 2010 22:04:40 +0100 (CET)
Thomas Gleixner <[email protected]> wrote:

> After years of efforts we have not succeeded in meeting (let alone
> exceeding) the utility of decades-old user-space tracing tools such as
> strace - except for a few new good tools such as PowerTop and
> LatencyTop.
>
> 'trace' is our shot at improving the situation: it aims at providing a
> simple to use and straightforward tracing tool based on the perf
> infrastructure and on the well-known perf profiling workflow:

Interesting, I just tried it using the -tip kernel.
Looks like a good start, but there are some features I'm missing, see below.

> The combo diffstat of the tool is appended at the end of the mail.

How did you generate that? In other words is there a git command I can
use to apply the mm_pagefault/vfs_getname/trace patches on top of
2.6.36? Doing a git merge tip/tmp.perf/trace merged 2.6.37 for me.
For now I'm running a -tip kernel to test.

> - Unintrusive tracing

Nice! At a quick glance it didn't seem to impact performance much, I'll
have to do some measurements.

>
> - It will record filterable information about the session:
>
> * syscalls

There is no strace-like "not finished":
clamd/11086 ( 0.000 ms): pread(0x9, 0x7f142f7f1000, 0x2000, 0xa8000, 0x2000, 0xa8) => 0x1
clamd/11087 ( 0.014 ms): futex(uaddr: 0x7f143fad3260, op: 0x80, val: 0x2, utime: 0x0, uaddr2: 0x7f143fad3260, val3: 0x2b4f) => 0x0
clamd/11087 ( 0.001 ms): futex(uaddr: 0x7f143fad3260, op: 0x81, val: 0x1, utime: 0xfe0, uaddr2: 0x7f143fad3260, val3: 0x2b4f) => 0x0
clamd/11086 ( 0.011 ms): ... [continued]: pread() => 0x2000

I assume that the first pread with 0ms time got interrupted, and it resumed/continued later. Is that the case?
Is the 0x1 return value bogus then? If so it would be less confusing if you output what strace usually does (".. not finished" IIRC).

> * task lifetime events (fork/clone/exit)
> * pagefaults
> * various mm events (mmap)
> * scheduling events
>
> * (support for more events is being worked on)

Is it possible to use 'trace record' and 'perf record' at the same time?
I.e to do both strace-like (trace) and oprofile-like (perf record)
recording?

>
> - Iterative analysis of an existing session/trace without having
> to re-run the trace, with different filters and options (trace
> data is saved on disk)
>
> - Concurrent and multi-user safe trace recording: multiple users
> can trace, or the same user can trace in multiple different
> tracing sessions.
>
> - Available to plain users - the permission model is not
> completely done yet, it needs a temporary workaround currently:
>
> echo -1 > /proc/sys/kernel/perf_event_paranoid

I see --call-graph in the trace record -h output, but it doesn't seem to work on x86_64 for me.
Do I need to pass some additional parameters to trace report, do I need additional CONFIG_ turned on?
I have:
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
# CONFIG_CC_STACKPROTECTOR is not set
CONFIG_STACKTRACE=y
CONFIG_USER_STACKTRACE_SUPPORT=y
# CONFIG_STACK_TRACER is not set
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set

Does it use the same code for callgraphs as ftrace, so I could choose userspace stacktrace?

>
> Todo list:
>
> - Improve output formatting
> - Utilize more events: networking, block IO
> - Add more higher level trace points
> - Implement 'trace diff' utility to simplify the comparison of
> traces
> - Improve the permission model (work in progress)
> - Implement -p NNN (trace existing pid) option

1. I miss sorting.
Sure I can do something like this: trace report -d <N> |
sort -g -k2 -t\(, but I have to choose <N> first, and it has to output
all the data before sorting can begin (including the text part).
It would be better if I 'trace' itself could sort based on the syscall
latency time, and show me the slowest syscalls first.

2. Another feature I miss is ability to sum up syscall times per syscall, and output a summary.

3. The -s output is not very intuitive, the sched tracer output in /sys/debug/tracing was more intuitive IIRC,
showing that you swithced from process A to process B. I would like to see that in the -s output
(and if switching to a process not traced, it should say so).

4. Also it would be useful to know if during a long syscall (futex, pread, etc.) the CPU time was wasted, or
another thread of same process got scheduled in. Calculating the overhead of doing that would be interesting too.
I think this could be done with a tool that postprocesses the results, but maybe its easier to do in the tool itself.
For example:
thread1: futex(...) [... not finished]
thread2: (delta1 ms) ... scheduled in place of thread1 (switch thread1 -> thread2)
thread2: ... do stuff ...
thread1: (delta2 ms) ... scheduled in place of thread2 (switch thread2 -> thread1)
thread1: (delta3 ms) [... continued] futex() => ...

I would be interested to see delta1, and delta2, perhaps totalled up per syscall, per thread or per process.

5. Also it would be good if it could tell what a futex is used for:
1122.567 clamd/11082 (139.120 ms): futex(uaddr: 0x1b59d44, op: 0x80, val: 0x883, utime: 0x0, uaddr2: 0x0, val3: 0x441) => 0x0
1712.922 clamd/11082 (309.925 ms): futex(uaddr: 0x1b59d44, op: 0x80, val: 0x891, utime: 0x0, uaddr2: 0x0, val3: 0x448) => 0x0
2014.289 clamd/11082 (244.312 ms): futex(uaddr: 0x1b59d44, op: 0x80, val: 0x895, utime: 0x0, uaddr2: 0x0, val3: 0x44a) => 0x0
3639.956 clamd/11082 (104.370 ms): futex(uaddr: 0x1b59d44, op: 0x80, val: 0x919, utime: 0x0, uaddr2: 0x0, val3: 0x48c) => 0x0
4371.086 clamd/11082 (158.304 ms): futex(uaddr: 0x1b59d44, op: 0x80, val: 0x951, utime: 0x0, uaddr2: 0x0, val3: 0x4a8) => 0x0
5204.773 clamd/11082 (133.566 ms): futex(uaddr: 0x1b59d44, op: 0x80, val: 0x999, utime: 0x0, uaddr2: 0x0, val3: 0x4cc) => 0x0

Using /proc/`pidof clamd`/maps I found out that 0x1b59d44 is on the heap. Such large delays are probably
due to a pthread_cond_wait/cond_timed_wait, and after spending a bit of time in gdb I found out exactly which condition var it is:
(gdb) p &thr_pool->queueable_bulk_cond.__data.__futex
$10 = (unsigned int *) 0x1b59d44

But I don't want to repeat that for each futex call with a different address.
It would be good if 'trace' could figure out what kind of futex wait this is (i.e. mutex lock or condition variable wait), I hope
the kernel has a wait to tell them apart.
Large delays for waiting on condition variables is expected, and I want to exclude them from my trace.
However I do want to see if there are large delays on mutex locks, that is definetely something I am interested in.
Now of course 'strace' can't make the difference between these two, but 'trace' being a kernel feature might have more information.

Of course it'd be even better if it could show the name of the condition variable, but if the condition variable is on the heap that is hard to do.
I'll have to see if there is a way to script gdb to lookup the variable associated with an address (by looking through the local and global vars on the current thread).

>
> Main differences to strace:
>
> - Low impact recording
>
> - Arguments are printed in plain hex representation, except for
> important arguments like filenames, which are resolved after
> the syscall is recorded. 'trace' wont do full decoding like
> strace does.

Did you consider providing a tool to post-process the output and
perform that decoding?

> Another thing that is recorded are page fault events. These can be
> analyzed via the -p option:

Error: switch `p' requires a value
-P works though.

Best regards,
--Edwin

2010-11-17 12:51:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 14:47 +0200, Török Edwin wrote:
> I see --call-graph in the trace record -h output, but it doesn't seem
> to work on x86_64 for me.

If you want to unwind userspace you need to build everything with
framepointers -- sadly some people have been lobbying to remove
framepointers from all distro builds, even though on x86_64 its nearly
free (i386 does have a significant performance benefit, sadly).

There's some work on-going to simply copy out the top of stack and let
dwarves go wild at it in userspace, but that isn't done yet.

2010-11-17 12:53:55

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, Nov 17, 2010 at 12:35:50PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 09:30 +0100, Ingo Molnar wrote:
> > > For example I'm currently working with dozens of trace_printk() and I would be
> > > very happy to turn some of them off half of the time.
> >
> > I guess we could try such a patch. If you send a prototype i'd be interested in
> > testing it out.
>
> I don't see the point, the kernel shouldn't contain any trace_printk()s
> to begin with..


It's oriented toward developers. Those who use dozens of tracepoints in
their tree because they are debugging something or developing a new feature,
they might to deactivate/reactivate some of these independant points.

This can also apply to dynamic_printk of course.

Well, the very first and main point is to standardize trace_printk into
a trace event so that it gets usable by perf tools. I have been asked many
times "how to use trace_printk() with perf?".

2010-11-17 13:02:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 13:53 +0100, Frederic Weisbecker wrote:
> On Wed, Nov 17, 2010 at 12:35:50PM +0100, Peter Zijlstra wrote:
> > On Wed, 2010-11-17 at 09:30 +0100, Ingo Molnar wrote:
> > > > For example I'm currently working with dozens of trace_printk() and I would be
> > > > very happy to turn some of them off half of the time.
> > >
> > > I guess we could try such a patch. If you send a prototype i'd be interested in
> > > testing it out.
> >
> > I don't see the point, the kernel shouldn't contain any trace_printk()s
> > to begin with..
>
>
> It's oriented toward developers. Those who use dozens of tracepoints in
> their tree because they are debugging something or developing a new feature,
> they might to deactivate/reactivate some of these independant points.
>
> This can also apply to dynamic_printk of course.
>
> Well, the very first and main point is to standardize trace_printk into
> a trace event so that it gets usable by perf tools. I have been asked many
> times "how to use trace_printk() with perf?".

Thing is, since its these dev who add the trace_printk()s to begin with,
I don't see the point in splitting them out, if you didn't want them why
did you add them to begin with?!

As to the trace_printk() to perf interface, you could do like mingo did
and create a fake event and use the regular tracepoint interface, or
hook it up directly and create a PERF_RECORD_TEXT field.

Personally I like the trace_printk() as a TRACE_EVENT(printk), it also
allows removing lots of the special casing concerning trace_printk from
ftrace.

2010-11-17 13:05:23

by Török Edwin

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 17 Nov 2010 13:51:52 +0100
Peter Zijlstra <[email protected]> wrote:

> On Wed, 2010-11-17 at 14:47 +0200, Török Edwin wrote:
> > I see --call-graph in the trace record -h output, but it doesn't
> > seem to work on x86_64 for me.
>
> If you want to unwind userspace you need to build everything with
> framepointers -- sadly some people have been lobbying to remove
> framepointers from all distro builds, even though on x86_64 its nearly
> free (i386 does have a significant performance benefit, sadly).
>
> There's some work on-going to simply copy out the top of stack and let
> dwarves go wild at it in userspace, but that isn't done yet.

Even without frame-pointers I should see at least the first caller in
userspace, since that can be obtained by reading the return address of
the syscall, right?
I don't see any difference in the output of 'trace report' thats why
I'm asking it is implemented at all right now, or I need additional
flags to see it.

I do see some difference in output of 'perf record -g/perf report'
though.

Best regards,
--Edwin

2010-11-17 13:10:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 15:05 +0200, Török Edwin wrote:
> On Wed, 17 Nov 2010 13:51:52 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > On Wed, 2010-11-17 at 14:47 +0200, Török Edwin wrote:
> > > I see --call-graph in the trace record -h output, but it doesn't
> > > seem to work on x86_64 for me.
> >
> > If you want to unwind userspace you need to build everything with
> > framepointers -- sadly some people have been lobbying to remove
> > framepointers from all distro builds, even though on x86_64 its nearly
> > free (i386 does have a significant performance benefit, sadly).
> >
> > There's some work on-going to simply copy out the top of stack and let
> > dwarves go wild at it in userspace, but that isn't done yet.
>
> Even without frame-pointers I should see at least the first caller in
> userspace, since that can be obtained by reading the return address of
> the syscall, right?

Right.

> I don't see any difference in the output of 'trace report' thats why
> I'm asking it is implemented at all right now, or I need additional
> flags to see it.

Ah, I haven't actually looked at any of the perf-trace code yet, so I'm
afraid I can't answer that particular question.

> I do see some difference in output of 'perf record -g/perf report'
> though.

ok.

2010-11-17 13:11:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'


* Peter Zijlstra <[email protected]> wrote:

> On Wed, 2010-11-17 at 13:53 +0100, Frederic Weisbecker wrote:
> > On Wed, Nov 17, 2010 at 12:35:50PM +0100, Peter Zijlstra wrote:
> > > On Wed, 2010-11-17 at 09:30 +0100, Ingo Molnar wrote:
> > > > > For example I'm currently working with dozens of trace_printk() and I would be
> > > > > very happy to turn some of them off half of the time.
> > > >
> > > > I guess we could try such a patch. If you send a prototype i'd be interested in
> > > > testing it out.
> > >
> > > I don't see the point, the kernel shouldn't contain any trace_printk()s
> > > to begin with..
> >
> >
> > It's oriented toward developers. Those who use dozens of tracepoints in
> > their tree because they are debugging something or developing a new feature,
> > they might to deactivate/reactivate some of these independant points.
> >
> > This can also apply to dynamic_printk of course.
> >
> > Well, the very first and main point is to standardize trace_printk into
> > a trace event so that it gets usable by perf tools. I have been asked many
> > times "how to use trace_printk() with perf?".
>
> Thing is, since its these dev who add the trace_printk()s to begin with, I don't
> see the point in splitting them out, if you didn't want them why did you add them
> to begin with?!

That's a common workflow: lots of printks (trace_printk's) put all around the code -
and sometimes one set of tracepoints is needed, one time another set.

_If_ we succeed in presenting them like Frederic suggested it, and if we make the
turning on/off _simpler_ (no kernel modification) and faster (no kernel reboot) via
the tooling, people like Frederic might start using it.

I dont think we should fight the workflow itself - it makes sense.

The only question is whether we can represent it all in a nicer fashion than 'modify
the source code and reboot'. If we cannot then there's no point - but i'm not sure
about it and Frederic seems to be convinced too that he can make such a switch
on/off facility intuitive. We'll only see if we try it.

Also, i dont see any harm - do you?

Thanks,

Ingo

2010-11-17 13:24:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'


* Ted Ts'o <[email protected]> wrote:

> On Tue, Nov 16, 2010 at 10:04:40PM +0100, Thomas Gleixner wrote:
> > If you've booted the new kernel you can run 'trace check' to double
> > check that all events used by the tool are available:
> >
> > $ trace check
> >
> > Checking whether the kernel has all required events ...
> > ... Checking event raw_syscalls:sys_enter:r : ok
> > ...
> > ... Checking event sched:sched_stat_runtime:r : ok
> > Good: all required event types are supported by this kernel.
> > The 'trace' utility will be fully functional.
>
> For the benefit of people who create tracepoints, what restrictions
> does trace have with respect to event types, and is this anticipated
> to change in the future?

There are no conceptual restrictions - this v1 version of the tool uses a (small)
subset of tracepoints right now.

ext4 tracepoints will be used once someone writes a trace code (or plugin) for it -
i think beyond simply displaying that trace data it would also be useful to
interpret it to a certain degree? They could also interact with the block events: so
we could track which inode (or other ext4 data structure) generates what IO, what
the latencies are and which task originated things. We could track what the wait
reasons are.

At that point (when ext4 event support is added) we'd add 'trace check' test as
well, and would generally try to make sure that if distros enable events that all
commonly used tracepoints are available as well.

I.e. the long term goal would be to create a widely available tool, which, amongst
other things, can be used to record and report ext4 events as well - with no kernel
reboots or tool rebuilds required.

> > The combo diffstat of the tool is appended at the end of the mail. The
> > overwhelming majority of changes is on the tooling side - it uses
> > existing perf events facilities and features to implement the tool.
>
> What about the filtering and other general features/functionality of
> ftrace? Is the anticipation that this will be ported over to perf?
> What about things like blktrace?

Yeah, i'd love to see all that available. Filtering is available already on the
kernel perf event side and could be added as an option.

> Not that I expect all of this will be working with this initial
> release, but I'm curious about the long-term roadmap of this
> interface. (Obviously subject to change as we learn more, etc. But
> I'd love to hear what your current thoughts and plans are.)
>
> P.S. What about sysrq-z? Is that going to stay with ftrace side of
> the tracing architecture?

What we are working towards is to unify the whole tracing infrastructure. IMHO it's
not really acceptable that there is an 'ftrace side' and a 'perf side', with feature
overlap and feature mismatch and general confusion.

Thanks,

Ingo

2010-11-17 13:32:15

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

(Please keep the initial Cc list)


On Wed, Nov 17, 2010 at 02:47:58PM +0200, T?r?k Edwin wrote:
> I see --call-graph in the trace record -h output, but it doesn't seem to work on x86_64 for me.
> Do I need to pass some additional parameters to trace report, do I need additional CONFIG_ turned on?
> I have:
> CONFIG_STACKTRACE_SUPPORT=y
> CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
> # CONFIG_CC_STACKPROTECTOR is not set
> CONFIG_STACKTRACE=y
> CONFIG_USER_STACKTRACE_SUPPORT=y
> # CONFIG_STACK_TRACER is not set
> # CONFIG_DEBUG_STACKOVERFLOW is not set
> # CONFIG_DEBUG_STACK_USAGE is not set
>
> Does it use the same code for callgraphs as ftrace, so I could choose userspace stacktrace?


So, this option comes from the fact this new tools wraps on perf record
command line. So yeah, your callchains are recorded, they are just not
reported by this new tool (yet).

Only perf report know how to deal with them for now.

You can play with that though, recording with "trace record -g" and look
at the result with "perf report", you'll get nice stats about your
callchains.

Now the tracing has yet to make use of it.

Ah and it doesn't use the same code than ftrace, but perf does kernel and
user callchain snapshot too. "-g" just record both.

2010-11-17 13:37:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 14:10 +0100, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Wed, 2010-11-17 at 13:53 +0100, Frederic Weisbecker wrote:
> > > On Wed, Nov 17, 2010 at 12:35:50PM +0100, Peter Zijlstra wrote:
> > > > On Wed, 2010-11-17 at 09:30 +0100, Ingo Molnar wrote:
> > > > > > For example I'm currently working with dozens of trace_printk() and I would be
> > > > > > very happy to turn some of them off half of the time.
> > > > >
> > > > > I guess we could try such a patch. If you send a prototype i'd be interested in
> > > > > testing it out.
> > > >
> > > > I don't see the point, the kernel shouldn't contain any trace_printk()s
> > > > to begin with..
> > >
> > >
> > > It's oriented toward developers. Those who use dozens of tracepoints in
> > > their tree because they are debugging something or developing a new feature,
> > > they might to deactivate/reactivate some of these independant points.
> > >
> > > This can also apply to dynamic_printk of course.
> > >
> > > Well, the very first and main point is to standardize trace_printk into
> > > a trace event so that it gets usable by perf tools. I have been asked many
> > > times "how to use trace_printk() with perf?".
> >
> > Thing is, since its these dev who add the trace_printk()s to begin with, I don't
> > see the point in splitting them out, if you didn't want them why did you add them
> > to begin with?!
>
> That's a common workflow: lots of printks (trace_printk's) put all around the code -
> and sometimes one set of tracepoints is needed, one time another set.
>
> _If_ we succeed in presenting them like Frederic suggested it, and if we make the
> turning on/off _simpler_ (no kernel modification) and faster (no kernel reboot) via
> the tooling, people like Frederic might start using it.
>
> I dont think we should fight the workflow itself - it makes sense.
>
> The only question is whether we can represent it all in a nicer fashion than 'modify
> the source code and reboot'. If we cannot then there's no point - but i'm not sure
> about it and Frederic seems to be convinced too that he can make such a switch
> on/off facility intuitive. We'll only see if we try it.
>
> Also, i dont see any harm - do you?

Yes, trace_printk() is a pure debug interface, solely meant for the edit
+ reboot cycle.

If you want anything more than that we've got tracepoints. The rule up
until now has been to never merge a trace_printk() user.

2010-11-17 13:38:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 14:24 +0100, Ingo Molnar wrote:
> > What about the filtering and other general features/functionality of
> > ftrace? Is the anticipation that this will be ported over to perf?
> > What about things like blktrace?
>
> Yeah, i'd love to see all that available. Filtering is available already on the
> kernel perf event side and could be added as an option.

Except I utterly detest the existing filter code.. it runs _after_ we do
all the hard work and then makes us roll back when it decides we
shouldn't output after all.

But yes, it is functional.

2010-11-17 13:43:10

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, Nov 17, 2010 at 02:36:16PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 14:10 +0100, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > On Wed, 2010-11-17 at 13:53 +0100, Frederic Weisbecker wrote:
> > > > On Wed, Nov 17, 2010 at 12:35:50PM +0100, Peter Zijlstra wrote:
> > > > > On Wed, 2010-11-17 at 09:30 +0100, Ingo Molnar wrote:
> > > > > > > For example I'm currently working with dozens of trace_printk() and I would be
> > > > > > > very happy to turn some of them off half of the time.
> > > > > >
> > > > > > I guess we could try such a patch. If you send a prototype i'd be interested in
> > > > > > testing it out.
> > > > >
> > > > > I don't see the point, the kernel shouldn't contain any trace_printk()s
> > > > > to begin with..
> > > >
> > > >
> > > > It's oriented toward developers. Those who use dozens of tracepoints in
> > > > their tree because they are debugging something or developing a new feature,
> > > > they might to deactivate/reactivate some of these independant points.
> > > >
> > > > This can also apply to dynamic_printk of course.
> > > >
> > > > Well, the very first and main point is to standardize trace_printk into
> > > > a trace event so that it gets usable by perf tools. I have been asked many
> > > > times "how to use trace_printk() with perf?".
> > >
> > > Thing is, since its these dev who add the trace_printk()s to begin with, I don't
> > > see the point in splitting them out, if you didn't want them why did you add them
> > > to begin with?!
> >
> > That's a common workflow: lots of printks (trace_printk's) put all around the code -
> > and sometimes one set of tracepoints is needed, one time another set.
> >
> > _If_ we succeed in presenting them like Frederic suggested it, and if we make the
> > turning on/off _simpler_ (no kernel modification) and faster (no kernel reboot) via
> > the tooling, people like Frederic might start using it.
> >
> > I dont think we should fight the workflow itself - it makes sense.
> >
> > The only question is whether we can represent it all in a nicer fashion than 'modify
> > the source code and reboot'. If we cannot then there's no point - but i'm not sure
> > about it and Frederic seems to be convinced too that he can make such a switch
> > on/off facility intuitive. We'll only see if we try it.
> >
> > Also, i dont see any harm - do you?
>
> Yes, trace_printk() is a pure debug interface, solely meant for the edit
> + reboot cycle.

So why prevent from making it even more handy?


> If you want anything more than that we've got tracepoints. The rule up
> until now has been to never merge a trace_printk() user.

Sure, that doesn't change the core idea of trace_prink(): none of them must
be merged. That new event interface would just make private uses of trace_printk()
more convenient.

2010-11-17 13:53:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 14:43 +0100, Frederic Weisbecker wrote:
>
> > Yes, trace_printk() is a pure debug interface, solely meant for the edit
> > + reboot cycle.
>
> So why prevent from making it even more handy?
>
>
> > If you want anything more than that we've got tracepoints. The rule up
> > until now has been to never merge a trace_printk() user.
>
> Sure, that doesn't change the core idea of trace_prink(): none of them must
> be merged. That new event interface would just make private uses of trace_printk()
> more convenient.

I don't get it, if you don't want it, why put it there in the first
place?

I've never found myself thinking, oh damn, I didn't want to see that
one!, only damn, I should have added more :-)

2010-11-17 14:00:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'


* Peter Zijlstra <[email protected]> wrote:

> On Wed, 2010-11-17 at 14:24 +0100, Ingo Molnar wrote:
> > > What about the filtering and other general features/functionality of
> > > ftrace? Is the anticipation that this will be ported over to perf?
> > > What about things like blktrace?
> >
> > Yeah, i'd love to see all that available. Filtering is available already on the
> > kernel perf event side and could be added as an option.
>
> Except I utterly detest the existing filter code.. it runs _after_ we do all the
> hard work and then makes us roll back when it decides we shouldn't output after
> all.
>
> But yes, it is functional.

I suspect that is what matters mostly - unless you think that it's impossible to
have a sane implementation of it, if the users come.

Thanks,

Ingo

2010-11-17 14:10:40

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, Nov 17, 2010 at 02:53:49PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 14:43 +0100, Frederic Weisbecker wrote:
> >
> > > Yes, trace_printk() is a pure debug interface, solely meant for the edit
> > > + reboot cycle.
> >
> > So why prevent from making it even more handy?
> >
> >
> > > If you want anything more than that we've got tracepoints. The rule up
> > > until now has been to never merge a trace_printk() user.
> >
> > Sure, that doesn't change the core idea of trace_prink(): none of them must
> > be merged. That new event interface would just make private uses of trace_printk()
> > more convenient.
>
> I don't get it, if you don't want it, why put it there in the first
> place?
>
> I've never found myself thinking, oh damn, I didn't want to see that
> one!, only damn, I should have added more :-)


Hehe :)

Yeah I have a strange workflow. I'm working on that CPU isolation thing
and I have dozens of trace_printk all over the place for tons of
things. And everytime I remove one to unwind some output or to focus
on another one, I often have to restore it later because I need it
again. Usually I even just comment it out instead of removing it.

If I could make this dynamically on a per line filtering, or sometimes on
a per file granularity (as both are equally often the case for me), I would
probably win some time.

I just don't know how many developers have a similar workflow than mine.

2010-11-17 14:11:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 15:00 +0100, Ingo Molnar wrote:
> > But yes, it is functional.
>
> I suspect that is what matters mostly - unless you think that it's impossible to
> have a sane implementation of it, if the users come.

I'm not sure, I raised the point several times, and I think Steve and
Tom though it was possible to implement sanely, but I'm not sure if it
was expression invariant (the fix that is).

It would be very good to have a definite answer on that.

The idea was to not let the filter engine work on the trace data (once
its gathered) but on the trace argument right at the beginning of the
tracepoint callchain, since some of the trace data is an expression of
the trace argument (say next->prio instead of next), the trace
expression wouldn't stay invariant, you'd have to write a different
filter for the same effect.

So I think it would be wise to make this change sooner rather than
later.

2010-11-17 14:18:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 15:11 +0100, Peter Zijlstra wrote:
> The idea was to not let the filter engine work on the trace data (once
> its gathered) but on the trace argument right at the beginning of the
> tracepoint callchain, since some of the trace data is an expression of
> the trace argument (say next->prio instead of next), the trace
> expression wouldn't stay invariant, you'd have to write a different
> filter for the same effect.
>
> So I think it would be wise to make this change sooner rather than
> later.

Also, I see a lot of overlap with the dynamic probes stuff which needs
the help of magic dwarves to come up with the right data to gather.

Merging the dynamic tracepoint and filter stuff would be nice, there's
no way you can express next->prio without the help of these short
buggers with large axes.

The trouble is that the dynamic tracepoint stuff is privileged for good
reasons, try next+0x1000000 and you're out in the woods, priv. only
filters otoh just doesn't sound too hot.

Another nasty thing is that you actually need to have these dwarves
present, which means having the -debug package installed.

2010-11-17 14:39:21

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 17 Nov 2010, Peter Zijlstra wrote:

> On Wed, 2010-11-17 at 15:11 +0100, Peter Zijlstra wrote:
> > The idea was to not let the filter engine work on the trace data (once
> > its gathered) but on the trace argument right at the beginning of the
> > tracepoint callchain, since some of the trace data is an expression of
> > the trace argument (say next->prio instead of next), the trace
> > expression wouldn't stay invariant, you'd have to write a different
> > filter for the same effect.
> >
> > So I think it would be wise to make this change sooner rather than
> > later.
>
> Also, I see a lot of overlap with the dynamic probes stuff which needs
> the help of magic dwarves to come up with the right data to gather.
>
> Merging the dynamic tracepoint and filter stuff would be nice, there's
> no way you can express next->prio without the help of these short
> buggers with large axes.
>
> The trouble is that the dynamic tracepoint stuff is privileged for good
> reasons, try next+0x1000000 and you're out in the woods, priv. only
> filters otoh just doesn't sound too hot.
>
> Another nasty thing is that you actually need to have these dwarves
> present, which means having the -debug package installed.

That sounds utterly insane for the basic use case where you trace
application context. There is no point to filter out the tracepoints,
really. Postprocessing can do that just fine.

I consider myself a power user of tracing, but hell I never used any
of those filters in a real use case. awk, grep, scripting languages do
just a better job and you don't miss any data just because you got
your filter expression slightly wrong. Postprocessing always wins in
that regard.

The only reason why these filters might make sense is to reduce the
trace volume on system wide traces in production environments, but
that's a completely different story. These scenarios can do with the
dynamic tracepoint stuff and the custom filtering w/o putting the
burden on the majority of users.

Thanks,

tglx

2010-11-17 14:42:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 15:37 +0100, Thomas Gleixner wrote:
>
> The only reason why these filters might make sense is to reduce the
> trace volume on system wide traces in production environments, but
> that's a completely different story. These scenarios can do with the
> dynamic tracepoint stuff and the custom filtering w/o putting the
> burden on the majority of users.

I've been hearing people claim 'enterprise' and custom are like mutually
exclusive ;-)

But yeah, I totally agree, I've never used any of that filter stuff
either, I always script the post-processing.

2010-11-17 15:02:46

by Steven Rostedt

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 15:10 +0100, Frederic Weisbecker wrote:

> Yeah I have a strange workflow. I'm working on that CPU isolation thing
> and I have dozens of trace_printk all over the place for tons of
> things. And everytime I remove one to unwind some output or to focus
> on another one, I often have to restore it later because I need it
> again. Usually I even just comment it out instead of removing it.
>
> If I could make this dynamically on a per line filtering, or sometimes on
> a per file granularity (as both are equally often the case for me), I would
> probably win some time.
>
> I just don't know how many developers have a similar workflow than mine.

I usually wrap trace_printk() (or printk sometimes) in my own wrapper:

#define sprintk(on, x...) do { if (on) { trace_prinkt(x); } } while (0)

Then have things like:

#define SDR_DEBUG_IRQS 0
#define SDR_DEBUG_SCHED 1
[...]

and in the irq code I'll have:

sprintk(SDR_DEBUG_IRQS, "printme"...);

and the sched code:

sprintk(SDR_DEBUG_SCHED, "printthis"...);

And I can easily enable or disable the prints I want. Sometimes I'll
define it as a variable, and enable them either in certain code paths,
or export them in the debugfs system and enable or disable them from
userspace.

extern int sdr_debug_irqs;
#define SDR_DEBUG_IRQS sdr_debug_irqs

[...]
int sdr_debug_irqs;
[...]

debugfs_create_bool("sdr_debug_sched", 0644, NULL, &sdr_debug_irqs);

Of course this is all for temporary debugging. I use SDR and sdr and
sprintk to search and delete the debugging output when done.

-- Steve

2010-11-17 15:10:38

by Steven Rostedt

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 15:11 +0100, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 15:00 +0100, Ingo Molnar wrote:
> > > But yes, it is functional.
> >
> > I suspect that is what matters mostly - unless you think that it's impossible to
> > have a sane implementation of it, if the users come.
>
> I'm not sure, I raised the point several times, and I think Steve and
> Tom though it was possible to implement sanely, but I'm not sure if it
> was expression invariant (the fix that is).
>
> It would be very good to have a definite answer on that.
>
> The idea was to not let the filter engine work on the trace data (once
> its gathered) but on the trace argument right at the beginning of the
> tracepoint callchain, since some of the trace data is an expression of
> the trace argument (say next->prio instead of next), the trace
> expression wouldn't stay invariant, you'd have to write a different
> filter for the same effect.
>
> So I think it would be wise to make this change sooner rather than
> later.

Right, the problem with filtering is what do we want to filter, and what
about copying?

Currently, we copy the data into the buffer and then filter on that
data. We could also easily filter on the parameters of the tracepoint,
but sometimes those parameters do not match the final output (as the
case with sched_switch). Do we copy the data into a separate "per cpu"
temp buffer, and figure out the filter then? And if the filter is fine,
then copy into the buffer. This obviously is slow, due to the multiple
copies. We could do this only if the filtering is enabled.

Note, the stable event proposal I suggested passes the same info via the
trace callback parameters that is expected to go into the buffer. Arrays
are passed as pointers so the copying is of the minimum. This would give
us the ability to filter _before_ writing into the buffer, and no
unnecessary copying.

-- Steve

2010-11-17 15:33:27

by Tom Zanussi

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 15:11 +0100, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 15:00 +0100, Ingo Molnar wrote:
> > > But yes, it is functional.
> >
> > I suspect that is what matters mostly - unless you think that it's impossible to
> > have a sane implementation of it, if the users come.
>
> I'm not sure, I raised the point several times, and I think Steve and
> Tom though it was possible to implement sanely, but I'm not sure if it
> was expression invariant (the fix that is).
>
> It would be very good to have a definite answer on that.
>
> The idea was to not let the filter engine work on the trace data (once
> its gathered) but on the trace argument right at the beginning of the
> tracepoint callchain, since some of the trace data is an expression of
> the trace argument (say next->prio instead of next), the trace
> expression wouldn't stay invariant, you'd have to write a different
> filter for the same effect.
>

IIRC, I think the conclusion we came to was that it could be done
mechanically if for example the right-hand-side of an assignment in
TP_fast_assign() only involved a simple variable assignment, but as
Steve pointed out, some assignments are more complicated than that.

For example, in the sched_switch tracepoint assignments:

__entry->prev_prio = prev->prio;
__entry->prev_state = __trace_sched_switch_state(prev);

so the prev_prio should be able to be tested 'in-line' but the
prev_state would require a temporary buffer to write the value into
before doing the test as mentioned by Steve. At which point you're no
further ahead (in that case) than the current situation...

Tom

> So I think it would be wise to make this change sooner rather than
> later.

2010-11-17 15:41:12

by Tom Zanussi

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 15:37 +0100, Thomas Gleixner wrote:
> On Wed, 17 Nov 2010, Peter Zijlstra wrote:
>
> > On Wed, 2010-11-17 at 15:11 +0100, Peter Zijlstra wrote:
> > > The idea was to not let the filter engine work on the trace data (once
> > > its gathered) but on the trace argument right at the beginning of the
> > > tracepoint callchain, since some of the trace data is an expression of
> > > the trace argument (say next->prio instead of next), the trace
> > > expression wouldn't stay invariant, you'd have to write a different
> > > filter for the same effect.
> > >
> > > So I think it would be wise to make this change sooner rather than
> > > later.
> >
> > Also, I see a lot of overlap with the dynamic probes stuff which needs
> > the help of magic dwarves to come up with the right data to gather.
> >
> > Merging the dynamic tracepoint and filter stuff would be nice, there's
> > no way you can express next->prio without the help of these short
> > buggers with large axes.
> >
> > The trouble is that the dynamic tracepoint stuff is privileged for good
> > reasons, try next+0x1000000 and you're out in the woods, priv. only
> > filters otoh just doesn't sound too hot.
> >
> > Another nasty thing is that you actually need to have these dwarves
> > present, which means having the -debug package installed.
>
> That sounds utterly insane for the basic use case where you trace
> application context. There is no point to filter out the tracepoints,
> really. Postprocessing can do that just fine.
>
> I consider myself a power user of tracing, but hell I never used any
> of those filters in a real use case. awk, grep, scripting languages do
> just a better job and you don't miss any data just because you got
> your filter expression slightly wrong. Postprocessing always wins in
> that regard.
>
> The only reason why these filters might make sense is to reduce the
> trace volume on system wide traces in production environments, but
> that's a completely different story. These scenarios can do with the
> dynamic tracepoint stuff and the custom filtering w/o putting the
> burden on the majority of users.
>

Yeah, in my mind, the main point of the filters was to be a 'blunt
force' instrument preventing userspace from being overwhelmed by events.
The real filtering would happen in userspace with e.g. real scripting
languages.

Tom

> Thanks,
>
> tglx

2010-11-17 15:44:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 10:10 -0500, Steven Rostedt wrote:
>
> Right, the problem with filtering is what do we want to filter, and what
> about copying?
>
> Currently, we copy the data into the buffer and then filter on that
> data. We could also easily filter on the parameters of the tracepoint,
> but sometimes those parameters do not match the final output (as the
> case with sched_switch). Do we copy the data into a separate "per cpu"
> temp buffer, and figure out the filter then? And if the filter is fine,
> then copy into the buffer. This obviously is slow, due to the multiple
> copies. We could do this only if the filtering is enabled.

Right, so what is the primary purpose of this filtering stuff? As it
stands it makes stuff terribly slow, so you add overhead but the win
(presumably) is less data output, is that a sane trade-off?

Most people I've heard -- both at LinuxCon.JP and LPC -- are asking for
lower overhead tracing (while at the same time demanding more features).

Who are actually using this and can they provide some input on this?

2010-11-17 15:57:09

by Avi Kivity

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On 11/17/2010 05:41 PM, Tom Zanussi wrote:
>
>> The only reason why these filters might make sense is to reduce the
>> trace volume on system wide traces in production environments, but
>> that's a completely different story. These scenarios can do with the
>> dynamic tracepoint stuff and the custom filtering w/o putting the
>> burden on the majority of users.
>>
> Yeah, in my mind, the main point of the filters was to be a 'blunt
> force' instrument preventing userspace from being overwhelmed by events.
> The real filtering would happen in userspace with e.g. real scripting
> languages.
>

Filtering is also useful for fine-grained perf events; this cannot be
done in userspace.

2010-11-17 15:59:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 17:55 +0200, Avi Kivity wrote:
>
> Filtering is also useful for fine-grained perf events; this cannot be
> done in userspace.
>
You're hinting at the kvm_exit histogram thing you've constructed,
right?

2010-11-17 15:59:23

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 17 Nov 2010, Avi Kivity wrote:

> On 11/17/2010 05:41 PM, Tom Zanussi wrote:
> >
> > > The only reason why these filters might make sense is to reduce the
> > > trace volume on system wide traces in production environments, but
> > > that's a completely different story. These scenarios can do with the
> > > dynamic tracepoint stuff and the custom filtering w/o putting the
> > > burden on the majority of users.
> > >
> > Yeah, in my mind, the main point of the filters was to be a 'blunt
> > force' instrument preventing userspace from being overwhelmed by events.
> > The real filtering would happen in userspace with e.g. real scripting
> > languages.
> >
>
> Filtering is also useful for fine-grained perf events; this cannot be done in
> userspace.

Could you explain that a bit more detailed please ?

Thanks,

tglx

2010-11-17 16:04:31

by Steven Rostedt

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 16:43 +0100, Peter Zijlstra wrote:

> Most people I've heard -- both at LinuxCon.JP and LPC -- are asking for
> lower overhead tracing (while at the same time demanding more features).

Tell them we have:

1) features
2) fast
3) simple to use

and have them pick any two.

;-)

-- Steve

2010-11-17 16:09:45

by Avi Kivity

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On 11/17/2010 05:59 PM, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 17:55 +0200, Avi Kivity wrote:
> >
> > Filtering is also useful for fine-grained perf events; this cannot be
> > done in userspace.
> >
> You're hinting at the kvm_exit histogram thing you've constructed,
> right?

Yes.


--
error compiling committee.c: too many arguments to function

2010-11-17 16:15:24

by Steven Rostedt

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 16:43 +0100, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 10:10 -0500, Steven Rostedt wrote:
> >
> > Right, the problem with filtering is what do we want to filter, and what
> > about copying?
> >
> > Currently, we copy the data into the buffer and then filter on that
> > data. We could also easily filter on the parameters of the tracepoint,
> > but sometimes those parameters do not match the final output (as the
> > case with sched_switch). Do we copy the data into a separate "per cpu"
> > temp buffer, and figure out the filter then? And if the filter is fine,
> > then copy into the buffer. This obviously is slow, due to the multiple
> > copies. We could do this only if the filtering is enabled.
>
> Right, so what is the primary purpose of this filtering stuff? As it
> stands it makes stuff terribly slow, so you add overhead but the win
> (presumably) is less data output, is that a sane trade-off?

I've actually used filtering too. Not for speed up, but because I was
recording a lot of data and the reader could not keep up. By filtering,
I was able to get all the relevant information without needing to make
the kernel buffer a Gig.

-- Steve

2010-11-17 16:22:01

by Avi Kivity

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On 11/17/2010 05:58 PM, Thomas Gleixner wrote:
> >
> > Filtering is also useful for fine-grained perf events; this cannot be done in
> > userspace.
>
> Could you explain that a bit more detailed please ?

I have a tracepoint kvm:kvm_exit which is called every time a guest
exits to the host (for whatever reason). The first step in profiling
kvm is checking the rate at which this tracepoint is triggered.

(I have a small tool kvm_stat which shows a 'perf top' like display,
except that instead of showing functions, it shows kvm tracepoints
sorted by triggering rate)

If you see a high exit rate, you may also be interested in what exit
reasons triggered those exits. So kvm_stat creates additional events,
one per possible exit_reason (which is a field in the kvm_exit
tracepoint), with a filter matching exit_reason to the enum describing
the possible exit reasons. We add that to the display and get a sort of
histogram showing which exit_reason is responsible for how many exits.

A screenshot:

kvm statistics

kvm_exit 40075 10293
kvm_entry 40075 10293
kvm_exit(EXCEPTION_NMI) 29824 7712
kvm_page_fault 29642 7672
kvm_emulate_insn 9117 2205
kvm_inj_exception 8052 2134
kvm_mmio 6160 1514
kvm_apic 6143 1509
kvm_exit(CR_ACCESS) 2307 635
kvm_cr 2307 635
kvm_exit(INVLPG) 1998 599
kvm_userspace_exit 2956 582
kvm_exit(IO_INSTRUCTION) 2954 582
kvm_pio 2954 582
kvm_inj_virq 2194 551
kvm_apic_accept_irq 2181 547
kvm_exit(HLT) 1523 372
kvm_exit(EXTERNAL_INTERRUPT) 1131 304
kvm_fpu 706 176
kvm_set_irq 802 156
kvm_pic_set_irq 802 156
kvm_ioapic_set_irq 802 156
kvm_apic_ipi 454 132
kvm_exit(PENDING_INTERRUPT) 175 44
kvm_cpuid 163 42
kvm_exit(CPUID) 163 42
kvm_ack_irq 152 29


So there are 10K exits/sec, of which 7.7K are EXCEPTION_NMI and 635 are
CR_ACCESS.

Tool source http://tinyurl.com/2ue5tev; one day I'll make it a
tools/perf script.

--
error compiling committee.c: too many arguments to function

2010-11-17 18:13:53

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, Nov 17, 2010 at 03:10:33PM +0100, Frederic Weisbecker wrote:
> Yeah I have a strange workflow. I'm working on that CPU isolation thing
> and I have dozens of trace_printk all over the place for tons of
> things. And everytime I remove one to unwind some output or to focus
> on another one, I often have to restore it later because I need it
> again. Usually I even just comment it out instead of removing it.

What I do for my file system development work is to drop in
trace_printk's everywhere, since they are lightweight and don't slow
down the system much. Then when the system wedges, I use sysrq-z to
get a "flight data recorder" printout of what happened up til the
system hung.

I love the fact that the ring buffer is in "overwrite" mode (aka
flight data recorder mode), and hope this doesn't go away.

Per line filtering is also great, but very often when I'm interacting
with the block I/O layer, if something screws up, what happens is "and
then whole machine locks up", and sysrq-z is literally all I have.

- Ted

2010-11-17 18:23:49

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

* Peter Zijlstra ([email protected]) wrote:
> On Wed, 2010-11-17 at 10:10 -0500, Steven Rostedt wrote:
> >
> > Right, the problem with filtering is what do we want to filter, and what
> > about copying?
> >
> > Currently, we copy the data into the buffer and then filter on that
> > data. We could also easily filter on the parameters of the tracepoint,
> > but sometimes those parameters do not match the final output (as the
> > case with sched_switch). Do we copy the data into a separate "per cpu"
> > temp buffer, and figure out the filter then? And if the filter is fine,
> > then copy into the buffer. This obviously is slow, due to the multiple
> > copies. We could do this only if the filtering is enabled.
>
> Right, so what is the primary purpose of this filtering stuff? As it
> stands it makes stuff terribly slow, so you add overhead but the win
> (presumably) is less data output, is that a sane trade-off?
>
> Most people I've heard -- both at LinuxCon.JP and LPC -- are asking for
> lower overhead tracing (while at the same time demanding more features).
>
> Who are actually using this and can they provide some input on this?

Amongst others, Ericsson rely on this kind of feature. The case where we're
filtering "out" should be really, really fast (even if it makes the recording
slightly slower). If there is a race modifying the data underneath between the
filter execution and the copy to the buffers, I really don't think it matters.
If the tracepoint caller context don't provide data consistency guarantees for
the parameters it passes, we should not expect 100% perfect consistency between
filter and what is actually saved in the trace. The trace analysis tools should
just be able to cope with that, but I really don't think anyone should care.

So I would recommend no copy, filter directly on the source data, stop the
filter chain as soon as a non-match is observed. Copy the data into the buffers
if the filter passes.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-11-17 18:26:37

by Darren Hart

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On 11/17/2010 05:02 AM, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 13:53 +0100, Frederic Weisbecker wrote:
>> On Wed, Nov 17, 2010 at 12:35:50PM +0100, Peter Zijlstra wrote:
>>> On Wed, 2010-11-17 at 09:30 +0100, Ingo Molnar wrote:
>>>>> For example I'm currently working with dozens of trace_printk() and I would be
>>>>> very happy to turn some of them off half of the time.
>>>>
>>>> I guess we could try such a patch. If you send a prototype i'd be interested in
>>>> testing it out.
>>>
>>> I don't see the point, the kernel shouldn't contain any trace_printk()s
>>> to begin with..
>>
>>
>> It's oriented toward developers. Those who use dozens of tracepoints in
>> their tree because they are debugging something or developing a new feature,
>> they might to deactivate/reactivate some of these independant points.
>>
>> This can also apply to dynamic_printk of course.
>>
>> Well, the very first and main point is to standardize trace_printk into
>> a trace event so that it gets usable by perf tools. I have been asked many
>> times "how to use trace_printk() with perf?".
>
> Thing is, since its these dev who add the trace_printk()s to begin with,
> I don't see the point in splitting them out, if you didn't want them why
> did you add them to begin with?!

What I understood from Frederic's email was that during a debug session
it is sometimes helpful to be able to enable and disable the
trace_printk's. This makes sense as it reduces the number of kernel
build/reboot cycles. However, I would think most of that could be
accomplished with some judicious message tagging and post-processing to
filter out the unwanted trace_printk's. The only exception might be when
the trace_printk's add enough overhead to mask a timing related bug. In
this case, I'd probably be tempted to remove the stubs anyway.

--
Darren Hart
Yocto Linux Kernel

2010-11-17 18:29:12

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, Nov 17, 2010 at 01:13:40PM -0500, Ted Ts'o wrote:
> On Wed, Nov 17, 2010 at 03:10:33PM +0100, Frederic Weisbecker wrote:
> > Yeah I have a strange workflow. I'm working on that CPU isolation thing
> > and I have dozens of trace_printk all over the place for tons of
> > things. And everytime I remove one to unwind some output or to focus
> > on another one, I often have to restore it later because I need it
> > again. Usually I even just comment it out instead of removing it.
>
> What I do for my file system development work is to drop in
> trace_printk's everywhere, since they are lightweight and don't slow
> down the system much. Then when the system wedges, I use sysrq-z to
> get a "flight data recorder" printout of what happened up til the
> system hung.
>
> I love the fact that the ring buffer is in "overwrite" mode (aka
> flight data recorder mode), and hope this doesn't go away.
>
> Per line filtering is also great, but very often when I'm interacting
> with the block I/O layer, if something screws up, what happens is "and
> then whole machine locks up", and sysrq-z is literally all I have.

Yeah all agreed.

Steve proposed to keep the current trace_printk() implementation that relies
on ftrace but rename in into ftrace_printk(). So that we can develop a new
trace_printk() based on trace_event interface and in the meantime keep the
old version in case something messes up with the new thing.

2010-11-17 18:30:27

by Darren Hart

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On 11/17/2010 10:29 AM, Frederic Weisbecker wrote:
> On Wed, Nov 17, 2010 at 01:13:40PM -0500, Ted Ts'o wrote:
>> On Wed, Nov 17, 2010 at 03:10:33PM +0100, Frederic Weisbecker wrote:
>>> Yeah I have a strange workflow. I'm working on that CPU isolation thing
>>> and I have dozens of trace_printk all over the place for tons of
>>> things. And everytime I remove one to unwind some output or to focus
>>> on another one, I often have to restore it later because I need it
>>> again. Usually I even just comment it out instead of removing it.
>>
>> What I do for my file system development work is to drop in
>> trace_printk's everywhere, since they are lightweight and don't slow
>> down the system much. Then when the system wedges, I use sysrq-z to
>> get a "flight data recorder" printout of what happened up til the
>> system hung.
>>
>> I love the fact that the ring buffer is in "overwrite" mode (aka
>> flight data recorder mode), and hope this doesn't go away.
>>
>> Per line filtering is also great, but very often when I'm interacting
>> with the block I/O layer, if something screws up, what happens is "and
>> then whole machine locks up", and sysrq-z is literally all I have.
>
> Yeah all agreed.
>
> Steve proposed to keep the current trace_printk() implementation that relies
> on ftrace but rename in into ftrace_printk(). So that we can develop a new
> trace_printk() based on trace_event interface and in the meantime keep the
> old version in case something messes up with the new thing.

Seems reasonable.


--
Darren Hart
Yocto Linux Kernel

2010-11-17 18:36:46

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

* Tom Zanussi ([email protected]) wrote:
[...]
> IIRC, I think the conclusion we came to was that it could be done
> mechanically if for example the right-hand-side of an assignment in
> TP_fast_assign() only involved a simple variable assignment, but as
> Steve pointed out, some assignments are more complicated than that.

Yep, we came up to the same conclusions in UST.

> For example, in the sched_switch tracepoint assignments:
>
> __entry->prev_prio = prev->prio;
> __entry->prev_state = __trace_sched_switch_state(prev);
>
> so the prev_prio should be able to be tested 'in-line' but the
> prev_state would require a temporary buffer to write the value into
> before doing the test as mentioned by Steve. At which point you're no
> further ahead (in that case) than the current situation...

if we change all assignments to, e.g.:

_tp_assign(__entry->prev_prio, prev->prio)
_tp_assign(__entry->prev_state, __trace_sched_switch_state(prev))

then we can redefine the macros for filtering much more easily than with the
" = " assignment operator.

About your comment above, what is the problem with evaluating
"__trace_sched_switch_state(prev)" twice ? It will typically be cache-hot after
the first evaluation, so I wonder if, in practice, we really save a significant
amount of cycles by saving its result between filtering and writing into trace
buffers. As I pointed out earlier, for my customers, having a very, very fast
filter "out" case is more important that trying to squeeze a few cycles out of
the filter passed case.

Also, how many of these "__trace_sched_switch_state(prev)" are static inlines vs
actual function calls ? If it's mostly static inlines to dereference a few
pointers, doing it the second time when the filter passed won't hurt much.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-11-17 18:53:33

by Tom Zanussi

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 13:36 -0500, Mathieu Desnoyers wrote:
> * Tom Zanussi ([email protected]) wrote:
> [...]
> > IIRC, I think the conclusion we came to was that it could be done
> > mechanically if for example the right-hand-side of an assignment in
> > TP_fast_assign() only involved a simple variable assignment, but as
> > Steve pointed out, some assignments are more complicated than that.
>
> Yep, we came up to the same conclusions in UST.
>
> > For example, in the sched_switch tracepoint assignments:
> >
> > __entry->prev_prio = prev->prio;
> > __entry->prev_state = __trace_sched_switch_state(prev);
> >
> > so the prev_prio should be able to be tested 'in-line' but the
> > prev_state would require a temporary buffer to write the value into
> > before doing the test as mentioned by Steve. At which point you're no
> > further ahead (in that case) than the current situation...
>
> if we change all assignments to, e.g.:
>
> _tp_assign(__entry->prev_prio, prev->prio)
> _tp_assign(__entry->prev_state, __trace_sched_switch_state(prev))
>
> then we can redefine the macros for filtering much more easily than with the
> " = " assignment operator.
>
> About your comment above, what is the problem with evaluating
> "__trace_sched_switch_state(prev)" twice ? It will typically be cache-hot after
> the first evaluation, so I wonder if, in practice, we really save a significant
> amount of cycles by saving its result between filtering and writing into trace
> buffers. As I pointed out earlier, for my customers, having a very, very fast
> filter "out" case is more important that trying to squeeze a few cycles out of
> the filter passed case.
>

But the idea is to avoid allocating the trace buffer in the first place,
until we've decided we want the event. So how do you check the result
of __trace_sched_switch_state(prev) with the filter value if you don't
have it temporarily stored somewhere (not in the trace buffer, which
doesn't exist yet as far as this event is concerned)?

Tom

> Also, how many of these "__trace_sched_switch_state(prev)" are static inlines vs
> actual function calls ? If it's mostly static inlines to dereference a few
> pointers, doing it the second time when the filter passed won't hurt much.
>
> Thanks,
>
> Mathieu
>

2010-11-17 19:00:41

by Bob Copeland

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Tue, Nov 16, 2010 at 10:34 PM, Steven Rostedt <[email protected]> wrote:
> (*) Currently in trace-cmd I have plugins. These plugins are simple
> functions to read events and either output better displays of the event
> or do something special (Avi wrote some for his kvm events).

We also use this for debugging some wireless drivers. The (unstable)
raw events can capture entire packets and produce psuedo-pcap files with
a plugin so we can compare what we thought went over the air versus what
actually did, and correlate them with api-level tracepoints in upper
layers. It's perhaps a bit of a bending of the infrastructure, but one
that has been helpful.

--
Bob Copeland %% http://www.bobcopeland.com

2010-11-17 19:02:26

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

* Tom Zanussi ([email protected]) wrote:
> On Wed, 2010-11-17 at 13:36 -0500, Mathieu Desnoyers wrote:
> > * Tom Zanussi ([email protected]) wrote:
> > [...]
> > > IIRC, I think the conclusion we came to was that it could be done
> > > mechanically if for example the right-hand-side of an assignment in
> > > TP_fast_assign() only involved a simple variable assignment, but as
> > > Steve pointed out, some assignments are more complicated than that.
> >
> > Yep, we came up to the same conclusions in UST.
> >
> > > For example, in the sched_switch tracepoint assignments:
> > >
> > > __entry->prev_prio = prev->prio;
> > > __entry->prev_state = __trace_sched_switch_state(prev);
> > >
> > > so the prev_prio should be able to be tested 'in-line' but the
> > > prev_state would require a temporary buffer to write the value into
> > > before doing the test as mentioned by Steve. At which point you're no
> > > further ahead (in that case) than the current situation...
> >
> > if we change all assignments to, e.g.:
> >
> > _tp_assign(__entry->prev_prio, prev->prio)
> > _tp_assign(__entry->prev_state, __trace_sched_switch_state(prev))
> >
> > then we can redefine the macros for filtering much more easily than with the
> > " = " assignment operator.
> >
> > About your comment above, what is the problem with evaluating
> > "__trace_sched_switch_state(prev)" twice ? It will typically be cache-hot after
> > the first evaluation, so I wonder if, in practice, we really save a significant
> > amount of cycles by saving its result between filtering and writing into trace
> > buffers. As I pointed out earlier, for my customers, having a very, very fast
> > filter "out" case is more important that trying to squeeze a few cycles out of
> > the filter passed case.
> >
>
> But the idea is to avoid allocating the trace buffer in the first place,
> until we've decided we want the event. So how do you check the result
> of __trace_sched_switch_state(prev) with the filter value if you don't
> have it temporarily stored somewhere (not in the trace buffer, which
> doesn't exist yet as far as this event is concerned)?

It seems I might be missing something important, but what's wrong with using
registers or the stack to hold the value for comparison ? In this case, it's a
"long", so a register seems perfectly reasonable. But again, I feel I'm missing
a key point -- what is it ?

Thanks,

Mathieu

>
> Tom
>
> > Also, how many of these "__trace_sched_switch_state(prev)" are static inlines vs
> > actual function calls ? If it's mostly static inlines to dereference a few
> > pointers, doing it the second time when the filter passed won't hurt much.
> >
> > Thanks,
> >
> > Mathieu
> >
>
>

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-11-17 19:25:11

by Steven Rostedt

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 13:36 -0500, Mathieu Desnoyers wrote:
> * Tom Zanussi ([email protected]) wrote:
> [...]
> > IIRC, I think the conclusion we came to was that it could be done
> > mechanically if for example the right-hand-side of an assignment in
> > TP_fast_assign() only involved a simple variable assignment, but as
> > Steve pointed out, some assignments are more complicated than that.
>
> Yep, we came up to the same conclusions in UST.
>
> > For example, in the sched_switch tracepoint assignments:
> >
> > __entry->prev_prio = prev->prio;
> > __entry->prev_state = __trace_sched_switch_state(prev);
> >
> > so the prev_prio should be able to be tested 'in-line' but the
> > prev_state would require a temporary buffer to write the value into
> > before doing the test as mentioned by Steve. At which point you're no
> > further ahead (in that case) than the current situation...
>
> if we change all assignments to, e.g.:
>
> _tp_assign(__entry->prev_prio, prev->prio)
> _tp_assign(__entry->prev_state, __trace_sched_switch_state(prev))

I would just call it __assign().


>
> then we can redefine the macros for filtering much more easily than with the
> " = " assignment operator.
>
> About your comment above, what is the problem with evaluating
> "__trace_sched_switch_state(prev)" twice ? It will typically be cache-hot after
> the first evaluation, so I wonder if, in practice, we really save a significant
> amount of cycles by saving its result between filtering and writing into trace
> buffers. As I pointed out earlier, for my customers, having a very, very fast
> filter "out" case is more important that trying to squeeze a few cycles out of
> the filter passed case.
>
> Also, how many of these "__trace_sched_switch_state(prev)" are static inlines vs
> actual function calls ? If it's mostly static inlines to dereference a few
> pointers, doing it the second time when the filter passed won't hurt much.

Yes, something like this could work. It would require a bit more CPP
magic to handle it though.

I could write something up to do this. I just have to add it to the 100
other things I'm doing at the same time :-p

-- Steve

2010-11-17 19:25:38

by Tom Zanussi

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 14:02 -0500, Mathieu Desnoyers wrote:
> * Tom Zanussi ([email protected]) wrote:
> > On Wed, 2010-11-17 at 13:36 -0500, Mathieu Desnoyers wrote:
> > > * Tom Zanussi ([email protected]) wrote:
> > > [...]
> > > > IIRC, I think the conclusion we came to was that it could be done
> > > > mechanically if for example the right-hand-side of an assignment in
> > > > TP_fast_assign() only involved a simple variable assignment, but as
> > > > Steve pointed out, some assignments are more complicated than that.
> > >
> > > Yep, we came up to the same conclusions in UST.
> > >
> > > > For example, in the sched_switch tracepoint assignments:
> > > >
> > > > __entry->prev_prio = prev->prio;
> > > > __entry->prev_state = __trace_sched_switch_state(prev);
> > > >
> > > > so the prev_prio should be able to be tested 'in-line' but the
> > > > prev_state would require a temporary buffer to write the value into
> > > > before doing the test as mentioned by Steve. At which point you're no
> > > > further ahead (in that case) than the current situation...
> > >
> > > if we change all assignments to, e.g.:
> > >
> > > _tp_assign(__entry->prev_prio, prev->prio)
> > > _tp_assign(__entry->prev_state, __trace_sched_switch_state(prev))
> > >
> > > then we can redefine the macros for filtering much more easily than with the
> > > " = " assignment operator.
> > >
> > > About your comment above, what is the problem with evaluating
> > > "__trace_sched_switch_state(prev)" twice ? It will typically be cache-hot after
> > > the first evaluation, so I wonder if, in practice, we really save a significant
> > > amount of cycles by saving its result between filtering and writing into trace
> > > buffers. As I pointed out earlier, for my customers, having a very, very fast
> > > filter "out" case is more important that trying to squeeze a few cycles out of
> > > the filter passed case.
> > >
> >
> > But the idea is to avoid allocating the trace buffer in the first place,
> > until we've decided we want the event. So how do you check the result
> > of __trace_sched_switch_state(prev) with the filter value if you don't
> > have it temporarily stored somewhere (not in the trace buffer, which
> > doesn't exist yet as far as this event is concerned)?
>
> It seems I might be missing something important, but what's wrong with using
> registers or the stack to hold the value for comparison ? In this case, it's a

Nothing, it shouldn't matter where the temporary storage is, as long as
it's not in the trace buffer.

Tom

> "long", so a register seems perfectly reasonable. But again, I feel I'm missing
> a key point -- what is it ?
>
> Thanks,
>
> Mathieu
>
> >
> > Tom
> >
> > > Also, how many of these "__trace_sched_switch_state(prev)" are static inlines vs
> > > actual function calls ? If it's mostly static inlines to dereference a few
> > > pointers, doing it the second time when the filter passed won't hurt much.
> > >
> > > Thanks,
> > >
> > > Mathieu
> > >
> >
> >
>


2010-11-17 19:40:41

by Darren Hart

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On 11/17/2010 08:15 AM, Steven Rostedt wrote:
> On Wed, 2010-11-17 at 16:43 +0100, Peter Zijlstra wrote:
>> On Wed, 2010-11-17 at 10:10 -0500, Steven Rostedt wrote:
>>>
>>> Right, the problem with filtering is what do we want to filter, and what
>>> about copying?
>>>
>>> Currently, we copy the data into the buffer and then filter on that
>>> data. We could also easily filter on the parameters of the tracepoint,
>>> but sometimes those parameters do not match the final output (as the
>>> case with sched_switch). Do we copy the data into a separate "per cpu"
>>> temp buffer, and figure out the filter then? And if the filter is fine,
>>> then copy into the buffer. This obviously is slow, due to the multiple
>>> copies. We could do this only if the filtering is enabled.
>>
>> Right, so what is the primary purpose of this filtering stuff? As it
>> stands it makes stuff terribly slow, so you add overhead but the win
>> (presumably) is less data output, is that a sane trade-off?
>
> I've actually used filtering too. Not for speed up, but because I was
> recording a lot of data and the reader could not keep up. By filtering,
> I was able to get all the relevant information without needing to make
> the kernel buffer a Gig.

I have run into situations where the volume of output becomes a problem
and not every system will have the memory to dedicate to massive trace
buffers. This is, for me, probably the one motivating argument for doing
filtering in the kernel as opposed to post-processing scripts.

--
Darren Hart
Yocto Linux Kernel

2010-11-18 00:47:47

by Ian Munsie

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

Excerpts from Thomas Gleixner's message of Wed Nov 17 08:04:40 +1100 2010:
> We are pleased to announce a new tracing tool called 'trace'.

Cool!

> Build the -tip kernel and reboot into it to get the most out of the
> tool - but even on default kernels it will try to cope. Syscall events
> (CONFIG_FTRACE_SYSCALLS=y) is a must for most methods of analysis
> though. (with the above -tip tree you'll get it all configured by
> default)

OK, now I'll really have to find some time in the near future to go back
to the work I was doing on the ftrace syscalls + building on Jason's
ftrace compat syscall patches to make the most of this.

Cheers,
-Ian

Subject: Re: [ANNOUNCE] New utility: 'trace'

(2010/11/18 3:23), Mathieu Desnoyers wrote:
> Amongst others, Ericsson rely on this kind of feature. The case where we're
> filtering "out" should be really, really fast (even if it makes the recording
> slightly slower). If there is a race modifying the data underneath between the
> filter execution and the copy to the buffers, I really don't think it matters.
> If the tracepoint caller context don't provide data consistency guarantees for
> the parameters it passes, we should not expect 100% perfect consistency between
> filter and what is actually saved in the trace. The trace analysis tools should
> just be able to cope with that, but I really don't think anyone should care.
>
> So I would recommend no copy, filter directly on the source data, stop the
> filter chain as soon as a non-match is observed. Copy the data into the buffers
> if the filter passes.

Indeed, so that we can eliminate memory write accesses. :)

Thanks,

--
Masami HIRAMATSU
2nd Dept. Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
E-mail: [email protected]

Subject: Re: [ANNOUNCE] New utility: 'trace'

(2010/11/18 0:43), Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 10:10 -0500, Steven Rostedt wrote:
>>
>> Right, the problem with filtering is what do we want to filter, and what
>> about copying?
>>
>> Currently, we copy the data into the buffer and then filter on that
>> data. We could also easily filter on the parameters of the tracepoint,
>> but sometimes those parameters do not match the final output (as the
>> case with sched_switch). Do we copy the data into a separate "per cpu"
>> temp buffer, and figure out the filter then? And if the filter is fine,
>> then copy into the buffer. This obviously is slow, due to the multiple
>> copies. We could do this only if the filtering is enabled.
>
> Right, so what is the primary purpose of this filtering stuff? As it
> stands it makes stuff terribly slow, so you add overhead but the win
> (presumably) is less data output, is that a sane trade-off?
>
> Most people I've heard -- both at LinuxCon.JP and LPC -- are asking for
> lower overhead tracing (while at the same time demanding more features).

I've also heard a dynamic filtering (or kicking buffer snapshot) request
in LinuxCon.JP. I think filtering is not only for filtering-out purpose,
but also useful for hooking event to take some action. :)

Thank you,


--
Masami HIRAMATSU
2nd Dept. Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
E-mail: [email protected]

2010-11-18 23:23:31

by Joe Perches

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 09:30 +0100, Ingo Molnar wrote:
> * Frederic Weisbecker <[email protected]> wrote:
> > An idea is to reproduce the kernel file hierarchy in a "printk" event
> > subsystem, but this implies to allow subsystems nesting.
> > But may be the whole idea is just fancy and nobody will care, [...]
> I think it's an excellent idea, and it could also give us the framework to integrate
> the dynamic_printk points.
[]
> > For example I'm currently working with dozens of trace_printk() and I would be
> > very happy to turn some of them off half of the time.
> I guess we could try such a patch. If you send a prototype i'd be interested in
> testing it out.

Another concept you might consider would be to
selectively compile trace points per compilation
unit. This might help embedded users that might
want to use trace only in a few areas without the
additional text overhead.

Perhaps something like:

(default on)

#ifdef ENABLE_TRACE_COMPILATION
# normal tracepoint macro defines
#else
# static inline tracepoint functions {} or null macro defines
#endif

2010-11-19 02:32:12

by Ian Munsie

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

Hi Jason,

I'm adding Avantika to the CC list as she recently contacted me
regarding syscall tracepoints on PowerPC.

Excerpts from Jason Baron's message of Fri Nov 19 02:11:42 +1100 2010:
> I was wondering what the status of those patches were...last I remember
> they looked pretty good to me...I'd be more than happy to help review
> those.

Yeah there isn't much left to do on them - the main delay has just been
finding some time to work on them.

I did want to revisit the naming of the events and the subtle
differences for syscalls with a single implementation for both the
native and compat versions versus those that have separate
implementations - I don't want userspace to have to care about that
distinction.

I recall there were a few comments on some of your patches in the series
- if I rebase the patches on the current tip tree and publish the tree
somewhere would you be willing to take a look at those?


Cheers,
-Ian

2010-11-19 15:23:48

by Jason Baron

[permalink] [raw]
Subject: Re: [ANNOUNCE] New utility: 'trace'

On Fri, Nov 19, 2010 at 01:32:08PM +1100, Ian Munsie wrote:
> Hi Jason,
>
> I'm adding Avantika to the CC list as she recently contacted me
> regarding syscall tracepoints on PowerPC.
>
> Excerpts from Jason Baron's message of Fri Nov 19 02:11:42 +1100 2010:
> > I was wondering what the status of those patches were...last I remember
> > they looked pretty good to me...I'd be more than happy to help review
> > those.
>
> Yeah there isn't much left to do on them - the main delay has just been
> finding some time to work on them.
>
> I did want to revisit the naming of the events and the subtle
> differences for syscalls with a single implementation for both the
> native and compat versions versus those that have separate
> implementations - I don't want userspace to have to care about that
> distinction.
>
> I recall there were a few comments on some of your patches in the series
> - if I rebase the patches on the current tip tree and publish the tree
> somewhere would you be willing to take a look at those?
>

sure. the remaining issue was to combine the enabled/disabled flags into
the 'int nb_args' field since we are only using 3 bits of it. that
shouldn't be too bad...

thanks,

-Jason