LinuxLists.cc - [ANNOUNCE] New utility: 'trace'

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On 11/16/2010 01:04 PM, Thomas Gleixner wrote:
> We are pleased to announce a new tracing tool called 'trace'.
>

Hi Thomas, Ingo - Congrats and Thanks!

> At this point we'd like to ask for feedback from other users of tracing tools,
> which workflow components would you like to see in addition to what the 'trace'
> tool is offering?
>
> Comments, bug reports, suggestions welcome,

The first thing that comes to mind is trace_marker:

echo "Test point A" > /sys/kernel/debug/tracing/trace_marker

I've found this sort of markup to be useful using ftrace (with C
equivalents embedded in the test case). Is this supported?

Thanks,

--
Darren Hart
Yocto Linux Kernel

2010-11-16 21:36:32

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

* Thomas Gleixner <[email protected]> wrote:

> Todo list:
>
> [...]
>
> - Implement -p NNN (trace existing pid) option

Update, 'trace -p <pid> ...' is now implemented via:

da46d9f09699: trace: Add -p <pid> recording

I have just pushed it out into -tip.

Thanks,

Ingo

2010-11-16 22:00:19

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

* Darren Hart <[email protected]> wrote:

> On 11/16/2010 01:04 PM, Thomas Gleixner wrote:
> >We are pleased to announce a new tracing tool called 'trace'.
> >
>
> Hi Thomas, Ingo - Congrats and Thanks!

You are welcome :)

> > At this point we'd like to ask for feedback from other users of tracing tools,
> > which workflow components would you like to see in addition to what the 'trace'
> > tool is offering?
> >
> > Comments, bug reports, suggestions welcome,
>
> The first thing that comes to mind is trace_marker:
>
> echo "Test point A" > /sys/kernel/debug/tracing/trace_marker
>
> I've found this sort of markup to be useful using ftrace (with C equivalents
> embedded in the test case). Is this supported?

Yes, LatencyTop uses something similar IIRC, via a prctl() hack: it uses
prctl(PR_SET_NAME) to inject up to 16 characters into the comm - and then every perf
event gets that comm. You can utilize that straight away if you need a marker
solution right now.

A cleaner, more generic approach would be a more free-form ASCII event facility for
this: a generic 'user-space triggered event' with injection capabilities.

To inject it, we could put that into a prctl() as well, for easy, single-syscall
access from C. We already have two perf prctls: PR_TASK_PERF_EVENTS_DISABLE,
PR_TASK_PERF_EVENTS_ENABLE - and this would be a third one: PR_TASK_PERF_USER_EVENT.

This would, like the existing perf_event_task_enable()/disable() methods, loop
through current events and inject a string into matching user events.

Ideally i'd like to see basically user-space access to trace_printk(), and proper
integration into /debug/tracing/events/ enumeration and availability - not a
ftrace-specific and admin-only hack like /sys/kernel/debug/tracing/trace_marker is
today.

Would you be interested in helping out with (and testing) such a more generic
approach?

Thanks,

Ingo

2010-11-16 22:03:48

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Tue, 2010-11-16 at 22:59 +0100, Ingo Molnar wrote:
>
>
> Would you be interested in helping out with (and testing) such a more generic
> approach?

how about having perf-record open a named pipe and rewriting everything
that comes in through that as an event and storing it in the buffer?

2010-11-16 22:07:14

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On 11/16/2010 01:59 PM, Ingo Molnar wrote:
>
> * Darren Hart<[email protected]> wrote:
>
>> On 11/16/2010 01:04 PM, Thomas Gleixner wrote:
>>> We are pleased to announce a new tracing tool called 'trace'.
>>>
>>
>> Hi Thomas, Ingo - Congrats and Thanks!
>
> You are welcome :)
>
>>> At this point we'd like to ask for feedback from other users of tracing tools,
>>> which workflow components would you like to see in addition to what the 'trace'
>>> tool is offering?
>>>
>>> Comments, bug reports, suggestions welcome,
>>
>> The first thing that comes to mind is trace_marker:
>>
>> echo "Test point A"> /sys/kernel/debug/tracing/trace_marker
>>
>> I've found this sort of markup to be useful using ftrace (with C equivalents
>> embedded in the test case). Is this supported?
>
> Yes, LatencyTop uses something similar IIRC, via a prctl() hack: it uses
> prctl(PR_SET_NAME) to inject up to 16 characters into the comm - and then every perf
> event gets that comm. You can utilize that straight away if you need a marker
> solution right now.

OK, definitely much more limited than trace_marker.

>
> A cleaner, more generic approach would be a more free-form ASCII event facility for
> this: a generic 'user-space triggered event' with injection capabilities.
>
> To inject it, we could put that into a prctl() as well, for easy, single-syscall
> access from C. We already have two perf prctls: PR_TASK_PERF_EVENTS_DISABLE,
> PR_TASK_PERF_EVENTS_ENABLE - and this would be a third one: PR_TASK_PERF_USER_EVENT.
>
> This would, like the existing perf_event_task_enable()/disable() methods, loop
> through current events and inject a string into matching user events.
>
> Ideally i'd like to see basically user-space access to trace_printk(), and proper

Right, this would be ideal.

> integration into /debug/tracing/events/ enumeration and availability - not a
> ftrace-specific and admin-only hack like /sys/kernel/debug/tracing/trace_marker is
> today.
>
> Would you be interested in helping out with (and testing) such a more generic
> approach?

I would. I need some time to familiarize myself with perf first (I've
been a relatively happy ftrace/trace-cmd user for the last couple years).

--
Darren Hart
Yocto Linux Kernel

2010-11-16 22:09:15

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On 11/16/2010 02:03 PM, Peter Zijlstra wrote:
> On Tue, 2010-11-16 at 22:59 +0100, Ingo Molnar wrote:
>>
>>
>> Would you be interested in helping out with (and testing) such a more generic
>> approach?
>
> how about having perf-record open a named pipe and rewriting everything
> that comes in through that as an event and storing it in the buffer?

When I'm tracing an app or test-case, I want to be able to quickly add
and change strings. I like the trace_printk sort of access as it avoids
having to do a bunch of string formatting (sprintf()) calls in the test app.

--
Darren Hart
Yocto Linux Kernel

2010-11-16 22:09:21

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

* Peter Zijlstra <[email protected]> wrote:

> On Tue, 2010-11-16 at 22:59 +0100, Ingo Molnar wrote:
> >
> >
> > Would you be interested in helping out with (and testing) such a more generic
> > approach?
>
> how about having perf-record open a named pipe and rewriting everything that comes
> in through that as an event and storing it in the buffer?

Yeah, that would work too in most cases - except if there's no VFS connection
between tracer and tracee. (as it might be in the case of say chroot environments,
etc.)

I think it's better to have kernel solutions that dont modify the VFS beyond the
trace.data session info. A task could thus generate a user-space event without
having to negotiate with the tracer.

Thanks,

Ingo

2010-11-16 22:17:35

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Tue, Nov 16, 2010 at 01:27:35PM -0800, Darren Hart wrote:
> On 11/16/2010 01:04 PM, Thomas Gleixner wrote:
>> We are pleased to announce a new tracing tool called 'trace'.
>>
>
> Hi Thomas, Ingo - Congrats and Thanks!
>
>> At this point we'd like to ask for feedback from other users of tracing tools,
>> which workflow components would you like to see in addition to what the 'trace'
>> tool is offering?
>>
>> Comments, bug reports, suggestions welcome,
>
> The first thing that comes to mind is trace_marker:
>
> echo "Test point A" > /sys/kernel/debug/tracing/trace_marker
>
> I've found this sort of markup to be useful using ftrace (with C
> equivalents embedded in the test case). Is this supported?

No, the trace_marker file sends trace_printk events. And trace_printk
events needs to be converted to use the unified trace event interface
(ie: implement a struct ftrace_event_call).

This shouldn't be so hard, and there are several ways we could do this.

An idea is to reproduce the kernel file hierarchy in a "printk" event
subsystem, but this implies to allow subsystems nesting.

Imagine you have two trace_printk(), one in kernel/sched.c:117
and one in kernel/time/tick-sched.c:228
The result would be:

$ ls /sys/kernel/debug/tracing/events/printk

/sys/kernel/debug/tracing/events/printk:
enable filter kernel/

/sys/kernel/debug/tracing/events/printk/kernel:
enable filter sched.c/ time/

/sys/kernel/debug/tracing/events/printk/kernel/sched.c:
enable filter 117/

/sys/kernel/debug/tracing/events/printk/kernel/sched.c/117:
enable filter format id

/sys/kernel/debug/tracing/events/printk/kernel/time:
enable filter tick-sched.c/

/sys/kernel/debug/tracing/events/printk/kernel/time/tick-sched.c:
enable filter 228/

/sys/kernel/debug/tracing/events/printk/kernel/time/tick-sched.c/228:
enable filter format id

The format would be quite easy, and only one field for the whole string (strloc).

Not sure what to do with trace_marker. I just think we can add it as:
/sys/kernel/debug/tracing/events/printk/sys/kernel/debug/tracing/trace_marker

<:o)

But may be the whole idea is just fancy and nobody will care, and would just
a simple single event in a printk subsystem, on which you can use a kind of
virtual filter:

echo "path != kernel/whatever.c:226" > /sys/kernel/debug/tracing/events/printk/filter

would turn on every trace_printk() but the one in the given path.

Dunno, I like both ideas. I prefer the first one which looks to me more flexible: could be
useful to list the user his trace_printks for example and implement an interface for him
to quickly select the trace_printk he wants.

For example I'm currently working with dozens of trace_printk() and I would be very happy
to turn some of them off half of the time.

2010-11-16 22:48:13

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On 11/16/2010 02:09 PM, Darren Hart wrote:
> On 11/16/2010 02:03 PM, Peter Zijlstra wrote:
>> On Tue, 2010-11-16 at 22:59 +0100, Ingo Molnar wrote:
>>>
>>>
>>> Would you be interested in helping out with (and testing) such a more
>>> generic
>>> approach?
>>
>> how about having perf-record open a named pipe and rewriting everything
>> that comes in through that as an event and storing it in the buffer?
>
> When I'm tracing an app or test-case, I want to be able to quickly add
> and change strings. I like the trace_printk sort of access as it avoids
> having to do a bunch of string formatting (sprintf()) calls in the test
> app.

Sorry - fprintf would work for Peter's suggestion. I had the PR_SET_NAME
with a short string or another call with jsut a plain C string - those
would require the user doing the string formatting first.

--
Darren Hart
Yocto Linux Kernel

2010-11-17 01:37:11

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Tue, Nov 16, 2010 at 10:04:40PM +0100, Thomas Gleixner wrote:
> If you've booted the new kernel you can run 'trace check' to double
> check that all events used by the tool are available:
>
> $ trace check
>
> Checking whether the kernel has all required events ...
> ... Checking event raw_syscalls:sys_enter:r : ok
> ...
> ... Checking event sched:sched_stat_runtime:r : ok
> Good: all required event types are supported by this kernel.
> The 'trace' utility will be fully functional.

For the benefit of people who create tracepoints, what restrictions
does trace have with respect to event types, and is this anticipated
to change in the future?

> The combo diffstat of the tool is appended at the end of the mail. The
> overwhelming majority of changes is on the tooling side - it uses
> existing perf events facilities and features to implement the tool.

What about the filtering and other general features/functionality of
ftrace? Is the anticipation that this will be ported over to perf?
What about things like blktrace?

Not that I expect all of this will be working with this initial
release, but I'm curious about the long-term roadmap of this
interface. (Obviously subject to change as we learn more, etc. But
I'd love to hear what your current thoughts and plans are.)

Thanks,

- Ted

P.S. What about sysrq-z? Is that going to stay with ftrace side of
the tracing architecture?

2010-11-17 01:47:44

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Tue, 2010-11-16 at 20:37 -0500, Ted Ts'o wrote:

> For the benefit of people who create tracepoints, what restrictions
> does trace have with respect to event types, and is this anticipated
> to change in the future?

Note, I just posted an RFC stable event patch set. I would like any tool
that does general analysis, to use stable events and not random raw
events created by the perspective maintainers.

The tool could tap into the raw events, but I don't like the "trace
check" I think anything that the trace needs should be guaranteed there
(with the stable tracepoints).

The tracepoints used for general analysis should be stable, anything
else is just shear bonus.

-- Steve

2010-11-17 03:16:15

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Tue, Nov 16, 2010 at 08:47:40PM -0500, Steven Rostedt wrote:
>
> Note, I just posted an RFC stable event patch set. I would like any tool
> that does general analysis, to use stable events and not random raw
> events created by the perspective maintainers.
>
> The tool could tap into the raw events, but I don't like the "trace
> check" I think anything that the trace needs should be guaranteed there
> (with the stable tracepoints).
>
> The tracepoints used for general analysis should be stable, anything
> else is just shear bonus.

Which is fine with me, so long as it is accepted that the 'trace' tool
is not targetted at kernel developers (who would probably like to use
a combination of stable and unstable tracepoints).

Do we all agree on what the intended target audience is for this
'trace' tool?

My one concern of having a tool that doesn't support the unstable
tracepoints is that if the kernel developers aren't using it for their
day-to-day use, it won't get attention/love, and it risks the fate of
'systemtap' --- ignored by kernel developers and treated as if it
doesn't exist.

Maybe the answer is there's a #ifdef, and there's one version of the
tool that is intended for use by kernel developers, and one which is
restricted to the stable interface that we give to the hoi polloi?

- Ted

2010-11-17 03:34:06

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Tue, 2010-11-16 at 22:16 -0500, Ted Ts'o wrote:
> On Tue, Nov 16, 2010 at 08:47:40PM -0500, Steven Rostedt wrote:
> >
> > Note, I just posted an RFC stable event patch set. I would like any tool
> > that does general analysis, to use stable events and not random raw
> > events created by the perspective maintainers.
> >
> > The tool could tap into the raw events, but I don't like the "trace
> > check" I think anything that the trace needs should be guaranteed there
> > (with the stable tracepoints).
> >
> > The tracepoints used for general analysis should be stable, anything
> > else is just shear bonus.
>
> Which is fine with me, so long as it is accepted that the 'trace' tool
> is not targetted at kernel developers (who would probably like to use
> a combination of stable and unstable tracepoints).
>
> Do we all agree on what the intended target audience is for this
> 'trace' tool?
>
> My one concern of having a tool that doesn't support the unstable
> tracepoints is that if the kernel developers aren't using it for their
> day-to-day use, it won't get attention/love, and it risks the fate of
> 'systemtap' --- ignored by kernel developers and treated as if it
> doesn't exist.
>
> Maybe the answer is there's a #ifdef, and there's one version of the
> tool that is intended for use by kernel developers, and one which is
> restricted to the stable interface that we give to the hoi polloi?

I was not involved in making of this tool, but I would envision that it
would probably have special options to enable the raw events. That is,
basic users would never see the events that kernel developers use, but
with a simple switch (-r?) it would enable all events (both stable and
raw). It just wont have the ability to do any analysis of these events.
That is, it can just report the events (much like what trace-cmd does
today), but they will just exist as an event, not anything the tool can
give true meaning to. But the kernel developer should be able to
understand it.(*)

If we get a consensus on stable events, I would like to add
a /debug/events directory that will be similar to
the /debug/tracing/events, except that it will have the new format for
all raw events.

-- Steve

(*) Currently in trace-cmd I have plugins. These plugins are simple
functions to read events and either output better displays of the event
or do something special (Avi wrote some for his kvm events). I would
like to create a tools/trace/plugins directory that this trace tool
could use. A kernel developer could write the code to pretty print their
event and show much better information that reading a bunch of hex
fields. But more on this later, I would like to work on getting stable
events done first.

2010-11-17 08:31:04

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

* Frederic Weisbecker <[email protected]> wrote:

> On Tue, Nov 16, 2010 at 01:27:35PM -0800, Darren Hart wrote:
> > On 11/16/2010 01:04 PM, Thomas Gleixner wrote:
> >> We are pleased to announce a new tracing tool called 'trace'.
> >>
> >
> > Hi Thomas, Ingo - Congrats and Thanks!
> >
> >> At this point we'd like to ask for feedback from other users of tracing tools,
> >> which workflow components would you like to see in addition to what the 'trace'
> >> tool is offering?
> >>
> >> Comments, bug reports, suggestions welcome,
> >
> > The first thing that comes to mind is trace_marker:
> >
> > echo "Test point A" > /sys/kernel/debug/tracing/trace_marker
> >
> > I've found this sort of markup to be useful using ftrace (with C
> > equivalents embedded in the test case). Is this supported?
>
>
> No, the trace_marker file sends trace_printk events. And trace_printk
> events needs to be converted to use the unified trace event interface
> (ie: implement a struct ftrace_event_call).
>
> This shouldn't be so hard, and there are several ways we could do this.
>
> An idea is to reproduce the kernel file hierarchy in a "printk" event
> subsystem, but this implies to allow subsystems nesting.
>
> Imagine you have two trace_printk(), one in kernel/sched.c:117
> and one in kernel/time/tick-sched.c:228
> The result would be:
>
> $ ls /sys/kernel/debug/tracing/events/printk
>
> /sys/kernel/debug/tracing/events/printk:
> enable filter kernel/
>
> /sys/kernel/debug/tracing/events/printk/kernel:
> enable filter sched.c/ time/
>
> /sys/kernel/debug/tracing/events/printk/kernel/sched.c:
> enable filter 117/
>
> /sys/kernel/debug/tracing/events/printk/kernel/sched.c/117:
> enable filter format id
>
> /sys/kernel/debug/tracing/events/printk/kernel/time:
> enable filter tick-sched.c/
>
> /sys/kernel/debug/tracing/events/printk/kernel/time/tick-sched.c:
> enable filter 228/
>
> /sys/kernel/debug/tracing/events/printk/kernel/time/tick-sched.c/228:
> enable filter format id
>
>
> The format would be quite easy, and only one field for the whole string (strloc).
>
> Not sure what to do with trace_marker. I just think we can add it as:
> /sys/kernel/debug/tracing/events/printk/sys/kernel/debug/tracing/trace_marker
>
> <:o)
>
>
> But may be the whole idea is just fancy and nobody will care, [...]

I think it's an excellent idea, and it could also give us the framework to integrate
the dynamic_printk points.

> [...] and would just a simple single event in a printk subsystem, on which you can
> use a kind of virtual filter:
>
> echo "path != kernel/whatever.c:226" > /sys/kernel/debug/tracing/events/printk/filter
>
> would turn on every trace_printk() but the one in the given path.
>
> Dunno, I like both ideas. I prefer the first one which looks to me more flexible:
> could be useful to list the user his trace_printks for example and implement an
> interface for him to quickly select the trace_printk he wants.

I prefer the first one too.

> For example I'm currently working with dozens of trace_printk() and I would be
> very happy to turn some of them off half of the time.

I guess we could try such a patch. If you send a prototype i'd be interested in
testing it out.

Thanks,

Ingo

2010-11-17 09:50:07

by Philipp Marek

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

Hello Thomas!

Thomas Gleixner <tglx <at> linutronix.de> writes:
> We are pleased to announce a new tracing tool called 'trace'.
...
> read(3:</usr/bin/firefox>, buf: 0x7fffcc4caf40, count: 0x80)

Is there a way to get (parts of) the data being read/written/sent/received?
Something like the -s flag for strace would be nice, but that already seems
taken for scheduler events.

Regards,

Phil

2010-11-17 11:36:04

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 09:30 +0100, Ingo Molnar wrote:
> * Frederic Weisbecker <[email protected]> wrote:
>
> > On Tue, Nov 16, 2010 at 01:27:35PM -0800, Darren Hart wrote:
> > > On 11/16/2010 01:04 PM, Thomas Gleixner wrote:
> > >> We are pleased to announce a new tracing tool called 'trace'.
> > >>
> > >
> > > Hi Thomas, Ingo - Congrats and Thanks!
> > >
> > >> At this point we'd like to ask for feedback from other users of tracing tools,
> > >> which workflow components would you like to see in addition to what the 'trace'
> > >> tool is offering?
> > >>
> > >> Comments, bug reports, suggestions welcome,
> > >
> > > The first thing that comes to mind is trace_marker:
> > >
> > > echo "Test point A" > /sys/kernel/debug/tracing/trace_marker
> > >
> > > I've found this sort of markup to be useful using ftrace (with C
> > > equivalents embedded in the test case). Is this supported?
> >
> >
> > No, the trace_marker file sends trace_printk events. And trace_printk
> > events needs to be converted to use the unified trace event interface
> > (ie: implement a struct ftrace_event_call).
> >
> > This shouldn't be so hard, and there are several ways we could do this.
> >
> > An idea is to reproduce the kernel file hierarchy in a "printk" event
> > subsystem, but this implies to allow subsystems nesting.
> >
> > Imagine you have two trace_printk(), one in kernel/sched.c:117
> > and one in kernel/time/tick-sched.c:228
> > The result would be:
> >
> > $ ls /sys/kernel/debug/tracing/events/printk
> >
> > /sys/kernel/debug/tracing/events/printk:
> > enable filter kernel/
> >
> > /sys/kernel/debug/tracing/events/printk/kernel:
> > enable filter sched.c/ time/
> >
> > /sys/kernel/debug/tracing/events/printk/kernel/sched.c:
> > enable filter 117/
> >
> > /sys/kernel/debug/tracing/events/printk/kernel/sched.c/117:
> > enable filter format id
> >
> > /sys/kernel/debug/tracing/events/printk/kernel/time:
> > enable filter tick-sched.c/
> >
> > /sys/kernel/debug/tracing/events/printk/kernel/time/tick-sched.c:
> > enable filter 228/
> >
> > /sys/kernel/debug/tracing/events/printk/kernel/time/tick-sched.c/228:
> > enable filter format id
> >
> >
> > The format would be quite easy, and only one field for the whole string (strloc).
> >
> > Not sure what to do with trace_marker. I just think we can add it as:
> > /sys/kernel/debug/tracing/events/printk/sys/kernel/debug/tracing/trace_marker
> >
> > <:o)
> >
> >
> > But may be the whole idea is just fancy and nobody will care, [...]
>
> I think it's an excellent idea, and it could also give us the framework to integrate
> the dynamic_printk points.
>
> > [...] and would just a simple single event in a printk subsystem, on which you can
> > use a kind of virtual filter:
> >
> > echo "path != kernel/whatever.c:226" > /sys/kernel/debug/tracing/events/printk/filter
> >
> > would turn on every trace_printk() but the one in the given path.
> >
> > Dunno, I like both ideas. I prefer the first one which looks to me more flexible:
> > could be useful to list the user his trace_printks for example and implement an
> > interface for him to quickly select the trace_printk he wants.
>
> I prefer the first one too.
>
> > For example I'm currently working with dozens of trace_printk() and I would be
> > very happy to turn some of them off half of the time.
>
> I guess we could try such a patch. If you send a prototype i'd be interested in
> testing it out.

I don't see the point, the kernel shouldn't contain any trace_printk()s
to begin with..

2010-11-17 12:48:09

by Török Edwin

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Tue, 16 Nov 2010 22:04:40 +0100 (CET)
Thomas Gleixner <[email protected]> wrote:

> After years of efforts we have not succeeded in meeting (let alone
> exceeding) the utility of decades-old user-space tracing tools such as
> strace - except for a few new good tools such as PowerTop and
> LatencyTop.
>
> 'trace' is our shot at improving the situation: it aims at providing a
> simple to use and straightforward tracing tool based on the perf
> infrastructure and on the well-known perf profiling workflow:

Interesting, I just tried it using the -tip kernel.
Looks like a good start, but there are some features I'm missing, see below.

> The combo diffstat of the tool is appended at the end of the mail.

How did you generate that? In other words is there a git command I can
use to apply the mm_pagefault/vfs_getname/trace patches on top of
2.6.36? Doing a git merge tip/tmp.perf/trace merged 2.6.37 for me.
For now I'm running a -tip kernel to test.

> - Unintrusive tracing

Nice! At a quick glance it didn't seem to impact performance much, I'll
have to do some measurements.

>
> - It will record filterable information about the session:
>
> * syscalls

There is no strace-like "not finished":
clamd/11086 ( 0.000 ms): pread(0x9, 0x7f142f7f1000, 0x2000, 0xa8000, 0x2000, 0xa8) => 0x1
clamd/11087 ( 0.014 ms): futex(uaddr: 0x7f143fad3260, op: 0x80, val: 0x2, utime: 0x0, uaddr2: 0x7f143fad3260, val3: 0x2b4f) => 0x0
clamd/11087 ( 0.001 ms): futex(uaddr: 0x7f143fad3260, op: 0x81, val: 0x1, utime: 0xfe0, uaddr2: 0x7f143fad3260, val3: 0x2b4f) => 0x0
clamd/11086 ( 0.011 ms): ... [continued]: pread() => 0x2000

I assume that the first pread with 0ms time got interrupted, and it resumed/continued later. Is that the case?
Is the 0x1 return value bogus then? If so it would be less confusing if you output what strace usually does (".. not finished" IIRC).

> * task lifetime events (fork/clone/exit)
> * pagefaults
> * various mm events (mmap)
> * scheduling events
>
> * (support for more events is being worked on)

Is it possible to use 'trace record' and 'perf record' at the same time?
I.e to do both strace-like (trace) and oprofile-like (perf record)
recording?

>
> - Iterative analysis of an existing session/trace without having
> to re-run the trace, with different filters and options (trace
> data is saved on disk)
>
> - Concurrent and multi-user safe trace recording: multiple users
> can trace, or the same user can trace in multiple different
> tracing sessions.
>
> - Available to plain users - the permission model is not
> completely done yet, it needs a temporary workaround currently:
>
> echo -1 > /proc/sys/kernel/perf_event_paranoid

I see --call-graph in the trace record -h output, but it doesn't seem to work on x86_64 for me.
Do I need to pass some additional parameters to trace report, do I need additional CONFIG_ turned on?
I have:
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
# CONFIG_CC_STACKPROTECTOR is not set
CONFIG_STACKTRACE=y
CONFIG_USER_STACKTRACE_SUPPORT=y
# CONFIG_STACK_TRACER is not set
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set

Does it use the same code for callgraphs as ftrace, so I could choose userspace stacktrace?

>
> Todo list:
>
> - Improve output formatting
> - Utilize more events: networking, block IO
> - Add more higher level trace points
> - Implement 'trace diff' utility to simplify the comparison of
> traces
> - Improve the permission model (work in progress)
> - Implement -p NNN (trace existing pid) option

1. I miss sorting.
Sure I can do something like this: trace report -d <N> |
sort -g -k2 -t\(, but I have to choose <N> first, and it has to output
all the data before sorting can begin (including the text part).
It would be better if I 'trace' itself could sort based on the syscall
latency time, and show me the slowest syscalls first.

2. Another feature I miss is ability to sum up syscall times per syscall, and output a summary.

3. The -s output is not very intuitive, the sched tracer output in /sys/debug/tracing was more intuitive IIRC,
showing that you swithced from process A to process B. I would like to see that in the -s output
(and if switching to a process not traced, it should say so).

4. Also it would be useful to know if during a long syscall (futex, pread, etc.) the CPU time was wasted, or
another thread of same process got scheduled in. Calculating the overhead of doing that would be interesting too.
I think this could be done with a tool that postprocesses the results, but maybe its easier to do in the tool itself.
For example:
thread1: futex(...) [... not finished]
thread2: (delta1 ms) ... scheduled in place of thread1 (switch thread1 -> thread2)
thread2: ... do stuff ...
thread1: (delta2 ms) ... scheduled in place of thread2 (switch thread2 -> thread1)
thread1: (delta3 ms) [... continued] futex() => ...

I would be interested to see delta1, and delta2, perhaps totalled up per syscall, per thread or per process.

5. Also it would be good if it could tell what a futex is used for:
1122.567 clamd/11082 (139.120 ms): futex(uaddr: 0x1b59d44, op: 0x80, val: 0x883, utime: 0x0, uaddr2: 0x0, val3: 0x441) => 0x0
1712.922 clamd/11082 (309.925 ms): futex(uaddr: 0x1b59d44, op: 0x80, val: 0x891, utime: 0x0, uaddr2: 0x0, val3: 0x448) => 0x0
2014.289 clamd/11082 (244.312 ms): futex(uaddr: 0x1b59d44, op: 0x80, val: 0x895, utime: 0x0, uaddr2: 0x0, val3: 0x44a) => 0x0
3639.956 clamd/11082 (104.370 ms): futex(uaddr: 0x1b59d44, op: 0x80, val: 0x919, utime: 0x0, uaddr2: 0x0, val3: 0x48c) => 0x0
4371.086 clamd/11082 (158.304 ms): futex(uaddr: 0x1b59d44, op: 0x80, val: 0x951, utime: 0x0, uaddr2: 0x0, val3: 0x4a8) => 0x0
5204.773 clamd/11082 (133.566 ms): futex(uaddr: 0x1b59d44, op: 0x80, val: 0x999, utime: 0x0, uaddr2: 0x0, val3: 0x4cc) => 0x0

Using /proc/`pidof clamd`/maps I found out that 0x1b59d44 is on the heap. Such large delays are probably
due to a pthread_cond_wait/cond_timed_wait, and after spending a bit of time in gdb I found out exactly which condition var it is:
(gdb) p &thr_pool->queueable_bulk_cond.__data.__futex
$10 = (unsigned int *) 0x1b59d44

But I don't want to repeat that for each futex call with a different address.
It would be good if 'trace' could figure out what kind of futex wait this is (i.e. mutex lock or condition variable wait), I hope
the kernel has a wait to tell them apart.
Large delays for waiting on condition variables is expected, and I want to exclude them from my trace.
However I do want to see if there are large delays on mutex locks, that is definetely something I am interested in.
Now of course 'strace' can't make the difference between these two, but 'trace' being a kernel feature might have more information.

Of course it'd be even better if it could show the name of the condition variable, but if the condition variable is on the heap that is hard to do.
I'll have to see if there is a way to script gdb to lookup the variable associated with an address (by looking through the local and global vars on the current thread).

>
> Main differences to strace:
>
> - Low impact recording
>
> - Arguments are printed in plain hex representation, except for
> important arguments like filenames, which are resolved after
> the syscall is recorded. 'trace' wont do full decoding like
> strace does.

Did you consider providing a tool to post-process the output and
perform that decoding?

> Another thing that is recorded are page fault events. These can be
> analyzed via the -p option:

Error: switch `p' requires a value
-P works though.

Best regards,
--Edwin

2010-11-17 12:51:44

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 14:47 +0200, Török Edwin wrote:
> I see --call-graph in the trace record -h output, but it doesn't seem
> to work on x86_64 for me.

If you want to unwind userspace you need to build everything with
framepointers -- sadly some people have been lobbying to remove
framepointers from all distro builds, even though on x86_64 its nearly
free (i386 does have a significant performance benefit, sadly).

There's some work on-going to simply copy out the top of stack and let
dwarves go wild at it in userspace, but that isn't done yet.

2010-11-17 12:53:55

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, Nov 17, 2010 at 12:35:50PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 09:30 +0100, Ingo Molnar wrote:
> > > For example I'm currently working with dozens of trace_printk() and I would be
> > > very happy to turn some of them off half of the time.
> >
> > I guess we could try such a patch. If you send a prototype i'd be interested in
> > testing it out.
>
> I don't see the point, the kernel shouldn't contain any trace_printk()s
> to begin with..

It's oriented toward developers. Those who use dozens of tracepoints in
their tree because they are debugging something or developing a new feature,
they might to deactivate/reactivate some of these independant points.

This can also apply to dynamic_printk of course.

Well, the very first and main point is to standardize trace_printk into
a trace event so that it gets usable by perf tools. I have been asked many
times "how to use trace_printk() with perf?".

2010-11-17 13:02:42

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 13:53 +0100, Frederic Weisbecker wrote:
> On Wed, Nov 17, 2010 at 12:35:50PM +0100, Peter Zijlstra wrote:
> > On Wed, 2010-11-17 at 09:30 +0100, Ingo Molnar wrote:
> > > > For example I'm currently working with dozens of trace_printk() and I would be
> > > > very happy to turn some of them off half of the time.
> > >
> > > I guess we could try such a patch. If you send a prototype i'd be interested in
> > > testing it out.
> >
> > I don't see the point, the kernel shouldn't contain any trace_printk()s
> > to begin with..
>
>
> It's oriented toward developers. Those who use dozens of tracepoints in
> their tree because they are debugging something or developing a new feature,
> they might to deactivate/reactivate some of these independant points.
>
> This can also apply to dynamic_printk of course.
>
> Well, the very first and main point is to standardize trace_printk into
> a trace event so that it gets usable by perf tools. I have been asked many
> times "how to use trace_printk() with perf?".

Thing is, since its these dev who add the trace_printk()s to begin with,
I don't see the point in splitting them out, if you didn't want them why
did you add them to begin with?!

As to the trace_printk() to perf interface, you could do like mingo did
and create a fake event and use the regular tracepoint interface, or
hook it up directly and create a PERF_RECORD_TEXT field.

Personally I like the trace_printk() as a TRACE_EVENT(printk), it also
allows removing lots of the special casing concerning trace_printk from
ftrace.

2010-11-17 13:05:23

by Török Edwin

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 17 Nov 2010 13:51:52 +0100
Peter Zijlstra <[email protected]> wrote:

> On Wed, 2010-11-17 at 14:47 +0200, Török Edwin wrote:
> > I see --call-graph in the trace record -h output, but it doesn't
> > seem to work on x86_64 for me.
>
> If you want to unwind userspace you need to build everything with
> framepointers -- sadly some people have been lobbying to remove
> framepointers from all distro builds, even though on x86_64 its nearly
> free (i386 does have a significant performance benefit, sadly).
>
> There's some work on-going to simply copy out the top of stack and let
> dwarves go wild at it in userspace, but that isn't done yet.

Even without frame-pointers I should see at least the first caller in
userspace, since that can be obtained by reading the return address of
the syscall, right?
I don't see any difference in the output of 'trace report' thats why
I'm asking it is implemented at all right now, or I need additional
flags to see it.

I do see some difference in output of 'perf record -g/perf report'
though.

Best regards,
--Edwin

2010-11-17 13:10:01

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 15:05 +0200, Török Edwin wrote:
> On Wed, 17 Nov 2010 13:51:52 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > On Wed, 2010-11-17 at 14:47 +0200, Török Edwin wrote:
> > > I see --call-graph in the trace record -h output, but it doesn't
> > > seem to work on x86_64 for me.
> >
> > If you want to unwind userspace you need to build everything with
> > framepointers -- sadly some people have been lobbying to remove
> > framepointers from all distro builds, even though on x86_64 its nearly
> > free (i386 does have a significant performance benefit, sadly).
> >
> > There's some work on-going to simply copy out the top of stack and let
> > dwarves go wild at it in userspace, but that isn't done yet.
>
> Even without frame-pointers I should see at least the first caller in
> userspace, since that can be obtained by reading the return address of
> the syscall, right?

Right.

> I don't see any difference in the output of 'trace report' thats why
> I'm asking it is implemented at all right now, or I need additional
> flags to see it.

Ah, I haven't actually looked at any of the perf-trace code yet, so I'm
afraid I can't answer that particular question.

> I do see some difference in output of 'perf record -g/perf report'
> though.

ok.

2010-11-17 13:11:03

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

* Peter Zijlstra <[email protected]> wrote:

> On Wed, 2010-11-17 at 13:53 +0100, Frederic Weisbecker wrote:
> > On Wed, Nov 17, 2010 at 12:35:50PM +0100, Peter Zijlstra wrote:
> > > On Wed, 2010-11-17 at 09:30 +0100, Ingo Molnar wrote:
> > > > > For example I'm currently working with dozens of trace_printk() and I would be
> > > > > very happy to turn some of them off half of the time.
> > > >
> > > > I guess we could try such a patch. If you send a prototype i'd be interested in
> > > > testing it out.
> > >
> > > I don't see the point, the kernel shouldn't contain any trace_printk()s
> > > to begin with..
> >
> >
> > It's oriented toward developers. Those who use dozens of tracepoints in
> > their tree because they are debugging something or developing a new feature,
> > they might to deactivate/reactivate some of these independant points.
> >
> > This can also apply to dynamic_printk of course.
> >
> > Well, the very first and main point is to standardize trace_printk into
> > a trace event so that it gets usable by perf tools. I have been asked many
> > times "how to use trace_printk() with perf?".
>
> Thing is, since its these dev who add the trace_printk()s to begin with, I don't
> see the point in splitting them out, if you didn't want them why did you add them
> to begin with?!

That's a common workflow: lots of printks (trace_printk's) put all around the code -
and sometimes one set of tracepoints is needed, one time another set.

_If_ we succeed in presenting them like Frederic suggested it, and if we make the
turning on/off _simpler_ (no kernel modification) and faster (no kernel reboot) via
the tooling, people like Frederic might start using it.

I dont think we should fight the workflow itself - it makes sense.

The only question is whether we can represent it all in a nicer fashion than 'modify
the source code and reboot'. If we cannot then there's no point - but i'm not sure
about it and Frederic seems to be convinced too that he can make such a switch
on/off facility intuitive. We'll only see if we try it.

Also, i dont see any harm - do you?

Thanks,

Ingo

2010-11-17 13:24:33

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

* Ted Ts'o <[email protected]> wrote:

> On Tue, Nov 16, 2010 at 10:04:40PM +0100, Thomas Gleixner wrote:
> > If you've booted the new kernel you can run 'trace check' to double
> > check that all events used by the tool are available:
> >
> > $ trace check
> >
> > Checking whether the kernel has all required events ...
> > ... Checking event raw_syscalls:sys_enter:r : ok
> > ...
> > ... Checking event sched:sched_stat_runtime:r : ok
> > Good: all required event types are supported by this kernel.
> > The 'trace' utility will be fully functional.
>
> For the benefit of people who create tracepoints, what restrictions
> does trace have with respect to event types, and is this anticipated
> to change in the future?

There are no conceptual restrictions - this v1 version of the tool uses a (small)
subset of tracepoints right now.

ext4 tracepoints will be used once someone writes a trace code (or plugin) for it -
i think beyond simply displaying that trace data it would also be useful to
interpret it to a certain degree? They could also interact with the block events: so
we could track which inode (or other ext4 data structure) generates what IO, what
the latencies are and which task originated things. We could track what the wait
reasons are.

At that point (when ext4 event support is added) we'd add 'trace check' test as
well, and would generally try to make sure that if distros enable events that all
commonly used tracepoints are available as well.

I.e. the long term goal would be to create a widely available tool, which, amongst
other things, can be used to record and report ext4 events as well - with no kernel
reboots or tool rebuilds required.

> > The combo diffstat of the tool is appended at the end of the mail. The
> > overwhelming majority of changes is on the tooling side - it uses
> > existing perf events facilities and features to implement the tool.
>
> What about the filtering and other general features/functionality of
> ftrace? Is the anticipation that this will be ported over to perf?
> What about things like blktrace?

Yeah, i'd love to see all that available. Filtering is available already on the
kernel perf event side and could be added as an option.

> Not that I expect all of this will be working with this initial
> release, but I'm curious about the long-term roadmap of this
> interface. (Obviously subject to change as we learn more, etc. But
> I'd love to hear what your current thoughts and plans are.)
>
> P.S. What about sysrq-z? Is that going to stay with ftrace side of
> the tracing architecture?

What we are working towards is to unify the whole tracing infrastructure. IMHO it's
not really acceptable that there is an 'ftrace side' and a 'perf side', with feature
overlap and feature mismatch and general confusion.

Thanks,

Ingo

2010-11-17 13:32:15

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

(Please keep the initial Cc list)

On Wed, Nov 17, 2010 at 02:47:58PM +0200, T?r?k Edwin wrote:
> I see --call-graph in the trace record -h output, but it doesn't seem to work on x86_64 for me.
> Do I need to pass some additional parameters to trace report, do I need additional CONFIG_ turned on?
> I have:
> CONFIG_STACKTRACE_SUPPORT=y
> CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
> # CONFIG_CC_STACKPROTECTOR is not set
> CONFIG_STACKTRACE=y
> CONFIG_USER_STACKTRACE_SUPPORT=y
> # CONFIG_STACK_TRACER is not set
> # CONFIG_DEBUG_STACKOVERFLOW is not set
> # CONFIG_DEBUG_STACK_USAGE is not set
>
> Does it use the same code for callgraphs as ftrace, so I could choose userspace stacktrace?

So, this option comes from the fact this new tools wraps on perf record
command line. So yeah, your callchains are recorded, they are just not
reported by this new tool (yet).

Only perf report know how to deal with them for now.

You can play with that though, recording with "trace record -g" and look
at the result with "perf report", you'll get nice stats about your
callchains.

Now the tracing has yet to make use of it.

Ah and it doesn't use the same code than ftrace, but perf does kernel and
user callchain snapshot too. "-g" just record both.

2010-11-17 13:37:17

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 14:10 +0100, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Wed, 2010-11-17 at 13:53 +0100, Frederic Weisbecker wrote:
> > > On Wed, Nov 17, 2010 at 12:35:50PM +0100, Peter Zijlstra wrote:
> > > > On Wed, 2010-11-17 at 09:30 +0100, Ingo Molnar wrote:
> > > > > > For example I'm currently working with dozens of trace_printk() and I would be
> > > > > > very happy to turn some of them off half of the time.
> > > > >
> > > > > I guess we could try such a patch. If you send a prototype i'd be interested in
> > > > > testing it out.
> > > >
> > > > I don't see the point, the kernel shouldn't contain any trace_printk()s
> > > > to begin with..
> > >
> > >
> > > It's oriented toward developers. Those who use dozens of tracepoints in
> > > their tree because they are debugging something or developing a new feature,
> > > they might to deactivate/reactivate some of these independant points.
> > >
> > > This can also apply to dynamic_printk of course.
> > >
> > > Well, the very first and main point is to standardize trace_printk into
> > > a trace event so that it gets usable by perf tools. I have been asked many
> > > times "how to use trace_printk() with perf?".
> >
> > Thing is, since its these dev who add the trace_printk()s to begin with, I don't
> > see the point in splitting them out, if you didn't want them why did you add them
> > to begin with?!
>
> That's a common workflow: lots of printks (trace_printk's) put all around the code -
> and sometimes one set of tracepoints is needed, one time another set.
>
> _If_ we succeed in presenting them like Frederic suggested it, and if we make the
> turning on/off _simpler_ (no kernel modification) and faster (no kernel reboot) via
> the tooling, people like Frederic might start using it.
>
> I dont think we should fight the workflow itself - it makes sense.
>
> The only question is whether we can represent it all in a nicer fashion than 'modify
> the source code and reboot'. If we cannot then there's no point - but i'm not sure
> about it and Frederic seems to be convinced too that he can make such a switch
> on/off facility intuitive. We'll only see if we try it.
>
> Also, i dont see any harm - do you?

Yes, trace_printk() is a pure debug interface, solely meant for the edit
+ reboot cycle.

If you want anything more than that we've got tracepoints. The rule up
until now has been to never merge a trace_printk() user.

2010-11-17 13:38:52

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 14:24 +0100, Ingo Molnar wrote:
> > What about the filtering and other general features/functionality of
> > ftrace? Is the anticipation that this will be ported over to perf?
> > What about things like blktrace?
>
> Yeah, i'd love to see all that available. Filtering is available already on the
> kernel perf event side and could be added as an option.

Except I utterly detest the existing filter code.. it runs _after_ we do
all the hard work and then makes us roll back when it decides we
shouldn't output after all.

But yes, it is functional.

2010-11-17 13:43:10

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, Nov 17, 2010 at 02:36:16PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 14:10 +0100, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > On Wed, 2010-11-17 at 13:53 +0100, Frederic Weisbecker wrote:
> > > > On Wed, Nov 17, 2010 at 12:35:50PM +0100, Peter Zijlstra wrote:
> > > > > On Wed, 2010-11-17 at 09:30 +0100, Ingo Molnar wrote:
> > > > > > > For example I'm currently working with dozens of trace_printk() and I would be
> > > > > > > very happy to turn some of them off half of the time.
> > > > > >
> > > > > > I guess we could try such a patch. If you send a prototype i'd be interested in
> > > > > > testing it out.
> > > > >
> > > > > I don't see the point, the kernel shouldn't contain any trace_printk()s
> > > > > to begin with..
> > > >
> > > >
> > > > It's oriented toward developers. Those who use dozens of tracepoints in
> > > > their tree because they are debugging something or developing a new feature,
> > > > they might to deactivate/reactivate some of these independant points.
> > > >
> > > > This can also apply to dynamic_printk of course.
> > > >
> > > > Well, the very first and main point is to standardize trace_printk into
> > > > a trace event so that it gets usable by perf tools. I have been asked many
> > > > times "how to use trace_printk() with perf?".
> > >
> > > Thing is, since its these dev who add the trace_printk()s to begin with, I don't
> > > see the point in splitting them out, if you didn't want them why did you add them
> > > to begin with?!
> >
> > That's a common workflow: lots of printks (trace_printk's) put all around the code -
> > and sometimes one set of tracepoints is needed, one time another set.
> >
> > _If_ we succeed in presenting them like Frederic suggested it, and if we make the
> > turning on/off _simpler_ (no kernel modification) and faster (no kernel reboot) via
> > the tooling, people like Frederic might start using it.
> >
> > I dont think we should fight the workflow itself - it makes sense.
> >
> > The only question is whether we can represent it all in a nicer fashion than 'modify
> > the source code and reboot'. If we cannot then there's no point - but i'm not sure
> > about it and Frederic seems to be convinced too that he can make such a switch
> > on/off facility intuitive. We'll only see if we try it.
> >
> > Also, i dont see any harm - do you?
>
> Yes, trace_printk() is a pure debug interface, solely meant for the edit
> + reboot cycle.

So why prevent from making it even more handy?

> If you want anything more than that we've got tracepoints. The rule up
> until now has been to never merge a trace_printk() user.

Sure, that doesn't change the core idea of trace_prink(): none of them must
be merged. That new event interface would just make private uses of trace_printk()
more convenient.

2010-11-17 13:53:57

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 14:43 +0100, Frederic Weisbecker wrote:
>
> > Yes, trace_printk() is a pure debug interface, solely meant for the edit
> > + reboot cycle.
>
> So why prevent from making it even more handy?
>
>
> > If you want anything more than that we've got tracepoints. The rule up
> > until now has been to never merge a trace_printk() user.
>
> Sure, that doesn't change the core idea of trace_prink(): none of them must
> be merged. That new event interface would just make private uses of trace_printk()
> more convenient.

I don't get it, if you don't want it, why put it there in the first
place?

I've never found myself thinking, oh damn, I didn't want to see that
one!, only damn, I should have added more :-)

2010-11-17 14:00:29

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

* Peter Zijlstra <[email protected]> wrote:

> On Wed, 2010-11-17 at 14:24 +0100, Ingo Molnar wrote:
> > > What about the filtering and other general features/functionality of
> > > ftrace? Is the anticipation that this will be ported over to perf?
> > > What about things like blktrace?
> >
> > Yeah, i'd love to see all that available. Filtering is available already on the
> > kernel perf event side and could be added as an option.
>
> Except I utterly detest the existing filter code.. it runs _after_ we do all the
> hard work and then makes us roll back when it decides we shouldn't output after
> all.
>
> But yes, it is functional.

I suspect that is what matters mostly - unless you think that it's impossible to
have a sane implementation of it, if the users come.

Thanks,

Ingo

2010-11-17 14:10:40

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, Nov 17, 2010 at 02:53:49PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 14:43 +0100, Frederic Weisbecker wrote:
> >
> > > Yes, trace_printk() is a pure debug interface, solely meant for the edit
> > > + reboot cycle.
> >
> > So why prevent from making it even more handy?
> >
> >
> > > If you want anything more than that we've got tracepoints. The rule up
> > > until now has been to never merge a trace_printk() user.
> >
> > Sure, that doesn't change the core idea of trace_prink(): none of them must
> > be merged. That new event interface would just make private uses of trace_printk()
> > more convenient.
>
> I don't get it, if you don't want it, why put it there in the first
> place?
>
> I've never found myself thinking, oh damn, I didn't want to see that
> one!, only damn, I should have added more :-)

Hehe :)

Yeah I have a strange workflow. I'm working on that CPU isolation thing
and I have dozens of trace_printk all over the place for tons of
things. And everytime I remove one to unwind some output or to focus
on another one, I often have to restore it later because I need it
again. Usually I even just comment it out instead of removing it.

If I could make this dynamically on a per line filtering, or sometimes on
a per file granularity (as both are equally often the case for me), I would
probably win some time.

I just don't know how many developers have a similar workflow than mine.

2010-11-17 14:11:51

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 15:00 +0100, Ingo Molnar wrote:
> > But yes, it is functional.
>
> I suspect that is what matters mostly - unless you think that it's impossible to
> have a sane implementation of it, if the users come.

I'm not sure, I raised the point several times, and I think Steve and
Tom though it was possible to implement sanely, but I'm not sure if it
was expression invariant (the fix that is).

It would be very good to have a definite answer on that.

The idea was to not let the filter engine work on the trace data (once
its gathered) but on the trace argument right at the beginning of the
tracepoint callchain, since some of the trace data is an expression of
the trace argument (say next->prio instead of next), the trace
expression wouldn't stay invariant, you'd have to write a different
filter for the same effect.

So I think it would be wise to make this change sooner rather than
later.

2010-11-17 14:18:48

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 15:11 +0100, Peter Zijlstra wrote:
> The idea was to not let the filter engine work on the trace data (once
> its gathered) but on the trace argument right at the beginning of the
> tracepoint callchain, since some of the trace data is an expression of
> the trace argument (say next->prio instead of next), the trace
> expression wouldn't stay invariant, you'd have to write a different
> filter for the same effect.
>
> So I think it would be wise to make this change sooner rather than
> later.

Also, I see a lot of overlap with the dynamic probes stuff which needs
the help of magic dwarves to come up with the right data to gather.

Merging the dynamic tracepoint and filter stuff would be nice, there's
no way you can express next->prio without the help of these short
buggers with large axes.

The trouble is that the dynamic tracepoint stuff is privileged for good
reasons, try next+0x1000000 and you're out in the woods, priv. only
filters otoh just doesn't sound too hot.

Another nasty thing is that you actually need to have these dwarves
present, which means having the -debug package installed.

2010-11-17 14:39:21

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 17 Nov 2010, Peter Zijlstra wrote:

> On Wed, 2010-11-17 at 15:11 +0100, Peter Zijlstra wrote:
> > The idea was to not let the filter engine work on the trace data (once
> > its gathered) but on the trace argument right at the beginning of the
> > tracepoint callchain, since some of the trace data is an expression of
> > the trace argument (say next->prio instead of next), the trace
> > expression wouldn't stay invariant, you'd have to write a different
> > filter for the same effect.
> >
> > So I think it would be wise to make this change sooner rather than
> > later.
>
> Also, I see a lot of overlap with the dynamic probes stuff which needs
> the help of magic dwarves to come up with the right data to gather.
>
> Merging the dynamic tracepoint and filter stuff would be nice, there's
> no way you can express next->prio without the help of these short
> buggers with large axes.
>
> The trouble is that the dynamic tracepoint stuff is privileged for good
> reasons, try next+0x1000000 and you're out in the woods, priv. only
> filters otoh just doesn't sound too hot.
>
> Another nasty thing is that you actually need to have these dwarves
> present, which means having the -debug package installed.

That sounds utterly insane for the basic use case where you trace
application context. There is no point to filter out the tracepoints,
really. Postprocessing can do that just fine.

I consider myself a power user of tracing, but hell I never used any
of those filters in a real use case. awk, grep, scripting languages do
just a better job and you don't miss any data just because you got
your filter expression slightly wrong. Postprocessing always wins in
that regard.

The only reason why these filters might make sense is to reduce the
trace volume on system wide traces in production environments, but
that's a completely different story. These scenarios can do with the
dynamic tracepoint stuff and the custom filtering w/o putting the
burden on the majority of users.

Thanks,

tglx

2010-11-17 14:42:11

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 15:37 +0100, Thomas Gleixner wrote:
>
> The only reason why these filters might make sense is to reduce the
> trace volume on system wide traces in production environments, but
> that's a completely different story. These scenarios can do with the
> dynamic tracepoint stuff and the custom filtering w/o putting the
> burden on the majority of users.

I've been hearing people claim 'enterprise' and custom are like mutually
exclusive ;-)

But yeah, I totally agree, I've never used any of that filter stuff
either, I always script the post-processing.

2010-11-17 15:02:46

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 15:10 +0100, Frederic Weisbecker wrote:

> Yeah I have a strange workflow. I'm working on that CPU isolation thing
> and I have dozens of trace_printk all over the place for tons of
> things. And everytime I remove one to unwind some output or to focus
> on another one, I often have to restore it later because I need it
> again. Usually I even just comment it out instead of removing it.
>
> If I could make this dynamically on a per line filtering, or sometimes on
> a per file granularity (as both are equally often the case for me), I would
> probably win some time.
>
> I just don't know how many developers have a similar workflow than mine.

I usually wrap trace_printk() (or printk sometimes) in my own wrapper:

#define sprintk(on, x...) do { if (on) { trace_prinkt(x); } } while (0)

Then have things like:

#define SDR_DEBUG_IRQS 0
#define SDR_DEBUG_SCHED 1
[...]

and in the irq code I'll have:

sprintk(SDR_DEBUG_IRQS, "printme"...);

and the sched code:

sprintk(SDR_DEBUG_SCHED, "printthis"...);

And I can easily enable or disable the prints I want. Sometimes I'll
define it as a variable, and enable them either in certain code paths,
or export them in the debugfs system and enable or disable them from
userspace.

extern int sdr_debug_irqs;
#define SDR_DEBUG_IRQS sdr_debug_irqs

[...]
int sdr_debug_irqs;
[...]

debugfs_create_bool("sdr_debug_sched", 0644, NULL, &sdr_debug_irqs);

Of course this is all for temporary debugging. I use SDR and sdr and
sprintk to search and delete the debugging output when done.

-- Steve

2010-11-17 15:10:38

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 15:11 +0100, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 15:00 +0100, Ingo Molnar wrote:
> > > But yes, it is functional.
> >
> > I suspect that is what matters mostly - unless you think that it's impossible to
> > have a sane implementation of it, if the users come.
>
> I'm not sure, I raised the point several times, and I think Steve and
> Tom though it was possible to implement sanely, but I'm not sure if it
> was expression invariant (the fix that is).
>
> It would be very good to have a definite answer on that.
>
> The idea was to not let the filter engine work on the trace data (once
> its gathered) but on the trace argument right at the beginning of the
> tracepoint callchain, since some of the trace data is an expression of
> the trace argument (say next->prio instead of next), the trace
> expression wouldn't stay invariant, you'd have to write a different
> filter for the same effect.
>
> So I think it would be wise to make this change sooner rather than
> later.

Right, the problem with filtering is what do we want to filter, and what
about copying?

Currently, we copy the data into the buffer and then filter on that
data. We could also easily filter on the parameters of the tracepoint,
but sometimes those parameters do not match the final output (as the
case with sched_switch). Do we copy the data into a separate "per cpu"
temp buffer, and figure out the filter then? And if the filter is fine,
then copy into the buffer. This obviously is slow, due to the multiple
copies. We could do this only if the filtering is enabled.

Note, the stable event proposal I suggested passes the same info via the
trace callback parameters that is expected to go into the buffer. Arrays
are passed as pointers so the copying is of the minimum. This would give
us the ability to filter _before_ writing into the buffer, and no
unnecessary copying.

-- Steve

2010-11-17 15:33:27

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 15:11 +0100, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 15:00 +0100, Ingo Molnar wrote:
> > > But yes, it is functional.
> >
> > I suspect that is what matters mostly - unless you think that it's impossible to
> > have a sane implementation of it, if the users come.
>
> I'm not sure, I raised the point several times, and I think Steve and
> Tom though it was possible to implement sanely, but I'm not sure if it
> was expression invariant (the fix that is).
>
> It would be very good to have a definite answer on that.
>
> The idea was to not let the filter engine work on the trace data (once
> its gathered) but on the trace argument right at the beginning of the
> tracepoint callchain, since some of the trace data is an expression of
> the trace argument (say next->prio instead of next), the trace
> expression wouldn't stay invariant, you'd have to write a different
> filter for the same effect.
>

IIRC, I think the conclusion we came to was that it could be done
mechanically if for example the right-hand-side of an assignment in
TP_fast_assign() only involved a simple variable assignment, but as
Steve pointed out, some assignments are more complicated than that.

For example, in the sched_switch tracepoint assignments:

__entry->prev_prio = prev->prio;
__entry->prev_state = __trace_sched_switch_state(prev);

so the prev_prio should be able to be tested 'in-line' but the
prev_state would require a temporary buffer to write the value into
before doing the test as mentioned by Steve. At which point you're no
further ahead (in that case) than the current situation...

Tom

> So I think it would be wise to make this change sooner rather than
> later.

2010-11-17 15:41:12

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 15:37 +0100, Thomas Gleixner wrote:
> On Wed, 17 Nov 2010, Peter Zijlstra wrote:
>
> > On Wed, 2010-11-17 at 15:11 +0100, Peter Zijlstra wrote:
> > > The idea was to not let the filter engine work on the trace data (once
> > > its gathered) but on the trace argument right at the beginning of the
> > > tracepoint callchain, since some of the trace data is an expression of
> > > the trace argument (say next->prio instead of next), the trace
> > > expression wouldn't stay invariant, you'd have to write a different
> > > filter for the same effect.
> > >
> > > So I think it would be wise to make this change sooner rather than
> > > later.
> >
> > Also, I see a lot of overlap with the dynamic probes stuff which needs
> > the help of magic dwarves to come up with the right data to gather.
> >
> > Merging the dynamic tracepoint and filter stuff would be nice, there's
> > no way you can express next->prio without the help of these short
> > buggers with large axes.
> >
> > The trouble is that the dynamic tracepoint stuff is privileged for good
> > reasons, try next+0x1000000 and you're out in the woods, priv. only
> > filters otoh just doesn't sound too hot.
> >
> > Another nasty thing is that you actually need to have these dwarves
> > present, which means having the -debug package installed.
>
> That sounds utterly insane for the basic use case where you trace
> application context. There is no point to filter out the tracepoints,
> really. Postprocessing can do that just fine.
>
> I consider myself a power user of tracing, but hell I never used any
> of those filters in a real use case. awk, grep, scripting languages do
> just a better job and you don't miss any data just because you got
> your filter expression slightly wrong. Postprocessing always wins in
> that regard.
>
> The only reason why these filters might make sense is to reduce the
> trace volume on system wide traces in production environments, but
> that's a completely different story. These scenarios can do with the
> dynamic tracepoint stuff and the custom filtering w/o putting the
> burden on the majority of users.
>

Yeah, in my mind, the main point of the filters was to be a 'blunt
force' instrument preventing userspace from being overwhelmed by events.
The real filtering would happen in userspace with e.g. real scripting
languages.

Tom

> Thanks,
>
> tglx

2010-11-17 15:44:11

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 10:10 -0500, Steven Rostedt wrote:
>
> Right, the problem with filtering is what do we want to filter, and what
> about copying?
>
> Currently, we copy the data into the buffer and then filter on that
> data. We could also easily filter on the parameters of the tracepoint,
> but sometimes those parameters do not match the final output (as the
> case with sched_switch). Do we copy the data into a separate "per cpu"
> temp buffer, and figure out the filter then? And if the filter is fine,
> then copy into the buffer. This obviously is slow, due to the multiple
> copies. We could do this only if the filtering is enabled.

Right, so what is the primary purpose of this filtering stuff? As it
stands it makes stuff terribly slow, so you add overhead but the win
(presumably) is less data output, is that a sane trade-off?

Most people I've heard -- both at LinuxCon.JP and LPC -- are asking for
lower overhead tracing (while at the same time demanding more features).

Who are actually using this and can they provide some input on this?

2010-11-17 15:57:09

by Avi Kivity

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On 11/17/2010 05:41 PM, Tom Zanussi wrote:
>
>> The only reason why these filters might make sense is to reduce the
>> trace volume on system wide traces in production environments, but
>> that's a completely different story. These scenarios can do with the
>> dynamic tracepoint stuff and the custom filtering w/o putting the
>> burden on the majority of users.
>>
> Yeah, in my mind, the main point of the filters was to be a 'blunt
> force' instrument preventing userspace from being overwhelmed by events.
> The real filtering would happen in userspace with e.g. real scripting
> languages.
>

Filtering is also useful for fine-grained perf events; this cannot be
done in userspace.

2010-11-17 15:59:14

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 17:55 +0200, Avi Kivity wrote:
>
> Filtering is also useful for fine-grained perf events; this cannot be
> done in userspace.
>
You're hinting at the kvm_exit histogram thing you've constructed,
right?

2010-11-17 15:59:23

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 17 Nov 2010, Avi Kivity wrote:

> On 11/17/2010 05:41 PM, Tom Zanussi wrote:
> >
> > > The only reason why these filters might make sense is to reduce the
> > > trace volume on system wide traces in production environments, but
> > > that's a completely different story. These scenarios can do with the
> > > dynamic tracepoint stuff and the custom filtering w/o putting the
> > > burden on the majority of users.
> > >
> > Yeah, in my mind, the main point of the filters was to be a 'blunt
> > force' instrument preventing userspace from being overwhelmed by events.
> > The real filtering would happen in userspace with e.g. real scripting
> > languages.
> >
>
> Filtering is also useful for fine-grained perf events; this cannot be done in
> userspace.

Could you explain that a bit more detailed please ?

Thanks,

tglx

2010-11-17 16:04:31

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 16:43 +0100, Peter Zijlstra wrote:

> Most people I've heard -- both at LinuxCon.JP and LPC -- are asking for
> lower overhead tracing (while at the same time demanding more features).

Tell them we have:

1) features
2) fast
3) simple to use

and have them pick any two.

;-)

-- Steve

2010-11-17 16:09:45

by Avi Kivity

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On 11/17/2010 05:59 PM, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 17:55 +0200, Avi Kivity wrote:
> >
> > Filtering is also useful for fine-grained perf events; this cannot be
> > done in userspace.
> >
> You're hinting at the kvm_exit histogram thing you've constructed,
> right?

Yes.

--
error compiling committee.c: too many arguments to function

2010-11-17 16:15:24

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 16:43 +0100, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 10:10 -0500, Steven Rostedt wrote:
> >
> > Right, the problem with filtering is what do we want to filter, and what
> > about copying?
> >
> > Currently, we copy the data into the buffer and then filter on that
> > data. We could also easily filter on the parameters of the tracepoint,
> > but sometimes those parameters do not match the final output (as the
> > case with sched_switch). Do we copy the data into a separate "per cpu"
> > temp buffer, and figure out the filter then? And if the filter is fine,
> > then copy into the buffer. This obviously is slow, due to the multiple
> > copies. We could do this only if the filtering is enabled.
>
> Right, so what is the primary purpose of this filtering stuff? As it
> stands it makes stuff terribly slow, so you add overhead but the win
> (presumably) is less data output, is that a sane trade-off?

I've actually used filtering too. Not for speed up, but because I was
recording a lot of data and the reader could not keep up. By filtering,
I was able to get all the relevant information without needing to make
the kernel buffer a Gig.

-- Steve

2010-11-17 16:22:01

by Avi Kivity

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On 11/17/2010 05:58 PM, Thomas Gleixner wrote:
> >
> > Filtering is also useful for fine-grained perf events; this cannot be done in
> > userspace.
>
> Could you explain that a bit more detailed please ?

I have a tracepoint kvm:kvm_exit which is called every time a guest
exits to the host (for whatever reason). The first step in profiling
kvm is checking the rate at which this tracepoint is triggered.

(I have a small tool kvm_stat which shows a 'perf top' like display,
except that instead of showing functions, it shows kvm tracepoints
sorted by triggering rate)

If you see a high exit rate, you may also be interested in what exit
reasons triggered those exits. So kvm_stat creates additional events,
one per possible exit_reason (which is a field in the kvm_exit
tracepoint), with a filter matching exit_reason to the enum describing
the possible exit reasons. We add that to the display and get a sort of
histogram showing which exit_reason is responsible for how many exits.

A screenshot:

kvm statistics

kvm_exit 40075 10293
kvm_entry 40075 10293
kvm_exit(EXCEPTION_NMI) 29824 7712
kvm_page_fault 29642 7672
kvm_emulate_insn 9117 2205
kvm_inj_exception 8052 2134
kvm_mmio 6160 1514
kvm_apic 6143 1509
kvm_exit(CR_ACCESS) 2307 635
kvm_cr 2307 635
kvm_exit(INVLPG) 1998 599
kvm_userspace_exit 2956 582
kvm_exit(IO_INSTRUCTION) 2954 582
kvm_pio 2954 582
kvm_inj_virq 2194 551
kvm_apic_accept_irq 2181 547
kvm_exit(HLT) 1523 372
kvm_exit(EXTERNAL_INTERRUPT) 1131 304
kvm_fpu 706 176
kvm_set_irq 802 156
kvm_pic_set_irq 802 156
kvm_ioapic_set_irq 802 156
kvm_apic_ipi 454 132
kvm_exit(PENDING_INTERRUPT) 175 44
kvm_cpuid 163 42
kvm_exit(CPUID) 163 42
kvm_ack_irq 152 29

So there are 10K exits/sec, of which 7.7K are EXCEPTION_NMI and 635 are
CR_ACCESS.

Tool source http://tinyurl.com/2ue5tev; one day I'll make it a
tools/perf script.

--
error compiling committee.c: too many arguments to function

2010-11-17 18:13:53

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, Nov 17, 2010 at 03:10:33PM +0100, Frederic Weisbecker wrote:
> Yeah I have a strange workflow. I'm working on that CPU isolation thing
> and I have dozens of trace_printk all over the place for tons of
> things. And everytime I remove one to unwind some output or to focus
> on another one, I often have to restore it later because I need it
> again. Usually I even just comment it out instead of removing it.

What I do for my file system development work is to drop in
trace_printk's everywhere, since they are lightweight and don't slow
down the system much. Then when the system wedges, I use sysrq-z to
get a "flight data recorder" printout of what happened up til the
system hung.

I love the fact that the ring buffer is in "overwrite" mode (aka
flight data recorder mode), and hope this doesn't go away.

Per line filtering is also great, but very often when I'm interacting
with the block I/O layer, if something screws up, what happens is "and
then whole machine locks up", and sysrq-z is literally all I have.

- Ted

2010-11-17 18:23:49

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

* Peter Zijlstra ([email protected]) wrote:
> On Wed, 2010-11-17 at 10:10 -0500, Steven Rostedt wrote:
> >
> > Right, the problem with filtering is what do we want to filter, and what
> > about copying?
> >
> > Currently, we copy the data into the buffer and then filter on that
> > data. We could also easily filter on the parameters of the tracepoint,
> > but sometimes those parameters do not match the final output (as the
> > case with sched_switch). Do we copy the data into a separate "per cpu"
> > temp buffer, and figure out the filter then? And if the filter is fine,
> > then copy into the buffer. This obviously is slow, due to the multiple
> > copies. We could do this only if the filtering is enabled.
>
> Right, so what is the primary purpose of this filtering stuff? As it
> stands it makes stuff terribly slow, so you add overhead but the win
> (presumably) is less data output, is that a sane trade-off?
>
> Most people I've heard -- both at LinuxCon.JP and LPC -- are asking for
> lower overhead tracing (while at the same time demanding more features).
>
> Who are actually using this and can they provide some input on this?

Amongst others, Ericsson rely on this kind of feature. The case where we're
filtering "out" should be really, really fast (even if it makes the recording
slightly slower). If there is a race modifying the data underneath between the
filter execution and the copy to the buffers, I really don't think it matters.
If the tracepoint caller context don't provide data consistency guarantees for
the parameters it passes, we should not expect 100% perfect consistency between
filter and what is actually saved in the trace. The trace analysis tools should
just be able to cope with that, but I really don't think anyone should care.

So I would recommend no copy, filter directly on the source data, stop the
filter chain as soon as a non-match is observed. Copy the data into the buffers
if the filter passes.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-11-17 18:26:37

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On 11/17/2010 05:02 AM, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 13:53 +0100, Frederic Weisbecker wrote:
>> On Wed, Nov 17, 2010 at 12:35:50PM +0100, Peter Zijlstra wrote:
>>> On Wed, 2010-11-17 at 09:30 +0100, Ingo Molnar wrote:
>>>>> For example I'm currently working with dozens of trace_printk() and I would be
>>>>> very happy to turn some of them off half of the time.
>>>>
>>>> I guess we could try such a patch. If you send a prototype i'd be interested in
>>>> testing it out.
>>>
>>> I don't see the point, the kernel shouldn't contain any trace_printk()s
>>> to begin with..
>>
>>
>> It's oriented toward developers. Those who use dozens of tracepoints in
>> their tree because they are debugging something or developing a new feature,
>> they might to deactivate/reactivate some of these independant points.
>>
>> This can also apply to dynamic_printk of course.
>>
>> Well, the very first and main point is to standardize trace_printk into
>> a trace event so that it gets usable by perf tools. I have been asked many
>> times "how to use trace_printk() with perf?".
>
> Thing is, since its these dev who add the trace_printk()s to begin with,
> I don't see the point in splitting them out, if you didn't want them why
> did you add them to begin with?!

What I understood from Frederic's email was that during a debug session
it is sometimes helpful to be able to enable and disable the
trace_printk's. This makes sense as it reduces the number of kernel
build/reboot cycles. However, I would think most of that could be
accomplished with some judicious message tagging and post-processing to
filter out the unwanted trace_printk's. The only exception might be when
the trace_printk's add enough overhead to mask a timing related bug. In
this case, I'd probably be tempted to remove the stubs anyway.

--
Darren Hart
Yocto Linux Kernel

2010-11-17 18:29:12

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, Nov 17, 2010 at 01:13:40PM -0500, Ted Ts'o wrote:
> On Wed, Nov 17, 2010 at 03:10:33PM +0100, Frederic Weisbecker wrote:
> > Yeah I have a strange workflow. I'm working on that CPU isolation thing
> > and I have dozens of trace_printk all over the place for tons of
> > things. And everytime I remove one to unwind some output or to focus
> > on another one, I often have to restore it later because I need it
> > again. Usually I even just comment it out instead of removing it.
>
> What I do for my file system development work is to drop in
> trace_printk's everywhere, since they are lightweight and don't slow
> down the system much. Then when the system wedges, I use sysrq-z to
> get a "flight data recorder" printout of what happened up til the
> system hung.
>
> I love the fact that the ring buffer is in "overwrite" mode (aka
> flight data recorder mode), and hope this doesn't go away.
>
> Per line filtering is also great, but very often when I'm interacting
> with the block I/O layer, if something screws up, what happens is "and
> then whole machine locks up", and sysrq-z is literally all I have.

Yeah all agreed.

Steve proposed to keep the current trace_printk() implementation that relies
on ftrace but rename in into ftrace_printk(). So that we can develop a new
trace_printk() based on trace_event interface and in the meantime keep the
old version in case something messes up with the new thing.

2010-11-17 18:30:27

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On 11/17/2010 10:29 AM, Frederic Weisbecker wrote:
> On Wed, Nov 17, 2010 at 01:13:40PM -0500, Ted Ts'o wrote:
>> On Wed, Nov 17, 2010 at 03:10:33PM +0100, Frederic Weisbecker wrote:
>>> Yeah I have a strange workflow. I'm working on that CPU isolation thing
>>> and I have dozens of trace_printk all over the place for tons of
>>> things. And everytime I remove one to unwind some output or to focus
>>> on another one, I often have to restore it later because I need it
>>> again. Usually I even just comment it out instead of removing it.
>>
>> What I do for my file system development work is to drop in
>> trace_printk's everywhere, since they are lightweight and don't slow
>> down the system much. Then when the system wedges, I use sysrq-z to
>> get a "flight data recorder" printout of what happened up til the
>> system hung.
>>
>> I love the fact that the ring buffer is in "overwrite" mode (aka
>> flight data recorder mode), and hope this doesn't go away.
>>
>> Per line filtering is also great, but very often when I'm interacting
>> with the block I/O layer, if something screws up, what happens is "and
>> then whole machine locks up", and sysrq-z is literally all I have.
>
> Yeah all agreed.
>
> Steve proposed to keep the current trace_printk() implementation that relies
> on ftrace but rename in into ftrace_printk(). So that we can develop a new
> trace_printk() based on trace_event interface and in the meantime keep the
> old version in case something messes up with the new thing.

Seems reasonable.

--
Darren Hart
Yocto Linux Kernel

2010-11-17 18:36:46

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

* Tom Zanussi ([email protected]) wrote:
[...]
> IIRC, I think the conclusion we came to was that it could be done
> mechanically if for example the right-hand-side of an assignment in
> TP_fast_assign() only involved a simple variable assignment, but as
> Steve pointed out, some assignments are more complicated than that.

Yep, we came up to the same conclusions in UST.

> For example, in the sched_switch tracepoint assignments:
>
> __entry->prev_prio = prev->prio;
> __entry->prev_state = __trace_sched_switch_state(prev);
>
> so the prev_prio should be able to be tested 'in-line' but the
> prev_state would require a temporary buffer to write the value into
> before doing the test as mentioned by Steve. At which point you're no
> further ahead (in that case) than the current situation...

if we change all assignments to, e.g.:

_tp_assign(__entry->prev_prio, prev->prio)
_tp_assign(__entry->prev_state, __trace_sched_switch_state(prev))

then we can redefine the macros for filtering much more easily than with the
" = " assignment operator.

About your comment above, what is the problem with evaluating
"__trace_sched_switch_state(prev)" twice ? It will typically be cache-hot after
the first evaluation, so I wonder if, in practice, we really save a significant
amount of cycles by saving its result between filtering and writing into trace
buffers. As I pointed out earlier, for my customers, having a very, very fast
filter "out" case is more important that trying to squeeze a few cycles out of
the filter passed case.

Also, how many of these "__trace_sched_switch_state(prev)" are static inlines vs
actual function calls ? If it's mostly static inlines to dereference a few
pointers, doing it the second time when the filter passed won't hurt much.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-11-17 18:53:33

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 13:36 -0500, Mathieu Desnoyers wrote:
> * Tom Zanussi ([email protected]) wrote:
> [...]
> > IIRC, I think the conclusion we came to was that it could be done
> > mechanically if for example the right-hand-side of an assignment in
> > TP_fast_assign() only involved a simple variable assignment, but as
> > Steve pointed out, some assignments are more complicated than that.
>
> Yep, we came up to the same conclusions in UST.
>
> > For example, in the sched_switch tracepoint assignments:
> >
> > __entry->prev_prio = prev->prio;
> > __entry->prev_state = __trace_sched_switch_state(prev);
> >
> > so the prev_prio should be able to be tested 'in-line' but the
> > prev_state would require a temporary buffer to write the value into
> > before doing the test as mentioned by Steve. At which point you're no
> > further ahead (in that case) than the current situation...
>
> if we change all assignments to, e.g.:
>
> _tp_assign(__entry->prev_prio, prev->prio)
> _tp_assign(__entry->prev_state, __trace_sched_switch_state(prev))
>
> then we can redefine the macros for filtering much more easily than with the
> " = " assignment operator.
>
> About your comment above, what is the problem with evaluating
> "__trace_sched_switch_state(prev)" twice ? It will typically be cache-hot after
> the first evaluation, so I wonder if, in practice, we really save a significant
> amount of cycles by saving its result between filtering and writing into trace
> buffers. As I pointed out earlier, for my customers, having a very, very fast
> filter "out" case is more important that trying to squeeze a few cycles out of
> the filter passed case.
>

But the idea is to avoid allocating the trace buffer in the first place,
until we've decided we want the event. So how do you check the result
of __trace_sched_switch_state(prev) with the filter value if you don't
have it temporarily stored somewhere (not in the trace buffer, which
doesn't exist yet as far as this event is concerned)?

Tom

> Also, how many of these "__trace_sched_switch_state(prev)" are static inlines vs
> actual function calls ? If it's mostly static inlines to dereference a few
> pointers, doing it the second time when the filter passed won't hurt much.
>
> Thanks,
>
> Mathieu
>

2010-11-17 19:00:41

by Bob Copeland

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Tue, Nov 16, 2010 at 10:34 PM, Steven Rostedt <[email protected]> wrote:
> (*) Currently in trace-cmd I have plugins. These plugins are simple
> functions to read events and either output better displays of the event
> or do something special (Avi wrote some for his kvm events).

We also use this for debugging some wireless drivers. The (unstable)
raw events can capture entire packets and produce psuedo-pcap files with
a plugin so we can compare what we thought went over the air versus what
actually did, and correlate them with api-level tracepoints in upper
layers. It's perhaps a bit of a bending of the infrastructure, but one
that has been helpful.

--
Bob Copeland %% http://www.bobcopeland.com

2010-11-17 19:02:26

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

* Tom Zanussi ([email protected]) wrote:
> On Wed, 2010-11-17 at 13:36 -0500, Mathieu Desnoyers wrote:
> > * Tom Zanussi ([email protected]) wrote:
> > [...]
> > > IIRC, I think the conclusion we came to was that it could be done
> > > mechanically if for example the right-hand-side of an assignment in
> > > TP_fast_assign() only involved a simple variable assignment, but as
> > > Steve pointed out, some assignments are more complicated than that.
> >
> > Yep, we came up to the same conclusions in UST.
> >
> > > For example, in the sched_switch tracepoint assignments:
> > >
> > > __entry->prev_prio = prev->prio;
> > > __entry->prev_state = __trace_sched_switch_state(prev);
> > >
> > > so the prev_prio should be able to be tested 'in-line' but the
> > > prev_state would require a temporary buffer to write the value into
> > > before doing the test as mentioned by Steve. At which point you're no
> > > further ahead (in that case) than the current situation...
> >
> > if we change all assignments to, e.g.:
> >
> > _tp_assign(__entry->prev_prio, prev->prio)
> > _tp_assign(__entry->prev_state, __trace_sched_switch_state(prev))
> >
> > then we can redefine the macros for filtering much more easily than with the
> > " = " assignment operator.
> >
> > About your comment above, what is the problem with evaluating
> > "__trace_sched_switch_state(prev)" twice ? It will typically be cache-hot after
> > the first evaluation, so I wonder if, in practice, we really save a significant
> > amount of cycles by saving its result between filtering and writing into trace
> > buffers. As I pointed out earlier, for my customers, having a very, very fast
> > filter "out" case is more important that trying to squeeze a few cycles out of
> > the filter passed case.
> >
>
> But the idea is to avoid allocating the trace buffer in the first place,
> until we've decided we want the event. So how do you check the result
> of __trace_sched_switch_state(prev) with the filter value if you don't
> have it temporarily stored somewhere (not in the trace buffer, which
> doesn't exist yet as far as this event is concerned)?

It seems I might be missing something important, but what's wrong with using
registers or the stack to hold the value for comparison ? In this case, it's a
"long", so a register seems perfectly reasonable. But again, I feel I'm missing
a key point -- what is it ?

Thanks,

Mathieu

>
> Tom
>
> > Also, how many of these "__trace_sched_switch_state(prev)" are static inlines vs
> > actual function calls ? If it's mostly static inlines to dereference a few
> > pointers, doing it the second time when the filter passed won't hurt much.
> >
> > Thanks,
> >
> > Mathieu
> >
>
>

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-11-17 19:25:11

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 13:36 -0500, Mathieu Desnoyers wrote:
> * Tom Zanussi ([email protected]) wrote:
> [...]
> > IIRC, I think the conclusion we came to was that it could be done
> > mechanically if for example the right-hand-side of an assignment in
> > TP_fast_assign() only involved a simple variable assignment, but as
> > Steve pointed out, some assignments are more complicated than that.
>
> Yep, we came up to the same conclusions in UST.
>
> > For example, in the sched_switch tracepoint assignments:
> >
> > __entry->prev_prio = prev->prio;
> > __entry->prev_state = __trace_sched_switch_state(prev);
> >
> > so the prev_prio should be able to be tested 'in-line' but the
> > prev_state would require a temporary buffer to write the value into
> > before doing the test as mentioned by Steve. At which point you're no
> > further ahead (in that case) than the current situation...
>
> if we change all assignments to, e.g.:
>
> _tp_assign(__entry->prev_prio, prev->prio)
> _tp_assign(__entry->prev_state, __trace_sched_switch_state(prev))

I would just call it __assign().

>
> then we can redefine the macros for filtering much more easily than with the
> " = " assignment operator.
>
> About your comment above, what is the problem with evaluating
> "__trace_sched_switch_state(prev)" twice ? It will typically be cache-hot after
> the first evaluation, so I wonder if, in practice, we really save a significant
> amount of cycles by saving its result between filtering and writing into trace
> buffers. As I pointed out earlier, for my customers, having a very, very fast
> filter "out" case is more important that trying to squeeze a few cycles out of
> the filter passed case.
>
> Also, how many of these "__trace_sched_switch_state(prev)" are static inlines vs
> actual function calls ? If it's mostly static inlines to dereference a few
> pointers, doing it the second time when the filter passed won't hurt much.

Yes, something like this could work. It would require a bit more CPP
magic to handle it though.

I could write something up to do this. I just have to add it to the 100
other things I'm doing at the same time :-p

-- Steve

2010-11-17 19:25:38

[permalink] [raw]

Subject: Re: [ANNOUNCE] New utility: 'trace'

On Wed, 2010-11-17 at 14:02 -0500, Mathieu Desnoyers wrote:
> * Tom Zanussi ([email protected]) wrote:
> > On Wed, 2010-11-17 at 13:36 -0500, Mathieu Desnoyers wrote:
> > > * Tom Zanussi ([email protected]) wrote:
> > > [...]
> > > > IIRC, I think the conclusion we came to was that it could be done
> > > > mechanically if for example the right-hand-side of an assignment in
> > > > TP_fast_assign() only involved a simple variable assignment, but as
> > > > Steve pointed out, some assignments are more complicated than that.
> > >
> > > Yep, we came up to the same conclusions in UST.
> > >
> > > > For example, in the sched_switch tracepoint assignments:
> > > >
> > > > __entry->prev_prio = prev->prio;
> > > > __entry->prev_state = __trace_sched_switch_state(prev);
> > > >
> > > > so the prev_prio should be able to be tested 'in-line' but the
> > > > prev_state would require a temporary buffer to write the value into
> > > > before doing the test as mentioned by Steve. At which point you're no
> > > > further ahead (in that case) than the current situation...
> > >
> > > if we change all assignments to, e.g.:
> > >
> > > _tp_assign(__entry->prev_prio, prev->prio)
> > > _tp_assign(__entry->prev_state, __trace_sched_switch_state(prev))
> > >
> > > then we can redefine the macros for filtering much more easily than with the
> > > " = " assignment operator.
> > >
> > > About your comment above, what is the problem with evaluating
> > > "__trace_sched_switch_state(prev)" twice ? It will typically be cache-hot after
> > > the first evaluation, so I wonder if, in practice, we really save a significant
> > > amount of cycles by saving its result between filtering and writing into trace
> > > buffers. As I pointed out earlier, for my customers, having a very, very fast
> > > filter "out" case is more important that trying to squeeze a few cycles out of
> > > the filter passed case.
> > >
> >
> > But the idea is to avoid allocating the trace buffer in the first place,
> > until we've decided we want the event. So how do you check the result
> > of __trace_sched_switch_state(prev) with the filter value if you don't
> > have it temporarily stored somewhere (not in the trace buffer, which
> > doesn't exist yet as far as this event is concerned)?
>
> It seems I might be missing something important, but what's wrong with using
> registers or the stack to hold the value for comparison ? In this case, it's a

Nothing, it shouldn't matter where the temporary storage is, as long as
it's not in the trace buffer.

Tom

> "long", so a register seems perfectly reasonable. But again, I feel I'm missing
> a key point -- what is it ?
>
> Thanks,
>
> Mathieu
>
> >
> > Tom
> >
> > > Also, how many of these "__trace_sched_switch_state(prev)" are static inlines vs
> > > actual function calls ? If it's mostly static inlines to dereference a few
> > > pointers, doing it the second time when the filter passed won't hurt much.
> > >
> > > Thanks,
> > >
> > > Mathieu
> > >
> >
> >
>

2010-11-17 19:40:41