LinuxLists.cc - Enhanced profiling support (was Re: vm lock contention reduction)

2002-07-08 11:37:07

Subject: Enhanced profiling support (was Re: vm lock contention reduction)

[Excuse the quoting, I was out of the loop on this ...]

On Sat, Jul 06, 2002 at 08:47:54PM -0700, Andrew Morton wrote:

> Linus Torvalds wrote:
> >
> > I haven't had much time to look at the oprofile thing, but what I _have_
> > seen has made me rather unhappy (especially the horrid system call
> > tracking kludges).

It makes me very unhappy too. There are a number of horrible things
there, mostly for the sake of convenience and performance.
sys_call_table is just the most obviously foul thing. I'm glad to hear
there is interest in getting some kernel support for such things to be
done tastefully.

> > I'd rather have some generic hooks (a notion of a "profile buffer" and
> > events that cause us to have to synchronize with it, like process
> > switches, mmap/munmap - oprofile wants these too), and some generic helper
> > routines for profiling (turn any eip into a "dentry + offset" pair
> > together with ways to tag specific dentries as being "worthy" of
> > profiling).

How do you see such dentry names being exported to user-space for the
profiling daemon to access ? The current oprofile scheme is, um, less
than ideal ...

> So. John. Get coding :-)

I'm interested in doing so but I'd like to hear some more on how people
perceive this working. It essentially means a fork for a lot of the
kernel-side code, so it'd mean a lot more work for us (at least until I
can drop the 2.2/2.4 versions).

regards
john

--
"If a thing is not diminished by being shared, it is not rightly owned if
it is only owned & not shared."
- St. Augustine

2002-07-08 17:50:52

by Linus Torvalds

[permalink] [raw]

Subject: Re: Enhanced profiling support (was Re: vm lock contention reduction)

On Mon, 8 Jul 2002, John Levon wrote:
>
> > I'd rather have some generic hooks (a notion of a "profile buffer" and
> > events that cause us to have to synchronize with it, like process
> > switches, mmap/munmap - oprofile wants these too), and some generic helper
> > routines for profiling (turn any eip into a "dentry + offset" pair
> > together with ways to tag specific dentries as being "worthy" of
> > profiling).
>
> How do you see such dentry names being exported to user-space for the
> profiling daemon to access ? The current oprofile scheme is, um, less
> than ideal ...

Ok, I'll outline my personal favourite interface, but I'd also better
point out that while I've thought a bit about what I'd like to have and
how it could be implemented in the kernel, I have _not_ actually tried any
of it out, much less thought about what the user level stuff really needs.

Anyway, here goes a straw-man:

- I'd associate each profiling event with a dentry/offset pair, simply
because that's the highest-level thing that the kernel knows about and
that is "static".

- I'd suggest that the profiler explicitly mark the dentries it wants
profiled, so that the kernel can throw away events that we're not
interested in. The marking function would return a cookie to user
space, and increment the dentry count (along with setting the
"profile" flag in the dentry)

- the "cookie" (which would most easily just be the kernel address of the
dentry) would be the thing that we give to user-space (along with
offset) on profile read. The user app can turn it back into a filename.

Whether it is the original "mark this file for profiling" phase that saves
away the cookie<->filename association, or whether we also have a system
call for "return the path of this cookie", I don't much care about.
Details, details.

Anyway, what would be the preferred interface from user level?

Linus

2002-07-08 18:51:02

by Karim Yaghmour

[permalink] [raw]

Subject: Re: Enhanced profiling support (was Re: vm lock contention reduction)

Linus Torvalds wrote:
> On Mon, 8 Jul 2002, John Levon wrote:
> > How do you see such dentry names being exported to user-space for the
> > profiling daemon to access ? The current oprofile scheme is, um, less
> > than ideal ...
>
> Ok, I'll outline my personal favourite interface, but I'd also better
> point out that while I've thought a bit about what I'd like to have and
> how it could be implemented in the kernel, I have _not_ actually tried any
> of it out, much less thought about what the user level stuff really needs.

Sure. I've done some work on profiling using trace hooks. Hopefully the
following is useful.

> Anyway, here goes a straw-man:
>
> - I'd associate each profiling event with a dentry/offset pair, simply
> because that's the highest-level thing that the kernel knows about and
> that is "static".

dentry + offset: on a 32bit machine, this is 8 bytes total per event being
profiled. This is a lot of information if you are trying you have a high
volume throughput. You can almost always skip the dentry since you know scheduling
changes and since you can catch a system-state snapshot at the begining of
the profiling. After that, the eip is sufficient and can easily be correlated
to a meaningfull entry in a file in user-space.

> - I'd suggest that the profiler explicitly mark the dentries it wants
> profiled, so that the kernel can throw away events that we're not
> interested in. The marking function would return a cookie to user
> space, and increment the dentry count (along with setting the
> "profile" flag in the dentry)

Or the kernel can completely ignore this sort of selection and leave it
all to the agent responsible for collecting the events. This is what is
done in LTT. Currently, you can select one PID, GID, UID, but this
is easily extendable to include many. Of course if you agree to having
the task struct have "trace" or "profile" field, then this would be
much easier.

> - the "cookie" (which would most easily just be the kernel address of the
> dentry) would be the thing that we give to user-space (along with
> offset) on profile read. The user app can turn it back into a filename.

That's the typical scheme and the one possible with the data retrieved
by LTT.

> Whether it is the original "mark this file for profiling" phase that saves
> away the cookie<->filename association, or whether we also have a system
> call for "return the path of this cookie", I don't much care about.
> Details, details.
>
> Anyway, what would be the preferred interface from user level?

The approach LTT takes is that no part in the kernel should actually care
about the user level needs. Anything in user level that has to modify
the tracing/profiling makes its requests to the trace driver, /dev/tracer.
No additional system calls, no special cases in the main kernel code. Only
3 main files:
kernel/trace.c: The main entry point for all events (trace_event())
drivers/trace/tracer.c: The trace driver
include/linux/trace.h: The trace hook definitions

Cheers,

Karim

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-07-09 16:55:23

by John Levon

[permalink] [raw]

Subject: Re: Enhanced profiling support (was Re: vm lock contention reduction)

On Mon, Jul 08, 2002 at 10:52:36AM -0700, Linus Torvalds wrote:

> - I'd associate each profiling event with a dentry/offset pair, simply
> because that's the highest-level thing that the kernel knows about and
> that is "static".

This makes sense, I think.

> - I'd suggest that the profiler explicitly mark the dentries it wants
> profiled, so that the kernel can throw away events that we're not
> interested in. The marking function would return a cookie to user
> space, and increment the dentry count (along with setting the
> "profile" flag in the dentry)

For a system-wide profiler, this needs to be /all/ dentries that get
mapped in with executable permissions, or we lose any mappings of shared
libraries we don't know about etc. Essentially, oprofile wants samples
against any dentry that gets mmap()ed with PROT_EXEC, so this marking
would really need to happen at mmap() time. Missing out on any dentry
profiles amounts to data loss in the system profile and has the
potential to mislead.

> - the "cookie" (which would most easily just be the kernel address of the
> dentry) would be the thing that we give to user-space (along with
> offset) on profile read. The user app can turn it back into a filename.
>
> Whether it is the original "mark this file for profiling" phase that saves
> away the cookie<->filename association, or whether we also have a system
> call for "return the path of this cookie", I don't much care about.
> Details, details.
>
> Anyway, what would be the preferred interface from user level?

oprofile currently receives eip-pid pairs, along with the necessary
syscall tracing info needed to reconstruct file offsets. The above
scheme removes the dependency on the pid, but this also unfortunately
throws away some useful information.

It is often useful to be able to separate out shared-library samples on
a per-process (and/or per-application) basis. Any really useful profile
buffer facility really needs to preserve this info, but just including
the raw pid isn't going to work when user-space can't reconstruct the
"name" of the pid (where "name" would be something "/bin/bash") because
the process exited in the meantime.

The same goes for kernel samples that happen in process context.

So this might work well in tandem with some global process-tree tracing
scheme, but I don't know what form that might take ...

(Then there are kernel modules, but that's probably best served by
patching modutils)

regards
john

2002-07-09 20:20:38

by Karim Yaghmour

[permalink] [raw]

Subject: Re: Enhanced profiling support (was Re: vm lock contention reduction)

I've been following oprofile's development for a while and was quite
happy to see a DCPI equivalent come to Linux. The LTT data engine
collection is actually inspired by the one described in the DCPI paper.

As I said earlier, much of the kernel behavior information oprofile
requires is made available by LTT. Instead of redirecting the syscall
table, for example, LTT hooks itself within the syscall path to obtain
all syscall entries and exits (all contained within #ifdef's of course).
The same goes for important events such as forks/reads/writes/scheduling
etc. The mapping of PID to the name of an executable, for instance,
is easily extracted from this information.

As with other applications/uses which require insight into the dynamics
of the kernel, it would seem to me that oprofile would greatly benefit
from the integration of the LTT patch in the mainline kernel. If
nothing else, oprofile could use the LTT collection engine to forward
its data to user-space, much like DProbes currently does.

By the same token, the LTT collection engine could replace the slew
of per-driver tracing systems already in the kernel, providing therefore
a uniform tracing system:
drivers/char/ftape/lowlevel/ftape-tracing.c
drivers/char/ip2/ip2trace.c
drivers/char/dtlk.c
drivers/char/mwavedd.h
drivers/char/n_r3964
drivers/acpi/utilities/utdebug.c
drivers/cdrom/sbpcd.c
drivers/isdn/eicon/eicon_mod.c
drivers/scsi/gdth.c
drivers/scsi/megaraid.c
drivers/scsi/qlogicfc.c
drivers/scsi/qlogicisp.c
drivers/net/wavelan.c
drivers/net/skfp/hwmtm.c
drivers/net/pcmcia/wavelan_cs.c
drivers/net/wireless/orinoco.c
drivers/net/wireless/orinoco_cs.c
drivers/video/radeonfb.c
drivers/usb/pwc.h
drivers/usb/hpusbscsi.c
include/linux/jdb.h for fs/ext3/*.c and fs/jdb/*.c
net/irda/irnet/irnet.h
etc.

The above list is but a sample of the files containing actual code
implementing tracing and/or data collection engines. There a great
deal many files that actually have trace points already in them. A
simple "grep -r TRACE *" provides an interesting insight to the
number of subsystems requiring tracing, each implementing their own
scheme.

It is time to provide a uniform tracing and high-throughput data
collection engine for all to use. LTT has already been field-tested
for these purposes and is very easily extended to include any
additional functionality required.

Any comments/thoughts are greatly appreciated.

Cheers,

Karim

John Levon wrote:
>
> On Mon, Jul 08, 2002 at 10:52:36AM -0700, Linus Torvalds wrote:
>
> > - I'd associate each profiling event with a dentry/offset pair, simply
> > because that's the highest-level thing that the kernel knows about and
> > that is "static".
>
> This makes sense, I think.
>
> > - I'd suggest that the profiler explicitly mark the dentries it wants
> > profiled, so that the kernel can throw away events that we're not
> > interested in. The marking function would return a cookie to user
> > space, and increment the dentry count (along with setting the
> > "profile" flag in the dentry)
>
> For a system-wide profiler, this needs to be /all/ dentries that get
> mapped in with executable permissions, or we lose any mappings of shared
> libraries we don't know about etc. Essentially, oprofile wants samples
> against any dentry that gets mmap()ed with PROT_EXEC, so this marking
> would really need to happen at mmap() time. Missing out on any dentry
> profiles amounts to data loss in the system profile and has the
> potential to mislead.
>
> > - the "cookie" (which would most easily just be the kernel address of the
> > dentry) would be the thing that we give to user-space (along with
> > offset) on profile read. The user app can turn it back into a filename.
> >
> > Whether it is the original "mark this file for profiling" phase that saves
> > away the cookie<->filename association, or whether we also have a system
> > call for "return the path of this cookie", I don't much care about.
> > Details, details.
> >
> > Anyway, what would be the preferred interface from user level?
>
> oprofile currently receives eip-pid pairs, along with the necessary
> syscall tracing info needed to reconstruct file offsets. The above
> scheme removes the dependency on the pid, but this also unfortunately
> throws away some useful information.
>
> It is often useful to be able to separate out shared-library samples on
> a per-process (and/or per-application) basis. Any really useful profile
> buffer facility really needs to preserve this info, but just including
> the raw pid isn't going to work when user-space can't reconstruct the
> "name" of the pid (where "name" would be something "/bin/bash") because
> the process exited in the meantime.
>
> The same goes for kernel samples that happen in process context.
>
> So this might work well in tandem with some global process-tree tracing
> scheme, but I don't know what form that might take ...
>
> (Then there are kernel modules, but that's probably best served by
> patching modutils)
>
> regards
> john
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-07-10 02:19:35

by John Levon

[permalink] [raw]

Subject: Re: Enhanced profiling support (was Re: vm lock contention reduction)

On Mon, Jul 08, 2002 at 02:41:00PM -0400, Karim Yaghmour wrote:

> dentry + offset: on a 32bit machine, this is 8 bytes total per event being
> profiled. This is a lot of information if you are trying you have a high
> volume throughput.

I haven't found that to be significant in profiling overhead, mainly
because the hash table removes some of the "sting" of high sampling
rates (and the interrupt handler dwarfs all other aspects). The
situation is probably different for more general tracing purposes, but
I'm dubious as to the utility of a general tracing mechanism.

(besides, a profile buffer needs a sample context value too, for things
like CPU number and perfctr event number).

> You can almost always skip the dentry since you know scheduling
> changes and since you can catch a system-state snapshot at the begining of
> the profiling. After that, the eip is sufficient and can easily be correlated
> to a meaningfull entry in a file in user-space.

But as I point out in my other post, dentry-offset alone is not as
useful as it could be...

I just don't see a really good reason to introduce insidious tracing
throughout. Both tracing and profiling are ugly ugly things to be doing
by their very nature, and I'd much prefer to keep such intrusions to a
bare minimum.

The entry.S examine-the-registers approach is simple enough, but it's
not much more tasteful than sys_call_table hackery IMHO

regards
john

--
"I know I believe in nothing but it is my nothing"
- Manic Street Preachers

2002-07-10 04:12:15

by Karim Yaghmour

[permalink] [raw]

Subject: Re: Enhanced profiling support (was Re: vm lock contention reduction)

John Levon wrote:
> I'm dubious as to the utility of a general tracing mechanism.
...
> I just don't see a really good reason to introduce insidious tracing
> throughout. Both tracing and profiling are ugly ugly things to be doing
> by their very nature, and I'd much prefer to keep such intrusions to a
> bare minimum.

Tracing is essential for an entire category of problems which can only
be solved by obtaining the raw data which describesg the dynamic behavior
of the kernel.

Have you ever tried to solve an inter-process synchronization problem
using strace or gdb? In reality, only tracing built into the kernel can
enable a developer to solve such problems.

Have you ever tried to follow the exact reaction applications have
to kernel input? It took lots of ad-hoc experimentation to isolate the
thundering hurd problem. Tracing would have shown this immediately.

Can you list the exact sequence of processes that are scheduled
in reaction to input when you press the keyboard while running a
terminal in X? This is out of reach of most user-space programmers but
a trace shows this quite nicely.

Ever had a box saturated with IRQs and still showing 0% CPU usage? This
problem has been reported time and again. Lately someone was asking
about the utility which soaks-up CPU cycles to show this sort of
situation. Once more, tracing shows this right away ... without soaking
up CPU cycles.

Ever tried to get the exact time spent by an application in user-space
vs. kernel space? Even better, can you tell the actually syscall which
cost the most in kernel time? You can indeed get closer using random sampling,
but it's just one more thing tracing gives you without any difficulty.

And the list goes on.

The fact that so many kernel subsystems already have their own tracing
built-in (see other posting) clearly shows that there is a fundamental
need for such a utility even for driver developers. If many driver
developers can't develop drivers adequately without tracing, can we
expect user-space developers to efficiently use the kernel if they have
absolutely no idea about the dynamic interaction their processes have
with the kernel and how this interaction is influenced by and influences
the interaction with other processes?

> The entry.S examine-the-registers approach is simple enough, but it's
> not much more tasteful than sys_call_table hackery IMHO

I guess we won't agree on this. From my point of view it is much better
to have the code directly within entry.S for all to see instead of
having some external software play around with the syscall table in a
way kernel users can't trace back to the kernel's own code.

Cheers,

Karim

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-07-10 04:36:06

by John Levon

[permalink] [raw]

Subject: Re: Enhanced profiling support (was Re: vm lock contention reduction)

On Wed, Jul 10, 2002 at 12:16:05AM -0400, Karim Yaghmour wrote:

[snip]

> And the list goes on.

Sure, there are all sorts of things where some tracing can come in
useful. The question is whether it's really something the mainline
kernel should be doing, and if the gung-ho approach is nice or not.

> The fact that so many kernel subsystems already have their own tracing
> built-in (see other posting)

Your list was almost entirely composed of per-driver debug routines.
This is not the same thing as logging trap entry/exits, syscalls etc
etc, on any level, and I'm a bit perplexed that you're making such an
assocation.

> expect user-space developers to efficiently use the kernel if they
> have
> absolutely no idea about the dynamic interaction their processes have
> with the kernel and how this interaction is influenced by and
> influences
> the interaction with other processes?

This is clearly an exaggeration. And seeing as something like LTT
doesn't (and cannot) tell the "whole story" either, I could throw the
same argument directly back at you. The point is, there comes a point of
no return where usefulness gets outweighed by ugliness. For the very few
cases that such detailed information is really useful, the user can
usually install the needed special-case tools.

In contrast a profiling mechanism that improves on the poor lot that
currently exists (gprof, readprofile) has a truly general utility, and
can hopefully be done without too much ugliness.

The primary reason I want to see something like this is to kill the ugly
code I have to maintain.

> > The entry.S examine-the-registers approach is simple enough, but
> > it's
> > not much more tasteful than sys_call_table hackery IMHO
>
> I guess we won't agree on this. From my point of view it is much
> better
> to have the code directly within entry.S for all to see instead of
> having some external software play around with the syscall table in a
> way kernel users can't trace back to the kernel's own code.

Eh ? I didn't say sys_call_table hackery was better. I said the entry.S
thing wasn't much better ...

regards
john

2002-07-10 05:42:27

by Karim Yaghmour

[permalink] [raw]

Subject: Re: Enhanced profiling support (was Re: vm lock contention reduction)

John Levon wrote:
> > And the list goes on.
>
> Sure, there are all sorts of things where some tracing can come in
> useful. The question is whether it's really something the mainline
> kernel should be doing, and if the gung-ho approach is nice or not.

I'm not sure how to read "gung-ho approach" but I would point out that
LTT wasn't built overnight. It's been out there for 3 years now.

As for whether the mainline kernel should have it or not, then I
think the standard has always been set using the actual purpose:
Is this useful to users or is this only useful to kernel developers?
Whenever it was the later, then the answer has almost always been NO.
In the case of tracing I think that not only is it "useful" to users,
but I would say "essential". Would you ask a user to recompile his
kernel with a ptrace patch because he needs to use gdb? I don't
think so, and I don't think application developers or system
administrators should have to recompile their kernel either in
order to understand the system' dynamics.

If facts are of interest, then I would point out that LTT is already
part of a number of distributions out there. Most other distros that
don't have it included find it very useful but don't want to get
involved in maintaining it until it's part of the kernel.

> > The fact that so many kernel subsystems already have their own tracing
> > built-in (see other posting)
>
> Your list was almost entirely composed of per-driver debug routines.

I don't see any contradiction here. This is part of what I'm pointing out.
Mainly that understanding the dynamic behavior of the system is essential
to software development, especially when complex interactions, such as
in the Linux kernel, take place.

> This is not the same thing as logging trap entry/exits, syscalls etc
> etc, on any level, and I'm a bit perplexed that you're making such an
> assocation.

In regards to trap entry/exits, there was a talk a couple of years ago
by Nat Friedman, I think, which discussed GNU rope. Basically, it
identified the location of page fault misses at the start of programs
and used this information to reorder the binary in order to accelerate
its load time. That's just one example where traps are of interest.

Not to mention that traps can result in scheduling changes and that
knowing why a process has been scheduled out is all part of understanding
the system's dynamics.

As for syscalls, oprofile, for one, really needs this sort of info. So
I don't see your point.

There are 2 things LTT provides in the kernel:
1- A unified tracing and high-throughput data-collection system
2- A select list of trace points in the kernel

Item #1 can easily replace the existing ad-hoc implementation while
serving oprofile, DProbes and others. Item #2 is of prime interest
to application developers and system administrators.

> > expect user-space developers to efficiently use the kernel if they
> > have
> > absolutely no idea about the dynamic interaction their processes have
> > with the kernel and how this interaction is influenced by and
> > influences
> > the interaction with other processes?
>
> This is clearly an exaggeration. And seeing as something like LTT
> doesn't (and cannot) tell the "whole story" either, I could throw the
> same argument directly back at you. The point is, there comes a point of
> no return where usefulness gets outweighed by ugliness. For the very few
> cases that such detailed information is really useful, the user can
> usually install the needed special-case tools.

I'd really be interested in seing what you mean by "ugliness". If there's
a list of grievances you have with LTT then I'm all ears.

Anything inserted by LTT is clean and clear. It doesn't change anything
to the kernel's normal behavior and once a trace point is inserted it
requires almost zero maintenance. Take for example the scheduling change
trace point (kernel/sched.c:schedule()):
TRACE_SCHEDCHANGE(prev, next);

I don't see why this should be ugly. It's an inline that results in
zero lines of code if you configure tracing as off.

The cases I presented earlier clearly show the usefullness of this
information. The developer who needs to solve his synchronization
problem or the system administrator who wants to understand why his
box is so slow doesn't really want to patch/recompile/reboot to fix
his problem.

You would like to paint these as "very few cases". Unfortunately
these cases are much more common than you say they are.

> In contrast a profiling mechanism that improves on the poor lot that
> currently exists (gprof, readprofile) has a truly general utility, and
> can hopefully be done without too much ugliness.

Somehow it is not justifiable to add a feature the like of which didn't
exist before but it is justifiable to add a feature which only "improves"
on existing tools?

As I said before, LTT and the information it provides has a truly
general utility: Oprofile can use it, DProbes uses it, a slew of existing
ad-hoc tracing systems in the kernel can be replaced by it, application
developers can isolate synchronization problems with it, system
administrators can identify performance bottlenecks with it, etc.

An argument over which is more useful between LTT and oprofile is bound
to be useless. If nothing else, I see LTT as having a more general use
than oprofile. But let's not get into that. What I'm really interested in
here is the possibilty of having one unified tracing and data collection
system which serves many purposes instead of having each subsystem or
profiler have its own tracing and data collection mechanism.

> The primary reason I want to see something like this is to kill the ugly
> code I have to maintain.

I can't say that my goals are any different.

> > > The entry.S examine-the-registers approach is simple enough, but
> > > it's
> > > not much more tasteful than sys_call_table hackery IMHO
> >
> > I guess we won't agree on this. From my point of view it is much
> > better
> > to have the code directly within entry.S for all to see instead of
> > having some external software play around with the syscall table in a
> > way kernel users can't trace back to the kernel's own code.
>
> Eh ? I didn't say sys_call_table hackery was better. I said the entry.S
> thing wasn't much better ...

I know you weren't saying that. I said that in _my point of view_ the entry.S
"thing" is much better.

Cheers,

Karim

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-07-10 13:08:40

by Robert Wisniewski

[permalink] [raw]

Subject: Re: Enhanced profiling support (was Re: vm lock contention reduction)

John,
I have been cc'ed on this email as I was an active participant at the
RAS performance monitoring/tracing session at OLS. Let me preface by
saying that my view may be a bit biased as I have worked on the the core
tracing infrastructure that went into IRIX in the mid 90s as well as the
tracing infrastructure for K42, our research OS (see
http://www.research.ibm.com/K42/index.html), and in both cases helped solve
performance issues that would not have been otherwise solved. From the
below it doesn't appear that anyone is arguing that tracing is not useful.
The debate (except for some of the details) appears over whether it should
be included in the kernel in a first-class manner or individual mechanisms
put in on an ad-hoc basis. While this is indeed philosophical, let me
share some experiences and benefits from other systems I've worked on:
1) The mechanism proposed is very non-invasive, a single line of
code (some TRACE_XXX macro or like) is added to the area of interest.
Further, at OLS, the proposal was to add only a few trace points.
Programming-wise this does not clutter the code - in fact having a single
well-known unified mechanism is cleaner coding than a set of one-off
ways, as when anyone sees a trace macro it will be clear what it is.
2) In the end, there will be less intrusion with a single unified
approach. With a series of different mechanisms over time multiple events
will get added in the same place creating performance issues and more
importantly causing confusion.
3) A unified approach will uncover performance issues not explicitly being
searched for and allow ones of known interest to be tracked down without
adding a patch (that may be cumbersome to maintain) and re-compilation.
4) In both my experiences, I have had resistance to adding this
tracing infrastructure, and in both experiences other kernel developers
have come back after the fact and thanked me for my persistence :-), as it
helped them solve timing sensitive or other such issues they were having
great difficulty with.

If there is interest, I would happy to set up a conference number so people
who are interested could all speak.

-bob

Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
[email protected]

----

John Levon writes:
> On Wed, Jul 10, 2002 at 12:16:05AM -0400, Karim Yaghmour wrote:
>
> [snip]
>
> > And the list goes on.
>
> Sure, there are all sorts of things where some tracing can come in
> useful. The question is whether it's really something the mainline
> kernel should be doing, and if the gung-ho approach is nice or not.
>
> > The fact that so many kernel subsystems already have their own tracing
> > built-in (see other posting)
>
> Your list was almost entirely composed of per-driver debug routines.
> This is not the same thing as logging trap entry/exits, syscalls etc
> etc, on any level, and I'm a bit perplexed that you're making such an
> assocation.
>
> > expect user-space developers to efficiently use the kernel if they
> > have
> > absolutely no idea about the dynamic interaction their processes have
> > with the kernel and how this interaction is influenced by and
> > influences
> > the interaction with other processes?
>
> This is clearly an exaggeration. And seeing as something like LTT
> doesn't (and cannot) tell the "whole story" either, I could throw the
> same argument directly back at you. The point is, there comes a point of
> no return where usefulness gets outweighed by ugliness. For the very few
> cases that such detailed information is really useful, the user can
> usually install the needed special-case tools.
>
> In contrast a profiling mechanism that improves on the poor lot that
> currently exists (gprof, readprofile) has a truly general utility, and
> can hopefully be done without too much ugliness.
>
> The primary reason I want to see something like this is to kill the ugly
> code I have to maintain.
>
> > > The entry.S examine-the-registers approach is simple enough, but
> > > it's
> > > not much more tasteful than sys_call_table hackery IMHO
> >
> > I guess we won't agree on this. From my point of view it is much
> > better
> > to have the code directly within entry.S for all to see instead of
> > having some external software play around with the syscall table in a
> > way kernel users can't trace back to the kernel's own code.
>
> Eh ? I didn't say sys_call_table hackery was better. I said the entry.S
> thing wasn't much better ...
>
> regards
> john
>

2002-07-10 14:32:40

by Richard J Moore

[permalink] [raw]

Subject: Re: Enhanced profiling support (was Re: vm lock contention reduction)

>Sure, there are all sorts of things where some tracing can come in
>useful. The question is whether it's really something the mainline
>kernel should be doing, and if the gung-ho approach is nice or not.
>
>> The fact that so many kernel subsystems already have their own tracing
>> built-in (see other posting)
>
>Your list was almost entirely composed of per-driver debug routines.
>This is not the same thing as logging trap entry/exits, syscalls etc
>etc, on any level, and I'm a bit perplexed that you're making such an
>assocation.

There's a balance to be struck with tracing. First we should point out that
the recording mechanism doesn't have to intrude within the kernel unlss you
want init time tracing. The bigger point of contention seems to be that of
instrumentation. Yes, it is very ugly to have thousands of trace points
littering the source. On the otherhand, for basic serviceability a minimal
set should be present in a production system - these would typically allow
the external interface of any component to be traced. For low-level
tracing - i.e. internal routines etc - the dynamic trace can be used. This
requires no modification to source. The tracepoint is implemanted
dynamically in execting code. DProbes+LTT provides this capability.

Some level of tracing (along with other complementary PD tools e.g. crash
dump) needs to be readiliy available to deal with those types of problem we
see with mature systems employed in the production environment. Typically
such problems are not readily recreatable nor even prictable. I've often
had to solve problems which impact a business environment severely, where
one server out of 2000 gets hit each day, but its a different one each day.
Its under those circumstances that trace along without other automated data
capturing problem determination tools become invaluable. And its a fact of
life that only those types of difficult problem remain once we've beaten a
system to death in developments and test. Being able to use a common set of
tools whatever the componets under investigation greatly eases problem
determination. This is especially so where you have the ability to use
dprobes with LTT to provide ad hoc tracepoints that were not originally
included by the developers.

Richard J Moore CEng, MIEE, Consulting IT Specialist, TSM
RAS Project Lead - Linux Technology Centre (ATS-PIC).
http://oss.software.ibm.com/developerworks/opensource/linux
Office: (+44) (0)1962-817072, Mobile: (+44) (0)7768-298183
IBM UK Ltd, MP135 Galileo Centre, Hursley Park, Winchester, SO21 2JN, UK
The IBM Academy will hold a Conference on Performance Engineering in
Toronto July 8-10. A High Availability Conference follows July 10-12.
Details on http://w3.ibm.com/academy/

John Levon
<movement@marcelothewonderp To: Karim Yaghmour <[email protected]>
enguin.com> cc: Linus Torvalds <[email protected]>, Andrew Morton
Sent by: John Levon <[email protected]>, Andrea Arcangeli <[email protected]>, Rik van Riel
<[email protected]> <[email protected]>, "[email protected]" <[email protected]>,
[email protected], [email protected], Richard J
Moore/UK/IBM@IBMGB, bob <[email protected]>
10/07/2002 00:38 Subject: Re: Enhanced profiling support (was Re: vm lock contention reduction)
Please respond to John
Levon

On Wed, Jul 10, 2002 at 12:16:05AM -0400, Karim Yaghmour wrote:

[snip]

> And the list goes on.

Sure, there are all sorts of things where some tracing can come in
useful. The question is whether it's really something the mainline
kernel should be doing, and if the gung-ho approach is nice or not.

> The fact that so many kernel subsystems already have their own tracing
> built-in (see other posting)

Your list was almost entirely composed of per-driver debug routines.
This is not the same thing as logging trap entry/exits, syscalls etc
etc, on any level, and I'm a bit perplexed that you're making such an
assocation.

> expect user-space developers to efficiently use the kernel if they
> have
> absolutely no idea about the dynamic interaction their processes have
> with the kernel and how this interaction is influenced by and
> influences
> the interaction with other processes?

This is clearly an exaggeration. And seeing as something like LTT
doesn't (and cannot) tell the "whole story" either, I could throw the
same argument directly back at you. The point is, there comes a point of
no return where usefulness gets outweighed by ugliness. For the very few
cases that such detailed information is really useful, the user can
usually install the needed special-case tools.

In contrast a profiling mechanism that improves on the poor lot that
currently exists (gprof, readprofile) has a truly general utility, and
can hopefully be done without too much ugliness.

The primary reason I want to see something like this is to kill the ugly
code I have to maintain.

> > The entry.S examine-the-registers approach is simple enough, but
> > it's
> > not much more tasteful than sys_call_table hackery IMHO
>
> I guess we won't agree on this. From my point of view it is much
> better
> to have the code directly within entry.S for all to see instead of
> having some external software play around with the syscall table in a
> way kernel users can't trace back to the kernel's own code.

Eh ? I didn't say sys_call_table hackery was better. I said the entry.S
thing wasn't much better ...

regards
john

2002-07-10 20:27:04

by Karim Yaghmour

[permalink] [raw]

Subject: Re: Enhanced profiling support (was Re: vm lock contention reduction)

Richard J Moore wrote:
> Some level of tracing (along with other complementary PD tools e.g. crash
> dump) needs to be readiliy available to deal with those types of problem we
> see with mature systems employed in the production environment. Typically
> such problems are not readily recreatable nor even prictable. I've often
> had to solve problems which impact a business environment severely, where
> one server out of 2000 gets hit each day, but its a different one each day.
> Its under those circumstances that trace along without other automated data
> capturing problem determination tools become invaluable. And its a fact of
> life that only those types of difficult problem remain once we've beaten a
> system to death in developments and test. Being able to use a common set of
> tools whatever the componets under investigation greatly eases problem
> determination. This is especially so where you have the ability to use
> dprobes with LTT to provide ad hoc tracepoints that were not originally
> included by the developers.

I definitely agree.

One case which perfectly illustrates how extreme these situations can be is
the Mars Pathfinder. The folks at the Jet Propulsion Lab used a tracing tool
very similar to LTT to locate the priority inversion problem the Pathfinder
had while it was on Mars.

The full account gives an interesting read (sorry for the link being on
MS's website but its author works for MS research ...):
http://research.microsoft.com/research/os/mbj/Mars_Pathfinder/Authoritative_Account.html

Karim

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-07-10 21:40:05

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Enhanced profiling support (was Re: vm lock contention reduction)

On Wed, Jul 10, 2002 at 04:30:42PM -0400, Karim Yaghmour wrote:
>
> Richard J Moore wrote:
> > Some level of tracing (along with other complementary PD tools e.g. crash
> > dump) needs to be readiliy available to deal with those types of problem we
> > see with mature systems employed in the production environment. Typically
> > such problems are not readily recreatable nor even prictable. I've often
> > had to solve problems which impact a business environment severely, where
> > one server out of 2000 gets hit each day, but its a different one each day.
> > Its under those circumstances that trace along without other automated data
> > capturing problem determination tools become invaluable. And its a fact of
> > life that only those types of difficult problem remain once we've beaten a
> > system to death in developments and test. Being able to use a common set of
> > tools whatever the componets under investigation greatly eases problem
> > determination. This is especially so where you have the ability to use
> > dprobes with LTT to provide ad hoc tracepoints that were not originally
> > included by the developers.
>
> I definitely agree.
>
> One case which perfectly illustrates how extreme these situations can be is
> the Mars Pathfinder. The folks at the Jet Propulsion Lab used a tracing tool
> very similar to LTT to locate the priority inversion problem the Pathfinder
> had while it was on Mars.

btw, on the topic, with our semaphores there's no way to handle priority
inversion with SCHED_RR tasks, if there's more than one task that runs
in RT priority we may fall into starvation of RT tasks too the same way.

No starvation can happen of course if all tasks in the systems belongs
to the same scheduler policy (nice levels can have effects but they're
not indefinite delays).

The fix Ingo used for SCHED_IDLE is to have a special call to the
scheduler while returning to userspace, so in the only place where we
know the kernel isn't holding any lock. But while it's going to only
generate some minor unexpected cpu load with SCHED_IDLE, generalizing
that hack to make all tasks scheduling inside the kernel running with RT
priority isn't going to provide a nice/fair behaviour (some task infact could
run way too much if it's very system-hungry, in particular with
-preempt, which could again generate starvation of userspace, even if
not anymore because of kernel locks). Maybe I'm overlooking something
simple but I guess it's not going to be easy to fix it, for the
semaphores it isn't too bad, they could learn how to raise priority of a
special holder when needed, but for any semaphore-by-hand (for example
spinlock based) it would require some major auditing.

Andrea

2002-07-11 04:44:01

by Karim Yaghmour

[permalink] [raw]

Subject: Re: Enhanced profiling support (was Re: vm lock contention reduction)

Andrea Arcangeli wrote:
> btw, on the topic, with our semaphores there's no way to handle priority
> inversion with SCHED_RR tasks, if there's more than one task that runs
> in RT priority we may fall into starvation of RT tasks too the same way.
>
> No starvation can happen of course if all tasks in the systems belongs
> to the same scheduler policy (nice levels can have effects but they're
> not indefinite delays).
>
> The fix Ingo used for SCHED_IDLE is to have a special call to the
> scheduler while returning to userspace, so in the only place where we
> know the kernel isn't holding any lock. But while it's going to only
> generate some minor unexpected cpu load with SCHED_IDLE, generalizing
> that hack to make all tasks scheduling inside the kernel running with RT
> priority isn't going to provide a nice/fair behaviour (some task infact could
> run way too much if it's very system-hungry, in particular with
> -preempt, which could again generate starvation of userspace, even if
> not anymore because of kernel locks). Maybe I'm overlooking something
> simple but I guess it's not going to be easy to fix it, for the
> semaphores it isn't too bad, they could learn how to raise priority of a
> special holder when needed, but for any semaphore-by-hand (for example
> spinlock based) it would require some major auditing.

I wasn't aware of this particular problem, but I certainly can see LTT
as being helpful in trying to understand the actual interactions. The
custom event creation API is rather simple to use if you would like
to instrument some of the semaphore code (the Sys V IPC code is already
instrumented at a basic level):
trace_create_event()
trace_create_owned_event()
trace_destroy_event()
trace_std_formatted_event()
trace_raw_event()

Here's the complete description:
int trace_create_event(char* pm_event_type,
char* pm_event_desc,
int pm_format_type,
char* pm_format_data)
pm_event_type is a short string describing the type of event.
pm_event_desc is required if you are using
trace_std_formatted_event(), more on this below.
pm_format_type and pm_format_data are required if you are
using trace_raw_event(). The function returns a unique
custom event ID.

int trace_create_owned_event(char* pm_event_type,
char* pm_event_desc,
int pm_format_type,
char* pm_format_data,
pid_t pm_owner_pid)
Same as trace_create_event() except that all events created
using this call will be deleted once the process with pm_owner_pid
exits. Not really useful in kernel space, but essential for
providing user-space events.

void trace_destroy_event(int pm_event_id)
Destroys the event with given pm_event_id.

int trace_std_formatted_event(int pm_event_id, ...)
Instead of having a slew of printk's all around the code,
pm_event_desc is filled with a printk-like string at the
event creation and the actual params used to fill this
string are passed to trace_std_formatted_event(). Be aware
that this function uses va_start/vsprintf/va_end. The
resulting string is logged in the trace as such and is visible
in the trace much like a printk in a log. Except that you can
place this in paths where printk's can't go and you can be sure
that events logged with this are delivered even in high
throughput situations (granted the trace buffer size is adequate).

int trace_raw_event(int pm_event_id,
int pm_event_size,
void* pm_event_data)
This is the easiest way to log tons of binary data. Just give it
the size of the data (pm_event_size) and a pointer to it
(pm_event_data) and it gets written in the trace. This is what
DProbes uses. The binary data is then formatted in userspace.

All events logged using this API can easily be extracted using LibLTT's
API. There is no requirement to use LTT's own visualization tool,
although you can still use it to see your own custom events.

As with the other traced events, each custom event gets timestamped
(do_gettimeofday) and placed in order of occurrence in the trace buffer.

I've attached an example module that uses the custom event API. See
the Examples directory of the LTT package for an example custom trace
reader which uses LibLTT.

One clear advantage of this API is that you can avoid using any
"#ifdef TRACE" or "#ifdef DEBUG". If the tracing module isn't loaded
or if the trace daemon isn't running, then nothing gets logged.

Cheers,

Karim

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

Attachments:

custom1.c (2.23 kB)

2002-07-11 04:56:18

by Karim Yaghmour

[permalink] [raw]

Subject: Re: Enhanced profiling support (was Re: vm lock contention reduction)

/* Example usage of custom event API */

#define MODULE

#if 0
#define CONFIG_TRACE
#endif

#include <linux/config.h>
#include <linux/module.h>
#include <linux/trace.h>
#include <asm/string.h>

struct delta_event
{
int an_int;
char a_char;
};

static int alpha_id, omega_id, theta_id, delta_id, rho_id;

int init_module(void)
{
uint8_t a_byte;
char a_char;
int an_int;
int a_hex;
char* a_string = "We are initializing the module";
struct delta_event a_delta_event;

/* Create events */
alpha_id = trace_create_event("Alpha",
"Number %d, String %s, Hex %08X",
CUSTOM_EVENT_FORMAT_TYPE_STR,
NULL);
omega_id = trace_create_event("Omega",
"Number %d, Char %c",
CUSTOM_EVENT_FORMAT_TYPE_STR,
NULL);
theta_id = trace_create_event("Theta",
"Plain string",
CUSTOM_EVENT_FORMAT_TYPE_STR,
NULL);
delta_id = trace_create_event("Delta",
NULL,
CUSTOM_EVENT_FORMAT_TYPE_HEX,
NULL);
rho_id = trace_create_event("Rho",
NULL,
CUSTOM_EVENT_FORMAT_TYPE_HEX,
NULL);

/* Trace events */
an_int = 1;
a_hex = 0xFFFFAAAA;
trace_std_formatted_event(alpha_id, an_int, a_string, a_hex);
an_int = 25;
a_char = 'c';
trace_std_formatted_event(omega_id, an_int, a_char);
trace_std_formatted_event(theta_id);
memset(&a_delta_event, 0, sizeof(a_delta_event));
trace_raw_event(delta_id, sizeof(a_delta_event), &a_delta_event);
a_byte = 0x12;
trace_raw_event(rho_id, sizeof(a_byte), &a_byte);

return 0;
}

void cleanup_module(void)
{
uint8_t a_byte;
char a_char;
int an_int;
int a_hex;
char* a_string = "We are initializing the module";
struct delta_event a_delta_event;

/* Trace events */
an_int = 324;
a_hex = 0xABCDEF10;
trace_std_formatted_event(alpha_id, an_int, a_string, a_hex);
an_int = 789;
a_char = 's';
trace_std_formatted_event(omega_id, an_int, a_char);
trace_std_formatted_event(theta_id);
memset(&a_delta_event, 0xFF, sizeof(a_delta_event));
trace_raw_event(delta_id, sizeof(a_delta_event), &a_delta_event);
a_byte = 0xA4;
trace_raw_event(rho_id, sizeof(a_byte), &a_byte);

/* Destroy the events created */
trace_destroy_event(alpha_id);
trace_destroy_event(omega_id);
trace_destroy_event(theta_id);
trace_destroy_event(delta_id);
trace_destroy_event(rho_id);
}

Attachments:

example.c (2.27 kB)