LinuxLists.cc - [PATCH] LTT for 2.5.38 1/9: Core infrastructure

[permalink] [raw]

Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

On Sun, 22 Sep 2002, Karim Yaghmour wrote:

> D: The core tracing infrastructure serves as the main rallying point for
> D: all the tracing activity in the kernel. (Tracing here isn't meant in
> D: the ptrace sense, but in the sense of recording key kernel events along
> D: with a time-stamp in order to reconstruct the system's behavior post-
> D: mortem.) Whether the trace driver (which buffers the data collected
> D: and provides it to the user-space trace daemon via a char dev) is loaded
> D: or not, the kernel sees a unique tracing function: trace_event().
> D: Basically, this provides a trace driver register/unregister service.
> D: When a trace driver registers, it is forwarded all the events generated
> D: by the kernel. If no trace driver is registered, then the events go
> D: nowhere.

my problem with this stuff is conceptual: it introduces a constant drag on
the kernel sourcecode, while 99% of development will not want to trace,
ever. When i do need tracing occasionally, then i take those 30 minutes to
write up a tracer from pre-existing tracing patches, tailored to specific
problems. Eg. for the scheduler i wrote a simple tracer, but the rate of
trace points that started to make sense for me from a development and
debugging POV also made kernel/sched.c butt-ugly and unmaintainable, so i
always kept the tracer separate and did the hacking in the untained code.

also, the direction things are taking these days seems to be towards
hardware-assisted tracing. Ie. on the P4 we can recover a trace of EIPs
traversed by the CPU recently. Stuff like this is powerful and can can
debug bugs that cannot be debugged via software. I've seen and debugged
dozens of subtle bugs that went away if a software-tracer was enabled, in
fact i debugged at least 3 scheduler bugs which triggered on the removal
of a specific trace point. Sw-tracing, and especially the kind of
intrusive stuff you are doing has its limitations and side-effects. It's
also something that comes from the closed-source world, there kernels must
have tracing APIs because otherwise debugging drivers and subsystems would
be much easier. It does have its uses, no doubt, but usually we apply
things to the kernel that have either a positive, or at worst, a neutral
impact on the kernel proper - kernel tracing clearly is not such a
feature.

so use the power of the GPL-ed kernel and keep your patches separate,
releasing them for specific stable kernel branches (or even development
kernels). If anything then i'm biased towards tracer code, eg. i wrote the
first versions of ktrace (source-unintrusive tracer) and iotrace
(source-intrusive tracer), and i for one do not want to have *any* trace
points in any of the code i hack on a daily basis. This stuff must stay
separate.

Ingo

2002-09-22 10:37:41

[permalink] [raw]

Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

> [...] It's also something that comes from the closed-source world, there
> kernels must have tracing APIs because otherwise debugging drivers and
> subsystems would be much easier. [...]
^------harder

2002-09-22 17:21:36

by Roman Zippel

[permalink] [raw]

Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Hi,

On Sun, 22 Sep 2002, Ingo Molnar wrote:

> Eg. for the scheduler i wrote a simple tracer, but the rate of
> trace points that started to make sense for me from a development and
> debugging POV also made kernel/sched.c butt-ugly and unmaintainable, so i
> always kept the tracer separate and did the hacking in the untained code.
>
> also, the direction things are taking these days seems to be towards
> hardware-assisted tracing. Ie. on the P4 we can recover a trace of EIPs
> traversed by the CPU recently. Stuff like this is powerful and can can
> debug bugs that cannot be debugged via software.

To summarize: You find tracing useful, but software tracing is only of
limited value in areas you're working at.
What about other developers, which only want to develop a simple driver,
without having to understand the whole kernel? Traces still work where
printk() or kgdb don't work. I think it's reasonable to ask an user to
enable tracing and reproduce the problem, which you can't reproduce
yourself.

> It does have its uses, no doubt, but usually we apply
> things to the kernel that have either a positive, or at worst, a neutral
> impact on the kernel proper - kernel tracing clearly is not such a
> feature.

Last time I checked it has no impact on the kernel as long as it's not
enabled. Anyway, it would already be very useful to have at least the core
integrated. How many drivers currently define a "dprint"? Some even
implement its own tracing. While debug prints are mostly useful during
early development, they are usually completely useless, when you have to
reproduce a problem.

> so use the power of the GPL-ed kernel and keep your patches separate,
> releasing them for specific stable kernel branches (or even development
> kernels).

While I agree that this acceptable approach for things like kgdb, I think
it would very useful to have at least the tracing core in the kernel.

bye, Roman

2002-09-22 18:29:48

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

On Sun, 22 Sep 2002, Roman Zippel wrote:
>
> To summarize: You find tracing useful, but software tracing is only of
> limited value in areas you're working at.
>
> What about other developers, which only want to develop a simple driver,
> without having to understand the whole kernel? Traces still work where
> printk() or kgdb don't work. I think it's reasonable to ask an user to
> enable tracing and reproduce the problem, which you can't reproduce
> yourself.

That makes adding source bloat ok? I've debugged some drivers with
dprintk() style tracing, and it often makes the code harder to follow,
even if it eds up being compiled away.

>From what I've seen from the LTT thing, it's too heavy-weight to be good
for many things (taking SMP-global locks for trace events is _not_ a good
idea if the trace is for doing things like doing performance tracing,
where a tracer that adds synchronization fundamentally _changes_ what is
going on in ways that have nothing to do with timing).

I suspect we'll want to have some form of event tracing eventually, but
I'm personally pretty convinced that it needs to be a per-CPU thing, and
the core mechanism would need to be very lightweight. It's easier to build
up complexity on top of a lightweight interface than it is to make a
lightweight interface out of a heavy one.

Linus

2002-09-22 18:58:21

[permalink] [raw]

Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Hello Ingo,

Thanks for taking the time to look at this.

Ingo Molnar wrote:
> my problem with this stuff is conceptual: it introduces a constant drag on
> the kernel sourcecode, while 99% of development will not want to trace,
> ever.

It seems my description was misleading. So here's the skinny: LTT's main
purpose is to enable users and developers to observe the system's dynamics
in order to retrieve exact information regarding the behavior of the
entire system WITHOUT modifying the system's behavior or degrading the
system's performance. In turn, this can be used for identifying
synchronization and performance problems. In doing so, however, the services
implemented by LTT in the kernel happen to be quite useful to many other
kernel subsystems and device drivers since they too occasionnally need
tracing.

Here are some actual practical cases:
- How do you debug process synchronization problems in user-space? You
can't use anything that calls on ptrace() since it modifies the
processes' behavior and you can't use printf's for anything the least
bit complicated. The only way you can do this is if you use a tracing
tool such as LTT that enables you to see which services were called,
what happened as a consequence of the processes' requests, and where
the synchronization failed.
- How do you measure the exact time processes spend in kernel space,
identify why they spend it there, which processes they had to wait
for, etc.?
- How do you measure the exact time it takes for an interrupt's
effects to propagate through the entire system? As a simple example, say
you want to follow the exact sequence of processes that run from the
moment you press a key on the keyboard until a character shows up in
the command terminal in X. LTT will shows this quite easily.
- Take tools like oprofile and syscalltrack which need the same
information available through the trace points added by LTT. Instead
of diverting the system call table, as they currently do, they could
retrieve the information they need easily from LTT without using
clean interfaces and no table redirection.
- Say you have thousands of servers in an installation and one of them
has some sporadic problem. How are you going to debug this sytem?
Should the sysadmin be expected to download the kernel's source, patch
it for tracing and restart the system to find the problem? Rather,
wouldn't it be simpler if he could run the tracing in the background
for the time until the problem occurs and then look at the trace to
see what's the real problem before digging deeper?
- etc.

Do I think that the kernel should be instrumented in a way that it is
"a constant drag on the kernel sourcecode"? No. This is why the trace points
inserted really have more to do with the way a classic Unix kernel is
structured (system calls, process switching, forks, execs, ...) than
anything peculiar to Linux's source code. Hence, you could reimplement
the entire Linux source an entirely different way, you would still find
those very same events taking place. Also, all these trace points result
in zero code if the kernel is compilled without tracing support.

For adding additional trace points wherever you want, you can use
kernel probes to add them dynamically (kprobes already interfaces with
LTT and is slated to go in 2.5) or you can use the custom even API
available from LTT to create your own events and logging them as
part of the trace.

In brief, no LTT isn't a kernel debugging tool, but yes its integration
into the kernel would certainly help subsystems that do need this sort
of service.

Karim

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-09-22 19:10:47

[permalink] [raw]

Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Linus Torvalds wrote:
> On Sun, 22 Sep 2002, Roman Zippel wrote:
> > What about other developers, which only want to develop a simple driver,
> > without having to understand the whole kernel? Traces still work where
> > printk() or kgdb don't work. I think it's reasonable to ask an user to
> > enable tracing and reproduce the problem, which you can't reproduce
> > yourself.
>
> That makes adding source bloat ok? I've debugged some drivers with
> dprintk() style tracing, and it often makes the code harder to follow,
> even if it eds up being compiled away.

Source bloat is certainly not desirable, as I said to my reply to Ingo.
What is desirable, however, is to have a uniform tracing mechanism
replace the ad-hoc tracing mechanisms already implemented in many drivers
and subsystems.

> >From what I've seen from the LTT thing, it's too heavy-weight to be good
> for many things (taking SMP-global locks for trace events is _not_ a good
> idea if the trace is for doing things like doing performance tracing,
> where a tracer that adds synchronization fundamentally _changes_ what is
> going on in ways that have nothing to do with timing).

Sure, but there are no locks anymore in the tracer with the addition of
the lockless code which is part of the set of patches I just sent. So yes,
this was a problem with LTT, but it isn't anymore.

The lockless scheme is pretty simple, instead of using a spinlock to
ensure atomic allocation of buffer space, the code does an allocate-and-test
routine where it tries to allocate space in the buffer and tests if it
succeeded in doing so. If so, then it goes on to write the data in the
event buffer, otherwise it tries again. In most cases, it does this loop
only once and in most worst cases twice.

> I suspect we'll want to have some form of event tracing eventually, but
> I'm personally pretty convinced that it needs to be a per-CPU thing, and
> the core mechanism would need to be very lightweight. It's easier to build
> up complexity on top of a lightweight interface than it is to make a
> lightweight interface out of a heavy one.

I fully agree with the requirements you list. LTT is already lightweight
in terms of its performance impact on the system and it doesn't use any
form of locking anymore. The only remaining issue is the use of per-CPU
buffers and this is currently being worked on by the team at IBM that
had already developed the lockless scheme and will be ready shortly.
However, there clearly is no more lock contention.

Karim

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-09-22 19:22:20

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Linus Torvalds <[email protected]> writes:
>
> I suspect we'll want to have some form of event tracing eventually, but
> I'm personally pretty convinced that it needs to be a per-CPU thing, and
> the core mechanism would need to be very lightweight. It's easier to build
> up complexity on top of a lightweight interface than it is to make a
> lightweight interface out of a heavy one.

There is an old patch around from SGI that does exactly this. It is a
very lightweight binary value tracer that has per CPU buffers. It
traces using macros that you can easily add. It's called ktrace (not
to be confused with Ingo's ktrace). I've been porting it for some time
for my own tracing needs (adding tracing macros as needed but never submitting
them). If you're interested I can submit it for 2.5 (without any hooks, people
should just add them as needed and then remove them again)

-Andi

2002-09-22 19:25:01

[permalink] [raw]

Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

err...

Karim Yaghmour wrote:
> retrieve the information they need easily from LTT without using
^^^^^^^ => by
> clean interfaces and no table redirection.

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-09-22 19:27:41

[permalink] [raw]

Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

On Sun, 22 Sep 2002, Karim Yaghmour wrote:

> Source bloat is certainly not desirable, as I said to my reply to Ingo.

(then how should i interpret 90% of the patches you sent to lkml today?)

> What is desirable, however, is to have a uniform tracing mechanism
> replace the ad-hoc tracing mechanisms already implemented in many
> drivers and subsystems.

exactly what is the problem with keeping intrusive debugging patches
separate, just like all the other ones are kept separate? It's not like
this came out of the blue, per-CPU trace buffers (and other tracers) were
done years ago for Linux.

> The lockless scheme is pretty simple, instead of using a spinlock to
> ensure atomic allocation of buffer space, the code does an
> allocate-and-test routine where it tries to allocate space in the buffer
> and tests if it succeeded in doing so. If so, then it goes on to write
> the data in the event buffer, otherwise it tries again. In most cases,
> it does this loop only once and in most worst cases twice.

(this is in essence a moving spinlock at the tail of the trace buffer -
same problem.)

Ingo

2002-09-22 21:15:50

[permalink] [raw]

Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

There is no drag on the kernel. The concept that we are working on is
consistent with your below recommendations. Only place in the kernel an
efficient tracing infrastructure, keep trace points as patches. This adds
no overhead to kernel, allows your suggested patches to use a standard
efficient infrastructure, reduces replicated work from specific problem to
specific problem.

> my problem with this stuff is conceptual: it introduces a constant drag on
> the kernel sourcecode, while 99% of development will not want to trace,

If you care about performance you will want to trace. On two previous
kernels I have worked on I've heard this comment. Once the infrastructure
was in it was used and appreciated.

Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
[email protected]

----

Ingo Molnar writes:
>
> On Sun, 22 Sep 2002, Karim Yaghmour wrote:
>
> > D: The core tracing infrastructure serves as the main rallying point for
> > D: all the tracing activity in the kernel. (Tracing here isn't meant in
> > D: the ptrace sense, but in the sense of recording key kernel events along
> > D: with a time-stamp in order to reconstruct the system's behavior post-
> > D: mortem.) Whether the trace driver (which buffers the data collected
> > D: and provides it to the user-space trace daemon via a char dev) is loaded
> > D: or not, the kernel sees a unique tracing function: trace_event().
> > D: Basically, this provides a trace driver register/unregister service.
> > D: When a trace driver registers, it is forwarded all the events generated
> > D: by the kernel. If no trace driver is registered, then the events go
> > D: nowhere.
>
> my problem with this stuff is conceptual: it introduces a constant drag on
> the kernel sourcecode, while 99% of development will not want to trace,
> ever. When i do need tracing occasionally, then i take those 30 minutes to
> write up a tracer from pre-existing tracing patches, tailored to specific
> problems. Eg. for the scheduler i wrote a simple tracer, but the rate of
> trace points that started to make sense for me from a development and
> debugging POV also made kernel/sched.c butt-ugly and unmaintainable, so i
> always kept the tracer separate and did the hacking in the untained code.
>
> also, the direction things are taking these days seems to be towards
> hardware-assisted tracing. Ie. on the P4 we can recover a trace of EIPs
> traversed by the CPU recently. Stuff like this is powerful and can can
> debug bugs that cannot be debugged via software. I've seen and debugged
> dozens of subtle bugs that went away if a software-tracer was enabled, in
> fact i debugged at least 3 scheduler bugs which triggered on the removal
> of a specific trace point. Sw-tracing, and especially the kind of
> intrusive stuff you are doing has its limitations and side-effects. It's
> also something that comes from the closed-source world, there kernels must
> have tracing APIs because otherwise debugging drivers and subsystems would
> be much easier. It does have its uses, no doubt, but usually we apply
> things to the kernel that have either a positive, or at worst, a neutral
> impact on the kernel proper - kernel tracing clearly is not such a
> feature.
>
> so use the power of the GPL-ed kernel and keep your patches separate,
> releasing them for specific stable kernel branches (or even development
> kernels). If anything then i'm biased towards tracer code, eg. i wrote the
> first versions of ktrace (source-unintrusive tracer) and iotrace
> (source-intrusive tracer), and i for one do not want to have *any* trace
> points in any of the code i hack on a daily basis. This stuff must stay
> separate.
>
> Ingo
>
>
> _______________________________________________
> ltt-dev mailing list
> [email protected]
> http://www.listserv.shafik.org/listserv/listinfo/ltt-dev

2002-09-22 21:25:50

[permalink] [raw]

Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

On Sun, 22 Sep 2002, bob wrote:

> There is no drag on the kernel. The concept that we are working on is
> consistent with your below recommendations. Only place in the kernel an
> efficient tracing infrastructure, keep trace points as patches. [...]

well, this is not the impression i got from the patches posted to lkml ...

> [...] This adds no overhead to kernel, allows your suggested patches to
> use a standard efficient infrastructure, reduces replicated work from
> specific problem to specific problem.

so why not keep the core parts as separate patches as well? If it does
nothing then i dont see why it should get into the kernel proper.

> > my problem with this stuff is conceptual: it introduces a constant drag on
> > the kernel sourcecode, while 99% of development will not want to trace,
>
> If you care about performance you will want to trace. On two previous
> kernels I have worked on I've heard this comment. Once the
> infrastructure was in it was used and appreciated.

(i think you have not read what i have written. I use tracing pretty
frequently, and no, i dont need tracing in the kernel, during development
i can apply patches to kernel trees just fine.)

Ingo

2002-09-22 21:24:23

[permalink] [raw]

Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Linus Torvalds writes:
>
> >From what I've seen from the LTT thing, it's too heavy-weight to be good

Not true anymore.

> I suspect we'll want to have some form of event tracing eventually, but
> I'm personally pretty convinced that it needs to be a per-CPU thing, and
> the core mechanism would need to be very lightweight. It's easier to build
> up complexity on top of a lightweight interface than it is to make a
> lightweight interface out of a heavy one.

We have removed locks (code now atomically reserves space in the trace
buffer), significantly reduced the cost of taking timestamps by using the
real-time clock, and are in the process of implementing per-CPU buffers.
As per previous email, the intent is to get only the core infrastructure
into the kernel and keep trace points as patches. Some of the work going
into LTT is modeled after the tracing infrastructure in K42, which is
extremely lightweight, lock-free, and designed for multiprocessors.

Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
[email protected]

2002-09-22 22:01:38

[permalink] [raw]

Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Ingo Molnar wrote:
> On Sun, 22 Sep 2002, Karim Yaghmour wrote:
>
> > Source bloat is certainly not desirable, as I said to my reply to Ingo.
>
> (then how should i interpret 90% of the patches you sent to lkml today?)

Please refer to my other email where I explain why tracing is essential
to the day-to-day usage of any kernel. I don't think this is bloat and
the distributions which already include LTT certainly think it's quite
useful to their clients. In fact, most embedded distro actually make
the inclusion of LTT one of the main features with which they sell
Linux to their clients.

> > What is desirable, however, is to have a uniform tracing mechanism
> > replace the ad-hoc tracing mechanisms already implemented in many
> > drivers and subsystems.
>
> exactly what is the problem with keeping intrusive debugging patches
> separate, just like all the other ones are kept separate?

Again, this is not a kernel debugging patch. As you yourself have stated
elsewhere, instrumenting a kernel for it to yield useful information to
a kernel developer makes the code "butt-ugly" (your words). The trace
statements currently inserted by LTT are clearly useless for any kernel
debugging whatsoever. The trace statements inserted are only useful for
the day-to-day tracing needs of any Linux user.

> It's not like
> this came out of the blue, per-CPU trace buffers (and other tracers) were
> done years ago for Linux.

I don't remember claiming to having implemented the first tracer. However,
I have been working very hard in putting together a rock solid tracer
which includes the best ideas of all existing tracers and offers a wide
range of tools for _any_ user to use. The decision of the attendees of
the RAS BoF at the OLS to standardize on LTT clearly goes in this direction.

Again, please understand that LTT is not a kernel debugger. Any look at
the set of trace statements inserted by LTT will reveal their low value
for kernel developers. These trace statements are meant for providing
users with in-depth and complete understanding of the system's dynamics.

> > The lockless scheme is pretty simple, instead of using a spinlock to
> > ensure atomic allocation of buffer space, the code does an
> > allocate-and-test routine where it tries to allocate space in the buffer
> > and tests if it succeeded in doing so. If so, then it goes on to write
> > the data in the event buffer, otherwise it tries again. In most cases,
> > it does this loop only once and in most worst cases twice.
>
> (this is in essence a moving spinlock at the tail of the trace buffer -
> same problem.)

Hmm. No offense, but I think you ought to take a better look at the code.

Because events can occur at the interrupt level and on normal non-interrupt
path, any tracer that has to record a broad range of event types needs
to use spin_lock_irqsave(), which is what LTT's tracer used. Now, last
I checked, spin_lock_irqsave() calls on local_irq_save() which, on an
i386 for example, is defined as follows:
#define local_irq_save(x) __asm__ __volatile__("pushfl ; popl %0 ; cli":"=g" (x): /* no input */ :"memory")

There's a cli() in there. No cli's in the lockless code. Among other
things, this makes the lockless code quite different from any usual
Linux spinlock.

Karim

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-09-22 22:11:17

[permalink] [raw]

Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

On Sun, 22 Sep 2002, Karim Yaghmour wrote:

> > (this is in essence a moving spinlock at the tail of the trace buffer -
> > same problem.)
>
> Hmm. No offense, but I think you ought to take a better look at the
> code.

i have, and i see stuff like this:

+ TRACE_PROCESS(TRACE_EV_PROCESS_WAKEUP, p->pid, p->state);

+static inline void TRACE_PROCESS(u8 ev_id, u32 data1, u32 data2)
+{
+ trace_process proc_event;
+
+ proc_event.event_sub_id = ev_id;
+ proc_event.event_data1 = data1;
+ proc_event.event_data2 = data2;
+
+ trace_event(TRACE_EV_PROCESS, &proc_event);
+}

where trace_event() is defined as:

+int trace_event(u8 pm_event_id,
+ void *pm_event_struct)
[...]
+ read_lock(&tracer_register_lock);

ie. it's using a global spinlock. (sure, it can be made lockless, as other
tracers have done it.)

Ingo

2002-09-22 22:32:53

[permalink] [raw]

Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Ingo Molnar writes:
>
> On Sun, 22 Sep 2002, bob wrote:
>
> > There is no drag on the kernel. The concept that we are working on is
> > consistent with your below recommendations. Only place in the kernel an
> > efficient tracing infrastructure, keep trace points as patches. [...]
>
> well, this is not the impression i got from the patches posted to lkml ...

The intent is to split LTT, get the infrastructure into the kernel, have
the trace points as patches.

>
> > [...] This adds no overhead to kernel, allows your suggested patches to
> > use a standard efficient infrastructure, reduces replicated work from
> > specific problem to specific problem.
>
> so why not keep the core parts as separate patches as well? If it does
> nothing then i dont see why it should get into the kernel proper.

:-) It does do something. It provides a common infrastructure for anyone
wanting to use trace points. What I meant is that when not enabled it
doesn't cause any overhead.

As a performance tool it will be used not only be kernel developers but by
people writing device drivers, sub-systems, and apps. Having an accepted
infrastructure in the kernel allows a common vocabulary to be used across
kernel, devices, sub-systems, and applications. It allows sub-system
developers who know their system best to put in the events and developers
of other sub-systems of apps to use those events to understand what is
going on. If the infrastructure is in the kernel, users could dynamically
enable and feedback performance results to the kernel developers.

In short this will provide a common way to discuss performance issues
across kernel, sub-system, and application space.

> > > my problem with this stuff is conceptual: it introduces a constant drag on
> > > the kernel sourcecode, while 99% of development will not want to trace,
> >
> > If you care about performance you will want to trace. On two previous
> > kernels I have worked on I've heard this comment. Once the
> > infrastructure was in it was used and appreciated.
>
> (i think you have not read what i have written. I use tracing pretty
> frequently, and no, i dont need tracing in the kernel, during development
> i can apply patches to kernel trees just fine.)

Good - I'm glad you find tracing useful - sorry if I reacted to the
statement that most of the time it's not needed. As above, it should be in
the kernel proper not as just a patch.

> > The lockless scheme is pretty simple, instead of using a spinlock to
> > ensure atomic allocation of buffer space, the code does an
> > allocate-and-test routine where it tries to allocate space in the buffer
>
> (this is in essence a moving spinlock at the tail of the trace buffer -
> same problem.)

No, we use lock-free atomic operations to reserve a place in the buffer to
write the data. What happens is you attempt to atomic move the current
index pointer forward. If you succeed then you have bought yourself that
many data words in the queue. In the unlikely event you happened to
collide with someone you perform the atomic operation again.

Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
[email protected]

2002-09-22 22:33:11

[permalink] [raw]

Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Ingo Molnar wrote:
> +int trace_event(u8 pm_event_id,
> + void *pm_event_struct)
> [...]
> + read_lock(&tracer_register_lock);
>
> ie. it's using a global spinlock. (sure, it can be made lockless, as other
> tracers have done it.)

It is, but this is separate from the trace driver. This global
spinlock is only used to avoid a race condition in the registration/
unregistration of the tracing function with the trace infrastructure.
The only case where the lock is taken in write mode is when a
tracer in being registered or unregistered (register_tracer() and
unregister_tracer()). Since tracing itself is NOT registeration/
unregistration intensive, there is no contention over this lock.

Any trace infrastructure that allows dynamic registration of tracers
needs this sort of lock in order to make sure that the function pointer
it has for the tracer is actually valid when it calls it. Of course if
the tracer itself was directly called from the inline trace statements,
then this would be a different story, but then the tracer has to be
in there all the time (which is exactly what happens with most, if
not all, the tracers already included in the kernel).

Karim

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-09-22 22:36:39

[permalink] [raw]

Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

a number of suggestions to make the tracer truly lightweight:

- remove the 'event registration' and callback stuff. It just introduces
unnecessery runtime overhead. Use an include file as a registry of
events instead. This will simplify things greatly. Why do you need a
table of callbacks registered to an event? Nothing in your patches
actually uses it ... Just use one tracing function that copies the
arguments into a per-CPU ringbuffer. It's really just a few lines.

- do not disable interrupts when writing events. I used this method in
a tracer and it works well. Just get an irq-safe index to the trace
ring-buffer and fill it in. [eg. on x86 incl can be used for this
purpose.]

- get rid of p->trace_info and the pending_write_count - it's completely
unnecessery.

- drivers/trace/tracer.c is a complex mess of strange coding style and
#ifdefs, it's not proper Linux kernel code.

it's possible to have lightweight tracing - this patch clearly is not
achieving that goal yet.

Ingo

2002-09-22 22:41:43

[permalink] [raw]

Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

On Sun, 22 Sep 2002, bob wrote:

> > (this is in essence a moving spinlock at the tail of the trace buffer -
> > same problem.)
>
> No, we use lock-free atomic operations to reserve a place in the buffer
> to write the data. What happens is you attempt to atomic move the
> current index pointer forward. If you succeed then you have bought
> yourself that many data words in the queue. In the unlikely event you
> happened to collide with someone you perform the atomic operation again.

you have not understood what i have written.

what you do has the same (bad) effect as a global spinlock, it in essence
has the same cache effect as a constantly moving spinlock at the 'end' of
the trace buffer. Cachelines bounce between CPUs. Only completely per-CPU
trace buffers solve this problem.

Ingo

2002-09-22 22:49:42

[permalink] [raw]

Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

On Sun, 22 Sep 2002, bob wrote:

> [...] On a technical note: a cache-line ping-ponging is bad - a global
> spinlock is horrendous. They're different - the lock-less MP scheme gets
> rid of them both.

(on the contrary - a global spinlock is bad for exactly that reason,
because it causes a cacheline ping-pong. So if two CPUs are trying to
write trace events at once, you'll get the same effect as if they were
using a global spinlock.)

Ingo

2002-09-22 22:47:27

[permalink] [raw]

Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Ingo Molnar writes:
>
> On Sun, 22 Sep 2002, bob wrote:
>
> > > (this is in essence a moving spinlock at the tail of the trace buffer -
> > > same problem.)
> >
> > No, we use lock-free atomic operations to reserve a place in the buffer
> > to write the data. What happens is you attempt to atomic move the
> > current index pointer forward. If you succeed then you have bought
> > yourself that many data words in the queue. In the unlikely event you
> > happened to collide with someone you perform the atomic operation again.
>
> you have not understood what i have written.
>
> what you do has the same (bad) effect as a global spinlock, it in essence
> has the same cache effect as a constantly moving spinlock at the 'end' of
> the trace buffer. Cachelines bounce between CPUs. Only completely per-CPU
> trace buffers solve this problem.

As per previous email, we are moving to a per-CPU scheme. On a technical
note: a cache-line ping-ponging is bad - a global spinlock is horrendous.
They're different - the lock-less MP scheme gets rid of them both.

> - do not disable interrupts when writing events. I used this method in
> a tracer and it works well. Just get an irq-safe index to the trace
> ring-buffer and fill it in. [eg. on x86 incl can be used for this
> purpose.]

The lock-less scheme does not disable interrupts - we've eliminated that.

Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
[email protected]

2002-09-22 22:45:47

[permalink] [raw]

Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

On Sun, 22 Sep 2002, Karim Yaghmour wrote:

> Ingo Molnar wrote:
> > +int trace_event(u8 pm_event_id,
> > + void *pm_event_struct)
> > [...]
> > + read_lock(&tracer_register_lock);
> >
> > ie. it's using a global spinlock. (sure, it can be made lockless, as other
> > tracers have done it.)
>
> It is, but this is separate from the trace driver. [...]

it does not matter, it's called for every event.

> [...] This global spinlock is only used to avoid a race condition in the
> registration/ unregistration of the tracing function with the trace
> infrastructure.

(here you make the incorrect assumption that read-locking a rwlock is a
lightweight operation.)

Ingo

2002-09-22 22:58:17

[permalink] [raw]

Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

> > [...] On a technical note: a cache-line ping-ponging is bad - a global
> > spinlock is horrendous. They're different - the lock-less MP scheme gets
> > rid of them both.
>
> (on the contrary - a global spinlock is bad for exactly that reason,
> because it causes a cacheline ping-pong. So if two CPUs are trying to
> write trace events at once, you'll get the same effect as if they were
> using a global spinlock.)
>
> Ingo

Just want to be clear that we are going to a per-CPU buffer scheme.

However, for sake of argument, the above is still not true. A global lock
has a different (worse) performance problem then the lock-free atomic
operation even given a global queue. The difference is 1) the Linux global
lock is very expensive and interacts with potential other processes, and 2)
you have to hold the lock for the entire duration of logging the event;
with the atomic operation you are finished once you've reserved you space.
If you didn't use the expensive Linux global lock and just a global lock,
you could be interrupted in the middle of holding the lock and performance
would fall off the map.

-bob

Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
[email protected]

2002-09-22 23:06:17

[permalink] [raw]

Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

On Sun, 22 Sep 2002, bob wrote:

> However, for sake of argument, the above is still not true. A global
> lock has a different (worse) performance problem then the lock-free
> atomic operation even given a global queue. The difference is 1) the
> Linux global lock is very expensive [... and interacts with potential
> other processes, [...]

huh? what is 'the Linux global lock'?

> [...] and 2) you have to hold the lock for the entire duration of
> logging the event; with the atomic operation you are finished once
> you've reserved you space. [...]

you dont have to hold the lock for the duration of saving the event, the
lock could as well protect a 'current entry' index. (Not that those 2-3
cycles saving off the event into a single cacheline counts that much ...)

the tail-atomic method is precisely equivalent to a global spinlock. The
tail of a global event buffer acts precisely as a global spinlock: if one
CPU writes to it in a stream then it performs okay, if two CPUs trace in
parallel then it causes cachelines to bounce like crazy.

> [...] If you didn't use the expensive Linux global lock and just a
> global lock, you could be interrupted in the middle of holding the lock
> and performance would fall off the map.

again, what 'expensive Linux global lock' are you talking about?

Ingo

2002-09-22 23:19:17

[permalink] [raw]

Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

this is that a trace point should do, at most:

--------------------->
task_t *tracer_task;

int curr_idx[NR_CPUS];
int curr_pending[NR_CPUS];

struct trace_event **trace_ring;

void trace(event, data1, data2, data3)
{
int cpu = smp_processor_id();
int idx, pending, *curr = curr_idx + cpu;
struct trace_event *t;
unsigned long flags;

if (!event_wanted(current, event, data1, data2, data3))
return;

local_irq_save(flags);

idx = ++curr_idx[cpu] & (NR_TRACE_ENTRIES - 1);
pending = ++curr_pending[cpu];

t = trace_ring[cpu] + idx;

t->event = event;
rdtscll(t->timestamp);
t->data1 = data1;
t->data2 = data2;
t->data3 = data3;

if (curr_pending == TRACE_LOW_WATERMARK && tracer_task)
wake_up_process(tracer_task);

local_irq_restore(flags);
}

this should cover most of what's needed. The event_wanted() filter
function should be made as fast as possible. Note that the irq-disabled
section is not strictly needed but nice and also makes it work on the
preemptible kernel. (It's not a big issue at all to run these few
instructions with irqs disabled.)

[there are also other details like putting curr_index and curr_pending
into the per-cpu area and similar stuff.]

Ingo

2002-09-22 23:27:17

[permalink] [raw]

Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Thanks for the recommendations, we will certainly direct the development
to address these issues.

Ingo Molnar wrote:
> - remove the 'event registration' and callback stuff. It just introduces
> unnecessery runtime overhead. Use an include file as a registry of
> events instead. This will simplify things greatly.

OK, basically then all the trace points call the trace driver directly.

> Why do you need a
> table of callbacks registered to an event? Nothing in your patches
> actually uses it ...

True, nothing in the patches actually uses it as this point. This was
added with the mindset of letting other tools than LTT use the trace
points already provided by LTT.

> Just use one tracing function that copies the
> arguments into a per-CPU ringbuffer. It's really just a few lines.

Sure, the writing of data itself is trivial. The reason you find the
driver to be rather full is because of its need to do a couple of
extra operations:
- Get timestamp and use delta since begining of buffer to reduce
trace size. (i.e. because of the rate at which traces are filled, it's
essential to be able to cut down in the data written as much as possible).
- Filter events according to event mask.
- Copy extra data in case of some events (e.g. filenames). (We're working on
ways to simplify this).
- Synchronize with trace daemon to save trace data. (A single per-CPU
circular buffer may be useful when doing kernel devleopment, but user
tracing often requires N buffers).

In addition, because this data is available from user-space, you need
to be able to deal with many buffers. For example, you don't want some
random user to know everything that's happening on the entire system
for obvious security reasons. So the tracer will need to be able to
have per-user and per-process buffers.

The writing of the data itself is not a problem, the real problem is
having a flexible lightweight tracer that can be used in a variety
of different situations.

> - do not disable interrupts when writing events. I used this method in
> a tracer and it works well. Just get an irq-safe index to the trace
> ring-buffer and fill it in. [eg. on x86 incl can be used for this
> purpose.]

Done.

> - get rid of p->trace_info and the pending_write_count - it's completely
> unnecessery.

But then how do we keep track of whether processes have pointers to the
trace buffer or not? We need to be able to allocate/free trace buffers
in runtime. That's what the pending_write_count is for. A buffer can't
be freed is someone still has pending writes. Alternatives are welcomed.

Also, though this hasn't been implemented yet, users may desire to trace a
certain set of processes and trace_info could include a flag to this end.

> - drivers/trace/tracer.c is a complex mess of strange coding style and
> #ifdefs, it's not proper Linux kernel code.

We'll fix that.

Karim

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-09-22 23:45:12

[permalink] [raw]

Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Ingo Molnar writes:
>
> On Sun, 22 Sep 2002, bob wrote:
>
> > However, for sake of argument, the above is still not true. A global
> > lock has a different (worse) performance problem then the lock-free
> > atomic operation even given a global queue. The difference is 1) the
> > Linux global lock is very expensive [... and interacts with potential
> > other processes, [...]
>
> huh? what is 'the Linux global lock'?

sorry - LTT just uses a global lock - but to do so it must disable
interrupts. This is not a cheap operation. With lockless code you do not
need to disable interrrupts (or grab a lock) -> many less cycles.

>
> > [...] and 2) you have to hold the lock for the entire duration of
> > logging the event; with the atomic operation you are finished once
> > you've reserved you space. [...]
>
> you dont have to hold the lock for the duration of saving the event, the
> lock could as well protect a 'current entry' index. (Not that those 2-3
> cycles saving off the event into a single cacheline counts that much ...)
>
> the tail-atomic method is precisely equivalent to a global spinlock. The
> tail of a global event buffer acts precisely as a global spinlock: if one
> CPU writes to it in a stream then it performs okay, if two CPUs trace in
> parallel then it causes cachelines to bounce like crazy.

If 2 cpus ping-pong back and forth there will be significant cache cost -
true, but the cost of having to acquire the lock (which also ping-pongs)
and disabling the interrupts, adds even more. The additional cache line
ping pong for the lock (latency probably won't be hidden in fetching the
trace buffer data) plus the disabling interrupts still more than doubles
the cost.

-bob

Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
[email protected]

2002-09-23 00:02:32

[permalink] [raw]

Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Yes this is simple code - similar to the model we use in K42. Still,
couple of things about the below.

1) the !event_wanted can be done outside the function, in a macro so that
the only cost if tracing is disabled is a hot cache hit on a mask (not
function call) - that helps with your comment:
> The event_wanted() filter function should be made as fast as possible.

2) If you use the lockless scheme you do not need to disable interrupts.
In K42 we manage to do the entire log operation in 21 instructions and
about as many cycles (couple more for getting time). We do this from user
space as well, disabling interrupts precludes this model (may of may not be
a problem). I was really leaning hard away from even the cost of making a
system call and disabling interrupts. Do people on the kernel dev team
feel this is an acceptable cost? Is migration prevented when interrupts
are disabled? This is something for us to consider.

3) All trace events should not have to have the same number of data words
logged - though I think that's just a packaging/interface issue the code
below would just be placed behind macros which correctly package up the
right number of arguments.

Ingo Molnar writes:
>
> this is that a trace point should do, at most:
>
> --------------------->
> task_t *tracer_task;
>
> int curr_idx[NR_CPUS];
> int curr_pending[NR_CPUS];
>
> struct trace_event **trace_ring;
>
> void trace(event, data1, data2, data3)
> {
> int cpu = smp_processor_id();
> int idx, pending, *curr = curr_idx + cpu;
> struct trace_event *t;
> unsigned long flags;
>
> if (!event_wanted(current, event, data1, data2, data3))
> return;
>
> local_irq_save(flags);
>
> idx = ++curr_idx[cpu] & (NR_TRACE_ENTRIES - 1);
> pending = ++curr_pending[cpu];
>
> t = trace_ring[cpu] + idx;
>
> t->event = event;
> rdtscll(t->timestamp);
> t->data1 = data1;
> t->data2 = data2;
> t->data3 = data3;
>
> if (curr_pending == TRACE_LOW_WATERMARK && tracer_task)
> wake_up_process(tracer_task);
>
> local_irq_restore(flags);
> }
>
> this should cover most of what's needed. The event_wanted() filter
> function should be made as fast as possible. Note that the irq-disabled
> section is not strictly needed but nice and also makes it work on the
> preemptible kernel. (It's not a big issue at all to run these few
> instructions with irqs disabled.)
>
> [there are also other details like putting curr_index and curr_pending
> into the per-cpu area and similar stuff.]
>
> Ingo
>
>
> _______________________________________________
> ltt-dev mailing list
> [email protected]
> http://www.listserv.shafik.org/listserv/listinfo/ltt-dev

2002-09-23 07:14:19

[permalink] [raw]

Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

On Sun, 22 Sep 2002, bob wrote:

> Yes this is simple code - similar to the model we use in K42. Still,
> couple of things about the below.
>
> 1) the !event_wanted can be done outside the function, in a macro so
> that the only cost if tracing is disabled is a hot cache hit on a mask
> (not function call) - that helps with your comment:
> > The event_wanted() filter function should be made as fast as possible.

yes. It's a cost to be considered, but the main issue these days is the
icache cost of inlining. So generally we are leaning towards the
least-impact inlining cost.

> 2) If you use the lockless scheme you do not need to disable interrupts.
> In K42 we manage to do the entire log operation in 21 instructions and
> about as many cycles (couple more for getting time). We do this from
> user space as well, disabling interrupts precludes this model (may of
> may not be a problem). I was really leaning hard away from even the
> cost of making a system call and disabling interrupts. Do people on the
> kernel dev team feel this is an acceptable cost? Is migration prevented
> when interrupts are disabled? This is something for us to consider.

the trace() functions runs purely in kernel-space, so doing a cli/sti is
not a performance problem - if it can be avoided it saves a few cycles,
but it does not have any global costs. But i dont think reliable tracing
can be done without disabling interrupts - how do you guarantee that there
will be no trace 'holes' due to interruption at the wrong instruction?

> 3) All trace events should not have to have the same number of data
> words logged - though I think that's just a packaging/interface issue
> the code below would just be placed behind macros which correctly
> package up the right number of arguments.

yes, agreed, this can be solved by having some sort of RLA, tightly packed
trace buffer. Trace buffer usage is definitely one of the more important
points.

Ingo

2002-09-23 07:27:54

[permalink] [raw]

Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

On Sun, 22 Sep 2002, Karim Yaghmour wrote:

> > - remove the 'event registration' and callback stuff. It just introduces
> > unnecessery runtime overhead. Use an include file as a registry of
> > events instead. This will simplify things greatly.
>
> OK, basically then all the trace points call the trace driver directly.

yes. And in fact i'd suggest to not make it a driver but create a new
kernel/trace.c file - if it's a central mechanism then it should live in a
central place.

> > Why do you need a
> > table of callbacks registered to an event? Nothing in your patches
> > actually uses it ...
>
> True, nothing in the patches actually uses it as this point. This was
> added with the mindset of letting other tools than LTT use the trace
> points already provided by LTT.

okay. The thing is that generic callbacks and data hooks in the task
structure are an invitation for various types of abuses - security and GPL
type abuses. People do get very nervous when seeing such stuff - eg. read
back Christoph Hellwig's comment from a few weeks ago. It's a red flag for
many people. Provide a clean and concentrated set of APIs, no callbacks,
no unnecessery hooks. I can see the technical reasons why you have added
it - it's in theory an extensible interface, but generally we tend to add
such stuff when it's needed - if it's needed at all.

> > Just use one tracing function that copies the
> > arguments into a per-CPU ringbuffer. It's really just a few lines.
>
> Sure, the writing of data itself is trivial. The reason you find the
> driver to be rather full is because of its need to do a couple of
> extra operations:
> - Get timestamp and use delta since begining of buffer to reduce
> trace size. (i.e. because of the rate at which traces are filled, it's
> essential to be able to cut down in the data written as much as possible).

yes - but even this one can also be solved by providing 2-3 macros that
each are hardcoded for one specific event length each - this should cover
about 90% of the events. Plus perhaps a more generic entry to handle the
longer/rarer event lengths, and the variable event length stuff.

> - Filter events according to event mask.

yes - this is handled by the event_allowed() function.

> - Copy extra data in case of some events (e.g. filenames). (We're
> working on ways to simplify this).

are you sure you want to copy filenames? File descriptor and inode numbers
ought to be enough.

> - Synchronize with trace daemon to save trace data. (A single per-CPU
> circular buffer may be useful when doing kernel devleopment, but user
> tracing often requires N buffers).
>
> In addition, because this data is available from user-space, you need to
> be able to deal with many buffers. For example, you don't want some
> random user to know everything that's happening on the entire system for
> obvious security reasons. So the tracer will need to be able to have
> per-user and per-process buffers.

in fact i have the feeling that you should not expose any of this to
ordinary users. Performance measurements are to be done by administrator
types - all this stuff has heavy memory allocation impact anyway.

in exactly which cases do you want to have multiple trace buffers? A
single (large enough if needed) buffer should be enough. This i think is
one of the core issues of your design.

Ingo

2002-09-23 13:54:19

[permalink] [raw]

Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Ingo Molnar writes:
>
> On Sun, 22 Sep 2002, bob wrote:
>
> > Yes this is simple code - similar to the model we use in K42. Still,
> > couple of things about the below.
> >
> > 1) the !event_wanted can be done outside the function, in a macro so
> > that the only cost if tracing is disabled is a hot cache hit on a mask
> > (not function call) - that helps with your comment:
> > > The event_wanted() filter function should be made as fast as possible.
>
> yes. It's a cost to be considered, but the main issue these days is the
> icache cost of inlining. So generally we are leaning towards the
> least-impact inlining cost.

mmm - that seems a reasonable trade-off.

> > 2) If you use the lockless scheme you do not need to disable interrupts.
> > In K42 we manage to do the entire log operation in 21 instructions and
> > about as many cycles (couple more for getting time). We do this from
> > user space as well, disabling interrupts precludes this model (may of
> > may not be a problem). I was really leaning hard away from even the
> > cost of making a system call and disabling interrupts. Do people on the
> > kernel dev team feel this is an acceptable cost? Is migration prevented
> > when interrupts are disabled? This is something for us to consider.
>
> the trace() functions runs purely in kernel-space, so doing a cli/sti is
> not a performance problem - if it can be avoided it saves a few cycles,
> but it does not have any global costs. But i dont think reliable tracing
> can be done without disabling interrupts - how do you guarantee that there
> will be no trace 'holes' due to interruption at the wrong instruction?

We do have a way of guaranteeing no 'holes' get created unless the process
is interrupted for a *very* long time or killed (which could happen) during
the logging of an event. The code is a little more complicated and does
require an atomic operation that may be more or less equivalent to the cli
cost. In K42, and other OSes I worked on, we wanted very efficient logging
from user space as well. I think there might be a place for understanding
libc, database, jvm, performance, for examples, but if we really only do
log events in kernel space then the cli/sti approach is simpler and roughly
equivalent performance.

>
> > 3) All trace events should not have to have the same number of data
> > words logged - though I think that's just a packaging/interface issue
> > the code below would just be placed behind macros which correctly
> > package up the right number of arguments.
>
> yes, agreed, this can be solved by having some sort of RLA, tightly packed
> trace buffer. Trace buffer usage is definitely one of the more important
> points.

Yes! and we also have a scheme to allowed such a packed buffer stream to be
randomaly accessed on disk (useful if you have 100Ms or Gs of data).

-bob

Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
[email protected]

2002-09-23 18:43:00

[permalink] [raw]

Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

OK, I think we've agreed on most of the issues already. Just a couple of
details:

Ingo Molnar wrote:
> yes. And in fact i'd suggest to not make it a driver but create a new
> kernel/trace.c file - if it's a central mechanism then it should live in a
> central place.

OK, will do. Need to add a syscall for controlling tracing though (currently
done through device ioctl()).

[...]
Regarding callbacks:

Will be removed.

> are you sure you want to copy filenames? File descriptor and inode numbers
> ought to be enough.

We record the filename only once (i.e. upon exec or open). After that,
the fd is used.

> > - Synchronize with trace daemon to save trace data. (A single per-CPU
> > circular buffer may be useful when doing kernel devleopment, but user
> > tracing often requires N buffers).
> >
> > In addition, because this data is available from user-space, you need to
> > be able to deal with many buffers. For example, you don't want some
> > random user to know everything that's happening on the entire system for
> > obvious security reasons. So the tracer will need to be able to have
> > per-user and per-process buffers.
>
> in fact i have the feeling that you should not expose any of this to
> ordinary users. Performance measurements are to be done by administrator
> types - all this stuff has heavy memory allocation impact anyway.

Sure, for performance measurements it's the admin, but per my earlier
descriptions:
- users who want to debug synchronization problems of their own tasks
shouldn't see the kernel's behavior.
- users who want to log custom events separate from the kernel events
don't want to see the kernel's beavhior.

In any case, what the admin sees and what the users see of the tracing
facility will certainly be different (i.e. not the same level of
flexibility).

> in exactly which cases do you want to have multiple trace buffers? A
> single (large enough if needed) buffer should be enough. This i think is
> one of the core issues of your design.

OK, we'll revisit this issue.

Karim

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-09-23 20:06:11

by Andreas Ferber

[permalink] [raw]

Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

On Mon, Sep 23, 2002 at 11:12:12AM -0400, Karim Yaghmour wrote:
>
> Sure, for performance measurements it's the admin, but per my earlier
> descriptions:
> - users who want to debug synchronization problems of their own tasks
> shouldn't see the kernel's behavior.
> - users who want to log custom events separate from the kernel events
> don't want to see the kernel's beavhior.

Fairly simple to achieve: provide some sort of userspace trace daemon
from which the users request the trace events they want to see
(communicating through standard IPC channels). The daemon provides a
unified event mask to the kernel (to prevent unnecessary overhead in
the kernel proper) and dispatches the events read from the kernel.
AFAICS LTT doesn't try to achieve realtime event monitoring, so
somewhat delaying the event propagation to the final receiver should
not be a problem (at least as long as it generally stays within a
reasonable timewindow, which should be no problem as long as the
system is not heavily overloaded, in which case in-kernel dispatching
would be nothing better).

Apart from taking complexity out of the kernel it also reduces the
tracing impact in case of event bursts because (provided the
ringbuffer is large enough) the (potentially timeconsuming in case of
many active tracers) dispatching of events is decoupled (in time) from
the event recording.

You will have to record uid/gid/pid/whatever criteria you might think
of with the event, somewhat enlarging (by a few bytes) a single event
record (don't know how much of this data you are currently gathering),
but that is a minor tradeoff IMHO.

Andreas
--
Andreas Ferber - dev/consulting GmbH - Bielefeld, FRG
---------------------------------------------------------
+49 521 1365800 - [email protected] - http://www.devcon.net

2002-09-23 23:22:12