2008-06-20 17:05:53

by Masami Hiramatsu

[permalink] [raw]
Subject: [RFC][Patch 2/2] markers: example of irq regular kernel markers

Add trace points of irq handle events ported from LTTng's markers.

Signed-off-by: Masami Hiramatsu <[email protected]>
---
I just rewrote LTTng's irq event by using DEFINE_TRACE for example.

include/linux/irq_trace.h | 6 ++++++
kernel/irq/handle.c | 6 ++++++
2 files changed, 12 insertions(+)

Index: 2.6.26-rc5-mm3/include/linux/irq_trace.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ 2.6.26-rc5-mm3/include/linux/irq_trace.h 2008-06-16 12:27:51.000000000 -0400
@@ -0,0 +1,6 @@
+#include <linux/marker.h>
+
+DEFINE_TRACE(irq_entry, (int irq_id, int kernel_mode), irq_id, kernel_mode);
+
+DEFINE_TRACE(irq_exit, (void));
+
Index: 2.6.26-rc5-mm3/kernel/irq/handle.c
===================================================================
--- 2.6.26-rc5-mm3.orig/kernel/irq/handle.c 2008-06-16 12:27:50.000000000 -0400
+++ 2.6.26-rc5-mm3/kernel/irq/handle.c 2008-06-16 12:27:51.000000000 -0400
@@ -15,6 +15,7 @@
#include <linux/random.h>
#include <linux/interrupt.h>
#include <linux/kernel_stat.h>
+#include <linux/irq_trace.h>

#include "internals.h"

@@ -130,6 +131,9 @@ irqreturn_t handle_IRQ_event(unsigned in
{
irqreturn_t ret, retval = IRQ_NONE;
unsigned int status = 0;
+ struct pt_regs *regs = get_irq_regs();
+
+ trace_irq_entry(irq, (regs)?(!user_mode(regs)):(1));

handle_dynamic_tick(action);

@@ -148,6 +152,8 @@ irqreturn_t handle_IRQ_event(unsigned in
add_interrupt_randomness(irq);
local_irq_disable();

+ trace_irq_exit();
+
return retval;
}

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]


2008-06-20 17:45:41

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][Patch 2/2] markers: example of irq regular kernel markers

* Masami Hiramatsu ([email protected]) wrote:
> Add trace points of irq handle events ported from LTTng's markers.
>
> Signed-off-by: Masami Hiramatsu <[email protected]>
> ---
> I just rewrote LTTng's irq event by using DEFINE_TRACE for example.
>
> include/linux/irq_trace.h | 6 ++++++
> kernel/irq/handle.c | 6 ++++++
> 2 files changed, 12 insertions(+)
>
> Index: 2.6.26-rc5-mm3/include/linux/irq_trace.h
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ 2.6.26-rc5-mm3/include/linux/irq_trace.h 2008-06-16 12:27:51.000000000 -0400
> @@ -0,0 +1,6 @@
> +#include <linux/marker.h>
> +
> +DEFINE_TRACE(irq_entry, (int irq_id, int kernel_mode), irq_id, kernel_mode);
> +
> +DEFINE_TRACE(irq_exit, (void));
> +

All this work look good, thanks Masami! Sorry I did not find time to do
it lately, I've been busy on other things. A small question though :
since LTTng is configurable both as an external module or as an
in-kernel tracer, I wonder if it would really hurt to add the format
strings to DEFINE_TRACE, e.g. :

DEFINE_TRACE(name, prototype, format_string, args...)

which would give :

DEFINE_TRACE(irq_entry, (int irq_id, int kernel_mode), "%d %d",
irq_id, kernel_mode);

DEFINE_TRACE(irq_exit, (void), MARK_NOARGS);

and calling this in the kernel code :

trace_irq_entry(irq, (regs)?(!user_mode(regs)):(1));
...
trace_irq_exit();

and for quick-and-dirty debug usage, one would add this to kernel code :

trace_mark(subsystem_event, "(int arg, struct task_struct *task)",
"%d %p", arg, current);

By doing so, we could leave a gcc format string check by passing the
format string to __mark_check_format(). We could extract the field names
from the prototype, so there is no need to duplicate field information
in the format string.

Since the format string information is hidden in a header but kept at
the same location as the trace point definition, I think it reaches
goals of being "neat", efficient for general purpose tracers and to keep
the tracepoint information all in one place so we don't end up adding
stuff to various information consumers whenever a new trace point is
added.

What do you think of these changes ?

Mathieu


> Index: 2.6.26-rc5-mm3/kernel/irq/handle.c
> ===================================================================
> --- 2.6.26-rc5-mm3.orig/kernel/irq/handle.c 2008-06-16 12:27:50.000000000 -0400
> +++ 2.6.26-rc5-mm3/kernel/irq/handle.c 2008-06-16 12:27:51.000000000 -0400
> @@ -15,6 +15,7 @@
> #include <linux/random.h>
> #include <linux/interrupt.h>
> #include <linux/kernel_stat.h>
> +#include <linux/irq_trace.h>
>
> #include "internals.h"
>
> @@ -130,6 +131,9 @@ irqreturn_t handle_IRQ_event(unsigned in
> {
> irqreturn_t ret, retval = IRQ_NONE;
> unsigned int status = 0;
> + struct pt_regs *regs = get_irq_regs();
> +
> + trace_irq_entry(irq, (regs)?(!user_mode(regs)):(1));
>
> handle_dynamic_tick(action);
>
> @@ -148,6 +152,8 @@ irqreturn_t handle_IRQ_event(unsigned in
> add_interrupt_randomness(irq);
> local_irq_disable();
>
> + trace_irq_exit();
> +
> return retval;
> }
>
> --
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America) Inc.
> Software Solutions Division
>
> e-mail: [email protected]
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-06-20 19:36:34

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC][Patch 2/2] markers: example of irq regular kernel markers

Hi Mathieu,

Mathieu Desnoyers wrote:
> * Masami Hiramatsu ([email protected]) wrote:
>> Add trace points of irq handle events ported from LTTng's markers.
>>
>> Signed-off-by: Masami Hiramatsu <[email protected]>
>> ---
>> I just rewrote LTTng's irq event by using DEFINE_TRACE for example.
>>
>> include/linux/irq_trace.h | 6 ++++++
>> kernel/irq/handle.c | 6 ++++++
>> 2 files changed, 12 insertions(+)
>>
>> Index: 2.6.26-rc5-mm3/include/linux/irq_trace.h
>> ===================================================================
>> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
>> +++ 2.6.26-rc5-mm3/include/linux/irq_trace.h 2008-06-16 12:27:51.000000000 -0400
>> @@ -0,0 +1,6 @@
>> +#include <linux/marker.h>
>> +
>> +DEFINE_TRACE(irq_entry, (int irq_id, int kernel_mode), irq_id, kernel_mode);
>> +
>> +DEFINE_TRACE(irq_exit, (void));
>> +
>
> All this work look good, thanks Masami! Sorry I did not find time to do
> it lately, I've been busy on other things. A small question though :
> since LTTng is configurable both as an external module or as an
> in-kernel tracer, I wonder if it would really hurt to add the format
> strings to DEFINE_TRACE, e.g. :
>
> DEFINE_TRACE(name, prototype, format_string, args...)
>
> which would give :
>
> DEFINE_TRACE(irq_entry, (int irq_id, int kernel_mode), "%d %d",
> irq_id, kernel_mode);
>
> DEFINE_TRACE(irq_exit, (void), MARK_NOARGS);
>
> and calling this in the kernel code :
>
> trace_irq_entry(irq, (regs)?(!user_mode(regs)):(1));
> ...
> trace_irq_exit();
>
> and for quick-and-dirty debug usage, one would add this to kernel code :
>
> trace_mark(subsystem_event, "(int arg, struct task_struct *task)",
> "%d %p", arg, current);

why would you complicate it? I think.

trace_mark(subsystem_event, "arg %d task %p", arg, current);

is enough for user-defined markers.

>
> By doing so, we could leave a gcc format string check by passing the
> format string to __mark_check_format(). We could extract the field names
> from the prototype, so there is no need to duplicate field information
> in the format string.

I thought that someone complained against those format strings in
kernel code. Thus I removed it from DEFINE_TRACE.

even though, I think you can do that by adding below string table
to LTTng module.

const char *lookup_table[MAX_MARKERS][2] = {
{"irq_entry", "%d %d"}, // or "(int irq_id, int kernel_mode)", "%d %d"
...
};


>
> Since the format string information is hidden in a header but kept at
> the same location as the trace point definition, I think it reaches
> goals of being "neat", efficient for general purpose tracers and to keep
> the tracepoint information all in one place so we don't end up adding
> stuff to various information consumers whenever a new trace point is
> added.

Hmm, IMHO, there seems no difference between

DEFINE_TRACE(irq_entry, (int irq_id, int kernel_mode), "%d %d",
irq_id, kernel_mode);

and

trace_irq_entry(int irq_id, int kernel_mode)
{
trace_mark(irq_entry, "%d %d", irq_id, kernel_mode);
}

for me. If so, we'd better use latter because of simplicity:-)

Thank you,


>
> What do you think of these changes ?
>
> Mathieu
>
>
>> Index: 2.6.26-rc5-mm3/kernel/irq/handle.c
>> ===================================================================
>> --- 2.6.26-rc5-mm3.orig/kernel/irq/handle.c 2008-06-16 12:27:50.000000000 -0400
>> +++ 2.6.26-rc5-mm3/kernel/irq/handle.c 2008-06-16 12:27:51.000000000 -0400
>> @@ -15,6 +15,7 @@
>> #include <linux/random.h>
>> #include <linux/interrupt.h>
>> #include <linux/kernel_stat.h>
>> +#include <linux/irq_trace.h>
>>
>> #include "internals.h"
>>
>> @@ -130,6 +131,9 @@ irqreturn_t handle_IRQ_event(unsigned in
>> {
>> irqreturn_t ret, retval = IRQ_NONE;
>> unsigned int status = 0;
>> + struct pt_regs *regs = get_irq_regs();
>> +
>> + trace_irq_entry(irq, (regs)?(!user_mode(regs)):(1));
>>
>> handle_dynamic_tick(action);
>>
>> @@ -148,6 +152,8 @@ irqreturn_t handle_IRQ_event(unsigned in
>> add_interrupt_randomness(irq);
>> local_irq_disable();
>>
>> + trace_irq_exit();
>> +
>> return retval;
>> }
>>
>> --
>> Masami Hiramatsu
>>
>> Software Engineer
>> Hitachi Computer Products (America) Inc.
>> Software Solutions Division
>>
>> e-mail: [email protected]
>>
>

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2008-06-20 20:08:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][Patch 2/2] markers: example of irq regular kernel markers

On Fri, 2008-06-20 at 13:45 -0400, Mathieu Desnoyers wrote:

> All this work look good, thanks Masami! Sorry I did not find time to do
> it lately, I've been busy on other things. A small question though :
> since LTTng is configurable both as an external module or as an
> in-kernel tracer, I wonder if it would really hurt to add the format
> strings to DEFINE_TRACE, e.g. :
>
> DEFINE_TRACE(name, prototype, format_string, args...)
>
> which would give :
>
> DEFINE_TRACE(irq_entry, (int irq_id, int kernel_mode), "%d %d",
> irq_id, kernel_mode);
>
> DEFINE_TRACE(irq_exit, (void), MARK_NOARGS);
>
> and calling this in the kernel code :
>
> trace_irq_entry(irq, (regs)?(!user_mode(regs)):(1));
> ...
> trace_irq_exit();
>
> and for quick-and-dirty debug usage, one would add this to kernel code :
>
> trace_mark(subsystem_event, "(int arg, struct task_struct *task)",
> "%d %p", arg, current);

How would this work for:

DEFINE_TRACE(sched_switch, (struct task_struct *prev, struct task_struct *next), prev, next);

You'd want a string like: "%d %d", prev->pid, next->pid
not: "%p %p", prev, next

perhaps we can do something like:

DEFINE_TRACER(sched_switch, (struct task_struct *prev, struct task_struct *next), prev, next,
"%d %d", prev->pid, next->pid);

that defines a default tracer function for the previously defined trace
point. That way its optional, and allows for generic trace points.

Of course, all this could be ruined by reality - C really sucks wrt
forwarding functions.. :-/

2008-06-21 10:14:45

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][Patch 2/2] markers: example of irq regular kernel markers

Hi!

> > By doing so, we could leave a gcc format string check by passing the
> > format string to __mark_check_format(). We could extract the field names
> > from the prototype, so there is no need to duplicate field information
> > in the format string.
>
> I thought that someone complained against those format strings in
> kernel code. Thus I removed it from DEFINE_TRACE.
>
> even though, I think you can do that by adding below string table
> to LTTng module.
>
> const char *lookup_table[MAX_MARKERS][2] = {
> {"irq_entry", "%d %d"}, // or "(int irq_id, int kernel_mode)", "%d %d"
> ...
> };

if move string to out of kernel core, compiler may kill some variable.
thus, we will get incomplete tracing result.

I think your proposal is very interesting.
but I dont understand why someone dislike format strings.
Could you explain this reason?

2008-06-21 14:36:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][Patch 2/2] markers: example of irq regular kernel markers


On Sat, 21 Jun 2008, KOSAKI Motohiro wrote:
> >
> > even though, I think you can do that by adding below string table
> > to LTTng module.
> >
> > const char *lookup_table[MAX_MARKERS][2] = {
> > {"irq_entry", "%d %d"}, // or "(int irq_id, int kernel_mode)", "%d %d"
> > ...
> > };
>
> if move string to out of kernel core, compiler may kill some variable.
> thus, we will get incomplete tracing result.

I don't understand why the compiler would do such a thing? The compiler
shouldn't be removing parameters.

>
> I think your proposal is very interesting.
> but I dont understand why someone dislike format strings.
> Could you explain this reason?

Format strings only help for printf like operations. Not all tracers want
such a thing. The best example of this is the sched_switch code. LTTng and
friends just want a pid and comm to show. But there's tracers that want
more info from the task_struct. We also like to see the priority of the
task.

I also foresee these trace hooks be use for dynamic itegrity checks. On
switching out a process we might want to examine both the next and prev to
make sure the context switch should really occur.

There's other places in the core kernel that a tracer wants more than some
data in a structure. Recording all the data that might be needed in a
structure slows down those tracers that only want a little bit of data.
Passing in a pointer to the structure being traced should be enough for
all tracers.

If those that don't care about priority of a context switch, why should it
be parsing it out just for a single tracer that might need it. But all
tracers would need the prev and next pointers.

Now back to your question, why don't we like the printf format. Simply
because it does nothing for pointers. It might help you with a %d and
number of parameters, but a %p can not tell the difference between a
struct tasks_struct *, and a int *, which can have even more devastating
results. It also just looks like a debug session instead of a trace
marker.

Not to mention what Peter Zijlstra already stated, that a printf format
gives only run time checking, where we would like to see compile time
checking.

-- Steve

2008-06-21 14:56:37

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC][Patch 2/2] markers: example of irq regular kernel markers


Steven Rostedt <[email protected]> writes:

> [...] Format strings only help for printf like operations. [...]

That is not so. They are far from panaceanic, but printf formats are
useful for type checked simple scalars, which we can extract and use
for purposes other than printf like operations.

> The best example of this is the sched_switch code. LTTng and friends
> just want a pid and comm to show. But there's tracers that want more
> info from the task_struct. We also like to see the priority of the
> task. [...]

This is the sort of information that can help generate a compromise.
For this case, pass a few raw pointers that a compiled-in tracing
engine can dereference at will, and *also pass* a few user-level
scalars that a separately-compiled tracing engine can use.

> Passing in a pointer to the structure being traced should be enough
> for all tracers.

On the contrary, we have explained why *this is not so*. Using raw
general structure pointers in impractical for some tracers.

> Now back to your question, why don't we like the printf
> format. Simply because it does nothing for pointers. It might help
> you with a %d and number of parameters, but a %p can not tell the
> difference between a struct tasks_struct *, and a int *, which can
> have even more devastating results.

Indeed. Unfortunately, C is not kind to us in the way that perhaps
C++ templates could be. We have not yet seen a single mechanism that
does all of:

- type-safe passing of arbitrary pointer/etc. types (so compiled-in
trace data consumers can go wild with data passing)
- declarative / separately-compiled consumption of types
(so lttng/systemtap/userspace can hook in without heroics)
- parsimonious implementation

Maybe a solution could involve some restrictions on the generalities.
For example, can we narrow down the number of different scalar +
pointer types to a fixed handful? Can we tolerate type-safety being
provided by families of function declarations rather than one generic
one?


> It also just looks like a debug session instead of a trace marker.

Why do you think the difference between those is profound?


- FChE

2008-06-21 15:07:46

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][Patch 2/2] markers: example of irq regular kernel markers


On Sat, 21 Jun 2008, Frank Ch. Eigler wrote:
> Steven Rostedt <[email protected]> writes:
>
> > [...] Format strings only help for printf like operations. [...]
>
> That is not so. They are far from panaceanic, but printf formats are
> useful for type checked simple scalars, which we can extract and use
> for purposes other than printf like operations.

It doesn't help much if you mix a pid and prio of a task_struct, and then
use the prio to find the actual task, or try setting another task priority
to the pid.

>
> > The best example of this is the sched_switch code. LTTng and friends
> > just want a pid and comm to show. But there's tracers that want more
> > info from the task_struct. We also like to see the priority of the
> > task. [...]
>
> This is the sort of information that can help generate a compromise.
> For this case, pass a few raw pointers that a compiled-in tracing
> engine can dereference at will, and *also pass* a few user-level
> scalars that a separately-compiled tracing engine can use.

I definitely want the raw pointers for my tracers, but I understand
why other tracers don't want them. And yes I would like a compromise
here. I'm hoping for something where I don't need to skip over the
scalers to get to the needed pointers. Things that Masami is suggesting
is looking promising.


>
> > Passing in a pointer to the structure being traced should be enough
> > for all tracers.
>
> On the contrary, we have explained why *this is not so*. Using raw
> general structure pointers in impractical for some tracers.

The thing that those tracers need is something that can be stored in the
kernel that can easily extract the needed information.

>
> > Now back to your question, why don't we like the printf
> > format. Simply because it does nothing for pointers. It might help
> > you with a %d and number of parameters, but a %p can not tell the
> > difference between a struct tasks_struct *, and a int *, which can
> > have even more devastating results.
>
> Indeed. Unfortunately, C is not kind to us in the way that perhaps
> C++ templates could be. We have not yet seen a single mechanism that
> does all of:
>
> - type-safe passing of arbitrary pointer/etc. types (so compiled-in
> trace data consumers can go wild with data passing)
> - declarative / separately-compiled consumption of types
> (so lttng/systemtap/userspace can hook in without heroics)
> - parsimonious implementation
>
> Maybe a solution could involve some restrictions on the generalities.
> For example, can we narrow down the number of different scalar +
> pointer types to a fixed handful? Can we tolerate type-safety being
> provided by families of function declarations rather than one generic
> one?

I'm all for restricting this, I even suggested something similar a while
ago (http://lists.openwall.net/linux-kernel/2006/10/07/21). No, I'm not
pushing that solution, that solution was only to bring out more ideas.

>
>
> > It also just looks like a debug session instead of a trace marker.
>
> Why do you think the difference between those is profound?

Not that profound but I do find:

trace_sched_switch(prev, next);

much nicer to look at than

trace_mark("%p %p", prev, next);


The trace_sched_switch seems a bit more informative with a simple glance
than the trace_mark.

-- Steve

2008-06-21 16:15:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][Patch 2/2] markers: example of irq regular kernel markers

On Sat, 2008-06-21 at 11:07 -0400, Steven Rostedt wrote:
> On Sat, 21 Jun 2008, Frank Ch. Eigler wrote:
> > Steven Rostedt <[email protected]> writes:

> > > It also just looks like a debug session instead of a trace marker.
> >
> > Why do you think the difference between those is profound?
>
> Not that profound but I do find:
>
> trace_sched_switch(prev, next);
>
> much nicer to look at than
>
> trace_mark("%p %p", prev, next);
>
>
> The trace_sched_switch seems a bit more informative with a simple glance
> than the trace_mark.

I think what Frank refers to here is why not scatter the kernel code
with trace_mark()s on every conceivable location like you do with
printk-style debugging. Those trace marks might help out when
$customer's kernel goes splat and you don't want to provide him with a
custom kernel.

I do think we must make a clear distinction between these cases because:

1) tracers provide a kernel<->user interface - and whilst we don't have
a stable in-kernel API/ABI we are anal about the kernel/user boundary.
Andrew also greatly worries about this aspect.

2) it highly uglyfies the code, Frank doesn't need to maintain it, so
its easy for him to say. But IMHO its much harder to read code that is
littered with debug statements that it is to read regular code.

3) it bloats the kernel,.. while it may not be fast path bloat, all
that marker stuff does go somewhere.


So, while I see the value of 'stable' mark sites for 'stable' events,
I'm dead-set against littering the kernel with markers just because we
can, and hoping they might some day be useful for someone.

2008-06-21 18:04:50

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC][Patch 2/2] markers: example of irq regular kernel markers

Hi -

On Sat, Jun 21, 2008 at 06:13:54PM +0200, Peter Zijlstra wrote:

> [...] I think what Frank refers to here is why not scatter the
> kernel code with trace_mark()s on every conceivable location like
> you do with printk-style debugging.

It's not fair to caricaturize my suggestions this way ("every
conceivable location").

> Those trace marks might help out when $customer's kernel goes splat
> and you don't want to provide him with a custom kernel.

Right.

> I do think we must make a clear distinction between these cases because:
>
> 1) tracers provide a kernel<->user interface - and whilst we don't
> have a stable in-kernel API/ABI we are anal about the kernel/user
> boundary. Andrew also greatly worries about this aspect.

Well, how to set Andrew's mind at ease then, beyond what we've already
said? Back a few months ago, both systemtap and lttng guys - the
primary user-space clients - have said that we are fine with this
interface changing. We each have mechanisms to adapt.

> 2) it highly uglyfies the code, Frank doesn't need to maintain it,
> so its easy for him to say. But IMHO its much harder to read code
> that is littered with debug statements that it is to read regular
> code.

Then don't put too many in, or hide them with inline functions.

> 3) it bloats the kernel,.. while it may not be fast path bloat, all
> that marker stuff does go somewhere.

That bloat has been quantified and appears negligible in space and time.

> So, while I see the value of 'stable' mark sites for 'stable'
> events, I'm dead-set against littering the kernel with markers just
> because we can, and hoping they might some day be useful for
> someone.

We're in violent agreement. No one suggested "littering just because
we can". The initial lttng suite of markers consisted of about one
hundred *in total*. If some other subsystem maintainer runs amok and
adds thousands, please take it up with them, not with us.


- FChE

2008-06-21 19:42:25

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC][Patch 2/2] markers: example of irq regular kernel markers


Steven Rostedt <[email protected]> writes:

> [...]
>> That is not so. They are far from panaceanic, but printf formats are
>> useful for type checked simple scalars, which we can extract and use
>> for purposes other than printf like operations.
>
> It doesn't help much if you mix a pid and prio of a task_struct, and then
> use the prio to find the actual task, or try setting another task priority
> to the pid.

Right, though the same is true for a tracer client flipping around
"next" and "prev" pointer values.

>[...]
>> > Passing in a pointer to the structure being traced should be enough
>> > for all tracers.
>>
>> On the contrary, we have explained why *this is not so*. Using raw
>> general structure pointers in impractical for some tracers.
>
> The thing that those tracers need is something that can be stored in the
> kernel that can easily extract the needed information.

Well sure, but who is to do that storage & extraction? Some code the
marker site maintainer needs to write for each marker?


> [...]
>> Maybe a solution could involve some restrictions on the generalities.
>> For example, can we narrow down the number of different scalar +
>> pointer types to a fixed handful? Can we tolerate type-safety being
>> provided by families of function declarations rather than one generic
>> one?
>
> I'm all for restricting this, I even suggested something similar a while
> ago (http://lists.openwall.net/linux-kernel/2006/10/07/21). No, I'm not
> pushing that solution, that solution was only to bring out more ideas.

Parts of that approach have a lot of merit. Perhaps we should spend
some time cataloguing the types of all the lttng/logdev/blktrace
"markers" and their nearby pointers. Maybe it's only a few dozen
types altogether.

- FChE

2008-06-22 04:03:29

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC][Patch 2/2] markers: example of irq regular kernel markers

Hi Motohiro,

KOSAKI Motohiro wrote:
>>> By doing so, we could leave a gcc format string check by passing the
>>> format string to __mark_check_format(). We could extract the field names
>>> from the prototype, so there is no need to duplicate field information
>>> in the format string.
>> I thought that someone complained against those format strings in
>> kernel code. Thus I removed it from DEFINE_TRACE.
>>
>> even though, I think you can do that by adding below string table
>> to LTTng module.
>>
>> const char *lookup_table[MAX_MARKERS][2] = {
>> {"irq_entry", "%d %d"}, // or "(int irq_id, int kernel_mode)", "%d %d"
>> ...
>> };
>
> if move string to out of kernel core, compiler may kill some variable.
> thus, we will get incomplete tracing result.

I think you might worry about the lookup_table variable will be removed
by compiler if the variable is not referred from nowhere, right?

Here what I said is, if LTTng needs printf-style format, LTTng can "have"
a lookup table in its module. In that case, since LTTng must have (at
least one) functions which refers the lookup table, it is not optimized out.

Of course, LTTng maintainers have to put their eyes on changes of markers
and related variables/structures, and change the lookup table if they need.
(If they have a good c-style parameter parser, the maintenance cost
can be reduced...)

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2008-06-22 04:33:03

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC][Patch 2/2] markers: example of irq regular kernel markers

Hi,

Frank Ch. Eigler wrote:
>> I do think we must make a clear distinction between these cases because:
>>
>> 1) tracers provide a kernel<->user interface - and whilst we don't
>> have a stable in-kernel API/ABI we are anal about the kernel/user
>> boundary. Andrew also greatly worries about this aspect.
>
> Well, how to set Andrew's mind at ease then, beyond what we've already
> said? Back a few months ago, both systemtap and lttng guys - the
> primary user-space clients - have said that we are fine with this
> interface changing. We each have mechanisms to adapt.
>
>> 2) it highly uglyfies the code, Frank doesn't need to maintain it,
>> so its easy for him to say. But IMHO its much harder to read code
>> that is littered with debug statements that it is to read regular
>> code.
>
> Then don't put too many in, or hide them with inline functions.
>
>> 3) it bloats the kernel,.. while it may not be fast path bloat, all
>> that marker stuff does go somewhere.
>
> That bloat has been quantified and appears negligible in space and time.
>
>> So, while I see the value of 'stable' mark sites for 'stable'
>> events, I'm dead-set against littering the kernel with markers just
>> because we can, and hoping they might some day be useful for
>> someone.
>
> We're in violent agreement. No one suggested "littering just because
> we can". The initial lttng suite of markers consisted of about one
> hundred *in total*. If some other subsystem maintainer runs amok and
> adds thousands, please take it up with them, not with us.

Indeed.

Peter, I thought, we were discussing what interface we could accept,
not how many or where tracepoints we could accept. Or, am I misreading?

I know that if someone pushes markers into kernel in his own sweet way,
of course, the kernel code will be bloated endlessly. But I also know
why we review patches and send Ack/Nack before merging them to the tree.
(If you still worry about it, we might be able to make linux-markers
git tree, and review all regular markers on it)

I think and agree that we should make clear policies and criteria of
acceptance of each marker, but it should be discussed on actual
marker uses patches after making agreement for the interface.

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2008-06-22 17:11:48

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC] Tracepoint proposal

* Peter Zijlstra ([email protected]) wrote:
> On Fri, 2008-06-20 at 13:45 -0400, Mathieu Desnoyers wrote:
>
> > All this work look good, thanks Masami! Sorry I did not find time to do
> > it lately, I've been busy on other things. A small question though :
> > since LTTng is configurable both as an external module or as an
> > in-kernel tracer, I wonder if it would really hurt to add the format
> > strings to DEFINE_TRACE, e.g. :
> >
> > DEFINE_TRACE(name, prototype, format_string, args...)
> >
> > which would give :
> >
> > DEFINE_TRACE(irq_entry, (int irq_id, int kernel_mode), "%d %d",
> > irq_id, kernel_mode);
> >
> > DEFINE_TRACE(irq_exit, (void), MARK_NOARGS);
> >
> > and calling this in the kernel code :
> >
> > trace_irq_entry(irq, (regs)?(!user_mode(regs)):(1));
> > ...
> > trace_irq_exit();
> >
> > and for quick-and-dirty debug usage, one would add this to kernel code :
> >
> > trace_mark(subsystem_event, "(int arg, struct task_struct *task)",
> > "%d %p", arg, current);
>
> How would this work for:
>
> DEFINE_TRACE(sched_switch, (struct task_struct *prev, struct task_struct *next), prev, next);
>
> You'd want a string like: "%d %d", prev->pid, next->pid
> not: "%p %p", prev, next
>
> perhaps we can do something like:
>
> DEFINE_TRACER(sched_switch, (struct task_struct *prev, struct task_struct *next), prev, next,
> "%d %d", prev->pid, next->pid);
>
> that defines a default tracer function for the previously defined trace
> point. That way its optional, and allows for generic trace points.
>
> Of course, all this could be ruined by reality - C really sucks wrt
> forwarding functions.. :-/
>

Hi Peter,

I've tried to read through the comments recently posted to this thread
(sorry I don't have time to answer them all specifically right now, a
lot of this makes a lot of sense). I've tried to come up with a
proposal, let's name it "tracepoint", which should hopefully address the
full scope of the problem. Please tell me if it makes sense. It should
allow compile-time verification of dynamically linked-in and activated
tracepoints. I'll work on an implementation ASAP.

Mathieu

Tracepoint proposal

- Tracepoint infrastructure
- In-kernel users
- Complete typing, verified by the compiler
- Dynamically linked and activated

- Marker infrastructure
- Exported API to userland
- Basic types only

- Dynamic vs static
- In-kernel probes are dynamically linked, dynamically activated, connected to
tracepoints. Type verification is done at compile-time. Those in-kernel
probes can be a probe extracting the information to put in a marker or a
specific in-kernel tracer such as ftrace.
- Information sinks (LTTng, SystemTAP) are dynamically connected to the
markers inserted in the probes and are dynamically activated.

- Near instrumentation site vs in a separate tracer module

A probe module, only if provided with the kernel tree, could connect to internal
tracing sites. This argues for keeping the tracepoing probes near the
instrumentation site code. However, if a tracer is general purpose and exports
typing information to userspace through some mechanism, it should only export
the "basic type" information and could be therefore shipped outside of the
kernel tree.

In-kernel probes should be integrated to the kernel tree. They would be close to
the instrumented kernel code and would translate between the in-kernel
instrumentation and the "basic type" exports. Other in-kernel probes could
provide a different output (statistics available through debugfs for instance).
ftrace falls into this category.

Generic or specialized information "sinks" (LTTng, systemtap) could be connected
to the markers put in tracepoint probes to extract the information to userspace.
They would extract both typing information and the per-tracepoint execution
information to userspace.

Therefore, the code would look like :

kernel/sched.c:

#include "sched-trace.h"

schedule()
{
...
trace_sched_switch(prev, next);
...
}


kernel/sched-trace.h:

DEFINE_TRACE(sched_switch, struct task_struct *prev, struct task_struct *next);


kernel/sched-trace.c:

#include "sched-trace.h"

static probe_sched_switch(struct task_struct *prev, struct task_struct
*next)
{
trace_mark(kernel_sched_switch, "prev_pid %d next_pid %d prev_state %ld",
prev->pid, next->pid, prev->state);
}

int __init init(void)
{
return register_sched_switch(probe_sched_switch);
}

void __exit exit(void)
{
unregister_sched_switch(probe_sched_switch);
}


Where DEFINE_TRACE internals declare a structure, a trace_* inline function,
a register_trace_* and unregister_trace_* inline functions :

static instrumentation site structure, containing function pointers to
deactivated functions and activation boolean. It also contains the
"sched_switch" string. This structure is placed in a special section to create
an array of these structures.

static inline void trace_sched_switch(struct task_struct *prev,
struct task_struct *next)
{
if (sched_switch tracing is activated)
marshall_probes(&instrumentation_site_structure, prev, next);
}

static inline int register_trace_sched_switch(
void (*probe)(struct task_struct *prev, struct task_struct *next)
{
return do_register_probe("sched_switch", (void *)probe);
}

static inline void unregister_trace_sched_switch(
void (*probe)(struct task_struct *prev, struct task_struct *next)
{
do_unregister_probe("sched_switch", (void *)probe);
}


We need a a new kernel probe API :

do_register_probe / do_unregister_probe
- Connects the in-kernel probe to the site
- Activates the site tracing (probe reference counting)


--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-06-22 18:03:44

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [RFC] Tracepoint proposal

On Sun, Jun 22, 2008 at 01:11:35PM -0400, Mathieu Desnoyers wrote:
> Tracepoint proposal
>
> - Tracepoint infrastructure
> - In-kernel users
> - Complete typing, verified by the compiler
> - Dynamically linked and activated
>
> - Marker infrastructure
> - Exported API to userland
> - Basic types only
>
> - Dynamic vs static
> - In-kernel probes are dynamically linked, dynamically activated, connected to
> tracepoints. Type verification is done at compile-time. Those in-kernel
> probes can be a probe extracting the information to put in a marker or a
> specific in-kernel tracer such as ftrace.
> - Information sinks (LTTng, SystemTAP) are dynamically connected to the
> markers inserted in the probes and are dynamically activated.
>
> - Near instrumentation site vs in a separate tracer module
>
> A probe module, only if provided with the kernel tree, could connect to internal
> tracing sites. This argues for keeping the tracepoing probes near the
> instrumentation site code. However, if a tracer is general purpose and exports
> typing information to userspace through some mechanism, it should only export
> the "basic type" information and could be therefore shipped outside of the
> kernel tree.
>
> In-kernel probes should be integrated to the kernel tree. They would be close to
> the instrumented kernel code and would translate between the in-kernel
> instrumentation and the "basic type" exports. Other in-kernel probes could
> provide a different output (statistics available through debugfs for instance).
> ftrace falls into this category.
>
> Generic or specialized information "sinks" (LTTng, systemtap) could be connected
> to the markers put in tracepoint probes to extract the information to userspace.
> They would extract both typing information and the per-tracepoint execution
> information to userspace.
>
> Therefore, the code would look like :
>
> kernel/sched.c:
>
> #include "sched-trace.h"
>
> schedule()
> {
> ...
> trace_sched_switch(prev, next);
> ...
> }

Once this is accepted you're going to add hundreds of such calls to every
core subsystem, right?

2008-06-22 18:27:18

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC] Tracepoint proposal

* Alexey Dobriyan ([email protected]) wrote:
> On Sun, Jun 22, 2008 at 01:11:35PM -0400, Mathieu Desnoyers wrote:
> > Tracepoint proposal
> >
> > - Tracepoint infrastructure
> > - In-kernel users
> > - Complete typing, verified by the compiler
> > - Dynamically linked and activated
> >
> > - Marker infrastructure
> > - Exported API to userland
> > - Basic types only
> >
> > - Dynamic vs static
> > - In-kernel probes are dynamically linked, dynamically activated, connected to
> > tracepoints. Type verification is done at compile-time. Those in-kernel
> > probes can be a probe extracting the information to put in a marker or a
> > specific in-kernel tracer such as ftrace.
> > - Information sinks (LTTng, SystemTAP) are dynamically connected to the
> > markers inserted in the probes and are dynamically activated.
> >
> > - Near instrumentation site vs in a separate tracer module
> >
> > A probe module, only if provided with the kernel tree, could connect to internal
> > tracing sites. This argues for keeping the tracepoing probes near the
> > instrumentation site code. However, if a tracer is general purpose and exports
> > typing information to userspace through some mechanism, it should only export
> > the "basic type" information and could be therefore shipped outside of the
> > kernel tree.
> >
> > In-kernel probes should be integrated to the kernel tree. They would be close to
> > the instrumented kernel code and would translate between the in-kernel
> > instrumentation and the "basic type" exports. Other in-kernel probes could
> > provide a different output (statistics available through debugfs for instance).
> > ftrace falls into this category.
> >
> > Generic or specialized information "sinks" (LTTng, systemtap) could be connected
> > to the markers put in tracepoint probes to extract the information to userspace.
> > They would extract both typing information and the per-tracepoint execution
> > information to userspace.
> >
> > Therefore, the code would look like :
> >
> > kernel/sched.c:
> >
> > #include "sched-trace.h"
> >
> > schedule()
> > {
> > ...
> > trace_sched_switch(prev, next);
> > ...
> > }
>
> Once this is accepted you're going to add hundreds of such calls to every
> core subsystem, right?
>

The LTTng instrumentation has about 125 of such calls. Tests have
revealed that adding such dormant tracepoints to the kernel often
increase kernel performances rather than decreasing it (see the ia64
benchmarks posted on lkml a few weeks ago).

The core subsystem maintainers are being involved in the process.
Actually, marking up the source code has the interesting effect of
letting knowledgeable people influence the trace point decisions.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-06-23 02:21:15

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][Patch 2/2] markers: example of irq regular kernel markers

> Peter, I thought, we were discussing what interface we could accept,
> not how many or where tracepoints we could accept. Or, am I misreading?
>
> I know that if someone pushes markers into kernel in his own sweet way,
> of course, the kernel code will be bloated endlessly. But I also know
> why we review patches and send Ack/Nack before merging them to the tree.
> (If you still worry about it, we might be able to make linux-markers
> git tree, and review all regular markers on it)

Indeed.

if necessary, I can maintain this tree.
(because, I hope marker is independent by LTTng and SystemTap.
So, I am no related any tracer in this discussion member.)

but, at first,
We should review and discuss to Mathieu's new tracepoint proposal.
I think that is good idea.

2008-06-24 00:24:31

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [RFC] Tracepoint proposal

On Sun, Jun 22, 2008 at 02:27:05PM -0400, Mathieu Desnoyers wrote:
> * Alexey Dobriyan ([email protected]) wrote:
> > On Sun, Jun 22, 2008 at 01:11:35PM -0400, Mathieu Desnoyers wrote:
> > > Tracepoint proposal
> > >
> > > - Tracepoint infrastructure
> > > - In-kernel users
> > > - Complete typing, verified by the compiler
> > > - Dynamically linked and activated
> > >
> > > - Marker infrastructure
> > > - Exported API to userland
> > > - Basic types only
> > >
> > > - Dynamic vs static
> > > - In-kernel probes are dynamically linked, dynamically activated, connected to
> > > tracepoints. Type verification is done at compile-time. Those in-kernel
> > > probes can be a probe extracting the information to put in a marker or a
> > > specific in-kernel tracer such as ftrace.
> > > - Information sinks (LTTng, SystemTAP) are dynamically connected to the
> > > markers inserted in the probes and are dynamically activated.
> > >
> > > - Near instrumentation site vs in a separate tracer module
> > >
> > > A probe module, only if provided with the kernel tree, could connect to internal
> > > tracing sites. This argues for keeping the tracepoing probes near the
> > > instrumentation site code. However, if a tracer is general purpose and exports
> > > typing information to userspace through some mechanism, it should only export
> > > the "basic type" information and could be therefore shipped outside of the
> > > kernel tree.
> > >
> > > In-kernel probes should be integrated to the kernel tree. They would be close to
> > > the instrumented kernel code and would translate between the in-kernel
> > > instrumentation and the "basic type" exports. Other in-kernel probes could
> > > provide a different output (statistics available through debugfs for instance).
> > > ftrace falls into this category.
> > >
> > > Generic or specialized information "sinks" (LTTng, systemtap) could be connected
> > > to the markers put in tracepoint probes to extract the information to userspace.
> > > They would extract both typing information and the per-tracepoint execution
> > > information to userspace.
> > >
> > > Therefore, the code would look like :
> > >
> > > kernel/sched.c:
> > >
> > > #include "sched-trace.h"
> > >
> > > schedule()
> > > {
> > > ...
> > > trace_sched_switch(prev, next);
> > > ...
> > > }
> >
> > Once this is accepted you're going to add hundreds of such calls to every
> > core subsystem, right?
> >
>
> The LTTng instrumentation has about 125 of such calls. Tests have
> revealed that adding such dormant tracepoints to the kernel often
> increase kernel performances rather than decreasing it (see the ia64
> benchmarks posted on lkml a few weeks ago).

We're not adding this for performance increase, you do realize this?

> The core subsystem maintainers are being involved in the process.

NAK this from proc if you about this.

> Actually, marking up the source code has the interesting effect of
> letting knowledgeable people influence the trace point decisions.

I'd say that maximum source code overhead any tracing facility should be
allowed is "__xxx" annotation at very start of function definition.
Anything beyond should be rejected and there are good reasons for that.

2008-06-24 03:11:18

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC] Tracepoint proposal

Hi Mathieu,

Mathieu Desnoyers wrote:
>
> Hi Peter,
>
> I've tried to read through the comments recently posted to this thread
> (sorry I don't have time to answer them all specifically right now, a
> lot of this makes a lot of sense). I've tried to come up with a
> proposal, let's name it "tracepoint", which should hopefully address the
> full scope of the problem. Please tell me if it makes sense. It should
> allow compile-time verification of dynamically linked-in and activated
> tracepoints. I'll work on an implementation ASAP.
>
> Mathieu
>
> Tracepoint proposal
>
> - Tracepoint infrastructure
> - In-kernel users
> - Complete typing, verified by the compiler
> - Dynamically linked and activated
>
> - Marker infrastructure
> - Exported API to userland
> - Basic types only
>
> - Dynamic vs static
> - In-kernel probes are dynamically linked, dynamically activated, connected to
> tracepoints. Type verification is done at compile-time. Those in-kernel
> probes can be a probe extracting the information to put in a marker or a
> specific in-kernel tracer such as ftrace.
> - Information sinks (LTTng, SystemTAP) are dynamically connected to the
> markers inserted in the probes and are dynamically activated.
>
> - Near instrumentation site vs in a separate tracer module
>
> A probe module, only if provided with the kernel tree, could connect to internal
> tracing sites. This argues for keeping the tracepoing probes near the
> instrumentation site code. However, if a tracer is general purpose and exports
> typing information to userspace through some mechanism, it should only export
> the "basic type" information and could be therefore shipped outside of the
> kernel tree.
>
> In-kernel probes should be integrated to the kernel tree. They would be close to
> the instrumented kernel code and would translate between the in-kernel
> instrumentation and the "basic type" exports. Other in-kernel probes could
> provide a different output (statistics available through debugfs for instance).
> ftrace falls into this category.
>
> Generic or specialized information "sinks" (LTTng, systemtap) could be connected
> to the markers put in tracepoint probes to extract the information to userspace.
> They would extract both typing information and the per-tracepoint execution
> information to userspace.

Your idea is good to me. I just worry about complexity.
if both tracepoint and marker add information to other sections,
both two functions cover each partially. Anyway, it depends
on implementation.:-)

> Therefore, the code would look like :
>
> kernel/sched.c:
>
> #include "sched-trace.h"
>
> schedule()
> {
> ...
> trace_sched_switch(prev, next);
> ...
> }
>
>
> kernel/sched-trace.h:
>
> DEFINE_TRACE(sched_switch, struct task_struct *prev, struct task_struct *next);
>
>
> kernel/sched-trace.c:
>
> #include "sched-trace.h"
>
> static probe_sched_switch(struct task_struct *prev, struct task_struct
> *next)
> {
> trace_mark(kernel_sched_switch, "prev_pid %d next_pid %d prev_state %ld",
> prev->pid, next->pid, prev->state);
> }
>
> int __init init(void)
> {
> return register_sched_switch(probe_sched_switch);
> }
>
> void __exit exit(void)
> {
> unregister_sched_switch(probe_sched_switch);
> }
>
>
> Where DEFINE_TRACE internals declare a structure, a trace_* inline function,
> a register_trace_* and unregister_trace_* inline functions :

Hmm, if so, DEFINE_TRACE() still needs next and prev.:-)

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2008-06-24 04:04:06

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC] Tracepoint proposal

Hi Alexey,

Alexey Dobriyan wrote:
> On Sun, Jun 22, 2008 at 02:27:05PM -0400, Mathieu Desnoyers wrote:
>> * Alexey Dobriyan ([email protected]) wrote:
>>> On Sun, Jun 22, 2008 at 01:11:35PM -0400, Mathieu Desnoyers wrote:
>>>> Tracepoint proposal
>>>>
>>>> - Tracepoint infrastructure
>>>> - In-kernel users
>>>> - Complete typing, verified by the compiler
>>>> - Dynamically linked and activated
>>>>
>>>> - Marker infrastructure
>>>> - Exported API to userland
>>>> - Basic types only
>>>>
>>>> - Dynamic vs static
>>>> - In-kernel probes are dynamically linked, dynamically activated, connected to
>>>> tracepoints. Type verification is done at compile-time. Those in-kernel
>>>> probes can be a probe extracting the information to put in a marker or a
>>>> specific in-kernel tracer such as ftrace.
>>>> - Information sinks (LTTng, SystemTAP) are dynamically connected to the
>>>> markers inserted in the probes and are dynamically activated.
>>>>
>>>> - Near instrumentation site vs in a separate tracer module
>>>>
>>>> A probe module, only if provided with the kernel tree, could connect to internal
>>>> tracing sites. This argues for keeping the tracepoing probes near the
>>>> instrumentation site code. However, if a tracer is general purpose and exports
>>>> typing information to userspace through some mechanism, it should only export
>>>> the "basic type" information and could be therefore shipped outside of the
>>>> kernel tree.
>>>>
>>>> In-kernel probes should be integrated to the kernel tree. They would be close to
>>>> the instrumented kernel code and would translate between the in-kernel
>>>> instrumentation and the "basic type" exports. Other in-kernel probes could
>>>> provide a different output (statistics available through debugfs for instance).
>>>> ftrace falls into this category.
>>>>
>>>> Generic or specialized information "sinks" (LTTng, systemtap) could be connected
>>>> to the markers put in tracepoint probes to extract the information to userspace.
>>>> They would extract both typing information and the per-tracepoint execution
>>>> information to userspace.
>>>>
>>>> Therefore, the code would look like :
>>>>
>>>> kernel/sched.c:
>>>>
>>>> #include "sched-trace.h"
>>>>
>>>> schedule()
>>>> {
>>>> ...
>>>> trace_sched_switch(prev, next);
>>>> ...
>>>> }
>>> Once this is accepted you're going to add hundreds of such calls to every
>>> core subsystem, right?
>>>
>> The LTTng instrumentation has about 125 of such calls. Tests have
>> revealed that adding such dormant tracepoints to the kernel often
>> increase kernel performances rather than decreasing it (see the ia64
>> benchmarks posted on lkml a few weeks ago).
>
> We're not adding this for performance increase, you do realize this?
>
>> The core subsystem maintainers are being involved in the process.
>
> NAK this from proc if you about this.

Hmm, Mathieu, for this issue, I think that no one agrees
if there is no clear policy for trace points (for example,
subsystem maintainer or patch committer can modify or
erase trace points by their patch, if they need; tracepoint
maintainers should follow the changes, etc.).

>> Actually, marking up the source code has the interesting effect of
>> letting knowledgeable people influence the trace point decisions.
>
> I'd say that maximum source code overhead any tracing facility should be
> allowed is "__xxx" annotation at very start of function definition.
> Anything beyond should be rejected and there are good reasons for that.

(Out of curiously, would you know any __xxx magic for that?)

One reason why we need markers or other in-the-middle-of-function
trace point is that some events happen inside functions, not it's
interface.

Actually, we might be able to break those functions into several
functions to export it's interface to tracers. However, I think
it is easier to call trace_XXX than that.


Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2008-06-24 07:15:39

by Takashi NISHIIE

[permalink] [raw]
Subject: RE: [RFC] Tracepoint proposal

Hi

Hiramatsu wrote:
>One reason why we need markers or other in-the-middle-of-function
>trace point is that some events happen inside functions, not it's
>interface.

Each kernel sub-system seems to have its own way of dealing with
debugging statements. Some of these methods include 'dprintk',
'pr_debug', 'dev_debug', 'DEBUGP'. I think that these functions are
the tracepoints that has been availably mounted without setting up
the tool set of the outside. I think whether mounting that unites
these functions can be done if kernel marker and tracepoint are used.


By the way, isn't there problem on security?
What kprobe, jprobe, and kernel marker, etc. offer looks like what
the framework of Linux Security Module had offered before. Gotten
kprobe, jprobe, and kernel marker, etc. should not be exported to the
userland for security because it becomes the hotbed of rootkits. Users
such as kprobe, jprobe, and kernel marker should not be Loadable Kernel
Module. I think that there are some solutions in LTTng about this
security problem. However, will the environment to be able to operate
SystemTap be really secure?
$B!!(BAt least, kernel commandline option to invalidate all of kprobe,
jprobe, and kernel marker, etc. because of the batch might be
necessary.

Thank you,

--
Takashi Nishiie


2008-06-24 11:58:19

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC] Tracepoint proposal

"Takashi Nishiie" <[email protected]> writes:

> [...]
> Each kernel sub-system seems to have its own way of dealing with
> debugging statements. Some of these methods include 'dprintk',
> 'pr_debug', 'dev_debug', 'DEBUGP'. I think that these functions are
> the tracepoints that has been availably mounted without setting up
> the tool set of the outside. I think whether mounting that unites
> these functions can be done if kernel marker and tracepoint are used.

There are efforts underway to collect these various debug methods into
a single run-time-dynamic stream, which may even turn out to connect
to markers.

> By the way, isn't there problem on security? What kprobe, jprobe,
> and kernel marker, etc. offer looks like what the framework of Linux
> Security Module had offered before. Gotten kprobe, jprobe, and
> kernel marker, etc. should not be exported to the userland for
> security because it becomes the hotbed of rootkits.

These are all kernel-side facilities with no direct connection to
user-land.

> Users such as kprobe, jprobe, and kernel marker should not be
> Loadable Kernel Module. [...]

That would defeat their usefulness. Remember, kernel modules run with
no hardware-level restrictions at all, so if an adversary managed to
load up some kernel malware module, the game is over, whether or not
they use kprobes.

- FChE

2008-06-24 16:06:20

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC] Tracepoint proposal

Hi,

Takashi Nishiie wrote:
> Hi
>
> Hiramatsu wrote:
>> One reason why we need markers or other in-the-middle-of-function
>> trace point is that some events happen inside functions, not it's
>> interface.
>
> Each kernel sub-system seems to have its own way of dealing with
> debugging statements. Some of these methods include 'dprintk',
> 'pr_debug', 'dev_debug', 'DEBUGP'. I think that these functions are
> the tracepoints that has been availably mounted without setting up
> the tool set of the outside. I think whether mounting that unites
> these functions can be done if kernel marker and tracepoint are used.

Sure, I think those functions covers each partially, but some requirements
are different.

dynamic printk
- stored in a section
- dynamic activation
- formatted message (multiple messages for each activation group)
- export basic types
- variadic function
- low frequently called
- module support

Marker
- stored in a section
- dynamic activation
- formatted string (single format for each marker)
- export basic types
- variadic function
- low-high frequently called
- module support

Tracepoint
- stored in a section
- dynamic activation
- no message
- export kernel structure
- arguments depending on points
- high frequently called
- no module support (kernel use only)


> By the way, isn't there problem on security?
> What kprobe, jprobe, and kernel marker, etc. offer looks like what
> the framework of Linux Security Module had offered before. Gotten
> kprobe, jprobe, and kernel marker, etc. should not be exported to the
> userland for security because it becomes the hotbed of rootkits. Users
> such as kprobe, jprobe, and kernel marker should not be Loadable Kernel
> Module. I think that there are some solutions in LTTng about this
> security problem. However, will the environment to be able to operate
> SystemTap be really secure?
> $B!!(BAt least, kernel commandline option to invalidate all of kprobe,
> jprobe, and kernel marker, etc. because of the batch might be
> necessary.

Please, set CONFIG_MODULES=no.
If your system really really needs to be hardened, please
don't make kernel module loadable. Otherwise, any kernel module
can modify any kernel code. So, I think it's not a problem of
any specific functionality.

Anyway, I think selinux will give you more flexible way to
restrict who can load what modules.

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2008-06-24 16:23:36

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC] Tracepoint proposal

> Tracepoint
> - stored in a section
> - dynamic activation
> - no message
> - export kernel structure
> - arguments depending on points
> - high frequently called
> - no module support (kernel use only)

Why Shouldn't kernel module use tracepoint?
technical problem? or any plicy exist?

2008-06-24 17:03:37

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC] Tracepoint proposal

KOSAKI Motohiro wrote:
>> Tracepoint
>> - stored in a section
>> - dynamic activation
>> - no message
>> - export kernel structure
>> - arguments depending on points
>> - high frequently called
>> - no module support (kernel use only)
>
> Why Shouldn't kernel module use tracepoint?
> technical problem? or any plicy exist?

Good question, I think we don't want to export so much
kernel internal structures. Since tracepoint tend to export
raw kernel structure to user module, I thought it might be
better not to export its interface to modules.

However, I thought that again, and knew what Peter worried was
about marker which exports marker list to user space.
So, the issue is exporting internal structures to user space,
not to modules. Thus, I think it can support modules also.

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2008-06-24 17:46:24

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC] Tracepoint proposal

* Masami Hiramatsu ([email protected]) wrote:
> KOSAKI Motohiro wrote:
> >> Tracepoint
> >> - stored in a section
> >> - dynamic activation
> >> - no message
> >> - export kernel structure
> >> - arguments depending on points
> >> - high frequently called
> >> - no module support (kernel use only)
> >
> > Why Shouldn't kernel module use tracepoint?
> > technical problem? or any plicy exist?
>
> Good question, I think we don't want to export so much
> kernel internal structures. Since tracepoint tend to export
> raw kernel structure to user module, I thought it might be
> better not to export its interface to modules.
>
> However, I thought that again, and knew what Peter worried was
> about marker which exports marker list to user space.
> So, the issue is exporting internal structures to user space,
> not to modules. Thus, I think it can support modules also.
>

Agreed. It could be good to someday merge these tracepoint probes with
the mainline kernel so they can easily be adapted to tracepoint changes.
It seems correct to me for tracepoint probes to be either in the core
kernel or in modules, but I expect that they will be easier to maintain
within the tree.

Mathieu

> Thank you,
>
> --
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America) Inc.
> Software Solutions Division
>
> e-mail: [email protected]
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-06-25 23:57:34

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC PATCH] Kernel Tracepoints

** note : this patch is submitted for early review. It applies after my
current unreleased 2.6.26-rc8 LTTng patchset. Comments are welcome.

Implementation of kernel tracepoints. Inspired from the Linux Kernel Markers.

Allows complete typing verification. No format string required.

TODO : Documentation/tracepoint.txt

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: 'Peter Zijlstra' <[email protected]>
CC: "Frank Ch. Eigler" <[email protected]>
CC: 'Ingo Molnar' <[email protected]>
CC: 'Hideo AOKI' <[email protected]>
CC: Takashi Nishiie <[email protected]>
CC: 'Steven Rostedt' <[email protected]>
---
include/asm-generic/vmlinux.lds.h | 4
include/linux/module.h | 17 +
include/linux/tracepoint.h | 154 +++++++++++
init/Kconfig | 7
kernel/Makefile | 1
kernel/module.c | 71 +++++
kernel/tracepoint.c | 495 ++++++++++++++++++++++++++++++++++++++
7 files changed, 748 insertions(+), 1 deletion(-)

Index: linux-2.6.26-rc7-lttng/init/Kconfig
===================================================================
--- linux-2.6.26-rc7-lttng.orig/init/Kconfig 2008-06-25 18:11:05.000000000 -0400
+++ linux-2.6.26-rc7-lttng/init/Kconfig 2008-06-25 19:01:50.000000000 -0400
@@ -782,6 +782,13 @@
Say Y here to enable the extended profiling support mechanisms used
by profilers such as OProfile.

+config TRACEPOINTS
+ bool "Activate tracepoints"
+ default y
+ help
+ Place an empty function call at each tracepoint site. Can be
+ dynamically changed for a probe function.
+
config MARKERS
bool "Activate markers"
help
Index: linux-2.6.26-rc7-lttng/kernel/Makefile
===================================================================
--- linux-2.6.26-rc7-lttng.orig/kernel/Makefile 2008-06-25 18:11:04.000000000 -0400
+++ linux-2.6.26-rc7-lttng/kernel/Makefile 2008-06-25 19:01:50.000000000 -0400
@@ -69,6 +69,7 @@
obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
obj-$(USE_IMMEDIATE) += immediate.o
obj-$(CONFIG_MARKERS) += marker.o
+obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
obj-$(CONFIG_LATENCYTOP) += latencytop.o

ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
Index: linux-2.6.26-rc7-lttng/include/linux/tracepoint.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.26-rc7-lttng/include/linux/tracepoint.h 2008-06-25 19:04:17.000000000 -0400
@@ -0,0 +1,154 @@
+#ifndef _LINUX_TRACEPOINT_H
+#define _LINUX_TRACEPOINT_H
+
+/*
+ * Kernel Tracepoint API.
+ *
+ * See Documentation/tracepoint.txt.
+ *
+ * (C) Copyright 2008 Mathieu Desnoyers <[email protected]>
+ *
+ * Heavily inspired from the Linux Kernel Markers.
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#include <linux/immediate.h>
+#include <linux/types.h>
+
+struct module;
+struct tracepoint;
+
+struct tracepoint_probe_closure {
+ void *func; /* Callback */
+ void *probe_private; /* Private probe data */
+};
+
+struct tracepoint {
+ const char *name; /* Tracepoint name */
+ DEFINE_IMV(char, state); /* Immediate value state. */
+ struct tracepoint_probe_closure *multi; /* Closures */
+} __attribute__((aligned(8)));
+
+
+#define TPPROTO(args...) args
+#define TPARGS(args...) args
+
+#ifdef CONFIG_TRACEPOINTS
+
+/*
+ * Note : the empty asm volatile with read constraint is used here instead of a
+ * "used" attribute to fix a gcc 4.1.x bug.
+ * Make sure the alignment of the structure in the __markers section will
+ * not add unwanted padding between the beginning of the section and the
+ * structure. Force alignment to the same alignment as the section start.
+ *
+ * The "generic" argument controls which marker enabling mechanism must be used.
+ * If generic is true, a variable read is used.
+ * If generic is false, immediate values are used.
+ */
+#define DEFINE_TRACE(name, proto, args) \
+ static inline void _do_trace_##name(struct tracepoint *tp, proto) \
+ { \
+ int i; \
+ struct tracepoint_probe_closure *multi; \
+ preempt_disable(); \
+ multi = tp->multi; \
+ smp_read_barrier_depends(); \
+ if (multi) { \
+ for (i = 0; multi[i].func; i++) { \
+ ((void(*)(void *private_data, proto)) \
+ (multi[i].func))(multi[i].probe_private, args);\
+ } \
+ } \
+ preempt_enable(); \
+ } \
+ static inline void __trace_##name(int generic, proto) \
+ { \
+ static const char __tpstrtab_##name[] \
+ __attribute__((section("__tracepoints_strings"))) \
+ = #name; \
+ static struct tracepoint __tracepoint_##name \
+ __attribute__((section("__tracepoints"), aligned(8))) = \
+ { __tpstrtab_##name, 0, NULL }; \
+ if (!generic) { \
+ if (unlikely(imv_cond(__tracepoint_##name.state))) { \
+ imv_cond_end(); \
+ _do_trace_##name(&__tracepoint_##name, args); \
+ } else \
+ imv_cond_end(); \
+ } else { \
+ if (unlikely(_imv_read(__tracepoint_##name.state))) \
+ _do_trace_##name(&__tracepoint_##name, args); \
+ } \
+ } \
+ static inline void trace_##name(proto) \
+ { \
+ __trace_##name(0, args); \
+ } \
+ static inline void _trace_##name(proto) \
+ { \
+ __trace_##name(1, args); \
+ } \
+ static inline int register_trace_##name( \
+ void (*probe)(void *private_data, proto), \
+ void *private_data) \
+ { \
+ return tracepoint_probe_register(#name, (void *)probe, \
+ private_data); \
+ } \
+ static inline void unregister_trace_##name( \
+ void (*probe)(void *private_data, proto), \
+ void *private_data) \
+ { \
+ tracepoint_probe_unregister(#name, (void *)probe, \
+ private_data); \
+ }
+
+extern void tracepoint_update_probe_range(struct tracepoint *begin,
+ struct tracepoint *end);
+
+#else /* !CONFIG_TRACEPOINTS */
+#define DEFINE_TRACE(name, proto, args) \
+ static inline void _do_trace_##name(struct tracepoint *tp, proto) \
+ { } \
+ static inline void __trace_##name(int generic, proto) \
+ { } \
+ static inline void trace_##name(proto) \
+ { } \
+ static inline void _trace_##name(proto) \
+ { }
+
+static inline void tracepoint_update_probe_range(struct tracepoint *begin,
+ struct tracepoint *end)
+{ }
+#endif /* CONFIG_TRACEPOINTS */
+
+/*
+ * Connect a probe to a tracepoint.
+ * Internal API, should not be used directly.
+ */
+extern int tracepoint_probe_register(const char *name,
+ void *probe, void *probe_private);
+
+/*
+ * Disconnect a probe from a tracepoint.
+ * Internal API, should not be used directly.
+ */
+extern int tracepoint_probe_unregister(const char *name,
+ void *probe, void *probe_private);
+
+struct tracepoint_iter {
+ struct module *module;
+ struct tracepoint *tracepoint;
+};
+
+extern void tracepoint_iter_start(struct tracepoint_iter *iter);
+extern void tracepoint_iter_next(struct tracepoint_iter *iter);
+extern void tracepoint_iter_stop(struct tracepoint_iter *iter);
+extern void tracepoint_iter_reset(struct tracepoint_iter *iter);
+extern int tracepoint_get_iter_range(struct tracepoint **tracepoint,
+ struct tracepoint *begin, struct tracepoint *end);
+
+#endif
Index: linux-2.6.26-rc7-lttng/include/asm-generic/vmlinux.lds.h
===================================================================
--- linux-2.6.26-rc7-lttng.orig/include/asm-generic/vmlinux.lds.h 2008-06-25 18:11:04.000000000 -0400
+++ linux-2.6.26-rc7-lttng/include/asm-generic/vmlinux.lds.h 2008-06-25 18:11:05.000000000 -0400
@@ -53,6 +53,9 @@
VMLINUX_SYMBOL(__start___markers) = .; \
*(__markers) \
VMLINUX_SYMBOL(__stop___markers) = .; \
+ VMLINUX_SYMBOL(__start___tracepoints) = .; \
+ *(__tracepoints) \
+ VMLINUX_SYMBOL(__stop___tracepoints) = .; \
VMLINUX_SYMBOL(__start___imv) = .; \
*(__imv) /* Immediate values: pointers */ \
VMLINUX_SYMBOL(__stop___imv) = .;
@@ -68,6 +71,7 @@
*(__imv_cond_end) /* Immediate condition end pointers */\
VMLINUX_SYMBOL(__stop___imv_cond_end) = .; \
*(__markers_strings) /* Markers: strings */ \
+ *(__tracepoints_strings)/* Tracepoints: strings */ \
} \
\
.rodata1 : AT(ADDR(.rodata1) - LOAD_OFFSET) { \
Index: linux-2.6.26-rc7-lttng/kernel/tracepoint.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.26-rc7-lttng/kernel/tracepoint.c 2008-06-25 18:11:05.000000000 -0400
@@ -0,0 +1,495 @@
+/*
+ * Copyright (C) 2007 Mathieu Desnoyers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ */
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/types.h>
+#include <linux/jhash.h>
+#include <linux/list.h>
+#include <linux/rcupdate.h>
+#include <linux/tracepoint.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/immediate.h>
+
+extern struct tracepoint __start___tracepoints[];
+extern struct tracepoint __stop___tracepoints[];
+
+/* Set to 1 to enable tracepoint debug output */
+static const int tracepoint_debug;
+
+/*
+ * tracepoints_mutex nests inside module_mutex. Tracepoints mutex protects the
+ * builtin and module tracepoints and the hash table.
+ */
+static DEFINE_MUTEX(tracepoints_mutex);
+
+/*
+ * Tracepoint hash table, containing the active tracepoints.
+ * Protected by tracepoints_mutex.
+ */
+#define TRACEPOINT_HASH_BITS 6
+#define TRACEPOINT_TABLE_SIZE (1 << TRACEPOINT_HASH_BITS)
+
+/*
+ * Note about RCU :
+ * It is used to make sure every handler has finished using its private data
+ * between two consecutive operation (add or remove) on a given tracepoint. It
+ * is also used to delay the free of multiple probes array until a quiescent
+ * state is reached.
+ * tracepoint entries modifications are protected by the tracepoints_mutex.
+ */
+struct tracepoint_entry {
+ struct hlist_node hlist;
+ struct tracepoint_probe_closure *multi;
+ int refcount; /* Number of times armed. 0 if disarmed. */
+ struct rcu_head rcu;
+ void *oldptr;
+ unsigned char rcu_pending:1;
+ char name[0];
+};
+
+static struct hlist_head tracepoint_table[TRACEPOINT_TABLE_SIZE];
+
+static void free_old_closure(struct rcu_head *head)
+{
+ struct tracepoint_entry *entry = container_of(head,
+ struct tracepoint_entry, rcu);
+ kfree(entry->oldptr);
+ /* Make sure we free the data before setting the pending flag to 0 */
+ smp_wmb();
+ entry->rcu_pending = 0;
+}
+
+static void debug_print_probes(struct tracepoint_entry *entry)
+{
+ int i;
+
+ if (!tracepoint_debug)
+ return;
+
+ for (i = 0; entry->multi[i].func; i++)
+ printk(KERN_DEBUG "Multi probe %d : %p %p\n", i,
+ entry->multi[i].func,
+ entry->multi[i].probe_private);
+}
+
+static struct tracepoint_probe_closure *
+tracepoint_entry_add_probe(struct tracepoint_entry *entry,
+ void *probe, void *probe_private)
+{
+ int nr_probes = 0;
+ struct tracepoint_probe_closure *old, *new;
+
+ WARN_ON(!probe);
+
+ debug_print_probes(entry);
+ old = entry->multi;
+ if (old) {
+ /* (N -> N+1), (N != 0, 1) probes */
+ for (nr_probes = 0; old[nr_probes].func; nr_probes++)
+ if (old[nr_probes].func == probe
+ && old[nr_probes].probe_private
+ == probe_private)
+ return ERR_PTR(-EBUSY);
+ }
+ /* + 2 : one for new probe, one for NULL func */
+ new = kzalloc((nr_probes + 2) * sizeof(struct tracepoint_probe_closure),
+ GFP_KERNEL);
+ if (new == NULL)
+ return ERR_PTR(-ENOMEM);
+ if (old)
+ memcpy(new, old,
+ nr_probes * sizeof(struct tracepoint_probe_closure));
+ new[nr_probes].func = probe;
+ new[nr_probes].probe_private = probe_private;
+ entry->refcount = nr_probes + 1;
+ entry->multi = new;
+ debug_print_probes(entry);
+ return old;
+}
+
+static struct tracepoint_probe_closure *
+tracepoint_entry_remove_probe(struct tracepoint_entry *entry,
+ void *probe, void *probe_private)
+{
+ int nr_probes = 0, nr_del = 0, i;
+ struct tracepoint_probe_closure *old, *new;
+
+ old = entry->multi;
+
+ debug_print_probes(entry);
+ /* (N -> M), (N > 1, M >= 0) probes */
+ for (nr_probes = 0; old[nr_probes].func; nr_probes++) {
+ if ((!probe || old[nr_probes].func == probe)
+ && old[nr_probes].probe_private
+ == probe_private)
+ nr_del++;
+ }
+
+ if (nr_probes - nr_del == 0) {
+ /* N -> 0, (N > 1) */
+ entry->multi = NULL;
+ entry->refcount = 0;
+ debug_print_probes(entry);
+ return old;
+ } else {
+ int j = 0;
+ /* N -> M, (N > 1, M > 0) */
+ /* + 1 for NULL */
+ new = kzalloc((nr_probes - nr_del + 1)
+ * sizeof(struct tracepoint_probe_closure), GFP_KERNEL);
+ if (new == NULL)
+ return ERR_PTR(-ENOMEM);
+ for (i = 0; old[i].func; i++)
+ if ((probe && old[i].func != probe) ||
+ old[i].probe_private != probe_private)
+ new[j++] = old[i];
+ entry->refcount = nr_probes - nr_del;
+ entry->multi = new;
+ }
+ debug_print_probes(entry);
+ return old;
+}
+
+/*
+ * Get tracepoint if the tracepoint is present in the tracepoint hash table.
+ * Must be called with tracepoints_mutex held.
+ * Returns NULL if not present.
+ */
+static struct tracepoint_entry *get_tracepoint(const char *name)
+{
+ struct hlist_head *head;
+ struct hlist_node *node;
+ struct tracepoint_entry *e;
+ u32 hash = jhash(name, strlen(name), 0);
+
+ head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
+ hlist_for_each_entry(e, node, head, hlist) {
+ if (!strcmp(name, e->name))
+ return e;
+ }
+ return NULL;
+}
+
+/*
+ * Add the tracepoint to the tracepoint hash table. Must be called with
+ * tracepoints_mutex held.
+ */
+static struct tracepoint_entry *add_tracepoint(const char *name)
+{
+ struct hlist_head *head;
+ struct hlist_node *node;
+ struct tracepoint_entry *e;
+ size_t name_len = strlen(name) + 1;
+ u32 hash = jhash(name, name_len-1, 0);
+
+ head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
+ hlist_for_each_entry(e, node, head, hlist) {
+ if (!strcmp(name, e->name)) {
+ printk(KERN_NOTICE
+ "tracepoint %s busy\n", name);
+ return ERR_PTR(-EBUSY); /* Already there */
+ }
+ }
+ /*
+ * Using kmalloc here to allocate a variable length element. Could
+ * cause some memory fragmentation if overused.
+ */
+ e = kmalloc(sizeof(struct tracepoint_entry) + name_len, GFP_KERNEL);
+ if (!e)
+ return ERR_PTR(-ENOMEM);
+ memcpy(&e->name[0], name, name_len);
+ e->multi = NULL;
+ e->refcount = 0;
+ e->rcu_pending = 0;
+ hlist_add_head(&e->hlist, head);
+ return e;
+}
+
+/*
+ * Remove the tracepoint from the tracepoint hash table. Must be called with
+ * mutex_lock held.
+ */
+static int remove_tracepoint(const char *name)
+{
+ struct hlist_head *head;
+ struct hlist_node *node;
+ struct tracepoint_entry *e;
+ int found = 0;
+ size_t len = strlen(name) + 1;
+ u32 hash = jhash(name, len-1, 0);
+
+ head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
+ hlist_for_each_entry(e, node, head, hlist) {
+ if (!strcmp(name, e->name)) {
+ found = 1;
+ break;
+ }
+ }
+ if (!found)
+ return -ENOENT;
+ hlist_del(&e->hlist);
+ /* Make sure the call_rcu has been executed */
+ if (e->rcu_pending)
+ rcu_barrier();
+ kfree(e);
+ return 0;
+}
+
+/*
+ * Sets the probe callback corresponding to one tracepoint.
+ */
+static void set_tracepoint(struct tracepoint_entry **entry,
+ struct tracepoint *elem, int active)
+{
+ WARN_ON(strcmp((*entry)->name, elem->name) != 0);
+
+ smp_wmb();
+ /*
+ * We also make sure that the new probe callbacks array is consistent
+ * before setting a pointer to it.
+ */
+ rcu_assign_pointer(elem->multi, (*entry)->multi);
+ elem->state__imv = active;
+}
+
+/*
+ * Disable a tracepoint and its probe callback.
+ * Note: only waiting an RCU period after setting elem->call to the empty
+ * function insures that the original callback is not used anymore. This insured
+ * by preempt_disable around the call site.
+ */
+static void disable_tracepoint(struct tracepoint *elem)
+{
+ elem->state__imv = 0;
+}
+
+/**
+ * tracepoint_update_probe_range - Update a probe range
+ * @begin: beginning of the range
+ * @end: end of the range
+ *
+ * Updates the probe callback corresponding to a range of tracepoints.
+ */
+void tracepoint_update_probe_range(struct tracepoint *begin,
+ struct tracepoint *end)
+{
+ struct tracepoint *iter;
+ struct tracepoint_entry *mark_entry;
+
+ mutex_lock(&tracepoints_mutex);
+ for (iter = begin; iter < end; iter++) {
+ mark_entry = get_tracepoint(iter->name);
+ if (mark_entry) {
+ set_tracepoint(&mark_entry, iter,
+ !!mark_entry->refcount);
+ } else {
+ disable_tracepoint(iter);
+ }
+ }
+ mutex_unlock(&tracepoints_mutex);
+}
+
+/*
+ * Update probes, removing the faulty probes.
+ */
+static void tracepoint_update_probes(void)
+{
+ /* Core kernel tracepoints */
+ tracepoint_update_probe_range(__start___tracepoints,
+ __stop___tracepoints);
+ /* tracepoints in modules. */
+ module_update_tracepoints();
+ /* Update immediate values */
+ core_imv_update();
+ module_imv_update();
+}
+
+/**
+ * tracepoint_probe_register - Connect a probe to a tracepoint
+ * @name: tracepoint name
+ * @probe: probe handler
+ * @probe_private: probe private data
+ *
+ * private data must be a valid allocated memory address, or NULL.
+ * Returns 0 if ok, error value on error.
+ * The probe address must at least be aligned on the architecture pointer size.
+ */
+int tracepoint_probe_register(const char *name,
+ void *probe, void *probe_private)
+{
+ struct tracepoint_entry *entry;
+ int ret = 0;
+ struct tracepoint_probe_closure *old;
+
+ mutex_lock(&tracepoints_mutex);
+ entry = get_tracepoint(name);
+ if (!entry) {
+ entry = add_tracepoint(name);
+ if (IS_ERR(entry)) {
+ ret = PTR_ERR(entry);
+ goto end;
+ }
+ }
+ /*
+ * If we detect that a call_rcu is pending for this tracepoint,
+ * make sure it's executed now.
+ */
+ if (entry->rcu_pending)
+ rcu_barrier();
+ old = tracepoint_entry_add_probe(entry, probe, probe_private);
+ if (IS_ERR(old)) {
+ ret = PTR_ERR(old);
+ goto end;
+ }
+ mutex_unlock(&tracepoints_mutex);
+ tracepoint_update_probes(); /* may update entry */
+ mutex_lock(&tracepoints_mutex);
+ entry = get_tracepoint(name);
+ WARN_ON(!entry);
+ entry->oldptr = old;
+ entry->rcu_pending = 1;
+ /* write rcu_pending before calling the RCU callback */
+ smp_wmb();
+#ifdef CONFIG_PREEMPT_RCU
+ synchronize_sched(); /* Until we have the call_rcu_sched() */
+#endif
+ call_rcu(&entry->rcu, free_old_closure);
+end:
+ mutex_unlock(&tracepoints_mutex);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(tracepoint_probe_register);
+
+/**
+ * tracepoint_probe_unregister - Disconnect a probe from a tracepoint
+ * @name: tracepoint name
+ * @probe: probe function pointer
+ * @probe_private: probe private data
+ *
+ * Returns the private data given to tracepoint_probe_register, or an ERR_PTR().
+ * We do not need to call a synchronize_sched to make sure the probes have
+ * finished running before doing a module unload, because the module unload
+ * itself uses stop_machine(), which insures that every preempt disabled section
+ * have finished.
+ */
+int tracepoint_probe_unregister(const char *name,
+ void *probe, void *probe_private)
+{
+ struct tracepoint_entry *entry;
+ struct tracepoint_probe_closure *old;
+ int ret = -ENOENT;
+
+ mutex_lock(&tracepoints_mutex);
+ entry = get_tracepoint(name);
+ if (!entry)
+ goto end;
+ if (entry->rcu_pending)
+ rcu_barrier();
+ old = tracepoint_entry_remove_probe(entry, probe, probe_private);
+ mutex_unlock(&tracepoints_mutex);
+ tracepoint_update_probes(); /* may update entry */
+ mutex_lock(&tracepoints_mutex);
+ entry = get_tracepoint(name);
+ if (!entry)
+ goto end;
+ entry->oldptr = old;
+ entry->rcu_pending = 1;
+ /* write rcu_pending before calling the RCU callback */
+ smp_wmb();
+#ifdef CONFIG_PREEMPT_RCU
+ synchronize_sched(); /* Until we have the call_rcu_sched() */
+#endif
+ call_rcu(&entry->rcu, free_old_closure);
+ remove_tracepoint(name); /* Ignore busy error message */
+ ret = 0;
+end:
+ mutex_unlock(&tracepoints_mutex);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(tracepoint_probe_unregister);
+
+/**
+ * tracepoint_get_iter_range - Get a next tracepoint iterator given a range.
+ * @tracepoint: current tracepoints (in), next tracepoint (out)
+ * @begin: beginning of the range
+ * @end: end of the range
+ *
+ * Returns whether a next tracepoint has been found (1) or not (0).
+ * Will return the first tracepoint in the range if the input tracepoint is NULL.
+ */
+int tracepoint_get_iter_range(struct tracepoint **tracepoint,
+ struct tracepoint *begin, struct tracepoint *end)
+{
+ if (!*tracepoint && begin != end) {
+ *tracepoint = begin;
+ return 1;
+ }
+ if (*tracepoint >= begin && *tracepoint < end)
+ return 1;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(tracepoint_get_iter_range);
+
+static void tracepoint_get_iter(struct tracepoint_iter *iter)
+{
+ int found = 0;
+
+ /* Core kernel tracepoints */
+ if (!iter->module) {
+ found = tracepoint_get_iter_range(&iter->tracepoint,
+ __start___tracepoints, __stop___tracepoints);
+ if (found)
+ goto end;
+ }
+ /* tracepoints in modules. */
+ found = module_get_iter_tracepoints(iter);
+end:
+ if (!found)
+ tracepoint_iter_reset(iter);
+}
+
+void tracepoint_iter_start(struct tracepoint_iter *iter)
+{
+ tracepoint_get_iter(iter);
+}
+EXPORT_SYMBOL_GPL(tracepoint_iter_start);
+
+void tracepoint_iter_next(struct tracepoint_iter *iter)
+{
+ iter->tracepoint++;
+ /*
+ * iter->tracepoint may be invalid because we blindly incremented it.
+ * Make sure it is valid by marshalling on the tracepoints, getting the
+ * tracepoints from following modules if necessary.
+ */
+ tracepoint_get_iter(iter);
+}
+EXPORT_SYMBOL_GPL(tracepoint_iter_next);
+
+void tracepoint_iter_stop(struct tracepoint_iter *iter)
+{
+}
+EXPORT_SYMBOL_GPL(tracepoint_iter_stop);
+
+void tracepoint_iter_reset(struct tracepoint_iter *iter)
+{
+ iter->module = NULL;
+ iter->tracepoint = NULL;
+}
+EXPORT_SYMBOL_GPL(tracepoint_iter_reset);
Index: linux-2.6.26-rc7-lttng/kernel/module.c
===================================================================
--- linux-2.6.26-rc7-lttng.orig/kernel/module.c 2008-06-25 18:11:05.000000000 -0400
+++ linux-2.6.26-rc7-lttng/kernel/module.c 2008-06-25 18:11:05.000000000 -0400
@@ -48,6 +48,7 @@
#include <linux/license.h>
#include <asm/sections.h>
#include <linux/marker.h>
+#include <linux/tracepoint.h>

#if 0
#define DEBUGP printk
@@ -1791,6 +1792,8 @@
unsigned int immediatecondendindex;
unsigned int markersindex;
unsigned int markersstringsindex;
+ unsigned int tracepointsindex;
+ unsigned int tracepointsstringsindex;
struct module *mod;
long err = 0;
void *percpu = NULL, *ptr = NULL; /* Stops spurious gcc warning */
@@ -2083,6 +2086,9 @@
markersindex = find_sec(hdr, sechdrs, secstrings, "__markers");
markersstringsindex = find_sec(hdr, sechdrs, secstrings,
"__markers_strings");
+ tracepointsindex = find_sec(hdr, sechdrs, secstrings, "__tracepoints");
+ tracepointsstringsindex = find_sec(hdr, sechdrs, secstrings,
+ "__tracepoints_strings");

/* Now do relocations. */
for (i = 1; i < hdr->e_shnum; i++) {
@@ -2110,6 +2116,12 @@
mod->num_markers =
sechdrs[markersindex].sh_size / sizeof(*mod->markers);
#endif
+#ifdef CONFIG_TRACEPOINTS
+ mod->tracepoints = (void *)sechdrs[tracepointsindex].sh_addr;
+ mod->num_tracepoints =
+ sechdrs[tracepointsindex].sh_size / sizeof(*mod->tracepoints);
+#endif
+

/* Find duplicate symbols */
err = verify_export_symbols(mod);
@@ -2133,8 +2145,16 @@
marker_update_probe_range(mod->markers,
mod->markers + mod->num_markers);
#endif
+#ifdef CONFIG_TRACEPOINTS
+ tracepoint_update_probe_range(mod->tracepoints,
+ mod->tracepoints + mod->num_tracepoints);
+#endif
+
#ifdef USE_IMMEDIATE
- /* Immediate values must be updated after markers */
+ /*
+ * Immediate values must be updated after markers and
+ * tracepoints.
+ */
imv_update_range(mod->immediate,
mod->immediate + mod->num_immediate);
#endif
@@ -2746,6 +2766,55 @@
}
#endif

+#ifdef CONFIG_TRACEPOINTS
+void module_update_tracepoints(void)
+{
+ struct module *mod;
+
+ mutex_lock(&module_mutex);
+ list_for_each_entry(mod, &modules, list)
+ if (!mod->taints)
+ tracepoint_update_probe_range(mod->tracepoints,
+ mod->tracepoints + mod->num_tracepoints);
+ mutex_unlock(&module_mutex);
+}
+
+/*
+ * Returns 0 if current not found.
+ * Returns 1 if current found.
+ */
+int module_get_iter_tracepoints(struct tracepoint_iter *iter)
+{
+ struct module *iter_mod;
+ int found = 0;
+
+ mutex_lock(&module_mutex);
+ list_for_each_entry(iter_mod, &modules, list) {
+ if (!iter_mod->taints) {
+ /*
+ * Sorted module list
+ */
+ if (iter_mod < iter->module)
+ continue;
+ else if (iter_mod > iter->module)
+ iter->tracepoint = NULL;
+ found = tracepoint_get_iter_range(&iter->tracepoint,
+ iter_mod->tracepoints,
+ iter_mod->tracepoints
+ + iter_mod->num_tracepoints);
+ if (found) {
+ iter->module = iter_mod;
+ break;
+ }
+ }
+ }
+ mutex_unlock(&module_mutex);
+ return found;
+}
+#endif
+
+
+
#ifdef USE_IMMEDIATE
/**
* _module_imv_update - update all immediate values in the kernel
Index: linux-2.6.26-rc7-lttng/include/linux/module.h
===================================================================
--- linux-2.6.26-rc7-lttng.orig/include/linux/module.h 2008-06-25 18:11:05.000000000 -0400
+++ linux-2.6.26-rc7-lttng/include/linux/module.h 2008-06-25 18:11:05.000000000 -0400
@@ -17,6 +17,7 @@
#include <linux/moduleparam.h>
#include <linux/immediate.h>
#include <linux/marker.h>
+#include <linux/tracepoint.h>
#include <asm/local.h>

#include <asm/module.h>
@@ -349,6 +350,10 @@
struct marker *markers;
unsigned int num_markers;
#endif
+#ifdef CONFIG_TRACEPOINTS
+ struct tracepoint *tracepoints;
+ unsigned int num_tracepoints;
+#endif
};
#ifndef MODULE_ARCH_INIT
#define MODULE_ARCH_INIT {}
@@ -459,6 +464,9 @@
extern void module_update_markers(void);
extern int module_get_iter_markers(struct marker_iter *iter);

+extern void module_update_tracepoints(void);
+extern int module_get_iter_tracepoints(struct tracepoint_iter *iter);
+
#else /* !CONFIG_MODULES... */
#define EXPORT_SYMBOL(sym)
#define EXPORT_SYMBOL_GPL(sym)
@@ -568,6 +576,15 @@
return 0;
}

+static inline void module_update_tracepoints(void)
+{
+}
+
+static inline int module_get_iter_tracepoints(struct tracepoint_iter *iter)
+{
+ return 0;
+}
+
#endif /* CONFIG_MODULES */

#if defined(CONFIG_MODULES) && defined(USE_IMMEDIATE)

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-06-26 00:05:29

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC PATCH] Tracepoint sched probes

** note : early RFC. This patch applies on a unreleased LTTng for
2.6.26-rc8 patchset.

Scheduler tracepoint probes

Only a single probe is implement as a POC.

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: 'Peter Zijlstra' <[email protected]>
CC: "Frank Ch. Eigler" <[email protected]>
CC: 'Ingo Molnar' <[email protected]>
CC: 'Hideo AOKI' <[email protected]>
CC: Takashi Nishiie <[email protected]>
CC: 'Steven Rostedt' <[email protected]>
---
init/Kconfig | 7 +++++++
kernel/Makefile | 1 +
kernel/sched-trace.c | 28 ++++++++++++++++++++++++++++
kernel/sched-trace.h | 11 +++++++++++
kernel/sched.c | 5 +++--
5 files changed, 50 insertions(+), 2 deletions(-)

Index: linux-2.6.26-rc7-lttng/kernel/sched-trace.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.26-rc7-lttng/kernel/sched-trace.c 2008-06-25 18:36:11.000000000 -0400
@@ -0,0 +1,28 @@
+#include <linux/module.h>
+#include "sched-trace.h"
+
+static void probe_sched_switch(void *private_data, struct task_struct *prev,
+ struct task_struct *next)
+{
+ trace_mark(kernel_sched_schedule,
+ "prev_pid %d next_pid %d prev_state %ld",
+ prev->pid, next->pid, prev->state);
+}
+
+int __init sched_trace_init(void)
+{
+ return register_trace_sched_switch(probe_sched_switch, NULL);
+}
+
+module_init(sched_trace_init);
+
+void __exit sched_trace_exit(void)
+{
+ unregister_trace_sched_switch(probe_sched_switch, NULL);
+}
+
+module_exit(sched_trace_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Mathieu Desnoyers");
+MODULE_DESCRIPTION("Scheduler Tracpoint Probes");
Index: linux-2.6.26-rc7-lttng/kernel/sched-trace.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.26-rc7-lttng/kernel/sched-trace.h 2008-06-25 18:36:11.000000000 -0400
@@ -0,0 +1,11 @@
+#ifndef _KERNEL_SCHED_TRACE_H
+#define _KERNEL_SCHED_TRACE_H
+
+#include <linux/sched.h>
+#include <linux/tracepoint.h>
+
+DEFINE_TRACE(sched_switch,
+ TPPROTO(struct task_struct *prev, struct task_struct *next),
+ TPARGS(prev, next));
+
+#endif
Index: linux-2.6.26-rc7-lttng/kernel/sched.c
===================================================================
--- linux-2.6.26-rc7-lttng.orig/kernel/sched.c 2008-06-25 18:35:42.000000000 -0400
+++ linux-2.6.26-rc7-lttng/kernel/sched.c 2008-06-25 18:36:11.000000000 -0400
@@ -75,6 +75,8 @@
#include <asm/tlb.h>
#include <asm/irq_regs.h>

+#include "sched-trace.h"
+
/*
* Convert user-nice values [ -20 ... 0 ... 19 ]
* to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
@@ -2458,8 +2460,7 @@
struct mm_struct *mm, *oldmm;

prepare_task_switch(rq, prev, next);
- trace_mark(internal_kernel_sched_schedule,
- "prev %p next %p", prev, next);
+ trace_sched_switch(prev, next);
mm = next->mm;
oldmm = prev->active_mm;
/*
Index: linux-2.6.26-rc7-lttng/kernel/Makefile
===================================================================
--- linux-2.6.26-rc7-lttng.orig/kernel/Makefile 2008-06-25 18:35:42.000000000 -0400
+++ linux-2.6.26-rc7-lttng/kernel/Makefile 2008-06-25 18:36:11.000000000 -0400
@@ -70,6 +70,7 @@
obj-$(USE_IMMEDIATE) += immediate.o
obj-$(CONFIG_MARKERS) += marker.o
obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
+obj-$(CONFIG_SCHED_TRACE) += sched-trace.o
obj-$(CONFIG_LATENCYTOP) += latencytop.o

ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
Index: linux-2.6.26-rc7-lttng/init/Kconfig
===================================================================
--- linux-2.6.26-rc7-lttng.orig/init/Kconfig 2008-06-25 18:36:07.000000000 -0400
+++ linux-2.6.26-rc7-lttng/init/Kconfig 2008-06-25 18:36:26.000000000 -0400
@@ -789,6 +789,13 @@
Place an empty function call at each tracepoint site. Can be
dynamically changed for a probe function.

+config SCHED_TRACE
+ depends on TRACEPOINTS
+ tristate "Scheduler tracing"
+ default y
+ help
+ Build the scheduler tracing probes.
+
config MARKERS
bool "Activate markers"
help

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-06-26 21:04:42

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC PATCH] Kernel Tracepoints

Hi Mathieu,

Thank you for making this so soon!

Mathieu Desnoyers wrote:
> ** note : this patch is submitted for early review. It applies after my
> current unreleased 2.6.26-rc8 LTTng patchset. Comments are welcome.

Would you mean there is no tree on which we can test this patch?

>
> Implementation of kernel tracepoints. Inspired from the Linux Kernel Markers.

What would you think redesigning markers on tracepoints? because most of the
logic (scaning sections, multiple probe and activation) seems very similar
to markers.

For example, (not complete, I just thought :-))

struct tracepoint {
const char *name; /* Tracepoint name */
DEFINE_IMV(char, state); /* Immediate value state. */
struct tracepoint_probe_closure *multi; /* Closures */
void * callsite_data; /* private date from call site */
} __attribute__((aligned(8)));

#define __tracepoint_block(generic, name, data, func, args)
static const char __tpstrtab_##name[] \
__attribute__((section("__tracepoints_strings"))) \
= #name; \
static struct tracepoint __tracepoint_##name \
__attribute__((section("__tracepoints"), aligned(8))) = \
{ __tpstrtab_##name, 0, NULL, data}; \
if (!generic) { \
if (unlikely(imv_cond(__tracepoint_##name.state))) { \
imv_cond_end(); \
func(&__tracepoint_##name, args); \
} else \
imv_cond_end(); \
} else { \
if (unlikely(_imv_read(__tracepoint_##name.state))) \
func(&__tracepoint_##name, args); \
}

struct marker {
const char *name; /* Marker name */
const char *format; /* Marker format string, describing the
* variable argument list.
*/
} __attribute__((aligned(8)));

#define trace_mark(name, fmt, args...) \
do { \
static const char __mstrtab_##name[] \
__attribute__((section("__markers_strings"))) \
= #name "\0" fmt; \
static struct marker __marker_##name \
__attribute__((section("__markers"), aligned(8))) = \
{ __mstrtab_##name, &__mstrtab_##name[sizeof(#name)]}; \
__tracepoint_block(1, name, __marker_##name, marker_probe_cb, args) \
} while (0)

>
> Allows complete typing verification. No format string required.
>
> TODO : Documentation/tracepoint.txt
[...]
> +/*
> + * Note : the empty asm volatile with read constraint is used here instead of a
> + * "used" attribute to fix a gcc 4.1.x bug.as
> + * Make sure the alignment of the structure in the __markers section will
> + * not add unwanted padding between the beginning of the section and the
> + * structure. Force alignment to the same alignment as the section start.

this comment should be updated...

> + *
> + * The "generic" argument controls which marker enabling mechanism must be used.
> + * If generic is true, a variable read is used.
> + * If generic is false, immediate values are used.
> + */
> +#define DEFINE_TRACE(name, proto, args) \
> + static inline void _do_trace_##name(struct tracepoint *tp, proto) \
> + { \
> + int i; \
> + struct tracepoint_probe_closure *multi; \
> + preempt_disable(); \
> + multi = tp->multi; \
> + smp_read_barrier_depends(); \
> + if (multi) { \
> + for (i = 0; multi[i].func; i++) { \
> + ((void(*)(void *private_data, proto)) \
> + (multi[i].func))(multi[i].probe_private, args);\
> + } \
> + } \
> + preempt_enable(); \
> + } \
> + static inline void __trace_##name(int generic, proto) \
> + { \
> + static const char __tpstrtab_##name[] \
> + __attribute__((section("__tracepoints_strings"))) \
> + = #name; \
> + static struct tracepoint __tracepoint_##name \
> + __attribute__((section("__tracepoints"), aligned(8))) = \
> + { __tpstrtab_##name, 0, NULL }; \
> + if (!generic) { \
> + if (unlikely(imv_cond(__tracepoint_##name.state))) { \
> + imv_cond_end(); \
> + _do_trace_##name(&__tracepoint_##name, args); \
> + } else \
> + imv_cond_end(); \
> + } else { \
> + if (unlikely(_imv_read(__tracepoint_##name.state))) \
> + _do_trace_##name(&__tracepoint_##name, args); \
> + } \
> + } \
> + static inline void trace_##name(proto) \
> + { \
> + __trace_##name(0, args); \
> + } \
> + static inline void _trace_##name(proto) \
> + { \
> + __trace_##name(1, args); \
> + } \
> + static inline int register_trace_##name( \
> + void (*probe)(void *private_data, proto), \
> + void *private_data) \
> + { \
> + return tracepoint_probe_register(#name, (void *)probe, \
> + private_data); \
> + } \
> + static inline void unregister_trace_##name( \
> + void (*probe)(void *private_data, proto), \
> + void *private_data) \
> + { \
> + tracepoint_probe_unregister(#name, (void *)probe, \
> + private_data); \
> + }

Out of curiousity, what the private_data is for?

> +
> +extern void tracepoint_update_probe_range(struct tracepoint *begin,
> + struct tracepoint *end);
> +
> +#else /* !CONFIG_TRACEPOINTS */
> +#define DEFINE_TRACE(name, proto, args) \
> + static inline void _do_trace_##name(struct tracepoint *tp, proto) \
> + { } \
> + static inline void __trace_##name(int generic, proto) \
> + { } \
> + static inline void trace_##name(proto) \
> + { } \
> + static inline void _trace_##name(proto) \
> + { }

By the way, I think you'd better add below two inlines.

static inline int register_trace_##name( \
void (*probe)(void *private_data, proto), \
void *private_data) \
{ return -ENOSYS; }
static inline void unregister_trace_##name( \
void (*probe)(void *private_data, proto), \
void *private_data) \
{ }


> +
> +static inline void tracepoint_update_probe_range(struct tracepoint *begin,
> + struct tracepoint *end)
> +{ }
> +#endif /* CONFIG_TRACEPOINTS */
> +
> +/*
> + * Connect a probe to a tracepoint.
> + * Internal API, should not be used directly.
> + */
> +extern int tracepoint_probe_register(const char *name,
> + void *probe, void *probe_private);
> +
> +/*
> + * Disconnect a probe from a tracepoint.
> + * Internal API, should not be used directly.
> + */
> +extern int tracepoint_probe_unregister(const char *name,
> + void *probe, void *probe_private);
> +
> +struct tracepoint_iter {
> + struct module *module;
> + struct tracepoint *tracepoint;
> +};
> +
> +extern void tracepoint_iter_start(struct tracepoint_iter *iter);
> +extern void tracepoint_iter_next(struct tracepoint_iter *iter);
> +extern void tracepoint_iter_stop(struct tracepoint_iter *iter);
> +extern void tracepoint_iter_reset(struct tracepoint_iter *iter);
> +extern int tracepoint_get_iter_range(struct tracepoint **tracepoint,
> + struct tracepoint *begin, struct tracepoint *end);
> +
> +#endif
[...]


Best regards,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2008-06-27 13:15:23

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] Kernel Tracepoints

* Masami Hiramatsu ([email protected]) wrote:
> Hi Mathieu,
>
> Thank you for making this so soon!
>

Hi Masami,

Thanks for the comments, I will rework the patch accordingly.

Also, one thing I thought about yesterday which I dislike is that if we
have two modules declaring the same tracepoint in different headers with
different prototypes, each declaration will be valid but the
registration will try to connect a probe expecting wrong parameters to
the other tracepoint.

It would be the case if someone does :

drivers/somedrivera/mydriver1-trace.h

DECLARE_TRACE(really_generic_name, TPPTOTO(void), TPARGS()));


drivers/somedriverb/mydriver2-trace.h

DECLARE_TRACE(really_generic_name, TPPTOTO(struct somestruct *s), TPARGS(s)));

Do you think it's worth it to append the prototype string to the
tracepoint name ? I think it should fix the problem.

Mathieu

> Mathieu Desnoyers wrote:
> > ** note : this patch is submitted for early review. It applies after my
> > current unreleased 2.6.26-rc8 LTTng patchset. Comments are welcome.
>
> Would you mean there is no tree on which we can test this patch?
>
> >
> > Implementation of kernel tracepoints. Inspired from the Linux Kernel Markers.
>
> What would you think redesigning markers on tracepoints? because most of the
> logic (scaning sections, multiple probe and activation) seems very similar
> to markers.
>
> For example, (not complete, I just thought :-))
>
> struct tracepoint {
> const char *name; /* Tracepoint name */
> DEFINE_IMV(char, state); /* Immediate value state. */
> struct tracepoint_probe_closure *multi; /* Closures */
> void * callsite_data; /* private date from call site */
> } __attribute__((aligned(8)));
>
> #define __tracepoint_block(generic, name, data, func, args)
> static const char __tpstrtab_##name[] \
> __attribute__((section("__tracepoints_strings"))) \
> = #name; \
> static struct tracepoint __tracepoint_##name \
> __attribute__((section("__tracepoints"), aligned(8))) = \
> { __tpstrtab_##name, 0, NULL, data}; \
> if (!generic) { \
> if (unlikely(imv_cond(__tracepoint_##name.state))) { \
> imv_cond_end(); \
> func(&__tracepoint_##name, args); \
> } else \
> imv_cond_end(); \
> } else { \
> if (unlikely(_imv_read(__tracepoint_##name.state))) \
> func(&__tracepoint_##name, args); \
> }
>
> struct marker {
> const char *name; /* Marker name */
> const char *format; /* Marker format string, describing the
> * variable argument list.
> */
> } __attribute__((aligned(8)));
>
> #define trace_mark(name, fmt, args...) \
> do { \
> static const char __mstrtab_##name[] \
> __attribute__((section("__markers_strings"))) \
> = #name "\0" fmt; \
> static struct marker __marker_##name \
> __attribute__((section("__markers"), aligned(8))) = \
> { __mstrtab_##name, &__mstrtab_##name[sizeof(#name)]}; \
> __tracepoint_block(1, name, __marker_##name, marker_probe_cb, args) \
> } while (0)
>
> >
> > Allows complete typing verification. No format string required.
> >
> > TODO : Documentation/tracepoint.txt
> [...]
> > +/*
> > + * Note : the empty asm volatile with read constraint is used here instead of a
> > + * "used" attribute to fix a gcc 4.1.x bug.as
> > + * Make sure the alignment of the structure in the __markers section will
> > + * not add unwanted padding between the beginning of the section and the
> > + * structure. Force alignment to the same alignment as the section start.
>
> this comment should be updated...
>
> > + *
> > + * The "generic" argument controls which marker enabling mechanism must be used.
> > + * If generic is true, a variable read is used.
> > + * If generic is false, immediate values are used.
> > + */
> > +#define DEFINE_TRACE(name, proto, args) \
> > + static inline void _do_trace_##name(struct tracepoint *tp, proto) \
> > + { \
> > + int i; \
> > + struct tracepoint_probe_closure *multi; \
> > + preempt_disable(); \
> > + multi = tp->multi; \
> > + smp_read_barrier_depends(); \
> > + if (multi) { \
> > + for (i = 0; multi[i].func; i++) { \
> > + ((void(*)(void *private_data, proto)) \
> > + (multi[i].func))(multi[i].probe_private, args);\
> > + } \
> > + } \
> > + preempt_enable(); \
> > + } \
> > + static inline void __trace_##name(int generic, proto) \
> > + { \
> > + static const char __tpstrtab_##name[] \
> > + __attribute__((section("__tracepoints_strings"))) \
> > + = #name; \
> > + static struct tracepoint __tracepoint_##name \
> > + __attribute__((section("__tracepoints"), aligned(8))) = \
> > + { __tpstrtab_##name, 0, NULL }; \
> > + if (!generic) { \
> > + if (unlikely(imv_cond(__tracepoint_##name.state))) { \
> > + imv_cond_end(); \
> > + _do_trace_##name(&__tracepoint_##name, args); \
> > + } else \
> > + imv_cond_end(); \
> > + } else { \
> > + if (unlikely(_imv_read(__tracepoint_##name.state))) \
> > + _do_trace_##name(&__tracepoint_##name, args); \
> > + } \
> > + } \
> > + static inline void trace_##name(proto) \
> > + { \
> > + __trace_##name(0, args); \
> > + } \
> > + static inline void _trace_##name(proto) \
> > + { \
> > + __trace_##name(1, args); \
> > + } \
> > + static inline int register_trace_##name( \
> > + void (*probe)(void *private_data, proto), \
> > + void *private_data) \
> > + { \
> > + return tracepoint_probe_register(#name, (void *)probe, \
> > + private_data); \
> > + } \
> > + static inline void unregister_trace_##name( \
> > + void (*probe)(void *private_data, proto), \
> > + void *private_data) \
> > + { \
> > + tracepoint_probe_unregister(#name, (void *)probe, \
> > + private_data); \
> > + }
>
> Out of curiousity, what the private_data is for?
>
> > +
> > +extern void tracepoint_update_probe_range(struct tracepoint *begin,
> > + struct tracepoint *end);
> > +
> > +#else /* !CONFIG_TRACEPOINTS */
> > +#define DEFINE_TRACE(name, proto, args) \
> > + static inline void _do_trace_##name(struct tracepoint *tp, proto) \
> > + { } \
> > + static inline void __trace_##name(int generic, proto) \
> > + { } \
> > + static inline void trace_##name(proto) \
> > + { } \
> > + static inline void _trace_##name(proto) \
> > + { }
>
> By the way, I think you'd better add below two inlines.
>
> static inline int register_trace_##name( \
> void (*probe)(void *private_data, proto), \
> void *private_data) \
> { return -ENOSYS; }
> static inline void unregister_trace_##name( \
> void (*probe)(void *private_data, proto), \
> void *private_data) \
> { }
>
>
> > +
> > +static inline void tracepoint_update_probe_range(struct tracepoint *begin,
> > + struct tracepoint *end)
> > +{ }
> > +#endif /* CONFIG_TRACEPOINTS */
> > +
> > +/*
> > + * Connect a probe to a tracepoint.
> > + * Internal API, should not be used directly.
> > + */
> > +extern int tracepoint_probe_register(const char *name,
> > + void *probe, void *probe_private);
> > +
> > +/*
> > + * Disconnect a probe from a tracepoint.
> > + * Internal API, should not be used directly.
> > + */
> > +extern int tracepoint_probe_unregister(const char *name,
> > + void *probe, void *probe_private);
> > +
> > +struct tracepoint_iter {
> > + struct module *module;
> > + struct tracepoint *tracepoint;
> > +};
> > +
> > +extern void tracepoint_iter_start(struct tracepoint_iter *iter);
> > +extern void tracepoint_iter_next(struct tracepoint_iter *iter);
> > +extern void tracepoint_iter_stop(struct tracepoint_iter *iter);
> > +extern void tracepoint_iter_reset(struct tracepoint_iter *iter);
> > +extern int tracepoint_get_iter_range(struct tracepoint **tracepoint,
> > + struct tracepoint *begin, struct tracepoint *end);
> > +
> > +#endif
> [...]
>
>
> Best regards,
>
> --
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America) Inc.
> Software Solutions Division
>
> e-mail: [email protected]
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-06-27 13:16:07

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] Kernel Tracepoints

* Masami Hiramatsu ([email protected]) wrote:
> Hi Mathieu,
>
> Thank you for making this so soon!
>
> Mathieu Desnoyers wrote:
> > ** note : this patch is submitted for early review. It applies after my
> > current unreleased 2.6.26-rc8 LTTng patchset. Comments are welcome.
>
> Would you mean there is no tree on which we can test this patch?
>

It actually applies on top of the following patchset :

http://ltt.polymtl.ca/lttng/patch-2.6.26-rc8-0.10-pre56.tar.bz2

Thanks,

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-06-27 13:35:28

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] Kernel Tracepoints

* Masami Hiramatsu ([email protected]) wrote:
>
> > Implementation of kernel tracepoints. Inspired from the Linux Kernel Markers.
>
> What would you think redesigning markers on tracepoints? because most of the
> logic (scaning sections, multiple probe and activation) seems very similar
> to markers.
>

We could, although markers, because they use var args, allow to put the
iteration on the multi probe array out-of-line. Tracepoints cannot
afford this and the iteration must be done at the initial call-site.

>From what I see in your proposal, it's mostly to extract the if() call()
code from the inner __trace_mark() macro and to put it in a separate
macro, am I correct ? This would make the macro more readable.

> For example, (not complete, I just thought :-))
>
> struct tracepoint {
> const char *name; /* Tracepoint name */
> DEFINE_IMV(char, state); /* Immediate value state. */
> struct tracepoint_probe_closure *multi; /* Closures */
> void * callsite_data; /* private date from call site */
> } __attribute__((aligned(8)));
>
> #define __tracepoint_block(generic, name, data, func, args)
> static const char __tpstrtab_##name[] \
> __attribute__((section("__tracepoints_strings"))) \
> = #name; \
> static struct tracepoint __tracepoint_##name \
> __attribute__((section("__tracepoints"), aligned(8))) = \
> { __tpstrtab_##name, 0, NULL, data}; \
> if (!generic) { \
> if (unlikely(imv_cond(__tracepoint_##name.state))) { \
> imv_cond_end(); \
> func(&__tracepoint_##name, args); \
> } else \
> imv_cond_end(); \
> } else { \
> if (unlikely(_imv_read(__tracepoint_##name.state))) \
> func(&__tracepoint_##name, args); \
> }
>
> struct marker {
> const char *name; /* Marker name */
> const char *format; /* Marker format string, describing the
> * variable argument list.
> */
> } __attribute__((aligned(8)));
>
> #define trace_mark(name, fmt, args...) \
> do { \
> static const char __mstrtab_##name[] \
> __attribute__((section("__markers_strings"))) \
> = #name "\0" fmt; \
> static struct marker __marker_##name \
> __attribute__((section("__markers"), aligned(8))) = \
> { __mstrtab_##name, &__mstrtab_##name[sizeof(#name)]}; \
> __tracepoint_block(1, name, __marker_##name, marker_probe_cb, args) \
> } while (0)
>
> >
[...]
> > + static inline int register_trace_##name( \
> > + void (*probe)(void *private_data, proto), \
> > + void *private_data) \
> > + { \
> > + return tracepoint_probe_register(#name, (void *)probe, \
> > + private_data); \
> > + } \
> > + static inline void unregister_trace_##name( \
> > + void (*probe)(void *private_data, proto), \
> > + void *private_data) \
> > + { \
> > + tracepoint_probe_unregister(#name, (void *)probe, \
> > + private_data); \
> > + }
>
> Out of curiousity, what the private_data is for?
>

When a probe is registered, it gives more flexibility to be able to pass
a pointer to private data associated with that probe. For instance, if a
tracer needs to register the same probe to many different tracepoints,
but having a different context associated with each, it will pass the
same function pointer with different private_data to the registration
function.

> > +
> > +extern void tracepoint_update_probe_range(struct tracepoint *begin,
> > + struct tracepoint *end);
> > +
> > +#else /* !CONFIG_TRACEPOINTS */
> > +#define DEFINE_TRACE(name, proto, args) \
> > + static inline void _do_trace_##name(struct tracepoint *tp, proto) \
> > + { } \
> > + static inline void __trace_##name(int generic, proto) \
> > + { } \
> > + static inline void trace_##name(proto) \
> > + { } \
> > + static inline void _trace_##name(proto) \
> > + { }
>
> By the way, I think you'd better add below two inlines.
>
> static inline int register_trace_##name( \
> void (*probe)(void *private_data, proto), \
> void *private_data) \
> { return -ENOSYS; }
> static inline void unregister_trace_##name( \
> void (*probe)(void *private_data, proto), \
> void *private_data) \
> { }
>

My original thought was that if tracepoints are disabled, the probe
objects should not even be built. But I can foresee that they will not
always be in separate objects, so it makes sense. Will add.

Thanks,

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-06-27 13:42:20

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] Kernel Tracepoints (update)

Following Masami's review....


Implementation of kernel tracepoints. Inspired from the Linux Kernel Markers.

Allows complete typing verification. No format string required.

TODO : Documentation/tracepoint.txt

Changelog :
- Use #name "_" #proto as string to identify the tracepoint in the
tracepoint table. This will make sure not type mismatch happens due to
connexion of a probe with the wrong type to a tracepoint declared with
the same name in a different header.

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: 'Peter Zijlstra' <[email protected]>
CC: "Frank Ch. Eigler" <[email protected]>
CC: 'Ingo Molnar' <[email protected]>
CC: 'Hideo AOKI' <[email protected]>
CC: Takashi Nishiie <[email protected]>
CC: 'Steven Rostedt' <[email protected]>
---
include/asm-generic/vmlinux.lds.h | 4
include/linux/module.h | 17 +
include/linux/tracepoint.h | 165 ++++++++++++
init/Kconfig | 7
kernel/Makefile | 1
kernel/module.c | 71 +++++
kernel/tracepoint.c | 495 ++++++++++++++++++++++++++++++++++++++
7 files changed, 759 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/init/Kconfig
===================================================================
--- linux-2.6-lttng.orig/init/Kconfig 2008-06-26 09:05:14.000000000 -0400
+++ linux-2.6-lttng/init/Kconfig 2008-06-27 09:16:16.000000000 -0400
@@ -782,6 +782,13 @@ config PROFILING
Say Y here to enable the extended profiling support mechanisms used
by profilers such as OProfile.

+config TRACEPOINTS
+ bool "Activate tracepoints"
+ default y
+ help
+ Place an empty function call at each tracepoint site. Can be
+ dynamically changed for a probe function.
+
config MARKERS
bool "Activate markers"
help
Index: linux-2.6-lttng/kernel/Makefile
===================================================================
--- linux-2.6-lttng.orig/kernel/Makefile 2008-06-26 09:05:12.000000000 -0400
+++ linux-2.6-lttng/kernel/Makefile 2008-06-27 09:16:16.000000000 -0400
@@ -69,6 +69,7 @@ obj-$(CONFIG_TASK_DELAY_ACCT) += delayac
obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
obj-$(USE_IMMEDIATE) += immediate.o
obj-$(CONFIG_MARKERS) += marker.o
+obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
obj-$(CONFIG_LATENCYTOP) += latencytop.o

ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
Index: linux-2.6-lttng/include/linux/tracepoint.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/linux/tracepoint.h 2008-06-27 09:30:28.000000000 -0400
@@ -0,0 +1,165 @@
+#ifndef _LINUX_TRACEPOINT_H
+#define _LINUX_TRACEPOINT_H
+
+/*
+ * Kernel Tracepoint API.
+ *
+ * See Documentation/tracepoint.txt.
+ *
+ * (C) Copyright 2008 Mathieu Desnoyers <[email protected]>
+ *
+ * Heavily inspired from the Linux Kernel Markers.
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#include <linux/immediate.h>
+#include <linux/types.h>
+
+struct module;
+struct tracepoint;
+
+struct tracepoint_probe_closure {
+ void *func; /* Callback */
+ void *probe_private; /* Private probe data */
+};
+
+struct tracepoint {
+ const char *name; /* Tracepoint name */
+ DEFINE_IMV(char, state); /* Immediate value state. */
+ struct tracepoint_probe_closure *multi; /* Closures */
+} __attribute__((aligned(8)));
+
+
+#define TPPROTO(args...) args
+#define TPARGS(args...) args
+
+#ifdef CONFIG_TRACEPOINTS
+
+/*
+ * Note : the empty asm volatile with read constraint is used here instead of a
+ * "used" attribute to fix a gcc 4.1.x bug.
+ * Make sure the alignment of the structure in the __tracepoints section will
+ * not add unwanted padding between the beginning of the section and the
+ * structure. Force alignment to the same alignment as the section start.
+ *
+ * The "generic" argument, passed to the declared __trace_##name inline function
+ * controls which tracepoint enabling mechanism must be used.
+ * If generic is true, a variable read is used.
+ * If generic is false, immediate values are used.
+ */
+#define DEFINE_TRACE(name, proto, args) \
+ static inline void _do_trace_##name(struct tracepoint *tp, proto) \
+ { \
+ int i; \
+ struct tracepoint_probe_closure *multi; \
+ preempt_disable(); \
+ multi = tp->multi; \
+ smp_read_barrier_depends(); \
+ if (multi) { \
+ for (i = 0; multi[i].func; i++) { \
+ ((void(*)(void *private_data, proto)) \
+ (multi[i].func))(multi[i].probe_private, args);\
+ } \
+ } \
+ preempt_enable(); \
+ } \
+ static inline void __trace_##name(int generic, proto) \
+ { \
+ static const char __tpstrtab_##name[] \
+ __attribute__((section("__tracepoints_strings"))) \
+ = #name "_" #proto; \
+ static struct tracepoint __tracepoint_##name \
+ __attribute__((section("__tracepoints"), aligned(8))) = \
+ { __tpstrtab_##name, 0, NULL }; \
+ if (!generic) { \
+ if (unlikely(imv_cond(__tracepoint_##name.state))) { \
+ imv_cond_end(); \
+ _do_trace_##name(&__tracepoint_##name, args); \
+ } else \
+ imv_cond_end(); \
+ } else { \
+ if (unlikely(_imv_read(__tracepoint_##name.state))) \
+ _do_trace_##name(&__tracepoint_##name, args); \
+ } \
+ } \
+ static inline void trace_##name(proto) \
+ { \
+ __trace_##name(0, args); \
+ } \
+ static inline void _trace_##name(proto) \
+ { \
+ __trace_##name(1, args); \
+ } \
+ static inline int register_trace_##name( \
+ void (*probe)(void *private_data, proto), \
+ void *private_data) \
+ { \
+ return tracepoint_probe_register(#name "_" #proto, \
+ (void *)probe, private_data); \
+ } \
+ static inline void unregister_trace_##name( \
+ void (*probe)(void *private_data, proto), \
+ void *private_data) \
+ { \
+ tracepoint_probe_unregister(#name "_" #proto, \
+ (void *)probe, private_data); \
+ }
+
+extern void tracepoint_update_probe_range(struct tracepoint *begin,
+ struct tracepoint *end);
+
+#else /* !CONFIG_TRACEPOINTS */
+#define DEFINE_TRACE(name, proto, args) \
+ static inline void _do_trace_##name(struct tracepoint *tp, proto) \
+ { } \
+ static inline void __trace_##name(int generic, proto) \
+ { } \
+ static inline void trace_##name(proto) \
+ { } \
+ static inline void _trace_##name(proto) \
+ { } \
+ static inline int register_trace_##name( \
+ void (*probe)(void *private_data, proto), \
+ void *private_data) \
+ { \
+ return -ENOSYS; \
+ } \
+ static inline void unregister_trace_##name( \
+ void (*probe)(void *private_data, proto), \
+ void *private_data) \
+ { }
+
+static inline void tracepoint_update_probe_range(struct tracepoint *begin,
+ struct tracepoint *end)
+{ }
+#endif /* CONFIG_TRACEPOINTS */
+
+/*
+ * Connect a probe to a tracepoint.
+ * Internal API, should not be used directly.
+ */
+extern int tracepoint_probe_register(const char *name,
+ void *probe, void *probe_private);
+
+/*
+ * Disconnect a probe from a tracepoint.
+ * Internal API, should not be used directly.
+ */
+extern int tracepoint_probe_unregister(const char *name,
+ void *probe, void *probe_private);
+
+struct tracepoint_iter {
+ struct module *module;
+ struct tracepoint *tracepoint;
+};
+
+extern void tracepoint_iter_start(struct tracepoint_iter *iter);
+extern void tracepoint_iter_next(struct tracepoint_iter *iter);
+extern void tracepoint_iter_stop(struct tracepoint_iter *iter);
+extern void tracepoint_iter_reset(struct tracepoint_iter *iter);
+extern int tracepoint_get_iter_range(struct tracepoint **tracepoint,
+ struct tracepoint *begin, struct tracepoint *end);
+
+#endif
Index: linux-2.6-lttng/include/asm-generic/vmlinux.lds.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-generic/vmlinux.lds.h 2008-06-26 09:05:12.000000000 -0400
+++ linux-2.6-lttng/include/asm-generic/vmlinux.lds.h 2008-06-26 09:05:15.000000000 -0400
@@ -53,6 +53,9 @@
VMLINUX_SYMBOL(__start___markers) = .; \
*(__markers) \
VMLINUX_SYMBOL(__stop___markers) = .; \
+ VMLINUX_SYMBOL(__start___tracepoints) = .; \
+ *(__tracepoints) \
+ VMLINUX_SYMBOL(__stop___tracepoints) = .; \
VMLINUX_SYMBOL(__start___imv) = .; \
*(__imv) /* Immediate values: pointers */ \
VMLINUX_SYMBOL(__stop___imv) = .;
@@ -68,6 +71,7 @@
*(__imv_cond_end) /* Immediate condition end pointers */\
VMLINUX_SYMBOL(__stop___imv_cond_end) = .; \
*(__markers_strings) /* Markers: strings */ \
+ *(__tracepoints_strings)/* Tracepoints: strings */ \
} \
\
.rodata1 : AT(ADDR(.rodata1) - LOAD_OFFSET) { \
Index: linux-2.6-lttng/kernel/tracepoint.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/kernel/tracepoint.c 2008-06-26 09:05:15.000000000 -0400
@@ -0,0 +1,495 @@
+/*
+ * Copyright (C) 2007 Mathieu Desnoyers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ */
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/types.h>
+#include <linux/jhash.h>
+#include <linux/list.h>
+#include <linux/rcupdate.h>
+#include <linux/tracepoint.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/immediate.h>
+
+extern struct tracepoint __start___tracepoints[];
+extern struct tracepoint __stop___tracepoints[];
+
+/* Set to 1 to enable tracepoint debug output */
+static const int tracepoint_debug;
+
+/*
+ * tracepoints_mutex nests inside module_mutex. Tracepoints mutex protects the
+ * builtin and module tracepoints and the hash table.
+ */
+static DEFINE_MUTEX(tracepoints_mutex);
+
+/*
+ * Tracepoint hash table, containing the active tracepoints.
+ * Protected by tracepoints_mutex.
+ */
+#define TRACEPOINT_HASH_BITS 6
+#define TRACEPOINT_TABLE_SIZE (1 << TRACEPOINT_HASH_BITS)
+
+/*
+ * Note about RCU :
+ * It is used to make sure every handler has finished using its private data
+ * between two consecutive operation (add or remove) on a given tracepoint. It
+ * is also used to delay the free of multiple probes array until a quiescent
+ * state is reached.
+ * tracepoint entries modifications are protected by the tracepoints_mutex.
+ */
+struct tracepoint_entry {
+ struct hlist_node hlist;
+ struct tracepoint_probe_closure *multi;
+ int refcount; /* Number of times armed. 0 if disarmed. */
+ struct rcu_head rcu;
+ void *oldptr;
+ unsigned char rcu_pending:1;
+ char name[0];
+};
+
+static struct hlist_head tracepoint_table[TRACEPOINT_TABLE_SIZE];
+
+static void free_old_closure(struct rcu_head *head)
+{
+ struct tracepoint_entry *entry = container_of(head,
+ struct tracepoint_entry, rcu);
+ kfree(entry->oldptr);
+ /* Make sure we free the data before setting the pending flag to 0 */
+ smp_wmb();
+ entry->rcu_pending = 0;
+}
+
+static void debug_print_probes(struct tracepoint_entry *entry)
+{
+ int i;
+
+ if (!tracepoint_debug)
+ return;
+
+ for (i = 0; entry->multi[i].func; i++)
+ printk(KERN_DEBUG "Multi probe %d : %p %p\n", i,
+ entry->multi[i].func,
+ entry->multi[i].probe_private);
+}
+
+static struct tracepoint_probe_closure *
+tracepoint_entry_add_probe(struct tracepoint_entry *entry,
+ void *probe, void *probe_private)
+{
+ int nr_probes = 0;
+ struct tracepoint_probe_closure *old, *new;
+
+ WARN_ON(!probe);
+
+ debug_print_probes(entry);
+ old = entry->multi;
+ if (old) {
+ /* (N -> N+1), (N != 0, 1) probes */
+ for (nr_probes = 0; old[nr_probes].func; nr_probes++)
+ if (old[nr_probes].func == probe
+ && old[nr_probes].probe_private
+ == probe_private)
+ return ERR_PTR(-EBUSY);
+ }
+ /* + 2 : one for new probe, one for NULL func */
+ new = kzalloc((nr_probes + 2) * sizeof(struct tracepoint_probe_closure),
+ GFP_KERNEL);
+ if (new == NULL)
+ return ERR_PTR(-ENOMEM);
+ if (old)
+ memcpy(new, old,
+ nr_probes * sizeof(struct tracepoint_probe_closure));
+ new[nr_probes].func = probe;
+ new[nr_probes].probe_private = probe_private;
+ entry->refcount = nr_probes + 1;
+ entry->multi = new;
+ debug_print_probes(entry);
+ return old;
+}
+
+static struct tracepoint_probe_closure *
+tracepoint_entry_remove_probe(struct tracepoint_entry *entry,
+ void *probe, void *probe_private)
+{
+ int nr_probes = 0, nr_del = 0, i;
+ struct tracepoint_probe_closure *old, *new;
+
+ old = entry->multi;
+
+ debug_print_probes(entry);
+ /* (N -> M), (N > 1, M >= 0) probes */
+ for (nr_probes = 0; old[nr_probes].func; nr_probes++) {
+ if ((!probe || old[nr_probes].func == probe)
+ && old[nr_probes].probe_private
+ == probe_private)
+ nr_del++;
+ }
+
+ if (nr_probes - nr_del == 0) {
+ /* N -> 0, (N > 1) */
+ entry->multi = NULL;
+ entry->refcount = 0;
+ debug_print_probes(entry);
+ return old;
+ } else {
+ int j = 0;
+ /* N -> M, (N > 1, M > 0) */
+ /* + 1 for NULL */
+ new = kzalloc((nr_probes - nr_del + 1)
+ * sizeof(struct tracepoint_probe_closure), GFP_KERNEL);
+ if (new == NULL)
+ return ERR_PTR(-ENOMEM);
+ for (i = 0; old[i].func; i++)
+ if ((probe && old[i].func != probe) ||
+ old[i].probe_private != probe_private)
+ new[j++] = old[i];
+ entry->refcount = nr_probes - nr_del;
+ entry->multi = new;
+ }
+ debug_print_probes(entry);
+ return old;
+}
+
+/*
+ * Get tracepoint if the tracepoint is present in the tracepoint hash table.
+ * Must be called with tracepoints_mutex held.
+ * Returns NULL if not present.
+ */
+static struct tracepoint_entry *get_tracepoint(const char *name)
+{
+ struct hlist_head *head;
+ struct hlist_node *node;
+ struct tracepoint_entry *e;
+ u32 hash = jhash(name, strlen(name), 0);
+
+ head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
+ hlist_for_each_entry(e, node, head, hlist) {
+ if (!strcmp(name, e->name))
+ return e;
+ }
+ return NULL;
+}
+
+/*
+ * Add the tracepoint to the tracepoint hash table. Must be called with
+ * tracepoints_mutex held.
+ */
+static struct tracepoint_entry *add_tracepoint(const char *name)
+{
+ struct hlist_head *head;
+ struct hlist_node *node;
+ struct tracepoint_entry *e;
+ size_t name_len = strlen(name) + 1;
+ u32 hash = jhash(name, name_len-1, 0);
+
+ head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
+ hlist_for_each_entry(e, node, head, hlist) {
+ if (!strcmp(name, e->name)) {
+ printk(KERN_NOTICE
+ "tracepoint %s busy\n", name);
+ return ERR_PTR(-EBUSY); /* Already there */
+ }
+ }
+ /*
+ * Using kmalloc here to allocate a variable length element. Could
+ * cause some memory fragmentation if overused.
+ */
+ e = kmalloc(sizeof(struct tracepoint_entry) + name_len, GFP_KERNEL);
+ if (!e)
+ return ERR_PTR(-ENOMEM);
+ memcpy(&e->name[0], name, name_len);
+ e->multi = NULL;
+ e->refcount = 0;
+ e->rcu_pending = 0;
+ hlist_add_head(&e->hlist, head);
+ return e;
+}
+
+/*
+ * Remove the tracepoint from the tracepoint hash table. Must be called with
+ * mutex_lock held.
+ */
+static int remove_tracepoint(const char *name)
+{
+ struct hlist_head *head;
+ struct hlist_node *node;
+ struct tracepoint_entry *e;
+ int found = 0;
+ size_t len = strlen(name) + 1;
+ u32 hash = jhash(name, len-1, 0);
+
+ head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
+ hlist_for_each_entry(e, node, head, hlist) {
+ if (!strcmp(name, e->name)) {
+ found = 1;
+ break;
+ }
+ }
+ if (!found)
+ return -ENOENT;
+ hlist_del(&e->hlist);
+ /* Make sure the call_rcu has been executed */
+ if (e->rcu_pending)
+ rcu_barrier();
+ kfree(e);
+ return 0;
+}
+
+/*
+ * Sets the probe callback corresponding to one tracepoint.
+ */
+static void set_tracepoint(struct tracepoint_entry **entry,
+ struct tracepoint *elem, int active)
+{
+ WARN_ON(strcmp((*entry)->name, elem->name) != 0);
+
+ smp_wmb();
+ /*
+ * We also make sure that the new probe callbacks array is consistent
+ * before setting a pointer to it.
+ */
+ rcu_assign_pointer(elem->multi, (*entry)->multi);
+ elem->state__imv = active;
+}
+
+/*
+ * Disable a tracepoint and its probe callback.
+ * Note: only waiting an RCU period after setting elem->call to the empty
+ * function insures that the original callback is not used anymore. This insured
+ * by preempt_disable around the call site.
+ */
+static void disable_tracepoint(struct tracepoint *elem)
+{
+ elem->state__imv = 0;
+}
+
+/**
+ * tracepoint_update_probe_range - Update a probe range
+ * @begin: beginning of the range
+ * @end: end of the range
+ *
+ * Updates the probe callback corresponding to a range of tracepoints.
+ */
+void tracepoint_update_probe_range(struct tracepoint *begin,
+ struct tracepoint *end)
+{
+ struct tracepoint *iter;
+ struct tracepoint_entry *mark_entry;
+
+ mutex_lock(&tracepoints_mutex);
+ for (iter = begin; iter < end; iter++) {
+ mark_entry = get_tracepoint(iter->name);
+ if (mark_entry) {
+ set_tracepoint(&mark_entry, iter,
+ !!mark_entry->refcount);
+ } else {
+ disable_tracepoint(iter);
+ }
+ }
+ mutex_unlock(&tracepoints_mutex);
+}
+
+/*
+ * Update probes, removing the faulty probes.
+ */
+static void tracepoint_update_probes(void)
+{
+ /* Core kernel tracepoints */
+ tracepoint_update_probe_range(__start___tracepoints,
+ __stop___tracepoints);
+ /* tracepoints in modules. */
+ module_update_tracepoints();
+ /* Update immediate values */
+ core_imv_update();
+ module_imv_update();
+}
+
+/**
+ * tracepoint_probe_register - Connect a probe to a tracepoint
+ * @name: tracepoint name
+ * @probe: probe handler
+ * @probe_private: probe private data
+ *
+ * private data must be a valid allocated memory address, or NULL.
+ * Returns 0 if ok, error value on error.
+ * The probe address must at least be aligned on the architecture pointer size.
+ */
+int tracepoint_probe_register(const char *name,
+ void *probe, void *probe_private)
+{
+ struct tracepoint_entry *entry;
+ int ret = 0;
+ struct tracepoint_probe_closure *old;
+
+ mutex_lock(&tracepoints_mutex);
+ entry = get_tracepoint(name);
+ if (!entry) {
+ entry = add_tracepoint(name);
+ if (IS_ERR(entry)) {
+ ret = PTR_ERR(entry);
+ goto end;
+ }
+ }
+ /*
+ * If we detect that a call_rcu is pending for this tracepoint,
+ * make sure it's executed now.
+ */
+ if (entry->rcu_pending)
+ rcu_barrier();
+ old = tracepoint_entry_add_probe(entry, probe, probe_private);
+ if (IS_ERR(old)) {
+ ret = PTR_ERR(old);
+ goto end;
+ }
+ mutex_unlock(&tracepoints_mutex);
+ tracepoint_update_probes(); /* may update entry */
+ mutex_lock(&tracepoints_mutex);
+ entry = get_tracepoint(name);
+ WARN_ON(!entry);
+ entry->oldptr = old;
+ entry->rcu_pending = 1;
+ /* write rcu_pending before calling the RCU callback */
+ smp_wmb();
+#ifdef CONFIG_PREEMPT_RCU
+ synchronize_sched(); /* Until we have the call_rcu_sched() */
+#endif
+ call_rcu(&entry->rcu, free_old_closure);
+end:
+ mutex_unlock(&tracepoints_mutex);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(tracepoint_probe_register);
+
+/**
+ * tracepoint_probe_unregister - Disconnect a probe from a tracepoint
+ * @name: tracepoint name
+ * @probe: probe function pointer
+ * @probe_private: probe private data
+ *
+ * Returns the private data given to tracepoint_probe_register, or an ERR_PTR().
+ * We do not need to call a synchronize_sched to make sure the probes have
+ * finished running before doing a module unload, because the module unload
+ * itself uses stop_machine(), which insures that every preempt disabled section
+ * have finished.
+ */
+int tracepoint_probe_unregister(const char *name,
+ void *probe, void *probe_private)
+{
+ struct tracepoint_entry *entry;
+ struct tracepoint_probe_closure *old;
+ int ret = -ENOENT;
+
+ mutex_lock(&tracepoints_mutex);
+ entry = get_tracepoint(name);
+ if (!entry)
+ goto end;
+ if (entry->rcu_pending)
+ rcu_barrier();
+ old = tracepoint_entry_remove_probe(entry, probe, probe_private);
+ mutex_unlock(&tracepoints_mutex);
+ tracepoint_update_probes(); /* may update entry */
+ mutex_lock(&tracepoints_mutex);
+ entry = get_tracepoint(name);
+ if (!entry)
+ goto end;
+ entry->oldptr = old;
+ entry->rcu_pending = 1;
+ /* write rcu_pending before calling the RCU callback */
+ smp_wmb();
+#ifdef CONFIG_PREEMPT_RCU
+ synchronize_sched(); /* Until we have the call_rcu_sched() */
+#endif
+ call_rcu(&entry->rcu, free_old_closure);
+ remove_tracepoint(name); /* Ignore busy error message */
+ ret = 0;
+end:
+ mutex_unlock(&tracepoints_mutex);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(tracepoint_probe_unregister);
+
+/**
+ * tracepoint_get_iter_range - Get a next tracepoint iterator given a range.
+ * @tracepoint: current tracepoints (in), next tracepoint (out)
+ * @begin: beginning of the range
+ * @end: end of the range
+ *
+ * Returns whether a next tracepoint has been found (1) or not (0).
+ * Will return the first tracepoint in the range if the input tracepoint is NULL.
+ */
+int tracepoint_get_iter_range(struct tracepoint **tracepoint,
+ struct tracepoint *begin, struct tracepoint *end)
+{
+ if (!*tracepoint && begin != end) {
+ *tracepoint = begin;
+ return 1;
+ }
+ if (*tracepoint >= begin && *tracepoint < end)
+ return 1;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(tracepoint_get_iter_range);
+
+static void tracepoint_get_iter(struct tracepoint_iter *iter)
+{
+ int found = 0;
+
+ /* Core kernel tracepoints */
+ if (!iter->module) {
+ found = tracepoint_get_iter_range(&iter->tracepoint,
+ __start___tracepoints, __stop___tracepoints);
+ if (found)
+ goto end;
+ }
+ /* tracepoints in modules. */
+ found = module_get_iter_tracepoints(iter);
+end:
+ if (!found)
+ tracepoint_iter_reset(iter);
+}
+
+void tracepoint_iter_start(struct tracepoint_iter *iter)
+{
+ tracepoint_get_iter(iter);
+}
+EXPORT_SYMBOL_GPL(tracepoint_iter_start);
+
+void tracepoint_iter_next(struct tracepoint_iter *iter)
+{
+ iter->tracepoint++;
+ /*
+ * iter->tracepoint may be invalid because we blindly incremented it.
+ * Make sure it is valid by marshalling on the tracepoints, getting the
+ * tracepoints from following modules if necessary.
+ */
+ tracepoint_get_iter(iter);
+}
+EXPORT_SYMBOL_GPL(tracepoint_iter_next);
+
+void tracepoint_iter_stop(struct tracepoint_iter *iter)
+{
+}
+EXPORT_SYMBOL_GPL(tracepoint_iter_stop);
+
+void tracepoint_iter_reset(struct tracepoint_iter *iter)
+{
+ iter->module = NULL;
+ iter->tracepoint = NULL;
+}
+EXPORT_SYMBOL_GPL(tracepoint_iter_reset);
Index: linux-2.6-lttng/kernel/module.c
===================================================================
--- linux-2.6-lttng.orig/kernel/module.c 2008-06-26 09:05:15.000000000 -0400
+++ linux-2.6-lttng/kernel/module.c 2008-06-26 09:05:15.000000000 -0400
@@ -48,6 +48,7 @@
#include <linux/license.h>
#include <asm/sections.h>
#include <linux/marker.h>
+#include <linux/tracepoint.h>

#if 0
#define DEBUGP printk
@@ -1791,6 +1792,8 @@ static struct module *load_module(void _
unsigned int immediatecondendindex;
unsigned int markersindex;
unsigned int markersstringsindex;
+ unsigned int tracepointsindex;
+ unsigned int tracepointsstringsindex;
struct module *mod;
long err = 0;
void *percpu = NULL, *ptr = NULL; /* Stops spurious gcc warning */
@@ -2083,6 +2086,9 @@ static struct module *load_module(void _
markersindex = find_sec(hdr, sechdrs, secstrings, "__markers");
markersstringsindex = find_sec(hdr, sechdrs, secstrings,
"__markers_strings");
+ tracepointsindex = find_sec(hdr, sechdrs, secstrings, "__tracepoints");
+ tracepointsstringsindex = find_sec(hdr, sechdrs, secstrings,
+ "__tracepoints_strings");

/* Now do relocations. */
for (i = 1; i < hdr->e_shnum; i++) {
@@ -2110,6 +2116,12 @@ static struct module *load_module(void _
mod->num_markers =
sechdrs[markersindex].sh_size / sizeof(*mod->markers);
#endif
+#ifdef CONFIG_TRACEPOINTS
+ mod->tracepoints = (void *)sechdrs[tracepointsindex].sh_addr;
+ mod->num_tracepoints =
+ sechdrs[tracepointsindex].sh_size / sizeof(*mod->tracepoints);
+#endif
+

/* Find duplicate symbols */
err = verify_export_symbols(mod);
@@ -2133,8 +2145,16 @@ static struct module *load_module(void _
marker_update_probe_range(mod->markers,
mod->markers + mod->num_markers);
#endif
+#ifdef CONFIG_TRACEPOINTS
+ tracepoint_update_probe_range(mod->tracepoints,
+ mod->tracepoints + mod->num_tracepoints);
+#endif
+
#ifdef USE_IMMEDIATE
- /* Immediate values must be updated after markers */
+ /*
+ * Immediate values must be updated after markers and
+ * tracepoints.
+ */
imv_update_range(mod->immediate,
mod->immediate + mod->num_immediate);
#endif
@@ -2746,6 +2766,55 @@ int module_get_iter_markers(struct marke
}
#endif

+#ifdef CONFIG_TRACEPOINTS
+void module_update_tracepoints(void)
+{
+ struct module *mod;
+
+ mutex_lock(&module_mutex);
+ list_for_each_entry(mod, &modules, list)
+ if (!mod->taints)
+ tracepoint_update_probe_range(mod->tracepoints,
+ mod->tracepoints + mod->num_tracepoints);
+ mutex_unlock(&module_mutex);
+}
+
+/*
+ * Returns 0 if current not found.
+ * Returns 1 if current found.
+ */
+int module_get_iter_tracepoints(struct tracepoint_iter *iter)
+{
+ struct module *iter_mod;
+ int found = 0;
+
+ mutex_lock(&module_mutex);
+ list_for_each_entry(iter_mod, &modules, list) {
+ if (!iter_mod->taints) {
+ /*
+ * Sorted module list
+ */
+ if (iter_mod < iter->module)
+ continue;
+ else if (iter_mod > iter->module)
+ iter->tracepoint = NULL;
+ found = tracepoint_get_iter_range(&iter->tracepoint,
+ iter_mod->tracepoints,
+ iter_mod->tracepoints
+ + iter_mod->num_tracepoints);
+ if (found) {
+ iter->module = iter_mod;
+ break;
+ }
+ }
+ }
+ mutex_unlock(&module_mutex);
+ return found;
+}
+#endif
+
+
+
#ifdef USE_IMMEDIATE
/**
* _module_imv_update - update all immediate values in the kernel
Index: linux-2.6-lttng/include/linux/module.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/module.h 2008-06-26 09:05:15.000000000 -0400
+++ linux-2.6-lttng/include/linux/module.h 2008-06-26 09:05:15.000000000 -0400
@@ -17,6 +17,7 @@
#include <linux/moduleparam.h>
#include <linux/immediate.h>
#include <linux/marker.h>
+#include <linux/tracepoint.h>
#include <asm/local.h>

#include <asm/module.h>
@@ -349,6 +350,10 @@ struct module
struct marker *markers;
unsigned int num_markers;
#endif
+#ifdef CONFIG_TRACEPOINTS
+ struct tracepoint *tracepoints;
+ unsigned int num_tracepoints;
+#endif
};
#ifndef MODULE_ARCH_INIT
#define MODULE_ARCH_INIT {}
@@ -459,6 +464,9 @@ extern void list_modules(void *call_data
extern void module_update_markers(void);
extern int module_get_iter_markers(struct marker_iter *iter);

+extern void module_update_tracepoints(void);
+extern int module_get_iter_tracepoints(struct tracepoint_iter *iter);
+
#else /* !CONFIG_MODULES... */
#define EXPORT_SYMBOL(sym)
#define EXPORT_SYMBOL_GPL(sym)
@@ -568,6 +576,15 @@ static inline int module_get_iter_marker
return 0;
}

+static inline void module_update_tracepoints(void)
+{
+}
+
+static inline int module_get_iter_tracepoints(struct tracepoint_iter *iter)
+{
+ return 0;
+}
+
#endif /* CONFIG_MODULES */

#if defined(CONFIG_MODULES) && defined(USE_IMMEDIATE)
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-06-27 21:00:20

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC PATCH] Kernel Tracepoints

Mathieu Desnoyers wrote:
> * Masami Hiramatsu ([email protected]) wrote:
> >
>>> Implementation of kernel tracepoints. Inspired from the Linux Kernel Markers.
>> What would you think redesigning markers on tracepoints? because most of the
>> logic (scaning sections, multiple probe and activation) seems very similar
>> to markers.
>>
>
> We could, although markers, because they use var args, allow to put the
> iteration on the multi probe array out-of-line. Tracepoints cannot
> afford this and the iteration must be done at the initial call-site.
>
> From what I see in your proposal, it's mostly to extract the if() call()
> code from the inner __trace_mark() macro and to put it in a separate
> macro, am I correct ? This would make the macro more readable.

Sure, I think marker and tracepoint can share below functions;
- definition of static local variables in specific sections
- probe activation code (if() call())
- multi probe handling
Then, marker just exports marker_strings sections.

>> For example, (not complete, I just thought :-))
>>
>> struct tracepoint {
>> const char *name; /* Tracepoint name */
>> DEFINE_IMV(char, state); /* Immediate value state. */
>> struct tracepoint_probe_closure *multi; /* Closures */
>> void * callsite_data; /* private date from call site */
>> } __attribute__((aligned(8)));
>>
>> #define __tracepoint_block(generic, name, data, func, args)
>> static const char __tpstrtab_##name[] \
>> __attribute__((section("__tracepoints_strings"))) \
>> = #name; \
>> static struct tracepoint __tracepoint_##name \
>> __attribute__((section("__tracepoints"), aligned(8))) = \
>> { __tpstrtab_##name, 0, NULL, data}; \
>> if (!generic) { \
>> if (unlikely(imv_cond(__tracepoint_##name.state))) { \
>> imv_cond_end(); \
>> func(&__tracepoint_##name, args); \
>> } else \
>> imv_cond_end(); \
>> } else { \
>> if (unlikely(_imv_read(__tracepoint_##name.state))) \
>> func(&__tracepoint_##name, args); \
>> }


So, in my idea, __trace_##name() also uses __tracepoint_block() for
avoiding code duplication.


> [...]
>>> + static inline int register_trace_##name( \
>>> + void (*probe)(void *private_data, proto), \
>>> + void *private_data) \
>>> + { \
>>> + return tracepoint_probe_register(#name, (void *)probe, \
>>> + private_data); \
>>> + } \
>>> + static inline void unregister_trace_##name( \
>>> + void (*probe)(void *private_data, proto), \
>>> + void *private_data) \
>>> + { \
>>> + tracepoint_probe_unregister(#name, (void *)probe, \
>>> + private_data); \
>>> + }
>> Out of curiousity, what the private_data is for?
>>
>
> When a probe is registered, it gives more flexibility to be able to pass
> a pointer to private data associated with that probe. For instance, if a
> tracer needs to register the same probe to many different tracepoints,
> but having a different context associated with each, it will pass the
> same function pointer with different private_data to the registration
> function.

Hmm, only for tracepoint, it might be not so useful, because
most of tracepoint's prototypes are different and so we can't
use same probe to those tracepoints.
Anyway, it is useful for more general probe(ex. markers) if that
is implemented on tracepoint ;-)


Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2008-06-27 22:47:47

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC PATCH] Kernel Tracepoints



Mathieu Desnoyers wrote:
> * Masami Hiramatsu ([email protected]) wrote:
>> Hi Mathieu,
>>
>> Thank you for making this so soon!
>>
>
> Hi Masami,
>
> Thanks for the comments, I will rework the patch accordingly.
>
> Also, one thing I thought about yesterday which I dislike is that if we
> have two modules declaring the same tracepoint in different headers with
> different prototypes, each declaration will be valid but the
> registration will try to connect a probe expecting wrong parameters to
> the other tracepoint.
>
> It would be the case if someone does :
>
> drivers/somedrivera/mydriver1-trace.h
>
> DECLARE_TRACE(really_generic_name, TPPTOTO(void), TPARGS()));
>
>
> drivers/somedriverb/mydriver2-trace.h
>
> DECLARE_TRACE(really_generic_name, TPPTOTO(struct somestruct *s), TPARGS(s)));
>
> Do you think it's worth it to append the prototype string to the
> tracepoint name ? I think it should fix the problem.

Hmm, I think we'd better send a fix patch to them in that case.
(I hope we can find that kind of conflicts soon)
I think we can make an external tool which detect those conflicts.
Anyway, signature based checking idea is good to me. I think ":" is
better delimiter.


Thank you,


--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2008-06-30 15:40:22

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] Kernel Tracepoints

* Masami Hiramatsu ([email protected]) wrote:
> Mathieu Desnoyers wrote:
> > * Masami Hiramatsu ([email protected]) wrote:
> > >
> >>> Implementation of kernel tracepoints. Inspired from the Linux Kernel Markers.
> >> What would you think redesigning markers on tracepoints? because most of the
> >> logic (scaning sections, multiple probe and activation) seems very similar
> >> to markers.
> >>
> >
> > We could, although markers, because they use var args, allow to put the
> > iteration on the multi probe array out-of-line. Tracepoints cannot
> > afford this and the iteration must be done at the initial call-site.
> >
> > From what I see in your proposal, it's mostly to extract the if() call()
> > code from the inner __trace_mark() macro and to put it in a separate
> > macro, am I correct ? This would make the macro more readable.
>
> Sure, I think marker and tracepoint can share below functions;
> - definition of static local variables in specific sections

Given that we could want to keep activation of tracepoints and markers
separate (so they don't share the same namespace), declaring the static
variables in separated sections seems to make sense to me.

> - probe activation code (if() call())
> - multi probe handling

Hrm, the thing here is that because markers allow to do the iteration on
the multiple probe callbacks within an internal wrapper (instead of
doing it on-site as in the tracepoints), it allows to do some further
optimizations (less memory allocation and less pointer dereference in
the single probe case, not having to prepare the va_args in the
MARK_NOARGS case) which are only done because it does not have to add
code to the instrumentation site. However, tracepoints cannot have such
"generic" wrapper and we have to do the iteration on callbacks in the
code added to the instrumented object. Therefore, I keep it as small as
possible in terms of bytes of instructions.

> Then, marker just exports marker_strings sections.
>
> >> For example, (not complete, I just thought :-))
> >>
> >> struct tracepoint {
> >> const char *name; /* Tracepoint name */
> >> DEFINE_IMV(char, state); /* Immediate value state. */
> >> struct tracepoint_probe_closure *multi; /* Closures */
> >> void * callsite_data; /* private date from call site */
> >> } __attribute__((aligned(8)));
> >>
> >> #define __tracepoint_block(generic, name, data, func, args)
> >> static const char __tpstrtab_##name[] \
> >> __attribute__((section("__tracepoints_strings"))) \
> >> = #name; \
> >> static struct tracepoint __tracepoint_##name \
> >> __attribute__((section("__tracepoints"), aligned(8))) = \
> >> { __tpstrtab_##name, 0, NULL, data}; \
> >> if (!generic) { \
> >> if (unlikely(imv_cond(__tracepoint_##name.state))) { \
> >> imv_cond_end(); \
> >> func(&__tracepoint_##name, args); \
> >> } else \
> >> imv_cond_end(); \
> >> } else { \
> >> if (unlikely(_imv_read(__tracepoint_##name.state))) \
> >> func(&__tracepoint_##name, args); \
> >> }
>
>
> So, in my idea, __trace_##name() also uses __tracepoint_block() for
> avoiding code duplication.
>
>
> > [...]
> >>> + static inline int register_trace_##name( \
> >>> + void (*probe)(void *private_data, proto), \
> >>> + void *private_data) \
> >>> + { \
> >>> + return tracepoint_probe_register(#name, (void *)probe, \
> >>> + private_data); \
> >>> + } \
> >>> + static inline void unregister_trace_##name( \
> >>> + void (*probe)(void *private_data, proto), \
> >>> + void *private_data) \
> >>> + { \
> >>> + tracepoint_probe_unregister(#name, (void *)probe, \
> >>> + private_data); \
> >>> + }
> >> Out of curiousity, what the private_data is for?
> >>
> >
> > When a probe is registered, it gives more flexibility to be able to pass
> > a pointer to private data associated with that probe. For instance, if a
> > tracer needs to register the same probe to many different tracepoints,
> > but having a different context associated with each, it will pass the
> > same function pointer with different private_data to the registration
> > function.
>
> Hmm, only for tracepoint, it might be not so useful, because
> most of tracepoint's prototypes are different and so we can't
> use same probe to those tracepoints.
> Anyway, it is useful for more general probe(ex. markers) if that
> is implemented on tracepoint ;-)
>

The usefulness of private_data in the tracepoints is indeed
debatable, but given that we may have scenarios where code allocates its
own data structure and has to pass it efficiently to the tracepoint
callback, I think private_data can become quite useful at that point.
It's useful whenever you have a tracer which can generate more than one
trace, or collect more than one type of statistics depending on the
user's needs.

Mathieu

>
> Thank you,
>
> --
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America) Inc.
> Software Solutions Division
>
> e-mail: [email protected]
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-06-30 15:49:17

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] Kernel Tracepoints

* Masami Hiramatsu ([email protected]) wrote:
>
>
> Mathieu Desnoyers wrote:
> > * Masami Hiramatsu ([email protected]) wrote:
> >> Hi Mathieu,
> >>
> >> Thank you for making this so soon!
> >>
> >
> > Hi Masami,
> >
> > Thanks for the comments, I will rework the patch accordingly.
> >
> > Also, one thing I thought about yesterday which I dislike is that if we
> > have two modules declaring the same tracepoint in different headers with
> > different prototypes, each declaration will be valid but the
> > registration will try to connect a probe expecting wrong parameters to
> > the other tracepoint.
> >
> > It would be the case if someone does :
> >
> > drivers/somedrivera/mydriver1-trace.h
> >
> > DECLARE_TRACE(really_generic_name, TPPTOTO(void), TPARGS()));
> >
> >
> > drivers/somedriverb/mydriver2-trace.h
> >
> > DECLARE_TRACE(really_generic_name, TPPTOTO(struct somestruct *s), TPARGS(s)));
> >
> > Do you think it's worth it to append the prototype string to the
> > tracepoint name ? I think it should fix the problem.
>
> Hmm, I think we'd better send a fix patch to them in that case.
> (I hope we can find that kind of conflicts soon)
> I think we can make an external tool which detect those conflicts.

Hrm, ideally, we could output this information in a file which looks
like the Marker file generated by depmod, and automatically check for
duplicate tracepoints with different prototypes. That should also deal
with modules built outside of the kernel tree.

> Anyway, signature based checking idea is good to me. I think ":" is
> better delimiter.

Yes, changing it for :.

If we append the prototype to the tracepoint name, then at least we can
later add the consistency check in depmod.

Mathieu

>
>
> Thank you,
>
>
> --
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America) Inc.
> Software Solutions Division
>
> e-mail: [email protected]
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-06-30 19:40:27

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC PATCH] Kernel Tracepoints

Hi Mathieu,

Mathieu Desnoyers wrote:
> * Masami Hiramatsu ([email protected]) wrote:
>> Hi Mathieu,
>>
>> Thank you for making this so soon!
>>
>> Mathieu Desnoyers wrote:
>>> ** note : this patch is submitted for early review. It applies after my
>>> current unreleased 2.6.26-rc8 LTTng patchset. Comments are welcome.
>> Would you mean there is no tree on which we can test this patch?
>>
>
> It actually applies on top of the following patchset :
>
> http://ltt.polymtl.ca/lttng/patch-2.6.26-rc8-0.10-pre56.tar.bz2

Thank you,

I think, this tracepoint patch might be better moved before
immediate and LTTng's marking point patches, because current -mm/-next
tree doesn't have immediate, and all marking points are replaced
by tracepoints.

Regards,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2008-06-30 20:00:44

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC PATCH] Kernel Tracepoints

Hi Mathieu,

Mathieu Desnoyers wrote:
> * Masami Hiramatsu ([email protected]) wrote:
>> Mathieu Desnoyers wrote:
>>> * Masami Hiramatsu ([email protected]) wrote:
>>> >
>>>>> Implementation of kernel tracepoints. Inspired from the Linux Kernel Markers.
>>>> What would you think redesigning markers on tracepoints? because most of the
>>>> logic (scaning sections, multiple probe and activation) seems very similar
>>>> to markers.
>>>>
>>> We could, although markers, because they use var args, allow to put the
>>> iteration on the multi probe array out-of-line. Tracepoints cannot
>>> afford this and the iteration must be done at the initial call-site.
>>>
>>> From what I see in your proposal, it's mostly to extract the if() call()
>>> code from the inner __trace_mark() macro and to put it in a separate
>>> macro, am I correct ? This would make the macro more readable.
>> Sure, I think marker and tracepoint can share below functions;
>> - definition of static local variables in specific sections
>
> Given that we could want to keep activation of tracepoints and markers
> separate (so they don't share the same namespace), declaring the static
> variables in separated sections seems to make sense to me.

Sorry, I'm not sure what is "separate activation".
As far as I can see, both tracepoints and markers are activated
when its probe handlers are registered on each tracepoint/marker.
Aren't it separated?

I did not mean integrating registering interfaces, but
I think that they can share base(internal) functions.
for example, both of them has XXX_update_range/_module_XXX_update etc.

IMHO, current code is not so good for maintenance. there are
many code duplications (ex. kernel/module.c, I think
that both of them (and imv too?) can share the code for
handling its section and iterating entries). I'm not sure those
duplications are acceptable.

>> - probe activation code (if() call())
>> - multi probe handling
>
> Hrm, the thing here is that because markers allow to do the iteration on
> the multiple probe callbacks within an internal wrapper (instead of
> doing it on-site as in the tracepoints), it allows to do some further
> optimizations (less memory allocation and less pointer dereference in
> the single probe case, not having to prepare the va_args in the
> MARK_NOARGS case) which are only done because it does not have to add
> code to the instrumentation site. However, tracepoints cannot have such
> "generic" wrapper and we have to do the iteration on callbacks in the
> code added to the instrumented object. Therefore, I keep it as small as
> possible in terms of bytes of instructions.

OK, I see. So, __tracepoint_block() macro can specify handler function.
what would you think about it?

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2008-07-03 15:15:35

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] Kernel Tracepoints

* Masami Hiramatsu ([email protected]) wrote:
> Hi Mathieu,
>
> Mathieu Desnoyers wrote:
> > * Masami Hiramatsu ([email protected]) wrote:
> >> Mathieu Desnoyers wrote:
> >>> * Masami Hiramatsu ([email protected]) wrote:
> >>> >
> >>>>> Implementation of kernel tracepoints. Inspired from the Linux Kernel Markers.
> >>>> What would you think redesigning markers on tracepoints? because most of the
> >>>> logic (scaning sections, multiple probe and activation) seems very similar
> >>>> to markers.
> >>>>
> >>> We could, although markers, because they use var args, allow to put the
> >>> iteration on the multi probe array out-of-line. Tracepoints cannot
> >>> afford this and the iteration must be done at the initial call-site.
> >>>
> >>> From what I see in your proposal, it's mostly to extract the if() call()
> >>> code from the inner __trace_mark() macro and to put it in a separate
> >>> macro, am I correct ? This would make the macro more readable.
> >> Sure, I think marker and tracepoint can share below functions;
> >> - definition of static local variables in specific sections
> >
> > Given that we could want to keep activation of tracepoints and markers
> > separate (so they don't share the same namespace), declaring the static
> > variables in separated sections seems to make sense to me.
>
> Sorry, I'm not sure what is "separate activation".
> As far as I can see, both tracepoints and markers are activated
> when its probe handlers are registered on each tracepoint/marker.
> Aren't it separated?
>

Yes, it is separate. This is insured by the fact that markers and
tracepoints are declared in different sections. Therefore, even if they
have the same "name", they won't be used by each other.

> I did not mean integrating registering interfaces, but
> I think that they can share base(internal) functions.
> for example, both of them has XXX_update_range/_module_XXX_update etc.
>

Hrm, but the sections and symbols on which these function iterate are
different. I am unsure it's worth trying to merge such tiny functions.

> IMHO, current code is not so good for maintenance. there are
> many code duplications (ex. kernel/module.c, I think
> that both of them (and imv too?) can share the code for
> handling its section and iterating entries). I'm not sure those
> duplications are acceptable.

Given it's only slmost one-liners, and that there is some ordering to
keep (markers and tracepoints must be updated before immediate values
because they might influence the immediate value state), I don't think
having a special section for these callbacks (a little bit like
initcalls, but for module load) is a good option.

>
> >> - probe activation code (if() call())
> >> - multi probe handling
> >
> > Hrm, the thing here is that because markers allow to do the iteration on
> > the multiple probe callbacks within an internal wrapper (instead of
> > doing it on-site as in the tracepoints), it allows to do some further
> > optimizations (less memory allocation and less pointer dereference in
> > the single probe case, not having to prepare the va_args in the
> > MARK_NOARGS case) which are only done because it does not have to add
> > code to the instrumentation site. However, tracepoints cannot have such
> > "generic" wrapper and we have to do the iteration on callbacks in the
> > code added to the instrumented object. Therefore, I keep it as small as
> > possible in terms of bytes of instructions.
>
> OK, I see. So, __tracepoint_block() macro can specify handler function.
> what would you think about it?
>

When I originally designed the markers, I tried to make sure there was
absolutely no code duplication until I discovered that trying to read a
huge amount of nested macros is just a pain starting from a certain
level. If we only save a few duplicated lines but end up tying up
markers and tracepoints, I am far from certain that it will make the
code more readable.

I'll post a tracepoint version with the modifications you proposed (it's
now placed earlier in the patchset), except the merge with markers.

Mathieu

> Thank you,
>
> --
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America) Inc.
> Software Solutions Division
>
> e-mail: [email protected]
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-07-03 15:30:23

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC PATCH] Kernel Tracepoints (update)

Hi Mathieu,

I found a bug.
Mathieu Desnoyers wrote:
[...]

> +/*
> + * Remove the tracepoint from the tracepoint hash table. Must be called with
> + * mutex_lock held.
> + */
> +static int remove_tracepoint(const char *name)
> +{
> + struct hlist_head *head;
> + struct hlist_node *node;
> + struct tracepoint_entry *e;
> + int found = 0;
> + size_t len = strlen(name) + 1;
> + u32 hash = jhash(name, len-1, 0);
> +
> + head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
> + hlist_for_each_entry(e, node, head, hlist) {
> + if (!strcmp(name, e->name)) {
> + found = 1;
> + break;
> + }
> + }
> + if (!found)
> + return -ENOENT;

before removing, you have to ensure refcount == 0. So,
if (e->refcount != 0)
return -EBUSY;

> + hlist_del(&e->hlist);
> + /* Make sure the call_rcu has been executed */
> + if (e->rcu_pending)
> + rcu_barrier();
> + kfree(e);
> + return 0;
> +}

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2008-07-03 15:47:27

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] Kernel Tracepoints (update)

* Masami Hiramatsu ([email protected]) wrote:
> Hi Mathieu,
>
> I found a bug.
> Mathieu Desnoyers wrote:
> [...]
>
> > +/*
> > + * Remove the tracepoint from the tracepoint hash table. Must be called with
> > + * mutex_lock held.
> > + */
> > +static int remove_tracepoint(const char *name)
> > +{
> > + struct hlist_head *head;
> > + struct hlist_node *node;
> > + struct tracepoint_entry *e;
> > + int found = 0;
> > + size_t len = strlen(name) + 1;
> > + u32 hash = jhash(name, len-1, 0);
> > +
> > + head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
> > + hlist_for_each_entry(e, node, head, hlist) {
> > + if (!strcmp(name, e->name)) {
> > + found = 1;
> > + break;
> > + }
> > + }
> > + if (!found)
> > + return -ENOENT;
>
> before removing, you have to ensure refcount == 0. So,
> if (e->refcount != 0)
> return -EBUSY;
>

Ah, right, I had


if (e->single.func != __mark_empty_function)
return -EBUSY;

in the markers to test this and did not replace this test.

Thanks,

Mathieu

> > + hlist_del(&e->hlist);
> > + /* Make sure the call_rcu has been executed */
> > + if (e->rcu_pending)
> > + rcu_barrier();
> > + kfree(e);
> > + return 0;
> > +}
>
> Thank you,
>
> --
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America) Inc.
> Software Solutions Division
>
> e-mail: [email protected]
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-07-03 18:18:54

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] Kernel Tracepoints (update)

Hi Masami,

I am thinking about removing the private_data, because such declaration
won't work :

DEFINE_TRACE(kernel_irq_exit,
TPPROTO(void),
TPARGS());

because the arguments become :

(void *private_data, void)

Any better idea ?

Mathieu

* Masami Hiramatsu ([email protected]) wrote:
> Hi Mathieu,
>
> I found a bug.
> Mathieu Desnoyers wrote:
> [...]
>
> > +/*
> > + * Remove the tracepoint from the tracepoint hash table. Must be called with
> > + * mutex_lock held.
> > + */
> > +static int remove_tracepoint(const char *name)
> > +{
> > + struct hlist_head *head;
> > + struct hlist_node *node;
> > + struct tracepoint_entry *e;
> > + int found = 0;
> > + size_t len = strlen(name) + 1;
> > + u32 hash = jhash(name, len-1, 0);
> > +
> > + head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
> > + hlist_for_each_entry(e, node, head, hlist) {
> > + if (!strcmp(name, e->name)) {
> > + found = 1;
> > + break;
> > + }
> > + }
> > + if (!found)
> > + return -ENOENT;
>
> before removing, you have to ensure refcount == 0. So,
> if (e->refcount != 0)
> return -EBUSY;
>
> > + hlist_del(&e->hlist);
> > + /* Make sure the call_rcu has been executed */
> > + if (e->rcu_pending)
> > + rcu_barrier();
> > + kfree(e);
> > + return 0;
> > +}
>
> Thank you,
>
> --
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America) Inc.
> Software Solutions Division
>
> e-mail: [email protected]
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-07-03 18:49:17

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC PATCH] Kernel Tracepoints (update)

Hi Mathieu,

Mathieu Desnoyers wrote:
> Hi Masami,
>
> I am thinking about removing the private_data, because such declaration
> won't work :
>
> DEFINE_TRACE(kernel_irq_exit,
> TPPROTO(void),
> TPARGS());
>
> because the arguments become :
>
> (void *private_data, void)

Oh, you right.
since there is no tracepoint user who uses private_data,
we can remove it.

> Any better idea ?

If you need it, I think you can introduce some macros like below.

#define TPPROTO(args...) (void *private_data, args)

#define __define_trace_register(name, proto) \ /*shared by DEFINE_TRACE/DEFINE_TRACE_NOARG*/
static inline int register_trace_##name(\
void (*probe)proto, \
void *private_data) \
{ \
return tracepoint_probe_register(#name ":" #proto, \
(void *)probe, private_data); \
} \
static inline void unregister_trace_##name( \
void (*probe)proto, \
void *private_data) \
{ \
tracepoint_probe_unregister(#name ":" #proto, \
(void *)probe, private_data); \
}

#define DEFINE_TRACE_NOARG(name) \
static inline void _do_trace_##name(struct tracepoint *tp) \
...
static inline void trace_##name(void) \
...
__define_trace_register(name, (void *private_data))


But I think removing private_data is better...

Thank you,

>
> Mathieu
>
> * Masami Hiramatsu ([email protected]) wrote:
>> Hi Mathieu,
>>
>> I found a bug.
>> Mathieu Desnoyers wrote:
>> [...]
>>
>>> +/*
>>> + * Remove the tracepoint from the tracepoint hash table. Must be called with
>>> + * mutex_lock held.
>>> + */
>>> +static int remove_tracepoint(const char *name)
>>> +{
>>> + struct hlist_head *head;
>>> + struct hlist_node *node;
>>> + struct tracepoint_entry *e;
>>> + int found = 0;
>>> + size_t len = strlen(name) + 1;
>>> + u32 hash = jhash(name, len-1, 0);
>>> +
>>> + head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
>>> + hlist_for_each_entry(e, node, head, hlist) {
>>> + if (!strcmp(name, e->name)) {
>>> + found = 1;
>>> + break;
>>> + }
>>> + }
>>> + if (!found)
>>> + return -ENOENT;
>> before removing, you have to ensure refcount == 0. So,
>> if (e->refcount != 0)
>> return -EBUSY;
>>
>>> + hlist_del(&e->hlist);
>>> + /* Make sure the call_rcu has been executed */
>>> + if (e->rcu_pending)
>>> + rcu_barrier();
>>> + kfree(e);
>>> + return 0;
>>> +}
>> Thank you,
>>
>> --
>> Masami Hiramatsu
>>
>> Software Engineer
>> Hitachi Computer Products (America) Inc.
>> Software Solutions Division
>>
>> e-mail: [email protected]
>>
>

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2008-07-03 18:54:25

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC PATCH] Kernel Tracepoints

Hi Mathieu,

Mathieu Desnoyers wrote:
> * Masami Hiramatsu ([email protected]) wrote:
>> Hi Mathieu,
>>
>> Mathieu Desnoyers wrote:
>>> * Masami Hiramatsu ([email protected]) wrote:
>>>> Mathieu Desnoyers wrote:
>>>>> * Masami Hiramatsu ([email protected]) wrote:
>>>>> >
>>>>>>> Implementation of kernel tracepoints. Inspired from the Linux Kernel Markers.
>>>>>> What would you think redesigning markers on tracepoints? because most of the
>>>>>> logic (scaning sections, multiple probe and activation) seems very similar
>>>>>> to markers.
>>>>>>
>>>>> We could, although markers, because they use var args, allow to put the
>>>>> iteration on the multi probe array out-of-line. Tracepoints cannot
>>>>> afford this and the iteration must be done at the initial call-site.
>>>>>
>>>>> From what I see in your proposal, it's mostly to extract the if() call()
>>>>> code from the inner __trace_mark() macro and to put it in a separate
>>>>> macro, am I correct ? This would make the macro more readable.
>>>> Sure, I think marker and tracepoint can share below functions;
>>>> - definition of static local variables in specific sections
>>> Given that we could want to keep activation of tracepoints and markers
>>> separate (so they don't share the same namespace), declaring the static
>>> variables in separated sections seems to make sense to me.
>> Sorry, I'm not sure what is "separate activation".
>> As far as I can see, both tracepoints and markers are activated
>> when its probe handlers are registered on each tracepoint/marker.
>> Aren't it separated?
>>
>
> Yes, it is separate. This is insured by the fact that markers and
> tracepoints are declared in different sections. Therefore, even if they
> have the same "name", they won't be used by each other.

I reviewed both marker and tracepoint deeply in these days, and
recognized what you said. Actually, markers and tracepoint would
better have separate sections.

[...]
>>>> - probe activation code (if() call())
>>>> - multi probe handling
>>> Hrm, the thing here is that because markers allow to do the iteration on
>>> the multiple probe callbacks within an internal wrapper (instead of
>>> doing it on-site as in the tracepoints), it allows to do some further
>>> optimizations (less memory allocation and less pointer dereference in
>>> the single probe case, not having to prepare the va_args in the
>>> MARK_NOARGS case) which are only done because it does not have to add
>>> code to the instrumentation site. However, tracepoints cannot have such
>>> "generic" wrapper and we have to do the iteration on callbacks in the
>>> code added to the instrumented object. Therefore, I keep it as small as
>>> possible in terms of bytes of instructions.
>> OK, I see. So, __tracepoint_block() macro can specify handler function.
>> what would you think about it?
>>
>
> When I originally designed the markers, I tried to make sure there was
> absolutely no code duplication until I discovered that trying to read a
> huge amount of nested macros is just a pain starting from a certain
> level. If we only save a few duplicated lines but end up tying up
> markers and tracepoints, I am far from certain that it will make the
> code more readable.

After reviewing, I knew it is hard to implement markers on tracepoint.
maybe, I need to find another way or maintenance both codes.


> I'll post a tracepoint version with the modifications you proposed (it's
> now placed earlier in the patchset), except the merge with markers.

I see, if it is acceptable for upstream developers, I have no problem.

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]