LinuxLists.cc - [RFC] simple dprobe like markers for the kernel

2008-07-09 21:22:47

by James Bottomley

[permalink] [raw]

Subject: [RFC] simple dprobe like markers for the kernel

I've been looking at using the existing in kernel markers for dtrace
named probing in systemtap. What I find is that they're a bit
heavyweight when compared to what dtrace does (because of the way they
drop stubbable calling points).

This patch adds incredibly simple markers which are designed to be used
via kprobes. All it does is add an extra section to the kernel (and
modules) which annotates the location in source file/line of the marker
and a description of the variables of interest. Tools like systemtap
can then use the kernel dwarf2 debugging information to transform this
to a precise probe point that gives access to the named variables.

The beauty of this scheme is that it has zero cost in the unactivated
case (the extra section is discardable if you're not interested in the
information, and nothing is actually added into the routine being
marked). The disadvantage is that it's really unusable for rolling your
own marker probes because it relies on the dwarf2 information to locate
the probe point for kprobes and unravel the local variables of interest,
so you need an external tool like systemtap to help you.

The scheme uses a printk format like string to describe the variables of
interest, so if those variables disappear, the compile breaks (even in
the unmarked case) which should help us keep the marked probe points
current.

For instance, this is what SCSI would look like with a probe point added
just before the command goes to the low level device

trace_simple(queuecommand, "Command being queued %p Done function %p", cmd, scsi_done);
rtn = host->hostt->queuecommand(cmd, scsi_done);
trace_simple(queuecommand_return, "Command returning %p Return value %d", cmd, rtn);

Here you can see that each trace point describes two variables whose
values can be viewed at that point by the relevant tools. The format
strings and variables can be used by a tool to perform dtrace -l like
functionality:

MODULE FUNCTION NAME DESCRIPTION
scsi_mod scsi_dispatch_io queuecommand Command being queued $sdev; Done function $scsi_done
scsi_mod scsi_dispatch_io queuecommand_return Command being queued $sdev; Return value $ret

So the trace points recommend to the user what variables to use and
briefly what they mean.

James

---

diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index f054778..c0c38b8 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -299,6 +299,8 @@
.debug_funcnames 0 : { *(.debug_funcnames) } \
.debug_typenames 0 : { *(.debug_typenames) } \
.debug_varnames 0 : { *(.debug_varnames) } \
+ /* simple markers (depends on dwarf2 debugging info) */ \
+ __simple_marker (INFO) : { *(__simple_marker) } \

/* Stabs debugging sections. */
#define STABS_DEBUG \
diff --git a/include/linux/simple_marker.h b/include/linux/simple_marker.h
new file mode 100644
index 0000000..675f5b1
--- /dev/null
+++ b/include/linux/simple_marker.h
@@ -0,0 +1,41 @@
+#include <linux/stringify.h>
+
+/* To be used for string format validity checking with gcc */
+static inline void __printf(1, 2)
+__trace_simple_check_format(const char *fmt, ...)
+{
+}
+
+#ifdef CONFIG_DEBUG_INFO
+#define trace_simple(name, format, args...) \
+ do { \
+ static const char __simple_name_##name[] \
+ __attribute__((section("__simple_marker"))) \
+ __attribute__((__used__)) \
+ = #name; \
+ static const char __simple_file_##name[] \
+ __attribute__((section("__simple_marker"))) \
+ __attribute__((__used__)) \
+ = __FILE__; \
+ static const char __simple_line_##name[] \
+ __attribute__((section("__simple_marker"))) \
+ __attribute__((__used__)) \
+ = __stringify(__LINE__); \
+ static const char __simple_format_##name[] \
+ __attribute__((section("__simple_marker"))) \
+ __attribute__((__used__)) \
+ = #format; \
+ static const char __simple_args_##name[] \
+ __attribute__((section("__simple_marker"))) \
+ __attribute__((__used__)) \
+ = #args; \
+ if (0) \
+ __trace_simple_check_format(format, ## args); \
+ } while(0)
+#else
+#define trace_simple(name, format, args...) \
+ do { \
+ if (0) \
+ __trace_simple_check_format(format, ## args); \
+ } while(0)
+#endif

2008-07-10 02:30:54

by Frank Ch. Eigler

[permalink] [raw]

Subject: Re: [RFC] simple dprobe like markers for the kernel

James Bottomley <[email protected]> writes:

> I've been looking at using the existing in kernel markers for dtrace
> named probing in systemtap. What I find is that they're a bit
> heavyweight when compared to what dtrace does (because of the way
> they drop stubbable calling points).

> This patch adds incredibly simple markers which are designed to be used
> via kprobes [+ dwarf]. [...]

> [...] The disadvantage is that it's really unusable for rolling
> your own marker probes because it relies on the dwarf2 information
> to locate the probe point for kprobes and unravel the local
> variables of interest, so you need an external tool like systemtap
> to help you. [...]

Another disadvantage is one that came up earlier when markers were
initially thought up: that something so invisible to the compiler (no
code being generated in the instruction stream, after optimization,
may be impossible to locate: not just the statement but also the
putative parameters.

Long ago, someone proposed inserting an asm("nop") mini-markers into
the instruction stream, which could then be used as an anchor to tie a
kprobe to, so that would solve the statement-location problem.

But it doesn't help assure that the parameters will be available in
dwarf, so someone else proposed adding another asm that just asks the
parameters to be evaluated and placed *somewhere*. Each asm input
constraint was to be the loosest possible, so as to not force the
compiler to put the values into registers (and evict their normal
tracing-ignorant tenants).

I believe this combination was never actually built/tested, partly
because people realized that then the compiler would always have to
evaluate parameters unconditionally, whether or not a kprobe is
present. (To do it otherwise would IIRC require the asm code to
include control-flow-modification instructions, which would surprise
gcc.)

So that's roughly how we arrived at recent markers. They expose to
the compiler the parameters, but arrange not to evaluate them unless
necessary. The most recent markers code patches nops over most or all
the hot path instructions, so there is no tangible performance impact.

- FChE

2008-07-10 03:39:55

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [RFC] simple dprobe like markers for the kernel

* James Bottomley ([email protected]) wrote:
> I've been looking at using the existing in kernel markers for dtrace
> named probing in systemtap. What I find is that they're a bit
> heavyweight when compared to what dtrace does (because of the way they
> drop stubbable calling points).
>
> This patch adds incredibly simple markers which are designed to be used
> via kprobes. All it does is add an extra section to the kernel (and
> modules) which annotates the location in source file/line of the marker
> and a description of the variables of interest. Tools like systemtap
> can then use the kernel dwarf2 debugging information to transform this
> to a precise probe point that gives access to the named variables.
>
> The beauty of this scheme is that it has zero cost in the unactivated
> case (the extra section is discardable if you're not interested in the
> information, and nothing is actually added into the routine being
> marked). The disadvantage is that it's really unusable for rolling your
> own marker probes because it relies on the dwarf2 information to locate
> the probe point for kprobes and unravel the local variables of interest,
> so you need an external tool like systemtap to help you.
>
> The scheme uses a printk format like string to describe the variables of
> interest, so if those variables disappear, the compile breaks (even in
> the unmarked case) which should help us keep the marked probe points
> current.
>
> For instance, this is what SCSI would look like with a probe point added
> just before the command goes to the low level device
>
> trace_simple(queuecommand, "Command being queued %p Done function %p", cmd, scsi_done);
> rtn = host->hostt->queuecommand(cmd, scsi_done);
> trace_simple(queuecommand_return, "Command returning %p Return value %d", cmd, rtn);
>
> Here you can see that each trace point describes two variables whose
> values can be viewed at that point by the relevant tools. The format
> strings and variables can be used by a tool to perform dtrace -l like
> functionality:
>
> MODULE FUNCTION NAME DESCRIPTION
> scsi_mod scsi_dispatch_io queuecommand Command being queued $sdev; Done function $scsi_done
> scsi_mod scsi_dispatch_io queuecommand_return Command being queued $sdev; Return value $ret
>
> So the trace points recommend to the user what variables to use and
> briefly what they mean.
>
> James
>

Hi James,

It's interesting to see this try at a stubless marker scheme. A few
things as you and Frank pointed out :
- It depends on an external tool to parse the dwarf info, so it cannot
be used by in-kernel tracers such as ftrace.
- It does not require variable liveliness at the marker site : the
compiler can freely optimize out the variable whenever it needs to.

Besides those core concerns, I went through your patch, a small detail
seems incorrect. Please see the comment below.

> ---
>
> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> index f054778..c0c38b8 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -299,6 +299,8 @@
> .debug_funcnames 0 : { *(.debug_funcnames) } \
> .debug_typenames 0 : { *(.debug_typenames) } \
> .debug_varnames 0 : { *(.debug_varnames) } \
> + /* simple markers (depends on dwarf2 debugging info) */ \
> + __simple_marker (INFO) : { *(__simple_marker) } \
>
> /* Stabs debugging sections. */
> #define STABS_DEBUG \
> diff --git a/include/linux/simple_marker.h b/include/linux/simple_marker.h
> new file mode 100644
> index 0000000..675f5b1
> --- /dev/null
> +++ b/include/linux/simple_marker.h
> @@ -0,0 +1,41 @@
> +#include <linux/stringify.h>
> +
> +/* To be used for string format validity checking with gcc */
> +static inline void __printf(1, 2)
> +__trace_simple_check_format(const char *fmt, ...)
> +{
> +}
> +
> +#ifdef CONFIG_DEBUG_INFO
> +#define trace_simple(name, format, args...) \
> + do { \
> + static const char __simple_name_##name[] \
> + __attribute__((section("__simple_marker"))) \
> + __attribute__((__used__)) \
> + = #name; \
> + static const char __simple_file_##name[] \
> + __attribute__((section("__simple_marker"))) \
> + __attribute__((__used__)) \
> + = __FILE__; \
> + static const char __simple_line_##name[] \
> + __attribute__((section("__simple_marker"))) \
> + __attribute__((__used__)) \
> + = __stringify(__LINE__); \
> + static const char __simple_format_##name[] \
> + __attribute__((section("__simple_marker"))) \
> + __attribute__((__used__)) \
> + = #format; \
> + static const char __simple_args_##name[] \
> + __attribute__((section("__simple_marker"))) \
> + __attribute__((__used__)) \
> + = #args; \

All those variables placed in the __simple_marker section are not
guaranteed to be placed nicely together. There should be a structure
containing pointers to name, file, line, format and args strings (all
together) in a special section, and then those strings could be emitted
in another section. Otherwise, the compiler can freely choose to
interleave different strings from various tracing statements.

Mathieu

> + if (0) \
> + __trace_simple_check_format(format, ## args); \
> + } while(0)
> +#else
> +#define trace_simple(name, format, args...) \
> + do { \
> + if (0) \
> + __trace_simple_check_format(format, ## args); \
> + } while(0)
> +#endif
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-07-10 13:50:12

by James Bottomley

[permalink] [raw]

Subject: Re: [RFC] simple dprobe like markers for the kernel

On Wed, 2008-07-09 at 22:29 -0400, Frank Ch. Eigler wrote:
> James Bottomley <[email protected]> writes:
>
> > I've been looking at using the existing in kernel markers for dtrace
> > named probing in systemtap. What I find is that they're a bit
> > heavyweight when compared to what dtrace does (because of the way
> > they drop stubbable calling points).
>
> > This patch adds incredibly simple markers which are designed to be used
> > via kprobes [+ dwarf]. [...]
>
> > [...] The disadvantage is that it's really unusable for rolling
> > your own marker probes because it relies on the dwarf2 information
> > to locate the probe point for kprobes and unravel the local
> > variables of interest, so you need an external tool like systemtap
> > to help you. [...]
>
> Another disadvantage is one that came up earlier when markers were
> initially thought up: that something so invisible to the compiler (no
> code being generated in the instruction stream, after optimization,
> may be impossible to locate: not just the statement but also the
> putative parameters.

Actually, I listed that one as an advantage. But, in order to be
completely zero impact, the probe cannot interfere with optimisation,
and so you run the risk of having the probe point do strange things
(like it's in the middle of a loop that gets unrolled) or that the
variables you want to advertise get optimised away.

All of this is mitigated by correct selection of the probe points and
the variables.

> Long ago, someone proposed inserting an asm("nop") mini-markers into
> the instruction stream, which could then be used as an anchor to tie a
> kprobe to, so that would solve the statement-location problem.

But you don't need a nop ... you just need a line number.

> But it doesn't help assure that the parameters will be available in
> dwarf, so someone else proposed adding another asm that just asks the
> parameters to be evaluated and placed *somewhere*. Each asm input
> constraint was to be the loosest possible, so as to not force the
> compiler to put the values into registers (and evict their normal
> tracing-ignorant tenants).

Actually, it does. Assuming the probe is placed in the code by someone
who knows what they're doing and is using it, you can ensure that what
you're advertising actually exists. If you look at the SCSI example I
gave, both the probe points and the variables actually exist, and will
continue to exist because of what they are and where they're placed.

> I believe this combination was never actually built/tested, partly
> because people realized that then the compiler would always have to
> evaluate parameters unconditionally, whether or not a kprobe is
> present. (To do it otherwise would IIRC require the asm code to
> include control-flow-modification instructions, which would surprise
> gcc.)
>
> So that's roughly how we arrived at recent markers. They expose to
> the compiler the parameters, but arrange not to evaluate them unless
> necessary. The most recent markers code patches nops over most or all
> the hot path instructions, so there is no tangible performance impact.

Yes there are. There are actually two performance impacts:

1. The nops themselves take cycles to execute ... small, granted,
but it adds up with lots of probe points
2. The probes interfere with optimisation since to replace them
with a function call, they must be barriers.

I didn't say use simple probes to replace markers ... I just said it's
an alternative for things like I/O subsystems that don't want the
perturbation.

James

2008-07-10 13:55:19

by James Bottomley

[permalink] [raw]

Subject: Re: [RFC] simple dprobe like markers for the kernel

On Wed, 2008-07-09 at 23:39 -0400, Mathieu Desnoyers wrote:
> * James Bottomley ([email protected]) wrote:
> > I've been looking at using the existing in kernel markers for dtrace
> > named probing in systemtap. What I find is that they're a bit
> > heavyweight when compared to what dtrace does (because of the way they
> > drop stubbable calling points).
> >
> > This patch adds incredibly simple markers which are designed to be used
> > via kprobes. All it does is add an extra section to the kernel (and
> > modules) which annotates the location in source file/line of the marker
> > and a description of the variables of interest. Tools like systemtap
> > can then use the kernel dwarf2 debugging information to transform this
> > to a precise probe point that gives access to the named variables.
> >
> > The beauty of this scheme is that it has zero cost in the unactivated
> > case (the extra section is discardable if you're not interested in the
> > information, and nothing is actually added into the routine being
> > marked). The disadvantage is that it's really unusable for rolling your
> > own marker probes because it relies on the dwarf2 information to locate
> > the probe point for kprobes and unravel the local variables of interest,
> > so you need an external tool like systemtap to help you.
> >
> > The scheme uses a printk format like string to describe the variables of
> > interest, so if those variables disappear, the compile breaks (even in
> > the unmarked case) which should help us keep the marked probe points
> > current.
> >
> > For instance, this is what SCSI would look like with a probe point added
> > just before the command goes to the low level device
> >
> > trace_simple(queuecommand, "Command being queued %p Done function %p", cmd, scsi_done);
> > rtn = host->hostt->queuecommand(cmd, scsi_done);
> > trace_simple(queuecommand_return, "Command returning %p Return value %d", cmd, rtn);
> >
> > Here you can see that each trace point describes two variables whose
> > values can be viewed at that point by the relevant tools. The format
> > strings and variables can be used by a tool to perform dtrace -l like
> > functionality:
> >
> > MODULE FUNCTION NAME DESCRIPTION
> > scsi_mod scsi_dispatch_io queuecommand Command being queued $sdev; Done function $scsi_done
> > scsi_mod scsi_dispatch_io queuecommand_return Command being queued $sdev; Return value $ret
> >
> > So the trace points recommend to the user what variables to use and
> > briefly what they mean.
> >
> > James
> >
>
> Hi James,
>
> It's interesting to see this try at a stubless marker scheme. A few
> things as you and Frank pointed out :
> - It depends on an external tool to parse the dwarf info, so it cannot
> be used by in-kernel tracers such as ftrace.

Actually, I think I listed that as one of the issues.

> - It does not require variable liveliness at the marker site : the
> compiler can freely optimize out the variable whenever it needs to.

Correct ... by design a zero impact probe must not affect the
optimisation.

> Besides those core concerns, I went through your patch, a small detail
> seems incorrect. Please see the comment below.
>
> > ---
> >
> > diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> > index f054778..c0c38b8 100644
> > --- a/include/asm-generic/vmlinux.lds.h
> > +++ b/include/asm-generic/vmlinux.lds.h
> > @@ -299,6 +299,8 @@
> > .debug_funcnames 0 : { *(.debug_funcnames) } \
> > .debug_typenames 0 : { *(.debug_typenames) } \
> > .debug_varnames 0 : { *(.debug_varnames) } \
> > + /* simple markers (depends on dwarf2 debugging info) */ \
> > + __simple_marker (INFO) : { *(__simple_marker) } \
> >
> > /* Stabs debugging sections. */
> > #define STABS_DEBUG \
> > diff --git a/include/linux/simple_marker.h b/include/linux/simple_marker.h
> > new file mode 100644
> > index 0000000..675f5b1
> > --- /dev/null
> > +++ b/include/linux/simple_marker.h
> > @@ -0,0 +1,41 @@
> > +#include <linux/stringify.h>
> > +
> > +/* To be used for string format validity checking with gcc */
> > +static inline void __printf(1, 2)
> > +__trace_simple_check_format(const char *fmt, ...)
> > +{
> > +}
> > +
> > +#ifdef CONFIG_DEBUG_INFO
> > +#define trace_simple(name, format, args...) \
> > + do { \
> > + static const char __simple_name_##name[] \
> > + __attribute__((section("__simple_marker"))) \
> > + __attribute__((__used__)) \
> > + = #name; \
> > + static const char __simple_file_##name[] \
> > + __attribute__((section("__simple_marker"))) \
> > + __attribute__((__used__)) \
> > + = __FILE__; \
> > + static const char __simple_line_##name[] \
> > + __attribute__((section("__simple_marker"))) \
> > + __attribute__((__used__)) \
> > + = __stringify(__LINE__); \
> > + static const char __simple_format_##name[] \
> > + __attribute__((section("__simple_marker"))) \
> > + __attribute__((__used__)) \
> > + = #format; \
> > + static const char __simple_args_##name[] \
> > + __attribute__((section("__simple_marker"))) \
> > + __attribute__((__used__)) \
> > + = #args; \
>
> All those variables placed in the __simple_marker section are not
> guaranteed to be placed nicely together. There should be a structure
> containing pointers to name, file, line, format and args strings (all
> together) in a special section, and then those strings could be emitted
> in another section. Otherwise, the compiler can freely choose to
> interleave different strings from various tracing statements.

This is just an RFC and a proof of concept. In practice the whole
section will have to be constructed so that it's versioned (in case more
information has to be added). Realistically, string optimisation also
needs to be done (since file will be repeated over and over again),
which is also less easy to do. However, the idea of using strings is
deliberate since they're easier to parse than specific structures which
have to be shared between tools (particularly when the information is
variable length).

James

2008-07-10 14:23:59

by Frank Ch. Eigler

[permalink] [raw]

Subject: Re: [RFC] simple dprobe like markers for the kernel

Hi -

On Thu, Jul 10, 2008 at 08:49:54AM -0500, James Bottomley wrote:
> [...]
> > Another disadvantage is one that came up earlier when markers were
> > initially thought up: that something so invisible to the compiler (no
> > code being generated in the instruction stream, after optimization,
> > may be impossible to locate: not just the statement but also the
> > putative parameters.
>
> Actually, I listed that one as an advantage. But, in order to be
> completely zero impact, the probe cannot interfere with optimisation,
> and so you run the risk of having the probe point do strange things
> (like it's in the middle of a loop that gets unrolled) or that the
> variables you want to advertise get optimised away.
>
> All of this is mitigated by correct selection of the probe points and
> the variables.

Well, you can test your theory: replace some "tracepoints" or markers
or printk's with this, and see if systemtap (or gdb) can get at the
same data.

When "correct selection" is a function of any particular compiler's
optimization algorithms, it will be difficult for a human programmer
to get it right.

> > Long ago, someone proposed inserting an asm("nop") mini-markers into
> > the instruction stream, which could then be used as an anchor to tie a
> > kprobe to, so that would solve the statement-location problem.
>
> But you don't need a nop ... you just need a line number.

That's *if* the line number ends up being resolvable back to a PC. In
fact, since there is no code emitted for it, that particular line
number will not actually appear in DWARF line records.

> > But it doesn't help assure that the parameters will be available in
> > dwarf, so someone else proposed adding another asm that just asks the
> > parameters to be evaluated and placed *somewhere*. Each asm input
> > constraint was to be the loosest possible, so as to not force the
> > compiler to put the values into registers (and evict their normal
> > tracing-ignorant tenants).
>
> Actually, it does. Assuming the probe is placed in the code by someone
> who knows what they're doing and is using it, you can ensure that what
> you're advertising actually exists. [...]

You misunderstood - I am not talking about whether the variables exist
in the context of the source code. The question is which of those
variables still exist, live & addressable, in the machine code and
execution state. You may be surprised to what extent compiler
optimizations disrupt a simple source-level reading of the situation.

> > So that's roughly how we arrived at recent markers. They expose to
> > the compiler the parameters, but arrange not to evaluate them unless
> > necessary. The most recent markers code patches nops over most or all
> > the hot path instructions, so there is no tangible performance impact.
>
> Yes there are. There are actually two performance impacts:
>
> 1. The nops themselves take cycles to execute ... small, granted,
> but it adds up with lots of probe points
> 2. The probes interfere with optimisation since to replace them
> with a function call, they must be barriers. [...]

That's why I qualified it with "tangible". Please confirm your
intuition about these costs.

- FChE

2008-07-10 14:43:31

by James Bottomley

[permalink] [raw]

Subject: Re: [RFC] simple dprobe like markers for the kernel

On Thu, 2008-07-10 at 10:22 -0400, Frank Ch. Eigler wrote:
> Hi -
>
> On Thu, Jul 10, 2008 at 08:49:54AM -0500, James Bottomley wrote:
> > [...]
> > > Another disadvantage is one that came up earlier when markers were
> > > initially thought up: that something so invisible to the compiler (no
> > > code being generated in the instruction stream, after optimization,
> > > may be impossible to locate: not just the statement but also the
> > > putative parameters.
> >
> > Actually, I listed that one as an advantage. But, in order to be
> > completely zero impact, the probe cannot interfere with optimisation,
> > and so you run the risk of having the probe point do strange things
> > (like it's in the middle of a loop that gets unrolled) or that the
> > variables you want to advertise get optimised away.
> >
> > All of this is mitigated by correct selection of the probe points and
> > the variables.
>
> Well, you can test your theory: replace some "tracepoints" or markers
> or printk's with this, and see if systemtap (or gdb) can get at the
> same data.

That's what I'm actually already doing ... so far it works nicely.

> When "correct selection" is a function of any particular compiler's
> optimization algorithms, it will be difficult for a human programmer
> to get it right.

Not necessarily. A tracepoint by a barrier will always be pretty much
OK, as will variables that are either passed in or passed to functions
(since they have to be instantiated to pass as arguments).

Plus screw ups are easily detectable by a tool that parses the dwarf.

The essential point is that we need zero impact trace points and that
makes them difficult to place in this fashion. However, the burden of
placing and verifying them rests with the people in the actual subsystem
(who are also the ones who hopefully get the most use out of them).

> > > Long ago, someone proposed inserting an asm("nop") mini-markers into
> > > the instruction stream, which could then be used as an anchor to tie a
> > > kprobe to, so that would solve the statement-location problem.
> >
> > But you don't need a nop ... you just need a line number.
>
> That's *if* the line number ends up being resolvable back to a PC. In
> fact, since there is no code emitted for it, that particular line
> number will not actually appear in DWARF line records.

Erm, no ... dwarf is designed to emit an entry for every line in the
file (whether it contains a statment or not). The empty lines get
elided in the line number program (because you can attach them to the
first statement following) but a correct parser will recover them (by
design in the dwarf).

> > > But it doesn't help assure that the parameters will be available in
> > > dwarf, so someone else proposed adding another asm that just asks the
> > > parameters to be evaluated and placed *somewhere*. Each asm input
> > > constraint was to be the loosest possible, so as to not force the
> > > compiler to put the values into registers (and evict their normal
> > > tracing-ignorant tenants).
> >
> > Actually, it does. Assuming the probe is placed in the code by someone
> > who knows what they're doing and is using it, you can ensure that what
> > you're advertising actually exists. [...]
>
> You misunderstood - I am not talking about whether the variables exist
> in the context of the source code. The question is which of those
> variables still exist, live & addressable, in the machine code and
> execution state. You may be surprised to what extent compiler
> optimizations disrupt a simple source-level reading of the situation.

No ... I'm used to optimisation strangeness. Again, I'm not trying to
eliminate it because that would defeat the zero impact purpose. I'm
trying to build a system that can be useful without any impact. The
consequence is going to be that certain trace points can't be used
because of the optimiser, but that's the tradeoff. As long as the
people placing the trace points are subject matter experts in the
subsystem (and actually using them) everything should be OK.

> > > So that's roughly how we arrived at recent markers. They expose to
> > > the compiler the parameters, but arrange not to evaluate them unless
> > > necessary. The most recent markers code patches nops over most or all
> > > the hot path instructions, so there is no tangible performance impact.
> >
> > Yes there are. There are actually two performance impacts:
> >
> > 1. The nops themselves take cycles to execute ... small, granted,
> > but it adds up with lots of probe points
> > 2. The probes interfere with optimisation since to replace them
> > with a function call, they must be barriers. [...]
>
> That's why I qualified it with "tangible". Please confirm your
> intuition about these costs.

1 is pretty obvious ... the nops have a defined cycle time in every
instruction architecture. The optimisation costs are very difficult to
quantify since they vary so much from compiler to compiler and function
to function.

James

2008-07-10 15:30:49

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [RFC] simple dprobe like markers for the kernel

On Thu, Jul 10, 2008 at 09:43:16AM -0500, James Bottomley wrote:
> No ... I'm used to optimisation strangeness. Again, I'm not trying to
> eliminate it because that would defeat the zero impact purpose. I'm
> trying to build a system that can be useful without any impact. The
> consequence is going to be that certain trace points can't be used
> because of the optimiser, but that's the tradeoff. As long as the
> people placing the trace points are subject matter experts in the
> subsystem (and actually using them) everything should be OK.

So as I understand things, your light-weight tracepoints are designed
for very performance-sensitive code paths where we don't want to
disturbe the optimization in the deactivated state. In
non-performance sensitive parts of the kernel, where cycle counting is
not so important, tracepoints can and probably should still be used.
So I don't think you were proposing eliminating the current kernel
markers in favor of this approach, yes?

When you said a tool could determine if the tracepoint had gotten
optimized away, or the variables were no longer present, I assume you
meant at compile time, right? So with the right tool built into the
kbuild infrastructure, if we could simply print warnings when
tracepoints had gotten optimized away, that would make the your simple
tracepoints quite safe for general use, I would think.

- Ted

P.S. When you said that the current kernel markers are "a bit
heavyweight", how bad are they in practice? Hundreds of cycles? More?

2008-07-10 15:57:32

by James Bottomley

[permalink] [raw]

Subject: Re: [RFC] simple dprobe like markers for the kernel

On Thu, 2008-07-10 at 11:30 -0400, Theodore Tso wrote:
> On Thu, Jul 10, 2008 at 09:43:16AM -0500, James Bottomley wrote:
> > No ... I'm used to optimisation strangeness. Again, I'm not trying to
> > eliminate it because that would defeat the zero impact purpose. I'm
> > trying to build a system that can be useful without any impact. The
> > consequence is going to be that certain trace points can't be used
> > because of the optimiser, but that's the tradeoff. As long as the
> > people placing the trace points are subject matter experts in the
> > subsystem (and actually using them) everything should be OK.
>
> So as I understand things, your light-weight tracepoints are designed
> for very performance-sensitive code paths where we don't want to
> disturbe the optimization in the deactivated state. In
> non-performance sensitive parts of the kernel, where cycle counting is
> not so important, tracepoints can and probably should still be used.
> So I don't think you were proposing eliminating the current kernel
> markers in favor of this approach, yes?

That's right ... I started from the position that the current markers
were too heavy for an I/O subsystem, but I'm sure they have many other
uses.

> When you said a tool could determine if the tracepoint had gotten
> optimized away, or the variables were no longer present, I assume you
> meant at compile time, right?

Yes and no. Yes because a tool will be able to detect the problems, but
no if you're thinking an actual kernel compile would do it (unless some
tool is designed for this and integrated into the build ... the obvious
tool is systemtap, but that might cause some heartburn).

> So with the right tool built into the
> kbuild infrastructure, if we could simply print warnings when
> tracepoints had gotten optimized away, that would make the your simple
> tracepoints quite safe for general use, I would think.

Yes ... but someone has to come up with the tool. I suppose rebuilding
the line number matrix and finding the variables at the location is easy
mechanical dwarf stuff ... but it will give the kernel build a lot of
external dependencies it didn't have before.

Plus, this level of checking can only be done if dwarf is generated
(i.e. CONFIG_KERNEL_DEBUG_INFO is y).

James

2008-07-10 18:20:35

by Frank Ch. Eigler

[permalink] [raw]

Subject: Re: [RFC] simple dprobe like markers for the kernel

Hi -

On Thu, Jul 10, 2008 at 11:30:17AM -0400, Theodore Tso wrote:

> [...] When you said a tool could determine if the tracepoint had
> gotten optimized away, or the variables were no longer present, I
> assume you meant at compile time, right? So with the right tool
> built into the kbuild infrastructure, if we could simply print
> warnings when tracepoints had gotten optimized away [...]

It will be interesting to see how frequently such a warning appears
for a good suite of such mini markers, on a diversity of architectures
and compilers. Such data can help pronounce judgement on this approach.

> P.S. When you said that the current kernel markers are "a bit
> heavyweight", how bad are they in practice? Hundreds of cycles? More?

Good question. The only performance measurements I have seen posted
indicate negligible effects.

- FChE

2008-07-12 18:23:00

by James Bottomley

[permalink] [raw]

Subject: [PATCH] simple dprobe like markers for the kernel

This is just an incremental update based on feedback. The most
significant was that making the marker a compiler barrier will free the
inserter from worrying about the mark sliding around changes to named
variables (and thus having to worry about this in placement) at
practically zero optimisation cost. I also updated the code to drop and
asm section instead of using the static variable scheme. I also added
documentation and made the module loader ignore them (since modules
don't go through the vmlinux.lds transformations).

I also added a simple versioning scheme (basically tack the version on
to the end of the section name). It can be used simply and even
provides backwards compatibility (just emit the old and the new
sections).

If everyone's happy with this, I'll follow it up with the systemtap
changes to make use of them ... they've been incredibly helpful
debugging some of the CDROM problems for me so far.

James

---
>From 4916bf71aa808622503f9fa87e03ce577a65d6ac Mon Sep 17 00:00:00 2001
From: James Bottomley <[email protected]>
Date: Wed, 9 Jul 2008 16:18:16 -0500
Subject: [PATCH] add simple marker trace point infrastructure

his patch adds incredibly simple markers which are designed to be used
via kprobes. All it does is add an extra section to the kernel (and
modules) which annotates the location in source file/line of the marker
and a description of the variables of interest. Tools like systemtap
can then use the kernel dwarf2 debugging information to transform this
to a precise probe point that gives access to the named variables.

The beauty of this scheme is that it has zero cost in the unactivated
case (the extra section is discardable if you're not interested in the
information, and nothing is actually added into the routine being
marked). The disadvantage is that it's really unusable for rolling your
own marker probes because it relies on the dwarf2 information to locate
the probe point for kprobes and unravel the local variables of interest,
so you need an external tool like systemtap to help you.

The scheme uses a printk format like string to describe the variables of
interest, so if those variables disappear, the compile breaks (even in
the unmarked case) which should help us keep the marked probe points
current.

For instance, this is what SCSI would look like with a probe point added
just before the command goes to the low level device

trace_simple(queuecommand, "Command being queued %p Done function %p", cmd, scsi_done);
rtn = host->hostt->queuecommand(cmd, scsi_done);
trace_simple(queuecommand_return, "Command returning %p Return value %d", cmd, rtn);

Here you can see that each trace point describes two variables whose
values can be viewed at that point by the relevant tools. The format
strings and variables can be used by a tool to perform dtrace -l like
functionality:

MODULE FUNCTION NAME DESCRIPTION
scsi_mod scsi_dispatch_io queuecommand Command being queued $sdev; Done function $scsi_done
scsi_mod scsi_dispatch_io queuecommand_return Command being queued $sdev; Return value $ret

So the trace points recommend to the user what variables to use and
briefly what they mean.

Signed-off-by: James Bottomley <[email protected]>
---
Documentation/simple_markers.txt | 61 +++++++++++++++++++++++++++++++++++++
include/asm-generic/vmlinux.lds.h | 2 +
include/linux/simple_marker.h | 46 ++++++++++++++++++++++++++++
kernel/module.c | 6 ++++
4 files changed, 115 insertions(+), 0 deletions(-)
create mode 100644 Documentation/simple_markers.txt
create mode 100644 include/linux/simple_marker.h

diff --git a/Documentation/simple_markers.txt b/Documentation/simple_markers.txt
new file mode 100644
index 0000000..e4c159a
--- /dev/null
+++ b/Documentation/simple_markers.txt
@@ -0,0 +1,61 @@
+ Using Simple Markers
+ ====================
+
+ James E.J. Bottomley <[email protected]>
+
+This document describes the purpose and use of simple markers in the
+kernel. These are designed to be used as lightweight zero passive
+impact markers in critical path subsystems (such as I/O). They differ
+from conventional markers in that there is no actual instruction
+deposited for them into the stream of the object files (hence zero
+impact when not activated).
+
+Using Simple Markers
+--------------------
+
+All simple markers do is add an extra (unloaded) section to the kernel
+and modules which identifies the trace points by name file, line and
+interesting variables if CONFIG_KERNEL_INFO (enable debugging
+information) is set.
+
+The data in the section can only be used by debugging tools (like
+systemtap) in concert with the dwarf debugging information. The way
+it works is that you use the marker in the section to translate the
+marker position to an exact file and line number which the dwarf
+information can then be used to locate in the program (and add probe
+points via kprobes). The listed variables of interest can also be
+accessed via the dwarf debugging information within the kprobe
+(although again you need a tool to do this).
+
+Inserting Simple Markers
+------------------------
+
+Simple markers are very easy to use. You simply
+
+#include <linux/simple_marker.h>
+
+And then insert a trace point with
+
+trace_simple(<name>, <variables description>, <variables of interest>);
+
+The <name> should be globally unique. It is recommended that you
+break it up into <subsystem>:<component> (and even subdivide
+<component> with extra ':') it will be the name used to attach to the
+trace point.
+
+The <variables description> is a printf string format for each of the
+variables of interest, so say in SCSI we have two variables of
+interest at the trace point: the SCSI command (struct scsi_command
+*cmd) and the return value (int rtn) then the <variables description>
+is "SCSI Command %p Return value %d" and <variables of interest>
+becomes cmd, rtn.
+
+A tool parsing the sections can pick out the trace point name and
+variables and description, so it will list the variables as
+
+variables:
+ SCSI Command $cmd
+ Return value $rtn
+
+(The actual variables are displayed in the format the debugger makes
+use of them).
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index f054778..e686f55 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -299,6 +299,8 @@
.debug_funcnames 0 : { *(.debug_funcnames) } \
.debug_typenames 0 : { *(.debug_typenames) } \
.debug_varnames 0 : { *(.debug_varnames) } \
+ /* simple markers (depends on dwarf2 debugging info) */ \
+ __simple_marker.1 (INFO) : { *(__simple_marker.1) } \

/* Stabs debugging sections. */
#define STABS_DEBUG \
diff --git a/include/linux/simple_marker.h b/include/linux/simple_marker.h
new file mode 100644
index 0000000..af8bb1e
--- /dev/null
+++ b/include/linux/simple_marker.h
@@ -0,0 +1,46 @@
+#ifndef __LINUX_SIMPLE_MARKER_H
+#define __LINUX_SIMPLE_MARKER_H
+
+#include <linux/compiler.h>
+#include <linux/stringify.h>
+
+/* Note: If you change the format, increase the version
+ * and change the section name by appending the version. That
+ * way backwards compatibility is simple to maintain. You must
+ * also update asm-generic/vmlinux.lds.h to modify the build
+ * rule to include the updated section(s) */
+
+#define SIMPLE_MARKER_VERSION 1
+#define SIMPLE_MARKER_SECTION "__simple_marker"
+#define SIMPLE_MARKER_SECTION_NAME \
+ SIMPLE_MARKER_SECTION "." __stringify(SIMPLE_MARKER_VERSION)
+
+/* To be used for string format validity checking with gcc */
+static inline void __printf(1, 2)
+__trace_simple_check_format(const char *fmt, ...)
+{
+}
+
+#ifdef CONFIG_DEBUG_INFO
+#define trace_simple(name, format, args...) \
+ do { \
+ barrier(); \
+ asm (".pushsection " SIMPLE_MARKER_SECTION_NAME "\n" \
+ ".string \"" #name "\"\n" \
+ ".string \"" __FILE__ "\"\n" \
+ ".string \"" __stringify(__LINE__) "\"\n" \
+ ".string \"" format "\"\n" \
+ ".string \"" #args "\"\n" \
+ ".popsection\n"); \
+ if (0) \
+ __trace_simple_check_format(format, ## args); \
+ } while(0)
+#else
+#define trace_simple(name, format, args...) \
+ do { \
+ if (0) \
+ __trace_simple_check_format(format, ## args); \
+ } while(0)
+#endif
+
+#endif
diff --git a/kernel/module.c b/kernel/module.c
index 5f80478..a1d1d85 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -40,6 +40,7 @@
#include <linux/stop_machine.h>
#include <linux/device.h>
#include <linux/string.h>
+#include <linux/simple_marker.h>
#include <linux/mutex.h>
#include <linux/unwind.h>
#include <asm/uaccess.h>
@@ -1828,6 +1829,11 @@ static struct module *load_module(void __user *umod,
if (strncmp(secstrings+sechdrs[i].sh_name, ".exit", 5) == 0)
sechdrs[i].sh_flags &= ~(unsigned long)SHF_ALLOC;
#endif
+ /* Don't load any marker sections */
+ if (strncmp(secstrings+sechdrs[i].sh_name,
+ SIMPLE_MARKER_SECTION "." ,
+ sizeof(SIMPLE_MARKER_SECTION) + 1) == 0)
+ sechdrs[i].sh_flags &= ~(unsigned long)SHF_ALLOC;
}

modindex = find_sec(hdr, sechdrs, secstrings,
--
1.5.6

2008-07-12 20:05:03

by James Bottomley

[permalink] [raw]

Subject: [PATCH] systemtap: add parser for simple markers

This is the systemtap piece that allows you to use simple markers as
probe points for people who want to play around with the functionality.

James

---

>From a6b70f5c6faa0e9df4f9c84a5db5088b77ceaed9 Mon Sep 17 00:00:00 2001
From: James Bottomley <[email protected]>
Date: Fri, 11 Jul 2008 09:32:34 -0500
Subject: Add simple_marker statement

Now that the kernel drops simple markers in a __simple_marker section, update systemtap to parse for them by introducing an extra

<module>.simple_mark(<marker str>)

statement. It would be nice to reuse the existing mark() directive,
but unfortunately, the parser can't cope with semantic dependent
parsing (it won't allow the registration of two identical patterns),
so the easiest way to get this to work is to introduce an additional
statement type.

Signed-off-by: James Bottomley <[email protected]>
---
tapsets.cxx | 124 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 121 insertions(+), 3 deletions(-)

diff --git a/tapsets.cxx b/tapsets.cxx
index adfe10e..ce59102 100644
--- a/tapsets.cxx
+++ b/tapsets.cxx
@@ -458,6 +458,7 @@ static string TOK_MAXACTIVE("maxactive");
static string TOK_STATEMENT("statement");
static string TOK_ABSOLUTE("absolute");
static string TOK_PROCESS("process");
+static string TOK_SIMPLE_MARK("simple_mark");

// Can we handle this query with just symbol-table info?
enum dbinfo_reqt
@@ -571,7 +572,15 @@ module_cache
};
typedef struct module_cache module_cache_t;

+struct marker_map_data {
+ string file;
+ int line;
+
+ marker_map_data(void) : line(-1) { };
+};
+
#ifdef HAVE_TR1_UNORDERED_MAP
+typedef tr1::unordered_map<string,struct marker_map_data> marker_map_t;
typedef tr1::unordered_map<string,Dwarf_Die> cu_function_cache_t;
typedef tr1::unordered_map<string,cu_function_cache_t*> mod_cu_function_cache_t; // module:cu -> function -> die
#else
@@ -579,6 +588,7 @@ struct stringhash {
size_t operator() (const string& s) const { hash<const char*> h; return h(s.c_str()); }
};

+typedef hash_map<string,struct marker_map_data,stringhash> marker_map_t;
typedef hash_map<string,Dwarf_Die,stringhash> cu_function_cache_t;
typedef hash_map<string,cu_function_cache_t*,stringhash> mod_cu_function_cache_t; // module:cu -> function -> die
#endif
@@ -727,6 +737,8 @@ struct dwflpp

function_name.clear();
function = NULL;
+ delete marker_map;
+ marker_map = NULL;
}

@@ -1583,6 +1595,82 @@ struct dwflpp
dwarf_decl_line (function, linep);
}

+ marker_map_t *marker_map;
+
+ void marker_map_populate(void)
+ {
+ assert(module);
+ marker_map = new marker_map_t;
+
+ Dwarf_Addr bias;
+ Elf_Scn* scn = 0;
+ size_t shstrndx;
+ // We prefer dwfl_module_getdwarf to dwfl_module_getelf here,
+ // because dwfl_module_getelf can force costly section relocations
+ // we don't really need, while either will do for this purpose.
+ Elf* elf = (dwarf_getelf (dwfl_module_getdwarf (module, &bias))
+ ?: dwfl_module_getelf (module, &bias));
+
+ assert(elf);
+
+ // find the __simple_marker section
+ elf_getshstrndx (elf, &shstrndx);
+
+ while ((scn = elf_nextscn (elf, scn)) != NULL)
+ {
+ GElf_Shdr shdr_mem;
+ GElf_Shdr *shdr = gelf_getshdr (scn, &shdr_mem);
+ if (! shdr) continue; // XXX error?
+
+ const char *name = elf_strptr (elf, shstrndx, shdr->sh_name);
+ // looking for version 1 of the format
+ if (!strcmp(name, "__simple_marker.1"))
+ break;
+ }
+ // no simple marker section
+ if (!scn)
+ return;
+
+ Elf_Data *rawdata = elf_rawdata(scn, NULL);
+ // section is empty
+ if (!rawdata)
+ return;
+ const char *data = (const char *)rawdata->d_buf;
+ for (unsigned int i = 0; i < rawdata->d_size; i++)
+ {
+
+ if (data[i] == '\0')
+ continue;
+
+ const char *name = data + i;
+ i += strlen(name) + 1;
+ const char *file = data + i;
+ i += strlen(file) + 1;
+ const char *line = data + i;
+ i += strlen(line) + 1;
+ const char *format = data + i;
+ i += strlen(format) + 1;
+ const char *variables = data + i;
+ // no + 1 for last; i++ in for loop does that
+ i += strlen(variables);
+
+ (*marker_map)[name].file = file;
+ (*marker_map)[name].line = lex_cast<int>(line);
+ }
+ }
+
+ /**
+ * find marker - returns the line corresponding to the marker
+ * @marker - string corresponding to the marker
+ */
+ struct marker_map_data& find_marker (string marker)
+ {
+ if (!marker_map)
+ marker_map_populate();
+
+ return (*marker_map)[marker];
+ }
+
bool die_has_pc (Dwarf_Die & die, Dwarf_Addr pc)
{
int res = dwarf_haspc (&die, pc);
@@ -2481,8 +2569,10 @@ struct dwarf_query : public base_query
bool has_statement_str;
bool has_function_num;
bool has_statement_num;
+ bool has_simple_mark;
string statement_str_val;
string function_str_val;
+ string simple_mark_str_val;
Dwarf_Addr statement_num_val;
Dwarf_Addr function_num_val;

@@ -2731,6 +2821,7 @@ dwarf_query::dwarf_query(systemtap_session & sess,
has_return = has_null_param(params, TOK_RETURN);
has_maxactive = get_number_param(params, TOK_MAXACTIVE, maxactive_val);
has_absolute = has_null_param(params, TOK_ABSOLUTE);
+ has_simple_mark = get_string_param(params, TOK_SIMPLE_MARK, simple_mark_str_val);

if (has_function_str)
spec_type = parse_function_spec(function_str_val);
@@ -2742,10 +2833,31 @@ dwarf_query::dwarf_query(systemtap_session & sess,
query_done = false;
}

-
void
dwarf_query::query_module_dwarf()
{
+ if (has_simple_mark) {
+ stringstream ss;
+ struct marker_map_data md = dw.find_marker(simple_mark_str_val);
+ if (md.line == -1)
+ {
+ ss << "Failed to find simple_mark(" << simple_mark_str_val << ") in module "
+ << dw.module_name;
+ throw semantic_error(ss.str());
+ }
+ function = "*";
+ file = md.file;
+ line[0] = md.line;
+ line[1] = 0;
+ line_type = ABSOLUTE;
+ spec_type = function_file_and_line;
+ has_statement_str = true;
+ ss << "*@" << md.file << ":" << md.line;
+ statement_str_val = ss.str();
+ if (sess.verbose > 1)
+ clog << "transform simple_mark(" << simple_mark_str_val << ") into "
+ << "statement(" << statement_str_val << endl;
+ }
if (has_function_num || has_statement_num)
{
// If we have module("foo").function(0xbeef) or
@@ -2768,7 +2880,7 @@ dwarf_query::query_module_dwarf()
// Otherwise if we have a function("foo") or statement("foo")
// specifier, we have to scan over all the CUs looking for
// the function(s) in question
- assert(has_function_str || has_statement_str);
+ assert(has_function_str || has_statement_str || has_simple_mark);
dw.iterate_over_cus(&query_cu, this);
}
}
@@ -4433,7 +4545,12 @@ dwarf_derived_probe::dwarf_derived_probe(const string& funcname,
else
fn_or_stmt = "statement";

- if (q.has_function_str || q.has_statement_str)
+ if (q.has_simple_mark)
+ {
+ comps.push_back(new probe_point::component
+ (TOK_SIMPLE_MARK, new literal_string(q.simple_mark_str_val)));
+ }
+ else if (q.has_function_str || q.has_statement_str)
{
string retro_name = funcname;
if (filename != "")
@@ -4507,6 +4624,7 @@ dwarf_derived_probe::register_function_and_statement_variants(match_node * root,
register_function_variants(root->bind_num(TOK_FUNCTION), dw);
register_statement_variants(root->bind_str(TOK_STATEMENT), dw);
register_statement_variants(root->bind_num(TOK_STATEMENT), dw);
+ register_statement_variants(root->bind_str(TOK_SIMPLE_MARK), dw);
}

void
--
1.5.6

2008-07-12 23:08:36

by Frank Ch. Eigler

[permalink] [raw]

Subject: Re: [PATCH] systemtap: add parser for simple markers

Hi -

> This is the systemtap piece that allows you to use simple markers as
> probe points for people who want to play around with the
> functionality. [...]

Clever. We can include support for this as soon as kernel-side
simple_mark widget go upstream.

(For completeness, the code would need test cases, docs, and desirably
support for wildcarding as in probe kernel.simple_mark("*").)

- FChE

2008-07-14 16:28:34

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: [PATCH] simple dprobe like markers for the kernel

Hi James,

James Bottomley wrote:
> This is just an incremental update based on feedback. The most
> significant was that making the marker a compiler barrier will free the
> inserter from worrying about the mark sliding around changes to named
> variables (and thus having to worry about this in placement) at
> practically zero optimisation cost. I also updated the code to drop and
> asm section instead of using the static variable scheme. I also added
> documentation and made the module loader ignore them (since modules
> don't go through the vmlinux.lds transformations).

I'm very interested in your approach.

IMHO, as Aoki investigated, the overhead of markers is not so big
unless we put a lot of them into kernel. And from "active"
overhead point of view, it takes less than tens of nano-seconds,
while kprobes takes hundreds of nano-seconds. Kprobe also has a
limitation of probable points, it can't probe "__kprobes" marked
functions. So, original markers still has advantages.

However, your approach is also useful, especially for embedding
thousands of markers in kernel or drivers.

So I think it's better to use both of them as the situation demands.

I just have one comment on its name. Since it doesn't trace
anything, so I'd rather like notation() or note_mark() than
trace_simple(). :-)

Thank you,

>
> I also added a simple versioning scheme (basically tack the version on
> to the end of the section name). It can be used simply and even
> provides backwards compatibility (just emit the old and the new
> sections).
>
> If everyone's happy with this, I'll follow it up with the systemtap
> changes to make use of them ... they've been incredibly helpful
> debugging some of the CDROM problems for me so far.
>
> James

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2008-07-14 22:09:22

by James Bottomley

[permalink] [raw]

Subject: Re: [PATCH] simple dprobe like markers for the kernel

On Mon, 2008-07-14 at 12:26 -0400, Masami Hiramatsu wrote:
> Hi James,
>
> James Bottomley wrote:
> > This is just an incremental update based on feedback. The most
> > significant was that making the marker a compiler barrier will free the
> > inserter from worrying about the mark sliding around changes to named
> > variables (and thus having to worry about this in placement) at
> > practically zero optimisation cost. I also updated the code to drop and
> > asm section instead of using the static variable scheme. I also added
> > documentation and made the module loader ignore them (since modules
> > don't go through the vmlinux.lds transformations).
>
> I'm very interested in your approach.
>
> IMHO, as Aoki investigated, the overhead of markers is not so big
> unless we put a lot of them into kernel.

That's the case which I started from. The point is that if passive
markers have a cost, we have to be very careful about placing them to
avoid the cost adding up.

> And from "active"
> overhead point of view, it takes less than tens of nano-seconds,
> while kprobes takes hundreds of nano-seconds. Kprobe also has a
> limitation of probable points, it can't probe "__kprobes" marked
> functions. So, original markers still has advantages.

Yes ... the zero impact markers are completely dependent on the kprobes
overhead for activation ... on the other hand, one of the vendor
complaints is cost of activation of kprobes, so it's nicely tied into
solving that particular problem.

> However, your approach is also useful, especially for embedding
> thousands of markers in kernel or drivers.
>
> So I think it's better to use both of them as the situation demands.

Certainly ... as I said to Ted, I'm not planning to replace the current
markers, just give a lightweight alternative.

> I just have one comment on its name. Since it doesn't trace
> anything, so I'd rather like notation() or note_mark() than
> trace_simple(). :-)

well ... the current markers code uses trace_mark as its base .. I was
just trying to fit into that scheme.

Also, don't rely on anything in this code yet ... that's why it's an
RFC; I'm still playing around with the section formats and the
information. After more discussions with people, I'm actually coming to
the conclusion that dropping the address of the simple marker might be
very useful (in place of file and line). It makes the marker section
need relocation, but it would also mean they could be used simply from
within the kernel as well.

James