2008-11-14 23:04:10

by Mathieu Desnoyers

[permalink] [raw]
Subject: [patch 06/16] Markers auto enable tracepoints (new API : trace_mark_tp())

Add a new API trace_mark_tp(), which declares a marker within a tracepoint
probe. When the marker is activated, the tracepoint is automatically enabled.

No branch test is used at the marker site, because it would be a duplicate of
the branch already present in the tracepoint.

Impact: new API.

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: 'Ingo Molnar' <[email protected]>
CC: Lai Jiangshan <[email protected]>
---
include/linux/marker.h | 45 ++++++++++++++++++++++++++++++++++++++++-
init/Kconfig | 1
kernel/marker.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 96 insertions(+), 3 deletions(-)

Index: linux.trees.git/include/linux/marker.h
===================================================================
--- linux.trees.git.orig/include/linux/marker.h 2008-11-14 17:38:47.000000000 -0500
+++ linux.trees.git/include/linux/marker.h 2008-11-14 17:39:45.000000000 -0500
@@ -49,6 +49,8 @@ struct marker {
void (*call)(const struct marker *mdata, void *call_private, ...);
struct marker_probe_closure single;
struct marker_probe_closure *multi;
+ const char *tp_name; /* Optional tracepoint name */
+ void *tp_cb; /* Optional tracepoint callback */
} __attribute__((aligned(8)));

#ifdef CONFIG_MARKERS
@@ -73,7 +75,7 @@ struct marker {
__attribute__((section("__markers"), aligned(8))) = \
{ __mstrtab_##name, &__mstrtab_##name[sizeof(#name)], \
0, 0, marker_probe_cb, \
- { __mark_empty_function, NULL}, NULL }; \
+ { __mark_empty_function, NULL}, NULL, NULL, NULL }; \
__mark_check_format(format, ## args); \
if (unlikely(__mark_##name.state)) { \
(*__mark_##name.call) \
@@ -81,11 +83,38 @@ struct marker {
} \
} while (0)

+#define __trace_mark_tp(name, call_private, tp_name, tp_cb, format, args...) \
+ do { \
+ void __check_tp_type(void) \
+ { \
+ register_trace_##tp_name(tp_cb); \
+ } \
+ static const char __mstrtab_##name[] \
+ __attribute__((section("__markers_strings"))) \
+ = #name "\0" format; \
+ static struct marker __mark_##name \
+ __attribute__((section("__markers"), aligned(8))) = \
+ { __mstrtab_##name, &__mstrtab_##name[sizeof(#name)], \
+ 0, 0, marker_probe_cb, \
+ { __mark_empty_function, NULL}, NULL, #tp_name, tp_cb };\
+ __mark_check_format(format, ## args); \
+ (*__mark_##name.call)(&__mark_##name, call_private, \
+ ## args); \
+ } while (0)
+
extern void marker_update_probe_range(struct marker *begin,
struct marker *end);
#else /* !CONFIG_MARKERS */
#define __trace_mark(generic, name, call_private, format, args...) \
__mark_check_format(format, ## args)
+#define __trace_mark_tp(name, call_private, tp_name, tp_cb, format, args...) \
+ do { \
+ void __check_tp_type(void) \
+ { \
+ register_trace_##tp_name(tp_cb); \
+ } \
+ __mark_check_format(format, ## args); \
+ } while (0)
static inline void marker_update_probe_range(struct marker *begin,
struct marker *end)
{ }
@@ -118,6 +147,20 @@ static inline void marker_update_probe_r
__trace_mark(1, name, NULL, format, ## args)

/**
+ * trace_mark_tp - Marker in a tracepoint callback
+ * @name: marker name, not quoted.
+ * @tp_name: tracepoint name, not quoted.
+ * @tp_cb: tracepoint callback. Should have an associated global symbol so it
+ * is not optimized away by the compiler (should not be static).
+ * @format: format string
+ * @args...: variable argument list
+ *
+ * Places a marker in a tracepoint callback.
+ */
+#define trace_mark_tp(name, tp_name, tp_cb, format, args...) \
+ __trace_mark_tp(name, NULL, tp_name, tp_cb, format, ## args)
+
+/**
* MARK_NOARGS - Format string for a marker with no argument.
*/
#define MARK_NOARGS " "
Index: linux.trees.git/kernel/marker.c
===================================================================
--- linux.trees.git.orig/kernel/marker.c 2008-11-14 17:39:28.000000000 -0500
+++ linux.trees.git/kernel/marker.c 2008-11-14 17:39:45.000000000 -0500
@@ -479,7 +479,7 @@ static int marker_set_format(struct mark
static int set_marker(struct marker_entry *entry, struct marker *elem,
int active)
{
- int ret;
+ int ret = 0;
WARN_ON(strcmp(entry->name, elem->name) != 0);

if (entry->format) {
@@ -531,9 +531,40 @@ static int set_marker(struct marker_entr
*/
smp_wmb();
elem->ptype = entry->ptype;
+
+ if (elem->tp_name && (active ^ elem->state)) {
+ WARN_ON(!elem->tp_cb);
+ /*
+ * It is ok to directly call the probe registration because type
+ * checking has been done in the __trace_mark_tp() macro.
+ */
+
+ if (active) {
+ /*
+ * try_module_get should always succeed because we hold
+ * lock_module() to get the tp_cb address.
+ */
+ ret = try_module_get(__module_text_address(
+ (unsigned long)elem->tp_cb));
+ BUG_ON(!ret);
+ ret = tracepoint_probe_register_noupdate(
+ elem->tp_name,
+ elem->tp_cb);
+ } else {
+ ret = tracepoint_probe_unregister_noupdate(
+ elem->tp_name,
+ elem->tp_cb);
+ /*
+ * tracepoint_probe_update_all() must be called
+ * before the module containing tp_cb is unloaded.
+ */
+ module_put(__module_text_address(
+ (unsigned long)elem->tp_cb));
+ }
+ }
elem->state = active;

- return 0;
+ return ret;
}

/*
@@ -544,7 +575,24 @@ static int set_marker(struct marker_entr
*/
static void disable_marker(struct marker *elem)
{
+ int ret;
+
/* leave "call" as is. It is known statically. */
+ if (elem->tp_name && elem->state) {
+ WARN_ON(!elem->tp_cb);
+ /*
+ * It is ok to directly call the probe registration because type
+ * checking has been done in the __trace_mark_tp() macro.
+ */
+ ret = tracepoint_probe_unregister_noupdate(elem->tp_name,
+ elem->tp_cb);
+ WARN_ON(ret);
+ /*
+ * tracepoint_probe_update_all() must be called
+ * before the module containing tp_cb is unloaded.
+ */
+ module_put(__module_text_address((unsigned long)elem->tp_cb));
+ }
elem->state = 0;
elem->single.func = __mark_empty_function;
/* Update the function before setting the ptype */
@@ -608,6 +656,7 @@ static void marker_update_probes(void)
marker_update_probe_range(__start___markers, __stop___markers);
/* Markers in modules. */
module_update_markers();
+ tracepoint_probe_update_all();
}

/**
Index: linux.trees.git/init/Kconfig
===================================================================
--- linux.trees.git.orig/init/Kconfig 2008-11-14 17:38:29.000000000 -0500
+++ linux.trees.git/init/Kconfig 2008-11-14 17:39:45.000000000 -0500
@@ -808,6 +808,7 @@ config TRACEPOINTS

config MARKERS
bool "Activate markers"
+ depends on TRACEPOINTS
help
Place an empty function call at each marker site. Can be
dynamically changed for a probe function.

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68


2008-11-16 07:59:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 06/16] Markers auto enable tracepoints (new API : trace_mark_tp())


* Mathieu Desnoyers <[email protected]> wrote:

> Add a new API trace_mark_tp(), which declares a marker within a
> tracepoint probe. When the marker is activated, the tracepoint is
> automatically enabled.
>
> No branch test is used at the marker site, because it would be a
> duplicate of the branch already present in the tracepoint.
>
> Impact: new API.

i dont know.

I was actually hoping for markers (the in-kernel API) to go away
completely - and be replaced with tracepoints.

Markers are the wrong design on several levels. They couple the kernel
dynamically with unknown (to the kernel) entities - and that is
causing complexity all around the place, clearly expressed in these
patches of yours.

Tracepoints are much more specific - typed and enumerated function
call callback points in essence - with some politeness that allows
external instrumentation entities to attach to those callbacks.

Is there any usecase of markers that should not be converted to either
tracepoints or to ftrace_printk() ?

Ingo

2008-11-18 04:44:18

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 06/16] Markers auto enable tracepoints (new API : trace_mark_tp())

* Ingo Molnar ([email protected]) wrote:
>
> * Mathieu Desnoyers <[email protected]> wrote:
>
> > Add a new API trace_mark_tp(), which declares a marker within a
> > tracepoint probe. When the marker is activated, the tracepoint is
> > automatically enabled.
> >
> > No branch test is used at the marker site, because it would be a
> > duplicate of the branch already present in the tracepoint.
> >
> > Impact: new API.
>
> i dont know.
>
> I was actually hoping for markers (the in-kernel API) to go away
> completely - and be replaced with tracepoints.
>
> Markers are the wrong design on several levels. They couple the kernel
> dynamically with unknown (to the kernel) entities - and that is
> causing complexity all around the place, clearly expressed in these
> patches of yours.
>
> Tracepoints are much more specific - typed and enumerated function
> call callback points in essence - with some politeness that allows
> external instrumentation entities to attach to those callbacks.
>
> Is there any usecase of markers that should not be converted to either
> tracepoints or to ftrace_printk() ?
>
> Ingo

I'll try to explain the common use-case I have in mind. I think starting
from that we'll be able to better understand what pieces of tracepoints
and markers are useful, and which purpose they fulfill. I have other
use-cases in mind too, but for sake of clarity, let's begin with this
subset.

Tracepoints instrument the kernel. They identify code locations and
extract the state of pre-identified interesting variables at these
locations. They are built into the kernel. They are closely tied to
kernel internals, and hence only provide an in-kernel API.

Markers identify the name (and therefore numeric ID) to attach to an
"event" and the data types to export into trace buffers for this
specific event type. These data types are fully expressed in a marker
format-string table recorded in a "metadata" channel. The size of the
various basic types and the endianness is recorded in the buffer header.
Therefore, the binary trace buffers are self-described.

Data is exported through binary trace buffers out of kernel-space,
either by writing directly to disk, sending data over the network, crash
dump extraction, etc.

User-space applications, which run on an architecture with potentially
different endianness and different type sizes, reads the binary buffers.
Therefore, those buffers must be fully self-described so the application
can read them portably (or just on 32-bits userspace running under a
64-bits kernel).

In debug-mode tracing (where the developer wants to add
ftrace_printk()-like statements in its kernel and recompile), the
extracted information should be made available through the same binary
buffers where the information extracted from the tracepoints is saved.
Adding printk-like statements to the kernel should not suffer from the
time and buffer space penality of expanding the raw data to text.
However, ftrace_printk() requires to pretty-print the data in text.
It would be too cumbersome to ask developer to deploy their own
tracepoints whenever they want to create ad-hoc debug-style tracing
statements. This is where trace_mark() statements do better than
ftrace_printk() by allowing to dump the information to trace buffers in
binary format through a simple to use format-string based macro. Note
that each marker format string is dumped in the metadata table to
identify the new event types.

So that was my use-case which I think cannot be converted to tracepoints
nor ftrace_print(). If things are not as clear as they should be, don't
hesitate to ask for clarifications.

I also think we might consider removing the "multiple callback support"
from markers if we tie them more closely to the data extraction code,
given this multiple-callback role is already fulfilled by tracepoints.
OTOH, given my use-case can use markers as information source, thus
bypassing tracepoints for debug-style tracing, we may still need to
connect more than a single probe to the markers. I becomes very useful
for use-cases such as : dumping the kernel or userspace stack, reading
the performance counters and activating/de-activating the function
tracer at specific instrumentation sites.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-18 16:31:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 06/16] Markers auto enable tracepoints (new API : trace_mark_tp())


* Mathieu Desnoyers <[email protected]> wrote:

> Markers identify the name (and therefore numeric ID) to attach to an
> "event" and the data types to export into trace buffers for this
> specific event type. These data types are fully expressed in a
> marker format-string table recorded in a "metadata" channel. The
> size of the various basic types and the endianness is recorded in
> the buffer header. Therefore, the binary trace buffers are
> self-described.
>
> Data is exported through binary trace buffers out of kernel-space,
> either by writing directly to disk, sending data over the network,
> crash dump extraction, etc.

Streaming gigabytes of data is really mostly only done when we know
_nothing_ useful about a failure mode and are _forced_ into logging
gobs and gobs of data at great expense.

And thus in reality this is a rather uninteresting usecase.

We do recognize and support it as it's a valid "last line of defense"
for system and application failure analysis, but we should also put it
all into proper perspective: it's the rare and abnormal exception, not
the design target.

Note that we support this mode of tracing today already: we can
already stream binary data via the ftrace channel - the ring buffer
gives the infrastructure for that. Just do:

# echo bin > /debug/tracing/trace_options

... and you'll get the trace data streamed to user-space in an
efficient, raw, binary data format!

This works here and today - and if you'd like it to become more
efficient within the ftrace framework, we are all for it. (It's
obviously not the default mode of output, because humans prefer ASCII
and scriptable output formats by a _wide_ margin.)

Almost by definition anything opaque and binary-only that goes from
the kernel to user-space has fundamental limitations: it just doesnt
actively interact with the kernel for us to be able to form a useful
and flexible filter of information around it.

The _real_ solution to tracing in 99% of the cases is to intelligently
limit information - it's not like the user will read and parse
gigabytes of data ...

Look at the myriads of rather useful ftrace plugins we have already
and that sprung out of nothing. Compare it to the _10 years_ of
inaction that more static tracing concepts created. Those plugins work
and spread because it all lives and breathes within the kernel, and
almost none of that could be achieved via the 'stream binary data to
user-space' model you are concentrating on.

So in the conceptual space i can see little use for markers in the
kernel that are not tracepoints (i.e. not actively used by a real
tracer). We had markers in the scheduler initially, then we moved to
tracepoints - and tracepoints are much nicer.

[ And you wrote both markers and tracepoints, so it's not like i risk
degenerating this discussion into a flamewar by advocating one of
your solutions over the other one ;-) ]

... and in that sense i'd love to see lttng become a "super ftrace
plugin", and be merged upstream ASAP.

We could even split it up into multiple bits as its merged: for
example syscall tracing would be a nice touch that a couple of other
plugins would adapt as well. But every tracepoint should have some
active role and active connection to a tracer.

And we'd keep all those tracepoints open for external kprobes use as
well - for the dynamic tracers, as a low-cost courtesy. (no long-term
API guarantees though.)

Hm?

Ingo

2008-11-23 16:40:54

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 06/16] Markers auto enable tracepoints (new API : trace_mark_tp())

* Ingo Molnar ([email protected]) wrote:
>
> * Mathieu Desnoyers <[email protected]> wrote:
>
> > Markers identify the name (and therefore numeric ID) to attach to an
> > "event" and the data types to export into trace buffers for this
> > specific event type. These data types are fully expressed in a
> > marker format-string table recorded in a "metadata" channel. The
> > size of the various basic types and the endianness is recorded in
> > the buffer header. Therefore, the binary trace buffers are
> > self-described.
> >
> > Data is exported through binary trace buffers out of kernel-space,
> > either by writing directly to disk, sending data over the network,
> > crash dump extraction, etc.
>
> Streaming gigabytes of data is really mostly only done when we know
> _nothing_ useful about a failure mode and are _forced_ into logging
> gobs and gobs of data at great expense.
>
> And thus in reality this is a rather uninteresting usecase.
>
> We do recognize and support it as it's a valid "last line of defense"
> for system and application failure analysis, but we should also put it
> all into proper perspective: it's the rare and abnormal exception, not
> the design target.
>

Hrm, the fact that you assume that large data-throughput recording is
seldom used shows you have not been in contact with the same user-base I
have been interacting with. A few examples of successful LTTng users :

- Google are deploying LTTng on their servers. They want to use it to
monitor their production servers (with flight recorder mode tracing)
and to help them solve hard to reproduce problems. They have had
success with such tracing approach to fix "rare disk delay" issues and
VM-related issues presented in this article :

* "Linux Kernel Debugging on Google-sized clusters at Ottawa Linux
Symposium 2007"
http://ltt.polymtl.ca/papers/bligh-Reprint.pdf

- IBM Research have had problems with Commercial Scale-out applications,
which are being an increasing trend to split large server workloads.
They used LTTng successfully to solve a distributed filesystem-related
issue. It's presented in the same paper above.

- Autodesk, in the development of their next-generation of Linux
audio/video edition applications, used LTTng extensively to solve
soft real-time issues they had. Also presented in the same paper.

- Wind River included LTTng in their Linux distribution so their
clients, already familiar to Wind River own tracing solution in
VxWorks, car have the same kind of feature they have relied on for a
long time.

- Montavista have integrated LTTng in their distribution for the same
reasons. It's used by Sony amongst others.

- SuSE are currently integrating LTTng in their next SLES distribution,
because their clients asking for solutions which supports a kernel
closer to real-time need such tools to debug their problems.

- A project between Ericsson, the Canadian Defense, NSERC and various
universities is just starting. It aims at monitoring and debugging
multi-core systems and provide automated and help user system behavior
analysis.

- Siemens have been using LTTng internally for quite some time now.

The wide user-base I have been interacting with, which range from expert
developers to lead OS researchers, all agree on the strong need for a
tool streaming gigabytes of data, as LTTng does, to help analysing the
problem offline. I think that Linux kernel developers might be a bit
biased in this aspect, because they happen to know what they are looking
for. But users, even experts, often have very few clue where the problem
might be in large applications they are developing. And as the number of
cores grows and applications are getting larger and more complex, this
problem is not likely to lessen.

> Note that we support this mode of tracing today already: we can
> already stream binary data via the ftrace channel - the ring buffer
> gives the infrastructure for that. Just do:
>
> # echo bin > /debug/tracing/trace_options
>
> ... and you'll get the trace data streamed to user-space in an
> efficient, raw, binary data format!
>
> This works here and today - and if you'd like it to become more
> efficient within the ftrace framework, we are all for it. (It's
> obviously not the default mode of output, because humans prefer ASCII
> and scriptable output formats by a _wide_ margin.)
>
> Almost by definition anything opaque and binary-only that goes from
> the kernel to user-space has fundamental limitations: it just doesnt
> actively interact with the kernel for us to be able to form a useful
> and flexible filter of information around it.
>

This is the exact reason why I have an elaborated scheme to export
binary data to userspace in LTTng. LTTng buffer format is binary-only,
but is *not* opaque. It is self-described and portable, and a simple
user-space tool can format it to text without much effort.

> The _real_ solution to tracing in 99% of the cases is to intelligently
> limit information - it's not like the user will read and parse
> gigabytes of data ...
>

Yes, limiting the information flow is sometimes required. e.g. we don't
want to export lockdep-rate information all the time. However, having
enough information within the trace to understand the surroundings of a
problematic behavior can greatly help identifying the root cause of the
problem. Ideally, having tools which automatically finds the interesting
spots in those gigabytes of data (which we have in lttv), and helps
representing the information in graphical form (which helps users find
execution patterns much more easily.. it's impressive to see how good
the human brain can be at pattern-recognition), and lets the user dig in
the detailed information located near the problematic execution scenario
is of inestimable value.

It has, in many of the cases explained above, led to fix the problematic
situation after a few hours with a tracer rather than a few weeks of
painful trial-and-error debugging (involving many developers).

> Look at the myriads of rather useful ftrace plugins we have already
> and that sprung out of nothing. Compare it to the _10 years_ of
> inaction that more static tracing concepts created. Those plugins work
> and spread because it all lives and breathes within the kernel, and
> almost none of that could be achieved via the 'stream binary data to
> user-space' model you are concentrating on.

It's great that you have such momentum for ftrace, and yes, there is a
need for in-kernel analysis, because some workloads might be better
suited for in-kernel analysis (potentially because they generate a
too-high tracing throughput for the available system resources).

However, the comparison you are doing here is simply unfair. Ftrace has
this momentum simply because it happens to be shipped with the Linux
kernel. LTTng has been an out-of-tree patch for about 4 years now and
has generated a great deal of interest in users which can afford to
deploy their own kernel. Therefore, the real reason why ftrace has such
popularity is just because, as you say, it "all lives and breathes
within the kernel". It has nothing to do with in-kernel vs
post-processing analysis or with ascii vs binary data streaming.

Also, doing the analysis part within the kernel has a downside : it adds
complexity to the kernel itself. It adds analysis which are sometimes
complex and require additionnal data structures within the kernel. The
advantage of streaming the data out of the kernel is that it makes the
kernel-side of tracing trivially simple : we get the data out to a
memory buffer. Period. This helps being less intrusive, minimizes the
risks of distupting the normal system behavior, etc.

>
> So in the conceptual space i can see little use for markers in the
> kernel that are not tracepoints (i.e. not actively used by a real
> tracer). We had markers in the scheduler initially, then we moved to
> tracepoints - and tracepoints are much nicer.
>

I am open to changes to the markers API, and we may need to do some, but
in LTTng scheme, they fulfill a very important requirement : they turn
what would otherwise be "opaque binary data" as you call it into fully
described, parseable binary data.

We can then think of LTTng binary data as a very efficient data
reprensentation wich can be automatically, and generically, transformed
into text with a simple binary-to-ascii parser (think of it as a
printk-like parser).

> [ And you wrote both markers and tracepoints, so it's not like i risk
> degenerating this discussion into a flamewar by advocating one of
> your solutions over the other one ;-) ]
>

That's ok.. I'm just trying to show which design space markers currently
fulfill.

> ... and in that sense i'd love to see lttng become a "super ftrace
> plugin", and be merged upstream ASAP.
>

Hrm, a big part of LTTng is its optimized buffering mechanism, which has
been more tested and is more flexible than the one currently found in
ftrace (it has supported lockless NMI-safe tracing for over 2 years). It
also separates the various tracing layers in separated modules, which
helps not mixing the buffer management layer with the timestamping layer
and with the memory backend layer.. I am opened to try to merge ftrace
and lttng together, but I think it will be more than a simple "make
lttng a ftrace plugin".

> We could even split it up into multiple bits as its merged: for
> example syscall tracing would be a nice touch that a couple of other
> plugins would adapt as well. But every tracepoint should have some
> active role and active connection to a tracer.
>
> And we'd keep all those tracepoints open for external kprobes use as
> well - for the dynamic tracers, as a low-cost courtesy. (no long-term
> API guarantees though.)
>

Ah, I see.. when you speak of LTTng as a "super ftrace plugin", you
refer to the LTTng tracepoints only (the instrumentation). In that
sense, yes, we could add the LTTng tracepoints into the Linux kernel and
make ftrace a user of those without any problem. And as you say, no
guaranteed API (this is in-kernel only).

> Hm?

Well, given that I currently have :

- trace_clock() infrastructure for timestamping (which I could submit
for -tip, I think it's ready)

- LTTng instrumentation, which could be used by many tracers.

- LTTng buffer management, trace control, etc, which we might want to
get through in a review phase and try to do a mix and match of the
best features between LTTng and ftrace.

I think we could end up with a tracer which would be faster, more solid
and would support both what ftrace is currently doing and what LTTng is
doing. But if we want to do that, we have to both recognise that the
use-cases filled by ftrace and by LTTng are complementary and are all
needed by the community overall.

Mathieu

>
> Ingo

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-23 16:49:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 06/16] Markers auto enable tracepoints (new API : trace_mark_tp())


* Mathieu Desnoyers <[email protected]> wrote:

> > ... and in that sense i'd love to see lttng become a "super ftrace
> > plugin", and be merged upstream ASAP.
>
> Hrm, a big part of LTTng is its optimized buffering mechanism, which
> has been more tested and is more flexible than the one currently
> found in ftrace (it has supported lockless NMI-safe tracing for over
> 2 years). It also separates the various tracing layers in separated
> modules, which helps not mixing the buffer management layer with the
> timestamping layer and with the memory backend layer.. I am opened
> to try to merge ftrace and lttng together, but I think it will be
> more than a simple "make lttng a ftrace plugin".

Sure, it will certainly be a fair bit of (interesting and useful!)
work.

But it's something we should do iteratively, instead of merging a
parallel framework. Steve's ring-buffer should be a fair step in the
right direction. Could we improve on that?

i see nothing in the feature list of LTTng that we wouldnt want to
have for ftrace, agreed? And we could do an ftrace tracer named
'LTTng' which would interact with lttng userspace in the customary
way.

Ingo

2008-11-24 08:05:27

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 06/16] Markers auto enable tracepoints (new API : trace_mark_tp())

* Ingo Molnar ([email protected]) wrote:
>
> * Mathieu Desnoyers <[email protected]> wrote:
>
> > > ... and in that sense i'd love to see lttng become a "super ftrace
> > > plugin", and be merged upstream ASAP.
> >
> > Hrm, a big part of LTTng is its optimized buffering mechanism, which
> > has been more tested and is more flexible than the one currently
> > found in ftrace (it has supported lockless NMI-safe tracing for over
> > 2 years). It also separates the various tracing layers in separated
> > modules, which helps not mixing the buffer management layer with the
> > timestamping layer and with the memory backend layer.. I am opened
> > to try to merge ftrace and lttng together, but I think it will be
> > more than a simple "make lttng a ftrace plugin".
>
> Sure, it will certainly be a fair bit of (interesting and useful!)
> work.
>
> But it's something we should do iteratively, instead of merging a
> parallel framework. Steve's ring-buffer should be a fair step in the
> right direction. Could we improve on that?
>

I think it's a very good thing that both Steve and me went through the
effort of creating such buffering mechanism, because it created healthy
competition and we have been able to cross-check each other code (well,
I actually pointed out some bugs he had in his tracer code just because
I ran into the same issues).

The only thing is that ftrace needs are a subset of what LTTng needs and
provides. As Steve stated very clearly in the past weeks, he doesn't
care about exporting binary data in an organized self-described manner.
This is actually a key component of LTTng. Therefore, there seems to be
no will nor need to self-describe the information in the trace buffers
from Steve's point of view.

However, even if such self-description is not used by ftrace-type
tracers within the kernel, it could just be *there* so we can support
high-throughput streaming of organized (non-opaque) binary data when
needed. A ftrace tracer could then participate to this information flow
and make the data available for high-speed extraction. I think this
would be very beneficial to the LTTng use-cases and would not hurt
ftrace tracers at all.

> i see nothing in the feature list of LTTng that we wouldnt want to
> have for ftrace, agreed?

Yes, the only thing is that LTTng might be doing a little bit more than
what ftrace needs, because it needs to export the information to
userspace.

> And we could do an ftrace tracer named
> 'LTTng' which would interact with lttng userspace in the customary
> way.

Or we could think of making ftrace be a LTTng information source, so it
would be able to export all its information to userspace through the
high-speed self-described binary trace buffers (channels) and also
manage to keep all of its current features. The way I see it, because
ftrace has simpler requirements due to the fact it does not export
binary data across the kernel boundary, ftrace buffering can be seen as
a subset of the LTTng requirements.

Besides, looking at LWN reviews of Steve's ring buffer implementation
"A casual perusal of the patch might well leave a reader confused; 2000
lines of relatively complex code to implement what is, in the end, just
a circular buffer. This circular buffer is not even suitable for use by
tracing frameworks yet; a separate "tracing" layer is to be added for
that."

I achieve all the core work of atomic-ops based buffering,
self-description of traces and all I said above with a design easier to
review in a few more lines, but distributed in smaller files :

850 ltt-marker-control.c (marker userspace control, event ID
management)
179 ltt-relay.h
701 ltt-relay-alloc.c (buffer memory backend, vmap-less,
contiguous memory range emulated by software)
1648 ltt-relay.c (atomic buffer writer/reader
synchronization)
780 ltt-tracer.h (header definitions, prototypes, inlines)

Compared to :
2241 ring_buffer.c
130 ring_buffer.h

So in the end, my argumentation is not based on a simple "I prefer my
code to someone else's code", but simply that I think it could be easier
to integrate the tracer with the fewer constraints into the tracer with
the more constraints (principally exporting data to userspace) than to
do the opposite.

And all ftrace could need that LTTng currently lacks (like in-kernel
binary-to-text converter) is currently on the LTTng roadmap.

Mathieu

>
> Ingo

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-25 12:24:18

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 06/16] Markers auto enable tracepoints (new API : trace_mark_tp())

Hi

Sorry for very late responce.
However, if you do marker removing disucussion, I hope CC to Frank Eigler.
because SystemTap also use marker and tracepoint.
and this patch also improvement SystemTap marker support, I think.

IIRC, Currently, Systemtap also have marker/tracepoint on/off
syncronization problem.


> > Add a new API trace_mark_tp(), which declares a marker within a
> > tracepoint probe. When the marker is activated, the tracepoint is
> > automatically enabled.
> >
> > No branch test is used at the marker site, because it would be a
> > duplicate of the branch already present in the tracepoint.
> >
> > Impact: new API.
>
> i dont know.
>
> I was actually hoping for markers (the in-kernel API) to go away
> completely - and be replaced with tracepoints.
>
> Markers are the wrong design on several levels. They couple the kernel
> dynamically with unknown (to the kernel) entities - and that is
> causing complexity all around the place, clearly expressed in these
> patches of yours.
>
> Tracepoints are much more specific - typed and enumerated function
> call callback points in essence - with some politeness that allows
> external instrumentation entities to attach to those callbacks.
>
> Is there any usecase of markers that should not be converted to either
> tracepoints or to ftrace_printk() ?

Frank, Could you please read this thread and give us your comment?


2008-11-25 17:26:31

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [patch 06/16] Markers auto enable tracepoints (new API : trace_mark_tp())

Hi -

On Tue, Nov 25, 2008 at 09:23:57PM +0900, KOSAKI Motohiro wrote:
> Sorry for very late response.
> However, if you do marker removing discussion, I hope CC to Frank Eigler.

Thanks!

> because SystemTap also use marker and tracepoint. and this patch
> also improvement SystemTap marker support, I think.

I believe that this trace_mark_tp variant would operate at a level
below systemtap's. It would mainly improve the kernel performance and
simplify kernel code.

> IIRC, Currently, Systemtap also have marker/tracepoint on/off
> synchronization problem.

Right, to the extent that systemtap can currently connect to markers
and not yet tracepoints, so tracepoints need to be converted to
markers or another systemtap-visible hooking mechanism at some point.
This is a temporary state of affairs though, as we hope to talk to
tracepoints directly before too long.


mingo wrote:

> > I was actually hoping for markers (the in-kernel API) to go away
> > completely - and be replaced with tracepoints.

(If you're seriously contemplating outright removal of this API, then
the earlier worries about how markers would somehow tie its users'
hands to keep them around forever was surely exaggerated.)


> > Markers are the wrong design on several levels. They couple the kernel
> > dynamically with unknown (to the kernel) entities - [...]
> > Tracepoints are much more specific - typed and enumerated function
> > call callback points in essence [...]

You are confusing type safety and "unknown (to the kernel)"-ness.

Markers provide some type safety via the format string; tracepoints
via individual C declarations.

Markers can connect to "unknown (to the kernel)" clients since their
API is EXPORT_SYMBOL_GPL'd. Tracepoints do exactly the same thing.

There are certainly differences between them, but these two particular
ones are immaterial.


- FChE