2009-01-10 16:19:23

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [ltt-dev] Fw: [PATCH] ext4: Add markers for better debuggability

* Theodore Tso ([email protected]) wrote:
> On Fri, Jan 09, 2009 at 09:49:01AM -0500, Mathieu Desnoyers wrote:
> >
> > Given this is meant to be in the mainline kernel, using tracepoints
> > rather than markers would be more appropriate. Please see
> > Documentation/tracepoints.txt and samples/tracepoints/ for details.
>
> I saw that, but I'm still not sure how I would actually *use*
> tracepoints. I can use markers today with Systemtap, and I need an
> easy way to be able to selectively do things like "trap and print all
> of the logged data when the ext4_request_inode tracepoint/marker is
> tripped where dev=='dm-8'".
>
> Unfortunately, Systemtap seems doomed given its current course and
> direction, so I'm not opposed to moving to another solution --- I just
> need a HOWTO that demonstrates how I can do this painlessly and
> easily; preferably without having me have to manually hack up a kernel
> module each time I want to implement the tracing
>
> Is there an easy way to do that using ltt today? Can you point me at
> a sample of how this could be done?
>

Hi Ted,

I just ported LTTng to 2.6.28 yesterday and started doing the port of
ext4/jbd2 to tracepoints. As you can see in my 0.74 announcement, the
tracepoint work for both jbd2 and ext4 is done. I also did the lttng
probe module for jbd2. Now I just have to create the probe module for
ext4. I also want to create debugfs files to control per-probe module
filtering, e.g. :

/mnt/debugfs/ltt/filter/jbd2/dev
/mnt/debugfs/ltt/filter/ext4/dev

Where writing to it would add device names to the filter list. I would
like a scheme where we can easily add/remove devices, list all
devices... I think ftrace already has something similar for
instrumentation activation.

The main question I am facing is : What interface semantic do we want
for such filter control file ?

Comments/ideas are welcome.

Mathieu

> - Ted
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68


2009-01-10 18:42:56

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [ltt-dev] Fw: [PATCH] ext4: Add markers for better debuggability

On Sat, Jan 10, 2009 at 11:19:06AM -0500, Mathieu Desnoyers wrote:
>
> I just ported LTTng to 2.6.28 yesterday and started doing the port of
> ext4/jbd2 to tracepoints. As you can see in my 0.74 announcement, the
> tracepoint work for both jbd2 and ext4 is done. I also did the lttng
> probe module for jbd2. Now I just have to create the probe module for
> ext4. I also want to create debugfs files to control per-probe module
> filtering, e.g. :
>
> /mnt/debugfs/ltt/filter/jbd2/dev
> /mnt/debugfs/ltt/filter/ext4/dev
>
> Where writing to it would add device names to the filter list. I would
> like a scheme where we can easily add/remove devices, list all
> devices... I think ftrace already has something similar for
> instrumentation activation.
>
> The main question I am facing is : What interface semantic do we want
> for such filter control file ?

Hmm, we'll let's see. The most common filtering restriction will be
by device, but I'll occasionally want to filter based on a single
inode; the next most common thing I could forsee wanting to do is to
filter on a based of inode numbers or on one or more block groups.
(This would be when trying to figure out what is going on with a
particular filesystem benchmark.)

Past a certain point, I recognize that I'll probably have to write a
custom probe module --- although I have to admit that's one of the
things that has spoiled me about SystemTap; it automates the job of
creating the custom probe modules, and allows me to create
turing-equivalent filtering and data collection.

So the question is where do we draw the line between the most common
filters that is worth putting into a probe module that goes into
mainline, versus what should be done via custome probe modules,
probably via modifying the probe module as an example.

If it's not too much trouble, being able to filter on a single device
(or report the data from all trace points) and filtering on a single
inode (or reporting all inodes) seems to make the most mount of sense.
Does tht seem reasonable to you?

The other question is how much data gets reported back; normally with
Systemtap I would report back to userspace only the bits that I needed
to debug whatever issue I was looking at. However, with the LTTNG
approach, can we send back all of the bits of data tht was in the
markers? It won't be needed for all problems, and I know that sending
too much data back will cause us to potentilly overflow the kernel
buffering, since there is a limited bandwidth of stuff we can send
back through the kernel<->user interface. I'll have to admit to being
that well informed about LTTNG; what are the practical bandwidth
limitations that I should expect to see when it's in use?

Thanks, regards,

- Ted

P.S. At the kernel summit, there was some talk about a very
simplified interface that would allow us to extract text dumps from
tracepoints without having to download a huge userspace utility like
SystemTap --- or LTTng. Has that been written yet, or is the only way
for me to use Tracepoints right now is to figure out how to use the
whole LTTng infrstructure?

2009-01-10 20:40:31

by Steven Rostedt

[permalink] [raw]
Subject: Re: [ltt-dev] Fw: [PATCH] ext4: Add markers for better debuggability


On Sat, 10 Jan 2009, Theodore Tso wrote:
>
> P.S. At the kernel summit, there was some talk about a very
> simplified interface that would allow us to extract text dumps from
> tracepoints without having to download a huge userspace utility like
> SystemTap --- or LTTng. Has that been written yet, or is the only way
> for me to use Tracepoints right now is to figure out how to use the
> whole LTTng infrstructure?

Actually, that's one of the things ftrace is suppose to do, and is on my
todo list. To track all trace points, and be able to hook into any one and
print them out to the trace.

-- Steve

2009-01-10 21:50:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [ltt-dev] Fw: [PATCH] ext4: Add markers for better debuggability

On Sat, Jan 10, 2009 at 03:40:16PM -0500, Steven Rostedt wrote:
>
> Actually, that's one of the things ftrace is suppose to do, and is on my
> todo list. To track all trace points, and be able to hook into any one and
> print them out to the trace.

As a feature request, can this include actually printing the data
which gets captured at a tracepoint, and not just the fact that we hit
a particular tracepoint?

One nice thing about the markers infrastructure was it included a
printk format strong, so this would have been very easy. The current
tracepoints insertion hook, while convenient in that you can just pass
it a data structure like this:

+ trace_ext4_sync_file(file, dentry, datasync);

....instead of this...

- trace_mark(ext4_sync_file, "dev %s datasync %d ino %ld parent %ld",
- inode->i_sb->s_id, datasync, inode->i_ino,
- dentry->d_parent->d_inode->i_ino);


Has as its downside the fact that it's going to be difficult for
ftrace to be able to pretty-print the contents of data structure.
Where as with the trace_mark() format string, it would be trivially
easy for ftrace to print the information captured at the trace point.

For the bulk of the ext4 tracepoints, merely logging the fact that we
hit the "ext4_sync_file" tracepoint isn't going to be particularly
interesting; it's the data associated with hitting that particular
tracepoint which I really want to be able to get access to.

(And why I decided to use markers instead of tracepoints. With
systemtap, I can access the markers *today* looks like being able to
do something useful with tracepoints is still a work in progress....
But as I said, I've been warning the Systemtap folks for a while now
that they don't get their act together, there will be replacements
written by kernel developers that will be designed to be useful for
kernel developers --- but they chose not to believe James Bottomley or
myself, and blew us off --- so when there is a functional equivalent
ready, I'm happy to switch; it just doesn't seem quite there yet.)


- Ted

2009-01-12 01:34:44

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [ltt-dev] Fw: [PATCH] ext4: Add markers for better debuggability

* Theodore Tso ([email protected]) wrote:
> On Sat, Jan 10, 2009 at 11:19:06AM -0500, Mathieu Desnoyers wrote:
> >
> > I just ported LTTng to 2.6.28 yesterday and started doing the port of
> > ext4/jbd2 to tracepoints. As you can see in my 0.74 announcement, the
> > tracepoint work for both jbd2 and ext4 is done. I also did the lttng
> > probe module for jbd2. Now I just have to create the probe module for
> > ext4. I also want to create debugfs files to control per-probe module
> > filtering, e.g. :
> >
> > /mnt/debugfs/ltt/filter/jbd2/dev
> > /mnt/debugfs/ltt/filter/ext4/dev
> >
> > Where writing to it would add device names to the filter list. I would
> > like a scheme where we can easily add/remove devices, list all
> > devices... I think ftrace already has something similar for
> > instrumentation activation.
> >
> > The main question I am facing is : What interface semantic do we want
> > for such filter control file ?
>
> Hmm, we'll let's see. The most common filtering restriction will be
> by device, but I'll occasionally want to filter based on a single
> inode; the next most common thing I could forsee wanting to do is to
> filter on a based of inode numbers or on one or more block groups.
> (This would be when trying to figure out what is going on with a
> particular filesystem benchmark.)
>
> Past a certain point, I recognize that I'll probably have to write a
> custom probe module --- although I have to admit that's one of the
> things that has spoiled me about SystemTap; it automates the job of
> creating the custom probe modules, and allows me to create
> turing-equivalent filtering and data collection.
>
> So the question is where do we draw the line between the most common
> filters that is worth putting into a probe module that goes into
> mainline, versus what should be done via custome probe modules,
> probably via modifying the probe module as an example.
>
> If it's not too much trouble, being able to filter on a single device
> (or report the data from all trace points) and filtering on a single
> inode (or reporting all inodes) seems to make the most mount of sense.
> Does tht seem reasonable to you?
>

Yes, very reasonable. It's now implemented in LTTng 0.77. Usage looks
like :

echo "*" > /mnt/debugfs/ltt/filter/ext4/dev
(or just echo "" > /mnt/debugfs/ltt/filter/ext4/dev)
(enables all devices, this is the default)
cat /mnt/debugfs/ltt/filter/ext4/dev
(shows current filter)

echo sda1 > /mnt/debugfs/ltt/filter/ext4/dev
(ext4 probes will only record information for device sda1)
echo 15 > /mnt/debugfs/ltt/filter/ext4/inode
(in addition to device filtering, only inode 15 will be traced)

I think it makes sense to let people edit
ltt/probes/ext4-trace.c to implement their own more complex filtering
suiting their need. It can be compiled as a kernel module.

Here is the ext4 filtering feature diff :
http://ltt.polymtl.ca/cgi-bin/gitweb.cgi?p=linux-2.6-lttng.git;a=blob;f=ltt/probes/ext4-trace.c;h=4824a82fe9c2056d8dea83c57473ecf6c7f1c25d;hb=ad03cc9ead62d85d7641dcd19ea883f07bb7900a

> The other question is how much data gets reported back; normally with
> Systemtap I would report back to userspace only the bits that I needed
> to debug whatever issue I was looking at. However, with the LTTNG
> approach, can we send back all of the bits of data tht was in the
> markers?

Yes, LTTng consists of a "probe layer" which converts the data taken
from tracepoints into markers. For ext4 :
http://ltt.polymtl.ca/cgi-bin/gitweb.cgi?p=linux-2.6-lttng.git;a=blob;f=ltt/probes/ext4-trace.c;h=167357d87e730e33c3ca1fb335358ddc469473ad;hb=1861c6b862df1e6aad2d658984fbff137b15496e

By default, LTTng sends all the data, including the payload, identified
by the markers. Note that the marker format string is only recorded once
per trace in the metadata channel in a table, which makes everything
very compact and efficient.

> It won't be needed for all problems, and I know that sending
> too much data back will cause us to potentilly overflow the kernel
> buffering, since there is a limited bandwidth of stuff we can send
> back through the kernel<->user interface. I'll have to admit to being
> that well informed about LTTNG; what are the practical bandwidth
> limitations that I should expect to see when it's in use?
>

LTTng has been heavily optimized to provide good performances. It uses
lightweight locking (RCU and per-cpu local atomic operations) and
splices the data directly from the memory buffers to disk without
intermediate copy.

A small test to give you an idea of the amount of data generated :

Copying 125MB worth of data (the drivers/ subdirectory) from an ext3 to
an ext4 partition, both on the same disk (raid1) + issuing sync, while
tracing with the trace output on the same disk (on the ext3 partition).

Copy duration : 12.99s
sync duration : 3.59s

Trace generated : 39M

ext4-related data:
516K /tmp/trace/ext4_0
516K /tmp/trace/ext4_1
516K /tmp/trace/ext4_2
3.1M /tmp/trace/ext4_3
516K /tmp/trace/ext4_4
516K /tmp/trace/ext4_5
516K /tmp/trace/ext4_6
516K /tmp/trace/ext4_7

Therefore, most of the trace size is padding due to unused buffer space.
I should say that this amount of ext4 data output (187.0kB/s) is really
not a problem neither for LTTng nor the disk.

A small speed test to figure out the tracer impact :
echo 3 > /proc/sys/vm/drop_caches; time (cp -ax drivers /mnt/testext4/; sync)

Without tracing (1):
real 0m18.182s to 0m19.026s

While tracing the overall kernel (including ext4) to disk :
real 0m19.309s to 0m19.986s (~4.8% execution time overhead over (1))

If you cannot afford to dirturb disk activity, you can use the flight
recorder mode, which only keep the data in memory buffers and overwrites
the oldest data until you stop tracing. At that point, you can grab the
buffers and dump them to disk to analyse the last events. It's very
useful if you know what particular condition you are looking for and if
you are only interested in the latest prior to the condition.

Flight recorder trace example :

lttctl -C -w /tmp/trace3 -o channel.all.overwrite=1 trace3
...
lttctl -D -w /tmp/trace3 trace3

While tracing the overall kernel (including ext4) to memory buffers :
real 0m18.748s to 0m19.388s (~2.2% execution time overhead over (1))

Note that the current ext4 probe implementation uses the format-string
serializer (printk-like code which iterates on the format string). For
very high throughput events, we can trivially change to
ltt-type-serializer.[ch] and diminish the execution time impact a lot.
It's just a bit more probe code to write. See kernel-trace.c to see
examples using such mechanism. The only reason why I use this dynamic
serializer is because I am lazy. ;)

> Thanks, regards,
>
> - Ted
>
> P.S. At the kernel summit, there was some talk about a very
> simplified interface that would allow us to extract text dumps from
> tracepoints without having to download a huge userspace utility like
> SystemTap --- or LTTng. Has that been written yet, or is the only way
> for me to use Tracepoints right now is to figure out how to use the
> whole LTTng infrstructure?
>

Yes, we're working on having a binary-to-ascii converter in the kernel.
People at Fujitsu are helping on this, and I'm waiting for news in this
regard. That's the main element I'm waiting for before I start
submitting LTTng for mainline inclusion.

The current LTTng work flow does not require "huge" userspace utilities.
lttctl (from the ltt-control package) is just a helper over the
/mnt/debugfs/ltt/ interface; this can be done using simple "echo"
commands too. It's therefore optional.

lttd is the daemon which splices the data to disk. It is required.

lttv is built as a single executable with multiple libraries. The
package itself depends on GTK/GDK, but the textDump module really only
needs the glib library.

However, I agree that also providing a text output directly from a
debugfs file would be useful for low-overhead tracing. Note, however,
that this would increase the tracer footprint.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68