2009-09-29 21:40:23

by Josh Stone

[permalink] [raw]
Subject: [PATCH] ext4: Add a stub for mpage_da_data in the trace header

The tracepoint ext4_da_write_pages has a struct mpage_da_data*
parameter, but that struct is only defined in fs/ext4/ext4.h. This
patch adds a forward declaration for that struct, so this tracepoint
header can still be used by tools like SystemTap.

This is a continuation of the fix in commit 3661d286.

http://sourceware.org/bugzilla/show_bug.cgi?id=10703

Signed-off-by: Josh Stone <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
---
include/trace/events/ext4.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index c1bd8f1..d502974 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -11,6 +11,7 @@ struct ext4_allocation_context;
struct ext4_allocation_request;
struct ext4_prealloc_space;
struct ext4_inode_info;
+struct mpage_da_data;

#define EXT4_I(inode) (container_of(inode, struct ext4_inode_info, vfs_inode))

--
1.6.2.5


2009-09-30 13:33:33

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] ext4: Add a stub for mpage_da_data in the trace header

On Tue, Sep 29, 2009 at 02:40:07PM -0700, Josh Stone wrote:
> The tracepoint ext4_da_write_pages has a struct mpage_da_data*
> parameter, but that struct is only defined in fs/ext4/ext4.h. This
> patch adds a forward declaration for that struct, so this tracepoint
> header can still be used by tools like SystemTap.

That's not what the tracepoints are for anyway. No hacks for out of
tree crap like this please - just use the trace buffer directly like
ftrace and then you don't actually need any data types.

2009-09-30 14:20:56

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: Add a stub for mpage_da_data in the trace header

On Wed, Sep 30, 2009 at 09:33:35AM -0400, Christoph Hellwig wrote:
> On Tue, Sep 29, 2009 at 02:40:07PM -0700, Josh Stone wrote:
> > The tracepoint ext4_da_write_pages has a struct mpage_da_data*
> > parameter, but that struct is only defined in fs/ext4/ext4.h. This
> > patch adds a forward declaration for that struct, so this tracepoint
> > header can still be used by tools like SystemTap.
>
> That's not what the tracepoints are for anyway. No hacks for out of
> tree crap like this please - just use the trace buffer directly like
> ftrace and then you don't actually need any data types.
>

Josh, you do realize that SystemTap can pull stuff out of the trace
buffer by examining information found in

/sys/kernel/debug/tracing/events/*/*/format

right? For example:

name: ext4_sync_fs
ID: 600
format:
field:unsigned short common_type; offset:0; size:2;
field:unsigned char common_flags; offset:2; size:1;
field:unsigned char common_preempt_count; offset:3; size:1;
field:int common_pid; offset:4; size:4;
field:int common_tgid; offset:8; size:4;

field:dev_t dev; offset:12; size:4;
field:int wait; offset:16; size:4;

print fmt: "dev %s wait %d", jbd2_dev_to_name(REC->dev), REC->wait

No need to for users to download gigabytes and gigabytes of DWARF
crapola (and for developers to grow old waiting for a kernel compile
with -g enabled to complete); SystemTap can just get what it needs
straight out of /sys/kernel/debug/tracing/events, and then reading
what it needs out of the ring buffer.

In any case I've added the blind structure pointer to
include/trace/events/ext4.h, but in the long term you're *much* better
off pulling the information you need from the ftrace ring buffer and
getting the information you need to interpret it out of
/sys/kernel/debug/tracing.

- Ted

2009-09-30 19:45:29

by Josh Stone

[permalink] [raw]
Subject: Re: [PATCH] ext4: Add a stub for mpage_da_data in the trace header

On 09/30/2009 07:20 AM, Theodore Tso wrote:
> On Wed, Sep 30, 2009 at 09:33:35AM -0400, Christoph Hellwig wrote:
>> On Tue, Sep 29, 2009 at 02:40:07PM -0700, Josh Stone wrote:
>>> The tracepoint ext4_da_write_pages has a struct mpage_da_data*
>>> parameter, but that struct is only defined in fs/ext4/ext4.h. This
>>> patch adds a forward declaration for that struct, so this tracepoint
>>> header can still be used by tools like SystemTap.
>>
>> That's not what the tracepoints are for anyway. No hacks for out of
>> tree crap like this please - just use the trace buffer directly like
>> ftrace and then you don't actually need any data types.

If you just want the data in the trace buffer, then SystemTap is not the
tool for you. By all means, just write yourself a perl script or
something that parses the trace buffer however you like.

On the other hand, stap is useful to do some processing/inspection
*live*, at the moment the event happens. For that, we register our own
tracepoint handler that can do something different than ftrace.

Anyway, I don't think it's a hack to add a simple struct declaration.
It's wrong to have an include/... header that requires definitions from
some subsystem internals, and that's the core thing I'm fixing here.

> Josh, you do realize that SystemTap can pull stuff out of the trace
> buffer by examining information found in
>
> /sys/kernel/debug/tracing/events/*/*/format
>
> right?

Sure, but we want to be able to do more than just post-process the trace
buffer, so we hook up directly to the tracepoints.

> No need to for users to download gigabytes and gigabytes of DWARF
> crapola (and for developers to grow old waiting for a kernel compile
> with -g enabled to complete); SystemTap can just get what it needs
> straight out of /sys/kernel/debug/tracing/events, and then reading
> what it needs out of the ring buffer.

FWIW, Fedora rawhide's x86_64 kernel-debuginfo is a ~140MB package,
which installs to ~1.1GB, so that situation is not quite so bad. The
compile time is of course still a beast.

However, SystemTap does *not* require the kernel debuginfo for using
tracepoints, even when reading parameters. It should work in the
complete absence of CONFIG_DEBUGINFO, so if you find otherwise, please
let me know and I will fix it.

> In any case I've added the blind structure pointer to
> include/trace/events/ext4.h, but in the long term you're *much* better
> off pulling the information you need from the ftrace ring buffer and
> getting the information you need to interpret it out of
> /sys/kernel/debug/tracing.

Thank you for including it.

Josh

2009-09-30 21:23:39

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: Add a stub for mpage_da_data in the trace header

On Wed, Sep 30, 2009 at 12:45:23PM -0700, Josh Stone wrote:
> If you just want the data in the trace buffer, then SystemTap is not the
> tool for you. By all means, just write yourself a perl script or
> something that parses the trace buffer however you like.
>
> On the other hand, stap is useful to do some processing/inspection
> *live*, at the moment the event happens. For that, we register our own
> tracepoint handler that can do something different than ftrace.

So there are two things I would point out here. First of all, now
that ftrace has the ability to do basic filtering, just about the only
thing SystemTap can do which is unique is either complex filtering,
summary statistics, or some kind of correlation between multiple
events (within the limits of restricted memory allocation limits of
SystemTap). So I'm not sure it's such a great idea to cede a large
bit of functionality to as being something that SystemTap will never
accomplish --- especially when it's far more convenient and stable to
depend on fixed trace points than setting arbitrary dynamic trace
points in the middle of source files which will break all the time
when distro's release new kernels, etc.

Secondly, while I'm not so sure it's that big of a restriction to have
Systemtap pull events out of the trace buffer, if you must capture the
event right as it happens, it should be possible set a kprobe in the
ftrace subsystem, and then pull out the data of the event from the
trace buffer. Keep in mind that one of the advantage DTrace has over
SystemTap is that it can use pre-defined events in the kernel, and not
have to keep userspace macro files in sync with a changing kernel
source base. It seems counterproductive to throw away the opportunity
of being able to read the tracepoint event data, since it would give
SystemTap a lot more power. As people start creating more and more
tracepoints, being able to tap into them without needing to download
(or build) symbol tables would be a huge advantage.

> FWIW, Fedora rawhide's x86_64 kernel-debuginfo is a ~140MB package,
> which installs to ~1.1GB, so that situation is not quite so bad. The
> compile time is of course still a beast.
>
> However, SystemTap does *not* require the kernel debuginfo for using
> tracepoints, even when reading parameters. It should work in the
> complete absence of CONFIG_DEBUGINFO, so if you find otherwise, please
> let me know and I will fix it.

Well, how is it going to do that if you don't have access to the
structure definition? This is why fetching the information from the
ring buffer is much more powerful.

- Ted

2009-10-01 00:13:05

by Josh Stone

[permalink] [raw]
Subject: Re: [PATCH] ext4: Add a stub for mpage_da_data in the trace header

On 09/30/2009 02:23 PM, Theodore Tso wrote:
> On Wed, Sep 30, 2009 at 12:45:23PM -0700, Josh Stone wrote:
>> If you just want the data in the trace buffer, then SystemTap is not the
>> tool for you. By all means, just write yourself a perl script or
>> something that parses the trace buffer however you like.
>>
>> On the other hand, stap is useful to do some processing/inspection
>> *live*, at the moment the event happens. For that, we register our own
>> tracepoint handler that can do something different than ftrace.
>
> So there are two things I would point out here. First of all, now
> that ftrace has the ability to do basic filtering, just about the only
> thing SystemTap can do which is unique is either complex filtering,
> summary statistics, or some kind of correlation between multiple
> events (within the limits of restricted memory allocation limits of
> SystemTap).

This "only thing" seems like quite a lot to me, but I suppose the
significance could be a matter of opinion. I would also add that
SystemTap can better support concurrent users who want to monitor
different things.

> So I'm not sure it's such a great idea to cede a large bit of
> functionality to as being something that SystemTap will never
> accomplish --- especially when it's far more convenient and stable
> to depend on fixed trace points than setting arbitrary dynamic trace
> points in the middle of source files which will break all the time
> when distro's release new kernels, etc.

I don't understand your point about ceding here. But yes, I agree that
fixed trace points are more convenient and stable, which is why we've
long supported static instrumentation in the kernel.

> Secondly, while I'm not so sure it's that big of a restriction to have
> Systemtap pull events out of the trace buffer, if you must capture the
> event right as it happens, it should be possible set a kprobe in the
> ftrace subsystem, and then pull out the data of the event from the
> trace buffer.

This is possible, but it's a step backward for a few reasons:

- A kprobe will be inherently slower than a tracepoint handler.
- It requires debuginfo (maybe not to place the probe, but surely to dig
into ftrace's internal data structures).
- It requires knowledge about the ftrace internals, which is fragile and
unmaintainable.
- It assumes that every bit of data that the user wants is captured in
the trace buffer.

I think that last point is particularly significant. Kernel devs are
not prescient, so the trace event might not be capturing all of the data
that's relevant to a particular troubleshooting effort. With stap you
can gather whatever data you want.

(By the way, I seem to recall that we once discussed adding a proper
hook for stap to grab ftrace data as it comes, but I don't think that
went anywhere.)

> Keep in mind that one of the advantage DTrace has over SystemTap is
> that it can use pre-defined events in the kernel, and not have to
> keep userspace macro files in sync with a changing kernel source
> base. It seems counterproductive to throw away the opportunity of
> being able to read the tracepoint event data, since it would give
> SystemTap a lot more power.

Aren't "pre-defined events" == tracepoints? That's exactly what we're
trying to use in SystemTap! But then, DTrace doesn't dictate what data
is captured at those events, so I don't understand why you think we
should be more restrictive.

>> However, SystemTap does *not* require the kernel debuginfo for using
>> tracepoints, even when reading parameters. It should work in the
>> complete absence of CONFIG_DEBUGINFO, so if you find otherwise, please
>> let me know and I will fix it.
>
> Well, how is it going to do that if you don't have access to the
> structure definition? This is why fetching the information from the
> ring buffer is much more powerful.

True, when neither a header nor debuginfo for a private type is
available, then it will be opaque to us, so the ring buffer can offer
pre-defined insight into those structures. But in sched_switch, for
example, the ring buffer only knows prev/next->comm/pid/prio/state,
whereas stap has the entire rq and task_structs at your disposal. Each
has power in their own place...

Josh