At the kernel panel at the Linux Foundation Collaboration Summit, I
mentioned that the documentation for the tracing subsystem could be
improved; in particular, that very often the needed documentation for
how the various tracers worked could only be found in the git commit
logs, Kconfig files, and of course, reading the source.
Since criticisms should be followed up with some attempts to improve
the situation, here are some patches to improve the tracing
documentation.
I'm not sure the documentation is 100% correct, but by submitting it,
hopefully that the tracer authors will get flushed out of the woodwork
and be inspired to improve what has been submitted. :-)
- Ted
Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Steven Rostedt <[email protected]>
---
Documentation/power.txt | 19 +++++++++++++++++++
1 files changed, 19 insertions(+), 0 deletions(-)
create mode 100644 Documentation/power.txt
diff --git a/Documentation/power.txt b/Documentation/power.txt
new file mode 100644
index 0000000..16f95c9
--- /dev/null
+++ b/Documentation/power.txt
@@ -0,0 +1,19 @@
+The power trace collects detailed information about C-state and P-state
+transitions, instead of just looking at the high-level "average"
+information.
+
+There is a helper script found in scrips/tracing/power.pl in the kernel
+sources which can be used to parse this information and create a
+Scalable Vector Graphics (SVG) picture from the trace data.
+
+To use this tracer:
+
+ echo 0 > /sys/kernel/debug/tracing/tracing_enabled
+ echo power > /sys/kernel/debug/tracing/current_tracer
+ echo 1 > /sys/kernel/debug/tracing/tracing_enabled
+ sleep 1
+ echo 0 > /sys/kernel/debug/tracing/tracing_enabled
+ cat /sys/kernel/debug/tracing/trace | \
+ perl scripts/tracing/power.pl > out.sv
+
+
--
1.5.6.3
Commit 66bb7488 consolidated documentation files into
Documentation/trace, but it didn't update references found in the
kernel/trace/Kconfig file. Fix this.
Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Steven Rostedt <[email protected]>
---
kernel/trace/Kconfig | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 2246141..417d198 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -312,7 +312,7 @@ config KMEMTRACE
and profile kernel code.
This requires an userspace application to use. See
- Documentation/vm/kmemtrace.txt for more information.
+ Documentation/trace/kmemtrace.txt for more information.
Saying Y will make the kernel somewhat larger and slower. However,
if you disable kmemtrace at run-time or boot-time, the performance
@@ -403,7 +403,7 @@ config MMIOTRACE
implementation and works via page faults. Tracing is disabled by
default and can be enabled at run-time.
- See Documentation/tracers/mmiotrace.txt.
+ See Documentation/trace/mmiotrace.txt.
If you are not helping to develop drivers, say N.
config MMIOTRACE_TEST
--
1.5.6.3
Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: Steven Rostedt <[email protected]>
---
Documentation/trace/events.txt | 132 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 132 insertions(+), 0 deletions(-)
create mode 100644 Documentation/trace/events.txt
diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt
new file mode 100644
index 0000000..a0490c8
--- /dev/null
+++ b/Documentation/trace/events.txt
@@ -0,0 +1,132 @@
+ Event Tracing
+
+ Documentation written by Theodore Ts'o
+
+Introduction
+============
+
+Tracepoints (see Documentation/tracing/tracepoints.txt) can be used
+without creating custom kernel modules to register probe functions using
+the event tracing infrastructure.
+
+Not all tracepoints can be traced using the event tracing system; the
+kernel developer must provide code snippets which define how the tracing
+information is saved into the tracing buffer, and how the the tracing
+information should be printed.
+
+Using Event Tracing
+===================
+
+The events which are available for tracing can be found in the file
+/sys/kernel/debug/tracing/available_events.
+
+To enable a particular event, such as 'sched_wakeup', simply echo it to
+/sys/debug/tracing/set_event. For example:
+
+ # echo sched_wakeup > /sys/kernel/debug/tracing/set_event
+
+To disable an event, echo the event name to the set_event file prefixed
+with an exclamation point:
+
+ # echo '!sched_wakeup' >> /sys/kernel//debug/tracing/set_event
+
+To disable events, echo an empty line to the set_event file:
+
+ # echo > /sys/kernel/debug/tracing/set_event
+
+The events are organized into subsystems, such as ext4, irq, sched,
+etc., and a full event name looks like this: <subsystem>:<event>. The
+subsystem name is optional, but it is displayed in the available_events
+file. All of the events in a subsystem can be specified via the syntax
+"<subsystem>:*"; for example, to enable all irq events, you can use the
+command:
+
+ # echo 'irq:*' > /sys/kernel/debug/tracing/set_event
+
+Defining an event-enabled tracepoint
+------------------------------------
+
+A kernel developer which wishes to define an event-enabled tracepoint
+must declare the tracepoint using TRACE_EVENT instead of DECLARE_TRACE.
+This is done via two header files in include/trace. For example, to
+event-enable the jbd2 subsystem, we must create two files,
+include/trace/jbd2.h and include/trace/jbd2_event_types.h. The
+include/trace/jbd2.h file should be included by kernel source files that
+will have a tracepoint inserted, and might look like this:
+
+#ifndef _TRACE_JBD2_H
+#define _TRACE_JBD2_H
+
+#include <linux/jbd2.h>
+#include <linux/tracepoint.h>
+
+#include <trace/jbd2_event_types.h>
+
+#endif
+
+In a file that utilizes a jbd2 tracepoint, this header file would be
+included. Note that you still have to use DEFINE_TRACE(). So for
+example, if fs/jbd2/commit.c planned to use the jbd2_start_commit
+tracepoint, it would have the following near the beginning of the file:
+
+#include <trace/jbd2.h>
+
+DEFINE_TRACE(jbd2_start_commit);
+
+Then in the function that would call the tracepoint, it would call the
+tracepoint function. (For more information, please see the tracepoint
+documentation in Documentation/trace/tracepoints.txt):
+
+ trace_jbd2_start_commit(journal, commit_transaction);
+
+The code snippets which allow jbd2_start_commit to be an event-enabled
+tracepoint are placed in the file include/trace/jbd2_event_types.h:
+
+/* use <trace/jbd2.h> instead */
+#ifndef TRACE_EVENT
+# error Do not include this file directly.
+# error Unless you know what you are doing.
+#endif
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM jbd2
+
+#include <linux/jbd2.h>
+
+TRACE_EVENT(jbd2_start_commit,
+ TP_PROTO(journal_t *journal, transaction_t *commit_transaction),
+ TP_ARGS(journal, commit_transaction),
+ TP_STRUCT__entry(
+ __array( char, devname, BDEVNAME_SIZE+24 )
+ __field( int, transaction )
+ ),
+ TP_fast_assign(
+ memcpy(__entry->devname, journal->j_devname, BDEVNAME_SIZE+24);
+ __entry->transaction = commit_transaction->t_tid;
+ ),
+ TP_printk("dev %s transaction %d",
+ __entry->devname, __entry->transaction)
+);
+
+The TP_PROTO and TP_ARGS are unchanged from DECLARE_TRACE. The new
+arguments to TRACE_EVENT are TP_STRUCT__entry, TP_fast_assign, and
+TP_printk.
+
+TP_STRUCT__entry defines the data structure which will be stored in the
+trace buffer. Normally, fields in __entry will be arrays or simple
+types. It is possible to place data structures in __entry --- however,
+pointers in the data structure can not be trusted, since they will be
+accessed sometime later by TP_printk, and if the data structure contains
+fields that will not or cannot be used by TP_printk, this will waste
+space in the trace buffer. In general, data structures should be
+avoided, unless they do only contain non-pointer types and all of the
+fields will be used by TP_printk.
+
+TP_fast_assign defines the code snippet which saves information into the
+__entry data structure, using the passed-in arguments defined in
+TP_PROTO and TP_ARGS.
+
+Finally, TP_printk will print the __entry data structure. At the time
+when code snippet defined by TP_printk is executed, it will not have
+access to the TP_ARGS arguments; it can only use the information saved
+in the __entry data structure.
--
1.5.6.3
On Sat, 2009-04-11 at 15:51 -0400, Theodore Ts'o wrote:
> diff --git a/Documentation/power.txt b/Documentation/power.txt
> new file mode 100644
> index 0000000..16f95c9
> --- /dev/null
> +++ b/Documentation/power.txt
Documentation/trace/power.txt?
Documentation/power/trace.txt?
> @@ -0,0 +1,19 @@
> +There is a helper script found in scrips/tracing/power.pl in the kernel
scripts/tracing/power.pl
On Sat, 11 Apr 2009 15:51:19 -0400
Theodore Ts'o <[email protected]> wrote:
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Steven Rostedt <[email protected]>
Acked-by: Arjan van de Ven <[email protected]>
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
* Theodore Ts'o <[email protected]> wrote:
> At the kernel panel at the Linux Foundation Collaboration
> Summit, I mentioned that the documentation for the tracing
> subsystem could be improved; in particular, that very
> often the needed documentation for how the various tracers
> worked could only be found in the git commit logs, Kconfig
> files, and of course, reading the source.
>
> Since criticisms should be followed up with some attempts
> to improve the situation, here are some patches to improve
> the tracing documentation.
>
> I'm not sure the documentation is 100% correct, but by
> submitting it, hopefully that the tracer authors will get
> flushed out of the woodwork and be inspired to improve
> what has been submitted. :-)
Applied, thanks Ted!
Patch #1 was already present in the tracing tree - i've
added both the event and the power tracer docs to
Documentation/trace/.
Ingo
Commit-ID: 1542263998a2d143596972b2776c59c9247578a7
Gitweb: http://git.kernel.org/tip/1542263998a2d143596972b2776c59c9247578a7
Author: Theodore Ts'o <[email protected]>
AuthorDate: Sat, 11 Apr 2009 15:51:18 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 12 Apr 2009 11:23:48 +0200
tracing: Document the event tracing system
Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Steven Rostedt <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/trace/events.txt | 132 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 132 insertions(+), 0 deletions(-)
diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt
new file mode 100644
index 0000000..122522f
--- /dev/null
+++ b/Documentation/trace/events.txt
@@ -0,0 +1,132 @@
+ Event Tracing
+
+ Documentation written by Theodore Ts'o
+
+Introduction
+============
+
+Tracepoints (see Documentation/tracing/tracepoints.txt) can be used
+without creating custom kernel modules to register probe functions using
+the event tracing infrastructure.
+
+Not all tracepoints can be traced using the event tracing system; the
+kernel developer must provide code snippets which define how the tracing
+information is saved into the tracing buffer, and how the the tracing
+information should be printed.
+
+Using Event Tracing
+===================
+
+The events which are available for tracing can be found in the file
+/sys/kernel/debug/tracing/available_events.
+
+To enable a particular event, such as 'sched_wakeup', simply echo it to
+/sys/debug/tracing/set_event. For example:
+
+ # echo sched_wakeup > /sys/kernel/debug/tracing/set_event
+
+To disable an event, echo the event name to the set_event file prefixed
+with an exclamation point:
+
+ # echo '!sched_wakeup' >> /sys/kernel//debug/tracing/set_event
+
+To disable events, echo an empty line to the set_event file:
+
+ # echo > /sys/kernel/debug/tracing/set_event
+
+The events are organized into subsystems, such as ext4, irq, sched,
+etc., and a full event name looks like this: <subsystem>:<event>. The
+subsystem name is optional, but it is displayed in the available_events
+file. All of the events in a subsystem can be specified via the syntax
+"<subsystem>:*"; for example, to enable all irq events, you can use the
+command:
+
+ # echo 'irq:*' > /sys/kernel/debug/tracing/set_event
+
+Defining an event-enabled tracepoint
+------------------------------------
+
+A kernel developer which wishes to define an event-enabled tracepoint
+must declare the tracepoint using TRACE_EVENT instead of DECLARE_TRACE.
+This is done via two header files in include/trace. For example, to
+event-enable the jbd2 subsystem, we must create two files,
+include/trace/jbd2.h and include/trace/jbd2_event_types.h. The
+include/trace/jbd2.h file should be included by kernel source files that
+will have a tracepoint inserted, and might look like this:
+
+#ifndef _TRACE_JBD2_H
+#define _TRACE_JBD2_H
+
+#include <linux/jbd2.h>
+#include <linux/tracepoint.h>
+
+#include <trace/jbd2_event_types.h>
+
+#endif
+
+In a file that utilizes a jbd2 tracepoint, this header file would be
+included. Note that you still have to use DEFINE_TRACE(). So for
+example, if fs/jbd2/commit.c planned to use the jbd2_start_commit
+tracepoint, it would have the following near the beginning of the file:
+
+#include <trace/jbd2.h>
+
+DEFINE_TRACE(jbd2_start_commit);
+
+Then in the function that would call the tracepoint, it would call the
+tracepoint function. (For more information, please see the tracepoint
+documentation in Documentation/trace/tracepoints.txt):
+
+ trace_jbd2_start_commit(journal, commit_transaction);
+
+The code snippets which allow jbd2_start_commit to be an event-enabled
+tracepoint are placed in the file include/trace/jbd2_event_types.h:
+
+/* use <trace/jbd2.h> instead */
+#ifndef TRACE_EVENT
+# error Do not include this file directly.
+# error Unless you know what you are doing.
+#endif
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM jbd2
+
+#include <linux/jbd2.h>
+
+TRACE_EVENT(jbd2_start_commit,
+ TP_PROTO(journal_t *journal, transaction_t *commit_transaction),
+ TP_ARGS(journal, commit_transaction),
+ TP_STRUCT__entry(
+ __array( char, devname, BDEVNAME_SIZE+24 )
+ __field( int, transaction )
+ ),
+ TP_fast_assign(
+ memcpy(__entry->devname, journal->j_devname, BDEVNAME_SIZE+24);
+ __entry->transaction = commit_transaction->t_tid;
+ ),
+ TP_printk("dev %s transaction %d",
+ __entry->devname, __entry->transaction)
+);
+
+The TP_PROTO and TP_ARGS are unchanged from DECLARE_TRACE. The new
+arguments to TRACE_EVENT are TP_STRUCT__entry, TP_fast_assign, and
+TP_printk.
+
+TP_STRUCT__entry defines the data structure which will be stored in the
+trace buffer. Normally, fields in __entry will be arrays or simple
+types. It is possible to place data structures in __entry --- however,
+pointers in the data structure can not be trusted, since they will be
+accessed sometime later by TP_printk, and if the data structure contains
+fields that will not or cannot be used by TP_printk, this will waste
+space in the trace buffer. In general, data structures should be
+avoided, unless they do only contain non-pointer types and all of the
+fields will be used by TP_printk.
+
+TP_fast_assign defines the code snippet which saves information into the
+__entry data structure, using the passed-in arguments defined in
+TP_PROTO and TP_ARGS.
+
+Finally, TP_printk will print the __entry data structure. At the time
+when code snippet defined by TP_printk is executed, it will not have
+access to the TP_ARGS arguments; it can only use the information saved
+in the __entry data structure.
Commit-ID: f7a3b3845cdfa4d6383b74139392ff5c259531fc
Gitweb: http://git.kernel.org/tip/f7a3b3845cdfa4d6383b74139392ff5c259531fc
Author: Theodore Ts'o <[email protected]>
AuthorDate: Sat, 11 Apr 2009 15:51:19 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 12 Apr 2009 11:23:49 +0200
tracing: Add documentation for the power tracer
Signed-off-by: "Theodore Ts'o" <[email protected]>
Acked-by: Arjan van de Ven <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Steven Rostedt <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/trace/power.txt | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)
diff --git a/Documentation/trace/power.txt b/Documentation/trace/power.txt
new file mode 100644
index 0000000..9c584ff
--- /dev/null
+++ b/Documentation/trace/power.txt
@@ -0,0 +1,17 @@
+The power trace collects detailed information about C-state and P-state
+transitions, instead of just looking at the high-level "average"
+information.
+
+There is a helper script found in scrips/tracing/power.pl in the kernel
+sources which can be used to parse this information and create a
+Scalable Vector Graphics (SVG) picture from the trace data.
+
+To use this tracer:
+
+ echo 0 > /sys/kernel/debug/tracing/tracing_enabled
+ echo power > /sys/kernel/debug/tracing/current_tracer
+ echo 1 > /sys/kernel/debug/tracing/tracing_enabled
+ sleep 1
+ echo 0 > /sys/kernel/debug/tracing/tracing_enabled
+ cat /sys/kernel/debug/tracing/trace | \
+ perl scripts/tracing/power.pl > out.sv
[Theodore Ts'o - Sun, Apr 12, 2009 at 09:27:58AM +0000]
| Commit-ID: 1542263998a2d143596972b2776c59c9247578a7
| Gitweb: http://git.kernel.org/tip/1542263998a2d143596972b2776c59c9247578a7
| Author: Theodore Ts'o <[email protected]>
| AuthorDate: Sat, 11 Apr 2009 15:51:18 -0400
| Committer: Ingo Molnar <[email protected]>
| CommitDate: Sun, 12 Apr 2009 11:23:48 +0200
|
| tracing: Document the event tracing system
|
| Signed-off-by: "Theodore Ts'o" <[email protected]>
| Cc: Theodore Ts'o <[email protected]>
| Cc: Steven Rostedt <[email protected]>
| LKML-Reference: <[email protected]>
| Signed-off-by: Ingo Molnar <[email protected]>
|
|
| ---
| Documentation/trace/events.txt | 132 ++++++++++++++++++++++++++++++++++++++++
| 1 files changed, 132 insertions(+), 0 deletions(-)
|
...
| +The events which are available for tracing can be found in the file
| +/sys/kernel/debug/tracing/available_events.
| +
| +To enable a particular event, such as 'sched_wakeup', simply echo it to
| +/sys/debug/tracing/set_event. For example:
| +
| + # echo sched_wakeup > /sys/kernel/debug/tracing/set_event
| +
| +To disable an event, echo the event name to the set_event file prefixed
| +with an exclamation point:
| +
| + # echo '!sched_wakeup' >> /sys/kernel//debug/tracing/set_event
^^^
Since it's just being added and easy to fix (later maybe via trivial tree),
I suppose here is quite a small misprint :)
| +
| +To disable events, echo an empty line to the set_event file:
| +
| + # echo > /sys/kernel/debug/tracing/set_event
| +
...
Cyrill
On Sun, Apr 12, 2009 at 11:25:24AM +0200, Ingo Molnar wrote:
> > I'm not sure the documentation is 100% correct, but by
> > submitting it, hopefully that the tracer authors will get
> > flushed out of the woodwork and be inspired to improve
> > what has been submitted. :-)
>
> Applied, thanks Ted!
>
> Patch #1 was already present in the tracing tree - i've
> added both the event and the power tracer docs to
> Documentation/trace/.
Thanks for applying it so fast, but did you fix up the spelling and
pathname (Documentation/power.txt was supposed to be
Documentation/trace/power.txt) fixups that folks pointed out? I was
going to send an v2 update of the patches. So if you haven't pushed
them out yet, you might want to pull them until I get them fixed up.
Or I can send patches to fix up those issues, whichever is more
convenient for you.
Thanks, regards,
- Ted
Commit-ID: 56c49951747f250d8398582509e02ae5ce1d36d1
Gitweb: http://git.kernel.org/tip/56c49951747f250d8398582509e02ae5ce1d36d1
Author: Theodore Ts'o <[email protected]>
AuthorDate: Sat, 11 Apr 2009 15:51:19 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 12 Apr 2009 14:51:42 +0200
tracing: Add documentation for the power tracer
Signed-off-by: "Theodore Ts'o" <[email protected]>
Acked-by: Arjan van de Ven <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Steven Rostedt <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/trace/power.txt | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)
diff --git a/Documentation/trace/power.txt b/Documentation/trace/power.txt
new file mode 100644
index 0000000..cd805e1
--- /dev/null
+++ b/Documentation/trace/power.txt
@@ -0,0 +1,17 @@
+The power tracer collects detailed information about C-state and P-state
+transitions, instead of just looking at the high-level "average"
+information.
+
+There is a helper script found in scrips/tracing/power.pl in the kernel
+sources which can be used to parse this information and create a
+Scalable Vector Graphics (SVG) picture from the trace data.
+
+To use this tracer:
+
+ echo 0 > /sys/kernel/debug/tracing/tracing_enabled
+ echo power > /sys/kernel/debug/tracing/current_tracer
+ echo 1 > /sys/kernel/debug/tracing/tracing_enabled
+ sleep 1
+ echo 0 > /sys/kernel/debug/tracing/tracing_enabled
+ cat /sys/kernel/debug/tracing/trace | \
+ perl scripts/tracing/power.pl > out.sv
* Theodore Tso <[email protected]> wrote:
> On Sun, Apr 12, 2009 at 11:25:24AM +0200, Ingo Molnar wrote:
> > > I'm not sure the documentation is 100% correct, but by
> > > submitting it, hopefully that the tracer authors will get
> > > flushed out of the woodwork and be inspired to improve
> > > what has been submitted. :-)
> >
> > Applied, thanks Ted!
> >
> > Patch #1 was already present in the tracing tree - i've
> > added both the event and the power tracer docs to
> > Documentation/trace/.
>
> Thanks for applying it so fast, [...]
Heh, i'm used to getting perfect patches from you :) Also,
documentation is always lagging behind so every little piece helps,
and it's not like it can actually break anything in the kernel.
> [...] but did you fix up the spelling and pathname
> (Documentation/power.txt was supposed to be
> Documentation/trace/power.txt) fixups that folks pointed out?
Yeah, i did that. I also fixed up a few other small details no-one
pointed out.
> [...] I was going to send an v2 update of the patches. So if you
> haven't pushed them out yet, you might want to pull them until I
> get them fixed up. Or I can send patches to fix up those issues,
> whichever is more convenient for you.
No need to resend.
Btw., you mention ext4 and jbd2 new-style tracepoints in the text.
Does this mean you already have them coded up (i havent seen any
patch posting from you), just that you cannot push it upstream yet
because ext4 can be a module? We'll have modular new-style
tracepoints soon.
Ingo
Commit-ID: abd41443ac76d3e9c29a8c1d9e9a3312306cc55e
Gitweb: http://git.kernel.org/tip/abd41443ac76d3e9c29a8c1d9e9a3312306cc55e
Author: Theodore Ts'o <[email protected]>
AuthorDate: Sat, 11 Apr 2009 15:51:18 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 12 Apr 2009 14:57:12 +0200
tracing: Document the event tracing system
Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Steven Rostedt <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/trace/events.txt | 135 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 135 insertions(+), 0 deletions(-)
diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt
new file mode 100644
index 0000000..abdee66
--- /dev/null
+++ b/Documentation/trace/events.txt
@@ -0,0 +1,135 @@
+ Event Tracing
+
+ Documentation written by Theodore Ts'o
+
+Introduction
+============
+
+Tracepoints (see Documentation/trace/tracepoints.txt) can be used
+without creating custom kernel modules to register probe functions
+using the event tracing infrastructure.
+
+Not all tracepoints can be traced using the event tracing system;
+the kernel developer must provide code snippets which define how the
+tracing information is saved into the tracing buffer, and how the
+the tracing information should be printed.
+
+Using Event Tracing
+===================
+
+The events which are available for tracing can be found in the file
+/sys/kernel/debug/tracing/available_events.
+
+To enable a particular event, such as 'sched_wakeup', simply echo it
+to /sys/debug/tracing/set_event. For example:
+
+ # echo sched_wakeup > /sys/kernel/debug/tracing/set_event
+
+[ Note: events can also be enabled/disabled via the 'enabled' toggle
+ found in the /sys/kernel/tracing/events/ hierarchy of directories. ]
+
+To disable an event, echo the event name to the set_event file prefixed
+with an exclamation point:
+
+ # echo '!sched_wakeup' >> /sys/kernel/debug/tracing/set_event
+
+To disable events, echo an empty line to the set_event file:
+
+ # echo > /sys/kernel/debug/tracing/set_event
+
+The events are organized into subsystems, such as ext4, irq, sched,
+etc., and a full event name looks like this: <subsystem>:<event>. The
+subsystem name is optional, but it is displayed in the available_events
+file. All of the events in a subsystem can be specified via the syntax
+"<subsystem>:*"; for example, to enable all irq events, you can use the
+command:
+
+ # echo 'irq:*' > /sys/kernel/debug/tracing/set_event
+
+Defining an event-enabled tracepoint
+------------------------------------
+
+A kernel developer which wishes to define an event-enabled tracepoint
+must declare the tracepoint using TRACE_EVENT instead of DECLARE_TRACE.
+This is done via two header files in include/trace. For example, to
+event-enable the jbd2 subsystem, we must create two files,
+include/trace/jbd2.h and include/trace/jbd2_event_types.h. The
+include/trace/jbd2.h file should be included by kernel source files that
+will have a tracepoint inserted, and might look like this:
+
+#ifndef _TRACE_JBD2_H
+#define _TRACE_JBD2_H
+
+#include <linux/jbd2.h>
+#include <linux/tracepoint.h>
+
+#include <trace/jbd2_event_types.h>
+
+#endif
+
+In a file that utilizes a jbd2 tracepoint, this header file would be
+included. Note that you still have to use DEFINE_TRACE(). So for
+example, if fs/jbd2/commit.c planned to use the jbd2_start_commit
+tracepoint, it would have the following near the beginning of the file:
+
+#include <trace/jbd2.h>
+
+DEFINE_TRACE(jbd2_start_commit);
+
+Then in the function that would call the tracepoint, it would call the
+tracepoint function. (For more information, please see the tracepoint
+documentation in Documentation/trace/tracepoints.txt):
+
+ trace_jbd2_start_commit(journal, commit_transaction);
+
+The code snippets which allow jbd2_start_commit to be an event-enabled
+tracepoint are placed in the file include/trace/jbd2_event_types.h:
+
+/* use <trace/jbd2.h> instead */
+#ifndef TRACE_EVENT
+# error Do not include this file directly.
+# error Unless you know what you are doing.
+#endif
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM jbd2
+
+#include <linux/jbd2.h>
+
+TRACE_EVENT(jbd2_start_commit,
+ TP_PROTO(journal_t *journal, transaction_t *commit_transaction),
+ TP_ARGS(journal, commit_transaction),
+ TP_STRUCT__entry(
+ __array( char, devname, BDEVNAME_SIZE+24 )
+ __field( int, transaction )
+ ),
+ TP_fast_assign(
+ memcpy(__entry->devname, journal->j_devname, BDEVNAME_SIZE+24);
+ __entry->transaction = commit_transaction->t_tid;
+ ),
+ TP_printk("dev %s transaction %d",
+ __entry->devname, __entry->transaction)
+);
+
+The TP_PROTO and TP_ARGS are unchanged from DECLARE_TRACE. The new
+arguments to TRACE_EVENT are TP_STRUCT__entry, TP_fast_assign, and
+TP_printk.
+
+TP_STRUCT__entry defines the data structure which will be stored in the
+trace buffer. Normally, fields in __entry will be arrays or simple
+types. It is possible to place data structures in __entry --- however,
+pointers in the data structure can not be trusted, since they will be
+accessed sometime later by TP_printk, and if the data structure contains
+fields that will not or cannot be used by TP_printk, this will waste
+space in the trace buffer. In general, data structures should be
+avoided, unless they do only contain non-pointer types and all of the
+fields will be used by TP_printk.
+
+TP_fast_assign defines the code snippet which saves information into the
+__entry data structure, using the passed-in arguments defined in
+TP_PROTO and TP_ARGS.
+
+Finally, TP_printk will print the __entry data structure. At the time
+when the code snippet defined by TP_printk is executed, it will not have
+access to the TP_ARGS arguments; it can only use the information saved
+in the __entry data structure.
On Sun, Apr 12, 2009 at 03:01:05PM +0200, Ingo Molnar wrote:
> Btw., you mention ext4 and jbd2 new-style tracepoints in the text.
> Does this mean you already have them coded up (i havent seen any
> patch posting from you), just that you cannot push it upstream yet
> because ext4 can be a module? We'll have modular new-style
> tracepoints soon.
Yes, I coded them up recently. I wanted to do some performance
measurements, and being able to interleave the ext4 tracepoint logs
with the block I/o tracer was definitely handy. Also, not having to
fight with balky userspace tools (whether it is out-of-tree kernel
support code, or random graphical userspace libraries when I could
really care less about a GUI interface) was definitely a big win.
I only had two real problems. One is that the block I/O tracer only
traces "real" devices, and not device mapper devices --- I could user
the blktrace userspace tool, but then the results wouldn't be properly
interleaved with the ext4 tracepoint logs. The second is that ext4
has its localized header files in fs/ext4/*.h, and not in
include/linux/*.h, and that was problematic given that the trace code
snippets in include/trace/ext4_event_types.h needed access to some
internal ext4 data structures. I ultimately solved the latter by
creating a include/linux/ext4_tracing_types.h, but I suspect this
problem will go away when you have modular new-style tracepoints ---
if not, I'd appreciately greatly if folks could consider whether or
not this support could be added.
I'll attach the patches as replies to this mail thread so you can see
what I've done. Any comments of anything I might have done "wrong"
would be greatly appreciated.
- Ted
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/jbd2/checkpoint.c | 7 ++-
fs/jbd2/commit.c | 14 +++--
include/trace/jbd2.h | 9 ++++
include/trace/jbd2_event_types.h | 93 +++++++++++++++++++++++++++++++++++++
include/trace/trace_event_types.h | 1 +
include/trace/trace_events.h | 1 +
6 files changed, 116 insertions(+), 9 deletions(-)
create mode 100644 include/trace/jbd2.h
create mode 100644 include/trace/jbd2_event_types.h
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 17159ca..2b794d9 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -20,9 +20,11 @@
#include <linux/time.h>
#include <linux/fs.h>
#include <linux/jbd2.h>
-#include <linux/marker.h>
#include <linux/errno.h>
#include <linux/slab.h>
+#include <trace/jbd2.h>
+
+DEFINE_TRACE(jbd2_checkpoint);
/*
* Unlink a buffer from a transaction checkpoint list.
@@ -358,8 +360,7 @@ int jbd2_log_do_checkpoint(journal_t *journal)
* journal straight away.
*/
result = jbd2_cleanup_journal_tail(journal);
- trace_mark(jbd2_checkpoint, "dev %s need_checkpoint %d",
- journal->j_devname, result);
+ trace_jbd2_checkpoint(journal, result);
jbd_debug(1, "cleanup_journal_tail returned %d\n", result);
if (result <= 0)
return result;
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 073c8c3..21462a8 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -16,7 +16,6 @@
#include <linux/time.h>
#include <linux/fs.h>
#include <linux/jbd2.h>
-#include <linux/marker.h>
#include <linux/errno.h>
#include <linux/slab.h>
#include <linux/mm.h>
@@ -26,6 +25,11 @@
#include <linux/writeback.h>
#include <linux/backing-dev.h>
#include <linux/bio.h>
+#include <trace/jbd2.h>
+
+DEFINE_TRACE(jbd2_start_commit);
+DEFINE_TRACE(jbd2_end_commit);
+DEFINE_TRACE(jbd2_submit_inode_data);
/*
* Default IO end handler for temporary BJ_IO buffer_heads.
@@ -253,6 +257,7 @@ static int journal_submit_data_buffers(journal_t *journal,
* block allocation with delalloc. We need to write
* only allocated blocks here.
*/
+ trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
err = journal_submit_inode_data_buffers(mapping);
if (!ret)
ret = err;
@@ -394,8 +399,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
commit_transaction = journal->j_running_transaction;
J_ASSERT(commit_transaction->t_state == T_RUNNING);
- trace_mark(jbd2_start_commit, "dev %s transaction %d",
- journal->j_devname, commit_transaction->t_tid);
+ trace_jbd2_start_commit(journal, commit_transaction);
jbd_debug(1, "JBD: starting commit of transaction %d\n",
commit_transaction->t_tid);
@@ -1053,9 +1057,7 @@ restart_loop:
if (journal->j_commit_callback)
journal->j_commit_callback(journal, commit_transaction);
- trace_mark(jbd2_end_commit, "dev %s transaction %d head %d",
- journal->j_devname, commit_transaction->t_tid,
- journal->j_tail_sequence);
+ trace_jbd2_end_commit(journal, commit_transaction);
jbd_debug(1, "JBD: commit %d complete, head %d\n",
journal->j_commit_sequence, journal->j_tail_sequence);
if (to_free)
diff --git a/include/trace/jbd2.h b/include/trace/jbd2.h
new file mode 100644
index 0000000..68a93a5
--- /dev/null
+++ b/include/trace/jbd2.h
@@ -0,0 +1,9 @@
+#ifndef _TRACE_JBD2_H
+#define _TRACE_JBD2_H
+
+#include <linux/jbd2.h>
+#include <linux/tracepoint.h>
+
+#include <trace/jbd2_event_types.h>
+
+#endif
diff --git a/include/trace/jbd2_event_types.h b/include/trace/jbd2_event_types.h
new file mode 100644
index 0000000..c550ef4
--- /dev/null
+++ b/include/trace/jbd2_event_types.h
@@ -0,0 +1,93 @@
+
+/* use <trace/jbd2.h> instead */
+#ifndef TRACE_EVENT
+# error Do not include this file directly.
+# error Unless you know what you are doing.
+#endif
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM jbd2
+
+#include <linux/jbd2.h>
+
+TRACE_EVENT(jbd2_checkpoint,
+
+ TP_PROTO(journal_t *journal, int result),
+
+ TP_ARGS(journal, result),
+
+ TP_STRUCT__entry(
+ __array( char, devname, BDEVNAME_SIZE+24 )
+ __field( int, result )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, journal->j_devname, BDEVNAME_SIZE+24);
+ __entry->result = result;
+ ),
+
+ TP_printk("dev %s result %d",
+ __entry->devname, __entry->result)
+);
+
+TRACE_EVENT(jbd2_start_commit,
+
+ TP_PROTO(journal_t *journal, transaction_t *commit_transaction),
+
+ TP_ARGS(journal, commit_transaction),
+
+ TP_STRUCT__entry(
+ __array( char, devname, BDEVNAME_SIZE+24 )
+ __field( int, transaction )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, journal->j_devname, BDEVNAME_SIZE+24);
+ __entry->transaction = commit_transaction->t_tid;
+ ),
+
+ TP_printk("dev %s transaction %d",
+ __entry->devname, __entry->transaction)
+);
+
+TRACE_EVENT(jbd2_end_commit,
+ TP_PROTO(journal_t *journal, transaction_t *commit_transaction),
+
+ TP_ARGS(journal, commit_transaction),
+
+ TP_STRUCT__entry(
+ __array( char, devname, BDEVNAME_SIZE+24 )
+ __field( int, transaction )
+ __field( int, head )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, journal->j_devname, BDEVNAME_SIZE+24);
+ __entry->transaction = commit_transaction->t_tid;
+ __entry->head = journal->j_tail_sequence;
+ ),
+
+ TP_printk("dev %s transaction %d head %d",
+ __entry->devname, __entry->transaction, __entry->head)
+);
+
+TRACE_EVENT(jbd2_submit_inode_data,
+ TP_PROTO(struct inode *inode),
+
+ TP_ARGS(inode),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, inode->i_sb->s_id, 32);
+ __entry->ino = inode->i_ino;
+ ),
+
+ TP_printk("dev %s transaction ino %lu",
+ __entry->devname, __entry->ino)
+);
+
+#undef TRACE_SYSTEM
diff --git a/include/trace/trace_event_types.h b/include/trace/trace_event_types.h
index df56f56..bd26f4d 100644
--- a/include/trace/trace_event_types.h
+++ b/include/trace/trace_event_types.h
@@ -3,3 +3,4 @@
#include <trace/sched_event_types.h>
#include <trace/irq_event_types.h>
#include <trace/lockdep_event_types.h>
+#include <trace/jbd2_event_types.h>
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index fd13750..2a9b970 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -3,3 +3,4 @@
#include <trace/sched.h>
#include <trace/irq.h>
#include <trace/lockdep.h>
+#include <trace/jbd2.h>
--
1.5.6.3
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/ext4.h | 22 +--
fs/ext4/ext4_i.h | 12 -
fs/ext4/fsync.c | 8 +-
fs/ext4/ialloc.c | 17 +-
fs/ext4/inode.c | 79 ++---
fs/ext4/mballoc.c | 84 ++---
fs/ext4/mballoc.h | 68 +----
fs/ext4/super.c | 6 +-
include/linux/ext4_trace_types.h | 108 ++++++
include/trace/ext4.h | 8 +
include/trace/ext4_event_types.h | 690 +++++++++++++++++++++++++++++++++++++
include/trace/trace_event_types.h | 1 +
include/trace/trace_events.h | 1 +
13 files changed, 886 insertions(+), 218 deletions(-)
create mode 100644 include/linux/ext4_trace_types.h
create mode 100644 include/trace/ext4.h
create mode 100644 include/trace/ext4_event_types.h
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d0f15ef..791da89 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -21,6 +21,7 @@
#include <linux/magic.h>
#include <linux/jbd2.h>
#include <linux/quota.h>
+#include <linux/ext4_trace_types.h>
#include "ext4_i.h"
/*
@@ -70,27 +71,6 @@
#define EXT4_MB_DELALLOC_RESERVED 1024
-struct ext4_allocation_request {
- /* target inode for block we're allocating */
- struct inode *inode;
- /* logical block in target inode */
- ext4_lblk_t logical;
- /* phys. target (a hint) */
- ext4_fsblk_t goal;
- /* the closest logical allocated block to the left */
- ext4_lblk_t lleft;
- /* phys. block for ^^^ */
- ext4_fsblk_t pleft;
- /* the closest logical allocated block to the right */
- ext4_lblk_t lright;
- /* phys. block for ^^^ */
- ext4_fsblk_t pright;
- /* how many blocks we want to allocate */
- unsigned int len;
- /* flags. see above EXT4_MB_HINT_* */
- unsigned int flags;
-};
-
/*
* Special inodes numbers
*/
diff --git a/fs/ext4/ext4_i.h b/fs/ext4/ext4_i.h
index 4ce2187..48bf4bd 100644
--- a/fs/ext4/ext4_i.h
+++ b/fs/ext4/ext4_i.h
@@ -21,18 +21,6 @@
#include <linux/seqlock.h>
#include <linux/mutex.h>
-/* data type for block offset of block group */
-typedef int ext4_grpblk_t;
-
-/* data type for filesystem-wide blocks number */
-typedef unsigned long long ext4_fsblk_t;
-
-/* data type for file logical block number */
-typedef __u32 ext4_lblk_t;
-
-/* data type for block group number */
-typedef unsigned int ext4_group_t;
-
/*
* storage for cached extent
*/
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 5afe437..783efcb 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -28,10 +28,12 @@
#include <linux/writeback.h>
#include <linux/jbd2.h>
#include <linux/blkdev.h>
-#include <linux/marker.h>
+#include <trace/ext4.h>
#include "ext4.h"
#include "ext4_jbd2.h"
+DEFINE_TRACE(ext4_sync_file);
+
/*
* akpm: A new design for ext4_sync_file().
*
@@ -52,9 +54,7 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
J_ASSERT(ext4_journal_current_handle() == NULL);
- trace_mark(ext4_sync_file, "dev %s datasync %d ino %ld parent %ld",
- inode->i_sb->s_id, datasync, inode->i_ino,
- dentry->d_parent->d_inode->i_ino);
+ trace_ext4_sync_file(file, dentry, datasync);
/*
* data=writeback:
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 47b84e8..80288de 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -22,6 +22,7 @@
#include <linux/random.h>
#include <linux/bitops.h>
#include <linux/blkdev.h>
+#include <trace/ext4.h>
#include <asm/byteorder.h>
#include "ext4.h"
#include "ext4_jbd2.h"
@@ -29,6 +30,10 @@
#include "acl.h"
#include "group.h"
+DEFINE_TRACE(ext4_free_inode);
+DEFINE_TRACE(ext4_request_inode);
+DEFINE_TRACE(ext4_allocate_inode);
+
/*
* ialloc.c contains the inodes allocation and deallocation routines
*/
@@ -209,11 +214,7 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
ino = inode->i_ino;
ext4_debug("freeing inode %lu\n", ino);
- trace_mark(ext4_free_inode,
- "dev %s ino %lu mode %d uid %lu gid %lu bocks %llu",
- sb->s_id, inode->i_ino, inode->i_mode,
- (unsigned long) inode->i_uid, (unsigned long) inode->i_gid,
- (unsigned long long) inode->i_blocks);
+ trace_ext4_free_inode(inode);
/*
* Note: we must free any quota before locking the superblock,
@@ -817,8 +818,7 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode *dir, int mode)
return ERR_PTR(-EPERM);
sb = dir->i_sb;
- trace_mark(ext4_request_inode, "dev %s dir %lu mode %d", sb->s_id,
- dir->i_ino, mode);
+ trace_ext4_request_inode(dir, mode);
inode = new_inode(sb);
if (!inode)
return ERR_PTR(-ENOMEM);
@@ -1050,8 +1050,7 @@ got:
}
ext4_debug("allocating inode %lu\n", inode->i_ino);
- trace_mark(ext4_allocate_inode, "dev %s ino %lu dir %lu mode %d",
- sb->s_id, inode->i_ino, dir->i_ino, mode);
+ trace_ext4_allocate_inode(inode, dir, mode);
goto really_out;
fail:
ext4_std_error(sb, err);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c6bd6ce..3866f44 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -37,11 +37,24 @@
#include <linux/namei.h>
#include <linux/uio.h>
#include <linux/bio.h>
+#include <trace/ext4.h>
#include "ext4_jbd2.h"
#include "xattr.h"
#include "acl.h"
#include "ext4_extents.h"
+DEFINE_TRACE(ext4_write_begin);
+DEFINE_TRACE(ext4_ordered_write_end);
+DEFINE_TRACE(ext4_writeback_write_end);
+DEFINE_TRACE(ext4_journalled_write_end);
+DEFINE_TRACE(ext4_da_writepage);
+DEFINE_TRACE(ext4_da_writepages);
+DEFINE_TRACE(ext4_da_writepages_result);
+DEFINE_TRACE(ext4_da_write_begin);
+DEFINE_TRACE(ext4_da_write_end);
+DEFINE_TRACE(ext4_normal_writepage);
+DEFINE_TRACE(ext4_journalled_writepage);
+
#define MPAGE_DA_EXTENT_TAIL 0x01
static inline int ext4_begin_ordered_truncate(struct inode *inode,
@@ -1433,10 +1446,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
pgoff_t index;
unsigned from, to;
- trace_mark(ext4_write_begin,
- "dev %s ino %lu pos %llu len %u flags %u",
- inode->i_sb->s_id, inode->i_ino,
- (unsigned long long) pos, len, flags);
+ trace_ext4_write_begin(inode, pos, len, flags);
index = pos >> PAGE_CACHE_SHIFT;
from = pos & (PAGE_CACHE_SIZE - 1);
to = from + len;
@@ -1512,10 +1522,7 @@ static int ext4_ordered_write_end(struct file *file,
struct inode *inode = mapping->host;
int ret = 0, ret2;
- trace_mark(ext4_ordered_write_end,
- "dev %s ino %lu pos %llu len %u copied %u",
- inode->i_sb->s_id, inode->i_ino,
- (unsigned long long) pos, len, copied);
+ trace_ext4_ordered_write_end(inode, pos, len, copied);
ret = ext4_jbd2_file_inode(handle, inode);
if (ret == 0) {
@@ -1554,10 +1561,7 @@ static int ext4_writeback_write_end(struct file *file,
int ret = 0, ret2;
loff_t new_i_size;
- trace_mark(ext4_writeback_write_end,
- "dev %s ino %lu pos %llu len %u copied %u",
- inode->i_sb->s_id, inode->i_ino,
- (unsigned long long) pos, len, copied);
+ trace_ext4_writeback_write_end(inode, pos, len, copied);
new_i_size = pos + copied;
if (new_i_size > EXT4_I(inode)->i_disksize) {
ext4_update_i_disksize(inode, new_i_size);
@@ -1593,10 +1597,7 @@ static int ext4_journalled_write_end(struct file *file,
unsigned from, to;
loff_t new_i_size;
- trace_mark(ext4_journalled_write_end,
- "dev %s ino %lu pos %llu len %u copied %u",
- inode->i_sb->s_id, inode->i_ino,
- (unsigned long long) pos, len, copied);
+ trace_ext4_journalled_write_end(inode, pos, len, copied);
from = pos & (PAGE_CACHE_SIZE - 1);
to = from + len;
@@ -2372,9 +2373,7 @@ static int ext4_da_writepage(struct page *page,
struct buffer_head *page_bufs;
struct inode *inode = page->mapping->host;
- trace_mark(ext4_da_writepage,
- "dev %s ino %lu page_index %lu",
- inode->i_sb->s_id, inode->i_ino, page->index);
+ trace_ext4_da_writepage(inode, page);
size = i_size_read(inode);
if (page->index == size >> PAGE_CACHE_SHIFT)
len = size & ~PAGE_CACHE_MASK;
@@ -2486,19 +2485,7 @@ static int ext4_da_writepages(struct address_space *mapping,
int needed_blocks, ret = 0, nr_to_writebump = 0;
struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
- trace_mark(ext4_da_writepages,
- "dev %s ino %lu nr_t_write %ld "
- "pages_skipped %ld range_start %llu "
- "range_end %llu nonblocking %d "
- "for_kupdate %d for_reclaim %d "
- "for_writepages %d range_cyclic %d",
- inode->i_sb->s_id, inode->i_ino,
- wbc->nr_to_write, wbc->pages_skipped,
- (unsigned long long) wbc->range_start,
- (unsigned long long) wbc->range_end,
- wbc->nonblocking, wbc->for_kupdate,
- wbc->for_reclaim, wbc->for_writepages,
- wbc->range_cyclic);
+ trace_ext4_da_writepages(inode, wbc);
/*
* No pages to write? This is mainly a kludge to avoid starting
@@ -2664,14 +2651,7 @@ out_writepages:
if (!no_nrwrite_index_update)
wbc->no_nrwrite_index_update = 0;
wbc->nr_to_write -= nr_to_writebump;
- trace_mark(ext4_da_writepage_result,
- "dev %s ino %lu ret %d pages_written %d "
- "pages_skipped %ld congestion %d "
- "more_io %d no_nrwrite_index_update %d",
- inode->i_sb->s_id, inode->i_ino, ret,
- pages_written, wbc->pages_skipped,
- wbc->encountered_congestion, wbc->more_io,
- wbc->no_nrwrite_index_update);
+ trace_ext4_da_writepages_result(inode, wbc, ret, pages_written);
return ret;
}
@@ -2723,11 +2703,7 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
len, flags, pagep, fsdata);
}
*fsdata = (void *)0;
-
- trace_mark(ext4_da_write_begin,
- "dev %s ino %lu pos %llu len %u flags %u",
- inode->i_sb->s_id, inode->i_ino,
- (unsigned long long) pos, len, flags);
+ trace_ext4_da_write_begin(inode, pos, len, flags);
retry:
/*
* With delayed allocation, we don't log the i_disksize update
@@ -2820,10 +2796,7 @@ static int ext4_da_write_end(struct file *file,
}
}
- trace_mark(ext4_da_write_end,
- "dev %s ino %lu pos %llu len %u copied %u",
- inode->i_sb->s_id, inode->i_ino,
- (unsigned long long) pos, len, copied);
+ trace_ext4_da_write_end(inode, pos, len, copied);
start = pos & (PAGE_CACHE_SIZE - 1);
end = start + copied - 1;
@@ -3076,9 +3049,7 @@ static int ext4_normal_writepage(struct page *page,
loff_t size = i_size_read(inode);
loff_t len;
- trace_mark(ext4_normal_writepage,
- "dev %s ino %lu page_index %lu",
- inode->i_sb->s_id, inode->i_ino, page->index);
+ trace_ext4_normal_writepage(inode, page);
J_ASSERT(PageLocked(page));
if (page->index == size >> PAGE_CACHE_SHIFT)
len = size & ~PAGE_CACHE_MASK;
@@ -3164,9 +3135,7 @@ static int ext4_journalled_writepage(struct page *page,
loff_t size = i_size_read(inode);
loff_t len;
- trace_mark(ext4_journalled_writepage,
- "dev %s ino %lu page_index %lu",
- inode->i_sb->s_id, inode->i_ino, page->index);
+ trace_ext4_journalled_writepage(inode, page);
J_ASSERT(PageLocked(page));
if (page->index == size >> PAGE_CACHE_SHIFT)
len = size & ~PAGE_CACHE_MASK;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index f871677..73a4866 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -342,6 +342,17 @@ static void release_blocks_on_commit(journal_t *journal, transaction_t *txn);
+DEFINE_TRACE(ext4_discard_blocks);
+DEFINE_TRACE(ext4_mb_new_inode_pa);
+DEFINE_TRACE(ext4_mb_new_group_pa);
+DEFINE_TRACE(ext4_mb_release_inode_pa);
+DEFINE_TRACE(ext4_mb_release_group_pa);
+DEFINE_TRACE(ext4_discard_preallocations);
+DEFINE_TRACE(ext4_mb_discard_preallocations);
+DEFINE_TRACE(ext4_request_blocks);
+DEFINE_TRACE(ext4_allocate_blocks);
+DEFINE_TRACE(ext4_free_blocks);
+
static inline void *mb_correct_addr_and_bit(int *bit, void *addr)
{
#if BITS_PER_LONG == 64
@@ -2882,9 +2893,8 @@ static void release_blocks_on_commit(journal_t *journal, transaction_t *txn)
discard_block = (ext4_fsblk_t) entry->group * EXT4_BLOCKS_PER_GROUP(sb)
+ entry->start_blk
+ le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block);
- trace_mark(ext4_discard_blocks, "dev %s blk %llu count %u",
- sb->s_id, (unsigned long long) discard_block,
- entry->count);
+ trace_ext4_discard_blocks(sb, (unsigned long long)discard_block,
+ entry->count);
sb_issue_discard(sb, discard_block, entry->count);
kmem_cache_free(ext4_free_ext_cachep, entry);
@@ -3658,10 +3668,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
mb_debug("new inode pa %p: %llu/%u for %u\n", pa,
pa->pa_pstart, pa->pa_len, pa->pa_lstart);
- trace_mark(ext4_mb_new_inode_pa,
- "dev %s ino %lu pstart %llu len %u lstart %u",
- sb->s_id, ac->ac_inode->i_ino,
- pa->pa_pstart, pa->pa_len, pa->pa_lstart);
+ trace_ext4_mb_new_inode_pa(ac, pa);
ext4_mb_use_inode_pa(ac, pa);
atomic_add(pa->pa_free, &EXT4_SB(sb)->s_mb_preallocated);
@@ -3720,9 +3727,8 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
pa->pa_type = MB_GROUP_PA;
mb_debug("new group pa %p: %llu/%u for %u\n", pa,
- pa->pa_pstart, pa->pa_len, pa->pa_lstart);
- trace_mark(ext4_mb_new_group_pa, "dev %s pstart %llu len %u lstart %u",
- sb->s_id, pa->pa_pstart, pa->pa_len, pa->pa_lstart);
+ pa->pa_pstart, pa->pa_len, pa->pa_lstart);
+ trace_ext4_mb_new_group_pa(ac, pa);
ext4_mb_use_group_pa(ac, pa);
atomic_add(pa->pa_free, &EXT4_SB(sb)->s_mb_preallocated);
@@ -3812,10 +3818,8 @@ ext4_mb_release_inode_pa(struct ext4_buddy *e4b, struct buffer_head *bitmap_bh,
ext4_mb_store_history(ac);
}
- trace_mark(ext4_mb_release_inode_pa,
- "dev %s ino %lu block %llu count %u",
- sb->s_id, pa->pa_inode->i_ino, grp_blk_start + bit,
- next - bit);
+ trace_ext4_mb_release_inode_pa(ac, pa, grp_blk_start + bit,
+ next - bit);
mb_free_blocks(pa->pa_inode, e4b, bit, next - bit);
bit = next + 1;
}
@@ -3849,8 +3853,7 @@ ext4_mb_release_group_pa(struct ext4_buddy *e4b,
if (ac)
ac->ac_op = EXT4_MB_HISTORY_DISCARD;
- trace_mark(ext4_mb_release_group_pa, "dev %s pstart %llu len %d",
- sb->s_id, pa->pa_pstart, pa->pa_len);
+ trace_ext4_mb_release_group_pa(ac, pa);
BUG_ON(pa->pa_deleted == 0);
ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit);
BUG_ON(group != e4b->bd_group && pa->pa_len != 0);
@@ -3918,6 +3921,8 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
INIT_LIST_HEAD(&list);
ac = kmem_cache_alloc(ext4_ac_cachep, GFP_NOFS);
+ if (ac)
+ ac->ac_sb = sb;
repeat:
ext4_lock_group(sb, group);
list_for_each_entry_safe(pa, tmp,
@@ -4016,12 +4021,15 @@ void ext4_discard_preallocations(struct inode *inode)
}
mb_debug("discard preallocation for inode %lu\n", inode->i_ino);
- trace_mark(ext4_discard_preallocations, "dev %s ino %lu", sb->s_id,
- inode->i_ino);
+ trace_ext4_discard_preallocations(inode);
INIT_LIST_HEAD(&list);
ac = kmem_cache_alloc(ext4_ac_cachep, GFP_NOFS);
+ if (ac) {
+ ac->ac_sb = sb;
+ ac->ac_inode = inode;
+ }
repeat:
/* first, collect all pa's in the inode */
spin_lock(&ei->i_prealloc_lock);
@@ -4304,6 +4312,8 @@ ext4_mb_discard_lg_preallocations(struct super_block *sb,
INIT_LIST_HEAD(&discard_list);
ac = kmem_cache_alloc(ext4_ac_cachep, GFP_NOFS);
+ if (ac)
+ ac->ac_sb = sb;
spin_lock(&lg->lg_prealloc_lock);
list_for_each_entry_rcu(pa, &lg->lg_prealloc_list[order],
@@ -4473,8 +4483,7 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
int ret;
int freed = 0;
- trace_mark(ext4_mb_discard_preallocations, "dev %s needed %d",
- sb->s_id, needed);
+ trace_ext4_mb_discard_preallocations(sb, needed);
for (i = 0; i < EXT4_SB(sb)->s_groups_count && needed > 0; i++) {
ret = ext4_mb_discard_group_preallocations(sb, i, needed);
freed += ret;
@@ -4503,17 +4512,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
sb = ar->inode->i_sb;
sbi = EXT4_SB(sb);
- trace_mark(ext4_request_blocks, "dev %s flags %u len %u ino %lu "
- "lblk %llu goal %llu lleft %llu lright %llu "
- "pleft %llu pright %llu ",
- sb->s_id, ar->flags, ar->len,
- ar->inode ? ar->inode->i_ino : 0,
- (unsigned long long) ar->logical,
- (unsigned long long) ar->goal,
- (unsigned long long) ar->lleft,
- (unsigned long long) ar->lright,
- (unsigned long long) ar->pleft,
- (unsigned long long) ar->pright);
+ trace_ext4_request_blocks(ar);
/*
* For delayed allocation, we could skip the ENOSPC and
@@ -4549,7 +4548,10 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
}
ac = kmem_cache_alloc(ext4_ac_cachep, GFP_NOFS);
- if (!ac) {
+ if (ac) {
+ ac->ac_sb = sb;
+ ac->ac_inode = ar->inode;
+ } else {
ar->len = 0;
*errp = -ENOMEM;
goto out1;
@@ -4622,18 +4624,7 @@ out3:
reserv_blks);
}
- trace_mark(ext4_allocate_blocks,
- "dev %s block %llu flags %u len %u ino %lu "
- "logical %llu goal %llu lleft %llu lright %llu "
- "pleft %llu pright %llu ",
- sb->s_id, (unsigned long long) block,
- ar->flags, ar->len, ar->inode ? ar->inode->i_ino : 0,
- (unsigned long long) ar->logical,
- (unsigned long long) ar->goal,
- (unsigned long long) ar->lleft,
- (unsigned long long) ar->lright,
- (unsigned long long) ar->pleft,
- (unsigned long long) ar->pright);
+ trace_ext4_allocate_blocks(ar, (unsigned long long)block);
return block;
}
@@ -4768,10 +4759,7 @@ void ext4_mb_free_blocks(handle_t *handle, struct inode *inode,
}
ext4_debug("freeing block %lu\n", block);
- trace_mark(ext4_free_blocks,
- "dev %s block %llu count %lu metadata %d ino %lu",
- sb->s_id, (unsigned long long) block, count, metadata,
- inode ? inode->i_ino : 0);
+ trace_ext4_free_blocks(inode, block, count, metadata);
ac = kmem_cache_alloc(ext4_ac_cachep, GFP_NOFS);
if (ac) {
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index dd9e6cd..61829a2 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -19,8 +19,8 @@
#include <linux/seq_file.h>
#include <linux/version.h>
#include <linux/blkdev.h>
-#include <linux/marker.h>
#include <linux/mutex.h>
+#include <trace/ext4.h>
#include "ext4_jbd2.h"
#include "ext4.h"
#include "group.h"
@@ -118,37 +118,11 @@ struct ext4_free_data {
tid_t t_tid;
};
-struct ext4_prealloc_space {
- struct list_head pa_inode_list;
- struct list_head pa_group_list;
- union {
- struct list_head pa_tmp_list;
- struct rcu_head pa_rcu;
- } u;
- spinlock_t pa_lock;
- atomic_t pa_count;
- unsigned pa_deleted;
- ext4_fsblk_t pa_pstart; /* phys. block */
- ext4_lblk_t pa_lstart; /* log. block */
- unsigned short pa_len; /* len of preallocated chunk */
- unsigned short pa_free; /* how many blocks are free */
- unsigned short pa_type; /* pa type. inode or group */
- spinlock_t *pa_obj_lock;
- struct inode *pa_inode; /* hack, for history only */
-};
-
enum {
MB_INODE_PA = 0,
MB_GROUP_PA = 1
};
-struct ext4_free_extent {
- ext4_lblk_t fe_logical;
- ext4_grpblk_t fe_start;
- ext4_group_t fe_group;
- int fe_len;
-};
-
/*
* Locality group:
* we try to group all related changes together
@@ -167,46 +141,6 @@ struct ext4_locality_group {
spinlock_t lg_prealloc_lock;
};
-struct ext4_allocation_context {
- struct inode *ac_inode;
- struct super_block *ac_sb;
-
- /* original request */
- struct ext4_free_extent ac_o_ex;
-
- /* goal request (after normalization) */
- struct ext4_free_extent ac_g_ex;
-
- /* the best found extent */
- struct ext4_free_extent ac_b_ex;
-
- /* copy of the bext found extent taken before preallocation efforts */
- struct ext4_free_extent ac_f_ex;
-
- /* number of iterations done. we have to track to limit searching */
- unsigned long ac_ex_scanned;
- __u16 ac_groups_scanned;
- __u16 ac_found;
- __u16 ac_tail;
- __u16 ac_buddy;
- __u16 ac_flags; /* allocation hints */
- __u8 ac_status;
- __u8 ac_criteria;
- __u8 ac_repeats;
- __u8 ac_2order; /* if request is to allocate 2^N blocks and
- * N > 0, the field stores N, otherwise 0 */
- __u8 ac_op; /* operation, for history only */
- struct page *ac_bitmap_page;
- struct page *ac_buddy_page;
- /*
- * pointer to the held semaphore upon successful
- * block allocation
- */
- struct rw_semaphore *alloc_semp;
- struct ext4_prealloc_space *ac_pa;
- struct ext4_locality_group *ac_lg;
-};
-
#define AC_STATUS_CONTINUE 1
#define AC_STATUS_FOUND 2
#define AC_STATUS_BREAK 3
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 2958f4e..9b77dd1 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -36,9 +36,9 @@
#include <linux/seq_file.h>
#include <linux/proc_fs.h>
#include <linux/ctype.h>
-#include <linux/marker.h>
#include <linux/log2.h>
#include <linux/crc16.h>
+#include <trace/ext4.h>
#include <asm/uaccess.h>
#include "ext4.h"
@@ -48,6 +48,8 @@
#include "namei.h"
#include "group.h"
+DEFINE_TRACE(ext4_sync_fs);
+
struct proc_dir_entry *ext4_proc_root;
static struct kset *ext4_kset;
@@ -3287,7 +3289,7 @@ static int ext4_sync_fs(struct super_block *sb, int wait)
int ret = 0;
tid_t target;
- trace_mark(ext4_sync_fs, "dev %s wait %d", sb->s_id, wait);
+ trace_ext4_sync_fs(sb, wait);
sb->s_dirt = 0;
if (EXT4_SB(sb)->s_journal) {
if (jbd2_journal_start_commit(EXT4_SB(sb)->s_journal,
diff --git a/include/linux/ext4_trace_types.h b/include/linux/ext4_trace_types.h
new file mode 100644
index 0000000..229a2dd
--- /dev/null
+++ b/include/linux/ext4_trace_types.h
@@ -0,0 +1,108 @@
+/*
+ * ext4_trace_types.h -- these data structures are needed by the
+ * tracing infrastructure
+ */
+
+#ifndef _EXT4_TRACE_TYPES_H
+#define _EXT4_TRACE_TYPES_H
+
+/* data type for block offset of block group */
+typedef int ext4_grpblk_t;
+
+/* data type for filesystem-wide blocks number */
+typedef unsigned long long ext4_fsblk_t;
+
+/* data type for file logical block number */
+typedef __u32 ext4_lblk_t;
+
+/* data type for block group number */
+typedef unsigned int ext4_group_t;
+
+struct ext4_allocation_request {
+ /* target inode for block we're allocating */
+ struct inode *inode;
+ /* logical block in target inode */
+ ext4_lblk_t logical;
+ /* phys. target (a hint) */
+ ext4_fsblk_t goal;
+ /* the closest logical allocated block to the left */
+ ext4_lblk_t lleft;
+ /* phys. block for ^^^ */
+ ext4_fsblk_t pleft;
+ /* the closest logical allocated block to the right */
+ ext4_lblk_t lright;
+ /* phys. block for ^^^ */
+ ext4_fsblk_t pright;
+ /* how many blocks we want to allocate */
+ unsigned int len;
+ /* flags. see above EXT4_MB_HINT_* */
+ unsigned int flags;
+};
+
+struct ext4_free_extent {
+ ext4_lblk_t fe_logical;
+ ext4_grpblk_t fe_start;
+ ext4_group_t fe_group;
+ int fe_len;
+};
+
+struct ext4_allocation_context {
+ struct inode *ac_inode;
+ struct super_block *ac_sb;
+
+ /* original request */
+ struct ext4_free_extent ac_o_ex;
+
+ /* goal request (after normalization) */
+ struct ext4_free_extent ac_g_ex;
+
+ /* the best found extent */
+ struct ext4_free_extent ac_b_ex;
+
+ /* copy of the bext found extent taken before preallocation efforts */
+ struct ext4_free_extent ac_f_ex;
+
+ /* number of iterations done. we have to track to limit searching */
+ unsigned long ac_ex_scanned;
+ __u16 ac_groups_scanned;
+ __u16 ac_found;
+ __u16 ac_tail;
+ __u16 ac_buddy;
+ __u16 ac_flags; /* allocation hints */
+ __u8 ac_status;
+ __u8 ac_criteria;
+ __u8 ac_repeats;
+ __u8 ac_2order; /* if request is to allocate 2^N blocks and
+ * N > 0, the field stores N, otherwise 0 */
+ __u8 ac_op; /* operation, for history only */
+ struct page *ac_bitmap_page;
+ struct page *ac_buddy_page;
+ /*
+ * pointer to the held semaphore upon successful
+ * block allocation
+ */
+ struct rw_semaphore *alloc_semp;
+ struct ext4_prealloc_space *ac_pa;
+ struct ext4_locality_group *ac_lg;
+};
+
+struct ext4_prealloc_space {
+ struct list_head pa_inode_list;
+ struct list_head pa_group_list;
+ union {
+ struct list_head pa_tmp_list;
+ struct rcu_head pa_rcu;
+ } u;
+ spinlock_t pa_lock;
+ atomic_t pa_count;
+ unsigned pa_deleted;
+ ext4_fsblk_t pa_pstart; /* phys. block */
+ ext4_lblk_t pa_lstart; /* log. block */
+ unsigned short pa_len; /* len of preallocated chunk */
+ unsigned short pa_free; /* how many blocks are free */
+ unsigned short pa_type; /* pa type. inode or group */
+ spinlock_t *pa_obj_lock;
+ struct inode *pa_inode; /* hack, for history only */
+};
+
+#endif /* _EXT4_TRACE_TYPES_H */
diff --git a/include/trace/ext4.h b/include/trace/ext4.h
new file mode 100644
index 0000000..e8d647a
--- /dev/null
+++ b/include/trace/ext4.h
@@ -0,0 +1,8 @@
+#ifndef _TRACE_EXT4_H
+#define _TRACE_EXT4_H
+
+#include <linux/tracepoint.h>
+
+#include <trace/ext4_event_types.h>
+
+#endif
diff --git a/include/trace/ext4_event_types.h b/include/trace/ext4_event_types.h
new file mode 100644
index 0000000..da9e5f4
--- /dev/null
+++ b/include/trace/ext4_event_types.h
@@ -0,0 +1,690 @@
+
+/* use <trace/ext4.h> instead */
+#ifndef TRACE_EVENT
+# error Do not include this file directly.
+# error Unless you know what you are doing.
+#endif
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM ext4
+
+#include <linux/writeback.h>
+#include <linux/ext4_trace_types.h>
+
+TRACE_EVENT(ext4_free_inode,
+ TP_PROTO(struct inode *inode),
+
+ TP_ARGS(inode),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( umode_t, mode )
+ __field( uid_t, uid )
+ __field( gid_t, gid )
+ __field( blkcnt_t, blocks )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, inode->i_sb->s_id, 32);
+ __entry->ino = inode->i_ino;
+ __entry->mode = inode->i_mode;
+ __entry->uid = inode->i_uid;
+ __entry->gid = inode->i_gid;
+ __entry->blocks = inode->i_blocks;
+ ),
+
+ TP_printk("dev %s ino %lu mode %d uid %u gid %u blocks %llu",
+ __entry->devname, __entry->ino, __entry->mode,
+ __entry->uid, __entry->gid, __entry->blocks)
+);
+
+TRACE_EVENT(ext4_request_inode,
+ TP_PROTO(struct inode *dir, int mode),
+
+ TP_ARGS(dir, mode),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, dir )
+ __field( umode_t, mode )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, dir->i_sb->s_id, 32);
+ __entry->dir = dir->i_ino;
+ __entry->mode = mode;
+ ),
+
+ TP_printk("dev %s dir %lu mode %d",
+ __entry->devname, __entry->dir, __entry->mode)
+);
+
+TRACE_EVENT(ext4_allocate_inode,
+ TP_PROTO(struct inode *inode, struct inode *dir, int mode),
+
+ TP_ARGS(inode, dir, mode),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( ino_t, dir )
+ __field( umode_t, mode )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, dir->i_sb->s_id, 32);
+ __entry->ino = inode->i_ino;
+ __entry->dir = dir->i_ino;
+ __entry->mode = mode;
+ ),
+
+ TP_printk("dev %s ino %lu dir %lu mode %d",
+ __entry->devname, __entry->ino, __entry->dir, __entry->mode)
+);
+
+TRACE_EVENT(ext4_write_begin,
+
+ TP_PROTO(struct inode *inode, loff_t pos, unsigned int len,
+ unsigned int flags),
+
+ TP_ARGS(inode, pos, len, flags),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( loff_t, pos )
+ __field( unsigned int, len )
+ __field( unsigned int, flags )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, inode->i_sb->s_id, 32);
+ __entry->ino = inode->i_ino;
+ __entry->pos = pos;
+ __entry->len = len;
+ __entry->flags = flags;
+ ),
+
+ TP_printk("dev %s ino %lu pos %llu len %u flags %u",
+ __entry->devname, __entry->ino, __entry->pos, __entry->len,
+ __entry->flags)
+);
+
+TRACE_EVENT(ext4_ordered_write_end,
+ TP_PROTO(struct inode *inode, loff_t pos, unsigned int len,
+ unsigned int copied),
+
+ TP_ARGS(inode, pos, len, copied),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( loff_t, pos )
+ __field( unsigned int, len )
+ __field( unsigned int, copied )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, inode->i_sb->s_id, 32);
+ __entry->ino = inode->i_ino;
+ __entry->pos = pos;
+ __entry->len = len;
+ __entry->copied = copied;
+ ),
+
+ TP_printk("dev %s ino %lu pos %llu len %u copied %u",
+ __entry->devname, __entry->ino, __entry->pos, __entry->len,
+ __entry->copied)
+);
+
+TRACE_EVENT(ext4_writeback_write_end,
+ TP_PROTO(struct inode *inode, loff_t pos, unsigned int len,
+ unsigned int copied),
+
+ TP_ARGS(inode, pos, len, copied),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( loff_t, pos )
+ __field( unsigned int, len )
+ __field( unsigned int, copied )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, inode->i_sb->s_id, 32);
+ __entry->ino = inode->i_ino;
+ __entry->pos = pos;
+ __entry->len = len;
+ __entry->copied = copied;
+ ),
+
+ TP_printk("dev %s ino %lu pos %llu len %u copied %u",
+ __entry->devname, __entry->ino, __entry->pos, __entry->len,
+ __entry->copied)
+);
+
+TRACE_EVENT(ext4_journalled_write_end,
+ TP_PROTO(struct inode *inode, loff_t pos, unsigned int len,
+ unsigned int copied),
+ TP_ARGS(inode, pos, len, copied),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( loff_t, pos )
+ __field( unsigned int, len )
+ __field( unsigned int, copied )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, inode->i_sb->s_id, 32);
+ __entry->ino = inode->i_ino;
+ __entry->pos = pos;
+ __entry->len = len;
+ __entry->copied = copied;
+ ),
+
+ TP_printk("dev %s ino %lu pos %llu len %u copied %u",
+ __entry->devname, __entry->ino, __entry->pos, __entry->len,
+ __entry->copied)
+);
+
+TRACE_EVENT(ext4_da_writepage,
+ TP_PROTO(struct inode *inode, struct page *page),
+
+ TP_ARGS(inode, page),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( pgoff_t, index )
+
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, inode->i_sb->s_id, 32);
+ __entry->ino = inode->i_ino;
+ __entry->index = page->index;
+ ),
+
+ TP_printk("dev %s ino %lu page_index %lu",
+ __entry->devname, __entry->ino, __entry->index)
+);
+
+TRACE_EVENT(ext4_da_writepages,
+ TP_PROTO(struct inode *inode, struct writeback_control *wbc),
+
+ TP_ARGS(inode, wbc),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( long, nr_to_write )
+ __field( long, pages_skipped )
+ __field( loff_t, range_start )
+ __field( loff_t, range_end )
+ __field( char, nonblocking )
+ __field( char, for_kupdate )
+ __field( char, for_reclaim )
+ __field( char, for_writepages )
+ __field( char, range_cyclic )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, inode->i_sb->s_id, 32);
+ __entry->ino = inode->i_ino;
+ __entry->nr_to_write = wbc->nr_to_write;
+ __entry->pages_skipped = wbc->pages_skipped;
+ __entry->range_start = wbc->range_start;
+ __entry->range_end = wbc->range_end;
+ __entry->nonblocking = wbc->nonblocking;
+ __entry->for_kupdate = wbc->for_kupdate;
+ __entry->for_reclaim = wbc->for_reclaim;
+ __entry->for_writepages = wbc->for_writepages;
+ __entry->range_cyclic = wbc->range_cyclic;
+ ),
+
+ TP_printk("dev %s ino %lu nr_t_write %ld pages_skipped %ld range_start %llu range_end %llu nonblocking %d for_kupdate %d for_reclaim %d for_writepages %d range_cyclic %d",
+ __entry->devname, __entry->ino, __entry->nr_to_write,
+ __entry->pages_skipped, __entry->range_start,
+ __entry->range_end, __entry->nonblocking,
+ __entry->for_kupdate, __entry->for_reclaim,
+ __entry->for_writepages, __entry->range_cyclic)
+);
+
+TRACE_EVENT(ext4_da_writepages_result,
+ TP_PROTO(struct inode *inode, struct writeback_control *wbc,
+ int ret, int pages_written),
+
+ TP_ARGS(inode, wbc, ret, pages_written),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( int, ret )
+ __field( int, pages_written )
+ __field( long, pages_skipped )
+ __field( char, encountered_congestion )
+ __field( char, more_io )
+ __field( char, no_nrwrite_index_update )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, inode->i_sb->s_id, 32);
+ __entry->ino = inode->i_ino;
+ __entry->ret = ret;
+ __entry->pages_written = pages_written;
+ __entry->pages_skipped = wbc->pages_skipped;
+ __entry->encountered_congestion = wbc->encountered_congestion;
+ __entry->more_io = wbc->more_io;
+ __entry->no_nrwrite_index_update = wbc->no_nrwrite_index_update;
+ ),
+
+ TP_printk("dev %s ino %lu ret %d pages_written %d pages_skipped %ld congestion %d more_io %d no_nrwrite_index_update %d",
+ __entry->devname, __entry->ino, __entry->ret,
+ __entry->pages_written, __entry->pages_skipped,
+ __entry->encountered_congestion, __entry->more_io,
+ __entry->no_nrwrite_index_update)
+);
+
+TRACE_EVENT(ext4_da_write_begin,
+ TP_PROTO(struct inode *inode, loff_t pos, unsigned int len,
+ unsigned int flags),
+
+ TP_ARGS(inode, pos, len, flags),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( loff_t, pos )
+ __field( unsigned int, len )
+ __field( unsigned int, flags )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, inode->i_sb->s_id, 32);
+ __entry->ino = inode->i_ino;
+ __entry->pos = pos;
+ __entry->len = len;
+ __entry->flags = flags;
+ ),
+
+ TP_printk("dev %s ino %lu pos %llu len %u flags %u",
+ __entry->devname, __entry->ino, __entry->pos, __entry->len,
+ __entry->flags)
+);
+
+TRACE_EVENT(ext4_da_write_end,
+ TP_PROTO(struct inode *inode, loff_t pos, unsigned int len,
+ unsigned int copied),
+
+ TP_ARGS(inode, pos, len, copied),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( loff_t, pos )
+ __field( unsigned int, len )
+ __field( unsigned int, copied )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, inode->i_sb->s_id, 32);
+ __entry->ino = inode->i_ino;
+ __entry->pos = pos;
+ __entry->len = len;
+ __entry->copied = copied;
+ ),
+
+ TP_printk("dev %s ino %lu pos %llu len %u copied %u",
+ __entry->devname, __entry->ino, __entry->pos, __entry->len,
+ __entry->copied)
+);
+
+TRACE_EVENT(ext4_normal_writepage,
+ TP_PROTO(struct inode *inode, struct page *page),
+
+ TP_ARGS(inode, page),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( pgoff_t, index )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, inode->i_sb->s_id, 32);
+ __entry->ino = inode->i_ino;
+ __entry->index = page->index;
+ ),
+
+ TP_printk("dev %s ino %lu page_index %lu",
+ __entry->devname, __entry->ino, __entry->index)
+);
+
+TRACE_EVENT(ext4_journalled_writepage,
+ TP_PROTO(struct inode *inode, struct page *page),
+
+ TP_ARGS(inode, page),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( pgoff_t, index )
+
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, inode->i_sb->s_id, 32);
+ __entry->ino = inode->i_ino;
+ __entry->index = page->index;
+ ),
+
+ TP_printk("dev %s ino %lu page_index %lu",
+ __entry->devname, __entry->ino, __entry->index)
+);
+
+TRACE_EVENT(ext4_discard_blocks,
+ TP_PROTO(struct super_block *sb, unsigned long long blk,
+ unsigned long long count),
+
+ TP_ARGS(sb, blk, count),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( __u64, blk )
+ __field( __u64, count )
+
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, sb->s_id, 32);
+ __entry->blk = blk;
+ __entry->count = count;
+ ),
+
+ TP_printk("dev %s blk %llu count %llu",
+ __entry->devname, __entry->blk, __entry->count)
+);
+
+TRACE_EVENT(ext4_mb_new_inode_pa,
+ TP_PROTO(struct ext4_allocation_context *ac,
+ struct ext4_prealloc_space *pa),
+
+ TP_ARGS(ac, pa),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( __u64, pa_pstart )
+ __field( __u32, pa_len )
+ __field( __u64, pa_lstart )
+
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, ac->ac_sb->s_id, 32);
+ __entry->ino = ac->ac_inode->i_ino;
+ __entry->pa_pstart = pa->pa_pstart;
+ __entry->pa_len = pa->pa_len;
+ __entry->pa_lstart = pa->pa_lstart;
+ ),
+
+ TP_printk("dev %s ino %lu pstart %llu len %u lstart %llu",
+ __entry->devname, __entry->ino, __entry->pa_pstart,
+ __entry->pa_len, __entry->pa_lstart)
+);
+
+TRACE_EVENT(ext4_mb_new_group_pa,
+ TP_PROTO(struct ext4_allocation_context *ac,
+ struct ext4_prealloc_space *pa),
+
+ TP_ARGS(ac, pa),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( __u64, pa_pstart )
+ __field( __u32, pa_len )
+ __field( __u64, pa_lstart )
+
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, ac->ac_sb->s_id, 32);
+ __entry->ino = ac->ac_inode->i_ino;
+ __entry->pa_pstart = pa->pa_pstart;
+ __entry->pa_len = pa->pa_len;
+ __entry->pa_lstart = pa->pa_lstart;
+ ),
+
+ TP_printk("dev %s ino %lu pstart %llu len %u lstart %llu",
+ __entry->devname, __entry->ino, __entry->pa_pstart,
+ __entry->pa_len, __entry->pa_lstart)
+);
+
+TRACE_EVENT(ext4_mb_release_inode_pa,
+ TP_PROTO(struct ext4_allocation_context *ac,
+ struct ext4_prealloc_space *pa,
+ unsigned long long block, unsigned int count),
+
+ TP_ARGS(ac, pa, block, count),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( __u64, block )
+ __field( __u32, count )
+
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, ac->ac_sb->s_id, 32);
+ __entry->ino = ac->ac_inode->i_ino;
+ __entry->block = block;
+ __entry->count = count;
+ ),
+
+ TP_printk("dev %s ino %lu block %llu count %u",
+ __entry->devname, __entry->ino, __entry->block,
+ __entry->count)
+);
+
+TRACE_EVENT(ext4_mb_release_group_pa,
+ TP_PROTO(struct ext4_allocation_context *ac,
+ struct ext4_prealloc_space *pa),
+
+ TP_ARGS(ac, pa),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( __u64, pa_pstart )
+ __field( __u32, pa_len )
+
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, ac->ac_sb->s_id, 32);
+ __entry->ino = ac->ac_inode->i_ino;
+ __entry->pa_pstart = pa->pa_pstart;
+ __entry->pa_len = pa->pa_len;
+ ),
+
+ TP_printk("dev %s pstart %llu len %u",
+ __entry->devname, __entry->pa_pstart, __entry->pa_len)
+);
+
+TRACE_EVENT(ext4_discard_preallocations,
+ TP_PROTO(struct inode *inode),
+
+ TP_ARGS(inode),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, inode->i_sb->s_id, 32);
+ __entry->ino = inode->i_ino;
+ ),
+
+ TP_printk("dev %s ino %lu",
+ __entry->devname, __entry->ino)
+);
+
+TRACE_EVENT(ext4_mb_discard_preallocations,
+ TP_PROTO(struct super_block *sb, int needed),
+
+ TP_ARGS(sb, needed),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( int, needed )
+
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, sb->s_id, 32);
+ __entry->needed = needed;
+ ),
+
+ TP_printk("dev %s needed %d",
+ __entry->devname, __entry->needed)
+);
+
+TRACE_EVENT(ext4_request_blocks,
+ TP_PROTO(struct ext4_allocation_request *ar),
+
+ TP_ARGS(ar),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( struct ext4_allocation_request, ar )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, ar->inode->i_sb->s_id, 32);
+ __entry->ino = ar->inode->i_ino;
+ __entry->ar = *ar;
+ ),
+
+ TP_printk("dev %s ino %lu flags %u len %u lblk %llu goal %llu lleft %llu lright %llu pleft %llu pright %llu ",
+ __entry->devname, __entry->ino, __entry->ar.flags,
+ __entry->ar.len,
+ (unsigned long long) __entry->ar.logical,
+ (unsigned long long) __entry->ar.goal,
+ (unsigned long long) __entry->ar.lleft,
+ (unsigned long long) __entry->ar.lright,
+ (unsigned long long) __entry->ar.pleft,
+ (unsigned long long) __entry->ar.pright)
+);
+
+TRACE_EVENT(ext4_allocate_blocks,
+ TP_PROTO(struct ext4_allocation_request *ar, unsigned long long block),
+
+ TP_ARGS(ar, block),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( __u64, block )
+ __field( struct ext4_allocation_request, ar )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, ar->inode->i_sb->s_id, 32);
+ __entry->ino = ar->inode->i_ino;
+ __entry->block = block;
+ __entry->ar = *ar;
+ ),
+
+ TP_printk("dev %s ino %lu flags %u len %u block %llu lblk %llu goal %llu lleft %llu lright %llu pleft %llu pright %llu ",
+ __entry->devname, __entry->ino, __entry->ar.flags,
+ __entry->ar.len, __entry->block,
+ (unsigned long long) __entry->ar.logical,
+ (unsigned long long) __entry->ar.goal,
+ (unsigned long long) __entry->ar.lleft,
+ (unsigned long long) __entry->ar.lright,
+ (unsigned long long) __entry->ar.pleft,
+ (unsigned long long) __entry->ar.pright)
+);
+
+TRACE_EVENT(ext4_free_blocks,
+ TP_PROTO(struct inode *inode, __u64 block, unsigned long count,
+ int metadata),
+
+ TP_ARGS(inode, block, count, metadata),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( __u64, block )
+ __field( unsigned long, count )
+ __field( int, metadata )
+
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, inode->i_sb->s_id, 32);
+ __entry->ino = inode->i_ino;
+ __entry->block = block;
+ __entry->count = count;
+ __entry->metadata = metadata;
+ ),
+
+ TP_printk("dev %s ino %lu block %llu count %lu metadata %d",
+ __entry->devname, __entry->ino, __entry->block,
+ __entry->count, __entry->metadata)
+);
+
+TRACE_EVENT(ext4_sync_file,
+ TP_PROTO(struct file *file, struct dentry *dentry, int datasync),
+
+ TP_ARGS(file, dentry, datasync),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( ino_t, ino )
+ __field( ino_t, parent )
+ __field( int, datasync )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, dentry->d_inode->i_sb->s_id, 32);
+ __entry->ino = dentry->d_inode->i_ino;
+ __entry->datasync = datasync;
+ __entry->parent = dentry->d_parent->d_inode->i_ino;
+ ),
+
+ TP_printk("dev %s ino %ld parent %ld datasync %d ",
+ __entry->devname, __entry->ino, __entry->parent,
+ __entry->datasync)
+);
+
+TRACE_EVENT(ext4_sync_fs,
+ TP_PROTO(struct super_block *sb, int wait),
+
+ TP_ARGS(sb, wait),
+
+ TP_STRUCT__entry(
+ __array( char, devname, 32 )
+ __field( int, wait )
+
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->devname, sb->s_id, 32);
+ __entry->wait = wait;
+ ),
+
+ TP_printk("dev %s wait %d", __entry->devname, __entry->wait)
+);
+
+
+#undef TRACE_SYSTEM
diff --git a/include/trace/trace_event_types.h b/include/trace/trace_event_types.h
index bd26f4d..26d7aee 100644
--- a/include/trace/trace_event_types.h
+++ b/include/trace/trace_event_types.h
@@ -4,3 +4,4 @@
#include <trace/irq_event_types.h>
#include <trace/lockdep_event_types.h>
#include <trace/jbd2_event_types.h>
+#include <trace/ext4_event_types.h>
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 2a9b970..ad8f6eb 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -4,3 +4,4 @@
#include <trace/irq.h>
#include <trace/lockdep.h>
#include <trace/jbd2.h>
+#include <trace/ext4.h>
--
1.5.6.3
* Theodore Tso <[email protected]> wrote:
> On Sun, Apr 12, 2009 at 03:01:05PM +0200, Ingo Molnar wrote:
> > Btw., you mention ext4 and jbd2 new-style tracepoints in the text.
> > Does this mean you already have them coded up (i havent seen any
> > patch posting from you), just that you cannot push it upstream yet
> > because ext4 can be a module? We'll have modular new-style
> > tracepoints soon.
>
> Yes, I coded them up recently. I wanted to do some performance
> measurements, and being able to interleave the ext4 tracepoint
> logs with the block I/o tracer was definitely handy. Also, not
> having to fight with balky userspace tools (whether it is
> out-of-tree kernel support code, or random graphical userspace
> libraries when I could really care less about a GUI interface) was
> definitely a big win.
Cool. [ And i guess you'll like the per tracepoint filter
expressions too :-) ]
> I only had two real problems. One is that the block I/O tracer
> only traces "real" devices, and not device mapper devices --- I
> could user the blktrace userspace tool, but then the results
> wouldn't be properly interleaved with the ext4 tracepoint logs.
Ok - i forwarded your mail to Li Zefan and he already posted a draft
patch to attempt to solve this. "Cannot trace the primary block
device my filesystem is using" is IMO a showstopper in your tracing
work-flow critical path, so it needs some sort of quick solution.
> The second is that ext4 has its localized header files in
> fs/ext4/*.h, and not in include/linux/*.h, and that was
> problematic given that the trace code snippets in
> include/trace/ext4_event_types.h needed access to some internal
> ext4 data structures. I ultimately solved the latter by creating
> a include/linux/ext4_tracing_types.h, but I suspect this problem
> will go away when you have modular new-style tracepoints --- if
> not, I'd appreciately greatly if folks could consider whether or
> not this support could be added.
I'd expect these problems to go away with Steve's module support and
TRACE_EVENT()-simplification/cleanup series, yes.
> I'll attach the patches as replies to this mail thread so you can
> see what I've done. Any comments of anything I might have done
> "wrong" would be greatly appreciated.
I dont think you can do anything "wrong" at this stage - the event
tracer is still fresh out of the oven. It was shaped based on the
needs of a few select core kernel non-modular subsystems (scheduler,
irqs, etc.). Feedback from people like you with non-trivial
independent tracing bits (you have a handful of useful-looking
markers in ext4) is very much needed to shape and complete it.
So we'll sure try to address all your feedback and it would be
extremely useful if you labeled all the steps that you found
unintuitive, unnecessary, over-complicated or redundant or painful
in any way. Please dont hold any of your punches ;-)
Ingo
* Theodore Ts'o <[email protected]> wrote:
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> fs/ext4/ext4.h | 22 +--
> fs/ext4/ext4_i.h | 12 -
> fs/ext4/fsync.c | 8 +-
> fs/ext4/ialloc.c | 17 +-
> fs/ext4/inode.c | 79 ++---
> fs/ext4/mballoc.c | 84 ++---
> fs/ext4/mballoc.h | 68 +----
> fs/ext4/super.c | 6 +-
> include/linux/ext4_trace_types.h | 108 ++++++
> include/trace/ext4.h | 8 +
> include/trace/ext4_event_types.h | 690 +++++++++++++++++++++++++++++++++++++
> include/trace/trace_event_types.h | 1 +
> include/trace/trace_events.h | 1 +
> 13 files changed, 886 insertions(+), 218 deletions(-)
> create mode 100644 include/linux/ext4_trace_types.h
> create mode 100644 include/trace/ext4.h
> create mode 100644 include/trace/ext4_event_types.h
Really nice stuff!
A general, high-level observation: it is visible that the conversion
to TRACE_EVENTS() does increase the total linecount, but it also
removes code from the most often used source code areas
(fs/ext4/*.c) - and using an intuitive function-call-alike
single-line tracepoint there.
I'd call that a positive effect.
It would still be nice to compact the 690 lines of (out of line)
ext4_event_types.h definitions some more - and i think we'll be able
to do it without affecting the usage sites.
OTOH, even those bits look fairly readable, and there's a fair
amount of bang-for-bucks ratio - these tracepoints give us:
- integrated tracepoints
- zero-copy and per-cpu splice() based tracing
- binary tracing without printf overhead
- structured logging records exposed under /debug/tracing/events
- trace events embedded in function tracer output and other plugins
- user-defined, per tracepoint filter expressions
... so even if we were to keep the 690 lines of overhead with no
further compacting, it still looks like a good deal to me, all
things considered.
Ingo
On Mon, Apr 13, 2009 at 11:31:24PM +0200, Ingo Molnar wrote:
> Cool. [ And i guess you'll like the per tracepoint filter
> expressions too :-) ]
I haven't played with them yet, but I was looking over the source code
at them (since they aren't documented yet :-). It looks like at the
moment only integer matches are allowed, right? That's a bit of an
issue for me, since one of the things I'd really like to be able to do
is filter based on devname (i.e., sda2). (Most of the time we only
want to collect information for a particular block device or
filesystem.)
Actually, the fact that I'm having to drop some 32 bytes for each jbd2
and ext4 trace log for the bdevname in the ring buffer is really for
the birds. What I really want to do is just to drop in the dev_t, and
then for the tracing infrastructure to have an efficient (cached) way
of taking the dev_t and turning that back into struct block_device at
TP_printk time so we can print the bdevname when it's needed. We
deifnitely don't want to be calling bdget() in fs/block_dev.c each
time we print a line in the tracing buffer! I'm guessing that's
something the blktrace tracer would find handy as well.
Of course having more kernel code play with dev_t's directly isn't
considered politically correct in some circles, but tough. :-) We
can't exactly drop a pointer to a struct block_device in the trace
buffer, since there's no guarantee it will still be valid when we read
it out. Dropping in a dev_t is exactly what we want. It would be
nice though if there was a way to specify a major/minor number as the
filter predicate for the dev_t, and not to have the user generate the
MAJOR/MINOR encoding. So some way of parsing "MKDEV(8, 4)" as the
input to the filter predicate would probably be a really good thing to
do.
- Ted
* Theodore Tso <[email protected]> wrote:
> On Mon, Apr 13, 2009 at 11:31:24PM +0200, Ingo Molnar wrote:
> > Cool. [ And i guess you'll like the per tracepoint filter
> > expressions too :-) ]
>
> I haven't played with them yet, but I was looking over the source
> code at them (since they aren't documented yet :-). It looks like
> at the moment only integer matches are allowed, right? That's a
> bit of an issue for me, since one of the things I'd really like to
> be able to do is filter based on devname (i.e., sda2). (Most of
> the time we only want to collect information for a particular
> block device or filesystem.)
You can already do:
aldebaran:/debug/tracing/events/sched/sched_process_wait> echo "comm == Xorg" > filter
aldebaran:/debug/tracing/events/sched/sched_process_wait> cat filter
comm == Xorg
But string values depends on the type of the format field - so you
cannot do string matches on integer fields.
For kdev_t matches i think we'll need native support for that type -
in addition to the integer/string types. It will come up in other
places as well and user-space knows about devices as well.
> Actually, the fact that I'm having to drop some 32 bytes for each
> jbd2 and ext4 trace log for the bdevname in the ring buffer is
> really for the birds. What I really want to do is just to drop in
> the dev_t, and then for the tracing infrastructure to have an
> efficient (cached) way of taking the dev_t and turning that back
> into struct block_device at TP_printk time so we can print the
> bdevname when it's needed. We deifnitely don't want to be calling
> bdget() in fs/block_dev.c each time we print a line in the tracing
> buffer! I'm guessing that's something the blktrace tracer would
> find handy as well.
Yeah.
It could be worked around right now by converting it to an integer
but i think what we want is native support for kdev_t, together with
all the usual convenience forms of specifying it: sda1 should work
the same way as 8:1 or 0801. Even /dev/sda1 should be recognized in
a filter expression.
> Of course having more kernel code play with dev_t's directly isn't
> considered politically correct in some circles, but tough. :-) We
> can't exactly drop a pointer to a struct block_device in the trace
> buffer, since there's no guarantee it will still be valid when we
> read it out. Dropping in a dev_t is exactly what we want. It
> would be nice though if there was a way to specify a major/minor
> number as the filter predicate for the dev_t, and not to have the
> user generate the MAJOR/MINOR encoding. So some way of parsing
> "MKDEV(8, 4)" as the input to the filter predicate would probably
> be a really good thing to do.
Yeah, exactly. We already have smarts in init/* to recognize certain
device string patterns (for rootdev specification) - that could be
factored out (it already is to a large degree) and reused. We dont
need full udev enumeration really - we just need the most common
variants.
Regardless of whether it's considered politically correct or not ;-)
It's clearly useful.
Ingo
On Tue, Apr 14, 2009 at 12:55:42AM +0200, Ingo Molnar wrote:
>
> It could be worked around right now by converting it to an integer
> but i think what we want is native support for kdev_t, together with
> all the usual convenience forms of specifying it: sda1 should work
> the same way as 8:1 or 0801. Even /dev/sda1 should be recognized in
> a filter expression.
Yeah, I could just drop in the integer now, and and just have
TP_printk() display "(8, 2)" instead of "sda2". It's really a
question of how far we want to take pretty-printing and parsing for
ftrace, I suppose.
But if we are going to have end-users use it, having real
pretty-printed names would be a good thing, IMHO. Especially if
major/minor numbers start becoming completely random beasts, as some
have proposed. (I think it's a terrible idea, but I'm clearly not
politically correct. :-)
- Ted
* Theodore Tso <[email protected]> wrote:
> On Tue, Apr 14, 2009 at 12:55:42AM +0200, Ingo Molnar wrote:
> >
> > It could be worked around right now by converting it to an
> > integer but i think what we want is native support for kdev_t,
> > together with all the usual convenience forms of specifying it:
> > sda1 should work the same way as 8:1 or 0801. Even /dev/sda1
> > should be recognized in a filter expression.
>
> Yeah, I could just drop in the integer now, and and just have
> TP_printk() display "(8, 2)" instead of "sda2". It's really a
> question of how far we want to take pretty-printing and parsing
> for ftrace, I suppose.
We try to do it as far as daily use in /debug/tracing/ dictates.
Interacting with the kernel on such a direct channel is really
intuitive and useful in debugging and development sessions IMHO.
Raw binary records would encode it in an efficient and
well-specified manner, so information density is not hurt by
pretty-printing.
> But if we are going to have end-users use it, having real
> pretty-printed names would be a good thing, IMHO. Especially if
> major/minor numbers start becoming completely random beasts, as
> some have proposed. (I think it's a terrible idea, but I'm
> clearly not politically correct. :-)
Sounds like a terrible idea to me too. If more space is needed then
perhaps dynamically allocate the _new_ bits needed - but leave the
well-established spaces alone. Making everything random looking is
just asking for all sorts of trouble IMO. Making the system harder
to understand at a glance, on such a fundamental level, seems silly.
Ingo
On Tue, 2009-04-14 at 00:55 +0200, Ingo Molnar wrote:
> * Theodore Tso <[email protected]> wrote:
>
> > On Mon, Apr 13, 2009 at 11:31:24PM +0200, Ingo Molnar wrote:
> > > Cool. [ And i guess you'll like the per tracepoint filter
> > > expressions too :-) ]
> >
> > I haven't played with them yet, but I was looking over the source
> > code at them (since they aren't documented yet :-). It looks like
> > at the moment only integer matches are allowed, right? That's a
> > bit of an issue for me, since one of the things I'd really like to
> > be able to do is filter based on devname (i.e., sda2). (Most of
> > the time we only want to collect information for a particular
> > block device or filesystem.)
>
> You can already do:
>
> aldebaran:/debug/tracing/events/sched/sched_process_wait> echo "comm == Xorg" > filter
> aldebaran:/debug/tracing/events/sched/sched_process_wait> cat filter
> comm == Xorg
>
> But string values depends on the type of the format field - so you
> cannot do string matches on integer fields.
>
> For kdev_t matches i think we'll need native support for that type -
> in addition to the integer/string types. It will come up in other
> places as well and user-space knows about devices as well.
>
> > Actually, the fact that I'm having to drop some 32 bytes for each
> > jbd2 and ext4 trace log for the bdevname in the ring buffer is
> > really for the birds. What I really want to do is just to drop in
> > the dev_t, and then for the tracing infrastructure to have an
> > efficient (cached) way of taking the dev_t and turning that back
> > into struct block_device at TP_printk time so we can print the
> > bdevname when it's needed. We deifnitely don't want to be calling
> > bdget() in fs/block_dev.c each time we print a line in the tracing
> > buffer! I'm guessing that's something the blktrace tracer would
> > find handy as well.
>
> Yeah.
>
> It could be worked around right now by converting it to an integer
> but i think what we want is native support for kdev_t, together with
> all the usual convenience forms of specifying it: sda1 should work
> the same way as 8:1 or 0801. Even /dev/sda1 should be recognized in
> a filter expression.
>
Yeah, I'll make sure the new filter parser can handle all these forms,
but hopefully in a general way that would allow different forms to be
used as predicate values for any special type such as kdev_t. Maybe a
simple expression matcher that when setting up the filter would map the
matched value expression to some code that would generate the final
filter value to be used in the run-time filter.
Tom
> > Of course having more kernel code play with dev_t's directly isn't
> > considered politically correct in some circles, but tough. :-) We
> > can't exactly drop a pointer to a struct block_device in the trace
> > buffer, since there's no guarantee it will still be valid when we
> > read it out. Dropping in a dev_t is exactly what we want. It
> > would be nice though if there was a way to specify a major/minor
> > number as the filter predicate for the dev_t, and not to have the
> > user generate the MAJOR/MINOR encoding. So some way of parsing
> > "MKDEV(8, 4)" as the input to the filter predicate would probably
> > be a really good thing to do.
>
> Yeah, exactly. We already have smarts in init/* to recognize certain
> device string patterns (for rootdev specification) - that could be
> factored out (it already is to a large degree) and reused. We dont
> need full udev enumeration really - we just need the most common
> variants.
>
> Regardless of whether it's considered politically correct or not ;-)
> It's clearly useful.
>
> Ingo