2009-03-04 02:54:04

by Steven Rostedt

[permalink] [raw]
Subject: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers


RFC only, not for pulling, unless everyone is fine with these :-)

After telling the folks from Google that I had splice code that I needed
to get working, I finally got around to doing it.

Not only did I get the splice code working, but I also made a perl
script (might want to cover you eyes if you look at that code) that
can automagically create the output from the binary files reading
the format arguments in the /debugfs/tracing/events/<subsys>/<event>/format
file.

>From previous patches, we have in include/trace/sched_event_types.h:

#undef TRACE_SYSTEM
#define TRACE_SYSTEM sched

TRACE_EVENT_FORMAT(sched_switch,
TPPROTO(struct rq *rq, struct task_struct *prev,
struct task_struct *next),
TPARGS(rq, prev, next),
TPFMT("task %s:%d ==> %s:%d",
prev->comm, prev->pid, next->comm, next->pid),
TRACE_STRUCT(
TRACE_FIELD(pid_t, prev_pid, prev->pid)
TRACE_FIELD(int, prev_prio, prev->prio)
TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
next_comm,
TPCMD(memcpy(TRACE_ENTRY->next_comm,
next->comm,
TASK_COMM_LEN)))
TRACE_FIELD(pid_t, next_pid, next->pid)
TRACE_FIELD(int, next_prio, next->prio)
),
TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
);

By making that, we automagically get this file:

# cat /debug/tracing/events/sched/sched_switch/format
name: sched_switch
ID: 29
format:
field:unsigned char type; offset:0; size:1;
field:unsigned char flags; offset:1; size:1;
field:unsigned char preempt_count; offset:2; size:1;
field:int pid; offset:4; size:4;
field:int tgid; offset:8; size:4;

field:pid_t prev_pid; offset:12; size:4;
field:int prev_prio; offset:16; size:4;
field special:char next_comm[TASK_COMM_LEN]; offset:20; size:16;
field:pid_t next_pid; offset:36; size:4;
field:int next_prio; offset:40; size:4;

print fmt: "prev %d:%d ==> next %s:%d:%d"


Now with this patch set, we create a way to read the ftrace ring buffers
directly, as a binary page. Splice has been used such that the user
could splice the ring buffers without need to copy the pages.
The pages are taken from the ring buffers and can be placed directly
into files, without extra copies.

# ls /debug/tracing/binary_buffers/
0 1 2 3 4 5 6 7

One can either just use the read/write to grab live data from these
buffers, or they could use splice. I have a simple file that reads
this buffers using splice. Note, it only runs on one file, you
can make multiple copies to run more. The ring buffers in ftrace
are per cpu and they are not dependent on each other.

Also, if there is no data in the buffer, it returns -EAGAIN.

# find /debug/tracing/events -name 'type' | while read f; do
> echo raw > $f; done

# find /debug/tracing/events -name 'enable' | while read f; do
> echo 1 > $f; done

# ./splice /debug/tracing/binary_buffers/0 /tmp/buf-0

Yes you can run multiple instances of this on different buffers.

# ./rb-read.pl /tmp/buf-0

produces:

0 [000] 7071.936459: (irq_handler_entry) irq 48
0 [000] 7071.936462: (irq_handler_exit) irq 48 ret 1
0 [000] 7071.988801: (sched_signal_send) sig: 14 task 0
0 [000] 7071.988813: (sched_wakeup) task 0 success=1
0 [000] 7071.988823: (sched_switch) prev 0:140 ==> next ntpd:6582272:0
3303 [000] 7071.988916: (sched_switch) prev 0:120 ==> next swap:7497072:0
0 [000] 7072.020370: (sched_wakeup) task 0 success=1
0 [000] 7072.020407: (sched_switch) prev 0:140 ==> next even:808416116:0
16 [000] 7071.936157: (sched_switch) prev 0:115 ==> next swap:7497072:0

Notice the "(sched_switch)" lines.

The splice C code:
http://www.kernel.org/pub/linux/kernel/people/rostedt/splice.c

The perl script to parse:
http://www.kernel.org/pub/linux/kernel/people/rostedt/rb-read.pl

And yes I know, they are both ugly :-p

The following patches are in:

git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git

branch: rfc/splice/tip/tracing/ftrace


Steven Rostedt (5):
ring-buffer: reset write field for ring_buffer_read_page
ring-buffer: fix ring_buffer_read_page
ring-buffer: replace sizeof of event header with offsetof
ring-buffer: make ring_buffer_read_page read from start on partial page
tracing: add binary buffer files for use with splice

----
include/linux/ring_buffer.h | 7 +-
kernel/trace/ring_buffer.c | 118 +++++++++++++------
kernel/trace/trace.c | 274 +++++++++++++++++++++++++++++++++++++++++-
kernel/trace/trace.h | 1 +
4 files changed, 357 insertions(+), 43 deletions(-)
--


2009-03-04 03:01:18

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers


On Tue, 3 Mar 2009, Steven Rostedt wrote:

> Yes you can run multiple instances of this on different buffers.
>
> # ./rb-read.pl /tmp/buf-0
>
> produces:
>
> 0 [000] 7071.936459: (irq_handler_entry) irq 48
> 0 [000] 7071.936462: (irq_handler_exit) irq 48 ret 1
> 0 [000] 7071.988801: (sched_signal_send) sig: 14 task 0
> 0 [000] 7071.988813: (sched_wakeup) task 0 success=1
> 0 [000] 7071.988823: (sched_switch) prev 0:140 ==> next ntpd:6582272:0
> 3303 [000] 7071.988916: (sched_switch) prev 0:120 ==> next swap:7497072:0
> 0 [000] 7072.020370: (sched_wakeup) task 0 success=1
> 0 [000] 7072.020407: (sched_switch) prev 0:140 ==> next even:808416116:0

My perl script is still a bit buggy, but you get the idea ;-)

-- Steve

> 16 [000] 7071.936157: (sched_switch) prev 0:115 ==> next swap:7497072:0
>

2009-03-04 03:24:18

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers


On Tue, 3 Mar 2009, Steven Rostedt wrote:

>
> On Tue, 3 Mar 2009, Steven Rostedt wrote:
>
> > Yes you can run multiple instances of this on different buffers.
> >
> > # ./rb-read.pl /tmp/buf-0
> >
> > produces:
> >
> > 0 [000] 7071.936459: (irq_handler_entry) irq 48
> > 0 [000] 7071.936462: (irq_handler_exit) irq 48 ret 1
> > 0 [000] 7071.988801: (sched_signal_send) sig: 14 task 0
> > 0 [000] 7071.988813: (sched_wakeup) task 0 success=1
> > 0 [000] 7071.988823: (sched_switch) prev 0:140 ==> next ntpd:6582272:0
> > 3303 [000] 7071.988916: (sched_switch) prev 0:120 ==> next swap:7497072:0
> > 0 [000] 7072.020370: (sched_wakeup) task 0 success=1
> > 0 [000] 7072.020407: (sched_switch) prev 0:140 ==> next even:808416116:0
>
> My perl script is still a bit buggy, but you get the idea ;-)

Uploaded a fix:

0 [000] 7071.936459: (irq_handler_entry) irq 48
0 [000] 7071.936462: (irq_handler_exit) irq 48 ret 1
0 [000] 7071.988801: (sched_signal_send) sig: 14 task 3303
0 [000] 7071.988813: (sched_wakeup) task 3303 success=1
0 [000] 7071.988823: (sched_switch) prev 0:140 ==> next ntpdpd:3303:120
3303 [000] 7071.988916: (sched_switch) prev 3303:120 ==> next swapper:0:140
0 [000] 7072.020370: (sched_wakeup) task 16 success=1
0 [000] 7072.020407: (sched_switch) prev 0:140 ==> next events/0:16:115
16 [000] 7071.936157: (sched_switch) prev 16:115 ==> next swapper:0:140

Much better.

I just used the same binary image. The bug was in the perl script.

-- Steve

2009-03-04 10:27:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers


* Steven Rostedt <[email protected]> wrote:

> RFC only, not for pulling, unless everyone is fine with these
> :-)
>
> After telling the folks from Google that I had splice code
> that I needed to get working, I finally got around to doing
> it.
>
> Not only did I get the splice code working, but I also made a
> perl script (might want to cover you eyes if you look at that
> code) that can automagically create the output from the binary
> files reading the format arguments in the
> /debugfs/tracing/events/<subsys>/<event>/format file.

Very cool stuff!

The ftrace splice path allows the following tracing path: after
we create a trace ringbuffer page in the kernel, we dont touch
it _ever again_. We can shuffle it to disk via DMA or over the
network via DMA without bringing it again into the CPU's caches.

This feature allows low-overhead high-throughput tracing on
unprecedented levels. There's no format string overhead, no
instruction patching/trapping overhead - straight raw binary
tracing with C syntax tracepoints and a zero-copy path to
storage.

Have you had any chance to measure tracing overhead and the
maximum throghput we can reach with the ftrace splice code? I'd
expect to see some really nice numbers.

> >From previous patches, we have in include/trace/sched_event_types.h:
>
> #undef TRACE_SYSTEM
> #define TRACE_SYSTEM sched
>
> TRACE_EVENT_FORMAT(sched_switch,
> TPPROTO(struct rq *rq, struct task_struct *prev,
> struct task_struct *next),
> TPARGS(rq, prev, next),
> TPFMT("task %s:%d ==> %s:%d",
> prev->comm, prev->pid, next->comm, next->pid),
> TRACE_STRUCT(
> TRACE_FIELD(pid_t, prev_pid, prev->pid)
> TRACE_FIELD(int, prev_prio, prev->prio)
> TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
> next_comm,
> TPCMD(memcpy(TRACE_ENTRY->next_comm,
> next->comm,
> TASK_COMM_LEN)))
> TRACE_FIELD(pid_t, next_pid, next->pid)
> TRACE_FIELD(int, next_prio, next->prio)
> ),
> TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
> );
>
> By making that, we automagically get this file:
>
> # cat /debug/tracing/events/sched/sched_switch/format
> name: sched_switch
> ID: 29
> format:
> field:unsigned char type; offset:0; size:1;
> field:unsigned char flags; offset:1; size:1;
> field:unsigned char preempt_count; offset:2; size:1;
> field:int pid; offset:4; size:4;
> field:int tgid; offset:8; size:4;
>
> field:pid_t prev_pid; offset:12; size:4;
> field:int prev_prio; offset:16; size:4;
> field special:char next_comm[TASK_COMM_LEN]; offset:20; size:16;
> field:pid_t next_pid; offset:36; size:4;
> field:int next_prio; offset:40; size:4;
>
> print fmt: "prev %d:%d ==> next %s:%d:%d"
>
>
> Now with this patch set, we create a way to read the ftrace
> ring buffers directly, as a binary page. Splice has been used
> such that the user could splice the ring buffers without need
> to copy the pages. The pages are taken from the ring buffers
> and can be placed directly into files, without extra copies.
>
> # ls /debug/tracing/binary_buffers/
> 0 1 2 3 4 5 6 7
>
> One can either just use the read/write to grab live data from
> these buffers, or they could use splice. I have a simple file
> that reads this buffers using splice. Note, it only runs on
> one file, you can make multiple copies to run more. The ring
> buffers in ftrace are per cpu and they are not dependent on
> each other.
>
> Also, if there is no data in the buffer, it returns -EAGAIN.
>
> # find /debug/tracing/events -name 'type' | while read f; do
> > echo raw > $f; done
>
> # find /debug/tracing/events -name 'enable' | while read f; do
> > echo 1 > $f; done
>
> # ./splice /debug/tracing/binary_buffers/0 /tmp/buf-0
>
> Yes you can run multiple instances of this on different buffers.
>
> # ./rb-read.pl /tmp/buf-0
>
> produces:
>
> 0 [000] 7071.936459: (irq_handler_entry) irq 48
> 0 [000] 7071.936462: (irq_handler_exit) irq 48 ret 1
> 0 [000] 7071.988801: (sched_signal_send) sig: 14 task 0
> 0 [000] 7071.988813: (sched_wakeup) task 0 success=1
> 0 [000] 7071.988823: (sched_switch) prev 0:140 ==> next ntpd:6582272:0
> 3303 [000] 7071.988916: (sched_switch) prev 0:120 ==> next swap:7497072:0
> 0 [000] 7072.020370: (sched_wakeup) task 0 success=1
> 0 [000] 7072.020407: (sched_switch) prev 0:140 ==> next even:808416116:0
> 16 [000] 7071.936157: (sched_switch) prev 0:115 ==> next swap:7497072:0
>
> Notice the "(sched_switch)" lines.
>
> The splice C code:
> http://www.kernel.org/pub/linux/kernel/people/rostedt/splice.c
>
> The perl script to parse:
> http://www.kernel.org/pub/linux/kernel/people/rostedt/rb-read.pl
>
> And yes I know, they are both ugly :-p

Any chance of merging them into a single perl script? I.e. the
perl script would embedd the .c code and build it on every
invocation (into /tmp), so that one can do single-tool tracing
with no immediate binary components. Such self-sufficiency is
very handy when probing systems in a minimally invasive way.

And we could also have twice the ugliness for the same price.

> The following patches are in:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
>
> branch: rfc/splice/tip/tracing/ftrace

Looks like the only problem with the code is the -EFAULT
handling bug it exposed in simple_read_from_buffer() - but that
fix should go on a separate track IMO.

So i've pulled it into tip:tracing - thanks Steve! We definitely
want this in v2.6.30.

Ingo

2009-03-04 14:51:59

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers

On Wed, 4 Mar 2009, Ingo Molnar wrote:
> >
> > The splice C code:
> > http://www.kernel.org/pub/linux/kernel/people/rostedt/splice.c
> >
> > The perl script to parse:
> > http://www.kernel.org/pub/linux/kernel/people/rostedt/rb-read.pl
> >
> > And yes I know, they are both ugly :-p
>
> Any chance of merging them into a single perl script? I.e. the
> perl script would embedd the .c code and build it on every
> invocation (into /tmp), so that one can do single-tool tracing
> with no immediate binary components. Such self-sufficiency is
> very handy when probing systems in a minimally invasive way.
>
> And we could also have twice the ugliness for the same price.
>

Gag, that would really be ugly code ;-)

If anything I might actually want to break up the perl script. This was
just made to be an example. Any serious user would most likely write their
own user apps. This was more of a proof of concept.

I could clean up that code, and then have one big file that does all the
ugliness for those that just want to tinker with it. But that is low on
the totem pole ;-)

-- Steve

2009-03-04 15:39:52

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers

* Ingo Molnar ([email protected]) wrote:
>
> * Steven Rostedt <[email protected]> wrote:
>
> > RFC only, not for pulling, unless everyone is fine with these
> > :-)
> >
> > After telling the folks from Google that I had splice code
> > that I needed to get working, I finally got around to doing
> > it.
> >
> > Not only did I get the splice code working, but I also made a
> > perl script (might want to cover you eyes if you look at that
> > code) that can automagically create the output from the binary
> > files reading the format arguments in the
> > /debugfs/tracing/events/<subsys>/<event>/format file.
>
> Very cool stuff!
>
> The ftrace splice path allows the following tracing path: after
> we create a trace ringbuffer page in the kernel, we dont touch
> it _ever again_. We can shuffle it to disk via DMA or over the
> network via DMA without bringing it again into the CPU's caches.
>
> This feature allows low-overhead high-throughput tracing on
> unprecedented levels. There's no format string overhead, no
> instruction patching/trapping overhead - straight raw binary
> tracing with C syntax tracepoints and a zero-copy path to
> storage.
>
> Have you had any chance to measure tracing overhead and the
> maximum throghput we can reach with the ftrace splice code? I'd
> expect to see some really nice numbers.
>

"unprecented levels" -> LTTng has been using splice for about 5 months.
The only reason why I did not post the patchset which performs this is
because I am currently adapting a LTTng module to optionally format the
binary buffers to a text output. And yes, it provides very, very good
performance results.

Mathieu


> > >From previous patches, we have in include/trace/sched_event_types.h:
> >
> > #undef TRACE_SYSTEM
> > #define TRACE_SYSTEM sched
> >
> > TRACE_EVENT_FORMAT(sched_switch,
> > TPPROTO(struct rq *rq, struct task_struct *prev,
> > struct task_struct *next),
> > TPARGS(rq, prev, next),
> > TPFMT("task %s:%d ==> %s:%d",
> > prev->comm, prev->pid, next->comm, next->pid),
> > TRACE_STRUCT(
> > TRACE_FIELD(pid_t, prev_pid, prev->pid)
> > TRACE_FIELD(int, prev_prio, prev->prio)
> > TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
> > next_comm,
> > TPCMD(memcpy(TRACE_ENTRY->next_comm,
> > next->comm,
> > TASK_COMM_LEN)))
> > TRACE_FIELD(pid_t, next_pid, next->pid)
> > TRACE_FIELD(int, next_prio, next->prio)
> > ),
> > TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
> > );
> >
> > By making that, we automagically get this file:
> >
> > # cat /debug/tracing/events/sched/sched_switch/format
> > name: sched_switch
> > ID: 29
> > format:
> > field:unsigned char type; offset:0; size:1;
> > field:unsigned char flags; offset:1; size:1;
> > field:unsigned char preempt_count; offset:2; size:1;
> > field:int pid; offset:4; size:4;
> > field:int tgid; offset:8; size:4;
> >
> > field:pid_t prev_pid; offset:12; size:4;
> > field:int prev_prio; offset:16; size:4;
> > field special:char next_comm[TASK_COMM_LEN]; offset:20; size:16;
> > field:pid_t next_pid; offset:36; size:4;
> > field:int next_prio; offset:40; size:4;
> >
> > print fmt: "prev %d:%d ==> next %s:%d:%d"
> >
> >
> > Now with this patch set, we create a way to read the ftrace
> > ring buffers directly, as a binary page. Splice has been used
> > such that the user could splice the ring buffers without need
> > to copy the pages. The pages are taken from the ring buffers
> > and can be placed directly into files, without extra copies.
> >
> > # ls /debug/tracing/binary_buffers/
> > 0 1 2 3 4 5 6 7
> >
> > One can either just use the read/write to grab live data from
> > these buffers, or they could use splice. I have a simple file
> > that reads this buffers using splice. Note, it only runs on
> > one file, you can make multiple copies to run more. The ring
> > buffers in ftrace are per cpu and they are not dependent on
> > each other.
> >
> > Also, if there is no data in the buffer, it returns -EAGAIN.
> >
> > # find /debug/tracing/events -name 'type' | while read f; do
> > > echo raw > $f; done
> >
> > # find /debug/tracing/events -name 'enable' | while read f; do
> > > echo 1 > $f; done
> >
> > # ./splice /debug/tracing/binary_buffers/0 /tmp/buf-0
> >
> > Yes you can run multiple instances of this on different buffers.
> >
> > # ./rb-read.pl /tmp/buf-0
> >
> > produces:
> >
> > 0 [000] 7071.936459: (irq_handler_entry) irq 48
> > 0 [000] 7071.936462: (irq_handler_exit) irq 48 ret 1
> > 0 [000] 7071.988801: (sched_signal_send) sig: 14 task 0
> > 0 [000] 7071.988813: (sched_wakeup) task 0 success=1
> > 0 [000] 7071.988823: (sched_switch) prev 0:140 ==> next ntpd:6582272:0
> > 3303 [000] 7071.988916: (sched_switch) prev 0:120 ==> next swap:7497072:0
> > 0 [000] 7072.020370: (sched_wakeup) task 0 success=1
> > 0 [000] 7072.020407: (sched_switch) prev 0:140 ==> next even:808416116:0
> > 16 [000] 7071.936157: (sched_switch) prev 0:115 ==> next swap:7497072:0
> >
> > Notice the "(sched_switch)" lines.
> >
> > The splice C code:
> > http://www.kernel.org/pub/linux/kernel/people/rostedt/splice.c
> >
> > The perl script to parse:
> > http://www.kernel.org/pub/linux/kernel/people/rostedt/rb-read.pl
> >
> > And yes I know, they are both ugly :-p
>
> Any chance of merging them into a single perl script? I.e. the
> perl script would embedd the .c code and build it on every
> invocation (into /tmp), so that one can do single-tool tracing
> with no immediate binary components. Such self-sufficiency is
> very handy when probing systems in a minimally invasive way.
>
> And we could also have twice the ugliness for the same price.
>
> > The following patches are in:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
> >
> > branch: rfc/splice/tip/tracing/ftrace
>
> Looks like the only problem with the code is the -EFAULT
> handling bug it exposed in simple_read_from_buffer() - but that
> fix should go on a separate track IMO.
>
> So i've pulled it into tip:tracing - thanks Steve! We definitely
> want this in v2.6.30.
>
> Ingo
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-04 17:05:37

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers

* Steven Rostedt ([email protected]) wrote:
>
> RFC only, not for pulling, unless everyone is fine with these :-)
>
> After telling the folks from Google that I had splice code that I needed
> to get working, I finally got around to doing it.
>
> Not only did I get the splice code working, but I also made a perl
> script (might want to cover you eyes if you look at that code) that
> can automagically create the output from the binary files reading
> the format arguments in the /debugfs/tracing/events/<subsys>/<event>/format
> file.
>
> From previous patches, we have in include/trace/sched_event_types.h:
>
> #undef TRACE_SYSTEM
> #define TRACE_SYSTEM sched
>
> TRACE_EVENT_FORMAT(sched_switch,
> TPPROTO(struct rq *rq, struct task_struct *prev,
> struct task_struct *next),
> TPARGS(rq, prev, next),
> TPFMT("task %s:%d ==> %s:%d",
> prev->comm, prev->pid, next->comm, next->pid),
> TRACE_STRUCT(
> TRACE_FIELD(pid_t, prev_pid, prev->pid)
> TRACE_FIELD(int, prev_prio, prev->prio)
> TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
> next_comm,
> TPCMD(memcpy(TRACE_ENTRY->next_comm,
> next->comm,
> TASK_COMM_LEN)))
> TRACE_FIELD(pid_t, next_pid, next->pid)
> TRACE_FIELD(int, next_prio, next->prio)
> ),
> TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
> );
>

I fear that putting these user-visible format strings in tracepoint
header files will create a big maintainability issue.

I'll post the LTTng patchset in a jiffy, where the format string
awareness is done in a tracer-specific module. I don't understand why
Peter Z. is not yelling against your tracepoints modifications : they
are actually presenting to userspace an interface that is meant to
eventually change.

I used a separate layer for format string presentation for this very
purpose : I don't want to tie the kernel code instrumentation
(tracepoints) to any kind of user-visible API.

Mathieu

> By making that, we automagically get this file:
>
> # cat /debug/tracing/events/sched/sched_switch/format
> name: sched_switch
> ID: 29
> format:
> field:unsigned char type; offset:0; size:1;
> field:unsigned char flags; offset:1; size:1;
> field:unsigned char preempt_count; offset:2; size:1;
> field:int pid; offset:4; size:4;
> field:int tgid; offset:8; size:4;
>
> field:pid_t prev_pid; offset:12; size:4;
> field:int prev_prio; offset:16; size:4;
> field special:char next_comm[TASK_COMM_LEN]; offset:20; size:16;
> field:pid_t next_pid; offset:36; size:4;
> field:int next_prio; offset:40; size:4;
>
> print fmt: "prev %d:%d ==> next %s:%d:%d"
>
>
> Now with this patch set, we create a way to read the ftrace ring buffers
> directly, as a binary page. Splice has been used such that the user
> could splice the ring buffers without need to copy the pages.
> The pages are taken from the ring buffers and can be placed directly
> into files, without extra copies.
>
> # ls /debug/tracing/binary_buffers/
> 0 1 2 3 4 5 6 7
>
> One can either just use the read/write to grab live data from these
> buffers, or they could use splice. I have a simple file that reads
> this buffers using splice. Note, it only runs on one file, you
> can make multiple copies to run more. The ring buffers in ftrace
> are per cpu and they are not dependent on each other.
>
> Also, if there is no data in the buffer, it returns -EAGAIN.
>
> # find /debug/tracing/events -name 'type' | while read f; do
> > echo raw > $f; done
>
> # find /debug/tracing/events -name 'enable' | while read f; do
> > echo 1 > $f; done
>
> # ./splice /debug/tracing/binary_buffers/0 /tmp/buf-0
>
> Yes you can run multiple instances of this on different buffers.
>
> # ./rb-read.pl /tmp/buf-0
>
> produces:
>
> 0 [000] 7071.936459: (irq_handler_entry) irq 48
> 0 [000] 7071.936462: (irq_handler_exit) irq 48 ret 1
> 0 [000] 7071.988801: (sched_signal_send) sig: 14 task 0
> 0 [000] 7071.988813: (sched_wakeup) task 0 success=1
> 0 [000] 7071.988823: (sched_switch) prev 0:140 ==> next ntpd:6582272:0
> 3303 [000] 7071.988916: (sched_switch) prev 0:120 ==> next swap:7497072:0
> 0 [000] 7072.020370: (sched_wakeup) task 0 success=1
> 0 [000] 7072.020407: (sched_switch) prev 0:140 ==> next even:808416116:0
> 16 [000] 7071.936157: (sched_switch) prev 0:115 ==> next swap:7497072:0
>
> Notice the "(sched_switch)" lines.
>
> The splice C code:
> http://www.kernel.org/pub/linux/kernel/people/rostedt/splice.c
>
> The perl script to parse:
> http://www.kernel.org/pub/linux/kernel/people/rostedt/rb-read.pl
>
> And yes I know, they are both ugly :-p
>
> The following patches are in:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
>
> branch: rfc/splice/tip/tracing/ftrace
>
>
> Steven Rostedt (5):
> ring-buffer: reset write field for ring_buffer_read_page
> ring-buffer: fix ring_buffer_read_page
> ring-buffer: replace sizeof of event header with offsetof
> ring-buffer: make ring_buffer_read_page read from start on partial page
> tracing: add binary buffer files for use with splice
>
> ----
> include/linux/ring_buffer.h | 7 +-
> kernel/trace/ring_buffer.c | 118 +++++++++++++------
> kernel/trace/trace.c | 274 +++++++++++++++++++++++++++++++++++++++++-
> kernel/trace/trace.h | 1 +
> 4 files changed, 357 insertions(+), 43 deletions(-)
> --
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-04 17:20:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers

On Wed, 2009-03-04 at 12:00 -0500, Mathieu Desnoyers wrote:

> > TRACE_EVENT_FORMAT(sched_switch,
> > TPPROTO(struct rq *rq, struct task_struct *prev,
> > struct task_struct *next),
> > TPARGS(rq, prev, next),
> > TPFMT("task %s:%d ==> %s:%d",
> > prev->comm, prev->pid, next->comm, next->pid),
> > TRACE_STRUCT(
> > TRACE_FIELD(pid_t, prev_pid, prev->pid)
> > TRACE_FIELD(int, prev_prio, prev->prio)
> > TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
> > next_comm,
> > TPCMD(memcpy(TRACE_ENTRY->next_comm,
> > next->comm,
> > TASK_COMM_LEN)))
> > TRACE_FIELD(pid_t, next_pid, next->pid)
> > TRACE_FIELD(int, next_prio, next->prio)
> > ),
> > TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
> > );
> >
>
> I fear that putting these user-visible format strings in tracepoint
> header files will create a big maintainability issue.
>
> I'll post the LTTng patchset in a jiffy, where the format string
> awareness is done in a tracer-specific module. I don't understand why
> Peter Z. is not yelling against your tracepoints modifications : they
> are actually presenting to userspace an interface that is meant to
> eventually change.
>
> I used a separate layer for format string presentation for this very
> purpose : I don't want to tie the kernel code instrumentation
> (tracepoints) to any kind of user-visible API.

Not only that, its butt-ugly to boot.

I'm just not sure what to say anymore.. people seem insistent on doing
this, look at the recent google trace and this zedtrace thing.

What I can live with is if we tie it 1:1 to our code structure by means
of code generation, in that case we can simply argue its a direct
representation of our code, and we don't have any API rules on that.

Arnaldo proposed using sparse and tree rewriting to accomplish this.

The above however creates these artificial things that can arguably be
kept static, therefore one could defend that expectation.

What's worse, it doesn't even come with version information :-)

2009-03-04 22:47:21

by Ingo Oeser

[permalink] [raw]
Subject: Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers

Hi Steven,

On Wednesday 04 March 2009, Steven Rostedt wrote:
> On Wed, 4 Mar 2009, Ingo Molnar wrote:
> > Any chance of merging them into a single perl script? I.e. the
> > perl script would embedd the .c code and build it on every
> > invocation (into /tmp), so that one can do single-tool tracing
> > with no immediate binary components. Such self-sufficiency is
> > very handy when probing systems in a minimally invasive way.
> >
> > And we could also have twice the ugliness for the same price.
> >
>
> Gag, that would really be ugly code ;-)

If you want really ugly code, you could do the SYS_splice in perl
via syscall. Just look it up via "perldoc -f syscall" and play around :-)

Best Regards

Ingo Oeser

2009-03-06 16:59:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers


On Wed, 4 Mar 2009, Mathieu Desnoyers wrote:
> >
> > From previous patches, we have in include/trace/sched_event_types.h:
> >
> > #undef TRACE_SYSTEM
> > #define TRACE_SYSTEM sched
> >
> > TRACE_EVENT_FORMAT(sched_switch,
> > TPPROTO(struct rq *rq, struct task_struct *prev,
> > struct task_struct *next),
> > TPARGS(rq, prev, next),
> > TPFMT("task %s:%d ==> %s:%d",
> > prev->comm, prev->pid, next->comm, next->pid),
> > TRACE_STRUCT(
> > TRACE_FIELD(pid_t, prev_pid, prev->pid)
> > TRACE_FIELD(int, prev_prio, prev->prio)
> > TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
> > next_comm,
> > TPCMD(memcpy(TRACE_ENTRY->next_comm,
> > next->comm,
> > TASK_COMM_LEN)))
> > TRACE_FIELD(pid_t, next_pid, next->pid)
> > TRACE_FIELD(int, next_prio, next->prio)
> > ),
> > TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
> > );
> >
>
> I fear that putting these user-visible format strings in tracepoint
> header files will create a big maintainability issue.
>
> I'll post the LTTng patchset in a jiffy, where the format string
> awareness is done in a tracer-specific module. I don't understand why
> Peter Z. is not yelling against your tracepoints modifications : they
> are actually presenting to userspace an interface that is meant to
> eventually change.
>
> I used a separate layer for format string presentation for this very
> purpose : I don't want to tie the kernel code instrumentation
> (tracepoints) to any kind of user-visible API.
>

Hi Mathieu,

Sorry for the late reply, I started to reply but was distracted, and sadly
forgot about it :-( (and I rebooted the box that had the reply window
open)


Just a couple of points to make.

1) The raw format string is a "hint" for userspace to read the buffers.
The offset and sizeof is the real data that userspace should use. The
print format can be just a way to help them out. But it is by no means
something that is expected to be constant.

It is also used internal by ftrace because it needs a way to read the
"fast C style recording" records. It is better to print some text than to
have a bunch of hex print out in the screen. Say you have the C style
recording enabled and you take an oops. With ftrace_dumps_on_oops set, the
buffers will be printed to the console. This lets ftrace have a means to
print out the events.

2) The difference between this and markers is that the format string is
bound to the declaration in the trace header. Not in the code itself. The
maintainer should only see the trace point function call. It should only
look like a function call and not some funky printf string that the
maintainer might not care about.

If the code changes and it breaks the format string, it is up to the
tracing maintainers to fix it. If the format string was embedded within
the code, then the maintainer would be burdoned with that duty. But here
the format string is in the include/trace directory and any breakage there
should definitely be fixed by a some tracing maintainer. Not the
maintainer of the subsystem. Unless of course that maintainer wants to fix
it ;-)


3) TRACE_EVENT_FORMAT is an extension to TRACE_FORMAT. Developers may
choose to use the more simple, but a bit more expensive TRACE_FORMAT if
they choose to. Perhaps it may be more prudent to anyway. Recently
Peter Zijlstra was tracing locks and just wanted to trace the name.
There's really no difference in copying the name via a C style (strcpy) or
having a printf formatting do the work. Here is a case where just using
TRACE_FORMAT is a reasonable solution.

4) I actually avoided moving the format string into the tracing utility.
The tracing utility can ignore it if it wants. I orginially had the
headers included in the kernel/trace/events.c file and found that was too
cumbersome. I did not like the fact that I needed to go into some tracing
infrastructure to add an event. It was a burdon to me to add a new event
and I wrote the code! That's why I moved it all into the include/trace
directory. One stop shopping.

A developer to add new events only needs to add the trace points in their
code and then update the files (perhaps add new ones) in include/trace. No
need to dig through the kernel source to find out where else they need to
go.


I see that you published your patch queue. I've looked at most of the
patches already. I'll make my comments shortly.

Thanks,

-- Steve

2009-03-06 19:11:19

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers

* Steven Rostedt ([email protected]) wrote:
>
> On Wed, 4 Mar 2009, Mathieu Desnoyers wrote:
> > >
> > > From previous patches, we have in include/trace/sched_event_types.h:
> > >
> > > #undef TRACE_SYSTEM
> > > #define TRACE_SYSTEM sched
> > >
> > > TRACE_EVENT_FORMAT(sched_switch,
> > > TPPROTO(struct rq *rq, struct task_struct *prev,
> > > struct task_struct *next),
> > > TPARGS(rq, prev, next),
> > > TPFMT("task %s:%d ==> %s:%d",
> > > prev->comm, prev->pid, next->comm, next->pid),
> > > TRACE_STRUCT(
> > > TRACE_FIELD(pid_t, prev_pid, prev->pid)
> > > TRACE_FIELD(int, prev_prio, prev->prio)
> > > TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
> > > next_comm,
> > > TPCMD(memcpy(TRACE_ENTRY->next_comm,
> > > next->comm,
> > > TASK_COMM_LEN)))
> > > TRACE_FIELD(pid_t, next_pid, next->pid)
> > > TRACE_FIELD(int, next_prio, next->prio)
> > > ),
> > > TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
> > > );
> > >
> >
> > I fear that putting these user-visible format strings in tracepoint
> > header files will create a big maintainability issue.
> >
> > I'll post the LTTng patchset in a jiffy, where the format string
> > awareness is done in a tracer-specific module. I don't understand why
> > Peter Z. is not yelling against your tracepoints modifications : they
> > are actually presenting to userspace an interface that is meant to
> > eventually change.
> >
> > I used a separate layer for format string presentation for this very
> > purpose : I don't want to tie the kernel code instrumentation
> > (tracepoints) to any kind of user-visible API.
> >
>
> Hi Mathieu,
>
> Sorry for the late reply, I started to reply but was distracted, and sadly
> forgot about it :-( (and I rebooted the box that had the reply window
> open)
>
>
> Just a couple of points to make.
>
> 1) The raw format string is a "hint" for userspace to read the buffers.
> The offset and sizeof is the real data that userspace should use. The
> print format can be just a way to help them out. But it is by no means
> something that is expected to be constant.
>
> It is also used internal by ftrace because it needs a way to read the
> "fast C style recording" records. It is better to print some text than to
> have a bunch of hex print out in the screen. Say you have the C style
> recording enabled and you take an oops. With ftrace_dumps_on_oops set, the
> buffers will be printed to the console. This lets ftrace have a means to
> print out the events.
>
> 2) The difference between this and markers is that the format string is
> bound to the declaration in the trace header. Not in the code itself. The
> maintainer should only see the trace point function call. It should only
> look like a function call and not some funky printf string that the
> maintainer might not care about.
>
> If the code changes and it breaks the format string, it is up to the
> tracing maintainers to fix it. If the format string was embedded within
> the code, then the maintainer would be burdoned with that duty. But here
> the format string is in the include/trace directory and any breakage there
> should definitely be fixed by a some tracing maintainer. Not the
> maintainer of the subsystem. Unless of course that maintainer wants to fix
> it ;-)
>
>
> 3) TRACE_EVENT_FORMAT is an extension to TRACE_FORMAT. Developers may
> choose to use the more simple, but a bit more expensive TRACE_FORMAT if
> they choose to. Perhaps it may be more prudent to anyway. Recently
> Peter Zijlstra was tracing locks and just wanted to trace the name.
> There's really no difference in copying the name via a C style (strcpy) or
> having a printf formatting do the work. Here is a case where just using
> TRACE_FORMAT is a reasonable solution.
>
> 4) I actually avoided moving the format string into the tracing utility.
> The tracing utility can ignore it if it wants. I orginially had the
> headers included in the kernel/trace/events.c file and found that was too
> cumbersome. I did not like the fact that I needed to go into some tracing
> infrastructure to add an event. It was a burdon to me to add a new event
> and I wrote the code! That's why I moved it all into the include/trace
> directory. One stop shopping.
>
> A developer to add new events only needs to add the trace points in their
> code and then update the files (perhaps add new ones) in include/trace. No
> need to dig through the kernel source to find out where else they need to
> go.
>
>
> I see that you published your patch queue. I've looked at most of the
> patches already. I'll make my comments shortly.
>

Hi Steve,

I totally agree that we want something like a format string to identify
the data we export. My disagreement is on the location where such
identification belongs. I have the funny feeling that you are currently
creating no less than full-blown callbacks expressed as a weird, custom,
macros-based, header definition, and I think it's seriously wrong.

To paraphrase Andrew Morton : we are writing in C. Let's keep it that
way.

This is why I strongly prefer to use the tracepoint callback -> marker
format string approach rather that putting any sort of format string in
the tracepoint header.

If what you really worry about is to have the ability to "quickly" add
debug-style instrumentation when you write code, then you do not want,
nor should ever, use a tracepoint, because tracepoint location is not
meant to be decided without careful consideration. Markers are there
fore this : you can add a quick-and-dirty one-liner :

trace_mark(group, event, "field_identifier1 %u f_ident2 %u", ...);

In your source code and you are all set. I don't see the problem with
such approach ?

Mathieu

> Thanks,
>
> -- Steve
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-06 23:29:00

by Jiaying Zhang

[permalink] [raw]
Subject: Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers

Sorry for getting into this discussion late.

I would like to point out that we think it is really important to have
some very efficient probing mechanism in the kernel for tracing in
production environments. The printf and va_arg based probes are
flexible but less efficient when we want to trace high-throughput events.
Even function calls can add noticeable overhead according to our
measurements. So I think we need to provide a way (mostly via
macro definitions) with which a subsystem can enter an event into
a trace buffer through a short code path. I.e., we should limit the
number of callbacks and avoid format string parsing.

As I understand, Steven's latest TRACE_FIELD patch avoids such
overhead, although it does seem to add complexity for adding new
trace points. It would be nice if we can replace the above sched_switch
declaration with just a couple of macros.

Jiaying


On Fri, Mar 6, 2009 at 11:10 AM, Mathieu Desnoyers
<[email protected]> wrote:
> * Steven Rostedt ([email protected]) wrote:
>>
>> On Wed, 4 Mar 2009, Mathieu Desnoyers wrote:
>> > >
>> > > From previous patches, we have in include/trace/sched_event_types.h:
>> > >
>> > > #undef TRACE_SYSTEM
>> > > #define TRACE_SYSTEM sched
>> > >
>> > > TRACE_EVENT_FORMAT(sched_switch,
>> > > TPPROTO(struct rq *rq, struct task_struct *prev,
>> > > struct task_struct *next),
>> > > TPARGS(rq, prev, next),
>> > > TPFMT("task %s:%d ==> %s:%d",
>> > > prev->comm, prev->pid, next->comm, next->pid),
>> > > TRACE_STRUCT(
>> > > TRACE_FIELD(pid_t, prev_pid, prev->pid)
>> > > TRACE_FIELD(int, prev_prio, prev->prio)
>> > > TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
>> > > next_comm,
>> > > TPCMD(memcpy(TRACE_ENTRY->next_comm,
>> > > next->comm,
>> > > TASK_COMM_LEN)))
>> > > TRACE_FIELD(pid_t, next_pid, next->pid)
>> > > TRACE_FIELD(int, next_prio, next->prio)
>> > > ),
>> > > TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
>> > > );
>> > >
>> >
>> > I fear that putting these user-visible format strings in tracepoint
>> > header files will create a big maintainability issue.
>> >
>> > I'll post the LTTng patchset in a jiffy, where the format string
>> > awareness is done in a tracer-specific module. I don't understand why
>> > Peter Z. is not yelling against your tracepoints modifications : they
>> > are actually presenting to userspace an interface that is meant to
>> > eventually change.
>> >
>> > I used a separate layer for format string presentation for this very
>> > purpose : I don't want to tie the kernel code instrumentation
>> > (tracepoints) to any kind of user-visible API.
>> >
>>
>> Hi Mathieu,
>>
>> Sorry for the late reply, I started to reply but was distracted, and sadly
>> forgot about it :-( (and I rebooted the box that had the reply window
>> open)
>>
>>
>> Just a couple of points to make.
>>
>> 1) The raw format string is a "hint" for userspace to read the buffers.
>> The offset and sizeof is the real data that userspace should use. The
>> print format can be just a way to help them out. But it is by no means
>> something that is expected to be constant.
>>
>> It is also used internal by ftrace because it needs a way to read the
>> "fast C style recording" records. It is better to print some text than to
>> have a bunch of hex print out in the screen. Say you have the C style
>> recording enabled and you take an oops. With ftrace_dumps_on_oops set, the
>> buffers will be printed to the console. This lets ftrace have a means to
>> print out the events.
>>
>> 2) The difference between this and markers is that the format string is
>> bound to the declaration in the trace header. Not in the code itself. The
>> maintainer should only see the trace point function call. It should only
>> look like a function call and not some funky printf string that the
>> maintainer might not care about.
>>
>> If the code changes and it breaks the format string, it is up to the
>> tracing maintainers to fix it. If the format string was embedded within
>> the code, then the maintainer would be burdoned with that duty. But here
>> the format string is in the include/trace directory and any breakage there
>> should definitely be fixed by a some tracing maintainer. Not the
>> maintainer of the subsystem. Unless of course that maintainer wants to fix
>> it ;-)
>>
>>
>> 3) TRACE_EVENT_FORMAT is an extension to TRACE_FORMAT. Developers may
>> choose to use the more simple, but a bit more expensive TRACE_FORMAT if
>> they choose to. Perhaps it may be more prudent to anyway. Recently
>> Peter Zijlstra was tracing locks and just wanted to trace the name.
>> There's really no difference in copying the name via a C style (strcpy) or
>> having a printf formatting do the work. Here is a case where just using
>> TRACE_FORMAT is a reasonable solution.
>>
>> 4) I actually avoided moving the format string into the tracing utility.
>> The tracing utility can ignore it if it wants. I orginially had the
>> headers included in the kernel/trace/events.c file and found that was too
>> cumbersome. I did not like the fact that I needed to go into some tracing
>> infrastructure to add an event. It was a burdon to me to add a new event
>> and I wrote the code! That's why I moved it all into the include/trace
>> directory. One stop shopping.
>>
>> A developer to add new events only needs to add the trace points in their
>> code and then update the files (perhaps add new ones) in include/trace. No
>> need to dig through the kernel source to find out where else they need to
>> go.
>>
>>
>> I see that you published your patch queue. I've looked at most of the
>> patches already. I'll make my comments shortly.
>>
>
> Hi Steve,
>
> I totally agree that we want something like a format string to identify
> the data we export. My disagreement is on the location where such
> identification belongs. I have the funny feeling that you are currently
> creating no less than full-blown callbacks expressed as a weird, custom,
> macros-based, header definition, and I think it's seriously wrong.
>
> To paraphrase Andrew Morton : we are writing in C. Let's keep it that
> way.
>
> This is why I strongly prefer to use the tracepoint callback -> marker
> format string approach rather that putting any sort of format string in
> the tracepoint header.
>
> If what you really worry about is to have the ability to "quickly" add
> debug-style instrumentation when you write code, then you do not want,
> nor should ever, use a tracepoint, because tracepoint location is not
> meant to be decided without careful consideration. Markers are there
> fore this : you can add a quick-and-dirty one-liner :
>
> trace_mark(group, event, "field_identifier1 %u f_ident2 %u", ...);
>
> In your source code and you are all set. I don't see the problem with
> such approach ?
>
> Mathieu
>
>> Thanks,
>>
>> -- Steve
>>
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>

2009-03-08 19:22:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers


* Jiaying Zhang <[email protected]> wrote:

> I would like to point out that we think it is really important
> to have some very efficient probing mechanism in the kernel
> for tracing in production environments. The printf and va_arg
> based probes are flexible but less efficient when we want to
> trace high-throughput events. Even function calls can add
> noticeable overhead according to our measurements. So I think
> we need to provide a way (mostly via macro definitions) with
> which a subsystem can enter an event into a trace buffer
> through a short code path. I.e., we should limit the number of
> callbacks and avoid format string parsing.
>
> As I understand, Steven's latest TRACE_FIELD patch avoids such
> overhead, although it does seem to add complexity for adding
> new trace points. [...]

Yeah - it was motivated by the patches you sent to lkml which
showed that it's possible to do it quite sanely and that it can
be done faster.

> [...] It would be nice if we can replace the above
> sched_switch declaration with just a couple of macros.

Good point - there's ongoing work to simplify the TRACE_FIELD
approach. The current (not yet pushed out) optimized tracepoint
format Steve is working on is:

/*
* Tracepoint for task switches, performed by the scheduler:
*
* (NOTE: the 'rq' argument is not used by generic trace events,
* but used by the latency tracer plugin. )
*/
TRACE_EVENT(sched_switch,

TP_PROTO(struct rq *rq, struct task_struct *prev,
struct task_struct *next),

TP_ARGS(rq, prev, next),

TP_STRUCT__entry(
__array( char, prev_comm, TASK_COMM_LEN )
__field( pid_t, prev_pid )
__field( int, prev_prio )
__array( char, next_comm, TASK_COMM_LEN )
__field( pid_t, next_pid )
__field( int, next_prio )
),

TP_printk("task %s:%d [%d] ==> %s:%d [%d]",
__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
__entry->next_comm, __entry->next_pid, __entry->next_prio),

TP_fast_assign(
memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
__entry->prev_pid = prev->pid;
__entry->prev_prio = prev->prio;
memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
__entry->next_pid = next->pid;
__entry->next_prio = next->prio;
)
);

As you can see it enumerates fields, provides format-based
tracing and a tracepoint as well. It also looks quite similar to
C syntax while still being an information-dense macro.

Ingo