2009-06-10 05:43:07

by Steven Rostedt

[permalink] [raw]
Subject: [PATCH 00/11] [GIT PULL] more updates for the tag format


Ingo,

Please pull the latest tip/tracing/event-print-format tree, which can be found at:

git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
tip/tracing/event-print-format


Li Zefan (1):
tracing/events: convert block trace points to TRACE_EVENT(), fix

Steven Rostedt (10):
tracing: add nsec2sec print formats
tracing: convert lockdep lock_acquired trace point to use nsec2usec tag
tracing: add major and minor tags for print format
tracing: use << to print < instead of \<
tracing: convert the block trace points to use the new tag format
tracing: add test for strings in event tag format
tracing: add func and symfunc to tag format
tracing: check full name for field
tracing: update sample code with new tag format
tracing: move '>' to out of macros and into print statement

----
include/linux/blktrace_api.h | 4 +-
include/linux/ftrace_event.h | 3 +-
include/trace/events/block.h | 101 +++------
include/trace/events/irq.h | 8 +-
include/trace/events/kmem.h | 12 +-
include/trace/events/lockdep.h | 8 +-
include/trace/ftrace.h | 2 +-
kernel/trace/trace_output.c | 2 +-
kernel/trace/trace_output.h | 4 +
kernel/trace/trace_read_binary.c | 304 +++++++++++++++++++++++-----
samples/trace_events/trace-events-sample.c | 21 ++-
samples/trace_events/trace-events-sample.h | 66 ++++++
12 files changed, 399 insertions(+), 136 deletions(-)
--


2009-06-10 09:27:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/11] [GIT PULL] more updates for the tag format


* Steven Rostedt <[email protected]> wrote:

>
> Ingo,
>
> Please pull the latest tip/tracing/event-print-format tree, which can be found at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
> tip/tracing/event-print-format
>
>
> Li Zefan (1):
> tracing/events: convert block trace points to TRACE_EVENT(), fix
>
> Steven Rostedt (10):
> tracing: add nsec2sec print formats
> tracing: convert lockdep lock_acquired trace point to use nsec2usec tag
> tracing: add major and minor tags for print format
> tracing: use << to print < instead of \<
> tracing: convert the block trace points to use the new tag format
> tracing: add test for strings in event tag format
> tracing: add func and symfunc to tag format
> tracing: check full name for field
> tracing: update sample code with new tag format
> tracing: move '>' to out of macros and into print statement
>
> ----
> include/linux/blktrace_api.h | 4 +-
> include/linux/ftrace_event.h | 3 +-
> include/trace/events/block.h | 101 +++------
> include/trace/events/irq.h | 8 +-
> include/trace/events/kmem.h | 12 +-
> include/trace/events/lockdep.h | 8 +-
> include/trace/ftrace.h | 2 +-
> kernel/trace/trace_output.c | 2 +-
> kernel/trace/trace_output.h | 4 +
> kernel/trace/trace_read_binary.c | 304 +++++++++++++++++++++++-----
> samples/trace_events/trace-events-sample.c | 21 ++-
> samples/trace_events/trace-events-sample.h | 66 ++++++
> 12 files changed, 399 insertions(+), 136 deletions(-)

Hm, that's way too much back and forth really - trivial typo fixes,
build failure, etc. This is really a 'oh, the merge window is
coming' last minute scrambling and we dont want to mess up the
squeaky-clean tracing tree history be messed up with this.

Frederic also expressed worries about the tag format. Could we have
a wider buy-in for this format?

I've separated these bits into tip:tracing/ftrace, and kept
tip:tracing/core on a pre-print-formats state (going back 8
commits), so that upstream merging of the other bits does not get
held up.

Could we try a cleaner, bisectable, consiously built up version of
these final tracing/core..tracing/ftrace please? I think we can -
the rest of the tree is clean. Please do a exact-same-content rebase
so that the merge back gets obvious and that the testing we've
injected does not get invalidated?

Ingo

2009-06-10 11:11:50

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 00/11] [GIT PULL] more updates for the tag format

2009/6/10 Ingo Molnar <[email protected]>:
>
> * Steven Rostedt <[email protected]> wrote:
>
>>
>> Ingo,
>>
>> Please pull the latest tip/tracing/event-print-format tree, which can be found at:
>>
>> ? git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
>> tip/tracing/event-print-format
>>
>>
>> Li Zefan (1):
>> ? ? ? tracing/events: convert block trace points to TRACE_EVENT(), fix
>>
>> Steven Rostedt (10):
>> ? ? ? tracing: add nsec2sec print formats
>> ? ? ? tracing: convert lockdep lock_acquired trace point to use nsec2usec tag
>> ? ? ? tracing: add major and minor tags for print format
>> ? ? ? tracing: use << to print < instead of \<
>> ? ? ? tracing: convert the block trace points to use the new tag format
>> ? ? ? tracing: add test for strings in event tag format
>> ? ? ? tracing: add func and symfunc to tag format
>> ? ? ? tracing: check full name for field
>> ? ? ? tracing: update sample code with new tag format
>> ? ? ? tracing: move '>' to out of macros and into print statement
>>
>> ----
>> ?include/linux/blktrace_api.h ? ? ? ? ? ? ? | ? ?4 +-
>> ?include/linux/ftrace_event.h ? ? ? ? ? ? ? | ? ?3 +-
>> ?include/trace/events/block.h ? ? ? ? ? ? ? | ?101 +++------
>> ?include/trace/events/irq.h ? ? ? ? ? ? ? ? | ? ?8 +-
>> ?include/trace/events/kmem.h ? ? ? ? ? ? ? ?| ? 12 +-
>> ?include/trace/events/lockdep.h ? ? ? ? ? ? | ? ?8 +-
>> ?include/trace/ftrace.h ? ? ? ? ? ? ? ? ? ? | ? ?2 +-
>> ?kernel/trace/trace_output.c ? ? ? ? ? ? ? ?| ? ?2 +-
>> ?kernel/trace/trace_output.h ? ? ? ? ? ? ? ?| ? ?4 +
>> ?kernel/trace/trace_read_binary.c ? ? ? ? ? | ?304 +++++++++++++++++++++++-----
>> ?samples/trace_events/trace-events-sample.c | ? 21 ++-
>> ?samples/trace_events/trace-events-sample.h | ? 66 ++++++
>> ?12 files changed, 399 insertions(+), 136 deletions(-)
>
> Hm, that's way too much back and forth really - trivial typo fixes,
> build failure, etc. This is really a 'oh, the merge window is
> coming' last minute scrambling and we dont want to mess up the
> squeaky-clean tracing tree history be messed up with this.
>
> Frederic also expressed worries about the tag format. Could we have
> a wider buy-in for this format?



Well, indeed I had worries, but I discussed about it with Steven and
now I actually
think this new tag format is much more powerful than printf style.
It brings a cleaner, and much higher level way to control the data exports.

But it would be nice to read some opinions from end users (end
developers) of TRACE_EVENT().


>
> I've separated these bits into tip:tracing/ftrace, and kept
> tip:tracing/core on a pre-print-formats state (going back 8
> commits), so that upstream merging of the other bits does not get
> held up.
>
> Could we try a cleaner, bisectable, consiously built up version of
> these final tracing/core..tracing/ftrace please? I think we can -
> the rest of the tree is clean. Please do a exact-same-content rebase
> so that the merge back gets obvious and that the testing we've
> injected does not get invalidated?
>
> ? ? ? ?Ingo
>

2009-06-10 13:02:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 00/11] [GIT PULL] more updates for the tag format

On Wed, Jun 10, 2009 at 01:11:40PM +0200, Fr?d?ric Weisbecker wrote:
> Well, indeed I had worries, but I discussed about it with Steven and
> now I actually
> think this new tag format is much more powerful than printf style.
> It brings a cleaner, and much higher level way to control the data exports.
>
> But it would be nice to read some opinions from end users (end
> developers) of TRACE_EVENT().

Maybe I'm missing something, but looks like the this new format, while
simpler and easier to read, doesn't have support for using a more
complicated C expression as a printk argument. For example:

TP_printk("dev %s ino %lu mode %d uid %u gid %u blocks %llu",
jbd2_dev_to_name(__entry->dev), __entry->ino, __entry->mode,
__entry->uid, __entry->gid, __entry->blocks)

How should I handle the "jbd2_dev_to_name(__entry->dev)" argument to
TP_printk? The whole point of calling jbd2_dev_to_name() at TP_printk
time is to not bloat the ring buffer with a 32 byte devname.

- Ted

2009-06-10 13:49:38

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 00/11] [GIT PULL] more updates for the tag format


On Wed, 10 Jun 2009, Theodore Tso wrote:

> On Wed, Jun 10, 2009 at 01:11:40PM +0200, Fr?d?ric Weisbecker wrote:
> > Well, indeed I had worries, but I discussed about it with Steven and
> > now I actually
> > think this new tag format is much more powerful than printf style.
> > It brings a cleaner, and much higher level way to control the data exports.
> >
> > But it would be nice to read some opinions from end users (end
> > developers) of TRACE_EVENT().
>
> Maybe I'm missing something, but looks like the this new format, while
> simpler and easier to read, doesn't have support for using a more
> complicated C expression as a printk argument. For example:
>
> TP_printk("dev %s ino %lu mode %d uid %u gid %u blocks %llu",
> jbd2_dev_to_name(__entry->dev), __entry->ino, __entry->mode,
> __entry->uid, __entry->gid, __entry->blocks)
>
> How should I handle the "jbd2_dev_to_name(__entry->dev)" argument to
> TP_printk? The whole point of calling jbd2_dev_to_name() at TP_printk
> time is to not bloat the ring buffer with a 32 byte devname.

Understood, and the example you just gave also has the flaw that a
userspace tool could not parse it, because it would not know what to do
with "jbd2_dev_to_name()".

This is why I suggested keeping the TP_printk, for cases like this. Since
it is also currently useless in userspace.

But we really should convert all cases, and I was toying with an idea to
dynamically make your own data type, and be able to make a way to print
it. That is you could register:

register_trace_event_data_type(const char *name,
(int)(*print_func)(struct trace_seq *s, void *data, int size),
const char *fmt, ...);

Where the name would be the data type you are making, the print_func is
how ftrace would print it in debugfs/tracing/trace, and the fmt, ... would
be who to show the user how to print it.

For example, for the GFP flags we could do something like:

/* helper routine */
unsigned long long trace_get_word(void *p, int len)
{
unsigned long long val;

switch (size) {
case 1:
val = *(char *)p;
break;
case 2:
val = *(short *)p;
break;
case 4:
val = *(int *)p;
break;
case 8:
val = *(long long *)p;
break;
default:
WARN(1,"length %d not valid word size\n");
return 0;
}

return val;
}

static int test_gfp(unsigned long *gfp, unsigned long mask)
{
if ((*gfp & mask) == mask) {
*gfp &= ~mask;
return 1;
}
return 0;
}

#define test_gfp_name(under, name) \
if (test_gfp(&gfp, under##GFP_##name)) { \
if (first) \
first = 0; \
else \
trace_seq_putc(s, '|'); \
trace_seq_puts(s, "GFP_" #name); \
}


static int print_gfp(struct trace_seq *s, void *data, int len)
{
unsigned long gfp;

gfp = trace_get_word(data, len);

if (!gfp) {
trace_seq_puts(s, GPF_NOWAIT);
return 0;
}

while (gfp) {
test_gfp_name(,HIGHUSER_MOVABLE);
test_gfp_name(,HIGHUSER);
test_gfp_name(,USER);
test_gfp_name(,TEMPORARY);
test_gfp_name(,KERNEL);
test_gfp_name(,NOFS);
test_gfp_name(,ATOMIC);
test_gfp_name(,NOIO);
test_gfp_name(__,HIGH);
test_gfp_name(__,WAIT);
test_gfp_name(__,IO);
test_gfp_name(__,COLD);
test_gfp_name(__,NOWARN);
test_gfp_name(__,REPEAT);
test_gfp_name(__,NOFAIL);
test_gfp_name(__,NORETRY);
test_gfp_name(__,COMP);
test_gfp_name(__,ZERO);
test_gfp_name(__,NOMEMALLOC);
test_gfp_name(__,HARDWALL);
test_gfp_name(__,THISNODE);
test_gfp_name(__,RECLAIMABLE);
test_gfp_name(__,MOVABLE);

}

#define gfp_insert(under, name) \
(unsigned long)under##GFP_##name, "GFP_" #name

register_trace_event_data_type("gfp", print_gfp,
"mask:\n"
" 0=GFP_NOWAIT,"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n"
" 0x%lx=%s,\n",
gfp_insert(,HIGHUSER_MOVABLE),
gfp_insert(,HIGHUSER),
gfp_insert(,USER),
gfp_insert(,TEMPORARY),
gfp_insert(,NOFS),
gfp_insert(,ATOMIC),
gfp_insert(,NOIO),
gfp_insert(__,HIGH),
gfp_insert(__,WAIT),
gfp_insert(__,IO),
gfp_insert(__,COLD),
gfp_insert(__,NOWARN),
gfp_insert(__,REPEAT),
gfp_insert(__,NOFAIL),
gfp_insert(__,NORETRY),
gfp_insert(__,COMP),
gfp_insert(__,ZERO),
gfp_insert(__,NOMEMALLOC),
gfp_insert(__,HARDWALL),
gfp_insert(__,THISNODE),
gfp_insert(__,RECLAIMABLE),
gfp_insert(__,MOVEABLE));


And then in the trace format, we could do:

<data:gfp:field>

And the 'data' will flag us to how to print the data.

For userland, there could be a file in:

/debug/tracing/events/data_types/gfp/format

That will show that format. Yes we duplicate some of the code, but it
it would solve these types of issues.

-- Steve





2009-06-10 14:39:24

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH 00/11] [GIT PULL] more updates for the tag format

* Steven Rostedt ([email protected]) wrote:
>
> On Wed, 10 Jun 2009, Theodore Tso wrote:
>
> > On Wed, Jun 10, 2009 at 01:11:40PM +0200, Fr?d?ric Weisbecker wrote:
> > > Well, indeed I had worries, but I discussed about it with Steven and
> > > now I actually
> > > think this new tag format is much more powerful than printf style.
> > > It brings a cleaner, and much higher level way to control the data exports.
> > >
> > > But it would be nice to read some opinions from end users (end
> > > developers) of TRACE_EVENT().
> >
> > Maybe I'm missing something, but looks like the this new format, while
> > simpler and easier to read, doesn't have support for using a more
> > complicated C expression as a printk argument. For example:
> >
> > TP_printk("dev %s ino %lu mode %d uid %u gid %u blocks %llu",
> > jbd2_dev_to_name(__entry->dev), __entry->ino, __entry->mode,
> > __entry->uid, __entry->gid, __entry->blocks)
> >
> > How should I handle the "jbd2_dev_to_name(__entry->dev)" argument to
> > TP_printk? The whole point of calling jbd2_dev_to_name() at TP_printk
> > time is to not bloat the ring buffer with a 32 byte devname.
>
> Understood, and the example you just gave also has the flaw that a
> userspace tool could not parse it, because it would not know what to do
> with "jbd2_dev_to_name()".
>
> This is why I suggested keeping the TP_printk, for cases like this. Since
> it is also currently useless in userspace.
>
> But we really should convert all cases, and I was toying with an idea to
> dynamically make your own data type, and be able to make a way to print
> it. That is you could register:
>
> register_trace_event_data_type(const char *name,
> (int)(*print_func)(struct trace_seq *s, void *data, int size),
> const char *fmt, ...);
>
> Where the name would be the data type you are making, the print_func is
> how ftrace would print it in debugfs/tracing/trace, and the fmt, ... would
> be who to show the user how to print it.
>
> For example, for the GFP flags we could do something like:
>
> /* helper routine */
> unsigned long long trace_get_word(void *p, int len)
> {
> unsigned long long val;
>
> switch (size) {
> case 1:
> val = *(char *)p;
> break;
> case 2:
> val = *(short *)p;
> break;
> case 4:
> val = *(int *)p;
> break;
> case 8:
> val = *(long long *)p;
> break;
> default:
> WARN(1,"length %d not valid word size\n");
> return 0;
> }
>
> return val;
> }
>
> static int test_gfp(unsigned long *gfp, unsigned long mask)
> {
> if ((*gfp & mask) == mask) {
> *gfp &= ~mask;
> return 1;
> }
> return 0;
> }
>
> #define test_gfp_name(under, name) \
> if (test_gfp(&gfp, under##GFP_##name)) { \
> if (first) \
> first = 0; \
> else \
> trace_seq_putc(s, '|'); \
> trace_seq_puts(s, "GFP_" #name); \
> }
>
>
> static int print_gfp(struct trace_seq *s, void *data, int len)
> {
> unsigned long gfp;
>
> gfp = trace_get_word(data, len);
>
> if (!gfp) {
> trace_seq_puts(s, GPF_NOWAIT);
> return 0;
> }
>
> while (gfp) {
> test_gfp_name(,HIGHUSER_MOVABLE);
> test_gfp_name(,HIGHUSER);
> test_gfp_name(,USER);
> test_gfp_name(,TEMPORARY);
> test_gfp_name(,KERNEL);
> test_gfp_name(,NOFS);
> test_gfp_name(,ATOMIC);
> test_gfp_name(,NOIO);
> test_gfp_name(__,HIGH);
> test_gfp_name(__,WAIT);
> test_gfp_name(__,IO);
> test_gfp_name(__,COLD);
> test_gfp_name(__,NOWARN);
> test_gfp_name(__,REPEAT);
> test_gfp_name(__,NOFAIL);
> test_gfp_name(__,NORETRY);
> test_gfp_name(__,COMP);
> test_gfp_name(__,ZERO);
> test_gfp_name(__,NOMEMALLOC);
> test_gfp_name(__,HARDWALL);
> test_gfp_name(__,THISNODE);
> test_gfp_name(__,RECLAIMABLE);
> test_gfp_name(__,MOVABLE);
>
> }
>
> #define gfp_insert(under, name) \
> (unsigned long)under##GFP_##name, "GFP_" #name
>
> register_trace_event_data_type("gfp", print_gfp,
> "mask:\n"
> " 0=GFP_NOWAIT,"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n"
> " 0x%lx=%s,\n",
> gfp_insert(,HIGHUSER_MOVABLE),
> gfp_insert(,HIGHUSER),
> gfp_insert(,USER),
> gfp_insert(,TEMPORARY),
> gfp_insert(,NOFS),
> gfp_insert(,ATOMIC),
> gfp_insert(,NOIO),
> gfp_insert(__,HIGH),
> gfp_insert(__,WAIT),
> gfp_insert(__,IO),
> gfp_insert(__,COLD),
> gfp_insert(__,NOWARN),
> gfp_insert(__,REPEAT),
> gfp_insert(__,NOFAIL),
> gfp_insert(__,NORETRY),
> gfp_insert(__,COMP),
> gfp_insert(__,ZERO),
> gfp_insert(__,NOMEMALLOC),
> gfp_insert(__,HARDWALL),
> gfp_insert(__,THISNODE),
> gfp_insert(__,RECLAIMABLE),
> gfp_insert(__,MOVEABLE));
>
>
> And then in the trace format, we could do:
>
> <data:gfp:field>
>
> And the 'data' will flag us to how to print the data.
>
> For userland, there could be a file in:
>
> /debug/tracing/events/data_types/gfp/format
>
> That will show that format. Yes we duplicate some of the code, but it
> it would solve these types of issues.
>

It sounds a lot like the type tables LTTng is currently exporting
through specific channels. One for the list of IRQ handlers, one listing
softirqs, one for syscalls.... etc etc. The nice side of this approach
is that it permits to deal with dynamic events that modify the table
state while tracing is active, e.g. : loadling a module which adds an
IRQ handlers.

This is planned to be used for enum description eventually.

Mathieu

> -- Steve
>
>
>
>
>


--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-06-10 15:21:18

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 00/11] [GIT PULL] more updates for the tag format


On Wed, 10 Jun 2009, Mathieu Desnoyers wrote:
> >
> >
> > And then in the trace format, we could do:
> >
> > <data:gfp:field>
> >
> > And the 'data' will flag us to how to print the data.
> >
> > For userland, there could be a file in:
> >
> > /debug/tracing/events/data_types/gfp/format
> >
> > That will show that format. Yes we duplicate some of the code, but it
> > it would solve these types of issues.
> >
>
> It sounds a lot like the type tables LTTng is currently exporting
> through specific channels. One for the list of IRQ handlers, one listing
> softirqs, one for syscalls.... etc etc. The nice side of this approach
> is that it permits to deal with dynamic events that modify the table
> state while tracing is active, e.g. : loadling a module which adds an
> IRQ handlers.
>
> This is planned to be used for enum description eventually.

Well, I wouldn't make it dependent on any class, just a generic type to
add that any user could do. The types just need to be added before the
events are registered.

-- Steve

2009-06-10 16:03:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 00/11] [GIT PULL] more updates for the tag format

On Wed, Jun 10, 2009 at 09:49:29AM -0400, Steven Rostedt wrote:
> >
> > Maybe I'm missing something, but looks like the this new format, while
> > simpler and easier to read, doesn't have support for using a more
> > complicated C expression as a printk argument. For example:
> >
> > TP_printk("dev %s ino %lu mode %d uid %u gid %u blocks %llu",
> > jbd2_dev_to_name(__entry->dev), __entry->ino, __entry->mode,
> > __entry->uid, __entry->gid, __entry->blocks)
> >
> > How should I handle the "jbd2_dev_to_name(__entry->dev)" argument to
> > TP_printk? The whole point of calling jbd2_dev_to_name() at TP_printk
> > time is to not bloat the ring buffer with a 32 byte devname.
>
> Understood, and the example you just gave also has the flaw that a
> userspace tool could not parse it, because it would not know what to do
> with "jbd2_dev_to_name()".
>
> This is why I suggested keeping the TP_printk, for cases like this. Since
> it is also currently useless in userspace.
>
> But we really should convert all cases, and I was toying with an idea to
> dynamically make your own data type, and be able to make a way to print
> it.

Yes, another approach for handling this case would be to take my
"jbd2_dev_to_name" function and support it as a first-class tagged
type; after all, I'm sure ext4 won't be the only place that would like
to take a dev_t and print the device name. So this could certainly be
fixed by adding some kind of "<dev:xxx>" sort of tagged name.

But I think it would be good to keep TP_printk because otherwise I'll
have to scramble and change my marker->tracepoint patches during the
merge window, which would invalidate all of the testing to date.

I agree that the new tagged format is superior, but I'm wondering
whether it really makes sense to try to scramble and try to switch my
ext4/jbd2 users in the 36 hours or so before Linus opens the merge
window....

- Ted

2009-06-10 16:17:42

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 00/11] [GIT PULL] more updates for the tag format


On Wed, 10 Jun 2009, Theodore Tso wrote:

> On Wed, Jun 10, 2009 at 09:49:29AM -0400, Steven Rostedt wrote:
> > >
> > > Maybe I'm missing something, but looks like the this new format, while
> > > simpler and easier to read, doesn't have support for using a more
> > > complicated C expression as a printk argument. For example:
> > >
> > > TP_printk("dev %s ino %lu mode %d uid %u gid %u blocks %llu",
> > > jbd2_dev_to_name(__entry->dev), __entry->ino, __entry->mode,
> > > __entry->uid, __entry->gid, __entry->blocks)
> > >
> > > How should I handle the "jbd2_dev_to_name(__entry->dev)" argument to
> > > TP_printk? The whole point of calling jbd2_dev_to_name() at TP_printk
> > > time is to not bloat the ring buffer with a 32 byte devname.
> >
> > Understood, and the example you just gave also has the flaw that a
> > userspace tool could not parse it, because it would not know what to do
> > with "jbd2_dev_to_name()".
> >
> > This is why I suggested keeping the TP_printk, for cases like this. Since
> > it is also currently useless in userspace.
> >
> > But we really should convert all cases, and I was toying with an idea to
> > dynamically make your own data type, and be able to make a way to print
> > it.
>
> Yes, another approach for handling this case would be to take my
> "jbd2_dev_to_name" function and support it as a first-class tagged
> type; after all, I'm sure ext4 won't be the only place that would like
> to take a dev_t and print the device name. So this could certainly be
> fixed by adding some kind of "<dev:xxx>" sort of tagged name.

Yep that could be done as long as we know the mapping will never change.
Userspace needs know what those numbers mean.

>
> But I think it would be good to keep TP_printk because otherwise I'll
> have to scramble and change my marker->tracepoint patches during the
> merge window, which would invalidate all of the testing to date.

Understood, I made it that both TP_printk and TP_FORMAT can exist
together, but Christoph Hellwig doesn't like that idea. I'm thinking
for quick debug sessions, TP_printk() be used. In fact, if we go to
TP_FORMAT, I'll just make TP_printk no longer show up in the user format.

Then TP_printk() can be used for quick hacks, but if you want something
merged, it would need to be added to the tag format.

>
> I agree that the new tagged format is superior, but I'm wondering
> whether it really makes sense to try to scramble and try to switch my
> ext4/jbd2 users in the 36 hours or so before Linus opens the merge
> window....

Relax, we already decided this is .32 material ;-)

-- Steve

2009-06-11 13:03:28

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 00/11] [GIT PULL] more updates for the tag format

On Wed, Jun 10, 2009 at 12:03:03PM -0400, Theodore Tso wrote:
> Yes, another approach for handling this case would be to take my
> "jbd2_dev_to_name" function and support it as a first-class tagged
> type; after all, I'm sure ext4 won't be the only place that would like
> to take a dev_t and print the device name. So this could certainly be
> fixed by adding some kind of "<dev:xxx>" sort of tagged name.

We should absolutely agree on one way to represent block device for
tracers. At this point xfs, gfs2 and the block tracer all use
major/minor. I agree that having a name would be nicer for text output,
but I really want to keep the dev_t for the binary output. We should
also make sure we agree on a field name so that filtering will work
the same for all block-related tracers (especially important if we want
to trace something from the filesystem down to the block layer).

2009-06-11 15:48:17

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 00/11] [GIT PULL] more updates for the tag format

On Thu, Jun 11, 2009 at 09:03:18AM -0400, Christoph Hellwig wrote:
> On Wed, Jun 10, 2009 at 12:03:03PM -0400, Theodore Tso wrote:
> > Yes, another approach for handling this case would be to take my
> > "jbd2_dev_to_name" function and support it as a first-class tagged
> > type; after all, I'm sure ext4 won't be the only place that would like
> > to take a dev_t and print the device name. So this could certainly be
> > fixed by adding some kind of "<dev:xxx>" sort of tagged name.
>
> We should absolutely agree on one way to represent block device for
> tracers. At this point xfs, gfs2 and the block tracer all use
> major/minor. I agree that having a name would be nicer for text output,
> but I really want to keep the dev_t for the binary output. We should
> also make sure we agree on a field name so that filtering will work
> the same for all block-related tracers (especially important if we want
> to trace something from the filesystem down to the block layer).

And I want to keep the dev_t as what stored in the ring buffer for
space reasons; but the text name is far more convient in terms of
being human-readable, *especially* given there are some crazy people
out there wanting to make major/minor numbers be completely random.
The problem with that is that a log trace file might not be useful
after a reboot, unless someone had enough foresight to save the device
mappings which were in effect at the that the trace was taken.

So if we store the dev_t in the ring buffer, and have a way of
translating the dev_t into a human-friendly name at printk happen, I
think that meets both of our goals. That would imply that filtering
would be using a device number, which would certainly be more
efficient than doing a string compare.

- Ted

2009-06-11 17:14:47

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 00/11] [GIT PULL] more updates for the tag format

On Thu, Jun 11, 2009 at 11:47:51AM -0400, Theodore Tso wrote:
> On Thu, Jun 11, 2009 at 09:03:18AM -0400, Christoph Hellwig wrote:
> > On Wed, Jun 10, 2009 at 12:03:03PM -0400, Theodore Tso wrote:
> > > Yes, another approach for handling this case would be to take my
> > > "jbd2_dev_to_name" function and support it as a first-class tagged
> > > type; after all, I'm sure ext4 won't be the only place that would like
> > > to take a dev_t and print the device name. So this could certainly be
> > > fixed by adding some kind of "<dev:xxx>" sort of tagged name.
> >
> > We should absolutely agree on one way to represent block device for
> > tracers. At this point xfs, gfs2 and the block tracer all use
> > major/minor. I agree that having a name would be nicer for text output,
> > but I really want to keep the dev_t for the binary output. We should
> > also make sure we agree on a field name so that filtering will work
> > the same for all block-related tracers (especially important if we want
> > to trace something from the filesystem down to the block layer).
>
> And I want to keep the dev_t as what stored in the ring buffer for
> space reasons; but the text name is far more convient in terms of
> being human-readable, *especially* given there are some crazy people
> out there wanting to make major/minor numbers be completely random.
> The problem with that is that a log trace file might not be useful
> after a reboot, unless someone had enough foresight to save the device
> mappings which were in effect at the that the trace was taken.
>
> So if we store the dev_t in the ring buffer, and have a way of
> translating the dev_t into a human-friendly name at printk happen, I
> think that meets both of our goals. That would imply that filtering
> would be using a device number, which would certainly be more
> efficient than doing a string compare.
>
> - Ted


For the filters, we could enter the text name which would be
internally converted into a dev_t, there should be no problem.

Also the raw dev_t can be stored and then human-friendly printed on
print time.

Both seem about trivial to add.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2009-06-11 19:20:55

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 00/11] [GIT PULL] more updates for the tag format

On Thu, Jun 11, 2009 at 07:14:36PM +0200, Frederic Weisbecker wrote:
>
> For the filters, we could enter the text name which would be
> internally converted into a dev_t, there should be no problem.
>
> Also the raw dev_t can be stored and then human-friendly printed on
> print time.
>
> Both seem about trivial to add.

If someone wants to take this code and drop it into the core tracing
code, please feel free.

- Ted

/*
* jbd2_dev_to_name is a utility function used by the jbd2 and ext4
* tracing infrastructure to map a dev_t to a device name.
*
* The caller should use rcu_read_lock() in order to make sure the
* device name stays valid until its done with it. We use
* rcu_read_lock() as well to make sure we're safe in case the caller
* gets sloppy, and because rcu_read_lock() is cheap and can be safely
* nested.
*/
struct devname_cache {
struct rcu_head rcu;
dev_t device;
char devname[BDEVNAME_SIZE];
};
#define CACHE_SIZE_BITS 6
static struct devname_cache *devcache[1 << CACHE_SIZE_BITS];
static DEFINE_SPINLOCK(devname_cache_lock);

static void free_devcache(struct rcu_head *rcu)
{
kfree(rcu);
}

const char *jbd2_dev_to_name(dev_t device)
{
int i = hash_32(device, CACHE_SIZE_BITS);
char *ret;
struct block_device *bd;

rcu_read_lock();
if (devcache[i] && devcache[i]->device == device) {
ret = devcache[i]->devname;
rcu_read_unlock();
return ret;
}
rcu_read_unlock();

spin_lock(&devname_cache_lock);
if (devcache[i]) {
if (devcache[i]->device == device) {
ret = devcache[i]->devname;
spin_unlock(&devname_cache_lock);
return ret;
}
call_rcu(&devcache[i]->rcu, free_devcache);
}
devcache[i] = kmalloc(sizeof(struct devname_cache), GFP_KERNEL);
if (!devcache[i]) {
spin_unlock(&devname_cache_lock);
return "NODEV-ALLOCFAILURE"; /* Something non-NULL */
}
devcache[i]->device = device;
bd = bdget(device);
if (bd) {
bdevname(bd, devcache[i]->devname);
bdput(bd);
} else
__bdevname(device, devcache[i]->devname);
ret = devcache[i]->devname;
spin_unlock(&devname_cache_lock);
return ret;
}
EXPORT_SYMBOL(jbd2_dev_to_name);

2009-06-19 08:13:00

by Li Zefan

[permalink] [raw]
Subject: [BUG] bugs in jbd2_dev_to_name() (was Re: [PATCH 00/11] [GIT PULL] more updates for the tag format)

Theodore Tso wrote:
> On Thu, Jun 11, 2009 at 07:14:36PM +0200, Frederic Weisbecker wrote:
>> For the filters, we could enter the text name which would be
>> internally converted into a dev_t, there should be no problem.
>>
>> Also the raw dev_t can be stored and then human-friendly printed on
>> print time.
>>
>> Both seem about trivial to add.
>
> If someone wants to take this code and drop it into the core tracing
> code, please feel free.
>

Just notice this code has been merge, but there are 2 bugs in it.

> - Ted
>
> /*
> * jbd2_dev_to_name is a utility function used by the jbd2 and ext4
> * tracing infrastructure to map a dev_t to a device name.
> *
> * The caller should use rcu_read_lock() in order to make sure the
> * device name stays valid until its done with it. We use
> * rcu_read_lock() as well to make sure we're safe in case the caller
> * gets sloppy, and because rcu_read_lock() is cheap and can be safely
> * nested.
> */
> struct devname_cache {
> struct rcu_head rcu;
> dev_t device;
> char devname[BDEVNAME_SIZE];
> };
> #define CACHE_SIZE_BITS 6
> static struct devname_cache *devcache[1 << CACHE_SIZE_BITS];
> static DEFINE_SPINLOCK(devname_cache_lock);
>
> static void free_devcache(struct rcu_head *rcu)
> {
> kfree(rcu);
> }
>
> const char *jbd2_dev_to_name(dev_t device)
> {
> int i = hash_32(device, CACHE_SIZE_BITS);
> char *ret;
> struct block_device *bd;
>
> rcu_read_lock();
> if (devcache[i] && devcache[i]->device == device) {
> ret = devcache[i]->devname;
> rcu_read_unlock();
> return ret;

It doesn't seem safe to dereference @ret outside rcu read section.

> }
> rcu_read_unlock();
>
> spin_lock(&devname_cache_lock);
> if (devcache[i]) {
> if (devcache[i]->device == device) {
> ret = devcache[i]->devname;
> spin_unlock(&devname_cache_lock);
> return ret;
> }
> call_rcu(&devcache[i]->rcu, free_devcache);
> }
> devcache[i] = kmalloc(sizeof(struct devname_cache), GFP_KERNEL);

kmalloc(GFP_KERNEL) called with spin_lock held..

> if (!devcache[i]) {
> spin_unlock(&devname_cache_lock);
> return "NODEV-ALLOCFAILURE"; /* Something non-NULL */
> }
> devcache[i]->device = device;
> bd = bdget(device);
> if (bd) {
> bdevname(bd, devcache[i]->devname);
> bdput(bd);
> } else
> __bdevname(device, devcache[i]->devname);
> ret = devcache[i]->devname;
> spin_unlock(&devname_cache_lock);
> return ret;
> }
> EXPORT_SYMBOL(jbd2_dev_to_name);

2009-06-19 12:32:42

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [BUG] bugs in jbd2_dev_to_name() (was Re: [PATCH 00/11] [GIT PULL] more updates for the tag format)

On Fri, Jun 19, 2009 at 04:14:23PM +0800, Li Zefan wrote:
> > rcu_read_lock();
> > if (devcache[i] && devcache[i]->device == device) {
> > ret = devcache[i]->devname;
> > rcu_read_unlock();
> > return ret;
>
> It doesn't seem safe to dereference @ret outside rcu read section.

Note the comments at the beginning of the function:

The caller should use rcu_read_lock() in order to make sure the
device name stays valid until its done with it. We use
rcu_read_lock() as well to make sure we're safe in case the caller
gets sloppy, and because rcu_read_lock() is cheap and can be safely
nested.

I suppose I should change the wording to indicate that it adds a bit
more safety (as in, the crash won't happen inside this function, but
as far as the caller is concerned, all bets are off!)

> > spin_lock(&devname_cache_lock);
> > if (devcache[i]) {
> > if (devcache[i]->device == device) {
> > ret = devcache[i]->devname;
> > spin_unlock(&devname_cache_lock);
> > return ret;
> > }
> > call_rcu(&devcache[i]->rcu, free_devcache);
> > }
> > devcache[i] = kmalloc(sizeof(struct devname_cache), GFP_KERNEL);
>
> kmalloc(GFP_KERNEL) called with spin_lock held..

Good catch, thanks. I'll get a patch in to fix this.

- Ted

2009-06-22 01:35:03

by Li Zefan

[permalink] [raw]
Subject: Re: [BUG] bugs in jbd2_dev_to_name() (was Re: [PATCH 00/11] [GIT PULL] more updates for the tag format)

Theodore Tso wrote:
> On Fri, Jun 19, 2009 at 04:14:23PM +0800, Li Zefan wrote:
>>> rcu_read_lock();
>>> if (devcache[i] && devcache[i]->device == device) {
>>> ret = devcache[i]->devname;
>>> rcu_read_unlock();
>>> return ret;
>> It doesn't seem safe to dereference @ret outside rcu read section.
>
> Note the comments at the beginning of the function:
>

Ah, I overlooked the comments.

But the patch that adds rcu locking around trace event prints
never gets merged:

http://lkml.org/lkml/2009/4/15/549

Steven?

> The caller should use rcu_read_lock() in order to make sure the
> device name stays valid until its done with it. We use
> rcu_read_lock() as well to make sure we're safe in case the caller
> gets sloppy, and because rcu_read_lock() is cheap and can be safely
> nested.
>
> I suppose I should change the wording to indicate that it adds a bit
> more safety (as in, the crash won't happen inside this function, but
> as far as the caller is concerned, all bets are off!)
>