by zhangwei(Jovi)

[permalink] [raw]

Subject: Re: [PATCH v2 00/11] tracing: trace event triggers

On 2013/7/1 20:32, Masami Hiramatsu wrote:
> (2013/06/29 18:30), zhangwei(Jovi) wrote:
>>> This patchset implements 'trace event triggers', which are similar to
>>> the function triggers implemented for 'ftrace filter commands' (see
>>> 'Filter commands' in Documentation/trace/ftrace.txt), but instead of
>>> being invoked from function calls are invoked by trace events.
>>> Basically the patchset allows 'commands' to be triggered whenever a
>>> given trace event is hit. The set of commands implemented by this
>>> patchset are:
>>>
>>> - enable/disable_event - enable or disable another event whenever
>>> the trigger event is hit
>>>
>>> - stacktrace - dump a stacktrace to the trace buffer whenever the
>>> trigger event is hit
>>>
>>> - snapshot - create a snapshot of the current trace buffer whenever
>>> the trigger event is hit
>>>
>>> - traceon/traceoff - turn tracing on or off whenever the trigger
>>> event is hit
>>>
>>> Triggers can also be conditionally invoked by associating a standard
>>> trace event filter with them - if the given event passes the filter,
>>> the trigger is invoked, otherwise it's not. (see 'Event filtering' in
>>> Documentation/trace/events.txt for info on event filters).
>>>
>>
>> I just aware that we are implementing more and more scripting functionality into
>> tracing subsystem, like filter and trigger mode, of cause we don't call it
>> as scripting, but basically the pattern is same, all is "do something when event hit".
>
> Agreed, that's a good direction to handle event by script in kernel :)
> That may be simply done with an extension of "event trigger". Of course
> your ktap work will be the next step for ftrace. But I think, the basic
> implementation can be done by just passing recorded event entry to
> each action. (other works are for debugfs management)
> And that could be a generic trace-event interface for other users too.
>
Fully agree "passing recorded event entry to each action".
Actually there already have this interface, it's perf.

struct perf_event *
perf_event_create_kernel_counter(struct perf_event_attr *attr,
int cpu,
struct task_struct *task,
perf_overflow_handler_t callback,
void *context);

As we known, each event has a id, register this id in perf_event_attr, and
give a callback function, then it will call the callback function when event hit.

void overflow_callback(struct perf_event *event,
struct perf_sample_data *data,
struct pt_regs *regs)

the recorded event entry is passed as data->raw->data;

This perf interface is a generic action trigger interface now, it support
tracepoint, k(ret)probe, u(ret)probe, PMU, hw_breakpoint(perhaps we could implement
PMU and hw_breakpoint trigger in future)

So why we need to reinvent another trigger interface as this patchset did?
(this patchset changed lots of places, include macro in ftrace.h, and will change more
if support kprobe/uprobe trigger in future.)

jovi

2013-07-02 14:46:51

by Tom Zanussi

[permalink] [raw]

Subject: Re: [PATCH v2 00/11] tracing: trace event triggers

On Tue, 2013-07-02 at 11:54 +0800, zhangwei(Jovi) wrote:
> On 2013/7/1 23:49, Tom Zanussi wrote:
> > Hi jovi,
> >
> > On Sat, 2013-06-29 at 17:30 +0800, zhangwei(Jovi) wrote:
> >> On 2013/6/29 13:08, Tom Zanussi wrote:
> >>> Hi,
> >>>
> >>> This is v2 of the trace event triggers patchset, addressing comments
> >>> from Masami Hiramatsu, zhangwei(Jovi), and Steve Rostedt (thanks for
> >>> reviewing v1).
> >>>
> >>> v2:
> >>> - removed all changes to __ftrace_event_enable_disable() (except
> >>> for patch 04/11 which clears the soft_disabled bit as discussed)
> >>> and created a separate trace_event_trigger_enable_disable() that
> >>> calls it after setting/clearing the TRIGGER_MODE_BIT.
> >>> - added a trigger_mode enum for future patches that break up the
> >>> trigger calls for filtering, but that's also now used as a command
> >>> id for registering/unregistering commands.
> >>> - removed the enter_file/exit_file members that were added to
> >>> syscall_metadata after realizing that they were unnecessary if
> >>> ftrace_syscall_enter/exit() were modified to receive a pointer
> >>> to the ftrace_file instead of the pointer to the trace_array in
> >>> the ftrace_file.
> >>> - broke up the trigger invocation into two parts so that triggers
> >>> like 'stacktrace' that themselves log into the trace buffer can
> >>> defer the actual trigger invocation until after the current
> >>> record is closed, which is needed for the filter check that
> >>> in turn determines whether the trigger gets invoked.
> >>> - other minor cleanup
> >>>
> >>>
> >>> This patchset implements 'trace event triggers', which are similar to
> >>> the function triggers implemented for 'ftrace filter commands' (see
> >>> 'Filter commands' in Documentation/trace/ftrace.txt), but instead of
> >>> being invoked from function calls are invoked by trace events.
> >>> Basically the patchset allows 'commands' to be triggered whenever a
> >>> given trace event is hit. The set of commands implemented by this
> >>> patchset are:
> >>>
> >>> - enable/disable_event - enable or disable another event whenever
> >>> the trigger event is hit
> >>>
> >>> - stacktrace - dump a stacktrace to the trace buffer whenever the
> >>> trigger event is hit
> >>>
> >>> - snapshot - create a snapshot of the current trace buffer whenever
> >>> the trigger event is hit
> >>>
> >>> - traceon/traceoff - turn tracing on or off whenever the trigger
> >>> event is hit
> >>>
> >>> Triggers can also be conditionally invoked by associating a standard
> >>> trace event filter with them - if the given event passes the filter,
> >>> the trigger is invoked, otherwise it's not. (see 'Event filtering' in
> >>> Documentation/trace/events.txt for info on event filters).
> >>>
> >>
> >> I just aware that we are implementing more and more scripting functionality into
> >> tracing subsystem, like filter and trigger mode, of cause we don't call it
> >> as scripting, but basically the pattern is same, all is "do something when event hit".
> >>
> >
> > Not really - this patchset is just reusing the existing filter code
> > that's been there for years, and yes, it does follow the pattern of
> > 'doing something when an event is hit', but the things it does are
> > really dirt simple - toggling of other events or the global tracing
> > on/off switch, snapshotting trace buffers, etc. All things that don't
> > require any kind of scripting - this is more on the level of wiring
> > things together on a breadboard. And it's all available by simply using
> > 'cat' and 'echo' - no separate command and scripts to keep track of.
> >
> >> FYI, a pretty simple scripting module of tracing is there:
> >> https://github.com/ktap/ktap.git
> >>
> >
> > It looks pretty nice, but I wonder if Linux is ready for a full-fledged
> > language interpreter in the kernel. It's been tried before - see
> > DProbes (in fact that effort is where kprobes came from - after it was
> > obvious DProbes wouldn't make it into the kernel, it was broken up into
> > multiple pieces - kprobes and uprobes eventually got in, but the RPN
> > interpreter, which also had an ANSI C compiler targeting the bytecode
> > (dpcc) never did...). Well, following that, DTrace came along and
> > showed how useful it could be, so maybe there wouldn't be as much
> > resistance these days...
>
> Actually ktap is very lightweight compare with Dtrace and other tools,
> it doesn't reinvent tracing interface, it make use on kprobe/uprobe/tracepoint/perf,
> it doesn't engage with debugging info, just a simple script interface for tracing,
> as demoed in example, the tracing interface is same as perf.
> >
> > Also, assuming an in-kernel language interpreter would fly, did you
> > consider starting with something already-baked rather than starting from
> > scratch? How about taking something like the parrot VM and carving out
> > a minimal core subset of that suitable for embedding in the kernel? It
> > probably wouldn't be easy, but you'd be building on a relatively mature
> > and tested VM that's been designed for targeting many languages
> > (including lua).
> >
> ktap is building on mature interpreter(lua) which proven fast and widely used,
> the core interpreter is very simple and small, and really easy to hack it.
>
> I will never try to put a general purpose VM(like parrot) into kernel, we need to
> care on safety, size, performance, memory footprint, etc. Compared with
> many other languages, you will find ktap interpreter(based on lua) is
> very suit for kernel than others, the interpreter just have 5000 LOC, with
> 40+ bytecodes, and there have a lot of space to improve.
>

Right, and I didn't suggest putting an unmodified general-purpose VM in
the kernel either - I suggested that *if* you were going to put a VM in
the kernel, why not start with one designed to target many languages,
especially the ones it was initially designed for and that most people
actually know (perl and python). Of course, modifying it for 'safety,
size, ..., etc', as I mentioned probably wouldn't be easy, but maybe it
would be worth it if you could avoid forcing your users into learning
lua to use your tool...

> > Another thing you could consider is taking a page from the DProbes and
> > breaking up and submitting ktap piecemeal. For instance, you could
> > break out and submit the associative array piece and with a little
> > thought tie it to a 'histogram' trigger e.g. something like:
> >
> > echo 'hist[key,val] if filter' > trigger
> >
> > For example, something like this could be used to accomplish a similar
> > thing to your scripts/syscalls_histogram.kp script (keeping and dumping
> > a per-syscall count):
> >
> > echo 'hist[e.name,count()]' > /syscalls/trigger
> >
> > Similarly, this would keep a per-executable syscall count like
> > syscalls_histogram2.kp:
> >
> > echo 'hist[comm,count()]' > /syscalls/trigger
> >
> > Any field defined in an event's 'format' file could be used as a key or
> > value (e.g. using 'comm' as a hash key as above). For example, keeping
> > a per-pid hash of total bytes written by 'sys_exit_write' (both
> > common_pid and ret are sys_exit_write fields, and I assume a small set
> > of utility functions that say what to do with numeric fields, perhaps
> > 'add' could be the default) would show which process is doing most of
> > the writing:
> >
> > echo 'hist[common_pid,add(ret)]' > syscalls/sys_exit_write/trigger
> >
> > Making the trace_dump_stack() available as a utility function would
> > allow things like this:
> >
> > echo 'hist[stacktrace(),count()] if bytes_req > 16384' >
> > kmem/kmalloc/trigger
> >
> > Basically the above would use the stacktrace itself as a hash key and
> > the value stored for each key would be a per-execution-path count of >
> > 16k kernel memory requests. Dumping out the hash would show the top
> > execution paths leading to large kmalloc requests.
> >
> > This would give users the ability to accomplish what a lot of them seem
> > to use these tools for, and without any scripting (I'm not saying
> > scripting wouldn't be useful, but I think you can still accomplish quite
> > a bit without it).
> >
> > Tom
>
> Thanks for your comments.
>
> There have some technical limitation if using oneline echo/cat style trigger.
>

Yes, there are many, I claimed otherwise.

> - subsystem level trigger
> How to handle subsystem level trigger like "syscalls:*" and all tracepoint "*:*"?
> you need to add /sys/kernel/debug/tracing/events/$system/trigger file, otherwise
> user need to use echo command to all events belong the subsystem.
>

Yes, there are already $system/filter and $system/enable, so adding
$system/trigger shouldn't be too much of a stretch.

> - userspace uprobe (debugging info)
> I couldn't image how to handle userspace probe if only use echo/cat(you will need
> help from perf-probe), the other issue is userspace symbol resolving(like call stack),
> there must need some userspace utility in there, like perf, systemtap, dtrace, etc.
>
> - timer
> I also cannot image how to handle timer into there, normally we need to record
> timestamps when event is hit and calculate duration.
>
> - associative array
> Is really use echo/cat to present associative array would be more descriptive than
> scripting?
>

No, you're putting words in my mouth, who said it was more descriptive?
I was just trying to show what could be done with a simple associative
array trigger, and that it could probably account for 80% of the basic
'quick-and-dirty' use cases people seem to use these tools for (and it
looks like about 80% of the scripts in your ktap/scripts directory as
well).

> - arith calculation
> Usually we need to do arith calculation in script, but it will be hard to do in oneline.
>
> - variable declaration
> Of course there don't have variable declaration in oneline style trigger.
>
> There have many technical advantage to use script in there which I can't list all.
>
> The question is if you put more and more trigger command into event tracing, eventually it
> will became a script engine, from this point of view, script tracing is a flexible and extendable
> framework of trace event trigger.
>
> Since this thread is purpose on trace event trigger patchset, it's not suit for ktap detail

Well, you're the one who started the ktap discussion here...

Tom

> discussion, I will open a new ktap thread in LKML soon, feel free to give your comment in there. :)
>
> jovi
> >
> >
> >> For the trigger mode, you can perform any command when event hit if using scripting,
> >> in contrast with this patchset, ktap use perf callback handler to invoke command,
> >> so it don't need extra code to support trigger mode in tracepoint/k(ret)probe/u(ret)probe.
> >>
> >> ---------------------
> >> trace "syscalls:*" function (e) {
> >> printf("%d %d\t%s\t%s", cpu(), pid(), execname(), e.tostring())
> >> }
> >> ---------------------
> >> trace "probe:do_sys_open dfd=%di filename=%dx flags=%cx mode=+4($stack)" function (e) {
> >> printf("%d %d\t%s\t%s", cpu(), pid(), execname(), e.tostring())
> >> }
> >> ---------------------
> >> trace "probe:do_sys_open%return fd=$retval" function (e) {
> >> printf("%d %d\t%s\t%s", cpu(), pid(), execname(), e.tostring())
> >> }
> >> ---------------------
> >> trace "probe:/lib/libc.so.6:0x000773c0" function (e) {
> >> printf("%d %d\t%s\t%s", cpu(), pid(), execname(), e.tostring())
> >> }
> >>
> >>
> >> what I'm thinking now is perhaps we can use a more generic mechanism in future
> >> to let user do more magic things when event hit.
> >>
> >> To be clear, I'm not against on this patchset, I'm on the same side with Tom,
> >> the trigger mode of this patchset is useful(awesome work). I'm just sharing some extra info,
> >> hopeful you don't mind it :)
> >>
> >> Thanks.
> >> jovi
> >>
> >>
>
>