On Thu, 13 Sep 2007 16:43:16 -0700 David Wilder wrote:
> [it would be easier to review and make sense of the comments
> if the patch were inline instead of attached]
Tom Zanussi <[email protected]>
Martin Hunt <[email protected]>
David Wilder <[email protected]>
Above needs to use Signed-off-by: if you want this merged.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
trace.txt:
+When using the READ(2) interface,
s/READ/read/
+ preemption is disabled and that trace state is set to "running". a
s/. a/. A/
+on the function do_fork(). The value of current->pid is writen to
s/writen/written/
+You can build the kernel module fork_trace.ko using the following
+Makefile:
There was reportedly a discussion about sample source code in the
Documentation/ directory at the kernel summit. Some people want to
move it to the util-linux package. If that's not done, I strongly
prefer that Makefiles and source files be put into their own
sub-directory, not "hidden" inside txt files, so you would end up
with something line Documentation/trace/, with trace.txt, and also
Documentation/trace/src, with Makefile and fork_trace.c & any other
source files.
+Trace is adapted from blktrace authored by Jens Axboe ([email protected]).
MAINTAINERS file says <[email protected]>. He's also
<[email protected]>.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lib/trace.c:
+ * Based on blktrace code, Copyright (C) 2006 Jens Axboe <[email protected]>
Ditto.
+
+
Use just one blank line between functions and/or structs.
Please check the patch with scripts/checkpatch.pl and then evaluate
its warnings. Sometimes it makes sense to ignore some of them.
+/**
+ * trace_setup: create a new trace trace handle
+ *
+ * @root: The root directory name in the root of the debugfs
+ * to place trace directories. Created as needed.
Thanks for using kernel-doc; however, don't put a blank line between
the function name line and the parameters. Also, the function
name line should have a "-" separating the function name and the
short description. (multiple places in trace.c)
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
Hi David.
A random comment to the code.
Several of the struct file_operations are not declared static as
they should be.
Btw. it looks good from a coding style point-of-view.
About the name what about ktrace??
Sam
> Trace - Provides tracing primitives
>
> ...
>
> +config TRACE
> + bool "Trace setup and control"
> + select RELAY
> + select DEBUG_FS
> + help
> + This option provides support for the setup, teardown and control
> + of tracing channels from kernel code. It also provides trace
> + information and control to userspace via a set of debugfs control
> + files. If unsure, say N.
> +
select is evil - you really want to avoid using it.
The problem is where you select a symbol whose dependencies aren't met.
Kconfig resolves this incompatibility by just not selecting the thing you
wanted, iirc. So your CONFIG_SYSFS=n, CONFIG_TRACE=y kernel won't build.
> +/*
> + * Based on blktrace code, Copyright (C) 2006 Jens Axboe <[email protected]>
So can we migrate blktrace to using this?
> +static ssize_t state_write(struct file *filp, const char __user *buffer,
> + size_t count, loff_t *ppos)
> +{
> + struct trace_info *trace = filp->private_data;
> + char buf[16] = { '\0' };
this initialisation isn't needed and will waste cycles.
> + int ret;
> +
> + if (trace->flags & TRACE_DISABLE_STATE)
> + return -EINVAL;
> +
> + if (count > sizeof(buf) - 1)
> + return -EINVAL;
> +
> + if (copy_from_user(buf, buffer, count))
> + return -EFAULT;
> +
> + buf[count] = '\0';
> +
> + if (strncmp(buf, "start", strlen("start")) == 0 ) {
> + ret = trace_start(trace);
> + if (ret)
> + return ret;
> + } else if (strncmp(buffer, "stop", strlen("stop")) == 0)
> + trace_stop(trace);
> + else
> + return -EINVAL;
What's the above code doing? Trying to cope with trailing chars after
"start" or "stop"? Is that actually needed? It's the \n, I assume?
> + return count;
> +}
> +
> +
> +static struct file_operations state_fops = {
> + .owner = THIS_MODULE,
> + .open = state_open,
> + .read = state_read,
> + .write = state_write,
> +};
> +
> +
> +static void remove_root(struct trace_info *trace)
> +{
> + if (trace->root->root && simple_empty(trace->root->root)) {
> + debugfs_remove(trace->root->root);
> + list_del(&trace->root->list);
> + kfree(trace->root);
> + trace->root = NULL;
> + }
> +}
> +
> +
> +static void remove_tree(struct trace_info *trace)
> +{
> + mutex_lock(&trace_mutex);
> +
> + debugfs_remove(trace->dir);
> +
> + if (--trace->root->users == 0)
> + remove_root(trace);
> +
> + mutex_unlock(&trace_mutex);
> +}
We usually only put a single blank line between functions. Two is just a
waste of screen space.
> +
> +
> +/*
> + * Creates the trace_root if it's not found.
> + */
>
> ...
>
> +static ssize_t sub_size_read(struct file *filp, char __user *buffer,
> + size_t count, loff_t *ppos)
> +{
> + struct trace_info *trace = filp->private_data;
> + char buf[32];
> +
> + snprintf(buf, sizeof(buf), "%u\n",
> + (unsigned int)trace->rchan->subbuf_size);
Use %tu to print a size_t, rather than the typecast.
> + return simple_read_from_buffer(buffer, count, ppos, buf, strlen(buf));
> +}
> +
>
> ...
>
> +static ssize_t nr_sub_read(struct file *filp, char __user *buffer,
> + size_t count, loff_t *ppos)
> +{
> + struct trace_info *trace = filp->private_data;
> + char buf[32];
> +
> + snprintf(buf, sizeof(buf), "%u\n",
> + (unsigned int)trace->rchan->n_subbufs);
Ditto. (It's unobvious why n_subbufs is a size_t)
> + return simple_read_from_buffer(buffer, count, ppos, buf, strlen(buf));
> +}
> +
>
> ...
>
> +static void remove_controls(struct trace_info *trace)
> +{
> + if (trace->state_file)
> + debugfs_remove(trace->state_file);
> + if (trace->dropped_file)
> + debugfs_remove(trace->dropped_file);
> + if (trace->reset_consumed_file)
> + debugfs_remove(trace->reset_consumed_file);
> + if (trace->nr_sub_file)
> + debugfs_remove(trace->nr_sub_file);
> + if (trace->sub_size_file)
> + debugfs_remove(trace->sub_size_file);
debugfs_remove(NULL) is legal: all the above tests can be removed.
> + if (trace->dir)
> + remove_tree(trace);
> +}
> +
>
> ...
>
> + * trace_setup: create a new trace trace handle
> + *
> + * @root: The root directory name in the root of the debugfs
> + * to place trace directories. Created as needed.
> + * @name: Trace directory name, created in @root
> + * @buf_size: size of the relay sub-buffers
> + * @buf_nr: number of relay sub-buffers
> + * @flags: Option selection (see GTSC channel flags definitions)
> + * default values when flags=0 are: use per-CPU buffering,
> + * use non-overwrite mode. See Documentation/trace.txt for details.
> + *
> + * returns a trace_info handle or NULL, if setup failed.
> + */
> +struct trace_info *trace_setup(const char *root, const char *name,
> + u32 buf_size, u32 buf_nr, u32 flags)
> +{
> + struct trace_info *trace = NULL;
> +
> + trace = setup_controls(root, name, flags);
> + if (!trace)
> + return NULL;
> +
> + trace->buf_size = buf_size;
> + trace->buf_nr = buf_nr;
> + trace->flags = flags;
> + mutex_init(&trace->state_mutex);
> + trace->state = TRACE_SETUP;
> +
> + return trace;
> +}
> +EXPORT_SYMBOL_GPL(trace_setup);
It's better for a pointer-returning function to return an ERR_PTR on error,
rather than NULL. That way the caller doesn't have to make guesses about
why the callee failed when propagating back an error. (See how your
init_module gives up and returns -1?)
> ...
>
> +
> +/**
> + * trace_cleanup_channel: destroys the trace channel only
> + *
> + * @trace: trace handle to cleanup
> + */
> +static void trace_cleanup_channel(struct trace_info *trace)
> +{
> + trace_stop(trace);
> + if (trace->rchan)
> + relay_close(trace->rchan);
relay_close(NULL) is legal. Please check the whole patch for this.
> + trace->rchan = NULL;
> +}
> +
> +/**
> + * trace_cleanup: destroys the trace channel, control files and dir
> + *
> + * @trace: trace handle to cleanup
> + */
> +void trace_cleanup(struct trace_info *trace)
> +{
> + trace_cleanup_channel(trace);
> + remove_controls(trace);
> + kfree(trace);
> +}
> +EXPORT_SYMBOL_GPL(trace_cleanup);
>
>
>
On Fri, 14 Sep 2007 18:08:40 -0700 Andrew Morton wrote:
> > Trace - Provides tracing primitives
> >
> > ...
> >
> > +config TRACE
> > + bool "Trace setup and control"
> > + select RELAY
> > + select DEBUG_FS
> > + help
> > + This option provides support for the setup, teardown and control
> > + of tracing channels from kernel code. It also provides trace
> > + information and control to userspace via a set of debugfs control
> > + files. If unsure, say N.
> > +
>
> select is evil - you really want to avoid using it.
I checked when I reviewed(?) this patch. There are a few other
places that select also (IIRC, the blktrace code does), but most of
them use depends on.
> The problem is where you select a symbol whose dependencies aren't met.
> Kconfig resolves this incompatibility by just not selecting the thing you
> wanted, iirc. So your CONFIG_SYSFS=n, CONFIG_TRACE=y kernel won't build.
>
> > +/*
> > + * Based on blktrace code, Copyright (C) 2006 Jens Axboe <[email protected]>
>
> So can we migrate blktrace to using this?
> > +static ssize_t sub_size_read(struct file *filp, char __user *buffer,
> > + size_t count, loff_t *ppos)
> > +{
> > + struct trace_info *trace = filp->private_data;
> > + char buf[32];
> > +
> > + snprintf(buf, sizeof(buf), "%u\n",
> > + (unsigned int)trace->rchan->subbuf_size);
>
> Use %tu to print a size_t, rather than the typecast.
Eh?
Use %zu to print a size_t. Use %tu to print a ptrdiff_t.
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
Andrew Morton wrote:
>> +/*
>> + * Based on blktrace code, Copyright (C) 2006 Jens Axboe <[email protected]>
>>
>
> So can we migrate blktrace to using this?
>
Yes, a blktrace patch is comming.
>> + int ret;
>> +
>> + if (trace->flags & TRACE_DISABLE_STATE)
>> + return -EINVAL;
>> +
>> + if (count > sizeof(buf) - 1)
>> + return -EINVAL;
>> +
>> + if (copy_from_user(buf, buffer, count))
>> + return -EFAULT;
>> +
>> + buf[count] = '\0';
>> +
>> + if (strncmp(buf, "start", strlen("start")) == 0 ) {
>> + ret = trace_start(trace);
>> + if (ret)
>> + return ret;
>> + } else if (strncmp(buffer, "stop", strlen("stop")) == 0)
>> + trace_stop(trace);
>> + else
>> + return -EINVAL;
>>
>
> What's the above code doing? Trying to cope with trailing chars after
> "start" or "stop"? Is that actually needed? It's the \n, I assume?
>
Yes, the typical usage is "echo start > state" and echo adds a \n.
Thanks for the comments, I will make the changes and resubmit.
Sam Ravnborg wrote:
> Hi David.
>
> A random comment to the code.
> Several of the struct file_operations are not declared static as
> they should be.
>
> Btw. it looks good from a coding style point-of-view.
>
> About the name what about ktrace??
>
> Sam
>
>
Thanks for the comment. I sure don't want to change the name a forth
time, can we live with "trace"?
On Fri, Sep 14, 2007 at 09:49:31PM -0700, David Wilder wrote:
> Sam Ravnborg wrote:
> >Hi David.
> >
> >A random comment to the code.
> >Several of the struct file_operations are not declared static as
> >they should be.
> >
> >Btw. it looks good from a coding style point-of-view.
> >
> >About the name what about ktrace??
> >
> > Sam
> >
> >
> Thanks for the comment. I sure don't want to change the name a forth
> time, can we live with "trace"?
I do not care much about the name so no big deal for me.
I was just soo generic..
Sam
Hi David,
Interesting work, but I think we could still enhance it. The interesting
things you bring is the trace control though debugfs files, which is
clear and simple. (I did it on top of netlink in LTTng, but I don't
really care about the mechanism, as long as we have the same
flexibility).
* David Wilder ([email protected]) wrote:
[...]
>
> +Overwrite mode can be called "flight recorder mode". Flight recorder
> +mode is selected by setting the TRACE_FLIGHT_CHANNEL flag when
> +creating trace channels. In flight mode when a tracing buffer is
> +full, the oldest records in the buffer will be discarded to make room
> +as new records arrive. In the default non-overwrite mode, new records
> +may be written only if the buffer has room. In either case, to
> +prevent data loss, a user space reader must keep the buffers
> +drained. Trace provides a means to detect the number of records that
> +have been dropped due to a buffer-full condition (non-overwrite mode
> +only).
> +
Since, in the end, we can represent the "flight recorder" as a simple flag,
can we imagine setting/unsetting it while the trace is active ?
Also, why is trace creation an in-kernel API ? What about a mkdir in
debugfs/trace ? I guess I see that this is because you need to keep a
pointer to the created trace so you can record your events it in. Have
you thought about keeping a global RCU list of active traces instead ?
One could then iterate on every active trace and record information in
them without having to bother about which specific trace it has created.
However, this should come with the ability to filter in/out events in
the handler on a per-trace basis.
The problem with the approach you propose is that it seems to tie a
specific event source to one trace channel.
It would be good to have some way to separate:
- event sources (markers/kprobes/...)
- active traces.
Each of them would have an event filter.
- within each trace, the ability to create _multiple_ channels, so we
can send the information in high/medium/low event rate channels on a
per-event basis. This is really useful to gather hybrid traces made
from flight recorder channels (high event rate) and non-over channels
(important low rate information required to understand the trace).
- Each trace channel would be either global or per cpu, and would be a
flight recorder channel or "normal", non overwrite, channel.
> +When per-CPU buffers are used, relay creates one debugfs file for each
> +running CPU. The user-space consumer of the data is responsible for
> +reading the per-CPU buffers and collating the records presumably using
> +a time stamp or sequence number included in the trace records. The
> +use of global buffers eliminates this extra work of sequencing
> +records; however the provider's data layer must hold a lock when
> +writing records. The lock prevents writers running on different CPUs
> +from overwriting each other's data. However, buffering may be slower
> +because writes to the buffer are serialized. Global buffering is
> +selected by setting the TRACE_GLOBAL_CHANNEL flag when creating trace
> +channels.
> +
We could allocate the trace buffers upon actions that would be
independant from trace creation. By doing so, we could then do a
echo 1 > path_to_trace/channel/global
Before we activate the trace or allocate the buffers. I would vote for a
echo 1 > path_to_trace/channel/allocate
So we can separate the trace buffer allocation from trace start (because
start operation might have to be done near from the studied events and
we want it to be as lightweight as possible).
So, typical usage could be:
cd /mnt/debugfs/trace
mkdir mytrace
cd mytrace
echo 1 > start
The default could be that we create a trace with a "main" set of per-cpu
channels. i.e.:
mytrace/main
But then, a mkdir within the mytrace directory could add new custom
channels:
in mytrace:
mkdir processes
cd processes
(then set buffer size, nr subbuf, flight vs non flight, global..)
echo 1 > allocate
By default, a trace event filter would accept all events. Events could
be identified by a name (see markers proposed subsystem_event name).
Issuing a :
echo 0 > path_to_trace/filter
would disable all events
echo "event_name" > path_to_trace/filter
would add the event_name to the trace filter
By default, events would be sent into the "main" channel, but
echo "event_name" > path_to_trace/channel/filter
would send the event in the "channel" channel instead.
We could think of integrating the markers macro into this scheme to
describe the events. Instead of doing an explicit trace write in the
breakpoint handler, we could simply put a marker (it could even be a
branch-free marker if you prefer). In LTTng, upon trace start, I iterate
on all the kernel's markers to record, in a "control" channel, all the
marker names, their ids (I assign them a 16 bit id), and their format
strings. It allows me to parse the trace given only the timestamps,
event IDs and event specific data.
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68