Hi Jiaying,
I'm very interested to hear about the details of your test-case, because
what you are talking about here, and the numbers that you get, are, at
the very least, not representative of the current LTTng. Perharps you
used a version right after the kernel summit and stopped following the
improvements we have done since then ? I also wonder about the instrumentation
set you enabled in your tests, and whether it's exactly the same in the ktrace
and LTTng runs (especially system call tracing). Please note that following the
Kernel Summit 2008, a more active development cycle began, and the time was not
right for optimisation. But since then a lot of improvement has been done on
this side.
First I'd like to express my concern about the way you posted your
numbers :
1 - You did not CC anyone from the LTTng project.
2 - You did not tell which LTTng version you used as a comparison basis.
3 - You did not look at the current state of code before posting your
comparison. (as shows the comment in your patch header)
4 - You did not tell the instrumentation set that you enabled from LTTng. Is it
the default instrumentation ? Did you also enable system call tracing
with your tracer using set_kernel_trace_flag_all_tasks() ? If not, this
makes a huge difference on the amount of information gathered.
So let's have a close look at them. I'm re-doing the benchmarks based on
the current LTTng version below.
> Hello All,
>
> We have been working on building a kernel tracer for a production environment.
> We first considered to use the existing markers code, but found it a bit
> heavyweight.
Have you looked at :
1 - The tracepoints currently in the kernel
2 - The _current_ version of markers in the LTTng tree ?
> So we developed a new kernel tracing prototype that shares
> many similarities as markers but also includes several simplifications that
> give us considerable performance benefits. For example, we pass the size of
> trace event directly to the probe function instead of using va_arg as that
> used in markers, which saves us the overhead of variable passing.
The markers in the LTTng tree only use va_args for "debug-style"
tracing. All the tracing fast paths are now done with custom probe
functions which takes the arguments from the tracepoint and write them
in the trace buffers directly in C. When the event payload size is already
known statically, there is no dynamic computation cost at all : it's
simply passed to the ltt-relay write function.
> Also, we
> associate a single probe function to each event, so we can directly call the
> probe function when tracing is enabled.
You seem to imply that being stuck with a single probe function is
somehow superior to have tiny per-event callbacks which then call into a
buffer write function ? Why ?
> Our measurements show that our
> current tracing prototype introduces much lower overhead compared with
> lttng (we use lttng as a comparison because markers does not provide the
> implementation to record trace data).
Why LTTng version, kernel version, LTTng config and runtime options do
you use ?
> E.g., with dbench, lttng introduces
> 20~80% overhead on our testing platforms while the overhead with our
> kernel tracer is within 10%; with tbench, lttng usually adds more than 30%
> overhead while the overhead added by our kernel tracer is usually within
> 15%; with kernbench (time to compile a Linux kernel), the overhead
> added by lttng is around 4%~18% while the overhead added by our
> kernel tracer is within 2.5%.
>
I'm re-running the dbench and tbench tests :
With LTTng 0.106, kernel 2.6.29-rc7
8-cores Intel(R) Xeon(R) CPU E5405 @ 2.00GHz, 16GB ram
Arming all the default LTTng instrumentation (168 tracepoints),
Tracing in flight recorder mode, including system call entry/exit
tracing and block I/O tracing (blktrace instrumentation), I get :
Test / config | Vanilla kernel | LTTng 0.106 | slowdown
dbench 8 -t 200 | 1361 MB/s | 1193 MB/s | 12.3 %
As an example of the effect of custom probe vs va_args based probes,
let's take the tbench example
| Vanilla | LTTng 0.106 | slowdown
tbench 8 -t 200 | 2074 MB/s | 997 MB/s | 52 %
The LTTng trace statistics shows the top high event rate events :
Some are usual (therefore lttng already implements custom probes for them) :
42004 sched_try_wakeup
79754 timer_set
83973 sched_schedule
84317 softirq_exit
84318 softirq_entry
95370 page_free
95412 page_alloc
267247 syscall_entry
267247 syscall_exit
Some are more tbench workload-specific :
16915 socket_sendmsg
18911 napi_complete
18911 napi_schedule
18912 dev_receive
18914 dev_xmit
18920 napi_poll
33831 socket_recvmsg
59860 read
So let's see what happens if I implement those as custom probes and do a bit of
tuning in the tracer hot path :
| Vanilla | LTTng 0.107 | slowdown
Custom probes :
tbench 8 -t 200 | 2074 MB/s | 1275 MB/s | 38.5 %
With tracer hot path tuning (inline hot path, function calls for slow path, will
be in 0.108) :
tbench 8 -t 200 | 2074 MB/s | 1335 MB/s | 35.6 %
kernbench :
Average Optimal load -j 32 Run
Vanilla 2.6.29-rc7 :
Elapsed Time 48.238 (0.117346)
With flight recorder tracing, default lttng instrumentation, lttng 0.108 :
Elapsed Time 50.226 (0.0750333)
Slowdown : 4.1 %
So except the tbench workload on LTTng 0.106, which still had a 52% performance
impact (brought it down to 35 % for the next LTTng release) due to
specific workload and not having specialized probes implemented, the
dbench impact is well below your results (12.3 % vs 20-80%)
So given the huge result discrepancies between your tests and mine, could you
give us more detail about your benchmarks ? Especially knowing the enabled
instrumentation set, and the amount of MB/s generated by both ktrace and LTTng
runs (to make sure we are comparing the same traffic) and the LTTng versions you
are using as a comparison basis would be good.
Thanks,
Mathieu
> The patch below is the current prototype we have implemented. It uses
> Steve's unified trace buffer code with some small extensions to support
> buffer mmap. The patch is probably still buggy, but we would like to get
> some early feedback from the community.
>
> Signed-off-by: Jiaying Zhang <[email protected]>
>
> Index: git-linux/init/Kconfig
> ===================================================================
> --- git-linux.orig/init/Kconfig 2009-02-19 14:58:37.000000000 -0800
> +++ git-linux/init/Kconfig 2009-02-19 15:50:15.000000000 -0800
> @@ -950,6 +950,15 @@
> Place an empty function call at each marker site. Can be
> dynamically changed for a probe function.
>
> +config KTRACE
> + bool "Enable ktrace support"
> + select DEBUG_FS
> + select TRACING
> + select RING_BUFFER
> + help
> + Ktrace is a kernel tracing tool that allows you to trace
> + kernel events by inserting trace points at proper places.
> +
> source "arch/Kconfig"
>
> endmenu # General setup
> Index: git-linux/kernel/Makefile
> ===================================================================
> --- git-linux.orig/kernel/Makefile 2009-02-19 14:58:37.000000000 -0800
> +++ git-linux/kernel/Makefile 2009-02-19 14:58:39.000000000 -0800
> @@ -86,6 +86,7 @@
> obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
> obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
> obj-$(CONFIG_MARKERS) += marker.o
> +obj-$(CONFIG_KTRACE) += ktrace.o
> obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
> obj-$(CONFIG_LATENCYTOP) += latencytop.o
> obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
> Index: git-linux/include/asm-generic/vmlinux.lds.h
> ===================================================================
> --- git-linux.orig/include/asm-generic/vmlinux.lds.h 2009-02-19
> 14:58:37.000000000 -0800
> +++ git-linux/include/asm-generic/vmlinux.lds.h 2009-02-19
> 14:58:39.000000000 -0800
> @@ -73,6 +73,9 @@
> MEM_KEEP(init.data) \
> MEM_KEEP(exit.data) \
> . = ALIGN(8); \
> + VMLINUX_SYMBOL(__start___ktraces) = .; \
> + *(__ktrace) \
> + VMLINUX_SYMBOL(__stop___ktraces) = .; \
> VMLINUX_SYMBOL(__start___markers) = .; \
> *(__markers) \
> VMLINUX_SYMBOL(__stop___markers) = .; \
> @@ -89,6 +92,7 @@
> VMLINUX_SYMBOL(__start_rodata) = .; \
> *(.rodata) *(.rodata.*) \
> *(__vermagic) /* Kernel version magic */ \
> + *(__ktrace_strings) /* Ktrace: strings */ \
> *(__markers_strings) /* Markers: strings */ \
> *(__tracepoints_strings)/* Tracepoints: strings */ \
> } \
> Index: git-linux/kernel/module.c
> ===================================================================
> --- git-linux.orig/kernel/module.c 2009-02-19 14:58:37.000000000 -0800
> +++ git-linux/kernel/module.c 2009-02-19 14:58:39.000000000 -0800
> @@ -50,6 +50,7 @@
> #include <asm/sections.h>
> #include <linux/tracepoint.h>
> #include <linux/ftrace.h>
> +#include <linux/ktrace.h>
> #include <linux/async.h>
>
> #if 0
> @@ -2145,6 +2146,10 @@
> sizeof(*mod->tracepoints),
> &mod->num_tracepoints);
> #endif
> +#ifdef CONFIG_KTRACE
> + mod->ktrace = section_objs(hdr, sechdrs, secstrings, "__ktrace",
> + sizeof(*mod->ktrace), &mod->num_ktrace);
> +#endif
>
> #ifdef CONFIG_MODVERSIONS
> if ((mod->num_syms && !mod->crcs)
> Index: git-linux/include/linux/module.h
> ===================================================================
> --- git-linux.orig/include/linux/module.h 2009-02-19
> 14:58:37.000000000 -0800
> +++ git-linux/include/linux/module.h 2009-02-19 14:58:39.000000000 -0800
> @@ -44,6 +44,7 @@
> };
>
> struct module;
> +struct kernel_trace;
>
> struct module_attribute {
> struct attribute attr;
> @@ -347,6 +348,11 @@
> /* Reference counts */
> struct module_ref ref[NR_CPUS];
> #endif
> +#ifdef CONFIG_KTRACE
> + struct kernel_trace *ktrace;
> + unsigned int num_ktrace;
> +#endif
> +
> };
> #ifndef MODULE_ARCH_INIT
> #define MODULE_ARCH_INIT {}
> Index: git-linux/include/linux/ktrace.h
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ git-linux/include/linux/ktrace.h 2009-02-19 15:36:39.000000000 -0800
> @@ -0,0 +1,83 @@
> +#ifndef _LINUX_KTRACE_H
> +#define _LINUX_KTRACE_H
> +
> +#include <linux/types.h>
> +
> +struct kernel_trace;
> +
> +typedef void ktrace_probe_func(struct kernel_trace *, void *, size_t);
> +
> +struct kernel_trace {
> + const char *name;
> + const char *format; /* format string describing variable list */
> + size_t *stroff; /* offsets of string variables */
> + /* 31 bit event_id is converted to 16 bit when entering to the buffer */
> + u32 enabled:1, event_id:31;
> + ktrace_probe_func *func; /* probe function */
> + struct list_head list; /* list head linked to the hash table entry */
> +} __attribute__((aligned(8)));
> +
> +extern int ktrace_enabled;
> +
> +/*
> + * Make sure the alignment of the structure in the __ktrace section will
> + * not add unwanted padding between the beginning of the section and the
> + * structure. Force alignment to the same alignment as the section start.
> + */
> +
> +#define DEFINE_KTRACE_STRUCT(name) \
> + struct __attribute__((packed)) ktrace_struct_##name
> +
> +#ifdef CONFIG_KTRACE
> +/*
> + * DO_TRACE is the macro that is called from each trace point. Each trace
> + * point is associated with a static ktrace object that is located in a
> + * special __ktrace section and is uniquely identified by the event name and
> + * event format. Take a look at the files under include/trace for example
> + * usage. Note that we call the trace probe function only if the global
> + * tracing is enabled and the tracing for the associated event is enabled,
> + * so we only introduce the overhead of a predicted condition judge when
> + * tracing is disabled.
> + */
> +#define DO_TRACE(name, format, args...)
> \
> + do { \
> + static const char __kstrtab_##name[] \
> + __attribute__((section("__ktrace_strings"))) \
> + = #name "\0" format; \
> + static struct kernel_trace __ktrace_##name \
> + __attribute__((section("__ktrace"), aligned(8))) = \
> + { __kstrtab_##name, &__kstrtab_##name[sizeof(#name)], \
> + NULL, 0, 0, NULL, LIST_HEAD_INIT(__ktrace_##name.list) }; \
> + __ktrace_check_format(format, ## args); \
> + if (unlikely(ktrace_enabled) && \
> + unlikely(__ktrace_##name.enabled)) { \
> + struct ktrace_struct_##name karg = { args }; \
> + (*__ktrace_##name.func) \
> + (&__ktrace_##name, &karg, sizeof(karg)); \
> + } \
> + } while (0)
> +
> +#else /* !CONFIG_KTRACE */
> +#define DO_TRACE(name, format, args...)
> +#endif
> +
> +/* To be used for string format validity checking with gcc */
> +static inline void __printf(1, 2) ___ktrace_check_format(const char *fmt, ...)
> +{
> +}
> +
> +#define __ktrace_check_format(format, args...) \
> + do { \
> + if (0) \
> + ___ktrace_check_format(format, ## args); \
> + } while (0)
> +
> +
> +/* get the trace buffer information */
> +#define KTRACE_BUF_GET_SIZE _IOR(0xF5, 0x00, __u32)
> +#define KTRACE_BUF_GET_PRODUCED _IOR(0xF5, 0x01, __u32)
> +#define KTRACE_BUF_GET_CONSUMED _IOR(0xF5, 0x02, __u32)
> +#define KTRACE_BUF_PUT_PRODUCED _IOW(0xF5, 0x03, __u32)
> +#define KTRACE_BUF_PUT_CONSUMED _IOW(0xF5, 0x04, __u32)
> +
> +#endif
> Index: git-linux/kernel/ktrace.c
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ git-linux/kernel/ktrace.c 2009-02-19 15:45:20.000000000 -0800
> @@ -0,0 +1,872 @@
> +/*
> + * kernel/ktrace.c
> + *
> + * Implementation of the kernel tracing for linux kernel 2.6.
> + *
> + * Copyright 2008- Google Inc.
> + * Original Author: Jiaying Zhang
> + */
> +
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/bitmap.h>
> +#include <linux/poll.h>
> +#include <linux/errno.h>
> +#include <linux/sched.h>
> +#include <linux/linkage.h>
> +#include <linux/debugfs.h>
> +#include <linux/timer.h>
> +#include <linux/delay.h>
> +#include <linux/jhash.h>
> +#include <linux/ctype.h>
> +#include <linux/namei.h>
> +#include <linux/ktrace.h>
> +#include <linux/ring_buffer.h>
> +
> +static char *ktrace_version = "1.0.0";
> +int ktrace_enabled;
> +EXPORT_SYMBOL(ktrace_enabled);
> +static uint16_t ktrace_next_id;
> +
> +extern struct kernel_trace __start___ktraces[];
> +extern struct kernel_trace __stop___ktraces[];
> +
> +static DEFINE_MUTEX(ktrace_mutex);
> +
> +/* ktrace hash table entry structure and variables */
> +struct ktrace_entry {
> + struct hlist_node hlist;
> + char *name;
> + char *format;
> + size_t *stroff;
> + ktrace_probe_func *func;
> + u32 enabled:1, event_id:31;
> + struct list_head klist; /* list of loaded ktraces */
> +};
> +
> +#define KTRACE_HASH_BITS 6
> +#define KTRACE_HASH_SIZE (1 << KTRACE_HASH_BITS)
> +static struct hlist_head ktrace_table[KTRACE_HASH_SIZE];
> +
> +/* debugfs directory variables */
> +static struct dentry *tracedir;
> +static struct dentry *eventdir;
> +typedef void tracecontrol_handle_func(struct file *, unsigned long, size_t *);
> +
> +/* ring buffer code */
> +static unsigned long trace_buf_size = 65536UL;
> +struct ring_buffer *trace_buffer;
> +static DECLARE_WAIT_QUEUE_HEAD(trace_wait); /* waitqueue for trace_data_poll */
> +static struct timer_list trace_timer; /* reader wake-up timer */
> +#define TRACEREAD_WAKEUP_INTERVAL 1000 /* time interval in jiffies */
> +static struct kref ktrace_kref;
> +
> +static void
> +ring_buffer_probe(struct kernel_trace *kt, void *data, size_t event_size)
> +{
> + struct ring_buffer_event *event;
> + void *trace_event;
> + unsigned long flag;
> +
> + event = ring_buffer_lock_reserve(trace_buffer,
> + sizeof(uint16_t) + event_size, &flag);
> + if (!event)
> + return;
> + trace_event = ring_buffer_event_data(event);
> + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
> + memcpy(trace_event + sizeof(uint16_t), data, event_size);
> + ring_buffer_unlock_commit(trace_buffer, event, flag);
> +}
> +
> +/* special probe function for events that contain string arguments */
> +static void
> +string_probe(struct kernel_trace *kt, void *data, size_t event_size)
> +{
> + struct ring_buffer_event *event;
> + void *trace_event, *p;
> + unsigned long flag;
> + size_t *offset, scanned;
> + char *string;
> +
> + /*
> + * Get the real length of the event, i.e., we use stroff array to
> + * locate string variables in the passed-in trace data and use their
> + * length to replace the size of the string pointers.
> + */
> + for (offset = kt->stroff; *offset != -1; offset++) {
> + string = *(char **) (data + *offset);
> + event_size += strlen(string) + 1 - sizeof(char *);
> + }
> + event = ring_buffer_lock_reserve(trace_buffer,
> + sizeof(uint16_t) + event_size, &flag);
> + if (!event)
> + return;
> + trace_event = ring_buffer_event_data(event);
> + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
> + p = trace_event + sizeof(uint16_t);
> + /*
> + * Copy the trace data into buffer. For string variables, we enter the
> + * string into the buffer. Otherwise, the passed in data is copied.
> + */
> + for (offset = kt->stroff, scanned = 0; *offset != -1; offset++) {
> + memcpy(p, data + scanned, *offset - scanned);
> + p += *offset - scanned;
> + string = *(char **) (data + *offset);
> + memcpy(p, string, strlen(string) + 1);
> + p += strlen(string) + 1;
> + scanned = *offset + sizeof(char *);
> + }
> + memcpy(p, data + scanned,
> + trace_event + sizeof(uint16_t) + event_size - p);
> + ring_buffer_unlock_commit(trace_buffer, event, flag);
> +}
> +
> +/* timer function used to defer ring buffer reader waking */
> +static void wakeup_readers(unsigned long data)
> +{
> + if (trace_buffer && !ring_buffer_empty(trace_buffer))
> + wake_up_interruptible(&trace_wait);
> + mod_timer(&trace_timer, jiffies + TRACEREAD_WAKEUP_INTERVAL);
> +}
> +
> +/* function for reading ktrace metadata info */
> +static void *tracing_info_start(struct seq_file *seq, loff_t *pos)
> +{
> + seq_printf(seq, "version %s\n", ktrace_version);
> + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
> +}
> +
> +static void *tracing_info_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> + ++*pos;
> + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
> +}
> +
> +static int tracing_info_show(struct seq_file *seq, void *v)
> +{
> + struct hlist_head *head = v;
> + struct hlist_node *node;
> + struct ktrace_entry *entry;
> + hlist_for_each_entry(entry, node, head, hlist)
> + seq_printf(seq, "name '%s' format '%s' id %u %s\n",
> + entry->name, entry->format, entry->event_id,
> + entry->enabled ? "enabled" : "disabled");
> + return 0;
> +}
> +
> +static void tracing_info_stop(struct seq_file *seq, void *v)
> +{
> +}
> +
> +static const struct seq_operations tracing_info_seq_ops = {
> + .start = tracing_info_start,
> + .next = tracing_info_next,
> + .show = tracing_info_show,
> + .stop = tracing_info_stop,
> +};
> +
> +static int tracing_info_open(struct inode *inode, struct file *file)
> +{
> + return seq_open(file, &tracing_info_seq_ops);
> +}
> +
> +static const struct file_operations traceinfo_file_operations = {
> + .open = tracing_info_open,
> + .read = seq_read,
> + .llseek = seq_lseek,
> + .release = seq_release,
> +};
> +
> +/*
> + * wrapper function used by debugfs write operation.
> + * func: handling function that does real work
> + */
> +static ssize_t trace_debugfs_write(struct file *filp, const char __user *ubuf,
> + size_t cnt, loff_t *ppos, tracecontrol_handle_func *func)
> +{
> + int ret;
> + char buf[64];
> + unsigned long val;
> +
> + if (cnt >= sizeof(buf))
> + return -EINVAL;
> + if (copy_from_user(&buf, ubuf, cnt))
> + return -EFAULT;
> + buf[cnt] = 0;
> +
> + ret = strict_strtoul(buf, 10, &val);
> + if (ret < 0)
> + return ret;
> + val = !!val;
> +
> + func(filp, val, &cnt);
> +
> + filp->f_pos += cnt;
> + return cnt;
> +}
> +
> +/* functions for reading/writing the global 'enabled' control file */
> +static ssize_t tracing_control_read(struct file *filp, char __user *ubuf,
> + size_t cnt, loff_t *ppos)
> +{
> + int ret;
> + char buf[64];
> + ret = snprintf(buf, 64, "%u\n", ktrace_enabled);
> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
> +}
> +
> +static void __tracing_control_write(struct file *filp,
> + unsigned long val, size_t *cnt)
> +{
> + if (val ^ ktrace_enabled) {
> + if (val) {
> + trace_timer.expires =
> + jiffies + TRACEREAD_WAKEUP_INTERVAL;
> + add_timer(&trace_timer);
> + ktrace_enabled = 1;
> + } else {
> + ktrace_enabled = 0;
> + del_timer_sync(&trace_timer);
> + if (trace_buffer && !ring_buffer_empty(trace_buffer))
> + wake_up_interruptible(&trace_wait);
> + }
> + }
> +}
> +
> +static ssize_t tracing_control_write(struct file *filp, const char
> __user *ubuf,
> + size_t cnt, loff_t *ppos)
> +{
> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
> + __tracing_control_write);
> +}
> +
> +static const struct file_operations tracecontrol_file_operations = {
> + .read = tracing_control_read,
> + .write = tracing_control_write,
> +};
> +
> +/* functions for reading/writing a trace event file */
> +static int update_ktrace(struct ktrace_entry *entry,
> + ktrace_probe_func *func, int enabled)
> +{
> + struct kernel_trace *iter;
> + /* no need to update the list if the tracing is not initialized */
> + if (!tracedir)
> + return 0;
> + entry->enabled = enabled;
> + entry->func = func;
> + list_for_each_entry(iter, &entry->klist, list) {
> + iter->stroff = entry->stroff;
> + iter->func = func;
> + iter->enabled = enabled;
> + }
> + return 0;
> +}
> +
> +static int ktrace_probe_register(struct ktrace_entry *entry,
> + ktrace_probe_func *probefunc)
> +{
> + return update_ktrace(entry, probefunc, 1);
> +}
> +
> +static int ktrace_probe_unregister(struct ktrace_entry *entry)
> +{
> + return update_ktrace(entry, NULL, 0);
> +}
> +
> +static int ktrace_control_open(struct inode *inode, struct file *filp)
> +{
> + filp->private_data = inode->i_private;
> + return 0;
> +}
> +
> +static ssize_t ktrace_control_read(struct file *filp, char __user *ubuf,
> + size_t cnt, loff_t *ppos)
> +{
> + int ret;
> + char buf[64];
> + struct ktrace_entry *entry = filp->private_data;
> +
> + ret = snprintf(buf, 64, "%u\n", entry->enabled);
> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
> +}
> +
> +/*
> + * Check whether the event contains string arguments. Return 0 if not;
> + * otherwise record the offsets of the string arguments in the ktrace
> + * structure according to the event format and return 1.
> + * The offset info is later used to store the string arguments into
> + * the trace buffer during trace probe.
> + */
> +static int contain_string_arguments(struct ktrace_entry *entry)
> +{
> + int count = 0;
> + char *fmt;
> + int qualifier; /* 'h', 'l', or 'L' for integer fields */
> + size_t offset = 0;
> +
> + for (fmt = strstr(entry->format, "%s"); fmt; fmt = strstr(++fmt, "%s"))
> + count++;
> + if (!count)
> + return 0;
> + entry->stroff = kmalloc(sizeof(size_t) * (count + 1), GFP_KERNEL);
> + if (!entry->stroff)
> + return -ENOMEM;
> +
> + for (fmt = entry->format, count = 0; *fmt ; ++fmt) {
> + if (*fmt != '%')
> + continue;
> +repeat:
> + ++fmt;
> + switch (*fmt) {
> + case '-':
> + case '+':
> + case ' ':
> + case '#':
> + case '0':
> + ++fmt;
> + goto repeat;
> + }
> +
> + while (isdigit(*fmt))
> + fmt++;
> +
> + /* get the conversion qualifier */
> + qualifier = -1;
> + if (*fmt == 'h' || *fmt == 'l' || *fmt == 'L') {
> + qualifier = *fmt;
> + ++fmt;
> + if (qualifier == 'l' && *fmt == 'l') {
> + qualifier = 'L';
> + ++fmt;
> + }
> + }
> +
> + switch (*fmt) {
> + case 'c':
> + offset += sizeof(char);
> + continue;
> + case 's':
> + entry->stroff[count] = offset;
> + count++;
> + offset += sizeof(char *);
> + continue;
> + case 'p':
> + offset += sizeof(void *);
> + continue;
> + case 'd':
> + case 'i':
> + case 'o':
> + case 'u':
> + case 'x':
> + case 'X':
> + switch (qualifier) {
> + case 'L':
> + offset += sizeof(long long);
> + break;
> + case 'l':
> + offset += sizeof(long);
> + break;
> + case 'h':
> + offset += sizeof(short);
> + break;
> + default:
> + offset += sizeof(int);
> + break;
> + }
> + break;
> + default:
> + printk(KERN_WARNING "unknown format '%c'", *fmt);
> + continue;
> + }
> + }
> + entry->stroff[count] = -1;
> + return 1;
> +}
> +
> +static void __ktrace_control_write(struct file *filp,
> + unsigned long val, size_t *cnt)
> +{
> + struct ktrace_entry *entry = filp->private_data;
> + int ret;
> +
> + mutex_lock(&ktrace_mutex);
> + if (val ^ entry->enabled) {
> + if (val) {
> + ret = contain_string_arguments(entry);
> + if (ret == 0)
> + ktrace_probe_register(entry, ring_buffer_probe);
> + else if (ret > 0)
> + ktrace_probe_register(entry, string_probe);
> + else
> + *cnt = ret;
> + } else
> + ktrace_probe_unregister(entry);
> + }
> + mutex_unlock(&ktrace_mutex);
> +}
> +
> +static ssize_t ktrace_control_write(struct file *filp, const char __user *ubuf,
> + size_t cnt, loff_t *ppos)
> +{
> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
> + __ktrace_control_write);
> +}
> +
> +static const struct file_operations ktracecontrol_file_operations = {
> + .open = ktrace_control_open,
> + .read = ktrace_control_read,
> + .write = ktrace_control_write,
> +};
> +
> +/* functions for adding/removing a trace event. Protected by mutex lock.
> + * Called during initialization or after loading a module */
> +static struct ktrace_entry *add_ktrace(struct kernel_trace *ktrace)
> +{
> + struct ktrace_entry *entry;
> + size_t name_len = strlen(ktrace->name) + 1;
> + u32 hash = jhash(ktrace->name, name_len-1, 0);
> + struct hlist_head *head;
> + struct hlist_node *node;
> + struct dentry *dentry;
> +
> + if (!tracedir)
> + return 0;
> + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
> + hlist_for_each_entry(entry, node, head, hlist) {
> + if (!strcmp(ktrace->name, entry->name)) {
> + if (strcmp(ktrace->format, entry->format)) {
> + printk(KERN_WARNING "the format of tracepoint "
> + "\'%s\' changes from \'%s\' to \'%s\'."
> + "Dynamic changing of trace format is "
> + "not supported yet!\n",
> + ktrace->name, entry->format,
> + ktrace->format);
> + return ERR_PTR(-EINVAL);
> + }
> + ktrace->enabled = entry->enabled;
> + ktrace->func = entry->func;
> + ktrace->event_id = entry->event_id;
> + if (list_empty(&entry->klist))
> + goto add_head;
> + list_add_tail(&ktrace->list, &entry->klist);
> + return entry;
> + }
> + }
> + entry = kmalloc(sizeof(struct ktrace_entry), GFP_KERNEL);
> + if (!entry)
> + return ERR_PTR(-ENOMEM);
> + entry->name = kmalloc(name_len, GFP_KERNEL);
> + if (!entry->name) {
> + kfree(entry);
> + return ERR_PTR(-ENOMEM);
> + }
> + memcpy(entry->name, ktrace->name, name_len);
> + if (ktrace->format) {
> + size_t format_len = strlen(ktrace->format) + 1;
> + entry->format = kmalloc(format_len, GFP_KERNEL);
> + if (!entry->format) {
> + kfree(entry->name);
> + kfree(entry);
> + return ERR_PTR(-ENOMEM);
> + }
> + memcpy(entry->format, ktrace->format, format_len);
> + } else
> + entry->format = NULL;
> + entry->func = ktrace->func;
> + entry->enabled = 0;
> + ktrace->event_id = entry->event_id = ktrace_next_id++;
> + entry->stroff = NULL;
> + INIT_LIST_HEAD(&entry->klist);
> + hlist_add_head(&entry->hlist, head);
> +add_head:
> + list_add(&ktrace->list, &entry->klist);
> + dentry = debugfs_create_file(entry->name, 0660, eventdir,
> + entry, &ktracecontrol_file_operations);
> + if (!dentry)
> + printk(KERN_WARNING "Couldn't create debugfs '%s' entry\n",
> + entry->name);
> + return entry;
> +}
> +
> +static int remove_ktrace(struct kernel_trace *ktrace, int free)
> +{
> + struct ktrace_entry *entry;
> + size_t name_len = strlen(ktrace->name) + 1;
> + u32 hash = jhash(ktrace->name, name_len-1, 0);
> + struct hlist_head *head;
> + struct hlist_node *node, *temp;
> + struct dentry *dentry;
> +
> + if (!tracedir)
> + return 0;
> + list_del(&ktrace->list);
> + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
> + hlist_for_each_entry_safe(entry, node, temp, head, hlist)
> + if (!strcmp(ktrace->name, entry->name)) {
> + if (list_empty(&entry->klist)) {
> + dentry = lookup_one_len(entry->name,
> + eventdir, strlen(entry->name));
> + if (dentry && !IS_ERR(dentry)) {
> + debugfs_remove(dentry);
> + dput(dentry);
> + }
> + entry->enabled = 0;
> + if (free) {
> + hlist_del(&entry->hlist);
> + kfree(entry->name);
> + kfree(entry->format);
> + kfree(entry->stroff);
> + kfree(entry);
> + }
> + }
> + return 0;
> + }
> + return -1;
> +}
> +
> +/* Add/remove tracepoints contained in a module to the ktrace hash table.
> + * Called at the end of module load/unload. */
> +static int ktrace_module_notify(struct notifier_block *self,
> + unsigned long val, void *data)
> +{
> +#ifdef CONFIG_MODULES
> + struct kernel_trace *iter;
> + struct module *mod = data;
> +
> + if (val != MODULE_STATE_COMING && val != MODULE_STATE_GOING)
> + return 0;
> + mutex_lock(&ktrace_mutex);
> + for (iter = mod->ktrace; iter < mod->ktrace + mod->num_ktrace; iter++)
> + if (val == MODULE_STATE_COMING)
> + add_ktrace(iter);
> + else
> + remove_ktrace(iter, 0);
> + mutex_unlock(&ktrace_mutex);
> +#endif
> + return 0;
> +}
> +
> +static struct notifier_block ktrace_module_nb = {
> + .notifier_call = ktrace_module_notify,
> +};
> +
> +/* functions for user-space programs to read data from tracing buffer */
> +static int trace_data_open(struct inode *inode, struct file *filp)
> +{
> + filp->private_data = inode->i_private;
> + kref_get(&ktrace_kref);
> + return 0;
> +}
> +
> +static void release_trace_buffer(struct kref *kref)
> +{
> + ring_buffer_free(trace_buffer);
> + trace_buffer = NULL;
> +}
> +
> +static int trace_data_release(struct inode *inode, struct file *filp)
> +{
> + kref_put(&ktrace_kref, release_trace_buffer);
> + return 0;
> +}
> +
> +static unsigned int trace_data_poll(struct file *filp, poll_table *poll_table)
> +{
> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
> + if (filp->f_mode & FMODE_READ) {
> + if (!ring_buffer_per_cpu_empty(cpu_buffer))
> + return POLLIN | POLLRDNORM;
> + poll_wait(filp, &trace_wait, poll_table);
> + if (!ring_buffer_per_cpu_empty(cpu_buffer))
> + return POLLIN | POLLRDNORM;
> + }
> + return 0;
> +}
> +
> +static int trace_buf_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> + struct ring_buffer_per_cpu *cpu_buffer = vma->vm_private_data;
> + pgoff_t pgoff = vmf->pgoff;
> + struct page *page;
> +
> + if (!trace_buffer)
> + return VM_FAULT_OOM;
> + page = ring_buffer_get_page(cpu_buffer, pgoff);
> + if (page == NULL)
> + return VM_FAULT_SIGBUS;
> + get_page(page);
> + vmf->page = page;
> + return 0;
> +}
> +
> +static struct vm_operations_struct trace_data_mmap_ops = {
> + .fault = trace_buf_fault,
> +};
> +
> +static int trace_data_mmap(struct file *filp, struct vm_area_struct *vma)
> +{
> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
> +
> + vma->vm_ops = &trace_data_mmap_ops;
> + vma->vm_flags |= VM_DONTEXPAND;
> + vma->vm_private_data = cpu_buffer;
> + return 0;
> +}
> +
> +static int trace_data_ioctl(struct inode *inode, struct file *filp,
> + unsigned int cmd, unsigned long arg)
> +{
> + u32 __user *argp = (u32 __user *)arg;
> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
> +
> + if (!trace_buffer || !tracedir)
> + return -EINVAL;
> + switch (cmd) {
> + case KTRACE_BUF_GET_SIZE:
> + {
> + unsigned long bufsize;
> + bufsize = ring_buffer_size(trace_buffer);
> + return put_user((u32)bufsize, argp);
> + }
> + case KTRACE_BUF_GET_PRODUCED:
> + return put_user(ring_buffer_get_produced(cpu_buffer), argp);
> + case KTRACE_BUF_GET_CONSUMED:
> + return put_user(ring_buffer_get_consumed(cpu_buffer), argp);
> + case KTRACE_BUF_PUT_CONSUMED:
> + {
> + u32 consumed, consumed_old;
> + int ret;
> +
> + ret = get_user(consumed, argp);
> + if (ret) {
> + printk(KERN_WARNING
> + "error getting consumed value: %d\n", ret);
> + return ret;
> + }
> + consumed_old = ring_buffer_get_consumed(cpu_buffer);
> + if (consumed == consumed_old)
> + return 0;
> + ring_buffer_advance_reader(cpu_buffer, consumed - consumed_old);
> + return 0;
> + }
> + default:
> + return -ENOIOCTLCMD;
> + }
> + return 0;
> +}
> +
> +static const struct file_operations tracedata_file_operations = {
> + .open = trace_data_open,
> + .poll = trace_data_poll,
> + .mmap = trace_data_mmap,
> + .ioctl = trace_data_ioctl,
> + .release = trace_data_release,
> +};
> +
> +/* function to report the number of lost events due to buffer overflow */
> +static ssize_t trace_overflow_read(struct file *filp, char __user *ubuf,
> + size_t cnt, loff_t *ppos)
> +{
> + int ret;
> + char buf[64];
> + ret = snprintf(buf, 64, "%lu\n", ring_buffer_overruns(trace_buffer));
> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
> +}
> +
> +static const struct file_operations traceoverflow_file_operations = {
> + .read = trace_overflow_read,
> +};
> +
> +/* function to re-initialize trace buffer */
> +static void __trace_buffer_reset(struct file *filp,
> + unsigned long val, size_t *cnt)
> +{
> + if (val && trace_buffer)
> + ring_buffer_reset(trace_buffer);
> +}
> +
> +static ssize_t trace_buffer_reset(struct file *filp, const char __user *ubuf,
> + size_t cnt, loff_t *ppos)
> +{
> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
> + __trace_buffer_reset);
> +}
> +
> +static const struct file_operations tracereset_file_operations = {
> + .write = trace_buffer_reset,
> +};
> +
> +/*
> + * We use debugfs for kernel-user communication. All of the control/info
> + * files are under debugfs/ktrace directory.
> + */
> +static int create_debugfs(void)
> +{
> + struct dentry *entry, *bufdir;
> + struct ring_buffer_per_cpu *cpu_buffer;
> + char *tmpname;
> + int cpu;
> +
> + tracedir = debugfs_create_dir("ktrace", NULL);
> + if (!tracedir) {
> + printk(KERN_WARNING "Couldn't create debugfs directory\n");
> + return -ENOMEM;
> + }
> +
> + /* 'buffers' is the directory that holds trace buffer debugfs files */
> + bufdir = debugfs_create_dir("buffers", tracedir);
> + if (!bufdir) {
> + printk(KERN_WARNING "Couldn't create 'buffers' directory\n");
> + return -ENOMEM;
> + }
> +
> + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
> + if (!tmpname)
> + return -ENOMEM;
> + /* create a debugfs file for each cpu buffer */
> + for_each_possible_cpu(cpu) {
> + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
> + cpu_buffer = ring_buffer_cpu(trace_buffer, cpu);
> + entry = debugfs_create_file(tmpname, 0440, bufdir, cpu_buffer,
> + &tracedata_file_operations);
> + if (!entry) {
> + printk(KERN_WARNING
> + "Couldn't create debugfs \'%s\' entry\n",
> + tmpname);
> + return -ENOMEM;
> + }
> + }
> + kfree(tmpname);
> +
> + /* the control file for users to enable/disable global tracing. */
> + entry = debugfs_create_file("enabled", 0664, tracedir, NULL,
> + &tracecontrol_file_operations);
> + if (!entry) {
> + printk(KERN_WARNING
> + "Couldn't create debugfs 'enabled' entry\n");
> + return -ENOMEM;
> + }
> +
> + /*
> + * the debugfs file that displays the name, format etc. of every
> + * supported trace event. The file is to be used by the user-space
> + * trace parser to analyze the collected trace data.
> + */
> + entry = debugfs_create_file("info", 0444, tracedir, NULL,
> + &traceinfo_file_operations);
> + if (!entry) {
> + printk(KERN_WARNING "Couldn't create debugfs 'info' entry\n");
> + return -ENOMEM;
> + }
> +
> + /*
> + * the debugfs file that reports the number of events
> + * lost due to buffer overflow
> + */
> + entry = debugfs_create_file("overflow", 0444, tracedir, NULL,
> + &traceoverflow_file_operations);
> + if (!entry) {
> + printk(KERN_WARNING
> + "Couldn't create debugfs 'overflow' entry\n");
> + return -ENOMEM;
> + }
> +
> + /* the debugfs file that resets trace buffer upon write */
> + entry = debugfs_create_file("reset", 0220, tracedir, NULL,
> + &tracereset_file_operations);
> + if (!entry) {
> + printk(KERN_WARNING "Couldn't create debugfs 'reset' entry\n");
> + return -ENOMEM;
> + }
> +
> + /* create the directory that holds the control files for every event */
> + eventdir = debugfs_create_dir("events", tracedir);
> + if (!eventdir) {
> + printk(KERN_WARNING "Couldn't create 'events' directory\n");
> + return -ENOMEM;
> + }
> + return 0;
> +}
> +
> +static void remove_debugfs_file(struct dentry *dir, char *name)
> +{
> + struct dentry *dentry;
> + dentry = lookup_one_len(name, dir, strlen(name));
> + if (dentry && !IS_ERR(dentry)) {
> + debugfs_remove(dentry);
> + dput(dentry);
> + }
> +}
> +
> +static int remove_debugfs(void)
> +{
> + struct dentry *bufdir;
> + char *tmpname;
> + int cpu;
> +
> + bufdir = lookup_one_len("buffers", tracedir, strlen("buffers"));
> + if (!bufdir || IS_ERR(bufdir))
> + return -EIO;
> + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
> + if (!tmpname)
> + return -ENOMEM;
> + for_each_possible_cpu(cpu) {
> + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
> + remove_debugfs_file(bufdir, tmpname);
> + }
> + kfree(tmpname);
> + debugfs_remove(bufdir);
> +
> + remove_debugfs_file(tracedir, "enabled");
> + remove_debugfs_file(tracedir, "info");
> + debugfs_remove(eventdir);
> + debugfs_remove(tracedir);
> + return 0;
> +}
> +
> +static int __init init_ktrace(void)
> +{
> + struct kernel_trace *iter;
> + int i, err;
> +
> + ktrace_next_id = 0;
> + kref_set(&ktrace_kref, 1);
> + trace_buffer = ring_buffer_alloc(trace_buf_size, RB_FL_OVERWRITE);
> + if (trace_buffer == NULL)
> + return -ENOMEM;
> + setup_timer(&trace_timer, wakeup_readers, 0);
> + for (i = 0; i < KTRACE_HASH_SIZE; i++)
> + INIT_HLIST_HEAD(&ktrace_table[i]);
> +
> + mutex_lock(&ktrace_mutex);
> + err = create_debugfs();
> + if (err != 0)
> + goto out;
> + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
> + add_ktrace(iter);
> + err = register_module_notifier(&ktrace_module_nb);
> +out:
> + mutex_unlock(&ktrace_mutex);
> + return err;
> +}
> +module_init(init_ktrace);
> +
> +static void __exit exit_ktrace(void)
> +{
> + struct kernel_trace *iter;
> +
> + if (ktrace_enabled) {
> + ktrace_enabled = 0;
> + del_timer_sync(&trace_timer);
> + }
> + mutex_lock(&ktrace_mutex);
> + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
> + remove_ktrace(iter, 1);
> + unregister_module_notifier(&ktrace_module_nb);
> + remove_debugfs();
> + mutex_unlock(&ktrace_mutex);
> +
> + kref_put(&ktrace_kref, release_trace_buffer);
> +}
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
* Mathieu Desnoyers ([email protected]) wrote:
> Hi Jiaying,
>
> I'm very interested to hear about the details of your test-case, because
> what you are talking about here, and the numbers that you get, are, at
> the very least, not representative of the current LTTng. Perharps you
> used a version right after the kernel summit and stopped following the
> improvements we have done since then ? I also wonder about the instrumentation
> set you enabled in your tests, and whether it's exactly the same in the ktrace
> and LTTng runs (especially system call tracing). Please note that following the
> Kernel Summit 2008, a more active development cycle began, and the time was not
> right for optimisation. But since then a lot of improvement has been done on
> this side.
>
> First I'd like to express my concern about the way you posted your
> numbers :
>
> 1 - You did not CC anyone from the LTTng project.
> 2 - You did not tell which LTTng version you used as a comparison basis.
> 3 - You did not look at the current state of code before posting your
> comparison. (as shows the comment in your patch header)
> 4 - You did not tell the instrumentation set that you enabled from LTTng. Is it
> the default instrumentation ? Did you also enable system call tracing
> with your tracer using set_kernel_trace_flag_all_tasks() ? If not, this
> makes a huge difference on the amount of information gathered.
>
> So let's have a close look at them. I'm re-doing the benchmarks based on
> the current LTTng version below.
>
> > Hello All,
> >
> > We have been working on building a kernel tracer for a production environment.
> > We first considered to use the existing markers code, but found it a bit
> > heavyweight.
>
> Have you looked at :
>
> 1 - The tracepoints currently in the kernel
> 2 - The _current_ version of markers in the LTTng tree ?
>
> > So we developed a new kernel tracing prototype that shares
> > many similarities as markers but also includes several simplifications that
> > give us considerable performance benefits. For example, we pass the size of
> > trace event directly to the probe function instead of using va_arg as that
> > used in markers, which saves us the overhead of variable passing.
>
> The markers in the LTTng tree only use va_args for "debug-style"
> tracing. All the tracing fast paths are now done with custom probe
> functions which takes the arguments from the tracepoint and write them
> in the trace buffers directly in C. When the event payload size is already
> known statically, there is no dynamic computation cost at all : it's
> simply passed to the ltt-relay write function.
>
> > Also, we
> > associate a single probe function to each event, so we can directly call the
> > probe function when tracing is enabled.
>
> You seem to imply that being stuck with a single probe function is
> somehow superior to have tiny per-event callbacks which then call into a
> buffer write function ? Why ?
>
> > Our measurements show that our
> > current tracing prototype introduces much lower overhead compared with
> > lttng (we use lttng as a comparison because markers does not provide the
> > implementation to record trace data).
>
> Why LTTng version, kernel version, LTTng config and runtime options do
> you use ?
>
> > E.g., with dbench, lttng introduces
> > 20~80% overhead on our testing platforms while the overhead with our
> > kernel tracer is within 10%; with tbench, lttng usually adds more than 30%
> > overhead while the overhead added by our kernel tracer is usually within
> > 15%; with kernbench (time to compile a Linux kernel), the overhead
> > added by lttng is around 4%~18% while the overhead added by our
> > kernel tracer is within 2.5%.
> >
>
> I'm re-running the dbench and tbench tests :
>
> With LTTng 0.106, kernel 2.6.29-rc7
> 8-cores Intel(R) Xeon(R) CPU E5405 @ 2.00GHz, 16GB ram
>
> Arming all the default LTTng instrumentation (168 tracepoints),
> Tracing in flight recorder mode, including system call entry/exit
> tracing and block I/O tracing (blktrace instrumentation), I get :
>
> Test / config | Vanilla kernel | LTTng 0.106 | slowdown
> dbench 8 -t 200 | 1361 MB/s | 1193 MB/s | 12.3 %
>
> As an example of the effect of custom probe vs va_args based probes,
> let's take the tbench example
>
> | Vanilla | LTTng 0.106 | slowdown
> tbench 8 -t 200 | 2074 MB/s | 997 MB/s | 52 %
>
> The LTTng trace statistics shows the top high event rate events :
>
> Some are usual (therefore lttng already implements custom probes for them) :
>
> 42004 sched_try_wakeup
> 79754 timer_set
> 83973 sched_schedule
> 84317 softirq_exit
> 84318 softirq_entry
> 95370 page_free
> 95412 page_alloc
> 267247 syscall_entry
> 267247 syscall_exit
>
> Some are more tbench workload-specific :
>
> 16915 socket_sendmsg
> 18911 napi_complete
> 18911 napi_schedule
> 18912 dev_receive
> 18914 dev_xmit
> 18920 napi_poll
> 33831 socket_recvmsg
> 59860 read
>
>
> So let's see what happens if I implement those as custom probes and do a bit of
> tuning in the tracer hot path :
>
> | Vanilla | LTTng 0.107 | slowdown
> Custom probes :
> tbench 8 -t 200 | 2074 MB/s | 1275 MB/s | 38.5 %
>
> With tracer hot path tuning (inline hot path, function calls for slow path, will
> be in 0.108) :
> tbench 8 -t 200 | 2074 MB/s | 1335 MB/s | 35.6 %
>
After fixing false sharing problems found using oprofile (got burned by
kzalloc_node, thinking it was somehow cacheline aligned, but is not), updated
tbench performance results (this will be in lttng 0.108) :
Vanilla : 2070 MB/sec
1405 MB/sec 32 % slowdown
Removing the rdtsc barriers (ouch, the kernel uses mfence and lfence in the
recent x86 implementations). The question is : do we keep them or not, or is a
sync_core() (cpuid, synchronizing the core) enough ?
1450 MB/sec 29 % slowdown
Some of the most active instructions in the tracer fast path :
....
2982 ffffffff803fefc2 3.0242
rcu list of active traces prefetch0
5000 ffffffff803ff379 5.0708
xadd (commit)
13525 ffffffff803ff04f 13.7166
rdtsc
Basically, simply using oprofile opreport -i on _ltt_specialized_trace
and sorting by number of samples highlights the fastpath hot spots.
Mathieu
> kernbench :
> Average Optimal load -j 32 Run
>
> Vanilla 2.6.29-rc7 :
> Elapsed Time 48.238 (0.117346)
>
> With flight recorder tracing, default lttng instrumentation, lttng 0.108 :
> Elapsed Time 50.226 (0.0750333)
>
> Slowdown : 4.1 %
>
> So except the tbench workload on LTTng 0.106, which still had a 52% performance
> impact (brought it down to 35 % for the next LTTng release) due to
> specific workload and not having specialized probes implemented, the
> dbench impact is well below your results (12.3 % vs 20-80%)
>
> So given the huge result discrepancies between your tests and mine, could you
> give us more detail about your benchmarks ? Especially knowing the enabled
> instrumentation set, and the amount of MB/s generated by both ktrace and LTTng
> runs (to make sure we are comparing the same traffic) and the LTTng versions you
> are using as a comparison basis would be good.
>
> Thanks,
>
> Mathieu
>
>
> > The patch below is the current prototype we have implemented. It uses
> > Steve's unified trace buffer code with some small extensions to support
> > buffer mmap. The patch is probably still buggy, but we would like to get
> > some early feedback from the community.
> >
> > Signed-off-by: Jiaying Zhang <[email protected]>
> >
> > Index: git-linux/init/Kconfig
> > ===================================================================
> > --- git-linux.orig/init/Kconfig 2009-02-19 14:58:37.000000000 -0800
> > +++ git-linux/init/Kconfig 2009-02-19 15:50:15.000000000 -0800
> > @@ -950,6 +950,15 @@
> > Place an empty function call at each marker site. Can be
> > dynamically changed for a probe function.
> >
> > +config KTRACE
> > + bool "Enable ktrace support"
> > + select DEBUG_FS
> > + select TRACING
> > + select RING_BUFFER
> > + help
> > + Ktrace is a kernel tracing tool that allows you to trace
> > + kernel events by inserting trace points at proper places.
> > +
> > source "arch/Kconfig"
> >
> > endmenu # General setup
> > Index: git-linux/kernel/Makefile
> > ===================================================================
> > --- git-linux.orig/kernel/Makefile 2009-02-19 14:58:37.000000000 -0800
> > +++ git-linux/kernel/Makefile 2009-02-19 14:58:39.000000000 -0800
> > @@ -86,6 +86,7 @@
> > obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
> > obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
> > obj-$(CONFIG_MARKERS) += marker.o
> > +obj-$(CONFIG_KTRACE) += ktrace.o
> > obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
> > obj-$(CONFIG_LATENCYTOP) += latencytop.o
> > obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
> > Index: git-linux/include/asm-generic/vmlinux.lds.h
> > ===================================================================
> > --- git-linux.orig/include/asm-generic/vmlinux.lds.h 2009-02-19
> > 14:58:37.000000000 -0800
> > +++ git-linux/include/asm-generic/vmlinux.lds.h 2009-02-19
> > 14:58:39.000000000 -0800
> > @@ -73,6 +73,9 @@
> > MEM_KEEP(init.data) \
> > MEM_KEEP(exit.data) \
> > . = ALIGN(8); \
> > + VMLINUX_SYMBOL(__start___ktraces) = .; \
> > + *(__ktrace) \
> > + VMLINUX_SYMBOL(__stop___ktraces) = .; \
> > VMLINUX_SYMBOL(__start___markers) = .; \
> > *(__markers) \
> > VMLINUX_SYMBOL(__stop___markers) = .; \
> > @@ -89,6 +92,7 @@
> > VMLINUX_SYMBOL(__start_rodata) = .; \
> > *(.rodata) *(.rodata.*) \
> > *(__vermagic) /* Kernel version magic */ \
> > + *(__ktrace_strings) /* Ktrace: strings */ \
> > *(__markers_strings) /* Markers: strings */ \
> > *(__tracepoints_strings)/* Tracepoints: strings */ \
> > } \
> > Index: git-linux/kernel/module.c
> > ===================================================================
> > --- git-linux.orig/kernel/module.c 2009-02-19 14:58:37.000000000 -0800
> > +++ git-linux/kernel/module.c 2009-02-19 14:58:39.000000000 -0800
> > @@ -50,6 +50,7 @@
> > #include <asm/sections.h>
> > #include <linux/tracepoint.h>
> > #include <linux/ftrace.h>
> > +#include <linux/ktrace.h>
> > #include <linux/async.h>
> >
> > #if 0
> > @@ -2145,6 +2146,10 @@
> > sizeof(*mod->tracepoints),
> > &mod->num_tracepoints);
> > #endif
> > +#ifdef CONFIG_KTRACE
> > + mod->ktrace = section_objs(hdr, sechdrs, secstrings, "__ktrace",
> > + sizeof(*mod->ktrace), &mod->num_ktrace);
> > +#endif
> >
> > #ifdef CONFIG_MODVERSIONS
> > if ((mod->num_syms && !mod->crcs)
> > Index: git-linux/include/linux/module.h
> > ===================================================================
> > --- git-linux.orig/include/linux/module.h 2009-02-19
> > 14:58:37.000000000 -0800
> > +++ git-linux/include/linux/module.h 2009-02-19 14:58:39.000000000 -0800
> > @@ -44,6 +44,7 @@
> > };
> >
> > struct module;
> > +struct kernel_trace;
> >
> > struct module_attribute {
> > struct attribute attr;
> > @@ -347,6 +348,11 @@
> > /* Reference counts */
> > struct module_ref ref[NR_CPUS];
> > #endif
> > +#ifdef CONFIG_KTRACE
> > + struct kernel_trace *ktrace;
> > + unsigned int num_ktrace;
> > +#endif
> > +
> > };
> > #ifndef MODULE_ARCH_INIT
> > #define MODULE_ARCH_INIT {}
> > Index: git-linux/include/linux/ktrace.h
> > ===================================================================
> > --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> > +++ git-linux/include/linux/ktrace.h 2009-02-19 15:36:39.000000000 -0800
> > @@ -0,0 +1,83 @@
> > +#ifndef _LINUX_KTRACE_H
> > +#define _LINUX_KTRACE_H
> > +
> > +#include <linux/types.h>
> > +
> > +struct kernel_trace;
> > +
> > +typedef void ktrace_probe_func(struct kernel_trace *, void *, size_t);
> > +
> > +struct kernel_trace {
> > + const char *name;
> > + const char *format; /* format string describing variable list */
> > + size_t *stroff; /* offsets of string variables */
> > + /* 31 bit event_id is converted to 16 bit when entering to the buffer */
> > + u32 enabled:1, event_id:31;
> > + ktrace_probe_func *func; /* probe function */
> > + struct list_head list; /* list head linked to the hash table entry */
> > +} __attribute__((aligned(8)));
> > +
> > +extern int ktrace_enabled;
> > +
> > +/*
> > + * Make sure the alignment of the structure in the __ktrace section will
> > + * not add unwanted padding between the beginning of the section and the
> > + * structure. Force alignment to the same alignment as the section start.
> > + */
> > +
> > +#define DEFINE_KTRACE_STRUCT(name) \
> > + struct __attribute__((packed)) ktrace_struct_##name
> > +
> > +#ifdef CONFIG_KTRACE
> > +/*
> > + * DO_TRACE is the macro that is called from each trace point. Each trace
> > + * point is associated with a static ktrace object that is located in a
> > + * special __ktrace section and is uniquely identified by the event name and
> > + * event format. Take a look at the files under include/trace for example
> > + * usage. Note that we call the trace probe function only if the global
> > + * tracing is enabled and the tracing for the associated event is enabled,
> > + * so we only introduce the overhead of a predicted condition judge when
> > + * tracing is disabled.
> > + */
> > +#define DO_TRACE(name, format, args...)
> > \
> > + do { \
> > + static const char __kstrtab_##name[] \
> > + __attribute__((section("__ktrace_strings"))) \
> > + = #name "\0" format; \
> > + static struct kernel_trace __ktrace_##name \
> > + __attribute__((section("__ktrace"), aligned(8))) = \
> > + { __kstrtab_##name, &__kstrtab_##name[sizeof(#name)], \
> > + NULL, 0, 0, NULL, LIST_HEAD_INIT(__ktrace_##name.list) }; \
> > + __ktrace_check_format(format, ## args); \
> > + if (unlikely(ktrace_enabled) && \
> > + unlikely(__ktrace_##name.enabled)) { \
> > + struct ktrace_struct_##name karg = { args }; \
> > + (*__ktrace_##name.func) \
> > + (&__ktrace_##name, &karg, sizeof(karg)); \
> > + } \
> > + } while (0)
> > +
> > +#else /* !CONFIG_KTRACE */
> > +#define DO_TRACE(name, format, args...)
> > +#endif
> > +
> > +/* To be used for string format validity checking with gcc */
> > +static inline void __printf(1, 2) ___ktrace_check_format(const char *fmt, ...)
> > +{
> > +}
> > +
> > +#define __ktrace_check_format(format, args...) \
> > + do { \
> > + if (0) \
> > + ___ktrace_check_format(format, ## args); \
> > + } while (0)
> > +
> > +
> > +/* get the trace buffer information */
> > +#define KTRACE_BUF_GET_SIZE _IOR(0xF5, 0x00, __u32)
> > +#define KTRACE_BUF_GET_PRODUCED _IOR(0xF5, 0x01, __u32)
> > +#define KTRACE_BUF_GET_CONSUMED _IOR(0xF5, 0x02, __u32)
> > +#define KTRACE_BUF_PUT_PRODUCED _IOW(0xF5, 0x03, __u32)
> > +#define KTRACE_BUF_PUT_CONSUMED _IOW(0xF5, 0x04, __u32)
> > +
> > +#endif
> > Index: git-linux/kernel/ktrace.c
> > ===================================================================
> > --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> > +++ git-linux/kernel/ktrace.c 2009-02-19 15:45:20.000000000 -0800
> > @@ -0,0 +1,872 @@
> > +/*
> > + * kernel/ktrace.c
> > + *
> > + * Implementation of the kernel tracing for linux kernel 2.6.
> > + *
> > + * Copyright 2008- Google Inc.
> > + * Original Author: Jiaying Zhang
> > + */
> > +
> > +#include <linux/module.h>
> > +#include <linux/kernel.h>
> > +#include <linux/bitmap.h>
> > +#include <linux/poll.h>
> > +#include <linux/errno.h>
> > +#include <linux/sched.h>
> > +#include <linux/linkage.h>
> > +#include <linux/debugfs.h>
> > +#include <linux/timer.h>
> > +#include <linux/delay.h>
> > +#include <linux/jhash.h>
> > +#include <linux/ctype.h>
> > +#include <linux/namei.h>
> > +#include <linux/ktrace.h>
> > +#include <linux/ring_buffer.h>
> > +
> > +static char *ktrace_version = "1.0.0";
> > +int ktrace_enabled;
> > +EXPORT_SYMBOL(ktrace_enabled);
> > +static uint16_t ktrace_next_id;
> > +
> > +extern struct kernel_trace __start___ktraces[];
> > +extern struct kernel_trace __stop___ktraces[];
> > +
> > +static DEFINE_MUTEX(ktrace_mutex);
> > +
> > +/* ktrace hash table entry structure and variables */
> > +struct ktrace_entry {
> > + struct hlist_node hlist;
> > + char *name;
> > + char *format;
> > + size_t *stroff;
> > + ktrace_probe_func *func;
> > + u32 enabled:1, event_id:31;
> > + struct list_head klist; /* list of loaded ktraces */
> > +};
> > +
> > +#define KTRACE_HASH_BITS 6
> > +#define KTRACE_HASH_SIZE (1 << KTRACE_HASH_BITS)
> > +static struct hlist_head ktrace_table[KTRACE_HASH_SIZE];
> > +
> > +/* debugfs directory variables */
> > +static struct dentry *tracedir;
> > +static struct dentry *eventdir;
> > +typedef void tracecontrol_handle_func(struct file *, unsigned long, size_t *);
> > +
> > +/* ring buffer code */
> > +static unsigned long trace_buf_size = 65536UL;
> > +struct ring_buffer *trace_buffer;
> > +static DECLARE_WAIT_QUEUE_HEAD(trace_wait); /* waitqueue for trace_data_poll */
> > +static struct timer_list trace_timer; /* reader wake-up timer */
> > +#define TRACEREAD_WAKEUP_INTERVAL 1000 /* time interval in jiffies */
> > +static struct kref ktrace_kref;
> > +
> > +static void
> > +ring_buffer_probe(struct kernel_trace *kt, void *data, size_t event_size)
> > +{
> > + struct ring_buffer_event *event;
> > + void *trace_event;
> > + unsigned long flag;
> > +
> > + event = ring_buffer_lock_reserve(trace_buffer,
> > + sizeof(uint16_t) + event_size, &flag);
> > + if (!event)
> > + return;
> > + trace_event = ring_buffer_event_data(event);
> > + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
> > + memcpy(trace_event + sizeof(uint16_t), data, event_size);
> > + ring_buffer_unlock_commit(trace_buffer, event, flag);
> > +}
> > +
> > +/* special probe function for events that contain string arguments */
> > +static void
> > +string_probe(struct kernel_trace *kt, void *data, size_t event_size)
> > +{
> > + struct ring_buffer_event *event;
> > + void *trace_event, *p;
> > + unsigned long flag;
> > + size_t *offset, scanned;
> > + char *string;
> > +
> > + /*
> > + * Get the real length of the event, i.e., we use stroff array to
> > + * locate string variables in the passed-in trace data and use their
> > + * length to replace the size of the string pointers.
> > + */
> > + for (offset = kt->stroff; *offset != -1; offset++) {
> > + string = *(char **) (data + *offset);
> > + event_size += strlen(string) + 1 - sizeof(char *);
> > + }
> > + event = ring_buffer_lock_reserve(trace_buffer,
> > + sizeof(uint16_t) + event_size, &flag);
> > + if (!event)
> > + return;
> > + trace_event = ring_buffer_event_data(event);
> > + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
> > + p = trace_event + sizeof(uint16_t);
> > + /*
> > + * Copy the trace data into buffer. For string variables, we enter the
> > + * string into the buffer. Otherwise, the passed in data is copied.
> > + */
> > + for (offset = kt->stroff, scanned = 0; *offset != -1; offset++) {
> > + memcpy(p, data + scanned, *offset - scanned);
> > + p += *offset - scanned;
> > + string = *(char **) (data + *offset);
> > + memcpy(p, string, strlen(string) + 1);
> > + p += strlen(string) + 1;
> > + scanned = *offset + sizeof(char *);
> > + }
> > + memcpy(p, data + scanned,
> > + trace_event + sizeof(uint16_t) + event_size - p);
> > + ring_buffer_unlock_commit(trace_buffer, event, flag);
> > +}
> > +
> > +/* timer function used to defer ring buffer reader waking */
> > +static void wakeup_readers(unsigned long data)
> > +{
> > + if (trace_buffer && !ring_buffer_empty(trace_buffer))
> > + wake_up_interruptible(&trace_wait);
> > + mod_timer(&trace_timer, jiffies + TRACEREAD_WAKEUP_INTERVAL);
> > +}
> > +
> > +/* function for reading ktrace metadata info */
> > +static void *tracing_info_start(struct seq_file *seq, loff_t *pos)
> > +{
> > + seq_printf(seq, "version %s\n", ktrace_version);
> > + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
> > +}
> > +
> > +static void *tracing_info_next(struct seq_file *seq, void *v, loff_t *pos)
> > +{
> > + ++*pos;
> > + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
> > +}
> > +
> > +static int tracing_info_show(struct seq_file *seq, void *v)
> > +{
> > + struct hlist_head *head = v;
> > + struct hlist_node *node;
> > + struct ktrace_entry *entry;
> > + hlist_for_each_entry(entry, node, head, hlist)
> > + seq_printf(seq, "name '%s' format '%s' id %u %s\n",
> > + entry->name, entry->format, entry->event_id,
> > + entry->enabled ? "enabled" : "disabled");
> > + return 0;
> > +}
> > +
> > +static void tracing_info_stop(struct seq_file *seq, void *v)
> > +{
> > +}
> > +
> > +static const struct seq_operations tracing_info_seq_ops = {
> > + .start = tracing_info_start,
> > + .next = tracing_info_next,
> > + .show = tracing_info_show,
> > + .stop = tracing_info_stop,
> > +};
> > +
> > +static int tracing_info_open(struct inode *inode, struct file *file)
> > +{
> > + return seq_open(file, &tracing_info_seq_ops);
> > +}
> > +
> > +static const struct file_operations traceinfo_file_operations = {
> > + .open = tracing_info_open,
> > + .read = seq_read,
> > + .llseek = seq_lseek,
> > + .release = seq_release,
> > +};
> > +
> > +/*
> > + * wrapper function used by debugfs write operation.
> > + * func: handling function that does real work
> > + */
> > +static ssize_t trace_debugfs_write(struct file *filp, const char __user *ubuf,
> > + size_t cnt, loff_t *ppos, tracecontrol_handle_func *func)
> > +{
> > + int ret;
> > + char buf[64];
> > + unsigned long val;
> > +
> > + if (cnt >= sizeof(buf))
> > + return -EINVAL;
> > + if (copy_from_user(&buf, ubuf, cnt))
> > + return -EFAULT;
> > + buf[cnt] = 0;
> > +
> > + ret = strict_strtoul(buf, 10, &val);
> > + if (ret < 0)
> > + return ret;
> > + val = !!val;
> > +
> > + func(filp, val, &cnt);
> > +
> > + filp->f_pos += cnt;
> > + return cnt;
> > +}
> > +
> > +/* functions for reading/writing the global 'enabled' control file */
> > +static ssize_t tracing_control_read(struct file *filp, char __user *ubuf,
> > + size_t cnt, loff_t *ppos)
> > +{
> > + int ret;
> > + char buf[64];
> > + ret = snprintf(buf, 64, "%u\n", ktrace_enabled);
> > + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
> > +}
> > +
> > +static void __tracing_control_write(struct file *filp,
> > + unsigned long val, size_t *cnt)
> > +{
> > + if (val ^ ktrace_enabled) {
> > + if (val) {
> > + trace_timer.expires =
> > + jiffies + TRACEREAD_WAKEUP_INTERVAL;
> > + add_timer(&trace_timer);
> > + ktrace_enabled = 1;
> > + } else {
> > + ktrace_enabled = 0;
> > + del_timer_sync(&trace_timer);
> > + if (trace_buffer && !ring_buffer_empty(trace_buffer))
> > + wake_up_interruptible(&trace_wait);
> > + }
> > + }
> > +}
> > +
> > +static ssize_t tracing_control_write(struct file *filp, const char
> > __user *ubuf,
> > + size_t cnt, loff_t *ppos)
> > +{
> > + return trace_debugfs_write(filp, ubuf, cnt, ppos,
> > + __tracing_control_write);
> > +}
> > +
> > +static const struct file_operations tracecontrol_file_operations = {
> > + .read = tracing_control_read,
> > + .write = tracing_control_write,
> > +};
> > +
> > +/* functions for reading/writing a trace event file */
> > +static int update_ktrace(struct ktrace_entry *entry,
> > + ktrace_probe_func *func, int enabled)
> > +{
> > + struct kernel_trace *iter;
> > + /* no need to update the list if the tracing is not initialized */
> > + if (!tracedir)
> > + return 0;
> > + entry->enabled = enabled;
> > + entry->func = func;
> > + list_for_each_entry(iter, &entry->klist, list) {
> > + iter->stroff = entry->stroff;
> > + iter->func = func;
> > + iter->enabled = enabled;
> > + }
> > + return 0;
> > +}
> > +
> > +static int ktrace_probe_register(struct ktrace_entry *entry,
> > + ktrace_probe_func *probefunc)
> > +{
> > + return update_ktrace(entry, probefunc, 1);
> > +}
> > +
> > +static int ktrace_probe_unregister(struct ktrace_entry *entry)
> > +{
> > + return update_ktrace(entry, NULL, 0);
> > +}
> > +
> > +static int ktrace_control_open(struct inode *inode, struct file *filp)
> > +{
> > + filp->private_data = inode->i_private;
> > + return 0;
> > +}
> > +
> > +static ssize_t ktrace_control_read(struct file *filp, char __user *ubuf,
> > + size_t cnt, loff_t *ppos)
> > +{
> > + int ret;
> > + char buf[64];
> > + struct ktrace_entry *entry = filp->private_data;
> > +
> > + ret = snprintf(buf, 64, "%u\n", entry->enabled);
> > + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
> > +}
> > +
> > +/*
> > + * Check whether the event contains string arguments. Return 0 if not;
> > + * otherwise record the offsets of the string arguments in the ktrace
> > + * structure according to the event format and return 1.
> > + * The offset info is later used to store the string arguments into
> > + * the trace buffer during trace probe.
> > + */
> > +static int contain_string_arguments(struct ktrace_entry *entry)
> > +{
> > + int count = 0;
> > + char *fmt;
> > + int qualifier; /* 'h', 'l', or 'L' for integer fields */
> > + size_t offset = 0;
> > +
> > + for (fmt = strstr(entry->format, "%s"); fmt; fmt = strstr(++fmt, "%s"))
> > + count++;
> > + if (!count)
> > + return 0;
> > + entry->stroff = kmalloc(sizeof(size_t) * (count + 1), GFP_KERNEL);
> > + if (!entry->stroff)
> > + return -ENOMEM;
> > +
> > + for (fmt = entry->format, count = 0; *fmt ; ++fmt) {
> > + if (*fmt != '%')
> > + continue;
> > +repeat:
> > + ++fmt;
> > + switch (*fmt) {
> > + case '-':
> > + case '+':
> > + case ' ':
> > + case '#':
> > + case '0':
> > + ++fmt;
> > + goto repeat;
> > + }
> > +
> > + while (isdigit(*fmt))
> > + fmt++;
> > +
> > + /* get the conversion qualifier */
> > + qualifier = -1;
> > + if (*fmt == 'h' || *fmt == 'l' || *fmt == 'L') {
> > + qualifier = *fmt;
> > + ++fmt;
> > + if (qualifier == 'l' && *fmt == 'l') {
> > + qualifier = 'L';
> > + ++fmt;
> > + }
> > + }
> > +
> > + switch (*fmt) {
> > + case 'c':
> > + offset += sizeof(char);
> > + continue;
> > + case 's':
> > + entry->stroff[count] = offset;
> > + count++;
> > + offset += sizeof(char *);
> > + continue;
> > + case 'p':
> > + offset += sizeof(void *);
> > + continue;
> > + case 'd':
> > + case 'i':
> > + case 'o':
> > + case 'u':
> > + case 'x':
> > + case 'X':
> > + switch (qualifier) {
> > + case 'L':
> > + offset += sizeof(long long);
> > + break;
> > + case 'l':
> > + offset += sizeof(long);
> > + break;
> > + case 'h':
> > + offset += sizeof(short);
> > + break;
> > + default:
> > + offset += sizeof(int);
> > + break;
> > + }
> > + break;
> > + default:
> > + printk(KERN_WARNING "unknown format '%c'", *fmt);
> > + continue;
> > + }
> > + }
> > + entry->stroff[count] = -1;
> > + return 1;
> > +}
> > +
> > +static void __ktrace_control_write(struct file *filp,
> > + unsigned long val, size_t *cnt)
> > +{
> > + struct ktrace_entry *entry = filp->private_data;
> > + int ret;
> > +
> > + mutex_lock(&ktrace_mutex);
> > + if (val ^ entry->enabled) {
> > + if (val) {
> > + ret = contain_string_arguments(entry);
> > + if (ret == 0)
> > + ktrace_probe_register(entry, ring_buffer_probe);
> > + else if (ret > 0)
> > + ktrace_probe_register(entry, string_probe);
> > + else
> > + *cnt = ret;
> > + } else
> > + ktrace_probe_unregister(entry);
> > + }
> > + mutex_unlock(&ktrace_mutex);
> > +}
> > +
> > +static ssize_t ktrace_control_write(struct file *filp, const char __user *ubuf,
> > + size_t cnt, loff_t *ppos)
> > +{
> > + return trace_debugfs_write(filp, ubuf, cnt, ppos,
> > + __ktrace_control_write);
> > +}
> > +
> > +static const struct file_operations ktracecontrol_file_operations = {
> > + .open = ktrace_control_open,
> > + .read = ktrace_control_read,
> > + .write = ktrace_control_write,
> > +};
> > +
> > +/* functions for adding/removing a trace event. Protected by mutex lock.
> > + * Called during initialization or after loading a module */
> > +static struct ktrace_entry *add_ktrace(struct kernel_trace *ktrace)
> > +{
> > + struct ktrace_entry *entry;
> > + size_t name_len = strlen(ktrace->name) + 1;
> > + u32 hash = jhash(ktrace->name, name_len-1, 0);
> > + struct hlist_head *head;
> > + struct hlist_node *node;
> > + struct dentry *dentry;
> > +
> > + if (!tracedir)
> > + return 0;
> > + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
> > + hlist_for_each_entry(entry, node, head, hlist) {
> > + if (!strcmp(ktrace->name, entry->name)) {
> > + if (strcmp(ktrace->format, entry->format)) {
> > + printk(KERN_WARNING "the format of tracepoint "
> > + "\'%s\' changes from \'%s\' to \'%s\'."
> > + "Dynamic changing of trace format is "
> > + "not supported yet!\n",
> > + ktrace->name, entry->format,
> > + ktrace->format);
> > + return ERR_PTR(-EINVAL);
> > + }
> > + ktrace->enabled = entry->enabled;
> > + ktrace->func = entry->func;
> > + ktrace->event_id = entry->event_id;
> > + if (list_empty(&entry->klist))
> > + goto add_head;
> > + list_add_tail(&ktrace->list, &entry->klist);
> > + return entry;
> > + }
> > + }
> > + entry = kmalloc(sizeof(struct ktrace_entry), GFP_KERNEL);
> > + if (!entry)
> > + return ERR_PTR(-ENOMEM);
> > + entry->name = kmalloc(name_len, GFP_KERNEL);
> > + if (!entry->name) {
> > + kfree(entry);
> > + return ERR_PTR(-ENOMEM);
> > + }
> > + memcpy(entry->name, ktrace->name, name_len);
> > + if (ktrace->format) {
> > + size_t format_len = strlen(ktrace->format) + 1;
> > + entry->format = kmalloc(format_len, GFP_KERNEL);
> > + if (!entry->format) {
> > + kfree(entry->name);
> > + kfree(entry);
> > + return ERR_PTR(-ENOMEM);
> > + }
> > + memcpy(entry->format, ktrace->format, format_len);
> > + } else
> > + entry->format = NULL;
> > + entry->func = ktrace->func;
> > + entry->enabled = 0;
> > + ktrace->event_id = entry->event_id = ktrace_next_id++;
> > + entry->stroff = NULL;
> > + INIT_LIST_HEAD(&entry->klist);
> > + hlist_add_head(&entry->hlist, head);
> > +add_head:
> > + list_add(&ktrace->list, &entry->klist);
> > + dentry = debugfs_create_file(entry->name, 0660, eventdir,
> > + entry, &ktracecontrol_file_operations);
> > + if (!dentry)
> > + printk(KERN_WARNING "Couldn't create debugfs '%s' entry\n",
> > + entry->name);
> > + return entry;
> > +}
> > +
> > +static int remove_ktrace(struct kernel_trace *ktrace, int free)
> > +{
> > + struct ktrace_entry *entry;
> > + size_t name_len = strlen(ktrace->name) + 1;
> > + u32 hash = jhash(ktrace->name, name_len-1, 0);
> > + struct hlist_head *head;
> > + struct hlist_node *node, *temp;
> > + struct dentry *dentry;
> > +
> > + if (!tracedir)
> > + return 0;
> > + list_del(&ktrace->list);
> > + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
> > + hlist_for_each_entry_safe(entry, node, temp, head, hlist)
> > + if (!strcmp(ktrace->name, entry->name)) {
> > + if (list_empty(&entry->klist)) {
> > + dentry = lookup_one_len(entry->name,
> > + eventdir, strlen(entry->name));
> > + if (dentry && !IS_ERR(dentry)) {
> > + debugfs_remove(dentry);
> > + dput(dentry);
> > + }
> > + entry->enabled = 0;
> > + if (free) {
> > + hlist_del(&entry->hlist);
> > + kfree(entry->name);
> > + kfree(entry->format);
> > + kfree(entry->stroff);
> > + kfree(entry);
> > + }
> > + }
> > + return 0;
> > + }
> > + return -1;
> > +}
> > +
> > +/* Add/remove tracepoints contained in a module to the ktrace hash table.
> > + * Called at the end of module load/unload. */
> > +static int ktrace_module_notify(struct notifier_block *self,
> > + unsigned long val, void *data)
> > +{
> > +#ifdef CONFIG_MODULES
> > + struct kernel_trace *iter;
> > + struct module *mod = data;
> > +
> > + if (val != MODULE_STATE_COMING && val != MODULE_STATE_GOING)
> > + return 0;
> > + mutex_lock(&ktrace_mutex);
> > + for (iter = mod->ktrace; iter < mod->ktrace + mod->num_ktrace; iter++)
> > + if (val == MODULE_STATE_COMING)
> > + add_ktrace(iter);
> > + else
> > + remove_ktrace(iter, 0);
> > + mutex_unlock(&ktrace_mutex);
> > +#endif
> > + return 0;
> > +}
> > +
> > +static struct notifier_block ktrace_module_nb = {
> > + .notifier_call = ktrace_module_notify,
> > +};
> > +
> > +/* functions for user-space programs to read data from tracing buffer */
> > +static int trace_data_open(struct inode *inode, struct file *filp)
> > +{
> > + filp->private_data = inode->i_private;
> > + kref_get(&ktrace_kref);
> > + return 0;
> > +}
> > +
> > +static void release_trace_buffer(struct kref *kref)
> > +{
> > + ring_buffer_free(trace_buffer);
> > + trace_buffer = NULL;
> > +}
> > +
> > +static int trace_data_release(struct inode *inode, struct file *filp)
> > +{
> > + kref_put(&ktrace_kref, release_trace_buffer);
> > + return 0;
> > +}
> > +
> > +static unsigned int trace_data_poll(struct file *filp, poll_table *poll_table)
> > +{
> > + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
> > + if (filp->f_mode & FMODE_READ) {
> > + if (!ring_buffer_per_cpu_empty(cpu_buffer))
> > + return POLLIN | POLLRDNORM;
> > + poll_wait(filp, &trace_wait, poll_table);
> > + if (!ring_buffer_per_cpu_empty(cpu_buffer))
> > + return POLLIN | POLLRDNORM;
> > + }
> > + return 0;
> > +}
> > +
> > +static int trace_buf_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > +{
> > + struct ring_buffer_per_cpu *cpu_buffer = vma->vm_private_data;
> > + pgoff_t pgoff = vmf->pgoff;
> > + struct page *page;
> > +
> > + if (!trace_buffer)
> > + return VM_FAULT_OOM;
> > + page = ring_buffer_get_page(cpu_buffer, pgoff);
> > + if (page == NULL)
> > + return VM_FAULT_SIGBUS;
> > + get_page(page);
> > + vmf->page = page;
> > + return 0;
> > +}
> > +
> > +static struct vm_operations_struct trace_data_mmap_ops = {
> > + .fault = trace_buf_fault,
> > +};
> > +
> > +static int trace_data_mmap(struct file *filp, struct vm_area_struct *vma)
> > +{
> > + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
> > +
> > + vma->vm_ops = &trace_data_mmap_ops;
> > + vma->vm_flags |= VM_DONTEXPAND;
> > + vma->vm_private_data = cpu_buffer;
> > + return 0;
> > +}
> > +
> > +static int trace_data_ioctl(struct inode *inode, struct file *filp,
> > + unsigned int cmd, unsigned long arg)
> > +{
> > + u32 __user *argp = (u32 __user *)arg;
> > + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
> > +
> > + if (!trace_buffer || !tracedir)
> > + return -EINVAL;
> > + switch (cmd) {
> > + case KTRACE_BUF_GET_SIZE:
> > + {
> > + unsigned long bufsize;
> > + bufsize = ring_buffer_size(trace_buffer);
> > + return put_user((u32)bufsize, argp);
> > + }
> > + case KTRACE_BUF_GET_PRODUCED:
> > + return put_user(ring_buffer_get_produced(cpu_buffer), argp);
> > + case KTRACE_BUF_GET_CONSUMED:
> > + return put_user(ring_buffer_get_consumed(cpu_buffer), argp);
> > + case KTRACE_BUF_PUT_CONSUMED:
> > + {
> > + u32 consumed, consumed_old;
> > + int ret;
> > +
> > + ret = get_user(consumed, argp);
> > + if (ret) {
> > + printk(KERN_WARNING
> > + "error getting consumed value: %d\n", ret);
> > + return ret;
> > + }
> > + consumed_old = ring_buffer_get_consumed(cpu_buffer);
> > + if (consumed == consumed_old)
> > + return 0;
> > + ring_buffer_advance_reader(cpu_buffer, consumed - consumed_old);
> > + return 0;
> > + }
> > + default:
> > + return -ENOIOCTLCMD;
> > + }
> > + return 0;
> > +}
> > +
> > +static const struct file_operations tracedata_file_operations = {
> > + .open = trace_data_open,
> > + .poll = trace_data_poll,
> > + .mmap = trace_data_mmap,
> > + .ioctl = trace_data_ioctl,
> > + .release = trace_data_release,
> > +};
> > +
> > +/* function to report the number of lost events due to buffer overflow */
> > +static ssize_t trace_overflow_read(struct file *filp, char __user *ubuf,
> > + size_t cnt, loff_t *ppos)
> > +{
> > + int ret;
> > + char buf[64];
> > + ret = snprintf(buf, 64, "%lu\n", ring_buffer_overruns(trace_buffer));
> > + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
> > +}
> > +
> > +static const struct file_operations traceoverflow_file_operations = {
> > + .read = trace_overflow_read,
> > +};
> > +
> > +/* function to re-initialize trace buffer */
> > +static void __trace_buffer_reset(struct file *filp,
> > + unsigned long val, size_t *cnt)
> > +{
> > + if (val && trace_buffer)
> > + ring_buffer_reset(trace_buffer);
> > +}
> > +
> > +static ssize_t trace_buffer_reset(struct file *filp, const char __user *ubuf,
> > + size_t cnt, loff_t *ppos)
> > +{
> > + return trace_debugfs_write(filp, ubuf, cnt, ppos,
> > + __trace_buffer_reset);
> > +}
> > +
> > +static const struct file_operations tracereset_file_operations = {
> > + .write = trace_buffer_reset,
> > +};
> > +
> > +/*
> > + * We use debugfs for kernel-user communication. All of the control/info
> > + * files are under debugfs/ktrace directory.
> > + */
> > +static int create_debugfs(void)
> > +{
> > + struct dentry *entry, *bufdir;
> > + struct ring_buffer_per_cpu *cpu_buffer;
> > + char *tmpname;
> > + int cpu;
> > +
> > + tracedir = debugfs_create_dir("ktrace", NULL);
> > + if (!tracedir) {
> > + printk(KERN_WARNING "Couldn't create debugfs directory\n");
> > + return -ENOMEM;
> > + }
> > +
> > + /* 'buffers' is the directory that holds trace buffer debugfs files */
> > + bufdir = debugfs_create_dir("buffers", tracedir);
> > + if (!bufdir) {
> > + printk(KERN_WARNING "Couldn't create 'buffers' directory\n");
> > + return -ENOMEM;
> > + }
> > +
> > + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
> > + if (!tmpname)
> > + return -ENOMEM;
> > + /* create a debugfs file for each cpu buffer */
> > + for_each_possible_cpu(cpu) {
> > + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
> > + cpu_buffer = ring_buffer_cpu(trace_buffer, cpu);
> > + entry = debugfs_create_file(tmpname, 0440, bufdir, cpu_buffer,
> > + &tracedata_file_operations);
> > + if (!entry) {
> > + printk(KERN_WARNING
> > + "Couldn't create debugfs \'%s\' entry\n",
> > + tmpname);
> > + return -ENOMEM;
> > + }
> > + }
> > + kfree(tmpname);
> > +
> > + /* the control file for users to enable/disable global tracing. */
> > + entry = debugfs_create_file("enabled", 0664, tracedir, NULL,
> > + &tracecontrol_file_operations);
> > + if (!entry) {
> > + printk(KERN_WARNING
> > + "Couldn't create debugfs 'enabled' entry\n");
> > + return -ENOMEM;
> > + }
> > +
> > + /*
> > + * the debugfs file that displays the name, format etc. of every
> > + * supported trace event. The file is to be used by the user-space
> > + * trace parser to analyze the collected trace data.
> > + */
> > + entry = debugfs_create_file("info", 0444, tracedir, NULL,
> > + &traceinfo_file_operations);
> > + if (!entry) {
> > + printk(KERN_WARNING "Couldn't create debugfs 'info' entry\n");
> > + return -ENOMEM;
> > + }
> > +
> > + /*
> > + * the debugfs file that reports the number of events
> > + * lost due to buffer overflow
> > + */
> > + entry = debugfs_create_file("overflow", 0444, tracedir, NULL,
> > + &traceoverflow_file_operations);
> > + if (!entry) {
> > + printk(KERN_WARNING
> > + "Couldn't create debugfs 'overflow' entry\n");
> > + return -ENOMEM;
> > + }
> > +
> > + /* the debugfs file that resets trace buffer upon write */
> > + entry = debugfs_create_file("reset", 0220, tracedir, NULL,
> > + &tracereset_file_operations);
> > + if (!entry) {
> > + printk(KERN_WARNING "Couldn't create debugfs 'reset' entry\n");
> > + return -ENOMEM;
> > + }
> > +
> > + /* create the directory that holds the control files for every event */
> > + eventdir = debugfs_create_dir("events", tracedir);
> > + if (!eventdir) {
> > + printk(KERN_WARNING "Couldn't create 'events' directory\n");
> > + return -ENOMEM;
> > + }
> > + return 0;
> > +}
> > +
> > +static void remove_debugfs_file(struct dentry *dir, char *name)
> > +{
> > + struct dentry *dentry;
> > + dentry = lookup_one_len(name, dir, strlen(name));
> > + if (dentry && !IS_ERR(dentry)) {
> > + debugfs_remove(dentry);
> > + dput(dentry);
> > + }
> > +}
> > +
> > +static int remove_debugfs(void)
> > +{
> > + struct dentry *bufdir;
> > + char *tmpname;
> > + int cpu;
> > +
> > + bufdir = lookup_one_len("buffers", tracedir, strlen("buffers"));
> > + if (!bufdir || IS_ERR(bufdir))
> > + return -EIO;
> > + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
> > + if (!tmpname)
> > + return -ENOMEM;
> > + for_each_possible_cpu(cpu) {
> > + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
> > + remove_debugfs_file(bufdir, tmpname);
> > + }
> > + kfree(tmpname);
> > + debugfs_remove(bufdir);
> > +
> > + remove_debugfs_file(tracedir, "enabled");
> > + remove_debugfs_file(tracedir, "info");
> > + debugfs_remove(eventdir);
> > + debugfs_remove(tracedir);
> > + return 0;
> > +}
> > +
> > +static int __init init_ktrace(void)
> > +{
> > + struct kernel_trace *iter;
> > + int i, err;
> > +
> > + ktrace_next_id = 0;
> > + kref_set(&ktrace_kref, 1);
> > + trace_buffer = ring_buffer_alloc(trace_buf_size, RB_FL_OVERWRITE);
> > + if (trace_buffer == NULL)
> > + return -ENOMEM;
> > + setup_timer(&trace_timer, wakeup_readers, 0);
> > + for (i = 0; i < KTRACE_HASH_SIZE; i++)
> > + INIT_HLIST_HEAD(&ktrace_table[i]);
> > +
> > + mutex_lock(&ktrace_mutex);
> > + err = create_debugfs();
> > + if (err != 0)
> > + goto out;
> > + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
> > + add_ktrace(iter);
> > + err = register_module_notifier(&ktrace_module_nb);
> > +out:
> > + mutex_unlock(&ktrace_mutex);
> > + return err;
> > +}
> > +module_init(init_ktrace);
> > +
> > +static void __exit exit_ktrace(void)
> > +{
> > + struct kernel_trace *iter;
> > +
> > + if (ktrace_enabled) {
> > + ktrace_enabled = 0;
> > + del_timer_sync(&trace_timer);
> > + }
> > + mutex_lock(&ktrace_mutex);
> > + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
> > + remove_ktrace(iter, 1);
> > + unregister_module_notifier(&ktrace_module_nb);
> > + remove_debugfs();
> > + mutex_unlock(&ktrace_mutex);
> > +
> > + kref_put(&ktrace_kref, release_trace_buffer);
> > +}
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Hi Mathieu,
First, I apologize that I did not forward you my original email. I assumed
that the email would be noticed by interested people when it was sent
to lkml. Now I see that is a mistake.
On Fri, Mar 13, 2009 at 10:04 AM, Mathieu Desnoyers
<[email protected]> wrote:
> Hi Jiaying,
>
> I'm very interested to hear about the details of your test-case, because
> what you are talking about here, and the numbers that you get, are, at
> the very least, not representative of the current LTTng. Perharps you
> used a version right after the kernel summit and stopped following the
> improvements we have done since then ? I also wonder about the instrumentation
> set you enabled in your tests, and whether it's exactly the same in the ktrace
> and LTTng runs (especially system call tracing). Please note that following the
> Kernel Summit 2008, a more active development cycle began, and the time was not
> right for optimisation. But since then a lot of improvement has been done on
> this side.
Sorry that I omitted a lot of details in my test description. I did use an old
version of LTTng in my tests. The LTTng version I used was the 0.21 lttng
patches for 2.6.27 kernel. I used 2.6.26 kernel in my tests, with the ported
lttng-2.6.27-rc6-0.21 patches.
I enabled the same set of trace events in ktrace and LTTng runs. Here is
the list of enabled events: syscall entry/exit, irq entry/exit, trap
entry/exit,
kernel_sched_switch, kernel_process_fork, kernel_process_wait,
kernel_process_free, kernel_process_exit.
We are glad to see the improvements coming from the LTTng community.
I haven't got time to sync with the latest LTTng code and re-run my tests.
I will let you know when I have such data available.
>
> First I'd like to express my concern about the way you posted your
> numbers :
>
> 1 - You did not CC anyone from the LTTng project.
> 2 - You did not tell which LTTng version you used as a comparison basis.
> 3 - You did not look at the current state of code before posting your
> comparison. (as shows the comment in your patch header)
That is indeed my mistake.
> 4 - You did not tell the instrumentation set that you enabled from LTTng. Is it
> the default instrumentation ? Did you also enable system call tracing
> with your tracer using set_kernel_trace_flag_all_tasks() ? If not, this
> makes a huge difference on the amount of information gathered.
Yes. We also enable syscall tracing via set_kernel_trace_flag_all_tasks().
>
> So let's have a close look at them. I'm re-doing the benchmarks based on
> the current LTTng version below.
>
>> Hello All,
>>
>> We have been working on building a kernel tracer for a production environment.
>> We first considered to use the existing markers code, but found it a bit
>> heavyweight.
>
> Have you looked at :
>
> 1 - The tracepoints currently in the kernel
> 2 - The _current_ version of markers in the LTTng tree ?
I looked at the tracepoints currently in the kernel. It is great to see that
some trace instrumentation code is finally available in the mainline kernel.
But I still don't like an additional tracepoint layer on top of the
actual tracing
(ktrace or markers or ftrace) layer. That seems to add unnecessary overhead.
The markers code I referred to in my email is the current code in the mainline
kernel. I had a look at the LTTng markers code but still saw that va_list
argument is used in marker_probe_func. It sounds a nice improvement
if the LTTng markers avoids the use of va_arg whenever possible.
>
>> So we developed a new kernel tracing prototype that shares
>> many similarities as markers but also includes several simplifications that
>> give us considerable performance benefits. For example, we pass the size of
>> trace event directly to the probe function instead of using va_arg as that
>> used in markers, which saves us the overhead of variable passing.
>
> The markers in the LTTng tree only use va_args for "debug-style"
> tracing. All the tracing fast paths are now done with custom probe
> functions which takes the arguments from the tracepoint and write them
> in the trace buffers directly in C. When the event payload size is already
> known statically, there is no dynamic computation cost at all : it's
> simply passed to the ltt-relay write function.
But this raises another concern that sub-system developers may need
to write a bunch of custom probe functions to avoid the argument parsing
overhead. Can you instead have a general probe function or macro that
does this work?
>
>> Also, we
>> associate a single probe function to each event, so we can directly call the
>> probe function when tracing is enabled.
>
> You seem to imply that being stuck with a single probe function is
> somehow superior to have tiny per-event callbacks which then call into a
> buffer write function ? Why ?
That is indeed my thinking, because we saw that some platforms are
very sensitive to function call overhead and we haven't seen use cases
where multi probe functions prove necessary.
>
>> Our measurements show that our
>> current tracing prototype introduces much lower overhead compared with
>> lttng (we use lttng as a comparison because markers does not provide the
>> implementation to record trace data).
>
> Why LTTng version, kernel version, LTTng config and runtime options do
> you use ?
See above for LTTng and kernel versions. I was using the default LTTng
config options and 'normal' recorder mode.
>
>> E.g., with dbench, lttng introduces
>> 20~80% overhead on our testing platforms while the overhead with our
>> kernel tracer is within 10%; with tbench, lttng usually adds more than 30%
>> overhead while the overhead added by our kernel tracer is usually within
>> 15%; with kernbench (time to compile a Linux kernel), the overhead
>> added by lttng is around 4%~18% while the overhead added by our
>> kernel tracer is within 2.5%.
>>
>
> I'm re-running the dbench and tbench tests :
>
> With LTTng 0.106, kernel 2.6.29-rc7
> 8-cores Intel(R) Xeon(R) CPU E5405 @ 2.00GHz, 16GB ram
>
> Arming all the default LTTng instrumentation (168 tracepoints),
> Tracing in flight recorder mode, including system call entry/exit
> tracing and block I/O tracing (blktrace instrumentation), I get :
>
> Test / config | Vanilla kernel | LTTng 0.106 | slowdown
> dbench 8 -t 200 | 1361 MB/s | 1193 MB/s | 12.3 %
>
> As an example of the effect of custom probe vs va_args based probes,
> let's take the tbench example
>
> | Vanilla | LTTng 0.106 | slowdown
> tbench 8 -t 200 | 2074 MB/s | 997 MB/s | 52 %
>
> The LTTng trace statistics shows the top high event rate events :
>
> Some are usual (therefore lttng already implements custom probes for them) :
>
> 42004 sched_try_wakeup
> 79754 timer_set
> 83973 sched_schedule
> 84317 softirq_exit
> 84318 softirq_entry
> 95370 page_free
> 95412 page_alloc
> 267247 syscall_entry
> 267247 syscall_exit
>
> Some are more tbench workload-specific :
>
> 16915 socket_sendmsg
> 18911 napi_complete
> 18911 napi_schedule
> 18912 dev_receive
> 18914 dev_xmit
> 18920 napi_poll
> 33831 socket_recvmsg
> 59860 read
>
>
> So let's see what happens if I implement those as custom probes and do a bit of
> tuning in the tracer hot path :
Could you give more details about the custom probe you used?
Do you need to write a custom probe for each tracepoint?
Jiaying
>
> | Vanilla | LTTng 0.107 | slowdown
> Custom probes :
> tbench 8 -t 200 | 2074 MB/s | 1275 MB/s | 38.5 %
>
> With tracer hot path tuning (inline hot path, function calls for slow path, will
> be in 0.108) :
> tbench 8 -t 200 | 2074 MB/s | 1335 MB/s | 35.6 %
>
> kernbench :
> Average Optimal load -j 32 Run
>
> Vanilla 2.6.29-rc7 :
> Elapsed Time 48.238 (0.117346)
>
> With flight recorder tracing, default lttng instrumentation, lttng 0.108 :
> Elapsed Time 50.226 (0.0750333)
>
> Slowdown : 4.1 %
>
> So except the tbench workload on LTTng 0.106, which still had a 52% performance
> impact (brought it down to 35 % for the next LTTng release) due to
> specific workload and not having specialized probes implemented, the
> dbench impact is well below your results (12.3 % vs 20-80%)
>
> So given the huge result discrepancies between your tests and mine, could you
> give us more detail about your benchmarks ? Especially knowing the enabled
> instrumentation set, and the amount of MB/s generated by both ktrace and LTTng
> runs (to make sure we are comparing the same traffic) and the LTTng versions you
> are using as a comparison basis would be good.
>
> Thanks,
>
> Mathieu
>
>
>> The patch below is the current prototype we have implemented. It uses
>> Steve's unified trace buffer code with some small extensions to support
>> buffer mmap. The patch is probably still buggy, but we would like to get
>> some early feedback from the community.
>>
>> Signed-off-by: Jiaying Zhang <[email protected]>
>>
>> Index: git-linux/init/Kconfig
>> ===================================================================
>> --- git-linux.orig/init/Kconfig 2009-02-19 14:58:37.000000000 -0800
>> +++ git-linux/init/Kconfig 2009-02-19 15:50:15.000000000 -0800
>> @@ -950,6 +950,15 @@
>> Place an empty function call at each marker site. Can be
>> dynamically changed for a probe function.
>>
>> +config KTRACE
>> + bool "Enable ktrace support"
>> + select DEBUG_FS
>> + select TRACING
>> + select RING_BUFFER
>> + help
>> + Ktrace is a kernel tracing tool that allows you to trace
>> + kernel events by inserting trace points at proper places.
>> +
>> source "arch/Kconfig"
>>
>> endmenu # General setup
>> Index: git-linux/kernel/Makefile
>> ===================================================================
>> --- git-linux.orig/kernel/Makefile 2009-02-19 14:58:37.000000000 -0800
>> +++ git-linux/kernel/Makefile 2009-02-19 14:58:39.000000000 -0800
>> @@ -86,6 +86,7 @@
>> obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
>> obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
>> obj-$(CONFIG_MARKERS) += marker.o
>> +obj-$(CONFIG_KTRACE) += ktrace.o
>> obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
>> obj-$(CONFIG_LATENCYTOP) += latencytop.o
>> obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
>> Index: git-linux/include/asm-generic/vmlinux.lds.h
>> ===================================================================
>> --- git-linux.orig/include/asm-generic/vmlinux.lds.h 2009-02-19
>> 14:58:37.000000000 -0800
>> +++ git-linux/include/asm-generic/vmlinux.lds.h 2009-02-19
>> 14:58:39.000000000 -0800
>> @@ -73,6 +73,9 @@
>> MEM_KEEP(init.data) \
>> MEM_KEEP(exit.data) \
>> . = ALIGN(8); \
>> + VMLINUX_SYMBOL(__start___ktraces) = .; \
>> + *(__ktrace) \
>> + VMLINUX_SYMBOL(__stop___ktraces) = .; \
>> VMLINUX_SYMBOL(__start___markers) = .; \
>> *(__markers) \
>> VMLINUX_SYMBOL(__stop___markers) = .; \
>> @@ -89,6 +92,7 @@
>> VMLINUX_SYMBOL(__start_rodata) = .; \
>> *(.rodata) *(.rodata.*) \
>> *(__vermagic) /* Kernel version magic */ \
>> + *(__ktrace_strings) /* Ktrace: strings */ \
>> *(__markers_strings) /* Markers: strings */ \
>> *(__tracepoints_strings)/* Tracepoints: strings */ \
>> } \
>> Index: git-linux/kernel/module.c
>> ===================================================================
>> --- git-linux.orig/kernel/module.c 2009-02-19 14:58:37.000000000 -0800
>> +++ git-linux/kernel/module.c 2009-02-19 14:58:39.000000000 -0800
>> @@ -50,6 +50,7 @@
>> #include <asm/sections.h>
>> #include <linux/tracepoint.h>
>> #include <linux/ftrace.h>
>> +#include <linux/ktrace.h>
>> #include <linux/async.h>
>>
>> #if 0
>> @@ -2145,6 +2146,10 @@
>> sizeof(*mod->tracepoints),
>> &mod->num_tracepoints);
>> #endif
>> +#ifdef CONFIG_KTRACE
>> + mod->ktrace = section_objs(hdr, sechdrs, secstrings, "__ktrace",
>> + sizeof(*mod->ktrace), &mod->num_ktrace);
>> +#endif
>>
>> #ifdef CONFIG_MODVERSIONS
>> if ((mod->num_syms && !mod->crcs)
>> Index: git-linux/include/linux/module.h
>> ===================================================================
>> --- git-linux.orig/include/linux/module.h 2009-02-19
>> 14:58:37.000000000 -0800
>> +++ git-linux/include/linux/module.h 2009-02-19 14:58:39.000000000 -0800
>> @@ -44,6 +44,7 @@
>> };
>>
>> struct module;
>> +struct kernel_trace;
>>
>> struct module_attribute {
>> struct attribute attr;
>> @@ -347,6 +348,11 @@
>> /* Reference counts */
>> struct module_ref ref[NR_CPUS];
>> #endif
>> +#ifdef CONFIG_KTRACE
>> + struct kernel_trace *ktrace;
>> + unsigned int num_ktrace;
>> +#endif
>> +
>> };
>> #ifndef MODULE_ARCH_INIT
>> #define MODULE_ARCH_INIT {}
>> Index: git-linux/include/linux/ktrace.h
>> ===================================================================
>> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
>> +++ git-linux/include/linux/ktrace.h 2009-02-19 15:36:39.000000000 -0800
>> @@ -0,0 +1,83 @@
>> +#ifndef _LINUX_KTRACE_H
>> +#define _LINUX_KTRACE_H
>> +
>> +#include <linux/types.h>
>> +
>> +struct kernel_trace;
>> +
>> +typedef void ktrace_probe_func(struct kernel_trace *, void *, size_t);
>> +
>> +struct kernel_trace {
>> + const char *name;
>> + const char *format; /* format string describing variable list */
>> + size_t *stroff; /* offsets of string variables */
>> + /* 31 bit event_id is converted to 16 bit when entering to the buffer */
>> + u32 enabled:1, event_id:31;
>> + ktrace_probe_func *func; /* probe function */
>> + struct list_head list; /* list head linked to the hash table entry */
>> +} __attribute__((aligned(8)));
>> +
>> +extern int ktrace_enabled;
>> +
>> +/*
>> + * Make sure the alignment of the structure in the __ktrace section will
>> + * not add unwanted padding between the beginning of the section and the
>> + * structure. Force alignment to the same alignment as the section start.
>> + */
>> +
>> +#define DEFINE_KTRACE_STRUCT(name) \
>> + struct __attribute__((packed)) ktrace_struct_##name
>> +
>> +#ifdef CONFIG_KTRACE
>> +/*
>> + * DO_TRACE is the macro that is called from each trace point. Each trace
>> + * point is associated with a static ktrace object that is located in a
>> + * special __ktrace section and is uniquely identified by the event name and
>> + * event format. Take a look at the files under include/trace for example
>> + * usage. Note that we call the trace probe function only if the global
>> + * tracing is enabled and the tracing for the associated event is enabled,
>> + * so we only introduce the overhead of a predicted condition judge when
>> + * tracing is disabled.
>> + */
>> +#define DO_TRACE(name, format, args...)
>> \
>> + do { \
>> + static const char __kstrtab_##name[] \
>> + __attribute__((section("__ktrace_strings"))) \
>> + = #name "\0" format; \
>> + static struct kernel_trace __ktrace_##name \
>> + __attribute__((section("__ktrace"), aligned(8))) = \
>> + { __kstrtab_##name, &__kstrtab_##name[sizeof(#name)], \
>> + NULL, 0, 0, NULL, LIST_HEAD_INIT(__ktrace_##name.list) }; \
>> + __ktrace_check_format(format, ## args); \
>> + if (unlikely(ktrace_enabled) && \
>> + unlikely(__ktrace_##name.enabled)) { \
>> + struct ktrace_struct_##name karg = { args }; \
>> + (*__ktrace_##name.func) \
>> + (&__ktrace_##name, &karg, sizeof(karg)); \
>> + } \
>> + } while (0)
>> +
>> +#else /* !CONFIG_KTRACE */
>> +#define DO_TRACE(name, format, args...)
>> +#endif
>> +
>> +/* To be used for string format validity checking with gcc */
>> +static inline void __printf(1, 2) ___ktrace_check_format(const char *fmt, ...)
>> +{
>> +}
>> +
>> +#define __ktrace_check_format(format, args...) \
>> + do { \
>> + if (0) \
>> + ___ktrace_check_format(format, ## args); \
>> + } while (0)
>> +
>> +
>> +/* get the trace buffer information */
>> +#define KTRACE_BUF_GET_SIZE _IOR(0xF5, 0x00, __u32)
>> +#define KTRACE_BUF_GET_PRODUCED _IOR(0xF5, 0x01, __u32)
>> +#define KTRACE_BUF_GET_CONSUMED _IOR(0xF5, 0x02, __u32)
>> +#define KTRACE_BUF_PUT_PRODUCED _IOW(0xF5, 0x03, __u32)
>> +#define KTRACE_BUF_PUT_CONSUMED _IOW(0xF5, 0x04, __u32)
>> +
>> +#endif
>> Index: git-linux/kernel/ktrace.c
>> ===================================================================
>> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
>> +++ git-linux/kernel/ktrace.c 2009-02-19 15:45:20.000000000 -0800
>> @@ -0,0 +1,872 @@
>> +/*
>> + * kernel/ktrace.c
>> + *
>> + * Implementation of the kernel tracing for linux kernel 2.6.
>> + *
>> + * Copyright 2008- Google Inc.
>> + * Original Author: Jiaying Zhang
>> + */
>> +
>> +#include <linux/module.h>
>> +#include <linux/kernel.h>
>> +#include <linux/bitmap.h>
>> +#include <linux/poll.h>
>> +#include <linux/errno.h>
>> +#include <linux/sched.h>
>> +#include <linux/linkage.h>
>> +#include <linux/debugfs.h>
>> +#include <linux/timer.h>
>> +#include <linux/delay.h>
>> +#include <linux/jhash.h>
>> +#include <linux/ctype.h>
>> +#include <linux/namei.h>
>> +#include <linux/ktrace.h>
>> +#include <linux/ring_buffer.h>
>> +
>> +static char *ktrace_version = "1.0.0";
>> +int ktrace_enabled;
>> +EXPORT_SYMBOL(ktrace_enabled);
>> +static uint16_t ktrace_next_id;
>> +
>> +extern struct kernel_trace __start___ktraces[];
>> +extern struct kernel_trace __stop___ktraces[];
>> +
>> +static DEFINE_MUTEX(ktrace_mutex);
>> +
>> +/* ktrace hash table entry structure and variables */
>> +struct ktrace_entry {
>> + struct hlist_node hlist;
>> + char *name;
>> + char *format;
>> + size_t *stroff;
>> + ktrace_probe_func *func;
>> + u32 enabled:1, event_id:31;
>> + struct list_head klist; /* list of loaded ktraces */
>> +};
>> +
>> +#define KTRACE_HASH_BITS 6
>> +#define KTRACE_HASH_SIZE (1 << KTRACE_HASH_BITS)
>> +static struct hlist_head ktrace_table[KTRACE_HASH_SIZE];
>> +
>> +/* debugfs directory variables */
>> +static struct dentry *tracedir;
>> +static struct dentry *eventdir;
>> +typedef void tracecontrol_handle_func(struct file *, unsigned long, size_t *);
>> +
>> +/* ring buffer code */
>> +static unsigned long trace_buf_size = 65536UL;
>> +struct ring_buffer *trace_buffer;
>> +static DECLARE_WAIT_QUEUE_HEAD(trace_wait); /* waitqueue for trace_data_poll */
>> +static struct timer_list trace_timer; /* reader wake-up timer */
>> +#define TRACEREAD_WAKEUP_INTERVAL 1000 /* time interval in jiffies */
>> +static struct kref ktrace_kref;
>> +
>> +static void
>> +ring_buffer_probe(struct kernel_trace *kt, void *data, size_t event_size)
>> +{
>> + struct ring_buffer_event *event;
>> + void *trace_event;
>> + unsigned long flag;
>> +
>> + event = ring_buffer_lock_reserve(trace_buffer,
>> + sizeof(uint16_t) + event_size, &flag);
>> + if (!event)
>> + return;
>> + trace_event = ring_buffer_event_data(event);
>> + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
>> + memcpy(trace_event + sizeof(uint16_t), data, event_size);
>> + ring_buffer_unlock_commit(trace_buffer, event, flag);
>> +}
>> +
>> +/* special probe function for events that contain string arguments */
>> +static void
>> +string_probe(struct kernel_trace *kt, void *data, size_t event_size)
>> +{
>> + struct ring_buffer_event *event;
>> + void *trace_event, *p;
>> + unsigned long flag;
>> + size_t *offset, scanned;
>> + char *string;
>> +
>> + /*
>> + * Get the real length of the event, i.e., we use stroff array to
>> + * locate string variables in the passed-in trace data and use their
>> + * length to replace the size of the string pointers.
>> + */
>> + for (offset = kt->stroff; *offset != -1; offset++) {
>> + string = *(char **) (data + *offset);
>> + event_size += strlen(string) + 1 - sizeof(char *);
>> + }
>> + event = ring_buffer_lock_reserve(trace_buffer,
>> + sizeof(uint16_t) + event_size, &flag);
>> + if (!event)
>> + return;
>> + trace_event = ring_buffer_event_data(event);
>> + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
>> + p = trace_event + sizeof(uint16_t);
>> + /*
>> + * Copy the trace data into buffer. For string variables, we enter the
>> + * string into the buffer. Otherwise, the passed in data is copied.
>> + */
>> + for (offset = kt->stroff, scanned = 0; *offset != -1; offset++) {
>> + memcpy(p, data + scanned, *offset - scanned);
>> + p += *offset - scanned;
>> + string = *(char **) (data + *offset);
>> + memcpy(p, string, strlen(string) + 1);
>> + p += strlen(string) + 1;
>> + scanned = *offset + sizeof(char *);
>> + }
>> + memcpy(p, data + scanned,
>> + trace_event + sizeof(uint16_t) + event_size - p);
>> + ring_buffer_unlock_commit(trace_buffer, event, flag);
>> +}
>> +
>> +/* timer function used to defer ring buffer reader waking */
>> +static void wakeup_readers(unsigned long data)
>> +{
>> + if (trace_buffer && !ring_buffer_empty(trace_buffer))
>> + wake_up_interruptible(&trace_wait);
>> + mod_timer(&trace_timer, jiffies + TRACEREAD_WAKEUP_INTERVAL);
>> +}
>> +
>> +/* function for reading ktrace metadata info */
>> +static void *tracing_info_start(struct seq_file *seq, loff_t *pos)
>> +{
>> + seq_printf(seq, "version %s\n", ktrace_version);
>> + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
>> +}
>> +
>> +static void *tracing_info_next(struct seq_file *seq, void *v, loff_t *pos)
>> +{
>> + ++*pos;
>> + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
>> +}
>> +
>> +static int tracing_info_show(struct seq_file *seq, void *v)
>> +{
>> + struct hlist_head *head = v;
>> + struct hlist_node *node;
>> + struct ktrace_entry *entry;
>> + hlist_for_each_entry(entry, node, head, hlist)
>> + seq_printf(seq, "name '%s' format '%s' id %u %s\n",
>> + entry->name, entry->format, entry->event_id,
>> + entry->enabled ? "enabled" : "disabled");
>> + return 0;
>> +}
>> +
>> +static void tracing_info_stop(struct seq_file *seq, void *v)
>> +{
>> +}
>> +
>> +static const struct seq_operations tracing_info_seq_ops = {
>> + .start = tracing_info_start,
>> + .next = tracing_info_next,
>> + .show = tracing_info_show,
>> + .stop = tracing_info_stop,
>> +};
>> +
>> +static int tracing_info_open(struct inode *inode, struct file *file)
>> +{
>> + return seq_open(file, &tracing_info_seq_ops);
>> +}
>> +
>> +static const struct file_operations traceinfo_file_operations = {
>> + .open = tracing_info_open,
>> + .read = seq_read,
>> + .llseek = seq_lseek,
>> + .release = seq_release,
>> +};
>> +
>> +/*
>> + * wrapper function used by debugfs write operation.
>> + * func: handling function that does real work
>> + */
>> +static ssize_t trace_debugfs_write(struct file *filp, const char __user *ubuf,
>> + size_t cnt, loff_t *ppos, tracecontrol_handle_func *func)
>> +{
>> + int ret;
>> + char buf[64];
>> + unsigned long val;
>> +
>> + if (cnt >= sizeof(buf))
>> + return -EINVAL;
>> + if (copy_from_user(&buf, ubuf, cnt))
>> + return -EFAULT;
>> + buf[cnt] = 0;
>> +
>> + ret = strict_strtoul(buf, 10, &val);
>> + if (ret < 0)
>> + return ret;
>> + val = !!val;
>> +
>> + func(filp, val, &cnt);
>> +
>> + filp->f_pos += cnt;
>> + return cnt;
>> +}
>> +
>> +/* functions for reading/writing the global 'enabled' control file */
>> +static ssize_t tracing_control_read(struct file *filp, char __user *ubuf,
>> + size_t cnt, loff_t *ppos)
>> +{
>> + int ret;
>> + char buf[64];
>> + ret = snprintf(buf, 64, "%u\n", ktrace_enabled);
>> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
>> +}
>> +
>> +static void __tracing_control_write(struct file *filp,
>> + unsigned long val, size_t *cnt)
>> +{
>> + if (val ^ ktrace_enabled) {
>> + if (val) {
>> + trace_timer.expires =
>> + jiffies + TRACEREAD_WAKEUP_INTERVAL;
>> + add_timer(&trace_timer);
>> + ktrace_enabled = 1;
>> + } else {
>> + ktrace_enabled = 0;
>> + del_timer_sync(&trace_timer);
>> + if (trace_buffer && !ring_buffer_empty(trace_buffer))
>> + wake_up_interruptible(&trace_wait);
>> + }
>> + }
>> +}
>> +
>> +static ssize_t tracing_control_write(struct file *filp, const char
>> __user *ubuf,
>> + size_t cnt, loff_t *ppos)
>> +{
>> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
>> + __tracing_control_write);
>> +}
>> +
>> +static const struct file_operations tracecontrol_file_operations = {
>> + .read = tracing_control_read,
>> + .write = tracing_control_write,
>> +};
>> +
>> +/* functions for reading/writing a trace event file */
>> +static int update_ktrace(struct ktrace_entry *entry,
>> + ktrace_probe_func *func, int enabled)
>> +{
>> + struct kernel_trace *iter;
>> + /* no need to update the list if the tracing is not initialized */
>> + if (!tracedir)
>> + return 0;
>> + entry->enabled = enabled;
>> + entry->func = func;
>> + list_for_each_entry(iter, &entry->klist, list) {
>> + iter->stroff = entry->stroff;
>> + iter->func = func;
>> + iter->enabled = enabled;
>> + }
>> + return 0;
>> +}
>> +
>> +static int ktrace_probe_register(struct ktrace_entry *entry,
>> + ktrace_probe_func *probefunc)
>> +{
>> + return update_ktrace(entry, probefunc, 1);
>> +}
>> +
>> +static int ktrace_probe_unregister(struct ktrace_entry *entry)
>> +{
>> + return update_ktrace(entry, NULL, 0);
>> +}
>> +
>> +static int ktrace_control_open(struct inode *inode, struct file *filp)
>> +{
>> + filp->private_data = inode->i_private;
>> + return 0;
>> +}
>> +
>> +static ssize_t ktrace_control_read(struct file *filp, char __user *ubuf,
>> + size_t cnt, loff_t *ppos)
>> +{
>> + int ret;
>> + char buf[64];
>> + struct ktrace_entry *entry = filp->private_data;
>> +
>> + ret = snprintf(buf, 64, "%u\n", entry->enabled);
>> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
>> +}
>> +
>> +/*
>> + * Check whether the event contains string arguments. Return 0 if not;
>> + * otherwise record the offsets of the string arguments in the ktrace
>> + * structure according to the event format and return 1.
>> + * The offset info is later used to store the string arguments into
>> + * the trace buffer during trace probe.
>> + */
>> +static int contain_string_arguments(struct ktrace_entry *entry)
>> +{
>> + int count = 0;
>> + char *fmt;
>> + int qualifier; /* 'h', 'l', or 'L' for integer fields */
>> + size_t offset = 0;
>> +
>> + for (fmt = strstr(entry->format, "%s"); fmt; fmt = strstr(++fmt, "%s"))
>> + count++;
>> + if (!count)
>> + return 0;
>> + entry->stroff = kmalloc(sizeof(size_t) * (count + 1), GFP_KERNEL);
>> + if (!entry->stroff)
>> + return -ENOMEM;
>> +
>> + for (fmt = entry->format, count = 0; *fmt ; ++fmt) {
>> + if (*fmt != '%')
>> + continue;
>> +repeat:
>> + ++fmt;
>> + switch (*fmt) {
>> + case '-':
>> + case '+':
>> + case ' ':
>> + case '#':
>> + case '0':
>> + ++fmt;
>> + goto repeat;
>> + }
>> +
>> + while (isdigit(*fmt))
>> + fmt++;
>> +
>> + /* get the conversion qualifier */
>> + qualifier = -1;
>> + if (*fmt == 'h' || *fmt == 'l' || *fmt == 'L') {
>> + qualifier = *fmt;
>> + ++fmt;
>> + if (qualifier == 'l' && *fmt == 'l') {
>> + qualifier = 'L';
>> + ++fmt;
>> + }
>> + }
>> +
>> + switch (*fmt) {
>> + case 'c':
>> + offset += sizeof(char);
>> + continue;
>> + case 's':
>> + entry->stroff[count] = offset;
>> + count++;
>> + offset += sizeof(char *);
>> + continue;
>> + case 'p':
>> + offset += sizeof(void *);
>> + continue;
>> + case 'd':
>> + case 'i':
>> + case 'o':
>> + case 'u':
>> + case 'x':
>> + case 'X':
>> + switch (qualifier) {
>> + case 'L':
>> + offset += sizeof(long long);
>> + break;
>> + case 'l':
>> + offset += sizeof(long);
>> + break;
>> + case 'h':
>> + offset += sizeof(short);
>> + break;
>> + default:
>> + offset += sizeof(int);
>> + break;
>> + }
>> + break;
>> + default:
>> + printk(KERN_WARNING "unknown format '%c'", *fmt);
>> + continue;
>> + }
>> + }
>> + entry->stroff[count] = -1;
>> + return 1;
>> +}
>> +
>> +static void __ktrace_control_write(struct file *filp,
>> + unsigned long val, size_t *cnt)
>> +{
>> + struct ktrace_entry *entry = filp->private_data;
>> + int ret;
>> +
>> + mutex_lock(&ktrace_mutex);
>> + if (val ^ entry->enabled) {
>> + if (val) {
>> + ret = contain_string_arguments(entry);
>> + if (ret == 0)
>> + ktrace_probe_register(entry, ring_buffer_probe);
>> + else if (ret > 0)
>> + ktrace_probe_register(entry, string_probe);
>> + else
>> + *cnt = ret;
>> + } else
>> + ktrace_probe_unregister(entry);
>> + }
>> + mutex_unlock(&ktrace_mutex);
>> +}
>> +
>> +static ssize_t ktrace_control_write(struct file *filp, const char __user *ubuf,
>> + size_t cnt, loff_t *ppos)
>> +{
>> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
>> + __ktrace_control_write);
>> +}
>> +
>> +static const struct file_operations ktracecontrol_file_operations = {
>> + .open = ktrace_control_open,
>> + .read = ktrace_control_read,
>> + .write = ktrace_control_write,
>> +};
>> +
>> +/* functions for adding/removing a trace event. Protected by mutex lock.
>> + * Called during initialization or after loading a module */
>> +static struct ktrace_entry *add_ktrace(struct kernel_trace *ktrace)
>> +{
>> + struct ktrace_entry *entry;
>> + size_t name_len = strlen(ktrace->name) + 1;
>> + u32 hash = jhash(ktrace->name, name_len-1, 0);
>> + struct hlist_head *head;
>> + struct hlist_node *node;
>> + struct dentry *dentry;
>> +
>> + if (!tracedir)
>> + return 0;
>> + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
>> + hlist_for_each_entry(entry, node, head, hlist) {
>> + if (!strcmp(ktrace->name, entry->name)) {
>> + if (strcmp(ktrace->format, entry->format)) {
>> + printk(KERN_WARNING "the format of tracepoint "
>> + "\'%s\' changes from \'%s\' to \'%s\'."
>> + "Dynamic changing of trace format is "
>> + "not supported yet!\n",
>> + ktrace->name, entry->format,
>> + ktrace->format);
>> + return ERR_PTR(-EINVAL);
>> + }
>> + ktrace->enabled = entry->enabled;
>> + ktrace->func = entry->func;
>> + ktrace->event_id = entry->event_id;
>> + if (list_empty(&entry->klist))
>> + goto add_head;
>> + list_add_tail(&ktrace->list, &entry->klist);
>> + return entry;
>> + }
>> + }
>> + entry = kmalloc(sizeof(struct ktrace_entry), GFP_KERNEL);
>> + if (!entry)
>> + return ERR_PTR(-ENOMEM);
>> + entry->name = kmalloc(name_len, GFP_KERNEL);
>> + if (!entry->name) {
>> + kfree(entry);
>> + return ERR_PTR(-ENOMEM);
>> + }
>> + memcpy(entry->name, ktrace->name, name_len);
>> + if (ktrace->format) {
>> + size_t format_len = strlen(ktrace->format) + 1;
>> + entry->format = kmalloc(format_len, GFP_KERNEL);
>> + if (!entry->format) {
>> + kfree(entry->name);
>> + kfree(entry);
>> + return ERR_PTR(-ENOMEM);
>> + }
>> + memcpy(entry->format, ktrace->format, format_len);
>> + } else
>> + entry->format = NULL;
>> + entry->func = ktrace->func;
>> + entry->enabled = 0;
>> + ktrace->event_id = entry->event_id = ktrace_next_id++;
>> + entry->stroff = NULL;
>> + INIT_LIST_HEAD(&entry->klist);
>> + hlist_add_head(&entry->hlist, head);
>> +add_head:
>> + list_add(&ktrace->list, &entry->klist);
>> + dentry = debugfs_create_file(entry->name, 0660, eventdir,
>> + entry, &ktracecontrol_file_operations);
>> + if (!dentry)
>> + printk(KERN_WARNING "Couldn't create debugfs '%s' entry\n",
>> + entry->name);
>> + return entry;
>> +}
>> +
>> +static int remove_ktrace(struct kernel_trace *ktrace, int free)
>> +{
>> + struct ktrace_entry *entry;
>> + size_t name_len = strlen(ktrace->name) + 1;
>> + u32 hash = jhash(ktrace->name, name_len-1, 0);
>> + struct hlist_head *head;
>> + struct hlist_node *node, *temp;
>> + struct dentry *dentry;
>> +
>> + if (!tracedir)
>> + return 0;
>> + list_del(&ktrace->list);
>> + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
>> + hlist_for_each_entry_safe(entry, node, temp, head, hlist)
>> + if (!strcmp(ktrace->name, entry->name)) {
>> + if (list_empty(&entry->klist)) {
>> + dentry = lookup_one_len(entry->name,
>> + eventdir, strlen(entry->name));
>> + if (dentry && !IS_ERR(dentry)) {
>> + debugfs_remove(dentry);
>> + dput(dentry);
>> + }
>> + entry->enabled = 0;
>> + if (free) {
>> + hlist_del(&entry->hlist);
>> + kfree(entry->name);
>> + kfree(entry->format);
>> + kfree(entry->stroff);
>> + kfree(entry);
>> + }
>> + }
>> + return 0;
>> + }
>> + return -1;
>> +}
>> +
>> +/* Add/remove tracepoints contained in a module to the ktrace hash table.
>> + * Called at the end of module load/unload. */
>> +static int ktrace_module_notify(struct notifier_block *self,
>> + unsigned long val, void *data)
>> +{
>> +#ifdef CONFIG_MODULES
>> + struct kernel_trace *iter;
>> + struct module *mod = data;
>> +
>> + if (val != MODULE_STATE_COMING && val != MODULE_STATE_GOING)
>> + return 0;
>> + mutex_lock(&ktrace_mutex);
>> + for (iter = mod->ktrace; iter < mod->ktrace + mod->num_ktrace; iter++)
>> + if (val == MODULE_STATE_COMING)
>> + add_ktrace(iter);
>> + else
>> + remove_ktrace(iter, 0);
>> + mutex_unlock(&ktrace_mutex);
>> +#endif
>> + return 0;
>> +}
>> +
>> +static struct notifier_block ktrace_module_nb = {
>> + .notifier_call = ktrace_module_notify,
>> +};
>> +
>> +/* functions for user-space programs to read data from tracing buffer */
>> +static int trace_data_open(struct inode *inode, struct file *filp)
>> +{
>> + filp->private_data = inode->i_private;
>> + kref_get(&ktrace_kref);
>> + return 0;
>> +}
>> +
>> +static void release_trace_buffer(struct kref *kref)
>> +{
>> + ring_buffer_free(trace_buffer);
>> + trace_buffer = NULL;
>> +}
>> +
>> +static int trace_data_release(struct inode *inode, struct file *filp)
>> +{
>> + kref_put(&ktrace_kref, release_trace_buffer);
>> + return 0;
>> +}
>> +
>> +static unsigned int trace_data_poll(struct file *filp, poll_table *poll_table)
>> +{
>> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
>> + if (filp->f_mode & FMODE_READ) {
>> + if (!ring_buffer_per_cpu_empty(cpu_buffer))
>> + return POLLIN | POLLRDNORM;
>> + poll_wait(filp, &trace_wait, poll_table);
>> + if (!ring_buffer_per_cpu_empty(cpu_buffer))
>> + return POLLIN | POLLRDNORM;
>> + }
>> + return 0;
>> +}
>> +
>> +static int trace_buf_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>> +{
>> + struct ring_buffer_per_cpu *cpu_buffer = vma->vm_private_data;
>> + pgoff_t pgoff = vmf->pgoff;
>> + struct page *page;
>> +
>> + if (!trace_buffer)
>> + return VM_FAULT_OOM;
>> + page = ring_buffer_get_page(cpu_buffer, pgoff);
>> + if (page == NULL)
>> + return VM_FAULT_SIGBUS;
>> + get_page(page);
>> + vmf->page = page;
>> + return 0;
>> +}
>> +
>> +static struct vm_operations_struct trace_data_mmap_ops = {
>> + .fault = trace_buf_fault,
>> +};
>> +
>> +static int trace_data_mmap(struct file *filp, struct vm_area_struct *vma)
>> +{
>> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
>> +
>> + vma->vm_ops = &trace_data_mmap_ops;
>> + vma->vm_flags |= VM_DONTEXPAND;
>> + vma->vm_private_data = cpu_buffer;
>> + return 0;
>> +}
>> +
>> +static int trace_data_ioctl(struct inode *inode, struct file *filp,
>> + unsigned int cmd, unsigned long arg)
>> +{
>> + u32 __user *argp = (u32 __user *)arg;
>> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
>> +
>> + if (!trace_buffer || !tracedir)
>> + return -EINVAL;
>> + switch (cmd) {
>> + case KTRACE_BUF_GET_SIZE:
>> + {
>> + unsigned long bufsize;
>> + bufsize = ring_buffer_size(trace_buffer);
>> + return put_user((u32)bufsize, argp);
>> + }
>> + case KTRACE_BUF_GET_PRODUCED:
>> + return put_user(ring_buffer_get_produced(cpu_buffer), argp);
>> + case KTRACE_BUF_GET_CONSUMED:
>> + return put_user(ring_buffer_get_consumed(cpu_buffer), argp);
>> + case KTRACE_BUF_PUT_CONSUMED:
>> + {
>> + u32 consumed, consumed_old;
>> + int ret;
>> +
>> + ret = get_user(consumed, argp);
>> + if (ret) {
>> + printk(KERN_WARNING
>> + "error getting consumed value: %d\n", ret);
>> + return ret;
>> + }
>> + consumed_old = ring_buffer_get_consumed(cpu_buffer);
>> + if (consumed == consumed_old)
>> + return 0;
>> + ring_buffer_advance_reader(cpu_buffer, consumed - consumed_old);
>> + return 0;
>> + }
>> + default:
>> + return -ENOIOCTLCMD;
>> + }
>> + return 0;
>> +}
>> +
>> +static const struct file_operations tracedata_file_operations = {
>> + .open = trace_data_open,
>> + .poll = trace_data_poll,
>> + .mmap = trace_data_mmap,
>> + .ioctl = trace_data_ioctl,
>> + .release = trace_data_release,
>> +};
>> +
>> +/* function to report the number of lost events due to buffer overflow */
>> +static ssize_t trace_overflow_read(struct file *filp, char __user *ubuf,
>> + size_t cnt, loff_t *ppos)
>> +{
>> + int ret;
>> + char buf[64];
>> + ret = snprintf(buf, 64, "%lu\n", ring_buffer_overruns(trace_buffer));
>> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
>> +}
>> +
>> +static const struct file_operations traceoverflow_file_operations = {
>> + .read = trace_overflow_read,
>> +};
>> +
>> +/* function to re-initialize trace buffer */
>> +static void __trace_buffer_reset(struct file *filp,
>> + unsigned long val, size_t *cnt)
>> +{
>> + if (val && trace_buffer)
>> + ring_buffer_reset(trace_buffer);
>> +}
>> +
>> +static ssize_t trace_buffer_reset(struct file *filp, const char __user *ubuf,
>> + size_t cnt, loff_t *ppos)
>> +{
>> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
>> + __trace_buffer_reset);
>> +}
>> +
>> +static const struct file_operations tracereset_file_operations = {
>> + .write = trace_buffer_reset,
>> +};
>> +
>> +/*
>> + * We use debugfs for kernel-user communication. All of the control/info
>> + * files are under debugfs/ktrace directory.
>> + */
>> +static int create_debugfs(void)
>> +{
>> + struct dentry *entry, *bufdir;
>> + struct ring_buffer_per_cpu *cpu_buffer;
>> + char *tmpname;
>> + int cpu;
>> +
>> + tracedir = debugfs_create_dir("ktrace", NULL);
>> + if (!tracedir) {
>> + printk(KERN_WARNING "Couldn't create debugfs directory\n");
>> + return -ENOMEM;
>> + }
>> +
>> + /* 'buffers' is the directory that holds trace buffer debugfs files */
>> + bufdir = debugfs_create_dir("buffers", tracedir);
>> + if (!bufdir) {
>> + printk(KERN_WARNING "Couldn't create 'buffers' directory\n");
>> + return -ENOMEM;
>> + }
>> +
>> + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
>> + if (!tmpname)
>> + return -ENOMEM;
>> + /* create a debugfs file for each cpu buffer */
>> + for_each_possible_cpu(cpu) {
>> + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
>> + cpu_buffer = ring_buffer_cpu(trace_buffer, cpu);
>> + entry = debugfs_create_file(tmpname, 0440, bufdir, cpu_buffer,
>> + &tracedata_file_operations);
>> + if (!entry) {
>> + printk(KERN_WARNING
>> + "Couldn't create debugfs \'%s\' entry\n",
>> + tmpname);
>> + return -ENOMEM;
>> + }
>> + }
>> + kfree(tmpname);
>> +
>> + /* the control file for users to enable/disable global tracing. */
>> + entry = debugfs_create_file("enabled", 0664, tracedir, NULL,
>> + &tracecontrol_file_operations);
>> + if (!entry) {
>> + printk(KERN_WARNING
>> + "Couldn't create debugfs 'enabled' entry\n");
>> + return -ENOMEM;
>> + }
>> +
>> + /*
>> + * the debugfs file that displays the name, format etc. of every
>> + * supported trace event. The file is to be used by the user-space
>> + * trace parser to analyze the collected trace data.
>> + */
>> + entry = debugfs_create_file("info", 0444, tracedir, NULL,
>> + &traceinfo_file_operations);
>> + if (!entry) {
>> + printk(KERN_WARNING "Couldn't create debugfs 'info' entry\n");
>> + return -ENOMEM;
>> + }
>> +
>> + /*
>> + * the debugfs file that reports the number of events
>> + * lost due to buffer overflow
>> + */
>> + entry = debugfs_create_file("overflow", 0444, tracedir, NULL,
>> + &traceoverflow_file_operations);
>> + if (!entry) {
>> + printk(KERN_WARNING
>> + "Couldn't create debugfs 'overflow' entry\n");
>> + return -ENOMEM;
>> + }
>> +
>> + /* the debugfs file that resets trace buffer upon write */
>> + entry = debugfs_create_file("reset", 0220, tracedir, NULL,
>> + &tracereset_file_operations);
>> + if (!entry) {
>> + printk(KERN_WARNING "Couldn't create debugfs 'reset' entry\n");
>> + return -ENOMEM;
>> + }
>> +
>> + /* create the directory that holds the control files for every event */
>> + eventdir = debugfs_create_dir("events", tracedir);
>> + if (!eventdir) {
>> + printk(KERN_WARNING "Couldn't create 'events' directory\n");
>> + return -ENOMEM;
>> + }
>> + return 0;
>> +}
>> +
>> +static void remove_debugfs_file(struct dentry *dir, char *name)
>> +{
>> + struct dentry *dentry;
>> + dentry = lookup_one_len(name, dir, strlen(name));
>> + if (dentry && !IS_ERR(dentry)) {
>> + debugfs_remove(dentry);
>> + dput(dentry);
>> + }
>> +}
>> +
>> +static int remove_debugfs(void)
>> +{
>> + struct dentry *bufdir;
>> + char *tmpname;
>> + int cpu;
>> +
>> + bufdir = lookup_one_len("buffers", tracedir, strlen("buffers"));
>> + if (!bufdir || IS_ERR(bufdir))
>> + return -EIO;
>> + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
>> + if (!tmpname)
>> + return -ENOMEM;
>> + for_each_possible_cpu(cpu) {
>> + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
>> + remove_debugfs_file(bufdir, tmpname);
>> + }
>> + kfree(tmpname);
>> + debugfs_remove(bufdir);
>> +
>> + remove_debugfs_file(tracedir, "enabled");
>> + remove_debugfs_file(tracedir, "info");
>> + debugfs_remove(eventdir);
>> + debugfs_remove(tracedir);
>> + return 0;
>> +}
>> +
>> +static int __init init_ktrace(void)
>> +{
>> + struct kernel_trace *iter;
>> + int i, err;
>> +
>> + ktrace_next_id = 0;
>> + kref_set(&ktrace_kref, 1);
>> + trace_buffer = ring_buffer_alloc(trace_buf_size, RB_FL_OVERWRITE);
>> + if (trace_buffer == NULL)
>> + return -ENOMEM;
>> + setup_timer(&trace_timer, wakeup_readers, 0);
>> + for (i = 0; i < KTRACE_HASH_SIZE; i++)
>> + INIT_HLIST_HEAD(&ktrace_table[i]);
>> +
>> + mutex_lock(&ktrace_mutex);
>> + err = create_debugfs();
>> + if (err != 0)
>> + goto out;
>> + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
>> + add_ktrace(iter);
>> + err = register_module_notifier(&ktrace_module_nb);
>> +out:
>> + mutex_unlock(&ktrace_mutex);
>> + return err;
>> +}
>> +module_init(init_ktrace);
>> +
>> +static void __exit exit_ktrace(void)
>> +{
>> + struct kernel_trace *iter;
>> +
>> + if (ktrace_enabled) {
>> + ktrace_enabled = 0;
>> + del_timer_sync(&trace_timer);
>> + }
>> + mutex_lock(&ktrace_mutex);
>> + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
>> + remove_ktrace(iter, 1);
>> + unregister_module_notifier(&ktrace_module_nb);
>> + remove_debugfs();
>> + mutex_unlock(&ktrace_mutex);
>> +
>> + kref_put(&ktrace_kref, release_trace_buffer);
>> +}
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>
* Jiaying Zhang ([email protected]) wrote:
> Hi Mathieu,
>
> First, I apologize that I did not forward you my original email. I assumed
> that the email would be noticed by interested people when it was sent
> to lkml. Now I see that is a mistake.
>
Hi Jiaying,
Apologies accepted, no hard feeling :) I have just been surprised when I
found out about the numbers you posted.
> On Fri, Mar 13, 2009 at 10:04 AM, Mathieu Desnoyers
> <[email protected]> wrote:
> > Hi Jiaying,
> >
> > I'm very interested to hear about the details of your test-case, because
> > what you are talking about here, and the numbers that you get, are, at
> > the very least, not representative of the current LTTng. Perharps you
> > used a version right after the kernel summit and stopped following the
> > improvements we have done since then ? I also wonder about the instrumentation
> > set you enabled in your tests, and whether it's exactly the same in the ktrace
> > and LTTng runs (especially system call tracing). Please note that following the
> > Kernel Summit 2008, a more active development cycle began, and the time was not
> > right for optimisation. But since then a lot of improvement has been done on
> > this side.
>
> Sorry that I omitted a lot of details in my test description. I did use an old
> version of LTTng in my tests. The LTTng version I used was the 0.21 lttng
> patches for 2.6.27 kernel. I used 2.6.26 kernel in my tests, with the ported
> lttng-2.6.27-rc6-0.21 patches.
>
> I enabled the same set of trace events in ktrace and LTTng runs. Here is
> the list of enabled events: syscall entry/exit, irq entry/exit, trap
> entry/exit,
> kernel_sched_switch, kernel_process_fork, kernel_process_wait,
> kernel_process_free, kernel_process_exit.
>
> We are glad to see the improvements coming from the LTTng community.
> I haven't got time to sync with the latest LTTng code and re-run my tests.
> I will let you know when I have such data available.
>
On my side, with flight recorder activated, same instrumentation you
use, with LTTng 0.111 :
arm-google script :
#!/bin/sh
DIR=/mnt/debugfs/ltt/markers
for a in \
syscall_entry syscall_exit \
irq_entry irq_exit \
trap_entry trap_exit \
page_fault_entry page_fault_exit \
page_fault_nosem_entry page_fault_nosem_exit \
page_fault_get_user_entry page_fault_get_user_exit \
sched_schedule process_fork process_wait \
process_free process_exit;
do
echo 1 > ${DIR}/kernel/$a/enable
done
running tbench -t 200 8
Vanilla kernel : 2070 MB/sec
Flight recorder : 1693 MB/sec Performance impact : 18.2 %
Normal : 1620 MB/sec Performance impact : 21.7 % (8 lttd threads)
So now we are comparing apples with apples :-)
> >
> > First I'd like to express my concern about the way you posted your
> > numbers :
> >
> > 1 - You did not CC anyone from the LTTng project.
> > 2 - You did not tell which LTTng version you used as a comparison basis.
> > 3 - You did not look at the current state of code before posting your
> > comparison. (as shows the comment in your patch header)
> That is indeed my mistake.
>
> > 4 - You did not tell the instrumentation set that you enabled from LTTng. Is it
> > the default instrumentation ? Did you also enable system call tracing
> > with your tracer using set_kernel_trace_flag_all_tasks() ? If not, this
> > makes a huge difference on the amount of information gathered.
> Yes. We also enable syscall tracing via set_kernel_trace_flag_all_tasks().
>
> >
> > So let's have a close look at them. I'm re-doing the benchmarks based on
> > the current LTTng version below.
> >
> >> Hello All,
> >>
> >> We have been working on building a kernel tracer for a production environment.
> >> We first considered to use the existing markers code, but found it a bit
> >> heavyweight.
> >
> > Have you looked at :
> >
> > 1 - The tracepoints currently in the kernel
> > 2 - The _current_ version of markers in the LTTng tree ?
>
> I looked at the tracepoints currently in the kernel. It is great to see that
> some trace instrumentation code is finally available in the mainline kernel.
> But I still don't like an additional tracepoint layer on top of the
> actual tracing
> (ktrace or markers or ftrace) layer. That seems to add unnecessary overhead.
> The markers code I referred to in my email is the current code in the mainline
> kernel. I had a look at the LTTng markers code but still saw that va_list
> argument is used in marker_probe_func. It sounds a nice improvement
> if the LTTng markers avoids the use of va_arg whenever possible.
>
Yes, vs_args is only there as a commodity for slow paths.
> >
> >> So we developed a new kernel tracing prototype that shares
> >> many similarities as markers but also includes several simplifications that
> >> give us considerable performance benefits. For example, we pass the size of
> >> trace event directly to the probe function instead of using va_arg as that
> >> used in markers, which saves us the overhead of variable passing.
> >
> > The markers in the LTTng tree only use va_args for "debug-style"
> > tracing. All the tracing fast paths are now done with custom probe
> > functions which takes the arguments from the tracepoint and write them
> > in the trace buffers directly in C. When the event payload size is already
> > known statically, there is no dynamic computation cost at all : it's
> > simply passed to the ltt-relay write function.
> But this raises another concern that sub-system developers may need
> to write a bunch of custom probe functions to avoid the argument parsing
> overhead. Can you instead have a general probe function or macro that
> does this work?
>
Well, considering the amount of care we have to take when we consider
adding a tracepoint in the mainline kernel, I think it's even a *good*
thing that someone has to stop and write a small callback to deal with
the information he wants to export. I think debug-style tracing should
be very, very quick and easy to add, but tracepoints meant to make it
into the mainline does not share this requirement.
> >
> >> Also, we
> >> associate a single probe function to each event, so we can directly call the
> >> probe function when tracing is enabled.
> >
> > You seem to imply that being stuck with a single probe function is
> > somehow superior to have tiny per-event callbacks which then call into a
> > buffer write function ? Why ?
> That is indeed my thinking, because we saw that some platforms are
> very sensitive to function call overhead and we haven't seen use cases
> where multi probe functions prove necessary.
>
Hrm, so your plan is to embed the tracing code at the caller site, or to
keep one single function call ?
> >
> >> Our measurements show that our
> >> current tracing prototype introduces much lower overhead compared with
> >> lttng (we use lttng as a comparison because markers does not provide the
> >> implementation to record trace data).
> >
> > Why LTTng version, kernel version, LTTng config and runtime options do
> > you use ?
> See above for LTTng and kernel versions. I was using the default LTTng
> config options and 'normal' recorder mode.
>
> >
> >> E.g., with dbench, lttng introduces
> >> 20~80% overhead on our testing platforms while the overhead with our
> >> kernel tracer is within 10%; with tbench, lttng usually adds more than 30%
> >> overhead while the overhead added by our kernel tracer is usually within
> >> 15%; with kernbench (time to compile a Linux kernel), the overhead
> >> added by lttng is around 4%~18% while the overhead added by our
> >> kernel tracer is within 2.5%.
> >>
> >
> > I'm re-running the dbench and tbench tests :
> >
> > With LTTng 0.106, kernel 2.6.29-rc7
> > 8-cores Intel(R) Xeon(R) CPU E5405 @ 2.00GHz, 16GB ram
> >
> > Arming all the default LTTng instrumentation (168 tracepoints),
> > Tracing in flight recorder mode, including system call entry/exit
> > tracing and block I/O tracing (blktrace instrumentation), I get :
> >
> > Test / config | Vanilla kernel | LTTng 0.106 | slowdown
> > dbench 8 -t 200 | 1361 MB/s | 1193 MB/s | 12.3 %
> >
> > As an example of the effect of custom probe vs va_args based probes,
> > let's take the tbench example
> >
> > | Vanilla | LTTng 0.106 | slowdown
> > tbench 8 -t 200 | 2074 MB/s | 997 MB/s | 52 %
> >
> > The LTTng trace statistics shows the top high event rate events :
> >
> > Some are usual (therefore lttng already implements custom probes for them) :
> >
> > 42004 sched_try_wakeup
> > 79754 timer_set
> > 83973 sched_schedule
> > 84317 softirq_exit
> > 84318 softirq_entry
> > 95370 page_free
> > 95412 page_alloc
> > 267247 syscall_entry
> > 267247 syscall_exit
> >
> > Some are more tbench workload-specific :
> >
> > 16915 socket_sendmsg
> > 18911 napi_complete
> > 18911 napi_schedule
> > 18912 dev_receive
> > 18914 dev_xmit
> > 18920 napi_poll
> > 33831 socket_recvmsg
> > 59860 read
> >
> >
> > So let's see what happens if I implement those as custom probes and do a bit of
> > tuning in the tracer hot path :
> Could you give more details about the custom probe you used?
They are available in the lttng tree on git.kernel.org in
ltt/probes/*-trace.c. Their API is in include/linux/ltt-type-serialize.h
> Do you need to write a custom probe for each tracepoint?
>
Yes. But it's not such a big deal. It looks a bit like Steve's
TRACE_EVENT macro-fu, but written in a simple C callback :
void probe_sched_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next);
DEFINE_MARKER_TP(kernel, sched_schedule, sched_switch, probe_sched_switch,
"prev_pid %d next_pid %d prev_state #2d%ld");
notrace void probe_sched_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
struct marker *marker;
struct serialize_int_int_short data;
data.f1 = prev->pid;
data.f2 = next->pid;
data.f3 = prev->state;
marker = &GET_MARKER(kernel, sched_schedule);
ltt_specialized_trace(marker, marker->single.probe_private,
&data, serialize_sizeof(data), sizeof(int));
}
Best regards,
Mathieu
> Jiaying
>
> >
> > | Vanilla | LTTng 0.107 | slowdown
> > Custom probes :
> > tbench 8 -t 200 | 2074 MB/s | 1275 MB/s | 38.5 %
> >
> > With tracer hot path tuning (inline hot path, function calls for slow path, will
> > be in 0.108) :
> > tbench 8 -t 200 | 2074 MB/s | 1335 MB/s | 35.6 %
> >
> > kernbench :
> > Average Optimal load -j 32 Run
> >
> > Vanilla 2.6.29-rc7 :
> > Elapsed Time 48.238 (0.117346)
> >
> > With flight recorder tracing, default lttng instrumentation, lttng 0.108 :
> > Elapsed Time 50.226 (0.0750333)
> >
> > Slowdown : 4.1 %
> >
> > So except the tbench workload on LTTng 0.106, which still had a 52% performance
> > impact (brought it down to 35 % for the next LTTng release) due to
> > specific workload and not having specialized probes implemented, the
> > dbench impact is well below your results (12.3 % vs 20-80%)
> >
> > So given the huge result discrepancies between your tests and mine, could you
> > give us more detail about your benchmarks ? Especially knowing the enabled
> > instrumentation set, and the amount of MB/s generated by both ktrace and LTTng
> > runs (to make sure we are comparing the same traffic) and the LTTng versions you
> > are using as a comparison basis would be good.
> >
> > Thanks,
> >
> > Mathieu
> >
> >
> >> The patch below is the current prototype we have implemented. It uses
> >> Steve's unified trace buffer code with some small extensions to support
> >> buffer mmap. The patch is probably still buggy, but we would like to get
> >> some early feedback from the community.
> >>
> >> Signed-off-by: Jiaying Zhang <[email protected]>
> >>
> >> Index: git-linux/init/Kconfig
> >> ===================================================================
> >> --- git-linux.orig/init/Kconfig 2009-02-19 14:58:37.000000000 -0800
> >> +++ git-linux/init/Kconfig 2009-02-19 15:50:15.000000000 -0800
> >> @@ -950,6 +950,15 @@
> >> Place an empty function call at each marker site. Can be
> >> dynamically changed for a probe function.
> >>
> >> +config KTRACE
> >> + bool "Enable ktrace support"
> >> + select DEBUG_FS
> >> + select TRACING
> >> + select RING_BUFFER
> >> + help
> >> + Ktrace is a kernel tracing tool that allows you to trace
> >> + kernel events by inserting trace points at proper places.
> >> +
> >> source "arch/Kconfig"
> >>
> >> endmenu # General setup
> >> Index: git-linux/kernel/Makefile
> >> ===================================================================
> >> --- git-linux.orig/kernel/Makefile 2009-02-19 14:58:37.000000000 -0800
> >> +++ git-linux/kernel/Makefile 2009-02-19 14:58:39.000000000 -0800
> >> @@ -86,6 +86,7 @@
> >> obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
> >> obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
> >> obj-$(CONFIG_MARKERS) += marker.o
> >> +obj-$(CONFIG_KTRACE) += ktrace.o
> >> obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
> >> obj-$(CONFIG_LATENCYTOP) += latencytop.o
> >> obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
> >> Index: git-linux/include/asm-generic/vmlinux.lds.h
> >> ===================================================================
> >> --- git-linux.orig/include/asm-generic/vmlinux.lds.h 2009-02-19
> >> 14:58:37.000000000 -0800
> >> +++ git-linux/include/asm-generic/vmlinux.lds.h 2009-02-19
> >> 14:58:39.000000000 -0800
> >> @@ -73,6 +73,9 @@
> >> MEM_KEEP(init.data) \
> >> MEM_KEEP(exit.data) \
> >> . = ALIGN(8); \
> >> + VMLINUX_SYMBOL(__start___ktraces) = .; \
> >> + *(__ktrace) \
> >> + VMLINUX_SYMBOL(__stop___ktraces) = .; \
> >> VMLINUX_SYMBOL(__start___markers) = .; \
> >> *(__markers) \
> >> VMLINUX_SYMBOL(__stop___markers) = .; \
> >> @@ -89,6 +92,7 @@
> >> VMLINUX_SYMBOL(__start_rodata) = .; \
> >> *(.rodata) *(.rodata.*) \
> >> *(__vermagic) /* Kernel version magic */ \
> >> + *(__ktrace_strings) /* Ktrace: strings */ \
> >> *(__markers_strings) /* Markers: strings */ \
> >> *(__tracepoints_strings)/* Tracepoints: strings */ \
> >> } \
> >> Index: git-linux/kernel/module.c
> >> ===================================================================
> >> --- git-linux.orig/kernel/module.c 2009-02-19 14:58:37.000000000 -0800
> >> +++ git-linux/kernel/module.c 2009-02-19 14:58:39.000000000 -0800
> >> @@ -50,6 +50,7 @@
> >> #include <asm/sections.h>
> >> #include <linux/tracepoint.h>
> >> #include <linux/ftrace.h>
> >> +#include <linux/ktrace.h>
> >> #include <linux/async.h>
> >>
> >> #if 0
> >> @@ -2145,6 +2146,10 @@
> >> sizeof(*mod->tracepoints),
> >> &mod->num_tracepoints);
> >> #endif
> >> +#ifdef CONFIG_KTRACE
> >> + mod->ktrace = section_objs(hdr, sechdrs, secstrings, "__ktrace",
> >> + sizeof(*mod->ktrace), &mod->num_ktrace);
> >> +#endif
> >>
> >> #ifdef CONFIG_MODVERSIONS
> >> if ((mod->num_syms && !mod->crcs)
> >> Index: git-linux/include/linux/module.h
> >> ===================================================================
> >> --- git-linux.orig/include/linux/module.h 2009-02-19
> >> 14:58:37.000000000 -0800
> >> +++ git-linux/include/linux/module.h 2009-02-19 14:58:39.000000000 -0800
> >> @@ -44,6 +44,7 @@
> >> };
> >>
> >> struct module;
> >> +struct kernel_trace;
> >>
> >> struct module_attribute {
> >> struct attribute attr;
> >> @@ -347,6 +348,11 @@
> >> /* Reference counts */
> >> struct module_ref ref[NR_CPUS];
> >> #endif
> >> +#ifdef CONFIG_KTRACE
> >> + struct kernel_trace *ktrace;
> >> + unsigned int num_ktrace;
> >> +#endif
> >> +
> >> };
> >> #ifndef MODULE_ARCH_INIT
> >> #define MODULE_ARCH_INIT {}
> >> Index: git-linux/include/linux/ktrace.h
> >> ===================================================================
> >> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> >> +++ git-linux/include/linux/ktrace.h 2009-02-19 15:36:39.000000000 -0800
> >> @@ -0,0 +1,83 @@
> >> +#ifndef _LINUX_KTRACE_H
> >> +#define _LINUX_KTRACE_H
> >> +
> >> +#include <linux/types.h>
> >> +
> >> +struct kernel_trace;
> >> +
> >> +typedef void ktrace_probe_func(struct kernel_trace *, void *, size_t);
> >> +
> >> +struct kernel_trace {
> >> + const char *name;
> >> + const char *format; /* format string describing variable list */
> >> + size_t *stroff; /* offsets of string variables */
> >> + /* 31 bit event_id is converted to 16 bit when entering to the buffer */
> >> + u32 enabled:1, event_id:31;
> >> + ktrace_probe_func *func; /* probe function */
> >> + struct list_head list; /* list head linked to the hash table entry */
> >> +} __attribute__((aligned(8)));
> >> +
> >> +extern int ktrace_enabled;
> >> +
> >> +/*
> >> + * Make sure the alignment of the structure in the __ktrace section will
> >> + * not add unwanted padding between the beginning of the section and the
> >> + * structure. Force alignment to the same alignment as the section start.
> >> + */
> >> +
> >> +#define DEFINE_KTRACE_STRUCT(name) \
> >> + struct __attribute__((packed)) ktrace_struct_##name
> >> +
> >> +#ifdef CONFIG_KTRACE
> >> +/*
> >> + * DO_TRACE is the macro that is called from each trace point. Each trace
> >> + * point is associated with a static ktrace object that is located in a
> >> + * special __ktrace section and is uniquely identified by the event name and
> >> + * event format. Take a look at the files under include/trace for example
> >> + * usage. Note that we call the trace probe function only if the global
> >> + * tracing is enabled and the tracing for the associated event is enabled,
> >> + * so we only introduce the overhead of a predicted condition judge when
> >> + * tracing is disabled.
> >> + */
> >> +#define DO_TRACE(name, format, args...)
> >> \
> >> + do { \
> >> + static const char __kstrtab_##name[] \
> >> + __attribute__((section("__ktrace_strings"))) \
> >> + = #name "\0" format; \
> >> + static struct kernel_trace __ktrace_##name \
> >> + __attribute__((section("__ktrace"), aligned(8))) = \
> >> + { __kstrtab_##name, &__kstrtab_##name[sizeof(#name)], \
> >> + NULL, 0, 0, NULL, LIST_HEAD_INIT(__ktrace_##name.list) }; \
> >> + __ktrace_check_format(format, ## args); \
> >> + if (unlikely(ktrace_enabled) && \
> >> + unlikely(__ktrace_##name.enabled)) { \
> >> + struct ktrace_struct_##name karg = { args }; \
> >> + (*__ktrace_##name.func) \
> >> + (&__ktrace_##name, &karg, sizeof(karg)); \
> >> + } \
> >> + } while (0)
> >> +
> >> +#else /* !CONFIG_KTRACE */
> >> +#define DO_TRACE(name, format, args...)
> >> +#endif
> >> +
> >> +/* To be used for string format validity checking with gcc */
> >> +static inline void __printf(1, 2) ___ktrace_check_format(const char *fmt, ...)
> >> +{
> >> +}
> >> +
> >> +#define __ktrace_check_format(format, args...) \
> >> + do { \
> >> + if (0) \
> >> + ___ktrace_check_format(format, ## args); \
> >> + } while (0)
> >> +
> >> +
> >> +/* get the trace buffer information */
> >> +#define KTRACE_BUF_GET_SIZE _IOR(0xF5, 0x00, __u32)
> >> +#define KTRACE_BUF_GET_PRODUCED _IOR(0xF5, 0x01, __u32)
> >> +#define KTRACE_BUF_GET_CONSUMED _IOR(0xF5, 0x02, __u32)
> >> +#define KTRACE_BUF_PUT_PRODUCED _IOW(0xF5, 0x03, __u32)
> >> +#define KTRACE_BUF_PUT_CONSUMED _IOW(0xF5, 0x04, __u32)
> >> +
> >> +#endif
> >> Index: git-linux/kernel/ktrace.c
> >> ===================================================================
> >> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> >> +++ git-linux/kernel/ktrace.c 2009-02-19 15:45:20.000000000 -0800
> >> @@ -0,0 +1,872 @@
> >> +/*
> >> + * kernel/ktrace.c
> >> + *
> >> + * Implementation of the kernel tracing for linux kernel 2.6.
> >> + *
> >> + * Copyright 2008- Google Inc.
> >> + * Original Author: Jiaying Zhang
> >> + */
> >> +
> >> +#include <linux/module.h>
> >> +#include <linux/kernel.h>
> >> +#include <linux/bitmap.h>
> >> +#include <linux/poll.h>
> >> +#include <linux/errno.h>
> >> +#include <linux/sched.h>
> >> +#include <linux/linkage.h>
> >> +#include <linux/debugfs.h>
> >> +#include <linux/timer.h>
> >> +#include <linux/delay.h>
> >> +#include <linux/jhash.h>
> >> +#include <linux/ctype.h>
> >> +#include <linux/namei.h>
> >> +#include <linux/ktrace.h>
> >> +#include <linux/ring_buffer.h>
> >> +
> >> +static char *ktrace_version = "1.0.0";
> >> +int ktrace_enabled;
> >> +EXPORT_SYMBOL(ktrace_enabled);
> >> +static uint16_t ktrace_next_id;
> >> +
> >> +extern struct kernel_trace __start___ktraces[];
> >> +extern struct kernel_trace __stop___ktraces[];
> >> +
> >> +static DEFINE_MUTEX(ktrace_mutex);
> >> +
> >> +/* ktrace hash table entry structure and variables */
> >> +struct ktrace_entry {
> >> + struct hlist_node hlist;
> >> + char *name;
> >> + char *format;
> >> + size_t *stroff;
> >> + ktrace_probe_func *func;
> >> + u32 enabled:1, event_id:31;
> >> + struct list_head klist; /* list of loaded ktraces */
> >> +};
> >> +
> >> +#define KTRACE_HASH_BITS 6
> >> +#define KTRACE_HASH_SIZE (1 << KTRACE_HASH_BITS)
> >> +static struct hlist_head ktrace_table[KTRACE_HASH_SIZE];
> >> +
> >> +/* debugfs directory variables */
> >> +static struct dentry *tracedir;
> >> +static struct dentry *eventdir;
> >> +typedef void tracecontrol_handle_func(struct file *, unsigned long, size_t *);
> >> +
> >> +/* ring buffer code */
> >> +static unsigned long trace_buf_size = 65536UL;
> >> +struct ring_buffer *trace_buffer;
> >> +static DECLARE_WAIT_QUEUE_HEAD(trace_wait); /* waitqueue for trace_data_poll */
> >> +static struct timer_list trace_timer; /* reader wake-up timer */
> >> +#define TRACEREAD_WAKEUP_INTERVAL 1000 /* time interval in jiffies */
> >> +static struct kref ktrace_kref;
> >> +
> >> +static void
> >> +ring_buffer_probe(struct kernel_trace *kt, void *data, size_t event_size)
> >> +{
> >> + struct ring_buffer_event *event;
> >> + void *trace_event;
> >> + unsigned long flag;
> >> +
> >> + event = ring_buffer_lock_reserve(trace_buffer,
> >> + sizeof(uint16_t) + event_size, &flag);
> >> + if (!event)
> >> + return;
> >> + trace_event = ring_buffer_event_data(event);
> >> + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
> >> + memcpy(trace_event + sizeof(uint16_t), data, event_size);
> >> + ring_buffer_unlock_commit(trace_buffer, event, flag);
> >> +}
> >> +
> >> +/* special probe function for events that contain string arguments */
> >> +static void
> >> +string_probe(struct kernel_trace *kt, void *data, size_t event_size)
> >> +{
> >> + struct ring_buffer_event *event;
> >> + void *trace_event, *p;
> >> + unsigned long flag;
> >> + size_t *offset, scanned;
> >> + char *string;
> >> +
> >> + /*
> >> + * Get the real length of the event, i.e., we use stroff array to
> >> + * locate string variables in the passed-in trace data and use their
> >> + * length to replace the size of the string pointers.
> >> + */
> >> + for (offset = kt->stroff; *offset != -1; offset++) {
> >> + string = *(char **) (data + *offset);
> >> + event_size += strlen(string) + 1 - sizeof(char *);
> >> + }
> >> + event = ring_buffer_lock_reserve(trace_buffer,
> >> + sizeof(uint16_t) + event_size, &flag);
> >> + if (!event)
> >> + return;
> >> + trace_event = ring_buffer_event_data(event);
> >> + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
> >> + p = trace_event + sizeof(uint16_t);
> >> + /*
> >> + * Copy the trace data into buffer. For string variables, we enter the
> >> + * string into the buffer. Otherwise, the passed in data is copied.
> >> + */
> >> + for (offset = kt->stroff, scanned = 0; *offset != -1; offset++) {
> >> + memcpy(p, data + scanned, *offset - scanned);
> >> + p += *offset - scanned;
> >> + string = *(char **) (data + *offset);
> >> + memcpy(p, string, strlen(string) + 1);
> >> + p += strlen(string) + 1;
> >> + scanned = *offset + sizeof(char *);
> >> + }
> >> + memcpy(p, data + scanned,
> >> + trace_event + sizeof(uint16_t) + event_size - p);
> >> + ring_buffer_unlock_commit(trace_buffer, event, flag);
> >> +}
> >> +
> >> +/* timer function used to defer ring buffer reader waking */
> >> +static void wakeup_readers(unsigned long data)
> >> +{
> >> + if (trace_buffer && !ring_buffer_empty(trace_buffer))
> >> + wake_up_interruptible(&trace_wait);
> >> + mod_timer(&trace_timer, jiffies + TRACEREAD_WAKEUP_INTERVAL);
> >> +}
> >> +
> >> +/* function for reading ktrace metadata info */
> >> +static void *tracing_info_start(struct seq_file *seq, loff_t *pos)
> >> +{
> >> + seq_printf(seq, "version %s\n", ktrace_version);
> >> + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
> >> +}
> >> +
> >> +static void *tracing_info_next(struct seq_file *seq, void *v, loff_t *pos)
> >> +{
> >> + ++*pos;
> >> + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
> >> +}
> >> +
> >> +static int tracing_info_show(struct seq_file *seq, void *v)
> >> +{
> >> + struct hlist_head *head = v;
> >> + struct hlist_node *node;
> >> + struct ktrace_entry *entry;
> >> + hlist_for_each_entry(entry, node, head, hlist)
> >> + seq_printf(seq, "name '%s' format '%s' id %u %s\n",
> >> + entry->name, entry->format, entry->event_id,
> >> + entry->enabled ? "enabled" : "disabled");
> >> + return 0;
> >> +}
> >> +
> >> +static void tracing_info_stop(struct seq_file *seq, void *v)
> >> +{
> >> +}
> >> +
> >> +static const struct seq_operations tracing_info_seq_ops = {
> >> + .start = tracing_info_start,
> >> + .next = tracing_info_next,
> >> + .show = tracing_info_show,
> >> + .stop = tracing_info_stop,
> >> +};
> >> +
> >> +static int tracing_info_open(struct inode *inode, struct file *file)
> >> +{
> >> + return seq_open(file, &tracing_info_seq_ops);
> >> +}
> >> +
> >> +static const struct file_operations traceinfo_file_operations = {
> >> + .open = tracing_info_open,
> >> + .read = seq_read,
> >> + .llseek = seq_lseek,
> >> + .release = seq_release,
> >> +};
> >> +
> >> +/*
> >> + * wrapper function used by debugfs write operation.
> >> + * func: handling function that does real work
> >> + */
> >> +static ssize_t trace_debugfs_write(struct file *filp, const char __user *ubuf,
> >> + size_t cnt, loff_t *ppos, tracecontrol_handle_func *func)
> >> +{
> >> + int ret;
> >> + char buf[64];
> >> + unsigned long val;
> >> +
> >> + if (cnt >= sizeof(buf))
> >> + return -EINVAL;
> >> + if (copy_from_user(&buf, ubuf, cnt))
> >> + return -EFAULT;
> >> + buf[cnt] = 0;
> >> +
> >> + ret = strict_strtoul(buf, 10, &val);
> >> + if (ret < 0)
> >> + return ret;
> >> + val = !!val;
> >> +
> >> + func(filp, val, &cnt);
> >> +
> >> + filp->f_pos += cnt;
> >> + return cnt;
> >> +}
> >> +
> >> +/* functions for reading/writing the global 'enabled' control file */
> >> +static ssize_t tracing_control_read(struct file *filp, char __user *ubuf,
> >> + size_t cnt, loff_t *ppos)
> >> +{
> >> + int ret;
> >> + char buf[64];
> >> + ret = snprintf(buf, 64, "%u\n", ktrace_enabled);
> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
> >> +}
> >> +
> >> +static void __tracing_control_write(struct file *filp,
> >> + unsigned long val, size_t *cnt)
> >> +{
> >> + if (val ^ ktrace_enabled) {
> >> + if (val) {
> >> + trace_timer.expires =
> >> + jiffies + TRACEREAD_WAKEUP_INTERVAL;
> >> + add_timer(&trace_timer);
> >> + ktrace_enabled = 1;
> >> + } else {
> >> + ktrace_enabled = 0;
> >> + del_timer_sync(&trace_timer);
> >> + if (trace_buffer && !ring_buffer_empty(trace_buffer))
> >> + wake_up_interruptible(&trace_wait);
> >> + }
> >> + }
> >> +}
> >> +
> >> +static ssize_t tracing_control_write(struct file *filp, const char
> >> __user *ubuf,
> >> + size_t cnt, loff_t *ppos)
> >> +{
> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
> >> + __tracing_control_write);
> >> +}
> >> +
> >> +static const struct file_operations tracecontrol_file_operations = {
> >> + .read = tracing_control_read,
> >> + .write = tracing_control_write,
> >> +};
> >> +
> >> +/* functions for reading/writing a trace event file */
> >> +static int update_ktrace(struct ktrace_entry *entry,
> >> + ktrace_probe_func *func, int enabled)
> >> +{
> >> + struct kernel_trace *iter;
> >> + /* no need to update the list if the tracing is not initialized */
> >> + if (!tracedir)
> >> + return 0;
> >> + entry->enabled = enabled;
> >> + entry->func = func;
> >> + list_for_each_entry(iter, &entry->klist, list) {
> >> + iter->stroff = entry->stroff;
> >> + iter->func = func;
> >> + iter->enabled = enabled;
> >> + }
> >> + return 0;
> >> +}
> >> +
> >> +static int ktrace_probe_register(struct ktrace_entry *entry,
> >> + ktrace_probe_func *probefunc)
> >> +{
> >> + return update_ktrace(entry, probefunc, 1);
> >> +}
> >> +
> >> +static int ktrace_probe_unregister(struct ktrace_entry *entry)
> >> +{
> >> + return update_ktrace(entry, NULL, 0);
> >> +}
> >> +
> >> +static int ktrace_control_open(struct inode *inode, struct file *filp)
> >> +{
> >> + filp->private_data = inode->i_private;
> >> + return 0;
> >> +}
> >> +
> >> +static ssize_t ktrace_control_read(struct file *filp, char __user *ubuf,
> >> + size_t cnt, loff_t *ppos)
> >> +{
> >> + int ret;
> >> + char buf[64];
> >> + struct ktrace_entry *entry = filp->private_data;
> >> +
> >> + ret = snprintf(buf, 64, "%u\n", entry->enabled);
> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
> >> +}
> >> +
> >> +/*
> >> + * Check whether the event contains string arguments. Return 0 if not;
> >> + * otherwise record the offsets of the string arguments in the ktrace
> >> + * structure according to the event format and return 1.
> >> + * The offset info is later used to store the string arguments into
> >> + * the trace buffer during trace probe.
> >> + */
> >> +static int contain_string_arguments(struct ktrace_entry *entry)
> >> +{
> >> + int count = 0;
> >> + char *fmt;
> >> + int qualifier; /* 'h', 'l', or 'L' for integer fields */
> >> + size_t offset = 0;
> >> +
> >> + for (fmt = strstr(entry->format, "%s"); fmt; fmt = strstr(++fmt, "%s"))
> >> + count++;
> >> + if (!count)
> >> + return 0;
> >> + entry->stroff = kmalloc(sizeof(size_t) * (count + 1), GFP_KERNEL);
> >> + if (!entry->stroff)
> >> + return -ENOMEM;
> >> +
> >> + for (fmt = entry->format, count = 0; *fmt ; ++fmt) {
> >> + if (*fmt != '%')
> >> + continue;
> >> +repeat:
> >> + ++fmt;
> >> + switch (*fmt) {
> >> + case '-':
> >> + case '+':
> >> + case ' ':
> >> + case '#':
> >> + case '0':
> >> + ++fmt;
> >> + goto repeat;
> >> + }
> >> +
> >> + while (isdigit(*fmt))
> >> + fmt++;
> >> +
> >> + /* get the conversion qualifier */
> >> + qualifier = -1;
> >> + if (*fmt == 'h' || *fmt == 'l' || *fmt == 'L') {
> >> + qualifier = *fmt;
> >> + ++fmt;
> >> + if (qualifier == 'l' && *fmt == 'l') {
> >> + qualifier = 'L';
> >> + ++fmt;
> >> + }
> >> + }
> >> +
> >> + switch (*fmt) {
> >> + case 'c':
> >> + offset += sizeof(char);
> >> + continue;
> >> + case 's':
> >> + entry->stroff[count] = offset;
> >> + count++;
> >> + offset += sizeof(char *);
> >> + continue;
> >> + case 'p':
> >> + offset += sizeof(void *);
> >> + continue;
> >> + case 'd':
> >> + case 'i':
> >> + case 'o':
> >> + case 'u':
> >> + case 'x':
> >> + case 'X':
> >> + switch (qualifier) {
> >> + case 'L':
> >> + offset += sizeof(long long);
> >> + break;
> >> + case 'l':
> >> + offset += sizeof(long);
> >> + break;
> >> + case 'h':
> >> + offset += sizeof(short);
> >> + break;
> >> + default:
> >> + offset += sizeof(int);
> >> + break;
> >> + }
> >> + break;
> >> + default:
> >> + printk(KERN_WARNING "unknown format '%c'", *fmt);
> >> + continue;
> >> + }
> >> + }
> >> + entry->stroff[count] = -1;
> >> + return 1;
> >> +}
> >> +
> >> +static void __ktrace_control_write(struct file *filp,
> >> + unsigned long val, size_t *cnt)
> >> +{
> >> + struct ktrace_entry *entry = filp->private_data;
> >> + int ret;
> >> +
> >> + mutex_lock(&ktrace_mutex);
> >> + if (val ^ entry->enabled) {
> >> + if (val) {
> >> + ret = contain_string_arguments(entry);
> >> + if (ret == 0)
> >> + ktrace_probe_register(entry, ring_buffer_probe);
> >> + else if (ret > 0)
> >> + ktrace_probe_register(entry, string_probe);
> >> + else
> >> + *cnt = ret;
> >> + } else
> >> + ktrace_probe_unregister(entry);
> >> + }
> >> + mutex_unlock(&ktrace_mutex);
> >> +}
> >> +
> >> +static ssize_t ktrace_control_write(struct file *filp, const char __user *ubuf,
> >> + size_t cnt, loff_t *ppos)
> >> +{
> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
> >> + __ktrace_control_write);
> >> +}
> >> +
> >> +static const struct file_operations ktracecontrol_file_operations = {
> >> + .open = ktrace_control_open,
> >> + .read = ktrace_control_read,
> >> + .write = ktrace_control_write,
> >> +};
> >> +
> >> +/* functions for adding/removing a trace event. Protected by mutex lock.
> >> + * Called during initialization or after loading a module */
> >> +static struct ktrace_entry *add_ktrace(struct kernel_trace *ktrace)
> >> +{
> >> + struct ktrace_entry *entry;
> >> + size_t name_len = strlen(ktrace->name) + 1;
> >> + u32 hash = jhash(ktrace->name, name_len-1, 0);
> >> + struct hlist_head *head;
> >> + struct hlist_node *node;
> >> + struct dentry *dentry;
> >> +
> >> + if (!tracedir)
> >> + return 0;
> >> + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
> >> + hlist_for_each_entry(entry, node, head, hlist) {
> >> + if (!strcmp(ktrace->name, entry->name)) {
> >> + if (strcmp(ktrace->format, entry->format)) {
> >> + printk(KERN_WARNING "the format of tracepoint "
> >> + "\'%s\' changes from \'%s\' to \'%s\'."
> >> + "Dynamic changing of trace format is "
> >> + "not supported yet!\n",
> >> + ktrace->name, entry->format,
> >> + ktrace->format);
> >> + return ERR_PTR(-EINVAL);
> >> + }
> >> + ktrace->enabled = entry->enabled;
> >> + ktrace->func = entry->func;
> >> + ktrace->event_id = entry->event_id;
> >> + if (list_empty(&entry->klist))
> >> + goto add_head;
> >> + list_add_tail(&ktrace->list, &entry->klist);
> >> + return entry;
> >> + }
> >> + }
> >> + entry = kmalloc(sizeof(struct ktrace_entry), GFP_KERNEL);
> >> + if (!entry)
> >> + return ERR_PTR(-ENOMEM);
> >> + entry->name = kmalloc(name_len, GFP_KERNEL);
> >> + if (!entry->name) {
> >> + kfree(entry);
> >> + return ERR_PTR(-ENOMEM);
> >> + }
> >> + memcpy(entry->name, ktrace->name, name_len);
> >> + if (ktrace->format) {
> >> + size_t format_len = strlen(ktrace->format) + 1;
> >> + entry->format = kmalloc(format_len, GFP_KERNEL);
> >> + if (!entry->format) {
> >> + kfree(entry->name);
> >> + kfree(entry);
> >> + return ERR_PTR(-ENOMEM);
> >> + }
> >> + memcpy(entry->format, ktrace->format, format_len);
> >> + } else
> >> + entry->format = NULL;
> >> + entry->func = ktrace->func;
> >> + entry->enabled = 0;
> >> + ktrace->event_id = entry->event_id = ktrace_next_id++;
> >> + entry->stroff = NULL;
> >> + INIT_LIST_HEAD(&entry->klist);
> >> + hlist_add_head(&entry->hlist, head);
> >> +add_head:
> >> + list_add(&ktrace->list, &entry->klist);
> >> + dentry = debugfs_create_file(entry->name, 0660, eventdir,
> >> + entry, &ktracecontrol_file_operations);
> >> + if (!dentry)
> >> + printk(KERN_WARNING "Couldn't create debugfs '%s' entry\n",
> >> + entry->name);
> >> + return entry;
> >> +}
> >> +
> >> +static int remove_ktrace(struct kernel_trace *ktrace, int free)
> >> +{
> >> + struct ktrace_entry *entry;
> >> + size_t name_len = strlen(ktrace->name) + 1;
> >> + u32 hash = jhash(ktrace->name, name_len-1, 0);
> >> + struct hlist_head *head;
> >> + struct hlist_node *node, *temp;
> >> + struct dentry *dentry;
> >> +
> >> + if (!tracedir)
> >> + return 0;
> >> + list_del(&ktrace->list);
> >> + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
> >> + hlist_for_each_entry_safe(entry, node, temp, head, hlist)
> >> + if (!strcmp(ktrace->name, entry->name)) {
> >> + if (list_empty(&entry->klist)) {
> >> + dentry = lookup_one_len(entry->name,
> >> + eventdir, strlen(entry->name));
> >> + if (dentry && !IS_ERR(dentry)) {
> >> + debugfs_remove(dentry);
> >> + dput(dentry);
> >> + }
> >> + entry->enabled = 0;
> >> + if (free) {
> >> + hlist_del(&entry->hlist);
> >> + kfree(entry->name);
> >> + kfree(entry->format);
> >> + kfree(entry->stroff);
> >> + kfree(entry);
> >> + }
> >> + }
> >> + return 0;
> >> + }
> >> + return -1;
> >> +}
> >> +
> >> +/* Add/remove tracepoints contained in a module to the ktrace hash table.
> >> + * Called at the end of module load/unload. */
> >> +static int ktrace_module_notify(struct notifier_block *self,
> >> + unsigned long val, void *data)
> >> +{
> >> +#ifdef CONFIG_MODULES
> >> + struct kernel_trace *iter;
> >> + struct module *mod = data;
> >> +
> >> + if (val != MODULE_STATE_COMING && val != MODULE_STATE_GOING)
> >> + return 0;
> >> + mutex_lock(&ktrace_mutex);
> >> + for (iter = mod->ktrace; iter < mod->ktrace + mod->num_ktrace; iter++)
> >> + if (val == MODULE_STATE_COMING)
> >> + add_ktrace(iter);
> >> + else
> >> + remove_ktrace(iter, 0);
> >> + mutex_unlock(&ktrace_mutex);
> >> +#endif
> >> + return 0;
> >> +}
> >> +
> >> +static struct notifier_block ktrace_module_nb = {
> >> + .notifier_call = ktrace_module_notify,
> >> +};
> >> +
> >> +/* functions for user-space programs to read data from tracing buffer */
> >> +static int trace_data_open(struct inode *inode, struct file *filp)
> >> +{
> >> + filp->private_data = inode->i_private;
> >> + kref_get(&ktrace_kref);
> >> + return 0;
> >> +}
> >> +
> >> +static void release_trace_buffer(struct kref *kref)
> >> +{
> >> + ring_buffer_free(trace_buffer);
> >> + trace_buffer = NULL;
> >> +}
> >> +
> >> +static int trace_data_release(struct inode *inode, struct file *filp)
> >> +{
> >> + kref_put(&ktrace_kref, release_trace_buffer);
> >> + return 0;
> >> +}
> >> +
> >> +static unsigned int trace_data_poll(struct file *filp, poll_table *poll_table)
> >> +{
> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
> >> + if (filp->f_mode & FMODE_READ) {
> >> + if (!ring_buffer_per_cpu_empty(cpu_buffer))
> >> + return POLLIN | POLLRDNORM;
> >> + poll_wait(filp, &trace_wait, poll_table);
> >> + if (!ring_buffer_per_cpu_empty(cpu_buffer))
> >> + return POLLIN | POLLRDNORM;
> >> + }
> >> + return 0;
> >> +}
> >> +
> >> +static int trace_buf_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> >> +{
> >> + struct ring_buffer_per_cpu *cpu_buffer = vma->vm_private_data;
> >> + pgoff_t pgoff = vmf->pgoff;
> >> + struct page *page;
> >> +
> >> + if (!trace_buffer)
> >> + return VM_FAULT_OOM;
> >> + page = ring_buffer_get_page(cpu_buffer, pgoff);
> >> + if (page == NULL)
> >> + return VM_FAULT_SIGBUS;
> >> + get_page(page);
> >> + vmf->page = page;
> >> + return 0;
> >> +}
> >> +
> >> +static struct vm_operations_struct trace_data_mmap_ops = {
> >> + .fault = trace_buf_fault,
> >> +};
> >> +
> >> +static int trace_data_mmap(struct file *filp, struct vm_area_struct *vma)
> >> +{
> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
> >> +
> >> + vma->vm_ops = &trace_data_mmap_ops;
> >> + vma->vm_flags |= VM_DONTEXPAND;
> >> + vma->vm_private_data = cpu_buffer;
> >> + return 0;
> >> +}
> >> +
> >> +static int trace_data_ioctl(struct inode *inode, struct file *filp,
> >> + unsigned int cmd, unsigned long arg)
> >> +{
> >> + u32 __user *argp = (u32 __user *)arg;
> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
> >> +
> >> + if (!trace_buffer || !tracedir)
> >> + return -EINVAL;
> >> + switch (cmd) {
> >> + case KTRACE_BUF_GET_SIZE:
> >> + {
> >> + unsigned long bufsize;
> >> + bufsize = ring_buffer_size(trace_buffer);
> >> + return put_user((u32)bufsize, argp);
> >> + }
> >> + case KTRACE_BUF_GET_PRODUCED:
> >> + return put_user(ring_buffer_get_produced(cpu_buffer), argp);
> >> + case KTRACE_BUF_GET_CONSUMED:
> >> + return put_user(ring_buffer_get_consumed(cpu_buffer), argp);
> >> + case KTRACE_BUF_PUT_CONSUMED:
> >> + {
> >> + u32 consumed, consumed_old;
> >> + int ret;
> >> +
> >> + ret = get_user(consumed, argp);
> >> + if (ret) {
> >> + printk(KERN_WARNING
> >> + "error getting consumed value: %d\n", ret);
> >> + return ret;
> >> + }
> >> + consumed_old = ring_buffer_get_consumed(cpu_buffer);
> >> + if (consumed == consumed_old)
> >> + return 0;
> >> + ring_buffer_advance_reader(cpu_buffer, consumed - consumed_old);
> >> + return 0;
> >> + }
> >> + default:
> >> + return -ENOIOCTLCMD;
> >> + }
> >> + return 0;
> >> +}
> >> +
> >> +static const struct file_operations tracedata_file_operations = {
> >> + .open = trace_data_open,
> >> + .poll = trace_data_poll,
> >> + .mmap = trace_data_mmap,
> >> + .ioctl = trace_data_ioctl,
> >> + .release = trace_data_release,
> >> +};
> >> +
> >> +/* function to report the number of lost events due to buffer overflow */
> >> +static ssize_t trace_overflow_read(struct file *filp, char __user *ubuf,
> >> + size_t cnt, loff_t *ppos)
> >> +{
> >> + int ret;
> >> + char buf[64];
> >> + ret = snprintf(buf, 64, "%lu\n", ring_buffer_overruns(trace_buffer));
> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
> >> +}
> >> +
> >> +static const struct file_operations traceoverflow_file_operations = {
> >> + .read = trace_overflow_read,
> >> +};
> >> +
> >> +/* function to re-initialize trace buffer */
> >> +static void __trace_buffer_reset(struct file *filp,
> >> + unsigned long val, size_t *cnt)
> >> +{
> >> + if (val && trace_buffer)
> >> + ring_buffer_reset(trace_buffer);
> >> +}
> >> +
> >> +static ssize_t trace_buffer_reset(struct file *filp, const char __user *ubuf,
> >> + size_t cnt, loff_t *ppos)
> >> +{
> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
> >> + __trace_buffer_reset);
> >> +}
> >> +
> >> +static const struct file_operations tracereset_file_operations = {
> >> + .write = trace_buffer_reset,
> >> +};
> >> +
> >> +/*
> >> + * We use debugfs for kernel-user communication. All of the control/info
> >> + * files are under debugfs/ktrace directory.
> >> + */
> >> +static int create_debugfs(void)
> >> +{
> >> + struct dentry *entry, *bufdir;
> >> + struct ring_buffer_per_cpu *cpu_buffer;
> >> + char *tmpname;
> >> + int cpu;
> >> +
> >> + tracedir = debugfs_create_dir("ktrace", NULL);
> >> + if (!tracedir) {
> >> + printk(KERN_WARNING "Couldn't create debugfs directory\n");
> >> + return -ENOMEM;
> >> + }
> >> +
> >> + /* 'buffers' is the directory that holds trace buffer debugfs files */
> >> + bufdir = debugfs_create_dir("buffers", tracedir);
> >> + if (!bufdir) {
> >> + printk(KERN_WARNING "Couldn't create 'buffers' directory\n");
> >> + return -ENOMEM;
> >> + }
> >> +
> >> + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
> >> + if (!tmpname)
> >> + return -ENOMEM;
> >> + /* create a debugfs file for each cpu buffer */
> >> + for_each_possible_cpu(cpu) {
> >> + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
> >> + cpu_buffer = ring_buffer_cpu(trace_buffer, cpu);
> >> + entry = debugfs_create_file(tmpname, 0440, bufdir, cpu_buffer,
> >> + &tracedata_file_operations);
> >> + if (!entry) {
> >> + printk(KERN_WARNING
> >> + "Couldn't create debugfs \'%s\' entry\n",
> >> + tmpname);
> >> + return -ENOMEM;
> >> + }
> >> + }
> >> + kfree(tmpname);
> >> +
> >> + /* the control file for users to enable/disable global tracing. */
> >> + entry = debugfs_create_file("enabled", 0664, tracedir, NULL,
> >> + &tracecontrol_file_operations);
> >> + if (!entry) {
> >> + printk(KERN_WARNING
> >> + "Couldn't create debugfs 'enabled' entry\n");
> >> + return -ENOMEM;
> >> + }
> >> +
> >> + /*
> >> + * the debugfs file that displays the name, format etc. of every
> >> + * supported trace event. The file is to be used by the user-space
> >> + * trace parser to analyze the collected trace data.
> >> + */
> >> + entry = debugfs_create_file("info", 0444, tracedir, NULL,
> >> + &traceinfo_file_operations);
> >> + if (!entry) {
> >> + printk(KERN_WARNING "Couldn't create debugfs 'info' entry\n");
> >> + return -ENOMEM;
> >> + }
> >> +
> >> + /*
> >> + * the debugfs file that reports the number of events
> >> + * lost due to buffer overflow
> >> + */
> >> + entry = debugfs_create_file("overflow", 0444, tracedir, NULL,
> >> + &traceoverflow_file_operations);
> >> + if (!entry) {
> >> + printk(KERN_WARNING
> >> + "Couldn't create debugfs 'overflow' entry\n");
> >> + return -ENOMEM;
> >> + }
> >> +
> >> + /* the debugfs file that resets trace buffer upon write */
> >> + entry = debugfs_create_file("reset", 0220, tracedir, NULL,
> >> + &tracereset_file_operations);
> >> + if (!entry) {
> >> + printk(KERN_WARNING "Couldn't create debugfs 'reset' entry\n");
> >> + return -ENOMEM;
> >> + }
> >> +
> >> + /* create the directory that holds the control files for every event */
> >> + eventdir = debugfs_create_dir("events", tracedir);
> >> + if (!eventdir) {
> >> + printk(KERN_WARNING "Couldn't create 'events' directory\n");
> >> + return -ENOMEM;
> >> + }
> >> + return 0;
> >> +}
> >> +
> >> +static void remove_debugfs_file(struct dentry *dir, char *name)
> >> +{
> >> + struct dentry *dentry;
> >> + dentry = lookup_one_len(name, dir, strlen(name));
> >> + if (dentry && !IS_ERR(dentry)) {
> >> + debugfs_remove(dentry);
> >> + dput(dentry);
> >> + }
> >> +}
> >> +
> >> +static int remove_debugfs(void)
> >> +{
> >> + struct dentry *bufdir;
> >> + char *tmpname;
> >> + int cpu;
> >> +
> >> + bufdir = lookup_one_len("buffers", tracedir, strlen("buffers"));
> >> + if (!bufdir || IS_ERR(bufdir))
> >> + return -EIO;
> >> + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
> >> + if (!tmpname)
> >> + return -ENOMEM;
> >> + for_each_possible_cpu(cpu) {
> >> + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
> >> + remove_debugfs_file(bufdir, tmpname);
> >> + }
> >> + kfree(tmpname);
> >> + debugfs_remove(bufdir);
> >> +
> >> + remove_debugfs_file(tracedir, "enabled");
> >> + remove_debugfs_file(tracedir, "info");
> >> + debugfs_remove(eventdir);
> >> + debugfs_remove(tracedir);
> >> + return 0;
> >> +}
> >> +
> >> +static int __init init_ktrace(void)
> >> +{
> >> + struct kernel_trace *iter;
> >> + int i, err;
> >> +
> >> + ktrace_next_id = 0;
> >> + kref_set(&ktrace_kref, 1);
> >> + trace_buffer = ring_buffer_alloc(trace_buf_size, RB_FL_OVERWRITE);
> >> + if (trace_buffer == NULL)
> >> + return -ENOMEM;
> >> + setup_timer(&trace_timer, wakeup_readers, 0);
> >> + for (i = 0; i < KTRACE_HASH_SIZE; i++)
> >> + INIT_HLIST_HEAD(&ktrace_table[i]);
> >> +
> >> + mutex_lock(&ktrace_mutex);
> >> + err = create_debugfs();
> >> + if (err != 0)
> >> + goto out;
> >> + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
> >> + add_ktrace(iter);
> >> + err = register_module_notifier(&ktrace_module_nb);
> >> +out:
> >> + mutex_unlock(&ktrace_mutex);
> >> + return err;
> >> +}
> >> +module_init(init_ktrace);
> >> +
> >> +static void __exit exit_ktrace(void)
> >> +{
> >> + struct kernel_trace *iter;
> >> +
> >> + if (ktrace_enabled) {
> >> + ktrace_enabled = 0;
> >> + del_timer_sync(&trace_timer);
> >> + }
> >> + mutex_lock(&ktrace_mutex);
> >> + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
> >> + remove_ktrace(iter, 1);
> >> + unregister_module_notifier(&ktrace_module_nb);
> >> + remove_debugfs();
> >> + mutex_unlock(&ktrace_mutex);
> >> +
> >> + kref_put(&ktrace_kref, release_trace_buffer);
> >> +}
> >
> > --
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
> >
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Mon, Mar 16, 2009 at 5:01 PM, Mathieu Desnoyers
<[email protected]> wrote:
> * Jiaying Zhang ([email protected]) wrote:
>> Hi Mathieu,
>>
>> First, I apologize that I did not forward you my original email. I assumed
>> that the email would be noticed by interested people when it was sent
>> to lkml. Now I see that is a mistake.
>>
>
> Hi Jiaying,
>
> Apologies accepted, no hard feeling :) I have just been surprised when I
> found out about the numbers you posted.
>
>> On Fri, Mar 13, 2009 at 10:04 AM, Mathieu Desnoyers
>> <[email protected]> wrote:
>> > Hi Jiaying,
>> >
>> > I'm very interested to hear about the details of your test-case, because
>> > what you are talking about here, and the numbers that you get, are, at
>> > the very least, not representative of the current LTTng. Perharps you
>> > used a version right after the kernel summit and stopped following the
>> > improvements we have done since then ? I also wonder about the instrumentation
>> > set you enabled in your tests, and whether it's exactly the same in the ktrace
>> > and LTTng runs (especially system call tracing). Please note that following the
>> > Kernel Summit 2008, a more active development cycle began, and the time was not
>> > right for optimisation. But since then a lot of improvement has been done on
>> > this side.
>>
>> Sorry that I omitted a lot of details in my test description. I did use an old
>> version of LTTng in my tests. The LTTng version I used was the 0.21 lttng
>> patches for 2.6.27 kernel. I used 2.6.26 kernel in my tests, with the ported
>> lttng-2.6.27-rc6-0.21 patches.
>>
>> I enabled the same set of trace events in ktrace and LTTng runs. Here is
>> the list of enabled events: syscall entry/exit, irq entry/exit, trap
>> entry/exit,
>> kernel_sched_switch, kernel_process_fork, kernel_process_wait,
>> kernel_process_free, kernel_process_exit.
>>
>> We are glad to see the improvements coming from the LTTng community.
>> I haven't got time to sync with the latest LTTng code and re-run my tests.
>> I will let you know when I have such data available.
>>
>
> On my side, with flight recorder activated, same instrumentation you
> use, with LTTng 0.111 :
>
> arm-google script :
>
> #!/bin/sh
>
> DIR=/mnt/debugfs/ltt/markers
>
> for a in \
> syscall_entry syscall_exit \
> irq_entry irq_exit \
> trap_entry trap_exit \
> page_fault_entry page_fault_exit \
> page_fault_nosem_entry page_fault_nosem_exit \
> page_fault_get_user_entry page_fault_get_user_exit \
> sched_schedule process_fork process_wait \
> process_free process_exit;
> do
> echo 1 > ${DIR}/kernel/$a/enable
> done
>
> running tbench -t 200 8
>
> Vanilla kernel : 2070 MB/sec
> Flight recorder : 1693 MB/sec Performance impact : 18.2 %
> Normal : 1620 MB/sec Performance impact : 21.7 % (8 lttd threads)
>
> So now we are comparing apples with apples :-)
Looks like the performance of LTTng has improved a lot over the past
few months. I will run some tests with the latest LTTng patches and
let you know if I see any difference. Did you use 2.6.29-rc7 kernel in
your tests?
>
>> >
>> > First I'd like to express my concern about the way you posted your
>> > numbers :
>> >
>> > 1 - You did not CC anyone from the LTTng project.
>> > 2 - You did not tell which LTTng version you used as a comparison basis.
>> > 3 - You did not look at the current state of code before posting your
>> > comparison. (as shows the comment in your patch header)
>> That is indeed my mistake.
>>
>> > 4 - You did not tell the instrumentation set that you enabled from LTTng. Is it
>> > the default instrumentation ? Did you also enable system call tracing
>> > with your tracer using set_kernel_trace_flag_all_tasks() ? If not, this
>> > makes a huge difference on the amount of information gathered.
>> Yes. We also enable syscall tracing via set_kernel_trace_flag_all_tasks().
>>
>> >
>> > So let's have a close look at them. I'm re-doing the benchmarks based on
>> > the current LTTng version below.
>> >
>> >> Hello All,
>> >>
>> >> We have been working on building a kernel tracer for a production environment.
>> >> We first considered to use the existing markers code, but found it a bit
>> >> heavyweight.
>> >
>> > Have you looked at :
>> >
>> > 1 - The tracepoints currently in the kernel
>> > 2 - The _current_ version of markers in the LTTng tree ?
>>
>> I looked at the tracepoints currently in the kernel. It is great to see that
>> some trace instrumentation code is finally available in the mainline kernel.
>> But I still don't like an additional tracepoint layer on top of the
>> actual tracing
>> (ktrace or markers or ftrace) layer. That seems to add unnecessary overhead.
>> The markers code I referred to in my email is the current code in the mainline
>> kernel. I had a look at the LTTng markers code but still saw that va_list
>> argument is used in marker_probe_func. It sounds a nice improvement
>> if the LTTng markers avoids the use of va_arg whenever possible.
>>
>
> Yes, vs_args is only there as a commodity for slow paths.
>
>> >
>> >> So we developed a new kernel tracing prototype that shares
>> >> many similarities as markers but also includes several simplifications that
>> >> give us considerable performance benefits. For example, we pass the size of
>> >> trace event directly to the probe function instead of using va_arg as that
>> >> used in markers, which saves us the overhead of variable passing.
>> >
>> > The markers in the LTTng tree only use va_args for "debug-style"
>> > tracing. All the tracing fast paths are now done with custom probe
>> > functions which takes the arguments from the tracepoint and write them
>> > in the trace buffers directly in C. When the event payload size is already
>> > known statically, there is no dynamic computation cost at all : it's
>> > simply passed to the ltt-relay write function.
>> But this raises another concern that sub-system developers may need
>> to write a bunch of custom probe functions to avoid the argument parsing
>> overhead. Can you instead have a general probe function or macro that
>> does this work?
>>
>
> Well, considering the amount of care we have to take when we consider
> adding a tracepoint in the mainline kernel, I think it's even a *good*
> thing that someone has to stop and write a small callback to deal with
> the information he wants to export. I think debug-style tracing should
> be very, very quick and easy to add, but tracepoints meant to make it
> into the mainline does not share this requirement.
I agree that we need to be careful when adding tracepoints into the kernel,
but I still think it is good to simplify the process of adding/maintaining
tracepoints so sub-system maintainers don't feel it is a big burden.
>
>> >
>> >> Also, we
>> >> associate a single probe function to each event, so we can directly call the
>> >> probe function when tracing is enabled.
>> >
>> > You seem to imply that being stuck with a single probe function is
>> > somehow superior to have tiny per-event callbacks which then call into a
>> > buffer write function ? Why ?
>> That is indeed my thinking, because we saw that some platforms are
>> very sensitive to function call overhead and we haven't seen use cases
>> where multi probe functions prove necessary.
>>
>
> Hrm, so your plan is to embed the tracing code at the caller site, or to
> keep one single function call ?
We do have a general probe function that is called when tracing is enabled.
Other than that, the main tracing code is defined as an inline function that is
embed at the caller site.
Jiaying
>
>> >
>> >> Our measurements show that our
>> >> current tracing prototype introduces much lower overhead compared with
>> >> lttng (we use lttng as a comparison because markers does not provide the
>> >> implementation to record trace data).
>> >
>> > Why LTTng version, kernel version, LTTng config and runtime options do
>> > you use ?
>> See above for LTTng and kernel versions. I was using the default LTTng
>> config options and 'normal' recorder mode.
>>
>> >
>> >> E.g., with dbench, lttng introduces
>> >> 20~80% overhead on our testing platforms while the overhead with our
>> >> kernel tracer is within 10%; with tbench, lttng usually adds more than 30%
>> >> overhead while the overhead added by our kernel tracer is usually within
>> >> 15%; with kernbench (time to compile a Linux kernel), the overhead
>> >> added by lttng is around 4%~18% while the overhead added by our
>> >> kernel tracer is within 2.5%.
>> >>
>> >
>> > I'm re-running the dbench and tbench tests :
>> >
>> > With LTTng 0.106, kernel 2.6.29-rc7
>> > 8-cores Intel(R) Xeon(R) CPU E5405 @ 2.00GHz, 16GB ram
>> >
>> > Arming all the default LTTng instrumentation (168 tracepoints),
>> > Tracing in flight recorder mode, including system call entry/exit
>> > tracing and block I/O tracing (blktrace instrumentation), I get :
>> >
>> > Test / config | Vanilla kernel | LTTng 0.106 | slowdown
>> > dbench 8 -t 200 | 1361 MB/s | 1193 MB/s | 12.3 %
>> >
>> > As an example of the effect of custom probe vs va_args based probes,
>> > let's take the tbench example
>> >
>> > | Vanilla | LTTng 0.106 | slowdown
>> > tbench 8 -t 200 | 2074 MB/s | 997 MB/s | 52 %
>> >
>> > The LTTng trace statistics shows the top high event rate events :
>> >
>> > Some are usual (therefore lttng already implements custom probes for them) :
>> >
>> > 42004 sched_try_wakeup
>> > 79754 timer_set
>> > 83973 sched_schedule
>> > 84317 softirq_exit
>> > 84318 softirq_entry
>> > 95370 page_free
>> > 95412 page_alloc
>> > 267247 syscall_entry
>> > 267247 syscall_exit
>> >
>> > Some are more tbench workload-specific :
>> >
>> > 16915 socket_sendmsg
>> > 18911 napi_complete
>> > 18911 napi_schedule
>> > 18912 dev_receive
>> > 18914 dev_xmit
>> > 18920 napi_poll
>> > 33831 socket_recvmsg
>> > 59860 read
>> >
>> >
>> > So let's see what happens if I implement those as custom probes and do a bit of
>> > tuning in the tracer hot path :
>> Could you give more details about the custom probe you used?
>
> They are available in the lttng tree on git.kernel.org in
> ltt/probes/*-trace.c. Their API is in include/linux/ltt-type-serialize.h
>
>> Do you need to write a custom probe for each tracepoint?
>>
>
> Yes. But it's not such a big deal. It looks a bit like Steve's
> TRACE_EVENT macro-fu, but written in a simple C callback :
>
> void probe_sched_switch(struct rq *rq, struct task_struct *prev,
> struct task_struct *next);
>
> DEFINE_MARKER_TP(kernel, sched_schedule, sched_switch, probe_sched_switch,
> "prev_pid %d next_pid %d prev_state #2d%ld");
>
> notrace void probe_sched_switch(struct rq *rq, struct task_struct *prev,
> struct task_struct *next)
> {
> struct marker *marker;
> struct serialize_int_int_short data;
>
> data.f1 = prev->pid;
> data.f2 = next->pid;
> data.f3 = prev->state;
>
> marker = &GET_MARKER(kernel, sched_schedule);
> ltt_specialized_trace(marker, marker->single.probe_private,
> &data, serialize_sizeof(data), sizeof(int));
> }
>
> Best regards,
>
> Mathieu
>
>> Jiaying
>>
>> >
>> > | Vanilla | LTTng 0.107 | slowdown
>> > Custom probes :
>> > tbench 8 -t 200 | 2074 MB/s | 1275 MB/s | 38.5 %
>> >
>> > With tracer hot path tuning (inline hot path, function calls for slow path, will
>> > be in 0.108) :
>> > tbench 8 -t 200 | 2074 MB/s | 1335 MB/s | 35.6 %
>> >
>> > kernbench :
>> > Average Optimal load -j 32 Run
>> >
>> > Vanilla 2.6.29-rc7 :
>> > Elapsed Time 48.238 (0.117346)
>> >
>> > With flight recorder tracing, default lttng instrumentation, lttng 0.108 :
>> > Elapsed Time 50.226 (0.0750333)
>> >
>> > Slowdown : 4.1 %
>> >
>> > So except the tbench workload on LTTng 0.106, which still had a 52% performance
>> > impact (brought it down to 35 % for the next LTTng release) due to
>> > specific workload and not having specialized probes implemented, the
>> > dbench impact is well below your results (12.3 % vs 20-80%)
>> >
>> > So given the huge result discrepancies between your tests and mine, could you
>> > give us more detail about your benchmarks ? Especially knowing the enabled
>> > instrumentation set, and the amount of MB/s generated by both ktrace and LTTng
>> > runs (to make sure we are comparing the same traffic) and the LTTng versions you
>> > are using as a comparison basis would be good.
>> >
>> > Thanks,
>> >
>> > Mathieu
>> >
>> >
>> >> The patch below is the current prototype we have implemented. It uses
>> >> Steve's unified trace buffer code with some small extensions to support
>> >> buffer mmap. The patch is probably still buggy, but we would like to get
>> >> some early feedback from the community.
>> >>
>> >> Signed-off-by: Jiaying Zhang <[email protected]>
>> >>
>> >> Index: git-linux/init/Kconfig
>> >> ===================================================================
>> >> --- git-linux.orig/init/Kconfig 2009-02-19 14:58:37.000000000 -0800
>> >> +++ git-linux/init/Kconfig 2009-02-19 15:50:15.000000000 -0800
>> >> @@ -950,6 +950,15 @@
>> >> Place an empty function call at each marker site. Can be
>> >> dynamically changed for a probe function.
>> >>
>> >> +config KTRACE
>> >> + bool "Enable ktrace support"
>> >> + select DEBUG_FS
>> >> + select TRACING
>> >> + select RING_BUFFER
>> >> + help
>> >> + Ktrace is a kernel tracing tool that allows you to trace
>> >> + kernel events by inserting trace points at proper places.
>> >> +
>> >> source "arch/Kconfig"
>> >>
>> >> endmenu # General setup
>> >> Index: git-linux/kernel/Makefile
>> >> ===================================================================
>> >> --- git-linux.orig/kernel/Makefile 2009-02-19 14:58:37.000000000 -0800
>> >> +++ git-linux/kernel/Makefile 2009-02-19 14:58:39.000000000 -0800
>> >> @@ -86,6 +86,7 @@
>> >> obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
>> >> obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
>> >> obj-$(CONFIG_MARKERS) += marker.o
>> >> +obj-$(CONFIG_KTRACE) += ktrace.o
>> >> obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
>> >> obj-$(CONFIG_LATENCYTOP) += latencytop.o
>> >> obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
>> >> Index: git-linux/include/asm-generic/vmlinux.lds.h
>> >> ===================================================================
>> >> --- git-linux.orig/include/asm-generic/vmlinux.lds.h 2009-02-19
>> >> 14:58:37.000000000 -0800
>> >> +++ git-linux/include/asm-generic/vmlinux.lds.h 2009-02-19
>> >> 14:58:39.000000000 -0800
>> >> @@ -73,6 +73,9 @@
>> >> MEM_KEEP(init.data) \
>> >> MEM_KEEP(exit.data) \
>> >> . = ALIGN(8); \
>> >> + VMLINUX_SYMBOL(__start___ktraces) = .; \
>> >> + *(__ktrace) \
>> >> + VMLINUX_SYMBOL(__stop___ktraces) = .; \
>> >> VMLINUX_SYMBOL(__start___markers) = .; \
>> >> *(__markers) \
>> >> VMLINUX_SYMBOL(__stop___markers) = .; \
>> >> @@ -89,6 +92,7 @@
>> >> VMLINUX_SYMBOL(__start_rodata) = .; \
>> >> *(.rodata) *(.rodata.*) \
>> >> *(__vermagic) /* Kernel version magic */ \
>> >> + *(__ktrace_strings) /* Ktrace: strings */ \
>> >> *(__markers_strings) /* Markers: strings */ \
>> >> *(__tracepoints_strings)/* Tracepoints: strings */ \
>> >> } \
>> >> Index: git-linux/kernel/module.c
>> >> ===================================================================
>> >> --- git-linux.orig/kernel/module.c 2009-02-19 14:58:37.000000000 -0800
>> >> +++ git-linux/kernel/module.c 2009-02-19 14:58:39.000000000 -0800
>> >> @@ -50,6 +50,7 @@
>> >> #include <asm/sections.h>
>> >> #include <linux/tracepoint.h>
>> >> #include <linux/ftrace.h>
>> >> +#include <linux/ktrace.h>
>> >> #include <linux/async.h>
>> >>
>> >> #if 0
>> >> @@ -2145,6 +2146,10 @@
>> >> sizeof(*mod->tracepoints),
>> >> &mod->num_tracepoints);
>> >> #endif
>> >> +#ifdef CONFIG_KTRACE
>> >> + mod->ktrace = section_objs(hdr, sechdrs, secstrings, "__ktrace",
>> >> + sizeof(*mod->ktrace), &mod->num_ktrace);
>> >> +#endif
>> >>
>> >> #ifdef CONFIG_MODVERSIONS
>> >> if ((mod->num_syms && !mod->crcs)
>> >> Index: git-linux/include/linux/module.h
>> >> ===================================================================
>> >> --- git-linux.orig/include/linux/module.h 2009-02-19
>> >> 14:58:37.000000000 -0800
>> >> +++ git-linux/include/linux/module.h 2009-02-19 14:58:39.000000000 -0800
>> >> @@ -44,6 +44,7 @@
>> >> };
>> >>
>> >> struct module;
>> >> +struct kernel_trace;
>> >>
>> >> struct module_attribute {
>> >> struct attribute attr;
>> >> @@ -347,6 +348,11 @@
>> >> /* Reference counts */
>> >> struct module_ref ref[NR_CPUS];
>> >> #endif
>> >> +#ifdef CONFIG_KTRACE
>> >> + struct kernel_trace *ktrace;
>> >> + unsigned int num_ktrace;
>> >> +#endif
>> >> +
>> >> };
>> >> #ifndef MODULE_ARCH_INIT
>> >> #define MODULE_ARCH_INIT {}
>> >> Index: git-linux/include/linux/ktrace.h
>> >> ===================================================================
>> >> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
>> >> +++ git-linux/include/linux/ktrace.h 2009-02-19 15:36:39.000000000 -0800
>> >> @@ -0,0 +1,83 @@
>> >> +#ifndef _LINUX_KTRACE_H
>> >> +#define _LINUX_KTRACE_H
>> >> +
>> >> +#include <linux/types.h>
>> >> +
>> >> +struct kernel_trace;
>> >> +
>> >> +typedef void ktrace_probe_func(struct kernel_trace *, void *, size_t);
>> >> +
>> >> +struct kernel_trace {
>> >> + const char *name;
>> >> + const char *format; /* format string describing variable list */
>> >> + size_t *stroff; /* offsets of string variables */
>> >> + /* 31 bit event_id is converted to 16 bit when entering to the buffer */
>> >> + u32 enabled:1, event_id:31;
>> >> + ktrace_probe_func *func; /* probe function */
>> >> + struct list_head list; /* list head linked to the hash table entry */
>> >> +} __attribute__((aligned(8)));
>> >> +
>> >> +extern int ktrace_enabled;
>> >> +
>> >> +/*
>> >> + * Make sure the alignment of the structure in the __ktrace section will
>> >> + * not add unwanted padding between the beginning of the section and the
>> >> + * structure. Force alignment to the same alignment as the section start.
>> >> + */
>> >> +
>> >> +#define DEFINE_KTRACE_STRUCT(name) \
>> >> + struct __attribute__((packed)) ktrace_struct_##name
>> >> +
>> >> +#ifdef CONFIG_KTRACE
>> >> +/*
>> >> + * DO_TRACE is the macro that is called from each trace point. Each trace
>> >> + * point is associated with a static ktrace object that is located in a
>> >> + * special __ktrace section and is uniquely identified by the event name and
>> >> + * event format. Take a look at the files under include/trace for example
>> >> + * usage. Note that we call the trace probe function only if the global
>> >> + * tracing is enabled and the tracing for the associated event is enabled,
>> >> + * so we only introduce the overhead of a predicted condition judge when
>> >> + * tracing is disabled.
>> >> + */
>> >> +#define DO_TRACE(name, format, args...)
>> >> \
>> >> + do { \
>> >> + static const char __kstrtab_##name[] \
>> >> + __attribute__((section("__ktrace_strings"))) \
>> >> + = #name "\0" format; \
>> >> + static struct kernel_trace __ktrace_##name \
>> >> + __attribute__((section("__ktrace"), aligned(8))) = \
>> >> + { __kstrtab_##name, &__kstrtab_##name[sizeof(#name)], \
>> >> + NULL, 0, 0, NULL, LIST_HEAD_INIT(__ktrace_##name.list) }; \
>> >> + __ktrace_check_format(format, ## args); \
>> >> + if (unlikely(ktrace_enabled) && \
>> >> + unlikely(__ktrace_##name.enabled)) { \
>> >> + struct ktrace_struct_##name karg = { args }; \
>> >> + (*__ktrace_##name.func) \
>> >> + (&__ktrace_##name, &karg, sizeof(karg)); \
>> >> + } \
>> >> + } while (0)
>> >> +
>> >> +#else /* !CONFIG_KTRACE */
>> >> +#define DO_TRACE(name, format, args...)
>> >> +#endif
>> >> +
>> >> +/* To be used for string format validity checking with gcc */
>> >> +static inline void __printf(1, 2) ___ktrace_check_format(const char *fmt, ...)
>> >> +{
>> >> +}
>> >> +
>> >> +#define __ktrace_check_format(format, args...) \
>> >> + do { \
>> >> + if (0) \
>> >> + ___ktrace_check_format(format, ## args); \
>> >> + } while (0)
>> >> +
>> >> +
>> >> +/* get the trace buffer information */
>> >> +#define KTRACE_BUF_GET_SIZE _IOR(0xF5, 0x00, __u32)
>> >> +#define KTRACE_BUF_GET_PRODUCED _IOR(0xF5, 0x01, __u32)
>> >> +#define KTRACE_BUF_GET_CONSUMED _IOR(0xF5, 0x02, __u32)
>> >> +#define KTRACE_BUF_PUT_PRODUCED _IOW(0xF5, 0x03, __u32)
>> >> +#define KTRACE_BUF_PUT_CONSUMED _IOW(0xF5, 0x04, __u32)
>> >> +
>> >> +#endif
>> >> Index: git-linux/kernel/ktrace.c
>> >> ===================================================================
>> >> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
>> >> +++ git-linux/kernel/ktrace.c 2009-02-19 15:45:20.000000000 -0800
>> >> @@ -0,0 +1,872 @@
>> >> +/*
>> >> + * kernel/ktrace.c
>> >> + *
>> >> + * Implementation of the kernel tracing for linux kernel 2.6.
>> >> + *
>> >> + * Copyright 2008- Google Inc.
>> >> + * Original Author: Jiaying Zhang
>> >> + */
>> >> +
>> >> +#include <linux/module.h>
>> >> +#include <linux/kernel.h>
>> >> +#include <linux/bitmap.h>
>> >> +#include <linux/poll.h>
>> >> +#include <linux/errno.h>
>> >> +#include <linux/sched.h>
>> >> +#include <linux/linkage.h>
>> >> +#include <linux/debugfs.h>
>> >> +#include <linux/timer.h>
>> >> +#include <linux/delay.h>
>> >> +#include <linux/jhash.h>
>> >> +#include <linux/ctype.h>
>> >> +#include <linux/namei.h>
>> >> +#include <linux/ktrace.h>
>> >> +#include <linux/ring_buffer.h>
>> >> +
>> >> +static char *ktrace_version = "1.0.0";
>> >> +int ktrace_enabled;
>> >> +EXPORT_SYMBOL(ktrace_enabled);
>> >> +static uint16_t ktrace_next_id;
>> >> +
>> >> +extern struct kernel_trace __start___ktraces[];
>> >> +extern struct kernel_trace __stop___ktraces[];
>> >> +
>> >> +static DEFINE_MUTEX(ktrace_mutex);
>> >> +
>> >> +/* ktrace hash table entry structure and variables */
>> >> +struct ktrace_entry {
>> >> + struct hlist_node hlist;
>> >> + char *name;
>> >> + char *format;
>> >> + size_t *stroff;
>> >> + ktrace_probe_func *func;
>> >> + u32 enabled:1, event_id:31;
>> >> + struct list_head klist; /* list of loaded ktraces */
>> >> +};
>> >> +
>> >> +#define KTRACE_HASH_BITS 6
>> >> +#define KTRACE_HASH_SIZE (1 << KTRACE_HASH_BITS)
>> >> +static struct hlist_head ktrace_table[KTRACE_HASH_SIZE];
>> >> +
>> >> +/* debugfs directory variables */
>> >> +static struct dentry *tracedir;
>> >> +static struct dentry *eventdir;
>> >> +typedef void tracecontrol_handle_func(struct file *, unsigned long, size_t *);
>> >> +
>> >> +/* ring buffer code */
>> >> +static unsigned long trace_buf_size = 65536UL;
>> >> +struct ring_buffer *trace_buffer;
>> >> +static DECLARE_WAIT_QUEUE_HEAD(trace_wait); /* waitqueue for trace_data_poll */
>> >> +static struct timer_list trace_timer; /* reader wake-up timer */
>> >> +#define TRACEREAD_WAKEUP_INTERVAL 1000 /* time interval in jiffies */
>> >> +static struct kref ktrace_kref;
>> >> +
>> >> +static void
>> >> +ring_buffer_probe(struct kernel_trace *kt, void *data, size_t event_size)
>> >> +{
>> >> + struct ring_buffer_event *event;
>> >> + void *trace_event;
>> >> + unsigned long flag;
>> >> +
>> >> + event = ring_buffer_lock_reserve(trace_buffer,
>> >> + sizeof(uint16_t) + event_size, &flag);
>> >> + if (!event)
>> >> + return;
>> >> + trace_event = ring_buffer_event_data(event);
>> >> + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
>> >> + memcpy(trace_event + sizeof(uint16_t), data, event_size);
>> >> + ring_buffer_unlock_commit(trace_buffer, event, flag);
>> >> +}
>> >> +
>> >> +/* special probe function for events that contain string arguments */
>> >> +static void
>> >> +string_probe(struct kernel_trace *kt, void *data, size_t event_size)
>> >> +{
>> >> + struct ring_buffer_event *event;
>> >> + void *trace_event, *p;
>> >> + unsigned long flag;
>> >> + size_t *offset, scanned;
>> >> + char *string;
>> >> +
>> >> + /*
>> >> + * Get the real length of the event, i.e., we use stroff array to
>> >> + * locate string variables in the passed-in trace data and use their
>> >> + * length to replace the size of the string pointers.
>> >> + */
>> >> + for (offset = kt->stroff; *offset != -1; offset++) {
>> >> + string = *(char **) (data + *offset);
>> >> + event_size += strlen(string) + 1 - sizeof(char *);
>> >> + }
>> >> + event = ring_buffer_lock_reserve(trace_buffer,
>> >> + sizeof(uint16_t) + event_size, &flag);
>> >> + if (!event)
>> >> + return;
>> >> + trace_event = ring_buffer_event_data(event);
>> >> + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
>> >> + p = trace_event + sizeof(uint16_t);
>> >> + /*
>> >> + * Copy the trace data into buffer. For string variables, we enter the
>> >> + * string into the buffer. Otherwise, the passed in data is copied.
>> >> + */
>> >> + for (offset = kt->stroff, scanned = 0; *offset != -1; offset++) {
>> >> + memcpy(p, data + scanned, *offset - scanned);
>> >> + p += *offset - scanned;
>> >> + string = *(char **) (data + *offset);
>> >> + memcpy(p, string, strlen(string) + 1);
>> >> + p += strlen(string) + 1;
>> >> + scanned = *offset + sizeof(char *);
>> >> + }
>> >> + memcpy(p, data + scanned,
>> >> + trace_event + sizeof(uint16_t) + event_size - p);
>> >> + ring_buffer_unlock_commit(trace_buffer, event, flag);
>> >> +}
>> >> +
>> >> +/* timer function used to defer ring buffer reader waking */
>> >> +static void wakeup_readers(unsigned long data)
>> >> +{
>> >> + if (trace_buffer && !ring_buffer_empty(trace_buffer))
>> >> + wake_up_interruptible(&trace_wait);
>> >> + mod_timer(&trace_timer, jiffies + TRACEREAD_WAKEUP_INTERVAL);
>> >> +}
>> >> +
>> >> +/* function for reading ktrace metadata info */
>> >> +static void *tracing_info_start(struct seq_file *seq, loff_t *pos)
>> >> +{
>> >> + seq_printf(seq, "version %s\n", ktrace_version);
>> >> + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
>> >> +}
>> >> +
>> >> +static void *tracing_info_next(struct seq_file *seq, void *v, loff_t *pos)
>> >> +{
>> >> + ++*pos;
>> >> + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
>> >> +}
>> >> +
>> >> +static int tracing_info_show(struct seq_file *seq, void *v)
>> >> +{
>> >> + struct hlist_head *head = v;
>> >> + struct hlist_node *node;
>> >> + struct ktrace_entry *entry;
>> >> + hlist_for_each_entry(entry, node, head, hlist)
>> >> + seq_printf(seq, "name '%s' format '%s' id %u %s\n",
>> >> + entry->name, entry->format, entry->event_id,
>> >> + entry->enabled ? "enabled" : "disabled");
>> >> + return 0;
>> >> +}
>> >> +
>> >> +static void tracing_info_stop(struct seq_file *seq, void *v)
>> >> +{
>> >> +}
>> >> +
>> >> +static const struct seq_operations tracing_info_seq_ops = {
>> >> + .start = tracing_info_start,
>> >> + .next = tracing_info_next,
>> >> + .show = tracing_info_show,
>> >> + .stop = tracing_info_stop,
>> >> +};
>> >> +
>> >> +static int tracing_info_open(struct inode *inode, struct file *file)
>> >> +{
>> >> + return seq_open(file, &tracing_info_seq_ops);
>> >> +}
>> >> +
>> >> +static const struct file_operations traceinfo_file_operations = {
>> >> + .open = tracing_info_open,
>> >> + .read = seq_read,
>> >> + .llseek = seq_lseek,
>> >> + .release = seq_release,
>> >> +};
>> >> +
>> >> +/*
>> >> + * wrapper function used by debugfs write operation.
>> >> + * func: handling function that does real work
>> >> + */
>> >> +static ssize_t trace_debugfs_write(struct file *filp, const char __user *ubuf,
>> >> + size_t cnt, loff_t *ppos, tracecontrol_handle_func *func)
>> >> +{
>> >> + int ret;
>> >> + char buf[64];
>> >> + unsigned long val;
>> >> +
>> >> + if (cnt >= sizeof(buf))
>> >> + return -EINVAL;
>> >> + if (copy_from_user(&buf, ubuf, cnt))
>> >> + return -EFAULT;
>> >> + buf[cnt] = 0;
>> >> +
>> >> + ret = strict_strtoul(buf, 10, &val);
>> >> + if (ret < 0)
>> >> + return ret;
>> >> + val = !!val;
>> >> +
>> >> + func(filp, val, &cnt);
>> >> +
>> >> + filp->f_pos += cnt;
>> >> + return cnt;
>> >> +}
>> >> +
>> >> +/* functions for reading/writing the global 'enabled' control file */
>> >> +static ssize_t tracing_control_read(struct file *filp, char __user *ubuf,
>> >> + size_t cnt, loff_t *ppos)
>> >> +{
>> >> + int ret;
>> >> + char buf[64];
>> >> + ret = snprintf(buf, 64, "%u\n", ktrace_enabled);
>> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
>> >> +}
>> >> +
>> >> +static void __tracing_control_write(struct file *filp,
>> >> + unsigned long val, size_t *cnt)
>> >> +{
>> >> + if (val ^ ktrace_enabled) {
>> >> + if (val) {
>> >> + trace_timer.expires =
>> >> + jiffies + TRACEREAD_WAKEUP_INTERVAL;
>> >> + add_timer(&trace_timer);
>> >> + ktrace_enabled = 1;
>> >> + } else {
>> >> + ktrace_enabled = 0;
>> >> + del_timer_sync(&trace_timer);
>> >> + if (trace_buffer && !ring_buffer_empty(trace_buffer))
>> >> + wake_up_interruptible(&trace_wait);
>> >> + }
>> >> + }
>> >> +}
>> >> +
>> >> +static ssize_t tracing_control_write(struct file *filp, const char
>> >> __user *ubuf,
>> >> + size_t cnt, loff_t *ppos)
>> >> +{
>> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
>> >> + __tracing_control_write);
>> >> +}
>> >> +
>> >> +static const struct file_operations tracecontrol_file_operations = {
>> >> + .read = tracing_control_read,
>> >> + .write = tracing_control_write,
>> >> +};
>> >> +
>> >> +/* functions for reading/writing a trace event file */
>> >> +static int update_ktrace(struct ktrace_entry *entry,
>> >> + ktrace_probe_func *func, int enabled)
>> >> +{
>> >> + struct kernel_trace *iter;
>> >> + /* no need to update the list if the tracing is not initialized */
>> >> + if (!tracedir)
>> >> + return 0;
>> >> + entry->enabled = enabled;
>> >> + entry->func = func;
>> >> + list_for_each_entry(iter, &entry->klist, list) {
>> >> + iter->stroff = entry->stroff;
>> >> + iter->func = func;
>> >> + iter->enabled = enabled;
>> >> + }
>> >> + return 0;
>> >> +}
>> >> +
>> >> +static int ktrace_probe_register(struct ktrace_entry *entry,
>> >> + ktrace_probe_func *probefunc)
>> >> +{
>> >> + return update_ktrace(entry, probefunc, 1);
>> >> +}
>> >> +
>> >> +static int ktrace_probe_unregister(struct ktrace_entry *entry)
>> >> +{
>> >> + return update_ktrace(entry, NULL, 0);
>> >> +}
>> >> +
>> >> +static int ktrace_control_open(struct inode *inode, struct file *filp)
>> >> +{
>> >> + filp->private_data = inode->i_private;
>> >> + return 0;
>> >> +}
>> >> +
>> >> +static ssize_t ktrace_control_read(struct file *filp, char __user *ubuf,
>> >> + size_t cnt, loff_t *ppos)
>> >> +{
>> >> + int ret;
>> >> + char buf[64];
>> >> + struct ktrace_entry *entry = filp->private_data;
>> >> +
>> >> + ret = snprintf(buf, 64, "%u\n", entry->enabled);
>> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
>> >> +}
>> >> +
>> >> +/*
>> >> + * Check whether the event contains string arguments. Return 0 if not;
>> >> + * otherwise record the offsets of the string arguments in the ktrace
>> >> + * structure according to the event format and return 1.
>> >> + * The offset info is later used to store the string arguments into
>> >> + * the trace buffer during trace probe.
>> >> + */
>> >> +static int contain_string_arguments(struct ktrace_entry *entry)
>> >> +{
>> >> + int count = 0;
>> >> + char *fmt;
>> >> + int qualifier; /* 'h', 'l', or 'L' for integer fields */
>> >> + size_t offset = 0;
>> >> +
>> >> + for (fmt = strstr(entry->format, "%s"); fmt; fmt = strstr(++fmt, "%s"))
>> >> + count++;
>> >> + if (!count)
>> >> + return 0;
>> >> + entry->stroff = kmalloc(sizeof(size_t) * (count + 1), GFP_KERNEL);
>> >> + if (!entry->stroff)
>> >> + return -ENOMEM;
>> >> +
>> >> + for (fmt = entry->format, count = 0; *fmt ; ++fmt) {
>> >> + if (*fmt != '%')
>> >> + continue;
>> >> +repeat:
>> >> + ++fmt;
>> >> + switch (*fmt) {
>> >> + case '-':
>> >> + case '+':
>> >> + case ' ':
>> >> + case '#':
>> >> + case '0':
>> >> + ++fmt;
>> >> + goto repeat;
>> >> + }
>> >> +
>> >> + while (isdigit(*fmt))
>> >> + fmt++;
>> >> +
>> >> + /* get the conversion qualifier */
>> >> + qualifier = -1;
>> >> + if (*fmt == 'h' || *fmt == 'l' || *fmt == 'L') {
>> >> + qualifier = *fmt;
>> >> + ++fmt;
>> >> + if (qualifier == 'l' && *fmt == 'l') {
>> >> + qualifier = 'L';
>> >> + ++fmt;
>> >> + }
>> >> + }
>> >> +
>> >> + switch (*fmt) {
>> >> + case 'c':
>> >> + offset += sizeof(char);
>> >> + continue;
>> >> + case 's':
>> >> + entry->stroff[count] = offset;
>> >> + count++;
>> >> + offset += sizeof(char *);
>> >> + continue;
>> >> + case 'p':
>> >> + offset += sizeof(void *);
>> >> + continue;
>> >> + case 'd':
>> >> + case 'i':
>> >> + case 'o':
>> >> + case 'u':
>> >> + case 'x':
>> >> + case 'X':
>> >> + switch (qualifier) {
>> >> + case 'L':
>> >> + offset += sizeof(long long);
>> >> + break;
>> >> + case 'l':
>> >> + offset += sizeof(long);
>> >> + break;
>> >> + case 'h':
>> >> + offset += sizeof(short);
>> >> + break;
>> >> + default:
>> >> + offset += sizeof(int);
>> >> + break;
>> >> + }
>> >> + break;
>> >> + default:
>> >> + printk(KERN_WARNING "unknown format '%c'", *fmt);
>> >> + continue;
>> >> + }
>> >> + }
>> >> + entry->stroff[count] = -1;
>> >> + return 1;
>> >> +}
>> >> +
>> >> +static void __ktrace_control_write(struct file *filp,
>> >> + unsigned long val, size_t *cnt)
>> >> +{
>> >> + struct ktrace_entry *entry = filp->private_data;
>> >> + int ret;
>> >> +
>> >> + mutex_lock(&ktrace_mutex);
>> >> + if (val ^ entry->enabled) {
>> >> + if (val) {
>> >> + ret = contain_string_arguments(entry);
>> >> + if (ret == 0)
>> >> + ktrace_probe_register(entry, ring_buffer_probe);
>> >> + else if (ret > 0)
>> >> + ktrace_probe_register(entry, string_probe);
>> >> + else
>> >> + *cnt = ret;
>> >> + } else
>> >> + ktrace_probe_unregister(entry);
>> >> + }
>> >> + mutex_unlock(&ktrace_mutex);
>> >> +}
>> >> +
>> >> +static ssize_t ktrace_control_write(struct file *filp, const char __user *ubuf,
>> >> + size_t cnt, loff_t *ppos)
>> >> +{
>> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
>> >> + __ktrace_control_write);
>> >> +}
>> >> +
>> >> +static const struct file_operations ktracecontrol_file_operations = {
>> >> + .open = ktrace_control_open,
>> >> + .read = ktrace_control_read,
>> >> + .write = ktrace_control_write,
>> >> +};
>> >> +
>> >> +/* functions for adding/removing a trace event. Protected by mutex lock.
>> >> + * Called during initialization or after loading a module */
>> >> +static struct ktrace_entry *add_ktrace(struct kernel_trace *ktrace)
>> >> +{
>> >> + struct ktrace_entry *entry;
>> >> + size_t name_len = strlen(ktrace->name) + 1;
>> >> + u32 hash = jhash(ktrace->name, name_len-1, 0);
>> >> + struct hlist_head *head;
>> >> + struct hlist_node *node;
>> >> + struct dentry *dentry;
>> >> +
>> >> + if (!tracedir)
>> >> + return 0;
>> >> + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
>> >> + hlist_for_each_entry(entry, node, head, hlist) {
>> >> + if (!strcmp(ktrace->name, entry->name)) {
>> >> + if (strcmp(ktrace->format, entry->format)) {
>> >> + printk(KERN_WARNING "the format of tracepoint "
>> >> + "\'%s\' changes from \'%s\' to \'%s\'."
>> >> + "Dynamic changing of trace format is "
>> >> + "not supported yet!\n",
>> >> + ktrace->name, entry->format,
>> >> + ktrace->format);
>> >> + return ERR_PTR(-EINVAL);
>> >> + }
>> >> + ktrace->enabled = entry->enabled;
>> >> + ktrace->func = entry->func;
>> >> + ktrace->event_id = entry->event_id;
>> >> + if (list_empty(&entry->klist))
>> >> + goto add_head;
>> >> + list_add_tail(&ktrace->list, &entry->klist);
>> >> + return entry;
>> >> + }
>> >> + }
>> >> + entry = kmalloc(sizeof(struct ktrace_entry), GFP_KERNEL);
>> >> + if (!entry)
>> >> + return ERR_PTR(-ENOMEM);
>> >> + entry->name = kmalloc(name_len, GFP_KERNEL);
>> >> + if (!entry->name) {
>> >> + kfree(entry);
>> >> + return ERR_PTR(-ENOMEM);
>> >> + }
>> >> + memcpy(entry->name, ktrace->name, name_len);
>> >> + if (ktrace->format) {
>> >> + size_t format_len = strlen(ktrace->format) + 1;
>> >> + entry->format = kmalloc(format_len, GFP_KERNEL);
>> >> + if (!entry->format) {
>> >> + kfree(entry->name);
>> >> + kfree(entry);
>> >> + return ERR_PTR(-ENOMEM);
>> >> + }
>> >> + memcpy(entry->format, ktrace->format, format_len);
>> >> + } else
>> >> + entry->format = NULL;
>> >> + entry->func = ktrace->func;
>> >> + entry->enabled = 0;
>> >> + ktrace->event_id = entry->event_id = ktrace_next_id++;
>> >> + entry->stroff = NULL;
>> >> + INIT_LIST_HEAD(&entry->klist);
>> >> + hlist_add_head(&entry->hlist, head);
>> >> +add_head:
>> >> + list_add(&ktrace->list, &entry->klist);
>> >> + dentry = debugfs_create_file(entry->name, 0660, eventdir,
>> >> + entry, &ktracecontrol_file_operations);
>> >> + if (!dentry)
>> >> + printk(KERN_WARNING "Couldn't create debugfs '%s' entry\n",
>> >> + entry->name);
>> >> + return entry;
>> >> +}
>> >> +
>> >> +static int remove_ktrace(struct kernel_trace *ktrace, int free)
>> >> +{
>> >> + struct ktrace_entry *entry;
>> >> + size_t name_len = strlen(ktrace->name) + 1;
>> >> + u32 hash = jhash(ktrace->name, name_len-1, 0);
>> >> + struct hlist_head *head;
>> >> + struct hlist_node *node, *temp;
>> >> + struct dentry *dentry;
>> >> +
>> >> + if (!tracedir)
>> >> + return 0;
>> >> + list_del(&ktrace->list);
>> >> + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
>> >> + hlist_for_each_entry_safe(entry, node, temp, head, hlist)
>> >> + if (!strcmp(ktrace->name, entry->name)) {
>> >> + if (list_empty(&entry->klist)) {
>> >> + dentry = lookup_one_len(entry->name,
>> >> + eventdir, strlen(entry->name));
>> >> + if (dentry && !IS_ERR(dentry)) {
>> >> + debugfs_remove(dentry);
>> >> + dput(dentry);
>> >> + }
>> >> + entry->enabled = 0;
>> >> + if (free) {
>> >> + hlist_del(&entry->hlist);
>> >> + kfree(entry->name);
>> >> + kfree(entry->format);
>> >> + kfree(entry->stroff);
>> >> + kfree(entry);
>> >> + }
>> >> + }
>> >> + return 0;
>> >> + }
>> >> + return -1;
>> >> +}
>> >> +
>> >> +/* Add/remove tracepoints contained in a module to the ktrace hash table.
>> >> + * Called at the end of module load/unload. */
>> >> +static int ktrace_module_notify(struct notifier_block *self,
>> >> + unsigned long val, void *data)
>> >> +{
>> >> +#ifdef CONFIG_MODULES
>> >> + struct kernel_trace *iter;
>> >> + struct module *mod = data;
>> >> +
>> >> + if (val != MODULE_STATE_COMING && val != MODULE_STATE_GOING)
>> >> + return 0;
>> >> + mutex_lock(&ktrace_mutex);
>> >> + for (iter = mod->ktrace; iter < mod->ktrace + mod->num_ktrace; iter++)
>> >> + if (val == MODULE_STATE_COMING)
>> >> + add_ktrace(iter);
>> >> + else
>> >> + remove_ktrace(iter, 0);
>> >> + mutex_unlock(&ktrace_mutex);
>> >> +#endif
>> >> + return 0;
>> >> +}
>> >> +
>> >> +static struct notifier_block ktrace_module_nb = {
>> >> + .notifier_call = ktrace_module_notify,
>> >> +};
>> >> +
>> >> +/* functions for user-space programs to read data from tracing buffer */
>> >> +static int trace_data_open(struct inode *inode, struct file *filp)
>> >> +{
>> >> + filp->private_data = inode->i_private;
>> >> + kref_get(&ktrace_kref);
>> >> + return 0;
>> >> +}
>> >> +
>> >> +static void release_trace_buffer(struct kref *kref)
>> >> +{
>> >> + ring_buffer_free(trace_buffer);
>> >> + trace_buffer = NULL;
>> >> +}
>> >> +
>> >> +static int trace_data_release(struct inode *inode, struct file *filp)
>> >> +{
>> >> + kref_put(&ktrace_kref, release_trace_buffer);
>> >> + return 0;
>> >> +}
>> >> +
>> >> +static unsigned int trace_data_poll(struct file *filp, poll_table *poll_table)
>> >> +{
>> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
>> >> + if (filp->f_mode & FMODE_READ) {
>> >> + if (!ring_buffer_per_cpu_empty(cpu_buffer))
>> >> + return POLLIN | POLLRDNORM;
>> >> + poll_wait(filp, &trace_wait, poll_table);
>> >> + if (!ring_buffer_per_cpu_empty(cpu_buffer))
>> >> + return POLLIN | POLLRDNORM;
>> >> + }
>> >> + return 0;
>> >> +}
>> >> +
>> >> +static int trace_buf_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>> >> +{
>> >> + struct ring_buffer_per_cpu *cpu_buffer = vma->vm_private_data;
>> >> + pgoff_t pgoff = vmf->pgoff;
>> >> + struct page *page;
>> >> +
>> >> + if (!trace_buffer)
>> >> + return VM_FAULT_OOM;
>> >> + page = ring_buffer_get_page(cpu_buffer, pgoff);
>> >> + if (page == NULL)
>> >> + return VM_FAULT_SIGBUS;
>> >> + get_page(page);
>> >> + vmf->page = page;
>> >> + return 0;
>> >> +}
>> >> +
>> >> +static struct vm_operations_struct trace_data_mmap_ops = {
>> >> + .fault = trace_buf_fault,
>> >> +};
>> >> +
>> >> +static int trace_data_mmap(struct file *filp, struct vm_area_struct *vma)
>> >> +{
>> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
>> >> +
>> >> + vma->vm_ops = &trace_data_mmap_ops;
>> >> + vma->vm_flags |= VM_DONTEXPAND;
>> >> + vma->vm_private_data = cpu_buffer;
>> >> + return 0;
>> >> +}
>> >> +
>> >> +static int trace_data_ioctl(struct inode *inode, struct file *filp,
>> >> + unsigned int cmd, unsigned long arg)
>> >> +{
>> >> + u32 __user *argp = (u32 __user *)arg;
>> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
>> >> +
>> >> + if (!trace_buffer || !tracedir)
>> >> + return -EINVAL;
>> >> + switch (cmd) {
>> >> + case KTRACE_BUF_GET_SIZE:
>> >> + {
>> >> + unsigned long bufsize;
>> >> + bufsize = ring_buffer_size(trace_buffer);
>> >> + return put_user((u32)bufsize, argp);
>> >> + }
>> >> + case KTRACE_BUF_GET_PRODUCED:
>> >> + return put_user(ring_buffer_get_produced(cpu_buffer), argp);
>> >> + case KTRACE_BUF_GET_CONSUMED:
>> >> + return put_user(ring_buffer_get_consumed(cpu_buffer), argp);
>> >> + case KTRACE_BUF_PUT_CONSUMED:
>> >> + {
>> >> + u32 consumed, consumed_old;
>> >> + int ret;
>> >> +
>> >> + ret = get_user(consumed, argp);
>> >> + if (ret) {
>> >> + printk(KERN_WARNING
>> >> + "error getting consumed value: %d\n", ret);
>> >> + return ret;
>> >> + }
>> >> + consumed_old = ring_buffer_get_consumed(cpu_buffer);
>> >> + if (consumed == consumed_old)
>> >> + return 0;
>> >> + ring_buffer_advance_reader(cpu_buffer, consumed - consumed_old);
>> >> + return 0;
>> >> + }
>> >> + default:
>> >> + return -ENOIOCTLCMD;
>> >> + }
>> >> + return 0;
>> >> +}
>> >> +
>> >> +static const struct file_operations tracedata_file_operations = {
>> >> + .open = trace_data_open,
>> >> + .poll = trace_data_poll,
>> >> + .mmap = trace_data_mmap,
>> >> + .ioctl = trace_data_ioctl,
>> >> + .release = trace_data_release,
>> >> +};
>> >> +
>> >> +/* function to report the number of lost events due to buffer overflow */
>> >> +static ssize_t trace_overflow_read(struct file *filp, char __user *ubuf,
>> >> + size_t cnt, loff_t *ppos)
>> >> +{
>> >> + int ret;
>> >> + char buf[64];
>> >> + ret = snprintf(buf, 64, "%lu\n", ring_buffer_overruns(trace_buffer));
>> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
>> >> +}
>> >> +
>> >> +static const struct file_operations traceoverflow_file_operations = {
>> >> + .read = trace_overflow_read,
>> >> +};
>> >> +
>> >> +/* function to re-initialize trace buffer */
>> >> +static void __trace_buffer_reset(struct file *filp,
>> >> + unsigned long val, size_t *cnt)
>> >> +{
>> >> + if (val && trace_buffer)
>> >> + ring_buffer_reset(trace_buffer);
>> >> +}
>> >> +
>> >> +static ssize_t trace_buffer_reset(struct file *filp, const char __user *ubuf,
>> >> + size_t cnt, loff_t *ppos)
>> >> +{
>> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
>> >> + __trace_buffer_reset);
>> >> +}
>> >> +
>> >> +static const struct file_operations tracereset_file_operations = {
>> >> + .write = trace_buffer_reset,
>> >> +};
>> >> +
>> >> +/*
>> >> + * We use debugfs for kernel-user communication. All of the control/info
>> >> + * files are under debugfs/ktrace directory.
>> >> + */
>> >> +static int create_debugfs(void)
>> >> +{
>> >> + struct dentry *entry, *bufdir;
>> >> + struct ring_buffer_per_cpu *cpu_buffer;
>> >> + char *tmpname;
>> >> + int cpu;
>> >> +
>> >> + tracedir = debugfs_create_dir("ktrace", NULL);
>> >> + if (!tracedir) {
>> >> + printk(KERN_WARNING "Couldn't create debugfs directory\n");
>> >> + return -ENOMEM;
>> >> + }
>> >> +
>> >> + /* 'buffers' is the directory that holds trace buffer debugfs files */
>> >> + bufdir = debugfs_create_dir("buffers", tracedir);
>> >> + if (!bufdir) {
>> >> + printk(KERN_WARNING "Couldn't create 'buffers' directory\n");
>> >> + return -ENOMEM;
>> >> + }
>> >> +
>> >> + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
>> >> + if (!tmpname)
>> >> + return -ENOMEM;
>> >> + /* create a debugfs file for each cpu buffer */
>> >> + for_each_possible_cpu(cpu) {
>> >> + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
>> >> + cpu_buffer = ring_buffer_cpu(trace_buffer, cpu);
>> >> + entry = debugfs_create_file(tmpname, 0440, bufdir, cpu_buffer,
>> >> + &tracedata_file_operations);
>> >> + if (!entry) {
>> >> + printk(KERN_WARNING
>> >> + "Couldn't create debugfs \'%s\' entry\n",
>> >> + tmpname);
>> >> + return -ENOMEM;
>> >> + }
>> >> + }
>> >> + kfree(tmpname);
>> >> +
>> >> + /* the control file for users to enable/disable global tracing. */
>> >> + entry = debugfs_create_file("enabled", 0664, tracedir, NULL,
>> >> + &tracecontrol_file_operations);
>> >> + if (!entry) {
>> >> + printk(KERN_WARNING
>> >> + "Couldn't create debugfs 'enabled' entry\n");
>> >> + return -ENOMEM;
>> >> + }
>> >> +
>> >> + /*
>> >> + * the debugfs file that displays the name, format etc. of every
>> >> + * supported trace event. The file is to be used by the user-space
>> >> + * trace parser to analyze the collected trace data.
>> >> + */
>> >> + entry = debugfs_create_file("info", 0444, tracedir, NULL,
>> >> + &traceinfo_file_operations);
>> >> + if (!entry) {
>> >> + printk(KERN_WARNING "Couldn't create debugfs 'info' entry\n");
>> >> + return -ENOMEM;
>> >> + }
>> >> +
>> >> + /*
>> >> + * the debugfs file that reports the number of events
>> >> + * lost due to buffer overflow
>> >> + */
>> >> + entry = debugfs_create_file("overflow", 0444, tracedir, NULL,
>> >> + &traceoverflow_file_operations);
>> >> + if (!entry) {
>> >> + printk(KERN_WARNING
>> >> + "Couldn't create debugfs 'overflow' entry\n");
>> >> + return -ENOMEM;
>> >> + }
>> >> +
>> >> + /* the debugfs file that resets trace buffer upon write */
>> >> + entry = debugfs_create_file("reset", 0220, tracedir, NULL,
>> >> + &tracereset_file_operations);
>> >> + if (!entry) {
>> >> + printk(KERN_WARNING "Couldn't create debugfs 'reset' entry\n");
>> >> + return -ENOMEM;
>> >> + }
>> >> +
>> >> + /* create the directory that holds the control files for every event */
>> >> + eventdir = debugfs_create_dir("events", tracedir);
>> >> + if (!eventdir) {
>> >> + printk(KERN_WARNING "Couldn't create 'events' directory\n");
>> >> + return -ENOMEM;
>> >> + }
>> >> + return 0;
>> >> +}
>> >> +
>> >> +static void remove_debugfs_file(struct dentry *dir, char *name)
>> >> +{
>> >> + struct dentry *dentry;
>> >> + dentry = lookup_one_len(name, dir, strlen(name));
>> >> + if (dentry && !IS_ERR(dentry)) {
>> >> + debugfs_remove(dentry);
>> >> + dput(dentry);
>> >> + }
>> >> +}
>> >> +
>> >> +static int remove_debugfs(void)
>> >> +{
>> >> + struct dentry *bufdir;
>> >> + char *tmpname;
>> >> + int cpu;
>> >> +
>> >> + bufdir = lookup_one_len("buffers", tracedir, strlen("buffers"));
>> >> + if (!bufdir || IS_ERR(bufdir))
>> >> + return -EIO;
>> >> + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
>> >> + if (!tmpname)
>> >> + return -ENOMEM;
>> >> + for_each_possible_cpu(cpu) {
>> >> + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
>> >> + remove_debugfs_file(bufdir, tmpname);
>> >> + }
>> >> + kfree(tmpname);
>> >> + debugfs_remove(bufdir);
>> >> +
>> >> + remove_debugfs_file(tracedir, "enabled");
>> >> + remove_debugfs_file(tracedir, "info");
>> >> + debugfs_remove(eventdir);
>> >> + debugfs_remove(tracedir);
>> >> + return 0;
>> >> +}
>> >> +
>> >> +static int __init init_ktrace(void)
>> >> +{
>> >> + struct kernel_trace *iter;
>> >> + int i, err;
>> >> +
>> >> + ktrace_next_id = 0;
>> >> + kref_set(&ktrace_kref, 1);
>> >> + trace_buffer = ring_buffer_alloc(trace_buf_size, RB_FL_OVERWRITE);
>> >> + if (trace_buffer == NULL)
>> >> + return -ENOMEM;
>> >> + setup_timer(&trace_timer, wakeup_readers, 0);
>> >> + for (i = 0; i < KTRACE_HASH_SIZE; i++)
>> >> + INIT_HLIST_HEAD(&ktrace_table[i]);
>> >> +
>> >> + mutex_lock(&ktrace_mutex);
>> >> + err = create_debugfs();
>> >> + if (err != 0)
>> >> + goto out;
>> >> + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
>> >> + add_ktrace(iter);
>> >> + err = register_module_notifier(&ktrace_module_nb);
>> >> +out:
>> >> + mutex_unlock(&ktrace_mutex);
>> >> + return err;
>> >> +}
>> >> +module_init(init_ktrace);
>> >> +
>> >> +static void __exit exit_ktrace(void)
>> >> +{
>> >> + struct kernel_trace *iter;
>> >> +
>> >> + if (ktrace_enabled) {
>> >> + ktrace_enabled = 0;
>> >> + del_timer_sync(&trace_timer);
>> >> + }
>> >> + mutex_lock(&ktrace_mutex);
>> >> + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
>> >> + remove_ktrace(iter, 1);
>> >> + unregister_module_notifier(&ktrace_module_nb);
>> >> + remove_debugfs();
>> >> + mutex_unlock(&ktrace_mutex);
>> >> +
>> >> + kref_put(&ktrace_kref, release_trace_buffer);
>> >> +}
>> >
>> > --
>> > Mathieu Desnoyers
>> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>> >
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>
* Jiaying Zhang ([email protected]) wrote:
> On Mon, Mar 16, 2009 at 5:01 PM, Mathieu Desnoyers
> <[email protected]> wrote:
> > * Jiaying Zhang ([email protected]) wrote:
> >> Hi Mathieu,
> >>
> >> First, I apologize that I did not forward you my original email. I assumed
> >> that the email would be noticed by interested people when it was sent
> >> to lkml. Now I see that is a mistake.
> >>
> >
> > Hi Jiaying,
> >
> > Apologies accepted, no hard feeling :) I have just been surprised when I
> > found out about the numbers you posted.
> >
> >> On Fri, Mar 13, 2009 at 10:04 AM, Mathieu Desnoyers
> >> <[email protected]> wrote:
> >> > Hi Jiaying,
> >> >
> >> > I'm very interested to hear about the details of your test-case, because
> >> > what you are talking about here, and the numbers that you get, are, at
> >> > the very least, not representative of the current LTTng. Perharps you
> >> > used a version right after the kernel summit and stopped following the
> >> > improvements we have done since then ? I also wonder about the instrumentation
> >> > set you enabled in your tests, and whether it's exactly the same in the ktrace
> >> > and LTTng runs (especially system call tracing). Please note that following the
> >> > Kernel Summit 2008, a more active development cycle began, and the time was not
> >> > right for optimisation. But since then a lot of improvement has been done on
> >> > this side.
> >>
> >> Sorry that I omitted a lot of details in my test description. I did use an old
> >> version of LTTng in my tests. The LTTng version I used was the 0.21 lttng
> >> patches for 2.6.27 kernel. I used 2.6.26 kernel in my tests, with the ported
> >> lttng-2.6.27-rc6-0.21 patches.
> >>
> >> I enabled the same set of trace events in ktrace and LTTng runs. Here is
> >> the list of enabled events: syscall entry/exit, irq entry/exit, trap
> >> entry/exit,
> >> kernel_sched_switch, kernel_process_fork, kernel_process_wait,
> >> kernel_process_free, kernel_process_exit.
> >>
> >> We are glad to see the improvements coming from the LTTng community.
> >> I haven't got time to sync with the latest LTTng code and re-run my tests.
> >> I will let you know when I have such data available.
> >>
> >
> > On my side, with flight recorder activated, same instrumentation you
> > use, with LTTng 0.111 :
> >
> > arm-google script :
> >
> > #!/bin/sh
> >
> > DIR=/mnt/debugfs/ltt/markers
> >
> > for a in \
> > syscall_entry syscall_exit \
> > irq_entry irq_exit \
> > trap_entry trap_exit \
> > page_fault_entry page_fault_exit \
> > page_fault_nosem_entry page_fault_nosem_exit \
> > page_fault_get_user_entry page_fault_get_user_exit \
> > sched_schedule process_fork process_wait \
> > process_free process_exit;
> > do
> > echo 1 > ${DIR}/kernel/$a/enable
> > done
> >
> > running tbench -t 200 8
> >
> > Vanilla kernel : 2070 MB/sec
> > Flight recorder : 1693 MB/sec Performance impact : 18.2 %
> > Normal : 1620 MB/sec Performance impact : 21.7 % (8 lttd threads)
> >
> > So now we are comparing apples with apples :-)
> Looks like the performance of LTTng has improved a lot over the past
> few months. I will run some tests with the latest LTTng patches and
> let you know if I see any difference. Did you use 2.6.29-rc7 kernel in
> your tests?
>
Yes, kernel 2.6.29-rc7, lttng 0.111. But lttng 0.112 for kernel
2.6.29-rc8 should also be close enough.
> >
> >> >
> >> > First I'd like to express my concern about the way you posted your
> >> > numbers :
> >> >
> >> > 1 - You did not CC anyone from the LTTng project.
> >> > 2 - You did not tell which LTTng version you used as a comparison basis.
> >> > 3 - You did not look at the current state of code before posting your
> >> > comparison. (as shows the comment in your patch header)
> >> That is indeed my mistake.
> >>
> >> > 4 - You did not tell the instrumentation set that you enabled from LTTng. Is it
> >> > the default instrumentation ? Did you also enable system call tracing
> >> > with your tracer using set_kernel_trace_flag_all_tasks() ? If not, this
> >> > makes a huge difference on the amount of information gathered.
> >> Yes. We also enable syscall tracing via set_kernel_trace_flag_all_tasks().
> >>
> >> >
> >> > So let's have a close look at them. I'm re-doing the benchmarks based on
> >> > the current LTTng version below.
> >> >
> >> >> Hello All,
> >> >>
> >> >> We have been working on building a kernel tracer for a production environment.
> >> >> We first considered to use the existing markers code, but found it a bit
> >> >> heavyweight.
> >> >
> >> > Have you looked at :
> >> >
> >> > 1 - The tracepoints currently in the kernel
> >> > 2 - The _current_ version of markers in the LTTng tree ?
> >>
> >> I looked at the tracepoints currently in the kernel. It is great to see that
> >> some trace instrumentation code is finally available in the mainline kernel.
> >> But I still don't like an additional tracepoint layer on top of the
> >> actual tracing
> >> (ktrace or markers or ftrace) layer. That seems to add unnecessary overhead.
> >> The markers code I referred to in my email is the current code in the mainline
> >> kernel. I had a look at the LTTng markers code but still saw that va_list
> >> argument is used in marker_probe_func. It sounds a nice improvement
> >> if the LTTng markers avoids the use of va_arg whenever possible.
> >>
> >
> > Yes, vs_args is only there as a commodity for slow paths.
> >
> >> >
> >> >> So we developed a new kernel tracing prototype that shares
> >> >> many similarities as markers but also includes several simplifications that
> >> >> give us considerable performance benefits. For example, we pass the size of
> >> >> trace event directly to the probe function instead of using va_arg as that
> >> >> used in markers, which saves us the overhead of variable passing.
> >> >
> >> > The markers in the LTTng tree only use va_args for "debug-style"
> >> > tracing. All the tracing fast paths are now done with custom probe
> >> > functions which takes the arguments from the tracepoint and write them
> >> > in the trace buffers directly in C. When the event payload size is already
> >> > known statically, there is no dynamic computation cost at all : it's
> >> > simply passed to the ltt-relay write function.
> >> But this raises another concern that sub-system developers may need
> >> to write a bunch of custom probe functions to avoid the argument parsing
> >> overhead. Can you instead have a general probe function or macro that
> >> does this work?
> >>
> >
> > Well, considering the amount of care we have to take when we consider
> > adding a tracepoint in the mainline kernel, I think it's even a *good*
> > thing that someone has to stop and write a small callback to deal with
> > the information he wants to export. I think debug-style tracing should
> > be very, very quick and easy to add, but tracepoints meant to make it
> > into the mainline does not share this requirement.
> I agree that we need to be careful when adding tracepoints into the kernel,
> but I still think it is good to simplify the process of adding/maintaining
> tracepoints so sub-system maintainers don't feel it is a big burden.
>
Yes, we have to make sure it's not an impossible task.
> >
> >> >
> >> >> Also, we
> >> >> associate a single probe function to each event, so we can directly call the
> >> >> probe function when tracing is enabled.
> >> >
> >> > You seem to imply that being stuck with a single probe function is
> >> > somehow superior to have tiny per-event callbacks which then call into a
> >> > buffer write function ? Why ?
> >> That is indeed my thinking, because we saw that some platforms are
> >> very sensitive to function call overhead and we haven't seen use cases
> >> where multi probe functions prove necessary.
> >>
> >
> > Hrm, so your plan is to embed the tracing code at the caller site, or to
> > keep one single function call ?
> We do have a general probe function that is called when tracing is enabled.
> Other than that, the main tracing code is defined as an inline function that is
> embed at the caller site.
>
You mean that you embed the main tracing code directly in the kernel
instrumented functions ? Hrm, have you considered the calling site
pollution impact that might have ? The compiler might start having
difficulty optimising hot paths that would then become large functions.
Or maybe I am not correctly understanding your approach.
Mathieu
>
> Jiaying
>
> >
> >> >
> >> >> Our measurements show that our
> >> >> current tracing prototype introduces much lower overhead compared with
> >> >> lttng (we use lttng as a comparison because markers does not provide the
> >> >> implementation to record trace data).
> >> >
> >> > Why LTTng version, kernel version, LTTng config and runtime options do
> >> > you use ?
> >> See above for LTTng and kernel versions. I was using the default LTTng
> >> config options and 'normal' recorder mode.
> >>
> >> >
> >> >> E.g., with dbench, lttng introduces
> >> >> 20~80% overhead on our testing platforms while the overhead with our
> >> >> kernel tracer is within 10%; with tbench, lttng usually adds more than 30%
> >> >> overhead while the overhead added by our kernel tracer is usually within
> >> >> 15%; with kernbench (time to compile a Linux kernel), the overhead
> >> >> added by lttng is around 4%~18% while the overhead added by our
> >> >> kernel tracer is within 2.5%.
> >> >>
> >> >
> >> > I'm re-running the dbench and tbench tests :
> >> >
> >> > With LTTng 0.106, kernel 2.6.29-rc7
> >> > 8-cores Intel(R) Xeon(R) CPU E5405 @ 2.00GHz, 16GB ram
> >> >
> >> > Arming all the default LTTng instrumentation (168 tracepoints),
> >> > Tracing in flight recorder mode, including system call entry/exit
> >> > tracing and block I/O tracing (blktrace instrumentation), I get :
> >> >
> >> > Test / config | Vanilla kernel | LTTng 0.106 | slowdown
> >> > dbench 8 -t 200 | 1361 MB/s | 1193 MB/s | 12.3 %
> >> >
> >> > As an example of the effect of custom probe vs va_args based probes,
> >> > let's take the tbench example
> >> >
> >> > | Vanilla | LTTng 0.106 | slowdown
> >> > tbench 8 -t 200 | 2074 MB/s | 997 MB/s | 52 %
> >> >
> >> > The LTTng trace statistics shows the top high event rate events :
> >> >
> >> > Some are usual (therefore lttng already implements custom probes for them) :
> >> >
> >> > 42004 sched_try_wakeup
> >> > 79754 timer_set
> >> > 83973 sched_schedule
> >> > 84317 softirq_exit
> >> > 84318 softirq_entry
> >> > 95370 page_free
> >> > 95412 page_alloc
> >> > 267247 syscall_entry
> >> > 267247 syscall_exit
> >> >
> >> > Some are more tbench workload-specific :
> >> >
> >> > 16915 socket_sendmsg
> >> > 18911 napi_complete
> >> > 18911 napi_schedule
> >> > 18912 dev_receive
> >> > 18914 dev_xmit
> >> > 18920 napi_poll
> >> > 33831 socket_recvmsg
> >> > 59860 read
> >> >
> >> >
> >> > So let's see what happens if I implement those as custom probes and do a bit of
> >> > tuning in the tracer hot path :
> >> Could you give more details about the custom probe you used?
> >
> > They are available in the lttng tree on git.kernel.org in
> > ltt/probes/*-trace.c. Their API is in include/linux/ltt-type-serialize.h
> >
> >> Do you need to write a custom probe for each tracepoint?
> >>
> >
> > Yes. But it's not such a big deal. It looks a bit like Steve's
> > TRACE_EVENT macro-fu, but written in a simple C callback :
> >
> > void probe_sched_switch(struct rq *rq, struct task_struct *prev,
> > struct task_struct *next);
> >
> > DEFINE_MARKER_TP(kernel, sched_schedule, sched_switch, probe_sched_switch,
> > "prev_pid %d next_pid %d prev_state #2d%ld");
> >
> > notrace void probe_sched_switch(struct rq *rq, struct task_struct *prev,
> > struct task_struct *next)
> > {
> > struct marker *marker;
> > struct serialize_int_int_short data;
> >
> > data.f1 = prev->pid;
> > data.f2 = next->pid;
> > data.f3 = prev->state;
> >
> > marker = &GET_MARKER(kernel, sched_schedule);
> > ltt_specialized_trace(marker, marker->single.probe_private,
> > &data, serialize_sizeof(data), sizeof(int));
> > }
> >
> > Best regards,
> >
> > Mathieu
> >
> >> Jiaying
> >>
> >> >
> >> > | Vanilla | LTTng 0.107 | slowdown
> >> > Custom probes :
> >> > tbench 8 -t 200 | 2074 MB/s | 1275 MB/s | 38.5 %
> >> >
> >> > With tracer hot path tuning (inline hot path, function calls for slow path, will
> >> > be in 0.108) :
> >> > tbench 8 -t 200 | 2074 MB/s | 1335 MB/s | 35.6 %
> >> >
> >> > kernbench :
> >> > Average Optimal load -j 32 Run
> >> >
> >> > Vanilla 2.6.29-rc7 :
> >> > Elapsed Time 48.238 (0.117346)
> >> >
> >> > With flight recorder tracing, default lttng instrumentation, lttng 0.108 :
> >> > Elapsed Time 50.226 (0.0750333)
> >> >
> >> > Slowdown : 4.1 %
> >> >
> >> > So except the tbench workload on LTTng 0.106, which still had a 52% performance
> >> > impact (brought it down to 35 % for the next LTTng release) due to
> >> > specific workload and not having specialized probes implemented, the
> >> > dbench impact is well below your results (12.3 % vs 20-80%)
> >> >
> >> > So given the huge result discrepancies between your tests and mine, could you
> >> > give us more detail about your benchmarks ? Especially knowing the enabled
> >> > instrumentation set, and the amount of MB/s generated by both ktrace and LTTng
> >> > runs (to make sure we are comparing the same traffic) and the LTTng versions you
> >> > are using as a comparison basis would be good.
> >> >
> >> > Thanks,
> >> >
> >> > Mathieu
> >> >
> >> >
> >> >> The patch below is the current prototype we have implemented. It uses
> >> >> Steve's unified trace buffer code with some small extensions to support
> >> >> buffer mmap. The patch is probably still buggy, but we would like to get
> >> >> some early feedback from the community.
> >> >>
> >> >> Signed-off-by: Jiaying Zhang <[email protected]>
> >> >>
> >> >> Index: git-linux/init/Kconfig
> >> >> ===================================================================
> >> >> --- git-linux.orig/init/Kconfig 2009-02-19 14:58:37.000000000 -0800
> >> >> +++ git-linux/init/Kconfig 2009-02-19 15:50:15.000000000 -0800
> >> >> @@ -950,6 +950,15 @@
> >> >> Place an empty function call at each marker site. Can be
> >> >> dynamically changed for a probe function.
> >> >>
> >> >> +config KTRACE
> >> >> + bool "Enable ktrace support"
> >> >> + select DEBUG_FS
> >> >> + select TRACING
> >> >> + select RING_BUFFER
> >> >> + help
> >> >> + Ktrace is a kernel tracing tool that allows you to trace
> >> >> + kernel events by inserting trace points at proper places.
> >> >> +
> >> >> source "arch/Kconfig"
> >> >>
> >> >> endmenu # General setup
> >> >> Index: git-linux/kernel/Makefile
> >> >> ===================================================================
> >> >> --- git-linux.orig/kernel/Makefile 2009-02-19 14:58:37.000000000 -0800
> >> >> +++ git-linux/kernel/Makefile 2009-02-19 14:58:39.000000000 -0800
> >> >> @@ -86,6 +86,7 @@
> >> >> obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
> >> >> obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
> >> >> obj-$(CONFIG_MARKERS) += marker.o
> >> >> +obj-$(CONFIG_KTRACE) += ktrace.o
> >> >> obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
> >> >> obj-$(CONFIG_LATENCYTOP) += latencytop.o
> >> >> obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
> >> >> Index: git-linux/include/asm-generic/vmlinux.lds.h
> >> >> ===================================================================
> >> >> --- git-linux.orig/include/asm-generic/vmlinux.lds.h 2009-02-19
> >> >> 14:58:37.000000000 -0800
> >> >> +++ git-linux/include/asm-generic/vmlinux.lds.h 2009-02-19
> >> >> 14:58:39.000000000 -0800
> >> >> @@ -73,6 +73,9 @@
> >> >> MEM_KEEP(init.data) \
> >> >> MEM_KEEP(exit.data) \
> >> >> . = ALIGN(8); \
> >> >> + VMLINUX_SYMBOL(__start___ktraces) = .; \
> >> >> + *(__ktrace) \
> >> >> + VMLINUX_SYMBOL(__stop___ktraces) = .; \
> >> >> VMLINUX_SYMBOL(__start___markers) = .; \
> >> >> *(__markers) \
> >> >> VMLINUX_SYMBOL(__stop___markers) = .; \
> >> >> @@ -89,6 +92,7 @@
> >> >> VMLINUX_SYMBOL(__start_rodata) = .; \
> >> >> *(.rodata) *(.rodata.*) \
> >> >> *(__vermagic) /* Kernel version magic */ \
> >> >> + *(__ktrace_strings) /* Ktrace: strings */ \
> >> >> *(__markers_strings) /* Markers: strings */ \
> >> >> *(__tracepoints_strings)/* Tracepoints: strings */ \
> >> >> } \
> >> >> Index: git-linux/kernel/module.c
> >> >> ===================================================================
> >> >> --- git-linux.orig/kernel/module.c 2009-02-19 14:58:37.000000000 -0800
> >> >> +++ git-linux/kernel/module.c 2009-02-19 14:58:39.000000000 -0800
> >> >> @@ -50,6 +50,7 @@
> >> >> #include <asm/sections.h>
> >> >> #include <linux/tracepoint.h>
> >> >> #include <linux/ftrace.h>
> >> >> +#include <linux/ktrace.h>
> >> >> #include <linux/async.h>
> >> >>
> >> >> #if 0
> >> >> @@ -2145,6 +2146,10 @@
> >> >> sizeof(*mod->tracepoints),
> >> >> &mod->num_tracepoints);
> >> >> #endif
> >> >> +#ifdef CONFIG_KTRACE
> >> >> + mod->ktrace = section_objs(hdr, sechdrs, secstrings, "__ktrace",
> >> >> + sizeof(*mod->ktrace), &mod->num_ktrace);
> >> >> +#endif
> >> >>
> >> >> #ifdef CONFIG_MODVERSIONS
> >> >> if ((mod->num_syms && !mod->crcs)
> >> >> Index: git-linux/include/linux/module.h
> >> >> ===================================================================
> >> >> --- git-linux.orig/include/linux/module.h 2009-02-19
> >> >> 14:58:37.000000000 -0800
> >> >> +++ git-linux/include/linux/module.h 2009-02-19 14:58:39.000000000 -0800
> >> >> @@ -44,6 +44,7 @@
> >> >> };
> >> >>
> >> >> struct module;
> >> >> +struct kernel_trace;
> >> >>
> >> >> struct module_attribute {
> >> >> struct attribute attr;
> >> >> @@ -347,6 +348,11 @@
> >> >> /* Reference counts */
> >> >> struct module_ref ref[NR_CPUS];
> >> >> #endif
> >> >> +#ifdef CONFIG_KTRACE
> >> >> + struct kernel_trace *ktrace;
> >> >> + unsigned int num_ktrace;
> >> >> +#endif
> >> >> +
> >> >> };
> >> >> #ifndef MODULE_ARCH_INIT
> >> >> #define MODULE_ARCH_INIT {}
> >> >> Index: git-linux/include/linux/ktrace.h
> >> >> ===================================================================
> >> >> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> >> >> +++ git-linux/include/linux/ktrace.h 2009-02-19 15:36:39.000000000 -0800
> >> >> @@ -0,0 +1,83 @@
> >> >> +#ifndef _LINUX_KTRACE_H
> >> >> +#define _LINUX_KTRACE_H
> >> >> +
> >> >> +#include <linux/types.h>
> >> >> +
> >> >> +struct kernel_trace;
> >> >> +
> >> >> +typedef void ktrace_probe_func(struct kernel_trace *, void *, size_t);
> >> >> +
> >> >> +struct kernel_trace {
> >> >> + const char *name;
> >> >> + const char *format; /* format string describing variable list */
> >> >> + size_t *stroff; /* offsets of string variables */
> >> >> + /* 31 bit event_id is converted to 16 bit when entering to the buffer */
> >> >> + u32 enabled:1, event_id:31;
> >> >> + ktrace_probe_func *func; /* probe function */
> >> >> + struct list_head list; /* list head linked to the hash table entry */
> >> >> +} __attribute__((aligned(8)));
> >> >> +
> >> >> +extern int ktrace_enabled;
> >> >> +
> >> >> +/*
> >> >> + * Make sure the alignment of the structure in the __ktrace section will
> >> >> + * not add unwanted padding between the beginning of the section and the
> >> >> + * structure. Force alignment to the same alignment as the section start.
> >> >> + */
> >> >> +
> >> >> +#define DEFINE_KTRACE_STRUCT(name) \
> >> >> + struct __attribute__((packed)) ktrace_struct_##name
> >> >> +
> >> >> +#ifdef CONFIG_KTRACE
> >> >> +/*
> >> >> + * DO_TRACE is the macro that is called from each trace point. Each trace
> >> >> + * point is associated with a static ktrace object that is located in a
> >> >> + * special __ktrace section and is uniquely identified by the event name and
> >> >> + * event format. Take a look at the files under include/trace for example
> >> >> + * usage. Note that we call the trace probe function only if the global
> >> >> + * tracing is enabled and the tracing for the associated event is enabled,
> >> >> + * so we only introduce the overhead of a predicted condition judge when
> >> >> + * tracing is disabled.
> >> >> + */
> >> >> +#define DO_TRACE(name, format, args...)
> >> >> \
> >> >> + do { \
> >> >> + static const char __kstrtab_##name[] \
> >> >> + __attribute__((section("__ktrace_strings"))) \
> >> >> + = #name "\0" format; \
> >> >> + static struct kernel_trace __ktrace_##name \
> >> >> + __attribute__((section("__ktrace"), aligned(8))) = \
> >> >> + { __kstrtab_##name, &__kstrtab_##name[sizeof(#name)], \
> >> >> + NULL, 0, 0, NULL, LIST_HEAD_INIT(__ktrace_##name.list) }; \
> >> >> + __ktrace_check_format(format, ## args); \
> >> >> + if (unlikely(ktrace_enabled) && \
> >> >> + unlikely(__ktrace_##name.enabled)) { \
> >> >> + struct ktrace_struct_##name karg = { args }; \
> >> >> + (*__ktrace_##name.func) \
> >> >> + (&__ktrace_##name, &karg, sizeof(karg)); \
> >> >> + } \
> >> >> + } while (0)
> >> >> +
> >> >> +#else /* !CONFIG_KTRACE */
> >> >> +#define DO_TRACE(name, format, args...)
> >> >> +#endif
> >> >> +
> >> >> +/* To be used for string format validity checking with gcc */
> >> >> +static inline void __printf(1, 2) ___ktrace_check_format(const char *fmt, ...)
> >> >> +{
> >> >> +}
> >> >> +
> >> >> +#define __ktrace_check_format(format, args...) \
> >> >> + do { \
> >> >> + if (0) \
> >> >> + ___ktrace_check_format(format, ## args); \
> >> >> + } while (0)
> >> >> +
> >> >> +
> >> >> +/* get the trace buffer information */
> >> >> +#define KTRACE_BUF_GET_SIZE _IOR(0xF5, 0x00, __u32)
> >> >> +#define KTRACE_BUF_GET_PRODUCED _IOR(0xF5, 0x01, __u32)
> >> >> +#define KTRACE_BUF_GET_CONSUMED _IOR(0xF5, 0x02, __u32)
> >> >> +#define KTRACE_BUF_PUT_PRODUCED _IOW(0xF5, 0x03, __u32)
> >> >> +#define KTRACE_BUF_PUT_CONSUMED _IOW(0xF5, 0x04, __u32)
> >> >> +
> >> >> +#endif
> >> >> Index: git-linux/kernel/ktrace.c
> >> >> ===================================================================
> >> >> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> >> >> +++ git-linux/kernel/ktrace.c 2009-02-19 15:45:20.000000000 -0800
> >> >> @@ -0,0 +1,872 @@
> >> >> +/*
> >> >> + * kernel/ktrace.c
> >> >> + *
> >> >> + * Implementation of the kernel tracing for linux kernel 2.6.
> >> >> + *
> >> >> + * Copyright 2008- Google Inc.
> >> >> + * Original Author: Jiaying Zhang
> >> >> + */
> >> >> +
> >> >> +#include <linux/module.h>
> >> >> +#include <linux/kernel.h>
> >> >> +#include <linux/bitmap.h>
> >> >> +#include <linux/poll.h>
> >> >> +#include <linux/errno.h>
> >> >> +#include <linux/sched.h>
> >> >> +#include <linux/linkage.h>
> >> >> +#include <linux/debugfs.h>
> >> >> +#include <linux/timer.h>
> >> >> +#include <linux/delay.h>
> >> >> +#include <linux/jhash.h>
> >> >> +#include <linux/ctype.h>
> >> >> +#include <linux/namei.h>
> >> >> +#include <linux/ktrace.h>
> >> >> +#include <linux/ring_buffer.h>
> >> >> +
> >> >> +static char *ktrace_version = "1.0.0";
> >> >> +int ktrace_enabled;
> >> >> +EXPORT_SYMBOL(ktrace_enabled);
> >> >> +static uint16_t ktrace_next_id;
> >> >> +
> >> >> +extern struct kernel_trace __start___ktraces[];
> >> >> +extern struct kernel_trace __stop___ktraces[];
> >> >> +
> >> >> +static DEFINE_MUTEX(ktrace_mutex);
> >> >> +
> >> >> +/* ktrace hash table entry structure and variables */
> >> >> +struct ktrace_entry {
> >> >> + struct hlist_node hlist;
> >> >> + char *name;
> >> >> + char *format;
> >> >> + size_t *stroff;
> >> >> + ktrace_probe_func *func;
> >> >> + u32 enabled:1, event_id:31;
> >> >> + struct list_head klist; /* list of loaded ktraces */
> >> >> +};
> >> >> +
> >> >> +#define KTRACE_HASH_BITS 6
> >> >> +#define KTRACE_HASH_SIZE (1 << KTRACE_HASH_BITS)
> >> >> +static struct hlist_head ktrace_table[KTRACE_HASH_SIZE];
> >> >> +
> >> >> +/* debugfs directory variables */
> >> >> +static struct dentry *tracedir;
> >> >> +static struct dentry *eventdir;
> >> >> +typedef void tracecontrol_handle_func(struct file *, unsigned long, size_t *);
> >> >> +
> >> >> +/* ring buffer code */
> >> >> +static unsigned long trace_buf_size = 65536UL;
> >> >> +struct ring_buffer *trace_buffer;
> >> >> +static DECLARE_WAIT_QUEUE_HEAD(trace_wait); /* waitqueue for trace_data_poll */
> >> >> +static struct timer_list trace_timer; /* reader wake-up timer */
> >> >> +#define TRACEREAD_WAKEUP_INTERVAL 1000 /* time interval in jiffies */
> >> >> +static struct kref ktrace_kref;
> >> >> +
> >> >> +static void
> >> >> +ring_buffer_probe(struct kernel_trace *kt, void *data, size_t event_size)
> >> >> +{
> >> >> + struct ring_buffer_event *event;
> >> >> + void *trace_event;
> >> >> + unsigned long flag;
> >> >> +
> >> >> + event = ring_buffer_lock_reserve(trace_buffer,
> >> >> + sizeof(uint16_t) + event_size, &flag);
> >> >> + if (!event)
> >> >> + return;
> >> >> + trace_event = ring_buffer_event_data(event);
> >> >> + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
> >> >> + memcpy(trace_event + sizeof(uint16_t), data, event_size);
> >> >> + ring_buffer_unlock_commit(trace_buffer, event, flag);
> >> >> +}
> >> >> +
> >> >> +/* special probe function for events that contain string arguments */
> >> >> +static void
> >> >> +string_probe(struct kernel_trace *kt, void *data, size_t event_size)
> >> >> +{
> >> >> + struct ring_buffer_event *event;
> >> >> + void *trace_event, *p;
> >> >> + unsigned long flag;
> >> >> + size_t *offset, scanned;
> >> >> + char *string;
> >> >> +
> >> >> + /*
> >> >> + * Get the real length of the event, i.e., we use stroff array to
> >> >> + * locate string variables in the passed-in trace data and use their
> >> >> + * length to replace the size of the string pointers.
> >> >> + */
> >> >> + for (offset = kt->stroff; *offset != -1; offset++) {
> >> >> + string = *(char **) (data + *offset);
> >> >> + event_size += strlen(string) + 1 - sizeof(char *);
> >> >> + }
> >> >> + event = ring_buffer_lock_reserve(trace_buffer,
> >> >> + sizeof(uint16_t) + event_size, &flag);
> >> >> + if (!event)
> >> >> + return;
> >> >> + trace_event = ring_buffer_event_data(event);
> >> >> + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
> >> >> + p = trace_event + sizeof(uint16_t);
> >> >> + /*
> >> >> + * Copy the trace data into buffer. For string variables, we enter the
> >> >> + * string into the buffer. Otherwise, the passed in data is copied.
> >> >> + */
> >> >> + for (offset = kt->stroff, scanned = 0; *offset != -1; offset++) {
> >> >> + memcpy(p, data + scanned, *offset - scanned);
> >> >> + p += *offset - scanned;
> >> >> + string = *(char **) (data + *offset);
> >> >> + memcpy(p, string, strlen(string) + 1);
> >> >> + p += strlen(string) + 1;
> >> >> + scanned = *offset + sizeof(char *);
> >> >> + }
> >> >> + memcpy(p, data + scanned,
> >> >> + trace_event + sizeof(uint16_t) + event_size - p);
> >> >> + ring_buffer_unlock_commit(trace_buffer, event, flag);
> >> >> +}
> >> >> +
> >> >> +/* timer function used to defer ring buffer reader waking */
> >> >> +static void wakeup_readers(unsigned long data)
> >> >> +{
> >> >> + if (trace_buffer && !ring_buffer_empty(trace_buffer))
> >> >> + wake_up_interruptible(&trace_wait);
> >> >> + mod_timer(&trace_timer, jiffies + TRACEREAD_WAKEUP_INTERVAL);
> >> >> +}
> >> >> +
> >> >> +/* function for reading ktrace metadata info */
> >> >> +static void *tracing_info_start(struct seq_file *seq, loff_t *pos)
> >> >> +{
> >> >> + seq_printf(seq, "version %s\n", ktrace_version);
> >> >> + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
> >> >> +}
> >> >> +
> >> >> +static void *tracing_info_next(struct seq_file *seq, void *v, loff_t *pos)
> >> >> +{
> >> >> + ++*pos;
> >> >> + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
> >> >> +}
> >> >> +
> >> >> +static int tracing_info_show(struct seq_file *seq, void *v)
> >> >> +{
> >> >> + struct hlist_head *head = v;
> >> >> + struct hlist_node *node;
> >> >> + struct ktrace_entry *entry;
> >> >> + hlist_for_each_entry(entry, node, head, hlist)
> >> >> + seq_printf(seq, "name '%s' format '%s' id %u %s\n",
> >> >> + entry->name, entry->format, entry->event_id,
> >> >> + entry->enabled ? "enabled" : "disabled");
> >> >> + return 0;
> >> >> +}
> >> >> +
> >> >> +static void tracing_info_stop(struct seq_file *seq, void *v)
> >> >> +{
> >> >> +}
> >> >> +
> >> >> +static const struct seq_operations tracing_info_seq_ops = {
> >> >> + .start = tracing_info_start,
> >> >> + .next = tracing_info_next,
> >> >> + .show = tracing_info_show,
> >> >> + .stop = tracing_info_stop,
> >> >> +};
> >> >> +
> >> >> +static int tracing_info_open(struct inode *inode, struct file *file)
> >> >> +{
> >> >> + return seq_open(file, &tracing_info_seq_ops);
> >> >> +}
> >> >> +
> >> >> +static const struct file_operations traceinfo_file_operations = {
> >> >> + .open = tracing_info_open,
> >> >> + .read = seq_read,
> >> >> + .llseek = seq_lseek,
> >> >> + .release = seq_release,
> >> >> +};
> >> >> +
> >> >> +/*
> >> >> + * wrapper function used by debugfs write operation.
> >> >> + * func: handling function that does real work
> >> >> + */
> >> >> +static ssize_t trace_debugfs_write(struct file *filp, const char __user *ubuf,
> >> >> + size_t cnt, loff_t *ppos, tracecontrol_handle_func *func)
> >> >> +{
> >> >> + int ret;
> >> >> + char buf[64];
> >> >> + unsigned long val;
> >> >> +
> >> >> + if (cnt >= sizeof(buf))
> >> >> + return -EINVAL;
> >> >> + if (copy_from_user(&buf, ubuf, cnt))
> >> >> + return -EFAULT;
> >> >> + buf[cnt] = 0;
> >> >> +
> >> >> + ret = strict_strtoul(buf, 10, &val);
> >> >> + if (ret < 0)
> >> >> + return ret;
> >> >> + val = !!val;
> >> >> +
> >> >> + func(filp, val, &cnt);
> >> >> +
> >> >> + filp->f_pos += cnt;
> >> >> + return cnt;
> >> >> +}
> >> >> +
> >> >> +/* functions for reading/writing the global 'enabled' control file */
> >> >> +static ssize_t tracing_control_read(struct file *filp, char __user *ubuf,
> >> >> + size_t cnt, loff_t *ppos)
> >> >> +{
> >> >> + int ret;
> >> >> + char buf[64];
> >> >> + ret = snprintf(buf, 64, "%u\n", ktrace_enabled);
> >> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
> >> >> +}
> >> >> +
> >> >> +static void __tracing_control_write(struct file *filp,
> >> >> + unsigned long val, size_t *cnt)
> >> >> +{
> >> >> + if (val ^ ktrace_enabled) {
> >> >> + if (val) {
> >> >> + trace_timer.expires =
> >> >> + jiffies + TRACEREAD_WAKEUP_INTERVAL;
> >> >> + add_timer(&trace_timer);
> >> >> + ktrace_enabled = 1;
> >> >> + } else {
> >> >> + ktrace_enabled = 0;
> >> >> + del_timer_sync(&trace_timer);
> >> >> + if (trace_buffer && !ring_buffer_empty(trace_buffer))
> >> >> + wake_up_interruptible(&trace_wait);
> >> >> + }
> >> >> + }
> >> >> +}
> >> >> +
> >> >> +static ssize_t tracing_control_write(struct file *filp, const char
> >> >> __user *ubuf,
> >> >> + size_t cnt, loff_t *ppos)
> >> >> +{
> >> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
> >> >> + __tracing_control_write);
> >> >> +}
> >> >> +
> >> >> +static const struct file_operations tracecontrol_file_operations = {
> >> >> + .read = tracing_control_read,
> >> >> + .write = tracing_control_write,
> >> >> +};
> >> >> +
> >> >> +/* functions for reading/writing a trace event file */
> >> >> +static int update_ktrace(struct ktrace_entry *entry,
> >> >> + ktrace_probe_func *func, int enabled)
> >> >> +{
> >> >> + struct kernel_trace *iter;
> >> >> + /* no need to update the list if the tracing is not initialized */
> >> >> + if (!tracedir)
> >> >> + return 0;
> >> >> + entry->enabled = enabled;
> >> >> + entry->func = func;
> >> >> + list_for_each_entry(iter, &entry->klist, list) {
> >> >> + iter->stroff = entry->stroff;
> >> >> + iter->func = func;
> >> >> + iter->enabled = enabled;
> >> >> + }
> >> >> + return 0;
> >> >> +}
> >> >> +
> >> >> +static int ktrace_probe_register(struct ktrace_entry *entry,
> >> >> + ktrace_probe_func *probefunc)
> >> >> +{
> >> >> + return update_ktrace(entry, probefunc, 1);
> >> >> +}
> >> >> +
> >> >> +static int ktrace_probe_unregister(struct ktrace_entry *entry)
> >> >> +{
> >> >> + return update_ktrace(entry, NULL, 0);
> >> >> +}
> >> >> +
> >> >> +static int ktrace_control_open(struct inode *inode, struct file *filp)
> >> >> +{
> >> >> + filp->private_data = inode->i_private;
> >> >> + return 0;
> >> >> +}
> >> >> +
> >> >> +static ssize_t ktrace_control_read(struct file *filp, char __user *ubuf,
> >> >> + size_t cnt, loff_t *ppos)
> >> >> +{
> >> >> + int ret;
> >> >> + char buf[64];
> >> >> + struct ktrace_entry *entry = filp->private_data;
> >> >> +
> >> >> + ret = snprintf(buf, 64, "%u\n", entry->enabled);
> >> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
> >> >> +}
> >> >> +
> >> >> +/*
> >> >> + * Check whether the event contains string arguments. Return 0 if not;
> >> >> + * otherwise record the offsets of the string arguments in the ktrace
> >> >> + * structure according to the event format and return 1.
> >> >> + * The offset info is later used to store the string arguments into
> >> >> + * the trace buffer during trace probe.
> >> >> + */
> >> >> +static int contain_string_arguments(struct ktrace_entry *entry)
> >> >> +{
> >> >> + int count = 0;
> >> >> + char *fmt;
> >> >> + int qualifier; /* 'h', 'l', or 'L' for integer fields */
> >> >> + size_t offset = 0;
> >> >> +
> >> >> + for (fmt = strstr(entry->format, "%s"); fmt; fmt = strstr(++fmt, "%s"))
> >> >> + count++;
> >> >> + if (!count)
> >> >> + return 0;
> >> >> + entry->stroff = kmalloc(sizeof(size_t) * (count + 1), GFP_KERNEL);
> >> >> + if (!entry->stroff)
> >> >> + return -ENOMEM;
> >> >> +
> >> >> + for (fmt = entry->format, count = 0; *fmt ; ++fmt) {
> >> >> + if (*fmt != '%')
> >> >> + continue;
> >> >> +repeat:
> >> >> + ++fmt;
> >> >> + switch (*fmt) {
> >> >> + case '-':
> >> >> + case '+':
> >> >> + case ' ':
> >> >> + case '#':
> >> >> + case '0':
> >> >> + ++fmt;
> >> >> + goto repeat;
> >> >> + }
> >> >> +
> >> >> + while (isdigit(*fmt))
> >> >> + fmt++;
> >> >> +
> >> >> + /* get the conversion qualifier */
> >> >> + qualifier = -1;
> >> >> + if (*fmt == 'h' || *fmt == 'l' || *fmt == 'L') {
> >> >> + qualifier = *fmt;
> >> >> + ++fmt;
> >> >> + if (qualifier == 'l' && *fmt == 'l') {
> >> >> + qualifier = 'L';
> >> >> + ++fmt;
> >> >> + }
> >> >> + }
> >> >> +
> >> >> + switch (*fmt) {
> >> >> + case 'c':
> >> >> + offset += sizeof(char);
> >> >> + continue;
> >> >> + case 's':
> >> >> + entry->stroff[count] = offset;
> >> >> + count++;
> >> >> + offset += sizeof(char *);
> >> >> + continue;
> >> >> + case 'p':
> >> >> + offset += sizeof(void *);
> >> >> + continue;
> >> >> + case 'd':
> >> >> + case 'i':
> >> >> + case 'o':
> >> >> + case 'u':
> >> >> + case 'x':
> >> >> + case 'X':
> >> >> + switch (qualifier) {
> >> >> + case 'L':
> >> >> + offset += sizeof(long long);
> >> >> + break;
> >> >> + case 'l':
> >> >> + offset += sizeof(long);
> >> >> + break;
> >> >> + case 'h':
> >> >> + offset += sizeof(short);
> >> >> + break;
> >> >> + default:
> >> >> + offset += sizeof(int);
> >> >> + break;
> >> >> + }
> >> >> + break;
> >> >> + default:
> >> >> + printk(KERN_WARNING "unknown format '%c'", *fmt);
> >> >> + continue;
> >> >> + }
> >> >> + }
> >> >> + entry->stroff[count] = -1;
> >> >> + return 1;
> >> >> +}
> >> >> +
> >> >> +static void __ktrace_control_write(struct file *filp,
> >> >> + unsigned long val, size_t *cnt)
> >> >> +{
> >> >> + struct ktrace_entry *entry = filp->private_data;
> >> >> + int ret;
> >> >> +
> >> >> + mutex_lock(&ktrace_mutex);
> >> >> + if (val ^ entry->enabled) {
> >> >> + if (val) {
> >> >> + ret = contain_string_arguments(entry);
> >> >> + if (ret == 0)
> >> >> + ktrace_probe_register(entry, ring_buffer_probe);
> >> >> + else if (ret > 0)
> >> >> + ktrace_probe_register(entry, string_probe);
> >> >> + else
> >> >> + *cnt = ret;
> >> >> + } else
> >> >> + ktrace_probe_unregister(entry);
> >> >> + }
> >> >> + mutex_unlock(&ktrace_mutex);
> >> >> +}
> >> >> +
> >> >> +static ssize_t ktrace_control_write(struct file *filp, const char __user *ubuf,
> >> >> + size_t cnt, loff_t *ppos)
> >> >> +{
> >> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
> >> >> + __ktrace_control_write);
> >> >> +}
> >> >> +
> >> >> +static const struct file_operations ktracecontrol_file_operations = {
> >> >> + .open = ktrace_control_open,
> >> >> + .read = ktrace_control_read,
> >> >> + .write = ktrace_control_write,
> >> >> +};
> >> >> +
> >> >> +/* functions for adding/removing a trace event. Protected by mutex lock.
> >> >> + * Called during initialization or after loading a module */
> >> >> +static struct ktrace_entry *add_ktrace(struct kernel_trace *ktrace)
> >> >> +{
> >> >> + struct ktrace_entry *entry;
> >> >> + size_t name_len = strlen(ktrace->name) + 1;
> >> >> + u32 hash = jhash(ktrace->name, name_len-1, 0);
> >> >> + struct hlist_head *head;
> >> >> + struct hlist_node *node;
> >> >> + struct dentry *dentry;
> >> >> +
> >> >> + if (!tracedir)
> >> >> + return 0;
> >> >> + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
> >> >> + hlist_for_each_entry(entry, node, head, hlist) {
> >> >> + if (!strcmp(ktrace->name, entry->name)) {
> >> >> + if (strcmp(ktrace->format, entry->format)) {
> >> >> + printk(KERN_WARNING "the format of tracepoint "
> >> >> + "\'%s\' changes from \'%s\' to \'%s\'."
> >> >> + "Dynamic changing of trace format is "
> >> >> + "not supported yet!\n",
> >> >> + ktrace->name, entry->format,
> >> >> + ktrace->format);
> >> >> + return ERR_PTR(-EINVAL);
> >> >> + }
> >> >> + ktrace->enabled = entry->enabled;
> >> >> + ktrace->func = entry->func;
> >> >> + ktrace->event_id = entry->event_id;
> >> >> + if (list_empty(&entry->klist))
> >> >> + goto add_head;
> >> >> + list_add_tail(&ktrace->list, &entry->klist);
> >> >> + return entry;
> >> >> + }
> >> >> + }
> >> >> + entry = kmalloc(sizeof(struct ktrace_entry), GFP_KERNEL);
> >> >> + if (!entry)
> >> >> + return ERR_PTR(-ENOMEM);
> >> >> + entry->name = kmalloc(name_len, GFP_KERNEL);
> >> >> + if (!entry->name) {
> >> >> + kfree(entry);
> >> >> + return ERR_PTR(-ENOMEM);
> >> >> + }
> >> >> + memcpy(entry->name, ktrace->name, name_len);
> >> >> + if (ktrace->format) {
> >> >> + size_t format_len = strlen(ktrace->format) + 1;
> >> >> + entry->format = kmalloc(format_len, GFP_KERNEL);
> >> >> + if (!entry->format) {
> >> >> + kfree(entry->name);
> >> >> + kfree(entry);
> >> >> + return ERR_PTR(-ENOMEM);
> >> >> + }
> >> >> + memcpy(entry->format, ktrace->format, format_len);
> >> >> + } else
> >> >> + entry->format = NULL;
> >> >> + entry->func = ktrace->func;
> >> >> + entry->enabled = 0;
> >> >> + ktrace->event_id = entry->event_id = ktrace_next_id++;
> >> >> + entry->stroff = NULL;
> >> >> + INIT_LIST_HEAD(&entry->klist);
> >> >> + hlist_add_head(&entry->hlist, head);
> >> >> +add_head:
> >> >> + list_add(&ktrace->list, &entry->klist);
> >> >> + dentry = debugfs_create_file(entry->name, 0660, eventdir,
> >> >> + entry, &ktracecontrol_file_operations);
> >> >> + if (!dentry)
> >> >> + printk(KERN_WARNING "Couldn't create debugfs '%s' entry\n",
> >> >> + entry->name);
> >> >> + return entry;
> >> >> +}
> >> >> +
> >> >> +static int remove_ktrace(struct kernel_trace *ktrace, int free)
> >> >> +{
> >> >> + struct ktrace_entry *entry;
> >> >> + size_t name_len = strlen(ktrace->name) + 1;
> >> >> + u32 hash = jhash(ktrace->name, name_len-1, 0);
> >> >> + struct hlist_head *head;
> >> >> + struct hlist_node *node, *temp;
> >> >> + struct dentry *dentry;
> >> >> +
> >> >> + if (!tracedir)
> >> >> + return 0;
> >> >> + list_del(&ktrace->list);
> >> >> + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
> >> >> + hlist_for_each_entry_safe(entry, node, temp, head, hlist)
> >> >> + if (!strcmp(ktrace->name, entry->name)) {
> >> >> + if (list_empty(&entry->klist)) {
> >> >> + dentry = lookup_one_len(entry->name,
> >> >> + eventdir, strlen(entry->name));
> >> >> + if (dentry && !IS_ERR(dentry)) {
> >> >> + debugfs_remove(dentry);
> >> >> + dput(dentry);
> >> >> + }
> >> >> + entry->enabled = 0;
> >> >> + if (free) {
> >> >> + hlist_del(&entry->hlist);
> >> >> + kfree(entry->name);
> >> >> + kfree(entry->format);
> >> >> + kfree(entry->stroff);
> >> >> + kfree(entry);
> >> >> + }
> >> >> + }
> >> >> + return 0;
> >> >> + }
> >> >> + return -1;
> >> >> +}
> >> >> +
> >> >> +/* Add/remove tracepoints contained in a module to the ktrace hash table.
> >> >> + * Called at the end of module load/unload. */
> >> >> +static int ktrace_module_notify(struct notifier_block *self,
> >> >> + unsigned long val, void *data)
> >> >> +{
> >> >> +#ifdef CONFIG_MODULES
> >> >> + struct kernel_trace *iter;
> >> >> + struct module *mod = data;
> >> >> +
> >> >> + if (val != MODULE_STATE_COMING && val != MODULE_STATE_GOING)
> >> >> + return 0;
> >> >> + mutex_lock(&ktrace_mutex);
> >> >> + for (iter = mod->ktrace; iter < mod->ktrace + mod->num_ktrace; iter++)
> >> >> + if (val == MODULE_STATE_COMING)
> >> >> + add_ktrace(iter);
> >> >> + else
> >> >> + remove_ktrace(iter, 0);
> >> >> + mutex_unlock(&ktrace_mutex);
> >> >> +#endif
> >> >> + return 0;
> >> >> +}
> >> >> +
> >> >> +static struct notifier_block ktrace_module_nb = {
> >> >> + .notifier_call = ktrace_module_notify,
> >> >> +};
> >> >> +
> >> >> +/* functions for user-space programs to read data from tracing buffer */
> >> >> +static int trace_data_open(struct inode *inode, struct file *filp)
> >> >> +{
> >> >> + filp->private_data = inode->i_private;
> >> >> + kref_get(&ktrace_kref);
> >> >> + return 0;
> >> >> +}
> >> >> +
> >> >> +static void release_trace_buffer(struct kref *kref)
> >> >> +{
> >> >> + ring_buffer_free(trace_buffer);
> >> >> + trace_buffer = NULL;
> >> >> +}
> >> >> +
> >> >> +static int trace_data_release(struct inode *inode, struct file *filp)
> >> >> +{
> >> >> + kref_put(&ktrace_kref, release_trace_buffer);
> >> >> + return 0;
> >> >> +}
> >> >> +
> >> >> +static unsigned int trace_data_poll(struct file *filp, poll_table *poll_table)
> >> >> +{
> >> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
> >> >> + if (filp->f_mode & FMODE_READ) {
> >> >> + if (!ring_buffer_per_cpu_empty(cpu_buffer))
> >> >> + return POLLIN | POLLRDNORM;
> >> >> + poll_wait(filp, &trace_wait, poll_table);
> >> >> + if (!ring_buffer_per_cpu_empty(cpu_buffer))
> >> >> + return POLLIN | POLLRDNORM;
> >> >> + }
> >> >> + return 0;
> >> >> +}
> >> >> +
> >> >> +static int trace_buf_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> >> >> +{
> >> >> + struct ring_buffer_per_cpu *cpu_buffer = vma->vm_private_data;
> >> >> + pgoff_t pgoff = vmf->pgoff;
> >> >> + struct page *page;
> >> >> +
> >> >> + if (!trace_buffer)
> >> >> + return VM_FAULT_OOM;
> >> >> + page = ring_buffer_get_page(cpu_buffer, pgoff);
> >> >> + if (page == NULL)
> >> >> + return VM_FAULT_SIGBUS;
> >> >> + get_page(page);
> >> >> + vmf->page = page;
> >> >> + return 0;
> >> >> +}
> >> >> +
> >> >> +static struct vm_operations_struct trace_data_mmap_ops = {
> >> >> + .fault = trace_buf_fault,
> >> >> +};
> >> >> +
> >> >> +static int trace_data_mmap(struct file *filp, struct vm_area_struct *vma)
> >> >> +{
> >> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
> >> >> +
> >> >> + vma->vm_ops = &trace_data_mmap_ops;
> >> >> + vma->vm_flags |= VM_DONTEXPAND;
> >> >> + vma->vm_private_data = cpu_buffer;
> >> >> + return 0;
> >> >> +}
> >> >> +
> >> >> +static int trace_data_ioctl(struct inode *inode, struct file *filp,
> >> >> + unsigned int cmd, unsigned long arg)
> >> >> +{
> >> >> + u32 __user *argp = (u32 __user *)arg;
> >> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
> >> >> +
> >> >> + if (!trace_buffer || !tracedir)
> >> >> + return -EINVAL;
> >> >> + switch (cmd) {
> >> >> + case KTRACE_BUF_GET_SIZE:
> >> >> + {
> >> >> + unsigned long bufsize;
> >> >> + bufsize = ring_buffer_size(trace_buffer);
> >> >> + return put_user((u32)bufsize, argp);
> >> >> + }
> >> >> + case KTRACE_BUF_GET_PRODUCED:
> >> >> + return put_user(ring_buffer_get_produced(cpu_buffer), argp);
> >> >> + case KTRACE_BUF_GET_CONSUMED:
> >> >> + return put_user(ring_buffer_get_consumed(cpu_buffer), argp);
> >> >> + case KTRACE_BUF_PUT_CONSUMED:
> >> >> + {
> >> >> + u32 consumed, consumed_old;
> >> >> + int ret;
> >> >> +
> >> >> + ret = get_user(consumed, argp);
> >> >> + if (ret) {
> >> >> + printk(KERN_WARNING
> >> >> + "error getting consumed value: %d\n", ret);
> >> >> + return ret;
> >> >> + }
> >> >> + consumed_old = ring_buffer_get_consumed(cpu_buffer);
> >> >> + if (consumed == consumed_old)
> >> >> + return 0;
> >> >> + ring_buffer_advance_reader(cpu_buffer, consumed - consumed_old);
> >> >> + return 0;
> >> >> + }
> >> >> + default:
> >> >> + return -ENOIOCTLCMD;
> >> >> + }
> >> >> + return 0;
> >> >> +}
> >> >> +
> >> >> +static const struct file_operations tracedata_file_operations = {
> >> >> + .open = trace_data_open,
> >> >> + .poll = trace_data_poll,
> >> >> + .mmap = trace_data_mmap,
> >> >> + .ioctl = trace_data_ioctl,
> >> >> + .release = trace_data_release,
> >> >> +};
> >> >> +
> >> >> +/* function to report the number of lost events due to buffer overflow */
> >> >> +static ssize_t trace_overflow_read(struct file *filp, char __user *ubuf,
> >> >> + size_t cnt, loff_t *ppos)
> >> >> +{
> >> >> + int ret;
> >> >> + char buf[64];
> >> >> + ret = snprintf(buf, 64, "%lu\n", ring_buffer_overruns(trace_buffer));
> >> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
> >> >> +}
> >> >> +
> >> >> +static const struct file_operations traceoverflow_file_operations = {
> >> >> + .read = trace_overflow_read,
> >> >> +};
> >> >> +
> >> >> +/* function to re-initialize trace buffer */
> >> >> +static void __trace_buffer_reset(struct file *filp,
> >> >> + unsigned long val, size_t *cnt)
> >> >> +{
> >> >> + if (val && trace_buffer)
> >> >> + ring_buffer_reset(trace_buffer);
> >> >> +}
> >> >> +
> >> >> +static ssize_t trace_buffer_reset(struct file *filp, const char __user *ubuf,
> >> >> + size_t cnt, loff_t *ppos)
> >> >> +{
> >> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
> >> >> + __trace_buffer_reset);
> >> >> +}
> >> >> +
> >> >> +static const struct file_operations tracereset_file_operations = {
> >> >> + .write = trace_buffer_reset,
> >> >> +};
> >> >> +
> >> >> +/*
> >> >> + * We use debugfs for kernel-user communication. All of the control/info
> >> >> + * files are under debugfs/ktrace directory.
> >> >> + */
> >> >> +static int create_debugfs(void)
> >> >> +{
> >> >> + struct dentry *entry, *bufdir;
> >> >> + struct ring_buffer_per_cpu *cpu_buffer;
> >> >> + char *tmpname;
> >> >> + int cpu;
> >> >> +
> >> >> + tracedir = debugfs_create_dir("ktrace", NULL);
> >> >> + if (!tracedir) {
> >> >> + printk(KERN_WARNING "Couldn't create debugfs directory\n");
> >> >> + return -ENOMEM;
> >> >> + }
> >> >> +
> >> >> + /* 'buffers' is the directory that holds trace buffer debugfs files */
> >> >> + bufdir = debugfs_create_dir("buffers", tracedir);
> >> >> + if (!bufdir) {
> >> >> + printk(KERN_WARNING "Couldn't create 'buffers' directory\n");
> >> >> + return -ENOMEM;
> >> >> + }
> >> >> +
> >> >> + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
> >> >> + if (!tmpname)
> >> >> + return -ENOMEM;
> >> >> + /* create a debugfs file for each cpu buffer */
> >> >> + for_each_possible_cpu(cpu) {
> >> >> + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
> >> >> + cpu_buffer = ring_buffer_cpu(trace_buffer, cpu);
> >> >> + entry = debugfs_create_file(tmpname, 0440, bufdir, cpu_buffer,
> >> >> + &tracedata_file_operations);
> >> >> + if (!entry) {
> >> >> + printk(KERN_WARNING
> >> >> + "Couldn't create debugfs \'%s\' entry\n",
> >> >> + tmpname);
> >> >> + return -ENOMEM;
> >> >> + }
> >> >> + }
> >> >> + kfree(tmpname);
> >> >> +
> >> >> + /* the control file for users to enable/disable global tracing. */
> >> >> + entry = debugfs_create_file("enabled", 0664, tracedir, NULL,
> >> >> + &tracecontrol_file_operations);
> >> >> + if (!entry) {
> >> >> + printk(KERN_WARNING
> >> >> + "Couldn't create debugfs 'enabled' entry\n");
> >> >> + return -ENOMEM;
> >> >> + }
> >> >> +
> >> >> + /*
> >> >> + * the debugfs file that displays the name, format etc. of every
> >> >> + * supported trace event. The file is to be used by the user-space
> >> >> + * trace parser to analyze the collected trace data.
> >> >> + */
> >> >> + entry = debugfs_create_file("info", 0444, tracedir, NULL,
> >> >> + &traceinfo_file_operations);
> >> >> + if (!entry) {
> >> >> + printk(KERN_WARNING "Couldn't create debugfs 'info' entry\n");
> >> >> + return -ENOMEM;
> >> >> + }
> >> >> +
> >> >> + /*
> >> >> + * the debugfs file that reports the number of events
> >> >> + * lost due to buffer overflow
> >> >> + */
> >> >> + entry = debugfs_create_file("overflow", 0444, tracedir, NULL,
> >> >> + &traceoverflow_file_operations);
> >> >> + if (!entry) {
> >> >> + printk(KERN_WARNING
> >> >> + "Couldn't create debugfs 'overflow' entry\n");
> >> >> + return -ENOMEM;
> >> >> + }
> >> >> +
> >> >> + /* the debugfs file that resets trace buffer upon write */
> >> >> + entry = debugfs_create_file("reset", 0220, tracedir, NULL,
> >> >> + &tracereset_file_operations);
> >> >> + if (!entry) {
> >> >> + printk(KERN_WARNING "Couldn't create debugfs 'reset' entry\n");
> >> >> + return -ENOMEM;
> >> >> + }
> >> >> +
> >> >> + /* create the directory that holds the control files for every event */
> >> >> + eventdir = debugfs_create_dir("events", tracedir);
> >> >> + if (!eventdir) {
> >> >> + printk(KERN_WARNING "Couldn't create 'events' directory\n");
> >> >> + return -ENOMEM;
> >> >> + }
> >> >> + return 0;
> >> >> +}
> >> >> +
> >> >> +static void remove_debugfs_file(struct dentry *dir, char *name)
> >> >> +{
> >> >> + struct dentry *dentry;
> >> >> + dentry = lookup_one_len(name, dir, strlen(name));
> >> >> + if (dentry && !IS_ERR(dentry)) {
> >> >> + debugfs_remove(dentry);
> >> >> + dput(dentry);
> >> >> + }
> >> >> +}
> >> >> +
> >> >> +static int remove_debugfs(void)
> >> >> +{
> >> >> + struct dentry *bufdir;
> >> >> + char *tmpname;
> >> >> + int cpu;
> >> >> +
> >> >> + bufdir = lookup_one_len("buffers", tracedir, strlen("buffers"));
> >> >> + if (!bufdir || IS_ERR(bufdir))
> >> >> + return -EIO;
> >> >> + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
> >> >> + if (!tmpname)
> >> >> + return -ENOMEM;
> >> >> + for_each_possible_cpu(cpu) {
> >> >> + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
> >> >> + remove_debugfs_file(bufdir, tmpname);
> >> >> + }
> >> >> + kfree(tmpname);
> >> >> + debugfs_remove(bufdir);
> >> >> +
> >> >> + remove_debugfs_file(tracedir, "enabled");
> >> >> + remove_debugfs_file(tracedir, "info");
> >> >> + debugfs_remove(eventdir);
> >> >> + debugfs_remove(tracedir);
> >> >> + return 0;
> >> >> +}
> >> >> +
> >> >> +static int __init init_ktrace(void)
> >> >> +{
> >> >> + struct kernel_trace *iter;
> >> >> + int i, err;
> >> >> +
> >> >> + ktrace_next_id = 0;
> >> >> + kref_set(&ktrace_kref, 1);
> >> >> + trace_buffer = ring_buffer_alloc(trace_buf_size, RB_FL_OVERWRITE);
> >> >> + if (trace_buffer == NULL)
> >> >> + return -ENOMEM;
> >> >> + setup_timer(&trace_timer, wakeup_readers, 0);
> >> >> + for (i = 0; i < KTRACE_HASH_SIZE; i++)
> >> >> + INIT_HLIST_HEAD(&ktrace_table[i]);
> >> >> +
> >> >> + mutex_lock(&ktrace_mutex);
> >> >> + err = create_debugfs();
> >> >> + if (err != 0)
> >> >> + goto out;
> >> >> + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
> >> >> + add_ktrace(iter);
> >> >> + err = register_module_notifier(&ktrace_module_nb);
> >> >> +out:
> >> >> + mutex_unlock(&ktrace_mutex);
> >> >> + return err;
> >> >> +}
> >> >> +module_init(init_ktrace);
> >> >> +
> >> >> +static void __exit exit_ktrace(void)
> >> >> +{
> >> >> + struct kernel_trace *iter;
> >> >> +
> >> >> + if (ktrace_enabled) {
> >> >> + ktrace_enabled = 0;
> >> >> + del_timer_sync(&trace_timer);
> >> >> + }
> >> >> + mutex_lock(&ktrace_mutex);
> >> >> + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
> >> >> + remove_ktrace(iter, 1);
> >> >> + unregister_module_notifier(&ktrace_module_nb);
> >> >> + remove_debugfs();
> >> >> + mutex_unlock(&ktrace_mutex);
> >> >> +
> >> >> + kref_put(&ktrace_kref, release_trace_buffer);
> >> >> +}
> >> >
> >> > --
> >> > Mathieu Desnoyers
> >> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
> >> >
> >
> > --
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
> >
>
> _______________________________________________
> ltt-dev mailing list
> [email protected]
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Mon, Mar 16, 2009 at 6:53 PM, Mathieu Desnoyers
<[email protected]> wrote:
> * Jiaying Zhang ([email protected]) wrote:
>> On Mon, Mar 16, 2009 at 5:01 PM, Mathieu Desnoyers
>> <[email protected]> wrote:
>> > * Jiaying Zhang ([email protected]) wrote:
>> >> Hi Mathieu,
>> >>
>> >> First, I apologize that I did not forward you my original email. I assumed
>> >> that the email would be noticed by interested people when it was sent
>> >> to lkml. Now I see that is a mistake.
>> >>
>> >
>> > Hi Jiaying,
>> >
>> > Apologies accepted, no hard feeling :) I have just been surprised when I
>> > found out about the numbers you posted.
>> >
>> >> On Fri, Mar 13, 2009 at 10:04 AM, Mathieu Desnoyers
>> >> <[email protected]> wrote:
>> >> > Hi Jiaying,
>> >> >
>> >> > I'm very interested to hear about the details of your test-case, because
>> >> > what you are talking about here, and the numbers that you get, are, at
>> >> > the very least, not representative of the current LTTng. Perharps you
>> >> > used a version right after the kernel summit and stopped following the
>> >> > improvements we have done since then ? I also wonder about the instrumentation
>> >> > set you enabled in your tests, and whether it's exactly the same in the ktrace
>> >> > and LTTng runs (especially system call tracing). Please note that following the
>> >> > Kernel Summit 2008, a more active development cycle began, and the time was not
>> >> > right for optimisation. But since then a lot of improvement has been done on
>> >> > this side.
>> >>
>> >> Sorry that I omitted a lot of details in my test description. I did use an old
>> >> version of LTTng in my tests. The LTTng version I used was the 0.21 lttng
>> >> patches for 2.6.27 kernel. I used 2.6.26 kernel in my tests, with the ported
>> >> lttng-2.6.27-rc6-0.21 patches.
>> >>
>> >> I enabled the same set of trace events in ktrace and LTTng runs. Here is
>> >> the list of enabled events: syscall entry/exit, irq entry/exit, trap
>> >> entry/exit,
>> >> kernel_sched_switch, kernel_process_fork, kernel_process_wait,
>> >> kernel_process_free, kernel_process_exit.
>> >>
>> >> We are glad to see the improvements coming from the LTTng community.
>> >> I haven't got time to sync with the latest LTTng code and re-run my tests.
>> >> I will let you know when I have such data available.
>> >>
>> >
>> > On my side, with flight recorder activated, same instrumentation you
>> > use, with LTTng 0.111 :
>> >
>> > arm-google script :
>> >
>> > #!/bin/sh
>> >
>> > DIR=/mnt/debugfs/ltt/markers
>> >
>> > for a in \
>> > syscall_entry syscall_exit \
>> > irq_entry irq_exit \
>> > trap_entry trap_exit \
>> > page_fault_entry page_fault_exit \
>> > page_fault_nosem_entry page_fault_nosem_exit \
>> > page_fault_get_user_entry page_fault_get_user_exit \
>> > sched_schedule process_fork process_wait \
>> > process_free process_exit;
>> > do
>> > echo 1 > ${DIR}/kernel/$a/enable
>> > done
>> >
>> > running tbench -t 200 8
>> >
>> > Vanilla kernel : 2070 MB/sec
>> > Flight recorder : 1693 MB/sec Performance impact : 18.2 %
>> > Normal : 1620 MB/sec Performance impact : 21.7 % (8 lttd threads)
>> >
>> > So now we are comparing apples with apples :-)
>> Looks like the performance of LTTng has improved a lot over the past
>> few months. I will run some tests with the latest LTTng patches and
>> let you know if I see any difference. Did you use 2.6.29-rc7 kernel in
>> your tests?
>>
>
> Yes, kernel 2.6.29-rc7, lttng 0.111. But lttng 0.112 for kernel
> 2.6.29-rc8 should also be close enough.
>
>> >
>> >> >
>> >> > First I'd like to express my concern about the way you posted your
>> >> > numbers :
>> >> >
>> >> > 1 - You did not CC anyone from the LTTng project.
>> >> > 2 - You did not tell which LTTng version you used as a comparison basis.
>> >> > 3 - You did not look at the current state of code before posting your
>> >> > comparison. (as shows the comment in your patch header)
>> >> That is indeed my mistake.
>> >>
>> >> > 4 - You did not tell the instrumentation set that you enabled from LTTng. Is it
>> >> > the default instrumentation ? Did you also enable system call tracing
>> >> > with your tracer using set_kernel_trace_flag_all_tasks() ? If not, this
>> >> > makes a huge difference on the amount of information gathered.
>> >> Yes. We also enable syscall tracing via set_kernel_trace_flag_all_tasks().
>> >>
>> >> >
>> >> > So let's have a close look at them. I'm re-doing the benchmarks based on
>> >> > the current LTTng version below.
>> >> >
>> >> >> Hello All,
>> >> >>
>> >> >> We have been working on building a kernel tracer for a production environment.
>> >> >> We first considered to use the existing markers code, but found it a bit
>> >> >> heavyweight.
>> >> >
>> >> > Have you looked at :
>> >> >
>> >> > 1 - The tracepoints currently in the kernel
>> >> > 2 - The _current_ version of markers in the LTTng tree ?
>> >>
>> >> I looked at the tracepoints currently in the kernel. It is great to see that
>> >> some trace instrumentation code is finally available in the mainline kernel.
>> >> But I still don't like an additional tracepoint layer on top of the
>> >> actual tracing
>> >> (ktrace or markers or ftrace) layer. That seems to add unnecessary overhead.
>> >> The markers code I referred to in my email is the current code in the mainline
>> >> kernel. I had a look at the LTTng markers code but still saw that va_list
>> >> argument is used in marker_probe_func. It sounds a nice improvement
>> >> if the LTTng markers avoids the use of va_arg whenever possible.
>> >>
>> >
>> > Yes, vs_args is only there as a commodity for slow paths.
>> >
>> >> >
>> >> >> So we developed a new kernel tracing prototype that shares
>> >> >> many similarities as markers but also includes several simplifications that
>> >> >> give us considerable performance benefits. For example, we pass the size of
>> >> >> trace event directly to the probe function instead of using va_arg as that
>> >> >> used in markers, which saves us the overhead of variable passing.
>> >> >
>> >> > The markers in the LTTng tree only use va_args for "debug-style"
>> >> > tracing. All the tracing fast paths are now done with custom probe
>> >> > functions which takes the arguments from the tracepoint and write them
>> >> > in the trace buffers directly in C. When the event payload size is already
>> >> > known statically, there is no dynamic computation cost at all : it's
>> >> > simply passed to the ltt-relay write function.
>> >> But this raises another concern that sub-system developers may need
>> >> to write a bunch of custom probe functions to avoid the argument parsing
>> >> overhead. Can you instead have a general probe function or macro that
>> >> does this work?
>> >>
>> >
>> > Well, considering the amount of care we have to take when we consider
>> > adding a tracepoint in the mainline kernel, I think it's even a *good*
>> > thing that someone has to stop and write a small callback to deal with
>> > the information he wants to export. I think debug-style tracing should
>> > be very, very quick and easy to add, but tracepoints meant to make it
>> > into the mainline does not share this requirement.
>> I agree that we need to be careful when adding tracepoints into the kernel,
>> but I still think it is good to simplify the process of adding/maintaining
>> tracepoints so sub-system maintainers don't feel it is a big burden.
>>
>
> Yes, we have to make sure it's not an impossible task.
>
>> >
>> >> >
>> >> >> Also, we
>> >> >> associate a single probe function to each event, so we can directly call the
>> >> >> probe function when tracing is enabled.
>> >> >
>> >> > You seem to imply that being stuck with a single probe function is
>> >> > somehow superior to have tiny per-event callbacks which then call into a
>> >> > buffer write function ? Why ?
>> >> That is indeed my thinking, because we saw that some platforms are
>> >> very sensitive to function call overhead and we haven't seen use cases
>> >> where multi probe functions prove necessary.
>> >>
>> >
>> > Hrm, so your plan is to embed the tracing code at the caller site, or to
>> > keep one single function call ?
>> We do have a general probe function that is called when tracing is enabled.
>> Other than that, the main tracing code is defined as an inline function that is
>> embed at the caller site.
>>
>
> You mean that you embed the main tracing code directly in the kernel
> instrumented functions ? Hrm, have you considered the calling site
> pollution impact that might have ? The compiler might start having
> difficulty optimising hot paths that would then become large functions.
> Or maybe I am not correctly understanding your approach.
Sorry. I was actually trying to say the main instrumentation code is
embed in the caller site. The main tracing code is in the general probe
function.
Jiaying
>
> Mathieu
>
>>
>> Jiaying
>>
>> >
>> >> >
>> >> >> Our measurements show that our
>> >> >> current tracing prototype introduces much lower overhead compared with
>> >> >> lttng (we use lttng as a comparison because markers does not provide the
>> >> >> implementation to record trace data).
>> >> >
>> >> > Why LTTng version, kernel version, LTTng config and runtime options do
>> >> > you use ?
>> >> See above for LTTng and kernel versions. I was using the default LTTng
>> >> config options and 'normal' recorder mode.
>> >>
>> >> >
>> >> >> E.g., with dbench, lttng introduces
>> >> >> 20~80% overhead on our testing platforms while the overhead with our
>> >> >> kernel tracer is within 10%; with tbench, lttng usually adds more than 30%
>> >> >> overhead while the overhead added by our kernel tracer is usually within
>> >> >> 15%; with kernbench (time to compile a Linux kernel), the overhead
>> >> >> added by lttng is around 4%~18% while the overhead added by our
>> >> >> kernel tracer is within 2.5%.
>> >> >>
>> >> >
>> >> > I'm re-running the dbench and tbench tests :
>> >> >
>> >> > With LTTng 0.106, kernel 2.6.29-rc7
>> >> > 8-cores Intel(R) Xeon(R) CPU E5405 @ 2.00GHz, 16GB ram
>> >> >
>> >> > Arming all the default LTTng instrumentation (168 tracepoints),
>> >> > Tracing in flight recorder mode, including system call entry/exit
>> >> > tracing and block I/O tracing (blktrace instrumentation), I get :
>> >> >
>> >> > Test / config | Vanilla kernel | LTTng 0.106 | slowdown
>> >> > dbench 8 -t 200 | 1361 MB/s | 1193 MB/s | 12.3 %
>> >> >
>> >> > As an example of the effect of custom probe vs va_args based probes,
>> >> > let's take the tbench example
>> >> >
>> >> > | Vanilla | LTTng 0.106 | slowdown
>> >> > tbench 8 -t 200 | 2074 MB/s | 997 MB/s | 52 %
>> >> >
>> >> > The LTTng trace statistics shows the top high event rate events :
>> >> >
>> >> > Some are usual (therefore lttng already implements custom probes for them) :
>> >> >
>> >> > 42004 sched_try_wakeup
>> >> > 79754 timer_set
>> >> > 83973 sched_schedule
>> >> > 84317 softirq_exit
>> >> > 84318 softirq_entry
>> >> > 95370 page_free
>> >> > 95412 page_alloc
>> >> > 267247 syscall_entry
>> >> > 267247 syscall_exit
>> >> >
>> >> > Some are more tbench workload-specific :
>> >> >
>> >> > 16915 socket_sendmsg
>> >> > 18911 napi_complete
>> >> > 18911 napi_schedule
>> >> > 18912 dev_receive
>> >> > 18914 dev_xmit
>> >> > 18920 napi_poll
>> >> > 33831 socket_recvmsg
>> >> > 59860 read
>> >> >
>> >> >
>> >> > So let's see what happens if I implement those as custom probes and do a bit of
>> >> > tuning in the tracer hot path :
>> >> Could you give more details about the custom probe you used?
>> >
>> > They are available in the lttng tree on git.kernel.org in
>> > ltt/probes/*-trace.c. Their API is in include/linux/ltt-type-serialize.h
>> >
>> >> Do you need to write a custom probe for each tracepoint?
>> >>
>> >
>> > Yes. But it's not such a big deal. It looks a bit like Steve's
>> > TRACE_EVENT macro-fu, but written in a simple C callback :
>> >
>> > void probe_sched_switch(struct rq *rq, struct task_struct *prev,
>> > struct task_struct *next);
>> >
>> > DEFINE_MARKER_TP(kernel, sched_schedule, sched_switch, probe_sched_switch,
>> > "prev_pid %d next_pid %d prev_state #2d%ld");
>> >
>> > notrace void probe_sched_switch(struct rq *rq, struct task_struct *prev,
>> > struct task_struct *next)
>> > {
>> > struct marker *marker;
>> > struct serialize_int_int_short data;
>> >
>> > data.f1 = prev->pid;
>> > data.f2 = next->pid;
>> > data.f3 = prev->state;
>> >
>> > marker = &GET_MARKER(kernel, sched_schedule);
>> > ltt_specialized_trace(marker, marker->single.probe_private,
>> > &data, serialize_sizeof(data), sizeof(int));
>> > }
>> >
>> > Best regards,
>> >
>> > Mathieu
>> >
>> >> Jiaying
>> >>
>> >> >
>> >> > | Vanilla | LTTng 0.107 | slowdown
>> >> > Custom probes :
>> >> > tbench 8 -t 200 | 2074 MB/s | 1275 MB/s | 38.5 %
>> >> >
>> >> > With tracer hot path tuning (inline hot path, function calls for slow path, will
>> >> > be in 0.108) :
>> >> > tbench 8 -t 200 | 2074 MB/s | 1335 MB/s | 35.6 %
>> >> >
>> >> > kernbench :
>> >> > Average Optimal load -j 32 Run
>> >> >
>> >> > Vanilla 2.6.29-rc7 :
>> >> > Elapsed Time 48.238 (0.117346)
>> >> >
>> >> > With flight recorder tracing, default lttng instrumentation, lttng 0.108 :
>> >> > Elapsed Time 50.226 (0.0750333)
>> >> >
>> >> > Slowdown : 4.1 %
>> >> >
>> >> > So except the tbench workload on LTTng 0.106, which still had a 52% performance
>> >> > impact (brought it down to 35 % for the next LTTng release) due to
>> >> > specific workload and not having specialized probes implemented, the
>> >> > dbench impact is well below your results (12.3 % vs 20-80%)
>> >> >
>> >> > So given the huge result discrepancies between your tests and mine, could you
>> >> > give us more detail about your benchmarks ? Especially knowing the enabled
>> >> > instrumentation set, and the amount of MB/s generated by both ktrace and LTTng
>> >> > runs (to make sure we are comparing the same traffic) and the LTTng versions you
>> >> > are using as a comparison basis would be good.
>> >> >
>> >> > Thanks,
>> >> >
>> >> > Mathieu
>> >> >
>> >> >
>> >> >> The patch below is the current prototype we have implemented. It uses
>> >> >> Steve's unified trace buffer code with some small extensions to support
>> >> >> buffer mmap. The patch is probably still buggy, but we would like to get
>> >> >> some early feedback from the community.
>> >> >>
>> >> >> Signed-off-by: Jiaying Zhang <[email protected]>
>> >> >>
>> >> >> Index: git-linux/init/Kconfig
>> >> >> ===================================================================
>> >> >> --- git-linux.orig/init/Kconfig 2009-02-19 14:58:37.000000000 -0800
>> >> >> +++ git-linux/init/Kconfig 2009-02-19 15:50:15.000000000 -0800
>> >> >> @@ -950,6 +950,15 @@
>> >> >> Place an empty function call at each marker site. Can be
>> >> >> dynamically changed for a probe function.
>> >> >>
>> >> >> +config KTRACE
>> >> >> + bool "Enable ktrace support"
>> >> >> + select DEBUG_FS
>> >> >> + select TRACING
>> >> >> + select RING_BUFFER
>> >> >> + help
>> >> >> + Ktrace is a kernel tracing tool that allows you to trace
>> >> >> + kernel events by inserting trace points at proper places.
>> >> >> +
>> >> >> source "arch/Kconfig"
>> >> >>
>> >> >> endmenu # General setup
>> >> >> Index: git-linux/kernel/Makefile
>> >> >> ===================================================================
>> >> >> --- git-linux.orig/kernel/Makefile 2009-02-19 14:58:37.000000000 -0800
>> >> >> +++ git-linux/kernel/Makefile 2009-02-19 14:58:39.000000000 -0800
>> >> >> @@ -86,6 +86,7 @@
>> >> >> obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
>> >> >> obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
>> >> >> obj-$(CONFIG_MARKERS) += marker.o
>> >> >> +obj-$(CONFIG_KTRACE) += ktrace.o
>> >> >> obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
>> >> >> obj-$(CONFIG_LATENCYTOP) += latencytop.o
>> >> >> obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
>> >> >> Index: git-linux/include/asm-generic/vmlinux.lds.h
>> >> >> ===================================================================
>> >> >> --- git-linux.orig/include/asm-generic/vmlinux.lds.h 2009-02-19
>> >> >> 14:58:37.000000000 -0800
>> >> >> +++ git-linux/include/asm-generic/vmlinux.lds.h 2009-02-19
>> >> >> 14:58:39.000000000 -0800
>> >> >> @@ -73,6 +73,9 @@
>> >> >> MEM_KEEP(init.data) \
>> >> >> MEM_KEEP(exit.data) \
>> >> >> . = ALIGN(8); \
>> >> >> + VMLINUX_SYMBOL(__start___ktraces) = .; \
>> >> >> + *(__ktrace) \
>> >> >> + VMLINUX_SYMBOL(__stop___ktraces) = .; \
>> >> >> VMLINUX_SYMBOL(__start___markers) = .; \
>> >> >> *(__markers) \
>> >> >> VMLINUX_SYMBOL(__stop___markers) = .; \
>> >> >> @@ -89,6 +92,7 @@
>> >> >> VMLINUX_SYMBOL(__start_rodata) = .; \
>> >> >> *(.rodata) *(.rodata.*) \
>> >> >> *(__vermagic) /* Kernel version magic */ \
>> >> >> + *(__ktrace_strings) /* Ktrace: strings */ \
>> >> >> *(__markers_strings) /* Markers: strings */ \
>> >> >> *(__tracepoints_strings)/* Tracepoints: strings */ \
>> >> >> } \
>> >> >> Index: git-linux/kernel/module.c
>> >> >> ===================================================================
>> >> >> --- git-linux.orig/kernel/module.c 2009-02-19 14:58:37.000000000 -0800
>> >> >> +++ git-linux/kernel/module.c 2009-02-19 14:58:39.000000000 -0800
>> >> >> @@ -50,6 +50,7 @@
>> >> >> #include <asm/sections.h>
>> >> >> #include <linux/tracepoint.h>
>> >> >> #include <linux/ftrace.h>
>> >> >> +#include <linux/ktrace.h>
>> >> >> #include <linux/async.h>
>> >> >>
>> >> >> #if 0
>> >> >> @@ -2145,6 +2146,10 @@
>> >> >> sizeof(*mod->tracepoints),
>> >> >> &mod->num_tracepoints);
>> >> >> #endif
>> >> >> +#ifdef CONFIG_KTRACE
>> >> >> + mod->ktrace = section_objs(hdr, sechdrs, secstrings, "__ktrace",
>> >> >> + sizeof(*mod->ktrace), &mod->num_ktrace);
>> >> >> +#endif
>> >> >>
>> >> >> #ifdef CONFIG_MODVERSIONS
>> >> >> if ((mod->num_syms && !mod->crcs)
>> >> >> Index: git-linux/include/linux/module.h
>> >> >> ===================================================================
>> >> >> --- git-linux.orig/include/linux/module.h 2009-02-19
>> >> >> 14:58:37.000000000 -0800
>> >> >> +++ git-linux/include/linux/module.h 2009-02-19 14:58:39.000000000 -0800
>> >> >> @@ -44,6 +44,7 @@
>> >> >> };
>> >> >>
>> >> >> struct module;
>> >> >> +struct kernel_trace;
>> >> >>
>> >> >> struct module_attribute {
>> >> >> struct attribute attr;
>> >> >> @@ -347,6 +348,11 @@
>> >> >> /* Reference counts */
>> >> >> struct module_ref ref[NR_CPUS];
>> >> >> #endif
>> >> >> +#ifdef CONFIG_KTRACE
>> >> >> + struct kernel_trace *ktrace;
>> >> >> + unsigned int num_ktrace;
>> >> >> +#endif
>> >> >> +
>> >> >> };
>> >> >> #ifndef MODULE_ARCH_INIT
>> >> >> #define MODULE_ARCH_INIT {}
>> >> >> Index: git-linux/include/linux/ktrace.h
>> >> >> ===================================================================
>> >> >> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
>> >> >> +++ git-linux/include/linux/ktrace.h 2009-02-19 15:36:39.000000000 -0800
>> >> >> @@ -0,0 +1,83 @@
>> >> >> +#ifndef _LINUX_KTRACE_H
>> >> >> +#define _LINUX_KTRACE_H
>> >> >> +
>> >> >> +#include <linux/types.h>
>> >> >> +
>> >> >> +struct kernel_trace;
>> >> >> +
>> >> >> +typedef void ktrace_probe_func(struct kernel_trace *, void *, size_t);
>> >> >> +
>> >> >> +struct kernel_trace {
>> >> >> + const char *name;
>> >> >> + const char *format; /* format string describing variable list */
>> >> >> + size_t *stroff; /* offsets of string variables */
>> >> >> + /* 31 bit event_id is converted to 16 bit when entering to the buffer */
>> >> >> + u32 enabled:1, event_id:31;
>> >> >> + ktrace_probe_func *func; /* probe function */
>> >> >> + struct list_head list; /* list head linked to the hash table entry */
>> >> >> +} __attribute__((aligned(8)));
>> >> >> +
>> >> >> +extern int ktrace_enabled;
>> >> >> +
>> >> >> +/*
>> >> >> + * Make sure the alignment of the structure in the __ktrace section will
>> >> >> + * not add unwanted padding between the beginning of the section and the
>> >> >> + * structure. Force alignment to the same alignment as the section start.
>> >> >> + */
>> >> >> +
>> >> >> +#define DEFINE_KTRACE_STRUCT(name) \
>> >> >> + struct __attribute__((packed)) ktrace_struct_##name
>> >> >> +
>> >> >> +#ifdef CONFIG_KTRACE
>> >> >> +/*
>> >> >> + * DO_TRACE is the macro that is called from each trace point. Each trace
>> >> >> + * point is associated with a static ktrace object that is located in a
>> >> >> + * special __ktrace section and is uniquely identified by the event name and
>> >> >> + * event format. Take a look at the files under include/trace for example
>> >> >> + * usage. Note that we call the trace probe function only if the global
>> >> >> + * tracing is enabled and the tracing for the associated event is enabled,
>> >> >> + * so we only introduce the overhead of a predicted condition judge when
>> >> >> + * tracing is disabled.
>> >> >> + */
>> >> >> +#define DO_TRACE(name, format, args...)
>> >> >> \
>> >> >> + do { \
>> >> >> + static const char __kstrtab_##name[] \
>> >> >> + __attribute__((section("__ktrace_strings"))) \
>> >> >> + = #name "\0" format; \
>> >> >> + static struct kernel_trace __ktrace_##name \
>> >> >> + __attribute__((section("__ktrace"), aligned(8))) = \
>> >> >> + { __kstrtab_##name, &__kstrtab_##name[sizeof(#name)], \
>> >> >> + NULL, 0, 0, NULL, LIST_HEAD_INIT(__ktrace_##name.list) }; \
>> >> >> + __ktrace_check_format(format, ## args); \
>> >> >> + if (unlikely(ktrace_enabled) && \
>> >> >> + unlikely(__ktrace_##name.enabled)) { \
>> >> >> + struct ktrace_struct_##name karg = { args }; \
>> >> >> + (*__ktrace_##name.func) \
>> >> >> + (&__ktrace_##name, &karg, sizeof(karg)); \
>> >> >> + } \
>> >> >> + } while (0)
>> >> >> +
>> >> >> +#else /* !CONFIG_KTRACE */
>> >> >> +#define DO_TRACE(name, format, args...)
>> >> >> +#endif
>> >> >> +
>> >> >> +/* To be used for string format validity checking with gcc */
>> >> >> +static inline void __printf(1, 2) ___ktrace_check_format(const char *fmt, ...)
>> >> >> +{
>> >> >> +}
>> >> >> +
>> >> >> +#define __ktrace_check_format(format, args...) \
>> >> >> + do { \
>> >> >> + if (0) \
>> >> >> + ___ktrace_check_format(format, ## args); \
>> >> >> + } while (0)
>> >> >> +
>> >> >> +
>> >> >> +/* get the trace buffer information */
>> >> >> +#define KTRACE_BUF_GET_SIZE _IOR(0xF5, 0x00, __u32)
>> >> >> +#define KTRACE_BUF_GET_PRODUCED _IOR(0xF5, 0x01, __u32)
>> >> >> +#define KTRACE_BUF_GET_CONSUMED _IOR(0xF5, 0x02, __u32)
>> >> >> +#define KTRACE_BUF_PUT_PRODUCED _IOW(0xF5, 0x03, __u32)
>> >> >> +#define KTRACE_BUF_PUT_CONSUMED _IOW(0xF5, 0x04, __u32)
>> >> >> +
>> >> >> +#endif
>> >> >> Index: git-linux/kernel/ktrace.c
>> >> >> ===================================================================
>> >> >> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
>> >> >> +++ git-linux/kernel/ktrace.c 2009-02-19 15:45:20.000000000 -0800
>> >> >> @@ -0,0 +1,872 @@
>> >> >> +/*
>> >> >> + * kernel/ktrace.c
>> >> >> + *
>> >> >> + * Implementation of the kernel tracing for linux kernel 2.6.
>> >> >> + *
>> >> >> + * Copyright 2008- Google Inc.
>> >> >> + * Original Author: Jiaying Zhang
>> >> >> + */
>> >> >> +
>> >> >> +#include <linux/module.h>
>> >> >> +#include <linux/kernel.h>
>> >> >> +#include <linux/bitmap.h>
>> >> >> +#include <linux/poll.h>
>> >> >> +#include <linux/errno.h>
>> >> >> +#include <linux/sched.h>
>> >> >> +#include <linux/linkage.h>
>> >> >> +#include <linux/debugfs.h>
>> >> >> +#include <linux/timer.h>
>> >> >> +#include <linux/delay.h>
>> >> >> +#include <linux/jhash.h>
>> >> >> +#include <linux/ctype.h>
>> >> >> +#include <linux/namei.h>
>> >> >> +#include <linux/ktrace.h>
>> >> >> +#include <linux/ring_buffer.h>
>> >> >> +
>> >> >> +static char *ktrace_version = "1.0.0";
>> >> >> +int ktrace_enabled;
>> >> >> +EXPORT_SYMBOL(ktrace_enabled);
>> >> >> +static uint16_t ktrace_next_id;
>> >> >> +
>> >> >> +extern struct kernel_trace __start___ktraces[];
>> >> >> +extern struct kernel_trace __stop___ktraces[];
>> >> >> +
>> >> >> +static DEFINE_MUTEX(ktrace_mutex);
>> >> >> +
>> >> >> +/* ktrace hash table entry structure and variables */
>> >> >> +struct ktrace_entry {
>> >> >> + struct hlist_node hlist;
>> >> >> + char *name;
>> >> >> + char *format;
>> >> >> + size_t *stroff;
>> >> >> + ktrace_probe_func *func;
>> >> >> + u32 enabled:1, event_id:31;
>> >> >> + struct list_head klist; /* list of loaded ktraces */
>> >> >> +};
>> >> >> +
>> >> >> +#define KTRACE_HASH_BITS 6
>> >> >> +#define KTRACE_HASH_SIZE (1 << KTRACE_HASH_BITS)
>> >> >> +static struct hlist_head ktrace_table[KTRACE_HASH_SIZE];
>> >> >> +
>> >> >> +/* debugfs directory variables */
>> >> >> +static struct dentry *tracedir;
>> >> >> +static struct dentry *eventdir;
>> >> >> +typedef void tracecontrol_handle_func(struct file *, unsigned long, size_t *);
>> >> >> +
>> >> >> +/* ring buffer code */
>> >> >> +static unsigned long trace_buf_size = 65536UL;
>> >> >> +struct ring_buffer *trace_buffer;
>> >> >> +static DECLARE_WAIT_QUEUE_HEAD(trace_wait); /* waitqueue for trace_data_poll */
>> >> >> +static struct timer_list trace_timer; /* reader wake-up timer */
>> >> >> +#define TRACEREAD_WAKEUP_INTERVAL 1000 /* time interval in jiffies */
>> >> >> +static struct kref ktrace_kref;
>> >> >> +
>> >> >> +static void
>> >> >> +ring_buffer_probe(struct kernel_trace *kt, void *data, size_t event_size)
>> >> >> +{
>> >> >> + struct ring_buffer_event *event;
>> >> >> + void *trace_event;
>> >> >> + unsigned long flag;
>> >> >> +
>> >> >> + event = ring_buffer_lock_reserve(trace_buffer,
>> >> >> + sizeof(uint16_t) + event_size, &flag);
>> >> >> + if (!event)
>> >> >> + return;
>> >> >> + trace_event = ring_buffer_event_data(event);
>> >> >> + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
>> >> >> + memcpy(trace_event + sizeof(uint16_t), data, event_size);
>> >> >> + ring_buffer_unlock_commit(trace_buffer, event, flag);
>> >> >> +}
>> >> >> +
>> >> >> +/* special probe function for events that contain string arguments */
>> >> >> +static void
>> >> >> +string_probe(struct kernel_trace *kt, void *data, size_t event_size)
>> >> >> +{
>> >> >> + struct ring_buffer_event *event;
>> >> >> + void *trace_event, *p;
>> >> >> + unsigned long flag;
>> >> >> + size_t *offset, scanned;
>> >> >> + char *string;
>> >> >> +
>> >> >> + /*
>> >> >> + * Get the real length of the event, i.e., we use stroff array to
>> >> >> + * locate string variables in the passed-in trace data and use their
>> >> >> + * length to replace the size of the string pointers.
>> >> >> + */
>> >> >> + for (offset = kt->stroff; *offset != -1; offset++) {
>> >> >> + string = *(char **) (data + *offset);
>> >> >> + event_size += strlen(string) + 1 - sizeof(char *);
>> >> >> + }
>> >> >> + event = ring_buffer_lock_reserve(trace_buffer,
>> >> >> + sizeof(uint16_t) + event_size, &flag);
>> >> >> + if (!event)
>> >> >> + return;
>> >> >> + trace_event = ring_buffer_event_data(event);
>> >> >> + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
>> >> >> + p = trace_event + sizeof(uint16_t);
>> >> >> + /*
>> >> >> + * Copy the trace data into buffer. For string variables, we enter the
>> >> >> + * string into the buffer. Otherwise, the passed in data is copied.
>> >> >> + */
>> >> >> + for (offset = kt->stroff, scanned = 0; *offset != -1; offset++) {
>> >> >> + memcpy(p, data + scanned, *offset - scanned);
>> >> >> + p += *offset - scanned;
>> >> >> + string = *(char **) (data + *offset);
>> >> >> + memcpy(p, string, strlen(string) + 1);
>> >> >> + p += strlen(string) + 1;
>> >> >> + scanned = *offset + sizeof(char *);
>> >> >> + }
>> >> >> + memcpy(p, data + scanned,
>> >> >> + trace_event + sizeof(uint16_t) + event_size - p);
>> >> >> + ring_buffer_unlock_commit(trace_buffer, event, flag);
>> >> >> +}
>> >> >> +
>> >> >> +/* timer function used to defer ring buffer reader waking */
>> >> >> +static void wakeup_readers(unsigned long data)
>> >> >> +{
>> >> >> + if (trace_buffer && !ring_buffer_empty(trace_buffer))
>> >> >> + wake_up_interruptible(&trace_wait);
>> >> >> + mod_timer(&trace_timer, jiffies + TRACEREAD_WAKEUP_INTERVAL);
>> >> >> +}
>> >> >> +
>> >> >> +/* function for reading ktrace metadata info */
>> >> >> +static void *tracing_info_start(struct seq_file *seq, loff_t *pos)
>> >> >> +{
>> >> >> + seq_printf(seq, "version %s\n", ktrace_version);
>> >> >> + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
>> >> >> +}
>> >> >> +
>> >> >> +static void *tracing_info_next(struct seq_file *seq, void *v, loff_t *pos)
>> >> >> +{
>> >> >> + ++*pos;
>> >> >> + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
>> >> >> +}
>> >> >> +
>> >> >> +static int tracing_info_show(struct seq_file *seq, void *v)
>> >> >> +{
>> >> >> + struct hlist_head *head = v;
>> >> >> + struct hlist_node *node;
>> >> >> + struct ktrace_entry *entry;
>> >> >> + hlist_for_each_entry(entry, node, head, hlist)
>> >> >> + seq_printf(seq, "name '%s' format '%s' id %u %s\n",
>> >> >> + entry->name, entry->format, entry->event_id,
>> >> >> + entry->enabled ? "enabled" : "disabled");
>> >> >> + return 0;
>> >> >> +}
>> >> >> +
>> >> >> +static void tracing_info_stop(struct seq_file *seq, void *v)
>> >> >> +{
>> >> >> +}
>> >> >> +
>> >> >> +static const struct seq_operations tracing_info_seq_ops = {
>> >> >> + .start = tracing_info_start,
>> >> >> + .next = tracing_info_next,
>> >> >> + .show = tracing_info_show,
>> >> >> + .stop = tracing_info_stop,
>> >> >> +};
>> >> >> +
>> >> >> +static int tracing_info_open(struct inode *inode, struct file *file)
>> >> >> +{
>> >> >> + return seq_open(file, &tracing_info_seq_ops);
>> >> >> +}
>> >> >> +
>> >> >> +static const struct file_operations traceinfo_file_operations = {
>> >> >> + .open = tracing_info_open,
>> >> >> + .read = seq_read,
>> >> >> + .llseek = seq_lseek,
>> >> >> + .release = seq_release,
>> >> >> +};
>> >> >> +
>> >> >> +/*
>> >> >> + * wrapper function used by debugfs write operation.
>> >> >> + * func: handling function that does real work
>> >> >> + */
>> >> >> +static ssize_t trace_debugfs_write(struct file *filp, const char __user *ubuf,
>> >> >> + size_t cnt, loff_t *ppos, tracecontrol_handle_func *func)
>> >> >> +{
>> >> >> + int ret;
>> >> >> + char buf[64];
>> >> >> + unsigned long val;
>> >> >> +
>> >> >> + if (cnt >= sizeof(buf))
>> >> >> + return -EINVAL;
>> >> >> + if (copy_from_user(&buf, ubuf, cnt))
>> >> >> + return -EFAULT;
>> >> >> + buf[cnt] = 0;
>> >> >> +
>> >> >> + ret = strict_strtoul(buf, 10, &val);
>> >> >> + if (ret < 0)
>> >> >> + return ret;
>> >> >> + val = !!val;
>> >> >> +
>> >> >> + func(filp, val, &cnt);
>> >> >> +
>> >> >> + filp->f_pos += cnt;
>> >> >> + return cnt;
>> >> >> +}
>> >> >> +
>> >> >> +/* functions for reading/writing the global 'enabled' control file */
>> >> >> +static ssize_t tracing_control_read(struct file *filp, char __user *ubuf,
>> >> >> + size_t cnt, loff_t *ppos)
>> >> >> +{
>> >> >> + int ret;
>> >> >> + char buf[64];
>> >> >> + ret = snprintf(buf, 64, "%u\n", ktrace_enabled);
>> >> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
>> >> >> +}
>> >> >> +
>> >> >> +static void __tracing_control_write(struct file *filp,
>> >> >> + unsigned long val, size_t *cnt)
>> >> >> +{
>> >> >> + if (val ^ ktrace_enabled) {
>> >> >> + if (val) {
>> >> >> + trace_timer.expires =
>> >> >> + jiffies + TRACEREAD_WAKEUP_INTERVAL;
>> >> >> + add_timer(&trace_timer);
>> >> >> + ktrace_enabled = 1;
>> >> >> + } else {
>> >> >> + ktrace_enabled = 0;
>> >> >> + del_timer_sync(&trace_timer);
>> >> >> + if (trace_buffer && !ring_buffer_empty(trace_buffer))
>> >> >> + wake_up_interruptible(&trace_wait);
>> >> >> + }
>> >> >> + }
>> >> >> +}
>> >> >> +
>> >> >> +static ssize_t tracing_control_write(struct file *filp, const char
>> >> >> __user *ubuf,
>> >> >> + size_t cnt, loff_t *ppos)
>> >> >> +{
>> >> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
>> >> >> + __tracing_control_write);
>> >> >> +}
>> >> >> +
>> >> >> +static const struct file_operations tracecontrol_file_operations = {
>> >> >> + .read = tracing_control_read,
>> >> >> + .write = tracing_control_write,
>> >> >> +};
>> >> >> +
>> >> >> +/* functions for reading/writing a trace event file */
>> >> >> +static int update_ktrace(struct ktrace_entry *entry,
>> >> >> + ktrace_probe_func *func, int enabled)
>> >> >> +{
>> >> >> + struct kernel_trace *iter;
>> >> >> + /* no need to update the list if the tracing is not initialized */
>> >> >> + if (!tracedir)
>> >> >> + return 0;
>> >> >> + entry->enabled = enabled;
>> >> >> + entry->func = func;
>> >> >> + list_for_each_entry(iter, &entry->klist, list) {
>> >> >> + iter->stroff = entry->stroff;
>> >> >> + iter->func = func;
>> >> >> + iter->enabled = enabled;
>> >> >> + }
>> >> >> + return 0;
>> >> >> +}
>> >> >> +
>> >> >> +static int ktrace_probe_register(struct ktrace_entry *entry,
>> >> >> + ktrace_probe_func *probefunc)
>> >> >> +{
>> >> >> + return update_ktrace(entry, probefunc, 1);
>> >> >> +}
>> >> >> +
>> >> >> +static int ktrace_probe_unregister(struct ktrace_entry *entry)
>> >> >> +{
>> >> >> + return update_ktrace(entry, NULL, 0);
>> >> >> +}
>> >> >> +
>> >> >> +static int ktrace_control_open(struct inode *inode, struct file *filp)
>> >> >> +{
>> >> >> + filp->private_data = inode->i_private;
>> >> >> + return 0;
>> >> >> +}
>> >> >> +
>> >> >> +static ssize_t ktrace_control_read(struct file *filp, char __user *ubuf,
>> >> >> + size_t cnt, loff_t *ppos)
>> >> >> +{
>> >> >> + int ret;
>> >> >> + char buf[64];
>> >> >> + struct ktrace_entry *entry = filp->private_data;
>> >> >> +
>> >> >> + ret = snprintf(buf, 64, "%u\n", entry->enabled);
>> >> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
>> >> >> +}
>> >> >> +
>> >> >> +/*
>> >> >> + * Check whether the event contains string arguments. Return 0 if not;
>> >> >> + * otherwise record the offsets of the string arguments in the ktrace
>> >> >> + * structure according to the event format and return 1.
>> >> >> + * The offset info is later used to store the string arguments into
>> >> >> + * the trace buffer during trace probe.
>> >> >> + */
>> >> >> +static int contain_string_arguments(struct ktrace_entry *entry)
>> >> >> +{
>> >> >> + int count = 0;
>> >> >> + char *fmt;
>> >> >> + int qualifier; /* 'h', 'l', or 'L' for integer fields */
>> >> >> + size_t offset = 0;
>> >> >> +
>> >> >> + for (fmt = strstr(entry->format, "%s"); fmt; fmt = strstr(++fmt, "%s"))
>> >> >> + count++;
>> >> >> + if (!count)
>> >> >> + return 0;
>> >> >> + entry->stroff = kmalloc(sizeof(size_t) * (count + 1), GFP_KERNEL);
>> >> >> + if (!entry->stroff)
>> >> >> + return -ENOMEM;
>> >> >> +
>> >> >> + for (fmt = entry->format, count = 0; *fmt ; ++fmt) {
>> >> >> + if (*fmt != '%')
>> >> >> + continue;
>> >> >> +repeat:
>> >> >> + ++fmt;
>> >> >> + switch (*fmt) {
>> >> >> + case '-':
>> >> >> + case '+':
>> >> >> + case ' ':
>> >> >> + case '#':
>> >> >> + case '0':
>> >> >> + ++fmt;
>> >> >> + goto repeat;
>> >> >> + }
>> >> >> +
>> >> >> + while (isdigit(*fmt))
>> >> >> + fmt++;
>> >> >> +
>> >> >> + /* get the conversion qualifier */
>> >> >> + qualifier = -1;
>> >> >> + if (*fmt == 'h' || *fmt == 'l' || *fmt == 'L') {
>> >> >> + qualifier = *fmt;
>> >> >> + ++fmt;
>> >> >> + if (qualifier == 'l' && *fmt == 'l') {
>> >> >> + qualifier = 'L';
>> >> >> + ++fmt;
>> >> >> + }
>> >> >> + }
>> >> >> +
>> >> >> + switch (*fmt) {
>> >> >> + case 'c':
>> >> >> + offset += sizeof(char);
>> >> >> + continue;
>> >> >> + case 's':
>> >> >> + entry->stroff[count] = offset;
>> >> >> + count++;
>> >> >> + offset += sizeof(char *);
>> >> >> + continue;
>> >> >> + case 'p':
>> >> >> + offset += sizeof(void *);
>> >> >> + continue;
>> >> >> + case 'd':
>> >> >> + case 'i':
>> >> >> + case 'o':
>> >> >> + case 'u':
>> >> >> + case 'x':
>> >> >> + case 'X':
>> >> >> + switch (qualifier) {
>> >> >> + case 'L':
>> >> >> + offset += sizeof(long long);
>> >> >> + break;
>> >> >> + case 'l':
>> >> >> + offset += sizeof(long);
>> >> >> + break;
>> >> >> + case 'h':
>> >> >> + offset += sizeof(short);
>> >> >> + break;
>> >> >> + default:
>> >> >> + offset += sizeof(int);
>> >> >> + break;
>> >> >> + }
>> >> >> + break;
>> >> >> + default:
>> >> >> + printk(KERN_WARNING "unknown format '%c'", *fmt);
>> >> >> + continue;
>> >> >> + }
>> >> >> + }
>> >> >> + entry->stroff[count] = -1;
>> >> >> + return 1;
>> >> >> +}
>> >> >> +
>> >> >> +static void __ktrace_control_write(struct file *filp,
>> >> >> + unsigned long val, size_t *cnt)
>> >> >> +{
>> >> >> + struct ktrace_entry *entry = filp->private_data;
>> >> >> + int ret;
>> >> >> +
>> >> >> + mutex_lock(&ktrace_mutex);
>> >> >> + if (val ^ entry->enabled) {
>> >> >> + if (val) {
>> >> >> + ret = contain_string_arguments(entry);
>> >> >> + if (ret == 0)
>> >> >> + ktrace_probe_register(entry, ring_buffer_probe);
>> >> >> + else if (ret > 0)
>> >> >> + ktrace_probe_register(entry, string_probe);
>> >> >> + else
>> >> >> + *cnt = ret;
>> >> >> + } else
>> >> >> + ktrace_probe_unregister(entry);
>> >> >> + }
>> >> >> + mutex_unlock(&ktrace_mutex);
>> >> >> +}
>> >> >> +
>> >> >> +static ssize_t ktrace_control_write(struct file *filp, const char __user *ubuf,
>> >> >> + size_t cnt, loff_t *ppos)
>> >> >> +{
>> >> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
>> >> >> + __ktrace_control_write);
>> >> >> +}
>> >> >> +
>> >> >> +static const struct file_operations ktracecontrol_file_operations = {
>> >> >> + .open = ktrace_control_open,
>> >> >> + .read = ktrace_control_read,
>> >> >> + .write = ktrace_control_write,
>> >> >> +};
>> >> >> +
>> >> >> +/* functions for adding/removing a trace event. Protected by mutex lock.
>> >> >> + * Called during initialization or after loading a module */
>> >> >> +static struct ktrace_entry *add_ktrace(struct kernel_trace *ktrace)
>> >> >> +{
>> >> >> + struct ktrace_entry *entry;
>> >> >> + size_t name_len = strlen(ktrace->name) + 1;
>> >> >> + u32 hash = jhash(ktrace->name, name_len-1, 0);
>> >> >> + struct hlist_head *head;
>> >> >> + struct hlist_node *node;
>> >> >> + struct dentry *dentry;
>> >> >> +
>> >> >> + if (!tracedir)
>> >> >> + return 0;
>> >> >> + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
>> >> >> + hlist_for_each_entry(entry, node, head, hlist) {
>> >> >> + if (!strcmp(ktrace->name, entry->name)) {
>> >> >> + if (strcmp(ktrace->format, entry->format)) {
>> >> >> + printk(KERN_WARNING "the format of tracepoint "
>> >> >> + "\'%s\' changes from \'%s\' to \'%s\'."
>> >> >> + "Dynamic changing of trace format is "
>> >> >> + "not supported yet!\n",
>> >> >> + ktrace->name, entry->format,
>> >> >> + ktrace->format);
>> >> >> + return ERR_PTR(-EINVAL);
>> >> >> + }
>> >> >> + ktrace->enabled = entry->enabled;
>> >> >> + ktrace->func = entry->func;
>> >> >> + ktrace->event_id = entry->event_id;
>> >> >> + if (list_empty(&entry->klist))
>> >> >> + goto add_head;
>> >> >> + list_add_tail(&ktrace->list, &entry->klist);
>> >> >> + return entry;
>> >> >> + }
>> >> >> + }
>> >> >> + entry = kmalloc(sizeof(struct ktrace_entry), GFP_KERNEL);
>> >> >> + if (!entry)
>> >> >> + return ERR_PTR(-ENOMEM);
>> >> >> + entry->name = kmalloc(name_len, GFP_KERNEL);
>> >> >> + if (!entry->name) {
>> >> >> + kfree(entry);
>> >> >> + return ERR_PTR(-ENOMEM);
>> >> >> + }
>> >> >> + memcpy(entry->name, ktrace->name, name_len);
>> >> >> + if (ktrace->format) {
>> >> >> + size_t format_len = strlen(ktrace->format) + 1;
>> >> >> + entry->format = kmalloc(format_len, GFP_KERNEL);
>> >> >> + if (!entry->format) {
>> >> >> + kfree(entry->name);
>> >> >> + kfree(entry);
>> >> >> + return ERR_PTR(-ENOMEM);
>> >> >> + }
>> >> >> + memcpy(entry->format, ktrace->format, format_len);
>> >> >> + } else
>> >> >> + entry->format = NULL;
>> >> >> + entry->func = ktrace->func;
>> >> >> + entry->enabled = 0;
>> >> >> + ktrace->event_id = entry->event_id = ktrace_next_id++;
>> >> >> + entry->stroff = NULL;
>> >> >> + INIT_LIST_HEAD(&entry->klist);
>> >> >> + hlist_add_head(&entry->hlist, head);
>> >> >> +add_head:
>> >> >> + list_add(&ktrace->list, &entry->klist);
>> >> >> + dentry = debugfs_create_file(entry->name, 0660, eventdir,
>> >> >> + entry, &ktracecontrol_file_operations);
>> >> >> + if (!dentry)
>> >> >> + printk(KERN_WARNING "Couldn't create debugfs '%s' entry\n",
>> >> >> + entry->name);
>> >> >> + return entry;
>> >> >> +}
>> >> >> +
>> >> >> +static int remove_ktrace(struct kernel_trace *ktrace, int free)
>> >> >> +{
>> >> >> + struct ktrace_entry *entry;
>> >> >> + size_t name_len = strlen(ktrace->name) + 1;
>> >> >> + u32 hash = jhash(ktrace->name, name_len-1, 0);
>> >> >> + struct hlist_head *head;
>> >> >> + struct hlist_node *node, *temp;
>> >> >> + struct dentry *dentry;
>> >> >> +
>> >> >> + if (!tracedir)
>> >> >> + return 0;
>> >> >> + list_del(&ktrace->list);
>> >> >> + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
>> >> >> + hlist_for_each_entry_safe(entry, node, temp, head, hlist)
>> >> >> + if (!strcmp(ktrace->name, entry->name)) {
>> >> >> + if (list_empty(&entry->klist)) {
>> >> >> + dentry = lookup_one_len(entry->name,
>> >> >> + eventdir, strlen(entry->name));
>> >> >> + if (dentry && !IS_ERR(dentry)) {
>> >> >> + debugfs_remove(dentry);
>> >> >> + dput(dentry);
>> >> >> + }
>> >> >> + entry->enabled = 0;
>> >> >> + if (free) {
>> >> >> + hlist_del(&entry->hlist);
>> >> >> + kfree(entry->name);
>> >> >> + kfree(entry->format);
>> >> >> + kfree(entry->stroff);
>> >> >> + kfree(entry);
>> >> >> + }
>> >> >> + }
>> >> >> + return 0;
>> >> >> + }
>> >> >> + return -1;
>> >> >> +}
>> >> >> +
>> >> >> +/* Add/remove tracepoints contained in a module to the ktrace hash table.
>> >> >> + * Called at the end of module load/unload. */
>> >> >> +static int ktrace_module_notify(struct notifier_block *self,
>> >> >> + unsigned long val, void *data)
>> >> >> +{
>> >> >> +#ifdef CONFIG_MODULES
>> >> >> + struct kernel_trace *iter;
>> >> >> + struct module *mod = data;
>> >> >> +
>> >> >> + if (val != MODULE_STATE_COMING && val != MODULE_STATE_GOING)
>> >> >> + return 0;
>> >> >> + mutex_lock(&ktrace_mutex);
>> >> >> + for (iter = mod->ktrace; iter < mod->ktrace + mod->num_ktrace; iter++)
>> >> >> + if (val == MODULE_STATE_COMING)
>> >> >> + add_ktrace(iter);
>> >> >> + else
>> >> >> + remove_ktrace(iter, 0);
>> >> >> + mutex_unlock(&ktrace_mutex);
>> >> >> +#endif
>> >> >> + return 0;
>> >> >> +}
>> >> >> +
>> >> >> +static struct notifier_block ktrace_module_nb = {
>> >> >> + .notifier_call = ktrace_module_notify,
>> >> >> +};
>> >> >> +
>> >> >> +/* functions for user-space programs to read data from tracing buffer */
>> >> >> +static int trace_data_open(struct inode *inode, struct file *filp)
>> >> >> +{
>> >> >> + filp->private_data = inode->i_private;
>> >> >> + kref_get(&ktrace_kref);
>> >> >> + return 0;
>> >> >> +}
>> >> >> +
>> >> >> +static void release_trace_buffer(struct kref *kref)
>> >> >> +{
>> >> >> + ring_buffer_free(trace_buffer);
>> >> >> + trace_buffer = NULL;
>> >> >> +}
>> >> >> +
>> >> >> +static int trace_data_release(struct inode *inode, struct file *filp)
>> >> >> +{
>> >> >> + kref_put(&ktrace_kref, release_trace_buffer);
>> >> >> + return 0;
>> >> >> +}
>> >> >> +
>> >> >> +static unsigned int trace_data_poll(struct file *filp, poll_table *poll_table)
>> >> >> +{
>> >> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
>> >> >> + if (filp->f_mode & FMODE_READ) {
>> >> >> + if (!ring_buffer_per_cpu_empty(cpu_buffer))
>> >> >> + return POLLIN | POLLRDNORM;
>> >> >> + poll_wait(filp, &trace_wait, poll_table);
>> >> >> + if (!ring_buffer_per_cpu_empty(cpu_buffer))
>> >> >> + return POLLIN | POLLRDNORM;
>> >> >> + }
>> >> >> + return 0;
>> >> >> +}
>> >> >> +
>> >> >> +static int trace_buf_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>> >> >> +{
>> >> >> + struct ring_buffer_per_cpu *cpu_buffer = vma->vm_private_data;
>> >> >> + pgoff_t pgoff = vmf->pgoff;
>> >> >> + struct page *page;
>> >> >> +
>> >> >> + if (!trace_buffer)
>> >> >> + return VM_FAULT_OOM;
>> >> >> + page = ring_buffer_get_page(cpu_buffer, pgoff);
>> >> >> + if (page == NULL)
>> >> >> + return VM_FAULT_SIGBUS;
>> >> >> + get_page(page);
>> >> >> + vmf->page = page;
>> >> >> + return 0;
>> >> >> +}
>> >> >> +
>> >> >> +static struct vm_operations_struct trace_data_mmap_ops = {
>> >> >> + .fault = trace_buf_fault,
>> >> >> +};
>> >> >> +
>> >> >> +static int trace_data_mmap(struct file *filp, struct vm_area_struct *vma)
>> >> >> +{
>> >> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
>> >> >> +
>> >> >> + vma->vm_ops = &trace_data_mmap_ops;
>> >> >> + vma->vm_flags |= VM_DONTEXPAND;
>> >> >> + vma->vm_private_data = cpu_buffer;
>> >> >> + return 0;
>> >> >> +}
>> >> >> +
>> >> >> +static int trace_data_ioctl(struct inode *inode, struct file *filp,
>> >> >> + unsigned int cmd, unsigned long arg)
>> >> >> +{
>> >> >> + u32 __user *argp = (u32 __user *)arg;
>> >> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
>> >> >> +
>> >> >> + if (!trace_buffer || !tracedir)
>> >> >> + return -EINVAL;
>> >> >> + switch (cmd) {
>> >> >> + case KTRACE_BUF_GET_SIZE:
>> >> >> + {
>> >> >> + unsigned long bufsize;
>> >> >> + bufsize = ring_buffer_size(trace_buffer);
>> >> >> + return put_user((u32)bufsize, argp);
>> >> >> + }
>> >> >> + case KTRACE_BUF_GET_PRODUCED:
>> >> >> + return put_user(ring_buffer_get_produced(cpu_buffer), argp);
>> >> >> + case KTRACE_BUF_GET_CONSUMED:
>> >> >> + return put_user(ring_buffer_get_consumed(cpu_buffer), argp);
>> >> >> + case KTRACE_BUF_PUT_CONSUMED:
>> >> >> + {
>> >> >> + u32 consumed, consumed_old;
>> >> >> + int ret;
>> >> >> +
>> >> >> + ret = get_user(consumed, argp);
>> >> >> + if (ret) {
>> >> >> + printk(KERN_WARNING
>> >> >> + "error getting consumed value: %d\n", ret);
>> >> >> + return ret;
>> >> >> + }
>> >> >> + consumed_old = ring_buffer_get_consumed(cpu_buffer);
>> >> >> + if (consumed == consumed_old)
>> >> >> + return 0;
>> >> >> + ring_buffer_advance_reader(cpu_buffer, consumed - consumed_old);
>> >> >> + return 0;
>> >> >> + }
>> >> >> + default:
>> >> >> + return -ENOIOCTLCMD;
>> >> >> + }
>> >> >> + return 0;
>> >> >> +}
>> >> >> +
>> >> >> +static const struct file_operations tracedata_file_operations = {
>> >> >> + .open = trace_data_open,
>> >> >> + .poll = trace_data_poll,
>> >> >> + .mmap = trace_data_mmap,
>> >> >> + .ioctl = trace_data_ioctl,
>> >> >> + .release = trace_data_release,
>> >> >> +};
>> >> >> +
>> >> >> +/* function to report the number of lost events due to buffer overflow */
>> >> >> +static ssize_t trace_overflow_read(struct file *filp, char __user *ubuf,
>> >> >> + size_t cnt, loff_t *ppos)
>> >> >> +{
>> >> >> + int ret;
>> >> >> + char buf[64];
>> >> >> + ret = snprintf(buf, 64, "%lu\n", ring_buffer_overruns(trace_buffer));
>> >> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
>> >> >> +}
>> >> >> +
>> >> >> +static const struct file_operations traceoverflow_file_operations = {
>> >> >> + .read = trace_overflow_read,
>> >> >> +};
>> >> >> +
>> >> >> +/* function to re-initialize trace buffer */
>> >> >> +static void __trace_buffer_reset(struct file *filp,
>> >> >> + unsigned long val, size_t *cnt)
>> >> >> +{
>> >> >> + if (val && trace_buffer)
>> >> >> + ring_buffer_reset(trace_buffer);
>> >> >> +}
>> >> >> +
>> >> >> +static ssize_t trace_buffer_reset(struct file *filp, const char __user *ubuf,
>> >> >> + size_t cnt, loff_t *ppos)
>> >> >> +{
>> >> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
>> >> >> + __trace_buffer_reset);
>> >> >> +}
>> >> >> +
>> >> >> +static const struct file_operations tracereset_file_operations = {
>> >> >> + .write = trace_buffer_reset,
>> >> >> +};
>> >> >> +
>> >> >> +/*
>> >> >> + * We use debugfs for kernel-user communication. All of the control/info
>> >> >> + * files are under debugfs/ktrace directory.
>> >> >> + */
>> >> >> +static int create_debugfs(void)
>> >> >> +{
>> >> >> + struct dentry *entry, *bufdir;
>> >> >> + struct ring_buffer_per_cpu *cpu_buffer;
>> >> >> + char *tmpname;
>> >> >> + int cpu;
>> >> >> +
>> >> >> + tracedir = debugfs_create_dir("ktrace", NULL);
>> >> >> + if (!tracedir) {
>> >> >> + printk(KERN_WARNING "Couldn't create debugfs directory\n");
>> >> >> + return -ENOMEM;
>> >> >> + }
>> >> >> +
>> >> >> + /* 'buffers' is the directory that holds trace buffer debugfs files */
>> >> >> + bufdir = debugfs_create_dir("buffers", tracedir);
>> >> >> + if (!bufdir) {
>> >> >> + printk(KERN_WARNING "Couldn't create 'buffers' directory\n");
>> >> >> + return -ENOMEM;
>> >> >> + }
>> >> >> +
>> >> >> + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
>> >> >> + if (!tmpname)
>> >> >> + return -ENOMEM;
>> >> >> + /* create a debugfs file for each cpu buffer */
>> >> >> + for_each_possible_cpu(cpu) {
>> >> >> + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
>> >> >> + cpu_buffer = ring_buffer_cpu(trace_buffer, cpu);
>> >> >> + entry = debugfs_create_file(tmpname, 0440, bufdir, cpu_buffer,
>> >> >> + &tracedata_file_operations);
>> >> >> + if (!entry) {
>> >> >> + printk(KERN_WARNING
>> >> >> + "Couldn't create debugfs \'%s\' entry\n",
>> >> >> + tmpname);
>> >> >> + return -ENOMEM;
>> >> >> + }
>> >> >> + }
>> >> >> + kfree(tmpname);
>> >> >> +
>> >> >> + /* the control file for users to enable/disable global tracing. */
>> >> >> + entry = debugfs_create_file("enabled", 0664, tracedir, NULL,
>> >> >> + &tracecontrol_file_operations);
>> >> >> + if (!entry) {
>> >> >> + printk(KERN_WARNING
>> >> >> + "Couldn't create debugfs 'enabled' entry\n");
>> >> >> + return -ENOMEM;
>> >> >> + }
>> >> >> +
>> >> >> + /*
>> >> >> + * the debugfs file that displays the name, format etc. of every
>> >> >> + * supported trace event. The file is to be used by the user-space
>> >> >> + * trace parser to analyze the collected trace data.
>> >> >> + */
>> >> >> + entry = debugfs_create_file("info", 0444, tracedir, NULL,
>> >> >> + &traceinfo_file_operations);
>> >> >> + if (!entry) {
>> >> >> + printk(KERN_WARNING "Couldn't create debugfs 'info' entry\n");
>> >> >> + return -ENOMEM;
>> >> >> + }
>> >> >> +
>> >> >> + /*
>> >> >> + * the debugfs file that reports the number of events
>> >> >> + * lost due to buffer overflow
>> >> >> + */
>> >> >> + entry = debugfs_create_file("overflow", 0444, tracedir, NULL,
>> >> >> + &traceoverflow_file_operations);
>> >> >> + if (!entry) {
>> >> >> + printk(KERN_WARNING
>> >> >> + "Couldn't create debugfs 'overflow' entry\n");
>> >> >> + return -ENOMEM;
>> >> >> + }
>> >> >> +
>> >> >> + /* the debugfs file that resets trace buffer upon write */
>> >> >> + entry = debugfs_create_file("reset", 0220, tracedir, NULL,
>> >> >> + &tracereset_file_operations);
>> >> >> + if (!entry) {
>> >> >> + printk(KERN_WARNING "Couldn't create debugfs 'reset' entry\n");
>> >> >> + return -ENOMEM;
>> >> >> + }
>> >> >> +
>> >> >> + /* create the directory that holds the control files for every event */
>> >> >> + eventdir = debugfs_create_dir("events", tracedir);
>> >> >> + if (!eventdir) {
>> >> >> + printk(KERN_WARNING "Couldn't create 'events' directory\n");
>> >> >> + return -ENOMEM;
>> >> >> + }
>> >> >> + return 0;
>> >> >> +}
>> >> >> +
>> >> >> +static void remove_debugfs_file(struct dentry *dir, char *name)
>> >> >> +{
>> >> >> + struct dentry *dentry;
>> >> >> + dentry = lookup_one_len(name, dir, strlen(name));
>> >> >> + if (dentry && !IS_ERR(dentry)) {
>> >> >> + debugfs_remove(dentry);
>> >> >> + dput(dentry);
>> >> >> + }
>> >> >> +}
>> >> >> +
>> >> >> +static int remove_debugfs(void)
>> >> >> +{
>> >> >> + struct dentry *bufdir;
>> >> >> + char *tmpname;
>> >> >> + int cpu;
>> >> >> +
>> >> >> + bufdir = lookup_one_len("buffers", tracedir, strlen("buffers"));
>> >> >> + if (!bufdir || IS_ERR(bufdir))
>> >> >> + return -EIO;
>> >> >> + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
>> >> >> + if (!tmpname)
>> >> >> + return -ENOMEM;
>> >> >> + for_each_possible_cpu(cpu) {
>> >> >> + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
>> >> >> + remove_debugfs_file(bufdir, tmpname);
>> >> >> + }
>> >> >> + kfree(tmpname);
>> >> >> + debugfs_remove(bufdir);
>> >> >> +
>> >> >> + remove_debugfs_file(tracedir, "enabled");
>> >> >> + remove_debugfs_file(tracedir, "info");
>> >> >> + debugfs_remove(eventdir);
>> >> >> + debugfs_remove(tracedir);
>> >> >> + return 0;
>> >> >> +}
>> >> >> +
>> >> >> +static int __init init_ktrace(void)
>> >> >> +{
>> >> >> + struct kernel_trace *iter;
>> >> >> + int i, err;
>> >> >> +
>> >> >> + ktrace_next_id = 0;
>> >> >> + kref_set(&ktrace_kref, 1);
>> >> >> + trace_buffer = ring_buffer_alloc(trace_buf_size, RB_FL_OVERWRITE);
>> >> >> + if (trace_buffer == NULL)
>> >> >> + return -ENOMEM;
>> >> >> + setup_timer(&trace_timer, wakeup_readers, 0);
>> >> >> + for (i = 0; i < KTRACE_HASH_SIZE; i++)
>> >> >> + INIT_HLIST_HEAD(&ktrace_table[i]);
>> >> >> +
>> >> >> + mutex_lock(&ktrace_mutex);
>> >> >> + err = create_debugfs();
>> >> >> + if (err != 0)
>> >> >> + goto out;
>> >> >> + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
>> >> >> + add_ktrace(iter);
>> >> >> + err = register_module_notifier(&ktrace_module_nb);
>> >> >> +out:
>> >> >> + mutex_unlock(&ktrace_mutex);
>> >> >> + return err;
>> >> >> +}
>> >> >> +module_init(init_ktrace);
>> >> >> +
>> >> >> +static void __exit exit_ktrace(void)
>> >> >> +{
>> >> >> + struct kernel_trace *iter;
>> >> >> +
>> >> >> + if (ktrace_enabled) {
>> >> >> + ktrace_enabled = 0;
>> >> >> + del_timer_sync(&trace_timer);
>> >> >> + }
>> >> >> + mutex_lock(&ktrace_mutex);
>> >> >> + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
>> >> >> + remove_ktrace(iter, 1);
>> >> >> + unregister_module_notifier(&ktrace_module_nb);
>> >> >> + remove_debugfs();
>> >> >> + mutex_unlock(&ktrace_mutex);
>> >> >> +
>> >> >> + kref_put(&ktrace_kref, release_trace_buffer);
>> >> >> +}
>> >> >
>> >> > --
>> >> > Mathieu Desnoyers
>> >> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>> >> >
>> >
>> > --
>> > Mathieu Desnoyers
>> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>> >
>>
>> _______________________________________________
>> ltt-dev mailing list
>> [email protected]
>> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
>>
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>
Hi Mathieu,
I ran tbench tests with the latest lttng patches (version 0.111) and
2.6.29-rc8 kernel. I used the same set of trace events in my tests:
syscall entry/exit, irq entry/exit, trap entry/exit, sched_schedule,
process_fork, process_wait, process_free, process_exit.
Here are the results I got:
vanilla kernel: Throughput 673.269 MB/sec 4 procs
lttng normal mode: Throughput 462.786 MB/sec 4 procs (31.252% slowdown)
lttng flight-recorder: Throughput 484.547 MB/sec 4 procs (28.03% slowdown)
ktrace: Throughput 569.508 MB/sec 4 procs (15.411% slowdown)
The kernel LTTng configurations I used are:
CONFIG_LTT=y
CONFIG_LTT_FILTER=y
CONFIG_HAVE_LTT_DUMP_TABLES=y
CONFIG_LTT_RELAY_ALLOC=y
CONFIG_LTT_RELAY_LOCKLESS=y
# CONFIG_LTT_RELAY_LOCKED is not set
CONFIG_LTT_SERIALIZE=y
CONFIG_LTT_FAST_SERIALIZE=y
CONFIG_LTT_TRACEPROBES=y
CONFIG_LTT_TRACE_CONTROL=y
CONFIG_LTT_TRACER=y
# CONFIG_LTT_ALIGNMENT is not set
CONFIG_LTT_CHECK_ARCH_EFFICIENT_UNALIGNED_ACCESS=y
# CONFIG_LTT_DEBUG_EVENT_SIZE is not set
CONFIG_LTT_USERSPACE_EVENT=m
CONFIG_LTT_VMCORE=y
CONFIG_LTT_KPROBES=y
CONFIG_LTT_STATEDUMP=m
# CONFIG_LTT_ASCII is not set
CONFIG_HAVE_IMMEDIATE=y
CONFIG_IMMEDIATE=y
The LTTng performance seems to have improved compared with the
results I measured before. But I think there is still big room to further
optimize it when we look at Ktrace results as a comparison.
One worry I have is that LTTng patch set is getting bigger and bigger.
It is good to add more features and instrumentations, but it might
make it harder for performance optimization. Maybe we should take
a different approach. I.e., we start with something simple and make
sure it is implemented right and efficient. Then we extend it and make
sure we don't sacrifice performance as we add more features.
I think a good time to do this is when you port LTTng patches to
mainline kernel. As ftrace people are working on adding event tracing
in ftrace, that seems a good opportunity too.
Jiaying
On Mon, Mar 16, 2009 at 7:12 PM, Jiaying Zhang <[email protected]> wrote:
> On Mon, Mar 16, 2009 at 6:53 PM, Mathieu Desnoyers
> <[email protected]> wrote:
>> * Jiaying Zhang ([email protected]) wrote:
>>> On Mon, Mar 16, 2009 at 5:01 PM, Mathieu Desnoyers
>>> <[email protected]> wrote:
>>> > * Jiaying Zhang ([email protected]) wrote:
>>> >> Hi Mathieu,
>>> >>
>>> >> First, I apologize that I did not forward you my original email. I assumed
>>> >> that the email would be noticed by interested people when it was sent
>>> >> to lkml. Now I see that is a mistake.
>>> >>
>>> >
>>> > Hi Jiaying,
>>> >
>>> > Apologies accepted, no hard feeling :) I have just been surprised when I
>>> > found out about the numbers you posted.
>>> >
>>> >> On Fri, Mar 13, 2009 at 10:04 AM, Mathieu Desnoyers
>>> >> <[email protected]> wrote:
>>> >> > Hi Jiaying,
>>> >> >
>>> >> > I'm very interested to hear about the details of your test-case, because
>>> >> > what you are talking about here, and the numbers that you get, are, at
>>> >> > the very least, not representative of the current LTTng. Perharps you
>>> >> > used a version right after the kernel summit and stopped following the
>>> >> > improvements we have done since then ? I also wonder about the instrumentation
>>> >> > set you enabled in your tests, and whether it's exactly the same in the ktrace
>>> >> > and LTTng runs (especially system call tracing). Please note that following the
>>> >> > Kernel Summit 2008, a more active development cycle began, and the time was not
>>> >> > right for optimisation. But since then a lot of improvement has been done on
>>> >> > this side.
>>> >>
>>> >> Sorry that I omitted a lot of details in my test description. I did use an old
>>> >> version of LTTng in my tests. The LTTng version I used was the 0.21 lttng
>>> >> patches for 2.6.27 kernel. I used 2.6.26 kernel in my tests, with the ported
>>> >> lttng-2.6.27-rc6-0.21 patches.
>>> >>
>>> >> I enabled the same set of trace events in ktrace and LTTng runs. Here is
>>> >> the list of enabled events: syscall entry/exit, irq entry/exit, trap
>>> >> entry/exit,
>>> >> kernel_sched_switch, kernel_process_fork, kernel_process_wait,
>>> >> kernel_process_free, kernel_process_exit.
>>> >>
>>> >> We are glad to see the improvements coming from the LTTng community.
>>> >> I haven't got time to sync with the latest LTTng code and re-run my tests.
>>> >> I will let you know when I have such data available.
>>> >>
>>> >
>>> > On my side, with flight recorder activated, same instrumentation you
>>> > use, with LTTng 0.111 :
>>> >
>>> > arm-google script :
>>> >
>>> > #!/bin/sh
>>> >
>>> > DIR=/mnt/debugfs/ltt/markers
>>> >
>>> > for a in \
>>> > syscall_entry syscall_exit \
>>> > irq_entry irq_exit \
>>> > trap_entry trap_exit \
>>> > page_fault_entry page_fault_exit \
>>> > page_fault_nosem_entry page_fault_nosem_exit \
>>> > page_fault_get_user_entry page_fault_get_user_exit \
>>> > sched_schedule process_fork process_wait \
>>> > process_free process_exit;
>>> > do
>>> > echo 1 > ${DIR}/kernel/$a/enable
>>> > done
>>> >
>>> > running tbench -t 200 8
>>> >
>>> > Vanilla kernel : 2070 MB/sec
>>> > Flight recorder : 1693 MB/sec Performance impact : 18.2 %
>>> > Normal : 1620 MB/sec Performance impact : 21.7 % (8 lttd threads)
>>> >
>>> > So now we are comparing apples with apples :-)
>>> Looks like the performance of LTTng has improved a lot over the past
>>> few months. I will run some tests with the latest LTTng patches and
>>> let you know if I see any difference. Did you use 2.6.29-rc7 kernel in
>>> your tests?
>>>
>>
>> Yes, kernel 2.6.29-rc7, lttng 0.111. But lttng 0.112 for kernel
>> 2.6.29-rc8 should also be close enough.
>>
>>> >
>>> >> >
>>> >> > First I'd like to express my concern about the way you posted your
>>> >> > numbers :
>>> >> >
>>> >> > 1 - You did not CC anyone from the LTTng project.
>>> >> > 2 - You did not tell which LTTng version you used as a comparison basis.
>>> >> > 3 - You did not look at the current state of code before posting your
>>> >> > comparison. (as shows the comment in your patch header)
>>> >> That is indeed my mistake.
>>> >>
>>> >> > 4 - You did not tell the instrumentation set that you enabled from LTTng. Is it
>>> >> > the default instrumentation ? Did you also enable system call tracing
>>> >> > with your tracer using set_kernel_trace_flag_all_tasks() ? If not, this
>>> >> > makes a huge difference on the amount of information gathered.
>>> >> Yes. We also enable syscall tracing via set_kernel_trace_flag_all_tasks().
>>> >>
>>> >> >
>>> >> > So let's have a close look at them. I'm re-doing the benchmarks based on
>>> >> > the current LTTng version below.
>>> >> >
>>> >> >> Hello All,
>>> >> >>
>>> >> >> We have been working on building a kernel tracer for a production environment.
>>> >> >> We first considered to use the existing markers code, but found it a bit
>>> >> >> heavyweight.
>>> >> >
>>> >> > Have you looked at :
>>> >> >
>>> >> > 1 - The tracepoints currently in the kernel
>>> >> > 2 - The _current_ version of markers in the LTTng tree ?
>>> >>
>>> >> I looked at the tracepoints currently in the kernel. It is great to see that
>>> >> some trace instrumentation code is finally available in the mainline kernel.
>>> >> But I still don't like an additional tracepoint layer on top of the
>>> >> actual tracing
>>> >> (ktrace or markers or ftrace) layer. That seems to add unnecessary overhead.
>>> >> The markers code I referred to in my email is the current code in the mainline
>>> >> kernel. I had a look at the LTTng markers code but still saw that va_list
>>> >> argument is used in marker_probe_func. It sounds a nice improvement
>>> >> if the LTTng markers avoids the use of va_arg whenever possible.
>>> >>
>>> >
>>> > Yes, vs_args is only there as a commodity for slow paths.
>>> >
>>> >> >
>>> >> >> So we developed a new kernel tracing prototype that shares
>>> >> >> many similarities as markers but also includes several simplifications that
>>> >> >> give us considerable performance benefits. For example, we pass the size of
>>> >> >> trace event directly to the probe function instead of using va_arg as that
>>> >> >> used in markers, which saves us the overhead of variable passing.
>>> >> >
>>> >> > The markers in the LTTng tree only use va_args for "debug-style"
>>> >> > tracing. All the tracing fast paths are now done with custom probe
>>> >> > functions which takes the arguments from the tracepoint and write them
>>> >> > in the trace buffers directly in C. When the event payload size is already
>>> >> > known statically, there is no dynamic computation cost at all : it's
>>> >> > simply passed to the ltt-relay write function.
>>> >> But this raises another concern that sub-system developers may need
>>> >> to write a bunch of custom probe functions to avoid the argument parsing
>>> >> overhead. Can you instead have a general probe function or macro that
>>> >> does this work?
>>> >>
>>> >
>>> > Well, considering the amount of care we have to take when we consider
>>> > adding a tracepoint in the mainline kernel, I think it's even a *good*
>>> > thing that someone has to stop and write a small callback to deal with
>>> > the information he wants to export. I think debug-style tracing should
>>> > be very, very quick and easy to add, but tracepoints meant to make it
>>> > into the mainline does not share this requirement.
>>> I agree that we need to be careful when adding tracepoints into the kernel,
>>> but I still think it is good to simplify the process of adding/maintaining
>>> tracepoints so sub-system maintainers don't feel it is a big burden.
>>>
>>
>> Yes, we have to make sure it's not an impossible task.
>>
>>> >
>>> >> >
>>> >> >> Also, we
>>> >> >> associate a single probe function to each event, so we can directly call the
>>> >> >> probe function when tracing is enabled.
>>> >> >
>>> >> > You seem to imply that being stuck with a single probe function is
>>> >> > somehow superior to have tiny per-event callbacks which then call into a
>>> >> > buffer write function ? Why ?
>>> >> That is indeed my thinking, because we saw that some platforms are
>>> >> very sensitive to function call overhead and we haven't seen use cases
>>> >> where multi probe functions prove necessary.
>>> >>
>>> >
>>> > Hrm, so your plan is to embed the tracing code at the caller site, or to
>>> > keep one single function call ?
>>> We do have a general probe function that is called when tracing is enabled.
>>> Other than that, the main tracing code is defined as an inline function that is
>>> embed at the caller site.
>>>
>>
>> You mean that you embed the main tracing code directly in the kernel
>> instrumented functions ? Hrm, have you considered the calling site
>> pollution impact that might have ? The compiler might start having
>> difficulty optimising hot paths that would then become large functions.
>> Or maybe I am not correctly understanding your approach.
> Sorry. I was actually trying to say the main instrumentation code is
> embed in the caller site. The main tracing code is in the general probe
> function.
>
> Jiaying
>
>>
>> Mathieu
>>
>>>
>>> Jiaying
>>>
>>> >
>>> >> >
>>> >> >> Our measurements show that our
>>> >> >> current tracing prototype introduces much lower overhead compared with
>>> >> >> lttng (we use lttng as a comparison because markers does not provide the
>>> >> >> implementation to record trace data).
>>> >> >
>>> >> > Why LTTng version, kernel version, LTTng config and runtime options do
>>> >> > you use ?
>>> >> See above for LTTng and kernel versions. I was using the default LTTng
>>> >> config options and 'normal' recorder mode.
>>> >>
>>> >> >
>>> >> >> E.g., with dbench, lttng introduces
>>> >> >> 20~80% overhead on our testing platforms while the overhead with our
>>> >> >> kernel tracer is within 10%; with tbench, lttng usually adds more than 30%
>>> >> >> overhead while the overhead added by our kernel tracer is usually within
>>> >> >> 15%; with kernbench (time to compile a Linux kernel), the overhead
>>> >> >> added by lttng is around 4%~18% while the overhead added by our
>>> >> >> kernel tracer is within 2.5%.
>>> >> >>
>>> >> >
>>> >> > I'm re-running the dbench and tbench tests :
>>> >> >
>>> >> > With LTTng 0.106, kernel 2.6.29-rc7
>>> >> > 8-cores Intel(R) Xeon(R) CPU E5405 @ 2.00GHz, 16GB ram
>>> >> >
>>> >> > Arming all the default LTTng instrumentation (168 tracepoints),
>>> >> > Tracing in flight recorder mode, including system call entry/exit
>>> >> > tracing and block I/O tracing (blktrace instrumentation), I get :
>>> >> >
>>> >> > Test / config | Vanilla kernel | LTTng 0.106 | slowdown
>>> >> > dbench 8 -t 200 | 1361 MB/s | 1193 MB/s | 12.3 %
>>> >> >
>>> >> > As an example of the effect of custom probe vs va_args based probes,
>>> >> > let's take the tbench example
>>> >> >
>>> >> > | Vanilla | LTTng 0.106 | slowdown
>>> >> > tbench 8 -t 200 | 2074 MB/s | 997 MB/s | 52 %
>>> >> >
>>> >> > The LTTng trace statistics shows the top high event rate events :
>>> >> >
>>> >> > Some are usual (therefore lttng already implements custom probes for them) :
>>> >> >
>>> >> > 42004 sched_try_wakeup
>>> >> > 79754 timer_set
>>> >> > 83973 sched_schedule
>>> >> > 84317 softirq_exit
>>> >> > 84318 softirq_entry
>>> >> > 95370 page_free
>>> >> > 95412 page_alloc
>>> >> > 267247 syscall_entry
>>> >> > 267247 syscall_exit
>>> >> >
>>> >> > Some are more tbench workload-specific :
>>> >> >
>>> >> > 16915 socket_sendmsg
>>> >> > 18911 napi_complete
>>> >> > 18911 napi_schedule
>>> >> > 18912 dev_receive
>>> >> > 18914 dev_xmit
>>> >> > 18920 napi_poll
>>> >> > 33831 socket_recvmsg
>>> >> > 59860 read
>>> >> >
>>> >> >
>>> >> > So let's see what happens if I implement those as custom probes and do a bit of
>>> >> > tuning in the tracer hot path :
>>> >> Could you give more details about the custom probe you used?
>>> >
>>> > They are available in the lttng tree on git.kernel.org in
>>> > ltt/probes/*-trace.c. Their API is in include/linux/ltt-type-serialize.h
>>> >
>>> >> Do you need to write a custom probe for each tracepoint?
>>> >>
>>> >
>>> > Yes. But it's not such a big deal. It looks a bit like Steve's
>>> > TRACE_EVENT macro-fu, but written in a simple C callback :
>>> >
>>> > void probe_sched_switch(struct rq *rq, struct task_struct *prev,
>>> > struct task_struct *next);
>>> >
>>> > DEFINE_MARKER_TP(kernel, sched_schedule, sched_switch, probe_sched_switch,
>>> > "prev_pid %d next_pid %d prev_state #2d%ld");
>>> >
>>> > notrace void probe_sched_switch(struct rq *rq, struct task_struct *prev,
>>> > struct task_struct *next)
>>> > {
>>> > struct marker *marker;
>>> > struct serialize_int_int_short data;
>>> >
>>> > data.f1 = prev->pid;
>>> > data.f2 = next->pid;
>>> > data.f3 = prev->state;
>>> >
>>> > marker = &GET_MARKER(kernel, sched_schedule);
>>> > ltt_specialized_trace(marker, marker->single.probe_private,
>>> > &data, serialize_sizeof(data), sizeof(int));
>>> > }
>>> >
>>> > Best regards,
>>> >
>>> > Mathieu
>>> >
>>> >> Jiaying
>>> >>
>>> >> >
>>> >> > | Vanilla | LTTng 0.107 | slowdown
>>> >> > Custom probes :
>>> >> > tbench 8 -t 200 | 2074 MB/s | 1275 MB/s | 38.5 %
>>> >> >
>>> >> > With tracer hot path tuning (inline hot path, function calls for slow path, will
>>> >> > be in 0.108) :
>>> >> > tbench 8 -t 200 | 2074 MB/s | 1335 MB/s | 35.6 %
>>> >> >
>>> >> > kernbench :
>>> >> > Average Optimal load -j 32 Run
>>> >> >
>>> >> > Vanilla 2.6.29-rc7 :
>>> >> > Elapsed Time 48.238 (0.117346)
>>> >> >
>>> >> > With flight recorder tracing, default lttng instrumentation, lttng 0.108 :
>>> >> > Elapsed Time 50.226 (0.0750333)
>>> >> >
>>> >> > Slowdown : 4.1 %
>>> >> >
>>> >> > So except the tbench workload on LTTng 0.106, which still had a 52% performance
>>> >> > impact (brought it down to 35 % for the next LTTng release) due to
>>> >> > specific workload and not having specialized probes implemented, the
>>> >> > dbench impact is well below your results (12.3 % vs 20-80%)
>>> >> >
>>> >> > So given the huge result discrepancies between your tests and mine, could you
>>> >> > give us more detail about your benchmarks ? Especially knowing the enabled
>>> >> > instrumentation set, and the amount of MB/s generated by both ktrace and LTTng
>>> >> > runs (to make sure we are comparing the same traffic) and the LTTng versions you
>>> >> > are using as a comparison basis would be good.
>>> >> >
>>> >> > Thanks,
>>> >> >
>>> >> > Mathieu
>>> >> >
>>> >> >
>>> >> >> The patch below is the current prototype we have implemented. It uses
>>> >> >> Steve's unified trace buffer code with some small extensions to support
>>> >> >> buffer mmap. The patch is probably still buggy, but we would like to get
>>> >> >> some early feedback from the community.
>>> >> >>
>>> >> >> Signed-off-by: Jiaying Zhang <[email protected]>
>>> >> >>
>>> >> >> Index: git-linux/init/Kconfig
>>> >> >> ===================================================================
>>> >> >> --- git-linux.orig/init/Kconfig 2009-02-19 14:58:37.000000000 -0800
>>> >> >> +++ git-linux/init/Kconfig 2009-02-19 15:50:15.000000000 -0800
>>> >> >> @@ -950,6 +950,15 @@
>>> >> >> Place an empty function call at each marker site. Can be
>>> >> >> dynamically changed for a probe function.
>>> >> >>
>>> >> >> +config KTRACE
>>> >> >> + bool "Enable ktrace support"
>>> >> >> + select DEBUG_FS
>>> >> >> + select TRACING
>>> >> >> + select RING_BUFFER
>>> >> >> + help
>>> >> >> + Ktrace is a kernel tracing tool that allows you to trace
>>> >> >> + kernel events by inserting trace points at proper places.
>>> >> >> +
>>> >> >> source "arch/Kconfig"
>>> >> >>
>>> >> >> endmenu # General setup
>>> >> >> Index: git-linux/kernel/Makefile
>>> >> >> ===================================================================
>>> >> >> --- git-linux.orig/kernel/Makefile 2009-02-19 14:58:37.000000000 -0800
>>> >> >> +++ git-linux/kernel/Makefile 2009-02-19 14:58:39.000000000 -0800
>>> >> >> @@ -86,6 +86,7 @@
>>> >> >> obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
>>> >> >> obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
>>> >> >> obj-$(CONFIG_MARKERS) += marker.o
>>> >> >> +obj-$(CONFIG_KTRACE) += ktrace.o
>>> >> >> obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
>>> >> >> obj-$(CONFIG_LATENCYTOP) += latencytop.o
>>> >> >> obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
>>> >> >> Index: git-linux/include/asm-generic/vmlinux.lds.h
>>> >> >> ===================================================================
>>> >> >> --- git-linux.orig/include/asm-generic/vmlinux.lds.h 2009-02-19
>>> >> >> 14:58:37.000000000 -0800
>>> >> >> +++ git-linux/include/asm-generic/vmlinux.lds.h 2009-02-19
>>> >> >> 14:58:39.000000000 -0800
>>> >> >> @@ -73,6 +73,9 @@
>>> >> >> MEM_KEEP(init.data) \
>>> >> >> MEM_KEEP(exit.data) \
>>> >> >> . = ALIGN(8); \
>>> >> >> + VMLINUX_SYMBOL(__start___ktraces) = .; \
>>> >> >> + *(__ktrace) \
>>> >> >> + VMLINUX_SYMBOL(__stop___ktraces) = .; \
>>> >> >> VMLINUX_SYMBOL(__start___markers) = .; \
>>> >> >> *(__markers) \
>>> >> >> VMLINUX_SYMBOL(__stop___markers) = .; \
>>> >> >> @@ -89,6 +92,7 @@
>>> >> >> VMLINUX_SYMBOL(__start_rodata) = .; \
>>> >> >> *(.rodata) *(.rodata.*) \
>>> >> >> *(__vermagic) /* Kernel version magic */ \
>>> >> >> + *(__ktrace_strings) /* Ktrace: strings */ \
>>> >> >> *(__markers_strings) /* Markers: strings */ \
>>> >> >> *(__tracepoints_strings)/* Tracepoints: strings */ \
>>> >> >> } \
>>> >> >> Index: git-linux/kernel/module.c
>>> >> >> ===================================================================
>>> >> >> --- git-linux.orig/kernel/module.c 2009-02-19 14:58:37.000000000 -0800
>>> >> >> +++ git-linux/kernel/module.c 2009-02-19 14:58:39.000000000 -0800
>>> >> >> @@ -50,6 +50,7 @@
>>> >> >> #include <asm/sections.h>
>>> >> >> #include <linux/tracepoint.h>
>>> >> >> #include <linux/ftrace.h>
>>> >> >> +#include <linux/ktrace.h>
>>> >> >> #include <linux/async.h>
>>> >> >>
>>> >> >> #if 0
>>> >> >> @@ -2145,6 +2146,10 @@
>>> >> >> sizeof(*mod->tracepoints),
>>> >> >> &mod->num_tracepoints);
>>> >> >> #endif
>>> >> >> +#ifdef CONFIG_KTRACE
>>> >> >> + mod->ktrace = section_objs(hdr, sechdrs, secstrings, "__ktrace",
>>> >> >> + sizeof(*mod->ktrace), &mod->num_ktrace);
>>> >> >> +#endif
>>> >> >>
>>> >> >> #ifdef CONFIG_MODVERSIONS
>>> >> >> if ((mod->num_syms && !mod->crcs)
>>> >> >> Index: git-linux/include/linux/module.h
>>> >> >> ===================================================================
>>> >> >> --- git-linux.orig/include/linux/module.h 2009-02-19
>>> >> >> 14:58:37.000000000 -0800
>>> >> >> +++ git-linux/include/linux/module.h 2009-02-19 14:58:39.000000000 -0800
>>> >> >> @@ -44,6 +44,7 @@
>>> >> >> };
>>> >> >>
>>> >> >> struct module;
>>> >> >> +struct kernel_trace;
>>> >> >>
>>> >> >> struct module_attribute {
>>> >> >> struct attribute attr;
>>> >> >> @@ -347,6 +348,11 @@
>>> >> >> /* Reference counts */
>>> >> >> struct module_ref ref[NR_CPUS];
>>> >> >> #endif
>>> >> >> +#ifdef CONFIG_KTRACE
>>> >> >> + struct kernel_trace *ktrace;
>>> >> >> + unsigned int num_ktrace;
>>> >> >> +#endif
>>> >> >> +
>>> >> >> };
>>> >> >> #ifndef MODULE_ARCH_INIT
>>> >> >> #define MODULE_ARCH_INIT {}
>>> >> >> Index: git-linux/include/linux/ktrace.h
>>> >> >> ===================================================================
>>> >> >> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
>>> >> >> +++ git-linux/include/linux/ktrace.h 2009-02-19 15:36:39.000000000 -0800
>>> >> >> @@ -0,0 +1,83 @@
>>> >> >> +#ifndef _LINUX_KTRACE_H
>>> >> >> +#define _LINUX_KTRACE_H
>>> >> >> +
>>> >> >> +#include <linux/types.h>
>>> >> >> +
>>> >> >> +struct kernel_trace;
>>> >> >> +
>>> >> >> +typedef void ktrace_probe_func(struct kernel_trace *, void *, size_t);
>>> >> >> +
>>> >> >> +struct kernel_trace {
>>> >> >> + const char *name;
>>> >> >> + const char *format; /* format string describing variable list */
>>> >> >> + size_t *stroff; /* offsets of string variables */
>>> >> >> + /* 31 bit event_id is converted to 16 bit when entering to the buffer */
>>> >> >> + u32 enabled:1, event_id:31;
>>> >> >> + ktrace_probe_func *func; /* probe function */
>>> >> >> + struct list_head list; /* list head linked to the hash table entry */
>>> >> >> +} __attribute__((aligned(8)));
>>> >> >> +
>>> >> >> +extern int ktrace_enabled;
>>> >> >> +
>>> >> >> +/*
>>> >> >> + * Make sure the alignment of the structure in the __ktrace section will
>>> >> >> + * not add unwanted padding between the beginning of the section and the
>>> >> >> + * structure. Force alignment to the same alignment as the section start.
>>> >> >> + */
>>> >> >> +
>>> >> >> +#define DEFINE_KTRACE_STRUCT(name) \
>>> >> >> + struct __attribute__((packed)) ktrace_struct_##name
>>> >> >> +
>>> >> >> +#ifdef CONFIG_KTRACE
>>> >> >> +/*
>>> >> >> + * DO_TRACE is the macro that is called from each trace point. Each trace
>>> >> >> + * point is associated with a static ktrace object that is located in a
>>> >> >> + * special __ktrace section and is uniquely identified by the event name and
>>> >> >> + * event format. Take a look at the files under include/trace for example
>>> >> >> + * usage. Note that we call the trace probe function only if the global
>>> >> >> + * tracing is enabled and the tracing for the associated event is enabled,
>>> >> >> + * so we only introduce the overhead of a predicted condition judge when
>>> >> >> + * tracing is disabled.
>>> >> >> + */
>>> >> >> +#define DO_TRACE(name, format, args...)
>>> >> >> \
>>> >> >> + do { \
>>> >> >> + static const char __kstrtab_##name[] \
>>> >> >> + __attribute__((section("__ktrace_strings"))) \
>>> >> >> + = #name "\0" format; \
>>> >> >> + static struct kernel_trace __ktrace_##name \
>>> >> >> + __attribute__((section("__ktrace"), aligned(8))) = \
>>> >> >> + { __kstrtab_##name, &__kstrtab_##name[sizeof(#name)], \
>>> >> >> + NULL, 0, 0, NULL, LIST_HEAD_INIT(__ktrace_##name.list) }; \
>>> >> >> + __ktrace_check_format(format, ## args); \
>>> >> >> + if (unlikely(ktrace_enabled) && \
>>> >> >> + unlikely(__ktrace_##name.enabled)) { \
>>> >> >> + struct ktrace_struct_##name karg = { args }; \
>>> >> >> + (*__ktrace_##name.func) \
>>> >> >> + (&__ktrace_##name, &karg, sizeof(karg)); \
>>> >> >> + } \
>>> >> >> + } while (0)
>>> >> >> +
>>> >> >> +#else /* !CONFIG_KTRACE */
>>> >> >> +#define DO_TRACE(name, format, args...)
>>> >> >> +#endif
>>> >> >> +
>>> >> >> +/* To be used for string format validity checking with gcc */
>>> >> >> +static inline void __printf(1, 2) ___ktrace_check_format(const char *fmt, ...)
>>> >> >> +{
>>> >> >> +}
>>> >> >> +
>>> >> >> +#define __ktrace_check_format(format, args...) \
>>> >> >> + do { \
>>> >> >> + if (0) \
>>> >> >> + ___ktrace_check_format(format, ## args); \
>>> >> >> + } while (0)
>>> >> >> +
>>> >> >> +
>>> >> >> +/* get the trace buffer information */
>>> >> >> +#define KTRACE_BUF_GET_SIZE _IOR(0xF5, 0x00, __u32)
>>> >> >> +#define KTRACE_BUF_GET_PRODUCED _IOR(0xF5, 0x01, __u32)
>>> >> >> +#define KTRACE_BUF_GET_CONSUMED _IOR(0xF5, 0x02, __u32)
>>> >> >> +#define KTRACE_BUF_PUT_PRODUCED _IOW(0xF5, 0x03, __u32)
>>> >> >> +#define KTRACE_BUF_PUT_CONSUMED _IOW(0xF5, 0x04, __u32)
>>> >> >> +
>>> >> >> +#endif
>>> >> >> Index: git-linux/kernel/ktrace.c
>>> >> >> ===================================================================
>>> >> >> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
>>> >> >> +++ git-linux/kernel/ktrace.c 2009-02-19 15:45:20.000000000 -0800
>>> >> >> @@ -0,0 +1,872 @@
>>> >> >> +/*
>>> >> >> + * kernel/ktrace.c
>>> >> >> + *
>>> >> >> + * Implementation of the kernel tracing for linux kernel 2.6.
>>> >> >> + *
>>> >> >> + * Copyright 2008- Google Inc.
>>> >> >> + * Original Author: Jiaying Zhang
>>> >> >> + */
>>> >> >> +
>>> >> >> +#include <linux/module.h>
>>> >> >> +#include <linux/kernel.h>
>>> >> >> +#include <linux/bitmap.h>
>>> >> >> +#include <linux/poll.h>
>>> >> >> +#include <linux/errno.h>
>>> >> >> +#include <linux/sched.h>
>>> >> >> +#include <linux/linkage.h>
>>> >> >> +#include <linux/debugfs.h>
>>> >> >> +#include <linux/timer.h>
>>> >> >> +#include <linux/delay.h>
>>> >> >> +#include <linux/jhash.h>
>>> >> >> +#include <linux/ctype.h>
>>> >> >> +#include <linux/namei.h>
>>> >> >> +#include <linux/ktrace.h>
>>> >> >> +#include <linux/ring_buffer.h>
>>> >> >> +
>>> >> >> +static char *ktrace_version = "1.0.0";
>>> >> >> +int ktrace_enabled;
>>> >> >> +EXPORT_SYMBOL(ktrace_enabled);
>>> >> >> +static uint16_t ktrace_next_id;
>>> >> >> +
>>> >> >> +extern struct kernel_trace __start___ktraces[];
>>> >> >> +extern struct kernel_trace __stop___ktraces[];
>>> >> >> +
>>> >> >> +static DEFINE_MUTEX(ktrace_mutex);
>>> >> >> +
>>> >> >> +/* ktrace hash table entry structure and variables */
>>> >> >> +struct ktrace_entry {
>>> >> >> + struct hlist_node hlist;
>>> >> >> + char *name;
>>> >> >> + char *format;
>>> >> >> + size_t *stroff;
>>> >> >> + ktrace_probe_func *func;
>>> >> >> + u32 enabled:1, event_id:31;
>>> >> >> + struct list_head klist; /* list of loaded ktraces */
>>> >> >> +};
>>> >> >> +
>>> >> >> +#define KTRACE_HASH_BITS 6
>>> >> >> +#define KTRACE_HASH_SIZE (1 << KTRACE_HASH_BITS)
>>> >> >> +static struct hlist_head ktrace_table[KTRACE_HASH_SIZE];
>>> >> >> +
>>> >> >> +/* debugfs directory variables */
>>> >> >> +static struct dentry *tracedir;
>>> >> >> +static struct dentry *eventdir;
>>> >> >> +typedef void tracecontrol_handle_func(struct file *, unsigned long, size_t *);
>>> >> >> +
>>> >> >> +/* ring buffer code */
>>> >> >> +static unsigned long trace_buf_size = 65536UL;
>>> >> >> +struct ring_buffer *trace_buffer;
>>> >> >> +static DECLARE_WAIT_QUEUE_HEAD(trace_wait); /* waitqueue for trace_data_poll */
>>> >> >> +static struct timer_list trace_timer; /* reader wake-up timer */
>>> >> >> +#define TRACEREAD_WAKEUP_INTERVAL 1000 /* time interval in jiffies */
>>> >> >> +static struct kref ktrace_kref;
>>> >> >> +
>>> >> >> +static void
>>> >> >> +ring_buffer_probe(struct kernel_trace *kt, void *data, size_t event_size)
>>> >> >> +{
>>> >> >> + struct ring_buffer_event *event;
>>> >> >> + void *trace_event;
>>> >> >> + unsigned long flag;
>>> >> >> +
>>> >> >> + event = ring_buffer_lock_reserve(trace_buffer,
>>> >> >> + sizeof(uint16_t) + event_size, &flag);
>>> >> >> + if (!event)
>>> >> >> + return;
>>> >> >> + trace_event = ring_buffer_event_data(event);
>>> >> >> + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
>>> >> >> + memcpy(trace_event + sizeof(uint16_t), data, event_size);
>>> >> >> + ring_buffer_unlock_commit(trace_buffer, event, flag);
>>> >> >> +}
>>> >> >> +
>>> >> >> +/* special probe function for events that contain string arguments */
>>> >> >> +static void
>>> >> >> +string_probe(struct kernel_trace *kt, void *data, size_t event_size)
>>> >> >> +{
>>> >> >> + struct ring_buffer_event *event;
>>> >> >> + void *trace_event, *p;
>>> >> >> + unsigned long flag;
>>> >> >> + size_t *offset, scanned;
>>> >> >> + char *string;
>>> >> >> +
>>> >> >> + /*
>>> >> >> + * Get the real length of the event, i.e., we use stroff array to
>>> >> >> + * locate string variables in the passed-in trace data and use their
>>> >> >> + * length to replace the size of the string pointers.
>>> >> >> + */
>>> >> >> + for (offset = kt->stroff; *offset != -1; offset++) {
>>> >> >> + string = *(char **) (data + *offset);
>>> >> >> + event_size += strlen(string) + 1 - sizeof(char *);
>>> >> >> + }
>>> >> >> + event = ring_buffer_lock_reserve(trace_buffer,
>>> >> >> + sizeof(uint16_t) + event_size, &flag);
>>> >> >> + if (!event)
>>> >> >> + return;
>>> >> >> + trace_event = ring_buffer_event_data(event);
>>> >> >> + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
>>> >> >> + p = trace_event + sizeof(uint16_t);
>>> >> >> + /*
>>> >> >> + * Copy the trace data into buffer. For string variables, we enter the
>>> >> >> + * string into the buffer. Otherwise, the passed in data is copied.
>>> >> >> + */
>>> >> >> + for (offset = kt->stroff, scanned = 0; *offset != -1; offset++) {
>>> >> >> + memcpy(p, data + scanned, *offset - scanned);
>>> >> >> + p += *offset - scanned;
>>> >> >> + string = *(char **) (data + *offset);
>>> >> >> + memcpy(p, string, strlen(string) + 1);
>>> >> >> + p += strlen(string) + 1;
>>> >> >> + scanned = *offset + sizeof(char *);
>>> >> >> + }
>>> >> >> + memcpy(p, data + scanned,
>>> >> >> + trace_event + sizeof(uint16_t) + event_size - p);
>>> >> >> + ring_buffer_unlock_commit(trace_buffer, event, flag);
>>> >> >> +}
>>> >> >> +
>>> >> >> +/* timer function used to defer ring buffer reader waking */
>>> >> >> +static void wakeup_readers(unsigned long data)
>>> >> >> +{
>>> >> >> + if (trace_buffer && !ring_buffer_empty(trace_buffer))
>>> >> >> + wake_up_interruptible(&trace_wait);
>>> >> >> + mod_timer(&trace_timer, jiffies + TRACEREAD_WAKEUP_INTERVAL);
>>> >> >> +}
>>> >> >> +
>>> >> >> +/* function for reading ktrace metadata info */
>>> >> >> +static void *tracing_info_start(struct seq_file *seq, loff_t *pos)
>>> >> >> +{
>>> >> >> + seq_printf(seq, "version %s\n", ktrace_version);
>>> >> >> + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
>>> >> >> +}
>>> >> >> +
>>> >> >> +static void *tracing_info_next(struct seq_file *seq, void *v, loff_t *pos)
>>> >> >> +{
>>> >> >> + ++*pos;
>>> >> >> + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
>>> >> >> +}
>>> >> >> +
>>> >> >> +static int tracing_info_show(struct seq_file *seq, void *v)
>>> >> >> +{
>>> >> >> + struct hlist_head *head = v;
>>> >> >> + struct hlist_node *node;
>>> >> >> + struct ktrace_entry *entry;
>>> >> >> + hlist_for_each_entry(entry, node, head, hlist)
>>> >> >> + seq_printf(seq, "name '%s' format '%s' id %u %s\n",
>>> >> >> + entry->name, entry->format, entry->event_id,
>>> >> >> + entry->enabled ? "enabled" : "disabled");
>>> >> >> + return 0;
>>> >> >> +}
>>> >> >> +
>>> >> >> +static void tracing_info_stop(struct seq_file *seq, void *v)
>>> >> >> +{
>>> >> >> +}
>>> >> >> +
>>> >> >> +static const struct seq_operations tracing_info_seq_ops = {
>>> >> >> + .start = tracing_info_start,
>>> >> >> + .next = tracing_info_next,
>>> >> >> + .show = tracing_info_show,
>>> >> >> + .stop = tracing_info_stop,
>>> >> >> +};
>>> >> >> +
>>> >> >> +static int tracing_info_open(struct inode *inode, struct file *file)
>>> >> >> +{
>>> >> >> + return seq_open(file, &tracing_info_seq_ops);
>>> >> >> +}
>>> >> >> +
>>> >> >> +static const struct file_operations traceinfo_file_operations = {
>>> >> >> + .open = tracing_info_open,
>>> >> >> + .read = seq_read,
>>> >> >> + .llseek = seq_lseek,
>>> >> >> + .release = seq_release,
>>> >> >> +};
>>> >> >> +
>>> >> >> +/*
>>> >> >> + * wrapper function used by debugfs write operation.
>>> >> >> + * func: handling function that does real work
>>> >> >> + */
>>> >> >> +static ssize_t trace_debugfs_write(struct file *filp, const char __user *ubuf,
>>> >> >> + size_t cnt, loff_t *ppos, tracecontrol_handle_func *func)
>>> >> >> +{
>>> >> >> + int ret;
>>> >> >> + char buf[64];
>>> >> >> + unsigned long val;
>>> >> >> +
>>> >> >> + if (cnt >= sizeof(buf))
>>> >> >> + return -EINVAL;
>>> >> >> + if (copy_from_user(&buf, ubuf, cnt))
>>> >> >> + return -EFAULT;
>>> >> >> + buf[cnt] = 0;
>>> >> >> +
>>> >> >> + ret = strict_strtoul(buf, 10, &val);
>>> >> >> + if (ret < 0)
>>> >> >> + return ret;
>>> >> >> + val = !!val;
>>> >> >> +
>>> >> >> + func(filp, val, &cnt);
>>> >> >> +
>>> >> >> + filp->f_pos += cnt;
>>> >> >> + return cnt;
>>> >> >> +}
>>> >> >> +
>>> >> >> +/* functions for reading/writing the global 'enabled' control file */
>>> >> >> +static ssize_t tracing_control_read(struct file *filp, char __user *ubuf,
>>> >> >> + size_t cnt, loff_t *ppos)
>>> >> >> +{
>>> >> >> + int ret;
>>> >> >> + char buf[64];
>>> >> >> + ret = snprintf(buf, 64, "%u\n", ktrace_enabled);
>>> >> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
>>> >> >> +}
>>> >> >> +
>>> >> >> +static void __tracing_control_write(struct file *filp,
>>> >> >> + unsigned long val, size_t *cnt)
>>> >> >> +{
>>> >> >> + if (val ^ ktrace_enabled) {
>>> >> >> + if (val) {
>>> >> >> + trace_timer.expires =
>>> >> >> + jiffies + TRACEREAD_WAKEUP_INTERVAL;
>>> >> >> + add_timer(&trace_timer);
>>> >> >> + ktrace_enabled = 1;
>>> >> >> + } else {
>>> >> >> + ktrace_enabled = 0;
>>> >> >> + del_timer_sync(&trace_timer);
>>> >> >> + if (trace_buffer && !ring_buffer_empty(trace_buffer))
>>> >> >> + wake_up_interruptible(&trace_wait);
>>> >> >> + }
>>> >> >> + }
>>> >> >> +}
>>> >> >> +
>>> >> >> +static ssize_t tracing_control_write(struct file *filp, const char
>>> >> >> __user *ubuf,
>>> >> >> + size_t cnt, loff_t *ppos)
>>> >> >> +{
>>> >> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
>>> >> >> + __tracing_control_write);
>>> >> >> +}
>>> >> >> +
>>> >> >> +static const struct file_operations tracecontrol_file_operations = {
>>> >> >> + .read = tracing_control_read,
>>> >> >> + .write = tracing_control_write,
>>> >> >> +};
>>> >> >> +
>>> >> >> +/* functions for reading/writing a trace event file */
>>> >> >> +static int update_ktrace(struct ktrace_entry *entry,
>>> >> >> + ktrace_probe_func *func, int enabled)
>>> >> >> +{
>>> >> >> + struct kernel_trace *iter;
>>> >> >> + /* no need to update the list if the tracing is not initialized */
>>> >> >> + if (!tracedir)
>>> >> >> + return 0;
>>> >> >> + entry->enabled = enabled;
>>> >> >> + entry->func = func;
>>> >> >> + list_for_each_entry(iter, &entry->klist, list) {
>>> >> >> + iter->stroff = entry->stroff;
>>> >> >> + iter->func = func;
>>> >> >> + iter->enabled = enabled;
>>> >> >> + }
>>> >> >> + return 0;
>>> >> >> +}
>>> >> >> +
>>> >> >> +static int ktrace_probe_register(struct ktrace_entry *entry,
>>> >> >> + ktrace_probe_func *probefunc)
>>> >> >> +{
>>> >> >> + return update_ktrace(entry, probefunc, 1);
>>> >> >> +}
>>> >> >> +
>>> >> >> +static int ktrace_probe_unregister(struct ktrace_entry *entry)
>>> >> >> +{
>>> >> >> + return update_ktrace(entry, NULL, 0);
>>> >> >> +}
>>> >> >> +
>>> >> >> +static int ktrace_control_open(struct inode *inode, struct file *filp)
>>> >> >> +{
>>> >> >> + filp->private_data = inode->i_private;
>>> >> >> + return 0;
>>> >> >> +}
>>> >> >> +
>>> >> >> +static ssize_t ktrace_control_read(struct file *filp, char __user *ubuf,
>>> >> >> + size_t cnt, loff_t *ppos)
>>> >> >> +{
>>> >> >> + int ret;
>>> >> >> + char buf[64];
>>> >> >> + struct ktrace_entry *entry = filp->private_data;
>>> >> >> +
>>> >> >> + ret = snprintf(buf, 64, "%u\n", entry->enabled);
>>> >> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
>>> >> >> +}
>>> >> >> +
>>> >> >> +/*
>>> >> >> + * Check whether the event contains string arguments. Return 0 if not;
>>> >> >> + * otherwise record the offsets of the string arguments in the ktrace
>>> >> >> + * structure according to the event format and return 1.
>>> >> >> + * The offset info is later used to store the string arguments into
>>> >> >> + * the trace buffer during trace probe.
>>> >> >> + */
>>> >> >> +static int contain_string_arguments(struct ktrace_entry *entry)
>>> >> >> +{
>>> >> >> + int count = 0;
>>> >> >> + char *fmt;
>>> >> >> + int qualifier; /* 'h', 'l', or 'L' for integer fields */
>>> >> >> + size_t offset = 0;
>>> >> >> +
>>> >> >> + for (fmt = strstr(entry->format, "%s"); fmt; fmt = strstr(++fmt, "%s"))
>>> >> >> + count++;
>>> >> >> + if (!count)
>>> >> >> + return 0;
>>> >> >> + entry->stroff = kmalloc(sizeof(size_t) * (count + 1), GFP_KERNEL);
>>> >> >> + if (!entry->stroff)
>>> >> >> + return -ENOMEM;
>>> >> >> +
>>> >> >> + for (fmt = entry->format, count = 0; *fmt ; ++fmt) {
>>> >> >> + if (*fmt != '%')
>>> >> >> + continue;
>>> >> >> +repeat:
>>> >> >> + ++fmt;
>>> >> >> + switch (*fmt) {
>>> >> >> + case '-':
>>> >> >> + case '+':
>>> >> >> + case ' ':
>>> >> >> + case '#':
>>> >> >> + case '0':
>>> >> >> + ++fmt;
>>> >> >> + goto repeat;
>>> >> >> + }
>>> >> >> +
>>> >> >> + while (isdigit(*fmt))
>>> >> >> + fmt++;
>>> >> >> +
>>> >> >> + /* get the conversion qualifier */
>>> >> >> + qualifier = -1;
>>> >> >> + if (*fmt == 'h' || *fmt == 'l' || *fmt == 'L') {
>>> >> >> + qualifier = *fmt;
>>> >> >> + ++fmt;
>>> >> >> + if (qualifier == 'l' && *fmt == 'l') {
>>> >> >> + qualifier = 'L';
>>> >> >> + ++fmt;
>>> >> >> + }
>>> >> >> + }
>>> >> >> +
>>> >> >> + switch (*fmt) {
>>> >> >> + case 'c':
>>> >> >> + offset += sizeof(char);
>>> >> >> + continue;
>>> >> >> + case 's':
>>> >> >> + entry->stroff[count] = offset;
>>> >> >> + count++;
>>> >> >> + offset += sizeof(char *);
>>> >> >> + continue;
>>> >> >> + case 'p':
>>> >> >> + offset += sizeof(void *);
>>> >> >> + continue;
>>> >> >> + case 'd':
>>> >> >> + case 'i':
>>> >> >> + case 'o':
>>> >> >> + case 'u':
>>> >> >> + case 'x':
>>> >> >> + case 'X':
>>> >> >> + switch (qualifier) {
>>> >> >> + case 'L':
>>> >> >> + offset += sizeof(long long);
>>> >> >> + break;
>>> >> >> + case 'l':
>>> >> >> + offset += sizeof(long);
>>> >> >> + break;
>>> >> >> + case 'h':
>>> >> >> + offset += sizeof(short);
>>> >> >> + break;
>>> >> >> + default:
>>> >> >> + offset += sizeof(int);
>>> >> >> + break;
>>> >> >> + }
>>> >> >> + break;
>>> >> >> + default:
>>> >> >> + printk(KERN_WARNING "unknown format '%c'", *fmt);
>>> >> >> + continue;
>>> >> >> + }
>>> >> >> + }
>>> >> >> + entry->stroff[count] = -1;
>>> >> >> + return 1;
>>> >> >> +}
>>> >> >> +
>>> >> >> +static void __ktrace_control_write(struct file *filp,
>>> >> >> + unsigned long val, size_t *cnt)
>>> >> >> +{
>>> >> >> + struct ktrace_entry *entry = filp->private_data;
>>> >> >> + int ret;
>>> >> >> +
>>> >> >> + mutex_lock(&ktrace_mutex);
>>> >> >> + if (val ^ entry->enabled) {
>>> >> >> + if (val) {
>>> >> >> + ret = contain_string_arguments(entry);
>>> >> >> + if (ret == 0)
>>> >> >> + ktrace_probe_register(entry, ring_buffer_probe);
>>> >> >> + else if (ret > 0)
>>> >> >> + ktrace_probe_register(entry, string_probe);
>>> >> >> + else
>>> >> >> + *cnt = ret;
>>> >> >> + } else
>>> >> >> + ktrace_probe_unregister(entry);
>>> >> >> + }
>>> >> >> + mutex_unlock(&ktrace_mutex);
>>> >> >> +}
>>> >> >> +
>>> >> >> +static ssize_t ktrace_control_write(struct file *filp, const char __user *ubuf,
>>> >> >> + size_t cnt, loff_t *ppos)
>>> >> >> +{
>>> >> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
>>> >> >> + __ktrace_control_write);
>>> >> >> +}
>>> >> >> +
>>> >> >> +static const struct file_operations ktracecontrol_file_operations = {
>>> >> >> + .open = ktrace_control_open,
>>> >> >> + .read = ktrace_control_read,
>>> >> >> + .write = ktrace_control_write,
>>> >> >> +};
>>> >> >> +
>>> >> >> +/* functions for adding/removing a trace event. Protected by mutex lock.
>>> >> >> + * Called during initialization or after loading a module */
>>> >> >> +static struct ktrace_entry *add_ktrace(struct kernel_trace *ktrace)
>>> >> >> +{
>>> >> >> + struct ktrace_entry *entry;
>>> >> >> + size_t name_len = strlen(ktrace->name) + 1;
>>> >> >> + u32 hash = jhash(ktrace->name, name_len-1, 0);
>>> >> >> + struct hlist_head *head;
>>> >> >> + struct hlist_node *node;
>>> >> >> + struct dentry *dentry;
>>> >> >> +
>>> >> >> + if (!tracedir)
>>> >> >> + return 0;
>>> >> >> + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
>>> >> >> + hlist_for_each_entry(entry, node, head, hlist) {
>>> >> >> + if (!strcmp(ktrace->name, entry->name)) {
>>> >> >> + if (strcmp(ktrace->format, entry->format)) {
>>> >> >> + printk(KERN_WARNING "the format of tracepoint "
>>> >> >> + "\'%s\' changes from \'%s\' to \'%s\'."
>>> >> >> + "Dynamic changing of trace format is "
>>> >> >> + "not supported yet!\n",
>>> >> >> + ktrace->name, entry->format,
>>> >> >> + ktrace->format);
>>> >> >> + return ERR_PTR(-EINVAL);
>>> >> >> + }
>>> >> >> + ktrace->enabled = entry->enabled;
>>> >> >> + ktrace->func = entry->func;
>>> >> >> + ktrace->event_id = entry->event_id;
>>> >> >> + if (list_empty(&entry->klist))
>>> >> >> + goto add_head;
>>> >> >> + list_add_tail(&ktrace->list, &entry->klist);
>>> >> >> + return entry;
>>> >> >> + }
>>> >> >> + }
>>> >> >> + entry = kmalloc(sizeof(struct ktrace_entry), GFP_KERNEL);
>>> >> >> + if (!entry)
>>> >> >> + return ERR_PTR(-ENOMEM);
>>> >> >> + entry->name = kmalloc(name_len, GFP_KERNEL);
>>> >> >> + if (!entry->name) {
>>> >> >> + kfree(entry);
>>> >> >> + return ERR_PTR(-ENOMEM);
>>> >> >> + }
>>> >> >> + memcpy(entry->name, ktrace->name, name_len);
>>> >> >> + if (ktrace->format) {
>>> >> >> + size_t format_len = strlen(ktrace->format) + 1;
>>> >> >> + entry->format = kmalloc(format_len, GFP_KERNEL);
>>> >> >> + if (!entry->format) {
>>> >> >> + kfree(entry->name);
>>> >> >> + kfree(entry);
>>> >> >> + return ERR_PTR(-ENOMEM);
>>> >> >> + }
>>> >> >> + memcpy(entry->format, ktrace->format, format_len);
>>> >> >> + } else
>>> >> >> + entry->format = NULL;
>>> >> >> + entry->func = ktrace->func;
>>> >> >> + entry->enabled = 0;
>>> >> >> + ktrace->event_id = entry->event_id = ktrace_next_id++;
>>> >> >> + entry->stroff = NULL;
>>> >> >> + INIT_LIST_HEAD(&entry->klist);
>>> >> >> + hlist_add_head(&entry->hlist, head);
>>> >> >> +add_head:
>>> >> >> + list_add(&ktrace->list, &entry->klist);
>>> >> >> + dentry = debugfs_create_file(entry->name, 0660, eventdir,
>>> >> >> + entry, &ktracecontrol_file_operations);
>>> >> >> + if (!dentry)
>>> >> >> + printk(KERN_WARNING "Couldn't create debugfs '%s' entry\n",
>>> >> >> + entry->name);
>>> >> >> + return entry;
>>> >> >> +}
>>> >> >> +
>>> >> >> +static int remove_ktrace(struct kernel_trace *ktrace, int free)
>>> >> >> +{
>>> >> >> + struct ktrace_entry *entry;
>>> >> >> + size_t name_len = strlen(ktrace->name) + 1;
>>> >> >> + u32 hash = jhash(ktrace->name, name_len-1, 0);
>>> >> >> + struct hlist_head *head;
>>> >> >> + struct hlist_node *node, *temp;
>>> >> >> + struct dentry *dentry;
>>> >> >> +
>>> >> >> + if (!tracedir)
>>> >> >> + return 0;
>>> >> >> + list_del(&ktrace->list);
>>> >> >> + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
>>> >> >> + hlist_for_each_entry_safe(entry, node, temp, head, hlist)
>>> >> >> + if (!strcmp(ktrace->name, entry->name)) {
>>> >> >> + if (list_empty(&entry->klist)) {
>>> >> >> + dentry = lookup_one_len(entry->name,
>>> >> >> + eventdir, strlen(entry->name));
>>> >> >> + if (dentry && !IS_ERR(dentry)) {
>>> >> >> + debugfs_remove(dentry);
>>> >> >> + dput(dentry);
>>> >> >> + }
>>> >> >> + entry->enabled = 0;
>>> >> >> + if (free) {
>>> >> >> + hlist_del(&entry->hlist);
>>> >> >> + kfree(entry->name);
>>> >> >> + kfree(entry->format);
>>> >> >> + kfree(entry->stroff);
>>> >> >> + kfree(entry);
>>> >> >> + }
>>> >> >> + }
>>> >> >> + return 0;
>>> >> >> + }
>>> >> >> + return -1;
>>> >> >> +}
>>> >> >> +
>>> >> >> +/* Add/remove tracepoints contained in a module to the ktrace hash table.
>>> >> >> + * Called at the end of module load/unload. */
>>> >> >> +static int ktrace_module_notify(struct notifier_block *self,
>>> >> >> + unsigned long val, void *data)
>>> >> >> +{
>>> >> >> +#ifdef CONFIG_MODULES
>>> >> >> + struct kernel_trace *iter;
>>> >> >> + struct module *mod = data;
>>> >> >> +
>>> >> >> + if (val != MODULE_STATE_COMING && val != MODULE_STATE_GOING)
>>> >> >> + return 0;
>>> >> >> + mutex_lock(&ktrace_mutex);
>>> >> >> + for (iter = mod->ktrace; iter < mod->ktrace + mod->num_ktrace; iter++)
>>> >> >> + if (val == MODULE_STATE_COMING)
>>> >> >> + add_ktrace(iter);
>>> >> >> + else
>>> >> >> + remove_ktrace(iter, 0);
>>> >> >> + mutex_unlock(&ktrace_mutex);
>>> >> >> +#endif
>>> >> >> + return 0;
>>> >> >> +}
>>> >> >> +
>>> >> >> +static struct notifier_block ktrace_module_nb = {
>>> >> >> + .notifier_call = ktrace_module_notify,
>>> >> >> +};
>>> >> >> +
>>> >> >> +/* functions for user-space programs to read data from tracing buffer */
>>> >> >> +static int trace_data_open(struct inode *inode, struct file *filp)
>>> >> >> +{
>>> >> >> + filp->private_data = inode->i_private;
>>> >> >> + kref_get(&ktrace_kref);
>>> >> >> + return 0;
>>> >> >> +}
>>> >> >> +
>>> >> >> +static void release_trace_buffer(struct kref *kref)
>>> >> >> +{
>>> >> >> + ring_buffer_free(trace_buffer);
>>> >> >> + trace_buffer = NULL;
>>> >> >> +}
>>> >> >> +
>>> >> >> +static int trace_data_release(struct inode *inode, struct file *filp)
>>> >> >> +{
>>> >> >> + kref_put(&ktrace_kref, release_trace_buffer);
>>> >> >> + return 0;
>>> >> >> +}
>>> >> >> +
>>> >> >> +static unsigned int trace_data_poll(struct file *filp, poll_table *poll_table)
>>> >> >> +{
>>> >> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
>>> >> >> + if (filp->f_mode & FMODE_READ) {
>>> >> >> + if (!ring_buffer_per_cpu_empty(cpu_buffer))
>>> >> >> + return POLLIN | POLLRDNORM;
>>> >> >> + poll_wait(filp, &trace_wait, poll_table);
>>> >> >> + if (!ring_buffer_per_cpu_empty(cpu_buffer))
>>> >> >> + return POLLIN | POLLRDNORM;
>>> >> >> + }
>>> >> >> + return 0;
>>> >> >> +}
>>> >> >> +
>>> >> >> +static int trace_buf_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>>> >> >> +{
>>> >> >> + struct ring_buffer_per_cpu *cpu_buffer = vma->vm_private_data;
>>> >> >> + pgoff_t pgoff = vmf->pgoff;
>>> >> >> + struct page *page;
>>> >> >> +
>>> >> >> + if (!trace_buffer)
>>> >> >> + return VM_FAULT_OOM;
>>> >> >> + page = ring_buffer_get_page(cpu_buffer, pgoff);
>>> >> >> + if (page == NULL)
>>> >> >> + return VM_FAULT_SIGBUS;
>>> >> >> + get_page(page);
>>> >> >> + vmf->page = page;
>>> >> >> + return 0;
>>> >> >> +}
>>> >> >> +
>>> >> >> +static struct vm_operations_struct trace_data_mmap_ops = {
>>> >> >> + .fault = trace_buf_fault,
>>> >> >> +};
>>> >> >> +
>>> >> >> +static int trace_data_mmap(struct file *filp, struct vm_area_struct *vma)
>>> >> >> +{
>>> >> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
>>> >> >> +
>>> >> >> + vma->vm_ops = &trace_data_mmap_ops;
>>> >> >> + vma->vm_flags |= VM_DONTEXPAND;
>>> >> >> + vma->vm_private_data = cpu_buffer;
>>> >> >> + return 0;
>>> >> >> +}
>>> >> >> +
>>> >> >> +static int trace_data_ioctl(struct inode *inode, struct file *filp,
>>> >> >> + unsigned int cmd, unsigned long arg)
>>> >> >> +{
>>> >> >> + u32 __user *argp = (u32 __user *)arg;
>>> >> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
>>> >> >> +
>>> >> >> + if (!trace_buffer || !tracedir)
>>> >> >> + return -EINVAL;
>>> >> >> + switch (cmd) {
>>> >> >> + case KTRACE_BUF_GET_SIZE:
>>> >> >> + {
>>> >> >> + unsigned long bufsize;
>>> >> >> + bufsize = ring_buffer_size(trace_buffer);
>>> >> >> + return put_user((u32)bufsize, argp);
>>> >> >> + }
>>> >> >> + case KTRACE_BUF_GET_PRODUCED:
>>> >> >> + return put_user(ring_buffer_get_produced(cpu_buffer), argp);
>>> >> >> + case KTRACE_BUF_GET_CONSUMED:
>>> >> >> + return put_user(ring_buffer_get_consumed(cpu_buffer), argp);
>>> >> >> + case KTRACE_BUF_PUT_CONSUMED:
>>> >> >> + {
>>> >> >> + u32 consumed, consumed_old;
>>> >> >> + int ret;
>>> >> >> +
>>> >> >> + ret = get_user(consumed, argp);
>>> >> >> + if (ret) {
>>> >> >> + printk(KERN_WARNING
>>> >> >> + "error getting consumed value: %d\n", ret);
>>> >> >> + return ret;
>>> >> >> + }
>>> >> >> + consumed_old = ring_buffer_get_consumed(cpu_buffer);
>>> >> >> + if (consumed == consumed_old)
>>> >> >> + return 0;
>>> >> >> + ring_buffer_advance_reader(cpu_buffer, consumed - consumed_old);
>>> >> >> + return 0;
>>> >> >> + }
>>> >> >> + default:
>>> >> >> + return -ENOIOCTLCMD;
>>> >> >> + }
>>> >> >> + return 0;
>>> >> >> +}
>>> >> >> +
>>> >> >> +static const struct file_operations tracedata_file_operations = {
>>> >> >> + .open = trace_data_open,
>>> >> >> + .poll = trace_data_poll,
>>> >> >> + .mmap = trace_data_mmap,
>>> >> >> + .ioctl = trace_data_ioctl,
>>> >> >> + .release = trace_data_release,
>>> >> >> +};
>>> >> >> +
>>> >> >> +/* function to report the number of lost events due to buffer overflow */
>>> >> >> +static ssize_t trace_overflow_read(struct file *filp, char __user *ubuf,
>>> >> >> + size_t cnt, loff_t *ppos)
>>> >> >> +{
>>> >> >> + int ret;
>>> >> >> + char buf[64];
>>> >> >> + ret = snprintf(buf, 64, "%lu\n", ring_buffer_overruns(trace_buffer));
>>> >> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
>>> >> >> +}
>>> >> >> +
>>> >> >> +static const struct file_operations traceoverflow_file_operations = {
>>> >> >> + .read = trace_overflow_read,
>>> >> >> +};
>>> >> >> +
>>> >> >> +/* function to re-initialize trace buffer */
>>> >> >> +static void __trace_buffer_reset(struct file *filp,
>>> >> >> + unsigned long val, size_t *cnt)
>>> >> >> +{
>>> >> >> + if (val && trace_buffer)
>>> >> >> + ring_buffer_reset(trace_buffer);
>>> >> >> +}
>>> >> >> +
>>> >> >> +static ssize_t trace_buffer_reset(struct file *filp, const char __user *ubuf,
>>> >> >> + size_t cnt, loff_t *ppos)
>>> >> >> +{
>>> >> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
>>> >> >> + __trace_buffer_reset);
>>> >> >> +}
>>> >> >> +
>>> >> >> +static const struct file_operations tracereset_file_operations = {
>>> >> >> + .write = trace_buffer_reset,
>>> >> >> +};
>>> >> >> +
>>> >> >> +/*
>>> >> >> + * We use debugfs for kernel-user communication. All of the control/info
>>> >> >> + * files are under debugfs/ktrace directory.
>>> >> >> + */
>>> >> >> +static int create_debugfs(void)
>>> >> >> +{
>>> >> >> + struct dentry *entry, *bufdir;
>>> >> >> + struct ring_buffer_per_cpu *cpu_buffer;
>>> >> >> + char *tmpname;
>>> >> >> + int cpu;
>>> >> >> +
>>> >> >> + tracedir = debugfs_create_dir("ktrace", NULL);
>>> >> >> + if (!tracedir) {
>>> >> >> + printk(KERN_WARNING "Couldn't create debugfs directory\n");
>>> >> >> + return -ENOMEM;
>>> >> >> + }
>>> >> >> +
>>> >> >> + /* 'buffers' is the directory that holds trace buffer debugfs files */
>>> >> >> + bufdir = debugfs_create_dir("buffers", tracedir);
>>> >> >> + if (!bufdir) {
>>> >> >> + printk(KERN_WARNING "Couldn't create 'buffers' directory\n");
>>> >> >> + return -ENOMEM;
>>> >> >> + }
>>> >> >> +
>>> >> >> + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
>>> >> >> + if (!tmpname)
>>> >> >> + return -ENOMEM;
>>> >> >> + /* create a debugfs file for each cpu buffer */
>>> >> >> + for_each_possible_cpu(cpu) {
>>> >> >> + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
>>> >> >> + cpu_buffer = ring_buffer_cpu(trace_buffer, cpu);
>>> >> >> + entry = debugfs_create_file(tmpname, 0440, bufdir, cpu_buffer,
>>> >> >> + &tracedata_file_operations);
>>> >> >> + if (!entry) {
>>> >> >> + printk(KERN_WARNING
>>> >> >> + "Couldn't create debugfs \'%s\' entry\n",
>>> >> >> + tmpname);
>>> >> >> + return -ENOMEM;
>>> >> >> + }
>>> >> >> + }
>>> >> >> + kfree(tmpname);
>>> >> >> +
>>> >> >> + /* the control file for users to enable/disable global tracing. */
>>> >> >> + entry = debugfs_create_file("enabled", 0664, tracedir, NULL,
>>> >> >> + &tracecontrol_file_operations);
>>> >> >> + if (!entry) {
>>> >> >> + printk(KERN_WARNING
>>> >> >> + "Couldn't create debugfs 'enabled' entry\n");
>>> >> >> + return -ENOMEM;
>>> >> >> + }
>>> >> >> +
>>> >> >> + /*
>>> >> >> + * the debugfs file that displays the name, format etc. of every
>>> >> >> + * supported trace event. The file is to be used by the user-space
>>> >> >> + * trace parser to analyze the collected trace data.
>>> >> >> + */
>>> >> >> + entry = debugfs_create_file("info", 0444, tracedir, NULL,
>>> >> >> + &traceinfo_file_operations);
>>> >> >> + if (!entry) {
>>> >> >> + printk(KERN_WARNING "Couldn't create debugfs 'info' entry\n");
>>> >> >> + return -ENOMEM;
>>> >> >> + }
>>> >> >> +
>>> >> >> + /*
>>> >> >> + * the debugfs file that reports the number of events
>>> >> >> + * lost due to buffer overflow
>>> >> >> + */
>>> >> >> + entry = debugfs_create_file("overflow", 0444, tracedir, NULL,
>>> >> >> + &traceoverflow_file_operations);
>>> >> >> + if (!entry) {
>>> >> >> + printk(KERN_WARNING
>>> >> >> + "Couldn't create debugfs 'overflow' entry\n");
>>> >> >> + return -ENOMEM;
>>> >> >> + }
>>> >> >> +
>>> >> >> + /* the debugfs file that resets trace buffer upon write */
>>> >> >> + entry = debugfs_create_file("reset", 0220, tracedir, NULL,
>>> >> >> + &tracereset_file_operations);
>>> >> >> + if (!entry) {
>>> >> >> + printk(KERN_WARNING "Couldn't create debugfs 'reset' entry\n");
>>> >> >> + return -ENOMEM;
>>> >> >> + }
>>> >> >> +
>>> >> >> + /* create the directory that holds the control files for every event */
>>> >> >> + eventdir = debugfs_create_dir("events", tracedir);
>>> >> >> + if (!eventdir) {
>>> >> >> + printk(KERN_WARNING "Couldn't create 'events' directory\n");
>>> >> >> + return -ENOMEM;
>>> >> >> + }
>>> >> >> + return 0;
>>> >> >> +}
>>> >> >> +
>>> >> >> +static void remove_debugfs_file(struct dentry *dir, char *name)
>>> >> >> +{
>>> >> >> + struct dentry *dentry;
>>> >> >> + dentry = lookup_one_len(name, dir, strlen(name));
>>> >> >> + if (dentry && !IS_ERR(dentry)) {
>>> >> >> + debugfs_remove(dentry);
>>> >> >> + dput(dentry);
>>> >> >> + }
>>> >> >> +}
>>> >> >> +
>>> >> >> +static int remove_debugfs(void)
>>> >> >> +{
>>> >> >> + struct dentry *bufdir;
>>> >> >> + char *tmpname;
>>> >> >> + int cpu;
>>> >> >> +
>>> >> >> + bufdir = lookup_one_len("buffers", tracedir, strlen("buffers"));
>>> >> >> + if (!bufdir || IS_ERR(bufdir))
>>> >> >> + return -EIO;
>>> >> >> + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
>>> >> >> + if (!tmpname)
>>> >> >> + return -ENOMEM;
>>> >> >> + for_each_possible_cpu(cpu) {
>>> >> >> + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
>>> >> >> + remove_debugfs_file(bufdir, tmpname);
>>> >> >> + }
>>> >> >> + kfree(tmpname);
>>> >> >> + debugfs_remove(bufdir);
>>> >> >> +
>>> >> >> + remove_debugfs_file(tracedir, "enabled");
>>> >> >> + remove_debugfs_file(tracedir, "info");
>>> >> >> + debugfs_remove(eventdir);
>>> >> >> + debugfs_remove(tracedir);
>>> >> >> + return 0;
>>> >> >> +}
>>> >> >> +
>>> >> >> +static int __init init_ktrace(void)
>>> >> >> +{
>>> >> >> + struct kernel_trace *iter;
>>> >> >> + int i, err;
>>> >> >> +
>>> >> >> + ktrace_next_id = 0;
>>> >> >> + kref_set(&ktrace_kref, 1);
>>> >> >> + trace_buffer = ring_buffer_alloc(trace_buf_size, RB_FL_OVERWRITE);
>>> >> >> + if (trace_buffer == NULL)
>>> >> >> + return -ENOMEM;
>>> >> >> + setup_timer(&trace_timer, wakeup_readers, 0);
>>> >> >> + for (i = 0; i < KTRACE_HASH_SIZE; i++)
>>> >> >> + INIT_HLIST_HEAD(&ktrace_table[i]);
>>> >> >> +
>>> >> >> + mutex_lock(&ktrace_mutex);
>>> >> >> + err = create_debugfs();
>>> >> >> + if (err != 0)
>>> >> >> + goto out;
>>> >> >> + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
>>> >> >> + add_ktrace(iter);
>>> >> >> + err = register_module_notifier(&ktrace_module_nb);
>>> >> >> +out:
>>> >> >> + mutex_unlock(&ktrace_mutex);
>>> >> >> + return err;
>>> >> >> +}
>>> >> >> +module_init(init_ktrace);
>>> >> >> +
>>> >> >> +static void __exit exit_ktrace(void)
>>> >> >> +{
>>> >> >> + struct kernel_trace *iter;
>>> >> >> +
>>> >> >> + if (ktrace_enabled) {
>>> >> >> + ktrace_enabled = 0;
>>> >> >> + del_timer_sync(&trace_timer);
>>> >> >> + }
>>> >> >> + mutex_lock(&ktrace_mutex);
>>> >> >> + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
>>> >> >> + remove_ktrace(iter, 1);
>>> >> >> + unregister_module_notifier(&ktrace_module_nb);
>>> >> >> + remove_debugfs();
>>> >> >> + mutex_unlock(&ktrace_mutex);
>>> >> >> +
>>> >> >> + kref_put(&ktrace_kref, release_trace_buffer);
>>> >> >> +}
>>> >> >
>>> >> > --
>>> >> > Mathieu Desnoyers
>>> >> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>>> >> >
>>> >
>>> > --
>>> > Mathieu Desnoyers
>>> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>>> >
>>>
>>> _______________________________________________
>>> ltt-dev mailing list
>>> [email protected]
>>> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
>>>
>>
>> --
>> Mathieu Desnoyers
>> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>>
>
* Jiaying Zhang ([email protected]) wrote:
> Hi Mathieu,
>
Hi Jiaying,
> I ran tbench tests with the latest lttng patches (version 0.111) and
> 2.6.29-rc8 kernel. I used the same set of trace events in my tests:
> syscall entry/exit, irq entry/exit, trap entry/exit, sched_schedule,
> process_fork, process_wait, process_free, process_exit.
>
> Here are the results I got:
>
> vanilla kernel: Throughput 673.269 MB/sec 4 procs
> lttng normal mode: Throughput 462.786 MB/sec 4 procs (31.252% slowdown)
> lttng flight-recorder: Throughput 484.547 MB/sec 4 procs (28.03% slowdown)
> ktrace: Throughput 569.508 MB/sec 4 procs (15.411% slowdown)
>
Does your machine have synchronized timestamp counters ?
Also note that I added some further optimisations in 0.112, including
some data prefetching.
> The kernel LTTng configurations I used are:
>
> CONFIG_LTT=y
> CONFIG_LTT_FILTER=y
> CONFIG_HAVE_LTT_DUMP_TABLES=y
> CONFIG_LTT_RELAY_ALLOC=y
> CONFIG_LTT_RELAY_LOCKLESS=y
> # CONFIG_LTT_RELAY_LOCKED is not set
> CONFIG_LTT_SERIALIZE=y
> CONFIG_LTT_FAST_SERIALIZE=y
> CONFIG_LTT_TRACEPROBES=y
> CONFIG_LTT_TRACE_CONTROL=y
> CONFIG_LTT_TRACER=y
> # CONFIG_LTT_ALIGNMENT is not set
> CONFIG_LTT_CHECK_ARCH_EFFICIENT_UNALIGNED_ACCESS=y
> # CONFIG_LTT_DEBUG_EVENT_SIZE is not set
> CONFIG_LTT_USERSPACE_EVENT=m
> CONFIG_LTT_VMCORE=y
This can slow down tracing. You might want to try without the VMCORE
option.
> CONFIG_LTT_KPROBES=y
> CONFIG_LTT_STATEDUMP=m
> # CONFIG_LTT_ASCII is not set
> CONFIG_HAVE_IMMEDIATE=y
> CONFIG_IMMEDIATE=y
>
> The LTTng performance seems to have improved compared with the
> results I measured before. But I think there is still big room to further
> optimize it when we look at Ktrace results as a comparison.
> One worry I have is that LTTng patch set is getting bigger and bigger.
> It is good to add more features and instrumentations, but it might
> make it harder for performance optimization. Maybe we should take
> a different approach. I.e., we start with something simple and make
> sure it is implemented right and efficient. Then we extend it and make
> sure we don't sacrifice performance as we add more features.
> I think a good time to do this is when you port LTTng patches to
> mainline kernel. As ftrace people are working on adding event tracing
> in ftrace, that seems a good opportunity too.
>
Well, as any project goes, there are "feature addition" phases and
"optimization" phases. I think the job I did in the last week will help
a lot simplifying fast path optimization work. I basically trimmed it
down to a few small inlined functions. Therefore, the instruction
profile given by oprofile is now understandable and it's easier to
review the resulting assembly. So your argument about the patchset size
does not hold as much, because it is now possible to focus on a few
small functions.
Even if we start from zero as you propose, we will end up adding
regressions during the feature addition phases, and we will end up in
the same situation we were a few months ago.
If we want to see what can improve performance in LTTng, we can simply
disable the hot path piece by piece and figure out where the cycles go,
as I did recently. (and use oprofile, of course) Given that the LTTng
projet is very modular, it makes it easier to enable individual
features.
Another thing that might be worth looking at is to skip the tracepoint
probes -> marker callbacks, because they will likely add a noticeable
overhead. However, the tradeoff here is to accept some overhead for
clear separation of the core kernel instrumentation from the markers
exposed to userspace.
As you say, there will be a piece-by-piece review/integration phase that
will likely take place when I will discuss LTTng and ftrace integration.
Mathieu
> Jiaying
>
> On Mon, Mar 16, 2009 at 7:12 PM, Jiaying Zhang <[email protected]> wrote:
> > On Mon, Mar 16, 2009 at 6:53 PM, Mathieu Desnoyers
> > <[email protected]> wrote:
> >> * Jiaying Zhang ([email protected]) wrote:
> >>> On Mon, Mar 16, 2009 at 5:01 PM, Mathieu Desnoyers
> >>> <[email protected]> wrote:
> >>> > * Jiaying Zhang ([email protected]) wrote:
> >>> >> Hi Mathieu,
> >>> >>
> >>> >> First, I apologize that I did not forward you my original email. I assumed
> >>> >> that the email would be noticed by interested people when it was sent
> >>> >> to lkml. Now I see that is a mistake.
> >>> >>
> >>> >
> >>> > Hi Jiaying,
> >>> >
> >>> > Apologies accepted, no hard feeling :) I have just been surprised when I
> >>> > found out about the numbers you posted.
> >>> >
> >>> >> On Fri, Mar 13, 2009 at 10:04 AM, Mathieu Desnoyers
> >>> >> <[email protected]> wrote:
> >>> >> > Hi Jiaying,
> >>> >> >
> >>> >> > I'm very interested to hear about the details of your test-case, because
> >>> >> > what you are talking about here, and the numbers that you get, are, at
> >>> >> > the very least, not representative of the current LTTng. Perharps you
> >>> >> > used a version right after the kernel summit and stopped following the
> >>> >> > improvements we have done since then ? I also wonder about the instrumentation
> >>> >> > set you enabled in your tests, and whether it's exactly the same in the ktrace
> >>> >> > and LTTng runs (especially system call tracing). Please note that following the
> >>> >> > Kernel Summit 2008, a more active development cycle began, and the time was not
> >>> >> > right for optimisation. But since then a lot of improvement has been done on
> >>> >> > this side.
> >>> >>
> >>> >> Sorry that I omitted a lot of details in my test description. I did use an old
> >>> >> version of LTTng in my tests. The LTTng version I used was the 0.21 lttng
> >>> >> patches for 2.6.27 kernel. I used 2.6.26 kernel in my tests, with the ported
> >>> >> lttng-2.6.27-rc6-0.21 patches.
> >>> >>
> >>> >> I enabled the same set of trace events in ktrace and LTTng runs. Here is
> >>> >> the list of enabled events: syscall entry/exit, irq entry/exit, trap
> >>> >> entry/exit,
> >>> >> kernel_sched_switch, kernel_process_fork, kernel_process_wait,
> >>> >> kernel_process_free, kernel_process_exit.
> >>> >>
> >>> >> We are glad to see the improvements coming from the LTTng community.
> >>> >> I haven't got time to sync with the latest LTTng code and re-run my tests.
> >>> >> I will let you know when I have such data available.
> >>> >>
> >>> >
> >>> > On my side, with flight recorder activated, same instrumentation you
> >>> > use, with LTTng 0.111 :
> >>> >
> >>> > arm-google script :
> >>> >
> >>> > #!/bin/sh
> >>> >
> >>> > DIR=/mnt/debugfs/ltt/markers
> >>> >
> >>> > for a in \
> >>> > syscall_entry syscall_exit \
> >>> > irq_entry irq_exit \
> >>> > trap_entry trap_exit \
> >>> > page_fault_entry page_fault_exit \
> >>> > page_fault_nosem_entry page_fault_nosem_exit \
> >>> > page_fault_get_user_entry page_fault_get_user_exit \
> >>> > sched_schedule process_fork process_wait \
> >>> > process_free process_exit;
> >>> > do
> >>> > echo 1 > ${DIR}/kernel/$a/enable
> >>> > done
> >>> >
> >>> > running tbench -t 200 8
> >>> >
> >>> > Vanilla kernel : 2070 MB/sec
> >>> > Flight recorder : 1693 MB/sec Performance impact : 18.2 %
> >>> > Normal : 1620 MB/sec Performance impact : 21.7 % (8 lttd threads)
> >>> >
> >>> > So now we are comparing apples with apples :-)
> >>> Looks like the performance of LTTng has improved a lot over the past
> >>> few months. I will run some tests with the latest LTTng patches and
> >>> let you know if I see any difference. Did you use 2.6.29-rc7 kernel in
> >>> your tests?
> >>>
> >>
> >> Yes, kernel 2.6.29-rc7, lttng 0.111. But lttng 0.112 for kernel
> >> 2.6.29-rc8 should also be close enough.
> >>
> >>> >
> >>> >> >
> >>> >> > First I'd like to express my concern about the way you posted your
> >>> >> > numbers :
> >>> >> >
> >>> >> > 1 - You did not CC anyone from the LTTng project.
> >>> >> > 2 - You did not tell which LTTng version you used as a comparison basis.
> >>> >> > 3 - You did not look at the current state of code before posting your
> >>> >> > comparison. (as shows the comment in your patch header)
> >>> >> That is indeed my mistake.
> >>> >>
> >>> >> > 4 - You did not tell the instrumentation set that you enabled from LTTng. Is it
> >>> >> > the default instrumentation ? Did you also enable system call tracing
> >>> >> > with your tracer using set_kernel_trace_flag_all_tasks() ? If not, this
> >>> >> > makes a huge difference on the amount of information gathered.
> >>> >> Yes. We also enable syscall tracing via set_kernel_trace_flag_all_tasks().
> >>> >>
> >>> >> >
> >>> >> > So let's have a close look at them. I'm re-doing the benchmarks based on
> >>> >> > the current LTTng version below.
> >>> >> >
> >>> >> >> Hello All,
> >>> >> >>
> >>> >> >> We have been working on building a kernel tracer for a production environment.
> >>> >> >> We first considered to use the existing markers code, but found it a bit
> >>> >> >> heavyweight.
> >>> >> >
> >>> >> > Have you looked at :
> >>> >> >
> >>> >> > 1 - The tracepoints currently in the kernel
> >>> >> > 2 - The _current_ version of markers in the LTTng tree ?
> >>> >>
> >>> >> I looked at the tracepoints currently in the kernel. It is great to see that
> >>> >> some trace instrumentation code is finally available in the mainline kernel.
> >>> >> But I still don't like an additional tracepoint layer on top of the
> >>> >> actual tracing
> >>> >> (ktrace or markers or ftrace) layer. That seems to add unnecessary overhead.
> >>> >> The markers code I referred to in my email is the current code in the mainline
> >>> >> kernel. I had a look at the LTTng markers code but still saw that va_list
> >>> >> argument is used in marker_probe_func. It sounds a nice improvement
> >>> >> if the LTTng markers avoids the use of va_arg whenever possible.
> >>> >>
> >>> >
> >>> > Yes, vs_args is only there as a commodity for slow paths.
> >>> >
> >>> >> >
> >>> >> >> So we developed a new kernel tracing prototype that shares
> >>> >> >> many similarities as markers but also includes several simplifications that
> >>> >> >> give us considerable performance benefits. For example, we pass the size of
> >>> >> >> trace event directly to the probe function instead of using va_arg as that
> >>> >> >> used in markers, which saves us the overhead of variable passing.
> >>> >> >
> >>> >> > The markers in the LTTng tree only use va_args for "debug-style"
> >>> >> > tracing. All the tracing fast paths are now done with custom probe
> >>> >> > functions which takes the arguments from the tracepoint and write them
> >>> >> > in the trace buffers directly in C. When the event payload size is already
> >>> >> > known statically, there is no dynamic computation cost at all : it's
> >>> >> > simply passed to the ltt-relay write function.
> >>> >> But this raises another concern that sub-system developers may need
> >>> >> to write a bunch of custom probe functions to avoid the argument parsing
> >>> >> overhead. Can you instead have a general probe function or macro that
> >>> >> does this work?
> >>> >>
> >>> >
> >>> > Well, considering the amount of care we have to take when we consider
> >>> > adding a tracepoint in the mainline kernel, I think it's even a *good*
> >>> > thing that someone has to stop and write a small callback to deal with
> >>> > the information he wants to export. I think debug-style tracing should
> >>> > be very, very quick and easy to add, but tracepoints meant to make it
> >>> > into the mainline does not share this requirement.
> >>> I agree that we need to be careful when adding tracepoints into the kernel,
> >>> but I still think it is good to simplify the process of adding/maintaining
> >>> tracepoints so sub-system maintainers don't feel it is a big burden.
> >>>
> >>
> >> Yes, we have to make sure it's not an impossible task.
> >>
> >>> >
> >>> >> >
> >>> >> >> Also, we
> >>> >> >> associate a single probe function to each event, so we can directly call the
> >>> >> >> probe function when tracing is enabled.
> >>> >> >
> >>> >> > You seem to imply that being stuck with a single probe function is
> >>> >> > somehow superior to have tiny per-event callbacks which then call into a
> >>> >> > buffer write function ? Why ?
> >>> >> That is indeed my thinking, because we saw that some platforms are
> >>> >> very sensitive to function call overhead and we haven't seen use cases
> >>> >> where multi probe functions prove necessary.
> >>> >>
> >>> >
> >>> > Hrm, so your plan is to embed the tracing code at the caller site, or to
> >>> > keep one single function call ?
> >>> We do have a general probe function that is called when tracing is enabled.
> >>> Other than that, the main tracing code is defined as an inline function that is
> >>> embed at the caller site.
> >>>
> >>
> >> You mean that you embed the main tracing code directly in the kernel
> >> instrumented functions ? Hrm, have you considered the calling site
> >> pollution impact that might have ? The compiler might start having
> >> difficulty optimising hot paths that would then become large functions.
> >> Or maybe I am not correctly understanding your approach.
> > Sorry. I was actually trying to say the main instrumentation code is
> > embed in the caller site. The main tracing code is in the general probe
> > function.
> >
> > Jiaying
> >
> >>
> >> Mathieu
> >>
> >>>
> >>> Jiaying
> >>>
> >>> >
> >>> >> >
> >>> >> >> Our measurements show that our
> >>> >> >> current tracing prototype introduces much lower overhead compared with
> >>> >> >> lttng (we use lttng as a comparison because markers does not provide the
> >>> >> >> implementation to record trace data).
> >>> >> >
> >>> >> > Why LTTng version, kernel version, LTTng config and runtime options do
> >>> >> > you use ?
> >>> >> See above for LTTng and kernel versions. I was using the default LTTng
> >>> >> config options and 'normal' recorder mode.
> >>> >>
> >>> >> >
> >>> >> >> E.g., with dbench, lttng introduces
> >>> >> >> 20~80% overhead on our testing platforms while the overhead with our
> >>> >> >> kernel tracer is within 10%; with tbench, lttng usually adds more than 30%
> >>> >> >> overhead while the overhead added by our kernel tracer is usually within
> >>> >> >> 15%; with kernbench (time to compile a Linux kernel), the overhead
> >>> >> >> added by lttng is around 4%~18% while the overhead added by our
> >>> >> >> kernel tracer is within 2.5%.
> >>> >> >>
> >>> >> >
> >>> >> > I'm re-running the dbench and tbench tests :
> >>> >> >
> >>> >> > With LTTng 0.106, kernel 2.6.29-rc7
> >>> >> > 8-cores Intel(R) Xeon(R) CPU E5405 @ 2.00GHz, 16GB ram
> >>> >> >
> >>> >> > Arming all the default LTTng instrumentation (168 tracepoints),
> >>> >> > Tracing in flight recorder mode, including system call entry/exit
> >>> >> > tracing and block I/O tracing (blktrace instrumentation), I get :
> >>> >> >
> >>> >> > Test / config | Vanilla kernel | LTTng 0.106 | slowdown
> >>> >> > dbench 8 -t 200 | 1361 MB/s | 1193 MB/s | 12.3 %
> >>> >> >
> >>> >> > As an example of the effect of custom probe vs va_args based probes,
> >>> >> > let's take the tbench example
> >>> >> >
> >>> >> > | Vanilla | LTTng 0.106 | slowdown
> >>> >> > tbench 8 -t 200 | 2074 MB/s | 997 MB/s | 52 %
> >>> >> >
> >>> >> > The LTTng trace statistics shows the top high event rate events :
> >>> >> >
> >>> >> > Some are usual (therefore lttng already implements custom probes for them) :
> >>> >> >
> >>> >> > 42004 sched_try_wakeup
> >>> >> > 79754 timer_set
> >>> >> > 83973 sched_schedule
> >>> >> > 84317 softirq_exit
> >>> >> > 84318 softirq_entry
> >>> >> > 95370 page_free
> >>> >> > 95412 page_alloc
> >>> >> > 267247 syscall_entry
> >>> >> > 267247 syscall_exit
> >>> >> >
> >>> >> > Some are more tbench workload-specific :
> >>> >> >
> >>> >> > 16915 socket_sendmsg
> >>> >> > 18911 napi_complete
> >>> >> > 18911 napi_schedule
> >>> >> > 18912 dev_receive
> >>> >> > 18914 dev_xmit
> >>> >> > 18920 napi_poll
> >>> >> > 33831 socket_recvmsg
> >>> >> > 59860 read
> >>> >> >
> >>> >> >
> >>> >> > So let's see what happens if I implement those as custom probes and do a bit of
> >>> >> > tuning in the tracer hot path :
> >>> >> Could you give more details about the custom probe you used?
> >>> >
> >>> > They are available in the lttng tree on git.kernel.org in
> >>> > ltt/probes/*-trace.c. Their API is in include/linux/ltt-type-serialize.h
> >>> >
> >>> >> Do you need to write a custom probe for each tracepoint?
> >>> >>
> >>> >
> >>> > Yes. But it's not such a big deal. It looks a bit like Steve's
> >>> > TRACE_EVENT macro-fu, but written in a simple C callback :
> >>> >
> >>> > void probe_sched_switch(struct rq *rq, struct task_struct *prev,
> >>> > struct task_struct *next);
> >>> >
> >>> > DEFINE_MARKER_TP(kernel, sched_schedule, sched_switch, probe_sched_switch,
> >>> > "prev_pid %d next_pid %d prev_state #2d%ld");
> >>> >
> >>> > notrace void probe_sched_switch(struct rq *rq, struct task_struct *prev,
> >>> > struct task_struct *next)
> >>> > {
> >>> > struct marker *marker;
> >>> > struct serialize_int_int_short data;
> >>> >
> >>> > data.f1 = prev->pid;
> >>> > data.f2 = next->pid;
> >>> > data.f3 = prev->state;
> >>> >
> >>> > marker = &GET_MARKER(kernel, sched_schedule);
> >>> > ltt_specialized_trace(marker, marker->single.probe_private,
> >>> > &data, serialize_sizeof(data), sizeof(int));
> >>> > }
> >>> >
> >>> > Best regards,
> >>> >
> >>> > Mathieu
> >>> >
> >>> >> Jiaying
> >>> >>
> >>> >> >
> >>> >> > | Vanilla | LTTng 0.107 | slowdown
> >>> >> > Custom probes :
> >>> >> > tbench 8 -t 200 | 2074 MB/s | 1275 MB/s | 38.5 %
> >>> >> >
> >>> >> > With tracer hot path tuning (inline hot path, function calls for slow path, will
> >>> >> > be in 0.108) :
> >>> >> > tbench 8 -t 200 | 2074 MB/s | 1335 MB/s | 35.6 %
> >>> >> >
> >>> >> > kernbench :
> >>> >> > Average Optimal load -j 32 Run
> >>> >> >
> >>> >> > Vanilla 2.6.29-rc7 :
> >>> >> > Elapsed Time 48.238 (0.117346)
> >>> >> >
> >>> >> > With flight recorder tracing, default lttng instrumentation, lttng 0.108 :
> >>> >> > Elapsed Time 50.226 (0.0750333)
> >>> >> >
> >>> >> > Slowdown : 4.1 %
> >>> >> >
> >>> >> > So except the tbench workload on LTTng 0.106, which still had a 52% performance
> >>> >> > impact (brought it down to 35 % for the next LTTng release) due to
> >>> >> > specific workload and not having specialized probes implemented, the
> >>> >> > dbench impact is well below your results (12.3 % vs 20-80%)
> >>> >> >
> >>> >> > So given the huge result discrepancies between your tests and mine, could you
> >>> >> > give us more detail about your benchmarks ? Especially knowing the enabled
> >>> >> > instrumentation set, and the amount of MB/s generated by both ktrace and LTTng
> >>> >> > runs (to make sure we are comparing the same traffic) and the LTTng versions you
> >>> >> > are using as a comparison basis would be good.
> >>> >> >
> >>> >> > Thanks,
> >>> >> >
> >>> >> > Mathieu
> >>> >> >
> >>> >> >
> >>> >> >> The patch below is the current prototype we have implemented. It uses
> >>> >> >> Steve's unified trace buffer code with some small extensions to support
> >>> >> >> buffer mmap. The patch is probably still buggy, but we would like to get
> >>> >> >> some early feedback from the community.
> >>> >> >>
> >>> >> >> Signed-off-by: Jiaying Zhang <[email protected]>
> >>> >> >>
> >>> >> >> Index: git-linux/init/Kconfig
> >>> >> >> ===================================================================
> >>> >> >> --- git-linux.orig/init/Kconfig 2009-02-19 14:58:37.000000000 -0800
> >>> >> >> +++ git-linux/init/Kconfig 2009-02-19 15:50:15.000000000 -0800
> >>> >> >> @@ -950,6 +950,15 @@
> >>> >> >> Place an empty function call at each marker site. Can be
> >>> >> >> dynamically changed for a probe function.
> >>> >> >>
> >>> >> >> +config KTRACE
> >>> >> >> + bool "Enable ktrace support"
> >>> >> >> + select DEBUG_FS
> >>> >> >> + select TRACING
> >>> >> >> + select RING_BUFFER
> >>> >> >> + help
> >>> >> >> + Ktrace is a kernel tracing tool that allows you to trace
> >>> >> >> + kernel events by inserting trace points at proper places.
> >>> >> >> +
> >>> >> >> source "arch/Kconfig"
> >>> >> >>
> >>> >> >> endmenu # General setup
> >>> >> >> Index: git-linux/kernel/Makefile
> >>> >> >> ===================================================================
> >>> >> >> --- git-linux.orig/kernel/Makefile 2009-02-19 14:58:37.000000000 -0800
> >>> >> >> +++ git-linux/kernel/Makefile 2009-02-19 14:58:39.000000000 -0800
> >>> >> >> @@ -86,6 +86,7 @@
> >>> >> >> obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
> >>> >> >> obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
> >>> >> >> obj-$(CONFIG_MARKERS) += marker.o
> >>> >> >> +obj-$(CONFIG_KTRACE) += ktrace.o
> >>> >> >> obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
> >>> >> >> obj-$(CONFIG_LATENCYTOP) += latencytop.o
> >>> >> >> obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
> >>> >> >> Index: git-linux/include/asm-generic/vmlinux.lds.h
> >>> >> >> ===================================================================
> >>> >> >> --- git-linux.orig/include/asm-generic/vmlinux.lds.h 2009-02-19
> >>> >> >> 14:58:37.000000000 -0800
> >>> >> >> +++ git-linux/include/asm-generic/vmlinux.lds.h 2009-02-19
> >>> >> >> 14:58:39.000000000 -0800
> >>> >> >> @@ -73,6 +73,9 @@
> >>> >> >> MEM_KEEP(init.data) \
> >>> >> >> MEM_KEEP(exit.data) \
> >>> >> >> . = ALIGN(8); \
> >>> >> >> + VMLINUX_SYMBOL(__start___ktraces) = .; \
> >>> >> >> + *(__ktrace) \
> >>> >> >> + VMLINUX_SYMBOL(__stop___ktraces) = .; \
> >>> >> >> VMLINUX_SYMBOL(__start___markers) = .; \
> >>> >> >> *(__markers) \
> >>> >> >> VMLINUX_SYMBOL(__stop___markers) = .; \
> >>> >> >> @@ -89,6 +92,7 @@
> >>> >> >> VMLINUX_SYMBOL(__start_rodata) = .; \
> >>> >> >> *(.rodata) *(.rodata.*) \
> >>> >> >> *(__vermagic) /* Kernel version magic */ \
> >>> >> >> + *(__ktrace_strings) /* Ktrace: strings */ \
> >>> >> >> *(__markers_strings) /* Markers: strings */ \
> >>> >> >> *(__tracepoints_strings)/* Tracepoints: strings */ \
> >>> >> >> } \
> >>> >> >> Index: git-linux/kernel/module.c
> >>> >> >> ===================================================================
> >>> >> >> --- git-linux.orig/kernel/module.c 2009-02-19 14:58:37.000000000 -0800
> >>> >> >> +++ git-linux/kernel/module.c 2009-02-19 14:58:39.000000000 -0800
> >>> >> >> @@ -50,6 +50,7 @@
> >>> >> >> #include <asm/sections.h>
> >>> >> >> #include <linux/tracepoint.h>
> >>> >> >> #include <linux/ftrace.h>
> >>> >> >> +#include <linux/ktrace.h>
> >>> >> >> #include <linux/async.h>
> >>> >> >>
> >>> >> >> #if 0
> >>> >> >> @@ -2145,6 +2146,10 @@
> >>> >> >> sizeof(*mod->tracepoints),
> >>> >> >> &mod->num_tracepoints);
> >>> >> >> #endif
> >>> >> >> +#ifdef CONFIG_KTRACE
> >>> >> >> + mod->ktrace = section_objs(hdr, sechdrs, secstrings, "__ktrace",
> >>> >> >> + sizeof(*mod->ktrace), &mod->num_ktrace);
> >>> >> >> +#endif
> >>> >> >>
> >>> >> >> #ifdef CONFIG_MODVERSIONS
> >>> >> >> if ((mod->num_syms && !mod->crcs)
> >>> >> >> Index: git-linux/include/linux/module.h
> >>> >> >> ===================================================================
> >>> >> >> --- git-linux.orig/include/linux/module.h 2009-02-19
> >>> >> >> 14:58:37.000000000 -0800
> >>> >> >> +++ git-linux/include/linux/module.h 2009-02-19 14:58:39.000000000 -0800
> >>> >> >> @@ -44,6 +44,7 @@
> >>> >> >> };
> >>> >> >>
> >>> >> >> struct module;
> >>> >> >> +struct kernel_trace;
> >>> >> >>
> >>> >> >> struct module_attribute {
> >>> >> >> struct attribute attr;
> >>> >> >> @@ -347,6 +348,11 @@
> >>> >> >> /* Reference counts */
> >>> >> >> struct module_ref ref[NR_CPUS];
> >>> >> >> #endif
> >>> >> >> +#ifdef CONFIG_KTRACE
> >>> >> >> + struct kernel_trace *ktrace;
> >>> >> >> + unsigned int num_ktrace;
> >>> >> >> +#endif
> >>> >> >> +
> >>> >> >> };
> >>> >> >> #ifndef MODULE_ARCH_INIT
> >>> >> >> #define MODULE_ARCH_INIT {}
> >>> >> >> Index: git-linux/include/linux/ktrace.h
> >>> >> >> ===================================================================
> >>> >> >> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> >>> >> >> +++ git-linux/include/linux/ktrace.h 2009-02-19 15:36:39.000000000 -0800
> >>> >> >> @@ -0,0 +1,83 @@
> >>> >> >> +#ifndef _LINUX_KTRACE_H
> >>> >> >> +#define _LINUX_KTRACE_H
> >>> >> >> +
> >>> >> >> +#include <linux/types.h>
> >>> >> >> +
> >>> >> >> +struct kernel_trace;
> >>> >> >> +
> >>> >> >> +typedef void ktrace_probe_func(struct kernel_trace *, void *, size_t);
> >>> >> >> +
> >>> >> >> +struct kernel_trace {
> >>> >> >> + const char *name;
> >>> >> >> + const char *format; /* format string describing variable list */
> >>> >> >> + size_t *stroff; /* offsets of string variables */
> >>> >> >> + /* 31 bit event_id is converted to 16 bit when entering to the buffer */
> >>> >> >> + u32 enabled:1, event_id:31;
> >>> >> >> + ktrace_probe_func *func; /* probe function */
> >>> >> >> + struct list_head list; /* list head linked to the hash table entry */
> >>> >> >> +} __attribute__((aligned(8)));
> >>> >> >> +
> >>> >> >> +extern int ktrace_enabled;
> >>> >> >> +
> >>> >> >> +/*
> >>> >> >> + * Make sure the alignment of the structure in the __ktrace section will
> >>> >> >> + * not add unwanted padding between the beginning of the section and the
> >>> >> >> + * structure. Force alignment to the same alignment as the section start.
> >>> >> >> + */
> >>> >> >> +
> >>> >> >> +#define DEFINE_KTRACE_STRUCT(name) \
> >>> >> >> + struct __attribute__((packed)) ktrace_struct_##name
> >>> >> >> +
> >>> >> >> +#ifdef CONFIG_KTRACE
> >>> >> >> +/*
> >>> >> >> + * DO_TRACE is the macro that is called from each trace point. Each trace
> >>> >> >> + * point is associated with a static ktrace object that is located in a
> >>> >> >> + * special __ktrace section and is uniquely identified by the event name and
> >>> >> >> + * event format. Take a look at the files under include/trace for example
> >>> >> >> + * usage. Note that we call the trace probe function only if the global
> >>> >> >> + * tracing is enabled and the tracing for the associated event is enabled,
> >>> >> >> + * so we only introduce the overhead of a predicted condition judge when
> >>> >> >> + * tracing is disabled.
> >>> >> >> + */
> >>> >> >> +#define DO_TRACE(name, format, args...)
> >>> >> >> \
> >>> >> >> + do { \
> >>> >> >> + static const char __kstrtab_##name[] \
> >>> >> >> + __attribute__((section("__ktrace_strings"))) \
> >>> >> >> + = #name "\0" format; \
> >>> >> >> + static struct kernel_trace __ktrace_##name \
> >>> >> >> + __attribute__((section("__ktrace"), aligned(8))) = \
> >>> >> >> + { __kstrtab_##name, &__kstrtab_##name[sizeof(#name)], \
> >>> >> >> + NULL, 0, 0, NULL, LIST_HEAD_INIT(__ktrace_##name.list) }; \
> >>> >> >> + __ktrace_check_format(format, ## args); \
> >>> >> >> + if (unlikely(ktrace_enabled) && \
> >>> >> >> + unlikely(__ktrace_##name.enabled)) { \
> >>> >> >> + struct ktrace_struct_##name karg = { args }; \
> >>> >> >> + (*__ktrace_##name.func) \
> >>> >> >> + (&__ktrace_##name, &karg, sizeof(karg)); \
> >>> >> >> + } \
> >>> >> >> + } while (0)
> >>> >> >> +
> >>> >> >> +#else /* !CONFIG_KTRACE */
> >>> >> >> +#define DO_TRACE(name, format, args...)
> >>> >> >> +#endif
> >>> >> >> +
> >>> >> >> +/* To be used for string format validity checking with gcc */
> >>> >> >> +static inline void __printf(1, 2) ___ktrace_check_format(const char *fmt, ...)
> >>> >> >> +{
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +#define __ktrace_check_format(format, args...) \
> >>> >> >> + do { \
> >>> >> >> + if (0) \
> >>> >> >> + ___ktrace_check_format(format, ## args); \
> >>> >> >> + } while (0)
> >>> >> >> +
> >>> >> >> +
> >>> >> >> +/* get the trace buffer information */
> >>> >> >> +#define KTRACE_BUF_GET_SIZE _IOR(0xF5, 0x00, __u32)
> >>> >> >> +#define KTRACE_BUF_GET_PRODUCED _IOR(0xF5, 0x01, __u32)
> >>> >> >> +#define KTRACE_BUF_GET_CONSUMED _IOR(0xF5, 0x02, __u32)
> >>> >> >> +#define KTRACE_BUF_PUT_PRODUCED _IOW(0xF5, 0x03, __u32)
> >>> >> >> +#define KTRACE_BUF_PUT_CONSUMED _IOW(0xF5, 0x04, __u32)
> >>> >> >> +
> >>> >> >> +#endif
> >>> >> >> Index: git-linux/kernel/ktrace.c
> >>> >> >> ===================================================================
> >>> >> >> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> >>> >> >> +++ git-linux/kernel/ktrace.c 2009-02-19 15:45:20.000000000 -0800
> >>> >> >> @@ -0,0 +1,872 @@
> >>> >> >> +/*
> >>> >> >> + * kernel/ktrace.c
> >>> >> >> + *
> >>> >> >> + * Implementation of the kernel tracing for linux kernel 2.6.
> >>> >> >> + *
> >>> >> >> + * Copyright 2008- Google Inc.
> >>> >> >> + * Original Author: Jiaying Zhang
> >>> >> >> + */
> >>> >> >> +
> >>> >> >> +#include <linux/module.h>
> >>> >> >> +#include <linux/kernel.h>
> >>> >> >> +#include <linux/bitmap.h>
> >>> >> >> +#include <linux/poll.h>
> >>> >> >> +#include <linux/errno.h>
> >>> >> >> +#include <linux/sched.h>
> >>> >> >> +#include <linux/linkage.h>
> >>> >> >> +#include <linux/debugfs.h>
> >>> >> >> +#include <linux/timer.h>
> >>> >> >> +#include <linux/delay.h>
> >>> >> >> +#include <linux/jhash.h>
> >>> >> >> +#include <linux/ctype.h>
> >>> >> >> +#include <linux/namei.h>
> >>> >> >> +#include <linux/ktrace.h>
> >>> >> >> +#include <linux/ring_buffer.h>
> >>> >> >> +
> >>> >> >> +static char *ktrace_version = "1.0.0";
> >>> >> >> +int ktrace_enabled;
> >>> >> >> +EXPORT_SYMBOL(ktrace_enabled);
> >>> >> >> +static uint16_t ktrace_next_id;
> >>> >> >> +
> >>> >> >> +extern struct kernel_trace __start___ktraces[];
> >>> >> >> +extern struct kernel_trace __stop___ktraces[];
> >>> >> >> +
> >>> >> >> +static DEFINE_MUTEX(ktrace_mutex);
> >>> >> >> +
> >>> >> >> +/* ktrace hash table entry structure and variables */
> >>> >> >> +struct ktrace_entry {
> >>> >> >> + struct hlist_node hlist;
> >>> >> >> + char *name;
> >>> >> >> + char *format;
> >>> >> >> + size_t *stroff;
> >>> >> >> + ktrace_probe_func *func;
> >>> >> >> + u32 enabled:1, event_id:31;
> >>> >> >> + struct list_head klist; /* list of loaded ktraces */
> >>> >> >> +};
> >>> >> >> +
> >>> >> >> +#define KTRACE_HASH_BITS 6
> >>> >> >> +#define KTRACE_HASH_SIZE (1 << KTRACE_HASH_BITS)
> >>> >> >> +static struct hlist_head ktrace_table[KTRACE_HASH_SIZE];
> >>> >> >> +
> >>> >> >> +/* debugfs directory variables */
> >>> >> >> +static struct dentry *tracedir;
> >>> >> >> +static struct dentry *eventdir;
> >>> >> >> +typedef void tracecontrol_handle_func(struct file *, unsigned long, size_t *);
> >>> >> >> +
> >>> >> >> +/* ring buffer code */
> >>> >> >> +static unsigned long trace_buf_size = 65536UL;
> >>> >> >> +struct ring_buffer *trace_buffer;
> >>> >> >> +static DECLARE_WAIT_QUEUE_HEAD(trace_wait); /* waitqueue for trace_data_poll */
> >>> >> >> +static struct timer_list trace_timer; /* reader wake-up timer */
> >>> >> >> +#define TRACEREAD_WAKEUP_INTERVAL 1000 /* time interval in jiffies */
> >>> >> >> +static struct kref ktrace_kref;
> >>> >> >> +
> >>> >> >> +static void
> >>> >> >> +ring_buffer_probe(struct kernel_trace *kt, void *data, size_t event_size)
> >>> >> >> +{
> >>> >> >> + struct ring_buffer_event *event;
> >>> >> >> + void *trace_event;
> >>> >> >> + unsigned long flag;
> >>> >> >> +
> >>> >> >> + event = ring_buffer_lock_reserve(trace_buffer,
> >>> >> >> + sizeof(uint16_t) + event_size, &flag);
> >>> >> >> + if (!event)
> >>> >> >> + return;
> >>> >> >> + trace_event = ring_buffer_event_data(event);
> >>> >> >> + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
> >>> >> >> + memcpy(trace_event + sizeof(uint16_t), data, event_size);
> >>> >> >> + ring_buffer_unlock_commit(trace_buffer, event, flag);
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +/* special probe function for events that contain string arguments */
> >>> >> >> +static void
> >>> >> >> +string_probe(struct kernel_trace *kt, void *data, size_t event_size)
> >>> >> >> +{
> >>> >> >> + struct ring_buffer_event *event;
> >>> >> >> + void *trace_event, *p;
> >>> >> >> + unsigned long flag;
> >>> >> >> + size_t *offset, scanned;
> >>> >> >> + char *string;
> >>> >> >> +
> >>> >> >> + /*
> >>> >> >> + * Get the real length of the event, i.e., we use stroff array to
> >>> >> >> + * locate string variables in the passed-in trace data and use their
> >>> >> >> + * length to replace the size of the string pointers.
> >>> >> >> + */
> >>> >> >> + for (offset = kt->stroff; *offset != -1; offset++) {
> >>> >> >> + string = *(char **) (data + *offset);
> >>> >> >> + event_size += strlen(string) + 1 - sizeof(char *);
> >>> >> >> + }
> >>> >> >> + event = ring_buffer_lock_reserve(trace_buffer,
> >>> >> >> + sizeof(uint16_t) + event_size, &flag);
> >>> >> >> + if (!event)
> >>> >> >> + return;
> >>> >> >> + trace_event = ring_buffer_event_data(event);
> >>> >> >> + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
> >>> >> >> + p = trace_event + sizeof(uint16_t);
> >>> >> >> + /*
> >>> >> >> + * Copy the trace data into buffer. For string variables, we enter the
> >>> >> >> + * string into the buffer. Otherwise, the passed in data is copied.
> >>> >> >> + */
> >>> >> >> + for (offset = kt->stroff, scanned = 0; *offset != -1; offset++) {
> >>> >> >> + memcpy(p, data + scanned, *offset - scanned);
> >>> >> >> + p += *offset - scanned;
> >>> >> >> + string = *(char **) (data + *offset);
> >>> >> >> + memcpy(p, string, strlen(string) + 1);
> >>> >> >> + p += strlen(string) + 1;
> >>> >> >> + scanned = *offset + sizeof(char *);
> >>> >> >> + }
> >>> >> >> + memcpy(p, data + scanned,
> >>> >> >> + trace_event + sizeof(uint16_t) + event_size - p);
> >>> >> >> + ring_buffer_unlock_commit(trace_buffer, event, flag);
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +/* timer function used to defer ring buffer reader waking */
> >>> >> >> +static void wakeup_readers(unsigned long data)
> >>> >> >> +{
> >>> >> >> + if (trace_buffer && !ring_buffer_empty(trace_buffer))
> >>> >> >> + wake_up_interruptible(&trace_wait);
> >>> >> >> + mod_timer(&trace_timer, jiffies + TRACEREAD_WAKEUP_INTERVAL);
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +/* function for reading ktrace metadata info */
> >>> >> >> +static void *tracing_info_start(struct seq_file *seq, loff_t *pos)
> >>> >> >> +{
> >>> >> >> + seq_printf(seq, "version %s\n", ktrace_version);
> >>> >> >> + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static void *tracing_info_next(struct seq_file *seq, void *v, loff_t *pos)
> >>> >> >> +{
> >>> >> >> + ++*pos;
> >>> >> >> + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static int tracing_info_show(struct seq_file *seq, void *v)
> >>> >> >> +{
> >>> >> >> + struct hlist_head *head = v;
> >>> >> >> + struct hlist_node *node;
> >>> >> >> + struct ktrace_entry *entry;
> >>> >> >> + hlist_for_each_entry(entry, node, head, hlist)
> >>> >> >> + seq_printf(seq, "name '%s' format '%s' id %u %s\n",
> >>> >> >> + entry->name, entry->format, entry->event_id,
> >>> >> >> + entry->enabled ? "enabled" : "disabled");
> >>> >> >> + return 0;
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static void tracing_info_stop(struct seq_file *seq, void *v)
> >>> >> >> +{
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static const struct seq_operations tracing_info_seq_ops = {
> >>> >> >> + .start = tracing_info_start,
> >>> >> >> + .next = tracing_info_next,
> >>> >> >> + .show = tracing_info_show,
> >>> >> >> + .stop = tracing_info_stop,
> >>> >> >> +};
> >>> >> >> +
> >>> >> >> +static int tracing_info_open(struct inode *inode, struct file *file)
> >>> >> >> +{
> >>> >> >> + return seq_open(file, &tracing_info_seq_ops);
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static const struct file_operations traceinfo_file_operations = {
> >>> >> >> + .open = tracing_info_open,
> >>> >> >> + .read = seq_read,
> >>> >> >> + .llseek = seq_lseek,
> >>> >> >> + .release = seq_release,
> >>> >> >> +};
> >>> >> >> +
> >>> >> >> +/*
> >>> >> >> + * wrapper function used by debugfs write operation.
> >>> >> >> + * func: handling function that does real work
> >>> >> >> + */
> >>> >> >> +static ssize_t trace_debugfs_write(struct file *filp, const char __user *ubuf,
> >>> >> >> + size_t cnt, loff_t *ppos, tracecontrol_handle_func *func)
> >>> >> >> +{
> >>> >> >> + int ret;
> >>> >> >> + char buf[64];
> >>> >> >> + unsigned long val;
> >>> >> >> +
> >>> >> >> + if (cnt >= sizeof(buf))
> >>> >> >> + return -EINVAL;
> >>> >> >> + if (copy_from_user(&buf, ubuf, cnt))
> >>> >> >> + return -EFAULT;
> >>> >> >> + buf[cnt] = 0;
> >>> >> >> +
> >>> >> >> + ret = strict_strtoul(buf, 10, &val);
> >>> >> >> + if (ret < 0)
> >>> >> >> + return ret;
> >>> >> >> + val = !!val;
> >>> >> >> +
> >>> >> >> + func(filp, val, &cnt);
> >>> >> >> +
> >>> >> >> + filp->f_pos += cnt;
> >>> >> >> + return cnt;
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +/* functions for reading/writing the global 'enabled' control file */
> >>> >> >> +static ssize_t tracing_control_read(struct file *filp, char __user *ubuf,
> >>> >> >> + size_t cnt, loff_t *ppos)
> >>> >> >> +{
> >>> >> >> + int ret;
> >>> >> >> + char buf[64];
> >>> >> >> + ret = snprintf(buf, 64, "%u\n", ktrace_enabled);
> >>> >> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static void __tracing_control_write(struct file *filp,
> >>> >> >> + unsigned long val, size_t *cnt)
> >>> >> >> +{
> >>> >> >> + if (val ^ ktrace_enabled) {
> >>> >> >> + if (val) {
> >>> >> >> + trace_timer.expires =
> >>> >> >> + jiffies + TRACEREAD_WAKEUP_INTERVAL;
> >>> >> >> + add_timer(&trace_timer);
> >>> >> >> + ktrace_enabled = 1;
> >>> >> >> + } else {
> >>> >> >> + ktrace_enabled = 0;
> >>> >> >> + del_timer_sync(&trace_timer);
> >>> >> >> + if (trace_buffer && !ring_buffer_empty(trace_buffer))
> >>> >> >> + wake_up_interruptible(&trace_wait);
> >>> >> >> + }
> >>> >> >> + }
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static ssize_t tracing_control_write(struct file *filp, const char
> >>> >> >> __user *ubuf,
> >>> >> >> + size_t cnt, loff_t *ppos)
> >>> >> >> +{
> >>> >> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
> >>> >> >> + __tracing_control_write);
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static const struct file_operations tracecontrol_file_operations = {
> >>> >> >> + .read = tracing_control_read,
> >>> >> >> + .write = tracing_control_write,
> >>> >> >> +};
> >>> >> >> +
> >>> >> >> +/* functions for reading/writing a trace event file */
> >>> >> >> +static int update_ktrace(struct ktrace_entry *entry,
> >>> >> >> + ktrace_probe_func *func, int enabled)
> >>> >> >> +{
> >>> >> >> + struct kernel_trace *iter;
> >>> >> >> + /* no need to update the list if the tracing is not initialized */
> >>> >> >> + if (!tracedir)
> >>> >> >> + return 0;
> >>> >> >> + entry->enabled = enabled;
> >>> >> >> + entry->func = func;
> >>> >> >> + list_for_each_entry(iter, &entry->klist, list) {
> >>> >> >> + iter->stroff = entry->stroff;
> >>> >> >> + iter->func = func;
> >>> >> >> + iter->enabled = enabled;
> >>> >> >> + }
> >>> >> >> + return 0;
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static int ktrace_probe_register(struct ktrace_entry *entry,
> >>> >> >> + ktrace_probe_func *probefunc)
> >>> >> >> +{
> >>> >> >> + return update_ktrace(entry, probefunc, 1);
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static int ktrace_probe_unregister(struct ktrace_entry *entry)
> >>> >> >> +{
> >>> >> >> + return update_ktrace(entry, NULL, 0);
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static int ktrace_control_open(struct inode *inode, struct file *filp)
> >>> >> >> +{
> >>> >> >> + filp->private_data = inode->i_private;
> >>> >> >> + return 0;
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static ssize_t ktrace_control_read(struct file *filp, char __user *ubuf,
> >>> >> >> + size_t cnt, loff_t *ppos)
> >>> >> >> +{
> >>> >> >> + int ret;
> >>> >> >> + char buf[64];
> >>> >> >> + struct ktrace_entry *entry = filp->private_data;
> >>> >> >> +
> >>> >> >> + ret = snprintf(buf, 64, "%u\n", entry->enabled);
> >>> >> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +/*
> >>> >> >> + * Check whether the event contains string arguments. Return 0 if not;
> >>> >> >> + * otherwise record the offsets of the string arguments in the ktrace
> >>> >> >> + * structure according to the event format and return 1.
> >>> >> >> + * The offset info is later used to store the string arguments into
> >>> >> >> + * the trace buffer during trace probe.
> >>> >> >> + */
> >>> >> >> +static int contain_string_arguments(struct ktrace_entry *entry)
> >>> >> >> +{
> >>> >> >> + int count = 0;
> >>> >> >> + char *fmt;
> >>> >> >> + int qualifier; /* 'h', 'l', or 'L' for integer fields */
> >>> >> >> + size_t offset = 0;
> >>> >> >> +
> >>> >> >> + for (fmt = strstr(entry->format, "%s"); fmt; fmt = strstr(++fmt, "%s"))
> >>> >> >> + count++;
> >>> >> >> + if (!count)
> >>> >> >> + return 0;
> >>> >> >> + entry->stroff = kmalloc(sizeof(size_t) * (count + 1), GFP_KERNEL);
> >>> >> >> + if (!entry->stroff)
> >>> >> >> + return -ENOMEM;
> >>> >> >> +
> >>> >> >> + for (fmt = entry->format, count = 0; *fmt ; ++fmt) {
> >>> >> >> + if (*fmt != '%')
> >>> >> >> + continue;
> >>> >> >> +repeat:
> >>> >> >> + ++fmt;
> >>> >> >> + switch (*fmt) {
> >>> >> >> + case '-':
> >>> >> >> + case '+':
> >>> >> >> + case ' ':
> >>> >> >> + case '#':
> >>> >> >> + case '0':
> >>> >> >> + ++fmt;
> >>> >> >> + goto repeat;
> >>> >> >> + }
> >>> >> >> +
> >>> >> >> + while (isdigit(*fmt))
> >>> >> >> + fmt++;
> >>> >> >> +
> >>> >> >> + /* get the conversion qualifier */
> >>> >> >> + qualifier = -1;
> >>> >> >> + if (*fmt == 'h' || *fmt == 'l' || *fmt == 'L') {
> >>> >> >> + qualifier = *fmt;
> >>> >> >> + ++fmt;
> >>> >> >> + if (qualifier == 'l' && *fmt == 'l') {
> >>> >> >> + qualifier = 'L';
> >>> >> >> + ++fmt;
> >>> >> >> + }
> >>> >> >> + }
> >>> >> >> +
> >>> >> >> + switch (*fmt) {
> >>> >> >> + case 'c':
> >>> >> >> + offset += sizeof(char);
> >>> >> >> + continue;
> >>> >> >> + case 's':
> >>> >> >> + entry->stroff[count] = offset;
> >>> >> >> + count++;
> >>> >> >> + offset += sizeof(char *);
> >>> >> >> + continue;
> >>> >> >> + case 'p':
> >>> >> >> + offset += sizeof(void *);
> >>> >> >> + continue;
> >>> >> >> + case 'd':
> >>> >> >> + case 'i':
> >>> >> >> + case 'o':
> >>> >> >> + case 'u':
> >>> >> >> + case 'x':
> >>> >> >> + case 'X':
> >>> >> >> + switch (qualifier) {
> >>> >> >> + case 'L':
> >>> >> >> + offset += sizeof(long long);
> >>> >> >> + break;
> >>> >> >> + case 'l':
> >>> >> >> + offset += sizeof(long);
> >>> >> >> + break;
> >>> >> >> + case 'h':
> >>> >> >> + offset += sizeof(short);
> >>> >> >> + break;
> >>> >> >> + default:
> >>> >> >> + offset += sizeof(int);
> >>> >> >> + break;
> >>> >> >> + }
> >>> >> >> + break;
> >>> >> >> + default:
> >>> >> >> + printk(KERN_WARNING "unknown format '%c'", *fmt);
> >>> >> >> + continue;
> >>> >> >> + }
> >>> >> >> + }
> >>> >> >> + entry->stroff[count] = -1;
> >>> >> >> + return 1;
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static void __ktrace_control_write(struct file *filp,
> >>> >> >> + unsigned long val, size_t *cnt)
> >>> >> >> +{
> >>> >> >> + struct ktrace_entry *entry = filp->private_data;
> >>> >> >> + int ret;
> >>> >> >> +
> >>> >> >> + mutex_lock(&ktrace_mutex);
> >>> >> >> + if (val ^ entry->enabled) {
> >>> >> >> + if (val) {
> >>> >> >> + ret = contain_string_arguments(entry);
> >>> >> >> + if (ret == 0)
> >>> >> >> + ktrace_probe_register(entry, ring_buffer_probe);
> >>> >> >> + else if (ret > 0)
> >>> >> >> + ktrace_probe_register(entry, string_probe);
> >>> >> >> + else
> >>> >> >> + *cnt = ret;
> >>> >> >> + } else
> >>> >> >> + ktrace_probe_unregister(entry);
> >>> >> >> + }
> >>> >> >> + mutex_unlock(&ktrace_mutex);
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static ssize_t ktrace_control_write(struct file *filp, const char __user *ubuf,
> >>> >> >> + size_t cnt, loff_t *ppos)
> >>> >> >> +{
> >>> >> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
> >>> >> >> + __ktrace_control_write);
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static const struct file_operations ktracecontrol_file_operations = {
> >>> >> >> + .open = ktrace_control_open,
> >>> >> >> + .read = ktrace_control_read,
> >>> >> >> + .write = ktrace_control_write,
> >>> >> >> +};
> >>> >> >> +
> >>> >> >> +/* functions for adding/removing a trace event. Protected by mutex lock.
> >>> >> >> + * Called during initialization or after loading a module */
> >>> >> >> +static struct ktrace_entry *add_ktrace(struct kernel_trace *ktrace)
> >>> >> >> +{
> >>> >> >> + struct ktrace_entry *entry;
> >>> >> >> + size_t name_len = strlen(ktrace->name) + 1;
> >>> >> >> + u32 hash = jhash(ktrace->name, name_len-1, 0);
> >>> >> >> + struct hlist_head *head;
> >>> >> >> + struct hlist_node *node;
> >>> >> >> + struct dentry *dentry;
> >>> >> >> +
> >>> >> >> + if (!tracedir)
> >>> >> >> + return 0;
> >>> >> >> + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
> >>> >> >> + hlist_for_each_entry(entry, node, head, hlist) {
> >>> >> >> + if (!strcmp(ktrace->name, entry->name)) {
> >>> >> >> + if (strcmp(ktrace->format, entry->format)) {
> >>> >> >> + printk(KERN_WARNING "the format of tracepoint "
> >>> >> >> + "\'%s\' changes from \'%s\' to \'%s\'."
> >>> >> >> + "Dynamic changing of trace format is "
> >>> >> >> + "not supported yet!\n",
> >>> >> >> + ktrace->name, entry->format,
> >>> >> >> + ktrace->format);
> >>> >> >> + return ERR_PTR(-EINVAL);
> >>> >> >> + }
> >>> >> >> + ktrace->enabled = entry->enabled;
> >>> >> >> + ktrace->func = entry->func;
> >>> >> >> + ktrace->event_id = entry->event_id;
> >>> >> >> + if (list_empty(&entry->klist))
> >>> >> >> + goto add_head;
> >>> >> >> + list_add_tail(&ktrace->list, &entry->klist);
> >>> >> >> + return entry;
> >>> >> >> + }
> >>> >> >> + }
> >>> >> >> + entry = kmalloc(sizeof(struct ktrace_entry), GFP_KERNEL);
> >>> >> >> + if (!entry)
> >>> >> >> + return ERR_PTR(-ENOMEM);
> >>> >> >> + entry->name = kmalloc(name_len, GFP_KERNEL);
> >>> >> >> + if (!entry->name) {
> >>> >> >> + kfree(entry);
> >>> >> >> + return ERR_PTR(-ENOMEM);
> >>> >> >> + }
> >>> >> >> + memcpy(entry->name, ktrace->name, name_len);
> >>> >> >> + if (ktrace->format) {
> >>> >> >> + size_t format_len = strlen(ktrace->format) + 1;
> >>> >> >> + entry->format = kmalloc(format_len, GFP_KERNEL);
> >>> >> >> + if (!entry->format) {
> >>> >> >> + kfree(entry->name);
> >>> >> >> + kfree(entry);
> >>> >> >> + return ERR_PTR(-ENOMEM);
> >>> >> >> + }
> >>> >> >> + memcpy(entry->format, ktrace->format, format_len);
> >>> >> >> + } else
> >>> >> >> + entry->format = NULL;
> >>> >> >> + entry->func = ktrace->func;
> >>> >> >> + entry->enabled = 0;
> >>> >> >> + ktrace->event_id = entry->event_id = ktrace_next_id++;
> >>> >> >> + entry->stroff = NULL;
> >>> >> >> + INIT_LIST_HEAD(&entry->klist);
> >>> >> >> + hlist_add_head(&entry->hlist, head);
> >>> >> >> +add_head:
> >>> >> >> + list_add(&ktrace->list, &entry->klist);
> >>> >> >> + dentry = debugfs_create_file(entry->name, 0660, eventdir,
> >>> >> >> + entry, &ktracecontrol_file_operations);
> >>> >> >> + if (!dentry)
> >>> >> >> + printk(KERN_WARNING "Couldn't create debugfs '%s' entry\n",
> >>> >> >> + entry->name);
> >>> >> >> + return entry;
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static int remove_ktrace(struct kernel_trace *ktrace, int free)
> >>> >> >> +{
> >>> >> >> + struct ktrace_entry *entry;
> >>> >> >> + size_t name_len = strlen(ktrace->name) + 1;
> >>> >> >> + u32 hash = jhash(ktrace->name, name_len-1, 0);
> >>> >> >> + struct hlist_head *head;
> >>> >> >> + struct hlist_node *node, *temp;
> >>> >> >> + struct dentry *dentry;
> >>> >> >> +
> >>> >> >> + if (!tracedir)
> >>> >> >> + return 0;
> >>> >> >> + list_del(&ktrace->list);
> >>> >> >> + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
> >>> >> >> + hlist_for_each_entry_safe(entry, node, temp, head, hlist)
> >>> >> >> + if (!strcmp(ktrace->name, entry->name)) {
> >>> >> >> + if (list_empty(&entry->klist)) {
> >>> >> >> + dentry = lookup_one_len(entry->name,
> >>> >> >> + eventdir, strlen(entry->name));
> >>> >> >> + if (dentry && !IS_ERR(dentry)) {
> >>> >> >> + debugfs_remove(dentry);
> >>> >> >> + dput(dentry);
> >>> >> >> + }
> >>> >> >> + entry->enabled = 0;
> >>> >> >> + if (free) {
> >>> >> >> + hlist_del(&entry->hlist);
> >>> >> >> + kfree(entry->name);
> >>> >> >> + kfree(entry->format);
> >>> >> >> + kfree(entry->stroff);
> >>> >> >> + kfree(entry);
> >>> >> >> + }
> >>> >> >> + }
> >>> >> >> + return 0;
> >>> >> >> + }
> >>> >> >> + return -1;
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +/* Add/remove tracepoints contained in a module to the ktrace hash table.
> >>> >> >> + * Called at the end of module load/unload. */
> >>> >> >> +static int ktrace_module_notify(struct notifier_block *self,
> >>> >> >> + unsigned long val, void *data)
> >>> >> >> +{
> >>> >> >> +#ifdef CONFIG_MODULES
> >>> >> >> + struct kernel_trace *iter;
> >>> >> >> + struct module *mod = data;
> >>> >> >> +
> >>> >> >> + if (val != MODULE_STATE_COMING && val != MODULE_STATE_GOING)
> >>> >> >> + return 0;
> >>> >> >> + mutex_lock(&ktrace_mutex);
> >>> >> >> + for (iter = mod->ktrace; iter < mod->ktrace + mod->num_ktrace; iter++)
> >>> >> >> + if (val == MODULE_STATE_COMING)
> >>> >> >> + add_ktrace(iter);
> >>> >> >> + else
> >>> >> >> + remove_ktrace(iter, 0);
> >>> >> >> + mutex_unlock(&ktrace_mutex);
> >>> >> >> +#endif
> >>> >> >> + return 0;
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static struct notifier_block ktrace_module_nb = {
> >>> >> >> + .notifier_call = ktrace_module_notify,
> >>> >> >> +};
> >>> >> >> +
> >>> >> >> +/* functions for user-space programs to read data from tracing buffer */
> >>> >> >> +static int trace_data_open(struct inode *inode, struct file *filp)
> >>> >> >> +{
> >>> >> >> + filp->private_data = inode->i_private;
> >>> >> >> + kref_get(&ktrace_kref);
> >>> >> >> + return 0;
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static void release_trace_buffer(struct kref *kref)
> >>> >> >> +{
> >>> >> >> + ring_buffer_free(trace_buffer);
> >>> >> >> + trace_buffer = NULL;
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static int trace_data_release(struct inode *inode, struct file *filp)
> >>> >> >> +{
> >>> >> >> + kref_put(&ktrace_kref, release_trace_buffer);
> >>> >> >> + return 0;
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static unsigned int trace_data_poll(struct file *filp, poll_table *poll_table)
> >>> >> >> +{
> >>> >> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
> >>> >> >> + if (filp->f_mode & FMODE_READ) {
> >>> >> >> + if (!ring_buffer_per_cpu_empty(cpu_buffer))
> >>> >> >> + return POLLIN | POLLRDNORM;
> >>> >> >> + poll_wait(filp, &trace_wait, poll_table);
> >>> >> >> + if (!ring_buffer_per_cpu_empty(cpu_buffer))
> >>> >> >> + return POLLIN | POLLRDNORM;
> >>> >> >> + }
> >>> >> >> + return 0;
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static int trace_buf_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> >>> >> >> +{
> >>> >> >> + struct ring_buffer_per_cpu *cpu_buffer = vma->vm_private_data;
> >>> >> >> + pgoff_t pgoff = vmf->pgoff;
> >>> >> >> + struct page *page;
> >>> >> >> +
> >>> >> >> + if (!trace_buffer)
> >>> >> >> + return VM_FAULT_OOM;
> >>> >> >> + page = ring_buffer_get_page(cpu_buffer, pgoff);
> >>> >> >> + if (page == NULL)
> >>> >> >> + return VM_FAULT_SIGBUS;
> >>> >> >> + get_page(page);
> >>> >> >> + vmf->page = page;
> >>> >> >> + return 0;
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static struct vm_operations_struct trace_data_mmap_ops = {
> >>> >> >> + .fault = trace_buf_fault,
> >>> >> >> +};
> >>> >> >> +
> >>> >> >> +static int trace_data_mmap(struct file *filp, struct vm_area_struct *vma)
> >>> >> >> +{
> >>> >> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
> >>> >> >> +
> >>> >> >> + vma->vm_ops = &trace_data_mmap_ops;
> >>> >> >> + vma->vm_flags |= VM_DONTEXPAND;
> >>> >> >> + vma->vm_private_data = cpu_buffer;
> >>> >> >> + return 0;
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static int trace_data_ioctl(struct inode *inode, struct file *filp,
> >>> >> >> + unsigned int cmd, unsigned long arg)
> >>> >> >> +{
> >>> >> >> + u32 __user *argp = (u32 __user *)arg;
> >>> >> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
> >>> >> >> +
> >>> >> >> + if (!trace_buffer || !tracedir)
> >>> >> >> + return -EINVAL;
> >>> >> >> + switch (cmd) {
> >>> >> >> + case KTRACE_BUF_GET_SIZE:
> >>> >> >> + {
> >>> >> >> + unsigned long bufsize;
> >>> >> >> + bufsize = ring_buffer_size(trace_buffer);
> >>> >> >> + return put_user((u32)bufsize, argp);
> >>> >> >> + }
> >>> >> >> + case KTRACE_BUF_GET_PRODUCED:
> >>> >> >> + return put_user(ring_buffer_get_produced(cpu_buffer), argp);
> >>> >> >> + case KTRACE_BUF_GET_CONSUMED:
> >>> >> >> + return put_user(ring_buffer_get_consumed(cpu_buffer), argp);
> >>> >> >> + case KTRACE_BUF_PUT_CONSUMED:
> >>> >> >> + {
> >>> >> >> + u32 consumed, consumed_old;
> >>> >> >> + int ret;
> >>> >> >> +
> >>> >> >> + ret = get_user(consumed, argp);
> >>> >> >> + if (ret) {
> >>> >> >> + printk(KERN_WARNING
> >>> >> >> + "error getting consumed value: %d\n", ret);
> >>> >> >> + return ret;
> >>> >> >> + }
> >>> >> >> + consumed_old = ring_buffer_get_consumed(cpu_buffer);
> >>> >> >> + if (consumed == consumed_old)
> >>> >> >> + return 0;
> >>> >> >> + ring_buffer_advance_reader(cpu_buffer, consumed - consumed_old);
> >>> >> >> + return 0;
> >>> >> >> + }
> >>> >> >> + default:
> >>> >> >> + return -ENOIOCTLCMD;
> >>> >> >> + }
> >>> >> >> + return 0;
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static const struct file_operations tracedata_file_operations = {
> >>> >> >> + .open = trace_data_open,
> >>> >> >> + .poll = trace_data_poll,
> >>> >> >> + .mmap = trace_data_mmap,
> >>> >> >> + .ioctl = trace_data_ioctl,
> >>> >> >> + .release = trace_data_release,
> >>> >> >> +};
> >>> >> >> +
> >>> >> >> +/* function to report the number of lost events due to buffer overflow */
> >>> >> >> +static ssize_t trace_overflow_read(struct file *filp, char __user *ubuf,
> >>> >> >> + size_t cnt, loff_t *ppos)
> >>> >> >> +{
> >>> >> >> + int ret;
> >>> >> >> + char buf[64];
> >>> >> >> + ret = snprintf(buf, 64, "%lu\n", ring_buffer_overruns(trace_buffer));
> >>> >> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static const struct file_operations traceoverflow_file_operations = {
> >>> >> >> + .read = trace_overflow_read,
> >>> >> >> +};
> >>> >> >> +
> >>> >> >> +/* function to re-initialize trace buffer */
> >>> >> >> +static void __trace_buffer_reset(struct file *filp,
> >>> >> >> + unsigned long val, size_t *cnt)
> >>> >> >> +{
> >>> >> >> + if (val && trace_buffer)
> >>> >> >> + ring_buffer_reset(trace_buffer);
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static ssize_t trace_buffer_reset(struct file *filp, const char __user *ubuf,
> >>> >> >> + size_t cnt, loff_t *ppos)
> >>> >> >> +{
> >>> >> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
> >>> >> >> + __trace_buffer_reset);
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static const struct file_operations tracereset_file_operations = {
> >>> >> >> + .write = trace_buffer_reset,
> >>> >> >> +};
> >>> >> >> +
> >>> >> >> +/*
> >>> >> >> + * We use debugfs for kernel-user communication. All of the control/info
> >>> >> >> + * files are under debugfs/ktrace directory.
> >>> >> >> + */
> >>> >> >> +static int create_debugfs(void)
> >>> >> >> +{
> >>> >> >> + struct dentry *entry, *bufdir;
> >>> >> >> + struct ring_buffer_per_cpu *cpu_buffer;
> >>> >> >> + char *tmpname;
> >>> >> >> + int cpu;
> >>> >> >> +
> >>> >> >> + tracedir = debugfs_create_dir("ktrace", NULL);
> >>> >> >> + if (!tracedir) {
> >>> >> >> + printk(KERN_WARNING "Couldn't create debugfs directory\n");
> >>> >> >> + return -ENOMEM;
> >>> >> >> + }
> >>> >> >> +
> >>> >> >> + /* 'buffers' is the directory that holds trace buffer debugfs files */
> >>> >> >> + bufdir = debugfs_create_dir("buffers", tracedir);
> >>> >> >> + if (!bufdir) {
> >>> >> >> + printk(KERN_WARNING "Couldn't create 'buffers' directory\n");
> >>> >> >> + return -ENOMEM;
> >>> >> >> + }
> >>> >> >> +
> >>> >> >> + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
> >>> >> >> + if (!tmpname)
> >>> >> >> + return -ENOMEM;
> >>> >> >> + /* create a debugfs file for each cpu buffer */
> >>> >> >> + for_each_possible_cpu(cpu) {
> >>> >> >> + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
> >>> >> >> + cpu_buffer = ring_buffer_cpu(trace_buffer, cpu);
> >>> >> >> + entry = debugfs_create_file(tmpname, 0440, bufdir, cpu_buffer,
> >>> >> >> + &tracedata_file_operations);
> >>> >> >> + if (!entry) {
> >>> >> >> + printk(KERN_WARNING
> >>> >> >> + "Couldn't create debugfs \'%s\' entry\n",
> >>> >> >> + tmpname);
> >>> >> >> + return -ENOMEM;
> >>> >> >> + }
> >>> >> >> + }
> >>> >> >> + kfree(tmpname);
> >>> >> >> +
> >>> >> >> + /* the control file for users to enable/disable global tracing. */
> >>> >> >> + entry = debugfs_create_file("enabled", 0664, tracedir, NULL,
> >>> >> >> + &tracecontrol_file_operations);
> >>> >> >> + if (!entry) {
> >>> >> >> + printk(KERN_WARNING
> >>> >> >> + "Couldn't create debugfs 'enabled' entry\n");
> >>> >> >> + return -ENOMEM;
> >>> >> >> + }
> >>> >> >> +
> >>> >> >> + /*
> >>> >> >> + * the debugfs file that displays the name, format etc. of every
> >>> >> >> + * supported trace event. The file is to be used by the user-space
> >>> >> >> + * trace parser to analyze the collected trace data.
> >>> >> >> + */
> >>> >> >> + entry = debugfs_create_file("info", 0444, tracedir, NULL,
> >>> >> >> + &traceinfo_file_operations);
> >>> >> >> + if (!entry) {
> >>> >> >> + printk(KERN_WARNING "Couldn't create debugfs 'info' entry\n");
> >>> >> >> + return -ENOMEM;
> >>> >> >> + }
> >>> >> >> +
> >>> >> >> + /*
> >>> >> >> + * the debugfs file that reports the number of events
> >>> >> >> + * lost due to buffer overflow
> >>> >> >> + */
> >>> >> >> + entry = debugfs_create_file("overflow", 0444, tracedir, NULL,
> >>> >> >> + &traceoverflow_file_operations);
> >>> >> >> + if (!entry) {
> >>> >> >> + printk(KERN_WARNING
> >>> >> >> + "Couldn't create debugfs 'overflow' entry\n");
> >>> >> >> + return -ENOMEM;
> >>> >> >> + }
> >>> >> >> +
> >>> >> >> + /* the debugfs file that resets trace buffer upon write */
> >>> >> >> + entry = debugfs_create_file("reset", 0220, tracedir, NULL,
> >>> >> >> + &tracereset_file_operations);
> >>> >> >> + if (!entry) {
> >>> >> >> + printk(KERN_WARNING "Couldn't create debugfs 'reset' entry\n");
> >>> >> >> + return -ENOMEM;
> >>> >> >> + }
> >>> >> >> +
> >>> >> >> + /* create the directory that holds the control files for every event */
> >>> >> >> + eventdir = debugfs_create_dir("events", tracedir);
> >>> >> >> + if (!eventdir) {
> >>> >> >> + printk(KERN_WARNING "Couldn't create 'events' directory\n");
> >>> >> >> + return -ENOMEM;
> >>> >> >> + }
> >>> >> >> + return 0;
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static void remove_debugfs_file(struct dentry *dir, char *name)
> >>> >> >> +{
> >>> >> >> + struct dentry *dentry;
> >>> >> >> + dentry = lookup_one_len(name, dir, strlen(name));
> >>> >> >> + if (dentry && !IS_ERR(dentry)) {
> >>> >> >> + debugfs_remove(dentry);
> >>> >> >> + dput(dentry);
> >>> >> >> + }
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static int remove_debugfs(void)
> >>> >> >> +{
> >>> >> >> + struct dentry *bufdir;
> >>> >> >> + char *tmpname;
> >>> >> >> + int cpu;
> >>> >> >> +
> >>> >> >> + bufdir = lookup_one_len("buffers", tracedir, strlen("buffers"));
> >>> >> >> + if (!bufdir || IS_ERR(bufdir))
> >>> >> >> + return -EIO;
> >>> >> >> + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
> >>> >> >> + if (!tmpname)
> >>> >> >> + return -ENOMEM;
> >>> >> >> + for_each_possible_cpu(cpu) {
> >>> >> >> + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
> >>> >> >> + remove_debugfs_file(bufdir, tmpname);
> >>> >> >> + }
> >>> >> >> + kfree(tmpname);
> >>> >> >> + debugfs_remove(bufdir);
> >>> >> >> +
> >>> >> >> + remove_debugfs_file(tracedir, "enabled");
> >>> >> >> + remove_debugfs_file(tracedir, "info");
> >>> >> >> + debugfs_remove(eventdir);
> >>> >> >> + debugfs_remove(tracedir);
> >>> >> >> + return 0;
> >>> >> >> +}
> >>> >> >> +
> >>> >> >> +static int __init init_ktrace(void)
> >>> >> >> +{
> >>> >> >> + struct kernel_trace *iter;
> >>> >> >> + int i, err;
> >>> >> >> +
> >>> >> >> + ktrace_next_id = 0;
> >>> >> >> + kref_set(&ktrace_kref, 1);
> >>> >> >> + trace_buffer = ring_buffer_alloc(trace_buf_size, RB_FL_OVERWRITE);
> >>> >> >> + if (trace_buffer == NULL)
> >>> >> >> + return -ENOMEM;
> >>> >> >> + setup_timer(&trace_timer, wakeup_readers, 0);
> >>> >> >> + for (i = 0; i < KTRACE_HASH_SIZE; i++)
> >>> >> >> + INIT_HLIST_HEAD(&ktrace_table[i]);
> >>> >> >> +
> >>> >> >> + mutex_lock(&ktrace_mutex);
> >>> >> >> + err = create_debugfs();
> >>> >> >> + if (err != 0)
> >>> >> >> + goto out;
> >>> >> >> + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
> >>> >> >> + add_ktrace(iter);
> >>> >> >> + err = register_module_notifier(&ktrace_module_nb);
> >>> >> >> +out:
> >>> >> >> + mutex_unlock(&ktrace_mutex);
> >>> >> >> + return err;
> >>> >> >> +}
> >>> >> >> +module_init(init_ktrace);
> >>> >> >> +
> >>> >> >> +static void __exit exit_ktrace(void)
> >>> >> >> +{
> >>> >> >> + struct kernel_trace *iter;
> >>> >> >> +
> >>> >> >> + if (ktrace_enabled) {
> >>> >> >> + ktrace_enabled = 0;
> >>> >> >> + del_timer_sync(&trace_timer);
> >>> >> >> + }
> >>> >> >> + mutex_lock(&ktrace_mutex);
> >>> >> >> + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
> >>> >> >> + remove_ktrace(iter, 1);
> >>> >> >> + unregister_module_notifier(&ktrace_module_nb);
> >>> >> >> + remove_debugfs();
> >>> >> >> + mutex_unlock(&ktrace_mutex);
> >>> >> >> +
> >>> >> >> + kref_put(&ktrace_kref, release_trace_buffer);
> >>> >> >> +}
> >>> >> >
> >>> >> > --
> >>> >> > Mathieu Desnoyers
> >>> >> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
> >>> >> >
> >>> >
> >>> > --
> >>> > Mathieu Desnoyers
> >>> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
> >>> >
> >>>
> >>> _______________________________________________
> >>> ltt-dev mailing list
> >>> [email protected]
> >>> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> >>>
> >>
> >> --
> >> Mathieu Desnoyers
> >> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
> >>
> >
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Thu, Mar 19, 2009 at 2:31 PM, Mathieu Desnoyers
<[email protected]> wrote:
> * Jiaying Zhang ([email protected]) wrote:
>> Hi Mathieu,
>>
>
> Hi Jiaying,
>
>> I ran tbench tests with the latest lttng patches (version 0.111) and
>> 2.6.29-rc8 kernel. I used the same set of trace events in my tests:
>> syscall entry/exit, irq entry/exit, trap entry/exit, sched_schedule,
>> process_fork, process_wait, process_free, process_exit.
>>
>> Here are the results I got:
>>
>> vanilla kernel: Throughput 673.269 MB/sec 4 procs
>> lttng normal mode: Throughput 462.786 MB/sec 4 procs (31.252% slowdown)
>> lttng flight-recorder: Throughput 484.547 MB/sec 4 procs (28.03% slowdown)
>> ktrace: Throughput 569.508 MB/sec 4 procs (15.411% slowdown)
>>
>
> Does your machine have synchronized timestamp counters ?
Yes.
>
> Also note that I added some further optimisations in 0.112, including
> some data prefetching.
>
>> The kernel LTTng configurations I used are:
>>
>> CONFIG_LTT=y
>> CONFIG_LTT_FILTER=y
>> CONFIG_HAVE_LTT_DUMP_TABLES=y
>> CONFIG_LTT_RELAY_ALLOC=y
>> CONFIG_LTT_RELAY_LOCKLESS=y
>> # CONFIG_LTT_RELAY_LOCKED is not set
>> CONFIG_LTT_SERIALIZE=y
>> CONFIG_LTT_FAST_SERIALIZE=y
>> CONFIG_LTT_TRACEPROBES=y
>> CONFIG_LTT_TRACE_CONTROL=y
>> CONFIG_LTT_TRACER=y
>> # CONFIG_LTT_ALIGNMENT is not set
>> CONFIG_LTT_CHECK_ARCH_EFFICIENT_UNALIGNED_ACCESS=y
>> # CONFIG_LTT_DEBUG_EVENT_SIZE is not set
>> CONFIG_LTT_USERSPACE_EVENT=m
>
>> CONFIG_LTT_VMCORE=y
>
> This can slow down tracing. You might want to try without the VMCORE
> option.
>
>> CONFIG_LTT_KPROBES=y
>> CONFIG_LTT_STATEDUMP=m
>> # CONFIG_LTT_ASCII is not set
>> CONFIG_HAVE_IMMEDIATE=y
>> CONFIG_IMMEDIATE=y
>>
>> The LTTng performance seems to have improved compared with the
>> results I measured before. But I think there is still big room to further
>> optimize it when we look at Ktrace results as a comparison.
>> One worry I have is that LTTng patch set is getting bigger and bigger.
>> It is good to add more features and instrumentations, but it might
>> make it harder for performance optimization. Maybe we should take
>> a different approach. I.e., we start with something simple and make
>> sure it is implemented right and efficient. Then we extend it and make
>> sure we don't sacrifice performance as we add more features.
>> I think a good time to do this is when you port LTTng patches to
>> mainline kernel. As ftrace people are working on adding event tracing
>> in ftrace, that seems a good opportunity too.
>>
>
> Well, as any project goes, there are "feature addition" phases and
> "optimization" phases. I think the job I did in the last week will help
> a lot simplifying fast path optimization work. I basically trimmed it
> down to a few small inlined functions. Therefore, the instruction
> profile given by oprofile is now understandable and it's easier to
> review the resulting assembly. So your argument about the patchset size
> does not hold as much, because it is now possible to focus on a few
> small functions.
>
> Even if we start from zero as you propose, we will end up adding
> regressions during the feature addition phases, and we will end up in
> the same situation we were a few months ago.
But we should have a continuous performance test running there that
prevents any regression patch slipping in.
>
> If we want to see what can improve performance in LTTng, we can simply
> disable the hot path piece by piece and figure out where the cycles go,
> as I did recently. (and use oprofile, of course) Given that the LTTng
> projet is very modular, it makes it easier to enable individual
> features.
>
> Another thing that might be worth looking at is to skip the tracepoint
> probes -> marker callbacks, because they will likely add a noticeable
> overhead. However, the tradeoff here is to accept some overhead for
> clear separation of the core kernel instrumentation from the markers
> exposed to userspace.
>
> As you say, there will be a piece-by-piece review/integration phase that
> will likely take place when I will discuss LTTng and ftrace integration.
Interesting to hear this. Do you have any time plan for this?
Jiaying
>
> Mathieu
>
>> Jiaying
>>
>> On Mon, Mar 16, 2009 at 7:12 PM, Jiaying Zhang <[email protected]> wrote:
>> > On Mon, Mar 16, 2009 at 6:53 PM, Mathieu Desnoyers
>> > <[email protected]> wrote:
>> >> * Jiaying Zhang ([email protected]) wrote:
>> >>> On Mon, Mar 16, 2009 at 5:01 PM, Mathieu Desnoyers
>> >>> <[email protected]> wrote:
>> >>> > * Jiaying Zhang ([email protected]) wrote:
>> >>> >> Hi Mathieu,
>> >>> >>
>> >>> >> First, I apologize that I did not forward you my original email. I assumed
>> >>> >> that the email would be noticed by interested people when it was sent
>> >>> >> to lkml. Now I see that is a mistake.
>> >>> >>
>> >>> >
>> >>> > Hi Jiaying,
>> >>> >
>> >>> > Apologies accepted, no hard feeling :) I have just been surprised when I
>> >>> > found out about the numbers you posted.
>> >>> >
>> >>> >> On Fri, Mar 13, 2009 at 10:04 AM, Mathieu Desnoyers
>> >>> >> <[email protected]> wrote:
>> >>> >> > Hi Jiaying,
>> >>> >> >
>> >>> >> > I'm very interested to hear about the details of your test-case, because
>> >>> >> > what you are talking about here, and the numbers that you get, are, at
>> >>> >> > the very least, not representative of the current LTTng. Perharps you
>> >>> >> > used a version right after the kernel summit and stopped following the
>> >>> >> > improvements we have done since then ? I also wonder about the instrumentation
>> >>> >> > set you enabled in your tests, and whether it's exactly the same in the ktrace
>> >>> >> > and LTTng runs (especially system call tracing). Please note that following the
>> >>> >> > Kernel Summit 2008, a more active development cycle began, and the time was not
>> >>> >> > right for optimisation. But since then a lot of improvement has been done on
>> >>> >> > this side.
>> >>> >>
>> >>> >> Sorry that I omitted a lot of details in my test description. I did use an old
>> >>> >> version of LTTng in my tests. The LTTng version I used was the 0.21 lttng
>> >>> >> patches for 2.6.27 kernel. I used 2.6.26 kernel in my tests, with the ported
>> >>> >> lttng-2.6.27-rc6-0.21 patches.
>> >>> >>
>> >>> >> I enabled the same set of trace events in ktrace and LTTng runs. Here is
>> >>> >> the list of enabled events: syscall entry/exit, irq entry/exit, trap
>> >>> >> entry/exit,
>> >>> >> kernel_sched_switch, kernel_process_fork, kernel_process_wait,
>> >>> >> kernel_process_free, kernel_process_exit.
>> >>> >>
>> >>> >> We are glad to see the improvements coming from the LTTng community.
>> >>> >> I haven't got time to sync with the latest LTTng code and re-run my tests.
>> >>> >> I will let you know when I have such data available.
>> >>> >>
>> >>> >
>> >>> > On my side, with flight recorder activated, same instrumentation you
>> >>> > use, with LTTng 0.111 :
>> >>> >
>> >>> > arm-google script :
>> >>> >
>> >>> > #!/bin/sh
>> >>> >
>> >>> > DIR=/mnt/debugfs/ltt/markers
>> >>> >
>> >>> > for a in \
>> >>> > syscall_entry syscall_exit \
>> >>> > irq_entry irq_exit \
>> >>> > trap_entry trap_exit \
>> >>> > page_fault_entry page_fault_exit \
>> >>> > page_fault_nosem_entry page_fault_nosem_exit \
>> >>> > page_fault_get_user_entry page_fault_get_user_exit \
>> >>> > sched_schedule process_fork process_wait \
>> >>> > process_free process_exit;
>> >>> > do
>> >>> > echo 1 > ${DIR}/kernel/$a/enable
>> >>> > done
>> >>> >
>> >>> > running tbench -t 200 8
>> >>> >
>> >>> > Vanilla kernel : 2070 MB/sec
>> >>> > Flight recorder : 1693 MB/sec Performance impact : 18.2 %
>> >>> > Normal : 1620 MB/sec Performance impact : 21.7 % (8 lttd threads)
>> >>> >
>> >>> > So now we are comparing apples with apples :-)
>> >>> Looks like the performance of LTTng has improved a lot over the past
>> >>> few months. I will run some tests with the latest LTTng patches and
>> >>> let you know if I see any difference. Did you use 2.6.29-rc7 kernel in
>> >>> your tests?
>> >>>
>> >>
>> >> Yes, kernel 2.6.29-rc7, lttng 0.111. But lttng 0.112 for kernel
>> >> 2.6.29-rc8 should also be close enough.
>> >>
>> >>> >
>> >>> >> >
>> >>> >> > First I'd like to express my concern about the way you posted your
>> >>> >> > numbers :
>> >>> >> >
>> >>> >> > 1 - You did not CC anyone from the LTTng project.
>> >>> >> > 2 - You did not tell which LTTng version you used as a comparison basis.
>> >>> >> > 3 - You did not look at the current state of code before posting your
>> >>> >> > comparison. (as shows the comment in your patch header)
>> >>> >> That is indeed my mistake.
>> >>> >>
>> >>> >> > 4 - You did not tell the instrumentation set that you enabled from LTTng. Is it
>> >>> >> > the default instrumentation ? Did you also enable system call tracing
>> >>> >> > with your tracer using set_kernel_trace_flag_all_tasks() ? If not, this
>> >>> >> > makes a huge difference on the amount of information gathered.
>> >>> >> Yes. We also enable syscall tracing via set_kernel_trace_flag_all_tasks().
>> >>> >>
>> >>> >> >
>> >>> >> > So let's have a close look at them. I'm re-doing the benchmarks based on
>> >>> >> > the current LTTng version below.
>> >>> >> >
>> >>> >> >> Hello All,
>> >>> >> >>
>> >>> >> >> We have been working on building a kernel tracer for a production environment.
>> >>> >> >> We first considered to use the existing markers code, but found it a bit
>> >>> >> >> heavyweight.
>> >>> >> >
>> >>> >> > Have you looked at :
>> >>> >> >
>> >>> >> > 1 - The tracepoints currently in the kernel
>> >>> >> > 2 - The _current_ version of markers in the LTTng tree ?
>> >>> >>
>> >>> >> I looked at the tracepoints currently in the kernel. It is great to see that
>> >>> >> some trace instrumentation code is finally available in the mainline kernel.
>> >>> >> But I still don't like an additional tracepoint layer on top of the
>> >>> >> actual tracing
>> >>> >> (ktrace or markers or ftrace) layer. That seems to add unnecessary overhead.
>> >>> >> The markers code I referred to in my email is the current code in the mainline
>> >>> >> kernel. I had a look at the LTTng markers code but still saw that va_list
>> >>> >> argument is used in marker_probe_func. It sounds a nice improvement
>> >>> >> if the LTTng markers avoids the use of va_arg whenever possible.
>> >>> >>
>> >>> >
>> >>> > Yes, vs_args is only there as a commodity for slow paths.
>> >>> >
>> >>> >> >
>> >>> >> >> So we developed a new kernel tracing prototype that shares
>> >>> >> >> many similarities as markers but also includes several simplifications that
>> >>> >> >> give us considerable performance benefits. For example, we pass the size of
>> >>> >> >> trace event directly to the probe function instead of using va_arg as that
>> >>> >> >> used in markers, which saves us the overhead of variable passing.
>> >>> >> >
>> >>> >> > The markers in the LTTng tree only use va_args for "debug-style"
>> >>> >> > tracing. All the tracing fast paths are now done with custom probe
>> >>> >> > functions which takes the arguments from the tracepoint and write them
>> >>> >> > in the trace buffers directly in C. When the event payload size is already
>> >>> >> > known statically, there is no dynamic computation cost at all : it's
>> >>> >> > simply passed to the ltt-relay write function.
>> >>> >> But this raises another concern that sub-system developers may need
>> >>> >> to write a bunch of custom probe functions to avoid the argument parsing
>> >>> >> overhead. Can you instead have a general probe function or macro that
>> >>> >> does this work?
>> >>> >>
>> >>> >
>> >>> > Well, considering the amount of care we have to take when we consider
>> >>> > adding a tracepoint in the mainline kernel, I think it's even a *good*
>> >>> > thing that someone has to stop and write a small callback to deal with
>> >>> > the information he wants to export. I think debug-style tracing should
>> >>> > be very, very quick and easy to add, but tracepoints meant to make it
>> >>> > into the mainline does not share this requirement.
>> >>> I agree that we need to be careful when adding tracepoints into the kernel,
>> >>> but I still think it is good to simplify the process of adding/maintaining
>> >>> tracepoints so sub-system maintainers don't feel it is a big burden.
>> >>>
>> >>
>> >> Yes, we have to make sure it's not an impossible task.
>> >>
>> >>> >
>> >>> >> >
>> >>> >> >> Also, we
>> >>> >> >> associate a single probe function to each event, so we can directly call the
>> >>> >> >> probe function when tracing is enabled.
>> >>> >> >
>> >>> >> > You seem to imply that being stuck with a single probe function is
>> >>> >> > somehow superior to have tiny per-event callbacks which then call into a
>> >>> >> > buffer write function ? Why ?
>> >>> >> That is indeed my thinking, because we saw that some platforms are
>> >>> >> very sensitive to function call overhead and we haven't seen use cases
>> >>> >> where multi probe functions prove necessary.
>> >>> >>
>> >>> >
>> >>> > Hrm, so your plan is to embed the tracing code at the caller site, or to
>> >>> > keep one single function call ?
>> >>> We do have a general probe function that is called when tracing is enabled.
>> >>> Other than that, the main tracing code is defined as an inline function that is
>> >>> embed at the caller site.
>> >>>
>> >>
>> >> You mean that you embed the main tracing code directly in the kernel
>> >> instrumented functions ? Hrm, have you considered the calling site
>> >> pollution impact that might have ? The compiler might start having
>> >> difficulty optimising hot paths that would then become large functions.
>> >> Or maybe I am not correctly understanding your approach.
>> > Sorry. I was actually trying to say the main instrumentation code is
>> > embed in the caller site. The main tracing code is in the general probe
>> > function.
>> >
>> > Jiaying
>> >
>> >>
>> >> Mathieu
>> >>
>> >>>
>> >>> Jiaying
>> >>>
>> >>> >
>> >>> >> >
>> >>> >> >> Our measurements show that our
>> >>> >> >> current tracing prototype introduces much lower overhead compared with
>> >>> >> >> lttng (we use lttng as a comparison because markers does not provide the
>> >>> >> >> implementation to record trace data).
>> >>> >> >
>> >>> >> > Why LTTng version, kernel version, LTTng config and runtime options do
>> >>> >> > you use ?
>> >>> >> See above for LTTng and kernel versions. I was using the default LTTng
>> >>> >> config options and 'normal' recorder mode.
>> >>> >>
>> >>> >> >
>> >>> >> >> E.g., with dbench, lttng introduces
>> >>> >> >> 20~80% overhead on our testing platforms while the overhead with our
>> >>> >> >> kernel tracer is within 10%; with tbench, lttng usually adds more than 30%
>> >>> >> >> overhead while the overhead added by our kernel tracer is usually within
>> >>> >> >> 15%; with kernbench (time to compile a Linux kernel), the overhead
>> >>> >> >> added by lttng is around 4%~18% while the overhead added by our
>> >>> >> >> kernel tracer is within 2.5%.
>> >>> >> >>
>> >>> >> >
>> >>> >> > I'm re-running the dbench and tbench tests :
>> >>> >> >
>> >>> >> > With LTTng 0.106, kernel 2.6.29-rc7
>> >>> >> > 8-cores Intel(R) Xeon(R) CPU E5405 @ 2.00GHz, 16GB ram
>> >>> >> >
>> >>> >> > Arming all the default LTTng instrumentation (168 tracepoints),
>> >>> >> > Tracing in flight recorder mode, including system call entry/exit
>> >>> >> > tracing and block I/O tracing (blktrace instrumentation), I get :
>> >>> >> >
>> >>> >> > Test / config | Vanilla kernel | LTTng 0.106 | slowdown
>> >>> >> > dbench 8 -t 200 | 1361 MB/s | 1193 MB/s | 12.3 %
>> >>> >> >
>> >>> >> > As an example of the effect of custom probe vs va_args based probes,
>> >>> >> > let's take the tbench example
>> >>> >> >
>> >>> >> > | Vanilla | LTTng 0.106 | slowdown
>> >>> >> > tbench 8 -t 200 | 2074 MB/s | 997 MB/s | 52 %
>> >>> >> >
>> >>> >> > The LTTng trace statistics shows the top high event rate events :
>> >>> >> >
>> >>> >> > Some are usual (therefore lttng already implements custom probes for them) :
>> >>> >> >
>> >>> >> > 42004 sched_try_wakeup
>> >>> >> > 79754 timer_set
>> >>> >> > 83973 sched_schedule
>> >>> >> > 84317 softirq_exit
>> >>> >> > 84318 softirq_entry
>> >>> >> > 95370 page_free
>> >>> >> > 95412 page_alloc
>> >>> >> > 267247 syscall_entry
>> >>> >> > 267247 syscall_exit
>> >>> >> >
>> >>> >> > Some are more tbench workload-specific :
>> >>> >> >
>> >>> >> > 16915 socket_sendmsg
>> >>> >> > 18911 napi_complete
>> >>> >> > 18911 napi_schedule
>> >>> >> > 18912 dev_receive
>> >>> >> > 18914 dev_xmit
>> >>> >> > 18920 napi_poll
>> >>> >> > 33831 socket_recvmsg
>> >>> >> > 59860 read
>> >>> >> >
>> >>> >> >
>> >>> >> > So let's see what happens if I implement those as custom probes and do a bit of
>> >>> >> > tuning in the tracer hot path :
>> >>> >> Could you give more details about the custom probe you used?
>> >>> >
>> >>> > They are available in the lttng tree on git.kernel.org in
>> >>> > ltt/probes/*-trace.c. Their API is in include/linux/ltt-type-serialize.h
>> >>> >
>> >>> >> Do you need to write a custom probe for each tracepoint?
>> >>> >>
>> >>> >
>> >>> > Yes. But it's not such a big deal. It looks a bit like Steve's
>> >>> > TRACE_EVENT macro-fu, but written in a simple C callback :
>> >>> >
>> >>> > void probe_sched_switch(struct rq *rq, struct task_struct *prev,
>> >>> > struct task_struct *next);
>> >>> >
>> >>> > DEFINE_MARKER_TP(kernel, sched_schedule, sched_switch, probe_sched_switch,
>> >>> > "prev_pid %d next_pid %d prev_state #2d%ld");
>> >>> >
>> >>> > notrace void probe_sched_switch(struct rq *rq, struct task_struct *prev,
>> >>> > struct task_struct *next)
>> >>> > {
>> >>> > struct marker *marker;
>> >>> > struct serialize_int_int_short data;
>> >>> >
>> >>> > data.f1 = prev->pid;
>> >>> > data.f2 = next->pid;
>> >>> > data.f3 = prev->state;
>> >>> >
>> >>> > marker = &GET_MARKER(kernel, sched_schedule);
>> >>> > ltt_specialized_trace(marker, marker->single.probe_private,
>> >>> > &data, serialize_sizeof(data), sizeof(int));
>> >>> > }
>> >>> >
>> >>> > Best regards,
>> >>> >
>> >>> > Mathieu
>> >>> >
>> >>> >> Jiaying
>> >>> >>
>> >>> >> >
>> >>> >> > | Vanilla | LTTng 0.107 | slowdown
>> >>> >> > Custom probes :
>> >>> >> > tbench 8 -t 200 | 2074 MB/s | 1275 MB/s | 38.5 %
>> >>> >> >
>> >>> >> > With tracer hot path tuning (inline hot path, function calls for slow path, will
>> >>> >> > be in 0.108) :
>> >>> >> > tbench 8 -t 200 | 2074 MB/s | 1335 MB/s | 35.6 %
>> >>> >> >
>> >>> >> > kernbench :
>> >>> >> > Average Optimal load -j 32 Run
>> >>> >> >
>> >>> >> > Vanilla 2.6.29-rc7 :
>> >>> >> > Elapsed Time 48.238 (0.117346)
>> >>> >> >
>> >>> >> > With flight recorder tracing, default lttng instrumentation, lttng 0.108 :
>> >>> >> > Elapsed Time 50.226 (0.0750333)
>> >>> >> >
>> >>> >> > Slowdown : 4.1 %
>> >>> >> >
>> >>> >> > So except the tbench workload on LTTng 0.106, which still had a 52% performance
>> >>> >> > impact (brought it down to 35 % for the next LTTng release) due to
>> >>> >> > specific workload and not having specialized probes implemented, the
>> >>> >> > dbench impact is well below your results (12.3 % vs 20-80%)
>> >>> >> >
>> >>> >> > So given the huge result discrepancies between your tests and mine, could you
>> >>> >> > give us more detail about your benchmarks ? Especially knowing the enabled
>> >>> >> > instrumentation set, and the amount of MB/s generated by both ktrace and LTTng
>> >>> >> > runs (to make sure we are comparing the same traffic) and the LTTng versions you
>> >>> >> > are using as a comparison basis would be good.
>> >>> >> >
>> >>> >> > Thanks,
>> >>> >> >
>> >>> >> > Mathieu
>> >>> >> >
>> >>> >> >
>> >>> >> >> The patch below is the current prototype we have implemented. It uses
>> >>> >> >> Steve's unified trace buffer code with some small extensions to support
>> >>> >> >> buffer mmap. The patch is probably still buggy, but we would like to get
>> >>> >> >> some early feedback from the community.
>> >>> >> >>
>> >>> >> >> Signed-off-by: Jiaying Zhang <[email protected]>
>> >>> >> >>
>> >>> >> >> Index: git-linux/init/Kconfig
>> >>> >> >> ===================================================================
>> >>> >> >> --- git-linux.orig/init/Kconfig 2009-02-19 14:58:37.000000000 -0800
>> >>> >> >> +++ git-linux/init/Kconfig 2009-02-19 15:50:15.000000000 -0800
>> >>> >> >> @@ -950,6 +950,15 @@
>> >>> >> >> Place an empty function call at each marker site. Can be
>> >>> >> >> dynamically changed for a probe function.
>> >>> >> >>
>> >>> >> >> +config KTRACE
>> >>> >> >> + bool "Enable ktrace support"
>> >>> >> >> + select DEBUG_FS
>> >>> >> >> + select TRACING
>> >>> >> >> + select RING_BUFFER
>> >>> >> >> + help
>> >>> >> >> + Ktrace is a kernel tracing tool that allows you to trace
>> >>> >> >> + kernel events by inserting trace points at proper places.
>> >>> >> >> +
>> >>> >> >> source "arch/Kconfig"
>> >>> >> >>
>> >>> >> >> endmenu # General setup
>> >>> >> >> Index: git-linux/kernel/Makefile
>> >>> >> >> ===================================================================
>> >>> >> >> --- git-linux.orig/kernel/Makefile 2009-02-19 14:58:37.000000000 -0800
>> >>> >> >> +++ git-linux/kernel/Makefile 2009-02-19 14:58:39.000000000 -0800
>> >>> >> >> @@ -86,6 +86,7 @@
>> >>> >> >> obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
>> >>> >> >> obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
>> >>> >> >> obj-$(CONFIG_MARKERS) += marker.o
>> >>> >> >> +obj-$(CONFIG_KTRACE) += ktrace.o
>> >>> >> >> obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
>> >>> >> >> obj-$(CONFIG_LATENCYTOP) += latencytop.o
>> >>> >> >> obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
>> >>> >> >> Index: git-linux/include/asm-generic/vmlinux.lds.h
>> >>> >> >> ===================================================================
>> >>> >> >> --- git-linux.orig/include/asm-generic/vmlinux.lds.h 2009-02-19
>> >>> >> >> 14:58:37.000000000 -0800
>> >>> >> >> +++ git-linux/include/asm-generic/vmlinux.lds.h 2009-02-19
>> >>> >> >> 14:58:39.000000000 -0800
>> >>> >> >> @@ -73,6 +73,9 @@
>> >>> >> >> MEM_KEEP(init.data) \
>> >>> >> >> MEM_KEEP(exit.data) \
>> >>> >> >> . = ALIGN(8); \
>> >>> >> >> + VMLINUX_SYMBOL(__start___ktraces) = .; \
>> >>> >> >> + *(__ktrace) \
>> >>> >> >> + VMLINUX_SYMBOL(__stop___ktraces) = .; \
>> >>> >> >> VMLINUX_SYMBOL(__start___markers) = .; \
>> >>> >> >> *(__markers) \
>> >>> >> >> VMLINUX_SYMBOL(__stop___markers) = .; \
>> >>> >> >> @@ -89,6 +92,7 @@
>> >>> >> >> VMLINUX_SYMBOL(__start_rodata) = .; \
>> >>> >> >> *(.rodata) *(.rodata.*) \
>> >>> >> >> *(__vermagic) /* Kernel version magic */ \
>> >>> >> >> + *(__ktrace_strings) /* Ktrace: strings */ \
>> >>> >> >> *(__markers_strings) /* Markers: strings */ \
>> >>> >> >> *(__tracepoints_strings)/* Tracepoints: strings */ \
>> >>> >> >> } \
>> >>> >> >> Index: git-linux/kernel/module.c
>> >>> >> >> ===================================================================
>> >>> >> >> --- git-linux.orig/kernel/module.c 2009-02-19 14:58:37.000000000 -0800
>> >>> >> >> +++ git-linux/kernel/module.c 2009-02-19 14:58:39.000000000 -0800
>> >>> >> >> @@ -50,6 +50,7 @@
>> >>> >> >> #include <asm/sections.h>
>> >>> >> >> #include <linux/tracepoint.h>
>> >>> >> >> #include <linux/ftrace.h>
>> >>> >> >> +#include <linux/ktrace.h>
>> >>> >> >> #include <linux/async.h>
>> >>> >> >>
>> >>> >> >> #if 0
>> >>> >> >> @@ -2145,6 +2146,10 @@
>> >>> >> >> sizeof(*mod->tracepoints),
>> >>> >> >> &mod->num_tracepoints);
>> >>> >> >> #endif
>> >>> >> >> +#ifdef CONFIG_KTRACE
>> >>> >> >> + mod->ktrace = section_objs(hdr, sechdrs, secstrings, "__ktrace",
>> >>> >> >> + sizeof(*mod->ktrace), &mod->num_ktrace);
>> >>> >> >> +#endif
>> >>> >> >>
>> >>> >> >> #ifdef CONFIG_MODVERSIONS
>> >>> >> >> if ((mod->num_syms && !mod->crcs)
>> >>> >> >> Index: git-linux/include/linux/module.h
>> >>> >> >> ===================================================================
>> >>> >> >> --- git-linux.orig/include/linux/module.h 2009-02-19
>> >>> >> >> 14:58:37.000000000 -0800
>> >>> >> >> +++ git-linux/include/linux/module.h 2009-02-19 14:58:39.000000000 -0800
>> >>> >> >> @@ -44,6 +44,7 @@
>> >>> >> >> };
>> >>> >> >>
>> >>> >> >> struct module;
>> >>> >> >> +struct kernel_trace;
>> >>> >> >>
>> >>> >> >> struct module_attribute {
>> >>> >> >> struct attribute attr;
>> >>> >> >> @@ -347,6 +348,11 @@
>> >>> >> >> /* Reference counts */
>> >>> >> >> struct module_ref ref[NR_CPUS];
>> >>> >> >> #endif
>> >>> >> >> +#ifdef CONFIG_KTRACE
>> >>> >> >> + struct kernel_trace *ktrace;
>> >>> >> >> + unsigned int num_ktrace;
>> >>> >> >> +#endif
>> >>> >> >> +
>> >>> >> >> };
>> >>> >> >> #ifndef MODULE_ARCH_INIT
>> >>> >> >> #define MODULE_ARCH_INIT {}
>> >>> >> >> Index: git-linux/include/linux/ktrace.h
>> >>> >> >> ===================================================================
>> >>> >> >> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
>> >>> >> >> +++ git-linux/include/linux/ktrace.h 2009-02-19 15:36:39.000000000 -0800
>> >>> >> >> @@ -0,0 +1,83 @@
>> >>> >> >> +#ifndef _LINUX_KTRACE_H
>> >>> >> >> +#define _LINUX_KTRACE_H
>> >>> >> >> +
>> >>> >> >> +#include <linux/types.h>
>> >>> >> >> +
>> >>> >> >> +struct kernel_trace;
>> >>> >> >> +
>> >>> >> >> +typedef void ktrace_probe_func(struct kernel_trace *, void *, size_t);
>> >>> >> >> +
>> >>> >> >> +struct kernel_trace {
>> >>> >> >> + const char *name;
>> >>> >> >> + const char *format; /* format string describing variable list */
>> >>> >> >> + size_t *stroff; /* offsets of string variables */
>> >>> >> >> + /* 31 bit event_id is converted to 16 bit when entering to the buffer */
>> >>> >> >> + u32 enabled:1, event_id:31;
>> >>> >> >> + ktrace_probe_func *func; /* probe function */
>> >>> >> >> + struct list_head list; /* list head linked to the hash table entry */
>> >>> >> >> +} __attribute__((aligned(8)));
>> >>> >> >> +
>> >>> >> >> +extern int ktrace_enabled;
>> >>> >> >> +
>> >>> >> >> +/*
>> >>> >> >> + * Make sure the alignment of the structure in the __ktrace section will
>> >>> >> >> + * not add unwanted padding between the beginning of the section and the
>> >>> >> >> + * structure. Force alignment to the same alignment as the section start.
>> >>> >> >> + */
>> >>> >> >> +
>> >>> >> >> +#define DEFINE_KTRACE_STRUCT(name) \
>> >>> >> >> + struct __attribute__((packed)) ktrace_struct_##name
>> >>> >> >> +
>> >>> >> >> +#ifdef CONFIG_KTRACE
>> >>> >> >> +/*
>> >>> >> >> + * DO_TRACE is the macro that is called from each trace point. Each trace
>> >>> >> >> + * point is associated with a static ktrace object that is located in a
>> >>> >> >> + * special __ktrace section and is uniquely identified by the event name and
>> >>> >> >> + * event format. Take a look at the files under include/trace for example
>> >>> >> >> + * usage. Note that we call the trace probe function only if the global
>> >>> >> >> + * tracing is enabled and the tracing for the associated event is enabled,
>> >>> >> >> + * so we only introduce the overhead of a predicted condition judge when
>> >>> >> >> + * tracing is disabled.
>> >>> >> >> + */
>> >>> >> >> +#define DO_TRACE(name, format, args...)
>> >>> >> >> \
>> >>> >> >> + do { \
>> >>> >> >> + static const char __kstrtab_##name[] \
>> >>> >> >> + __attribute__((section("__ktrace_strings"))) \
>> >>> >> >> + = #name "\0" format; \
>> >>> >> >> + static struct kernel_trace __ktrace_##name \
>> >>> >> >> + __attribute__((section("__ktrace"), aligned(8))) = \
>> >>> >> >> + { __kstrtab_##name, &__kstrtab_##name[sizeof(#name)], \
>> >>> >> >> + NULL, 0, 0, NULL, LIST_HEAD_INIT(__ktrace_##name.list) }; \
>> >>> >> >> + __ktrace_check_format(format, ## args); \
>> >>> >> >> + if (unlikely(ktrace_enabled) && \
>> >>> >> >> + unlikely(__ktrace_##name.enabled)) { \
>> >>> >> >> + struct ktrace_struct_##name karg = { args }; \
>> >>> >> >> + (*__ktrace_##name.func) \
>> >>> >> >> + (&__ktrace_##name, &karg, sizeof(karg)); \
>> >>> >> >> + } \
>> >>> >> >> + } while (0)
>> >>> >> >> +
>> >>> >> >> +#else /* !CONFIG_KTRACE */
>> >>> >> >> +#define DO_TRACE(name, format, args...)
>> >>> >> >> +#endif
>> >>> >> >> +
>> >>> >> >> +/* To be used for string format validity checking with gcc */
>> >>> >> >> +static inline void __printf(1, 2) ___ktrace_check_format(const char *fmt, ...)
>> >>> >> >> +{
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +#define __ktrace_check_format(format, args...) \
>> >>> >> >> + do { \
>> >>> >> >> + if (0) \
>> >>> >> >> + ___ktrace_check_format(format, ## args); \
>> >>> >> >> + } while (0)
>> >>> >> >> +
>> >>> >> >> +
>> >>> >> >> +/* get the trace buffer information */
>> >>> >> >> +#define KTRACE_BUF_GET_SIZE _IOR(0xF5, 0x00, __u32)
>> >>> >> >> +#define KTRACE_BUF_GET_PRODUCED _IOR(0xF5, 0x01, __u32)
>> >>> >> >> +#define KTRACE_BUF_GET_CONSUMED _IOR(0xF5, 0x02, __u32)
>> >>> >> >> +#define KTRACE_BUF_PUT_PRODUCED _IOW(0xF5, 0x03, __u32)
>> >>> >> >> +#define KTRACE_BUF_PUT_CONSUMED _IOW(0xF5, 0x04, __u32)
>> >>> >> >> +
>> >>> >> >> +#endif
>> >>> >> >> Index: git-linux/kernel/ktrace.c
>> >>> >> >> ===================================================================
>> >>> >> >> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
>> >>> >> >> +++ git-linux/kernel/ktrace.c 2009-02-19 15:45:20.000000000 -0800
>> >>> >> >> @@ -0,0 +1,872 @@
>> >>> >> >> +/*
>> >>> >> >> + * kernel/ktrace.c
>> >>> >> >> + *
>> >>> >> >> + * Implementation of the kernel tracing for linux kernel 2.6.
>> >>> >> >> + *
>> >>> >> >> + * Copyright 2008- Google Inc.
>> >>> >> >> + * Original Author: Jiaying Zhang
>> >>> >> >> + */
>> >>> >> >> +
>> >>> >> >> +#include <linux/module.h>
>> >>> >> >> +#include <linux/kernel.h>
>> >>> >> >> +#include <linux/bitmap.h>
>> >>> >> >> +#include <linux/poll.h>
>> >>> >> >> +#include <linux/errno.h>
>> >>> >> >> +#include <linux/sched.h>
>> >>> >> >> +#include <linux/linkage.h>
>> >>> >> >> +#include <linux/debugfs.h>
>> >>> >> >> +#include <linux/timer.h>
>> >>> >> >> +#include <linux/delay.h>
>> >>> >> >> +#include <linux/jhash.h>
>> >>> >> >> +#include <linux/ctype.h>
>> >>> >> >> +#include <linux/namei.h>
>> >>> >> >> +#include <linux/ktrace.h>
>> >>> >> >> +#include <linux/ring_buffer.h>
>> >>> >> >> +
>> >>> >> >> +static char *ktrace_version = "1.0.0";
>> >>> >> >> +int ktrace_enabled;
>> >>> >> >> +EXPORT_SYMBOL(ktrace_enabled);
>> >>> >> >> +static uint16_t ktrace_next_id;
>> >>> >> >> +
>> >>> >> >> +extern struct kernel_trace __start___ktraces[];
>> >>> >> >> +extern struct kernel_trace __stop___ktraces[];
>> >>> >> >> +
>> >>> >> >> +static DEFINE_MUTEX(ktrace_mutex);
>> >>> >> >> +
>> >>> >> >> +/* ktrace hash table entry structure and variables */
>> >>> >> >> +struct ktrace_entry {
>> >>> >> >> + struct hlist_node hlist;
>> >>> >> >> + char *name;
>> >>> >> >> + char *format;
>> >>> >> >> + size_t *stroff;
>> >>> >> >> + ktrace_probe_func *func;
>> >>> >> >> + u32 enabled:1, event_id:31;
>> >>> >> >> + struct list_head klist; /* list of loaded ktraces */
>> >>> >> >> +};
>> >>> >> >> +
>> >>> >> >> +#define KTRACE_HASH_BITS 6
>> >>> >> >> +#define KTRACE_HASH_SIZE (1 << KTRACE_HASH_BITS)
>> >>> >> >> +static struct hlist_head ktrace_table[KTRACE_HASH_SIZE];
>> >>> >> >> +
>> >>> >> >> +/* debugfs directory variables */
>> >>> >> >> +static struct dentry *tracedir;
>> >>> >> >> +static struct dentry *eventdir;
>> >>> >> >> +typedef void tracecontrol_handle_func(struct file *, unsigned long, size_t *);
>> >>> >> >> +
>> >>> >> >> +/* ring buffer code */
>> >>> >> >> +static unsigned long trace_buf_size = 65536UL;
>> >>> >> >> +struct ring_buffer *trace_buffer;
>> >>> >> >> +static DECLARE_WAIT_QUEUE_HEAD(trace_wait); /* waitqueue for trace_data_poll */
>> >>> >> >> +static struct timer_list trace_timer; /* reader wake-up timer */
>> >>> >> >> +#define TRACEREAD_WAKEUP_INTERVAL 1000 /* time interval in jiffies */
>> >>> >> >> +static struct kref ktrace_kref;
>> >>> >> >> +
>> >>> >> >> +static void
>> >>> >> >> +ring_buffer_probe(struct kernel_trace *kt, void *data, size_t event_size)
>> >>> >> >> +{
>> >>> >> >> + struct ring_buffer_event *event;
>> >>> >> >> + void *trace_event;
>> >>> >> >> + unsigned long flag;
>> >>> >> >> +
>> >>> >> >> + event = ring_buffer_lock_reserve(trace_buffer,
>> >>> >> >> + sizeof(uint16_t) + event_size, &flag);
>> >>> >> >> + if (!event)
>> >>> >> >> + return;
>> >>> >> >> + trace_event = ring_buffer_event_data(event);
>> >>> >> >> + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
>> >>> >> >> + memcpy(trace_event + sizeof(uint16_t), data, event_size);
>> >>> >> >> + ring_buffer_unlock_commit(trace_buffer, event, flag);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +/* special probe function for events that contain string arguments */
>> >>> >> >> +static void
>> >>> >> >> +string_probe(struct kernel_trace *kt, void *data, size_t event_size)
>> >>> >> >> +{
>> >>> >> >> + struct ring_buffer_event *event;
>> >>> >> >> + void *trace_event, *p;
>> >>> >> >> + unsigned long flag;
>> >>> >> >> + size_t *offset, scanned;
>> >>> >> >> + char *string;
>> >>> >> >> +
>> >>> >> >> + /*
>> >>> >> >> + * Get the real length of the event, i.e., we use stroff array to
>> >>> >> >> + * locate string variables in the passed-in trace data and use their
>> >>> >> >> + * length to replace the size of the string pointers.
>> >>> >> >> + */
>> >>> >> >> + for (offset = kt->stroff; *offset != -1; offset++) {
>> >>> >> >> + string = *(char **) (data + *offset);
>> >>> >> >> + event_size += strlen(string) + 1 - sizeof(char *);
>> >>> >> >> + }
>> >>> >> >> + event = ring_buffer_lock_reserve(trace_buffer,
>> >>> >> >> + sizeof(uint16_t) + event_size, &flag);
>> >>> >> >> + if (!event)
>> >>> >> >> + return;
>> >>> >> >> + trace_event = ring_buffer_event_data(event);
>> >>> >> >> + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
>> >>> >> >> + p = trace_event + sizeof(uint16_t);
>> >>> >> >> + /*
>> >>> >> >> + * Copy the trace data into buffer. For string variables, we enter the
>> >>> >> >> + * string into the buffer. Otherwise, the passed in data is copied.
>> >>> >> >> + */
>> >>> >> >> + for (offset = kt->stroff, scanned = 0; *offset != -1; offset++) {
>> >>> >> >> + memcpy(p, data + scanned, *offset - scanned);
>> >>> >> >> + p += *offset - scanned;
>> >>> >> >> + string = *(char **) (data + *offset);
>> >>> >> >> + memcpy(p, string, strlen(string) + 1);
>> >>> >> >> + p += strlen(string) + 1;
>> >>> >> >> + scanned = *offset + sizeof(char *);
>> >>> >> >> + }
>> >>> >> >> + memcpy(p, data + scanned,
>> >>> >> >> + trace_event + sizeof(uint16_t) + event_size - p);
>> >>> >> >> + ring_buffer_unlock_commit(trace_buffer, event, flag);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +/* timer function used to defer ring buffer reader waking */
>> >>> >> >> +static void wakeup_readers(unsigned long data)
>> >>> >> >> +{
>> >>> >> >> + if (trace_buffer && !ring_buffer_empty(trace_buffer))
>> >>> >> >> + wake_up_interruptible(&trace_wait);
>> >>> >> >> + mod_timer(&trace_timer, jiffies + TRACEREAD_WAKEUP_INTERVAL);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +/* function for reading ktrace metadata info */
>> >>> >> >> +static void *tracing_info_start(struct seq_file *seq, loff_t *pos)
>> >>> >> >> +{
>> >>> >> >> + seq_printf(seq, "version %s\n", ktrace_version);
>> >>> >> >> + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static void *tracing_info_next(struct seq_file *seq, void *v, loff_t *pos)
>> >>> >> >> +{
>> >>> >> >> + ++*pos;
>> >>> >> >> + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static int tracing_info_show(struct seq_file *seq, void *v)
>> >>> >> >> +{
>> >>> >> >> + struct hlist_head *head = v;
>> >>> >> >> + struct hlist_node *node;
>> >>> >> >> + struct ktrace_entry *entry;
>> >>> >> >> + hlist_for_each_entry(entry, node, head, hlist)
>> >>> >> >> + seq_printf(seq, "name '%s' format '%s' id %u %s\n",
>> >>> >> >> + entry->name, entry->format, entry->event_id,
>> >>> >> >> + entry->enabled ? "enabled" : "disabled");
>> >>> >> >> + return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static void tracing_info_stop(struct seq_file *seq, void *v)
>> >>> >> >> +{
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static const struct seq_operations tracing_info_seq_ops = {
>> >>> >> >> + .start = tracing_info_start,
>> >>> >> >> + .next = tracing_info_next,
>> >>> >> >> + .show = tracing_info_show,
>> >>> >> >> + .stop = tracing_info_stop,
>> >>> >> >> +};
>> >>> >> >> +
>> >>> >> >> +static int tracing_info_open(struct inode *inode, struct file *file)
>> >>> >> >> +{
>> >>> >> >> + return seq_open(file, &tracing_info_seq_ops);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static const struct file_operations traceinfo_file_operations = {
>> >>> >> >> + .open = tracing_info_open,
>> >>> >> >> + .read = seq_read,
>> >>> >> >> + .llseek = seq_lseek,
>> >>> >> >> + .release = seq_release,
>> >>> >> >> +};
>> >>> >> >> +
>> >>> >> >> +/*
>> >>> >> >> + * wrapper function used by debugfs write operation.
>> >>> >> >> + * func: handling function that does real work
>> >>> >> >> + */
>> >>> >> >> +static ssize_t trace_debugfs_write(struct file *filp, const char __user *ubuf,
>> >>> >> >> + size_t cnt, loff_t *ppos, tracecontrol_handle_func *func)
>> >>> >> >> +{
>> >>> >> >> + int ret;
>> >>> >> >> + char buf[64];
>> >>> >> >> + unsigned long val;
>> >>> >> >> +
>> >>> >> >> + if (cnt >= sizeof(buf))
>> >>> >> >> + return -EINVAL;
>> >>> >> >> + if (copy_from_user(&buf, ubuf, cnt))
>> >>> >> >> + return -EFAULT;
>> >>> >> >> + buf[cnt] = 0;
>> >>> >> >> +
>> >>> >> >> + ret = strict_strtoul(buf, 10, &val);
>> >>> >> >> + if (ret < 0)
>> >>> >> >> + return ret;
>> >>> >> >> + val = !!val;
>> >>> >> >> +
>> >>> >> >> + func(filp, val, &cnt);
>> >>> >> >> +
>> >>> >> >> + filp->f_pos += cnt;
>> >>> >> >> + return cnt;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +/* functions for reading/writing the global 'enabled' control file */
>> >>> >> >> +static ssize_t tracing_control_read(struct file *filp, char __user *ubuf,
>> >>> >> >> + size_t cnt, loff_t *ppos)
>> >>> >> >> +{
>> >>> >> >> + int ret;
>> >>> >> >> + char buf[64];
>> >>> >> >> + ret = snprintf(buf, 64, "%u\n", ktrace_enabled);
>> >>> >> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static void __tracing_control_write(struct file *filp,
>> >>> >> >> + unsigned long val, size_t *cnt)
>> >>> >> >> +{
>> >>> >> >> + if (val ^ ktrace_enabled) {
>> >>> >> >> + if (val) {
>> >>> >> >> + trace_timer.expires =
>> >>> >> >> + jiffies + TRACEREAD_WAKEUP_INTERVAL;
>> >>> >> >> + add_timer(&trace_timer);
>> >>> >> >> + ktrace_enabled = 1;
>> >>> >> >> + } else {
>> >>> >> >> + ktrace_enabled = 0;
>> >>> >> >> + del_timer_sync(&trace_timer);
>> >>> >> >> + if (trace_buffer && !ring_buffer_empty(trace_buffer))
>> >>> >> >> + wake_up_interruptible(&trace_wait);
>> >>> >> >> + }
>> >>> >> >> + }
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static ssize_t tracing_control_write(struct file *filp, const char
>> >>> >> >> __user *ubuf,
>> >>> >> >> + size_t cnt, loff_t *ppos)
>> >>> >> >> +{
>> >>> >> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
>> >>> >> >> + __tracing_control_write);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static const struct file_operations tracecontrol_file_operations = {
>> >>> >> >> + .read = tracing_control_read,
>> >>> >> >> + .write = tracing_control_write,
>> >>> >> >> +};
>> >>> >> >> +
>> >>> >> >> +/* functions for reading/writing a trace event file */
>> >>> >> >> +static int update_ktrace(struct ktrace_entry *entry,
>> >>> >> >> + ktrace_probe_func *func, int enabled)
>> >>> >> >> +{
>> >>> >> >> + struct kernel_trace *iter;
>> >>> >> >> + /* no need to update the list if the tracing is not initialized */
>> >>> >> >> + if (!tracedir)
>> >>> >> >> + return 0;
>> >>> >> >> + entry->enabled = enabled;
>> >>> >> >> + entry->func = func;
>> >>> >> >> + list_for_each_entry(iter, &entry->klist, list) {
>> >>> >> >> + iter->stroff = entry->stroff;
>> >>> >> >> + iter->func = func;
>> >>> >> >> + iter->enabled = enabled;
>> >>> >> >> + }
>> >>> >> >> + return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static int ktrace_probe_register(struct ktrace_entry *entry,
>> >>> >> >> + ktrace_probe_func *probefunc)
>> >>> >> >> +{
>> >>> >> >> + return update_ktrace(entry, probefunc, 1);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static int ktrace_probe_unregister(struct ktrace_entry *entry)
>> >>> >> >> +{
>> >>> >> >> + return update_ktrace(entry, NULL, 0);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static int ktrace_control_open(struct inode *inode, struct file *filp)
>> >>> >> >> +{
>> >>> >> >> + filp->private_data = inode->i_private;
>> >>> >> >> + return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static ssize_t ktrace_control_read(struct file *filp, char __user *ubuf,
>> >>> >> >> + size_t cnt, loff_t *ppos)
>> >>> >> >> +{
>> >>> >> >> + int ret;
>> >>> >> >> + char buf[64];
>> >>> >> >> + struct ktrace_entry *entry = filp->private_data;
>> >>> >> >> +
>> >>> >> >> + ret = snprintf(buf, 64, "%u\n", entry->enabled);
>> >>> >> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +/*
>> >>> >> >> + * Check whether the event contains string arguments. Return 0 if not;
>> >>> >> >> + * otherwise record the offsets of the string arguments in the ktrace
>> >>> >> >> + * structure according to the event format and return 1.
>> >>> >> >> + * The offset info is later used to store the string arguments into
>> >>> >> >> + * the trace buffer during trace probe.
>> >>> >> >> + */
>> >>> >> >> +static int contain_string_arguments(struct ktrace_entry *entry)
>> >>> >> >> +{
>> >>> >> >> + int count = 0;
>> >>> >> >> + char *fmt;
>> >>> >> >> + int qualifier; /* 'h', 'l', or 'L' for integer fields */
>> >>> >> >> + size_t offset = 0;
>> >>> >> >> +
>> >>> >> >> + for (fmt = strstr(entry->format, "%s"); fmt; fmt = strstr(++fmt, "%s"))
>> >>> >> >> + count++;
>> >>> >> >> + if (!count)
>> >>> >> >> + return 0;
>> >>> >> >> + entry->stroff = kmalloc(sizeof(size_t) * (count + 1), GFP_KERNEL);
>> >>> >> >> + if (!entry->stroff)
>> >>> >> >> + return -ENOMEM;
>> >>> >> >> +
>> >>> >> >> + for (fmt = entry->format, count = 0; *fmt ; ++fmt) {
>> >>> >> >> + if (*fmt != '%')
>> >>> >> >> + continue;
>> >>> >> >> +repeat:
>> >>> >> >> + ++fmt;
>> >>> >> >> + switch (*fmt) {
>> >>> >> >> + case '-':
>> >>> >> >> + case '+':
>> >>> >> >> + case ' ':
>> >>> >> >> + case '#':
>> >>> >> >> + case '0':
>> >>> >> >> + ++fmt;
>> >>> >> >> + goto repeat;
>> >>> >> >> + }
>> >>> >> >> +
>> >>> >> >> + while (isdigit(*fmt))
>> >>> >> >> + fmt++;
>> >>> >> >> +
>> >>> >> >> + /* get the conversion qualifier */
>> >>> >> >> + qualifier = -1;
>> >>> >> >> + if (*fmt == 'h' || *fmt == 'l' || *fmt == 'L') {
>> >>> >> >> + qualifier = *fmt;
>> >>> >> >> + ++fmt;
>> >>> >> >> + if (qualifier == 'l' && *fmt == 'l') {
>> >>> >> >> + qualifier = 'L';
>> >>> >> >> + ++fmt;
>> >>> >> >> + }
>> >>> >> >> + }
>> >>> >> >> +
>> >>> >> >> + switch (*fmt) {
>> >>> >> >> + case 'c':
>> >>> >> >> + offset += sizeof(char);
>> >>> >> >> + continue;
>> >>> >> >> + case 's':
>> >>> >> >> + entry->stroff[count] = offset;
>> >>> >> >> + count++;
>> >>> >> >> + offset += sizeof(char *);
>> >>> >> >> + continue;
>> >>> >> >> + case 'p':
>> >>> >> >> + offset += sizeof(void *);
>> >>> >> >> + continue;
>> >>> >> >> + case 'd':
>> >>> >> >> + case 'i':
>> >>> >> >> + case 'o':
>> >>> >> >> + case 'u':
>> >>> >> >> + case 'x':
>> >>> >> >> + case 'X':
>> >>> >> >> + switch (qualifier) {
>> >>> >> >> + case 'L':
>> >>> >> >> + offset += sizeof(long long);
>> >>> >> >> + break;
>> >>> >> >> + case 'l':
>> >>> >> >> + offset += sizeof(long);
>> >>> >> >> + break;
>> >>> >> >> + case 'h':
>> >>> >> >> + offset += sizeof(short);
>> >>> >> >> + break;
>> >>> >> >> + default:
>> >>> >> >> + offset += sizeof(int);
>> >>> >> >> + break;
>> >>> >> >> + }
>> >>> >> >> + break;
>> >>> >> >> + default:
>> >>> >> >> + printk(KERN_WARNING "unknown format '%c'", *fmt);
>> >>> >> >> + continue;
>> >>> >> >> + }
>> >>> >> >> + }
>> >>> >> >> + entry->stroff[count] = -1;
>> >>> >> >> + return 1;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static void __ktrace_control_write(struct file *filp,
>> >>> >> >> + unsigned long val, size_t *cnt)
>> >>> >> >> +{
>> >>> >> >> + struct ktrace_entry *entry = filp->private_data;
>> >>> >> >> + int ret;
>> >>> >> >> +
>> >>> >> >> + mutex_lock(&ktrace_mutex);
>> >>> >> >> + if (val ^ entry->enabled) {
>> >>> >> >> + if (val) {
>> >>> >> >> + ret = contain_string_arguments(entry);
>> >>> >> >> + if (ret == 0)
>> >>> >> >> + ktrace_probe_register(entry, ring_buffer_probe);
>> >>> >> >> + else if (ret > 0)
>> >>> >> >> + ktrace_probe_register(entry, string_probe);
>> >>> >> >> + else
>> >>> >> >> + *cnt = ret;
>> >>> >> >> + } else
>> >>> >> >> + ktrace_probe_unregister(entry);
>> >>> >> >> + }
>> >>> >> >> + mutex_unlock(&ktrace_mutex);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static ssize_t ktrace_control_write(struct file *filp, const char __user *ubuf,
>> >>> >> >> + size_t cnt, loff_t *ppos)
>> >>> >> >> +{
>> >>> >> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
>> >>> >> >> + __ktrace_control_write);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static const struct file_operations ktracecontrol_file_operations = {
>> >>> >> >> + .open = ktrace_control_open,
>> >>> >> >> + .read = ktrace_control_read,
>> >>> >> >> + .write = ktrace_control_write,
>> >>> >> >> +};
>> >>> >> >> +
>> >>> >> >> +/* functions for adding/removing a trace event. Protected by mutex lock.
>> >>> >> >> + * Called during initialization or after loading a module */
>> >>> >> >> +static struct ktrace_entry *add_ktrace(struct kernel_trace *ktrace)
>> >>> >> >> +{
>> >>> >> >> + struct ktrace_entry *entry;
>> >>> >> >> + size_t name_len = strlen(ktrace->name) + 1;
>> >>> >> >> + u32 hash = jhash(ktrace->name, name_len-1, 0);
>> >>> >> >> + struct hlist_head *head;
>> >>> >> >> + struct hlist_node *node;
>> >>> >> >> + struct dentry *dentry;
>> >>> >> >> +
>> >>> >> >> + if (!tracedir)
>> >>> >> >> + return 0;
>> >>> >> >> + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
>> >>> >> >> + hlist_for_each_entry(entry, node, head, hlist) {
>> >>> >> >> + if (!strcmp(ktrace->name, entry->name)) {
>> >>> >> >> + if (strcmp(ktrace->format, entry->format)) {
>> >>> >> >> + printk(KERN_WARNING "the format of tracepoint "
>> >>> >> >> + "\'%s\' changes from \'%s\' to \'%s\'."
>> >>> >> >> + "Dynamic changing of trace format is "
>> >>> >> >> + "not supported yet!\n",
>> >>> >> >> + ktrace->name, entry->format,
>> >>> >> >> + ktrace->format);
>> >>> >> >> + return ERR_PTR(-EINVAL);
>> >>> >> >> + }
>> >>> >> >> + ktrace->enabled = entry->enabled;
>> >>> >> >> + ktrace->func = entry->func;
>> >>> >> >> + ktrace->event_id = entry->event_id;
>> >>> >> >> + if (list_empty(&entry->klist))
>> >>> >> >> + goto add_head;
>> >>> >> >> + list_add_tail(&ktrace->list, &entry->klist);
>> >>> >> >> + return entry;
>> >>> >> >> + }
>> >>> >> >> + }
>> >>> >> >> + entry = kmalloc(sizeof(struct ktrace_entry), GFP_KERNEL);
>> >>> >> >> + if (!entry)
>> >>> >> >> + return ERR_PTR(-ENOMEM);
>> >>> >> >> + entry->name = kmalloc(name_len, GFP_KERNEL);
>> >>> >> >> + if (!entry->name) {
>> >>> >> >> + kfree(entry);
>> >>> >> >> + return ERR_PTR(-ENOMEM);
>> >>> >> >> + }
>> >>> >> >> + memcpy(entry->name, ktrace->name, name_len);
>> >>> >> >> + if (ktrace->format) {
>> >>> >> >> + size_t format_len = strlen(ktrace->format) + 1;
>> >>> >> >> + entry->format = kmalloc(format_len, GFP_KERNEL);
>> >>> >> >> + if (!entry->format) {
>> >>> >> >> + kfree(entry->name);
>> >>> >> >> + kfree(entry);
>> >>> >> >> + return ERR_PTR(-ENOMEM);
>> >>> >> >> + }
>> >>> >> >> + memcpy(entry->format, ktrace->format, format_len);
>> >>> >> >> + } else
>> >>> >> >> + entry->format = NULL;
>> >>> >> >> + entry->func = ktrace->func;
>> >>> >> >> + entry->enabled = 0;
>> >>> >> >> + ktrace->event_id = entry->event_id = ktrace_next_id++;
>> >>> >> >> + entry->stroff = NULL;
>> >>> >> >> + INIT_LIST_HEAD(&entry->klist);
>> >>> >> >> + hlist_add_head(&entry->hlist, head);
>> >>> >> >> +add_head:
>> >>> >> >> + list_add(&ktrace->list, &entry->klist);
>> >>> >> >> + dentry = debugfs_create_file(entry->name, 0660, eventdir,
>> >>> >> >> + entry, &ktracecontrol_file_operations);
>> >>> >> >> + if (!dentry)
>> >>> >> >> + printk(KERN_WARNING "Couldn't create debugfs '%s' entry\n",
>> >>> >> >> + entry->name);
>> >>> >> >> + return entry;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static int remove_ktrace(struct kernel_trace *ktrace, int free)
>> >>> >> >> +{
>> >>> >> >> + struct ktrace_entry *entry;
>> >>> >> >> + size_t name_len = strlen(ktrace->name) + 1;
>> >>> >> >> + u32 hash = jhash(ktrace->name, name_len-1, 0);
>> >>> >> >> + struct hlist_head *head;
>> >>> >> >> + struct hlist_node *node, *temp;
>> >>> >> >> + struct dentry *dentry;
>> >>> >> >> +
>> >>> >> >> + if (!tracedir)
>> >>> >> >> + return 0;
>> >>> >> >> + list_del(&ktrace->list);
>> >>> >> >> + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
>> >>> >> >> + hlist_for_each_entry_safe(entry, node, temp, head, hlist)
>> >>> >> >> + if (!strcmp(ktrace->name, entry->name)) {
>> >>> >> >> + if (list_empty(&entry->klist)) {
>> >>> >> >> + dentry = lookup_one_len(entry->name,
>> >>> >> >> + eventdir, strlen(entry->name));
>> >>> >> >> + if (dentry && !IS_ERR(dentry)) {
>> >>> >> >> + debugfs_remove(dentry);
>> >>> >> >> + dput(dentry);
>> >>> >> >> + }
>> >>> >> >> + entry->enabled = 0;
>> >>> >> >> + if (free) {
>> >>> >> >> + hlist_del(&entry->hlist);
>> >>> >> >> + kfree(entry->name);
>> >>> >> >> + kfree(entry->format);
>> >>> >> >> + kfree(entry->stroff);
>> >>> >> >> + kfree(entry);
>> >>> >> >> + }
>> >>> >> >> + }
>> >>> >> >> + return 0;
>> >>> >> >> + }
>> >>> >> >> + return -1;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +/* Add/remove tracepoints contained in a module to the ktrace hash table.
>> >>> >> >> + * Called at the end of module load/unload. */
>> >>> >> >> +static int ktrace_module_notify(struct notifier_block *self,
>> >>> >> >> + unsigned long val, void *data)
>> >>> >> >> +{
>> >>> >> >> +#ifdef CONFIG_MODULES
>> >>> >> >> + struct kernel_trace *iter;
>> >>> >> >> + struct module *mod = data;
>> >>> >> >> +
>> >>> >> >> + if (val != MODULE_STATE_COMING && val != MODULE_STATE_GOING)
>> >>> >> >> + return 0;
>> >>> >> >> + mutex_lock(&ktrace_mutex);
>> >>> >> >> + for (iter = mod->ktrace; iter < mod->ktrace + mod->num_ktrace; iter++)
>> >>> >> >> + if (val == MODULE_STATE_COMING)
>> >>> >> >> + add_ktrace(iter);
>> >>> >> >> + else
>> >>> >> >> + remove_ktrace(iter, 0);
>> >>> >> >> + mutex_unlock(&ktrace_mutex);
>> >>> >> >> +#endif
>> >>> >> >> + return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static struct notifier_block ktrace_module_nb = {
>> >>> >> >> + .notifier_call = ktrace_module_notify,
>> >>> >> >> +};
>> >>> >> >> +
>> >>> >> >> +/* functions for user-space programs to read data from tracing buffer */
>> >>> >> >> +static int trace_data_open(struct inode *inode, struct file *filp)
>> >>> >> >> +{
>> >>> >> >> + filp->private_data = inode->i_private;
>> >>> >> >> + kref_get(&ktrace_kref);
>> >>> >> >> + return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static void release_trace_buffer(struct kref *kref)
>> >>> >> >> +{
>> >>> >> >> + ring_buffer_free(trace_buffer);
>> >>> >> >> + trace_buffer = NULL;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static int trace_data_release(struct inode *inode, struct file *filp)
>> >>> >> >> +{
>> >>> >> >> + kref_put(&ktrace_kref, release_trace_buffer);
>> >>> >> >> + return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static unsigned int trace_data_poll(struct file *filp, poll_table *poll_table)
>> >>> >> >> +{
>> >>> >> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
>> >>> >> >> + if (filp->f_mode & FMODE_READ) {
>> >>> >> >> + if (!ring_buffer_per_cpu_empty(cpu_buffer))
>> >>> >> >> + return POLLIN | POLLRDNORM;
>> >>> >> >> + poll_wait(filp, &trace_wait, poll_table);
>> >>> >> >> + if (!ring_buffer_per_cpu_empty(cpu_buffer))
>> >>> >> >> + return POLLIN | POLLRDNORM;
>> >>> >> >> + }
>> >>> >> >> + return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static int trace_buf_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>> >>> >> >> +{
>> >>> >> >> + struct ring_buffer_per_cpu *cpu_buffer = vma->vm_private_data;
>> >>> >> >> + pgoff_t pgoff = vmf->pgoff;
>> >>> >> >> + struct page *page;
>> >>> >> >> +
>> >>> >> >> + if (!trace_buffer)
>> >>> >> >> + return VM_FAULT_OOM;
>> >>> >> >> + page = ring_buffer_get_page(cpu_buffer, pgoff);
>> >>> >> >> + if (page == NULL)
>> >>> >> >> + return VM_FAULT_SIGBUS;
>> >>> >> >> + get_page(page);
>> >>> >> >> + vmf->page = page;
>> >>> >> >> + return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static struct vm_operations_struct trace_data_mmap_ops = {
>> >>> >> >> + .fault = trace_buf_fault,
>> >>> >> >> +};
>> >>> >> >> +
>> >>> >> >> +static int trace_data_mmap(struct file *filp, struct vm_area_struct *vma)
>> >>> >> >> +{
>> >>> >> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
>> >>> >> >> +
>> >>> >> >> + vma->vm_ops = &trace_data_mmap_ops;
>> >>> >> >> + vma->vm_flags |= VM_DONTEXPAND;
>> >>> >> >> + vma->vm_private_data = cpu_buffer;
>> >>> >> >> + return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static int trace_data_ioctl(struct inode *inode, struct file *filp,
>> >>> >> >> + unsigned int cmd, unsigned long arg)
>> >>> >> >> +{
>> >>> >> >> + u32 __user *argp = (u32 __user *)arg;
>> >>> >> >> + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
>> >>> >> >> +
>> >>> >> >> + if (!trace_buffer || !tracedir)
>> >>> >> >> + return -EINVAL;
>> >>> >> >> + switch (cmd) {
>> >>> >> >> + case KTRACE_BUF_GET_SIZE:
>> >>> >> >> + {
>> >>> >> >> + unsigned long bufsize;
>> >>> >> >> + bufsize = ring_buffer_size(trace_buffer);
>> >>> >> >> + return put_user((u32)bufsize, argp);
>> >>> >> >> + }
>> >>> >> >> + case KTRACE_BUF_GET_PRODUCED:
>> >>> >> >> + return put_user(ring_buffer_get_produced(cpu_buffer), argp);
>> >>> >> >> + case KTRACE_BUF_GET_CONSUMED:
>> >>> >> >> + return put_user(ring_buffer_get_consumed(cpu_buffer), argp);
>> >>> >> >> + case KTRACE_BUF_PUT_CONSUMED:
>> >>> >> >> + {
>> >>> >> >> + u32 consumed, consumed_old;
>> >>> >> >> + int ret;
>> >>> >> >> +
>> >>> >> >> + ret = get_user(consumed, argp);
>> >>> >> >> + if (ret) {
>> >>> >> >> + printk(KERN_WARNING
>> >>> >> >> + "error getting consumed value: %d\n", ret);
>> >>> >> >> + return ret;
>> >>> >> >> + }
>> >>> >> >> + consumed_old = ring_buffer_get_consumed(cpu_buffer);
>> >>> >> >> + if (consumed == consumed_old)
>> >>> >> >> + return 0;
>> >>> >> >> + ring_buffer_advance_reader(cpu_buffer, consumed - consumed_old);
>> >>> >> >> + return 0;
>> >>> >> >> + }
>> >>> >> >> + default:
>> >>> >> >> + return -ENOIOCTLCMD;
>> >>> >> >> + }
>> >>> >> >> + return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static const struct file_operations tracedata_file_operations = {
>> >>> >> >> + .open = trace_data_open,
>> >>> >> >> + .poll = trace_data_poll,
>> >>> >> >> + .mmap = trace_data_mmap,
>> >>> >> >> + .ioctl = trace_data_ioctl,
>> >>> >> >> + .release = trace_data_release,
>> >>> >> >> +};
>> >>> >> >> +
>> >>> >> >> +/* function to report the number of lost events due to buffer overflow */
>> >>> >> >> +static ssize_t trace_overflow_read(struct file *filp, char __user *ubuf,
>> >>> >> >> + size_t cnt, loff_t *ppos)
>> >>> >> >> +{
>> >>> >> >> + int ret;
>> >>> >> >> + char buf[64];
>> >>> >> >> + ret = snprintf(buf, 64, "%lu\n", ring_buffer_overruns(trace_buffer));
>> >>> >> >> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static const struct file_operations traceoverflow_file_operations = {
>> >>> >> >> + .read = trace_overflow_read,
>> >>> >> >> +};
>> >>> >> >> +
>> >>> >> >> +/* function to re-initialize trace buffer */
>> >>> >> >> +static void __trace_buffer_reset(struct file *filp,
>> >>> >> >> + unsigned long val, size_t *cnt)
>> >>> >> >> +{
>> >>> >> >> + if (val && trace_buffer)
>> >>> >> >> + ring_buffer_reset(trace_buffer);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static ssize_t trace_buffer_reset(struct file *filp, const char __user *ubuf,
>> >>> >> >> + size_t cnt, loff_t *ppos)
>> >>> >> >> +{
>> >>> >> >> + return trace_debugfs_write(filp, ubuf, cnt, ppos,
>> >>> >> >> + __trace_buffer_reset);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static const struct file_operations tracereset_file_operations = {
>> >>> >> >> + .write = trace_buffer_reset,
>> >>> >> >> +};
>> >>> >> >> +
>> >>> >> >> +/*
>> >>> >> >> + * We use debugfs for kernel-user communication. All of the control/info
>> >>> >> >> + * files are under debugfs/ktrace directory.
>> >>> >> >> + */
>> >>> >> >> +static int create_debugfs(void)
>> >>> >> >> +{
>> >>> >> >> + struct dentry *entry, *bufdir;
>> >>> >> >> + struct ring_buffer_per_cpu *cpu_buffer;
>> >>> >> >> + char *tmpname;
>> >>> >> >> + int cpu;
>> >>> >> >> +
>> >>> >> >> + tracedir = debugfs_create_dir("ktrace", NULL);
>> >>> >> >> + if (!tracedir) {
>> >>> >> >> + printk(KERN_WARNING "Couldn't create debugfs directory\n");
>> >>> >> >> + return -ENOMEM;
>> >>> >> >> + }
>> >>> >> >> +
>> >>> >> >> + /* 'buffers' is the directory that holds trace buffer debugfs files */
>> >>> >> >> + bufdir = debugfs_create_dir("buffers", tracedir);
>> >>> >> >> + if (!bufdir) {
>> >>> >> >> + printk(KERN_WARNING "Couldn't create 'buffers' directory\n");
>> >>> >> >> + return -ENOMEM;
>> >>> >> >> + }
>> >>> >> >> +
>> >>> >> >> + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
>> >>> >> >> + if (!tmpname)
>> >>> >> >> + return -ENOMEM;
>> >>> >> >> + /* create a debugfs file for each cpu buffer */
>> >>> >> >> + for_each_possible_cpu(cpu) {
>> >>> >> >> + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
>> >>> >> >> + cpu_buffer = ring_buffer_cpu(trace_buffer, cpu);
>> >>> >> >> + entry = debugfs_create_file(tmpname, 0440, bufdir, cpu_buffer,
>> >>> >> >> + &tracedata_file_operations);
>> >>> >> >> + if (!entry) {
>> >>> >> >> + printk(KERN_WARNING
>> >>> >> >> + "Couldn't create debugfs \'%s\' entry\n",
>> >>> >> >> + tmpname);
>> >>> >> >> + return -ENOMEM;
>> >>> >> >> + }
>> >>> >> >> + }
>> >>> >> >> + kfree(tmpname);
>> >>> >> >> +
>> >>> >> >> + /* the control file for users to enable/disable global tracing. */
>> >>> >> >> + entry = debugfs_create_file("enabled", 0664, tracedir, NULL,
>> >>> >> >> + &tracecontrol_file_operations);
>> >>> >> >> + if (!entry) {
>> >>> >> >> + printk(KERN_WARNING
>> >>> >> >> + "Couldn't create debugfs 'enabled' entry\n");
>> >>> >> >> + return -ENOMEM;
>> >>> >> >> + }
>> >>> >> >> +
>> >>> >> >> + /*
>> >>> >> >> + * the debugfs file that displays the name, format etc. of every
>> >>> >> >> + * supported trace event. The file is to be used by the user-space
>> >>> >> >> + * trace parser to analyze the collected trace data.
>> >>> >> >> + */
>> >>> >> >> + entry = debugfs_create_file("info", 0444, tracedir, NULL,
>> >>> >> >> + &traceinfo_file_operations);
>> >>> >> >> + if (!entry) {
>> >>> >> >> + printk(KERN_WARNING "Couldn't create debugfs 'info' entry\n");
>> >>> >> >> + return -ENOMEM;
>> >>> >> >> + }
>> >>> >> >> +
>> >>> >> >> + /*
>> >>> >> >> + * the debugfs file that reports the number of events
>> >>> >> >> + * lost due to buffer overflow
>> >>> >> >> + */
>> >>> >> >> + entry = debugfs_create_file("overflow", 0444, tracedir, NULL,
>> >>> >> >> + &traceoverflow_file_operations);
>> >>> >> >> + if (!entry) {
>> >>> >> >> + printk(KERN_WARNING
>> >>> >> >> + "Couldn't create debugfs 'overflow' entry\n");
>> >>> >> >> + return -ENOMEM;
>> >>> >> >> + }
>> >>> >> >> +
>> >>> >> >> + /* the debugfs file that resets trace buffer upon write */
>> >>> >> >> + entry = debugfs_create_file("reset", 0220, tracedir, NULL,
>> >>> >> >> + &tracereset_file_operations);
>> >>> >> >> + if (!entry) {
>> >>> >> >> + printk(KERN_WARNING "Couldn't create debugfs 'reset' entry\n");
>> >>> >> >> + return -ENOMEM;
>> >>> >> >> + }
>> >>> >> >> +
>> >>> >> >> + /* create the directory that holds the control files for every event */
>> >>> >> >> + eventdir = debugfs_create_dir("events", tracedir);
>> >>> >> >> + if (!eventdir) {
>> >>> >> >> + printk(KERN_WARNING "Couldn't create 'events' directory\n");
>> >>> >> >> + return -ENOMEM;
>> >>> >> >> + }
>> >>> >> >> + return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static void remove_debugfs_file(struct dentry *dir, char *name)
>> >>> >> >> +{
>> >>> >> >> + struct dentry *dentry;
>> >>> >> >> + dentry = lookup_one_len(name, dir, strlen(name));
>> >>> >> >> + if (dentry && !IS_ERR(dentry)) {
>> >>> >> >> + debugfs_remove(dentry);
>> >>> >> >> + dput(dentry);
>> >>> >> >> + }
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static int remove_debugfs(void)
>> >>> >> >> +{
>> >>> >> >> + struct dentry *bufdir;
>> >>> >> >> + char *tmpname;
>> >>> >> >> + int cpu;
>> >>> >> >> +
>> >>> >> >> + bufdir = lookup_one_len("buffers", tracedir, strlen("buffers"));
>> >>> >> >> + if (!bufdir || IS_ERR(bufdir))
>> >>> >> >> + return -EIO;
>> >>> >> >> + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
>> >>> >> >> + if (!tmpname)
>> >>> >> >> + return -ENOMEM;
>> >>> >> >> + for_each_possible_cpu(cpu) {
>> >>> >> >> + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
>> >>> >> >> + remove_debugfs_file(bufdir, tmpname);
>> >>> >> >> + }
>> >>> >> >> + kfree(tmpname);
>> >>> >> >> + debugfs_remove(bufdir);
>> >>> >> >> +
>> >>> >> >> + remove_debugfs_file(tracedir, "enabled");
>> >>> >> >> + remove_debugfs_file(tracedir, "info");
>> >>> >> >> + debugfs_remove(eventdir);
>> >>> >> >> + debugfs_remove(tracedir);
>> >>> >> >> + return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static int __init init_ktrace(void)
>> >>> >> >> +{
>> >>> >> >> + struct kernel_trace *iter;
>> >>> >> >> + int i, err;
>> >>> >> >> +
>> >>> >> >> + ktrace_next_id = 0;
>> >>> >> >> + kref_set(&ktrace_kref, 1);
>> >>> >> >> + trace_buffer = ring_buffer_alloc(trace_buf_size, RB_FL_OVERWRITE);
>> >>> >> >> + if (trace_buffer == NULL)
>> >>> >> >> + return -ENOMEM;
>> >>> >> >> + setup_timer(&trace_timer, wakeup_readers, 0);
>> >>> >> >> + for (i = 0; i < KTRACE_HASH_SIZE; i++)
>> >>> >> >> + INIT_HLIST_HEAD(&ktrace_table[i]);
>> >>> >> >> +
>> >>> >> >> + mutex_lock(&ktrace_mutex);
>> >>> >> >> + err = create_debugfs();
>> >>> >> >> + if (err != 0)
>> >>> >> >> + goto out;
>> >>> >> >> + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
>> >>> >> >> + add_ktrace(iter);
>> >>> >> >> + err = register_module_notifier(&ktrace_module_nb);
>> >>> >> >> +out:
>> >>> >> >> + mutex_unlock(&ktrace_mutex);
>> >>> >> >> + return err;
>> >>> >> >> +}
>> >>> >> >> +module_init(init_ktrace);
>> >>> >> >> +
>> >>> >> >> +static void __exit exit_ktrace(void)
>> >>> >> >> +{
>> >>> >> >> + struct kernel_trace *iter;
>> >>> >> >> +
>> >>> >> >> + if (ktrace_enabled) {
>> >>> >> >> + ktrace_enabled = 0;
>> >>> >> >> + del_timer_sync(&trace_timer);
>> >>> >> >> + }
>> >>> >> >> + mutex_lock(&ktrace_mutex);
>> >>> >> >> + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
>> >>> >> >> + remove_ktrace(iter, 1);
>> >>> >> >> + unregister_module_notifier(&ktrace_module_nb);
>> >>> >> >> + remove_debugfs();
>> >>> >> >> + mutex_unlock(&ktrace_mutex);
>> >>> >> >> +
>> >>> >> >> + kref_put(&ktrace_kref, release_trace_buffer);
>> >>> >> >> +}
>> >>> >> >
>> >>> >> > --
>> >>> >> > Mathieu Desnoyers
>> >>> >> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>> >>> >> >
>> >>> >
>> >>> > --
>> >>> > Mathieu Desnoyers
>> >>> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>> >>> >
>> >>>
>> >>> _______________________________________________
>> >>> ltt-dev mailing list
>> >>> [email protected]
>> >>> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
>> >>>
>> >>
>> >> --
>> >> Mathieu Desnoyers
>> >> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>> >>
>> >
>>
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>