DomainKey-Signature: a=rsa-sha1; s=beta; d=google.com; c=nofws; q=dns;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to:
	cc:content-type:content-transfer-encoding:x-system-of-record;
	b=jkvaHhpAgVRImFhyzi5pSLN0jUiWFHb8nDuf7gLgCV4HW6GdVQgRkP1koam3DP/gQ
	Kxxki9dlUQLFLi86bDZIQ==
MIME-Version: 1.0
In-Reply-To: <20090319213106.GA3443@Krystal>
References: <5df78e1d0902231610y615bcfb5hdbbc44736ecfce00@mail.gmail.com>
	 <20090313170415.GE3354@Krystal>
	 <5df78e1d0903161623s1d5d2a04h946668ac0c6a69f7@mail.gmail.com>
	 <20090317000138.GA19848@Krystal>
	 <5df78e1d0903161840o50becebcxe2378805049fee77@mail.gmail.com>
	 <20090317015359.GA23976@Krystal>
	 <5df78e1d0903161912p405a96bbnd7e9a247bbfd59e4@mail.gmail.com>
	 <5df78e1d0903191406w124b946eo655649ef6d2b342f@mail.gmail.com>
	 <20090319213106.GA3443@Krystal>
Date: Thu, 19 Mar 2009 15:06:15 -0700
Message-ID: <5df78e1d0903191506k15757e1elaf56a5f66c02ac56@mail.gmail.com>
Subject: Re: [ltt-dev] [RFC PATCH] Propose a new kernel tracing framework
From: Jiaying Zhang <jiayingz@google.com>
To: Mathieu Desnoyers <compudj@krystal.dyndns.org>
Cc: Michael Rubin <mrubin@google.com>, linux-kernel@vger.kernel.org,
       Michael Davidson <md@google.com>, Steven Rostedt <srostedt@redhat.com>,
       ltt-dev@lists.casi.polymtl.ca, Martin Bligh <mbligh@google.com>,
       Ingo Molnar <mingo@elte.hu>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 74074
Lines: 1565

On Thu, Mar 19, 2009 at 2:31 PM, Mathieu Desnoyers
<compudj@krystal.dyndns.org> wrote:
> * Jiaying Zhang (jiayingz@google.com) wrote:
>> Hi Mathieu,
>>
>
> Hi Jiaying,
>
>> I ran tbench tests with the latest lttng patches (version 0.111) and
>> 2.6.29-rc8 kernel. I used the same set of trace events in my tests:
>> syscall entry/exit, irq entry/exit, trap entry/exit, sched_schedule,
>> process_fork, process_wait, process_free, process_exit.
>>
>> Here are the results I got:
>>
>> vanilla kernel: Throughput 673.269 MB/sec 4 procs
>> lttng normal mode: Throughput 462.786 MB/sec 4 procs (31.252% slowdown)
>> lttng flight-recorder: Throughput 484.547 MB/sec 4 procs (28.03% slowdown)
>> ktrace: Throughput 569.508 MB/sec 4 procs (15.411% slowdown)
>>
>
> Does your machine have synchronized timestamp counters ?
Yes.

>
> Also note that I added some further optimisations in 0.112, including
> some data prefetching.
>
>> The kernel LTTng configurations I used are:
>>
>> CONFIG_LTT=y
>> CONFIG_LTT_FILTER=y
>> CONFIG_HAVE_LTT_DUMP_TABLES=y
>> CONFIG_LTT_RELAY_ALLOC=y
>> CONFIG_LTT_RELAY_LOCKLESS=y
>> # CONFIG_LTT_RELAY_LOCKED is not set
>> CONFIG_LTT_SERIALIZE=y
>> CONFIG_LTT_FAST_SERIALIZE=y
>> CONFIG_LTT_TRACEPROBES=y
>> CONFIG_LTT_TRACE_CONTROL=y
>> CONFIG_LTT_TRACER=y
>> # CONFIG_LTT_ALIGNMENT is not set
>> CONFIG_LTT_CHECK_ARCH_EFFICIENT_UNALIGNED_ACCESS=y
>> # CONFIG_LTT_DEBUG_EVENT_SIZE is not set
>> CONFIG_LTT_USERSPACE_EVENT=m
>
>> CONFIG_LTT_VMCORE=y
>
> This can slow down tracing. You might want to try without the VMCORE
> option.
>
>> CONFIG_LTT_KPROBES=y
>> CONFIG_LTT_STATEDUMP=m
>> # CONFIG_LTT_ASCII is not set
>> CONFIG_HAVE_IMMEDIATE=y
>> CONFIG_IMMEDIATE=y
>>
>> The LTTng performance seems to have improved compared with the
>> results I measured before. But I think there is still big room to further
>> optimize it when we look at Ktrace results as a comparison.
>> One worry I have is that LTTng patch set is getting bigger and bigger.
>> It is good to add more features and instrumentations, but it might
>> make it harder for performance optimization. Maybe we should take
>> a different approach. I.e., we start with something simple and make
>> sure it is implemented right and efficient. Then we extend it and make
>> sure we don't sacrifice performance as we add more features.
>> I think a good time to do this is when you port LTTng patches to
>> mainline kernel. As ftrace people are working on adding event tracing
>> in ftrace, that seems a good opportunity too.
>>
>
> Well, as any project goes, there are "feature addition" phases and
> "optimization" phases. I think the job I did in the last week will help
> a lot simplifying fast path optimization work. I basically trimmed it
> down to a few small inlined functions. Therefore, the instruction
> profile given by oprofile is now understandable and it's easier to
> review the resulting assembly. So your argument about the patchset size
> does not hold as much, because it is now possible to focus on a few
> small functions.
>
> Even if we start from zero as you propose, we will end up adding
> regressions during the feature addition phases, and we will end up in
> the same situation we were a few months ago.
But we should have a continuous performance test running there that
prevents any regression patch slipping in.

>
> If we want to see what can improve performance in LTTng, we can simply
> disable the hot path piece by piece and figure out where the cycles go,
> as I did recently. (and use oprofile, of course) Given that the LTTng
> projet is very modular, it makes it easier to enable individual
> features.
>
> Another thing that might be worth looking at is to skip the tracepoint
> probes -> marker callbacks, because they will likely add a noticeable
> overhead. However, the tradeoff here is to accept some overhead for
> clear separation of the core kernel instrumentation from the markers
> exposed to userspace.
>
> As you say, there will be a piece-by-piece review/integration phase that
> will likely take place when I will discuss LTTng and ftrace integration.
Interesting to hear this. Do you have any time plan for this?

Jiaying

>
> Mathieu
>
>> Jiaying
>>
>> On Mon, Mar 16, 2009 at 7:12 PM, Jiaying Zhang <jiayingz@google.com> wrote:
>> > On Mon, Mar 16, 2009 at 6:53 PM, Mathieu Desnoyers
>> > <compudj@krystal.dyndns.org> wrote:
>> >> * Jiaying Zhang (jiayingz@google.com) wrote:
>> >>> On Mon, Mar 16, 2009 at 5:01 PM, Mathieu Desnoyers
>> >>> <mathieu.desnoyers@polymtl.ca> wrote:
>> >>> > * Jiaying Zhang (jiayingz@google.com) wrote:
>> >>> >> Hi Mathieu,
>> >>> >>
>> >>> >> First, I apologize that I did not forward you my original email. I assumed
>> >>> >> that the email would be noticed by interested people when it was sent
>> >>> >> to lkml. Now I see that is a mistake.
>> >>> >>
>> >>> >
>> >>> > Hi Jiaying,
>> >>> >
>> >>> > Apologies accepted, no hard feeling :) I have just been surprised when I
>> >>> > found out about the numbers you posted.
>> >>> >
>> >>> >> On Fri, Mar 13, 2009 at 10:04 AM, Mathieu Desnoyers
>> >>> >> <mathieu.desnoyers@polymtl.ca> wrote:
>> >>> >> > Hi Jiaying,
>> >>> >> >
>> >>> >> > I'm very interested to hear about the details of your test-case, because
>> >>> >> > what you are talking about here, and the numbers that you get, are, at
>> >>> >> > the very least, not representative of the current LTTng. Perharps you
>> >>> >> > used a version right after the kernel summit and stopped following the
>> >>> >> > improvements we have done since then ? I also wonder about the instrumentation
>> >>> >> > set you enabled in your tests, and whether it's exactly the same in the ktrace
>> >>> >> > and LTTng runs (especially system call tracing). Please note that following the
>> >>> >> > Kernel Summit 2008, a more active development cycle began, and the time was not
>> >>> >> > right for optimisation. But since then a lot of improvement has been done on
>> >>> >> > this side.
>> >>> >>
>> >>> >> Sorry that I omitted a lot of details in my test description. I did use an old
>> >>> >> version of LTTng in my tests. The LTTng version I used was the 0.21 lttng
>> >>> >> patches for 2.6.27 kernel. I used 2.6.26 kernel in my tests, with the ported
>> >>> >> lttng-2.6.27-rc6-0.21 patches.
>> >>> >>
>> >>> >> I enabled the same set of trace events in ktrace and LTTng runs. Here is
>> >>> >> the list of enabled events: syscall entry/exit, irq entry/exit, trap
>> >>> >> entry/exit,
>> >>> >> kernel_sched_switch, kernel_process_fork, kernel_process_wait,
>> >>> >> kernel_process_free, kernel_process_exit.
>> >>> >>
>> >>> >> We are glad to see the improvements coming from the LTTng community.
>> >>> >> I haven't got time to sync with the latest LTTng code and re-run my tests.
>> >>> >> I will let you know when I have such data available.
>> >>> >>
>> >>> >
>> >>> > On my side, with flight recorder activated, same instrumentation you
>> >>> > use, with LTTng 0.111 :
>> >>> >
>> >>> > arm-google script :
>> >>> >
>> >>> > #!/bin/sh
>> >>> >
>> >>> > DIR=/mnt/debugfs/ltt/markers
>> >>> >
>> >>> > for a in \
>> >>> >        syscall_entry syscall_exit \
>> >>> >        irq_entry irq_exit \
>> >>> >        trap_entry trap_exit \
>> >>> >        page_fault_entry page_fault_exit \
>> >>> >        page_fault_nosem_entry page_fault_nosem_exit \
>> >>> >        page_fault_get_user_entry page_fault_get_user_exit \
>> >>> >        sched_schedule process_fork process_wait \
>> >>> >        process_free process_exit;
>> >>> > do
>> >>> >        echo 1 > ${DIR}/kernel/$a/enable
>> >>> > done
>> >>> >
>> >>> > running tbench -t 200 8
>> >>> >
>> >>> > Vanilla kernel :  2070 MB/sec
>> >>> > Flight recorder : 1693 MB/sec   Performance impact : 18.2 %
>> >>> > Normal :          1620 MB/sec   Performance impact : 21.7 % (8 lttd threads)
>> >>> >
>> >>> > So now we are comparing apples with apples :-)
>> >>> Looks like the performance of LTTng has improved a lot over the past
>> >>> few months. I will run some tests with the latest LTTng patches and
>> >>> let you know if I see any difference. Did you use 2.6.29-rc7 kernel in
>> >>> your tests?
>> >>>
>> >>
>> >> Yes, kernel 2.6.29-rc7, lttng 0.111. But lttng 0.112 for kernel
>> >> 2.6.29-rc8 should also be close enough.
>> >>
>> >>> >
>> >>> >> >
>> >>> >> > First I'd like to express my concern about the way you posted your
>> >>> >> > numbers :
>> >>> >> >
>> >>> >> > 1 - You did not CC anyone from the LTTng project.
>> >>> >> > 2 - You did not tell which LTTng version you used as a comparison basis.
>> >>> >> > 3 - You did not look at the current state of code before posting your
>> >>> >> >    comparison. (as shows the comment in your patch header)
>> >>> >> That is indeed my mistake.
>> >>> >>
>> >>> >> > 4 - You did not tell the instrumentation set that you enabled from LTTng. Is it
>> >>> >> >    the default instrumentation ? Did you also enable system call tracing
>> >>> >> >    with your tracer using set_kernel_trace_flag_all_tasks() ? If not, this
>> >>> >> >    makes a huge difference on the amount of information gathered.
>> >>> >> Yes. We also enable syscall tracing via set_kernel_trace_flag_all_tasks().
>> >>> >>
>> >>> >> >
>> >>> >> > So let's have a close look at them. I'm re-doing the benchmarks based on
>> >>> >> > the current LTTng version below.
>> >>> >> >
>> >>> >> >> Hello All,
>> >>> >> >>
>> >>> >> >> We have been working on building a kernel tracer for a production environment.
>> >>> >> >> We first considered to use the existing markers code, but found it a bit
>> >>> >> >> heavyweight.
>> >>> >> >
>> >>> >> > Have you looked at :
>> >>> >> >
>> >>> >> > 1 - The tracepoints currently in the kernel
>> >>> >> > 2 - The _current_ version of markers in the LTTng tree ?
>> >>> >>
>> >>> >> I looked at the tracepoints currently in the kernel. It is great to see that
>> >>> >> some trace instrumentation code is finally available in the mainline kernel.
>> >>> >> But I still don't like an additional tracepoint layer on top of the
>> >>> >> actual tracing
>> >>> >> (ktrace or markers or ftrace) layer. That seems to add unnecessary overhead.
>> >>> >> The markers code I referred to in my email is the current code in the mainline
>> >>> >> kernel. I had a look at the LTTng markers code but still saw that va_list
>> >>> >> argument is used in marker_probe_func. It sounds a nice improvement
>> >>> >> if the LTTng markers avoids the use of va_arg whenever possible.
>> >>> >>
>> >>> >
>> >>> > Yes, vs_args is only there as a commodity for slow paths.
>> >>> >
>> >>> >> >
>> >>> >> >>  So we developed a new kernel tracing prototype that shares
>> >>> >> >> many similarities as markers but also includes several simplifications that
>> >>> >> >> give us considerable performance benefits. For example, we pass the size of
>> >>> >> >> trace event directly to the probe function instead of using va_arg as that
>> >>> >> >> used in markers, which saves us the overhead of variable passing.
>> >>> >> >
>> >>> >> > The markers in the LTTng tree only use va_args for "debug-style"
>> >>> >> > tracing. All the tracing fast paths are now done with custom probe
>> >>> >> > functions which takes the arguments from the tracepoint and write them
>> >>> >> > in the trace buffers directly in C. When the event payload size is already
>> >>> >> > known statically, there is no dynamic computation cost at all : it's
>> >>> >> > simply passed to the ltt-relay write function.
>> >>> >> But this raises another concern that sub-system developers may need
>> >>> >> to write a bunch of custom probe functions to avoid the argument parsing
>> >>> >> overhead. Can you instead have a general probe function or macro that
>> >>> >> does this work?
>> >>> >>
>> >>> >
>> >>> > Well, considering the amount of care we have to take when we consider
>> >>> > adding a tracepoint in the mainline kernel, I think it's even a *good*
>> >>> > thing that someone has to stop and write a small callback to deal with
>> >>> > the information he wants to export. I think debug-style tracing should
>> >>> > be very, very quick and easy to add, but tracepoints meant to make it
>> >>> > into the mainline does not share this requirement.
>> >>> I agree that we need to be careful when adding tracepoints into the kernel,
>> >>> but I still think it is good to simplify the process of adding/maintaining
>> >>> tracepoints so sub-system maintainers don't feel it is a big burden.
>> >>>
>> >>
>> >> Yes, we have to make sure it's not an impossible task.
>> >>
>> >>> >
>> >>> >> >
>> >>> >> >>  Also, we
>> >>> >> >> associate a single probe function to each event, so we can directly call the
>> >>> >> >> probe function when tracing is enabled.
>> >>> >> >
>> >>> >> > You seem to imply that being stuck with a single probe function is
>> >>> >> > somehow superior to have tiny per-event callbacks which then call into a
>> >>> >> > buffer write function ? Why ?
>> >>> >> That is indeed my thinking, because we saw that some platforms are
>> >>> >> very sensitive to function call overhead and we haven't seen use cases
>> >>> >> where multi probe functions prove necessary.
>> >>> >>
>> >>> >
>> >>> > Hrm, so your plan is to embed the tracing code at the caller site, or to
>> >>> > keep one single function call ?
>> >>> We do have a general probe function that is called when tracing is enabled.
>> >>> Other than that, the main tracing code is defined as an inline function that is
>> >>> embed at the caller site.
>> >>>
>> >>
>> >> You mean that you embed the main tracing code directly in the kernel
>> >> instrumented functions ? Hrm, have you considered the calling site
>> >> pollution impact that might have ? The compiler might start having
>> >> difficulty optimising hot paths that would then become large functions.
>> >> Or maybe I am not correctly understanding your approach.
>> > Sorry. I was actually trying to say the main instrumentation code is
>> > embed in the caller site. The main tracing code is in the general probe
>> > function.
>> >
>> > Jiaying
>> >
>> >>
>> >> Mathieu
>> >>
>> >>>
>> >>> Jiaying
>> >>>
>> >>> >
>> >>> >> >
>> >>> >> >>  Our measurements show that our
>> >>> >> >> current tracing prototype introduces much lower overhead compared with
>> >>> >> >> lttng (we use lttng as a comparison because markers does not provide the
>> >>> >> >> implementation to record trace data).
>> >>> >> >
>> >>> >> > Why LTTng version, kernel version, LTTng config and runtime options do
>> >>> >> > you use ?
>> >>> >> See above for LTTng and kernel versions. I was using the default LTTng
>> >>> >> config options and 'normal' recorder mode.
>> >>> >>
>> >>> >> >
>> >>> >> >> E.g., with dbench, lttng introduces
>> >>> >> >> 20~80% overhead on our testing platforms while the overhead with our
>> >>> >> >> kernel tracer is within 10%; with tbench, lttng usually adds more than 30%
>> >>> >> >> overhead while the overhead added by our kernel tracer is usually within
>> >>> >> >> 15%; with kernbench (time to compile a Linux kernel), the overhead
>> >>> >> >> added by lttng is around 4%~18% while the overhead added by our
>> >>> >> >> kernel tracer is within 2.5%.
>> >>> >> >>
>> >>> >> >
>> >>> >> > I'm re-running the dbench and tbench tests :
>> >>> >> >
>> >>> >> > With LTTng 0.106, kernel 2.6.29-rc7
>> >>> >> > 8-cores Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz, 16GB ram
>> >>> >> >
>> >>> >> > Arming all the default LTTng instrumentation (168 tracepoints),
>> >>> >> > Tracing in flight recorder mode, including system call entry/exit
>> >>> >> > tracing and block I/O tracing (blktrace instrumentation), I get :
>> >>> >> >
>> >>> >> > Test / config    |  Vanilla kernel  |  LTTng 0.106  |  slowdown
>> >>> >> > dbench 8 -t 200  |       1361 MB/s  |   1193 MB/s   |    12.3 %
>> >>> >> >
>> >>> >> > As an example of the effect of custom probe vs va_args based probes,
>> >>> >> > let's take the tbench example
>> >>> >> >
>> >>> >> >                 |     Vanilla      |  LTTng 0.106  |  slowdown
>> >>> >> > tbench 8 -t 200  |       2074 MB/s  |     997 MB/s  |      52 %
>> >>> >> >
>> >>> >> > The LTTng trace statistics shows the top high event rate events :
>> >>> >> >
>> >>> >> > Some are usual (therefore lttng already implements custom probes for them) :
>> >>> >> >
>> >>> >> > 42004 sched_try_wakeup
>> >>> >> > 79754 timer_set
>> >>> >> > 83973 sched_schedule
>> >>> >> > 84317 softirq_exit
>> >>> >> > 84318 softirq_entry
>> >>> >> > 95370 page_free
>> >>> >> > 95412 page_alloc
>> >>> >> > 267247 syscall_entry
>> >>> >> > 267247 syscall_exit
>> >>> >> >
>> >>> >> > Some are more tbench workload-specific :
>> >>> >> >
>> >>> >> > 16915 socket_sendmsg
>> >>> >> > 18911 napi_complete
>> >>> >> > 18911 napi_schedule
>> >>> >> > 18912 dev_receive
>> >>> >> > 18914 dev_xmit
>> >>> >> > 18920 napi_poll
>> >>> >> > 33831 socket_recvmsg
>> >>> >> > 59860 read
>> >>> >> >
>> >>> >> >
>> >>> >> > So let's see what happens if I implement those as custom probes and do a bit of
>> >>> >> > tuning in the tracer hot path :
>> >>> >> Could you give more details about the custom probe you used?
>> >>> >
>> >>> > They are available in the lttng tree on git.kernel.org in
>> >>> > ltt/probes/*-trace.c. Their API is in include/linux/ltt-type-serialize.h
>> >>> >
>> >>> >> Do you need to write a custom probe for each tracepoint?
>> >>> >>
>> >>> >
>> >>> > Yes. But it's not such a big deal. It looks a bit like Steve's
>> >>> > TRACE_EVENT macro-fu, but written in a simple C callback :
>> >>> >
>> >>> > void probe_sched_switch(struct rq *rq, struct task_struct *prev,
>> >>> >                struct task_struct *next);
>> >>> >
>> >>> > DEFINE_MARKER_TP(kernel, sched_schedule, sched_switch, probe_sched_switch,
>> >>> >        "prev_pid %d next_pid %d prev_state #2d%ld");
>> >>> >
>> >>> > notrace void probe_sched_switch(struct rq *rq, struct task_struct *prev,
>> >>> >                struct task_struct *next)
>> >>> > {
>> >>> >        struct marker *marker;
>> >>> >        struct serialize_int_int_short data;
>> >>> >
>> >>> >        data.f1 = prev->pid;
>> >>> >        data.f2 = next->pid;
>> >>> >        data.f3 = prev->state;
>> >>> >
>> >>> >        marker = &GET_MARKER(kernel, sched_schedule);
>> >>> >        ltt_specialized_trace(marker, marker->single.probe_private,
>> >>> >                &data, serialize_sizeof(data), sizeof(int));
>> >>> > }
>> >>> >
>> >>> > Best regards,
>> >>> >
>> >>> > Mathieu
>> >>> >
>> >>> >> Jiaying
>> >>> >>
>> >>> >> >
>> >>> >> >                 |     Vanilla      |  LTTng 0.107  |  slowdown
>> >>> >> > Custom probes :
>> >>> >> > tbench 8 -t 200  |       2074 MB/s  |    1275 MB/s  |    38.5 %
>> >>> >> >
>> >>> >> > With tracer hot path tuning (inline hot path, function calls for slow path, will
>> >>> >> > be in 0.108) :
>> >>> >> > tbench 8 -t 200  |       2074 MB/s  |    1335 MB/s  |    35.6 %
>> >>> >> >
>> >>> >> > kernbench :
>> >>> >> > Average Optimal load -j 32 Run
>> >>> >> >
>> >>> >> > Vanilla 2.6.29-rc7 :
>> >>> >> > Elapsed Time 48.238 (0.117346)
>> >>> >> >
>> >>> >> > With flight recorder tracing, default lttng instrumentation, lttng 0.108 :
>> >>> >> > Elapsed Time 50.226 (0.0750333)
>> >>> >> >
>> >>> >> > Slowdown : 4.1 %
>> >>> >> >
>> >>> >> > So except the tbench workload on LTTng 0.106, which still had a 52% performance
>> >>> >> > impact (brought it down to 35 % for the next LTTng release) due to
>> >>> >> > specific workload and not having specialized probes implemented, the
>> >>> >> > dbench impact is well below your results (12.3 % vs 20-80%)
>> >>> >> >
>> >>> >> > So given the huge result discrepancies between your tests and mine, could you
>> >>> >> > give us more detail about your benchmarks ? Especially knowing the enabled
>> >>> >> > instrumentation set, and the amount of MB/s generated by both ktrace and LTTng
>> >>> >> > runs (to make sure we are comparing the same traffic) and the LTTng versions you
>> >>> >> > are using as a comparison basis would be good.
>> >>> >> >
>> >>> >> > Thanks,
>> >>> >> >
>> >>> >> > Mathieu
>> >>> >> >
>> >>> >> >
>> >>> >> >> The patch below is the current prototype we have implemented. It uses
>> >>> >> >> Steve's unified trace buffer code with some small extensions to support
>> >>> >> >> buffer mmap. The patch is probably still buggy, but we would like to get
>> >>> >> >> some early feedback from the community.
>> >>> >> >>
>> >>> >> >> Signed-off-by: Jiaying Zhang <jiayingz@google.com>
>> >>> >> >>
>> >>> >> >> Index: git-linux/init/Kconfig
>> >>> >> >> ===================================================================
>> >>> >> >> --- git-linux.orig/init/Kconfig 2009-02-19 14:58:37.000000000 -0800
>> >>> >> >> +++ git-linux/init/Kconfig      2009-02-19 15:50:15.000000000 -0800
>> >>> >> >> @@ -950,6 +950,15 @@
>> >>> >> >>          Place an empty function call at each marker site. Can be
>> >>> >> >>          dynamically changed for a probe function.
>> >>> >> >>
>> >>> >> >> +config KTRACE
>> >>> >> >> +       bool "Enable ktrace support"
>> >>> >> >> +       select DEBUG_FS
>> >>> >> >> +       select TRACING
>> >>> >> >> +       select RING_BUFFER
>> >>> >> >> +       help
>> >>> >> >> +         Ktrace is a kernel tracing tool that allows you to trace
>> >>> >> >> +         kernel events by inserting trace points at proper places.
>> >>> >> >> +
>> >>> >> >>  source "arch/Kconfig"
>> >>> >> >>
>> >>> >> >>  endmenu                # General setup
>> >>> >> >> Index: git-linux/kernel/Makefile
>> >>> >> >> ===================================================================
>> >>> >> >> --- git-linux.orig/kernel/Makefile      2009-02-19 14:58:37.000000000 -0800
>> >>> >> >> +++ git-linux/kernel/Makefile   2009-02-19 14:58:39.000000000 -0800
>> >>> >> >> @@ -86,6 +86,7 @@
>> >>> >> >>  obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
>> >>> >> >>  obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
>> >>> >> >>  obj-$(CONFIG_MARKERS) += marker.o
>> >>> >> >> +obj-$(CONFIG_KTRACE) += ktrace.o
>> >>> >> >>  obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
>> >>> >> >>  obj-$(CONFIG_LATENCYTOP) += latencytop.o
>> >>> >> >>  obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
>> >>> >> >> Index: git-linux/include/asm-generic/vmlinux.lds.h
>> >>> >> >> ===================================================================
>> >>> >> >> --- git-linux.orig/include/asm-generic/vmlinux.lds.h    2009-02-19
>> >>> >> >> 14:58:37.000000000 -0800
>> >>> >> >> +++ git-linux/include/asm-generic/vmlinux.lds.h 2009-02-19
>> >>> >> >> 14:58:39.000000000 -0800
>> >>> >> >> @@ -73,6 +73,9 @@
>> >>> >> >>        MEM_KEEP(init.data)                                             \
>> >>> >> >>        MEM_KEEP(exit.data)                                             \
>> >>> >> >>        . = ALIGN(8);                                                   \
>> >>> >> >> +       VMLINUX_SYMBOL(__start___ktraces) = .;                          \
>> >>> >> >> +       *(__ktrace)                                                     \
>> >>> >> >> +       VMLINUX_SYMBOL(__stop___ktraces) = .;                           \
>> >>> >> >>        VMLINUX_SYMBOL(__start___markers) = .;                          \
>> >>> >> >>        *(__markers)                                                    \
>> >>> >> >>        VMLINUX_SYMBOL(__stop___markers) = .;                           \
>> >>> >> >> @@ -89,6 +92,7 @@
>> >>> >> >>                VMLINUX_SYMBOL(__start_rodata) = .;                     \
>> >>> >> >>                *(.rodata) *(.rodata.*)                                 \
>> >>> >> >>                *(__vermagic)           /* Kernel version magic */      \
>> >>> >> >> +               *(__ktrace_strings)     /* Ktrace: strings */           \
>> >>> >> >>                *(__markers_strings)    /* Markers: strings */          \
>> >>> >> >>                *(__tracepoints_strings)/* Tracepoints: strings */      \
>> >>> >> >>        }                                                               \
>> >>> >> >> Index: git-linux/kernel/module.c
>> >>> >> >> ===================================================================
>> >>> >> >> --- git-linux.orig/kernel/module.c      2009-02-19 14:58:37.000000000 -0800
>> >>> >> >> +++ git-linux/kernel/module.c   2009-02-19 14:58:39.000000000 -0800
>> >>> >> >> @@ -50,6 +50,7 @@
>> >>> >> >>  #include <asm/sections.h>
>> >>> >> >>  #include <linux/tracepoint.h>
>> >>> >> >>  #include <linux/ftrace.h>
>> >>> >> >> +#include <linux/ktrace.h>
>> >>> >> >>  #include <linux/async.h>
>> >>> >> >>
>> >>> >> >>  #if 0
>> >>> >> >> @@ -2145,6 +2146,10 @@
>> >>> >> >>                                        sizeof(*mod->tracepoints),
>> >>> >> >>                                        &mod->num_tracepoints);
>> >>> >> >>  #endif
>> >>> >> >> +#ifdef CONFIG_KTRACE
>> >>> >> >> +       mod->ktrace = section_objs(hdr, sechdrs, secstrings, "__ktrace",
>> >>> >> >> +                               sizeof(*mod->ktrace), &mod->num_ktrace);
>> >>> >> >> +#endif
>> >>> >> >>
>> >>> >> >>  #ifdef CONFIG_MODVERSIONS
>> >>> >> >>        if ((mod->num_syms && !mod->crcs)
>> >>> >> >> Index: git-linux/include/linux/module.h
>> >>> >> >> ===================================================================
>> >>> >> >> --- git-linux.orig/include/linux/module.h       2009-02-19
>> >>> >> >> 14:58:37.000000000 -0800
>> >>> >> >> +++ git-linux/include/linux/module.h    2009-02-19 14:58:39.000000000 -0800
>> >>> >> >> @@ -44,6 +44,7 @@
>> >>> >> >>  };
>> >>> >> >>
>> >>> >> >>  struct module;
>> >>> >> >> +struct kernel_trace;
>> >>> >> >>
>> >>> >> >>  struct module_attribute {
>> >>> >> >>         struct attribute attr;
>> >>> >> >> @@ -347,6 +348,11 @@
>> >>> >> >>        /* Reference counts */
>> >>> >> >>        struct module_ref ref[NR_CPUS];
>> >>> >> >>  #endif
>> >>> >> >> +#ifdef CONFIG_KTRACE
>> >>> >> >> +       struct kernel_trace *ktrace;
>> >>> >> >> +       unsigned int num_ktrace;
>> >>> >> >> +#endif
>> >>> >> >> +
>> >>> >> >>  };
>> >>> >> >>  #ifndef MODULE_ARCH_INIT
>> >>> >> >>  #define MODULE_ARCH_INIT {}
>> >>> >> >> Index: git-linux/include/linux/ktrace.h
>> >>> >> >> ===================================================================
>> >>> >> >> --- /dev/null   1970-01-01 00:00:00.000000000 +0000
>> >>> >> >> +++ git-linux/include/linux/ktrace.h    2009-02-19 15:36:39.000000000 -0800
>> >>> >> >> @@ -0,0 +1,83 @@
>> >>> >> >> +#ifndef _LINUX_KTRACE_H
>> >>> >> >> +#define _LINUX_KTRACE_H
>> >>> >> >> +
>> >>> >> >> +#include <linux/types.h>
>> >>> >> >> +
>> >>> >> >> +struct kernel_trace;
>> >>> >> >> +
>> >>> >> >> +typedef void ktrace_probe_func(struct kernel_trace *, void *, size_t);
>> >>> >> >> +
>> >>> >> >> +struct kernel_trace {
>> >>> >> >> +       const char      *name;
>> >>> >> >> +       const char      *format;  /* format string describing variable list */
>> >>> >> >> +       size_t          *stroff;  /* offsets of string variables */
>> >>> >> >> +       /* 31 bit event_id is converted to 16 bit when entering to the buffer */
>> >>> >> >> +       u32             enabled:1, event_id:31;
>> >>> >> >> +       ktrace_probe_func *func;  /* probe function */
>> >>> >> >> +       struct list_head list;  /* list head linked to the hash table entry */
>> >>> >> >> +} __attribute__((aligned(8)));
>> >>> >> >> +
>> >>> >> >> +extern int ktrace_enabled;
>> >>> >> >> +
>> >>> >> >> +/*
>> >>> >> >> + * Make sure the alignment of the structure in the __ktrace section will
>> >>> >> >> + * not add unwanted padding between the beginning of the section and the
>> >>> >> >> + * structure. Force alignment to the same alignment as the section start.
>> >>> >> >> + */
>> >>> >> >> +
>> >>> >> >> +#define DEFINE_KTRACE_STRUCT(name)                     \
>> >>> >> >> +       struct __attribute__((packed)) ktrace_struct_##name
>> >>> >> >> +
>> >>> >> >> +#ifdef CONFIG_KTRACE
>> >>> >> >> +/*
>> >>> >> >> + * DO_TRACE is the macro that is called from each trace point. Each trace
>> >>> >> >> + * point is associated with a static ktrace object that is located in a
>> >>> >> >> + * special __ktrace section and is uniquely identified by the event name and
>> >>> >> >> + * event format. Take a look at the files under include/trace for example
>> >>> >> >> + * usage. Note that we call the trace probe function only if the global
>> >>> >> >> + * tracing is enabled and the tracing for the associated event is enabled,
>> >>> >> >> + * so we only introduce the overhead of a predicted condition judge when
>> >>> >> >> + * tracing is disabled.
>> >>> >> >> + */
>> >>> >> >> +#define DO_TRACE(name, format, args...)
>> >>> >> >>          \
>> >>> >> >> +       do {                                                            \
>> >>> >> >> +               static const char __kstrtab_##name[]                    \
>> >>> >> >> +               __attribute__((section("__ktrace_strings")))            \
>> >>> >> >> +               = #name "\0" format;                                    \
>> >>> >> >> +               static struct kernel_trace __ktrace_##name              \
>> >>> >> >> +               __attribute__((section("__ktrace"), aligned(8))) =      \
>> >>> >> >> +               { __kstrtab_##name, &__kstrtab_##name[sizeof(#name)],   \
>> >>> >> >> +               NULL, 0, 0, NULL, LIST_HEAD_INIT(__ktrace_##name.list) }; \
>> >>> >> >> +               __ktrace_check_format(format, ## args);                 \
>> >>> >> >> +               if (unlikely(ktrace_enabled) &&                         \
>> >>> >> >> +                               unlikely(__ktrace_##name.enabled)) {    \
>> >>> >> >> +                       struct ktrace_struct_##name karg = { args };    \
>> >>> >> >> +                       (*__ktrace_##name.func)                         \
>> >>> >> >> +                       (&__ktrace_##name, &karg, sizeof(karg));        \
>> >>> >> >> +               }                                                       \
>> >>> >> >> +       } while (0)
>> >>> >> >> +
>> >>> >> >> +#else /* !CONFIG_KTRACE */
>> >>> >> >> +#define DO_TRACE(name, format, args...)
>> >>> >> >> +#endif
>> >>> >> >> +
>> >>> >> >> +/* To be used for string format validity checking with gcc */
>> >>> >> >> +static inline void __printf(1, 2) ___ktrace_check_format(const char *fmt, ...)
>> >>> >> >> +{
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +#define __ktrace_check_format(format, args...)                         \
>> >>> >> >> +       do {                                                            \
>> >>> >> >> +               if (0)                                                  \
>> >>> >> >> +                       ___ktrace_check_format(format, ## args);        \
>> >>> >> >> +       } while (0)
>> >>> >> >> +
>> >>> >> >> +
>> >>> >> >> +/* get the trace buffer information */
>> >>> >> >> +#define KTRACE_BUF_GET_SIZE     _IOR(0xF5, 0x00, __u32)
>> >>> >> >> +#define KTRACE_BUF_GET_PRODUCED _IOR(0xF5, 0x01, __u32)
>> >>> >> >> +#define KTRACE_BUF_GET_CONSUMED _IOR(0xF5, 0x02, __u32)
>> >>> >> >> +#define KTRACE_BUF_PUT_PRODUCED _IOW(0xF5, 0x03, __u32)
>> >>> >> >> +#define KTRACE_BUF_PUT_CONSUMED _IOW(0xF5, 0x04, __u32)
>> >>> >> >> +
>> >>> >> >> +#endif
>> >>> >> >> Index: git-linux/kernel/ktrace.c
>> >>> >> >> ===================================================================
>> >>> >> >> --- /dev/null   1970-01-01 00:00:00.000000000 +0000
>> >>> >> >> +++ git-linux/kernel/ktrace.c   2009-02-19 15:45:20.000000000 -0800
>> >>> >> >> @@ -0,0 +1,872 @@
>> >>> >> >> +/*
>> >>> >> >> + * kernel/ktrace.c
>> >>> >> >> + *
>> >>> >> >> + * Implementation of the kernel tracing for linux kernel 2.6.
>> >>> >> >> + *
>> >>> >> >> + * Copyright 2008- Google Inc.
>> >>> >> >> + * Original Author: Jiaying Zhang
>> >>> >> >> + */
>> >>> >> >> +
>> >>> >> >> +#include <linux/module.h>
>> >>> >> >> +#include <linux/kernel.h>
>> >>> >> >> +#include <linux/bitmap.h>
>> >>> >> >> +#include <linux/poll.h>
>> >>> >> >> +#include <linux/errno.h>
>> >>> >> >> +#include <linux/sched.h>
>> >>> >> >> +#include <linux/linkage.h>
>> >>> >> >> +#include <linux/debugfs.h>
>> >>> >> >> +#include <linux/timer.h>
>> >>> >> >> +#include <linux/delay.h>
>> >>> >> >> +#include <linux/jhash.h>
>> >>> >> >> +#include <linux/ctype.h>
>> >>> >> >> +#include <linux/namei.h>
>> >>> >> >> +#include <linux/ktrace.h>
>> >>> >> >> +#include <linux/ring_buffer.h>
>> >>> >> >> +
>> >>> >> >> +static char *ktrace_version = "1.0.0";
>> >>> >> >> +int ktrace_enabled;
>> >>> >> >> +EXPORT_SYMBOL(ktrace_enabled);
>> >>> >> >> +static uint16_t ktrace_next_id;
>> >>> >> >> +
>> >>> >> >> +extern struct kernel_trace __start___ktraces[];
>> >>> >> >> +extern struct kernel_trace __stop___ktraces[];
>> >>> >> >> +
>> >>> >> >> +static DEFINE_MUTEX(ktrace_mutex);
>> >>> >> >> +
>> >>> >> >> +/* ktrace hash table entry structure and variables */
>> >>> >> >> +struct ktrace_entry {
>> >>> >> >> +       struct hlist_node       hlist;
>> >>> >> >> +       char                    *name;
>> >>> >> >> +       char                    *format;
>> >>> >> >> +       size_t                  *stroff;
>> >>> >> >> +       ktrace_probe_func       *func;
>> >>> >> >> +       u32                     enabled:1, event_id:31;
>> >>> >> >> +       struct list_head        klist;  /* list of loaded ktraces */
>> >>> >> >> +};
>> >>> >> >> +
>> >>> >> >> +#define KTRACE_HASH_BITS 6
>> >>> >> >> +#define KTRACE_HASH_SIZE (1 << KTRACE_HASH_BITS)
>> >>> >> >> +static struct hlist_head ktrace_table[KTRACE_HASH_SIZE];
>> >>> >> >> +
>> >>> >> >> +/* debugfs directory variables */
>> >>> >> >> +static struct dentry *tracedir;
>> >>> >> >> +static struct dentry *eventdir;
>> >>> >> >> +typedef void tracecontrol_handle_func(struct file *, unsigned long, size_t *);
>> >>> >> >> +
>> >>> >> >> +/* ring buffer code */
>> >>> >> >> +static unsigned long trace_buf_size = 65536UL;
>> >>> >> >> +struct ring_buffer *trace_buffer;
>> >>> >> >> +static DECLARE_WAIT_QUEUE_HEAD(trace_wait); /* waitqueue for trace_data_poll */
>> >>> >> >> +static struct timer_list trace_timer; /* reader wake-up timer */
>> >>> >> >> +#define TRACEREAD_WAKEUP_INTERVAL      1000  /* time interval in jiffies */
>> >>> >> >> +static struct kref ktrace_kref;
>> >>> >> >> +
>> >>> >> >> +static void
>> >>> >> >> +ring_buffer_probe(struct kernel_trace *kt, void *data, size_t event_size)
>> >>> >> >> +{
>> >>> >> >> +       struct ring_buffer_event *event;
>> >>> >> >> +       void *trace_event;
>> >>> >> >> +       unsigned long flag;
>> >>> >> >> +
>> >>> >> >> +       event = ring_buffer_lock_reserve(trace_buffer,
>> >>> >> >> +                       sizeof(uint16_t) + event_size, &flag);
>> >>> >> >> +       if (!event)
>> >>> >> >> +               return;
>> >>> >> >> +       trace_event = ring_buffer_event_data(event);
>> >>> >> >> +       *(uint16_t *) trace_event = (uint16_t) kt->event_id;
>> >>> >> >> +       memcpy(trace_event + sizeof(uint16_t), data, event_size);
>> >>> >> >> +       ring_buffer_unlock_commit(trace_buffer, event, flag);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +/* special probe function for events that contain string arguments */
>> >>> >> >> +static void
>> >>> >> >> +string_probe(struct kernel_trace *kt, void *data, size_t event_size)
>> >>> >> >> +{
>> >>> >> >> +       struct ring_buffer_event *event;
>> >>> >> >> +       void *trace_event, *p;
>> >>> >> >> +       unsigned long flag;
>> >>> >> >> +       size_t *offset, scanned;
>> >>> >> >> +       char *string;
>> >>> >> >> +
>> >>> >> >> +       /*
>> >>> >> >> +        * Get the real length of the event, i.e., we use stroff array to
>> >>> >> >> +        * locate string variables in the passed-in trace data and use their
>> >>> >> >> +        * length to replace the size of the string pointers.
>> >>> >> >> +        */
>> >>> >> >> +       for (offset = kt->stroff; *offset != -1; offset++) {
>> >>> >> >> +               string = *(char **) (data + *offset);
>> >>> >> >> +               event_size += strlen(string) + 1 - sizeof(char *);
>> >>> >> >> +       }
>> >>> >> >> +       event = ring_buffer_lock_reserve(trace_buffer,
>> >>> >> >> +                       sizeof(uint16_t) + event_size, &flag);
>> >>> >> >> +       if (!event)
>> >>> >> >> +               return;
>> >>> >> >> +       trace_event = ring_buffer_event_data(event);
>> >>> >> >> +       *(uint16_t *) trace_event = (uint16_t) kt->event_id;
>> >>> >> >> +       p = trace_event + sizeof(uint16_t);
>> >>> >> >> +       /*
>> >>> >> >> +        * Copy the trace data into buffer. For string variables, we enter the
>> >>> >> >> +        * string into the buffer. Otherwise, the passed in data is copied.
>> >>> >> >> +        */
>> >>> >> >> +       for (offset = kt->stroff, scanned = 0; *offset != -1; offset++) {
>> >>> >> >> +               memcpy(p, data + scanned, *offset - scanned);
>> >>> >> >> +               p += *offset - scanned;
>> >>> >> >> +               string = *(char **) (data + *offset);
>> >>> >> >> +               memcpy(p, string, strlen(string) + 1);
>> >>> >> >> +               p += strlen(string) + 1;
>> >>> >> >> +               scanned = *offset + sizeof(char *);
>> >>> >> >> +       }
>> >>> >> >> +       memcpy(p, data + scanned,
>> >>> >> >> +                       trace_event + sizeof(uint16_t) + event_size - p);
>> >>> >> >> +       ring_buffer_unlock_commit(trace_buffer, event, flag);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +/* timer function used to defer ring buffer reader waking */
>> >>> >> >> +static void wakeup_readers(unsigned long data)
>> >>> >> >> +{
>> >>> >> >> +       if (trace_buffer && !ring_buffer_empty(trace_buffer))
>> >>> >> >> +               wake_up_interruptible(&trace_wait);
>> >>> >> >> +       mod_timer(&trace_timer, jiffies + TRACEREAD_WAKEUP_INTERVAL);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +/* function for reading ktrace metadata info */
>> >>> >> >> +static void *tracing_info_start(struct seq_file *seq, loff_t *pos)
>> >>> >> >> +{
>> >>> >> >> +       seq_printf(seq, "version %s\n", ktrace_version);
>> >>> >> >> +       return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static void *tracing_info_next(struct seq_file *seq, void *v, loff_t *pos)
>> >>> >> >> +{
>> >>> >> >> +       ++*pos;
>> >>> >> >> +       return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static int tracing_info_show(struct seq_file *seq, void *v)
>> >>> >> >> +{
>> >>> >> >> +       struct hlist_head *head = v;
>> >>> >> >> +       struct hlist_node *node;
>> >>> >> >> +       struct ktrace_entry *entry;
>> >>> >> >> +       hlist_for_each_entry(entry, node, head, hlist)
>> >>> >> >> +               seq_printf(seq, "name '%s' format '%s' id %u %s\n",
>> >>> >> >> +                       entry->name, entry->format, entry->event_id,
>> >>> >> >> +                       entry->enabled ? "enabled" : "disabled");
>> >>> >> >> +       return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static void tracing_info_stop(struct seq_file *seq, void *v)
>> >>> >> >> +{
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static const struct seq_operations tracing_info_seq_ops = {
>> >>> >> >> +       .start  = tracing_info_start,
>> >>> >> >> +       .next   = tracing_info_next,
>> >>> >> >> +       .show   = tracing_info_show,
>> >>> >> >> +       .stop   = tracing_info_stop,
>> >>> >> >> +};
>> >>> >> >> +
>> >>> >> >> +static int tracing_info_open(struct inode *inode, struct file *file)
>> >>> >> >> +{
>> >>> >> >> +       return seq_open(file, &tracing_info_seq_ops);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static const struct file_operations traceinfo_file_operations = {
>> >>> >> >> +       .open   = tracing_info_open,
>> >>> >> >> +       .read   = seq_read,
>> >>> >> >> +       .llseek = seq_lseek,
>> >>> >> >> +       .release = seq_release,
>> >>> >> >> +};
>> >>> >> >> +
>> >>> >> >> +/*
>> >>> >> >> + * wrapper function used by debugfs write operation.
>> >>> >> >> + * func: handling function that does real work
>> >>> >> >> + */
>> >>> >> >> +static ssize_t trace_debugfs_write(struct file *filp, const char __user *ubuf,
>> >>> >> >> +               size_t cnt, loff_t *ppos, tracecontrol_handle_func *func)
>> >>> >> >> +{
>> >>> >> >> +       int ret;
>> >>> >> >> +       char buf[64];
>> >>> >> >> +       unsigned long val;
>> >>> >> >> +
>> >>> >> >> +       if (cnt >= sizeof(buf))
>> >>> >> >> +               return -EINVAL;
>> >>> >> >> +       if (copy_from_user(&buf, ubuf, cnt))
>> >>> >> >> +               return -EFAULT;
>> >>> >> >> +       buf[cnt] = 0;
>> >>> >> >> +
>> >>> >> >> +       ret = strict_strtoul(buf, 10, &val);
>> >>> >> >> +       if (ret < 0)
>> >>> >> >> +               return ret;
>> >>> >> >> +       val = !!val;
>> >>> >> >> +
>> >>> >> >> +       func(filp, val, &cnt);
>> >>> >> >> +
>> >>> >> >> +       filp->f_pos += cnt;
>> >>> >> >> +       return cnt;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +/* functions for reading/writing the global 'enabled' control file */
>> >>> >> >> +static ssize_t tracing_control_read(struct file *filp, char __user *ubuf,
>> >>> >> >> +               size_t cnt, loff_t *ppos)
>> >>> >> >> +{
>> >>> >> >> +       int ret;
>> >>> >> >> +       char buf[64];
>> >>> >> >> +       ret = snprintf(buf, 64, "%u\n", ktrace_enabled);
>> >>> >> >> +       return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static void __tracing_control_write(struct file *filp,
>> >>> >> >> +               unsigned long val, size_t *cnt)
>> >>> >> >> +{
>> >>> >> >> +       if (val ^ ktrace_enabled) {
>> >>> >> >> +               if (val) {
>> >>> >> >> +                       trace_timer.expires =
>> >>> >> >> +                               jiffies + TRACEREAD_WAKEUP_INTERVAL;
>> >>> >> >> +                       add_timer(&trace_timer);
>> >>> >> >> +                       ktrace_enabled = 1;
>> >>> >> >> +               } else {
>> >>> >> >> +                       ktrace_enabled = 0;
>> >>> >> >> +                       del_timer_sync(&trace_timer);
>> >>> >> >> +                       if (trace_buffer && !ring_buffer_empty(trace_buffer))
>> >>> >> >> +                               wake_up_interruptible(&trace_wait);
>> >>> >> >> +               }
>> >>> >> >> +       }
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static ssize_t tracing_control_write(struct file *filp, const char
>> >>> >> >> __user *ubuf,
>> >>> >> >> +               size_t cnt, loff_t *ppos)
>> >>> >> >> +{
>> >>> >> >> +       return trace_debugfs_write(filp, ubuf, cnt, ppos,
>> >>> >> >> +                       __tracing_control_write);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static const struct file_operations tracecontrol_file_operations = {
>> >>> >> >> +       .read   = tracing_control_read,
>> >>> >> >> +       .write  = tracing_control_write,
>> >>> >> >> +};
>> >>> >> >> +
>> >>> >> >> +/* functions for reading/writing a trace event file */
>> >>> >> >> +static int update_ktrace(struct ktrace_entry *entry,
>> >>> >> >> +               ktrace_probe_func *func, int enabled)
>> >>> >> >> +{
>> >>> >> >> +       struct kernel_trace *iter;
>> >>> >> >> +       /* no need to update the list if the tracing is not initialized */
>> >>> >> >> +       if (!tracedir)
>> >>> >> >> +               return 0;
>> >>> >> >> +       entry->enabled = enabled;
>> >>> >> >> +       entry->func = func;
>> >>> >> >> +       list_for_each_entry(iter, &entry->klist, list) {
>> >>> >> >> +               iter->stroff = entry->stroff;
>> >>> >> >> +               iter->func = func;
>> >>> >> >> +               iter->enabled = enabled;
>> >>> >> >> +       }
>> >>> >> >> +       return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static int ktrace_probe_register(struct ktrace_entry *entry,
>> >>> >> >> +               ktrace_probe_func *probefunc)
>> >>> >> >> +{
>> >>> >> >> +       return update_ktrace(entry, probefunc, 1);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static int ktrace_probe_unregister(struct ktrace_entry *entry)
>> >>> >> >> +{
>> >>> >> >> +       return update_ktrace(entry, NULL, 0);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static int ktrace_control_open(struct inode *inode, struct file *filp)
>> >>> >> >> +{
>> >>> >> >> +       filp->private_data = inode->i_private;
>> >>> >> >> +       return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static ssize_t ktrace_control_read(struct file *filp, char __user *ubuf,
>> >>> >> >> +               size_t cnt, loff_t *ppos)
>> >>> >> >> +{
>> >>> >> >> +       int ret;
>> >>> >> >> +       char buf[64];
>> >>> >> >> +       struct ktrace_entry *entry = filp->private_data;
>> >>> >> >> +
>> >>> >> >> +       ret = snprintf(buf, 64, "%u\n", entry->enabled);
>> >>> >> >> +       return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +/*
>> >>> >> >> + * Check whether the event contains string arguments. Return 0 if not;
>> >>> >> >> + * otherwise record the offsets of the string arguments in the ktrace
>> >>> >> >> + * structure according to the event format and return 1.
>> >>> >> >> + * The offset info is later used to store the string arguments into
>> >>> >> >> + * the trace buffer during trace probe.
>> >>> >> >> + */
>> >>> >> >> +static int contain_string_arguments(struct ktrace_entry *entry)
>> >>> >> >> +{
>> >>> >> >> +       int count = 0;
>> >>> >> >> +       char *fmt;
>> >>> >> >> +       int qualifier;          /* 'h', 'l', or 'L' for integer fields */
>> >>> >> >> +       size_t offset = 0;
>> >>> >> >> +
>> >>> >> >> +       for (fmt = strstr(entry->format, "%s"); fmt; fmt = strstr(++fmt, "%s"))
>> >>> >> >> +               count++;
>> >>> >> >> +       if (!count)
>> >>> >> >> +               return 0;
>> >>> >> >> +       entry->stroff = kmalloc(sizeof(size_t) * (count + 1), GFP_KERNEL);
>> >>> >> >> +       if (!entry->stroff)
>> >>> >> >> +               return -ENOMEM;
>> >>> >> >> +
>> >>> >> >> +       for (fmt = entry->format, count = 0; *fmt ; ++fmt) {
>> >>> >> >> +               if (*fmt != '%')
>> >>> >> >> +                       continue;
>> >>> >> >> +repeat:
>> >>> >> >> +               ++fmt;
>> >>> >> >> +               switch (*fmt) {
>> >>> >> >> +               case '-':
>> >>> >> >> +               case '+':
>> >>> >> >> +               case ' ':
>> >>> >> >> +               case '#':
>> >>> >> >> +               case '0':
>> >>> >> >> +                       ++fmt;
>> >>> >> >> +                       goto repeat;
>> >>> >> >> +               }
>> >>> >> >> +
>> >>> >> >> +               while (isdigit(*fmt))
>> >>> >> >> +                       fmt++;
>> >>> >> >> +
>> >>> >> >> +               /* get the conversion qualifier */
>> >>> >> >> +               qualifier = -1;
>> >>> >> >> +               if (*fmt == 'h' || *fmt == 'l' || *fmt == 'L') {
>> >>> >> >> +                       qualifier = *fmt;
>> >>> >> >> +                       ++fmt;
>> >>> >> >> +                       if (qualifier == 'l' && *fmt == 'l') {
>> >>> >> >> +                               qualifier = 'L';
>> >>> >> >> +                               ++fmt;
>> >>> >> >> +                       }
>> >>> >> >> +               }
>> >>> >> >> +
>> >>> >> >> +               switch (*fmt) {
>> >>> >> >> +               case 'c':
>> >>> >> >> +                       offset += sizeof(char);
>> >>> >> >> +                       continue;
>> >>> >> >> +               case 's':
>> >>> >> >> +                       entry->stroff[count] = offset;
>> >>> >> >> +                       count++;
>> >>> >> >> +                       offset += sizeof(char *);
>> >>> >> >> +                       continue;
>> >>> >> >> +               case 'p':
>> >>> >> >> +                       offset += sizeof(void *);
>> >>> >> >> +                       continue;
>> >>> >> >> +               case 'd':
>> >>> >> >> +               case 'i':
>> >>> >> >> +               case 'o':
>> >>> >> >> +               case 'u':
>> >>> >> >> +               case 'x':
>> >>> >> >> +               case 'X':
>> >>> >> >> +                       switch (qualifier) {
>> >>> >> >> +                       case 'L':
>> >>> >> >> +                               offset += sizeof(long long);
>> >>> >> >> +                               break;
>> >>> >> >> +                       case 'l':
>> >>> >> >> +                               offset += sizeof(long);
>> >>> >> >> +                               break;
>> >>> >> >> +                       case 'h':
>> >>> >> >> +                               offset += sizeof(short);
>> >>> >> >> +                               break;
>> >>> >> >> +                       default:
>> >>> >> >> +                               offset += sizeof(int);
>> >>> >> >> +                               break;
>> >>> >> >> +                       }
>> >>> >> >> +                       break;
>> >>> >> >> +               default:
>> >>> >> >> +                       printk(KERN_WARNING "unknown format '%c'", *fmt);
>> >>> >> >> +                       continue;
>> >>> >> >> +               }
>> >>> >> >> +       }
>> >>> >> >> +       entry->stroff[count] = -1;
>> >>> >> >> +       return 1;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static void __ktrace_control_write(struct file *filp,
>> >>> >> >> +               unsigned long val, size_t *cnt)
>> >>> >> >> +{
>> >>> >> >> +       struct ktrace_entry *entry = filp->private_data;
>> >>> >> >> +       int ret;
>> >>> >> >> +
>> >>> >> >> +       mutex_lock(&ktrace_mutex);
>> >>> >> >> +       if (val ^ entry->enabled) {
>> >>> >> >> +               if (val) {
>> >>> >> >> +                       ret = contain_string_arguments(entry);
>> >>> >> >> +                       if (ret == 0)
>> >>> >> >> +                               ktrace_probe_register(entry, ring_buffer_probe);
>> >>> >> >> +                       else if (ret > 0)
>> >>> >> >> +                               ktrace_probe_register(entry, string_probe);
>> >>> >> >> +                       else
>> >>> >> >> +                               *cnt = ret;
>> >>> >> >> +               } else
>> >>> >> >> +                       ktrace_probe_unregister(entry);
>> >>> >> >> +       }
>> >>> >> >> +       mutex_unlock(&ktrace_mutex);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static ssize_t ktrace_control_write(struct file *filp, const char __user *ubuf,
>> >>> >> >> +               size_t cnt, loff_t *ppos)
>> >>> >> >> +{
>> >>> >> >> +       return trace_debugfs_write(filp, ubuf, cnt, ppos,
>> >>> >> >> +                       __ktrace_control_write);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static const struct file_operations ktracecontrol_file_operations = {
>> >>> >> >> +       .open   = ktrace_control_open,
>> >>> >> >> +       .read   = ktrace_control_read,
>> >>> >> >> +       .write  = ktrace_control_write,
>> >>> >> >> +};
>> >>> >> >> +
>> >>> >> >> +/* functions for adding/removing a trace event. Protected by mutex lock.
>> >>> >> >> + * Called during initialization or after loading a module */
>> >>> >> >> +static struct ktrace_entry *add_ktrace(struct kernel_trace *ktrace)
>> >>> >> >> +{
>> >>> >> >> +       struct ktrace_entry *entry;
>> >>> >> >> +       size_t name_len = strlen(ktrace->name) + 1;
>> >>> >> >> +       u32 hash = jhash(ktrace->name, name_len-1, 0);
>> >>> >> >> +       struct hlist_head *head;
>> >>> >> >> +       struct hlist_node *node;
>> >>> >> >> +       struct dentry *dentry;
>> >>> >> >> +
>> >>> >> >> +       if (!tracedir)
>> >>> >> >> +               return 0;
>> >>> >> >> +       head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
>> >>> >> >> +       hlist_for_each_entry(entry, node, head, hlist) {
>> >>> >> >> +               if (!strcmp(ktrace->name, entry->name)) {
>> >>> >> >> +                       if (strcmp(ktrace->format, entry->format)) {
>> >>> >> >> +                               printk(KERN_WARNING "the format of tracepoint "
>> >>> >> >> +                                       "\'%s\' changes from \'%s\' to \'%s\'."
>> >>> >> >> +                                       "Dynamic changing of trace format is "
>> >>> >> >> +                                       "not supported yet!\n",
>> >>> >> >> +                                       ktrace->name, entry->format,
>> >>> >> >> +                                       ktrace->format);
>> >>> >> >> +                               return ERR_PTR(-EINVAL);
>> >>> >> >> +                       }
>> >>> >> >> +                       ktrace->enabled = entry->enabled;
>> >>> >> >> +                       ktrace->func = entry->func;
>> >>> >> >> +                       ktrace->event_id = entry->event_id;
>> >>> >> >> +                       if (list_empty(&entry->klist))
>> >>> >> >> +                               goto add_head;
>> >>> >> >> +                       list_add_tail(&ktrace->list, &entry->klist);
>> >>> >> >> +                       return entry;
>> >>> >> >> +               }
>> >>> >> >> +       }
>> >>> >> >> +       entry = kmalloc(sizeof(struct ktrace_entry), GFP_KERNEL);
>> >>> >> >> +       if (!entry)
>> >>> >> >> +               return ERR_PTR(-ENOMEM);
>> >>> >> >> +       entry->name = kmalloc(name_len, GFP_KERNEL);
>> >>> >> >> +       if (!entry->name) {
>> >>> >> >> +               kfree(entry);
>> >>> >> >> +               return ERR_PTR(-ENOMEM);
>> >>> >> >> +       }
>> >>> >> >> +       memcpy(entry->name, ktrace->name, name_len);
>> >>> >> >> +       if (ktrace->format) {
>> >>> >> >> +               size_t format_len = strlen(ktrace->format) + 1;
>> >>> >> >> +               entry->format = kmalloc(format_len, GFP_KERNEL);
>> >>> >> >> +               if (!entry->format) {
>> >>> >> >> +                       kfree(entry->name);
>> >>> >> >> +                       kfree(entry);
>> >>> >> >> +                       return ERR_PTR(-ENOMEM);
>> >>> >> >> +               }
>> >>> >> >> +               memcpy(entry->format, ktrace->format, format_len);
>> >>> >> >> +       } else
>> >>> >> >> +               entry->format = NULL;
>> >>> >> >> +       entry->func = ktrace->func;
>> >>> >> >> +       entry->enabled = 0;
>> >>> >> >> +       ktrace->event_id = entry->event_id = ktrace_next_id++;
>> >>> >> >> +       entry->stroff = NULL;
>> >>> >> >> +       INIT_LIST_HEAD(&entry->klist);
>> >>> >> >> +       hlist_add_head(&entry->hlist, head);
>> >>> >> >> +add_head:
>> >>> >> >> +       list_add(&ktrace->list, &entry->klist);
>> >>> >> >> +       dentry = debugfs_create_file(entry->name, 0660, eventdir,
>> >>> >> >> +                               entry, &ktracecontrol_file_operations);
>> >>> >> >> +       if (!dentry)
>> >>> >> >> +               printk(KERN_WARNING "Couldn't create debugfs '%s' entry\n",
>> >>> >> >> +                                                       entry->name);
>> >>> >> >> +       return entry;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static int remove_ktrace(struct kernel_trace *ktrace, int free)
>> >>> >> >> +{
>> >>> >> >> +       struct ktrace_entry *entry;
>> >>> >> >> +       size_t name_len = strlen(ktrace->name) + 1;
>> >>> >> >> +       u32 hash = jhash(ktrace->name, name_len-1, 0);
>> >>> >> >> +       struct hlist_head *head;
>> >>> >> >> +       struct hlist_node *node, *temp;
>> >>> >> >> +       struct dentry *dentry;
>> >>> >> >> +
>> >>> >> >> +       if (!tracedir)
>> >>> >> >> +               return 0;
>> >>> >> >> +       list_del(&ktrace->list);
>> >>> >> >> +       head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
>> >>> >> >> +       hlist_for_each_entry_safe(entry, node, temp, head, hlist)
>> >>> >> >> +               if (!strcmp(ktrace->name, entry->name)) {
>> >>> >> >> +                       if (list_empty(&entry->klist)) {
>> >>> >> >> +                               dentry = lookup_one_len(entry->name,
>> >>> >> >> +                                       eventdir, strlen(entry->name));
>> >>> >> >> +                               if (dentry && !IS_ERR(dentry)) {
>> >>> >> >> +                                       debugfs_remove(dentry);
>> >>> >> >> +                                       dput(dentry);
>> >>> >> >> +                               }
>> >>> >> >> +                               entry->enabled = 0;
>> >>> >> >> +                               if (free) {
>> >>> >> >> +                                       hlist_del(&entry->hlist);
>> >>> >> >> +                                       kfree(entry->name);
>> >>> >> >> +                                       kfree(entry->format);
>> >>> >> >> +                                       kfree(entry->stroff);
>> >>> >> >> +                                       kfree(entry);
>> >>> >> >> +                               }
>> >>> >> >> +                       }
>> >>> >> >> +                       return 0;
>> >>> >> >> +               }
>> >>> >> >> +       return -1;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +/* Add/remove tracepoints contained in a module to the ktrace hash table.
>> >>> >> >> + * Called at the end of module load/unload. */
>> >>> >> >> +static int ktrace_module_notify(struct notifier_block *self,
>> >>> >> >> +               unsigned long val, void *data)
>> >>> >> >> +{
>> >>> >> >> +#ifdef CONFIG_MODULES
>> >>> >> >> +       struct kernel_trace *iter;
>> >>> >> >> +       struct module *mod = data;
>> >>> >> >> +
>> >>> >> >> +       if (val != MODULE_STATE_COMING && val != MODULE_STATE_GOING)
>> >>> >> >> +               return 0;
>> >>> >> >> +       mutex_lock(&ktrace_mutex);
>> >>> >> >> +       for (iter = mod->ktrace; iter < mod->ktrace + mod->num_ktrace; iter++)
>> >>> >> >> +               if (val == MODULE_STATE_COMING)
>> >>> >> >> +                       add_ktrace(iter);
>> >>> >> >> +               else
>> >>> >> >> +                       remove_ktrace(iter, 0);
>> >>> >> >> +       mutex_unlock(&ktrace_mutex);
>> >>> >> >> +#endif
>> >>> >> >> +       return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static struct notifier_block ktrace_module_nb = {
>> >>> >> >> +       .notifier_call  = ktrace_module_notify,
>> >>> >> >> +};
>> >>> >> >> +
>> >>> >> >> +/* functions for user-space programs to read data from tracing buffer */
>> >>> >> >> +static int trace_data_open(struct inode *inode, struct file *filp)
>> >>> >> >> +{
>> >>> >> >> +       filp->private_data = inode->i_private;
>> >>> >> >> +       kref_get(&ktrace_kref);
>> >>> >> >> +       return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static void release_trace_buffer(struct kref *kref)
>> >>> >> >> +{
>> >>> >> >> +       ring_buffer_free(trace_buffer);
>> >>> >> >> +       trace_buffer = NULL;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static int trace_data_release(struct inode *inode, struct file *filp)
>> >>> >> >> +{
>> >>> >> >> +       kref_put(&ktrace_kref, release_trace_buffer);
>> >>> >> >> +       return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static unsigned int trace_data_poll(struct file *filp, poll_table *poll_table)
>> >>> >> >> +{
>> >>> >> >> +       struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
>> >>> >> >> +       if (filp->f_mode & FMODE_READ) {
>> >>> >> >> +               if (!ring_buffer_per_cpu_empty(cpu_buffer))
>> >>> >> >> +                       return POLLIN | POLLRDNORM;
>> >>> >> >> +               poll_wait(filp, &trace_wait, poll_table);
>> >>> >> >> +               if (!ring_buffer_per_cpu_empty(cpu_buffer))
>> >>> >> >> +                       return POLLIN | POLLRDNORM;
>> >>> >> >> +       }
>> >>> >> >> +       return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static int trace_buf_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>> >>> >> >> +{
>> >>> >> >> +       struct ring_buffer_per_cpu *cpu_buffer = vma->vm_private_data;
>> >>> >> >> +       pgoff_t pgoff = vmf->pgoff;
>> >>> >> >> +       struct page *page;
>> >>> >> >> +
>> >>> >> >> +       if (!trace_buffer)
>> >>> >> >> +               return VM_FAULT_OOM;
>> >>> >> >> +       page = ring_buffer_get_page(cpu_buffer, pgoff);
>> >>> >> >> +       if (page == NULL)
>> >>> >> >> +               return VM_FAULT_SIGBUS;
>> >>> >> >> +       get_page(page);
>> >>> >> >> +       vmf->page = page;
>> >>> >> >> +       return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static struct vm_operations_struct trace_data_mmap_ops = {
>> >>> >> >> +       .fault = trace_buf_fault,
>> >>> >> >> +};
>> >>> >> >> +
>> >>> >> >> +static int trace_data_mmap(struct file *filp, struct vm_area_struct *vma)
>> >>> >> >> +{
>> >>> >> >> +       struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
>> >>> >> >> +
>> >>> >> >> +       vma->vm_ops = &trace_data_mmap_ops;
>> >>> >> >> +       vma->vm_flags |= VM_DONTEXPAND;
>> >>> >> >> +       vma->vm_private_data = cpu_buffer;
>> >>> >> >> +       return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static int trace_data_ioctl(struct inode *inode, struct file *filp,
>> >>> >> >> +               unsigned int cmd, unsigned long arg)
>> >>> >> >> +{
>> >>> >> >> +       u32 __user *argp = (u32 __user *)arg;
>> >>> >> >> +       struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
>> >>> >> >> +
>> >>> >> >> +       if (!trace_buffer || !tracedir)
>> >>> >> >> +               return -EINVAL;
>> >>> >> >> +       switch (cmd) {
>> >>> >> >> +       case KTRACE_BUF_GET_SIZE:
>> >>> >> >> +       {
>> >>> >> >> +               unsigned long bufsize;
>> >>> >> >> +               bufsize = ring_buffer_size(trace_buffer);
>> >>> >> >> +               return put_user((u32)bufsize, argp);
>> >>> >> >> +       }
>> >>> >> >> +       case KTRACE_BUF_GET_PRODUCED:
>> >>> >> >> +               return put_user(ring_buffer_get_produced(cpu_buffer), argp);
>> >>> >> >> +       case KTRACE_BUF_GET_CONSUMED:
>> >>> >> >> +               return put_user(ring_buffer_get_consumed(cpu_buffer), argp);
>> >>> >> >> +       case KTRACE_BUF_PUT_CONSUMED:
>> >>> >> >> +       {
>> >>> >> >> +               u32 consumed, consumed_old;
>> >>> >> >> +               int ret;
>> >>> >> >> +
>> >>> >> >> +               ret = get_user(consumed, argp);
>> >>> >> >> +               if (ret) {
>> >>> >> >> +                       printk(KERN_WARNING
>> >>> >> >> +                               "error getting consumed value: %d\n", ret);
>> >>> >> >> +                       return ret;
>> >>> >> >> +               }
>> >>> >> >> +               consumed_old = ring_buffer_get_consumed(cpu_buffer);
>> >>> >> >> +               if (consumed == consumed_old)
>> >>> >> >> +                       return 0;
>> >>> >> >> +               ring_buffer_advance_reader(cpu_buffer, consumed - consumed_old);
>> >>> >> >> +               return 0;
>> >>> >> >> +       }
>> >>> >> >> +       default:
>> >>> >> >> +               return -ENOIOCTLCMD;
>> >>> >> >> +       }
>> >>> >> >> +       return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static const struct file_operations tracedata_file_operations = {
>> >>> >> >> +       .open           = trace_data_open,
>> >>> >> >> +       .poll           = trace_data_poll,
>> >>> >> >> +       .mmap           = trace_data_mmap,
>> >>> >> >> +       .ioctl          = trace_data_ioctl,
>> >>> >> >> +       .release        = trace_data_release,
>> >>> >> >> +};
>> >>> >> >> +
>> >>> >> >> +/* function to report the number of lost events due to buffer overflow */
>> >>> >> >> +static ssize_t trace_overflow_read(struct file *filp, char __user *ubuf,
>> >>> >> >> +               size_t cnt, loff_t *ppos)
>> >>> >> >> +{
>> >>> >> >> +       int ret;
>> >>> >> >> +       char buf[64];
>> >>> >> >> +       ret = snprintf(buf, 64, "%lu\n", ring_buffer_overruns(trace_buffer));
>> >>> >> >> +       return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static const struct file_operations traceoverflow_file_operations = {
>> >>> >> >> +       .read           = trace_overflow_read,
>> >>> >> >> +};
>> >>> >> >> +
>> >>> >> >> +/* function to re-initialize trace buffer */
>> >>> >> >> +static void __trace_buffer_reset(struct file *filp,
>> >>> >> >> +               unsigned long val, size_t *cnt)
>> >>> >> >> +{
>> >>> >> >> +       if (val && trace_buffer)
>> >>> >> >> +               ring_buffer_reset(trace_buffer);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static ssize_t trace_buffer_reset(struct file *filp, const char __user *ubuf,
>> >>> >> >> +               size_t cnt, loff_t *ppos)
>> >>> >> >> +{
>> >>> >> >> +       return trace_debugfs_write(filp, ubuf, cnt, ppos,
>> >>> >> >> +                       __trace_buffer_reset);
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static const struct file_operations tracereset_file_operations = {
>> >>> >> >> +       .write          = trace_buffer_reset,
>> >>> >> >> +};
>> >>> >> >> +
>> >>> >> >> +/*
>> >>> >> >> + * We use debugfs for kernel-user communication. All of the control/info
>> >>> >> >> + * files are under debugfs/ktrace directory.
>> >>> >> >> + */
>> >>> >> >> +static int create_debugfs(void)
>> >>> >> >> +{
>> >>> >> >> +       struct dentry *entry, *bufdir;
>> >>> >> >> +       struct ring_buffer_per_cpu *cpu_buffer;
>> >>> >> >> +       char *tmpname;
>> >>> >> >> +       int cpu;
>> >>> >> >> +
>> >>> >> >> +       tracedir = debugfs_create_dir("ktrace", NULL);
>> >>> >> >> +       if (!tracedir) {
>> >>> >> >> +               printk(KERN_WARNING "Couldn't create debugfs directory\n");
>> >>> >> >> +               return -ENOMEM;
>> >>> >> >> +       }
>> >>> >> >> +
>> >>> >> >> +       /* 'buffers' is the directory that holds trace buffer debugfs files */
>> >>> >> >> +       bufdir = debugfs_create_dir("buffers", tracedir);
>> >>> >> >> +       if (!bufdir) {
>> >>> >> >> +               printk(KERN_WARNING "Couldn't create 'buffers' directory\n");
>> >>> >> >> +               return -ENOMEM;
>> >>> >> >> +       }
>> >>> >> >> +
>> >>> >> >> +       tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
>> >>> >> >> +       if (!tmpname)
>> >>> >> >> +               return -ENOMEM;
>> >>> >> >> +       /* create a debugfs file for each cpu buffer */
>> >>> >> >> +       for_each_possible_cpu(cpu) {
>> >>> >> >> +               snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
>> >>> >> >> +               cpu_buffer = ring_buffer_cpu(trace_buffer, cpu);
>> >>> >> >> +               entry = debugfs_create_file(tmpname, 0440, bufdir, cpu_buffer,
>> >>> >> >> +                       &tracedata_file_operations);
>> >>> >> >> +               if (!entry) {
>> >>> >> >> +                       printk(KERN_WARNING
>> >>> >> >> +                               "Couldn't create debugfs \'%s\' entry\n",
>> >>> >> >> +                                       tmpname);
>> >>> >> >> +                       return -ENOMEM;
>> >>> >> >> +               }
>> >>> >> >> +       }
>> >>> >> >> +       kfree(tmpname);
>> >>> >> >> +
>> >>> >> >> +       /* the control file for users to enable/disable global tracing. */
>> >>> >> >> +       entry = debugfs_create_file("enabled", 0664, tracedir, NULL,
>> >>> >> >> +                       &tracecontrol_file_operations);
>> >>> >> >> +       if (!entry) {
>> >>> >> >> +               printk(KERN_WARNING
>> >>> >> >> +                       "Couldn't create debugfs 'enabled' entry\n");
>> >>> >> >> +               return -ENOMEM;
>> >>> >> >> +       }
>> >>> >> >> +
>> >>> >> >> +       /*
>> >>> >> >> +        * the debugfs file that displays the name, format etc. of every
>> >>> >> >> +        * supported trace event. The file is to be used by the user-space
>> >>> >> >> +        * trace parser to analyze the collected trace data.
>> >>> >> >> +        */
>> >>> >> >> +       entry = debugfs_create_file("info", 0444, tracedir, NULL,
>> >>> >> >> +                       &traceinfo_file_operations);
>> >>> >> >> +       if (!entry) {
>> >>> >> >> +               printk(KERN_WARNING "Couldn't create debugfs 'info' entry\n");
>> >>> >> >> +               return -ENOMEM;
>> >>> >> >> +       }
>> >>> >> >> +
>> >>> >> >> +       /*
>> >>> >> >> +        * the debugfs file that reports the number of events
>> >>> >> >> +        * lost due to buffer overflow
>> >>> >> >> +        */
>> >>> >> >> +       entry = debugfs_create_file("overflow", 0444, tracedir, NULL,
>> >>> >> >> +                       &traceoverflow_file_operations);
>> >>> >> >> +       if (!entry) {
>> >>> >> >> +               printk(KERN_WARNING
>> >>> >> >> +                       "Couldn't create debugfs 'overflow' entry\n");
>> >>> >> >> +               return -ENOMEM;
>> >>> >> >> +       }
>> >>> >> >> +
>> >>> >> >> +       /* the debugfs file that resets trace buffer upon write */
>> >>> >> >> +       entry = debugfs_create_file("reset", 0220, tracedir, NULL,
>> >>> >> >> +                       &tracereset_file_operations);
>> >>> >> >> +       if (!entry) {
>> >>> >> >> +               printk(KERN_WARNING "Couldn't create debugfs 'reset' entry\n");
>> >>> >> >> +               return -ENOMEM;
>> >>> >> >> +       }
>> >>> >> >> +
>> >>> >> >> +       /* create the directory that holds the control files for every event */
>> >>> >> >> +       eventdir = debugfs_create_dir("events", tracedir);
>> >>> >> >> +       if (!eventdir) {
>> >>> >> >> +               printk(KERN_WARNING "Couldn't create 'events' directory\n");
>> >>> >> >> +               return -ENOMEM;
>> >>> >> >> +       }
>> >>> >> >> +       return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static void remove_debugfs_file(struct dentry *dir, char *name)
>> >>> >> >> +{
>> >>> >> >> +       struct dentry *dentry;
>> >>> >> >> +       dentry = lookup_one_len(name, dir, strlen(name));
>> >>> >> >> +       if (dentry && !IS_ERR(dentry)) {
>> >>> >> >> +               debugfs_remove(dentry);
>> >>> >> >> +               dput(dentry);
>> >>> >> >> +       }
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static int remove_debugfs(void)
>> >>> >> >> +{
>> >>> >> >> +       struct dentry *bufdir;
>> >>> >> >> +       char *tmpname;
>> >>> >> >> +       int cpu;
>> >>> >> >> +
>> >>> >> >> +       bufdir = lookup_one_len("buffers", tracedir, strlen("buffers"));
>> >>> >> >> +       if (!bufdir || IS_ERR(bufdir))
>> >>> >> >> +               return -EIO;
>> >>> >> >> +       tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
>> >>> >> >> +       if (!tmpname)
>> >>> >> >> +               return -ENOMEM;
>> >>> >> >> +       for_each_possible_cpu(cpu) {
>> >>> >> >> +               snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
>> >>> >> >> +               remove_debugfs_file(bufdir, tmpname);
>> >>> >> >> +       }
>> >>> >> >> +       kfree(tmpname);
>> >>> >> >> +       debugfs_remove(bufdir);
>> >>> >> >> +
>> >>> >> >> +       remove_debugfs_file(tracedir, "enabled");
>> >>> >> >> +       remove_debugfs_file(tracedir, "info");
>> >>> >> >> +       debugfs_remove(eventdir);
>> >>> >> >> +       debugfs_remove(tracedir);
>> >>> >> >> +       return 0;
>> >>> >> >> +}
>> >>> >> >> +
>> >>> >> >> +static int __init init_ktrace(void)
>> >>> >> >> +{
>> >>> >> >> +       struct kernel_trace *iter;
>> >>> >> >> +       int i, err;
>> >>> >> >> +
>> >>> >> >> +       ktrace_next_id = 0;
>> >>> >> >> +       kref_set(&ktrace_kref, 1);
>> >>> >> >> +       trace_buffer = ring_buffer_alloc(trace_buf_size, RB_FL_OVERWRITE);
>> >>> >> >> +       if (trace_buffer == NULL)
>> >>> >> >> +               return -ENOMEM;
>> >>> >> >> +       setup_timer(&trace_timer, wakeup_readers, 0);
>> >>> >> >> +       for (i = 0; i < KTRACE_HASH_SIZE; i++)
>> >>> >> >> +               INIT_HLIST_HEAD(&ktrace_table[i]);
>> >>> >> >> +
>> >>> >> >> +       mutex_lock(&ktrace_mutex);
>> >>> >> >> +       err = create_debugfs();
>> >>> >> >> +       if (err != 0)
>> >>> >> >> +               goto out;
>> >>> >> >> +       for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
>> >>> >> >> +               add_ktrace(iter);
>> >>> >> >> +       err = register_module_notifier(&ktrace_module_nb);
>> >>> >> >> +out:
>> >>> >> >> +       mutex_unlock(&ktrace_mutex);
>> >>> >> >> +       return err;
>> >>> >> >> +}
>> >>> >> >> +module_init(init_ktrace);
>> >>> >> >> +
>> >>> >> >> +static void __exit exit_ktrace(void)
>> >>> >> >> +{
>> >>> >> >> +       struct kernel_trace *iter;
>> >>> >> >> +
>> >>> >> >> +       if (ktrace_enabled) {
>> >>> >> >> +               ktrace_enabled = 0;
>> >>> >> >> +               del_timer_sync(&trace_timer);
>> >>> >> >> +       }
>> >>> >> >> +       mutex_lock(&ktrace_mutex);
>> >>> >> >> +       for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
>> >>> >> >> +               remove_ktrace(iter, 1);
>> >>> >> >> +       unregister_module_notifier(&ktrace_module_nb);
>> >>> >> >> +       remove_debugfs();
>> >>> >> >> +       mutex_unlock(&ktrace_mutex);
>> >>> >> >> +
>> >>> >> >> +       kref_put(&ktrace_kref, release_trace_buffer);
>> >>> >> >> +}
>> >>> >> >
>> >>> >> > --
>> >>> >> > Mathieu Desnoyers
>> >>> >> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
>> >>> >> >
>> >>> >
>> >>> > --
>> >>> > Mathieu Desnoyers
>> >>> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
>> >>> >
>> >>>
>> >>> _______________________________________________
>> >>> ltt-dev mailing list
>> >>> ltt-dev@lists.casi.polymtl.ca
>> >>> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
>> >>>
>> >>
>> >> --
>> >> Mathieu Desnoyers
>> >> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
>> >>
>> >
>>
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/