LinuxLists.cc - [patch 0/4] Port KVM-trace to tracepoints

2008-07-17 16:01:46

Subject: [patch 0/4] Port KVM-trace to tracepoints

Hi,

Here is a port of KVM-trace, currently using macros on top of the Linux Markers,
to tracepoints. Note that I cleaned up the instrumentation too, so stuff like

KVMTRACE_3D(CR_WRITE, vcpu, (u32)cr,
(u32)kvm_register_read(vcpu, reg),
(u32)((u64)kvm_register_read(vcpu, reg) >> 32),
handler);

Sprinkled all across the KVM C code becomes :

trace_kvm_cr_write(vcpu, cr, reg);

Which looks much nicer, IMHO.

It applies on top of linux-next patch-v2.6.26-next-20080715.

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-07-22 18:43:17

by Avi Kivity

[permalink] [raw]

Subject: Re: [patch 0/4] Port KVM-trace to tracepoints

Mathieu Desnoyers wrote:
> Hi,
>
> Here is a port of KVM-trace, currently using macros on top of the Linux Markers,
> to tracepoints. Note that I cleaned up the instrumentation too, so stuff like
>
> KVMTRACE_3D(CR_WRITE, vcpu, (u32)cr,
> (u32)kvm_register_read(vcpu, reg),
> (u32)((u64)kvm_register_read(vcpu, reg) >> 32),
> handler);
>
> Sprinkled all across the KVM C code becomes :
>
> trace_kvm_cr_write(vcpu, cr, reg);
>
> Which looks much nicer, IMHO.
>
> It applies on top of linux-next patch-v2.6.26-next-20080715.
>
>

It does look nicer; but it means maintaining tracepoints becomes much
harder, because the information is scattered across many files.

kvm tracepoints are heavily tied into the implementation; and making
them harder to write means we will have less information. In fact, I am
contemplating moving in another direction (when looking at the
pgprintk()s scattered around arch/x86/kvm/mmu.c:

kvm_trace("pfentry", "page_fault entry addr %lx error code %x\n",
cr2, error_code);

Unlike printk()s, no actual formatting would occur during runtime.
Instead, at initialization time all the strings would be parsed into a
data structure that describes the data types, and the runtime would
simply consult this structure and copy the arguments into trace
records. User space would also be able to pull this structure and so
recreate the formatted string.

The advantages I see to this are:

- easy to add traces; the most important advantage
- when the code changes, obsolete traces are completely removed
- good performance
- no need to have a formats file in userspace (which is tied to the
kernel version)
- can also send printk()s along, for synchronization with other kvm
and kernel events

In fact, why not convert printk() to do this, making it a
high-performance logging system? Of course it would retain the dmesg
interface, but we could say that messages at the KERN_TRACE level bypass
dmesg and only go to the trace system.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2008-07-22 19:19:40

by Frank Ch. Eigler

[permalink] [raw]

Subject: Re: [patch 0/4] Port KVM-trace to tracepoints

Avi Kivity <[email protected]> writes:

> [...]
> kvm tracepoints are heavily tied into the implementation; and making
> them harder to write means we will have less information. In fact, I
> am contemplating moving in another direction (when looking at the
> pgprintk()s scattered around arch/x86/kvm/mmu.c:
>
> kvm_trace("pfentry", "page_fault entry addr %lx error code %x\n",
> cr2, error_code);
>
> Unlike printk()s, no actual formatting would occur during runtime.

Have you considered using trace_mark() directly - eliminating the
KVM_TRACEN() middlemen?

> Instead, at initialization time all the strings would be parsed into
> a data structure that describes the data types, and the runtime
> would simply consult this structure and copy the arguments into
> trace records. User space would also be able to pull this structure
> and so recreate the formatted string.

If one really wanted to, one could build such a mechanism on top of
marker-based callbacks.

> The advantages I see to this are:
>
> - easy to add traces; the most important advantage
> - when the code changes, obsolete traces are completely removed
> - good performance

Ditto.

> - no need to have a formats file in userspace (which is tied to the
> kernel version)

OTOH, you'd have the kernel collecting compact binary records
containing just the parameters, which are at least as tied to kernel
version.

> - can also send printk()s along, for synchronization with other kvm
> and kernel events

Ditto. It is elementary to attach a printk-generating marker callback.

- FChE

2008-07-22 19:32:53

by Avi Kivity

[permalink] [raw]

Subject: Re: [patch 0/4] Port KVM-trace to tracepoints

Frank Ch. Eigler wrote:
> Avi Kivity <[email protected]> writes:
>
>
>> [...]
>> kvm tracepoints are heavily tied into the implementation; and making
>> them harder to write means we will have less information. In fact, I
>> am contemplating moving in another direction (when looking at the
>> pgprintk()s scattered around arch/x86/kvm/mmu.c:
>>
>> kvm_trace("pfentry", "page_fault entry addr %lx error code %x\n",
>> cr2, error_code);
>>
>> Unlike printk()s, no actual formatting would occur during runtime.
>>
>
> Have you considered using trace_mark() directly - eliminating the
> KVM_TRACEN() middlemen?
>
>

Eliminating KVM_TRACEN -- yes. There are too many of them, they aren't
type-aware, and they're in uppercase.

Using trace_mark() directly -- looking at it, seems to fit the
requirements exactly. Should have looked at it earlier. Is there a way
to get a list of all markers?

Perhaps the kvmtrace marker->relay integration should be made a marker
feature, since there is nothing specific to kvm in it.

>> Instead, at initialization time all the strings would be parsed into
>> a data structure that describes the data types, and the runtime
>> would simply consult this structure and copy the arguments into
>> trace records. User space would also be able to pull this structure
>> and so recreate the formatted string.
>>
>
> If one really wanted to, one could build such a mechanism on top of
> marker-based callbacks.
>
>

One does want to.

>> - no need to have a formats file in userspace (which is tied to the
>> kernel version)
>>
>
> OTOH, you'd have the kernel collecting compact binary records
> containing just the parameters, which are at least as tied to kernel
> version.
>
>

Yes, but the userspace side would collect the format strings as well
(just once) and could put them in the same file. The aggregation is
portable across kernel versions.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2008-07-22 19:56:08

by Frank Ch. Eigler

[permalink] [raw]

Subject: Re: [patch 0/4] Port KVM-trace to tracepoints

Avi Kivity <[email protected]> writes:

> [...]
>> Have you considered using trace_mark() directly - eliminating the
>> KVM_TRACEN() middlemen?
> [...]
> Using trace_mark() directly -- looking at it, seems to fit the
> requirements exactly. Should have looked at it earlier. Is there a
> way to get a list of all markers?

>From kernel-space, I can't find just now an API for listing them, but
there probably is / could be one.

>From user-space, lttng tools can probably do it. systemtap can too:
% stap -l 'kernel.mark("*")'

You could prototype binary tracing thusly:
% stap -e 'probe kernel.mark("kvm_foobar") {
printf("%4b%4b%4b", # three 4-byte ints
cpu(), $arg1, $arg2)
}'

> Perhaps the kvmtrace marker->relay integration should be made a marker
> feature, since there is nothing specific to kvm in it.

Right, I believe something like that is in the lttng patch suite.

- FChE

2008-07-22 22:12:30

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [patch 0/4] Port KVM-trace to tracepoints -> LTTng ?

* Avi Kivity ([email protected]) wrote:
> Frank Ch. Eigler wrote:
>> Avi Kivity <[email protected]> writes:
>>
>>
>>> [...]
>>> kvm tracepoints are heavily tied into the implementation; and making
>>> them harder to write means we will have less information. In fact, I
>>> am contemplating moving in another direction (when looking at the
>>> pgprintk()s scattered around arch/x86/kvm/mmu.c:
>>>
>>> kvm_trace("pfentry", "page_fault entry addr %lx error code %x\n",
>>> cr2, error_code);
>>>
>>> Unlike printk()s, no actual formatting would occur during runtime.
>>>
>>
>> Have you considered using trace_mark() directly - eliminating the
>> KVM_TRACEN() middlemen?
>>
>>
>
> Eliminating KVM_TRACEN -- yes. There are too many of them, they aren't
> type-aware, and they're in uppercase.
>
> Using trace_mark() directly -- looking at it, seems to fit the requirements
> exactly. Should have looked at it earlier. Is there a way to get a list
> of all markers?
>
> Perhaps the kvmtrace marker->relay integration should be made a marker
> feature, since there is nothing specific to kvm in it.
>
>>> Instead, at initialization time all the strings would be parsed into
>>> a data structure that describes the data types, and the runtime
>>> would simply consult this structure and copy the arguments into
>>> trace records. User space would also be able to pull this structure
>>> and so recreate the formatted string.
>>>
>>
>> If one really wanted to, one could build such a mechanism on top of
>> marker-based callbacks.
>>
>>
>
> One does want to.
>
>>> - no need to have a formats file in userspace (which is tied to the
>>> kernel version)
>>>
>>
>> OTOH, you'd have the kernel collecting compact binary records
>> containing just the parameters, which are at least as tied to kernel
>> version.
>>
>>
>
> Yes, but the userspace side would collect the format strings as well (just
> once) and could put them in the same file. The aggregation is portable
> across kernel versions.
>

Yes,

LTTng does exactly all that.

Please have a look at the current LTTng patchset :

http://ltt.polymtl.ca/lttng/patch-2.6.26-0.11.tar.bz2

The interface to list markers is currently found in /proc/ltt

Commands like :

cat /proc/ltt (list markers)
And echo -n "connect marker_name default dynamic channel_name"

See the script ltt-armall.sh in the package :

http://ltt.polymtl.ca/lttng/ltt-control-0.50-17072008.tar.gz

To see how to arm all markers listed.

General information (compatibility list and quickstart guide are
available at http://ltt.polymtl.ca). Packages also useful : lttv (trace
analyser, including text dump, filtering, gui...) and a userspace marker
package (only supports x86 32/64 currently).

All these packages support any kind of custom markers, because the
marker names/types are listed in the "facilities_*" control tracefile at
trace start, so the traces are self-described. I also list other stuff
(memory maps, irq handler names, system call handler names) at trace
start so we can dynamically have these mapping, independently of the
architecture.

See :
http://ltt.polymtl.ca/packages/lttv-0.10.0-pre14-17072008.tar.gz
http://ltt.polymtl.ca/packages/markers-userspace-0.5.tar.bz2

I'll be more than happy to answer your questions.

Mathieu

> --
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-07-27 10:11:35

by Avi Kivity

[permalink] [raw]

Subject: Re: [patch 0/4] Port KVM-trace to tracepoints -> LTTng ?

Mathieu Desnoyers wrote:

>> Yes, but the userspace side would collect the format strings as well (just
>> once) and could put them in the same file. The aggregation is portable
>> across kernel versions.
>>
>>
>
> Yes,
>
> LTTng does exactly all that.
>
>
>

[snip goodies]

> I'll be more than happy to answer your questions.
>

What's the merge plan for this?

--
error compiling committee.c: too many arguments to function

2008-07-28 00:59:20

by Mathieu Desnoyers

[permalink] [raw]

Subject: [RFC] LTTng merge plan

* Avi Kivity ([email protected]) wrote:
> Mathieu Desnoyers wrote:
>
>
>>> Yes, but the userspace side would collect the format strings as well
>>> (just once) and could put them in the same file. The aggregation is
>>> portable across kernel versions.
>>>
>>>
>>
>> Yes,
>>
>> LTTng does exactly all that.
>>
>>
>>
>
> [snip goodies]
>
>> I'll be more than happy to answer your questions.
>>
>
> What's the merge plan for this?
>

Hi Avi,

Thanks for asking. Given the amount of expectation from kernel
developers, distributions and users I have seen for kernel tracing at
this year's OLS, I think giving a detailed merge plan for my LTTng work
is appropriate.

Currently, it looks like :

In Ingo's trees :
- Tracepoints, scheduler tracepoints instrumentation, ftrace port to
tracepoints
- Should make it into 2.6.27 since ftrace needs those.
- Immediate Values (faster branch based on load immediate instruction)
Useful for markers and tracepoints, but can also be used for any
compiled-in code that has to be dynamically enabled.
- Aims at 2.6.28
- Text Edit Lock : protection of kernel text modification with a mutex.
Synchronises kprobes and immediate values.
- Aims at 2.6.28

Short-term submission plan

In LTTng patchset
(http://ltt.polymtl.ca/lttng/patch-2.6.26-0.12.tar.bz2)

- Instrumentation
- LTTng tracepoints
- Used by LTTng, SystemTAP and usable specialized probes.
- Port specific sets of tracepoints along with their current users
- ftrace (port currently in Ingo's tree), KVM trace, blktrace.

- Data extraction
- LTTng timestamping
- Based on the CPU cycle counter when synchronized across CPUs.
- Fallback on a simple cache-bouncing atomic counter if no
synchronized fast time source is available. Basically, the idea
is that having the correct event _order_ is more important than
having an approximate time, because this "timestamp" is used to
reorder events which are written in per-CPU buffers. Time updates
can always be recorded as an event in the trace to get an idea of
the kernel time flow.
- LTTng trace management
- netlink interface to start/stop tracing and set the buffer sizes.
- Supports multiple channels (high/medium/low event rate).
Metadata (marker types, list of interrupt handlers...) can be
exported in low event rate channels.
- Supports flight recorder mode (overwriting oldest buffer data),
normal mode (writes to disk, drops events if buffer is full) or
hybrid, or mixed, mode, where the high event rate buffers only
are in flight recorder mode.
- Data relay
- Atomic buffering mechanism which does not call into kernel
primitives except preempt disable. Only touches variables
atomically, does not use any lock. Aims at having minimal
intrusiveness and allowing the largest code coverage (thus not
calling kernel code).
- LTTng marker control
- Currently a /proc/ltt interface with read and write operations to
list markers and connect LTTng probe to individual markers,
specifying in which channel to send the data (I know, should
probably belong to /sys instead, comments welcome) It's not part
of the core marker infrastructure because it depends both on
markers and on the LTTng trace management. It's also responsible
for allocating a numeric ID to a marker which is guaranteed to be
unique as long as there is at least one active trace.

Medium-term submission plan

In LTTng patchset

- Instrumentation
- Userspace tracing interface
- Allow userspace to declare tracepoints and/or markers
- Provide a data extraction interface to collect the tracing data.
- More work needed in this area.
- LTTng statedump
- Exports the kernel data structures to the trace buffers at trace
start. List interrupts, system calls, threads, memory maps, ... It
does not use /proc because :
1 - /proc has nasty races which makes the information "generally
correct" but not more.
2 - /proc exports the information in text format, which is not
as compact as LTTng binary format.

Longer term wishlist
- GCC support for static branch patching
- Improvement on the immediate values for dynamic code activation

A bit more information is available in the slides I just presented at
OLS at :

http://ltt.polymtl.ca/slides/desnoyers-talk-ols2008.pdf

I'll gladly answer to questions/comments.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-07-29 16:22:52

by Frank Ch. Eigler

[permalink] [raw]

Subject: Re: [RFC] LTTng merge plan

Hi, Mathieu -

> [...]
> Currently, [my merge plan] looks like :
>
> In Ingo's trees :
> - Tracepoints, scheduler tracepoints instrumentation, ftrace port to
> tracepoints
> - Should make it into 2.6.27 since ftrace needs those.
> [...]

This is the tracepoints infrastructure, and ...

> Short-term submission plan
>
> In LTTng patchset
> (http://ltt.polymtl.ca/lttng/patch-2.6.26-0.12.tar.bz2)
>
> - Instrumentation
> - LTTng tracepoints
> - Used by LTTng, SystemTAP and usable specialized probes.
> [...]

... this is the "meat", which includes both the tracepoints
and the the tracepoint-to-marker conversion modules, such as those in
"lttng-instrumentation-*-tracepoint-probes.patch", right?

- FChE

2008-07-29 17:02:20

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [RFC] LTTng merge plan

* Frank Ch. Eigler ([email protected]) wrote:
> Hi, Mathieu -
>
> > [...]
> > Currently, [my merge plan] looks like :
> >
> > In Ingo's trees :
> > - Tracepoints, scheduler tracepoints instrumentation, ftrace port to
> > tracepoints
> > - Should make it into 2.6.27 since ftrace needs those.
> > [...]
>
> This is the tracepoints infrastructure, and ...
>
> > Short-term submission plan
> >
> > In LTTng patchset
> > (http://ltt.polymtl.ca/lttng/patch-2.6.26-0.12.tar.bz2)
> >
> > - Instrumentation
> > - LTTng tracepoints
> > - Used by LTTng, SystemTAP and usable specialized probes.
> > [...]
>
> ... this is the "meat", which includes both the tracepoints
> and the the tracepoint-to-marker conversion modules, such as those in
> "lttng-instrumentation-*-tracepoint-probes.patch", right?
>
> - FChE

Exactly. The tracepoint-to-markers conversion modules bridge between
the in-kernel API (tracepoints) which declares standard location for
kernel instrumentation and user-space visible "markers", so that there
is an indirection level between the instrumented kernel code and what is
visible from userspace.

For instance, we have

in kernel/sched.c:context_switch()
trace_sched_switch(rq, prev, next);

Which exports a struct rq * and two struct task_struct *. They are used
by both ftrace, which needs access to these data structures, and by the
LTTng probes, which extracts the previous PID and state (running,
waiting..) and next PID to be scheduled in.

Basically, these modules turn the data exported by tracepoints,
meaningful only to kernel modules, into data useful for userspace trace
analysis.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68