2013-06-04 08:36:30

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: [PATCH V2 0/1] kvm/vmx: Output TSC offset

Hi All,

I'd like to propose a patch which adds a tracepoint at write_tsc_offset for
tracing guests TSC offset changes. It is required for sorting the trace data of
a guest and the host in chronological order.

In a virtualization environment, it is difficult to analyze performance
problems, such as a delay of I/O request on a guest. This is because multiple
guests operate on the host. One of approaches for solving such kind of problems
is to sort trace data of guests and the host in chronological order.

Raw TSC can be chosen as a timestamp of ftrace. I think TSC is useful for
merging trace data in chronological order by two reasons. One of the reasons is
that guests can directly read raw TSC from the CPU using rdtsc operation. This
means that raw TSC value is not software clock like sched_clock, so we don't
need to consider about how the timestamp is calculated. The other is that TSC
of recent x86 CPUs is constantly incremented. This means that we don't need to
worry about pace of the timestamp. Therefore, choosing TSC as a timestamp for
tracing is reasonable to integrate trace data of guests and a host.

Here, we need to consider about just one matter for using TSC on guests. TSC
value on a guest is always the host TSC plus the guest's "TSC offset". In other
words, to merge trace data using TSC as timestamp in chronological order, we
need to consider TSC offset of the guest.

However, only the host kernel can read the TSC offset from VMCS and TSC offset
is not output in anywhere now. In other words, tools in userland cannot get
the TSC offset value, so we cannot merge trace data of guest and the host in
chronological order. Therefore, I think the TSC offset should be exported for
userland tools.

In this patch, TSC offset is exported by the tracepoint kvm_write_tsc_offset
on the host. TSC offset events will be very rare event because guests may
seldom execute write_tsc. So if we enable this event for a normal buffer of
ftrace, the events will be overwritten by other events. For a recent linux
kernel, a multiple buffer function is available. So, I recommend to enable
this event for a sub buffer of ftrace.

<Example>
We assume that wakeup-latency for a command is big on a guest. Normally
we will use ftrace's wakeup-latency tracer or event tracer on the guest, but we
may not be able to solve this problem. This is because guests often exit to
the host for several reasons. In the next, we will use TSC as ftrace's timestamp
and record the trace data on the guest and the host. Then, we get following
data:

/* guest data */
comm-3826 [000] d...49836825726903: sched_wakeup: [detail]
comm-3826 [000] d...49836832225344: sched_switch: [detail]
/* host data */
qemu-kvm-2687 [003] d...50550079203669: kvm_exit: [detail]
qemu-kvm-2687 [003] d...50550079206816: kvm_entry: [detail]
qemu-kvm-2687 [003] d...50550079240656: kvm_exit: [detail]
qemu-kvm-2687 [003] d...50550079243467: kvm_entry: [detail]
qemu-kvm-2687 [003] d...50550079256103: kvm_exit: [detail]
qemu-kvm-2687 [003] d...50550079268391: kvm_entry: [detail]
qemu-kvm-2687 [003] d...50550079280829: kvm_exit: [detail]
qemu-kvm-2687 [003] d...50550079286028: kvm_entry: [detail]

Since TSC offset is not considered, these data cannot be merged. If this trace
data is shown like as follows, we will be able to understand the reason:

qemu-kvm-2687 [003] d...50550079203669: kvm_exit: [detail]
qemu-kvm-2687 [003] d...50550079206816: kvm_entry: [detail]
comm-3826 [000] d.h.49836825726903: sched_wakeup: [detail] <=
qemu-kvm-2687 [003] d...50550079240656: kvm_exit: [detail]
qemu-kvm-2687 [003] d...50550079243467: kvm_entry: [detail]
qemu-kvm-2687 [003] d...50550079256103: kvm_exit: [detail]
qemu-kvm-2687 [003] d...50550079268391: kvm_entry: [detail]
comm-3826 [000] d...49836832225344: sched_switch: [detail] <=
qemu-kvm-2687 [003] d...50550079280829: kvm_exit: [detail]
qemu-kvm-2687 [003] d...50550079286028: kvm_entry: [detail]

In this case, we can understand wakeup-latency was big due to exit to host
twice. Getting this data sorted in chronological order is our goal.

To merge the data like previous pattern, we apply this patch. Then, we can
get TSC offset of the guest as follows:

# cat /sys/kernel/debug/tracing/instances/tsc_offset/trace
.. d...4300151845072: kvm_write_tsc_offset: previous 0 next 18446739773557710924
^ ^^^^^^^^^^^^^^^^^^^^

We use this TSC offset value to a merge script and obtain the following data:

$ ./trace-merge.pl -g guest_data -h host_data -t 18446739773557710924

h qemu-kvm-2687 [003] d...50550079203669: kvm_exit: [detail]
h qemu-kvm-2687 [003] d...50550079206816: kvm_entry: [detail]
g comm-3826 [000] d.h.50550079226331: sched_wakeup: [detail] <=
h qemu-kvm-2687 [003] d...50550079240656: kvm_exit: [detail]
h qemu-kvm-2687 [003] d...50550079243467: kvm_entry: [detail]
h qemu-kvm-2687 [003] d...50550079256103: kvm_exit: [detail]
h qemu-kvm-2687 [003] d...50550079268391: kvm_entry: [detail]
g comm-3826 [000] d...50550079279266: sched_switch: [detail] <=
h qemu-kvm-2687 [003] d...50550079280829: kvm_exit: [detail]
h qemu-kvm-2687 [003] d...50550079286028: kvm_entry: [detail]
|
\----guest/host

In this summary, I suggest the patch which TSC offset for each guest can be
output on the host.

As one example, I will send a merge tool. This tool is not considered
executing write_tsc on a guest yet.

Changes in V2:
- Use tracepoint for outputting TSC offset value instead of printk() [1/1]

Thanks!

---

Yoshihiro YUNOMAE (1):
kvm/vmx: Add a tracepoint write_tsc_offset


arch/x86/kvm/trace.h | 18 ++++++++++++++++++
arch/x86/kvm/vmx.c | 3 +++
arch/x86/kvm/x86.c | 1 +
3 files changed, 22 insertions(+)

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]


2013-06-04 08:36:53

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: [PATCH V2 1/1] kvm/vmx: Add a tracepoint write_tsc_offset

Add a tracepoint write_tsc_offset for tracing TSC offset change.
We want to merge ftrace's trace data of guest OSs and the host OS using
TSC for timestamp in chronological order. We need "TSC offset" values for
each guest when merge those because the TSC value on a guest is always the
host TSC plus guest's TSC offset. If we get the TSC offset values, we can
calculate the host TSC value for each guest events from the TSC offset and
the event TSC value. The host TSC values of the guest events are used when we
want to merge trace data of guests and the host in chronological order.
(Note: the trace_clock of both the host and the guest must be set x86-tsc in
this case)

TSC offset is stored in the VMCS by vmx_write_tsc_offset() or
vmx_adjust_tsc_offset(). KVM executes the former function when a guest boots.
The latter function is executed when kvm clock is updated. Only host can read
TSC offset value from VMCS, so a host needs to output TSC offset value
when TSC offset is changed.

Since the TSC offset is not often changed, it could be overwritten by other
frequent events while tracing. To avoid that, I recommend to use a special
instance for getting this event:

1. set a instance before booting a guest
# cd /sys/kernel/debug/tracing/instances
# mkdir tsc_offset
# cd tsc_offset
# echo x86-tsc > trace_clock
# echo 1 > events/kvm/kvm_write_tsc_offset/enable

2. boot a guest

Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Gleb Natapov <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---
arch/x86/kvm/trace.h | 18 ++++++++++++++++++
arch/x86/kvm/vmx.c | 3 +++
arch/x86/kvm/x86.c | 1 +
3 files changed, 22 insertions(+)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index fe5e00e..9c22e39 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -815,6 +815,24 @@ TRACE_EVENT(kvm_track_tsc,
__print_symbolic(__entry->host_clock, host_clocks))
);

+TRACE_EVENT(kvm_write_tsc_offset,
+ TP_PROTO(__u64 previous_tsc_offset, __u64 next_tsc_offset),
+ TP_ARGS(previous_tsc_offset, next_tsc_offset),
+
+ TP_STRUCT__entry(
+ __field( __u64, previous_tsc_offset )
+ __field( __u64, next_tsc_offset )
+ ),
+
+ TP_fast_assign(
+ __entry->previous_tsc_offset = previous_tsc_offset;
+ __entry->next_tsc_offset = next_tsc_offset;
+ ),
+
+ TP_printk("previous=%llu next=%llu",
+ __entry->previous_tsc_offset, __entry->next_tsc_offset)
+);
+
#endif /* CONFIG_X86_64 */

#endif /* _TRACE_KVM_H */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 25a791e..00f7dde 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2096,6 +2096,7 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
(nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
vmcs12->tsc_offset : 0));
} else {
+ trace_kvm_write_tsc_offset(vmcs_read64(TSC_OFFSET), offset);
vmcs_write64(TSC_OFFSET, offset);
}
}
@@ -2103,6 +2104,8 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment, bool host)
{
u64 offset = vmcs_read64(TSC_OFFSET);
+
+ trace_kvm_write_tsc_offset(offset, offset + adjustment);
vmcs_write64(TSC_OFFSET, offset + adjustment);
if (is_guest_mode(vcpu)) {
/* Even when running L2, the adjustment needs to apply to L1 */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 05a8b1a..c942a0c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7264,3 +7264,4 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intr_vmexit);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_invlpga);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_skinit);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intercepts);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_write_tsc_offset);

2013-06-04 08:39:06

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: [EXAMPLE] tools: a tool for merging trace data of a guest and a host

This tool merges trace data of a guest and a host in chronological
order. Restrictions of this tool is as follows:
- one guest (not for multiple guests)
- stable TSC (not backward TSC)
- synchronized TSC
- unchanged TSC offset (the guest does not execute write_TSC)

- How to use
1. [host] Enable kvm_write_tsc_offset before booting a guest
# cd /sys/kernel/debug/tracing/instances
# mkdir tsc_offset
# cd tsc_offset
# echo x86-tsc > trace_clock
# echo 1 > events/kvm/kvm_write_tsc_offset/enable

2. [host] Enable events you want
Note: I recommend to enable kvm_exit/entry events.
# cd /sys/kernel/debug/tracing
# echo kvm_entry >> set_event
# echo kvm_exit >> set_event
# [snip]
# echo x86-tsc > trace_clock

3. [host] Boot the guest

4. [guest] Enable events you want
# cd /sys/kernel/debug/tracing
# echo sched_wakeup >> set_event
# echo sched_switch >> set_event
# [snip]
# echo x86-tsc > trace_clock

5. [guest] Run programs

6. [guest/host] Get trace data
# echo 0 > tracing_on
# cat trace > /home/yourdir/log/guest_trace.txt (for the guest)
(cat trace > /home/yourdir/log/host_trace.txt (for the host))
# scp [host_IP]:/home/yourdir/log/guest_trace.txt (only for the guest)

7. [host] Get next TSC offset
# cat /sys/kernel/debug/tracing/instances/tsc_offset/trace
qemu-kvm-22089 [000] d...4300151845072: kvm_write_tsc_offset:
previous 0 next 18446739773557710924

8. [host] Run this tool with the next TSC offset for -t option.
$ ./trace-merge.pl -g ~/log/guest_trace.txt -h ~/log/host_trace.txt \
-t 18446739773557710924

h qemu-kvm-2687 [003] d...50550079203669: kvm_exit: [detail]
h qemu-kvm-2687 [003] d...50550079206816: kvm_entry: [detail]
g comm-3826 [000] d.h.50550079226331: sched_wakeup: [detail]
h qemu-kvm-2687 [003] d...50550079240656: kvm_exit: [detail]
h qemu-kvm-2687 [003] d...50550079243467: kvm_entry: [detail]
h qemu-kvm-2687 [003] d...50550079256103: kvm_exit: [detail]
h qemu-kvm-2687 [003] d...50550079268391: kvm_entry: [detail]
g comm-3826 [000] d...50550079279266: sched_switch: [detail]
h qemu-kvm-2687 [003] d...50550079280829: kvm_exit: [detail]
h qemu-kvm-2687 [003] d...50550079286028: kvm_entry: [detail]
|
\----guest/host

Thanks,

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]



Attachments:
trace-merge.pl (2.35 kB)

2013-06-06 00:25:04

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH V2 1/1] kvm/vmx: Add a tracepoint write_tsc_offset

On Tue, Jun 04, 2013 at 05:36:19PM +0900, Yoshihiro YUNOMAE wrote:
> Add a tracepoint write_tsc_offset for tracing TSC offset change.
> We want to merge ftrace's trace data of guest OSs and the host OS using
> TSC for timestamp in chronological order. We need "TSC offset" values for
> each guest when merge those because the TSC value on a guest is always the
> host TSC plus guest's TSC offset. If we get the TSC offset values, we can
> calculate the host TSC value for each guest events from the TSC offset and
> the event TSC value. The host TSC values of the guest events are used when we
> want to merge trace data of guests and the host in chronological order.
> (Note: the trace_clock of both the host and the guest must be set x86-tsc in
> this case)
>
> TSC offset is stored in the VMCS by vmx_write_tsc_offset() or
> vmx_adjust_tsc_offset(). KVM executes the former function when a guest boots.
> The latter function is executed when kvm clock is updated. Only host can read
> TSC offset value from VMCS, so a host needs to output TSC offset value
> when TSC offset is changed.
>
> Since the TSC offset is not often changed, it could be overwritten by other
> frequent events while tracing. To avoid that, I recommend to use a special
> instance for getting this event:
>
> 1. set a instance before booting a guest
> # cd /sys/kernel/debug/tracing/instances
> # mkdir tsc_offset
> # cd tsc_offset
> # echo x86-tsc > trace_clock
> # echo 1 > events/kvm/kvm_write_tsc_offset/enable
>
> 2. boot a guest
>
> Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
> Cc: Marcelo Tosatti <[email protected]>
> Cc: Gleb Natapov <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> ---
> arch/x86/kvm/trace.h | 18 ++++++++++++++++++
> arch/x86/kvm/vmx.c | 3 +++
> arch/x86/kvm/x86.c | 1 +
> 3 files changed, 22 insertions(+)
>
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index fe5e00e..9c22e39 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -815,6 +815,24 @@ TRACE_EVENT(kvm_track_tsc,
> __print_symbolic(__entry->host_clock, host_clocks))
> );
>
> +TRACE_EVENT(kvm_write_tsc_offset,
> + TP_PROTO(__u64 previous_tsc_offset, __u64 next_tsc_offset),
> + TP_ARGS(previous_tsc_offset, next_tsc_offset),
> +
> + TP_STRUCT__entry(
> + __field( __u64, previous_tsc_offset )
> + __field( __u64, next_tsc_offset )
> + ),
> +
> + TP_fast_assign(
> + __entry->previous_tsc_offset = previous_tsc_offset;
> + __entry->next_tsc_offset = next_tsc_offset;
> + ),
> +
> + TP_printk("previous=%llu next=%llu",
> + __entry->previous_tsc_offset, __entry->next_tsc_offset)
> +);
> +

Yoshihiro YUNOMAE,

1) Why is previous_tsc_offset necessary?

2) The TSC offset traces should include vcpu number, so that its
possible to correlate traces of SMP guests (the tool should use
the individual vcpu tsc offsets when converting guests trace).

3) Please add traces for svm.c.

2013-06-06 11:33:19

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH V2 1/1] kvm/vmx: Add a tracepoint write_tsc_offset

On Wed, Jun 05, 2013 at 09:23:22PM -0300, Marcelo Tosatti wrote:
> On Tue, Jun 04, 2013 at 05:36:19PM +0900, Yoshihiro YUNOMAE wrote:
> > Add a tracepoint write_tsc_offset for tracing TSC offset change.
> > We want to merge ftrace's trace data of guest OSs and the host OS using
> > TSC for timestamp in chronological order. We need "TSC offset" values for
> > each guest when merge those because the TSC value on a guest is always the
> > host TSC plus guest's TSC offset. If we get the TSC offset values, we can
> > calculate the host TSC value for each guest events from the TSC offset and
> > the event TSC value. The host TSC values of the guest events are used when we
> > want to merge trace data of guests and the host in chronological order.
> > (Note: the trace_clock of both the host and the guest must be set x86-tsc in
> > this case)
> >
> > TSC offset is stored in the VMCS by vmx_write_tsc_offset() or
> > vmx_adjust_tsc_offset(). KVM executes the former function when a guest boots.
> > The latter function is executed when kvm clock is updated. Only host can read
> > TSC offset value from VMCS, so a host needs to output TSC offset value
> > when TSC offset is changed.
> >
> > Since the TSC offset is not often changed, it could be overwritten by other
> > frequent events while tracing. To avoid that, I recommend to use a special
> > instance for getting this event:
> >
> > 1. set a instance before booting a guest
> > # cd /sys/kernel/debug/tracing/instances
> > # mkdir tsc_offset
> > # cd tsc_offset
> > # echo x86-tsc > trace_clock
> > # echo 1 > events/kvm/kvm_write_tsc_offset/enable
> >
> > 2. boot a guest
> >
> > Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
> > Cc: Marcelo Tosatti <[email protected]>
> > Cc: Gleb Natapov <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: "H. Peter Anvin" <[email protected]>
> > ---
> > arch/x86/kvm/trace.h | 18 ++++++++++++++++++
> > arch/x86/kvm/vmx.c | 3 +++
> > arch/x86/kvm/x86.c | 1 +
> > 3 files changed, 22 insertions(+)
> >
> > diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> > index fe5e00e..9c22e39 100644
> > --- a/arch/x86/kvm/trace.h
> > +++ b/arch/x86/kvm/trace.h
> > @@ -815,6 +815,24 @@ TRACE_EVENT(kvm_track_tsc,
> > __print_symbolic(__entry->host_clock, host_clocks))
> > );
> >
> > +TRACE_EVENT(kvm_write_tsc_offset,
> > + TP_PROTO(__u64 previous_tsc_offset, __u64 next_tsc_offset),
> > + TP_ARGS(previous_tsc_offset, next_tsc_offset),
> > +
> > + TP_STRUCT__entry(
> > + __field( __u64, previous_tsc_offset )
> > + __field( __u64, next_tsc_offset )
> > + ),
> > +
> > + TP_fast_assign(
> > + __entry->previous_tsc_offset = previous_tsc_offset;
> > + __entry->next_tsc_offset = next_tsc_offset;
> > + ),
> > +
> > + TP_printk("previous=%llu next=%llu",
> > + __entry->previous_tsc_offset, __entry->next_tsc_offset)
> > +);
> > +
>
> Yoshihiro YUNOMAE,
>
> 1) Why is previous_tsc_offset necessary?
>
> 2) The TSC offset traces should include vcpu number, so that its
> possible to correlate traces of SMP guests (the tool should use
> the individual vcpu tsc offsets when converting guests trace).
>
Why PID is not enough? No other trace, except kvm_entry, outputs vcpu id.

> 3) Please add traces for svm.c.

--
Gleb.

2013-06-07 05:22:31

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: Re: Re: [PATCH V2 1/1] kvm/vmx: Add a tracepoint write_tsc_offset

(2013/06/06 20:33), Gleb Natapov wrote:
> On Wed, Jun 05, 2013 at 09:23:22PM -0300, Marcelo Tosatti wrote:
>> On Tue, Jun 04, 2013 at 05:36:19PM +0900, Yoshihiro YUNOMAE wrote:
>>> Add a tracepoint write_tsc_offset for tracing TSC offset change.
>>> We want to merge ftrace's trace data of guest OSs and the host OS using
>>> TSC for timestamp in chronological order. We need "TSC offset" values for
>>> each guest when merge those because the TSC value on a guest is always the
>>> host TSC plus guest's TSC offset. If we get the TSC offset values, we can
>>> calculate the host TSC value for each guest events from the TSC offset and
>>> the event TSC value. The host TSC values of the guest events are used when we
>>> want to merge trace data of guests and the host in chronological order.
>>> (Note: the trace_clock of both the host and the guest must be set x86-tsc in
>>> this case)
>>>
>>> TSC offset is stored in the VMCS by vmx_write_tsc_offset() or
>>> vmx_adjust_tsc_offset(). KVM executes the former function when a guest boots.
>>> The latter function is executed when kvm clock is updated. Only host can read
>>> TSC offset value from VMCS, so a host needs to output TSC offset value
>>> when TSC offset is changed.
>>>
>>> Since the TSC offset is not often changed, it could be overwritten by other
>>> frequent events while tracing. To avoid that, I recommend to use a special
>>> instance for getting this event:
>>>
>>> 1. set a instance before booting a guest
>>> # cd /sys/kernel/debug/tracing/instances
>>> # mkdir tsc_offset
>>> # cd tsc_offset
>>> # echo x86-tsc > trace_clock
>>> # echo 1 > events/kvm/kvm_write_tsc_offset/enable
>>>
>>> 2. boot a guest
>>>
>>> Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
>>> Cc: Marcelo Tosatti <[email protected]>
>>> Cc: Gleb Natapov <[email protected]>
>>> Cc: Thomas Gleixner <[email protected]>
>>> Cc: Ingo Molnar <[email protected]>
>>> Cc: "H. Peter Anvin" <[email protected]>
>>> ---
>>> arch/x86/kvm/trace.h | 18 ++++++++++++++++++
>>> arch/x86/kvm/vmx.c | 3 +++
>>> arch/x86/kvm/x86.c | 1 +
>>> 3 files changed, 22 insertions(+)
>>>
>>> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
>>> index fe5e00e..9c22e39 100644
>>> --- a/arch/x86/kvm/trace.h
>>> +++ b/arch/x86/kvm/trace.h
>>> @@ -815,6 +815,24 @@ TRACE_EVENT(kvm_track_tsc,
>>> __print_symbolic(__entry->host_clock, host_clocks))
>>> );
>>>
>>> +TRACE_EVENT(kvm_write_tsc_offset,
>>> + TP_PROTO(__u64 previous_tsc_offset, __u64 next_tsc_offset),
>>> + TP_ARGS(previous_tsc_offset, next_tsc_offset),
>>> +
>>> + TP_STRUCT__entry(
>>> + __field( __u64, previous_tsc_offset )
>>> + __field( __u64, next_tsc_offset )
>>> + ),
>>> +
>>> + TP_fast_assign(
>>> + __entry->previous_tsc_offset = previous_tsc_offset;
>>> + __entry->next_tsc_offset = next_tsc_offset;
>>> + ),
>>> +
>>> + TP_printk("previous=%llu next=%llu",
>>> + __entry->previous_tsc_offset, __entry->next_tsc_offset)
>>> +);
>>> +
>>
>> Yoshihiro YUNOMAE,
>>
>> 1) Why is previous_tsc_offset necessary?

I was considering the situations where we did not enable
kvm_write_tsc_offset event before booting a guest or where we did not
use multiple buffers. Here, we will need another new I/F to get current
TSC offset of a given VCPU. For example, if kvm_write_tsc_offset is not
included in the host's trace data, we get the current TSC offset from
the new I/F and apply it to all guest events. On the other hand, if
kvm_write_tsc_offset event appears more than once, we apply the
previous offset to guest events before the first TSC offset change.

Since we support only for using multiple buffers now, we don't need to
record previous TSC offset at this time. But I'm conscious that we have
to change the format of kvm_write_tsc_offset event when we support
those situations.

>> 2) The TSC offset traces should include vcpu number, so that its
>> possible to correlate traces of SMP guests (the tool should use
>> the individual vcpu tsc offsets when converting guests trace).
>>
> Why PID is not enough? No other trace, except kvm_entry, outputs vcpu id.

As Gleb mentioned, a tool can understand TSC offset for each vcpu from
PID and vcpu number of kvm_entry. IMO, that is indirect way, so I would
be better off including vcpu number.

>> 3) Please add traces for svm.c.

Sure, I'll add the tracepoint for SVM.

Thanks,

Yoshihiro YUNOMAE

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

2013-06-07 20:42:19

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH V2 1/1] kvm/vmx: Add a tracepoint write_tsc_offset

On Thu, Jun 06, 2013 at 02:33:06PM +0300, Gleb Natapov wrote:
> On Wed, Jun 05, 2013 at 09:23:22PM -0300, Marcelo Tosatti wrote:
> > On Tue, Jun 04, 2013 at 05:36:19PM +0900, Yoshihiro YUNOMAE wrote:
> > > Add a tracepoint write_tsc_offset for tracing TSC offset change.
> > > We want to merge ftrace's trace data of guest OSs and the host OS using
> > > TSC for timestamp in chronological order. We need "TSC offset" values for
> > > each guest when merge those because the TSC value on a guest is always the
> > > host TSC plus guest's TSC offset. If we get the TSC offset values, we can
> > > calculate the host TSC value for each guest events from the TSC offset and
> > > the event TSC value. The host TSC values of the guest events are used when we
> > > want to merge trace data of guests and the host in chronological order.
> > > (Note: the trace_clock of both the host and the guest must be set x86-tsc in
> > > this case)
> > >
> > > TSC offset is stored in the VMCS by vmx_write_tsc_offset() or
> > > vmx_adjust_tsc_offset(). KVM executes the former function when a guest boots.
> > > The latter function is executed when kvm clock is updated. Only host can read
> > > TSC offset value from VMCS, so a host needs to output TSC offset value
> > > when TSC offset is changed.
> > >
> > > Since the TSC offset is not often changed, it could be overwritten by other
> > > frequent events while tracing. To avoid that, I recommend to use a special
> > > instance for getting this event:
> > >
> > > 1. set a instance before booting a guest
> > > # cd /sys/kernel/debug/tracing/instances
> > > # mkdir tsc_offset
> > > # cd tsc_offset
> > > # echo x86-tsc > trace_clock
> > > # echo 1 > events/kvm/kvm_write_tsc_offset/enable
> > >
> > > 2. boot a guest
> > >
> > > Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
> > > Cc: Marcelo Tosatti <[email protected]>
> > > Cc: Gleb Natapov <[email protected]>
> > > Cc: Thomas Gleixner <[email protected]>
> > > Cc: Ingo Molnar <[email protected]>
> > > Cc: "H. Peter Anvin" <[email protected]>
> > > ---
> > > arch/x86/kvm/trace.h | 18 ++++++++++++++++++
> > > arch/x86/kvm/vmx.c | 3 +++
> > > arch/x86/kvm/x86.c | 1 +
> > > 3 files changed, 22 insertions(+)
> > >
> > > diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> > > index fe5e00e..9c22e39 100644
> > > --- a/arch/x86/kvm/trace.h
> > > +++ b/arch/x86/kvm/trace.h
> > > @@ -815,6 +815,24 @@ TRACE_EVENT(kvm_track_tsc,
> > > __print_symbolic(__entry->host_clock, host_clocks))
> > > );
> > >
> > > +TRACE_EVENT(kvm_write_tsc_offset,
> > > + TP_PROTO(__u64 previous_tsc_offset, __u64 next_tsc_offset),
> > > + TP_ARGS(previous_tsc_offset, next_tsc_offset),
> > > +
> > > + TP_STRUCT__entry(
> > > + __field( __u64, previous_tsc_offset )
> > > + __field( __u64, next_tsc_offset )
> > > + ),
> > > +
> > > + TP_fast_assign(
> > > + __entry->previous_tsc_offset = previous_tsc_offset;
> > > + __entry->next_tsc_offset = next_tsc_offset;
> > > + ),
> > > +
> > > + TP_printk("previous=%llu next=%llu",
> > > + __entry->previous_tsc_offset, __entry->next_tsc_offset)
> > > +);
> > > +
> >
> > Yoshihiro YUNOMAE,
> >
> > 1) Why is previous_tsc_offset necessary?
> >
> > 2) The TSC offset traces should include vcpu number, so that its
> > possible to correlate traces of SMP guests (the tool should use
> > the individual vcpu tsc offsets when converting guests trace).
> >
> Why PID is not enough? No other trace, except kvm_entry, outputs vcpu id.

Guest trace contains CPU ID.

If PID is exported it is necessary to perform an additional PID->VCPU
translation, which might not be available by the time the trace is
correlated by the tool (because guest is down).

So, PID is equivalent to VCPU as long as translation can be performed.

What is the preference?


2013-06-07 21:56:16

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Re: [PATCH V2 1/1] kvm/vmx: Add a tracepoint write_tsc_offset

On Fri, Jun 07, 2013 at 02:22:22PM +0900, Yoshihiro YUNOMAE wrote:
> (2013/06/06 20:33), Gleb Natapov wrote:
> >On Wed, Jun 05, 2013 at 09:23:22PM -0300, Marcelo Tosatti wrote:
> >>On Tue, Jun 04, 2013 at 05:36:19PM +0900, Yoshihiro YUNOMAE wrote:
> >>>Add a tracepoint write_tsc_offset for tracing TSC offset change.
> >>>We want to merge ftrace's trace data of guest OSs and the host OS using
> >>>TSC for timestamp in chronological order. We need "TSC offset" values for
> >>>each guest when merge those because the TSC value on a guest is always the
> >>>host TSC plus guest's TSC offset. If we get the TSC offset values, we can
> >>>calculate the host TSC value for each guest events from the TSC offset and
> >>>the event TSC value. The host TSC values of the guest events are used when we
> >>>want to merge trace data of guests and the host in chronological order.
> >>>(Note: the trace_clock of both the host and the guest must be set x86-tsc in
> >>>this case)
> >>>
> >>>TSC offset is stored in the VMCS by vmx_write_tsc_offset() or
> >>>vmx_adjust_tsc_offset(). KVM executes the former function when a guest boots.
> >>>The latter function is executed when kvm clock is updated. Only host can read
> >>>TSC offset value from VMCS, so a host needs to output TSC offset value
> >>>when TSC offset is changed.
> >>>
> >>>Since the TSC offset is not often changed, it could be overwritten by other
> >>>frequent events while tracing. To avoid that, I recommend to use a special
> >>>instance for getting this event:
> >>>
> >>>1. set a instance before booting a guest
> >>> # cd /sys/kernel/debug/tracing/instances
> >>> # mkdir tsc_offset
> >>> # cd tsc_offset
> >>> # echo x86-tsc > trace_clock
> >>> # echo 1 > events/kvm/kvm_write_tsc_offset/enable
> >>>
> >>>2. boot a guest
> >>>
> >>>Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
> >>>Cc: Marcelo Tosatti <[email protected]>
> >>>Cc: Gleb Natapov <[email protected]>
> >>>Cc: Thomas Gleixner <[email protected]>
> >>>Cc: Ingo Molnar <[email protected]>
> >>>Cc: "H. Peter Anvin" <[email protected]>
> >>>---
> >>> arch/x86/kvm/trace.h | 18 ++++++++++++++++++
> >>> arch/x86/kvm/vmx.c | 3 +++
> >>> arch/x86/kvm/x86.c | 1 +
> >>> 3 files changed, 22 insertions(+)
> >>>
> >>>diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> >>>index fe5e00e..9c22e39 100644
> >>>--- a/arch/x86/kvm/trace.h
> >>>+++ b/arch/x86/kvm/trace.h
> >>>@@ -815,6 +815,24 @@ TRACE_EVENT(kvm_track_tsc,
> >>> __print_symbolic(__entry->host_clock, host_clocks))
> >>> );
> >>>
> >>>+TRACE_EVENT(kvm_write_tsc_offset,
> >>>+ TP_PROTO(__u64 previous_tsc_offset, __u64 next_tsc_offset),
> >>>+ TP_ARGS(previous_tsc_offset, next_tsc_offset),
> >>>+
> >>>+ TP_STRUCT__entry(
> >>>+ __field( __u64, previous_tsc_offset )
> >>>+ __field( __u64, next_tsc_offset )
> >>>+ ),
> >>>+
> >>>+ TP_fast_assign(
> >>>+ __entry->previous_tsc_offset = previous_tsc_offset;
> >>>+ __entry->next_tsc_offset = next_tsc_offset;
> >>>+ ),
> >>>+
> >>>+ TP_printk("previous=%llu next=%llu",
> >>>+ __entry->previous_tsc_offset, __entry->next_tsc_offset)
> >>>+);
> >>>+
> >>
> >>Yoshihiro YUNOMAE,
> >>
> >>1) Why is previous_tsc_offset necessary?
>
> I was considering the situations where we did not enable
> kvm_write_tsc_offset event before booting a guest or where we did not
> use multiple buffers. Here, we will need another new I/F to get current
> TSC offset of a given VCPU. For example, if kvm_write_tsc_offset is not
> included in the host's trace data, we get the current TSC offset from
> the new I/F and apply it to all guest events. On the other hand, if
> kvm_write_tsc_offset event appears more than once, we apply the
> previous offset to guest events before the first TSC offset change.

OK.

> Since we support only for using multiple buffers now, we don't need to
> record previous TSC offset at this time. But I'm conscious that we have
> to change the format of kvm_write_tsc_offset event when we support
> those situations.

OK, feel free to keep prev_tsc_offset.

> >>2) The TSC offset traces should include vcpu number, so that its
> >>possible to correlate traces of SMP guests (the tool should use
> >>the individual vcpu tsc offsets when converting guests trace).
> >>
> >Why PID is not enough? No other trace, except kvm_entry, outputs vcpu id.
>
> As Gleb mentioned, a tool can understand TSC offset for each vcpu from
> PID and vcpu number of kvm_entry. IMO, that is indirect way, so I would
> be better off including vcpu number.

Yes.

> >>3) Please add traces for svm.c.
>
> Sure, I'll add the tracepoint for SVM.

OK, no further comments for now.

2013-06-09 11:15:04

by Gleb Natapov

[permalink] [raw]
Subject: Re: Re: [PATCH V2 1/1] kvm/vmx: Add a tracepoint write_tsc_offset

On Fri, Jun 07, 2013 at 02:22:22PM +0900, Yoshihiro YUNOMAE wrote:
> (2013/06/06 20:33), Gleb Natapov wrote:
> >On Wed, Jun 05, 2013 at 09:23:22PM -0300, Marcelo Tosatti wrote:
> >>On Tue, Jun 04, 2013 at 05:36:19PM +0900, Yoshihiro YUNOMAE wrote:
> >>>Add a tracepoint write_tsc_offset for tracing TSC offset change.
> >>>We want to merge ftrace's trace data of guest OSs and the host OS using
> >>>TSC for timestamp in chronological order. We need "TSC offset" values for
> >>>each guest when merge those because the TSC value on a guest is always the
> >>>host TSC plus guest's TSC offset. If we get the TSC offset values, we can
> >>>calculate the host TSC value for each guest events from the TSC offset and
> >>>the event TSC value. The host TSC values of the guest events are used when we
> >>>want to merge trace data of guests and the host in chronological order.
> >>>(Note: the trace_clock of both the host and the guest must be set x86-tsc in
> >>>this case)
> >>>
> >>>TSC offset is stored in the VMCS by vmx_write_tsc_offset() or
> >>>vmx_adjust_tsc_offset(). KVM executes the former function when a guest boots.
> >>>The latter function is executed when kvm clock is updated. Only host can read
> >>>TSC offset value from VMCS, so a host needs to output TSC offset value
> >>>when TSC offset is changed.
> >>>
> >>>Since the TSC offset is not often changed, it could be overwritten by other
> >>>frequent events while tracing. To avoid that, I recommend to use a special
> >>>instance for getting this event:
> >>>
> >>>1. set a instance before booting a guest
> >>> # cd /sys/kernel/debug/tracing/instances
> >>> # mkdir tsc_offset
> >>> # cd tsc_offset
> >>> # echo x86-tsc > trace_clock
> >>> # echo 1 > events/kvm/kvm_write_tsc_offset/enable
> >>>
> >>>2. boot a guest
> >>>
> >>>Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
> >>>Cc: Marcelo Tosatti <[email protected]>
> >>>Cc: Gleb Natapov <[email protected]>
> >>>Cc: Thomas Gleixner <[email protected]>
> >>>Cc: Ingo Molnar <[email protected]>
> >>>Cc: "H. Peter Anvin" <[email protected]>
> >>>---
> >>> arch/x86/kvm/trace.h | 18 ++++++++++++++++++
> >>> arch/x86/kvm/vmx.c | 3 +++
> >>> arch/x86/kvm/x86.c | 1 +
> >>> 3 files changed, 22 insertions(+)
> >>>
> >>>diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> >>>index fe5e00e..9c22e39 100644
> >>>--- a/arch/x86/kvm/trace.h
> >>>+++ b/arch/x86/kvm/trace.h
> >>>@@ -815,6 +815,24 @@ TRACE_EVENT(kvm_track_tsc,
> >>> __print_symbolic(__entry->host_clock, host_clocks))
> >>> );
> >>>
> >>>+TRACE_EVENT(kvm_write_tsc_offset,
> >>>+ TP_PROTO(__u64 previous_tsc_offset, __u64 next_tsc_offset),
> >>>+ TP_ARGS(previous_tsc_offset, next_tsc_offset),
> >>>+
> >>>+ TP_STRUCT__entry(
> >>>+ __field( __u64, previous_tsc_offset )
> >>>+ __field( __u64, next_tsc_offset )
> >>>+ ),
> >>>+
> >>>+ TP_fast_assign(
> >>>+ __entry->previous_tsc_offset = previous_tsc_offset;
> >>>+ __entry->next_tsc_offset = next_tsc_offset;
> >>>+ ),
> >>>+
> >>>+ TP_printk("previous=%llu next=%llu",
> >>>+ __entry->previous_tsc_offset, __entry->next_tsc_offset)
> >>>+);
> >>>+
> >>
> >>Yoshihiro YUNOMAE,
> >>
> >>1) Why is previous_tsc_offset necessary?
>
> I was considering the situations where we did not enable
> kvm_write_tsc_offset event before booting a guest or where we did not
> use multiple buffers. Here, we will need another new I/F to get current
> TSC offset of a given VCPU. For example, if kvm_write_tsc_offset is not
> included in the host's trace data, we get the current TSC offset from
> the new I/F and apply it to all guest events. On the other hand, if
> kvm_write_tsc_offset event appears more than once, we apply the
> previous offset to guest events before the first TSC offset change.
>
> Since we support only for using multiple buffers now, we don't need to
> record previous TSC offset at this time. But I'm conscious that we have
> to change the format of kvm_write_tsc_offset event when we support
> those situations.
>
> >>2) The TSC offset traces should include vcpu number, so that its
> >>possible to correlate traces of SMP guests (the tool should use
> >>the individual vcpu tsc offsets when converting guests trace).
> >>
> >Why PID is not enough? No other trace, except kvm_entry, outputs vcpu id.
>
> As Gleb mentioned, a tool can understand TSC offset for each vcpu from
> PID and vcpu number of kvm_entry. IMO, that is indirect way, so I would
> be better off including vcpu number.
>
But doesn't the tool operates on vcpu's PID for all other events. I mean to
figure out what vcpu an event belongs too during merge. Why tsc offset
event is different?

> >>3) Please add traces for svm.c.
>
> Sure, I'll add the tracepoint for SVM.
>
> Thanks,
>
> Yoshihiro YUNOMAE
>
> --
> Yoshihiro YUNOMAE
> Software Platform Research Dept. Linux Technology Center
> Hitachi, Ltd., Yokohama Research Laboratory
> E-mail: [email protected]
>

--
Gleb.

2013-06-10 09:30:48

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: Re: Re: Re: [PATCH V2 1/1] kvm/vmx: Add a tracepoint write_tsc_offset

Hi Gleb,

(2013/06/09 20:14), Gleb Natapov wrote:
> On Fri, Jun 07, 2013 at 02:22:22PM +0900, Yoshihiro YUNOMAE wrote:
>> (2013/06/06 20:33), Gleb Natapov wrote:
>>> On Wed, Jun 05, 2013 at 09:23:22PM -0300, Marcelo Tosatti wrote:
>>>> On Tue, Jun 04, 2013 at 05:36:19PM +0900, Yoshihiro YUNOMAE wrote:
>>>>> Add a tracepoint write_tsc_offset for tracing TSC offset change.
>>>>> We want to merge ftrace's trace data of guest OSs and the host OS using
>>>>> TSC for timestamp in chronological order. We need "TSC offset" values for
>>>>> each guest when merge those because the TSC value on a guest is always the
>>>>> host TSC plus guest's TSC offset. If we get the TSC offset values, we can
>>>>> calculate the host TSC value for each guest events from the TSC offset and
>>>>> the event TSC value. The host TSC values of the guest events are used when we
>>>>> want to merge trace data of guests and the host in chronological order.
>>>>> (Note: the trace_clock of both the host and the guest must be set x86-tsc in
>>>>> this case)
>>>>>
>>>>> TSC offset is stored in the VMCS by vmx_write_tsc_offset() or
>>>>> vmx_adjust_tsc_offset(). KVM executes the former function when a guest boots.
>>>>> The latter function is executed when kvm clock is updated. Only host can read
>>>>> TSC offset value from VMCS, so a host needs to output TSC offset value
>>>>> when TSC offset is changed.
>>>>>
>>>>> Since the TSC offset is not often changed, it could be overwritten by other
>>>>> frequent events while tracing. To avoid that, I recommend to use a special
>>>>> instance for getting this event:
>>>>>
>>>>> 1. set a instance before booting a guest
>>>>> # cd /sys/kernel/debug/tracing/instances
>>>>> # mkdir tsc_offset
>>>>> # cd tsc_offset
>>>>> # echo x86-tsc > trace_clock
>>>>> # echo 1 > events/kvm/kvm_write_tsc_offset/enable
>>>>>
>>>>> 2. boot a guest
>>>>>
>>>>> Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
>>>>> Cc: Marcelo Tosatti <[email protected]>
>>>>> Cc: Gleb Natapov <[email protected]>
>>>>> Cc: Thomas Gleixner <[email protected]>
>>>>> Cc: Ingo Molnar <[email protected]>
>>>>> Cc: "H. Peter Anvin" <[email protected]>
>>>>> ---
>>>>> arch/x86/kvm/trace.h | 18 ++++++++++++++++++
>>>>> arch/x86/kvm/vmx.c | 3 +++
>>>>> arch/x86/kvm/x86.c | 1 +
>>>>> 3 files changed, 22 insertions(+)
>>>>>
>>>>> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
>>>>> index fe5e00e..9c22e39 100644
>>>>> --- a/arch/x86/kvm/trace.h
>>>>> +++ b/arch/x86/kvm/trace.h
>>>>> @@ -815,6 +815,24 @@ TRACE_EVENT(kvm_track_tsc,
>>>>> __print_symbolic(__entry->host_clock, host_clocks))
>>>>> );
>>>>>
>>>>> +TRACE_EVENT(kvm_write_tsc_offset,
>>>>> + TP_PROTO(__u64 previous_tsc_offset, __u64 next_tsc_offset),
>>>>> + TP_ARGS(previous_tsc_offset, next_tsc_offset),
>>>>> +
>>>>> + TP_STRUCT__entry(
>>>>> + __field( __u64, previous_tsc_offset )
>>>>> + __field( __u64, next_tsc_offset )
>>>>> + ),
>>>>> +
>>>>> + TP_fast_assign(
>>>>> + __entry->previous_tsc_offset = previous_tsc_offset;
>>>>> + __entry->next_tsc_offset = next_tsc_offset;
>>>>> + ),
>>>>> +
>>>>> + TP_printk("previous=%llu next=%llu",
>>>>> + __entry->previous_tsc_offset, __entry->next_tsc_offset)
>>>>> +);
>>>>> +
>>>>
>>>> Yoshihiro YUNOMAE,
>>>>
>>>> 1) Why is previous_tsc_offset necessary?
>>
>> I was considering the situations where we did not enable
>> kvm_write_tsc_offset event before booting a guest or where we did not
>> use multiple buffers. Here, we will need another new I/F to get current
>> TSC offset of a given VCPU. For example, if kvm_write_tsc_offset is not
>> included in the host's trace data, we get the current TSC offset from
>> the new I/F and apply it to all guest events. On the other hand, if
>> kvm_write_tsc_offset event appears more than once, we apply the
>> previous offset to guest events before the first TSC offset change.
>>
>> Since we support only for using multiple buffers now, we don't need to
>> record previous TSC offset at this time. But I'm conscious that we have
>> to change the format of kvm_write_tsc_offset event when we support
>> those situations.
>>
>>>> 2) The TSC offset traces should include vcpu number, so that its
>>>> possible to correlate traces of SMP guests (the tool should use
>>>> the individual vcpu tsc offsets when converting guests trace).
>>>>
>>> Why PID is not enough? No other trace, except kvm_entry, outputs vcpu id.
>>
>> As Gleb mentioned, a tool can understand TSC offset for each vcpu from
>> PID and vcpu number of kvm_entry. IMO, that is indirect way, so I would
>> be better off including vcpu number.
>>
> But doesn't the tool operates on vcpu's PID for all other events. I mean to
> figure out what vcpu an event belongs too during merge. Why tsc offset
> event is different?

In vcpu_load()@virt/kvm/kvm_main.c, it seems that PID of the vcpu thread
can be changed. Are you familiar with this situation?
If the situation can be occurred, outputting vcpu number is better, I
think. If not occurred, as you say, we will be able to merge those data
without vcpu number in write_tsc_offset event. However, when we
focus on output data of the write_tsc_offset event, it is difficult to
directly understand contents of the data if vcpu number information is
not included. So, including the information is useful, I think.

Thanks,
--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

2013-06-10 10:05:20

by Gleb Natapov

[permalink] [raw]
Subject: Re: Re: Re: [PATCH V2 1/1] kvm/vmx: Add a tracepoint write_tsc_offset

On Mon, Jun 10, 2013 at 06:30:42PM +0900, Yoshihiro YUNOMAE wrote:
> Hi Gleb,
>
> (2013/06/09 20:14), Gleb Natapov wrote:
> >On Fri, Jun 07, 2013 at 02:22:22PM +0900, Yoshihiro YUNOMAE wrote:
> >>(2013/06/06 20:33), Gleb Natapov wrote:
> >>>On Wed, Jun 05, 2013 at 09:23:22PM -0300, Marcelo Tosatti wrote:
> >>>>On Tue, Jun 04, 2013 at 05:36:19PM +0900, Yoshihiro YUNOMAE wrote:
> >>>>>Add a tracepoint write_tsc_offset for tracing TSC offset change.
> >>>>>We want to merge ftrace's trace data of guest OSs and the host OS using
> >>>>>TSC for timestamp in chronological order. We need "TSC offset" values for
> >>>>>each guest when merge those because the TSC value on a guest is always the
> >>>>>host TSC plus guest's TSC offset. If we get the TSC offset values, we can
> >>>>>calculate the host TSC value for each guest events from the TSC offset and
> >>>>>the event TSC value. The host TSC values of the guest events are used when we
> >>>>>want to merge trace data of guests and the host in chronological order.
> >>>>>(Note: the trace_clock of both the host and the guest must be set x86-tsc in
> >>>>>this case)
> >>>>>
> >>>>>TSC offset is stored in the VMCS by vmx_write_tsc_offset() or
> >>>>>vmx_adjust_tsc_offset(). KVM executes the former function when a guest boots.
> >>>>>The latter function is executed when kvm clock is updated. Only host can read
> >>>>>TSC offset value from VMCS, so a host needs to output TSC offset value
> >>>>>when TSC offset is changed.
> >>>>>
> >>>>>Since the TSC offset is not often changed, it could be overwritten by other
> >>>>>frequent events while tracing. To avoid that, I recommend to use a special
> >>>>>instance for getting this event:
> >>>>>
> >>>>>1. set a instance before booting a guest
> >>>>> # cd /sys/kernel/debug/tracing/instances
> >>>>> # mkdir tsc_offset
> >>>>> # cd tsc_offset
> >>>>> # echo x86-tsc > trace_clock
> >>>>> # echo 1 > events/kvm/kvm_write_tsc_offset/enable
> >>>>>
> >>>>>2. boot a guest
> >>>>>
> >>>>>Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
> >>>>>Cc: Marcelo Tosatti <[email protected]>
> >>>>>Cc: Gleb Natapov <[email protected]>
> >>>>>Cc: Thomas Gleixner <[email protected]>
> >>>>>Cc: Ingo Molnar <[email protected]>
> >>>>>Cc: "H. Peter Anvin" <[email protected]>
> >>>>>---
> >>>>> arch/x86/kvm/trace.h | 18 ++++++++++++++++++
> >>>>> arch/x86/kvm/vmx.c | 3 +++
> >>>>> arch/x86/kvm/x86.c | 1 +
> >>>>> 3 files changed, 22 insertions(+)
> >>>>>
> >>>>>diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> >>>>>index fe5e00e..9c22e39 100644
> >>>>>--- a/arch/x86/kvm/trace.h
> >>>>>+++ b/arch/x86/kvm/trace.h
> >>>>>@@ -815,6 +815,24 @@ TRACE_EVENT(kvm_track_tsc,
> >>>>> __print_symbolic(__entry->host_clock, host_clocks))
> >>>>> );
> >>>>>
> >>>>>+TRACE_EVENT(kvm_write_tsc_offset,
> >>>>>+ TP_PROTO(__u64 previous_tsc_offset, __u64 next_tsc_offset),
> >>>>>+ TP_ARGS(previous_tsc_offset, next_tsc_offset),
> >>>>>+
> >>>>>+ TP_STRUCT__entry(
> >>>>>+ __field( __u64, previous_tsc_offset )
> >>>>>+ __field( __u64, next_tsc_offset )
> >>>>>+ ),
> >>>>>+
> >>>>>+ TP_fast_assign(
> >>>>>+ __entry->previous_tsc_offset = previous_tsc_offset;
> >>>>>+ __entry->next_tsc_offset = next_tsc_offset;
> >>>>>+ ),
> >>>>>+
> >>>>>+ TP_printk("previous=%llu next=%llu",
> >>>>>+ __entry->previous_tsc_offset, __entry->next_tsc_offset)
> >>>>>+);
> >>>>>+
> >>>>
> >>>>Yoshihiro YUNOMAE,
> >>>>
> >>>>1) Why is previous_tsc_offset necessary?
> >>
> >>I was considering the situations where we did not enable
> >>kvm_write_tsc_offset event before booting a guest or where we did not
> >>use multiple buffers. Here, we will need another new I/F to get current
> >>TSC offset of a given VCPU. For example, if kvm_write_tsc_offset is not
> >>included in the host's trace data, we get the current TSC offset from
> >>the new I/F and apply it to all guest events. On the other hand, if
> >>kvm_write_tsc_offset event appears more than once, we apply the
> >>previous offset to guest events before the first TSC offset change.
> >>
> >>Since we support only for using multiple buffers now, we don't need to
> >>record previous TSC offset at this time. But I'm conscious that we have
> >>to change the format of kvm_write_tsc_offset event when we support
> >>those situations.
> >>
> >>>>2) The TSC offset traces should include vcpu number, so that its
> >>>>possible to correlate traces of SMP guests (the tool should use
> >>>>the individual vcpu tsc offsets when converting guests trace).
> >>>>
> >>>Why PID is not enough? No other trace, except kvm_entry, outputs vcpu id.
> >>
> >>As Gleb mentioned, a tool can understand TSC offset for each vcpu from
> >>PID and vcpu number of kvm_entry. IMO, that is indirect way, so I would
> >>be better off including vcpu number.
> >>
> >But doesn't the tool operates on vcpu's PID for all other events. I mean to
> >figure out what vcpu an event belongs too during merge. Why tsc offset
> >event is different?
>
> In vcpu_load()@virt/kvm/kvm_main.c, it seems that PID of the vcpu thread
> can be changed. Are you familiar with this situation?
Recommended way of using KVM API is to have dedicated thread per vcpu
and this is how all known userspace implementations use it, but having
one thread drive several vcpus (not simultaneously obviously) also
works, but not recommended.

> If the situation can be occurred, outputting vcpu number is better, I
> think. If not occurred, as you say, we will be able to merge those data
> without vcpu number in write_tsc_offset event.
The thing is that all other traces that you want to merge do not contain
vcpu number, only pid, so if the situation occurs how do you merge the
data?

> However, when we
> focus on output data of the write_tsc_offset event, it is difficult to
> directly understand contents of the data if vcpu number information is
> not included. So, including the information is useful, I think.
>
How your tool does it now?

--
Gleb.

2013-06-10 11:37:13

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: Re: Re: Re: Re: [PATCH V2 1/1] kvm/vmx: Add a tracepoint write_tsc_offset

Hi Gleb,

(2013/06/10 19:05), Gleb Natapov wrote:
> On Mon, Jun 10, 2013 at 06:30:42PM +0900, Yoshihiro YUNOMAE wrote:
>> Hi Gleb,
>>
>> (2013/06/09 20:14), Gleb Natapov wrote:
>>> On Fri, Jun 07, 2013 at 02:22:22PM +0900, Yoshihiro YUNOMAE wrote:
>>>> (2013/06/06 20:33), Gleb Natapov wrote:
>>>>> On Wed, Jun 05, 2013 at 09:23:22PM -0300, Marcelo Tosatti wrote:
>>>>>> On Tue, Jun 04, 2013 at 05:36:19PM +0900, Yoshihiro YUNOMAE wrote:
>>>>>>> Add a tracepoint write_tsc_offset for tracing TSC offset change.
>>>>>>> We want to merge ftrace's trace data of guest OSs and the host OS using
>>>>>>> TSC for timestamp in chronological order. We need "TSC offset" values for
>>>>>>> each guest when merge those because the TSC value on a guest is always the
>>>>>>> host TSC plus guest's TSC offset. If we get the TSC offset values, we can
>>>>>>> calculate the host TSC value for each guest events from the TSC offset and
>>>>>>> the event TSC value. The host TSC values of the guest events are used when we
>>>>>>> want to merge trace data of guests and the host in chronological order.
>>>>>>> (Note: the trace_clock of both the host and the guest must be set x86-tsc in
>>>>>>> this case)
>>>>>>>
>>>>>>> TSC offset is stored in the VMCS by vmx_write_tsc_offset() or
>>>>>>> vmx_adjust_tsc_offset(). KVM executes the former function when a guest boots.
>>>>>>> The latter function is executed when kvm clock is updated. Only host can read
>>>>>>> TSC offset value from VMCS, so a host needs to output TSC offset value
>>>>>>> when TSC offset is changed.
>>>>>>>
>>>>>>> Since the TSC offset is not often changed, it could be overwritten by other
>>>>>>> frequent events while tracing. To avoid that, I recommend to use a special
>>>>>>> instance for getting this event:
>>>>>>>
>>>>>>> 1. set a instance before booting a guest
>>>>>>> # cd /sys/kernel/debug/tracing/instances
>>>>>>> # mkdir tsc_offset
>>>>>>> # cd tsc_offset
>>>>>>> # echo x86-tsc > trace_clock
>>>>>>> # echo 1 > events/kvm/kvm_write_tsc_offset/enable
>>>>>>>
>>>>>>> 2. boot a guest
>>>>>>>
>>>>>>> Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
>>>>>>> Cc: Marcelo Tosatti <[email protected]>
>>>>>>> Cc: Gleb Natapov <[email protected]>
>>>>>>> Cc: Thomas Gleixner <[email protected]>
>>>>>>> Cc: Ingo Molnar <[email protected]>
>>>>>>> Cc: "H. Peter Anvin" <[email protected]>
>>>>>>> ---
>>>>>>> arch/x86/kvm/trace.h | 18 ++++++++++++++++++
>>>>>>> arch/x86/kvm/vmx.c | 3 +++
>>>>>>> arch/x86/kvm/x86.c | 1 +
>>>>>>> 3 files changed, 22 insertions(+)
>>>>>>>
>>>>>>> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
>>>>>>> index fe5e00e..9c22e39 100644
>>>>>>> --- a/arch/x86/kvm/trace.h
>>>>>>> +++ b/arch/x86/kvm/trace.h
>>>>>>> @@ -815,6 +815,24 @@ TRACE_EVENT(kvm_track_tsc,
>>>>>>> __print_symbolic(__entry->host_clock, host_clocks))
>>>>>>> );
>>>>>>>
>>>>>>> +TRACE_EVENT(kvm_write_tsc_offset,
>>>>>>> + TP_PROTO(__u64 previous_tsc_offset, __u64 next_tsc_offset),
>>>>>>> + TP_ARGS(previous_tsc_offset, next_tsc_offset),
>>>>>>> +
>>>>>>> + TP_STRUCT__entry(
>>>>>>> + __field( __u64, previous_tsc_offset )
>>>>>>> + __field( __u64, next_tsc_offset )
>>>>>>> + ),
>>>>>>> +
>>>>>>> + TP_fast_assign(
>>>>>>> + __entry->previous_tsc_offset = previous_tsc_offset;
>>>>>>> + __entry->next_tsc_offset = next_tsc_offset;
>>>>>>> + ),
>>>>>>> +
>>>>>>> + TP_printk("previous=%llu next=%llu",
>>>>>>> + __entry->previous_tsc_offset, __entry->next_tsc_offset)
>>>>>>> +);
>>>>>>> +
>>>>>>
>>>>>> Yoshihiro YUNOMAE,
>>>>>>
>>>>>> 1) Why is previous_tsc_offset necessary?
>>>>
>>>> I was considering the situations where we did not enable
>>>> kvm_write_tsc_offset event before booting a guest or where we did not
>>>> use multiple buffers. Here, we will need another new I/F to get current
>>>> TSC offset of a given VCPU. For example, if kvm_write_tsc_offset is not
>>>> included in the host's trace data, we get the current TSC offset from
>>>> the new I/F and apply it to all guest events. On the other hand, if
>>>> kvm_write_tsc_offset event appears more than once, we apply the
>>>> previous offset to guest events before the first TSC offset change.
>>>>
>>>> Since we support only for using multiple buffers now, we don't need to
>>>> record previous TSC offset at this time. But I'm conscious that we have
>>>> to change the format of kvm_write_tsc_offset event when we support
>>>> those situations.
>>>>
>>>>>> 2) The TSC offset traces should include vcpu number, so that its
>>>>>> possible to correlate traces of SMP guests (the tool should use
>>>>>> the individual vcpu tsc offsets when converting guests trace).
>>>>>>
>>>>> Why PID is not enough? No other trace, except kvm_entry, outputs vcpu id.
>>>>
>>>> As Gleb mentioned, a tool can understand TSC offset for each vcpu from
>>>> PID and vcpu number of kvm_entry. IMO, that is indirect way, so I would
>>>> be better off including vcpu number.
>>>>
>>> But doesn't the tool operates on vcpu's PID for all other events. I mean to
>>> figure out what vcpu an event belongs too during merge. Why tsc offset
>>> event is different?
>>
>> In vcpu_load()@virt/kvm/kvm_main.c, it seems that PID of the vcpu thread
>> can be changed. Are you familiar with this situation?
> Recommended way of using KVM API is to have dedicated thread per vcpu
> and this is how all known userspace implementations use it, but having
> one thread drive several vcpus (not simultaneously obviously) also
> works, but not recommended.

I understood. Thanks.

>> If the situation can be occurred, outputting vcpu number is better, I
>> think. If not occurred, as you say, we will be able to merge those data
>> without vcpu number in write_tsc_offset event.
> The thing is that all other traces that you want to merge do not contain
> vcpu number, only pid, so if the situation occurs how do you merge the
> data?

As you say, all kvm events cannot contain vcpu number.
However, all trace data of guests include VCPU number, so we can
understand how guests and the host operate by using both those values
and kvm_entry events.

>> However, when we
>> focus on output data of the write_tsc_offset event, it is difficult to
>> directly understand contents of the data if vcpu number information is
>> not included. So, including the information is useful, I think.
>>
> How your tool does it now?

My sample tool supports only for one vcpu, and the tool does not
automatically get TSC offset values yet. If the event includes
vcpu number, it is easy to manage TSC offset for each vcpu.
For example, if the data are "kvm_write_tsc_offset: vcpu=0 prev=100
next=12345", a tool will use next value for all vcpu0 data of the guest.

Thanks,

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

2013-06-10 14:05:01

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Re: Re: [PATCH V2 1/1] kvm/vmx: Add a tracepoint write_tsc_offset

On Mon, Jun 10, 2013 at 01:05:05PM +0300, Gleb Natapov wrote:
> On Mon, Jun 10, 2013 at 06:30:42PM +0900, Yoshihiro YUNOMAE wrote:
> > Hi Gleb,
> >
> > (2013/06/09 20:14), Gleb Natapov wrote:
> > >On Fri, Jun 07, 2013 at 02:22:22PM +0900, Yoshihiro YUNOMAE wrote:
> > >>(2013/06/06 20:33), Gleb Natapov wrote:
> > >>>On Wed, Jun 05, 2013 at 09:23:22PM -0300, Marcelo Tosatti wrote:
> > >>>>On Tue, Jun 04, 2013 at 05:36:19PM +0900, Yoshihiro YUNOMAE wrote:
> > >>>>>Add a tracepoint write_tsc_offset for tracing TSC offset change.
> > >>>>>We want to merge ftrace's trace data of guest OSs and the host OS using
> > >>>>>TSC for timestamp in chronological order. We need "TSC offset" values for
> > >>>>>each guest when merge those because the TSC value on a guest is always the
> > >>>>>host TSC plus guest's TSC offset. If we get the TSC offset values, we can
> > >>>>>calculate the host TSC value for each guest events from the TSC offset and
> > >>>>>the event TSC value. The host TSC values of the guest events are used when we
> > >>>>>want to merge trace data of guests and the host in chronological order.
> > >>>>>(Note: the trace_clock of both the host and the guest must be set x86-tsc in
> > >>>>>this case)
> > >>>>>
> > >>>>>TSC offset is stored in the VMCS by vmx_write_tsc_offset() or
> > >>>>>vmx_adjust_tsc_offset(). KVM executes the former function when a guest boots.
> > >>>>>The latter function is executed when kvm clock is updated. Only host can read
> > >>>>>TSC offset value from VMCS, so a host needs to output TSC offset value
> > >>>>>when TSC offset is changed.
> > >>>>>
> > >>>>>Since the TSC offset is not often changed, it could be overwritten by other
> > >>>>>frequent events while tracing. To avoid that, I recommend to use a special
> > >>>>>instance for getting this event:
> > >>>>>
> > >>>>>1. set a instance before booting a guest
> > >>>>> # cd /sys/kernel/debug/tracing/instances
> > >>>>> # mkdir tsc_offset
> > >>>>> # cd tsc_offset
> > >>>>> # echo x86-tsc > trace_clock
> > >>>>> # echo 1 > events/kvm/kvm_write_tsc_offset/enable
> > >>>>>
> > >>>>>2. boot a guest
> > >>>>>
> > >>>>>Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
> > >>>>>Cc: Marcelo Tosatti <[email protected]>
> > >>>>>Cc: Gleb Natapov <[email protected]>
> > >>>>>Cc: Thomas Gleixner <[email protected]>
> > >>>>>Cc: Ingo Molnar <[email protected]>
> > >>>>>Cc: "H. Peter Anvin" <[email protected]>
> > >>>>>---
> > >>>>> arch/x86/kvm/trace.h | 18 ++++++++++++++++++
> > >>>>> arch/x86/kvm/vmx.c | 3 +++
> > >>>>> arch/x86/kvm/x86.c | 1 +
> > >>>>> 3 files changed, 22 insertions(+)
> > >>>>>
> > >>>>>diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> > >>>>>index fe5e00e..9c22e39 100644
> > >>>>>--- a/arch/x86/kvm/trace.h
> > >>>>>+++ b/arch/x86/kvm/trace.h
> > >>>>>@@ -815,6 +815,24 @@ TRACE_EVENT(kvm_track_tsc,
> > >>>>> __print_symbolic(__entry->host_clock, host_clocks))
> > >>>>> );
> > >>>>>
> > >>>>>+TRACE_EVENT(kvm_write_tsc_offset,
> > >>>>>+ TP_PROTO(__u64 previous_tsc_offset, __u64 next_tsc_offset),
> > >>>>>+ TP_ARGS(previous_tsc_offset, next_tsc_offset),
> > >>>>>+
> > >>>>>+ TP_STRUCT__entry(
> > >>>>>+ __field( __u64, previous_tsc_offset )
> > >>>>>+ __field( __u64, next_tsc_offset )
> > >>>>>+ ),
> > >>>>>+
> > >>>>>+ TP_fast_assign(
> > >>>>>+ __entry->previous_tsc_offset = previous_tsc_offset;
> > >>>>>+ __entry->next_tsc_offset = next_tsc_offset;
> > >>>>>+ ),
> > >>>>>+
> > >>>>>+ TP_printk("previous=%llu next=%llu",
> > >>>>>+ __entry->previous_tsc_offset, __entry->next_tsc_offset)
> > >>>>>+);
> > >>>>>+
> > >>>>
> > >>>>Yoshihiro YUNOMAE,
> > >>>>
> > >>>>1) Why is previous_tsc_offset necessary?
> > >>
> > >>I was considering the situations where we did not enable
> > >>kvm_write_tsc_offset event before booting a guest or where we did not
> > >>use multiple buffers. Here, we will need another new I/F to get current
> > >>TSC offset of a given VCPU. For example, if kvm_write_tsc_offset is not
> > >>included in the host's trace data, we get the current TSC offset from
> > >>the new I/F and apply it to all guest events. On the other hand, if
> > >>kvm_write_tsc_offset event appears more than once, we apply the
> > >>previous offset to guest events before the first TSC offset change.
> > >>
> > >>Since we support only for using multiple buffers now, we don't need to
> > >>record previous TSC offset at this time. But I'm conscious that we have
> > >>to change the format of kvm_write_tsc_offset event when we support
> > >>those situations.
> > >>
> > >>>>2) The TSC offset traces should include vcpu number, so that its
> > >>>>possible to correlate traces of SMP guests (the tool should use
> > >>>>the individual vcpu tsc offsets when converting guests trace).
> > >>>>
> > >>>Why PID is not enough? No other trace, except kvm_entry, outputs vcpu id.
> > >>
> > >>As Gleb mentioned, a tool can understand TSC offset for each vcpu from
> > >>PID and vcpu number of kvm_entry. IMO, that is indirect way, so I would
> > >>be better off including vcpu number.
> > >>
> > >But doesn't the tool operates on vcpu's PID for all other events. I mean to
> > >figure out what vcpu an event belongs too during merge. Why tsc offset
> > >event is different?
> >
> > In vcpu_load()@virt/kvm/kvm_main.c, it seems that PID of the vcpu thread
> > can be changed. Are you familiar with this situation?
> Recommended way of using KVM API is to have dedicated thread per vcpu
> and this is how all known userspace implementations use it, but having
> one thread drive several vcpus (not simultaneously obviously) also
> works, but not recommended.
>
> > If the situation can be occurred, outputting vcpu number is better, I
> > think. If not occurred, as you say, we will be able to merge those data
> > without vcpu number in write_tsc_offset event.
> The thing is that all other traces that you want to merge do not contain
> vcpu number, only pid, so if the situation occurs how do you merge the
> data?

Guest traces contain vcpu number and not pid (because guest is unaware
of host PID).

> > However, when we
> > focus on output data of the write_tsc_offset event, it is difficult to
> > directly understand contents of the data if vcpu number information is
> > not included. So, including the information is useful, I think.
> >
> How your tool does it now?

It merges guest trace with host trace (by converting the TSC timestamp
in the guest trace to host TSC using tsc_offset information).

By not recording vcpu ID in the tsc_offset trace, it is necessary to
supply the tool with PID<->VCPU_id tuples for translation (so its an
additional step required, and it makes trace merge impossible
if the information is not available).

2013-06-10 16:38:51

by Gleb Natapov

[permalink] [raw]
Subject: Re: Re: Re: [PATCH V2 1/1] kvm/vmx: Add a tracepoint write_tsc_offset

On Mon, Jun 10, 2013 at 11:04:24AM -0300, Marcelo Tosatti wrote:
> On Mon, Jun 10, 2013 at 01:05:05PM +0300, Gleb Natapov wrote:
> > On Mon, Jun 10, 2013 at 06:30:42PM +0900, Yoshihiro YUNOMAE wrote:
> > > Hi Gleb,
> > >
> > > (2013/06/09 20:14), Gleb Natapov wrote:
> > > >On Fri, Jun 07, 2013 at 02:22:22PM +0900, Yoshihiro YUNOMAE wrote:
> > > >>(2013/06/06 20:33), Gleb Natapov wrote:
> > > >>>On Wed, Jun 05, 2013 at 09:23:22PM -0300, Marcelo Tosatti wrote:
> > > >>>>On Tue, Jun 04, 2013 at 05:36:19PM +0900, Yoshihiro YUNOMAE wrote:
> > > >>>>>Add a tracepoint write_tsc_offset for tracing TSC offset change.
> > > >>>>>We want to merge ftrace's trace data of guest OSs and the host OS using
> > > >>>>>TSC for timestamp in chronological order. We need "TSC offset" values for
> > > >>>>>each guest when merge those because the TSC value on a guest is always the
> > > >>>>>host TSC plus guest's TSC offset. If we get the TSC offset values, we can
> > > >>>>>calculate the host TSC value for each guest events from the TSC offset and
> > > >>>>>the event TSC value. The host TSC values of the guest events are used when we
> > > >>>>>want to merge trace data of guests and the host in chronological order.
> > > >>>>>(Note: the trace_clock of both the host and the guest must be set x86-tsc in
> > > >>>>>this case)
> > > >>>>>
> > > >>>>>TSC offset is stored in the VMCS by vmx_write_tsc_offset() or
> > > >>>>>vmx_adjust_tsc_offset(). KVM executes the former function when a guest boots.
> > > >>>>>The latter function is executed when kvm clock is updated. Only host can read
> > > >>>>>TSC offset value from VMCS, so a host needs to output TSC offset value
> > > >>>>>when TSC offset is changed.
> > > >>>>>
> > > >>>>>Since the TSC offset is not often changed, it could be overwritten by other
> > > >>>>>frequent events while tracing. To avoid that, I recommend to use a special
> > > >>>>>instance for getting this event:
> > > >>>>>
> > > >>>>>1. set a instance before booting a guest
> > > >>>>> # cd /sys/kernel/debug/tracing/instances
> > > >>>>> # mkdir tsc_offset
> > > >>>>> # cd tsc_offset
> > > >>>>> # echo x86-tsc > trace_clock
> > > >>>>> # echo 1 > events/kvm/kvm_write_tsc_offset/enable
> > > >>>>>
> > > >>>>>2. boot a guest
> > > >>>>>
> > > >>>>>Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
> > > >>>>>Cc: Marcelo Tosatti <[email protected]>
> > > >>>>>Cc: Gleb Natapov <[email protected]>
> > > >>>>>Cc: Thomas Gleixner <[email protected]>
> > > >>>>>Cc: Ingo Molnar <[email protected]>
> > > >>>>>Cc: "H. Peter Anvin" <[email protected]>
> > > >>>>>---
> > > >>>>> arch/x86/kvm/trace.h | 18 ++++++++++++++++++
> > > >>>>> arch/x86/kvm/vmx.c | 3 +++
> > > >>>>> arch/x86/kvm/x86.c | 1 +
> > > >>>>> 3 files changed, 22 insertions(+)
> > > >>>>>
> > > >>>>>diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> > > >>>>>index fe5e00e..9c22e39 100644
> > > >>>>>--- a/arch/x86/kvm/trace.h
> > > >>>>>+++ b/arch/x86/kvm/trace.h
> > > >>>>>@@ -815,6 +815,24 @@ TRACE_EVENT(kvm_track_tsc,
> > > >>>>> __print_symbolic(__entry->host_clock, host_clocks))
> > > >>>>> );
> > > >>>>>
> > > >>>>>+TRACE_EVENT(kvm_write_tsc_offset,
> > > >>>>>+ TP_PROTO(__u64 previous_tsc_offset, __u64 next_tsc_offset),
> > > >>>>>+ TP_ARGS(previous_tsc_offset, next_tsc_offset),
> > > >>>>>+
> > > >>>>>+ TP_STRUCT__entry(
> > > >>>>>+ __field( __u64, previous_tsc_offset )
> > > >>>>>+ __field( __u64, next_tsc_offset )
> > > >>>>>+ ),
> > > >>>>>+
> > > >>>>>+ TP_fast_assign(
> > > >>>>>+ __entry->previous_tsc_offset = previous_tsc_offset;
> > > >>>>>+ __entry->next_tsc_offset = next_tsc_offset;
> > > >>>>>+ ),
> > > >>>>>+
> > > >>>>>+ TP_printk("previous=%llu next=%llu",
> > > >>>>>+ __entry->previous_tsc_offset, __entry->next_tsc_offset)
> > > >>>>>+);
> > > >>>>>+
> > > >>>>
> > > >>>>Yoshihiro YUNOMAE,
> > > >>>>
> > > >>>>1) Why is previous_tsc_offset necessary?
> > > >>
> > > >>I was considering the situations where we did not enable
> > > >>kvm_write_tsc_offset event before booting a guest or where we did not
> > > >>use multiple buffers. Here, we will need another new I/F to get current
> > > >>TSC offset of a given VCPU. For example, if kvm_write_tsc_offset is not
> > > >>included in the host's trace data, we get the current TSC offset from
> > > >>the new I/F and apply it to all guest events. On the other hand, if
> > > >>kvm_write_tsc_offset event appears more than once, we apply the
> > > >>previous offset to guest events before the first TSC offset change.
> > > >>
> > > >>Since we support only for using multiple buffers now, we don't need to
> > > >>record previous TSC offset at this time. But I'm conscious that we have
> > > >>to change the format of kvm_write_tsc_offset event when we support
> > > >>those situations.
> > > >>
> > > >>>>2) The TSC offset traces should include vcpu number, so that its
> > > >>>>possible to correlate traces of SMP guests (the tool should use
> > > >>>>the individual vcpu tsc offsets when converting guests trace).
> > > >>>>
> > > >>>Why PID is not enough? No other trace, except kvm_entry, outputs vcpu id.
> > > >>
> > > >>As Gleb mentioned, a tool can understand TSC offset for each vcpu from
> > > >>PID and vcpu number of kvm_entry. IMO, that is indirect way, so I would
> > > >>be better off including vcpu number.
> > > >>
> > > >But doesn't the tool operates on vcpu's PID for all other events. I mean to
> > > >figure out what vcpu an event belongs too during merge. Why tsc offset
> > > >event is different?
> > >
> > > In vcpu_load()@virt/kvm/kvm_main.c, it seems that PID of the vcpu thread
> > > can be changed. Are you familiar with this situation?
> > Recommended way of using KVM API is to have dedicated thread per vcpu
> > and this is how all known userspace implementations use it, but having
> > one thread drive several vcpus (not simultaneously obviously) also
> > works, but not recommended.
> >
> > > If the situation can be occurred, outputting vcpu number is better, I
> > > think. If not occurred, as you say, we will be able to merge those data
> > > without vcpu number in write_tsc_offset event.
> > The thing is that all other traces that you want to merge do not contain
> > vcpu number, only pid, so if the situation occurs how do you merge the
> > data?
>
> Guest traces contain vcpu number and not pid (because guest is unaware
> of host PID).
>
No, guest trace is just a regular ftrace done inside a guest. It contains
guest's PIDs which is useless for host. I do not know how exactly guest
traces are transfered to a host, if each vcpu buffer is transfered
separately host can figure out what trace entry belong to which vcpu
based on what buffer the trace is in. But the information about what
buffer belongs to which vcpu id should be transfered to a host somehow
too.

> > > However, when we
> > > focus on output data of the write_tsc_offset event, it is difficult to
> > > directly understand contents of the data if vcpu number information is
> > > not included. So, including the information is useful, I think.
> > >
> > How your tool does it now?
>
> It merges guest trace with host trace (by converting the TSC timestamp
> in the guest trace to host TSC using tsc_offset information).
>
I mean how it does it now without vcpu id. The answer is that it works
for only one vcpu now.

> By not recording vcpu ID in the tsc_offset trace, it is necessary to
> supply the tool with PID<->VCPU_id tuples for translation (so its an
> additional step required, and it makes trace merge impossible
> if the information is not available).
The tool needs PID<->VCPU_id tuples to do the merging of any trace
entry. Without that it does not know how to interpret entry timestamps
(which offset to use). Apparently it will get this information from
vmentry trace point. What is so special about tsc_offset tracing that
it needs to contain vcpuid by itself.

--
Gleb.

2013-06-10 20:28:59

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Re: Re: [PATCH V2 1/1] kvm/vmx: Add a tracepoint write_tsc_offset

On Mon, Jun 10, 2013 at 07:38:34PM +0300, Gleb Natapov wrote:
> > Guest traces contain vcpu number and not pid (because guest is unaware
> > of host PID).
> >
> No, guest trace is just a regular ftrace done inside a guest. It contains
> guest's PIDs which is useless for host.

# tracer: nop
#
# entries-in-buffer/entries-written: 5333/5333 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |

Traces contain CPU ID.

> I do not know how exactly guest traces are transfered to a host, if
> each vcpu buffer is transfered separately host can figure out what
> trace entry belong to which vcpu based on what buffer the trace is in.
> But the information about what buffer belongs to which vcpu id should
> be transfered to a host somehow too.
>
> > > > However, when we
> > > > focus on output data of the write_tsc_offset event, it is difficult to
> > > > directly understand contents of the data if vcpu number information is
> > > > not included. So, including the information is useful, I think.
> > > >
> > > How your tool does it now?
> >
> > It merges guest trace with host trace (by converting the TSC timestamp
> > in the guest trace to host TSC using tsc_offset information).
> >
> I mean how it does it now without vcpu id. The answer is that it works
> for only one vcpu now.

Yes.

> > By not recording vcpu ID in the tsc_offset trace, it is necessary to
> > supply the tool with PID<->VCPU_id tuples for translation (so its an
> > additional step required, and it makes trace merge impossible
> > if the information is not available).
> The tool needs PID<->VCPU_id tuples to do the merging of any trace
> entry. Without that it does not know how to interpret entry timestamps
> (which offset to use). Apparently it will get this information from
> vmentry trace point. What is so special about tsc_offset tracing that
> it needs to contain vcpuid by itself.

If the tsc_offset tracepoint contains vcpu ID, its possible to lookup
guest trace entry (which contains CPU ID), and match on that.

Without that, PID<->VCPU_id tuples are necessary. Yes?

2013-06-11 06:50:50

by Gleb Natapov

[permalink] [raw]
Subject: Re: Re: Re: [PATCH V2 1/1] kvm/vmx: Add a tracepoint write_tsc_offset

On Mon, Jun 10, 2013 at 05:28:23PM -0300, Marcelo Tosatti wrote:
> On Mon, Jun 10, 2013 at 07:38:34PM +0300, Gleb Natapov wrote:
> > > Guest traces contain vcpu number and not pid (because guest is unaware
> > > of host PID).
> > >
> > No, guest trace is just a regular ftrace done inside a guest. It contains
> > guest's PIDs which is useless for host.
>
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 5333/5333 #P:4
> #
> # _-----=> irqs-off
> # / _----=> need-resched
> # | / _---=> hardirq/softirq
> # || / _--=> preempt-depth
> # ||| / delay
> # TASK-PID CPU# |||| TIMESTAMP FUNCTION
> # | | | |||| | |
>
> Traces contain CPU ID.
>
Doh, yes it does, but this is not vcpu_id. vcpu_id is seen as apic id in
a guest, so additional step is needed to map between numbers that you see
in the trace and vcpu_id. This is easy to do by looking at /proc/cpuinfo
of a guest.

> > I do not know how exactly guest traces are transfered to a host, if
> > each vcpu buffer is transfered separately host can figure out what
> > trace entry belong to which vcpu based on what buffer the trace is in.
> > But the information about what buffer belongs to which vcpu id should
> > be transfered to a host somehow too.
> >
> > > > > However, when we
> > > > > focus on output data of the write_tsc_offset event, it is difficult to
> > > > > directly understand contents of the data if vcpu number information is
> > > > > not included. So, including the information is useful, I think.
> > > > >
> > > > How your tool does it now?
> > >
> > > It merges guest trace with host trace (by converting the TSC timestamp
> > > in the guest trace to host TSC using tsc_offset information).
> > >
> > I mean how it does it now without vcpu id. The answer is that it works
> > for only one vcpu now.
>
> Yes.
>
> > > By not recording vcpu ID in the tsc_offset trace, it is necessary to
> > > supply the tool with PID<->VCPU_id tuples for translation (so its an
> > > additional step required, and it makes trace merge impossible
> > > if the information is not available).
> > The tool needs PID<->VCPU_id tuples to do the merging of any trace
> > entry. Without that it does not know how to interpret entry timestamps
> > (which offset to use). Apparently it will get this information from
> > vmentry trace point. What is so special about tsc_offset tracing that
> > it needs to contain vcpuid by itself.
>
> If the tsc_offset tracepoint contains vcpu ID, its possible to lookup
> guest trace entry (which contains CPU ID), and match on that.
>
> Without that, PID<->VCPU_id tuples are necessary. Yes?
Ah, I think I see it now. For some reason I assumed that merge is done
for each vcpu separately, so you need to separate host events per vcpu
too, but this is not the case, host and guest event are merged based on
tsc timestamp only. In this case I see the merits of having vcpu id in
tsc_offset change trace point. It will make things easier a bit.

--
Gleb.

2013-06-11 09:26:43

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: Re: Re: Re: Re: [PATCH V2 1/1] kvm/vmx: Add a tracepoint write_tsc_offset

(2013/06/11 15:50), Gleb Natapov wrote:
> On Mon, Jun 10, 2013 at 05:28:23PM -0300, Marcelo Tosatti wrote:
>> On Mon, Jun 10, 2013 at 07:38:34PM +0300, Gleb Natapov wrote:
>>>> Guest traces contain vcpu number and not pid (because guest is unaware
>>>> of host PID).
>>>>
>>> No, guest trace is just a regular ftrace done inside a guest. It contains
>>> guest's PIDs which is useless for host.
>>
>> # tracer: nop
>> #
>> # entries-in-buffer/entries-written: 5333/5333 #P:4
>> #
>> # _-----=> irqs-off
>> # / _----=> need-resched
>> # | / _---=> hardirq/softirq
>> # || / _--=> preempt-depth
>> # ||| / delay
>> # TASK-PID CPU# |||| TIMESTAMP FUNCTION
>> # | | | |||| | |
>>
>> Traces contain CPU ID.
>>
> Doh, yes it does, but this is not vcpu_id. vcpu_id is seen as apic id in
> a guest, so additional step is needed to map between numbers that you see
> in the trace and vcpu_id. This is easy to do by looking at /proc/cpuinfo
> of a guest.
>
>>> I do not know how exactly guest traces are transfered to a host, if
>>> each vcpu buffer is transfered separately host can figure out what
>>> trace entry belong to which vcpu based on what buffer the trace is in.
>>> But the information about what buffer belongs to which vcpu id should
>>> be transfered to a host somehow too.
>>>
>>>>>> However, when we
>>>>>> focus on output data of the write_tsc_offset event, it is difficult to
>>>>>> directly understand contents of the data if vcpu number information is
>>>>>> not included. So, including the information is useful, I think.
>>>>>>
>>>>> How your tool does it now?
>>>>
>>>> It merges guest trace with host trace (by converting the TSC timestamp
>>>> in the guest trace to host TSC using tsc_offset information).
>>>>
>>> I mean how it does it now without vcpu id. The answer is that it works
>>> for only one vcpu now.
>>
>> Yes.
>>
>>>> By not recording vcpu ID in the tsc_offset trace, it is necessary to
>>>> supply the tool with PID<->VCPU_id tuples for translation (so its an
>>>> additional step required, and it makes trace merge impossible
>>>> if the information is not available).
>>> The tool needs PID<->VCPU_id tuples to do the merging of any trace
>>> entry. Without that it does not know how to interpret entry timestamps
>>> (which offset to use). Apparently it will get this information from
>>> vmentry trace point. What is so special about tsc_offset tracing that
>>> it needs to contain vcpuid by itself.
>>
>> If the tsc_offset tracepoint contains vcpu ID, its possible to lookup
>> guest trace entry (which contains CPU ID), and match on that.
>>
>> Without that, PID<->VCPU_id tuples are necessary. Yes?
> Ah, I think I see it now. For some reason I assumed that merge is done
> for each vcpu separately, so you need to separate host events per vcpu
> too, but this is not the case, host and guest event are merged based on
> tsc timestamp only. In this case I see the merits of having vcpu id in
> tsc_offset change trace point. It will make things easier a bit.

OK, I'll resend a patch including vcpu_id information.

Thanks,
Yoshihiro YUNOMAE

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]