Hi All,
The following patch set can make disordered trace data of a guest and a host
sorted in chronological order.
In a virtualization environment, it is difficult to analyze performance
problems, such as a delay of I/O request on a guest. This is because multiple
guests operate on the host. One of approaches for solving such kind of problems
is to sort trace data of guests and the host in chronological order.
After we applied the patch set(https://lkml.org/lkml/2012/11/13/588), raw TSC
can be chosen as a timestamp of ftrace. TSC is useful for merging trace data
in chronological order by two reasons. One of the reasons is that guests can
directly read raw TSC from the CPU using rdtsc operation. This means that raw
TSC value is not software clock like sched_clock, so we don't need to consider
about how the timestamp is calculated. The other is that TSC of recent x86 CPUs
is constantly incremented. This means that we don't need to worry about pace of
the timestamp. Therefore, choosing TSC as a timestamp for tracing is reasonable
to integrate trace data of guests and a host.
Here, we need to consider about just one matter for using TSC on guests. TSC
value on a guest is always the host TSC plus the guest's "TSC offset". In other
words, to merge trace data using TSC as timestamp in chronological order, we
need to consider TSC offset of the guest.
However, only the host kernel can read the TSC offset from VMCS and TSC offset
is not output in anywhere now. In other words, tools in userland cannot get
the TSC offset value, so we cannot merge trace data of guest and the host in
chronological order. Therefore, the TSC offset should be exported for userland
tools.
In this patch set, TSC offset is exported by printk() on the host. I also
attached a tool for merging trace data of a guest and a host in chronological
order.
<Example>
We assume that wakeup-latency for a command is big on a guest. Normally
we will use ftrace's wakeup-latency tracer or event tracer on the guest, but we
may not be able to solve this problem. This is because guests often exit to
the host for several reasons. In the next, we will use TSC as ftrace's timestamp
and record the trace data on the guest and the host. Then, we get following
data:
/* guest data */
comm-3826 [000] d...49836825726903: sched_wakeup: [detail]
comm-3826 [000] d...49836832225344: sched_switch: [detail]
/* host data */
qemu-kvm-2687 [003] d...50550079203669: kvm_exit: [detail]
qemu-kvm-2687 [003] d...50550079206816: kvm_entry: [detail]
qemu-kvm-2687 [003] d...50550079240656: kvm_exit: [detail]
qemu-kvm-2687 [003] d...50550079243467: kvm_entry: [detail]
qemu-kvm-2687 [003] d...50550079256103: kvm_exit: [detail]
qemu-kvm-2687 [003] d...50550079268391: kvm_entry: [detail]
qemu-kvm-2687 [003] d...50550079280829: kvm_exit: [detail]
qemu-kvm-2687 [003] d...50550079286028: kvm_entry: [detail]
Since TSC offset is not considered, these data cannot be merged. If this trace
data is shown like as follows, we will be able to understand the reason:
qemu-kvm-2687 [003] d...50550079203669: kvm_exit: [detail]
qemu-kvm-2687 [003] d...50550079206816: kvm_entry: [detail]
comm-3826 [000] d.h.49836825726903: sched_wakeup: [detail] <=
qemu-kvm-2687 [003] d...50550079240656: kvm_exit: [detail]
qemu-kvm-2687 [003] d...50550079243467: kvm_entry: [detail]
qemu-kvm-2687 [003] d...50550079256103: kvm_exit: [detail]
qemu-kvm-2687 [003] d...50550079268391: kvm_entry: [detail]
comm-3826 [000] d...49836832225344: sched_switch: [detail] <=
qemu-kvm-2687 [003] d...50550079280829: kvm_exit: [detail]
qemu-kvm-2687 [003] d...50550079286028: kvm_entry: [detail]
In this case, we can understand wakeup-latency was big due to exit to host
twice. Getting this data sorted in chronological order is our goal.
To merge the data like previous pattern, we apply this patch set. Then, we can
get TSC offset of the guest as follows:
$ dmesg | grep kvm
[ 57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ##
^^^^ ^^^^^^^^^^^^^^^^^^^^ |
PID TSC offset |
HOST TSC value --+
We use this TSC offset value to a merge script and obtain the following data:
$ ./trace-merge.pl 18446743360465545001 host.data guest.data
h qemu-kvm-2687 [003] d...50550079203669: kvm_exit: [detail]
h qemu-kvm-2687 [003] d...50550079206816: kvm_entry: [detail]
g comm-3826 [000] d.h.50550079226331: sched_wakeup: [detail] <=
h qemu-kvm-2687 [003] d...50550079240656: kvm_exit: [detail]
h qemu-kvm-2687 [003] d...50550079243467: kvm_entry: [detail]
h qemu-kvm-2687 [003] d...50550079256103: kvm_exit: [detail]
h qemu-kvm-2687 [003] d...50550079268391: kvm_entry: [detail]
g comm-3826 [000] d...50550079279266: sched_switch: [detail] <=
h qemu-kvm-2687 [003] d...50550079280829: kvm_exit: [detail]
h qemu-kvm-2687 [003] d...50550079286028: kvm_entry: [detail]
|
\----guest/host
In this summary, I suggest the patch which TSC offset for each guest can be
output on the host.
I chose printk() to output TSC offset value, but I think this is not the best
method. For example, defining as a tracepoint is one of the methods. In the
case, multiple buffers are needed to keep these data.
I need your comments, thanks!
---
Yoshihiro YUNOMAE (2):
kvm/vmx: Print TSC_OFFSET information when TSC offset value is written to VMCS
tools: Add a tool for merging trace data of a guest and a host
arch/x86/kvm/vmx.c | 5 +
tools/scripts/trace-merge/trace-merge.pl | 109 ++++++++++++++++++++++++++++++
2 files changed, 114 insertions(+)
create mode 100755 tools/scripts/trace-merge/trace-merge.pl
--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
Print TSC_OFFSET information when TSC offset value is written to VMCS for
measuring actual TSC of a guest.
TSC value on a guest is always the host TSC plus the guest's "TSC offset".
TSC offset is stored in the VMCS in vmx_write_tsc_offset() or
vmx_adjust_tsc_offset(). KVM executes the former function when a guest boots.
The latter function is executed when kvm clock is updated. On the other hand,
the host can read the TSC offset values from VMCS. So, if the host outputs the
TSC offset values, we can calculate an actual TSC value for each TSC timestamp
recorded trace data of the guest.
Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Hidehiro Kawai <[email protected]>
---
arch/x86/kvm/vmx.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ad6b1dd..8edfe3c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1887,6 +1887,9 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
vmcs12->tsc_offset : 0));
} else {
vmcs_write64(TSC_OFFSET, offset);
+ pr_info("kvm: (%d) write TSC offset %llu, now clock %llu\n",
+ current->pid, vmcs_read64(TSC_OFFSET),
+ native_read_tsc());
}
}
@@ -1894,6 +1897,8 @@ static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment, bool ho
{
u64 offset = vmcs_read64(TSC_OFFSET);
vmcs_write64(TSC_OFFSET, offset + adjustment);
+ pr_info("kvm: (%d) adjust TSC offset %llu, now clock %llu\n",
+ current->pid, vmcs_read64(TSC_OFFSET), native_read_tsc());
if (is_guest_mode(vcpu)) {
/* Even when running L2, the adjustment needs to apply to L1 */
to_vmx(vcpu)->nested.vmcs01_tsc_offset += adjustment;
This tool merges trace data of a guest and a host in chronological order.
Note that this tool is used only for a guest and a host. (not for multiple
guests)
- How to use
1. Get trace data of the host and guest via ssh, virtio-serial, or virtio-trace
2. Get TSC offset after applied patch "kvm/vmx: Print TSC_OFFSET information
when TSC offset value is written to VMCS"
$ dmesg | grep kvm
[ 57.717180] kvm: ([PID]) write TSC offset [TSC offset], now clock [HOST TSC]
3. Use this tool
$ ./trace-merge.pl <TSC offset> <host data> <guest data>
h qemu-kvm-2687 [003] d...50550079203669: kvm_exit: [detail]
h qemu-kvm-2687 [003] d...50550079206816: kvm_entry: [detail]
g comm-3826 [000] d.h.50550079226331: sched_wakeup: [detail]
h qemu-kvm-2687 [003] d...50550079240656: kvm_exit: [detail]
h qemu-kvm-2687 [003] d...50550079243467: kvm_entry: [detail]
h qemu-kvm-2687 [003] d...50550079256103: kvm_exit: [detail]
h qemu-kvm-2687 [003] d...50550079268391: kvm_entry: [detail]
g comm-3826 [000] d...50550079279266: sched_switch: [detail]
h qemu-kvm-2687 [003] d...50550079280829: kvm_exit: [detail]
h qemu-kvm-2687 [003] d...50550079286028: kvm_entry: [detail]
|
\----guest/host
Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
---
tools/scripts/trace-merge/trace-merge.pl | 109 ++++++++++++++++++++++++++++++
1 file changed, 109 insertions(+)
create mode 100755 tools/scripts/trace-merge/trace-merge.pl
diff --git a/tools/scripts/trace-merge/trace-merge.pl b/tools/scripts/trace-merge/trace-merge.pl
new file mode 100755
index 0000000..e0b080c
--- /dev/null
+++ b/tools/scripts/trace-merge/trace-merge.pl
@@ -0,0 +1,109 @@
+#!/usr/bin/perl
+#
+# Tool for merging and sorting trace data of a guest and host
+#
+# Created by Yoshihiro YUNOMAE <[email protected]>
+#
+# - How to use
+# ./trace-merge.pl <TSC offset> <host data> <guest data>
+#
+use strict;
+use bigint;
+
+my %all_data_info = ();
+my @merged_data = ();
+my @sorted_data = ();
+
+&read_all_data();
+&merge_guest_host_data();
+&sort_data_by_tsc();
+&output_data();
+
+sub read_all_data {
+ # TSC offset value is very big ull value.
+ # This value is actually negative, so we calculate the value here.
+ $all_data_info{"tsc_offset"} = &convert_tscoffset($ARGV[0]);
+ if ($all_data_info{"tsc_offset"} == 0) {
+ die "TSC should not be 0";
+ }
+
+ if (!open(HOST_DATA, $ARGV[1])) {
+ die "Cannot open host file: $!"
+ }
+ my @host_data = <HOST_DATA>;
+ close(HOST_DATA);
+
+ if (!open(GUEST_DATA, $ARGV[2])) {
+ die "Cannot open guest file: $!"
+ }
+ my @guest_data = <GUEST_DATA>;
+ close(GUEST_DATA);
+
+ $all_data_info{"host_data"} = \@host_data;
+ $all_data_info{"guest_data"} = \@guest_data;
+}
+
+sub merge_guest_host_data {
+ &guest_push_data();
+ &host_push_data();
+}
+
+sub sort_data_by_tsc {
+ no strict 'refs';
+ @sorted_data = sort {$a->{tsc} <=> $b->{tsc}} @merged_data;
+}
+
+sub output_data {
+ foreach my $line (@sorted_data) {
+ print "$line->{name}$line->{comm}$line->{tsc}$line->{event}\n";
+ }
+}
+
+sub guest_push_data {
+ &make_data_list(1);
+}
+
+sub host_push_data {
+ &make_data_list(0);
+}
+
+#
+# If this function is used for guest's data,
+# subtract TSC offset from guest's TSC value.
+#
+# NOTE: guest's TSC is added TSC offset to actual TSC when the guest boots.
+#
+sub make_data_list {
+ my $is_guest = $_[0];
+ my @data = ();
+ my $name = "";
+ my $list = "";
+ my $tsc_offset = 0;
+
+ if ($is_guest eq 1) {
+ $name = "g";
+ @data = @{$all_data_info{"guest_data"}};
+ $tsc_offset = $all_data_info{"tsc_offset"};
+ } else {
+ $name = "h";
+ @data = @{$all_data_info{"host_data"}};
+ }
+
+ foreach my $line (@data) {
+ chomp($line);
+
+ if ($line =~ /^(.+\[[0-9]+\].{5})([0-9]+)(:.+)/) {
+ $list = {
+ name => $name,
+ comm => $1,
+ tsc => $2 - $tsc_offset,
+ event => $3
+ };
+ push(@merged_data, $list);
+ }
+ }
+}
+
+sub convert_tscoffset {
+ return $_[0] - (1 << 64);
+}
On Wed, 2012-11-14 at 10:36 +0900, Yoshihiro YUNOMAE wrote:
> To merge the data like previous pattern, we apply this patch set. Then, we can
> get TSC offset of the guest as follows:
>
> $ dmesg | grep kvm
> [ 57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ##
> ^^^^ ^^^^^^^^^^^^^^^^^^^^ |
> PID TSC offset |
> HOST TSC value --+
>
Using printk to export something like this is IMO a nasty hack.
Can't we create a /sys or /proc file to export the same thing?
-- Steve
On 11/13/2012 06:00 PM, Steven Rostedt wrote:
>
> Using printk to export something like this is IMO a nasty hack.
>
> Can't we create a /sys or /proc file to export the same thing?
>
Maybe we need a /proc/pid/kvm/* directory?
-hpa
On Tue, Nov 13, 2012 at 6:00 PM, Steven Rostedt <[email protected]> wrote:
> On Wed, 2012-11-14 at 10:36 +0900, Yoshihiro YUNOMAE wrote:
>
>> To merge the data like previous pattern, we apply this patch set. Then, we can
>> get TSC offset of the guest as follows:
>>
>> $ dmesg | grep kvm
>> [ 57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ##
>> ^^^^ ^^^^^^^^^^^^^^^^^^^^ |
>> PID TSC offset |
>> HOST TSC value --+
>>
>
> Using printk to export something like this is IMO a nasty hack.
>
> Can't we create a /sys or /proc file to export the same thing?
Since the value changes over the course of the trace, and seems to be
part of the context of the trace, I think I'd include it as a
tracepoint.
>
> -- Steve
>
>
On Tue, 2012-11-13 at 18:03 -0800, David Sharp wrote:
> On Tue, Nov 13, 2012 at 6:00 PM, Steven Rostedt <[email protected]> wrote:
> > On Wed, 2012-11-14 at 10:36 +0900, Yoshihiro YUNOMAE wrote:
> >
> >> To merge the data like previous pattern, we apply this patch set. Then, we can
> >> get TSC offset of the guest as follows:
> >>
> >> $ dmesg | grep kvm
> >> [ 57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ##
> >> ^^^^ ^^^^^^^^^^^^^^^^^^^^ |
> >> PID TSC offset |
> >> HOST TSC value --+
> >>
> >
> > Using printk to export something like this is IMO a nasty hack.
> >
> > Can't we create a /sys or /proc file to export the same thing?
>
> Since the value changes over the course of the trace, and seems to be
> part of the context of the trace, I think I'd include it as a
> tracepoint.
>
I'm fine with that too.
-- Steve
Thank you for commenting on my patch set.
(2012/11/14 11:31), Steven Rostedt wrote:
> On Tue, 2012-11-13 at 18:03 -0800, David Sharp wrote:
>> On Tue, Nov 13, 2012 at 6:00 PM, Steven Rostedt <[email protected]> wrote:
>>> On Wed, 2012-11-14 at 10:36 +0900, Yoshihiro YUNOMAE wrote:
>>>
>>>> To merge the data like previous pattern, we apply this patch set. Then, we can
>>>> get TSC offset of the guest as follows:
>>>>
>>>> $ dmesg | grep kvm
>>>> [ 57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ##
>>>> ^^^^ ^^^^^^^^^^^^^^^^^^^^ |
>>>> PID TSC offset |
>>>> HOST TSC value --+
>>>>
>>>
>>> Using printk to export something like this is IMO a nasty hack.
>>>
>>> Can't we create a /sys or /proc file to export the same thing?
>>
>> Since the value changes over the course of the trace, and seems to be
>> part of the context of the trace, I think I'd include it as a
>> tracepoint.
>>
>
> I'm fine with that too.
Using some tracepoint is a nice idea, but there is one problem. Here,
our discussion point is "the event which TSC offset is changed does not
frequently occur, but the buffer must keep the event data."
There are two ideas for using tracepoint. First, we define new
tracepoint for changed TSC offset. This is simple and the overhead will
be low. However, this trace event stored in the buffer will be
overwritten by other trace events because this TSC offset event does
not frequently occur. Second, we add TSC offset information to the
tracepoint frequently occured. For example, we assume that TSC offset
information is added to arguments of trace_kvm_exit(). By adding the
information to the arguments, we can avoid the situation where the TSC
offset information is overwritten by other events. However, TSC offset
is not frequently changed and same information is output many times
because almost all data are waste. Therefore, only using tracepoint
is not good idea.
So, I suggest a hybrid method; record TSC offset change events and read
the last TSC offset from procfs when collecting the trace data.
In particular, the method is as follows:
1. Enable the tracepoint of TSC offset change and record the value
before and after changing
2. Start tracing
3. Stop tracing
4. Collect trace data and read /proc/pid/kvm/*
5. Check if any trace event recording the two TSC offsets exists
in the trace data
if(existing) => use trace event (flow 6)
else => use /proc/pid/kvm/* (flow 7)
6. Apply two TSC offsets of the trace event to the trace data and
sort the trace data
(Ex.)
* => tracepoint of changing TSC offset
. => another trace event
[START]............*............[END]
<----------> <---------->
previous current
TSC offset TSC offset
7. Apply TSC offset of /proc/pid/kvm/* to the trace data and
sort the trace data
(Ex.)
. => another trace event(not tracepoint of changing TSC offset)
[START].........................[END]
<----------------------->
current
TSC offset
Thanks,
--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
On Wed, Nov 14, 2012 at 10:36:21AM +0900, Yoshihiro YUNOMAE wrote:
> Hi All,
>
> The following patch set can make disordered trace data of a guest and a host
> sorted in chronological order.
>
> In a virtualization environment, it is difficult to analyze performance
> problems, such as a delay of I/O request on a guest. This is because multiple
> guests operate on the host. One of approaches for solving such kind of problems
> is to sort trace data of guests and the host in chronological order.
>
> After we applied the patch set(https://lkml.org/lkml/2012/11/13/588), raw TSC
> can be chosen as a timestamp of ftrace. TSC is useful for merging trace data
> in chronological order by two reasons. One of the reasons is that guests can
> directly read raw TSC from the CPU using rdtsc operation. This means that raw
> TSC value is not software clock like sched_clock, so we don't need to consider
> about how the timestamp is calculated. The other is that TSC of recent x86 CPUs
> is constantly incremented. This means that we don't need to worry about pace of
> the timestamp. Therefore, choosing TSC as a timestamp for tracing is reasonable
> to integrate trace data of guests and a host.
>
> Here, we need to consider about just one matter for using TSC on guests. TSC
> value on a guest is always the host TSC plus the guest's "TSC offset". In other
> words, to merge trace data using TSC as timestamp in chronological order, we
> need to consider TSC offset of the guest.
>
> However, only the host kernel can read the TSC offset from VMCS and TSC offset
> is not output in anywhere now. In other words, tools in userland cannot get
> the TSC offset value, so we cannot merge trace data of guest and the host in
> chronological order. Therefore, the TSC offset should be exported for userland
> tools.
>
> In this patch set, TSC offset is exported by printk() on the host. I also
> attached a tool for merging trace data of a guest and a host in chronological
> order.
>
> <Example>
> We assume that wakeup-latency for a command is big on a guest. Normally
> we will use ftrace's wakeup-latency tracer or event tracer on the guest, but we
> may not be able to solve this problem. This is because guests often exit to
> the host for several reasons. In the next, we will use TSC as ftrace's timestamp
> and record the trace data on the guest and the host. Then, we get following
> data:
>
> /* guest data */
> comm-3826 [000] d...49836825726903: sched_wakeup: [detail]
> comm-3826 [000] d...49836832225344: sched_switch: [detail]
> /* host data */
> qemu-kvm-2687 [003] d...50550079203669: kvm_exit: [detail]
> qemu-kvm-2687 [003] d...50550079206816: kvm_entry: [detail]
> qemu-kvm-2687 [003] d...50550079240656: kvm_exit: [detail]
> qemu-kvm-2687 [003] d...50550079243467: kvm_entry: [detail]
> qemu-kvm-2687 [003] d...50550079256103: kvm_exit: [detail]
> qemu-kvm-2687 [003] d...50550079268391: kvm_entry: [detail]
> qemu-kvm-2687 [003] d...50550079280829: kvm_exit: [detail]
> qemu-kvm-2687 [003] d...50550079286028: kvm_entry: [detail]
>
> Since TSC offset is not considered, these data cannot be merged. If this trace
> data is shown like as follows, we will be able to understand the reason:
>
> qemu-kvm-2687 [003] d...50550079203669: kvm_exit: [detail]
> qemu-kvm-2687 [003] d...50550079206816: kvm_entry: [detail]
> comm-3826 [000] d.h.49836825726903: sched_wakeup: [detail] <=
> qemu-kvm-2687 [003] d...50550079240656: kvm_exit: [detail]
> qemu-kvm-2687 [003] d...50550079243467: kvm_entry: [detail]
> qemu-kvm-2687 [003] d...50550079256103: kvm_exit: [detail]
> qemu-kvm-2687 [003] d...50550079268391: kvm_entry: [detail]
> comm-3826 [000] d...49836832225344: sched_switch: [detail] <=
> qemu-kvm-2687 [003] d...50550079280829: kvm_exit: [detail]
> qemu-kvm-2687 [003] d...50550079286028: kvm_entry: [detail]
>
> In this case, we can understand wakeup-latency was big due to exit to host
> twice. Getting this data sorted in chronological order is our goal.
>
> To merge the data like previous pattern, we apply this patch set. Then, we can
> get TSC offset of the guest as follows:
>
> $ dmesg | grep kvm
> [ 57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ##
> ^^^^ ^^^^^^^^^^^^^^^^^^^^ |
> PID TSC offset |
> HOST TSC value --+
>
> We use this TSC offset value to a merge script and obtain the following data:
>
> $ ./trace-merge.pl 18446743360465545001 host.data guest.data
> h qemu-kvm-2687 [003] d...50550079203669: kvm_exit: [detail]
> h qemu-kvm-2687 [003] d...50550079206816: kvm_entry: [detail]
> g comm-3826 [000] d.h.50550079226331: sched_wakeup: [detail] <=
> h qemu-kvm-2687 [003] d...50550079240656: kvm_exit: [detail]
> h qemu-kvm-2687 [003] d...50550079243467: kvm_entry: [detail]
> h qemu-kvm-2687 [003] d...50550079256103: kvm_exit: [detail]
> h qemu-kvm-2687 [003] d...50550079268391: kvm_entry: [detail]
> g comm-3826 [000] d...50550079279266: sched_switch: [detail] <=
> h qemu-kvm-2687 [003] d...50550079280829: kvm_exit: [detail]
> h qemu-kvm-2687 [003] d...50550079286028: kvm_entry: [detail]
> |
> \----guest/host
>
> In this summary, I suggest the patch which TSC offset for each guest can be
> output on the host.
The guest TSC can change (for example if TSC scaling is used). Moreover
TSC offset can change, and you'd have to monitor that. What
about a module option so that tsc_offset is written as zero (to be
used as debugging tool). Then the following restrictions apply:
- TSC must be synchronized across CPUs/VCPUS.
- TSC must be reliable.
Would that suffice? (a module option to kvm.ko, say zero_tsc_offset).
> I chose printk() to output TSC offset value, but I think this is not the best
> method. For example, defining as a tracepoint is one of the methods. In the
> case, multiple buffers are needed to keep these data.
>
> I need your comments, thanks!
> ---
>
> Yoshihiro YUNOMAE (2):
> kvm/vmx: Print TSC_OFFSET information when TSC offset value is written to VMCS
> tools: Add a tool for merging trace data of a guest and a host
>
>
> arch/x86/kvm/vmx.c | 5 +
> tools/scripts/trace-merge/trace-merge.pl | 109 ++++++++++++++++++++++++++++++
> 2 files changed, 114 insertions(+)
> create mode 100755 tools/scripts/trace-merge/trace-merge.pl
>
> --
> Yoshihiro YUNOMAE
> Software Platform Research Dept. Linux Technology Center
> Hitachi, Ltd., Yokohama Research Laboratory
> E-mail: [email protected]
Hi Marcelo,
Thank you for commenting on my patch set.
(2012/11/16 12:19), Marcelo Tosatti wrote:
> On Wed, Nov 14, 2012 at 10:36:21AM +0900, Yoshihiro YUNOMAE wrote:
[...]
>> In this summary, I suggest the patch which TSC offset for each guest can be
>> output on the host.
>
> The guest TSC can change (for example if TSC scaling is used). Moreover
> TSC offset can change, and you'd have to monitor that. What
Yes, that's true. Changing TSC offset is the key point to use TSC for
merging trace data of guests and the host.
> about a module option so that tsc_offset is written as zero (to be
> used as debugging tool). Then the following restrictions apply:
>
> - TSC must be synchronized across CPUs/VCPUS.
> - TSC must be reliable.
>
> Would that suffice? (a module option to kvm.ko, say zero_tsc_offset).
As you say, the guest TSC can change, so guest TSC needs to meet these
two restrictions to merge the trace data in chronological order.
However, the zero-TSC offset method is not enough, I think.
I will use TSC values as the tracing timestamp not only for debugging
but for failure analysis on actual operations. When we introduce
the zero-TSC offset, normally it will be no problem. However, if
the guest executes write_tsc or the guest works live migration, TSC
offset will be changed. After all, we need to monitor the TSC offset
value.
Thank you,
--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
On Fri, Nov 16, 2012 at 05:09:40PM +0900, Yoshihiro YUNOMAE wrote:
> Hi Marcelo,
>
> Thank you for commenting on my patch set.
>
> (2012/11/16 12:19), Marcelo Tosatti wrote:
> >On Wed, Nov 14, 2012 at 10:36:21AM +0900, Yoshihiro YUNOMAE wrote:
> [...]
> >>In this summary, I suggest the patch which TSC offset for each guest can be
> >>output on the host.
> >
> >The guest TSC can change (for example if TSC scaling is used). Moreover
> >TSC offset can change, and you'd have to monitor that. What
>
> Yes, that's true. Changing TSC offset is the key point to use TSC for
> merging trace data of guests and the host.
>
> >about a module option so that tsc_offset is written as zero (to be
> >used as debugging tool). Then the following restrictions apply:
> >
> >- TSC must be synchronized across CPUs/VCPUS.
> >- TSC must be reliable.
> >
> >Would that suffice? (a module option to kvm.ko, say zero_tsc_offset).
>
> As you say, the guest TSC can change, so guest TSC needs to meet these
> two restrictions to merge the trace data in chronological order.
>
> However, the zero-TSC offset method is not enough, I think.
> I will use TSC values as the tracing timestamp not only for debugging
> but for failure analysis on actual operations. When we introduce
> the zero-TSC offset, normally it will be no problem. However, if
> the guest executes write_tsc or the guest works live migration, TSC
> offset will be changed. After all, we need to monitor the TSC offset
> value.
>
> Thank you,
What i wrote was not precise.
With TSC scaling (of AMD's svm.c) or TSC trapping, the guest RDTSC is
not monotonic with reference to the host TSC. That is, TSC scaling
and/or trapping are fundamentally incompatible with the feature you
propose.
What you are saying is that is you'd like this feature for production,
not debugging. Unfortunately this requires a synchronized TSC across
CPUs, which is not widespread. Therefore the suggestion to introduce a
debugging facility (you cannot rely on this for failure analysis on all
systems).
Even then, a more reliable method for reporting TSC offset must be used.
Apart from these issues, it is very useful to correlate guest/host
events in the way you propose.
On Wed, 2012-11-14 at 17:26 +0900, Yoshihiro YUNOMAE wrote:
> Thank you for commenting on my patch set.
>
> (2012/11/14 11:31), Steven Rostedt wrote:
> > On Tue, 2012-11-13 at 18:03 -0800, David Sharp wrote:
> >> On Tue, Nov 13, 2012 at 6:00 PM, Steven Rostedt <[email protected]> wrote:
> >>> On Wed, 2012-11-14 at 10:36 +0900, Yoshihiro YUNOMAE wrote:
> >>>
> >>>> To merge the data like previous pattern, we apply this patch set. Then, we can
> >>>> get TSC offset of the guest as follows:
> >>>>
> >>>> $ dmesg | grep kvm
> >>>> [ 57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ##
> >>>> ^^^^ ^^^^^^^^^^^^^^^^^^^^ |
> >>>> PID TSC offset |
> >>>> HOST TSC value --+
> >>>>
> >>>
> >>> Using printk to export something like this is IMO a nasty hack.
> >>>
> >>> Can't we create a /sys or /proc file to export the same thing?
> >>
> >> Since the value changes over the course of the trace, and seems to be
> >> part of the context of the trace, I think I'd include it as a
> >> tracepoint.
> >>
> >
> > I'm fine with that too.
>
> Using some tracepoint is a nice idea, but there is one problem. Here,
> our discussion point is "the event which TSC offset is changed does not
> frequently occur, but the buffer must keep the event data."
If you can hold off a bit, for the 3.9 window, I plan on pushing
multiple buffers for ftrace. That is, you can create a separate buffer
just for the TSC offset events:
cd /sys/kernel/debug
echo tsc > instances/new
echo 1 > instances/tsc/events/tsc/offset/enable
Then the buffer will be used only for that event.
The code is pretty much ready, but I want to test it more before the 3.8
window opens, which is why I'm waiting for 3.9.
You can see the latest version here:
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
branch: rfc/multi-buffers-v7
-- Steve
On Wed, Nov 14, 2012 at 05:26:10PM +0900, Yoshihiro YUNOMAE wrote:
> Thank you for commenting on my patch set.
>
> (2012/11/14 11:31), Steven Rostedt wrote:
> >On Tue, 2012-11-13 at 18:03 -0800, David Sharp wrote:
> >>On Tue, Nov 13, 2012 at 6:00 PM, Steven Rostedt <[email protected]> wrote:
> >>>On Wed, 2012-11-14 at 10:36 +0900, Yoshihiro YUNOMAE wrote:
> >>>
> >>>>To merge the data like previous pattern, we apply this patch set. Then, we can
> >>>>get TSC offset of the guest as follows:
> >>>>
> >>>>$ dmesg | grep kvm
> >>>>[ 57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ##
> >>>> ^^^^ ^^^^^^^^^^^^^^^^^^^^ |
> >>>> PID TSC offset |
> >>>> HOST TSC value --+
> >>>>
> >>>
> >>>Using printk to export something like this is IMO a nasty hack.
> >>>
> >>>Can't we create a /sys or /proc file to export the same thing?
> >>
> >>Since the value changes over the course of the trace, and seems to be
> >>part of the context of the trace, I think I'd include it as a
> >>tracepoint.
> >>
> >
> >I'm fine with that too.
>
> Using some tracepoint is a nice idea, but there is one problem. Here,
> our discussion point is "the event which TSC offset is changed does not
> frequently occur, but the buffer must keep the event data."
>
> There are two ideas for using tracepoint. First, we define new
> tracepoint for changed TSC offset. This is simple and the overhead will
> be low. However, this trace event stored in the buffer will be
> overwritten by other trace events because this TSC offset event does
> not frequently occur. Second, we add TSC offset information to the
> tracepoint frequently occured. For example, we assume that TSC offset
> information is added to arguments of trace_kvm_exit().
The TSC offset is in the host trace. So given a host trace with two TSC
offset updates, how do you know which events in the guest trace
(containing a number of events) refer to which tsc offset update?
Unless i am missing something, you can't solve this easily (well, except
exporting information to the guest that allows it to transform RDTSC ->
host TSC value, which can be done via pvclock).
Another issue as mentioned is lack of TSC synchronization in the host.
Should you provide such a feature without the possibility of proper
chronological order on systems with unsynchronized TSC?
> By adding the
> information to the arguments, we can avoid the situation where the TSC
> offset information is overwritten by other events. However, TSC offset
> is not frequently changed and same information is output many times
> because almost all data are waste. Therefore, only using tracepoint
> is not good idea.
>
> So, I suggest a hybrid method; record TSC offset change events and read
> the last TSC offset from procfs when collecting the trace data.
> In particular, the method is as follows:
> 1. Enable the tracepoint of TSC offset change and record the value
> before and after changing
> 2. Start tracing
> 3. Stop tracing
> 4. Collect trace data and read /proc/pid/kvm/*
> 5. Check if any trace event recording the two TSC offsets exists
> in the trace data
> if(existing) => use trace event (flow 6)
> else => use /proc/pid/kvm/* (flow 7)
> 6. Apply two TSC offsets of the trace event to the trace data and
> sort the trace data
> (Ex.)
> * => tracepoint of changing TSC offset
> . => another trace event
>
> [START]............*............[END]
> <----------> <---------->
> previous current
> TSC offset TSC offset
>
> 7. Apply TSC offset of /proc/pid/kvm/* to the trace data and
> sort the trace data
> (Ex.)
> . => another trace event(not tracepoint of changing TSC offset)
>
> [START].........................[END]
> <----------------------->
> current
> TSC offset
>
> Thanks,
>
> --
> Yoshihiro YUNOMAE
> Software Platform Research Dept. Linux Technology Center
> Hitachi, Ltd., Yokohama Research Laboratory
> E-mail: [email protected]
>
On Fri, Nov 16, 2012 at 10:05:54AM -0500, Steven Rostedt wrote:
> On Wed, 2012-11-14 at 17:26 +0900, Yoshihiro YUNOMAE wrote:
> > Thank you for commenting on my patch set.
> >
> > (2012/11/14 11:31), Steven Rostedt wrote:
> > > On Tue, 2012-11-13 at 18:03 -0800, David Sharp wrote:
> > >> On Tue, Nov 13, 2012 at 6:00 PM, Steven Rostedt <[email protected]> wrote:
> > >>> On Wed, 2012-11-14 at 10:36 +0900, Yoshihiro YUNOMAE wrote:
> > >>>
> > >>>> To merge the data like previous pattern, we apply this patch set. Then, we can
> > >>>> get TSC offset of the guest as follows:
> > >>>>
> > >>>> $ dmesg | grep kvm
> > >>>> [ 57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ##
> > >>>> ^^^^ ^^^^^^^^^^^^^^^^^^^^ |
> > >>>> PID TSC offset |
> > >>>> HOST TSC value --+
> > >>>>
> > >>>
> > >>> Using printk to export something like this is IMO a nasty hack.
> > >>>
> > >>> Can't we create a /sys or /proc file to export the same thing?
> > >>
> > >> Since the value changes over the course of the trace, and seems to be
> > >> part of the context of the trace, I think I'd include it as a
> > >> tracepoint.
> > >>
> > >
> > > I'm fine with that too.
> >
> > Using some tracepoint is a nice idea, but there is one problem. Here,
> > our discussion point is "the event which TSC offset is changed does not
> > frequently occur, but the buffer must keep the event data."
>
> If you can hold off a bit, for the 3.9 window, I plan on pushing
> multiple buffers for ftrace. That is, you can create a separate buffer
> just for the TSC offset events:
>
> cd /sys/kernel/debug
> echo tsc > instances/new
> echo 1 > instances/tsc/events/tsc/offset/enable
>
> Then the buffer will be used only for that event.
>
> The code is pretty much ready, but I want to test it more before the 3.8
> window opens, which is why I'm waiting for 3.9.
>
> You can see the latest version here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
> branch: rfc/multi-buffers-v7
>
> -- Steve
You need the tsc offset change event and the main events to be ordered,
so i can't see how this helps.
Hi Marcelo,
Sorry for the late reply.
(2012/11/17 4:15), Marcelo Tosatti wrote:
> On Wed, Nov 14, 2012 at 05:26:10PM +0900, Yoshihiro YUNOMAE wrote:
>> Thank you for commenting on my patch set.
>>
>> (2012/11/14 11:31), Steven Rostedt wrote:
>>> On Tue, 2012-11-13 at 18:03 -0800, David Sharp wrote:
>>>> On Tue, Nov 13, 2012 at 6:00 PM, Steven Rostedt <[email protected]> wrote:
>>>>> On Wed, 2012-11-14 at 10:36 +0900, Yoshihiro YUNOMAE wrote:
>>>>>
>>>>>> To merge the data like previous pattern, we apply this patch set. Then, we can
>>>>>> get TSC offset of the guest as follows:
>>>>>>
>>>>>> $ dmesg | grep kvm
>>>>>> [ 57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ##
>>>>>> ^^^^ ^^^^^^^^^^^^^^^^^^^^ |
>>>>>> PID TSC offset |
>>>>>> HOST TSC value --+
>>>>>>
>>>>>
>>>>> Using printk to export something like this is IMO a nasty hack.
>>>>>
>>>>> Can't we create a /sys or /proc file to export the same thing?
>>>>
>>>> Since the value changes over the course of the trace, and seems to be
>>>> part of the context of the trace, I think I'd include it as a
>>>> tracepoint.
>>>>
>>>
>>> I'm fine with that too.
>>
>> Using some tracepoint is a nice idea, but there is one problem. Here,
>> our discussion point is "the event which TSC offset is changed does not
>> frequently occur, but the buffer must keep the event data."
>>
>> There are two ideas for using tracepoint. First, we define new
>> tracepoint for changed TSC offset. This is simple and the overhead will
>> be low. However, this trace event stored in the buffer will be
>> overwritten by other trace events because this TSC offset event does
>> not frequently occur. Second, we add TSC offset information to the
>> tracepoint frequently occured. For example, we assume that TSC offset
>> information is added to arguments of trace_kvm_exit().
>
> The TSC offset is in the host trace. So given a host trace with two TSC
> offset updates, how do you know which events in the guest trace
> (containing a number of events) refer to which tsc offset update?
>
> Unless i am missing something, you can't solve this easily (well, except
> exporting information to the guest that allows it to transform RDTSC ->
> host TSC value, which can be done via pvclock).
As you say, TSC offset events are in the host trace, but we don't need
to notify guests of updating TSC offset. The offset event will output
the next TSC offset value and the current TSC value, so we can
calculate the guest TSC (T1) for the event. Guest TSCs since T1 can be
converted to host TSC using the TSC offset, so we can integrate those
trace data.
> Another issue as mentioned is lack of TSC synchronization in the host.
> Should you provide such a feature without the possibility of proper
> chronological order on systems with unsynchronized TSC?
I think, we cannot support this sorting feature using TSC on systems
with unsynchronized TSC. On systems with unsynchronized TSC, it is
difficult to sort not only trace data of guests and the host but trace
data of a guest or a host using TSC in chronological order. Actually,
if we want to output tracing data of ftrace in chronological order with
unsynchronized TSC, we will use the "global" mode as the timestamp. The
global mode uses wallclock added TSC correction, so the mode guarantees
to sort in chronological order for trace data of the guest or of
the host. If we use this mode to sort the trace data of guests and the
host in chronological order, we need to consider about the difference
between the guest and the host and timekeeping of guests and the host,
so it is difficult to solve these issues. At least, I haven't came up
with the good solution.
We cannot sort the trace data of guests and the host in chronological
order with unsynchronized TSC, but if we can set following
synchronization events for both guests and the host, we will know where
we should sort.
First, a guest and the host uses the global mode as the timestamp of
ftrace. Next, a user on the guest writes "1" to the synchronization I/F
as the ID, then the synchronization event "1" is recorded in a
ring-buffer of the guest. The synchronization operation induces
hypercall, so the host can handle the event. After the operation moves
to the host, the host records the event "1" in a ring-buffer of the
host. In the end, the operation returns to the host, and the
synchronization is finished.
When we integrate tracing data of the guest and the host, we
calculate difference of the timestamp between the synchronizing events
with the same ID. This value is a temporary "offset". We will convert
the timestamp of the guests to the timestamp of the host before the
next synchronizing event. If the synchronizing event cycle is very
short, we will not need to consider the timekeeping. Then, we can sort
the trace data in chronological order.
Would you comment for this or do you have another idea?
Thanks,
--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
Hi Steven,
Sorry for the late reply.
(2012/11/17 0:05), Steven Rostedt wrote:
> On Wed, 2012-11-14 at 17:26 +0900, Yoshihiro YUNOMAE wrote:
>> Thank you for commenting on my patch set.
>>
>> (2012/11/14 11:31), Steven Rostedt wrote:
>>> On Tue, 2012-11-13 at 18:03 -0800, David Sharp wrote:
>>>> On Tue, Nov 13, 2012 at 6:00 PM, Steven Rostedt <[email protected]> wrote:
>>>>> On Wed, 2012-11-14 at 10:36 +0900, Yoshihiro YUNOMAE wrote:
>>>>>
>>>>>> To merge the data like previous pattern, we apply this patch set. Then, we can
>>>>>> get TSC offset of the guest as follows:
>>>>>>
>>>>>> $ dmesg | grep kvm
>>>>>> [ 57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ##
>>>>>> ^^^^ ^^^^^^^^^^^^^^^^^^^^ |
>>>>>> PID TSC offset |
>>>>>> HOST TSC value --+
>>>>>>
>>>>>
>>>>> Using printk to export something like this is IMO a nasty hack.
>>>>>
>>>>> Can't we create a /sys or /proc file to export the same thing?
>>>>
>>>> Since the value changes over the course of the trace, and seems to be
>>>> part of the context of the trace, I think I'd include it as a
>>>> tracepoint.
>>>>
>>>
>>> I'm fine with that too.
>>
>> Using some tracepoint is a nice idea, but there is one problem. Here,
>> our discussion point is "the event which TSC offset is changed does not
>> frequently occur, but the buffer must keep the event data."
>
> If you can hold off a bit, for the 3.9 window, I plan on pushing
> multiple buffers for ftrace. That is, you can create a separate buffer
> just for the TSC offset events:
>
> cd /sys/kernel/debug
> echo tsc > instances/new
> echo 1 > instances/tsc/events/tsc/offset/enable
>
> Then the buffer will be used only for that event.
That's good. The tracepoint will output as follows:
qemu-kvm-12345 [000] ....123456789: kvm_write_tsc_offset:
now_tsc=123456789 previous_offset=0 next_offset=123456780
Thanks,
--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
On Tue, Nov 20, 2012 at 07:36:33PM +0900, Yoshihiro YUNOMAE wrote:
> Hi Marcelo,
>
> Sorry for the late reply.
>
> (2012/11/17 4:15), Marcelo Tosatti wrote:
> >On Wed, Nov 14, 2012 at 05:26:10PM +0900, Yoshihiro YUNOMAE wrote:
> >>Thank you for commenting on my patch set.
> >>
> >>(2012/11/14 11:31), Steven Rostedt wrote:
> >>>On Tue, 2012-11-13 at 18:03 -0800, David Sharp wrote:
> >>>>On Tue, Nov 13, 2012 at 6:00 PM, Steven Rostedt <[email protected]> wrote:
> >>>>>On Wed, 2012-11-14 at 10:36 +0900, Yoshihiro YUNOMAE wrote:
> >>>>>
> >>>>>>To merge the data like previous pattern, we apply this patch set. Then, we can
> >>>>>>get TSC offset of the guest as follows:
> >>>>>>
> >>>>>>$ dmesg | grep kvm
> >>>>>>[ 57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ##
> >>>>>> ^^^^ ^^^^^^^^^^^^^^^^^^^^ |
> >>>>>> PID TSC offset |
> >>>>>> HOST TSC value --+
> >>>>>>
> >>>>>
> >>>>>Using printk to export something like this is IMO a nasty hack.
> >>>>>
> >>>>>Can't we create a /sys or /proc file to export the same thing?
> >>>>
> >>>>Since the value changes over the course of the trace, and seems to be
> >>>>part of the context of the trace, I think I'd include it as a
> >>>>tracepoint.
> >>>>
> >>>
> >>>I'm fine with that too.
> >>
> >>Using some tracepoint is a nice idea, but there is one problem. Here,
> >>our discussion point is "the event which TSC offset is changed does not
> >>frequently occur, but the buffer must keep the event data."
> >>
> >>There are two ideas for using tracepoint. First, we define new
> >>tracepoint for changed TSC offset. This is simple and the overhead will
> >>be low. However, this trace event stored in the buffer will be
> >>overwritten by other trace events because this TSC offset event does
> >>not frequently occur. Second, we add TSC offset information to the
> >>tracepoint frequently occured. For example, we assume that TSC offset
> >>information is added to arguments of trace_kvm_exit().
> >
> >The TSC offset is in the host trace. So given a host trace with two TSC
> >offset updates, how do you know which events in the guest trace
> >(containing a number of events) refer to which tsc offset update?
> >
> >Unless i am missing something, you can't solve this easily (well, except
> >exporting information to the guest that allows it to transform RDTSC ->
> >host TSC value, which can be done via pvclock).
>
> As you say, TSC offset events are in the host trace, but we don't need
> to notify guests of updating TSC offset. The offset event will output
> the next TSC offset value and the current TSC value, so we can
> calculate the guest TSC (T1) for the event. Guest TSCs since T1 can be
> converted to host TSC using the TSC offset, so we can integrate those
> trace data.
Think of this scenario:
host trace
1h. event tsc write tsc_offset=-1000
3h. vmenter
4h. vmexit
... (event sequence)
99h. vmexit
100h. event tsc_write tsc_offset=-2000
101h. vmenter
... (event sequence).
500h. event tsc_write tsc_offset=-3000
Then a guest trace containing events with a TSC timestamp.
Which tsc_offset to use?
(that is the problem, which unless i am mistaken can only be solved
easily if the guest can convert RDTSC -> TSC of host).
> >Another issue as mentioned is lack of TSC synchronization in the host.
> >Should you provide such a feature without the possibility of proper
> >chronological order on systems with unsynchronized TSC?
>
> I think, we cannot support this sorting feature using TSC on systems
> with unsynchronized TSC. On systems with unsynchronized TSC, it is
> difficult to sort not only trace data of guests and the host but trace
> data of a guest or a host using TSC in chronological order. Actually,
> if we want to output tracing data of ftrace in chronological order with
> unsynchronized TSC, we will use the "global" mode as the timestamp. The
> global mode uses wallclock added TSC correction, so the mode guarantees
> to sort in chronological order for trace data of the guest or of
> the host. If we use this mode to sort the trace data of guests and the
> host in chronological order, we need to consider about the difference
> between the guest and the host and timekeeping of guests and the host,
> so it is difficult to solve these issues. At least, I haven't came up
> with the good solution.
I suppose the tradeoff is performance (RDTSC) versus reliability, when
using ftrace. But then, even ftrace on the host suffers from the
same problem, with unsynchronized TSCs.
> We cannot sort the trace data of guests and the host in chronological
> order with unsynchronized TSC, but if we can set following
> synchronization events for both guests and the host, we will know where
> we should sort.
>
> First, a guest and the host uses the global mode as the timestamp of
> ftrace. Next, a user on the guest writes "1" to the synchronization I/F
> as the ID, then the synchronization event "1" is recorded in a
> ring-buffer of the guest. The synchronization operation induces
> hypercall, so the host can handle the event. After the operation moves
> to the host, the host records the event "1" in a ring-buffer of the
> host. In the end, the operation returns to the host, and the
> synchronization is finished.
>
> When we integrate tracing data of the guest and the host, we
> calculate difference of the timestamp between the synchronizing events
> with the same ID. This value is a temporary "offset". We will convert
> the timestamp of the guests to the timestamp of the host before the
> next synchronizing event. If the synchronizing event cycle is very
> short, we will not need to consider the timekeeping. Then, we can sort
> the trace data in chronological order.
>
> Would you comment for this or do you have another idea?
Performance of any solution across without synchronized TSC will be bad.
Lets try to reduce coverage of the feature by providing ordering of
guest/host events on per-vcpu basis (that is, you can only
chronologically order events on a per-vm basis if the host TSC is
synchronized).
Which depends on the discussion above about multiple tsc offsets
in the host trace.
BTW, this issue came up during the KVM-RT BOF at KVMForum earlier this
month. Currently there is no way to correlate (and be able to measure)
events across host/guest, to profile RT behaviour.
Hi Marcelo,
(2012/11/21 7:51), Marcelo Tosatti wrote:
> On Tue, Nov 20, 2012 at 07:36:33PM +0900, Yoshihiro YUNOMAE wrote:
>> Hi Marcelo,
>>
>> Sorry for the late reply.
>>
>> (2012/11/17 4:15), Marcelo Tosatti wrote:
>>> On Wed, Nov 14, 2012 at 05:26:10PM +0900, Yoshihiro YUNOMAE wrote:
>>>> Thank you for commenting on my patch set.
>>>>
>>>> (2012/11/14 11:31), Steven Rostedt wrote:
>>>>> On Tue, 2012-11-13 at 18:03 -0800, David Sharp wrote:
>>>>>> On Tue, Nov 13, 2012 at 6:00 PM, Steven Rostedt <[email protected]> wrote:
>>>>>>> On Wed, 2012-11-14 at 10:36 +0900, Yoshihiro YUNOMAE wrote:
>>>>>>>
>>>>>>>> To merge the data like previous pattern, we apply this patch set. Then, we can
>>>>>>>> get TSC offset of the guest as follows:
>>>>>>>>
>>>>>>>> $ dmesg | grep kvm
>>>>>>>> [ 57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ##
>>>>>>>> ^^^^ ^^^^^^^^^^^^^^^^^^^^ |
>>>>>>>> PID TSC offset |
>>>>>>>> HOST TSC value --+
>>>>>>>>
>>>>>>>
>>>>>>> Using printk to export something like this is IMO a nasty hack.
>>>>>>>
>>>>>>> Can't we create a /sys or /proc file to export the same thing?
>>>>>>
>>>>>> Since the value changes over the course of the trace, and seems to be
>>>>>> part of the context of the trace, I think I'd include it as a
>>>>>> tracepoint.
>>>>>>
>>>>>
>>>>> I'm fine with that too.
>>>>
>>>> Using some tracepoint is a nice idea, but there is one problem. Here,
>>>> our discussion point is "the event which TSC offset is changed does not
>>>> frequently occur, but the buffer must keep the event data."
>>>>
>>>> There are two ideas for using tracepoint. First, we define new
>>>> tracepoint for changed TSC offset. This is simple and the overhead will
>>>> be low. However, this trace event stored in the buffer will be
>>>> overwritten by other trace events because this TSC offset event does
>>>> not frequently occur. Second, we add TSC offset information to the
>>>> tracepoint frequently occured. For example, we assume that TSC offset
>>>> information is added to arguments of trace_kvm_exit().
>>>
>>> The TSC offset is in the host trace. So given a host trace with two TSC
>>> offset updates, how do you know which events in the guest trace
>>> (containing a number of events) refer to which tsc offset update?
>>>
>>> Unless i am missing something, you can't solve this easily (well, except
>>> exporting information to the guest that allows it to transform RDTSC ->
>>> host TSC value, which can be done via pvclock).
>>
>> As you say, TSC offset events are in the host trace, but we don't need
>> to notify guests of updating TSC offset. The offset event will output
>> the next TSC offset value and the current TSC value, so we can
>> calculate the guest TSC (T1) for the event. Guest TSCs since T1 can be
>> converted to host TSC using the TSC offset, so we can integrate those
>> trace data.
>
> Think of this scenario:
>
> host trace
> 1h. event tsc write tsc_offset=-1000
> 3h. vmenter
> 4h. vmexit
> ... (event sequence)
> 99h. vmexit
> 100h. event tsc_write tsc_offset=-2000
> 101h. vmenter
> ... (event sequence).
> 500h. event tsc_write tsc_offset=-3000
>
> Then a guest trace containing events with a TSC timestamp.
> Which tsc_offset to use?
>
> (that is the problem, which unless i am mistaken can only be solved
> easily if the guest can convert RDTSC -> TSC of host).
There are three following cases of changing TSC offset:
1. Reset TSC at guest boot time
2. Adjust TSC offset due to some host's problems
3. Write TSC on guests
The scenario which you mentioned is case 3, so we'll discuss this case.
Here, we assume that a guest is allocated single CPU for the sake of
ease.
If a guest executes write_tsc, TSC values jumps to forward or backward.
For the forward case, trace data are as follows:
< host > < guest >
cycles events cycles events
3000 tsc_offset=-2950
3001 kvm_enter
53 eventX
....
100 (write_tsc=+900)
3060 kvm_exit
3075 tsc_offset=-2050
3080 kvm_enter
1050 event1
1055 event2
...
This case is simple. The guest TSC of the first kvm_enter is calculated
as follows:
(host TSC of kvm_enter) + (current tsc_offset) = 3001 - 2950 = 51
Similarly, the guest TSC of the second kvm_enter is 130. So, the guest
events between 51 and 130, that is, 53 eventX is inserted between the
first pair of kvm_enter and kvm_exit. To insert events of the guests
between 51 and 130, we convert the guest TSC to the host TSC using TSC
offset 2950.
For the backward case, trace data are as follows:
< host > < guest >
cycles events cycles events
3000 tsc_offset=-2950
3001 kvm_enter
53 eventX
....
100 (write_tsc=-50)
3060 kvm_exit
3075 tsc_offset=-2050
3080 kvm_enter
90 event1
95 event2
...
As you say, in this case, the previous method is invalid. When we
calculate the guest TSC value for the tsc_offset=-3000 event, the value
is 75 on the guest. This seems like prior event of write_tsc=-50 event.
So, we need to consider more.
In this case, it is important that we can understand where the guest
executes write_tsc or the host rewrites the TSC offset. write_tsc on
the guest equals wrmsr 0x00000010, so this instruction induces vm_exit.
This implies that the guest does not operate when the host changes TSC
offset on the cpu. In other words, the guest cannot use new TSC before
the host rewrites the new TSC offset. So, if timestamp on the guest is
not monotonically increased, we can understand the guest executes
write_tsc. Moreover, in the region where timestamp is decreasing, we
can understand when the host rewrote the TSC offset in the guest trace
data. Therefore, we can sort trace data in chronological order.
>>> Another issue as mentioned is lack of TSC synchronization in the host.
>>> Should you provide such a feature without the possibility of proper
>>> chronological order on systems with unsynchronized TSC?
>>
>> I think, we cannot support this sorting feature using TSC on systems
>> with unsynchronized TSC. On systems with unsynchronized TSC, it is
>> difficult to sort not only trace data of guests and the host but trace
>> data of a guest or a host using TSC in chronological order. Actually,
>> if we want to output tracing data of ftrace in chronological order with
>> unsynchronized TSC, we will use the "global" mode as the timestamp. The
>> global mode uses wallclock added TSC correction, so the mode guarantees
>> to sort in chronological order for trace data of the guest or of
>> the host. If we use this mode to sort the trace data of guests and the
>> host in chronological order, we need to consider about the difference
>> between the guest and the host and timekeeping of guests and the host,
>> so it is difficult to solve these issues. At least, I haven't came up
>> with the good solution.
>
> I suppose the tradeoff is performance (RDTSC) versus reliability, when
> using ftrace. But then, even ftrace on the host suffers from the
> same problem, with unsynchronized TSCs.
Yes, that's true.
>> We cannot sort the trace data of guests and the host in chronological
>> order with unsynchronized TSC, but if we can set following
>> synchronization events for both guests and the host, we will know where
>> we should sort.
>>
>> First, a guest and the host uses the global mode as the timestamp of
>> ftrace. Next, a user on the guest writes "1" to the synchronization I/F
>> as the ID, then the synchronization event "1" is recorded in a
>> ring-buffer of the guest. The synchronization operation induces
>> hypercall, so the host can handle the event. After the operation moves
>> to the host, the host records the event "1" in a ring-buffer of the
>> host. In the end, the operation returns to the host, and the
>> synchronization is finished.
>>
>> When we integrate tracing data of the guest and the host, we
>> calculate difference of the timestamp between the synchronizing events
>> with the same ID. This value is a temporary "offset". We will convert
>> the timestamp of the guests to the timestamp of the host before the
>> next synchronizing event. If the synchronizing event cycle is very
>> short, we will not need to consider the timekeeping. Then, we can sort
>> the trace data in chronological order.
>>
>> Would you comment for this or do you have another idea?
>
> Performance of any solution across without synchronized TSC will be bad.
> Lets try to reduce coverage of the feature by providing ordering of
> guest/host events on per-vcpu basis (that is, you can only
> chronologically order events on a per-vm basis if the host TSC is
> synchronized).
OK. For the next patch, I'll indicate the restriction, which the host
TSC is synchronized if we use the feature to sort trace data in
chronological order.
> Which depends on the discussion above about multiple tsc offsets
> in the host trace.
>
> BTW, this issue came up during the KVM-RT BOF at KVMForum earlier this
> month. Currently there is no way to correlate (and be able to measure)
> events across host/guest, to profile RT behaviour.
Yes, this feature will be helpful in RT virtualization systsems:)
Thanks,
--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
On Thu, Nov 22, 2012 at 02:21:20PM +0900, Yoshihiro YUNOMAE wrote:
> Hi Marcelo,
>
> (2012/11/21 7:51), Marcelo Tosatti wrote:
> >On Tue, Nov 20, 2012 at 07:36:33PM +0900, Yoshihiro YUNOMAE wrote:
> >>Hi Marcelo,
> >>
> >>Sorry for the late reply.
> >>
> >>(2012/11/17 4:15), Marcelo Tosatti wrote:
> >>>On Wed, Nov 14, 2012 at 05:26:10PM +0900, Yoshihiro YUNOMAE wrote:
> >>>>Thank you for commenting on my patch set.
> >>>>
> >>>>(2012/11/14 11:31), Steven Rostedt wrote:
> >>>>>On Tue, 2012-11-13 at 18:03 -0800, David Sharp wrote:
> >>>>>>On Tue, Nov 13, 2012 at 6:00 PM, Steven Rostedt <[email protected]> wrote:
> >>>>>>>On Wed, 2012-11-14 at 10:36 +0900, Yoshihiro YUNOMAE wrote:
> >>>>>>>
> >>>>>>>>To merge the data like previous pattern, we apply this patch set. Then, we can
> >>>>>>>>get TSC offset of the guest as follows:
> >>>>>>>>
> >>>>>>>>$ dmesg | grep kvm
> >>>>>>>>[ 57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ##
> >>>>>>>> ^^^^ ^^^^^^^^^^^^^^^^^^^^ |
> >>>>>>>> PID TSC offset |
> >>>>>>>> HOST TSC value --+
> >>>>>>>>
> >>>>>>>
> >>>>>>>Using printk to export something like this is IMO a nasty hack.
> >>>>>>>
> >>>>>>>Can't we create a /sys or /proc file to export the same thing?
> >>>>>>
> >>>>>>Since the value changes over the course of the trace, and seems to be
> >>>>>>part of the context of the trace, I think I'd include it as a
> >>>>>>tracepoint.
> >>>>>>
> >>>>>
> >>>>>I'm fine with that too.
> >>>>
> >>>>Using some tracepoint is a nice idea, but there is one problem. Here,
> >>>>our discussion point is "the event which TSC offset is changed does not
> >>>>frequently occur, but the buffer must keep the event data."
> >>>>
> >>>>There are two ideas for using tracepoint. First, we define new
> >>>>tracepoint for changed TSC offset. This is simple and the overhead will
> >>>>be low. However, this trace event stored in the buffer will be
> >>>>overwritten by other trace events because this TSC offset event does
> >>>>not frequently occur. Second, we add TSC offset information to the
> >>>>tracepoint frequently occured. For example, we assume that TSC offset
> >>>>information is added to arguments of trace_kvm_exit().
> >>>
> >>>The TSC offset is in the host trace. So given a host trace with two TSC
> >>>offset updates, how do you know which events in the guest trace
> >>>(containing a number of events) refer to which tsc offset update?
> >>>
> >>>Unless i am missing something, you can't solve this easily (well, except
> >>>exporting information to the guest that allows it to transform RDTSC ->
> >>>host TSC value, which can be done via pvclock).
> >>
> >>As you say, TSC offset events are in the host trace, but we don't need
> >>to notify guests of updating TSC offset. The offset event will output
> >>the next TSC offset value and the current TSC value, so we can
> >>calculate the guest TSC (T1) for the event. Guest TSCs since T1 can be
> >>converted to host TSC using the TSC offset, so we can integrate those
> >>trace data.
> >
> >Think of this scenario:
> >
> >host trace
> >1h. event tsc write tsc_offset=-1000
> >3h. vmenter
> >4h. vmexit
> >... (event sequence)
> >99h. vmexit
> >100h. event tsc_write tsc_offset=-2000
> >101h. vmenter
> >... (event sequence).
> >500h. event tsc_write tsc_offset=-3000
> >
> >Then a guest trace containing events with a TSC timestamp.
> >Which tsc_offset to use?
> >
> >(that is the problem, which unless i am mistaken can only be solved
> >easily if the guest can convert RDTSC -> TSC of host).
>
> There are three following cases of changing TSC offset:
> 1. Reset TSC at guest boot time
> 2. Adjust TSC offset due to some host's problems
> 3. Write TSC on guests
> The scenario which you mentioned is case 3, so we'll discuss this case.
> Here, we assume that a guest is allocated single CPU for the sake of
> ease.
>
> If a guest executes write_tsc, TSC values jumps to forward or backward.
> For the forward case, trace data are as follows:
>
> < host > < guest >
> cycles events cycles events
> 3000 tsc_offset=-2950
> 3001 kvm_enter
> 53 eventX
> ....
> 100 (write_tsc=+900)
> 3060 kvm_exit
> 3075 tsc_offset=-2050
> 3080 kvm_enter
> 1050 event1
> 1055 event2
> ...
>
>
> This case is simple. The guest TSC of the first kvm_enter is calculated
> as follows:
>
> (host TSC of kvm_enter) + (current tsc_offset) = 3001 - 2950 = 51
>
> Similarly, the guest TSC of the second kvm_enter is 130. So, the guest
> events between 51 and 130, that is, 53 eventX is inserted between the
> first pair of kvm_enter and kvm_exit. To insert events of the guests
> between 51 and 130, we convert the guest TSC to the host TSC using TSC
> offset 2950.
>
> For the backward case, trace data are as follows:
>
> < host > < guest >
> cycles events cycles events
> 3000 tsc_offset=-2950
> 3001 kvm_enter
> 53 eventX
> ....
> 100 (write_tsc=-50)
> 3060 kvm_exit
> 3075 tsc_offset=-2050
> 3080 kvm_enter
> 90 event1
> 95 event2
> ...
3400 100 (write_tsc=-50)
90 event3
95 event4
> As you say, in this case, the previous method is invalid. When we
> calculate the guest TSC value for the tsc_offset=-3000 event, the value
> is 75 on the guest. This seems like prior event of write_tsc=-50 event.
> So, we need to consider more.
>
> In this case, it is important that we can understand where the guest
> executes write_tsc or the host rewrites the TSC offset. write_tsc on
> the guest equals wrmsr 0x00000010, so this instruction induces vm_exit.
> This implies that the guest does not operate when the host changes TSC
> offset on the cpu. In other words, the guest cannot use new TSC before
> the host rewrites the new TSC offset. So, if timestamp on the guest is
> not monotonically increased, we can understand the guest executes
> write_tsc. Moreover, in the region where timestamp is decreasing, we
> can understand when the host rewrote the TSC offset in the guest trace
> data. Therefore, we can sort trace data in chronological order.
This requires an entire trace of events. That is, to be able
to reconstruct timeline you require the entire trace from the moment
guest starts. So that you can correlate wrmsr-to-tsc on the guest with
vmexit-due-to-tsc-write on the host.
Which means that running out of space for trace buffer equals losing
ability to order events.
Is that desirable? It seems cumbersome to me.
Also the need to correlate each write_tsc event in the guest trace
with a corresponding tsc_offset write in the host trace means that it
is _necessary_ for the guest and host to enable tracing simultaneously.
Correct?
Also, there are WRMSR executions in the guest for which there is
no event in the trace buffer. From SeaBIOS, during boot.
In that case, there is no explicit event in the guest trace which you
can correlate with tsc_offset changes in the host side.
If the guest had access to the host TSC value, these complications
would disappear.
Hi Marcelo,
(2012/11/24 7:46), Marcelo Tosatti wrote:
> On Thu, Nov 22, 2012 at 02:21:20PM +0900, Yoshihiro YUNOMAE wrote:
>> Hi Marcelo,
>>
>> (2012/11/21 7:51), Marcelo Tosatti wrote:
>>> On Tue, Nov 20, 2012 at 07:36:33PM +0900, Yoshihiro YUNOMAE wrote:
>>>> Hi Marcelo,
>>>>
>>>> Sorry for the late reply.
>>>>
>>>> (2012/11/17 4:15), Marcelo Tosatti wrote:
>>>>> On Wed, Nov 14, 2012 at 05:26:10PM +0900, Yoshihiro YUNOMAE wrote:
>>>>>> Thank you for commenting on my patch set.
>>>>>>
>>>>>> (2012/11/14 11:31), Steven Rostedt wrote:
>>>>>>> On Tue, 2012-11-13 at 18:03 -0800, David Sharp wrote:
>>>>>>>> On Tue, Nov 13, 2012 at 6:00 PM, Steven Rostedt <[email protected]> wrote:
>>>>>>>>> On Wed, 2012-11-14 at 10:36 +0900, Yoshihiro YUNOMAE wrote:
>>>>>>>>>
>>>>>>>>>> To merge the data like previous pattern, we apply this patch set. Then, we can
>>>>>>>>>> get TSC offset of the guest as follows:
>>>>>>>>>>
>>>>>>>>>> $ dmesg | grep kvm
>>>>>>>>>> [ 57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ##
>>>>>>>>>> ^^^^ ^^^^^^^^^^^^^^^^^^^^ |
>>>>>>>>>> PID TSC offset |
>>>>>>>>>> HOST TSC value --+
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Using printk to export something like this is IMO a nasty hack.
>>>>>>>>>
>>>>>>>>> Can't we create a /sys or /proc file to export the same thing?
>>>>>>>>
>>>>>>>> Since the value changes over the course of the trace, and seems to be
>>>>>>>> part of the context of the trace, I think I'd include it as a
>>>>>>>> tracepoint.
>>>>>>>>
>>>>>>>
>>>>>>> I'm fine with that too.
>>>>>>
>>>>>> Using some tracepoint is a nice idea, but there is one problem. Here,
>>>>>> our discussion point is "the event which TSC offset is changed does not
>>>>>> frequently occur, but the buffer must keep the event data."
>>>>>>
>>>>>> There are two ideas for using tracepoint. First, we define new
>>>>>> tracepoint for changed TSC offset. This is simple and the overhead will
>>>>>> be low. However, this trace event stored in the buffer will be
>>>>>> overwritten by other trace events because this TSC offset event does
>>>>>> not frequently occur. Second, we add TSC offset information to the
>>>>>> tracepoint frequently occured. For example, we assume that TSC offset
>>>>>> information is added to arguments of trace_kvm_exit().
>>>>>
>>>>> The TSC offset is in the host trace. So given a host trace with two TSC
>>>>> offset updates, how do you know which events in the guest trace
>>>>> (containing a number of events) refer to which tsc offset update?
>>>>>
>>>>> Unless i am missing something, you can't solve this easily (well, except
>>>>> exporting information to the guest that allows it to transform RDTSC ->
>>>>> host TSC value, which can be done via pvclock).
>>>>
>>>> As you say, TSC offset events are in the host trace, but we don't need
>>>> to notify guests of updating TSC offset. The offset event will output
>>>> the next TSC offset value and the current TSC value, so we can
>>>> calculate the guest TSC (T1) for the event. Guest TSCs since T1 can be
>>>> converted to host TSC using the TSC offset, so we can integrate those
>>>> trace data.
>>>
>>> Think of this scenario:
>>>
>>> host trace
>>> 1h. event tsc write tsc_offset=-1000
>>> 3h. vmenter
>>> 4h. vmexit
>>> ... (event sequence)
>>> 99h. vmexit
>>> 100h. event tsc_write tsc_offset=-2000
>>> 101h. vmenter
>>> ... (event sequence).
>>> 500h. event tsc_write tsc_offset=-3000
>>>
>>> Then a guest trace containing events with a TSC timestamp.
>>> Which tsc_offset to use?
>>>
>>> (that is the problem, which unless i am mistaken can only be solved
>>> easily if the guest can convert RDTSC -> TSC of host).
>>
>> There are three following cases of changing TSC offset:
>> 1. Reset TSC at guest boot time
>> 2. Adjust TSC offset due to some host's problems
>> 3. Write TSC on guests
>> The scenario which you mentioned is case 3, so we'll discuss this case.
>> Here, we assume that a guest is allocated single CPU for the sake of
>> ease.
>>
>> If a guest executes write_tsc, TSC values jumps to forward or backward.
>> For the forward case, trace data are as follows:
>>
>> < host > < guest >
>> cycles events cycles events
>> 3000 tsc_offset=-2950
>> 3001 kvm_enter
>> 53 eventX
>> ....
>> 100 (write_tsc=+900)
>> 3060 kvm_exit
>> 3075 tsc_offset=-2050
>> 3080 kvm_enter
>> 1050 event1
>> 1055 event2
>> ...
>>
>>
>> This case is simple. The guest TSC of the first kvm_enter is calculated
>> as follows:
>>
>> (host TSC of kvm_enter) + (current tsc_offset) = 3001 - 2950 = 51
>>
>> Similarly, the guest TSC of the second kvm_enter is 130. So, the guest
>> events between 51 and 130, that is, 53 eventX is inserted between the
>> first pair of kvm_enter and kvm_exit. To insert events of the guests
>> between 51 and 130, we convert the guest TSC to the host TSC using TSC
>> offset 2950.
>>
>> For the backward case, trace data are as follows:
>>
>> < host > < guest >
>> cycles events cycles events
>> 3000 tsc_offset=-2950
>> 3001 kvm_enter
>> 53 eventX
>> ....
>> 100 (write_tsc=-50)
>> 3060 kvm_exit
>> 3075 tsc_offset=-2050
>> 3080 kvm_enter
>> 90 event1
>> 95 event2
>> ...
>
> 3400 100 (write_tsc=-50)
>
> 90 event3
> 95 event4
>
>> As you say, in this case, the previous method is invalid. When we
>> calculate the guest TSC value for the tsc_offset=-3000 event, the value
>> is 75 on the guest. This seems like prior event of write_tsc=-50 event.
>> So, we need to consider more.
>>
>> In this case, it is important that we can understand where the guest
>> executes write_tsc or the host rewrites the TSC offset. write_tsc on
>> the guest equals wrmsr 0x00000010, so this instruction induces vm_exit.
>> This implies that the guest does not operate when the host changes TSC
>> offset on the cpu. In other words, the guest cannot use new TSC before
>> the host rewrites the new TSC offset. So, if timestamp on the guest is
>> not monotonically increased, we can understand the guest executes
>> write_tsc. Moreover, in the region where timestamp is decreasing, we
>> can understand when the host rewrote the TSC offset in the guest trace
>> data. Therefore, we can sort trace data in chronological order.
>
> This requires an entire trace of events. That is, to be able
> to reconstruct timeline you require the entire trace from the moment
> guest starts. So that you can correlate wrmsr-to-tsc on the guest with
> vmexit-due-to-tsc-write on the host.
>
> Which means that running out of space for trace buffer equals losing
> ability to order events.
>
> Is that desirable? It seems cumbersome to me.
As you say, tracing events can overwrite important events like
kvm_exit/entry or write_tsc_offset. So, Steven's multiple buffer is
needed by this feature. Normal events which often hit record the buffer
A, and important events which rarely hit record the buffer B. In our
case, the important event is write_tsc_offset.
> Also the need to correlate each write_tsc event in the guest trace
> with a corresponding tsc_offset write in the host trace means that it
> is _necessary_ for the guest and host to enable tracing simultaneously.
> Correct?
>
> Also, there are WRMSR executions in the guest for which there is
> no event in the trace buffer. From SeaBIOS, during boot.
> In that case, there is no explicit event in the guest trace which you
> can correlate with tsc_offset changes in the host side.
I understand that you want to say, but we don't correlate between
write_tsc event and write_tsc_offset event directly. This is because
the write_tsc tracepoint (also WRMSR instruction) is not prepared in
the current kernel. So, in the previous mail
(https://lkml.org/lkml/2012/11/22/53), I suggested the method which we
don't need to prepare the write_tsc tracepoint.
In the method, we enable ftrace before the guest boots, and we need to
keep all write_tsc_offset events in the buffer. If we forgot enabling
ftrace or we don't use multiple buffers, we don't use this feature.
So, I think as Peter says, the host should also export TSC offset
information to /proc/pid/kvm/*.
> If the guest had access to the host TSC value, these complications
> would disappear.
As a debugging mode, the TSC offset zero mode will be useful, I think.
(not for the real operation mode)
Thanks,
--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
On Mon, Nov 26, 2012 at 08:05:10PM +0900, Yoshihiro YUNOMAE wrote:
> >>>500h. event tsc_write tsc_offset=-3000
> >>>
> >>>Then a guest trace containing events with a TSC timestamp.
> >>>Which tsc_offset to use?
> >>>
> >>>(that is the problem, which unless i am mistaken can only be solved
> >>>easily if the guest can convert RDTSC -> TSC of host).
> >>
> >>There are three following cases of changing TSC offset:
> >> 1. Reset TSC at guest boot time
> >> 2. Adjust TSC offset due to some host's problems
> >> 3. Write TSC on guests
> >>The scenario which you mentioned is case 3, so we'll discuss this case.
> >>Here, we assume that a guest is allocated single CPU for the sake of
> >>ease.
> >>
> >>If a guest executes write_tsc, TSC values jumps to forward or backward.
> >>For the forward case, trace data are as follows:
> >>
> >>< host > < guest >
> >>cycles events cycles events
> >> 3000 tsc_offset=-2950
> >> 3001 kvm_enter
> >> 53 eventX
> >> ....
> >> 100 (write_tsc=+900)
> >> 3060 kvm_exit
> >> 3075 tsc_offset=-2050
> >> 3080 kvm_enter
> >> 1050 event1
> >> 1055 event2
> >> ...
> >>
> >>
> >>This case is simple. The guest TSC of the first kvm_enter is calculated
> >>as follows:
> >>
> >> (host TSC of kvm_enter) + (current tsc_offset) = 3001 - 2950 = 51
> >>
> >>Similarly, the guest TSC of the second kvm_enter is 130. So, the guest
> >>events between 51 and 130, that is, 53 eventX is inserted between the
> >>first pair of kvm_enter and kvm_exit. To insert events of the guests
> >>between 51 and 130, we convert the guest TSC to the host TSC using TSC
> >>offset 2950.
> >>
> >>For the backward case, trace data are as follows:
> >>
> >>< host > < guest >
> >>cycles events cycles events
> >> 3000 tsc_offset=-2950
> >> 3001 kvm_enter
> >> 53 eventX
> >> ....
> >> 100 (write_tsc=-50)
> >> 3060 kvm_exit
> >> 3075 tsc_offset=-2050
> >> 3080 kvm_enter
> >> 90 event1
> >> 95 event2
> >> ...
> >
> > 3400 100 (write_tsc=-50)
> >
> > 90 event3
> > 95 event4
> >
> >>As you say, in this case, the previous method is invalid. When we
> >>calculate the guest TSC value for the tsc_offset=-3000 event, the value
> >>is 75 on the guest. This seems like prior event of write_tsc=-50 event.
> >>So, we need to consider more.
> >>
> >>In this case, it is important that we can understand where the guest
> >>executes write_tsc or the host rewrites the TSC offset. write_tsc on
> >>the guest equals wrmsr 0x00000010, so this instruction induces vm_exit.
> >>This implies that the guest does not operate when the host changes TSC
> >>offset on the cpu. In other words, the guest cannot use new TSC before
> >>the host rewrites the new TSC offset. So, if timestamp on the guest is
> >>not monotonically increased, we can understand the guest executes
> >>write_tsc. Moreover, in the region where timestamp is decreasing, we
> >>can understand when the host rewrote the TSC offset in the guest trace
> >>data. Therefore, we can sort trace data in chronological order.
> >
> >This requires an entire trace of events. That is, to be able
> >to reconstruct timeline you require the entire trace from the moment
> >guest starts. So that you can correlate wrmsr-to-tsc on the guest with
> >vmexit-due-to-tsc-write on the host.
> >
> >Which means that running out of space for trace buffer equals losing
> >ability to order events.
> >
> >Is that desirable? It seems cumbersome to me.
>
> As you say, tracing events can overwrite important events like
> kvm_exit/entry or write_tsc_offset. So, Steven's multiple buffer is
> needed by this feature. Normal events which often hit record the buffer
> A, and important events which rarely hit record the buffer B. In our
> case, the important event is write_tsc_offset.
> >Also the need to correlate each write_tsc event in the guest trace
> >with a corresponding tsc_offset write in the host trace means that it
> >is _necessary_ for the guest and host to enable tracing simultaneously.
> >Correct?
> >
> >Also, there are WRMSR executions in the guest for which there is
> >no event in the trace buffer. From SeaBIOS, during boot.
> >In that case, there is no explicit event in the guest trace which you
> >can correlate with tsc_offset changes in the host side.
>
> I understand that you want to say, but we don't correlate between
> write_tsc event and write_tsc_offset event directly. This is because
> the write_tsc tracepoint (also WRMSR instruction) is not prepared in
> the current kernel. So, in the previous mail
> (https://lkml.org/lkml/2012/11/22/53), I suggested the method which we
> don't need to prepare the write_tsc tracepoint.
>
> In the method, we enable ftrace before the guest boots, and we need to
> keep all write_tsc_offset events in the buffer. If we forgot enabling
> ftrace or we don't use multiple buffers, we don't use this feature.
Yoshihiro,
Better have a single method to convert guest TSC to host TSC.
Ok, if you keep both TSC offset write events and guest TSC writes (*)
in separate buffers which are persistent, then you can convert
guest-tsc-events to host-tsc.
Can you please write a succint but complete description of the method
so it can be verified?
(*) note guest TSC writes have no events because Linux does not write
to TSC offset, but a "system booted" event can be used to correlate
with the TSC write by BIOS.
Thanks
> So, I think as Peter says, the host should also export TSC offset
> information to /proc/pid/kvm/*.
>
> >If the guest had access to the host TSC value, these complications
> >would disappear.
>
> As a debugging mode, the TSC offset zero mode will be useful, I think.
> (not for the real operation mode)
>
> Thanks,
> --
> Yoshihiro YUNOMAE
> Software Platform Research Dept. Linux Technology Center
> Hitachi, Ltd., Yokohama Research Laboratory
> E-mail: [email protected]
Hi Marcelo,
(2012/11/27 8:16), Marcelo Tosatti wrote:
> On Mon, Nov 26, 2012 at 08:05:10PM +0900, Yoshihiro YUNOMAE wrote:
>>>>> 500h. event tsc_write tsc_offset=-3000
>>>>>
>>>>> Then a guest trace containing events with a TSC timestamp.
>>>>> Which tsc_offset to use?
>>>>>
>>>>> (that is the problem, which unless i am mistaken can only be solved
>>>>> easily if the guest can convert RDTSC -> TSC of host).
>>>>
>>>> There are three following cases of changing TSC offset:
>>>> 1. Reset TSC at guest boot time
>>>> 2. Adjust TSC offset due to some host's problems
>>>> 3. Write TSC on guests
>>>> The scenario which you mentioned is case 3, so we'll discuss this case.
>>>> Here, we assume that a guest is allocated single CPU for the sake of
>>>> ease.
>>>>
>>>> If a guest executes write_tsc, TSC values jumps to forward or backward.
>>>> For the forward case, trace data are as follows:
>>>>
>>>> < host > < guest >
>>>> cycles events cycles events
>>>> 3000 tsc_offset=-2950
>>>> 3001 kvm_enter
>>>> 53 eventX
>>>> ....
>>>> 100 (write_tsc=+900)
>>>> 3060 kvm_exit
>>>> 3075 tsc_offset=-2050
>>>> 3080 kvm_enter
>>>> 1050 event1
>>>> 1055 event2
>>>> ...
>>>>
>>>>
>>>> This case is simple. The guest TSC of the first kvm_enter is calculated
>>>> as follows:
>>>>
>>>> (host TSC of kvm_enter) + (current tsc_offset) = 3001 - 2950 = 51
>>>>
>>>> Similarly, the guest TSC of the second kvm_enter is 130. So, the guest
>>>> events between 51 and 130, that is, 53 eventX is inserted between the
>>>> first pair of kvm_enter and kvm_exit. To insert events of the guests
>>>> between 51 and 130, we convert the guest TSC to the host TSC using TSC
>>>> offset 2950.
>>>>
>>>> For the backward case, trace data are as follows:
>>>>
>>>> < host > < guest >
>>>> cycles events cycles events
>>>> 3000 tsc_offset=-2950
>>>> 3001 kvm_enter
>>>> 53 eventX
>>>> ....
>>>> 100 (write_tsc=-50)
>>>> 3060 kvm_exit
>>>> 3075 tsc_offset=-2050
>>>> 3080 kvm_enter
>>>> 90 event1
>>>> 95 event2
>>>> ...
>>>
>>> 3400 100 (write_tsc=-50)
>>>
>>> 90 event3
>>> 95 event4
>>>
>>>> As you say, in this case, the previous method is invalid. When we
>>>> calculate the guest TSC value for the tsc_offset=-3000 event, the value
>>>> is 75 on the guest. This seems like prior event of write_tsc=-50 event.
>>>> So, we need to consider more.
>>>>
>>>> In this case, it is important that we can understand where the guest
>>>> executes write_tsc or the host rewrites the TSC offset. write_tsc on
>>>> the guest equals wrmsr 0x00000010, so this instruction induces vm_exit.
>>>> This implies that the guest does not operate when the host changes TSC
>>>> offset on the cpu. In other words, the guest cannot use new TSC before
>>>> the host rewrites the new TSC offset. So, if timestamp on the guest is
>>>> not monotonically increased, we can understand the guest executes
>>>> write_tsc. Moreover, in the region where timestamp is decreasing, we
>>>> can understand when the host rewrote the TSC offset in the guest trace
>>>> data. Therefore, we can sort trace data in chronological order.
>>>
>>> This requires an entire trace of events. That is, to be able
>>> to reconstruct timeline you require the entire trace from the moment
>>> guest starts. So that you can correlate wrmsr-to-tsc on the guest with
>>> vmexit-due-to-tsc-write on the host.
>>>
>>> Which means that running out of space for trace buffer equals losing
>>> ability to order events.
>>>
>>> Is that desirable? It seems cumbersome to me.
>>
>> As you say, tracing events can overwrite important events like
>> kvm_exit/entry or write_tsc_offset. So, Steven's multiple buffer is
>> needed by this feature. Normal events which often hit record the buffer
>> A, and important events which rarely hit record the buffer B. In our
>> case, the important event is write_tsc_offset.
>>> Also the need to correlate each write_tsc event in the guest trace
>>> with a corresponding tsc_offset write in the host trace means that it
>>> is _necessary_ for the guest and host to enable tracing simultaneously.
>>> Correct?
>>>
>>> Also, there are WRMSR executions in the guest for which there is
>>> no event in the trace buffer. From SeaBIOS, during boot.
>>> In that case, there is no explicit event in the guest trace which you
>>> can correlate with tsc_offset changes in the host side.
>>
>> I understand that you want to say, but we don't correlate between
>> write_tsc event and write_tsc_offset event directly. This is because
>> the write_tsc tracepoint (also WRMSR instruction) is not prepared in
>> the current kernel. So, in the previous mail
>> (https://lkml.org/lkml/2012/11/22/53), I suggested the method which we
>> don't need to prepare the write_tsc tracepoint.
>>
>> In the method, we enable ftrace before the guest boots, and we need to
>> keep all write_tsc_offset events in the buffer. If we forgot enabling
>> ftrace or we don't use multiple buffers, we don't use this feature.
>
> Yoshihiro,
>
> Better have a single method to convert guest TSC to host TSC.
>
> Ok, if you keep both TSC offset write events and guest TSC writes (*)
> in separate buffers which are persistent, then you can convert
> guest-tsc-events to host-tsc.
>
> Can you please write a succint but complete description of the method
> so it can be verified?
Sure.
- Prerequisite
1. the host TSC is synchronized and stable.
2. kvm_write_tsc_offset events include previous and next TSC offset
values.
3. Every event's trace_clock is TSC.
- Assumption for the sake of ease
1. One VCPU
2. The guest executes no write_tsc or write_tsc only once.
- Configuration
1. On the host, kvm_exit/entry events are recorded in the buffer A and
kvm_write_tsc_offset events are recorded in the buffer B.
2. Boot a guest
- Sort method
1.
Confirm which the kvm_write_tsc_offset events are recorded except for
boot. Note that a vcpu thread writes TSC offset when boot as an initial
operation.
1-1.
If not recorded, it means that the guest did not execute write_tsc.
So, we convert the guest TSC to the host TSC using TSC offset of boot.
=> END
1-2.
If recorded, it means that the guest executed write_tsc.
So, we use new kvm_write_tsc_offset event information.
2.
We convert the host TSC(Th) of the kvm_write_tsc_offset event to
the guest TSC(Tg) using previous_tsc_offset value:
Tg = Th + previous_tsc_offset
3.
To search the point where the guest executed write_tsc, we find "n"
which satisfies the condition Tn < Tg < Tn+1 or Tn+1 < Tn < Tg from
older events of the guest. The former condition means trace data of
the guest were recorded monotonically. On the other hand, the latter
condition means trace data of the guest moved backward.
4.
We convert the guest TSC of trace data to the host TSC using
previous_tsc_offset value before "n" and using next_tsc_offset value
after "n+1".
=> END
- Note
We assumed one vcpu and no write_tsc or write_tsc only once for the
sake of ease. For other conditions, we will use similar method.
Thanks,
> (*) note guest TSC writes have no events because Linux does not write
> to TSC offset, but a "system booted" event can be used to correlate
> with the TSC write by BIOS.
>
> Thanks
>
>> So, I think as Peter says, the host should also export TSC offset
>> information to /proc/pid/kvm/*.
>>
>>> If the guest had access to the host TSC value, these complications
>>> would disappear.
>>
>> As a debugging mode, the TSC offset zero mode will be useful, I think.
>> (not for the real operation mode)
>>
>> Thanks,
>> --
>> Yoshihiro YUNOMAE
>> Software Platform Research Dept. Linux Technology Center
>> Hitachi, Ltd., Yokohama Research Laboratory
>> E-mail: [email protected]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
On Tue, Nov 27, 2012 at 07:53:47PM +0900, Yoshihiro YUNOMAE wrote:
> Hi Marcelo,
>
> (2012/11/27 8:16), Marcelo Tosatti wrote:
> >On Mon, Nov 26, 2012 at 08:05:10PM +0900, Yoshihiro YUNOMAE wrote:
> >>>>>500h. event tsc_write tsc_offset=-3000
> >>>>>
> >>>>>Then a guest trace containing events with a TSC timestamp.
> >>>>>Which tsc_offset to use?
> >>>>>
> >>>>>(that is the problem, which unless i am mistaken can only be solved
> >>>>>easily if the guest can convert RDTSC -> TSC of host).
> >>>>
> >>>>There are three following cases of changing TSC offset:
> >>>> 1. Reset TSC at guest boot time
> >>>> 2. Adjust TSC offset due to some host's problems
> >>>> 3. Write TSC on guests
> >>>>The scenario which you mentioned is case 3, so we'll discuss this case.
> >>>>Here, we assume that a guest is allocated single CPU for the sake of
> >>>>ease.
> >>>>
> >>>>If a guest executes write_tsc, TSC values jumps to forward or backward.
> >>>>For the forward case, trace data are as follows:
> >>>>
> >>>>< host > < guest >
> >>>>cycles events cycles events
> >>>> 3000 tsc_offset=-2950
> >>>> 3001 kvm_enter
> >>>> 53 eventX
> >>>> ....
> >>>> 100 (write_tsc=+900)
> >>>> 3060 kvm_exit
> >>>> 3075 tsc_offset=-2050
> >>>> 3080 kvm_enter
> >>>> 1050 event1
> >>>> 1055 event2
> >>>> ...
> >>>>
> >>>>
> >>>>This case is simple. The guest TSC of the first kvm_enter is calculated
> >>>>as follows:
> >>>>
> >>>> (host TSC of kvm_enter) + (current tsc_offset) = 3001 - 2950 = 51
> >>>>
> >>>>Similarly, the guest TSC of the second kvm_enter is 130. So, the guest
> >>>>events between 51 and 130, that is, 53 eventX is inserted between the
> >>>>first pair of kvm_enter and kvm_exit. To insert events of the guests
> >>>>between 51 and 130, we convert the guest TSC to the host TSC using TSC
> >>>>offset 2950.
> >>>>
> >>>>For the backward case, trace data are as follows:
> >>>>
> >>>>< host > < guest >
> >>>>cycles events cycles events
> >>>> 3000 tsc_offset=-2950
> >>>> 3001 kvm_enter
> >>>> 53 eventX
> >>>> ....
> >>>> 100 (write_tsc=-50)
> >>>> 3060 kvm_exit
> >>>> 3075 tsc_offset=-2050
> >>>> 3080 kvm_enter
> >>>> 90 event1
> >>>> 95 event2
> >>>> ...
> >>>
> >>> 3400 100 (write_tsc=-50)
> >>>
> >>> 90 event3
> >>> 95 event4
> >>>
> >>>>As you say, in this case, the previous method is invalid. When we
> >>>>calculate the guest TSC value for the tsc_offset=-3000 event, the value
> >>>>is 75 on the guest. This seems like prior event of write_tsc=-50 event.
> >>>>So, we need to consider more.
> >>>>
> >>>>In this case, it is important that we can understand where the guest
> >>>>executes write_tsc or the host rewrites the TSC offset. write_tsc on
> >>>>the guest equals wrmsr 0x00000010, so this instruction induces vm_exit.
> >>>>This implies that the guest does not operate when the host changes TSC
> >>>>offset on the cpu. In other words, the guest cannot use new TSC before
> >>>>the host rewrites the new TSC offset. So, if timestamp on the guest is
> >>>>not monotonically increased, we can understand the guest executes
> >>>>write_tsc. Moreover, in the region where timestamp is decreasing, we
> >>>>can understand when the host rewrote the TSC offset in the guest trace
> >>>>data. Therefore, we can sort trace data in chronological order.
> >>>
> >>>This requires an entire trace of events. That is, to be able
> >>>to reconstruct timeline you require the entire trace from the moment
> >>>guest starts. So that you can correlate wrmsr-to-tsc on the guest with
> >>>vmexit-due-to-tsc-write on the host.
> >>>
> >>>Which means that running out of space for trace buffer equals losing
> >>>ability to order events.
> >>>
> >>>Is that desirable? It seems cumbersome to me.
> >>
> >>As you say, tracing events can overwrite important events like
> >>kvm_exit/entry or write_tsc_offset. So, Steven's multiple buffer is
> >>needed by this feature. Normal events which often hit record the buffer
> >>A, and important events which rarely hit record the buffer B. In our
> >>case, the important event is write_tsc_offset.
> >>>Also the need to correlate each write_tsc event in the guest trace
> >>>with a corresponding tsc_offset write in the host trace means that it
> >>>is _necessary_ for the guest and host to enable tracing simultaneously.
> >>>Correct?
> >>>
> >>>Also, there are WRMSR executions in the guest for which there is
> >>>no event in the trace buffer. From SeaBIOS, during boot.
> >>>In that case, there is no explicit event in the guest trace which you
> >>>can correlate with tsc_offset changes in the host side.
> >>
> >>I understand that you want to say, but we don't correlate between
> >>write_tsc event and write_tsc_offset event directly. This is because
> >>the write_tsc tracepoint (also WRMSR instruction) is not prepared in
> >>the current kernel. So, in the previous mail
> >>(https://lkml.org/lkml/2012/11/22/53), I suggested the method which we
> >>don't need to prepare the write_tsc tracepoint.
> >>
> >>In the method, we enable ftrace before the guest boots, and we need to
> >>keep all write_tsc_offset events in the buffer. If we forgot enabling
> >>ftrace or we don't use multiple buffers, we don't use this feature.
> >
> >Yoshihiro,
> >
> >Better have a single method to convert guest TSC to host TSC.
> >
> >Ok, if you keep both TSC offset write events and guest TSC writes (*)
> >in separate buffers which are persistent, then you can convert
> >guest-tsc-events to host-tsc.
> >
> >Can you please write a succint but complete description of the method
> >so it can be verified?
>
> Sure.
>
> - Prerequisite
> 1. the host TSC is synchronized and stable.
> 2. kvm_write_tsc_offset events include previous and next TSC offset
> values.
> 3. Every event's trace_clock is TSC.
>
> - Assumption for the sake of ease
> 1. One VCPU
> 2. The guest executes no write_tsc or write_tsc only once.
>
> - Configuration
> 1. On the host, kvm_exit/entry events are recorded in the buffer A and
> kvm_write_tsc_offset events are recorded in the buffer B.
> 2. Boot a guest
>
> - Sort method
> 1.
> Confirm which the kvm_write_tsc_offset events are recorded except for
> boot. Note that a vcpu thread writes TSC offset when boot as an initial
> operation.
>
> 1-1.
> If not recorded, it means that the guest did not execute write_tsc.
> So, we convert the guest TSC to the host TSC using TSC offset of boot.
> => END
>
> 1-2.
> If recorded, it means that the guest executed write_tsc.
> So, we use new kvm_write_tsc_offset event information.
>
> 2.
> We convert the host TSC(Th) of the kvm_write_tsc_offset event to
> the guest TSC(Tg) using previous_tsc_offset value:
> Tg = Th + previous_tsc_offset
>
> 3.
> To search the point where the guest executed write_tsc, we find "n"
> which satisfies the condition Tn < Tg < Tn+1 or Tn+1 < Tn < Tg from
> older events of the guest.
> The former condition means trace data of
> the guest were recorded monotonically. On the other hand, the latter
> condition means trace data of the guest moved backward.
> 4.
> We convert the guest TSC of trace data to the host TSC using
> previous_tsc_offset value before "n" and using next_tsc_offset value
> after "n+1".
> => END
>
> - Note
> We assumed one vcpu and no write_tsc or write_tsc only once for the
> sake of ease. For other conditions, we will use similar method.
>
> Thanks,
There is no certainty. Consider the following information available
guest trace host trace
100: guest_tsc_write (tsc_offset=-100 => guest_tsc = 0)
104: guest_tsc_write (tsc_offset=-104 => guest_tsc = 0)
108: guest_tsc_write (tsc_offset=-108 => guest_tsc = 0)
1: eventA
2: eventB
3: eventC
1: eventD
2: eventE
3: eventF
How can you tell which tsc_offset to use for eventD ? It could be either
-104 or -108. The notion of "next_tsc_offset" is subject to such
issue.
I suppose its fine to restrict the interface as follows: the tool will
accept one trace of events with monotonic timestamps and the user is
responsible for correlating that to a host trace.
That is, you can't feed distinct instances of guest kernel trace.
Hi Marcelo,
>>> Can you please write a succint but complete description of the method
>>> so it can be verified?
>>
>> Sure.
>>
>> - Prerequisite
>> 1. the host TSC is synchronized and stable.
>> 2. kvm_write_tsc_offset events include previous and next TSC offset
>> values.
>> 3. Every event's trace_clock is TSC.
>>
>> - Assumption for the sake of ease
>> 1. One VCPU
>> 2. The guest executes no write_tsc or write_tsc only once.
>>
>> - Configuration
>> 1. On the host, kvm_exit/entry events are recorded in the buffer A and
>> kvm_write_tsc_offset events are recorded in the buffer B.
>> 2. Boot a guest
>>
>> - Sort method
>> 1.
>> Confirm which the kvm_write_tsc_offset events are recorded except for
>> boot. Note that a vcpu thread writes TSC offset when boot as an initial
>> operation.
>>
>> 1-1.
>> If not recorded, it means that the guest did not execute write_tsc.
>> So, we convert the guest TSC to the host TSC using TSC offset of boot.
>> => END
>>
>> 1-2.
>> If recorded, it means that the guest executed write_tsc.
>> So, we use new kvm_write_tsc_offset event information.
>>
>> 2.
>> We convert the host TSC(Th) of the kvm_write_tsc_offset event to
>> the guest TSC(Tg) using previous_tsc_offset value:
>> Tg = Th + previous_tsc_offset
>>
>> 3.
>> To search the point where the guest executed write_tsc, we find "n"
>> which satisfies the condition Tn < Tg < Tn+1 or Tn+1 < Tn < Tg from
>> older events of the guest.
>> The former condition means trace data of
>> the guest were recorded monotonically. On the other hand, the latter
>> condition means trace data of the guest moved backward.
>> 4.
>> We convert the guest TSC of trace data to the host TSC using
>> previous_tsc_offset value before "n" and using next_tsc_offset value
>> after "n+1".
>> => END
>>
>> - Note
>> We assumed one vcpu and no write_tsc or write_tsc only once for the
>> sake of ease. For other conditions, we will use similar method.
>>
>> Thanks,
>
> There is no certainty. Consider the following information available
>
> guest trace host trace
> 100: guest_tsc_write (tsc_offset=-100 => guest_tsc = 0)
> 104: guest_tsc_write (tsc_offset=-104 => guest_tsc = 0)
> 108: guest_tsc_write (tsc_offset=-108 => guest_tsc = 0)
> 1: eventA
> 2: eventB
> 3: eventC
> 1: eventD
> 2: eventE
> 3: eventF
>
> How can you tell which tsc_offset to use for eventD ? It could be either
> -104 or -108. The notion of "next_tsc_offset" is subject to such
> issue.
>
> I suppose its fine to restrict the interface as follows: the tool will
> accept one trace of events with monotonic timestamps and the user is
> responsible for correlating that to a host trace.
OK, I'll add the restriction, which trace data must have monotonic
timestamps to use this feature. I think guests seldom will execute
write_tsc, so in many cases, timestamps will be monotonically recorded
in trace data.
> That is, you can't feed distinct instances of guest kernel trace.
I'm not clear for "distinct instances". Is this about SMP or multiple
guests? Would you explain about this?
Thanks,
--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
On Fri, Nov 30, 2012 at 10:36:43AM +0900, Yoshihiro YUNOMAE wrote:
> Hi Marcelo,
>
> >>>Can you please write a succint but complete description of the method
> >>>so it can be verified?
> >>
> >>Sure.
> >>
> >>- Prerequisite
> >>1. the host TSC is synchronized and stable.
> >>2. kvm_write_tsc_offset events include previous and next TSC offset
> >> values.
> >>3. Every event's trace_clock is TSC.
> >>
> >>- Assumption for the sake of ease
> >>1. One VCPU
> >>2. The guest executes no write_tsc or write_tsc only once.
> >>
> >>- Configuration
> >>1. On the host, kvm_exit/entry events are recorded in the buffer A and
> >> kvm_write_tsc_offset events are recorded in the buffer B.
> >>2. Boot a guest
> >>
> >>- Sort method
> >>1.
> >> Confirm which the kvm_write_tsc_offset events are recorded except for
> >>boot. Note that a vcpu thread writes TSC offset when boot as an initial
> >>operation.
> >>
> >>1-1.
> >> If not recorded, it means that the guest did not execute write_tsc.
> >>So, we convert the guest TSC to the host TSC using TSC offset of boot.
> >>=> END
> >>
> >>1-2.
> >> If recorded, it means that the guest executed write_tsc.
> >>So, we use new kvm_write_tsc_offset event information.
> >>
> >>2.
> >>We convert the host TSC(Th) of the kvm_write_tsc_offset event to
> >>the guest TSC(Tg) using previous_tsc_offset value:
> >> Tg = Th + previous_tsc_offset
> >>
> >>3.
> >>To search the point where the guest executed write_tsc, we find "n"
> >>which satisfies the condition Tn < Tg < Tn+1 or Tn+1 < Tn < Tg from
> >>older events of the guest.
> >>The former condition means trace data of
> >>the guest were recorded monotonically. On the other hand, the latter
> >>condition means trace data of the guest moved backward.
> >>4.
> >>We convert the guest TSC of trace data to the host TSC using
> >>previous_tsc_offset value before "n" and using next_tsc_offset value
> >>after "n+1".
> >>=> END
> >>
> >>- Note
> >>We assumed one vcpu and no write_tsc or write_tsc only once for the
> >>sake of ease. For other conditions, we will use similar method.
> >>
> >>Thanks,
> >
> >There is no certainty. Consider the following information available
> >
> >guest trace host trace
> > 100: guest_tsc_write (tsc_offset=-100 => guest_tsc = 0)
> > 104: guest_tsc_write (tsc_offset=-104 => guest_tsc = 0)
> > 108: guest_tsc_write (tsc_offset=-108 => guest_tsc = 0)
> >1: eventA
> >2: eventB
> >3: eventC
> >1: eventD
> >2: eventE
> >3: eventF
> >
> >How can you tell which tsc_offset to use for eventD ? It could be either
> >-104 or -108. The notion of "next_tsc_offset" is subject to such
> >issue.
> >
> >I suppose its fine to restrict the interface as follows: the tool will
> >accept one trace of events with monotonic timestamps and the user is
> >responsible for correlating that to a host trace.
>
> OK, I'll add the restriction, which trace data must have monotonic
> timestamps to use this feature. I think guests seldom will execute
> write_tsc, so in many cases, timestamps will be monotonically recorded
> in trace data.
>
> >That is, you can't feed distinct instances of guest kernel trace.
>
> I'm not clear for "distinct instances". Is this about SMP or multiple
> guests? Would you explain about this?
Distinct boot instances. If the guest reboots TSC can be written to.