2013-10-28 01:27:32

by David Ahern

[permalink] [raw]
Subject: RFC: paravirtualizing perf_clock

Often when debugging performance problems in a virtualized environment
you need to correlate what is happening in the guest with what is
happening in the host. To correlate events you need a common time basis
(or the ability to directly correlate the two).

The attached patch paravirtualizes perf_clock, pulling the timestamps in
VMs from the host using an MSR read if the option is available (exposed
via KVM feature flag). I realize this is not the correct end code but it
illustrates what I would like to see -- host and guests using the same
perf_clock so timestamps directly correlate.

Any suggestions on how to do this and without impacting performance. I
noticed the MSR path seems to take about twice as long as the current
implementation (which I believe results in rdtsc in the VM for x86 with
stable TSC).

David


Attachments:
0001-perf-kvm-x86-Paravirtualize-perf_clock-for-x86_64.patch (4.14 kB)

2013-10-28 13:01:33

by Gleb Natapov

[permalink] [raw]
Subject: Re: RFC: paravirtualizing perf_clock

On Sun, Oct 27, 2013 at 07:27:27PM -0600, David Ahern wrote:
> Often when debugging performance problems in a virtualized
> environment you need to correlate what is happening in the guest
> with what is happening in the host. To correlate events you need a
> common time basis (or the ability to directly correlate the two).
>
> The attached patch paravirtualizes perf_clock, pulling the
> timestamps in VMs from the host using an MSR read if the option is
> available (exposed via KVM feature flag). I realize this is not the
> correct end code but it illustrates what I would like to see -- host
> and guests using the same perf_clock so timestamps directly
> correlate.
>
> Any suggestions on how to do this and without impacting performance.
> I noticed the MSR path seems to take about twice as long as the
> current implementation (which I believe results in rdtsc in the VM
> for x86 with stable TSC).
>
Yoshihiro YUNOMAE (copied) has a tool that merges guest's and host's
traces using tsc timestamp. His commit 489223edf29b adds a trace point
that reports current guest's tsc offset to support that.

> David
>

>
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index 94dc8ca434e0..5a023ddf085e 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -24,6 +24,7 @@
> #define KVM_FEATURE_STEAL_TIME 5
> #define KVM_FEATURE_PV_EOI 6
> #define KVM_FEATURE_PV_UNHALT 7
> +#define KVM_FEATURE_PV_PERF_CLOCK 8
>
> /* The last 8 bits are used to indicate how to interpret the flags field
> * in pvclock structure. If no bits are set, all flags are ignored.
> @@ -40,6 +41,7 @@
> #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
> #define MSR_KVM_STEAL_TIME 0x4b564d03
> #define MSR_KVM_PV_EOI_EN 0x4b564d04
> +#define MSR_KVM_PV_PERF_CLOCK 0x4b564d05
>
> struct kvm_steal_time {
> __u64 steal;
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 9d8449158cf9..fb7824a64823 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -25,6 +25,7 @@
> #include <linux/cpu.h>
> #include <linux/bitops.h>
> #include <linux/device.h>
> +#include <linux/kvm_para.h>
>
> #include <asm/apic.h>
> #include <asm/stacktrace.h>
> @@ -34,6 +35,7 @@
> #include <asm/timer.h>
> #include <asm/desc.h>
> #include <asm/ldt.h>
> +#include <asm/kvm_para.h>
>
> #include "perf_event.h"
>
> @@ -52,6 +54,38 @@ u64 __read_mostly hw_cache_extra_regs
> [PERF_COUNT_HW_CACHE_OP_MAX]
> [PERF_COUNT_HW_CACHE_RESULT_MAX];
>
> +
> +#ifdef CONFIG_PARAVIRT
> +
> +static int have_pv_perf_clock;
> +
> +static void __init perf_clock_init(void)
> +{
> + if (kvm_para_available() &&
> + kvm_para_has_feature(KVM_FEATURE_PV_PERF_CLOCK)) {
> + have_pv_perf_clock = 1;
> + }
> +}
> +
> +u64 perf_clock(void)
> +{
> + if (have_pv_perf_clock)
> + return native_read_msr(MSR_KVM_PV_PERF_CLOCK);
> +
> + /* otherwise return local_clock */
> + return local_clock();
> +}
> +
> +#else
> +u64 perf_clock(void)
> +{
> + return local_clock();
> +}
> +
> +static inline void __init perf_clock_init(void)
> +{
> +}
> +#endif
> /*
> * Propagate event elapsed time into the generic event.
> * Can only be executed on the CPU where the event is active.
> @@ -1496,6 +1530,8 @@ static int __init init_hw_perf_events(void)
> struct x86_pmu_quirk *quirk;
> int err;
>
> + perf_clock_init();
> +
> pr_info("Performance Events: ");
>
> switch (boot_cpu_data.x86_vendor) {
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index b110fe6c03d4..5b258a18f9c0 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -414,7 +414,8 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> (1 << KVM_FEATURE_ASYNC_PF) |
> (1 << KVM_FEATURE_PV_EOI) |
> (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
> - (1 << KVM_FEATURE_PV_UNHALT);
> + (1 << KVM_FEATURE_PV_UNHALT) |
> + (1 << KVM_FEATURE_PV_PERF_CLOCK);
>
> if (sched_info_on())
> entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e5ca72a5cdb6..61ec1f1c7d38 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2418,6 +2418,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
> case MSR_KVM_PV_EOI_EN:
> data = vcpu->arch.pv_eoi.msr_val;
> break;
> + case MSR_KVM_PV_PERF_CLOCK:
> + data = perf_clock();
> + break;
> case MSR_IA32_P5_MC_ADDR:
> case MSR_IA32_P5_MC_TYPE:
> case MSR_IA32_MCG_CAP:
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index c8ba627c1d60..c8a51954ea9e 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -865,4 +865,7 @@ _name##_show(struct device *dev, \
> \
> static struct device_attribute format_attr_##_name = __ATTR_RO(_name)
>
> +#if defined(CONFIG_X86_64)
> +u64 perf_clock(void);
> +#endif
> #endif /* _LINUX_PERF_EVENT_H */
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index d49a9d29334c..b073975af05a 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -290,10 +290,18 @@ extern __weak const char *perf_pmu_name(void)
> return "pmu";
> }
>
> +#if defined(CONFIG_X86_64)
> +__weak u64 perf_clock(void)
> +{
> + return local_clock();
> +}
> +EXPORT_SYMBOL(perf_clock);
> +#else
> static inline u64 perf_clock(void)
> {
> return local_clock();
> }
> +#endif
>
> static inline struct perf_cpu_context *
> __get_cpu_context(struct perf_event_context *ctx)
> --
> 1.7.12.4 (Apple Git-37)
>


--
Gleb.

2013-10-28 13:16:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RFC: paravirtualizing perf_clock

On Sun, Oct 27, 2013 at 07:27:27PM -0600, David Ahern wrote:
> Often when debugging performance problems in a virtualized environment you
> need to correlate what is happening in the guest with what is happening in
> the host. To correlate events you need a common time basis (or the ability
> to directly correlate the two).
>
> The attached patch paravirtualizes perf_clock, pulling the timestamps in VMs
> from the host using an MSR read if the option is available (exposed via KVM
> feature flag). I realize this is not the correct end code but it illustrates
> what I would like to see -- host and guests using the same perf_clock so
> timestamps directly correlate.
>
> Any suggestions on how to do this and without impacting performance. I
> noticed the MSR path seems to take about twice as long as the current
> implementation (which I believe results in rdtsc in the VM for x86 with
> stable TSC).

So assuming all the TSCs are in fact stable; you could implement this by
syncing up the guest TSC to the host TSC on guest boot. I don't think
anything _should_ rely on the absolute TSC value.

Of course you then also need to make sure the host and guest tsc
multipliers (cyc2ns) are identical, you can play games with
cyc2ns_offset if you're brave.

2013-10-29 02:58:13

by David Ahern

[permalink] [raw]
Subject: Re: RFC: paravirtualizing perf_clock

On 10/28/13 7:15 AM, Peter Zijlstra wrote:
>> Any suggestions on how to do this and without impacting performance. I
>> noticed the MSR path seems to take about twice as long as the current
>> implementation (which I believe results in rdtsc in the VM for x86 with
>> stable TSC).
>
> So assuming all the TSCs are in fact stable; you could implement this by
> syncing up the guest TSC to the host TSC on guest boot. I don't think
> anything _should_ rely on the absolute TSC value.
>
> Of course you then also need to make sure the host and guest tsc
> multipliers (cyc2ns) are identical, you can play games with
> cyc2ns_offset if you're brave.
>

This and the method Gleb mentioned both are going to be complex and
fragile -- based assumptions on how the perf_clock timestamps are
generated. For example, 489223e assumes you have the tracepoint enabled
at VM start with some means of capturing the data (e.g., a perf-session
active). In both cases the end result requires piecing together and
re-generating the VM's timestamp on the events. For perf this means
either modifying the tool to take parameters and an algorithm on how to
modify the timestamp or a homegrown tool to regenerate the file with
updated timestamps.

To back out a bit, my end goal is to be able to create and merge
perf-events from any context on a KVM-based host -- guest userspace,
guest kernel space, host userspace and host kernel space (userspace
events with a perf-clock timestamp is another topic ;-)). Having the
events generated with the proper timestamp is the simpler approach than
trying to collect various tidbits of data, massage timestamps (and
hoping the clock source hasn't changed) and then merge events.

And then for the cherry on top a design that works across architectures
(e.g., x86 now, but arm later).

David

2013-10-29 13:23:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RFC: paravirtualizing perf_clock

On Mon, Oct 28, 2013 at 08:58:08PM -0600, David Ahern wrote:
> To back out a bit, my end goal is to be able to create and merge perf-events
> from any context on a KVM-based host -- guest userspace, guest kernel space,
> host userspace and host kernel space (userspace events with a perf-clock
> timestamp is another topic ;-)). Having the events generated with the proper
> timestamp is the simpler approach than trying to collect various tidbits of
> data, massage timestamps (and hoping the clock source hasn't changed) and
> then merge events.
>
> And then for the cherry on top a design that works across architectures
> (e.g., x86 now, but arm later).

Fair enough; but then I don't know how to get things faster than what
your initial patch proposes to do. Typically the only way to get things
faster is avoiding VM exits is by replicating state inside the guest,
but as you say, that ends up being complex/fragile.

Subject: Re: Re: RFC: paravirtualizing perf_clock

(2013/10/29 11:58), David Ahern wrote:
> On 10/28/13 7:15 AM, Peter Zijlstra wrote:
>>> Any suggestions on how to do this and without impacting performance. I
>>> noticed the MSR path seems to take about twice as long as the current
>>> implementation (which I believe results in rdtsc in the VM for x86 with
>>> stable TSC).
>>
>> So assuming all the TSCs are in fact stable; you could implement this by
>> syncing up the guest TSC to the host TSC on guest boot. I don't think
>> anything _should_ rely on the absolute TSC value.
>>
>> Of course you then also need to make sure the host and guest tsc
>> multipliers (cyc2ns) are identical, you can play games with
>> cyc2ns_offset if you're brave.
>>
>
> This and the method Gleb mentioned both are going to be complex and
> fragile -- based assumptions on how the perf_clock timestamps are
> generated. For example, 489223e assumes you have the tracepoint enabled
> at VM start with some means of capturing the data (e.g., a perf-session
> active). In both cases the end result requires piecing together and
> re-generating the VM's timestamp on the events. For perf this means
> either modifying the tool to take parameters and an algorithm on how to
> modify the timestamp or a homegrown tool to regenerate the file with
> updated timestamps.
>
> To back out a bit, my end goal is to be able to create and merge
> perf-events from any context on a KVM-based host -- guest userspace,
> guest kernel space, host userspace and host kernel space (userspace
> events with a perf-clock timestamp is another topic ;-)).

That is almost same as what we(Yoshihiro and I) are trying on integrated
tracing, we are doing it on ftrace and trace-cmd (but perhaps, it eventually
works on perf-ftrace).

> Having the
> events generated with the proper timestamp is the simpler approach than
> trying to collect various tidbits of data, massage timestamps (and
> hoping the clock source hasn't changed) and then merge events.

Yeah, if possible, we'd like to use it too.

>
> And then for the cherry on top a design that works across architectures
> (e.g., x86 now, but arm later).

I think your proposal is good for the default implementation, it doesn't
depends on the arch specific feature. However, since physical timer(clock)
interfaces and virtualization interfaces strongly depends on the arch,
I guess the optimized implementations will become different on each arch.
For example, maybe we can export tsc-offset to the guest to adjust clock
on x86, but not on ARM, or other devices. In that case, until implementing
optimized one, we can use paravirt perf_clock.

Thank you,

--
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

2013-10-30 14:03:51

by David Ahern

[permalink] [raw]
Subject: Re: RFC: paravirtualizing perf_clock

On 10/29/13 11:59 PM, Masami Hiramatsu wrote:
> (2013/10/29 11:58), David Ahern wrote:
>> To back out a bit, my end goal is to be able to create and merge
>> perf-events from any context on a KVM-based host -- guest userspace,
>> guest kernel space, host userspace and host kernel space (userspace
>> events with a perf-clock timestamp is another topic ;-)).
>
> That is almost same as what we(Yoshihiro and I) are trying on integrated
> tracing, we are doing it on ftrace and trace-cmd (but perhaps, it eventually
> works on perf-ftrace).

I thought at this point (well, once perf-ftrace gets committed) that you
can do everything with perf. What feature is missing in perf that you
get with trace-cmd or using debugfs directly?


>> And then for the cherry on top a design that works across architectures
>> (e.g., x86 now, but arm later).
>
> I think your proposal is good for the default implementation, it doesn't
> depends on the arch specific feature. However, since physical timer(clock)
> interfaces and virtualization interfaces strongly depends on the arch,
> I guess the optimized implementations will become different on each arch.
> For example, maybe we can export tsc-offset to the guest to adjust clock
> on x86, but not on ARM, or other devices. In that case, until implementing
> optimized one, we can use paravirt perf_clock.

So this MSR read takes about 1.6usecs (from 'perf stat kvm live') and
that is total time between VMEXIT and VMENTRY. The time it takes to run
perf_clock in the host should be a very small part of that 1.6 usec.
I'll take a look at the TSC path to see how it is optimized (suggestions
appreciated).

Another thought is to make the use of pv_perf_clock an option -- user
can knowingly decide the additional latency/overhead is worth the feature.

David

2013-10-30 14:21:20

by Gleb Natapov

[permalink] [raw]
Subject: Re: RFC: paravirtualizing perf_clock

On Mon, Oct 28, 2013 at 08:58:08PM -0600, David Ahern wrote:
> On 10/28/13 7:15 AM, Peter Zijlstra wrote:
> >>Any suggestions on how to do this and without impacting performance. I
> >>noticed the MSR path seems to take about twice as long as the current
> >>implementation (which I believe results in rdtsc in the VM for x86 with
> >>stable TSC).
> >
> >So assuming all the TSCs are in fact stable; you could implement this by
> >syncing up the guest TSC to the host TSC on guest boot. I don't think
> >anything _should_ rely on the absolute TSC value.
> >
> >Of course you then also need to make sure the host and guest tsc
> >multipliers (cyc2ns) are identical, you can play games with
> >cyc2ns_offset if you're brave.
> >
>
> This and the method Gleb mentioned both are going to be complex and
> fragile -- based assumptions on how the perf_clock timestamps are
> generated. For example, 489223e assumes you have the tracepoint
> enabled at VM start with some means of capturing the data (e.g., a
> perf-session active).
We can think of other ways to provide tsc offset to perf.

> In both cases the end result requires piecing
> together and re-generating the VM's timestamp on the events. For
> perf this means either modifying the tool to take parameters and an
> algorithm on how to modify the timestamp or a homegrown tool to
> regenerate the file with updated timestamps.
>
> To back out a bit, my end goal is to be able to create and merge
> perf-events from any context on a KVM-based host -- guest userspace,
> guest kernel space, host userspace and host kernel space (userspace
> events with a perf-clock timestamp is another topic ;-)). Having the
> events generated with the proper timestamp is the simpler approach
> than trying to collect various tidbits of data, massage timestamps
> (and hoping the clock source hasn't changed) and then merge events.
>
So can you explain a little bit more about how this will work? You run
perf on a host and get both host and guest events? How do you pass
events from guest to host in this case?

> And then for the cherry on top a design that works across
> architectures (e.g., x86 now, but arm later).
>
MSR is x86 thing.

--
Gleb.

2013-10-30 14:31:36

by David Ahern

[permalink] [raw]
Subject: Re: RFC: paravirtualizing perf_clock

On 10/30/13 8:20 AM, Gleb Natapov wrote:
> So can you explain a little bit more about how this will work? You run
> perf on a host and get both host and guest events? How do you pass
> events from guest to host in this case?

The intent is to allow data capture to occur in both contexts (host and
guest) completely independently (e.g., record events in the guest to a
file and in the host to a separate file). The files are then made
available to a single post processing command (e.g., copy off box to an
analysis server or copy guest file to host or vice versa).

From there perf needs some tweaks to read 2 different data files and
sort. From an address to symbol perspective, perf already has the notion
of independent machines -- work that was done for perf-kvm. There has
already been a lot of discussion on writing perf events to mmap-specific
files which are then merged at analysis time (versus today where all
mmap's are scanned and dumped to the same file). This use case is not
much of an extension beyond those two concepts.

Right now, as a proof of concept, I am dumping events in the guest to a
file (perf-script) and in the host to a file, merging the two files
together and then time sorting. For example running, 'ls' in the guest
causes disk I/O which causes a VMEXIT, .... you can see this action end
to end.

>
>> And then for the cherry on top a design that works across
>> architectures (e.g., x86 now, but arm later).
>>
> MSR is x86 thing.

Sure the implementation of pv_perf_clock for x86 is an MSR read (open to
suggestions on other options). ARM would have some other means of a fast
access to host, no?

David

Subject: Re: RFC: paravirtualizing perf_clock

(2013/10/30 23:03), David Ahern wrote:
> On 10/29/13 11:59 PM, Masami Hiramatsu wrote:
>> (2013/10/29 11:58), David Ahern wrote:
>>> To back out a bit, my end goal is to be able to create and merge
>>> perf-events from any context on a KVM-based host -- guest userspace,
>>> guest kernel space, host userspace and host kernel space (userspace
>>> events with a perf-clock timestamp is another topic ;-)).
>>
>> That is almost same as what we(Yoshihiro and I) are trying on integrated
>> tracing, we are doing it on ftrace and trace-cmd (but perhaps, it eventually
>> works on perf-ftrace).
>
> I thought at this point (well, once perf-ftrace gets committed) that you
> can do everything with perf. What feature is missing in perf that you
> get with trace-cmd or using debugfs directly?

The perftools interface is the best for profiling a process or in a short period.
However, what we'd like to do is monitoring or tracing in background a long
period on the memory, while the system life cycle, as a flight recorder.
This kind of tracing interface is required for mission-critical system for
trouble shooting.

Also, on-the-fly configurability of ftrace such as snapshot, multi-buffer,
event-adding/removing are very useful, since in the flight-recorder
use-case, we can't stop tracing for even a moment.

Moreover, our guest/host integrated tracer can pass event buffers from
guest to host with very small overhead, because it uses ftrace ringbuffer
and virtio-serial with splice (so, zero page copying in the guest).
Note that we need low overhead tracing as small as possible because it
is running always in background.

That's why we're using ftrace for our purpose. But anyway, the time
synchronization is common issue. Let's share the solution :)


>>> And then for the cherry on top a design that works across architectures
>>> (e.g., x86 now, but arm later).
>>
>> I think your proposal is good for the default implementation, it doesn't
>> depends on the arch specific feature. However, since physical timer(clock)
>> interfaces and virtualization interfaces strongly depends on the arch,
>> I guess the optimized implementations will become different on each arch.
>> For example, maybe we can export tsc-offset to the guest to adjust clock
>> on x86, but not on ARM, or other devices. In that case, until implementing
>> optimized one, we can use paravirt perf_clock.
>
> So this MSR read takes about 1.6usecs (from 'perf stat kvm live') and
> that is total time between VMEXIT and VMENTRY. The time it takes to run
> perf_clock in the host should be a very small part of that 1.6 usec.

Yeah, a hypercall is always heavy operation. So that is not the best
solution, we need a optimized one for each arch.

> I'll take a look at the TSC path to see how it is optimized (suggestions
> appreciated).

At least on the machine which has stable tsc, we can relay on that.
We just need the tsc-offset to adjust it in the guest. Note that this
offset can change if the guest sleeps/resumes or does a live-migration.
Each time we need to refresh the tsc-offset.

> Another thought is to make the use of pv_perf_clock an option -- user
> can knowingly decide the additional latency/overhead is worth the feature.

Yeah. BTW, would you see the paravirt_sched_clock(pv_time_ops)?
It seems that such synchronized clock is there.

Thank you,

--
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

2013-10-31 16:46:02

by David Ahern

[permalink] [raw]
Subject: Re: RFC: paravirtualizing perf_clock

On 10/31/13, 2:09 AM, Masami Hiramatsu wrote:
> (2013/10/30 23:03), David Ahern wrote:
>> On 10/29/13 11:59 PM, Masami Hiramatsu wrote:
>>> (2013/10/29 11:58), David Ahern wrote:
>>>> To back out a bit, my end goal is to be able to create and merge
>>>> perf-events from any context on a KVM-based host -- guest userspace,
>>>> guest kernel space, host userspace and host kernel space (userspace
>>>> events with a perf-clock timestamp is another topic ;-)).
>>>
>>> That is almost same as what we(Yoshihiro and I) are trying on integrated
>>> tracing, we are doing it on ftrace and trace-cmd (but perhaps, it eventually
>>> works on perf-ftrace).
>>
>> I thought at this point (well, once perf-ftrace gets committed) that you
>> can do everything with perf. What feature is missing in perf that you
>> get with trace-cmd or using debugfs directly?
>
> The perftools interface is the best for profiling a process or in a short period.
> However, what we'd like to do is monitoring or tracing in background a long
> period on the memory, while the system life cycle, as a flight recorder.
> This kind of tracing interface is required for mission-critical system for
> trouble shooting.

right. I have a perf-based scheduling daemon that runs in a flight
recorder mode - retain the last N-seconds of scheduling data.
Challenging mostly to handle memory growth with task-based records
(MMAP, FORK, EXIT, COMM). Other events are handled fairly well.


> Also, on-the-fly configurability of ftrace such as snapshot, multi-buffer,
> event-adding/removing are very useful, since in the flight-recorder
> use-case, we can't stop tracing for even a moment.

interesting.

> Moreover, our guest/host integrated tracer can pass event buffers from
> guest to host with very small overhead, because it uses ftrace ringbuffer
> and virtio-serial with splice (so, zero page copying in the guest).
> Note that we need low overhead tracing as small as possible because it
> is running always in background.

Right. Been meaning to look at what you guys have done, just have not
had the time.

> That's why we're using ftrace for our purpose. But anyway, the time
> synchronization is common issue. Let's share the solution :)

Yes, that was one of the key takeaways from the Tracing Summit is the
need to have a common time-source - just extending it to VMs as well.

>>>> And then for the cherry on top a design that works across architectures
>>>> (e.g., x86 now, but arm later).
>>>
>>> I think your proposal is good for the default implementation, it doesn't
>>> depends on the arch specific feature. However, since physical timer(clock)
>>> interfaces and virtualization interfaces strongly depends on the arch,
>>> I guess the optimized implementations will become different on each arch.
>>> For example, maybe we can export tsc-offset to the guest to adjust clock
>>> on x86, but not on ARM, or other devices. In that case, until implementing
>>> optimized one, we can use paravirt perf_clock.
>>
>> So this MSR read takes about 1.6usecs (from 'perf stat kvm live') and
>> that is total time between VMEXIT and VMENTRY. The time it takes to run
>> perf_clock in the host should be a very small part of that 1.6 usec.
>
> Yeah, a hypercall is always heavy operation. So that is not the best
> solution, we need a optimized one for each arch.
>
>> I'll take a look at the TSC path to see how it is optimized (suggestions
>> appreciated).
>
> At least on the machine which has stable tsc, we can relay on that.
> We just need the tsc-offset to adjust it in the guest. Note that this
> offset can change if the guest sleeps/resumes or does a live-migration.
> Each time we need to refresh the tsc-offset.
>
>> Another thought is to make the use of pv_perf_clock an option -- user
>> can knowingly decide the additional latency/overhead is worth the feature.
>
> Yeah. BTW, would you see the paravirt_sched_clock(pv_time_ops)?
> It seems that such synchronized clock is there.

I have poked around with it a bit.

David