2020-11-30 13:39:32

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH 0/2] RFC: Precise TSC migration

Hi!



This is the first version of the work to make TSC migration more accurate,

as was defined by Paulo at:

https://www.spinics.net/lists/kvm/msg225525.html



I have a few thoughts about the kvm masterclock synchronization,

which relate to the Paulo's proposal that I implemented.



The idea of masterclock is that when the host TSC is synchronized

(or as kernel call it, stable), and the guest TSC is synchronized as well,

then we can base the kvmclock, on the same pair of

(host time in nsec, host tsc value), for all vCPUs.



This makes the random error in calculation of this value invariant

across vCPUS, and allows the guest to do kvmclock calculation in userspace

(vDSO) since kvmclock parameters are vCPU invariant.



To ensure that the guest tsc is synchronized we currently track host/guest tsc

writes, and enable the master clock only when roughly the same guest's TSC value

was written across all vCPUs.



Recently this was disabled by Paulo and I agree with this, because I think

that we indeed should only make the guest TSC synchronized by default

(including new hotplugged vCPUs) and not do any tsc synchronization beyond that.

(Trying to guess when the guest syncs the TSC can cause more harm that good).



Besides, Linux guests don't sync the TSC via IA32_TSC write,

but rather use IA32_TSC_ADJUST which currently doesn't participate

in the tsc sync heruistics.

And as far as I know, Linux guest is the primary (only?) user of the kvmclock.



I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE

in the documentation to state that it only guarantees invariance if the guest

doesn't mess with its own TSC.



Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE

in the guest kernel, when kvm is detected to avoid the guest even from trying

to sync TSC on newly hotplugged vCPUs.



(The guest doesn't end up touching TSC_ADJUST usually, but it still might

in some cases due to scheduling of guest vCPUs)



(X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,

and TSC clocksource watchdog, and the later we might want to keep).



For host TSC writes, just as Paulo proposed we can still do the tsc sync,

unless the new code that I implemented is in use.



Few more random notes:



I have a weird feeling about using 'nsec since 1 January 1970'.

Common sense is telling me that a 64 bit value can hold about 580 years,

but still I see that it is more common to use timespec which is a (sec,nsec) pair.



I feel that 'kvm_get_walltime' that I added is a bit of a hack.

Some refactoring might improve things here.



For example making kvm_get_walltime_and_clockread work in non tsc case as well

might make the code cleaner.



Patches to enable this feature in qemu are in process of being sent to

qemu-devel mailing list.



Best regards,

Maxim Levitsky



Maxim Levitsky (2):

KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE

KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS



Documentation/virt/kvm/api.rst | 56 +++++++++++++++++++++

arch/x86/include/uapi/asm/kvm.h | 1 +

arch/x86/kvm/x86.c | 88 +++++++++++++++++++++++++++++++--

include/uapi/linux/kvm.h | 14 ++++++

4 files changed, 154 insertions(+), 5 deletions(-)



--

2.26.2





2020-11-30 13:39:48

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH 2/2] KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS

This quirk reflects the fact that we currently treat MSR_IA32_TSC
and MSR_TSC_ADJUST access by the host (e.g qemu) in a way that is different
compared to an access from the guest.

For host's MSR_IA32_TSC read we currently always return L1 TSC value, and for
host's write we do the tsc synchronization.

For host's MSR_TSC_ADJUST write, we don't make the tsc 'jump' as we should
for this msr.

When the hypervisor uses the new TSC GET/SET state ioctls, all of this is no
longer needed, thus leave this enabled only with a quirk
which the hypervisor can disable.

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Maxim Levitsky <[email protected]>
---
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/x86.c | 19 ++++++++++++++-----
2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 8e76d3701db3f..2a60fc6674164 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -404,6 +404,7 @@ struct kvm_sync_regs {
#define KVM_X86_QUIRK_LAPIC_MMIO_HOLE (1 << 2)
#define KVM_X86_QUIRK_OUT_7E_INC_RIP (1 << 3)
#define KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT (1 << 4)
+#define KVM_X86_QUIRK_TSC_HOST_ACCESS (1 << 5)

#define KVM_STATE_NESTED_FORMAT_VMX 0
#define KVM_STATE_NESTED_FORMAT_SVM 1
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4f0ae9cb14b8a..46a2111d54840 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3091,7 +3091,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
break;
case MSR_IA32_TSC_ADJUST:
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
- if (!msr_info->host_initiated) {
+ if (!msr_info->host_initiated ||
+ !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS)) {
s64 adj = data - vcpu->arch.ia32_tsc_adjust_msr;
adjust_tsc_offset_guest(vcpu, adj);
}
@@ -3118,7 +3119,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vcpu->arch.msr_ia32_power_ctl = data;
break;
case MSR_IA32_TSC:
- if (msr_info->host_initiated) {
+ if (msr_info->host_initiated &&
+ kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS)) {
kvm_synchronize_tsc(vcpu, data);
} else {
u64 adj = kvm_compute_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
@@ -3409,17 +3411,24 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
msr_info->data = vcpu->arch.msr_ia32_power_ctl;
break;
case MSR_IA32_TSC: {
+ u64 tsc_offset;
+
/*
* Intel SDM states that MSR_IA32_TSC read adds the TSC offset
* even when not intercepted. AMD manual doesn't explicitly
* state this but appears to behave the same.
*
- * On userspace reads and writes, however, we unconditionally
+ * On userspace reads and writes, when KVM_X86_QUIRK_SPECIAL_TSC_READ
+ * is present, however, we unconditionally
* return L1's TSC value to ensure backwards-compatible
* behavior for migration.
*/
- u64 tsc_offset = msr_info->host_initiated ? vcpu->arch.l1_tsc_offset :
- vcpu->arch.tsc_offset;
+
+ if (msr_info->host_initiated &&
+ kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS))
+ tsc_offset = vcpu->arch.l1_tsc_offset;
+ else
+ tsc_offset = vcpu->arch.tsc_offset;

msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + tsc_offset;
break;
--
2.26.2

2020-11-30 13:41:40

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: Precise TSC migration (summary)

This is the summary of few things that I think are relevant.

Best regards,
Maxim Levitsky


Attachments:
tsc.md (4.54 kB)

2020-11-30 13:41:43

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH 1/2] KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE

These two new ioctls allow to more precisly capture and
restore guest's TSC state.

Both ioctls are meant to be used to accurately migrate guest TSC
even when there is a significant downtime during the migration.

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Maxim Levitsky <[email protected]>
---
Documentation/virt/kvm/api.rst | 56 +++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 69 ++++++++++++++++++++++++++++++++++
include/uapi/linux/kvm.h | 14 +++++++
3 files changed, 139 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 70254eaa5229f..2f04aa8ecf119 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4826,6 +4826,62 @@ If a vCPU is in running state while this ioctl is invoked, the vCPU may
experience inconsistent filtering behavior on MSR accesses.


+4.127 KVM_GET_TSC_STATE
+----------------------------
+
+:Capability: KVM_CAP_PRECISE_TSC
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_tsc_info
+:Returns: 0 on success, < 0 on error
+
+::
+
+ #define KVM_TSC_INFO_TSC_ADJUST_VALID 1
+ struct kvm_tsc_info {
+ __u32 flags;
+ __u64 nsec;
+ __u64 tsc;
+ __u64 tsc_adjust;
+ };
+
+flags values for ``struct kvm_tsc_info``:
+
+``KVM_TSC_INFO_TSC_ADJUST_VALID``
+
+ ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
+
+This ioctl allows user space to read guest's IA32_TSC, IA32_TSC_ADJUST,
+and the current value of host CLOCK_REALTIME clock in nanoseconds since unix
+epoch.
+
+
+4.128 KVM_SET_TSC_STATE
+----------------------------
+
+:Capability: KVM_CAP_PRECISE_TSC
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_tsc_info
+:Returns: 0 on success, < 0 on error
+
+::
+
+This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
+from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
+
+KVM will adjust the guest TSC value by the time that passed between
+CLOCK_REALTIME timestamp saved in the struct and current value of
+CLOCK_REALTIME, and set guest's TSC to the new value.
+
+TSC_ADJUST is restored as is if KVM_TSC_INFO_TSC_ADJUST_VALID is set.
+
+It is assumed that either both ioctls will be run on the same machine,
+or that source and destination machines have synchronized clocks.
+
+As a special case, it is allowed to leave the timestamp in the struct to zero,
+in which case it will be ignored and the TSC will be restored exactly.
+
5. The kvm_run structure
========================

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a3fdc16cfd6f3..4f0ae9cb14b8a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,

return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
}
+
+
+static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc)
+{
+ struct timespec64 ts;
+
+ if (kvm_get_walltime_and_clockread(&ts, host_tsc)) {
+ *walltime_ns = timespec64_to_ns(&ts);
+ return;
+ }
+
+ *host_tsc = rdtsc();
+ *walltime_ns = ktime_get_real_ns();
+}
+
#endif

/*
@@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_X86_USER_SPACE_MSR:
case KVM_CAP_X86_MSR_FILTER:
case KVM_CAP_ENFORCE_PV_FEATURE_CPUID:
+#ifdef CONFIG_X86_64
+ case KVM_CAP_PRECISE_TSC:
+#endif
r = 1;
break;
case KVM_CAP_SYNC_REGS:
@@ -4999,6 +5017,57 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
case KVM_GET_SUPPORTED_HV_CPUID:
r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
break;
+#ifdef CONFIG_X86_64
+ case KVM_GET_TSC_STATE: {
+ struct kvm_tsc_state __user *user_tsc_state = argp;
+ struct kvm_tsc_state tsc_state;
+ u64 host_tsc;
+
+ memset(&tsc_state, 0, sizeof(tsc_state));
+
+ kvm_get_walltime(&tsc_state.nsec, &host_tsc);
+ tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
+
+ if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
+ tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
+ tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
+ }
+
+ r = -EFAULT;
+ if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state)))
+ goto out;
+ r = 0;
+ break;
+ }
+ case KVM_SET_TSC_STATE: {
+ struct kvm_tsc_state __user *user_tsc_state = argp;
+ struct kvm_tsc_state tsc_state;
+
+ u64 host_tsc, wall_nsec;
+ s64 diff;
+ u64 new_guest_tsc, new_guest_tsc_offset;
+
+ r = -EFAULT;
+ if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
+ goto out;
+
+ kvm_get_walltime(&wall_nsec, &host_tsc);
+ diff = wall_nsec - tsc_state.nsec;
+
+ if (diff < 0 || tsc_state.nsec == 0)
+ diff = 0;
+
+ new_guest_tsc = tsc_state.tsc + nsec_to_cycles(vcpu, diff);
+ new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
+ kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
+
+ if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID)
+ if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
+ vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust;
+ r = 0;
+ break;
+ }
+#endif
default:
r = -EINVAL;
}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 886802b8ffba3..ee1bd5e7da964 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
#define KVM_CAP_SYS_HYPERV_CPUID 191
#define KVM_CAP_DIRTY_LOG_RING 192
+#define KVM_CAP_PRECISE_TSC 193

#ifdef KVM_CAP_IRQ_ROUTING

@@ -1169,6 +1170,15 @@ struct kvm_clock_data {
__u32 pad[9];
};

+
+#define KVM_TSC_STATE_TSC_ADJUST_VALID 1
+struct kvm_tsc_state {
+ __u32 flags;
+ __u64 nsec;
+ __u64 tsc;
+ __u64 tsc_adjust;
+};
+
/* For KVM_CAP_SW_TLB */

#define KVM_MMU_FSL_BOOKE_NOHV 0
@@ -1563,6 +1573,10 @@ struct kvm_pv_cmd {
/* Available with KVM_CAP_DIRTY_LOG_RING */
#define KVM_RESET_DIRTY_RINGS _IO(KVMIO, 0xc7)

+/* Available with KVM_CAP_PRECISE_TSC*/
+#define KVM_SET_TSC_STATE _IOW(KVMIO, 0xc8, struct kvm_tsc_state)
+#define KVM_GET_TSC_STATE _IOR(KVMIO, 0xc9, struct kvm_tsc_state)
+
/* Secure Encrypted Virtualization command */
enum sev_cmd_id {
/* Guest initialization commands */
--
2.26.2

2020-11-30 13:58:32

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS

On 30/11/20 14:35, Maxim Levitsky wrote:
> This quirk reflects the fact that we currently treat MSR_IA32_TSC
> and MSR_TSC_ADJUST access by the host (e.g qemu) in a way that is different
> compared to an access from the guest.
>
> For host's MSR_IA32_TSC read we currently always return L1 TSC value, and for
> host's write we do the tsc synchronization.
>
> For host's MSR_TSC_ADJUST write, we don't make the tsc 'jump' as we should
> for this msr.
>
> When the hypervisor uses the new TSC GET/SET state ioctls, all of this is no
> longer needed, thus leave this enabled only with a quirk
> which the hypervisor can disable.
>
> Suggested-by: Paolo Bonzini <[email protected]>
> Signed-off-by: Maxim Levitsky <[email protected]>

This needs to be covered by a variant of the existing selftests testcase
(running the same guest code, but different host code of course).

Paolo

> ---
> arch/x86/include/uapi/asm/kvm.h | 1 +
> arch/x86/kvm/x86.c | 19 ++++++++++++++-----
> 2 files changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 8e76d3701db3f..2a60fc6674164 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -404,6 +404,7 @@ struct kvm_sync_regs {
> #define KVM_X86_QUIRK_LAPIC_MMIO_HOLE (1 << 2)
> #define KVM_X86_QUIRK_OUT_7E_INC_RIP (1 << 3)
> #define KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT (1 << 4)
> +#define KVM_X86_QUIRK_TSC_HOST_ACCESS (1 << 5)
>
> #define KVM_STATE_NESTED_FORMAT_VMX 0
> #define KVM_STATE_NESTED_FORMAT_SVM 1
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4f0ae9cb14b8a..46a2111d54840 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3091,7 +3091,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> break;
> case MSR_IA32_TSC_ADJUST:
> if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
> - if (!msr_info->host_initiated) {
> + if (!msr_info->host_initiated ||
> + !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS)) {
> s64 adj = data - vcpu->arch.ia32_tsc_adjust_msr;
> adjust_tsc_offset_guest(vcpu, adj);
> }
> @@ -3118,7 +3119,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> vcpu->arch.msr_ia32_power_ctl = data;
> break;
> case MSR_IA32_TSC:
> - if (msr_info->host_initiated) {
> + if (msr_info->host_initiated &&
> + kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS)) {
> kvm_synchronize_tsc(vcpu, data);
> } else {
> u64 adj = kvm_compute_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
> @@ -3409,17 +3411,24 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> msr_info->data = vcpu->arch.msr_ia32_power_ctl;
> break;
> case MSR_IA32_TSC: {
> + u64 tsc_offset;
> +
> /*
> * Intel SDM states that MSR_IA32_TSC read adds the TSC offset
> * even when not intercepted. AMD manual doesn't explicitly
> * state this but appears to behave the same.
> *
> - * On userspace reads and writes, however, we unconditionally
> + * On userspace reads and writes, when KVM_X86_QUIRK_SPECIAL_TSC_READ
> + * is present, however, we unconditionally
> * return L1's TSC value to ensure backwards-compatible
> * behavior for migration.
> */
> - u64 tsc_offset = msr_info->host_initiated ? vcpu->arch.l1_tsc_offset :
> - vcpu->arch.tsc_offset;
> +
> + if (msr_info->host_initiated &&
> + kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS))
> + tsc_offset = vcpu->arch.l1_tsc_offset;
> + else
> + tsc_offset = vcpu->arch.tsc_offset;
>
> msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + tsc_offset;
> break;
>

2020-11-30 14:16:14

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS

On Mon, 2020-11-30 at 14:54 +0100, Paolo Bonzini wrote:
> On 30/11/20 14:35, Maxim Levitsky wrote:
> > This quirk reflects the fact that we currently treat MSR_IA32_TSC
> > and MSR_TSC_ADJUST access by the host (e.g qemu) in a way that is different
> > compared to an access from the guest.
> >
> > For host's MSR_IA32_TSC read we currently always return L1 TSC value, and for
> > host's write we do the tsc synchronization.
> >
> > For host's MSR_TSC_ADJUST write, we don't make the tsc 'jump' as we should
> > for this msr.
> >
> > When the hypervisor uses the new TSC GET/SET state ioctls, all of this is no
> > longer needed, thus leave this enabled only with a quirk
> > which the hypervisor can disable.
> >
> > Suggested-by: Paolo Bonzini <[email protected]>
> > Signed-off-by: Maxim Levitsky <[email protected]>
>
> This needs to be covered by a variant of the existing selftests testcase
> (running the same guest code, but different host code of course).
Do you think that the test should go to the kernel's kvm unit tests,
or to kvm-unit-tests project?

Best regards,
Maxim Levitsky

>
> Paolo
>
> > ---
> > arch/x86/include/uapi/asm/kvm.h | 1 +
> > arch/x86/kvm/x86.c | 19 ++++++++++++++-----
> > 2 files changed, 15 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> > index 8e76d3701db3f..2a60fc6674164 100644
> > --- a/arch/x86/include/uapi/asm/kvm.h
> > +++ b/arch/x86/include/uapi/asm/kvm.h
> > @@ -404,6 +404,7 @@ struct kvm_sync_regs {
> > #define KVM_X86_QUIRK_LAPIC_MMIO_HOLE (1 << 2)
> > #define KVM_X86_QUIRK_OUT_7E_INC_RIP (1 << 3)
> > #define KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT (1 << 4)
> > +#define KVM_X86_QUIRK_TSC_HOST_ACCESS (1 << 5)
> >
> > #define KVM_STATE_NESTED_FORMAT_VMX 0
> > #define KVM_STATE_NESTED_FORMAT_SVM 1
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 4f0ae9cb14b8a..46a2111d54840 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -3091,7 +3091,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> > break;
> > case MSR_IA32_TSC_ADJUST:
> > if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
> > - if (!msr_info->host_initiated) {
> > + if (!msr_info->host_initiated ||
> > + !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS)) {
> > s64 adj = data - vcpu->arch.ia32_tsc_adjust_msr;
> > adjust_tsc_offset_guest(vcpu, adj);
> > }
> > @@ -3118,7 +3119,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> > vcpu->arch.msr_ia32_power_ctl = data;
> > break;
> > case MSR_IA32_TSC:
> > - if (msr_info->host_initiated) {
> > + if (msr_info->host_initiated &&
> > + kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS)) {
> > kvm_synchronize_tsc(vcpu, data);
> > } else {
> > u64 adj = kvm_compute_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
> > @@ -3409,17 +3411,24 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> > msr_info->data = vcpu->arch.msr_ia32_power_ctl;
> > break;
> > case MSR_IA32_TSC: {
> > + u64 tsc_offset;
> > +
> > /*
> > * Intel SDM states that MSR_IA32_TSC read adds the TSC offset
> > * even when not intercepted. AMD manual doesn't explicitly
> > * state this but appears to behave the same.
> > *
> > - * On userspace reads and writes, however, we unconditionally
> > + * On userspace reads and writes, when KVM_X86_QUIRK_SPECIAL_TSC_READ
> > + * is present, however, we unconditionally
> > * return L1's TSC value to ensure backwards-compatible
> > * behavior for migration.
> > */
> > - u64 tsc_offset = msr_info->host_initiated ? vcpu->arch.l1_tsc_offset :
> > - vcpu->arch.tsc_offset;
> > +
> > + if (msr_info->host_initiated &&
> > + kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS))
> > + tsc_offset = vcpu->arch.l1_tsc_offset;
> > + else
> > + tsc_offset = vcpu->arch.tsc_offset;
> >
> > msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + tsc_offset;
> > break;
> >


2020-11-30 14:20:50

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS

On 30/11/20 15:11, Maxim Levitsky wrote:
> On Mon, 2020-11-30 at 14:54 +0100, Paolo Bonzini wrote:
>> On 30/11/20 14:35, Maxim Levitsky wrote:
>>> This quirk reflects the fact that we currently treat MSR_IA32_TSC
>>> and MSR_TSC_ADJUST access by the host (e.g qemu) in a way that is different
>>> compared to an access from the guest.
>>>
>>> For host's MSR_IA32_TSC read we currently always return L1 TSC value, and for
>>> host's write we do the tsc synchronization.
>>>
>>> For host's MSR_TSC_ADJUST write, we don't make the tsc 'jump' as we should
>>> for this msr.
>>>
>>> When the hypervisor uses the new TSC GET/SET state ioctls, all of this is no
>>> longer needed, thus leave this enabled only with a quirk
>>> which the hypervisor can disable.
>>>
>>> Suggested-by: Paolo Bonzini <[email protected]>
>>> Signed-off-by: Maxim Levitsky <[email protected]>
>>
>> This needs to be covered by a variant of the existing selftests testcase
>> (running the same guest code, but different host code of course).
> Do you think that the test should go to the kernel's kvm unit tests,
> or to kvm-unit-tests project?

The latter already has x86_64/tsc_msrs_test.c (which I created in
preparation for this exact change :)).

Paolo

2020-11-30 14:37:35

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE

On 30/11/20 14:35, Maxim Levitsky wrote:
> + if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
> + tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
> + tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
> + }

This is mostly useful for userspace that doesn't disable the quirk, right?

> + kvm_get_walltime(&wall_nsec, &host_tsc);
> + diff = wall_nsec - tsc_state.nsec;
> +
> + if (diff < 0 || tsc_state.nsec == 0)
> + diff = 0;
> +

diff < 0 should be okay. Also why the nsec==0 special case? What about
using a flag instead?

Paolo

2020-11-30 15:37:37

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS

On Mon, 2020-11-30 at 15:15 +0100, Paolo Bonzini wrote:
> On 30/11/20 15:11, Maxim Levitsky wrote:
> > On Mon, 2020-11-30 at 14:54 +0100, Paolo Bonzini wrote:
> > > On 30/11/20 14:35, Maxim Levitsky wrote:
> > > > This quirk reflects the fact that we currently treat MSR_IA32_TSC
> > > > and MSR_TSC_ADJUST access by the host (e.g qemu) in a way that is different
> > > > compared to an access from the guest.
> > > >
> > > > For host's MSR_IA32_TSC read we currently always return L1 TSC value, and for
> > > > host's write we do the tsc synchronization.
> > > >
> > > > For host's MSR_TSC_ADJUST write, we don't make the tsc 'jump' as we should
> > > > for this msr.
> > > >
> > > > When the hypervisor uses the new TSC GET/SET state ioctls, all of this is no
> > > > longer needed, thus leave this enabled only with a quirk
> > > > which the hypervisor can disable.
> > > >
> > > > Suggested-by: Paolo Bonzini <[email protected]>
> > > > Signed-off-by: Maxim Levitsky <[email protected]>
> > >
> > > This needs to be covered by a variant of the existing selftests testcase
> > > (running the same guest code, but different host code of course).
> > Do you think that the test should go to the kernel's kvm unit tests,
> > or to kvm-unit-tests project?
>
> The latter already has x86_64/tsc_msrs_test.c (which I created in
> preparation for this exact change :)).

I'll prepare a test then for it!

Best regards,
Maxim Levitsky
>
> Paolo
>


2020-11-30 16:02:49

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE

On Mon, 2020-11-30 at 15:33 +0100, Paolo Bonzini wrote:
> On 30/11/20 14:35, Maxim Levitsky wrote:
> > + if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
> > + tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
> > + tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
> > + }
>
> This is mostly useful for userspace that doesn't disable the quirk, right?

Isn't this the opposite? If I understand the original proposal correctly,
the reason that we include the TSC_ADJUST in the new ioctl, is that
we would like to disable the special kvm behavior (that is disable the quirk),
which would mean that tsc will jump on regular host initiated TSC_ADJUST write.

To avoid this, userspace would set TSC_ADJUST through this new interface.

Note that I haven't yet disabled the quirk in the patches I posted to the qemu,
because we need some infrastructure to manage which quirks we want to disable
in qemu
(That is, KVM_ENABLE_CAP is as I understand write only, so I can't just disable
KVM_X86_QUIRK_TSC_HOST_ACCESS, in the code that enables x-precise-tsc in qemu).

>
> > + kvm_get_walltime(&wall_nsec, &host_tsc);
> > + diff = wall_nsec - tsc_state.nsec;
> > +
> > + if (diff < 0 || tsc_state.nsec == 0)
> > + diff = 0;
> > +
>
> diff < 0 should be okay. Also why the nsec==0 special case? What about
> using a flag instead?

In theory diff < 0 should indeed be okay (though this would mean that target,
has unsynchronized clock or time travel happened).

However for example nsec_to_cycles takes unsigned number, and then
pvclock_scale_delta also takes unsigned number, and so on,
so I was thinking why bother with this case.

There is still (mostly?) theoretical issue, if on some vcpus 'diff' is positive
and on some is negative
(this can happen if the migration was really fast, and target has the clock
A. that is only slightly ahead of the source).
Do you think that this is an issue? If so I can make the code work with
signed numbers.

About nsec == 0, this is to allow to use this API for VM initialization.
(That is to call KVM_SET_TSC_PRECISE prior to doing KVM_GET_TSC_PRECISE)

This simplifies qemu code, and I don't think
that this makes the API much worse.

Best regards,
Maxim Levitsky

>
> Paolo
>


2020-11-30 16:57:32

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: Precise TSC migration

On Mon, Nov 30, 2020 at 5:36 AM Maxim Levitsky <[email protected]> wrote:
>
> Hi!
>
> This is the first version of the work to make TSC migration more accurate,
> as was defined by Paulo at:
> https://www.spinics.net/lists/kvm/msg225525.html
>
> I have a few thoughts about the kvm masterclock synchronization,
> which relate to the Paulo's proposal that I implemented.
>
> The idea of masterclock is that when the host TSC is synchronized
> (or as kernel call it, stable), and the guest TSC is synchronized as well,
> then we can base the kvmclock, on the same pair of
> (host time in nsec, host tsc value), for all vCPUs.
>
> This makes the random error in calculation of this value invariant
> across vCPUS, and allows the guest to do kvmclock calculation in userspace
> (vDSO) since kvmclock parameters are vCPU invariant.
>
> To ensure that the guest tsc is synchronized we currently track host/guest tsc
> writes, and enable the master clock only when roughly the same guest's TSC value
> was written across all vCPUs.
>
> Recently this was disabled by Paulo and I agree with this, because I think
> that we indeed should only make the guest TSC synchronized by default
> (including new hotplugged vCPUs) and not do any tsc synchronization beyond that.
> (Trying to guess when the guest syncs the TSC can cause more harm that good).
>
> Besides, Linux guests don't sync the TSC via IA32_TSC write,
> but rather use IA32_TSC_ADJUST which currently doesn't participate
> in the tsc sync heruistics.
> And as far as I know, Linux guest is the primary (only?) user of the kvmclock.
>
> I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE
> in the documentation to state that it only guarantees invariance if the guest
> doesn't mess with its own TSC.
>
> Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
> in the guest kernel, when kvm is detected to avoid the guest even from trying
> to sync TSC on newly hotplugged vCPUs.
>
> (The guest doesn't end up touching TSC_ADJUST usually, but it still might
> in some cases due to scheduling of guest vCPUs)
>
> (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
> and TSC clocksource watchdog, and the later we might want to keep).

If you're going to change the guest behavior to be more trusting of
the host, I think
the host should probably signal this to the guest using a new bit.

2020-11-30 17:05:00

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE

On 30/11/20 16:58, Maxim Levitsky wrote:
>> This is mostly useful for userspace that doesn't disable the quirk, right?
> Isn't this the opposite? If I understand the original proposal correctly,
> the reason that we include the TSC_ADJUST in the new ioctl, is that
> we would like to disable the special kvm behavior (that is disable the quirk),
> which would mean that tsc will jump on regular host initiated TSC_ADJUST write.
>
> To avoid this, userspace would set TSC_ADJUST through this new interface.

Yeah, that makes sense. It removes the need to think "I have to set TSC
adjust before TSC".

> Do you think that this is an issue? If so I can make the code work with
> signed numbers.

Not sure if it's an issue, but I prefer to make the API "less
surprising" for userspace. Who knows how it will be used.

> About nsec == 0, this is to allow to use this API for VM initialization.
> (That is to call KVM_SET_TSC_PRECISE prior to doing KVM_GET_TSC_PRECISE)

I prefer using flags for that purpose.

Paolo

2020-11-30 17:07:02

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: Precise TSC migration

On 30/11/20 17:54, Andy Lutomirski wrote:
>> I*do think* however that we should redefine KVM_CLOCK_TSC_STABLE
>> in the documentation to state that it only guarantees invariance if the guest
>> doesn't mess with its own TSC.
>>
>> Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
>> in the guest kernel, when kvm is detected to avoid the guest even from trying
>> to sync TSC on newly hotplugged vCPUs.
>>
>> (The guest doesn't end up touching TSC_ADJUST usually, but it still might
>> in some cases due to scheduling of guest vCPUs)
>>
>> (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
>> and TSC clocksource watchdog, and the later we might want to keep).
> If you're going to change the guest behavior to be more trusting of
> the host, I think
> the host should probably signal this to the guest using a new bit.
>

Yes, a new CPUID bit takes longer to propagate to existing setups, but
it is more future proof.

Paolo

2020-11-30 19:20:37

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: Precise TSC migration

Hi Maxim,

On Mon, Nov 30, 2020 at 03:35:57PM +0200, Maxim Levitsky wrote:
> Hi!
>
> This is the first version of the work to make TSC migration more accurate,
> as was defined by Paulo at:
> https://www.spinics.net/lists/kvm/msg225525.html

Description from Oliver's patch:

"To date, VMMs have typically restored the guest's TSCs by value using
the KVM_SET_MSRS ioctl for each vCPU. However, restoring the TSCs by
value introduces some challenges with synchronization as the TSCs
continue to tick throughout the restoration process. As such, KVM has
some heuristics around TSC writes to infer whether or not the guest or
host is attempting to synchronize the TSCs."

Not really. The synchronization logic tries to sync TSCs during
BIOS boot (and CPU hotplug), because the TSC values are loaded
sequentially, say:

CPU realtime TSC val
vcpu0 0 usec 0
vcpu1 100 usec 0
vcpu2 200 usec 0
...

And we'd like to see all vcpus to read the same value at all times.

Other than that, comment makes sense. The problem with live migration
is as follows:

We'd like the TSC value to be written, ideally, just before the first
VM-entry a vCPU (because at the moment the TSC_OFFSET has been written,
the vcpus tsc is ticking, which will cause a visible forward jump
in vcpus tsc time).

Before the first VM-entry is the farthest point in time before guest
entry that one could do that.

The window (or forward jump) between KVM_SET_TSC and VM-entry was about
100ms last time i checked (which results in a 100ms time jump forward),
See QEMU's 6053a86fe7bd3d5b07b49dae6c05f2cd0d44e687.

Have we measured any improvement with this patchset?

Then Paolo mentions (with >), i am replying as usual.

> Ok, after looking more at the code with Maxim I can confidently say that
> it's a total mess. And a lot of the synchronization code is dead
> because 1) as far as we could see no guest synchronizes the TSC using
> MSR_IA32_TSC;

Well, recent BIOS'es take care of synchronizing the TSC. So when Linux
boots, it does not have to synchronize TSC in software.

However, upon migration (and initialization), the KVM_SET_TSC's do
not happen at exactly the same time (the MSRs for each vCPU are loaded
in sequence). The synchronization code in kvm_set_tsc() is for those cases.

> and 2) writing to MSR_IA32_TSC_ADJUST does not trigger the
> synchronization code in kvm_write_tsc.

Not familiar how guests are using MSR_IA32_TSC_ADJUST (or Linux)...
Lets see:


/*
* Freshly booted CPUs call into this:
*/
void check_tsc_sync_target(void)
{
struct tsc_adjust *cur = this_cpu_ptr(&tsc_adjust);
unsigned int cpu = smp_processor_id();
cycles_t cur_max_warp, gbl_max_warp;
int cpus = 2;

/* Also aborts if there is no TSC. */
if (unsynchronized_tsc())
return;

/*
* Store, verify and sanitize the TSC adjust register. If
* successful skip the test.
*
* The test is also skipped when the TSC is marked reliable. This
* is true for SoCs which have no fallback clocksource. On these
* SoCs the TSC is frequency synchronized, but still the TSC ADJUST
* register might have been wreckaged by the BIOS..
*/
if (tsc_store_and_check_tsc_adjust(false) || tsc_clocksource_reliable) {
atomic_inc(&skip_test);
return;
}

retry:

I'd force that synchronization path to be taken as a test-case.


> I have a few thoughts about the kvm masterclock synchronization,
> which relate to the Paulo's proposal that I implemented.
>
> The idea of masterclock is that when the host TSC is synchronized
> (or as kernel call it, stable), and the guest TSC is synchronized as well,
> then we can base the kvmclock, on the same pair of
> (host time in nsec, host tsc value), for all vCPUs.

We _have_ to base. See the comment which starts with

"Assuming a stable TSC across physical CPUS, and a stable TSC"

at x86.c.

>
> This makes the random error in calculation of this value invariant
> across vCPUS, and allows the guest to do kvmclock calculation in userspace
> (vDSO) since kvmclock parameters are vCPU invariant.

Actually, without synchronized host TSCs (and the masterclock scheme,
with a single base read from a vCPU), kvmclock in kernel is buggy as
well:

u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
{
unsigned version;
u64 ret;
u64 last;
u8 flags;

do {
version = pvclock_read_begin(src);
ret = __pvclock_read_cycles(src, rdtsc_ordered());
flags = src->flags;
} while (pvclock_read_retry(src, version));

if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
src->flags &= ~PVCLOCK_GUEST_STOPPED;
pvclock_touch_watchdogs();
}

if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
(flags & PVCLOCK_TSC_STABLE_BIT))
return ret;

The code that follows this (including cmpxchg) is a workaround for that
bug.

Workaround would require each vCPU to write to a "global clock", on
every clock read.

> To ensure that the guest tsc is synchronized we currently track host/guest tsc
> writes, and enable the master clock only when roughly the same guest's TSC value
> was written across all vCPUs.

Yes, because then you can do:

vcpu0 vcpu1

A = read TSC
... elapsed time ...

B = read TSC

delta = B - A

> Recently this was disabled by Paulo

What was disabled exactly?

> and I agree with this, because I think
> that we indeed should only make the guest TSC synchronized by default
> (including new hotplugged vCPUs) and not do any tsc synchronization beyond that.
> (Trying to guess when the guest syncs the TSC can cause more harm that good).
>
> Besides, Linux guests don't sync the TSC via IA32_TSC write,
> but rather use IA32_TSC_ADJUST which currently doesn't participate
> in the tsc sync heruistics.

Linux should not try to sync the TSC with IA32_TSC_ADJUST. It expects
the BIOS to boot with synced TSCs.

So i wonder what is making it attempt TSC sync in the first place?

(one might also want to have Linux's synchronization via IA32_TSC_ADJUST
working, but it should not need to happen in the first place, as long as
QEMU and KVM are behaving properly).

> And as far as I know, Linux guest is the primary (only?) user of the kvmclock.

Only AFAIK.

> I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE
> in the documentation to state that it only guarantees invariance if the guest
> doesn't mess with its own TSC.
>
> Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
> in the guest kernel, when kvm is detected to avoid the guest even from trying
> to sync TSC on newly hotplugged vCPUs.

See 7539b174aef405d9d57db48c58390ba360c91312.

Was hoping to make that (-cpu xxx,+invtsc) the default in QEMU once invariant TSC code
becomes stable. Should be tested enough by now?

> (The guest doesn't end up touching TSC_ADJUST usually, but it still might
> in some cases due to scheduling of guest vCPUs)
>
> (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
> and TSC clocksource watchdog, and the later we might want to keep).

The latter we want to keep.

> For host TSC writes, just as Paulo proposed we can still do the tsc sync,
> unless the new code that I implemented is in use.

So Paolo's proposal is to

"- for live migration, userspace is expected to use the new
KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
(nanosecond, TSC, TSC_ADJUST) tuple."

Makes sense, so that no time between KVM_SET_TSC and
MSR_WRITE(TSC_ADJUST) elapses, which would cause the TSC to go out
of what is desired by the user.

Since you are proposing this new ioctl, perhaps its useful to also
reduce the 100ms jump?

"- for live migration, userspace is expected to use the new
KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
(nanosecond, TSC, TSC_ADJUST) tuple. This value will be written
to the guest before the first VM-entry"

Sounds like a good idea (to integrate the values in a tuple).

> Few more random notes:
>
> I have a weird feeling about using 'nsec since 1 January 1970'.
> Common sense is telling me that a 64 bit value can hold about 580 years,
> but still I see that it is more common to use timespec which is a (sec,nsec) pair.

struct timespec {
time_t tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};

> I feel that 'kvm_get_walltime' that I added is a bit of a hack.
> Some refactoring might improve things here.

Haven't read the patchset yet...

> For example making kvm_get_walltime_and_clockread work in non tsc case as well
> might make the code cleaner.
>
> Patches to enable this feature in qemu are in process of being sent to
> qemu-devel mailing list.
>
> Best regards,
> Maxim Levitsky
>
> Maxim Levitsky (2):
> KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE
> KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS
>
> Documentation/virt/kvm/api.rst | 56 +++++++++++++++++++++
> arch/x86/include/uapi/asm/kvm.h | 1 +
> arch/x86/kvm/x86.c | 88 +++++++++++++++++++++++++++++++--
> include/uapi/linux/kvm.h | 14 ++++++
> 4 files changed, 154 insertions(+), 5 deletions(-)
>
> --
> 2.26.2
>

2020-12-01 12:35:22

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: Precise TSC migration

On Mon, 2020-11-30 at 16:16 -0300, Marcelo Tosatti wrote:
> Hi Maxim,
>
> On Mon, Nov 30, 2020 at 03:35:57PM +0200, Maxim Levitsky wrote:
> > Hi!
> >
> > This is the first version of the work to make TSC migration more accurate,
> > as was defined by Paulo at:
> > https://www.spinics.net/lists/kvm/msg225525.html
>
> Description from Oliver's patch:
>
> "To date, VMMs have typically restored the guest's TSCs by value using
> the KVM_SET_MSRS ioctl for each vCPU. However, restoring the TSCs by
> value introduces some challenges with synchronization as the TSCs
> continue to tick throughout the restoration process. As such, KVM has
> some heuristics around TSC writes to infer whether or not the guest or
> host is attempting to synchronize the TSCs."
>
> Not really. The synchronization logic tries to sync TSCs during
> BIOS boot (and CPU hotplug), because the TSC values are loaded
> sequentially, say:
>
> CPU realtime TSC val
> vcpu0 0 usec 0
> vcpu1 100 usec 0
> vcpu2 200 usec 0
> ...
>
> And we'd like to see all vcpus to read the same value at all times.
>
> Other than that, comment makes sense. The problem with live migration
> is as follows:
>
> We'd like the TSC value to be written, ideally, just before the first
> VM-entry a vCPU (because at the moment the TSC_OFFSET has been written,
> the vcpus tsc is ticking, which will cause a visible forward jump
> in vcpus tsc time).
>
> Before the first VM-entry is the farthest point in time before guest
> entry that one could do that.
>
> The window (or forward jump) between KVM_SET_TSC and VM-entry was about
> 100ms last time i checked (which results in a 100ms time jump forward),
> See QEMU's 6053a86fe7bd3d5b07b49dae6c05f2cd0d44e687.
>
> Have we measured any improvement with this patchset?

Its not about this window.
It is about time that passes between the point that we read the
TSC on source system (and we do it in qemu each time the VM is paused)
and the moment that we set the same TSC value on the target.
That time is unbounded.

Also this patchset should decrease TSC skew that happens
between restoring it on multiple vCPUs as well, since
KVM_SET_TSC_STATE doesn't have to happen at the same time,
as it accounts for time passed on each vCPU.


Speaking of kvmclock, somewhat offtopic since this is a different issue,
I found out that qemu reads the kvmclock value on each pause,
and then 'restores' on unpause, using
KVM_SET_CLOCK (this modifies the global kvmclock offset)

This means (and I tested it) that if guest uses kvmclock
for time reference, it will not account for time passed in
the paused state.

>
> Then Paolo mentions (with >), i am replying as usual.
>
> > Ok, after looking more at the code with Maxim I can confidently say that
> > it's a total mess. And a lot of the synchronization code is dead
> > because 1) as far as we could see no guest synchronizes the TSC using
> > MSR_IA32_TSC;
>
> Well, recent BIOS'es take care of synchronizing the TSC. So when Linux
> boots, it does not have to synchronize TSC in software.

Do you have an example of such BIOS? I tested OVMF which I compiled
from git master a few weeks ago, and I also tested this with seabios
from qemu repo, and I have never seen writes to either TSC, or TSC_ADJUST
from BIOS.

Or do you refer to the native BIOS on the host doing TSC synchronization?

>
> However, upon migration (and initialization), the KVM_SET_TSC's do
> not happen at exactly the same time (the MSRs for each vCPU are loaded
> in sequence). The synchronization code in kvm_set_tsc() is for those cases.

I agree with that, and this is one of the issues that KVM_SET_TSC_STATE
is going to fix, since it accounts for it.


>
> > and 2) writing to MSR_IA32_TSC_ADJUST does not trigger the
> > synchronization code in kvm_write_tsc.
>
> Not familiar how guests are using MSR_IA32_TSC_ADJUST (or Linux)...
> Lets see:
>
>
> /*
> * Freshly booted CPUs call into this:
> */
> void check_tsc_sync_target(void)
> {
> struct tsc_adjust *cur = this_cpu_ptr(&tsc_adjust);
> unsigned int cpu = smp_processor_id();
> cycles_t cur_max_warp, gbl_max_warp;
> int cpus = 2;
>
> /* Also aborts if there is no TSC. */
> if (unsynchronized_tsc())
> return;
>
> /*
> * Store, verify and sanitize the TSC adjust register. If
> * successful skip the test.
> *
> * The test is also skipped when the TSC is marked reliable. This
> * is true for SoCs which have no fallback clocksource. On these
> * SoCs the TSC is frequency synchronized, but still the TSC ADJUST
> * register might have been wreckaged by the BIOS..
> */
> if (tsc_store_and_check_tsc_adjust(false) || tsc_clocksource_reliable) {
> atomic_inc(&skip_test);
> return;
> }
>
> retry:
>
> I'd force that synchronization path to be taken as a test-case.

Or even better as I suggested, we might tell the guest kernel
to avoid this synchronization path when KVM is detected
(regardless of invtsc flag)

>
>
> > I have a few thoughts about the kvm masterclock synchronization,
> > which relate to the Paulo's proposal that I implemented.
> >
> > The idea of masterclock is that when the host TSC is synchronized
> > (or as kernel call it, stable), and the guest TSC is synchronized as well,
> > then we can base the kvmclock, on the same pair of
> > (host time in nsec, host tsc value), for all vCPUs.
>
> We _have_ to base. See the comment which starts with
>
> "Assuming a stable TSC across physical CPUS, and a stable TSC"
>
> at x86.c.
>
> > This makes the random error in calculation of this value invariant
> > across vCPUS, and allows the guest to do kvmclock calculation in userspace
> > (vDSO) since kvmclock parameters are vCPU invariant.
>
> Actually, without synchronized host TSCs (and the masterclock scheme,
> with a single base read from a vCPU), kvmclock in kernel is buggy as
> well:
>
> u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> {
> unsigned version;
> u64 ret;
> u64 last;
> u8 flags;
>
> do {
> version = pvclock_read_begin(src);
> ret = __pvclock_read_cycles(src, rdtsc_ordered());
> flags = src->flags;
> } while (pvclock_read_retry(src, version));
>
> if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
> src->flags &= ~PVCLOCK_GUEST_STOPPED;
> pvclock_touch_watchdogs();
> }
>
> if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
> (flags & PVCLOCK_TSC_STABLE_BIT))
> return ret;
>
> The code that follows this (including cmpxchg) is a workaround for that
> bug.

I understand that. I am not arguing that we shoudn't use the masterclock!
I am just saying the facts about the condition when it works.

>
> Workaround would require each vCPU to write to a "global clock", on
> every clock read.
>
> > To ensure that the guest tsc is synchronized we currently track host/guest tsc
> > writes, and enable the master clock only when roughly the same guest's TSC value
> > was written across all vCPUs.
>
> Yes, because then you can do:
>
> vcpu0 vcpu1
>
> A = read TSC
> ... elapsed time ...
>
> B = read TSC
>
> delta = B - A
>
> > Recently this was disabled by Paulo
>
> What was disabled exactly?

The running of tsc synchronization code when the _guest_ writes the TSC.

Which changes two things:
1. If the guest de-synchronizes its TSC, we won't disable master clock.
2. If the guest writes similar TSC values on each vCPU we won't detect
this as synchronization attempt, replace this with exactly the same
value and finally re-enable the master clock.

I argue that this change is OK, because Linux guests don't write to TSC at all,
the virtual BIOSes seems not to write there either, and the only case in which
the Linux guest tries to change its TSC is on CPU hotplug as you mention and
it uses TSC_ADJUST, that currently doesn't trigger TSC synchronization code in
KVM anyway, so it is broken already.

However I also argue that we should mention this in documentation just in case,
and we might also want (also just in case) to make Linux guests avoid even trying to
touch TSC_ADJUST register when running under KVM.

To rehash my own words, the KVM_CLOCK_TSC_STABLE should be defined as:
'kvmclock is vCPU invariant, as long as the guest doesn't mess with its TSC'.

Having said all that, now that I know tsc sync code, and the
reasons why it is there, I wouldn't be arguing about putting it back either.

>
> > and I agree with this, because I think
> > that we indeed should only make the guest TSC synchronized by default
> > (including new hotplugged vCPUs) and not do any tsc synchronization beyond that.
> > (Trying to guess when the guest syncs the TSC can cause more harm that good).
> >
> > Besides, Linux guests don't sync the TSC via IA32_TSC write,
> > but rather use IA32_TSC_ADJUST which currently doesn't participate
> > in the tsc sync heruistics.
>
> Linux should not try to sync the TSC with IA32_TSC_ADJUST. It expects
> the BIOS to boot with synced TSCs.
>
> So i wonder what is making it attempt TSC sync in the first place?

CPU hotplug. And the guest doesn't really write to TSC_ADJUST
since it's measurement code doesn't detect any tsc warps.

I was just thinking that in theory since, this is a VM, and it can be
interrupted at any point, the measurement code should sometimes fall,
and cause trouble.
I didn't do much homework on this so I might be overreacting.

As far as I see X86_FEATURE_TSC_RELIABLE was done mostly to support
running under Hyper-V and VMWARE, and these should be prone to similar
issues, supporting my theory.

>
> (one might also want to have Linux's synchronization via IA32_TSC_ADJUST
> working, but it should not need to happen in the first place, as long as
> QEMU and KVM are behaving properly).
>
> > And as far as I know, Linux guest is the primary (only?) user of the kvmclock.
>
> Only AFAIK.
>
> > I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE
> > in the documentation to state that it only guarantees invariance if the guest
> > doesn't mess with its own TSC.
> >
> > Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
> > in the guest kernel, when kvm is detected to avoid the guest even from trying
> > to sync TSC on newly hotplugged vCPUs.
>
> See 7539b174aef405d9d57db48c58390ba360c91312.

I know about this, and I personally always use invtsc
with my VMs.
>
>
> Was hoping to make that (-cpu xxx,+invtsc) the default in QEMU once invariant TSC code
> becomes stable. Should be tested enough by now?

The issue is that Qemu blocks migration when invtsc is set, based on the
fact that the target machine might have different TSC frequency and no
support for TSC scaling.
There was a long debate on this long ago.

It is possible though to override this by specifying the exact frequency
you want the guest TSC to run at, by using something like
(tsc-frequency=3500000000)
I haven't checked if libvirt does this or not.

I do think that as long as the user uses modern CPUs (which have stable TSC
and support TSC scaling), there is no reason to disable invtsc, and
therefore no reason to use kvmclock.

>
> > (The guest doesn't end up touching TSC_ADJUST usually, but it still might
> > in some cases due to scheduling of guest vCPUs)
> >
> > (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
> > and TSC clocksource watchdog, and the later we might want to keep).
>
> The latter we want to keep.
>
> > For host TSC writes, just as Paulo proposed we can still do the tsc sync,
> > unless the new code that I implemented is in use.
>
> So Paolo's proposal is to
>
> "- for live migration, userspace is expected to use the new
> KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
> (nanosecond, TSC, TSC_ADJUST) tuple."
>
> Makes sense, so that no time between KVM_SET_TSC and
> MSR_WRITE(TSC_ADJUST) elapses, which would cause the TSC to go out
> of what is desired by the user.
>
> Since you are proposing this new ioctl, perhaps its useful to also
> reduce the 100ms jump?

Yep. As long as target and destantion clocks are synchronized,
it should make it better.

>
> "- for live migration, userspace is expected to use the new
> KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
> (nanosecond, TSC, TSC_ADJUST) tuple. This value will be written
> to the guest before the first VM-entry"
>
> Sounds like a good idea (to integrate the values in a tuple).
>
> > Few more random notes:
> >
> > I have a weird feeling about using 'nsec since 1 January 1970'.
> > Common sense is telling me that a 64 bit value can hold about 580 years,
> > but still I see that it is more common to use timespec which is a (sec,nsec) pair.
>
> struct timespec {
> time_t tv_sec; /* seconds */
> long tv_nsec; /* nanoseconds */
> };
>
> > I feel that 'kvm_get_walltime' that I added is a bit of a hack.
> > Some refactoring might improve things here.
>
> Haven't read the patchset yet...
>
> > For example making kvm_get_walltime_and_clockread work in non tsc case as well
> > might make the code cleaner.
> >
> > Patches to enable this feature in qemu are in process of being sent to
> > qemu-devel mailing list.
> >
> > Best regards,
> > Maxim Levitsky
> >
> > Maxim Levitsky (2):
> > KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE
> > KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS
> >
> > Documentation/virt/kvm/api.rst | 56 +++++++++++++++++++++
> > arch/x86/include/uapi/asm/kvm.h | 1 +
> > arch/x86/kvm/x86.c | 88 +++++++++++++++++++++++++++++++--
> > include/uapi/linux/kvm.h | 14 ++++++
> > 4 files changed, 154 insertions(+), 5 deletions(-)
> >
> > --
> > 2.26.2
> >

Best regards,
Maxim Levitsky


2020-12-01 13:50:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: Precise TSC migration

On Mon, Nov 30 2020 at 16:16, Marcelo Tosatti wrote:
>> Besides, Linux guests don't sync the TSC via IA32_TSC write,
>> but rather use IA32_TSC_ADJUST which currently doesn't participate
>> in the tsc sync heruistics.
>
> Linux should not try to sync the TSC with IA32_TSC_ADJUST. It expects
> the BIOS to boot with synced TSCs.

That's wishful thinking.

Reality is that BIOS tinkerers fail to get it right. TSC_ADJUST allows
us to undo the wreckage they create.

Thanks,

tglx

2020-12-01 14:07:43

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: Precise TSC migration

On Mon, Nov 30 2020 at 16:16, Marcelo Tosatti wrote:
> Not really. The synchronization logic tries to sync TSCs during
> BIOS boot (and CPU hotplug), because the TSC values are loaded
> sequentially, say:
>
> CPU realtime TSC val
> vcpu0 0 usec 0
> vcpu1 100 usec 0
> vcpu2 200 usec 0

That's nonsense, really.

> And we'd like to see all vcpus to read the same value at all times.

Providing guests with a synchronized and stable TSC on a host with a
synchronized and stable TSC is trivial.

Write the _same_ TSC offset to _all_ vcpu control structs and be done
with it. It's not rocket science.

The guest TSC read is:

hostTSC + vcpu_offset

So if the host TSC is synchronized then the guest TSCs are synchronized
as well.

If the host TSC is not synchronized, then don't even try.

Thanks,

tglx

2020-12-01 16:23:01

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: Precise TSC migration



> On Dec 1, 2020, at 6:01 AM, Thomas Gleixner <[email protected]> wrote:
>
> On Mon, Nov 30 2020 at 16:16, Marcelo Tosatti wrote:
>> Not really. The synchronization logic tries to sync TSCs during
>> BIOS boot (and CPU hotplug), because the TSC values are loaded
>> sequentially, say:
>>
>> CPU realtime TSC val
>> vcpu0 0 usec 0
>> vcpu1 100 usec 0
>> vcpu2 200 usec 0
>
> That's nonsense, really.
>
>> And we'd like to see all vcpus to read the same value at all times.
>
> Providing guests with a synchronized and stable TSC on a host with a
> synchronized and stable TSC is trivial.
>
> Write the _same_ TSC offset to _all_ vcpu control structs and be done
> with it. It's not rocket science.
>
> The guest TSC read is:
>
> hostTSC + vcpu_offset
>
> So if the host TSC is synchronized then the guest TSCs are synchronized
> as well.
>
> If the host TSC is not synchronized, then don't even try.

This reminds me: if you’re adding a new kvm feature that tells the guest that the TSC works well, could you perhaps only have one structure for all vCPUs in the same guest?

>
> Thanks,
>
> tglx

2020-12-01 19:30:58

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: Precise TSC migration

On Tue, Dec 01, 2020 at 02:48:11PM +0100, Thomas Gleixner wrote:
> On Mon, Nov 30 2020 at 16:16, Marcelo Tosatti wrote:
> >> Besides, Linux guests don't sync the TSC via IA32_TSC write,
> >> but rather use IA32_TSC_ADJUST which currently doesn't participate
> >> in the tsc sync heruistics.
> >
> > Linux should not try to sync the TSC with IA32_TSC_ADJUST. It expects
> > the BIOS to boot with synced TSCs.
>
> That's wishful thinking.
>
> Reality is that BIOS tinkerers fail to get it right. TSC_ADJUST allows
> us to undo the wreckage they create.
>
> Thanks,
>
> tglx

Have not seen any multicore Dell/HP systems require that.

Anyway, for QEMU/KVM it should be synced (unless there is a bug
in the sync logic in the first place).

2020-12-01 19:40:26

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: Precise TSC migration

On Mon, Nov 30 2020 at 15:35, Maxim Levitsky wrote:
> The idea of masterclock is that when the host TSC is synchronized
> (or as kernel call it, stable), and the guest TSC is synchronized as well,
> then we can base the kvmclock, on the same pair of
> (host time in nsec, host tsc value), for all vCPUs.

That's correct. All guest CPUs should see exactly the same TSC value,
i.e.

hostTSC + vcpu_offset

> This makes the random error in calculation of this value invariant
> across vCPUS, and allows the guest to do kvmclock calculation in userspace
> (vDSO) since kvmclock parameters are vCPU invariant.

That's not the case today? OMG!

> To ensure that the guest tsc is synchronized we currently track host/guest tsc
> writes, and enable the master clock only when roughly the same guest's TSC value
> was written across all vCPUs.

The Linux kernel never writes the TSC. We've tried that ~15 years ago
and it was a total disaster.

> Recently this was disabled by Paulo and I agree with this, because I think
> that we indeed should only make the guest TSC synchronized by default
> (including new hotplugged vCPUs) and not do any tsc synchronization beyond that.
> (Trying to guess when the guest syncs the TSC can cause more harm that good).
>
> Besides, Linux guests don't sync the TSC via IA32_TSC write,
> but rather use IA32_TSC_ADJUST which currently doesn't participate
> in the tsc sync heruistics.

The kernel only writes TSC_ADJUST when it is advertised in CPUID and:

1) when the boot CPU detects a non-zero TSC_ADJUST value it writes
it to 0, except when running on SGI-UV

2) when a starting CPU has a different TSC_ADJUST value than the
first CPU which came up on the same socket.

3) When the first CPU of a different socket is starting and the TSC
synchronization check fails against a CPU on an already checked
socket then the kernel tries to adjust TSC_ADJUST to the point
that the synchronization check does not fail anymore.

> And as far as I know, Linux guest is the primary (only?) user of the kvmclock.
>
> I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE
> in the documentation to state that it only guarantees invariance if the guest
> doesn't mess with its own TSC.
>
> Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
> in the guest kernel, when kvm is detected to avoid the guest even from trying
> to sync TSC on newly hotplugged vCPUs.
>
> (The guest doesn't end up touching TSC_ADJUST usually, but it still might
> in some cases due to scheduling of guest vCPUs)

The only cases it would try to write are #3 above or because the
hypervisor or BIOS messed it up (#1, #2).

> (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
> and TSC clocksource watchdog, and the later we might want to keep).

Depends. If the host TSC is stable and synchronized, then you don't need
the TSC watchdog. We are slowly starting to trust the TSC to some extent
and phase out the watchdog for newer parts (hopefully).

If the host TSC really falls apart then it still can invalidate
KVM_CLOCK and force the guest to reevaluate the situation.

> Few more random notes:
>
> I have a weird feeling about using 'nsec since 1 January 1970'.
> Common sense is telling me that a 64 bit value can hold about 580
> years,

which is plenty.

> but still I see that it is more common to use timespec which is a
> (sec,nsec) pair.

timespecs are horrible.

Thanks,

tglx

2020-12-01 19:46:16

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE

On Mon, Nov 30 2020 at 15:35, Maxim Levitsky wrote:
> + struct kvm_tsc_info {
> + __u32 flags;
> + __u64 nsec;
> + __u64 tsc;
> + __u64 tsc_adjust;
> + };
> +
> +flags values for ``struct kvm_tsc_info``:
> +
> +``KVM_TSC_INFO_TSC_ADJUST_VALID``
> +
> + ``tsc_adjust`` contains valid IA32_TSC_ADJUST value

Why exposing TSC_ADJUST at all? Just because?

Thanks,

tglx

2020-12-01 20:07:32

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: Precise TSC migration

On Tue, Dec 01, 2020 at 02:30:39PM +0200, Maxim Levitsky wrote:
> On Mon, 2020-11-30 at 16:16 -0300, Marcelo Tosatti wrote:
> > Hi Maxim,
> >
> > On Mon, Nov 30, 2020 at 03:35:57PM +0200, Maxim Levitsky wrote:
> > > Hi!
> > >
> > > This is the first version of the work to make TSC migration more accurate,
> > > as was defined by Paulo at:
> > > https://www.spinics.net/lists/kvm/msg225525.html
> >
> > Description from Oliver's patch:
> >
> > "To date, VMMs have typically restored the guest's TSCs by value using
> > the KVM_SET_MSRS ioctl for each vCPU. However, restoring the TSCs by
> > value introduces some challenges with synchronization as the TSCs
> > continue to tick throughout the restoration process. As such, KVM has
> > some heuristics around TSC writes to infer whether or not the guest or
> > host is attempting to synchronize the TSCs."
> >
> > Not really. The synchronization logic tries to sync TSCs during
> > BIOS boot (and CPU hotplug), because the TSC values are loaded
> > sequentially, say:
> >
> > CPU realtime TSC val
> > vcpu0 0 usec 0
> > vcpu1 100 usec 0
> > vcpu2 200 usec 0
> > ...
> >
> > And we'd like to see all vcpus to read the same value at all times.
> >
> > Other than that, comment makes sense. The problem with live migration
> > is as follows:
> >
> > We'd like the TSC value to be written, ideally, just before the first
> > VM-entry a vCPU (because at the moment the TSC_OFFSET has been written,
> > the vcpus tsc is ticking, which will cause a visible forward jump
> > in vcpus tsc time).
> >
> > Before the first VM-entry is the farthest point in time before guest
> > entry that one could do that.
> >
> > The window (or forward jump) between KVM_SET_TSC and VM-entry was about
> > 100ms last time i checked (which results in a 100ms time jump forward),
> > See QEMU's 6053a86fe7bd3d5b07b49dae6c05f2cd0d44e687.
> >
> > Have we measured any improvement with this patchset?
>
> Its not about this window.
> It is about time that passes between the point that we read the
> TSC on source system (and we do it in qemu each time the VM is paused)
> and the moment that we set the same TSC value on the target.
> That time is unbounded.

OK. Well, its the same problem: ideally you'd want to do that just
before VCPU-entry.

> Also this patchset should decrease TSC skew that happens
> between restoring it on multiple vCPUs as well, since
> KVM_SET_TSC_STATE doesn't have to happen at the same time,
> as it accounts for time passed on each vCPU.
>
>
> Speaking of kvmclock, somewhat offtopic since this is a different issue,
> I found out that qemu reads the kvmclock value on each pause,
> and then 'restores' on unpause, using
> KVM_SET_CLOCK (this modifies the global kvmclock offset)
>
> This means (and I tested it) that if guest uses kvmclock
> for time reference, it will not account for time passed in
> the paused state.

Yes, this is necessary because otherwise there might be an overflow
in the kernel time accounting code (if the clock delta is too large).

>
> >
> > Then Paolo mentions (with >), i am replying as usual.
> >
> > > Ok, after looking more at the code with Maxim I can confidently say that
> > > it's a total mess. And a lot of the synchronization code is dead
> > > because 1) as far as we could see no guest synchronizes the TSC using
> > > MSR_IA32_TSC;
> >
> > Well, recent BIOS'es take care of synchronizing the TSC. So when Linux
> > boots, it does not have to synchronize TSC in software.
>
> Do you have an example of such BIOS? I tested OVMF which I compiled
> from git master a few weeks ago, and I also tested this with seabios
> from qemu repo, and I have never seen writes to either TSC, or TSC_ADJUST
> from BIOS.

Oh, well, QEMU then.

> Or do you refer to the native BIOS on the host doing TSC synchronization?

No, virt.

> > However, upon migration (and initialization), the KVM_SET_TSC's do
> > not happen at exactly the same time (the MSRs for each vCPU are loaded
> > in sequence). The synchronization code in kvm_set_tsc() is for those cases.
>
> I agree with that, and this is one of the issues that KVM_SET_TSC_STATE
> is going to fix, since it accounts for it.
>
>
> >
> > > and 2) writing to MSR_IA32_TSC_ADJUST does not trigger the
> > > synchronization code in kvm_write_tsc.
> >
> > Not familiar how guests are using MSR_IA32_TSC_ADJUST (or Linux)...
> > Lets see:
> >
> >
> > /*
> > * Freshly booted CPUs call into this:
> > */
> > void check_tsc_sync_target(void)
> > {
> > struct tsc_adjust *cur = this_cpu_ptr(&tsc_adjust);
> > unsigned int cpu = smp_processor_id();
> > cycles_t cur_max_warp, gbl_max_warp;
> > int cpus = 2;
> >
> > /* Also aborts if there is no TSC. */
> > if (unsynchronized_tsc())
> > return;
> >
> > /*
> > * Store, verify and sanitize the TSC adjust register. If
> > * successful skip the test.
> > *
> > * The test is also skipped when the TSC is marked reliable. This
> > * is true for SoCs which have no fallback clocksource. On these
> > * SoCs the TSC is frequency synchronized, but still the TSC ADJUST
> > * register might have been wreckaged by the BIOS..
> > */
> > if (tsc_store_and_check_tsc_adjust(false) || tsc_clocksource_reliable) {
> > atomic_inc(&skip_test);
> > return;
> > }
> >
> > retry:
> >
> > I'd force that synchronization path to be taken as a test-case.
>
> Or even better as I suggested, we might tell the guest kernel
> to avoid this synchronization path when KVM is detected
> (regardless of invtsc flag)
>
> >
> >
> > > I have a few thoughts about the kvm masterclock synchronization,
> > > which relate to the Paulo's proposal that I implemented.
> > >
> > > The idea of masterclock is that when the host TSC is synchronized
> > > (or as kernel call it, stable), and the guest TSC is synchronized as well,
> > > then we can base the kvmclock, on the same pair of
> > > (host time in nsec, host tsc value), for all vCPUs.
> >
> > We _have_ to base. See the comment which starts with
> >
> > "Assuming a stable TSC across physical CPUS, and a stable TSC"
> >
> > at x86.c.
> >
> > > This makes the random error in calculation of this value invariant
> > > across vCPUS, and allows the guest to do kvmclock calculation in userspace
> > > (vDSO) since kvmclock parameters are vCPU invariant.
> >
> > Actually, without synchronized host TSCs (and the masterclock scheme,
> > with a single base read from a vCPU), kvmclock in kernel is buggy as
> > well:
> >
> > u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> > {
> > unsigned version;
> > u64 ret;
> > u64 last;
> > u8 flags;
> >
> > do {
> > version = pvclock_read_begin(src);
> > ret = __pvclock_read_cycles(src, rdtsc_ordered());
> > flags = src->flags;
> > } while (pvclock_read_retry(src, version));
> >
> > if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
> > src->flags &= ~PVCLOCK_GUEST_STOPPED;
> > pvclock_touch_watchdogs();
> > }
> >
> > if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
> > (flags & PVCLOCK_TSC_STABLE_BIT))
> > return ret;
> >
> > The code that follows this (including cmpxchg) is a workaround for that
> > bug.
>
> I understand that. I am not arguing that we shoudn't use the masterclock!
> I am just saying the facts about the condition when it works.

Sure.

> > Workaround would require each vCPU to write to a "global clock", on
> > every clock read.
> >
> > > To ensure that the guest tsc is synchronized we currently track host/guest tsc
> > > writes, and enable the master clock only when roughly the same guest's TSC value
> > > was written across all vCPUs.
> >
> > Yes, because then you can do:
> >
> > vcpu0 vcpu1
> >
> > A = read TSC
> > ... elapsed time ...
> >
> > B = read TSC
> >
> > delta = B - A
> >
> > > Recently this was disabled by Paulo
> >
> > What was disabled exactly?
>
> The running of tsc synchronization code when the _guest_ writes the TSC.
>
> Which changes two things:
> 1. If the guest de-synchronizes its TSC, we won't disable master clock.
> 2. If the guest writes similar TSC values on each vCPU we won't detect
> this as synchronization attempt, replace this with exactly the same
> value and finally re-enable the master clock.
>
> I argue that this change is OK, because Linux guests don't write to TSC at all,
> the virtual BIOSes seems not to write there either, and the only case in which
> the Linux guest tries to change its TSC is on CPU hotplug as you mention and
> it uses TSC_ADJUST, that currently doesn't trigger TSC synchronization code in
> KVM anyway, so it is broken already.
>
> However I also argue that we should mention this in documentation just in case,
> and we might also want (also just in case) to make Linux guests avoid even trying to
> touch TSC_ADJUST register when running under KVM.
>
> To rehash my own words, the KVM_CLOCK_TSC_STABLE should be defined as:
> 'kvmclock is vCPU invariant, as long as the guest doesn't mess with its TSC'.
>
> Having said all that, now that I know tsc sync code, and the
> reasons why it is there, I wouldn't be arguing about putting it back either.

Agree.

> > > and I agree with this, because I think
> > > that we indeed should only make the guest TSC synchronized by default
> > > (including new hotplugged vCPUs) and not do any tsc synchronization beyond that.
> > > (Trying to guess when the guest syncs the TSC can cause more harm that good).
> > >
> > > Besides, Linux guests don't sync the TSC via IA32_TSC write,
> > > but rather use IA32_TSC_ADJUST which currently doesn't participate
> > > in the tsc sync heruistics.
> >
> > Linux should not try to sync the TSC with IA32_TSC_ADJUST. It expects
> > the BIOS to boot with synced TSCs.
> >
> > So i wonder what is making it attempt TSC sync in the first place?
>
> CPU hotplug. And the guest doesn't really write to TSC_ADJUST
> since it's measurement code doesn't detect any tsc warps.
>
> I was just thinking that in theory since, this is a VM, and it can be
> interrupted at any point, the measurement code should sometimes fall,
> and cause trouble.
> I didn't do much homework on this so I might be overreacting.

That is true (and you can see it with a CPU starved guest).

> As far as I see X86_FEATURE_TSC_RELIABLE was done mostly to support
> running under Hyper-V and VMWARE, and these should be prone to similar
> issues, supporting my theory.
>
> >
> > (one might also want to have Linux's synchronization via IA32_TSC_ADJUST
> > working, but it should not need to happen in the first place, as long as
> > QEMU and KVM are behaving properly).
> >
> > > And as far as I know, Linux guest is the primary (only?) user of the kvmclock.
> >
> > Only AFAIK.
> >
> > > I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE
> > > in the documentation to state that it only guarantees invariance if the guest
> > > doesn't mess with its own TSC.
> > >
> > > Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
> > > in the guest kernel, when kvm is detected to avoid the guest even from trying
> > > to sync TSC on newly hotplugged vCPUs.
> >
> > See 7539b174aef405d9d57db48c58390ba360c91312.
>
> I know about this, and I personally always use invtsc
> with my VMs.

Well, we can't make it (-cpu xxx,+invtsc) the default if vm-stop/vm-cont are unstable
with TSC!

> > Was hoping to make that (-cpu xxx,+invtsc) the default in QEMU once invariant TSC code
> > becomes stable. Should be tested enough by now?
>
> The issue is that Qemu blocks migration when invtsc is set, based on the
> fact that the target machine might have different TSC frequency and no
> support for TSC scaling.
> There was a long debate on this long ago.

Oh right.

> It is possible though to override this by specifying the exact frequency
> you want the guest TSC to run at, by using something like
> (tsc-frequency=3500000000)
> I haven't checked if libvirt does this or not.

It does.

> I do think that as long as the user uses modern CPUs (which have stable TSC
> and support TSC scaling), there is no reason to disable invtsc, and
> therefore no reason to use kvmclock.

Yep. TSC is faster.

> > > (The guest doesn't end up touching TSC_ADJUST usually, but it still might
> > > in some cases due to scheduling of guest vCPUs)
> > >
> > > (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
> > > and TSC clocksource watchdog, and the later we might want to keep).
> >
> > The latter we want to keep.
> >
> > > For host TSC writes, just as Paulo proposed we can still do the tsc sync,
> > > unless the new code that I implemented is in use.
> >
> > So Paolo's proposal is to
> >
> > "- for live migration, userspace is expected to use the new
> > KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
> > (nanosecond, TSC, TSC_ADJUST) tuple."
> >
> > Makes sense, so that no time between KVM_SET_TSC and
> > MSR_WRITE(TSC_ADJUST) elapses, which would cause the TSC to go out
> > of what is desired by the user.
> >
> > Since you are proposing this new ioctl, perhaps its useful to also
> > reduce the 100ms jump?
>
> Yep. As long as target and destantion clocks are synchronized,
> it should make it better.
>
> >
> > "- for live migration, userspace is expected to use the new
> > KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
> > (nanosecond, TSC, TSC_ADJUST) tuple. This value will be written
> > to the guest before the first VM-entry"
> >
> > Sounds like a good idea (to integrate the values in a tuple).
> >
> > > Few more random notes:
> > >
> > > I have a weird feeling about using 'nsec since 1 January 1970'.
> > > Common sense is telling me that a 64 bit value can hold about 580 years,
> > > but still I see that it is more common to use timespec which is a (sec,nsec) pair.
> >
> > struct timespec {
> > time_t tv_sec; /* seconds */
> > long tv_nsec; /* nanoseconds */
> > };
> >
> > > I feel that 'kvm_get_walltime' that I added is a bit of a hack.
> > > Some refactoring might improve things here.
> >
> > Haven't read the patchset yet...
> >
> > > For example making kvm_get_walltime_and_clockread work in non tsc case as well
> > > might make the code cleaner.
> > >
> > > Patches to enable this feature in qemu are in process of being sent to
> > > qemu-devel mailing list.
> > >
> > > Best regards,
> > > Maxim Levitsky
> > >
> > > Maxim Levitsky (2):
> > > KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE
> > > KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS
> > >
> > > Documentation/virt/kvm/api.rst | 56 +++++++++++++++++++++
> > > arch/x86/include/uapi/asm/kvm.h | 1 +
> > > arch/x86/kvm/x86.c | 88 +++++++++++++++++++++++++++++++--
> > > include/uapi/linux/kvm.h | 14 ++++++
> > > 4 files changed, 154 insertions(+), 5 deletions(-)
> > >
> > > --
> > > 2.26.2
> > >
>
> Best regards,
> Maxim Levitsky
>

2020-12-03 11:15:54

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE

On Tue, 2020-12-01 at 20:43 +0100, Thomas Gleixner wrote:
> On Mon, Nov 30 2020 at 15:35, Maxim Levitsky wrote:
> > + struct kvm_tsc_info {
> > + __u32 flags;
> > + __u64 nsec;
> > + __u64 tsc;
> > + __u64 tsc_adjust;
> > + };
> > +
> > +flags values for ``struct kvm_tsc_info``:
> > +
> > +``KVM_TSC_INFO_TSC_ADJUST_VALID``
> > +
> > + ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
>
> Why exposing TSC_ADJUST at all? Just because?

It's because we want to reduce the number of cases where
KVM's msr/read write behavior differs between guest and host
(e.g qemu) writes.

TSC and TSC_ADJUST are tied on architectural level, such as
chang
ing one, changes the other.

However for the migration to work we must be able
to set each one separately.

Currently, KVM does this by turning the host write to
TSC_ADJUST into a special case that bypasses
the actual TSC adjustment, and just sets this MSR.

The next patch in this series, will allow to disable
this special behavior, making host TSC_ADJUST write
work the same way as in guest.

Therefore to still allow to set TSC_ADJUST and TSC independently
after migration this ioctl will be used instead.

Best regards,
Maxim Levitsky

>
> Thanks,
>
> tglx
>


2020-12-03 11:46:07

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: Precise TSC migration

On Tue, 2020-12-01 at 16:48 -0300, Marcelo Tosatti wrote:
> On Tue, Dec 01, 2020 at 02:30:39PM +0200, Maxim Levitsky wrote:
> > On Mon, 2020-11-30 at 16:16 -0300, Marcelo Tosatti wrote:
> > > Hi Maxim,
> > >
> > > On Mon, Nov 30, 2020 at 03:35:57PM +0200, Maxim Levitsky wrote:
> > > > Hi!
> > > >
> > > > This is the first version of the work to make TSC migration more accurate,
> > > > as was defined by Paulo at:
> > > > https://www.spinics.net/lists/kvm/msg225525.html
> > >
> > > Description from Oliver's patch:
> > >
> > > "To date, VMMs have typically restored the guest's TSCs by value using
> > > the KVM_SET_MSRS ioctl for each vCPU. However, restoring the TSCs by
> > > value introduces some challenges with synchronization as the TSCs
> > > continue to tick throughout the restoration process. As such, KVM has
> > > some heuristics around TSC writes to infer whether or not the guest or
> > > host is attempting to synchronize the TSCs."
> > >
> > > Not really. The synchronization logic tries to sync TSCs during
> > > BIOS boot (and CPU hotplug), because the TSC values are loaded
> > > sequentially, say:
> > >
> > > CPU realtime TSC val
> > > vcpu0 0 usec 0
> > > vcpu1 100 usec 0
> > > vcpu2 200 usec 0
> > > ...
> > >
> > > And we'd like to see all vcpus to read the same value at all times.
> > >
> > > Other than that, comment makes sense. The problem with live migration
> > > is as follows:
> > >
> > > We'd like the TSC value to be written, ideally, just before the first
> > > VM-entry a vCPU (because at the moment the TSC_OFFSET has been written,
> > > the vcpus tsc is ticking, which will cause a visible forward jump
> > > in vcpus tsc time).
> > >
> > > Before the first VM-entry is the farthest point in time before guest
> > > entry that one could do that.
> > >
> > > The window (or forward jump) between KVM_SET_TSC and VM-entry was about
> > > 100ms last time i checked (which results in a 100ms time jump forward),
> > > See QEMU's 6053a86fe7bd3d5b07b49dae6c05f2cd0d44e687.
> > >
> > > Have we measured any improvement with this patchset?
> >
> > Its not about this window.
> > It is about time that passes between the point that we read the
> > TSC on source system (and we do it in qemu each time the VM is paused)
> > and the moment that we set the same TSC value on the target.
> > That time is unbounded.
>
> OK. Well, its the same problem: ideally you'd want to do that just
> before VCPU-entry.
>
> > Also this patchset should decrease TSC skew that happens
> > between restoring it on multiple vCPUs as well, since
> > KVM_SET_TSC_STATE doesn't have to happen at the same time,
> > as it accounts for time passed on each vCPU.
> >
> >
> > Speaking of kvmclock, somewhat offtopic since this is a different issue,
> > I found out that qemu reads the kvmclock value on each pause,
> > and then 'restores' on unpause, using
> > KVM_SET_CLOCK (this modifies the global kvmclock offset)
> >
> > This means (and I tested it) that if guest uses kvmclock
> > for time reference, it will not account for time passed in
> > the paused state.
>
> Yes, this is necessary because otherwise there might be an overflow
> in the kernel time accounting code (if the clock delta is too large).

Could you elaborate on this? Do you mean that guest kernel can crash,
when the time 'jumps' too far forward in one go?

If so this will happen with kernel using TSC as well,
since we do let the virtual TSC to 'keep running' while VM is suspended,
and the goal of this patchset is to let it 'run' even while
the VM is migrating.

And if there is an issue, we really should try to fix it in
the guest kernel IMHO.

>
> > > Then Paolo mentions (with >), i am replying as usual.
> > >
> > > > Ok, after looking more at the code with Maxim I can confidently say that
> > > > it's a total mess. And a lot of the synchronization code is dead
> > > > because 1) as far as we could see no guest synchronizes the TSC using
> > > > MSR_IA32_TSC;
> > >
> > > Well, recent BIOS'es take care of synchronizing the TSC. So when Linux
> > > boots, it does not have to synchronize TSC in software.
> >
> > Do you have an example of such BIOS? I tested OVMF which I compiled
> > from git master a few weeks ago, and I also tested this with seabios
> > from qemu repo, and I have never seen writes to either TSC, or TSC_ADJUST
> > from BIOS.
>
> Oh, well, QEMU then.
>
> > Or do you refer to the native BIOS on the host doing TSC synchronization?
>
> No, virt.

I also (lightly) tested win10 guest, and win10 guest with Hyper-V enabled,
and in both cases I haven't observed TSC/TSC_ADJUST writes.

>
> > > However, upon migration (and initialization), the KVM_SET_TSC's do
> > > not happen at exactly the same time (the MSRs for each vCPU are loaded
> > > in sequence). The synchronization code in kvm_set_tsc() is for those cases.
> >
> > I agree with that, and this is one of the issues that KVM_SET_TSC_STATE
> > is going to fix, since it accounts for it.
> >
> >
> > > > and 2) writing to MSR_IA32_TSC_ADJUST does not trigger the
> > > > synchronization code in kvm_write_tsc.
> > >
> > > Not familiar how guests are using MSR_IA32_TSC_ADJUST (or Linux)...
> > > Lets see:
> > >
> > >
> > > /*
> > > * Freshly booted CPUs call into this:
> > > */
> > > void check_tsc_sync_target(void)
> > > {
> > > struct tsc_adjust *cur = this_cpu_ptr(&tsc_adjust);
> > > unsigned int cpu = smp_processor_id();
> > > cycles_t cur_max_warp, gbl_max_warp;
> > > int cpus = 2;
> > >
> > > /* Also aborts if there is no TSC. */
> > > if (unsynchronized_tsc())
> > > return;
> > >
> > > /*
> > > * Store, verify and sanitize the TSC adjust register. If
> > > * successful skip the test.
> > > *
> > > * The test is also skipped when the TSC is marked reliable. This
> > > * is true for SoCs which have no fallback clocksource. On these
> > > * SoCs the TSC is frequency synchronized, but still the TSC ADJUST
> > > * register might have been wreckaged by the BIOS..
> > > */
> > > if (tsc_store_and_check_tsc_adjust(false) || tsc_clocksource_reliable) {
> > > atomic_inc(&skip_test);
> > > return;
> > > }
> > >
> > > retry:
> > >
> > > I'd force that synchronization path to be taken as a test-case.
> >
> > Or even better as I suggested, we might tell the guest kernel
> > to avoid this synchronization path when KVM is detected
> > (regardless of invtsc flag)
> >
> > >
> > > > I have a few thoughts about the kvm masterclock synchronization,
> > > > which relate to the Paulo's proposal that I implemented.
> > > >
> > > > The idea of masterclock is that when the host TSC is synchronized
> > > > (or as kernel call it, stable), and the guest TSC is synchronized as well,
> > > > then we can base the kvmclock, on the same pair of
> > > > (host time in nsec, host tsc value), for all vCPUs.
> > >
> > > We _have_ to base. See the comment which starts with
> > >
> > > "Assuming a stable TSC across physical CPUS, and a stable TSC"
> > >
> > > at x86.c.
> > >
> > > > This makes the random error in calculation of this value invariant
> > > > across vCPUS, and allows the guest to do kvmclock calculation in userspace
> > > > (vDSO) since kvmclock parameters are vCPU invariant.
> > >
> > > Actually, without synchronized host TSCs (and the masterclock scheme,
> > > with a single base read from a vCPU), kvmclock in kernel is buggy as
> > > well:
> > >
> > > u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> > > {
> > > unsigned version;
> > > u64 ret;
> > > u64 last;
> > > u8 flags;
> > >
> > > do {
> > > version = pvclock_read_begin(src);
> > > ret = __pvclock_read_cycles(src, rdtsc_ordered());
> > > flags = src->flags;
> > > } while (pvclock_read_retry(src, version));
> > >
> > > if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
> > > src->flags &= ~PVCLOCK_GUEST_STOPPED;
> > > pvclock_touch_watchdogs();
> > > }
> > >
> > > if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
> > > (flags & PVCLOCK_TSC_STABLE_BIT))
> > > return ret;
> > >
> > > The code that follows this (including cmpxchg) is a workaround for that
> > > bug.
> >
> > I understand that. I am not arguing that we shoudn't use the masterclock!
> > I am just saying the facts about the condition when it works.
>
> Sure.
>
> > > Workaround would require each vCPU to write to a "global clock", on
> > > every clock read.
> > >
> > > > To ensure that the guest tsc is synchronized we currently track host/guest tsc
> > > > writes, and enable the master clock only when roughly the same guest's TSC value
> > > > was written across all vCPUs.
> > >
> > > Yes, because then you can do:
> > >
> > > vcpu0 vcpu1
> > >
> > > A = read TSC
> > > ... elapsed time ...
> > >
> > > B = read TSC
> > >
> > > delta = B - A
> > >
> > > > Recently this was disabled by Paulo
> > >
> > > What was disabled exactly?
> >
> > The running of tsc synchronization code when the _guest_ writes the TSC.
> >
> > Which changes two things:
> > 1. If the guest de-synchronizes its TSC, we won't disable master clock.
> > 2. If the guest writes similar TSC values on each vCPU we won't detect
> > this as synchronization attempt, replace this with exactly the same
> > value and finally re-enable the master clock.
> >
> > I argue that this change is OK, because Linux guests don't write to TSC at all,
> > the virtual BIOSes seems not to write there either, and the only case in which
> > the Linux guest tries to change its TSC is on CPU hotplug as you mention and
> > it uses TSC_ADJUST, that currently doesn't trigger TSC synchronization code in
> > KVM anyway, so it is broken already.
> >
> > However I also argue that we should mention this in documentation just in case,
> > and we might also want (also just in case) to make Linux guests avoid even trying to
> > touch TSC_ADJUST register when running under KVM.
> >
> > To rehash my own words, the KVM_CLOCK_TSC_STABLE should be defined as:
> > 'kvmclock is vCPU invariant, as long as the guest doesn't mess with its TSC'.
> >
> > Having said all that, now that I know tsc sync code, and the
> > reasons why it is there, I wouldn't be arguing about putting it back either.
>
> Agree.
>
> > > > and I agree with this, because I think
> > > > that we indeed should only make the guest TSC synchronized by default
> > > > (including new hotplugged vCPUs) and not do any tsc synchronization beyond that.
> > > > (Trying to guess when the guest syncs the TSC can cause more harm that good).
> > > >
> > > > Besides, Linux guests don't sync the TSC via IA32_TSC write,
> > > > but rather use IA32_TSC_ADJUST which currently doesn't participate
> > > > in the tsc sync heruistics.
> > >
> > > Linux should not try to sync the TSC with IA32_TSC_ADJUST. It expects
> > > the BIOS to boot with synced TSCs.
> > >
> > > So i wonder what is making it attempt TSC sync in the first place?
> >
> > CPU hotplug. And the guest doesn't really write to TSC_ADJUST
> > since it's measurement code doesn't detect any tsc warps.
> >
> > I was just thinking that in theory since, this is a VM, and it can be
> > interrupted at any point, the measurement code should sometimes fall,
> > and cause trouble.
> > I didn't do much homework on this so I might be overreacting.
>
> That is true (and you can see it with a CPU starved guest).
>
> > As far as I see X86_FEATURE_TSC_RELIABLE was done mostly to support
> > running under Hyper-V and VMWARE, and these should be prone to similar
> > issues, supporting my theory.
> >
> > > (one might also want to have Linux's synchronization via IA32_TSC_ADJUST
> > > working, but it should not need to happen in the first place, as long as
> > > QEMU and KVM are behaving properly).
> > >
> > > > And as far as I know, Linux guest is the primary (only?) user of the kvmclock.
> > >
> > > Only AFAIK.
> > >
> > > > I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE
> > > > in the documentation to state that it only guarantees invariance if the guest
> > > > doesn't mess with its own TSC.
> > > >
> > > > Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
> > > > in the guest kernel, when kvm is detected to avoid the guest even from trying
> > > > to sync TSC on newly hotplugged vCPUs.
> > >
> > > See 7539b174aef405d9d57db48c58390ba360c91312.
> >
> > I know about this, and I personally always use invtsc
> > with my VMs.
>
> Well, we can't make it (-cpu xxx,+invtsc) the default if vm-stop/vm-cont are unstable
> with TSC!

Could you elaborate on this too? Are you referring to the same issue you
had mentioned about the overflow in the kernel time accounting?

>
> > > Was hoping to make that (-cpu xxx,+invtsc) the default in QEMU once invariant TSC code
> > > becomes stable. Should be tested enough by now?
> >
> > The issue is that Qemu blocks migration when invtsc is set, based on the
> > fact that the target machine might have different TSC frequency and no
> > support for TSC scaling.
> > There was a long debate on this long ago.
>
> Oh right.
>
> > It is possible though to override this by specifying the exact frequency
> > you want the guest TSC to run at, by using something like
> > (tsc-frequency=3500000000)
> > I haven't checked if libvirt does this or not.
>
> It does.
Cool.
>
> > I do think that as long as the user uses modern CPUs (which have stable TSC
> > and support TSC scaling), there is no reason to disable invtsc, and
> > therefore no reason to use kvmclock.
>
> Yep. TSC is faster.

Also this bit is sometimes used by userspace tools.

Some time ago I found out that fio uses it to decide whether
to use TSC for measurements.

I didn't know this and was running fio in a guest without 'invtsc'.
Fio switched to plain gettimeofday behind my back
and totally screwed up the results.

>
> > > > (The guest doesn't end up touching TSC_ADJUST usually, but it still might
> > > > in some cases due to scheduling of guest vCPUs)
> > > >
> > > > (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
> > > > and TSC clocksource watchdog, and the later we might want to keep).
> > >
> > > The latter we want to keep.
> > >
> > > > For host TSC writes, just as Paulo proposed we can still do the tsc sync,
> > > > unless the new code that I implemented is in use.
> > >
> > > So Paolo's proposal is to
> > >
> > > "- for live migration, userspace is expected to use the new
> > > KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
> > > (nanosecond, TSC, TSC_ADJUST) tuple."
> > >
> > > Makes sense, so that no time between KVM_SET_TSC and
> > > MSR_WRITE(TSC_ADJUST) elapses, which would cause the TSC to go out
> > > of what is desired by the user.
> > >
> > > Since you are proposing this new ioctl, perhaps its useful to also
> > > reduce the 100ms jump?
> >
> > Yep. As long as target and destantion clocks are synchronized,
> > it should make it better.
> >
> > > "- for live migration, userspace is expected to use the new
> > > KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
> > > (nanosecond, TSC, TSC_ADJUST) tuple. This value will be written
> > > to the guest before the first VM-entry"
> > >
> > > Sounds like a good idea (to integrate the values in a tuple).
> > >
> > > > Few more random notes:
> > > >
> > > > I have a weird feeling about using 'nsec since 1 January 1970'.
> > > > Common sense is telling me that a 64 bit value can hold about 580 years,
> > > > but still I see that it is more common to use timespec which is a (sec,nsec) pair.
> > >
> > > struct timespec {
> > > time_t tv_sec; /* seconds */
> > > long tv_nsec; /* nanoseconds */
> > > };
> > >
> > > > I feel that 'kvm_get_walltime' that I added is a bit of a hack.
> > > > Some refactoring might improve things here.
> > >
> > > Haven't read the patchset yet...
> > >
> > > > For example making kvm_get_walltime_and_clockread work in non tsc case as well
> > > > might make the code cleaner.
> > > >
> > > > Patches to enable this feature in qemu are in process of being sent to
> > > > qemu-devel mailing list.
> > > >
> > > > Best regards,
> > > > Maxim Levitsky
> > > >
> > > > Maxim Levitsky (2):
> > > > KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE
> > > > KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS
> > > >
> > > > Documentation/virt/kvm/api.rst | 56 +++++++++++++++++++++
> > > > arch/x86/include/uapi/asm/kvm.h | 1 +
> > > > arch/x86/kvm/x86.c | 88 +++++++++++++++++++++++++++++++--
> > > > include/uapi/linux/kvm.h | 14 ++++++
> > > > 4 files changed, 154 insertions(+), 5 deletions(-)
> > > >
> > > > --
> > > > 2.26.2
> > > >
> >
> > Best regards,
> > Maxim Levitsky
> >


Best regards,
Maxim Levitsky

2020-12-03 11:47:15

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: Precise TSC migration

On 01/12/20 20:35, Thomas Gleixner wrote:
>> This makes the random error in calculation of this value invariant
>> across vCPUS, and allows the guest to do kvmclock calculation in userspace
>> (vDSO) since kvmclock parameters are vCPU invariant.
>
> That's not the case today? OMG!

It's optional. If the host tells the guest that the host TSC is messed
up, kvmclock disables its vDSO implementation and falls back to the syscall.

Paolo

2020-12-03 11:56:06

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: Precise TSC migration

On Tue, 2020-12-01 at 12:02 -0300, Marcelo Tosatti wrote:
> On Tue, Dec 01, 2020 at 02:48:11PM +0100, Thomas Gleixner wrote:
> > On Mon, Nov 30 2020 at 16:16, Marcelo Tosatti wrote:
> > > > Besides, Linux guests don't sync the TSC via IA32_TSC write,
> > > > but rather use IA32_TSC_ADJUST which currently doesn't participate
> > > > in the tsc sync heruistics.
> > >
> > > Linux should not try to sync the TSC with IA32_TSC_ADJUST. It expects
> > > the BIOS to boot with synced TSCs.
> >
> > That's wishful thinking.
> >
> > Reality is that BIOS tinkerers fail to get it right. TSC_ADJUST allows
> > us to undo the wreckage they create.
> >
> > Thanks,
> >
> > tglx
>
> Have not seen any multicore Dell/HP systems require that.
>
> Anyway, for QEMU/KVM it should be synced (unless there is a bug
> in the sync logic in the first place).
>

I agree with that, and that is why I suggested to make the guest
avoid TSC syncing when KVM is detected.

I don't mind how to implement this.

It can be either done with new CPUID bit,
or always when KVM
is detected,
(or even when *any* hypervisor is detected)

I also don't mind if we only disable tsc sync logic or
set X86_FEATURE_TSC_RELIABLE which will disable it
and the clocksource watchdog.


Best regards,
Maxim Levitsky



2020-12-03 12:03:23

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: Precise TSC migration

On Tue, 2020-12-01 at 08:19 -0800, Andy Lutomirski wrote:
> > On Dec 1, 2020, at 6:01 AM, Thomas Gleixner <[email protected]> wrote:
> >
> > On Mon, Nov 30 2020 at 16:16, Marcelo Tosatti wrote:
> > > Not really. The synchronization logic tries to sync TSCs during
> > > BIOS boot (and CPU hotplug), because the TSC values are loaded
> > > sequentially, say:
> > >
> > > CPU realtime TSC val
> > > vcpu0 0 usec 0
> > > vcpu1 100 usec 0
> > > vcpu2 200 usec 0
> >
> > That's nonsense, really.
> >
> > > And we'd like to see all vcpus to read the same value at all times.
> >
> > Providing guests with a synchronized and stable TSC on a host with a
> > synchronized and stable TSC is trivial.
> >
> > Write the _same_ TSC offset to _all_ vcpu control structs and be done
> > with it. It's not rocket science.
> >
> > The guest TSC read is:
> >
> > hostTSC + vcpu_offset
> >
> > So if the host TSC is synchronized then the guest TSCs are synchronized
> > as well.
> >
> > If the host TSC is not synchronized, then don't even try.
>
> This reminds me: if you’re adding a new kvm feature that tells the guest that the TSC works well, could you perhaps only have one structure for all vCPUs in the same guest?

I won't mind doing this, but this might be too much work for
too little gain.

IMHO, modern hosts don't need the kvmclock in the first place,
and should just expose the TSC to the guest
together with the invtsc bit.

Best regards,
Maxim Levitsky

>
> > Thanks,
> >
> > tglx


2020-12-03 12:53:40

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: Precise TSC migration

On Tue, 2020-12-01 at 20:35 +0100, Thomas Gleixner wrote:
> On Mon, Nov 30 2020 at 15:35, Maxim Levitsky wrote:
> > The idea of masterclock is that when the host TSC is synchronized
> > (or as kernel call it, stable), and the guest TSC is synchronized as well,
> > then we can base the kvmclock, on the same pair of
> > (host time in nsec, host tsc value), for all vCPUs.
>
> That's correct. All guest CPUs should see exactly the same TSC value,
> i.e.
>
> hostTSC + vcpu_offset

This is roughly the case today, unless the guest messes up
its own TSC, which it is allowed to do so.

And that brings me an idea: Why do we allow the guest
to mess up with its TSC in the first place?

Can't we just stop exposing the TSC_ADJUST to the guest,
since it is an optional x86 feature?

In addition to that I also had read somewhere
that TSC writes aren't even guaranteed to work on some x86 systems
(e.g that msr can be readonly, and that's why TSC_ADJUST was created)
so I'll say we can go even further and just ignore
writes to TSC from the guest as well.

Or if this is too much, we can just ignore this
since Linux doesn't touch the TSC anyway.

Best feature is no feature, you know.


>
> > This makes the random error in calculation of this value invariant
> > across vCPUS, and allows the guest to do kvmclock calculation in userspace
> > (vDSO) since kvmclock parameters are vCPU invariant.
>
> That's not the case today? OMG!

We have the masterclock to avoid this, whose existence we trickle
down to the guest via KVM_CLOCK_TSC_STABLE bit.

When master clock is not enabled (due to one of the reasons
I mentioned in the RFC mail), this bit is not set,
and the guest fails back to do system calls.

>
> > To ensure that the guest tsc is synchronized we currently track host/guest tsc
> > writes, and enable the master clock only when roughly the same guest's TSC value
> > was written across all vCPUs.
>
> The Linux kernel never writes the TSC. We've tried that ~15 years ago
> and it was a total disaster.
I can imagine this.

>
> > Recently this was disabled by Paulo and I agree with this, because I think
> > that we indeed should only make the guest TSC synchronized by default
> > (including new hotplugged vCPUs) and not do any tsc synchronization beyond that.
> > (Trying to guess when the guest syncs the TSC can cause more harm that good).
> >
> > Besides, Linux guests don't sync the TSC via IA32_TSC write,
> > but rather use IA32_TSC_ADJUST which currently doesn't participate
> > in the tsc sync heruistics.
>
> The kernel only writes TSC_ADJUST when it is advertised in CPUID and:
>
> 1) when the boot CPU detects a non-zero TSC_ADJUST value it writes
> it to 0, except when running on SGI-UV
>
> 2) when a starting CPU has a different TSC_ADJUST value than the
> first CPU which came up on the same socket.
>
> 3) When the first CPU of a different socket is starting and the TSC
> synchronization check fails against a CPU on an already checked
> socket then the kernel tries to adjust TSC_ADJUST to the point
> that the synchronization check does not fail anymore.
Since we do allow multi socket guests, I guess (3) can still fail
due to scheduling noise and make the guest write TSC_ADJUST.

>
> > And as far as I know, Linux guest is the primary (only?) user of the kvmclock.
> >
> > I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE
> > in the documentation to state that it only guarantees invariance if the guest
> > doesn't mess with its own TSC.
> >
> > Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
> > in the guest kernel, when kvm is detected to avoid the guest even from trying
> > to sync TSC on newly hotplugged vCPUs.
> >
> > (The guest doesn't end up touching TSC_ADJUST usually, but it still might
> > in some cases due to scheduling of guest vCPUs)
>
> The only cases it would try to write are #3 above or because the
> hypervisor or BIOS messed it up (#1, #2).
>
> > (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
> > and TSC clocksource watchdog, and the later we might want to keep).
>
> Depends. If the host TSC is stable and synchronized, then you don't need
> the TSC watchdog. We are slowly starting to trust the TSC to some extent
> and phase out the watchdog for newer parts (hopefully).
>
> If the host TSC really falls apart then it still can invalidate
> KVM_CLOCK and force the guest to reevaluate the situation.
>
> > Few more random notes:
> >
> > I have a weird feeling about using 'nsec since 1 January 1970'.
> > Common sense is telling me that a 64 bit value can hold about 580
> > years,
>
> which is plenty.
I also think so, just wanted to hear your opinion on that.
>
> > but still I see that it is more common to use timespec which is a
> > (sec,nsec) pair.
>
> timespecs are horrible.

Best regards,
Maxim Levitsky

>
> Thanks,
>
> tglx
>


2020-12-04 12:39:39

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: Precise TSC migration


On Thu, Dec 03, 2020 at 01:39:42PM +0200, Maxim Levitsky wrote:
> On Tue, 2020-12-01 at 16:48 -0300, Marcelo Tosatti wrote:
> > On Tue, Dec 01, 2020 at 02:30:39PM +0200, Maxim Levitsky wrote:
> > > On Mon, 2020-11-30 at 16:16 -0300, Marcelo Tosatti wrote:
> > > > Hi Maxim,
> > > >
> > > > On Mon, Nov 30, 2020 at 03:35:57PM +0200, Maxim Levitsky wrote:
> > > > > Hi!
> > > > >
> > > > > This is the first version of the work to make TSC migration more accurate,
> > > > > as was defined by Paulo at:
> > > > > https://www.spinics.net/lists/kvm/msg225525.html
> > > >
> > > > Description from Oliver's patch:
> > > >
> > > > "To date, VMMs have typically restored the guest's TSCs by value using
> > > > the KVM_SET_MSRS ioctl for each vCPU. However, restoring the TSCs by
> > > > value introduces some challenges with synchronization as the TSCs
> > > > continue to tick throughout the restoration process. As such, KVM has
> > > > some heuristics around TSC writes to infer whether or not the guest or
> > > > host is attempting to synchronize the TSCs."
> > > >
> > > > Not really. The synchronization logic tries to sync TSCs during
> > > > BIOS boot (and CPU hotplug), because the TSC values are loaded
> > > > sequentially, say:
> > > >
> > > > CPU realtime TSC val
> > > > vcpu0 0 usec 0
> > > > vcpu1 100 usec 0
> > > > vcpu2 200 usec 0
> > > > ...
> > > >
> > > > And we'd like to see all vcpus to read the same value at all times.
> > > >
> > > > Other than that, comment makes sense. The problem with live migration
> > > > is as follows:
> > > >
> > > > We'd like the TSC value to be written, ideally, just before the first
> > > > VM-entry a vCPU (because at the moment the TSC_OFFSET has been written,
> > > > the vcpus tsc is ticking, which will cause a visible forward jump
> > > > in vcpus tsc time).
> > > >
> > > > Before the first VM-entry is the farthest point in time before guest
> > > > entry that one could do that.
> > > >
> > > > The window (or forward jump) between KVM_SET_TSC and VM-entry was about
> > > > 100ms last time i checked (which results in a 100ms time jump forward),
> > > > See QEMU's 6053a86fe7bd3d5b07b49dae6c05f2cd0d44e687.
> > > >
> > > > Have we measured any improvement with this patchset?
> > >
> > > Its not about this window.
> > > It is about time that passes between the point that we read the
> > > TSC on source system (and we do it in qemu each time the VM is paused)
> > > and the moment that we set the same TSC value on the target.
> > > That time is unbounded.
> >
> > OK. Well, its the same problem: ideally you'd want to do that just
> > before VCPU-entry.
> >
> > > Also this patchset should decrease TSC skew that happens
> > > between restoring it on multiple vCPUs as well, since
> > > KVM_SET_TSC_STATE doesn't have to happen at the same time,
> > > as it accounts for time passed on each vCPU.
> > >
> > >
> > > Speaking of kvmclock, somewhat offtopic since this is a different issue,
> > > I found out that qemu reads the kvmclock value on each pause,
> > > and then 'restores' on unpause, using
> > > KVM_SET_CLOCK (this modifies the global kvmclock offset)
> > >
> > > This means (and I tested it) that if guest uses kvmclock
> > > for time reference, it will not account for time passed in
> > > the paused state.
> >
> > Yes, this is necessary because otherwise there might be an overflow
> > in the kernel time accounting code (if the clock delta is too large).
>
> Could you elaborate on this? Do you mean that guest kernel can crash,
> when the time 'jumps' too far forward in one go?

It can crash (there will a overflow and time will jump backwards).

> If so this will happen with kernel using TSC as well,
> since we do let the virtual TSC to 'keep running' while VM is suspended,
> and the goal of this patchset is to let it 'run' even while
> the VM is migrating.

True. For the overflow one, perhaps TSC can be stopped (and restored) similarly to
KVMCLOCK.

See QEMU's commit 00f4d64ee76e873be881a82d893a591487aa7950.

> And if there is an issue, we really should try to fix it in
> the guest kernel IMHO.
>
> >
> > > > Then Paolo mentions (with >), i am replying as usual.
> > > >
> > > > > Ok, after looking more at the code with Maxim I can confidently say that
> > > > > it's a total mess. And a lot of the synchronization code is dead
> > > > > because 1) as far as we could see no guest synchronizes the TSC using
> > > > > MSR_IA32_TSC;
> > > >
> > > > Well, recent BIOS'es take care of synchronizing the TSC. So when Linux
> > > > boots, it does not have to synchronize TSC in software.
> > >
> > > Do you have an example of such BIOS? I tested OVMF which I compiled
> > > from git master a few weeks ago, and I also tested this with seabios
> > > from qemu repo, and I have never seen writes to either TSC, or TSC_ADJUST
> > > from BIOS.
> >
> > Oh, well, QEMU then.
> >
> > > Or do you refer to the native BIOS on the host doing TSC synchronization?
> >
> > No, virt.
>
> I also (lightly) tested win10 guest, and win10 guest with Hyper-V enabled,
> and in both cases I haven't observed TSC/TSC_ADJUST writes.
>
> >
> > > > However, upon migration (and initialization), the KVM_SET_TSC's do
> > > > not happen at exactly the same time (the MSRs for each vCPU are loaded
> > > > in sequence). The synchronization code in kvm_set_tsc() is for those cases.
> > >
> > > I agree with that, and this is one of the issues that KVM_SET_TSC_STATE
> > > is going to fix, since it accounts for it.
> > >
> > >
> > > > > and 2) writing to MSR_IA32_TSC_ADJUST does not trigger the
> > > > > synchronization code in kvm_write_tsc.
> > > >
> > > > Not familiar how guests are using MSR_IA32_TSC_ADJUST (or Linux)...
> > > > Lets see:
> > > >
> > > >
> > > > /*
> > > > * Freshly booted CPUs call into this:
> > > > */
> > > > void check_tsc_sync_target(void)
> > > > {
> > > > struct tsc_adjust *cur = this_cpu_ptr(&tsc_adjust);
> > > > unsigned int cpu = smp_processor_id();
> > > > cycles_t cur_max_warp, gbl_max_warp;
> > > > int cpus = 2;
> > > >
> > > > /* Also aborts if there is no TSC. */
> > > > if (unsynchronized_tsc())
> > > > return;
> > > >
> > > > /*
> > > > * Store, verify and sanitize the TSC adjust register. If
> > > > * successful skip the test.
> > > > *
> > > > * The test is also skipped when the TSC is marked reliable. This
> > > > * is true for SoCs which have no fallback clocksource. On these
> > > > * SoCs the TSC is frequency synchronized, but still the TSC ADJUST
> > > > * register might have been wreckaged by the BIOS..
> > > > */
> > > > if (tsc_store_and_check_tsc_adjust(false) || tsc_clocksource_reliable) {
> > > > atomic_inc(&skip_test);
> > > > return;
> > > > }
> > > >
> > > > retry:
> > > >
> > > > I'd force that synchronization path to be taken as a test-case.
> > >
> > > Or even better as I suggested, we might tell the guest kernel
> > > to avoid this synchronization path when KVM is detected
> > > (regardless of invtsc flag)
> > >
> > > >
> > > > > I have a few thoughts about the kvm masterclock synchronization,
> > > > > which relate to the Paulo's proposal that I implemented.
> > > > >
> > > > > The idea of masterclock is that when the host TSC is synchronized
> > > > > (or as kernel call it, stable), and the guest TSC is synchronized as well,
> > > > > then we can base the kvmclock, on the same pair of
> > > > > (host time in nsec, host tsc value), for all vCPUs.
> > > >
> > > > We _have_ to base. See the comment which starts with
> > > >
> > > > "Assuming a stable TSC across physical CPUS, and a stable TSC"
> > > >
> > > > at x86.c.
> > > >
> > > > > This makes the random error in calculation of this value invariant
> > > > > across vCPUS, and allows the guest to do kvmclock calculation in userspace
> > > > > (vDSO) since kvmclock parameters are vCPU invariant.
> > > >
> > > > Actually, without synchronized host TSCs (and the masterclock scheme,
> > > > with a single base read from a vCPU), kvmclock in kernel is buggy as
> > > > well:
> > > >
> > > > u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> > > > {
> > > > unsigned version;
> > > > u64 ret;
> > > > u64 last;
> > > > u8 flags;
> > > >
> > > > do {
> > > > version = pvclock_read_begin(src);
> > > > ret = __pvclock_read_cycles(src, rdtsc_ordered());
> > > > flags = src->flags;
> > > > } while (pvclock_read_retry(src, version));
> > > >
> > > > if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
> > > > src->flags &= ~PVCLOCK_GUEST_STOPPED;
> > > > pvclock_touch_watchdogs();
> > > > }
> > > >
> > > > if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
> > > > (flags & PVCLOCK_TSC_STABLE_BIT))
> > > > return ret;
> > > >
> > > > The code that follows this (including cmpxchg) is a workaround for that
> > > > bug.
> > >
> > > I understand that. I am not arguing that we shoudn't use the masterclock!
> > > I am just saying the facts about the condition when it works.
> >
> > Sure.
> >
> > > > Workaround would require each vCPU to write to a "global clock", on
> > > > every clock read.
> > > >
> > > > > To ensure that the guest tsc is synchronized we currently track host/guest tsc
> > > > > writes, and enable the master clock only when roughly the same guest's TSC value
> > > > > was written across all vCPUs.
> > > >
> > > > Yes, because then you can do:
> > > >
> > > > vcpu0 vcpu1
> > > >
> > > > A = read TSC
> > > > ... elapsed time ...
> > > >
> > > > B = read TSC
> > > >
> > > > delta = B - A
> > > >
> > > > > Recently this was disabled by Paulo
> > > >
> > > > What was disabled exactly?
> > >
> > > The running of tsc synchronization code when the _guest_ writes the TSC.
> > >
> > > Which changes two things:
> > > 1. If the guest de-synchronizes its TSC, we won't disable master clock.
> > > 2. If the guest writes similar TSC values on each vCPU we won't detect
> > > this as synchronization attempt, replace this with exactly the same
> > > value and finally re-enable the master clock.
> > >
> > > I argue that this change is OK, because Linux guests don't write to TSC at all,
> > > the virtual BIOSes seems not to write there either, and the only case in which
> > > the Linux guest tries to change its TSC is on CPU hotplug as you mention and
> > > it uses TSC_ADJUST, that currently doesn't trigger TSC synchronization code in
> > > KVM anyway, so it is broken already.
> > >
> > > However I also argue that we should mention this in documentation just in case,
> > > and we might also want (also just in case) to make Linux guests avoid even trying to
> > > touch TSC_ADJUST register when running under KVM.
> > >
> > > To rehash my own words, the KVM_CLOCK_TSC_STABLE should be defined as:
> > > 'kvmclock is vCPU invariant, as long as the guest doesn't mess with its TSC'.
> > >
> > > Having said all that, now that I know tsc sync code, and the
> > > reasons why it is there, I wouldn't be arguing about putting it back either.
> >
> > Agree.
> >
> > > > > and I agree with this, because I think
> > > > > that we indeed should only make the guest TSC synchronized by default
> > > > > (including new hotplugged vCPUs) and not do any tsc synchronization beyond that.
> > > > > (Trying to guess when the guest syncs the TSC can cause more harm that good).
> > > > >
> > > > > Besides, Linux guests don't sync the TSC via IA32_TSC write,
> > > > > but rather use IA32_TSC_ADJUST which currently doesn't participate
> > > > > in the tsc sync heruistics.
> > > >
> > > > Linux should not try to sync the TSC with IA32_TSC_ADJUST. It expects
> > > > the BIOS to boot with synced TSCs.
> > > >
> > > > So i wonder what is making it attempt TSC sync in the first place?
> > >
> > > CPU hotplug. And the guest doesn't really write to TSC_ADJUST
> > > since it's measurement code doesn't detect any tsc warps.
> > >
> > > I was just thinking that in theory since, this is a VM, and it can be
> > > interrupted at any point, the measurement code should sometimes fall,
> > > and cause trouble.
> > > I didn't do much homework on this so I might be overreacting.
> >
> > That is true (and you can see it with a CPU starved guest).
> >
> > > As far as I see X86_FEATURE_TSC_RELIABLE was done mostly to support
> > > running under Hyper-V and VMWARE, and these should be prone to similar
> > > issues, supporting my theory.
> > >
> > > > (one might also want to have Linux's synchronization via IA32_TSC_ADJUST
> > > > working, but it should not need to happen in the first place, as long as
> > > > QEMU and KVM are behaving properly).
> > > >
> > > > > And as far as I know, Linux guest is the primary (only?) user of the kvmclock.
> > > >
> > > > Only AFAIK.
> > > >
> > > > > I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE
> > > > > in the documentation to state that it only guarantees invariance if the guest
> > > > > doesn't mess with its own TSC.
> > > > >
> > > > > Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
> > > > > in the guest kernel, when kvm is detected to avoid the guest even from trying
> > > > > to sync TSC on newly hotplugged vCPUs.
> > > >
> > > > See 7539b174aef405d9d57db48c58390ba360c91312.
> > >
> > > I know about this, and I personally always use invtsc
> > > with my VMs.
> >
> > Well, we can't make it (-cpu xxx,+invtsc) the default if vm-stop/vm-cont are unstable
> > with TSC!
>
> Could you elaborate on this too? Are you referring to the same issue you
> had mentioned about the overflow in the kernel time accounting?

Well, any issue that could show up.

> > > > Was hoping to make that (-cpu xxx,+invtsc) the default in QEMU once invariant TSC code
> > > > becomes stable. Should be tested enough by now?
> > >
> > > The issue is that Qemu blocks migration when invtsc is set, based on the
> > > fact that the target machine might have different TSC frequency and no
> > > support for TSC scaling.
> > > There was a long debate on this long ago.
> >
> > Oh right.
> >
> > > It is possible though to override this by specifying the exact frequency
> > > you want the guest TSC to run at, by using something like
> > > (tsc-frequency=3500000000)
> > > I haven't checked if libvirt does this or not.
> >
> > It does.
> Cool.
> >
> > > I do think that as long as the user uses modern CPUs (which have stable TSC
> > > and support TSC scaling), there is no reason to disable invtsc, and
> > > therefore no reason to use kvmclock.
> >
> > Yep. TSC is faster.
>
> Also this bit is sometimes used by userspace tools.

Yep! SAP HANA as well.

> Some time ago I found out that fio uses it to decide whether
> to use TSC for measurements.
>
> I didn't know this and was running fio in a guest without 'invtsc'.
> Fio switched to plain gettimeofday behind my back
> and totally screwed up the results.
>
> >
> > > > > (The guest doesn't end up touching TSC_ADJUST usually, but it still might
> > > > > in some cases due to scheduling of guest vCPUs)
> > > > >
> > > > > (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
> > > > > and TSC clocksource watchdog, and the later we might want to keep).
> > > >
> > > > The latter we want to keep.
> > > >
> > > > > For host TSC writes, just as Paulo proposed we can still do the tsc sync,
> > > > > unless the new code that I implemented is in use.
> > > >
> > > > So Paolo's proposal is to
> > > >
> > > > "- for live migration, userspace is expected to use the new
> > > > KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
> > > > (nanosecond, TSC, TSC_ADJUST) tuple."
> > > >
> > > > Makes sense, so that no time between KVM_SET_TSC and
> > > > MSR_WRITE(TSC_ADJUST) elapses, which would cause the TSC to go out
> > > > of what is desired by the user.
> > > >
> > > > Since you are proposing this new ioctl, perhaps its useful to also
> > > > reduce the 100ms jump?
> > >
> > > Yep. As long as target and destantion clocks are synchronized,
> > > it should make it better.
> > >
> > > > "- for live migration, userspace is expected to use the new
> > > > KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
> > > > (nanosecond, TSC, TSC_ADJUST) tuple. This value will be written
> > > > to the guest before the first VM-entry"
> > > >
> > > > Sounds like a good idea (to integrate the values in a tuple).
> > > >
> > > > > Few more random notes:
> > > > >
> > > > > I have a weird feeling about using 'nsec since 1 January 1970'.
> > > > > Common sense is telling me that a 64 bit value can hold about 580 years,
> > > > > but still I see that it is more common to use timespec which is a (sec,nsec) pair.
> > > >
> > > > struct timespec {
> > > > time_t tv_sec; /* seconds */
> > > > long tv_nsec; /* nanoseconds */
> > > > };
> > > >
> > > > > I feel that 'kvm_get_walltime' that I added is a bit of a hack.
> > > > > Some refactoring might improve things here.
> > > >
> > > > Haven't read the patchset yet...
> > > >
> > > > > For example making kvm_get_walltime_and_clockread work in non tsc case as well
> > > > > might make the code cleaner.
> > > > >
> > > > > Patches to enable this feature in qemu are in process of being sent to
> > > > > qemu-devel mailing list.
> > > > >
> > > > > Best regards,
> > > > > Maxim Levitsky
> > > > >
> > > > > Maxim Levitsky (2):
> > > > > KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE
> > > > > KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS
> > > > >
> > > > > Documentation/virt/kvm/api.rst | 56 +++++++++++++++++++++
> > > > > arch/x86/include/uapi/asm/kvm.h | 1 +
> > > > > arch/x86/kvm/x86.c | 88 +++++++++++++++++++++++++++++++--
> > > > > include/uapi/linux/kvm.h | 14 ++++++
> > > > > 4 files changed, 154 insertions(+), 5 deletions(-)
> > > > >
> > > > > --
> > > > > 2.26.2
> > > > >
> > >
> > > Best regards,
> > > Maxim Levitsky
> > >
>
>
> Best regards,
> Maxim Levitsky

2020-12-07 13:05:41

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: Precise TSC migration

On Thu, 2020-12-03 at 17:18 -0300, Marcelo Tosatti wrote:
> On Thu, Dec 03, 2020 at 01:39:42PM +0200, Maxim Levitsky wrote:
> > On Tue, 2020-12-01 at 16:48 -0300, Marcelo Tosatti wrote:
> > > On Tue, Dec 01, 2020 at 02:30:39PM +0200, Maxim Levitsky wrote:
> > > > On Mon, 2020-11-30 at 16:16 -0300, Marcelo Tosatti wrote:
> > > > > Hi Maxim,
> > > > >
> > > > > On Mon, Nov 30, 2020 at 03:35:57PM +0200, Maxim Levitsky wrote:
> > > > > > Hi!
> > > > > >
> > > > > > This is the first version of the work to make TSC migration more accurate,
> > > > > > as was defined by Paulo at:
> > > > > > https://www.spinics.net/lists/kvm/msg225525.html
> > > > >
> > > > > Description from Oliver's patch:
> > > > >
> > > > > "To date, VMMs have typically restored the guest's TSCs by value using
> > > > > the KVM_SET_MSRS ioctl for each vCPU. However, restoring the TSCs by
> > > > > value introduces some challenges with synchronization as the TSCs
> > > > > continue to tick throughout the restoration process. As such, KVM has
> > > > > some heuristics around TSC writes to infer whether or not the guest or
> > > > > host is attempting to synchronize the TSCs."
> > > > >
> > > > > Not really. The synchronization logic tries to sync TSCs during
> > > > > BIOS boot (and CPU hotplug), because the TSC values are loaded
> > > > > sequentially, say:
> > > > >
> > > > > CPU realtime TSC val
> > > > > vcpu0 0 usec 0
> > > > > vcpu1 100 usec 0
> > > > > vcpu2 200 usec 0
> > > > > ...
> > > > >
> > > > > And we'd like to see all vcpus to read the same value at all times.
> > > > >
> > > > > Other than that, comment makes sense. The problem with live migration
> > > > > is as follows:
> > > > >
> > > > > We'd like the TSC value to be written, ideally, just before the first
> > > > > VM-entry a vCPU (because at the moment the TSC_OFFSET has been written,
> > > > > the vcpus tsc is ticking, which will cause a visible forward jump
> > > > > in vcpus tsc time).
> > > > >
> > > > > Before the first VM-entry is the farthest point in time before guest
> > > > > entry that one could do that.
> > > > >
> > > > > The window (or forward jump) between KVM_SET_TSC and VM-entry was about
> > > > > 100ms last time i checked (which results in a 100ms time jump forward),
> > > > > See QEMU's 6053a86fe7bd3d5b07b49dae6c05f2cd0d44e687.
> > > > >
> > > > > Have we measured any improvement with this patchset?
> > > >
> > > > Its not about this window.
> > > > It is about time that passes between the point that we read the
> > > > TSC on source system (and we do it in qemu each time the VM is paused)
> > > > and the moment that we set the same TSC value on the target.
> > > > That time is unbounded.
> > >
> > > OK. Well, its the same problem: ideally you'd want to do that just
> > > before VCPU-entry.
> > >
> > > > Also this patchset should decrease TSC skew that happens
> > > > between restoring it on multiple vCPUs as well, since
> > > > KVM_SET_TSC_STATE doesn't have to happen at the same time,
> > > > as it accounts for time passed on each vCPU.
> > > >
> > > >
> > > > Speaking of kvmclock, somewhat offtopic since this is a different issue,
> > > > I found out that qemu reads the kvmclock value on each pause,
> > > > and then 'restores' on unpause, using
> > > > KVM_SET_CLOCK (this modifies the global kvmclock offset)
> > > >
> > > > This means (and I tested it) that if guest uses kvmclock
> > > > for time reference, it will not account for time passed in
> > > > the paused state.
> > >
> > > Yes, this is necessary because otherwise there might be an overflow
> > > in the kernel time accounting code (if the clock delta is too large).
> >
> > Could you elaborate on this? Do you mean that guest kernel can crash,
> > when the time 'jumps' too far forward in one go?
>
> It can crash (there will a overflow and time will jump backwards).
>
> > If so this will happen with kernel using TSC as well,
> > since we do let the virtual TSC to 'keep running' while VM is suspended,
> > and the goal of this patchset is to let it 'run' even while
> > the VM is migrating.
>
> True. For the overflow one, perhaps TSC can be stopped (and restored) similarly to
> KVMCLOCK.
>
> See QEMU's commit 00f4d64ee76e873be881a82d893a591487aa7950.

This is exactly what we want to avoid, and it is the main purpose of this patch set.
Can this issue be fixed in the guest kernel instead?

I also don't completely agree with 'CLOCK_MONOTONIC' remark in the above qemu commit.

From clock_gettime man page
"CLOCK_MONOTONIC
Clock that cannot be set and represents monotonic time since—as described by POSIX—"some unspecified point in the past".
On Linux, that point corresponds to the number of seconds that the system has been running since it was booted.
"

'system has been running' IMHO isn't fully defined by the above statement.

In the same man page we also have CLOCK_BOOTTIME which was introduced to fix the
above 'bug' and it states that:

"CLOCK_BOOTTIME (since Linux 2.6.39; Linux-specific)
Identical to CLOCK_MONOTONIC, except it also includes any time that the system is suspended.
This allows applications to get a suspend-aware monotonic
clock without having to deal with the complications of CLOCK_REALTIME,
which may have discontinuities if the time is changed using settimeofday(2) or similar.
"

This relates to OS initiated suspend (e.g S3), which suggests that 'system has been running'
is defined as 'system is in S0 state'.

Also, by stopping the kvmclock, even if we define the hypervisor initiated suspend
as the 'os suspended state', we at least break CLOCK_BOOTTIME since the
latter is also based on kvmclock.

AFAIK (I might be wrong here), the guest kernel only updates its
CLOCK_BOOTTIME offset when it enters/exits the S3/S4 state.

(Look at 'tk_update_sleep_time', which updates 'offs_boot'
which is what defines CLOCK_BOOTTIME AFAIK)


Another thing I noticed is that the kvmclock documentation (msr.rst)
specifies that the time reference in the kvmclock PV structure,
should include sleep time, thus the qemu patch you referenced
might break its promise:

"
system_time:
a host notion of monotonic time, including sleep
time at the time this structure was last updated. Unit is
nanoseconds.
"

Best regards,
Maxim Levitsky

>
> > And if there is an issue, we really should try to fix it in
> > the guest kernel IMHO.
> >
> > > > > Then Paolo mentions (with >), i am replying as usual.
> > > > >
> > > > > > Ok, after looking more at the code with Maxim I can confidently say that
> > > > > > it's a total mess. And a lot of the synchronization code is dead
> > > > > > because 1) as far as we could see no guest synchronizes the TSC using
> > > > > > MSR_IA32_TSC;
> > > > >
> > > > > Well, recent BIOS'es take care of synchronizing the TSC. So when Linux
> > > > > boots, it does not have to synchronize TSC in software.
> > > >
> > > > Do you have an example of such BIOS? I tested OVMF which I compiled
> > > > from git master a few weeks ago, and I also tested this with seabios
> > > > from qemu repo, and I have never seen writes to either TSC, or TSC_ADJUST
> > > > from BIOS.
> > >
> > > Oh, well, QEMU then.
> > >
> > > > Or do you refer to the native BIOS on the host doing TSC synchronization?
> > >
> > > No, virt.
> >
> > I also (lightly) tested win10 guest, and win10 guest with Hyper-V enabled,
> > and in both cases I haven't observed TSC/TSC_ADJUST writes.
> >
> > > > > However, upon migration (and initialization), the KVM_SET_TSC's do
> > > > > not happen at exactly the same time (the MSRs for each vCPU are loaded
> > > > > in sequence). The synchronization code in kvm_set_tsc() is for those cases.
> > > >
> > > > I agree with that, and this is one of the issues that KVM_SET_TSC_STATE
> > > > is going to fix, since it accounts for it.
> > > >
> > > >
> > > > > > and 2) writing to MSR_IA32_TSC_ADJUST does not trigger the
> > > > > > synchronization code in kvm_write_tsc.
> > > > >
> > > > > Not familiar how guests are using MSR_IA32_TSC_ADJUST (or Linux)...
> > > > > Lets see:
> > > > >
> > > > >
> > > > > /*
> > > > > * Freshly booted CPUs call into this:
> > > > > */
> > > > > void check_tsc_sync_target(void)
> > > > > {
> > > > > struct tsc_adjust *cur = this_cpu_ptr(&tsc_adjust);
> > > > > unsigned int cpu = smp_processor_id();
> > > > > cycles_t cur_max_warp, gbl_max_warp;
> > > > > int cpus = 2;
> > > > >
> > > > > /* Also aborts if there is no TSC. */
> > > > > if (unsynchronized_tsc())
> > > > > return;
> > > > >
> > > > > /*
> > > > > * Store, verify and sanitize the TSC adjust register. If
> > > > > * successful skip the test.
> > > > > *
> > > > > * The test is also skipped when the TSC is marked reliable. This
> > > > > * is true for SoCs which have no fallback clocksource. On these
> > > > > * SoCs the TSC is frequency synchronized, but still the TSC ADJUST
> > > > > * register might have been wreckaged by the BIOS..
> > > > > */
> > > > > if (tsc_store_and_check_tsc_adjust(false) || tsc_clocksource_reliable) {
> > > > > atomic_inc(&skip_test);
> > > > > return;
> > > > > }
> > > > >
> > > > > retry:
> > > > >
> > > > > I'd force that synchronization path to be taken as a test-case.
> > > >
> > > > Or even better as I suggested, we might tell the guest kernel
> > > > to avoid this synchronization path when KVM is detected
> > > > (regardless of invtsc flag)
> > > >
> > > > > > I have a few thoughts about the kvm masterclock synchronization,
> > > > > > which relate to the Paulo's proposal that I implemented.
> > > > > >
> > > > > > The idea of masterclock is that when the host TSC is synchronized
> > > > > > (or as kernel call it, stable), and the guest TSC is synchronized as well,
> > > > > > then we can base the kvmclock, on the same pair of
> > > > > > (host time in nsec, host tsc value), for all vCPUs.
> > > > >
> > > > > We _have_ to base. See the comment which starts with
> > > > >
> > > > > "Assuming a stable TSC across physical CPUS, and a stable TSC"
> > > > >
> > > > > at x86.c.
> > > > >
> > > > > > This makes the random error in calculation of this value invariant
> > > > > > across vCPUS, and allows the guest to do kvmclock calculation in userspace
> > > > > > (vDSO) since kvmclock parameters are vCPU invariant.
> > > > >
> > > > > Actually, without synchronized host TSCs (and the masterclock scheme,
> > > > > with a single base read from a vCPU), kvmclock in kernel is buggy as
> > > > > well:
> > > > >
> > > > > u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> > > > > {
> > > > > unsigned version;
> > > > > u64 ret;
> > > > > u64 last;
> > > > > u8 flags;
> > > > >
> > > > > do {
> > > > > version = pvclock_read_begin(src);
> > > > > ret = __pvclock_read_cycles(src, rdtsc_ordered());
> > > > > flags = src->flags;
> > > > > } while (pvclock_read_retry(src, version));
> > > > >
> > > > > if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
> > > > > src->flags &= ~PVCLOCK_GUEST_STOPPED;
> > > > > pvclock_touch_watchdogs();
> > > > > }
> > > > >
> > > > > if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
> > > > > (flags & PVCLOCK_TSC_STABLE_BIT))
> > > > > return ret;
> > > > >
> > > > > The code that follows this (including cmpxchg) is a workaround for that
> > > > > bug.
> > > >
> > > > I understand that. I am not arguing that we shoudn't use the masterclock!
> > > > I am just saying the facts about the condition when it works.
> > >
> > > Sure.
> > >
> > > > > Workaround would require each vCPU to write to a "global clock", on
> > > > > every clock read.
> > > > >
> > > > > > To ensure that the guest tsc is synchronized we currently track host/guest tsc
> > > > > > writes, and enable the master clock only when roughly the same guest's TSC value
> > > > > > was written across all vCPUs.
> > > > >
> > > > > Yes, because then you can do:
> > > > >
> > > > > vcpu0 vcpu1
> > > > >
> > > > > A = read TSC
> > > > > ... elapsed time ...
> > > > >
> > > > > B = read TSC
> > > > >
> > > > > delta = B - A
> > > > >
> > > > > > Recently this was disabled by Paulo
> > > > >
> > > > > What was disabled exactly?
> > > >
> > > > The running of tsc synchronization code when the _guest_ writes the TSC.
> > > >
> > > > Which changes two things:
> > > > 1. If the guest de-synchronizes its TSC, we won't disable master clock.
> > > > 2. If the guest writes similar TSC values on each vCPU we won't detect
> > > > this as synchronization attempt, replace this with exactly the same
> > > > value and finally re-enable the master clock.
> > > >
> > > > I argue that this change is OK, because Linux guests don't write to TSC at all,
> > > > the virtual BIOSes seems not to write there either, and the only case in which
> > > > the Linux guest tries to change its TSC is on CPU hotplug as you mention and
> > > > it uses TSC_ADJUST, that currently doesn't trigger TSC synchronization code in
> > > > KVM anyway, so it is broken already.
> > > >
> > > > However I also argue that we should mention this in documentation just in case,
> > > > and we might also want (also just in case) to make Linux guests avoid even trying to
> > > > touch TSC_ADJUST register when running under KVM.
> > > >
> > > > To rehash my own words, the KVM_CLOCK_TSC_STABLE should be defined as:
> > > > 'kvmclock is vCPU invariant, as long as the guest doesn't mess with its TSC'.
> > > >
> > > > Having said all that, now that I know tsc sync code, and the
> > > > reasons why it is there, I wouldn't be arguing about putting it back either.
> > >
> > > Agree.
> > >
> > > > > > and I agree with this, because I think
> > > > > > that we indeed should only make the guest TSC synchronized by default
> > > > > > (including new hotplugged vCPUs) and not do any tsc synchronization beyond that.
> > > > > > (Trying to guess when the guest syncs the TSC can cause more harm that good).
> > > > > >
> > > > > > Besides, Linux guests don't sync the TSC via IA32_TSC write,
> > > > > > but rather use IA32_TSC_ADJUST which currently doesn't participate
> > > > > > in the tsc sync heruistics.
> > > > >
> > > > > Linux should not try to sync the TSC with IA32_TSC_ADJUST. It expects
> > > > > the BIOS to boot with synced TSCs.
> > > > >
> > > > > So i wonder what is making it attempt TSC sync in the first place?
> > > >
> > > > CPU hotplug. And the guest doesn't really write to TSC_ADJUST
> > > > since it's measurement code doesn't detect any tsc warps.
> > > >
> > > > I was just thinking that in theory since, this is a VM, and it can be
> > > > interrupted at any point, the measurement code should sometimes fall,
> > > > and cause trouble.
> > > > I didn't do much homework on this so I might be overreacting.
> > >
> > > That is true (and you can see it with a CPU starved guest).
> > >
> > > > As far as I see X86_FEATURE_TSC_RELIABLE was done mostly to support
> > > > running under Hyper-V and VMWARE, and these should be prone to similar
> > > > issues, supporting my theory.
> > > >
> > > > > (one might also want to have Linux's synchronization via IA32_TSC_ADJUST
> > > > > working, but it should not need to happen in the first place, as long as
> > > > > QEMU and KVM are behaving properly).
> > > > >
> > > > > > And as far as I know, Linux guest is the primary (only?) user of the kvmclock.
> > > > >
> > > > > Only AFAIK.
> > > > >
> > > > > > I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE
> > > > > > in the documentation to state that it only guarantees invariance if the guest
> > > > > > doesn't mess with its own TSC.
> > > > > >
> > > > > > Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
> > > > > > in the guest kernel, when kvm is detected to avoid the guest even from trying
> > > > > > to sync TSC on newly hotplugged vCPUs.
> > > > >
> > > > > See 7539b174aef405d9d57db48c58390ba360c91312.
> > > >
> > > > I know about this, and I personally always use invtsc
> > > > with my VMs.
> > >
> > > Well, we can't make it (-cpu xxx,+invtsc) the default if vm-stop/vm-cont are unstable
> > > with TSC!
> >
> > Could you elaborate on this too? Are you referring to the same issue you
> > had mentioned about the overflow in the kernel time accounting?
>
> Well, any issue that could show up.
>
> > > > > Was hoping to make that (-cpu xxx,+invtsc) the default in QEMU once invariant TSC code
> > > > > becomes stable. Should be tested enough by now?
> > > >
> > > > The issue is that Qemu blocks migration when invtsc is set, based on the
> > > > fact that the target machine might have different TSC frequency and no
> > > > support for TSC scaling.
> > > > There was a long debate on this long ago.
> > >
> > > Oh right.
> > >
> > > > It is possible though to override this by specifying the exact frequency
> > > > you want the guest TSC to run at, by using something like
> > > > (tsc-frequency=3500000000)
> > > > I haven't checked if libvirt does this or not.
> > >
> > > It does.
> > Cool.
> > > > I do think that as long as the user uses modern CPUs (which have stable TSC
> > > > and support TSC scaling), there is no reason to disable invtsc, and
> > > > therefore no reason to use kvmclock.
> > >
> > > Yep. TSC is faster.
> >
> > Also this bit is sometimes used by userspace tools.
>
> Yep! SAP HANA as well.
>
> > Some time ago I found out that fio uses it to decide whether
> > to use TSC for measurements.
> >
> > I didn't know this and was running fio in a guest without 'invtsc'.
> > Fio switched to plain gettimeofday behind my back
> > and totally screwed up the results.
> >
> > > > > > (The guest doesn't end up touching TSC_ADJUST usually, but it still might
> > > > > > in some cases due to scheduling of guest vCPUs)
> > > > > >
> > > > > > (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
> > > > > > and TSC clocksource watchdog, and the later we might want to keep).
> > > > >
> > > > > The latter we want to keep.
> > > > >
> > > > > > For host TSC writes, just as Paulo proposed we can still do the tsc sync,
> > > > > > unless the new code that I implemented is in use.
> > > > >
> > > > > So Paolo's proposal is to
> > > > >
> > > > > "- for live migration, userspace is expected to use the new
> > > > > KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
> > > > > (nanosecond, TSC, TSC_ADJUST) tuple."
> > > > >
> > > > > Makes sense, so that no time between KVM_SET_TSC and
> > > > > MSR_WRITE(TSC_ADJUST) elapses, which would cause the TSC to go out
> > > > > of what is desired by the user.
> > > > >
> > > > > Since you are proposing this new ioctl, perhaps its useful to also
> > > > > reduce the 100ms jump?
> > > >
> > > > Yep. As long as target and destantion clocks are synchronized,
> > > > it should make it better.
> > > >
> > > > > "- for live migration, userspace is expected to use the new
> > > > > KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
> > > > > (nanosecond, TSC, TSC_ADJUST) tuple. This value will be written
> > > > > to the guest before the first VM-entry"
> > > > >
> > > > > Sounds like a good idea (to integrate the values in a tuple).
> > > > >
> > > > > > Few more random notes:
> > > > > >
> > > > > > I have a weird feeling about using 'nsec since 1 January 1970'.
> > > > > > Common sense is telling me that a 64 bit value can hold about 580 years,
> > > > > > but still I see that it is more common to use timespec which is a (sec,nsec) pair.
> > > > >
> > > > > struct timespec {
> > > > > time_t tv_sec; /* seconds */
> > > > > long tv_nsec; /* nanoseconds */
> > > > > };
> > > > >
> > > > > > I feel that 'kvm_get_walltime' that I added is a bit of a hack.
> > > > > > Some refactoring might improve things here.
> > > > >
> > > > > Haven't read the patchset yet...
> > > > >
> > > > > > For example making kvm_get_walltime_and_clockread work in non tsc case as well
> > > > > > might make the code cleaner.
> > > > > >
> > > > > > Patches to enable this feature in qemu are in process of being sent to
> > > > > > qemu-devel mailing list.
> > > > > >
> > > > > > Best regards,
> > > > > > Maxim Levitsky
> > > > > >
> > > > > > Maxim Levitsky (2):
> > > > > > KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE
> > > > > > KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS
> > > > > >
> > > > > > Documentation/virt/kvm/api.rst | 56 +++++++++++++++++++++
> > > > > > arch/x86/include/uapi/asm/kvm.h | 1 +
> > > > > > arch/x86/kvm/x86.c | 88 +++++++++++++++++++++++++++++++--
> > > > > > include/uapi/linux/kvm.h | 14 ++++++
> > > > > > 4 files changed, 154 insertions(+), 5 deletions(-)
> > > > > >
> > > > > > --
> > > > > > 2.26.2
> > > > > >
> > > >
> > > > Best regards,
> > > > Maxim Levitsky
> > > >
> >
> > Best regards,
> > Maxim Levitsky