2019-02-22 11:00:26

by Olaf Hering

[permalink] [raw]
Subject: recalibrating x86 TSC during suspend/resume

Is there a way to recalibrate the x86 TSC during a suspend/resume cycle?

While the frequency will remain the same on a Laptop, it may (or rather:
it definitly will) differ if a VM is migrated from one host to another.
The hypervisor may choose to emulate the expected TSC frequency on the
destination host, but this emulation comes with a significant
performance cost. Therefore it would be good if the kernel evaluates the
environment during resume.

The specific usecase I have is a workload within VMs that makes heavy
use of TSC. The kernel is booted with 'clocksource=tsc highres=off nohz=off'
because only this clocksource gives enough granularity. The default
paravirtualized clock will return the same values via
clock_gettime(CLOCK_MONOTONIC) if the timespan between two calls is too
short. This does not happen with 'clocksource=tsc'.

Right now it is not possible to migrate VMs to hosts with different CPU
speeds. This leads to "islands" of identical hardware, and makes
maintenance of hosts harder than it needs to be. If the VM kernel would
be able to cope with CPU/TSC frequency changes, the pool of potential
destination hosts will become significant larger.

The current result of a migration with non-emulated TSC between hosts of
different speed is:

[ 42.452258] clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc' as unstable because the skew is too large:
[ 42.452270] clocksource: 'xen' wd_now: 6d34a86adb wd_last: 6d1dc51793 mask: ffffffffffffffff
[ 42.452272] clocksource: 'tsc' cs_now: 1fd2ce46bb cs_last: 1f95c4ca75 mask: ffffffffffffffff
[ 42.452273] tsc: Marking TSC unstable due to clocksource watchdog

Thanks,
Olaf


Attachments:
(No filename) (1.71 kB)
signature.asc (201.00 B)
Download all attachments

2019-02-22 11:47:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: recalibrating x86 TSC during suspend/resume

On Fri, 22 Feb 2019, Olaf Hering wrote:
> Is there a way to recalibrate the x86 TSC during a suspend/resume cycle?

No.

> While the frequency will remain the same on a Laptop, it may (or rather:
> it definitly will) differ if a VM is migrated from one host to another.
> The hypervisor may choose to emulate the expected TSC frequency on the
> destination host, but this emulation comes with a significant
> performance cost. Therefore it would be good if the kernel evaluates the
> environment during resume.
>
> The specific usecase I have is a workload within VMs that makes heavy
> use of TSC. The kernel is booted with 'clocksource=tsc highres=off nohz=off'
> because only this clocksource gives enough granularity. The default
> paravirtualized clock will return the same values via
> clock_gettime(CLOCK_MONOTONIC) if the timespan between two calls is too
> short. This does not happen with 'clocksource=tsc'.
>
> Right now it is not possible to migrate VMs to hosts with different CPU
> speeds. This leads to "islands" of identical hardware, and makes
> maintenance of hosts harder than it needs to be. If the VM kernel would
> be able to cope with CPU/TSC frequency changes, the pool of potential
> destination hosts will become significant larger.

The problem with recalibrating TSC on resume is that it would have to be

1) quick

2) accurate, so NTP does not get utterly unhappy.

Newer Intels support TSC scaling for VMX, which could solve the problem. It
affects TSC readout by:

TSC = (read(HWTSC) * multiplier) >> 48

So you can standarize on a TSC frequency accross a fleet. Not sure when
that was introduced and no idea whether it's available on AMD.

For a software solution we could try the following:

1) Provide the raw TSC frequency of the host to the guest in some magic
software defined MSR or CPUID. If there is an existing mechanism, use
that.

2) On resume check whether the MSR/CPUID is available and if so readout
that information and check whether the frequency is the same as
before. If not it is trivial enough to adjust the guest mult/shift
values for both raw and NTP adjusted clocks before they are used again,
i.e. before timekeeping_resume(). Need to look what's the best place,
but probably the clocksource resume callback. Plus if TSC deadline
timer is used, we'd need the same adjustment there.

That's backward compatible, because if the MSR/CPUID is not there, then
the recalibration is not tried.

Whether that is accurate enough or not to make NTP happy, I can't tell, but
it's definitely worth a try.

Thanks,

tglx


2019-02-22 11:52:12

by Olaf Hering

[permalink] [raw]
Subject: Re: recalibrating x86 TSC during suspend/resume

Am Fri, 22 Feb 2019 12:44:39 +0100 (CET)
schrieb Thomas Gleixner <[email protected]>:

> Whether that is accurate enough or not to make NTP happy, I can't tell, but
> it's definitely worth a try.

Thanks Thomas, I will look into the suggestions.


Olaf


Attachments:
(No filename) (201.00 B)
Digitale Signatur von OpenPGP

2019-02-22 12:32:02

by Paolo Bonzini

[permalink] [raw]
Subject: Re: recalibrating x86 TSC during suspend/resume

On 22/02/19 12:44, Thomas Gleixner wrote:
>> The specific usecase I have is a workload within VMs that makes heavy
>> use of TSC. The kernel is booted with 'clocksource=tsc highres=off nohz=off'
>> because only this clocksource gives enough granularity. The default
>> paravirtualized clock will return the same values via
>> clock_gettime(CLOCK_MONOTONIC) if the timespan between two calls is too
>> short. This does not happen with 'clocksource=tsc'.

This shouldn't happen. clock_gettime(CLOCK_MONOTONIC) should be
monotonic increasing. Do you have a testcase?

The KVM clocksource is high-resolution and also TSC-based, the
difference is that it performs two multiplications instead of one. The
first uses TSC parameters from the host. The second, which is the one
in arch/x86/entry/vdso/vclock_gettime.c's do_hres function, will have a
1:1 multiplier (excluding adjtime shearing) because kvmclock already
returns nanoseconds.

> Newer Intels support TSC scaling for VMX, which could solve the problem. It
> affects TSC readout by:
>
> TSC = (read(HWTSC) * multiplier) >> 48
>
> So you can standarize on a TSC frequency accross a fleet. Not sure when
> that was introduced and no idea whether it's available on AMD.

It's Skylake (server parts only) or newer. AMD instead has had it
(almost) forever. QEMU 2.6 or newer will use it automatically across
live migration, if available.

> For a software solution we could try the following:
>
> 1) Provide the raw TSC frequency of the host to the guest in some magic
> software defined MSR or CPUID. If there is an existing mechanism, use
> that.

This shouldn't be needed for two reasons:

1) you could also use kvmclock's provided mult/shift

2) I am not convinced that kvmclock has the behavior that Olaf mentions,
and if it does it would be a bug.

Paolo

2019-02-22 14:29:20

by Olaf Hering

[permalink] [raw]
Subject: Re: recalibrating x86 TSC during suspend/resume

On Fri, Feb 22, Paolo Bonzini wrote:

> On 22/02/19 12:44, Thomas Gleixner wrote:
> >> The specific usecase I have is a workload within VMs that makes heavy
> >> use of TSC. The kernel is booted with 'clocksource=tsc highres=off nohz=off'
> >> because only this clocksource gives enough granularity. The default
> >> paravirtualized clock will return the same values via
> >> clock_gettime(CLOCK_MONOTONIC) if the timespan between two calls is too
> >> short. This does not happen with 'clocksource=tsc'.
>
> This shouldn't happen. clock_gettime(CLOCK_MONOTONIC) should be
> monotonic increasing. Do you have a testcase?

Two years ago I tweaked sysbench to track the execution time of the
'memory' test:

https://github.com/olafhering/sysbench
https://github.com/olafhering/sysbench/blame/pv/src/tests/memory/sb_memory.c

The checks in diff_timespec() triggered with clocksource=xen, but I can
not reproduce it right now with 5.0 and 4.4 based kernels. I have no
data how KVM behaves. In the end the hypervisor was tweaked to tolerate
a certain jitter in expected TSC speed before emulation kicks in. Up to
~1MHz would be ok to stay within the 500PPM limit that ntpd can handle.

But now there is that "island" issue that needs to be resolved in one
way or another.

Olaf


Attachments:
(No filename) (1.27 kB)
signature.asc (201.00 B)
Download all attachments