LinuxLists.cc - [RFC PATCH 0/10] Cleaning up the KVM clock mess

2024-04-19 00:37:28

Subject: [RFC PATCH 0/10] Cleaning up the KVM clock mess

I lied, there aren't three different definitions of the KVM clock. The
fourth is on i386, where it's still based on CLOCK_MONOTONIC (well,
boot time which might as well be the same sine we offset it anyway).

If we fix *that* to be based on CLOCK_MONOTONIC_RAW then we can rip out
a whole bunch of mechanisms which were added to cope with NTP frequency
skew.

This cleans up the mess and gets us back down to the two unavoidable
definitions of the KVM clock: when the host and guest TSCs are well
behaved and in sync, it's in "master clock" mode where it's defined as
a simple arithmetic function of the guest TSC, otherwise it's clamped
to the host's CLOCK_MONOTONIC_RAW.

It includes Jack's KVM_[GS]ET_CLOCK_GUEST patch to allow accurate
migration. Also my KVM_VCPU_TSC_SCALE which exposes the precise TSC
scaling factors. This is needed to get accurate migration of the guest
TSC, and can *also* be used by userspace to have vDSO-style access to
the KVM clock. Thus allowing hypercalls and other emulated clock devices
(e.g. PIT, HPET, ACPI timer) to be based on the KVM clock too, giving
*consistency* across a live migration.

I do still need to fix KVM_REQ_MASTERCLOCK_UPDATE so that it doesn't
clamp the clock back to CLOCK_MONOTONIC_RAW; we should *update* the
ka->kvmclock_offset when we've been running in use_master_clock mode.
Should probably do that in kvm_arch_hardware_disable() for timekeeping
across hibernation too, but I haven't finished working that one out.

I think there are still some places left where KVM reads the time twice
in close(ish) succession and then assumes they were at the *same* time,
which I'll audit and fix too.

I also need to flesh out the test cases and do some real testing from
VMMs, but I think it's ready for some heckling at least.

https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/clocks

David Woodhouse (8):
KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init()
KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force
KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC
KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration
KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host
KVM: x86: Remove periodic global clock updates
KVM: x86: Kill KVM_REQ_GLOBAL_CLOCK_UPDATE
KVM: x86: Fix KVM clock precision in __get_kvmclock()

Jack Allister (2):
KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
KVM: selftests: Add KVM/PV clock selftest to prove timer correction

Documentation/virt/kvm/api.rst | 37 ++
Documentation/virt/kvm/devices/vcpu.rst | 115 ++++--
arch/x86/include/asm/kvm_host.h | 8 +-
arch/x86/include/uapi/asm/kvm.h | 6 +
arch/x86/kvm/svm/svm.c | 3 +-
arch/x86/kvm/vmx/vmx.c | 2 +-
arch/x86/kvm/x86.c | 438 +++++++++++++---------
arch/x86/kvm/xen.c | 4 +-
include/uapi/linux/kvm.h | 3 +
tools/testing/selftests/kvm/Makefile | 1 +
tools/testing/selftests/kvm/x86_64/pvclock_test.c | 192 ++++++++++
11 files changed, 600 insertions(+), 209 deletions(-)

2024-04-19 01:22:45

by David Woodhouse

[permalink] [raw]

Subject: [PATCH 08/10] KVM: x86: Remove periodic global clock updates

From: David Woodhouse <[email protected]>

This effectively reverts commit 332967a3eac0 ("x86: kvm: introduce
periodic global clock updates"). The periodic update was introduced to
propagate NTP corrections to the guest KVM clock, when the KVM clock was
based on CLOCK_MONOTONIC.

However, commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to
monotonic raw clock") switched to using CLOCK_MONOTONIC_RAW as the basis
for the KVM clock, avoiding the NTP frequency skew altogether.

So the periodic update serves no purpose. Remove it.

Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/kvm/x86.c | 25 -------------------------
1 file changed, 25 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 44b3d2a0da5b..4ec4eb850c5b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -155,9 +155,6 @@ EXPORT_SYMBOL_GPL(report_ignored_msrs);
unsigned int min_timer_period_us = 200;
module_param(min_timer_period_us, uint, 0644);

-static bool __read_mostly kvmclock_periodic_sync = true;
-module_param(kvmclock_periodic_sync, bool, 0444);
-
/* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
static u32 __read_mostly tsc_tolerance_ppm = 250;
module_param(tsc_tolerance_ppm, uint, 0644);
@@ -3434,20 +3431,6 @@ static void kvm_gen_kvmclock_update(struct kvm_vcpu *v)
KVMCLOCK_UPDATE_DELAY);
}

-#define KVMCLOCK_SYNC_PERIOD (300 * HZ)
-
-static void kvmclock_sync_fn(struct work_struct *work)
-{
- struct delayed_work *dwork = to_delayed_work(work);
- struct kvm_arch *ka = container_of(dwork, struct kvm_arch,
- kvmclock_sync_work);
- struct kvm *kvm = container_of(ka, struct kvm, arch);
-
- schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0);
- schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
- KVMCLOCK_SYNC_PERIOD);
-}
-
/* These helpers are safe iff @msr is known to be an MCx bank MSR. */
static bool is_mci_control_msr(u32 msr)
{
@@ -12416,8 +12399,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)

void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
{
- struct kvm *kvm = vcpu->kvm;
-
if (mutex_lock_killable(&vcpu->mutex))
return;
vcpu_load(vcpu);
@@ -12428,10 +12409,6 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
vcpu->arch.msr_kvm_poll_control = 1;

mutex_unlock(&vcpu->mutex);
-
- if (kvmclock_periodic_sync && vcpu->vcpu_idx == 0)
- schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
- KVMCLOCK_SYNC_PERIOD);
}

void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
@@ -12804,7 +12781,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
#endif

INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
- INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);

kvm_apicv_init(kvm);
kvm_hv_init_vm(kvm);
@@ -12844,7 +12820,6 @@ static void kvm_unload_vcpu_mmus(struct kvm *kvm)

void kvm_arch_sync_events(struct kvm *kvm)
{
- cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work);
cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work);
kvm_free_pit(kvm);
}
--
2.44.0

2024-04-19 01:42:29

by David Woodhouse

[permalink] [raw]

Subject: [PATCH 07/10] KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host

From: David Woodhouse <[email protected]>

Commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to monotonic raw
clock") did so only for 64-bit hosts, by capturing the boot offset from
within the existing clocksource notifier update_pvclock_gtod().

That notifier was added in commit 16e8d74d2da9 ("KVM: x86: notifier for
clocksource changes") but only on x86_64, because its original purpose
was just to disable the "master clock" mode which is only supported on
x86_64.

Now that the notifier is used for more than disabling master clock mode,
(well, OK, more than a decade later but clocks are hard), enable it for
the 32-bit build too so that get_kvmclock_base_ns() can be unaffected by
NTP sync on 32-bit too.

Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/kvm/x86.c | 18 ++++++------------
1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 00a7c1188dec..44b3d2a0da5b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2245,7 +2245,6 @@ static int do_set_msr(struct kvm_vcpu *vcpu, unsigned index, u64 *data)
return kvm_set_msr_ignored_check(vcpu, index, *data, true);
}

-#ifdef CONFIG_X86_64
struct pvclock_clock {
int vclock_mode;
u64 cycle_last;
@@ -2303,13 +2302,6 @@ static s64 get_kvmclock_base_ns(void)
/* Count up from boot time, but with the frequency of the raw clock. */
return ktime_to_ns(ktime_add(ktime_get_raw(), pvclock_gtod_data.offs_boot));
}
-#else
-static s64 get_kvmclock_base_ns(void)
-{
- /* Master clock not used, so we can just use CLOCK_BOOTTIME. */
- return ktime_get_boottime_ns();
-}
-#endif

static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock, int sec_hi_ofs)
{
@@ -9819,6 +9811,7 @@ static void pvclock_irq_work_fn(struct irq_work *w)
}

static DEFINE_IRQ_WORK(pvclock_irq_work, pvclock_irq_work_fn);
+#endif

/*
* Notification about pvclock gtod data update.
@@ -9826,26 +9819,26 @@ static DEFINE_IRQ_WORK(pvclock_irq_work, pvclock_irq_work_fn);
static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
void *priv)
{
- struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
struct timekeeper *tk = priv;

update_pvclock_gtod(tk);

+#ifdef CONFIG_X86_64
/*
* Disable master clock if host does not trust, or does not use,
* TSC based clocksource. Delegate queue_work() to irq_work as
* this is invoked with tk_core.seq write held.
*/
- if (!gtod_is_based_on_tsc(gtod->clock.vclock_mode) &&
+ if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode) &&
atomic_read(&kvm_guest_has_master_clock) != 0)
irq_work_queue(&pvclock_irq_work);
+#endif
return 0;
}

static struct notifier_block pvclock_gtod_notifier = {
.notifier_call = pvclock_gtod_notify,
};
-#endif

static inline void kvm_ops_update(struct kvm_x86_init_ops *ops)
{
@@ -9984,9 +9977,10 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)

if (pi_inject_timer == -1)
pi_inject_timer = housekeeping_enabled(HK_TYPE_TIMER);
-#ifdef CONFIG_X86_64
+
pvclock_gtod_register_notifier(&pvclock_gtod_notifier);

+#ifdef CONFIG_X86_64
if (hypervisor_is_type(X86_HYPER_MS_HYPERV))
set_hv_tscchange_cb(kvm_hyperv_tsc_notifier);
#endif
--
2.44.0

2024-04-19 02:42:24

by David Woodhouse

[permalink] [raw]

Subject: [PATCH 10/10] KVM: x86: Fix KVM clock precision in __get_kvmclock()

From: David Woodhouse <[email protected]>

When in 'master clock mode' (i.e. when host and guest TSCs are behaving
sanely and in sync), the KVM clock is defined in terms of the guest TSC.

When TSC scaling is used, calculating the KVM clock directly from *host*
TSC cycles leads to a systemic drift from the values calculated by the
guest from its TSC.

Commit 451a707813ae ("KVM: x86/xen: improve accuracy of Xen timers")
had a simple workaround for the specific case of Xen timers, as it had an
actual vCPU to hand and could use its scaling information. That commit
noted that it was broken for the general case of get_kvmclock_ns(), and
said "I'll come back to that".

Since __get_kvmclock() is invoked without a specific CPU, it needs to
be able to find or generate the scaling values required to perform the
correct calculation.

Thankfully, TSC scaling can only happen with X86_FEATURE_CONSTANT_TSC,
so it isn't as complex as it might have been.

In __kvm_synchronize_tsc(), note the current vCPU's scaling ratio in
kvm->arch.last_tsc_scaling_ratio. That is only protected by the
tsc_writE_lock, so in pvclock_update_vm_gtod_copy(), copy it into a
separate kvm->arch.master_tsc_scaling_ratio so that it can be accessed
using the kvm->arch.pvclock_sc seqcount lock. Also generate the mul and
shift factors to convert to nanoseconds for the corresponding KVM clock,
just as kvm_guest_time_update() would.

In __get_kvmclock(), which runs within a seqcount retry loop, use those
values to convert host to guest TSC and then to nanoseconds. Only fall
back to using get_kvmclock_base_ns() when not in master clock mode.

There was previously a code path in __get_kvmclock() which looked like
it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps
even on 32-bit hosts. In practice that could never happen as the
ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit
hosts it would never be set when the system clock isn't TSC-based. So
that code path is now removed.

The kvm_get_wall_clock_epoch() function had the same problem; make it
just call get_kvmclock() and subtract kvmclock from wallclock, with
the same fallback as before.

Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 +
arch/x86/kvm/x86.c | 150 ++++++++++++++++----------------
2 files changed, 78 insertions(+), 76 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cfac72b4aa64..13f979dd14b9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1353,6 +1353,7 @@ struct kvm_arch {
u64 last_tsc_write;
u32 last_tsc_khz;
u64 last_tsc_offset;
+ u64 last_tsc_scaling_ratio;
u64 cur_tsc_nsec;
u64 cur_tsc_write;
u64 cur_tsc_offset;
@@ -1366,6 +1367,9 @@ struct kvm_arch {
bool use_master_clock;
u64 master_kernel_ns;
u64 master_cycle_now;
+ u64 master_tsc_scaling_ratio;
+ u32 master_tsc_mul;
+ s8 master_tsc_shift;
struct delayed_work kvmclock_update_work;
struct delayed_work kvmclock_sync_work;

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f870e29d2558..5cd92f4b4c97 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2671,6 +2671,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
kvm->arch.last_tsc_nsec = ns;
kvm->arch.last_tsc_write = tsc;
kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
+ kvm->arch.last_tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio;
kvm->arch.last_tsc_offset = offset;

vcpu->arch.last_guest_tsc = tsc;
@@ -3006,6 +3007,7 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
{
#ifdef CONFIG_X86_64
struct kvm_arch *ka = &kvm->arch;
+ uint64_t last_tsc_hz;
int vclock_mode;
bool host_tsc_clocksource, vcpus_matched;

@@ -3025,6 +3027,34 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
&& !ka->backwards_tsc_observed
&& !ka->boot_vcpu_runs_old_kvmclock;

+ /*
+ * When TSC scaling is in use (which can thankfully only happen
+ * with X86_FEATURE_CONSTANT_TSC), the host must calculate the
+ * KVM clock precisely as the guest would, by scaling through
+ * the guest TSC frequency. Otherwise, differences in arithmetic
+ * precision lead to systemic drift between the guest's and the
+ * host's idea of the time.
+ */
+ if (kvm_caps.has_tsc_control) {
+ /*
+ * Copy from the field protected solely by ka->tsc_write_lock,
+ * to the field protected by the ka->pvclock_sc seqlock.
+ */
+ ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio;
+
+ /*
+ * Calculate the scaling factors precisely the same way
+ * that kvm_guest_time_update() does.
+ */
+ last_tsc_hz = kvm_scale_tsc(tsc_khz * 1000,
+ ka->last_tsc_scaling_ratio);
+ kvm_get_time_scale(NSEC_PER_SEC, last_tsc_hz,
+ &ka->master_tsc_shift, &ka->master_tsc_mul);
+ } else if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
+ kvm_get_time_scale(NSEC_PER_SEC, tsc_khz * 1009,
+ &ka->master_tsc_shift, &ka->master_tsc_mul);
+ }
+
if (ka->use_master_clock)
atomic_set(&kvm_guest_has_master_clock, 1);

@@ -3097,36 +3127,49 @@ static unsigned long get_cpu_tsc_khz(void)
static void __get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
{
struct kvm_arch *ka = &kvm->arch;
- struct pvclock_vcpu_time_info hv_clock;
+
+#ifdef CONFIG_X86_64
+ uint64_t cur_tsc_khz = 0;
+ struct timespec64 ts;

/* both __this_cpu_read() and rdtsc() should be on the same cpu */
get_cpu();

- data->flags = 0;
if (ka->use_master_clock &&
- (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) {
-#ifdef CONFIG_X86_64
- struct timespec64 ts;
+ (cur_tsc_khz = get_cpu_tsc_khz()) &&
+ !kvm_get_walltime_and_clockread(&ts, &data->host_tsc))
+ cur_tsc_khz = 0;

- if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) {
- data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
- data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
- } else
-#endif
- data->host_tsc = rdtsc();
-
- data->flags |= KVM_CLOCK_TSC_STABLE;
- hv_clock.tsc_timestamp = ka->master_cycle_now;
- hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
- kvm_get_time_scale(NSEC_PER_SEC, get_cpu_tsc_khz() * 1000LL,
- &hv_clock.tsc_shift,
- &hv_clock.tsc_to_system_mul);
- data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
- } else {
- data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
+ put_cpu();
+
+ if (cur_tsc_khz) {
+ uint64_t tsc_cycles;
+ uint32_t mul;
+ int8_t shift;
+
+ tsc_cycles = data->host_tsc - ka->master_cycle_now;
+
+ if (kvm_caps.has_tsc_control)
+ tsc_cycles = kvm_scale_tsc(tsc_cycles,
+ ka->master_tsc_scaling_ratio);
+
+ if (static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
+ mul = ka->master_tsc_mul;
+ shift = ka->master_tsc_shift;
+ } else {
+ kvm_get_time_scale(NSEC_PER_SEC, cur_tsc_khz * 1000LL,
+ &shift, &mul);
+ }
+ data->clock = ka->master_kernel_ns + ka->kvmclock_offset +
+ pvclock_scale_delta(tsc_cycles, mul, shift);
+ data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
+ data->flags = KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC | KVM_CLOCK_TSC_STABLE;
+ return;
}
+#endif

- put_cpu();
+ data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
+ data->flags = 0;
}

static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
@@ -3327,68 +3370,23 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
* that and kvmclock, but even that would be subject to change over
* time.
*
- * Attempt to calculate the epoch at a given moment using the *same*
- * TSC reading via kvm_get_walltime_and_clockread() to obtain both
- * wallclock and kvmclock times, and subtracting one from the other.
+ * Use get_kvmclock() to obtain a simultaneous reading of wallclock
+ * and kvmclock times from the *same* TSC reading, and subtract one
+ * from the other.
*
* Fall back to using their values at slightly different moments by
* calling ktime_get_real_ns() and get_kvmclock_ns() separately.
*/
uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm)
{
-#ifdef CONFIG_X86_64
- struct pvclock_vcpu_time_info hv_clock;
- struct kvm_arch *ka = &kvm->arch;
- unsigned long seq, local_tsc_khz;
- struct timespec64 ts;
- uint64_t host_tsc;
-
- do {
- seq = read_seqcount_begin(&ka->pvclock_sc);
-
- local_tsc_khz = 0;
- if (!ka->use_master_clock)
- break;
-
- /*
- * The TSC read and the call to get_cpu_tsc_khz() must happen
- * on the same CPU.
- */
- get_cpu();
-
- local_tsc_khz = get_cpu_tsc_khz();
-
- if (local_tsc_khz &&
- !kvm_get_walltime_and_clockread(&ts, &host_tsc))
- local_tsc_khz = 0; /* Fall back to old method */
-
- put_cpu();
-
- /*
- * These values must be snapshotted within the seqcount loop.
- * After that, it's just mathematics which can happen on any
- * CPU at any time.
- */
- hv_clock.tsc_timestamp = ka->master_cycle_now;
- hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
+ struct kvm_clock_data data;

- } while (read_seqcount_retry(&ka->pvclock_sc, seq));
+ get_kvmclock(kvm, &data);

- /*
- * If the conditions were right, and obtaining the wallclock+TSC was
- * successful, calculate the KVM clock at the corresponding time and
- * subtract one from the other to get the guest's epoch in nanoseconds
- * since 1970-01-01.
- */
- if (local_tsc_khz) {
- kvm_get_time_scale(NSEC_PER_SEC, local_tsc_khz * NSEC_PER_USEC,
- &hv_clock.tsc_shift,
- &hv_clock.tsc_to_system_mul);
- return ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec -
- __pvclock_read_cycles(&hv_clock, host_tsc);
- }
-#endif
- return ktime_get_real_ns() - get_kvmclock_ns(kvm);
+ if (data.flags & KVM_CLOCK_REALTIME)
+ return data.realtime - data.clock;
+ else
+ return ktime_get_real_ns() - data.clock;
}

/* These helpers are safe iff @msr is known to be an MCx bank MSR. */
--
2.44.0

2024-04-19 15:54:02

by Paul Durrant

[permalink] [raw]

Subject: Re: [PATCH 07/10] KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host

On 18/04/2024 20:34, David Woodhouse wrote:
> From: David Woodhouse <[email protected]>
>
> Commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to monotonic raw
> clock") did so only for 64-bit hosts, by capturing the boot offset from
> within the existing clocksource notifier update_pvclock_gtod().
>
> That notifier was added in commit 16e8d74d2da9 ("KVM: x86: notifier for
> clocksource changes") but only on x86_64, because its original purpose
> was just to disable the "master clock" mode which is only supported on
> x86_64.
>
> Now that the notifier is used for more than disabling master clock mode,
> (well, OK, more than a decade later but clocks are hard), enable it for
> the 32-bit build too so that get_kvmclock_base_ns() can be unaffected by
> NTP sync on 32-bit too.
>
> Signed-off-by: David Woodhouse <[email protected]>
> ---
> arch/x86/kvm/x86.c | 18 ++++++------------
> 1 file changed, 6 insertions(+), 12 deletions(-)
>

Reviewed-by: Paul Durrant <[email protected]>

2024-04-19 15:56:12

by Paul Durrant

[permalink] [raw]

Subject: Re: [PATCH 08/10] KVM: x86: Remove periodic global clock updates

On 18/04/2024 20:34, David Woodhouse wrote:
> From: David Woodhouse <[email protected]>
>
> This effectively reverts commit 332967a3eac0 ("x86: kvm: introduce
> periodic global clock updates"). The periodic update was introduced to
> propagate NTP corrections to the guest KVM clock, when the KVM clock was
> based on CLOCK_MONOTONIC.
>
> However, commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to
> monotonic raw clock") switched to using CLOCK_MONOTONIC_RAW as the basis
> for the KVM clock, avoiding the NTP frequency skew altogether.
>
> So the periodic update serves no purpose. Remove it.
>
> Signed-off-by: David Woodhouse <[email protected]>
> ---
> arch/x86/kvm/x86.c | 25 -------------------------
> 1 file changed, 25 deletions(-)
>

Reviewed-by: Paul Durrant <[email protected]>

2024-04-19 16:08:28

by Paul Durrant

[permalink] [raw]

Subject: Re: [PATCH 10/10] KVM: x86: Fix KVM clock precision in __get_kvmclock()

On 18/04/2024 20:34, David Woodhouse wrote:
> From: David Woodhouse <[email protected]>
>
> When in 'master clock mode' (i.e. when host and guest TSCs are behaving
> sanely and in sync), the KVM clock is defined in terms of the guest TSC.
>
> When TSC scaling is used, calculating the KVM clock directly from *host*
> TSC cycles leads to a systemic drift from the values calculated by the
> guest from its TSC.
>
> Commit 451a707813ae ("KVM: x86/xen: improve accuracy of Xen timers")
> had a simple workaround for the specific case of Xen timers, as it had an
> actual vCPU to hand and could use its scaling information. That commit
> noted that it was broken for the general case of get_kvmclock_ns(), and
> said "I'll come back to that".
>
> Since __get_kvmclock() is invoked without a specific CPU, it needs to
> be able to find or generate the scaling values required to perform the
> correct calculation.
>
> Thankfully, TSC scaling can only happen with X86_FEATURE_CONSTANT_TSC,
> so it isn't as complex as it might have been.
>
> In __kvm_synchronize_tsc(), note the current vCPU's scaling ratio in
> kvm->arch.last_tsc_scaling_ratio. That is only protected by the
> tsc_writE_lock, so in pvclock_update_vm_gtod_copy(), copy it into a

^ typo: capitalization of the 'E'

> separate kvm->arch.master_tsc_scaling_ratio so that it can be accessed
> using the kvm->arch.pvclock_sc seqcount lock. Also generate the mul and
> shift factors to convert to nanoseconds for the corresponding KVM clock,
> just as kvm_guest_time_update() would.
>
> In __get_kvmclock(), which runs within a seqcount retry loop, use those
> values to convert host to guest TSC and then to nanoseconds. Only fall
> back to using get_kvmclock_base_ns() when not in master clock mode.
>
> There was previously a code path in __get_kvmclock() which looked like
> it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps
> even on 32-bit hosts. In practice that could never happen as the
> ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit
> hosts it would never be set when the system clock isn't TSC-based. So
> that code path is now removed.
>
> The kvm_get_wall_clock_epoch() function had the same problem; make it
> just call get_kvmclock() and subtract kvmclock from wallclock, with
> the same fallback as before.
>
> Signed-off-by: David Woodhouse <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 4 +
> arch/x86/kvm/x86.c | 150 ++++++++++++++++----------------
> 2 files changed, 78 insertions(+), 76 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index cfac72b4aa64..13f979dd14b9 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1353,6 +1353,7 @@ struct kvm_arch {
> u64 last_tsc_write;
> u32 last_tsc_khz;
> u64 last_tsc_offset;
> + u64 last_tsc_scaling_ratio;
> u64 cur_tsc_nsec;
> u64 cur_tsc_write;
> u64 cur_tsc_offset;
> @@ -1366,6 +1367,9 @@ struct kvm_arch {
> bool use_master_clock;
> u64 master_kernel_ns;
> u64 master_cycle_now;
> + u64 master_tsc_scaling_ratio;
> + u32 master_tsc_mul;
> + s8 master_tsc_shift;
> struct delayed_work kvmclock_update_work;
> struct delayed_work kvmclock_sync_work;
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f870e29d2558..5cd92f4b4c97 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2671,6 +2671,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
> kvm->arch.last_tsc_nsec = ns;
> kvm->arch.last_tsc_write = tsc;
> kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
> + kvm->arch.last_tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio;
> kvm->arch.last_tsc_offset = offset;
>
> vcpu->arch.last_guest_tsc = tsc;
> @@ -3006,6 +3007,7 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
> {
> #ifdef CONFIG_X86_64
> struct kvm_arch *ka = &kvm->arch;
> + uint64_t last_tsc_hz;
> int vclock_mode;
> bool host_tsc_clocksource, vcpus_matched;
>
> @@ -3025,6 +3027,34 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
> && !ka->backwards_tsc_observed
> && !ka->boot_vcpu_runs_old_kvmclock;
>
> + /*
> + * When TSC scaling is in use (which can thankfully only happen
> + * with X86_FEATURE_CONSTANT_TSC), the host must calculate the
> + * KVM clock precisely as the guest would, by scaling through
> + * the guest TSC frequency. Otherwise, differences in arithmetic
> + * precision lead to systemic drift between the guest's and the
> + * host's idea of the time.
> + */
> + if (kvm_caps.has_tsc_control) {
> + /*
> + * Copy from the field protected solely by ka->tsc_write_lock,
> + * to the field protected by the ka->pvclock_sc seqlock.
> + */
> + ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio;
> +
> + /*
> + * Calculate the scaling factors precisely the same way
> + * that kvm_guest_time_update() does.
> + */
> + last_tsc_hz = kvm_scale_tsc(tsc_khz * 1000,
> + ka->last_tsc_scaling_ratio);
> + kvm_get_time_scale(NSEC_PER_SEC, last_tsc_hz,
> + &ka->master_tsc_shift, &ka->master_tsc_mul);
> + } else if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
> + kvm_get_time_scale(NSEC_PER_SEC, tsc_khz * 1009,

1009?

> + &ka->master_tsc_shift, &ka->master_tsc_mul);
> + }
> +
> if (ka->use_master_clock)
> atomic_set(&kvm_guest_has_master_clock, 1);
>

2024-04-19 16:56:18

by David Woodhouse

[permalink] [raw]

Subject: Re: [RFC PATCH 0/10] Cleaning up the KVM clock mess

On Thu, 2024-04-18 at 20:34 +0100, David Woodhouse wrote:
>
> KVM: x86: Remove periodic global clock updates
> KVM: x86: Kill KVM_REQ_GLOBAL_CLOCK_UPDATE

Meh, I might have to put those back. They were originally introduced to
cope with NTP frequency skew which is no longer a problem, but we now
know there's a systemic skew even between the host CLOCK_MONOTONIC_RAW
and the KVM clock as calculated via the guest's TSC.

So at least when !ka->use_master_clock I think the forced resync does
need to happen.

Attachments:

smime.p7s (5.83 kB)

2024-04-19 18:01:50

by David Woodhouse

[permalink] [raw]

Subject: [PATCH 11/10] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()

From: David Woodhouse <[email protected]>

There was some confusion in kvm_update_guest_time() when software needs
to advance the guest TSC.

In master clock mode, there are two points of time which need to be taken
into account. First there is the master clock reference point, stored in
kvm->arch.master_kernel_ns (and associated host TSC ->master_cycle_now).
Secondly, there is the time *now*, at the point kvm_update_guest_time()
is being called.

With software TSC upscaling, the guest TSC is getting further and further
ahead of the host TSC as time elapses. So at time "now", the guest TSC
should be further ahead of the host, than it was at master_kernel_ns.

The adjustment in kvm_update_guest_time() was not taking that into
account, and was only advancing the guest TSC by the appropriate amount
for master_kernel_ns, *not* the current time.

Fix it to calculate them both correctly.

Since the KVM clock reference point in master_kernel_ns might actually
be *earlier* than the reference point used for the guest TSC
(vcpu->last_tsc_nsec), this might lead to a negative delta. Fix the
compute_guest_tsc() function to cope with negative numbers, which
then means there is no need to force a master clock update when the
guest TSC is written.

Signed-off-by: David Woodhouse <[email protected]>
---
Untested. Thrown on the pile at
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/clocks
which we'll be testing more next week...

arch/x86/kvm/x86.c | 61 +++++++++++++++++++++++++++++++++-------------
1 file changed, 44 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5cd92f4b4c97..a78adef698bd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2488,10 +2488,19 @@ static int kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)

static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
{
- u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec,
- vcpu->arch.virtual_tsc_mult,
- vcpu->arch.virtual_tsc_shift);
- tsc += vcpu->arch.this_tsc_write;
+ s64 delta = kernel_ns - vcpu->arch.this_tsc_nsec;
+ u64 tsc = vcpu->arch.this_tsc_write;
+
+ /* pvclock_scale_delta cannot cope with negative deltas */
+ if (delta >= 0)
+ tsc += pvclock_scale_delta(delta,
+ vcpu->arch.virtual_tsc_mult,
+ vcpu->arch.virtual_tsc_shift);
+ else
+ tsc -= pvclock_scale_delta(-delta,
+ vcpu->arch.virtual_tsc_mult,
+ vcpu->arch.virtual_tsc_shift);
+
return tsc;
}

@@ -2502,7 +2511,7 @@ static inline bool gtod_is_based_on_tsc(int mode)
}
#endif

-static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
+static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
{
#ifdef CONFIG_X86_64
struct kvm_arch *ka = &vcpu->kvm->arch;
@@ -2519,12 +2528,9 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)

/*
* Request a masterclock update if the masterclock needs to be toggled
- * on/off, or when starting a new generation and the masterclock is
- * enabled (compute_guest_tsc() requires the masterclock snapshot to be
- * taken _after_ the new generation is created).
+ * on/off.
*/
- if ((ka->use_master_clock && new_generation) ||
- (ka->use_master_clock != use_master_clock))
+ if ((ka->use_master_clock != use_master_clock))
kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);

trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
@@ -2702,7 +2708,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;

- kvm_track_tsc_matching(vcpu, !matched);
+ kvm_track_tsc_matching(vcpu);
}

static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
@@ -3296,8 +3302,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
kernel_ns = get_kvmclock_base_ns();
}

- tsc_timestamp = kvm_read_l1_tsc(v, host_tsc);
-
/*
* We may have to catch up the TSC to match elapsed wall clock
* time for two reasons, even if kvmclock is used.
@@ -3309,11 +3313,34 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
* very slowly.
*/
if (vcpu->tsc_catchup) {
- u64 tsc = compute_guest_tsc(v, kernel_ns);
- if (tsc > tsc_timestamp) {
- adjust_tsc_offset_guest(v, tsc - tsc_timestamp);
- tsc_timestamp = tsc;
+ uint64_t now_guest_tsc_adjusted;
+ uint64_t now_guest_tsc_unadjusted;
+ int64_t now_guest_tsc_delta;
+
+ tsc_timestamp = compute_guest_tsc(v, kernel_ns);
+
+ if (use_master_clock) {
+ uint64_t now_host_tsc;
+ int64_t now_kernel_ns;
+
+ if (!kvm_get_time_and_clockread(&now_kernel_ns, &now_host_tsc)) {
+ now_kernel_ns = get_kvmclock_base_ns();
+ now_host_tsc = rdtsc();
+ }
+ now_guest_tsc_adjusted = compute_guest_tsc(v, now_kernel_ns);
+ now_guest_tsc_unadjusted = kvm_read_l1_tsc(v, now_host_tsc);
+ } else {
+ now_guest_tsc_adjusted = tsc_timestamp;
+ now_guest_tsc_unadjusted = kvm_read_l1_tsc(v, kernel_ns);
}
+
+ now_guest_tsc_delta = now_guest_tsc_adjusted -
+ now_guest_tsc_unadjusted;
+
+ if (now_guest_tsc_delta > 0)
+ adjust_tsc_offset_guest(v, now_guest_tsc_delta);
+ } else {
+ tsc_timestamp = kvm_read_l1_tsc(v, host_tsc);
}

local_irq_restore(flags);
--
2.34.1

Attachments:

smime.p7s (5.83 kB)