2024-04-08 22:08:09

by Jack Allister

[permalink] [raw]
Subject: [PATCH 0/2] Add API to correct KVM/PV clock drift

Guest VMs can be provided with a para-virtualized clock source to
perform timekeeping. A KVM guest can map in a PV clock via the
MSR_KVM_SYSTEM_TIME/MSR_KVM_SYSTEM_TIME_NEW virtualized MSRs.
Where as on a Xen guest this can be provided via the vcpu/shared
info pages.

These PV clocks both use a common structure which is mapped between
host <-> guest to provide the PVTI (paravirtual time information)
for the clock. This reference information is a guest TSC timestamp
and a host system time at a SINGULAR point in time.

If KVM decides to update the reference information due to a
KVM_REQ_MASTERCLOCK_UPDATE, a drift between the guest TSC and
the PV clock can be observed, this is exascerbated when the guest
TSC is also scaled too.

If the reference guest TSC & system time within the structure stay
the same there is no potential for a drift between the TSC and PV
clock.

This series adds in two patches, one to add in API/ioctl to allow
a VMM to perform a correction/fixup of the PVTI structure when it
knows that KVM may have updated the KVM clock information and a
second one to verify that the drift is present & corrected.

Jack Allister (2):
KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for KVM clock drift fixup
KVM: selftests: Add KVM/PV clock selftest to prove timer drift
correction

Documentation/virt/kvm/api.rst | 43 ++++
arch/x86/kvm/x86.c | 87 +++++++
include/uapi/linux/kvm.h | 3 +
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/x86_64/pvclock_test.c | 223 ++++++++++++++++++
5 files changed, 357 insertions(+)
create mode 100644 tools/testing/selftests/kvm/x86_64/pvclock_test.c


base-commit: 8cb4a9a82b21623dbb4b3051dd30d98356cf95bc
--
2.40.1



2024-04-08 22:09:55

by Jack Allister

[permalink] [raw]
Subject: [PATCH 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for KVM clock drift fixup

There is a potential for drift between the TSC and a KVM/PV clock when the
guest TSC is scaled (as seen previously in [1]). Which fixed drift between
timers over the lifetime of a VM.

However, there is another factor which will cause a drift. In a situation
such as a kexec/live-update of the kernel or a live-migration of a VM the
PV clock information is recalculated by KVM (KVM_REQ_MASTERCLOCK_UPDATE).
This update samples a new system_time & tsc_timestamp to be used in the
structure.

For example, when a guest is running with a TSC frequency of 1.5GHz but the
host frequency is 3.0GHz upon an update of the PV time information a delta
of ~3500ns is observed between the TSC and the KVM/PV clock. There is no
reason why a fixup creating an accuracy of ±1ns cannot be achieved.

Additional interfaces are added to retrieve & fixup the PV time information
when a VMM may believe is appropriate (deserialization after live-update/
migration). KVM_GET_CLOCK_GUEST can be used for the VMM to retrieve the
currently used PV time information and then when the VMM believes a drift
may occur can then instruct KVM to perform a correction via the setter
KVM_SET_CLOCK_GUEST.

The KVM_SET_CLOCK_GUEST ioctl works under the following premise. The host
TSC & kernel timstamp are sampled at a singular point in time. Using the
already known scaling/offset for L1 the guest TSC is then derived from this
information.

From here two PV time information structures are created, one which is the
original time information structure prior to whatever may have caused a PV
clock re-calculation (live-update/migration). The second is then using the
singular point in time sampled just prior. An individual KVM/PV clock for
each of the PV time information structures using the singular guest TSC.

A delta is then determined between the two calculated PV times, which is
then used as a correction offset added onto the kvmclock_offset for the VM.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=451a707813ae

Suggested-by: David Woodhouse <[email protected]>
Signed-off-by: Jack Allister <[email protected]>
CC: Paul Durrant <[email protected]>
---
Documentation/virt/kvm/api.rst | 43 +++++++++++++++++
arch/x86/kvm/x86.c | 87 ++++++++++++++++++++++++++++++++++
include/uapi/linux/kvm.h | 3 ++
3 files changed, 133 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 0b5a33ee71ee..5f74d8ac1002 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6352,6 +6352,49 @@ a single guest_memfd file, but the bound ranges must not overlap).

See KVM_SET_USER_MEMORY_REGION2 for additional details.

+4.143 KVM_GET_CLOCK_GUEST
+----------------------------
+
+:Capability: none
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct pvclock_vcpu_time_info (out)
+:Returns: 0 on success, <0 on error
+
+Retrieves the current time information structure used for KVM/PV clocks.
+On x86 a PV clock is derived from the current TSC and is then scaled based
+upon the a specified multiplier and shift. The result of this is then added
+to a system time.
+
+The guest needs a way to determine the system time, multiplier and shift. This
+can be done by multiple ways, for KVM guests this can be via an MSR write to
+MSR_KVM_SYSTEM_TIME / MSR_KVM_SYSTEM_TIME_NEW which defines the guest physical
+address KVM shall put the structure. On Xen guests this can be found in the Xen
+vcpu_info.
+
+This is structure is useful information for a VMM to also know when taking into
+account potential timer drift on live-update/migration.
+
+4.144 KVM_SET_CLOCK_GUEST
+----------------------------
+
+:Capability: none
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct pvclock_vcpu_time_info (in)
+:Returns: 0 on success, <0 on error
+
+Triggers KVM to perform a correction of the KVM/PV clock structure based upon a
+known prior PV clock structure (see KVM_GET_CLOCK_GUEST).
+
+If a VM is utilizing TSC scaling there is a potential for a drift between the
+KVM/PV clock and the TSC itself. This is due to the loss of precision when
+performing a multiply and shift rather than divide for the TSC.
+
+To perform the correction a delta is calculated between the original time info
+(which is assumed correct) at a singular point in time X. The KVM clock offset
+is then offset by this delta.
+
5. The kvm_run structure
========================

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 47d9f03b7778..5d2e10cd1c30 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6988,6 +6988,87 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
return 0;
}

+static struct kvm_vcpu *kvm_get_bsp_vcpu(struct kvm *kvm)
+{
+ struct kvm_vcpu *vcpu = NULL;
+ int i;
+
+ for (i = 0; i < KVM_MAX_VCPUS; i++) {
+ vcpu = kvm_get_vcpu_by_id(kvm, i);
+ if (!vcpu)
+ continue;
+
+ if (kvm_vcpu_is_reset_bsp(vcpu))
+ break;
+ }
+
+ return vcpu;
+}
+
+static int kvm_vm_ioctl_get_clock_guest(struct kvm *kvm, void __user *argp)
+{
+ struct kvm_vcpu *vcpu;
+
+ vcpu = kvm_get_bsp_vcpu(kvm);
+ if (!vcpu)
+ return -EINVAL;
+
+ if (!vcpu->arch.hv_clock.tsc_timestamp || !vcpu->arch.hv_clock.system_time)
+ return -EIO;
+
+ if (copy_to_user(argp, &vcpu->arch.hv_clock, sizeof(vcpu->arch.hv_clock)))
+ return -EFAULT;
+
+ return 0;
+}
+
+static int kvm_vm_ioctl_set_clock_guest(struct kvm *kvm, void __user *argp)
+{
+ struct kvm_vcpu *vcpu;
+ struct pvclock_vcpu_time_info orig_pvti;
+ struct pvclock_vcpu_time_info dummy_pvti;
+ int64_t kernel_ns;
+ uint64_t host_tsc, guest_tsc;
+ uint64_t clock_orig, clock_dummy;
+ int64_t correction;
+ unsigned long i;
+
+ vcpu = kvm_get_bsp_vcpu(kvm);
+ if (!vcpu)
+ return -EINVAL;
+
+ if (copy_from_user(&orig_pvti, argp, sizeof(orig_pvti)))
+ return -EFAULT;
+
+ /*
+ * Sample the kernel time and host TSC at a singular point.
+ * We then calculate the guest TSC using this exact point in time,
+ * From here we can then determine the delta using the
+ * PV time info requested from the user and what we currently have
+ * using the fixed point in time. This delta is then used as a
+ * correction factor to fixup the potential drift.
+ */
+ if (!kvm_get_time_and_clockread(&kernel_ns, &host_tsc))
+ return -EFAULT;
+
+ guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
+
+ dummy_pvti = orig_pvti;
+ dummy_pvti.tsc_timestamp = guest_tsc;
+ dummy_pvti.system_time = kernel_ns + kvm->arch.kvmclock_offset;
+
+ clock_orig = __pvclock_read_cycles(&orig_pvti, guest_tsc);
+ clock_dummy = __pvclock_read_cycles(&dummy_pvti, guest_tsc);
+
+ correction = clock_orig - clock_dummy;
+ kvm->arch.kvmclock_offset += correction;
+
+ kvm_for_each_vcpu(i, vcpu, kvm)
+ kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+
+ return 0;
+}
+
int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
{
struct kvm *kvm = filp->private_data;
@@ -7246,6 +7327,12 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
case KVM_GET_CLOCK:
r = kvm_vm_ioctl_get_clock(kvm, argp);
break;
+ case KVM_SET_CLOCK_GUEST:
+ r = kvm_vm_ioctl_set_clock_guest(kvm, argp);
+ break;
+ case KVM_GET_CLOCK_GUEST:
+ r = kvm_vm_ioctl_get_clock_guest(kvm, argp);
+ break;
case KVM_SET_TSC_KHZ: {
u32 user_tsc_khz;

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..0d306311e4d6 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1548,4 +1548,7 @@ struct kvm_create_guest_memfd {
__u64 reserved[6];
};

+#define KVM_SET_CLOCK_GUEST _IOW(KVMIO, 0xd5, struct pvclock_vcpu_time_info)
+#define KVM_GET_CLOCK_GUEST _IOR(KVMIO, 0xd6, struct pvclock_vcpu_time_info)
+
#endif /* __LINUX_KVM_H */
--
2.40.1


2024-04-08 22:10:45

by Jack Allister

[permalink] [raw]
Subject: [PATCH 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer drift correction

This test proves that there is an inherent KVM/PV clock drift away from the
guest TSC when KVM decides to update the PV time information structure due
to a KVM_REQ_MASTERCLOCK_UPDATE. This drift is exascerbated when a guest is
using TSC scaling and running at a different frequency to the host TSC [1].
It also proves that KVM_[GS]ET_CLOCK_GUEST API is working to mitigate the
drift from TSC to within ±1ns.

The test simply records the PVTI (PV time information) at time of guest
creation, after KVM has updated it's mapped PVTI structure and once the
correction has taken place.

A singular point in time is then recorded via the guest TSC and is used to
calculate the a PV clock value using each of the 3 PVTI structures.

As seen below a drift of ~3500ns is observed if no correction has taken
place after KVM has updated the PVTI via master clock update. However,
after the correction a delta of at most 1ns can be seen.

* selftests: kvm: pvclock_test
* scaling tsc from 2999999KHz to 1499999KHz
* before=5038374946 uncorrected=5038371437 corrected=5038374945
* delta_uncorrected=3509 delta_corrected=1

Clocksource check code has been borrowed from [2].

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=451a707813ae
[2]: https://lore.kernel.org/kvm/[email protected]/

Signed-off-by: Jack Allister <[email protected]>
CC: David Woodhouse <[email protected]>
CC: Paul Durrant <[email protected]>
---
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/x86_64/pvclock_test.c | 223 ++++++++++++++++++
2 files changed, 224 insertions(+)
create mode 100644 tools/testing/selftests/kvm/x86_64/pvclock_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 741c7dc16afc..02ee1205bbed 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -87,6 +87,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/pmu_counters_test
TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
TEST_GEN_PROGS_x86_64 += x86_64/private_mem_kvm_exits_test
+TEST_GEN_PROGS_x86_64 += x86_64/pvclock_test
TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
diff --git a/tools/testing/selftests/kvm/x86_64/pvclock_test.c b/tools/testing/selftests/kvm/x86_64/pvclock_test.c
new file mode 100644
index 000000000000..172ef4d19c60
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/pvclock_test.c
@@ -0,0 +1,223 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright © 2024, Amazon.com, Inc. or its affiliates.
+ *
+ * Tests for pvclock API
+ * KVM_SET_CLOCK_GUEST/KVM_GET_CLOCK_GUEST
+ */
+#include <asm/pvclock.h>
+#include <asm/pvclock-abi.h>
+#include <sys/stat.h>
+#include <stdint.h>
+#include <stdio.h>
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+
+enum {
+ STAGE_FIRST_BOOT,
+ STAGE_UNCORRECTED,
+ STAGE_CORRECTED,
+ NUM_STAGES
+};
+
+#define KVMCLOCK_GPA 0xc0000000ull
+#define KVMCLOCK_SIZE sizeof(struct pvclock_vcpu_time_info)
+
+static void trigger_pvti_update(vm_paddr_t pvti_pa)
+{
+ /*
+ * We need a way to trigger KVM to update the fields
+ * in the PV time info. The easiest way to do this is
+ * to temporarily switch to the old KVM system time
+ * method and then switch back to the new one.
+ */
+ wrmsr(MSR_KVM_SYSTEM_TIME, pvti_pa | KVM_MSR_ENABLED);
+ wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
+}
+
+static void guest_code(vm_paddr_t pvti_pa)
+{
+ struct pvclock_vcpu_time_info *pvti_va =
+ (struct pvclock_vcpu_time_info *)pvti_pa;
+
+ struct pvclock_vcpu_time_info pvti_boot;
+ struct pvclock_vcpu_time_info pvti_uncorrected;
+ struct pvclock_vcpu_time_info pvti_corrected;
+ uint64_t cycles_boot;
+ uint64_t cycles_uncorrected;
+ uint64_t cycles_corrected;
+ uint64_t tsc_guest;
+
+ /*
+ * Setup the KVMCLOCK in the guest & store the original
+ * PV time structure that is used.
+ */
+ wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
+ pvti_boot = *pvti_va;
+ GUEST_SYNC(STAGE_FIRST_BOOT);
+
+ /*
+ * Trigger an update of the PVTI, if we calculate
+ * the KVM clock using this structure we'll see
+ * a drift from the TSC.
+ */
+ trigger_pvti_update(pvti_pa);
+ pvti_uncorrected = *pvti_va;
+ GUEST_SYNC(STAGE_UNCORRECTED);
+
+ /*
+ * The test should have triggered the correction by this
+ * point in time. We have a copy of each of the PVTI structs
+ * at each stage now.
+ *
+ * Let's sample the timestamp at a SINGLE point in time and
+ * then calculate what the KVM clock would be using the PVTI
+ * from each stage.
+ *
+ * Then return each of these values to the tester.
+ */
+ pvti_corrected = *pvti_va;
+ tsc_guest = rdtsc();
+
+ cycles_boot = __pvclock_read_cycles(&pvti_boot, tsc_guest);
+ cycles_uncorrected = __pvclock_read_cycles(&pvti_uncorrected, tsc_guest);
+ cycles_corrected = __pvclock_read_cycles(&pvti_corrected, tsc_guest);
+
+ GUEST_SYNC_ARGS(STAGE_CORRECTED, cycles_boot, cycles_uncorrected,
+ cycles_corrected, 0);
+}
+
+static void run_test(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
+{
+ struct ucall uc;
+ uint64_t ucall_reason;
+ struct pvclock_vcpu_time_info pvti_before;
+ uint64_t before, uncorrected, corrected;
+ int64_t delta_uncorrected, delta_corrected;
+
+ /* Loop through each stage of the test. */
+ while (true) {
+
+ /* Start/restart the running vCPU code. */
+ vcpu_run(vcpu);
+ TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
+
+ /* Retrieve and verify our stage. */
+ ucall_reason = get_ucall(vcpu, &uc);
+ TEST_ASSERT(ucall_reason == UCALL_SYNC,
+ "Unhandled ucall reason=%lu",
+ ucall_reason);
+
+ /* Run host specific code relating to stage. */
+ switch (uc.args[1]) {
+ case STAGE_FIRST_BOOT:
+ /* Store the KVM clock values before an update. */
+ vm_ioctl(vm, KVM_GET_CLOCK_GUEST, &pvti_before);
+
+ /* Sleep for a set amount of time to induce drift. */
+ sleep(5);
+ break;
+
+ case STAGE_UNCORRECTED:
+ /* Restore the KVM clock values. */
+ vm_ioctl(vm, KVM_SET_CLOCK_GUEST, &pvti_before);
+ break;
+
+ case STAGE_CORRECTED:
+ /* Query the clock information and verify delta. */
+ before = uc.args[2];
+ uncorrected = uc.args[3];
+ corrected = uc.args[4];
+
+ delta_uncorrected = before - uncorrected;
+ delta_corrected = before - corrected;
+
+ pr_info("before=%lu uncorrected=%lu corrected=%lu\n",
+ before, uncorrected, corrected);
+
+ pr_info("delta_uncorrected=%ld delta_corrected=%ld\n",
+ delta_uncorrected, delta_corrected);
+
+ TEST_ASSERT((delta_corrected <= 1) && (delta_corrected >= -1),
+ "larger than expected delta detected = %ld", delta_corrected);
+ return;
+ }
+ }
+}
+
+#define CLOCKSOURCE_PATH "/sys/devices/system/clocksource/clocksource0/current_clocksource"
+
+static void check_clocksource(void)
+{
+ char *clk_name;
+ struct stat st;
+ FILE *fp;
+
+ fp = fopen(CLOCKSOURCE_PATH, "r");
+ if (!fp) {
+ pr_info("failed to open clocksource file: %d; assuming TSC.\n",
+ errno);
+ return;
+ }
+
+ if (fstat(fileno(fp), &st)) {
+ pr_info("failed to stat clocksource file: %d; assuming TSC.\n",
+ errno);
+ goto out;
+ }
+
+ clk_name = malloc(st.st_size);
+ TEST_ASSERT(clk_name, "failed to allocate buffer to read file\n");
+
+ if (!fgets(clk_name, st.st_size, fp)) {
+ pr_info("failed to read clocksource file: %d; assuming TSC.\n",
+ ferror(fp));
+ goto out;
+ }
+
+ TEST_ASSERT(!strncmp(clk_name, "tsc\n", st.st_size),
+ "clocksource not supported: %s", clk_name);
+out:
+ fclose(fp);
+}
+
+static void configure_pvclock(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
+{
+ unsigned int gpages;
+
+ gpages = vm_calc_num_guest_pages(VM_MODE_DEFAULT, KVMCLOCK_SIZE);
+ vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+ KVMCLOCK_GPA, 1, gpages, 0);
+ virt_map(vm, KVMCLOCK_GPA, KVMCLOCK_GPA, gpages);
+
+ vcpu_args_set(vcpu, 1, KVMCLOCK_GPA);
+}
+
+static void configure_scaled_tsc(struct kvm_vcpu *vcpu)
+{
+ uint64_t tsc_khz;
+
+ tsc_khz = __vcpu_ioctl(vcpu, KVM_GET_TSC_KHZ, NULL);
+ pr_info("scaling tsc from %ldKHz to %ldKHz\n", tsc_khz, tsc_khz / 2);
+ tsc_khz /= 2;
+ vcpu_ioctl(vcpu, KVM_SET_TSC_KHZ, (void *)tsc_khz);
+}
+
+int main(void)
+{
+ struct kvm_vm *vm;
+ struct kvm_vcpu *vcpu;
+
+ check_clocksource();
+
+ vm = vm_create_with_one_vcpu(&vcpu, guest_code);
+
+ configure_pvclock(vm, vcpu);
+ configure_scaled_tsc(vcpu);
+
+ run_test(vm, vcpu);
+
+ return 0;
+}
--
2.40.1


2024-04-09 00:35:10

by Dongli Zhang

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for KVM clock drift fixup

Hi Jack,

On 4/8/24 15:07, Jack Allister wrote:
> There is a potential for drift between the TSC and a KVM/PV clock when the
> guest TSC is scaled (as seen previously in [1]). Which fixed drift between
> timers over the lifetime of a VM.

Those patches mentioned "TSC scaling" mutiple times. Is it a necessary to
reproduce this issue? I do not think it is necessary. The tsc scaling may speed
up the drift, but not the root cause.

How about to cite the below patch as the beginning. The below patch only
*avoids* KVM_REQ_MASTERCLOCK_UPDATE in some situations, but never solve the
problem when KVM_REQ_MASTERCLOCK_UPDATE is triggered ... therefore we need this
patchset ...

KVM: x86: Don't unnecessarily force masterclock update on vCPU hotplug
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c52ffadc65e28ab461fd055e9991e8d8106a0056

I think this patch is only closely related to KVM_REQ_MASTERCLOCK_UPDATE, not
TSC scaling.

>
> However, there is another factor which will cause a drift. In a situation
> such as a kexec/live-update of the kernel or a live-migration of a VM the
> PV clock information is recalculated by KVM (KVM_REQ_MASTERCLOCK_UPDATE).
> This update samples a new system_time & tsc_timestamp to be used in the
> structure.
>
> For example, when a guest is running with a TSC frequency of 1.5GHz but the
> host frequency is 3.0GHz upon an update of the PV time information a delta
> of ~3500ns is observed between the TSC and the KVM/PV clock. There is no
> reason why a fixup creating an accuracy of ±1ns cannot be achieved.

Same as above. I think the key is to explain the issue when
KVM_REQ_MASTERCLOCK_UPDATE is triggered, not to emphasize the TSC scaling.
Please correct me if I am wrong.

>
> Additional interfaces are added to retrieve & fixup the PV time information
> when a VMM may believe is appropriate (deserialization after live-update/
> migration). KVM_GET_CLOCK_GUEST can be used for the VMM to retrieve the
> currently used PV time information and then when the VMM believes a drift
> may occur can then instruct KVM to perform a correction via the setter
> KVM_SET_CLOCK_GUEST.
>
> The KVM_SET_CLOCK_GUEST ioctl works under the following premise. The host
> TSC & kernel timstamp are sampled at a singular point in time. Using the

Typo: "timstamp"

> already known scaling/offset for L1 the guest TSC is then derived from this

I assume you meant to derive guest TSC from TSC offset/scaling, not to derive
kvmclock. What does "TSC & kernel timstamp" mean?

> information.
>
> From here two PV time information structures are created, one which is the
> original time information structure prior to whatever may have caused a PV
> clock re-calculation (live-update/migration). The second is then using the
> singular point in time sampled just prior. An individual KVM/PV clock for
> each of the PV time information structures using the singular guest TSC.
>
> A delta is then determined between the two calculated PV times, which is
> then used as a correction offset added onto the kvmclock_offset for the VM.
>
> [1]: https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=451a707813ae__;!!ACWV5N9M2RV99hQ!OnMXeXj4Plz6xvAc5lYsKaR3d1GDGGGRhZkdLMbxr8Skc_VAv_O1H8qP9igQv4KPCtYDw2ShTUtEd2o3mD5R$
>
> Suggested-by: David Woodhouse <[email protected]>
> Signed-off-by: Jack Allister <[email protected]>
> CC: Paul Durrant <[email protected]>
> ---
> Documentation/virt/kvm/api.rst | 43 +++++++++++++++++
> arch/x86/kvm/x86.c | 87 ++++++++++++++++++++++++++++++++++
> include/uapi/linux/kvm.h | 3 ++
> 3 files changed, 133 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 0b5a33ee71ee..5f74d8ac1002 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6352,6 +6352,49 @@ a single guest_memfd file, but the bound ranges must not overlap).
>
> See KVM_SET_USER_MEMORY_REGION2 for additional details.
>
> +4.143 KVM_GET_CLOCK_GUEST
> +----------------------------
> +
> +:Capability: none
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct pvclock_vcpu_time_info (out)
> +:Returns: 0 on success, <0 on error
> +
> +Retrieves the current time information structure used for KVM/PV clocks.
> +On x86 a PV clock is derived from the current TSC and is then scaled based
> +upon the a specified multiplier and shift. The result of this is then added
> +to a system time.

Typo: "the a".

> +
> +The guest needs a way to determine the system time, multiplier and shift. This
> +can be done by multiple ways, for KVM guests this can be via an MSR write to
> +MSR_KVM_SYSTEM_TIME / MSR_KVM_SYSTEM_TIME_NEW which defines the guest physical
> +address KVM shall put the structure. On Xen guests this can be found in the Xen
> +vcpu_info.
> +
> +This is structure is useful information for a VMM to also know when taking into
> +account potential timer drift on live-update/migration.
> +
> +4.144 KVM_SET_CLOCK_GUEST
> +----------------------------
> +
> +:Capability: none
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct pvclock_vcpu_time_info (in)
> +:Returns: 0 on success, <0 on error
> +
> +Triggers KVM to perform a correction of the KVM/PV clock structure based upon a
> +known prior PV clock structure (see KVM_GET_CLOCK_GUEST).
> +
> +If a VM is utilizing TSC scaling there is a potential for a drift between the
> +KVM/PV clock and the TSC itself. This is due to the loss of precision when
> +performing a multiply and shift rather than divide for the TSC.
> +
> +To perform the correction a delta is calculated between the original time info
> +(which is assumed correct) at a singular point in time X. The KVM clock offset
> +is then offset by this delta.
> +
> 5. The kvm_run structure
> ========================
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 47d9f03b7778..5d2e10cd1c30 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6988,6 +6988,87 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
> return 0;
> }
>
> +static struct kvm_vcpu *kvm_get_bsp_vcpu(struct kvm *kvm)
> +{
> + struct kvm_vcpu *vcpu = NULL;
> + int i;
> +
> + for (i = 0; i < KVM_MAX_VCPUS; i++) {
> + vcpu = kvm_get_vcpu_by_id(kvm, i);
> + if (!vcpu)
> + continue;
> +
> + if (kvm_vcpu_is_reset_bsp(vcpu))
> + break;
> + }
> +
> + return vcpu;
> +}

Would the above rely not only on TSC clocksource, but also
ka->use_master_clock==true?

3125 ka->use_master_clock = host_tsc_clocksource && vcpus_matched
3126 && !ka->backwards_tsc_observed
3127 && !ka->boot_vcpu_runs_old_kvmclock;

Should the condition of (ka->use_master_clock==true) be checked in the ioctl?

> +
> +static int kvm_vm_ioctl_get_clock_guest(struct kvm *kvm, void __user *argp)
> +{
> + struct kvm_vcpu *vcpu;
> +
> + vcpu = kvm_get_bsp_vcpu(kvm);
> + if (!vcpu)
> + return -EINVAL;
> +
> + if (!vcpu->arch.hv_clock.tsc_timestamp || !vcpu->arch.hv_clock.system_time)
> + return -EIO;
> +
> + if (copy_to_user(argp, &vcpu->arch.hv_clock, sizeof(vcpu->arch.hv_clock)))
> + return -EFAULT;

What will happen if the vCPU=0 (e.g., BSP) thread is racing with here to update
the vcpu->arch.hv_clock?

It is a good idea to making assumption from the VMM (e.g., QEMU) side?

> +
> + return 0;
> +}
> +
> +static int kvm_vm_ioctl_set_clock_guest(struct kvm *kvm, void __user *argp)
> +{
> + struct kvm_vcpu *vcpu;
> + struct pvclock_vcpu_time_info orig_pvti;
> + struct pvclock_vcpu_time_info dummy_pvti;
> + int64_t kernel_ns;
> + uint64_t host_tsc, guest_tsc;
> + uint64_t clock_orig, clock_dummy;
> + int64_t correction;
> + unsigned long i;

Please ignore me if there is not any chance to make the above (and other places
in the patchset) to honor reverse xmas tree style.

> +
> + vcpu = kvm_get_bsp_vcpu(kvm);
> + if (!vcpu)
> + return -EINVAL;
> +
> + if (copy_from_user(&orig_pvti, argp, sizeof(orig_pvti)))
> + return -EFAULT;
> +
> + /*
> + * Sample the kernel time and host TSC at a singular point.
> + * We then calculate the guest TSC using this exact point in time,
> + * From here we can then determine the delta using the
> + * PV time info requested from the user and what we currently have
> + * using the fixed point in time. This delta is then used as a
> + * correction factor to fixup the potential drift.
> + */
> + if (!kvm_get_time_and_clockread(&kernel_ns, &host_tsc))
> + return -EFAULT;
> +
> + guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> +
> + dummy_pvti = orig_pvti;
> + dummy_pvti.tsc_timestamp = guest_tsc;
> + dummy_pvti.system_time = kernel_ns + kvm->arch.kvmclock_offset;
> +
> + clock_orig = __pvclock_read_cycles(&orig_pvti, guest_tsc);
> + clock_dummy = __pvclock_read_cycles(&dummy_pvti, guest_tsc);
> +
> + correction = clock_orig - clock_dummy;
> + kvm->arch.kvmclock_offset += correction;

I am not sure if it is a good idea to rely on userspace VMM to decide the good
timepoint to issue the ioctl, without assuming any racing.

In addition to live migration, can the user call this API any time during the VM
is running (to fix the clock drift)? Therefore, any requirement to protect the
update of kvmclock_offset from racing?


Thank you very much!

Dongli Zhang


> +
> + kvm_for_each_vcpu(i, vcpu, kvm)
> + kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> +
> + return 0;
> +}
> +
> int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
> {
> struct kvm *kvm = filp->private_data;
> @@ -7246,6 +7327,12 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
> case KVM_GET_CLOCK:
> r = kvm_vm_ioctl_get_clock(kvm, argp);
> break;
> + case KVM_SET_CLOCK_GUEST:
> + r = kvm_vm_ioctl_set_clock_guest(kvm, argp);
> + break;
> + case KVM_GET_CLOCK_GUEST:
> + r = kvm_vm_ioctl_get_clock_guest(kvm, argp);
> + break;
> case KVM_SET_TSC_KHZ: {
> u32 user_tsc_khz;
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 2190adbe3002..0d306311e4d6 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1548,4 +1548,7 @@ struct kvm_create_guest_memfd {
> __u64 reserved[6];
> };
>
> +#define KVM_SET_CLOCK_GUEST _IOW(KVMIO, 0xd5, struct pvclock_vcpu_time_info)
> +#define KVM_GET_CLOCK_GUEST _IOR(KVMIO, 0xd6, struct pvclock_vcpu_time_info)
> +
> #endif /* __LINUX_KVM_H */

2024-04-09 00:44:04

by Dongli Zhang

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer drift correction

Hi Jack,

On 4/8/24 15:07, Jack Allister wrote:
> This test proves that there is an inherent KVM/PV clock drift away from the
> guest TSC when KVM decides to update the PV time information structure due
> to a KVM_REQ_MASTERCLOCK_UPDATE. This drift is exascerbated when a guest is

Typo: exacerbated

> using TSC scaling and running at a different frequency to the host TSC [1].
> It also proves that KVM_[GS]ET_CLOCK_GUEST API is working to mitigate the
> drift from TSC to within ±1ns.
>
> The test simply records the PVTI (PV time information) at time of guest
> creation, after KVM has updated it's mapped PVTI structure and once the
> correction has taken place.
>
> A singular point in time is then recorded via the guest TSC and is used to
> calculate the a PV clock value using each of the 3 PVTI structures.

Typo: "the a"

>
> As seen below a drift of ~3500ns is observed if no correction has taken
> place after KVM has updated the PVTI via master clock update. However,
> after the correction a delta of at most 1ns can be seen.
>
> * selftests: kvm: pvclock_test
> * scaling tsc from 2999999KHz to 1499999KHz
> * before=5038374946 uncorrected=5038371437 corrected=5038374945
> * delta_uncorrected=3509 delta_corrected=1
>
> Clocksource check code has been borrowed from [2].
>
> [1]: https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=451a707813ae__;!!ACWV5N9M2RV99hQ!JWCFSSEPaZUer553HE44W0OhktMh3-Iyz4aZNE4pcjc94q_bs4QBIrVS8ciYEzj1NigrVCamkHgpvwqP1XrV$
> [2]: https://urldefense.com/v3/__https://lore.kernel.org/kvm/[email protected]/__;!!ACWV5N9M2RV99hQ!JWCFSSEPaZUer553HE44W0OhktMh3-Iyz4aZNE4pcjc94q_bs4QBIrVS8ciYEzj1NigrVCamkHgpv7jfLpy7$
>
> Signed-off-by: Jack Allister <[email protected]>
> CC: David Woodhouse <[email protected]>
> CC: Paul Durrant <[email protected]>
> ---
> tools/testing/selftests/kvm/Makefile | 1 +
> .../selftests/kvm/x86_64/pvclock_test.c | 223 ++++++++++++++++++
> 2 files changed, 224 insertions(+)
> create mode 100644 tools/testing/selftests/kvm/x86_64/pvclock_test.c
>
> diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> index 741c7dc16afc..02ee1205bbed 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -87,6 +87,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/pmu_counters_test
> TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
> TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
> TEST_GEN_PROGS_x86_64 += x86_64/private_mem_kvm_exits_test
> +TEST_GEN_PROGS_x86_64 += x86_64/pvclock_test
> TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
> TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
> TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
> diff --git a/tools/testing/selftests/kvm/x86_64/pvclock_test.c b/tools/testing/selftests/kvm/x86_64/pvclock_test.c
> new file mode 100644
> index 000000000000..172ef4d19c60
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/x86_64/pvclock_test.c
> @@ -0,0 +1,223 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright © 2024, Amazon.com, Inc. or its affiliates.
> + *
> + * Tests for pvclock API
> + * KVM_SET_CLOCK_GUEST/KVM_GET_CLOCK_GUEST
> + */
> +#include <asm/pvclock.h>
> +#include <asm/pvclock-abi.h>
> +#include <sys/stat.h>
> +#include <stdint.h>
> +#include <stdio.h>
> +
> +#include "test_util.h"
> +#include "kvm_util.h"
> +#include "processor.h"
> +
> +enum {
> + STAGE_FIRST_BOOT,
> + STAGE_UNCORRECTED,
> + STAGE_CORRECTED,
> + NUM_STAGES
> +};
> +
> +#define KVMCLOCK_GPA 0xc0000000ull
> +#define KVMCLOCK_SIZE sizeof(struct pvclock_vcpu_time_info)
> +
> +static void trigger_pvti_update(vm_paddr_t pvti_pa)
> +{
> + /*
> + * We need a way to trigger KVM to update the fields
> + * in the PV time info. The easiest way to do this is
> + * to temporarily switch to the old KVM system time
> + * method and then switch back to the new one.
> + */
> + wrmsr(MSR_KVM_SYSTEM_TIME, pvti_pa | KVM_MSR_ENABLED);
> + wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
> +}
> +
> +static void guest_code(vm_paddr_t pvti_pa)
> +{
> + struct pvclock_vcpu_time_info *pvti_va =
> + (struct pvclock_vcpu_time_info *)pvti_pa;
> +
> + struct pvclock_vcpu_time_info pvti_boot;
> + struct pvclock_vcpu_time_info pvti_uncorrected;
> + struct pvclock_vcpu_time_info pvti_corrected;
> + uint64_t cycles_boot;
> + uint64_t cycles_uncorrected;
> + uint64_t cycles_corrected;
> + uint64_t tsc_guest;
> +
> + /*
> + * Setup the KVMCLOCK in the guest & store the original
> + * PV time structure that is used.
> + */
> + wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
> + pvti_boot = *pvti_va;
> + GUEST_SYNC(STAGE_FIRST_BOOT);
> +
> + /*
> + * Trigger an update of the PVTI, if we calculate
> + * the KVM clock using this structure we'll see
> + * a drift from the TSC.
> + */
> + trigger_pvti_update(pvti_pa);
> + pvti_uncorrected = *pvti_va;
> + GUEST_SYNC(STAGE_UNCORRECTED);
> +
> + /*
> + * The test should have triggered the correction by this
> + * point in time. We have a copy of each of the PVTI structs
> + * at each stage now.
> + *
> + * Let's sample the timestamp at a SINGLE point in time and
> + * then calculate what the KVM clock would be using the PVTI
> + * from each stage.
> + *
> + * Then return each of these values to the tester.
> + */
> + pvti_corrected = *pvti_va;
> + tsc_guest = rdtsc();
> +
> + cycles_boot = __pvclock_read_cycles(&pvti_boot, tsc_guest);
> + cycles_uncorrected = __pvclock_read_cycles(&pvti_uncorrected, tsc_guest);
> + cycles_corrected = __pvclock_read_cycles(&pvti_corrected, tsc_guest);
> +
> + GUEST_SYNC_ARGS(STAGE_CORRECTED, cycles_boot, cycles_uncorrected,
> + cycles_corrected, 0);
> +}
> +
> +static void run_test(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
> +{
> + struct ucall uc;
> + uint64_t ucall_reason;
> + struct pvclock_vcpu_time_info pvti_before;
> + uint64_t before, uncorrected, corrected;
> + int64_t delta_uncorrected, delta_corrected;
> +
> + /* Loop through each stage of the test. */
> + while (true) {
> +
> + /* Start/restart the running vCPU code. */
> + vcpu_run(vcpu);
> + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
> +
> + /* Retrieve and verify our stage. */
> + ucall_reason = get_ucall(vcpu, &uc);
> + TEST_ASSERT(ucall_reason == UCALL_SYNC,
> + "Unhandled ucall reason=%lu",
> + ucall_reason);
> +
> + /* Run host specific code relating to stage. */
> + switch (uc.args[1]) {
> + case STAGE_FIRST_BOOT:
> + /* Store the KVM clock values before an update. */
> + vm_ioctl(vm, KVM_GET_CLOCK_GUEST, &pvti_before);
> +
> + /* Sleep for a set amount of time to induce drift. */
> + sleep(5);
> + break;
> +
> + case STAGE_UNCORRECTED:
> + /* Restore the KVM clock values. */
> + vm_ioctl(vm, KVM_SET_CLOCK_GUEST, &pvti_before);
> + break;
> +
> + case STAGE_CORRECTED:
> + /* Query the clock information and verify delta. */
> + before = uc.args[2];
> + uncorrected = uc.args[3];
> + corrected = uc.args[4];
> +
> + delta_uncorrected = before - uncorrected;
> + delta_corrected = before - corrected;
> +
> + pr_info("before=%lu uncorrected=%lu corrected=%lu\n",
> + before, uncorrected, corrected);
> +
> + pr_info("delta_uncorrected=%ld delta_corrected=%ld\n",
> + delta_uncorrected, delta_corrected);
> +
> + TEST_ASSERT((delta_corrected <= 1) && (delta_corrected >= -1),
> + "larger than expected delta detected = %ld", delta_corrected);
> + return;
> + }
> + }
> +}
> +
> +#define CLOCKSOURCE_PATH "/sys/devices/system/clocksource/clocksource0/current_clocksource"
> +
> +static void check_clocksource(void)
> +{

AFAIR, I copied check_clocksource() from existing code during that time.

The commit e440c5f2e ("KVM: selftests: Generalize check_clocksource() from
kvm_clock_test") has introduced sys_clocksource_is_tsc(). Later it is renamed to
sys_clocksource_is_based_on_tsc().

Any chance to re-use sys_clocksource_is_based_on_tsc()?

> + char *clk_name;
> + struct stat st;
> + FILE *fp;
> +
> + fp = fopen(CLOCKSOURCE_PATH, "r");
> + if (!fp) {
> + pr_info("failed to open clocksource file: %d; assuming TSC.\n",
> + errno);
> + return;
> + }
> +
> + if (fstat(fileno(fp), &st)) {
> + pr_info("failed to stat clocksource file: %d; assuming TSC.\n",
> + errno);
> + goto out;
> + }
> +
> + clk_name = malloc(st.st_size);
> + TEST_ASSERT(clk_name, "failed to allocate buffer to read file\n");
> +
> + if (!fgets(clk_name, st.st_size, fp)) {
> + pr_info("failed to read clocksource file: %d; assuming TSC.\n",
> + ferror(fp));
> + goto out;
> + }
> +
> + TEST_ASSERT(!strncmp(clk_name, "tsc\n", st.st_size),
> + "clocksource not supported: %s", clk_name);
> +out:
> + fclose(fp);
> +}
> +
> +static void configure_pvclock(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
> +{
> + unsigned int gpages;
> +
> + gpages = vm_calc_num_guest_pages(VM_MODE_DEFAULT, KVMCLOCK_SIZE);
> + vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
> + KVMCLOCK_GPA, 1, gpages, 0);
> + virt_map(vm, KVMCLOCK_GPA, KVMCLOCK_GPA, gpages);
> +
> + vcpu_args_set(vcpu, 1, KVMCLOCK_GPA);
> +}
> +
> +static void configure_scaled_tsc(struct kvm_vcpu *vcpu)
> +{
> + uint64_t tsc_khz;
> +
> + tsc_khz = __vcpu_ioctl(vcpu, KVM_GET_TSC_KHZ, NULL);
> + pr_info("scaling tsc from %ldKHz to %ldKHz\n", tsc_khz, tsc_khz / 2);
> + tsc_khz /= 2;
> + vcpu_ioctl(vcpu, KVM_SET_TSC_KHZ, (void *)tsc_khz);
> +}

Is configure_scaled_tsc() anecessary? Or how about to make it an option/arg?
Then I will be able to test it on a VM/server without TSC scaling.

Thank you very much!

Dongli Zhang

> +
> +int main(void)
> +{
> + struct kvm_vm *vm;
> + struct kvm_vcpu *vcpu;
> +
> + check_clocksource();
> +
> + vm = vm_create_with_one_vcpu(&vcpu, guest_code);
> +
> + configure_pvclock(vm, vcpu);
> + configure_scaled_tsc(vcpu);
> +
> + run_test(vm, vcpu);
> +
> + return 0;
> +}

2024-04-09 06:57:08

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for KVM clock drift fixup

On Mon, 2024-04-08 at 17:34 -0700, Dongli Zhang wrote:
> Hi Jack,
>
> On 4/8/24 15:07, Jack Allister wrote:
> > There is a potential for drift between the TSC and a KVM/PV clock when the
> > guest TSC is scaled (as seen previously in [1]). Which fixed drift between
> > timers over the lifetime of a VM.
>
> Those patches mentioned "TSC scaling" mutiple times. Is it a necessary to
> reproduce this issue? I do not think it is necessary. The tsc scaling may speed
> up the drift, but not the root cause.

Right. See the three definitions of the KVM clock I described in
https://lore.kernel.org/kvm/[email protected]/

One is based on the host CLOCK_MONOTONIC_RAW + ka->kvmclock_offset, the
second is based on the reference point in master_kernel_ns /
master_cycle_now, and runs at the rate of the (scaled) guest TSC.

It's the *third* definition, which I called 'Definition C', which is
purely a bug, and which comes from running at the rate of the
*unscaled* guest TSC. And which doesn't exist without TSC scaling.

I don't think this patch needs to talk about the TSC scaling issues at
all. Those, and the existence of 'Definition C', are just bugs.

For that matter, I'm not entirely sure I'd have mentioned 'drift'
either. Yes, the two non-buggy definitions do drift from each other but
that isn't the point of this patch. Let's try again to explain what
*is* the point of this patch...



In all sane environments, ka->use_master_clock is set and 'Definition
B' of the KVM clock is used. The KVM clock is a simple arithmetic
function of the guest TSC.

The existing KVM_GET_CLOCK function isn't just buggy because it uses
'Definition C' and ignores TSC scaling. It's also just fundamentally
wrong as an API.

First let's consider the case of live *update*, where we serialize the
guest and kexec into a new kernel, then resum the guest. So we don't
have to deal with the full horrors of TSC migration, and the broken
algorithm described in the kernel documentation for doing that; we can
just preserve the KVM_VCPU_TSC_OFFSET. (We'll come to that problem
later).

Now, how do we restore the KVM clock? There is no real excuse for this
not being cycle-accurate, and giving the *same* pvclock information
back to the guest as before. The TSC wasn't disrupted at all during the
steal time. And the KVM clock should be precisely the *same* arithmetic
function of the TSC. The pvclock structure shouldn't change by a single
bit! Although in fact we *will* tolerate the data structure changing,
but only if it gives the same *result* (±1ns for rounding as Jack
notes).

KVM_[GS]ET_CLOCK are fundamentally useless for this. KVM_SET_CLOCK can
attempt to set the KVM clock at a given UTC time, but UTC is an awful
reference. Its frequency is also skewed by NTP, which is probably going
to be unsynchronized on the newly-booted kernel and running at a
different rate w.r.t the TSC than it was on the source.

Using UTC was always stupid because of leap seconds; TAI would have
been better. And one might argue that that aside, KVM_[GS]ET_CLOCK
isn't *such* an awful thing to use for live *migration*.

But wait, we *also* have to migrate the TSC using a wallclock time as a
base, and that naturally introduces some unavoidable discontinuity in
the TSC. So why in $DEITY's name would we use wallclock time as the
reference for migrating the KVM clock, giving it a *different* time
jump from the TSC on which it is based?

Even for live migration, even though the accuracy of the guest clocks
are fundamentally limited by the accuracy of the NTP sync between the
source and destination hosts, why wouldn't the clock jump at least be
*consistent*? We want that precise arithmetic relationship from TSC to
KVM clock to be preserved even in the live *migration* case where we
have to accept that the TSC itself isn't cycle-accurate.

Because we don't have a way to accurately preserve the TSC→KVM clock
relationship, we are seeing the clocksource watchdog trigger in guests
and disable the TSC clocksource... because the *KVM clock* was used as
a reference and got corrupted :)

So, Jack's patch adds KVM_[GS]ET_CLOCK_GUEST which reads and writes the
actual struct pvclock_vcpu_time_info structure.

> How about to cite the below patch as the beginning. The below patch only
> *avoids* KVM_REQ_MASTERCLOCK_UPDATE in some situations, but never solve the
> problem when KVM_REQ_MASTERCLOCK_UPDATE is triggered ... therefore we need this
> patchset ...
>
> KVM: x86: Don't unnecessarily force masterclock update on vCPU hotplug
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c52ffadc65e28ab461fd055e9991e8d8106a0056
>
> I think this patch is only closely related to KVM_REQ_MASTERCLOCK_UPDATE, not
> TSC scaling.
>
> ...
>
> Same as above. I think the key is to explain the issue when
> KVM_REQ_MASTERCLOCK_UPDATE is triggered, not to emphasize the TSC scaling.
> Please correct me if I am wrong.

Not sure; I think that's a separate issue too.

> >
> > Additional interfaces are added to retrieve & fixup the PV time information
> > when a VMM may believe is appropriate (deserialization after live-update/
> > migration). KVM_GET_CLOCK_GUEST can be used for the VMM to retrieve the
> > currently used PV time information and then when the VMM believes a drift
> > may occur can then instruct KVM to perform a correction via the setter
> > KVM_SET_CLOCK_GUEST.
> >
> > The KVM_SET_CLOCK_GUEST ioctl works under the following premise. The host
> > TSC & kernel timstamp are sampled at a singular point in time. Using the
>
> Typo: "timstamp"
>
> > already known scaling/offset for L1 the guest TSC is then derived from this
>
> I assume you meant to derive guest TSC from TSC offset/scaling, not to derive
> kvmclock. What does "TSC & kernel timstamp" mean?

In ka->use_master_clock mode (Definition B), the KVM clock is defined
in terms of guest TSC ticks since a reference point in time, stored in
ka->master_cycle_now and ka->master_kernel_ns. That is, a TSC, and a
kernel (CLOCK_MONOTONIC_RAW, to be precise) timestamp.

Jack's KVM_SET_CLOCK_GUEST should be taking a new reference point with
kvm_get_time_and_clockread(). As we know, changing the reference point
like that has basically the same effect as a stray invocation of
KVM_REQ_MASTERCLOCK_UPDATE — it yanks the KVM clock back to 'Definition
A' based on the CLOCK_MONOTONIC_RAW. But we correct for that...

Next we calculate (correctly via guest TSC frequency) the KVM clock
value which would result with the newly changed reference point.

Then we calculate the *intended* KVM clock at that *same* host TSC
value, by converting to a guest TSC value and running it through the
pvclock information passed into the ioctl.

Then adjust ka->kvmclock_offset by the delta between the two.

And now the KVM clock at this moment is set to *precisely* what it
would was before the live {update,migration}, in relation to the guest
TSC. At least within 1ns.

> > information.
> >
> > From here two PV time information structures are created, one which is the
> > original time information structure prior to whatever may have caused a PV
> > clock re-calculation (live-update/migration). The second is then using the
> > singular point in time sampled just prior. An individual KVM/PV clock for
> > each of the PV time information structures using the singular guest TSC.
> >
> > A delta is then determined between the two calculated PV times, which is
> > then used as a correction offset added onto the kvmclock_offset for the VM.
> >
> > [1]: https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=451a707813ae__;!!ACWV5N9M2RV99hQ!OnMXeXj4Plz6xvAc5lYsKaR3d1GDGGGRhZkdLMbxr8Skc_VAv_O1H8qP9igQv4KPCtYDw2ShTUtEd2o3mD5R$ 
> >
> > Suggested-by: David Woodhouse <[email protected]>
> > Signed-off-by: Jack Allister <[email protected]>
> > CC: Paul Durrant <[email protected]>
> > ---
> >  Documentation/virt/kvm/api.rst | 43 +++++++++++++++++
> >  arch/x86/kvm/x86.c             | 87 ++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/kvm.h       |  3 ++
> >  3 files changed, 133 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 0b5a33ee71ee..5f74d8ac1002 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6352,6 +6352,49 @@ a single guest_memfd file, but the bound ranges must not overlap).
> >  
> >  See KVM_SET_USER_MEMORY_REGION2 for additional details.
> >  
> > +4.143 KVM_GET_CLOCK_GUEST
> > +----------------------------
> > +
> > +:Capability: none
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: struct pvclock_vcpu_time_info (out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Retrieves the current time information structure used for KVM/PV clocks.
> > +On x86 a PV clock is derived from the current TSC and is then scaled based
> > +upon the a specified multiplier and shift. The result of this is then added
> > +to a system time.
>
> Typo: "the a".
>
> > +
> > +The guest needs a way to determine the system time, multiplier and shift. This
> > +can be done by multiple ways, for KVM guests this can be via an MSR write to
> > +MSR_KVM_SYSTEM_TIME / MSR_KVM_SYSTEM_TIME_NEW which defines the guest physical
> > +address KVM shall put the structure. On Xen guests this can be found in the Xen
> > +vcpu_info.
> > +
> > +This is structure is useful information for a VMM to also know when taking into
> > +account potential timer drift on live-update/migration.
> > +
> > +4.144 KVM_SET_CLOCK_GUEST
> > +----------------------------
> > +
> > +:Capability: none
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: struct pvclock_vcpu_time_info (in)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Triggers KVM to perform a correction of the KVM/PV clock structure based upon a
> > +known prior PV clock structure (see KVM_GET_CLOCK_GUEST).
> > +
> > +If a VM is utilizing TSC scaling there is a potential for a drift between the
> > +KVM/PV clock and the TSC itself. This is due to the loss of precision when
> > +performing a multiply and shift rather than divide for the TSC.
> > +
> > +To perform the correction a delta is calculated between the original time info
> > +(which is assumed correct) at a singular point in time X. The KVM clock offset
> > +is then offset by this delta.
> > +
> >  5. The kvm_run structure
> >  ========================
> >  
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 47d9f03b7778..5d2e10cd1c30 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -6988,6 +6988,87 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
> >         return 0;
> >  }
> >  
> > +static struct kvm_vcpu *kvm_get_bsp_vcpu(struct kvm *kvm)
> > +{
> > +       struct kvm_vcpu *vcpu = NULL;
> > +       int i;
> > +
> > +       for (i = 0; i < KVM_MAX_VCPUS; i++) {
> > +               vcpu = kvm_get_vcpu_by_id(kvm, i);
> > +               if (!vcpu)
> > +                       continue;
> > +
> > +               if (kvm_vcpu_is_reset_bsp(vcpu))
> > +                       break;
> > +       }
> > +
> > +       return vcpu;
> > +}
>
> Would the above rely not only on TSC clocksource, but also
> ka->use_master_clock==true?
>
>  3125         ka->use_master_clock = host_tsc_clocksource && vcpus_matched
>  3126                                 && !ka->backwards_tsc_observed
>  3127                                 && !ka->boot_vcpu_runs_old_kvmclock;
>
> Should the condition of (ka->use_master_clock==true) be checked in the ioctl?
>
> > +
> > +static int kvm_vm_ioctl_get_clock_guest(struct kvm *kvm, void __user *argp)
> > +{
> > +       struct kvm_vcpu *vcpu;
> > +
> > +       vcpu = kvm_get_bsp_vcpu(kvm);
> > +       if (!vcpu)
> > +               return -EINVAL;
> > +
> > +       if (!vcpu->arch.hv_clock.tsc_timestamp || !vcpu->arch.hv_clock.system_time)
> > +               return -EIO;
> > +
> > +       if (copy_to_user(argp, &vcpu->arch.hv_clock, sizeof(vcpu->arch.hv_clock)))
> > +               return -EFAULT;
>
> What will happen if the vCPU=0 (e.g., BSP) thread is racing with here to update
> the vcpu->arch.hv_clock?
>
> It is a good idea to making assumption from the VMM (e.g., QEMU) side?

What if the vCPU has never set up the KVM clock and vcpu->arch.hv_clock
isn't even populated?

I don't think we should be using an actual vCPU at all.

I think we have to *create* the pvclock information, just just blindly
memcpy from some vCPU's ->arch.hv_clock.

Yes, we do need to scale via guest TSC frequency, but *only* in the
ka->use_master_clock case because otherwise, everything's hosed anyway.
In the ka->use_master_clock case, where we know all guests are running
at the same TSC frequency, why not just snapshot the scaling factor?
We can fix the broken __get_kvmclock_ns() that way too.

In fact, in the !ka->use_master_clock case, this ioctl should probably
return an error. Because in that case, the KVM clock *isn't* stable in
terms of the guest TSC as it would be in a sane world, and it doesn't
make sense to care about migrating it accurately.

We should think about the case where ka->use_master_clock is true on
the source, but *false* on the destination. Could KVM_SET_CLOCK_GUEST
still work in that mode, by calling compute_guest_tsc()?


> > +
> > +       return 0;
> > +}
> > +
> > +static int kvm_vm_ioctl_set_clock_guest(struct kvm *kvm, void __user *argp)
> > +{
> > +       struct kvm_vcpu *vcpu;
> > +       struct pvclock_vcpu_time_info orig_pvti;
> > +       struct pvclock_vcpu_time_info dummy_pvti;
> > +       int64_t kernel_ns;
> > +       uint64_t host_tsc, guest_tsc;
> > +       uint64_t clock_orig, clock_dummy;
> > +       int64_t correction;
> > +       unsigned long i;
>
> Please ignore me if there is not any chance to make the above (and other places
> in the patchset) to honor reverse xmas tree style.
>
> > +
> > +       vcpu = kvm_get_bsp_vcpu(kvm);
> > +       if (!vcpu)
> > +               return -EINVAL;
> > +
> > +       if (copy_from_user(&orig_pvti, argp, sizeof(orig_pvti)))
> > +               return -EFAULT;
> > +
> > +       /*
> > +        * Sample the kernel time and host TSC at a singular point.
> > +        * We then calculate the guest TSC using this exact point in time,
> > +        * From here we can then determine the delta using the
> > +        * PV time info requested from the user and what we currently have
> > +        * using the fixed point in time. This delta is then used as a
> > +        * correction factor to fixup the potential drift.
> > +        */
> > +       if (!kvm_get_time_and_clockread(&kernel_ns, &host_tsc))
> > +               return -EFAULT;
> > +
> > +       guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> > +
> > +       dummy_pvti = orig_pvti;
> > +       dummy_pvti.tsc_timestamp = guest_tsc;
> > +       dummy_pvti.system_time = kernel_ns + kvm->arch.kvmclock_offset;
> > +
> > +       clock_orig = __pvclock_read_cycles(&orig_pvti, guest_tsc);
> > +       clock_dummy = __pvclock_read_cycles(&dummy_pvti, guest_tsc);
> > +
> > +       correction = clock_orig - clock_dummy;
> > +       kvm->arch.kvmclock_offset += correction;
>
> I am not sure if it is a good idea to rely on userspace VMM to decide the good
> timepoint to issue the ioctl, without assuming any racing.
>
> In addition to live migration, can the user call this API any time during the VM
> is running (to fix the clock drift)? Therefore, any requirement to protect the
> update of kvmclock_offset from racing?
>

$DEITY no. The clock drift at *runtime* due to stray calls of
KVM_REQ_MASTERCLOCK_UPDATE is just a kernel bug. It isn't the user's
responsibility to correct it!

However... where invoked in ka->use_masterclock_mode, perhaps
pvclock_update_vm_gtod_copy() *should* perform the same kind of
calculation, and actually *adjust* ka->kvmclock_offset as the true
definition of the KVM clock drifts from CLOCK_MONOTONIC_RAW?

But internally; never seen by userspace.


Attachments:
smime.p7s (5.83 kB)

2024-04-09 11:33:00

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer drift correction

On Mon, 2024-04-08 at 17:43 -0700, Dongli Zhang wrote:
> Hi Jack,
>
> On 4/8/24 15:07, Jack Allister wrote:
> > This test proves that there is an inherent KVM/PV clock drift away from the
> > guest TSC when KVM decides to update the PV time information structure due
> > to a KVM_REQ_MASTERCLOCK_UPDATE. This drift is exascerbated when a guest is
>
> Typo: exacerbated
>
> > using TSC scaling and running at a different frequency to the host TSC [1].
> > It also proves that KVM_[GS]ET_CLOCK_GUEST API is working to mitigate the
> > drift from TSC to within ±1ns.

No, the KVM_[GS}ET_CLOCK_GUEST API is not about mitigating kernel bugs.

We *fix* kernel bugs, we don't make userspace work around them.

The KVM_[GS}ET_CLOCK_GUEST API allows userspace to perform accurate
live update and live migration without disrupting the relationship
between guest TSC and KVM clock.

Since a bombing run on KVM_REQ_MASTERCLOCK_UPDATE users is on my TODO
list, it's worth noting the *reason* that switching to the obsolete
MSR_KVM_SYSTEM_TIME forces ka->use_master_clock mode off.

It's not documented at all as far as I can tell, but in commit
54750f2cf042 (“KVM: x86: workaround SuSE's 2.6.16 pvclock vs
masterclock issue“) in 2015, it was done to work around a guest bug
where the guest *expected* the reference point to keep being updated
and never be too far in the past.


> Is configure_scaled_tsc() anecessary? Or how about to make it an
> option/arg?
> Then I will be able to test it on a VM/server without TSC scaling.

As discussed, TSC scaling shouldn't be needed. It *should* be part of
the unit test if possible, because there is a class of bugs it'll
trigger, but it should be optional.

In particular, the existing KVM_GET_CLOCK will return extra-wrong
results if TSC scaling is in force. But that isn't being tested here
yet because we haven't *fixed* it yet :)

For reference, here's my TODO list which Jack is working from...

• Add KVM unit test to validate that KVM clock does not change when
provoked (including by simulated live update).
It’s OK for the reference point at { tsc_timestamp, system_time } in
the pvclock structure to change, but only such that it gives the
same results for a given guest TSC — that is, if system_time
changes, then tsc_timestamp must change by a delta which precisely
corresponds in terms of the advertised guest TSC frequency. Perhaps
allow a slop of 1ns for rounding, but no more.

• Audit and fix (i.e. remove) KVM_REQ_MASTERCLOCK_UPDATE usage,
starting with kvm_xen_shared_info_init(). And work out whether it
should be sent to all vCPUs, as some call sites do, or just one?

• Add KVM_VCPU_TSC_SCALE attribute to allow userspace to know the
precise host→guest TSC scaling.
(cf. https://lore.kernel.org/all/[email protected]/)

• Expose guest’s view of KVM clock to userspace via KVM_GET_CLOCK_GUEST
ioctl. Perhaps also a memory-mapped version, as the gfn_to_pfn_cache
allows writing to userspace HVAs. With this, userspace has fast and
accurate way to calculate the KVM clock at any given moment in time.
 (Currently, userspace calls the KVM_GET_CLOCK ioctl which is slow
and returns inaccurate results). Then userspace can base other
things like PIT and HPET emulation on the KVM clock and simplify
timekeeping over migration for those too.

• Add a KVM_SET_CLOCK_GUEST ioctl which consumes the pvclock
information back again. This should not only set the kvmclock_offset
  field, but also set the reference point { master_cycle_now,
master_kernel_ns } as follows:
• Sample the kernel’s CLOCK_MONOTONIC_RAW to create a new
master_kernel_ns and master_cycle_now.
• Convert the new master_cycle_now to a guest TSC.
• Calculate the intended KVM clock with that guest TSC from the
provided pvclock information.
• Calculate the current KVM clock with that guest TSC using the
new master_cycle_now and master_kernel_ns and kvmclock_offset as
usual.
• Adjust kvmclock_offset to correct for the delta between current
and intended values.
• Raise KVM_REQ_CLOCK_UPDATE on all vCPUs.

• Fix the broken __get_kvmclock() function to scale via the guest’s
TSC frequency as it should. There isn’t necessarily a vCPU to use
for this, so it’s OK for this to work only when the frequency has
been set of the whole VM rather than only for individual vCPUs.
Likewise kvm_get_wall_clock_epoch() which has the same bug.

• Fix all other cases where KVM reads the time in two places
separately and then treats them as simultaneous.

• Fix the discontinuities in KVM_REQ_MASTERCLOCK_UPDATE by allowing
kvmclock_offset to vary while the VM is running in master clock
mode. Perhaps every call to pvclock_update_vm_gtod_copy() which
starts in master clock mode should follow the same process as the
proposed KVM_SET_CLOCK_GUEST to adjust the kvmclock_offset value
which corresponds with the new reference point. As long as we don’t
break in the case where something weird (host hibernation, etc.)
happened to the TSC, and we actually want to trust kvmclock_offset.
Maybe we should have a periodic work queue which keeps
kvmclock_offset in sync with the KVM clock while the VM is in master
clock mode?

• Correct the KVM documentation for TSC migration to take TSC
scaling
into account. Something like...

(SOURCE)
• Sample both TAI and the (source) host TSC at an arbitrary time we
  shall call Tsrc:
• Use adjtimex() to obtain tai_offset.
• Use KVM_GET_CLOCK to read UTC time and host TSC (ignoring the
actual kvm clock). These represent time Tsrc.
• Use adjtimex() to obtain tai_offset again, looping back to the
beginning if it changes.
• Convert the UTC time to TAI by adding the tai_offset.

• ∀ vCPU:
• Read the scaling information with the KVM_CPU_TSC_SCALE
attribute.
• Read the offset with the KVM_CPU_TSC_OFFSET attribute.
• Calculate this vCPU’s TSC at time Tsrc, from the host TSC
value.

• Use KVM_GET_CLOCK_GUEST to read the KVM clock (on vCPU0).

(DESTINATION)
• Sample both TAI and the (destination) host TSC at a time we shall
call Tdst:
• Use adjtimex() to obtain tai_offset.
• Use KVM_GET_CLOCK to read UTC time and host TSC.
• Use adjtimex() to obtain tai_offset again, looping back to the
beginning if it changes.
• Convert the UTC time to TAI by adding the tai_offset.

• Calculate the time (in the TAI clock) elapsed between Tsrc and
Tdst. Call this ΔT.

• ∀ vCPU:
• Calculate this vCPU’s intended TSC value at time Tdst:
• Given this vCPU’s TSC frequency, calculate the number of TSC
ticks correponding to ΔT.
• Add this to the vCPU TSC value calculated on the source
• Read the scaling information on the current host with the
KVM_CPU_TSC_SCALE attribute
• Calculate this vCPU’s scaled TSC value corresponding to the
host TSC at time Tdst without taking offsetting into account.
• Set KVM_CPU_TSC_OFFSET to the delta between that and the
intended TSC value.

• Use KVM_SET_CLOCK_GUEST to set the KVM clock (on vCPU0).





Attachments:
smime.p7s (5.83 kB)

2024-04-10 09:53:16

by Jack Allister

[permalink] [raw]
Subject: [PATCH v2 0/2] Add API for accurate KVM/PV clock migration

Guest VMs can be provided with a para-virtualized clock source to
perform timekeeping. A KVM guest can map in a PV clock via the
MSR_KVM_SYSTEM_TIME/MSR_KVM_SYSTEM_TIME_NEW virtualized MSRs.
Where as on a Xen guest this can be provided via the vcpu/shared
info pages.

These PV clocks both use a common structure which is mapped between
host <-> guest to provide the PVTI (paravirtual time information)
for the clock. This reference information is a guest TSC timestamp
and a host system time at a singular point in time.

Upon a live-update of a host or live-migration of an instance the
PVTI may be recalculated by KVM. Using the existing KVM_[GS]ET_CLOCK
functionality the relationship between the TSC and PV clock cannot
be precisely saved and restored by userspace.

This series adds in two patches, one to add in a new interface to
allow a VMM/userspace to perform a correction of the PVTI structure.
Then a second to verify the imprecision after a simulation of a
live-update/migration and then to verify the correction is to within
±1ns.

v1: https://lore.kernel.org/all/[email protected]/

v2:
- Moved new IOCTLs from vm to vcpu level.
- Adds extra error checks as suggested by Dongli Zhang / David Woodhouse.
- Adds on-demand calculation of PVTI if non currently present in vcpu.
- Adds proper synchronization for PV clock during correction.
- Added option to test without TSC scaling in sefltest.
- Updated commit messages to better explain the situation (thanks David).


Jack Allister (2):
KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
KVM: selftests: Add KVM/PV clock selftest to prove timer correction

Documentation/virt/kvm/api.rst | 37 ++++
arch/x86/kvm/x86.c | 124 +++++++++++
include/uapi/linux/kvm.h | 3 +
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/x86_64/pvclock_test.c | 192 ++++++++++++++++++
5 files changed, 357 insertions(+)
create mode 100644 tools/testing/selftests/kvm/x86_64/pvclock_test.c


base-commit: 8cb4a9a82b21623dbb4b3051dd30d98356cf95bc
--
2.40.1


2024-04-10 09:53:43

by Jack Allister

[permalink] [raw]
Subject: [PATCH v2 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration

In the common case (where kvm->arch.use_master_clock is true), the KVM
clock is defined as a simple arithmetic function of the guest TSC, based on
a reference point stored in kvm->arch.master_kernel_ns and
kvm->arch.master_cycle_now.

The existing KVM_[GS]ET_CLOCK functionality does not allow for this
relationship to be precisely saved and restored by userspace. All it can
currently do is set the KVM clock at a given UTC reference time, which is
necessarily imprecise.

So on live update, the guest TSC can remain cycle accurate at precisely the
same offset from the host TSC, but there is no way for userspace to restore
the KVM clock accurately.

Even on live migration to a new host, where the accuracy of the guest time-
keeping is fundamentally limited by the accuracy of wallclock
synchronization between the source and destination hosts, the clock jump
experienced by the guest's TSC and its KVM clock should at least be
*consistent*. Even when the guest TSC suffers a discontinuity, its KVM
clock should still remain the *same* arithmetic function of the guest TSC,
and not suffer an *additional* discontinuity.

To allow for accurate migration of the KVM clock, add per-vCPU ioctls which
save and restore the actual PV clock info in pvclock_vcpu_time_info.

The restoration in KVM_SET_CLOCK_GUEST works by creating a new reference
point in time just as kvm_update_masterclock() does, and calculating the
corresponding guest TSC value. This guest TSC value is then passed through
the user-provided pvclock structure to generate the *intended* KVM clock
value at that point in time, and through the *actual* KVM clock calculation.
Then kvm->arch.kvmclock_offset is adjusted to eliminate for the difference.

Where kvm->arch.use_master_clock is false (because the host TSC is
unreliable, or the guest TSCs are configured strangely), the KVM clock
is *not* defined as a function of the guest TSC so KVM_GET_CLOCK_GUEST
returns an error. In this case, as documented, userspace shall use the
legacy KVM_GET_CLOCK ioctl. The loss of precision is acceptable in this
case since the clocks are imprecise in this mode anyway.

On *restoration*, if kvm->arch.use_master_clock is false, an error is
returned for similar reasons and userspace shall fall back to using
KVM_SET_CLOCK. This does mean that, as documented, userspace needs to use
*both* KVM_GET_CLOCK_GUEST and KVM_GET_CLOCK and send both results with the
migration data (unless the intent is to refuse to resume on a host with bad
TSC).

(It may have been possible to make KVM_SET_CLOCK_GUEST "good enough" in the
non-masterclock mode, as that mode is necessarily imprecise anyway. The
explicit fallback allows userspace to deliberately fail migration to a host
with misbehaving TSC where master clock mode wouldn't be active.)

Suggested-by: David Woodhouse <[email protected]>
Signed-off-by: Jack Allister <[email protected]>
CC: Paul Durrant <[email protected]>
CC: Dongli Zhang <[email protected]>
---
Documentation/virt/kvm/api.rst | 37 ++++++++++
arch/x86/kvm/x86.c | 124 +++++++++++++++++++++++++++++++++
include/uapi/linux/kvm.h | 3 +
3 files changed, 164 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 0b5a33ee71ee..80fcd93bba1b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6352,6 +6352,43 @@ a single guest_memfd file, but the bound ranges must not overlap).

See KVM_SET_USER_MEMORY_REGION2 for additional details.

+4.143 KVM_GET_CLOCK_GUEST
+----------------------------
+
+:Capability: none
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct pvclock_vcpu_time_info (out)
+:Returns: 0 on success, <0 on error
+
+Retrieves the current time information structure used for KVM/PV clocks,
+in precisely the form advertised to the guest vCPU, which gives parameters
+for a direct conversion from a guest TSC value to nanoseconds.
+
+When the KVM clock not is in "master clock" mode, for example because the
+host TSC is unreliable or the guest TSCs are oddly configured, the KVM clock
+is actually defined by the host CLOCK_MONOTONIC_RAW instead of the guest TSC.
+In this case, the KVM_GET_CLOCK_GUEST ioctl returns -EINVAL.
+
+4.144 KVM_SET_CLOCK_GUEST
+----------------------------
+
+:Capability: none
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct pvclock_vcpu_time_info (in)
+:Returns: 0 on success, <0 on error
+
+Sets the KVM clock (for the whole VM) in terms of the vCPU TSC, using the
+pvclock structure as returned by KVM_GET_CLOCK_GUEST. This allows the precise
+arithmetic relationship between guest TSC and KVM clock to be preserved by
+userspace across migration.
+
+When the KVM clock is not in "master clock" mode, and the KVM clock is actually
+defined by the host CLOCK_MONOTONIC_RAW, this ioctl returns -EINVAL. Userspace
+may choose to set the clock using the less precise KVM_SET_CLOCK ioctl, or may
+choose to fail, denying migration to a host whose TSC is misbehaving.
+
5. The kvm_run structure
========================

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 47d9f03b7778..d5cae3ead04d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5859,6 +5859,124 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
}
}

+static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user *argp)
+{
+ struct pvclock_vcpu_time_info *vcpu_pvti = &v->arch.hv_clock;
+ struct pvclock_vcpu_time_info local_pvti = { 0 };
+ struct kvm_arch *ka = &v->kvm->arch;
+ uint64_t host_tsc, guest_tsc;
+ bool use_master_clock;
+ uint64_t kernel_ns;
+ unsigned int seq;
+
+ /*
+ * CLOCK_MONOTONIC_RAW is not suitable for GET/SET API,
+ * see kvm_vcpu_ioctl_set_clock_guest equivalent comment.
+ */
+ if (!static_cpu_has(X86_FEATURE_CONSTANT_TSC))
+ return -EINVAL;
+
+ /*
+ * Querying needs to be performed in a seqcount loop as it's possible
+ * another vCPU has triggered an update of the master clock. If so we
+ * should store the host TSC & time at this point.
+ */
+ do {
+ seq = read_seqcount_begin(&ka->pvclock_sc);
+ use_master_clock = ka->use_master_clock;
+ if (use_master_clock) {
+ host_tsc = ka->master_cycle_now;
+ kernel_ns = ka->master_kernel_ns;
+ }
+ } while (read_seqcount_retry(&ka->pvclock_sc, seq));
+
+ if (!use_master_clock)
+ return -EINVAL;
+
+ /*
+ * It's possible that this vCPU doesn't have a HVCLOCK configured
+ * but the other vCPUs may. If this is the case calculate based
+ * upon the time gathered in the seqcount but do not update the
+ * vCPU specific PVTI. If we have one, then use that.
+ */
+ if (!vcpu_pvti->tsc_timestamp && !vcpu_pvti->system_time) {
+ guest_tsc = kvm_read_l1_tsc(v, host_tsc);
+
+ local_pvti.tsc_timestamp = guest_tsc;
+ local_pvti.system_time = kernel_ns + ka->kvmclock_offset;
+ } else {
+ local_pvti = *vcpu_pvti;
+ }
+
+ if (copy_to_user(argp, &local_pvti, sizeof(local_pvti)))
+ return -EFAULT;
+
+ return 0;
+}
+
+static int kvm_vcpu_ioctl_set_clock_guest(struct kvm_vcpu *v, void __user *argp)
+{
+ struct pvclock_vcpu_time_info dummy_pvti;
+ struct pvclock_vcpu_time_info orig_pvti;
+ struct kvm *kvm = v->kvm;
+ struct kvm_arch *ka = &kvm->arch;
+ uint64_t clock_orig, clock_dummy;
+ uint64_t host_tsc, guest_tsc;
+ int64_t kernel_ns;
+ int64_t correction;
+ int rc = 0;
+
+ /*
+ * If a constant TSC is not provided by the host then KVM will
+ * be using CLOCK_MONOTONIC_RAW, correction using this is not
+ * precise and as such we can never sync to the precision we
+ * are requiring.
+ */
+ if (!static_cpu_has(X86_FEATURE_CONSTANT_TSC))
+ return -EINVAL;
+
+ if (copy_from_user(&orig_pvti, argp, sizeof(orig_pvti)))
+ return -EFAULT;
+
+ kvm_hv_request_tsc_page_update(kvm);
+ kvm_start_pvclock_update(kvm);
+ pvclock_update_vm_gtod_copy(kvm);
+
+ if (!ka->use_master_clock) {
+ rc = -EINVAL;
+ goto out;
+ }
+
+ /*
+ * Sample the kernel time and host TSC at a singular point.
+ * We then calculate the guest TSC using this exact point in time.
+ * From here we can then determine the delta using the
+ * PV time info requested from the user and what we currently have
+ * using the fixed point in time. This delta is then used as a
+ * correction factor to subtract from the clock offset.
+ */
+ if (!kvm_get_time_and_clockread(&kernel_ns, &host_tsc)) {
+ rc = -EFAULT;
+ goto out;
+ }
+
+ guest_tsc = kvm_read_l1_tsc(v, host_tsc);
+
+ dummy_pvti = orig_pvti;
+ dummy_pvti.tsc_timestamp = guest_tsc;
+ dummy_pvti.system_time = kernel_ns + ka->kvmclock_offset;
+
+ clock_orig = __pvclock_read_cycles(&orig_pvti, guest_tsc);
+ clock_dummy = __pvclock_read_cycles(&dummy_pvti, guest_tsc);
+
+ correction = clock_orig - clock_dummy;
+ ka->kvmclock_offset += correction;
+
+out:
+ kvm_end_pvclock_update(kvm);
+ return rc;
+}
+
long kvm_arch_vcpu_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
@@ -6239,6 +6357,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
srcu_read_unlock(&vcpu->kvm->srcu, idx);
break;
}
+ case KVM_SET_CLOCK_GUEST:
+ r = kvm_vcpu_ioctl_set_clock_guest(vcpu, argp);
+ break;
+ case KVM_GET_CLOCK_GUEST:
+ r = kvm_vcpu_ioctl_get_clock_guest(vcpu, argp);
+ break;
#ifdef CONFIG_KVM_HYPERV
case KVM_GET_SUPPORTED_HV_CPUID:
r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..0d306311e4d6 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1548,4 +1548,7 @@ struct kvm_create_guest_memfd {
__u64 reserved[6];
};

+#define KVM_SET_CLOCK_GUEST _IOW(KVMIO, 0xd5, struct pvclock_vcpu_time_info)
+#define KVM_GET_CLOCK_GUEST _IOR(KVMIO, 0xd6, struct pvclock_vcpu_time_info)
+
#endif /* __LINUX_KVM_H */
--
2.40.1


2024-04-10 09:54:01

by Jack Allister

[permalink] [raw]
Subject: [PATCH v2 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer correction

A VM's KVM/PV clock has an inherent relationship to it's TSC (guest). When
either the host system live-updates or the VM is live-migrated this pairing
of the two clock sources should stay the same.

In reality this is not the case without some correction taking place. Two
new IOCTLs (KVM_GET_CLOCK_GUEST/KVM_SET_CLOCK_GUEST) can be utilized to
perform a correction on the PVTI (PV time information) structure held by
KVM to effectively fixup the kvmclock_offset prior to the guest VM resuming
in either a live-update/migration scenario.

This test proves that without the necessary fixup there is a perceived
change in the guest TSC & KVM/PV clock relationship before and after a LU/
LM takes place.

The following steps are made to verify there is a delta in the relationship
and that it can be corrected:

1. PVTI is sampled by guest at boot (let's call this PVTI0).
2. Induce a change in PVTI data (KVM_REQ_MASTERCLOCK_UPDATE).
3. PVTI is sampled by guest after change (PVTI1).
4. Correction is requested by usermode to KVM using PVTI0.
5. PVTI is sampled by guest after correction (PVTI2).

The guest the records a singular TSC reference point in time and uses it to
calculate 3 KVM clock values utilizing the 3 recorded PVTI prior. Let's
call each clock value CLK[0-2].

In a perfect world CLK[0-2] should all be the same value if the KVM clock
& TSC relationship is preserved across the LU/LM (or faked in this test),
however it is not.

A delta can be observed between CLK0-CLK1 due to KVM recalculating the PVTI
(and the inaccuracies associated with that). A delta of ~3500ns can be
observed if guest TSC scaling to half host TSC frequency is also enabled,
where as without scaling this is observed at ~180ns.

With the correction it should be possible to achieve a delta of ±1ns.

An option to enable guest TSC scaling is available via invoking the tester
with -s/--scale-tsc.

Example of the test output below:
* selftests: kvm: pvclock_test
* scaling tsc from 2999999KHz to 1499999KHz
* before=5038374946 uncorrected=5038371437 corrected=5038374945
* delta_uncorrected=3509 delta_corrected=1

Signed-off-by: Jack Allister <[email protected]>
CC: David Woodhouse <[email protected]>
CC: Paul Durrant <[email protected]>
CC: Dongli Zhang <[email protected]>
---
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/x86_64/pvclock_test.c | 192 ++++++++++++++++++
2 files changed, 193 insertions(+)
create mode 100644 tools/testing/selftests/kvm/x86_64/pvclock_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 741c7dc16afc..02ee1205bbed 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -87,6 +87,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/pmu_counters_test
TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
TEST_GEN_PROGS_x86_64 += x86_64/private_mem_kvm_exits_test
+TEST_GEN_PROGS_x86_64 += x86_64/pvclock_test
TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
diff --git a/tools/testing/selftests/kvm/x86_64/pvclock_test.c b/tools/testing/selftests/kvm/x86_64/pvclock_test.c
new file mode 100644
index 000000000000..376ffb730a53
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/pvclock_test.c
@@ -0,0 +1,192 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright © 2024, Amazon.com, Inc. or its affiliates.
+ *
+ * Tests for pvclock API
+ * KVM_SET_CLOCK_GUEST/KVM_GET_CLOCK_GUEST
+ */
+#include <asm/pvclock.h>
+#include <asm/pvclock-abi.h>
+#include <sys/stat.h>
+#include <stdint.h>
+#include <stdio.h>
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+
+enum {
+ STAGE_FIRST_BOOT,
+ STAGE_UNCORRECTED,
+ STAGE_CORRECTED
+};
+
+#define KVMCLOCK_GPA 0xc0000000ull
+#define KVMCLOCK_SIZE sizeof(struct pvclock_vcpu_time_info)
+
+static void trigger_pvti_update(vm_paddr_t pvti_pa)
+{
+ /*
+ * We need a way to trigger KVM to update the fields
+ * in the PV time info. The easiest way to do this is
+ * to temporarily switch to the old KVM system time
+ * method and then switch back to the new one.
+ */
+ wrmsr(MSR_KVM_SYSTEM_TIME, pvti_pa | KVM_MSR_ENABLED);
+ wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
+}
+
+static void guest_code(vm_paddr_t pvti_pa)
+{
+ struct pvclock_vcpu_time_info *pvti_va =
+ (struct pvclock_vcpu_time_info *)pvti_pa;
+
+ struct pvclock_vcpu_time_info pvti_boot;
+ struct pvclock_vcpu_time_info pvti_uncorrected;
+ struct pvclock_vcpu_time_info pvti_corrected;
+ uint64_t cycles_boot;
+ uint64_t cycles_uncorrected;
+ uint64_t cycles_corrected;
+ uint64_t tsc_guest;
+
+ /*
+ * Setup the KVMCLOCK in the guest & store the original
+ * PV time structure that is used.
+ */
+ wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
+ pvti_boot = *pvti_va;
+ GUEST_SYNC(STAGE_FIRST_BOOT);
+
+ /*
+ * Trigger an update of the PVTI, if we calculate
+ * the KVM clock using this structure we'll see
+ * a delta from the TSC.
+ */
+ trigger_pvti_update(pvti_pa);
+ pvti_uncorrected = *pvti_va;
+ GUEST_SYNC(STAGE_UNCORRECTED);
+
+ /*
+ * The test should have triggered the correction by this
+ * point in time. We have a copy of each of the PVTI structs
+ * at each stage now.
+ *
+ * Let's sample the timestamp at a SINGLE point in time and
+ * then calculate what the KVM clock would be using the PVTI
+ * from each stage.
+ *
+ * Then return each of these values to the tester.
+ */
+ pvti_corrected = *pvti_va;
+ tsc_guest = rdtsc();
+
+ cycles_boot = __pvclock_read_cycles(&pvti_boot, tsc_guest);
+ cycles_uncorrected = __pvclock_read_cycles(&pvti_uncorrected, tsc_guest);
+ cycles_corrected = __pvclock_read_cycles(&pvti_corrected, tsc_guest);
+
+ GUEST_SYNC_ARGS(STAGE_CORRECTED, cycles_boot, cycles_uncorrected,
+ cycles_corrected, 0);
+}
+
+static void run_test(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
+{
+ struct pvclock_vcpu_time_info pvti_before;
+ uint64_t before, uncorrected, corrected;
+ int64_t delta_uncorrected, delta_corrected;
+ struct ucall uc;
+ uint64_t ucall_reason;
+
+ /* Loop through each stage of the test. */
+ while (true) {
+
+ /* Start/restart the running vCPU code. */
+ vcpu_run(vcpu);
+ TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
+
+ /* Retrieve and verify our stage. */
+ ucall_reason = get_ucall(vcpu, &uc);
+ TEST_ASSERT(ucall_reason == UCALL_SYNC,
+ "Unhandled ucall reason=%lu",
+ ucall_reason);
+
+ /* Run host specific code relating to stage. */
+ switch (uc.args[1]) {
+ case STAGE_FIRST_BOOT:
+ /* Store the KVM clock values before an update. */
+ vcpu_ioctl(vcpu, KVM_GET_CLOCK_GUEST, &pvti_before);
+
+ /* Sleep for a set amount of time to increase delta. */
+ sleep(5);
+ break;
+
+ case STAGE_UNCORRECTED:
+ /* Restore the KVM clock values. */
+ vcpu_ioctl(vcpu, KVM_SET_CLOCK_GUEST, &pvti_before);
+ break;
+
+ case STAGE_CORRECTED:
+ /* Query the clock information and verify delta. */
+ before = uc.args[2];
+ uncorrected = uc.args[3];
+ corrected = uc.args[4];
+
+ delta_uncorrected = before - uncorrected;
+ delta_corrected = before - corrected;
+
+ pr_info("before=%lu uncorrected=%lu corrected=%lu\n",
+ before, uncorrected, corrected);
+
+ pr_info("delta_uncorrected=%ld delta_corrected=%ld\n",
+ delta_uncorrected, delta_corrected);
+
+ TEST_ASSERT((delta_corrected <= 1) && (delta_corrected >= -1),
+ "larger than expected delta detected = %ld", delta_corrected);
+ return;
+ }
+ }
+}
+
+static void configure_pvclock(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
+{
+ unsigned int gpages;
+
+ gpages = vm_calc_num_guest_pages(VM_MODE_DEFAULT, KVMCLOCK_SIZE);
+ vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+ KVMCLOCK_GPA, 1, gpages, 0);
+ virt_map(vm, KVMCLOCK_GPA, KVMCLOCK_GPA, gpages);
+
+ vcpu_args_set(vcpu, 1, KVMCLOCK_GPA);
+}
+
+static void configure_scaled_tsc(struct kvm_vcpu *vcpu)
+{
+ uint64_t tsc_khz;
+
+ tsc_khz = __vcpu_ioctl(vcpu, KVM_GET_TSC_KHZ, NULL);
+ pr_info("scaling tsc from %ldKHz to %ldKHz\n", tsc_khz, tsc_khz / 2);
+ tsc_khz /= 2;
+ vcpu_ioctl(vcpu, KVM_SET_TSC_KHZ, (void *)tsc_khz);
+}
+
+int main(int argc, char *argv[])
+{
+ struct kvm_vcpu *vcpu;
+ struct kvm_vm *vm;
+ bool scale_tsc;
+
+ scale_tsc = argc > 1 && (!strncmp(argv[1], "-s", 3) ||
+ !strncmp(argv[1], "--scale-tsc", 10));
+
+ TEST_REQUIRE(sys_clocksource_is_based_on_tsc());
+
+ vm = vm_create_with_one_vcpu(&vcpu, guest_code);
+
+ configure_pvclock(vm, vcpu);
+
+ if (scale_tsc)
+ configure_scaled_tsc(vcpu);
+
+ run_test(vm, vcpu);
+
+ return 0;
+}
--
2.40.1


2024-04-10 10:08:47

by Allister, Jack

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for KVM clock drift fixup

> Would the above rely not only on TSC clocksource, but also
> ka->use_master_clock==true?

Yes this should only be in the case of master clock. Extra verification
to check this for both GET/SET should be present.

> What will happen if the vCPU=0 (e.g., BSP) thread is racing with here
> to update the vcpu->arch.hv_clock?

> In addition to live migration, can the user call this API any time
> during the VM is running (to fix the clock drift)? Therefore, any
> requirement to protect the update of kvmclock_offset from racing?

This should occur when none of the vCPUs are running, in the context of
a live-migration/update this would be after deserialization before the
resume. I may have been unintentionally misleading here describing some
of the problem space. There isn't a "drift" per say when a VM is
running and the PVTI stays constant. It's more the relationship between
the TSC & PV clock changes due to inaccuracy when KVM decides to
generate that information, e.g on a live-update/migration KVM will
perform the update. The KVM_REQ_MASTERCLOCK_UPDATE is just an example
as to how to simulate/trigger this.

I've posted a V2 which hopefully addresses some of above and makes it
clearer overall the aim behind the patches.

2024-04-10 10:15:47

by Allister, Jack

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer drift correction

> AFAIR, I copied check_clocksource() from existing code during that >
time.

> The commit e440c5f2e ("KVM: selftests: Generalize check_clocksource()
> from kvm_clock_test") has introduced sys_clocksource_is_tsc(). Later
> it is renamed to sys_clocksource_is_based_on_tsc().
> Any chance to re-use sys_clocksource_is_based_on_tsc()?

Yes I'm more than happy to change it to that. I was using your original
mail as a reference and did not realise there was a utility present for
this.

> Is configure_scaled_tsc() anecessary? Or how about to make it an >
option/arg?
> Then I will be able to test it on a VM/server without TSC scaling.

So if TSC scaling from 3GHz (host) -> 1.5GHz (guest) I do see a skew of
~3500ns after the update. Where as without scaling a delta can be seen
but is roughly ~180ns.

In V2 I've adjusted the test so that now by default scaling won't take
place, however if someone wants to test with it enabled they can pass
"-s/--scale-tsc" to induce the greater delta.


Thanks you for the feedback,
Jack Allister



2024-04-10 10:29:31

by Paul Durrant

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration

On 10/04/2024 10:52, Jack Allister wrote:
> In the common case (where kvm->arch.use_master_clock is true), the KVM
> clock is defined as a simple arithmetic function of the guest TSC, based on
> a reference point stored in kvm->arch.master_kernel_ns and
> kvm->arch.master_cycle_now.
>
> The existing KVM_[GS]ET_CLOCK functionality does not allow for this
> relationship to be precisely saved and restored by userspace. All it can
> currently do is set the KVM clock at a given UTC reference time, which is
> necessarily imprecise.
>
> So on live update, the guest TSC can remain cycle accurate at precisely the
> same offset from the host TSC, but there is no way for userspace to restore
> the KVM clock accurately.
>
> Even on live migration to a new host, where the accuracy of the guest time-
> keeping is fundamentally limited by the accuracy of wallclock
> synchronization between the source and destination hosts, the clock jump
> experienced by the guest's TSC and its KVM clock should at least be
> *consistent*. Even when the guest TSC suffers a discontinuity, its KVM
> clock should still remain the *same* arithmetic function of the guest TSC,
> and not suffer an *additional* discontinuity.
>
> To allow for accurate migration of the KVM clock, add per-vCPU ioctls which
> save and restore the actual PV clock info in pvclock_vcpu_time_info.
>
> The restoration in KVM_SET_CLOCK_GUEST works by creating a new reference
> point in time just as kvm_update_masterclock() does, and calculating the
> corresponding guest TSC value. This guest TSC value is then passed through
> the user-provided pvclock structure to generate the *intended* KVM clock
> value at that point in time, and through the *actual* KVM clock calculation.
> Then kvm->arch.kvmclock_offset is adjusted to eliminate for the difference.
>
> Where kvm->arch.use_master_clock is false (because the host TSC is
> unreliable, or the guest TSCs are configured strangely), the KVM clock
> is *not* defined as a function of the guest TSC so KVM_GET_CLOCK_GUEST
> returns an error. In this case, as documented, userspace shall use the
> legacy KVM_GET_CLOCK ioctl. The loss of precision is acceptable in this
> case since the clocks are imprecise in this mode anyway.
>
> On *restoration*, if kvm->arch.use_master_clock is false, an error is
> returned for similar reasons and userspace shall fall back to using
> KVM_SET_CLOCK. This does mean that, as documented, userspace needs to use
> *both* KVM_GET_CLOCK_GUEST and KVM_GET_CLOCK and send both results with the
> migration data (unless the intent is to refuse to resume on a host with bad
> TSC).
>
> (It may have been possible to make KVM_SET_CLOCK_GUEST "good enough" in the
> non-masterclock mode, as that mode is necessarily imprecise anyway. The
> explicit fallback allows userspace to deliberately fail migration to a host
> with misbehaving TSC where master clock mode wouldn't be active.)
>
> Suggested-by: David Woodhouse <[email protected]>
> Signed-off-by: Jack Allister <[email protected]>
> CC: Paul Durrant <[email protected]>
> CC: Dongli Zhang <[email protected]>
> ---
> Documentation/virt/kvm/api.rst | 37 ++++++++++
> arch/x86/kvm/x86.c | 124 +++++++++++++++++++++++++++++++++
> include/uapi/linux/kvm.h | 3 +
> 3 files changed, 164 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 0b5a33ee71ee..80fcd93bba1b 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6352,6 +6352,43 @@ a single guest_memfd file, but the bound ranges must not overlap).
>
> See KVM_SET_USER_MEMORY_REGION2 for additional details.
>
> +4.143 KVM_GET_CLOCK_GUEST
> +----------------------------
> +
> +:Capability: none
> +:Architectures: x86
> +:Type: vcpu ioctl
> +:Parameters: struct pvclock_vcpu_time_info (out)
> +:Returns: 0 on success, <0 on error
> +
> +Retrieves the current time information structure used for KVM/PV clocks,
> +in precisely the form advertised to the guest vCPU, which gives parameters
> +for a direct conversion from a guest TSC value to nanoseconds.
> +
> +When the KVM clock not is in "master clock" mode, for example because the
> +host TSC is unreliable or the guest TSCs are oddly configured, the KVM clock
> +is actually defined by the host CLOCK_MONOTONIC_RAW instead of the guest TSC.
> +In this case, the KVM_GET_CLOCK_GUEST ioctl returns -EINVAL.
> +
> +4.144 KVM_SET_CLOCK_GUEST
> +----------------------------
> +
> +:Capability: none
> +:Architectures: x86
> +:Type: vcpu ioctl
> +:Parameters: struct pvclock_vcpu_time_info (in)
> +:Returns: 0 on success, <0 on error
> +
> +Sets the KVM clock (for the whole VM) in terms of the vCPU TSC, using the
> +pvclock structure as returned by KVM_GET_CLOCK_GUEST. This allows the precise
> +arithmetic relationship between guest TSC and KVM clock to be preserved by
> +userspace across migration.
> +
> +When the KVM clock is not in "master clock" mode, and the KVM clock is actually
> +defined by the host CLOCK_MONOTONIC_RAW, this ioctl returns -EINVAL.

EINVAL doesn't seem appropriate. ENOTSUP perhaps? Same for getting the
clock info I suppose.

> Userspace
> +may choose to set the clock using the less precise KVM_SET_CLOCK ioctl, or may
> +choose to fail, denying migration to a host whose TSC is misbehaving.
> +
> 5. The kvm_run structure
> ========================
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 47d9f03b7778..d5cae3ead04d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5859,6 +5859,124 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
> }
> }
>
> +static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user *argp)
> +{
> + struct pvclock_vcpu_time_info *vcpu_pvti = &v->arch.hv_clock;
> + struct pvclock_vcpu_time_info local_pvti = { 0 };
> + struct kvm_arch *ka = &v->kvm->arch;
> + uint64_t host_tsc, guest_tsc;
> + bool use_master_clock;
> + uint64_t kernel_ns;
> + unsigned int seq;
> +
> + /*
> + * CLOCK_MONOTONIC_RAW is not suitable for GET/SET API,
> + * see kvm_vcpu_ioctl_set_clock_guest equivalent comment.
> + */
> + if (!static_cpu_has(X86_FEATURE_CONSTANT_TSC))
> + return -EINVAL;
> +
> + /*
> + * Querying needs to be performed in a seqcount loop as it's possible
> + * another vCPU has triggered an update of the master clock. If so we
> + * should store the host TSC & time at this point.
> + */
> + do {
> + seq = read_seqcount_begin(&ka->pvclock_sc);
> + use_master_clock = ka->use_master_clock;
> + if (use_master_clock) {
> + host_tsc = ka->master_cycle_now;
> + kernel_ns = ka->master_kernel_ns;
> + }
> + } while (read_seqcount_retry(&ka->pvclock_sc, seq));

You could bail from the loop if `use_master_clock` is false, couldn't you?

> +
> + if (!use_master_clock)
> + return -EINVAL;
> +
> + /*
> + * It's possible that this vCPU doesn't have a HVCLOCK configured
> + * but the other vCPUs may. If this is the case calculate based
> + * upon the time gathered in the seqcount but do not update the
> + * vCPU specific PVTI. If we have one, then use that.

Given this is a per-vCPU ioctl, why not fail in the case the vCPU
doesn't have HVCLOCK configured? Or is your intention that a GET/SET
should always work if TSC is stable?

> + */
> + if (!vcpu_pvti->tsc_timestamp && !vcpu_pvti->system_time) {
> + guest_tsc = kvm_read_l1_tsc(v, host_tsc);
> +
> + local_pvti.tsc_timestamp = guest_tsc;
> + local_pvti.system_time = kernel_ns + ka->kvmclock_offset;
> + } else {
> + local_pvti = *vcpu_pvti;
> + }
> +
> + if (copy_to_user(argp, &local_pvti, sizeof(local_pvti)))
> + return -EFAULT;
> +
> + return 0;
> +}
> +


2024-04-10 10:44:27

by Paul Durrant

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer correction

On 10/04/2024 10:52, Jack Allister wrote:
> A VM's KVM/PV clock has an inherent relationship to it's TSC (guest). When
> either the host system live-updates or the VM is live-migrated this pairing
> of the two clock sources should stay the same.
>
> In reality this is not the case without some correction taking place. Two
> new IOCTLs (KVM_GET_CLOCK_GUEST/KVM_SET_CLOCK_GUEST) can be utilized to
> perform a correction on the PVTI (PV time information) structure held by
> KVM to effectively fixup the kvmclock_offset prior to the guest VM resuming
> in either a live-update/migration scenario.
>
> This test proves that without the necessary fixup there is a perceived
> change in the guest TSC & KVM/PV clock relationship before and after a LU/
> LM takes place.
>
> The following steps are made to verify there is a delta in the relationship
> and that it can be corrected:
>
> 1. PVTI is sampled by guest at boot (let's call this PVTI0).
> 2. Induce a change in PVTI data (KVM_REQ_MASTERCLOCK_UPDATE).
> 3. PVTI is sampled by guest after change (PVTI1).
> 4. Correction is requested by usermode to KVM using PVTI0.
> 5. PVTI is sampled by guest after correction (PVTI2).
>
> The guest the records a singular TSC reference point in time and uses it to
> calculate 3 KVM clock values utilizing the 3 recorded PVTI prior. Let's
> call each clock value CLK[0-2].
>
> In a perfect world CLK[0-2] should all be the same value if the KVM clock
> & TSC relationship is preserved across the LU/LM (or faked in this test),
> however it is not.
>
> A delta can be observed between CLK0-CLK1 due to KVM recalculating the PVTI
> (and the inaccuracies associated with that). A delta of ~3500ns can be
> observed if guest TSC scaling to half host TSC frequency is also enabled,
> where as without scaling this is observed at ~180ns.
>
> With the correction it should be possible to achieve a delta of ±1ns.
>
> An option to enable guest TSC scaling is available via invoking the tester
> with -s/--scale-tsc.
>
> Example of the test output below:
> * selftests: kvm: pvclock_test
> * scaling tsc from 2999999KHz to 1499999KHz
> * before=5038374946 uncorrected=5038371437 corrected=5038374945
> * delta_uncorrected=3509 delta_corrected=1
>
> Signed-off-by: Jack Allister <[email protected]>
> CC: David Woodhouse <[email protected]>
> CC: Paul Durrant <[email protected]>
> CC: Dongli Zhang <[email protected]>
> ---
> tools/testing/selftests/kvm/Makefile | 1 +
> .../selftests/kvm/x86_64/pvclock_test.c | 192 ++++++++++++++++++
> 2 files changed, 193 insertions(+)
> create mode 100644 tools/testing/selftests/kvm/x86_64/pvclock_test.c
>
> diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> index 741c7dc16afc..02ee1205bbed 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -87,6 +87,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/pmu_counters_test
> TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
> TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
> TEST_GEN_PROGS_x86_64 += x86_64/private_mem_kvm_exits_test
> +TEST_GEN_PROGS_x86_64 += x86_64/pvclock_test
> TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
> TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
> TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
> diff --git a/tools/testing/selftests/kvm/x86_64/pvclock_test.c b/tools/testing/selftests/kvm/x86_64/pvclock_test.c
> new file mode 100644
> index 000000000000..376ffb730a53
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/x86_64/pvclock_test.c
> @@ -0,0 +1,192 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright © 2024, Amazon.com, Inc. or its affiliates.
> + *
> + * Tests for pvclock API
> + * KVM_SET_CLOCK_GUEST/KVM_GET_CLOCK_GUEST
> + */
> +#include <asm/pvclock.h>
> +#include <asm/pvclock-abi.h>
> +#include <sys/stat.h>
> +#include <stdint.h>
> +#include <stdio.h>
> +
> +#include "test_util.h"
> +#include "kvm_util.h"
> +#include "processor.h"
> +
> +enum {
> + STAGE_FIRST_BOOT,
> + STAGE_UNCORRECTED,
> + STAGE_CORRECTED
> +};
> +
> +#define KVMCLOCK_GPA 0xc0000000ull
> +#define KVMCLOCK_SIZE sizeof(struct pvclock_vcpu_time_info)
> +
> +static void trigger_pvti_update(vm_paddr_t pvti_pa)
> +{
> + /*
> + * We need a way to trigger KVM to update the fields
> + * in the PV time info. The easiest way to do this is
> + * to temporarily switch to the old KVM system time
> + * method and then switch back to the new one.
> + */
> + wrmsr(MSR_KVM_SYSTEM_TIME, pvti_pa | KVM_MSR_ENABLED);
> + wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
> +}
> +
> +static void guest_code(vm_paddr_t pvti_pa)
> +{
> + struct pvclock_vcpu_time_info *pvti_va =
> + (struct pvclock_vcpu_time_info *)pvti_pa;
> +
> + struct pvclock_vcpu_time_info pvti_boot;
> + struct pvclock_vcpu_time_info pvti_uncorrected;
> + struct pvclock_vcpu_time_info pvti_corrected;
> + uint64_t cycles_boot;
> + uint64_t cycles_uncorrected;
> + uint64_t cycles_corrected;
> + uint64_t tsc_guest;
> +
> + /*
> + * Setup the KVMCLOCK in the guest & store the original
> + * PV time structure that is used.
> + */
> + wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
> + pvti_boot = *pvti_va;
> + GUEST_SYNC(STAGE_FIRST_BOOT);
> +
> + /*
> + * Trigger an update of the PVTI, if we calculate
> + * the KVM clock using this structure we'll see
> + * a delta from the TSC.
> + */
> + trigger_pvti_update(pvti_pa);
> + pvti_uncorrected = *pvti_va;
> + GUEST_SYNC(STAGE_UNCORRECTED);
> +
> + /*
> + * The test should have triggered the correction by this
> + * point in time. We have a copy of each of the PVTI structs
> + * at each stage now.
> + *
> + * Let's sample the timestamp at a SINGLE point in time and
> + * then calculate what the KVM clock would be using the PVTI
> + * from each stage.
> + *
> + * Then return each of these values to the tester.
> + */
> + pvti_corrected = *pvti_va;
> + tsc_guest = rdtsc();
> +
> + cycles_boot = __pvclock_read_cycles(&pvti_boot, tsc_guest);
> + cycles_uncorrected = __pvclock_read_cycles(&pvti_uncorrected, tsc_guest);
> + cycles_corrected = __pvclock_read_cycles(&pvti_corrected, tsc_guest);
> +
> + GUEST_SYNC_ARGS(STAGE_CORRECTED, cycles_boot, cycles_uncorrected,
> + cycles_corrected, 0);
> +}
> +
> +static void run_test(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
> +{
> + struct pvclock_vcpu_time_info pvti_before;
> + uint64_t before, uncorrected, corrected;
> + int64_t delta_uncorrected, delta_corrected;
> + struct ucall uc;
> + uint64_t ucall_reason;
> +
> + /* Loop through each stage of the test. */
> + while (true) {
> +

Unnecessary blank line. Otherwise LGTM so with that fixed...

Reviewed-by: Paul Durrant <[email protected]>



2024-04-10 12:43:59

by Paul Durrant

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration

On 10/04/2024 13:09, David Woodhouse wrote:
> On 10 April 2024 11:29:13 BST, Paul Durrant <[email protected]> wrote:
>> On 10/04/2024 10:52, Jack Allister wrote:
>>> + * It's possible that this vCPU doesn't have a HVCLOCK configured
>>> + * but the other vCPUs may. If this is the case calculate based
>>> + * upon the time gathered in the seqcount but do not update the
>>> + * vCPU specific PVTI. If we have one, then use that.
>>
>> Given this is a per-vCPU ioctl, why not fail in the case the vCPU doesn't have HVCLOCK configured? Or is your intention that a GET/SET should always work if TSC is stable?
>
> It definitely needs to work for SET even when the vCPU hasn't been run yet (and doesn't have a hvclock in vcpu->arch.hv_clock).

So would it make sense to set up hvclock earlier?


2024-04-10 13:27:16

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration

On 10 April 2024 11:29:13 BST, Paul Durrant <[email protected]> wrote:
>On 10/04/2024 10:52, Jack Allister wrote:
>> + * It's possible that this vCPU doesn't have a HVCLOCK configured
>> + * but the other vCPUs may. If this is the case calculate based
>> + * upon the time gathered in the seqcount but do not update the
>> + * vCPU specific PVTI. If we have one, then use that.
>
>Given this is a per-vCPU ioctl, why not fail in the case the vCPU doesn't have HVCLOCK configured? Or is your intention that a GET/SET should always work if TSC is stable?

It definitely needs to work for SET even when the vCPU hasn't been run yet (and doesn't have a hvclock in vcpu->arch.hv_clock).

I think it should ideally work for GET too. I did try arguing that if the vCPU hasn't set up its pvclock then why would it care if it's inaccurate? But there's a pathological case of AMP where one vCPU is dedicated to an RTOS or something, and only the *other* vCPUs bring up their pvclock.

This of course brings you to the question of why we have it as a per-vCPU ioctl at all? It only needs to be done *once* to get/set the KVM-wide clock
And a function of *this* vCPU's TSC. And the point is that if we're in use_master_clock mode, that's consistent across *all* vCPUs. There would be a bunch of additional complexity in making it a VM ioctl though, especially around the question of what to do if userspace tries to restore it when there *aren't* any vCPUs yet. So we didn't do that.



2024-04-11 19:47:21

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer drift correction

On Wed, 2024-04-10 at 10:15 +0000, Allister, Jack wrote:
> > AFAIR, I copied check_clocksource() from existing code during that >
> time.
>
> > The commit e440c5f2e ("KVM: selftests: Generalize check_clocksource()
> > from kvm_clock_test") has introduced sys_clocksource_is_tsc(). Later
> > it is renamed to sys_clocksource_is_based_on_tsc().
> > Any chance to re-use sys_clocksource_is_based_on_tsc()?
>
> Yes I'm more than happy to change it to that. I was using your original
> mail as a reference and did not realise there was a utility present for
> this.
>
> > Is configure_scaled_tsc() anecessary? Or how about to make it an  >
> option/arg?
> > Then I will be able to test it on a VM/server without TSC scaling.
>
> So if TSC scaling from 3GHz (host) -> 1.5GHz (guest) I do see a skew of
> ~3500ns after the update. Where as without scaling a delta can be seen
> but is roughly ~180ns.

I don't think it's as simple as "TSC scaling makes the drift larger".
I suspect that's just the way the arithmetic precision works out for
those frequencies. With other frequencies of host and guest you might
find that it works out closer *with* the scaling.

Consider a graph of "time" in the Y axis, against the host TSC as the X
axis. As an example, let's assume the host has a TSC frequency of 3GHz.

Each of the three definitions of the KVM clock (A based on
CLOCK_MONOTONIC_RAW, B based on the guest TSC, C based directly on the
host TSC) will have a gradient of *roughly* 1 ns per three ticks.

Due to arithmetic precision, the gradient of each is going to vary
slightly. We hope that CLOCK_MONOTONIC_RAW is going to do the best, as
the other two are limited by the precision of the pvclock ABI that's
exposed to the guest. You can use http://david.woodhou.se/tsdrift.c to
see where the latter two land, for different TSC frequencies.

$ ./tsdrift 2500000000 3000000000 | tail -1
TSC 259200000000000, guest TSC 215999999979883, guest ns 86399999971836 host ns 86399999979883 (delta -8047)
$ ./tsdrift 2700000000 3000000000 | tail -1
TSC 259200000000000, guest TSC 233279999975860, guest ns 86399999983012 host ns 86399999979883 (delta 3129)

So after a day, let's assume CLOCK_MONOTONIC_RAW will have advanced by
86400 seconds. The KVM clock based on the host TSC will be 20µs slow,
while a KVM clock based on a guest TSC frequency of 2.5GHz would be an
*additional* 8µs slower. But a guest TSC frequency of 2.7GHz would
actually run *faster* than the host-based one, and would only be 17µs
behind reality.

Your test is measuring how *much* the host CLOCK_MONOTONIC_RAW (my
definition A) drifts from definition B which is derived from the guest
TSC.

It demonstrates the discontinuity that KVM_REQ_MASTERCLOCK_UPDATE
introduces, by clamping the KVM clock back to the 'definition A' line.

Fixing that is in the TODO list I shared. Basically it involves
realising that in use_master_clock mode, the delta between the KVM
clock and CLOCK_MONOTONIC_RAW (ka->kvmclock_offset) is *varying* over
time. So instead of just blindly using kvmclock_offset, we should
*recalculate* it in precisely the way that your KVM_SET_CLOCK_GUEST
does.

Having said all that... scaling from 3GHz to 1.5GHz *doesn't* lose any
precision; it shouldn't make any difference. But I guess your host TSC
isn't *really* 3GHz, it's measured against the PIT or something awful,
and comes out at a shade above or below 3GHz, leading to a more
interesting scaling factor?

> In V2 I've adjusted the test so that now by default scaling won't take
> place, however if someone wants to test with it enabled they can pass
> "-s/--scale-tsc" to induce the greater delta.

Please do it automatically based on the availability of the feature.


Attachments:
smime.p7s (5.83 kB)

2024-04-12 08:20:22

by Dongli Zhang

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer correction



On 4/10/24 02:52, Jack Allister wrote:
> A VM's KVM/PV clock has an inherent relationship to it's TSC (guest). When
> either the host system live-updates or the VM is live-migrated this pairing
> of the two clock sources should stay the same.
>

[snip]

> +
> +int main(int argc, char *argv[])
> +{
> + struct kvm_vcpu *vcpu;
> + struct kvm_vm *vm;
> + bool scale_tsc;
> +
> + scale_tsc = argc > 1 && (!strncmp(argv[1], "-s", 3) ||
> + !strncmp(argv[1], "--scale-tsc", 10));
> +
> + TEST_REQUIRE(sys_clocksource_is_based_on_tsc());
> +
> + vm = vm_create_with_one_vcpu(&vcpu, guest_code);
> +
> + configure_pvclock(vm, vcpu);
> +
> + if (scale_tsc)
> + configure_scaled_tsc(vcpu);
> +
> + run_test(vm, vcpu);
> +
> + return 0;
> +}

David suggested make the scale_tsc automatically based on the availability of
the feature.

Therefore, if there is going to be v3, I suggest add kvm_vm_free() explicitly
(similar to close(fd)).

Thank you very much!

Dongli Zhang

2024-04-15 12:26:41

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration

On Wed, 2024-04-10 at 09:52 +0000, Jack Allister wrote:
>
> +static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user *argp)
> +{
> +       struct pvclock_vcpu_time_info *vcpu_pvti = &v->arch.hv_clock;
> +       struct pvclock_vcpu_time_info local_pvti = { 0 };
> +       struct kvm_arch *ka = &v->kvm->arch;
> +       uint64_t host_tsc, guest_tsc;
> +       bool use_master_clock;
> +       uint64_t kernel_ns;
> +       unsigned int seq;
> +
> +       /*
> +        * CLOCK_MONOTONIC_RAW is not suitable for GET/SET API,
> +        * see kvm_vcpu_ioctl_set_clock_guest equivalent comment.
> +        */
> +       if (!static_cpu_has(X86_FEATURE_CONSTANT_TSC))
> +               return -EINVAL;
> +
> +       /*
> +        * Querying needs to be performed in a seqcount loop as it's possible
> +        * another vCPU has triggered an update of the master clock. If so we
> +        * should store the host TSC & time at this point.
> +        */
> +       do {
> +               seq = read_seqcount_begin(&ka->pvclock_sc);
> +               use_master_clock = ka->use_master_clock;
> +               if (use_master_clock) {
> +                       host_tsc = ka->master_cycle_now;
> +                       kernel_ns = ka->master_kernel_ns;
> +               }
> +       } while (read_seqcount_retry(&ka->pvclock_sc, seq));
> +
> +       if (!use_master_clock)
> +               return -EINVAL;
> +
> +       /*
> +        * It's possible that this vCPU doesn't have a HVCLOCK configured
> +        * but the other vCPUs may. If this is the case calculate based
> +        * upon the time gathered in the seqcount but do not update the
> +        * vCPU specific PVTI. If we have one, then use that.
> +        */
> +       if (!vcpu_pvti->tsc_timestamp && !vcpu_pvti->system_time) {

|| !kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)

Otherwise you may be using out of date information.

> +               guest_tsc = kvm_read_l1_tsc(v, host_tsc);
> +
> +               local_pvti.tsc_timestamp = guest_tsc;
> +               local_pvti.system_time = kernel_ns + ka->kvmclock_offset;

This is missing the scale information in tsc_to_system_mul and
tsc_shift. Is there a reason we can't just call kvm_guest_time_update()
from here? (I think we looked at using it for the *SET* function, but
did we look at doing so for GET?)


> +       } else {
> +               local_pvti = *vcpu_pvti;
> +       }
> +
> +       if (copy_to_user(argp, &local_pvti, sizeof(local_pvti)))
> +               return -EFAULT;
> +
> +       return 0;
> +}
> +
> +static int kvm_vcpu_ioctl_set_clock_guest(struct kvm_vcpu *v, void __user *argp)
> +{
> +       struct pvclock_vcpu_time_info dummy_pvti;
> +       struct pvclock_vcpu_time_info orig_pvti;
> +       struct kvm *kvm = v->kvm;
> +       struct kvm_arch *ka = &kvm->arch;
> +       uint64_t clock_orig, clock_dummy;
> +       uint64_t host_tsc, guest_tsc;
> +       int64_t kernel_ns;
> +       int64_t correction;
> +       int rc = 0;
> +
> +       /*
> +        * If a constant TSC is not provided by the host then KVM will
> +        * be using CLOCK_MONOTONIC_RAW, correction using this is not
> +        * precise and as such we can never sync to the precision we
> +        * are requiring.
> +        */
> +       if (!static_cpu_has(X86_FEATURE_CONSTANT_TSC))
> +               return -EINVAL;
> +
> +       if (copy_from_user(&orig_pvti, argp, sizeof(orig_pvti)))
> +               return -EFAULT;
> +
> +       kvm_hv_request_tsc_page_update(kvm);
> +       kvm_start_pvclock_update(kvm);
> +       pvclock_update_vm_gtod_copy(kvm);
> +
> +       if (!ka->use_master_clock) {
> +               rc = -EINVAL;
> +               goto out;
> +       }
> +
> +       /*
> +        * Sample the kernel time and host TSC at a singular point.
> +        * We then calculate the guest TSC using this exact point in time.
> +        * From here we can then determine the delta using the
> +        * PV time info requested from the user and what we currently have
> +        * using the fixed point in time. This delta is then used as a
> +        * correction factor to subtract from the clock offset.
> +        */
> +       if (!kvm_get_time_and_clockread(&kernel_ns, &host_tsc)) {
> +               rc = -EFAULT;
> +               goto out;
> +       }
> +
> +       guest_tsc = kvm_read_l1_tsc(v, host_tsc);
> +
> +       dummy_pvti = orig_pvti;
> +       dummy_pvti.tsc_timestamp = guest_tsc;
> +       dummy_pvti.system_time = kernel_ns + ka->kvmclock_offset;
> +
> +       clock_orig = __pvclock_read_cycles(&orig_pvti, guest_tsc);
> +       clock_dummy = __pvclock_read_cycles(&dummy_pvti, guest_tsc);
>

In both cases here you're using the scale information directly from
userspace... that you forgot to fill in for them earlier. I think we
should we have a sanity check on it, to ensure that it matches the TSC
frequency of the vCPU?

> +       correction = clock_orig - clock_dummy;
> +       ka->kvmclock_offset += correction;
> +
> +out:
> +       kvm_end_pvclock_update(kvm);
> +       return rc;
> +}
> +
>  long kvm_arch_vcpu_ioctl(struct file *filp,
>                          unsigned int ioctl, unsigned long arg)
>  {
> @@ -6239,6 +6357,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>                 srcu_read_unlock(&vcpu->kvm->srcu, idx);
>                 break;
>         }
> +       case KVM_SET_CLOCK_GUEST:
> +               r = kvm_vcpu_ioctl_set_clock_guest(vcpu, argp);
> +               break;
> +       case KVM_GET_CLOCK_GUEST:
> +               r = kvm_vcpu_ioctl_get_clock_guest(vcpu, argp);
> +               break;
>  #ifdef CONFIG_KVM_HYPERV
>         case KVM_GET_SUPPORTED_HV_CPUID:
>                 r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 2190adbe3002..0d306311e4d6 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1548,4 +1548,7 @@ struct kvm_create_guest_memfd {
>         __u64 reserved[6];
>  };
>  
> +#define KVM_SET_CLOCK_GUEST       _IOW(KVMIO,  0xd5, struct pvclock_vcpu_time_info)
> +#define KVM_GET_CLOCK_GUEST       _IOR(KVMIO,  0xd6, struct pvclock_vcpu_time_info)
> +
>  #endif /* __LINUX_KVM_H */


Attachments:
smime.p7s (5.83 kB)

2024-04-18 02:57:07

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration

On Wed, 2024-04-10 at 13:43 +0100, Paul Durrant wrote:
> On 10/04/2024 13:09, David Woodhouse wrote:
> > On 10 April 2024 11:29:13 BST, Paul Durrant <[email protected]>
> > wrote:
> > > On 10/04/2024 10:52, Jack Allister wrote:
> > > > +        * It's possible that this vCPU doesn't have a HVCLOCK
> > > > configured
> > > > +        * but the other vCPUs may. If this is the case
> > > > calculate based
> > > > +        * upon the time gathered in the seqcount but do not
> > > > update the
> > > > +        * vCPU specific PVTI. If we have one, then use that.
> > >
> > > Given this is a per-vCPU ioctl, why not fail in the case the vCPU
> > > doesn't have HVCLOCK configured? Or is your intention that a
> > > GET/SET should always work if TSC is stable?
> >
> > It definitely needs to work for SET even when the vCPU hasn't been
> > run yet (and doesn't have a hvclock in vcpu->arch.hv_clock).
>
> So would it make sense to set up hvclock earlier?

Yeah, and I think we can do so just by calling kvm_guest_time_update().

The GET function can look like this:

static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user *argp)
{
struct pvclock_vcpu_time_info *hv_clock = &v->arch.hv_clock;

/*
* If KVM_REQ_CLOCK_UPDATE is already pending, or if the hv_clock has
* never been generated at all, call kvm_guest_time_update() to do so.
* Might as well use the PVCLOCK_TSC_STABLE_BIT as the check for ever
* having been written.
*/
if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, v) ||
!(hv_clock->flags & PVCLOCK_TSC_STABLE_BIT)) {
if (kvm_guest_time_update(v))
return -EINVAL;
}

/*
* PVCLOCK_TSC_STABLE_BIT is set in use_master_clock mode where the
* KVM clock is defined in terms of the guest TSC. Otherwise, it is
* is defined by the host CLOCK_MONOTONIC_RAW, and userspace should
* use the legacy KVM_[GS]ET_CLOCK to migrate it.
*/
if (!(hv_clock->flags & PVCLOCK_TSC_STABLE_BIT))
return -EINVAL;

if (copy_to_user(argp, hv_clock, sizeof(*hv_clock)))
return -EFAULT;

return 0;
}

And the SET function doesn't even *need* the existing vCPU's hv_clock,
because we know damn well that the number of TSC cycles elapsed between
the reference time point and... erm... the reference time point... is
zero.

And everything *else* the hv_clock was being used for, either in Jack's
version or my own (where I used it for checking PVCLOCK_TSC_STABLE_BIT
and even used my new hvclock_to_hz() on it), can be done differently
too.

https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/clocks
(Try not to look at the 'Improve accuracy of KVM clock' one. It'll just
make you sad. Let Jack and me get to the end of the TODO list and you
can have all the sadness in one go like pulling a band-aid off.)

static int kvm_vcpu_ioctl_set_clock_guest(struct kvm_vcpu *v, void __user *argp)
{
struct pvclock_vcpu_time_info user_hv_clock;
struct kvm *kvm = v->kvm;
struct kvm_arch *ka = &kvm->arch;
uint64_t curr_tsc_hz, user_tsc_hz;
uint64_t user_clk_ns;
uint64_t guest_tsc;
int rc = 0;

if (copy_from_user(&user_hv_clock, argp, sizeof(user_hv_clock)))
return -EFAULT;

if (!user_hv_clock.tsc_to_system_mul)
return -EINVAL;

user_tsc_hz = hvclock_to_hz(user_hv_clock.tsc_to_system_mul,
user_hv_clock.tsc_shift);


kvm_hv_request_tsc_page_update(kvm);
kvm_start_pvclock_update(kvm);
pvclock_update_vm_gtod_copy(kvm);

/*
* If not in use_master_clock mode, do not allow userspace to set
* the clock in terms of the guest TSC. Userspace should either
* fail the migration (to a host with suboptimal TSCs), or should
* knowingly restore the KVM clock using KVM_SET_CLOCK instead.
*/
if (!ka->use_master_clock) {
rc = -EINVAL;
goto out;
}

curr_tsc_hz = get_cpu_tsc_khz() * 1000LL;
if (unlikely(curr_tsc_hz == 0)) {
rc = -EINVAL;
goto out;
}

if (kvm_caps.has_tsc_control)
curr_tsc_hz = kvm_scale_tsc(curr_tsc_hz,
v->arch.l1_tsc_scaling_ratio);

/*
* The scaling factors in the hv_clock do not depend solely on the
* TSC frequency *requested* by userspace. They actually use the
* host TSC frequency that was measured/detected by the host kernel,
* scaled by kvm_scale_tsc() with the vCPU's l1_tsc_scaling_ratio.
*
* So a sanity check that they *precisely* match would have false
* negatives. Allow for a discrepancy of 1 kHz either way.
*/
if (user_tsc_hz < curr_tsc_hz - 1000 ||
user_tsc_hz > curr_tsc_hz + 1000) {
rc = -ERANGE;
goto out;
}

/*
* The call to pvclock_update_vm_gtod_copy() has created a new time
* reference point in ka->master_cycle_now and ka->master_kernel_ns.
*
* Calculate the guest TSC at that moment, and the corresponding KVM
* clock value according to user_hv_clock. The value according to the
* current hv_clock will of course be ka->master_kernel_ns since no
* TSC cycles have elapsed.
*
* Adjust ka->kvmclock_offset to the delta, so that both definitions
* of the clock give precisely the same reading at the reference time.
*/
guest_tsc = kvm_read_l1_tsc(v, ka->master_cycle_now);
user_clk_ns = __pvclock_read_cycles(&user_hv_clock, guest_tsc);
ka->kvmclock_offset = user_clk_ns - ka->master_kernel_ns;

out:
kvm_end_pvclock_update(kvm);
return rc;
}


Attachments:
smime.p7s (5.83 kB)

2024-04-19 17:13:45

by Chen, Zide

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer drift correction



On 4/8/2024 3:07 PM, Jack Allister wrote:
> This test proves that there is an inherent KVM/PV clock drift away from the
> guest TSC when KVM decides to update the PV time information structure due
> to a KVM_REQ_MASTERCLOCK_UPDATE. This drift is exascerbated when a guest is
> using TSC scaling and running at a different frequency to the host TSC [1].
> It also proves that KVM_[GS]ET_CLOCK_GUEST API is working to mitigate the
> drift from TSC to within ±1ns.
>
> The test simply records the PVTI (PV time information) at time of guest
> creation, after KVM has updated it's mapped PVTI structure and once the
> correction has taken place.
>
> A singular point in time is then recorded via the guest TSC and is used to
> calculate the a PV clock value using each of the 3 PVTI structures.
>
> As seen below a drift of ~3500ns is observed if no correction has taken
> place after KVM has updated the PVTI via master clock update. However,
> after the correction a delta of at most 1ns can be seen.
>
> * selftests: kvm: pvclock_test
> * scaling tsc from 2999999KHz to 1499999KHz
> * before=5038374946 uncorrected=5038371437 corrected=5038374945
> * delta_uncorrected=3509 delta_corrected=1
>
> Clocksource check code has been borrowed from [2].
>
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=451a707813ae
> [2]: https://lore.kernel.org/kvm/[email protected]/
>
> Signed-off-by: Jack Allister <[email protected]>
> CC: David Woodhouse <[email protected]>
> CC: Paul Durrant <[email protected]>
> ---
> tools/testing/selftests/kvm/Makefile | 1 +
> .../selftests/kvm/x86_64/pvclock_test.c | 223 ++++++++++++++++++
> 2 files changed, 224 insertions(+)
> create mode 100644 tools/testing/selftests/kvm/x86_64/pvclock_test.c
>
> diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> index 741c7dc16afc..02ee1205bbed 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -87,6 +87,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/pmu_counters_test
> TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
> TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
> TEST_GEN_PROGS_x86_64 += x86_64/private_mem_kvm_exits_test
> +TEST_GEN_PROGS_x86_64 += x86_64/pvclock_test
> TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
> TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
> TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
> diff --git a/tools/testing/selftests/kvm/x86_64/pvclock_test.c b/tools/testing/selftests/kvm/x86_64/pvclock_test.c
> new file mode 100644
> index 000000000000..172ef4d19c60
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/x86_64/pvclock_test.c
> @@ -0,0 +1,223 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright © 2024, Amazon.com, Inc. or its affiliates.
> + *
> + * Tests for pvclock API
> + * KVM_SET_CLOCK_GUEST/KVM_GET_CLOCK_GUEST
> + */
> +#include <asm/pvclock.h>
> +#include <asm/pvclock-abi.h>
> +#include <sys/stat.h>
> +#include <stdint.h>
> +#include <stdio.h>
> +
> +#include "test_util.h"
> +#include "kvm_util.h"
> +#include "processor.h"
> +
> +enum {
> + STAGE_FIRST_BOOT,
> + STAGE_UNCORRECTED,
> + STAGE_CORRECTED,
> + NUM_STAGES
> +};
> +
> +#define KVMCLOCK_GPA 0xc0000000ull
> +#define KVMCLOCK_SIZE sizeof(struct pvclock_vcpu_time_info)
> +
> +static void trigger_pvti_update(vm_paddr_t pvti_pa)
> +{
> + /*
> + * We need a way to trigger KVM to update the fields
> + * in the PV time info. The easiest way to do this is
> + * to temporarily switch to the old KVM system time
> + * method and then switch back to the new one.
> + */
> + wrmsr(MSR_KVM_SYSTEM_TIME, pvti_pa | KVM_MSR_ENABLED);
> + wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
> +}
> +
> +static void guest_code(vm_paddr_t pvti_pa)
> +{
> + struct pvclock_vcpu_time_info *pvti_va =
> + (struct pvclock_vcpu_time_info *)pvti_pa;
> +
> + struct pvclock_vcpu_time_info pvti_boot;
> + struct pvclock_vcpu_time_info pvti_uncorrected;
> + struct pvclock_vcpu_time_info pvti_corrected;
> + uint64_t cycles_boot;
> + uint64_t cycles_uncorrected;
> + uint64_t cycles_corrected;
> + uint64_t tsc_guest;
> +
> + /*
> + * Setup the KVMCLOCK in the guest & store the original
> + * PV time structure that is used.
> + */
> + wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
> + pvti_boot = *pvti_va;
> + GUEST_SYNC(STAGE_FIRST_BOOT);
> +
> + /*
> + * Trigger an update of the PVTI, if we calculate
> + * the KVM clock using this structure we'll see
> + * a drift from the TSC.
> + */
> + trigger_pvti_update(pvti_pa);
> + pvti_uncorrected = *pvti_va;
> + GUEST_SYNC(STAGE_UNCORRECTED);
> +
> + /*
> + * The test should have triggered the correction by this
> + * point in time. We have a copy of each of the PVTI structs
> + * at each stage now.
> + *
> + * Let's sample the timestamp at a SINGLE point in time and
> + * then calculate what the KVM clock would be using the PVTI
> + * from each stage.
> + *
> + * Then return each of these values to the tester.
> + */
> + pvti_corrected = *pvti_va;
> + tsc_guest = rdtsc();
> +
> + cycles_boot = __pvclock_read_cycles(&pvti_boot, tsc_guest);
> + cycles_uncorrected = __pvclock_read_cycles(&pvti_uncorrected, tsc_guest);
> + cycles_corrected = __pvclock_read_cycles(&pvti_corrected, tsc_guest);
> +
> + GUEST_SYNC_ARGS(STAGE_CORRECTED, cycles_boot, cycles_uncorrected,
> + cycles_corrected, 0);
> +}
> +
> +static void run_test(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
> +{
> + struct ucall uc;
> + uint64_t ucall_reason;
> + struct pvclock_vcpu_time_info pvti_before;
> + uint64_t before, uncorrected, corrected;
> + int64_t delta_uncorrected, delta_corrected;
> +
> + /* Loop through each stage of the test. */
> + while (true) {
> +
> + /* Start/restart the running vCPU code. */
> + vcpu_run(vcpu);
> + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
> +
> + /* Retrieve and verify our stage. */
> + ucall_reason = get_ucall(vcpu, &uc);
> + TEST_ASSERT(ucall_reason == UCALL_SYNC,
> + "Unhandled ucall reason=%lu",
> + ucall_reason);
> +
> + /* Run host specific code relating to stage. */
> + switch (uc.args[1]) {
> + case STAGE_FIRST_BOOT:
> + /* Store the KVM clock values before an update. */
> + vm_ioctl(vm, KVM_GET_CLOCK_GUEST, &pvti_before);
> +
> + /* Sleep for a set amount of time to induce drift. */
> + sleep(5);
> + break;
> +
> + case STAGE_UNCORRECTED:
> + /* Restore the KVM clock values. */
> + vm_ioctl(vm, KVM_SET_CLOCK_GUEST, &pvti_before);
> + break;
> +
> + case STAGE_CORRECTED:
> + /* Query the clock information and verify delta. */
> + before = uc.args[2];
> + uncorrected = uc.args[3];
> + corrected = uc.args[4];
> +
> + delta_uncorrected = before - uncorrected;
> + delta_corrected = before - corrected;
> +
> + pr_info("before=%lu uncorrected=%lu corrected=%lu\n",
> + before, uncorrected, corrected);
> +
> + pr_info("delta_uncorrected=%ld delta_corrected=%ld\n",
> + delta_uncorrected, delta_corrected);
> +
> + TEST_ASSERT((delta_corrected <= 1) && (delta_corrected >= -1),
> + "larger than expected delta detected = %ld", delta_corrected);

I'm wondering what's the underling theory that we definitely can achieve
±1ns accuracy? I tested it on a Sapphire Rapids @2100MHz TSC frequency,
and I can see delta_corrected=2 in ~2% cases.

2024-04-19 20:01:39

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer drift correction

On 19 April 2024 19:40:06 BST, David Woodhouse <[email protected]> wrote:
>On 19 April 2024 18:13:16 BST, "Chen, Zide" <[email protected]> wrote:
>>I'm wondering what's the underling theory that we definitely can achieve
>>±1ns accuracy? I tested it on a Sapphire Rapids @2100MHz TSC frequency,
>>and I can see delta_corrected=2 in ~2% cases.
>
>Hm. Thanks for testing!
>
>So the KVM clock is based on the guest TSC. Given a delta between the guest TSC T and some reference point in time R, the KVM clock is expressed as a(T-R)+r, where little r is the value of the KVM clock when the guest TSC was R, and (a) is the rate of the guest TSC.
>
>When set the clock with KVM_SET_CLOCK_GUEST, we are changing the values of R and r to a new point in time. Call the new ones Q and q respectively.
>
>But we calculate precisely (within 1ns at least) what the KVM clock would have been with the *old* formula, and adjust our new offset (q) so that at our new reference TSC value Q, the formulae give exactly the same result.
>
>And because the *rates* are the same, they should continue to give the same results, ±1ns.
>
>Or such *was* my theory, at least.
>
>Would be interesting to see it disproven with actual numbers for the old+new pvclock structs, so I can understand where the logic goes wrong.
>
>Were you using frequency scaling?
>

Oh, also please could you test the updated version I posted yesterday, from https://git.infradead.org/?p=users/dwmw2/linux.git;a=shortlog;h=refs/heads/clocks

2024-04-19 23:55:47

by Chen, Zide

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer drift correction



On 4/19/2024 12:34 PM, David Woodhouse wrote:
> On 19 April 2024 18:13:16 BST, "Chen, Zide" <[email protected]> wrote:
>> I'm wondering what's the underling theory that we definitely can achieve
>> ±1ns accuracy? I tested it on a Sapphire Rapids @2100MHz TSC frequency,
>> and I can see delta_corrected=2 in ~2% cases.
>
> Hm. Thanks for testing!
>
> So the KVM clock is based on the guest TSC. Given a delta between the guest TSC T and some reference point in time R, the KVM clock is expressed as a(T-R)+r, where little r is the value of the KVM clock when the guest TSC was R, and (a) is the rate of the guest TSC.
>
> When set the clock with KVM_SET_CLOCK_GUEST, we are changing the values of R and r to a new point in time. Call the new ones Q and q respectively.
>
> But we calculate precisely (within 1ns at least) what the KVM clock would have been with the *old* formula, and adjust our new offset (q) so that at our new reference TSC value Q, the formulae give exactly the same result.
>
> And because the *rates* are the same, they should continue to give the same results, ±1ns.
>
> Or such *was* my theory, at least.

Thanks for the explanation.

>
> Would be interesting to see it disproven with actual numbers for the old+new pvclock structs, so I can understand where the logic goes wrong.
>
> Were you using frequency scaling?

I can see similar ~2% failure ratio w/ or w/o commenting out
configure_scaled_tsc().

2024-04-19 23:56:13

by Chen, Zide

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer drift correction



On 4/19/2024 11:43 AM, David Woodhouse wrote:
> On 19 April 2024 19:40:06 BST, David Woodhouse <[email protected]> wrote:
>> On 19 April 2024 18:13:16 BST, "Chen, Zide" <[email protected]> wrote:
>>> I'm wondering what's the underling theory that we definitely can achieve
>>> ±1ns accuracy? I tested it on a Sapphire Rapids @2100MHz TSC frequency,
>>> and I can see delta_corrected=2 in ~2% cases.
>>
>> Hm. Thanks for testing!
>>
>> So the KVM clock is based on the guest TSC. Given a delta between the guest TSC T and some reference point in time R, the KVM clock is expressed as a(T-R)+r, where little r is the value of the KVM clock when the guest TSC was R, and (a) is the rate of the guest TSC.
>>
>> When set the clock with KVM_SET_CLOCK_GUEST, we are changing the values of R and r to a new point in time. Call the new ones Q and q respectively.
>>
>> But we calculate precisely (within 1ns at least) what the KVM clock would have been with the *old* formula, and adjust our new offset (q) so that at our new reference TSC value Q, the formulae give exactly the same result.
>>
>> And because the *rates* are the same, they should continue to give the same results, ±1ns.
>>
>> Or such *was* my theory, at least.
>>
>> Would be interesting to see it disproven with actual numbers for the old+new pvclock structs, so I can understand where the logic goes wrong.
>>
>> Were you using frequency scaling?
>>
>
> Oh, also please could you test the updated version I posted yesterday, from https://git.infradead.org/?p=users/dwmw2/linux.git;a=shortlog;h=refs/heads/clocks

I failed to check out your branch, instead I downloaded the patch series
from:
https://lore.kernel.org/linux-kselftest/[email protected]/T/#t

However, the selftest hangs:

[Apr19 16:15] kselftest: Running tests in kvm
[Apr19 16:16] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ +0.000628] rcu: 78-...0: (1 GPs behind)
idle=3c8c/1/0x4000000000000000 softirq=5908/5913 fqs=14025
[ +0.000468] rcu: (detected by 104, t=60003 jiffies, g=60073,
q=3100 ncpus=128)
[ +0.000389] Sending NMI from CPU 104 to CPUs 78:
[ +0.000360] NMI backtrace for cpu 78
[ +0.000004] CPU: 78 PID: 33515 Comm: pvclock_test Tainted: G
O 6.9.0-rc1zide-l0+ #194
[ +0.000003] Hardware name: Inspur NF5280M7/NF5280M7, BIOS 05.08.01
08/18/2023
[ +0.000002] RIP: 0010:pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm]
[ +0.000079] Code: ea 83 e1 40 48 0f 45 c2 31 d2 48 3d 00 94 35 77 76
0e 48 d1 e8 83 ea 01 48 3d 00 94 35 77 77 f2 48 3d 00 ca 9a 3b 89 c1 77
0d <01> c9 83 c2 01 81 f9 00 ca 9a 3b 76 f3 88 93 8c 95 00 00 31 c0 ba
[ +0.000002] RSP: 0018:ff368a58cfe07e30 EFLAGS: 00000087
[ +0.000002] RAX: 0000000000000000 RBX: ff368a58e0ccd000 RCX:
0000000000000000
[ +0.000001] RDX: 000000005ca49a49 RSI: 00000000000029aa RDI:
0000019ee77a1c00
[ +0.000002] RBP: ff368a58cfe07e50 R08: 0000000000000001 R09:
0000000000000000
[ +0.000000] R10: ff26383d853ab400 R11: 0000000000000002 R12:
0000000000000000
[ +0.000001] R13: ff368a58e0cd6400 R14: 0000000000000293 R15:
ff368a58e0cd69f0
[ +0.000001] FS: 00007f6946473740(0000) GS:ff26384c7fb80000(0000)
knlGS:0000000000000000
[ +0.000001] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000001] CR2: 00007f69463bd445 CR3: 000000016f466006 CR4:
0000000000f71ef0
[ +0.000001] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ +0.000000] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
0000000000000400
[ +0.000001] PKRU: 55555554
[ +0.000001] Call Trace:
[ +0.000004] <NMI>
[ +0.000003] ? nmi_cpu_backtrace+0x87/0xf0
[ +0.000008] ? nmi_cpu_backtrace_handler+0x11/0x20
[ +0.000005] ? nmi_handle+0x5f/0x170
[ +0.000005] ? pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm]
[ +0.000045] ? default_do_nmi+0x79/0x1a0
[ +0.000004] ? exc_nmi+0xf0/0x130
[ +0.000001] ? end_repeat_nmi+0xf/0x53
[ +0.000006] ? pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm]
[ +0.000041] ? pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm]
[ +0.000040] ? pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm]
[ +0.000039] </NMI>
[ +0.000000] <TASK>
[ +0.000001] ? preempt_count_add+0x73/0xa0
[ +0.000004] kvm_arch_init_vm+0xf1/0x1e0 [kvm]
[ +0.000049] kvm_create_vm+0x370/0x650 [kvm]
[ +0.000036] kvm_dev_ioctl+0x88/0x180 [kvm]
[ +0.000034] __x64_sys_ioctl+0x8e/0xd0
[ +0.000007] do_syscall_64+0x5b/0x120
[ +0.000003] entry_SYSCALL_64_after_hwframe+0x6c/0x74
[ +0.000003] RIP: 0033:0x7f694631a94f
[ +0.000002] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10
00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f
05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00
[ +0.000001] RSP: 002b:00007ffca91b2e50 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[ +0.000002] RAX: ffffffffffffffda RBX: 0000000000434480 RCX:
00007f694631a94f
[ +0.000001] RDX: 0000000000000000 RSI: 000000000000ae01 RDI:
0000000000000005
[ +0.000000] RBP: 0000000000000009 R08: 000000000041b198 R09:
000000000041bfbf
[ +0.000001] R10: 00007f69463d8882 R11: 0000000000000246 R12:
0000000000434480
[ +0.000000] R13: 000000000041e0f0 R14: 0000000000001000 R15:
0000000000000207
[ +0.000002] </TASK>

2024-04-20 00:21:42

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer drift correction

On 19 April 2024 18:13:16 BST, "Chen, Zide" <[email protected]> wrote:
>I'm wondering what's the underling theory that we definitely can achieve
>±1ns accuracy? I tested it on a Sapphire Rapids @2100MHz TSC frequency,
>and I can see delta_corrected=2 in ~2% cases.

Hm. Thanks for testing!

So the KVM clock is based on the guest TSC. Given a delta between the guest TSC T and some reference point in time R, the KVM clock is expressed as a(T-R)+r, where little r is the value of the KVM clock when the guest TSC was R, and (a) is the rate of the guest TSC.

When set the clock with KVM_SET_CLOCK_GUEST, we are changing the values of R and r to a new point in time. Call the new ones Q and q respectively.

But we calculate precisely (within 1ns at least) what the KVM clock would have been with the *old* formula, and adjust our new offset (q) so that at our new reference TSC value Q, the formulae give exactly the same result.

And because the *rates* are the same, they should continue to give the same results, ±1ns.

Or such *was* my theory, at least.

Would be interesting to see it disproven with actual numbers for the old+new pvclock structs, so I can understand where the logic goes wrong.

Were you using frequency scaling?


2024-04-20 13:01:47

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer drift correction

On 20 April 2024 00:54:05 BST, "Chen, Zide" <[email protected]> wrote:
>
>
>On 4/19/2024 11:43 AM, David Woodhouse wrote:
>> On 19 April 2024 19:40:06 BST, David Woodhouse <[email protected]> wrote:
>>> On 19 April 2024 18:13:16 BST, "Chen, Zide" <[email protected]> wrote:
>>>> I'm wondering what's the underling theory that we definitely can achieve
>>>> ±1ns accuracy? I tested it on a Sapphire Rapids @2100MHz TSC frequency,
>>>> and I can see delta_corrected=2 in ~2% cases.
>>>
>>> Hm. Thanks for testing!
>>>
>>> So the KVM clock is based on the guest TSC. Given a delta between the guest TSC T and some reference point in time R, the KVM clock is expressed as a(T-R)+r, where little r is the value of the KVM clock when the guest TSC was R, and (a) is the rate of the guest TSC.
>>>
>>> When set the clock with KVM_SET_CLOCK_GUEST, we are changing the values of R and r to a new point in time. Call the new ones Q and q respectively.
>>>
>>> But we calculate precisely (within 1ns at least) what the KVM clock would have been with the *old* formula, and adjust our new offset (q) so that at our new reference TSC value Q, the formulae give exactly the same result.
>>>
>>> And because the *rates* are the same, they should continue to give the same results, ±1ns.
>>>
>>> Or such *was* my theory, at least.
>>>
>>> Would be interesting to see it disproven with actual numbers for the old+new pvclock structs, so I can understand where the logic goes wrong.
>>>
>>> Were you using frequency scaling?
>>>
>>
>> Oh, also please could you test the updated version I posted yesterday, from https://git.infradead.org/?p=users/dwmw2/linux.git;a=shortlog;h=refs/heads/clocks
>
>I failed to check out your branch, instead I downloaded the patch series
>from:
>https://lore.kernel.org/linux-kselftest/[email protected]/T/#t
>
>However, the selftest hangs:

Odd. It locks up in kvm_arch_init_vm(). Maybe when I get back to my desk something will be obvious. But please could I have your .config?

If you're able to bisect and see which patch causes that, it would also be much appreciated. Thanks!

>[Apr19 16:15] kselftest: Running tests in kvm
>[Apr19 16:16] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>[ +0.000628] rcu: 78-...0: (1 GPs behind)
>idle=3c8c/1/0x4000000000000000 softirq=5908/5913 fqs=14025
>[ +0.000468] rcu: (detected by 104, t=60003 jiffies, g=60073,
>q=3100 ncpus=128)
>[ +0.000389] Sending NMI from CPU 104 to CPUs 78:
>[ +0.000360] NMI backtrace for cpu 78
>[ +0.000004] CPU: 78 PID: 33515 Comm: pvclock_test Tainted: G
>O 6.9.0-rc1zide-l0+ #194
>[ +0.000003] Hardware name: Inspur NF5280M7/NF5280M7, BIOS 05.08.01
>08/18/2023
>[ +0.000002] RIP: 0010:pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm]
>[ +0.000079] Code: ea 83 e1 40 48 0f 45 c2 31 d2 48 3d 00 94 35 77 76
>0e 48 d1 e8 83 ea 01 48 3d 00 94 35 77 77 f2 48 3d 00 ca 9a 3b 89 c1 77
>0d <01> c9 83 c2 01 81 f9 00 ca 9a 3b 76 f3 88 93 8c 95 00 00 31 c0 ba
>[ +0.000002] RSP: 0018:ff368a58cfe07e30 EFLAGS: 00000087
>[ +0.000002] RAX: 0000000000000000 RBX: ff368a58e0ccd000 RCX:
>0000000000000000
>[ +0.000001] RDX: 000000005ca49a49 RSI: 00000000000029aa RDI:
>0000019ee77a1c00
>[ +0.000002] RBP: ff368a58cfe07e50 R08: 0000000000000001 R09:
>0000000000000000
>[ +0.000000] R10: ff26383d853ab400 R11: 0000000000000002 R12:
>0000000000000000
>[ +0.000001] R13: ff368a58e0cd6400 R14: 0000000000000293 R15:
>ff368a58e0cd69f0
>[ +0.000001] FS: 00007f6946473740(0000) GS:ff26384c7fb80000(0000)
>knlGS:0000000000000000
>[ +0.000001] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>[ +0.000001] CR2: 00007f69463bd445 CR3: 000000016f466006 CR4:
>0000000000f71ef0
>[ +0.000001] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>0000000000000000
>[ +0.000000] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
>0000000000000400
>[ +0.000001] PKRU: 55555554
>[ +0.000001] Call Trace:
>[ +0.000004] <NMI>
>[ +0.000003] ? nmi_cpu_backtrace+0x87/0xf0
>[ +0.000008] ? nmi_cpu_backtrace_handler+0x11/0x20
>[ +0.000005] ? nmi_handle+0x5f/0x170
>[ +0.000005] ? pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm]
>[ +0.000045] ? default_do_nmi+0x79/0x1a0
>[ +0.000004] ? exc_nmi+0xf0/0x130
>[ +0.000001] ? end_repeat_nmi+0xf/0x53
>[ +0.000006] ? pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm]
>[ +0.000041] ? pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm]
>[ +0.000040] ? pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm]
>[ +0.000039] </NMI>
>[ +0.000000] <TASK>
>[ +0.000001] ? preempt_count_add+0x73/0xa0
>[ +0.000004] kvm_arch_init_vm+0xf1/0x1e0 [kvm]
>[ +0.000049] kvm_create_vm+0x370/0x650 [kvm]
>[ +0.000036] kvm_dev_ioctl+0x88/0x180 [kvm]
>[ +0.000034] __x64_sys_ioctl+0x8e/0xd0
>[ +0.000007] do_syscall_64+0x5b/0x120
>[ +0.000003] entry_SYSCALL_64_after_hwframe+0x6c/0x74
>[ +0.000003] RIP: 0033:0x7f694631a94f
>[ +0.000002] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10
>00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f
>05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00
>[ +0.000001] RSP: 002b:00007ffca91b2e50 EFLAGS: 00000246 ORIG_RAX:
>0000000000000010
>[ +0.000002] RAX: ffffffffffffffda RBX: 0000000000434480 RCX:
>00007f694631a94f
>[ +0.000001] RDX: 0000000000000000 RSI: 000000000000ae01 RDI:
>0000000000000005
>[ +0.000000] RBP: 0000000000000009 R08: 000000000041b198 R09:
>000000000041bfbf
>[ +0.000001] R10: 00007f69463d8882 R11: 0000000000000246 R12:
>0000000000434480
>[ +0.000000] R13: 000000000041e0f0 R14: 0000000000001000 R15:
>0000000000000207
>[ +0.000002] </TASK>


2024-04-20 17:21:44

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer drift correction

On Fri, 2024-04-19 at 16:54 -0700, Chen, Zide wrote:
>
> However, the selftest hangs:
>
> [Apr19 16:15] kselftest: Running tests in kvm
> [Apr19 16:16] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  +0.000628] rcu:      78-...0: (1 GPs behind) idle=3c8c/1/0x4000000000000000 softirq=5908/5913 fqs=14025
> [  +0.000468] rcu:      (detected by 104, t=60003 jiffies, g=60073, q=3100 ncpus=128)
> [  +0.000389] Sending NMI from CPU 104 to CPUs 78:
> [  +0.000360] NMI backtrace for cpu 78
> [  +0.000004] CPU: 78 PID: 33515 Comm: pvclock_test Tainted: G O       6.9.0-rc1zide-l0+ #194
> [  +0.000003] Hardware name: Inspur NF5280M7/NF5280M7, BIOS 05.08.01 08/18/2023
> [  +0.000002] RIP: 0010:pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm]

Ah, kvm_get_time_scale() doesn't much like being asked to scale to zero.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a07b60351894..45fb99986cf9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3046,7 +3046,8 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
* Copy from the field protected solely by ka->tsc_write_lock,
* to the field protected by the ka->pvclock_sc seqlock.
*/
- ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio;
+ ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio ? :
+ kvm_caps.default_tsc_scaling_ratio;

/*
* Calculate the scaling factors precisely the same way


Attachments:
smime.p7s (5.83 kB)

2024-04-22 22:03:46

by Chen, Zide

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer drift correction



On 4/20/2024 9:03 AM, David Woodhouse wrote:
> On Fri, 2024-04-19 at 16:54 -0700, Chen, Zide wrote:
>>
>> However, the selftest hangs:
>>
>> [Apr19 16:15] kselftest: Running tests in kvm
>> [Apr19 16:16] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [  +0.000628] rcu:      78-...0: (1 GPs behind) idle=3c8c/1/0x4000000000000000 softirq=5908/5913 fqs=14025
>> [  +0.000468] rcu:      (detected by 104, t=60003 jiffies, g=60073, q=3100 ncpus=128)
>> [  +0.000389] Sending NMI from CPU 104 to CPUs 78:
>> [  +0.000360] NMI backtrace for cpu 78
>> [  +0.000004] CPU: 78 PID: 33515 Comm: pvclock_test Tainted: G O       6.9.0-rc1zide-l0+ #194
>> [  +0.000003] Hardware name: Inspur NF5280M7/NF5280M7, BIOS 05.08.01 08/18/2023
>> [  +0.000002] RIP: 0010:pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm]
>
> Ah, kvm_get_time_scale() doesn't much like being asked to scale to zero.
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a07b60351894..45fb99986cf9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3046,7 +3046,8 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
> * Copy from the field protected solely by ka->tsc_write_lock,
> * to the field protected by the ka->pvclock_sc seqlock.
> */
> - ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio;
> + ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio ? :
> + kvm_caps.default_tsc_scaling_ratio;
>
> /*
> * Calculate the scaling factors precisely the same way
> * that kvm_guest_time_update() does.
> last_tsc_hz = kvm_scale_tsc(tsc_khz * 1000,
> ka->last_tsc_scaling_ratio);

Should be ka->master_tsc_scaling_ratio?

If I restored the KVM_REQ_GLOBAL_CLOCK_UPDATE request from
kvm_arch_vcpu_load(), the selftest works for me, and I ran the test for
1000+ iterations, w/ or w/o TSC scaling, the TEST_ASSERT(delta_corrected
<= ±1) never got hit. This is awesome!

However, without KVM_REQ_GLOBAL_CLOCK_UPDATE, it still fails on creating
a VM. Maybe the init sequence sill needs some rework.

BUG: unable to handle page fault for address: 005b29e3f221ccf0
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 86 PID: 4118 Comm: pvclock_test Tainted
Hardware name: Inspur NF5280M7/NF5280M7, BIOS 05.08.01 08/18/2023
RIP: 0010:start_creating+0x80/0x190
Code: ce ad 48 c7 c6 70 a1 ce ad 48 c7 c7 80 1c 9b ab e8 b5 10 d5 ff 4c
63 e0 45 85 e4 0f 85 cd 00 00 00 48 85 db 0f 84 b5 00 00 00 <48> 8b 43
30 48 8d b8 b8 >
RSP: 0018:ff786eaacf3cfdd0 EFLAGS: 00010206
RAX: 0000000000000000 RBX: 005b29e3f221ccc0 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffadcea170 RDI: 0000000000000000
RBP: ffffffffc06ac8cf R08: ffffffffa6ea0fe0 R09: ffffffffc06a5940
R10: ff786eaacf3cfe30 R11: 00000013a7b5feaa R12: 0000000000000000
R13: 0000000000000124 R14: ff786eaacfa11000 R15: 00000000000081a4
FS: 00007f0837c89740(0000) GS:ff4f44b6bfd80000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0
CR2: 005b29e3f221ccf0 CR3: 000000014bdf8002 CR4: 0000000000f73ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
? __die+0x24/0x70
? page_fault_oops+0x81/0x150
? do_user_addr_fault+0x64/0x6c0
? exc_page_fault+0x8a/0x1a0
? asm_exc_page_fault+0x26/0x30
? start_creating+0x80/0x190
__debugfs_create_file+0x43/0x1f0
kvm_create_vm_debugfs+0x28b/0x2d0 [kvm]
kvm_create_vm+0x457/0x650 [kvm]
kvm_dev_ioctl+0x88/0x180 [kvm]
__x64_sys_ioctl+0x8e/0xd0
do_syscall_64+0x5b/0x120
entry_SYSCALL_64_after_hwframe+0x71/0x79
RIP: 0033:0x7f0837b1a94f
Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89
44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0
3d 00 f0 ff ff >
RSP: 002b:00007ffe01be3fc0 EFLAGS: 00000246 ORIG_RAX
RAX: ffffffffffffffda RBX: 0000000000434480 RCX: 00007f0837b1a94f
RDX: 0000000000000000 RSI: 000000000000ae01 RDI: 0000000000000005
RBP: 0000000000000009 R08: 000000000041b1a0 R09: 000000000041bfcf
R10: 00007f0837bd8882 R11: 0000000000000246 R12: 0000000000434480
R13: 000000000041e0f0 R14: 0000000000001000 R15: 0000000000000207
</TASK>
Modules linked in: kvm_intel(O) kvm(O) [last unloaded: kvm(O)]
CR2: 005b29e3f221ccf0

2024-04-23 11:27:28

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer drift correction

On Mon, 2024-04-22 at 15:02 -0700, Chen, Zide wrote:
>
>
> On 4/20/2024 9:03 AM, David Woodhouse wrote:
> > On Fri, 2024-04-19 at 16:54 -0700, Chen, Zide wrote:
> > >
> > > However, the selftest hangs:
> > >
> > > [Apr19 16:15] kselftest: Running tests in kvm
> > > [Apr19 16:16] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > [  +0.000628] rcu:      78-...0: (1 GPs behind) idle=3c8c/1/0x4000000000000000 softirq=5908/5913 fqs=14025
> > > [  +0.000468] rcu:      (detected by 104, t=60003 jiffies, g=60073, q=3100 ncpus=128)
> > > [  +0.000389] Sending NMI from CPU 104 to CPUs 78:
> > > [  +0.000360] NMI backtrace for cpu 78
> > > [  +0.000004] CPU: 78 PID: 33515 Comm: pvclock_test Tainted: G O       6.9.0-rc1zide-l0+ #194
> > > [  +0.000003] Hardware name: Inspur NF5280M7/NF5280M7, BIOS 05.08.01 08/18/2023
> > > [  +0.000002] RIP: 0010:pvclock_update_vm_gtod_copy+0xb5/0x200 [kvm]
> >
> > Ah, kvm_get_time_scale() doesn't much like being asked to scale to zero.
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index a07b60351894..45fb99986cf9 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -3046,7 +3046,8 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
> >                  * Copy from the field protected solely by ka->tsc_write_lock,
> >                  * to the field protected by the ka->pvclock_sc seqlock.
> >                  */
> > -               ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio;
> > +               ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio ? :
> > +                       kvm_caps.default_tsc_scaling_ratio;
> >  
> >                 /*
> >                  * Calculate the scaling factors precisely the same way
> >                  * that kvm_guest_time_update() does.
> >                 last_tsc_hz = kvm_scale_tsc(tsc_khz * 1000,
> >                                           ka->last_tsc_scaling_ratio);
>
> Should be ka->master_tsc_scaling_ratio?

Oops, yes. I'll actually do some proper testing on a host with TSC
scaling this week. Thanks.

> If I restored the KVM_REQ_GLOBAL_CLOCK_UPDATE request from
> kvm_arch_vcpu_load(), the selftest works for me, and I ran the test for
> 1000+ iterations, w/ or w/o TSC scaling, the TEST_ASSERT(delta_corrected
>  <= ±1) never got hit. This is awesome!
>
> However, without KVM_REQ_GLOBAL_CLOCK_UPDATE, it still fails on creating
> a VM. Maybe the init sequence sill needs some rework.

That one confuses me. The crash is actually in debugfs, as it's
registering the per-vm or per-vcpu stats. I can't imagine *how* that's
occurring. Or see why the availability of TSC scaling would cause it to
show up for you and not me. Can I have your .config please?

First thought would be that there's some change in the KVM structures
and you have some stale object files using the old struct, but then I
realise I forgot to actually *remove* the now-unused
kvmclock_update_work from x86's struct kvm_arch anyway.

I'll try to reproduce, as I think I want to *know* what's going on
here, even if I am going to drop that patch as mentioned in 
https://lore.kernel.org/kvm/[email protected]

Are you able to load that vmlinux in gdb and
(gdb) list *start_creating+0x80
(gdb) list *kvm_create_vm_debugfs+0x28b

Thanks again.

>  BUG: unable to handle page fault for address: 005b29e3f221ccf0
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x0000) - not-present page
>  PGD 0
>  Oops: 0000 [#1] PREEMPT SMP NOPTI
>  CPU: 86 PID: 4118 Comm: pvclock_test Tainted
>  Hardware name: Inspur NF5280M7/NF5280M7, BIOS 05.08.01 08/18/2023
>  RIP: 0010:start_creating+0x80/0x190
>  Code: ce ad 48 c7 c6 70 a1 ce ad 48 c7 c7 80 1c 9b ab e8 b5 10 d5 ff 4c
> 63 e0 45 85 e4 0f 85 cd 00 00 00 48 85 db 0f 84 b5 00 00 00 <48> 8b 43
> 30 48 8d b8 b8 >
>  RSP: 0018:ff786eaacf3cfdd0 EFLAGS: 00010206
>  RAX: 0000000000000000 RBX: 005b29e3f221ccc0 RCX: 0000000000000000
>  RDX: 0000000000000001 RSI: ffffffffadcea170 RDI: 0000000000000000
>  RBP: ffffffffc06ac8cf R08: ffffffffa6ea0fe0 R09: ffffffffc06a5940
>  R10: ff786eaacf3cfe30 R11: 00000013a7b5feaa R12: 0000000000000000
>  R13: 0000000000000124 R14: ff786eaacfa11000 R15: 00000000000081a4
>  FS:  00007f0837c89740(0000) GS:ff4f44b6bfd80000(0000)
> knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0
>  CR2: 005b29e3f221ccf0 CR3: 000000014bdf8002 CR4: 0000000000f73ef0
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>  PKRU: 55555554
>  Call Trace:
>   <TASK>
>   ? __die+0x24/0x70
>   ? page_fault_oops+0x81/0x150
>   ? do_user_addr_fault+0x64/0x6c0
>   ? exc_page_fault+0x8a/0x1a0
>   ? asm_exc_page_fault+0x26/0x30
>   ? start_creating+0x80/0x190
>   __debugfs_create_file+0x43/0x1f0
>   kvm_create_vm_debugfs+0x28b/0x2d0 [kvm]
>   kvm_create_vm+0x457/0x650 [kvm]
>   kvm_dev_ioctl+0x88/0x180 [kvm]
>   __x64_sys_ioctl+0x8e/0xd0
>   do_syscall_64+0x5b/0x120
>   entry_SYSCALL_64_after_hwframe+0x71/0x79
>  RIP: 0033:0x7f0837b1a94f
>  Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89
> 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0
> 3d 00 f0 ff ff >
>  RSP: 002b:00007ffe01be3fc0 EFLAGS: 00000246 ORIG_RAX
>  RAX: ffffffffffffffda RBX: 0000000000434480 RCX: 00007f0837b1a94f
>  RDX: 0000000000000000 RSI: 000000000000ae01 RDI: 0000000000000005
>  RBP: 0000000000000009 R08: 000000000041b1a0 R09: 000000000041bfcf
>  R10: 00007f0837bd8882 R11: 0000000000000246 R12: 0000000000434480
>  R13: 000000000041e0f0 R14: 0000000000001000 R15: 0000000000000207
>   </TASK>
>  Modules linked in: kvm_intel(O) kvm(O) [last unloaded: kvm(O)]
>  CR2: 005b29e3f221ccf0


Attachments:
smime.p7s (5.83 kB)

2024-04-23 18:04:48

by Chen, Zide

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer drift correction



On 4/23/2024 12:49 AM, David Woodhouse wrote:
>> If I restored the KVM_REQ_GLOBAL_CLOCK_UPDATE request from
>> kvm_arch_vcpu_load(), the selftest works for me, and I ran the test for
>> 1000+ iterations, w/ or w/o TSC scaling, the TEST_ASSERT(delta_corrected
>>  <= ±1) never got hit. This is awesome!
>>
>> However, without KVM_REQ_GLOBAL_CLOCK_UPDATE, it still fails on creating
>> a VM. Maybe the init sequence sill needs some rework.
>
> That one confuses me. The crash is actually in debugfs, as it's
> registering the per-vm or per-vcpu stats. I can't imagine *how* that's
> occurring. Or see why the availability of TSC scaling would cause it to
> show up for you and not me. Can I have your .config please?
>
> First thought would be that there's some change in the KVM structures
> and you have some stale object files using the old struct, but then I
> realise I forgot to actually *remove* the now-unused
> kvmclock_update_work from x86's struct kvm_arch anyway.
>
> I'll try to reproduce, as I think I want to *know* what's going on
> here, even if I am going to drop that patch as mentioned in 
> https://lore.kernel.org/kvm/[email protected]
>
> Are you able to load that vmlinux in gdb and
> (gdb) list *start_creating+0x80
> (gdb) list *kvm_create_vm_debugfs+0x28b
>
> Thanks again.

My apologies, it turns out the KVM_REQ_GLOBAL_CLOCK_UPDATE is not
needed. Today I can't reproduce the issue after removing it. Yesterday
I thought it may miss something related to pfncache.

To be clear, with the above mentioned change to
kvm_scale_tsc(master_tsc_scaling_ratio), the selftest runs reliably
regardless TSC scaling is enabled or not.

2024-04-23 21:03:38

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer drift correction

On 23 April 2024 18:59:21 BST, "Chen, Zide" <[email protected]> wrote:
>
>
>On 4/23/2024 12:49 AM, David Woodhouse wrote:
>>> If I restored the KVM_REQ_GLOBAL_CLOCK_UPDATE request from
>>> kvm_arch_vcpu_load(), the selftest works for me, and I ran the test for
>>> 1000+ iterations, w/ or w/o TSC scaling, the TEST_ASSERT(delta_corrected
>>>  <= ±1) never got hit. This is awesome!
>>>
>>> However, without KVM_REQ_GLOBAL_CLOCK_UPDATE, it still fails on creating
>>> a VM. Maybe the init sequence sill needs some rework.
>>
>> That one confuses me. The crash is actually in debugfs, as it's
>> registering the per-vm or per-vcpu stats. I can't imagine *how* that's
>> occurring. Or see why the availability of TSC scaling would cause it to
>> show up for you and not me. Can I have your .config please?
>>
>> First thought would be that there's some change in the KVM structures
>> and you have some stale object files using the old struct, but then I
>> realise I forgot to actually *remove* the now-unused
>> kvmclock_update_work from x86's struct kvm_arch anyway.
>>
>> I'll try to reproduce, as I think I want to *know* what's going on
>> here, even if I am going to drop that patch as mentioned in 
>> https://lore.kernel.org/kvm/[email protected]
>>
>> Are you able to load that vmlinux in gdb and
>> (gdb) list *start_creating+0x80
>> (gdb) list *kvm_create_vm_debugfs+0x28b
>>
>> Thanks again.
>
>My apologies, it turns out the KVM_REQ_GLOBAL_CLOCK_UPDATE is not
>needed. Today I can't reproduce the issue after removing it. Yesterday
>I thought it may miss something related to pfncache.
>
>To be clear, with the above mentioned change to
>kvm_scale_tsc(master_tsc_scaling_ratio), the selftest runs reliably
>regardless TSC scaling is enabled or not.

Thanks. That version is now in my git tree and I have tested it myself on Skylake. Then I got distracted by reverse-engineering kvm_get_time_scale() so I could actually add some comments to it.

I'm still going to have to put the clock updates back though, for the non-masterclock case.

While I'm ripping all this up I guess I ought to rename it to "reference clock" too?

2024-04-24 17:47:30

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer drift correction

On Mon, 2024-04-22 at 15:02 -0700, Chen, Zide wrote:
> the selftest works for me, and I ran the test for 1000+ iterations,
> w/ or w/o TSC scaling, the TEST_ASSERT(delta_corrected <= ±1) never
> got hit. This is awesome!

I think that with further care we can get even better than that.

Let's look at where that ±1ns tolerance comes from.

Consider a 3GHz TSC. That gives us three ticks per nanosecond. Each TSC
value can be seen as (3n) (3n+1) or (3n+2) for a given nanosecond n.

If we take a new reference point at a (3n+2) TSC value and calculate
the KVM clock from that, we *know* we're going to round down and lose
two-thirds of a nanosecond.

So then we set the new KVM clock parameters to use that new reference
point, and that's why we have to allow a disruption of up to a single
nanosecond. In fact, I don't think it's ±1 ns, is it? It'll only ever
be in the same direction (rounding down)?

But if we're careful which *which* TSC value we use as the reference
point, we can reduce that error.

The TSC value we use should be *around* the current time, but what if
we were to evaluate maybe the previous 100 TSC values. Pass *each* of
them through the conversion to nanoseconds and use the one that comes
*closest* to a precise nanosecond (nnnnnnnn.000).

It's even fairly easy to calculate those, because of the way the KVM
clock ABI has us multiply and then shift right by 32 bits. We just need
to look at those low 32 bits (the fractional nanosecond) *before*
shifting them out of existence. Something like...

uint64_t tsc_candidate, tsc_candidate_last, best_tsc;
uint32_t frac_ns_min = 0xffffffff;
uint64_t frac_ns;

best_tsc = tsc_candidate = rdtsc();
tsc_candidate_last = tsc_candidate - 100;

while (tsc_candidate-- > tsc_candidate_last) {
uint64_t guest_tsc = kvm_scale_tsc(tsc_candidate, ...);
frac_ns = guest_tsc * hvclock->tsc_to_system_mul;
/* Shift *after* multiplication, not before as pvclock_scale_cycles() does. */
if (hvclock->tsc_shift < 0)
frac_ns >>= -hvclock->tsc_shift;
else
frac_ns <<= hvclock->tsc_shift;

if ( (uint32_t)frac_ns <= frac_ns_min ) {
frac_ns_min = frac_ns;
best_tsc = tsc_candidate;
}
}
printk("Best TSC to use for reference point is %lld", best_tsc);

And then you calculate your CLOCK_MONOTONIC_RAW and guest KVM clock
from *that* host TSC value, and thus minimise the discrepancies due to
rounding down?

Aside from the fact that I literally just typed that into email and
it's barely even thought through let alone entirely untested... I'm
just not sure it's even worth the runtime cost, for that ±1 ns on a
rare case.

A slop of ±1ns is probably sufficient because over the past few years
we've already shifted the definition of the KVM clock to *not* be NTP-
corrected, and we leave guests to do fine-grained synchronization
through other means anyway.

But I see talk of people offering a PPS signal to *hundreds* of guests
on the same host simultaneously, just for them all to use it to
calibrate the same underlying oscillator. Which is a little bit insane.

We *should* have a way for the host to do that once and then expose the
precise time to its guests, in a much saner way than the KVM clock
does. I'll look at adding something along those lines to this series
too, which can be driven from the host's adjtimex() adjustments (which
KVM already consumes), and fed into each guest's timekeeping as a
PTP/PPS device or something.



Attachments:
smime.p7s (5.83 kB)