2023-10-30 15:51:15

by David Woodhouse

[permalink] [raw]
Subject: [PATCH] KVM: x86/xen: improve accuracy of Xen timers

From: David Woodhouse <[email protected]>

A test program such as http://david.woodhou.se/timerlat.c confirms user
reports that timers are increasingly inaccurate as the lifetime of a
guest increases. Reporting the actual delay observed when asking for
100µs of sleep, it starts off OK on a newly-launched guest but gets
worse over time, giving incorrect sleep times:

root@ip-10-0-193-21:~# ./timerlat -c -n 5
00000000 latency 103243/100000 (3.2430%)
00000001 latency 103243/100000 (3.2430%)
00000002 latency 103242/100000 (3.2420%)
00000003 latency 103245/100000 (3.2450%)
00000004 latency 103245/100000 (3.2450%)

The biggest problem is that get_kvmclock_ns() returns inaccurate values
when the guest TSC is scaled. The guest sees a TSC value scaled from the
host TSC by a mul/shift conversion (hopefully done in hardware). The
guest then converts that guest TSC value into nanoseconds using the
mul/shift conversion given to it by the KVM pvclock information.

But get_kvmclock_ns() performs only a single conversion directly from
host TSC to nanoseconds, giving a different result. A test program at
http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
over a day.

It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
that. The actual guest hv_clock is per-CPU, and *theoretically* each
vCPU could be running at a *different* frequency. But this patch is
needed anyway because...

The other issue with Xen timers was that the code would snapshot the
host CLOCK_MONOTONIC at some point in time, and then... after a few
interrupts may have occurred, some preemption perhaps... would also read
the guest's kvmclock. Then it would proceed under the false assumption
that those two happened at the *same* time. Any time which *actually*
elapsed between reading the two clocks was introduced as inaccuracies
in the time at which the timer fired.

Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
host TSC just *once*, then use the returned TSC value to calculate the
kvmclock (making sure to do that the way the guest would instead of
making the same mistake get_kvmclock_ns() does).

Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
timers still have to use CLOCK_MONOTONIC. In practice the difference
between the two won't matter over the timescales involved, as the
*absolute* values don't matter; just the delta.

This does mean a new variant of kvm_get_time_and_clockread() is needed;
called kvm_get_monotonic_and_clockread() because that's what it does.

Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/kvm/x86.c | 30 ++++++++++++
arch/x86/kvm/x86.h | 1 +
arch/x86/kvm/xen.c | 111 +++++++++++++++++++++++++++++++--------------
3 files changed, 109 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 41cce5031126..aeede83d65dc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2863,6 +2863,25 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
return mode;
}

+static int do_monotonic(s64 *t, u64 *tsc_timestamp)
+{
+ struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+ unsigned long seq;
+ int mode;
+ u64 ns;
+
+ do {
+ seq = read_seqcount_begin(&gtod->seq);
+ ns = gtod->clock.base_cycles;
+ ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
+ ns >>= gtod->clock.shift;
+ ns += ktime_to_ns(ktime_add(gtod->clock.offset, gtod->offs_boot));
+ } while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
+ *t = ns;
+
+ return mode;
+}
+
static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
{
struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
@@ -2895,6 +2914,17 @@ static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
tsc_timestamp));
}

+/* returns true if host is using TSC based clocksource */
+bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
+{
+ /* checked again under seqlock below */
+ if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
+ return false;
+
+ return gtod_is_based_on_tsc(do_monotonic(kernel_ns,
+ tsc_timestamp));
+}
+
/* returns true if host is using TSC based clocksource */
static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
u64 *tsc_timestamp)
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 1e7be1f6ab29..c08c6f729965 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -293,6 +293,7 @@ static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk)
void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);

u64 get_kvmclock_ns(struct kvm *kvm);
+bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp);

int kvm_read_guest_virt(struct kvm_vcpu *vcpu,
gva_t addr, void *val, unsigned int bytes,
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 0ea6016ad132..00a1e924a717 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -24,6 +24,7 @@
#include <xen/interface/sched.h>

#include <asm/xen/cpuid.h>
+#include <asm/pvclock.h>

#include "cpuid.h"
#include "trace.h"
@@ -144,17 +145,87 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
return HRTIMER_NORESTART;
}

-static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns)
+static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs,
+ bool linux_wa)
{
+ uint64_t guest_now;
+ int64_t kernel_now, delta;
+
+ /*
+ * The guest provides the requested timeout in absolute nanoseconds
+ * of the KVM clock — as *it* sees it, based on the scaled TSC and
+ * the pvclock information provided by KVM.
+ *
+ * The kernel doesn't support hrtimers based on CLOCK_MONOTONIC_RAW
+ * so use CLOCK_MONOTONIC. In the timescales covered by timers, the
+ * difference won't matter much as there is no cumulative effect.
+ *
+ * Calculate the time for some arbitrary point in time around "now"
+ * in terms of both kvmclock and CLOCK_MONOTONIC. Calculate the
+ * delta between the kvmclock "now" value and the guest's requested
+ * timeout, apply the "Linux workaround" described below, and add
+ * the resulting delta to the CLOCK_MONOTONIC "now" value, to get
+ * the absolute CLOCK_MONOTONIC time at which the timer should
+ * fire.
+ */
+ if (vcpu->kvm->arch.use_master_clock &&
+ static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
+ uint64_t host_tsc, guest_tsc;
+
+ if (!IS_ENABLED(CONFIG_64BIT) ||
+ !kvm_get_monotonic_and_clockread(&kernel_now, &host_tsc)) {
+ /*
+ * Don't fall back to get_kvmclock_ns() because it's
+ * broken; it has a systemic error in its results
+ * because it scales directly from host TSC to
+ * nanoseconds, and doesn't scale first to guest TSC
+ * and then* to nanoseconds as the guest does.
+ *
+ * There is a small error introduced here because time
+ * continues to elapse between the ktime_get() and the
+ * subsequent rdtsc(). But not the systemic drift due
+ * to get_kvmclock_ns().
+ */
+ kernel_now = ktime_get(); /* This is CLOCK_MONOTONIC */
+ host_tsc = rdtsc();
+ }
+
+ /* Calculate the guest kvmclock as the guest would do it. */
+ guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
+ guest_now = __pvclock_read_cycles(&vcpu->arch.hv_clock, guest_tsc);
+ } else {
+ /* Without CONSTANT_TSC, get_kvmclock_ns() is the only option */
+ guest_now = get_kvmclock_ns(vcpu->kvm);
+ kernel_now = ktime_get();
+ }
+
+ delta = guest_abs - guest_now;
+
+ /* Xen has a 'Linux workaround' in do_set_timer_op() which
+ * checks for negative absolute timeout values (caused by
+ * integer overflow), and for values about 13 days in the
+ * future (2^50ns) which would be caused by jiffies
+ * overflow. For those cases, it sets the timeout 100ms in
+ * the future (not *too* soon, since if a guest really did
+ * set a long timeout on purpose we don't want to keep
+ * churning CPU time by waking it up).
+ */
+ if (linux_wa) {
+ if ((unlikely((int64_t)guest_abs < 0 ||
+ (delta > 0 && (uint32_t) (delta >> 50) != 0)))) {
+ delta = 100 * NSEC_PER_MSEC;
+ guest_abs = guest_now + delta;
+ }
+ }
+
atomic_set(&vcpu->arch.xen.timer_pending, 0);
vcpu->arch.xen.timer_expires = guest_abs;

- if (delta_ns <= 0) {
+ if (delta <= 0) {
xen_timer_callback(&vcpu->arch.xen.timer);
} else {
- ktime_t ktime_now = ktime_get();
hrtimer_start(&vcpu->arch.xen.timer,
- ktime_add_ns(ktime_now, delta_ns),
+ ktime_add_ns(kernel_now, delta),
HRTIMER_MODE_ABS_HARD);
}
}
@@ -923,8 +994,7 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
/* Start the timer if the new value has a valid vector+expiry. */
if (data->u.timer.port && data->u.timer.expires_ns)
kvm_xen_start_timer(vcpu, data->u.timer.expires_ns,
- data->u.timer.expires_ns -
- get_kvmclock_ns(vcpu->kvm));
+ false);

r = 0;
break;
@@ -1340,7 +1410,6 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
{
struct vcpu_set_singleshot_timer oneshot;
struct x86_exception e;
- s64 delta;

if (!kvm_xen_timer_enabled(vcpu))
return false;
@@ -1374,13 +1443,7 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
return true;
}

- delta = oneshot.timeout_abs_ns - get_kvmclock_ns(vcpu->kvm);
- if ((oneshot.flags & VCPU_SSHOTTMR_future) && delta < 0) {
- *r = -ETIME;
- return true;
- }
-
- kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, delta);
+ kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, false);
*r = 0;
return true;

@@ -1404,25 +1467,7 @@ static bool kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout,
return false;

if (timeout) {
- uint64_t guest_now = get_kvmclock_ns(vcpu->kvm);
- int64_t delta = timeout - guest_now;
-
- /* Xen has a 'Linux workaround' in do_set_timer_op() which
- * checks for negative absolute timeout values (caused by
- * integer overflow), and for values about 13 days in the
- * future (2^50ns) which would be caused by jiffies
- * overflow. For those cases, it sets the timeout 100ms in
- * the future (not *too* soon, since if a guest really did
- * set a long timeout on purpose we don't want to keep
- * churning CPU time by waking it up).
- */
- if (unlikely((int64_t)timeout < 0 ||
- (delta > 0 && (uint32_t) (delta >> 50) != 0))) {
- delta = 100 * NSEC_PER_MSEC;
- timeout = guest_now + delta;
- }
-
- kvm_xen_start_timer(vcpu, timeout, delta);
+ kvm_xen_start_timer(vcpu, timeout, true);
} else {
kvm_xen_stop_timer(vcpu);
}
--
2.41.0



Attachments:
smime.p7s (5.83 kB)

2023-10-31 10:43:10

by Paul Durrant

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/xen: improve accuracy of Xen timers

On 30/10/2023 15:50, David Woodhouse wrote:
> From: David Woodhouse <[email protected]>
>
> A test program such as http://david.woodhou.se/timerlat.c confirms user
> reports that timers are increasingly inaccurate as the lifetime of a
> guest increases. Reporting the actual delay observed when asking for
> 100µs of sleep, it starts off OK on a newly-launched guest but gets
> worse over time, giving incorrect sleep times:
>
> root@ip-10-0-193-21:~# ./timerlat -c -n 5
> 00000000 latency 103243/100000 (3.2430%)
> 00000001 latency 103243/100000 (3.2430%)
> 00000002 latency 103242/100000 (3.2420%)
> 00000003 latency 103245/100000 (3.2450%)
> 00000004 latency 103245/100000 (3.2450%)
>
> The biggest problem is that get_kvmclock_ns() returns inaccurate values
> when the guest TSC is scaled. The guest sees a TSC value scaled from the
> host TSC by a mul/shift conversion (hopefully done in hardware). The
> guest then converts that guest TSC value into nanoseconds using the
> mul/shift conversion given to it by the KVM pvclock information.
>
> But get_kvmclock_ns() performs only a single conversion directly from
> host TSC to nanoseconds, giving a different result. A test program at
> http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
> over a day.
>
> It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
> that. The actual guest hv_clock is per-CPU, and *theoretically* each
> vCPU could be running at a *different* frequency. But this patch is
> needed anyway because...
>
> The other issue with Xen timers was that the code would snapshot the
> host CLOCK_MONOTONIC at some point in time, and then... after a few
> interrupts may have occurred, some preemption perhaps... would also read
> the guest's kvmclock. Then it would proceed under the false assumption
> that those two happened at the *same* time. Any time which *actually*
> elapsed between reading the two clocks was introduced as inaccuracies
> in the time at which the timer fired.
>
> Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
> host TSC just *once*, then use the returned TSC value to calculate the
> kvmclock (making sure to do that the way the guest would instead of
> making the same mistake get_kvmclock_ns() does).
>
> Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
> timers still have to use CLOCK_MONOTONIC. In practice the difference
> between the two won't matter over the timescales involved, as the
> *absolute* values don't matter; just the delta.
>
> This does mean a new variant of kvm_get_time_and_clockread() is needed;
> called kvm_get_monotonic_and_clockread() because that's what it does.
>
> Signed-off-by: David Woodhouse <[email protected]>
> ---
> arch/x86/kvm/x86.c | 30 ++++++++++++
> arch/x86/kvm/x86.h | 1 +
> arch/x86/kvm/xen.c | 111 +++++++++++++++++++++++++++++++--------------
> 3 files changed, 109 insertions(+), 33 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 41cce5031126..aeede83d65dc 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2863,6 +2863,25 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
> return mode;
> }
>
> +static int do_monotonic(s64 *t, u64 *tsc_timestamp)
> +{
> + struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
> + unsigned long seq;
> + int mode;
> + u64 ns;
> +
> + do {
> + seq = read_seqcount_begin(&gtod->seq);
> + ns = gtod->clock.base_cycles;
> + ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
> + ns >>= gtod->clock.shift;
> + ns += ktime_to_ns(ktime_add(gtod->clock.offset, gtod->offs_boot));
> + } while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
> + *t = ns;
> +
> + return mode;
> +}
> +
> static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
> {
> struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
> @@ -2895,6 +2914,17 @@ static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
> tsc_timestamp));
> }
>
> +/* returns true if host is using TSC based clocksource */
> +bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
> +{
> + /* checked again under seqlock below */
> + if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
> + return false;
> +
> + return gtod_is_based_on_tsc(do_monotonic(kernel_ns,
> + tsc_timestamp));
> +}
> +
> /* returns true if host is using TSC based clocksource */
> static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
> u64 *tsc_timestamp)
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 1e7be1f6ab29..c08c6f729965 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -293,6 +293,7 @@ static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk)
> void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);
>
> u64 get_kvmclock_ns(struct kvm *kvm);
> +bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp);
>
> int kvm_read_guest_virt(struct kvm_vcpu *vcpu,
> gva_t addr, void *val, unsigned int bytes,
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index 0ea6016ad132..00a1e924a717 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -24,6 +24,7 @@
> #include <xen/interface/sched.h>
>
> #include <asm/xen/cpuid.h>
> +#include <asm/pvclock.h>
>
> #include "cpuid.h"
> #include "trace.h"
> @@ -144,17 +145,87 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
> return HRTIMER_NORESTART;
> }
>
> -static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns)
> +static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs,
> + bool linux_wa)
> {
> + uint64_t guest_now;
> + int64_t kernel_now, delta;
> +
> + /*
> + * The guest provides the requested timeout in absolute nanoseconds
> + * of the KVM clock — as *it* sees it, based on the scaled TSC and
> + * the pvclock information provided by KVM.
> + *
> + * The kernel doesn't support hrtimers based on CLOCK_MONOTONIC_RAW
> + * so use CLOCK_MONOTONIC. In the timescales covered by timers, the
> + * difference won't matter much as there is no cumulative effect.
> + *
> + * Calculate the time for some arbitrary point in time around "now"
> + * in terms of both kvmclock and CLOCK_MONOTONIC. Calculate the
> + * delta between the kvmclock "now" value and the guest's requested
> + * timeout, apply the "Linux workaround" described below, and add
> + * the resulting delta to the CLOCK_MONOTONIC "now" value, to get
> + * the absolute CLOCK_MONOTONIC time at which the timer should
> + * fire.
> + */
> + if (vcpu->kvm->arch.use_master_clock &&
> + static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
> + uint64_t host_tsc, guest_tsc;
> +
> + if (!IS_ENABLED(CONFIG_64BIT) ||
> + !kvm_get_monotonic_and_clockread(&kernel_now, &host_tsc)) {
> + /*
> + * Don't fall back to get_kvmclock_ns() because it's
> + * broken; it has a systemic error in its results
> + * because it scales directly from host TSC to
> + * nanoseconds, and doesn't scale first to guest TSC
> + * and then* to nanoseconds as the guest does.
> + *
> + * There is a small error introduced here because time
> + * continues to elapse between the ktime_get() and the
> + * subsequent rdtsc(). But not the systemic drift due
> + * to get_kvmclock_ns().
> + */
> + kernel_now = ktime_get(); /* This is CLOCK_MONOTONIC */
> + host_tsc = rdtsc();
> + }
> +
> + /* Calculate the guest kvmclock as the guest would do it. */
> + guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> + guest_now = __pvclock_read_cycles(&vcpu->arch.hv_clock, guest_tsc);
> + } else {
> + /* Without CONSTANT_TSC, get_kvmclock_ns() is the only option */
> + guest_now = get_kvmclock_ns(vcpu->kvm);
> + kernel_now = ktime_get();
> + }
> +
> + delta = guest_abs - guest_now;
> +
> + /* Xen has a 'Linux workaround' in do_set_timer_op() which
> + * checks for negative absolute timeout values (caused by
> + * integer overflow), and for values about 13 days in the
> + * future (2^50ns) which would be caused by jiffies
> + * overflow. For those cases, it sets the timeout 100ms in
> + * the future (not *too* soon, since if a guest really did
> + * set a long timeout on purpose we don't want to keep
> + * churning CPU time by waking it up).
> + */
> + if (linux_wa) {
> + if ((unlikely((int64_t)guest_abs < 0 ||
> + (delta > 0 && (uint32_t) (delta >> 50) != 0)))) {
> + delta = 100 * NSEC_PER_MSEC;
> + guest_abs = guest_now + delta;
> + }
> + }
> +
> atomic_set(&vcpu->arch.xen.timer_pending, 0);
> vcpu->arch.xen.timer_expires = guest_abs;
>
> - if (delta_ns <= 0) {
> + if (delta <= 0) {
> xen_timer_callback(&vcpu->arch.xen.timer);
> } else {
> - ktime_t ktime_now = ktime_get();
> hrtimer_start(&vcpu->arch.xen.timer,
> - ktime_add_ns(ktime_now, delta_ns),
> + ktime_add_ns(kernel_now, delta),
> HRTIMER_MODE_ABS_HARD);
> }
> }
> @@ -923,8 +994,7 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
> /* Start the timer if the new value has a valid vector+expiry. */
> if (data->u.timer.port && data->u.timer.expires_ns)
> kvm_xen_start_timer(vcpu, data->u.timer.expires_ns,
> - data->u.timer.expires_ns -
> - get_kvmclock_ns(vcpu->kvm));
> + false);

There is no documented ordering requirement on setting
KVM_XEN_VCPU_ATTR_TYPE_TIMER versus KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO or
KVM_XEN_ATTR_TYPE_SHARED_INFO but kvm_xen_start_timer() now needs the
vCPU's pvclock to be valid. Should actually starting the timer not be
deferred until then? (Or simply add a check here and have the attribute
setting fail if the pvclock is not valid).

Paul

>
> r = 0;
> break;
> @@ -1340,7 +1410,6 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
> {
> struct vcpu_set_singleshot_timer oneshot;
> struct x86_exception e;
> - s64 delta;
>
> if (!kvm_xen_timer_enabled(vcpu))
> return false;
> @@ -1374,13 +1443,7 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
> return true;
> }
>
> - delta = oneshot.timeout_abs_ns - get_kvmclock_ns(vcpu->kvm);
> - if ((oneshot.flags & VCPU_SSHOTTMR_future) && delta < 0) {
> - *r = -ETIME;
> - return true;
> - }
> -
> - kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, delta);
> + kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, false);
> *r = 0;
> return true;
>
> @@ -1404,25 +1467,7 @@ static bool kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout,
> return false;
>
> if (timeout) {
> - uint64_t guest_now = get_kvmclock_ns(vcpu->kvm);
> - int64_t delta = timeout - guest_now;
> -
> - /* Xen has a 'Linux workaround' in do_set_timer_op() which
> - * checks for negative absolute timeout values (caused by
> - * integer overflow), and for values about 13 days in the
> - * future (2^50ns) which would be caused by jiffies
> - * overflow. For those cases, it sets the timeout 100ms in
> - * the future (not *too* soon, since if a guest really did
> - * set a long timeout on purpose we don't want to keep
> - * churning CPU time by waking it up).
> - */
> - if (unlikely((int64_t)timeout < 0 ||
> - (delta > 0 && (uint32_t) (delta >> 50) != 0))) {
> - delta = 100 * NSEC_PER_MSEC;
> - timeout = guest_now + delta;
> - }
> -
> - kvm_xen_start_timer(vcpu, timeout, delta);
> + kvm_xen_start_timer(vcpu, timeout, true);
> } else {
> kvm_xen_stop_timer(vcpu);
> }

2023-10-31 11:42:43

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/xen: improve accuracy of Xen timers



On 31 October 2023 10:42:42 GMT, Paul Durrant <[email protected]> wrote:
>There is no documented ordering requirement on setting KVM_XEN_VCPU_ATTR_TYPE_TIMER versus KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO or KVM_XEN_ATTR_TYPE_SHARED_INFO but kvm_xen_start_timer() now needs the vCPU's pvclock to be valid. Should actually starting the timer not be deferred until then? (Or simply add a check here and have the attribute setting fail if the pvclock is not valid).


There are no such dependencies and I don't want there to be. That would be the *epitome* of what my "if it needs documenting, fix it first" mantra is intended to correct.

The fact that this broke on migration because the hv_clock isn't set up yet, as we saw in our overnight testing, is just a bug. In my tree I've fixed it thus:

index 63531173dad1..e3d2d63eef34 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -182,7 +182,7 @@ static void kvm_xen_start_timer(st
ruct kvm_vcpu *vcpu, u64 guest_abs,
* the absolute CLOCK_MONOTONIC time at which
the timer should
* fire.
*/
- if (vcpu->kvm->arch.use_master_clock &&
+ if (vcpu->arch.hv_clock.version && vcpu->kvm->
arch.use_master_clock &&
static_cpu_has(X86_FEATURE_CONSTANT_TSC))
{
uint64_t host_tsc, guest_tsc;

@@ -206,9 +206,23 @@ static void kvm_xen_start_timer(s
truct kvm_vcpu *vcpu, u64 guest_abs,

/* Calculate the guest kvmclock as the
guest would do it. */
guest_tsc = kvm_read_l1_tsc(vcpu, host
_tsc);
- guest_now = __pvclock_read_cycles(&vcp
u->arch.hv_clock, guest_tsc);
+ guest_now = __pvclock_read_cycles(&vcp
u->arch.hv_clock,
+ gues
t_tsc);
} else {
- /* Without CONSTANT_TSC, get_kvmclock_
ns() is the only option */
+ /*
+ * Without CONSTANT_TSC, get_kvmclock_
ns() is the only option.
+ *
+ * Also if the guest PV clock hasn't b
een set up yet, as is
+ * likely to be the case during migrat
ion when the vCPU has
+ * not been run yet. It would be possi
ble to calculate the
+ * scaling factors properly in that ca
se but there's not much
+ * point in doing so. The get_kvmclock
_ns() drift accumulates
+ * over time, so it's OK to use it at
startup. Besides, on
+ * migration there's going to be a lit
tle bit of skew in the
+ * precise moment at which timers fire
anyway. Often they'll
+ * be in the "past" by the time the VM
is running again after
+ * migration.
+ */
guest_now = get_kvmclock_ns(vcpu->kvm)
;
kernel_now = ktime_get();
}
--
2.41.0

We *could* reset the timer when the vCPU starts to run and handles the KVM_REQ_CLOCK_UPDATE event, but I don't want to for two reasons.

Firstly, we just don't need that complexity. This approach is OK, as the newly-added comment says. And we do need to fix get_kvmclock_ns() anyway, so it should work fine. Most of this patch will still be useful as it uses a single TSC read and we *do* need to do that part even after all the kvmclock brokenness is fixed. But the complexity on KVM_REQ_CLOCK_UPDATE isn't needed in the long term.

Secondly, it's also wrong thing to do in the general case. Let's say KVM does its thing and snaps the kvmclock backwards in time on a KVM_REQ_CLOCK_UPDATE... do we really want to reinterpret existing timers against the new kvmclock? They were best left alone, I think.



















2023-10-31 12:07:18

by Paul Durrant

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/xen: improve accuracy of Xen timers

On 31/10/2023 11:42, David Woodhouse wrote:
>
>
> On 31 October 2023 10:42:42 GMT, Paul Durrant <[email protected]> wrote:
>> There is no documented ordering requirement on setting KVM_XEN_VCPU_ATTR_TYPE_TIMER versus KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO or KVM_XEN_ATTR_TYPE_SHARED_INFO but kvm_xen_start_timer() now needs the vCPU's pvclock to be valid. Should actually starting the timer not be deferred until then? (Or simply add a check here and have the attribute setting fail if the pvclock is not valid).
>
>
> There are no such dependencies and I don't want there to be. That would be the *epitome* of what my "if it needs documenting, fix it first" mantra is intended to correct.
>
> The fact that this broke on migration because the hv_clock isn't set up yet, as we saw in our overnight testing, is just a bug. In my tree I've fixed it thus:
>
> index 63531173dad1..e3d2d63eef34 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -182,7 +182,7 @@ static void kvm_xen_start_timer(st
> ruct kvm_vcpu *vcpu, u64 guest_abs,
> * the absolute CLOCK_MONOTONIC time at which
> the timer should
> * fire.
> */
> - if (vcpu->kvm->arch.use_master_clock &&
> + if (vcpu->arch.hv_clock.version && vcpu->kvm->
> arch.use_master_clock &&
> static_cpu_has(X86_FEATURE_CONSTANT_TSC))
> {
> uint64_t host_tsc, guest_tsc;
>
> @@ -206,9 +206,23 @@ static void kvm_xen_start_timer(s
> truct kvm_vcpu *vcpu, u64 guest_abs,
>
> /* Calculate the guest kvmclock as the
> guest would do it. */
> guest_tsc = kvm_read_l1_tsc(vcpu, host
> _tsc);
> - guest_now = __pvclock_read_cycles(&vcp
> u->arch.hv_clock, guest_tsc);
> + guest_now = __pvclock_read_cycles(&vcp
> u->arch.hv_clock,
> + gues
> t_tsc);
> } else {
> - /* Without CONSTANT_TSC, get_kvmclock_
> ns() is the only option */
> + /*
> + * Without CONSTANT_TSC, get_kvmclock_
> ns() is the only option.
> + *
> + * Also if the guest PV clock hasn't b
> een set up yet, as is
> + * likely to be the case during migrat
> ion when the vCPU has
> + * not been run yet. It would be possi
> ble to calculate the
> + * scaling factors properly in that ca
> se but there's not much
> + * point in doing so. The get_kvmclock
> _ns() drift accumulates
> + * over time, so it's OK to use it at
> startup. Besides, on
> + * migration there's going to be a lit
> tle bit of skew in the
> + * precise moment at which timers fire
> anyway. Often they'll
> + * be in the "past" by the time the VM
> is running again after
> + * migration.
> + */
> guest_now = get_kvmclock_ns(vcpu->kvm)
> ;
> kernel_now = ktime_get();
> }
> --
> 2.41.0
>
> We *could* reset the timer when the vCPU starts to run and handles the KVM_REQ_CLOCK_UPDATE event, but I don't want to for two reasons.
>
> Firstly, we just don't need that complexity. This approach is OK, as the newly-added comment says. And we do need to fix get_kvmclock_ns() anyway, so it should work fine. Most of this patch will still be useful as it uses a single TSC read and we *do* need to do that part even after all the kvmclock brokenness is fixed. But the complexity on KVM_REQ_CLOCK_UPDATE isn't needed in the long term.
>
> Secondly, it's also wrong thing to do in the general case. Let's say KVM does its thing and snaps the kvmclock backwards in time on a KVM_REQ_CLOCK_UPDATE... do we really want to reinterpret existing timers against the new kvmclock? They were best left alone, I think.

Do we not want to do exactly that? If the master clock is changed, why
would we not want to re-interpret the guest's idea of time? That update
will be visible to the guest when it re-reads the PV clock source.

Paul

2023-10-31 12:12:02

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/xen: improve accuracy of Xen timers



On 31 October 2023 12:06:17 GMT, Paul Durrant <[email protected]> wrote:
>On 31/10/2023 11:42, David Woodhouse wrote:
>> Secondly, it's also wrong thing to do in the general case. Let's say KVM does its thing and snaps the kvmclock backwards in time on a KVM_REQ_CLOCK_UPDATE... do we really want to reinterpret existing timers against the new kvmclock? They were best left alone, I think.
>
>Do we not want to do exactly that? If the master clock is changed, why would we not want to re-interpret the guest's idea of time? That update will be visible to the guest when it re-reads the PV clock source.

Well no, because the guest set that timer *before* we yanked the clock from under it, and probably wants it interpreted in the time scale which was in force at the time it was set.

But more to the point, KVM shouldn't be doing that! We need to *fix* the kvmclock brokenness, not design further band-aids around it.

As I said, this patch stands even *after* we fix kvmclock, because it handles the timer delta calculation from an single TSC read.

But overengineering a timer reset on KVM_REQ_CLOCK_UPDATE would not.

2023-10-31 12:22:40

by Paul Durrant

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/xen: improve accuracy of Xen timers

On 31/10/2023 12:11, David Woodhouse wrote:
>
>
> On 31 October 2023 12:06:17 GMT, Paul Durrant <[email protected]> wrote:
>> On 31/10/2023 11:42, David Woodhouse wrote:
>>> Secondly, it's also wrong thing to do in the general case. Let's say KVM does its thing and snaps the kvmclock backwards in time on a KVM_REQ_CLOCK_UPDATE... do we really want to reinterpret existing timers against the new kvmclock? They were best left alone, I think.
>>
>> Do we not want to do exactly that? If the master clock is changed, why would we not want to re-interpret the guest's idea of time? That update will be visible to the guest when it re-reads the PV clock source.
>
> Well no, because the guest set that timer *before* we yanked the clock from under it, and probably wants it interpreted in the time scale which was in force at the time it was set.
>
> But more to the point, KVM shouldn't be doing that! We need to *fix* the kvmclock brokenness, not design further band-aids around it.

Ok, fair enough.

>
> As I said, this patch stands even *after* we fix kvmclock, because it handles the timer delta calculation from an single TSC read.
>
> But overengineering a timer reset on KVM_REQ_CLOCK_UPDATE would not.

I'm not sure what you intend to do to kvmlock, so not sure whether we'll
still need the __pvclock_read_cycles(&vcpu->arch.hv_clock, guest_tsc)
but this patch (with the extra check on validity of hv_clock) does fix
the drift so...

Reviewed-by: Paul Durrant <[email protected]>

2023-10-31 16:52:19

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/xen: improve accuracy of Xen timers

On Tue, 2023-10-31 at 12:22 +0000, Paul Durrant wrote:
>
> >
> > As I said, this patch stands even *after* we fix kvmclock, because
> > it handles the timer delta calculation from an single TSC read.
> >
> > But overengineering a timer reset on KVM_REQ_CLOCK_UPDATE would not.
>
> I'm not sure what you intend to do to kvmlock, so not sure whether we'll
> still need the __pvclock_read_cycles(&vcpu->arch.hv_clock, guest_tsc)
> but this patch (with the extra check on validity of hv_clock) does fix
> the drift so...
>
> Reviewed-by: Paul Durrant <[email protected]>

Ta. And no, I'm not quite sure what I'm going to do with kvmclock for
the general case yet. The more I look at it, the more I realise how
broken it is.

Last week I thought it was just about the way KVM 'jumps' the kvmclock
and yanks it back to the host's CLOCK_MONOTONIC_RAW, but I thought KVM
at *least* managed to do it right between those times. But no, this
patch is addressing the fact that even *without* those clock jumps, KVM
doesn't manage to calculate the guest clock the same way that it tells
the guest to... and thus doesn't get the same results :)

I think it involves get_kvmclock_ns() using the frequency given in the
KVM-wide (not per-vCPU) KVM_SET_TSC_KHZ ioctl, and scaling via that to
get the guest clock. That should match, without having to have a
specific vCPU's hv_clock to play with. And maybe we can also have a
get_kvmclock_ns_at() which takes a host TSC value, and the timer code
from this patch can use that instead of using __pvclock_read_cycles()
directly.

That's probably the easy part. Fixing the 'resync' to
CLOCK_MONOTONIC_RAW, while keeping things working for the now-
considered-pathological !CONSTANT_TSC case, will be slightly more fun.
As well as suspend etc. in the CONSTANT_TSC case, of course.

And replacing that stupid KVM_CLOCK_REALTIME with something that uses
CLOCK_TAI. Or maybe just making it export the tai_offset at the same
moment?


Attachments:
smime.p7s (5.83 kB)

2023-10-31 19:45:21

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/xen: improve accuracy of Xen timers

On Mon, 2023-10-30 at 15:50 +0000, David Woodhouse wrote:
>
> +static int do_monotonic(s64 *t, u64 *tsc_timestamp)
> +{
> + struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
> + unsigned long seq;
> + int mode;
> + u64 ns;
> +
> + do {
> + seq = read_seqcount_begin(&gtod->seq);
> + ns = gtod->clock.base_cycles;
> + ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
> + ns >>= gtod->clock.shift;
> + ns += ktime_to_ns(ktime_add(gtod->clock.offset,
> gtod->offs_boot));
> + } while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
> + *t = ns;
> +
> + return mode;
> +}
> +

Hrm, that's basically cargo-culted from do_monotonic_raw() immediately
above it. Should it be adding gtod->offs_boot?

Empirically the answer would appear to be 'no'. When gtod->offs_boot is
non-zero, I see kvm_get_monotonic_and_clockread() returning values
which are precisely that far in advance of what ktime_get() reports.



Attachments:
smime.p7s (5.83 kB)

2023-10-31 22:39:05

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/xen: improve accuracy of Xen timers

On Tue, 2023-10-31 at 19:45 +0000, David Woodhouse wrote:
> On Mon, 2023-10-30 at 15:50 +0000, David Woodhouse wrote:
> >
> > +static int do_monotonic(s64 *t, u64 *tsc_timestamp)
> > +{
> > + struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
> > + unsigned long seq;
> > + int mode;
> > + u64 ns;
> > +
> > + do {
> > + seq = read_seqcount_begin(&gtod->seq);
> > + ns = gtod->clock.base_cycles;
> > + ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
> > + ns >>= gtod->clock.shift;
> > + ns += ktime_to_ns(ktime_add(gtod->clock.offset, gtod->offs_boot));
> > + } while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
> > + *t = ns;
> > +
> > + return mode;
> > +}
> > +
>
> Hrm, that's basically cargo-culted from do_monotonic_raw() immediately
> above it. Should it be adding gtod->offs_boot?
>
> Empirically the answer would appear to be 'no'. When gtod->offs_boot is
> non-zero, I see kvm_get_monotonic_and_clockread() returning values
> which are precisely that far in advance of what ktime_get() reports.


.... because the do_monotonic_raw() function, despite the simple
clarity of its name... doesn't actually return the CLOCK_MONOTONIC_RAW
time. Of course it doesn't. Why would a function with that name return
the MONOTONIC_RAW clock?

It actually returns the same as get_kvmclock_base_ns(), which is

/* Count up from boot time, but with the frequency of the raw clock. */
return ktime_to_ns(ktime_add(ktime_get_raw(), pvclock_gtod_data.offs_boot));

I feel that Grey's Law is starting to apply to this clock stuff. This
is starting to be indistinguishable from malice ;)


Attachments:
smime.p7s (5.83 kB)

2023-10-31 23:14:08

by David Woodhouse

[permalink] [raw]
Subject: [PATCH v2] KVM: x86/xen: improve accuracy of Xen timers

From: David Woodhouse <[email protected]>

A test program such as http://david.woodhou.se/timerlat.c confirms user
reports that timers are increasingly inaccurate as the lifetime of a
guest increases. Reporting the actual delay observed when asking for
100µs of sleep, it starts off OK on a newly-launched guest but gets
worse over time, giving incorrect sleep times:

root@ip-10-0-193-21:~# ./timerlat -c -n 5
00000000 latency 103243/100000 (3.2430%)
00000001 latency 103243/100000 (3.2430%)
00000002 latency 103242/100000 (3.2420%)
00000003 latency 103245/100000 (3.2450%)
00000004 latency 103245/100000 (3.2450%)

The biggest problem is that get_kvmclock_ns() returns inaccurate values
when the guest TSC is scaled. The guest sees a TSC value scaled from the
host TSC by a mul/shift conversion (hopefully done in hardware). The
guest then converts that guest TSC value into nanoseconds using the
mul/shift conversion given to it by the KVM pvclock information.

But get_kvmclock_ns() performs only a single conversion directly from
host TSC to nanoseconds, giving a different result. A test program at
http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
over a day.

It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
that. The actual guest hv_clock is per-CPU, and *theoretically* each
vCPU could be running at a *different* frequency. But this patch is
needed anyway because...

The other issue with Xen timers was that the code would snapshot the
host CLOCK_MONOTONIC at some point in time, and then... after a few
interrupts may have occurred, some preemption perhaps... would also read
the guest's kvmclock. Then it would proceed under the false assumption
that those two happened at the *same* time. Any time which *actually*
elapsed between reading the two clocks was introduced as inaccuracies
in the time at which the timer fired.

Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
host TSC just *once*, then use the returned TSC value to calculate the
kvmclock (making sure to do that the way the guest would instead of
making the same mistake get_kvmclock_ns() does).

Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
timers still have to use CLOCK_MONOTONIC. In practice the difference
between the two won't matter over the timescales involved, as the
*absolute* values don't matter; just the delta.

This does mean a new variant of kvm_get_time_and_clockread() is needed;
called kvm_get_monotonic_and_clockread() because that's what it does.

Signed-off-by: David Woodhouse <[email protected]>
Reviewed-by: Paul Durrant <[email protected]>

---
v2: 
• Fall back to get_kvmclock_ns() if vcpu-arch.hv_clock isn't set up
yet, with a big comment explaining why that's actually OK.
• Fix do_monotonic() *not* to add the boot time offset.
• Rename do_monotonic_raw() → do_kvmclock_base() and add a comment
to make it clear that it *does* add the boot time offset. That
was just left as a bear trap for the unwary developer, wasn't it?

arch/x86/kvm/x86.c | 61 +++++++++++++++++++++--
arch/x86/kvm/x86.h | 1 +
arch/x86/kvm/xen.c | 121 ++++++++++++++++++++++++++++++++++-----------
3 files changed, 149 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6eaab714d90a..e479637af42c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2844,7 +2844,11 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
return v * clock->mult;
}

-static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
+/*
+ * As with get_kvmclock_base_ns(), this counts from boot time, at the
+ * frequency of CLOCK_MONOTONIC_RAW (hence adding gtos->offs_boot).
+ */
+static int do_kvmclock_base(s64 *t, u64 *tsc_timestamp)
{
struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
unsigned long seq;
@@ -2863,6 +2867,29 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
return mode;
}

+/*
+ * This calculates CLOCK_MONOTONIC at the time of the TSC snapshot, with
+ * no boot time offset.
+ */
+static int do_monotonic(s64 *t, u64 *tsc_timestamp)
+{
+ struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+ unsigned long seq;
+ int mode;
+ u64 ns;
+
+ do {
+ seq = read_seqcount_begin(&gtod->seq);
+ ns = gtod->clock.base_cycles;
+ ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
+ ns >>= gtod->clock.shift;
+ ns += ktime_to_ns(gtod->clock.offset);
+ } while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
+ *t = ns;
+
+ return mode;
+}
+
static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
{
struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
@@ -2884,18 +2911,42 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
return mode;
}

-/* returns true if host is using TSC based clocksource */
+/*
+ * Calculates the kvmclock_base_ns (CLOCK_MONOTONIC_RAW + boot time) and
+ * reports the TSC value from which it do so. Returns true if host is
+ * using TSC based clocksource.
+ */
static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
{
/* checked again under seqlock below */
if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
return false;

- return gtod_is_based_on_tsc(do_monotonic_raw(kernel_ns,
- tsc_timestamp));
+ return gtod_is_based_on_tsc(do_kvmclock_base(kernel_ns,
+ tsc_timestamp));
}

-/* returns true if host is using TSC based clocksource */
+/*
+ * Calculates CLOCK_MONOTONIC and reports the TSC value from which it did
+ * so. Returns true if host is using TSC based clocksource.
+ */
+bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
+{
+ /* checked again under seqlock below */
+ if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
+ return false;
+
+ return gtod_is_based_on_tsc(do_monotonic(kernel_ns,
+ tsc_timestamp));
+}
+
+/*
+ * Calculates CLOCK_REALTIME and reports the TSC value from which it did
+ * so. Returns true if host is using TSC based clocksource.
+ *
+ * DO NOT USE this for anything related to migration. You want CLOCK_TAI
+ * for that.
+ */
static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
u64 *tsc_timestamp)
{
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 1e7be1f6ab29..c08c6f729965 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -293,6 +293,7 @@ static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk)
void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);

u64 get_kvmclock_ns(struct kvm *kvm);
+bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp);

int kvm_read_guest_virt(struct kvm_vcpu *vcpu,
gva_t addr, void *val, unsigned int bytes,
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 751d9a984668..e3d2d63eef34 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -24,6 +24,7 @@
#include <xen/interface/sched.h>

#include <asm/xen/cpuid.h>
+#include <asm/pvclock.h>

#include "cpuid.h"
#include "trace.h"
@@ -158,8 +159,93 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
return HRTIMER_NORESTART;
}

-static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns)
+static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs,
+ bool linux_wa)
{
+ uint64_t guest_now;
+ int64_t kernel_now, delta;
+
+ /*
+ * The guest provides the requested timeout in absolute nanoseconds
+ * of the KVM clock — as *it* sees it, based on the scaled TSC and
+ * the pvclock information provided by KVM.
+ *
+ * The kernel doesn't support hrtimers based on CLOCK_MONOTONIC_RAW
+ * so use CLOCK_MONOTONIC. In the timescales covered by timers, the
+ * difference won't matter much as there is no cumulative effect.
+ *
+ * Calculate the time for some arbitrary point in time around "now"
+ * in terms of both kvmclock and CLOCK_MONOTONIC. Calculate the
+ * delta between the kvmclock "now" value and the guest's requested
+ * timeout, apply the "Linux workaround" described below, and add
+ * the resulting delta to the CLOCK_MONOTONIC "now" value, to get
+ * the absolute CLOCK_MONOTONIC time at which the timer should
+ * fire.
+ */
+ if (vcpu->arch.hv_clock.version && vcpu->kvm->arch.use_master_clock &&
+ static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
+ uint64_t host_tsc, guest_tsc;
+
+ if (!IS_ENABLED(CONFIG_64BIT) ||
+ !kvm_get_monotonic_and_clockread(&kernel_now, &host_tsc)) {
+ /*
+ * Don't fall back to get_kvmclock_ns() because it's
+ * broken; it has a systemic error in its results
+ * because it scales directly from host TSC to
+ * nanoseconds, and doesn't scale first to guest TSC
+ * and then* to nanoseconds as the guest does.
+ *
+ * There is a small error introduced here because time
+ * continues to elapse between the ktime_get() and the
+ * subsequent rdtsc(). But not the systemic drift due
+ * to get_kvmclock_ns().
+ */
+ kernel_now = ktime_get(); /* This is CLOCK_MONOTONIC */
+ host_tsc = rdtsc();
+ }
+
+ /* Calculate the guest kvmclock as the guest would do it. */
+ guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
+ guest_now = __pvclock_read_cycles(&vcpu->arch.hv_clock,
+ guest_tsc);
+ } else {
+ /*
+ * Without CONSTANT_TSC, get_kvmclock_ns() is the only option.
+ *
+ * Also if the guest PV clock hasn't been set up yet, as is
+ * likely to be the case during migration when the vCPU has
+ * not been run yet. It would be possible to calculate the
+ * scaling factors properly in that case but there's not much
+ * point in doing so. The get_kvmclock_ns() drift accumulates
+ * over time, so it's OK to use it at startup. Besides, on
+ * migration there's going to be a little bit of skew in the
+ * precise moment at which timers fire anyway. Often they'll
+ * be in the "past" by the time the VM is running again after
+ * migration.
+ */
+ guest_now = get_kvmclock_ns(vcpu->kvm);
+ kernel_now = ktime_get();
+ }
+
+ delta = guest_abs - guest_now;
+
+ /* Xen has a 'Linux workaround' in do_set_timer_op() which
+ * checks for negative absolute timeout values (caused by
+ * integer overflow), and for values about 13 days in the
+ * future (2^50ns) which would be caused by jiffies
+ * overflow. For those cases, it sets the timeout 100ms in
+ * the future (not *too* soon, since if a guest really did
+ * set a long timeout on purpose we don't want to keep
+ * churning CPU time by waking it up).
+ */
+ if (linux_wa) {
+ if ((unlikely((int64_t)guest_abs < 0 ||
+ (delta > 0 && (uint32_t) (delta >> 50) != 0)))) {
+ delta = 100 * NSEC_PER_MSEC;
+ guest_abs = guest_now + delta;
+ }
+ }
+
/*
* Avoid races with the old timer firing. Checking timer_expires
* to avoid calling hrtimer_cancel() will only have false positives
@@ -171,12 +257,11 @@ static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_
atomic_set(&vcpu->arch.xen.timer_pending, 0);
vcpu->arch.xen.timer_expires = guest_abs;

- if (delta_ns <= 0) {
+ if (delta <= 0) {
xen_timer_callback(&vcpu->arch.xen.timer);
} else {
- ktime_t ktime_now = ktime_get();
hrtimer_start(&vcpu->arch.xen.timer,
- ktime_add_ns(ktime_now, delta_ns),
+ ktime_add_ns(kernel_now, delta),
HRTIMER_MODE_ABS_HARD);
}
}
@@ -945,8 +1030,7 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
/* Start the timer if the new value has a valid vector+expiry. */
if (data->u.timer.port && data->u.timer.expires_ns)
kvm_xen_start_timer(vcpu, data->u.timer.expires_ns,
- data->u.timer.expires_ns -
- get_kvmclock_ns(vcpu->kvm));
+ false);

r = 0;
break;
@@ -1389,7 +1473,6 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
{
struct vcpu_set_singleshot_timer oneshot;
struct x86_exception e;
- s64 delta;

if (!kvm_xen_timer_enabled(vcpu))
return false;
@@ -1423,9 +1506,7 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
return true;
}

- /* A delta <= 0 results in an immediate callback, which is what we want */
- delta = oneshot.timeout_abs_ns - get_kvmclock_ns(vcpu->kvm);
- kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, delta);
+ kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, false);
*r = 0;
return true;

@@ -1449,25 +1530,7 @@ static bool kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout,
return false;

if (timeout) {
- uint64_t guest_now = get_kvmclock_ns(vcpu->kvm);
- int64_t delta = timeout - guest_now;
-
- /* Xen has a 'Linux workaround' in do_set_timer_op() which
- * checks for negative absolute timeout values (caused by
- * integer overflow), and for values about 13 days in the
- * future (2^50ns) which would be caused by jiffies
- * overflow. For those cases, it sets the timeout 100ms in
- * the future (not *too* soon, since if a guest really did
- * set a long timeout on purpose we don't want to keep
- * churning CPU time by waking it up).
- */
- if (unlikely((int64_t)timeout < 0 ||
- (delta > 0 && (uint32_t) (delta >> 50) != 0))) {
- delta = 100 * NSEC_PER_MSEC;
- timeout = guest_now + delta;
- }
-
- kvm_xen_start_timer(vcpu, timeout, delta);
+ kvm_xen_start_timer(vcpu, timeout, true);
} else {
kvm_xen_stop_timer(vcpu);
}
--
2.41.0



Attachments:
smime.p7s (5.83 kB)

2023-11-07 01:46:51

by Dongli Zhang

[permalink] [raw]
Subject: Re: [PATCH v2] KVM: x86/xen: improve accuracy of Xen timers

Hi David,

On 10/31/23 16:13, David Woodhouse wrote:
> From: David Woodhouse <[email protected]>
>
> A test program such as http://david.woodhou.se/timerlat.c confirms user
> reports that timers are increasingly inaccurate as the lifetime of a
> guest increases. Reporting the actual delay observed when asking for
> 100µs of sleep, it starts off OK on a newly-launched guest but gets
> worse over time, giving incorrect sleep times:
>
> root@ip-10-0-193-21:~# ./timerlat -c -n 5
> 00000000 latency 103243/100000 (3.2430%)
> 00000001 latency 103243/100000 (3.2430%)
> 00000002 latency 103242/100000 (3.2420%)
> 00000003 latency 103245/100000 (3.2450%)
> 00000004 latency 103245/100000 (3.2450%)
>
> The biggest problem is that get_kvmclock_ns() returns inaccurate values
> when the guest TSC is scaled. The guest sees a TSC value scaled from the
> host TSC by a mul/shift conversion (hopefully done in hardware). The
> guest then converts that guest TSC value into nanoseconds using the
> mul/shift conversion given to it by the KVM pvclock information.
>
> But get_kvmclock_ns() performs only a single conversion directly from
> host TSC to nanoseconds, giving a different result. A test program at
> http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
> over a day.
>
> It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
> that. The actual guest hv_clock is per-CPU, and *theoretically* each
> vCPU could be running at a *different* frequency. But this patch is
> needed anyway because...
>
> The other issue with Xen timers was that the code would snapshot the
> host CLOCK_MONOTONIC at some point in time, and then... after a few
> interrupts may have occurred, some preemption perhaps... would also read
> the guest's kvmclock. Then it would proceed under the false assumption
> that those two happened at the *same* time. Any time which *actually*
> elapsed between reading the two clocks was introduced as inaccuracies
> in the time at which the timer fired.
>
> Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
> host TSC just *once*, then use the returned TSC value to calculate the
> kvmclock (making sure to do that the way the guest would instead of
> making the same mistake get_kvmclock_ns() does).
>
> Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
> timers still have to use CLOCK_MONOTONIC. In practice the difference
> between the two won't matter over the timescales involved, as the
> *absolute* values don't matter; just the delta.
>
> This does mean a new variant of kvm_get_time_and_clockread() is needed;
> called kvm_get_monotonic_and_clockread() because that's what it does.
>
> Signed-off-by: David Woodhouse <[email protected]>
> Reviewed-by: Paul Durrant <[email protected]>
>
> ---
> v2: 
> • Fall back to get_kvmclock_ns() if vcpu-arch.hv_clock isn't set up
> yet, with a big comment explaining why that's actually OK.
> • Fix do_monotonic() *not* to add the boot time offset.
> • Rename do_monotonic_raw() → do_kvmclock_base() and add a comment
> to make it clear that it *does* add the boot time offset. That
> was just left as a bear trap for the unwary developer, wasn't it?
>
> arch/x86/kvm/x86.c | 61 +++++++++++++++++++++--
> arch/x86/kvm/x86.h | 1 +
> arch/x86/kvm/xen.c | 121 ++++++++++++++++++++++++++++++++++-----------
> 3 files changed, 149 insertions(+), 34 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6eaab714d90a..e479637af42c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2844,7 +2844,11 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
> return v * clock->mult;
> }
>
> -static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
> +/*
> + * As with get_kvmclock_base_ns(), this counts from boot time, at the
> + * frequency of CLOCK_MONOTONIC_RAW (hence adding gtos->offs_boot).
> + */
> +static int do_kvmclock_base(s64 *t, u64 *tsc_timestamp)
> {
> struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
> unsigned long seq;
> @@ -2863,6 +2867,29 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
> return mode;
> }
>
> +/*
> + * This calculates CLOCK_MONOTONIC at the time of the TSC snapshot, with
> + * no boot time offset.
> + */
> +static int do_monotonic(s64 *t, u64 *tsc_timestamp)
> +{
> + struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
> + unsigned long seq;
> + int mode;
> + u64 ns;
> +
> + do {
> + seq = read_seqcount_begin(&gtod->seq);
> + ns = gtod->clock.base_cycles;
> + ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
> + ns >>= gtod->clock.shift;
> + ns += ktime_to_ns(gtod->clock.offset);
> + } while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
> + *t = ns;
> +
> + return mode;
> +}
> +
> static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
> {
> struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
> @@ -2884,18 +2911,42 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
> return mode;
> }
>
> -/* returns true if host is using TSC based clocksource */
> +/*
> + * Calculates the kvmclock_base_ns (CLOCK_MONOTONIC_RAW + boot time) and
> + * reports the TSC value from which it do so. Returns true if host is
> + * using TSC based clocksource.
> + */
> static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
> {
> /* checked again under seqlock below */
> if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
> return false;
>
> - return gtod_is_based_on_tsc(do_monotonic_raw(kernel_ns,
> - tsc_timestamp));
> + return gtod_is_based_on_tsc(do_kvmclock_base(kernel_ns,
> + tsc_timestamp));
> }
>
> -/* returns true if host is using TSC based clocksource */
> +/*
> + * Calculates CLOCK_MONOTONIC and reports the TSC value from which it did
> + * so. Returns true if host is using TSC based clocksource.
> + */
> +bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
> +{
> + /* checked again under seqlock below */
> + if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
> + return false;
> +
> + return gtod_is_based_on_tsc(do_monotonic(kernel_ns,
> + tsc_timestamp));
> +}
> +
> +/*
> + * Calculates CLOCK_REALTIME and reports the TSC value from which it did
> + * so. Returns true if host is using TSC based clocksource.
> + *
> + * DO NOT USE this for anything related to migration. You want CLOCK_TAI
> + * for that.
> + */
> static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
> u64 *tsc_timestamp)
> {
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 1e7be1f6ab29..c08c6f729965 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -293,6 +293,7 @@ static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk)
> void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);
>
> u64 get_kvmclock_ns(struct kvm *kvm);
> +bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp);
>
> int kvm_read_guest_virt(struct kvm_vcpu *vcpu,
> gva_t addr, void *val, unsigned int bytes,
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index 751d9a984668..e3d2d63eef34 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -24,6 +24,7 @@
> #include <xen/interface/sched.h>
>
> #include <asm/xen/cpuid.h>
> +#include <asm/pvclock.h>
>
> #include "cpuid.h"
> #include "trace.h"
> @@ -158,8 +159,93 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
> return HRTIMER_NORESTART;
> }
>
> -static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns)
> +static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs,
> + bool linux_wa)
> {
> + uint64_t guest_now;
> + int64_t kernel_now, delta;
> +
> + /*
> + * The guest provides the requested timeout in absolute nanoseconds
> + * of the KVM clock — as *it* sees it, based on the scaled TSC and
> + * the pvclock information provided by KVM.
> + *
> + * The kernel doesn't support hrtimers based on CLOCK_MONOTONIC_RAW
> + * so use CLOCK_MONOTONIC. In the timescales covered by timers, the
> + * difference won't matter much as there is no cumulative effect.
> + *
> + * Calculate the time for some arbitrary point in time around "now"
> + * in terms of both kvmclock and CLOCK_MONOTONIC. Calculate the
> + * delta between the kvmclock "now" value and the guest's requested
> + * timeout, apply the "Linux workaround" described below, and add
> + * the resulting delta to the CLOCK_MONOTONIC "now" value, to get
> + * the absolute CLOCK_MONOTONIC time at which the timer should
> + * fire.
> + */
> + if (vcpu->arch.hv_clock.version && vcpu->kvm->arch.use_master_clock &&
> + static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {

If there any reason to use both vcpu->kvm->arch.use_master_clock and
X86_FEATURE_CONSTANT_TSC?

I think even __get_kvmclock() would not require both cases at the same time?

3071 if (ka->use_master_clock &&
3072 (static_cpu_has(X86_FEATURE_CONSTANT_TSC) ||
__this_cpu_read(cpu_tsc_khz))) {

> + uint64_t host_tsc, guest_tsc;
> +
> + if (!IS_ENABLED(CONFIG_64BIT) ||
> + !kvm_get_monotonic_and_clockread(&kernel_now, &host_tsc)) {
> + /*
> + * Don't fall back to get_kvmclock_ns() because it's
> + * broken; it has a systemic error in its results
> + * because it scales directly from host TSC to
> + * nanoseconds, and doesn't scale first to guest TSC
> + * and then* to nanoseconds as the guest does.
> + *
> + * There is a small error introduced here because time
> + * continues to elapse between the ktime_get() and the
> + * subsequent rdtsc(). But not the systemic drift due
> + * to get_kvmclock_ns().
> + */
> + kernel_now = ktime_get(); /* This is CLOCK_MONOTONIC */
> + host_tsc = rdtsc();
> + }
> +
> + /* Calculate the guest kvmclock as the guest would do it. */
> + guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> + guest_now = __pvclock_read_cycles(&vcpu->arch.hv_clock,
> + guest_tsc);
> + } else {
> + /*
> + * Without CONSTANT_TSC, get_kvmclock_ns() is the only option.
> + *
> + * Also if the guest PV clock hasn't been set up yet, as is
> + * likely to be the case during migration when the vCPU has
> + * not been run yet. It would be possible to calculate the
> + * scaling factors properly in that case but there's not much
> + * point in doing so. The get_kvmclock_ns() drift accumulates
> + * over time, so it's OK to use it at startup. Besides, on
> + * migration there's going to be a little bit of skew in the
> + * precise moment at which timers fire anyway. Often they'll
> + * be in the "past" by the time the VM is running again after
> + * migration.
> + */
> + guest_now = get_kvmclock_ns(vcpu->kvm);
> + kernel_now = ktime_get();

1. Can I assume the issue is still there if we fall into the "else" case? That
is, the increasing inaccuracy as the VM has been up for longer and longer time?

If that is the case, which may be better?

(1) get_kvmclock_ns(), or
(2) always get_kvmclock_base_ns() + ka->kvmclock_offset, when pvclock is not
enabled, regardless whether master clock is used. At least, the inaccurary is
not going to increase over guest time?


2. I see 3 scenarios here:

(1) vcpu->arch.hv_clock.version and master clock is used.

In this case, the bugfix looks good.

(2) The master clock is used. However, pv clock is not enabled.

In this case, the bug is not resolved? ... even the master clock is used.

(3) The master clock is not used.

We fall into get_kvmclock_base_ns() + ka->kvmclock_offset. The behavior is not
changed. This looks good.


Just from my own point: as this patch involves relatively complex changes, I
would suggest resolve the issue, but not use a temporary solution :)

(I see you mentioned that you will be back with get_kvmclock_ns())


Based on your bug fix, I see the below cases:

If master clock is not used:
get_kvmclock_base_ns() + ka->kvmclock_offset

If master clock is used:
If pvclock is enabled:
use the &vcpu->arch.hv_clock to get current guest time
Else
create a temporary hv_clock, based on masterclock.


Regarding the temporary solution, I would suggest create a new API to
encapsulate and fulfill above scenarios, if we are not going to touch
__get_kvmclock() at this time point.


Thank you very much!

Dongli Zhang


> + }
> +
> + delta = guest_abs - guest_now;
> +
> + /* Xen has a 'Linux workaround' in do_set_timer_op() which
> + * checks for negative absolute timeout values (caused by
> + * integer overflow), and for values about 13 days in the
> + * future (2^50ns) which would be caused by jiffies
> + * overflow. For those cases, it sets the timeout 100ms in
> + * the future (not *too* soon, since if a guest really did
> + * set a long timeout on purpose we don't want to keep
> + * churning CPU time by waking it up).
> + */
> + if (linux_wa) {
> + if ((unlikely((int64_t)guest_abs < 0 ||
> + (delta > 0 && (uint32_t) (delta >> 50) != 0)))) {
> + delta = 100 * NSEC_PER_MSEC;
> + guest_abs = guest_now + delta;
> + }
> + }
> +
> /*
> * Avoid races with the old timer firing. Checking timer_expires
> * to avoid calling hrtimer_cancel() will only have false positives
> @@ -171,12 +257,11 @@ static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_
> atomic_set(&vcpu->arch.xen.timer_pending, 0);
> vcpu->arch.xen.timer_expires = guest_abs;
>
> - if (delta_ns <= 0) {
> + if (delta <= 0) {
> xen_timer_callback(&vcpu->arch.xen.timer);
> } else {
> - ktime_t ktime_now = ktime_get();
> hrtimer_start(&vcpu->arch.xen.timer,
> - ktime_add_ns(ktime_now, delta_ns),
> + ktime_add_ns(kernel_now, delta),
> HRTIMER_MODE_ABS_HARD);
> }
> }
> @@ -945,8 +1030,7 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
> /* Start the timer if the new value has a valid vector+expiry. */
> if (data->u.timer.port && data->u.timer.expires_ns)
> kvm_xen_start_timer(vcpu, data->u.timer.expires_ns,
> - data->u.timer.expires_ns -
> - get_kvmclock_ns(vcpu->kvm));
> + false);
>
> r = 0;
> break;
> @@ -1389,7 +1473,6 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
> {
> struct vcpu_set_singleshot_timer oneshot;
> struct x86_exception e;
> - s64 delta;
>
> if (!kvm_xen_timer_enabled(vcpu))
> return false;
> @@ -1423,9 +1506,7 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
> return true;
> }
>
> - /* A delta <= 0 results in an immediate callback, which is what we want */
> - delta = oneshot.timeout_abs_ns - get_kvmclock_ns(vcpu->kvm);
> - kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, delta);
> + kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, false);
> *r = 0;
> return true;
>
> @@ -1449,25 +1530,7 @@ static bool kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout,
> return false;
>
> if (timeout) {
> - uint64_t guest_now = get_kvmclock_ns(vcpu->kvm);
> - int64_t delta = timeout - guest_now;
> -
> - /* Xen has a 'Linux workaround' in do_set_timer_op() which
> - * checks for negative absolute timeout values (caused by
> - * integer overflow), and for values about 13 days in the
> - * future (2^50ns) which would be caused by jiffies
> - * overflow. For those cases, it sets the timeout 100ms in
> - * the future (not *too* soon, since if a guest really did
> - * set a long timeout on purpose we don't want to keep
> - * churning CPU time by waking it up).
> - */
> - if (unlikely((int64_t)timeout < 0 ||
> - (delta > 0 && (uint32_t) (delta >> 50) != 0))) {
> - delta = 100 * NSEC_PER_MSEC;
> - timeout = guest_now + delta;
> - }
> -
> - kvm_xen_start_timer(vcpu, timeout, delta);
> + kvm_xen_start_timer(vcpu, timeout, true);
> } else {
> kvm_xen_stop_timer(vcpu);
> }

2023-11-07 08:18:00

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v2] KVM: x86/xen: improve accuracy of Xen timers

On Mon, 2023-11-06 at 17:44 -0800, Dongli Zhang wrote:
> > +       if (vcpu->arch.hv_clock.version && vcpu->kvm->arch.use_master_clock &&
> > +           static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
>
> If there any reason to use both vcpu->kvm->arch.use_master_clock and
> X86_FEATURE_CONSTANT_TSC?

Er, paranoia? I'll recheck.

> I think even __get_kvmclock() would not require both cases at the same time?
>
>  3071         if (ka->use_master_clock &&
>  3072             (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) {
>

But it does. That requires ka->use_master_clock (line 3071) AND that we
know the current CPU's TSC frequency (line 3072).

My code insists on the CONSTANT_TSC form of "knowing the current CPU's
TSC frequency" because even with a get_cpu(), it's not clear the guest
*was* running on this vCPU when it did its calculations. So I don't
want to go anywhere near the !CONSTANT_TSC case; it can use the
fallback.


> > +       } else {
> > +               /*
> > +                * Without CONSTANT_TSC, get_kvmclock_ns() is the only option.
> > +                *
> > +                * Also if the guest PV clock hasn't been set up yet, as is
> > +                * likely to be the case during migration when the vCPU has
> > +                * not been run yet. It would be possible to calculate the
> > +                * scaling factors properly in that case but there's not much
> > +                * point in doing so. The get_kvmclock_ns() drift accumulates
> > +                * over time, so it's OK to use it at startup. Besides, on
> > +                * migration there's going to be a little bit of skew in the
> > +                * precise moment at which timers fire anyway. Often they'll
> > +                * be in the "past" by the time the VM is running again after
> > +                * migration.
> > +                */
> > +               guest_now = get_kvmclock_ns(vcpu->kvm);
> > +               kernel_now = ktime_get();
>
> 1. Can I assume the issue is still there if we fall into the "else" case? That
> is, the increasing inaccuracy as the VM has been up for longer and longer time?
>
> If that is the case, which may be better?
>
> (1) get_kvmclock_ns(), or
> (2) always get_kvmclock_base_ns() + ka->kvmclock_offset, when pvclock is not
> enabled, regardless whether master clock is used. At least, the inaccurary is
> not going to increase over guest time?

No, those are both wrong, and drifting further away over time. They are
each *differently* wrong, which is why periodically clamping (1) to (2)
is also broken, as previously discussed. I know you've got a patch to
do that clamping more *often* which would slightly reduce the pain
because the kvmclock wouldn't jump backwards so *far* each time... but
it's still wrong to do so at all (in either direction).

>
> 2. I see 3 scenarios here:
>
> (1) vcpu->arch.hv_clock.version and master clock is used.
>
> In this case, the bugfix looks good.
>
> (2) The master clock is used. However, pv clock is not enabled.
>
> In this case, the bug is not resolved? ... even the master clock is used.

Under Xen the PV clock is always enabled. It's in the vcpu_info
structure which is required for any kind of event channel setup.

>
> (3) The master clock is not used.
>
> We fall into get_kvmclock_base_ns() + ka->kvmclock_offset. The behavior is not
> changed. This looks good.
>
>
> Just from my own point: as this patch involves relatively complex changes, I
> would suggest resolve the issue, but not use a temporary solution :)
>

This is the conversation I had with Paul on Tuesday, when he wanted me
to fix up this "(3) / behaviour is not changed" case. And yes, I argued
that we *don't* want a temporary solution for this case. Because yes:

> (I see you mentioned that you will be back with get_kvmclock_ns())

We need to fix get_kvmclock_ns() anyway. The systemic drift *and* the
fact that we periodically clamp it to a different clock and make it
jump. I was working on the former and have something half-done but was
preempted by the realisation that the QEMU soft freeze is today, and I
needed to flush my QEMU patch queue.

But even once we fix get_kvmclock_ns(), *this* patch stands. Because it
*also* addresses the "now" problem, where we get the time by one clock
... and then some time passes ... and we get the time by another clock,
and subtract one from the other as if they were the *same* time.

Using kvm_get_monotonic_and_clockread() gives us a single TSC read
corresponding to the CLOCK_MONOTONIC time, from which we can calculate
the kvmclock time. We just *happen* to calculate it correctly here,
unlike anywhere else in KVM.

> Based on your bug fix, I see the below cases:
>
> If master clock is not used:
>     get_kvmclock_base_ns() + ka->kvmclock_offset
>
> If master clock is used:
>     If pvclock is enabled:
>         use the &vcpu->arch.hv_clock to get current guest time
>     Else
>         create a temporary hv_clock, based on masterclock.

I don't want to do that last 'else' clause yet, because that feels like
a temporary workaround. It should be enough to call get_kvmclock_ns(),
once we fix it.




Attachments:
smime.p7s (5.83 kB)