2023-09-13 12:25:37

by Like Xu

[permalink] [raw]
Subject: Re: [PATCH v6] KVM: x86/tsc: Don't sync user-written TSC against startup values

On 13/9/2023 6:51 pm, David Woodhouse wrote:
> On Wed, 2023-09-13 at 18:37 +0800, Like Xu wrote:
>> From: Like Xu <[email protected]>
>>
>> The legacy API for setting the TSC is fundamentally broken, and only
>> allows userspace to set a TSC "now", without any way to account for
>> time lost to preemption between the calculation of the value, and the
>> kernel eventually handling the ioctl.
>>
>> To work around this we have had a hack which, if a TSC is set with a
>> value which is within a second's worth of a previous vCPU, assumes that
>> userspace actually intended them to be in sync and adjusts the newly-
>> written TSC value accordingly.
>>
>> Thus, when a VMM restores a guest after suspend or migration using the
>> legacy API, the TSCs aren't necessarily *right*, but at least they're
>> in sync.
>>
>> This trick falls down when restoring a guest which genuinely has been
>> running for less time than the 1 second of imprecision which we allow
>> for in the legacy API. On *creation* the first vCPU starts its TSC
>> counting from zero, and the subsequent vCPUs synchronize to that. But
>> then when the VMM tries to set the intended TSC value, because that's
>> within a second of what the last TSC synced to, it just adjusts it to
>> match that.
>>
> Proofreading my own words here... "it just adjusts it to match" is
> using the same pronoun for different things and is probably hard to
> follow. Perhaps "KVM just adjusts it to match" is nicer.
>
>> The correct answer is for the VMM not to use the legacy API of course.
>>
>> But we can pile further hacks onto our existing hackish ABI, and
>> declare that the *first* value written by userspace (on any vCPU)
>> should not be subject to this 'correction' to make it sync up with
>> values that only from the kernel's default vCPU creation.
>
> ^^
> ... that only *come* from the kernel's...
>
>
>>
>> To that end: Add a flag in kvm->arch.user_set_tsc, protected by
>> kvm->arch.tsc_write_lock, to record that a TSC for at least one vCPU in
>> this KVM *has* been set by userspace. Make the 1-second slop hack only
>> trigger if that flag is already set.
>>
>> Reported-by: Yong He <[email protected]>
>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217423
>> Suggested-by: Oliver Upton <[email protected]>
>> Original-by: Oliver Upton <[email protected]>
>> Original-by: Sean Christopherson <[email protected]>
>> Co-developed-by: David Woodhouse <[email protected]>
>> Signed-off-by: David Woodhouse <[email protected]>
>> Signed-off-by: Like Xu <[email protected]>
>> Tested-by: Yong He <[email protected]>
>
> Reviewed-by: David Woodhouse <[email protected]>
>
> Please remove the 'Signed-off-by' from me. You must never ever *type* a
> signed-off-by line for anyone else. You only ever cut and paste those
> intact when they have provided them for *themselves*.

Nice rule, sorry and thanks for the guidance.

>
> It's OK to remove the Co-developed-by: too. You did the actual typing
> of the code here; I just heckled :)

Thank you for reviewing it.

I'll wait for a cooling off period to see if the maintainers need me to post v7.


2023-09-13 15:06:39

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v6] KVM: x86/tsc: Don't sync user-written TSC against startup values

On Wed, Sep 13, 2023, Like Xu wrote:
> I'll wait for a cooling off period to see if the maintainers need me to post v7.

You should have waiting to post v5, let alone v6. Resurrecting a thread after a
month and not waiting even 7 hours for others to respond is extremely frustrating.

2023-09-14 08:32:03

by Like Xu

[permalink] [raw]
Subject: Re: [PATCH v6] KVM: x86/tsc: Don't sync user-written TSC against startup values

On 13/9/2023 10:47 pm, Sean Christopherson wrote:
> On Wed, Sep 13, 2023, Like Xu wrote:
>> I'll wait for a cooling off period to see if the maintainers need me to post v7.
>
> You should have waiting to post v5, let alone v6. Resurrecting a thread after a
> month and not waiting even 7 hours for others to respond is extremely frustrating.

You are right. I don't seem to be keeping up with many of other issues. Sorry
for that.
Wish there were 48 hours in a day.

Back to this issue: for commit message, I'd be more inclined to David's
understanding,
but you have the gavel; and for proposed code diff, how about the following changes:

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1a4def36d5bb..9a7dfef9d32d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1324,6 +1324,7 @@ struct kvm_arch {
int nr_vcpus_matched_tsc;

u32 default_tsc_khz;
+ bool user_set_tsc;

seqcount_raw_spinlock_t pvclock_sc;
bool use_master_clock;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6c9c81e82e65..faaae8b3fec4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2714,8 +2714,9 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu,
u64 offset, u64 tsc,
kvm_track_tsc_matching(vcpu);
}

-static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
+static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
{
+ u64 data = user_value ? *user_value : 0;
struct kvm *kvm = vcpu->kvm;
u64 offset, ns, elapsed;
unsigned long flags;
@@ -2728,27 +2729,45 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu,
u64 data)
elapsed = ns - kvm->arch.last_tsc_nsec;

if (vcpu->arch.virtual_tsc_khz) {
+ /*
+ * Force synchronization when creating or hotplugging a vCPU,
+ * i.e. when the TSC value is '0', to help keep clocks stable.
+ * If this is NOT a hotplug/creation case, skip synchronization
+ * on the first write from userspace so as not to misconstrue
+ * state restoration after live migration as an attempt from
+ * userspace to synchronize.
+ */
if (data == 0) {
- /*
- * detection of vcpu initialization -- need to sync
- * with other vCPUs. This particularly helps to keep
- * kvm_clock stable after CPU hotplug
- */
synchronizing = true;
- } else {
+ } else if (kvm->arch.user_set_tsc) {
u64 tsc_exp = kvm->arch.last_tsc_write +
nsec_to_cycles(vcpu, elapsed);
u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
/*
- * Special case: TSC write with a small delta (1 second)
- * of virtual cycle time against real time is
- * interpreted as an attempt to synchronize the CPU.
+ * Here lies UAPI baggage: when a user-initiated TSC write has
+ * a small delta (1 second) of virtual cycle time against the
+ * previously set vCPU, we assume that they were intended to be
+ * in sync and the delta was only due to the racy nature of the
+ * legacy API.
+ *
+ * This trick falls down when restoring a guest which genuinely
+ * has been running for less time than the 1 second of imprecision
+ * which we allow for in the legacy API. In this case, the first
+ * value written by userspace (on any vCPU) should not be subject
+ * to this 'correction' to make it sync up with values that only
+ * from the kernel's default vCPU creation. Make the 1-second slop
+ * hack only trigger if the user_set_tsc flag is already set.
+ *
+ * The correct answer is for the VMM not to use the legacy API.
*/
synchronizing = data < tsc_exp + tsc_hz &&
data + tsc_hz > tsc_exp;
}
}

+ if (user_value)
+ kvm->arch.user_set_tsc = true;
+
/*
* For a reliable TSC, we can match TSC offsets, and for an unstable
* TSC, we add elapsed time in this computation. We could let the
@@ -3777,7 +3796,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct
msr_data *msr_info)
break;
case MSR_IA32_TSC:
if (msr_info->host_initiated) {
- kvm_synchronize_tsc(vcpu, data);
+ kvm_synchronize_tsc(vcpu, &data);
} else {
u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
adjust_tsc_offset_guest(vcpu, adj);
@@ -5536,6 +5555,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset;
ns = get_kvmclock_base_ns();

+ kvm->arch.user_set_tsc = true;
__kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched);
raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);

@@ -11959,7 +11979,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
if (mutex_lock_killable(&vcpu->mutex))
return;
vcpu_load(vcpu);
- kvm_synchronize_tsc(vcpu, 0);
+ kvm_synchronize_tsc(vcpu, NULL);
vcpu_put(vcpu);

/* poll control enabled by default */
--
2.42.0

2023-09-14 13:27:35

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v6] KVM: x86/tsc: Don't sync user-written TSC against startup values

On Thu, 2023-09-14 at 11:50 +0800, Like Xu wrote:
> On 13/9/2023 10:47 pm, Sean Christopherson wrote:
> > On Wed, Sep 13, 2023, Like Xu wrote:
> > > I'll wait for a cooling off period to see if the maintainers need me to post v7.
> >
> > You should have waiting to post v5, let alone v6.  Resurrecting a thread after a
> > month and not waiting even 7 hours for others to respond is extremely frustrating.
>
> You are right. I don't seem to be keeping up with many of other issues. Sorry
> for that.
> Wish there were 48 hours in a day.
>
> Back to this issue: for commit message, I'd be more inclined to David's
> understanding,

The discussion that Sean and I had should probably be reflected in the
commit message too. To the end of the commit log you used for v6, after
the final 'To that end:…' paragraph, let's add:

Note that userspace can explicitly request a *synchronization* of the
TSC by writing zero. For the purpose of this patch, this counts as
"setting" the TSC. If userspace then subsequently writes an explicit
non-zero value which happens to be within 1 second of the previous
value, it will be 'corrected'. For that case, this preserves the prior
behaviour of KVM (which always applied the 1-second 'correction' 
regardless of user vs. kernel).

> @@ -2728,27 +2729,45 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu,
> u64 data)
>         elapsed = ns - kvm->arch.last_tsc_nsec;
>
>         if (vcpu->arch.virtual_tsc_khz) {
> +               /*
> +                * Force synchronization when creating or hotplugging a vCPU,
> +                * i.e. when the TSC value is '0', to help keep clocks stable.
> +                * If this is NOT a hotplug/creation case, skip synchronization
> +                * on the first write from userspace so as not to misconstrue
> +                * state restoration after live migration as an attempt from
> +                * userspace to synchronize.
> +                */

You cannot *misconstrue* an attempt from userspace to synchronize. If
userspace writes a zero, it's a sync attempt. If it's non-zero it's a
TSC value to be set. It's not very subtle :)

I think the 1-second slop thing is sufficiently documented in the 'else
if' clause below, so I started writing an alternative 'overall' comment
to go here and found it a bit redundant. So maybe let's just drop this
comment and add one back in the if (data == 0) case...

>                 if (data == 0) {
> -                       /*
> -                        * detection of vcpu initialization -- need to sync
> -                        * with other vCPUs. This particularly helps to keep
> -                        * kvm_clock stable after CPU hotplug
> -                        */


/*
* Force synchronization when creating a vCPU, or when
* userspace explicitly writes a zero value.
*/

>                         synchronizing = true;
> -               } else {
> +               } else if (kvm->arch.user_set_tsc) {
>                         u64 tsc_exp = kvm->arch.last_tsc_write +
>                                                 nsec_to_cycles(vcpu, elapsed);
>                         u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
>                         /*
> -                        * Special case: TSC write with a small delta (1 second)
> -                        * of virtual cycle time against real time is
> -                        * interpreted as an attempt to synchronize the CPU.
> +                        * Here lies UAPI baggage: when a user-initiated TSC write has
> +                        * a small delta (1 second) of virtual cycle time against the
> +                        * previously set vCPU, we assume that they were intended to be
> +                        * in sync and the delta was only due to the racy nature of the
> +                        * legacy API.
> +                        *
> +                        * This trick falls down when restoring a guest which genuinely
> +                        * has been running for less time than the 1 second of imprecision
> +                        * which we allow for in the legacy API. In this case, the first
> +                        * value written by userspace (on any vCPU) should not be subject
> +                        * to this 'correction' to make it sync up with values that only

Missing the word 'come' here too, in '…that only *come* from…',

> +                        * from the kernel's default vCPU creation. Make the 1-second slop
> +                        * hack only trigger if the user_set_tsc flag is already set.
> +                        *
> +                        * The correct answer is for the VMM not to use the legacy API.

Maybe we should drop this line, as we don't actually have a sane API
yet that VMMs can use instead.


Attachments:
smime.p7s (5.83 kB)

2023-09-19 14:01:18

by Like Xu

[permalink] [raw]
Subject: Re: [PATCH v6] KVM: x86/tsc: Don't sync user-written TSC against startup values

On 14/9/2023 3:31 pm, David Woodhouse wrote:
> On Thu, 2023-09-14 at 11:50 +0800, Like Xu wrote:
>> On 13/9/2023 10:47 pm, Sean Christopherson wrote:
>>> On Wed, Sep 13, 2023, Like Xu wrote:
>>>> I'll wait for a cooling off period to see if the maintainers need me to post v7.
>>>
>>> You should have waiting to post v5, let alone v6.  Resurrecting a thread after a
>>> month and not waiting even 7 hours for others to respond is extremely frustrating.
>>
>> You are right. I don't seem to be keeping up with many of other issues. Sorry
>> for that.
>> Wish there were 48 hours in a day.
>>
>> Back to this issue: for commit message, I'd be more inclined to David's
>> understanding,
>
> The discussion that Sean and I had should probably be reflected in the
> commit message too. To the end of the commit log you used for v6, after
> the final 'To that end:…' paragraph, let's add:
>
> Note that userspace can explicitly request a *synchronization* of the
> TSC by writing zero. For the purpose of this patch, this counts as
> "setting" the TSC. If userspace then subsequently writes an explicit
> non-zero value which happens to be within 1 second of the previous
> value, it will be 'corrected'. For that case, this preserves the prior
> behaviour of KVM (which always applied the 1-second 'correction'
> regardless of user vs. kernel).
>
>> @@ -2728,27 +2729,45 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu,
>> u64 data)
>>         elapsed = ns - kvm->arch.last_tsc_nsec;
>>
>>         if (vcpu->arch.virtual_tsc_khz) {
>> +               /*
>> +                * Force synchronization when creating or hotplugging a vCPU,
>> +                * i.e. when the TSC value is '0', to help keep clocks stable.
>> +                * If this is NOT a hotplug/creation case, skip synchronization
>> +                * on the first write from userspace so as not to misconstrue
>> +                * state restoration after live migration as an attempt from
>> +                * userspace to synchronize.
>> +                */
>
> You cannot *misconstrue* an attempt from userspace to synchronize. If
> userspace writes a zero, it's a sync attempt. If it's non-zero it's a
> TSC value to be set. It's not very subtle :)
>
> I think the 1-second slop thing is sufficiently documented in the 'else
> if' clause below, so I started writing an alternative 'overall' comment
> to go here and found it a bit redundant. So maybe let's just drop this
> comment and add one back in the if (data == 0) case...
>
>>                 if (data == 0) {
>> -                       /*
>> -                        * detection of vcpu initialization -- need to sync
>> -                        * with other vCPUs. This particularly helps to keep
>> -                        * kvm_clock stable after CPU hotplug
>> -                        */
>
>
> /*
> * Force synchronization when creating a vCPU, or when
> * userspace explicitly writes a zero value.
> */
>
>>                         synchronizing = true;
>> -               } else {
>> +               } else if (kvm->arch.user_set_tsc) {
>>                         u64 tsc_exp = kvm->arch.last_tsc_write +
>>                                                 nsec_to_cycles(vcpu, elapsed);
>>                         u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
>>                         /*
>> -                        * Special case: TSC write with a small delta (1 second)
>> -                        * of virtual cycle time against real time is
>> -                        * interpreted as an attempt to synchronize the CPU.
>> +                        * Here lies UAPI baggage: when a user-initiated TSC write has
>> +                        * a small delta (1 second) of virtual cycle time against the
>> +                        * previously set vCPU, we assume that they were intended to be
>> +                        * in sync and the delta was only due to the racy nature of the
>> +                        * legacy API.
>> +                        *
>> +                        * This trick falls down when restoring a guest which genuinely
>> +                        * has been running for less time than the 1 second of imprecision
>> +                        * which we allow for in the legacy API. In this case, the first
>> +                        * value written by userspace (on any vCPU) should not be subject
>> +                        * to this 'correction' to make it sync up with values that only
>
> Missing the word 'come' here too, in '…that only *come* from…',
>
>> +                        * from the kernel's default vCPU creation. Make the 1-second slop
>> +                        * hack only trigger if the user_set_tsc flag is already set.
>> +                        *
>> +                        * The correct answer is for the VMM not to use the legacy API.
>
> Maybe we should drop this line, as we don't actually have a sane API
> yet that VMMs can use instead.
>

Thanks for your comments, but not sure if Sean has any more concerns to move
forward:

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1a4def36d5bb..9a7dfef9d32d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1324,6 +1324,7 @@ struct kvm_arch {
int nr_vcpus_matched_tsc;

u32 default_tsc_khz;
+ bool user_set_tsc;

seqcount_raw_spinlock_t pvclock_sc;
bool use_master_clock;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6c9c81e82e65..11fbd2a4a370 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2714,8 +2714,9 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu,
u64 offset, u64 tsc,
kvm_track_tsc_matching(vcpu);
}

-static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
+static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
{
+ u64 data = user_value ? *user_value : 0;
struct kvm *kvm = vcpu->kvm;
u64 offset, ns, elapsed;
unsigned long flags;
@@ -2730,25 +2731,37 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu,
u64 data)
if (vcpu->arch.virtual_tsc_khz) {
if (data == 0) {
/*
- * detection of vcpu initialization -- need to sync
- * with other vCPUs. This particularly helps to keep
- * kvm_clock stable after CPU hotplug
+ * Force synchronization when creating a vCPU, or when
+ * userspace explicitly writes a zero value.
*/
synchronizing = true;
- } else {
+ } else if (kvm->arch.user_set_tsc) {
u64 tsc_exp = kvm->arch.last_tsc_write +
nsec_to_cycles(vcpu, elapsed);
u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
/*
- * Special case: TSC write with a small delta (1 second)
- * of virtual cycle time against real time is
- * interpreted as an attempt to synchronize the CPU.
+ * Here lies UAPI baggage: when a user-initiated TSC write has
+ * a small delta (1 second) of virtual cycle time against the
+ * previously set vCPU, we assume that they were intended to be
+ * in sync and the delta was only due to the racy nature of the
+ * legacy API.
+ *
+ * This trick falls down when restoring a guest which genuinely
+ * has been running for less time than the 1 second of imprecision
+ * which we allow for in the legacy API. In this case, the first
+ * value written by userspace (on any vCPU) should not be subject
+ * to this 'correction' to make it sync up with values that only
+ * come from the kernel's default vCPU creation. Make the 1-second
+ * slop hack only trigger if the user_set_tsc flag is already set.
*/
synchronizing = data < tsc_exp + tsc_hz &&
data + tsc_hz > tsc_exp;
}
}

+ if (user_value)
+ kvm->arch.user_set_tsc = true;
+
/*
* For a reliable TSC, we can match TSC offsets, and for an unstable
* TSC, we add elapsed time in this computation. We could let the
@@ -3777,7 +3790,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct
msr_data *msr_info)
break;
case MSR_IA32_TSC:
if (msr_info->host_initiated) {
- kvm_synchronize_tsc(vcpu, data);
+ kvm_synchronize_tsc(vcpu, &data);
} else {
u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
adjust_tsc_offset_guest(vcpu, adj);
@@ -5536,6 +5549,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset;
ns = get_kvmclock_base_ns();

+ kvm->arch.user_set_tsc = true;
__kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched);
raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);

@@ -11959,7 +11973,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
if (mutex_lock_killable(&vcpu->mutex))
return;
vcpu_load(vcpu);
- kvm_synchronize_tsc(vcpu, 0);
+ kvm_synchronize_tsc(vcpu, NULL);
vcpu_put(vcpu);

/* poll control enabled by default */

2023-09-25 08:42:44

by Like Xu

[permalink] [raw]
Subject: Re: [PATCH v6] KVM: x86/tsc: Don't sync user-written TSC against startup values

Ping for further comments and confirmation from Sean.
Could we trigger a new version for this issue ? Thanks.

On 19/9/2023 7:29 pm, Like Xu wrote:
> On 14/9/2023 3:31 pm, David Woodhouse wrote:
>> On Thu, 2023-09-14 at 11:50 +0800, Like Xu wrote:
>>> On 13/9/2023 10:47 pm, Sean Christopherson wrote:
>>>> On Wed, Sep 13, 2023, Like Xu wrote:
>>>>> I'll wait for a cooling off period to see if the maintainers need me to
>>>>> post v7.
>>>>
>>>> You should have waiting to post v5, let alone v6.  Resurrecting a thread
>>>> after a
>>>> month and not waiting even 7 hours for others to respond is extremely
>>>> frustrating.
>>>
>>> You are right. I don't seem to be keeping up with many of other issues. Sorry
>>> for that.
>>> Wish there were 48 hours in a day.
>>>
>>> Back to this issue: for commit message, I'd be more inclined to David's
>>> understanding,
>>
>> The discussion that Sean and I had should probably be reflected in the
>> commit message too. To the end of the commit log you used for v6, after
>> the final 'To that end:…' paragraph, let's add:
>>
>>   Note that userspace can explicitly request a *synchronization* of the
>>   TSC by writing zero. For the purpose of this patch, this counts as
>>   "setting" the TSC. If userspace then subsequently writes an explicit
>>   non-zero value which happens to be within 1 second of the previous
>>   value, it will be 'corrected'. For that case, this preserves the prior
>>   behaviour of KVM (which always applied the 1-second 'correction'
>>   regardless of user vs. kernel).
>>
>>> @@ -2728,27 +2729,45 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu,
>>> u64 data)
>>>          elapsed = ns - kvm->arch.last_tsc_nsec;
>>>
>>>          if (vcpu->arch.virtual_tsc_khz) {
>>> +               /*
>>> +                * Force synchronization when creating or hotplugging a vCPU,
>>> +                * i.e. when the TSC value is '0', to help keep clocks stable.
>>> +                * If this is NOT a hotplug/creation case, skip synchronization
>>> +                * on the first write from userspace so as not to misconstrue
>>> +                * state restoration after live migration as an attempt from
>>> +                * userspace to synchronize.
>>> +                */
>>
>> You cannot *misconstrue* an attempt from userspace to synchronize. If
>> userspace writes a zero, it's a sync attempt. If it's non-zero it's a
>> TSC value to be set. It's not very subtle :)
>>
>> I think the 1-second slop thing is sufficiently documented in the 'else
>> if' clause below, so I started writing an alternative 'overall' comment
>> to go here and found it a bit redundant. So maybe let's just drop this
>> comment and add one back in the if (data == 0) case...
>>
>>>                  if (data == 0) {
>>> -                       /*
>>> -                        * detection of vcpu initialization -- need to sync
>>> -                        * with other vCPUs. This particularly helps to keep
>>> -                        * kvm_clock stable after CPU hotplug
>>> -                        */
>>
>>
>>              /*
>>               * Force synchronization when creating a vCPU, or when
>>               * userspace explicitly writes a zero value.
>>               */
>>
>>>                          synchronizing = true;
>>> -               } else {
>>> +               } else if (kvm->arch.user_set_tsc) {
>>>                          u64 tsc_exp = kvm->arch.last_tsc_write +
>>>                                                  nsec_to_cycles(vcpu, elapsed);
>>>                          u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
>>>                          /*
>>> -                        * Special case: TSC write with a small delta (1 second)
>>> -                        * of virtual cycle time against real time is
>>> -                        * interpreted as an attempt to synchronize the CPU.
>>> +                        * Here lies UAPI baggage: when a user-initiated TSC
>>> write has
>>> +                        * a small delta (1 second) of virtual cycle time
>>> against the
>>> +                        * previously set vCPU, we assume that they were
>>> intended to be
>>> +                        * in sync and the delta was only due to the racy
>>> nature of the
>>> +                        * legacy API.
>>> +                        *
>>> +                        * This trick falls down when restoring a guest which
>>> genuinely
>>> +                        * has been running for less time than the 1 second
>>> of imprecision
>>> +                        * which we allow for in the legacy API. In this
>>> case, the first
>>> +                        * value written by userspace (on any vCPU) should
>>> not be subject
>>> +                        * to this 'correction' to make it sync up with
>>> values that only
>>
>> Missing the word 'come' here too, in '…that only *come* from…',
>>
>>> +                        * from the kernel's default vCPU creation. Make the
>>> 1-second slop
>>> +                        * hack only trigger if the user_set_tsc flag is
>>> already set.
>>> +                        *
>>> +                        * The correct answer is for the VMM not to use the
>>> legacy API.
>>
>> Maybe we should drop this line, as we don't actually have a sane API
>> yet that VMMs can use instead.
>>
>
> Thanks for your comments, but not sure if Sean has any more concerns to move
> forward:
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 1a4def36d5bb..9a7dfef9d32d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1324,6 +1324,7 @@ struct kvm_arch {
>      int nr_vcpus_matched_tsc;
>
>      u32 default_tsc_khz;
> +    bool user_set_tsc;
>
>      seqcount_raw_spinlock_t pvclock_sc;
>      bool use_master_clock;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6c9c81e82e65..11fbd2a4a370 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2714,8 +2714,9 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu,
> u64 offset, u64 tsc,
>      kvm_track_tsc_matching(vcpu);
>  }
>
> -static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
> +static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
>  {
> +    u64 data = user_value ? *user_value : 0;
>      struct kvm *kvm = vcpu->kvm;
>      u64 offset, ns, elapsed;
>      unsigned long flags;
> @@ -2730,25 +2731,37 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu,
> u64 data)
>      if (vcpu->arch.virtual_tsc_khz) {
>          if (data == 0) {
>              /*
> -             * detection of vcpu initialization -- need to sync
> -             * with other vCPUs. This particularly helps to keep
> -             * kvm_clock stable after CPU hotplug
> +             * Force synchronization when creating a vCPU, or when
> +             * userspace explicitly writes a zero value.
>               */
>              synchronizing = true;
> -        } else {
> +        } else if (kvm->arch.user_set_tsc) {
>              u64 tsc_exp = kvm->arch.last_tsc_write +
>                          nsec_to_cycles(vcpu, elapsed);
>              u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
>              /*
> -             * Special case: TSC write with a small delta (1 second)
> -             * of virtual cycle time against real time is
> -             * interpreted as an attempt to synchronize the CPU.
> +             * Here lies UAPI baggage: when a user-initiated TSC write has
> +             * a small delta (1 second) of virtual cycle time against the
> +             * previously set vCPU, we assume that they were intended to be
> +             * in sync and the delta was only due to the racy nature of the
> +             * legacy API.
> +             *
> +             * This trick falls down when restoring a guest which genuinely
> +             * has been running for less time than the 1 second of imprecision
> +             * which we allow for in the legacy API. In this case, the first
> +             * value written by userspace (on any vCPU) should not be subject
> +             * to this 'correction' to make it sync up with values that only
> +             * come from the kernel's default vCPU creation. Make the 1-second
> +             * slop hack only trigger if the user_set_tsc flag is already set.
>               */
>              synchronizing = data < tsc_exp + tsc_hz &&
>                      data + tsc_hz > tsc_exp;
>          }
>      }
>
> +    if (user_value)
> +        kvm->arch.user_set_tsc = true;
> +
>      /*
>       * For a reliable TSC, we can match TSC offsets, and for an unstable
>       * TSC, we add elapsed time in this computation.  We could let the
> @@ -3777,7 +3790,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct
> msr_data *msr_info)
>          break;
>      case MSR_IA32_TSC:
>          if (msr_info->host_initiated) {
> -            kvm_synchronize_tsc(vcpu, data);
> +            kvm_synchronize_tsc(vcpu, &data);
>          } else {
>              u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) -
> vcpu->arch.l1_tsc_offset;
>              adjust_tsc_offset_guest(vcpu, adj);
> @@ -5536,6 +5549,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
>          tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset;
>          ns = get_kvmclock_base_ns();
>
> +        kvm->arch.user_set_tsc = true;
>          __kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched);
>          raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
>
> @@ -11959,7 +11973,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
>      if (mutex_lock_killable(&vcpu->mutex))
>          return;
>      vcpu_load(vcpu);
> -    kvm_synchronize_tsc(vcpu, 0);
> +    kvm_synchronize_tsc(vcpu, NULL);
>      vcpu_put(vcpu);
>
>      /* poll control enabled by default */

2023-09-25 14:03:16

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v6] KVM: x86/tsc: Don't sync user-written TSC against startup values



On 25 September 2023 09:36:47 CEST, Like Xu <[email protected]> wrote:
>Ping for further comments and confirmation from Sean.
>Could we trigger a new version for this issue ? Thanks.

I believe you just have a few tweaks to the comments; I'd resend that as v7.

2023-10-07 07:29:48

by Like Xu

[permalink] [raw]
Subject: Re: [PATCH v6] KVM: x86/tsc: Don't sync user-written TSC against startup values

On 25/9/2023 4:31 pm, David Woodhouse wrote:
>
>
> On 25 September 2023 09:36:47 CEST, Like Xu <[email protected]> wrote:
>> Ping for further comments and confirmation from Sean.
>> Could we trigger a new version for this issue ? Thanks.
>
> I believe you just have a few tweaks to the comments; I'd resend that as v7.

OK, thanks.

Since I don't seem to have seen V7 yet on the list,
just let me know if anyone has any new thoughts on this issue.

2023-10-07 12:51:40

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v6] KVM: x86/tsc: Don't sync user-written TSC against startup values



On 7 October 2023 08:29:08 BST, Like Xu <[email protected]> wrote:
>On 25/9/2023 4:31 pm, David Woodhouse wrote:
>>
>>
>> On 25 September 2023 09:36:47 CEST, Like Xu <[email protected]> wrote:
>>> Ping for further comments and confirmation from Sean.
>>> Could we trigger a new version for this issue ? Thanks.
>>
>> I believe you just have a few tweaks to the comments; I'd resend that as v7.
>
>OK, thanks.
>
>Since I don't seem to have seen V7 yet on the list,
>just let me know if anyone has any new thoughts on this issue.

Ah, the end of my sentence was intended to mean, "(if I were you) I (woul)d resend that as v7."

I didn't mean to suggest that I was actually going to do so myself. So if you didn't post a v7, you won't see it on the list yet :)