2021-04-28 02:25:33

by Zelin Deng

[permalink] [raw]
Subject: [PATCH] Guest system time jumps when new vCPUs is hot-added

Hello,
I have below VM configuration:
...
<vcpu placement='static' current='1'>2</vcpu>
<cpu mode='host-passthrough'>
</cpu>
<clock offset='utc'>
<timer name='tsc' frequency='3000000000'/>
</clock>
...
After VM has been up for a few minutes, I use "virsh setvcpus" to hot-add
second vCPU into VM, below dmesg is observed:
[ 53.273484] CPU1 has been hot-added
[ 85.067135] SMP alternatives: switching to SMP code
[ 85.078409] x86: Booting SMP configuration:
[ 85.079027] smpboot: Booting Node 0 Processor 1 APIC 0x1
[ 85.080240] kvm-clock: cpu 1, msr 77601041, secondary cpu clock
[ 85.080450] smpboot: CPU 1 Converting physical 0 to logical die 1
[ 85.101228] TSC ADJUST compensate: CPU1 observed 169175101528 warp. Adjust: 169175101528
[ 141.513496] TSC ADJUST compensate: CPU1 observed 166 warp. Adjust: 169175101694
[ 141.513496] TSC synchronization [CPU#0 -> CPU#1]:
[ 141.513496] Measured 235 cycles TSC warp between CPUs, turning off TSC clock.
[ 141.513496] tsc: Marking TSC unstable due to check_tsc_sync_source failed
[ 141.543996] KVM setup async PF for cpu 1
[ 141.544281] kvm-stealtime: cpu 1, msr 13bd2c080
[ 141.549381] Will online and init hotplugged CPU: 1

System time jumps from 85.101228 to 141.51.3496.

Guest: KVM
----- ------
check_tsc_sync_target()
wrmsrl(MSR_IA32_TSC_ADJUST,...)
kvm_set_msr_common(vcpu,...)
adjust_tsc_offset_guest(vcpu,...) //tsc_offset jumped
vcpu_enter_guest(vcpu) //tsc_timestamp was not changed
...
rdtsc() jumped, system time jumped

tsc_timestamp must be updated before go back to guest.

---
Zelin Deng (1):
KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset
is adjusted

arch/x86/kvm/x86.c | 4 ++++
1 file changed, 4 insertions(+)

--
1.8.3.1


2021-04-28 02:27:38

by Zelin Deng

[permalink] [raw]
Subject: [PATCH] KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is adjusted

When MSR_IA32_TSC_ADJUST is written by guest due to TSC ADJUST feature
especially there's a big tsc warp (like a new vCPU is hot-added into VM
which has been up for a long time), tsc_offset is added by a large value
then go back to guest. This causes system time jump as tsc_timestamp is
not adjusted in the meantime and pvclock monotonic character.
To fix this, just notify kvm to update vCPU's guest time before back to
guest.

Cc: [email protected]
Signed-off-by: Zelin Deng <[email protected]>
---
arch/x86/kvm/x86.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index efc7a82..f03294f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3095,6 +3095,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
if (!msr_info->host_initiated) {
s64 adj = data - vcpu->arch.ia32_tsc_adjust_msr;
adjust_tsc_offset_guest(vcpu, adj);
+ /* Before back to guest, tsc_timestamp must be adjusted
+ * as well, otherwise guest's percpu pvclock time could jump.
+ */
+ kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
}
vcpu->arch.ia32_tsc_adjust_msr = data;
}
--
1.8.3.1

2021-04-28 09:02:05

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] Guest system time jumps when new vCPUs is hot-added

On Wed, Apr 28 2021 at 10:22, Zelin Deng wrote:

> Hello,
> I have below VM configuration:
> ...
> <vcpu placement='static' current='1'>2</vcpu>
> <cpu mode='host-passthrough'>
> </cpu>
> <clock offset='utc'>
> <timer name='tsc' frequency='3000000000'/>
> </clock>
> ...
> After VM has been up for a few minutes, I use "virsh setvcpus" to hot-add
> second vCPU into VM, below dmesg is observed:
> [ 53.273484] CPU1 has been hot-added
> [ 85.067135] SMP alternatives: switching to SMP code
> [ 85.078409] x86: Booting SMP configuration:
> [ 85.079027] smpboot: Booting Node 0 Processor 1 APIC 0x1
> [ 85.080240] kvm-clock: cpu 1, msr 77601041, secondary cpu clock
> [ 85.080450] smpboot: CPU 1 Converting physical 0 to logical die 1
> [ 85.101228] TSC ADJUST compensate: CPU1 observed 169175101528 warp. Adjust: 169175101528
> [ 141.513496] TSC ADJUST compensate: CPU1 observed 166 warp. Adjust: 169175101694

Why is TSC_ADJUST on CPU1 different from CPU0 in the first place?

That's broken.

Thanks,

tglx

2021-04-28 09:13:05

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] Guest system time jumps when new vCPUs is hot-added

On Wed, Apr 28 2021 at 11:00, Thomas Gleixner wrote:

> On Wed, Apr 28 2021 at 10:22, Zelin Deng wrote:
>
>> Hello,
>> I have below VM configuration:
>> ...
>> <vcpu placement='static' current='1'>2</vcpu>
>> <cpu mode='host-passthrough'>
>> </cpu>
>> <clock offset='utc'>
>> <timer name='tsc' frequency='3000000000'/>
>> </clock>
>> ...
>> After VM has been up for a few minutes, I use "virsh setvcpus" to hot-add
>> second vCPU into VM, below dmesg is observed:
>> [ 53.273484] CPU1 has been hot-added
>> [ 85.067135] SMP alternatives: switching to SMP code
>> [ 85.078409] x86: Booting SMP configuration:
>> [ 85.079027] smpboot: Booting Node 0 Processor 1 APIC 0x1
>> [ 85.080240] kvm-clock: cpu 1, msr 77601041, secondary cpu clock
>> [ 85.080450] smpboot: CPU 1 Converting physical 0 to logical die 1
>> [ 85.101228] TSC ADJUST compensate: CPU1 observed 169175101528 warp. Adjust: 169175101528
>> [ 141.513496] TSC ADJUST compensate: CPU1 observed 166 warp. Adjust: 169175101694
>
> Why is TSC_ADJUST on CPU1 different from CPU0 in the first place?
>
> That's broken.

Aside of that the TSC synchronization check in guests cannot work
reliably at all. Simply because there is no guarantee that vCPU0 and
vCPU1 are running in parallel.

Thanks,

tglx

2021-04-28 23:26:57

by Zelin Deng

[permalink] [raw]
Subject: Re: [PATCH] Guest system time jumps when new vCPUs is hot-added

On 2021/4/28 下午5:00, Thomas Gleixner wrote:
> On Wed, Apr 28 2021 at 10:22, Zelin Deng wrote:
>
>> Hello,
>> I have below VM configuration:
>> ...
>> <vcpu placement='static' current='1'>2</vcpu>
>> <cpu mode='host-passthrough'>
>> </cpu>
>> <clock offset='utc'>
>> <timer name='tsc' frequency='3000000000'/>
>> </clock>
>> ...
>> After VM has been up for a few minutes, I use "virsh setvcpus" to hot-add
>> second vCPU into VM, below dmesg is observed:
>> [ 53.273484] CPU1 has been hot-added
>> [ 85.067135] SMP alternatives: switching to SMP code
>> [ 85.078409] x86: Booting SMP configuration:
>> [ 85.079027] smpboot: Booting Node 0 Processor 1 APIC 0x1
>> [ 85.080240] kvm-clock: cpu 1, msr 77601041, secondary cpu clock
>> [ 85.080450] smpboot: CPU 1 Converting physical 0 to logical die 1
>> [ 85.101228] TSC ADJUST compensate: CPU1 observed 169175101528 warp. Adjust: 169175101528
>> [ 141.513496] TSC ADJUST compensate: CPU1 observed 166 warp. Adjust: 169175101694
> Why is TSC_ADJUST on CPU1 different from CPU0 in the first place?

Per my understanding when vCPU is created by KVM, it's tsc_offset = 0 -
host rdtsc() meanwhile TSC_ADJUST is 0.

Assume vCPU0 boots up with tsc_offset0, after 10000 tsc cycles, hotplug
via "virsh setvcpus" creates a new vCPU1 whose tsc_offset1 should be
about tsc_offset0 - 10000.  Therefore there's 10000 tsc warp between
rdtsc() in guest of vCPU0 and vCPU1, check_tsc_sync_target() when vCPU1
gets online will set TSC_ADJUST for vCPU1.

Did I miss something?

>
> That's broken.
>
> Thanks,
>
> tglx

2021-04-29 08:48:41

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] Guest system time jumps when new vCPUs is hot-added

On Thu, Apr 29 2021 at 07:24, Zelin Deng wrote:
> On 2021/4/28 下午5:00, Thomas Gleixner wrote:
>> On Wed, Apr 28 2021 at 10:22, Zelin Deng wrote:
>>> [ 85.101228] TSC ADJUST compensate: CPU1 observed 169175101528 warp. Adjust: 169175101528
>>> [ 141.513496] TSC ADJUST compensate: CPU1 observed 166 warp. Adjust: 169175101694
>> Why is TSC_ADJUST on CPU1 different from CPU0 in the first place?
>
> Per my understanding when vCPU is created by KVM, it's tsc_offset = 0 -
> host rdtsc() meanwhile TSC_ADJUST is 0.
>
> Assume vCPU0 boots up with tsc_offset0, after 10000 tsc cycles, hotplug
> via "virsh setvcpus" creates a new vCPU1 whose tsc_offset1 should be
> about tsc_offset0 - 10000.  Therefore there's 10000 tsc warp between
> rdtsc() in guest of vCPU0 and vCPU1, check_tsc_sync_target() when vCPU1
> gets online will set TSC_ADJUST for vCPU1.
>
> Did I miss something?

Yes. The above is wrong.

The host has to ensure that the TSC of the vCPUs is in sync and if it
exposes TSC_ADJUST then that should be 0 and nothing else. The TSC
in a guest vCPU is

hostTSC + host_TSC_ADJUST + vcpu_TSC_OFFSET + vcpu_guest_TSC_ADJUST

The mechanism the host has to use to ensure that the guest vCPUs are
exposing the same time is vcpu_TSC_OFFSET and nothing else. And
vcpu_TSC_OFFSET is the same for all vCPUs of a guest.

Now there is another issue when vCPU0 and vCPU1 are on different
'sockets' via the topology information provided by the hypervisor.

Because we had quite some issues in the past where TSCs on a single
socket were perfectly fine, but between sockets they were skewed, we
have a sanity check there. What it does is:

if (cpu_is_first_on_non_boot_socket(cpu))
validate_synchronization_with_boot_socket()

And that validation expects that the CPUs involved run in a tight loop
concurrently so the TSC readouts which happen on both can be reliably
compared.

But this cannot be guaranteed on vCPUs at all, because the host can
schedule out one or both at any point during that synchronization check.

A two socket guest setup needs to have information from the host that
TSC is usable and that the socket sync check can be skipped. Anything
else is just doomed to fail in hard to diagnose ways.

Thanks,

tglx

2021-04-29 09:40:07

by Zelin Deng

[permalink] [raw]
Subject: Re: [PATCH] Guest system time jumps when new vCPUs is hot-added

On 2021/4/29 下午4:46, Thomas Gleixner wrote:
> On Thu, Apr 29 2021 at 07:24, Zelin Deng wrote:
>> On 2021/4/28 下午5:00, Thomas Gleixner wrote:
>>> On Wed, Apr 28 2021 at 10:22, Zelin Deng wrote:
>>>> [ 85.101228] TSC ADJUST compensate: CPU1 observed 169175101528 warp. Adjust: 169175101528
>>>> [ 141.513496] TSC ADJUST compensate: CPU1 observed 166 warp. Adjust: 169175101694
>>> Why is TSC_ADJUST on CPU1 different from CPU0 in the first place?
>> Per my understanding when vCPU is created by KVM, it's tsc_offset = 0 -
>> host rdtsc() meanwhile TSC_ADJUST is 0.
>>
>> Assume vCPU0 boots up with tsc_offset0, after 10000 tsc cycles, hotplug
>> via "virsh setvcpus" creates a new vCPU1 whose tsc_offset1 should be
>> about tsc_offset0 - 10000.  Therefore there's 10000 tsc warp between
>> rdtsc() in guest of vCPU0 and vCPU1, check_tsc_sync_target() when vCPU1
>> gets online will set TSC_ADJUST for vCPU1.
>>
>> Did I miss something?
> Yes. The above is wrong.
>
> The host has to ensure that the TSC of the vCPUs is in sync and if it
> exposes TSC_ADJUST then that should be 0 and nothing else. The TSC
> in a guest vCPU is
>
> hostTSC + host_TSC_ADJUST + vcpu_TSC_OFFSET + vcpu_guest_TSC_ADJUST
>
> The mechanism the host has to use to ensure that the guest vCPUs are
> exposing the same time is vcpu_TSC_OFFSET and nothing else. And
> vcpu_TSC_OFFSET is the same for all vCPUs of a guest.
Yes, make sense.
>
> Now there is another issue when vCPU0 and vCPU1 are on different
> 'sockets' via the topology information provided by the hypervisor.
>
> Because we had quite some issues in the past where TSCs on a single
> socket were perfectly fine, but between sockets they were skewed, we
> have a sanity check there. What it does is:
>
> if (cpu_is_first_on_non_boot_socket(cpu))
> validate_synchronization_with_boot_socket()
>
> And that validation expects that the CPUs involved run in a tight loop
> concurrently so the TSC readouts which happen on both can be reliably
> compared.
>
> But this cannot be guaranteed on vCPUs at all, because the host can
> schedule out one or both at any point during that synchronization check.
Is there any plan to fix this?
>
> A two socket guest setup needs to have information from the host that
> TSC is usable and that the socket sync check can be skipped. Anything
> else is just doomed to fail in hard to diagnose ways.

Yes, I had tried to add "tsc=unstable" to skip tsc sync.  However if a
user process which is not pined to vCPU is using rdtsc, it can get tsc
warp, because it can be scheduled among vCPUs.  Does it mean user
applications have to guarantee itself to use rdtsc only when TSC is
reliable?

>
> Thanks,
>
> tglx

2021-04-29 16:04:22

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] Guest system time jumps when new vCPUs is hot-added

On Thu, Apr 29 2021 at 17:38, Zelin Deng wrote:
> On 2021/4/29 下午4:46, Thomas Gleixner wrote:
>> And that validation expects that the CPUs involved run in a tight loop
>> concurrently so the TSC readouts which happen on both can be reliably
>> compared.
>>
>> But this cannot be guaranteed on vCPUs at all, because the host can
>> schedule out one or both at any point during that synchronization
>> check.
>
> Is there any plan to fix this?

The above cannot be fixed.

As I said before the solution is:

>> A two socket guest setup needs to have information from the host that
>> TSC is usable and that the socket sync check can be skipped. Anything
>> else is just doomed to fail in hard to diagnose ways.
>
> Yes, I had tried to add "tsc=unstable" to skip tsc sync.  However if a

tsc=unstable? Oh well.

> user process which is not pined to vCPU is using rdtsc, it can get tsc
> warp, because it can be scheduled among vCPUs.  Does it mean user

Only if the hypervisor is not doing the right thing and makes sure that
all vCPUs have the same tsc offset vs. the host TSC.

> applications have to guarantee itself to use rdtsc only when TSC is
> reliable?

If the TSCs of CPUs are not in sync then the kernel does the right thing
and uses some other clocksource for the various time interfaces, e.g.
the kernel provides clock_getttime() which guarantees to be correct
whether TSC is usable or not.

Any application using RDTSC directly is own their own and it's not a
kernel problem.

The host kernel cannot make guarantees that the hardware is sane neither
can a guest kernel make guarantees that the hypervisor is sane.

Thanks,

tglx




2021-04-29 22:41:26

by Zelin Deng

[permalink] [raw]
Subject: Re: [PATCH] Guest system time jumps when new vCPUs is hot-added

Got it. Many thanks, Thomas.

On 2021/4/30 上午12:02, Thomas Gleixner wrote:

> On Thu, Apr 29 2021 at 17:38, Zelin Deng wrote:
>> On 2021/4/29 下午4:46, Thomas Gleixner wrote:
>>> And that validation expects that the CPUs involved run in a tight loop
>>> concurrently so the TSC readouts which happen on both can be reliably
>>> compared.
>>>
>>> But this cannot be guaranteed on vCPUs at all, because the host can
>>> schedule out one or both at any point during that synchronization
>>> check.
>> Is there any plan to fix this?
> The above cannot be fixed.
>
> As I said before the solution is:
>
>>> A two socket guest setup needs to have information from the host that
>>> TSC is usable and that the socket sync check can be skipped. Anything
>>> else is just doomed to fail in hard to diagnose ways.
>> Yes, I had tried to add "tsc=unstable" to skip tsc sync.  However if a
> tsc=unstable? Oh well.
>
>> user process which is not pined to vCPU is using rdtsc, it can get tsc
>> warp, because it can be scheduled among vCPUs.  Does it mean user
> Only if the hypervisor is not doing the right thing and makes sure that
> all vCPUs have the same tsc offset vs. the host TSC.
>
>> applications have to guarantee itself to use rdtsc only when TSC is
>> reliable?
> If the TSCs of CPUs are not in sync then the kernel does the right thing
> and uses some other clocksource for the various time interfaces, e.g.
> the kernel provides clock_getttime() which guarantees to be correct
> whether TSC is usable or not.
>
> Any application using RDTSC directly is own their own and it's not a
> kernel problem.
>
> The host kernel cannot make guarantees that the hardware is sane neither
> can a guest kernel make guarantees that the hypervisor is sane.
>
> Thanks,
>
> tglx
>
>
>

2021-09-06 11:30:36

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] Guest system time jumps when new vCPUs is hot-added

On 28/04/21 04:22, Zelin Deng wrote:
> Hello,
> I have below VM configuration:
> ...
> <vcpu placement='static' current='1'>2</vcpu>
> <cpu mode='host-passthrough'>
> </cpu>
> <clock offset='utc'>
> <timer name='tsc' frequency='3000000000'/>
> </clock>
> ...
> After VM has been up for a few minutes, I use "virsh setvcpus" to hot-add
> second vCPU into VM, below dmesg is observed:
> [ 53.273484] CPU1 has been hot-added
> [ 85.067135] SMP alternatives: switching to SMP code
> [ 85.078409] x86: Booting SMP configuration:
> [ 85.079027] smpboot: Booting Node 0 Processor 1 APIC 0x1
> [ 85.080240] kvm-clock: cpu 1, msr 77601041, secondary cpu clock
> [ 85.080450] smpboot: CPU 1 Converting physical 0 to logical die 1
> [ 85.101228] TSC ADJUST compensate: CPU1 observed 169175101528 warp. Adjust: 169175101528
> [ 141.513496] TSC ADJUST compensate: CPU1 observed 166 warp. Adjust: 169175101694
> [ 141.513496] TSC synchronization [CPU#0 -> CPU#1]:
> [ 141.513496] Measured 235 cycles TSC warp between CPUs, turning off TSC clock.
> [ 141.513496] tsc: Marking TSC unstable due to check_tsc_sync_source failed
> [ 141.543996] KVM setup async PF for cpu 1
> [ 141.544281] kvm-stealtime: cpu 1, msr 13bd2c080
> [ 141.549381] Will online and init hotplugged CPU: 1
>
> System time jumps from 85.101228 to 141.51.3496.
>
> Guest: KVM
> ----- ------
> check_tsc_sync_target()
> wrmsrl(MSR_IA32_TSC_ADJUST,...)
> kvm_set_msr_common(vcpu,...)
> adjust_tsc_offset_guest(vcpu,...) //tsc_offset jumped
> vcpu_enter_guest(vcpu) //tsc_timestamp was not changed
> ...
> rdtsc() jumped, system time jumped
>
> tsc_timestamp must be updated before go back to guest.
>
> ---
> Zelin Deng (1):
> KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset
> is adjusted
>
> arch/x86/kvm/x86.c | 4 ++++
> 1 file changed, 4 insertions(+)
>

While Thomas is right in general, what you found is indeed a bug with
the KVM->userspace API to set up the vCPU TSC adjust. So I'm queueing
the patch for 5.15.

Thanks,

Paolo