2014-07-16 09:53:09

by Igor Mammedov

[permalink] [raw]
Subject: [PATCH] ensure guest's kvmclock never goes backwards when TSC jumps backward

There are buggy hosts in the wild that advertise invariant
TSC and as result host uses TSC as clocksource, but TSC on
such host sometimes sporadically jumps backwards.

This causes kvmclock to go backwards if host advertises
PVCLOCK_TSC_STABLE_BIT, which turns off aggregated clock
accumulator and returns:
pvclock_vcpu_time_info.system_timestamp + offset
where 'offset' is calculated using TSC.
Since TSC is not virtualized in KVM, it makes guest see
TSC jumped backwards and leads to kvmclock going backwards
as well.

This is defensive patch that keeps per CPU last clock value
and ensures that clock will never go backwards even with
using PVCLOCK_TSC_STABLE_BIT enabled path.

Signed-off-by: Igor Mammedov <[email protected]>
---
RHBZ: 1115795

---
arch/x86/kernel/pvclock.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 2f355d2..dd9df0e 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -71,11 +71,14 @@ u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src)
return flags & valid_flags;
}

+static DEFINE_PER_CPU(cycle_t, last_clock);
+
cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
{
unsigned version;
cycle_t ret;
- u64 last;
+ u64 last, *this_cpu_last;
+ s64 clock_delta;
u8 flags;

do {
@@ -87,6 +90,16 @@ cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
pvclock_touch_watchdogs();
}

+ this_cpu_last = &get_cpu_var(last_clock);
+ clock_delta = ret - *this_cpu_last;
+ if (likely(clock_delta > 0)) {
+ *this_cpu_last = ret;
+ } else {
+ ret = *this_cpu_last;
+ WARN_ONCE(1, "clock went backwards");
+ }
+ put_cpu_var(last_clock);
+
if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
(flags & PVCLOCK_TSC_STABLE_BIT))
return ret;
--
1.8.3.1


2014-07-16 10:18:45

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] ensure guest's kvmclock never goes backwards when TSC jumps backward

Il 16/07/2014 11:52, Igor Mammedov ha scritto:
> There are buggy hosts in the wild that advertise invariant
> TSC and as result host uses TSC as clocksource, but TSC on
> such host sometimes sporadically jumps backwards.
>
> This causes kvmclock to go backwards if host advertises
> PVCLOCK_TSC_STABLE_BIT, which turns off aggregated clock
> accumulator and returns:
> pvclock_vcpu_time_info.system_timestamp + offset
> where 'offset' is calculated using TSC.
> Since TSC is not virtualized in KVM, it makes guest see
> TSC jumped backwards and leads to kvmclock going backwards
> as well.
>
> This is defensive patch that keeps per CPU last clock value
> and ensures that clock will never go backwards even with
> using PVCLOCK_TSC_STABLE_BIT enabled path.

I'm not sure that a per-CPU value is enough; your patch can make the
problem much less frequent of course, but I'm not sure neither detection
nor correction are 100% reliable. Your addition is basically a faster
but less reliable version of the last_value logic.

If may be okay to have detection that is faster but not 100% reliable.
However, once you find that the host is buggy I think the correct thing
to do is to write last_value and kill PVCLOCK_TSC_STABLE_BIT from
valid_flags.

Did you check that the affected host has the latest microcode?
Alternatively, could we simply blacklist some CPU steppings? I'm not
sure who we could ask at AMD :( but perhaps there is an erratum.

Paolo

2014-07-16 11:41:22

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] ensure guest's kvmclock never goes backwards when TSC jumps backward

On Wed, Jul 16, 2014 at 12:18:37PM +0200, Paolo Bonzini wrote:
> Il 16/07/2014 11:52, Igor Mammedov ha scritto:
> >There are buggy hosts in the wild that advertise invariant
> >TSC and as result host uses TSC as clocksource, but TSC on
> >such host sometimes sporadically jumps backwards.
> >
> >This causes kvmclock to go backwards if host advertises
> >PVCLOCK_TSC_STABLE_BIT, which turns off aggregated clock
> >accumulator and returns:
> > pvclock_vcpu_time_info.system_timestamp + offset
> >where 'offset' is calculated using TSC.
> >Since TSC is not virtualized in KVM, it makes guest see
> >TSC jumped backwards and leads to kvmclock going backwards
> >as well.
> >
> >This is defensive patch that keeps per CPU last clock value
> >and ensures that clock will never go backwards even with
> >using PVCLOCK_TSC_STABLE_BIT enabled path.
>
> I'm not sure that a per-CPU value is enough; your patch can make the
> problem much less frequent of course, but I'm not sure neither
> detection nor correction are 100% reliable. Your addition is
> basically a faster but less reliable version of the last_value
> logic.
>
> If may be okay to have detection that is faster but not 100%
> reliable. However, once you find that the host is buggy I think the
> correct thing to do is to write last_value and kill
> PVCLOCK_TSC_STABLE_BIT from valid_flags.
>
> Did you check that the affected host has the latest microcode?
> Alternatively, could we simply blacklist some CPU steppings? I'm
> not sure who we could ask at AMD :( but perhaps there is an erratum.
>
> Paolo

Igor,

Can we move detection to the host TSC clocksource driver ?

Because it is responsability of the host system to provide a non
backwards clock_gettime() interface as well.

How did you prove its the host TSC in fact going backwards?
Is it cross-CPU detection?

2014-07-16 13:55:31

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH] ensure guest's kvmclock never goes backwards when TSC jumps backward

On Wed, 16 Jul 2014 08:41:00 -0300
Marcelo Tosatti <[email protected]> wrote:

> On Wed, Jul 16, 2014 at 12:18:37PM +0200, Paolo Bonzini wrote:
> > Il 16/07/2014 11:52, Igor Mammedov ha scritto:
> > >There are buggy hosts in the wild that advertise invariant
> > >TSC and as result host uses TSC as clocksource, but TSC on
> > >such host sometimes sporadically jumps backwards.
> > >
> > >This causes kvmclock to go backwards if host advertises
> > >PVCLOCK_TSC_STABLE_BIT, which turns off aggregated clock
> > >accumulator and returns:
> > > pvclock_vcpu_time_info.system_timestamp + offset
> > >where 'offset' is calculated using TSC.
> > >Since TSC is not virtualized in KVM, it makes guest see
> > >TSC jumped backwards and leads to kvmclock going backwards
> > >as well.
> > >
> > >This is defensive patch that keeps per CPU last clock value
> > >and ensures that clock will never go backwards even with
> > >using PVCLOCK_TSC_STABLE_BIT enabled path.
> >
> > I'm not sure that a per-CPU value is enough; your patch can make the
> > problem much less frequent of course, but I'm not sure neither
> > detection nor correction are 100% reliable. Your addition is
> > basically a faster but less reliable version of the last_value
> > logic.
How is it less reliable than last_value logic?

Alternatively, we can panic in case of backward jump here, so that guest
won't hang in random place in case of error. There might not be OOPs
but at least coredump will point to a right place.

> >
> > If may be okay to have detection that is faster but not 100%
> > reliable. However, once you find that the host is buggy I think the
> > correct thing to do is to write last_value and kill
> > PVCLOCK_TSC_STABLE_BIT from valid_flags.
that might be an option, but what value we need to store into
last_value?
To make sure that clock won't go back we need to track
it on all CPUs and store highest value to last_value, at this point
there is no point in switching to last_value path since we have to
track per CPU anyway.

What this patch doesn't cover is switching from master_clock mode to
last_value mode (it happens at CPU hotplug time), I'd need to add
what was described above as second patch on top of this one.

> >
> > Did you check that the affected host has the latest microcode?
> > Alternatively, could we simply blacklist some CPU steppings? I'm
> > not sure who we could ask at AMD :( but perhaps there is an erratum.
I haven't found anything in this direction yet. I'm still trying to
find someone from AMD to look at the issue.

> >
> > Paolo
>
> Igor,
>
> Can we move detection to the host TSC clocksource driver ?
I haven't looked much at host side solution yet,
but to detection reliable it needs to be run constantly,
from read_native_tsc().

it's possible to put detection into check_system_tsc_reliable() but
that would increase boot time and it's not clear for how long test
should run to make detection reliable (in my case it takes ~5-10sec
to detect first failure).

Best we could at boot time is mark TSC as unstable on affected hardware,
but for this we need to figure out if it's specific machine or CPU issue
to do it properly. (I'm in process of finding out who to bug with it)

>
> Because it is responsability of the host system to provide a non
> backwards clock_gettime() interface as well.
vdso_clock_gettime() is not affected since it will use last highest
tsc value in case of jump due to usage of vread_tsc().
PS: it appears that host runs stably.

but kvm_get_time_and_clockread() is affected since it uses its own
version of do_monotonic()->vgettsc() which will lead to cycles
go backwards and overflow of nano secs in timespec. We should mimic
vread_tsc() here so not to run into this kind of issues.

>
> How did you prove its the host TSC in fact going backwards?
> Is it cross-CPU detection?
I've checked with several methods,
1. patched pvclock_clocksource_read() in guest with VCPUs pinned to host
CPUs.
2.Ingo's tsc_wrap_test, which fails miserably on affected host.
3 sytemtap script hooked to read_native_tsc(), for source see
https://bugzilla.redhat.com/show_bug.cgi?id=1115795#c12


>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-07-16 14:16:29

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] ensure guest's kvmclock never goes backwards when TSC jumps backward

Il 16/07/2014 15:55, Igor Mammedov ha scritto:
> On Wed, 16 Jul 2014 08:41:00 -0300
> Marcelo Tosatti <[email protected]> wrote:
>
>> On Wed, Jul 16, 2014 at 12:18:37PM +0200, Paolo Bonzini wrote:
>>> Il 16/07/2014 11:52, Igor Mammedov ha scritto:
>>>> There are buggy hosts in the wild that advertise invariant
>>>> TSC and as result host uses TSC as clocksource, but TSC on
>>>> such host sometimes sporadically jumps backwards.
>>>>
>>>> This causes kvmclock to go backwards if host advertises
>>>> PVCLOCK_TSC_STABLE_BIT, which turns off aggregated clock
>>>> accumulator and returns:
>>>> pvclock_vcpu_time_info.system_timestamp + offset
>>>> where 'offset' is calculated using TSC.
>>>> Since TSC is not virtualized in KVM, it makes guest see
>>>> TSC jumped backwards and leads to kvmclock going backwards
>>>> as well.
>>>>
>>>> This is defensive patch that keeps per CPU last clock value
>>>> and ensures that clock will never go backwards even with
>>>> using PVCLOCK_TSC_STABLE_BIT enabled path.
>>>
>>> I'm not sure that a per-CPU value is enough; your patch can make the
>>> problem much less frequent of course, but I'm not sure neither
>>> detection nor correction are 100% reliable. Your addition is
>>> basically a faster but less reliable version of the last_value
>>> logic.
> How is it less reliable than last_value logic?

Suppose CPU 1 is behind by 3 nanoseconds

CPU 0 CPU 1
time = 100 (at time 100)
time = 99 (at time 102)
time = 104 (at time 104)
time = 105 (at time 108)

Your patch will not detect this.

>>> If may be okay to have detection that is faster but not 100%
>>> reliable. However, once you find that the host is buggy I think the
>>> correct thing to do is to write last_value and kill
>>> PVCLOCK_TSC_STABLE_BIT from valid_flags.
> that might be an option, but what value we need to store into
> last_value?

You can write the value that was in the per-CPU variable (not perfect
correction)...

> To make sure that clock won't go back we need to track
> it on all CPUs and store highest value to last_value, at this point
> there is no point in switching to last_value path since we have to
> track per CPU anyway.

... or loop over all CPUs and find the highest value. You would only
have to do this once.

>> Can we move detection to the host TSC clocksource driver ?
>
> I haven't looked much at host side solution yet,
> but to detection reliable it needs to be run constantly,
> from read_native_tsc().
>
> it's possible to put detection into check_system_tsc_reliable() but
> that would increase boot time and it's not clear for how long test
> should run to make detection reliable (in my case it takes ~5-10sec
> to detect first failure).

Is 5-10sec the time that it takes for tsc_wrap_test to fail?

> Best we could at boot time is mark TSC as unstable on affected hardware,
> but for this we need to figure out if it's specific machine or CPU issue
> to do it properly. (I'm in process of finding out who to bug with it)

Thanks, this would be best.

> PS: it appears that host runs stably.
>
> but kvm_get_time_and_clockread() is affected since it uses its own
> version of do_monotonic()->vgettsc() which will lead to cycles
> go backwards and overflow of nano secs in timespec. We should mimic
> vread_tsc() here so not to run into this kind of issues.

I'm not sure I understand, the code is similar:

arch/x86/kvm/x86.c arch/x86/vdso/vclock_gettime.c
do_monotonic do_monotonic
vgettsc vgetsns
read_tsc vread_tsc
vget_cycles
__native_read_tsc __native_read_tsc

The VDSO inlines timespec_add_ns.

Paolo

2014-07-16 14:51:59

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH] ensure guest's kvmclock never goes backwards when TSC jumps backward

On Wed, 16 Jul 2014 16:16:17 +0200
Paolo Bonzini <[email protected]> wrote:

> Il 16/07/2014 15:55, Igor Mammedov ha scritto:
> > On Wed, 16 Jul 2014 08:41:00 -0300
> > Marcelo Tosatti <[email protected]> wrote:
> >
> >> On Wed, Jul 16, 2014 at 12:18:37PM +0200, Paolo Bonzini wrote:
> >>> Il 16/07/2014 11:52, Igor Mammedov ha scritto:
> >>>> There are buggy hosts in the wild that advertise invariant
> >>>> TSC and as result host uses TSC as clocksource, but TSC on
> >>>> such host sometimes sporadically jumps backwards.
> >>>>
> >>>> This causes kvmclock to go backwards if host advertises
> >>>> PVCLOCK_TSC_STABLE_BIT, which turns off aggregated clock
> >>>> accumulator and returns:
> >>>> pvclock_vcpu_time_info.system_timestamp + offset
> >>>> where 'offset' is calculated using TSC.
> >>>> Since TSC is not virtualized in KVM, it makes guest see
> >>>> TSC jumped backwards and leads to kvmclock going backwards
> >>>> as well.
> >>>>
> >>>> This is defensive patch that keeps per CPU last clock value
> >>>> and ensures that clock will never go backwards even with
> >>>> using PVCLOCK_TSC_STABLE_BIT enabled path.
> >>>
> >>> I'm not sure that a per-CPU value is enough; your patch can make
> >>> the problem much less frequent of course, but I'm not sure neither
> >>> detection nor correction are 100% reliable. Your addition is
> >>> basically a faster but less reliable version of the last_value
> >>> logic.
> > How is it less reliable than last_value logic?
>
> Suppose CPU 1 is behind by 3 nanoseconds
>
> CPU 0 CPU 1
> time = 100 (at time 100)
> time = 99 (at time 102)
> time = 104 (at time 104)
> time = 105 (at time 108)
>
> Your patch will not detect this.
Is it possible for each cpu to have it's own time?

>
> >>> If may be okay to have detection that is faster but not 100%
> >>> reliable. However, once you find that the host is buggy I think
> >>> the correct thing to do is to write last_value and kill
> >>> PVCLOCK_TSC_STABLE_BIT from valid_flags.
> > that might be an option, but what value we need to store into
> > last_value?
>
> You can write the value that was in the per-CPU variable (not perfect
> correction)...
I'll look at this variant, it's not perfect but it doesn't involve callout
to other CPUs!

>
> > To make sure that clock won't go back we need to track
> > it on all CPUs and store highest value to last_value, at this point
> > there is no point in switching to last_value path since we have to
> > track per CPU anyway.
>
> ... or loop over all CPUs and find the highest value. You would only
> have to do this once.
>
> >> Can we move detection to the host TSC clocksource driver ?
> >
> > I haven't looked much at host side solution yet,
> > but to detection reliable it needs to be run constantly,
> > from read_native_tsc().
> >
> > it's possible to put detection into check_system_tsc_reliable() but
> > that would increase boot time and it's not clear for how long test
> > should run to make detection reliable (in my case it takes ~5-10sec
> > to detect first failure).
>
> Is 5-10sec the time that it takes for tsc_wrap_test tofail?
nope, for systemtap script hooked to native_read_tsc(), but it depend
on the load for example hotplugging VCPU causes imediate jumps.

tsc_wrap_test starts to fail almost imediately,

I'll check how much tries it takes to fail for the first time, if it is
not too much I guess we could add check to check_system_tsc_reliable().


>
> > Best we could at boot time is mark TSC as unstable on affected
> > hardware, but for this we need to figure out if it's specific
> > machine or CPU issue to do it properly. (I'm in process of finding
> > out who to bug with it)
>
> Thanks, this would be best.
>
> > PS: it appears that host runs stably.
> >
> > but kvm_get_time_and_clockread() is affected since it uses its own
> > version of do_monotonic()->vgettsc() which will lead to cycles
> > go backwards and overflow of nano secs in timespec. We should mimic
> > vread_tsc() here so not to run into this kind of issues.
>
> I'm not sure I understand, the code is similar:
>
> arch/x86/kvm/x86.c arch/x86/vdso/vclock_gettime.c
> do_monotonic do_monotonic
> vgettsc vgetsns
> read_tsc vread_tsc
> vget_cycles
> __native_read_tsc __native_read_tsc
>
> The VDSO inlines timespec_add_ns.
I'm sorry, I haven't looked inside read_tsc() in
arch/x86/kvm/x86.c, it's the same as vread_tsc().

>
> Paolo

2014-07-16 14:55:48

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] ensure guest's kvmclock never goes backwards when TSC jumps backward

Il 16/07/2014 16:51, Igor Mammedov ha scritto:
>>>> I'm not sure that a per-CPU value is enough; your patch can make
>>>> the problem much less frequent of course, but I'm not sure neither
>>>> detection nor correction are 100% reliable. Your addition is
>>>> basically a faster but less reliable version of the last_value
>>>> logic.
>>>
>>> How is it less reliable than last_value logic?
>>
>> Suppose CPU 1 is behind by 3 nanoseconds
>>
>> CPU 0 CPU 1
>> time = 100 (at time 100)
>> time = 99 (at time 102)
>> time = 104 (at time 104)
>> time = 105 (at time 108)
>>
>> Your patch will not detect this.
> Is it possible for each cpu to have it's own time?

Yes, that's one of the reasons for TSC not to be stable (it could also
happen simply because the value of TSC_ADJUST MSR is bogus).

> tsc_wrap_test starts to fail almost imediately,
>
> I'll check how much tries it takes to fail for the first time, if it is
> not too much I guess we could add check to check_system_tsc_reliable().

Thanks!

Paolo

2014-07-16 15:19:04

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH] ensure guest's kvmclock never goes backwards when TSC jumps backward

On Wed, 16 Jul 2014 16:55:37 +0200
Paolo Bonzini <[email protected]> wrote:

> Il 16/07/2014 16:51, Igor Mammedov ha scritto:
> >>>> I'm not sure that a per-CPU value is enough; your patch can make
> >>>> the problem much less frequent of course, but I'm not sure
> >>>> neither detection nor correction are 100% reliable. Your
> >>>> addition is basically a faster but less reliable version of the
> >>>> last_value logic.
> >>>
> >>> How is it less reliable than last_value logic?
> >>
> >> Suppose CPU 1 is behind by 3 nanoseconds
> >>
> >> CPU 0 CPU 1
> >> time = 100 (at time 100)
> >> time = 99 (at time 102)
> >> time = 104 (at time 104)
> >> time = 105 (at time 108)
> >>
> >> Your patch will not detect this.
> > Is it possible for each cpu to have it's own time?
>
> Yes, that's one of the reasons for TSC not to be stable (it could
> also happen simply because the value of TSC_ADJUST MSR is bogus).
I was wondering not about TSC but kvmclock -> sched_clock.
If they are percpu and can differ for each CPU than above diagram if
fine as far as time on each CPU is monotonic and there is not need to
detect that time on each CPU is different.


>
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html