2023-10-18 22:18:13

by Dongli Zhang

[permalink] [raw]
Subject: [PATCH RFC 1/1] x86/paravirt: introduce param to disable pv sched_clock

As mentioned in the linux kernel development document, "sched_clock() is
used for scheduling and timestamping". While there is a default native
implementation, many paravirtualizations have their own implementations.

About KVM, it uses kvm_sched_clock_read() and there is no way to only
disable KVM's sched_clock. The "no-kvmclock" may disable all
paravirtualized kvmclock features.

94 static inline void kvm_sched_clock_init(bool stable)
95 {
96 if (!stable)
97 clear_sched_clock_stable();
98 kvm_sched_clock_offset = kvm_clock_read();
99 paravirt_set_sched_clock(kvm_sched_clock_read);
100
101 pr_info("kvm-clock: using sched offset of %llu cycles",
102 kvm_sched_clock_offset);
103
104 BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
105 sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
106 }

There is known issue that kvmclock may drift during vCPU hotplug [1].
Although a temporary fix is available [2], we may need a way to disable pv
sched_clock. Nowadays, the TSC is more stable and has less performance
overhead than kvmclock.

This is to propose to introduce a global param to disable pv sched_clock
for all paravirtualizations.

Please suggest and comment if other options are better:

1. Global param (this RFC patch).

2. The kvmclock specific param (e.g., "no-vmw-sched-clock" in vmware).

Indeed I like the 2nd method.

3. Enforce native sched_clock only when TSC is invariant (hyper-v method).

4. Remove and cleanup pv sched_clock, and always use pv_sched_clock() for
all (suggested by Peter Zijlstra in [3]). Some paravirtualizations may
want to keep the pv sched_clock.

To introduce a param may be easier to backport to old kernel version.

References:
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
[3] https://lore.kernel.org/all/[email protected]/

Thank you very much for the suggestion!

Signed-off-by: Dongli Zhang <[email protected]>
---
arch/x86/include/asm/paravirt.h | 2 +-
arch/x86/kernel/kvmclock.c | 12 +++++++-----
arch/x86/kernel/paravirt.c | 18 +++++++++++++++++-
3 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 6c8ff12140ae..f36edf608b6b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -24,7 +24,7 @@ u64 dummy_sched_clock(void);
DECLARE_STATIC_CALL(pv_steal_clock, dummy_steal_clock);
DECLARE_STATIC_CALL(pv_sched_clock, dummy_sched_clock);

-void paravirt_set_sched_clock(u64 (*func)(void));
+int paravirt_set_sched_clock(u64 (*func)(void));

static __always_inline u64 paravirt_sched_clock(void)
{
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index fb8f52149be9..0b8bf5677d44 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -93,13 +93,15 @@ static noinstr u64 kvm_sched_clock_read(void)

static inline void kvm_sched_clock_init(bool stable)
{
- if (!stable)
- clear_sched_clock_stable();
kvm_sched_clock_offset = kvm_clock_read();
- paravirt_set_sched_clock(kvm_sched_clock_read);

- pr_info("kvm-clock: using sched offset of %llu cycles",
- kvm_sched_clock_offset);
+ if (!paravirt_set_sched_clock(kvm_sched_clock_read)) {
+ if (!stable)
+ clear_sched_clock_stable();
+
+ pr_info("kvm-clock: using sched offset of %llu cycles",
+ kvm_sched_clock_offset);
+ }

BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 97f1436c1a20..2cfef94317b0 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -118,9 +118,25 @@ static u64 native_steal_clock(int cpu)
DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);

-void paravirt_set_sched_clock(u64 (*func)(void))
+static bool no_pv_sched_clock;
+
+static int __init parse_no_pv_sched_clock(char *arg)
+{
+ no_pv_sched_clock = true;
+ return 0;
+}
+early_param("no_pv_sched_clock", parse_no_pv_sched_clock);
+
+int paravirt_set_sched_clock(u64 (*func)(void))
{
+ if (no_pv_sched_clock) {
+ pr_info("sched_clock: not configurable\n");
+ return -EPERM;
+ }
+
static_call_update(pv_sched_clock, func);
+
+ return 0;
}

/* These are in entry.S */
--
2.34.1


2023-10-19 11:56:44

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH RFC 1/1] x86/paravirt: introduce param to disable pv sched_clock

Dongli Zhang <[email protected]> writes:

> As mentioned in the linux kernel development document, "sched_clock() is
> used for scheduling and timestamping". While there is a default native
> implementation, many paravirtualizations have their own implementations.
>
> About KVM, it uses kvm_sched_clock_read() and there is no way to only
> disable KVM's sched_clock. The "no-kvmclock" may disable all
> paravirtualized kvmclock features.
>
> 94 static inline void kvm_sched_clock_init(bool stable)
> 95 {
> 96 if (!stable)
> 97 clear_sched_clock_stable();
> 98 kvm_sched_clock_offset = kvm_clock_read();
> 99 paravirt_set_sched_clock(kvm_sched_clock_read);
> 100
> 101 pr_info("kvm-clock: using sched offset of %llu cycles",
> 102 kvm_sched_clock_offset);
> 103
> 104 BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
> 105 sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
> 106 }
>
> There is known issue that kvmclock may drift during vCPU hotplug [1].
> Although a temporary fix is available [2], we may need a way to disable pv
> sched_clock. Nowadays, the TSC is more stable and has less performance
> overhead than kvmclock.
>
> This is to propose to introduce a global param to disable pv sched_clock
> for all paravirtualizations.
>
> Please suggest and comment if other options are better:
>
> 1. Global param (this RFC patch).
>
> 2. The kvmclock specific param (e.g., "no-vmw-sched-clock" in vmware).
>
> Indeed I like the 2nd method.
>
> 3. Enforce native sched_clock only when TSC is invariant (hyper-v method).
>
> 4. Remove and cleanup pv sched_clock, and always use pv_sched_clock() for
> all (suggested by Peter Zijlstra in [3]). Some paravirtualizations may
> want to keep the pv sched_clock.

Normally, it should be up to the hypervisor to tell the guest which
clock to use, i.e. if TSC is reliable or not. Let me put my question
this way: if TSC on the particular host is good for everything, why
does the hypervisor advertises 'kvmclock' to its guests? If for some
'historical reasons' we can't revoke features we can always introduce a
new PV feature bit saying that TSC is preferred.

1) Global param doesn't sound like a good idea to me: chances are that
people will be setting it on their guest images to workaround problems
on one hypervisor (or, rather, on one public cloud which is too lazy to
fix their hypervisor) while simultaneously creating problems on another.

2) KVM specific parameter can work, but as KVM's sched_clock is the same
as kvmclock, I'm not convinced it actually makes sense to separate the
two. Like if sched_clock is known to be bad but TSC is good, why do we
need to use PV clock at all? Having a parameter for debugging purposes
may be OK though...

3) This is Hyper-V specific, you can see that it uses a dedicated PV bit
(HV_ACCESS_TSC_INVARIANT) and not the architectural
CPUID.80000007H:EDX[8]. I'm not sure we can blindly trust the later on
all hypervisors.

4) Personally, I'm not sure that relying on 'TSC is crap' detection is
100% reliable. I can imagine cases when we can't detect that fact that
while synchronized across CPUs and not going backwards, it is, for
example, ticking with an unstable frequency and PV sched clock is
supposed to give the right correction (all of them are rdtsc() based
anyways, aren't they?).

>
> To introduce a param may be easier to backport to old kernel version.
>
> References:
> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://lore.kernel.org/all/[email protected]/
> [3] https://lore.kernel.org/all/[email protected]/
>
> Thank you very much for the suggestion!
>
> Signed-off-by: Dongli Zhang <[email protected]>
> ---
> arch/x86/include/asm/paravirt.h | 2 +-
> arch/x86/kernel/kvmclock.c | 12 +++++++-----
> arch/x86/kernel/paravirt.c | 18 +++++++++++++++++-
> 3 files changed, 25 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index 6c8ff12140ae..f36edf608b6b 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -24,7 +24,7 @@ u64 dummy_sched_clock(void);
> DECLARE_STATIC_CALL(pv_steal_clock, dummy_steal_clock);
> DECLARE_STATIC_CALL(pv_sched_clock, dummy_sched_clock);
>
> -void paravirt_set_sched_clock(u64 (*func)(void));
> +int paravirt_set_sched_clock(u64 (*func)(void));
>
> static __always_inline u64 paravirt_sched_clock(void)
> {
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index fb8f52149be9..0b8bf5677d44 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -93,13 +93,15 @@ static noinstr u64 kvm_sched_clock_read(void)
>
> static inline void kvm_sched_clock_init(bool stable)
> {
> - if (!stable)
> - clear_sched_clock_stable();
> kvm_sched_clock_offset = kvm_clock_read();
> - paravirt_set_sched_clock(kvm_sched_clock_read);
>
> - pr_info("kvm-clock: using sched offset of %llu cycles",
> - kvm_sched_clock_offset);
> + if (!paravirt_set_sched_clock(kvm_sched_clock_read)) {
> + if (!stable)
> + clear_sched_clock_stable();
> +
> + pr_info("kvm-clock: using sched offset of %llu cycles",
> + kvm_sched_clock_offset);
> + }
>
> BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
> sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index 97f1436c1a20..2cfef94317b0 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -118,9 +118,25 @@ static u64 native_steal_clock(int cpu)
> DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
> DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
>
> -void paravirt_set_sched_clock(u64 (*func)(void))
> +static bool no_pv_sched_clock;
> +
> +static int __init parse_no_pv_sched_clock(char *arg)
> +{
> + no_pv_sched_clock = true;
> + return 0;
> +}
> +early_param("no_pv_sched_clock", parse_no_pv_sched_clock);
> +
> +int paravirt_set_sched_clock(u64 (*func)(void))
> {
> + if (no_pv_sched_clock) {
> + pr_info("sched_clock: not configurable\n");
> + return -EPERM;
> + }
> +
> static_call_update(pv_sched_clock, func);
> +
> + return 0;
> }
>
> /* These are in entry.S */

--
Vitaly

2023-10-19 15:41:29

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH RFC 1/1] x86/paravirt: introduce param to disable pv sched_clock

On Thu, Oct 19, 2023, Vitaly Kuznetsov wrote:
> Dongli Zhang <[email protected]> writes:
>
> > As mentioned in the linux kernel development document, "sched_clock() is
> > used for scheduling and timestamping". While there is a default native
> > implementation, many paravirtualizations have their own implementations.
> >
> > About KVM, it uses kvm_sched_clock_read() and there is no way to only
> > disable KVM's sched_clock. The "no-kvmclock" may disable all
> > paravirtualized kvmclock features.

...

> > Please suggest and comment if other options are better:
> >
> > 1. Global param (this RFC patch).
> >
> > 2. The kvmclock specific param (e.g., "no-vmw-sched-clock" in vmware).
> >
> > Indeed I like the 2nd method.
> >
> > 3. Enforce native sched_clock only when TSC is invariant (hyper-v method).
> >
> > 4. Remove and cleanup pv sched_clock, and always use pv_sched_clock() for
> > all (suggested by Peter Zijlstra in [3]). Some paravirtualizations may
> > want to keep the pv sched_clock.
>
> Normally, it should be up to the hypervisor to tell the guest which
> clock to use, i.e. if TSC is reliable or not. Let me put my question
> this way: if TSC on the particular host is good for everything, why
> does the hypervisor advertises 'kvmclock' to its guests?

I suspect there are two reasons.

1. As is likely the case in our fleet, no one revisited the set of advertised
PV features when defining the VM shapes for a new generation of hardware, or
whoever did the reviews wasn't aware that advertising kvmclock is actually
suboptimal. All the PV clock stuff in KVM is quite labyrinthian, so it's
not hard to imagine it getting overlooked.

2. Legacy VMs. If VMs have been running with a PV clock for years, forcing
them to switch to a new clocksource is high-risk, low-reward.

> If for some 'historical reasons' we can't revoke features we can always
> introduce a new PV feature bit saying that TSC is preferred.
>
> 1) Global param doesn't sound like a good idea to me: chances are that
> people will be setting it on their guest images to workaround problems
> on one hypervisor (or, rather, on one public cloud which is too lazy to
> fix their hypervisor) while simultaneously creating problems on another.
>
> 2) KVM specific parameter can work, but as KVM's sched_clock is the same
> as kvmclock, I'm not convinced it actually makes sense to separate the
> two. Like if sched_clock is known to be bad but TSC is good, why do we
> need to use PV clock at all? Having a parameter for debugging purposes
> may be OK though...
>
> 3) This is Hyper-V specific, you can see that it uses a dedicated PV bit
> (HV_ACCESS_TSC_INVARIANT) and not the architectural
> CPUID.80000007H:EDX[8]. I'm not sure we can blindly trust the later on
> all hypervisors.
>
> 4) Personally, I'm not sure that relying on 'TSC is crap' detection is
> 100% reliable. I can imagine cases when we can't detect that fact that
> while synchronized across CPUs and not going backwards, it is, for
> example, ticking with an unstable frequency and PV sched clock is
> supposed to give the right correction (all of them are rdtsc() based
> anyways, aren't they?).

Yeah, practically speaking, the only thing adding a knob to turn off using PV
clocks for sched_clock will accomplish is creating an even bigger matrix of
combinations that can cause problems, e.g. where guests end up using kvmclock
timekeeping but not scheduling.

The explanation above and the links below fail to capture _the_ key point:
Linux-as-a-guest already prioritizes the TSC over paravirt clocks as the clocksource
when the TSC is constant and nonstop (first spliced blob below).

What I suggested is that if the TSC is chosen over a PV clock as the clocksource,
then we have the kernel also override the sched_clock selection (second spliced
blob below).

That doesn't require the guest admin to opt-in, and doesn't create even more
combinations to support. It also provides for a smoother transition for when
customers inevitably end up creating VMs on hosts that don't advertise kvmclock
(or any PV clock).

> > To introduce a param may be easier to backport to old kernel version.
> >
> > References:
> > [1] https://lore.kernel.org/all/[email protected]/
> > [2] https://lore.kernel.org/all/[email protected]/
> > [3] https://lore.kernel.org/all/[email protected]/

On Mon, Oct 2, 2023 at 11:18 AM Sean Christopherson <[email protected]> wrote:
> > Do we need to update the documentation to always suggest TSC when it is
> > constant, as I believe many users still prefer pv clock than tsc?
> >
> > Thanks to tsc ratio scaling, the live migration will not impact tsc.
> >
> > >From the source code, the rating of kvm-clock is still higher than tsc.
> >
> > BTW., how about to decrease the rating if guest detects constant tsc?
> >
> > 166 struct clocksource kvm_clock = {
> > 167 .name = "kvm-clock",
> > 168 .read = kvm_clock_get_cycles,
> > 169 .rating = 400,
> > 170 .mask = CLOCKSOURCE_MASK(64),
> > 171 .flags = CLOCK_SOURCE_IS_CONTINUOUS,
> > 172 .enable = kvm_cs_enable,
> > 173 };
> >
> > 1196 static struct clocksource clocksource_tsc = {
> > 1197 .name = "tsc",
> > 1198 .rating = 300,
> > 1199 .read = read_tsc,
>
> That's already done in kvmclock_init().
>
> if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> !check_tsc_unstable())
> kvm_clock.rating = 299;
>
> See also: https://lore.kernel.org/all/ZOjF2DMBgW%[email protected]
>
> > 2. The sched_clock.
> >
> > The scheduling is impacted if there is big drift.
>
> ...
>
> > Unfortunately, the "no-kvmclock" kernel parameter disables all pv clock
> > operations (not only sched_clock), e.g., after line 300.
>
> ...
>
> > Should I introduce a new param to disable no-kvm-sched-clock only, or to
> > introduce a new param to allow the selection of sched_clock?
>
> I don't think we want a KVM-specific knob, because every flavor of paravirt guest
> would need to do the same thing. And unless there's a good reason to use a
> paravirt clock, this really shouldn't be something the guest admin needs to opt
> into using.


On Mon, Oct 2, 2023 at 2:06 PM Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Oct 02, 2023 at 11:18:50AM -0700, Sean Christopherson wrote:
> > Assuming the desirable thing to do is to use native_sched_clock() in this
> > scenario, do we need a separate rating system, or can we simply tie the
> > sched clock selection to the clocksource selection, e.g. override the
> > paravirt stuff if the TSC clock has higher priority and is chosen?
>
> Yeah, I see no point of another rating system. Just force the thing back
> to native (or don't set it to that other thing).

2023-10-19 15:47:39

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH RFC 1/1] x86/paravirt: introduce param to disable pv sched_clock

On Thu, 2023-10-19 at 08:40 -0700, Sean Christopherson wrote:
>
> > Normally, it should be up to the hypervisor to tell the guest which
> > clock to use, i.e. if TSC is reliable or not. Let me put my question
> > this way: if TSC on the particular host is good for everything, why
> > does the hypervisor advertises 'kvmclock' to its guests?
>
> I suspect there are two reasons.
>
>   1. As is likely the case in our fleet, no one revisited the set of advertised
>      PV features when defining the VM shapes for a new generation of hardware, or
>      whoever did the reviews wasn't aware that advertising kvmclock is actually
>      suboptimal.  All the PV clock stuff in KVM is quite labyrinthian, so it's
>      not hard to imagine it getting overlooked.
>
>   2. Legacy VMs.  If VMs have been running with a PV clock for years, forcing
>      them to switch to a new clocksource is high-risk, low-reward.

Doubly true for Xen guests (given that the Xen clocksource is identical
to the KVM clocksource).

> > If for some 'historical reasons' we can't revoke features we can always
> > introduce a new PV feature bit saying that TSC is preferred.

Don't we already have one? It's the PVCLOCK_TSC_STABLE_BIT. Why would a
guest ever use kvmclock if the PVCLOCK_TSC_STABLE_BIT is set?

The *point* in the kvmclock is that the hypervisor can mess with the
epoch/scaling to try to compensate for TSC brokenness as the host
scales/sleeps/etc.

And the *problem* with the kvmclock is that it does just that, even
when the host TSC hasn't done anything wrong and the kvmclock shouldn't
have changed at all.

If the PVCLOCK_TSC_STABLE_BIT is set, a guest should just use the guest
TSC directly without looking to the kvmclock for adjusting it.

No?



Attachments:
smime.p7s (5.83 kB)

2023-10-19 16:47:14

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH RFC 1/1] x86/paravirt: introduce param to disable pv sched_clock

On Thu, Oct 19, 2023, David Woodhouse wrote:
> On Thu, 2023-10-19 at 08:40 -0700, Sean Christopherson wrote:
> > > If for some 'historical reasons' we can't revoke features we can always
> > > introduce a new PV feature bit saying that TSC is preferred.
>
> Don't we already have one? It's the PVCLOCK_TSC_STABLE_BIT. Why would a
> guest ever use kvmclock if the PVCLOCK_TSC_STABLE_BIT is set?
>
> The *point* in the kvmclock is that the hypervisor can mess with the
> epoch/scaling to try to compensate for TSC brokenness as the host
> scales/sleeps/etc.
>
> And the *problem* with the kvmclock is that it does just that, even
> when the host TSC hasn't done anything wrong and the kvmclock shouldn't
> have changed at all.
>
> If the PVCLOCK_TSC_STABLE_BIT is set, a guest should just use the guest
> TSC directly without looking to the kvmclock for adjusting it.
>
> No?

No :-)

PVCLOCK_TSC_STABLE_BIT doesn't provide the guarantees that are needed to use the
raw TSC directly. It's close, but there is at least one situation where using TSC
directly even when the TSC is stable is bad idea: when hardware doesn't support TSC
scaling and the guest virtual TSC is running at a higher frequency than the hardware
TSC. The guest doesn't have to worry about the TSC going backwards, but using the
TSC directly would cause the guest's time calculations to be inaccurate.

And PVCLOCK_TSC_STABLE_BIT is also much more dynamic as it's tied to a given
generation/sequence. E.g. if KVM stops using its masterclock for whatever reason,
then kvm_guest_time_update() will effectively clear PVCLOCK_TSC_STABLE_BIT and the
guest-side __pvclock_clocksource_read() will be forced to do a bit of extra work
to ensure the clock is monotonically increasing.

2023-10-19 23:13:16

by Dongli Zhang

[permalink] [raw]
Subject: Re: [PATCH RFC 1/1] x86/paravirt: introduce param to disable pv sched_clock

Hi Vitaly, Sean and David,

On 10/19/23 08:40, Sean Christopherson wrote:
> On Thu, Oct 19, 2023, Vitaly Kuznetsov wrote:
>> Dongli Zhang <[email protected]> writes:
>>
>>> As mentioned in the linux kernel development document, "sched_clock() is
>>> used for scheduling and timestamping". While there is a default native
>>> implementation, many paravirtualizations have their own implementations.
>>>
>>> About KVM, it uses kvm_sched_clock_read() and there is no way to only
>>> disable KVM's sched_clock. The "no-kvmclock" may disable all
>>> paravirtualized kvmclock features.
>
> ...
>
>>> Please suggest and comment if other options are better:
>>>
>>> 1. Global param (this RFC patch).
>>>
>>> 2. The kvmclock specific param (e.g., "no-vmw-sched-clock" in vmware).
>>>
>>> Indeed I like the 2nd method.
>>>
>>> 3. Enforce native sched_clock only when TSC is invariant (hyper-v method).
>>>
>>> 4. Remove and cleanup pv sched_clock, and always use pv_sched_clock() for
>>> all (suggested by Peter Zijlstra in [3]). Some paravirtualizations may
>>> want to keep the pv sched_clock.
>>
>> Normally, it should be up to the hypervisor to tell the guest which
>> clock to use, i.e. if TSC is reliable or not. Let me put my question
>> this way: if TSC on the particular host is good for everything, why
>> does the hypervisor advertises 'kvmclock' to its guests?
>
> I suspect there are two reasons.
>
> 1. As is likely the case in our fleet, no one revisited the set of advertised
> PV features when defining the VM shapes for a new generation of hardware, or
> whoever did the reviews wasn't aware that advertising kvmclock is actually
> suboptimal. All the PV clock stuff in KVM is quite labyrinthian, so it's
> not hard to imagine it getting overlooked.
>
> 2. Legacy VMs. If VMs have been running with a PV clock for years, forcing
> them to switch to a new clocksource is high-risk, low-reward.
>
>> If for some 'historical reasons' we can't revoke features we can always
>> introduce a new PV feature bit saying that TSC is preferred.
>>
>> 1) Global param doesn't sound like a good idea to me: chances are that
>> people will be setting it on their guest images to workaround problems
>> on one hypervisor (or, rather, on one public cloud which is too lazy to
>> fix their hypervisor) while simultaneously creating problems on another.
>>
>> 2) KVM specific parameter can work, but as KVM's sched_clock is the same
>> as kvmclock, I'm not convinced it actually makes sense to separate the
>> two. Like if sched_clock is known to be bad but TSC is good, why do we
>> need to use PV clock at all? Having a parameter for debugging purposes
>> may be OK though...
>>
>> 3) This is Hyper-V specific, you can see that it uses a dedicated PV bit
>> (HV_ACCESS_TSC_INVARIANT) and not the architectural
>> CPUID.80000007H:EDX[8]. I'm not sure we can blindly trust the later on
>> all hypervisors.
>>
>> 4) Personally, I'm not sure that relying on 'TSC is crap' detection is
>> 100% reliable. I can imagine cases when we can't detect that fact that
>> while synchronized across CPUs and not going backwards, it is, for
>> example, ticking with an unstable frequency and PV sched clock is
>> supposed to give the right correction (all of them are rdtsc() based
>> anyways, aren't they?).
>
> Yeah, practically speaking, the only thing adding a knob to turn off using PV
> clocks for sched_clock will accomplish is creating an even bigger matrix of
> combinations that can cause problems, e.g. where guests end up using kvmclock
> timekeeping but not scheduling.
>
> The explanation above and the links below fail to capture _the_ key point:
> Linux-as-a-guest already prioritizes the TSC over paravirt clocks as the clocksource
> when the TSC is constant and nonstop (first spliced blob below).
>
> What I suggested is that if the TSC is chosen over a PV clock as the clocksource,
> then we have the kernel also override the sched_clock selection (second spliced
> blob below).
>
> That doesn't require the guest admin to opt-in, and doesn't create even more
> combinations to support. It also provides for a smoother transition for when
> customers inevitably end up creating VMs on hosts that don't advertise kvmclock
> (or any PV clock).

I would prefer to always leave the option to allow the guest admin to change the
decision, especially for diagnostic/workaround reason (although the kvmclock is
always buggy when tsc is buggy).


As a summary of discussion:

1. Vitaly Kuznetsov prefers global param, e.g., for the easy deployment of the
same guest image on different hypervisors.

2. Sean Christopherson prefers an automatic change of sched_clock when
clocksource is or not TSC.


However, the clocksource and TSC are different concepts.

1. The clocksource is an arch global concept. That is, all archs (e.g., x86,
arm, mips) share the same implementation to register/select clocksource. In
additon, something like HPET does not have sched_clock.

2. Some architecture has its own sched_clock implementation. E.g., x86 has its
own sched_clock implementation in arch/x86/kernel/tsc.c.

309 notrace u64 sched_clock(void)
310 {
311 u64 now;
312 preempt_disable_notrace();
313 now = sched_clock_noinstr();
314 preempt_enable_notrace();
315 return now;
316 }

3. When !CONFIG_PARAVIRT, it is native_sched_clock().

4. When CONFIG_PARAVIRT, it is sched_clock_noinstr()->paravirt_sched_clock()
referring to paravirt specific implementation (native/kvm/xen/vmware/hyperv).

That is, the pv sched_clock is a concept under x86 when CONFIG_PARAVIRT==true.


Although the implementation is possible, I just do not like the idea to change
some arch global code, to accommodate some requirement as a leaf of the tree.


How about to keep the change at x86 as in below? It won't work unless I change
'tsc_clocksource_reliable' to an early_param.

---
arch/x86/include/asm/paravirt.h | 2 +-
arch/x86/kernel/kvmclock.c | 12 +++++++-----
arch/x86/kernel/paravirt.c | 16 +++++++++++++++-
3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 6c8ff12..118b793 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -24,7 +24,7 @@
DECLARE_STATIC_CALL(pv_steal_clock, dummy_steal_clock);
DECLARE_STATIC_CALL(pv_sched_clock, dummy_sched_clock);

-void paravirt_set_sched_clock(u64 (*func)(void));
+bool paravirt_set_sched_clock(u64 (*func)(void));

static __always_inline u64 paravirt_sched_clock(void)
{
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index fb8f5214..0b8bf56 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -93,13 +93,15 @@ static noinstr u64 kvm_sched_clock_read(void)

static inline void kvm_sched_clock_init(bool stable)
{
- if (!stable)
- clear_sched_clock_stable();
kvm_sched_clock_offset = kvm_clock_read();
- paravirt_set_sched_clock(kvm_sched_clock_read);

- pr_info("kvm-clock: using sched offset of %llu cycles",
- kvm_sched_clock_offset);
+ if (!paravirt_set_sched_clock(kvm_sched_clock_read)) {
+ if (!stable)
+ clear_sched_clock_stable();
+
+ pr_info("kvm-clock: using sched offset of %llu cycles",
+ kvm_sched_clock_offset);
+ }

BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 97f1436..f8ad521 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -118,9 +118,23 @@ static u64 native_steal_clock(int cpu)
DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);

-void paravirt_set_sched_clock(u64 (*func)(void))
+bool paravirt_set_sched_clock(u64 (*func)(void))
{
+ if (tsc_clocksource_reliable)
+ goto refuse;
+
+ if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
+ boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
+ !check_tsc_unstable())
+ goto refuse;
+
static_call_update(pv_sched_clock, func);
+
+ return 0;
+
+refuse:
+ pr_info("sched_clock: use native when TSC is reliable");
+ return -EPERM;
}

/* These are in entry.S */



Indeed my favorite is to keep within kvmclock.
(This won't work until I turn 'tsc_clocksource_reliable' into early_param).

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index fb8f5214..f16655d 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -286,6 +286,7 @@ static int kvmclock_setup_percpu(unsigned int cpu)

void __init kvmclock_init(void)
{
+ bool prefer_tsc;
u8 flags;

if (!kvm_para_available() || !kvmclock)
@@ -313,19 +314,8 @@ void __init kvmclock_init(void)
if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);

- flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
- kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
-
- x86_platform.calibrate_tsc = kvm_get_tsc_khz;
- x86_platform.calibrate_cpu = kvm_get_tsc_khz;
- x86_platform.get_wallclock = kvm_get_wallclock;
- x86_platform.set_wallclock = kvm_set_wallclock;
-#ifdef CONFIG_X86_LOCAL_APIC
- x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
-#endif
- x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
- x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
- kvm_get_preset_lpj();
+ if (tsc_clocksource_reliable)
+ prefer_tsc = true;

/*
* X86_FEATURE_NONSTOP_TSC is TSC runs at constant rate
@@ -334,10 +324,31 @@ void __init kvmclock_init(void)
* Invariant TSC exposed by host means kvmclock is not necessary:
* can use TSC as clocksource.
*
+ * The TSC is used also when tsc_clocksource_reliable is configured
+ * in kernel command line on purpose.
*/
if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
!check_tsc_unstable())
+ prefer_tsc = true;
+
+ if (!prefer_tsc) {
+ flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
+ kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
+ }
+
+ x86_platform.calibrate_tsc = kvm_get_tsc_khz;
+ x86_platform.calibrate_cpu = kvm_get_tsc_khz;
+ x86_platform.get_wallclock = kvm_get_wallclock;
+ x86_platform.set_wallclock = kvm_set_wallclock;
+#ifdef CONFIG_X86_LOCAL_APIC
+ x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
+#endif
+ x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
+ x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
+ kvm_get_preset_lpj();
+
+ if (prefer_tsc)
kvm_clock.rating = 299;

clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);


Thank you very much!

Dongli Zhang

>
>>> To introduce a param may be easier to backport to old kernel version.
>>>
>>> References:
>>> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/[email protected]/__;!!ACWV5N9M2RV99hQ!Omk8Q6d8PW-UcKNdCRAeA8qSb698y3Eier2hro5vporwTCHqHSmYYk8fCinciVOHUG40CK4GQpHsjNlDiQ$
>>> [2] https://urldefense.com/v3/__https://lore.kernel.org/all/[email protected]/__;!!ACWV5N9M2RV99hQ!Omk8Q6d8PW-UcKNdCRAeA8qSb698y3Eier2hro5vporwTCHqHSmYYk8fCinciVOHUG40CK4GQpHh5avzQg$
>>> [3] https://urldefense.com/v3/__https://lore.kernel.org/all/[email protected]/__;!!ACWV5N9M2RV99hQ!Omk8Q6d8PW-UcKNdCRAeA8qSb698y3Eier2hro5vporwTCHqHSmYYk8fCinciVOHUG40CK4GQpH74It6kQ$
>
> On Mon, Oct 2, 2023 at 11:18 AM Sean Christopherson <[email protected]> wrote:
>>> Do we need to update the documentation to always suggest TSC when it is
>>> constant, as I believe many users still prefer pv clock than tsc?
>>>
>>> Thanks to tsc ratio scaling, the live migration will not impact tsc.
>>>
>>> >From the source code, the rating of kvm-clock is still higher than tsc.
>>>
>>> BTW., how about to decrease the rating if guest detects constant tsc?
>>>
>>> 166 struct clocksource kvm_clock = {
>>> 167 .name = "kvm-clock",
>>> 168 .read = kvm_clock_get_cycles,
>>> 169 .rating = 400,
>>> 170 .mask = CLOCKSOURCE_MASK(64),
>>> 171 .flags = CLOCK_SOURCE_IS_CONTINUOUS,
>>> 172 .enable = kvm_cs_enable,
>>> 173 };
>>>
>>> 1196 static struct clocksource clocksource_tsc = {
>>> 1197 .name = "tsc",
>>> 1198 .rating = 300,
>>> 1199 .read = read_tsc,
>>
>> That's already done in kvmclock_init().
>>
>> if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
>> boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
>> !check_tsc_unstable())
>> kvm_clock.rating = 299;
>>
>> See also: https://urldefense.com/v3/__https://lore.kernel.org/all/ZOjF2DMBgW*[email protected]__;JQ!!ACWV5N9M2RV99hQ!Omk8Q6d8PW-UcKNdCRAeA8qSb698y3Eier2hro5vporwTCHqHSmYYk8fCinciVOHUG40CK4GQpFjD9PZNg$
>>
>>> 2. The sched_clock.
>>>
>>> The scheduling is impacted if there is big drift.
>>
>> ...
>>
>>> Unfortunately, the "no-kvmclock" kernel parameter disables all pv clock
>>> operations (not only sched_clock), e.g., after line 300.
>>
>> ...
>>
>>> Should I introduce a new param to disable no-kvm-sched-clock only, or to
>>> introduce a new param to allow the selection of sched_clock?
>>
>> I don't think we want a KVM-specific knob, because every flavor of paravirt guest
>> would need to do the same thing. And unless there's a good reason to use a
>> paravirt clock, this really shouldn't be something the guest admin needs to opt
>> into using.
>
>
> On Mon, Oct 2, 2023 at 2:06 PM Peter Zijlstra <[email protected]> wrote:
>>
>> On Mon, Oct 02, 2023 at 11:18:50AM -0700, Sean Christopherson wrote:
>>> Assuming the desirable thing to do is to use native_sched_clock() in this
>>> scenario, do we need a separate rating system, or can we simply tie the
>>> sched clock selection to the clocksource selection, e.g. override the
>>> paravirt stuff if the TSC clock has higher priority and is chosen?
>>
>> Yeah, I see no point of another rating system. Just force the thing back
>> to native (or don't set it to that other thing).