2023-10-06 01:13:18

by Vishal Annapurve

[permalink] [raw]
Subject: [PATCH] x86/tdx: Override the tsc calibration for TDX VMs

TSC calibration for native execution gets the TSC frequency from CPUID,
but also ends up setting lapic_timer_period. When using oneshot mode
with lapic timer, predefined value of lapic_timer_period causes lapic
timer calibration to be skipped with wrong multipliers set for lapic
timer.

To avoid this issue, override the TSC calibration step for TDX VMs to
just calculate the TSC frequency using cpuid values.

Signed-off-by: Vishal Annapurve <[email protected]>
---
arch/x86/coco/tdx/tdx.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 1d6b863c42b0..6625594f8c62 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -757,6 +757,33 @@ static bool tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
return true;
}

+/**
+ * Determine TSC frequency via CPUID, else return 0.
+ */
+static unsigned long tdx_calibrate_tsc(void)
+{
+ unsigned int eax_denominator = 0, ebx_numerator = 0, ecx_hz = 0, edx = 0;
+ unsigned int crystal_khz;
+
+ /* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
+ cpuid(0x15, &eax_denominator, &ebx_numerator, &ecx_hz, &edx);
+
+ if (ebx_numerator == 0 || eax_denominator == 0)
+ return 0;
+
+ crystal_khz = ecx_hz / 1000;
+
+ /*
+ * TSC frequency reported directly by CPUID is a "hardware reported"
+ * frequency and is the most accurate one so far we have. This
+ * is considered a known frequency.
+ */
+ if (crystal_khz != 0)
+ setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+
+ return crystal_khz * ebx_numerator / eax_denominator;
+}
+
void __init tdx_early_init(void)
{
u64 cc_mask;
@@ -808,6 +835,7 @@ void __init tdx_early_init(void)

x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required;
x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;
+ x86_platform.calibrate_tsc = tdx_calibrate_tsc;

/*
* TDX intercepts the RDMSR to read the X2APIC ID in the parallel
--
2.42.0.609.gbb76f46606-goog


2023-10-06 10:43:59

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH] x86/tdx: Override the tsc calibration for TDX VMs


>
> +/**
> + * Determine TSC frequency via CPUID, else return 0.
> + */

Nit: looks you don't need the k-doc style comment here?

> +static unsigned long tdx_calibrate_tsc(void)
> +{
> + unsigned int eax_denominator = 0, ebx_numerator = 0, ecx_hz = 0, edx = 0;
> + unsigned int crystal_khz;
> +
> + /* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
> + cpuid(0x15, &eax_denominator, &ebx_numerator, &ecx_hz, &edx);
> +
> + if (ebx_numerator == 0 || eax_denominator == 0)
> + return 0;
> +
> + crystal_khz = ecx_hz / 1000;
> +
> + /*
> + * TSC frequency reported directly by CPUID is a "hardware reported"
> + * frequency and is the most accurate one so far we have. This
> + * is considered a known frequency.
> + */
> + if (crystal_khz != 0)
> + setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> +
> + return crystal_khz * ebx_numerator / eax_denominator;
> +}
> +

2023-10-06 14:03:21

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] x86/tdx: Override the tsc calibration for TDX VMs

On 10/5/23 18:12, Vishal Annapurve wrote:
> +/**
> + * Determine TSC frequency via CPUID, else return 0.
> + */
> +static unsigned long tdx_calibrate_tsc(void)
> +{
> + unsigned int eax_denominator = 0, ebx_numerator = 0, ecx_hz = 0, edx = 0;
> + unsigned int crystal_khz;
> +
> + /* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
> + cpuid(0x15, &eax_denominator, &ebx_numerator, &ecx_hz, &edx);
> +
> + if (ebx_numerator == 0 || eax_denominator == 0)
> + return 0;
> +
> + crystal_khz = ecx_hz / 1000;
> +
> + /*
> + * TSC frequency reported directly by CPUID is a "hardware reported"
> + * frequency and is the most accurate one so far we have. This
> + * is considered a known frequency.
> + */
> + if (crystal_khz != 0)
> + setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> +
> + return crystal_khz * ebx_numerator / eax_denominator;
> +}
> +

Would it be possible to do this by refactoring the existing code and
calling it directly instead of copying and pasting so much?

2023-10-06 15:27:50

by Vishal Annapurve

[permalink] [raw]
Subject: Re: [PATCH] x86/tdx: Override the tsc calibration for TDX VMs

On Fri, Oct 6, 2023 at 7:03 AM Dave Hansen <[email protected]> wrote:
>
> On 10/5/23 18:12, Vishal Annapurve wrote:
> > +/**
> > + * Determine TSC frequency via CPUID, else return 0.
> > + */
> > +static unsigned long tdx_calibrate_tsc(void)
> > +{
> > + unsigned int eax_denominator = 0, ebx_numerator = 0, ecx_hz = 0, edx = 0;
> > + unsigned int crystal_khz;
> > +
> > + /* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
> > + cpuid(0x15, &eax_denominator, &ebx_numerator, &ecx_hz, &edx);
> > +
> > + if (ebx_numerator == 0 || eax_denominator == 0)
> > + return 0;
> > +
> > + crystal_khz = ecx_hz / 1000;
> > +
> > + /*
> > + * TSC frequency reported directly by CPUID is a "hardware reported"
> > + * frequency and is the most accurate one so far we have. This
> > + * is considered a known frequency.
> > + */
> > + if (crystal_khz != 0)
> > + setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> > +
> > + return crystal_khz * ebx_numerator / eax_denominator;
> > +}
> > +
>
> Would it be possible to do this by refactoring the existing code and
> calling it directly instead of copying and pasting so much?

One option is to call native_calibrate_tsc from tdx_calibrate_tsc and
undo the lapic_timer_period configuration after the call. Does that
sound reasonable?

2023-10-06 15:29:38

by Vishal Annapurve

[permalink] [raw]
Subject: Re: [PATCH] x86/tdx: Override the tsc calibration for TDX VMs

On Fri, Oct 6, 2023 at 3:43 AM Huang, Kai <[email protected]> wrote:
>
>
> >
> > +/**
> > + * Determine TSC frequency via CPUID, else return 0.
> > + */
>
> Nit: looks you don't need the k-doc style comment here?
>

Ack, will remove/update this comment in the next version.

2023-10-06 15:43:17

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] x86/tdx: Override the tsc calibration for TDX VMs

On 10/6/23 08:27, Vishal Annapurve wrote:
>> Would it be possible to do this by refactoring the existing code and
>> calling it directly instead of copying and pasting so much?
> One option is to call native_calibrate_tsc from tdx_calibrate_tsc and
> undo the lapic_timer_period configuration after the call. Does that
> sound reasonable?

That sounds like a hack.

Why not just refactor the code properly?

2023-10-13 22:44:08

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH] x86/tdx: Override the tsc calibration for TDX VMs

On Fri, Oct 06, 2023, Vishal Annapurve wrote:
> TSC calibration for native execution gets the TSC frequency from CPUID,
> but also ends up setting lapic_timer_period. When using oneshot mode
> with lapic timer, predefined value of lapic_timer_period causes lapic
> timer calibration to be skipped with wrong multipliers set for lapic
> timer.
>
> To avoid this issue, override the TSC calibration step for TDX VMs to
> just calculate the TSC frequency using cpuid values.

This is a hack to workaround a KVM TDX bug. Per Intel's SDM:

The APIC timer frequency will be the processor’s bus clock or core crystal
clock frequency (when TSC/core crystal clock ratio is enumerated in CPUID
leaf 0x15) divided by the value specified in the divide configuration register.

TDX hardcodes the core crystal frequency to 25Mhz, whereas KVM hardcodes the APIC
bus frequency to 1Ghz. Upstream KVM's *current* behavior is fine, because KVM
doesn't advertise support for CPUID 0x15, i.e. doesn't announce to host userspace
that it's safe to expose CPUID 0x15 to the guest. Because TDX makes exposing
CPUID 0x15 mandatory, KVM needs to be taught to correctly emulate the guest's
APIC bus frequency, a.k.a. the TDX guest core crystal frequency of 25Mhz.

I.e. tmict_to_ns() needs to replace APIC_BUS_CYCLE_NS with some math that makes
the guest's APIC timer actually run at 25Mhz given whatever the host APIC bus
runs at.

static inline u64 tmict_to_ns(struct kvm_lapic *apic, u32 tmict)
{
return (u64)tmict * APIC_BUS_CYCLE_NS * (u64)apic->divide_count;
}

The existing guest code "works" because the calibration code effectively discovers
the host APIC bus frequency. If we really want to force calibration, then the
best way to do that would be to add a command line option to do exactly that, not
hack around a KVM TDX bug.

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 15f97c0abc9d..ce1cec6b3c18 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -723,7 +723,8 @@ unsigned long native_calibrate_tsc(void)
* lapic_timer_period here to avoid having to calibrate the APIC
* timer later.
*/
- lapic_timer_period = crystal_khz * 1000 / HZ;
+ if (!force_lapic_timer_calibration)
+ lapic_timer_period = crystal_khz * 1000 / HZ;
#endif

return crystal_khz * ebx_numerator / eax_denominator;

But I would be very leery of forcing calibration, as effectively calibrating to
the *host* core crystal frequency will cause the guest APIC timer to be wrong if
the VM is migrated to a host with a different core crystal frequency. Relying
on CPUID 0x15, if it's available, avoids that problem because it puts the onus on
the hypervisor to account for the new host's frequency when emulating the APIC
timer. That mess exists today, but deliberate ignoring the mechanism that allows
the host to fix the mess would be asinine IMO.

Even better would be for GCE to just enumerate support for TSC deadline already,
because KVM already does the right thing to convert guest TSC frequency to host
TSC frequency. KVM TDX would still need to add full support for CPUID 0x15, but
at least any problems with using one-shot mode will unlikely to impact real guests.

2023-10-13 23:02:37

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH] x86/tdx: Override the tsc calibration for TDX VMs

On Fri, Oct 13, 2023, Sean Christopherson wrote:
> On Fri, Oct 06, 2023, Vishal Annapurve wrote:
> > TSC calibration for native execution gets the TSC frequency from CPUID,
> > but also ends up setting lapic_timer_period. When using oneshot mode
> > with lapic timer, predefined value of lapic_timer_period causes lapic
> > timer calibration to be skipped with wrong multipliers set for lapic
> > timer.
> >
> > To avoid this issue, override the TSC calibration step for TDX VMs to
> > just calculate the TSC frequency using cpuid values.
>
> This is a hack to workaround a KVM TDX bug. Per Intel's SDM:
>
> The APIC timer frequency will be the processor’s bus clock or core crystal
> clock frequency (when TSC/core crystal clock ratio is enumerated in CPUID
> leaf 0x15) divided by the value specified in the divide configuration register.
>
> TDX hardcodes the core crystal frequency to 25Mhz, whereas KVM hardcodes the APIC
> bus frequency to 1Ghz. Upstream KVM's *current* behavior is fine, because KVM
> doesn't advertise support for CPUID 0x15, i.e. doesn't announce to host userspace
> that it's safe to expose CPUID 0x15 to the guest. Because TDX makes exposing
> CPUID 0x15 mandatory, KVM needs to be taught to correctly emulate the guest's
> APIC bus frequency, a.k.a. the TDX guest core crystal frequency of 25Mhz.
>
> I.e. tmict_to_ns() needs to replace APIC_BUS_CYCLE_NS with some math that makes
> the guest's APIC timer actually run at 25Mhz given whatever the host APIC bus
> runs at.
>
> static inline u64 tmict_to_ns(struct kvm_lapic *apic, u32 tmict)
> {
> return (u64)tmict * APIC_BUS_CYCLE_NS * (u64)apic->divide_count;
> }
>
> The existing guest code "works" because the calibration code effectively discovers
> the host APIC bus frequency. If we really want to force calibration, then the
> best way to do that would be to add a command line option to do exactly that, not
> hack around a KVM TDX bug.
>
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 15f97c0abc9d..ce1cec6b3c18 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -723,7 +723,8 @@ unsigned long native_calibrate_tsc(void)
> * lapic_timer_period here to avoid having to calibrate the APIC
> * timer later.
> */
> - lapic_timer_period = crystal_khz * 1000 / HZ;
> + if (!force_lapic_timer_calibration)
> + lapic_timer_period = crystal_khz * 1000 / HZ;
> #endif
>
> return crystal_khz * ebx_numerator / eax_denominator;
>
> But I would be very leery of forcing calibration, as effectively calibrating to
> the *host* core crystal frequency will cause the guest APIC timer to be wrong if
> the VM is migrated to a host with a different core crystal frequency. Relying
> on CPUID 0x15, if it's available, avoids that problem because it puts the onus on
> the hypervisor to account for the new host's frequency when emulating the APIC
> timer. That mess exists today, but deliberate ignoring the mechanism that allows
> the host to fix the mess would be asinine IMO.

Gah, I had a brainfart. tmict_to_ns() obviously converts TMICT to nanoseconds,
and then converts nanoseconds into whatever frequency the underlying host timer
runs at.

So it's really only TDX that is problematic, as TDX demands the APIC bus be
emulated at a frequency other than 1Ghz.

2023-10-21 02:01:01

by Vishal Annapurve

[permalink] [raw]
Subject: Re: [PATCH] x86/tdx: Override the tsc calibration for TDX VMs

On Sat, Oct 14, 2023 at 4:32 AM Sean Christopherson <[email protected]> wrote:
>
> On Fri, Oct 13, 2023, Sean Christopherson wrote:
> > On Fri, Oct 06, 2023, Vishal Annapurve wrote:
> > > TSC calibration for native execution gets the TSC frequency from CPUID,
> > > but also ends up setting lapic_timer_period. When using oneshot mode
> > > with lapic timer, predefined value of lapic_timer_period causes lapic
> > > timer calibration to be skipped with wrong multipliers set for lapic
> > > timer.
> > >
> > > To avoid this issue, override the TSC calibration step for TDX VMs to
> > > just calculate the TSC frequency using cpuid values.
> >
> > This is a hack to workaround a KVM TDX bug. Per Intel's SDM:
> >
> > The APIC timer frequency will be the processor’s bus clock or core crystal
> > clock frequency (when TSC/core crystal clock ratio is enumerated in CPUID
> > leaf 0x15) divided by the value specified in the divide configuration register.
> >
> > TDX hardcodes the core crystal frequency to 25Mhz, whereas KVM hardcodes the APIC
> > bus frequency to 1Ghz. Upstream KVM's *current* behavior is fine, because KVM
> > doesn't advertise support for CPUID 0x15, i.e. doesn't announce to host userspace
> > that it's safe to expose CPUID 0x15 to the guest. Because TDX makes exposing
> > CPUID 0x15 mandatory, KVM needs to be taught to correctly emulate the guest's
> > APIC bus frequency, a.k.a. the TDX guest core crystal frequency of 25Mhz.
> >

Ack, makes sense to pursue this fix from the KVM TDX side instead of
guest changes as per your suggestion.

>
>
> > I.e. tmict_to_ns() needs to replace APIC_BUS_CYCLE_NS with some math that makes
> > the guest's APIC timer actually run at 25Mhz given whatever the host APIC bus
> > runs at.
> >
> > static inline u64 tmict_to_ns(struct kvm_lapic *apic, u32 tmict)
> > {
> > return (u64)tmict * APIC_BUS_CYCLE_NS * (u64)apic->divide_count;
> > }
>