2019-04-22 10:16:41

by Daniel Drake

[permalink] [raw]
Subject: [PATCH 1/2] x86/tsc: use CPUID.0x16 to calculate missing crystal frequency

native_calibrate_tsc() had a hardcoded table of Intel CPU families
and crystal clock, but we have found that it is possible to
calculate the crystal clock speed, and this is preferred to a hardcoded
table.

Where crystal clock frequency was not reported by CPUID.0x15,
use CPUID.0x16 data to calculate the crystal clock.

Using CPUID dump data from http://instlatx64.atw.hu/, the calculation
results can be seen to be sufficiently close to the previously hardcoded
values:
SKYLAKE_MOBILE: 24074074 Hz
SKYLAKE_DESKTOP: 23913043 Hz
KABYLAKE_MOBILE: 23893805 Hz
KABYLAKE_DESKTOP: 24050632 Hz
GOLDMONT: 19.2MHz crystal clock correctly reported by CPUID.0x15

Additionally, crystal clock frequency for platforms that were missing
from the list (e.g. SKYLAKE_X) will now be provided.

GOLDMONT_X was left as a hardcoded value, as the CPUID data on that site
indicates that the hardware does not report crystal clock nor CPUID.0x16
data.

Going forward, Intel have hopefully now started providing crystal
frequency in CPUID.0x15. At least ApolloLake, GeminiLake and CannonLake
CPUs all provide the relevant data directly.

Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Daniel Drake <[email protected]>
---
arch/x86/kernel/tsc.c | 33 ++++++++++++++++++---------------
1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 3fae23834069..3971c837584a 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -629,23 +629,26 @@ unsigned long native_calibrate_tsc(void)

crystal_khz = ecx_hz / 1000;

- if (crystal_khz == 0) {
- switch (boot_cpu_data.x86_model) {
- case INTEL_FAM6_SKYLAKE_MOBILE:
- case INTEL_FAM6_SKYLAKE_DESKTOP:
- case INTEL_FAM6_KABYLAKE_MOBILE:
- case INTEL_FAM6_KABYLAKE_DESKTOP:
- crystal_khz = 24000; /* 24.0 MHz */
- break;
- case INTEL_FAM6_ATOM_GOLDMONT_X:
- crystal_khz = 25000; /* 25.0 MHz */
- break;
- case INTEL_FAM6_ATOM_GOLDMONT:
- crystal_khz = 19200; /* 19.2 MHz */
- break;
- }
+ /*
+ * Some Intel SoCs like Skylake and Kabylake don't report the crystal
+ * clock, but we can easily calculate it by considering the crystal
+ * ratio and the CPU speed.
+ */
+ if (crystal_khz == 0 && boot_cpu_data.cpuid_level >= 0x16) {
+ unsigned int eax_base_mhz, ebx, ecx, edx;
+ cpuid(0x16, &eax_base_mhz, &ebx, &ecx, &edx);
+ crystal_khz = eax_base_mhz * 1000 * \
+ eax_denominator / ebx_numerator;
}

+ /*
+ * Denverton SoCs don't report crystal clock, and also don't support
+ * CPUID.0x16, so hardcode the 25MHz crystal clock.
+ */
+ if (crystal_khz == 0 &&
+ boot_cpu_data.x86_model == INTEL_FAM6_ATOM_GOLDMONT_X)
+ crystal_khz = 25000;
+
if (crystal_khz == 0)
return 0;
/*
--
2.19.1


2019-04-22 10:18:22

by Daniel Drake

[permalink] [raw]
Subject: [PATCH 2/2] x86/tsc: set LAPIC timer frequency to crystal clock frequency

The APIC timer calibration (calibrate_APIC_timer()) can be skipped
in cases where we know the APIC timer frequency. On Intel SoCs,
we believe that the APIC is fed by the crystal clock; this would make
sense, and the crystal clock frequency has been verified against the
APIC timer calibration result on ApolloLake, GeminiLake, Kabylake,
CoffeeLake, WhiskeyLake and AmberLake.

Set lapic_timer_frequency based on the crystal clock frequency
accordingly.

APIC timer calibration would normally be skipped on modern CPUs
by nature of the TSC deadline timer being used instead,
however this change is still potentially useful, e.g. if the
TSC deadline timer has been disabled with a kernel parameter.
calibrate_APIC_timer() uses the legacy timer, but we are seeing
new platforms that omit such legacy functionality, so avoiding
such codepaths is becoming more important.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Daniel Drake <[email protected]>
---
arch/x86/kernel/tsc.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 3971c837584a..8750543287fc 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -665,6 +665,16 @@ unsigned long native_calibrate_tsc(void)
if (boot_cpu_data.x86_model == INTEL_FAM6_ATOM_GOLDMONT)
setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);

+#ifdef CONFIG_X86_LOCAL_APIC
+ /*
+ * The local APIC appears to be fed by the core crystal clock
+ * (which sounds entirely sensible). We can set the global
+ * lapic_timer_frequency here to avoid having to calibrate the APIC
+ * timer later.
+ */
+ lapic_timer_frequency = (crystal_khz * 1000) / HZ;
+#endif
+
return crystal_khz * ebx_numerator / eax_denominator;
}

--
2.19.1

2019-04-22 12:11:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tsc: set LAPIC timer frequency to crystal clock frequency


* Daniel Drake <[email protected]> wrote:

> +#ifdef CONFIG_X86_LOCAL_APIC
> + /*
> + * The local APIC appears to be fed by the core crystal clock
> + * (which sounds entirely sensible). We can set the global
> + * lapic_timer_frequency here to avoid having to calibrate the APIC
> + * timer later.
> + */
> + lapic_timer_frequency = (crystal_khz * 1000) / HZ;
> +#endif

Minor style nit: the parentheses are unnecessary, integer expressions
like this are evaluated left to right and multiplication and division has
the same precedence.

But it might also make sense to actually store crystal_mhz instead of
crystal_khz, because both CPUID 15H and 16H provides MHz values.

That way the above expression would simplify to:

lapic_timer_frequency = crystal_mhz / HZ;

Thanks,

Ingo

2019-04-23 04:31:58

by Daniel Drake

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tsc: set LAPIC timer frequency to crystal clock frequency

On Mon, Apr 22, 2019 at 8:04 PM Ingo Molnar <[email protected]> wrote:
> Minor style nit: the parentheses are unnecessary, integer expressions
> like this are evaluated left to right and multiplication and division has
> the same precedence.

Fair point, although the same could be said for cpu_khz_from_msr().

> But it might also make sense to actually store crystal_mhz instead of
> crystal_khz, because both CPUID 15H and 16H provides MHz values.
>
> That way the above expression would simplify to:
>
> lapic_timer_frequency = crystal_mhz / HZ;

Wouldn't it be
lapic_timer_frequency = crystal_mhz * 1000000 / HZ;
?

Thanks
Daniel

2019-04-23 09:10:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/tsc: use CPUID.0x16 to calculate missing crystal frequency

On Mon, Apr 22, 2019 at 06:15:25PM +0800, Daniel Drake wrote:
> Additionally, crystal clock frequency for platforms that were missing
> from the list (e.g. SKYLAKE_X) will now be provided.

Well, SKX isn't exactly 'missing'; it would be very good if we can
confirm the new code is still correct under the below mentioned
conditions.

---

commit b511203093489eb1829cb4de86e8214752205ac6
Author: Len Brown <[email protected]>
Date: Fri Dec 22 00:27:55 2017 -0500

x86/tsc: Fix erroneous TSC rate on Skylake Xeon

The INTEL_FAM6_SKYLAKE_X hardcoded crystal_khz value of 25MHZ is
problematic:

- SKX workstations (with same model # as server variants) use a 24 MHz
crystal. This results in a -4.0% time drift rate on SKX workstations.

- SKX servers subject the crystal to an EMI reduction circuit that reduces its
actual frequency by (approximately) -0.25%. This results in -1 second per
10 minute time drift as compared to network time.

This issue can also trigger a timer and power problem, on configurations
that use the LAPIC timer (versus the TSC deadline timer). Clock ticks
scheduled with the LAPIC timer arrive a few usec before the time they are
expected (according to the slow TSC). This causes Linux to poll-idle, when
it should be in an idle power saving state. The idle and clock code do not
graciously recover from this error, sometimes resulting in significant
polling and measurable power impact.

Stop using native_calibrate_tsc() for INTEL_FAM6_SKYLAKE_X.
native_calibrate_tsc() will return 0, boot will run with tsc_khz = cpu_khz,
and the TSC refined calibration will update tsc_khz to correct for the
difference.

[ tglx: Sanitized change log ]

Fixes: 6baf3d61821f ("x86/tsc: Add additional Intel CPU models to the crystal quirk list")
Signed-off-by: Len Brown <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: Prarit Bhargava <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/ff6dcea166e8ff8f2f6a03c17beab2cb436aa779.1513920414.git.len.brown@intel.com

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index ce4b71119c36..3bf4df7f52d7 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -602,7 +602,6 @@ unsigned long native_calibrate_tsc(void)
case INTEL_FAM6_KABYLAKE_DESKTOP:
crystal_khz = 24000; /* 24.0 MHz */
break;
- case INTEL_FAM6_SKYLAKE_X:
case INTEL_FAM6_ATOM_DENVERTON:
crystal_khz = 25000; /* 25.0 MHz */
break;

2019-04-23 10:33:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tsc: set LAPIC timer frequency to crystal clock frequency


* Daniel Drake <[email protected]> wrote:

> On Mon, Apr 22, 2019 at 8:04 PM Ingo Molnar <[email protected]> wrote:
> > Minor style nit: the parentheses are unnecessary, integer expressions
> > like this are evaluated left to right and multiplication and division has
> > the same precedence.
>
> Fair point, although the same could be said for cpu_khz_from_msr().

Yes, this standardization on mhz, if it makes sense, might have to be
propagated a bit to function names and any other variables.

Another naming quirk: what unit is 'lapic_timer_frequency' in? AFAICS
it's a "period" unit (number of clock cycles per jiffy), not a frequency
(which is number of cycles per second) - so the better name might be
lapic_timer_period?

> > But it might also make sense to actually store crystal_mhz instead of
> > crystal_khz, because both CPUID 15H and 16H provides MHz values.
> >
> > That way the above expression would simplify to:
> >
> > lapic_timer_frequency = crystal_mhz / HZ;
>
> Wouldn't it be
> lapic_timer_frequency = crystal_mhz * 1000000 / HZ;
> ?

Sorry, what I wanted to suggest is crystal_hz, not crystal_mhz.

I.e. store the raw unit that comes out of the CPUID, which appears to be
HZ, right?

That would change the calculation to:

lapic_timer_period = crystal_hz / HZ;

Note how clearly it's visible in that calculation that what we calculate
there is a 'period', not a frequency.

Thanks,

Ingo

2019-04-24 07:14:16

by Daniel Drake

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/tsc: use CPUID.0x16 to calculate missing crystal frequency

On Tue, Apr 23, 2019 at 5:08 PM Peter Zijlstra <[email protected]> wrote:
> Well, SKX isn't exactly 'missing'; it would be very good if we can
> confirm the new code is still correct under the below mentioned
> conditions.
>
> commit b511203093489eb1829cb4de86e8214752205ac6
> Author: Len Brown <[email protected]>
> Date: Fri Dec 22 00:27:55 2017 -0500
>
> x86/tsc: Fix erroneous TSC rate on Skylake Xeon
>
> The INTEL_FAM6_SKYLAKE_X hardcoded crystal_khz value of 25MHZ is
> problematic:
>
> - SKX workstations (with same model # as server variants) use a 24 MHz
> crystal. This results in a -4.0% time drift rate on SKX workstations.

Checking http://instlatx64.atw.hu/ there are 11 platforms listed as
00050654 (skylake X), doing the calculations manually:
18-Core Intel Xeon Gold 6154, 3000 MHz: 25000000
2x 16-Core Intel Xeon Gold 6130, 2100 MHz: 25000000
2x 18-Core Intel Xeon Gold 6154, 3000 MHz: 25000000
2x 28-Core Intel Xeon Platinum 8180, 2500 MHz: 25000000
2x HexaCore Intel Xeon Bronze 3106: 25000000
2x OctalCore Intel Xeon Silver 4108, 3000 MHz: 25000000
10-Core Xeon W-2155: 23913043
HexaCore Intel Core i7-7800X: 23972602
10-Core Intel Core i9-7900X, 4000 MHz: 23913043
18-Core Intel Core i9-7980XE: 24074074
28-Core Intel Xeon W-3175X: 25000000

So given that the results include 24MHz and 25MHz crystal clocks
calculated on different products then it looks like this variance is
correctly accounted for.

> - SKX servers subject the crystal to an EMI reduction circuit that reduces its
> actual frequency by (approximately) -0.25%. This results in -1 second per
> 10 minute time drift as compared to network time.
>
> This issue can also trigger a timer and power problem, on configurations
> that use the LAPIC timer (versus the TSC deadline timer). Clock ticks
> scheduled with the LAPIC timer arrive a few usec before the time they are
> expected (according to the slow TSC). This causes Linux to poll-idle, when
> it should be in an idle power saving state. The idle and clock code do not
> graciously recover from this error, sometimes resulting in significant
> polling and measurable power impact.
>
> Stop using native_calibrate_tsc() for INTEL_FAM6_SKYLAKE_X.
> native_calibrate_tsc() will return 0, boot will run with tsc_khz = cpu_khz,
> and the TSC refined calibration will update tsc_khz to correct for the
> difference.

It sounds like I should add a condition:
if (boot_cpu_data.x86_model != INTEL_FAM6_SKYLAKE_X)
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
so that refined TSC calibration will run on SKYLAKE_X.

That does raise another question though. Taking other platforms such
as KABYLAKE_MOBILE, the previous code hardcoded a precise 24MHz value,
but my new code calculates 23893805Hz. Is that small discrepancy small
enough to trigger the timer and power problem as described for
SKYLAKE_X above?

If so, maybe the logic should instead be:
if (CPUID.0x15 provided Crystal Hz, i.e. we didn't have to calculate it)
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
This would enable TSC refinement on all variants of skylake and kabylake.

Daniel