Production systems may repeatedly experience a switch to the HPET
clocksource on account of the watchdog marking the TSC as unstable. The
HPET is a slow clocksource that is not checked by the clocksource
watchdog, which in itself seems inconsistent - the TSC is fast but may
fail a watchdog check, the HPET is slow, bordering on useless on modern
machines, and systems are practically stuck with it (or acpi_pm) once the
TSC has been marked unstable. A switch to the HPET clocksource may result
in a large performance regression (of up to 90% for networking workloads,
for example), which may not be acceptable for production workloads.
Performance is restored only by a reboot that allows the TSC to become the
current clocksource again. Reoccurring and forced reboots are unwelcome on
production systems. The switches to the HPET may have their root cause in
the way the hardware works. Solving a hardware issue or replacing the
machine may take an inordinate amount of time, while the production
machine is still needed to do its job.
There are machines that experience switches to the HPET but also are known
to have handled their workload without any apparent issues before the
threshold for the watchdog got reduced in 2e27e793e280 ("clocksource:
Reduce clocksource-skew threshold"). The value
CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US option can be increased to widen
the margin of error for evaluating time skews but a change like this would
be applied to all systems running a particular distribution kernel, which
is not desirable. It is almost a rule that no matter the threhold there
will be a system exceeding that threshold. So, a workaround is need for
special cases where the measured time skew exceeds the margin of error.
It is possible to pass tsc=nowatchdog to the kernel, which disables the
clocksource watchdog on the TSC, effectively leaving the operator of the
production machine blind to any clocksource malfunction. Also, the
tsc=nowatchdog parameter is intended for situations with tight latency
requirements and not for working around repeated switches to the HPET. For
machines with more than 4 NUMA nodes, the current state of the TSC code
does allow tsc=reliable tsc=watchdog to be passed to the kernel to disable
the clocksource watchdog on the TSC while allowing the TSC to act as a
watchdog but the tsc=reliable parameter also disables TSC warp detection
and the tsc_sync_check_timer even on machines with the TSC_ADJUST feature,
which is unnecessary and possibly harmful.
The sematics of the recently introduced tsc=watchdog, 0051293c5330
("clocksource: Enable TSC watchdog checking of HPET and PMTMR only when
requested"), could be changed to disable the clocksource watchdog on the
TSC no matter the number of NUMA nodes while allowing the TSC to act as a
watchdog for other clocksources. This way, a watchdog check failing would
not result in the current clocksource (TSC) getting marked unstable and
the operator would still be informed that the machine may be experiencing
a clocksource issue. Passing tsc=watchdog to the kernel could be used as a
workaround before the underlying hardware issue is resolved or as a
permanent solution with the understanding that only a warning will be
printed into the kernel log but no action will be taken if the TSC
actually becomes unstable.
There is one side effect of the sematic change of tsc=watchdog - it fixes
the HPET not having its CLOCK_SOURCE_MUST_VERIFY bit set as intended by
efc8b329c7fd ("clocksource: Verify HPET and PMTMR when TSC unverified").
The HPET never had its CLOCK_SOURCE_MUST_VERIFY bit set on account of the
order in which clocksources are initialized in x86_late_time_init(). The
HPET initialization and clocksource registration always comes before the
TSC init function unsets the CLOCK_SOURCE_MUST_VERIFY bit.
Signed-off-by: Jiri Wiesner <[email protected]>
---
I am resubmitting this patch because there has been no reaction to it.
I sent it over the Christmas period. Perhaps, it got forgotten.
Documentation/admin-guide/kernel-parameters.txt | 9 +++++----
arch/x86/include/asm/time.h | 2 +-
arch/x86/kernel/hpet.c | 2 +-
arch/x86/kernel/tsc.c | 7 +++----
drivers/clocksource/acpi_pm.c | 2 +-
5 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 65731b060e3f..665a4df9a511 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6695,10 +6695,11 @@
obtained from HW or FW using either an MSR or CPUID(0x15).
Warn if the difference is more than 500 ppm.
[x86] watchdog: Use TSC as the watchdog clocksource with
- which to check other HW timers (HPET or PM timer), but
- only on systems where TSC has been deemed trustworthy.
- This will be suppressed by an earlier tsc=nowatchdog and
- can be overridden by a later tsc=nowatchdog. A console
+ which to check other HW timers (HPET or PM timer).
+ Disables watchdog checks on TSC, even on systems where
+ TSC has not been deemed trustworthy. This will be
+ suppressed by an earlier tsc=nowatchdog and can be
+ overridden by a later tsc=nowatchdog. A console
message will flag any such suppression or overriding.
tsc_early_khz= [X86] Skip early TSC calibration and use the given
diff --git a/arch/x86/include/asm/time.h b/arch/x86/include/asm/time.h
index f360104ed172..c2364b74a318 100644
--- a/arch/x86/include/asm/time.h
+++ b/arch/x86/include/asm/time.h
@@ -7,7 +7,7 @@
extern void hpet_time_init(void);
extern bool pit_timer_init(void);
-extern bool tsc_clocksource_watchdog_disabled(void);
+extern bool tsc_clocksource_as_watchdog(void);
extern struct clock_event_device *global_clock_event;
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 41eecf180b7f..9f6222cea684 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -1091,7 +1091,7 @@ int __init hpet_enable(void)
if (!hpet_counting())
goto out_nohpet;
- if (tsc_clocksource_watchdog_disabled())
+ if (tsc_clocksource_as_watchdog())
clocksource_hpet.flags |= CLOCK_SOURCE_MUST_VERIFY;
clocksource_register_hz(&clocksource_hpet, (u32)hpet_freq);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 15f97c0abc9d..ec1860178ea1 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1221,10 +1221,9 @@ static void __init tsc_disable_clocksource_watchdog(void)
clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY;
}
-bool tsc_clocksource_watchdog_disabled(void)
+bool tsc_clocksource_as_watchdog(void)
{
- return !(clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY) &&
- tsc_as_watchdog && !no_tsc_watchdog;
+ return tsc_as_watchdog && !no_tsc_watchdog;
}
static void __init check_system_tsc_reliable(void)
@@ -1609,7 +1608,7 @@ void __init tsc_init(void)
return;
}
- if (tsc_clocksource_reliable || no_tsc_watchdog)
+ if (tsc_clocksource_reliable || no_tsc_watchdog || tsc_as_watchdog)
tsc_disable_clocksource_watchdog();
clocksource_register_khz(&clocksource_tsc_early, tsc_khz);
diff --git a/drivers/clocksource/acpi_pm.c b/drivers/clocksource/acpi_pm.c
index 82338773602c..9b5dad94713e 100644
--- a/drivers/clocksource/acpi_pm.c
+++ b/drivers/clocksource/acpi_pm.c
@@ -211,7 +211,7 @@ static int __init init_acpi_pm_clocksource(void)
return -ENODEV;
}
- if (tsc_clocksource_watchdog_disabled())
+ if (tsc_clocksource_as_watchdog())
clocksource_acpi_pm.flags |= CLOCK_SOURCE_MUST_VERIFY;
return clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC);
}
--
2.35.3
--
Jiri Wiesner
SUSE Labs