2023-01-05 01:11:48

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH clocksource 0/6] Clocksource watchdog updates for v6.3

Hello!

This series contains clocksource-watchdog updates:

1. Print clocksource name when clocksource is tested unstable,
courtesy of Yunying Sun.

2. Loosen clocksource watchdog constraints.

3. Improve read-back-delay message.

4. Improve "skew is too large" messages.

5. Suspend the watchdog temporarily when high read latency detected,
courtesy of Feng Tang.

6. Verify HPET and PMTMR when TSC unverified.

Thanx, Paul

------------------------------------------------------------------------

b/arch/x86/include/asm/time.h | 1
b/arch/x86/kernel/hpet.c | 2 +
b/arch/x86/kernel/tsc.c | 5 ++
b/drivers/clocksource/acpi_pm.c | 6 ++-
b/kernel/time/Kconfig | 6 ++-
b/kernel/time/clocksource.c | 4 +-
kernel/time/clocksource.c | 72 ++++++++++++++++++++++++++++------------
7 files changed, 70 insertions(+), 26 deletions(-)


2023-01-05 01:12:01

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH clocksource 1/6] clocksource: Print clocksource name when clocksource is tested unstable

From: Yunying Sun <[email protected]>

Some "TSC fall back to HPET" messages appear on systems having more than
2 NUMA nodes:

clocksource: timekeeping watchdog on CPU168: hpet read-back delay of 4296200ns, attempt 4, marking unstable

The "hpet" here is misleading the clocksource watchdog is really
doing repeated reads of "hpet" in order to check for unrelated delays.
Therefore, print the name of the clocksource under test, prefixed by
"wd-" and suffixed by "-wd", for example, "wd-tsc-wd".

Signed-off-by: Yunying Sun <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/clocksource.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 9cf32ccda715d..4a2c3bb92e2e9 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -257,8 +257,8 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow,
goto skip_test;
}

- pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
- smp_processor_id(), watchdog->name, wd_delay, nretries);
+ pr_warn("timekeeping watchdog on CPU%d: wd-%s-wd read-back delay of %lldns, attempt %d, marking unstable\n",
+ smp_processor_id(), cs->name, wd_delay, nretries);
return WD_READ_UNSTABLE;

skip_test:
--
2.31.1.189.g2e36527f23

2023-01-05 01:12:08

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH clocksource 4/6] clocksource: Improve "skew is too large" messages

When clocksource_watchdog() detects excessive clocksource skew compared
to the watchdog clocksource, it marks the clocksource under test as
unstable and prints several lines worth of message. But that message
is unclear to anyone unfamiliar with the code:

clocksource: timekeeping watchdog on CPU2: Marking clocksource 'wdtest-ktime' as unstable because the skew is too large:
clocksource: 'kvm-clock' wd_nsec: 400744390 wd_now: 612625c2c wd_last: 5fa7f7c66 mask: ffffffffffffffff
clocksource: 'wdtest-ktime' cs_nsec: 600744034 cs_now: 173081397a292d4f cs_last: 17308139565a8ced mask: ffffffffffffffff
clocksource: 'kvm-clock' (not 'wdtest-ktime') is current clocksource.

Therefore, add the following line near the end of that message:

Clocksource 'wdtest-ktime' skewed 199999644 ns (199 ms) over watchdog 'kvm-clock' interval of 400744390 ns (400 ms)

This new line clearly indicates the amount of skew between the two
clocksources, along with the duration of the time interval over which
the skew occurred, both in nanoseconds and milliseconds.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Feng Tang <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/clocksource.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index b59914953809f..fc486cd972635 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -446,12 +446,20 @@ static void clocksource_watchdog(struct timer_list *unused)
/* Check the deviation from the watchdog clocksource. */
md = cs->uncertainty_margin + watchdog->uncertainty_margin;
if (abs(cs_nsec - wd_nsec) > md) {
+ u64 cs_wd_msec;
+ u64 wd_msec;
+ u32 wd_rem;
+
pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
smp_processor_id(), cs->name);
pr_warn(" '%s' wd_nsec: %lld wd_now: %llx wd_last: %llx mask: %llx\n",
watchdog->name, wd_nsec, wdnow, wdlast, watchdog->mask);
pr_warn(" '%s' cs_nsec: %lld cs_now: %llx cs_last: %llx mask: %llx\n",
cs->name, cs_nsec, csnow, cslast, cs->mask);
+ cs_wd_msec = div_u64_rem(cs_nsec - wd_nsec, 1000U * 1000U, &wd_rem);
+ wd_msec = div_u64_rem(wd_nsec, 1000U * 1000U, &wd_rem);
+ pr_warn(" Clocksource '%s' skewed %lld ns (%lld ms) over watchdog '%s' interval of %lld ns (%lld ms)\n",
+ cs->name, cs_nsec - wd_nsec, cs_wd_msec, watchdog->name, wd_nsec, wd_msec);
if (curr_clocksource == cs)
pr_warn(" '%s' is current clocksource.\n", cs->name);
else if (curr_clocksource)
--
2.31.1.189.g2e36527f23

2023-01-05 01:12:34

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH clocksource 6/6] clocksource: Verify HPET and PMTMR when TSC unverified

On systems with two or fewer sockets, when the boot CPU has CONSTANT_TSC,
NONSTOP_TSC, and TSC_ADJUST, clocksource watchdog verification of the
TSC is disabled. This works well much of the time, but there is the
occasional system that meets all of these criteria, but which still
has a TSC that skews significantly from atomic-clock time. This is
usually attributed to a firmware or hardware fault. Yes, the various
NTP daemons do express their opinions of userspace-to-atomic-clock time
skew, but they put them in various places, depending on the daemon and
distro in question. It would therefore be good for the kernel to have
some clue that there is a problem.

The old behavior of marking the TSC unstable is a non-starter because a
great many workloads simply cannot tolerate the overheads and latencies
of the various non-TSC clocksources. In addition, NTP-corrected systems
often seem to be able to tolerate significant kernel-space time skew as
long as the userspace time sources are within epsilon of atomic-clock
time.

Therefore, when watchdog verification of TSC is disabled, enable it for
HPET and PMTMR (AKA ACPI PM timer). This provides the needed in-kernel
time-skew diagnostic without degrading the system's performance.

Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: <[email protected]>
Tested-by: Feng Tang <[email protected]>
---
arch/x86/include/asm/time.h | 1 +
arch/x86/kernel/hpet.c | 2 ++
arch/x86/kernel/tsc.c | 5 +++++
drivers/clocksource/acpi_pm.c | 6 ++++--
4 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/time.h b/arch/x86/include/asm/time.h
index 8ac563abb567b..a53961c64a567 100644
--- a/arch/x86/include/asm/time.h
+++ b/arch/x86/include/asm/time.h
@@ -8,6 +8,7 @@
extern void hpet_time_init(void);
extern void time_init(void);
extern bool pit_timer_init(void);
+extern bool tsc_clocksource_watchdog_disabled(void);

extern struct clock_event_device *global_clock_event;

diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 71f336425e58a..c8eb1ac5125ab 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -1091,6 +1091,8 @@ int __init hpet_enable(void)
if (!hpet_counting())
goto out_nohpet;

+ if (tsc_clocksource_watchdog_disabled())
+ clocksource_hpet.flags |= CLOCK_SOURCE_MUST_VERIFY;
clocksource_register_hz(&clocksource_hpet, (u32)hpet_freq);

if (id & HPET_ID_LEGSUP) {
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index a78e73da4a74b..af3782fb6200c 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1186,6 +1186,11 @@ static void __init tsc_disable_clocksource_watchdog(void)
clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY;
}

+bool tsc_clocksource_watchdog_disabled(void)
+{
+ return !(clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY);
+}
+
static void __init check_system_tsc_reliable(void)
{
#if defined(CONFIG_MGEODEGX1) || defined(CONFIG_MGEODE_LX) || defined(CONFIG_X86_GENERIC)
diff --git a/drivers/clocksource/acpi_pm.c b/drivers/clocksource/acpi_pm.c
index 279ddff81ab49..82338773602ca 100644
--- a/drivers/clocksource/acpi_pm.c
+++ b/drivers/clocksource/acpi_pm.c
@@ -23,6 +23,7 @@
#include <linux/pci.h>
#include <linux/delay.h>
#include <asm/io.h>
+#include <asm/time.h>

/*
* The I/O port the PMTMR resides at.
@@ -210,8 +211,9 @@ static int __init init_acpi_pm_clocksource(void)
return -ENODEV;
}

- return clocksource_register_hz(&clocksource_acpi_pm,
- PMTMR_TICKS_PER_SEC);
+ if (tsc_clocksource_watchdog_disabled())
+ clocksource_acpi_pm.flags |= CLOCK_SOURCE_MUST_VERIFY;
+ return clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC);
}

/* We use fs_initcall because we want the PCI fixups to have run
--
2.31.1.189.g2e36527f23

2023-01-05 01:12:39

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH clocksource 3/6] clocksource: Improve read-back-delay message

When cs_watchdog_read() is unable to get a qualifying clocksource read
within the limit set by max_cswd_read_retries, it prints a message
and marks the clocksource under test as unstable. But that message is
unclear to anyone unfamiliar with the code:

clocksource: timekeeping watchdog on CPU13: wd-tsc-wd read-back delay 1000614ns, attempt 3, marking unstable

Therefore, add some context so that the message appears as follows:

clocksource: timekeeping watchdog on CPU13: wd-tsc-wd excessive read-back delay of 1000614ns vs. limit of 125000ns, wd-wd read-back delay only 27ns, attempt 3, marking tsc unstable

Signed-off-by: Paul E. McKenney <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Feng Tang <[email protected]>
---
kernel/time/clocksource.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index a3d19f6660ac7..b59914953809f 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -260,8 +260,8 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow,
goto skip_test;
}

- pr_warn("timekeeping watchdog on CPU%d: wd-%s-wd read-back delay of %lldns, attempt %d, marking unstable\n",
- smp_processor_id(), cs->name, wd_delay, nretries);
+ pr_warn("timekeeping watchdog on CPU%d: wd-%s-wd excessive read-back delay of %lldns vs. limit of %ldns, wd-wd read-back delay only %lldns, attempt %d, marking %s unstable\n",
+ smp_processor_id(), cs->name, wd_delay, WATCHDOG_MAX_SKEW, wd_seq_delay, nretries, cs->name);
return WD_READ_UNSTABLE;

skip_test:
--
2.31.1.189.g2e36527f23

2023-01-05 01:24:42

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH clocksource 2/6] clocksource: Loosen clocksource watchdog constraints

Currently, MAX_SKEW_USEC is set to 100 microseconds, which has worked
reasonably well. However, NTP is willing to tolerate 500 microseconds
of skew per second, and a clocksource that is good enough for NTP should
be good enough for the clocksource watchdog. The watchdog's skew is
controlled by MAX_SKEW_USEC and the CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
Kconfig option. However, these values are doubled before being associated
with a clocksource's ->uncertainty_margin, and the ->uncertainty_margin
values of the pair of clocksource's being compared are summed before
checking against the skew.

Therefore, set both MAX_SKEW_USEC and the default for the
CLOCKSOURCE_WATCHDOG_MAX_SKEW_US Kconfig option to 125 microseconds of
skew per second, resulting in 500 microseconds of skew per second in
the clocksource watchdog's skew comparison.

Suggested-by Rik van Riel <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/Kconfig | 6 +++++-
kernel/time/clocksource.c | 15 +++++++++------
2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index a41753be1a2bf..bae8f11070bef 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -200,10 +200,14 @@ config CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
int "Clocksource watchdog maximum allowable skew (in μs)"
depends on CLOCKSOURCE_WATCHDOG
range 50 1000
- default 100
+ default 125
help
Specify the maximum amount of allowable watchdog skew in
microseconds before reporting the clocksource to be unstable.
+ The default is based on a half-second clocksource watchdog
+ interval and NTP's maximum frequency drift of 500 parts
+ per million. If the clocksource is good enough for NTP,
+ it is good enough for the clocksource watchdog!

endmenu
endif
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 4a2c3bb92e2e9..a3d19f6660ac7 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -95,6 +95,11 @@ static char override_name[CS_NAME_LEN];
static int finished_booting;
static u64 suspend_start;

+/*
+ * Interval: 0.5sec.
+ */
+#define WATCHDOG_INTERVAL (HZ >> 1)
+
/*
* Threshold: 0.0312s, when doubled: 0.0625s.
* Also a default for cs->uncertainty_margin when registering clocks.
@@ -106,11 +111,14 @@ static u64 suspend_start;
* clocksource surrounding a read of the clocksource being validated.
* This delay could be due to SMIs, NMIs, or to VCPU preemptions. Used as
* a lower bound for cs->uncertainty_margin values when registering clocks.
+ *
+ * The default of 500 parts per million is based on NTP's limits.
+ * If a clocksource is good enough for NTP, it is good enough for us!
*/
#ifdef CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
#define MAX_SKEW_USEC CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
#else
-#define MAX_SKEW_USEC 100
+#define MAX_SKEW_USEC (125 * WATCHDOG_INTERVAL / HZ)
#endif

#define WATCHDOG_MAX_SKEW (MAX_SKEW_USEC * NSEC_PER_USEC)
@@ -140,11 +148,6 @@ static inline void clocksource_watchdog_unlock(unsigned long *flags)
static int clocksource_watchdog_kthread(void *data);
static void __clocksource_change_rating(struct clocksource *cs, int rating);

-/*
- * Interval: 0.5sec.
- */
-#define WATCHDOG_INTERVAL (HZ >> 1)
-
static void clocksource_watchdog_work(struct work_struct *work)
{
/*
--
2.31.1.189.g2e36527f23

2023-01-05 01:29:28

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH clocksource 5/6] clocksource: Suspend the watchdog temporarily when high read latency detected

From: Feng Tang <[email protected]>

Bugs have been reported on 8 sockets x86 machines in which the TSC was
wrongly disabled when the system is under heavy workload.

[ 818.380354] clocksource: timekeeping watchdog on CPU336: hpet wd-wd read-back delay of 1203520ns
[ 818.436160] clocksource: wd-tsc-wd read-back delay of 181880ns, clock-skew test skipped!
[ 819.402962] clocksource: timekeeping watchdog on CPU338: hpet wd-wd read-back delay of 324000ns
[ 819.448036] clocksource: wd-tsc-wd read-back delay of 337240ns, clock-skew test skipped!
[ 819.880863] clocksource: timekeeping watchdog on CPU339: hpet read-back delay of 150280ns, attempt 3, marking unstable
[ 819.936243] tsc: Marking TSC unstable due to clocksource watchdog
[ 820.068173] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[ 820.092382] sched_clock: Marking unstable (818769414384, 1195404998)
[ 820.643627] clocksource: Checking clocksource tsc synchronization from CPU 267 to CPUs 0,4,25,70,126,430,557,564.
[ 821.067990] clocksource: Switched to clocksource hpet

This can be reproduced by running memory intensive 'stream' tests,
or some of the stress-ng subcases such as 'ioport'.

The reason for these issues is the when system is under heavy load, the
read latency of the clocksources can be very high. Even lightweight TSC
reads can show high latencies, and latencies are much worse for external
clocksources such as HPET or the APIC PM timer. These latencies can
result in false-positive clocksource-unstable determinations.

Given that the clocksource watchdog is a continual diagnostic check with
frequency of twice a second, there is no need to rush it when the system
is under heavy load. Therefore, when high clocksource read latencies
are detected, suspend the watchdog timer for 5 minutes.

Signed-off-by: Feng Tang <[email protected]>
Acked-by: Waiman Long <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Feng Tang <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/clocksource.c | 45 ++++++++++++++++++++++++++++-----------
1 file changed, 32 insertions(+), 13 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index fc486cd972635..91836b727cef5 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -387,6 +387,15 @@ void clocksource_verify_percpu(struct clocksource *cs)
}
EXPORT_SYMBOL_GPL(clocksource_verify_percpu);

+static inline void clocksource_reset_watchdog(void)
+{
+ struct clocksource *cs;
+
+ list_for_each_entry(cs, &watchdog_list, wd_list)
+ cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
+}
+
+
static void clocksource_watchdog(struct timer_list *unused)
{
u64 csnow, wdnow, cslast, wdlast, delta;
@@ -394,6 +403,7 @@ static void clocksource_watchdog(struct timer_list *unused)
int64_t wd_nsec, cs_nsec;
struct clocksource *cs;
enum wd_read_status read_ret;
+ unsigned long extra_wait = 0;
u32 md;

spin_lock(&watchdog_lock);
@@ -413,13 +423,30 @@ static void clocksource_watchdog(struct timer_list *unused)

read_ret = cs_watchdog_read(cs, &csnow, &wdnow);

- if (read_ret != WD_READ_SUCCESS) {
- if (read_ret == WD_READ_UNSTABLE)
- /* Clock readout unreliable, so give it up. */
- __clocksource_unstable(cs);
+ if (read_ret == WD_READ_UNSTABLE) {
+ /* Clock readout unreliable, so give it up. */
+ __clocksource_unstable(cs);
continue;
}

+ /*
+ * When WD_READ_SKIP is returned, it means the system is likely
+ * under very heavy load, where the latency of reading
+ * watchdog/clocksource is very big, and affect the accuracy of
+ * watchdog check. So give system some space and suspend the
+ * watchdog check for 5 minutes.
+ */
+ if (read_ret == WD_READ_SKIP) {
+ /*
+ * As the watchdog timer will be suspended, and
+ * cs->last could keep unchanged for 5 minutes, reset
+ * the counters.
+ */
+ clocksource_reset_watchdog();
+ extra_wait = HZ * 300;
+ break;
+ }
+
/* Clocksource initialized ? */
if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
atomic_read(&watchdog_reset_pending)) {
@@ -523,7 +550,7 @@ static void clocksource_watchdog(struct timer_list *unused)
* pair clocksource_stop_watchdog() clocksource_start_watchdog().
*/
if (!timer_pending(&watchdog_timer)) {
- watchdog_timer.expires += WATCHDOG_INTERVAL;
+ watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait;
add_timer_on(&watchdog_timer, next_cpu);
}
out:
@@ -548,14 +575,6 @@ static inline void clocksource_stop_watchdog(void)
watchdog_running = 0;
}

-static inline void clocksource_reset_watchdog(void)
-{
- struct clocksource *cs;
-
- list_for_each_entry(cs, &watchdog_list, wd_list)
- cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
-}
-
static void clocksource_resume_watchdog(void)
{
atomic_inc(&watchdog_reset_pending);
--
2.31.1.189.g2e36527f23

2023-01-11 11:40:13

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH clocksource 5/6] clocksource: Suspend the watchdog temporarily when high read latency detected

On Wed, Jan 04 2023 at 17:07, Paul E. McKenney wrote:
> This can be reproduced by running memory intensive 'stream' tests,
> or some of the stress-ng subcases such as 'ioport'.
>
> The reason for these issues is the when system is under heavy load, the
> read latency of the clocksources can be very high. Even lightweight TSC
> reads can show high latencies, and latencies are much worse for external
> clocksources such as HPET or the APIC PM timer. These latencies can
> result in false-positive clocksource-unstable determinations.
>
> Given that the clocksource watchdog is a continual diagnostic check with
> frequency of twice a second, there is no need to rush it when the system
> is under heavy load. Therefore, when high clocksource read latencies
> are detected, suspend the watchdog timer for 5 minutes.

We should have enough heuristics in place by now to qualify the output of
the clocksource watchdog as a random number generator, right?

2023-01-11 13:11:30

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH clocksource 5/6] clocksource: Suspend the watchdog temporarily when high read latency detected

On Wed, Jan 11, 2023 at 12:26:58PM +0100, Thomas Gleixner wrote:
> On Wed, Jan 04 2023 at 17:07, Paul E. McKenney wrote:
> > This can be reproduced by running memory intensive 'stream' tests,
> > or some of the stress-ng subcases such as 'ioport'.
> >
> > The reason for these issues is the when system is under heavy load, the
> > read latency of the clocksources can be very high. Even lightweight TSC
> > reads can show high latencies, and latencies are much worse for external
> > clocksources such as HPET or the APIC PM timer. These latencies can
> > result in false-positive clocksource-unstable determinations.
> >
> > Given that the clocksource watchdog is a continual diagnostic check with
> > frequency of twice a second, there is no need to rush it when the system
> > is under heavy load. Therefore, when high clocksource read latencies
> > are detected, suspend the watchdog timer for 5 minutes.
>
> We should have enough heuristics in place by now to qualify the output of
> the clocksource watchdog as a random number generator, right?

The issue was found on a 8 sockets machine (around 400 cores, 800 CPUs),
and seems with the bigger and bigger CPU numbers, the spark latency
of reading HPET or even TSC is very high, which does affect the
accuracy of clocksource watchdog check. And unfortunately, we can't
just disable the watchdog for these 8 sockets machine.

We tried a debug patch which disables interrupt and does consective
reads with 'rdtsc', and check the delta between these reads (when
system is running some heavy load), sometimes we can see up to
300 micro-seconds delta, on a 2-sockets Icelake machine. Similar
thing if we replace the 'rdtsc' with 'rdtscp' or 'lfence;rdtsc'.
And I was told the max latency is much higher on the 8 sockets
machine.

Thanks,
Feng


2023-01-11 18:04:19

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH clocksource 5/6] clocksource: Suspend the watchdog temporarily when high read latency detected

On Wed, Jan 11, 2023 at 12:26:58PM +0100, Thomas Gleixner wrote:
> On Wed, Jan 04 2023 at 17:07, Paul E. McKenney wrote:
> > This can be reproduced by running memory intensive 'stream' tests,
> > or some of the stress-ng subcases such as 'ioport'.
> >
> > The reason for these issues is the when system is under heavy load, the
> > read latency of the clocksources can be very high. Even lightweight TSC
> > reads can show high latencies, and latencies are much worse for external
> > clocksources such as HPET or the APIC PM timer. These latencies can
> > result in false-positive clocksource-unstable determinations.
> >
> > Given that the clocksource watchdog is a continual diagnostic check with
> > frequency of twice a second, there is no need to rush it when the system
> > is under heavy load. Therefore, when high clocksource read latencies
> > are detected, suspend the watchdog timer for 5 minutes.
>
> We should have enough heuristics in place by now to qualify the output of
> the clocksource watchdog as a random number generator, right?

Glad to see that you are still keeping up your style, Thomas! ;-)

We really do see the occasional clocksource skew in our fleet, and
sometimes it really is the TSC that is in disagreement with atomic-clock
time. And the watchdog does detect these, for example, the 40,000
parts-per-million case discussed recently. We therefore need a way to
check this, but without producing false positives on busy systems and
without the current kneejerk reflex of disabling TSC, thus rendering the
system useless from a performance standpoint for some important workloads.

Yes, if a system was 100% busy forever, this patch would suppress these
checks. But 100% busy forever is not the common case, due to thermal
throttling and to security updates if nothing else.

With all that said, is there a better way to get the desired effects of
this patch?

Thanx, Paul

2023-01-11 21:58:39

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH clocksource 5/6] clocksource: Suspend the watchdog temporarily when high read latency detected

On Wed, Jan 11 2023 at 09:50, Paul E. McKenney wrote:
> On Wed, Jan 11, 2023 at 12:26:58PM +0100, Thomas Gleixner wrote:
> Yes, if a system was 100% busy forever, this patch would suppress these
> checks. But 100% busy forever is not the common case, due to thermal
> throttling and to security updates if nothing else.
>
> With all that said, is there a better way to get the desired effects of
> this patch?

Sane hardware?

2023-01-11 21:58:43

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH clocksource 5/6] clocksource: Suspend the watchdog temporarily when high read latency detected

On Wed, Jan 11, 2023 at 10:19:50PM +0100, Thomas Gleixner wrote:
> On Wed, Jan 11 2023 at 09:50, Paul E. McKenney wrote:
> > On Wed, Jan 11, 2023 at 12:26:58PM +0100, Thomas Gleixner wrote:
> > Yes, if a system was 100% busy forever, this patch would suppress these
> > checks. But 100% busy forever is not the common case, due to thermal
> > throttling and to security updates if nothing else.
> >
> > With all that said, is there a better way to get the desired effects of
> > this patch?
>
> Sane hardware?

I must let Feng talk to his systems, but most of the systems I saw were
production systems. A few were engineering samples, from which some
insanity might be expected behavior.

Clearly, something about the hardware or firmware was insane in order
to get this result, but that is what diagnostics are for, even on
engineering samples.

Thanx, Paul

2023-01-12 01:30:20

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH clocksource 5/6] clocksource: Suspend the watchdog temporarily when high read latency detected

On Wed, Jan 11, 2023 at 01:32:10PM -0800, Paul E. McKenney wrote:
> On Wed, Jan 11, 2023 at 10:19:50PM +0100, Thomas Gleixner wrote:
> > On Wed, Jan 11 2023 at 09:50, Paul E. McKenney wrote:
> > > On Wed, Jan 11, 2023 at 12:26:58PM +0100, Thomas Gleixner wrote:
> > > Yes, if a system was 100% busy forever, this patch would suppress these
> > > checks. But 100% busy forever is not the common case, due to thermal
> > > throttling and to security updates if nothing else.
> > >
> > > With all that said, is there a better way to get the desired effects of
> > > this patch?
> >
> > Sane hardware?
>
> I must let Feng talk to his systems, but most of the systems I saw were
> production systems. A few were engineering samples, from which some
> insanity might be expected behavior.

I've tested with several generations of Xeon servers, and they all
can reproduce the issue with stress-ng stress load. Those platforms
are not bought from market :), but they have latest stepping and
firmware, which are close to production systesm.

The issue originally came from customer, and there were engineers
who reproduced it on production systems(even from different vendors)

Thanks,
Feng

> Clearly, something about the hardware or firmware was insane in order
> to get this result, but that is what diagnostics are for, even on
> engineering samples.
>
> Thanx, Paul