2021-04-25 22:47:50

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v10 clocksource 0/7] Do not mark clocks unstable due to delays for v5.13

Hello!

If there is a sufficient delay between reading the watchdog clock and the
clock under test, the clock under test will be marked unstable through no
fault of its own. This series checks for this, doing limited retries
to get a good set of clock reads. If the clock is marked unstable
and is marked as being per-CPU, cross-CPU synchronization is checked.
This series also provides delay injection, which may be enabled via
kernel boot parameters to test the checking for delays.

Note that "sufficient delay" can be provided by SMIs, NMIs, and of course
vCPU preemption.

1. Provide module parameters to inject delays in watchdog.

2. Retry clock read if long delays detected.

3. Check per-CPU clock synchronization when marked unstable.

4. Provide a module parameter to fuzz per-CPU clock checking.

5. Limit number of CPUs checked for clock synchronization.

6. Forgive tsc_early pre-calibration drift.

7. Reduce the clock-skew limit to 200us and the maximum permissible
clock read delay to 50us.

Changes since v9:

o Forgive tsc_early drift, based on feedback from Feng Tang; Xing,
Zhengjun; and Thomas Gleixner.

o Improve CPU selection for clock-synchronization checking.

Link: https://lore.kernel.org/lkml/20210419045155.GA596058@paulmck-ThinkPad-P17-Gen-1/

Changes since v8, based on Thomas Gleixner feedback:

o Reduced clock-skew threshold to 200us and delay limit to 50us.

o Split out a cs_watchdog_read() function.

o Removed the pointless CLOCK_SOURCE_VERIFY_PERCPU from kvm_clock.

o Initialized cs_nsec_max and cs_nsec_min to avoid firsttime checks.

Link: https://lore.kernel.org/lkml/20210414043435.GA2812539@paulmck-ThinkPad-P17-Gen-1/

Changes since v7, based on Thomas Gleixner feedback:

o Fix embarrassing git-format-patch operator error.

o Merge pairwise clock-desynchronization checking into the checking
of per-CPU clock synchronization when marked unstable.

o Do selective per-CPU checking rather than blindly checking all
CPUs. Provide a clocksource.verify_n_cpus kernel boot parameter
to control this behavior, with the value -1 choosing the old
check-all-CPUs behavior. The default is to randomly check 8 CPUs.

o Fix the clock-desynchronization checking to avoid a potential
use-after-free error for dynamically allocated clocksource
structures.

o Remove redundance "wdagain_nsec < 0" from clocksource_watchdog()
clocksource skew checking.

o Update commit logs and do code-style updates.

Link: https://lore.kernel.org/lkml/20210106004013.GA11179@paulmck-ThinkPad-P72/

Changes since v5:

o Rebased to v5.12-rc5.

Changes since v4:

o Rebased to v5.12-rc1.

Changes since v3:

o Rebased to v5.11.

o Apply Randy Dunlap feedback.

Changes since v2:

o Rebased to v5.11-rc6.

o Updated Cc: list.

Changes since v1:

o Applied feedback from Rik van Riel.

o Rebased to v5.11-rc3.

o Stripped "RFC" from the subject lines.

Thanx, Paul

------------------------------------------------------------------------

Documentation/admin-guide/kernel-parameters.txt | 32 +++
b/Documentation/admin-guide/kernel-parameters.txt | 16 +
b/arch/x86/kernel/tsc.c | 3
b/include/linux/clocksource.h | 2
b/kernel/time/clocksource.c | 23 ++
kernel/time/clocksource.c | 201 ++++++++++++++++++++--
6 files changed, 263 insertions(+), 14 deletions(-)


2021-04-25 22:48:24

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v10 clocksource 4/7] clocksource: Provide a module parameter to fuzz per-CPU clock checking

Code that checks for clock desynchronization must itself be tested, so
create a new clocksource.inject_delay_shift_percpu= kernel boot parameter
that adds or subtracts a large value from the check read, using the
specified bit of the CPU ID to determine whether to add or to subtract.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Apply Randy Dunlap feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 16 ++++++++++++++++
kernel/time/clocksource.c | 10 +++++++++-
2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 7fff95bd5504..03a8c88a6bb9 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -599,6 +599,22 @@
will be five delay-free reads followed by three
delayed reads.

+ clocksource.inject_delay_shift_percpu= [KNL]
+ Clocksource delay injection partitions the CPUs
+ into two sets, one whose clocks are moved ahead
+ and the other whose clocks are moved behind.
+ This kernel parameter selects the CPU-number
+ bit that determines which of these two sets the
+ corresponding CPU is placed into. For example,
+ setting this parameter to the value 4 will result
+ in the first set containing alternating groups
+ of 16 CPUs whose clocks are moved ahead, while
+ the second set will contain the rest of the CPUs,
+ whose clocks are moved behind.
+
+ The default value of -1 disables this type of
+ error injection.
+
clocksource.max_read_retries= [KNL]
Number of clocksource_watchdog() retries due to
external delays before the clock will be marked
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index a8d73e1f9431..584433448226 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -196,6 +196,8 @@ static ulong inject_delay_period;
module_param(inject_delay_period, ulong, 0644);
static ulong inject_delay_repeat = 1;
module_param(inject_delay_repeat, ulong, 0644);
+static int inject_delay_shift_percpu = -1;
+module_param(inject_delay_shift_percpu, int, 0644);
static ulong max_read_retries = 3;
module_param(max_read_retries, ulong, 0644);

@@ -252,8 +254,14 @@ static cpumask_t cpus_behind;
static void clocksource_verify_one_cpu(void *csin)
{
struct clocksource *cs = (struct clocksource *)csin;
+ s64 delta = 0;
+ int sign;

- csnow_mid = cs->read(cs);
+ if (inject_delay_shift_percpu >= 0) {
+ sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1;
+ delta = sign * NSEC_PER_SEC;
+ }
+ csnow_mid = cs->read(cs) + delta;
}

static void clocksource_verify_percpu(struct clocksource *cs)
--
2.31.1.189.g2e36527f23

2021-04-25 22:48:25

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v10 clocksource 2/7] clocksource: Retry clock read if long delays detected

When the clocksource watchdog marks a clock as unstable, this might
be due to that clock being unstable or it might be due to delays that
happen to occur between the reads of the two clocks. Yes, interrupts are
disabled across those two reads, but there are no shortage of things that
can delay interrupts-disabled regions of code ranging from SMI handlers
to vCPU preemption. It would be good to have some indication as to why
the clock was marked unstable.

Therefore, re-read the watchdog clock on either side of the read
from the clock under test. If the watchdog clock shows an excessive
time delta between its pair of reads, the reads are retried. The
maximum number of retries is specified by a new kernel boot parameter
clocksource.max_read_retries, which defaults to three, that is, up to four
reads, one initial and up to three retries. If more than one retry was
required, a message is printed on the console (the occasional single retry
is expected behavior, especially in guest OSes). If the maximum number
of retries is exceeded, the clock under test will be marked unstable.
However, the probability of this happening due to various sorts of
delays is quite small. In addition, the reason (clock-read delays)
for the unstable marking will be apparent.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Per-clocksource retries per Neeraj Upadhyay feedback. ]
[ paulmck: Don't reset injectfail per Neeraj Upadhyay feedback. ]
[ paulmck: Apply Thomas Gleixner feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 6 +++
kernel/time/clocksource.c | 53 ++++++++++++++++---
2 files changed, 52 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 4a372037b49f..7fff95bd5504 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -599,6 +599,12 @@
will be five delay-free reads followed by three
delayed reads.

+ clocksource.max_read_retries= [KNL]
+ Number of clocksource_watchdog() retries due to
+ external delays before the clock will be marked
+ unstable. Defaults to three retries, that is,
+ four attempts to read the clock under test.
+
clearcpuid=BITNUM[,BITNUM...] [X86]
Disable CPUID feature X for the kernel. See
arch/x86/include/asm/cpufeatures.h for the valid bit
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index f1e1e6e4b387..94bfdb53f2f4 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -125,6 +125,13 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
#define WATCHDOG_INTERVAL (HZ >> 1)
#define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)

+/*
+ * Maximum permissible delay between two readouts of the watchdog
+ * clocksource surrounding a read of the clocksource being validated.
+ * This delay could be due to SMIs, NMIs, or to VCPU preemptions.
+ */
+#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC)
+
static void clocksource_watchdog_work(struct work_struct *work)
{
/*
@@ -189,6 +196,8 @@ static ulong inject_delay_period;
module_param(inject_delay_period, ulong, 0644);
static ulong inject_delay_repeat = 1;
module_param(inject_delay_repeat, ulong, 0644);
+static ulong max_read_retries = 3;
+module_param(max_read_retries, ulong, 0644);

static void clocksource_watchdog_inject_delay(void)
{
@@ -206,12 +215,42 @@ static void clocksource_watchdog_inject_delay(void)
invocations++;
}

+static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
+{
+ unsigned int nretries;
+ u64 wd_end, wd_delta;
+ int64_t wd_delay;
+
+ for (nretries = 0; nretries <= max_read_retries; nretries++) {
+ local_irq_disable();
+ *wdnow = watchdog->read(watchdog);
+ clocksource_watchdog_inject_delay();
+ *csnow = cs->read(cs);
+ wd_end = watchdog->read(watchdog);
+ local_irq_enable();
+
+ wd_delta = clocksource_delta(wd_end, *wdnow, watchdog->mask);
+ wd_delay = clocksource_cyc2ns(wd_delta, watchdog->mult, watchdog->shift);
+ if (wd_delay <= WATCHDOG_MAX_SKEW) {
+ if (nretries > 1 || nretries >= max_read_retries) {
+ pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
+ smp_processor_id(), watchdog->name, nretries);
+ }
+ return true;
+ }
+ }
+
+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
+ smp_processor_id(), watchdog->name, wd_delay, nretries);
+ return false;
+}
+
static void clocksource_watchdog(struct timer_list *unused)
{
- struct clocksource *cs;
u64 csnow, wdnow, cslast, wdlast, delta;
- int64_t wd_nsec, cs_nsec;
int next_cpu, reset_pending;
+ int64_t wd_nsec, cs_nsec;
+ struct clocksource *cs;

spin_lock(&watchdog_lock);
if (!watchdog_running)
@@ -228,11 +267,11 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;
}

- local_irq_disable();
- csnow = cs->read(cs);
- clocksource_watchdog_inject_delay();
- wdnow = watchdog->read(watchdog);
- local_irq_enable();
+ if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
+ /* Clock readout unreliable, so give it up. */
+ __clocksource_unstable(cs);
+ continue;
+ }

/* Clocksource initialized ? */
if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
--
2.31.1.189.g2e36527f23

2021-04-25 22:48:27

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v10 clocksource 5/7] clocksource: Limit number of CPUs checked for clock synchronization

Currently, if skew is detected on a clock marked CLOCK_SOURCE_VERIFY_PERCPU,
that clock is checked on all CPUs. This is thorough, but might not be
what you want on a system with a few tens of CPUs, let alone a few hundred
of them.

Therefore, by default check only up to eight randomly chosen CPUs.
Also provide a new clocksource.verify_n_cpus kernel boot parameter.
A value of -1 says to check all of the CPUs, and a non-negative value says
to randomly select that number of CPUs, without concern about selecting
the same CPU multiple times. However, make use of a cpumask so that a
given CPU will be checked at most once.

Suggested-by: Thomas Gleixner <[email protected]> # For verify_n_cpus=1.
Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 10 +++
kernel/time/clocksource.c | 74 ++++++++++++++++++-
2 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 03a8c88a6bb9..6e5b5f19fc56 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -621,6 +621,16 @@
unstable. Defaults to three retries, that is,
four attempts to read the clock under test.

+ clocksource.verify_n_cpus= [KNL]
+ Limit the number of CPUs checked for clocksources
+ marked with CLOCK_SOURCE_VERIFY_PERCPU that
+ are marked unstable due to excessive skew.
+ A negative value says to check all CPUs, while
+ zero says not to check any. Values larger than
+ nr_cpu_ids are silently truncated to nr_cpu_ids.
+ The actual CPUs are chosen randomly, with
+ no replacement if the same CPU is chosen twice.
+
clearcpuid=BITNUM[,BITNUM...] [X86]
Disable CPUID feature X for the kernel. See
arch/x86/include/asm/cpufeatures.h for the valid bit
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 584433448226..f71f375df544 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -15,6 +15,8 @@
#include <linux/tick.h>
#include <linux/kthread.h>
#include <linux/delay.h>
+#include <linux/prandom.h>
+#include <linux/cpu.h>

#include "tick-internal.h"
#include "timekeeping_internal.h"
@@ -200,6 +202,8 @@ static int inject_delay_shift_percpu = -1;
module_param(inject_delay_shift_percpu, int, 0644);
static ulong max_read_retries = 3;
module_param(max_read_retries, ulong, 0644);
+static int verify_n_cpus = 8;
+module_param(verify_n_cpus, int, 0644);

static void clocksource_watchdog_inject_delay(void)
{
@@ -250,6 +254,55 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
static u64 csnow_mid;
static cpumask_t cpus_ahead;
static cpumask_t cpus_behind;
+static cpumask_t cpus_chosen;
+
+static void clocksource_verify_choose_cpus(void)
+{
+ int cpu, i, n = verify_n_cpus;
+
+ if (n < 0) {
+ /* Check all of the CPUs. */
+ cpumask_copy(&cpus_chosen, cpu_online_mask);
+ cpumask_clear_cpu(smp_processor_id(), &cpus_chosen);
+ return;
+ }
+
+ /* If no checking desired, or no other CPU to check, leave. */
+ cpumask_clear(&cpus_chosen);
+ if (n == 0 || num_online_cpus() <= 1)
+ return;
+
+ /* Make sure to select at least one CPU other than the current CPU. */
+ cpu = cpumask_next(-1, cpu_online_mask);
+ if (cpu == smp_processor_id())
+ cpu = cpumask_next(cpu, cpu_online_mask);
+ if (WARN_ON_ONCE(cpu >= nr_cpu_ids))
+ return;
+ cpumask_set_cpu(cpu, &cpus_chosen);
+
+ /* Force a sane value for the boot parameter. */
+ if (n > nr_cpu_ids)
+ n = nr_cpu_ids;
+
+ /*
+ * Randomly select the specified number of CPUs. If the same
+ * CPU is selected multiple times, that CPU is checked only once,
+ * and no replacement CPU is selected. This gracefully handles
+ * situations where verify_n_cpus is greater than the number of
+ * CPUs that are currently online.
+ */
+ for (i = 1; i < n; i++) {
+ cpu = prandom_u32() % nr_cpu_ids;
+ cpu = cpumask_next(cpu - 1, cpu_online_mask);
+ if (cpu >= nr_cpu_ids)
+ cpu = cpumask_next(-1, cpu_online_mask);
+ if (!WARN_ON_ONCE(cpu >= nr_cpu_ids))
+ cpumask_set_cpu(cpu, &cpus_chosen);
+ }
+
+ /* Don't verify ourselves. */
+ cpumask_clear_cpu(smp_processor_id(), &cpus_chosen);
+}

static void clocksource_verify_one_cpu(void *csin)
{
@@ -271,12 +324,22 @@ static void clocksource_verify_percpu(struct clocksource *cs)
int cpu, testcpu;
s64 delta;

+ if (verify_n_cpus == 0)
+ return;
cpumask_clear(&cpus_ahead);
cpumask_clear(&cpus_behind);
+ get_online_cpus();
preempt_disable();
+ clocksource_verify_choose_cpus();
+ if (cpumask_weight(&cpus_chosen) == 0) {
+ preempt_enable();
+ put_online_cpus();
+ pr_warn("Not enough CPUs to check clocksource '%s'.\n", cs->name);
+ return;
+ }
testcpu = smp_processor_id();
- pr_warn("Checking clocksource %s synchronization from CPU %d.\n", cs->name, testcpu);
- for_each_online_cpu(cpu) {
+ pr_warn("Checking clocksource %s synchronization from CPU %d to CPUs %*pbl.\n", cs->name, testcpu, cpumask_pr_args(&cpus_chosen));
+ for_each_cpu(cpu, &cpus_chosen) {
if (cpu == testcpu)
continue;
csnow_begin = cs->read(cs);
@@ -296,6 +359,7 @@ static void clocksource_verify_percpu(struct clocksource *cs)
cs_nsec_min = cs_nsec;
}
preempt_enable();
+ put_online_cpus();
if (!cpumask_empty(&cpus_ahead))
pr_warn(" CPUs %*pbl ahead of CPU %d for clocksource %s.\n",
cpumask_pr_args(&cpus_ahead), testcpu, cs->name);
@@ -366,6 +430,12 @@ static void clocksource_watchdog(struct timer_list *unused)
watchdog->name, wdnow, wdlast, watchdog->mask);
pr_warn(" '%s' cs_now: %llx cs_last: %llx mask: %llx\n",
cs->name, csnow, cslast, cs->mask);
+ if (curr_clocksource == cs)
+ pr_warn(" '%s' is current clocksource.\n", cs->name);
+ else if (curr_clocksource)
+ pr_warn(" '%s' (not '%s') is current clocksource.\n", curr_clocksource->name, cs->name);
+ else
+ pr_warn(" No current clocksource.\n");
__clocksource_unstable(cs);
continue;
}
--
2.31.1.189.g2e36527f23

2021-04-25 22:48:31

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v10 clocksource 1/7] clocksource: Provide module parameters to inject delays in watchdog

When the clocksource watchdog marks a clock as unstable, this might be due
to that clock being unstable or it might be due to delays that happen to
occur between the reads of the two clocks. Yes, interrupts are disabled
across those two reads, but there are no shortage of things that can
delay interrupts-disabled regions of code ranging from SMI handlers to
vCPU preemption. It would be good to have some indication as to why
the clock was marked unstable.

The first step is a way of injecting such delays. Therefore, provide
clocksource.inject_delay_freq and clocksource.inject_delay_run kernel boot
parameters that specify that sufficient delay be injected to cause the
clocksource_watchdog() function to mark a clock unstable. This delay is
injected every Nth set of M calls to clocksource_watchdog(), where N is
the value specified for the inject_delay_freq boot parameter and M is the
value specified for the inject_delay_run boot parameter. Values of zero
or less for either parameter disable delay injection, and the default for
clocksource.inject_delay_freq is zero, that is, disabled. The default for
clocksource.inject_delay_run is the value one, that is single-call runs.

This facility is intended for diagnostic use only, and should be avoided
on production systems.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
[ paulmck: Apply Rik van Riel feedback. ]
[ paulmck: Apply Thomas Gleixner feedback. ]
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 16 +++++++++++++
kernel/time/clocksource.c | 23 +++++++++++++++++++
2 files changed, 39 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 04545725f187..4a372037b49f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -583,6 +583,22 @@
loops can be debugged more effectively on production
systems.

+ clocksource.inject_delay_period= [KNL]
+ Number of calls to clocksource_watchdog() before
+ delays are injected between reads from the
+ two clocksources. Values of zero disable this
+ delay injection. These delays can cause clocks
+ to be marked unstable, so use of this parameter
+ should therefore be avoided on production systems.
+ Defaults to zero (disabled).
+
+ clocksource.inject_delay_repeat= [KNL]
+ Number of repeated clocksource_watchdog() delay
+ injections per period. If inject_delay_period
+ is five and inject_delay_repeat is three, there
+ will be five delay-free reads followed by three
+ delayed reads.
+
clearcpuid=BITNUM[,BITNUM...] [X86]
Disable CPUID feature X for the kernel. See
arch/x86/include/asm/cpufeatures.h for the valid bit
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index cce484a2cc7c..f1e1e6e4b387 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -14,6 +14,7 @@
#include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
#include <linux/tick.h>
#include <linux/kthread.h>
+#include <linux/delay.h>

#include "tick-internal.h"
#include "timekeeping_internal.h"
@@ -184,6 +185,27 @@ void clocksource_mark_unstable(struct clocksource *cs)
spin_unlock_irqrestore(&watchdog_lock, flags);
}

+static ulong inject_delay_period;
+module_param(inject_delay_period, ulong, 0644);
+static ulong inject_delay_repeat = 1;
+module_param(inject_delay_repeat, ulong, 0644);
+
+static void clocksource_watchdog_inject_delay(void)
+{
+ static unsigned int invocations = 1, injections;
+
+ if (!inject_delay_period || !inject_delay_repeat)
+ return;
+ if (!(invocations % inject_delay_period)) {
+ pr_warn("%s(): Injecting delay.\n", __func__);
+ mdelay(2 * WATCHDOG_THRESHOLD / NSEC_PER_MSEC);
+ if (++injections < inject_delay_repeat)
+ return;
+ injections = 0;
+ }
+ invocations++;
+}
+
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
@@ -208,6 +230,7 @@ static void clocksource_watchdog(struct timer_list *unused)

local_irq_disable();
csnow = cs->read(cs);
+ clocksource_watchdog_inject_delay();
wdnow = watchdog->read(watchdog);
local_irq_enable();

--
2.31.1.189.g2e36527f23

2021-04-25 22:48:33

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v10 clocksource 3/7] clocksource: Check per-CPU clock synchronization when marked unstable

Some sorts of per-CPU clock sources have a history of going out of
synchronization with each other. However, this problem has purportedy
been solved in the past ten years. Except that it is all too possible
that the problem has instead simply been made less likely, which might
mean that some of the occasional "Marking clocksource 'tsc' as unstable"
messages might be due to desynchronization. How would anyone know?

Therefore apply CPU-to-CPU synchronization checking to newly unstable
clocksource that are marked with the new CLOCK_SOURCE_VERIFY_PERCPU flag.
Lists of desynchronized CPUs are printed, with the caveat that if it
is the reporting CPU that is itself desynchronized, it will appear that
all the other clocks are wrong. Just like in real life.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Add "static" to clocksource_verify_one_cpu() per kernel test robot feedback. ]
[ paulmck: Apply Thomas Gleixner feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
arch/x86/kernel/tsc.c | 3 +-
include/linux/clocksource.h | 2 +-
kernel/time/clocksource.c | 60 +++++++++++++++++++++++++++++++++++++
3 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index f70dffc2771f..56289170753c 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1151,7 +1151,8 @@ static struct clocksource clocksource_tsc = {
.mask = CLOCKSOURCE_MASK(64),
.flags = CLOCK_SOURCE_IS_CONTINUOUS |
CLOCK_SOURCE_VALID_FOR_HRES |
- CLOCK_SOURCE_MUST_VERIFY,
+ CLOCK_SOURCE_MUST_VERIFY |
+ CLOCK_SOURCE_VERIFY_PERCPU,
.vdso_clock_mode = VDSO_CLOCKMODE_TSC,
.enable = tsc_cs_enable,
.resume = tsc_resume,
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 86d143db6523..83a3ebff7456 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -131,7 +131,7 @@ struct clocksource {
#define CLOCK_SOURCE_UNSTABLE 0x40
#define CLOCK_SOURCE_SUSPEND_NONSTOP 0x80
#define CLOCK_SOURCE_RESELECT 0x100
-
+#define CLOCK_SOURCE_VERIFY_PERCPU 0x200
/* simplify initialization of mask field */
#define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 94bfdb53f2f4..a8d73e1f9431 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -245,6 +245,60 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
return false;
}

+static u64 csnow_mid;
+static cpumask_t cpus_ahead;
+static cpumask_t cpus_behind;
+
+static void clocksource_verify_one_cpu(void *csin)
+{
+ struct clocksource *cs = (struct clocksource *)csin;
+
+ csnow_mid = cs->read(cs);
+}
+
+static void clocksource_verify_percpu(struct clocksource *cs)
+{
+ int64_t cs_nsec, cs_nsec_max = 0, cs_nsec_min = LLONG_MAX;
+ u64 csnow_begin, csnow_end;
+ int cpu, testcpu;
+ s64 delta;
+
+ cpumask_clear(&cpus_ahead);
+ cpumask_clear(&cpus_behind);
+ preempt_disable();
+ testcpu = smp_processor_id();
+ pr_warn("Checking clocksource %s synchronization from CPU %d.\n", cs->name, testcpu);
+ for_each_online_cpu(cpu) {
+ if (cpu == testcpu)
+ continue;
+ csnow_begin = cs->read(cs);
+ smp_call_function_single(cpu, clocksource_verify_one_cpu, cs, 1);
+ csnow_end = cs->read(cs);
+ delta = (s64)((csnow_mid - csnow_begin) & cs->mask);
+ if (delta < 0)
+ cpumask_set_cpu(cpu, &cpus_behind);
+ delta = (csnow_end - csnow_mid) & cs->mask;
+ if (delta < 0)
+ cpumask_set_cpu(cpu, &cpus_ahead);
+ delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
+ cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
+ if (cs_nsec > cs_nsec_max)
+ cs_nsec_max = cs_nsec;
+ if (cs_nsec < cs_nsec_min)
+ cs_nsec_min = cs_nsec;
+ }
+ preempt_enable();
+ if (!cpumask_empty(&cpus_ahead))
+ pr_warn(" CPUs %*pbl ahead of CPU %d for clocksource %s.\n",
+ cpumask_pr_args(&cpus_ahead), testcpu, cs->name);
+ if (!cpumask_empty(&cpus_behind))
+ pr_warn(" CPUs %*pbl behind CPU %d for clocksource %s.\n",
+ cpumask_pr_args(&cpus_behind), testcpu, cs->name);
+ if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind))
+ pr_warn(" CPU %d check durations %lldns - %lldns for clocksource %s.\n",
+ testcpu, cs_nsec_min, cs_nsec_max, cs->name);
+}
+
static void clocksource_watchdog(struct timer_list *unused)
{
u64 csnow, wdnow, cslast, wdlast, delta;
@@ -469,6 +523,12 @@ static int __clocksource_watchdog_kthread(void)
unsigned long flags;
int select = 0;

+ /* Do any required per-CPU skew verification. */
+ if (curr_clocksource &&
+ curr_clocksource->flags & CLOCK_SOURCE_UNSTABLE &&
+ curr_clocksource->flags & CLOCK_SOURCE_VERIFY_PERCPU)
+ clocksource_verify_percpu(curr_clocksource);
+
spin_lock_irqsave(&watchdog_lock, flags);
list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list) {
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
--
2.31.1.189.g2e36527f23

2021-04-25 22:49:30

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v9 clocksource 6/6] clocksource: Reduce WATCHDOG_THRESHOLD

Currently, WATCHDOG_THRESHOLD is set to detect a 62.5-millisecond skew in
a 500-millisecond WATCHDOG_INTERVAL. This requires that clocks be skewed
by more than 12.5% in order to be marked unstable. Except that a clock
that is skewed by that much is probably destroying unsuspecting software
right and left. And given that there are now checks for false-positive
skews due to delays between reading the two clocks, and given that current
hardware clocks all increment well in excess of 1MHz, it should be possible
to greatly decrease WATCHDOG_THRESHOLD.

Therefore, decrease WATCHDOG_THRESHOLD from the current 62.5 milliseconds
down to 200 microseconds.

Suggested-by: Thomas Gleixner <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
[ paulmck: Apply Rik van Riel feedback. ]
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/clocksource.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 2981e293a468..d82ad0492777 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -125,14 +125,14 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
* Interval: 0.5sec Threshold: 0.0625s
*/
#define WATCHDOG_INTERVAL (HZ >> 1)
-#define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
+#define WATCHDOG_THRESHOLD (200 * NSEC_PER_USEC)

/*
* Maximum permissible delay between two readouts of the watchdog
* clocksource surrounding a read of the clocksource being validated.
* This delay could be due to SMIs, NMIs, or to VCPU preemptions.
*/
-#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC)
+#define WATCHDOG_MAX_SKEW (50 * NSEC_PER_USEC)

static void clocksource_watchdog_work(struct work_struct *work)
{
--
2.31.1.189.g2e36527f23

2021-04-25 22:50:24

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v10 clocksource 7/7] clocksource: Reduce WATCHDOG_THRESHOLD

Currently, WATCHDOG_THRESHOLD is set to detect a 62.5-millisecond skew in
a 500-millisecond WATCHDOG_INTERVAL. This requires that clocks be skewed
by more than 12.5% in order to be marked unstable. Except that a clock
that is skewed by that much is probably destroying unsuspecting software
right and left. And given that there are now checks for false-positive
skews due to delays between reading the two clocks, and given that current
hardware clocks all increment well in excess of 1MHz, it should be possible
to greatly decrease WATCHDOG_THRESHOLD.

Therefore, decrease WATCHDOG_THRESHOLD from the current 62.5 milliseconds
down to 200 microseconds.

Suggested-by: Thomas Gleixner <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
[ paulmck: Apply Rik van Riel feedback. ]
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/clocksource.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 070b09dab815..a374cf7b6336 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -125,7 +125,7 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
* Interval: 0.5sec Threshold: 0.0625s
*/
#define WATCHDOG_INTERVAL (HZ >> 1)
-#define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
+#define WATCHDOG_THRESHOLD (200 * NSEC_PER_USEC)
#define WATCHDOG_SYNC_FORGIVENESS (HZ * 60UL)

/*
@@ -133,7 +133,7 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
* clocksource surrounding a read of the clocksource being validated.
* This delay could be due to SMIs, NMIs, or to VCPU preemptions.
*/
-#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC)
+#define WATCHDOG_MAX_SKEW (50 * NSEC_PER_USEC)

static void clocksource_watchdog_work(struct work_struct *work)
{
--
2.31.1.189.g2e36527f23

2021-04-25 22:52:15

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v10 clocksource 6/7] clocksource: Forgive tsc_early pre-calibration drift

Because the x86 tsc_early clocksource is given a quick and semi-accurate
calibration (by design!), it might have drift rates well in excess of
the 0.1% limit that is in the process of being adopted.

Therefore, add a max_drift field to the clocksource structure that, when
non-zero, specifies the maximum allowable drift rate in nanoseconds over
a half-second period. The tsc_early clocksource initializes this to five
miliseconds, which corresponds to the 1% drift rate limit suggested by
Xing Zhengjun. This max_drift field is intended only for early boot,
so clocksource_watchdog() splats if it encounters a non-zero value in
this field more than 60 seconds after boot, inspired by a suggestion by
Thomas Gleixner.

This was tested by setting the clocksource_tsc ->max_drift field to 1,
which, as expected, resulted in a clock-skew event.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Xing Zhengjun <[email protected]>
Cc: Feng Tang <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
arch/x86/kernel/tsc.c | 1 +
include/linux/clocksource.h | 5 +++++
kernel/time/clocksource.c | 19 ++++++++++++++++++-
3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 56289170753c..7192b8950322 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1127,6 +1127,7 @@ static int tsc_cs_enable(struct clocksource *cs)
static struct clocksource clocksource_tsc_early = {
.name = "tsc-early",
.rating = 299,
+ .max_drift = 5 * NSEC_PER_MSEC,
.read = read_tsc,
.mask = CLOCKSOURCE_MASK(64),
.flags = CLOCK_SOURCE_IS_CONTINUOUS |
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 83a3ebff7456..07d25a158d7a 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -42,6 +42,10 @@ struct module;
* @shift: Cycle to nanosecond divisor (power of two)
* @max_idle_ns: Maximum idle time permitted by the clocksource (nsecs)
* @maxadj: Maximum adjustment value to mult (~11%)
+ * @max_drift: Maximum drift rate in nanoseconds per half second.
+ * Zero says to use default WATCHDOG_THRESHOLD.
+ * A non-zero value for ->max_drift more than 60 seconds after boot
+ * will result in a splat.
* @archdata: Optional arch-specific data
* @max_cycles: Maximum safe cycle value which won't overflow on
* multiplication
@@ -93,6 +97,7 @@ struct clocksource {
u32 shift;
u64 max_idle_ns;
u32 maxadj;
+ u32 max_drift;
#ifdef CONFIG_ARCH_CLOCKSOURCE_DATA
struct arch_clocksource_data archdata;
#endif
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index f71f375df544..070b09dab815 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -126,6 +126,7 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
*/
#define WATCHDOG_INTERVAL (HZ >> 1)
#define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
+#define WATCHDOG_SYNC_FORGIVENESS (HZ * 60UL)

/*
* Maximum permissible delay between two readouts of the watchdog
@@ -377,6 +378,7 @@ static void clocksource_watchdog(struct timer_list *unused)
int next_cpu, reset_pending;
int64_t wd_nsec, cs_nsec;
struct clocksource *cs;
+ u32 md;

spin_lock(&watchdog_lock);
if (!watchdog_running)
@@ -423,7 +425,22 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;

/* Check the deviation from the watchdog clocksource. */
- if (abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) {
+ if (!cs->max_drift) {
+ md = WATCHDOG_THRESHOLD;
+ } else {
+ static unsigned long first_jiffies;
+ static bool beenhere;
+
+ if (beenhere) {
+ WARN_ON_ONCE(time_after(jiffies,
+ first_jiffies + WATCHDOG_SYNC_FORGIVENESS));
+ } else {
+ beenhere = true;
+ first_jiffies = jiffies;
+ }
+ md = cs->max_drift;
+ }
+ if (abs(cs_nsec - wd_nsec) > md) {
pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
smp_processor_id(), cs->name);
pr_warn(" '%s' wd_now: %llx wd_last: %llx mask: %llx\n",
--
2.31.1.189.g2e36527f23

2021-04-26 04:09:01

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 1/7] clocksource: Provide module parameters to inject delays in watchdog

> occur between the reads of the two clocks. Yes, interrupts are disabled
> across those two reads, but there are no shortage of things that can
> delay interrupts-disabled regions of code ranging from SMI handlers to
> vCPU preemption. It would be good to have some indication as to why

I assume vCPU preemption here refers to preempt RT? I didn't think
a standard kernel could preempt when interrupt are disabled.

>
> + clocksource.inject_delay_period= [KNL]
> + Number of calls to clocksource_watchdog() before
> + delays are injected between reads from the
> + two clocksources. Values of zero disable this
> + delay injection. These delays can cause clocks
> + to be marked unstable, so use of this parameter
> + should therefore be avoided on production systems.
> + Defaults to zero (disabled).
> +
> + clocksource.inject_delay_repeat= [KNL]
> + Number of repeated clocksource_watchdog() delay
> + injections per period. If inject_delay_period
> + is five and inject_delay_repeat is three, there
> + will be five delay-free reads followed by three
> + delayed reads.

I'm not sure command line options are the right way to do this.
How about integrating it with the fault injection framework in debugfs.

This way syzkaller etc. can play with it, which long term would
give much better test coverage.

This wouldn't allow boot time coverage, but presumably that's not
too important here.

-Andi

2021-04-26 04:13:24

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 3/7] clocksource: Check per-CPU clock synchronization when marked unstable

On Sun, Apr 25, 2021 at 03:47:04PM -0700, Paul E. McKenney wrote:
> Some sorts of per-CPU clock sources have a history of going out of
> synchronization with each other. However, this problem has purportedy
> been solved in the past ten years. Except that it is all too possible
> that the problem has instead simply been made less likely, which might
> mean that some of the occasional "Marking clocksource 'tsc' as unstable"
> messages might be due to desynchronization. How would anyone know?
>
> Therefore apply CPU-to-CPU synchronization checking to newly unstable
> clocksource that are marked with the new CLOCK_SOURCE_VERIFY_PERCPU flag.
> Lists of desynchronized CPUs are printed, with the caveat that if it
> is the reporting CPU that is itself desynchronized, it will appear that
> all the other clocks are wrong. Just like in real life.

Well I could see this causing a gigantic flood of messages then.
Assume I have 300 cores, do I get all those messages 300 times repeated
then? If the console is slow this might end up taking a lot
of CPU time.

And in a larger cluster this might not be uncommon.

There must be some way to throttle this.

-Andi

2021-04-26 07:15:20

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 1/7] clocksource: Provide module parameters to inject delays in watchdog

On Sun, Apr 25 2021 at 21:07, Andi Kleen wrote:
>> occur between the reads of the two clocks. Yes, interrupts are disabled
>> across those two reads, but there are no shortage of things that can
>> delay interrupts-disabled regions of code ranging from SMI handlers to
>> vCPU preemption. It would be good to have some indication as to why
>
> I assume vCPU preemption here refers to preempt RT? I didn't think
> a standard kernel could preempt when interrupt are disabled.

Neither does Preempt-RT. It's about vCPU's being preempted by the
hypervisor which can happen anytime independent of the state of the
guest.

Thanks,

tglx

2021-04-26 07:18:36

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 3/7] clocksource: Check per-CPU clock synchronization when marked unstable

On Sun, Apr 25 2021 at 21:12, Andi Kleen wrote:

> On Sun, Apr 25, 2021 at 03:47:04PM -0700, Paul E. McKenney wrote:
>> Some sorts of per-CPU clock sources have a history of going out of
>> synchronization with each other. However, this problem has purportedy
>> been solved in the past ten years. Except that it is all too possible
>> that the problem has instead simply been made less likely, which might
>> mean that some of the occasional "Marking clocksource 'tsc' as unstable"
>> messages might be due to desynchronization. How would anyone know?
>>
>> Therefore apply CPU-to-CPU synchronization checking to newly unstable
>> clocksource that are marked with the new CLOCK_SOURCE_VERIFY_PERCPU flag.
>> Lists of desynchronized CPUs are printed, with the caveat that if it
>> is the reporting CPU that is itself desynchronized, it will appear that
>> all the other clocks are wrong. Just like in real life.
>
> Well I could see this causing a gigantic flood of messages then.
> Assume I have 300 cores, do I get all those messages 300 times repeated
> then? If the console is slow this might end up taking a lot
> of CPU time.

Exactly 4 pr_warn() lines in dmesg which are emitted exactly once when
the TSC deviates from the watchdog where three of them print a cpumask
are really a gigantic flood of messsages.

Thanks,

tglx

2021-04-26 15:03:40

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 6/7] clocksource: Forgive tsc_early pre-calibration drift

Hi Paul,

On Sun, Apr 25, 2021 at 03:47:07PM -0700, Paul E. McKenney wrote:
> Because the x86 tsc_early clocksource is given a quick and semi-accurate
> calibration (by design!), it might have drift rates well in excess of
> the 0.1% limit that is in the process of being adopted.
>
> Therefore, add a max_drift field to the clocksource structure that, when
> non-zero, specifies the maximum allowable drift rate in nanoseconds over
> a half-second period. The tsc_early clocksource initializes this to five
> miliseconds, which corresponds to the 1% drift rate limit suggested by
> Xing Zhengjun. This max_drift field is intended only for early boot,
> so clocksource_watchdog() splats if it encounters a non-zero value in
> this field more than 60 seconds after boot, inspired by a suggestion by
> Thomas Gleixner.
>
> This was tested by setting the clocksource_tsc ->max_drift field to 1,
> which, as expected, resulted in a clock-skew event.

We've run the same last for this v10, and those 'unstable' thing [1] can
not be reproduced!

We've reported one case that tsc can be wrongly judged as 'unstable'
by 'refined-jiffies' watchdog [1], while reducing the threshold could
make it easier to be triggered.

It could be reproduced on the a plaform with a 115200 serial console,
and hpet been disabled (several x86 platforms has this), add
'initcall_debug' cmdline parameter to get more debug message, we can
see:

[ 1.134197] clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc-early' as unstable because the skew is too large:
[ 1.134214] clocksource: 'refined-jiffies' wd_nesc: 500000000 wd_now: ffff8b35 wd_last: ffff8b03 mask: ffffffff
[ 1.134217] clocksource: 'tsc-early' cs_nsec: 507537855 cs_now: 4e63c9d09 cs_last: 4bebd81f5 mask: ffffffffffffffff
[ 1.134220] clocksource: No current clocksource.
[ 1.134222] tsc: Marking TSC unstable due to clocksource watchdog

[1]. https://lore.kernel.org/lkml/20210420064934.GE31773@xsang-OptiPlex-9020/
[2]. https://lore.kernel.org/lkml/[email protected]/

Thanks,
Feng

> Cc: John Stultz <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Stephen Boyd <[email protected]>
> Cc: Jonathan Corbet <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: Xing Zhengjun <[email protected]>
> Cc: Feng Tang <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> arch/x86/kernel/tsc.c | 1 +
> include/linux/clocksource.h | 5 +++++
> kernel/time/clocksource.c | 19 ++++++++++++++++++-
> 3 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 56289170753c..7192b8950322 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -1127,6 +1127,7 @@ static int tsc_cs_enable(struct clocksource *cs)
> static struct clocksource clocksource_tsc_early = {
> .name = "tsc-early",
> .rating = 299,
> + .max_drift = 5 * NSEC_PER_MSEC,
> .read = read_tsc,
> .mask = CLOCKSOURCE_MASK(64),
> .flags = CLOCK_SOURCE_IS_CONTINUOUS |
> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> index 83a3ebff7456..07d25a158d7a 100644
> --- a/include/linux/clocksource.h
> +++ b/include/linux/clocksource.h
> @@ -42,6 +42,10 @@ struct module;
> * @shift: Cycle to nanosecond divisor (power of two)
> * @max_idle_ns: Maximum idle time permitted by the clocksource (nsecs)
> * @maxadj: Maximum adjustment value to mult (~11%)
> + * @max_drift: Maximum drift rate in nanoseconds per half second.
> + * Zero says to use default WATCHDOG_THRESHOLD.
> + * A non-zero value for ->max_drift more than 60 seconds after boot
> + * will result in a splat.
> * @archdata: Optional arch-specific data
> * @max_cycles: Maximum safe cycle value which won't overflow on
> * multiplication
> @@ -93,6 +97,7 @@ struct clocksource {
> u32 shift;
> u64 max_idle_ns;
> u32 maxadj;
> + u32 max_drift;
> #ifdef CONFIG_ARCH_CLOCKSOURCE_DATA
> struct arch_clocksource_data archdata;
> #endif
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index f71f375df544..070b09dab815 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -126,6 +126,7 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
> */
> #define WATCHDOG_INTERVAL (HZ >> 1)
> #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
> +#define WATCHDOG_SYNC_FORGIVENESS (HZ * 60UL)
>
> /*
> * Maximum permissible delay between two readouts of the watchdog
> @@ -377,6 +378,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> int next_cpu, reset_pending;
> int64_t wd_nsec, cs_nsec;
> struct clocksource *cs;
> + u32 md;
>
> spin_lock(&watchdog_lock);
> if (!watchdog_running)
> @@ -423,7 +425,22 @@ static void clocksource_watchdog(struct timer_list *unused)
> continue;
>
> /* Check the deviation from the watchdog clocksource. */
> - if (abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) {
> + if (!cs->max_drift) {
> + md = WATCHDOG_THRESHOLD;
> + } else {
> + static unsigned long first_jiffies;
> + static bool beenhere;
> +
> + if (beenhere) {
> + WARN_ON_ONCE(time_after(jiffies,
> + first_jiffies + WATCHDOG_SYNC_FORGIVENESS));
> + } else {
> + beenhere = true;
> + first_jiffies = jiffies;
> + }
> + md = cs->max_drift;
> + }
> + if (abs(cs_nsec - wd_nsec) > md) {
> pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
> smp_processor_id(), cs->name);
> pr_warn(" '%s' wd_now: %llx wd_last: %llx mask: %llx\n",
> --
> 2.31.1.189.g2e36527f23

2021-04-26 15:26:27

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 6/7] clocksource: Forgive tsc_early pre-calibration drift

On Mon, Apr 26, 2021 at 11:01:27PM +0800, Feng Tang wrote:
> Hi Paul,
>
> On Sun, Apr 25, 2021 at 03:47:07PM -0700, Paul E. McKenney wrote:
> > Because the x86 tsc_early clocksource is given a quick and semi-accurate
> > calibration (by design!), it might have drift rates well in excess of
> > the 0.1% limit that is in the process of being adopted.
> >
> > Therefore, add a max_drift field to the clocksource structure that, when
> > non-zero, specifies the maximum allowable drift rate in nanoseconds over
> > a half-second period. The tsc_early clocksource initializes this to five
> > miliseconds, which corresponds to the 1% drift rate limit suggested by
> > Xing Zhengjun. This max_drift field is intended only for early boot,
> > so clocksource_watchdog() splats if it encounters a non-zero value in
> > this field more than 60 seconds after boot, inspired by a suggestion by
> > Thomas Gleixner.
> >
> > This was tested by setting the clocksource_tsc ->max_drift field to 1,
> > which, as expected, resulted in a clock-skew event.
>
> We've run the same last for this v10, and those 'unstable' thing [1] can
> not be reproduced!

Good to hear! ;-)

> We've reported one case that tsc can be wrongly judged as 'unstable'
> by 'refined-jiffies' watchdog [1], while reducing the threshold could
> make it easier to be triggered.
>
> It could be reproduced on the a plaform with a 115200 serial console,
> and hpet been disabled (several x86 platforms has this), add
> 'initcall_debug' cmdline parameter to get more debug message, we can
> see:
>
> [ 1.134197] clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc-early' as unstable because the skew is too large:
> [ 1.134214] clocksource: 'refined-jiffies' wd_nesc: 500000000 wd_now: ffff8b35 wd_last: ffff8b03 mask: ffffffff
> [ 1.134217] clocksource: 'tsc-early' cs_nsec: 507537855 cs_now: 4e63c9d09 cs_last: 4bebd81f5 mask: ffffffffffffffff
> [ 1.134220] clocksource: No current clocksource.
> [ 1.134222] tsc: Marking TSC unstable due to clocksource watchdog

Just to make sure I understand: "could be reproduced" as in this is the
result from v9, and v10 avoids this, correct?

Thanx, Paul

> [1]. https://lore.kernel.org/lkml/20210420064934.GE31773@xsang-OptiPlex-9020/
> [2]. https://lore.kernel.org/lkml/[email protected]/
>
> Thanks,
> Feng
>
> > Cc: John Stultz <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Stephen Boyd <[email protected]>
> > Cc: Jonathan Corbet <[email protected]>
> > Cc: Mark Rutland <[email protected]>
> > Cc: Marc Zyngier <[email protected]>
> > Cc: Andi Kleen <[email protected]>
> > Cc: Xing Zhengjun <[email protected]>
> > Cc: Feng Tang <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > arch/x86/kernel/tsc.c | 1 +
> > include/linux/clocksource.h | 5 +++++
> > kernel/time/clocksource.c | 19 ++++++++++++++++++-
> > 3 files changed, 24 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index 56289170753c..7192b8950322 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -1127,6 +1127,7 @@ static int tsc_cs_enable(struct clocksource *cs)
> > static struct clocksource clocksource_tsc_early = {
> > .name = "tsc-early",
> > .rating = 299,
> > + .max_drift = 5 * NSEC_PER_MSEC,
> > .read = read_tsc,
> > .mask = CLOCKSOURCE_MASK(64),
> > .flags = CLOCK_SOURCE_IS_CONTINUOUS |
> > diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> > index 83a3ebff7456..07d25a158d7a 100644
> > --- a/include/linux/clocksource.h
> > +++ b/include/linux/clocksource.h
> > @@ -42,6 +42,10 @@ struct module;
> > * @shift: Cycle to nanosecond divisor (power of two)
> > * @max_idle_ns: Maximum idle time permitted by the clocksource (nsecs)
> > * @maxadj: Maximum adjustment value to mult (~11%)
> > + * @max_drift: Maximum drift rate in nanoseconds per half second.
> > + * Zero says to use default WATCHDOG_THRESHOLD.
> > + * A non-zero value for ->max_drift more than 60 seconds after boot
> > + * will result in a splat.
> > * @archdata: Optional arch-specific data
> > * @max_cycles: Maximum safe cycle value which won't overflow on
> > * multiplication
> > @@ -93,6 +97,7 @@ struct clocksource {
> > u32 shift;
> > u64 max_idle_ns;
> > u32 maxadj;
> > + u32 max_drift;
> > #ifdef CONFIG_ARCH_CLOCKSOURCE_DATA
> > struct arch_clocksource_data archdata;
> > #endif
> > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > index f71f375df544..070b09dab815 100644
> > --- a/kernel/time/clocksource.c
> > +++ b/kernel/time/clocksource.c
> > @@ -126,6 +126,7 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
> > */
> > #define WATCHDOG_INTERVAL (HZ >> 1)
> > #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
> > +#define WATCHDOG_SYNC_FORGIVENESS (HZ * 60UL)
> >
> > /*
> > * Maximum permissible delay between two readouts of the watchdog
> > @@ -377,6 +378,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > int next_cpu, reset_pending;
> > int64_t wd_nsec, cs_nsec;
> > struct clocksource *cs;
> > + u32 md;
> >
> > spin_lock(&watchdog_lock);
> > if (!watchdog_running)
> > @@ -423,7 +425,22 @@ static void clocksource_watchdog(struct timer_list *unused)
> > continue;
> >
> > /* Check the deviation from the watchdog clocksource. */
> > - if (abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) {
> > + if (!cs->max_drift) {
> > + md = WATCHDOG_THRESHOLD;
> > + } else {
> > + static unsigned long first_jiffies;
> > + static bool beenhere;
> > +
> > + if (beenhere) {
> > + WARN_ON_ONCE(time_after(jiffies,
> > + first_jiffies + WATCHDOG_SYNC_FORGIVENESS));
> > + } else {
> > + beenhere = true;
> > + first_jiffies = jiffies;
> > + }
> > + md = cs->max_drift;
> > + }
> > + if (abs(cs_nsec - wd_nsec) > md) {
> > pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
> > smp_processor_id(), cs->name);
> > pr_warn(" '%s' wd_now: %llx wd_last: %llx mask: %llx\n",
> > --
> > 2.31.1.189.g2e36527f23

2021-04-26 15:29:38

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 6/7] clocksource: Forgive tsc_early pre-calibration drift

On Mon, Apr 26 2021 at 23:01, Feng Tang wrote:
> On Sun, Apr 25, 2021 at 03:47:07PM -0700, Paul E. McKenney wrote:
> We've reported one case that tsc can be wrongly judged as 'unstable'
> by 'refined-jiffies' watchdog [1], while reducing the threshold could
> make it easier to be triggered.
>
> It could be reproduced on the a plaform with a 115200 serial console,
> and hpet been disabled (several x86 platforms has this), add
> 'initcall_debug' cmdline parameter to get more debug message, we can
> see:
>
> [ 1.134197] clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc-early' as unstable because the skew is too large:
> [ 1.134214] clocksource: 'refined-jiffies' wd_nesc: 500000000 wd_now: ffff8b35 wd_last: ffff8b03 mask: ffffffff
> [ 1.134217] clocksource: 'tsc-early' cs_nsec: 507537855 cs_now: 4e63c9d09 cs_last: 4bebd81f5 mask: ffffffffffffffff
> [ 1.134220] clocksource: No current clocksource.
> [ 1.134222] tsc: Marking TSC unstable due to clocksource watchdog

Yes, refined jiffies is the worst case.

Though the worst of everything is that any of this exists at all, but
you knew that already :)

Thanks,

tglx

2021-04-26 15:29:56

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 1/7] clocksource: Provide module parameters to inject delays in watchdog

On Sun, Apr 25, 2021 at 09:07:36PM -0700, Andi Kleen wrote:

[ . . . ]

> > + clocksource.inject_delay_period= [KNL]
> > + Number of calls to clocksource_watchdog() before
> > + delays are injected between reads from the
> > + two clocksources. Values of zero disable this
> > + delay injection. These delays can cause clocks
> > + to be marked unstable, so use of this parameter
> > + should therefore be avoided on production systems.
> > + Defaults to zero (disabled).
> > +
> > + clocksource.inject_delay_repeat= [KNL]
> > + Number of repeated clocksource_watchdog() delay
> > + injections per period. If inject_delay_period
> > + is five and inject_delay_repeat is three, there
> > + will be five delay-free reads followed by three
> > + delayed reads.
>
> I'm not sure command line options are the right way to do this.
> How about integrating it with the fault injection framework in debugfs.
>
> This way syzkaller etc. can play with it, which long term would
> give much better test coverage.
>
> This wouldn't allow boot time coverage, but presumably that's not
> too important here.

Boot-time coverage is important, as we saw in kbuild test robot testing
of v9 of this patchset, which triggered clocksource_tsc_early, but not
clocksource_tsc. Note that v10 avoids this triggering.

Thanx, Paul

2021-04-26 15:37:41

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 6/7] clocksource: Forgive tsc_early pre-calibration drift

On Mon, Apr 26, 2021 at 08:25:29AM -0700, Paul E. McKenney wrote:
> On Mon, Apr 26, 2021 at 11:01:27PM +0800, Feng Tang wrote:
> > Hi Paul,
> >
> > On Sun, Apr 25, 2021 at 03:47:07PM -0700, Paul E. McKenney wrote:
> > > Because the x86 tsc_early clocksource is given a quick and semi-accurate
> > > calibration (by design!), it might have drift rates well in excess of
> > > the 0.1% limit that is in the process of being adopted.
> > >
> > > Therefore, add a max_drift field to the clocksource structure that, when
> > > non-zero, specifies the maximum allowable drift rate in nanoseconds over
> > > a half-second period. The tsc_early clocksource initializes this to five
> > > miliseconds, which corresponds to the 1% drift rate limit suggested by
> > > Xing Zhengjun. This max_drift field is intended only for early boot,
> > > so clocksource_watchdog() splats if it encounters a non-zero value in
> > > this field more than 60 seconds after boot, inspired by a suggestion by
> > > Thomas Gleixner.
> > >
> > > This was tested by setting the clocksource_tsc ->max_drift field to 1,
> > > which, as expected, resulted in a clock-skew event.
> >
> > We've run the same last for this v10, and those 'unstable' thing [1] can
> > not be reproduced!
>
> Good to hear! ;-)
>
> > We've reported one case that tsc can be wrongly judged as 'unstable'
> > by 'refined-jiffies' watchdog [1], while reducing the threshold could
> > make it easier to be triggered.
> >
> > It could be reproduced on the a plaform with a 115200 serial console,
> > and hpet been disabled (several x86 platforms has this), add
> > 'initcall_debug' cmdline parameter to get more debug message, we can
> > see:
> >
> > [ 1.134197] clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc-early' as unstable because the skew is too large:
> > [ 1.134214] clocksource: 'refined-jiffies' wd_nesc: 500000000 wd_now: ffff8b35 wd_last: ffff8b03 mask: ffffffff
> > [ 1.134217] clocksource: 'tsc-early' cs_nsec: 507537855 cs_now: 4e63c9d09 cs_last: 4bebd81f5 mask: ffffffffffffffff
> > [ 1.134220] clocksource: No current clocksource.
> > [ 1.134222] tsc: Marking TSC unstable due to clocksource watchdog
>
> Just to make sure I understand: "could be reproduced" as in this is the
> result from v9, and v10 avoids this, correct?

Sorry I didn't make it clear. This is a rarely happened case, and can
be reproduced with upstream kerenl, which has 62.5 ms threshold. 6/7 &
7/7 patch of reducing the threshold can make it easier to be triggered.

Thanks,
Feng


> Thanx, Paul
>

2021-04-26 16:01:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 1/7] clocksource: Provide module parameters to inject delays in watchdog

> Boot-time coverage is important, as we saw in kbuild test robot testing
> of v9 of this patchset, which triggered clocksource_tsc_early, but not
> clocksource_tsc. Note that v10 avoids this triggering.

Not sure how that could happen, I didn't think the kbuild bot knows
how to set these options.

Overall I think you should follow the existing frame works instead
of inventing your own so that things can be later tested in an
uniform manner.

Perhaps the fault injection framework needs boot options, but
they should be generic.

-Andi

2021-04-26 16:15:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 1/7] clocksource: Provide module parameters to inject delays in watchdog

On Mon, Apr 26, 2021 at 09:00:08AM -0700, Andi Kleen wrote:
> > Boot-time coverage is important, as we saw in kbuild test robot testing
> > of v9 of this patchset, which triggered clocksource_tsc_early, but not
> > clocksource_tsc. Note that v10 avoids this triggering.
>
> Not sure how that could happen, I didn't think the kbuild bot knows
> how to set these options.

It didn't need to. My initial choice of maximum skew in the v9 patch
series exceeded what clocksource_tsc_early could remain within, given its
quick calibration. In v10 of this patch series, clocksource_tsc_early
is given special dispensation.

> Overall I think you should follow the existing frame works instead
> of inventing your own so that things can be later tested in an
> uniform manner.
>
> Perhaps the fault injection framework needs boot options, but
> they should be generic.

From Documentation/fault-injection/fault-injection.rst:

------------------------------------------------------------------------

- module parameters

If the scope of the fault injection capability is limited to a
single kernel module, it is better to provide module parameters to
configure the fault attributes.

------------------------------------------------------------------------

And in this case, the fault injection capability is in fact limited to
kernel/clocksource.c.

Thanx, Paul

2021-04-26 18:00:34

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 1/7] clocksource: Provide module parameters to inject delays in watchdog

> ------------------------------------------------------------------------
>
> - module parameters
>
> If the scope of the fault injection capability is limited to a
> single kernel module, it is better to provide module parameters to
> configure the fault attributes.
>
> ------------------------------------------------------------------------
>
> And in this case, the fault injection capability is in fact limited to
> kernel/clocksource.c.


I disagree with this recommendation because it prevents fuzzer coverage.

Much better to have an uniform interface that can be automatically
explored.

-Andi

2021-04-26 18:27:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 1/7] clocksource: Provide module parameters to inject delays in watchdog

On Mon, Apr 26, 2021 at 10:56:27AM -0700, Andi Kleen wrote:
> > ------------------------------------------------------------------------
> >
> > - module parameters
> >
> > If the scope of the fault injection capability is limited to a
> > single kernel module, it is better to provide module parameters to
> > configure the fault attributes.
> >
> > ------------------------------------------------------------------------
> >
> > And in this case, the fault injection capability is in fact limited to
> > kernel/clocksource.c.
>
>
> I disagree with this recommendation because it prevents fuzzer coverage.
>
> Much better to have an uniform interface that can be automatically
> explored.

The permissions for these module parameters is 0644, so there is no
reason why the fuzzers cannot use them via sysfs.

Thanx, Paul

2021-04-26 18:28:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 6/7] clocksource: Forgive tsc_early pre-calibration drift

On Mon, Apr 26, 2021 at 11:36:05PM +0800, Feng Tang wrote:
> On Mon, Apr 26, 2021 at 08:25:29AM -0700, Paul E. McKenney wrote:
> > On Mon, Apr 26, 2021 at 11:01:27PM +0800, Feng Tang wrote:
> > > Hi Paul,
> > >
> > > On Sun, Apr 25, 2021 at 03:47:07PM -0700, Paul E. McKenney wrote:
> > > > Because the x86 tsc_early clocksource is given a quick and semi-accurate
> > > > calibration (by design!), it might have drift rates well in excess of
> > > > the 0.1% limit that is in the process of being adopted.
> > > >
> > > > Therefore, add a max_drift field to the clocksource structure that, when
> > > > non-zero, specifies the maximum allowable drift rate in nanoseconds over
> > > > a half-second period. The tsc_early clocksource initializes this to five
> > > > miliseconds, which corresponds to the 1% drift rate limit suggested by
> > > > Xing Zhengjun. This max_drift field is intended only for early boot,
> > > > so clocksource_watchdog() splats if it encounters a non-zero value in
> > > > this field more than 60 seconds after boot, inspired by a suggestion by
> > > > Thomas Gleixner.
> > > >
> > > > This was tested by setting the clocksource_tsc ->max_drift field to 1,
> > > > which, as expected, resulted in a clock-skew event.
> > >
> > > We've run the same last for this v10, and those 'unstable' thing [1] can
> > > not be reproduced!
> >
> > Good to hear! ;-)
> >
> > > We've reported one case that tsc can be wrongly judged as 'unstable'
> > > by 'refined-jiffies' watchdog [1], while reducing the threshold could
> > > make it easier to be triggered.
> > >
> > > It could be reproduced on the a plaform with a 115200 serial console,
> > > and hpet been disabled (several x86 platforms has this), add
> > > 'initcall_debug' cmdline parameter to get more debug message, we can
> > > see:
> > >
> > > [ 1.134197] clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc-early' as unstable because the skew is too large:
> > > [ 1.134214] clocksource: 'refined-jiffies' wd_nesc: 500000000 wd_now: ffff8b35 wd_last: ffff8b03 mask: ffffffff
> > > [ 1.134217] clocksource: 'tsc-early' cs_nsec: 507537855 cs_now: 4e63c9d09 cs_last: 4bebd81f5 mask: ffffffffffffffff
> > > [ 1.134220] clocksource: No current clocksource.
> > > [ 1.134222] tsc: Marking TSC unstable due to clocksource watchdog
> >
> > Just to make sure I understand: "could be reproduced" as in this is the
> > result from v9, and v10 avoids this, correct?
>
> Sorry I didn't make it clear. This is a rarely happened case, and can
> be reproduced with upstream kerenl, which has 62.5 ms threshold. 6/7 &
> 7/7 patch of reducing the threshold can make it easier to be triggered.

Ah, OK, so this could be considered to be a benefit of this series, then.

Does this happen only for tsc-early, or for tsc as well?

Has it already been triggered on v10 of this series? (I understand that
it certainly should be easier to trigger, just curious whether this has
already happened.)

Thanx, Paul

2021-04-27 01:18:56

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 6/7] clocksource: Forgive tsc_early pre-calibration drift

On Mon, Apr 26, 2021 at 11:26:52AM -0700, Paul E. McKenney wrote:
> On Mon, Apr 26, 2021 at 11:36:05PM +0800, Feng Tang wrote:
> > On Mon, Apr 26, 2021 at 08:25:29AM -0700, Paul E. McKenney wrote:
> > > On Mon, Apr 26, 2021 at 11:01:27PM +0800, Feng Tang wrote:
> > > > Hi Paul,
> > > >
> > > > On Sun, Apr 25, 2021 at 03:47:07PM -0700, Paul E. McKenney wrote:
> > > > > Because the x86 tsc_early clocksource is given a quick and semi-accurate
> > > > > calibration (by design!), it might have drift rates well in excess of
> > > > > the 0.1% limit that is in the process of being adopted.
> > > > >
> > > > > Therefore, add a max_drift field to the clocksource structure that, when
> > > > > non-zero, specifies the maximum allowable drift rate in nanoseconds over
> > > > > a half-second period. The tsc_early clocksource initializes this to five
> > > > > miliseconds, which corresponds to the 1% drift rate limit suggested by
> > > > > Xing Zhengjun. This max_drift field is intended only for early boot,
> > > > > so clocksource_watchdog() splats if it encounters a non-zero value in
> > > > > this field more than 60 seconds after boot, inspired by a suggestion by
> > > > > Thomas Gleixner.
> > > > >
> > > > > This was tested by setting the clocksource_tsc ->max_drift field to 1,
> > > > > which, as expected, resulted in a clock-skew event.
> > > >
> > > > We've run the same last for this v10, and those 'unstable' thing [1] can
> > > > not be reproduced!
> > >
> > > Good to hear! ;-)
> > >
> > > > We've reported one case that tsc can be wrongly judged as 'unstable'
> > > > by 'refined-jiffies' watchdog [1], while reducing the threshold could
> > > > make it easier to be triggered.
> > > >
> > > > It could be reproduced on the a plaform with a 115200 serial console,
> > > > and hpet been disabled (several x86 platforms has this), add
> > > > 'initcall_debug' cmdline parameter to get more debug message, we can
> > > > see:
> > > >
> > > > [ 1.134197] clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc-early' as unstable because the skew is too large:
> > > > [ 1.134214] clocksource: 'refined-jiffies' wd_nesc: 500000000 wd_now: ffff8b35 wd_last: ffff8b03 mask: ffffffff
> > > > [ 1.134217] clocksource: 'tsc-early' cs_nsec: 507537855 cs_now: 4e63c9d09 cs_last: 4bebd81f5 mask: ffffffffffffffff
> > > > [ 1.134220] clocksource: No current clocksource.
> > > > [ 1.134222] tsc: Marking TSC unstable due to clocksource watchdog
> > >
> > > Just to make sure I understand: "could be reproduced" as in this is the
> > > result from v9, and v10 avoids this, correct?
> >
> > Sorry I didn't make it clear. This is a rarely happened case, and can
> > be reproduced with upstream kerenl, which has 62.5 ms threshold. 6/7 &
> > 7/7 patch of reducing the threshold can make it easier to be triggered.
>
> Ah, OK, so this could be considered to be a benefit of this series, then.
>
> Does this happen only for tsc-early, or for tsc as well?
>
> Has it already been triggered on v10 of this series? (I understand that
> it certainly should be easier to trigger, just curious whether this has
> already happened.)

Yes, it has. The upper log is from v10 (actually it's the 'dev' branch
of your linux-rcu git, which I didn't find obvious difference) on a
client platform

[ 1.134214] clocksource: 'refined-jiffies' wd_nesc: 500000000 wd_now: ffff8b35 wd_last: ffff8b03 mask: ffffffff
[ 1.134217] clocksource: 'tsc-early' cs_nsec: 507537855 cs_now: 4e63c9d09 cs_last: 4bebd81f5 mask: ffffffffffffffff

The deviation is 7537855 ns (7.5 ms). And as said before, it needs many
pre-conditions to be triggered.

Also I found the debug patch is useful, which prints out the direct
nanoseconds info when 'unstable' is detected.

kernel/time/clocksource.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index a374cf7b6336..5370f0c84981 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -443,10 +443,10 @@ static void clocksource_watchdog(struct timer_list *unused)
if (abs(cs_nsec - wd_nsec) > md) {
pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
smp_processor_id(), cs->name);
- pr_warn(" '%s' wd_now: %llx wd_last: %llx mask: %llx\n",
- watchdog->name, wdnow, wdlast, watchdog->mask);
- pr_warn(" '%s' cs_now: %llx cs_last: %llx mask: %llx\n",
- cs->name, csnow, cslast, cs->mask);
+ pr_warn(" '%s' wd_nesc: %lld wd_now: %llx wd_last: %llx mask: %llx\n",
+ watchdog->name, wd_nsec, wdnow, wdlast, watchdog->mask);
+ pr_warn(" '%s' cs_nsec: %lld cs_now: %llx cs_last: %llx mask: %llx\n",
+ cs->name, cs_nsec, csnow, cslast, cs->mask);
if (curr_clocksource == cs)
pr_warn(" '%s' is current clocksource.\n", cs->name);
else if (curr_clocksource)
--
2.27.0

Thanks,
Feng



>
> Thanx, Paul

2021-04-27 01:46:28

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 2/7] clocksource: Retry clock read if long delays detected

On Sun, Apr 25, 2021 at 03:47:03PM -0700, Paul E. McKenney wrote:
> When the clocksource watchdog marks a clock as unstable, this might
> be due to that clock being unstable or it might be due to delays that
> happen to occur between the reads of the two clocks. Yes, interrupts are
> disabled across those two reads, but there are no shortage of things that
> can delay interrupts-disabled regions of code ranging from SMI handlers
> to vCPU preemption. It would be good to have some indication as to why
> the clock was marked unstable.
>
> Therefore, re-read the watchdog clock on either side of the read
> from the clock under test. If the watchdog clock shows an excessive
> time delta between its pair of reads, the reads are retried. The
> maximum number of retries is specified by a new kernel boot parameter
> clocksource.max_read_retries, which defaults to three, that is, up to four
> reads, one initial and up to three retries. If more than one retry was
> required, a message is printed on the console (the occasional single retry
> is expected behavior, especially in guest OSes). If the maximum number
> of retries is exceeded, the clock under test will be marked unstable.
> However, the probability of this happening due to various sorts of
> delays is quite small. In addition, the reason (clock-read delays)
> for the unstable marking will be apparent.

I think this will help to filter some false alarms of unstable
cases, thanks!

Acked-by: Feng Tang <[email protected]>

> Cc: John Stultz <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Stephen Boyd <[email protected]>
> Cc: Jonathan Corbet <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Reported-by: Chris Mason <[email protected]>
> [ paulmck: Per-clocksource retries per Neeraj Upadhyay feedback. ]
> [ paulmck: Don't reset injectfail per Neeraj Upadhyay feedback. ]
> [ paulmck: Apply Thomas Gleixner feedback. ]
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> .../admin-guide/kernel-parameters.txt | 6 +++
> kernel/time/clocksource.c | 53 ++++++++++++++++---
> 2 files changed, 52 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 4a372037b49f..7fff95bd5504 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -599,6 +599,12 @@
> will be five delay-free reads followed by three
> delayed reads.
>
> + clocksource.max_read_retries= [KNL]
> + Number of clocksource_watchdog() retries due to
> + external delays before the clock will be marked
> + unstable. Defaults to three retries, that is,
> + four attempts to read the clock under test.
> +
> clearcpuid=BITNUM[,BITNUM...] [X86]
> Disable CPUID feature X for the kernel. See
> arch/x86/include/asm/cpufeatures.h for the valid bit
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index f1e1e6e4b387..94bfdb53f2f4 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -125,6 +125,13 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
> #define WATCHDOG_INTERVAL (HZ >> 1)
> #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
>
> +/*
> + * Maximum permissible delay between two readouts of the watchdog
> + * clocksource surrounding a read of the clocksource being validated.
> + * This delay could be due to SMIs, NMIs, or to VCPU preemptions.
> + */
> +#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC)
> +
> static void clocksource_watchdog_work(struct work_struct *work)
> {
> /*
> @@ -189,6 +196,8 @@ static ulong inject_delay_period;
> module_param(inject_delay_period, ulong, 0644);
> static ulong inject_delay_repeat = 1;
> module_param(inject_delay_repeat, ulong, 0644);
> +static ulong max_read_retries = 3;
> +module_param(max_read_retries, ulong, 0644);
>
> static void clocksource_watchdog_inject_delay(void)
> {
> @@ -206,12 +215,42 @@ static void clocksource_watchdog_inject_delay(void)
> invocations++;
> }
>
> +static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> +{
> + unsigned int nretries;
> + u64 wd_end, wd_delta;
> + int64_t wd_delay;
> +
> + for (nretries = 0; nretries <= max_read_retries; nretries++) {
> + local_irq_disable();
> + *wdnow = watchdog->read(watchdog);
> + clocksource_watchdog_inject_delay();
> + *csnow = cs->read(cs);
> + wd_end = watchdog->read(watchdog);
> + local_irq_enable();
> +
> + wd_delta = clocksource_delta(wd_end, *wdnow, watchdog->mask);
> + wd_delay = clocksource_cyc2ns(wd_delta, watchdog->mult, watchdog->shift);
> + if (wd_delay <= WATCHDOG_MAX_SKEW) {
> + if (nretries > 1 || nretries >= max_read_retries) {
> + pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
> + smp_processor_id(), watchdog->name, nretries);
> + }
> + return true;
> + }
> + }
> +
> + pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
> + smp_processor_id(), watchdog->name, wd_delay, nretries);
> + return false;
> +}
> +
> static void clocksource_watchdog(struct timer_list *unused)
> {
> - struct clocksource *cs;
> u64 csnow, wdnow, cslast, wdlast, delta;
> - int64_t wd_nsec, cs_nsec;
> int next_cpu, reset_pending;
> + int64_t wd_nsec, cs_nsec;
> + struct clocksource *cs;
>
> spin_lock(&watchdog_lock);
> if (!watchdog_running)
> @@ -228,11 +267,11 @@ static void clocksource_watchdog(struct timer_list *unused)
> continue;
> }
>
> - local_irq_disable();
> - csnow = cs->read(cs);
> - clocksource_watchdog_inject_delay();
> - wdnow = watchdog->read(watchdog);
> - local_irq_enable();
> + if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
> + /* Clock readout unreliable, so give it up. */
> + __clocksource_unstable(cs);
> + continue;
> + }
>
> /* Clocksource initialized ? */
> if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
> --
> 2.31.1.189.g2e36527f23

2021-04-27 03:47:26

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 6/7] clocksource: Forgive tsc_early pre-calibration drift

On Tue, Apr 27, 2021 at 09:13:55AM +0800, Feng Tang wrote:
> On Mon, Apr 26, 2021 at 11:26:52AM -0700, Paul E. McKenney wrote:
> > On Mon, Apr 26, 2021 at 11:36:05PM +0800, Feng Tang wrote:
> > > On Mon, Apr 26, 2021 at 08:25:29AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Apr 26, 2021 at 11:01:27PM +0800, Feng Tang wrote:
> > > > > Hi Paul,
> > > > >
> > > > > On Sun, Apr 25, 2021 at 03:47:07PM -0700, Paul E. McKenney wrote:
> > > > > > Because the x86 tsc_early clocksource is given a quick and semi-accurate
> > > > > > calibration (by design!), it might have drift rates well in excess of
> > > > > > the 0.1% limit that is in the process of being adopted.
> > > > > >
> > > > > > Therefore, add a max_drift field to the clocksource structure that, when
> > > > > > non-zero, specifies the maximum allowable drift rate in nanoseconds over
> > > > > > a half-second period. The tsc_early clocksource initializes this to five
> > > > > > miliseconds, which corresponds to the 1% drift rate limit suggested by
> > > > > > Xing Zhengjun. This max_drift field is intended only for early boot,
> > > > > > so clocksource_watchdog() splats if it encounters a non-zero value in
> > > > > > this field more than 60 seconds after boot, inspired by a suggestion by
> > > > > > Thomas Gleixner.
> > > > > >
> > > > > > This was tested by setting the clocksource_tsc ->max_drift field to 1,
> > > > > > which, as expected, resulted in a clock-skew event.
> > > > >
> > > > > We've run the same last for this v10, and those 'unstable' thing [1] can
> > > > > not be reproduced!
> > > >
> > > > Good to hear! ;-)
> > > >
> > > > > We've reported one case that tsc can be wrongly judged as 'unstable'
> > > > > by 'refined-jiffies' watchdog [1], while reducing the threshold could
> > > > > make it easier to be triggered.
> > > > >
> > > > > It could be reproduced on the a plaform with a 115200 serial console,
> > > > > and hpet been disabled (several x86 platforms has this), add
> > > > > 'initcall_debug' cmdline parameter to get more debug message, we can
> > > > > see:
> > > > >
> > > > > [ 1.134197] clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc-early' as unstable because the skew is too large:
> > > > > [ 1.134214] clocksource: 'refined-jiffies' wd_nesc: 500000000 wd_now: ffff8b35 wd_last: ffff8b03 mask: ffffffff
> > > > > [ 1.134217] clocksource: 'tsc-early' cs_nsec: 507537855 cs_now: 4e63c9d09 cs_last: 4bebd81f5 mask: ffffffffffffffff
> > > > > [ 1.134220] clocksource: No current clocksource.
> > > > > [ 1.134222] tsc: Marking TSC unstable due to clocksource watchdog
> > > >
> > > > Just to make sure I understand: "could be reproduced" as in this is the
> > > > result from v9, and v10 avoids this, correct?
> > >
> > > Sorry I didn't make it clear. This is a rarely happened case, and can
> > > be reproduced with upstream kerenl, which has 62.5 ms threshold. 6/7 &
> > > 7/7 patch of reducing the threshold can make it easier to be triggered.
> >
> > Ah, OK, so this could be considered to be a benefit of this series, then.
> >
> > Does this happen only for tsc-early, or for tsc as well?
> >
> > Has it already been triggered on v10 of this series? (I understand that
> > it certainly should be easier to trigger, just curious whether this has
> > already happened.)
>
> Yes, it has. The upper log is from v10 (actually it's the 'dev' branch
> of your linux-rcu git, which I didn't find obvious difference) on a
> client platform
>
> [ 1.134214] clocksource: 'refined-jiffies' wd_nesc: 500000000 wd_now: ffff8b35 wd_last: ffff8b03 mask: ffffffff
> [ 1.134217] clocksource: 'tsc-early' cs_nsec: 507537855 cs_now: 4e63c9d09 cs_last: 4bebd81f5 mask: ffffffffffffffff
>
> The deviation is 7537855 ns (7.5 ms). And as said before, it needs many
> pre-conditions to be triggered.
>
> Also I found the debug patch is useful, which prints out the direct
> nanoseconds info when 'unstable' is detected.

Looks good to me!

If you give me a Signed-off-by, I would be happy to queue it.

Thanx, Paul

> kernel/time/clocksource.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index a374cf7b6336..5370f0c84981 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -443,10 +443,10 @@ static void clocksource_watchdog(struct timer_list *unused)
> if (abs(cs_nsec - wd_nsec) > md) {
> pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
> smp_processor_id(), cs->name);
> - pr_warn(" '%s' wd_now: %llx wd_last: %llx mask: %llx\n",
> - watchdog->name, wdnow, wdlast, watchdog->mask);
> - pr_warn(" '%s' cs_now: %llx cs_last: %llx mask: %llx\n",
> - cs->name, csnow, cslast, cs->mask);
> + pr_warn(" '%s' wd_nesc: %lld wd_now: %llx wd_last: %llx mask: %llx\n",
> + watchdog->name, wd_nsec, wdnow, wdlast, watchdog->mask);
> + pr_warn(" '%s' cs_nsec: %lld cs_now: %llx cs_last: %llx mask: %llx\n",
> + cs->name, cs_nsec, csnow, cslast, cs->mask);
> if (curr_clocksource == cs)
> pr_warn(" '%s' is current clocksource.\n", cs->name);
> else if (curr_clocksource)
> --
> 2.27.0
>
> Thanks,
> Feng
>
>
>
> >
> > Thanx, Paul

2021-04-27 04:17:22

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 6/7] clocksource: Forgive tsc_early pre-calibration drift

On Mon, Apr 26, 2021 at 08:46:40PM -0700, Paul E. McKenney wrote:
> On Tue, Apr 27, 2021 at 09:13:55AM +0800, Feng Tang wrote:
> > On Mon, Apr 26, 2021 at 11:26:52AM -0700, Paul E. McKenney wrote:
> > > On Mon, Apr 26, 2021 at 11:36:05PM +0800, Feng Tang wrote:
> > > > On Mon, Apr 26, 2021 at 08:25:29AM -0700, Paul E. McKenney wrote:
> > > > > On Mon, Apr 26, 2021 at 11:01:27PM +0800, Feng Tang wrote:
> > > > > > Hi Paul,
> > > > > >
> > > > > > On Sun, Apr 25, 2021 at 03:47:07PM -0700, Paul E. McKenney wrote:
> > > > > > > Because the x86 tsc_early clocksource is given a quick and semi-accurate
> > > > > > > calibration (by design!), it might have drift rates well in excess of
> > > > > > > the 0.1% limit that is in the process of being adopted.
> > > > > > >
> > > > > > > Therefore, add a max_drift field to the clocksource structure that, when
> > > > > > > non-zero, specifies the maximum allowable drift rate in nanoseconds over
> > > > > > > a half-second period. The tsc_early clocksource initializes this to five
> > > > > > > miliseconds, which corresponds to the 1% drift rate limit suggested by
> > > > > > > Xing Zhengjun. This max_drift field is intended only for early boot,
> > > > > > > so clocksource_watchdog() splats if it encounters a non-zero value in
> > > > > > > this field more than 60 seconds after boot, inspired by a suggestion by
> > > > > > > Thomas Gleixner.
> > > > > > >
> > > > > > > This was tested by setting the clocksource_tsc ->max_drift field to 1,
> > > > > > > which, as expected, resulted in a clock-skew event.
> > > > > >
> > > > > > We've run the same last for this v10, and those 'unstable' thing [1] can
> > > > > > not be reproduced!
> > > > >
> > > > > Good to hear! ;-)
> > > > >
> > > > > > We've reported one case that tsc can be wrongly judged as 'unstable'
> > > > > > by 'refined-jiffies' watchdog [1], while reducing the threshold could
> > > > > > make it easier to be triggered.
> > > > > >
> > > > > > It could be reproduced on the a plaform with a 115200 serial console,
> > > > > > and hpet been disabled (several x86 platforms has this), add
> > > > > > 'initcall_debug' cmdline parameter to get more debug message, we can
> > > > > > see:
> > > > > >
> > > > > > [ 1.134197] clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc-early' as unstable because the skew is too large:
> > > > > > [ 1.134214] clocksource: 'refined-jiffies' wd_nesc: 500000000 wd_now: ffff8b35 wd_last: ffff8b03 mask: ffffffff
> > > > > > [ 1.134217] clocksource: 'tsc-early' cs_nsec: 507537855 cs_now: 4e63c9d09 cs_last: 4bebd81f5 mask: ffffffffffffffff
> > > > > > [ 1.134220] clocksource: No current clocksource.
> > > > > > [ 1.134222] tsc: Marking TSC unstable due to clocksource watchdog
> > > > >
> > > > > Just to make sure I understand: "could be reproduced" as in this is the
> > > > > result from v9, and v10 avoids this, correct?
> > > >
> > > > Sorry I didn't make it clear. This is a rarely happened case, and can
> > > > be reproduced with upstream kerenl, which has 62.5 ms threshold. 6/7 &
> > > > 7/7 patch of reducing the threshold can make it easier to be triggered.
> > >
> > > Ah, OK, so this could be considered to be a benefit of this series, then.
> > >
> > > Does this happen only for tsc-early, or for tsc as well?
> > >
> > > Has it already been triggered on v10 of this series? (I understand that
> > > it certainly should be easier to trigger, just curious whether this has
> > > already happened.)
> >
> > Yes, it has. The upper log is from v10 (actually it's the 'dev' branch
> > of your linux-rcu git, which I didn't find obvious difference) on a
> > client platform
> >
> > [ 1.134214] clocksource: 'refined-jiffies' wd_nesc: 500000000 wd_now: ffff8b35 wd_last: ffff8b03 mask: ffffffff
> > [ 1.134217] clocksource: 'tsc-early' cs_nsec: 507537855 cs_now: 4e63c9d09 cs_last: 4bebd81f5 mask: ffffffffffffffff
> >
> > The deviation is 7537855 ns (7.5 ms). And as said before, it needs many
> > pre-conditions to be triggered.
> >
> > Also I found the debug patch is useful, which prints out the direct
> > nanoseconds info when 'unstable' is detected.
>
> Looks good to me!
>
> If you give me a Signed-off-by, I would be happy to queue it.

Sure, here it is. thanks!

- Feng

From ff23cc5589c84c4d5f9a009867f21ac5ce96c9e3 Mon Sep 17 00:00:00 2001
From: Feng Tang <[email protected]>
Date: Tue, 27 Apr 2021 12:14:30 +0800
Subject: [PATCH] clocksource: print deviation in nanoseconds for unstable case

Currently when an unstable clocksource is detected, the raw counter
of that clocksource and watchdog will be printed, which can only be
understood after some math calculation. So print the existing delta
in nanoseconds for easier check.

Signed-off-by: Feng Tang <[email protected]>
---
kernel/time/clocksource.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index a374cf7..5370f0c 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -443,10 +443,10 @@ static void clocksource_watchdog(struct timer_list *unused)
if (abs(cs_nsec - wd_nsec) > md) {
pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
smp_processor_id(), cs->name);
- pr_warn(" '%s' wd_now: %llx wd_last: %llx mask: %llx\n",
- watchdog->name, wdnow, wdlast, watchdog->mask);
- pr_warn(" '%s' cs_now: %llx cs_last: %llx mask: %llx\n",
- cs->name, csnow, cslast, cs->mask);
+ pr_warn(" '%s' wd_nesc: %lld wd_now: %llx wd_last: %llx mask: %llx\n",
+ watchdog->name, wd_nsec, wdnow, wdlast, watchdog->mask);
+ pr_warn(" '%s' cs_nsec: %lld cs_now: %llx cs_last: %llx mask: %llx\n",
+ cs->name, cs_nsec, csnow, cslast, cs->mask);
if (curr_clocksource == cs)
pr_warn(" '%s' is current clocksource.\n", cs->name);
else if (curr_clocksource)
--
2.7.4

2021-04-27 07:10:43

by kernel test robot

[permalink] [raw]
Subject: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog



Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: 8c30ace35da3b362089f5c903144d762a065b58a ("[PATCH v10 clocksource 6/7] clocksource: Forgive tsc_early pre-calibration drift")
url: https://github.com/0day-ci/linux/commits/Paul-E-McKenney/Do-not-mark-clocks-unstable-due-to-delays-for-v5-13/20210426-064834
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 2d036dfa5f10df9782f5278fc591d79d283c1fad

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):



If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


[ 70.350923] WARNING: CPU: 0 PID: 1 at kernel/time/clocksource.c:435 clocksource_watchdog (kbuild/src/x86_64/kernel/time/clocksource.c:435)
[ 70.353152] Modules linked in:
[ 70.354074] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.12.0-rc3-00031-g8c30ace35da3 #2
[ 70.356180] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 70.358471] RIP: 0010:clocksource_watchdog (kbuild/src/x86_64/kernel/time/clocksource.c:435)
[ 70.359827] Code: 80 3d 9b 4b 89 03 00 45 89 d8 74 56 4c 8b 1d 87 4b 89 03 4c 8b 3d d8 8c dc 01 49 81 c3 60 ea 00 00 4d 39 fb 0f 89 32 fe ff ff <0f> 0b e9 2b fe ff ff 48 8b 85 80 00 00 00 48 85 c0 0f 84 52 fe ff
All code
========
0: 80 3d 9b 4b 89 03 00 cmpb $0x0,0x3894b9b(%rip) # 0x3894ba2
7: 45 89 d8 mov %r11d,%r8d
a: 74 56 je 0x62
c: 4c 8b 1d 87 4b 89 03 mov 0x3894b87(%rip),%r11 # 0x3894b9a
13: 4c 8b 3d d8 8c dc 01 mov 0x1dc8cd8(%rip),%r15 # 0x1dc8cf2
1a: 49 81 c3 60 ea 00 00 add $0xea60,%r11
21: 4d 39 fb cmp %r15,%r11
24: 0f 89 32 fe ff ff jns 0xfffffffffffffe5c
2a:* 0f 0b ud2 <-- trapping instruction
2c: e9 2b fe ff ff jmpq 0xfffffffffffffe5c
31: 48 8b 85 80 00 00 00 mov 0x80(%rbp),%rax
38: 48 85 c0 test %rax,%rax
3b: 0f .byte 0xf
3c: 84 52 fe test %dl,-0x2(%rdx)
3f: ff .byte 0xff

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: e9 2b fe ff ff jmpq 0xfffffffffffffe32
7: 48 8b 85 80 00 00 00 mov 0x80(%rbp),%rax
e: 48 85 c0 test %rax,%rax
11: 0f .byte 0xf
12: 84 52 fe test %dl,-0x2(%rdx)
15: ff .byte 0xff
[ 70.364785] RSP: 0000:ffffb8e500003e58 EFLAGS: 00010293
[ 70.366296] RAX: 00000000005f0b1d RBX: 000000009b6490f0 RCX: 0000000000000018
[ 70.368352] RDX: 0000000000000033 RSI: 0000000039b0d3d7 RDI: 0000000000800000
[ 70.370237] RBP: ffffffffa82564c0 R08: 00000000004c4b40 R09: 00000038d59f836b
[ 70.372164] R10: 0000000000000017 R11: 00000000fffc5a70 R12: 000000383a3af27b
[ 70.373948] R13: 00000015011ba0ea R14: 00000014c76acd13 R15: 00000000fffc5e38
[ 70.375967] FS: 0000000000000000(0000) GS:ffff9fc56fc00000(0000) knlGS:0000000000000000
[ 70.378265] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 70.379942] CR2: 0000000000000000 CR3: 0000000366026000 CR4: 00000000000406f0
[ 70.381800] Call Trace:
[ 70.382655] <IRQ>
[ 70.383554] ? __clocksource_unstable (kbuild/src/x86_64/kernel/time/clocksource.c:376)
[ 70.384797] ? __clocksource_unstable (kbuild/src/x86_64/kernel/time/clocksource.c:376)
[ 70.386145] call_timer_fn (kbuild/src/x86_64/arch/x86/include/asm/jump_label.h:25 kbuild/src/x86_64/include/linux/jump_label.h:200 kbuild/src/x86_64/include/trace/events/timer.h:125 kbuild/src/x86_64/kernel/time/timer.c:1432)
[ 70.387393] run_timer_softirq (kbuild/src/x86_64/kernel/time/timer.c:1477 kbuild/src/x86_64/kernel/time/timer.c:1745 kbuild/src/x86_64/kernel/time/timer.c:1721 kbuild/src/x86_64/kernel/time/timer.c:1758)
[ 70.388616] ? lock_is_held_type (kbuild/src/x86_64/kernel/locking/lockdep.c:437 kbuild/src/x86_64/kernel/locking/lockdep.c:5550)
[ 70.389855] __do_softirq (kbuild/src/x86_64/arch/x86/include/asm/jump_label.h:25 kbuild/src/x86_64/include/linux/jump_label.h:200 kbuild/src/x86_64/include/trace/events/irq.h:142 kbuild/src/x86_64/kernel/softirq.c:346)
[ 70.390965] irq_exit_rcu (kbuild/src/x86_64/kernel/softirq.c:221 kbuild/src/x86_64/kernel/softirq.c:422 kbuild/src/x86_64/kernel/softirq.c:434)
[ 70.392198] sysvec_apic_timer_interrupt (kbuild/src/x86_64/arch/x86/kernel/apic/apic.c:1100 (discriminator 14))
[ 70.393575] </IRQ>
[ 70.394355] asm_sysvec_apic_timer_interrupt (kbuild/src/x86_64/arch/x86/include/asm/idtentry.h:632)
[ 70.395968] RIP: 0010:__memcpy (kbuild/src/x86_64/arch/x86/lib/memcpy_64.S:39)
[ 70.397140] Code: 84 00 00 00 00 00 66 90 48 89 fe 48 c7 c7 7e 0a a2 a7 e9 f1 fe ff ff cc 66 66 90 66 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 a4
All code
========
0: 84 00 test %al,(%rax)
2: 00 00 add %al,(%rax)
4: 00 00 add %al,(%rax)
6: 66 90 xchg %ax,%ax
8: 48 89 fe mov %rdi,%rsi
b: 48 c7 c7 7e 0a a2 a7 mov $0xffffffffa7a20a7e,%rdi
12: e9 f1 fe ff ff jmpq 0xffffffffffffff08
17: cc int3
18: 66 66 90 data16 xchg %ax,%ax
1b: 66 90 xchg %ax,%ax
1d: 48 89 f8 mov %rdi,%rax
20: 48 89 d1 mov %rdx,%rcx
23: 48 c1 e9 03 shr $0x3,%rcx
27: 83 e2 07 and $0x7,%edx
2a:* f3 48 a5 rep movsq %ds:(%rsi),%es:(%rdi) <-- trapping instruction
2d: 89 d1 mov %edx,%ecx
2f: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi)
31: c3 retq
32: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
38: 48 89 f8 mov %rdi,%rax
3b: 48 89 d1 mov %rdx,%rcx
3e: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi)

Code starting with the faulting instruction
===========================================
0: f3 48 a5 rep movsq %ds:(%rsi),%es:(%rdi)
3: 89 d1 mov %edx,%ecx
5: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi)
7: c3 retq
8: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
e: 48 89 f8 mov %rdi,%rax
11: 48 89 d1 mov %rdx,%rcx
14: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi)
[ 70.404945] RSP: 0000:ffffb8e500013ae8 EFLAGS: 00010246
[ 70.406424] RAX: ffff9fc255c00000 RBX: 0000000000001000 RCX: 0000000000000200
[ 70.408416] RDX: 0000000000000000 RSI: ffff9fc2423e4bb0 RDI: ffff9fc255c00000
[ 70.410277] RBP: ffff9fc255c00000 R08: ffff9fc140000000 R09: 0000000000004bb0
[ 70.412300] R10: ffff9fc2402a3180 R11: 0000000000000000 R12: 0000000000001000
[ 70.414194] R13: ffffb8e500013c08 R14: 0000000000001000 R15: ffffe73dc0000000
[ 70.416239] iov_iter_copy_from_user_atomic (kbuild/src/x86_64/lib/iov_iter.c:991 (discriminator 10))
[ 70.417718] generic_perform_write (kbuild/src/x86_64/mm/filemap.c:3586)
[ 70.419241] __generic_file_write_iter (kbuild/src/x86_64/mm/filemap.c:3705)
[ 70.420517] generic_file_write_iter (kbuild/src/x86_64/include/linux/fs.h:780 kbuild/src/x86_64/mm/filemap.c:3737)
[ 70.421643] __kernel_write (kbuild/src/x86_64/fs/read_write.c:550 (discriminator 1))
[ 70.422769] kernel_write (kbuild/src/x86_64/include/linux/fs.h:2903 kbuild/src/x86_64/fs/read_write.c:580 kbuild/src/x86_64/fs/read_write.c:569)
[ 70.423999] ? write_buffer (kbuild/src/x86_64/init/initramfs.c:436)
[ 70.425104] xwrite+0x31/0x62
[ 70.426262] do_copy (kbuild/src/x86_64/init/initramfs.c:394)
[ 70.427333] ? write_buffer (kbuild/src/x86_64/init/initramfs.c:436)
[ 70.428395] write_buffer (kbuild/src/x86_64/init/initramfs.c:430 (discriminator 1))
[ 70.429461] flush_buffer (kbuild/src/x86_64/init/initramfs.c:442)
[ 70.430523] ? initrd_load (kbuild/src/x86_64/init/initramfs.c:44)
[ 70.431700] __gunzip (kbuild/src/x86_64/lib/decompress_inflate.c:161)
[ 70.432696] ? bunzip2 (kbuild/src/x86_64/lib/decompress_inflate.c:39)
[ 70.433760] ? __gunzip (kbuild/src/x86_64/lib/decompress_inflate.c:207)
[ 70.434841] gunzip (kbuild/src/x86_64/lib/decompress_inflate.c:207)
[ 70.435927] ? initrd_load (kbuild/src/x86_64/init/initramfs.c:44)
[ 70.437032] unpack_to_rootfs (kbuild/src/x86_64/init/initramfs.c:500)
[ 70.438284] ? initrd_load (kbuild/src/x86_64/init/initramfs.c:44)
[ 70.439507] ? reserve_initrd_mem (kbuild/src/x86_64/init/initramfs.c:662)
[ 70.440794] populate_rootfs (kbuild/src/x86_64/init/initramfs.c:676)
[ 70.441946] ? reserve_initrd_mem (kbuild/src/x86_64/init/initramfs.c:662)
[ 70.443386] do_one_initcall (kbuild/src/x86_64/init/main.c:1226)
[ 70.444562] ? rcu_read_lock_sched_held (kbuild/src/x86_64/include/linux/lockdep.h:278 kbuild/src/x86_64/kernel/rcu/update.c:125)
[ 70.445893] kernel_init_freeable (kbuild/src/x86_64/init/main.c:1298 kbuild/src/x86_64/init/main.c:1315 kbuild/src/x86_64/init/main.c:1335 kbuild/src/x86_64/init/main.c:1537)
[ 70.447344] ? rest_init (kbuild/src/x86_64/init/main.c:1421)
[ 70.448383] kernel_init (kbuild/src/x86_64/init/main.c:1426)
[ 70.449219] ret_from_fork (kbuild/src/x86_64/arch/x86/entry/entry_64.S:300)
[ 70.450175] irq event stamp: 2300500
[ 70.451358] hardirqs last enabled at (2300510): console_unlock (kbuild/src/x86_64/arch/x86/include/asm/irqflags.h:45 (discriminator 1) kbuild/src/x86_64/arch/x86/include/asm/irqflags.h:80 (discriminator 1) kbuild/src/x86_64/arch/x86/include/asm/irqflags.h:145 (discriminator 1) kbuild/src/x86_64/kernel/printk/printk.c:2605 (discriminator 1))
[ 70.453717] hardirqs last disabled at (2300519): console_unlock (kbuild/src/x86_64/kernel/printk/printk.c:2520 (discriminator 1))
[ 70.456262] softirqs last enabled at (2299674): __do_softirq (kbuild/src/x86_64/arch/x86/include/asm/preempt.h:27 kbuild/src/x86_64/kernel/softirq.c:373)
[ 70.458585] softirqs last disabled at (2299819): irq_exit_rcu (kbuild/src/x86_64/kernel/softirq.c:221 kbuild/src/x86_64/kernel/softirq.c:422 kbuild/src/x86_64/kernel/softirq.c:434)
[ 70.461343] ---[ end trace 5049069f8395a579 ]---
[ 199.544368] Freeing initrd memory: 592780K
[ 199.545765] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[ 199.547432] software IO TLB: mapped [mem 0x00000000bbfe0000-0x00000000bffe0000] (64MB)
[ 199.550168] kvm: no hardware support
[ 199.551256] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x26d347b2dd9, max_idle_ns: 440795223616 ns
[ 199.563791] Initialise system trusted keyrings
[ 199.565211] Key type blacklist registered
[ 199.566840] workingset: timestamp_bits=36 max_order=22 bucket_order=0
[ 199.586726] zbud: loaded
[ 199.605246] NET: Registered protocol family 38
[ 199.606639] Key type asymmetric registered
[ 199.607761] Asymmetric key parser 'x509' registered
[ 199.609141] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 247)
[ 199.611604] io scheduler mq-deadline registered
[ 199.612900] io scheduler kyber registered
[ 199.614552] io scheduler bfq registered
[ 199.616201] atomic64_test: passed for x86-64 platform with CX8 and with SSE
[ 199.618346] start plist test
[ 199.621225] end plist test


To reproduce:

# build kernel
cd linux
cp config-5.12.0-rc3-00031-g8c30ace35da3 .config
make HOSTCC=gcc-9 CC=gcc-9 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email



---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Thanks,
Oliver Sang


Attachments:
(No filename) (11.94 kB)
config-5.12.0-rc3-00031-g8c30ace35da3 (176.27 kB)
job-script (4.82 kB)
dmesg.xz (14.99 kB)
Download all attachments

2021-04-27 08:46:40

by Feng Tang

[permalink] [raw]
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog

On Tue, Apr 27, 2021 at 03:27:02PM +0800, kernel test robot wrote:
>
>
> Greeting,
>
> FYI, we noticed the following commit (built with gcc-9):
>
> commit: 8c30ace35da3b362089f5c903144d762a065b58a ("[PATCH v10 clocksource 6/7] clocksource: Forgive tsc_early pre-calibration drift")
> url: https://github.com/0day-ci/linux/commits/Paul-E-McKenney/Do-not-mark-clocks-unstable-due-to-delays-for-v5-13/20210426-064834
> base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 2d036dfa5f10df9782f5278fc591d79d283c1fad
>
> in testcase: boot
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
>
>
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot <[email protected]>
>
>
> [ 70.350923] WARNING: CPU: 0 PID: 1 at kernel/time/clocksource.c:435 clocksource_watchdog (kbuild/src/x86_64/kernel/time/clocksource.c:435)
> [ 70.353152] Modules linked in:
> [ 70.354074] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.12.0-rc3-00031-g8c30ace35da3 #2
> [ 70.356180] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> [ 70.358471] RIP: 0010:clocksource_watchdog (kbuild/src/x86_64/kernel/time/clocksource.c:435)

Some log extraced from the attached dmesg.xz:

[ 8.376387] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x26d347b2dd9, max_idle_ns: 440795223616 ns
[ 8.890982] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[ 9.132146] clocksource: Switched to clocksource kvm-clock
[ 10.324011] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns

[ 10.397945] Trying to unpack rootfs image as initramfs...
[ 70.350923] WARNING: CPU: 0 PID: 1 at kernel/time/clocksource.c:435 clocksource_watchdog+0x3b8/0x4c0
[ 199.544368] Freeing initrd memory: 592780K

[ 199.551256] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x26d347b2dd9, max_idle_ns: 440795223616 ns

Seems the initramfs took too long time, and exceeds the 60
seconds of WATCHDOG_SYNC_FORGIVENESS time, which triggers
the warning.

Also I asked Oliver about the reproduce rate, and he said the
warning was seen only once for all around 100 boot tests. From
other good boot logs, the initramfs unpacking usually only takes
about 20 seconds.

Thanks,
Feng

> [ 70.382655] <IRQ>
> [ 70.386145] call_timer_fn (kbuild/src/x86_64/arch/x86/include/asm/jump_label.h:25 kbuild/src/x86_64/include/linux/jump_label.h:200 kbuild/src/x86_64/include/trace/events/timer.h:125 kbuild/src/x86_64/kernel/time/timer.c:1432)
> [ 70.387393] run_timer_softirq (kbuild/src/x86_64/kernel/time/timer.c:1477 kbuild/src/x86_64/kernel/time/timer.c:1745 kbuild/src/x86_64/kernel/time/timer.c:1721 kbuild/src/x86_64/kernel/time/timer.c:1758)
> [ 70.389855] __do_softirq (kbuild/src/x86_64/arch/x86/include/asm/jump_label.h:25 kbuild/src/x86_64/include/linux/jump_label.h:200 kbuild/src/x86_64/include/trace/events/irq.h:142 kbuild/src/x86_64/kernel/softirq.c:346)
> [ 70.390965] irq_exit_rcu (kbuild/src/x86_64/kernel/softirq.c:221 kbuild/src/x86_64/kernel/softirq.c:422 kbuild/src/x86_64/kernel/softirq.c:434)
> [ 70.392198] sysvec_apic_timer_interrupt (kbuild/src/x86_64/arch/x86/kernel/apic/apic.c:1100 (discriminator 14))
> [ 70.393575] </IRQ>
> [ 70.394355] asm_sysvec_apic_timer_interrupt (kbuild/src/x86_64/arch/x86/include/asm/idtentry.h:632)
> [ 70.395968] RIP: 0010:__memcpy (kbuild/src/x86_64/arch/x86/lib/memcpy_64.S:39)
> [ 70.397140] Code: 84 00 00 00 00 00 66 90 48 89 fe 48 c7 c7 7e 0a a2 a7 e9 f1 fe ff ff cc 66 66 90 66 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 a4
> [ 70.404945] RSP: 0000:ffffb8e500013ae8 EFLAGS: 00010246
> [ 70.406424] RAX: ffff9fc255c00000 RBX: 0000000000001000 RCX: 0000000000000200
> [ 70.408416] RDX: 0000000000000000 RSI: ffff9fc2423e4bb0 RDI: ffff9fc255c00000
> [ 70.410277] RBP: ffff9fc255c00000 R08: ffff9fc140000000 R09: 0000000000004bb0
> [ 70.412300] R10: ffff9fc2402a3180 R11: 0000000000000000 R12: 0000000000001000
> [ 70.414194] R13: ffffb8e500013c08 R14: 0000000000001000 R15: ffffe73dc0000000
> [ 70.416239] iov_iter_copy_from_user_atomic (kbuild/src/x86_64/lib/iov_iter.c:991 (discriminator 10))
> [ 70.417718] generic_perform_write (kbuild/src/x86_64/mm/filemap.c:3586)
> [ 70.419241] __generic_file_write_iter (kbuild/src/x86_64/mm/filemap.c:3705)
> [ 70.420517] generic_file_write_iter (kbuild/src/x86_64/include/linux/fs.h:780 kbuild/src/x86_64/mm/filemap.c:3737)
> [ 70.421643] __kernel_write (kbuild/src/x86_64/fs/read_write.c:550 (discriminator 1))
> [ 70.422769] kernel_write (kbuild/src/x86_64/include/linux/fs.h:2903 kbuild/src/x86_64/fs/read_write.c:580 kbuild/src/x86_64/fs/read_write.c:569)
> [ 70.423999] ? write_buffer (kbuild/src/x86_64/init/initramfs.c:436)
> [ 70.425104] xwrite+0x31/0x62
> [ 70.426262] do_copy (kbuild/src/x86_64/init/initramfs.c:394)
> [ 70.427333] ? write_buffer (kbuild/src/x86_64/init/initramfs.c:436)
> [ 70.428395] write_buffer (kbuild/src/x86_64/init/initramfs.c:430 (discriminator 1))
> [ 70.429461] flush_buffer (kbuild/src/x86_64/init/initramfs.c:442)
> [ 70.430523] ? initrd_load (kbuild/src/x86_64/init/initramfs.c:44)
> [ 70.431700] __gunzip (kbuild/src/x86_64/lib/decompress_inflate.c:161)
> [ 70.432696] ? bunzip2 (kbuild/src/x86_64/lib/decompress_inflate.c:39)
> [ 70.433760] ? __gunzip (kbuild/src/x86_64/lib/decompress_inflate.c:207)
> [ 70.434841] gunzip (kbuild/src/x86_64/lib/decompress_inflate.c:207)
> [ 70.435927] ? initrd_load (kbuild/src/x86_64/init/initramfs.c:44)
> [ 70.437032] unpack_to_rootfs (kbuild/src/x86_64/init/initramfs.c:500)
> [ 70.438284] ? initrd_load (kbuild/src/x86_64/init/initramfs.c:44)
> [ 70.439507] ? reserve_initrd_mem (kbuild/src/x86_64/init/initramfs.c:662)
> [ 70.440794] populate_rootfs (kbuild/src/x86_64/init/initramfs.c:676)
> [ 70.441946] ? reserve_initrd_mem (kbuild/src/x86_64/init/initramfs.c:662)
> [ 70.443386] do_one_initcall (kbuild/src/x86_64/init/main.c:1226)
> [ 70.444562] ? rcu_read_lock_sched_held (kbuild/src/x86_64/include/linux/lockdep.h:278 kbuild/src/x86_64/kernel/rcu/update.c:125)
> [ 70.445893] kernel_init_freeable (kbuild/src/x86_64/init/main.c:1298 kbuild/src/x86_64/init/main.c:1315 kbuild/src/x86_64/init/main.c:1335 kbuild/src/x86_64/init/main.c:1537)
> [ 70.447344] ? rest_init (kbuild/src/x86_64/init/main.c:1421)
> [ 70.448383] kernel_init (kbuild/src/x86_64/init/main.c:1426)
> [ 70.449219] ret_from_fork (kbuild/src/x86_64/arch/x86/entry/entry_64.S:300)
> [ 70.450175] irq event stamp: 2300500
> [ 70.451358] hardirqs last enabled at (2300510): console_unlock (kbuild/src/x86_64/arch/x86/include/asm/irqflags.h:45 (discriminator 1) kbuild/src/x86_64/arch/x86/include/asm/irqflags.h:80 (discriminator 1) kbuild/src/x86_64/arch/x86/include/asm/irqflags.h:145 (discriminator 1) kbuild/src/x86_64/kernel/printk/printk.c:2605 (discriminator 1))
> [ 70.453717] hardirqs last disabled at (2300519): console_unlock (kbuild/src/x86_64/kernel/printk/printk.c:2520 (discriminator 1))
> [ 70.456262] softirqs last enabled at (2299674): __do_softirq (kbuild/src/x86_64/arch/x86/include/asm/preempt.h:27 kbuild/src/x86_64/kernel/softirq.c:373)
> [ 70.458585] softirqs last disabled at (2299819): irq_exit_rcu (kbuild/src/x86_64/kernel/softirq.c:221 kbuild/src/x86_64/kernel/softirq.c:422 kbuild/src/x86_64/kernel/softirq.c:434)
> [ 70.461343] ---[ end trace 5049069f8395a579 ]---
> [ 199.544368] Freeing initrd memory: 592780K
> [ 199.545765] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> [ 199.547432] software IO TLB: mapped [mem 0x00000000bbfe0000-0x00000000bffe0000] (64MB)
> [ 199.550168] kvm: no hardware support
> [ 199.551256] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x26d347b2dd9, max_idle_ns: 440795223616 ns
> [ 199.563791] Initialise system trusted keyrings
> [ 199.565211] Key type blacklist registered

2021-04-27 13:38:30

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog

On Tue, Apr 27, 2021 at 04:45:22PM +0800, Feng Tang wrote:
> On Tue, Apr 27, 2021 at 03:27:02PM +0800, kernel test robot wrote:
> >
> >
> > Greeting,
> >
> > FYI, we noticed the following commit (built with gcc-9):
> >
> > commit: 8c30ace35da3b362089f5c903144d762a065b58a ("[PATCH v10 clocksource 6/7] clocksource: Forgive tsc_early pre-calibration drift")
> > url: https://github.com/0day-ci/linux/commits/Paul-E-McKenney/Do-not-mark-clocks-unstable-due-to-delays-for-v5-13/20210426-064834
> > base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 2d036dfa5f10df9782f5278fc591d79d283c1fad
> >
> > in testcase: boot
> >
> > on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> >
> > caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
> >
> >
> >
> > If you fix the issue, kindly add following tag
> > Reported-by: kernel test robot <[email protected]>
> >
> >
> > [ 70.350923] WARNING: CPU: 0 PID: 1 at kernel/time/clocksource.c:435 clocksource_watchdog (kbuild/src/x86_64/kernel/time/clocksource.c:435)
> > [ 70.353152] Modules linked in:
> > [ 70.354074] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.12.0-rc3-00031-g8c30ace35da3 #2
> > [ 70.356180] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > [ 70.358471] RIP: 0010:clocksource_watchdog (kbuild/src/x86_64/kernel/time/clocksource.c:435)
>
> Some log extraced from the attached dmesg.xz:
>
> [ 8.376387] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x26d347b2dd9, max_idle_ns: 440795223616 ns
> [ 8.890982] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> [ 9.132146] clocksource: Switched to clocksource kvm-clock
> [ 10.324011] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
>
> [ 10.397945] Trying to unpack rootfs image as initramfs...
> [ 70.350923] WARNING: CPU: 0 PID: 1 at kernel/time/clocksource.c:435 clocksource_watchdog+0x3b8/0x4c0
> [ 199.544368] Freeing initrd memory: 592780K
>
> [ 199.551256] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x26d347b2dd9, max_idle_ns: 440795223616 ns
>
> Seems the initramfs took too long time, and exceeds the 60
> seconds of WATCHDOG_SYNC_FORGIVENESS time, which triggers
> the warning.
>
> Also I asked Oliver about the reproduce rate, and he said the
> warning was seen only once for all around 100 boot tests. From
> other good boot logs, the initramfs unpacking usually only takes
> about 20 seconds.

Hmmm... 70 seconds. Is this a legitimate diagnostic of initramfs?
If not, is there some way to detect that initramfs took longer than
expected?

I suppose that I give it (say) 120 seconds instead of the current 60,
which might be the right thing to do, but it does feel like papering
over a very real initramfs problem. Alternatively, I could provide a
boot parameter allowing those with slow systems to adjust as needed.

Thanx, Paul

> Thanks,
> Feng
>
> > [ 70.382655] <IRQ>
> > [ 70.386145] call_timer_fn (kbuild/src/x86_64/arch/x86/include/asm/jump_label.h:25 kbuild/src/x86_64/include/linux/jump_label.h:200 kbuild/src/x86_64/include/trace/events/timer.h:125 kbuild/src/x86_64/kernel/time/timer.c:1432)
> > [ 70.387393] run_timer_softirq (kbuild/src/x86_64/kernel/time/timer.c:1477 kbuild/src/x86_64/kernel/time/timer.c:1745 kbuild/src/x86_64/kernel/time/timer.c:1721 kbuild/src/x86_64/kernel/time/timer.c:1758)
> > [ 70.389855] __do_softirq (kbuild/src/x86_64/arch/x86/include/asm/jump_label.h:25 kbuild/src/x86_64/include/linux/jump_label.h:200 kbuild/src/x86_64/include/trace/events/irq.h:142 kbuild/src/x86_64/kernel/softirq.c:346)
> > [ 70.390965] irq_exit_rcu (kbuild/src/x86_64/kernel/softirq.c:221 kbuild/src/x86_64/kernel/softirq.c:422 kbuild/src/x86_64/kernel/softirq.c:434)
> > [ 70.392198] sysvec_apic_timer_interrupt (kbuild/src/x86_64/arch/x86/kernel/apic/apic.c:1100 (discriminator 14))
> > [ 70.393575] </IRQ>
> > [ 70.394355] asm_sysvec_apic_timer_interrupt (kbuild/src/x86_64/arch/x86/include/asm/idtentry.h:632)
> > [ 70.395968] RIP: 0010:__memcpy (kbuild/src/x86_64/arch/x86/lib/memcpy_64.S:39)
> > [ 70.397140] Code: 84 00 00 00 00 00 66 90 48 89 fe 48 c7 c7 7e 0a a2 a7 e9 f1 fe ff ff cc 66 66 90 66 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 a4
> > [ 70.404945] RSP: 0000:ffffb8e500013ae8 EFLAGS: 00010246
> > [ 70.406424] RAX: ffff9fc255c00000 RBX: 0000000000001000 RCX: 0000000000000200
> > [ 70.408416] RDX: 0000000000000000 RSI: ffff9fc2423e4bb0 RDI: ffff9fc255c00000
> > [ 70.410277] RBP: ffff9fc255c00000 R08: ffff9fc140000000 R09: 0000000000004bb0
> > [ 70.412300] R10: ffff9fc2402a3180 R11: 0000000000000000 R12: 0000000000001000
> > [ 70.414194] R13: ffffb8e500013c08 R14: 0000000000001000 R15: ffffe73dc0000000
> > [ 70.416239] iov_iter_copy_from_user_atomic (kbuild/src/x86_64/lib/iov_iter.c:991 (discriminator 10))
> > [ 70.417718] generic_perform_write (kbuild/src/x86_64/mm/filemap.c:3586)
> > [ 70.419241] __generic_file_write_iter (kbuild/src/x86_64/mm/filemap.c:3705)
> > [ 70.420517] generic_file_write_iter (kbuild/src/x86_64/include/linux/fs.h:780 kbuild/src/x86_64/mm/filemap.c:3737)
> > [ 70.421643] __kernel_write (kbuild/src/x86_64/fs/read_write.c:550 (discriminator 1))
> > [ 70.422769] kernel_write (kbuild/src/x86_64/include/linux/fs.h:2903 kbuild/src/x86_64/fs/read_write.c:580 kbuild/src/x86_64/fs/read_write.c:569)
> > [ 70.423999] ? write_buffer (kbuild/src/x86_64/init/initramfs.c:436)
> > [ 70.425104] xwrite+0x31/0x62
> > [ 70.426262] do_copy (kbuild/src/x86_64/init/initramfs.c:394)
> > [ 70.427333] ? write_buffer (kbuild/src/x86_64/init/initramfs.c:436)
> > [ 70.428395] write_buffer (kbuild/src/x86_64/init/initramfs.c:430 (discriminator 1))
> > [ 70.429461] flush_buffer (kbuild/src/x86_64/init/initramfs.c:442)
> > [ 70.430523] ? initrd_load (kbuild/src/x86_64/init/initramfs.c:44)
> > [ 70.431700] __gunzip (kbuild/src/x86_64/lib/decompress_inflate.c:161)
> > [ 70.432696] ? bunzip2 (kbuild/src/x86_64/lib/decompress_inflate.c:39)
> > [ 70.433760] ? __gunzip (kbuild/src/x86_64/lib/decompress_inflate.c:207)
> > [ 70.434841] gunzip (kbuild/src/x86_64/lib/decompress_inflate.c:207)
> > [ 70.435927] ? initrd_load (kbuild/src/x86_64/init/initramfs.c:44)
> > [ 70.437032] unpack_to_rootfs (kbuild/src/x86_64/init/initramfs.c:500)
> > [ 70.438284] ? initrd_load (kbuild/src/x86_64/init/initramfs.c:44)
> > [ 70.439507] ? reserve_initrd_mem (kbuild/src/x86_64/init/initramfs.c:662)
> > [ 70.440794] populate_rootfs (kbuild/src/x86_64/init/initramfs.c:676)
> > [ 70.441946] ? reserve_initrd_mem (kbuild/src/x86_64/init/initramfs.c:662)
> > [ 70.443386] do_one_initcall (kbuild/src/x86_64/init/main.c:1226)
> > [ 70.444562] ? rcu_read_lock_sched_held (kbuild/src/x86_64/include/linux/lockdep.h:278 kbuild/src/x86_64/kernel/rcu/update.c:125)
> > [ 70.445893] kernel_init_freeable (kbuild/src/x86_64/init/main.c:1298 kbuild/src/x86_64/init/main.c:1315 kbuild/src/x86_64/init/main.c:1335 kbuild/src/x86_64/init/main.c:1537)
> > [ 70.447344] ? rest_init (kbuild/src/x86_64/init/main.c:1421)
> > [ 70.448383] kernel_init (kbuild/src/x86_64/init/main.c:1426)
> > [ 70.449219] ret_from_fork (kbuild/src/x86_64/arch/x86/entry/entry_64.S:300)
> > [ 70.450175] irq event stamp: 2300500
> > [ 70.451358] hardirqs last enabled at (2300510): console_unlock (kbuild/src/x86_64/arch/x86/include/asm/irqflags.h:45 (discriminator 1) kbuild/src/x86_64/arch/x86/include/asm/irqflags.h:80 (discriminator 1) kbuild/src/x86_64/arch/x86/include/asm/irqflags.h:145 (discriminator 1) kbuild/src/x86_64/kernel/printk/printk.c:2605 (discriminator 1))
> > [ 70.453717] hardirqs last disabled at (2300519): console_unlock (kbuild/src/x86_64/kernel/printk/printk.c:2520 (discriminator 1))
> > [ 70.456262] softirqs last enabled at (2299674): __do_softirq (kbuild/src/x86_64/arch/x86/include/asm/preempt.h:27 kbuild/src/x86_64/kernel/softirq.c:373)
> > [ 70.458585] softirqs last disabled at (2299819): irq_exit_rcu (kbuild/src/x86_64/kernel/softirq.c:221 kbuild/src/x86_64/kernel/softirq.c:422 kbuild/src/x86_64/kernel/softirq.c:434)
> > [ 70.461343] ---[ end trace 5049069f8395a579 ]---
> > [ 199.544368] Freeing initrd memory: 592780K
> > [ 199.545765] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> > [ 199.547432] software IO TLB: mapped [mem 0x00000000bbfe0000-0x00000000bffe0000] (64MB)
> > [ 199.550168] kvm: no hardware support
> > [ 199.551256] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x26d347b2dd9, max_idle_ns: 440795223616 ns
> > [ 199.563791] Initialise system trusted keyrings
> > [ 199.565211] Key type blacklist registered

2021-04-27 17:52:31

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog

On Tue, Apr 27, 2021 at 06:37:46AM -0700, Paul E. McKenney wrote:
> On Tue, Apr 27, 2021 at 04:45:22PM +0800, Feng Tang wrote:
> > On Tue, Apr 27, 2021 at 03:27:02PM +0800, kernel test robot wrote:
> > >
> > >
> > > Greeting,
> > >
> > > FYI, we noticed the following commit (built with gcc-9):
> > >
> > > commit: 8c30ace35da3b362089f5c903144d762a065b58a ("[PATCH v10 clocksource 6/7] clocksource: Forgive tsc_early pre-calibration drift")
> > > url: https://github.com/0day-ci/linux/commits/Paul-E-McKenney/Do-not-mark-clocks-unstable-due-to-delays-for-v5-13/20210426-064834
> > > base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 2d036dfa5f10df9782f5278fc591d79d283c1fad
> > >
> > > in testcase: boot
> > >
> > > on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> > >
> > > caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
> > >
> > >
> > >
> > > If you fix the issue, kindly add following tag
> > > Reported-by: kernel test robot <[email protected]>
> > >
> > >
> > > [ 70.350923] WARNING: CPU: 0 PID: 1 at kernel/time/clocksource.c:435 clocksource_watchdog (kbuild/src/x86_64/kernel/time/clocksource.c:435)
> > > [ 70.353152] Modules linked in:
> > > [ 70.354074] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.12.0-rc3-00031-g8c30ace35da3 #2
> > > [ 70.356180] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > > [ 70.358471] RIP: 0010:clocksource_watchdog (kbuild/src/x86_64/kernel/time/clocksource.c:435)
> >
> > Some log extraced from the attached dmesg.xz:
> >
> > [ 8.376387] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x26d347b2dd9, max_idle_ns: 440795223616 ns
> > [ 8.890982] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > [ 9.132146] clocksource: Switched to clocksource kvm-clock
> > [ 10.324011] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> >
> > [ 10.397945] Trying to unpack rootfs image as initramfs...
> > [ 70.350923] WARNING: CPU: 0 PID: 1 at kernel/time/clocksource.c:435 clocksource_watchdog+0x3b8/0x4c0
> > [ 199.544368] Freeing initrd memory: 592780K
> >
> > [ 199.551256] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x26d347b2dd9, max_idle_ns: 440795223616 ns
> >
> > Seems the initramfs took too long time, and exceeds the 60
> > seconds of WATCHDOG_SYNC_FORGIVENESS time, which triggers
> > the warning.
> >
> > Also I asked Oliver about the reproduce rate, and he said the
> > warning was seen only once for all around 100 boot tests. From
> > other good boot logs, the initramfs unpacking usually only takes
> > about 20 seconds.
>
> Hmmm... 70 seconds. Is this a legitimate diagnostic of initramfs?
> If not, is there some way to detect that initramfs took longer than
> expected?
>
> I suppose that I give it (say) 120 seconds instead of the current 60,
> which might be the right thing to do, but it does feel like papering
> over a very real initramfs problem. Alternatively, I could provide a
> boot parameter allowing those with slow systems to adjust as needed.

OK, it turns out that there are systems for which boot times in excess
of one minute are expected behavior. They are a bit rare, though.
So what I will do is keep the 60-second default, add a boot parameter,
and also add a comment by the warning pointing out the boot parameter.

Thanx, Paul

> > Thanks,
> > Feng
> >
> > > [ 70.382655] <IRQ>
> > > [ 70.386145] call_timer_fn (kbuild/src/x86_64/arch/x86/include/asm/jump_label.h:25 kbuild/src/x86_64/include/linux/jump_label.h:200 kbuild/src/x86_64/include/trace/events/timer.h:125 kbuild/src/x86_64/kernel/time/timer.c:1432)
> > > [ 70.387393] run_timer_softirq (kbuild/src/x86_64/kernel/time/timer.c:1477 kbuild/src/x86_64/kernel/time/timer.c:1745 kbuild/src/x86_64/kernel/time/timer.c:1721 kbuild/src/x86_64/kernel/time/timer.c:1758)
> > > [ 70.389855] __do_softirq (kbuild/src/x86_64/arch/x86/include/asm/jump_label.h:25 kbuild/src/x86_64/include/linux/jump_label.h:200 kbuild/src/x86_64/include/trace/events/irq.h:142 kbuild/src/x86_64/kernel/softirq.c:346)
> > > [ 70.390965] irq_exit_rcu (kbuild/src/x86_64/kernel/softirq.c:221 kbuild/src/x86_64/kernel/softirq.c:422 kbuild/src/x86_64/kernel/softirq.c:434)
> > > [ 70.392198] sysvec_apic_timer_interrupt (kbuild/src/x86_64/arch/x86/kernel/apic/apic.c:1100 (discriminator 14))
> > > [ 70.393575] </IRQ>
> > > [ 70.394355] asm_sysvec_apic_timer_interrupt (kbuild/src/x86_64/arch/x86/include/asm/idtentry.h:632)
> > > [ 70.395968] RIP: 0010:__memcpy (kbuild/src/x86_64/arch/x86/lib/memcpy_64.S:39)
> > > [ 70.397140] Code: 84 00 00 00 00 00 66 90 48 89 fe 48 c7 c7 7e 0a a2 a7 e9 f1 fe ff ff cc 66 66 90 66 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 a4
> > > [ 70.404945] RSP: 0000:ffffb8e500013ae8 EFLAGS: 00010246
> > > [ 70.406424] RAX: ffff9fc255c00000 RBX: 0000000000001000 RCX: 0000000000000200
> > > [ 70.408416] RDX: 0000000000000000 RSI: ffff9fc2423e4bb0 RDI: ffff9fc255c00000
> > > [ 70.410277] RBP: ffff9fc255c00000 R08: ffff9fc140000000 R09: 0000000000004bb0
> > > [ 70.412300] R10: ffff9fc2402a3180 R11: 0000000000000000 R12: 0000000000001000
> > > [ 70.414194] R13: ffffb8e500013c08 R14: 0000000000001000 R15: ffffe73dc0000000
> > > [ 70.416239] iov_iter_copy_from_user_atomic (kbuild/src/x86_64/lib/iov_iter.c:991 (discriminator 10))
> > > [ 70.417718] generic_perform_write (kbuild/src/x86_64/mm/filemap.c:3586)
> > > [ 70.419241] __generic_file_write_iter (kbuild/src/x86_64/mm/filemap.c:3705)
> > > [ 70.420517] generic_file_write_iter (kbuild/src/x86_64/include/linux/fs.h:780 kbuild/src/x86_64/mm/filemap.c:3737)
> > > [ 70.421643] __kernel_write (kbuild/src/x86_64/fs/read_write.c:550 (discriminator 1))
> > > [ 70.422769] kernel_write (kbuild/src/x86_64/include/linux/fs.h:2903 kbuild/src/x86_64/fs/read_write.c:580 kbuild/src/x86_64/fs/read_write.c:569)
> > > [ 70.423999] ? write_buffer (kbuild/src/x86_64/init/initramfs.c:436)
> > > [ 70.425104] xwrite+0x31/0x62
> > > [ 70.426262] do_copy (kbuild/src/x86_64/init/initramfs.c:394)
> > > [ 70.427333] ? write_buffer (kbuild/src/x86_64/init/initramfs.c:436)
> > > [ 70.428395] write_buffer (kbuild/src/x86_64/init/initramfs.c:430 (discriminator 1))
> > > [ 70.429461] flush_buffer (kbuild/src/x86_64/init/initramfs.c:442)
> > > [ 70.430523] ? initrd_load (kbuild/src/x86_64/init/initramfs.c:44)
> > > [ 70.431700] __gunzip (kbuild/src/x86_64/lib/decompress_inflate.c:161)
> > > [ 70.432696] ? bunzip2 (kbuild/src/x86_64/lib/decompress_inflate.c:39)
> > > [ 70.433760] ? __gunzip (kbuild/src/x86_64/lib/decompress_inflate.c:207)
> > > [ 70.434841] gunzip (kbuild/src/x86_64/lib/decompress_inflate.c:207)
> > > [ 70.435927] ? initrd_load (kbuild/src/x86_64/init/initramfs.c:44)
> > > [ 70.437032] unpack_to_rootfs (kbuild/src/x86_64/init/initramfs.c:500)
> > > [ 70.438284] ? initrd_load (kbuild/src/x86_64/init/initramfs.c:44)
> > > [ 70.439507] ? reserve_initrd_mem (kbuild/src/x86_64/init/initramfs.c:662)
> > > [ 70.440794] populate_rootfs (kbuild/src/x86_64/init/initramfs.c:676)
> > > [ 70.441946] ? reserve_initrd_mem (kbuild/src/x86_64/init/initramfs.c:662)
> > > [ 70.443386] do_one_initcall (kbuild/src/x86_64/init/main.c:1226)
> > > [ 70.444562] ? rcu_read_lock_sched_held (kbuild/src/x86_64/include/linux/lockdep.h:278 kbuild/src/x86_64/kernel/rcu/update.c:125)
> > > [ 70.445893] kernel_init_freeable (kbuild/src/x86_64/init/main.c:1298 kbuild/src/x86_64/init/main.c:1315 kbuild/src/x86_64/init/main.c:1335 kbuild/src/x86_64/init/main.c:1537)
> > > [ 70.447344] ? rest_init (kbuild/src/x86_64/init/main.c:1421)
> > > [ 70.448383] kernel_init (kbuild/src/x86_64/init/main.c:1426)
> > > [ 70.449219] ret_from_fork (kbuild/src/x86_64/arch/x86/entry/entry_64.S:300)
> > > [ 70.450175] irq event stamp: 2300500
> > > [ 70.451358] hardirqs last enabled at (2300510): console_unlock (kbuild/src/x86_64/arch/x86/include/asm/irqflags.h:45 (discriminator 1) kbuild/src/x86_64/arch/x86/include/asm/irqflags.h:80 (discriminator 1) kbuild/src/x86_64/arch/x86/include/asm/irqflags.h:145 (discriminator 1) kbuild/src/x86_64/kernel/printk/printk.c:2605 (discriminator 1))
> > > [ 70.453717] hardirqs last disabled at (2300519): console_unlock (kbuild/src/x86_64/kernel/printk/printk.c:2520 (discriminator 1))
> > > [ 70.456262] softirqs last enabled at (2299674): __do_softirq (kbuild/src/x86_64/arch/x86/include/asm/preempt.h:27 kbuild/src/x86_64/kernel/softirq.c:373)
> > > [ 70.458585] softirqs last disabled at (2299819): irq_exit_rcu (kbuild/src/x86_64/kernel/softirq.c:221 kbuild/src/x86_64/kernel/softirq.c:422 kbuild/src/x86_64/kernel/softirq.c:434)
> > > [ 70.461343] ---[ end trace 5049069f8395a579 ]---
> > > [ 199.544368] Freeing initrd memory: 592780K
> > > [ 199.545765] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> > > [ 199.547432] software IO TLB: mapped [mem 0x00000000bbfe0000-0x00000000bffe0000] (64MB)
> > > [ 199.550168] kvm: no hardware support
> > > [ 199.551256] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x26d347b2dd9, max_idle_ns: 440795223616 ns
> > > [ 199.563791] Initialise system trusted keyrings
> > > [ 199.565211] Key type blacklist registered

2021-04-27 21:05:35

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 6/7] clocksource: Forgive tsc_early pre-calibration drift

On Mon, Apr 26 2021 at 23:01, Feng Tang wrote:
> On Sun, Apr 25, 2021 at 03:47:07PM -0700, Paul E. McKenney wrote:
> We've reported one case that tsc can be wrongly judged as 'unstable'
> by 'refined-jiffies' watchdog [1], while reducing the threshold could
> make it easier to be triggered.
>
> It could be reproduced on the a plaform with a 115200 serial console,
> and hpet been disabled (several x86 platforms has this), add
> 'initcall_debug' cmdline parameter to get more debug message, we can
> see:
>
> [ 1.134197] clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc-early' as unstable because the skew is too large:
> [ 1.134214] clocksource: 'refined-jiffies' wd_nesc: 500000000 wd_now: ffff8b35 wd_last: ffff8b03 mask: ffffffff

refined-jiffies is the worst of all watchdogs and this obviously cannot
be fixed at all simply because we can lose ticks in that mode. And no,
we cannot compensate for lost ticks via TSC which we in turn "monitor"
via ticks.

Even if we hack around it and make it "work" then the TSC will never
become fully trusted because refined-jiffies cannot support NOHZ/HIGHRES
mode for obvious reasons either. So the system stays in periodic mode
forever.

If there is no independent timer to validate against then the TSC is
better stable and the BIOS/SMM code has to be trusted not to wreckage
TSC. There are no other options.

So TBH, I do not care about this case at all. It's pointless to even
think about it. Either the TSC works on these systems or it doesn't. If
it doesn't, then you have to keep the pieces.

I'm so dead tired of this especially since this is known forever. But
it's obviously better to waste our time than to fix the damned hardware.

Thanks,

tglx

2021-04-27 21:11:38

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog

Paul,

On Tue, Apr 27 2021 at 10:50, Paul E. McKenney wrote:
> On Tue, Apr 27, 2021 at 06:37:46AM -0700, Paul E. McKenney wrote:
>> I suppose that I give it (say) 120 seconds instead of the current 60,
>> which might be the right thing to do, but it does feel like papering
>> over a very real initramfs problem. Alternatively, I could provide a
>> boot parameter allowing those with slow systems to adjust as needed.
>
> OK, it turns out that there are systems for which boot times in excess
> of one minute are expected behavior. They are a bit rare, though.
> So what I will do is keep the 60-second default, add a boot parameter,
> and also add a comment by the warning pointing out the boot parameter.

Oh, no. This starts to become yet another duct tape horror show.

I'm not at all against a more robust and resilent watchdog mechanism,
but having a dozen knobs to tune and heuristics which are doomed to fail
is not a solution at all.

Thanks,

tglx

2021-04-28 01:50:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog

On Tue, Apr 27, 2021 at 11:09:49PM +0200, Thomas Gleixner wrote:
> Paul,
>
> On Tue, Apr 27 2021 at 10:50, Paul E. McKenney wrote:
> > On Tue, Apr 27, 2021 at 06:37:46AM -0700, Paul E. McKenney wrote:
> >> I suppose that I give it (say) 120 seconds instead of the current 60,
> >> which might be the right thing to do, but it does feel like papering
> >> over a very real initramfs problem. Alternatively, I could provide a
> >> boot parameter allowing those with slow systems to adjust as needed.
> >
> > OK, it turns out that there are systems for which boot times in excess
> > of one minute are expected behavior. They are a bit rare, though.
> > So what I will do is keep the 60-second default, add a boot parameter,
> > and also add a comment by the warning pointing out the boot parameter.
>
> Oh, no. This starts to become yet another duct tape horror show.
>
> I'm not at all against a more robust and resilent watchdog mechanism,
> but having a dozen knobs to tune and heuristics which are doomed to fail
> is not a solution at all.

One problem is that I did the .max_drift patch backwards. I tightened
the skew requirements on all clocks except those specially marked, and
I should have done the reverse. With that change, all of the clocks
except for clocksource_tsc would work (or as the case might be, fail to
work) in exactly the same way that they do today, but still rejecting
false-positive skew events due to NMIs, SMIs, vCPU preemption, and so on.

Then patch v10 7/7 can go away completely, and patch 6/7 becomes much
smaller (and gets renamed), for example, as shown below.

Does that help?

Thanx, Paul

------------------------------------------------------------------------

commit ba1fca950a4bcd8a5737efc552f937529496b5fc
Author: Paul E. McKenney <[email protected]>
Date: Tue Apr 27 18:43:37 2021 -0700

clocksource: Reduce clocksource-skew threshold for TSC

Currently, WATCHDOG_THRESHOLD is set to detect a 62.5-millisecond skew in
a 500-millisecond WATCHDOG_INTERVAL. This requires that clocks be skewed
by more than 12.5% in order to be marked unstable. Except that a clock
that is skewed by that much is probably destroying unsuspecting software
right and left. And given that there are now checks for false-positive
skews due to delays between reading the two clocks, it should be possible
to greatly decrease WATCHDOG_THRESHOLD, at least for fine-grained clocks
such as TSC.

Therefore, decrease WATCHDOG_THRESHOLD from the current 62.5 milliseconds
down to 200 microseconds, but only for clocksource_tsc through use of
a new max_drift field in struct clocksource. Coarse-grained clocks
such as refined-jiffies retain their old skew checks, courtesy of the
default-zero initialization of the max_drift field.

Suggested-by: Thomas Gleixner <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Xing Zhengjun <[email protected]>
Cc: Feng Tang <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 56289170753c..c281575ed5c2 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1147,6 +1147,7 @@ static struct clocksource clocksource_tsc_early = {
static struct clocksource clocksource_tsc = {
.name = "tsc",
.rating = 300,
+ .max_drift = 200 * NSEC_PER_USEC,
.read = read_tsc,
.mask = CLOCKSOURCE_MASK(64),
.flags = CLOCK_SOURCE_IS_CONTINUOUS |
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 83a3ebff7456..44b567fbf435 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -42,6 +42,8 @@ struct module;
* @shift: Cycle to nanosecond divisor (power of two)
* @max_idle_ns: Maximum idle time permitted by the clocksource (nsecs)
* @maxadj: Maximum adjustment value to mult (~11%)
+ * @max_drift: Maximum drift rate in nanoseconds per half second.
+ * Zero says to use default WATCHDOG_THRESHOLD.
* @archdata: Optional arch-specific data
* @max_cycles: Maximum safe cycle value which won't overflow on
* multiplication
@@ -93,6 +95,7 @@ struct clocksource {
u32 shift;
u64 max_idle_ns;
u32 maxadj;
+ u32 max_drift;
#ifdef CONFIG_ARCH_CLOCKSOURCE_DATA
struct arch_clocksource_data archdata;
#endif
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index f71f375df544..e33955c322cf 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -377,6 +377,7 @@ static void clocksource_watchdog(struct timer_list *unused)
int next_cpu, reset_pending;
int64_t wd_nsec, cs_nsec;
struct clocksource *cs;
+ u32 md;

spin_lock(&watchdog_lock);
if (!watchdog_running)
@@ -423,7 +424,8 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;

/* Check the deviation from the watchdog clocksource. */
- if (abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) {
+ md = cs->max_drift ?: WATCHDOG_THRESHOLD;
+ if (abs(cs_nsec - wd_nsec) > md) {
pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
smp_processor_id(), cs->name);
pr_warn(" '%s' wd_now: %llx wd_last: %llx mask: %llx\n",

2021-04-28 04:50:41

by Luming Yu

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 1/7] clocksource: Provide module parameters to inject delays in watchdog

We 'd expect to see clock_source watchdog can avoid to do wrong thing
due to the injected delay or
in real life delay by doing tsc sync-re-check by applying the patch-set.
However , the noise is still cause wrong actions and the patch doesn't
defeat the injected's delay
please correct me if I'm wrong.

parameters]# cat *
1
1
-1
3
8

[62939.809615] clocksource: clocksource_watchdog_inject_delay():
Injecting delay.
[62939.816867] clocksource: clocksource_watchdog_inject_delay():
Injecting delay.
[62939.824094] clocksource: clocksource_watchdog_inject_delay():
Injecting delay.
[62939.831314] clocksource: clocksource_watchdog_inject_delay():
Injecting delay.
[62939.838536] clocksource: timekeeping watchdog on CPU26: hpet
read-back delay of 7220833ns, attempt 4, marking unstable
[62939.849230] tsc: Marking TSC unstable due to clocksource watchdog
[62939.855340] TSC found unstable after boot, most likely due to
broken BIOS. Use 'tsc=unstable'.
[62939.863972] sched_clock: Marking unstable (62943398530130,
-3543150114)<-(62941186607503, -1331276112)
[62939.875104] clocksource: Checking clocksource tsc synchronization
from CPU 123 to CPUs 0,6,26,62,78,97-98,137.
[62939.886518] clocksource: Switched to clocksource hpet

On Tue, Apr 27, 2021 at 2:27 AM Paul E. McKenney <[email protected]> wrote:
>
> On Mon, Apr 26, 2021 at 10:56:27AM -0700, Andi Kleen wrote:
> > > ------------------------------------------------------------------------
> > >
> > > - module parameters
> > >
> > > If the scope of the fault injection capability is limited to a
> > > single kernel module, it is better to provide module parameters to
> > > configure the fault attributes.
> > >
> > > ------------------------------------------------------------------------
> > >
> > > And in this case, the fault injection capability is in fact limited to
> > > kernel/clocksource.c.
> >
> >
> > I disagree with this recommendation because it prevents fuzzer coverage.
> >
> > Much better to have an uniform interface that can be automatically
> > explored.
>
> The permissions for these module parameters is 0644, so there is no
> reason why the fuzzers cannot use them via sysfs.
>
> Thanx, Paul

2021-04-28 13:57:43

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog

On Tue, Apr 27 2021 at 18:48, Paul E. McKenney wrote:
> On Tue, Apr 27, 2021 at 11:09:49PM +0200, Thomas Gleixner wrote:
>> Paul,
>>
>> On Tue, Apr 27 2021 at 10:50, Paul E. McKenney wrote:
>> > On Tue, Apr 27, 2021 at 06:37:46AM -0700, Paul E. McKenney wrote:
>> >> I suppose that I give it (say) 120 seconds instead of the current 60,
>> >> which might be the right thing to do, but it does feel like papering
>> >> over a very real initramfs problem. Alternatively, I could provide a
>> >> boot parameter allowing those with slow systems to adjust as needed.
>> >
>> > OK, it turns out that there are systems for which boot times in excess
>> > of one minute are expected behavior. They are a bit rare, though.
>> > So what I will do is keep the 60-second default, add a boot parameter,
>> > and also add a comment by the warning pointing out the boot parameter.
>>
>> Oh, no. This starts to become yet another duct tape horror show.
>>
>> I'm not at all against a more robust and resilent watchdog mechanism,
>> but having a dozen knobs to tune and heuristics which are doomed to fail
>> is not a solution at all.
>
> One problem is that I did the .max_drift patch backwards. I tightened
> the skew requirements on all clocks except those specially marked, and
> I should have done the reverse. With that change, all of the clocks
> except for clocksource_tsc would work (or as the case might be, fail to
> work) in exactly the same way that they do today, but still rejecting
> false-positive skew events due to NMIs, SMIs, vCPU preemption, and so on.
>
> Then patch v10 7/7 can go away completely, and patch 6/7 becomes much
> smaller (and gets renamed), for example, as shown below.
>
> Does that help?

No. Because the problem is on both ends. We have TSC early which has
inaccurate frequency and we have watchdogs which are inaccurate,
i.e. refined jiffies.

So the threshold has to take both into account.

Thanks,

tglx

2021-04-28 17:34:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog

Paul,

On Tue, Apr 27 2021 at 23:09, Thomas Gleixner wrote:
> On Tue, Apr 27 2021 at 10:50, Paul E. McKenney wrote:
>> OK, it turns out that there are systems for which boot times in excess
>> of one minute are expected behavior. They are a bit rare, though.
>> So what I will do is keep the 60-second default, add a boot parameter,
>> and also add a comment by the warning pointing out the boot parameter.
>
> Oh, no. This starts to become yet another duct tape horror show.
>
> I'm not at all against a more robust and resilent watchdog mechanism,
> but having a dozen knobs to tune and heuristics which are doomed to fail
> is not a solution at all.

That said, let's take a step back and look at the trainwreck from a
different POV.

Let's start with the easy part: Virtualization

While I'm still convinced that virt creates more problems than it
solves, in this particular case, the virt solution is actually
halfways trivial.

1) If the host does not trust the TSC then it clearly wants the guest
to use KVM clock and not TSC. KVM clock is still utilizing TSC, but
it's a host controlled and assisted mechanism.

2) If the host has clear indications that TSC can be trusted, then it
can tell the guest that TSC is trustworthy. A synthesized CPU
feature bit is the obvious solution for this.

That solves several virt problems at once:

- The watchdog issue

- The fact that TSC synchronization checks between two vCPUs
cannot ever work reliably.

Now for the bare metal case. We have to distinguish the following
scenarios:

1) CPU does not advertise X86_FEATURE_CONSTANT_TSC

That's a lost case and only relevant for really old hardware.

2) CPU does advertise X86_FEATURE_CONSTANT_TSC, but does not
have X86_FEATURE_NONSTOP_TSC

Mostly a lost case as well unless you disable the C-states in which
TSC stops working.

3) CPU does advertise X86_FEATURE_CONSTANT_TSC and X86_FEATURE_NONSTOP_TSC

That's the point where usable starts, which is around 2007/2008

4) CPU has X86_FEATURE_CONSTANT_TSC, X86_FEATURE_NONSTOP_TSC and
TSC_ADJUST

That's anything Intel starting from Haswell - not sure about the
ATOM parts though.

Non-Intel CPUs lack this completely.

5) CPU has X86_FEATURE_CONSTANT_TSC, X86_FEATURE_NONSTOP_TSC and
TSC_ADJUST and the not yet existing feature TSC_LOCKDOWN

We're asking for this for at least 15 years now, but we also had to
wait 10+ years to get a halfways usable TSC, so why am I
complaining?

So we really care about #3 and #4.

#4 is the easy case because we can check MSR_TSC_ADJUST to figure out
whether something has written to MSR_TSC or MSR_TSC_ADJUST and undo
the damage in a sane way.

That's currently only done when a CPU goes idle, but there are
options to do that differently (timer, TIF flag ...)

This allows to disable the clocksource watchdog completely and based
on Fengs work we are going to do so sooner than later.


The problematic case is #3 because that affects any Non-Intel CPUs
and the pre Haswell Intel ones.

The approach we have taken so far is the clocksource watchdog in the
form in which it exists today with the known limitations:

a) Broken watchdog clocksource. Not that I care much, because then
all bets are off.

This includes the jiffies based watchdog which is the worst bandaid
which we have, though that can be made 'work' by some definition of
work.

b) Long delays which prevent the watchdog from running
which in the worst case let the watchdog wrap around.

For ACPI_PMTIMER that's ~4.69 seconds and for HPET ~300 seconds.

Anything which keeps the watchdog timer from running for that long
is broken and trying to fix that at the watchdog level is putting
the cart before the horse.

Ignore virt here. See above.

c) vCPUs scheduled out between the watchdog and the TSC read

Cannot be mitigated reliably under all circumstances. See the virt
section above.

d) Long lasting NMI/SMI's between watchdog and TSC read

Can be mitigated with reread/retry as you demonstrated.

e) A too large threshold, which if reduced causes other problems as
you figured out already.

Unfortunately there is no other way than using the watchdog mechanism,
simply because we need hardware assistance for detection and a reliable
way to undo the damage, which is what we have with TSC_ADJUST. Of course
the best case would be a mechanism to prevent writes to TSC/TSC_ADJUST
completely after boot.

Now let's look at the cases which cause TSC problems in the real world:

1) Boot time (or resume time) TSC inconsistency

That's the major problem today, but that's not a watchdog issue.

On TSC_ADJUST equipped machines we fix the wreckage up, on others
we give up. For the latter case we don't need a watchdog :)

2) Runtime wreckage caused by BIOS/SMM

That used to be a wide spread problem 10 years ago and today it's
only a sporadic issue, but it's not eliminated completely which
is why we have this discussion at all.

As we have no reliable indicator whether a BIOS can be trusted and
history taught us that it can't be trusted, we need to have this
trainwreck until hardware people actually come to senses and fix the
root cause once and forever.

So let's look at the issues a - e above:

a) Is not fixable though the hillarious refined-jiffies case can
be handled to some degree. But I really don't care much about
it because it's a lost case.

b) Not interesting at all. If that ever happens, then sure the
TSC will be marked unstable for the wrong reasons, but that's
the least of the problems in that case.

And that includes virt because virt should not use the watchdog
ever.

c) Not relevant because virt has to solve the problems which virt
causes at the virt level and not here.

d) The reread/retry mechanism is sensible.

Though all the command line knobs for delay injection are really
horrible. If at all this can be avoided by having a special
watchdog clocksource module which registers a 'preferred' watchdog
clocksource and hides all the magic outside of the actual watchdog
code.

Neither I'm sure whether this IPI collect data muck is adding much
value, but shrug.

e) As I said elsewhere already this is an issue which has two
components if we want to handle the refined-jiffies case:

clocksource frequency is not accurate: early-TSC
watchdog is not accurate: refined-jiffies

So the threshold wants to be:

max(cs->uncertainty_margin, wd->uncertainty_margin)

So the right thing to do here is to set a small and reasonable
default margin ($N us) in the clock source registration code if
cs->uncertainty_margin == 0 and add the larger margins to
early-TSC and refined-jiffies.


The only other option to handle this mess is to declare the watchdog
experiment as failed, rip it out completely and emit a fat warning on
boot when a non-trustable TSC is detected:

HARDWARE-BUG: Untrusted TSC detected. For further information see:
https://www.kernel.org/doc/html/latest/admin-guide/hw-trainwrecks/tsc.html

I can live with that and maybe we should have done that 15 years ago
instead of trying to work around it at the symptom level.

In case we agree on that option, I'm volunteering to write the
documentation.

Thanks,

tglx

2021-04-28 17:35:11

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 1/7] clocksource: Provide module parameters to inject delays in watchdog

On Wed, Apr 28, 2021 at 12:49:12PM +0800, Luming Yu wrote:
> We 'd expect to see clock_source watchdog can avoid to do wrong thing
> due to the injected delay or
> in real life delay by doing tsc sync-re-check by applying the patch-set.
> However , the noise is still cause wrong actions and the patch doesn't
> defeat the injected's delay
> please correct me if I'm wrong.

Injecting delay is just a test. In real life, if you got four delays
in a row, the cause is likely that the clock read is broken and taking
a very long time. In which case marking that clock unstable is a
reasonable response.

Other causes include having an NMI or SMI storm, getting extremely
unlucky with vCPU preemptions, and so on. In these cases, you are not
making much forward progress anyway, so marked-unstable clock is the
least of your worries.

I ran this (without injected delays) for more than a thousand hours on
rcutorture guest OSes and didn't see any instances of even two consecutive
bad reads. There was the very occasional single instance of a bad read.

Therefore, the code marks the clock unstable if it has four bad reads
in a row, as it should.

Thanx, Paul

> parameters]# cat *
> 1
> 1
> -1
> 3
> 8
>
> [62939.809615] clocksource: clocksource_watchdog_inject_delay():
> Injecting delay.
> [62939.816867] clocksource: clocksource_watchdog_inject_delay():
> Injecting delay.
> [62939.824094] clocksource: clocksource_watchdog_inject_delay():
> Injecting delay.
> [62939.831314] clocksource: clocksource_watchdog_inject_delay():
> Injecting delay.
> [62939.838536] clocksource: timekeeping watchdog on CPU26: hpet
> read-back delay of 7220833ns, attempt 4, marking unstable
> [62939.849230] tsc: Marking TSC unstable due to clocksource watchdog
> [62939.855340] TSC found unstable after boot, most likely due to
> broken BIOS. Use 'tsc=unstable'.
> [62939.863972] sched_clock: Marking unstable (62943398530130,
> -3543150114)<-(62941186607503, -1331276112)
> [62939.875104] clocksource: Checking clocksource tsc synchronization
> from CPU 123 to CPUs 0,6,26,62,78,97-98,137.
> [62939.886518] clocksource: Switched to clocksource hpet
>
> On Tue, Apr 27, 2021 at 2:27 AM Paul E. McKenney <[email protected]> wrote:
> >
> > On Mon, Apr 26, 2021 at 10:56:27AM -0700, Andi Kleen wrote:
> > > > ------------------------------------------------------------------------
> > > >
> > > > - module parameters
> > > >
> > > > If the scope of the fault injection capability is limited to a
> > > > single kernel module, it is better to provide module parameters to
> > > > configure the fault attributes.
> > > >
> > > > ------------------------------------------------------------------------
> > > >
> > > > And in this case, the fault injection capability is in fact limited to
> > > > kernel/clocksource.c.
> > >
> > >
> > > I disagree with this recommendation because it prevents fuzzer coverage.
> > >
> > > Much better to have an uniform interface that can be automatically
> > > explored.
> >
> > The permissions for these module parameters is 0644, so there is no
> > reason why the fuzzers cannot use them via sysfs.
> >
> > Thanx, Paul

2021-04-28 17:35:30

by Luming Yu

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 1/7] clocksource: Provide module parameters to inject delays in watchdog

On Wed, Apr 28, 2021 at 9:57 PM Paul E. McKenney <[email protected]> wrote:
>
> On Wed, Apr 28, 2021 at 12:49:12PM +0800, Luming Yu wrote:
> > We 'd expect to see clock_source watchdog can avoid to do wrong thing
> > due to the injected delay or
> > in real life delay by doing tsc sync-re-check by applying the patch-set.
> > However , the noise is still cause wrong actions and the patch doesn't
> > defeat the injected's delay
> > please correct me if I'm wrong.
>
> Injecting delay is just a test. In real life, if you got four delays
> in a row, the cause is likely that the clock read is broken and taking
> a very long time. In which case marking that clock unstable is a
> reasonable response.
>
> Other causes include having an NMI or SMI storm, getting extremely
> unlucky with vCPU preemptions, and so on. In these cases, you are not
> making much forward progress anyway, so marked-unstable clock is the
> least of your worries.
>
> I ran this (without injected delays) for more than a thousand hours on
> rcutorture guest OSes and didn't see any instances of even two consecutive
> bad reads. There was the very occasional single instance of a bad read.
>
> Therefore, the code marks the clock unstable if it has four bad reads
> in a row, as it should.

The hard problem to solve is tsc is still in good shape and it can be verified
with a quick cross check with other thread/core's tsc counts in the
injected situation or in real life case
to prove if it is truly a tsc problem or reference clock's problem of
the watchdog.

Ideally, we could factor out hard-to-debug unstable tsc problems from
clock source watchdog problems
and get less and less tsc sightings caused by clock source watchdog.

>
> Thanx, Paul
>
> > parameters]# cat *
> > 1
> > 1
> > -1
> > 3
> > 8
> >
> > [62939.809615] clocksource: clocksource_watchdog_inject_delay():
> > Injecting delay.
> > [62939.816867] clocksource: clocksource_watchdog_inject_delay():
> > Injecting delay.
> > [62939.824094] clocksource: clocksource_watchdog_inject_delay():
> > Injecting delay.
> > [62939.831314] clocksource: clocksource_watchdog_inject_delay():
> > Injecting delay.
> > [62939.838536] clocksource: timekeeping watchdog on CPU26: hpet
> > read-back delay of 7220833ns, attempt 4, marking unstable
> > [62939.849230] tsc: Marking TSC unstable due to clocksource watchdog
> > [62939.855340] TSC found unstable after boot, most likely due to
> > broken BIOS. Use 'tsc=unstable'.
> > [62939.863972] sched_clock: Marking unstable (62943398530130,
> > -3543150114)<-(62941186607503, -1331276112)
> > [62939.875104] clocksource: Checking clocksource tsc synchronization
> > from CPU 123 to CPUs 0,6,26,62,78,97-98,137.
> > [62939.886518] clocksource: Switched to clocksource hpet
> >
> > On Tue, Apr 27, 2021 at 2:27 AM Paul E. McKenney <[email protected]> wrote:
> > >
> > > On Mon, Apr 26, 2021 at 10:56:27AM -0700, Andi Kleen wrote:
> > > > > ------------------------------------------------------------------------
> > > > >
> > > > > - module parameters
> > > > >
> > > > > If the scope of the fault injection capability is limited to a
> > > > > single kernel module, it is better to provide module parameters to
> > > > > configure the fault attributes.
> > > > >
> > > > > ------------------------------------------------------------------------
> > > > >
> > > > > And in this case, the fault injection capability is in fact limited to
> > > > > kernel/clocksource.c.
> > > >
> > > >
> > > > I disagree with this recommendation because it prevents fuzzer coverage.
> > > >
> > > > Much better to have an uniform interface that can be automatically
> > > > explored.
> > >
> > > The permissions for these module parameters is 0644, so there is no
> > > reason why the fuzzers cannot use them via sysfs.
> > >
> > > Thanx, Paul

2021-04-28 17:35:35

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 clocksource 1/7] clocksource: Provide module parameters to inject delays in watchdog

On Wed, Apr 28 2021 at 22:24, Luming Yu wrote:
> On Wed, Apr 28, 2021 at 9:57 PM Paul E. McKenney <[email protected]> wrote:
>> Therefore, the code marks the clock unstable if it has four bad reads
>> in a row, as it should.
>
> The hard problem to solve is tsc is still in good shape and it can be verified
> with a quick cross check with other thread/core's tsc counts in the
> injected situation or in real life case
> to prove if it is truly a tsc problem or reference clock's problem of
> the watchdog.
>
> Ideally, we could factor out hard-to-debug unstable tsc problems from
> clock source watchdog problems
> and get less and less tsc sightings caused by clock source watchdog.

The only way to fix this is at the hardware level. Everything else is
wishful thinking.

Thanks,

tglx

2021-04-28 17:40:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog

On Wed, Apr 28, 2021 at 03:34:52PM +0200, Thomas Gleixner wrote:


> 3) CPU does advertise X86_FEATURE_CONSTANT_TSC and X86_FEATURE_NONSTOP_TSC
>
> That's the point where usable starts, which is around 2007/2008
>
> 4) CPU has X86_FEATURE_CONSTANT_TSC, X86_FEATURE_NONSTOP_TSC and
> TSC_ADJUST
>
> That's anything Intel starting from Haswell - not sure about the
> ATOM parts though.
>
> Non-Intel CPUs lack this completely.
>
> 5) CPU has X86_FEATURE_CONSTANT_TSC, X86_FEATURE_NONSTOP_TSC and
> TSC_ADJUST and the not yet existing feature TSC_LOCKDOWN
>
> We're asking for this for at least 15 years now, but we also had to
> wait 10+ years to get a halfways usable TSC, so why am I
> complaining?
>
> So we really care about #3 and #4.
>
> #4 is the easy case because we can check MSR_TSC_ADJUST to figure out
> whether something has written to MSR_TSC or MSR_TSC_ADJUST and undo
> the damage in a sane way.

This is after the fact though; userspace (and kernel space) will have
observed non-linear time and things will be broken in various subtle and
hard to tell ways.

> That's currently only done when a CPU goes idle, but there are
> options to do that differently (timer, TIF flag ...)
>
> This allows to disable the clocksource watchdog completely and based
> on Fengs work we are going to do so sooner than later.


> The problematic case is #3 because that affects any Non-Intel CPUs
> and the pre Haswell Intel ones.

Right, no clue what to do about all them AMD machines :/

> Unfortunately there is no other way than using the watchdog mechanism,
> simply because we need hardware assistance for detection and a reliable
> way to undo the damage, which is what we have with TSC_ADJUST. Of course
> the best case would be a mechanism to prevent writes to TSC/TSC_ADJUST
> completely after boot.

Yes, we needs us that TSC_LOCKDOWN thing, across all of x86.

> Now let's look at the cases which cause TSC problems in the real world:
>
> 1) Boot time (or resume time) TSC inconsistency
>
> That's the major problem today, but that's not a watchdog issue.
>
> On TSC_ADJUST equipped machines we fix the wreckage up, on others
> we give up. For the latter case we don't need a watchdog :)
>
> 2) Runtime wreckage caused by BIOS/SMM
>
> That used to be a wide spread problem 10 years ago and today it's
> only a sporadic issue, but it's not eliminated completely which
> is why we have this discussion at all.
>
> As we have no reliable indicator whether a BIOS can be trusted and
> history taught us that it can't be trusted, we need to have this
> trainwreck until hardware people actually come to senses and fix the
> root cause once and forever.

Also, realistically, people are actually still running modern kernels on
10 year old hardware :/

> The only other option to handle this mess is to declare the watchdog
> experiment as failed, rip it out completely and emit a fat warning on
> boot when a non-trustable TSC is detected:
>
> HARDWARE-BUG: Untrusted TSC detected. For further information see:
> https://www.kernel.org/doc/html/latest/admin-guide/hw-trainwrecks/tsc.html
>
> I can live with that and maybe we should have done that 15 years ago
> instead of trying to work around it at the symptom level.

Anybody that still has runtime BIOS wreckage will then silently suffer
nonlinear time, doubly so for anybody not having TSC_ADJUST. Are we sure
we can tell them all to bugger off and buy new hardware?

At the very least we need something like tsc=broken, to explicitly mark
TSC bad on machines, so that people that see TSC fail on their current
kernels can continue to use the new kernels. This requires a whole lot
of care on the part of users though, and will raise a ruckus, because I
bet a fair number of these people are not even currently aware we're
disabling TSC for them :/

2021-04-28 20:51:14

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog

On Wed, Apr 28 2021 at 17:39, Peter Zijlstra wrote:
> On Wed, Apr 28, 2021 at 03:34:52PM +0200, Thomas Gleixner wrote:
>> #4 is the easy case because we can check MSR_TSC_ADJUST to figure out
>> whether something has written to MSR_TSC or MSR_TSC_ADJUST and undo
>> the damage in a sane way.
>
> This is after the fact though; userspace (and kernel space) will have
> observed non-linear time and things will be broken in various subtle and
> hard to tell ways.

What I observed in the recent past is that _IF_ that happens it's a
small amount of cycles so it's not a given that this can be observed
accross CPUs. But yes, it's daft.

>> I can live with that and maybe we should have done that 15 years ago
>> instead of trying to work around it at the symptom level.
>
> Anybody that still has runtime BIOS wreckage will then silently suffer
> nonlinear time, doubly so for anybody not having TSC_ADJUST. Are we sure
> we can tell them all to bugger off and buy new hardware?
>
> At the very least we need something like tsc=broken, to explicitly mark
> TSC bad on machines, so that people that see TSC fail on their current
> kernels can continue to use the new kernels. This requires a whole lot
> of care on the part of users though, and will raise a ruckus, because I
> bet a fair number of these people are not even currently aware we're
> disabling TSC for them :/

I'm still allowed to dream, right? :)

Thanks,

tglx

2021-04-28 20:53:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog

On Wed, Apr 28, 2021 at 12:14:49PM +0200, Thomas Gleixner wrote:
> On Tue, Apr 27 2021 at 18:48, Paul E. McKenney wrote:
> > On Tue, Apr 27, 2021 at 11:09:49PM +0200, Thomas Gleixner wrote:
> >> Paul,
> >>
> >> On Tue, Apr 27 2021 at 10:50, Paul E. McKenney wrote:
> >> > On Tue, Apr 27, 2021 at 06:37:46AM -0700, Paul E. McKenney wrote:
> >> >> I suppose that I give it (say) 120 seconds instead of the current 60,
> >> >> which might be the right thing to do, but it does feel like papering
> >> >> over a very real initramfs problem. Alternatively, I could provide a
> >> >> boot parameter allowing those with slow systems to adjust as needed.
> >> >
> >> > OK, it turns out that there are systems for which boot times in excess
> >> > of one minute are expected behavior. They are a bit rare, though.
> >> > So what I will do is keep the 60-second default, add a boot parameter,
> >> > and also add a comment by the warning pointing out the boot parameter.
> >>
> >> Oh, no. This starts to become yet another duct tape horror show.
> >>
> >> I'm not at all against a more robust and resilent watchdog mechanism,
> >> but having a dozen knobs to tune and heuristics which are doomed to fail
> >> is not a solution at all.
> >
> > One problem is that I did the .max_drift patch backwards. I tightened
> > the skew requirements on all clocks except those specially marked, and
> > I should have done the reverse. With that change, all of the clocks
> > except for clocksource_tsc would work (or as the case might be, fail to
> > work) in exactly the same way that they do today, but still rejecting
> > false-positive skew events due to NMIs, SMIs, vCPU preemption, and so on.
> >
> > Then patch v10 7/7 can go away completely, and patch 6/7 becomes much
> > smaller (and gets renamed), for example, as shown below.
> >
> > Does that help?
>
> No. Because the problem is on both ends. We have TSC early which has
> inaccurate frequency and we have watchdogs which are inaccurate,
> i.e. refined jiffies.
>
> So the threshold has to take both into account.

Got it, and will fix.

Thanx, Paul

2021-04-28 21:06:27

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog

On Wed, Apr 28, 2021 at 03:34:52PM +0200, Thomas Gleixner wrote:
> Paul,
>
> On Tue, Apr 27 2021 at 23:09, Thomas Gleixner wrote:
> > On Tue, Apr 27 2021 at 10:50, Paul E. McKenney wrote:
> >> OK, it turns out that there are systems for which boot times in excess
> >> of one minute are expected behavior. They are a bit rare, though.
> >> So what I will do is keep the 60-second default, add a boot parameter,
> >> and also add a comment by the warning pointing out the boot parameter.
> >
> > Oh, no. This starts to become yet another duct tape horror show.
> >
> > I'm not at all against a more robust and resilent watchdog mechanism,
> > but having a dozen knobs to tune and heuristics which are doomed to fail
> > is not a solution at all.
>
> That said, let's take a step back and look at the trainwreck from a
> different POV.
>
> Let's start with the easy part: Virtualization
>
> While I'm still convinced that virt creates more problems than it
> solves, in this particular case, the virt solution is actually
> halfways trivial.
>
> 1) If the host does not trust the TSC then it clearly wants the guest
> to use KVM clock and not TSC. KVM clock is still utilizing TSC, but
> it's a host controlled and assisted mechanism.
>
> 2) If the host has clear indications that TSC can be trusted, then it
> can tell the guest that TSC is trustworthy. A synthesized CPU
> feature bit is the obvious solution for this.
>
> That solves several virt problems at once:
>
> - The watchdog issue
>
> - The fact that TSC synchronization checks between two vCPUs
> cannot ever work reliably.

Such checks are subject to false negatives, but not false positives.
And the probability of vCPU preemption within that small window is
low enough that repeated false negatives should be extremely rare.

Or am I missing another source of synchronization-check unreliability?

Or are you saying that the checks should be in the host OS rather than
in the guests?

> Now for the bare metal case. We have to distinguish the following
> scenarios:
>
> 1) CPU does not advertise X86_FEATURE_CONSTANT_TSC
>
> That's a lost case and only relevant for really old hardware.
>
> 2) CPU does advertise X86_FEATURE_CONSTANT_TSC, but does not
> have X86_FEATURE_NONSTOP_TSC
>
> Mostly a lost case as well unless you disable the C-states in which
> TSC stops working.
>
> 3) CPU does advertise X86_FEATURE_CONSTANT_TSC and X86_FEATURE_NONSTOP_TSC
>
> That's the point where usable starts, which is around 2007/2008
>
> 4) CPU has X86_FEATURE_CONSTANT_TSC, X86_FEATURE_NONSTOP_TSC and
> TSC_ADJUST
>
> That's anything Intel starting from Haswell - not sure about the
> ATOM parts though.
>
> Non-Intel CPUs lack this completely.

A lot of options, so thank you for summarizing!

> 5) CPU has X86_FEATURE_CONSTANT_TSC, X86_FEATURE_NONSTOP_TSC and
> TSC_ADJUST and the not yet existing feature TSC_LOCKDOWN
>
> We're asking for this for at least 15 years now, but we also had to
> wait 10+ years to get a halfways usable TSC, so why am I
> complaining?

Wait! Don't tell me... The BIOS vendors have been refusing to accept
any TSC_LOCKDOWN hardware feature? ;-)

> So we really care about #3 and #4.
>
> #4 is the easy case because we can check MSR_TSC_ADJUST to figure out
> whether something has written to MSR_TSC or MSR_TSC_ADJUST and undo
> the damage in a sane way.
>
> That's currently only done when a CPU goes idle, but there are
> options to do that differently (timer, TIF flag ...)
>
> This allows to disable the clocksource watchdog completely and based
> on Fengs work we are going to do so sooner than later.

Given correctly operating hardware, yes, the x86 arch_cpu_idle_enter()
invokes tsc_verify_tsc_adjust(), which will print a "TSC ADJUST differs:
CPU" message on the console, correct? Or did I go down the wrong
rabbit hole?

In addition, breakage due to age and environmentals is possible, and if
you have enough hardware, probable. In which case it would be good to
get a notification so that the system in question can be dealt with.

> The problematic case is #3 because that affects any Non-Intel CPUs
> and the pre Haswell Intel ones.

I am with Peter on this one, and not just for the AMD machines. :-/

> The approach we have taken so far is the clocksource watchdog in the
> form in which it exists today with the known limitations:
>
> a) Broken watchdog clocksource. Not that I care much, because then
> all bets are off.
>
> This includes the jiffies based watchdog which is the worst bandaid
> which we have, though that can be made 'work' by some definition of
> work.

Larger uncertainty_margin should handle this. Let's see: MIPS does HZ=24
(and HZ=1024?) and Alpha does HZ=32 (and HZ=1200?). Apparently all the
old HZ=10 and HZ=1 use cases have migrated to NO_HZ_FULL or something.

And even HZ=24 fits nicely into an int, even allowing a factor of four for
two measurements plus factor of safety. So looking at both clocksource's
uncertainty_margin as you suggest below should cover this case.

Give or take the inevitable surprise.

> b) Long delays which prevent the watchdog from running
> which in the worst case let the watchdog wrap around.
>
> For ACPI_PMTIMER that's ~4.69 seconds and for HPET ~300 seconds.
>
> Anything which keeps the watchdog timer from running for that long
> is broken and trying to fix that at the watchdog level is putting
> the cart before the horse.
>
> Ignore virt here. See above.

Agreed. If you are hitting five-minute vCPU preemptions, you are almost
certainly blowing your response-time constraints anyway. I mean, we might
not use full-up -rt, but we also are not a punched-card shop.

> c) vCPUs scheduled out between the watchdog and the TSC read
>
> Cannot be mitigated reliably under all circumstances. See the virt
> section above.

Your point being that the watchdog checks should be done on the KVM
hypervisor host rather than in the guest OSes, correct?

As noted in an earlier thread, I am concerned about the possibility of
some strange timer bug that affects guests but not hosts.

> d) Long lasting NMI/SMI's between watchdog and TSC read
>
> Can be mitigated with reread/retry as you demonstrated.
>
> e) A too large threshold, which if reduced causes other problems as
> you figured out already.

This whole thing has been a learning experience for me. ;-)

> Unfortunately there is no other way than using the watchdog mechanism,
> simply because we need hardware assistance for detection and a reliable
> way to undo the damage, which is what we have with TSC_ADJUST. Of course
> the best case would be a mechanism to prevent writes to TSC/TSC_ADJUST
> completely after boot.

There will be other failure modes that result in skew detection,
especially if the systems are old or under environmental stress, correct?

> Now let's look at the cases which cause TSC problems in the real world:
>
> 1) Boot time (or resume time) TSC inconsistency
>
> That's the major problem today, but that's not a watchdog issue.
>
> On TSC_ADJUST equipped machines we fix the wreckage up, on others
> we give up. For the latter case we don't need a watchdog :)

I bet that "whack the hardware over the head and reboot" is still a
popular error-handling strategy, and it would handle this case. ;-)

> 2) Runtime wreckage caused by BIOS/SMM
>
> That used to be a wide spread problem 10 years ago and today it's
> only a sporadic issue, but it's not eliminated completely which
> is why we have this discussion at all.
>
> As we have no reliable indicator whether a BIOS can be trusted and
> history taught us that it can't be trusted, we need to have this
> trainwreck until hardware people actually come to senses and fix the
> root cause once and forever.

On new hardware, it looks like checking for "TSC ADJUST differs: CPU"
console messages would be a good thing during system evaluation. Noted!

> So let's look at the issues a - e above:
>
> a) Is not fixable though the hillarious refined-jiffies case can
> be handled to some degree. But I really don't care much about
> it because it's a lost case.

Given the HZ settings available these days, your suggestion of
uncertainty_margin should avoid false positives.

> b) Not interesting at all. If that ever happens, then sure the
> TSC will be marked unstable for the wrong reasons, but that's
> the least of the problems in that case.
>
> And that includes virt because virt should not use the watchdog
> ever.
>
> c) Not relevant because virt has to solve the problems which virt
> causes at the virt level and not here.
>
> d) The reread/retry mechanism is sensible.
>
> Though all the command line knobs for delay injection are really
> horrible. If at all this can be avoided by having a special
> watchdog clocksource module which registers a 'preferred' watchdog
> clocksource and hides all the magic outside of the actual watchdog
> code.

You mean move the watchdog code to a kernel/time/clocksourcewdg.c file
or some such? Either way, the delay-injection parameters need to be
specified somehow, and there needs to be a way of reliably injecting
the types of errors required to exercise the code.

> Neither I'm sure whether this IPI collect data muck is adding much
> value, but shrug.
>
> e) As I said elsewhere already this is an issue which has two
> components if we want to handle the refined-jiffies case:
>
> clocksource frequency is not accurate: early-TSC
> watchdog is not accurate: refined-jiffies
>
> So the threshold wants to be:
>
> max(cs->uncertainty_margin, wd->uncertainty_margin)
>
> So the right thing to do here is to set a small and reasonable
> default margin ($N us) in the clock source registration code if
> cs->uncertainty_margin == 0 and add the larger margins to
> early-TSC and refined-jiffies.

That does make a lot of sense, thank you! I would be tempted to add the
two ->uncertainty_margin fields rather than take the max(), but in real
life it should not matter much.

> The only other option to handle this mess is to declare the watchdog
> experiment as failed, rip it out completely and emit a fat warning on
> boot when a non-trustable TSC is detected:
>
> HARDWARE-BUG: Untrusted TSC detected. For further information see:
> https://www.kernel.org/doc/html/latest/admin-guide/hw-trainwrecks/tsc.html
>
> I can live with that and maybe we should have done that 15 years ago
> instead of trying to work around it at the symptom level.
>
> In case we agree on that option, I'm volunteering to write the
> documentation.

My concern here is that a lot of people seem to be getting excited about
creating their own CPUs. I would rather have code that automatically
slaps their wrists for getting per-CPU timing wrong than to have to chase
it down the hard way. Much as I would hope that they would learn from
the TSC experience, that does not always appear to be the trend. :-/

Thanx, Paul

2021-04-29 07:39:28

by Feng Tang

[permalink] [raw]
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog

On Wed, Apr 28, 2021 at 07:00:15PM +0200, Thomas Gleixner wrote:
> On Wed, Apr 28 2021 at 17:39, Peter Zijlstra wrote:
> > On Wed, Apr 28, 2021 at 03:34:52PM +0200, Thomas Gleixner wrote:
> >> #4 is the easy case because we can check MSR_TSC_ADJUST to figure out
> >> whether something has written to MSR_TSC or MSR_TSC_ADJUST and undo
> >> the damage in a sane way.
> >
> > This is after the fact though; userspace (and kernel space) will have
> > observed non-linear time and things will be broken in various subtle and
> > hard to tell ways.
>
> What I observed in the recent past is that _IF_ that happens it's a
> small amount of cycles so it's not a given that this can be observed
> accross CPUs. But yes, it's daft.

Currently when tsc_adjust overriden is detected, the warning msg is
"[Firmware Bug]: TSC ADJUST differs: CPU%u %lld --> %lld. Restoring",
which is kind of gentle. With Borislav's patch of preventing user space
from writing to tsc_adjust msr, the warning could be stronger? Adding
something after that like:

"Writing to TSC_ADJUST MSR is dangerous, and may cause the lost of
your best clocksource: tsc, please check with your BIOS/OS vendors"

Thanks,
Feng

> >> I can live with that and maybe we should have done that 15 years ago
> >> instead of trying to work around it at the symptom level.
> >
> > Anybody that still has runtime BIOS wreckage will then silently suffer
> > nonlinear time, doubly so for anybody not having TSC_ADJUST. Are we sure
> > we can tell them all to bugger off and buy new hardware?
> >
> > At the very least we need something like tsc=broken, to explicitly mark
> > TSC bad on machines, so that people that see TSC fail on their current
> > kernels can continue to use the new kernels. This requires a whole lot
> > of care on the part of users though, and will raise a ruckus, because I
> > bet a fair number of these people are not even currently aware we're
> > disabling TSC for them :/
>
> I'm still allowed to dream, right? :)
>
> Thanks,
>
> tglx

2021-04-29 08:28:32

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog

Paul,

On Wed, Apr 28 2021 at 11:31, Paul E. McKenney wrote:
> On Wed, Apr 28, 2021 at 03:34:52PM +0200, Thomas Gleixner wrote:
>> 2) If the host has clear indications that TSC can be trusted, then it
>> can tell the guest that TSC is trustworthy. A synthesized CPU
>> feature bit is the obvious solution for this.
>>
>> That solves several virt problems at once:
>>
>> - The watchdog issue
>>
>> - The fact that TSC synchronization checks between two vCPUs
>> cannot ever work reliably.
>
> Such checks are subject to false negatives, but not false positives.
> And the probability of vCPU preemption within that small window is
> low enough that repeated false negatives should be extremely rare.
>
> Or am I missing another source of synchronization-check unreliability?

Oh yes. The whole CPU bringup synchronization checks which involves two
vCPUs to run side by side. That's different from the watchdog issue
which is just a single vCPU issue.

See the mess in tsc_sync.c ....

> Or are you saying that the checks should be in the host OS rather than
> in the guests?

Yes. That's where it belongs. The host has to make sure that TSC is usable
otherwise it should tell the guest not to use it. Anything else is
wishful thinking and never reliable.

>> 5) CPU has X86_FEATURE_CONSTANT_TSC, X86_FEATURE_NONSTOP_TSC and
>> TSC_ADJUST and the not yet existing feature TSC_LOCKDOWN
>>
>> We're asking for this for at least 15 years now, but we also had to
>> wait 10+ years to get a halfways usable TSC, so why am I
>> complaining?
>
> Wait! Don't tell me... The BIOS vendors have been refusing to accept
> any TSC_LOCKDOWN hardware feature? ;-)

Not sure who is refusing to accept reality. As most BIOSes keep their
hands of TSC nowadays that might be more an issue on the HW vendor side.

>> So we really care about #3 and #4.
>>
>> #4 is the easy case because we can check MSR_TSC_ADJUST to figure out
>> whether something has written to MSR_TSC or MSR_TSC_ADJUST and undo
>> the damage in a sane way.
>>
>> That's currently only done when a CPU goes idle, but there are
>> options to do that differently (timer, TIF flag ...)
>>
>> This allows to disable the clocksource watchdog completely and based
>> on Fengs work we are going to do so sooner than later.
>
> Given correctly operating hardware, yes, the x86 arch_cpu_idle_enter()
> invokes tsc_verify_tsc_adjust(), which will print a "TSC ADJUST differs:
> CPU" message on the console, correct? Or did I go down the wrong
> rabbit hole?

That's the correct rabbit whole.

> In addition, breakage due to age and environmentals is possible, and if
> you have enough hardware, probable. In which case it would be good to
> get a notification so that the system in question can be dealt with.

Are you trying to find a problem for a solution again?

>> This includes the jiffies based watchdog which is the worst bandaid
>> which we have, though that can be made 'work' by some definition of
>> work.
>
> Larger uncertainty_margin should handle this. Let's see: MIPS does HZ=24
> (and HZ=1024?) and Alpha does HZ=32 (and HZ=1200?). Apparently all the
> old HZ=10 and HZ=1 use cases have migrated to NO_HZ_FULL or something.
>
> And even HZ=24 fits nicely into an int, even allowing a factor of four for
> two measurements plus factor of safety. So looking at both clocksource's
> uncertainty_margin as you suggest below should cover this case.
>
> Give or take the inevitable surprise.

Yes, and TBH I'm not worried about this case at all.

>> b) Long delays which prevent the watchdog from running
>> which in the worst case let the watchdog wrap around.
>>
>> For ACPI_PMTIMER that's ~4.69 seconds and for HPET ~300 seconds.
>>
>> Anything which keeps the watchdog timer from running for that long
>> is broken and trying to fix that at the watchdog level is putting
>> the cart before the horse.
>>
>> Ignore virt here. See above.
>
> Agreed. If you are hitting five-minute vCPU preemptions, you are almost
> certainly blowing your response-time constraints anyway. I mean, we might
> not use full-up -rt, but we also are not a punched-card shop.

Punch-card computing was only non-deterministic up to the point where
your cards hit the reader assumed that the assistant did not drop the
pile and resorted it in the wrong order. Once the job got on the CPU it
was very deterministic :)

>> c) vCPUs scheduled out between the watchdog and the TSC read
>>
>> Cannot be mitigated reliably under all circumstances. See the virt
>> section above.
>
> Your point being that the watchdog checks should be done on the KVM
> hypervisor host rather than in the guest OSes, correct?
>
> As noted in an earlier thread, I am concerned about the possibility of
> some strange timer bug that affects guests but not hosts.

Well, you might then also build safety nets for interrupts, exceptions
and if you go fully paranoid for every single CPU instruction. :)

>> Unfortunately there is no other way than using the watchdog mechanism,
>> simply because we need hardware assistance for detection and a reliable
>> way to undo the damage, which is what we have with TSC_ADJUST. Of course
>> the best case would be a mechanism to prevent writes to TSC/TSC_ADJUST
>> completely after boot.
>
> There will be other failure modes that result in skew detection,
> especially if the systems are old or under environmental stress,
> correct?

Well, if the system goes down that road then it's most likely that other
parts will expose failures and the aging/stress applies to the watchdog
as well.

>> Now let's look at the cases which cause TSC problems in the real world:
>>
>> 1) Boot time (or resume time) TSC inconsistency
>>
>> That's the major problem today, but that's not a watchdog issue.
>>
>> On TSC_ADJUST equipped machines we fix the wreckage up, on others
>> we give up. For the latter case we don't need a watchdog :)
>
> I bet that "whack the hardware over the head and reboot" is still a
> popular error-handling strategy, and it would handle this case. ;-)

Then you might end up with a endless boot, whack, reboot cycles :)

>> 2) Runtime wreckage caused by BIOS/SMM
>>
>> That used to be a wide spread problem 10 years ago and today it's
>> only a sporadic issue, but it's not eliminated completely which
>> is why we have this discussion at all.
>>
>> As we have no reliable indicator whether a BIOS can be trusted and
>> history taught us that it can't be trusted, we need to have this
>> trainwreck until hardware people actually come to senses and fix the
>> root cause once and forever.
>
> On new hardware, it looks like checking for "TSC ADJUST differs: CPU"
> console messages would be a good thing during system evaluation.
> Noted!

Correct.

>> So let's look at the issues a - e above:
>>
>> a) Is not fixable though the hillarious refined-jiffies case can
>> be handled to some degree. But I really don't care much about
>> it because it's a lost case.
>
> Given the HZ settings available these days, your suggestion of
> uncertainty_margin should avoid false positives.
>
>> b) Not interesting at all. If that ever happens, then sure the
>> TSC will be marked unstable for the wrong reasons, but that's
>> the least of the problems in that case.
>>
>> And that includes virt because virt should not use the watchdog
>> ever.
>>
>> c) Not relevant because virt has to solve the problems which virt
>> causes at the virt level and not here.
>>
>> d) The reread/retry mechanism is sensible.
>>
>> Though all the command line knobs for delay injection are really
>> horrible. If at all this can be avoided by having a special
>> watchdog clocksource module which registers a 'preferred' watchdog
>> clocksource and hides all the magic outside of the actual watchdog
>> code.
>
> You mean move the watchdog code to a kernel/time/clocksourcewdg.c file
> or some such? Either way, the delay-injection parameters need to be
> specified somehow, and there needs to be a way of reliably injecting
> the types of errors required to exercise the code.

Sure. If you have a seperate module then you can add module params to it
obviously. But you don't need any of the muck in the actual watchdog
code because the watchdog::read() function in that module will simply
handle the delay injection. Hmm?

>> In case we agree on that option, I'm volunteering to write the
>> documentation.
>
> My concern here is that a lot of people seem to be getting excited about
> creating their own CPUs. I would rather have code that automatically
> slaps their wrists for getting per-CPU timing wrong than to have to chase
> it down the hard way. Much as I would hope that they would learn from
> the TSC experience, that does not always appear to be the trend. :-/

As I said to Peter already: Nothing prevents me from dreaming :)

Thanks,

tglx

2021-04-29 14:30:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog

On Thu, Apr 29, 2021 at 10:27:09AM +0200, Thomas Gleixner wrote:
> Paul,
>
> On Wed, Apr 28 2021 at 11:31, Paul E. McKenney wrote:
> > On Wed, Apr 28, 2021 at 03:34:52PM +0200, Thomas Gleixner wrote:
> >> 2) If the host has clear indications that TSC can be trusted, then it
> >> can tell the guest that TSC is trustworthy. A synthesized CPU
> >> feature bit is the obvious solution for this.
> >>
> >> That solves several virt problems at once:
> >>
> >> - The watchdog issue
> >>
> >> - The fact that TSC synchronization checks between two vCPUs
> >> cannot ever work reliably.
> >
> > Such checks are subject to false negatives, but not false positives.
> > And the probability of vCPU preemption within that small window is
> > low enough that repeated false negatives should be extremely rare.
> >
> > Or am I missing another source of synchronization-check unreliability?
>
> Oh yes. The whole CPU bringup synchronization checks which involves two
> vCPUs to run side by side. That's different from the watchdog issue
> which is just a single vCPU issue.
>
> See the mess in tsc_sync.c ....

Sounds like too much fun... ;-)

> > Or are you saying that the checks should be in the host OS rather than
> > in the guests?
>
> Yes. That's where it belongs. The host has to make sure that TSC is usable
> otherwise it should tell the guest not to use it. Anything else is
> wishful thinking and never reliable.

Thank you for the confirmation. I will look into this.

> >> 5) CPU has X86_FEATURE_CONSTANT_TSC, X86_FEATURE_NONSTOP_TSC and
> >> TSC_ADJUST and the not yet existing feature TSC_LOCKDOWN
> >>
> >> We're asking for this for at least 15 years now, but we also had to
> >> wait 10+ years to get a halfways usable TSC, so why am I
> >> complaining?
> >
> > Wait! Don't tell me... The BIOS vendors have been refusing to accept
> > any TSC_LOCKDOWN hardware feature? ;-)
>
> Not sure who is refusing to accept reality. As most BIOSes keep their
> hands of TSC nowadays that might be more an issue on the HW vendor side.

Refusing to accept reality does seem to be a popular sport these days,
and perhaps it always has been. No shortage of suspects, therefore.

> >> So we really care about #3 and #4.
> >>
> >> #4 is the easy case because we can check MSR_TSC_ADJUST to figure out
> >> whether something has written to MSR_TSC or MSR_TSC_ADJUST and undo
> >> the damage in a sane way.
> >>
> >> That's currently only done when a CPU goes idle, but there are
> >> options to do that differently (timer, TIF flag ...)
> >>
> >> This allows to disable the clocksource watchdog completely and based
> >> on Fengs work we are going to do so sooner than later.
> >
> > Given correctly operating hardware, yes, the x86 arch_cpu_idle_enter()
> > invokes tsc_verify_tsc_adjust(), which will print a "TSC ADJUST differs:
> > CPU" message on the console, correct? Or did I go down the wrong
> > rabbit hole?
>
> That's the correct rabbit whole.

Thank you!

> > In addition, breakage due to age and environmentals is possible, and if
> > you have enough hardware, probable. In which case it would be good to
> > get a notification so that the system in question can be dealt with.
>
> Are you trying to find a problem for a solution again?

We really do see this thing trigger. I am trying to get rid of one
class of false positives that might be afflicting us. Along the way,
I am thinking aloud about what might be the cause of any remaining skew
reports that might trigger in the future.

> >> This includes the jiffies based watchdog which is the worst bandaid
> >> which we have, though that can be made 'work' by some definition of
> >> work.
> >
> > Larger uncertainty_margin should handle this. Let's see: MIPS does HZ=24
> > (and HZ=1024?) and Alpha does HZ=32 (and HZ=1200?). Apparently all the
> > old HZ=10 and HZ=1 use cases have migrated to NO_HZ_FULL or something.
> >
> > And even HZ=24 fits nicely into an int, even allowing a factor of four for
> > two measurements plus factor of safety. So looking at both clocksource's
> > uncertainty_margin as you suggest below should cover this case.
> >
> > Give or take the inevitable surprise.
>
> Yes, and TBH I'm not worried about this case at all.

Sounds good!

> >> b) Long delays which prevent the watchdog from running
> >> which in the worst case let the watchdog wrap around.
> >>
> >> For ACPI_PMTIMER that's ~4.69 seconds and for HPET ~300 seconds.
> >>
> >> Anything which keeps the watchdog timer from running for that long
> >> is broken and trying to fix that at the watchdog level is putting
> >> the cart before the horse.
> >>
> >> Ignore virt here. See above.
> >
> > Agreed. If you are hitting five-minute vCPU preemptions, you are almost
> > certainly blowing your response-time constraints anyway. I mean, we might
> > not use full-up -rt, but we also are not a punched-card shop.
>
> Punch-card computing was only non-deterministic up to the point where
> your cards hit the reader assumed that the assistant did not drop the
> pile and resorted it in the wrong order. Once the job got on the CPU it
> was very deterministic :)

And very slow, especially by today's standards! But yes, also extremely
deterministic by today's standards. This reminds me of Nicholas McGuire's
paper on producing hardware-quality random numbers from a tight loop
with interrupts disabled. Which would not have worked back in the
old days. ;-)

> >> c) vCPUs scheduled out between the watchdog and the TSC read
> >>
> >> Cannot be mitigated reliably under all circumstances. See the virt
> >> section above.
> >
> > Your point being that the watchdog checks should be done on the KVM
> > hypervisor host rather than in the guest OSes, correct?
> >
> > As noted in an earlier thread, I am concerned about the possibility of
> > some strange timer bug that affects guests but not hosts.
>
> Well, you might then also build safety nets for interrupts, exceptions
> and if you go fully paranoid for every single CPU instruction. :)

Fair, and I doubt that looking at failure data across a large fleet of
systems has done anything to reduce my level of paranoia. ;-)

> >> Unfortunately there is no other way than using the watchdog mechanism,
> >> simply because we need hardware assistance for detection and a reliable
> >> way to undo the damage, which is what we have with TSC_ADJUST. Of course
> >> the best case would be a mechanism to prevent writes to TSC/TSC_ADJUST
> >> completely after boot.
> >
> > There will be other failure modes that result in skew detection,
> > especially if the systems are old or under environmental stress,
> > correct?
>
> Well, if the system goes down that road then it's most likely that other
> parts will expose failures and the aging/stress applies to the watchdog
> as well.

Some fine day I should dig into what the most common senile-CPU failure
modes are. Probably someone has already done that, though such studies
likely go obsolete with each new stepping. :-/

> >> Now let's look at the cases which cause TSC problems in the real world:
> >>
> >> 1) Boot time (or resume time) TSC inconsistency
> >>
> >> That's the major problem today, but that's not a watchdog issue.
> >>
> >> On TSC_ADJUST equipped machines we fix the wreckage up, on others
> >> we give up. For the latter case we don't need a watchdog :)
> >
> > I bet that "whack the hardware over the head and reboot" is still a
> > popular error-handling strategy, and it would handle this case. ;-)
>
> Then you might end up with a endless boot, whack, reboot cycles :)

The cycle is ended when someone (or more likely, something) decides that
the system in question is beyond help. ;-)

> >> 2) Runtime wreckage caused by BIOS/SMM
> >>
> >> That used to be a wide spread problem 10 years ago and today it's
> >> only a sporadic issue, but it's not eliminated completely which
> >> is why we have this discussion at all.
> >>
> >> As we have no reliable indicator whether a BIOS can be trusted and
> >> history taught us that it can't be trusted, we need to have this
> >> trainwreck until hardware people actually come to senses and fix the
> >> root cause once and forever.
> >
> > On new hardware, it looks like checking for "TSC ADJUST differs: CPU"
> > console messages would be a good thing during system evaluation.
> > Noted!
>
> Correct.

Again, thank you!

> >> So let's look at the issues a - e above:
> >>
> >> a) Is not fixable though the hillarious refined-jiffies case can
> >> be handled to some degree. But I really don't care much about
> >> it because it's a lost case.
> >
> > Given the HZ settings available these days, your suggestion of
> > uncertainty_margin should avoid false positives.
> >
> >> b) Not interesting at all. If that ever happens, then sure the
> >> TSC will be marked unstable for the wrong reasons, but that's
> >> the least of the problems in that case.
> >>
> >> And that includes virt because virt should not use the watchdog
> >> ever.
> >>
> >> c) Not relevant because virt has to solve the problems which virt
> >> causes at the virt level and not here.
> >>
> >> d) The reread/retry mechanism is sensible.
> >>
> >> Though all the command line knobs for delay injection are really
> >> horrible. If at all this can be avoided by having a special
> >> watchdog clocksource module which registers a 'preferred' watchdog
> >> clocksource and hides all the magic outside of the actual watchdog
> >> code.
> >
> > You mean move the watchdog code to a kernel/time/clocksourcewdg.c file
> > or some such? Either way, the delay-injection parameters need to be
> > specified somehow, and there needs to be a way of reliably injecting
> > the types of errors required to exercise the code.
>
> Sure. If you have a seperate module then you can add module params to it
> obviously. But you don't need any of the muck in the actual watchdog
> code because the watchdog::read() function in that module will simply
> handle the delay injection. Hmm?

OK, first let me make sure I understand what you are suggesting.

The idea is to leave the watchdog code in kernel/time/clocksource.c,
but to move the fault injection into kernel/time/clocksourcefault.c or
some such. In this new file, new clocksource structures are created that
use some existing timebase/clocksource under the covers. These then
inject delays based on module parameters (one senstive to CPU number,
the other unconditional). They register these clocksources using the
normal interfaces, and verify that they are eventually marked unstable
when the fault-injection parameters warrant it. This is combined with
the usual checking of the console log.

Or am I missing your point?

> >> In case we agree on that option, I'm volunteering to write the
> >> documentation.
> >
> > My concern here is that a lot of people seem to be getting excited about
> > creating their own CPUs. I would rather have code that automatically
> > slaps their wrists for getting per-CPU timing wrong than to have to chase
> > it down the hard way. Much as I would hope that they would learn from
> > the TSC experience, that does not always appear to be the trend. :-/
>
> As I said to Peter already: Nothing prevents me from dreaming :)

Dreams are harmless, at least as long as they don't come true. ;-)

Thanx, Paul

2021-04-29 17:32:51

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog

Paul,

On Thu, Apr 29 2021 at 07:26, Paul E. McKenney wrote:
> On Thu, Apr 29, 2021 at 10:27:09AM +0200, Thomas Gleixner wrote:
>> > Or are you saying that the checks should be in the host OS rather than
>> > in the guests?
>>
>> Yes. That's where it belongs. The host has to make sure that TSC is usable
>> otherwise it should tell the guest not to use it. Anything else is
>> wishful thinking and never reliable.
>
> Thank you for the confirmation. I will look into this.

So the guest might need at least some basic sanity checking unless we
declare that hypervisors are always working correctly :)

Which is admittedly more likely than making the same assumption about
BIOS and hardware.

>> > In addition, breakage due to age and environmentals is possible, and if
>> > you have enough hardware, probable. In which case it would be good to
>> > get a notification so that the system in question can be dealt with.
>>
>> Are you trying to find a problem for a solution again?
>
> We really do see this thing trigger. > I am trying to get rid of one
> class of false positives that might be afflicting us. Along the way,
> I am thinking aloud about what might be the cause of any remaining
> skew reports that might trigger in the future.

Fair enough. Admittedly this has at least entertainment value :)

>> Well, you might then also build safety nets for interrupts, exceptions
>> and if you go fully paranoid for every single CPU instruction. :)
>
> Fair, and I doubt that looking at failure data across a large fleet of
> systems has done anything to reduce my level of paranoia. ;-)

You should have known better than opening Pandoras box.

>> Sure. If you have a seperate module then you can add module params to it
>> obviously. But you don't need any of the muck in the actual watchdog
>> code because the watchdog::read() function in that module will simply
>> handle the delay injection. Hmm?
>
> OK, first let me make sure I understand what you are suggesting.
>
> The idea is to leave the watchdog code in kernel/time/clocksource.c,
> but to move the fault injection into kernel/time/clocksourcefault.c or
> some such. In this new file, new clocksource structures are created that
> use some existing timebase/clocksource under the covers. These then
> inject delays based on module parameters (one senstive to CPU number,
> the other unconditional). They register these clocksources using the
> normal interfaces, and verify that they are eventually marked unstable
> when the fault-injection parameters warrant it. This is combined with
> the usual checking of the console log.
>
> Or am I missing your point?

That's what I meant.

Thanks,

tglx

2021-04-29 23:07:02

by Andi Kleen

[permalink] [raw]
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog

> > The idea is to leave the watchdog code in kernel/time/clocksource.c,
> > but to move the fault injection into kernel/time/clocksourcefault.c or
> > some such. In this new file, new clocksource structures are created that
> > use some existing timebase/clocksource under the covers. These then
> > inject delays based on module parameters (one senstive to CPU number,
> > the other unconditional). They register these clocksources using the
> > normal interfaces, and verify that they are eventually marked unstable
> > when the fault-injection parameters warrant it. This is combined with
> > the usual checking of the console log.
> >
> > Or am I missing your point?
>
> That's what I meant.

I still think all this stuff should be in the fault injection framework,
like other fault injections, to have a consistent discoverable interface.

I actually checked now and the standard fault injection supports boot arguments,
so needing it at boot time shouldn't be a barrier.

-Andi

2021-04-30 00:26:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog

On Thu, Apr 29, 2021 at 04:04:11PM -0700, Andi Kleen wrote:
> > > The idea is to leave the watchdog code in kernel/time/clocksource.c,
> > > but to move the fault injection into kernel/time/clocksourcefault.c or
> > > some such. In this new file, new clocksource structures are created that
> > > use some existing timebase/clocksource under the covers. These then
> > > inject delays based on module parameters (one senstive to CPU number,
> > > the other unconditional). They register these clocksources using the
> > > normal interfaces, and verify that they are eventually marked unstable
> > > when the fault-injection parameters warrant it. This is combined with
> > > the usual checking of the console log.
> > >
> > > Or am I missing your point?
> >
> > That's what I meant.
>
> I still think all this stuff should be in the fault injection framework,
> like other fault injections, to have a consistent discoverable interface.
>
> I actually checked now and the standard fault injection supports boot arguments,
> so needing it at boot time shouldn't be a barrier.

Per Thomas's feedback, I am in the midst of converting this to a unit
test implemented as a kernel module, at which point the only fault
injection will be in the unit test.

At the moment, the code just registers, reads, unregisters, and verifies
that the bogus unit-test clocksources act normally. Fault injection is
next on the list for the fine-grained clocksource. Which, as Thomas
noted, is quite a bit simpler, as I just need to force a delay until
the clocksource gets marked unstable with no need for fancy counting.

Thanx, Paul

2021-04-30 01:01:27

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog

On Thu, Apr 29, 2021 at 05:24:59PM -0700, Paul E. McKenney wrote:
> On Thu, Apr 29, 2021 at 04:04:11PM -0700, Andi Kleen wrote:
> > > > The idea is to leave the watchdog code in kernel/time/clocksource.c,
> > > > but to move the fault injection into kernel/time/clocksourcefault.c or
> > > > some such. In this new file, new clocksource structures are created that
> > > > use some existing timebase/clocksource under the covers. These then
> > > > inject delays based on module parameters (one senstive to CPU number,
> > > > the other unconditional). They register these clocksources using the
> > > > normal interfaces, and verify that they are eventually marked unstable
> > > > when the fault-injection parameters warrant it. This is combined with
> > > > the usual checking of the console log.
> > > >
> > > > Or am I missing your point?
> > >
> > > That's what I meant.
> >
> > I still think all this stuff should be in the fault injection framework,
> > like other fault injections, to have a consistent discoverable interface.
> >
> > I actually checked now and the standard fault injection supports boot arguments,
> > so needing it at boot time shouldn't be a barrier.
>
> Per Thomas's feedback, I am in the midst of converting this to a unit
> test implemented as a kernel module, at which point the only fault
> injection will be in the unit test.
>
> At the moment, the code just registers, reads, unregisters, and verifies
> that the bogus unit-test clocksources act normally. Fault injection is
> next on the list for the fine-grained clocksource. Which, as Thomas
> noted, is quite a bit simpler, as I just need to force a delay until
> the clocksource gets marked unstable with no need for fancy counting.

And this is what I currently get on the console from a successful test:

------------------------------------------------------------------------

clocksource_wdtest: --- holdoff=20
clocksource_wdtest: --- Verify jiffies-like uncertainty margin.
clocksource: wdtest-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
clocksource_wdtest: --- Verify tsc-like uncertainty margin.
clocksource: wdtest-ktime: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
clocksource_wdtest: --- tsc-like times: 1619743817068433427 - 1619743817068432303 = 1124.
clocksource_wdtest: --- Watchdog without error injection.
clocksource_wdtest: --- Watchdog with singleton error injection.
clocksource_wdtest: --- Watchdog with doublet error injection, expect console messages.
clocksource: timekeeping watchdog on CPU4: kvm-clock retried 2 times before success
clocksource_wdtest: --- Watchdog with quadruplet error injection, expect clock skew.
clocksource: timekeeping watchdog on CPU8: kvm-clock read-back delay of 401209ns, attempt 4, marking unstable
clocksource_wdtest: --- Marking wdtest-ktime unstable due to clocksource watchdog.
clocksource_wdtest: --- Done with test.

------------------------------------------------------------------------

The code currently looks like a dog's breakfast, so I will clean it
up before sending it out. And of course add the time-readout error
injection to test the other clock-skew code path.

And yes, there are WARNs to verify that skew happens when it is supposed
to and so on.

Thanx, Paul

2021-04-30 05:15:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog

On Thu, Apr 29, 2021 at 07:30:20PM +0200, Thomas Gleixner wrote:
> Paul,
>
> On Thu, Apr 29 2021 at 07:26, Paul E. McKenney wrote:
> > On Thu, Apr 29, 2021 at 10:27:09AM +0200, Thomas Gleixner wrote:
> >> > Or are you saying that the checks should be in the host OS rather than
> >> > in the guests?
> >>
> >> Yes. That's where it belongs. The host has to make sure that TSC is usable
> >> otherwise it should tell the guest not to use it. Anything else is
> >> wishful thinking and never reliable.
> >
> > Thank you for the confirmation. I will look into this.
>
> So the guest might need at least some basic sanity checking unless we
> declare that hypervisors are always working correctly :)
>
> Which is admittedly more likely than making the same assumption about
> BIOS and hardware.

That could be a tough choice...

> >> > In addition, breakage due to age and environmentals is possible, and if
> >> > you have enough hardware, probable. In which case it would be good to
> >> > get a notification so that the system in question can be dealt with.
> >>
> >> Are you trying to find a problem for a solution again?
> >
> > We really do see this thing trigger. > I am trying to get rid of one
> > class of false positives that might be afflicting us. Along the way,
> > I am thinking aloud about what might be the cause of any remaining
> > skew reports that might trigger in the future.
>
> Fair enough. Admittedly this has at least entertainment value :)

Glad to help. I think. ;-)

> >> Well, you might then also build safety nets for interrupts, exceptions
> >> and if you go fully paranoid for every single CPU instruction. :)
> >
> > Fair, and I doubt that looking at failure data across a large fleet of
> > systems has done anything to reduce my level of paranoia. ;-)
>
> You should have known better than opening Pandoras box.

When has that ever stopped anyone?

> >> Sure. If you have a seperate module then you can add module params to it
> >> obviously. But you don't need any of the muck in the actual watchdog
> >> code because the watchdog::read() function in that module will simply
> >> handle the delay injection. Hmm?
> >
> > OK, first let me make sure I understand what you are suggesting.
> >
> > The idea is to leave the watchdog code in kernel/time/clocksource.c,
> > but to move the fault injection into kernel/time/clocksourcefault.c or
> > some such. In this new file, new clocksource structures are created that
> > use some existing timebase/clocksource under the covers. These then
> > inject delays based on module parameters (one senstive to CPU number,
> > the other unconditional). They register these clocksources using the
> > normal interfaces, and verify that they are eventually marked unstable
> > when the fault-injection parameters warrant it. This is combined with
> > the usual checking of the console log.
> >
> > Or am I missing your point?
>
> That's what I meant.

OK, here is a sneak preview. A fair amount of cleanup still needed,
but functional. I also need to rip out the old error-injection code
and rejigger the commits. And I of course leverage rcutorture to run
this thing, so there is probably at least one missing export in there
somewhere.

And yes, I did try making clocksource_wdtest_ktime be curr_clocksource,
with the expected (in retrospect, anyway) hilarious results. Hence the
direct call to clocksource_verify_percpu().

Thanx, Paul

------------------------------------------------------------------------

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 8f87c1a6f323..83b3c6639c39 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -285,4 +285,7 @@ static inline void timer_probe(void) {}
#define TIMER_ACPI_DECLARE(name, table_id, fn) \
ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn)

+extern ulong max_cswd_read_retries;
+void clocksource_verify_percpu(struct clocksource *cs);
+
#endif /* _LINUX_CLOCKSOURCE_H */
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index 1fb1c1ef6a19..1ed85b25b096 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -21,3 +21,4 @@ obj-$(CONFIG_HAVE_GENERIC_VDSO) += vsyscall.o
obj-$(CONFIG_DEBUG_FS) += timekeeping_debug.o
obj-$(CONFIG_TEST_UDELAY) += test_udelay.o
obj-$(CONFIG_TIME_NS) += namespace.o
+obj-$(CONFIG_TEST_CLOCKSOURCE_WATCHDOG) += clocksource-wdtest.o
diff --git a/kernel/time/clocksource-wdtest.c b/kernel/time/clocksource-wdtest.c
new file mode 100644
index 000000000000..e01f09defdf7
--- /dev/null
+++ b/kernel/time/clocksource-wdtest.c
@@ -0,0 +1,198 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Unit test for the clocksource watchdog.
+ *
+ * Copyright (C) 2021 Facebook, Inc.
+ *
+ * Author: Paul E. McKenney <[email protected]>
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/device.h>
+#include <linux/clocksource.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
+#include <linux/tick.h>
+#include <linux/kthread.h>
+#include <linux/delay.h>
+#include <linux/prandom.h>
+#include <linux/cpu.h>
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Paul E. McKenney <[email protected]>");
+
+static int holdoff = IS_BUILTIN(CONFIG_TEST_CLOCKSOURCE_WATCHDOG) ? 20 : 0;
+module_param(holdoff, int, 0444);
+MODULE_PARM_DESC(holdoff, "Time to wait to start test (s).");
+
+/* Watchdog kthread's task_struct pointer for debug purposes. */
+static struct task_struct *wdtest_task;
+
+static u64 wdtest_jiffies_read(struct clocksource *cs)
+{
+ return (u64)jiffies;
+}
+
+/* Assume HZ > 100. */
+#define JIFFIES_SHIFT 8
+
+static struct clocksource clocksource_wdtest_jiffies = {
+ .name = "wdtest-jiffies",
+ .rating = 1, /* lowest valid rating*/
+ .uncertainty_margin = TICK_NSEC,
+ .read = wdtest_jiffies_read,
+ .mask = CLOCKSOURCE_MASK(32),
+ .flags = CLOCK_SOURCE_MUST_VERIFY,
+ .mult = TICK_NSEC << JIFFIES_SHIFT, /* details above */
+ .shift = JIFFIES_SHIFT,
+ .max_cycles = 10,
+};
+
+static int wdtest_ktime_read_ndelays;
+static bool wdtest_ktime_read_fuzz;
+
+static u64 wdtest_ktime_read(struct clocksource *cs)
+{
+ int wkrn = READ_ONCE(wdtest_ktime_read_ndelays);
+ static int sign = 1;
+ u64 ret;
+
+ if (wkrn) {
+ udelay(cs->uncertainty_margin / 250);
+ WRITE_ONCE(wdtest_ktime_read_ndelays, wkrn - 1);
+ }
+ ret = ktime_get_real_fast_ns();
+ if (READ_ONCE(wdtest_ktime_read_fuzz)) {
+ sign = -sign;
+ ret = ret + sign * 100 * NSEC_PER_MSEC;
+ }
+ return ret;
+}
+
+static void wdtest_ktime_cs_mark_unstable(struct clocksource *cs)
+{
+ pr_info("--- Marking %s unstable due to clocksource watchdog.\n", cs->name);
+}
+
+#define KTIME_FLAGS (CLOCK_SOURCE_IS_CONTINUOUS | \
+ CLOCK_SOURCE_VALID_FOR_HRES | \
+ CLOCK_SOURCE_MUST_VERIFY | \
+ CLOCK_SOURCE_VERIFY_PERCPU)
+
+static struct clocksource clocksource_wdtest_ktime = {
+ .name = "wdtest-ktime",
+ .rating = 300,
+ .read = wdtest_ktime_read,
+ .mask = CLOCKSOURCE_MASK(64),
+ .flags = KTIME_FLAGS,
+ .mark_unstable = wdtest_ktime_cs_mark_unstable,
+ .list = LIST_HEAD_INIT(clocksource_wdtest_ktime.list),
+};
+
+/* Run the specified series of watchdog tests. */
+static int wdtest_func(void *arg)
+{
+ unsigned long j1, j2;
+ int i;
+
+ schedule_timeout_uninterruptible(holdoff * HZ);
+
+ /*
+ * Verify that jiffies-like clocksources get the manually
+ * specified uncertainty margin.
+ */
+ pr_info("--- Verify jiffies-like uncertainty margin.\n");
+ __clocksource_register(&clocksource_wdtest_jiffies);
+ WARN_ON_ONCE(clocksource_wdtest_jiffies.uncertainty_margin != TICK_NSEC);
+
+ j1 = clocksource_wdtest_jiffies.read(&clocksource_wdtest_jiffies);
+ schedule_timeout_uninterruptible(HZ);
+ j2 = clocksource_wdtest_jiffies.read(&clocksource_wdtest_jiffies);
+ WARN_ON_ONCE(j1 == j2);
+
+ clocksource_unregister(&clocksource_wdtest_jiffies);
+
+ /*
+ * Verify that tsc-like clocksources are assigned a reasonable
+ * uncertainty margin.
+ */
+ pr_info("--- Verify tsc-like uncertainty margin.\n");
+ clocksource_register_khz(&clocksource_wdtest_ktime, 1000 * 1000);
+ WARN_ON_ONCE(clocksource_wdtest_ktime.uncertainty_margin < NSEC_PER_USEC);
+
+ j1 = clocksource_wdtest_ktime.read(&clocksource_wdtest_ktime);
+ udelay(1);
+ j2 = clocksource_wdtest_ktime.read(&clocksource_wdtest_ktime);
+ pr_info("--- tsc-like times: %lu - %lu = %lu.\n", j2, j1, j2 - j1);
+ WARN_ON_ONCE(time_before(j2, j1 + NSEC_PER_USEC));
+
+ /* Verify tsc-like stability without error injection. */
+ pr_info("--- Watchdog without error injection.\n");
+ schedule_timeout_uninterruptible(2 * HZ);
+ WARN_ON_ONCE(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE);
+
+ /* Verify tsc-like stability with various numbers of errors injected. */
+ for (i = 1; i <= max_cswd_read_retries + 1; i++) {
+ pr_info("--- Watchdog with %dx error injection, %lu retries%s.\n",
+ i, max_cswd_read_retries,
+ i <= 1 ? "" : i <= max_cswd_read_retries
+ ? ", expect message"
+ : ", expect clock skew");
+ WRITE_ONCE(wdtest_ktime_read_ndelays, i);
+ schedule_timeout_uninterruptible(2 * HZ);
+ WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays));
+ WARN_ON_ONCE((i <= max_cswd_read_retries) !=
+ !(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
+ if (clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE) {
+ clocksource_unregister(&clocksource_wdtest_ktime);
+ clocksource_wdtest_ktime.flags = KTIME_FLAGS;
+ schedule_timeout_uninterruptible(HZ / 10);
+ clocksource_register_khz(&clocksource_wdtest_ktime, 1000 * 1000);
+ }
+ }
+
+ /* Verify tsc-like stability with clock-value-fuzz error injection. */
+ pr_info("--- Watchdog clock-value-fuzz error injection, expect clock skew and per-CPU mismatches.\n");
+ WRITE_ONCE(wdtest_ktime_read_fuzz, true);
+ schedule_timeout_uninterruptible(2 * HZ);
+ WARN_ON_ONCE(!(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
+ clocksource_verify_percpu(&clocksource_wdtest_ktime);
+ WRITE_ONCE(wdtest_ktime_read_fuzz, false);
+
+ clocksource_unregister(&clocksource_wdtest_ktime);
+
+ pr_info("--- Done with test.\n");
+ return 0;
+}
+
+static void wdtest_print_module_parms(void)
+{
+ pr_alert("--- holdoff=%d\n", holdoff);
+}
+
+/* Cleanup function. */
+static void clocksource_wdtest_cleanup(void)
+{
+}
+
+static int __init clocksource_wdtest_init(void)
+{
+ int ret = 0;
+
+ wdtest_print_module_parms();
+
+ /* Create watchdog-test task. */
+ wdtest_task = kthread_run(wdtest_func, NULL, "wdtest");
+ if (IS_ERR(wdtest_task)) {
+ ret = PTR_ERR(wdtest_task);
+ pr_warn("%s: Failed to create wdtest kthread.\n", __func__);
+ wdtest_task = NULL;
+ return ret;
+ }
+
+ return 0;
+}
+
+module_init(clocksource_wdtest_init);
+module_exit(clocksource_wdtest_cleanup);
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 328a65ddb959..2b2937551bfe 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -206,8 +206,9 @@ static ulong inject_delay_repeat = 1;
module_param(inject_delay_repeat, ulong, 0644);
static int inject_delay_shift_percpu = -1;
module_param(inject_delay_shift_percpu, int, 0644);
-static ulong max_read_retries = 3;
-module_param(max_read_retries, ulong, 0644);
+ulong max_cswd_read_retries = 3;
+module_param(max_cswd_read_retries, ulong, 0644);
+EXPORT_SYMBOL_GPL(max_cswd_read_retries);
static int verify_n_cpus = 8;
module_param(verify_n_cpus, int, 0644);

@@ -233,7 +234,7 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
u64 wd_end, wd_delta;
int64_t wd_delay;

- for (nretries = 0; nretries <= max_read_retries; nretries++) {
+ for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) {
local_irq_disable();
*wdnow = watchdog->read(watchdog);
clocksource_watchdog_inject_delay();
@@ -244,7 +245,7 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
wd_delta = clocksource_delta(wd_end, *wdnow, watchdog->mask);
wd_delay = clocksource_cyc2ns(wd_delta, watchdog->mult, watchdog->shift);
if (wd_delay <= WATCHDOG_MAX_SKEW) {
- if (nretries > 1 || nretries >= max_read_retries) {
+ if (nretries > 1 || nretries >= max_cswd_read_retries) {
pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
smp_processor_id(), watchdog->name, nretries);
}
@@ -323,7 +324,7 @@ static void clocksource_verify_one_cpu(void *csin)
csnow_mid = cs->read(cs) + delta;
}

-static void clocksource_verify_percpu(struct clocksource *cs)
+void clocksource_verify_percpu(struct clocksource *cs)
{
int64_t cs_nsec, cs_nsec_max = 0, cs_nsec_min = LLONG_MAX;
u64 csnow_begin, csnow_end;
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 2779c29d9981..344735b8b151 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2513,6 +2513,18 @@ config TEST_FPU

If unsure, say N.

+config TEST_CLOCKSOURCE_WATCHDOG
+ tristate "Test clocksource watchdog in kernel space"
+ depends on CLOCKSOURCE_WATCHDOG
+ help
+ Enable this option to create a kernel module that will trigger
+ a test of the clocksource watchdog. This module may be loaded
+ via modprobe or insmod in which case it will run upon being
+ loaded, or it may be built in, in which case it will run
+ shortly after boot.
+
+ If unsure, say N.
+
endif # RUNTIME_TESTING_MENU

config MEMTEST