2021-01-06 00:51:24

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC clocksource] Do not mark clocks unstable due to delays

Hello!

If there is a sufficient delay between reading the watchdog clock and the
clock under test, the clock under test will be marked unstable through no
fault of its own. This series checks for this, doing limited retries
to get a good set of clock reads. If the clock is marked unstable
and is marked as being per-CPU, cross-CPU synchronization is checked.
This series also provides delay injection, which may be enabled via
kernel boot parameters to test the checking for delays.

1. Provide module parameters to inject delays in watchdog.

2. Retry clock read if long delays detected.

3. Check per-CPU clock synchronization when marked unstable.

4. Provide a module parameter to fuzz per-CPU clock checking.

5. Do pairwise clock-desynchronization checking.

Thanx, Paul

------------------------------------------------------------------------

Documentation/admin-guide/kernel-parameters.txt | 31 ++++
arch/x86/kernel/kvmclock.c | 2
arch/x86/kernel/tsc.c | 3
include/linux/clocksource.h | 2
kernel/time/clocksource.c | 176 +++++++++++++++++++++---
5 files changed, 189 insertions(+), 25 deletions(-)


2021-01-06 00:51:31

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

From: "Paul E. McKenney" <[email protected]>

Some sorts of per-CPU clock sources have a history of going out of
synchronization with each other. However, this problem has purportedy
been solved in the past ten years. Except that it is all too possible
that the problem has instead simply been made less likely, which might
mean that some of the occasional "Marking clocksource 'tsc' as unstable"
messages might be due to desynchronization. How would anyone know?

This commit therefore adds CPU-to-CPU synchronization checking
for newly unstable clocksource that are marked with the new
CLOCK_SOURCE_VERIFY_PERCPU flag. Lists of desynchronized CPUs are
printed, with the caveat that if it is the reporting CPU that is itself
desynchronized, it will appear that all the other clocks are wrong.
Just like in real life.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Add "static" to clocksource_verify_one_cpu() per kernel test robot feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
arch/x86/kernel/kvmclock.c | 2 +-
arch/x86/kernel/tsc.c | 3 +-
include/linux/clocksource.h | 2 +-
kernel/time/clocksource.c | 73 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index aa59374..337bb2c 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -169,7 +169,7 @@ struct clocksource kvm_clock = {
.read = kvm_clock_get_cycles,
.rating = 400,
.mask = CLOCKSOURCE_MASK(64),
- .flags = CLOCK_SOURCE_IS_CONTINUOUS,
+ .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_VERIFY_PERCPU,
.enable = kvm_cs_enable,
};
EXPORT_SYMBOL_GPL(kvm_clock);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index f70dffc..5628917 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1151,7 +1151,8 @@ static struct clocksource clocksource_tsc = {
.mask = CLOCKSOURCE_MASK(64),
.flags = CLOCK_SOURCE_IS_CONTINUOUS |
CLOCK_SOURCE_VALID_FOR_HRES |
- CLOCK_SOURCE_MUST_VERIFY,
+ CLOCK_SOURCE_MUST_VERIFY |
+ CLOCK_SOURCE_VERIFY_PERCPU,
.vdso_clock_mode = VDSO_CLOCKMODE_TSC,
.enable = tsc_cs_enable,
.resume = tsc_resume,
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 86d143d..83a3ebf 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -131,7 +131,7 @@ struct clocksource {
#define CLOCK_SOURCE_UNSTABLE 0x40
#define CLOCK_SOURCE_SUSPEND_NONSTOP 0x80
#define CLOCK_SOURCE_RESELECT 0x100
-
+#define CLOCK_SOURCE_VERIFY_PERCPU 0x200
/* simplify initialization of mask field */
#define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 4663b86..23bcefe 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -211,6 +211,78 @@ static void clocksource_watchdog_inject_delay(void)
WARN_ON_ONCE(injectfail < 0);
}

+static struct clocksource *clocksource_verify_work_cs;
+static DEFINE_PER_CPU(u64, csnow_mid);
+static cpumask_t cpus_ahead;
+static cpumask_t cpus_behind;
+
+static void clocksource_verify_one_cpu(void *csin)
+{
+ struct clocksource *cs = (struct clocksource *)csin;
+
+ __this_cpu_write(csnow_mid, cs->read(cs));
+}
+
+static void clocksource_verify_percpu_wq(struct work_struct *unused)
+{
+ int cpu;
+ struct clocksource *cs;
+ int64_t cs_nsec;
+ u64 csnow_begin;
+ u64 csnow_end;
+ u64 delta;
+
+ cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release
+ if (WARN_ON_ONCE(!cs))
+ return;
+ pr_warn("Checking clocksource %s synchronization from CPU %d.\n",
+ cs->name, smp_processor_id());
+ cpumask_clear(&cpus_ahead);
+ cpumask_clear(&cpus_behind);
+ csnow_begin = cs->read(cs);
+ smp_call_function(clocksource_verify_one_cpu, cs, 1);
+ csnow_end = cs->read(cs);
+ for_each_online_cpu(cpu) {
+ if (cpu == smp_processor_id())
+ continue;
+ delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask;
+ if ((s64)delta < 0)
+ cpumask_set_cpu(cpu, &cpus_behind);
+ delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask;
+ if ((s64)delta < 0)
+ cpumask_set_cpu(cpu, &cpus_ahead);
+ }
+ if (!cpumask_empty(&cpus_ahead))
+ pr_warn(" CPUs %*pbl ahead of CPU %d for clocksource %s.\n",
+ cpumask_pr_args(&cpus_ahead),
+ smp_processor_id(), cs->name);
+ if (!cpumask_empty(&cpus_behind))
+ pr_warn(" CPUs %*pbl behind CPU %d for clocksource %s.\n",
+ cpumask_pr_args(&cpus_behind),
+ smp_processor_id(), cs->name);
+ if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)) {
+ delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
+ cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
+ pr_warn(" CPU %d duration %lldns for clocksource %s.\n",
+ smp_processor_id(), cs_nsec, cs->name);
+ }
+ smp_store_release(&clocksource_verify_work_cs, NULL); // pairs with acquire.
+}
+
+static DECLARE_WORK(clocksource_verify_work, clocksource_verify_percpu_wq);
+
+static void clocksource_verify_percpu(struct clocksource *cs)
+{
+ if (!(cs->flags & CLOCK_SOURCE_VERIFY_PERCPU))
+ return;
+ if (smp_load_acquire(&clocksource_verify_work_cs)) { // pairs with release.
+ pr_warn("Previous clocksource synchronization still in flight.\n");
+ return;
+ }
+ smp_store_release(&clocksource_verify_work_cs, cs); //pairs with acquire.
+ queue_work(system_highpri_wq, &clocksource_verify_work);
+}
+
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
@@ -284,6 +356,7 @@ static void clocksource_watchdog(struct timer_list *unused)
watchdog->name, wdnow, wdlast, watchdog->mask);
pr_warn(" '%s' cs_now: %llx cs_last: %llx mask: %llx\n",
cs->name, csnow, cslast, cs->mask);
+ clocksource_verify_percpu(cs);
__clocksource_unstable(cs);
continue;
}
--
2.9.5

2021-01-06 00:51:34

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC clocksource 4/5] clocksource: Provide a module parameter to fuzz per-CPU clock checking

From: "Paul E. McKenney" <[email protected]>

Code that checks for clock desynchronization must itself be tested, so
this commit creates a new clocksource.inject_delay_shift_percpu= kernel
boot parameter that adds or subtracts a large value from the check read,
using the specified bit of the CPU ID to determine whether to add or
to subtract.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 9 +++++++++
kernel/time/clocksource.c | 10 +++++++++-
2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 15aa3fe..613ce32 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -593,6 +593,15 @@
times the value specified for inject_delay_freq
of consecutive non-delays.

+ clocksource.inject_delay_shift_percpu= [KNL]
+ Shift count to obtain bit from CPU number to
+ determine whether to shift the time of the per-CPU
+ clock under test ahead or behind. For example,
+ setting this to the value four will result in
+ alternating groups of 16 CPUs shifting ahead and
+ the rest of the CPUs shifting behind. The default
+ value of -1 disable this type of error injection.
+
clocksource.max_read_retries= [KNL]
Number of clocksource_watchdog() retries due to
external delays before the clock will be marked
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 23bcefe..67cf41c 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -190,6 +190,8 @@ static int inject_delay_freq;
module_param(inject_delay_freq, int, 0644);
static int inject_delay_run = 1;
module_param(inject_delay_run, int, 0644);
+static int inject_delay_shift_percpu = -1;
+module_param(inject_delay_shift_percpu, int, 0644);
static int max_read_retries = 3;
module_param(max_read_retries, int, 0644);

@@ -219,8 +221,14 @@ static cpumask_t cpus_behind;
static void clocksource_verify_one_cpu(void *csin)
{
struct clocksource *cs = (struct clocksource *)csin;
+ s64 delta = 0;
+ int sign;

- __this_cpu_write(csnow_mid, cs->read(cs));
+ if (inject_delay_shift_percpu >= 0) {
+ sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1;
+ delta = sign * NSEC_PER_SEC;
+ }
+ __this_cpu_write(csnow_mid, cs->read(cs) + delta);
}

static void clocksource_verify_percpu_wq(struct work_struct *unused)
--
2.9.5

2021-01-06 00:51:49

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC clocksource 1/5] clocksource: Provide module parameters to inject delays in watchdog

From: "Paul E. McKenney" <[email protected]>

When the clocksource watchdog marks a clock as unstable, this might be due
to that clock being unstable or it might be due to delays that happen to
occur between the reads of the two clocks. Yes, interrupts are disabled
across those two reads, but there are no shortage of things that can
delay interrupts-disabled regions of code ranging from SMI handlers to
vCPU preemption. It would be good to have some indication as to why
the clock was marked unstable.

The first step is a way of injecting such delays, and this
commit therefore provides a clocksource.inject_delay_freq and
clocksource.inject_delay_run kernel boot parameters that specify that
sufficient delay be injected to cause the clocksource_watchdog()
function to mark a clock unstable. This delay is injected every
Nth set of M calls to clocksource_watchdog(), where N is the value
specified for the inject_delay_freq boot parameter and M is the value
specified for the inject_delay_run boot parameter. Values of zero or
less for either parameter disable delay injection, and the default for
clocksource.inject_delay_freq is zero, that is, disabled. The default for
clocksource.inject_delay_run is the value one, that is single-call runs.

This facility is intended for diagnostic use only, and should be avoided
on production systems.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 22 +++++++++++++++++++
kernel/time/clocksource.c | 28 +++++++++++++++++++++++++
2 files changed, 50 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index c722ec1..15aa3fe 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -577,6 +577,28 @@
loops can be debugged more effectively on production
systems.

+ clocksource.inject_delay_freq= [KNL]
+ Number of runs of calls to clocksource_watchdog()
+ before delays are injected between reads from the
+ two clocksources. Values less than or equal to
+ zero disable this delay injection. These delays
+ can cause clocks to be marked unstable, so use
+ of this parameter should therefore be avoided on
+ production systems. Defaults to zero (disabled).
+
+ clocksource.inject_delay_run= [KNL]
+ Run lengths of clocksource_watchdog() delay
+ injections. Specifying the value 8 will result
+ in eight consecutive delays followed by eight
+ times the value specified for inject_delay_freq
+ of consecutive non-delays.
+
+ clocksource.max_read_retries= [KNL]
+ Number of clocksource_watchdog() retries due to
+ external delays before the clock will be marked
+ unstable. Defaults to three retries, that is,
+ four attempts to read the clock under test.
+
clearcpuid=BITNUM[,BITNUM...] [X86]
Disable CPUID feature X for the kernel. See
arch/x86/include/asm/cpufeatures.h for the valid bit
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index cce484a..a0d9d36 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -14,6 +14,7 @@
#include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
#include <linux/tick.h>
#include <linux/kthread.h>
+#include <linux/delay.h>

#include "tick-internal.h"
#include "timekeeping_internal.h"
@@ -184,6 +185,32 @@ void clocksource_mark_unstable(struct clocksource *cs)
spin_unlock_irqrestore(&watchdog_lock, flags);
}

+static int inject_delay_freq;
+module_param(inject_delay_freq, int, 0644);
+static int inject_delay_run = 1;
+module_param(inject_delay_run, int, 0644);
+static int max_read_retries = 3;
+module_param(max_read_retries, int, 0644);
+
+static void clocksource_watchdog_inject_delay(void)
+{
+ int i;
+ static int injectfail = -1;
+
+ if (inject_delay_freq <= 0 || inject_delay_run <= 0)
+ return;
+ if (injectfail < 0 || injectfail > INT_MAX / 2)
+ injectfail = inject_delay_run;
+ if (!(++injectfail / inject_delay_run % inject_delay_freq)) {
+ printk("%s(): Injecting delay.\n", __func__);
+ injectfail = 0;
+ for (i = 0; i < 2 * WATCHDOG_THRESHOLD / NSEC_PER_MSEC; i++)
+ udelay(1000);
+ printk("%s(): Done injecting delay.\n", __func__);
+ }
+ WARN_ON_ONCE(injectfail < 0);
+}
+
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
@@ -208,6 +235,7 @@ static void clocksource_watchdog(struct timer_list *unused)

local_irq_disable();
csnow = cs->read(cs);
+ clocksource_watchdog_inject_delay();
wdnow = watchdog->read(watchdog);
local_irq_enable();

--
2.9.5

2021-01-06 00:52:02

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC clocksource 5/5] clocksource: Do pairwise clock-desynchronization checking

From: "Paul E. McKenney" <[email protected]>

Although smp_call_function() has the advantage of simplicity, using
it to check for cross-CPU clock desynchronization means that any CPU
being slow reduces the sensitivity of the checking across all CPUs.
And it is not uncommon for smp_call_function() latencies to be in the
hundreds of microseconds.

This commit therefore switches to smp_call_function_single(), so that
delays from a given CPU affect only those measurements involving that
particular CPU.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/clocksource.c | 41 +++++++++++++++++++++++++----------------
1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 67cf41c..31560c6 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -214,7 +214,7 @@ static void clocksource_watchdog_inject_delay(void)
}

static struct clocksource *clocksource_verify_work_cs;
-static DEFINE_PER_CPU(u64, csnow_mid);
+static u64 csnow_mid;
static cpumask_t cpus_ahead;
static cpumask_t cpus_behind;

@@ -228,7 +228,7 @@ static void clocksource_verify_one_cpu(void *csin)
sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1;
delta = sign * NSEC_PER_SEC;
}
- __this_cpu_write(csnow_mid, cs->read(cs) + delta);
+ csnow_mid = cs->read(cs) + delta;
}

static void clocksource_verify_percpu_wq(struct work_struct *unused)
@@ -236,9 +236,12 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
int cpu;
struct clocksource *cs;
int64_t cs_nsec;
+ int64_t cs_nsec_max;
+ int64_t cs_nsec_min;
u64 csnow_begin;
u64 csnow_end;
- u64 delta;
+ s64 delta;
+ bool firsttime = 1;

cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release
if (WARN_ON_ONCE(!cs))
@@ -247,19 +250,28 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
cs->name, smp_processor_id());
cpumask_clear(&cpus_ahead);
cpumask_clear(&cpus_behind);
- csnow_begin = cs->read(cs);
- smp_call_function(clocksource_verify_one_cpu, cs, 1);
- csnow_end = cs->read(cs);
+ preempt_disable();
for_each_online_cpu(cpu) {
if (cpu == smp_processor_id())
continue;
- delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask;
- if ((s64)delta < 0)
+ csnow_begin = cs->read(cs);
+ smp_call_function_single(cpu, clocksource_verify_one_cpu, cs, 1);
+ csnow_end = cs->read(cs);
+ delta = (s64)((csnow_mid - csnow_begin) & cs->mask);
+ if (delta < 0)
cpumask_set_cpu(cpu, &cpus_behind);
- delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask;
- if ((s64)delta < 0)
+ delta = (csnow_end - csnow_mid) & cs->mask;
+ if (delta < 0)
cpumask_set_cpu(cpu, &cpus_ahead);
+ delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
+ cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
+ if (firsttime || cs_nsec > cs_nsec_max)
+ cs_nsec_max = cs_nsec;
+ if (firsttime || cs_nsec < cs_nsec_min)
+ cs_nsec_min = cs_nsec;
+ firsttime = 0;
}
+ preempt_enable();
if (!cpumask_empty(&cpus_ahead))
pr_warn(" CPUs %*pbl ahead of CPU %d for clocksource %s.\n",
cpumask_pr_args(&cpus_ahead),
@@ -268,12 +280,9 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
pr_warn(" CPUs %*pbl behind CPU %d for clocksource %s.\n",
cpumask_pr_args(&cpus_behind),
smp_processor_id(), cs->name);
- if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)) {
- delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
- cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
- pr_warn(" CPU %d duration %lldns for clocksource %s.\n",
- smp_processor_id(), cs_nsec, cs->name);
- }
+ if (!firsttime && (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)))
+ pr_warn(" CPU %d check durations %lldns - %lldns for clocksource %s.\n",
+ smp_processor_id(), cs_nsec_min, cs_nsec_max, cs->name);
smp_store_release(&clocksource_verify_work_cs, NULL); // pairs with acquire.
}

--
2.9.5

2021-01-06 00:53:23

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC clocksource 2/5] clocksource: Retry clock read if long delays detected

From: "Paul E. McKenney" <[email protected]>

When the clocksource watchdog marks a clock as unstable, this might
be due to that clock being unstable or it might be due to delays that
happen to occur between the reads of the two clocks. Yes, interrupts are
disabled across those two reads, but there are no shortage of things that
can delay interrupts-disabled regions of code ranging from SMI handlers
to vCPU preemption. It would be good to have some indication as to why
the clock was marked unstable.

This commit therefore re-reads the watchdog clock on either side of
the read from the clock under test. If the watchdog clock shows an
excessive time delta between its pair of reads, the reads are retried.
The maximum number of retries is specified by a new kernel boot
parameter clocksource.max_read_retries, which defaults to three, that
is, up to four reads, one initial and up to three retries. If retries
were required, a message is printed on the console. If the number of
retries is exceeded, the clock under test will be marked unstable.
However, the probability of this happening due to various sorts of
delays is quite small. In addition, the reason (clock-read delays)
for the unstable marking will be apparent.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Per-clocksource retries per Neeraj Upadhyay feedback. ]
[ paulmck: Don't reset injectfail per Neeraj Upadhyay feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/clocksource.c | 24 +++++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index a0d9d36..4663b86 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -124,6 +124,7 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
*/
#define WATCHDOG_INTERVAL (HZ >> 1)
#define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
+#define WATCHDOG_MAX_SKEW (NSEC_PER_SEC >> 6)

static void clocksource_watchdog_work(struct work_struct *work)
{
@@ -203,7 +204,6 @@ static void clocksource_watchdog_inject_delay(void)
injectfail = inject_delay_run;
if (!(++injectfail / inject_delay_run % inject_delay_freq)) {
printk("%s(): Injecting delay.\n", __func__);
- injectfail = 0;
for (i = 0; i < 2 * WATCHDOG_THRESHOLD / NSEC_PER_MSEC; i++)
udelay(1000);
printk("%s(): Done injecting delay.\n", __func__);
@@ -214,9 +214,10 @@ static void clocksource_watchdog_inject_delay(void)
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
- u64 csnow, wdnow, cslast, wdlast, delta;
- int64_t wd_nsec, cs_nsec;
+ u64 csnow, wdnow, wdagain, cslast, wdlast, delta;
+ int64_t wd_nsec, wdagain_nsec, wderr_nsec = 0, cs_nsec;
int next_cpu, reset_pending;
+ int nretries;

spin_lock(&watchdog_lock);
if (!watchdog_running)
@@ -225,6 +226,7 @@ static void clocksource_watchdog(struct timer_list *unused)
reset_pending = atomic_read(&watchdog_reset_pending);

list_for_each_entry(cs, &watchdog_list, wd_list) {
+ nretries = 0;

/* Clocksource already marked unstable? */
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
@@ -233,11 +235,23 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;
}

+retry:
local_irq_disable();
- csnow = cs->read(cs);
- clocksource_watchdog_inject_delay();
wdnow = watchdog->read(watchdog);
+ clocksource_watchdog_inject_delay();
+ csnow = cs->read(cs);
+ wdagain = watchdog->read(watchdog);
local_irq_enable();
+ delta = clocksource_delta(wdagain, wdnow, watchdog->mask);
+ wdagain_nsec = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
+ if (wdagain_nsec < 0 || wdagain_nsec > WATCHDOG_MAX_SKEW) {
+ wderr_nsec = wdagain_nsec;
+ if (nretries++ < max_read_retries)
+ goto retry;
+ }
+ if (nretries)
+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d\n",
+ smp_processor_id(), watchdog->name, wderr_nsec, nretries);

/* Clocksource initialized ? */
if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
--
2.9.5

2021-01-06 16:30:16

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH RFC clocksource 2/5] clocksource: Retry clock read if long delays detected

On Tue, 2021-01-05 at 16:41 -0800, [email protected] wrote:
>
> @@ -203,7 +204,6 @@ static void
> clocksource_watchdog_inject_delay(void)
> injectfail = inject_delay_run;
> if (!(++injectfail / inject_delay_run % inject_delay_freq)) {
> printk("%s(): Injecting delay.\n", __func__);
> - injectfail = 0;
> for (i = 0; i < 2 * WATCHDOG_THRESHOLD / NSEC_PER_MSEC;
> i++)
> udelay(1000);

Wait, patch 1 just added that line?

Should patch 1 not add it and this
patch go without
this removal? :)

+ wdagain_nsec = clocksource_cyc2ns(delta, watchdog-
>mult, watchdog->shift);
+ if (wdagain_nsec < 0 || wdagain_nsec >
WATCHDOG_MAX_SKEW) {
+ wderr_nsec = wdagain_nsec;
+ if (nretries++ < max_read_retries)
+ goto retry;
+ }

Given that clocksource_cyc2ns uses unsigned multiplication
followed by a right shift, do we need to test for <0?

2021-01-06 19:55:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC clocksource 2/5] clocksource: Retry clock read if long delays detected

On Wed, Jan 06, 2021 at 11:28:00AM -0500, Rik van Riel wrote:
> On Tue, 2021-01-05 at 16:41 -0800, [email protected] wrote:
> >
> > @@ -203,7 +204,6 @@ static void
> > clocksource_watchdog_inject_delay(void)
> > injectfail = inject_delay_run;
> > if (!(++injectfail / inject_delay_run % inject_delay_freq)) {
> > printk("%s(): Injecting delay.\n", __func__);
> > - injectfail = 0;
> > for (i = 0; i < 2 * WATCHDOG_THRESHOLD / NSEC_PER_MSEC;
> > i++)
> > udelay(1000);
>
> Wait, patch 1 just added that line?
>
> Should patch 1 not add it and this
> patch go without
> this removal? :)

Good catch, will fix. ;-)

> + wdagain_nsec = clocksource_cyc2ns(delta, watchdog-
> >mult, watchdog->shift);
> + if (wdagain_nsec < 0 || wdagain_nsec >
> WATCHDOG_MAX_SKEW) {
> + wderr_nsec = wdagain_nsec;
> + if (nretries++ < max_read_retries)
> + goto retry;
> + }
>
> Given that clocksource_cyc2ns uses unsigned multiplication
> followed by a right shift, do we need to test for <0?

I am worried about the possibility of the "shift" argument to
clocksource_cyc2ns() being zero. For example, unless I am missing
something, clocksource_tsc has a zero .shift field.

Thanx, Paul

2021-01-06 21:03:21

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH RFC clocksource 2/5] clocksource: Retry clock read if long delays detected

On Wed, 2021-01-06 at 11:53 -0800, Paul E. McKenney wrote:
> On Wed, Jan 06, 2021 at 11:28:00AM -0500, Rik van Riel wrote:
>
> > + wdagain_nsec = clocksource_cyc2ns(delta, watchdog-
> > > mult, watchdog->shift);
> > + if (wdagain_nsec < 0 || wdagain_nsec >
> > WATCHDOG_MAX_SKEW) {
> > + wderr_nsec = wdagain_nsec;
> > + if (nretries++ < max_read_retries)
> > + goto retry;
> > + }
> >
> > Given that clocksource_cyc2ns uses unsigned multiplication
> > followed by a right shift, do we need to test for <0?
>
> I am worried about the possibility of the "shift" argument to
> clocksource_cyc2ns() being zero. For example, unless I am missing
> something, clocksource_tsc has a zero .shift field.

Oh, good point!

2021-01-12 10:39:37

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v2 clocksource] Do not mark clocks unstable due to delays

Hello!

If there is a sufficient delay between reading the watchdog clock and the
clock under test, the clock under test will be marked unstable through no
fault of its own. This series checks for this, doing limited retries
to get a good set of clock reads. If the clock is marked unstable
and is marked as being per-CPU, cross-CPU synchronization is checked.
This series also provides delay injection, which may be enabled via
kernel boot parameters to test the checking for delays.

1. Provide module parameters to inject delays in watchdog.

2. Retry clock read if long delays detected.

3. Check per-CPU clock synchronization when marked unstable.

4. Provide a module parameter to fuzz per-CPU clock checking.

5. Do pairwise clock-desynchronization checking.

Changes since v1:

o Applied feedback from Rik van Riel.

o Rebased to v5.11-rc3.

o Stripped "RFC" from the subject lines.

Thanx, Paul

------------------------------------------------------------------------

Documentation/admin-guide/kernel-parameters.txt | 31 ++++
arch/x86/kernel/kvmclock.c | 2
arch/x86/kernel/tsc.c | 3
include/linux/clocksource.h | 2
kernel/time/clocksource.c | 174 +++++++++++++++++++++---
5 files changed, 188 insertions(+), 24 deletions(-)

2021-01-12 10:39:54

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v2 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

From: "Paul E. McKenney" <[email protected]>

Some sorts of per-CPU clock sources have a history of going out of
synchronization with each other. However, this problem has purportedy
been solved in the past ten years. Except that it is all too possible
that the problem has instead simply been made less likely, which might
mean that some of the occasional "Marking clocksource 'tsc' as unstable"
messages might be due to desynchronization. How would anyone know?

This commit therefore adds CPU-to-CPU synchronization checking
for newly unstable clocksource that are marked with the new
CLOCK_SOURCE_VERIFY_PERCPU flag. Lists of desynchronized CPUs are
printed, with the caveat that if it is the reporting CPU that is itself
desynchronized, it will appear that all the other clocks are wrong.
Just like in real life.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Add "static" to clocksource_verify_one_cpu() per kernel test robot feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
arch/x86/kernel/kvmclock.c | 2 +-
arch/x86/kernel/tsc.c | 3 +-
include/linux/clocksource.h | 2 +-
kernel/time/clocksource.c | 73 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index aa59374..337bb2c 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -169,7 +169,7 @@ struct clocksource kvm_clock = {
.read = kvm_clock_get_cycles,
.rating = 400,
.mask = CLOCKSOURCE_MASK(64),
- .flags = CLOCK_SOURCE_IS_CONTINUOUS,
+ .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_VERIFY_PERCPU,
.enable = kvm_cs_enable,
};
EXPORT_SYMBOL_GPL(kvm_clock);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index f70dffc..5628917 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1151,7 +1151,8 @@ static struct clocksource clocksource_tsc = {
.mask = CLOCKSOURCE_MASK(64),
.flags = CLOCK_SOURCE_IS_CONTINUOUS |
CLOCK_SOURCE_VALID_FOR_HRES |
- CLOCK_SOURCE_MUST_VERIFY,
+ CLOCK_SOURCE_MUST_VERIFY |
+ CLOCK_SOURCE_VERIFY_PERCPU,
.vdso_clock_mode = VDSO_CLOCKMODE_TSC,
.enable = tsc_cs_enable,
.resume = tsc_resume,
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 86d143d..83a3ebf 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -131,7 +131,7 @@ struct clocksource {
#define CLOCK_SOURCE_UNSTABLE 0x40
#define CLOCK_SOURCE_SUSPEND_NONSTOP 0x80
#define CLOCK_SOURCE_RESELECT 0x100
-
+#define CLOCK_SOURCE_VERIFY_PERCPU 0x200
/* simplify initialization of mask field */
#define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 4663b86..23bcefe 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -211,6 +211,78 @@ static void clocksource_watchdog_inject_delay(void)
WARN_ON_ONCE(injectfail < 0);
}

+static struct clocksource *clocksource_verify_work_cs;
+static DEFINE_PER_CPU(u64, csnow_mid);
+static cpumask_t cpus_ahead;
+static cpumask_t cpus_behind;
+
+static void clocksource_verify_one_cpu(void *csin)
+{
+ struct clocksource *cs = (struct clocksource *)csin;
+
+ __this_cpu_write(csnow_mid, cs->read(cs));
+}
+
+static void clocksource_verify_percpu_wq(struct work_struct *unused)
+{
+ int cpu;
+ struct clocksource *cs;
+ int64_t cs_nsec;
+ u64 csnow_begin;
+ u64 csnow_end;
+ u64 delta;
+
+ cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release
+ if (WARN_ON_ONCE(!cs))
+ return;
+ pr_warn("Checking clocksource %s synchronization from CPU %d.\n",
+ cs->name, smp_processor_id());
+ cpumask_clear(&cpus_ahead);
+ cpumask_clear(&cpus_behind);
+ csnow_begin = cs->read(cs);
+ smp_call_function(clocksource_verify_one_cpu, cs, 1);
+ csnow_end = cs->read(cs);
+ for_each_online_cpu(cpu) {
+ if (cpu == smp_processor_id())
+ continue;
+ delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask;
+ if ((s64)delta < 0)
+ cpumask_set_cpu(cpu, &cpus_behind);
+ delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask;
+ if ((s64)delta < 0)
+ cpumask_set_cpu(cpu, &cpus_ahead);
+ }
+ if (!cpumask_empty(&cpus_ahead))
+ pr_warn(" CPUs %*pbl ahead of CPU %d for clocksource %s.\n",
+ cpumask_pr_args(&cpus_ahead),
+ smp_processor_id(), cs->name);
+ if (!cpumask_empty(&cpus_behind))
+ pr_warn(" CPUs %*pbl behind CPU %d for clocksource %s.\n",
+ cpumask_pr_args(&cpus_behind),
+ smp_processor_id(), cs->name);
+ if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)) {
+ delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
+ cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
+ pr_warn(" CPU %d duration %lldns for clocksource %s.\n",
+ smp_processor_id(), cs_nsec, cs->name);
+ }
+ smp_store_release(&clocksource_verify_work_cs, NULL); // pairs with acquire.
+}
+
+static DECLARE_WORK(clocksource_verify_work, clocksource_verify_percpu_wq);
+
+static void clocksource_verify_percpu(struct clocksource *cs)
+{
+ if (!(cs->flags & CLOCK_SOURCE_VERIFY_PERCPU))
+ return;
+ if (smp_load_acquire(&clocksource_verify_work_cs)) { // pairs with release.
+ pr_warn("Previous clocksource synchronization still in flight.\n");
+ return;
+ }
+ smp_store_release(&clocksource_verify_work_cs, cs); //pairs with acquire.
+ queue_work(system_highpri_wq, &clocksource_verify_work);
+}
+
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
@@ -284,6 +356,7 @@ static void clocksource_watchdog(struct timer_list *unused)
watchdog->name, wdnow, wdlast, watchdog->mask);
pr_warn(" '%s' cs_now: %llx cs_last: %llx mask: %llx\n",
cs->name, csnow, cslast, cs->mask);
+ clocksource_verify_percpu(cs);
__clocksource_unstable(cs);
continue;
}
--
2.9.5

2021-01-12 10:40:21

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v2 clocksource 4/5] clocksource: Provide a module parameter to fuzz per-CPU clock checking

From: "Paul E. McKenney" <[email protected]>

Code that checks for clock desynchronization must itself be tested, so
this commit creates a new clocksource.inject_delay_shift_percpu= kernel
boot parameter that adds or subtracts a large value from the check read,
using the specified bit of the CPU ID to determine whether to add or
to subtract.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 9 +++++++++
kernel/time/clocksource.c | 10 +++++++++-
2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 4c59813..ca64b0c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -593,6 +593,15 @@
times the value specified for inject_delay_freq
of consecutive non-delays.

+ clocksource.inject_delay_shift_percpu= [KNL]
+ Shift count to obtain bit from CPU number to
+ determine whether to shift the time of the per-CPU
+ clock under test ahead or behind. For example,
+ setting this to the value four will result in
+ alternating groups of 16 CPUs shifting ahead and
+ the rest of the CPUs shifting behind. The default
+ value of -1 disable this type of error injection.
+
clocksource.max_read_retries= [KNL]
Number of clocksource_watchdog() retries due to
external delays before the clock will be marked
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 23bcefe..67cf41c 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -190,6 +190,8 @@ static int inject_delay_freq;
module_param(inject_delay_freq, int, 0644);
static int inject_delay_run = 1;
module_param(inject_delay_run, int, 0644);
+static int inject_delay_shift_percpu = -1;
+module_param(inject_delay_shift_percpu, int, 0644);
static int max_read_retries = 3;
module_param(max_read_retries, int, 0644);

@@ -219,8 +221,14 @@ static cpumask_t cpus_behind;
static void clocksource_verify_one_cpu(void *csin)
{
struct clocksource *cs = (struct clocksource *)csin;
+ s64 delta = 0;
+ int sign;

- __this_cpu_write(csnow_mid, cs->read(cs));
+ if (inject_delay_shift_percpu >= 0) {
+ sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1;
+ delta = sign * NSEC_PER_SEC;
+ }
+ __this_cpu_write(csnow_mid, cs->read(cs) + delta);
}

static void clocksource_verify_percpu_wq(struct work_struct *unused)
--
2.9.5

2021-01-12 10:40:29

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v2 clocksource 2/5] clocksource: Retry clock read if long delays detected

From: "Paul E. McKenney" <[email protected]>

When the clocksource watchdog marks a clock as unstable, this might
be due to that clock being unstable or it might be due to delays that
happen to occur between the reads of the two clocks. Yes, interrupts are
disabled across those two reads, but there are no shortage of things that
can delay interrupts-disabled regions of code ranging from SMI handlers
to vCPU preemption. It would be good to have some indication as to why
the clock was marked unstable.

This commit therefore re-reads the watchdog clock on either side of
the read from the clock under test. If the watchdog clock shows an
excessive time delta between its pair of reads, the reads are retried.
The maximum number of retries is specified by a new kernel boot
parameter clocksource.max_read_retries, which defaults to three, that
is, up to four reads, one initial and up to three retries. If retries
were required, a message is printed on the console. If the number of
retries is exceeded, the clock under test will be marked unstable.
However, the probability of this happening due to various sorts of
delays is quite small. In addition, the reason (clock-read delays)
for the unstable marking will be apparent.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Per-clocksource retries per Neeraj Upadhyay feedback. ]
[ paulmck: Don't reset injectfail per Neeraj Upadhyay feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/clocksource.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 545889c..4663b86 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -124,6 +124,7 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
*/
#define WATCHDOG_INTERVAL (HZ >> 1)
#define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
+#define WATCHDOG_MAX_SKEW (NSEC_PER_SEC >> 6)

static void clocksource_watchdog_work(struct work_struct *work)
{
@@ -213,9 +214,10 @@ static void clocksource_watchdog_inject_delay(void)
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
- u64 csnow, wdnow, cslast, wdlast, delta;
- int64_t wd_nsec, cs_nsec;
+ u64 csnow, wdnow, wdagain, cslast, wdlast, delta;
+ int64_t wd_nsec, wdagain_nsec, wderr_nsec = 0, cs_nsec;
int next_cpu, reset_pending;
+ int nretries;

spin_lock(&watchdog_lock);
if (!watchdog_running)
@@ -224,6 +226,7 @@ static void clocksource_watchdog(struct timer_list *unused)
reset_pending = atomic_read(&watchdog_reset_pending);

list_for_each_entry(cs, &watchdog_list, wd_list) {
+ nretries = 0;

/* Clocksource already marked unstable? */
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
@@ -232,11 +235,23 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;
}

+retry:
local_irq_disable();
- csnow = cs->read(cs);
- clocksource_watchdog_inject_delay();
wdnow = watchdog->read(watchdog);
+ clocksource_watchdog_inject_delay();
+ csnow = cs->read(cs);
+ wdagain = watchdog->read(watchdog);
local_irq_enable();
+ delta = clocksource_delta(wdagain, wdnow, watchdog->mask);
+ wdagain_nsec = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
+ if (wdagain_nsec < 0 || wdagain_nsec > WATCHDOG_MAX_SKEW) {
+ wderr_nsec = wdagain_nsec;
+ if (nretries++ < max_read_retries)
+ goto retry;
+ }
+ if (nretries)
+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d\n",
+ smp_processor_id(), watchdog->name, wderr_nsec, nretries);

/* Clocksource initialized ? */
if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
--
2.9.5

2021-01-12 10:40:29

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v2 clocksource 5/5] clocksource: Do pairwise clock-desynchronization checking

From: "Paul E. McKenney" <[email protected]>

Although smp_call_function() has the advantage of simplicity, using
it to check for cross-CPU clock desynchronization means that any CPU
being slow reduces the sensitivity of the checking across all CPUs.
And it is not uncommon for smp_call_function() latencies to be in the
hundreds of microseconds.

This commit therefore switches to smp_call_function_single(), so that
delays from a given CPU affect only those measurements involving that
particular CPU.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/clocksource.c | 41 +++++++++++++++++++++++++----------------
1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 67cf41c..3bae5fb 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -214,7 +214,7 @@ static void clocksource_watchdog_inject_delay(void)
}

static struct clocksource *clocksource_verify_work_cs;
-static DEFINE_PER_CPU(u64, csnow_mid);
+static u64 csnow_mid;
static cpumask_t cpus_ahead;
static cpumask_t cpus_behind;

@@ -228,7 +228,7 @@ static void clocksource_verify_one_cpu(void *csin)
sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1;
delta = sign * NSEC_PER_SEC;
}
- __this_cpu_write(csnow_mid, cs->read(cs) + delta);
+ csnow_mid = cs->read(cs) + delta;
}

static void clocksource_verify_percpu_wq(struct work_struct *unused)
@@ -236,9 +236,12 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
int cpu;
struct clocksource *cs;
int64_t cs_nsec;
+ int64_t cs_nsec_max;
+ int64_t cs_nsec_min;
u64 csnow_begin;
u64 csnow_end;
- u64 delta;
+ s64 delta;
+ bool firsttime = 1;

cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release
if (WARN_ON_ONCE(!cs))
@@ -247,19 +250,28 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
cs->name, smp_processor_id());
cpumask_clear(&cpus_ahead);
cpumask_clear(&cpus_behind);
- csnow_begin = cs->read(cs);
- smp_call_function(clocksource_verify_one_cpu, cs, 1);
- csnow_end = cs->read(cs);
+ preempt_disable();
for_each_online_cpu(cpu) {
if (cpu == smp_processor_id())
continue;
- delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask;
- if ((s64)delta < 0)
+ csnow_begin = cs->read(cs);
+ smp_call_function_single(cpu, clocksource_verify_one_cpu, cs, 1);
+ csnow_end = cs->read(cs);
+ delta = (s64)((csnow_mid - csnow_begin) & cs->mask);
+ if (delta < 0)
cpumask_set_cpu(cpu, &cpus_behind);
- delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask;
- if ((s64)delta < 0)
+ delta = (csnow_end - csnow_mid) & cs->mask;
+ if (delta < 0)
cpumask_set_cpu(cpu, &cpus_ahead);
+ delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
+ cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
+ if (firsttime || cs_nsec > cs_nsec_max)
+ cs_nsec_max = cs_nsec;
+ if (firsttime || cs_nsec < cs_nsec_min)
+ cs_nsec_min = cs_nsec;
+ firsttime = 0;
}
+ preempt_enable();
if (!cpumask_empty(&cpus_ahead))
pr_warn(" CPUs %*pbl ahead of CPU %d for clocksource %s.\n",
cpumask_pr_args(&cpus_ahead),
@@ -268,12 +280,9 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
pr_warn(" CPUs %*pbl behind CPU %d for clocksource %s.\n",
cpumask_pr_args(&cpus_behind),
smp_processor_id(), cs->name);
- if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)) {
- delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
- cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
- pr_warn(" CPU %d duration %lldns for clocksource %s.\n",
- smp_processor_id(), cs_nsec, cs->name);
- }
+ if (!firsttime && (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)))
+ pr_warn(" CPU %d check durations %lldns - %lldns for clocksource %s.\n",
+ smp_processor_id(), cs_nsec_min, cs_nsec_max, cs->name);
smp_store_release(&clocksource_verify_work_cs, NULL); // pairs with acquire.
}

--
2.9.5

2021-01-12 10:40:59

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v2 clocksource 1/5] clocksource: Provide module parameters to inject delays in watchdog

From: "Paul E. McKenney" <[email protected]>

When the clocksource watchdog marks a clock as unstable, this might be due
to that clock being unstable or it might be due to delays that happen to
occur between the reads of the two clocks. Yes, interrupts are disabled
across those two reads, but there are no shortage of things that can
delay interrupts-disabled regions of code ranging from SMI handlers to
vCPU preemption. It would be good to have some indication as to why
the clock was marked unstable.

The first step is a way of injecting such delays, and this
commit therefore provides a clocksource.inject_delay_freq and
clocksource.inject_delay_run kernel boot parameters that specify that
sufficient delay be injected to cause the clocksource_watchdog()
function to mark a clock unstable. This delay is injected every
Nth set of M calls to clocksource_watchdog(), where N is the value
specified for the inject_delay_freq boot parameter and M is the value
specified for the inject_delay_run boot parameter. Values of zero or
less for either parameter disable delay injection, and the default for
clocksource.inject_delay_freq is zero, that is, disabled. The default for
clocksource.inject_delay_run is the value one, that is single-call runs.

This facility is intended for diagnostic use only, and should be avoided
on production systems.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
[ paulmck: Apply Rik van Riel feedback. ]
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 22 ++++++++++++++++++++
kernel/time/clocksource.c | 27 +++++++++++++++++++++++++
2 files changed, 49 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9e3cdb2..4c59813 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -577,6 +577,28 @@
loops can be debugged more effectively on production
systems.

+ clocksource.inject_delay_freq= [KNL]
+ Number of runs of calls to clocksource_watchdog()
+ before delays are injected between reads from the
+ two clocksources. Values less than or equal to
+ zero disable this delay injection. These delays
+ can cause clocks to be marked unstable, so use
+ of this parameter should therefore be avoided on
+ production systems. Defaults to zero (disabled).
+
+ clocksource.inject_delay_run= [KNL]
+ Run lengths of clocksource_watchdog() delay
+ injections. Specifying the value 8 will result
+ in eight consecutive delays followed by eight
+ times the value specified for inject_delay_freq
+ of consecutive non-delays.
+
+ clocksource.max_read_retries= [KNL]
+ Number of clocksource_watchdog() retries due to
+ external delays before the clock will be marked
+ unstable. Defaults to three retries, that is,
+ four attempts to read the clock under test.
+
clearcpuid=BITNUM[,BITNUM...] [X86]
Disable CPUID feature X for the kernel. See
arch/x86/include/asm/cpufeatures.h for the valid bit
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index cce484a..545889c 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -14,6 +14,7 @@
#include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
#include <linux/tick.h>
#include <linux/kthread.h>
+#include <linux/delay.h>

#include "tick-internal.h"
#include "timekeeping_internal.h"
@@ -184,6 +185,31 @@ void clocksource_mark_unstable(struct clocksource *cs)
spin_unlock_irqrestore(&watchdog_lock, flags);
}

+static int inject_delay_freq;
+module_param(inject_delay_freq, int, 0644);
+static int inject_delay_run = 1;
+module_param(inject_delay_run, int, 0644);
+static int max_read_retries = 3;
+module_param(max_read_retries, int, 0644);
+
+static void clocksource_watchdog_inject_delay(void)
+{
+ int i;
+ static int injectfail = -1;
+
+ if (inject_delay_freq <= 0 || inject_delay_run <= 0)
+ return;
+ if (injectfail < 0 || injectfail > INT_MAX / 2)
+ injectfail = inject_delay_run;
+ if (!(++injectfail / inject_delay_run % inject_delay_freq)) {
+ printk("%s(): Injecting delay.\n", __func__);
+ for (i = 0; i < 2 * WATCHDOG_THRESHOLD / NSEC_PER_MSEC; i++)
+ udelay(1000);
+ printk("%s(): Done injecting delay.\n", __func__);
+ }
+ WARN_ON_ONCE(injectfail < 0);
+}
+
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
@@ -208,6 +234,7 @@ static void clocksource_watchdog(struct timer_list *unused)

local_irq_disable();
csnow = cs->read(cs);
+ clocksource_watchdog_inject_delay();
wdnow = watchdog->read(watchdog);
local_irq_enable();

--
2.9.5

2021-02-02 23:26:27

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v3 clocksource] Do not mark clocks unstable due to delays

Hello!

If there is a sufficient delay between reading the watchdog clock and the
clock under test, the clock under test will be marked unstable through no
fault of its own. This series checks for this, doing limited retries
to get a good set of clock reads. If the clock is marked unstable
and is marked as being per-CPU, cross-CPU synchronization is checked.
This series also provides delay injection, which may be enabled via
kernel boot parameters to test the checking for delays.

1. Provide module parameters to inject delays in watchdog.

2. Retry clock read if long delays detected.

3. Check per-CPU clock synchronization when marked unstable.

4. Provide a module parameter to fuzz per-CPU clock checking.

5. Do pairwise clock-desynchronization checking.

Changes since v2:

o Rebased to v5.11-rc6.

o Updated Cc: list.

Changes since v1:

o Applied feedback from Rik van Riel.

o Rebased to v5.11-rc3.

o Stripped "RFC" from the subject lines.

Thanx, Paul

------------------------------------------------------------------------

Documentation/admin-guide/kernel-parameters.txt | 31 ++++
arch/x86/kernel/kvmclock.c | 2
arch/x86/kernel/tsc.c | 3
include/linux/clocksource.h | 2
kernel/time/clocksource.c | 174 +++++++++++++++++++++---
5 files changed, 188 insertions(+), 24 deletions(-)

2021-02-02 23:27:24

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH clocksource 1/5] clocksource: Provide module parameters to inject delays in watchdog

From: "Paul E. McKenney" <[email protected]>

When the clocksource watchdog marks a clock as unstable, this might be due
to that clock being unstable or it might be due to delays that happen to
occur between the reads of the two clocks. Yes, interrupts are disabled
across those two reads, but there are no shortage of things that can
delay interrupts-disabled regions of code ranging from SMI handlers to
vCPU preemption. It would be good to have some indication as to why
the clock was marked unstable.

The first step is a way of injecting such delays, and this
commit therefore provides a clocksource.inject_delay_freq and
clocksource.inject_delay_run kernel boot parameters that specify that
sufficient delay be injected to cause the clocksource_watchdog()
function to mark a clock unstable. This delay is injected every
Nth set of M calls to clocksource_watchdog(), where N is the value
specified for the inject_delay_freq boot parameter and M is the value
specified for the inject_delay_run boot parameter. Values of zero or
less for either parameter disable delay injection, and the default for
clocksource.inject_delay_freq is zero, that is, disabled. The default for
clocksource.inject_delay_run is the value one, that is single-call runs.

This facility is intended for diagnostic use only, and should be avoided
on production systems.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
[ paulmck: Apply Rik van Riel feedback. ]
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 22 ++++++++++++++++++++
kernel/time/clocksource.c | 27 +++++++++++++++++++++++++
2 files changed, 49 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a10b545..9965266 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -577,6 +577,28 @@
loops can be debugged more effectively on production
systems.

+ clocksource.inject_delay_freq= [KNL]
+ Number of runs of calls to clocksource_watchdog()
+ before delays are injected between reads from the
+ two clocksources. Values less than or equal to
+ zero disable this delay injection. These delays
+ can cause clocks to be marked unstable, so use
+ of this parameter should therefore be avoided on
+ production systems. Defaults to zero (disabled).
+
+ clocksource.inject_delay_run= [KNL]
+ Run lengths of clocksource_watchdog() delay
+ injections. Specifying the value 8 will result
+ in eight consecutive delays followed by eight
+ times the value specified for inject_delay_freq
+ of consecutive non-delays.
+
+ clocksource.max_read_retries= [KNL]
+ Number of clocksource_watchdog() retries due to
+ external delays before the clock will be marked
+ unstable. Defaults to three retries, that is,
+ four attempts to read the clock under test.
+
clearcpuid=BITNUM[,BITNUM...] [X86]
Disable CPUID feature X for the kernel. See
arch/x86/include/asm/cpufeatures.h for the valid bit
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index cce484a..545889c 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -14,6 +14,7 @@
#include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
#include <linux/tick.h>
#include <linux/kthread.h>
+#include <linux/delay.h>

#include "tick-internal.h"
#include "timekeeping_internal.h"
@@ -184,6 +185,31 @@ void clocksource_mark_unstable(struct clocksource *cs)
spin_unlock_irqrestore(&watchdog_lock, flags);
}

+static int inject_delay_freq;
+module_param(inject_delay_freq, int, 0644);
+static int inject_delay_run = 1;
+module_param(inject_delay_run, int, 0644);
+static int max_read_retries = 3;
+module_param(max_read_retries, int, 0644);
+
+static void clocksource_watchdog_inject_delay(void)
+{
+ int i;
+ static int injectfail = -1;
+
+ if (inject_delay_freq <= 0 || inject_delay_run <= 0)
+ return;
+ if (injectfail < 0 || injectfail > INT_MAX / 2)
+ injectfail = inject_delay_run;
+ if (!(++injectfail / inject_delay_run % inject_delay_freq)) {
+ printk("%s(): Injecting delay.\n", __func__);
+ for (i = 0; i < 2 * WATCHDOG_THRESHOLD / NSEC_PER_MSEC; i++)
+ udelay(1000);
+ printk("%s(): Done injecting delay.\n", __func__);
+ }
+ WARN_ON_ONCE(injectfail < 0);
+}
+
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
@@ -208,6 +234,7 @@ static void clocksource_watchdog(struct timer_list *unused)

local_irq_disable();
csnow = cs->read(cs);
+ clocksource_watchdog_inject_delay();
wdnow = watchdog->read(watchdog);
local_irq_enable();

--
2.9.5

2021-02-02 23:28:45

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

From: "Paul E. McKenney" <[email protected]>

Some sorts of per-CPU clock sources have a history of going out of
synchronization with each other. However, this problem has purportedy
been solved in the past ten years. Except that it is all too possible
that the problem has instead simply been made less likely, which might
mean that some of the occasional "Marking clocksource 'tsc' as unstable"
messages might be due to desynchronization. How would anyone know?

This commit therefore adds CPU-to-CPU synchronization checking
for newly unstable clocksource that are marked with the new
CLOCK_SOURCE_VERIFY_PERCPU flag. Lists of desynchronized CPUs are
printed, with the caveat that if it is the reporting CPU that is itself
desynchronized, it will appear that all the other clocks are wrong.
Just like in real life.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Add "static" to clocksource_verify_one_cpu() per kernel test robot feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
arch/x86/kernel/kvmclock.c | 2 +-
arch/x86/kernel/tsc.c | 3 +-
include/linux/clocksource.h | 2 +-
kernel/time/clocksource.c | 73 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index aa59374..337bb2c 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -169,7 +169,7 @@ struct clocksource kvm_clock = {
.read = kvm_clock_get_cycles,
.rating = 400,
.mask = CLOCKSOURCE_MASK(64),
- .flags = CLOCK_SOURCE_IS_CONTINUOUS,
+ .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_VERIFY_PERCPU,
.enable = kvm_cs_enable,
};
EXPORT_SYMBOL_GPL(kvm_clock);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index f70dffc..5628917 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1151,7 +1151,8 @@ static struct clocksource clocksource_tsc = {
.mask = CLOCKSOURCE_MASK(64),
.flags = CLOCK_SOURCE_IS_CONTINUOUS |
CLOCK_SOURCE_VALID_FOR_HRES |
- CLOCK_SOURCE_MUST_VERIFY,
+ CLOCK_SOURCE_MUST_VERIFY |
+ CLOCK_SOURCE_VERIFY_PERCPU,
.vdso_clock_mode = VDSO_CLOCKMODE_TSC,
.enable = tsc_cs_enable,
.resume = tsc_resume,
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 86d143d..83a3ebf 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -131,7 +131,7 @@ struct clocksource {
#define CLOCK_SOURCE_UNSTABLE 0x40
#define CLOCK_SOURCE_SUSPEND_NONSTOP 0x80
#define CLOCK_SOURCE_RESELECT 0x100
-
+#define CLOCK_SOURCE_VERIFY_PERCPU 0x200
/* simplify initialization of mask field */
#define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 4663b86..23bcefe 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -211,6 +211,78 @@ static void clocksource_watchdog_inject_delay(void)
WARN_ON_ONCE(injectfail < 0);
}

+static struct clocksource *clocksource_verify_work_cs;
+static DEFINE_PER_CPU(u64, csnow_mid);
+static cpumask_t cpus_ahead;
+static cpumask_t cpus_behind;
+
+static void clocksource_verify_one_cpu(void *csin)
+{
+ struct clocksource *cs = (struct clocksource *)csin;
+
+ __this_cpu_write(csnow_mid, cs->read(cs));
+}
+
+static void clocksource_verify_percpu_wq(struct work_struct *unused)
+{
+ int cpu;
+ struct clocksource *cs;
+ int64_t cs_nsec;
+ u64 csnow_begin;
+ u64 csnow_end;
+ u64 delta;
+
+ cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release
+ if (WARN_ON_ONCE(!cs))
+ return;
+ pr_warn("Checking clocksource %s synchronization from CPU %d.\n",
+ cs->name, smp_processor_id());
+ cpumask_clear(&cpus_ahead);
+ cpumask_clear(&cpus_behind);
+ csnow_begin = cs->read(cs);
+ smp_call_function(clocksource_verify_one_cpu, cs, 1);
+ csnow_end = cs->read(cs);
+ for_each_online_cpu(cpu) {
+ if (cpu == smp_processor_id())
+ continue;
+ delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask;
+ if ((s64)delta < 0)
+ cpumask_set_cpu(cpu, &cpus_behind);
+ delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask;
+ if ((s64)delta < 0)
+ cpumask_set_cpu(cpu, &cpus_ahead);
+ }
+ if (!cpumask_empty(&cpus_ahead))
+ pr_warn(" CPUs %*pbl ahead of CPU %d for clocksource %s.\n",
+ cpumask_pr_args(&cpus_ahead),
+ smp_processor_id(), cs->name);
+ if (!cpumask_empty(&cpus_behind))
+ pr_warn(" CPUs %*pbl behind CPU %d for clocksource %s.\n",
+ cpumask_pr_args(&cpus_behind),
+ smp_processor_id(), cs->name);
+ if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)) {
+ delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
+ cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
+ pr_warn(" CPU %d duration %lldns for clocksource %s.\n",
+ smp_processor_id(), cs_nsec, cs->name);
+ }
+ smp_store_release(&clocksource_verify_work_cs, NULL); // pairs with acquire.
+}
+
+static DECLARE_WORK(clocksource_verify_work, clocksource_verify_percpu_wq);
+
+static void clocksource_verify_percpu(struct clocksource *cs)
+{
+ if (!(cs->flags & CLOCK_SOURCE_VERIFY_PERCPU))
+ return;
+ if (smp_load_acquire(&clocksource_verify_work_cs)) { // pairs with release.
+ pr_warn("Previous clocksource synchronization still in flight.\n");
+ return;
+ }
+ smp_store_release(&clocksource_verify_work_cs, cs); //pairs with acquire.
+ queue_work(system_highpri_wq, &clocksource_verify_work);
+}
+
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
@@ -284,6 +356,7 @@ static void clocksource_watchdog(struct timer_list *unused)
watchdog->name, wdnow, wdlast, watchdog->mask);
pr_warn(" '%s' cs_now: %llx cs_last: %llx mask: %llx\n",
cs->name, csnow, cslast, cs->mask);
+ clocksource_verify_percpu(cs);
__clocksource_unstable(cs);
continue;
}
--
2.9.5

2021-02-02 23:31:18

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH clocksource 4/5] clocksource: Provide a module parameter to fuzz per-CPU clock checking

From: "Paul E. McKenney" <[email protected]>

Code that checks for clock desynchronization must itself be tested, so
this commit creates a new clocksource.inject_delay_shift_percpu= kernel
boot parameter that adds or subtracts a large value from the check read,
using the specified bit of the CPU ID to determine whether to add or
to subtract.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 9 +++++++++
kernel/time/clocksource.c | 10 +++++++++-
2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9965266..f561e94 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -593,6 +593,15 @@
times the value specified for inject_delay_freq
of consecutive non-delays.

+ clocksource.inject_delay_shift_percpu= [KNL]
+ Shift count to obtain bit from CPU number to
+ determine whether to shift the time of the per-CPU
+ clock under test ahead or behind. For example,
+ setting this to the value four will result in
+ alternating groups of 16 CPUs shifting ahead and
+ the rest of the CPUs shifting behind. The default
+ value of -1 disable this type of error injection.
+
clocksource.max_read_retries= [KNL]
Number of clocksource_watchdog() retries due to
external delays before the clock will be marked
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 23bcefe..67cf41c 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -190,6 +190,8 @@ static int inject_delay_freq;
module_param(inject_delay_freq, int, 0644);
static int inject_delay_run = 1;
module_param(inject_delay_run, int, 0644);
+static int inject_delay_shift_percpu = -1;
+module_param(inject_delay_shift_percpu, int, 0644);
static int max_read_retries = 3;
module_param(max_read_retries, int, 0644);

@@ -219,8 +221,14 @@ static cpumask_t cpus_behind;
static void clocksource_verify_one_cpu(void *csin)
{
struct clocksource *cs = (struct clocksource *)csin;
+ s64 delta = 0;
+ int sign;

- __this_cpu_write(csnow_mid, cs->read(cs));
+ if (inject_delay_shift_percpu >= 0) {
+ sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1;
+ delta = sign * NSEC_PER_SEC;
+ }
+ __this_cpu_write(csnow_mid, cs->read(cs) + delta);
}

static void clocksource_verify_percpu_wq(struct work_struct *unused)
--
2.9.5

2021-02-02 23:31:18

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH clocksource 5/5] clocksource: Do pairwise clock-desynchronization checking

From: "Paul E. McKenney" <[email protected]>

Although smp_call_function() has the advantage of simplicity, using
it to check for cross-CPU clock desynchronization means that any CPU
being slow reduces the sensitivity of the checking across all CPUs.
And it is not uncommon for smp_call_function() latencies to be in the
hundreds of microseconds.

This commit therefore switches to smp_call_function_single(), so that
delays from a given CPU affect only those measurements involving that
particular CPU.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/clocksource.c | 41 +++++++++++++++++++++++++----------------
1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 67cf41c..3bae5fb 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -214,7 +214,7 @@ static void clocksource_watchdog_inject_delay(void)
}

static struct clocksource *clocksource_verify_work_cs;
-static DEFINE_PER_CPU(u64, csnow_mid);
+static u64 csnow_mid;
static cpumask_t cpus_ahead;
static cpumask_t cpus_behind;

@@ -228,7 +228,7 @@ static void clocksource_verify_one_cpu(void *csin)
sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1;
delta = sign * NSEC_PER_SEC;
}
- __this_cpu_write(csnow_mid, cs->read(cs) + delta);
+ csnow_mid = cs->read(cs) + delta;
}

static void clocksource_verify_percpu_wq(struct work_struct *unused)
@@ -236,9 +236,12 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
int cpu;
struct clocksource *cs;
int64_t cs_nsec;
+ int64_t cs_nsec_max;
+ int64_t cs_nsec_min;
u64 csnow_begin;
u64 csnow_end;
- u64 delta;
+ s64 delta;
+ bool firsttime = 1;

cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release
if (WARN_ON_ONCE(!cs))
@@ -247,19 +250,28 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
cs->name, smp_processor_id());
cpumask_clear(&cpus_ahead);
cpumask_clear(&cpus_behind);
- csnow_begin = cs->read(cs);
- smp_call_function(clocksource_verify_one_cpu, cs, 1);
- csnow_end = cs->read(cs);
+ preempt_disable();
for_each_online_cpu(cpu) {
if (cpu == smp_processor_id())
continue;
- delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask;
- if ((s64)delta < 0)
+ csnow_begin = cs->read(cs);
+ smp_call_function_single(cpu, clocksource_verify_one_cpu, cs, 1);
+ csnow_end = cs->read(cs);
+ delta = (s64)((csnow_mid - csnow_begin) & cs->mask);
+ if (delta < 0)
cpumask_set_cpu(cpu, &cpus_behind);
- delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask;
- if ((s64)delta < 0)
+ delta = (csnow_end - csnow_mid) & cs->mask;
+ if (delta < 0)
cpumask_set_cpu(cpu, &cpus_ahead);
+ delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
+ cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
+ if (firsttime || cs_nsec > cs_nsec_max)
+ cs_nsec_max = cs_nsec;
+ if (firsttime || cs_nsec < cs_nsec_min)
+ cs_nsec_min = cs_nsec;
+ firsttime = 0;
}
+ preempt_enable();
if (!cpumask_empty(&cpus_ahead))
pr_warn(" CPUs %*pbl ahead of CPU %d for clocksource %s.\n",
cpumask_pr_args(&cpus_ahead),
@@ -268,12 +280,9 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
pr_warn(" CPUs %*pbl behind CPU %d for clocksource %s.\n",
cpumask_pr_args(&cpus_behind),
smp_processor_id(), cs->name);
- if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)) {
- delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
- cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
- pr_warn(" CPU %d duration %lldns for clocksource %s.\n",
- smp_processor_id(), cs_nsec, cs->name);
- }
+ if (!firsttime && (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)))
+ pr_warn(" CPU %d check durations %lldns - %lldns for clocksource %s.\n",
+ smp_processor_id(), cs_nsec_min, cs_nsec_max, cs->name);
smp_store_release(&clocksource_verify_work_cs, NULL); // pairs with acquire.
}

--
2.9.5

2021-02-02 23:32:10

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH clocksource 2/5] clocksource: Retry clock read if long delays detected

From: "Paul E. McKenney" <[email protected]>

When the clocksource watchdog marks a clock as unstable, this might
be due to that clock being unstable or it might be due to delays that
happen to occur between the reads of the two clocks. Yes, interrupts are
disabled across those two reads, but there are no shortage of things that
can delay interrupts-disabled regions of code ranging from SMI handlers
to vCPU preemption. It would be good to have some indication as to why
the clock was marked unstable.

This commit therefore re-reads the watchdog clock on either side of
the read from the clock under test. If the watchdog clock shows an
excessive time delta between its pair of reads, the reads are retried.
The maximum number of retries is specified by a new kernel boot
parameter clocksource.max_read_retries, which defaults to three, that
is, up to four reads, one initial and up to three retries. If retries
were required, a message is printed on the console. If the number of
retries is exceeded, the clock under test will be marked unstable.
However, the probability of this happening due to various sorts of
delays is quite small. In addition, the reason (clock-read delays)
for the unstable marking will be apparent.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Per-clocksource retries per Neeraj Upadhyay feedback. ]
[ paulmck: Don't reset injectfail per Neeraj Upadhyay feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/clocksource.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 545889c..4663b86 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -124,6 +124,7 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
*/
#define WATCHDOG_INTERVAL (HZ >> 1)
#define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
+#define WATCHDOG_MAX_SKEW (NSEC_PER_SEC >> 6)

static void clocksource_watchdog_work(struct work_struct *work)
{
@@ -213,9 +214,10 @@ static void clocksource_watchdog_inject_delay(void)
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
- u64 csnow, wdnow, cslast, wdlast, delta;
- int64_t wd_nsec, cs_nsec;
+ u64 csnow, wdnow, wdagain, cslast, wdlast, delta;
+ int64_t wd_nsec, wdagain_nsec, wderr_nsec = 0, cs_nsec;
int next_cpu, reset_pending;
+ int nretries;

spin_lock(&watchdog_lock);
if (!watchdog_running)
@@ -224,6 +226,7 @@ static void clocksource_watchdog(struct timer_list *unused)
reset_pending = atomic_read(&watchdog_reset_pending);

list_for_each_entry(cs, &watchdog_list, wd_list) {
+ nretries = 0;

/* Clocksource already marked unstable? */
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
@@ -232,11 +235,23 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;
}

+retry:
local_irq_disable();
- csnow = cs->read(cs);
- clocksource_watchdog_inject_delay();
wdnow = watchdog->read(watchdog);
+ clocksource_watchdog_inject_delay();
+ csnow = cs->read(cs);
+ wdagain = watchdog->read(watchdog);
local_irq_enable();
+ delta = clocksource_delta(wdagain, wdnow, watchdog->mask);
+ wdagain_nsec = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
+ if (wdagain_nsec < 0 || wdagain_nsec > WATCHDOG_MAX_SKEW) {
+ wderr_nsec = wdagain_nsec;
+ if (nretries++ < max_read_retries)
+ goto retry;
+ }
+ if (nretries)
+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d\n",
+ smp_processor_id(), watchdog->name, wderr_nsec, nretries);

/* Clocksource initialized ? */
if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
--
2.9.5

2021-02-03 00:48:17

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH clocksource 4/5] clocksource: Provide a module parameter to fuzz per-CPU clock checking

On 2/2/21 9:06 AM, [email protected] wrote:
> From: "Paul E. McKenney" <[email protected]>
>
> Code that checks for clock desynchronization must itself be tested, so
> this commit creates a new clocksource.inject_delay_shift_percpu= kernel
> boot parameter that adds or subtracts a large value from the check read,
> using the specified bit of the CPU ID to determine whether to add or
> to subtract.
>
> Cc: John Stultz <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Stephen Boyd <[email protected]>
> Cc: Jonathan Corbet <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Reported-by: Chris Mason <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> Documentation/admin-guide/kernel-parameters.txt | 9 +++++++++
> kernel/time/clocksource.c | 10 +++++++++-
> 2 files changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 9965266..f561e94 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -593,6 +593,15 @@
> times the value specified for inject_delay_freq
> of consecutive non-delays.
>
> + clocksource.inject_delay_shift_percpu= [KNL]
> + Shift count to obtain bit from CPU number to
> + determine whether to shift the time of the per-CPU
> + clock under test ahead or behind. For example,

It's a good think that you give an example -- it helps a little bit.
That sentence above needs to be rewritten...

> + setting this to the value four will result in
> + alternating groups of 16 CPUs shifting ahead and
> + the rest of the CPUs shifting behind. The default
> + value of -1 disable this type of error injection.

disables

> +
> clocksource.max_read_retries= [KNL]
> Number of clocksource_watchdog() retries due to
> external delays before the clock will be marked


--
~Randy

2021-02-03 01:05:27

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH clocksource 4/5] clocksource: Provide a module parameter to fuzz per-CPU clock checking

On Tue, Feb 02, 2021 at 11:51:02AM -0800, Randy Dunlap wrote:
> On 2/2/21 9:06 AM, [email protected] wrote:
> > From: "Paul E. McKenney" <[email protected]>
> >
> > Code that checks for clock desynchronization must itself be tested, so
> > this commit creates a new clocksource.inject_delay_shift_percpu= kernel
> > boot parameter that adds or subtracts a large value from the check read,
> > using the specified bit of the CPU ID to determine whether to add or
> > to subtract.
> >
> > Cc: John Stultz <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Stephen Boyd <[email protected]>
> > Cc: Jonathan Corbet <[email protected]>
> > Cc: Mark Rutland <[email protected]>
> > Cc: Marc Zyngier <[email protected]>
> > Cc: Andi Kleen <[email protected]>
> > Reported-by: Chris Mason <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > Documentation/admin-guide/kernel-parameters.txt | 9 +++++++++
> > kernel/time/clocksource.c | 10 +++++++++-
> > 2 files changed, 18 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 9965266..f561e94 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -593,6 +593,15 @@
> > times the value specified for inject_delay_freq
> > of consecutive non-delays.
> >
> > + clocksource.inject_delay_shift_percpu= [KNL]
> > + Shift count to obtain bit from CPU number to
> > + determine whether to shift the time of the per-CPU
> > + clock under test ahead or behind. For example,
>
> It's a good think that you give an example -- it helps a little bit.
> That sentence above needs to be rewritten...

That is a bit obscure, now that you mention it.

> > + setting this to the value four will result in
> > + alternating groups of 16 CPUs shifting ahead and
> > + the rest of the CPUs shifting behind. The default
> > + value of -1 disable this type of error injection.
>
> disables

Good eyes!

So how about like this?

clocksource.inject_delay_shift_percpu= [KNL]
Clocksource delay injection partitions the CPUs
into two sets, one whose clocks are moved ahead
and the other whose clocks are moved behind.
This kernel parameter selects the CPU-number
bit that determines which of these two sets the
corresponding CPU is placed into. For example,
setting this parameter to the value four will
result in the first set containing alternating
groups of 16 CPUs whose clocks are moved ahead,
while the second set will contain the rest of
the CPUs, whose clocks are moved behind.

The default value of -1 disables this type of
error injection.

Thanx, Paul

> > +
> > clocksource.max_read_retries= [KNL]
> > Number of clocksource_watchdog() retries due to
> > external delays before the clock will be marked
>
>
> --
> ~Randy
>

2021-02-03 01:34:21

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH clocksource 4/5] clocksource: Provide a module parameter to fuzz per-CPU clock checking

On 2/2/21 4:50 PM, Paul E. McKenney wrote:
> On Tue, Feb 02, 2021 at 11:51:02AM -0800, Randy Dunlap wrote:
>> On 2/2/21 9:06 AM, [email protected] wrote:
>>> From: "Paul E. McKenney" <[email protected]>
>>>
>>> Code that checks for clock desynchronization must itself be tested, so
>>> this commit creates a new clocksource.inject_delay_shift_percpu= kernel
>>> boot parameter that adds or subtracts a large value from the check read,
>>> using the specified bit of the CPU ID to determine whether to add or
>>> to subtract.
>>>
>>> Cc: John Stultz <[email protected]>
>>> Cc: Thomas Gleixner <[email protected]>
>>> Cc: Stephen Boyd <[email protected]>
>>> Cc: Jonathan Corbet <[email protected]>
>>> Cc: Mark Rutland <[email protected]>
>>> Cc: Marc Zyngier <[email protected]>
>>> Cc: Andi Kleen <[email protected]>
>>> Reported-by: Chris Mason <[email protected]>
>>> Signed-off-by: Paul E. McKenney <[email protected]>
>>> ---
>>> Documentation/admin-guide/kernel-parameters.txt | 9 +++++++++
>>> kernel/time/clocksource.c | 10 +++++++++-
>>> 2 files changed, 18 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>>> index 9965266..f561e94 100644
>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>> @@ -593,6 +593,15 @@
>>> times the value specified for inject_delay_freq
>>> of consecutive non-delays.
>>>
>>> + clocksource.inject_delay_shift_percpu= [KNL]
>>> + Shift count to obtain bit from CPU number to
>>> + determine whether to shift the time of the per-CPU
>>> + clock under test ahead or behind. For example,
>>
>> It's a good think that you give an example -- it helps a little bit.
>> That sentence above needs to be rewritten...
>
> That is a bit obscure, now that you mention it.
>
>>> + setting this to the value four will result in
>>> + alternating groups of 16 CPUs shifting ahead and
>>> + the rest of the CPUs shifting behind. The default
>>> + value of -1 disable this type of error injection.
>>
>> disables
>
> Good eyes!
>
> So how about like this?

Much better, thanks.

> clocksource.inject_delay_shift_percpu= [KNL]
> Clocksource delay injection partitions the CPUs
> into two sets, one whose clocks are moved ahead
> and the other whose clocks are moved behind.
> This kernel parameter selects the CPU-number
> bit that determines which of these two sets the
> corresponding CPU is placed into. For example,
> setting this parameter to the value four will

I know that in "writing," "four" should be written out as you have it,
but IMO using "4" here would be much better. FWIW.

> result in the first set containing alternating
> groups of 16 CPUs whose clocks are moved ahead,
> while the second set will contain the rest of
> the CPUs, whose clocks are moved behind.
>
> The default value of -1 disables this type of
> error injection.
>
> Thanx, Paul
>
>>> +
>>> clocksource.max_read_retries= [KNL]
>>> Number of clocksource_watchdog() retries due to
>>> external delays before the clock will be marked


thanks!
--
~Randy

2021-02-03 01:42:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH clocksource 4/5] clocksource: Provide a module parameter to fuzz per-CPU clock checking

On Tue, Feb 02, 2021 at 05:31:55PM -0800, Randy Dunlap wrote:
> On 2/2/21 4:50 PM, Paul E. McKenney wrote:
> > On Tue, Feb 02, 2021 at 11:51:02AM -0800, Randy Dunlap wrote:
> >> On 2/2/21 9:06 AM, [email protected] wrote:
> >>> From: "Paul E. McKenney" <[email protected]>
> >>>
> >>> Code that checks for clock desynchronization must itself be tested, so
> >>> this commit creates a new clocksource.inject_delay_shift_percpu= kernel
> >>> boot parameter that adds or subtracts a large value from the check read,
> >>> using the specified bit of the CPU ID to determine whether to add or
> >>> to subtract.
> >>>
> >>> Cc: John Stultz <[email protected]>
> >>> Cc: Thomas Gleixner <[email protected]>
> >>> Cc: Stephen Boyd <[email protected]>
> >>> Cc: Jonathan Corbet <[email protected]>
> >>> Cc: Mark Rutland <[email protected]>
> >>> Cc: Marc Zyngier <[email protected]>
> >>> Cc: Andi Kleen <[email protected]>
> >>> Reported-by: Chris Mason <[email protected]>
> >>> Signed-off-by: Paul E. McKenney <[email protected]>
> >>> ---
> >>> Documentation/admin-guide/kernel-parameters.txt | 9 +++++++++
> >>> kernel/time/clocksource.c | 10 +++++++++-
> >>> 2 files changed, 18 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> >>> index 9965266..f561e94 100644
> >>> --- a/Documentation/admin-guide/kernel-parameters.txt
> >>> +++ b/Documentation/admin-guide/kernel-parameters.txt
> >>> @@ -593,6 +593,15 @@
> >>> times the value specified for inject_delay_freq
> >>> of consecutive non-delays.
> >>>
> >>> + clocksource.inject_delay_shift_percpu= [KNL]
> >>> + Shift count to obtain bit from CPU number to
> >>> + determine whether to shift the time of the per-CPU
> >>> + clock under test ahead or behind. For example,
> >>
> >> It's a good think that you give an example -- it helps a little bit.
> >> That sentence above needs to be rewritten...
> >
> > That is a bit obscure, now that you mention it.
> >
> >>> + setting this to the value four will result in
> >>> + alternating groups of 16 CPUs shifting ahead and
> >>> + the rest of the CPUs shifting behind. The default
> >>> + value of -1 disable this type of error injection.
> >>
> >> disables
> >
> > Good eyes!
> >
> > So how about like this?
>
> Much better, thanks.
>
> > clocksource.inject_delay_shift_percpu= [KNL]
> > Clocksource delay injection partitions the CPUs
> > into two sets, one whose clocks are moved ahead
> > and the other whose clocks are moved behind.
> > This kernel parameter selects the CPU-number
> > bit that determines which of these two sets the
> > corresponding CPU is placed into. For example,
> > setting this parameter to the value four will
>
> I know that in "writing," "four" should be written out as you have it,
> but IMO using "4" here would be much better. FWIW.

As long as it is "four" and not "fore"! Updated as requested. ;-)

Thanx, Paul

> > result in the first set containing alternating
> > groups of 16 CPUs whose clocks are moved ahead,
> > while the second set will contain the rest of
> > the CPUs, whose clocks are moved behind.
> >
> > The default value of -1 disables this type of
> > error injection.
> >
> > Thanx, Paul
> >
> >>> +
> >>> clocksource.max_read_retries= [KNL]
> >>> Number of clocksource_watchdog() retries due to
> >>> external delays before the clock will be marked
>
>
> thanks!
> --
> ~Randy
>

2021-02-17 21:30:33

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3 clocksource] Do not mark clocks unstable due to delays

Hello!

If there is a sufficient delay between reading the watchdog clock and the
clock under test, the clock under test will be marked unstable through no
fault of its own. This series checks for this, doing limited retries
to get a good set of clock reads. If the clock is marked unstable
and is marked as being per-CPU, cross-CPU synchronization is checked.
This series also provides delay injection, which may be enabled via
kernel boot parameters to test the checking for delays.

1. Provide module parameters to inject delays in watchdog.

2. Retry clock read if long delays detected.

3. Check per-CPU clock synchronization when marked unstable.

4. Provide a module parameter to fuzz per-CPU clock checking.

5. Do pairwise clock-desynchronization checking.

Changes since v3:

o Rebased to v5.11.

o Apply Randy Dunlap feedback.

Changes since v2:

o Rebased to v5.11-rc6.

o Updated Cc: list.

Changes since v1:

o Applied feedback from Rik van Riel.

o Rebased to v5.11-rc3.

o Stripped "RFC" from the subject lines.

Thanx, Paul

------------------------------------------------------------------------

Documentation/admin-guide/kernel-parameters.txt | 38 +++++
arch/x86/kernel/kvmclock.c | 2
arch/x86/kernel/tsc.c | 3
include/linux/clocksource.h | 2
kernel/time/clocksource.c | 174 +++++++++++++++++++++---
5 files changed, 195 insertions(+), 24 deletions(-)

2021-02-17 21:31:06

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH clocksource 1/5] clocksource: Provide module parameters to inject delays in watchdog

From: "Paul E. McKenney" <[email protected]>

When the clocksource watchdog marks a clock as unstable, this might be due
to that clock being unstable or it might be due to delays that happen to
occur between the reads of the two clocks. Yes, interrupts are disabled
across those two reads, but there are no shortage of things that can
delay interrupts-disabled regions of code ranging from SMI handlers to
vCPU preemption. It would be good to have some indication as to why
the clock was marked unstable.

The first step is a way of injecting such delays, and this
commit therefore provides a clocksource.inject_delay_freq and
clocksource.inject_delay_run kernel boot parameters that specify that
sufficient delay be injected to cause the clocksource_watchdog()
function to mark a clock unstable. This delay is injected every
Nth set of M calls to clocksource_watchdog(), where N is the value
specified for the inject_delay_freq boot parameter and M is the value
specified for the inject_delay_run boot parameter. Values of zero or
less for either parameter disable delay injection, and the default for
clocksource.inject_delay_freq is zero, that is, disabled. The default for
clocksource.inject_delay_run is the value one, that is single-call runs.

This facility is intended for diagnostic use only, and should be avoided
on production systems.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
[ paulmck: Apply Rik van Riel feedback. ]
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 22 ++++++++++++++++++++
kernel/time/clocksource.c | 27 +++++++++++++++++++++++++
2 files changed, 49 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a10b545..9965266 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -577,6 +577,28 @@
loops can be debugged more effectively on production
systems.

+ clocksource.inject_delay_freq= [KNL]
+ Number of runs of calls to clocksource_watchdog()
+ before delays are injected between reads from the
+ two clocksources. Values less than or equal to
+ zero disable this delay injection. These delays
+ can cause clocks to be marked unstable, so use
+ of this parameter should therefore be avoided on
+ production systems. Defaults to zero (disabled).
+
+ clocksource.inject_delay_run= [KNL]
+ Run lengths of clocksource_watchdog() delay
+ injections. Specifying the value 8 will result
+ in eight consecutive delays followed by eight
+ times the value specified for inject_delay_freq
+ of consecutive non-delays.
+
+ clocksource.max_read_retries= [KNL]
+ Number of clocksource_watchdog() retries due to
+ external delays before the clock will be marked
+ unstable. Defaults to three retries, that is,
+ four attempts to read the clock under test.
+
clearcpuid=BITNUM[,BITNUM...] [X86]
Disable CPUID feature X for the kernel. See
arch/x86/include/asm/cpufeatures.h for the valid bit
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index cce484a..4be4391 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -14,6 +14,7 @@
#include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
#include <linux/tick.h>
#include <linux/kthread.h>
+#include <linux/delay.h>

#include "tick-internal.h"
#include "timekeeping_internal.h"
@@ -184,6 +185,31 @@ void clocksource_mark_unstable(struct clocksource *cs)
spin_unlock_irqrestore(&watchdog_lock, flags);
}

+static int inject_delay_freq;
+module_param(inject_delay_freq, int, 0644);
+static int inject_delay_run = 1;
+module_param(inject_delay_run, int, 0644);
+static int max_read_retries = 3;
+module_param(max_read_retries, int, 0644);
+
+static void clocksource_watchdog_inject_delay(void)
+{
+ int i;
+ static int injectfail = -1;
+
+ if (inject_delay_freq <= 0 || inject_delay_run <= 0)
+ return;
+ if (injectfail < 0 || injectfail > INT_MAX / 2)
+ injectfail = inject_delay_run;
+ if (!(++injectfail / inject_delay_run % inject_delay_freq)) {
+ pr_warn("%s(): Injecting delay.\n", __func__);
+ for (i = 0; i < 2 * WATCHDOG_THRESHOLD / NSEC_PER_MSEC; i++)
+ udelay(1000);
+ pr_warn("%s(): Done injecting delay.\n", __func__);
+ }
+ WARN_ON_ONCE(injectfail < 0);
+}
+
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
@@ -208,6 +234,7 @@ static void clocksource_watchdog(struct timer_list *unused)

local_irq_disable();
csnow = cs->read(cs);
+ clocksource_watchdog_inject_delay();
wdnow = watchdog->read(watchdog);
local_irq_enable();

--
2.9.5

2021-02-17 21:31:33

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

From: "Paul E. McKenney" <[email protected]>

Some sorts of per-CPU clock sources have a history of going out of
synchronization with each other. However, this problem has purportedy
been solved in the past ten years. Except that it is all too possible
that the problem has instead simply been made less likely, which might
mean that some of the occasional "Marking clocksource 'tsc' as unstable"
messages might be due to desynchronization. How would anyone know?

This commit therefore adds CPU-to-CPU synchronization checking
for newly unstable clocksource that are marked with the new
CLOCK_SOURCE_VERIFY_PERCPU flag. Lists of desynchronized CPUs are
printed, with the caveat that if it is the reporting CPU that is itself
desynchronized, it will appear that all the other clocks are wrong.
Just like in real life.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Add "static" to clocksource_verify_one_cpu() per kernel test robot feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
arch/x86/kernel/kvmclock.c | 2 +-
arch/x86/kernel/tsc.c | 3 +-
include/linux/clocksource.h | 2 +-
kernel/time/clocksource.c | 73 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index aa59374..337bb2c 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -169,7 +169,7 @@ struct clocksource kvm_clock = {
.read = kvm_clock_get_cycles,
.rating = 400,
.mask = CLOCKSOURCE_MASK(64),
- .flags = CLOCK_SOURCE_IS_CONTINUOUS,
+ .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_VERIFY_PERCPU,
.enable = kvm_cs_enable,
};
EXPORT_SYMBOL_GPL(kvm_clock);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index f70dffc..5628917 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1151,7 +1151,8 @@ static struct clocksource clocksource_tsc = {
.mask = CLOCKSOURCE_MASK(64),
.flags = CLOCK_SOURCE_IS_CONTINUOUS |
CLOCK_SOURCE_VALID_FOR_HRES |
- CLOCK_SOURCE_MUST_VERIFY,
+ CLOCK_SOURCE_MUST_VERIFY |
+ CLOCK_SOURCE_VERIFY_PERCPU,
.vdso_clock_mode = VDSO_CLOCKMODE_TSC,
.enable = tsc_cs_enable,
.resume = tsc_resume,
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 86d143d..83a3ebf 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -131,7 +131,7 @@ struct clocksource {
#define CLOCK_SOURCE_UNSTABLE 0x40
#define CLOCK_SOURCE_SUSPEND_NONSTOP 0x80
#define CLOCK_SOURCE_RESELECT 0x100
-
+#define CLOCK_SOURCE_VERIFY_PERCPU 0x200
/* simplify initialization of mask field */
#define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 3f734c6..663bc53 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -211,6 +211,78 @@ static void clocksource_watchdog_inject_delay(void)
WARN_ON_ONCE(injectfail < 0);
}

+static struct clocksource *clocksource_verify_work_cs;
+static DEFINE_PER_CPU(u64, csnow_mid);
+static cpumask_t cpus_ahead;
+static cpumask_t cpus_behind;
+
+static void clocksource_verify_one_cpu(void *csin)
+{
+ struct clocksource *cs = (struct clocksource *)csin;
+
+ __this_cpu_write(csnow_mid, cs->read(cs));
+}
+
+static void clocksource_verify_percpu_wq(struct work_struct *unused)
+{
+ int cpu;
+ struct clocksource *cs;
+ int64_t cs_nsec;
+ u64 csnow_begin;
+ u64 csnow_end;
+ u64 delta;
+
+ cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release
+ if (WARN_ON_ONCE(!cs))
+ return;
+ pr_warn("Checking clocksource %s synchronization from CPU %d.\n",
+ cs->name, smp_processor_id());
+ cpumask_clear(&cpus_ahead);
+ cpumask_clear(&cpus_behind);
+ csnow_begin = cs->read(cs);
+ smp_call_function(clocksource_verify_one_cpu, cs, 1);
+ csnow_end = cs->read(cs);
+ for_each_online_cpu(cpu) {
+ if (cpu == smp_processor_id())
+ continue;
+ delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask;
+ if ((s64)delta < 0)
+ cpumask_set_cpu(cpu, &cpus_behind);
+ delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask;
+ if ((s64)delta < 0)
+ cpumask_set_cpu(cpu, &cpus_ahead);
+ }
+ if (!cpumask_empty(&cpus_ahead))
+ pr_warn(" CPUs %*pbl ahead of CPU %d for clocksource %s.\n",
+ cpumask_pr_args(&cpus_ahead),
+ smp_processor_id(), cs->name);
+ if (!cpumask_empty(&cpus_behind))
+ pr_warn(" CPUs %*pbl behind CPU %d for clocksource %s.\n",
+ cpumask_pr_args(&cpus_behind),
+ smp_processor_id(), cs->name);
+ if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)) {
+ delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
+ cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
+ pr_warn(" CPU %d duration %lldns for clocksource %s.\n",
+ smp_processor_id(), cs_nsec, cs->name);
+ }
+ smp_store_release(&clocksource_verify_work_cs, NULL); // pairs with acquire.
+}
+
+static DECLARE_WORK(clocksource_verify_work, clocksource_verify_percpu_wq);
+
+static void clocksource_verify_percpu(struct clocksource *cs)
+{
+ if (!(cs->flags & CLOCK_SOURCE_VERIFY_PERCPU))
+ return;
+ if (smp_load_acquire(&clocksource_verify_work_cs)) { // pairs with release.
+ pr_warn("Previous clocksource synchronization still in flight.\n");
+ return;
+ }
+ smp_store_release(&clocksource_verify_work_cs, cs); //pairs with acquire.
+ queue_work(system_highpri_wq, &clocksource_verify_work);
+}
+
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
@@ -284,6 +356,7 @@ static void clocksource_watchdog(struct timer_list *unused)
watchdog->name, wdnow, wdlast, watchdog->mask);
pr_warn(" '%s' cs_now: %llx cs_last: %llx mask: %llx\n",
cs->name, csnow, cslast, cs->mask);
+ clocksource_verify_percpu(cs);
__clocksource_unstable(cs);
continue;
}
--
2.9.5

2021-02-17 21:31:53

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH clocksource 4/5] clocksource: Provide a module parameter to fuzz per-CPU clock checking

From: "Paul E. McKenney" <[email protected]>

Code that checks for clock desynchronization must itself be tested, so
this commit creates a new clocksource.inject_delay_shift_percpu= kernel
boot parameter that adds or subtracts a large value from the check read,
using the specified bit of the CPU ID to determine whether to add or
to subtract.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Apply Randy Dunlap feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 16 ++++++++++++++++
kernel/time/clocksource.c | 10 +++++++++-
2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9965266..628e87f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -593,6 +593,22 @@
times the value specified for inject_delay_freq
of consecutive non-delays.

+ clocksource.inject_delay_shift_percpu= [KNL]
+ Clocksource delay injection partitions the CPUs
+ into two sets, one whose clocks are moved ahead
+ and the other whose clocks are moved behind.
+ This kernel parameter selects the CPU-number
+ bit that determines which of these two sets the
+ corresponding CPU is placed into. For example,
+ setting this parameter to the value 4 will result
+ in the first set containing alternating groups
+ of 16 CPUs whose clocks are moved ahead, while
+ the second set will contain the rest of the CPUs,
+ whose clocks are moved behind.
+
+ The default value of -1 disables this type of
+ error injection.
+
clocksource.max_read_retries= [KNL]
Number of clocksource_watchdog() retries due to
external delays before the clock will be marked
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 663bc53..df48416 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -190,6 +190,8 @@ static int inject_delay_freq;
module_param(inject_delay_freq, int, 0644);
static int inject_delay_run = 1;
module_param(inject_delay_run, int, 0644);
+static int inject_delay_shift_percpu = -1;
+module_param(inject_delay_shift_percpu, int, 0644);
static int max_read_retries = 3;
module_param(max_read_retries, int, 0644);

@@ -219,8 +221,14 @@ static cpumask_t cpus_behind;
static void clocksource_verify_one_cpu(void *csin)
{
struct clocksource *cs = (struct clocksource *)csin;
+ s64 delta = 0;
+ int sign;

- __this_cpu_write(csnow_mid, cs->read(cs));
+ if (inject_delay_shift_percpu >= 0) {
+ sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1;
+ delta = sign * NSEC_PER_SEC;
+ }
+ __this_cpu_write(csnow_mid, cs->read(cs) + delta);
}

static void clocksource_verify_percpu_wq(struct work_struct *unused)
--
2.9.5

2021-02-17 21:32:00

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH clocksource 5/5] clocksource: Do pairwise clock-desynchronization checking

From: "Paul E. McKenney" <[email protected]>

Although smp_call_function() has the advantage of simplicity, using
it to check for cross-CPU clock desynchronization means that any CPU
being slow reduces the sensitivity of the checking across all CPUs.
And it is not uncommon for smp_call_function() latencies to be in the
hundreds of microseconds.

This commit therefore switches to smp_call_function_single(), so that
delays from a given CPU affect only those measurements involving that
particular CPU.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/clocksource.c | 41 +++++++++++++++++++++++++----------------
1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index df48416..4161c84 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -214,7 +214,7 @@ static void clocksource_watchdog_inject_delay(void)
}

static struct clocksource *clocksource_verify_work_cs;
-static DEFINE_PER_CPU(u64, csnow_mid);
+static u64 csnow_mid;
static cpumask_t cpus_ahead;
static cpumask_t cpus_behind;

@@ -228,7 +228,7 @@ static void clocksource_verify_one_cpu(void *csin)
sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1;
delta = sign * NSEC_PER_SEC;
}
- __this_cpu_write(csnow_mid, cs->read(cs) + delta);
+ csnow_mid = cs->read(cs) + delta;
}

static void clocksource_verify_percpu_wq(struct work_struct *unused)
@@ -236,9 +236,12 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
int cpu;
struct clocksource *cs;
int64_t cs_nsec;
+ int64_t cs_nsec_max;
+ int64_t cs_nsec_min;
u64 csnow_begin;
u64 csnow_end;
- u64 delta;
+ s64 delta;
+ bool firsttime = 1;

cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release
if (WARN_ON_ONCE(!cs))
@@ -247,19 +250,28 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
cs->name, smp_processor_id());
cpumask_clear(&cpus_ahead);
cpumask_clear(&cpus_behind);
- csnow_begin = cs->read(cs);
- smp_call_function(clocksource_verify_one_cpu, cs, 1);
- csnow_end = cs->read(cs);
+ preempt_disable();
for_each_online_cpu(cpu) {
if (cpu == smp_processor_id())
continue;
- delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask;
- if ((s64)delta < 0)
+ csnow_begin = cs->read(cs);
+ smp_call_function_single(cpu, clocksource_verify_one_cpu, cs, 1);
+ csnow_end = cs->read(cs);
+ delta = (s64)((csnow_mid - csnow_begin) & cs->mask);
+ if (delta < 0)
cpumask_set_cpu(cpu, &cpus_behind);
- delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask;
- if ((s64)delta < 0)
+ delta = (csnow_end - csnow_mid) & cs->mask;
+ if (delta < 0)
cpumask_set_cpu(cpu, &cpus_ahead);
+ delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
+ cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
+ if (firsttime || cs_nsec > cs_nsec_max)
+ cs_nsec_max = cs_nsec;
+ if (firsttime || cs_nsec < cs_nsec_min)
+ cs_nsec_min = cs_nsec;
+ firsttime = 0;
}
+ preempt_enable();
if (!cpumask_empty(&cpus_ahead))
pr_warn(" CPUs %*pbl ahead of CPU %d for clocksource %s.\n",
cpumask_pr_args(&cpus_ahead),
@@ -268,12 +280,9 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
pr_warn(" CPUs %*pbl behind CPU %d for clocksource %s.\n",
cpumask_pr_args(&cpus_behind),
smp_processor_id(), cs->name);
- if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)) {
- delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
- cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
- pr_warn(" CPU %d duration %lldns for clocksource %s.\n",
- smp_processor_id(), cs_nsec, cs->name);
- }
+ if (!firsttime && (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)))
+ pr_warn(" CPU %d check durations %lldns - %lldns for clocksource %s.\n",
+ smp_processor_id(), cs_nsec_min, cs_nsec_max, cs->name);
smp_store_release(&clocksource_verify_work_cs, NULL); // pairs with acquire.
}

--
2.9.5

2021-02-17 21:33:37

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH clocksource 2/5] clocksource: Retry clock read if long delays detected

From: "Paul E. McKenney" <[email protected]>

When the clocksource watchdog marks a clock as unstable, this might
be due to that clock being unstable or it might be due to delays that
happen to occur between the reads of the two clocks. Yes, interrupts are
disabled across those two reads, but there are no shortage of things that
can delay interrupts-disabled regions of code ranging from SMI handlers
to vCPU preemption. It would be good to have some indication as to why
the clock was marked unstable.

This commit therefore re-reads the watchdog clock on either side of
the read from the clock under test. If the watchdog clock shows an
excessive time delta between its pair of reads, the reads are retried.
The maximum number of retries is specified by a new kernel boot
parameter clocksource.max_read_retries, which defaults to three, that
is, up to four reads, one initial and up to three retries. If retries
were required, a message is printed on the console. If the number of
retries is exceeded, the clock under test will be marked unstable.
However, the probability of this happening due to various sorts of
delays is quite small. In addition, the reason (clock-read delays)
for the unstable marking will be apparent.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Per-clocksource retries per Neeraj Upadhyay feedback. ]
[ paulmck: Don't reset injectfail per Neeraj Upadhyay feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/clocksource.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 4be4391..3f734c6 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -124,6 +124,7 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
*/
#define WATCHDOG_INTERVAL (HZ >> 1)
#define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
+#define WATCHDOG_MAX_SKEW (NSEC_PER_SEC >> 6)

static void clocksource_watchdog_work(struct work_struct *work)
{
@@ -213,9 +214,10 @@ static void clocksource_watchdog_inject_delay(void)
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
- u64 csnow, wdnow, cslast, wdlast, delta;
- int64_t wd_nsec, cs_nsec;
+ u64 csnow, wdnow, wdagain, cslast, wdlast, delta;
+ int64_t wd_nsec, wdagain_nsec, wderr_nsec = 0, cs_nsec;
int next_cpu, reset_pending;
+ int nretries;

spin_lock(&watchdog_lock);
if (!watchdog_running)
@@ -224,6 +226,7 @@ static void clocksource_watchdog(struct timer_list *unused)
reset_pending = atomic_read(&watchdog_reset_pending);

list_for_each_entry(cs, &watchdog_list, wd_list) {
+ nretries = 0;

/* Clocksource already marked unstable? */
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
@@ -232,11 +235,23 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;
}

+retry:
local_irq_disable();
- csnow = cs->read(cs);
- clocksource_watchdog_inject_delay();
wdnow = watchdog->read(watchdog);
+ clocksource_watchdog_inject_delay();
+ csnow = cs->read(cs);
+ wdagain = watchdog->read(watchdog);
local_irq_enable();
+ delta = clocksource_delta(wdagain, wdnow, watchdog->mask);
+ wdagain_nsec = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
+ if (wdagain_nsec < 0 || wdagain_nsec > WATCHDOG_MAX_SKEW) {
+ wderr_nsec = wdagain_nsec;
+ if (nretries++ < max_read_retries)
+ goto retry;
+ }
+ if (nretries)
+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d\n",
+ smp_processor_id(), watchdog->name, wderr_nsec, nretries);

/* Clocksource initialized ? */
if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
--
2.9.5

2021-03-04 14:17:53

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v5 clocksource] Do not mark clocks unstable due to delays for v5.13

Hello!

If there is a sufficient delay between reading the watchdog clock and the
clock under test, the clock under test will be marked unstable through no
fault of its own. This series checks for this, doing limited retries
to get a good set of clock reads. If the clock is marked unstable
and is marked as being per-CPU, cross-CPU synchronization is checked.
This series also provides delay injection, which may be enabled via
kernel boot parameters to test the checking for delays.

Note that "sufficient delay" can be provided by SMIs, NMIs, and of course
vCPU preemption.

1. Provide module parameters to inject delays in watchdog.

2. Retry clock read if long delays detected.

3. Check per-CPU clock synchronization when marked unstable.

4. Provide a module parameter to fuzz per-CPU clock checking.

5. Do pairwise clock-desynchronization checking.

Changes since v4:

o Rebased to v5.12-rc1.

Changes since v3:

o Rebased to v5.11.

o Apply Randy Dunlap feedback.

Changes since v2:

o Rebased to v5.11-rc6.

o Updated Cc: list.

Changes since v1:

o Applied feedback from Rik van Riel.

o Rebased to v5.11-rc3.

o Stripped "RFC" from the subject lines.

Thanx, Paul

------------------------------------------------------------------------

Documentation/admin-guide/kernel-parameters.txt | 38 +++++
arch/x86/kernel/kvmclock.c | 2
arch/x86/kernel/tsc.c | 3
include/linux/clocksource.h | 2
kernel/time/clocksource.c | 174 +++++++++++++++++++++---
5 files changed, 195 insertions(+), 24 deletions(-)

2021-03-04 14:18:04

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH kernel/time 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

From: "Paul E. McKenney" <[email protected]>

Some sorts of per-CPU clock sources have a history of going out of
synchronization with each other. However, this problem has purportedy
been solved in the past ten years. Except that it is all too possible
that the problem has instead simply been made less likely, which might
mean that some of the occasional "Marking clocksource 'tsc' as unstable"
messages might be due to desynchronization. How would anyone know?

This commit therefore adds CPU-to-CPU synchronization checking
for newly unstable clocksource that are marked with the new
CLOCK_SOURCE_VERIFY_PERCPU flag. Lists of desynchronized CPUs are
printed, with the caveat that if it is the reporting CPU that is itself
desynchronized, it will appear that all the other clocks are wrong.
Just like in real life.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Add "static" to clocksource_verify_one_cpu() per kernel test robot feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
arch/x86/kernel/kvmclock.c | 2 +-
arch/x86/kernel/tsc.c | 3 +-
include/linux/clocksource.h | 2 +-
kernel/time/clocksource.c | 73 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index aa59374..337bb2c 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -169,7 +169,7 @@ struct clocksource kvm_clock = {
.read = kvm_clock_get_cycles,
.rating = 400,
.mask = CLOCKSOURCE_MASK(64),
- .flags = CLOCK_SOURCE_IS_CONTINUOUS,
+ .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_VERIFY_PERCPU,
.enable = kvm_cs_enable,
};
EXPORT_SYMBOL_GPL(kvm_clock);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index f70dffc..5628917 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1151,7 +1151,8 @@ static struct clocksource clocksource_tsc = {
.mask = CLOCKSOURCE_MASK(64),
.flags = CLOCK_SOURCE_IS_CONTINUOUS |
CLOCK_SOURCE_VALID_FOR_HRES |
- CLOCK_SOURCE_MUST_VERIFY,
+ CLOCK_SOURCE_MUST_VERIFY |
+ CLOCK_SOURCE_VERIFY_PERCPU,
.vdso_clock_mode = VDSO_CLOCKMODE_TSC,
.enable = tsc_cs_enable,
.resume = tsc_resume,
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 86d143d..83a3ebf 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -131,7 +131,7 @@ struct clocksource {
#define CLOCK_SOURCE_UNSTABLE 0x40
#define CLOCK_SOURCE_SUSPEND_NONSTOP 0x80
#define CLOCK_SOURCE_RESELECT 0x100
-
+#define CLOCK_SOURCE_VERIFY_PERCPU 0x200
/* simplify initialization of mask field */
#define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 3f734c6..663bc53 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -211,6 +211,78 @@ static void clocksource_watchdog_inject_delay(void)
WARN_ON_ONCE(injectfail < 0);
}

+static struct clocksource *clocksource_verify_work_cs;
+static DEFINE_PER_CPU(u64, csnow_mid);
+static cpumask_t cpus_ahead;
+static cpumask_t cpus_behind;
+
+static void clocksource_verify_one_cpu(void *csin)
+{
+ struct clocksource *cs = (struct clocksource *)csin;
+
+ __this_cpu_write(csnow_mid, cs->read(cs));
+}
+
+static void clocksource_verify_percpu_wq(struct work_struct *unused)
+{
+ int cpu;
+ struct clocksource *cs;
+ int64_t cs_nsec;
+ u64 csnow_begin;
+ u64 csnow_end;
+ u64 delta;
+
+ cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release
+ if (WARN_ON_ONCE(!cs))
+ return;
+ pr_warn("Checking clocksource %s synchronization from CPU %d.\n",
+ cs->name, smp_processor_id());
+ cpumask_clear(&cpus_ahead);
+ cpumask_clear(&cpus_behind);
+ csnow_begin = cs->read(cs);
+ smp_call_function(clocksource_verify_one_cpu, cs, 1);
+ csnow_end = cs->read(cs);
+ for_each_online_cpu(cpu) {
+ if (cpu == smp_processor_id())
+ continue;
+ delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask;
+ if ((s64)delta < 0)
+ cpumask_set_cpu(cpu, &cpus_behind);
+ delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask;
+ if ((s64)delta < 0)
+ cpumask_set_cpu(cpu, &cpus_ahead);
+ }
+ if (!cpumask_empty(&cpus_ahead))
+ pr_warn(" CPUs %*pbl ahead of CPU %d for clocksource %s.\n",
+ cpumask_pr_args(&cpus_ahead),
+ smp_processor_id(), cs->name);
+ if (!cpumask_empty(&cpus_behind))
+ pr_warn(" CPUs %*pbl behind CPU %d for clocksource %s.\n",
+ cpumask_pr_args(&cpus_behind),
+ smp_processor_id(), cs->name);
+ if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)) {
+ delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
+ cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
+ pr_warn(" CPU %d duration %lldns for clocksource %s.\n",
+ smp_processor_id(), cs_nsec, cs->name);
+ }
+ smp_store_release(&clocksource_verify_work_cs, NULL); // pairs with acquire.
+}
+
+static DECLARE_WORK(clocksource_verify_work, clocksource_verify_percpu_wq);
+
+static void clocksource_verify_percpu(struct clocksource *cs)
+{
+ if (!(cs->flags & CLOCK_SOURCE_VERIFY_PERCPU))
+ return;
+ if (smp_load_acquire(&clocksource_verify_work_cs)) { // pairs with release.
+ pr_warn("Previous clocksource synchronization still in flight.\n");
+ return;
+ }
+ smp_store_release(&clocksource_verify_work_cs, cs); //pairs with acquire.
+ queue_work(system_highpri_wq, &clocksource_verify_work);
+}
+
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
@@ -284,6 +356,7 @@ static void clocksource_watchdog(struct timer_list *unused)
watchdog->name, wdnow, wdlast, watchdog->mask);
pr_warn(" '%s' cs_now: %llx cs_last: %llx mask: %llx\n",
cs->name, csnow, cslast, cs->mask);
+ clocksource_verify_percpu(cs);
__clocksource_unstable(cs);
continue;
}
--
2.9.5

2021-03-04 14:20:01

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH kernel/time 1/5] clocksource: Provide module parameters to inject delays in watchdog

From: "Paul E. McKenney" <[email protected]>

When the clocksource watchdog marks a clock as unstable, this might be due
to that clock being unstable or it might be due to delays that happen to
occur between the reads of the two clocks. Yes, interrupts are disabled
across those two reads, but there are no shortage of things that can
delay interrupts-disabled regions of code ranging from SMI handlers to
vCPU preemption. It would be good to have some indication as to why
the clock was marked unstable.

The first step is a way of injecting such delays, and this
commit therefore provides a clocksource.inject_delay_freq and
clocksource.inject_delay_run kernel boot parameters that specify that
sufficient delay be injected to cause the clocksource_watchdog()
function to mark a clock unstable. This delay is injected every
Nth set of M calls to clocksource_watchdog(), where N is the value
specified for the inject_delay_freq boot parameter and M is the value
specified for the inject_delay_run boot parameter. Values of zero or
less for either parameter disable delay injection, and the default for
clocksource.inject_delay_freq is zero, that is, disabled. The default for
clocksource.inject_delay_run is the value one, that is single-call runs.

This facility is intended for diagnostic use only, and should be avoided
on production systems.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
[ paulmck: Apply Rik van Riel feedback. ]
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 22 ++++++++++++++++++++
kernel/time/clocksource.c | 27 +++++++++++++++++++++++++
2 files changed, 49 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0454572..fc57952 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -583,6 +583,28 @@
loops can be debugged more effectively on production
systems.

+ clocksource.inject_delay_freq= [KNL]
+ Number of runs of calls to clocksource_watchdog()
+ before delays are injected between reads from the
+ two clocksources. Values less than or equal to
+ zero disable this delay injection. These delays
+ can cause clocks to be marked unstable, so use
+ of this parameter should therefore be avoided on
+ production systems. Defaults to zero (disabled).
+
+ clocksource.inject_delay_run= [KNL]
+ Run lengths of clocksource_watchdog() delay
+ injections. Specifying the value 8 will result
+ in eight consecutive delays followed by eight
+ times the value specified for inject_delay_freq
+ of consecutive non-delays.
+
+ clocksource.max_read_retries= [KNL]
+ Number of clocksource_watchdog() retries due to
+ external delays before the clock will be marked
+ unstable. Defaults to three retries, that is,
+ four attempts to read the clock under test.
+
clearcpuid=BITNUM[,BITNUM...] [X86]
Disable CPUID feature X for the kernel. See
arch/x86/include/asm/cpufeatures.h for the valid bit
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index cce484a..4be4391 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -14,6 +14,7 @@
#include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
#include <linux/tick.h>
#include <linux/kthread.h>
+#include <linux/delay.h>

#include "tick-internal.h"
#include "timekeeping_internal.h"
@@ -184,6 +185,31 @@ void clocksource_mark_unstable(struct clocksource *cs)
spin_unlock_irqrestore(&watchdog_lock, flags);
}

+static int inject_delay_freq;
+module_param(inject_delay_freq, int, 0644);
+static int inject_delay_run = 1;
+module_param(inject_delay_run, int, 0644);
+static int max_read_retries = 3;
+module_param(max_read_retries, int, 0644);
+
+static void clocksource_watchdog_inject_delay(void)
+{
+ int i;
+ static int injectfail = -1;
+
+ if (inject_delay_freq <= 0 || inject_delay_run <= 0)
+ return;
+ if (injectfail < 0 || injectfail > INT_MAX / 2)
+ injectfail = inject_delay_run;
+ if (!(++injectfail / inject_delay_run % inject_delay_freq)) {
+ pr_warn("%s(): Injecting delay.\n", __func__);
+ for (i = 0; i < 2 * WATCHDOG_THRESHOLD / NSEC_PER_MSEC; i++)
+ udelay(1000);
+ pr_warn("%s(): Done injecting delay.\n", __func__);
+ }
+ WARN_ON_ONCE(injectfail < 0);
+}
+
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
@@ -208,6 +234,7 @@ static void clocksource_watchdog(struct timer_list *unused)

local_irq_disable();
csnow = cs->read(cs);
+ clocksource_watchdog_inject_delay();
wdnow = watchdog->read(watchdog);
local_irq_enable();

--
2.9.5

2021-03-04 14:20:24

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH kernel/time 5/5] clocksource: Do pairwise clock-desynchronization checking

From: "Paul E. McKenney" <[email protected]>

Although smp_call_function() has the advantage of simplicity, using
it to check for cross-CPU clock desynchronization means that any CPU
being slow reduces the sensitivity of the checking across all CPUs.
And it is not uncommon for smp_call_function() latencies to be in the
hundreds of microseconds.

This commit therefore switches to smp_call_function_single(), so that
delays from a given CPU affect only those measurements involving that
particular CPU.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/clocksource.c | 41 +++++++++++++++++++++++++----------------
1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index df48416..4161c84 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -214,7 +214,7 @@ static void clocksource_watchdog_inject_delay(void)
}

static struct clocksource *clocksource_verify_work_cs;
-static DEFINE_PER_CPU(u64, csnow_mid);
+static u64 csnow_mid;
static cpumask_t cpus_ahead;
static cpumask_t cpus_behind;

@@ -228,7 +228,7 @@ static void clocksource_verify_one_cpu(void *csin)
sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1;
delta = sign * NSEC_PER_SEC;
}
- __this_cpu_write(csnow_mid, cs->read(cs) + delta);
+ csnow_mid = cs->read(cs) + delta;
}

static void clocksource_verify_percpu_wq(struct work_struct *unused)
@@ -236,9 +236,12 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
int cpu;
struct clocksource *cs;
int64_t cs_nsec;
+ int64_t cs_nsec_max;
+ int64_t cs_nsec_min;
u64 csnow_begin;
u64 csnow_end;
- u64 delta;
+ s64 delta;
+ bool firsttime = 1;

cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release
if (WARN_ON_ONCE(!cs))
@@ -247,19 +250,28 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
cs->name, smp_processor_id());
cpumask_clear(&cpus_ahead);
cpumask_clear(&cpus_behind);
- csnow_begin = cs->read(cs);
- smp_call_function(clocksource_verify_one_cpu, cs, 1);
- csnow_end = cs->read(cs);
+ preempt_disable();
for_each_online_cpu(cpu) {
if (cpu == smp_processor_id())
continue;
- delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask;
- if ((s64)delta < 0)
+ csnow_begin = cs->read(cs);
+ smp_call_function_single(cpu, clocksource_verify_one_cpu, cs, 1);
+ csnow_end = cs->read(cs);
+ delta = (s64)((csnow_mid - csnow_begin) & cs->mask);
+ if (delta < 0)
cpumask_set_cpu(cpu, &cpus_behind);
- delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask;
- if ((s64)delta < 0)
+ delta = (csnow_end - csnow_mid) & cs->mask;
+ if (delta < 0)
cpumask_set_cpu(cpu, &cpus_ahead);
+ delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
+ cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
+ if (firsttime || cs_nsec > cs_nsec_max)
+ cs_nsec_max = cs_nsec;
+ if (firsttime || cs_nsec < cs_nsec_min)
+ cs_nsec_min = cs_nsec;
+ firsttime = 0;
}
+ preempt_enable();
if (!cpumask_empty(&cpus_ahead))
pr_warn(" CPUs %*pbl ahead of CPU %d for clocksource %s.\n",
cpumask_pr_args(&cpus_ahead),
@@ -268,12 +280,9 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
pr_warn(" CPUs %*pbl behind CPU %d for clocksource %s.\n",
cpumask_pr_args(&cpus_behind),
smp_processor_id(), cs->name);
- if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)) {
- delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
- cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
- pr_warn(" CPU %d duration %lldns for clocksource %s.\n",
- smp_processor_id(), cs_nsec, cs->name);
- }
+ if (!firsttime && (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)))
+ pr_warn(" CPU %d check durations %lldns - %lldns for clocksource %s.\n",
+ smp_processor_id(), cs_nsec_min, cs_nsec_max, cs->name);
smp_store_release(&clocksource_verify_work_cs, NULL); // pairs with acquire.
}

--
2.9.5

2021-03-04 14:20:24

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH kernel/time 4/5] clocksource: Provide a module parameter to fuzz per-CPU clock checking

From: "Paul E. McKenney" <[email protected]>

Code that checks for clock desynchronization must itself be tested, so
this commit creates a new clocksource.inject_delay_shift_percpu= kernel
boot parameter that adds or subtracts a large value from the check read,
using the specified bit of the CPU ID to determine whether to add or
to subtract.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Apply Randy Dunlap feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 16 ++++++++++++++++
kernel/time/clocksource.c | 10 +++++++++-
2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fc57952..f9da90f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -599,6 +599,22 @@
times the value specified for inject_delay_freq
of consecutive non-delays.

+ clocksource.inject_delay_shift_percpu= [KNL]
+ Clocksource delay injection partitions the CPUs
+ into two sets, one whose clocks are moved ahead
+ and the other whose clocks are moved behind.
+ This kernel parameter selects the CPU-number
+ bit that determines which of these two sets the
+ corresponding CPU is placed into. For example,
+ setting this parameter to the value 4 will result
+ in the first set containing alternating groups
+ of 16 CPUs whose clocks are moved ahead, while
+ the second set will contain the rest of the CPUs,
+ whose clocks are moved behind.
+
+ The default value of -1 disables this type of
+ error injection.
+
clocksource.max_read_retries= [KNL]
Number of clocksource_watchdog() retries due to
external delays before the clock will be marked
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 663bc53..df48416 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -190,6 +190,8 @@ static int inject_delay_freq;
module_param(inject_delay_freq, int, 0644);
static int inject_delay_run = 1;
module_param(inject_delay_run, int, 0644);
+static int inject_delay_shift_percpu = -1;
+module_param(inject_delay_shift_percpu, int, 0644);
static int max_read_retries = 3;
module_param(max_read_retries, int, 0644);

@@ -219,8 +221,14 @@ static cpumask_t cpus_behind;
static void clocksource_verify_one_cpu(void *csin)
{
struct clocksource *cs = (struct clocksource *)csin;
+ s64 delta = 0;
+ int sign;

- __this_cpu_write(csnow_mid, cs->read(cs));
+ if (inject_delay_shift_percpu >= 0) {
+ sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1;
+ delta = sign * NSEC_PER_SEC;
+ }
+ __this_cpu_write(csnow_mid, cs->read(cs) + delta);
}

static void clocksource_verify_percpu_wq(struct work_struct *unused)
--
2.9.5

2021-03-04 23:57:34

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH kernel/time 2/5] clocksource: Retry clock read if long delays detected

From: "Paul E. McKenney" <[email protected]>

When the clocksource watchdog marks a clock as unstable, this might
be due to that clock being unstable or it might be due to delays that
happen to occur between the reads of the two clocks. Yes, interrupts are
disabled across those two reads, but there are no shortage of things that
can delay interrupts-disabled regions of code ranging from SMI handlers
to vCPU preemption. It would be good to have some indication as to why
the clock was marked unstable.

This commit therefore re-reads the watchdog clock on either side of
the read from the clock under test. If the watchdog clock shows an
excessive time delta between its pair of reads, the reads are retried.
The maximum number of retries is specified by a new kernel boot
parameter clocksource.max_read_retries, which defaults to three, that
is, up to four reads, one initial and up to three retries. If retries
were required, a message is printed on the console. If the number of
retries is exceeded, the clock under test will be marked unstable.
However, the probability of this happening due to various sorts of
delays is quite small. In addition, the reason (clock-read delays)
for the unstable marking will be apparent.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Per-clocksource retries per Neeraj Upadhyay feedback. ]
[ paulmck: Don't reset injectfail per Neeraj Upadhyay feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/clocksource.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 4be4391..3f734c6 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -124,6 +124,7 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
*/
#define WATCHDOG_INTERVAL (HZ >> 1)
#define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
+#define WATCHDOG_MAX_SKEW (NSEC_PER_SEC >> 6)

static void clocksource_watchdog_work(struct work_struct *work)
{
@@ -213,9 +214,10 @@ static void clocksource_watchdog_inject_delay(void)
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
- u64 csnow, wdnow, cslast, wdlast, delta;
- int64_t wd_nsec, cs_nsec;
+ u64 csnow, wdnow, wdagain, cslast, wdlast, delta;
+ int64_t wd_nsec, wdagain_nsec, wderr_nsec = 0, cs_nsec;
int next_cpu, reset_pending;
+ int nretries;

spin_lock(&watchdog_lock);
if (!watchdog_running)
@@ -224,6 +226,7 @@ static void clocksource_watchdog(struct timer_list *unused)
reset_pending = atomic_read(&watchdog_reset_pending);

list_for_each_entry(cs, &watchdog_list, wd_list) {
+ nretries = 0;

/* Clocksource already marked unstable? */
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
@@ -232,11 +235,23 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;
}

+retry:
local_irq_disable();
- csnow = cs->read(cs);
- clocksource_watchdog_inject_delay();
wdnow = watchdog->read(watchdog);
+ clocksource_watchdog_inject_delay();
+ csnow = cs->read(cs);
+ wdagain = watchdog->read(watchdog);
local_irq_enable();
+ delta = clocksource_delta(wdagain, wdnow, watchdog->mask);
+ wdagain_nsec = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
+ if (wdagain_nsec < 0 || wdagain_nsec > WATCHDOG_MAX_SKEW) {
+ wderr_nsec = wdagain_nsec;
+ if (nretries++ < max_read_retries)
+ goto retry;
+ }
+ if (nretries)
+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d\n",
+ smp_processor_id(), watchdog->name, wderr_nsec, nretries);

/* Clocksource initialized ? */
if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
--
2.9.5

2021-04-02 20:29:57

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v5 clocksource] Do not mark clocks unstable due to delays for v5.13

Hello!

If there is a sufficient delay between reading the watchdog clock and the
clock under test, the clock under test will be marked unstable through no
fault of its own. This series checks for this, doing limited retries
to get a good set of clock reads. If the clock is marked unstable
and is marked as being per-CPU, cross-CPU synchronization is checked.
This series also provides delay injection, which may be enabled via
kernel boot parameters to test the checking for delays.

Note that "sufficient delay" can be provided by SMIs, NMIs, and of course
vCPU preemption.

1. Provide module parameters to inject delays in watchdog.

2. Retry clock read if long delays detected.

3. Check per-CPU clock synchronization when marked unstable.

4. Provide a module parameter to fuzz per-CPU clock checking.

5. Do pairwise clock-desynchronization checking.

Changes since v5:

o Rebased to v5.12-rc5.

Changes since v4:

o Rebased to v5.12-rc1.

Changes since v3:

o Rebased to v5.11.

o Apply Randy Dunlap feedback.

Changes since v2:

o Rebased to v5.11-rc6.

o Updated Cc: list.

Changes since v1:

o Applied feedback from Rik van Riel.

o Rebased to v5.11-rc3.

o Stripped "RFC" from the subject lines.

Thanx, Paul

------------------------------------------------------------------------

Documentation/admin-guide/kernel-parameters.txt | 38 +++++
arch/x86/kernel/kvmclock.c | 2
arch/x86/kernel/tsc.c | 3
include/linux/clocksource.h | 2
kernel/time/clocksource.c | 174 +++++++++++++++++++++---
5 files changed, 195 insertions(+), 24 deletions(-)

2021-04-02 20:32:11

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v6 clocksource] Do not mark clocks unstable dueclocksource: Retry clock read if long delays detected

From: "Paul E. McKenney" <[email protected]>

When the clocksource watchdog marks a clock as unstable, this might
be due to that clock being unstable or it might be due to delays that
happen to occur between the reads of the two clocks. Yes, interrupts are
disabled across those two reads, but there are no shortage of things that
can delay interrupts-disabled regions of code ranging from SMI handlers
to vCPU preemption. It would be good to have some indication as to why
the clock was marked unstable.

This commit therefore re-reads the watchdog clock on either side of
the read from the clock under test. If the watchdog clock shows an
excessive time delta between its pair of reads, the reads are retried.
The maximum number of retries is specified by a new kernel boot
parameter clocksource.max_read_retries, which defaults to three, that
is, up to four reads, one initial and up to three retries. If retries
were required, a message is printed on the console. If the number of
retries is exceeded, the clock under test will be marked unstable.
However, the probability of this happening due to various sorts of
delays is quite small. In addition, the reason (clock-read delays)
for the unstable marking will be apparent.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Per-clocksource retries per Neeraj Upadhyay feedback. ]
[ paulmck: Don't reset injectfail per Neeraj Upadhyay feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/clocksource.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 4be4391..3f734c6 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -124,6 +124,7 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
*/
#define WATCHDOG_INTERVAL (HZ >> 1)
#define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
+#define WATCHDOG_MAX_SKEW (NSEC_PER_SEC >> 6)

static void clocksource_watchdog_work(struct work_struct *work)
{
@@ -213,9 +214,10 @@ static void clocksource_watchdog_inject_delay(void)
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
- u64 csnow, wdnow, cslast, wdlast, delta;
- int64_t wd_nsec, cs_nsec;
+ u64 csnow, wdnow, wdagain, cslast, wdlast, delta;
+ int64_t wd_nsec, wdagain_nsec, wderr_nsec = 0, cs_nsec;
int next_cpu, reset_pending;
+ int nretries;

spin_lock(&watchdog_lock);
if (!watchdog_running)
@@ -224,6 +226,7 @@ static void clocksource_watchdog(struct timer_list *unused)
reset_pending = atomic_read(&watchdog_reset_pending);

list_for_each_entry(cs, &watchdog_list, wd_list) {
+ nretries = 0;

/* Clocksource already marked unstable? */
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
@@ -232,11 +235,23 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;
}

+retry:
local_irq_disable();
- csnow = cs->read(cs);
- clocksource_watchdog_inject_delay();
wdnow = watchdog->read(watchdog);
+ clocksource_watchdog_inject_delay();
+ csnow = cs->read(cs);
+ wdagain = watchdog->read(watchdog);
local_irq_enable();
+ delta = clocksource_delta(wdagain, wdnow, watchdog->mask);
+ wdagain_nsec = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
+ if (wdagain_nsec < 0 || wdagain_nsec > WATCHDOG_MAX_SKEW) {
+ wderr_nsec = wdagain_nsec;
+ if (nretries++ < max_read_retries)
+ goto retry;
+ }
+ if (nretries)
+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d\n",
+ smp_processor_id(), watchdog->name, wderr_nsec, nretries);

/* Clocksource initialized ? */
if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
--
2.9.5

2021-04-02 20:32:33

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v6 clocksource] Do not mark clocks unstable dueclocksource: Provide module parameters to inject delays in watchdog

From: "Paul E. McKenney" <[email protected]>

When the clocksource watchdog marks a clock as unstable, this might be due
to that clock being unstable or it might be due to delays that happen to
occur between the reads of the two clocks. Yes, interrupts are disabled
across those two reads, but there are no shortage of things that can
delay interrupts-disabled regions of code ranging from SMI handlers to
vCPU preemption. It would be good to have some indication as to why
the clock was marked unstable.

The first step is a way of injecting such delays, and this
commit therefore provides a clocksource.inject_delay_freq and
clocksource.inject_delay_run kernel boot parameters that specify that
sufficient delay be injected to cause the clocksource_watchdog()
function to mark a clock unstable. This delay is injected every
Nth set of M calls to clocksource_watchdog(), where N is the value
specified for the inject_delay_freq boot parameter and M is the value
specified for the inject_delay_run boot parameter. Values of zero or
less for either parameter disable delay injection, and the default for
clocksource.inject_delay_freq is zero, that is, disabled. The default for
clocksource.inject_delay_run is the value one, that is single-call runs.

This facility is intended for diagnostic use only, and should be avoided
on production systems.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
[ paulmck: Apply Rik van Riel feedback. ]
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 22 ++++++++++++++++++++
kernel/time/clocksource.c | 27 +++++++++++++++++++++++++
2 files changed, 49 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0454572..fc57952 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -583,6 +583,28 @@
loops can be debugged more effectively on production
systems.

+ clocksource.inject_delay_freq= [KNL]
+ Number of runs of calls to clocksource_watchdog()
+ before delays are injected between reads from the
+ two clocksources. Values less than or equal to
+ zero disable this delay injection. These delays
+ can cause clocks to be marked unstable, so use
+ of this parameter should therefore be avoided on
+ production systems. Defaults to zero (disabled).
+
+ clocksource.inject_delay_run= [KNL]
+ Run lengths of clocksource_watchdog() delay
+ injections. Specifying the value 8 will result
+ in eight consecutive delays followed by eight
+ times the value specified for inject_delay_freq
+ of consecutive non-delays.
+
+ clocksource.max_read_retries= [KNL]
+ Number of clocksource_watchdog() retries due to
+ external delays before the clock will be marked
+ unstable. Defaults to three retries, that is,
+ four attempts to read the clock under test.
+
clearcpuid=BITNUM[,BITNUM...] [X86]
Disable CPUID feature X for the kernel. See
arch/x86/include/asm/cpufeatures.h for the valid bit
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index cce484a..4be4391 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -14,6 +14,7 @@
#include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
#include <linux/tick.h>
#include <linux/kthread.h>
+#include <linux/delay.h>

#include "tick-internal.h"
#include "timekeeping_internal.h"
@@ -184,6 +185,31 @@ void clocksource_mark_unstable(struct clocksource *cs)
spin_unlock_irqrestore(&watchdog_lock, flags);
}

+static int inject_delay_freq;
+module_param(inject_delay_freq, int, 0644);
+static int inject_delay_run = 1;
+module_param(inject_delay_run, int, 0644);
+static int max_read_retries = 3;
+module_param(max_read_retries, int, 0644);
+
+static void clocksource_watchdog_inject_delay(void)
+{
+ int i;
+ static int injectfail = -1;
+
+ if (inject_delay_freq <= 0 || inject_delay_run <= 0)
+ return;
+ if (injectfail < 0 || injectfail > INT_MAX / 2)
+ injectfail = inject_delay_run;
+ if (!(++injectfail / inject_delay_run % inject_delay_freq)) {
+ pr_warn("%s(): Injecting delay.\n", __func__);
+ for (i = 0; i < 2 * WATCHDOG_THRESHOLD / NSEC_PER_MSEC; i++)
+ udelay(1000);
+ pr_warn("%s(): Done injecting delay.\n", __func__);
+ }
+ WARN_ON_ONCE(injectfail < 0);
+}
+
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
@@ -208,6 +234,7 @@ static void clocksource_watchdog(struct timer_list *unused)

local_irq_disable();
csnow = cs->read(cs);
+ clocksource_watchdog_inject_delay();
wdnow = watchdog->read(watchdog);
local_irq_enable();

--
2.9.5

2021-04-02 20:33:03

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v6 clocksource] Do not mark clocks unstable dueclocksource: Provide a module parameter to fuzz per-CPU clock checking

From: "Paul E. McKenney" <[email protected]>

Code that checks for clock desynchronization must itself be tested, so
this commit creates a new clocksource.inject_delay_shift_percpu= kernel
boot parameter that adds or subtracts a large value from the check read,
using the specified bit of the CPU ID to determine whether to add or
to subtract.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Apply Randy Dunlap feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 16 ++++++++++++++++
kernel/time/clocksource.c | 10 +++++++++-
2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fc57952..f9da90f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -599,6 +599,22 @@
times the value specified for inject_delay_freq
of consecutive non-delays.

+ clocksource.inject_delay_shift_percpu= [KNL]
+ Clocksource delay injection partitions the CPUs
+ into two sets, one whose clocks are moved ahead
+ and the other whose clocks are moved behind.
+ This kernel parameter selects the CPU-number
+ bit that determines which of these two sets the
+ corresponding CPU is placed into. For example,
+ setting this parameter to the value 4 will result
+ in the first set containing alternating groups
+ of 16 CPUs whose clocks are moved ahead, while
+ the second set will contain the rest of the CPUs,
+ whose clocks are moved behind.
+
+ The default value of -1 disables this type of
+ error injection.
+
clocksource.max_read_retries= [KNL]
Number of clocksource_watchdog() retries due to
external delays before the clock will be marked
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 663bc53..df48416 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -190,6 +190,8 @@ static int inject_delay_freq;
module_param(inject_delay_freq, int, 0644);
static int inject_delay_run = 1;
module_param(inject_delay_run, int, 0644);
+static int inject_delay_shift_percpu = -1;
+module_param(inject_delay_shift_percpu, int, 0644);
static int max_read_retries = 3;
module_param(max_read_retries, int, 0644);

@@ -219,8 +221,14 @@ static cpumask_t cpus_behind;
static void clocksource_verify_one_cpu(void *csin)
{
struct clocksource *cs = (struct clocksource *)csin;
+ s64 delta = 0;
+ int sign;

- __this_cpu_write(csnow_mid, cs->read(cs));
+ if (inject_delay_shift_percpu >= 0) {
+ sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1;
+ delta = sign * NSEC_PER_SEC;
+ }
+ __this_cpu_write(csnow_mid, cs->read(cs) + delta);
}

static void clocksource_verify_percpu_wq(struct work_struct *unused)
--
2.9.5

2021-04-02 20:34:25

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v6 clocksource] Do not mark clocks unstable dueclocksource: Do pairwise clock-desynchronization checking

From: "Paul E. McKenney" <[email protected]>

Although smp_call_function() has the advantage of simplicity, using
it to check for cross-CPU clock desynchronization means that any CPU
being slow reduces the sensitivity of the checking across all CPUs.
And it is not uncommon for smp_call_function() latencies to be in the
hundreds of microseconds.

This commit therefore switches to smp_call_function_single(), so that
delays from a given CPU affect only those measurements involving that
particular CPU.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/clocksource.c | 41 +++++++++++++++++++++++++----------------
1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index df48416..4161c84 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -214,7 +214,7 @@ static void clocksource_watchdog_inject_delay(void)
}

static struct clocksource *clocksource_verify_work_cs;
-static DEFINE_PER_CPU(u64, csnow_mid);
+static u64 csnow_mid;
static cpumask_t cpus_ahead;
static cpumask_t cpus_behind;

@@ -228,7 +228,7 @@ static void clocksource_verify_one_cpu(void *csin)
sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1;
delta = sign * NSEC_PER_SEC;
}
- __this_cpu_write(csnow_mid, cs->read(cs) + delta);
+ csnow_mid = cs->read(cs) + delta;
}

static void clocksource_verify_percpu_wq(struct work_struct *unused)
@@ -236,9 +236,12 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
int cpu;
struct clocksource *cs;
int64_t cs_nsec;
+ int64_t cs_nsec_max;
+ int64_t cs_nsec_min;
u64 csnow_begin;
u64 csnow_end;
- u64 delta;
+ s64 delta;
+ bool firsttime = 1;

cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release
if (WARN_ON_ONCE(!cs))
@@ -247,19 +250,28 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
cs->name, smp_processor_id());
cpumask_clear(&cpus_ahead);
cpumask_clear(&cpus_behind);
- csnow_begin = cs->read(cs);
- smp_call_function(clocksource_verify_one_cpu, cs, 1);
- csnow_end = cs->read(cs);
+ preempt_disable();
for_each_online_cpu(cpu) {
if (cpu == smp_processor_id())
continue;
- delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask;
- if ((s64)delta < 0)
+ csnow_begin = cs->read(cs);
+ smp_call_function_single(cpu, clocksource_verify_one_cpu, cs, 1);
+ csnow_end = cs->read(cs);
+ delta = (s64)((csnow_mid - csnow_begin) & cs->mask);
+ if (delta < 0)
cpumask_set_cpu(cpu, &cpus_behind);
- delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask;
- if ((s64)delta < 0)
+ delta = (csnow_end - csnow_mid) & cs->mask;
+ if (delta < 0)
cpumask_set_cpu(cpu, &cpus_ahead);
+ delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
+ cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
+ if (firsttime || cs_nsec > cs_nsec_max)
+ cs_nsec_max = cs_nsec;
+ if (firsttime || cs_nsec < cs_nsec_min)
+ cs_nsec_min = cs_nsec;
+ firsttime = 0;
}
+ preempt_enable();
if (!cpumask_empty(&cpus_ahead))
pr_warn(" CPUs %*pbl ahead of CPU %d for clocksource %s.\n",
cpumask_pr_args(&cpus_ahead),
@@ -268,12 +280,9 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
pr_warn(" CPUs %*pbl behind CPU %d for clocksource %s.\n",
cpumask_pr_args(&cpus_behind),
smp_processor_id(), cs->name);
- if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)) {
- delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
- cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
- pr_warn(" CPU %d duration %lldns for clocksource %s.\n",
- smp_processor_id(), cs_nsec, cs->name);
- }
+ if (!firsttime && (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)))
+ pr_warn(" CPU %d check durations %lldns - %lldns for clocksource %s.\n",
+ smp_processor_id(), cs_nsec_min, cs_nsec_max, cs->name);
smp_store_release(&clocksource_verify_work_cs, NULL); // pairs with acquire.
}

--
2.9.5

2021-04-02 20:34:35

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v6 clocksource] Do not mark clocks unstable dueclocksource: Check per-CPU clock synchronization when marked unstable

From: "Paul E. McKenney" <[email protected]>

Some sorts of per-CPU clock sources have a history of going out of
synchronization with each other. However, this problem has purportedy
been solved in the past ten years. Except that it is all too possible
that the problem has instead simply been made less likely, which might
mean that some of the occasional "Marking clocksource 'tsc' as unstable"
messages might be due to desynchronization. How would anyone know?

This commit therefore adds CPU-to-CPU synchronization checking
for newly unstable clocksource that are marked with the new
CLOCK_SOURCE_VERIFY_PERCPU flag. Lists of desynchronized CPUs are
printed, with the caveat that if it is the reporting CPU that is itself
desynchronized, it will appear that all the other clocks are wrong.
Just like in real life.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Add "static" to clocksource_verify_one_cpu() per kernel test robot feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
arch/x86/kernel/kvmclock.c | 2 +-
arch/x86/kernel/tsc.c | 3 +-
include/linux/clocksource.h | 2 +-
kernel/time/clocksource.c | 73 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 1fc0962..97eeaf1 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -169,7 +169,7 @@ struct clocksource kvm_clock = {
.read = kvm_clock_get_cycles,
.rating = 400,
.mask = CLOCKSOURCE_MASK(64),
- .flags = CLOCK_SOURCE_IS_CONTINUOUS,
+ .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_VERIFY_PERCPU,
.enable = kvm_cs_enable,
};
EXPORT_SYMBOL_GPL(kvm_clock);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index f70dffc..5628917 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1151,7 +1151,8 @@ static struct clocksource clocksource_tsc = {
.mask = CLOCKSOURCE_MASK(64),
.flags = CLOCK_SOURCE_IS_CONTINUOUS |
CLOCK_SOURCE_VALID_FOR_HRES |
- CLOCK_SOURCE_MUST_VERIFY,
+ CLOCK_SOURCE_MUST_VERIFY |
+ CLOCK_SOURCE_VERIFY_PERCPU,
.vdso_clock_mode = VDSO_CLOCKMODE_TSC,
.enable = tsc_cs_enable,
.resume = tsc_resume,
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 86d143d..83a3ebf 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -131,7 +131,7 @@ struct clocksource {
#define CLOCK_SOURCE_UNSTABLE 0x40
#define CLOCK_SOURCE_SUSPEND_NONSTOP 0x80
#define CLOCK_SOURCE_RESELECT 0x100
-
+#define CLOCK_SOURCE_VERIFY_PERCPU 0x200
/* simplify initialization of mask field */
#define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 3f734c6..663bc53 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -211,6 +211,78 @@ static void clocksource_watchdog_inject_delay(void)
WARN_ON_ONCE(injectfail < 0);
}

+static struct clocksource *clocksource_verify_work_cs;
+static DEFINE_PER_CPU(u64, csnow_mid);
+static cpumask_t cpus_ahead;
+static cpumask_t cpus_behind;
+
+static void clocksource_verify_one_cpu(void *csin)
+{
+ struct clocksource *cs = (struct clocksource *)csin;
+
+ __this_cpu_write(csnow_mid, cs->read(cs));
+}
+
+static void clocksource_verify_percpu_wq(struct work_struct *unused)
+{
+ int cpu;
+ struct clocksource *cs;
+ int64_t cs_nsec;
+ u64 csnow_begin;
+ u64 csnow_end;
+ u64 delta;
+
+ cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release
+ if (WARN_ON_ONCE(!cs))
+ return;
+ pr_warn("Checking clocksource %s synchronization from CPU %d.\n",
+ cs->name, smp_processor_id());
+ cpumask_clear(&cpus_ahead);
+ cpumask_clear(&cpus_behind);
+ csnow_begin = cs->read(cs);
+ smp_call_function(clocksource_verify_one_cpu, cs, 1);
+ csnow_end = cs->read(cs);
+ for_each_online_cpu(cpu) {
+ if (cpu == smp_processor_id())
+ continue;
+ delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask;
+ if ((s64)delta < 0)
+ cpumask_set_cpu(cpu, &cpus_behind);
+ delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask;
+ if ((s64)delta < 0)
+ cpumask_set_cpu(cpu, &cpus_ahead);
+ }
+ if (!cpumask_empty(&cpus_ahead))
+ pr_warn(" CPUs %*pbl ahead of CPU %d for clocksource %s.\n",
+ cpumask_pr_args(&cpus_ahead),
+ smp_processor_id(), cs->name);
+ if (!cpumask_empty(&cpus_behind))
+ pr_warn(" CPUs %*pbl behind CPU %d for clocksource %s.\n",
+ cpumask_pr_args(&cpus_behind),
+ smp_processor_id(), cs->name);
+ if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)) {
+ delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
+ cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
+ pr_warn(" CPU %d duration %lldns for clocksource %s.\n",
+ smp_processor_id(), cs_nsec, cs->name);
+ }
+ smp_store_release(&clocksource_verify_work_cs, NULL); // pairs with acquire.
+}
+
+static DECLARE_WORK(clocksource_verify_work, clocksource_verify_percpu_wq);
+
+static void clocksource_verify_percpu(struct clocksource *cs)
+{
+ if (!(cs->flags & CLOCK_SOURCE_VERIFY_PERCPU))
+ return;
+ if (smp_load_acquire(&clocksource_verify_work_cs)) { // pairs with release.
+ pr_warn("Previous clocksource synchronization still in flight.\n");
+ return;
+ }
+ smp_store_release(&clocksource_verify_work_cs, cs); //pairs with acquire.
+ queue_work(system_highpri_wq, &clocksource_verify_work);
+}
+
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
@@ -284,6 +356,7 @@ static void clocksource_watchdog(struct timer_list *unused)
watchdog->name, wdnow, wdlast, watchdog->mask);
pr_warn(" '%s' cs_now: %llx cs_last: %llx mask: %llx\n",
cs->name, csnow, cslast, cs->mask);
+ clocksource_verify_percpu(cs);
__clocksource_unstable(cs);
continue;
}
--
2.9.5

2021-04-02 22:23:17

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v6 clocksource] Do not mark clocks unstable dueclocksource: Provide module parameters to inject delays in watchdog

On Fri, Apr 02 2021 at 13:31, paulmck wrote:

The subsystem prefix does not parse:

[PATCH v6 clocksource] Do not mark clocks unstable dueclocksource: Provide module parameters to inject delays in watchdog

I look at the actual code changes after the easter break.

Thanks,

tglx

2021-04-02 22:38:07

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v6 clocksource] Do not mark clocks unstable dueclocksource: Provide module parameters to inject delays in watchdog

On Sat, Apr 03, 2021 at 12:22:40AM +0200, Thomas Gleixner wrote:
> On Fri, Apr 02 2021 at 13:31, paulmck wrote:
>
> The subsystem prefix does not parse:
>
> [PATCH v6 clocksource] Do not mark clocks unstable dueclocksource: Provide module parameters to inject delays in watchdog

Where is that brown bag when I need it? :-/

> I look at the actual code changes after the easter break.

I will resend a clean copy as a reply to your message.

Thanx, Paul

2021-04-02 22:49:42

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v7 clocksource] Do not mark clocks unstable due to delays for v5.13

Hello!

If there is a sufficient delay between reading the watchdog clock and the
clock under test, the clock under test will be marked unstable through no
fault of its own. This series checks for this, doing limited retries
to get a good set of clock reads. If the clock is marked unstable
and is marked as being per-CPU, cross-CPU synchronization is checked.
This series also provides delay injection, which may be enabled via
kernel boot parameters to test the checking for delays.

Note that "sufficient delay" can be provided by SMIs, NMIs, and of course
vCPU preemption.

1. Provide module parameters to inject delays in watchdog.

2. Retry clock read if long delays detected.

3. Check per-CPU clock synchronization when marked unstable.

4. Provide a module parameter to fuzz per-CPU clock checking.

5. Do pairwise clock-desynchronization checking.

Changes since v7:

o Fix embarrassing git-format-patch operator error.

Changes since v5:

o Rebased to v5.12-rc5.

Changes since v4:

o Rebased to v5.12-rc1.

Changes since v3:

o Rebased to v5.11.

o Apply Randy Dunlap feedback.

Changes since v2:

o Rebased to v5.11-rc6.

o Updated Cc: list.

Changes since v1:

o Applied feedback from Rik van Riel.

o Rebased to v5.11-rc3.

o Stripped "RFC" from the subject lines.

Thanx, Paul

------------------------------------------------------------------------

Documentation/admin-guide/kernel-parameters.txt | 38 +++++
arch/x86/kernel/kvmclock.c | 2
arch/x86/kernel/tsc.c | 3
include/linux/clocksource.h | 2
kernel/time/clocksource.c | 174 +++++++++++++++++++++---
5 files changed, 195 insertions(+), 24 deletions(-)

2021-04-02 22:49:49

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v7 clocksource 2/5] clocksource: Retry clock read if long delays detected

From: "Paul E. McKenney" <[email protected]>

When the clocksource watchdog marks a clock as unstable, this might
be due to that clock being unstable or it might be due to delays that
happen to occur between the reads of the two clocks. Yes, interrupts are
disabled across those two reads, but there are no shortage of things that
can delay interrupts-disabled regions of code ranging from SMI handlers
to vCPU preemption. It would be good to have some indication as to why
the clock was marked unstable.

This commit therefore re-reads the watchdog clock on either side of
the read from the clock under test. If the watchdog clock shows an
excessive time delta between its pair of reads, the reads are retried.
The maximum number of retries is specified by a new kernel boot
parameter clocksource.max_read_retries, which defaults to three, that
is, up to four reads, one initial and up to three retries. If retries
were required, a message is printed on the console. If the number of
retries is exceeded, the clock under test will be marked unstable.
However, the probability of this happening due to various sorts of
delays is quite small. In addition, the reason (clock-read delays)
for the unstable marking will be apparent.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Per-clocksource retries per Neeraj Upadhyay feedback. ]
[ paulmck: Don't reset injectfail per Neeraj Upadhyay feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/clocksource.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 4be4391..3f734c6 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -124,6 +124,7 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
*/
#define WATCHDOG_INTERVAL (HZ >> 1)
#define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
+#define WATCHDOG_MAX_SKEW (NSEC_PER_SEC >> 6)

static void clocksource_watchdog_work(struct work_struct *work)
{
@@ -213,9 +214,10 @@ static void clocksource_watchdog_inject_delay(void)
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
- u64 csnow, wdnow, cslast, wdlast, delta;
- int64_t wd_nsec, cs_nsec;
+ u64 csnow, wdnow, wdagain, cslast, wdlast, delta;
+ int64_t wd_nsec, wdagain_nsec, wderr_nsec = 0, cs_nsec;
int next_cpu, reset_pending;
+ int nretries;

spin_lock(&watchdog_lock);
if (!watchdog_running)
@@ -224,6 +226,7 @@ static void clocksource_watchdog(struct timer_list *unused)
reset_pending = atomic_read(&watchdog_reset_pending);

list_for_each_entry(cs, &watchdog_list, wd_list) {
+ nretries = 0;

/* Clocksource already marked unstable? */
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
@@ -232,11 +235,23 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;
}

+retry:
local_irq_disable();
- csnow = cs->read(cs);
- clocksource_watchdog_inject_delay();
wdnow = watchdog->read(watchdog);
+ clocksource_watchdog_inject_delay();
+ csnow = cs->read(cs);
+ wdagain = watchdog->read(watchdog);
local_irq_enable();
+ delta = clocksource_delta(wdagain, wdnow, watchdog->mask);
+ wdagain_nsec = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
+ if (wdagain_nsec < 0 || wdagain_nsec > WATCHDOG_MAX_SKEW) {
+ wderr_nsec = wdagain_nsec;
+ if (nretries++ < max_read_retries)
+ goto retry;
+ }
+ if (nretries)
+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d\n",
+ smp_processor_id(), watchdog->name, wderr_nsec, nretries);

/* Clocksource initialized ? */
if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
--
2.9.5

2021-04-02 22:49:58

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v7 clocksource 1/5] clocksource: Provide module parameters to inject delays in watchdog

From: "Paul E. McKenney" <[email protected]>

When the clocksource watchdog marks a clock as unstable, this might be due
to that clock being unstable or it might be due to delays that happen to
occur between the reads of the two clocks. Yes, interrupts are disabled
across those two reads, but there are no shortage of things that can
delay interrupts-disabled regions of code ranging from SMI handlers to
vCPU preemption. It would be good to have some indication as to why
the clock was marked unstable.

The first step is a way of injecting such delays, and this
commit therefore provides a clocksource.inject_delay_freq and
clocksource.inject_delay_run kernel boot parameters that specify that
sufficient delay be injected to cause the clocksource_watchdog()
function to mark a clock unstable. This delay is injected every
Nth set of M calls to clocksource_watchdog(), where N is the value
specified for the inject_delay_freq boot parameter and M is the value
specified for the inject_delay_run boot parameter. Values of zero or
less for either parameter disable delay injection, and the default for
clocksource.inject_delay_freq is zero, that is, disabled. The default for
clocksource.inject_delay_run is the value one, that is single-call runs.

This facility is intended for diagnostic use only, and should be avoided
on production systems.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
[ paulmck: Apply Rik van Riel feedback. ]
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 22 ++++++++++++++++++++
kernel/time/clocksource.c | 27 +++++++++++++++++++++++++
2 files changed, 49 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0454572..fc57952 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -583,6 +583,28 @@
loops can be debugged more effectively on production
systems.

+ clocksource.inject_delay_freq= [KNL]
+ Number of runs of calls to clocksource_watchdog()
+ before delays are injected between reads from the
+ two clocksources. Values less than or equal to
+ zero disable this delay injection. These delays
+ can cause clocks to be marked unstable, so use
+ of this parameter should therefore be avoided on
+ production systems. Defaults to zero (disabled).
+
+ clocksource.inject_delay_run= [KNL]
+ Run lengths of clocksource_watchdog() delay
+ injections. Specifying the value 8 will result
+ in eight consecutive delays followed by eight
+ times the value specified for inject_delay_freq
+ of consecutive non-delays.
+
+ clocksource.max_read_retries= [KNL]
+ Number of clocksource_watchdog() retries due to
+ external delays before the clock will be marked
+ unstable. Defaults to three retries, that is,
+ four attempts to read the clock under test.
+
clearcpuid=BITNUM[,BITNUM...] [X86]
Disable CPUID feature X for the kernel. See
arch/x86/include/asm/cpufeatures.h for the valid bit
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index cce484a..4be4391 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -14,6 +14,7 @@
#include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
#include <linux/tick.h>
#include <linux/kthread.h>
+#include <linux/delay.h>

#include "tick-internal.h"
#include "timekeeping_internal.h"
@@ -184,6 +185,31 @@ void clocksource_mark_unstable(struct clocksource *cs)
spin_unlock_irqrestore(&watchdog_lock, flags);
}

+static int inject_delay_freq;
+module_param(inject_delay_freq, int, 0644);
+static int inject_delay_run = 1;
+module_param(inject_delay_run, int, 0644);
+static int max_read_retries = 3;
+module_param(max_read_retries, int, 0644);
+
+static void clocksource_watchdog_inject_delay(void)
+{
+ int i;
+ static int injectfail = -1;
+
+ if (inject_delay_freq <= 0 || inject_delay_run <= 0)
+ return;
+ if (injectfail < 0 || injectfail > INT_MAX / 2)
+ injectfail = inject_delay_run;
+ if (!(++injectfail / inject_delay_run % inject_delay_freq)) {
+ pr_warn("%s(): Injecting delay.\n", __func__);
+ for (i = 0; i < 2 * WATCHDOG_THRESHOLD / NSEC_PER_MSEC; i++)
+ udelay(1000);
+ pr_warn("%s(): Done injecting delay.\n", __func__);
+ }
+ WARN_ON_ONCE(injectfail < 0);
+}
+
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
@@ -208,6 +234,7 @@ static void clocksource_watchdog(struct timer_list *unused)

local_irq_disable();
csnow = cs->read(cs);
+ clocksource_watchdog_inject_delay();
wdnow = watchdog->read(watchdog);
local_irq_enable();

--
2.9.5

2021-04-02 22:50:40

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v7 clocksource 4/5] clocksource: Provide a module parameter to fuzz per-CPU clock checking

From: "Paul E. McKenney" <[email protected]>

Code that checks for clock desynchronization must itself be tested, so
this commit creates a new clocksource.inject_delay_shift_percpu= kernel
boot parameter that adds or subtracts a large value from the check read,
using the specified bit of the CPU ID to determine whether to add or
to subtract.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Apply Randy Dunlap feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 16 ++++++++++++++++
kernel/time/clocksource.c | 10 +++++++++-
2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fc57952..f9da90f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -599,6 +599,22 @@
times the value specified for inject_delay_freq
of consecutive non-delays.

+ clocksource.inject_delay_shift_percpu= [KNL]
+ Clocksource delay injection partitions the CPUs
+ into two sets, one whose clocks are moved ahead
+ and the other whose clocks are moved behind.
+ This kernel parameter selects the CPU-number
+ bit that determines which of these two sets the
+ corresponding CPU is placed into. For example,
+ setting this parameter to the value 4 will result
+ in the first set containing alternating groups
+ of 16 CPUs whose clocks are moved ahead, while
+ the second set will contain the rest of the CPUs,
+ whose clocks are moved behind.
+
+ The default value of -1 disables this type of
+ error injection.
+
clocksource.max_read_retries= [KNL]
Number of clocksource_watchdog() retries due to
external delays before the clock will be marked
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 663bc53..df48416 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -190,6 +190,8 @@ static int inject_delay_freq;
module_param(inject_delay_freq, int, 0644);
static int inject_delay_run = 1;
module_param(inject_delay_run, int, 0644);
+static int inject_delay_shift_percpu = -1;
+module_param(inject_delay_shift_percpu, int, 0644);
static int max_read_retries = 3;
module_param(max_read_retries, int, 0644);

@@ -219,8 +221,14 @@ static cpumask_t cpus_behind;
static void clocksource_verify_one_cpu(void *csin)
{
struct clocksource *cs = (struct clocksource *)csin;
+ s64 delta = 0;
+ int sign;

- __this_cpu_write(csnow_mid, cs->read(cs));
+ if (inject_delay_shift_percpu >= 0) {
+ sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1;
+ delta = sign * NSEC_PER_SEC;
+ }
+ __this_cpu_write(csnow_mid, cs->read(cs) + delta);
}

static void clocksource_verify_percpu_wq(struct work_struct *unused)
--
2.9.5

2021-04-02 22:52:03

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v7 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

From: "Paul E. McKenney" <[email protected]>

Some sorts of per-CPU clock sources have a history of going out of
synchronization with each other. However, this problem has purportedy
been solved in the past ten years. Except that it is all too possible
that the problem has instead simply been made less likely, which might
mean that some of the occasional "Marking clocksource 'tsc' as unstable"
messages might be due to desynchronization. How would anyone know?

This commit therefore adds CPU-to-CPU synchronization checking
for newly unstable clocksource that are marked with the new
CLOCK_SOURCE_VERIFY_PERCPU flag. Lists of desynchronized CPUs are
printed, with the caveat that if it is the reporting CPU that is itself
desynchronized, it will appear that all the other clocks are wrong.
Just like in real life.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
[ paulmck: Add "static" to clocksource_verify_one_cpu() per kernel test robot feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
---
arch/x86/kernel/kvmclock.c | 2 +-
arch/x86/kernel/tsc.c | 3 +-
include/linux/clocksource.h | 2 +-
kernel/time/clocksource.c | 73 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 1fc0962..97eeaf1 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -169,7 +169,7 @@ struct clocksource kvm_clock = {
.read = kvm_clock_get_cycles,
.rating = 400,
.mask = CLOCKSOURCE_MASK(64),
- .flags = CLOCK_SOURCE_IS_CONTINUOUS,
+ .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_VERIFY_PERCPU,
.enable = kvm_cs_enable,
};
EXPORT_SYMBOL_GPL(kvm_clock);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index f70dffc..5628917 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1151,7 +1151,8 @@ static struct clocksource clocksource_tsc = {
.mask = CLOCKSOURCE_MASK(64),
.flags = CLOCK_SOURCE_IS_CONTINUOUS |
CLOCK_SOURCE_VALID_FOR_HRES |
- CLOCK_SOURCE_MUST_VERIFY,
+ CLOCK_SOURCE_MUST_VERIFY |
+ CLOCK_SOURCE_VERIFY_PERCPU,
.vdso_clock_mode = VDSO_CLOCKMODE_TSC,
.enable = tsc_cs_enable,
.resume = tsc_resume,
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 86d143d..83a3ebf 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -131,7 +131,7 @@ struct clocksource {
#define CLOCK_SOURCE_UNSTABLE 0x40
#define CLOCK_SOURCE_SUSPEND_NONSTOP 0x80
#define CLOCK_SOURCE_RESELECT 0x100
-
+#define CLOCK_SOURCE_VERIFY_PERCPU 0x200
/* simplify initialization of mask field */
#define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 3f734c6..663bc53 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -211,6 +211,78 @@ static void clocksource_watchdog_inject_delay(void)
WARN_ON_ONCE(injectfail < 0);
}

+static struct clocksource *clocksource_verify_work_cs;
+static DEFINE_PER_CPU(u64, csnow_mid);
+static cpumask_t cpus_ahead;
+static cpumask_t cpus_behind;
+
+static void clocksource_verify_one_cpu(void *csin)
+{
+ struct clocksource *cs = (struct clocksource *)csin;
+
+ __this_cpu_write(csnow_mid, cs->read(cs));
+}
+
+static void clocksource_verify_percpu_wq(struct work_struct *unused)
+{
+ int cpu;
+ struct clocksource *cs;
+ int64_t cs_nsec;
+ u64 csnow_begin;
+ u64 csnow_end;
+ u64 delta;
+
+ cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release
+ if (WARN_ON_ONCE(!cs))
+ return;
+ pr_warn("Checking clocksource %s synchronization from CPU %d.\n",
+ cs->name, smp_processor_id());
+ cpumask_clear(&cpus_ahead);
+ cpumask_clear(&cpus_behind);
+ csnow_begin = cs->read(cs);
+ smp_call_function(clocksource_verify_one_cpu, cs, 1);
+ csnow_end = cs->read(cs);
+ for_each_online_cpu(cpu) {
+ if (cpu == smp_processor_id())
+ continue;
+ delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask;
+ if ((s64)delta < 0)
+ cpumask_set_cpu(cpu, &cpus_behind);
+ delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask;
+ if ((s64)delta < 0)
+ cpumask_set_cpu(cpu, &cpus_ahead);
+ }
+ if (!cpumask_empty(&cpus_ahead))
+ pr_warn(" CPUs %*pbl ahead of CPU %d for clocksource %s.\n",
+ cpumask_pr_args(&cpus_ahead),
+ smp_processor_id(), cs->name);
+ if (!cpumask_empty(&cpus_behind))
+ pr_warn(" CPUs %*pbl behind CPU %d for clocksource %s.\n",
+ cpumask_pr_args(&cpus_behind),
+ smp_processor_id(), cs->name);
+ if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)) {
+ delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
+ cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
+ pr_warn(" CPU %d duration %lldns for clocksource %s.\n",
+ smp_processor_id(), cs_nsec, cs->name);
+ }
+ smp_store_release(&clocksource_verify_work_cs, NULL); // pairs with acquire.
+}
+
+static DECLARE_WORK(clocksource_verify_work, clocksource_verify_percpu_wq);
+
+static void clocksource_verify_percpu(struct clocksource *cs)
+{
+ if (!(cs->flags & CLOCK_SOURCE_VERIFY_PERCPU))
+ return;
+ if (smp_load_acquire(&clocksource_verify_work_cs)) { // pairs with release.
+ pr_warn("Previous clocksource synchronization still in flight.\n");
+ return;
+ }
+ smp_store_release(&clocksource_verify_work_cs, cs); //pairs with acquire.
+ queue_work(system_highpri_wq, &clocksource_verify_work);
+}
+
static void clocksource_watchdog(struct timer_list *unused)
{
struct clocksource *cs;
@@ -284,6 +356,7 @@ static void clocksource_watchdog(struct timer_list *unused)
watchdog->name, wdnow, wdlast, watchdog->mask);
pr_warn(" '%s' cs_now: %llx cs_last: %llx mask: %llx\n",
cs->name, csnow, cslast, cs->mask);
+ clocksource_verify_percpu(cs);
__clocksource_unstable(cs);
continue;
}
--
2.9.5

2021-04-02 22:53:02

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH v7 clocksource 5/5] clocksource: Do pairwise clock-desynchronization checking

From: "Paul E. McKenney" <[email protected]>

Although smp_call_function() has the advantage of simplicity, using
it to check for cross-CPU clock desynchronization means that any CPU
being slow reduces the sensitivity of the checking across all CPUs.
And it is not uncommon for smp_call_function() latencies to be in the
hundreds of microseconds.

This commit therefore switches to smp_call_function_single(), so that
delays from a given CPU affect only those measurements involving that
particular CPU.

Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Andi Kleen <[email protected]>
Reported-by: Chris Mason <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/clocksource.c | 41 +++++++++++++++++++++++++----------------
1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index df48416..4161c84 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -214,7 +214,7 @@ static void clocksource_watchdog_inject_delay(void)
}

static struct clocksource *clocksource_verify_work_cs;
-static DEFINE_PER_CPU(u64, csnow_mid);
+static u64 csnow_mid;
static cpumask_t cpus_ahead;
static cpumask_t cpus_behind;

@@ -228,7 +228,7 @@ static void clocksource_verify_one_cpu(void *csin)
sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1;
delta = sign * NSEC_PER_SEC;
}
- __this_cpu_write(csnow_mid, cs->read(cs) + delta);
+ csnow_mid = cs->read(cs) + delta;
}

static void clocksource_verify_percpu_wq(struct work_struct *unused)
@@ -236,9 +236,12 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
int cpu;
struct clocksource *cs;
int64_t cs_nsec;
+ int64_t cs_nsec_max;
+ int64_t cs_nsec_min;
u64 csnow_begin;
u64 csnow_end;
- u64 delta;
+ s64 delta;
+ bool firsttime = 1;

cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release
if (WARN_ON_ONCE(!cs))
@@ -247,19 +250,28 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
cs->name, smp_processor_id());
cpumask_clear(&cpus_ahead);
cpumask_clear(&cpus_behind);
- csnow_begin = cs->read(cs);
- smp_call_function(clocksource_verify_one_cpu, cs, 1);
- csnow_end = cs->read(cs);
+ preempt_disable();
for_each_online_cpu(cpu) {
if (cpu == smp_processor_id())
continue;
- delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask;
- if ((s64)delta < 0)
+ csnow_begin = cs->read(cs);
+ smp_call_function_single(cpu, clocksource_verify_one_cpu, cs, 1);
+ csnow_end = cs->read(cs);
+ delta = (s64)((csnow_mid - csnow_begin) & cs->mask);
+ if (delta < 0)
cpumask_set_cpu(cpu, &cpus_behind);
- delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask;
- if ((s64)delta < 0)
+ delta = (csnow_end - csnow_mid) & cs->mask;
+ if (delta < 0)
cpumask_set_cpu(cpu, &cpus_ahead);
+ delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
+ cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
+ if (firsttime || cs_nsec > cs_nsec_max)
+ cs_nsec_max = cs_nsec;
+ if (firsttime || cs_nsec < cs_nsec_min)
+ cs_nsec_min = cs_nsec;
+ firsttime = 0;
}
+ preempt_enable();
if (!cpumask_empty(&cpus_ahead))
pr_warn(" CPUs %*pbl ahead of CPU %d for clocksource %s.\n",
cpumask_pr_args(&cpus_ahead),
@@ -268,12 +280,9 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused)
pr_warn(" CPUs %*pbl behind CPU %d for clocksource %s.\n",
cpumask_pr_args(&cpus_behind),
smp_processor_id(), cs->name);
- if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)) {
- delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
- cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
- pr_warn(" CPU %d duration %lldns for clocksource %s.\n",
- smp_processor_id(), cs_nsec, cs->name);
- }
+ if (!firsttime && (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)))
+ pr_warn(" CPU %d check durations %lldns - %lldns for clocksource %s.\n",
+ smp_processor_id(), cs_nsec_min, cs_nsec_max, cs->name);
smp_store_release(&clocksource_verify_work_cs, NULL); // pairs with acquire.
}

--
2.9.5

2021-04-10 08:03:19

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource] Do not mark clocks unstable due to delays for v5.13

On Fri, Apr 02 2021 at 15:48, Paul E. McKenney wrote:
> Hello!
>
> If there is a sufficient delay between reading the watchdog clock and the
> clock under test, the clock under test will be marked unstable through no
> fault of its own. This series checks for this, doing limited retries
> to get a good set of clock reads. If the clock is marked unstable
> and is marked as being per-CPU, cross-CPU synchronization is checked.
> This series also provides delay injection, which may be enabled via
> kernel boot parameters to test the checking for delays.
>
> Note that "sufficient delay" can be provided by SMIs, NMIs, and of course
> vCPU preemption.

I buy the vCPU preemption part and TBH guests should not have that
watchdog thing active at all for exactly this reason.

SMI, NMI injecting 62.5ms delay? If that happens then the performance of
the clocksource is the least of your worries.

Thanks,

tglx

2021-04-10 08:42:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource 2/5] clocksource: Retry clock read if long delays detected

On Fri, Apr 02 2021 at 15:49, paulmck wrote:
> This commit therefore re-reads the watchdog clock on either side of

'This commit' is not any better than 'This patch' and this sentence
makes no sense. I might be missing something, but how exactly does "the
commit" re-read the watchdog clock?

git grep 'This patch' Documentation/process/

> the read from the clock under test. If the watchdog clock shows an
> +retry:
> local_irq_disable();
> - csnow = cs->read(cs);
> - clocksource_watchdog_inject_delay();
> wdnow = watchdog->read(watchdog);
> + clocksource_watchdog_inject_delay();
> + csnow = cs->read(cs);
> + wdagain = watchdog->read(watchdog);
> local_irq_enable();
> + delta = clocksource_delta(wdagain, wdnow, watchdog->mask);
> + wdagain_nsec = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);

That variable naming is confusing as hell. This is about the delta and
not about the second readout of the watchdog.

> + if (wdagain_nsec < 0 || wdagain_nsec > WATCHDOG_MAX_SKEW) {

How exactly is this going negative especially with clocksources which
have a limited bitwidth? See clocksource_delta().

> + wderr_nsec = wdagain_nsec;
> + if (nretries++ < max_read_retries)
> + goto retry;
> + }
> + if (nretries)
> + pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d\n",
> + smp_processor_id(), watchdog->name, wderr_nsec, nretries);

Lacks curly braces around the pr_warn() simply because it's not a single
line. Breaks my parser :)

But if this ever happens to exceed max_read_retries, then what's the
point of continuing at all? The data is known to be crap already.

> /* Clocksource initialized ? */
> if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||

Thanks,

tglx

2021-04-10 09:01:20

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

On Fri, Apr 02 2021 at 15:49, paulmck wrote:
>
> +static void clocksource_verify_percpu_wq(struct work_struct *unused)
> +{
> + int cpu;
> + struct clocksource *cs;
> + int64_t cs_nsec;
> + u64 csnow_begin;
> + u64 csnow_end;
> + u64 delta;

Please use reverse fir tree ordering and stick variables of the same
type together:

u64 csnow_begin, csnow_end, delta;
struct clocksource *cs;
s64 cs_nsec;
int cpu;
> +
> + cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release

Please don't use tail comments. They are a horrible distraction.

> + if (WARN_ON_ONCE(!cs))
> + return;
> + pr_warn("Checking clocksource %s synchronization from CPU %d.\n",
> + cs->name, smp_processor_id());
> + cpumask_clear(&cpus_ahead);
> + cpumask_clear(&cpus_behind);
> + csnow_begin = cs->read(cs);

So this is invoked via work and the actual clocksource change is done
via work too. Once the clocksource is not longer actively used for
timekeeping it can go away. What's guaranteeing that this runs prior to
the clocksource change and 'cs' is valid throughout this function?

> + queue_work(system_highpri_wq, &clocksource_verify_work);

This does not guarantee anything. So why does this need an extra work
function which is scheduled seperately?

Thanks,

tglx

2021-04-10 09:05:37

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource 5/5] clocksource: Do pairwise clock-desynchronization checking

On Fri, Apr 02 2021 at 15:49, paulmck wrote:
> From: "Paul E. McKenney" <[email protected]>
>
> Although smp_call_function() has the advantage of simplicity, using
> it to check for cross-CPU clock desynchronization means that any CPU
> being slow reduces the sensitivity of the checking across all CPUs.
> And it is not uncommon for smp_call_function() latencies to be in the
> hundreds of microseconds.
>
> This commit therefore switches to smp_call_function_single(), so that
> delays from a given CPU affect only those measurements involving that
> particular CPU.

Is there any reason I'm missing why this is not done right in patch 3/5
which introduces this synchronization check?

Thanks,

tglx


2021-04-10 23:30:47

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource] Do not mark clocks unstable due to delays for v5.13

On Sat, Apr 10, 2021 at 10:01:58AM +0200, Thomas Gleixner wrote:
> On Fri, Apr 02 2021 at 15:48, Paul E. McKenney wrote:
> > Hello!
> >
> > If there is a sufficient delay between reading the watchdog clock and the
> > clock under test, the clock under test will be marked unstable through no
> > fault of its own. This series checks for this, doing limited retries
> > to get a good set of clock reads. If the clock is marked unstable
> > and is marked as being per-CPU, cross-CPU synchronization is checked.
> > This series also provides delay injection, which may be enabled via
> > kernel boot parameters to test the checking for delays.
> >
> > Note that "sufficient delay" can be provided by SMIs, NMIs, and of course
> > vCPU preemption.
>
> I buy the vCPU preemption part and TBH guests should not have that
> watchdog thing active at all for exactly this reason.

Agreed, one approch is to enable the the clocksource watchdog only in
the hypervisor, and have some action on the guests triggered when the
host detects clock skew.

This works quite well, at least until something breaks in a way that
messes up clock reads from the guest but not from the host. And I
am sure that any number of hardware guys will tell me that this just
isn't possible, but if failing hardware operated according to their
expectations, that hardware wouldn't be considered to be failing.
Or it wouldn't be hardware, firmware, or clock-driver bringup, as the
case may be.

> SMI, NMI injecting 62.5ms delay? If that happens then the performance of
> the clocksource is the least of your worries.

I was kind of hoping that you would tell me why the skew must be all the
way up to 62.5ms before the clock is disabled. The watchdog currently
is quite happy with more than 10% skew between clocks.

100HZ clocks or some such?

Thanx, Paul

2021-04-10 23:53:09

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource 2/5] clocksource: Retry clock read if long delays detected

On Sat, Apr 10, 2021 at 10:41:21AM +0200, Thomas Gleixner wrote:
> On Fri, Apr 02 2021 at 15:49, paulmck wrote:
> > This commit therefore re-reads the watchdog clock on either side of
>
> 'This commit' is not any better than 'This patch' and this sentence
> makes no sense. I might be missing something, but how exactly does "the
> commit" re-read the watchdog clock?
>
> git grep 'This patch' Documentation/process/

I will rework this.

> > the read from the clock under test. If the watchdog clock shows an
> > +retry:
> > local_irq_disable();
> > - csnow = cs->read(cs);
> > - clocksource_watchdog_inject_delay();
> > wdnow = watchdog->read(watchdog);
> > + clocksource_watchdog_inject_delay();
> > + csnow = cs->read(cs);
> > + wdagain = watchdog->read(watchdog);
> > local_irq_enable();
> > + delta = clocksource_delta(wdagain, wdnow, watchdog->mask);
> > + wdagain_nsec = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
>
> That variable naming is confusing as hell. This is about the delta and
> not about the second readout of the watchdog.

How about wdagain_delta?

> > + if (wdagain_nsec < 0 || wdagain_nsec > WATCHDOG_MAX_SKEW) {
>
> How exactly is this going negative especially with clocksources which
> have a limited bitwidth? See clocksource_delta().

I thought that I had actually seen this happen, though it is of course
quite possible that it was due to a bug in an early version of my changes.

What I will do is to remove the less-than comparison and test with
a WARN_ON(). If that doesn't trigger, I will drop the WARN_ON().
If it does trigger, I will figure out why.

> > + wderr_nsec = wdagain_nsec;
> > + if (nretries++ < max_read_retries)
> > + goto retry;
> > + }
> > + if (nretries)
> > + pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d\n",
> > + smp_processor_id(), watchdog->name, wderr_nsec, nretries);
>
> Lacks curly braces around the pr_warn() simply because it's not a single
> line. Breaks my parser :)

OK, will fix. ;-)

> But if this ever happens to exceed max_read_retries, then what's the
> point of continuing at all? The data is known to be crap already.

If there are four delays in four consecutive attempts to read out the
clocks -- with interrupts disabled -- then it is quite possible that the
delay is actually caused by the attempt to read the clock. In which case,
marking the clock bad due to skew is a reasonable choice.

On the other hand, if the four consecutive delays are caused by something
like an NMI storm, then as you say, you have worse problems.

Thanx, Paul

> > /* Clocksource initialized ? */
> > if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
>
> Thanks,
>
> tglx

2021-04-11 00:25:27

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

On Sat, Apr 10, 2021 at 11:00:25AM +0200, Thomas Gleixner wrote:
> On Fri, Apr 02 2021 at 15:49, paulmck wrote:
> >
> > +static void clocksource_verify_percpu_wq(struct work_struct *unused)
> > +{
> > + int cpu;
> > + struct clocksource *cs;
> > + int64_t cs_nsec;
> > + u64 csnow_begin;
> > + u64 csnow_end;
> > + u64 delta;
>
> Please use reverse fir tree ordering and stick variables of the same
> type together:
>
> u64 csnow_begin, csnow_end, delta;
> struct clocksource *cs;
> s64 cs_nsec;
> int cpu;

Will do.

> > +
> > + cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release
>
> Please don't use tail comments. They are a horrible distraction.

I will remove it.

> > + if (WARN_ON_ONCE(!cs))
> > + return;
> > + pr_warn("Checking clocksource %s synchronization from CPU %d.\n",
> > + cs->name, smp_processor_id());
> > + cpumask_clear(&cpus_ahead);
> > + cpumask_clear(&cpus_behind);
> > + csnow_begin = cs->read(cs);
>
> So this is invoked via work and the actual clocksource change is done
> via work too. Once the clocksource is not longer actively used for
> timekeeping it can go away. What's guaranteeing that this runs prior to
> the clocksource change and 'cs' is valid throughout this function?

From what I can see, cs->read() doesn't care whether or not the
clocksource has been marked unstable. So it should be OK to call
cs->read() before, during, or after the call to __clocksource_unstable().

Also, this is only done on clocksources marked CLOCK_SOURCE_VERIFY_PERCPU,
so any clocksource that did not like cs->read() being invoked during
or after the call to __clocksource_unstable() should leave off the
CLOCK_SOURCE_VERIFY_PERCPU bit.

Or did I take a wrong turn somewhere in the pointers to functions?

> > + queue_work(system_highpri_wq, &clocksource_verify_work);
>
> This does not guarantee anything. So why does this need an extra work
> function which is scheduled seperately?

Because I was concerned about doing smp_call_function() while holding
watchdog_lock, which is also acquired elsewhere using spin_lock_irqsave().
And it still looks like on x86 that spin_lock_irqsave() spins with irqs
disabled, which could result in deadlock. The smp_call_function_single()
would wait for the target CPU to enable interrupts, which would not
happen until after the smp_call_function_single() returned due to its
caller holding watchdog_lock.

Or is there something that I am missing that prevents this deadlock
from occurring?

Thanx, Paul

2021-04-11 00:26:01

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource 5/5] clocksource: Do pairwise clock-desynchronization checking

On Sat, Apr 10, 2021 at 11:04:54AM +0200, Thomas Gleixner wrote:
> On Fri, Apr 02 2021 at 15:49, paulmck wrote:
> > From: "Paul E. McKenney" <[email protected]>
> >
> > Although smp_call_function() has the advantage of simplicity, using
> > it to check for cross-CPU clock desynchronization means that any CPU
> > being slow reduces the sensitivity of the checking across all CPUs.
> > And it is not uncommon for smp_call_function() latencies to be in the
> > hundreds of microseconds.
> >
> > This commit therefore switches to smp_call_function_single(), so that
> > delays from a given CPU affect only those measurements involving that
> > particular CPU.
>
> Is there any reason I'm missing why this is not done right in patch 3/5
> which introduces this synchronization check?

None at all. I will merge this into 3/5.

Thanx, Paul

2021-04-11 10:35:30

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

On Sat, Apr 10 2021 at 17:20, Paul E. McKenney wrote:
> On Sat, Apr 10, 2021 at 11:00:25AM +0200, Thomas Gleixner wrote:
>> > + if (WARN_ON_ONCE(!cs))
>> > + return;
>> > + pr_warn("Checking clocksource %s synchronization from CPU %d.\n",
>> > + cs->name, smp_processor_id());
>> > + cpumask_clear(&cpus_ahead);
>> > + cpumask_clear(&cpus_behind);
>> > + csnow_begin = cs->read(cs);
>>
>> So this is invoked via work and the actual clocksource change is done
>> via work too. Once the clocksource is not longer actively used for
>> timekeeping it can go away. What's guaranteeing that this runs prior to
>> the clocksource change and 'cs' is valid throughout this function?
>
> From what I can see, cs->read() doesn't care whether or not the
> clocksource has been marked unstable. So it should be OK to call
> cs->read() before, during, or after the call to __clocksource_unstable().
>
> Also, this is only done on clocksources marked CLOCK_SOURCE_VERIFY_PERCPU,
> so any clocksource that did not like cs->read() being invoked during
> or after the call to __clocksource_unstable() should leave off the
> CLOCK_SOURCE_VERIFY_PERCPU bit.
>
> Or did I take a wrong turn somewhere in the pointers to functions?

Right. cs->read() does not care, but what guarantees that cs is valid
and not freed yet? It's not an issue with TSC and KVMCLOCK, but
conceptually the following is possible:

watchdog()
queue_work(synccheck);
queue_work(clocksource_change);

work:
synccheck() clocksource_change()
preemption ...
...
some_other_code():
unregister_clocksource(cs)
free(cs)
cs->read() <- UAF

>> > + queue_work(system_highpri_wq, &clocksource_verify_work);
>>
>> This does not guarantee anything. So why does this need an extra work
>> function which is scheduled seperately?
>
> Because I was concerned about doing smp_call_function() while holding
> watchdog_lock, which is also acquired elsewhere using spin_lock_irqsave().
> And it still looks like on x86 that spin_lock_irqsave() spins with irqs
> disabled, which could result in deadlock. The smp_call_function_single()
> would wait for the target CPU to enable interrupts, which would not
> happen until after the smp_call_function_single() returned due to its
> caller holding watchdog_lock.
>
> Or is there something that I am missing that prevents this deadlock
> from occurring?

The unstable mechanism is:

watchdog()
__clocksource_unstable()
schedule_work(&watchdog_work);

watchdog_work()
kthread_run(clocksource_watchdog_thread);

cs_watchdog_thread()
mutex_lock(&clocksource_mutex);
if (__clocksource_watchdog_kthread())
clocksource_select();
mutex_unlock(&clocksource_mutex);

So what prevents you from doing that right in watchdog_work() or even in
cs_watchdog_thread() properly ordered against the actual clocksource
switch?

Hmm?

Thanks,

tglx

2021-04-11 10:59:16

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource] Do not mark clocks unstable due to delays for v5.13

On Sat, Apr 10 2021 at 16:26, Paul E. McKenney wrote:
> On Sat, Apr 10, 2021 at 10:01:58AM +0200, Thomas Gleixner wrote:
>> On Fri, Apr 02 2021 at 15:48, Paul E. McKenney wrote:
>> I buy the vCPU preemption part and TBH guests should not have that
>> watchdog thing active at all for exactly this reason.
>
> Agreed, one approch is to enable the the clocksource watchdog only in
> the hypervisor, and have some action on the guests triggered when the
> host detects clock skew.
>
> This works quite well, at least until something breaks in a way that
> messes up clock reads from the guest but not from the host. And I
> am sure that any number of hardware guys will tell me that this just
> isn't possible, but if failing hardware operated according to their
> expectations, that hardware wouldn't be considered to be failing.
> Or it wouldn't be hardware, firmware, or clock-driver bringup, as the
> case may be.

Don't tell me. The fact that this code exists at all is a horror on it's
own.

>> SMI, NMI injecting 62.5ms delay? If that happens then the performance of
>> the clocksource is the least of your worries.
>
> I was kind of hoping that you would tell me why the skew must be all the
> way up to 62.5ms before the clock is disabled. The watchdog currently
> is quite happy with more than 10% skew between clocks.
>
> 100HZ clocks or some such?

Histerical raisins. When the clocksource watchdog was introduced it
replaced a x86 specific validation which was jiffies based. I have faint
memories that we wanted to have at least jiffies based checks preserved
in absence of other hardware, which had other problems and we gave up on
it. But obviously nobody thought about revisiting the threshold.

Yes, it's way too big. The slowest watchdog frequency on x86 is ~3.5 Mhz
(ACPI PMtimer). Don't know about the reference frequency on MIPS which
is the only other user of this.

Thanks,

tglx

2021-04-11 17:53:09

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

On Sun, Apr 11, 2021 at 12:33:44PM +0200, Thomas Gleixner wrote:
> On Sat, Apr 10 2021 at 17:20, Paul E. McKenney wrote:
> > On Sat, Apr 10, 2021 at 11:00:25AM +0200, Thomas Gleixner wrote:
> >> > + if (WARN_ON_ONCE(!cs))
> >> > + return;
> >> > + pr_warn("Checking clocksource %s synchronization from CPU %d.\n",
> >> > + cs->name, smp_processor_id());
> >> > + cpumask_clear(&cpus_ahead);
> >> > + cpumask_clear(&cpus_behind);
> >> > + csnow_begin = cs->read(cs);
> >>
> >> So this is invoked via work and the actual clocksource change is done
> >> via work too. Once the clocksource is not longer actively used for
> >> timekeeping it can go away. What's guaranteeing that this runs prior to
> >> the clocksource change and 'cs' is valid throughout this function?
> >
> > From what I can see, cs->read() doesn't care whether or not the
> > clocksource has been marked unstable. So it should be OK to call
> > cs->read() before, during, or after the call to __clocksource_unstable().
> >
> > Also, this is only done on clocksources marked CLOCK_SOURCE_VERIFY_PERCPU,
> > so any clocksource that did not like cs->read() being invoked during
> > or after the call to __clocksource_unstable() should leave off the
> > CLOCK_SOURCE_VERIFY_PERCPU bit.
> >
> > Or did I take a wrong turn somewhere in the pointers to functions?
>
> Right. cs->read() does not care, but what guarantees that cs is valid
> and not freed yet? It's not an issue with TSC and KVMCLOCK, but
> conceptually the following is possible:
>
> watchdog()
> queue_work(synccheck);
> queue_work(clocksource_change);
>
> work:
> synccheck() clocksource_change()
> preemption ...
> ...
> some_other_code():
> unregister_clocksource(cs)
> free(cs)
> cs->read() <- UAF

Got it, with the ingenic_tcu_init() function being case in point.
It invokes clcoksource_unregister() shortly followed by clk_put(), which,
if I found the correct clk_put(), can kfree() it.

Thank you!

> >> > + queue_work(system_highpri_wq, &clocksource_verify_work);
> >>
> >> This does not guarantee anything. So why does this need an extra work
> >> function which is scheduled seperately?
> >
> > Because I was concerned about doing smp_call_function() while holding
> > watchdog_lock, which is also acquired elsewhere using spin_lock_irqsave().
> > And it still looks like on x86 that spin_lock_irqsave() spins with irqs
> > disabled, which could result in deadlock. The smp_call_function_single()
> > would wait for the target CPU to enable interrupts, which would not
> > happen until after the smp_call_function_single() returned due to its
> > caller holding watchdog_lock.
> >
> > Or is there something that I am missing that prevents this deadlock
> > from occurring?
>
> The unstable mechanism is:
>
> watchdog()
> __clocksource_unstable()
> schedule_work(&watchdog_work);
>
> watchdog_work()
> kthread_run(clocksource_watchdog_thread);
>
> cs_watchdog_thread()
> mutex_lock(&clocksource_mutex);
> if (__clocksource_watchdog_kthread())
> clocksource_select();
> mutex_unlock(&clocksource_mutex);
>
> So what prevents you from doing that right in watchdog_work() or even in
> cs_watchdog_thread() properly ordered against the actual clocksource
> switch?
>
> Hmm?

My own confusion, apparently. :-/

So I need to is inline clocksource_verify_percpu_wq()
into clocksource_verify_percpu() and then move the call to
clocksource_verify_percpu() to __clocksource_watchdog_kthread(), right
before the existing call to list_del_init(). Will do!

Thanx, Paul

2021-04-11 19:35:27

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource] Do not mark clocks unstable due to delays for v5.13

On Sun, Apr 11, 2021 at 12:58:31PM +0200, Thomas Gleixner wrote:
> On Sat, Apr 10 2021 at 16:26, Paul E. McKenney wrote:
> > On Sat, Apr 10, 2021 at 10:01:58AM +0200, Thomas Gleixner wrote:
> >> On Fri, Apr 02 2021 at 15:48, Paul E. McKenney wrote:
> >> I buy the vCPU preemption part and TBH guests should not have that
> >> watchdog thing active at all for exactly this reason.
> >
> > Agreed, one approch is to enable the the clocksource watchdog only in
> > the hypervisor, and have some action on the guests triggered when the
> > host detects clock skew.
> >
> > This works quite well, at least until something breaks in a way that
> > messes up clock reads from the guest but not from the host. And I
> > am sure that any number of hardware guys will tell me that this just
> > isn't possible, but if failing hardware operated according to their
> > expectations, that hardware wouldn't be considered to be failing.
> > Or it wouldn't be hardware, firmware, or clock-driver bringup, as the
> > case may be.
>
> Don't tell me. The fact that this code exists at all is a horror on it's
> own.

Let's just say that really I did consider the option of just disabling
the watchdog for guest OSes, but it will likely be at least a few years
before per-CPU hardware clocks regain my trust. :-/

> >> SMI, NMI injecting 62.5ms delay? If that happens then the performance of
> >> the clocksource is the least of your worries.
> >
> > I was kind of hoping that you would tell me why the skew must be all the
> > way up to 62.5ms before the clock is disabled. The watchdog currently
> > is quite happy with more than 10% skew between clocks.
> >
> > 100HZ clocks or some such?
>
> Histerical raisins. When the clocksource watchdog was introduced it
> replaced a x86 specific validation which was jiffies based. I have faint
> memories that we wanted to have at least jiffies based checks preserved
> in absence of other hardware, which had other problems and we gave up on
> it. But obviously nobody thought about revisiting the threshold.
>
> Yes, it's way too big. The slowest watchdog frequency on x86 is ~3.5 Mhz
> (ACPI PMtimer). Don't know about the reference frequency on MIPS which
> is the only other user of this.

Whew! I will try reducing the permitted skew. My (perhaps naive) guess
is that with the delay rejection, there is no reason that it cannot
be decreased at least to 500us. If that goes well, I will send along
another patch.

Thanx, Paul

2021-04-12 04:25:29

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

On Sun, Apr 11, 2021 at 09:46:12AM -0700, Paul E. McKenney wrote:
> On Sun, Apr 11, 2021 at 12:33:44PM +0200, Thomas Gleixner wrote:
> > On Sat, Apr 10 2021 at 17:20, Paul E. McKenney wrote:
> > > On Sat, Apr 10, 2021 at 11:00:25AM +0200, Thomas Gleixner wrote:
> > >> > + if (WARN_ON_ONCE(!cs))
> > >> > + return;
> > >> > + pr_warn("Checking clocksource %s synchronization from CPU %d.\n",
> > >> > + cs->name, smp_processor_id());
> > >> > + cpumask_clear(&cpus_ahead);
> > >> > + cpumask_clear(&cpus_behind);
> > >> > + csnow_begin = cs->read(cs);
> > >>
> > >> So this is invoked via work and the actual clocksource change is done
> > >> via work too. Once the clocksource is not longer actively used for
> > >> timekeeping it can go away. What's guaranteeing that this runs prior to
> > >> the clocksource change and 'cs' is valid throughout this function?
> > >
> > > From what I can see, cs->read() doesn't care whether or not the
> > > clocksource has been marked unstable. So it should be OK to call
> > > cs->read() before, during, or after the call to __clocksource_unstable().
> > >
> > > Also, this is only done on clocksources marked CLOCK_SOURCE_VERIFY_PERCPU,
> > > so any clocksource that did not like cs->read() being invoked during
> > > or after the call to __clocksource_unstable() should leave off the
> > > CLOCK_SOURCE_VERIFY_PERCPU bit.
> > >
> > > Or did I take a wrong turn somewhere in the pointers to functions?
> >
> > Right. cs->read() does not care, but what guarantees that cs is valid
> > and not freed yet? It's not an issue with TSC and KVMCLOCK, but
> > conceptually the following is possible:
> >
> > watchdog()
> > queue_work(synccheck);
> > queue_work(clocksource_change);
> >
> > work:
> > synccheck() clocksource_change()
> > preemption ...
> > ...
> > some_other_code():
> > unregister_clocksource(cs)
> > free(cs)
> > cs->read() <- UAF
>
> Got it, with the ingenic_tcu_init() function being case in point.
> It invokes clcoksource_unregister() shortly followed by clk_put(), which,
> if I found the correct clk_put(), can kfree() it.
>
> Thank you!
>
> > >> > + queue_work(system_highpri_wq, &clocksource_verify_work);
> > >>
> > >> This does not guarantee anything. So why does this need an extra work
> > >> function which is scheduled seperately?
> > >
> > > Because I was concerned about doing smp_call_function() while holding
> > > watchdog_lock, which is also acquired elsewhere using spin_lock_irqsave().
> > > And it still looks like on x86 that spin_lock_irqsave() spins with irqs
> > > disabled, which could result in deadlock. The smp_call_function_single()
> > > would wait for the target CPU to enable interrupts, which would not
> > > happen until after the smp_call_function_single() returned due to its
> > > caller holding watchdog_lock.
> > >
> > > Or is there something that I am missing that prevents this deadlock
> > > from occurring?
> >
> > The unstable mechanism is:
> >
> > watchdog()
> > __clocksource_unstable()
> > schedule_work(&watchdog_work);
> >
> > watchdog_work()
> > kthread_run(clocksource_watchdog_thread);
> >
> > cs_watchdog_thread()
> > mutex_lock(&clocksource_mutex);
> > if (__clocksource_watchdog_kthread())
> > clocksource_select();
> > mutex_unlock(&clocksource_mutex);
> >
> > So what prevents you from doing that right in watchdog_work() or even in
> > cs_watchdog_thread() properly ordered against the actual clocksource
> > switch?
> >
> > Hmm?
>
> My own confusion, apparently. :-/
>
> So I need to is inline clocksource_verify_percpu_wq()
> into clocksource_verify_percpu() and then move the call to
> clocksource_verify_percpu() to __clocksource_watchdog_kthread(), right
> before the existing call to list_del_init(). Will do!

Except that this triggers the WARN_ON_ONCE() in smp_call_function_single()
due to interrupts being disabled across that list_del_init().

Possibilities include:

1. Figure out why interrupts must be disabled only sometimes while
holding watchdog_lock, in the hope that they need not be across
the entire critical section for __clocksource_watchdog_kthread().
As in:

local_irq_restore(flags);
clocksource_verify_percpu(cs);
local_irq_save(flags);

Trying this first with lockdep enabled. Might be spectacular.

2. Invoke clocksource_verify_percpu() from its original
location in clocksource_watchdog(), just before the call to
__clocksource_unstable(). This relies on the fact that
clocksource_watchdog() acquires watchdog_lock without
disabling interrupts.

3. Restrict CLOCK_SOURCE_VERIFY_PERCPU to clocksource structures
that are statically allocated, thus avoiding the use-after-free
problem. Rely on KASAN to enforce this restriction.

4. Add reference counting or some such to clock sources.

5. Your ideas here.

I will give this more thought, but #2 is looking pretty good at this point.

Thanx, Paul

2021-04-12 13:09:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

On Sun, Apr 11 2021 at 21:21, Paul E. McKenney wrote:
> On Sun, Apr 11, 2021 at 09:46:12AM -0700, Paul E. McKenney wrote:
>> So I need to is inline clocksource_verify_percpu_wq()
>> into clocksource_verify_percpu() and then move the call to
>> clocksource_verify_percpu() to __clocksource_watchdog_kthread(), right
>> before the existing call to list_del_init(). Will do!
>
> Except that this triggers the WARN_ON_ONCE() in smp_call_function_single()
> due to interrupts being disabled across that list_del_init().
>
> Possibilities include:
>
> 1. Figure out why interrupts must be disabled only sometimes while
> holding watchdog_lock, in the hope that they need not be across
> the entire critical section for __clocksource_watchdog_kthread().
> As in:
>
> local_irq_restore(flags);
> clocksource_verify_percpu(cs);
> local_irq_save(flags);
>
> Trying this first with lockdep enabled. Might be spectacular.

Yes, it's a possible deadlock against the watchdog timer firing ...

The reason for irqsave is again historical AFAICT and nobody bothered to
clean it up. spin_lock_bh() should be sufficient to serialize against
the watchdog timer, though I haven't looked at all possible scenarios.

> 2. Invoke clocksource_verify_percpu() from its original
> location in clocksource_watchdog(), just before the call to
> __clocksource_unstable(). This relies on the fact that
> clocksource_watchdog() acquires watchdog_lock without
> disabling interrupts.

That should be fine, but this might cause the softirq to 'run' for a
very long time which is not pretty either.

Aside of that, do we really need to check _all_ online CPUs? What you
are trying to figure out is whether the wreckage is CPU local or global,
right?

Wouldn't a shirt-sleeve approach of just querying _one_ CPU be good
enough? Either the other CPU has the same wreckage, then it's global or
it hasn't which points to a per CPU local issue.

Sure it does not catch the case where a subset (>1) of all CPUs is
affected, but I'm not seing how that really buys us anything.

Thanks,

tglx

2021-04-13 06:24:01

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

On Mon, Apr 12, 2021 at 03:08:16PM +0200, Thomas Gleixner wrote:
> On Sun, Apr 11 2021 at 21:21, Paul E. McKenney wrote:
> > On Sun, Apr 11, 2021 at 09:46:12AM -0700, Paul E. McKenney wrote:
> >> So I need to is inline clocksource_verify_percpu_wq()
> >> into clocksource_verify_percpu() and then move the call to
> >> clocksource_verify_percpu() to __clocksource_watchdog_kthread(), right
> >> before the existing call to list_del_init(). Will do!
> >
> > Except that this triggers the WARN_ON_ONCE() in smp_call_function_single()
> > due to interrupts being disabled across that list_del_init().
> >
> > Possibilities include:
> >
> > 1. Figure out why interrupts must be disabled only sometimes while
> > holding watchdog_lock, in the hope that they need not be across
> > the entire critical section for __clocksource_watchdog_kthread().
> > As in:
> >
> > local_irq_restore(flags);
> > clocksource_verify_percpu(cs);
> > local_irq_save(flags);
> >
> > Trying this first with lockdep enabled. Might be spectacular.
>
> Yes, it's a possible deadlock against the watchdog timer firing ...

And lockdep most emphatically agrees with you. ;-)

> The reason for irqsave is again historical AFAICT and nobody bothered to
> clean it up. spin_lock_bh() should be sufficient to serialize against
> the watchdog timer, though I haven't looked at all possible scenarios.

Though if BH is disabled, there is not so much advantage to
invoking it from __clocksource_watchdog_kthread(). Might as
well just invoke it directly from clocksource_watchdog().

> > 2. Invoke clocksource_verify_percpu() from its original
> > location in clocksource_watchdog(), just before the call to
> > __clocksource_unstable(). This relies on the fact that
> > clocksource_watchdog() acquires watchdog_lock without
> > disabling interrupts.
>
> That should be fine, but this might cause the softirq to 'run' for a
> very long time which is not pretty either.
>
> Aside of that, do we really need to check _all_ online CPUs? What you
> are trying to figure out is whether the wreckage is CPU local or global,
> right?
>
> Wouldn't a shirt-sleeve approach of just querying _one_ CPU be good
> enough? Either the other CPU has the same wreckage, then it's global or
> it hasn't which points to a per CPU local issue.
>
> Sure it does not catch the case where a subset (>1) of all CPUs is
> affected, but I'm not seing how that really buys us anything.

Good point! My thought is to randomly pick eight CPUs to keep the
duration reasonable while having a good chance of hitting "interesting"
CPU choices in multicore and multisocket systems.

However, if a hard-to-reproduce problem occurred, it would be good to take
the hit and scan all the CPUs. Additionally, there are some workloads
for which the switch from TSC to HPET is fatal anyway due to increased
overhead. For these workloads, the full CPU scan is no additional pain.

So I am thinking in terms of a default that probes eight randomly selected
CPUs without worrying about duplicates (as in there would be some chance
that fewer CPUs would actually be probed), but with a boot-time flag
that does all CPUs. I would add the (default) random selection as a
separate patch.

I will send a new series out later today, Pacific Time.

Thanx, Paul

2021-04-13 07:00:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

On Mon, Apr 12, 2021 at 08:54:03PM +0200, Thomas Gleixner wrote:
> Paul,
>
> On Mon, Apr 12 2021 at 11:20, Paul E. McKenney wrote:
> > On Mon, Apr 12, 2021 at 03:08:16PM +0200, Thomas Gleixner wrote:
> >> The reason for irqsave is again historical AFAICT and nobody bothered to
> >> clean it up. spin_lock_bh() should be sufficient to serialize against
> >> the watchdog timer, though I haven't looked at all possible scenarios.
> >
> > Though if BH is disabled, there is not so much advantage to
> > invoking it from __clocksource_watchdog_kthread(). Might as
> > well just invoke it directly from clocksource_watchdog().
> >
> >> > 2. Invoke clocksource_verify_percpu() from its original
> >> > location in clocksource_watchdog(), just before the call to
> >> > __clocksource_unstable(). This relies on the fact that
> >> > clocksource_watchdog() acquires watchdog_lock without
> >> > disabling interrupts.
> >>
> >> That should be fine, but this might cause the softirq to 'run' for a
> >> very long time which is not pretty either.
> >>
> >> Aside of that, do we really need to check _all_ online CPUs? What you
> >> are trying to figure out is whether the wreckage is CPU local or global,
> >> right?
> >>
> >> Wouldn't a shirt-sleeve approach of just querying _one_ CPU be good
> >> enough? Either the other CPU has the same wreckage, then it's global or
> >> it hasn't which points to a per CPU local issue.
> >>
> >> Sure it does not catch the case where a subset (>1) of all CPUs is
> >> affected, but I'm not seing how that really buys us anything.
> >
> > Good point! My thought is to randomly pick eight CPUs to keep the
> > duration reasonable while having a good chance of hitting "interesting"
> > CPU choices in multicore and multisocket systems.
> >
> > However, if a hard-to-reproduce problem occurred, it would be good to take
> > the hit and scan all the CPUs. Additionally, there are some workloads
> > for which the switch from TSC to HPET is fatal anyway due to increased
> > overhead. For these workloads, the full CPU scan is no additional pain.
> >
> > So I am thinking in terms of a default that probes eight randomly selected
> > CPUs without worrying about duplicates (as in there would be some chance
> > that fewer CPUs would actually be probed), but with a boot-time flag
> > that does all CPUs. I would add the (default) random selection as a
> > separate patch.
>
> You can't do without making it complex, right? Keep it simple is not an
> option for a RCU hacker it seems :)

But it was simple! It just hit all the CPUs.

However, you (quite rightly) pointed out that this simple approach had
a few shortcomings. ;-)

> > I will send a new series out later today, Pacific Time.
>
> Can you do me a favour and send it standalone and not as yet another
> reply to this existing thread maze. A trivial lore link to the previous
> version gives enough context.

Will do!

Of course, it turns out that lockdep also doesn't like waited-on
smp_call_function_single() invocations from timer handlers,
so I am currently looking at other options for dealing with that
potential use-after-free. I am starting to like the looks of "only set
CLOCK_SOURCE_VERIFY_PERCPU on statically allocated clocksource structures
and let KASAN enforce this restriction", but I have not quite given up
on making it more general.

Thanx, Paul

2021-04-13 07:06:19

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

On Mon, Apr 12 2021 at 12:57, Paul E. McKenney wrote:
> On Mon, Apr 12, 2021 at 08:54:03PM +0200, Thomas Gleixner wrote:
>> > I will send a new series out later today, Pacific Time.
>>
>> Can you do me a favour and send it standalone and not as yet another
>> reply to this existing thread maze. A trivial lore link to the previous
>> version gives enough context.
>
> Will do!
>
> Of course, it turns out that lockdep also doesn't like waited-on
> smp_call_function_single() invocations from timer handlers,
> so I am currently looking at other options for dealing with that
> potential use-after-free. I am starting to like the looks of "only set
> CLOCK_SOURCE_VERIFY_PERCPU on statically allocated clocksource structures
> and let KASAN enforce this restriction", but I have not quite given up
> on making it more general.

The simplest point is in the thread under the clocksource_mutex which
prevents anything from vanishing under your feet.

Thanks,

tglx

2021-04-13 11:01:16

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

Paul,

On Mon, Apr 12 2021 at 11:20, Paul E. McKenney wrote:
> On Mon, Apr 12, 2021 at 03:08:16PM +0200, Thomas Gleixner wrote:
>> The reason for irqsave is again historical AFAICT and nobody bothered to
>> clean it up. spin_lock_bh() should be sufficient to serialize against
>> the watchdog timer, though I haven't looked at all possible scenarios.
>
> Though if BH is disabled, there is not so much advantage to
> invoking it from __clocksource_watchdog_kthread(). Might as
> well just invoke it directly from clocksource_watchdog().
>
>> > 2. Invoke clocksource_verify_percpu() from its original
>> > location in clocksource_watchdog(), just before the call to
>> > __clocksource_unstable(). This relies on the fact that
>> > clocksource_watchdog() acquires watchdog_lock without
>> > disabling interrupts.
>>
>> That should be fine, but this might cause the softirq to 'run' for a
>> very long time which is not pretty either.
>>
>> Aside of that, do we really need to check _all_ online CPUs? What you
>> are trying to figure out is whether the wreckage is CPU local or global,
>> right?
>>
>> Wouldn't a shirt-sleeve approach of just querying _one_ CPU be good
>> enough? Either the other CPU has the same wreckage, then it's global or
>> it hasn't which points to a per CPU local issue.
>>
>> Sure it does not catch the case where a subset (>1) of all CPUs is
>> affected, but I'm not seing how that really buys us anything.
>
> Good point! My thought is to randomly pick eight CPUs to keep the
> duration reasonable while having a good chance of hitting "interesting"
> CPU choices in multicore and multisocket systems.
>
> However, if a hard-to-reproduce problem occurred, it would be good to take
> the hit and scan all the CPUs. Additionally, there are some workloads
> for which the switch from TSC to HPET is fatal anyway due to increased
> overhead. For these workloads, the full CPU scan is no additional pain.
>
> So I am thinking in terms of a default that probes eight randomly selected
> CPUs without worrying about duplicates (as in there would be some chance
> that fewer CPUs would actually be probed), but with a boot-time flag
> that does all CPUs. I would add the (default) random selection as a
> separate patch.

You can't do without making it complex, right? Keep it simple is not an
option for a RCU hacker it seems :)

> I will send a new series out later today, Pacific Time.

Can you do me a favour and send it standalone and not as yet another
reply to this existing thread maze. A trivial lore link to the previous
version gives enough context.

Thanks,

tglx

2021-04-13 11:56:48

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

On Mon, Apr 12, 2021 at 10:37:10PM +0200, Thomas Gleixner wrote:
> On Mon, Apr 12 2021 at 12:57, Paul E. McKenney wrote:
> > On Mon, Apr 12, 2021 at 08:54:03PM +0200, Thomas Gleixner wrote:
> >> > I will send a new series out later today, Pacific Time.
> >>
> >> Can you do me a favour and send it standalone and not as yet another
> >> reply to this existing thread maze. A trivial lore link to the previous
> >> version gives enough context.
> >
> > Will do!
> >
> > Of course, it turns out that lockdep also doesn't like waited-on
> > smp_call_function_single() invocations from timer handlers,
> > so I am currently looking at other options for dealing with that
> > potential use-after-free. I am starting to like the looks of "only set
> > CLOCK_SOURCE_VERIFY_PERCPU on statically allocated clocksource structures
> > and let KASAN enforce this restriction", but I have not quite given up
> > on making it more general.
>
> The simplest point is in the thread under the clocksource_mutex which
> prevents anything from vanishing under your feet.

And lockdep is -much- happier with the setup shown below, so thank
you again!

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index f047c6cb056c..34dc38b6b923 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -519,6 +515,13 @@ static int __clocksource_watchdog_kthread(void)
unsigned long flags;
int select = 0;

+ /* Do any required per-CPU skew verification. */
+ list_for_each_entry(cs, &watchdog_list, wd_list) {
+ if ((cs->flags & (CLOCK_SOURCE_UNSTABLE | CLOCK_SOURCE_VERIFY_PERCPU)) ==
+ (CLOCK_SOURCE_UNSTABLE | CLOCK_SOURCE_VERIFY_PERCPU))
+ clocksource_verify_percpu(cs);
+ }
+
spin_lock_irqsave(&watchdog_lock, flags);
list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list) {
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {

2021-04-14 08:14:28

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

Paul,

On Mon, Apr 12 2021 at 16:18, Paul E. McKenney wrote:
> On Mon, Apr 12, 2021 at 10:37:10PM +0200, Thomas Gleixner wrote:
>> On Mon, Apr 12 2021 at 12:57, Paul E. McKenney wrote:
>> > On Mon, Apr 12, 2021 at 08:54:03PM +0200, Thomas Gleixner wrote:
>> >> > I will send a new series out later today, Pacific Time.
>> >>
>> >> Can you do me a favour and send it standalone and not as yet another
>> >> reply to this existing thread maze. A trivial lore link to the previous
>> >> version gives enough context.
>> >
>> > Will do!
>> >
>> > Of course, it turns out that lockdep also doesn't like waited-on
>> > smp_call_function_single() invocations from timer handlers,
>> > so I am currently looking at other options for dealing with that
>> > potential use-after-free. I am starting to like the looks of "only set
>> > CLOCK_SOURCE_VERIFY_PERCPU on statically allocated clocksource structures
>> > and let KASAN enforce this restriction", but I have not quite given up
>> > on making it more general.
>>
>> The simplest point is in the thread under the clocksource_mutex which
>> prevents anything from vanishing under your feet.
>
> And lockdep is -much- happier with the setup shown below, so thank
> you again!

But it is too simple now :) ...

> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index f047c6cb056c..34dc38b6b923 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -519,6 +515,13 @@ static int __clocksource_watchdog_kthread(void)
> unsigned long flags;
> int select = 0;
>
> + /* Do any required per-CPU skew verification. */
> + list_for_each_entry(cs, &watchdog_list, wd_list) {
> + if ((cs->flags & (CLOCK_SOURCE_UNSTABLE | CLOCK_SOURCE_VERIFY_PERCPU)) ==
> + (CLOCK_SOURCE_UNSTABLE | CLOCK_SOURCE_VERIFY_PERCPU))
> + clocksource_verify_percpu(cs);
> + }

because that list is _NOT_ protected by the clocksource_mutex as you
noticed yourself already.

But you don't have to walk that list at all because the only interesting
thing is the currently active clocksource, which is about to be changed
in case the watchdog marked it unstable and cannot be changed by any
other code concurrently because clocksource_mutex is held.

So all you need is:

if (curr_clocksource &&
curr_clocksource->flags & CLOCK_SOURCE_UNSTABLE &&
curr_clocksource->flags & CLOCK_SOURCE_VERIFY_PERCPU)
clocksource_verify_percpu_wreckage(curr_clocksource);

Hmm?

Thanks,

tglx


2021-04-14 12:35:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v7 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

On Tue, Apr 13, 2021 at 10:49:11PM +0200, Thomas Gleixner wrote:
> Paul,
>
> On Mon, Apr 12 2021 at 16:18, Paul E. McKenney wrote:
> > On Mon, Apr 12, 2021 at 10:37:10PM +0200, Thomas Gleixner wrote:
> >> On Mon, Apr 12 2021 at 12:57, Paul E. McKenney wrote:
> >> > On Mon, Apr 12, 2021 at 08:54:03PM +0200, Thomas Gleixner wrote:
> >> >> > I will send a new series out later today, Pacific Time.
> >> >>
> >> >> Can you do me a favour and send it standalone and not as yet another
> >> >> reply to this existing thread maze. A trivial lore link to the previous
> >> >> version gives enough context.
> >> >
> >> > Will do!
> >> >
> >> > Of course, it turns out that lockdep also doesn't like waited-on
> >> > smp_call_function_single() invocations from timer handlers,
> >> > so I am currently looking at other options for dealing with that
> >> > potential use-after-free. I am starting to like the looks of "only set
> >> > CLOCK_SOURCE_VERIFY_PERCPU on statically allocated clocksource structures
> >> > and let KASAN enforce this restriction", but I have not quite given up
> >> > on making it more general.
> >>
> >> The simplest point is in the thread under the clocksource_mutex which
> >> prevents anything from vanishing under your feet.
> >
> > And lockdep is -much- happier with the setup shown below, so thank
> > you again!
>
> But it is too simple now :) ...
>
> > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > index f047c6cb056c..34dc38b6b923 100644
> > --- a/kernel/time/clocksource.c
> > +++ b/kernel/time/clocksource.c
> > @@ -519,6 +515,13 @@ static int __clocksource_watchdog_kthread(void)
> > unsigned long flags;
> > int select = 0;
> >
> > + /* Do any required per-CPU skew verification. */
> > + list_for_each_entry(cs, &watchdog_list, wd_list) {
> > + if ((cs->flags & (CLOCK_SOURCE_UNSTABLE | CLOCK_SOURCE_VERIFY_PERCPU)) ==
> > + (CLOCK_SOURCE_UNSTABLE | CLOCK_SOURCE_VERIFY_PERCPU))
> > + clocksource_verify_percpu(cs);
> > + }
>
> because that list is _NOT_ protected by the clocksource_mutex as you
> noticed yourself already.
>
> But you don't have to walk that list at all because the only interesting
> thing is the currently active clocksource, which is about to be changed
> in case the watchdog marked it unstable and cannot be changed by any
> other code concurrently because clocksource_mutex is held.
>
> So all you need is:
>
> if (curr_clocksource &&
> curr_clocksource->flags & CLOCK_SOURCE_UNSTABLE &&
> curr_clocksource->flags & CLOCK_SOURCE_VERIFY_PERCPU)
> clocksource_verify_percpu_wreckage(curr_clocksource);
>
> Hmm?

With the addition of a clocksource=tsc boot parameter, this approach
does appear to work, thank you! I sent out the updated series.

Thanx, Paul