2021-11-18 19:15:03

by Waiman Long

[permalink] [raw]
Subject: [PATCH v3 0/4] clocksource: Avoid incorrect hpet fallback

It was found that when an x86 system was being stressed by running
various different benchmark suites, the clocksource watchdog might
occasionally mark TSC as unstable and fall back to hpet which will
have a signficant impact on system performance.

v3:
- Remove the patch 1 thunks that changes uncertainty_margin from 2 *
WATCHDOG_MAX_SKEW to WATCHDOG_MAX_SKEW.
- Use pr_info() for the clock-skew test skipped" message instead of
pr_warn().
- Move the global clock_skew_skip into a new clock_skew_skipcnt
field in the clocksource structure.
- Replace the local_irq_save()/local_irq_restore() pair in
__clocksource_select_watchdog() by local_irq_disable()/local_irq_enable().

The current watchdog clocksource skew threshold of 50us is found to be
insufficient. So it is changed back to 100us before commit 2e27e793e280
("clocksource: Reduce clocksource-skew threshold") in patch 1. This
patch also skip the current clock skew check if the consecutive watchdog
read-back delay contributes a major portion of the total delay. On a
1-socket 64-thread test system, it was actually found that in one the
test sample, the hpet-tsc-hpet delay was 95263ns, while the corresponding
hpet-hpet delay was 94425ns. So the majority of the delay is caused by
the hpet read.

Patch 2 reduces the default clocksource_watchdog() retries to 2 as
suggested by Paul.

Patch 3 implements dynamic readjustment of the new internal
watchdog_max_skew variable in case the current value causes excessive
skipping of clock skew checks. The following reproducer provided by
Feng Tang was used to cause the test skipping:

sudo stress-ng --timeout 30 --times --verify --metrics-brief --ioport <n>

where <n> is the number of cpus in the system.

A sample watchdog_max_skew readjustment output was:

[ 197.771144] clocksource: timekeeping watchdog on CPU8: hpet wd-wd read-back delay of 92539ns
[ 197.789589] clocksource: wd-tsc-wd read-back delay of 90933ns, clock-skew test skipped!
[ 197.807145] clocksource: timekeeping watchdog on CPU8: watchdog_max_skew increased to 185078ns

To avoid excessive increase of watchdog_max_skew, a limit of
10*WATCHDOG_MAX_SKEW is used over which the watchdog itself will be
mark unstable and a new watchdog will be selected if possible.

To exercise the code, WATCHDOG_MAX_SKEW was reduced to 10us. After
skipping 10 checks, the watchdog then fell back to acpi_pm. However
the corresponding consecutive watchdog delay was still about the same
leading to ping-ponging between hpet and acpi_pm becoming the watchdog.

Patch 4 adds a Kconfig option to allow kernel builder to control the
actual WATCHDOG_MAX_SKEW threshold to be used.

Waiman Long (4):
clocksource: Avoid accidental unstable marking of clocksources
clocksource: Reduce the default clocksource_watchdog() retries to 2
clocksource: Dynamically increase watchdog_max_skew
clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW

.../admin-guide/kernel-parameters.txt | 4 +-
include/linux/clocksource.h | 1 +
kernel/time/Kconfig | 9 ++
kernel/time/clocksource.c | 112 +++++++++++++++---
4 files changed, 110 insertions(+), 16 deletions(-)

--
2.27.0



2021-11-18 19:15:05

by Waiman Long

[permalink] [raw]
Subject: [PATCH v3 1/4] clocksource: Avoid accidental unstable marking of clocksources

Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
threshold"), it is found that tsc clocksource fallback to hpet can
sometimes happen on both Intel and AMD systems especially when they are
running stressful benchmarking workloads. Of the 23 systems tested with
a v5.14 kernel, 10 of them have switched to hpet clock source during
the test run.

The result of falling back to hpet is a drastic reduction of performance
when running benchmarks. For example, the fio performance tests can
drop up to 70% whereas the iperf3 performance can drop up to 80%.

4 hpet fallbacks happened during bootup. They were:

[ 8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
[ 12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
[ 17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
[ 17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable

Other fallbacks happen when the systems were running stressful
benchmarks. For example:

[ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
[46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable

Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
changed the skew margin from 100us to 50us. I think this is too small
and can easily be exceeded when running some stressful workloads on a
thermally stressed system. So it is switched back to 100us.

Even a maximum skew margin of 100us may be too small in for some systems
when booting up especially if those systems are under thermal stress. To
eliminate the case that the large skew is due to the system being too
busy slowing down the reading of both the watchdog and the clocksource,
an extra consecutive read of watchdog clock is being done to check this.

The consecutive watchdog read delay is compared against
WATCHDOG_MAX_SKEW/2. If the delay exceeds the limit, we assume that
the system is just too busy. A warning will be printed to the console
and the clock skew check is skipped for this round.

Fixes: db3a34e17433 ("clocksource: Retry clock read if long delays detected")
Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
Signed-off-by: Waiman Long <[email protected]>
---
kernel/time/clocksource.c | 50 ++++++++++++++++++++++++++++++++-------
1 file changed, 41 insertions(+), 9 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index b8a14d2fb5ba..bcad1a1e5dcf 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -107,7 +107,7 @@ static u64 suspend_start;
* This delay could be due to SMIs, NMIs, or to VCPU preemptions. Used as
* a lower bound for cs->uncertainty_margin values when registering clocks.
*/
-#define WATCHDOG_MAX_SKEW (50 * NSEC_PER_USEC)
+#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC)

#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
static void clocksource_watchdog_work(struct work_struct *work);
@@ -205,17 +205,24 @@ EXPORT_SYMBOL_GPL(max_cswd_read_retries);
static int verify_n_cpus = 8;
module_param(verify_n_cpus, int, 0644);

-static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
+enum wd_read_status {
+ WD_READ_SUCCESS,
+ WD_READ_UNSTABLE,
+ WD_READ_SKIP
+};
+
+static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
{
unsigned int nretries;
- u64 wd_end, wd_delta;
- int64_t wd_delay;
+ u64 wd_end, wd_end2, wd_delta;
+ int64_t wd_delay, wd_seq_delay;

for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) {
local_irq_disable();
*wdnow = watchdog->read(watchdog);
*csnow = cs->read(cs);
wd_end = watchdog->read(watchdog);
+ wd_end2 = watchdog->read(watchdog);
local_irq_enable();

wd_delta = clocksource_delta(wd_end, *wdnow, watchdog->mask);
@@ -226,13 +233,34 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
smp_processor_id(), watchdog->name, nretries);
}
- return true;
+ return WD_READ_SUCCESS;
}
+
+ /*
+ * Now compute delay in consecutive watchdog read to see if
+ * there is too much external interferences that cause
+ * significant delay in reading both clocksource and watchdog.
+ *
+ * If consecutive WD read-back delay > WATCHDOG_MAX_SKEW/2,
+ * report system busy, reinit the watchdog and skip the current
+ * watchdog test.
+ */
+ wd_delta = clocksource_delta(wd_end2, wd_end, watchdog->mask);
+ wd_seq_delay = clocksource_cyc2ns(wd_delta, watchdog->mult, watchdog->shift);
+ if (wd_seq_delay > WATCHDOG_MAX_SKEW/2)
+ goto skip_test;
}

pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
smp_processor_id(), watchdog->name, wd_delay, nretries);
- return false;
+ return WD_READ_UNSTABLE;
+
+skip_test:
+ pr_info("timekeeping watchdog on CPU%d: %s wd-wd read-back delay of %lldns\n",
+ smp_processor_id(), watchdog->name, wd_seq_delay);
+ pr_info("wd-%s-wd read-back delay of %lldns, clock-skew test skipped!\n",
+ cs->name, wd_delay);
+ return WD_READ_SKIP;
}

static u64 csnow_mid;
@@ -356,6 +384,7 @@ static void clocksource_watchdog(struct timer_list *unused)
int next_cpu, reset_pending;
int64_t wd_nsec, cs_nsec;
struct clocksource *cs;
+ enum wd_read_status read_ret;
u32 md;

spin_lock(&watchdog_lock);
@@ -373,9 +402,12 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;
}

- if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
- /* Clock readout unreliable, so give it up. */
- __clocksource_unstable(cs);
+ read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
+
+ if (read_ret != WD_READ_SUCCESS) {
+ if (read_ret == WD_READ_UNSTABLE)
+ /* Clock readout unreliable, so give it up. */
+ __clocksource_unstable(cs);
continue;
}

--
2.27.0


2021-11-18 19:15:08

by Waiman Long

[permalink] [raw]
Subject: [PATCH v3 2/4] clocksource: Reduce the default clocksource_watchdog() retries to 2

With the previous patch, there is an extra watchdog read in each retry.
Now the total number of clocksource reads is increased to 4 per iteration.
In order to avoid increasing the clock skew check overhead, the default
maximum number of retries is reduced from 3 to 2 to maintain the same 12
clocksource reads in the worst case.

Suggested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 4 ++--
kernel/time/clocksource.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9725c546a0d4..3ea934b034f7 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -603,8 +603,8 @@
clocksource.max_cswd_read_retries= [KNL]
Number of clocksource_watchdog() retries due to
external delays before the clock will be marked
- unstable. Defaults to three retries, that is,
- four attempts to read the clock under test.
+ unstable. Defaults to two retries, that is,
+ three attempts to read the clock under test.

clocksource.verify_n_cpus= [KNL]
Limit the number of CPUs checked for clocksources
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index bcad1a1e5dcf..b7e52a642948 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -199,7 +199,7 @@ void clocksource_mark_unstable(struct clocksource *cs)
spin_unlock_irqrestore(&watchdog_lock, flags);
}

-ulong max_cswd_read_retries = 3;
+ulong max_cswd_read_retries = 2;
module_param(max_cswd_read_retries, ulong, 0644);
EXPORT_SYMBOL_GPL(max_cswd_read_retries);
static int verify_n_cpus = 8;
--
2.27.0


2021-11-18 19:15:25

by Waiman Long

[permalink] [raw]
Subject: [PATCH v3 3/4] clocksource: Dynamically increase watchdog_max_skew

It is possible that a long-lasting intensive workload running on
a system may cause the clock skew test to be skipped for extended
period of time. One way to avoid this is to dynamically increase the
watchdog_max_skew used in the clock skew test.

However, we also don't want watchdog_max_skew to be continuously increased
without bound. So we limit the increase up to 10*WATCHDOG_MAX_SKEW. If
that happens, there is something wrong the current watchdog and we are
going to mark it as unstable and select a new watchdog, if possible.

Signed-off-by: Waiman Long <[email protected]>
---
include/linux/clocksource.h | 1 +
kernel/time/clocksource.c | 57 +++++++++++++++++++++++++++++++++----
2 files changed, 52 insertions(+), 6 deletions(-)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 1d42d4b17327..6a62a1a4da85 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -108,6 +108,7 @@ struct clocksource {
const char *name;
struct list_head list;
int rating;
+ int clock_skew_skipcnt;
enum clocksource_ids id;
enum vdso_clock_mode vdso_clock_mode;
unsigned long flags;
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index b7e52a642948..284910b07f0c 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -108,6 +108,14 @@ static u64 suspend_start;
* a lower bound for cs->uncertainty_margin values when registering clocks.
*/
#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC)
+static u64 watchdog_max_skew = WATCHDOG_MAX_SKEW;
+
+/*
+ * The clock-skew check will be skipped if the watchdog shows too much
+ * read-back delay. To avoid indefinite test skips, watchdog_max_skew will be
+ * increased after a certain number of test skips.
+ */
+#define CLOCK_SKEW_SKIP_MAX 10

#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
static void clocksource_watchdog_work(struct work_struct *work);
@@ -205,6 +213,8 @@ EXPORT_SYMBOL_GPL(max_cswd_read_retries);
static int verify_n_cpus = 8;
module_param(verify_n_cpus, int, 0644);

+static void __clocksource_select_watchdog(bool fallback);
+
enum wd_read_status {
WD_READ_SUCCESS,
WD_READ_UNSTABLE,
@@ -228,7 +238,7 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow,
wd_delta = clocksource_delta(wd_end, *wdnow, watchdog->mask);
wd_delay = clocksource_cyc2ns(wd_delta, watchdog->mult,
watchdog->shift);
- if (wd_delay <= WATCHDOG_MAX_SKEW) {
+ if (wd_delay <= watchdog_max_skew) {
if (nretries > 1 || nretries >= max_cswd_read_retries) {
pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
smp_processor_id(), watchdog->name, nretries);
@@ -241,13 +251,13 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow,
* there is too much external interferences that cause
* significant delay in reading both clocksource and watchdog.
*
- * If consecutive WD read-back delay > WATCHDOG_MAX_SKEW/2,
+ * If consecutive WD read-back delay > watchdog_max_skew/2,
* report system busy, reinit the watchdog and skip the current
* watchdog test.
*/
wd_delta = clocksource_delta(wd_end2, wd_end, watchdog->mask);
wd_seq_delay = clocksource_cyc2ns(wd_delta, watchdog->mult, watchdog->shift);
- if (wd_seq_delay > WATCHDOG_MAX_SKEW/2)
+ if (wd_seq_delay > watchdog_max_skew/2)
goto skip_test;
}

@@ -260,6 +270,35 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow,
smp_processor_id(), watchdog->name, wd_seq_delay);
pr_info("wd-%s-wd read-back delay of %lldns, clock-skew test skipped!\n",
cs->name, wd_delay);
+ if (++watchdog->clock_skew_skipcnt > CLOCK_SKEW_SKIP_MAX) {
+ /*
+ * Increase watchdog_max_skew and watchdog->uncertainty_margin
+ * unless it will exceed 10*WATCHDOG_MAX_SKEW. In that case, the
+ * watchdog itself will be marked unstable.
+ */
+ watchdog->clock_skew_skipcnt = 0;
+ if (wd_seq_delay > 5 * WATCHDOG_MAX_SKEW) {
+ const char *old_wd_name = watchdog->name;
+
+ /*
+ * Consecutive watchdog delay exceed limit, mark
+ * watchdog as unstable & select a new watchdog,
+ * if possible.
+ */
+ local_irq_disable();
+ __clocksource_unstable(watchdog);
+ __clocksource_select_watchdog(true);
+ local_irq_enable();
+ pr_warn("timekeeping watchdog: old %s watchdog marked unstable, new %s watchdog selected\n",
+ old_wd_name, watchdog->name);
+ return WD_READ_SKIP;
+ }
+ watchdog_max_skew = 2 * wd_seq_delay;
+ if (wd_seq_delay > watchdog->uncertainty_margin)
+ watchdog->uncertainty_margin = wd_seq_delay;
+ pr_warn("timekeeping watchdog on CPU%d: watchdog_max_skew increased to %lldns\n",
+ smp_processor_id(), watchdog_max_skew);
+ }
return WD_READ_SKIP;
}

@@ -559,12 +598,10 @@ static void clocksource_enqueue_watchdog(struct clocksource *cs)
}
}

-static void clocksource_select_watchdog(bool fallback)
+static void __clocksource_select_watchdog(bool fallback)
{
struct clocksource *cs, *old_wd;
- unsigned long flags;

- spin_lock_irqsave(&watchdog_lock, flags);
/* save current watchdog */
old_wd = watchdog;
if (fallback)
@@ -593,6 +630,14 @@ static void clocksource_select_watchdog(bool fallback)

/* Check if the watchdog timer needs to be started. */
clocksource_start_watchdog();
+}
+
+static void clocksource_select_watchdog(bool fallback)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&watchdog_lock, flags);
+ __clocksource_select_watchdog(fallback);
spin_unlock_irqrestore(&watchdog_lock, flags);
}

--
2.27.0


2021-11-18 19:15:28

by Waiman Long

[permalink] [raw]
Subject: [PATCH v3 4/4] clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW

A watchdog maximum skew of 100us may still be too small for
some systems or archs. It may also be too small when some kernel
debug config options are enabled. So add a new Kconfig option
CLOCKSOURCE_WATCHDOG_MAX_SKEW_US to allow kernel builders to have more
control on the threshold for marking clocksource as unstable.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/time/Kconfig | 9 +++++++++
kernel/time/clocksource.c | 8 +++++++-
2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 04bfd62f5e5c..27b7868b5c30 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -181,5 +181,14 @@ config HIGH_RES_TIMERS
hardware is not capable then this option only increases
the size of the kernel image.

+config CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
+ int "Clocksource watchdog maximum allowable skew (in μs)"
+ depends on CLOCKSOURCE_WATCHDOG
+ range 50 1000
+ default 100
+ help
+ Specify the maximum amount of allowable watchdog skew in
+ microseconds before reporting the clocksource to be unstable.
+
endmenu
endif
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 284910b07f0c..c00a59770397 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -107,7 +107,13 @@ static u64 suspend_start;
* This delay could be due to SMIs, NMIs, or to VCPU preemptions. Used as
* a lower bound for cs->uncertainty_margin values when registering clocks.
*/
-#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC)
+#ifdef CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
+#define MAX_SKEW_USEC CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
+#else
+#define MAX_SKEW_USEC 100
+#endif
+
+#define WATCHDOG_MAX_SKEW (MAX_SKEW_USEC * NSEC_PER_USEC)
static u64 watchdog_max_skew = WATCHDOG_MAX_SKEW;

/*
--
2.27.0


2021-11-22 03:02:30

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] clocksource: Avoid accidental unstable marking of clocksources

On Thu, Nov 18, 2021 at 02:14:36PM -0500, Waiman Long wrote:
> Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
> detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
> threshold"), it is found that tsc clocksource fallback to hpet can
> sometimes happen on both Intel and AMD systems especially when they are
> running stressful benchmarking workloads. Of the 23 systems tested with
> a v5.14 kernel, 10 of them have switched to hpet clock source during
> the test run.
>
> The result of falling back to hpet is a drastic reduction of performance
> when running benchmarks. For example, the fio performance tests can
> drop up to 70% whereas the iperf3 performance can drop up to 80%.
>
> 4 hpet fallbacks happened during bootup. They were:
>
> [ 8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
> [ 12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
> [ 17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
> [ 17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
>
> Other fallbacks happen when the systems were running stressful
> benchmarks. For example:
>
> [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
> [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
>
> Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
> changed the skew margin from 100us to 50us. I think this is too small
> and can easily be exceeded when running some stressful workloads on a
> thermally stressed system. So it is switched back to 100us.
>
> Even a maximum skew margin of 100us may be too small in for some systems
> when booting up especially if those systems are under thermal stress. To
> eliminate the case that the large skew is due to the system being too
> busy slowing down the reading of both the watchdog and the clocksource,
> an extra consecutive read of watchdog clock is being done to check this.
>
> The consecutive watchdog read delay is compared against
> WATCHDOG_MAX_SKEW/2. If the delay exceeds the limit, we assume that
> the system is just too busy. A warning will be printed to the console
> and the clock skew check is skipped for this round.

Reviewed-by: Feng Tang <[email protected]>

Thanks,
Feng

> Fixes: db3a34e17433 ("clocksource: Retry clock read if long delays detected")
> Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
> Signed-off-by: Waiman Long <[email protected]>
> ---
> kernel/time/clocksource.c | 50 ++++++++++++++++++++++++++++++++-------
> 1 file changed, 41 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index b8a14d2fb5ba..bcad1a1e5dcf 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -107,7 +107,7 @@ static u64 suspend_start;
> * This delay could be due to SMIs, NMIs, or to VCPU preemptions. Used as
> * a lower bound for cs->uncertainty_margin values when registering clocks.
> */
> -#define WATCHDOG_MAX_SKEW (50 * NSEC_PER_USEC)
> +#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC)
>
> #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
> static void clocksource_watchdog_work(struct work_struct *work);
> @@ -205,17 +205,24 @@ EXPORT_SYMBOL_GPL(max_cswd_read_retries);
> static int verify_n_cpus = 8;
> module_param(verify_n_cpus, int, 0644);
>
> -static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> +enum wd_read_status {
> + WD_READ_SUCCESS,
> + WD_READ_UNSTABLE,
> + WD_READ_SKIP
> +};
> +
> +static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> {
> unsigned int nretries;
> - u64 wd_end, wd_delta;
> - int64_t wd_delay;
> + u64 wd_end, wd_end2, wd_delta;
> + int64_t wd_delay, wd_seq_delay;
>
> for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) {
> local_irq_disable();
> *wdnow = watchdog->read(watchdog);
> *csnow = cs->read(cs);
> wd_end = watchdog->read(watchdog);
> + wd_end2 = watchdog->read(watchdog);
> local_irq_enable();
>
> wd_delta = clocksource_delta(wd_end, *wdnow, watchdog->mask);
> @@ -226,13 +233,34 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
> smp_processor_id(), watchdog->name, nretries);
> }
> - return true;
> + return WD_READ_SUCCESS;
> }
> +
> + /*
> + * Now compute delay in consecutive watchdog read to see if
> + * there is too much external interferences that cause
> + * significant delay in reading both clocksource and watchdog.
> + *
> + * If consecutive WD read-back delay > WATCHDOG_MAX_SKEW/2,
> + * report system busy, reinit the watchdog and skip the current
> + * watchdog test.
> + */
> + wd_delta = clocksource_delta(wd_end2, wd_end, watchdog->mask);
> + wd_seq_delay = clocksource_cyc2ns(wd_delta, watchdog->mult, watchdog->shift);
> + if (wd_seq_delay > WATCHDOG_MAX_SKEW/2)
> + goto skip_test;
> }
>
> pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
> smp_processor_id(), watchdog->name, wd_delay, nretries);
> - return false;
> + return WD_READ_UNSTABLE;
> +
> +skip_test:
> + pr_info("timekeeping watchdog on CPU%d: %s wd-wd read-back delay of %lldns\n",
> + smp_processor_id(), watchdog->name, wd_seq_delay);
> + pr_info("wd-%s-wd read-back delay of %lldns, clock-skew test skipped!\n",
> + cs->name, wd_delay);
> + return WD_READ_SKIP;
> }
>
> static u64 csnow_mid;
> @@ -356,6 +384,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> int next_cpu, reset_pending;
> int64_t wd_nsec, cs_nsec;
> struct clocksource *cs;
> + enum wd_read_status read_ret;
> u32 md;
>
> spin_lock(&watchdog_lock);
> @@ -373,9 +402,12 @@ static void clocksource_watchdog(struct timer_list *unused)
> continue;
> }
>
> - if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
> - /* Clock readout unreliable, so give it up. */
> - __clocksource_unstable(cs);
> + read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
> +
> + if (read_ret != WD_READ_SUCCESS) {
> + if (read_ret == WD_READ_UNSTABLE)
> + /* Clock readout unreliable, so give it up. */
> + __clocksource_unstable(cs);
> continue;
> }
>
> --
> 2.27.0

2021-11-22 05:02:32

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] clocksource: Dynamically increase watchdog_max_skew

On Thu, Nov 18, 2021 at 02:14:38PM -0500, Waiman Long wrote:
> It is possible that a long-lasting intensive workload running on
> a system may cause the clock skew test to be skipped for extended
> period of time. One way to avoid this is to dynamically increase the
> watchdog_max_skew used in the clock skew test.
>
> However, we also don't want watchdog_max_skew to be continuously increased
> without bound. So we limit the increase up to 10*WATCHDOG_MAX_SKEW. If
> that happens, there is something wrong the current watchdog and we are
> going to mark it as unstable and select a new watchdog, if possible.

For reselecting watchdog, in these cases, I think it's the extreme system
stress causing the MMIO read of hpet to be slow (plus some lock), fallback
to other watchdog whose read is MMIO or ioport may not help much. I tried
this patch, and when "acpi_pm" timer is used instead of "hpet", similar
thing can still happen.

Thanks,
Feng

> Signed-off-by: Waiman Long <[email protected]>
> ---
> include/linux/clocksource.h | 1 +
> kernel/time/clocksource.c | 57 +++++++++++++++++++++++++++++++++----
> 2 files changed, 52 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> index 1d42d4b17327..6a62a1a4da85 100644
> --- a/include/linux/clocksource.h
> +++ b/include/linux/clocksource.h
> @@ -108,6 +108,7 @@ struct clocksource {
> const char *name;
> struct list_head list;
> int rating;
> + int clock_skew_skipcnt;
> enum clocksource_ids id;
> enum vdso_clock_mode vdso_clock_mode;
> unsigned long flags;
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index b7e52a642948..284910b07f0c 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -108,6 +108,14 @@ static u64 suspend_start;
> * a lower bound for cs->uncertainty_margin values when registering clocks.
> */
> #define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC)
> +static u64 watchdog_max_skew = WATCHDOG_MAX_SKEW;
> +
> +/*
> + * The clock-skew check will be skipped if the watchdog shows too much
> + * read-back delay. To avoid indefinite test skips, watchdog_max_skew will be
> + * increased after a certain number of test skips.
> + */
> +#define CLOCK_SKEW_SKIP_MAX 10
>
> #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
> static void clocksource_watchdog_work(struct work_struct *work);
> @@ -205,6 +213,8 @@ EXPORT_SYMBOL_GPL(max_cswd_read_retries);
> static int verify_n_cpus = 8;
> module_param(verify_n_cpus, int, 0644);
>
> +static void __clocksource_select_watchdog(bool fallback);
> +
> enum wd_read_status {
> WD_READ_SUCCESS,
> WD_READ_UNSTABLE,
> @@ -228,7 +238,7 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow,
> wd_delta = clocksource_delta(wd_end, *wdnow, watchdog->mask);
> wd_delay = clocksource_cyc2ns(wd_delta, watchdog->mult,
> watchdog->shift);
> - if (wd_delay <= WATCHDOG_MAX_SKEW) {
> + if (wd_delay <= watchdog_max_skew) {
> if (nretries > 1 || nretries >= max_cswd_read_retries) {
> pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
> smp_processor_id(), watchdog->name, nretries);
> @@ -241,13 +251,13 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow,
> * there is too much external interferences that cause
> * significant delay in reading both clocksource and watchdog.
> *
> - * If consecutive WD read-back delay > WATCHDOG_MAX_SKEW/2,
> + * If consecutive WD read-back delay > watchdog_max_skew/2,
> * report system busy, reinit the watchdog and skip the current
> * watchdog test.
> */
> wd_delta = clocksource_delta(wd_end2, wd_end, watchdog->mask);
> wd_seq_delay = clocksource_cyc2ns(wd_delta, watchdog->mult, watchdog->shift);
> - if (wd_seq_delay > WATCHDOG_MAX_SKEW/2)
> + if (wd_seq_delay > watchdog_max_skew/2)
> goto skip_test;
> }
>
> @@ -260,6 +270,35 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow,
> smp_processor_id(), watchdog->name, wd_seq_delay);
> pr_info("wd-%s-wd read-back delay of %lldns, clock-skew test skipped!\n",
> cs->name, wd_delay);
> + if (++watchdog->clock_skew_skipcnt > CLOCK_SKEW_SKIP_MAX) {
> + /*
> + * Increase watchdog_max_skew and watchdog->uncertainty_margin
> + * unless it will exceed 10*WATCHDOG_MAX_SKEW. In that case, the
> + * watchdog itself will be marked unstable.
> + */
> + watchdog->clock_skew_skipcnt = 0;
> + if (wd_seq_delay > 5 * WATCHDOG_MAX_SKEW) {
> + const char *old_wd_name = watchdog->name;
> +
> + /*
> + * Consecutive watchdog delay exceed limit, mark
> + * watchdog as unstable & select a new watchdog,
> + * if possible.
> + */
> + local_irq_disable();
> + __clocksource_unstable(watchdog);
> + __clocksource_select_watchdog(true);
> + local_irq_enable();
> + pr_warn("timekeeping watchdog: old %s watchdog marked unstable, new %s watchdog selected\n",
> + old_wd_name, watchdog->name);
> + return WD_READ_SKIP;
> + }
> + watchdog_max_skew = 2 * wd_seq_delay;
> + if (wd_seq_delay > watchdog->uncertainty_margin)
> + watchdog->uncertainty_margin = wd_seq_delay;
> + pr_warn("timekeeping watchdog on CPU%d: watchdog_max_skew increased to %lldns\n",
> + smp_processor_id(), watchdog_max_skew);
> + }
> return WD_READ_SKIP;
> }
>
> @@ -559,12 +598,10 @@ static void clocksource_enqueue_watchdog(struct clocksource *cs)
> }
> }
>
> -static void clocksource_select_watchdog(bool fallback)
> +static void __clocksource_select_watchdog(bool fallback)
> {
> struct clocksource *cs, *old_wd;
> - unsigned long flags;
>
> - spin_lock_irqsave(&watchdog_lock, flags);
> /* save current watchdog */
> old_wd = watchdog;
> if (fallback)
> @@ -593,6 +630,14 @@ static void clocksource_select_watchdog(bool fallback)
>
> /* Check if the watchdog timer needs to be started. */
> clocksource_start_watchdog();
> +}
> +
> +static void clocksource_select_watchdog(bool fallback)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(&watchdog_lock, flags);
> + __clocksource_select_watchdog(fallback);
> spin_unlock_irqrestore(&watchdog_lock, flags);
> }
>
> --
> 2.27.0

2021-11-22 18:27:38

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] clocksource: Dynamically increase watchdog_max_skew


On 11/22/21 00:02, Feng Tang wrote:
> On Thu, Nov 18, 2021 at 02:14:38PM -0500, Waiman Long wrote:
>> It is possible that a long-lasting intensive workload running on
>> a system may cause the clock skew test to be skipped for extended
>> period of time. One way to avoid this is to dynamically increase the
>> watchdog_max_skew used in the clock skew test.
>>
>> However, we also don't want watchdog_max_skew to be continuously increased
>> without bound. So we limit the increase up to 10*WATCHDOG_MAX_SKEW. If
>> that happens, there is something wrong the current watchdog and we are
>> going to mark it as unstable and select a new watchdog, if possible.
>
> For reselecting watchdog, in these cases, I think it's the extreme system
> stress causing the MMIO read of hpet to be slow (plus some lock), fallback
> to other watchdog whose read is MMIO or ioport may not help much. I tried
> this patch, and when "acpi_pm" timer is used instead of "hpet", similar
> thing can still happen.

Yes, I am aware of that. Switching to acpi_pm, did not reduce the
consecutive read delay. However, the current limit for this watchdog
fallback is when the delay exceed 5*WATCHDOG_MAX_SKEW. With a default of
100us, that mean 500us which is a really large delay that maybe there is
something wrong with the hpet timer. Of course, the selection of this
limit is kind of arbitrary. If you have concern about that, we can leave
this patch out.

Cheers,
Longman


2021-11-23 03:07:05

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] clocksource: Dynamically increase watchdog_max_skew

Hi Waiman,

On Mon, Nov 22, 2021 at 01:27:25PM -0500, Waiman Long wrote:
>
> On 11/22/21 00:02, Feng Tang wrote:
> > On Thu, Nov 18, 2021 at 02:14:38PM -0500, Waiman Long wrote:
> > > It is possible that a long-lasting intensive workload running on
> > > a system may cause the clock skew test to be skipped for extended
> > > period of time. One way to avoid this is to dynamically increase the
> > > watchdog_max_skew used in the clock skew test.
> > >
> > > However, we also don't want watchdog_max_skew to be continuously increased
> > > without bound. So we limit the increase up to 10*WATCHDOG_MAX_SKEW. If
> > > that happens, there is something wrong the current watchdog and we are
> > > going to mark it as unstable and select a new watchdog, if possible.
> > For reselecting watchdog, in these cases, I think it's the extreme system
> > stress causing the MMIO read of hpet to be slow (plus some lock), fallback
> > to other watchdog whose read is MMIO or ioport may not help much. I tried
> > this patch, and when "acpi_pm" timer is used instead of "hpet", similar
> > thing can still happen.
>
> Yes, I am aware of that. Switching to acpi_pm, did not reduce the
> consecutive read delay. However, the current limit for this watchdog
> fallback is when the delay exceed 5*WATCHDOG_MAX_SKEW. With a default of
> 100us, that mean 500us which is a really large delay that maybe there is
> something wrong with the hpet timer. Of course, the selection of this limit
> is kind of arbitrary. If you have concern about that, we can leave this
> patch out.

I'm not against the "watchdog_max_skew" concept, just wanted to stat
the hpet may be innocent in the case.

I don't have strong opinion about reselecting watchdog either. If the
server is always running that intensive workload, the watchog could
be fallback to other clocksource, then eventually to NULL device,
which is not bad for that specific case.

Thanks,
Feng

> Cheers,
> Longman

2021-11-25 04:17:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] clocksource: Avoid accidental unstable marking of clocksources

On Mon, Nov 22, 2021 at 11:02:23AM +0800, Feng Tang wrote:
> On Thu, Nov 18, 2021 at 02:14:36PM -0500, Waiman Long wrote:
> > Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
> > detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
> > threshold"), it is found that tsc clocksource fallback to hpet can
> > sometimes happen on both Intel and AMD systems especially when they are
> > running stressful benchmarking workloads. Of the 23 systems tested with
> > a v5.14 kernel, 10 of them have switched to hpet clock source during
> > the test run.
> >
> > The result of falling back to hpet is a drastic reduction of performance
> > when running benchmarks. For example, the fio performance tests can
> > drop up to 70% whereas the iperf3 performance can drop up to 80%.
> >
> > 4 hpet fallbacks happened during bootup. They were:
> >
> > [ 8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
> > [ 12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
> > [ 17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
> > [ 17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
> >
> > Other fallbacks happen when the systems were running stressful
> > benchmarks. For example:
> >
> > [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
> > [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
> >
> > Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
> > changed the skew margin from 100us to 50us. I think this is too small
> > and can easily be exceeded when running some stressful workloads on a
> > thermally stressed system. So it is switched back to 100us.
> >
> > Even a maximum skew margin of 100us may be too small in for some systems
> > when booting up especially if those systems are under thermal stress. To
> > eliminate the case that the large skew is due to the system being too
> > busy slowing down the reading of both the watchdog and the clocksource,
> > an extra consecutive read of watchdog clock is being done to check this.
> >
> > The consecutive watchdog read delay is compared against
> > WATCHDOG_MAX_SKEW/2. If the delay exceeds the limit, we assume that
> > the system is just too busy. A warning will be printed to the console
> > and the clock skew check is skipped for this round.
>
> Reviewed-by: Feng Tang <[email protected]>

I applied #1 and #2 with Feng Tang's Reviewed-by, thank you both!

It turns out that #4 depends on #3, so rather than risk injecting errors
by sorting that out manually, I will await either an updated #3 and #4
or a rebased #4, at your option.

Thanx, Paul

> Thanks,
> Feng
>
> > Fixes: db3a34e17433 ("clocksource: Retry clock read if long delays detected")
> > Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
> > Signed-off-by: Waiman Long <[email protected]>
> > ---
> > kernel/time/clocksource.c | 50 ++++++++++++++++++++++++++++++++-------
> > 1 file changed, 41 insertions(+), 9 deletions(-)
> >
> > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > index b8a14d2fb5ba..bcad1a1e5dcf 100644
> > --- a/kernel/time/clocksource.c
> > +++ b/kernel/time/clocksource.c
> > @@ -107,7 +107,7 @@ static u64 suspend_start;
> > * This delay could be due to SMIs, NMIs, or to VCPU preemptions. Used as
> > * a lower bound for cs->uncertainty_margin values when registering clocks.
> > */
> > -#define WATCHDOG_MAX_SKEW (50 * NSEC_PER_USEC)
> > +#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC)
> >
> > #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
> > static void clocksource_watchdog_work(struct work_struct *work);
> > @@ -205,17 +205,24 @@ EXPORT_SYMBOL_GPL(max_cswd_read_retries);
> > static int verify_n_cpus = 8;
> > module_param(verify_n_cpus, int, 0644);
> >
> > -static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> > +enum wd_read_status {
> > + WD_READ_SUCCESS,
> > + WD_READ_UNSTABLE,
> > + WD_READ_SKIP
> > +};
> > +
> > +static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> > {
> > unsigned int nretries;
> > - u64 wd_end, wd_delta;
> > - int64_t wd_delay;
> > + u64 wd_end, wd_end2, wd_delta;
> > + int64_t wd_delay, wd_seq_delay;
> >
> > for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) {
> > local_irq_disable();
> > *wdnow = watchdog->read(watchdog);
> > *csnow = cs->read(cs);
> > wd_end = watchdog->read(watchdog);
> > + wd_end2 = watchdog->read(watchdog);
> > local_irq_enable();
> >
> > wd_delta = clocksource_delta(wd_end, *wdnow, watchdog->mask);
> > @@ -226,13 +233,34 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> > pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
> > smp_processor_id(), watchdog->name, nretries);
> > }
> > - return true;
> > + return WD_READ_SUCCESS;
> > }
> > +
> > + /*
> > + * Now compute delay in consecutive watchdog read to see if
> > + * there is too much external interferences that cause
> > + * significant delay in reading both clocksource and watchdog.
> > + *
> > + * If consecutive WD read-back delay > WATCHDOG_MAX_SKEW/2,
> > + * report system busy, reinit the watchdog and skip the current
> > + * watchdog test.
> > + */
> > + wd_delta = clocksource_delta(wd_end2, wd_end, watchdog->mask);
> > + wd_seq_delay = clocksource_cyc2ns(wd_delta, watchdog->mult, watchdog->shift);
> > + if (wd_seq_delay > WATCHDOG_MAX_SKEW/2)
> > + goto skip_test;
> > }
> >
> > pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
> > smp_processor_id(), watchdog->name, wd_delay, nretries);
> > - return false;
> > + return WD_READ_UNSTABLE;
> > +
> > +skip_test:
> > + pr_info("timekeeping watchdog on CPU%d: %s wd-wd read-back delay of %lldns\n",
> > + smp_processor_id(), watchdog->name, wd_seq_delay);
> > + pr_info("wd-%s-wd read-back delay of %lldns, clock-skew test skipped!\n",
> > + cs->name, wd_delay);
> > + return WD_READ_SKIP;
> > }
> >
> > static u64 csnow_mid;
> > @@ -356,6 +384,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > int next_cpu, reset_pending;
> > int64_t wd_nsec, cs_nsec;
> > struct clocksource *cs;
> > + enum wd_read_status read_ret;
> > u32 md;
> >
> > spin_lock(&watchdog_lock);
> > @@ -373,9 +402,12 @@ static void clocksource_watchdog(struct timer_list *unused)
> > continue;
> > }
> >
> > - if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
> > - /* Clock readout unreliable, so give it up. */
> > - __clocksource_unstable(cs);
> > + read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
> > +
> > + if (read_ret != WD_READ_SUCCESS) {
> > + if (read_ret == WD_READ_UNSTABLE)
> > + /* Clock readout unreliable, so give it up. */
> > + __clocksource_unstable(cs);
> > continue;
> > }
> >
> > --
> > 2.27.0

2021-12-06 03:47:42

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] clocksource: Avoid accidental unstable marking of clocksources

On 11/24/21 23:15, Paul E. McKenney wrote:
> On Mon, Nov 22, 2021 at 11:02:23AM +0800, Feng Tang wrote:
>> On Thu, Nov 18, 2021 at 02:14:36PM -0500, Waiman Long wrote:
>>> Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
>>> detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
>>> threshold"), it is found that tsc clocksource fallback to hpet can
>>> sometimes happen on both Intel and AMD systems especially when they are
>>> running stressful benchmarking workloads. Of the 23 systems tested with
>>> a v5.14 kernel, 10 of them have switched to hpet clock source during
>>> the test run.
>>>
>>> The result of falling back to hpet is a drastic reduction of performance
>>> when running benchmarks. For example, the fio performance tests can
>>> drop up to 70% whereas the iperf3 performance can drop up to 80%.
>>>
>>> 4 hpet fallbacks happened during bootup. They were:
>>>
>>> [ 8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
>>> [ 12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
>>> [ 17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
>>> [ 17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
>>>
>>> Other fallbacks happen when the systems were running stressful
>>> benchmarks. For example:
>>>
>>> [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
>>> [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
>>>
>>> Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
>>> changed the skew margin from 100us to 50us. I think this is too small
>>> and can easily be exceeded when running some stressful workloads on a
>>> thermally stressed system. So it is switched back to 100us.
>>>
>>> Even a maximum skew margin of 100us may be too small in for some systems
>>> when booting up especially if those systems are under thermal stress. To
>>> eliminate the case that the large skew is due to the system being too
>>> busy slowing down the reading of both the watchdog and the clocksource,
>>> an extra consecutive read of watchdog clock is being done to check this.
>>>
>>> The consecutive watchdog read delay is compared against
>>> WATCHDOG_MAX_SKEW/2. If the delay exceeds the limit, we assume that
>>> the system is just too busy. A warning will be printed to the console
>>> and the clock skew check is skipped for this round.
>>
>> Reviewed-by: Feng Tang <[email protected]>
> I applied #1 and #2 with Feng Tang's Reviewed-by, thank you both!
>
> It turns out that #4 depends on #3, so rather than risk injecting errors
> by sorting that out manually, I will await either an updated #3 and #4
> or a rebased #4, at your option.

I just send out a updated patch 4 that applies on top of the current
rcu/next branch of linux-rcu git tree. I would still like to see a
Kconfig option be available.

Feng, if you have any further suggestion on patch 3, please let me know.
This one is just handling some rare cases that shouldn't happen at all.
So it is not that important.

Cheers,
Longman