It was found that when an x86 system was being stressed by running
various different benchmark suites, the clocksource watchdog might
occasionally mark TSC as unstable and fall back to hpet which will
have a signficant impact on system performance.
The current watchdog clocksource skew threshold of 50us is found to be
insufficient. So it is changed back to 100us before commit 2e27e793e280
("clocksource: Reduce clocksource-skew threshold") in patch 1. Patch 2
adds a Kconfig option to allow kernel builder to control the actual
threshold to be used.
Waiman Long (2):
clocksource: Avoid accidental unstable marking of clocksources
clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW
kernel/time/Kconfig | 9 ++++++
kernel/time/clocksource.c | 58 ++++++++++++++++++++++++++++++---------
2 files changed, 54 insertions(+), 13 deletions(-)
--
2.27.0
A watchdog maximum skew of 100us may still be too small for
some systems or archs. It may also be too small when some kernel
debug config options are enabled. So add a new Kconfig option
CLOCKSOURCE_WATCHDOG_MAX_SKEW_US to allow kernel builders to have more
control on the threshold for marking clocksource as unstable.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/time/Kconfig | 9 +++++++++
kernel/time/clocksource.c | 7 ++++++-
2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 04bfd62f5e5c..27b7868b5c30 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -181,5 +181,14 @@ config HIGH_RES_TIMERS
hardware is not capable then this option only increases
the size of the kernel image.
+config CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
+ int "Clocksource watchdog maximum allowable skew (in μs)"
+ depends on CLOCKSOURCE_WATCHDOG
+ range 50 1000
+ default 100
+ help
+ Specify the maximum amount of allowable watchdog skew in
+ microseconds before reporting the clocksource to be unstable.
+
endmenu
endif
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index c503f5ab4101..2e605fc40e82 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -107,7 +107,12 @@ static u64 suspend_start;
* This delay could be due to SMIs, NMIs, or to VCPU preemptions. Used as
* a lower bound for cs->uncertainty_margin values when registering clocks.
*/
-#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC)
+#ifdef CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
+#define MAX_SKEW_USEC CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
+#else
+#define MAX_SKEW_USEC 100
+#endif
+#define WATCHDOG_MAX_SKEW (MAX_SKEW_USEC * NSEC_PER_USEC)
#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
static void clocksource_watchdog_work(struct work_struct *work);
--
2.27.0
Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
threshold"), it is found that tsc clocksource fallback to hpet can
sometimes happen on both Intel and AMD systems especially when they are
running stressful benchmarking workloads. Of the 23 systems tested with
a v5.14 kernel, 10 of them have switched to hpet clock source during
the test run.
The result of falling back to hpet is a drastic reduction of performance
when running benchmarks. For example, the fio performance tests can
drop up to 70% whereas the iperf3 performance can drop up to 80%.
4 hpet fallbacks happened during bootup. They were:
[ 8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
[ 12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
[ 17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
[ 17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
Other fallbacks happen when the systems were running stressful
benchmarks. For example:
[ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
[46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
changed the skew margin from 100us to 50us. I think this is too small
and can easily be exceeded when running some stressful workloads on
a thermally stressed system. So it is switched back to 100us. On
the other hand, it doesn't look like we need to increase the minimum
uncertainty margin. So it is kept the same at 100us too.
Even a maximum skew margin of 100us may be too small in for some systems
when booting up especially if those systems are under thermal stress. To
eliminate the case that the large skew is due to the system being too
busy slowing down the reading of both the watchdog and the clocksource,
a final check is done by reading watchdog time again and comparing the
consecutive watchdog timing read delay against WATCHDOG_MAX_SKEW/2. If
that delay exceeds the limit, we assume that the system is just too
busy. A warning will be printed to the console and the watchdog check
is then skipped for this round. For example:
[ 8.789316] clocksource: timekeeping watchdog on CPU13: hpet consecutive read-back delay of 174541ns, system too busy
Fixes: db3a34e17433 ("clocksource: Retry clock read if long delays detected")
Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
Signed-off-by: Waiman Long <[email protected]>
---
kernel/time/clocksource.c | 53 +++++++++++++++++++++++++++++----------
1 file changed, 40 insertions(+), 13 deletions(-)
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index b8a14d2fb5ba..c503f5ab4101 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -107,7 +107,7 @@ static u64 suspend_start;
* This delay could be due to SMIs, NMIs, or to VCPU preemptions. Used as
* a lower bound for cs->uncertainty_margin values when registering clocks.
*/
-#define WATCHDOG_MAX_SKEW (50 * NSEC_PER_USEC)
+#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC)
#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
static void clocksource_watchdog_work(struct work_struct *work);
@@ -205,11 +205,17 @@ EXPORT_SYMBOL_GPL(max_cswd_read_retries);
static int verify_n_cpus = 8;
module_param(verify_n_cpus, int, 0644);
-static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
+enum wd_read_status {
+ WD_READ_SUCCESS,
+ WD_READ_UNSTABLE,
+ WD_READ_SKIP
+};
+
+static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
{
unsigned int nretries;
- u64 wd_end, wd_delta;
- int64_t wd_delay;
+ u64 wd_end, wd_end2, wd_delta;
+ int64_t wd_delay, wd_delay2;
for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) {
local_irq_disable();
@@ -226,13 +232,30 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
smp_processor_id(), watchdog->name, nretries);
}
- return true;
+ return WD_READ_SUCCESS;
}
}
+ /*
+ * Now compute delay in consecutive watchdog read to see if there is
+ * too much external interferences that cause significant delay in
+ * reading both clocksource and watchdog.
+ *
+ * If the consecutive watchdog read-back delay > WATCHDOG_MAX_SKEW/2,
+ * report system busy and skip the current watchdog test.
+ */
+ wd_end2 = watchdog->read(watchdog);
+ wd_delta = clocksource_delta(wd_end2, wd_end, watchdog->mask);
+ wd_delay2 = clocksource_cyc2ns(wd_delta, watchdog->mult, watchdog->shift);
+ if (wd_delay2 > WATCHDOG_MAX_SKEW/2) {
+ pr_warn("timekeeping watchdog on CPU%d: %s consecutive read-back delay of %lldns, system too busy\n",
+ smp_processor_id(), watchdog->name, wd_delay2);
+ return WD_READ_SKIP;
+ }
+
pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
smp_processor_id(), watchdog->name, wd_delay, nretries);
- return false;
+ return WD_READ_UNSTABLE;
}
static u64 csnow_mid;
@@ -356,6 +379,7 @@ static void clocksource_watchdog(struct timer_list *unused)
int next_cpu, reset_pending;
int64_t wd_nsec, cs_nsec;
struct clocksource *cs;
+ enum wd_read_status read_ret;
u32 md;
spin_lock(&watchdog_lock);
@@ -373,9 +397,12 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;
}
- if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
- /* Clock readout unreliable, so give it up. */
- __clocksource_unstable(cs);
+ read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
+
+ if (read_ret != WD_READ_SUCCESS) {
+ if (read_ret == WD_READ_UNSTABLE)
+ /* Clock readout unreliable, so give it up. */
+ __clocksource_unstable(cs);
continue;
}
@@ -1061,7 +1088,7 @@ void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq
/*
* If the uncertainty margin is not specified, calculate it.
* If both scale and freq are non-zero, calculate the clock
- * period, but bound below at 2*WATCHDOG_MAX_SKEW. However,
+ * period, but bound below at WATCHDOG_MAX_SKEW. However,
* if either of scale or freq is zero, be very conservative and
* take the tens-of-milliseconds WATCHDOG_THRESHOLD value for the
* uncertainty margin. Allow stupidly small uncertainty margins
@@ -1070,12 +1097,12 @@ void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq
*/
if (scale && freq && !cs->uncertainty_margin) {
cs->uncertainty_margin = NSEC_PER_SEC / (scale * freq);
- if (cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW)
- cs->uncertainty_margin = 2 * WATCHDOG_MAX_SKEW;
+ if (cs->uncertainty_margin < WATCHDOG_MAX_SKEW)
+ cs->uncertainty_margin = WATCHDOG_MAX_SKEW;
} else if (!cs->uncertainty_margin) {
cs->uncertainty_margin = WATCHDOG_THRESHOLD;
}
- WARN_ON_ONCE(cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW);
+ WARN_ON_ONCE(cs->uncertainty_margin < WATCHDOG_MAX_SKEW);
/*
* Ensure clocksources that have large 'mult' values don't overflow
--
2.27.0
On Wed, Nov 10, 2021 at 05:17:30PM -0500, Waiman Long wrote:
> It was found that when an x86 system was being stressed by running
> various different benchmark suites, the clocksource watchdog might
> occasionally mark TSC as unstable and fall back to hpet which will
> have a signficant impact on system performance.
>
> The current watchdog clocksource skew threshold of 50us is found to be
> insufficient. So it is changed back to 100us before commit 2e27e793e280
> ("clocksource: Reduce clocksource-skew threshold") in patch 1. Patch 2
> adds a Kconfig option to allow kernel builder to control the actual
> threshold to be used.
>
> Waiman Long (2):
> clocksource: Avoid accidental unstable marking of clocksources
> clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW
The ability to control the fine-grained threshold seems useful, but is
the TSC still marked unstable when this commit from -rcu is applied?
It has passed significant testing on other workloads.
2a43fb0479aa ("clocksource: Forgive repeated long-latency watchdog clocksource reads")
If the patch below takes care of your situation, my thought is to
also take your second patch, which would allow people to set the
cutoff more loosely or more tightly, as their situation dictates.
Thoughts?
Thanx, Paul
------------------------------------------------------------------------
commit 2a43fb0479aa53c6ff866aa26df2708de0c86b12
Author: Paul E. McKenney <[email protected]>
Date: Thu May 27 11:03:28 2021 -0700
clocksource: Forgive repeated long-latency watchdog clocksource reads
Currently, the clocksource watchdog reacts to repeated long-latency
clocksource reads by marking that clocksource unstable on the theory that
these long-latency reads are a sign of a serious problem. And this theory
does in fact have real-world support in the form of firmware issues [1].
However, it is also possible to trigger this using stress-ng on what
the stress-ng man page terms "poorly designed hardware" [2]. And it
is not necessarily a bad thing for the kernel to diagnose cases where
heavy memory-contention workloads are being run on hardware that is not
designed for this sort of use.
Nevertheless, it is quite possible that real-world use will result in
some situation requiring that high-stress workloads run on hardware
not designed to accommodate them, and also requiring that the kernel
refrain from marking clocksources unstable.
Therefore, react to persistent long-latency reads by leaving the
clocksource alone, but using the old 62.5-millisecond skew-detection
threshold. In addition, the offending clocksource is marked for
re-initialization, which both restarts that clocksource with a clean bill
of health and avoids false-positive skew reports on later watchdog checks.
Once marked for re-initialization, the clocksource is not subjected to
further watchdog checking until a subsequent successful read from that
clocksource that is free of excessive delays.
However, if clocksource.max_cswd_coarse_reads consecutive clocksource read
attempts result in long latencies, a warning (splat) will be emitted.
This kernel boot parameter defaults to 1000, and this warning can be
disabled by setting it to zero or to a negative value.
[ paulmck: Apply feedback from Chao Gao and Oliver Sang. ]
Link: https://lore.kernel.org/lkml/20210513155515.GB23902@xsang-OptiPlex-9020/ # [1]
Link: https://lore.kernel.org/lkml/20210521083322.GG25531@xsang-OptiPlex-9020/ # [2]
Link: https://lore.kernel.org/lkml/20210521084405.GH25531@xsang-OptiPlex-9020/
Link: https://lore.kernel.org/lkml/20210511233403.GA2896757@paulmck-ThinkPad-P17-Gen-1/
Tested-by: Chao Gao <[email protected]>
Tested-by: Jiri Wiesner <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 316027c3aadce..61d2436ae9dfc 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -600,6 +600,14 @@
loops can be debugged more effectively on production
systems.
+ clocksource.max_cswd_coarse_reads= [KNL]
+ Number of consecutive clocksource_watchdog()
+ coarse reads (that is, clocksource reads that
+ were unduly delayed) that are permitted before
+ the kernel complains (gently). Set to a value
+ less than or equal to zero to suppress these
+ complaints.
+
clocksource.max_cswd_read_retries= [KNL]
Number of clocksource_watchdog() retries due to
external delays before the clock will be marked
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 1d42d4b173271..3e925d9ffc31c 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -110,6 +110,7 @@ struct clocksource {
int rating;
enum clocksource_ids id;
enum vdso_clock_mode vdso_clock_mode;
+ unsigned int n_coarse_reads;
unsigned long flags;
int (*enable)(struct clocksource *cs);
@@ -291,6 +292,7 @@ static inline void timer_probe(void) {}
#define TIMER_ACPI_DECLARE(name, table_id, fn) \
ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn)
+extern int max_cswd_coarse_reads;
extern ulong max_cswd_read_retries;
void clocksource_verify_percpu(struct clocksource *cs);
diff --git a/kernel/time/clocksource-wdtest.c b/kernel/time/clocksource-wdtest.c
index df922f49d171b..7e82500c400b9 100644
--- a/kernel/time/clocksource-wdtest.c
+++ b/kernel/time/clocksource-wdtest.c
@@ -145,13 +145,12 @@ static int wdtest_func(void *arg)
else if (i <= max_cswd_read_retries)
s = ", expect message";
else
- s = ", expect clock skew";
+ s = ", expect coarse-grained clock skew check and re-initialization";
pr_info("--- Watchdog with %dx error injection, %lu retries%s.\n", i, max_cswd_read_retries, s);
WRITE_ONCE(wdtest_ktime_read_ndelays, i);
schedule_timeout_uninterruptible(2 * HZ);
WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays));
- WARN_ON_ONCE((i <= max_cswd_read_retries) !=
- !(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
+ WARN_ON_ONCE(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE);
wdtest_ktime_clocksource_reset();
}
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index b8a14d2fb5ba6..553b479c4e73e 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -199,6 +199,9 @@ void clocksource_mark_unstable(struct clocksource *cs)
spin_unlock_irqrestore(&watchdog_lock, flags);
}
+int max_cswd_coarse_reads = 1000;
+module_param(max_cswd_coarse_reads, int, 0644);
+EXPORT_SYMBOL_GPL(max_cswd_coarse_reads);
ulong max_cswd_read_retries = 3;
module_param(max_cswd_read_retries, ulong, 0644);
EXPORT_SYMBOL_GPL(max_cswd_read_retries);
@@ -226,13 +229,22 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
smp_processor_id(), watchdog->name, nretries);
}
- return true;
+ cs->n_coarse_reads = 0;
+ return false;
}
+ WARN_ONCE(max_cswd_coarse_reads > 0 &&
+ !(++cs->n_coarse_reads % max_cswd_coarse_reads),
+ "timekeeping watchdog on CPU%d: %s %u consecutive coarse-grained reads\n", smp_processor_id(), watchdog->name, cs->n_coarse_reads);
}
- pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
- smp_processor_id(), watchdog->name, wd_delay, nretries);
- return false;
+ if ((cs->flags & CLOCK_SOURCE_WATCHDOG) && !atomic_read(&watchdog_reset_pending)) {
+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, coarse-grained skew check followed by re-initialization\n",
+ smp_processor_id(), watchdog->name, wd_delay, nretries);
+ } else {
+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, awaiting re-initialization\n",
+ smp_processor_id(), watchdog->name, wd_delay, nretries);
+ }
+ return true;
}
static u64 csnow_mid;
@@ -356,6 +368,7 @@ static void clocksource_watchdog(struct timer_list *unused)
int next_cpu, reset_pending;
int64_t wd_nsec, cs_nsec;
struct clocksource *cs;
+ bool coarse;
u32 md;
spin_lock(&watchdog_lock);
@@ -373,16 +386,13 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;
}
- if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
- /* Clock readout unreliable, so give it up. */
- __clocksource_unstable(cs);
- continue;
- }
+ coarse = cs_watchdog_read(cs, &csnow, &wdnow);
/* Clocksource initialized ? */
if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
atomic_read(&watchdog_reset_pending)) {
- cs->flags |= CLOCK_SOURCE_WATCHDOG;
+ if (!coarse)
+ cs->flags |= CLOCK_SOURCE_WATCHDOG;
cs->wd_last = wdnow;
cs->cs_last = csnow;
continue;
@@ -403,7 +413,13 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;
/* Check the deviation from the watchdog clocksource. */
- md = cs->uncertainty_margin + watchdog->uncertainty_margin;
+ if (coarse) {
+ md = 62500 * NSEC_PER_USEC;
+ cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
+ pr_warn("timekeeping watchdog on CPU%d: %s coarse-grained %lu.%03lu ms clock-skew check followed by re-initialization\n", smp_processor_id(), watchdog->name, md / NSEC_PER_MSEC, md % NSEC_PER_MSEC / NSEC_PER_USEC);
+ } else {
+ md = cs->uncertainty_margin + watchdog->uncertainty_margin;
+ }
if (abs(cs_nsec - wd_nsec) > md) {
pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
smp_processor_id(), cs->name);
On 11/10/21 17:32, Paul E. McKenney wrote:
> On Wed, Nov 10, 2021 at 05:17:30PM -0500, Waiman Long wrote:
>> It was found that when an x86 system was being stressed by running
>> various different benchmark suites, the clocksource watchdog might
>> occasionally mark TSC as unstable and fall back to hpet which will
>> have a signficant impact on system performance.
>>
>> The current watchdog clocksource skew threshold of 50us is found to be
>> insufficient. So it is changed back to 100us before commit 2e27e793e280
>> ("clocksource: Reduce clocksource-skew threshold") in patch 1. Patch 2
>> adds a Kconfig option to allow kernel builder to control the actual
>> threshold to be used.
>>
>> Waiman Long (2):
>> clocksource: Avoid accidental unstable marking of clocksources
>> clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW
> The ability to control the fine-grained threshold seems useful, but is
> the TSC still marked unstable when this commit from -rcu is applied?
> It has passed significant testing on other workloads.
>
> 2a43fb0479aa ("clocksource: Forgive repeated long-latency watchdog clocksource reads")
>
> If the patch below takes care of your situation, my thought is to
> also take your second patch, which would allow people to set the
> cutoff more loosely or more tightly, as their situation dictates.
>
> Thoughts?
That is commit 14dbb29eda51 ("clocksource: Forgive repeated long-latency
watchdog clocksource reads") in your linux-rcu git tree. From reading
the patch, I believe it should be able to address the hpet fallback
problem that Red Hat had encountered. Your patch said it was an
out-of-tree patch. Are you planning to mainline it?
Patch 1 of this series contains some testing data that caused hpet
fallback in our testing runs. In summary, a clock skew of 100us is found
to be enough to avoid the problem with benchmark runs. However, we have
some cases where TSC was marked unstable at bootup time with a skew of
200us or more which, I believe, was caused by the thermal stress that
the system was experiencing after running stressful benchmarks for hours.
At the end, we have to revert your clocksource patches before shipping
RHEL9 beta last week.
Cheers,
Longman
On Wed, Nov 10, 2021 at 06:25:14PM -0500, Waiman Long wrote:
>
> On 11/10/21 17:32, Paul E. McKenney wrote:
> > On Wed, Nov 10, 2021 at 05:17:30PM -0500, Waiman Long wrote:
> > > It was found that when an x86 system was being stressed by running
> > > various different benchmark suites, the clocksource watchdog might
> > > occasionally mark TSC as unstable and fall back to hpet which will
> > > have a signficant impact on system performance.
> > >
> > > The current watchdog clocksource skew threshold of 50us is found to be
> > > insufficient. So it is changed back to 100us before commit 2e27e793e280
> > > ("clocksource: Reduce clocksource-skew threshold") in patch 1. Patch 2
> > > adds a Kconfig option to allow kernel builder to control the actual
> > > threshold to be used.
> > >
> > > Waiman Long (2):
> > > clocksource: Avoid accidental unstable marking of clocksources
> > > clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW
> > The ability to control the fine-grained threshold seems useful, but is
> > the TSC still marked unstable when this commit from -rcu is applied?
> > It has passed significant testing on other workloads.
> >
> > 2a43fb0479aa ("clocksource: Forgive repeated long-latency watchdog clocksource reads")
> >
> > If the patch below takes care of your situation, my thought is to
> > also take your second patch, which would allow people to set the
> > cutoff more loosely or more tightly, as their situation dictates.
> >
> > Thoughts?
>
> That is commit 14dbb29eda51 ("clocksource: Forgive repeated long-latency
> watchdog clocksource reads") in your linux-rcu git tree. From reading the
> patch, I believe it should be able to address the hpet fallback problem that
> Red Hat had encountered. Your patch said it was an out-of-tree patch. Are
> you planning to mainline it?
Yes, I expect to submit it into the next merge window (not the current
v5.16 merge window, but v5.17). However, if your situation is urgent, and
if it works for you, I could submit it as a fix for an earlier regression.
> Patch 1 of this series contains some testing data that caused hpet fallback
> in our testing runs. In summary, a clock skew of 100us is found to be enough
> to avoid the problem with benchmark runs. However, we have some cases where
> TSC was marked unstable at bootup time with a skew of 200us or more which, I
> believe, was caused by the thermal stress that the system was experiencing
> after running stressful benchmarks for hours.
This sort of thing does show some value for allowing the threshold to
be adjusted. I hope that it does not prove necessary to dynamically
adjust the threshold based on CPU clock frequency, but you never know.
> At the end, we have to revert your clocksource patches before shipping RHEL9
> beta last week.
Which has the disadvantage of leaving the initial clock-skew issues,
but I do understand that introducing one problem even while fixing
another one still counts as a regression.
Thanx, Paul
On 11/10/21 19:04, Paul E. McKenney wrote:
> On Wed, Nov 10, 2021 at 06:25:14PM -0500, Waiman Long wrote:
>> On 11/10/21 17:32, Paul E. McKenney wrote:
>>> On Wed, Nov 10, 2021 at 05:17:30PM -0500, Waiman Long wrote:
>>>> It was found that when an x86 system was being stressed by running
>>>> various different benchmark suites, the clocksource watchdog might
>>>> occasionally mark TSC as unstable and fall back to hpet which will
>>>> have a signficant impact on system performance.
>>>>
>>>> The current watchdog clocksource skew threshold of 50us is found to be
>>>> insufficient. So it is changed back to 100us before commit 2e27e793e280
>>>> ("clocksource: Reduce clocksource-skew threshold") in patch 1. Patch 2
>>>> adds a Kconfig option to allow kernel builder to control the actual
>>>> threshold to be used.
>>>>
>>>> Waiman Long (2):
>>>> clocksource: Avoid accidental unstable marking of clocksources
>>>> clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW
>>> The ability to control the fine-grained threshold seems useful, but is
>>> the TSC still marked unstable when this commit from -rcu is applied?
>>> It has passed significant testing on other workloads.
>>>
>>> 2a43fb0479aa ("clocksource: Forgive repeated long-latency watchdog clocksource reads")
>>>
>>> If the patch below takes care of your situation, my thought is to
>>> also take your second patch, which would allow people to set the
>>> cutoff more loosely or more tightly, as their situation dictates.
>>>
>>> Thoughts?
>> That is commit 14dbb29eda51 ("clocksource: Forgive repeated long-latency
>> watchdog clocksource reads") in your linux-rcu git tree. From reading the
>> patch, I believe it should be able to address the hpet fallback problem that
>> Red Hat had encountered. Your patch said it was an out-of-tree patch. Are
>> you planning to mainline it?
> Yes, I expect to submit it into the next merge window (not the current
> v5.16 merge window, but v5.17). However, if your situation is urgent, and
> if it works for you, I could submit it as a fix for an earlier regression.
I will build a test kernel based on your patch and ask our benchmarking
group to run their test suites. It will take a day or two to get a
definitive answer even though I believe it should fix the issue.
Cheers,
Longman
Hi Waiman, Paul,
On Wed, Nov 10, 2021 at 05:17:30PM -0500, Waiman Long wrote:
> It was found that when an x86 system was being stressed by running
> various different benchmark suites, the clocksource watchdog might
> occasionally mark TSC as unstable and fall back to hpet which will
> have a signficant impact on system performance.
We've seen similar cases while running 'netperf' and 'lockbus/ioport'
cases of 'stress-ng' tool.
In those scenarios, the clocksource used by kernel is tsc, while
hpet is used as watchdog. And when the "screwing" happens, we found
mostly it's the hpet's 'fault', that when system is under extreme
pressure, the read of hpet could take a long time, and even 2
consecutive read of hpet will have a big gap (up to 1ms+) in between.
So the screw we saw is actually caused by hpet instead of tsc, as
tsc read is a lightweight cpu operation
I tried the following patch to detect the screw of watchdog itself,
and avoid wrongly judging the tsc to be unstable. It does help in
our tests, please help to review.
And one futher idea is to also adding 2 consecutive read of current
clocksource, and compare its gap with watchdog's, and skip the check
if the watchdog's is bigger.
From f68c668f19abc77c409795ccc1800de3f4fcfc5a Mon Sep 17 00:00:00 2001
From: Feng Tang <[email protected]>
Date: Wed, 10 Nov 2021 21:12:10 +0800
Subject: [PATCH RFC] clocksource: add detection for screw of watchdog
Signed-off-by: Feng Tang <[email protected]>
---
kernel/time/clocksource.c | 39 +++++++++++++++++++++++++++++++++------
1 file changed, 33 insertions(+), 6 deletions(-)
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index b8a14d2fb5ba6..6fbef74b0ed74 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -205,34 +205,54 @@ EXPORT_SYMBOL_GPL(max_cswd_read_retries);
static int verify_n_cpus = 8;
module_param(verify_n_cpus, int, 0644);
-static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
+enum {
+ WATCHDOG_READ_SCREWED,
+ CLOCKSROUCE_READ_SCREWED,
+ BOTH_READ_FINE,
+};
+
+static int cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
{
unsigned int nretries;
- u64 wd_end, wd_delta;
- int64_t wd_delay;
+ u64 wd_end, wd_delta, wd_third;
+ int64_t wd_delay, wd_seq_delay;
+
for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) {
local_irq_disable();
*wdnow = watchdog->read(watchdog);
*csnow = cs->read(cs);
wd_end = watchdog->read(watchdog);
+ wd_third = watchdog->read(watchdog);
local_irq_enable();
wd_delta = clocksource_delta(wd_end, *wdnow, watchdog->mask);
wd_delay = clocksource_cyc2ns(wd_delta, watchdog->mult,
watchdog->shift);
+ /*
+ * If a big delay is seen between 2 consecutive read of watchdog,
+ * then there must be something wrong with watchdog itself, or
+ * the system is under some extreme pressures, no need to judge
+ * clocksource for this run.
+ */
+ wd_delta = clocksource_delta(wd_third, wd_end, watchdog->mask);
+ wd_seq_delay = clocksource_cyc2ns(wd_delta, watchdog->mult,
+ watchdog->shift);
+ if (wd_seq_delay > WATCHDOG_MAX_SKEW / 2)
+ return WATCHDOG_READ_SCREWED;
+
if (wd_delay <= WATCHDOG_MAX_SKEW) {
if (nretries > 1 || nretries >= max_cswd_read_retries) {
pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
smp_processor_id(), watchdog->name, nretries);
}
- return true;
+ return BOTH_READ_FINE;
}
}
pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
smp_processor_id(), watchdog->name, wd_delay, nretries);
- return false;
+ return CLOCKSROUCE_READ_SCREWED;
}
static u64 csnow_mid;
@@ -373,10 +393,17 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;
}
- if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
+ switch (cs_watchdog_read(cs, &csnow, &wdnow)) {
+ case WATCHDOG_READ_SCREWED:
+ /* skip checking this 'cs' */
+ continue;
+ case CLOCKSROUCE_READ_SCREWED:
/* Clock readout unreliable, so give it up. */
__clocksource_unstable(cs);
continue;
+ case BOTH_READ_FINE:
+ default:
+ ;
}
/* Clocksource initialized ? */
--
2.27.0
Thanks,
Feng
> The current watchdog clocksource skew threshold of 50us is found to be
> insufficient. So it is changed back to 100us before commit 2e27e793e280
> ("clocksource: Reduce clocksource-skew threshold") in patch 1. Patch 2
> adds a Kconfig option to allow kernel builder to control the actual
> threshold to be used.
>
> Waiman Long (2):
> clocksource: Avoid accidental unstable marking of clocksources
> clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW
>
> kernel/time/Kconfig | 9 ++++++
> kernel/time/clocksource.c | 58 ++++++++++++++++++++++++++++++---------
> 2 files changed, 54 insertions(+), 13 deletions(-)
>
> --
> 2.27.0
On 11/10/21 20:23, Feng Tang wrote:
> Hi Waiman, Paul,
>
> On Wed, Nov 10, 2021 at 05:17:30PM -0500, Waiman Long wrote:
>> It was found that when an x86 system was being stressed by running
>> various different benchmark suites, the clocksource watchdog might
>> occasionally mark TSC as unstable and fall back to hpet which will
>> have a signficant impact on system performance.
>
> We've seen similar cases while running 'netperf' and 'lockbus/ioport'
> cases of 'stress-ng' tool.
>
> In those scenarios, the clocksource used by kernel is tsc, while
> hpet is used as watchdog. And when the "screwing" happens, we found
> mostly it's the hpet's 'fault', that when system is under extreme
> pressure, the read of hpet could take a long time, and even 2
> consecutive read of hpet will have a big gap (up to 1ms+) in between.
> So the screw we saw is actually caused by hpet instead of tsc, as
> tsc read is a lightweight cpu operation
>
> I tried the following patch to detect the screw of watchdog itself,
> and avoid wrongly judging the tsc to be unstable. It does help in
> our tests, please help to review.
>
> And one futher idea is to also adding 2 consecutive read of current
> clocksource, and compare its gap with watchdog's, and skip the check
> if the watchdog's is bigger.
That is what I found too. And I also did a 2nd watchdog read to compare
the consecutive delay versus half the threshold and skip the test if it
exceeds it. My patch is actually similar in concept to what your patch does.
Cheers,
Longman
On Wed, Nov 10, 2021 at 08:30:10PM -0500, Waiman Long wrote:
>
> On 11/10/21 20:23, Feng Tang wrote:
> > Hi Waiman, Paul,
> >
> > On Wed, Nov 10, 2021 at 05:17:30PM -0500, Waiman Long wrote:
> > > It was found that when an x86 system was being stressed by running
> > > various different benchmark suites, the clocksource watchdog might
> > > occasionally mark TSC as unstable and fall back to hpet which will
> > > have a signficant impact on system performance.
> > We've seen similar cases while running 'netperf' and 'lockbus/ioport'
> > cases of 'stress-ng' tool.
> >
> > In those scenarios, the clocksource used by kernel is tsc, while
> > hpet is used as watchdog. And when the "screwing" happens, we found
> > mostly it's the hpet's 'fault', that when system is under extreme
> > pressure, the read of hpet could take a long time, and even 2
> > consecutive read of hpet will have a big gap (up to 1ms+) in between.
> > So the screw we saw is actually caused by hpet instead of tsc, as
> > tsc read is a lightweight cpu operation
> >
> > I tried the following patch to detect the screw of watchdog itself,
> > and avoid wrongly judging the tsc to be unstable. It does help in
> > our tests, please help to review.
> >
> > And one futher idea is to also adding 2 consecutive read of current
> > clocksource, and compare its gap with watchdog's, and skip the check
> > if the watchdog's is bigger.
>
> That is what I found too. And I also did a 2nd watchdog read to compare the
> consecutive delay versus half the threshold and skip the test if it exceeds
> it. My patch is actually similar in concept to what your patch does.
Aha, yes, I missed that.
I just got to office, and saw the disucssion around 0/2 patch and replied,
without going through the patches, sorry about that.
0day reported some cases about stress-ng testing, and we are still testing
differenct cases we've seen.
Thanks,
Feng
> Cheers,
> Longman
On Thu, Nov 11, 2021 at 09:53:31AM +0800, Feng Tang wrote:
> On Wed, Nov 10, 2021 at 08:30:10PM -0500, Waiman Long wrote:
> >
> > On 11/10/21 20:23, Feng Tang wrote:
> > > Hi Waiman, Paul,
> > >
> > > On Wed, Nov 10, 2021 at 05:17:30PM -0500, Waiman Long wrote:
> > > > It was found that when an x86 system was being stressed by running
> > > > various different benchmark suites, the clocksource watchdog might
> > > > occasionally mark TSC as unstable and fall back to hpet which will
> > > > have a signficant impact on system performance.
> > > We've seen similar cases while running 'netperf' and 'lockbus/ioport'
> > > cases of 'stress-ng' tool.
> > >
> > > In those scenarios, the clocksource used by kernel is tsc, while
> > > hpet is used as watchdog. And when the "screwing" happens, we found
> > > mostly it's the hpet's 'fault', that when system is under extreme
> > > pressure, the read of hpet could take a long time, and even 2
> > > consecutive read of hpet will have a big gap (up to 1ms+) in between.
> > > So the screw we saw is actually caused by hpet instead of tsc, as
> > > tsc read is a lightweight cpu operation
> > >
> > > I tried the following patch to detect the screw of watchdog itself,
> > > and avoid wrongly judging the tsc to be unstable. It does help in
> > > our tests, please help to review.
> > >
> > > And one futher idea is to also adding 2 consecutive read of current
> > > clocksource, and compare its gap with watchdog's, and skip the check
> > > if the watchdog's is bigger.
> >
> > That is what I found too. And I also did a 2nd watchdog read to compare the
> > consecutive delay versus half the threshold and skip the test if it exceeds
> > it. My patch is actually similar in concept to what your patch does.
>
> Aha, yes, I missed that.
>
> I just got to office, and saw the disucssion around 0/2 patch and replied,
> without going through the patches, sorry about that.
>
> 0day reported some cases about stress-ng testing, and we are still testing
> differenct cases we've seen.
Whichever way it happens, I agree that an extra read to correctly
attribute the latency could be very effective! Good on you both!
Thanx, Paul
On Wed, Nov 10, 2021 at 05:17:31PM -0500, Waiman Long wrote:
> Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
> detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
> threshold"), it is found that tsc clocksource fallback to hpet can
> sometimes happen on both Intel and AMD systems especially when they are
> running stressful benchmarking workloads. Of the 23 systems tested with
> a v5.14 kernel, 10 of them have switched to hpet clock source during
> the test run.
>
> The result of falling back to hpet is a drastic reduction of performance
> when running benchmarks. For example, the fio performance tests can
> drop up to 70% whereas the iperf3 performance can drop up to 80%.
>
> 4 hpet fallbacks happened during bootup. They were:
>
> [ 8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
> [ 12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
> [ 17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
> [ 17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
>
> Other fallbacks happen when the systems were running stressful
> benchmarks. For example:
>
> [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
> [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
>
> Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
> changed the skew margin from 100us to 50us. I think this is too small
> and can easily be exceeded when running some stressful workloads on
> a thermally stressed system. So it is switched back to 100us. On
> the other hand, it doesn't look like we need to increase the minimum
> uncertainty margin. So it is kept the same at 100us too.
>
> Even a maximum skew margin of 100us may be too small in for some systems
> when booting up especially if those systems are under thermal stress. To
> eliminate the case that the large skew is due to the system being too
> busy slowing down the reading of both the watchdog and the clocksource,
> a final check is done by reading watchdog time again and comparing the
> consecutive watchdog timing read delay against WATCHDOG_MAX_SKEW/2. If
> that delay exceeds the limit, we assume that the system is just too
> busy. A warning will be printed to the console and the watchdog check
> is then skipped for this round. For example:
>
> [ 8.789316] clocksource: timekeeping watchdog on CPU13: hpet consecutive read-back delay of 174541ns, system too busy
I think it may be better to add more details about the root cause, other
than that it looks good to me, as we tested similar patch on our test
platforms.
Reviewed-by: Feng Tang <[email protected]>
Thanks,
Feng
> Fixes: db3a34e17433 ("clocksource: Retry clock read if long delays detected")
> Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
> Signed-off-by: Waiman Long <[email protected]>
> ---
> kernel/time/clocksource.c | 53 +++++++++++++++++++++++++++++----------
> 1 file changed, 40 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index b8a14d2fb5ba..c503f5ab4101 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -107,7 +107,7 @@ static u64 suspend_start;
> * This delay could be due to SMIs, NMIs, or to VCPU preemptions. Used as
> * a lower bound for cs->uncertainty_margin values when registering clocks.
> */
> -#define WATCHDOG_MAX_SKEW (50 * NSEC_PER_USEC)
> +#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC)
>
> #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
> static void clocksource_watchdog_work(struct work_struct *work);
> @@ -205,11 +205,17 @@ EXPORT_SYMBOL_GPL(max_cswd_read_retries);
> static int verify_n_cpus = 8;
> module_param(verify_n_cpus, int, 0644);
>
> -static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> +enum wd_read_status {
> + WD_READ_SUCCESS,
> + WD_READ_UNSTABLE,
> + WD_READ_SKIP
> +};
> +
> +static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> {
> unsigned int nretries;
> - u64 wd_end, wd_delta;
> - int64_t wd_delay;
> + u64 wd_end, wd_end2, wd_delta;
> + int64_t wd_delay, wd_delay2;
>
> for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) {
> local_irq_disable();
> @@ -226,13 +232,30 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
> smp_processor_id(), watchdog->name, nretries);
> }
> - return true;
> + return WD_READ_SUCCESS;
> }
> }
>
> + /*
> + * Now compute delay in consecutive watchdog read to see if there is
> + * too much external interferences that cause significant delay in
> + * reading both clocksource and watchdog.
> + *
> + * If the consecutive watchdog read-back delay > WATCHDOG_MAX_SKEW/2,
> + * report system busy and skip the current watchdog test.
> + */
> + wd_end2 = watchdog->read(watchdog);
> + wd_delta = clocksource_delta(wd_end2, wd_end, watchdog->mask);
> + wd_delay2 = clocksource_cyc2ns(wd_delta, watchdog->mult, watchdog->shift);
> + if (wd_delay2 > WATCHDOG_MAX_SKEW/2) {
> + pr_warn("timekeeping watchdog on CPU%d: %s consecutive read-back delay of %lldns, system too busy\n",
> + smp_processor_id(), watchdog->name, wd_delay2);
> + return WD_READ_SKIP;
> + }
> +
> pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
> smp_processor_id(), watchdog->name, wd_delay, nretries);
> - return false;
> + return WD_READ_UNSTABLE;
> }
>
> static u64 csnow_mid;
> @@ -356,6 +379,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> int next_cpu, reset_pending;
> int64_t wd_nsec, cs_nsec;
> struct clocksource *cs;
> + enum wd_read_status read_ret;
> u32 md;
>
> spin_lock(&watchdog_lock);
> @@ -373,9 +397,12 @@ static void clocksource_watchdog(struct timer_list *unused)
> continue;
> }
>
> - if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
> - /* Clock readout unreliable, so give it up. */
> - __clocksource_unstable(cs);
> + read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
> +
> + if (read_ret != WD_READ_SUCCESS) {
> + if (read_ret == WD_READ_UNSTABLE)
> + /* Clock readout unreliable, so give it up. */
> + __clocksource_unstable(cs);
> continue;
> }
>
> @@ -1061,7 +1088,7 @@ void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq
> /*
> * If the uncertainty margin is not specified, calculate it.
> * If both scale and freq are non-zero, calculate the clock
> - * period, but bound below at 2*WATCHDOG_MAX_SKEW. However,
> + * period, but bound below at WATCHDOG_MAX_SKEW. However,
> * if either of scale or freq is zero, be very conservative and
> * take the tens-of-milliseconds WATCHDOG_THRESHOLD value for the
> * uncertainty margin. Allow stupidly small uncertainty margins
> @@ -1070,12 +1097,12 @@ void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq
> */
> if (scale && freq && !cs->uncertainty_margin) {
> cs->uncertainty_margin = NSEC_PER_SEC / (scale * freq);
> - if (cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW)
> - cs->uncertainty_margin = 2 * WATCHDOG_MAX_SKEW;
> + if (cs->uncertainty_margin < WATCHDOG_MAX_SKEW)
> + cs->uncertainty_margin = WATCHDOG_MAX_SKEW;
> } else if (!cs->uncertainty_margin) {
> cs->uncertainty_margin = WATCHDOG_THRESHOLD;
> }
> - WARN_ON_ONCE(cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW);
> + WARN_ON_ONCE(cs->uncertainty_margin < WATCHDOG_MAX_SKEW);
>
> /*
> * Ensure clocksources that have large 'mult' values don't overflow
> --
> 2.27.0
On Thu, Nov 11, 2021 at 12:57:03PM +0800, Feng Tang wrote:
> On Wed, Nov 10, 2021 at 05:17:31PM -0500, Waiman Long wrote:
> > Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
> > detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
> > threshold"), it is found that tsc clocksource fallback to hpet can
> > sometimes happen on both Intel and AMD systems especially when they are
> > running stressful benchmarking workloads. Of the 23 systems tested with
> > a v5.14 kernel, 10 of them have switched to hpet clock source during
> > the test run.
> >
> > The result of falling back to hpet is a drastic reduction of performance
> > when running benchmarks. For example, the fio performance tests can
> > drop up to 70% whereas the iperf3 performance can drop up to 80%.
> >
> > 4 hpet fallbacks happened during bootup. They were:
> >
> > [ 8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
> > [ 12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
> > [ 17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
> > [ 17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
> >
> > Other fallbacks happen when the systems were running stressful
> > benchmarks. For example:
> >
> > [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
> > [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
> >
> > Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
> > changed the skew margin from 100us to 50us. I think this is too small
> > and can easily be exceeded when running some stressful workloads on
> > a thermally stressed system. So it is switched back to 100us. On
> > the other hand, it doesn't look like we need to increase the minimum
> > uncertainty margin. So it is kept the same at 100us too.
> >
> > Even a maximum skew margin of 100us may be too small in for some systems
> > when booting up especially if those systems are under thermal stress. To
> > eliminate the case that the large skew is due to the system being too
> > busy slowing down the reading of both the watchdog and the clocksource,
> > a final check is done by reading watchdog time again and comparing the
> > consecutive watchdog timing read delay against WATCHDOG_MAX_SKEW/2. If
> > that delay exceeds the limit, we assume that the system is just too
> > busy. A warning will be printed to the console and the watchdog check
> > is then skipped for this round. For example:
> >
> > [ 8.789316] clocksource: timekeeping watchdog on CPU13: hpet consecutive read-back delay of 174541ns, system too busy
>
>
> I think it may be better to add more details about the root cause, other
> than that it looks good to me, as we tested similar patch on our test
> platforms.
>
> Reviewed-by: Feng Tang <[email protected]>
Thank you both!
I agree on the bit about root cause. Would it make sense to compare the
difference between HPET reads 1 and 2 (containing the read of the TSC)
and the difference between HPET reads 2 and 3? If the 2-1 difference was
no more than (say) 8/7ths of the 3-2 difference, or the 2-1 difference
was no more than (say) 20 microseconds more than the 3-2 difference,
this could be considered a good-as-it-gets read, ending the retry loop.
Then if the 3-1 difference was greater than the default (100 microseconds
in current -rcu), that difference could be substituted for that particular
clocksource watchdog check. With a console message noting the unusually
high overhead (but not a splat).
So if it took 75 microseconds for each HPET read and 1 microsecond for
the TSC read, then 226 microseconds would be substituted for the default
of 100 microseconds for that cycle's skew cutoff. Unless the previous
skew cutoff was larger, in which case the previous cutoff should be
used instead. Either way, the current cutoff is recorded for comparison
for the next clocksource watchdog check.
If the 3-1 difference was greater than 62.5 milliseconds, a warning should
probably be emitted anyway.
Or did you have something else in mind?
Thanx, Paul
> Thanks,
> Feng
>
> > Fixes: db3a34e17433 ("clocksource: Retry clock read if long delays detected")
> > Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
> > Signed-off-by: Waiman Long <[email protected]>
> > ---
> > kernel/time/clocksource.c | 53 +++++++++++++++++++++++++++++----------
> > 1 file changed, 40 insertions(+), 13 deletions(-)
> >
> > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > index b8a14d2fb5ba..c503f5ab4101 100644
> > --- a/kernel/time/clocksource.c
> > +++ b/kernel/time/clocksource.c
> > @@ -107,7 +107,7 @@ static u64 suspend_start;
> > * This delay could be due to SMIs, NMIs, or to VCPU preemptions. Used as
> > * a lower bound for cs->uncertainty_margin values when registering clocks.
> > */
> > -#define WATCHDOG_MAX_SKEW (50 * NSEC_PER_USEC)
> > +#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC)
> >
> > #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
> > static void clocksource_watchdog_work(struct work_struct *work);
> > @@ -205,11 +205,17 @@ EXPORT_SYMBOL_GPL(max_cswd_read_retries);
> > static int verify_n_cpus = 8;
> > module_param(verify_n_cpus, int, 0644);
> >
> > -static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> > +enum wd_read_status {
> > + WD_READ_SUCCESS,
> > + WD_READ_UNSTABLE,
> > + WD_READ_SKIP
> > +};
> > +
> > +static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> > {
> > unsigned int nretries;
> > - u64 wd_end, wd_delta;
> > - int64_t wd_delay;
> > + u64 wd_end, wd_end2, wd_delta;
> > + int64_t wd_delay, wd_delay2;
> >
> > for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) {
> > local_irq_disable();
> > @@ -226,13 +232,30 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> > pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
> > smp_processor_id(), watchdog->name, nretries);
> > }
> > - return true;
> > + return WD_READ_SUCCESS;
> > }
> > }
> >
> > + /*
> > + * Now compute delay in consecutive watchdog read to see if there is
> > + * too much external interferences that cause significant delay in
> > + * reading both clocksource and watchdog.
> > + *
> > + * If the consecutive watchdog read-back delay > WATCHDOG_MAX_SKEW/2,
> > + * report system busy and skip the current watchdog test.
> > + */
> > + wd_end2 = watchdog->read(watchdog);
> > + wd_delta = clocksource_delta(wd_end2, wd_end, watchdog->mask);
> > + wd_delay2 = clocksource_cyc2ns(wd_delta, watchdog->mult, watchdog->shift);
> > + if (wd_delay2 > WATCHDOG_MAX_SKEW/2) {
> > + pr_warn("timekeeping watchdog on CPU%d: %s consecutive read-back delay of %lldns, system too busy\n",
> > + smp_processor_id(), watchdog->name, wd_delay2);
> > + return WD_READ_SKIP;
> > + }
> > +
> > pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
> > smp_processor_id(), watchdog->name, wd_delay, nretries);
> > - return false;
> > + return WD_READ_UNSTABLE;
> > }
> >
> > static u64 csnow_mid;
> > @@ -356,6 +379,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > int next_cpu, reset_pending;
> > int64_t wd_nsec, cs_nsec;
> > struct clocksource *cs;
> > + enum wd_read_status read_ret;
> > u32 md;
> >
> > spin_lock(&watchdog_lock);
> > @@ -373,9 +397,12 @@ static void clocksource_watchdog(struct timer_list *unused)
> > continue;
> > }
> >
> > - if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
> > - /* Clock readout unreliable, so give it up. */
> > - __clocksource_unstable(cs);
> > + read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
> > +
> > + if (read_ret != WD_READ_SUCCESS) {
> > + if (read_ret == WD_READ_UNSTABLE)
> > + /* Clock readout unreliable, so give it up. */
> > + __clocksource_unstable(cs);
> > continue;
> > }
> >
> > @@ -1061,7 +1088,7 @@ void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq
> > /*
> > * If the uncertainty margin is not specified, calculate it.
> > * If both scale and freq are non-zero, calculate the clock
> > - * period, but bound below at 2*WATCHDOG_MAX_SKEW. However,
> > + * period, but bound below at WATCHDOG_MAX_SKEW. However,
> > * if either of scale or freq is zero, be very conservative and
> > * take the tens-of-milliseconds WATCHDOG_THRESHOLD value for the
> > * uncertainty margin. Allow stupidly small uncertainty margins
> > @@ -1070,12 +1097,12 @@ void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq
> > */
> > if (scale && freq && !cs->uncertainty_margin) {
> > cs->uncertainty_margin = NSEC_PER_SEC / (scale * freq);
> > - if (cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW)
> > - cs->uncertainty_margin = 2 * WATCHDOG_MAX_SKEW;
> > + if (cs->uncertainty_margin < WATCHDOG_MAX_SKEW)
> > + cs->uncertainty_margin = WATCHDOG_MAX_SKEW;
> > } else if (!cs->uncertainty_margin) {
> > cs->uncertainty_margin = WATCHDOG_THRESHOLD;
> > }
> > - WARN_ON_ONCE(cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW);
> > + WARN_ON_ONCE(cs->uncertainty_margin < WATCHDOG_MAX_SKEW);
> >
> > /*
> > * Ensure clocksources that have large 'mult' values don't overflow
> > --
> > 2.27.0
On Thu, Nov 11, 2021 at 06:43:11AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 11, 2021 at 12:57:03PM +0800, Feng Tang wrote:
> > On Wed, Nov 10, 2021 at 05:17:31PM -0500, Waiman Long wrote:
> > > Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
> > > detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
> > > threshold"), it is found that tsc clocksource fallback to hpet can
> > > sometimes happen on both Intel and AMD systems especially when they are
> > > running stressful benchmarking workloads. Of the 23 systems tested with
> > > a v5.14 kernel, 10 of them have switched to hpet clock source during
> > > the test run.
> > >
> > > The result of falling back to hpet is a drastic reduction of performance
> > > when running benchmarks. For example, the fio performance tests can
> > > drop up to 70% whereas the iperf3 performance can drop up to 80%.
> > >
> > > 4 hpet fallbacks happened during bootup. They were:
> > >
> > > [ 8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
> > > [ 12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
> > > [ 17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
> > > [ 17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
> > >
> > > Other fallbacks happen when the systems were running stressful
> > > benchmarks. For example:
> > >
> > > [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
> > > [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
> > >
> > > Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
> > > changed the skew margin from 100us to 50us. I think this is too small
> > > and can easily be exceeded when running some stressful workloads on
> > > a thermally stressed system. So it is switched back to 100us. On
> > > the other hand, it doesn't look like we need to increase the minimum
> > > uncertainty margin. So it is kept the same at 100us too.
> > >
> > > Even a maximum skew margin of 100us may be too small in for some systems
> > > when booting up especially if those systems are under thermal stress. To
> > > eliminate the case that the large skew is due to the system being too
> > > busy slowing down the reading of both the watchdog and the clocksource,
> > > a final check is done by reading watchdog time again and comparing the
> > > consecutive watchdog timing read delay against WATCHDOG_MAX_SKEW/2. If
> > > that delay exceeds the limit, we assume that the system is just too
> > > busy. A warning will be printed to the console and the watchdog check
> > > is then skipped for this round. For example:
> > >
> > > [ 8.789316] clocksource: timekeeping watchdog on CPU13: hpet consecutive read-back delay of 174541ns, system too busy
> >
> >
> > I think it may be better to add more details about the root cause, other
> > than that it looks good to me, as we tested similar patch on our test
> > platforms.
> >
> > Reviewed-by: Feng Tang <[email protected]>
>
> Thank you both!
>
> I agree on the bit about root cause. Would it make sense to compare the
> difference between HPET reads 1 and 2 (containing the read of the TSC)
> and the difference between HPET reads 2 and 3? If the 2-1 difference was
> no more than (say) 8/7ths of the 3-2 difference, or the 2-1 difference
> was no more than (say) 20 microseconds more than the 3-2 difference,
> this could be considered a good-as-it-gets read, ending the retry loop.
> Then if the 3-1 difference was greater than the default (100 microseconds
> in current -rcu), that difference could be substituted for that particular
> clocksource watchdog check. With a console message noting the unusually
> high overhead (but not a splat).
>
> So if it took 75 microseconds for each HPET read and 1 microsecond for
> the TSC read, then 226 microseconds would be substituted for the default
> of 100 microseconds for that cycle's skew cutoff. Unless the previous
> skew cutoff was larger, in which case the previous cutoff should be
> used instead. Either way, the current cutoff is recorded for comparison
> for the next clocksource watchdog check.
>
> If the 3-1 difference was greater than 62.5 milliseconds, a warning should
> probably be emitted anyway.
I can test the patch with our cases that could reproduce the problem.
> Or did you have something else in mind?
I'm not sure the detail in Waiman's cases, and in our cases (stress-ng)
the delay between watchdog's (HPET here) read were not linear, that
from debug data, sometimes the 3-2 difference could be bigger or much
bigger than the 2-1 difference.
The reason could be the gap between 2 reads depends hugely on the system
pressure at that time that 3 HPET read happens. On our test box (a
2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
about 2.5 us, while under the stress it could be bumped to from 6 us
to 2800 us.
So I think checking the 3-2 difference plus increasing the max retries
to 10 may be a simple way, if the watchdog read is found to be
abnormally long, we skip this round of check.
Thanks,
Feng
>
> Thanx, Paul
On Fri, Nov 12, 2021 at 01:44:17PM +0800, Feng Tang wrote:
> On Thu, Nov 11, 2021 at 06:43:11AM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 11, 2021 at 12:57:03PM +0800, Feng Tang wrote:
> > > On Wed, Nov 10, 2021 at 05:17:31PM -0500, Waiman Long wrote:
> > > > Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
> > > > detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
> > > > threshold"), it is found that tsc clocksource fallback to hpet can
> > > > sometimes happen on both Intel and AMD systems especially when they are
> > > > running stressful benchmarking workloads. Of the 23 systems tested with
> > > > a v5.14 kernel, 10 of them have switched to hpet clock source during
> > > > the test run.
> > > >
> > > > The result of falling back to hpet is a drastic reduction of performance
> > > > when running benchmarks. For example, the fio performance tests can
> > > > drop up to 70% whereas the iperf3 performance can drop up to 80%.
> > > >
> > > > 4 hpet fallbacks happened during bootup. They were:
> > > >
> > > > [ 8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
> > > > [ 12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
> > > > [ 17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
> > > > [ 17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
> > > >
> > > > Other fallbacks happen when the systems were running stressful
> > > > benchmarks. For example:
> > > >
> > > > [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
> > > > [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
> > > >
> > > > Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
> > > > changed the skew margin from 100us to 50us. I think this is too small
> > > > and can easily be exceeded when running some stressful workloads on
> > > > a thermally stressed system. So it is switched back to 100us. On
> > > > the other hand, it doesn't look like we need to increase the minimum
> > > > uncertainty margin. So it is kept the same at 100us too.
> > > >
> > > > Even a maximum skew margin of 100us may be too small in for some systems
> > > > when booting up especially if those systems are under thermal stress. To
> > > > eliminate the case that the large skew is due to the system being too
> > > > busy slowing down the reading of both the watchdog and the clocksource,
> > > > a final check is done by reading watchdog time again and comparing the
> > > > consecutive watchdog timing read delay against WATCHDOG_MAX_SKEW/2. If
> > > > that delay exceeds the limit, we assume that the system is just too
> > > > busy. A warning will be printed to the console and the watchdog check
> > > > is then skipped for this round. For example:
> > > >
> > > > [ 8.789316] clocksource: timekeeping watchdog on CPU13: hpet consecutive read-back delay of 174541ns, system too busy
> > >
> > >
> > > I think it may be better to add more details about the root cause, other
> > > than that it looks good to me, as we tested similar patch on our test
> > > platforms.
> > >
> > > Reviewed-by: Feng Tang <[email protected]>
> >
> > Thank you both!
> >
> > I agree on the bit about root cause. Would it make sense to compare the
> > difference between HPET reads 1 and 2 (containing the read of the TSC)
> > and the difference between HPET reads 2 and 3? If the 2-1 difference was
> > no more than (say) 8/7ths of the 3-2 difference, or the 2-1 difference
> > was no more than (say) 20 microseconds more than the 3-2 difference,
> > this could be considered a good-as-it-gets read, ending the retry loop.
> > Then if the 3-1 difference was greater than the default (100 microseconds
> > in current -rcu), that difference could be substituted for that particular
> > clocksource watchdog check. With a console message noting the unusually
> > high overhead (but not a splat).
> >
> > So if it took 75 microseconds for each HPET read and 1 microsecond for
> > the TSC read, then 226 microseconds would be substituted for the default
> > of 100 microseconds for that cycle's skew cutoff. Unless the previous
> > skew cutoff was larger, in which case the previous cutoff should be
> > used instead. Either way, the current cutoff is recorded for comparison
> > for the next clocksource watchdog check.
> >
> > If the 3-1 difference was greater than 62.5 milliseconds, a warning should
> > probably be emitted anyway.
>
> I can test the patch with our cases that could reproduce the problem.
Much appreciated!
> > Or did you have something else in mind?
>
> I'm not sure the detail in Waiman's cases, and in our cases (stress-ng)
> the delay between watchdog's (HPET here) read were not linear, that
> from debug data, sometimes the 3-2 difference could be bigger or much
> bigger than the 2-1 difference.
>
> The reason could be the gap between 2 reads depends hugely on the system
> pressure at that time that 3 HPET read happens. On our test box (a
> 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
> about 2.5 us, while under the stress it could be bumped to from 6 us
> to 2800 us.
>
> So I think checking the 3-2 difference plus increasing the max retries
> to 10 may be a simple way, if the watchdog read is found to be
> abnormally long, we skip this round of check.
OK, given your data, it might indeed be best to keep things simple for
the time being.
So when you suggested adding more details about root cause, what did you
have in mind? More details on Waiman's commit log? Some other strategy
for extracting additional latency information? Something else entirely?
Thanx, Paul
On 11/12/21 00:44, Feng Tang wrote:
> On Thu, Nov 11, 2021 at 06:43:11AM -0800, Paul E. McKenney wrote:
>> On Thu, Nov 11, 2021 at 12:57:03PM +0800, Feng Tang wrote:
>>> On Wed, Nov 10, 2021 at 05:17:31PM -0500, Waiman Long wrote:
>>>> Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
>>>> detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
>>>> threshold"), it is found that tsc clocksource fallback to hpet can
>>>> sometimes happen on both Intel and AMD systems especially when they are
>>>> running stressful benchmarking workloads. Of the 23 systems tested with
>>>> a v5.14 kernel, 10 of them have switched to hpet clock source during
>>>> the test run.
>>>>
>>>> The result of falling back to hpet is a drastic reduction of performance
>>>> when running benchmarks. For example, the fio performance tests can
>>>> drop up to 70% whereas the iperf3 performance can drop up to 80%.
>>>>
>>>> 4 hpet fallbacks happened during bootup. They were:
>>>>
>>>> [ 8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
>>>> [ 12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
>>>> [ 17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
>>>> [ 17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
>>>>
>>>> Other fallbacks happen when the systems were running stressful
>>>> benchmarks. For example:
>>>>
>>>> [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
>>>> [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
>>>>
>>>> Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
>>>> changed the skew margin from 100us to 50us. I think this is too small
>>>> and can easily be exceeded when running some stressful workloads on
>>>> a thermally stressed system. So it is switched back to 100us. On
>>>> the other hand, it doesn't look like we need to increase the minimum
>>>> uncertainty margin. So it is kept the same at 100us too.
>>>>
>>>> Even a maximum skew margin of 100us may be too small in for some systems
>>>> when booting up especially if those systems are under thermal stress. To
>>>> eliminate the case that the large skew is due to the system being too
>>>> busy slowing down the reading of both the watchdog and the clocksource,
>>>> a final check is done by reading watchdog time again and comparing the
>>>> consecutive watchdog timing read delay against WATCHDOG_MAX_SKEW/2. If
>>>> that delay exceeds the limit, we assume that the system is just too
>>>> busy. A warning will be printed to the console and the watchdog check
>>>> is then skipped for this round. For example:
>>>>
>>>> [ 8.789316] clocksource: timekeeping watchdog on CPU13: hpet consecutive read-back delay of 174541ns, system too busy
>>>
>>>
>>> I think it may be better to add more details about the root cause, other
>>> than that it looks good to me, as we tested similar patch on our test
>>> platforms.
>>>
>>> Reviewed-by: Feng Tang <[email protected]>
>> Thank you both!
>>
>> I agree on the bit about root cause. Would it make sense to compare the
>> difference between HPET reads 1 and 2 (containing the read of the TSC)
>> and the difference between HPET reads 2 and 3? If the 2-1 difference was
>> no more than (say) 8/7ths of the 3-2 difference, or the 2-1 difference
>> was no more than (say) 20 microseconds more than the 3-2 difference,
>> this could be considered a good-as-it-gets read, ending the retry loop.
>> Then if the 3-1 difference was greater than the default (100 microseconds
>> in current -rcu), that difference could be substituted for that particular
>> clocksource watchdog check. With a console message noting the unusually
>> high overhead (but not a splat).
>>
>> So if it took 75 microseconds for each HPET read and 1 microsecond for
>> the TSC read, then 226 microseconds would be substituted for the default
>> of 100 microseconds for that cycle's skew cutoff. Unless the previous
>> skew cutoff was larger, in which case the previous cutoff should be
>> used instead. Either way, the current cutoff is recorded for comparison
>> for the next clocksource watchdog check.
>>
>> If the 3-1 difference was greater than 62.5 milliseconds, a warning should
>> probably be emitted anyway.
>
> I can test the patch with our cases that could reproduce the problem.
>
>> Or did you have something else in mind?
> I'm not sure the detail in Waiman's cases, and in our cases (stress-ng)
> the delay between watchdog's (HPET here) read were not linear, that
> from debug data, sometimes the 3-2 difference could be bigger or much
> bigger than the 2-1 difference.
>
> The reason could be the gap between 2 reads depends hugely on the system
> pressure at that time that 3 HPET read happens. On our test box (a
> 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
> about 2.5 us, while under the stress it could be bumped to from 6 us
> to 2800 us.
>
> So I think checking the 3-2 difference plus increasing the max retries
> to 10 may be a simple way, if the watchdog read is found to be
> abnormally long, we skip this round of check.
On one of the test system, I had measured that normal delay
(hpet->tsc->hpet) was normally a bit over 2us. It was a bit more than
4us at bootup time. However, the same system under stress could have a
delay of over 200us at bootup time. When I measured the consecutive hpet
delay, it was about 180us. So hpet read did dominate the total
clocksource read delay.
I would not suggest increasing the max retries as it may still fail in
most cases because the system stress will likely not be going away
within a short time. So we are likely just wasting cpu times. I believe
we should just skip it if it is the watchdog read that is causing most
of the delay.
Cheers,
Longman
On Fri, Nov 12, 2021 at 10:43:15PM -0500, Waiman Long wrote:
> On 11/12/21 00:44, Feng Tang wrote:
> > On Thu, Nov 11, 2021 at 06:43:11AM -0800, Paul E. McKenney wrote:
> > > On Thu, Nov 11, 2021 at 12:57:03PM +0800, Feng Tang wrote:
> > > > On Wed, Nov 10, 2021 at 05:17:31PM -0500, Waiman Long wrote:
> > > > > Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
> > > > > detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
> > > > > threshold"), it is found that tsc clocksource fallback to hpet can
> > > > > sometimes happen on both Intel and AMD systems especially when they are
> > > > > running stressful benchmarking workloads. Of the 23 systems tested with
> > > > > a v5.14 kernel, 10 of them have switched to hpet clock source during
> > > > > the test run.
> > > > >
> > > > > The result of falling back to hpet is a drastic reduction of performance
> > > > > when running benchmarks. For example, the fio performance tests can
> > > > > drop up to 70% whereas the iperf3 performance can drop up to 80%.
> > > > >
> > > > > 4 hpet fallbacks happened during bootup. They were:
> > > > >
> > > > > [ 8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
> > > > > [ 12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
> > > > > [ 17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
> > > > > [ 17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
> > > > >
> > > > > Other fallbacks happen when the systems were running stressful
> > > > > benchmarks. For example:
> > > > >
> > > > > [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
> > > > > [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
> > > > >
> > > > > Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
> > > > > changed the skew margin from 100us to 50us. I think this is too small
> > > > > and can easily be exceeded when running some stressful workloads on
> > > > > a thermally stressed system. So it is switched back to 100us. On
> > > > > the other hand, it doesn't look like we need to increase the minimum
> > > > > uncertainty margin. So it is kept the same at 100us too.
> > > > >
> > > > > Even a maximum skew margin of 100us may be too small in for some systems
> > > > > when booting up especially if those systems are under thermal stress. To
> > > > > eliminate the case that the large skew is due to the system being too
> > > > > busy slowing down the reading of both the watchdog and the clocksource,
> > > > > a final check is done by reading watchdog time again and comparing the
> > > > > consecutive watchdog timing read delay against WATCHDOG_MAX_SKEW/2. If
> > > > > that delay exceeds the limit, we assume that the system is just too
> > > > > busy. A warning will be printed to the console and the watchdog check
> > > > > is then skipped for this round. For example:
> > > > >
> > > > > [ 8.789316] clocksource: timekeeping watchdog on CPU13: hpet consecutive read-back delay of 174541ns, system too busy
> > > >
> > > > I think it may be better to add more details about the root cause, other
> > > > than that it looks good to me, as we tested similar patch on our test
> > > > platforms.
> > > >
> > > > Reviewed-by: Feng Tang <[email protected]>
> > > Thank you both!
> > >
> > > I agree on the bit about root cause. Would it make sense to compare the
> > > difference between HPET reads 1 and 2 (containing the read of the TSC)
> > > and the difference between HPET reads 2 and 3? If the 2-1 difference was
> > > no more than (say) 8/7ths of the 3-2 difference, or the 2-1 difference
> > > was no more than (say) 20 microseconds more than the 3-2 difference,
> > > this could be considered a good-as-it-gets read, ending the retry loop.
> > > Then if the 3-1 difference was greater than the default (100 microseconds
> > > in current -rcu), that difference could be substituted for that particular
> > > clocksource watchdog check. With a console message noting the unusually
> > > high overhead (but not a splat).
> > >
> > > So if it took 75 microseconds for each HPET read and 1 microsecond for
> > > the TSC read, then 226 microseconds would be substituted for the default
> > > of 100 microseconds for that cycle's skew cutoff. Unless the previous
> > > skew cutoff was larger, in which case the previous cutoff should be
> > > used instead. Either way, the current cutoff is recorded for comparison
> > > for the next clocksource watchdog check.
> > >
> > > If the 3-1 difference was greater than 62.5 milliseconds, a warning should
> > > probably be emitted anyway.
> > I can test the patch with our cases that could reproduce the problem.
> >
> > > Or did you have something else in mind?
> > I'm not sure the detail in Waiman's cases, and in our cases (stress-ng)
> > the delay between watchdog's (HPET here) read were not linear, that
> > from debug data, sometimes the 3-2 difference could be bigger or much
> > bigger than the 2-1 difference.
> >
> > The reason could be the gap between 2 reads depends hugely on the system
> > pressure at that time that 3 HPET read happens. On our test box (a
> > 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
> > about 2.5 us, while under the stress it could be bumped to from 6 us
> > to 2800 us.
> >
> > So I think checking the 3-2 difference plus increasing the max retries
> > to 10 may be a simple way, if the watchdog read is found to be
> > abnormally long, we skip this round of check.
>
> On one of the test system, I had measured that normal delay
> (hpet->tsc->hpet) was normally a bit over 2us. It was a bit more than 4us at
> bootup time. However, the same system under stress could have a delay of
> over 200us at bootup time. When I measured the consecutive hpet delay, it
> was about 180us. So hpet read did dominate the total clocksource read delay.
Thank you both for the data!
> I would not suggest increasing the max retries as it may still fail in most
> cases because the system stress will likely not be going away within a short
> time. So we are likely just wasting cpu times. I believe we should just skip
> it if it is the watchdog read that is causing most of the delay.
If anything, adding that extra read would cause me to -reduce- the number
of retries to avoid increasing the per-watchdog overhead.
Thoughts?
Thanx, Paul
On Sun, Nov 14, 2021 at 07:54:07AM -0800, Paul E. McKenney wrote:
> On Fri, Nov 12, 2021 at 10:43:15PM -0500, Waiman Long wrote:
> > On 11/12/21 00:44, Feng Tang wrote:
> > > On Thu, Nov 11, 2021 at 06:43:11AM -0800, Paul E. McKenney wrote:
> > > > On Thu, Nov 11, 2021 at 12:57:03PM +0800, Feng Tang wrote:
> > > > > On Wed, Nov 10, 2021 at 05:17:31PM -0500, Waiman Long wrote:
> > > > > > Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
> > > > > > detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
> > > > > > threshold"), it is found that tsc clocksource fallback to hpet can
> > > > > > sometimes happen on both Intel and AMD systems especially when they are
> > > > > > running stressful benchmarking workloads. Of the 23 systems tested with
> > > > > > a v5.14 kernel, 10 of them have switched to hpet clock source during
> > > > > > the test run.
> > > > > >
> > > > > > The result of falling back to hpet is a drastic reduction of performance
> > > > > > when running benchmarks. For example, the fio performance tests can
> > > > > > drop up to 70% whereas the iperf3 performance can drop up to 80%.
> > > > > >
> > > > > > 4 hpet fallbacks happened during bootup. They were:
> > > > > >
> > > > > > [ 8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
> > > > > > [ 12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
> > > > > > [ 17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
> > > > > > [ 17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
> > > > > >
> > > > > > Other fallbacks happen when the systems were running stressful
> > > > > > benchmarks. For example:
> > > > > >
> > > > > > [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
> > > > > > [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
> > > > > >
> > > > > > Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
> > > > > > changed the skew margin from 100us to 50us. I think this is too small
> > > > > > and can easily be exceeded when running some stressful workloads on
> > > > > > a thermally stressed system. So it is switched back to 100us. On
> > > > > > the other hand, it doesn't look like we need to increase the minimum
> > > > > > uncertainty margin. So it is kept the same at 100us too.
> > > > > >
> > > > > > Even a maximum skew margin of 100us may be too small in for some systems
> > > > > > when booting up especially if those systems are under thermal stress. To
> > > > > > eliminate the case that the large skew is due to the system being too
> > > > > > busy slowing down the reading of both the watchdog and the clocksource,
> > > > > > a final check is done by reading watchdog time again and comparing the
> > > > > > consecutive watchdog timing read delay against WATCHDOG_MAX_SKEW/2. If
> > > > > > that delay exceeds the limit, we assume that the system is just too
> > > > > > busy. A warning will be printed to the console and the watchdog check
> > > > > > is then skipped for this round. For example:
> > > > > >
> > > > > > [ 8.789316] clocksource: timekeeping watchdog on CPU13: hpet consecutive read-back delay of 174541ns, system too busy
> > > > >
> > > > > I think it may be better to add more details about the root cause, other
> > > > > than that it looks good to me, as we tested similar patch on our test
> > > > > platforms.
> > > > >
> > > > > Reviewed-by: Feng Tang <[email protected]>
> > > > Thank you both!
> > > >
> > > > I agree on the bit about root cause. Would it make sense to compare the
> > > > difference between HPET reads 1 and 2 (containing the read of the TSC)
> > > > and the difference between HPET reads 2 and 3? If the 2-1 difference was
> > > > no more than (say) 8/7ths of the 3-2 difference, or the 2-1 difference
> > > > was no more than (say) 20 microseconds more than the 3-2 difference,
> > > > this could be considered a good-as-it-gets read, ending the retry loop.
> > > > Then if the 3-1 difference was greater than the default (100 microseconds
> > > > in current -rcu), that difference could be substituted for that particular
> > > > clocksource watchdog check. With a console message noting the unusually
> > > > high overhead (but not a splat).
> > > >
> > > > So if it took 75 microseconds for each HPET read and 1 microsecond for
> > > > the TSC read, then 226 microseconds would be substituted for the default
> > > > of 100 microseconds for that cycle's skew cutoff. Unless the previous
> > > > skew cutoff was larger, in which case the previous cutoff should be
> > > > used instead. Either way, the current cutoff is recorded for comparison
> > > > for the next clocksource watchdog check.
> > > >
> > > > If the 3-1 difference was greater than 62.5 milliseconds, a warning should
> > > > probably be emitted anyway.
> > > I can test the patch with our cases that could reproduce the problem.
> > >
> > > > Or did you have something else in mind?
> > > I'm not sure the detail in Waiman's cases, and in our cases (stress-ng)
> > > the delay between watchdog's (HPET here) read were not linear, that
> > > from debug data, sometimes the 3-2 difference could be bigger or much
> > > bigger than the 2-1 difference.
> > >
> > > The reason could be the gap between 2 reads depends hugely on the system
> > > pressure at that time that 3 HPET read happens. On our test box (a
> > > 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
> > > about 2.5 us, while under the stress it could be bumped to from 6 us
> > > to 2800 us.
> > >
> > > So I think checking the 3-2 difference plus increasing the max retries
> > > to 10 may be a simple way, if the watchdog read is found to be
> > > abnormally long, we skip this round of check.
> >
> > On one of the test system, I had measured that normal delay
> > (hpet->tsc->hpet) was normally a bit over 2us. It was a bit more than 4us at
> > bootup time. However, the same system under stress could have a delay of
> > over 200us at bootup time. When I measured the consecutive hpet delay, it
> > was about 180us. So hpet read did dominate the total clocksource read delay.
>
> Thank you both for the data!
>
> > I would not suggest increasing the max retries as it may still fail in most
> > cases because the system stress will likely not be going away within a short
> > time. So we are likely just wasting cpu times. I believe we should just skip
> > it if it is the watchdog read that is causing most of the delay.
>
> If anything, adding that extra read would cause me to -reduce- the number
> of retries to avoid increasing the per-watchdog overhead.
I understand Waiman's concern here, and in our test patch, the 2
consecutive watchdog read delay check is done inside this retrying
loop accompanying the 'cs' read, and once an abnormal delay is found,
the watchdog check is skipped without waiting for the max-retries to
complete.
Our test data shows the consecutive delay is not always big even when
the system is much stressed, that's why I suggest to increase the
retries.
Thanks,
Feng
> Thoughts?
>
> Thanx, Paul
On 11/14/21 21:08, Feng Tang wrote:
> Or did you have something else in mind?
>>>> I'm not sure the detail in Waiman's cases, and in our cases (stress-ng)
>>>> the delay between watchdog's (HPET here) read were not linear, that
>>>> from debug data, sometimes the 3-2 difference could be bigger or much
>>>> bigger than the 2-1 difference.
>>>>
>>>> The reason could be the gap between 2 reads depends hugely on the system
>>>> pressure at that time that 3 HPET read happens. On our test box (a
>>>> 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
>>>> about 2.5 us, while under the stress it could be bumped to from 6 us
>>>> to 2800 us.
>>>>
>>>> So I think checking the 3-2 difference plus increasing the max retries
>>>> to 10 may be a simple way, if the watchdog read is found to be
>>>> abnormally long, we skip this round of check.
>>> On one of the test system, I had measured that normal delay
>>> (hpet->tsc->hpet) was normally a bit over 2us. It was a bit more than 4us at
>>> bootup time. However, the same system under stress could have a delay of
>>> over 200us at bootup time. When I measured the consecutive hpet delay, it
>>> was about 180us. So hpet read did dominate the total clocksource read delay.
>> Thank you both for the data!
>>
>>> I would not suggest increasing the max retries as it may still fail in most
>>> cases because the system stress will likely not be going away within a short
>>> time. So we are likely just wasting cpu times. I believe we should just skip
>>> it if it is the watchdog read that is causing most of the delay.
>> If anything, adding that extra read would cause me to -reduce- the number
>> of retries to avoid increasing the per-watchdog overhead.
>
> I understand Waiman's concern here, and in our test patch, the 2
> consecutive watchdog read delay check is done inside this retrying
> loop accompanying the 'cs' read, and once an abnormal delay is found,
> the watchdog check is skipped without waiting for the max-retries to
> complete.
>
> Our test data shows the consecutive delay is not always big even when
> the system is much stressed, that's why I suggest to increase the
> retries.
If we need a large number of retries to avoid triggering the unstable
TSC message, we should consider increase the threshod instead. Right?
That is why my patch 2 makes the max skew value a configurable option so
that we can tune it if necessary.
Cheers,
Longman
On Sun, Nov 14, 2021 at 10:24:56PM -0500, Waiman Long wrote:
>
> On 11/14/21 21:08, Feng Tang wrote:
> > Or did you have something else in mind?
> > > > > I'm not sure the detail in Waiman's cases, and in our cases (stress-ng)
> > > > > the delay between watchdog's (HPET here) read were not linear, that
> > > > > from debug data, sometimes the 3-2 difference could be bigger or much
> > > > > bigger than the 2-1 difference.
> > > > >
> > > > > The reason could be the gap between 2 reads depends hugely on the system
> > > > > pressure at that time that 3 HPET read happens. On our test box (a
> > > > > 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
> > > > > about 2.5 us, while under the stress it could be bumped to from 6 us
> > > > > to 2800 us.
> > > > >
> > > > > So I think checking the 3-2 difference plus increasing the max retries
> > > > > to 10 may be a simple way, if the watchdog read is found to be
> > > > > abnormally long, we skip this round of check.
> > > > On one of the test system, I had measured that normal delay
> > > > (hpet->tsc->hpet) was normally a bit over 2us. It was a bit more than 4us at
> > > > bootup time. However, the same system under stress could have a delay of
> > > > over 200us at bootup time. When I measured the consecutive hpet delay, it
> > > > was about 180us. So hpet read did dominate the total clocksource read delay.
> > > Thank you both for the data!
> > >
> > > > I would not suggest increasing the max retries as it may still fail in most
> > > > cases because the system stress will likely not be going away within a short
> > > > time. So we are likely just wasting cpu times. I believe we should just skip
> > > > it if it is the watchdog read that is causing most of the delay.
> > > If anything, adding that extra read would cause me to -reduce- the number
> > > of retries to avoid increasing the per-watchdog overhead.
> > I understand Waiman's concern here, and in our test patch, the 2
> > consecutive watchdog read delay check is done inside this retrying
> > loop accompanying the 'cs' read, and once an abnormal delay is found,
> > the watchdog check is skipped without waiting for the max-retries to
> > complete.
> >
> > Our test data shows the consecutive delay is not always big even when
> > the system is much stressed, that's why I suggest to increase the
> > retries.
>
> If we need a large number of retries to avoid triggering the unstable TSC
> message, we should consider increase the threshod instead. Right?
>
> That is why my patch 2 makes the max skew value a configurable option so
> that we can tune it if necessary.
I'm fine with it, though the ideal case I expected is with carefully
picked values for max_retries/screw_threshhold, we could save the users
from configuring these. But given the complexity of all HWs out there,
it's not an easy goal.
And I still suggest to put the consecutive watchdog read check inside
the retry loop, so that it could bail out early when detecting the
abnormal delay.
Another thing is we may need to set the 'watchdog_reset_pending', as
under the stress, there could be consecutive many times of "skipping"
watchdog check, and the saved value of 'cs' and 'watchdog' should be
reset.
Thanks,
Feng
> Cheers,
> Longman
On Mon, Nov 15, 2021 at 03:59:15PM +0800, Feng Tang wrote:
> On Sun, Nov 14, 2021 at 10:24:56PM -0500, Waiman Long wrote:
> >
> > On 11/14/21 21:08, Feng Tang wrote:
> > > Or did you have something else in mind?
> > > > > > I'm not sure the detail in Waiman's cases, and in our cases (stress-ng)
> > > > > > the delay between watchdog's (HPET here) read were not linear, that
> > > > > > from debug data, sometimes the 3-2 difference could be bigger or much
> > > > > > bigger than the 2-1 difference.
> > > > > >
> > > > > > The reason could be the gap between 2 reads depends hugely on the system
> > > > > > pressure at that time that 3 HPET read happens. On our test box (a
> > > > > > 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
> > > > > > about 2.5 us, while under the stress it could be bumped to from 6 us
> > > > > > to 2800 us.
> > > > > >
> > > > > > So I think checking the 3-2 difference plus increasing the max retries
> > > > > > to 10 may be a simple way, if the watchdog read is found to be
> > > > > > abnormally long, we skip this round of check.
> > > > > On one of the test system, I had measured that normal delay
> > > > > (hpet->tsc->hpet) was normally a bit over 2us. It was a bit more than 4us at
> > > > > bootup time. However, the same system under stress could have a delay of
> > > > > over 200us at bootup time. When I measured the consecutive hpet delay, it
> > > > > was about 180us. So hpet read did dominate the total clocksource read delay.
> > > > Thank you both for the data!
> > > >
> > > > > I would not suggest increasing the max retries as it may still fail in most
> > > > > cases because the system stress will likely not be going away within a short
> > > > > time. So we are likely just wasting cpu times. I believe we should just skip
> > > > > it if it is the watchdog read that is causing most of the delay.
> > > > If anything, adding that extra read would cause me to -reduce- the number
> > > > of retries to avoid increasing the per-watchdog overhead.
> > > I understand Waiman's concern here, and in our test patch, the 2
> > > consecutive watchdog read delay check is done inside this retrying
> > > loop accompanying the 'cs' read, and once an abnormal delay is found,
> > > the watchdog check is skipped without waiting for the max-retries to
> > > complete.
> > >
> > > Our test data shows the consecutive delay is not always big even when
> > > the system is much stressed, that's why I suggest to increase the
> > > retries.
> >
> > If we need a large number of retries to avoid triggering the unstable TSC
> > message, we should consider increase the threshod instead. Right?
> >
> > That is why my patch 2 makes the max skew value a configurable option so
> > that we can tune it if necessary.
>
> I'm fine with it, though the ideal case I expected is with carefully
> picked values for max_retries/screw_threshhold, we could save the users
> from configuring these. But given the complexity of all HWs out there,
> it's not an easy goal.
That is my goal as well, but I expect that more experience, testing,
and patches will be required to reach that goal.
> And I still suggest to put the consecutive watchdog read check inside
> the retry loop, so that it could bail out early when detecting the
> abnormal delay.
If the HPET read shows abnormal delay, agreed. But if the abnormal
delay is only in the clocksource under test (TSC in this case), then
a re-read seems to me to make sense.
> Another thing is we may need to set the 'watchdog_reset_pending', as
> under the stress, there could be consecutive many times of "skipping"
> watchdog check, and the saved value of 'cs' and 'watchdog' should be
> reset.
My thought was to count a read failure only if the HPET read did not
have excessive delays. This means that a cache-buster workload could
indefinitely delay a clock-skew check, which was one reason that I
was thinking in terms of using the actual measured delays to set the
clock-skew check criterion.
Either way, something like Waiman's patch checking the HPET delay looks
to me to be valuable.
Thoughts?
Thanx, Paul
On 11/15/21 02:59, Feng Tang wrote:
> On Sun, Nov 14, 2021 at 10:24:56PM -0500, Waiman Long wrote:
>> On 11/14/21 21:08, Feng Tang wrote:
>>> Or did you have something else in mind?
>>>>>> I'm not sure the detail in Waiman's cases, and in our cases (stress-ng)
>>>>>> the delay between watchdog's (HPET here) read were not linear, that
>>>>>> from debug data, sometimes the 3-2 difference could be bigger or much
>>>>>> bigger than the 2-1 difference.
>>>>>>
>>>>>> The reason could be the gap between 2 reads depends hugely on the system
>>>>>> pressure at that time that 3 HPET read happens. On our test box (a
>>>>>> 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
>>>>>> about 2.5 us, while under the stress it could be bumped to from 6 us
>>>>>> to 2800 us.
>>>>>>
>>>>>> So I think checking the 3-2 difference plus increasing the max retries
>>>>>> to 10 may be a simple way, if the watchdog read is found to be
>>>>>> abnormally long, we skip this round of check.
>>>>> On one of the test system, I had measured that normal delay
>>>>> (hpet->tsc->hpet) was normally a bit over 2us. It was a bit more than 4us at
>>>>> bootup time. However, the same system under stress could have a delay of
>>>>> over 200us at bootup time. When I measured the consecutive hpet delay, it
>>>>> was about 180us. So hpet read did dominate the total clocksource read delay.
>>>> Thank you both for the data!
>>>>
>>>>> I would not suggest increasing the max retries as it may still fail in most
>>>>> cases because the system stress will likely not be going away within a short
>>>>> time. So we are likely just wasting cpu times. I believe we should just skip
>>>>> it if it is the watchdog read that is causing most of the delay.
>>>> If anything, adding that extra read would cause me to -reduce- the number
>>>> of retries to avoid increasing the per-watchdog overhead.
>>> I understand Waiman's concern here, and in our test patch, the 2
>>> consecutive watchdog read delay check is done inside this retrying
>>> loop accompanying the 'cs' read, and once an abnormal delay is found,
>>> the watchdog check is skipped without waiting for the max-retries to
>>> complete.
>>>
>>> Our test data shows the consecutive delay is not always big even when
>>> the system is much stressed, that's why I suggest to increase the
>>> retries.
>> If we need a large number of retries to avoid triggering the unstable TSC
>> message, we should consider increase the threshod instead. Right?
>>
>> That is why my patch 2 makes the max skew value a configurable option so
>> that we can tune it if necessary.
> I'm fine with it, though the ideal case I expected is with carefully
> picked values for max_retries/screw_threshhold, we could save the users
> from configuring these. But given the complexity of all HWs out there,
> it's not an easy goal.
>
> And I still suggest to put the consecutive watchdog read check inside
> the retry loop, so that it could bail out early when detecting the
> abnormal delay.
Yes, I think that may sense.
>
> Another thing is we may need to set the 'watchdog_reset_pending', as
> under the stress, there could be consecutive many times of "skipping"
> watchdog check, and the saved value of 'cs' and 'watchdog' should be
> reset.
I agree with that too. IOW, we combine the best part of the patches
together. I will post an updated patch for that.
Cheers,
Longman
On Mon, Nov 15, 2021 at 06:07:09AM -0800, Paul E. McKenney wrote:
> On Mon, Nov 15, 2021 at 03:59:15PM +0800, Feng Tang wrote:
> > On Sun, Nov 14, 2021 at 10:24:56PM -0500, Waiman Long wrote:
> > >
> > > On 11/14/21 21:08, Feng Tang wrote:
> > > > Or did you have something else in mind?
> > > > > > > I'm not sure the detail in Waiman's cases, and in our cases (stress-ng)
> > > > > > > the delay between watchdog's (HPET here) read were not linear, that
> > > > > > > from debug data, sometimes the 3-2 difference could be bigger or much
> > > > > > > bigger than the 2-1 difference.
> > > > > > >
> > > > > > > The reason could be the gap between 2 reads depends hugely on the system
> > > > > > > pressure at that time that 3 HPET read happens. On our test box (a
> > > > > > > 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
> > > > > > > about 2.5 us, while under the stress it could be bumped to from 6 us
> > > > > > > to 2800 us.
> > > > > > >
> > > > > > > So I think checking the 3-2 difference plus increasing the max retries
> > > > > > > to 10 may be a simple way, if the watchdog read is found to be
> > > > > > > abnormally long, we skip this round of check.
> > > > > > On one of the test system, I had measured that normal delay
> > > > > > (hpet->tsc->hpet) was normally a bit over 2us. It was a bit more than 4us at
> > > > > > bootup time. However, the same system under stress could have a delay of
> > > > > > over 200us at bootup time. When I measured the consecutive hpet delay, it
> > > > > > was about 180us. So hpet read did dominate the total clocksource read delay.
> > > > > Thank you both for the data!
> > > > >
> > > > > > I would not suggest increasing the max retries as it may still fail in most
> > > > > > cases because the system stress will likely not be going away within a short
> > > > > > time. So we are likely just wasting cpu times. I believe we should just skip
> > > > > > it if it is the watchdog read that is causing most of the delay.
> > > > > If anything, adding that extra read would cause me to -reduce- the number
> > > > > of retries to avoid increasing the per-watchdog overhead.
> > > > I understand Waiman's concern here, and in our test patch, the 2
> > > > consecutive watchdog read delay check is done inside this retrying
> > > > loop accompanying the 'cs' read, and once an abnormal delay is found,
> > > > the watchdog check is skipped without waiting for the max-retries to
> > > > complete.
> > > >
> > > > Our test data shows the consecutive delay is not always big even when
> > > > the system is much stressed, that's why I suggest to increase the
> > > > retries.
> > >
> > > If we need a large number of retries to avoid triggering the unstable TSC
> > > message, we should consider increase the threshod instead. Right?
> > >
> > > That is why my patch 2 makes the max skew value a configurable option so
> > > that we can tune it if necessary.
> >
> > I'm fine with it, though the ideal case I expected is with carefully
> > picked values for max_retries/screw_threshhold, we could save the users
> > from configuring these. But given the complexity of all HWs out there,
> > it's not an easy goal.
>
> That is my goal as well, but I expect that more experience, testing,
> and patches will be required to reach that goal.
>
> > And I still suggest to put the consecutive watchdog read check inside
> > the retry loop, so that it could bail out early when detecting the
> > abnormal delay.
>
> If the HPET read shows abnormal delay, agreed. But if the abnormal
> delay is only in the clocksource under test (TSC in this case), then
> a re-read seems to me to make sense.
Yes, I agree. The retry logic you introeduced does help to filter
many false alarms from a watchdog.
> > Another thing is we may need to set the 'watchdog_reset_pending', as
> > under the stress, there could be consecutive many times of "skipping"
> > watchdog check, and the saved value of 'cs' and 'watchdog' should be
> > reset.
>
> My thought was to count a read failure only if the HPET read did not
> have excessive delays. This means that a cache-buster workload could
> indefinitely delay a clock-skew check, which was one reason that I
> was thinking in terms of using the actual measured delays to set the
> clock-skew check criterion.
>
> Either way, something like Waiman's patch checking the HPET delay looks
> to me to be valuable.
Yes, and Wainman is working on a new version.
btw, here is our easy reproducer (the case you have worked with Oliver
Sang), running the stress-ng's case (192 is the CPU number of the test
box):
sudo stress-ng --timeout 30 --times --verify --metrics-brief --ioport 192
Thanks,
Feng
> Thoughts?
>
> Thanx, Paul
On Tue, Nov 16, 2021 at 09:36:51AM +0800, Feng Tang wrote:
> On Mon, Nov 15, 2021 at 06:07:09AM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 15, 2021 at 03:59:15PM +0800, Feng Tang wrote:
> > > On Sun, Nov 14, 2021 at 10:24:56PM -0500, Waiman Long wrote:
> > > >
> > > > On 11/14/21 21:08, Feng Tang wrote:
> > > > > Or did you have something else in mind?
> > > > > > > > I'm not sure the detail in Waiman's cases, and in our cases (stress-ng)
> > > > > > > > the delay between watchdog's (HPET here) read were not linear, that
> > > > > > > > from debug data, sometimes the 3-2 difference could be bigger or much
> > > > > > > > bigger than the 2-1 difference.
> > > > > > > >
> > > > > > > > The reason could be the gap between 2 reads depends hugely on the system
> > > > > > > > pressure at that time that 3 HPET read happens. On our test box (a
> > > > > > > > 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
> > > > > > > > about 2.5 us, while under the stress it could be bumped to from 6 us
> > > > > > > > to 2800 us.
> > > > > > > >
> > > > > > > > So I think checking the 3-2 difference plus increasing the max retries
> > > > > > > > to 10 may be a simple way, if the watchdog read is found to be
> > > > > > > > abnormally long, we skip this round of check.
> > > > > > > On one of the test system, I had measured that normal delay
> > > > > > > (hpet->tsc->hpet) was normally a bit over 2us. It was a bit more than 4us at
> > > > > > > bootup time. However, the same system under stress could have a delay of
> > > > > > > over 200us at bootup time. When I measured the consecutive hpet delay, it
> > > > > > > was about 180us. So hpet read did dominate the total clocksource read delay.
> > > > > > Thank you both for the data!
> > > > > >
> > > > > > > I would not suggest increasing the max retries as it may still fail in most
> > > > > > > cases because the system stress will likely not be going away within a short
> > > > > > > time. So we are likely just wasting cpu times. I believe we should just skip
> > > > > > > it if it is the watchdog read that is causing most of the delay.
> > > > > > If anything, adding that extra read would cause me to -reduce- the number
> > > > > > of retries to avoid increasing the per-watchdog overhead.
> > > > > I understand Waiman's concern here, and in our test patch, the 2
> > > > > consecutive watchdog read delay check is done inside this retrying
> > > > > loop accompanying the 'cs' read, and once an abnormal delay is found,
> > > > > the watchdog check is skipped without waiting for the max-retries to
> > > > > complete.
> > > > >
> > > > > Our test data shows the consecutive delay is not always big even when
> > > > > the system is much stressed, that's why I suggest to increase the
> > > > > retries.
> > > >
> > > > If we need a large number of retries to avoid triggering the unstable TSC
> > > > message, we should consider increase the threshod instead. Right?
> > > >
> > > > That is why my patch 2 makes the max skew value a configurable option so
> > > > that we can tune it if necessary.
> > >
> > > I'm fine with it, though the ideal case I expected is with carefully
> > > picked values for max_retries/screw_threshhold, we could save the users
> > > from configuring these. But given the complexity of all HWs out there,
> > > it's not an easy goal.
> >
> > That is my goal as well, but I expect that more experience, testing,
> > and patches will be required to reach that goal.
> >
> > > And I still suggest to put the consecutive watchdog read check inside
> > > the retry loop, so that it could bail out early when detecting the
> > > abnormal delay.
> >
> > If the HPET read shows abnormal delay, agreed. But if the abnormal
> > delay is only in the clocksource under test (TSC in this case), then
> > a re-read seems to me to make sense.
>
> Yes, I agree. The retry logic you introeduced does help to filter
> many false alarms from a watchdog.
>
> > > Another thing is we may need to set the 'watchdog_reset_pending', as
> > > under the stress, there could be consecutive many times of "skipping"
> > > watchdog check, and the saved value of 'cs' and 'watchdog' should be
> > > reset.
> >
> > My thought was to count a read failure only if the HPET read did not
> > have excessive delays. This means that a cache-buster workload could
> > indefinitely delay a clock-skew check, which was one reason that I
> > was thinking in terms of using the actual measured delays to set the
> > clock-skew check criterion.
> >
> > Either way, something like Waiman's patch checking the HPET delay looks
> > to me to be valuable.
>
> Yes, and Wainman is working on a new version.
Looking forward to seeing it!
> btw, here is our easy reproducer (the case you have worked with Oliver
> Sang), running the stress-ng's case (192 is the CPU number of the test
> box):
>
> sudo stress-ng --timeout 30 --times --verify --metrics-brief --ioport 192
Good to know, thank you!
Thanx, Paul