There have been reports of the watchdog marking clocksources unstable on
machines with 8 NUMA nodes:
> clocksource: timekeeping watchdog on CPU373: Marking clocksource 'tsc' as unstable because the skew is too large:
> clocksource: 'hpet' wd_nsec: 14523447520 wd_now: 5a749706 wd_last: 45adf1e0 mask: ffffffff
> clocksource: 'tsc' cs_nsec: 14524115132 cs_now: 515ce2c5a96caa cs_last: 515cd9a9d83918 mask: ffffffffffffffff
> clocksource: 'tsc' is current clocksource.
> tsc: Marking TSC unstable due to clocksource watchdog
> TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> sched_clock: Marking unstable (1950347883333462, 79649632569)<-(1950428279338308, -745776594)
> clocksource: Checking clocksource tsc synchronization from CPU 400 to CPUs 0,46,52,54,138,208,392,397.
> clocksource: Switched to clocksource hpet
The measured clocksource skew - the absolute difference between cs_nsec
and wd_nsec - was 668 microseconds:
> cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612
The kernel (based on 5.14.21) used 200 microseconds for the
uncertainty_margin of both the clocksource and watchdog, resulting in a
threshold of 400 microseconds. The discrepancy is that the measured
clocksource skew was evaluated against a threshold suited for watchdog
intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second.
Both the cs_nsec and the wd_nsec value indicate that the actual watchdog
interval was circa 14.5 seconds. Since the watchdog is executed in softirq
context the expiration of the watchdog timer can get severely delayed on
account of a ksoftirqd thread not getting to run in a timely manner.
Surely, a system with such belated softirq execution is not working well
and the scheduling issue should be looked into but the clocksource
watchdog should be able to deal with it accordingly.
The solution in this patch skips the current watchdog check if the
watchdog interval exceeds 1.5 * WATCHDOG_INTERVAL. Considering the maximum
watchdog interval of 1.5 * WATCHDOG_INTERVAL, the current default
uncertainty margin (of the TSC and HPET) corresponds to a limit on
clocksource skew of 333 ppm (microseconds of skew per second). To keep the
limit imposed by NTP (500 microseconds of skew per second) for all
possible watchdog intervals, the margins would have to be scaled so that
the threshold value is proportional to the length of the actual watchdog
interval.
v2: fixed interger overflow in WATCHDOG_INTR_MAX_NS on i386
Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
Suggested-by: Feng Tang <[email protected]>
Reviewed-by: Feng Tang <[email protected]>
Tested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Jiri Wiesner <[email protected]>
---
kernel/time/clocksource.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index c108ed8a9804..e7f8d0a1b95c 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -98,7 +98,9 @@ static u64 suspend_start;
/*
* Interval: 0.5sec.
*/
-#define WATCHDOG_INTERVAL (HZ >> 1)
+#define WATCHDOG_INTERVAL (HZ >> 1)
+#define WATCHDOG_INTR_MAX_NS ((WATCHDOG_INTERVAL + (WATCHDOG_INTERVAL >> 1))\
+ * (NSEC_PER_SEC / HZ))
/*
* Threshold: 0.0312s, when doubled: 0.0625s.
@@ -134,6 +136,7 @@ static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
static DEFINE_SPINLOCK(watchdog_lock);
static int watchdog_running;
static atomic_t watchdog_reset_pending;
+static int64_t watchdog_max_intr;
static inline void clocksource_watchdog_lock(unsigned long *flags)
{
@@ -399,8 +402,8 @@ static inline void clocksource_reset_watchdog(void)
static void clocksource_watchdog(struct timer_list *unused)
{
u64 csnow, wdnow, cslast, wdlast, delta;
+ int64_t wd_nsec, cs_nsec, interval;
int next_cpu, reset_pending;
- int64_t wd_nsec, cs_nsec;
struct clocksource *cs;
enum wd_read_status read_ret;
unsigned long extra_wait = 0;
@@ -470,6 +473,27 @@ static void clocksource_watchdog(struct timer_list *unused)
if (atomic_read(&watchdog_reset_pending))
continue;
+ /*
+ * The processing of timer softirqs can get delayed (usually
+ * on account of ksoftirqd not getting to run in a timely
+ * manner), which causes the watchdog interval to stretch.
+ * Some clocksources, e.g. acpi_pm, cannot tolerate
+ * watchdog intervals longer than a few seconds.
+ * Skew detection may fail for longer watchdog intervals
+ * on account of fixed margins being used.
+ */
+ interval = max(cs_nsec, wd_nsec);
+ if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) {
+ if (system_state > SYSTEM_SCHEDULING &&
+ interval > 2 * watchdog_max_intr) {
+ watchdog_max_intr = interval;
+ pr_warn("Skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n",
+ cs_nsec, wd_nsec);
+ }
+ watchdog_timer.expires = jiffies;
+ continue;
+ }
+
/* Check the deviation from the watchdog clocksource. */
md = cs->uncertainty_margin + watchdog->uncertainty_margin;
if (abs(cs_nsec - wd_nsec) > md) {
--
2.35.3
--
Jiri Wiesner
SUSE Labs
On Wed, Jan 10 2024 at 20:26, Jiri Wiesner wrote:
> The measured clocksource skew - the absolute difference between cs_nsec
> and wd_nsec - was 668 microseconds:
>> cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612
>
> The kernel (based on 5.14.21) used 200 microseconds for the
> uncertainty_margin of both the clocksource and watchdog, resulting in a
> threshold of 400 microseconds. The discrepancy is that the measured
> clocksource skew was evaluated against a threshold suited for watchdog
> intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5
> second.
This really took some time to decode. What you are trying to explain is:
The comparison between the clocksource and the watchdog is not
working for large readout intervals because the conversion to
nanoseconds is imprecise. The reason is that the initialization
values of the shift/mult pairs which are used for conversion are not
sufficiently accurate and the accumulated inaccuracy causes the
comparison to exceed the threshold.
Right?
So yes, limiting the maximum readout interval and skipping the check is
sensible.
> Both the cs_nsec and the wd_nsec value indicate that the actual watchdog
> interval was circa 14.5 seconds. Since the watchdog is executed in softirq
> context the expiration of the watchdog timer can get severely delayed on
> account of a ksoftirqd thread not getting to run in a timely manner.
> Surely, a system with such belated softirq execution is not working well
> and the scheduling issue should be looked into but the clocksource
> watchdog should be able to deal with it accordingly.
>
> The solution in this patch skips the current watchdog check if the
s/The solution in this patch skips/Prevent this by skipping/
We already know that this is a patch, no?
> v2: fixed interger overflow in WATCHDOG_INTR_MAX_NS on i386
Please put the version log after the --- separator. It's not part of the
changelog.
> Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
> Suggested-by: Feng Tang <[email protected]>
> Reviewed-by: Feng Tang <[email protected]>
> Tested-by: Paul E. McKenney <[email protected]>
> Signed-off-by: Jiri Wiesner <[email protected]>
> ---
> kernel/time/clocksource.c | 28 ++++++++++++++++++++++++++--
> 1 file changed, 26 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index c108ed8a9804..e7f8d0a1b95c 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -98,7 +98,9 @@ static u64 suspend_start;
> /*
> * Interval: 0.5sec.
> */
> -#define WATCHDOG_INTERVAL (HZ >> 1)
> +#define WATCHDOG_INTERVAL (HZ >> 1)
> +#define WATCHDOG_INTR_MAX_NS ((WATCHDOG_INTERVAL + (WATCHDOG_INTERVAL >> 1))\
> + * (NSEC_PER_SEC / HZ))
That 1.5 * WATCHDOG_INTERVAL seems to be rather arbitrary. One second
should be safe enough, no?
>
> + /*
> + * The processing of timer softirqs can get delayed (usually
> + * on account of ksoftirqd not getting to run in a timely
> + * manner), which causes the watchdog interval to stretch.
> + * Some clocksources, e.g. acpi_pm, cannot tolerate
> + * watchdog intervals longer than a few seconds.
What ensures that the watchdog did not wrap around then?
> + * Skew detection may fail for longer watchdog intervals
> + * on account of fixed margins being used.
> + */
> + interval = max(cs_nsec, wd_nsec);
> + if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) {
> + if (system_state > SYSTEM_SCHEDULING &&
> + interval > 2 * watchdog_max_intr) {
watchdog_max_intr is a misnomer. Why not naming it watchdog_max_interval
to make it entirly clear what this is about?
> + watchdog_max_intr = interval;
> + pr_warn("Skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n",
> + cs_nsec, wd_nsec);
This really wants to have a proper indication why the check was skipped,
i,e. due to a long readout interval, no?
Thanks,
tglx
On Fri, Jan 12, 2024 at 05:48:22PM +0100, Thomas Gleixner wrote:
> On Wed, Jan 10 2024 at 20:26, Jiri Wiesner wrote:
> > The measured clocksource skew - the absolute difference between cs_nsec
> > and wd_nsec - was 668 microseconds:
> >> cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612
> >
> > The kernel (based on 5.14.21) used 200 microseconds for the
> > uncertainty_margin of both the clocksource and watchdog, resulting in a
> > threshold of 400 microseconds. The discrepancy is that the measured
> > clocksource skew was evaluated against a threshold suited for watchdog
> > intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5
> > second.
>
> This really took some time to decode. What you are trying to explain is:
>
> The comparison between the clocksource and the watchdog is not
> working for large readout intervals because the conversion to
> nanoseconds is imprecise. The reason is that the initialization
> values of the shift/mult pairs which are used for conversion are not
> sufficiently accurate and the accumulated inaccuracy causes the
> comparison to exceed the threshold.
The root cause of the bug does not concern the precision of the conversion
to nanoseconds. The shift/mult pair of the TSC can convert diffs as large
as 600 seconds. The HPET is limited to 179.0 seconds on account of being a
32-bit counter. The acpi_pm can convert only 4.7 seconds. With the
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE option enabled, the ranges are
reduced to a half. The example above showed the TSC as the clocksource and
the HPET as a watchdog both of which should be able to convert a diff of
14.5 seconds to nanoseconds with sufficient precision.
I could change the description to:
The kernel (based on 5.14.21) used 200 microseconds for the
uncertainty_margin of both the clocksource and watchdog, resulting in a
threshold of 400 microseconds (the md variable). The root cause of the
issue is that the measured clocksource skew, 668 microseconds, was
evaluated against a threshold (the md variable) which is suited for
watchdog intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is
0.5 second. Both the cs_nsec and the wd_nsec value indicate that the
watchdog interval was circa 14.5 seconds.
The intention of 2e27e793e280 ("clocksource: Reduce clocksource-skew
threshold") was to tighten the threshold for evaluating skew and set the
lower bound for the uncertainty_margin of clocksources to twice
WATCHDOG_MAX_SKEW. Later in c37e85c135ce ("clocksource: Loosen clocksource
watchdog constraints"), the WATCHDOG_MAX_SKEW constant was increased to
125 microseconds to fit the limit of NTP, which is able to use
a clocksource that suffers from up to 500 microseconds of skew per second.
Both the TSC and the HPET use default uncertainty_margin. When the
watchdog interval gets stretched the default uncertainty_margin is no
longer a suitable lower bound for evaluating skew - it imposes a limit
that is stricter than the skew with which NTP can deal. The longer the
watchdog interval is the larger the threshold should be. For evaluating
skew in a watchdog interval of 14.5 seconds, a proportional threshold
should be used, which should be 14500 microseconds (7250 coming from the
TSC, 7250 coming from the HPET).
> So yes, limiting the maximum readout interval and skipping the check is
> sensible.
It is a bug to mark a clocksource unstable if the skew is 668 microseconds
in 14.5 seconds. One possible solution is to skip the check. I originally
posted a patch scaling the uncertainty_margin of clocksources but it got
no support and the feedback I got was to avoid the calculation and skip
the current check in order to keep the code simple:
https://lore.kernel.org/lkml/20231221160517.GA22919@incl/#t
Since skipping the check solves issue as well I sent a patch.
> > /*
> > * Interval: 0.5sec.
> > */
> > -#define WATCHDOG_INTERVAL (HZ >> 1)
> > +#define WATCHDOG_INTERVAL (HZ >> 1)
> > +#define WATCHDOG_INTR_MAX_NS ((WATCHDOG_INTERVAL + (WATCHDOG_INTERVAL >> 1))\
> > + * (NSEC_PER_SEC / HZ))
>
> That 1.5 * WATCHDOG_INTERVAL seems to be rather arbitrary. One second
> should be safe enough, no?
Yes, it is arbitrary. The concern is how strict can we allow the skew
check to get. 2 * WATCHDOG_INTERVAL would mean imposing a skew threshold
of 250 microseconds per second for intervals that are close in their value
to 2 * WATCHDOG_INTERVAL. Even using 2 * WATCHDOG_INTERVAL would still be
many times better than using 500 microseconds to check skew in a 14.5-long
watchdog interval.
> > + /*
> > + * The processing of timer softirqs can get delayed (usually
> > + * on account of ksoftirqd not getting to run in a timely
> > + * manner), which causes the watchdog interval to stretch.
> > + * Some clocksources, e.g. acpi_pm, cannot tolerate
> > + * watchdog intervals longer than a few seconds.
>
> What ensures that the watchdog did not wrap around then?
Nothing. It has always been this way. The check usually fails when the
watchdog wraps around, in which case the clocksource is marked unstable
for no fault of its own.
> > + watchdog_max_intr = interval;
> > + pr_warn("Skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n",
> > + cs_nsec, wd_nsec);
>
> This really wants to have a proper indication why the check was skipped,
> i,e. due to a long readout interval, no?
It could be changed to:
pr_warn("Large watchdog interval, skipping check: cs_nsec: %lld wd_nsec: %lld\n",
I will send a v3 incorporation all the suggestions after we have made the
description intelligible. Thank you for the feedback.
--
Jiri Wiesner
SUSE Labs
On Sat, Jan 13 2024 at 12:44, Jiri Wiesner wrote:
> On Fri, Jan 12, 2024 at 05:48:22PM +0100, Thomas Gleixner wrote:
>> On Wed, Jan 10 2024 at 20:26, Jiri Wiesner wrote:
>> > The measured clocksource skew - the absolute difference between cs_nsec
>> > and wd_nsec - was 668 microseconds:
>> >> cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612
>> >
>> > The kernel (based on 5.14.21) used 200 microseconds for the
>> > uncertainty_margin of both the clocksource and watchdog, resulting in a
>> > threshold of 400 microseconds. The discrepancy is that the measured
>> > clocksource skew was evaluated against a threshold suited for watchdog
>> > intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5
>> > second.
>>
>> This really took some time to decode. What you are trying to explain is:
>>
>> The comparison between the clocksource and the watchdog is not
>> working for large readout intervals because the conversion to
>> nanoseconds is imprecise. The reason is that the initialization
>> values of the shift/mult pairs which are used for conversion are not
>> sufficiently accurate and the accumulated inaccuracy causes the
>> comparison to exceed the threshold.
>
> The root cause of the bug does not concern the precision of the conversion
> to nanoseconds. The shift/mult pair of the TSC can convert diffs as large
> as 600 seconds. The HPET is limited to 179.0 seconds on account of being a
> 32-bit counter. The acpi_pm can convert only 4.7 seconds. With the
> CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE option enabled, the ranges are
> reduced to a half. The example above showed the TSC as the clocksource and
> the HPET as a watchdog both of which should be able to convert a diff of
> 14.5 seconds to nanoseconds with sufficient precision.
No. It _is_ an initialization and conversion precision problem, nothing
else.
Assume a perfect world where the frequency of the TSC and the frequency
of the HPET are precisely known at initialization time and the
conversion factors to nanoseconds are precise as well. Assume further
that the clock frequencies wont drift over time differently.
Then the relationship between the nanosecond converted readouts of TSC
and HPET (or any other watchdog clocksource) would be constant
independent of the readout interval with the following prerequisites:
1) freq(TSC) / freq(HPET) == constant
2) Conversion from TSC to nanoseconds is correct independent of the
interval
3) Conversion from HPET to nanoseconds is correct independent of the
interval
So if all apply the uncertainty margin would be correct for any
valid non wrapped around readout interval, which is a completely
different issue to take care of.
Though because neither the frequency values are precise nor the
conversion factors to nanoseconds are precise for a larger interval nor
there is a guarantee that the clock frequencies of both clocks can't
drift differently all prerequisites #1 - #3 above are not there.
As a consequence there _is_ a limitation to the readout interval where
the comparison is valid.
See?
If you still don't agree then I'm happy to write up the corresponding
formulas for you.
Thanks,
tglx
On Sun, Jan 14, 2024 at 01:22:18AM +0100, Thomas Gleixner wrote:
> On Sat, Jan 13 2024 at 12:44, Jiri Wiesner wrote:
> > On Fri, Jan 12, 2024 at 05:48:22PM +0100, Thomas Gleixner wrote:
> >> On Wed, Jan 10 2024 at 20:26, Jiri Wiesner wrote:
> >> > The measured clocksource skew - the absolute difference between cs_nsec
> >> > and wd_nsec - was 668 microseconds:
> >> >> cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612
> >> >
> >> > The kernel (based on 5.14.21) used 200 microseconds for the
> >> > uncertainty_margin of both the clocksource and watchdog, resulting in a
> >> > threshold of 400 microseconds. The discrepancy is that the measured
> >> > clocksource skew was evaluated against a threshold suited for watchdog
> >> > intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5
> >> > second.
> >>
> >> This really took some time to decode. What you are trying to explain is:
> >>
> >> The comparison between the clocksource and the watchdog is not
> >> working for large readout intervals because the conversion to
> >> nanoseconds is imprecise. The reason is that the initialization
> >> values of the shift/mult pairs which are used for conversion are not
> >> sufficiently accurate and the accumulated inaccuracy causes the
> >> comparison to exceed the threshold.
> >
> > The root cause of the bug does not concern the precision of the conversion
> > to nanoseconds. The shift/mult pair of the TSC can convert diffs as large
> > as 600 seconds. The HPET is limited to 179.0 seconds on account of being a
> > 32-bit counter. The acpi_pm can convert only 4.7 seconds. With the
> > CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE option enabled, the ranges are
> > reduced to a half. The example above showed the TSC as the clocksource and
> > the HPET as a watchdog both of which should be able to convert a diff of
> > 14.5 seconds to nanoseconds with sufficient precision.
>
> No. It _is_ an initialization and conversion precision problem, nothing
> else.
>
> Assume a perfect world where the frequency of the TSC and the frequency
> of the HPET are precisely known at initialization time and the
> conversion factors to nanoseconds are precise as well. Assume further
> that the clock frequencies wont drift over time differently.
>
> Then the relationship between the nanosecond converted readouts of TSC
> and HPET (or any other watchdog clocksource) would be constant
> independent of the readout interval with the following prerequisites:
>
> 1) freq(TSC) / freq(HPET) == constant
>
> 2) Conversion from TSC to nanoseconds is correct independent of the
> interval
>
> 3) Conversion from HPET to nanoseconds is correct independent of the
> interval
>
> So if all apply the uncertainty margin would be correct for any
> valid non wrapped around readout interval
Agreed.
> valid non wrapped around readout interval, which is a completely
> different issue to take care of.
It is. I did not mean to address that with the patch but did only
coincidentally.
> Though because neither the frequency values are precise nor the
> conversion factors to nanoseconds are precise for a larger interval nor
> there is a guarantee that the clock frequencies of both clocks can't
> drift differently all prerequisites #1 - #3 above are not there.
>
> As a consequence there _is_ a limitation to the readout interval where
> the comparison is valid.
I must admit the root cause of both inaccuracy of the shift/mult pairs and
conversion precision fits the observed behaviour. The observed behaviour
is that watchdog checks may fail for large readout intervals on 8 NUMA node
machines. So, my hypothesis was that the size of the skew is directly
proportinal to the length of the readout interval. This is what the patch
means to address.
--
Jiri Wiesner
SUSE Labs