LinuxLists.cc - [PATCH] clocksource: Skip watchdog check for large watchdog intervals

2024-01-03 11:21:28

Subject: [PATCH] clocksource: Skip watchdog check for large watchdog intervals

There have been reports of the watchdog marking clocksources unstable on
machines with 8 NUMA nodes:
> clocksource: timekeeping watchdog on CPU373: Marking clocksource 'tsc' as unstable because the skew is too large:
> clocksource: 'hpet' wd_nsec: 14523447520 wd_now: 5a749706 wd_last: 45adf1e0 mask: ffffffff
> clocksource: 'tsc' cs_nsec: 14524115132 cs_now: 515ce2c5a96caa cs_last: 515cd9a9d83918 mask: ffffffffffffffff
> clocksource: 'tsc' is current clocksource.
> tsc: Marking TSC unstable due to clocksource watchdog
> TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> sched_clock: Marking unstable (1950347883333462, 79649632569)<-(1950428279338308, -745776594)
> clocksource: Checking clocksource tsc synchronization from CPU 400 to CPUs 0,46,52,54,138,208,392,397.
> clocksource: Switched to clocksource hpet

The measured clocksource skew - the absolute difference between cs_nsec
and wd_nsec - was 668 microseconds:
> cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612

The kernel (based on 5.14.21) used 200 microseconds for the
uncertainty_margin of both the clocksource and watchdog, resulting in a
threshold of 400 microseconds. The discrepancy is that the measured
clocksource skew was evaluated against a threshold suited for watchdog
intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second.
Both the cs_nsec and the wd_nsec value indicate that the actual watchdog
interval was circa 14.5 seconds. Since the watchdog is executed in softirq
context the expiration of the watchdog timer can get severely delayed on
account of a ksoftirqd thread not getting to run in a timely manner.
Surely, a system with such belated softirq execution is not working well
and the scheduling issue should be looked into but the clocksource
watchdog should be able to deal with it accordingly.

The solution in this patch skips the current watchdog check if the
watchdog interval exceeds 1.5 * WATCHDOG_INTERVAL. Considering the maximum
watchdog interval of 1.5 * WATCHDOG_INTERVAL, the current default
uncertainty margin (of the TSC and HPET) corresponds to a limit on
clocksource skew of 333 ppm (microseconds of skew per second). To keep the
limit imposed by NTP (500 microseconds of skew per second) for all
possible watchdog intervals, the margins would have to be scaled so that
the threshold value is proportional to the length of the actual watchdog
interval.

Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
Suggested-by: Feng Tang <[email protected]>
Signed-off-by: Jiri Wiesner <[email protected]>
---
kernel/time/clocksource.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index c108ed8a9804..ac5cb0ff278b 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -98,7 +98,9 @@ static u64 suspend_start;
/*
* Interval: 0.5sec.
*/
-#define WATCHDOG_INTERVAL (HZ >> 1)
+#define WATCHDOG_INTERVAL (HZ >> 1)
+#define WATCHDOG_INTR_MAX_NS ((WATCHDOG_INTERVAL + (WATCHDOG_INTERVAL >> 1))\
+ * NSEC_PER_SEC / HZ)

/*
* Threshold: 0.0312s, when doubled: 0.0625s.
@@ -134,6 +136,7 @@ static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
static DEFINE_SPINLOCK(watchdog_lock);
static int watchdog_running;
static atomic_t watchdog_reset_pending;
+static int64_t watchdog_max_intr;

static inline void clocksource_watchdog_lock(unsigned long *flags)
{
@@ -400,7 +403,7 @@ static void clocksource_watchdog(struct timer_list *unused)
{
u64 csnow, wdnow, cslast, wdlast, delta;
int next_cpu, reset_pending;
- int64_t wd_nsec, cs_nsec;
+ int64_t wd_nsec, cs_nsec, interval;
struct clocksource *cs;
enum wd_read_status read_ret;
unsigned long extra_wait = 0;
@@ -470,6 +473,27 @@ static void clocksource_watchdog(struct timer_list *unused)
if (atomic_read(&watchdog_reset_pending))
continue;

+ /*
+ * The processing of timer softirqs can get delayed (usually
+ * on account of ksoftirqd not getting to run in a timely
+ * manner), which causes the watchdog interval to stretch.
+ * Some clocksources, e.g. acpi_pm, cannot tolerate
+ * watchdog intervals longer than a few seconds.
+ * Skew detection may fail for longer watchdog intervals
+ * on account of fixed margins being used.
+ */
+ interval = max(cs_nsec, wd_nsec);
+ if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) {
+ if (system_state > SYSTEM_SCHEDULING &&
+ interval > 2 * watchdog_max_intr) {
+ watchdog_max_intr = interval;
+ pr_warn("Skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n",
+ cs_nsec, wd_nsec);
+ }
+ watchdog_timer.expires = jiffies;
+ continue;
+ }
+
/* Check the deviation from the watchdog clocksource. */
md = cs->uncertainty_margin + watchdog->uncertainty_margin;
if (abs(cs_nsec - wd_nsec) > md) {
--
2.35.3

--
Jiri Wiesner
SUSE Labs

2024-01-03 22:08:16

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH] clocksource: Skip watchdog check for large watchdog intervals

On Wed, Jan 03, 2024 at 12:21:13PM +0100, Jiri Wiesner wrote:
> There have been reports of the watchdog marking clocksources unstable on
> machines with 8 NUMA nodes:
> > clocksource: timekeeping watchdog on CPU373: Marking clocksource 'tsc' as unstable because the skew is too large:
> > clocksource: 'hpet' wd_nsec: 14523447520 wd_now: 5a749706 wd_last: 45adf1e0 mask: ffffffff
> > clocksource: 'tsc' cs_nsec: 14524115132 cs_now: 515ce2c5a96caa cs_last: 515cd9a9d83918 mask: ffffffffffffffff
> > clocksource: 'tsc' is current clocksource.
> > tsc: Marking TSC unstable due to clocksource watchdog
> > TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> > sched_clock: Marking unstable (1950347883333462, 79649632569)<-(1950428279338308, -745776594)
> > clocksource: Checking clocksource tsc synchronization from CPU 400 to CPUs 0,46,52,54,138,208,392,397.
> > clocksource: Switched to clocksource hpet
>
> The measured clocksource skew - the absolute difference between cs_nsec
> and wd_nsec - was 668 microseconds:
> > cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612
>
> The kernel (based on 5.14.21) used 200 microseconds for the
> uncertainty_margin of both the clocksource and watchdog, resulting in a
> threshold of 400 microseconds. The discrepancy is that the measured
> clocksource skew was evaluated against a threshold suited for watchdog
> intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second.
> Both the cs_nsec and the wd_nsec value indicate that the actual watchdog
> interval was circa 14.5 seconds. Since the watchdog is executed in softirq
> context the expiration of the watchdog timer can get severely delayed on
> account of a ksoftirqd thread not getting to run in a timely manner.
> Surely, a system with such belated softirq execution is not working well
> and the scheduling issue should be looked into but the clocksource
> watchdog should be able to deal with it accordingly.
>
> The solution in this patch skips the current watchdog check if the
> watchdog interval exceeds 1.5 * WATCHDOG_INTERVAL. Considering the maximum
> watchdog interval of 1.5 * WATCHDOG_INTERVAL, the current default
> uncertainty margin (of the TSC and HPET) corresponds to a limit on
> clocksource skew of 333 ppm (microseconds of skew per second). To keep the
> limit imposed by NTP (500 microseconds of skew per second) for all
> possible watchdog intervals, the margins would have to be scaled so that
> the threshold value is proportional to the length of the actual watchdog
> interval.
>
> Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
> Suggested-by: Feng Tang <[email protected]>
> Signed-off-by: Jiri Wiesner <[email protected]>

I believe that there were concerns about a similar approach in the case
where the jiffies counter is the clocksource, but in the meantime:

Tested-by: Paul E. McKenney <[email protected]>

> ---
> kernel/time/clocksource.c | 28 ++++++++++++++++++++++++++--
> 1 file changed, 26 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index c108ed8a9804..ac5cb0ff278b 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -98,7 +98,9 @@ static u64 suspend_start;
> /*
> * Interval: 0.5sec.
> */
> -#define WATCHDOG_INTERVAL (HZ >> 1)
> +#define WATCHDOG_INTERVAL (HZ >> 1)
> +#define WATCHDOG_INTR_MAX_NS ((WATCHDOG_INTERVAL + (WATCHDOG_INTERVAL >> 1))\
> + * NSEC_PER_SEC / HZ)
>
> /*
> * Threshold: 0.0312s, when doubled: 0.0625s.
> @@ -134,6 +136,7 @@ static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> static DEFINE_SPINLOCK(watchdog_lock);
> static int watchdog_running;
> static atomic_t watchdog_reset_pending;
> +static int64_t watchdog_max_intr;
>
> static inline void clocksource_watchdog_lock(unsigned long *flags)
> {
> @@ -400,7 +403,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> {
> u64 csnow, wdnow, cslast, wdlast, delta;
> int next_cpu, reset_pending;
> - int64_t wd_nsec, cs_nsec;
> + int64_t wd_nsec, cs_nsec, interval;
> struct clocksource *cs;
> enum wd_read_status read_ret;
> unsigned long extra_wait = 0;
> @@ -470,6 +473,27 @@ static void clocksource_watchdog(struct timer_list *unused)
> if (atomic_read(&watchdog_reset_pending))
> continue;
>
> + /*
> + * The processing of timer softirqs can get delayed (usually
> + * on account of ksoftirqd not getting to run in a timely
> + * manner), which causes the watchdog interval to stretch.
> + * Some clocksources, e.g. acpi_pm, cannot tolerate
> + * watchdog intervals longer than a few seconds.
> + * Skew detection may fail for longer watchdog intervals
> + * on account of fixed margins being used.
> + */
> + interval = max(cs_nsec, wd_nsec);
> + if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) {
> + if (system_state > SYSTEM_SCHEDULING &&
> + interval > 2 * watchdog_max_intr) {
> + watchdog_max_intr = interval;
> + pr_warn("Skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n",
> + cs_nsec, wd_nsec);
> + }
> + watchdog_timer.expires = jiffies;
> + continue;
> + }
> +
> /* Check the deviation from the watchdog clocksource. */
> md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> if (abs(cs_nsec - wd_nsec) > md) {
> --
> 2.35.3
>
>
> --
> Jiri Wiesner
> SUSE Labs

2024-01-04 00:47:32

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH] clocksource: Skip watchdog check for large watchdog intervals

2024-01-04 00:58:26

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH] clocksource: Skip watchdog check for large watchdog intervals

Hi Jiri,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/timers/core]
[also build test WARNING on linus/master v6.7-rc8 next-20240103]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Jiri-Wiesner/clocksource-Skip-watchdog-check-for-large-watchdog-intervals/20240103-192257
base: tip/timers/core
patch link: https://lore.kernel.org/r/20240103112113.GA6108%40incl
patch subject: [PATCH] clocksource: Skip watchdog check for large watchdog intervals
config: i386-randconfig-r071-20240104 (https://download.01.org/0day-ci/archive/20240104/[email protected]/config)
compiler: ClangBuiltLinux clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240104/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

>> kernel/time/clocksource.c:486:27: warning: overflow in expression; result is -1978561024 with type 'long' [-Winteger-overflow]
486 | if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) {
| ^
kernel/time/clocksource.c:103:6: note: expanded from macro 'WATCHDOG_INTR_MAX_NS'
103 | * NSEC_PER_SEC / HZ)
| ^
1 warning generated.

vim +/long +486 kernel/time/clocksource.c

400
401
402 static void clocksource_watchdog(struct timer_list *unused)
403 {
404 u64 csnow, wdnow, cslast, wdlast, delta;
405 int next_cpu, reset_pending;
406 int64_t wd_nsec, cs_nsec, interval;
407 struct clocksource *cs;
408 enum wd_read_status read_ret;
409 unsigned long extra_wait = 0;
410 u32 md;
411
412 spin_lock(&watchdog_lock);
413 if (!watchdog_running)
414 goto out;
415
416 reset_pending = atomic_read(&watchdog_reset_pending);
417
418 list_for_each_entry(cs, &watchdog_list, wd_list) {
419
420 /* Clocksource already marked unstable? */
421 if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
422 if (finished_booting)
423 schedule_work(&watchdog_work);
424 continue;
425 }
426
427 read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
428
429 if (read_ret == WD_READ_UNSTABLE) {
430 /* Clock readout unreliable, so give it up. */
431 __clocksource_unstable(cs);
432 continue;
433 }
434
435 /*
436 * When WD_READ_SKIP is returned, it means the system is likely
437 * under very heavy load, where the latency of reading
438 * watchdog/clocksource is very big, and affect the accuracy of
439 * watchdog check. So give system some space and suspend the
440 * watchdog check for 5 minutes.
441 */
442 if (read_ret == WD_READ_SKIP) {
443 /*
444 * As the watchdog timer will be suspended, and
445 * cs->last could keep unchanged for 5 minutes, reset
446 * the counters.
447 */
448 clocksource_reset_watchdog();
449 extra_wait = HZ * 300;
450 break;
451 }
452
453 /* Clocksource initialized ? */
454 if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
455 atomic_read(&watchdog_reset_pending)) {
456 cs->flags |= CLOCK_SOURCE_WATCHDOG;
457 cs->wd_last = wdnow;
458 cs->cs_last = csnow;
459 continue;
460 }
461
462 delta = clocksource_delta(wdnow, cs->wd_last, watchdog->mask);
463 wd_nsec = clocksource_cyc2ns(delta, watchdog->mult,
464 watchdog->shift);
465
466 delta = clocksource_delta(csnow, cs->cs_last, cs->mask);
467 cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
468 wdlast = cs->wd_last; /* save these in case we print them */
469 cslast = cs->cs_last;
470 cs->cs_last = csnow;
471 cs->wd_last = wdnow;
472
473 if (atomic_read(&watchdog_reset_pending))
474 continue;
475
476 /*
477 * The processing of timer softirqs can get delayed (usually
478 * on account of ksoftirqd not getting to run in a timely
479 * manner), which causes the watchdog interval to stretch.
480 * Some clocksources, e.g. acpi_pm, cannot tolerate
481 * watchdog intervals longer than a few seconds.
482 * Skew detection may fail for longer watchdog intervals
483 * on account of fixed margins being used.
484 */
485 interval = max(cs_nsec, wd_nsec);
> 486 if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) {
487 if (system_state > SYSTEM_SCHEDULING &&
488 interval > 2 * watchdog_max_intr) {
489 watchdog_max_intr = interval;
490 pr_warn("Skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n",
491 cs_nsec, wd_nsec);
492 }
493 watchdog_timer.expires = jiffies;
494 continue;
495 }
496
497 /* Check the deviation from the watchdog clocksource. */
498 md = cs->uncertainty_margin + watchdog->uncertainty_margin;
499 if (abs(cs_nsec - wd_nsec) > md) {
500 s64 cs_wd_msec;
501 s64 wd_msec;
502 u32 wd_rem;
503
504 pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
505 smp_processor_id(), cs->name);
506 pr_warn(" '%s' wd_nsec: %lld wd_now: %llx wd_last: %llx mask: %llx\n",
507 watchdog->name, wd_nsec, wdnow, wdlast, watchdog->mask);
508 pr_warn(" '%s' cs_nsec: %lld cs_now: %llx cs_last: %llx mask: %llx\n",
509 cs->name, cs_nsec, csnow, cslast, cs->mask);
510 cs_wd_msec = div_s64_rem(cs_nsec - wd_nsec, 1000 * 1000, &wd_rem);
511 wd_msec = div_s64_rem(wd_nsec, 1000 * 1000, &wd_rem);
512 pr_warn(" Clocksource '%s' skewed %lld ns (%lld ms) over watchdog '%s' interval of %lld ns (%lld ms)\n",
513 cs->name, cs_nsec - wd_nsec, cs_wd_msec, watchdog->name, wd_nsec, wd_msec);
514 if (curr_clocksource == cs)
515 pr_warn(" '%s' is current clocksource.\n", cs->name);
516 else if (curr_clocksource)
517 pr_warn(" '%s' (not '%s') is current clocksource.\n", curr_clocksource->name, cs->name);
518 else
519 pr_warn(" No current clocksource.\n");
520 __clocksource_unstable(cs);
521 continue;
522 }
523
524 if (cs == curr_clocksource && cs->tick_stable)
525 cs->tick_stable(cs);
526
527 if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) &&
528 (cs->flags & CLOCK_SOURCE_IS_CONTINUOUS) &&
529 (watchdog->flags & CLOCK_SOURCE_IS_CONTINUOUS)) {
530 /* Mark it valid for high-res. */
531 cs->flags |= CLOCK_SOURCE_VALID_FOR_HRES;
532
533 /*
534 * clocksource_done_booting() will sort it if
535 * finished_booting is not set yet.
536 */
537 if (!finished_booting)
538 continue;
539
540 /*
541 * If this is not the current clocksource let
542 * the watchdog thread reselect it. Due to the
543 * change to high res this clocksource might
544 * be preferred now. If it is the current
545 * clocksource let the tick code know about
546 * that change.
547 */
548 if (cs != curr_clocksource) {
549 cs->flags |= CLOCK_SOURCE_RESELECT;
550 schedule_work(&watchdog_work);
551 } else {
552 tick_clock_notify();
553 }
554 }
555 }
556
557 /*
558 * We only clear the watchdog_reset_pending, when we did a
559 * full cycle through all clocksources.
560 */
561 if (reset_pending)
562 atomic_dec(&watchdog_reset_pending);
563
564 /*
565 * Cycle through CPUs to check if the CPUs stay synchronized
566 * to each other.
567 */
568 next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
569 if (next_cpu >= nr_cpu_ids)
570 next_cpu = cpumask_first(cpu_online_mask);
571
572 /*
573 * Arm timer if not already pending: could race with concurrent
574 * pair clocksource_stop_watchdog() clocksource_start_watchdog().
575 */
576 if (!timer_pending(&watchdog_timer)) {
577 watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait;
578 add_timer_on(&watchdog_timer, next_cpu);
579 }
580 out:
581 spin_unlock(&watchdog_lock);
582 }
583

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-01-04 06:04:17

by Feng Tang

[permalink] [raw]

Subject: Re: [PATCH] clocksource: Skip watchdog check for large watchdog intervals

On Wed, Jan 03, 2024 at 12:21:13PM +0100, Jiri Wiesner wrote:
> There have been reports of the watchdog marking clocksources unstable on
> machines with 8 NUMA nodes:
> > clocksource: timekeeping watchdog on CPU373: Marking clocksource 'tsc' as unstable because the skew is too large:
> > clocksource: 'hpet' wd_nsec: 14523447520 wd_now: 5a749706 wd_last: 45adf1e0 mask: ffffffff
> > clocksource: 'tsc' cs_nsec: 14524115132 cs_now: 515ce2c5a96caa cs_last: 515cd9a9d83918 mask: ffffffffffffffff
> > clocksource: 'tsc' is current clocksource.
> > tsc: Marking TSC unstable due to clocksource watchdog
> > TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> > sched_clock: Marking unstable (1950347883333462, 79649632569)<-(1950428279338308, -745776594)
> > clocksource: Checking clocksource tsc synchronization from CPU 400 to CPUs 0,46,52,54,138,208,392,397.
> > clocksource: Switched to clocksource hpet
>
> The measured clocksource skew - the absolute difference between cs_nsec
> and wd_nsec - was 668 microseconds:
> > cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612
>
> The kernel (based on 5.14.21) used 200 microseconds for the
> uncertainty_margin of both the clocksource and watchdog, resulting in a
> threshold of 400 microseconds. The discrepancy is that the measured
> clocksource skew was evaluated against a threshold suited for watchdog
> intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second.
> Both the cs_nsec and the wd_nsec value indicate that the actual watchdog
> interval was circa 14.5 seconds. Since the watchdog is executed in softirq
> context the expiration of the watchdog timer can get severely delayed on
> account of a ksoftirqd thread not getting to run in a timely manner.
> Surely, a system with such belated softirq execution is not working well
> and the scheduling issue should be looked into but the clocksource
> watchdog should be able to deal with it accordingly.
>
> The solution in this patch skips the current watchdog check if the
> watchdog interval exceeds 1.5 * WATCHDOG_INTERVAL. Considering the maximum
> watchdog interval of 1.5 * WATCHDOG_INTERVAL, the current default
> uncertainty margin (of the TSC and HPET) corresponds to a limit on
> clocksource skew of 333 ppm (microseconds of skew per second). To keep the
> limit imposed by NTP (500 microseconds of skew per second) for all
> possible watchdog intervals, the margins would have to be scaled so that
> the threshold value is proportional to the length of the actual watchdog
> interval.
>
> Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")

IIUC, the issue could happen even before this commit? Though I do think
this is good stuff for stable kernel.

> Suggested-by: Feng Tang <[email protected]>
> Signed-off-by: Jiri Wiesner <[email protected]>

0Day robot reported some compiling issue. Other than that, it looks
good to me, thanks!

Reviewed-by: Feng Tang <[email protected]>

>
> ---
> kernel/time/clocksource.c | 28 ++++++++++++++++++++++++++--
> 1 file changed, 26 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index c108ed8a9804..ac5cb0ff278b 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -98,7 +98,9 @@ static u64 suspend_start;
> /*
> * Interval: 0.5sec.
> */
> -#define WATCHDOG_INTERVAL (HZ >> 1)
> +#define WATCHDOG_INTERVAL (HZ >> 1)
> +#define WATCHDOG_INTR_MAX_NS ((WATCHDOG_INTERVAL + (WATCHDOG_INTERVAL >> 1))\
> + * NSEC_PER_SEC / HZ)
>
> /*
> * Threshold: 0.0312s, when doubled: 0.0625s.
> @@ -134,6 +136,7 @@ static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> static DEFINE_SPINLOCK(watchdog_lock);
> static int watchdog_running;
> static atomic_t watchdog_reset_pending;
> +static int64_t watchdog_max_intr;
>
> static inline void clocksource_watchdog_lock(unsigned long *flags)
> {
> @@ -400,7 +403,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> {
> u64 csnow, wdnow, cslast, wdlast, delta;
> int next_cpu, reset_pending;
> - int64_t wd_nsec, cs_nsec;
> + int64_t wd_nsec, cs_nsec, interval;
> struct clocksource *cs;
> enum wd_read_status read_ret;
> unsigned long extra_wait = 0;
> @@ -470,6 +473,27 @@ static void clocksource_watchdog(struct timer_list *unused)
> if (atomic_read(&watchdog_reset_pending))
> continue;
>
> + /*
> + * The processing of timer softirqs can get delayed (usually
> + * on account of ksoftirqd not getting to run in a timely
> + * manner), which causes the watchdog interval to stretch.
> + * Some clocksources, e.g. acpi_pm, cannot tolerate
> + * watchdog intervals longer than a few seconds.
> + * Skew detection may fail for longer watchdog intervals
> + * on account of fixed margins being used.
> + */
> + interval = max(cs_nsec, wd_nsec);
> + if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) {
> + if (system_state > SYSTEM_SCHEDULING &&
> + interval > 2 * watchdog_max_intr) {
> + watchdog_max_intr = interval;
> + pr_warn("Skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n",
> + cs_nsec, wd_nsec);
> + }
> + watchdog_timer.expires = jiffies;
> + continue;
> + }
> +
> /* Check the deviation from the watchdog clocksource. */
> md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> if (abs(cs_nsec - wd_nsec) > md) {
> --
> 2.35.3
>
>
> --
> Jiri Wiesner
> SUSE Labs

2024-01-04 16:31:28

by Jiri Wiesner

[permalink] [raw]

Subject: Re: [PATCH] clocksource: Skip watchdog check for large watchdog intervals

On Wed, Jan 03, 2024 at 02:08:08PM -0800, Paul E. McKenney wrote:
> I believe that there were concerns about a similar approach in the case
> where the jiffies counter is the clocksource

I ran a few simple tests on a 2 NUMA node Intel machine and found nothing
so far. I tried booting with clocksource=jiffies and I changed the
"nr_online_nodes <= 4" check in tsc_clocksource_as_watchdog() to enable
the watchdog on my machine. I have a debugging module that monitors
clocksource and watchdog reads in clocksource_watchdog() with kprobes. I
see the cs/wd reads executed roughly every 0.5 second, as expected. When
the machine is idle the average watchdog interval is 501.61 milliseconds
(+-15.57 ms, with a minimum of 477.07 ms and a maximum of 517.93 ms). The
result is similar when the CPUs of the machine are fully saturated with
netperf processes. I also tried booting with clocksource=jiffies and
tsc=watchdog. The watchdog interval was similar to the previous test.

AFAIK, the jiffies clocksource does get checked by the watchdog itself.
And with that, I have run out of ideas.
--
Jiri Wiesner
SUSE Labs

2024-01-04 16:48:45

by Jiri Wiesner

[permalink] [raw]

Subject: Re: [PATCH] clocksource: Skip watchdog check for large watchdog intervals

On Thu, Jan 04, 2024 at 01:55:21PM +0800, Feng Tang wrote:
> On Wed, Jan 03, 2024 at 12:21:13PM +0100, Jiri Wiesner wrote:
> > Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
>
> IIUC, the issue could happen even before this commit? Though I do think
> this is good stuff for stable kernel.

Yes, it could happen before 2e27e793e280 but the threshold was so loose
that it probably was an extremely rare event. Commit 2e27e793e280 lowered
the threshold and this significantly increased the likelihood of skew
detection failing for large watchdog intervals. We got a report of a
kernel with 2e27e793e280 switching to the HPET but there were no failures
with a kernel without 2e27e793e280 on the same fleet of 8 NUMA node machines.

> 0Day robot reported some compiling issue.

I am going to submit a v2 of the patch fixing the error after we clarify
the jiffies concern.
--
Jiri Wiesner
SUSE Labs

2024-01-04 19:20:17

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH] clocksource: Skip watchdog check for large watchdog intervals

On Thu, Jan 04, 2024 at 05:30:50PM +0100, Jiri Wiesner wrote:
> On Wed, Jan 03, 2024 at 02:08:08PM -0800, Paul E. McKenney wrote:
> > I believe that there were concerns about a similar approach in the case
> > where the jiffies counter is the clocksource
>
> I ran a few simple tests on a 2 NUMA node Intel machine and found nothing
> so far. I tried booting with clocksource=jiffies and I changed the
> "nr_online_nodes <= 4" check in tsc_clocksource_as_watchdog() to enable
> the watchdog on my machine. I have a debugging module that monitors
> clocksource and watchdog reads in clocksource_watchdog() with kprobes. I
> see the cs/wd reads executed roughly every 0.5 second, as expected. When
> the machine is idle the average watchdog interval is 501.61 milliseconds
> (+-15.57 ms, with a minimum of 477.07 ms and a maximum of 517.93 ms). The
> result is similar when the CPUs of the machine are fully saturated with
> netperf processes. I also tried booting with clocksource=jiffies and
> tsc=watchdog. The watchdog interval was similar to the previous test.
>
> AFAIK, the jiffies clocksource does get checked by the watchdog itself.
> And with that, I have run out of ideas.

If I recall correctly (ha!), the concern was that with the jiffies as
clocksource, we would be using jiffies (via timers) to check jiffies
(the clocksource), and that this could cause issues if the jiffies got
behind, then suddenly updated while the clocksource watchdog was running.

Thoughts?

Thanx, Paul

2024-01-06 03:04:15

by Feng Tang

[permalink] [raw]

Subject: Re: [PATCH] clocksource: Skip watchdog check for large watchdog intervals

On Thu, Jan 04, 2024 at 11:19:56AM -0800, Paul E. McKenney wrote:
> On Thu, Jan 04, 2024 at 05:30:50PM +0100, Jiri Wiesner wrote:
> > On Wed, Jan 03, 2024 at 02:08:08PM -0800, Paul E. McKenney wrote:
> > > I believe that there were concerns about a similar approach in the case
> > > where the jiffies counter is the clocksource
> >
> > I ran a few simple tests on a 2 NUMA node Intel machine and found nothing
> > so far. I tried booting with clocksource=jiffies and I changed the
> > "nr_online_nodes <= 4" check in tsc_clocksource_as_watchdog() to enable
> > the watchdog on my machine. I have a debugging module that monitors
> > clocksource and watchdog reads in clocksource_watchdog() with kprobes. I
> > see the cs/wd reads executed roughly every 0.5 second, as expected. When
> > the machine is idle the average watchdog interval is 501.61 milliseconds
> > (+-15.57 ms, with a minimum of 477.07 ms and a maximum of 517.93 ms). The
> > result is similar when the CPUs of the machine are fully saturated with
> > netperf processes. I also tried booting with clocksource=jiffies and
> > tsc=watchdog. The watchdog interval was similar to the previous test.
> >
> > AFAIK, the jiffies clocksource does get checked by the watchdog itself.
> > And with that, I have run out of ideas.
>
> If I recall correctly (ha!), the concern was that with the jiffies as
> clocksource, we would be using jiffies (via timers) to check jiffies
> (the clocksource), and that this could cause issues if the jiffies got
> behind, then suddenly updated while the clocksource watchdog was running.

Yes, we also met problem when 'jiffies' was used as clocksource/watchdog,
but don't know if it's the same problem you mentioned. Our problem
('jiffies' as watchdog marks clocksource TSC as unstable) only happens
in early boot phase with serial earlyprintk enabled, that the updating
of 'jiffies' relies on HW timer's periodic interrupt, but early printk
will disable interrupt during printing and cause some timer interrupts
lost, and hence big lagging in 'jiffies'. Rui once proposed a patch to
prevent 'jiffies' from being a watchdog due to it unreliability [1].

And I think skipping the watchdog check one time when detecting some
abnormal condition won't hurt the overall check much.

[1]. https://lore.kernel.org/lkml/[email protected]/

Thanks,
Feng

> Thoughts?
>
> Thanx, Paul

2024-01-06 12:04:24

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH] clocksource: Skip watchdog check for large watchdog intervals

On Sat, Jan 06, 2024 at 10:55:09AM +0800, Feng Tang wrote:
> On Thu, Jan 04, 2024 at 11:19:56AM -0800, Paul E. McKenney wrote:
> > On Thu, Jan 04, 2024 at 05:30:50PM +0100, Jiri Wiesner wrote:
> > > On Wed, Jan 03, 2024 at 02:08:08PM -0800, Paul E. McKenney wrote:
> > > > I believe that there were concerns about a similar approach in the case
> > > > where the jiffies counter is the clocksource
> > >
> > > I ran a few simple tests on a 2 NUMA node Intel machine and found nothing
> > > so far. I tried booting with clocksource=jiffies and I changed the
> > > "nr_online_nodes <= 4" check in tsc_clocksource_as_watchdog() to enable
> > > the watchdog on my machine. I have a debugging module that monitors
> > > clocksource and watchdog reads in clocksource_watchdog() with kprobes. I
> > > see the cs/wd reads executed roughly every 0.5 second, as expected. When
> > > the machine is idle the average watchdog interval is 501.61 milliseconds
> > > (+-15.57 ms, with a minimum of 477.07 ms and a maximum of 517.93 ms). The
> > > result is similar when the CPUs of the machine are fully saturated with
> > > netperf processes. I also tried booting with clocksource=jiffies and
> > > tsc=watchdog. The watchdog interval was similar to the previous test.
> > >
> > > AFAIK, the jiffies clocksource does get checked by the watchdog itself.
> > > And with that, I have run out of ideas.
> >
> > If I recall correctly (ha!), the concern was that with the jiffies as
> > clocksource, we would be using jiffies (via timers) to check jiffies
> > (the clocksource), and that this could cause issues if the jiffies got
> > behind, then suddenly updated while the clocksource watchdog was running.
>
> Yes, we also met problem when 'jiffies' was used as clocksource/watchdog,
> but don't know if it's the same problem you mentioned. Our problem
> ('jiffies' as watchdog marks clocksource TSC as unstable) only happens
> in early boot phase with serial earlyprintk enabled, that the updating
> of 'jiffies' relies on HW timer's periodic interrupt, but early printk
> will disable interrupt during printing and cause some timer interrupts
> lost, and hence big lagging in 'jiffies'. Rui once proposed a patch to
> prevent 'jiffies' from being a watchdog due to it unreliability [1].
>
> And I think skipping the watchdog check one time when detecting some
> abnormal condition won't hurt the overall check much.

Works for me!

Thanx, Paul

> [1]. https://lore.kernel.org/lkml/[email protected]/
>
> Thanks,
> Feng
>
> > Thoughts?
> >
> > Thanx, Paul

2024-01-08 13:45:13

by Oliver Sang

[permalink] [raw]

Subject: Re: [PATCH] clocksource: Skip watchdog check for large watchdog intervals

Hello,

kernel test robot noticed "WARNING:at_kernel/time/clocksource-wdtest.c:#wdtest_func" on:

commit: ceacf0b27b5d9cdc574f8d0a0bcb11d3272e7b9f ("[PATCH] clocksource: Skip watchdog check for large watchdog intervals")
url: https://github.com/intel-lab-lkp/linux/commits/Jiri-Wiesner/clocksource-Skip-watchdog-check-for-large-watchdog-intervals/20240103-192257
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git da65f29dada7f7cbbf0d6375b88a0316f5f7d6f5
patch link: https://lore.kernel.org/all/20240103112113.GA6108@incl/
patch subject: [PATCH] clocksource: Skip watchdog check for large watchdog intervals

in testcase: boot

compiler: gcc-9
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)

+--------------------------------------------------------------------+------------+------------+
| | da65f29dad | ceacf0b27b |
+--------------------------------------------------------------------+------------+------------+
| WARNING:at_kernel/time/clocksource-wdtest.c:#wdtest_func | 0 | 14 |
| EIP:wdtest_func | 0 | 14 |
+--------------------------------------------------------------------+------------+------------+

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-lkp/[email protected]

[ 54.778895][ T41] ------------[ cut here ]------------
[ 54.781279][ T41] WARNING: CPU: 0 PID: 41 at kernel/time/clocksource-wdtest.c:162 wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1))
[ 54.785405][ T41] Modules linked in:
[ 54.787152][ T41] CPU: 0 PID: 41 Comm: wdtest Tainted: G N 6.7.0-rc2-00014-gceacf0b27b5d #1 30e4ebd22e7da702dfdab1313ae74e5a246df970
[ 54.791633][ T41] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 54.795124][ T41] EIP: wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1))
[ 54.796800][ T41] Code: ff 8d b4 26 00 00 00 00 66 90 0f 0b e9 ee fd ff ff 8d b4 26 00 00 00 00 66 90 0f 0b e9 08 fe ff ff 8d b4 26 00 00 00 00 66 90 <0f> 0b e9 76 ff ff ff cc cc cc cc cc cc cc cc cc 3e 8d 74 26 00 55
All code
========
0: ff 8d b4 26 00 00 decl 0x26b4(%rbp)
6: 00 00 add %al,(%rax)
8: 66 90 xchg %ax,%ax
a: 0f 0b ud2
c: e9 ee fd ff ff jmpq 0xfffffffffffffdff
11: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi
18: 66 90 xchg %ax,%ax
1a: 0f 0b ud2
1c: e9 08 fe ff ff jmpq 0xfffffffffffffe29
21: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi
28: 66 90 xchg %ax,%ax
2a:* 0f 0b ud2 <-- trapping instruction
2c: e9 76 ff ff ff jmpq 0xffffffffffffffa7
31: cc int3
32: cc int3
33: cc int3
34: cc int3
35: cc int3
36: cc int3
37: cc int3
38: cc int3
39: cc int3
3a: 3e 8d 74 26 00 lea %ds:0x0(%rsi,%riz,1),%esi
3f: 55 push %rbp

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: e9 76 ff ff ff jmpq 0xffffffffffffff7d
7: cc int3
8: cc int3
9: cc int3
a: cc int3
b: cc int3
c: cc int3
d: cc int3
e: cc int3
f: cc int3
10: 3e 8d 74 26 00 lea %ds:0x0(%rsi,%riz,1),%esi
15: 55 push %rbp
[ 54.801794][ T41] EAX: c4a1fe8c EBX: 00000004 ECX: 00000000 EDX: 00000000
[ 54.803569][ T41] ESI: c4a1ff2e EDI: 00000000 EBP: c74aff70 ESP: c74aff60
[ 54.805314][ T41] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00010246
[ 54.807278][ T41] CR0: 80050033 CR2: ffda9000 CR3: 05979000 CR4: 000406d0
[ 54.808856][ T41] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 54.810216][ T41] DR6: fffe0ff0 DR7: 00000400
[ 54.811218][ T41] Call Trace:
[ 54.811950][ T41] ? show_regs (arch/x86/kernel/dumpstack.c:479 arch/x86/kernel/dumpstack.c:465)
[ 54.812840][ T41] ? wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1))
[ 54.813801][ T41] ? __warn (kernel/panic.c:677)
[ 54.814643][ T41] ? wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1))
[ 54.815628][ T41] ? report_bug (lib/bug.c:201 lib/bug.c:219)
[ 54.816581][ T41] ? exc_overflow (arch/x86/kernel/traps.c:250)
[ 54.817506][ T41] ? handle_bug (arch/x86/kernel/traps.c:216)
[ 54.818420][ T41] ? exc_invalid_op (arch/x86/kernel/traps.c:258 (discriminator 1))
[ 54.819425][ T41] ? handle_exception (arch/x86/entry/entry_32.S:1056)
[ 54.820438][ T41] ? exc_overflow (arch/x86/kernel/traps.c:250)
[ 54.821374][ T41] ? wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1))
[ 54.822318][ T41] ? exc_overflow (arch/x86/kernel/traps.c:250)
[ 54.823284][ T41] ? wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1))
[ 54.824248][ T41] kthread (kernel/kthread.c:388)
[ 54.825093][ T41] ? wdtest_ktime_read (kernel/time/clocksource-wdtest.c:105)
[ 54.826132][ T41] ? kthread_complete_and_exit (kernel/kthread.c:341)
[ 54.827307][ T41] ret_from_fork (arch/x86/kernel/process.c:153)
[ 54.828219][ T41] ? kthread_complete_and_exit (kernel/kthread.c:341)
[ 54.829331][ T41] ret_from_fork_asm (arch/x86/entry/entry_32.S:741)
[ 54.830310][ T41] entry_INT80_32 (arch/x86/entry/entry_32.S:947)
[ 54.831287][ T41] irq event stamp: 879
[ 54.832154][ T41] hardirqs last enabled at (889): console_unlock (kernel/printk/printk.c:341 kernel/printk/printk.c:2706 kernel/printk/printk.c:3038)
[ 54.833838][ T41] hardirqs last disabled at (898): console_unlock (kernel/printk/printk.c:339 kernel/printk/printk.c:2706 kernel/printk/printk.c:3038)
[ 54.835567][ T41] softirqs last enabled at (918): __do_softirq (arch/x86/include/asm/preempt.h:27 kernel/softirq.c:400 kernel/softirq.c:582)
[ 54.837246][ T41] softirqs last disabled at (907): do_softirq_own_stack (arch/x86/kernel/irq_32.c:57 arch/x86/kernel/irq_32.c:147)
[ 54.839049][ T41] ---[ end trace 0000000000000000 ]---
[ 54.840178][ T41] clocksource: Not enough CPUs to check clocksource 'wdtest-ktime'.
[ 54.841774][ T41] clocksource_wdtest: --- Done with test.

The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240108/[email protected]

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-01-10 18:36:31

by Jiri Wiesner

[permalink] [raw]

Subject: Re: [PATCH] clocksource: Skip watchdog check for large watchdog intervals

On Mon, Jan 08, 2024 at 09:44:31PM +0800, kernel test robot wrote:
> commit: ceacf0b27b5d9cdc574f8d0a0bcb11d3272e7b9f ("[PATCH] clocksource: Skip watchdog check for large watchdog intervals")
> url: https://github.com/intel-lab-lkp/linux/commits/Jiri-Wiesner/clocksource-Skip-watchdog-check-for-large-watchdog-intervals/20240103-192257
> base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git da65f29dada7f7cbbf0d6375b88a0316f5f7d6f5
> patch link: https://lore.kernel.org/all/20240103112113.GA6108@incl/
> patch subject: [PATCH] clocksource: Skip watchdog check for large watchdog intervals
>
> [ 54.778895][ T41] ------------[ cut here ]------------
> [ 54.781279][ T41] WARNING: CPU: 0 PID: 41 at kernel/time/clocksource-wdtest.c:162 wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1))
> [ 54.785405][ T41] Modules linked in:
> [ 54.787152][ T41] CPU: 0 PID: 41 Comm: wdtest Tainted: G N 6.7.0-rc2-00014-gceacf0b27b5d #1 30e4ebd22e7da702dfdab1313ae74e5a246df970
> [ 54.791633][ T41] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [ 54.795124][ T41] EIP: wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1))

The code:
pr_info("--- Watchdog clock-value-fuzz error injection, expect clock skew and per-CPU mismatches.\n");
WRITE_ONCE(wdtest_ktime_read_fuzz, true);
schedule_timeout_uninterruptible(2 * HZ);
WARN_ON_ONCE(!(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));

The warning is the consequence of the watchdog not running on i386 builds of the kernel. The watchdog does not run because the "interval > WATCHDOG_INTR_MAX_NS" check is always true on account of the WATCHDOG_INTR_MAX_NS constant evaluating to a negative number on account of integer overflow.

> [ 54.801794][ T41] EAX: c4a1fe8c EBX: 00000004 ECX: 00000000 EDX: 00000000
> [ 54.803569][ T41] ESI: c4a1ff2e EDI: 00000000 EBP: c74aff70 ESP: c74aff60
> [ 54.805314][ T41] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00010246
> [ 54.807278][ T41] CR0: 80050033 CR2: ffda9000 CR3: 05979000 CR4: 000406d0
> [ 54.808856][ T41] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> [ 54.810216][ T41] DR6: fffe0ff0 DR7: 00000400
> [ 54.811218][ T41] Call Trace:

--
Jiri Wiesner
SUSE Labs