MIME-Version: 1.0
In-Reply-To: <984ee7ee5ba91bdaa155a16e741f5b83d70233d2.1460422356.git.shli@fb.com>
References: <09c4f19409012995595db6fd0a12f326c292af1a.1460422356.git.shli@fb.com>
	<984ee7ee5ba91bdaa155a16e741f5b83d70233d2.1460422356.git.shli@fb.com>
Date: Mon, 18 Apr 2016 10:31:09 -0700
Message-ID: <CALAqxLV7RbewzJjP2uXsFjkmgeWiuzAayxbjj0gGMdBw3KT-ng@mail.gmail.com>
Subject: Re: [RFC 2/2] time: double check if watchdog clocksource is correct
From: John Stultz <john.stultz@linaro.org>
To: Shaohua Li <shli@fb.com>
Cc: lkml <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>,
        calvinowens@fb.com, Gratian Crisan <gratian.crisan@ni.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3571
Lines: 76

On Mon, Apr 11, 2016 at 5:57 PM, Shaohua Li <shli@fb.com> wrote:
> We use watchdog clocksource to detect unstable clocksource. This assumes
> watchdog clocksource is correct. But it's possible watchdog clocksource
> is crappy, please see previous patch. Double check if watchdog interval
> is too long and bypass potential wrong watchdog clocksource.
>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  kernel/time/clocksource.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index 56ece14..36aff4e 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -122,9 +122,10 @@ static int clocksource_watchdog_kthread(void *data);
>  static void __clocksource_change_rating(struct clocksource *cs, int rating);
>
>  /*
> - * Interval: 0.5sec Threshold: 0.0625s
> + * Interval: 0.5sec, Max Interval: 0.75sec, Threshold: 0.0625s
>   */
>  #define WATCHDOG_INTERVAL (HZ >> 1)
> +#define WATCHDOG_MAX_INTERVAL ((NSEC_PER_SEC >> 1) + (NSEC_PER_SEC >> 2))

Is there a reason this #define is so obtusely stated? Its all going to
be pre-computed by the compiler, so I'm not sure why to make it more
opaque to the casual reader.  Maybe something more like:
(750*MSEC_PER_SEC)

>  #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
>
>  static void clocksource_watchdog_work(struct work_struct *work)
> @@ -217,7 +218,8 @@ static void clocksource_watchdog(unsigned long data)
>                         continue;
>
>                 /* Check the deviation from the watchdog clocksource. */
> -               if (abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) {
> +               if (abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD &&
> +                   wd_nsec < WATCHDOG_MAX_INTERVAL) {
>                         pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
>                                 smp_processor_id(), cs->name);
>                         pr_warn("                      '%s' wd_now: %llx wd_last: %llx mask: %llx\n",
> --


My main concern with these tweaks have been that we might have a case
where there's a slow TSC (old cpufreq effected style, or maybe halts
in deep idle),  so everything is moving slowly and the timers are
firing late. Then the watchdog interval would *look* to be too long
and we wouldn't have a good tool for disqualifying that bad TSC.

That said, I am continuing to hear cases of problematic watchdog
disqualifications where the watchdog is to blame, due to wraparound
caused by VM delays, PREEMPT_RT, or softirq processing, so something
here would be nice.

Gratin (cc'ed) also was looking at this, and we had some thoughts
about trying to avoid the watchdog wrap-around issue by checking that
the clocksource's counter divided by the watchdog wraparound interval
(in nsecs)  matched the watchdog's counter (again, in nsecs).

But it seems like for your case where the watchdog hardware is bad,
that still wouldn't catch it. Something like a watchdog watchdog w/
something like the RTC might be an approach, but I worry that we will
hit the same problematic "can't trust the RTC on hardware foo" issue
down the road.  Having a watchdog quorum between multiple counters
might be the only way there, but on many devices there's really only
two main options. :/

Thomas: Any other thoughts? Are these just oddball cases that have to
be dealt with via tsc=reliable boot arguments? Should we print
something informative about that option when we disqualify the TSC?

thanks
-john