Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751944AbcDRRbL (ORCPT ); Mon, 18 Apr 2016 13:31:11 -0400 Received: from mail-oi0-f54.google.com ([209.85.218.54]:36010 "EHLO mail-oi0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750727AbcDRRbK (ORCPT ); Mon, 18 Apr 2016 13:31:10 -0400 MIME-Version: 1.0 In-Reply-To: <984ee7ee5ba91bdaa155a16e741f5b83d70233d2.1460422356.git.shli@fb.com> References: <09c4f19409012995595db6fd0a12f326c292af1a.1460422356.git.shli@fb.com> <984ee7ee5ba91bdaa155a16e741f5b83d70233d2.1460422356.git.shli@fb.com> Date: Mon, 18 Apr 2016 10:31:09 -0700 Message-ID: Subject: Re: [RFC 2/2] time: double check if watchdog clocksource is correct From: John Stultz To: Shaohua Li Cc: lkml , Thomas Gleixner , calvinowens@fb.com, Gratian Crisan Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3571 Lines: 76 On Mon, Apr 11, 2016 at 5:57 PM, Shaohua Li wrote: > We use watchdog clocksource to detect unstable clocksource. This assumes > watchdog clocksource is correct. But it's possible watchdog clocksource > is crappy, please see previous patch. Double check if watchdog interval > is too long and bypass potential wrong watchdog clocksource. > > Signed-off-by: Shaohua Li > --- > kernel/time/clocksource.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c > index 56ece14..36aff4e 100644 > --- a/kernel/time/clocksource.c > +++ b/kernel/time/clocksource.c > @@ -122,9 +122,10 @@ static int clocksource_watchdog_kthread(void *data); > static void __clocksource_change_rating(struct clocksource *cs, int rating); > > /* > - * Interval: 0.5sec Threshold: 0.0625s > + * Interval: 0.5sec, Max Interval: 0.75sec, Threshold: 0.0625s > */ > #define WATCHDOG_INTERVAL (HZ >> 1) > +#define WATCHDOG_MAX_INTERVAL ((NSEC_PER_SEC >> 1) + (NSEC_PER_SEC >> 2)) Is there a reason this #define is so obtusely stated? Its all going to be pre-computed by the compiler, so I'm not sure why to make it more opaque to the casual reader. Maybe something more like: (750*MSEC_PER_SEC) > #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4) > > static void clocksource_watchdog_work(struct work_struct *work) > @@ -217,7 +218,8 @@ static void clocksource_watchdog(unsigned long data) > continue; > > /* Check the deviation from the watchdog clocksource. */ > - if (abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) { > + if (abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD && > + wd_nsec < WATCHDOG_MAX_INTERVAL) { > pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n", > smp_processor_id(), cs->name); > pr_warn(" '%s' wd_now: %llx wd_last: %llx mask: %llx\n", > -- My main concern with these tweaks have been that we might have a case where there's a slow TSC (old cpufreq effected style, or maybe halts in deep idle), so everything is moving slowly and the timers are firing late. Then the watchdog interval would *look* to be too long and we wouldn't have a good tool for disqualifying that bad TSC. That said, I am continuing to hear cases of problematic watchdog disqualifications where the watchdog is to blame, due to wraparound caused by VM delays, PREEMPT_RT, or softirq processing, so something here would be nice. Gratin (cc'ed) also was looking at this, and we had some thoughts about trying to avoid the watchdog wrap-around issue by checking that the clocksource's counter divided by the watchdog wraparound interval (in nsecs) matched the watchdog's counter (again, in nsecs). But it seems like for your case where the watchdog hardware is bad, that still wouldn't catch it. Something like a watchdog watchdog w/ something like the RTC might be an approach, but I worry that we will hit the same problematic "can't trust the RTC on hardware foo" issue down the road. Having a watchdog quorum between multiple counters might be the only way there, but on many devices there's really only two main options. :/ Thomas: Any other thoughts? Are these just oddball cases that have to be dealt with via tsc=reliable boot arguments? Should we print something informative about that option when we disqualify the TSC? thanks -john