Date: Mon, 17 Aug 2015 19:57:06 -0700
From: Shaohua Li <shli@fb.com>
To: John Stultz <john.stultz@linaro.org>
CC: Thomas Gleixner <tglx@linutronix.de>, lkml <linux-kernel@vger.kernel.org>,
        Prarit Bhargava <prarit@redhat.com>,
        Richard Cochran <richardcochran@gmail.com>,
        Daniel Lezcano <daniel.lezcano@linaro.org>,
        Ingo Molnar <mingo@kernel.org>
Subject: Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection
Message-ID: <20150818025704.GA1129225@devbig257.prn2.facebook.com>
References: <1439844063-7957-1-git-send-email-john.stultz@linaro.org>
 <1439844063-7957-9-git-send-email-john.stultz@linaro.org>
 <alpine.DEB.2.11.1508172326020.3873@nanos>
 <CALAqxLWOzkbbFeAUKrE9B=O786vuJVmWQdagbfxkpLJDm2mXwQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <CALAqxLWOzkbbFeAUKrE9B=O786vuJVmWQdagbfxkpLJDm2mXwQ@mail.gmail.com>
User-Agent: Mutt/1.5.20 (2009-12-10)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5306
Lines: 128

On Mon, Aug 17, 2015 at 03:17:28PM -0700, John Stultz wrote:
> On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Mon, 17 Aug 2015, John Stultz wrote:
> >
> >> From: Shaohua Li <shli@fb.com>
> >>
> >> >From time to time we saw TSC is marked as unstable in our systems, while
> >
> > Stray '>'
> >
> >> the CPUs declare to have stable TSC. Looking at the clocksource unstable
> >> detection, there are two problems:
> >> - watchdog clock source wrap. HPET is the most common watchdog clock
> >>   source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
> >>   can wrap in about 5 minutes.
> >> - threshold isn't scaled against interval. The threshold is 0.0625s in
> >>   0.5s interval. What if the actual interval is bigger than 0.5s?
> >>
> >> The watchdog runs in a timer bh, so hard/soft irq can defer its running.
> >> Heavy network stack softirq can hog a cpu. IPMI driver can disable
> >> interrupt for a very long time.
> >
> > And they hold off the timer softirq for more than a second? Don't you
> > think that's the problem which needs to be fixed?
> 
> Though this is an issue I've experienced (and tried unsuccessfully to
> fix in a more complicated way) with the RT kernel, where high priority
> tasks blocked the watchdog long enough that we'd disqualify the TSC.
> 
> Ideally that sort of high-priority RT busyness would be avoided, but
> its also a pain to have false positive trigger when doing things like
> stress testing.
> 
> 
> >> The first problem is mostly we are suffering I think.
> >
> > So you think that's the root cause and because your patch makes it go
> > away it's not necessary to know for sure, right?
> >
> >> Here is a simple patch to fix the issues. If the waterdog doesn't run
> >
> > waterdog?
> 
> Allergen-free. :)
> 
> 
> >> for a long time, we ignore the detection.
> >
> > What's 'long time'? Please explain the numbers chosen.
> >
> >> This should work for the two
> >
> > Emphasis on 'should'?
> >
> >> problems. For the second one, we probably doen't need to scale if the
> >> interval isn't very long.
> >
> > -ENOPARSE
> >
> >> @@ -122,9 +122,10 @@ static int clocksource_watchdog_kthread(void *data);
> >>  static void __clocksource_change_rating(struct clocksource *cs, int rating);
> >>
> >>  /*
> >> - * Interval: 0.5sec Threshold: 0.0625s
> >> + * Interval: 0.5sec MaxInterval: 1s Threshold: 0.0625s
> >>   */
> >>  #define WATCHDOG_INTERVAL (HZ >> 1)
> >> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> >>  #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
> >>
> >>  static void clocksource_watchdog_work(struct work_struct *work)
> >> @@ -217,7 +218,9 @@ static void clocksource_watchdog(unsigned long data)
> >>                       continue;
> >>
> >>               /* Check the deviation from the watchdog clocksource. */
> >> -             if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD)) {
> >> +             if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) &&
> >> +                 cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
> >> +                 wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
> >
> > So that adds a new opportunity for undiscovered wreckage:
> >
> >    clocksource_watchdog();
> >    ....                         <--- SMI skews TSC
> >    looong_irq_disabled_region();
> >    ....
> >    clocksource_watchdog();      <--- Does not detect skew
> >
> > and it will not detect it later on if that SMI was a one time event.
> >
> > So 'fixing' the watchdog is the wrong approach. Fixing the stuff which
> > prevents the watchdog to run is the proper thing to do.
> 
> I'm not sure here. I feel like these delay-caused false positives
> (I've seen similar reports w/ VMs being stalled) are more common then
> one-off SMI TSC skews.
> 
> There are hard lines in the timekeeping code, where we do say: Don't
> delay us past X or we can't really handle it, but in this case, the
> main clocksource is fine and the limit is being caused by the
> watchdog. So I think some sort of a solution to remove this
> restriction would be good. We don't want to needlessly punish fine
> hardware because our checks for bad hardware add extra restrictions.
> 
> That said, I agree the "should"s and other vague qualifiers in the
> commit description you point out should have more specifics to back
> things up. And I'm fine delaying this (and the follow-on) patch until
> those details are provided.

It's not something I guess. We do see the issue from time to time. The
IPMI driver accesses some IO ports in softirq and hog cpu for a very
long time, then the watchdog alert. The false alert on the other hand
has very worse effect. It forces to use HPET as clocksource, which has
very big performance penality. We can't even manually switch back to TSC
as current interface doesn't allow us to do it, then we can only reboot
the system. I agree the driver should be fixed, but the watchdog has
false alert, we definitively should fix it.

The 1s interval is arbitary. If you think there is better way to fix the
issue, please let me know.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/