Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752142AbbHRC5Z (ORCPT ); Mon, 17 Aug 2015 22:57:25 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:54964 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751403AbbHRC5Y (ORCPT ); Mon, 17 Aug 2015 22:57:24 -0400 Date: Mon, 17 Aug 2015 19:57:06 -0700 From: Shaohua Li To: John Stultz CC: Thomas Gleixner , lkml , Prarit Bhargava , Richard Cochran , Daniel Lezcano , Ingo Molnar Subject: Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection Message-ID: <20150818025704.GA1129225@devbig257.prn2.facebook.com> References: <1439844063-7957-1-git-send-email-john.stultz@linaro.org> <1439844063-7957-9-git-send-email-john.stultz@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-12-10) X-Originating-IP: [192.168.52.123] X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.14.151,1.0.33,0.0.0000 definitions=2015-08-18_01:2015-08-17,2015-08-17,1970-01-01 signatures=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5306 Lines: 128 On Mon, Aug 17, 2015 at 03:17:28PM -0700, John Stultz wrote: > On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner wrote: > > On Mon, 17 Aug 2015, John Stultz wrote: > > > >> From: Shaohua Li > >> > >> >From time to time we saw TSC is marked as unstable in our systems, while > > > > Stray '>' > > > >> the CPUs declare to have stable TSC. Looking at the clocksource unstable > >> detection, there are two problems: > >> - watchdog clock source wrap. HPET is the most common watchdog clock > >> source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter > >> can wrap in about 5 minutes. > >> - threshold isn't scaled against interval. The threshold is 0.0625s in > >> 0.5s interval. What if the actual interval is bigger than 0.5s? > >> > >> The watchdog runs in a timer bh, so hard/soft irq can defer its running. > >> Heavy network stack softirq can hog a cpu. IPMI driver can disable > >> interrupt for a very long time. > > > > And they hold off the timer softirq for more than a second? Don't you > > think that's the problem which needs to be fixed? > > Though this is an issue I've experienced (and tried unsuccessfully to > fix in a more complicated way) with the RT kernel, where high priority > tasks blocked the watchdog long enough that we'd disqualify the TSC. > > Ideally that sort of high-priority RT busyness would be avoided, but > its also a pain to have false positive trigger when doing things like > stress testing. > > > >> The first problem is mostly we are suffering I think. > > > > So you think that's the root cause and because your patch makes it go > > away it's not necessary to know for sure, right? > > > >> Here is a simple patch to fix the issues. If the waterdog doesn't run > > > > waterdog? > > Allergen-free. :) > > > >> for a long time, we ignore the detection. > > > > What's 'long time'? Please explain the numbers chosen. > > > >> This should work for the two > > > > Emphasis on 'should'? > > > >> problems. For the second one, we probably doen't need to scale if the > >> interval isn't very long. > > > > -ENOPARSE > > > >> @@ -122,9 +122,10 @@ static int clocksource_watchdog_kthread(void *data); > >> static void __clocksource_change_rating(struct clocksource *cs, int rating); > >> > >> /* > >> - * Interval: 0.5sec Threshold: 0.0625s > >> + * Interval: 0.5sec MaxInterval: 1s Threshold: 0.0625s > >> */ > >> #define WATCHDOG_INTERVAL (HZ >> 1) > >> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC) > >> #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4) > >> > >> static void clocksource_watchdog_work(struct work_struct *work) > >> @@ -217,7 +218,9 @@ static void clocksource_watchdog(unsigned long data) > >> continue; > >> > >> /* Check the deviation from the watchdog clocksource. */ > >> - if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD)) { > >> + if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) && > >> + cs_nsec < WATCHDOG_MAX_INTERVAL_NS && > >> + wd_nsec < WATCHDOG_MAX_INTERVAL_NS) { > > > > So that adds a new opportunity for undiscovered wreckage: > > > > clocksource_watchdog(); > > .... <--- SMI skews TSC > > looong_irq_disabled_region(); > > .... > > clocksource_watchdog(); <--- Does not detect skew > > > > and it will not detect it later on if that SMI was a one time event. > > > > So 'fixing' the watchdog is the wrong approach. Fixing the stuff which > > prevents the watchdog to run is the proper thing to do. > > I'm not sure here. I feel like these delay-caused false positives > (I've seen similar reports w/ VMs being stalled) are more common then > one-off SMI TSC skews. > > There are hard lines in the timekeeping code, where we do say: Don't > delay us past X or we can't really handle it, but in this case, the > main clocksource is fine and the limit is being caused by the > watchdog. So I think some sort of a solution to remove this > restriction would be good. We don't want to needlessly punish fine > hardware because our checks for bad hardware add extra restrictions. > > That said, I agree the "should"s and other vague qualifiers in the > commit description you point out should have more specifics to back > things up. And I'm fine delaying this (and the follow-on) patch until > those details are provided. It's not something I guess. We do see the issue from time to time. The IPMI driver accesses some IO ports in softirq and hog cpu for a very long time, then the watchdog alert. The false alert on the other hand has very worse effect. It forces to use HPET as clocksource, which has very big performance penality. We can't even manually switch back to TSC as current interface doesn't allow us to do it, then we can only reboot the system. I agree the driver should be fixed, but the watchdog has false alert, we definitively should fix it. The 1s interval is arbitary. If you think there is better way to fix the issue, please let me know. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/