Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id ADB33C433F5 for ; Mon, 22 Nov 2021 03:02:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238521AbhKVDFe (ORCPT ); Sun, 21 Nov 2021 22:05:34 -0500 Received: from mga02.intel.com ([134.134.136.20]:48333 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229870AbhKVDFd (ORCPT ); Sun, 21 Nov 2021 22:05:33 -0500 X-IronPort-AV: E=McAfee;i="6200,9189,10175"; a="221933965" X-IronPort-AV: E=Sophos;i="5.87,253,1631602800"; d="scan'208";a="221933965" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Nov 2021 19:02:27 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.87,253,1631602800"; d="scan'208";a="496697414" Received: from shbuild999.sh.intel.com (HELO localhost) ([10.239.146.189]) by orsmga007.jf.intel.com with ESMTP; 21 Nov 2021 19:02:24 -0800 Date: Mon, 22 Nov 2021 11:02:23 +0800 From: Feng Tang To: Waiman Long Cc: John Stultz , Thomas Gleixner , Stephen Boyd , "Paul E. McKenney" , linux-kernel@vger.kernel.org, Peter Zijlstra , Cassio Neri , Linus Walleij , Frederic Weisbecker Subject: Re: [PATCH v3 1/4] clocksource: Avoid accidental unstable marking of clocksources Message-ID: <20211122030223.GG34844@shbuild999.sh.intel.com> References: <20211118191439.1000012-1-longman@redhat.com> <20211118191439.1000012-2-longman@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20211118191439.1000012-2-longman@redhat.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Nov 18, 2021 at 02:14:36PM -0500, Waiman Long wrote: > Since commit db3a34e17433 ("clocksource: Retry clock read if long delays > detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew > threshold"), it is found that tsc clocksource fallback to hpet can > sometimes happen on both Intel and AMD systems especially when they are > running stressful benchmarking workloads. Of the 23 systems tested with > a v5.14 kernel, 10 of them have switched to hpet clock source during > the test run. > > The result of falling back to hpet is a drastic reduction of performance > when running benchmarks. For example, the fio performance tests can > drop up to 70% whereas the iperf3 performance can drop up to 80%. > > 4 hpet fallbacks happened during bootup. They were: > > [ 8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable > [ 12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable > [ 17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable > [ 17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable > > Other fallbacks happen when the systems were running stressful > benchmarks. For example: > > [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable > [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable > > Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"), > changed the skew margin from 100us to 50us. I think this is too small > and can easily be exceeded when running some stressful workloads on a > thermally stressed system. So it is switched back to 100us. > > Even a maximum skew margin of 100us may be too small in for some systems > when booting up especially if those systems are under thermal stress. To > eliminate the case that the large skew is due to the system being too > busy slowing down the reading of both the watchdog and the clocksource, > an extra consecutive read of watchdog clock is being done to check this. > > The consecutive watchdog read delay is compared against > WATCHDOG_MAX_SKEW/2. If the delay exceeds the limit, we assume that > the system is just too busy. A warning will be printed to the console > and the clock skew check is skipped for this round. Reviewed-by: Feng Tang Thanks, Feng > Fixes: db3a34e17433 ("clocksource: Retry clock read if long delays detected") > Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold") > Signed-off-by: Waiman Long > --- > kernel/time/clocksource.c | 50 ++++++++++++++++++++++++++++++++------- > 1 file changed, 41 insertions(+), 9 deletions(-) > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c > index b8a14d2fb5ba..bcad1a1e5dcf 100644 > --- a/kernel/time/clocksource.c > +++ b/kernel/time/clocksource.c > @@ -107,7 +107,7 @@ static u64 suspend_start; > * This delay could be due to SMIs, NMIs, or to VCPU preemptions. Used as > * a lower bound for cs->uncertainty_margin values when registering clocks. > */ > -#define WATCHDOG_MAX_SKEW (50 * NSEC_PER_USEC) > +#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC) > > #ifdef CONFIG_CLOCKSOURCE_WATCHDOG > static void clocksource_watchdog_work(struct work_struct *work); > @@ -205,17 +205,24 @@ EXPORT_SYMBOL_GPL(max_cswd_read_retries); > static int verify_n_cpus = 8; > module_param(verify_n_cpus, int, 0644); > > -static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow) > +enum wd_read_status { > + WD_READ_SUCCESS, > + WD_READ_UNSTABLE, > + WD_READ_SKIP > +}; > + > +static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow) > { > unsigned int nretries; > - u64 wd_end, wd_delta; > - int64_t wd_delay; > + u64 wd_end, wd_end2, wd_delta; > + int64_t wd_delay, wd_seq_delay; > > for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) { > local_irq_disable(); > *wdnow = watchdog->read(watchdog); > *csnow = cs->read(cs); > wd_end = watchdog->read(watchdog); > + wd_end2 = watchdog->read(watchdog); > local_irq_enable(); > > wd_delta = clocksource_delta(wd_end, *wdnow, watchdog->mask); > @@ -226,13 +233,34 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow) > pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n", > smp_processor_id(), watchdog->name, nretries); > } > - return true; > + return WD_READ_SUCCESS; > } > + > + /* > + * Now compute delay in consecutive watchdog read to see if > + * there is too much external interferences that cause > + * significant delay in reading both clocksource and watchdog. > + * > + * If consecutive WD read-back delay > WATCHDOG_MAX_SKEW/2, > + * report system busy, reinit the watchdog and skip the current > + * watchdog test. > + */ > + wd_delta = clocksource_delta(wd_end2, wd_end, watchdog->mask); > + wd_seq_delay = clocksource_cyc2ns(wd_delta, watchdog->mult, watchdog->shift); > + if (wd_seq_delay > WATCHDOG_MAX_SKEW/2) > + goto skip_test; > } > > pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n", > smp_processor_id(), watchdog->name, wd_delay, nretries); > - return false; > + return WD_READ_UNSTABLE; > + > +skip_test: > + pr_info("timekeeping watchdog on CPU%d: %s wd-wd read-back delay of %lldns\n", > + smp_processor_id(), watchdog->name, wd_seq_delay); > + pr_info("wd-%s-wd read-back delay of %lldns, clock-skew test skipped!\n", > + cs->name, wd_delay); > + return WD_READ_SKIP; > } > > static u64 csnow_mid; > @@ -356,6 +384,7 @@ static void clocksource_watchdog(struct timer_list *unused) > int next_cpu, reset_pending; > int64_t wd_nsec, cs_nsec; > struct clocksource *cs; > + enum wd_read_status read_ret; > u32 md; > > spin_lock(&watchdog_lock); > @@ -373,9 +402,12 @@ static void clocksource_watchdog(struct timer_list *unused) > continue; > } > > - if (!cs_watchdog_read(cs, &csnow, &wdnow)) { > - /* Clock readout unreliable, so give it up. */ > - __clocksource_unstable(cs); > + read_ret = cs_watchdog_read(cs, &csnow, &wdnow); > + > + if (read_ret != WD_READ_SUCCESS) { > + if (read_ret == WD_READ_UNSTABLE) > + /* Clock readout unreliable, so give it up. */ > + __clocksource_unstable(cs); > continue; > } > > -- > 2.27.0