Received: by 2002:a05:6a10:af89:0:0:0:0 with SMTP id iu9csp3586348pxb; Mon, 24 Jan 2022 12:53:30 -0800 (PST) X-Google-Smtp-Source: ABdhPJwSNCrFtyjtFhWEXY8QnMS39EITy6kb062oJu/B6IbaC9sj1QA0EqaGzYJ4BsuiGe3fiCWB X-Received: by 2002:a05:6a00:234e:b0:4ae:2e0d:cc68 with SMTP id j14-20020a056a00234e00b004ae2e0dcc68mr15794228pfj.60.1643057610649; Mon, 24 Jan 2022 12:53:30 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1643057610; cv=none; d=google.com; s=arc-20160816; b=gBvw2TvYPGBzDU8A/HGMcomy0rg2yw2Bj8N5uwA1T1OW5sPFfYokGea0EZ6hFvjSc7 tbSqsKvIOKUcBI2JZb5uJkOCQ0Kmeo8j2M4tsvcz8iXsSZBF2ZB1ujE3jry6k1DzJapv Y87fjIsW9352wc15XIDf3lcxfNtpJiNYdYhuPiGH6UQeON4bTHaE4yJiHzoTf2RfIakC mPXEAnfrggDG7vetOGoWIu8DwVDuNoLb5nNrOomFtk8QeJ5qFjH9Wj8rcvpTjWqhRm3z fs/yXPMBkFaP+9q09jqTeItB9WwPqTujgUlMVwUrDFQvjRNlhKSM+nA6/tymG22JojQ3 B0pg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=CoHfv3BlGxGFpGcsnOGoA0QQEJPx31sLtyxC/wFhVG0=; b=VaAb6Fgv84gNutDO063nlMMlu+1gE0rJvJjJkVEVy2WpdgX/pYZHuaDXWWQ5oEQYWy cRBLRLfKW7fY4wQcYv8QJBgUdFZPf56hVMg8eaeC53C32G/danaKhF8XQnWm5sgoKHYE ZDhMB2bX3KVo7QzIeQHbMfxEvb9pZ5yx58Q2YMxc4uwPfaWIqJouT8ph5b4O263mF8zR fFHz3SZM3jvmm1uVUJbO8eObIus4GnXqrvA4PnzP/Me3fCn1P1Qtfr2U+FJZKxaECd+K 4xdgsejpDsnzs4JN3OxP3SZ110S5/CsPyvYRo2vOzZRKAn2O5Qv8rEU4o1eCMvH8VCEM hQdg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=op4XRG0V; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id b6si13307546plx.548.2022.01.24.12.53.18; Mon, 24 Jan 2022 12:53:30 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=op4XRG0V; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1348623AbiAXUM3 (ORCPT + 99 others); Mon, 24 Jan 2022 15:12:29 -0500 Received: from dfw.source.kernel.org ([139.178.84.217]:49206 "EHLO dfw.source.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344298AbiAXTxv (ORCPT ); Mon, 24 Jan 2022 14:53:51 -0500 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 48A0C601B6; Mon, 24 Jan 2022 19:53:49 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 308B5C340E5; Mon, 24 Jan 2022 19:53:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1643054028; bh=lOJCY7buUeLtrTZHe3ywJbkc8zqpQTJvczT/ZH8x97Q=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=op4XRG0VRxo1CxhAFEmKH8ZTzdogegbVt8GyyIHXyUsgQIq51oIgXx+hQTOlgEGRX K03TT/CfqMOZdfvj7JDtzh6D0zNSaC5rb/H5zn12HGDPPpfNrigm0QtwJ/4v3mWF9M njJPO7Ea980N/njNGpzWBFJ3osKS7pf+Ler8xnxA= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Waiman Long , "Paul E. McKenney" , Sasha Levin Subject: [PATCH 5.10 256/563] clocksource: Avoid accidental unstable marking of clocksources Date: Mon, 24 Jan 2022 19:40:21 +0100 Message-Id: <20220124184033.272071853@linuxfoundation.org> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20220124184024.407936072@linuxfoundation.org> References: <20220124184024.407936072@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Waiman Long [ Upstream commit c86ff8c55b8ae68837b2fa59dc0c203907e9a15f ] Since commit db3a34e17433 ("clocksource: Retry clock read if long delays detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"), it is found that tsc clocksource fallback to hpet can sometimes happen on both Intel and AMD systems especially when they are running stressful benchmarking workloads. Of the 23 systems tested with a v5.14 kernel, 10 of them have switched to hpet clock source during the test run. The result of falling back to hpet is a drastic reduction of performance when running benchmarks. For example, the fio performance tests can drop up to 70% whereas the iperf3 performance can drop up to 80%. 4 hpet fallbacks happened during bootup. They were: [ 8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable [ 12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable [ 17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable [ 17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable Other fallbacks happen when the systems were running stressful benchmarks. For example: [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"), changed the skew margin from 100us to 50us. I think this is too small and can easily be exceeded when running some stressful workloads on a thermally stressed system. So it is switched back to 100us. Even a maximum skew margin of 100us may be too small in for some systems when booting up especially if those systems are under thermal stress. To eliminate the case that the large skew is due to the system being too busy slowing down the reading of both the watchdog and the clocksource, an extra consecutive read of watchdog clock is being done to check this. The consecutive watchdog read delay is compared against WATCHDOG_MAX_SKEW/2. If the delay exceeds the limit, we assume that the system is just too busy. A warning will be printed to the console and the clock skew check is skipped for this round. Fixes: db3a34e17433 ("clocksource: Retry clock read if long delays detected") Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold") Signed-off-by: Waiman Long Signed-off-by: Paul E. McKenney Signed-off-by: Sasha Levin --- kernel/time/clocksource.c | 50 ++++++++++++++++++++++++++++++++------- 1 file changed, 41 insertions(+), 9 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index d0803a69a2009..e34ceb91f4c5a 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -105,7 +105,7 @@ static u64 suspend_start; * This delay could be due to SMIs, NMIs, or to VCPU preemptions. Used as * a lower bound for cs->uncertainty_margin values when registering clocks. */ -#define WATCHDOG_MAX_SKEW (50 * NSEC_PER_USEC) +#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC) #ifdef CONFIG_CLOCKSOURCE_WATCHDOG static void clocksource_watchdog_work(struct work_struct *work); @@ -200,17 +200,24 @@ void clocksource_mark_unstable(struct clocksource *cs) static ulong max_cswd_read_retries = 3; module_param(max_cswd_read_retries, ulong, 0644); -static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow) +enum wd_read_status { + WD_READ_SUCCESS, + WD_READ_UNSTABLE, + WD_READ_SKIP +}; + +static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow) { unsigned int nretries; - u64 wd_end, wd_delta; - int64_t wd_delay; + u64 wd_end, wd_end2, wd_delta; + int64_t wd_delay, wd_seq_delay; for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) { local_irq_disable(); *wdnow = watchdog->read(watchdog); *csnow = cs->read(cs); wd_end = watchdog->read(watchdog); + wd_end2 = watchdog->read(watchdog); local_irq_enable(); wd_delta = clocksource_delta(wd_end, *wdnow, watchdog->mask); @@ -221,13 +228,34 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow) pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n", smp_processor_id(), watchdog->name, nretries); } - return true; + return WD_READ_SUCCESS; } + + /* + * Now compute delay in consecutive watchdog read to see if + * there is too much external interferences that cause + * significant delay in reading both clocksource and watchdog. + * + * If consecutive WD read-back delay > WATCHDOG_MAX_SKEW/2, + * report system busy, reinit the watchdog and skip the current + * watchdog test. + */ + wd_delta = clocksource_delta(wd_end2, wd_end, watchdog->mask); + wd_seq_delay = clocksource_cyc2ns(wd_delta, watchdog->mult, watchdog->shift); + if (wd_seq_delay > WATCHDOG_MAX_SKEW/2) + goto skip_test; } pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n", smp_processor_id(), watchdog->name, wd_delay, nretries); - return false; + return WD_READ_UNSTABLE; + +skip_test: + pr_info("timekeeping watchdog on CPU%d: %s wd-wd read-back delay of %lldns\n", + smp_processor_id(), watchdog->name, wd_seq_delay); + pr_info("wd-%s-wd read-back delay of %lldns, clock-skew test skipped!\n", + cs->name, wd_delay); + return WD_READ_SKIP; } static u64 csnow_mid; @@ -290,6 +318,7 @@ static void clocksource_watchdog(struct timer_list *unused) int next_cpu, reset_pending; int64_t wd_nsec, cs_nsec; struct clocksource *cs; + enum wd_read_status read_ret; u32 md; spin_lock(&watchdog_lock); @@ -307,9 +336,12 @@ static void clocksource_watchdog(struct timer_list *unused) continue; } - if (!cs_watchdog_read(cs, &csnow, &wdnow)) { - /* Clock readout unreliable, so give it up. */ - __clocksource_unstable(cs); + read_ret = cs_watchdog_read(cs, &csnow, &wdnow); + + if (read_ret != WD_READ_SUCCESS) { + if (read_ret == WD_READ_UNSTABLE) + /* Clock readout unreliable, so give it up. */ + __clocksource_unstable(cs); continue; } -- 2.34.1