Received: by 2002:a05:6358:f14:b0:e5:3b68:ec04 with SMTP id b20csp3720080rwj; Tue, 20 Dec 2022 00:34:55 -0800 (PST) X-Google-Smtp-Source: AMrXdXsg2JYQa+BwpSu7bKSy7uLuVtkAhgpjFuNbyQLNUP9pcll1N0tx5ThJ/5Mo9qReEWSfLgcc X-Received: by 2002:a05:6a20:6593:b0:af:9391:449 with SMTP id p19-20020a056a20659300b000af93910449mr21353114pzh.45.1671525294816; Tue, 20 Dec 2022 00:34:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671525294; cv=none; d=google.com; s=arc-20160816; b=wnvRDNGLXnlIXOgLFw6O5rPO6dBvQLV1JBcrnszZW2ukQzX1S2SyVIYULt33cmtiD8 1S8XtU1fnuBP7HHQHACHNW/A2B4Tdab6C8C1pQ9sWNhbg72tl0uSuZe5vjhgnydaoDex N/7NhUAD0N5Qc5hOAK0djyXbBMQzadpUxIjzyakR1B5TAgz/WdbR4pJNgfg1v8Gr9e0h OdzZna24lJPWw6yn3f9XnD0mul4dGR9c2aK03MOHAInnQ3RJ8PRv4p1j6N1Alrbxz/R5 miyXDtxL2lsmuhD+G7AljkyY0crEqQBpNklVc+RUw7p6+uCKGc8MCcxzyKkfgfM0Wf+p FOTg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=hM/jngfsZfV6nqVk8cObXFVdjsq7G1WSBuMN2Jn3u0c=; b=I5MOM0JVb3ELH9+3hjfNz28WNIjIaLjhDDpH8pPV8slVdNrCdt8JkjG1hEBhNOd24z f+5VWcfwJH6dTcyiAt5ZXonbLqwLzf4r5myCHkPBp8uzuqBjViXf31UShEScJOHPkk/7 Z0IopfH5z5CNwoJNCEE5VFPkYSCj4HayMMdna74l0OjH5FHK2P2a0g63ZYsGy8OykNlq lPvJ5CrrrSyV4AJRn9NNDcUdCpeUkPM3wZ7rpftDu3ExCJF5iK5cl+3anstUwzCcVr6U 5S+heg9mpCRF6aIAIcWkt0V9faIYky1pYlzCpjhyljgCVNY9kRGwv1LbmpaIlF2F4f63 KrVA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=KIvnwobu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id s145-20020a632c97000000b00489a4cba354si8074131pgs.453.2022.12.20.00.34.45; Tue, 20 Dec 2022 00:34:54 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=KIvnwobu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233054AbiLTI2D (ORCPT + 70 others); Tue, 20 Dec 2022 03:28:03 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44558 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230090AbiLTI2B (ORCPT ); Tue, 20 Dec 2022 03:28:01 -0500 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2E50D165B3 for ; Tue, 20 Dec 2022 00:28:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1671524880; x=1703060880; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=T158du3kGmXXokicPR6fB328LpH2wH7Nj0/DHwUoQRI=; b=KIvnwobu77IleDooldn66umBovdYqOgGCMySIhPpw21iY5pIaWdfG/mA BkRwm2qzNXWbZ8U39z7SbybuwTut8/REKoRqwBgjF4WW9fzkwllFGhLpg UXWEe4KAZOqOr6mQpRhvs5zwz2kZYVtch83FSQHWk5UejythVg2LTNDz8 cxVPZpKVbwV3KNa8kurPB4tUFUycAoACa1mhwjOejwHAdUrZH533zwxU4 0awhQJBiqZJCOzcb5fZvMOfS8VtcFJGl19MjgWT/ZnO/Px0pkqk4hFCve KWOoA3xb1hcbWaDH5JuVJki9/XnskJ9dyqxtDJP8oY55zO1pFZM/ltPSn w==; X-IronPort-AV: E=McAfee;i="6500,9779,10566"; a="307240031" X-IronPort-AV: E=Sophos;i="5.96,258,1665471600"; d="scan'208";a="307240031" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Dec 2022 00:27:59 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10566"; a="793206662" X-IronPort-AV: E=Sophos;i="5.96,258,1665471600"; d="scan'208";a="793206662" Received: from feng-clx.sh.intel.com ([10.238.200.228]) by fmsmga001.fm.intel.com with ESMTP; 20 Dec 2022 00:27:56 -0800 From: Feng Tang To: John Stultz , Thomas Gleixner , Stephen Boyd , x86@kernel.org, Peter Zijlstra , "Paul E . McKenney" Cc: linux-kernel@vger.kernel.org, Waiman Long , Tim Chen , Feng Tang Subject: [RFC PATCH] clocksource: Suspend the watchdog temporarily when high read lantency detected Date: Tue, 20 Dec 2022 16:25:12 +0800 Message-Id: <20221220082512.186283-1-feng.tang@intel.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org There were bug reported on 8 sockets x86 machines that TSC was wrongly disabled when system is under heavy workload. [ 818.380354] clocksource: timekeeping watchdog on CPU336: hpet wd-wd read-back delay of 1203520ns [ 818.436160] clocksource: wd-tsc-wd read-back delay of 181880ns, clock-skew test skipped! [ 819.402962] clocksource: timekeeping watchdog on CPU338: hpet wd-wd read-back delay of 324000ns [ 819.448036] clocksource: wd-tsc-wd read-back delay of 337240ns, clock-skew test skipped! [ 819.880863] clocksource: timekeeping watchdog on CPU339: hpet read-back delay of 150280ns, attempt 3, marking unstable [ 819.936243] tsc: Marking TSC unstable due to clocksource watchdog [ 820.068173] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. [ 820.092382] sched_clock: Marking unstable (818769414384, 1195404998) [ 820.643627] clocksource: Checking clocksource tsc synchronization from CPU 267 to CPUs 0,4,25,70,126,430,557,564. [ 821.067990] clocksource: Switched to clocksource hpet This can be reproduced when system is running memory intensive 'stream' test, or some stress-ng subcases like 'ioport'. The reason is when system is under heavy load, the read latency of clocksource can be very high, it can be seen even with lightweight TSC read, and is much worse on MMIO or IO port read based external clocksource. Causing the watchdog check to be inaccurate. As the clocksource watchdog is a lifetime check with frequency of twice a second, there is no need to rush doing it when the system is under heavy load and the clocksource read latency is very high, suspend the watchdog timer for 5 minutes. Signed-off-by: Feng Tang --- kernel/time/clocksource.c | 45 ++++++++++++++++++++++++++++----------- 1 file changed, 32 insertions(+), 13 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 9cf32ccda715..8cd74b89d577 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -384,6 +384,15 @@ void clocksource_verify_percpu(struct clocksource *cs) } EXPORT_SYMBOL_GPL(clocksource_verify_percpu); +static inline void clocksource_reset_watchdog(void) +{ + struct clocksource *cs; + + list_for_each_entry(cs, &watchdog_list, wd_list) + cs->flags &= ~CLOCK_SOURCE_WATCHDOG; +} + + static void clocksource_watchdog(struct timer_list *unused) { u64 csnow, wdnow, cslast, wdlast, delta; @@ -391,6 +400,7 @@ static void clocksource_watchdog(struct timer_list *unused) int64_t wd_nsec, cs_nsec; struct clocksource *cs; enum wd_read_status read_ret; + unsigned long extra_wait = 0; u32 md; spin_lock(&watchdog_lock); @@ -410,13 +420,30 @@ static void clocksource_watchdog(struct timer_list *unused) read_ret = cs_watchdog_read(cs, &csnow, &wdnow); - if (read_ret != WD_READ_SUCCESS) { - if (read_ret == WD_READ_UNSTABLE) - /* Clock readout unreliable, so give it up. */ - __clocksource_unstable(cs); + if (read_ret == WD_READ_UNSTABLE) { + /* Clock readout unreliable, so give it up. */ + __clocksource_unstable(cs); continue; } + /* + * When WD_READ_SKIP is returned, it means the system is likely + * under very heavy load, where the latency of reading + * watchdog/clocksource is very big, and affect the accuracy of + * watchdog check. So give system some space and suspend the + * watchdog check for 5 minutes. + */ + if (read_ret == WD_READ_SKIP) { + /* + * As the watchdog timer will be suspended, and + * cs->last could keep unchanged for 5 minutes, reset + * the counters. + */ + clocksource_reset_watchdog(); + extra_wait = HZ * 300; + break; + } + /* Clocksource initialized ? */ if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) || atomic_read(&watchdog_reset_pending)) { @@ -512,7 +539,7 @@ static void clocksource_watchdog(struct timer_list *unused) * pair clocksource_stop_watchdog() clocksource_start_watchdog(). */ if (!timer_pending(&watchdog_timer)) { - watchdog_timer.expires += WATCHDOG_INTERVAL; + watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait; add_timer_on(&watchdog_timer, next_cpu); } out: @@ -537,14 +564,6 @@ static inline void clocksource_stop_watchdog(void) watchdog_running = 0; } -static inline void clocksource_reset_watchdog(void) -{ - struct clocksource *cs; - - list_for_each_entry(cs, &watchdog_list, wd_list) - cs->flags &= ~CLOCK_SOURCE_WATCHDOG; -} - static void clocksource_resume_watchdog(void) { atomic_inc(&watchdog_reset_pending); -- 2.34.1