Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp1341229pxb; Wed, 20 Oct 2021 03:11:30 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyImjGbtoDqjZcIUQjK9Ik+qAl3K2MiQISZhu+MgPfDBf4SzPpW1VMXCTsncFThQ43FZhej X-Received: by 2002:a17:90b:3a88:: with SMTP id om8mr6152504pjb.164.1634724689962; Wed, 20 Oct 2021 03:11:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1634724689; cv=none; d=google.com; s=arc-20160816; b=T4VclTd+0Wen4JYSGeereHal5eFyYopAF+i+XhkE8pczmYAftEmHVxyHHQqX+12jMn v0z5Y7CxGjE+CWZmZZsS0OstSiNmsL64VSx5/VkC1G4NmUAvyn0ac92op52I3pLzvhPN m8AHETLOIFM00sgVA/DE0CC/Rkym/rhQLaohf6ov1G/w9zJ2RAFVYGcjhnFJQLc8zOYM hbuxHkGfWDMz4ExV7F1AYZvo+wof6ZByLhagscWxvDP3WgfwlElu/eOyhX8y211RjQYg grpM3KvdkDH3apGXLEX/oF2BnsDqqBDTd3p81/Q2R3yy9WYaEQv4L432T5tEZFkMjvHC Y1tw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=7zPb9FQ20/SoAUVK7Nq9S40W9S5lPi5XKxOe4b1japQ=; b=hP6M+VF7B7p/Ct6Wg6wFiAoWFgTEuZPAox/yz55k8QTO/qsONnfGqXjjuzhfYtUIcA m53KX6hnXWM+GBiKeEtNcg2y/sYrBeFOKiXyJU03evdjA0PddDk1n690TJmgoaj9y65x 12Bf9AXQtq0PkPUWStQDsV1NewvSn4U4phWX9RB9vhcwtzSsFv13OTfSj3oATKVtTCp/ FwzUlB3JU6YEF6KTzS3Jf4fx3cWyBCUHbat4/665b6lukUuB9P/21bUP3NNhiAOrX0+3 YGl7LK5TxEZidl9UU2UL8B3KvjGUnyvcO4Plb8uOgK1xl46T3kHtZyHnlsG3WsOEAWjH MRVQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=BIhJQu4I; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id u3si2423040plq.96.2021.10.20.03.11.17; Wed, 20 Oct 2021 03:11:29 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=BIhJQu4I; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230017AbhJTKLQ (ORCPT + 99 others); Wed, 20 Oct 2021 06:11:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47816 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229878AbhJTKLO (ORCPT ); Wed, 20 Oct 2021 06:11:14 -0400 Received: from mail-qv1-xf35.google.com (mail-qv1-xf35.google.com [IPv6:2607:f8b0:4864:20::f35]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E92B4C061746 for ; Wed, 20 Oct 2021 03:09:00 -0700 (PDT) Received: by mail-qv1-xf35.google.com with SMTP id z15so1735630qvj.7 for ; Wed, 20 Oct 2021 03:09:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=7zPb9FQ20/SoAUVK7Nq9S40W9S5lPi5XKxOe4b1japQ=; b=BIhJQu4IfQHH3q2ARLadVGMsQpFmI8VifnbOEe1Fmc/AB2QJC9AZnDTkz02Quvc0c9 unB9dZ+SHOHt2z7OfnRPSoU/EYNTjV8ywHJdabQlwAz/SC19QPoA4FQwMKA4LQK+LXj5 OIfP6U/vawOHxt5PQWHHXXa2Bu7qZ9YeTb7tL1rqRAoGOnzM/CYVMbZxIxO+9v/aWJ6i ekuSF5b5q3eSbzya25tsWk+YGjKY6YTU5Q7tXquyhArzHHMO5D71/ygDfxdWAQwLFXXX c6q9ICrMwJtK2sntuWQly3NWEIFELdB9aTQ+FKhhz+G9Ga/V/XANTz+rcUfP5fXIe4dt jDEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=7zPb9FQ20/SoAUVK7Nq9S40W9S5lPi5XKxOe4b1japQ=; b=kcl+4t8GSbPbFRrMnG/atzppZ3vic2qPMFD2Yl2JAW/1z8pFe+tb0zt7cF30g0FyfU dXdsllYySBhWw4lCCRPMLIoTwuq2/oozmzZdbXLP2GSpYrYkTISsdgh5q1Eiq/kevRq/ j/tscKnacEy+F4asFblOLCNWwyMM3YMoX7YSov113/szyLYZ34IZ4P2r/JXBHDxY+mvk 5cuLd+b/X44WJm05d7NHUbBqJ01nXtUUDvy+fmmymhOpxa0pLA+CpLO7fwv3Bu/+ugqP ZfgZzXsjX97CakRmauUX7yvdikNCtVXrJt/Ak83gSHNSk84M2SyjxxsLfofXjYgfskzH 0maQ== X-Gm-Message-State: AOAM532nXOCqO29PXZJV2p34x77by8k/SS99SNWGPbHIZ2CG/ANDJh64 +HQ203wlZr3utZr5SL0j1ylmjS4xXUij+cNgMSM= X-Received: by 2002:a05:6214:224d:: with SMTP id c13mr5201789qvc.37.1634724540084; Wed, 20 Oct 2021 03:09:00 -0700 (PDT) MIME-Version: 1.0 References: <20211008080305.13401-1-yanghui.def@bytedance.com> <6b715fb7-9850-04f3-4ab8-1a2a8a2cdfbf@gmail.com> <95c1a031-6751-f90f-d003-b74fbec0e9d8@gmail.com> <61381153-634e-489b-848f-7077ce46049a@bytedance.com> In-Reply-To: From: Luming Yu Date: Wed, 20 Oct 2021 06:09:58 -0400 Message-ID: Subject: Re: [PATCH] Clocksource: Avoid misjudgment of clocksource To: John Stultz Cc: yanghui , Thomas Gleixner , Stephen Boyd , lkml , Shaohua Li , paulmck@kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 19, 2021 at 1:04 AM John Stultz wrote: > > On Mon, Oct 18, 2021 at 9:14 PM yanghui wrote= : > > =E5=9C=A8 2021/10/19 =E4=B8=8A=E5=8D=8812:14, John Stultz =E5=86=99=E9= =81=93: > > > On Tue, Oct 12, 2021 at 1:06 AM brookxu wrote: > > >> John Stultz wrote on 2021/10/12 13:29: > > >>> On Mon, Oct 11, 2021 at 10:23 PM brookxu wro= te: > > >>>> John Stultz wrote on 2021/10/12 12:52 =E4=B8=8B=E5=8D=88: > > >>>>> On Sat, Oct 9, 2021 at 7:04 AM brookxu wro= te: > > >>>> If we record the watchdog's start_time in clocksource_start_watchd= og(), and then > > >>>> when we verify cycles in clocksource_watchdog(), check whether the= clocksource > > >>>> watchdog is blocked. Due to MSB verification, if the blocked time = is greater than > > >>>> half of the watchdog timer max_cycles, then we can safely ignore t= he current > > >>>> verification? Do you think this idea is okay? > > >>> > > >>> I can't say I totally understand the idea. Maybe could you clarify = with a patch? > > >>> > > >> > > >> Sorry, it looks almost as follows: > > >> > > >> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c > > >> index b8a14d2..87f3b67 100644 > > >> --- a/kernel/time/clocksource.c > > >> +++ b/kernel/time/clocksource.c > > >> @@ -119,6 +119,7 @@ > > >> static DECLARE_WORK(watchdog_work, clocksource_watchdog_work); > > >> static DEFINE_SPINLOCK(watchdog_lock); > > >> static int watchdog_running; > > >> +static unsigned long watchdog_start_time; > > >> static atomic_t watchdog_reset_pending; > > >> > > >> static inline void clocksource_watchdog_lock(unsigned long *flags) > > >> @@ -356,6 +357,7 @@ static void clocksource_watchdog(struct timer_li= st *unused) > > >> int next_cpu, reset_pending; > > >> int64_t wd_nsec, cs_nsec; > > >> struct clocksource *cs; > > >> + unsigned long max_jiffies; > > >> u32 md; > > >> > > >> spin_lock(&watchdog_lock); > > >> @@ -402,6 +404,10 @@ static void clocksource_watchdog(struct timer_l= ist *unused) > > >> if (atomic_read(&watchdog_reset_pending)) > > >> continue; > > >> > > >> + max_jiffies =3D nsecs_to_jiffies(cs->max_idle_ns); > > >> + if (time_is_before_jiffies(watchdog_start_time + max= _jiffies)) > > >> + continue; > > >> + > > > > > > Sorry, what is the benefit of using jiffies here? Jiffies are > > > updated by counting the number of tick intervals on the current > > > clocksource. > > > > > > This seems like circular logic, where we're trying to judge the > > > current clocksource by using something we derived from the current > > > clocksource. > > > That's why the watchdog clocksource is important, as it's supposed to > > > be a separate counter that is more reliable (but likely slower) then > > > the preferred clocksource. > > > > > > So I'm not really sure how this helps. > > > > > > The earlier patch by yanghui at least used the watchdog interval to > > > decide if the watchdog timer had expired late. Which seemed > > > reasonable, but I thought it might be helpful to add some sort of a > > > counter so if the case is happening repeatedly (timers constantly > > > being delayed) we have a better signal that the watchdog and current > > > clocksource are out of sync. Because again, timers are fired based o= n > > > > I think only have a signal ls not enough. we need to prevent > > clocksource from being incorrectly switched. > > Right, but we also have to ensure that we also properly disqualify > clocksources that are misbehaving. > > In the case that the current clocksource is running very slow (imagine > old TSCs that lowered freq with cpufreq), then system time slows down, > so timers fire late. > So it would constantly seem like the irqs are being delayed, so with > your logic we would not disqualify a clearly malfunctioning > clocksource.. > > > The Timer callback function clocksource_watchdog() is executed in the > > context of softirq(run_timer_softirq()). So if softirq is disabled for > > long time(One situation is long time softlockup), clocksource_watchdog(= ) > > will be delay executed. > > Yes. The reality is that timers are often spuriously delayed. We don't > want a short burst of timer misbehavior to disqualify a good > clocksource. > > But the problem is that this situation and the one above (with the > freq changing TSC), will look exactly the same. > > So having a situation where if the watchdog clocksource thinks too > much time has passed between watchdog timers, we can skip judgement, > assuming its a spurious delay. But I think we need to keep a counter > so that if this happens 3-5 times in a row, we stop ignoring the > misbehavior and judge the current clocksource, as it may be running > slowly. > > > > > > I think it will be better to add this to my patch: > > /* > > * Interval: 0.5sec. > > - * MaxInterval: 1s. > > + * MaxInterval: 20s. > > */ > > #define WATCHDOG_INTERVAL (HZ >> 1) > > -#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC) > > +#define WATCHDOG_MAX_INTERVAL_NS (20 * NSEC_PER_SEC) > > > > Some watchdog counters wrap within 20 seconds, so I don't think this > is a good idea. > > The other proposal to calculate the error rate, rather than a fixed > error boundary might be useful too, as if the current clocksource and > watchdog are close, a long timer delay won't disqualify them if we > scale the error bounds to be within an given error rate. In most of tsc unstable trouble shooting on modern servers we experienced, it usually ends up in a false alarm triggered by the clock source watchdog for tsc. I think Paul has a proposal to make a clock source watchdog to be more intelligent. Its job is to find a real problem instead of causing a problem. so disabling it for known good-tsc might be a reasonable good idea that can save manpower for other more valuable problems to solve, or at least make it statistically a problem less chance to happen. > > thanks > -john