Received: by 2002:a05:7412:d8a:b0:e2:908c:2ebd with SMTP id b10csp2775660rdg; Mon, 16 Oct 2023 14:48:12 -0700 (PDT) X-Google-Smtp-Source: AGHT+IELyKLnCO0QsRB3qezIpu5iu4/T6ExHzHAq3b2Q6/zdjHKJPvMJayiawbxJdlobiPm97FAn X-Received: by 2002:a05:6870:51:b0:1e9:858e:ff23 with SMTP id 17-20020a056870005100b001e9858eff23mr388534oaz.55.1697492891873; Mon, 16 Oct 2023 14:48:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697492891; cv=none; d=google.com; s=arc-20160816; b=nV3gBOD+GK+4E0OAGLcr2v0P8YFFSkc1fCAWroukJYtekYJTBW+8jnnQLTfMG4DDyn qtCbjaoGaDuMIOO3SKMWfs5swApb7YTHk5zSO2QLF3uTrC1P++D0PSA5pZdu2KmZ4ZpQ cfdaJCoBRA5iIzW6H1OM1ofYQ4LQY7mKYU7+Cy26Egc8RwkarRI6i45ohbtj80FiVLqp 99kIhztuNMgLcAKSz3SWm1qfBaU3Md7uQrmor7lbVw0Mk1IK06B1Xa0oWQVE25+m0HgJ nfztAUBTJBu6CWir5LRnPwYwWbSddjN0ALI0YOc0UwclKk+TIp8DAB1ksc3dbThPbgGK 4twQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:references:in-reply-to:subject:cc:to:dkim-signature :dkim-signature:from; bh=f59bG+a2eADGxW7wAhinP+pAokYq41IIQASxg536iqM=; fh=47JzpF+58ywdmX5qGAtwVXYe0Q+OMmtzSSUA+V088dA=; b=A0pHUrMSgr9phxP/6Xk34/usVb/cU9hBCcFLY7WLJchZa5TayYF25g/H6FFm8lqWOv S5OvFm8h4IFRnNjeRt9oc65kSmq++EkzhTIU74RfjCilHQkbCjzoxOWOOPN76AF1o16s TnqiSS4rGuyXIYlun4R3018yjV0+Z8OJtGpfVEyM0iumk/5Ugc0QAEz2a8eTk/GxAvYc 2Rco5tO1X9Ar4SDAKoEcty8uiWDuMIeUJ8ljHckim8hQDq7nrZUIBFMo7ZZfOLY3jIcV 9DKZoM9fAtAowRCnYRJuBLGIYN6zCXLXjzpx3WPajkhM6ZTpa8+iUG7WsnS+MseuzMWp qaVg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=4b0hZ1jw; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e header.b="KL/R+mWD"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Return-Path: Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32]) by mx.google.com with ESMTPS id h4-20020a655184000000b00563a0bacbb4si223200pgq.694.2023.10.16.14.48.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 16 Oct 2023 14:48:11 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=4b0hZ1jw; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e header.b="KL/R+mWD"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 7026480944EE; Mon, 16 Oct 2023 14:48:08 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232896AbjJPVsB (ORCPT + 99 others); Mon, 16 Oct 2023 17:48:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38482 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231955AbjJPVsA (ORCPT ); Mon, 16 Oct 2023 17:48:00 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2A99EA1 for ; Mon, 16 Oct 2023 14:47:58 -0700 (PDT) From: Thomas Gleixner DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1697492876; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=f59bG+a2eADGxW7wAhinP+pAokYq41IIQASxg536iqM=; b=4b0hZ1jwsVNGf8ZQPQlclMglY6PnHSAkZrLg2te83n/7CXJMIrXZ8JrGVYeHG8Ic2ti78j coC+APj/tRVbMCCreHh7mTGQIqyNk2JZigCNy4xORZkYQneQIYu3zEbiwex9k75KQ2bzBZ EJmptX+xvD7mgT4rcB2aGE2F4JziCcx8O5nKncfdWVqfTM6aj3D36UE110LvW91CMr9c5h 0mQWB7MD52MFOj/td2ZYClKG5Gg1gXrc8ez1kApqMtCBhmJGFkage6KcCM65GWeNJ0YuN6 vaQKNhSkjYogDnFE4CW1k7QQDNzKajhMDgISoGGJalJDoKeNFIxoeIERbZKarQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1697492876; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=f59bG+a2eADGxW7wAhinP+pAokYq41IIQASxg536iqM=; b=KL/R+mWD7BMAFG9eXLCF0MlASOO4RHg6RcR30TofofC7Hy9oo9/jI6MBnnqjeDvUYX3Cj5 Mj5+6SWTbnm12hBg== To: John Stultz , Tetsuo Handa Cc: Stephen Boyd , LKML , Sebastian Andrzej Siewior , "Paul E. McKenney" , x86@kernel.org Subject: Re: [PATCH] clocksource: disable irq when holding watchdog_lock. In-Reply-To: References: <80ff5036-8449-44a6-ba2f-0130d3be6b57@I-love.SAKURA.ne.jp> Date: Mon, 16 Oct 2023 23:47:55 +0200 Message-ID: <878r826xys.ffs@tglx> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Mon, 16 Oct 2023 14:48:08 -0700 (PDT) On Mon, Oct 16 2023 at 10:46, John Stultz wrote: > On Fri, Oct 13, 2023 at 7:51=E2=80=AFAM Tetsuo Handa > wrote: >> >> Lockdep found that spin_lock(&watchdog_lock) from call_timer_fn() >> is not safe. Use spin_lock_irqsave(&watchdog_lock, flags) instead. >> >> [ 0.378387] TSC synchronization [CPU#0 -> CPU#1]: >> [ 0.378387] Measured 55060 cycles TSC warp between CPUs, turning off = TSC clock. Please trim splats down to the relevant content. Timestamps and the bulk of the backtraces are completely uninteresting. Something like this is completely sufficient and readable: tsc: Marking TSC unstable due to check_tsc_sync_source failed =20=20 WARNING: inconsistent lock state inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. ffffffff8cfa1c78 (watchdog_lock){?.-.}-{2:2}, at: clocksource_watchdog+0= x23/0x5a0 {IN-HARDIRQ-W} state was registered at: _raw_spin_lock_irqsave+0x3f/0x60 clocksource_mark_unstable+0x1b/0x90 mark_tsc_unstable+0x41/0x50 check_tsc_sync_source+0x14f/0x180 sysvec_call_function_single+0x69/0x90 Possible unsafe locking scenario: lock(watchdog_lock); lock(watchdog_lock); stack backtrace: _raw_spin_lock+0x30/0x40 clocksource_watchdog+0x23/0x5a0 run_timer_softirq+0x2a/0x50 sysvec_apic_timer_interrupt+0x6e/0x90 20 lines of useful information vs. ~100 lines containing mostly noise. No? Documentation/process/* clearly asks for this. >> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c >> index c108ed8a9804..4e8fc0a5ca9d 100644 >> --- a/kernel/time/clocksource.c >> +++ b/kernel/time/clocksource.c >> @@ -405,8 +405,9 @@ static void clocksource_watchdog(struct timer_list *= unused) >> enum wd_read_status read_ret; >> unsigned long extra_wait =3D 0; >> u32 md; >> + unsigned long flags; https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#variab= le-declarations >> - spin_lock(&watchdog_lock); >> + spin_lock_irqsave(&watchdog_lock, flags); That looks obvious, but it's just making the problem worse. See below. > At first glance this looks sane to me. Emphasis on "first glance" :) > The only thing that gives me pause is that I'm a little shocked this > hasn't been caught earlier, and I don't right off see recent changes > that would have caused this to trip. But Thomas may have more insight. From looking at the (for you neatly trimmed) backtrace: {IN-HARDIRQ-W} state was registered at: _raw_spin_lock_irqsave+0x3f/0x60 clocksource_mark_unstable+0x1b/0x90 mark_tsc_unstable+0x41/0x50 check_tsc_sync_source+0x14f/0x180 sysvec_call_function_single+0x69/0x90 it's bloody obvious which change caused that: 9d349d47f0e3 ("x86/smpboot: Make TSC synchronization function call based") Even if you don't recognize it immediately then: git grep -l check_tsc_sync_source | xargs git blame | grep check_tsc_sync= _source is not really rocket science either, right? Vs. you being shocked: As long as that synchronization function does not try to mark the TSC unstable, lockdep won't notice. Definitely surprising that this took a few months to surface. But that's not what shocks me... That spinlock_irqsave() conversion is not a solution. Why? If that happens on RT then this is even worse. mark_tsc_unstable() is invoked from hard interrupt context via smp function call which is a NONO due to watchdog_lock being a spinlock and the _irqsave() change won't help anything. Yes, I broke it and did not think about that particular side effect when moving check_tsc_sync_source() from thread to smp function call context. But that's just a distraction. What's even worse is that the patch blindly slaps spinlock_irqsave() at the problem without actually looking what the function does and why it does not use the irqsave variant in the first place. The watchdog function deliberately runs with interrupts enabled and only explicitly disables them for the actual readout because there is no point in delaying interrupts for this horrorshow accross the whole function. But the explicit local_irq_disable()/enable() pair around the readout means that the patch above is obviously broken: spin_lock_irqsave(&watchdog_lock, flags); cs_watchdog_read() local_irq_disable(); ... local_irq_enable(); <--- FAIL Lockdep should have yelled at that too, but even if it failed to do so then it's still very obviously wrong. Something like the uncompiled/untested below should cure it for real. It really does not matter whether the TSC unstable event happens a bit later. The system is unhappy no matter what. That said, this whole clocksource watchdog mess wants a proper overhaul. It has become a pile of warts and duct tape by now and after staring at it long enough there is no real reason to run it in a timer callback anymore. It just can move into delayed work and the whole locking problem can be reduced to the clocksource_mutex and some well thought out atomic operations to handle the mark unstable case. But that's a different story and not relevant for curing the problem at hand. Thanks, tglx --- --- a/arch/x86/kernel/tsc_sync.c +++ b/arch/x86/kernel/tsc_sync.c @@ -15,6 +15,7 @@ * ( The serial nature of the boot logic and the CPU hotplug lock * protects against more than 2 CPUs entering this code. ) */ +#include #include #include #include @@ -342,6 +343,13 @@ static inline unsigned int loop_timeout( return (cpumask_weight(topology_core_cpumask(cpu)) > 1) ? 2 : 20; } =20 +static void tsc_sync_mark_tsc_unstable(struct work_struct *work) +{ + mark_tsc_unstable("check_tsc_sync_source failed"); +} + +static DECLARE_WORK(tsc_sync_work, tsc_sync_mark_tsc_unstable); + /* * The freshly booted CPU initiates this via an async SMP function call. */ @@ -395,7 +403,7 @@ static void check_tsc_sync_source(void * "turning off TSC clock.\n", max_warp); if (random_warps) pr_warn("TSC warped randomly between CPUs\n"); - mark_tsc_unstable("check_tsc_sync_source failed"); + schedule_work(&tsc_sync_work); } =20 /*