Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp2918860rwr; Fri, 21 Apr 2023 17:04:34 -0700 (PDT) X-Google-Smtp-Source: AKy350aUxksWDRgriVjRgJrizVv2KuYbnPXjp4e+56GFOvp0EX8S1aJOUzGPjAy3mkWLabZLRT0N X-Received: by 2002:a17:90a:1a53:b0:242:d8e6:7b68 with SMTP id 19-20020a17090a1a5300b00242d8e67b68mr5953701pjl.1.1682121873478; Fri, 21 Apr 2023 17:04:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682121873; cv=none; d=google.com; s=arc-20160816; b=JbavleDIO8kzIf4oue2Q79VRQE73/djTvKouGXGtamwgvMhibp+5AvcyWj3d2H0yJY H7FBkOLVAY2uD2tt51erxUR3O9dBVhuSlKFGgkvi6MtQ08n6t9CFMQTASeK7HNc8JNoM QLI5yHIFBt5YXnlR+TDqpRqwlFiAYWf8p2z2i5SOLUNOuMLiVafnZtfPhb05hCA2SISA ZtRKoKJblOgRY1wu7LYBPTlL8+s9fCZKl8HAS5/8oPTj8X67wpsHkaIKHzxNe325iy4X Aimx6oxMbB6F4nkgyc0CO9owVCBibNgLy7GTS9UFYcEM1vEjXkA8O8ZAD2Pel7z/ko1f /Q5A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=wT7haPhVSoi15WlZuxldv/S7DXyNS9NkPBcv78cvTg4=; b=htzlaJyPlLJYK+pKIkBbKWCelj4d6g/Nrwxvb2ASjXiGua7/uhM23m3M0gE+1Usvv/ y3lE20AUwBtVzqY1CAk7Qx8EQ8PgRsK4zxIKZfrdgIFRPwtt+z3w871R1xldAhjrXgIx Ro7625oqkslMCrmcM0fEm14T4VZUeEpg3A6cSCFeYX7DH36ThqCcSctNLeVmgMoziHnL PhpcWhMTJ5D79oNewg/IxG9KZLeogUwpRHgEAetACWMzxQvCZ9GbQyv1DDMaO/uRIKIM E4k5irXH9/UiWcMPCUWrEiPYQL9s4FinYFaktxBCbehSEjz4L3WVcp4J4OEtYSD/qvJe 2yYA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=bombadil.20210309 header.b="tQ/7kO5W"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q8-20020a17090a9f4800b002448f1b24d5si3806535pjv.139.2023.04.21.17.03.36; Fri, 21 Apr 2023 17:04:33 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=bombadil.20210309 header.b="tQ/7kO5W"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233658AbjDVAAB (ORCPT + 99 others); Fri, 21 Apr 2023 20:00:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44594 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233893AbjDUX7y (ORCPT ); Fri, 21 Apr 2023 19:59:54 -0400 Received: from bombadil.infradead.org (bombadil.infradead.org [IPv6:2607:7c80:54:3::133]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C1E8A212D; Fri, 21 Apr 2023 16:59:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=Content-Transfer-Encoding: Content-Type:In-Reply-To:From:References:Cc:To:Subject:MIME-Version:Date: Message-ID:Sender:Reply-To:Content-ID:Content-Description; bh=wT7haPhVSoi15WlZuxldv/S7DXyNS9NkPBcv78cvTg4=; b=tQ/7kO5WGRK8mh5K/22TizxJCE 8dl7RmFvZc6P0qnJMdMfT7j0HwqFG0khdt0z8RLU/vItY9MnPEY1HMUFr66kZaVUJlVIQhFFmS51k YPPvqQJtiFKbbePl3USkjZD65yB8SSl4N5Xf4S1Gc5/o2e1DRJwrpu6gyoB+frIzeXrahETRf9Uwg B8nTlhwW4PyMwhHt+sP/mRCKLNFJX8J0/u/mFxhcRG/r03gafiEjnyvSQ3MHEScTdxtYhZJNIavSW XGWmAcbjfdJstkqFbPRCH7vsY5EmQMJHELXlVI+rkH54u/LI45pARRBWcVVg+tva1IpoBiH3e4s7e SekzjI7w==; Received: from [2601:1c2:980:9ec0::2764] by bombadil.infradead.org with esmtpsa (Exim 4.96 #2 (Red Hat Linux)) id 1pq0f4-00C71y-2G; Fri, 21 Apr 2023 23:59:31 +0000 Message-ID: Date: Fri, 21 Apr 2023 16:59:27 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Subject: Re: [PATCH] hardlockup: detect hard lockups using secondary (buddy) cpus Content-Language: en-US To: Douglas Anderson , Petr Mladek , Andrew Morton Cc: Lecopzer Chen , Daniel Thompson , Stephen Boyd , Chen-Yu Tsai , linux-arm-kernel@lists.infradead.org, kgdb-bugreport@lists.sourceforge.net, Marc Zyngier , linux-perf-users@vger.kernel.org, Mark Rutland , Masayoshi Mizuma , Will Deacon , ito-yuichi@fujitsu.com, Sumit Garg , Catalin Marinas , Colin Cross , Matthias Kaehlcke , Guenter Roeck , Tzung-Bi Shih , Alexander Potapenko , AngeloGioacchino Del Regno , Dan Williams , Geert Uytterhoeven , Ingo Molnar , John Ogness , Josh Poimboeuf , Juergen Gross , Kees Cook , Laurent Dufour , Liam Howlett , Marco Elver , Matthias Brugger , Michael Ellerman , Miguel Ojeda , Nathan Chancellor , Nick Desaulniers , "Paul E. McKenney" , Peter Zijlstra , Rasmus Villemoes , Sami Tolvanen , Stefano Stabellini , Vlastimil Babka , Zhaoyang Huang , Zhen Lei , linux-kernel@vger.kernel.org, linux-mediatek@lists.infradead.org References: <20230421155255.1.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid> From: Randy Dunlap In-Reply-To: <20230421155255.1.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-7.7 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi-- On 4/21/23 15:53, Douglas Anderson wrote: > From: Colin Cross > > Implement a hardlockup detector that can be enabled on SMP systems > that don't have an arch provided one or one implemented atop perf by Is that one or more ? > using interrupts on other cpus. Each cpu will use its softlockup > hrtimer to check that the next cpu is processing hrtimer interrupts by > verifying that a counter is increasing. > > NOTE: unlike the other hard lockup detectors, the buddy one can't > easily provide a backtrace on the CPU that locked up. It relies on > some other mechanism in the system to get information about the locked > up CPUs. This could be support for NMI backtraces like [1], it could > be a mechanism for printing the PC of locked CPUs like [2], or it > could be something else. > > This style of hardlockup detector originated in some downstream > Android trees and has been rebased on / carried in ChromeOS trees for > quite a long time for use on arm and arm64 boards. Historically on > these boards we've leveraged mechanism [2] to get information about > hung CPUs, but we could move to [1]. > > NOTE: the buddy system is not really useful to enable on any > architectures that have a better mechanism. On arm64 folks have been > trying to get a better mechanism for years and there has even been > recent posts of patches adding support [3]. However, nothing about the > buddy system is tied to arm64 and several archs (even arm32, where it > was originally developed) could find it useful. > > [1] https://lore.kernel.org/r/20230419225604.21204-1-dianders@chromium.org > [2] https://issuetracker.google.com/172213129 > [3] https://lore.kernel.org/linux-arm-kernel/20220903093415.15850-1-lecopzer.chen@mediatek.com/ > > Signed-off-by: Colin Cross > Signed-off-by: Matthias Kaehlcke > Signed-off-by: Guenter Roeck > Signed-off-by: Tzung-Bi Shih > Signed-off-by: Douglas Anderson > --- > This patch has been rebased in ChromeOS kernel trees many times, and > each time someone had to do work on it they added their > Signed-off-by. I've included those here. I've also left the author as > Colin Cross since the core code is still his. > > I'll also note that the CC list is pretty giant, but that's what > get_maintainers came up with (plus a few other folks I thought would > be interested). As far as I can tell, there's no true MAINTAINER > listed for the existing watchdog code. Assuming people don't hate > this, maybe it would go through Andrew Morton's tree? > > include/linux/nmi.h | 18 ++++- > kernel/Makefile | 1 + > kernel/watchdog.c | 24 ++++-- > kernel/watchdog_buddy_cpu.c | 141 ++++++++++++++++++++++++++++++++++++ > lib/Kconfig.debug | 19 ++++- > 5 files changed, 192 insertions(+), 11 deletions(-) > create mode 100644 kernel/watchdog_buddy_cpu.c > > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug > index 39d1d93164bd..9eb86bc9f5ee 100644 > --- a/lib/Kconfig.debug > +++ b/lib/Kconfig.debug > @@ -1036,6 +1036,9 @@ config HARDLOCKUP_DETECTOR_PERF > config HARDLOCKUP_CHECK_TIMESTAMP > bool > > +config HARDLOCKUP_DETECTOR_CORE > + bool > + > # > # arch/ can define HAVE_HARDLOCKUP_DETECTOR_ARCH to provide their own hard > # lockup detector rather than the perf based detector. > @@ -1045,6 +1048,7 @@ config HARDLOCKUP_DETECTOR > depends on DEBUG_KERNEL && !S390 > depends on HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_ARCH > select LOCKUP_DETECTOR > + select HARDLOCKUP_DETECTOR_CORE > select HARDLOCKUP_DETECTOR_PERF if HAVE_HARDLOCKUP_DETECTOR_PERF > help > Say Y here to enable the kernel to act as a watchdog to detect > @@ -1055,9 +1059,22 @@ config HARDLOCKUP_DETECTOR > chance to run. The current stack trace is displayed upon detection > and the system will stay locked up. > > +config HARDLOCKUP_DETECTOR_BUDDY_CPU > + bool "Buddy CPU hardlockup detector" > + depends on DEBUG_KERNEL && SMP > + depends on !HARDLOCKUP_DETECTOR && !HAVE_NMI_WATCHDOG > + depends on !S390 > + select HARDLOCKUP_DETECTOR_CORE > + select SOFTLOCKUP_DETECTOR > + help > + Say Y here to enable a hardlockup detector where CPUs check > + each other for lockup. Each cpu uses its softlockup hrtimer Preferably CPU > + to check that the next cpu is processing hrtimer interrupts by and CPU > + verifying that a counter is increasing. > + > config BOOTPARAM_HARDLOCKUP_PANIC > bool "Panic (Reboot) On Hard Lockups" > - depends on HARDLOCKUP_DETECTOR > + depends on HARDLOCKUP_DETECTOR_CORE > help > Say Y here to enable the kernel to panic on "hard lockups", > which are bugs that cause the kernel to loop in kernel -- ~Randy