Received: by 2002:a05:7412:d8a:b0:e2:908c:2ebd with SMTP id b10csp150030rdg; Tue, 10 Oct 2023 06:48:18 -0700 (PDT) X-Google-Smtp-Source: AGHT+IE0Z5Zl93fl7vCNuoz0LRL7Z2wDTJ09VHn60RTawtzbyQY3tHQwWoLOTesGQ0txzaMZvS+e X-Received: by 2002:a17:90b:8d8:b0:268:13c4:b800 with SMTP id ds24-20020a17090b08d800b0026813c4b800mr16729389pjb.21.1696945697589; Tue, 10 Oct 2023 06:48:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696945697; cv=none; d=google.com; s=arc-20160816; b=KHwRMdcxrRANyA/ieeZ8UBM6HdsIk/Q8ItBmGoC/q0g2QDfBpFJeXEPfntEQBzkbHD y+jYmpVSwJD2ZiQ5TCu3NqNi8LQJIH8/DnByim2GbYqJsB41WhTstDyZ9eqJY87rE5Cv Y8MLfuXfUWDTt9Sfkd7G1bwAAOPhTskBB7ZC3mb0mnuE4Rt/WgR7JR3dagUziJfiG4Pi gRbCt+OfYCbUs0Ty729pSFTuKJGGTFNRneNbM2O+PC5jdeo85U9zvHo+9X2umP52/ECC +RIr3rjhpI14syAq6+y0ICd46A0VLaNBHpRZe3197iRPZFueeHErtoog4Itzvt6q3pVc gnBQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=+HyUnfguQpbGd06mQRUZ3u8TcnL+i3JtU7VupaZ8BvI=; fh=WY3kf8rEGtQNVg+6N+oSRteG9CXjcSElF2Q+mFv/3zU=; b=jM6xgyXc7QhMtSIGlR/ykDmEmYEXtxfXz/zPgJ9CtGew+l8hDrzZdba7Xa2/lUBb8o G/PErqmorS8FREb20R6X7VZuY6yofIk2cdM0EvgGlqyBYtrW0GlisaM+vptqYDpnGf+j 927V3AsxcMbfpuu6WXJFYu8kxctGfO+KQA2/XICNiuoi/AhID+joYrvh6/ePEEF1fEfc X8KJ/ix9X/ak3IFaK5HqkLbpS75cGRqGna3GKCN9pgxucUs+oDZTmDmmSSLkNZdijJp7 G0KsBP5EhUOF65dpwr13kpf8NQSznCdaGnDlmJthbEtvwJJtRKFm7IinBPmJo48Yvlwy PVtQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="li/p1ogT"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id h5-20020a17090ac38500b00279866aa14csi12047921pjt.16.2023.10.10.06.48.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 10 Oct 2023 06:48:17 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="li/p1ogT"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 6ECE48075011; Tue, 10 Oct 2023 06:48:12 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232276AbjJJNrm (ORCPT + 99 others); Tue, 10 Oct 2023 09:47:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38502 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232532AbjJJNra (ORCPT ); Tue, 10 Oct 2023 09:47:30 -0400 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B3C6ED6D for ; Tue, 10 Oct 2023 06:47:28 -0700 (PDT) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 52C15C433C8; Tue, 10 Oct 2023 13:47:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1696945648; bh=aav3RU+ABv4Umdi+fIPyKqi7MUlpgMHs/uTgTKUBBDM=; h=Date:From:To:Cc:Subject:Reply-To:References:In-Reply-To:From; b=li/p1ogTyu72T9DPHf99PjwD3iFxVvQaES6Y711FN240Midm3m4OgX46Bkfv9tNCF qxsi44/3/0DwYGcnwZ4HwXxDI7Tq6DC4mYZUUEVfy9Sex1DC9PN+j8awl1VBsV5sj6 4qB5M3epsNrltj9dlnqkf9tZikHL6FLoQlmfZNL5kkSWYS+8vWuwFpAr7bIJ8OKUCY 5aADHbpWDxCVdDcXwTP8GDcOQq/7MQ2IZjzT1MrFA5lnh6zGA54xaBc8u1WtTb6wT6 qm7qae2yxgKAhrkfv3BbrLbOc5d6dphLqzfwICijubKASotadQWg6CR/Pqzlp3cdYM /OML5Rw1TM36A== Received: by paulmck-ThinkPad-P17-Gen-1.home (Postfix, from userid 1000) id E55BFCE0C54; Tue, 10 Oct 2023 06:47:27 -0700 (PDT) Date: Tue, 10 Oct 2023 06:47:27 -0700 From: "Paul E. McKenney" To: Imran Khan Cc: Peter Zijlstra , Valentin Schneider , Juergen Gross , Leonardo Bras , Kernel Mailing List Subject: Re: [PATCH smp,csd] Throw an error if a CSD lock is stuck for too long Message-ID: Reply-To: paulmck@kernel.org References: <89062478-fa97-c265-3b18-de55eeae3c1f@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <89062478-fa97-c265-3b18-de55eeae3c1f@oracle.com> X-Spam-Status: No, score=2.4 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, RCVD_IN_SBL_CSS,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Tue, 10 Oct 2023 06:48:12 -0700 (PDT) X-Spam-Level: ** On Tue, Oct 10, 2023 at 03:58:43PM +1100, Imran Khan wrote: > Hello Paul, > > On 10/10/2023 3:39 am, Paul E. McKenney wrote: > > On Fri, Oct 06, 2023 at 10:32:07AM +1100, Imran Khan wrote: > >> Hello Paul, > >> > >> On 6/10/2023 3:48 am, Paul E. McKenney wrote: > >>> The CSD lock seems to get stuck in 2 "modes". When it gets stuck > >>> temporarily, it usually gets released in a few seconds, and sometimes > >>> up to one or two minutes. > >>> > >>> If the CSD lock stays stuck for more than several minutes, it never > >>> seems to get unstuck, and gradually more and more things in the system > >>> end up also getting stuck. > >>> > >>> In the latter case, we should just give up, so the system can dump out > >>> a little more information about what went wrong, and, with panic_on_oops > >>> and a kdump kernel loaded, dump a whole bunch more information about > >>> what might have gone wrong. > >>> > >>> Question: should this have its own panic_on_ipistall switch in > >>> /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different > >>> way than via BUG_ON? > >>> > >> panic_on_ipistall (set to 1 by default) looks better option to me. For systems > >> where such delay is acceptable and system can eventually get back to sane state, > >> this option (set to 0 after boot) would prevent crashing the system for > >> apparently benign CSD hangs of long duration. > > > > Good point! How about like the following? > > > > Yes, this looks good. > Just realized that keeping panic_on_ipistall set by default(as suggested earlier > by me) would not follow convention of other similar switches like > hard/softlockup_panic etc. which are 0 by deafault. > So default value of 0 looks better choice for panic_on_ipistall as well. Plus if a new option is set by default and causes problems, people get (understandably) annoyed. ;-) > > ------------------------------------------------------------------------ > > > > commit 6bcf3786291b86f13b3e13d51e998737a8009ec3 > > Author: Rik van Riel > > Date: Mon Aug 21 16:04:09 2023 -0400 > > > > smp,csd: Throw an error if a CSD lock is stuck for too long > > > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > > temporarily, it usually gets released in a few seconds, and sometimes > > up to one or two minutes. > > > > If the CSD lock stays stuck for more than several minutes, it never > > seems to get unstuck, and gradually more and more things in the system > > end up also getting stuck. > > > > In the latter case, we should just give up, so the system can dump out > > a little more information about what went wrong, and, with panic_on_oops > > and a kdump kernel loaded, dump a whole bunch more information about what > > might have gone wrong. In addition, there is an smp.panic_on_ipistall > > kernel boot parameter that by default retains the old behavior, but when > > set enables the panic after the CSD lock has been stuck for more than > > five minutes. > > > > [ paulmck: Apply Imran Khan feedback. ] > > > > Link: https://urldefense.com/v3/__https://lore.kernel.org/lkml/bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop/__;!!ACWV5N9M2RV99hQ!PDFpjgGTCPjxqCyusua5IZWkvdWEMf51igFDc-yb9cVK9PYr7FpEE1oGpWp09YK4lc15C2taMdcuEOqyH8k$ > > Signed-off-by: Rik van Riel > > Signed-off-by: Paul E. McKenney > > Cc: Peter Zijlstra > > Cc: Valentin Schneider > > Cc: Juergen Gross > > Cc: Jonathan Corbet > > Cc: Randy Dunlap > > > Reviewed-by: Imran Khan Thank you, and I will apply this on my next rebase. Thanx, Paul > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > index 0a1731a0f0ef..592935267ce2 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -5858,6 +5858,11 @@ > > This feature may be more efficiently disabled > > using the csdlock_debug- kernel parameter. > > > > + smp.panic_on_ipistall= [KNL] > > + If a csd_lock_timeout extends for more than > > + five minutes, panic the system. By default, let > > + CSD-lock acquisition take as long as they take. > > + > > smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices > > smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port > > smsc-ircc2.ircc_sir= [HW] SIR base I/O port > > diff --git a/kernel/smp.c b/kernel/smp.c > > index 8455a53465af..b6a0773a7015 100644 > > --- a/kernel/smp.c > > +++ b/kernel/smp.c > > @@ -170,6 +170,8 @@ static DEFINE_PER_CPU(void *, cur_csd_info); > > > > static ulong csd_lock_timeout = 5000; /* CSD lock timeout in milliseconds. */ > > module_param(csd_lock_timeout, ulong, 0444); > > +static bool panic_on_ipistall; > > +module_param(panic_on_ipistall, bool, 0444); > > > > static atomic_t csd_bug_count = ATOMIC_INIT(0); > > > > @@ -230,6 +232,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * > > } > > > > ts2 = sched_clock(); > > + /* How long since we last checked for a stuck CSD lock.*/ > > ts_delta = ts2 - *ts1; > > if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0)) > > return false; > > @@ -243,9 +246,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * > > else > > cpux = cpu; > > cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */ > > + /* How long since this CSD lock was stuck. */ > > + ts_delta = ts2 - ts0; > > pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n", > > - firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0, > > + firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta, > > cpu, csd->func, csd->info); > > + /* > > + * If the CSD lock is still stuck after 5 minutes, it is unlikely > > + * to become unstuck. Use a signed comparison to avoid triggering > > + * on underflows when the TSC is out of sync between sockets. > > + */ > > + BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL); > > if (cpu_cur_csd && csd != cpu_cur_csd) { > > pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n", > > *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),