Received: by 2002:a05:7412:d8a:b0:e2:908c:2ebd with SMTP id b10csp2683652rdg; Mon, 16 Oct 2023 11:28:10 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEpB1WLJeLqNPHJbRs2xgXXwXl0NZGVvex3Eht/aMIhBla/sob+LdkJy4fSu5nRXs56j1/c X-Received: by 2002:a17:902:e5c1:b0:1b9:e241:ad26 with SMTP id u1-20020a170902e5c100b001b9e241ad26mr112968plf.9.1697480889926; Mon, 16 Oct 2023 11:28:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697480889; cv=none; d=google.com; s=arc-20160816; b=l3qtpOuvBvs8sldqa8neB8T7ONUKvaO4z63FozgDvglM8RVqQM7ffxIz0l23NXzUkT EaDJ+Ovfb+k+tCorq5jYaAI6iJbhtnLgS4t7NJYfivjnruTk5SHWE5VP9yNb+is+/Nr4 6BzuK2mX1mlnbuTDoE4Cs7rhU+7HUehvlNL2aD/quwywkNbo6VUzntITs/rFwJl7Ju1q Swhw4aBGO4uafw6Fz5WxgQX+vvZoAp+1F9uh+/a28CkilL37tQbE0Q13KQtc1DrLoJSO e3nXpIPFS0MMyOUF6cH+CILWJHOtZ24hCcp+S0J1hE/cmMLBgJHC9lpTPUJ1voiu1608 ZtdA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=UmoG9+7RSGjBNziLOqaoaxKOQJm0UJ5C2TsBRcT84Ho=; fh=OGPGF+z6rct7oFZBH7bmTUerQjmLGpiiUqY0I0FpjxA=; b=QeB8QbjYKpKnVqM0lJHdzR9//7jiSPxYUVKzmNH66xMyAeS/IHHhpaKspSoGZFo/8l pTg0TpuL0tOXI6a8QWIZG+0bC4Z2GvJ4b+0oiYjEIFqQHLFwozlo5PhLxrHWPsOtkQwW nhNCsFgz2XNgkAPduCMHQmRlX9CKMGixVLzAOaELRh0PWTPhJu30UNwQ/ArSbV6JRTuU M+v49lY0HV+eYnS3gDGtZiE9Hm/Dya659f0TIBIAcEn3tHFRhIFg/JbTjsplW9eNOpl3 WrNkmq9V7VQ8y33PMGz1IdlOHg8thtJXqwSkSrSvYOcf9PC4YDjlmcU880A1Y15UwYDd 1iGQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=dAYGg8p5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from agentk.vger.email (agentk.vger.email. [2620:137:e000::3:2]) by mx.google.com with ESMTPS id e1-20020a170902b78100b001c9b258508dsi5846101pls.248.2023.10.16.11.28.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 16 Oct 2023 11:28:09 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) client-ip=2620:137:e000::3:2; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=dAYGg8p5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 9C6538044972; Mon, 16 Oct 2023 11:28:06 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232228AbjJPS1z (ORCPT + 99 others); Mon, 16 Oct 2023 14:27:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50836 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231569AbjJPS1y (ORCPT ); Mon, 16 Oct 2023 14:27:54 -0400 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 54F58A2 for ; Mon, 16 Oct 2023 11:27:52 -0700 (PDT) Received: by smtp.kernel.org (Postfix) with ESMTPSA id B546CC433C7; Mon, 16 Oct 2023 18:27:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1697480871; bh=KVzc+DtZ0K5HtCSUZgGI0+UechUc26ZCSqZWkVTBSAY=; h=Date:From:To:Cc:Subject:Reply-To:References:In-Reply-To:From; b=dAYGg8p5ZyWRLCP+QEOdE2PYQTejbW5ZX0mGT+ZwEYC+3wFM2q2KHPsATrdaFsJWW IhlXeiOuaNivr7/Q5DkL69M7LWpGET7ya4jgk0VqR6IHX/Y+weoXvrp+KmTZFzgjQm CTZg5kP7qEHRdegIWIe0s7EKZ4TLv1Ty8uLBtTnjVMvQOddDQibiPt7mmgf2sBspPD Zt3d2m3srwHB9yIg+HHvuat4o81hiYTiSvGaqLFHkvYLcAYw5bOAbQ31t5cj2cRD/a ruwUcC3qZU26M+qOMR37o7tVeRBM/oy9Lp3iy+m43oqfnZ5+GX7HT0HaVYDZRN3gdP 2g4Tq5jXbGp9Q== Received: by paulmck-ThinkPad-P17-Gen-1.home (Postfix, from userid 1000) id 35A4ACE09BE; Mon, 16 Oct 2023 11:27:51 -0700 (PDT) Date: Mon, 16 Oct 2023 11:27:51 -0700 From: "Paul E. McKenney" To: Leonardo Bras Cc: Imran Khan , Peter Zijlstra , Valentin Schneider , Juergen Gross , Kernel Mailing List Subject: Re: [PATCH smp,csd] Throw an error if a CSD lock is stuck for too long Message-ID: Reply-To: paulmck@kernel.org References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-1.2 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Mon, 16 Oct 2023 11:28:06 -0700 (PDT) On Fri, Oct 13, 2023 at 12:26:22PM -0300, Leonardo Bras wrote: > On Mon, Oct 09, 2023 at 09:39:38AM -0700, Paul E. McKenney wrote: > > On Fri, Oct 06, 2023 at 10:32:07AM +1100, Imran Khan wrote: > > > Hello Paul, > > > > > > On 6/10/2023 3:48 am, Paul E. McKenney wrote: > > > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > > > > temporarily, it usually gets released in a few seconds, and sometimes > > > > up to one or two minutes. > > > > > > > > If the CSD lock stays stuck for more than several minutes, it never > > > > seems to get unstuck, and gradually more and more things in the system > > > > end up also getting stuck. > > > > > > > > In the latter case, we should just give up, so the system can dump out > > > > a little more information about what went wrong, and, with panic_on_oops > > > > and a kdump kernel loaded, dump a whole bunch more information about > > > > what might have gone wrong. > > > > > > > > Question: should this have its own panic_on_ipistall switch in > > > > /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different > > > > way than via BUG_ON? > > > > > > > panic_on_ipistall (set to 1 by default) looks better option to me. For systems > > > where such delay is acceptable and system can eventually get back to sane state, > > > this option (set to 0 after boot) would prevent crashing the system for > > > apparently benign CSD hangs of long duration. > > > > Good point! How about like the following? > > > > Thanx, Paul > > > > ------------------------------------------------------------------------ > > > > commit 6bcf3786291b86f13b3e13d51e998737a8009ec3 > > Author: Rik van Riel > > Date: Mon Aug 21 16:04:09 2023 -0400 > > > > smp,csd: Throw an error if a CSD lock is stuck for too long > > > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > > temporarily, it usually gets released in a few seconds, and sometimes > > up to one or two minutes. > > > > If the CSD lock stays stuck for more than several minutes, it never > > seems to get unstuck, and gradually more and more things in the system > > end up also getting stuck. > > > > In the latter case, we should just give up, so the system can dump out > > a little more information about what went wrong, and, with panic_on_oops > > and a kdump kernel loaded, dump a whole bunch more information about what > > might have gone wrong. In addition, there is an smp.panic_on_ipistall > > kernel boot parameter that by default retains the old behavior, but when > > set enables the panic after the CSD lock has been stuck for more than > > five minutes. > > > > [ paulmck: Apply Imran Khan feedback. ] > > > > Link: https://lore.kernel.org/lkml/bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop/ > > Signed-off-by: Rik van Riel > > Signed-off-by: Paul E. McKenney > > Cc: Peter Zijlstra > > Cc: Valentin Schneider > > Cc: Juergen Gross > > Cc: Jonathan Corbet > > Cc: Randy Dunlap > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > index 0a1731a0f0ef..592935267ce2 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -5858,6 +5858,11 @@ > > This feature may be more efficiently disabled > > using the csdlock_debug- kernel parameter. > > > > + smp.panic_on_ipistall= [KNL] > > + If a csd_lock_timeout extends for more than > > + five minutes, panic the system. By default, let > > + CSD-lock acquisition take as long as they take. > > + > > It could be interesting to have it as an s64 parameter (in {mili,}seconds) > instead of bool, this way the user could pick the time to wait before the > panic happens. 0 or -1 could mean disabled. > > What do you think? > > Other than that, > Reviewed-by: Leonardo Bras Thank you for looking this over! How about with the diff shown below, to be folded into the original? I went with int instead of s64 because I am having some difficulty imagining anyone specifying more than a 24-day timeout. ;-) Thanx, Paul ------------------------------------------------------------------------ diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index ccb7621eff79..ea5ae9deb753 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5931,8 +5931,10 @@ smp.panic_on_ipistall= [KNL] If a csd_lock_timeout extends for more than - five minutes, panic the system. By default, let - CSD-lock acquisition take as long as they take. + the specified number of milliseconds, panic the + system. By default, let CSD-lock acquisition + take as long as they take. Specifying 300,000 + for this value provides a 10-minute timeout. smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port diff --git a/kernel/smp.c b/kernel/smp.c index b6a0773a7015..d3ca47f32f38 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -170,8 +170,8 @@ static DEFINE_PER_CPU(void *, cur_csd_info); static ulong csd_lock_timeout = 5000; /* CSD lock timeout in milliseconds. */ module_param(csd_lock_timeout, ulong, 0444); -static bool panic_on_ipistall; -module_param(panic_on_ipistall, bool, 0444); +static int panic_on_ipistall; /* CSD panic timeout in milliseconds, 300000 for ten minutes. */ +module_param(panic_on_ipistall, int, 0444); static atomic_t csd_bug_count = ATOMIC_INIT(0); @@ -256,7 +256,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * * to become unstuck. Use a signed comparison to avoid triggering * on underflows when the TSC is out of sync between sockets. */ - BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL); + BUG_ON(panic_on_ipistall > 0 && (s64)ts_delta > ((s64)panic_on_ipistall * NSEC_PER_MSEC)); if (cpu_cur_csd && csd != cpu_cur_csd) { pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n", *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),