Received: by 2002:a05:7412:da14:b0:e2:908c:2ebd with SMTP id fe20csp2011391rdb; Mon, 9 Oct 2023 09:39:56 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHzS542hZ8dgrmuI7B/Ymg9/1314gLjQGYDdMQPNN26EE+kbwpn0N3dn6E+vgyzsVc5yFuD X-Received: by 2002:a05:6a20:4292:b0:13f:c40c:379 with SMTP id o18-20020a056a20429200b0013fc40c0379mr17204328pzj.13.1696869596177; Mon, 09 Oct 2023 09:39:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696869596; cv=none; d=google.com; s=arc-20160816; b=T+cxDnFnQt9c1VdawnSA5SSv785fauO8ZwaCZJaMYsFDjbSr62KUGGWho6eqHMFgF+ fXjze2g4HSZK0gzgar9qOPbz4v+K+gUSnEWAJZeSsfNf3HnUeQSA9IFwOWo0/sfsbhu9 gBNj1Qdv4PmkrynKUt1zyP/k/4fpOSwYJQomfwPwzr018DIPNmLQdbzPSS2ecJxjyATE TKXo+VfcNZnhCsUidXl92U8nhxNH2bs9xjX7HYDFyNv4QaaJXcFflQJUzxx64KR4n7Yk aGiySAzMC/2K4u7gB/tRkJnOGPJFtUME6PbIlAyBkv+DLK/IWXfGYKlmoZ2Ik8EqQesj a3AQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=KY4+NO3FBRb1Z7OwvYp2SxPO+jeMg8sa08vafTsqd3M=; fh=WY3kf8rEGtQNVg+6N+oSRteG9CXjcSElF2Q+mFv/3zU=; b=HbGfA5toKDLwbox16bG3Hk4szZRv0E1VS8AWjm3Y5wZDF1UqJLGNMjxcpLqb6RTNcD 87cPrJjjtSCMapQsHXnLTXnLKuOveBMrFkLjC7EkriVqSSMc+6EYUPwbH+JP7gXwYmrR UeMs7E82RZfb+EXqWyTbUH0GXaOcJSgfnJ7Bzn/qLkjKu2bDGV/ZFjh9YpHrSRndoUwo SxHJ5VoRmnQDQBUy/qMG7jjMhIMX2A3ESdkx9DPd+6H7XlwjkANapSwkx4liRSs6qgte n+LkcmO7BzLRmN+AmREPZ4Ihqx0cfPb1fZV87tVkbsljiUfbThV60yvTIhFJAc8/2EGx QmMw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=Zi7Xm1P8; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from agentk.vger.email (agentk.vger.email. [2620:137:e000::3:2]) by mx.google.com with ESMTPS id h9-20020a056a00230900b0068fc080f79dsi7753789pfh.122.2023.10.09.09.39.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 09 Oct 2023 09:39:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) client-ip=2620:137:e000::3:2; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=Zi7Xm1P8; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 9967880A853D; Mon, 9 Oct 2023 09:39:53 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1377292AbjJIQjm (ORCPT + 99 others); Mon, 9 Oct 2023 12:39:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37304 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1377261AbjJIQjl (ORCPT ); Mon, 9 Oct 2023 12:39:41 -0400 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DF5369C for ; Mon, 9 Oct 2023 09:39:39 -0700 (PDT) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7DB81C433C8; Mon, 9 Oct 2023 16:39:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1696869579; bh=v2fVq7Ohe9KF3DaMnIet1k+n3F9IC/5waGNh/+ofNBE=; h=Date:From:To:Cc:Subject:Reply-To:References:In-Reply-To:From; b=Zi7Xm1P80JW3KKWuSpUbQ4Rh2H0/J0wZ4c3QhaCUL+MKGu/hUzz/QOL5b/uZP7/1b Ag07Xe4V/15h9mFeieYWnhTKCYmngFWswM4G9dmYP+XlOsZwTQM7Hn/FAUcJ3UmR3A juzHWhx9MKUdK7ZjBwZvdr7j3zbwZ6LJC6dKcXAs1/xC3w5pSgGXzehTLHrKOZRY12 zPhBjEk2lqDvHgwSdieb8yKgX7fBgEQ6M91EvYy2mZ8VBR8BeX22s5wh7N5LvcdmY7 od9BuMgwmBsTRL3T0xT2lJz/tdpnCZLPPEZeKPnZSs5NRQBgIRRYLzkAf0TJyNmxIM ZcTk9qcliZOxw== Received: by paulmck-ThinkPad-P17-Gen-1.home (Postfix, from userid 1000) id 0044BCE09BE; Mon, 9 Oct 2023 09:39:38 -0700 (PDT) Date: Mon, 9 Oct 2023 09:39:38 -0700 From: "Paul E. McKenney" To: Imran Khan Cc: Peter Zijlstra , Valentin Schneider , Juergen Gross , Leonardo Bras , Kernel Mailing List Subject: Re: [PATCH smp,csd] Throw an error if a CSD lock is stuck for too long Message-ID: Reply-To: paulmck@kernel.org References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=2.4 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, RCVD_IN_SBL_CSS,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Mon, 09 Oct 2023 09:39:53 -0700 (PDT) X-Spam-Level: ** On Fri, Oct 06, 2023 at 10:32:07AM +1100, Imran Khan wrote: > Hello Paul, > > On 6/10/2023 3:48 am, Paul E. McKenney wrote: > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > > temporarily, it usually gets released in a few seconds, and sometimes > > up to one or two minutes. > > > > If the CSD lock stays stuck for more than several minutes, it never > > seems to get unstuck, and gradually more and more things in the system > > end up also getting stuck. > > > > In the latter case, we should just give up, so the system can dump out > > a little more information about what went wrong, and, with panic_on_oops > > and a kdump kernel loaded, dump a whole bunch more information about > > what might have gone wrong. > > > > Question: should this have its own panic_on_ipistall switch in > > /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different > > way than via BUG_ON? > > > panic_on_ipistall (set to 1 by default) looks better option to me. For systems > where such delay is acceptable and system can eventually get back to sane state, > this option (set to 0 after boot) would prevent crashing the system for > apparently benign CSD hangs of long duration. Good point! How about like the following? Thanx, Paul ------------------------------------------------------------------------ commit 6bcf3786291b86f13b3e13d51e998737a8009ec3 Author: Rik van Riel Date: Mon Aug 21 16:04:09 2023 -0400 smp,csd: Throw an error if a CSD lock is stuck for too long The CSD lock seems to get stuck in 2 "modes". When it gets stuck temporarily, it usually gets released in a few seconds, and sometimes up to one or two minutes. If the CSD lock stays stuck for more than several minutes, it never seems to get unstuck, and gradually more and more things in the system end up also getting stuck. In the latter case, we should just give up, so the system can dump out a little more information about what went wrong, and, with panic_on_oops and a kdump kernel loaded, dump a whole bunch more information about what might have gone wrong. In addition, there is an smp.panic_on_ipistall kernel boot parameter that by default retains the old behavior, but when set enables the panic after the CSD lock has been stuck for more than five minutes. [ paulmck: Apply Imran Khan feedback. ] Link: https://lore.kernel.org/lkml/bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop/ Signed-off-by: Rik van Riel Signed-off-by: Paul E. McKenney Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Juergen Gross Cc: Jonathan Corbet Cc: Randy Dunlap diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 0a1731a0f0ef..592935267ce2 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5858,6 +5858,11 @@ This feature may be more efficiently disabled using the csdlock_debug- kernel parameter. + smp.panic_on_ipistall= [KNL] + If a csd_lock_timeout extends for more than + five minutes, panic the system. By default, let + CSD-lock acquisition take as long as they take. + smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port smsc-ircc2.ircc_sir= [HW] SIR base I/O port diff --git a/kernel/smp.c b/kernel/smp.c index 8455a53465af..b6a0773a7015 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -170,6 +170,8 @@ static DEFINE_PER_CPU(void *, cur_csd_info); static ulong csd_lock_timeout = 5000; /* CSD lock timeout in milliseconds. */ module_param(csd_lock_timeout, ulong, 0444); +static bool panic_on_ipistall; +module_param(panic_on_ipistall, bool, 0444); static atomic_t csd_bug_count = ATOMIC_INIT(0); @@ -230,6 +232,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * } ts2 = sched_clock(); + /* How long since we last checked for a stuck CSD lock.*/ ts_delta = ts2 - *ts1; if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0)) return false; @@ -243,9 +246,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * else cpux = cpu; cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */ + /* How long since this CSD lock was stuck. */ + ts_delta = ts2 - ts0; pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n", - firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0, + firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta, cpu, csd->func, csd->info); + /* + * If the CSD lock is still stuck after 5 minutes, it is unlikely + * to become unstuck. Use a signed comparison to avoid triggering + * on underflows when the TSC is out of sync between sockets. + */ + BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL); if (cpu_cur_csd && csd != cpu_cur_csd) { pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n", *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),