Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp3032209imu; Wed, 7 Nov 2018 04:00:34 -0800 (PST) X-Google-Smtp-Source: AJdET5fbC7m4Q3u5sOdQMhCN9Ta0b/thUZjsnSX8tFN8FIjgyP+gj7zRaxhB0Kz7h+JiZBG8X+ja X-Received: by 2002:a63:741:: with SMTP id 62mr1276765pgh.352.1541592034602; Wed, 07 Nov 2018 04:00:34 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541592034; cv=none; d=google.com; s=arc-20160816; b=Vt1Oe6h0Z13W+yPHl+jny3H0ebKB/zta3MgKawPGSvu3QQ29ff29yboWTnU9deHRS0 srEfNXQJoyyasy5PNN/uoaWxC03tUm85lO/UvtzAaXCLM/hcWALfP0hEUDaSQ4j5nTvo TbUB/WTdOOJp3WG3WVconplz2L3haB1jiKElXAJsENDLCjbWWrD80dMCI9kNMM2n7ziY A9EaOYwzxc4q8+Mf6vOLreqa3S0auvAknbMnY7nqbIyR2vM8jAHGVMlYzNYUB0igAdfH OfD8k2yJerkss0n/ht8Hx6W3uATVuA3gvBMp3gAr7tncJC2Lf6/JSBWwEj7FPdulXv9P ar3g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=j01Qxhiz39nito7ewQN1W1FtxPRlmQuaymrKYUa51rI=; b=jaNcY2EyFjftPexq+KneyfuP0j00Ln8BwNUrCal1DiLMaEpG+fczEwxIcLsU6Gkihi dZyodYQnNBkz2lPDWHTRrqtqyyHoqztoMI/obrT2LTzpqN/YEySp94M23JJ4W5KVMV4h XQLzZMjAMJslI19pJueTbF8j0TocSwaP9XU6ZaPjK15klPGzRa3tghQUEmrHzUNhPlTq nbSpHhyaoItZ7mD27lJWbuL325MdbVU3jol0Zfh3Fo+pQ33jd0kpAluOx5wku6DfZKCk mcND4KW3vdbUTtfSD/FZW3bTYqAAZOBy9o7CLSoTnmMsqZVVRytc3KYLI/yJo2TY7EQH RqOw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=ILCVg5bV; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d8-v6si410995pll.220.2018.11.07.04.00.18; Wed, 07 Nov 2018 04:00:34 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=ILCVg5bV; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726525AbeKGVaB (ORCPT + 99 others); Wed, 7 Nov 2018 16:30:01 -0500 Received: from mail-wr1-f67.google.com ([209.85.221.67]:35413 "EHLO mail-wr1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726241AbeKGVaB (ORCPT ); Wed, 7 Nov 2018 16:30:01 -0500 Received: by mail-wr1-f67.google.com with SMTP id z16-v6so17121098wrv.2 for ; Wed, 07 Nov 2018 03:59:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=j01Qxhiz39nito7ewQN1W1FtxPRlmQuaymrKYUa51rI=; b=ILCVg5bVwam473Ss2BjzMaODuDwBjw07aFa2Z47GB04e1Yx1b8DuAd3ndBFUmV90E1 e6hoXrzvubeSqCTdkwH03ucvU3p//HWJ2NH20QtPIgK7qqV6NddGJ9+CBYFgX/+XJeFH OJlApgGSfiiMrwPkhJjDnL66qBNlD986jQOCQ= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=j01Qxhiz39nito7ewQN1W1FtxPRlmQuaymrKYUa51rI=; b=M/4uPwCj0hHr3BP1yoYDTpEJ8Phfyvct/aUKYD27c01a4OdE3Jc6DWcrswO8dU8J9A xJ4KF0R/sWp5d3CBKdCf5zlPwiCKWEqYXIvtWNUAkBCBF/cO7kO4nopVpHS5L/fbWgDE SJg/RTLZDLzIah3rPxmebHnFChE9v2jg4GV2dBppoSdrp+QWHlPjrBZPa4rTt0v0cpXH hlq4G3fY21aLTzpUIEgObWUis/dd/144foIOf9+8pnqym7JJ2MA1ISSeW7RH83tCwtgC ZbKcQ9pTpBkPTaFhEp95A8uXwuVay25PMN0LgMze97Oys/zBcRJWcoWVxtY4KY6X9t6Q a8iA== X-Gm-Message-State: AGRZ1gI3A9rvdBQK/6VUV2sxVJwvP4m/Vu+un4z/5YY2NbJEZaPXWTS0 yxj4PLk/lI1D3ekH+xuBTnL6jQ== X-Received: by 2002:adf:fac4:: with SMTP id a4-v6mr1698489wrs.202.1541591996873; Wed, 07 Nov 2018 03:59:56 -0800 (PST) Received: from holly.lan (cpc141214-aztw34-2-0-cust773.18-1.cable.virginm.net. [86.9.19.6]) by smtp.gmail.com with ESMTPSA id v11-v6sm564320wrt.40.2018.11.07.03.59.55 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Wed, 07 Nov 2018 03:59:55 -0800 (PST) Date: Wed, 7 Nov 2018 11:59:53 +0000 From: Daniel Thompson To: Douglas Anderson Cc: Jason Wessel , kgdb-bugreport@lists.sourceforge.net, Peter Zijlstra , linux-kernel@vger.kernel.org Subject: Re: [PATCH v3 3/4] kgdb: Don't round up a CPU that failed rounding up before Message-ID: <20181107115953.cnar4u2axi4poaxe@holly.lan> References: <20181107010028.184543-1-dianders@chromium.org> <20181107010028.184543-4-dianders@chromium.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181107010028.184543-4-dianders@chromium.org> User-Agent: NeoMutt/20180716 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Nov 06, 2018 at 05:00:27PM -0800, Douglas Anderson wrote: > If we're using the default implementation of kgdb_roundup_cpus() that > uses smp_call_function_single_async() we can end up hanging > kgdb_roundup_cpus() if we try to round up a CPU that failed to round > up before. > > Specifically smp_call_function_single_async() will try to wait on the > csd lock for the CPU that we're trying to round up. If the previous > round up never finished then that lock could still be held and we'll > just sit there hanging. > > There's not a lot of use trying to round up a CPU that failed to round > up before. Let's keep a flag that indicates whether the CPU started > but didn't finish to round up before. If we see that flag set then > we'll skip the next round up. > > In general we have a few goals here: > - We never want to end up calling smp_call_function_single_async() > when the csd is still locked. This is accomplished because > flush_smp_call_function_queue() unlocks the csd _before_ invoking > the callback. That means that when kgdb_nmicallback() runs we know > for sure the the csd is no longer locked. Thus when we set > "rounding_up = false" we know for sure that the csd is unlocked. > - If there are no timeouts rounding up we should never skip a round > up. > > NOTE #1: In general trying to continue running after failing to round > up CPUs doesn't appear to be supported in the debugger. When I > simulate this I find that kdb reports "Catastrophic error detected" > when I try to continue. I can overrule and continue anyway, but it > should be noted that we may be entering the land of dragons here. It's been quite a while but AFAIR I decided to set the catastrophic error here *because* the stuck csd lock would make restarting fragile. So arguably we are now able to remove the code that sets this flag when a CPU fails to round up. > NOTE #3: setting 'kgdb_info[cpu].rounding_up = false' is in > kgdb_nmicallback() instead of kgdb_call_nmi_hook() because some > implementations override kgdb_call_nmi_hook(). It shouldn't hurt to > have it in kgdb_nmicallback() in any case. Slightly icky but I guess this is OK. > diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c > index 23f2b5613afa..324cba8917f1 100644 > --- a/kernel/debug/debug_core.c > +++ b/kernel/debug/debug_core.c > @@ -246,6 +246,20 @@ void __weak kgdb_roundup_cpus(void) > continue; > > csd = &per_cpu(kgdb_roundup_csd, cpu); > + > + /* > + * If it didn't round up last time, don't try again > + * since smp_call_function_single_async() will block. > + * > + * If rounding_up is false then we know that the > + * previous call must have at least started and that > + * means smp_call_function_single_async() won't block. > + */ > + smp_mb(); Not commented and I suspect this may have no useful effect. What (harmful) orderings does this barrier render impossible? > + if (kgdb_info[cpu].rounding_up) > + continue; > + kgdb_info[cpu].rounding_up = true; > + > csd->func = kgdb_call_nmi_hook; > smp_call_function_single_async(cpu, csd); > } > @@ -782,6 +796,9 @@ int kgdb_nmicallback(int cpu, void *regs) > struct kgdb_state kgdb_var; > struct kgdb_state *ks = &kgdb_var; > > + kgdb_info[cpu].rounding_up = false; > + smp_mb(); Also not commented. Here I think the barrier may have a purpose (to ensure rounding_up gets cleared before we peek at dbg_master_lock) but if that is the case we need to comment it. Daniel.