Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp217258pxj; Thu, 20 May 2021 07:54:33 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwsesD307jWp35fyRqOLdc5WY/M5rqP2G+Kn43P+OnYJtYOlLBH1QOfSLuICoX+Pe8EEXgF X-Received: by 2002:a17:906:2a46:: with SMTP id k6mr4922033eje.406.1621522473333; Thu, 20 May 2021 07:54:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1621522473; cv=none; d=google.com; s=arc-20160816; b=F3H1o/JHdTpaY26g613kd+SNwaxtoYkcy6KOkWDD5WCsWmHadYVSPVKERCRgClOBUM q0qyepgzRutv5Y+VzhHANZSsTfdQ5qY0YH1JwobUYWCK2mM9r9ZzBE9YRku6JNtxgXQn ZbmcxJTOUBfdq+CEwv3WlyMegkTIF0zJ0Uqkao+df6dzk9uIvbtfbWelot6Ew+rkZV3E bAOre1RBEUc0NX03fVoXb4LDZoKqJ7lq9kiEN1A38psnDFzsHaC/iAg1wm8E91EuebQz Y86y+3+GmMIV6Ga8Xd1X4A8MIw71TsEsMlgflMeEq1UZMjvXrq1wjKom0EyJ0qCRtrod /ASA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=4pbVbA7UEbWYYJDznjZZ9TPHT+U+sVQ/598uoJR+2Mw=; b=oB4ZjFsk3nyTng6FA+3qpYN7nSYLp4RNkrMhIp+eevPYDNVw76im13i/MCS5v/eOnt YGaC9mQbu2++HCqlNi6kj042PNOQ1DoBcNlJ5baISpinCRxxTG20P6xo9qb+aj0+Wc+U nC8lKcfXOV2t7tHJRvnEovo9Uc4TG62kTqlH40oazglAzM1dilKNNvOrm9tqenF2WTCB pOtqo++MXcMIynflmQlU4VaZ2S4IABMfFTbul0KQHVNE0pK0aiumVec/vLl6CKln3M7J 7mZGVi0ZwX/UR1mOZoqqLvkjeV22dMhMR+waZMzRzMvNiIyU8+YNyQc8m0kvkvghkNME 0AJg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id jw2si2911937ejc.308.2021.05.20.07.54.09; Thu, 20 May 2021 07:54:33 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231530AbhETJWe (ORCPT + 99 others); Thu, 20 May 2021 05:22:34 -0400 Received: from mx2.suse.de ([195.135.220.15]:43494 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230483AbhETJWe (ORCPT ); Thu, 20 May 2021 05:22:34 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id CC6A3AD4B; Thu, 20 May 2021 09:21:11 +0000 (UTC) Date: Thu, 20 May 2021 11:21:04 +0200 From: Borislav Petkov To: James Feeney Cc: linux-smp@vger.kernel.org, Jens Axboe , lkml Subject: Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single Message-ID: References: <8a9599b2-f4fe-af9b-90f5-af39c315ec2f@nurealm.net> <1876afbe-a167-2be5-3690-846700eeb76c@nurealm.net> <984ee4ab-6e6b-cb0e-a4f1-ce2951994b1d@nurealm.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, May 19, 2021 at 09:12:04PM -0600, James Feeney wrote: > $ diff .config .config.old > 4983c4983,4984 > < # CONFIG_X86_THERMAL_VECTOR is not set > --- > > CONFIG_X86_THERMAL_VECTOR=y > > CONFIG_X86_PKG_TEMP_THERMAL=m > > No joy. Still have the same soft lockups and full boots - the full > boots interrupted by some mystery delay. Which means, even with therm_throt disabled, it still locks up. Which can't be caused by my patch. > I don't know about these patches, modifying and moving the location of > therm_throt.c, so I'm not in a position to draw any conclusion from > these results. They're just moving the thermal interrupt functionality from the MCE code where they don't belong to the thermal code where they do. Otherwise there should be no change. > build 5.11? There are lots of 5.11 kernels from the Arch distribution > that I have run. Are you looking for a dmesg log from 5.11? Take the .config you're normally using, make sure it has CONFIG_X86_THERMAL_VECTOR=y and build with it plain 5.11 kernel. No patches ontop, no nothing. Then add debug ignore_loglevel log_buf_len=16M no_console_suspend systemd.log_target=null console=ttyS0,115200 console=tty0 to its kernel command line and send me full dmesg again pls. Looking how it sometimes boots and sometimes it locks up, try that a couple of times. > So far, something looks quirky - somewhere. Timing related failures > can be a pain. Is there no useful information being provided by the > Call Trace in the dmesg log? What I'm seeing is that *sometimes* - not always - your CPU0 is not responding to the TLB flush IPI. Which is really weird. Have you had those always or did they start appearing with 5.12? That's why I'm still scratching my head over how my patch would cause CPU0 not responding to IPIs. Well, *maybe* there's a little difference which my patch did: it does that APIC_LVTTHMR only on the BSP. And *maybe* there's a problem there, who knows with those old CPUs. So here's two more things to try: 1. On plain 5.12, with the same kernel cmdline params add also "idle=nomwait" to the kernel command line and boot with it a couple of times to see whether it still locks up. 2. On plain 5.12, with the same kernel cmdline params apply this hunk ontop: --- diff --git a/drivers/thermal/intel/therm_throt.c b/drivers/thermal/intel/therm_throt.c index f8e882592ba5..42db48cd4666 100644 --- a/drivers/thermal/intel/therm_throt.c +++ b/drivers/thermal/intel/therm_throt.c @@ -630,9 +630,8 @@ void intel_init_thermal(struct cpuinfo_x86 *c) if (!intel_thermal_supported(c)) return; - /* On the BSP? */ - if (c == &boot_cpu_data) - lvtthmr_init = apic_read(APIC_LVTTHMR); + lvtthmr_init = apic_read(APIC_LVTTHMR); + pr_info("%s: CPU%d, lvtthmr_init: 0x%x\n", __func__, cpu, lvtthmr_init); /* * First check if its enabled already, in which case there might --- That'll tell us the thermal sensor LVT on both CPUs. Also do that a couple of times - it'll be interesting to see what those values are *when* the box locks up. As always, catch full dmesg each time pls. Thx. -- Regards/Gruss, Boris. SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG Nürnberg