Received: by 2002:a25:ef43:0:0:0:0:0 with SMTP id w3csp398906ybm; Thu, 28 May 2020 05:48:53 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy0Cny4kRzR7+OBddsiCNJHdKusK9mwG9+Cihiqsq9Yjc0cIGYALVPZquzuOs9p0fEFUtDn X-Received: by 2002:a17:906:edb5:: with SMTP id sa21mr2841302ejb.78.1590670133830; Thu, 28 May 2020 05:48:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1590670133; cv=none; d=google.com; s=arc-20160816; b=eAk2g+xc9nn5fftH9ijt+Gax7HIHPyRCHig8N5Ry6zv40+0j/4d95wS8RYDcyPbgMo vK1pNWkRFvvUhrr4sKSpbo4dYmdWnCtSmt2PFaz20Q2+hVmELetOqMPWJn4aiGzn99cU 2I+HIB8774pobQiEocL1Ca2sPNmHD0J1MzhtL1CPPp4gQ8bBClx/tB7tci7WeCJ4L+p6 CEPfhbQhBfGfbgJ1bGeNFSA24hlwmocBFLxHJUc6Bg1uAkPqLDgE+d6j9CHVkjQ2J2Ir EnUrbYQ1yfOcVjpd3d+Ujl/c6wg7G/yJdOjZSgi9XJrgT7t8EWrDyRv0sOL4+c9ZyvHc oSBg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date; bh=HukRBhS1O7ty4C7Sk1hGYFlIKq1/g5HYrgfFL1n2oqE=; b=MRnmibbBZKAOOgA3PBPxTaYJz4lJU6/G1ggMOhHFhpQF6mCqAPozDvEGOMuQOWgmu0 neiUQPukV3/fZEQT9kxahB+pcLRyaIIoovRQh2PJSZWJzgiZyJqmsctQ6/4ks5WG/7xj 4NV8kTIpPkzZzveH2LEuVA5NEVU4q/M8AVZGYvbD0W5zhCHgIWsWUFfbmQsw7B8XEQK/ p1EuOB6Zrwb9q4skXPPQtMXzJ7N4WItgqtvGK8U0yCbt6FfelDzPwF0/vBJSvy1ADVvW 61EyWBZlP4+TiJ5pkesupN6Iu7mUITUGbfWUOreZgYqOcos88qVi5h4V/Lx4a7MY21qL T4Vg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id bu27si3432452edb.169.2020.05.28.05.48.30; Thu, 28 May 2020 05:48:53 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389859AbgE1MqS convert rfc822-to-8bit (ORCPT + 99 others); Thu, 28 May 2020 08:46:18 -0400 Received: from mail.kernel.org ([198.145.29.99]:49212 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2389783AbgE1MqR (ORCPT ); Thu, 28 May 2020 08:46:17 -0400 Received: from gandalf.local.home (cpe-66-24-58-225.stny.res.rr.com [66.24.58.225]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 66B75206F1; Thu, 28 May 2020 12:46:16 +0000 (UTC) Date: Thu, 28 May 2020 08:46:14 -0400 From: Steven Rostedt To: Joerg Vehlow Cc: linux-kernel@vger.kernel.org, Joerg Vehlow , Thomas Gleixner , Sebastian Andrzej Siewior , Huang Ying , Andrew Morton Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context Message-ID: <20200528084614.0c949e8d@gandalf.local.home> In-Reply-To: <2c243f59-6d10-7abb-bab4-e7b1796cd54f@jv-coder.de> References: <2c243f59-6d10-7abb-bab4-e7b1796cd54f@jv-coder.de> X-Mailer: Claws Mail 3.17.3 (GTK+ 2.24.32; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Joerg, This does look like Andrew's commit (from 2008) is buggy (and this is a mainline bug, not an RT one). (top posting this so Andrew knows to look further ;-) On Thu, 28 May 2020 13:41:08 +0200 Joerg Vehlow wrote: > Hi, > > I think I found a bug in the kernel with rt patches (or maybe even without). > This applies to all kernels propably starting at 2.6.27. > > When a kernel panic is triggered from an interrupt handler, the dump-capture > kernel is not started, instead the system acts as if it was not installed. > The reason for this is, that panic calls __crash_kexec, which is protected > by a mutex. On an rt kernel this mutex is an rt mutex and when trylock > is called > on an rt mutex, the first check is whether the current kthread is in an > nmi or > irq handler. If it is, the function just returns 0 -> locking failed. > > According to rt_mutex_trylock documentation, it is not allowed to call this > function from an irq handler, but panic can be called from everywhere > and thus > rt_mutex_trylock can be called from everywhere. Actually even > mutex_trylock has > the comment, that it is not supposed to be used from interrupt context, > but it > still locks the mutex. I guess this could also be a bug in the non-rt > kernel. > > I found this problem using a test module, that triggers the softlock > detection. > It is a pretty simple module, that creates a kthread, that disables > preemption, > spins 60 seconds in an endless loop and then reenables preemption and > terminates > the thread. This reliably triggers the softlock detection and if > kernel.softlockup_panic=0, the system resumes perfectly fine afterwards. If > kernel.softlockup_panic=1 I would expect the dump-capture kernel to be > executed, > but it is not due to the bug (without rt patches it works), instead the > panic > function is executed until the end to the endless loop. > > > A stacktrace captured at the trylock call inside kexec_code looks like this: > #0  __rt_mutex_trylock (lock=0xffffffff81701aa0 ) at > /usr/src/kernel/kernel/locking/rtmutex.c:2110 > #1  0xffffffff8087601a in _mutex_trylock (lock=) at > /usr/src/kernel/kernel/locking/mutex-rt.c:185 > #2  0xffffffff803022a0 in __crash_kexec (regs=0x0 ) at > /usr/src/kernel/kernel/kexec_core.c:941 > #3  0xffffffff8027af59 in panic (fmt=0xffffffff80fa3d66 "softlockup: > hung tasks") at /usr/src/kernel/kernel/panic.c:198 > #4  0xffffffff80325b6d in watchdog_timer_fn (hrtimer=) at > /usr/src/kernel/kernel/watchdog.c:464 > #5  0xffffffff802e6b90 in __run_hrtimer (flags=, > now=, timer=, base=, > cpu_base=) at /usr/src/kernel/kernel/time/hrtimer.c:1417 > #6  __hrtimer_run_queues (cpu_base=0xffff88807db1c000, now= out>, flags=, active_mask=) at > /usr/src/kernel/kernel/time/hrtimer.c:1479 > #7  0xffffffff802e7704 in hrtimer_interrupt (dev=) at > /usr/src/kernel/kernel/time/hrtimer.c:1539 > #8  0xffffffff80a020f2 in local_apic_timer_interrupt () at > /usr/src/kernel/arch/x86/kernel/apic/apic.c:1067 > #9  smp_apic_timer_interrupt (regs=) at > /usr/src/kernel/arch/x86/kernel/apic/apic.c:1092 > #10 0xffffffff80a015df in apic_timer_interrupt () at > /usr/src/kernel/arch/x86/entry/entry_64.S:909 > > > Obviously and as expected the panic was triggered in the context of the apic > interrupt. So in_irq() is true and trylock fails. > > > About 12 years ago this was not implemented using a mutex, but using xchg. > See: 8c5a1cf0ad3ac5fcdf51314a63b16a440870f6a2 Yes, that commit is wrong, because mutex_trylock() is not to be taken in interrupt context, where crash_kexec() looks like it can be called. Unless back then crash_kexec() wasn't called in interrupt context, then the commit that calls it from that combined with this commit is the issue. -- Steve > > > Since my knowledege about mutexes inside the kernel is very limited, I > do not > know how this can be fixed and whether it should be fixed in the rt > patches or > if this really is a bug in mainline kernel (because trylock is also not > allowed > to be used in interrupt handlers. > > > Jörg