Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
From:   Michael Ellerman <mpe@ellerman.id.au>
To:     Leonardo Bras <leonardo@linux.ibm.com>,
        Benjamin Herrenschmidt <benh@kernel.crashing.org>,
        Paul Mackerras <paulus@samba.org>,
        Enrico Weigelt <info@metux.net>,
        Leonardo Bras <leonardo@linux.ibm.com>,
        Alexios Zavras <alexios.zavras@intel.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
        Christophe Leroy <christophe.leroy@c-s.fr>,
        peterz@infradead.org
Cc:     linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash
In-Reply-To: <20200401000020.590447-1-leonardo@linux.ibm.com>
References: <20200401000020.590447-1-leonardo@linux.ibm.com>
Date:   Thu, 02 Apr 2020 22:28:09 +1100
Message-ID: <871rp6t9di.fsf@mpe.ellerman.id.au>
MIME-Version: 1.0
Content-Type: text/plain
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

Leonardo Bras <leonardo@linux.ibm.com> writes:
> During a crash, there is chance that the cpus that handle the NMI IPI
> are holding a spin_lock. If this spin_lock is needed by crashing_cpu it
> will cause a deadlock. (rtas.lock and printk logbuf_lock as of today)
>
> This is a problem if the system has kdump set up, given if it crashes
> for any reason kdump may not be saved for crash analysis.
>
> After NMI IPI is sent to all other cpus, force unlock all spinlocks
> needed for finishing crash routine.

I'm not convinced this is the right approach.

Busting locks is risky, it could easily cause a crash if data structures
are left in some inconsistent state.

I think we need to make this code more careful about what it's doing.
There's a clue at the top of default_machine_crash_shutdown(), which
calls crash_kexec_prepare_cpus():

	 * This function is only called after the system
	 * has panicked or is otherwise in a critical state.
	 * The minimum amount of code to allow a kexec'd kernel
	 * to run successfully needs to happen here.


You said the "IPI complete" message was the cause of one lockup:

  #0  arch_spin_lock 
  #1  do_raw_spin_lock 
  #2  __raw_spin_lock 
  #3  _raw_spin_lock 
  #4  vprintk_emit 
  #5  vprintk_func
  #7  crash_kexec_prepare_cpus 
  #8  default_machine_crash_shutdown
  #9  machine_crash_shutdown 
  #10 __crash_kexec
  #11 crash_kexec
  #12 oops_end

TBH I think we could just drop that printk() entirely.

Or we could tell printk() that we're in NMI context so that it uses the
percpu buffers.

We should probably do the latter anyway, in case there's any other code
we call that inadvertently calls printk().


The RTAS trace you sent was:

  #0 arch_spin_lock
  #1  lock_rtas () 
  #2  rtas_call (token=8204, nargs=1, nret=1, outputs=0x0)
  #3  ics_rtas_mask_real_irq (hw_irq=4100) 
  #4  machine_kexec_mask_interrupts
  #5  default_machine_crash_shutdown
  #6  machine_crash_shutdown 
  #7  __crash_kexec
  #8  crash_kexec
  #9  oops_end


Which doesn't make it clear who holds the RTAS lock. We really shouldn't
be crashing while holding the RTAS lock, but I guess it could happen.
Can you get a full backtrace?


PAPR says we are not allowed to have multiple CPUs calling RTAS at once,
except for a very small list of RTAS calls. So if we bust the RTAS lock
there's a risk we violate that part of PAPR and crash even harder.

Also it's not specific to kdump, we can't even get through a normal
reboot if we crash with the RTAS lock held.

Anyway here's a patch with some ideas. That allows me to get from a
crash with the RTAS lock held through kdump into the 2nd kernel. But it
only works if it's the crashing CPU that holds the RTAS lock.

cheers

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index c5fa251b8950..44ce74966d60 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -25,6 +25,7 @@
 #include <linux/reboot.h>
 #include <linux/syscalls.h>
 
+#include <asm/debugfs.h>
 #include <asm/prom.h>
 #include <asm/rtas.h>
 #include <asm/hvcall.h>
@@ -65,6 +66,8 @@ unsigned long rtas_rmo_buf;
 void (*rtas_flash_term_hook)(int);
 EXPORT_SYMBOL(rtas_flash_term_hook);
 
+static int rtas_lock_holder = -1;
+
 /* RTAS use home made raw locking instead of spin_lock_irqsave
  * because those can be called from within really nasty contexts
  * such as having the timebase stopped which would lockup with
@@ -76,7 +79,20 @@ static unsigned long lock_rtas(void)
 
 	local_irq_save(flags);
 	preempt_disable();
-	arch_spin_lock(&rtas.lock);
+
+	if (!arch_spin_trylock(&rtas.lock)) {
+		// Couldn't get the lock, do we already hold it?
+		if (rtas_lock_holder == smp_processor_id())
+			// Yes, so we would have deadlocked on ourself. Assume
+			// we're crashing and continue on hopefully ...
+			return flags;
+
+		// No, wait on the lock
+		arch_spin_lock(&rtas.lock);
+	}
+
+	rtas_lock_holder = smp_processor_id();
+
 	return flags;
 }
 
@@ -85,6 +101,8 @@ static void unlock_rtas(unsigned long flags)
 	arch_spin_unlock(&rtas.lock);
 	local_irq_restore(flags);
 	preempt_enable();
+
+	rtas_lock_holder = -1;
 }
 
 /*
@@ -1263,3 +1281,24 @@ void rtas_take_timebase(void)
 	timebase = 0;
 	arch_spin_unlock(&timebase_lock);
 }
+
+static int rtas_crash_set(void *data, u64 val)
+{
+	printk("%s: Taking RTAS lock and then crashing ...\n", __func__);
+	lock_rtas();
+
+	*((volatile int *) 0) = 0;
+
+	return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(fops_rtas_crash, NULL, rtas_crash_set, "%llu\n");
+
+static __init int rtas_crash_debugfs_init(void)
+{
+	debugfs_create_file_unsafe("crash_in_rtas", 0200,
+				   powerpc_debugfs_root, NULL,
+				   &fops_rtas_crash);
+	return 0;
+}
+device_initcall(rtas_crash_debugfs_init);
diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c
index d488311efab1..4c52cb58e889 100644
--- a/arch/powerpc/kexec/crash.c
+++ b/arch/powerpc/kexec/crash.c
@@ -15,6 +15,7 @@
 #include <linux/crash_dump.h>
 #include <linux/delay.h>
 #include <linux/irq.h>
+#include <linux/printk.h>
 #include <linux/types.h>
 
 #include <asm/processor.h>
@@ -311,6 +312,8 @@ void default_machine_crash_shutdown(struct pt_regs *regs)
 	unsigned int i;
 	int (*old_handler)(struct pt_regs *regs);
 
+	printk_nmi_enter();
+
 	/*
 	 * This function is only called after the system
 	 * has panicked or is otherwise in a critical state.