2020-04-01 00:04:11

by Leonardo Bras

[permalink] [raw]
Subject: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

During a crash, there is chance that the cpus that handle the NMI IPI
are holding a spin_lock. If this spin_lock is needed by crashing_cpu it
will cause a deadlock. (rtas.lock and printk logbuf_lock as of today)

This is a problem if the system has kdump set up, given if it crashes
for any reason kdump may not be saved for crash analysis.

After NMI IPI is sent to all other cpus, force unlock all spinlocks
needed for finishing crash routine.

Signed-off-by: Leonardo Bras <[email protected]>

---
Changes from v2:
- Instead of skipping spinlocks, unlock the needed ones.

Changes from v1:
- Exported variable
---
arch/powerpc/kexec/crash.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c
index d488311efab1..8d63fca3242c 100644
--- a/arch/powerpc/kexec/crash.c
+++ b/arch/powerpc/kexec/crash.c
@@ -24,6 +24,7 @@
#include <asm/smp.h>
#include <asm/setjmp.h>
#include <asm/debug.h>
+#include <asm/rtas.h>

/*
* The primary CPU waits a while for all secondary CPUs to enter. This is to
@@ -49,6 +50,8 @@ static int time_to_dump;
*/
int crash_wake_offline;

+extern raw_spinlock_t logbuf_lock;
+
#define CRASH_HANDLER_MAX 3
/* List of shutdown handles */
static crash_shutdown_t crash_shutdown_handles[CRASH_HANDLER_MAX];
@@ -129,6 +132,13 @@ static void crash_kexec_prepare_cpus(int cpu)
/* Would it be better to replace the trap vector here? */

if (atomic_read(&cpus_in_crash) >= ncpus) {
+ /*
+ * At this point no other CPU is running, and some of them may
+ * have been interrupted while holding one of the locks needed
+ * to complete crashing. Free them so there is no deadlock.
+ */
+ arch_spin_unlock(&logbuf_lock.raw_lock);
+ arch_spin_unlock(&rtas.lock);
printk(KERN_EMERG "IPI complete\n");
return;
}
--
2.25.1


2020-04-01 03:09:10

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

Hi Leonardo,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on paulus-powerpc/kvm-ppc-next linus/master linux/master v5.6 next-20200331]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url: https://github.com/0day-ci/linux/commits/Leonardo-Bras/ppc-crash-Reset-spinlocks-during-crash/20200401-091600
base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-rhel-kconfig (attached as .config)
compiler: powerpc64le-linux-gcc (GCC) 9.3.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=9.3.0 make.cross ARCH=powerpc

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <[email protected]>

All errors (new ones prefixed by >>):

powerpc64le-linux-ld: warning: orphan section `.gnu.hash' from `linker stubs' being placed in section `.gnu.hash'
>> powerpc64le-linux-ld: arch/powerpc/kexec/crash.o:(.toc+0x0): undefined reference to `rtas'

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (1.45 kB)
.config.gz (14.98 kB)
Download all attachments

2020-04-01 09:27:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

On Tue, Mar 31, 2020 at 09:00:21PM -0300, Leonardo Bras wrote:
> During a crash, there is chance that the cpus that handle the NMI IPI
> are holding a spin_lock. If this spin_lock is needed by crashing_cpu it
> will cause a deadlock. (rtas.lock and printk logbuf_lock as of today)
>
> This is a problem if the system has kdump set up, given if it crashes
> for any reason kdump may not be saved for crash analysis.
>
> After NMI IPI is sent to all other cpus, force unlock all spinlocks
> needed for finishing crash routine.
>
> Signed-off-by: Leonardo Bras <[email protected]>

> @@ -129,6 +132,13 @@ static void crash_kexec_prepare_cpus(int cpu)
> /* Would it be better to replace the trap vector here? */
>
> if (atomic_read(&cpus_in_crash) >= ncpus) {
> + /*
> + * At this point no other CPU is running, and some of them may
> + * have been interrupted while holding one of the locks needed
> + * to complete crashing. Free them so there is no deadlock.
> + */
> + arch_spin_unlock(&logbuf_lock.raw_lock);
> + arch_spin_unlock(&rtas.lock);
> printk(KERN_EMERG "IPI complete\n");
> return;
> }

You might want to add a note to your asm/spinlock.h that you rely on
spin_unlock() unconditionally clearing a lock.

This isn't naturally true for all lock implementations. Consider ticket
locks, doing a surplus unlock will wreck your lock state in that case.
So anybody poking at the powerpc spinlock implementation had better know
you rely on this.

2020-04-01 23:55:16

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

Hello Peter,

On Wed, 2020-04-01 at 11:26 +0200, Peter Zijlstra wrote:
> You might want to add a note to your asm/spinlock.h that you rely on
> spin_unlock() unconditionally clearing a lock.
>
> This isn't naturally true for all lock implementations. Consider ticket
> locks, doing a surplus unlock will wreck your lock state in that case.
> So anybody poking at the powerpc spinlock implementation had better know
> you rely on this.

Good idea. I will add this to my changes and generate a v4.

Thank you,


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-04-02 11:28:48

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

Leonardo Bras <[email protected]> writes:
> During a crash, there is chance that the cpus that handle the NMI IPI
> are holding a spin_lock. If this spin_lock is needed by crashing_cpu it
> will cause a deadlock. (rtas.lock and printk logbuf_lock as of today)
>
> This is a problem if the system has kdump set up, given if it crashes
> for any reason kdump may not be saved for crash analysis.
>
> After NMI IPI is sent to all other cpus, force unlock all spinlocks
> needed for finishing crash routine.

I'm not convinced this is the right approach.

Busting locks is risky, it could easily cause a crash if data structures
are left in some inconsistent state.

I think we need to make this code more careful about what it's doing.
There's a clue at the top of default_machine_crash_shutdown(), which
calls crash_kexec_prepare_cpus():

* This function is only called after the system
* has panicked or is otherwise in a critical state.
* The minimum amount of code to allow a kexec'd kernel
* to run successfully needs to happen here.


You said the "IPI complete" message was the cause of one lockup:

#0 arch_spin_lock
#1 do_raw_spin_lock
#2 __raw_spin_lock
#3 _raw_spin_lock
#4 vprintk_emit
#5 vprintk_func
#7 crash_kexec_prepare_cpus
#8 default_machine_crash_shutdown
#9 machine_crash_shutdown
#10 __crash_kexec
#11 crash_kexec
#12 oops_end

TBH I think we could just drop that printk() entirely.

Or we could tell printk() that we're in NMI context so that it uses the
percpu buffers.

We should probably do the latter anyway, in case there's any other code
we call that inadvertently calls printk().


The RTAS trace you sent was:

#0 arch_spin_lock
#1 lock_rtas ()
#2 rtas_call (token=8204, nargs=1, nret=1, outputs=0x0)
#3 ics_rtas_mask_real_irq (hw_irq=4100)
#4 machine_kexec_mask_interrupts
#5 default_machine_crash_shutdown
#6 machine_crash_shutdown
#7 __crash_kexec
#8 crash_kexec
#9 oops_end


Which doesn't make it clear who holds the RTAS lock. We really shouldn't
be crashing while holding the RTAS lock, but I guess it could happen.
Can you get a full backtrace?


PAPR says we are not allowed to have multiple CPUs calling RTAS at once,
except for a very small list of RTAS calls. So if we bust the RTAS lock
there's a risk we violate that part of PAPR and crash even harder.

Also it's not specific to kdump, we can't even get through a normal
reboot if we crash with the RTAS lock held.

Anyway here's a patch with some ideas. That allows me to get from a
crash with the RTAS lock held through kdump into the 2nd kernel. But it
only works if it's the crashing CPU that holds the RTAS lock.

cheers

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index c5fa251b8950..44ce74966d60 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -25,6 +25,7 @@
#include <linux/reboot.h>
#include <linux/syscalls.h>

+#include <asm/debugfs.h>
#include <asm/prom.h>
#include <asm/rtas.h>
#include <asm/hvcall.h>
@@ -65,6 +66,8 @@ unsigned long rtas_rmo_buf;
void (*rtas_flash_term_hook)(int);
EXPORT_SYMBOL(rtas_flash_term_hook);

+static int rtas_lock_holder = -1;
+
/* RTAS use home made raw locking instead of spin_lock_irqsave
* because those can be called from within really nasty contexts
* such as having the timebase stopped which would lockup with
@@ -76,7 +79,20 @@ static unsigned long lock_rtas(void)

local_irq_save(flags);
preempt_disable();
- arch_spin_lock(&rtas.lock);
+
+ if (!arch_spin_trylock(&rtas.lock)) {
+ // Couldn't get the lock, do we already hold it?
+ if (rtas_lock_holder == smp_processor_id())
+ // Yes, so we would have deadlocked on ourself. Assume
+ // we're crashing and continue on hopefully ...
+ return flags;
+
+ // No, wait on the lock
+ arch_spin_lock(&rtas.lock);
+ }
+
+ rtas_lock_holder = smp_processor_id();
+
return flags;
}

@@ -85,6 +101,8 @@ static void unlock_rtas(unsigned long flags)
arch_spin_unlock(&rtas.lock);
local_irq_restore(flags);
preempt_enable();
+
+ rtas_lock_holder = -1;
}

/*
@@ -1263,3 +1281,24 @@ void rtas_take_timebase(void)
timebase = 0;
arch_spin_unlock(&timebase_lock);
}
+
+static int rtas_crash_set(void *data, u64 val)
+{
+ printk("%s: Taking RTAS lock and then crashing ...\n", __func__);
+ lock_rtas();
+
+ *((volatile int *) 0) = 0;
+
+ return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(fops_rtas_crash, NULL, rtas_crash_set, "%llu\n");
+
+static __init int rtas_crash_debugfs_init(void)
+{
+ debugfs_create_file_unsafe("crash_in_rtas", 0200,
+ powerpc_debugfs_root, NULL,
+ &fops_rtas_crash);
+ return 0;
+}
+device_initcall(rtas_crash_debugfs_init);
diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c
index d488311efab1..4c52cb58e889 100644
--- a/arch/powerpc/kexec/crash.c
+++ b/arch/powerpc/kexec/crash.c
@@ -15,6 +15,7 @@
#include <linux/crash_dump.h>
#include <linux/delay.h>
#include <linux/irq.h>
+#include <linux/printk.h>
#include <linux/types.h>

#include <asm/processor.h>
@@ -311,6 +312,8 @@ void default_machine_crash_shutdown(struct pt_regs *regs)
unsigned int i;
int (*old_handler)(struct pt_regs *regs);

+ printk_nmi_enter();
+
/*
* This function is only called after the system
* has panicked or is otherwise in a critical state.


2020-04-03 00:39:10

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

On Thu, 2020-04-02 at 22:28 +1100, Michael Ellerman wrote:
> Leonardo Bras <[email protected]> writes:
> > During a crash, there is chance that the cpus that handle the NMI IPI
> > are holding a spin_lock. If this spin_lock is needed by crashing_cpu it
> > will cause a deadlock. (rtas.lock and printk logbuf_lock as of today)
> >
> > This is a problem if the system has kdump set up, given if it crashes
> > for any reason kdump may not be saved for crash analysis.
> >
> > After NMI IPI is sent to all other cpus, force unlock all spinlocks
> > needed for finishing crash routine.
>
> I'm not convinced this is the right approach.

Me neither. I think it's a very hacky solution, but I couldn't think of
anything better at the time.

> Busting locks is risky, it could easily cause a crash if data structures
> are left in some inconsistent state.
>
> I think we need to make this code more careful about what it's doing.
> There's a clue at the top of default_machine_crash_shutdown(), which
> calls crash_kexec_prepare_cpus():
>
> * This function is only called after the system
> * has panicked or is otherwise in a critical state.
> * The minimum amount of code to allow a kexec'd kernel
> * to run successfully needs to happen here.
>
>
> You said the "IPI complete" message was the cause of one lockup:
>
> #0 arch_spin_lock
> #1 do_raw_spin_lock
> #2 __raw_spin_lock
> #3 _raw_spin_lock
> #4 vprintk_emit
> #5 vprintk_func
> #7 crash_kexec_prepare_cpus
> #8 default_machine_crash_shutdown
> #9 machine_crash_shutdown
> #10 __crash_kexec
> #11 crash_kexec
> #12 oops_end
>
> TBH I think we could just drop that printk() entirely.
>
> Or we could tell printk() that we're in NMI context so that it uses the
> percpu buffers.
>
> We should probably do the latter anyway, in case there's any other code
> we call that inadvertently calls printk().
>

I was not aware of using per-cpu buffers in printk. It may be a nice
solution.

There is another printk call there:
printk("kexec: Starting switchover sequence.\n");
in default_machine_kexec().

Two printk and one rtas call: it's all I could see using a spinlock
after IPI Complete.

>
> The RTAS trace you sent was:
>
> #0 arch_spin_lock
> #1 lock_rtas ()
> #2 rtas_call (token=8204, nargs=1, nret=1, outputs=0x0)
> #3 ics_rtas_mask_real_irq (hw_irq=4100)
> #4 machine_kexec_mask_interrupts
> #5 default_machine_crash_shutdown
> #6 machine_crash_shutdown
> #7 __crash_kexec
> #8 crash_kexec
> #9 oops_end
>
>
> Which doesn't make it clear who holds the RTAS lock. We really shouldn't
> be crashing while holding the RTAS lock, but I guess it could happen.
> Can you get a full backtrace?
>

Oh, all traces are from the thread that called the crash, by writing
'c' to sysrq. That is what I am using to reproduce.

#10 bad_page_fault
#11 handle_page_fault
#12 __handle_sysrq (key=99, check_mask=false)
#13 write_sysrq_trigger
#14 proc_reg_write
#15 __vfs_write
#16 vfs_write
#17 SYSC_write
#18 SyS_write
#19 system_call

>
> PAPR says we are not allowed to have multiple CPUs calling RTAS at once,
> except for a very small list of RTAS calls. So if we bust the RTAS lock
> there's a risk we violate that part of PAPR and crash even harder.

Interesting, I was not aware.

>
> Also it's not specific to kdump, we can't even get through a normal
> reboot if we crash with the RTAS lock held.
>
> Anyway here's a patch with some ideas. That allows me to get from a
> crash with the RTAS lock held through kdump into the 2nd kernel. But it
> only works if it's the crashing CPU that holds the RTAS lock.
>

Nice idea.
But my test environment is just triggering a crash from sysrq, so I
think it would not improve the result, given that this thread is
probably not holding the lock by the time.

I noticed that when rtas is locked, irqs and preemption are also
disabled.

Should the IPI send by crash be able to interrupt a thread with
disabled irqs?

Best regards,
Leonardo Bras


> cheers
>
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index c5fa251b8950..44ce74966d60 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -25,6 +25,7 @@
> #include <linux/reboot.h>
> #include <linux/syscalls.h>
>
> +#include <asm/debugfs.h>
> #include <asm/prom.h>
> #include <asm/rtas.h>
> #include <asm/hvcall.h>
> @@ -65,6 +66,8 @@ unsigned long rtas_rmo_buf;
> void (*rtas_flash_term_hook)(int);
> EXPORT_SYMBOL(rtas_flash_term_hook);
>
> +static int rtas_lock_holder = -1;
> +
> /* RTAS use home made raw locking instead of spin_lock_irqsave
> * because those can be called from within really nasty contexts
> * such as having the timebase stopped which would lockup with
> @@ -76,7 +79,20 @@ static unsigned long lock_rtas(void)
>
> local_irq_save(flags);
> preempt_disable();
> - arch_spin_lock(&rtas.lock);
> +
> + if (!arch_spin_trylock(&rtas.lock)) {
> + // Couldn't get the lock, do we already hold it?
> + if (rtas_lock_holder == smp_processor_id())
> + // Yes, so we would have deadlocked on ourself. Assume
> + // we're crashing and continue on hopefully ...
> + return flags;
> +
> + // No, wait on the lock
> + arch_spin_lock(&rtas.lock);
> + }
> +
> + rtas_lock_holder = smp_processor_id();
> +
> return flags;
> }
>
> @@ -85,6 +101,8 @@ static void unlock_rtas(unsigned long flags)
> arch_spin_unlock(&rtas.lock);
> local_irq_restore(flags);
> preempt_enable();
> +
> + rtas_lock_holder = -1;
> }
>
> /*
> @@ -1263,3 +1281,24 @@ void rtas_take_timebase(void)
> timebase = 0;
> arch_spin_unlock(&timebase_lock);
> }
> +
> +static int rtas_crash_set(void *data, u64 val)
> +{
> + printk("%s: Taking RTAS lock and then crashing ...\n", __func__);
> + lock_rtas();
> +
> + *((volatile int *) 0) = 0;
> +
> + return 0;
> +}
> +
> +DEFINE_DEBUGFS_ATTRIBUTE(fops_rtas_crash, NULL, rtas_crash_set, "%llu\n");
> +
> +static __init int rtas_crash_debugfs_init(void)
> +{
> + debugfs_create_file_unsafe("crash_in_rtas", 0200,
> + powerpc_debugfs_root, NULL,
> + &fops_rtas_crash);
> + return 0;
> +}
> +device_initcall(rtas_crash_debugfs_init);
> diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c
> index d488311efab1..4c52cb58e889 100644
> --- a/arch/powerpc/kexec/crash.c
> +++ b/arch/powerpc/kexec/crash.c
> @@ -15,6 +15,7 @@
> #include <linux/crash_dump.h>
> #include <linux/delay.h>
> #include <linux/irq.h>
> +#include <linux/printk.h>
> #include <linux/types.h>
>
> #include <asm/processor.h>
> @@ -311,6 +312,8 @@ void default_machine_crash_shutdown(struct pt_regs *regs)
> unsigned int i;
> int (*old_handler)(struct pt_regs *regs);
>
> + printk_nmi_enter();
> +
> /*
> * This function is only called after the system
> * has panicked or is otherwise in a critical state.
>
>


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-04-03 06:48:15

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

Leonardo Bras's on April 3, 2020 10:37 am:
> On Thu, 2020-04-02 at 22:28 +1100, Michael Ellerman wrote:
>> Leonardo Bras <[email protected]> writes:
>> > During a crash, there is chance that the cpus that handle the NMI IPI
>> > are holding a spin_lock. If this spin_lock is needed by crashing_cpu it
>> > will cause a deadlock. (rtas.lock and printk logbuf_lock as of today)
>> >
>> > This is a problem if the system has kdump set up, given if it crashes
>> > for any reason kdump may not be saved for crash analysis.
>> >
>> > After NMI IPI is sent to all other cpus, force unlock all spinlocks
>> > needed for finishing crash routine.
>>
>> I'm not convinced this is the right approach.
>
> Me neither. I think it's a very hacky solution, but I couldn't think of
> anything better at the time.
>
>> Busting locks is risky, it could easily cause a crash if data structures
>> are left in some inconsistent state.
>>
>> I think we need to make this code more careful about what it's doing.
>> There's a clue at the top of default_machine_crash_shutdown(), which
>> calls crash_kexec_prepare_cpus():
>>
>> * This function is only called after the system
>> * has panicked or is otherwise in a critical state.
>> * The minimum amount of code to allow a kexec'd kernel
>> * to run successfully needs to happen here.
>>
>>
>> You said the "IPI complete" message was the cause of one lockup:
>>
>> #0 arch_spin_lock
>> #1 do_raw_spin_lock
>> #2 __raw_spin_lock
>> #3 _raw_spin_lock
>> #4 vprintk_emit
>> #5 vprintk_func
>> #7 crash_kexec_prepare_cpus
>> #8 default_machine_crash_shutdown
>> #9 machine_crash_shutdown
>> #10 __crash_kexec
>> #11 crash_kexec
>> #12 oops_end
>>
>> TBH I think we could just drop that printk() entirely.
>>
>> Or we could tell printk() that we're in NMI context so that it uses the
>> percpu buffers.
>>
>> We should probably do the latter anyway, in case there's any other code
>> we call that inadvertently calls printk().
>>
>
> I was not aware of using per-cpu buffers in printk. It may be a nice
> solution.
>
> There is another printk call there:
> printk("kexec: Starting switchover sequence.\n");
> in default_machine_kexec().
>
> Two printk and one rtas call: it's all I could see using a spinlock
> after IPI Complete.
>
>>
>> The RTAS trace you sent was:
>>
>> #0 arch_spin_lock
>> #1 lock_rtas ()
>> #2 rtas_call (token=8204, nargs=1, nret=1, outputs=0x0)
>> #3 ics_rtas_mask_real_irq (hw_irq=4100)
>> #4 machine_kexec_mask_interrupts
>> #5 default_machine_crash_shutdown
>> #6 machine_crash_shutdown
>> #7 __crash_kexec
>> #8 crash_kexec
>> #9 oops_end
>>
>>
>> Which doesn't make it clear who holds the RTAS lock. We really shouldn't
>> be crashing while holding the RTAS lock, but I guess it could happen.
>> Can you get a full backtrace?
>>
>
> Oh, all traces are from the thread that called the crash, by writing
> 'c' to sysrq. That is what I am using to reproduce.
>
> #10 bad_page_fault
> #11 handle_page_fault
> #12 __handle_sysrq (key=99, check_mask=false)
> #13 write_sysrq_trigger
> #14 proc_reg_write
> #15 __vfs_write
> #16 vfs_write
> #17 SYSC_write
> #18 SyS_write
> #19 system_call
>
>>
>> PAPR says we are not allowed to have multiple CPUs calling RTAS at once,
>> except for a very small list of RTAS calls. So if we bust the RTAS lock
>> there's a risk we violate that part of PAPR and crash even harder.
>
> Interesting, I was not aware.
>
>>
>> Also it's not specific to kdump, we can't even get through a normal
>> reboot if we crash with the RTAS lock held.
>>
>> Anyway here's a patch with some ideas. That allows me to get from a
>> crash with the RTAS lock held through kdump into the 2nd kernel. But it
>> only works if it's the crashing CPU that holds the RTAS lock.
>>
>
> Nice idea.
> But my test environment is just triggering a crash from sysrq, so I
> think it would not improve the result, given that this thread is
> probably not holding the lock by the time.

Crash paths should not take that RTAS lock, it's a massive pain. I'm
fixing it for machine check, for other crashes I think it can be removed
too, it just needs to be unpicked. The good thing with crashing is that
you can reasonably *know* that you're single threaded, so you can
usually reason through situations like above.

> I noticed that when rtas is locked, irqs and preemption are also
> disabled.
>
> Should the IPI send by crash be able to interrupt a thread with
> disabled irqs?

Yes. It's been a bit painful, but in the long term it means that a CPU
which hangs with interrupts off can be debugged, and it means we can
take it offline to crash without risking that it will be clobbering what
we're doing.

Arguably what I should have done is try a regular IPI first, wait a few
seconds, then NMI IPI.

A couple of problems with that. Firstly it probably avoids this issue
you hit almost all the time, so it won't get fixed. So when we really
need the NMI IPI in the field, it'll still be riddled with deadlocks.

Secondly, sending the IPI first in theory can be more intrusive to the
state that we want to debug. It uses the currently running stack, paca
save areas, ec. NMI IPI uses its own stack and save regions so it's a
little more isolated. Maybe this is only a small advantage but I'd like
to have it if we can.

Thanks,
Nick

2020-04-06 18:50:12

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

On Thu, 2020-04-02 at 22:28 +1100, Michael Ellerman wrote:
> Leonardo Bras <[email protected]>
> TBH I think we could just drop that printk() entirely.
>
> Or we could tell printk() that we're in NMI context so that it uses the
> percpu buffers.
>
> We should probably do the latter anyway, in case there's any other code
> we call that inadvertently calls printk().

Done:
http://patchwork.ozlabs.org/patch/1266956/

About the rtas-call, I think it will take more time, because I have to
study it properly.

Thank you,


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-04-08 02:39:28

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

Hello Nick, Michael,

On Fri, 2020-04-03 at 16:41 +1000, Nicholas Piggin wrote:
[...]
> > > PAPR says we are not allowed to have multiple CPUs calling RTAS at once,
> > > except for a very small list of RTAS calls. So if we bust the RTAS lock
> > > there's a risk we violate that part of PAPR and crash even harder.
> >
> > Interesting, I was not aware.
> >
> > > Also it's not specific to kdump, we can't even get through a normal
> > > reboot if we crash with the RTAS lock held.
> > >
> > > Anyway here's a patch with some ideas. That allows me to get from a
> > > crash with the RTAS lock held through kdump into the 2nd kernel. But it
> > > only works if it's the crashing CPU that holds the RTAS lock.
> > >
> >
> > Nice idea.
> > But my test environment is just triggering a crash from sysrq, so I
> > think it would not improve the result, given that this thread is
> > probably not holding the lock by the time.
>
> Crash paths should not take that RTAS lock, it's a massive pain. I'm
> fixing it for machine check, for other crashes I think it can be removed
> too, it just needs to be unpicked. The good thing with crashing is that
> you can reasonably *know* that you're single threaded, so you can
> usually reason through situations like above.
>
> > I noticed that when rtas is locked, irqs and preemption are also
> > disabled.
> >
> > Should the IPI send by crash be able to interrupt a thread with
> > disabled irqs?
>
> Yes. It's been a bit painful, but in the long term it means that a CPU
> which hangs with interrupts off can be debugged, and it means we can
> take it offline to crash without risking that it will be clobbering what
> we're doing.
>
> Arguably what I should have done is try a regular IPI first, wait a few
> seconds, then NMI IPI.
>
> A couple of problems with that. Firstly it probably avoids this issue
> you hit almost all the time, so it won't get fixed. So when we really
> need the NMI IPI in the field, it'll still be riddled with deadlocks.
>
> Secondly, sending the IPI first in theory can be more intrusive to the
> state that we want to debug. It uses the currently running stack, paca
> save areas, ec. NMI IPI uses its own stack and save regions so it's a
> little more isolated. Maybe this is only a small advantage but I'd like
> to have it if we can.
>
> Thanks,
> Nick


I think the printk issue is solved (sent a patch on that), now what is
missing is the rtas call spinlock.

I noticed that rtas.lock is taken on machine_kexec_mask_interrupts(),
which happen after crashing the other threads and getting into
realmode.

The following rtas are called each IRQ with valid interrupt descriptor:
ibm,int-off : Reset mask bit for that interrupt
ibm,set_xive : Set XIVE priority to 0xff

By what I could understand, these rtas calls happen to put the next
kexec kernel (kdump kernel) in a safer environment, so I think it's not
safe to just remove them.

(See commit d6c1a9081080c6c4658acf2a06d851feb2855933)

On the other hand, busting the rtas.lock could be dangerous, because
it's code we can't control.

According with LoPAR, for both of these rtas-calls, we have:

For the PowerPC External Interrupt option: The call must be reentrant
to the number of processors on the platform.
For the PowerPC External Interrupt option: The argument call buffer for
each simultaneous call must be physically unique.

Which I think means this rtas-calls can be done simultaneously.
Would it mean that busting the rtas.lock for these calls would be safe?

Best regards,
Leonardo Bras


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-04-08 13:47:53

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

Leonardo Bras <[email protected]> writes:
> Hello Nick, Michael,
>
> On Fri, 2020-04-03 at 16:41 +1000, Nicholas Piggin wrote:
> [...]
>> > > PAPR says we are not allowed to have multiple CPUs calling RTAS at once,
>> > > except for a very small list of RTAS calls. So if we bust the RTAS lock
>> > > there's a risk we violate that part of PAPR and crash even harder.
>> >
>> > Interesting, I was not aware.
>> >
>> > > Also it's not specific to kdump, we can't even get through a normal
>> > > reboot if we crash with the RTAS lock held.
>> > >
>> > > Anyway here's a patch with some ideas. That allows me to get from a
>> > > crash with the RTAS lock held through kdump into the 2nd kernel. But it
>> > > only works if it's the crashing CPU that holds the RTAS lock.
>> > >
>> >
>> > Nice idea.
>> > But my test environment is just triggering a crash from sysrq, so I
>> > think it would not improve the result, given that this thread is
>> > probably not holding the lock by the time.
>>
>> Crash paths should not take that RTAS lock, it's a massive pain. I'm
>> fixing it for machine check, for other crashes I think it can be removed
>> too, it just needs to be unpicked. The good thing with crashing is that
>> you can reasonably *know* that you're single threaded, so you can
>> usually reason through situations like above.
>>
>> > I noticed that when rtas is locked, irqs and preemption are also
>> > disabled.
>> >
>> > Should the IPI send by crash be able to interrupt a thread with
>> > disabled irqs?
>>
>> Yes. It's been a bit painful, but in the long term it means that a CPU
>> which hangs with interrupts off can be debugged, and it means we can
>> take it offline to crash without risking that it will be clobbering what
>> we're doing.
>>
>> Arguably what I should have done is try a regular IPI first, wait a few
>> seconds, then NMI IPI.
>>
>> A couple of problems with that. Firstly it probably avoids this issue
>> you hit almost all the time, so it won't get fixed. So when we really
>> need the NMI IPI in the field, it'll still be riddled with deadlocks.
>>
>> Secondly, sending the IPI first in theory can be more intrusive to the
>> state that we want to debug. It uses the currently running stack, paca
>> save areas, ec. NMI IPI uses its own stack and save regions so it's a
>> little more isolated. Maybe this is only a small advantage but I'd like
>> to have it if we can.
>
> I think the printk issue is solved (sent a patch on that), now what is
> missing is the rtas call spinlock.
>
> I noticed that rtas.lock is taken on machine_kexec_mask_interrupts(),
> which happen after crashing the other threads and getting into
> realmode.
>
> The following rtas are called each IRQ with valid interrupt descriptor:
> ibm,int-off : Reset mask bit for that interrupt
> ibm,set_xive : Set XIVE priority to 0xff
>
> By what I could understand, these rtas calls happen to put the next
> kexec kernel (kdump kernel) in a safer environment, so I think it's not
> safe to just remove them.

Yes.

> (See commit d6c1a9081080c6c4658acf2a06d851feb2855933)

In hindsight the person who wrote that commit was being lazy. We
*should* have made the 2nd kernel robust against the IRQ state being
messed up.

> On the other hand, busting the rtas.lock could be dangerous, because
> it's code we can't control.
>
> According with LoPAR, for both of these rtas-calls, we have:
>
> For the PowerPC External Interrupt option: The call must be reentrant
> to the number of processors on the platform.
> For the PowerPC External Interrupt option: The argument call buffer for
> each simultaneous call must be physically unique.

Oh well spotted. Where is that in the doc?

> Which I think means this rtas-calls can be done simultaneously.

I think so too. I'll read PAPR in the morning and make sure.

> Would it mean that busting the rtas.lock for these calls would be safe?

What would be better is to make those specific calls not take the global
RTAS lock to begin with.

We should be able to just allocate the rtas_args on the stack, it's only
~80 odd bytes. And then we can use rtas_call_unlocked() which doesn't
take the global lock.

cheers

2020-04-08 19:40:19

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

On Wed, 2020-04-08 at 22:21 +1000, Michael Ellerman wrote:
[...]
> > On the other hand, busting the rtas.lock could be dangerous, because
> > it's code we can't control.
> >
> > According with LoPAR, for both of these rtas-calls, we have:
> >
> > For the PowerPC External Interrupt option: The call must be reentrant
> > to the number of processors on the platform.
> > For the PowerPC External Interrupt option: The argument call buffer for
> > each simultaneous call must be physically unique.
>
> Oh well spotted. Where is that in the doc?

In the current LoPAR available on OpenPower Foundation, it's on page
170, '7.3.10.2 ibm,set-xive' and '7.3.10.3 ibm,int-off'.

> > Which I think means this rtas-calls can be done simultaneously.
>
> I think so too. I'll read PAPR in the morning and make sure.
>
> > Would it mean that busting the rtas.lock for these calls would be safe?
>
> What would be better is to make those specific calls not take the global
> RTAS lock to begin with.
>
> We should be able to just allocate the rtas_args on the stack, it's only
> ~80 odd bytes. And then we can use rtas_call_unlocked() which doesn't
> take the global lock.

Good idea. I will try getting some work done on this.

Best regards,


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-04-08 19:54:09

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

On Wed, 2020-04-08 at 22:21 +1000, Michael Ellerman wrote:
> We should be able to just allocate the rtas_args on the stack, it's only
> ~80 odd bytes. And then we can use rtas_call_unlocked() which doesn't
> take the global lock.

At this point, would it be a problem using kmalloc?

Best regards,


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-04-08 22:57:29

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

On Wed, 2020-04-08 at 22:21 +1000, Michael Ellerman wrote:
[...]
> > According with LoPAR, for both of these rtas-calls, we have:
> >
> > For the PowerPC External Interrupt option: The call must be reentrant
> > to the number of processors on the platform.
> > For the PowerPC External Interrupt option: The argument call buffer for
> > each simultaneous call must be physically unique.
>
> Oh well spotted. Where is that in the doc?
>
> > Which I think means this rtas-calls can be done simultaneously.
>
> I think so too. I'll read PAPR in the morning and make sure.
>
> > Would it mean that busting the rtas.lock for these calls would be safe?
>
> What would be better is to make those specific calls not take the global
> RTAS lock to begin with.
>
> We should be able to just allocate the rtas_args on the stack, it's only
> ~80 odd bytes. And then we can use rtas_call_unlocked() which doesn't
> take the global lock.

Hello Michael,

I did the simplest possible version of this change:
http://patchwork.ozlabs.org/patch/1268371/

Where I create a rtas_call_reentrant(), and replace rtas_call() for
that in all the possible calls of "ibm,int-on", "ibm,int-off",ibm,get-
xive" and "ibm,set-xive".

At first, I was planning on creating a function that tells if the
requested token is one of above, before automatically choosing between
the common and reentrant versions. But it seemed like unnecessary
overhead, since the current calls are very few and very straight.

What do you think on this?

Best regards,
Leonardo Bras


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-04-09 00:28:47

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

On Wed, Apr 08, 2020 at 10:21:29PM +1000, Michael Ellerman wrote:
>
> We should be able to just allocate the rtas_args on the stack, it's only
> ~80 odd bytes. And then we can use rtas_call_unlocked() which doesn't
> take the global lock.

Do we instantiate a 64-bit RTAS these days, or is it still 32-bit?
In the old days we had to make sure the RTAS argument buffer was
below the 4GB point. If that's still necessary then perhaps putting
rtas_args inside the PACA would be the way to go.

Paul.

2020-05-12 03:52:39

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

Hello Paul, thanks for the reply!

On Thu, 2020-04-09 at 10:27 +1000, Paul Mackerras wrote:
> On Wed, Apr 08, 2020 at 10:21:29PM +1000, Michael Ellerman wrote:
> > We should be able to just allocate the rtas_args on the stack, it's only
> > ~80 odd bytes. And then we can use rtas_call_unlocked() which doesn't
> > take the global lock.
>
> Do we instantiate a 64-bit RTAS these days, or is it still 32-bit?

According to LoPAR, we can use instantiate-rtas or instantiate-rtas-64.
It looks like we do instantiate-rtas today (grep pointed only to
prom_instantiate_rtas()).

> In the old days we had to make sure the RTAS argument buffer was
> below the 4GB point. If that's still necessary then perhaps putting
> rtas_args inside the PACA would be the way to go.

Yes, we still need to make sure of this. I will study more about PACA
and try to implement that way.

Best regards,
Leonardo Bras


Attachments:
signature.asc (879.00 B)
This is a digitally signed message part

2020-05-12 10:43:57

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

Paul Mackerras <[email protected]> writes:
> On Wed, Apr 08, 2020 at 10:21:29PM +1000, Michael Ellerman wrote:
>>
>> We should be able to just allocate the rtas_args on the stack, it's only
>> ~80 odd bytes. And then we can use rtas_call_unlocked() which doesn't
>> take the global lock.
>
> Do we instantiate a 64-bit RTAS these days, or is it still 32-bit?

No, yes.

> In the old days we had to make sure the RTAS argument buffer was
> below the 4GB point.

Yes you're right, that's still true.

I was thinking we were on the emergency stack, but we may not be.

> If that's still necessary then perhaps putting rtas_args inside the
> PACA would be the way to go.

Yeah I guess. Allocating a struct within the RMO for each CPU is not
that simple vs just putting it in the paca.

cheers