On 10/16/2015 12:30 AM, Laurent Vivier wrote:
> On kexec, all secondary offline CPUs are onlined before
> starting the new kernel, this is not done in the case of kdump.
>
> If kdump is configured and a kernel crash occurs whereas
> some secondaries CPUs are offline (SMT=off),
> the new kernel is not able to start them and displays some
> "Processor X is stuck.".
>
> Starting with POWER8, subcore logic relies on all threads of
> core being booted. So, on startup kernel tries to start all
> threads, and asks OPAL (or RTAS) to start all CPUs (including
> threads). If a CPU has been offlined by the previous kernel,
> it has not been returned to OPAL, and thus OPAL cannot restart
> it: this CPU has been lost...
>
> Signed-off-by: Laurent Vivier<[email protected]>
Hi Laurent,
Sorry for jumping too late into this.
Are you seeing this issue even with the below patches:
pseries:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c1caae3de46a072d0855729aed6e793e536a4a55
opal/powernv:
https://github.com/open-power/skiboot/commit/9ee56b5
Thanks
Hari
> ---
> arch/powerpc/kernel/crash.c | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> diff --git a/arch/powerpc/kernel/crash.c b/arch/powerpc/kernel/crash.c
> index 51dbace..3ca9452 100644
> --- a/arch/powerpc/kernel/crash.c
> +++ b/arch/powerpc/kernel/crash.c
> @@ -19,6 +19,7 @@
> #include <linux/delay.h>
> #include <linux/irq.h>
> #include <linux/types.h>
> +#include <linux/cpu.h>
>
> #include <asm/processor.h>
> #include <asm/machdep.h>
> @@ -299,11 +300,30 @@ int crash_shutdown_unregister(crash_shutdown_t handler)
> }
> EXPORT_SYMBOL(crash_shutdown_unregister);
>
> +/*
> + * The next kernel will try to start all secondary CPUs and if
> + * there are not online it will fail to start them.
> + *
> + */
> +static void wake_offline_cpus(void)
> +{
> + int cpu = 0;
> +
> + for_each_present_cpu(cpu) {
> + if (!cpu_online(cpu)) {
> + pr_info("kexec: Waking offline cpu %d.\n", cpu);
> + cpu_up(cpu);
> + }
> + }
> +}
> +
> void default_machine_crash_shutdown(struct pt_regs *regs)
> {
> unsigned int i;
> int (*old_handler)(struct pt_regs *regs);
>
> + wake_offline_cpus();
> +
> /*
> * This function is only called after the system
> * has panicked or is otherwise in a critical state.
On 04/11/2015 13:34, Hari Bathini wrote:
> On 10/16/2015 12:30 AM, Laurent Vivier wrote:
>> On kexec, all secondary offline CPUs are onlined before
>> starting the new kernel, this is not done in the case of kdump.
>>
>> If kdump is configured and a kernel crash occurs whereas
>> some secondaries CPUs are offline (SMT=off),
>> the new kernel is not able to start them and displays some
>> "Processor X is stuck.".
>>
>> Starting with POWER8, subcore logic relies on all threads of
>> core being booted. So, on startup kernel tries to start all
>> threads, and asks OPAL (or RTAS) to start all CPUs (including
>> threads). If a CPU has been offlined by the previous kernel,
>> it has not been returned to OPAL, and thus OPAL cannot restart
>> it: this CPU has been lost...
>>
>> Signed-off-by: Laurent Vivier<[email protected]>
>
>
> Hi Laurent,
Hi Hari,
> Sorry for jumping too late into this.
better late than never :)
> Are you seeing this issue even with the below patches:
>
> pseries:
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c1caae3de46a072d0855729aed6e793e536a4a55
>
>
> opal/powernv:
> https://github.com/open-power/skiboot/commit/9ee56b5
Very interesting. Is there a way to have a firmware with the fix ?
Thanks,
Laurent
> Thanks
> Hari
>
>> ---
>> arch/powerpc/kernel/crash.c | 20 ++++++++++++++++++++
>> 1 file changed, 20 insertions(+)
>>
>> diff --git a/arch/powerpc/kernel/crash.c b/arch/powerpc/kernel/crash.c
>> index 51dbace..3ca9452 100644
>> --- a/arch/powerpc/kernel/crash.c
>> +++ b/arch/powerpc/kernel/crash.c
>> @@ -19,6 +19,7 @@
>> #include <linux/delay.h>
>> #include <linux/irq.h>
>> #include <linux/types.h>
>> +#include <linux/cpu.h>
>> #include <asm/processor.h>
>> #include <asm/machdep.h>
>> @@ -299,11 +300,30 @@ int crash_shutdown_unregister(crash_shutdown_t
>> handler)
>> }
>> EXPORT_SYMBOL(crash_shutdown_unregister);
>> +/*
>> + * The next kernel will try to start all secondary CPUs and if
>> + * there are not online it will fail to start them.
>> + *
>> + */
>> +static void wake_offline_cpus(void)
>> +{
>> + int cpu = 0;
>> +
>> + for_each_present_cpu(cpu) {
>> + if (!cpu_online(cpu)) {
>> + pr_info("kexec: Waking offline cpu %d.\n", cpu);
>> + cpu_up(cpu);
>> + }
>> + }
>> +}
>> +
>> void default_machine_crash_shutdown(struct pt_regs *regs)
>> {
>> unsigned int i;
>> int (*old_handler)(struct pt_regs *regs);
>> + wake_offline_cpus();
>> +
>> /*
>> * This function is only called after the system
>> * has panicked or is otherwise in a critical state.
>
On Wed, 4 Nov 2015 14:54:51 +0100
Laurent Vivier <[email protected]> wrote:
>
>
> On 04/11/2015 13:34, Hari Bathini wrote:
> > On 10/16/2015 12:30 AM, Laurent Vivier wrote:
> >> On kexec, all secondary offline CPUs are onlined before
> >> starting the new kernel, this is not done in the case of kdump.
> >>
> >> If kdump is configured and a kernel crash occurs whereas
> >> some secondaries CPUs are offline (SMT=off),
> >> the new kernel is not able to start them and displays some
> >> "Processor X is stuck.".
> >>
> >> Starting with POWER8, subcore logic relies on all threads of
> >> core being booted. So, on startup kernel tries to start all
> >> threads, and asks OPAL (or RTAS) to start all CPUs (including
> >> threads). If a CPU has been offlined by the previous kernel,
> >> it has not been returned to OPAL, and thus OPAL cannot restart
> >> it: this CPU has been lost...
> >>
> >> Signed-off-by: Laurent Vivier<[email protected]>
> >
> >
> > Hi Laurent,
>
> Hi Hari,
>
> > Sorry for jumping too late into this.
>
> better late than never :)
>
> > Are you seeing this issue even with the below patches:
> >
> > pseries:
> > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c1caae3de46a072d0855729aed6e793e536a4a55
Unfortunately, this is unlikely to be relevant - this fixes a failure
while setting up the kexec. The problem we see occurs once we've
booted the second kernel and it's attempting to bring up secondary CPUs.
> > opal/powernv:
> > https://github.com/open-power/skiboot/commit/9ee56b5
>
> Very interesting. Is there a way to have a firmware with the fix ?
From Laurent's analysis of the crash, I don't think this will be
relevant either, but I'm not sure. It would be very interesting to
know which (if any) released firmwares include this patch so we can
test it.
--
David Gibson <[email protected]>
Senior Software Engineer, Virtualization, Red Hat
David Gibson <[email protected]> writes:
>> > opal/powernv:
>> > https://github.com/open-power/skiboot/commit/9ee56b5
>>
>> Very interesting. Is there a way to have a firmware with the fix ?
>
> From Laurent's analysis of the crash, I don't think this will be
> relevant either, but I'm not sure. It would be very interesting to
> know which (if any) released firmwares include this patch so we can
> test it.
It'll be on the (just released) IBM LC machines (the ones with the AMI
BMC) and will be in the next major firmware version for FSP based
machines (the -L machines) FW840, which should be out in the next
month. Let me know if you want a build of that, we should be able to get
one to you.
For any OpenPower machine you can always build a custom skiboot and
flash it :)