2023-12-13 07:03:04

by James Gowans

[permalink] [raw]
Subject: [PATCH] kexec: do syscore_shutdown() in kernel_kexec

syscore_shutdown() runs driver and module callbacks to get the system
into a state where it can be correctly shut down. In commit
6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()")
syscore_shutdown() was removed from kernel_restart_prepare() and hence
got (incorrectly?) removed from the kexec flow. This was innocuous until
commit 6735150b6997 ("KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown")
changed the way that KVM registered its shutdown callbacks, switching from
reboot notifiers to syscore_ops.shutdown. As syscore_shutdown() is
missing from kexec, KVM's shutdown hook is not run and virtualisation is
left enabled on the boot CPU which results in triple faults when
switching to the new kernel on Intel x86 VT-x with VMXE enabled.

Fix this by adding syscore_shutdown() to the kexec sequence. In terms of
where to add it, it is being added after migrating the kexec task to the
boot CPU, but before APs are shut down. It is not totally clear if this
is the best place: in commit 6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()")
it is stated that "syscore_ops operations should be carried with one
CPU on-line and interrupts disabled." APs are only offlined later in
machine_shutdown(), so this syscore_shutdown() is being run while APs
are still online. This seems to be the correct place as it matches where
syscore_shutdown() is run in the reboot and halt flows - they also run
it before APs are shut down. The assumption is that the commit message
in commit 6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()")
is no longer valid.

KVM has been discussed here as it is what broke loudly by not having
syscore_shutdown() in kexec, but this change impacts more than just KVM;
all drivers/modules which register a syscore_ops.shutdown callback will
now be invoked in the kexec flow. Looking at some of them like x86 MCE
it is probably more correct to also shut these down during kexec.
Maintainers of all drivers which use syscore_ops.shutdown are added on
CC for visibility. They are:

arch/powerpc/platforms/cell/spu_base.c .shutdown = spu_shutdown,
arch/x86/kernel/cpu/mce/core.c .shutdown = mce_syscore_shutdown,
arch/x86/kernel/i8259.c .shutdown = i8259A_shutdown,
drivers/irqchip/irq-i8259.c .shutdown = i8259A_shutdown,
drivers/irqchip/irq-sun6i-r.c .shutdown = sun6i_r_intc_shutdown,
drivers/leds/trigger/ledtrig-cpu.c .shutdown = ledtrig_cpu_syscore_shutdown,
drivers/power/reset/sc27xx-poweroff.c .shutdown = sc27xx_poweroff_shutdown,
kernel/irq/generic-chip.c .shutdown = irq_gc_shutdown,
virt/kvm/kvm_main.c .shutdown = kvm_shutdown,

This has been tested by doing a kexec on x86_64 and aarch64.

Fixes: 6735150b6997 ("KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown")

Signed-off-by: James Gowans <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Chen-Yu Tsai <[email protected]>
Cc: Jernej Skrabec <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: Sebastian Reichel <[email protected]>
Cc: Orson Zhai <[email protected]>
Cc: Alexander Graf <[email protected]>
Cc: Jan H. Schoenherr <[email protected]>
---
kernel/kexec_core.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index be5642a4ec49..b926c4db8a91 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1254,6 +1254,7 @@ int kernel_kexec(void)
kexec_in_progress = true;
kernel_restart_prepare("kexec reboot");
migrate_to_reboot_cpu();
+ syscore_shutdown();

/*
* migrate_to_reboot_cpu() disables CPU hotplug assuming that
--
2.34.1


2023-12-13 16:41:37

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] kexec: do syscore_shutdown() in kernel_kexec

James Gowans <[email protected]> writes:

> syscore_shutdown() runs driver and module callbacks to get the system
> into a state where it can be correctly shut down. In commit
> 6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()")
> syscore_shutdown() was removed from kernel_restart_prepare() and hence
> got (incorrectly?) removed from the kexec flow. This was innocuous until
> commit 6735150b6997 ("KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown")
> changed the way that KVM registered its shutdown callbacks, switching from
> reboot notifiers to syscore_ops.shutdown. As syscore_shutdown() is
> missing from kexec, KVM's shutdown hook is not run and virtualisation is
> left enabled on the boot CPU which results in triple faults when
> switching to the new kernel on Intel x86 VT-x with VMXE enabled.
>
> Fix this by adding syscore_shutdown() to the kexec sequence. In terms of
> where to add it, it is being added after migrating the kexec task to the
> boot CPU, but before APs are shut down. It is not totally clear if this
> is the best place: in commit 6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()")
> it is stated that "syscore_ops operations should be carried with one
> CPU on-line and interrupts disabled." APs are only offlined later in
> machine_shutdown(), so this syscore_shutdown() is being run while APs
> are still online. This seems to be the correct place as it matches where
> syscore_shutdown() is run in the reboot and halt flows - they also run
> it before APs are shut down. The assumption is that the commit message
> in commit 6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()")
> is no longer valid.
>
> KVM has been discussed here as it is what broke loudly by not having
> syscore_shutdown() in kexec, but this change impacts more than just KVM;
> all drivers/modules which register a syscore_ops.shutdown callback will
> now be invoked in the kexec flow. Looking at some of them like x86 MCE
> it is probably more correct to also shut these down during kexec.
> Maintainers of all drivers which use syscore_ops.shutdown are added on
> CC for visibility. They are:
>
> arch/powerpc/platforms/cell/spu_base.c .shutdown = spu_shutdown,
> arch/x86/kernel/cpu/mce/core.c .shutdown = mce_syscore_shutdown,
> arch/x86/kernel/i8259.c .shutdown = i8259A_shutdown,
> drivers/irqchip/irq-i8259.c .shutdown = i8259A_shutdown,
> drivers/irqchip/irq-sun6i-r.c .shutdown = sun6i_r_intc_shutdown,
> drivers/leds/trigger/ledtrig-cpu.c .shutdown = ledtrig_cpu_syscore_shutdown,
> drivers/power/reset/sc27xx-poweroff.c .shutdown = sc27xx_poweroff_shutdown,
> kernel/irq/generic-chip.c .shutdown = irq_gc_shutdown,
> virt/kvm/kvm_main.c .shutdown = kvm_shutdown,
>
> This has been tested by doing a kexec on x86_64 and aarch64.

From the 10,000 foot perspective:
Acked-by: "Eric W. Biederman" <[email protected]>


Eric

> Fixes: 6735150b6997 ("KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown")
>
> Signed-off-by: James Gowans <[email protected]>
> Cc: Eric Biederman <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Sean Christopherson <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: Arnd Bergmann <[email protected]>
> Cc: Tony Luck <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Chen-Yu Tsai <[email protected]>
> Cc: Jernej Skrabec <[email protected]>
> Cc: Samuel Holland <[email protected]>
> Cc: Pavel Machek <[email protected]>
> Cc: Sebastian Reichel <[email protected]>
> Cc: Orson Zhai <[email protected]>
> Cc: Alexander Graf <[email protected]>
> Cc: Jan H. Schoenherr <[email protected]>
> ---
> kernel/kexec_core.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index be5642a4ec49..b926c4db8a91 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -1254,6 +1254,7 @@ int kernel_kexec(void)
> kexec_in_progress = true;
> kernel_restart_prepare("kexec reboot");
> migrate_to_reboot_cpu();
> + syscore_shutdown();
>
> /*
> * migrate_to_reboot_cpu() disables CPU hotplug assuming that

2023-12-18 12:43:02

by James Gowans

[permalink] [raw]
Subject: Re: [PATCH] kexec: do syscore_shutdown() in kernel_kexec

Hi Eric,

On Wed, 2023-12-13 at 10:39 -0600, Eric W. Biederman wrote:
>
> James Gowans <[email protected]> writes:
>
> > syscore_shutdown() runs driver and module callbacks to get the system
> > into a state where it can be correctly shut down. In commit
> > 6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()")
> > syscore_shutdown() was removed from kernel_restart_prepare() and hence
> > got (incorrectly?) removed from the kexec flow. This was innocuous until
> > commit 6735150b6997 ("KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown")
> > changed the way that KVM registered its shutdown callbacks, switching from
> > reboot notifiers to syscore_ops.shutdown. As syscore_shutdown() is
> > missing from kexec, KVM's shutdown hook is not run and virtualisation is
> > left enabled on the boot CPU which results in triple faults when
> > switching to the new kernel on Intel x86 VT-x with VMXE enabled.
> >
> > Fix this by adding syscore_shutdown() to the kexec sequence. In terms of
> > where to add it, it is being added after migrating the kexec task to the
> > boot CPU, but before APs are shut down. It is not totally clear if this
> > is the best place: in commit 6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()")
> > it is stated that "syscore_ops operations should be carried with one
> > CPU on-line and interrupts disabled." APs are only offlined later in
> > machine_shutdown(), so this syscore_shutdown() is being run while APs
> > are still online. This seems to be the correct place as it matches where
> > syscore_shutdown() is run in the reboot and halt flows - they also run
> > it before APs are shut down. The assumption is that the commit message
> > in commit 6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()")
> > is no longer valid.
> >
> > KVM has been discussed here as it is what broke loudly by not having
> > syscore_shutdown() in kexec, but this change impacts more than just KVM;
> > all drivers/modules which register a syscore_ops.shutdown callback will
> > now be invoked in the kexec flow. Looking at some of them like x86 MCE
> > it is probably more correct to also shut these down during kexec.
> > Maintainers of all drivers which use syscore_ops.shutdown are added on
> > CC for visibility. They are:
> >
> > arch/powerpc/platforms/cell/spu_base.c .shutdown = spu_shutdown,
> > arch/x86/kernel/cpu/mce/core.c .shutdown = mce_syscore_shutdown,
> > arch/x86/kernel/i8259.c .shutdown = i8259A_shutdown,
> > drivers/irqchip/irq-i8259.c .shutdown = i8259A_shutdown,
> > drivers/irqchip/irq-sun6i-r.c .shutdown = sun6i_r_intc_shutdown,
> > drivers/leds/trigger/ledtrig-cpu.c .shutdown = ledtrig_cpu_syscore_shutdown,
> > drivers/power/reset/sc27xx-poweroff.c .shutdown = sc27xx_poweroff_shutdown,
> > kernel/irq/generic-chip.c .shutdown = irq_gc_shutdown,
> > virt/kvm/kvm_main.c .shutdown = kvm_shutdown,
> >
> > This has been tested by doing a kexec on x86_64 and aarch64.
>
> From the 10,000 foot perspective:
> Acked-by: "Eric W. Biederman" <[email protected]>

Thanks for the ACK!
What's the next step to get this into the kexec tree?

JG

>
>
> Eric
>
> > Fixes: 6735150b6997 ("KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown")
> >
> > Signed-off-by: James Gowans <[email protected]>
> > Cc: Eric Biederman <[email protected]>
> > Cc: Paolo Bonzini <[email protected]>
> > Cc: Sean Christopherson <[email protected]>
> > Cc: Marc Zyngier <[email protected]>
> > Cc: Arnd Bergmann <[email protected]>
> > Cc: Tony Luck <[email protected]>
> > Cc: Borislav Petkov <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Chen-Yu Tsai <[email protected]>
> > Cc: Jernej Skrabec <[email protected]>
> > Cc: Samuel Holland <[email protected]>
> > Cc: Pavel Machek <[email protected]>
> > Cc: Sebastian Reichel <[email protected]>
> > Cc: Orson Zhai <[email protected]>
> > Cc: Alexander Graf <[email protected]>
> > Cc: Jan H. Schoenherr <[email protected]>
> > ---
> > kernel/kexec_core.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> > index be5642a4ec49..b926c4db8a91 100644
> > --- a/kernel/kexec_core.c
> > +++ b/kernel/kexec_core.c
> > @@ -1254,6 +1254,7 @@ int kernel_kexec(void)
> > kexec_in_progress = true;
> > kernel_restart_prepare("kexec reboot");
> > migrate_to_reboot_cpu();
> > + syscore_shutdown();
> >
> > /*
> > * migrate_to_reboot_cpu() disables CPU hotplug assuming that

2023-12-19 04:23:27

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH] kexec: do syscore_shutdown() in kernel_kexec

Add Andrew to CC as Andrew helps to pick kexec/kdump patches.

On 12/13/23 at 08:40am, James Gowans wrote:
......
> This has been tested by doing a kexec on x86_64 and aarch64.

Hi James,

Thanks for this great patch. My colleagues have opened bug in rhel to
track this and try to veryfy this patch. However, they can't reproduce
the issue this patch is fixing. Could you tell more about where and how
to reproduce so that we can be aware of it better? Thanks in advance.

Thanks
Baoquan

>
> Fixes: 6735150b6997 ("KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown")
>
> Signed-off-by: James Gowans <[email protected]>
> Cc: Eric Biederman <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Sean Christopherson <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: Arnd Bergmann <[email protected]>
> Cc: Tony Luck <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Chen-Yu Tsai <[email protected]>
> Cc: Jernej Skrabec <[email protected]>
> Cc: Samuel Holland <[email protected]>
> Cc: Pavel Machek <[email protected]>
> Cc: Sebastian Reichel <[email protected]>
> Cc: Orson Zhai <[email protected]>
> Cc: Alexander Graf <[email protected]>
> Cc: Jan H. Schoenherr <[email protected]>
> ---
> kernel/kexec_core.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index be5642a4ec49..b926c4db8a91 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -1254,6 +1254,7 @@ int kernel_kexec(void)
> kexec_in_progress = true;
> kernel_restart_prepare("kexec reboot");
> migrate_to_reboot_cpu();
> + syscore_shutdown();
>
> /*
> * migrate_to_reboot_cpu() disables CPU hotplug assuming that
> --
> 2.34.1
>
>
> _______________________________________________
> kexec mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/kexec
>


2023-12-19 07:42:11

by James Gowans

[permalink] [raw]
Subject: Re: [PATCH] kexec: do syscore_shutdown() in kernel_kexec

On Tue, 2023-12-19 at 12:22 +0800, Baoquan He wrote:
> Add Andrew to CC as Andrew helps to pick kexec/kdump patches.

Ah, thanks, I didn't realise that Andrew pulls in the kexec patches.
>
> On 12/13/23 at 08:40am, James Gowans wrote:
> ......
> > This has been tested by doing a kexec on x86_64 and aarch64.
>
> Hi James,
>
> Thanks for this great patch. My colleagues have opened bug in rhel to
> track this and try to veryfy this patch. However, they can't reproduce
> the issue this patch is fixing. Could you tell more about where and how
> to reproduce so that we can be aware of it better? Thanks in advance.

Sure! The TL;DR is: run a VMX (Intel x86) KVM VM on Linux v6.4+ and do a
kexec while the KVM VM is still running. Before this patch the system
will triple fault.

In more detail:
Run a bare metal host on a modern Intel CPU with VMX support. The kernel
I was using was 6.7.0-rc5+.
You can totally do this with a QEMU "host" as well, btw, that's how I
did the debugging and attached GDB to it to figure out what was up.

If you want a virtual "host" launch with:

-cpu host -M q35,kernel-irqchip=split,accel=kvm -enable-kvm

Launch a KVM guest VM, eg:

qemu-system-x86_64 \
-enable-kvm \
-cdrom alpine-virt-3.19.0-x86_64.iso \
-nodefaults -nographic -M q35 \
-serial mon:stdio

While the guest VM is *still running* do a kexec on the host, eg:

kexec -l --reuse-cmdline --initrd=config-6.7.0-rc5+ vmlinuz-6.7.0-rc5+ && \
kexec -e

The kexec can be to anything, but I generally just kexec to the same
kernel/ramdisk as is currently running. Ie: same-version kexec.

Before this patch the kexec will get stuck, after this the kexec will go
smoothly and the system will end up in the new kernel in a few seconds.

I hope those steps are clear and you can repro this?

BTW, the reason that it's important for the KVM VM to still be running
when the host does the kexec is because KVM internally maintains a usage
counter and will disable virtualisation once all VMs have been
terminated, via:

__fput(kvm_fd)
kvm_vm_release
kvm_destroy_vm
hardware_disable_all
hardware_disable_all_nolock
kvm_usage_count--;
if (!kvm_usage_count)
on_each_cpu(hardware_disable_nolock, NULL, 1);

So if all KVM fds are closed then kexec will work because VMXE is
cleared on all CPUs when the last VM is destroyed. If the KVM fds are
still open (ie: QEMU process still exists) then the issue manifests. It
sounds nasty to do a kexec while QEMU processes are still around but
this is a perfectly normal flow for live update:
1. Pause and Serialise VM state
2. kexec
3. deserialise and resume VMs.
In that flow there's no need to actually kill the QEMU process, as long
as the VM is *paused* and has been serialised we can happily kexec.

JG

2023-12-19 08:26:50

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH] kexec: do syscore_shutdown() in kernel_kexec

On 12/19/23 at 07:41am, Gowans, James wrote:
> On Tue, 2023-12-19 at 12:22 +0800, Baoquan He wrote:
> > Add Andrew to CC as Andrew helps to pick kexec/kdump patches.
>
> Ah, thanks, I didn't realise that Andrew pulls in the kexec patches.
> >
> > On 12/13/23 at 08:40am, James Gowans wrote:
> > ......
> > > This has been tested by doing a kexec on x86_64 and aarch64.
> >
> > Hi James,
> >
> > Thanks for this great patch. My colleagues have opened bug in rhel to
> > track this and try to veryfy this patch. However, they can't reproduce
> > the issue this patch is fixing. Could you tell more about where and how
> > to reproduce so that we can be aware of it better? Thanks in advance.
>
> Sure! The TL;DR is: run a VMX (Intel x86) KVM VM on Linux v6.4+ and do a
> kexec while the KVM VM is still running. Before this patch the system
> will triple fault.

Thanks a lot for these details, I will forward this to our QE to try.

>
> In more detail:
> Run a bare metal host on a modern Intel CPU with VMX support. The kernel
> I was using was 6.7.0-rc5+.
> You can totally do this with a QEMU "host" as well, btw, that's how I
> did the debugging and attached GDB to it to figure out what was up.
>
> If you want a virtual "host" launch with:
>
> -cpu host -M q35,kernel-irqchip=split,accel=kvm -enable-kvm
>
> Launch a KVM guest VM, eg:
>
> qemu-system-x86_64 \
> -enable-kvm \
> -cdrom alpine-virt-3.19.0-x86_64.iso \
> -nodefaults -nographic -M q35 \
> -serial mon:stdio
>
> While the guest VM is *still running* do a kexec on the host, eg:
>
> kexec -l --reuse-cmdline --initrd=config-6.7.0-rc5+ vmlinuz-6.7.0-rc5+ && \
> kexec -e
>
> The kexec can be to anything, but I generally just kexec to the same
> kernel/ramdisk as is currently running. Ie: same-version kexec.
>
> Before this patch the kexec will get stuck, after this the kexec will go
> smoothly and the system will end up in the new kernel in a few seconds.
>
> I hope those steps are clear and you can repro this?
>
> BTW, the reason that it's important for the KVM VM to still be running
> when the host does the kexec is because KVM internally maintains a usage
> counter and will disable virtualisation once all VMs have been
> terminated, via:
>
> __fput(kvm_fd)
> kvm_vm_release
> kvm_destroy_vm
> hardware_disable_all
> hardware_disable_all_nolock
> kvm_usage_count--;
> if (!kvm_usage_count)
> on_each_cpu(hardware_disable_nolock, NULL, 1);
>
> So if all KVM fds are closed then kexec will work because VMXE is
> cleared on all CPUs when the last VM is destroyed. If the KVM fds are
> still open (ie: QEMU process still exists) then the issue manifests. It
> sounds nasty to do a kexec while QEMU processes are still around but
> this is a perfectly normal flow for live update:
> 1. Pause and Serialise VM state
> 2. kexec
> 3. deserialise and resume VMs.
> In that flow there's no need to actually kill the QEMU process, as long
> as the VM is *paused* and has been serialised we can happily kexec.
>
> JG
>


2024-01-09 07:00:06

by James Gowans

[permalink] [raw]
Subject: Re: [PATCH] kexec: do syscore_shutdown() in kernel_kexec

+ akpm

Hi Eric and Andrew,
Just checking in on this patch.
Would be keen to get the fix merged if you're okay with it, or some
feedback.

Also still keen for input for the driver maintainers in CC if they
support or have objections to their shutdown hooks being invoked on
kexec.

JG

On Mon, 2023-12-18 at 14:41 +0200, James Gowans wrote:
> Hi Eric,
>
> On Wed, 2023-12-13 at 10:39 -0600, Eric W. Biederman wrote:
> >
> > James Gowans <[email protected]> writes:
> >
> > > syscore_shutdown() runs driver and module callbacks to get the system
> > > into a state where it can be correctly shut down. In commit
> > > 6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()")
> > > syscore_shutdown() was removed from kernel_restart_prepare() and hence
> > > got (incorrectly?) removed from the kexec flow. This was innocuous until
> > > commit 6735150b6997 ("KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown")
> > > changed the way that KVM registered its shutdown callbacks, switching from
> > > reboot notifiers to syscore_ops.shutdown. As syscore_shutdown() is
> > > missing from kexec, KVM's shutdown hook is not run and virtualisation is
> > > left enabled on the boot CPU which results in triple faults when
> > > switching to the new kernel on Intel x86 VT-x with VMXE enabled.
> > >
> > > Fix this by adding syscore_shutdown() to the kexec sequence. In terms of
> > > where to add it, it is being added after migrating the kexec task to the
> > > boot CPU, but before APs are shut down. It is not totally clear if this
> > > is the best place: in commit 6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()")
> > > it is stated that "syscore_ops operations should be carried with one
> > > CPU on-line and interrupts disabled." APs are only offlined later in
> > > machine_shutdown(), so this syscore_shutdown() is being run while APs
> > > are still online. This seems to be the correct place as it matches where
> > > syscore_shutdown() is run in the reboot and halt flows - they also run
> > > it before APs are shut down. The assumption is that the commit message
> > > in commit 6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()")
> > > is no longer valid.
> > >
> > > KVM has been discussed here as it is what broke loudly by not having
> > > syscore_shutdown() in kexec, but this change impacts more than just KVM;
> > > all drivers/modules which register a syscore_ops.shutdown callback will
> > > now be invoked in the kexec flow. Looking at some of them like x86 MCE
> > > it is probably more correct to also shut these down during kexec.
> > > Maintainers of all drivers which use syscore_ops.shutdown are added on
> > > CC for visibility. They are:
> > >
> > > arch/powerpc/platforms/cell/spu_base.c .shutdown = spu_shutdown,
> > > arch/x86/kernel/cpu/mce/core.c .shutdown = mce_syscore_shutdown,
> > > arch/x86/kernel/i8259.c .shutdown = i8259A_shutdown,
> > > drivers/irqchip/irq-i8259.c .shutdown = i8259A_shutdown,
> > > drivers/irqchip/irq-sun6i-r.c .shutdown = sun6i_r_intc_shutdown,
> > > drivers/leds/trigger/ledtrig-cpu.c .shutdown = ledtrig_cpu_syscore_shutdown,
> > > drivers/power/reset/sc27xx-poweroff.c .shutdown = sc27xx_poweroff_shutdown,
> > > kernel/irq/generic-chip.c .shutdown = irq_gc_shutdown,
> > > virt/kvm/kvm_main.c .shutdown = kvm_shutdown,
> > >
> > > This has been tested by doing a kexec on x86_64 and aarch64.
> >
> > From the 10,000 foot perspective:
> > Acked-by: "Eric W. Biederman" <[email protected]>
>
> Thanks for the ACK!
> What's the next step to get this into the kexec tree?
>
> JG
>
> >
> >
> > Eric
> >
> > > Fixes: 6735150b6997 ("KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown")
> > >
> > > Signed-off-by: James Gowans <[email protected]>
> > > Cc: Eric Biederman <[email protected]>
> > > Cc: Paolo Bonzini <[email protected]>
> > > Cc: Sean Christopherson <[email protected]>
> > > Cc: Marc Zyngier <[email protected]>
> > > Cc: Arnd Bergmann <[email protected]>
> > > Cc: Tony Luck <[email protected]>
> > > Cc: Borislav Petkov <[email protected]>
> > > Cc: Thomas Gleixner <[email protected]>
> > > Cc: Ingo Molnar <[email protected]>
> > > Cc: Chen-Yu Tsai <[email protected]>
> > > Cc: Jernej Skrabec <[email protected]>
> > > Cc: Samuel Holland <[email protected]>
> > > Cc: Pavel Machek <[email protected]>
> > > Cc: Sebastian Reichel <[email protected]>
> > > Cc: Orson Zhai <[email protected]>
> > > Cc: Alexander Graf <[email protected]>
> > > Cc: Jan H. Schoenherr <[email protected]>
> > > ---
> > > kernel/kexec_core.c | 1 +
> > > 1 file changed, 1 insertion(+)
> > >
> > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> > > index be5642a4ec49..b926c4db8a91 100644
> > > --- a/kernel/kexec_core.c
> > > +++ b/kernel/kexec_core.c
> > > @@ -1254,6 +1254,7 @@ int kernel_kexec(void)
> > > kexec_in_progress = true;
> > > kernel_restart_prepare("kexec reboot");
> > > migrate_to_reboot_cpu();
> > > + syscore_shutdown();
> > >
> > > /*
> > > * migrate_to_reboot_cpu() disables CPU hotplug assuming that
>