Fix a bug where enabling hardware virtualization can race with a forced
reboot, e.g. `reboot -f`, and result in virt hardware being enabled when
the reboot is attempted, and thus hanging the reboot.
Found by inspection, confirmed by hacking the reboot flow to wait until
KVM loads (the problematic window is ridiculously small).
Tested only on x86, though there would have to be some seriously subtle
arch and/or driver code for this to break other architectures.
Sean Christopherson (2):
KVM: Use syscore_ops instead of reboot_notifier to hook
restart/shutdown
KVM: Don't enable hardware after a restart/shutdown is initiated
virt/kvm/kvm_main.c | 31 +++++++++++++++++++------------
1 file changed, 19 insertions(+), 12 deletions(-)
base-commit: 45dd9bc75d9adc9483f0c7d662ba6e73ed698a0b
--
2.40.0.rc1.284.g88254d51c5-goog
Use syscore_ops.shutdown to disable hardware virtualization during a
reboot instead of using the dedicated reboot_notifier so that KVM disables
virtualization _after_ system_state has been updated. This will allow
fixing a race in KVM's handling of a forced reboot where KVM can end up
enabling hardware virtualization between kernel_restart_prepare() and
machine_restart().
Cc: Marc Zyngier <[email protected]>
Cc: Oliver Upton <[email protected]>
Cc: James Morse <[email protected]>
Cc: Suzuki K Poulose <[email protected]>
Cc: Zenghui Yu <[email protected]>
Cc: [email protected]
Cc: Huacai Chen <[email protected]>
Cc: Aleksandar Markovic <[email protected]>
Cc: Anup Patel <[email protected]>
Cc: Atish Patra <[email protected]>
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
---
virt/kvm/kvm_main.c | 14 +++-----------
1 file changed, 3 insertions(+), 11 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d255964ec331..6cdfbb2c641b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5211,8 +5211,7 @@ static int hardware_enable_all(void)
return r;
}
-static int kvm_reboot(struct notifier_block *notifier, unsigned long val,
- void *v)
+static void kvm_reboot(void)
{
/*
* Some (well, at least mine) BIOSes hang on reboot if
@@ -5223,14 +5222,8 @@ static int kvm_reboot(struct notifier_block *notifier, unsigned long val,
pr_info("kvm: exiting hardware virtualization\n");
kvm_rebooting = true;
on_each_cpu(hardware_disable_nolock, NULL, 1);
- return NOTIFY_OK;
}
-static struct notifier_block kvm_reboot_notifier = {
- .notifier_call = kvm_reboot,
- .priority = 0,
-};
-
static int kvm_suspend(void)
{
/*
@@ -5261,6 +5254,8 @@ static void kvm_resume(void)
static struct syscore_ops kvm_syscore_ops = {
.suspend = kvm_suspend,
.resume = kvm_resume,
+ .shutdown = kvm_reboot,
+
};
#else /* CONFIG_KVM_GENERIC_HARDWARE_ENABLING */
static int hardware_enable_all(void)
@@ -5967,7 +5962,6 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
if (r)
return r;
- register_reboot_notifier(&kvm_reboot_notifier);
register_syscore_ops(&kvm_syscore_ops);
#endif
@@ -6039,7 +6033,6 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
err_vcpu_cache:
#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
unregister_syscore_ops(&kvm_syscore_ops);
- unregister_reboot_notifier(&kvm_reboot_notifier);
cpuhp_remove_state_nocalls(CPUHP_AP_KVM_ONLINE);
#endif
return r;
@@ -6065,7 +6058,6 @@ void kvm_exit(void)
kvm_async_pf_deinit();
#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
unregister_syscore_ops(&kvm_syscore_ops);
- unregister_reboot_notifier(&kvm_reboot_notifier);
cpuhp_remove_state_nocalls(CPUHP_AP_KVM_ONLINE);
#endif
kvm_irqfd_exit();
--
2.40.0.rc1.284.g88254d51c5-goog
Reject hardware enabling, i.e. VM creation, if a restart/shutdown has
been initiated to avoid re-enabling hardware between kvm_reboot() and
machine_{halt,power_off,restart}(). The restart case is especially
problematic (for x86) as enabling VMX (or clearing GIF in KVM_RUN on
SVM) blocks INIT, which results in the restart/reboot hanging as BIOS
is unable to wake and rendezvous with APs.
Note, this bug, and the original issue that motivated the addition of
kvm_reboot(), is effectively limited to a forced reboot, e.g. `reboot -f`.
In a "normal" reboot, userspace will gracefully teardown userspace before
triggering the kernel reboot (modulo bugs, errors, etc), i.e. any process
that might do ioctl(KVM_CREATE_VM) is long gone.
Fixes: 8e1c18157d87 ("KVM: VMX: Disable VMX when system shutdown")
Signed-off-by: Sean Christopherson <[email protected]>
---
virt/kvm/kvm_main.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6cdfbb2c641b..b2bf4c105181 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5182,7 +5182,20 @@ static void hardware_disable_all(void)
static int hardware_enable_all(void)
{
atomic_t failed = ATOMIC_INIT(0);
- int r = 0;
+ int r;
+
+ /*
+ * Do not enable hardware virtualization if the system is going down.
+ * If userspace initiated a forced reboot, e.g. reboot -f, then it's
+ * possible for an in-flight KVM_CREATE_VM to trigger hardware enabling
+ * after kvm_reboot() is called. Note, this relies on system_state
+ * being set _before_ kvm_reboot(), which is why KVM uses a syscore ops
+ * hook instead of registering a dedicated reboot notifier (the latter
+ * runs before system_state is updated).
+ */
+ if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF ||
+ system_state == SYSTEM_RESTART)
+ return -EBUSY;
/*
* When onlining a CPU, cpu_online_mask is set before kvm_online_cpu()
@@ -5195,6 +5208,8 @@ static int hardware_enable_all(void)
cpus_read_lock();
mutex_lock(&kvm_lock);
+ r = 0;
+
kvm_usage_count++;
if (kvm_usage_count == 1) {
on_each_cpu(hardware_enable_nolock, &failed, 1);
--
2.40.0.rc1.284.g88254d51c5-goog
On Fri, 10 Mar 2023 22:14:13 +0000,
Sean Christopherson <[email protected]> wrote:
>
> Use syscore_ops.shutdown to disable hardware virtualization during a
> reboot instead of using the dedicated reboot_notifier so that KVM disables
> virtualization _after_ system_state has been updated. This will allow
> fixing a race in KVM's handling of a forced reboot where KVM can end up
> enabling hardware virtualization between kernel_restart_prepare() and
> machine_restart().
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Oliver Upton <[email protected]>
> Cc: James Morse <[email protected]>
> Cc: Suzuki K Poulose <[email protected]>
> Cc: Zenghui Yu <[email protected]>
> Cc: [email protected]
> Cc: Huacai Chen <[email protected]>
> Cc: Aleksandar Markovic <[email protected]>
> Cc: Anup Patel <[email protected]>
> Cc: Atish Patra <[email protected]>
> Cc: [email protected]
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> virt/kvm/kvm_main.c | 14 +++-----------
> 1 file changed, 3 insertions(+), 11 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d255964ec331..6cdfbb2c641b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -5211,8 +5211,7 @@ static int hardware_enable_all(void)
> return r;
> }
>
> -static int kvm_reboot(struct notifier_block *notifier, unsigned long val,
> - void *v)
> +static void kvm_reboot(void)
> {
> /*
> * Some (well, at least mine) BIOSes hang on reboot if
> @@ -5223,14 +5222,8 @@ static int kvm_reboot(struct notifier_block *notifier, unsigned long val,
> pr_info("kvm: exiting hardware virtualization\n");
> kvm_rebooting = true;
> on_each_cpu(hardware_disable_nolock, NULL, 1);
> - return NOTIFY_OK;
> }
>
> -static struct notifier_block kvm_reboot_notifier = {
> - .notifier_call = kvm_reboot,
> - .priority = 0,
> -};
> -
> static int kvm_suspend(void)
> {
> /*
> @@ -5261,6 +5254,8 @@ static void kvm_resume(void)
> static struct syscore_ops kvm_syscore_ops = {
> .suspend = kvm_suspend,
> .resume = kvm_resume,
> + .shutdown = kvm_reboot,
> +
nit: consider renaming the kvm_reboot to kvm_shutdown to match the
syscore structure, and drop the spurious blank line.
M.
--
Without deviation from the norm, progress is not possible.
On Fri, 10 Mar 2023 22:14:14 +0000,
Sean Christopherson <[email protected]> wrote:
>
> Reject hardware enabling, i.e. VM creation, if a restart/shutdown has
> been initiated to avoid re-enabling hardware between kvm_reboot() and
> machine_{halt,power_off,restart}(). The restart case is especially
> problematic (for x86) as enabling VMX (or clearing GIF in KVM_RUN on
> SVM) blocks INIT, which results in the restart/reboot hanging as BIOS
> is unable to wake and rendezvous with APs.
>
> Note, this bug, and the original issue that motivated the addition of
> kvm_reboot(), is effectively limited to a forced reboot, e.g. `reboot -f`.
> In a "normal" reboot, userspace will gracefully teardown userspace before
> triggering the kernel reboot (modulo bugs, errors, etc), i.e. any process
> that might do ioctl(KVM_CREATE_VM) is long gone.
>
> Fixes: 8e1c18157d87 ("KVM: VMX: Disable VMX when system shutdown")
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> virt/kvm/kvm_main.c | 17 ++++++++++++++++-
> 1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6cdfbb2c641b..b2bf4c105181 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -5182,7 +5182,20 @@ static void hardware_disable_all(void)
> static int hardware_enable_all(void)
> {
> atomic_t failed = ATOMIC_INIT(0);
> - int r = 0;
> + int r;
> +
> + /*
> + * Do not enable hardware virtualization if the system is going down.
> + * If userspace initiated a forced reboot, e.g. reboot -f, then it's
> + * possible for an in-flight KVM_CREATE_VM to trigger hardware enabling
> + * after kvm_reboot() is called. Note, this relies on system_state
> + * being set _before_ kvm_reboot(), which is why KVM uses a syscore ops
> + * hook instead of registering a dedicated reboot notifier (the latter
> + * runs before system_state is updated).
> + */
> + if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF ||
> + system_state == SYSTEM_RESTART)
> + return -EBUSY;
Since we now seem to be relying on system_state for most things, is
there any use for 'kvm_rebooting' other than the ease of evaluation in
__svm_vcpu_run?
M.
--
Without deviation from the norm, progress is not possible.
On Sun, Mar 12, 2023, Marc Zyngier wrote:
> On Fri, 10 Mar 2023 22:14:14 +0000,
> Sean Christopherson <[email protected]> wrote:
> >
> > Reject hardware enabling, i.e. VM creation, if a restart/shutdown has
> > been initiated to avoid re-enabling hardware between kvm_reboot() and
> > machine_{halt,power_off,restart}(). The restart case is especially
> > problematic (for x86) as enabling VMX (or clearing GIF in KVM_RUN on
> > SVM) blocks INIT, which results in the restart/reboot hanging as BIOS
> > is unable to wake and rendezvous with APs.
> >
> > Note, this bug, and the original issue that motivated the addition of
> > kvm_reboot(), is effectively limited to a forced reboot, e.g. `reboot -f`.
> > In a "normal" reboot, userspace will gracefully teardown userspace before
> > triggering the kernel reboot (modulo bugs, errors, etc), i.e. any process
> > that might do ioctl(KVM_CREATE_VM) is long gone.
> >
> > Fixes: 8e1c18157d87 ("KVM: VMX: Disable VMX when system shutdown")
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > virt/kvm/kvm_main.c | 17 ++++++++++++++++-
> > 1 file changed, 16 insertions(+), 1 deletion(-)
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 6cdfbb2c641b..b2bf4c105181 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -5182,7 +5182,20 @@ static void hardware_disable_all(void)
> > static int hardware_enable_all(void)
> > {
> > atomic_t failed = ATOMIC_INIT(0);
> > - int r = 0;
> > + int r;
> > +
> > + /*
> > + * Do not enable hardware virtualization if the system is going down.
> > + * If userspace initiated a forced reboot, e.g. reboot -f, then it's
> > + * possible for an in-flight KVM_CREATE_VM to trigger hardware enabling
> > + * after kvm_reboot() is called. Note, this relies on system_state
> > + * being set _before_ kvm_reboot(), which is why KVM uses a syscore ops
> > + * hook instead of registering a dedicated reboot notifier (the latter
> > + * runs before system_state is updated).
> > + */
> > + if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF ||
> > + system_state == SYSTEM_RESTART)
> > + return -EBUSY;
>
> Since we now seem to be relying on system_state for most things, is
> there any use for 'kvm_rebooting' other than the ease of evaluation in
> __svm_vcpu_run?
Sadly, yes. The x86 implementations of emergency_restart(), __crash_kexec() and
other emergency reboot flows disable virtualization and set 'kvm_rebooting'
without touching system_state. VMX and SVM rely on 'kvm_rebooting' being set to
avoid triggering (another) BUG() during the emergency.
On my todo list is to better understand whether or not the other architectures
that utilize the generic hardware enabling (ARM, RISC-V, MIPS) truly need to disable
virtualization during a reboot, versus KVM simply being polite. E.g. on x86, if VMX
is left enabled, reboot may hang depending on how the reboot is performed. If
other architectures really truly need to disable virtualization, then they likely
need something similar to x86's emergency reboot shenanigans.
On Mon, 13 Mar 2023 15:02:27 +0000,
Sean Christopherson <[email protected]> wrote:
>
> On my todo list is to better understand whether or not the other architectures
> that utilize the generic hardware enabling (ARM, RISC-V, MIPS) truly need to disable
> virtualization during a reboot, versus KVM simply being polite. E.g. on x86, if VMX
> is left enabled, reboot may hang depending on how the reboot is performed. If
> other architectures really truly need to disable virtualization, then they likely
> need something similar to x86's emergency reboot shenanigans.
At least pre-CCA, there isn't much to do, because there is no such
thing as "disabling virtualisation". For kexec, the only things we
need to do are to go back to EL2 in the nVHE case, and in any case to
put all other CPUs back into the firmware (PSCI CPU_OFF).
CCA may well add other things into the picture, because it is a
parallel exception level that KVM doesn't really control. That's one
of the many open questions I have about this "lovely" piece of
architecture.
Of course, if we were to completely ignore CCA and instead use the
underlying HW (aka RME), things would be a lot simpler and we'd be
back to my original statement...
M.
--
Without deviation from the norm, progress is not possible.