Register KVM's cpuhp and syscore callbacks when enabling virtualization in
hardware, as the sole purpose of said callbacks is to disable and re-enable
virtualization as needed.
The primary motivation for this series is to simplify dealing with enabling
virtualization for Intel's TDX, which needs to enable virtualization
when kvm-intel.ko is loaded, i.e. long before the first VM is created. TDX
doesn't _need_ to keep virtualization enabled, but doing so is much simpler
for KVM (see patch 3).
That said, this is a nice cleanup on its own, assuming I haven't broken
something. By registering the callbacks on-demand, the callbacks themselves
don't need to check kvm_usage_count, because their very existence implies a
non-zero count.
The meat is in patch 1. Patches 2 renames the helpers so that patch 3 is
less awkward. Patch 3 adds a module param to enable virtualization when KVM
is loaded. Patches 4-6 are tangentially related x86 cleanups to registers
KVM's "emergency disable" callback on-demand, same as the syscore callbacks.
Moderately well tested on x86, though I haven't (yet) done due dilegence on
the suspend/resume and cphup paths. Compile tested on other architectures,
but that's all for this version.
v2:
- Use a dedicated mutex to avoid lock inversion issues between kvm_lock and
the cpuhp lock.
- Register emergency disable callbacks on-demand. [Kai]
- Drop an unintended s/junk/ign rename. [Kai]
- Decrement kvm_usage_count on failure. [Chao]
v1: https://lore.kernel.org/all/[email protected]
Sean Christopherson (6):
KVM: Register cpuhp and syscore callbacks when enabling hardware
KVM: Rename functions related to enabling virtualization hardware
KVM: Add a module param to allow enabling virtualization when KVM is
loaded
KVM: Add arch hooks for enabling/disabling virtualization
x86/reboot: Unconditionally define cpu_emergency_virt_cb typedef
KVM: x86: Register "emergency disable" callbacks when virt is enabled
arch/x86/include/asm/kvm_host.h | 3 +
arch/x86/include/asm/reboot.h | 2 +-
arch/x86/kvm/svm/svm.c | 5 +-
arch/x86/kvm/vmx/main.c | 2 +
arch/x86/kvm/vmx/vmx.c | 6 +-
arch/x86/kvm/vmx/x86_ops.h | 1 +
arch/x86/kvm/x86.c | 10 ++
include/linux/kvm_host.h | 2 +
virt/kvm/kvm_main.c | 258 ++++++++++++++++----------------
9 files changed, 150 insertions(+), 139 deletions(-)
base-commit: 4aad0b1893a141f114ba40ed509066f3c9bc24b0
--
2.45.0.215.g3402c0e53f-goog
Register KVM's cpuhp and syscore callback when enabling virtualization
in hardware instead of registering the callbacks during initialization,
and let the CPU up/down framework invoke the inner enable/disable
functions. Registering the callbacks during initialization makes things
more complex than they need to be, as KVM needs to be very careful about
handling races between enabling CPUs being onlined/offlined and hardware
being enabled/disabled.
Intel TDX support will require KVM to enable virtualization during KVM
initialization, i.e. will add another wrinkle to things, at which point
sorting out the potential races with kvm_usage_count would become even
more complex.
Use a dedicated mutex to guard kvm_usage_count, as taking kvm_lock outside
cpu_hotplug_lock is disallowed. Ideally, KVM would *always* take kvm_lock
outside cpu_hotplug_lock, but KVM x86 takes kvm_lock in several notifiers
that may be called under cpus_read_lock(). kvmclock_cpufreq_notifier() in
particular has callchains that are infeasible to guarantee will never be
called with cpu_hotplug_lock held. And practically speaking, using a
dedicated mutex is a non-issue as the cost is a few bytes for all of KVM.
Note, using the cpuhp framework has a subtle behavioral change: enabling
will be done serially across all CPUs, whereas KVM currently sends an IPI
to all CPUs in parallel. While serializing virtualization enabling could
create undesirable latency, the issue is limited to creation of KVM's
first VM, and even that can be mitigated, e.g. by letting userspace force
virtualization to be enabled when KVM is initialized.
Cc: Chao Gao <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
virt/kvm/kvm_main.c | 189 ++++++++++++++++----------------------------
1 file changed, 69 insertions(+), 120 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a1756d5077ee..97783d6987e9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5499,9 +5499,10 @@ __visible bool kvm_rebooting;
EXPORT_SYMBOL_GPL(kvm_rebooting);
static DEFINE_PER_CPU(bool, hardware_enabled);
+static DEFINE_MUTEX(kvm_usage_lock);
static int kvm_usage_count;
-static int __hardware_enable_nolock(void)
+static int hardware_enable_nolock(void)
{
if (__this_cpu_read(hardware_enabled))
return 0;
@@ -5516,34 +5517,18 @@ static int __hardware_enable_nolock(void)
return 0;
}
-static void hardware_enable_nolock(void *failed)
-{
- if (__hardware_enable_nolock())
- atomic_inc(failed);
-}
-
static int kvm_online_cpu(unsigned int cpu)
{
- int ret = 0;
-
/*
* Abort the CPU online process if hardware virtualization cannot
* be enabled. Otherwise running VMs would encounter unrecoverable
* errors when scheduled to this CPU.
*/
- mutex_lock(&kvm_lock);
- if (kvm_usage_count)
- ret = __hardware_enable_nolock();
- mutex_unlock(&kvm_lock);
- return ret;
+ return hardware_enable_nolock();
}
static void hardware_disable_nolock(void *junk)
{
- /*
- * Note, hardware_disable_all_nolock() tells all online CPUs to disable
- * hardware, not just CPUs that successfully enabled hardware!
- */
if (!__this_cpu_read(hardware_enabled))
return;
@@ -5554,78 +5539,10 @@ static void hardware_disable_nolock(void *junk)
static int kvm_offline_cpu(unsigned int cpu)
{
- mutex_lock(&kvm_lock);
- if (kvm_usage_count)
- hardware_disable_nolock(NULL);
- mutex_unlock(&kvm_lock);
+ hardware_disable_nolock(NULL);
return 0;
}
-static void hardware_disable_all_nolock(void)
-{
- BUG_ON(!kvm_usage_count);
-
- kvm_usage_count--;
- if (!kvm_usage_count)
- on_each_cpu(hardware_disable_nolock, NULL, 1);
-}
-
-static void hardware_disable_all(void)
-{
- cpus_read_lock();
- mutex_lock(&kvm_lock);
- hardware_disable_all_nolock();
- mutex_unlock(&kvm_lock);
- cpus_read_unlock();
-}
-
-static int hardware_enable_all(void)
-{
- atomic_t failed = ATOMIC_INIT(0);
- int r;
-
- /*
- * Do not enable hardware virtualization if the system is going down.
- * If userspace initiated a forced reboot, e.g. reboot -f, then it's
- * possible for an in-flight KVM_CREATE_VM to trigger hardware enabling
- * after kvm_reboot() is called. Note, this relies on system_state
- * being set _before_ kvm_reboot(), which is why KVM uses a syscore ops
- * hook instead of registering a dedicated reboot notifier (the latter
- * runs before system_state is updated).
- */
- if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF ||
- system_state == SYSTEM_RESTART)
- return -EBUSY;
-
- /*
- * When onlining a CPU, cpu_online_mask is set before kvm_online_cpu()
- * is called, and so on_each_cpu() between them includes the CPU that
- * is being onlined. As a result, hardware_enable_nolock() may get
- * invoked before kvm_online_cpu(), which also enables hardware if the
- * usage count is non-zero. Disable CPU hotplug to avoid attempting to
- * enable hardware multiple times.
- */
- cpus_read_lock();
- mutex_lock(&kvm_lock);
-
- r = 0;
-
- kvm_usage_count++;
- if (kvm_usage_count == 1) {
- on_each_cpu(hardware_enable_nolock, &failed, 1);
-
- if (atomic_read(&failed)) {
- hardware_disable_all_nolock();
- r = -EBUSY;
- }
- }
-
- mutex_unlock(&kvm_lock);
- cpus_read_unlock();
-
- return r;
-}
-
static void kvm_shutdown(void)
{
/*
@@ -5648,27 +5565,25 @@ static int kvm_suspend(void)
{
/*
* Secondary CPUs and CPU hotplug are disabled across the suspend/resume
- * callbacks, i.e. no need to acquire kvm_lock to ensure the usage count
- * is stable. Assert that kvm_lock is not held to ensure the system
- * isn't suspended while KVM is enabling hardware. Hardware enabling
- * can be preempted, but the task cannot be frozen until it has dropped
- * all locks (userspace tasks are frozen via a fake signal).
+ * callbacks, i.e. no need to acquire kvm_usage_lock to ensure the usage
+ * count is stable. Assert that kvm_usage_lock is not held to ensure
+ * the system isn't suspended while KVM is enabling hardware. Hardware
+ * enabling can be preempted, but the task cannot be frozen until it has
+ * dropped all locks (userspace tasks are frozen via a fake signal).
*/
- lockdep_assert_not_held(&kvm_lock);
+ lockdep_assert_not_held(&kvm_usage_lock);
lockdep_assert_irqs_disabled();
- if (kvm_usage_count)
- hardware_disable_nolock(NULL);
+ hardware_disable_nolock(NULL);
return 0;
}
static void kvm_resume(void)
{
- lockdep_assert_not_held(&kvm_lock);
+ lockdep_assert_not_held(&kvm_usage_lock);
lockdep_assert_irqs_disabled();
- if (kvm_usage_count)
- WARN_ON_ONCE(__hardware_enable_nolock());
+ WARN_ON_ONCE(hardware_enable_nolock());
}
static struct syscore_ops kvm_syscore_ops = {
@@ -5676,6 +5591,60 @@ static struct syscore_ops kvm_syscore_ops = {
.resume = kvm_resume,
.shutdown = kvm_shutdown,
};
+
+static int hardware_enable_all(void)
+{
+ int r;
+
+ guard(mutex)(&kvm_usage_lock);
+
+ if (kvm_usage_count++)
+ return 0;
+
+ r = cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online",
+ kvm_online_cpu, kvm_offline_cpu);
+ if (r)
+ goto err_cpuhp;
+
+ register_syscore_ops(&kvm_syscore_ops);
+
+ /*
+ * Undo virtualization enabling and bail if the system is going down.
+ * If userspace initiated a forced reboot, e.g. reboot -f, then it's
+ * possible for an in-flight operation to enable virtualization after
+ * syscore_shutdown() is called, i.e. without kvm_shutdown() being
+ * invoked. Note, this relies on system_state being set _before_
+ * kvm_shutdown(), e.g. to ensure either kvm_shutdown() is invoked
+ * or this CPU observes the impending shutdown. Which is why KVM uses
+ * a syscore ops hook instead of registering a dedicated reboot
+ * notifier (the latter runs before system_state is updated).
+ */
+ if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF ||
+ system_state == SYSTEM_RESTART) {
+ r = -EBUSY;
+ goto err_rebooting;
+ }
+
+ return 0;
+
+err_rebooting:
+ unregister_syscore_ops(&kvm_syscore_ops);
+ cpuhp_remove_state(CPUHP_AP_KVM_ONLINE);
+err_cpuhp:
+ --kvm_usage_count;
+ return r;
+}
+
+static void hardware_disable_all(void)
+{
+ guard(mutex)(&kvm_usage_lock);
+
+ if (--kvm_usage_count)
+ return;
+
+ unregister_syscore_ops(&kvm_syscore_ops);
+ cpuhp_remove_state(CPUHP_AP_KVM_ONLINE);
+}
#else /* CONFIG_KVM_GENERIC_HARDWARE_ENABLING */
static int hardware_enable_all(void)
{
@@ -6381,15 +6350,6 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
int r;
int cpu;
-#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
- r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online",
- kvm_online_cpu, kvm_offline_cpu);
- if (r)
- return r;
-
- register_syscore_ops(&kvm_syscore_ops);
-#endif
-
/* A kmem cache lets us meet the alignment requirements of fx_save. */
if (!vcpu_align)
vcpu_align = __alignof__(struct kvm_vcpu);
@@ -6400,10 +6360,8 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
offsetofend(struct kvm_vcpu, stats_id)
- offsetof(struct kvm_vcpu, arch),
NULL);
- if (!kvm_vcpu_cache) {
- r = -ENOMEM;
- goto err_vcpu_cache;
- }
+ if (!kvm_vcpu_cache)
+ return -ENOMEM;
for_each_possible_cpu(cpu) {
if (!alloc_cpumask_var_node(&per_cpu(cpu_kick_mask, cpu),
@@ -6460,11 +6418,6 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
for_each_possible_cpu(cpu)
free_cpumask_var(per_cpu(cpu_kick_mask, cpu));
kmem_cache_destroy(kvm_vcpu_cache);
-err_vcpu_cache:
-#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
- unregister_syscore_ops(&kvm_syscore_ops);
- cpuhp_remove_state_nocalls(CPUHP_AP_KVM_ONLINE);
-#endif
return r;
}
EXPORT_SYMBOL_GPL(kvm_init);
@@ -6486,10 +6439,6 @@ void kvm_exit(void)
kmem_cache_destroy(kvm_vcpu_cache);
kvm_vfio_ops_exit();
kvm_async_pf_deinit();
-#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
- unregister_syscore_ops(&kvm_syscore_ops);
- cpuhp_remove_state_nocalls(CPUHP_AP_KVM_ONLINE);
-#endif
kvm_irqfd_exit();
}
EXPORT_SYMBOL_GPL(kvm_exit);
--
2.45.0.215.g3402c0e53f-goog
Add an off-by-default module param, enable_virt_at_load, to let userspace
force virtualization to be enabled in hardware when KVM is initialized,
i.e. just before /dev/kvm is exposed to userspace. Enabling virtualization
during KVM initialization allows userspace to avoid the additional latency
when creating/destroying the first/last VM. Now that KVM uses the cpuhp
framework to do per-CPU enabling, the latency could be non-trivial as the
cpuhup bringup/teardown is serialized across CPUs, e.g. the latency could
be problematic for use case that need to spin up VMs quickly.
Enabling virtualizaton during initialization will also allow KVM to setup
the Intel TDX Module, which requires VMX to be fully enabled, without
needing additional APIs to temporarily enable virtualization.
Signed-off-by: Sean Christopherson <[email protected]>
---
virt/kvm/kvm_main.c | 37 +++++++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8ba2861e7788..8d331ab3aef7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5495,6 +5495,9 @@ static struct miscdevice kvm_dev = {
};
#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
+static bool enable_virt_at_load;
+module_param(enable_virt_at_load, bool, 0444);
+
__visible bool kvm_rebooting;
EXPORT_SYMBOL_GPL(kvm_rebooting);
@@ -5645,15 +5648,41 @@ static void kvm_disable_virtualization(void)
unregister_syscore_ops(&kvm_syscore_ops);
cpuhp_remove_state(CPUHP_AP_KVM_ONLINE);
}
+
+static int kvm_init_virtualization(void)
+{
+ if (enable_virt_at_load)
+ return kvm_enable_virtualization();
+
+ return 0;
+}
+
+static void kvm_uninit_virtualization(void)
+{
+ if (enable_virt_at_load)
+ kvm_disable_virtualization();
+
+ WARN_ON(kvm_usage_count);
+}
#else /* CONFIG_KVM_GENERIC_HARDWARE_ENABLING */
static int kvm_enable_virtualization(void)
{
return 0;
}
+static int kvm_init_virtualization(void)
+{
+ return 0;
+}
+
static void kvm_disable_virtualization(void)
{
+}
+
+static void kvm_uninit_virtualization(void)
+{
+
}
#endif /* CONFIG_KVM_GENERIC_HARDWARE_ENABLING */
@@ -6395,6 +6424,10 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
kvm_gmem_init(module);
+ r = kvm_init_virtualization();
+ if (r)
+ goto err_virt;
+
/*
* Registration _must_ be the very last thing done, as this exposes
* /dev/kvm to userspace, i.e. all infrastructure must be setup!
@@ -6408,6 +6441,8 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
return 0;
err_register:
+ kvm_uninit_virtualization();
+err_virt:
kvm_vfio_ops_exit();
err_vfio:
kvm_async_pf_deinit();
@@ -6433,6 +6468,8 @@ void kvm_exit(void)
*/
misc_deregister(&kvm_dev);
+ kvm_uninit_virtualization();
+
debugfs_remove_recursive(kvm_debugfs_dir);
for_each_possible_cpu(cpu)
free_cpumask_var(per_cpu(cpu_kick_mask, cpu));
--
2.45.0.215.g3402c0e53f-goog
Rename the various functions that enable virtualization to prepare for
upcoming changes, and to clean up artifacts of KVM's previous behavior,
which required manually juggling locks around kvm_usage_count.
Drop the "nolock" qualifier from per-CPU functions now that there are no
"nolock" implementations of the "all" variants, i.e. now that calling a
non-nolock function from a nolock function isn't confusing (unlike this
sentence).
Drop "all" from the outer helpers as they no longer manually iterate
over all CPUs, and because it might not be obvious what "all" refers to.
Instead, use double-underscores to communicate that the per-CPU functions
are helpers to the outer APIs.
Opportunistically prepend "kvm" to all functions to help make it clear
that they are KVM helpers, but mostly there's no reason not to.
Lastly, use "virtualization" instead of "hardware", because while the
functions do enable virtualization in hardware, there are a _lot_ of
things that KVM enables in hardware.
Signed-off-by: Sean Christopherson <[email protected]>
---
virt/kvm/kvm_main.c | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 97783d6987e9..8ba2861e7788 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -139,8 +139,8 @@ static int kvm_no_compat_open(struct inode *inode, struct file *file)
#define KVM_COMPAT(c) .compat_ioctl = kvm_no_compat_ioctl, \
.open = kvm_no_compat_open
#endif
-static int hardware_enable_all(void);
-static void hardware_disable_all(void);
+static int kvm_enable_virtualization(void);
+static void kvm_disable_virtualization(void);
static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
@@ -1216,7 +1216,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
if (r)
goto out_err_no_arch_destroy_vm;
- r = hardware_enable_all();
+ r = kvm_enable_virtualization();
if (r)
goto out_err_no_disable;
@@ -1259,7 +1259,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
#endif
out_err_no_mmu_notifier:
- hardware_disable_all();
+ kvm_disable_virtualization();
out_err_no_disable:
kvm_arch_destroy_vm(kvm);
out_err_no_arch_destroy_vm:
@@ -1354,7 +1354,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
#endif
kvm_arch_free_vm(kvm);
preempt_notifier_dec();
- hardware_disable_all();
+ kvm_disable_virtualization();
mmdrop(mm);
}
@@ -5502,7 +5502,7 @@ static DEFINE_PER_CPU(bool, hardware_enabled);
static DEFINE_MUTEX(kvm_usage_lock);
static int kvm_usage_count;
-static int hardware_enable_nolock(void)
+static int __kvm_enable_virtualization(void)
{
if (__this_cpu_read(hardware_enabled))
return 0;
@@ -5524,10 +5524,10 @@ static int kvm_online_cpu(unsigned int cpu)
* be enabled. Otherwise running VMs would encounter unrecoverable
* errors when scheduled to this CPU.
*/
- return hardware_enable_nolock();
+ return __kvm_enable_virtualization();
}
-static void hardware_disable_nolock(void *junk)
+static void __kvm_disable_virtualization(void *ign)
{
if (!__this_cpu_read(hardware_enabled))
return;
@@ -5539,7 +5539,7 @@ static void hardware_disable_nolock(void *junk)
static int kvm_offline_cpu(unsigned int cpu)
{
- hardware_disable_nolock(NULL);
+ __kvm_disable_virtualization(NULL);
return 0;
}
@@ -5558,7 +5558,7 @@ static void kvm_shutdown(void)
*/
pr_info("kvm: exiting hardware virtualization\n");
kvm_rebooting = true;
- on_each_cpu(hardware_disable_nolock, NULL, 1);
+ on_each_cpu(__kvm_disable_virtualization, NULL, 1);
}
static int kvm_suspend(void)
@@ -5574,7 +5574,7 @@ static int kvm_suspend(void)
lockdep_assert_not_held(&kvm_usage_lock);
lockdep_assert_irqs_disabled();
- hardware_disable_nolock(NULL);
+ __kvm_disable_virtualization(NULL);
return 0;
}
@@ -5583,7 +5583,7 @@ static void kvm_resume(void)
lockdep_assert_not_held(&kvm_usage_lock);
lockdep_assert_irqs_disabled();
- WARN_ON_ONCE(hardware_enable_nolock());
+ WARN_ON_ONCE(__kvm_enable_virtualization());
}
static struct syscore_ops kvm_syscore_ops = {
@@ -5592,7 +5592,7 @@ static struct syscore_ops kvm_syscore_ops = {
.shutdown = kvm_shutdown,
};
-static int hardware_enable_all(void)
+static int kvm_enable_virtualization(void)
{
int r;
@@ -5635,7 +5635,7 @@ static int hardware_enable_all(void)
return r;
}
-static void hardware_disable_all(void)
+static void kvm_disable_virtualization(void)
{
guard(mutex)(&kvm_usage_lock);
@@ -5646,12 +5646,12 @@ static void hardware_disable_all(void)
cpuhp_remove_state(CPUHP_AP_KVM_ONLINE);
}
#else /* CONFIG_KVM_GENERIC_HARDWARE_ENABLING */
-static int hardware_enable_all(void)
+static int kvm_enable_virtualization(void)
{
return 0;
}
-static void hardware_disable_all(void)
+static void kvm_disable_virtualization(void)
{
}
--
2.45.0.215.g3402c0e53f-goog
Add arch hooks that are invoked when KVM enables/disable virtualization.
x86 will use the hooks to register an "emergency disable" callback, which
is essentially an x86-specific shutdown notifier that is used when the
kernel is doing an emergency reboot/shutdown/kexec.
Signed-off-by: Sean Christopherson <[email protected]>
---
include/linux/kvm_host.h | 2 ++
virt/kvm/kvm_main.c | 14 ++++++++++++++
2 files changed, 16 insertions(+)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7b57878c8c18..701b16290932 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1514,6 +1514,8 @@ static inline void kvm_create_vcpu_debugfs(struct kvm_vcpu *vcpu) {}
#endif
#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
+void kvm_arch_enable_virtualization(void);
+void kvm_arch_disable_virtualization(void);
int kvm_arch_hardware_enable(void);
void kvm_arch_hardware_disable(void);
#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8d331ab3aef7..16adeb6b5bef 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5505,6 +5505,16 @@ static DEFINE_PER_CPU(bool, hardware_enabled);
static DEFINE_MUTEX(kvm_usage_lock);
static int kvm_usage_count;
+__weak void kvm_arch_enable_virtualization(void)
+{
+
+}
+
+__weak void kvm_arch_disable_virtualization(void)
+{
+
+}
+
static int __kvm_enable_virtualization(void)
{
if (__this_cpu_read(hardware_enabled))
@@ -5604,6 +5614,8 @@ static int kvm_enable_virtualization(void)
if (kvm_usage_count++)
return 0;
+ kvm_arch_enable_virtualization();
+
r = cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online",
kvm_online_cpu, kvm_offline_cpu);
if (r)
@@ -5634,6 +5646,7 @@ static int kvm_enable_virtualization(void)
unregister_syscore_ops(&kvm_syscore_ops);
cpuhp_remove_state(CPUHP_AP_KVM_ONLINE);
err_cpuhp:
+ kvm_arch_disable_virtualization();
--kvm_usage_count;
return r;
}
@@ -5647,6 +5660,7 @@ static void kvm_disable_virtualization(void)
unregister_syscore_ops(&kvm_syscore_ops);
cpuhp_remove_state(CPUHP_AP_KVM_ONLINE);
+ kvm_arch_disable_virtualization();
}
static int kvm_init_virtualization(void)
--
2.45.0.215.g3402c0e53f-goog
Define cpu_emergency_virt_cb even if the kernel is being built without KVM
support so that KVM can reference the typedef in asm/kvm_host.h without
needing yet more #ifdefs.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/reboot.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
index 6536873f8fc0..d0ef2a678d66 100644
--- a/arch/x86/include/asm/reboot.h
+++ b/arch/x86/include/asm/reboot.h
@@ -25,8 +25,8 @@ void __noreturn machine_real_restart(unsigned int type);
#define MRR_BIOS 0
#define MRR_APM 1
-#if IS_ENABLED(CONFIG_KVM_INTEL) || IS_ENABLED(CONFIG_KVM_AMD)
typedef void (cpu_emergency_virt_cb)(void);
+#if IS_ENABLED(CONFIG_KVM_INTEL) || IS_ENABLED(CONFIG_KVM_AMD)
void cpu_emergency_register_virt_callback(cpu_emergency_virt_cb *callback);
void cpu_emergency_unregister_virt_callback(cpu_emergency_virt_cb *callback);
void cpu_emergency_disable_virtualization(void);
--
2.45.0.215.g3402c0e53f-goog
Register the "disable virtualization in an emergency" callback just
before KVM enables virtualization in hardware, as there is no functional
need to keep the callbacks registered while KVM happens to be loaded, but
is inactive, i.e. if KVM hasn't enabled virtualization.
Note, unregistering the callback every time the last VM is destroyed could
have measurable latency due to the synchronize_rcu() needed to ensure all
references to the callback are dropped before KVM is unloaded. But the
latency should be a small fraction of the total latency of disabling
virtualization across all CPUs, and userspace can set enable_virt_at_load
to completely eliminate the runtime overhead.
Add a pointer in kvm_x86_ops to allow vendor code to provide its callback.
There is no reason to force vendor code to do the registration, and either
way KVM would need a new kvm_x86_ops hook.
Suggested-by: Kai Huang <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 +++
arch/x86/kvm/svm/svm.c | 5 +----
arch/x86/kvm/vmx/main.c | 2 ++
arch/x86/kvm/vmx/vmx.c | 6 +-----
arch/x86/kvm/vmx/x86_ops.h | 1 +
arch/x86/kvm/x86.c | 10 ++++++++++
6 files changed, 18 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index aabf1648a56a..66698f5bcc85 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -36,6 +36,7 @@
#include <asm/kvm_page_track.h>
#include <asm/kvm_vcpu_regs.h>
#include <asm/hyperv-tlfs.h>
+#include <asm/reboot.h>
#define __KVM_HAVE_ARCH_VCPU_DEBUGFS
@@ -1613,6 +1614,8 @@ struct kvm_x86_ops {
int (*hardware_enable)(void);
void (*hardware_disable)(void);
+ cpu_emergency_virt_cb *emergency_disable;
+
void (*hardware_unsetup)(void);
bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 3d0549ca246f..9c55d0c9cb59 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4952,6 +4952,7 @@ static void svm_vcpu_unblocking(struct kvm_vcpu *vcpu)
static struct kvm_x86_ops svm_x86_ops __initdata = {
.name = KBUILD_MODNAME,
+ .emergency_disable = svm_emergency_disable,
.check_processor_compatibility = svm_check_processor_compat,
.hardware_unsetup = svm_hardware_unsetup,
@@ -5389,8 +5390,6 @@ static struct kvm_x86_init_ops svm_init_ops __initdata = {
static void __svm_exit(void)
{
kvm_x86_vendor_exit();
-
- cpu_emergency_unregister_virt_callback(svm_emergency_disable);
}
static int __init svm_init(void)
@@ -5406,8 +5405,6 @@ static int __init svm_init(void)
if (r)
return r;
- cpu_emergency_register_virt_callback(svm_emergency_disable);
-
/*
* Common KVM initialization _must_ come last, after this, /dev/kvm is
* exposed to userspace!
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 7c546ad3e4c9..3f423afc263b 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -24,6 +24,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.hardware_enable = vmx_hardware_enable,
.hardware_disable = vmx_hardware_disable,
+ .emergency_disable = vmx_emergency_disable,
+
.has_emulated_msr = vmx_has_emulated_msr,
.vm_size = sizeof(struct kvm_vmx),
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 51b2cd13250a..eac505299a7b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -753,7 +753,7 @@ static int kvm_cpu_vmxoff(void)
return -EIO;
}
-static void vmx_emergency_disable(void)
+void vmx_emergency_disable(void)
{
int cpu = raw_smp_processor_id();
struct loaded_vmcs *v;
@@ -8613,8 +8613,6 @@ static void __vmx_exit(void)
{
allow_smaller_maxphyaddr = false;
- cpu_emergency_unregister_virt_callback(vmx_emergency_disable);
-
vmx_cleanup_l1d_flush();
}
@@ -8661,8 +8659,6 @@ static int __init vmx_init(void)
pi_init_cpu(cpu);
}
- cpu_emergency_register_virt_callback(vmx_emergency_disable);
-
vmx_check_vmcs12_offsets();
/*
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 502704596c83..afddfe3747dd 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -15,6 +15,7 @@ void vmx_hardware_unsetup(void);
int vmx_check_processor_compat(void);
int vmx_hardware_enable(void);
void vmx_hardware_disable(void);
+void vmx_emergency_disable(void);
int vmx_vm_init(struct kvm *kvm);
void vmx_vm_destroy(struct kvm *kvm);
int vmx_vcpu_precreate(struct kvm *kvm);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d750546ec934..84b34696a76c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12464,6 +12464,16 @@ void kvm_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
}
EXPORT_SYMBOL_GPL(kvm_vcpu_deliver_sipi_vector);
+void kvm_arch_enable_virtualization(void)
+{
+ cpu_emergency_register_virt_callback(kvm_x86_ops.emergency_disable);
+}
+
+void kvm_arch_disable_virtualization(void)
+{
+ cpu_emergency_unregister_virt_callback(kvm_x86_ops.emergency_disable);
+}
+
int kvm_arch_hardware_enable(void)
{
struct kvm *kvm;
--
2.45.0.215.g3402c0e53f-goog
On Tue, May 21, 2024 at 07:28:22PM -0700, Sean Christopherson wrote:
>Register KVM's cpuhp and syscore callback when enabling virtualization
>in hardware instead of registering the callbacks during initialization,
>and let the CPU up/down framework invoke the inner enable/disable
>functions. Registering the callbacks during initialization makes things
>more complex than they need to be, as KVM needs to be very careful about
>handling races between enabling CPUs being onlined/offlined and hardware
>being enabled/disabled.
>
>Intel TDX support will require KVM to enable virtualization during KVM
>initialization, i.e. will add another wrinkle to things, at which point
>sorting out the potential races with kvm_usage_count would become even
>more complex.
>
>Use a dedicated mutex to guard kvm_usage_count, as taking kvm_lock outside
>cpu_hotplug_lock is disallowed. Ideally, KVM would *always* take kvm_lock
>outside cpu_hotplug_lock, but KVM x86 takes kvm_lock in several notifiers
>that may be called under cpus_read_lock(). kvmclock_cpufreq_notifier() in
>particular has callchains that are infeasible to guarantee will never be
>called with cpu_hotplug_lock held. And practically speaking, using a
>dedicated mutex is a non-issue as the cost is a few bytes for all of KVM.
Shouldn't this part go to a separate patch?
I think so because you post a lockdep splat which indicates the existing
locking order is problematic. So, using a dedicated mutex actually fixes
some bug and needs a "Fixes:" tag, so that it can be backported separately.
And Documentation/virt/kvm/locking.rst needs to be updated accordingly.
Actually, you are doing a partial revert to the commit:
0bf50497f03b ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock")
Perhaps you can handle this as a revert. After that, change the lock from
a raw_spinlock_t to a mutex.
On Tue, May 21, 2024 at 07:28:23PM -0700, Sean Christopherson wrote:
>Rename the various functions that enable virtualization to prepare for
>upcoming changes, and to clean up artifacts of KVM's previous behavior,
>which required manually juggling locks around kvm_usage_count.
>
>Drop the "nolock" qualifier from per-CPU functions now that there are no
>"nolock" implementations of the "all" variants, i.e. now that calling a
>non-nolock function from a nolock function isn't confusing (unlike this
>sentence).
>
>Drop "all" from the outer helpers as they no longer manually iterate
>over all CPUs, and because it might not be obvious what "all" refers to.
>Instead, use double-underscores to communicate that the per-CPU functions
>are helpers to the outer APIs.
>
>Opportunistically prepend "kvm" to all functions to help make it clear
>that they are KVM helpers, but mostly there's no reason not to.
>
>Lastly, use "virtualization" instead of "hardware", because while the
>functions do enable virtualization in hardware, there are a _lot_ of
>things that KVM enables in hardware.
>
>Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Chao Gao <[email protected]>
On 22/05/2024 2:28 pm, Sean Christopherson wrote:
> Add an off-by-default module param, enable_virt_at_load, to let userspace
> force virtualization to be enabled in hardware when KVM is initialized,
> i.e. just before /dev/kvm is exposed to userspace. Enabling virtualization
> during KVM initialization allows userspace to avoid the additional latency
> when creating/destroying the first/last VM. Now that KVM uses the cpuhp
> framework to do per-CPU enabling, the latency could be non-trivial as the
> cpuhup bringup/teardown is serialized across CPUs, e.g. the latency could
> be problematic for use case that need to spin up VMs quickly.
How about we defer this until there's a real complain that this isn't
acceptable? To me it doesn't sound "latency of creating the first VM"
matters a lot in the real CSP deployments.
The concern of adding a new module param is once we add it, we need to
maintain it even it is no longer needed in the future for backward
compatibility. Especially this param is in kvm.ko, and for all ARCHs.
E.g., I think _IF_ the core cpuhp code is enhanced to call those
callbacks in parallel in cpuhp_setup_state(), then this issue could be
mitigated to an unnoticeable level.
Or we just still do:
cpus_read_lock();
on_each_cpu(hardware_enable_nolock, ...);
cpuhp_setup_state_nocalls_cpuslocked(...);
cpus_read_unlock();
I think the main benefit of series is to put all virtualization enabling
related things into one single function. Whether using
cpuhp_setup_state() or using on_each_cpu() shouldn't be the main point.
On 22/05/2024 2:28 pm, Sean Christopherson wrote:
> Add arch hooks that are invoked when KVM enables/disable virtualization.
> x86 will use the hooks to register an "emergency disable" callback, which
> is essentially an x86-specific shutdown notifier that is used when the
> kernel is doing an emergency reboot/shutdown/kexec.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
[...]
> +
> static int __kvm_enable_virtualization(void)
> {
> if (__this_cpu_read(hardware_enabled))
> @@ -5604,6 +5614,8 @@ static int kvm_enable_virtualization(void)
> if (kvm_usage_count++)
> return 0;
>
> + kvm_arch_enable_virtualization();
> +
> r = cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online",
> kvm_online_cpu, kvm_offline_cpu);
Nit: is kvm_arch_pre_enable_virtualization() a better name?
On 22/05/2024 2:28 pm, Sean Christopherson wrote:
> Rename the various functions that enable virtualization to prepare for
> upcoming changes, and to clean up artifacts of KVM's previous behavior,
> which required manually juggling locks around kvm_usage_count.
>
> Drop the "nolock" qualifier from per-CPU functions now that there are no
> "nolock" implementations of the "all" variants, i.e. now that calling a
> non-nolock function from a nolock function isn't confusing (unlike this
> sentence).
>
> Drop "all" from the outer helpers as they no longer manually iterate
> over all CPUs, and because it might not be obvious what "all" refers to.
> Instead, use double-underscores to communicate that the per-CPU functions
> are helpers to the outer APIs.
>
> Opportunistically prepend "kvm" to all functions to help make it clear
> that they are KVM helpers, but mostly there's no reason not to.
>
> Lastly, use "virtualization" instead of "hardware", because while the
> functions do enable virtualization in hardware, there are a _lot_ of
> things that KVM enables in hardware.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
Reviewed-by: Kai Huang <[email protected]>
On 22/05/2024 2:28 pm, Sean Christopherson wrote:
> Define cpu_emergency_virt_cb even if the kernel is being built without KVM
> support so that KVM can reference the typedef in asm/kvm_host.h without
> needing yet more #ifdefs.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Acked-by: Kai Huang <[email protected]>
On 22/05/2024 2:28 pm, Sean Christopherson wrote:
> Register the "disable virtualization in an emergency" callback just
> before KVM enables virtualization in hardware, as there is no functional
> need to keep the callbacks registered while KVM happens to be loaded, but
> is inactive, i.e. if KVM hasn't enabled virtualization.
>
> Note, unregistering the callback every time the last VM is destroyed could
> have measurable latency due to the synchronize_rcu() needed to ensure all
> references to the callback are dropped before KVM is unloaded. But the
> latency should be a small fraction of the total latency of disabling
> virtualization across all CPUs, and userspace can set enable_virt_at_load
> to completely eliminate the runtime overhead.
>
> Add a pointer in kvm_x86_ops to allow vendor code to provide its callback.
> There is no reason to force vendor code to do the registration, and either
> way KVM would need a new kvm_x86_ops hook.
>
> Suggested-by: Kai Huang <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
Reviewed-by: Kai Huang <[email protected]>
On Thu, May 23, 2024 at 10:27:53AM +1200, Huang, Kai wrote:
>
>
>On 22/05/2024 2:28 pm, Sean Christopherson wrote:
>> Add an off-by-default module param, enable_virt_at_load, to let userspace
>> force virtualization to be enabled in hardware when KVM is initialized,
>> i.e. just before /dev/kvm is exposed to userspace. Enabling virtualization
>> during KVM initialization allows userspace to avoid the additional latency
>> when creating/destroying the first/last VM. Now that KVM uses the cpuhp
>> framework to do per-CPU enabling, the latency could be non-trivial as the
>> cpuhup bringup/teardown is serialized across CPUs, e.g. the latency could
>> be problematic for use case that need to spin up VMs quickly.
>
>How about we defer this until there's a real complain that this isn't
>acceptable? To me it doesn't sound "latency of creating the first VM"
>matters a lot in the real CSP deployments.
I suspect kselftest and kvm-unit-tests will be impacted a lot because
hundreds of tests are run serially. And it looks clumsy to reload KVM
module to set enable_virt_at_load to make tests run faster. I think the
test slowdown is a more realistic problem than running an off-tree
hypervisor, so I vote to make enabling virtualization at load time the
default behavior and if we really want to support an off-tree hypervisor,
we can add a new module param to opt in enabling virtualization at runtime.
>
>The concern of adding a new module param is once we add it, we need to
>maintain it even it is no longer needed in the future for backward
>compatibility. Especially this param is in kvm.ko, and for all ARCHs.
>
>E.g., I think _IF_ the core cpuhp code is enhanced to call those callbacks in
>parallel in cpuhp_setup_state(), then this issue could be mitigated to an
>unnoticeable level.
>
>Or we just still do:
>
> cpus_read_lock();
> on_each_cpu(hardware_enable_nolock, ...);
> cpuhp_setup_state_nocalls_cpuslocked(...);
> cpus_read_unlock();
>
>I think the main benefit of series is to put all virtualization enabling
>related things into one single function. Whether using cpuhp_setup_state()
>or using on_each_cpu() shouldn't be the main point.
>
On Tue, May 21, 2024 at 07:28:25PM -0700, Sean Christopherson wrote:
>Add arch hooks that are invoked when KVM enables/disable virtualization.
>x86 will use the hooks to register an "emergency disable" callback, which
>is essentially an x86-specific shutdown notifier that is used when the
>kernel is doing an emergency reboot/shutdown/kexec.
>
>Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Chao Gao <[email protected]>
On Tue, May 21, 2024 at 07:28:26PM -0700, Sean Christopherson wrote:
>Define cpu_emergency_virt_cb even if the kernel is being built without KVM
>support so that KVM can reference the typedef in asm/kvm_host.h without
>needing yet more #ifdefs.
>
>No functional change intended.
>
>Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Chao Gao <[email protected]>
Perhaps x86 maintainers need to be CC'ed.
>---
> arch/x86/include/asm/reboot.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
>index 6536873f8fc0..d0ef2a678d66 100644
>--- a/arch/x86/include/asm/reboot.h
>+++ b/arch/x86/include/asm/reboot.h
>@@ -25,8 +25,8 @@ void __noreturn machine_real_restart(unsigned int type);
> #define MRR_BIOS 0
> #define MRR_APM 1
>
>-#if IS_ENABLED(CONFIG_KVM_INTEL) || IS_ENABLED(CONFIG_KVM_AMD)
> typedef void (cpu_emergency_virt_cb)(void);
>+#if IS_ENABLED(CONFIG_KVM_INTEL) || IS_ENABLED(CONFIG_KVM_AMD)
> void cpu_emergency_register_virt_callback(cpu_emergency_virt_cb *callback);
> void cpu_emergency_unregister_virt_callback(cpu_emergency_virt_cb *callback);
> void cpu_emergency_disable_virtualization(void);
>--
>2.45.0.215.g3402c0e53f-goog
>
On Tue, May 21, 2024 at 07:28:27PM -0700, Sean Christopherson wrote:
>Register the "disable virtualization in an emergency" callback just
>before KVM enables virtualization in hardware, as there is no functional
>need to keep the callbacks registered while KVM happens to be loaded, but
>is inactive, i.e. if KVM hasn't enabled virtualization.
>
>Note, unregistering the callback every time the last VM is destroyed could
>have measurable latency due to the synchronize_rcu() needed to ensure all
>references to the callback are dropped before KVM is unloaded. But the
>latency should be a small fraction of the total latency of disabling
>virtualization across all CPUs, and userspace can set enable_virt_at_load
>to completely eliminate the runtime overhead.
>
>Add a pointer in kvm_x86_ops to allow vendor code to provide its callback.
>There is no reason to force vendor code to do the registration, and either
>way KVM would need a new kvm_x86_ops hook.
>
>Suggested-by: Kai Huang <[email protected]>
>Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Chao Gao <[email protected]>
..
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -753,7 +753,7 @@ static int kvm_cpu_vmxoff(void)
> return -EIO;
> }
>
>-static void vmx_emergency_disable(void)
>+void vmx_emergency_disable(void)
> {
> int cpu = raw_smp_processor_id();
> struct loaded_vmcs *v;
>@@ -8613,8 +8613,6 @@ static void __vmx_exit(void)
> {
> allow_smaller_maxphyaddr = false;
>
>- cpu_emergency_unregister_virt_callback(vmx_emergency_disable);
>-
> vmx_cleanup_l1d_flush();
> }
>
>@@ -8661,8 +8659,6 @@ static int __init vmx_init(void)
> pi_init_cpu(cpu);
> }
>
>- cpu_emergency_register_virt_callback(vmx_emergency_disable);
>-
Nit: with the removal of calls to cpu_emergency_(un)register_virt_callback,
there is no need to include asm/reboot.h in vmx.c any more. right?
On 23/05/2024 4:23 pm, Chao Gao wrote:
> On Thu, May 23, 2024 at 10:27:53AM +1200, Huang, Kai wrote:
>>
>>
>> On 22/05/2024 2:28 pm, Sean Christopherson wrote:
>>> Add an off-by-default module param, enable_virt_at_load, to let userspace
>>> force virtualization to be enabled in hardware when KVM is initialized,
>>> i.e. just before /dev/kvm is exposed to userspace. Enabling virtualization
>>> during KVM initialization allows userspace to avoid the additional latency
>>> when creating/destroying the first/last VM. Now that KVM uses the cpuhp
>>> framework to do per-CPU enabling, the latency could be non-trivial as the
>>> cpuhup bringup/teardown is serialized across CPUs, e.g. the latency could
>>> be problematic for use case that need to spin up VMs quickly.
>>
>> How about we defer this until there's a real complain that this isn't
>> acceptable? To me it doesn't sound "latency of creating the first VM"
>> matters a lot in the real CSP deployments.
>
> I suspect kselftest and kvm-unit-tests will be impacted a lot because
> hundreds of tests are run serially. And it looks clumsy to reload KVM
> module to set enable_virt_at_load to make tests run faster. I think the
> test slowdown is a more realistic problem than running an off-tree
> hypervisor, so I vote to make enabling virtualization at load time the
> default behavior and if we really want to support an off-tree hypervisor,
> we can add a new module param to opt in enabling virtualization at runtime.
I am not following why off-tree hypervisor is ever related to this.
Could you elaborate?
The problem of enabling virt during module loading by default is it
impacts all ARCHs. Given this performance downgrade (if we care) can be
resolved by explicitly doing on_each_cpu() below, I am not sure why we
want to choose this radical approach.
>> Or we just still do:
>>
>> cpus_read_lock();
>> on_each_cpu(hardware_enable_nolock, ...);
>> cpuhp_setup_state_nocalls_cpuslocked(...);
>> cpus_read_unlock();
>>
>> I think the main benefit of series is to put all virtualization enabling
>> related things into one single function. Whether using cpuhp_setup_state()
>> or using on_each_cpu() shouldn't be the main point.
>>
On Fri, May 24, 2024 at 11:11:37AM +1200, Huang, Kai wrote:
>
>
>On 23/05/2024 4:23 pm, Chao Gao wrote:
>> On Thu, May 23, 2024 at 10:27:53AM +1200, Huang, Kai wrote:
>> >
>> >
>> > On 22/05/2024 2:28 pm, Sean Christopherson wrote:
>> > > Add an off-by-default module param, enable_virt_at_load, to let userspace
>> > > force virtualization to be enabled in hardware when KVM is initialized,
>> > > i.e. just before /dev/kvm is exposed to userspace. Enabling virtualization
>> > > during KVM initialization allows userspace to avoid the additional latency
>> > > when creating/destroying the first/last VM. Now that KVM uses the cpuhp
>> > > framework to do per-CPU enabling, the latency could be non-trivial as the
>> > > cpuhup bringup/teardown is serialized across CPUs, e.g. the latency could
>> > > be problematic for use case that need to spin up VMs quickly.
>> >
>> > How about we defer this until there's a real complain that this isn't
>> > acceptable? To me it doesn't sound "latency of creating the first VM"
>> > matters a lot in the real CSP deployments.
>>
>> I suspect kselftest and kvm-unit-tests will be impacted a lot because
>> hundreds of tests are run serially. And it looks clumsy to reload KVM
>> module to set enable_virt_at_load to make tests run faster. I think the
>> test slowdown is a more realistic problem than running an off-tree
>> hypervisor, so I vote to make enabling virtualization at load time the
>> default behavior and if we really want to support an off-tree hypervisor,
>> we can add a new module param to opt in enabling virtualization at runtime.
>
>I am not following why off-tree hypervisor is ever related to this.
Enabling virtualization at runtime was added to support an off-tree hypervisor
(see the commit below). To me, supporting an off-tree hypervisor while KVM is
autoloaded is a niche usage. so, my preference is to make enabling
virtualization at runtime opt-in rather than the default.
commit 10474ae8945ce08622fd1f3464e55bd817bf2376
Author: Alexander Graf <[email protected]>
Date: Tue Sep 15 11:37:46 2009 +0200
KVM: Activate Virtualization On Demand
X86 CPUs need to have some magic happening to enable the virtualization
extensions on them. This magic can result in unpleasant results for
users, like blocking other VMMs from working (vmx) or using invalid TLB
entries (svm).
Currently KVM activates virtualization when the respective kernel module
is loaded. This blocks us from autoloading KVM modules without breaking
other VMMs.
To circumvent this problem at least a bit, this patch introduces on
demand activation of virtualization. This means, that instead
virtualization is enabled on creation of the first virtual machine
and disabled on destruction of the last one.
So using this, KVM can be easily autoloaded, while keeping other
hypervisors usable.
>
>Could you elaborate?
>
>The problem of enabling virt during module loading by default is it impacts
>all ARCHs. Given this performance downgrade (if we care) can be resolved by
>explicitly doing on_each_cpu() below, I am not sure why we want to choose
>this radical approach.
IIUC, we plan to set up TDX module at KVM load time; we need to enable virt
at load time at least for TDX. Definitely, on_each_cpu() can solve the perf
concern. But a solution which can also satisfy TDX's need is better to me.
>
>
>> > Or we just still do:
>> >
>> > cpus_read_lock();
>> > on_each_cpu(hardware_enable_nolock, ...);
>> > cpuhp_setup_state_nocalls_cpuslocked(...);
>> > cpus_read_unlock();
>> >
>> > I think the main benefit of series is to put all virtualization enabling
>> > related things into one single function. Whether using cpuhp_setup_state()
>> > or using on_each_cpu() shouldn't be the main point.
>> >
On 24/05/2024 2:39 pm, Chao Gao wrote:
> On Fri, May 24, 2024 at 11:11:37AM +1200, Huang, Kai wrote:
>>
>>
>> On 23/05/2024 4:23 pm, Chao Gao wrote:
>>> On Thu, May 23, 2024 at 10:27:53AM +1200, Huang, Kai wrote:
>>>>
>>>>
>>>> On 22/05/2024 2:28 pm, Sean Christopherson wrote:
>>>>> Add an off-by-default module param, enable_virt_at_load, to let userspace
>>>>> force virtualization to be enabled in hardware when KVM is initialized,
>>>>> i.e. just before /dev/kvm is exposed to userspace. Enabling virtualization
>>>>> during KVM initialization allows userspace to avoid the additional latency
>>>>> when creating/destroying the first/last VM. Now that KVM uses the cpuhp
>>>>> framework to do per-CPU enabling, the latency could be non-trivial as the
>>>>> cpuhup bringup/teardown is serialized across CPUs, e.g. the latency could
>>>>> be problematic for use case that need to spin up VMs quickly.
>>>>
>>>> How about we defer this until there's a real complain that this isn't
>>>> acceptable? To me it doesn't sound "latency of creating the first VM"
>>>> matters a lot in the real CSP deployments.
>>>
>>> I suspect kselftest and kvm-unit-tests will be impacted a lot because
>>> hundreds of tests are run serially. And it looks clumsy to reload KVM
>>> module to set enable_virt_at_load to make tests run faster. I think the
>>> test slowdown is a more realistic problem than running an off-tree
>>> hypervisor, so I vote to make enabling virtualization at load time the
>>> default behavior and if we really want to support an off-tree hypervisor,
>>> we can add a new module param to opt in enabling virtualization at runtime.
>>
>> I am not following why off-tree hypervisor is ever related to this.
>
> Enabling virtualization at runtime was added to support an off-tree hypervisor
> (see the commit below).
Oh, ok. I was thinking something else.
>>
>> The problem of enabling virt during module loading by default is it impacts
>> all ARCHs. Given this performance downgrade (if we care) can be resolved by
>> explicitly doing on_each_cpu() below, I am not sure why we want to choose
>> this radical approach.
>
> IIUC, we plan to set up TDX module at KVM load time; we need to enable virt
> at load time at least for TDX. Definitely, on_each_cpu() can solve the perf
> concern. But a solution which can also satisfy TDX's need is better to me.
>
Doing on_each_cpu() explicitly can also meet TDX's purpose. We just
explicitly enable virtualization during module loading if we are going
to enable TDX. For all other cases, the behaivour remains the same,
unless they want to change when to enable virtualization, e.g., when
loading module by default.
For always, or by default enabling virtualization during module loading,
we somehow discussed before:
https://lore.kernel.org/kvm/[email protected]/
My true comment is introducing a module parameter, which is a userspace
ABI, to just fix some performance downgrade seems overkill when it can
be mitigated by the kernel.
On Thu, May 23, 2024, Kai Huang wrote:
> On 22/05/2024 2:28 pm, Sean Christopherson wrote:
> > static int __kvm_enable_virtualization(void)
> > {
> > if (__this_cpu_read(hardware_enabled))
> > @@ -5604,6 +5614,8 @@ static int kvm_enable_virtualization(void)
> > if (kvm_usage_count++)
> > return 0;
> > + kvm_arch_enable_virtualization();
> > +
> > r = cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online",
> > kvm_online_cpu, kvm_offline_cpu);
>
>
> Nit: is kvm_arch_pre_enable_virtualization() a better name?
Hmm, yes? I don't have a strong preference either way. I did consider a more
verbose name, but omitted the "pre" because the hook is called only on the 0=>1
transition of kvm_usage_count, and for some reason that made me think "pre" would
be confusing.
On the other hand, "pre" very clearly communicates that the hook is invoked,
and _needs_ to be invoked (for x86), before KVM enables virtualization.
So I'm leaning towards kvm_arch_pre_enable_virtualization().
Anyone else have an opinion?
On Wed, May 22, 2024, Chao Gao wrote:
> On Tue, May 21, 2024 at 07:28:22PM -0700, Sean Christopherson wrote:
> >Register KVM's cpuhp and syscore callback when enabling virtualization
> >in hardware instead of registering the callbacks during initialization,
> >and let the CPU up/down framework invoke the inner enable/disable
> >functions. Registering the callbacks during initialization makes things
> >more complex than they need to be, as KVM needs to be very careful about
> >handling races between enabling CPUs being onlined/offlined and hardware
> >being enabled/disabled.
> >
> >Intel TDX support will require KVM to enable virtualization during KVM
> >initialization, i.e. will add another wrinkle to things, at which point
> >sorting out the potential races with kvm_usage_count would become even
> >more complex.
> >
>
> >Use a dedicated mutex to guard kvm_usage_count, as taking kvm_lock outside
> >cpu_hotplug_lock is disallowed. Ideally, KVM would *always* take kvm_lock
> >outside cpu_hotplug_lock, but KVM x86 takes kvm_lock in several notifiers
> >that may be called under cpus_read_lock(). kvmclock_cpufreq_notifier() in
> >particular has callchains that are infeasible to guarantee will never be
> >called with cpu_hotplug_lock held. And practically speaking, using a
> >dedicated mutex is a non-issue as the cost is a few bytes for all of KVM.
>
> Shouldn't this part go to a separate patch?
>
> I think so because you post a lockdep splat which indicates the existing
> locking order is problematic. So, using a dedicated mutex actually fixes
> some bug and needs a "Fixes:" tag, so that it can be backported separately.
Oooh, good point. I'll try to re-decipher the lockdep splat, and go this route
if using a dedicated lock does is indeed fix a real issue.
> And Documentation/virt/kvm/locking.rst needs to be updated accordingly.
>
> Actually, you are doing a partial revert to the commit:
>
> 0bf50497f03b ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock")
>
> Perhaps you can handle this as a revert. After that, change the lock from
> a raw_spinlock_t to a mutex.
Hmm, I'd prefer to not revert to a spinlock, even temporarily.
On Tue, May 28, 2024, Kai Huang wrote:
> On 24/05/2024 2:39 pm, Chao Gao wrote:
> > On Fri, May 24, 2024 at 11:11:37AM +1200, Huang, Kai wrote:
> > >
> > >
> > > On 23/05/2024 4:23 pm, Chao Gao wrote:
> > > > On Thu, May 23, 2024 at 10:27:53AM +1200, Huang, Kai wrote:
> > > > >
> > > > >
> > > > > On 22/05/2024 2:28 pm, Sean Christopherson wrote:
> > > > > > Add an off-by-default module param, enable_virt_at_load, to let userspace
> > > > > > force virtualization to be enabled in hardware when KVM is initialized,
> > > > > > i.e. just before /dev/kvm is exposed to userspace. Enabling virtualization
> > > > > > during KVM initialization allows userspace to avoid the additional latency
> > > > > > when creating/destroying the first/last VM. Now that KVM uses the cpuhp
> > > > > > framework to do per-CPU enabling, the latency could be non-trivial as the
> > > > > > cpuhup bringup/teardown is serialized across CPUs, e.g. the latency could
> > > > > > be problematic for use case that need to spin up VMs quickly.
> > > > >
> > > > > How about we defer this until there's a real complain that this isn't
> > > > > acceptable? To me it doesn't sound "latency of creating the first VM"
> > > > > matters a lot in the real CSP deployments.
> > > >
> > > > I suspect kselftest and kvm-unit-tests will be impacted a lot because
> > > > hundreds of tests are run serially. And it looks clumsy to reload KVM
> > > > module to set enable_virt_at_load to make tests run faster. I think the
> > > > test slowdown is a more realistic problem than running an off-tree
> > > > hypervisor, so I vote to make enabling virtualization at load time the
> > > > default behavior and if we really want to support an off-tree hypervisor,
> > > > we can add a new module param to opt in enabling virtualization at runtime.
I definitely don't object to making it the default behavior, though I would do so
in a separate patch, e.g. in case enabling virtualization by default somehow
causes problems.
We could also add a Kconfig to control the default behavior, though IMO that'd be
overkill without an actual use case for having virtualization off by default.
> > > I am not following why off-tree hypervisor is ever related to this.
> >
> > Enabling virtualization at runtime was added to support an off-tree hypervisor
> > (see the commit below).
>
> Oh, ok. I was thinking something else.
>
> > >
> > > The problem of enabling virt during module loading by default is it impacts
> > > all ARCHs.
Pratically speaking, Intel is the only vendor where enabling virtualization is
interesting enough for anyone to care. On SVM and all other architectures,
enabling virtualization doesn't meaningfully change the functionality of the
current mode.
And impacting all architectures isn't a bad thing. Yes, it requires getting buy-in
from more people, and maybe additional testing, though in this case we should get
that for "free" by virtue of being in linux-next. But those are one-time costs,
and not particular high costs.
> > > Given this performance downgrade (if we care) can be resolved by
> > > explicitly doing on_each_cpu() below, I am not sure why we want to choose
> > > this radical approach.
Because it's not radical? And manually doing on_each_cpu() requires complexity
that we would ideally avoid.
15+ years ago, when VMX and SVM were nascent technologies, there was likely a
good argument from a security perspective for leaving virtualization disabled.
E.g. the ucode flows were new _and_ massive, and thus a potentially juicy attack
surface.
But these days, AFAIK enabling virtualization is not considered to be a security
risk, nor are there performance or stability downsides. E.g. it's not all that
different than the kernel enabling CR4.PKE even though it's entirely possible
userspace will never actually use protection keys.
> > IIUC, we plan to set up TDX module at KVM load time; we need to enable virt
> > at load time at least for TDX. Definitely, on_each_cpu() can solve the perf
> > concern. But a solution which can also satisfy TDX's need is better to me.
> >
>
> Doing on_each_cpu() explicitly can also meet TDX's purpose. We just
> explicitly enable virtualization during module loading if we are going to
> enable TDX. For all other cases, the behaivour remains the same, unless
> they want to change when to enable virtualization, e.g., when loading module
> by default.
>
> For always, or by default enabling virtualization during module loading, we
> somehow discussed before:
>
> https://lore.kernel.org/kvm/[email protected]/
>
> My true comment is introducing a module parameter, which is a userspace ABI,
Module params aren't strictly ABI, and even if they were, this would only be
problematic if we wanted to remove the module param *and* doing so was a breaking
change. Enabling virtualization should be entirely transparent to userspace,
at least from a functional perspective; if changing how KVM enables virtualization
breaks userspace then we likely have bigger problems.
> to just fix some performance downgrade seems overkill when it can be
> mitigated by the kernel.
Performance is secondary for me, the primary motivation is simplifying the overall
KVM code base. Yes, we _could_ use on_each_cpu() and enable virtualization
on-demand for TDX, but as above, it's extra complexity without any meaningful
benefit, at least AFAICT.
On Wed, 2024-05-29 at 08:01 -0700, Sean Christopherson wrote:
> On Tue, May 28, 2024, Kai Huang wrote:
> > On 24/05/2024 2:39 pm, Chao Gao wrote:
> > > On Fri, May 24, 2024 at 11:11:37AM +1200, Huang, Kai wrote:
> > > >
> > > >
> > > > On 23/05/2024 4:23 pm, Chao Gao wrote:
> > > > > On Thu, May 23, 2024 at 10:27:53AM +1200, Huang, Kai wrote:
> > > > > >
> > > > > >
> > > > > > On 22/05/2024 2:28 pm, Sean Christopherson wrote:
> > > > > > > Add an off-by-default module param, enable_virt_at_load, to let userspace
> > > > > > > force virtualization to be enabled in hardware when KVM is initialized,
> > > > > > > i.e. just before /dev/kvm is exposed to userspace. Enabling virtualization
> > > > > > > during KVM initialization allows userspace to avoid the additional latency
> > > > > > > when creating/destroying the first/last VM. Now that KVM uses the cpuhp
> > > > > > > framework to do per-CPU enabling, the latency could be non-trivial as the
> > > > > > > cpuhup bringup/teardown is serialized across CPUs, e.g. the latency could
> > > > > > > be problematic for use case that need to spin up VMs quickly.
> > > > > >
> > > > > > How about we defer this until there's a real complain that this isn't
> > > > > > acceptable? To me it doesn't sound "latency of creating the first VM"
> > > > > > matters a lot in the real CSP deployments.
> > > > >
> > > > > I suspect kselftest and kvm-unit-tests will be impacted a lot because
> > > > > hundreds of tests are run serially. And it looks clumsy to reload KVM
> > > > > module to set enable_virt_at_load to make tests run faster. I think the
> > > > > test slowdown is a more realistic problem than running an off-tree
> > > > > hypervisor, so I vote to make enabling virtualization at load time the
> > > > > default behavior and if we really want to support an off-tree hypervisor,
> > > > > we can add a new module param to opt in enabling virtualization at runtime.
>
> I definitely don't object to making it the default behavior, though I would do so
> in a separate patch, e.g. in case enabling virtualization by default somehow
> causes problems.
>
> We could also add a Kconfig to control the default behavior, though IMO that'd be
> overkill without an actual use case for having virtualization off by default.
>
> > > > I am not following why off-tree hypervisor is ever related to this.
> > >
> > > Enabling virtualization at runtime was added to support an off-tree hypervisor
> > > (see the commit below).
> >
> > Oh, ok. I was thinking something else.
> >
> > > >
> > > > The problem of enabling virt during module loading by default is it impacts
> > > > all ARCHs.
>
> Pratically speaking, Intel is the only vendor where enabling virtualization is
> interesting enough for anyone to care. On SVM and all other architectures,
> enabling virtualization doesn't meaningfully change the functionality of the
> current mode.
>
> And impacting all architectures isn't a bad thing. Yes, it requires getting buy-in
> from more people, and maybe additional testing, though in this case we should get
> that for "free" by virtue of being in linux-next. But those are one-time costs,
> and not particular high costs.
>
> > > > Given this performance downgrade (if we care) can be resolved by
> > > > explicitly doing on_each_cpu() below, I am not sure why we want to choose
> > > > this radical approach.
>
> Because it's not radical? And manually doing on_each_cpu() requires complexity
> that we would ideally avoid.
>
> 15+ years ago, when VMX and SVM were nascent technologies, there was likely a
> good argument from a security perspective for leaving virtualization disabled.
> E.g. the ucode flows were new _and_ massive, and thus a potentially juicy attack
> surface.
>
> But these days, AFAIK enabling virtualization is not considered to be a security
> risk, nor are there performance or stability downsides. E.g. it's not all that
> different than the kernel enabling CR4.PKE even though it's entirely possible
> userspace will never actually use protection keys.
Agree this is not a security risk. If all other ARCHs are fine to enable
on module loading then I don't see any real problem.
>
> > > IIUC, we plan to set up TDX module at KVM load time; we need to enable virt
> > > at load time at least for TDX. Definitely, on_each_cpu() can solve the perf
> > > concern. But a solution which can also satisfy TDX's need is better to me.
> > >
> >
> > Doing on_each_cpu() explicitly can also meet TDX's purpose. We just
> > explicitly enable virtualization during module loading if we are going to
> > enable TDX. For all other cases, the behaivour remains the same, unless
> > they want to change when to enable virtualization, e.g., when loading module
> > by default.
> >
> > For always, or by default enabling virtualization during module loading, we
> > somehow discussed before:
> >
> > https://lore.kernel.org/kvm/[email protected]/
> >
> > My true comment is introducing a module parameter, which is a userspace ABI,
>
> Module params aren't strictly ABI, and even if they were, this would only be
> problematic if we wanted to remove the module param *and* doing so was a breaking
> change.
>
Yes. The concern is once introduced I don't think we can easily remove it
even it becomes useless.
> Enabling virtualization should be entirely transparent to userspace,
> at least from a functional perspective; if changing how KVM enables virtualization
> breaks userspace then we likely have bigger problems.
I am not sure how should I interpret this?
"having a module param" doesn't necessarily mean "entirely transparent to
userspace", right? :-)
>
> > to just fix some performance downgrade seems overkill when it can be
> > mitigated by the kernel.
>
> Performance is secondary for me, the primary motivation is simplifying the overall
> KVM code base. Yes, we _could_ use on_each_cpu() and enable virtualization
> on-demand for TDX, but as above, it's extra complexity without any meaningful
> benefit, at least AFAICT.
Either way works for me.
I just think using a module param to resolve some problem while there can
be solution completely in the kernel seems overkill :-)
On Wed, May 29, 2024, Kai Huang wrote:
> On Wed, 2024-05-29 at 08:01 -0700, Sean Christopherson wrote:
> > Enabling virtualization should be entirely transparent to userspace,
> > at least from a functional perspective; if changing how KVM enables virtualization
> > breaks userspace then we likely have bigger problems.
>
> I am not sure how should I interpret this?
>
> "having a module param" doesn't necessarily mean "entirely transparent to
> userspace", right? :-)
Ah, sorry, that was unclear. By "transparent to userspace" I meant the
functionality of userspace VMMs wouldn't be affected if we add (or delete) a
module param. E.g. QEMU should work exactly the same regardless of when KVM
enables virtualization.
> > Performance is secondary for me, the primary motivation is simplifying the overall
> > KVM code base. Yes, we _could_ use on_each_cpu() and enable virtualization
> > on-demand for TDX, but as above, it's extra complexity without any meaningful
> > benefit, at least AFAICT.
>
> Either way works for me.
>
> I just think using a module param to resolve some problem while there can
> be solution completely in the kernel seems overkill :-)
The module param doesn't solve the problem, e.g. we could solve this entirely
in-kernel simply by having KVM unconditionally enable virtualization during
initialization. The module param is mostly there to continue playing nice with
out-of-tree hypervisors, and to a lesser extent to give us a "break in case of
fire" knob.
On Wed, 2024-05-29 at 16:07 -0700, Sean Christopherson wrote:
> On Wed, May 29, 2024, Kai Huang wrote:
> > On Wed, 2024-05-29 at 08:01 -0700, Sean Christopherson wrote:
> > > Enabling virtualization should be entirely transparent to userspace,
> > > at least from a functional perspective; if changing how KVM enables virtualization
> > > breaks userspace then we likely have bigger problems.
> >
> > I am not sure how should I interpret this?
> >
> > "having a module param" doesn't necessarily mean "entirely transparent to
> > userspace", right? :-)
>
> Ah, sorry, that was unclear. By "transparent to userspace" I meant the
> functionality of userspace VMMs wouldn't be affected if we add (or delete) a
> module param. E.g. QEMU should work exactly the same regardless of when KVM
> enables virtualization.
>
> > > Performance is secondary for me, the primary motivation is simplifying the overall
> > > KVM code base. Yes, we _could_ use on_each_cpu() and enable virtualization
> > > on-demand for TDX, but as above, it's extra complexity without any meaningful
> > > benefit, at least AFAICT.
> >
> > Either way works for me.
> >
> > I just think using a module param to resolve some problem while there can
> > be solution completely in the kernel seems overkill :-)
>
> The module param doesn't solve the problem, e.g. we could solve this entirely
> in-kernel simply by having KVM unconditionally enable virtualization during
> initialization. The module param is mostly there to continue playing nice with
> out-of-tree hypervisors, and to a lesser extent to give us a "break in case of
> fire" knob.
OK. Now I understand you want to make enabling virtualization at loading
time as default behaviour, but make the module param as a way to defer to
enable virt when creating the first VM.