2022-08-16 08:31:51

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 0/3] KVM: kvm_create_vm() bug fixes and cleanup

Fix two (embarassing) bugs in kvm_create_vm() where KVM fails to properly
unwind VM creation, which most often manifests as a not-present page fault
due to use-after-free when walking the global vm_list (VM is added and
freed, but never removed from the list). Patch 3 is a loosely related
clean up.

I discovered the try_get_module() bug by inspection[*]. syzkaller found
the debugfs around the same time.

The try_get_module() bug is especially bad/amusing. The "rmmod --wait"
behavior KVM is trying to handle was removed ~9 years ago...

[*] https://lore.kernel.org/all/[email protected]

Sean Christopherson (3):
KVM: Properly unwind VM creation if creating debugfs fails
KVM: Unconditionally get a ref to /dev/kvm module when creating a VM
KVM: Move coalesced MMIO initialization (back) into kvm_create_vm()

virt/kvm/kvm_main.c | 39 +++++++++++++++++----------------------
1 file changed, 17 insertions(+), 22 deletions(-)


base-commit: 19a7cc817a380f7a412d7d76e145e9e2bc47e52f
--
2.37.1.595.g718a3a8f04-goog


2022-08-16 08:44:53

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 2/3] KVM: Unconditionally get a ref to /dev/kvm module when creating a VM

Unconditionally get a reference to the /dev/kvm module when creating a VM
instead of using try_get_module(), which will fail if the module is in
the process of being forcefully unloaded. The error handling when
try_get_module() fails doesn't properly unwind all that has been done,
e.g. doesn't call kvm_arch_pre_destroy_vm() and doesn't remove the VM
from the global list. Not removing VMs from the global list tends to be
fatal, e.g. leads to use-after-free explosions.

The obvious alternative would be to add proper unwinding, but the
justification for using try_get_module(), "rmmod --wait", is completely
bogus as support for "rmmod --wait", i.e. delete_module() without
O_NONBLOCK, was removed by commit 3f2b9c9cdf38 ("module: remove rmmod
--wait option.") nearly a decade ago.

It's still possible for try_get_module() to fail due to the module dying
(more like being killed), as the module will be tagged MODULE_STATE_GOING
by "rmmod --force", i.e. delete_module(..., O_TRUNC), but playing nice
with forced unloading is an exercise in futility and gives a falsea sense
of security. Using try_get_module() only prevents acquiring _new_
references, it doesn't magically put the references held by other VMs,
and forced unloading doesn't wait, i.e. "rmmod --force" on KVM is all but
guaranteed to cause spectacular fireworks; the window where KVM will fail
try_get_module() is tiny compared to the window where KVM is building and
running the VM with an elevated module refcount.

Addressing KVM's inability to play nice with "rmmod --force" is firmly
out-of-scope. Forcefully unloading any module taints kernel (for obvious
reasons) _and_ requires the kernel to be built with
CONFIG_MODULE_FORCE_UNLOAD=y, which is off by default and comes with the
amusing disclaimer that it's "mainly for kernel developers and desperate
users". In other words, KVM is free to scoff at bug reports due to using
"rmmod --force" while VMs may be running.

Fixes: 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed")
Cc: [email protected]
Cc: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
virt/kvm/kvm_main.c | 14 ++++----------
1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ee5f48cc100b..15e304e059d4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1134,6 +1134,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
if (!kvm)
return ERR_PTR(-ENOMEM);

+ /* KVM is pinned via open("/dev/kvm"), the fd passed to this ioctl(). */
+ __module_get(kvm_chardev_ops.owner);
+
KVM_MMU_LOCK_INIT(kvm);
mmgrab(current->mm);
kvm->mm = current->mm;
@@ -1226,16 +1229,6 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
preempt_notifier_inc();
kvm_init_pm_notifier(kvm);

- /*
- * When the fd passed to this ioctl() is opened it pins the module,
- * but try_module_get() also prevents getting a reference if the module
- * is in MODULE_STATE_GOING (e.g. if someone ran "rmmod --wait").
- */
- if (!try_module_get(kvm_chardev_ops.owner)) {
- r = -ENODEV;
- goto out_err;
- }
-
return kvm;

out_err:
@@ -1259,6 +1252,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
out_err_no_srcu:
kvm_arch_free_vm(kvm);
mmdrop(current->mm);
+ module_put(kvm_chardev_ops.owner);
return ERR_PTR(r);
}

--
2.37.1.595.g718a3a8f04-goog

2022-08-16 08:47:47

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 3/3] KVM: Move coalesced MMIO initialization (back) into kvm_create_vm()

Invoke kvm_coalesced_mmio_init() from kvm_create_vm() now that allocating
and initializing coalesced MMIO objects is separate from registering any
associated devices. Moving coalesced MMIO cleans up the last oddity
where KVM does VM creation/initialization after kvm_create_vm(), and more
importantly after kvm_arch_post_init_vm() is called and the VM is added
to the global vm_list, i.e. after the VM is fully created as far as KVM
is concerned.

Originally, kvm_coalesced_mmio_init() was called by kvm_create_vm(), but
the original implementation was completely devoid of error handling.
Commit 6ce5a090a9a0 ("KVM: coalesced_mmio: fix kvm_coalesced_mmio_init()'s
error handling" fixed the various bugs, and in doing so rightly moved the
call to after kvm_create_vm() because kvm_coalesced_mmio_init() also
registered the coalesced MMIO device. Commit 2b3c246a682c ("KVM: Make
coalesced mmio use a device per zone") cleaned up that mess by having
each zone register a separate device, i.e. moved device registration to
its logical home in kvm_vm_ioctl_register_coalesced_mmio(). As a result,
kvm_coalesced_mmio_init() is now a "pure" initialization helper and can
be safely called from kvm_create_vm().

Opportunstically drop the #ifdef, KVM provides stubs for
kvm_coalesced_mmio_{init,free}() when CONFIG_KVM_MMIO=n (arm).

Signed-off-by: Sean Christopherson <[email protected]>
---
virt/kvm/kvm_main.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 15e304e059d4..44b92d773156 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1214,6 +1214,10 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
if (r)
goto out_err_no_mmu_notifier;

+ r = kvm_coalesced_mmio_init(kvm);
+ if (r < 0)
+ goto out_no_coalesced_mmio;
+
r = kvm_create_vm_debugfs(kvm, fdname);
if (r)
goto out_err_no_debugfs;
@@ -1234,6 +1238,8 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
out_err:
kvm_destroy_vm_debugfs(kvm);
out_err_no_debugfs:
+ kvm_coalesced_mmio_free(kvm);
+out_no_coalesced_mmio:
#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
if (kvm->mmu_notifier.ops)
mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
@@ -4907,11 +4913,6 @@ static int kvm_dev_ioctl_create_vm(unsigned long type)
goto put_fd;
}

-#ifdef CONFIG_KVM_MMIO
- r = kvm_coalesced_mmio_init(kvm);
- if (r < 0)
- goto put_kvm;
-#endif
file = anon_inode_getfile("kvm-vm", &kvm_vm_fops, kvm, O_RDWR);
if (IS_ERR(file)) {
r = PTR_ERR(file);
--
2.37.1.595.g718a3a8f04-goog

2022-08-16 09:08:33

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 1/3] KVM: Properly unwind VM creation if creating debugfs fails

Properly unwind VM creation if kvm_create_vm_debugfs() fails. A recent
change to invoke kvm_create_vm_debug() in kvm_create_vm() was led astray
by buggy try_get_module() handling adding by commit 5f6de5cbebee ("KVM:
Prevent module exit until all VMs are freed"). The debugfs error path
effectively inherits the bad error path of try_module_get(), e.g. KVM
leaves the to-be-free VM on vm_list even though KVM appears to do the
right thing by calling module_put() and falling through.

Opportunistically hoist kvm_create_vm_debugfs() above the call to
kvm_arch_post_init_vm() so that the "post-init" arch hook is actually
invoked after the VM is initialized (ignoring kvm_coalesced_mmio_init()
for the moment). x86 is the only non-nop implementation of the post-init
hook, and it doesn't allocate/initialize any objects that are reachable
via debugfs code (spawns a kthread worker for the NX huge page mitigation).

Leave the buggy try_get_module() alone for now, it will be fixed in a
separate commit.

Fixes: b74ed7a68ec1 ("KVM: Actually create debugfs in kvm_create_vm()")
Reported-by: [email protected]
Cc: Oliver Upton <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
virt/kvm/kvm_main.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 515dfe9d3bcf..ee5f48cc100b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1211,9 +1211,13 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
if (r)
goto out_err_no_mmu_notifier;

+ r = kvm_create_vm_debugfs(kvm, fdname);
+ if (r)
+ goto out_err_no_debugfs;
+
r = kvm_arch_post_init_vm(kvm);
if (r)
- goto out_err_mmu_notifier;
+ goto out_err;

mutex_lock(&kvm_lock);
list_add(&kvm->vm_list, &vm_list);
@@ -1229,18 +1233,14 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
*/
if (!try_module_get(kvm_chardev_ops.owner)) {
r = -ENODEV;
- goto out_err_mmu_notifier;
- }
-
- r = kvm_create_vm_debugfs(kvm, fdname);
- if (r)
goto out_err;
+ }

return kvm;

out_err:
- module_put(kvm_chardev_ops.owner);
-out_err_mmu_notifier:
+ kvm_destroy_vm_debugfs(kvm);
+out_err_no_debugfs:
#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
if (kvm->mmu_notifier.ops)
mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
--
2.37.1.595.g718a3a8f04-goog

2022-08-16 17:43:54

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH 2/3] KVM: Unconditionally get a ref to /dev/kvm module when creating a VM

On Tue, Aug 16, 2022 at 05:39:36AM +0000, Sean Christopherson wrote:
> Unconditionally get a reference to the /dev/kvm module when creating a VM
> instead of using try_get_module(), which will fail if the module is in
> the process of being forcefully unloaded. The error handling when
> try_get_module() fails doesn't properly unwind all that has been done,
> e.g. doesn't call kvm_arch_pre_destroy_vm() and doesn't remove the VM
> from the global list. Not removing VMs from the global list tends to be
> fatal, e.g. leads to use-after-free explosions.
>
> The obvious alternative would be to add proper unwinding, but the
> justification for using try_get_module(), "rmmod --wait", is completely
> bogus as support for "rmmod --wait", i.e. delete_module() without
> O_NONBLOCK, was removed by commit 3f2b9c9cdf38 ("module: remove rmmod
> --wait option.") nearly a decade ago.

Ah! include/linux/module.h may also need a cleanup then. The comment
above __module_get() explicitly mentions "rmmod --wait", which is what
led me to use try_module_get() for commit 5f6de5cbebee ("KVM: Prevent
module exit until all VMs are freed").

2022-08-16 17:54:29

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH 1/3] KVM: Properly unwind VM creation if creating debugfs fails

On Tue, Aug 16, 2022 at 05:39:35AM +0000, Sean Christopherson wrote:
> Properly unwind VM creation if kvm_create_vm_debugfs() fails. A recent
> change to invoke kvm_create_vm_debug() in kvm_create_vm() was led astray

typo: kvm_create_vm_debugfs()

> by buggy try_get_module() handling adding by commit 5f6de5cbebee ("KVM:
> Prevent module exit until all VMs are freed"). The debugfs error path
> effectively inherits the bad error path of try_module_get(), e.g. KVM
> leaves the to-be-free VM on vm_list even though KVM appears to do the
> right thing by calling module_put() and falling through.
>
> Opportunistically hoist kvm_create_vm_debugfs() above the call to
> kvm_arch_post_init_vm() so that the "post-init" arch hook is actually
> invoked after the VM is initialized (ignoring kvm_coalesced_mmio_init()
> for the moment). x86 is the only non-nop implementation of the post-init
> hook, and it doesn't allocate/initialize any objects that are reachable
> via debugfs code (spawns a kthread worker for the NX huge page mitigation).
>
> Leave the buggy try_get_module() alone for now, it will be fixed in a
> separate commit.
>
> Fixes: b74ed7a68ec1 ("KVM: Actually create debugfs in kvm_create_vm()")
> Reported-by: [email protected]
> Cc: Oliver Upton <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>

Fun times! Thanks for the fix Sean.

Reviewed-by: Oliver Upton <[email protected]>

--
Best,
Oliver

2022-08-16 18:36:41

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH 3/3] KVM: Move coalesced MMIO initialization (back) into kvm_create_vm()

On Tue, Aug 16, 2022 at 05:39:37AM +0000, Sean Christopherson wrote:
> Invoke kvm_coalesced_mmio_init() from kvm_create_vm() now that allocating
> and initializing coalesced MMIO objects is separate from registering any
> associated devices. Moving coalesced MMIO cleans up the last oddity
> where KVM does VM creation/initialization after kvm_create_vm(), and more
> importantly after kvm_arch_post_init_vm() is called and the VM is added
> to the global vm_list, i.e. after the VM is fully created as far as KVM
> is concerned.
>
> Originally, kvm_coalesced_mmio_init() was called by kvm_create_vm(), but
> the original implementation was completely devoid of error handling.
> Commit 6ce5a090a9a0 ("KVM: coalesced_mmio: fix kvm_coalesced_mmio_init()'s
> error handling" fixed the various bugs, and in doing so rightly moved the
> call to after kvm_create_vm() because kvm_coalesced_mmio_init() also
> registered the coalesced MMIO device. Commit 2b3c246a682c ("KVM: Make
> coalesced mmio use a device per zone") cleaned up that mess by having
> each zone register a separate device, i.e. moved device registration to
> its logical home in kvm_vm_ioctl_register_coalesced_mmio(). As a result,
> kvm_coalesced_mmio_init() is now a "pure" initialization helper and can
> be safely called from kvm_create_vm().
>
> Opportunstically drop the #ifdef, KVM provides stubs for
> kvm_coalesced_mmio_{init,free}() when CONFIG_KVM_MMIO=n (arm).
^^^
We have CONFIG_KVM_MMIO=y on arm64. Is it actually s390?

--
Thanks,
Oliver

2022-08-16 19:33:58

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 3/3] KVM: Move coalesced MMIO initialization (back) into kvm_create_vm()

On Tue, Aug 16, 2022, Oliver Upton wrote:
> On Tue, Aug 16, 2022 at 05:39:37AM +0000, Sean Christopherson wrote:
> > Invoke kvm_coalesced_mmio_init() from kvm_create_vm() now that allocating
> > and initializing coalesced MMIO objects is separate from registering any
> > associated devices. Moving coalesced MMIO cleans up the last oddity
> > where KVM does VM creation/initialization after kvm_create_vm(), and more
> > importantly after kvm_arch_post_init_vm() is called and the VM is added
> > to the global vm_list, i.e. after the VM is fully created as far as KVM
> > is concerned.
> >
> > Originally, kvm_coalesced_mmio_init() was called by kvm_create_vm(), but
> > the original implementation was completely devoid of error handling.
> > Commit 6ce5a090a9a0 ("KVM: coalesced_mmio: fix kvm_coalesced_mmio_init()'s
> > error handling" fixed the various bugs, and in doing so rightly moved the
> > call to after kvm_create_vm() because kvm_coalesced_mmio_init() also
> > registered the coalesced MMIO device. Commit 2b3c246a682c ("KVM: Make
> > coalesced mmio use a device per zone") cleaned up that mess by having
> > each zone register a separate device, i.e. moved device registration to
> > its logical home in kvm_vm_ioctl_register_coalesced_mmio(). As a result,
> > kvm_coalesced_mmio_init() is now a "pure" initialization helper and can
> > be safely called from kvm_create_vm().
> >
> > Opportunstically drop the #ifdef, KVM provides stubs for
> > kvm_coalesced_mmio_{init,free}() when CONFIG_KVM_MMIO=n (arm).
> ^^^
> We have CONFIG_KVM_MMIO=y on arm64. Is it actually s390?

Yes, I apparently can't read.

2022-08-16 22:45:59

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 2/3] KVM: Unconditionally get a ref to /dev/kvm module when creating a VM

On Tue, Aug 16, 2022, David Matlack wrote:
> On Tue, Aug 16, 2022 at 05:39:36AM +0000, Sean Christopherson wrote:
> > Unconditionally get a reference to the /dev/kvm module when creating a VM
> > instead of using try_get_module(), which will fail if the module is in
> > the process of being forcefully unloaded. The error handling when
> > try_get_module() fails doesn't properly unwind all that has been done,
> > e.g. doesn't call kvm_arch_pre_destroy_vm() and doesn't remove the VM
> > from the global list. Not removing VMs from the global list tends to be
> > fatal, e.g. leads to use-after-free explosions.
> >
> > The obvious alternative would be to add proper unwinding, but the
> > justification for using try_get_module(), "rmmod --wait", is completely
> > bogus as support for "rmmod --wait", i.e. delete_module() without
> > O_NONBLOCK, was removed by commit 3f2b9c9cdf38 ("module: remove rmmod
> > --wait option.") nearly a decade ago.
>
> Ah! include/linux/module.h may also need a cleanup then. The comment
> above __module_get() explicitly mentions "rmmod --wait", which is what
> led me to use try_module_get() for commit 5f6de5cbebee ("KVM: Prevent
> module exit until all VMs are freed").

Ugh, I didn't see that one. The whole thing is a mess. try_module_get() also
has a comment (just below the "rmmod --wait" comment) saying that it's the one
true way of doing things, but that's at best misleading for cases like this where
a module is taking a reference of _itself_.

The man pages are also woefully out of date :-/

2022-08-17 09:52:45

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 0/3] KVM: kvm_create_vm() bug fixes and cleanup

Queued, thanks (with the arm/s390 confusion fixed in the last
commit message).

Paolo