Fix a use-after-module-unload bug in the async #PF code by ensuring all
workqueue items fully complete before tearing down vCPUs. Do a bit of
cleanup to try and make the code slightly more readable.
Side topic, I'm pretty s390's flic_set_attr() is broken/racy. The async #PF
code assumes that only the vCPU can invoke
kvm_clear_async_pf_completion_queue(), as there are multiple assets that
are effectively protected by vcpu->mutex. I don't any real world VMMs
trigger the race(s), but AFAICT it's a bug. I think/assume taking all
vCPUs' mutexes would plug the hole?
Sean Christopherson (4):
KVM: Always flush async #PF workqueue when vCPU is being destroyed
KVM: Put mm immediately after async #PF worker completes remote gup()
KVM: Get reference to VM's address space in the async #PF worker
KVM: Nullify async #PF worker's "apf" pointer as soon as it might be
freed
include/linux/kvm_host.h | 1 -
virt/kvm/async_pf.c | 79 ++++++++++++++++++++++++++++------------
2 files changed, 55 insertions(+), 25 deletions(-)
base-commit: 1c6d984f523f67ecfad1083bb04c55d91977bb15
--
2.43.0.472.g3155946c3a-goog
Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
completion queue, e.g. when a VM and all its vCPUs is being destroyed.
KVM must ensure that none of its workqueue callbacks is running when the
last reference to the KVM _module_ is put. Gifting a reference to the
associated VM prevents the workqueue callback from dereferencing freed
vCPU/VM memory, but does not prevent the KVM module from being unloaded
before the callback completes.
Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
result in deadlock. async_pf_execute() can't return until kvm_put_kvm()
finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:
WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Workqueue: events async_pf_execute [kvm]
RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
Call Trace:
<TASK>
async_pf_execute+0x198/0x260 [kvm]
process_one_work+0x145/0x2d0
worker_thread+0x27e/0x3a0
kthread+0xba/0xe0
ret_from_fork+0x2d/0x50
ret_from_fork_asm+0x11/0x20
</TASK>
---[ end trace 0000000000000000 ]---
INFO: task kworker/8:1:251 blocked for more than 120 seconds.
Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/8:1 state:D stack:0 pid:251 ppid:2 flags:0x00004000
Workqueue: events async_pf_execute [kvm]
Call Trace:
<TASK>
__schedule+0x33f/0xa40
schedule+0x53/0xc0
schedule_timeout+0x12a/0x140
__wait_for_common+0x8d/0x1d0
__flush_work.isra.0+0x19f/0x2c0
kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
kvm_put_kvm+0x1c1/0x320 [kvm]
async_pf_execute+0x198/0x260 [kvm]
process_one_work+0x145/0x2d0
worker_thread+0x27e/0x3a0
kthread+0xba/0xe0
ret_from_fork+0x2d/0x50
ret_from_fork_asm+0x11/0x20
</TASK>
If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
then there's no need to gift async_pf_execute() a reference because all
invocations of async_pf_execute() will be forced to complete before the
vCPU and its VM are destroyed/freed. And that in turn fixes the module
unloading bug as __fput() won't do module_put() on the last vCPU reference
until the vCPU has been freed, e.g. if closing the vCPU file also puts the
last reference to the KVM module.
Note that kvm_check_async_pf_completion() may also take the work item off
the completion queue and so also needs to flush the work queue, as the
work will not be seen by kvm_clear_async_pf_completion_queue(). Waiting
on the workqueue could theoretically delay a vCPU due to waiting for the
work to complete, but that's a very, very small chance, and likely a very
small delay. kvm_arch_async_page_present_queued() unconditionally makes a
new request, i.e. will effectively delay entering the guest, so the
remaining work is really just:
trace_kvm_async_pf_completed(addr, cr2_or_gpa);
__kvm_vcpu_wake_up(vcpu);
mmput(mm);
and mmput() can't drop the last reference to the page tables if the vCPU is
still alive, i.e. the vCPU won't get stuck tearing down page tables.
Add a helper to do the flushing, specifically to deal with "wakeup all"
work items, as they aren't actually work items, i.e. are never placed in a
workqueue. Trying to flush a bogus workqueue entry rightly makes
__flush_work() complain (kudos to whoever added that sanity check).
Note, commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are
freed") *tried* to fix the module refcounting issue by having VMs grab a
reference to the module, but that only made the bug slightly harder to hit
as it gave async_pf_execute() a bit more time to complete before the KVM
module could be unloaded.
Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out")
Cc: [email protected]
Cc: David Matlack <[email protected]>
Cc: Xu Yilun <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
virt/kvm/async_pf.c | 37 ++++++++++++++++++++++++++++++++-----
1 file changed, 32 insertions(+), 5 deletions(-)
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index e033c79d528e..876927a558ad 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -87,7 +87,25 @@ static void async_pf_execute(struct work_struct *work)
__kvm_vcpu_wake_up(vcpu);
mmput(mm);
- kvm_put_kvm(vcpu->kvm);
+}
+
+static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
+{
+ /*
+ * The async #PF is "done", but KVM must wait for the work item itself,
+ * i.e. async_pf_execute(), to run to completion. If KVM is a module,
+ * KVM must ensure *no* code owned by the KVM (the module) can be run
+ * after the last call to module_put(), i.e. after the last reference
+ * to the last vCPU's file is put.
+ *
+ * Wake all events skip the queue and go straight done, i.e. don't
+ * need to be flushed (but sanity check that the work wasn't queued).
+ */
+ if (work->wakeup_all)
+ WARN_ON_ONCE(work->work.func);
+ else
+ flush_work(&work->work);
+ kmem_cache_free(async_pf_cache, work);
}
void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
@@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
#else
if (cancel_work_sync(&work->work)) {
mmput(work->mm);
- kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
kmem_cache_free(async_pf_cache, work);
}
#endif
@@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
list_first_entry(&vcpu->async_pf.done,
typeof(*work), link);
list_del(&work->link);
- kmem_cache_free(async_pf_cache, work);
+
+ spin_unlock(&vcpu->async_pf.lock);
+
+ /*
+ * The async #PF is "done", but KVM must wait for the work item
+ * itself, i.e. async_pf_execute(), to run to completion. If
+ * KVM is a module, KVM must ensure *no* code owned by the KVM
+ * (the module) can be run after the last call to module_put(),
+ * i.e. after the last reference to the last vCPU's file is put.
+ */
+ kvm_flush_and_free_async_pf_work(work);
+ spin_lock(&vcpu->async_pf.lock);
}
spin_unlock(&vcpu->async_pf.lock);
@@ -151,7 +179,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
list_del(&work->queue);
vcpu->async_pf.queued--;
- kmem_cache_free(async_pf_cache, work);
+ kvm_flush_and_free_async_pf_work(work);
}
}
@@ -186,7 +214,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
work->arch = *arch;
work->mm = current->mm;
mmget(work->mm);
- kvm_get_kvm(work->vcpu->kvm);
INIT_WORK(&work->work, async_pf_execute);
--
2.43.0.472.g3155946c3a-goog
Get a reference to the target VM's address space in async_pf_execute()
instead of gifting a reference from kvm_setup_async_pf(). Keeping the
address space alive just to service an async #PF is counter-productive,
i.e. if the process is exiting and all vCPUs are dead, then NOT doing
get_user_pages_remote() and freeing the address space asap is desirable.
Handling the mm reference entirely within async_pf_execute() also
simplifies the async #PF flows as a whole, e.g. it's not immediately
obvious when the worker task vs. the vCPU task is responsible for putting
the gifted mm reference.
Signed-off-by: Sean Christopherson <[email protected]>
---
include/linux/kvm_host.h | 1 -
virt/kvm/async_pf.c | 32 ++++++++++++++++++--------------
2 files changed, 18 insertions(+), 15 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7e7fd25b09b3..bbfefd7e612f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -238,7 +238,6 @@ struct kvm_async_pf {
struct list_head link;
struct list_head queue;
struct kvm_vcpu *vcpu;
- struct mm_struct *mm;
gpa_t cr2_or_gpa;
unsigned long addr;
struct kvm_arch_async_pf arch;
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index d5dc50318aa6..c3f4f351a2ae 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -46,8 +46,8 @@ static void async_pf_execute(struct work_struct *work)
{
struct kvm_async_pf *apf =
container_of(work, struct kvm_async_pf, work);
- struct mm_struct *mm = apf->mm;
struct kvm_vcpu *vcpu = apf->vcpu;
+ struct mm_struct *mm = vcpu->kvm->mm;
unsigned long addr = apf->addr;
gpa_t cr2_or_gpa = apf->cr2_or_gpa;
int locked = 1;
@@ -56,16 +56,24 @@ static void async_pf_execute(struct work_struct *work)
might_sleep();
/*
- * This work is run asynchronously to the task which owns
- * mm and might be done in another context, so we must
- * access remotely.
+ * Attempt to pin the VM's host address space, and simply skip gup() if
+ * acquiring a pin fail, i.e. if the process is exiting. Note, KVM
+ * holds a reference to its associated mm_struct until the very end of
+ * kvm_destroy_vm(), i.e. the struct itself won't be freed before this
+ * work item is fully processed.
*/
- mmap_read_lock(mm);
- get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
- if (locked)
- mmap_read_unlock(mm);
- mmput(mm);
+ if (mmget_not_zero(mm)) {
+ mmap_read_lock(mm);
+ get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
+ if (locked)
+ mmap_read_unlock(mm);
+ mmput(mm);
+ }
+ /*
+ * Notify and kick the vCPU even if faulting in the page failed, e.g.
+ * so that the vCPU can retry the fault synchronously.
+ */
if (IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC))
kvm_arch_async_page_present(vcpu, apf);
@@ -129,10 +137,8 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
#ifdef CONFIG_KVM_ASYNC_PF_SYNC
flush_work(&work->work);
#else
- if (cancel_work_sync(&work->work)) {
- mmput(work->mm);
+ if (cancel_work_sync(&work->work))
kmem_cache_free(async_pf_cache, work);
- }
#endif
spin_lock(&vcpu->async_pf.lock);
}
@@ -211,8 +217,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
work->cr2_or_gpa = cr2_or_gpa;
work->addr = hva;
work->arch = *arch;
- work->mm = current->mm;
- mmget(work->mm);
INIT_WORK(&work->work, async_pf_execute);
--
2.43.0.472.g3155946c3a-goog
Put the async #PF worker's reference to the VM's address space as soon as
the worker is done with the mm. This will allow deferring getting a
reference to the worker itself without having to track whether or not
getting a reference succeeded.
Note, if the vCPU is still alive, there is no danger of the worker getting
stuck with tearing down the host page tables, as userspace also holds a
reference (obviously), i.e. there is no risk of delaying the page-present
notification due to triggering the slow path in mmput().
Signed-off-by: Sean Christopherson <[email protected]>
---
virt/kvm/async_pf.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 876927a558ad..d5dc50318aa6 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -64,6 +64,7 @@ static void async_pf_execute(struct work_struct *work)
get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
if (locked)
mmap_read_unlock(mm);
+ mmput(mm);
if (IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC))
kvm_arch_async_page_present(vcpu, apf);
@@ -85,8 +86,6 @@ static void async_pf_execute(struct work_struct *work)
trace_kvm_async_pf_completed(addr, cr2_or_gpa);
__kvm_vcpu_wake_up(vcpu);
-
- mmput(mm);
}
static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
--
2.43.0.472.g3155946c3a-goog
Nullify the async #PF worker's local "apf" pointer immediately after the
point where the structure can be freed by the vCPU. The existing comment
is helpful, but easy to overlook as there is no associated code.
Update the comment to clarify that it can be freed by as soon as the lock
is dropped, as "after this point" isn't strictly accurate, nor does it
help understand what prevents the structure from being freed earlier.
Signed-off-by: Sean Christopherson <[email protected]>
---
virt/kvm/async_pf.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index c3f4f351a2ae..1088c6628de9 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -83,13 +83,14 @@ static void async_pf_execute(struct work_struct *work)
apf->vcpu = NULL;
spin_unlock(&vcpu->async_pf.lock);
- if (!IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC) && first)
- kvm_arch_async_page_present_queued(vcpu);
-
/*
- * apf may be freed by kvm_check_async_pf_completion() after
- * this point
+ * The apf struct may freed by kvm_check_async_pf_completion() as soon
+ * as the lock is dropped. Nullify it to prevent improper usage.
*/
+ apf = NULL;
+
+ if (!IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC) && first)
+ kvm_arch_async_page_present_queued(vcpu);
trace_kvm_async_pf_completed(addr, cr2_or_gpa);
--
2.43.0.472.g3155946c3a-goog
On Tue, Jan 09, 2024 at 05:15:30PM -0800, Sean Christopherson wrote:
> Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
> completion queue, e.g. when a VM and all its vCPUs is being destroyed.
> KVM must ensure that none of its workqueue callbacks is running when the
> last reference to the KVM _module_ is put. Gifting a reference to the
> associated VM prevents the workqueue callback from dereferencing freed
> vCPU/VM memory, but does not prevent the KVM module from being unloaded
> before the callback completes.
>
> Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
> async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
> result in deadlock. async_pf_execute() can't return until kvm_put_kvm()
> finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:
>
> WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
> Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
> CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> Workqueue: events async_pf_execute [kvm]
> RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
> Call Trace:
> <TASK>
> async_pf_execute+0x198/0x260 [kvm]
> process_one_work+0x145/0x2d0
> worker_thread+0x27e/0x3a0
> kthread+0xba/0xe0
> ret_from_fork+0x2d/0x50
> ret_from_fork_asm+0x11/0x20
> </TASK>
> ---[ end trace 0000000000000000 ]---
> INFO: task kworker/8:1:251 blocked for more than 120 seconds.
> Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/8:1 state:D stack:0 pid:251 ppid:2 flags:0x00004000
> Workqueue: events async_pf_execute [kvm]
> Call Trace:
> <TASK>
> __schedule+0x33f/0xa40
> schedule+0x53/0xc0
> schedule_timeout+0x12a/0x140
> __wait_for_common+0x8d/0x1d0
> __flush_work.isra.0+0x19f/0x2c0
> kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
> kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
> kvm_put_kvm+0x1c1/0x320 [kvm]
> async_pf_execute+0x198/0x260 [kvm]
> process_one_work+0x145/0x2d0
> worker_thread+0x27e/0x3a0
> kthread+0xba/0xe0
> ret_from_fork+0x2d/0x50
> ret_from_fork_asm+0x11/0x20
> </TASK>
>
> If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
> then there's no need to gift async_pf_execute() a reference because all
> invocations of async_pf_execute() will be forced to complete before the
> vCPU and its VM are destroyed/freed. And that in turn fixes the module
> unloading bug as __fput() won't do module_put() on the last vCPU reference
> until the vCPU has been freed, e.g. if closing the vCPU file also puts the
I'm not sure why __fput() of vCPU fd should be mentioned here. I assume
we just need to say that vCPUs are freed before module_put(KVM the module)
in kvm_destroy_vm(), then the whole logic for module unloading fix is:
1. All workqueue callbacks complete when kvm_clear_async_pf_completion_queue(vcpu)
2. kvm_clear_async_pf_completion_queue(vcpu) must be executed before vCPU free.
3. vCPUs must be freed before module_put(KVM the module).
So all workqueue callbacks complete before module_put(KVM the module).
__fput() of vCPU fd is not the only trigger of kvm_destroy_vm(), that
makes me distracted from reason of the fix.
> last reference to the KVM module.
>
> Note that kvm_check_async_pf_completion() may also take the work item off
> the completion queue and so also needs to flush the work queue, as the
> work will not be seen by kvm_clear_async_pf_completion_queue(). Waiting
> on the workqueue could theoretically delay a vCPU due to waiting for the
> work to complete, but that's a very, very small chance, and likely a very
> small delay. kvm_arch_async_page_present_queued() unconditionally makes a
> new request, i.e. will effectively delay entering the guest, so the
> remaining work is really just:
>
> trace_kvm_async_pf_completed(addr, cr2_or_gpa);
>
> __kvm_vcpu_wake_up(vcpu);
>
> mmput(mm);
>
> and mmput() can't drop the last reference to the page tables if the vCPU is
> still alive, i.e. the vCPU won't get stuck tearing down page tables.
>
> Add a helper to do the flushing, specifically to deal with "wakeup all"
> work items, as they aren't actually work items, i.e. are never placed in a
> workqueue. Trying to flush a bogus workqueue entry rightly makes
> __flush_work() complain (kudos to whoever added that sanity check).
>
> Note, commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are
> freed") *tried* to fix the module refcounting issue by having VMs grab a
> reference to the module, but that only made the bug slightly harder to hit
> as it gave async_pf_execute() a bit more time to complete before the KVM
> module could be unloaded.
>
> Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out")
> Cc: [email protected]
> Cc: David Matlack <[email protected]>
> Cc: Xu Yilun <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> virt/kvm/async_pf.c | 37 ++++++++++++++++++++++++++++++++-----
> 1 file changed, 32 insertions(+), 5 deletions(-)
>
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index e033c79d528e..876927a558ad 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -87,7 +87,25 @@ static void async_pf_execute(struct work_struct *work)
> __kvm_vcpu_wake_up(vcpu);
>
> mmput(mm);
> - kvm_put_kvm(vcpu->kvm);
> +}
> +
> +static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
> +{
> + /*
> + * The async #PF is "done", but KVM must wait for the work item itself,
> + * i.e. async_pf_execute(), to run to completion. If KVM is a module,
> + * KVM must ensure *no* code owned by the KVM (the module) can be run
> + * after the last call to module_put(), i.e. after the last reference
> + * to the last vCPU's file is put.
Maybe drop the i.e? It is not exactly true, other components like VFIO
may also be the last one to put KVM reference?
> + *
> + * Wake all events skip the queue and go straight done, i.e. don't
> + * need to be flushed (but sanity check that the work wasn't queued).
> + */
> + if (work->wakeup_all)
> + WARN_ON_ONCE(work->work.func);
> + else
> + flush_work(&work->work);
> + kmem_cache_free(async_pf_cache, work);
> }
>
> void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> #else
> if (cancel_work_sync(&work->work)) {
> mmput(work->mm);
> - kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
> kmem_cache_free(async_pf_cache, work);
> }
> #endif
> @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> list_first_entry(&vcpu->async_pf.done,
> typeof(*work), link);
> list_del(&work->link);
> - kmem_cache_free(async_pf_cache, work);
> +
> + spin_unlock(&vcpu->async_pf.lock);
> +
> + /*
> + * The async #PF is "done", but KVM must wait for the work item
> + * itself, i.e. async_pf_execute(), to run to completion. If
> + * KVM is a module, KVM must ensure *no* code owned by the KVM
> + * (the module) can be run after the last call to module_put(),
> + * i.e. after the last reference to the last vCPU's file is put.
> + */
> + kvm_flush_and_free_async_pf_work(work);
> + spin_lock(&vcpu->async_pf.lock);
> }
> spin_unlock(&vcpu->async_pf.lock);
>
> @@ -151,7 +179,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
>
> list_del(&work->queue);
> vcpu->async_pf.queued--;
> - kmem_cache_free(async_pf_cache, work);
> + kvm_flush_and_free_async_pf_work(work);
> }
> }
>
> @@ -186,7 +214,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> work->arch = *arch;
> work->mm = current->mm;
> mmget(work->mm);
> - kvm_get_kvm(work->vcpu->kvm);
>
> INIT_WORK(&work->work, async_pf_execute);
Overall LGTM, Reviewed-by: Xu Yilun <[email protected]>
>
> --
> 2.43.0.472.g3155946c3a-goog
>
On Tue, Jan 09, 2024 at 05:15:31PM -0800, Sean Christopherson wrote:
> Put the async #PF worker's reference to the VM's address space as soon as
> the worker is done with the mm. This will allow deferring getting a
> reference to the worker itself without having to track whether or not
> getting a reference succeeded.
>
> Note, if the vCPU is still alive, there is no danger of the worker getting
> stuck with tearing down the host page tables, as userspace also holds a
> reference (obviously), i.e. there is no risk of delaying the page-present
> notification due to triggering the slow path in mmput().
>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Xu Yilun <[email protected]>
> ---
> virt/kvm/async_pf.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index 876927a558ad..d5dc50318aa6 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -64,6 +64,7 @@ static void async_pf_execute(struct work_struct *work)
> get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
> if (locked)
> mmap_read_unlock(mm);
> + mmput(mm);
>
> if (IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC))
> kvm_arch_async_page_present(vcpu, apf);
> @@ -85,8 +86,6 @@ static void async_pf_execute(struct work_struct *work)
> trace_kvm_async_pf_completed(addr, cr2_or_gpa);
>
> __kvm_vcpu_wake_up(vcpu);
> -
> - mmput(mm);
> }
>
> static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
> --
> 2.43.0.472.g3155946c3a-goog
>
On Tue, Jan 09, 2024 at 05:15:32PM -0800, Sean Christopherson wrote:
> Get a reference to the target VM's address space in async_pf_execute()
> instead of gifting a reference from kvm_setup_async_pf(). Keeping the
> address space alive just to service an async #PF is counter-productive,
> i.e. if the process is exiting and all vCPUs are dead, then NOT doing
> get_user_pages_remote() and freeing the address space asap is desirable.
>
> Handling the mm reference entirely within async_pf_execute() also
> simplifies the async #PF flows as a whole, e.g. it's not immediately
> obvious when the worker task vs. the vCPU task is responsible for putting
> the gifted mm reference.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> include/linux/kvm_host.h | 1 -
> virt/kvm/async_pf.c | 32 ++++++++++++++++++--------------
> 2 files changed, 18 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7e7fd25b09b3..bbfefd7e612f 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -238,7 +238,6 @@ struct kvm_async_pf {
> struct list_head link;
> struct list_head queue;
> struct kvm_vcpu *vcpu;
> - struct mm_struct *mm;
> gpa_t cr2_or_gpa;
> unsigned long addr;
> struct kvm_arch_async_pf arch;
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index d5dc50318aa6..c3f4f351a2ae 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -46,8 +46,8 @@ static void async_pf_execute(struct work_struct *work)
> {
> struct kvm_async_pf *apf =
> container_of(work, struct kvm_async_pf, work);
> - struct mm_struct *mm = apf->mm;
> struct kvm_vcpu *vcpu = apf->vcpu;
> + struct mm_struct *mm = vcpu->kvm->mm;
> unsigned long addr = apf->addr;
> gpa_t cr2_or_gpa = apf->cr2_or_gpa;
> int locked = 1;
> @@ -56,16 +56,24 @@ static void async_pf_execute(struct work_struct *work)
> might_sleep();
>
> /*
> - * This work is run asynchronously to the task which owns
> - * mm and might be done in another context, so we must
> - * access remotely.
> + * Attempt to pin the VM's host address space, and simply skip gup() if
> + * acquiring a pin fail, i.e. if the process is exiting. Note, KVM
> + * holds a reference to its associated mm_struct until the very end of
> + * kvm_destroy_vm(), i.e. the struct itself won't be freed before this
> + * work item is fully processed.
> */
> - mmap_read_lock(mm);
> - get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
> - if (locked)
> - mmap_read_unlock(mm);
> - mmput(mm);
> + if (mmget_not_zero(mm)) {
> + mmap_read_lock(mm);
> + get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
> + if (locked)
> + mmap_read_unlock(mm);
> + mmput(mm);
> + }
>
> + /*
> + * Notify and kick the vCPU even if faulting in the page failed, e.g.
How about when the process is exiting? Could we just skip the following?
Thanks,
Yilun
> + * so that the vCPU can retry the fault synchronously.
> + */
> if (IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC))
> kvm_arch_async_page_present(vcpu, apf);
>
> @@ -129,10 +137,8 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> #ifdef CONFIG_KVM_ASYNC_PF_SYNC
> flush_work(&work->work);
> #else
> - if (cancel_work_sync(&work->work)) {
> - mmput(work->mm);
> + if (cancel_work_sync(&work->work))
> kmem_cache_free(async_pf_cache, work);
> - }
> #endif
> spin_lock(&vcpu->async_pf.lock);
> }
> @@ -211,8 +217,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> work->cr2_or_gpa = cr2_or_gpa;
> work->addr = hva;
> work->arch = *arch;
> - work->mm = current->mm;
> - mmget(work->mm);
>
> INIT_WORK(&work->work, async_pf_execute);
>
> --
> 2.43.0.472.g3155946c3a-goog
>
On Tue, Jan 09, 2024 at 05:15:33PM -0800, Sean Christopherson wrote:
> Nullify the async #PF worker's local "apf" pointer immediately after the
> point where the structure can be freed by the vCPU. The existing comment
> is helpful, but easy to overlook as there is no associated code.
>
> Update the comment to clarify that it can be freed by as soon as the lock
> is dropped, as "after this point" isn't strictly accurate, nor does it
> help understand what prevents the structure from being freed earlier.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Xu Yilun <[email protected]>
> ---
> virt/kvm/async_pf.c | 11 ++++++-----
> 1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index c3f4f351a2ae..1088c6628de9 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -83,13 +83,14 @@ static void async_pf_execute(struct work_struct *work)
> apf->vcpu = NULL;
> spin_unlock(&vcpu->async_pf.lock);
>
> - if (!IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC) && first)
> - kvm_arch_async_page_present_queued(vcpu);
> -
> /*
> - * apf may be freed by kvm_check_async_pf_completion() after
> - * this point
> + * The apf struct may freed by kvm_check_async_pf_completion() as soon
> + * as the lock is dropped. Nullify it to prevent improper usage.
> */
> + apf = NULL;
> +
> + if (!IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC) && first)
> + kvm_arch_async_page_present_queued(vcpu);
>
> trace_kvm_async_pf_completed(addr, cr2_or_gpa);
>
> --
> 2.43.0.472.g3155946c3a-goog
>
On Sat, Jan 20, 2024, Xu Yilun wrote:
> On Tue, Jan 09, 2024 at 05:15:32PM -0800, Sean Christopherson wrote:
> > Get a reference to the target VM's address space in async_pf_execute()
> > instead of gifting a reference from kvm_setup_async_pf(). Keeping the
> > address space alive just to service an async #PF is counter-productive,
> > i.e. if the process is exiting and all vCPUs are dead, then NOT doing
> > get_user_pages_remote() and freeing the address space asap is desirable.
> >
> > Handling the mm reference entirely within async_pf_execute() also
> > simplifies the async #PF flows as a whole, e.g. it's not immediately
> > obvious when the worker task vs. the vCPU task is responsible for putting
> > the gifted mm reference.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > include/linux/kvm_host.h | 1 -
> > virt/kvm/async_pf.c | 32 ++++++++++++++++++--------------
> > 2 files changed, 18 insertions(+), 15 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 7e7fd25b09b3..bbfefd7e612f 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -238,7 +238,6 @@ struct kvm_async_pf {
> > struct list_head link;
> > struct list_head queue;
> > struct kvm_vcpu *vcpu;
> > - struct mm_struct *mm;
> > gpa_t cr2_or_gpa;
> > unsigned long addr;
> > struct kvm_arch_async_pf arch;
> > diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> > index d5dc50318aa6..c3f4f351a2ae 100644
> > --- a/virt/kvm/async_pf.c
> > +++ b/virt/kvm/async_pf.c
> > @@ -46,8 +46,8 @@ static void async_pf_execute(struct work_struct *work)
> > {
> > struct kvm_async_pf *apf =
> > container_of(work, struct kvm_async_pf, work);
> > - struct mm_struct *mm = apf->mm;
> > struct kvm_vcpu *vcpu = apf->vcpu;
> > + struct mm_struct *mm = vcpu->kvm->mm;
> > unsigned long addr = apf->addr;
> > gpa_t cr2_or_gpa = apf->cr2_or_gpa;
> > int locked = 1;
> > @@ -56,16 +56,24 @@ static void async_pf_execute(struct work_struct *work)
> > might_sleep();
> >
> > /*
> > - * This work is run asynchronously to the task which owns
> > - * mm and might be done in another context, so we must
> > - * access remotely.
> > + * Attempt to pin the VM's host address space, and simply skip gup() if
> > + * acquiring a pin fail, i.e. if the process is exiting. Note, KVM
> > + * holds a reference to its associated mm_struct until the very end of
> > + * kvm_destroy_vm(), i.e. the struct itself won't be freed before this
> > + * work item is fully processed.
> > */
> > - mmap_read_lock(mm);
> > - get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
> > - if (locked)
> > - mmap_read_unlock(mm);
> > - mmput(mm);
> > + if (mmget_not_zero(mm)) {
> > + mmap_read_lock(mm);
> > + get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
> > + if (locked)
> > + mmap_read_unlock(mm);
> > + mmput(mm);
> > + }
> >
> > + /*
> > + * Notify and kick the vCPU even if faulting in the page failed, e.g.
>
> How about when the process is exiting? Could we just skip the following?
Maybe? I'm not opposed to trimming this down even more, but I doubt it will make
much of a difference. The vCPU can't be running so async_pf.lock shouldn't be
contended, no IPIs will be issued for kicks, etc. So for this patch at least,
I want to take the most conservative approach while still cleaning up the mm_struct
usage.
> > + * so that the vCPU can retry the fault synchronously.
> > + */
> > if (IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC))
> > kvm_arch_async_page_present(vcpu, apf);
> >
> > @@ -129,10 +137,8 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > #ifdef CONFIG_KVM_ASYNC_PF_SYNC
> > flush_work(&work->work);
> > #else
> > - if (cancel_work_sync(&work->work)) {
> > - mmput(work->mm);
> > + if (cancel_work_sync(&work->work))
> > kmem_cache_free(async_pf_cache, work);
> > - }
> > #endif
> > spin_lock(&vcpu->async_pf.lock);
> > }
> > @@ -211,8 +217,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > work->cr2_or_gpa = cr2_or_gpa;
> > work->addr = hva;
> > work->arch = *arch;
> > - work->mm = current->mm;
> > - mmget(work->mm);
> >
> > INIT_WORK(&work->work, async_pf_execute);
> >
> > --
> > 2.43.0.472.g3155946c3a-goog
> >
On Sat, Jan 20, 2024, Xu Yilun wrote:
> On Tue, Jan 09, 2024 at 05:15:30PM -0800, Sean Christopherson wrote:
> > Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
> > completion queue, e.g. when a VM and all its vCPUs is being destroyed.
> > KVM must ensure that none of its workqueue callbacks is running when the
> > last reference to the KVM _module_ is put. Gifting a reference to the
> > associated VM prevents the workqueue callback from dereferencing freed
> > vCPU/VM memory, but does not prevent the KVM module from being unloaded
> > before the callback completes.
> >
> > Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
> > async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
> > result in deadlock. async_pf_execute() can't return until kvm_put_kvm()
> > finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:
> >
> > WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
> > Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
> > CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > Workqueue: events async_pf_execute [kvm]
> > RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
> > Call Trace:
> > <TASK>
> > async_pf_execute+0x198/0x260 [kvm]
> > process_one_work+0x145/0x2d0
> > worker_thread+0x27e/0x3a0
> > kthread+0xba/0xe0
> > ret_from_fork+0x2d/0x50
> > ret_from_fork_asm+0x11/0x20
> > </TASK>
> > ---[ end trace 0000000000000000 ]---
> > INFO: task kworker/8:1:251 blocked for more than 120 seconds.
> > Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > task:kworker/8:1 state:D stack:0 pid:251 ppid:2 flags:0x00004000
> > Workqueue: events async_pf_execute [kvm]
> > Call Trace:
> > <TASK>
> > __schedule+0x33f/0xa40
> > schedule+0x53/0xc0
> > schedule_timeout+0x12a/0x140
> > __wait_for_common+0x8d/0x1d0
> > __flush_work.isra.0+0x19f/0x2c0
> > kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
> > kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
> > kvm_put_kvm+0x1c1/0x320 [kvm]
> > async_pf_execute+0x198/0x260 [kvm]
> > process_one_work+0x145/0x2d0
> > worker_thread+0x27e/0x3a0
> > kthread+0xba/0xe0
> > ret_from_fork+0x2d/0x50
> > ret_from_fork_asm+0x11/0x20
> > </TASK>
> >
> > If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
> > then there's no need to gift async_pf_execute() a reference because all
> > invocations of async_pf_execute() will be forced to complete before the
> > vCPU and its VM are destroyed/freed. And that in turn fixes the module
> > unloading bug as __fput() won't do module_put() on the last vCPU reference
> > until the vCPU has been freed, e.g. if closing the vCPU file also puts the
>
> I'm not sure why __fput() of vCPU fd should be mentioned here. I assume
> we just need to say that vCPUs are freed before module_put(KVM the module)
> in kvm_destroy_vm(), then the whole logic for module unloading fix is:
>
> 1. All workqueue callbacks complete when kvm_clear_async_pf_completion_queue(vcpu)
> 2. kvm_clear_async_pf_completion_queue(vcpu) must be executed before vCPU free.
> 3. vCPUs must be freed before module_put(KVM the module).
>
> So all workqueue callbacks complete before module_put(KVM the module).
>
>
> __fput() of vCPU fd is not the only trigger of kvm_destroy_vm(), that
> makes me distracted from reason of the fix.
My goal was to call out that (a) the vCPU file descriptor is what ensures kvm.ko
is alive at this point and (b) that __fput() very deliberately ensures module_put()
is called after all module function callbacks/hooks complete, as there was quite
a bit of confusion around who/what can safely do module_put().
> > last reference to the KVM module.
> >
> > Note that kvm_check_async_pf_completion() may also take the work item off
> > the completion queue and so also needs to flush the work queue, as the
> > work will not be seen by kvm_clear_async_pf_completion_queue(). Waiting
> > on the workqueue could theoretically delay a vCPU due to waiting for the
> > work to complete, but that's a very, very small chance, and likely a very
> > small delay. kvm_arch_async_page_present_queued() unconditionally makes a
> > new request, i.e. will effectively delay entering the guest, so the
> > remaining work is really just:
> >
> > trace_kvm_async_pf_completed(addr, cr2_or_gpa);
> >
> > __kvm_vcpu_wake_up(vcpu);
> >
> > mmput(mm);
> >
> > and mmput() can't drop the last reference to the page tables if the vCPU is
> > still alive, i.e. the vCPU won't get stuck tearing down page tables.
> >
> > Add a helper to do the flushing, specifically to deal with "wakeup all"
> > work items, as they aren't actually work items, i.e. are never placed in a
> > workqueue. Trying to flush a bogus workqueue entry rightly makes
> > __flush_work() complain (kudos to whoever added that sanity check).
> >
> > Note, commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are
> > freed") *tried* to fix the module refcounting issue by having VMs grab a
> > reference to the module, but that only made the bug slightly harder to hit
> > as it gave async_pf_execute() a bit more time to complete before the KVM
> > module could be unloaded.
> >
> > Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out")
> > Cc: [email protected]
> > Cc: David Matlack <[email protected]>
> > Cc: Xu Yilun <[email protected]>
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > virt/kvm/async_pf.c | 37 ++++++++++++++++++++++++++++++++-----
> > 1 file changed, 32 insertions(+), 5 deletions(-)
> >
> > diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> > index e033c79d528e..876927a558ad 100644
> > --- a/virt/kvm/async_pf.c
> > +++ b/virt/kvm/async_pf.c
> > @@ -87,7 +87,25 @@ static void async_pf_execute(struct work_struct *work)
> > __kvm_vcpu_wake_up(vcpu);
> >
> > mmput(mm);
> > - kvm_put_kvm(vcpu->kvm);
> > +}
> > +
> > +static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
> > +{
> > + /*
> > + * The async #PF is "done", but KVM must wait for the work item itself,
> > + * i.e. async_pf_execute(), to run to completion. If KVM is a module,
> > + * KVM must ensure *no* code owned by the KVM (the module) can be run
> > + * after the last call to module_put(), i.e. after the last reference
> > + * to the last vCPU's file is put.
>
> Maybe drop the i.e? It is not exactly true, other components like VFIO
> may also be the last one to put KVM reference?
Ah, yeah, agreed. I'll drop that last snippet, it doesn't and much value.
On Wed, Jan 24, 2024 at 10:52:12AM -0800, Sean Christopherson wrote:
> On Sat, Jan 20, 2024, Xu Yilun wrote:
> > On Tue, Jan 09, 2024 at 05:15:32PM -0800, Sean Christopherson wrote:
> > > Get a reference to the target VM's address space in async_pf_execute()
> > > instead of gifting a reference from kvm_setup_async_pf(). Keeping the
> > > address space alive just to service an async #PF is counter-productive,
> > > i.e. if the process is exiting and all vCPUs are dead, then NOT doing
> > > get_user_pages_remote() and freeing the address space asap is desirable.
> > >
> > > Handling the mm reference entirely within async_pf_execute() also
> > > simplifies the async #PF flows as a whole, e.g. it's not immediately
> > > obvious when the worker task vs. the vCPU task is responsible for putting
> > > the gifted mm reference.
> > >
> > > Signed-off-by: Sean Christopherson <[email protected]>
> > > ---
> > > include/linux/kvm_host.h | 1 -
> > > virt/kvm/async_pf.c | 32 ++++++++++++++++++--------------
> > > 2 files changed, 18 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 7e7fd25b09b3..bbfefd7e612f 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -238,7 +238,6 @@ struct kvm_async_pf {
> > > struct list_head link;
> > > struct list_head queue;
> > > struct kvm_vcpu *vcpu;
> > > - struct mm_struct *mm;
> > > gpa_t cr2_or_gpa;
> > > unsigned long addr;
> > > struct kvm_arch_async_pf arch;
> > > diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> > > index d5dc50318aa6..c3f4f351a2ae 100644
> > > --- a/virt/kvm/async_pf.c
> > > +++ b/virt/kvm/async_pf.c
> > > @@ -46,8 +46,8 @@ static void async_pf_execute(struct work_struct *work)
> > > {
> > > struct kvm_async_pf *apf =
> > > container_of(work, struct kvm_async_pf, work);
> > > - struct mm_struct *mm = apf->mm;
> > > struct kvm_vcpu *vcpu = apf->vcpu;
> > > + struct mm_struct *mm = vcpu->kvm->mm;
> > > unsigned long addr = apf->addr;
> > > gpa_t cr2_or_gpa = apf->cr2_or_gpa;
> > > int locked = 1;
> > > @@ -56,16 +56,24 @@ static void async_pf_execute(struct work_struct *work)
> > > might_sleep();
> > >
> > > /*
> > > - * This work is run asynchronously to the task which owns
> > > - * mm and might be done in another context, so we must
> > > - * access remotely.
> > > + * Attempt to pin the VM's host address space, and simply skip gup() if
> > > + * acquiring a pin fail, i.e. if the process is exiting. Note, KVM
> > > + * holds a reference to its associated mm_struct until the very end of
> > > + * kvm_destroy_vm(), i.e. the struct itself won't be freed before this
> > > + * work item is fully processed.
> > > */
> > > - mmap_read_lock(mm);
> > > - get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
> > > - if (locked)
> > > - mmap_read_unlock(mm);
> > > - mmput(mm);
> > > + if (mmget_not_zero(mm)) {
> > > + mmap_read_lock(mm);
> > > + get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
> > > + if (locked)
> > > + mmap_read_unlock(mm);
> > > + mmput(mm);
> > > + }
> > >
> > > + /*
> > > + * Notify and kick the vCPU even if faulting in the page failed, e.g.
> >
> > How about when the process is exiting? Could we just skip the following?
>
> Maybe? I'm not opposed to trimming this down even more, but I doubt it will make
> much of a difference. The vCPU can't be running so async_pf.lock shouldn't be
> contended, no IPIs will be issued for kicks, etc. So for this patch at least,
> I want to take the most conservative approach while still cleaning up the mm_struct
> usage.
It's good to me.
Reviewed-by: Xu Yilun <[email protected]>
On Wed, Jan 24, 2024 at 11:04:55AM -0800, Sean Christopherson wrote:
> On Sat, Jan 20, 2024, Xu Yilun wrote:
> > On Tue, Jan 09, 2024 at 05:15:30PM -0800, Sean Christopherson wrote:
> > > Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
> > > completion queue, e.g. when a VM and all its vCPUs is being destroyed.
> > > KVM must ensure that none of its workqueue callbacks is running when the
> > > last reference to the KVM _module_ is put. Gifting a reference to the
> > > associated VM prevents the workqueue callback from dereferencing freed
> > > vCPU/VM memory, but does not prevent the KVM module from being unloaded
> > > before the callback completes.
> > >
> > > Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
> > > async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
> > > result in deadlock. async_pf_execute() can't return until kvm_put_kvm()
> > > finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:
> > >
> > > WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
> > > Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
> > > CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
> > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > > Workqueue: events async_pf_execute [kvm]
> > > RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
> > > Call Trace:
> > > <TASK>
> > > async_pf_execute+0x198/0x260 [kvm]
> > > process_one_work+0x145/0x2d0
> > > worker_thread+0x27e/0x3a0
> > > kthread+0xba/0xe0
> > > ret_from_fork+0x2d/0x50
> > > ret_from_fork_asm+0x11/0x20
> > > </TASK>
> > > ---[ end trace 0000000000000000 ]---
> > > INFO: task kworker/8:1:251 blocked for more than 120 seconds.
> > > Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > task:kworker/8:1 state:D stack:0 pid:251 ppid:2 flags:0x00004000
> > > Workqueue: events async_pf_execute [kvm]
> > > Call Trace:
> > > <TASK>
> > > __schedule+0x33f/0xa40
> > > schedule+0x53/0xc0
> > > schedule_timeout+0x12a/0x140
> > > __wait_for_common+0x8d/0x1d0
> > > __flush_work.isra.0+0x19f/0x2c0
> > > kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
> > > kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
> > > kvm_put_kvm+0x1c1/0x320 [kvm]
> > > async_pf_execute+0x198/0x260 [kvm]
> > > process_one_work+0x145/0x2d0
> > > worker_thread+0x27e/0x3a0
> > > kthread+0xba/0xe0
> > > ret_from_fork+0x2d/0x50
> > > ret_from_fork_asm+0x11/0x20
> > > </TASK>
> > >
> > > If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
> > > then there's no need to gift async_pf_execute() a reference because all
> > > invocations of async_pf_execute() will be forced to complete before the
> > > vCPU and its VM are destroyed/freed. And that in turn fixes the module
> > > unloading bug as __fput() won't do module_put() on the last vCPU reference
> > > until the vCPU has been freed, e.g. if closing the vCPU file also puts the
> >
> > I'm not sure why __fput() of vCPU fd should be mentioned here. I assume
> > we just need to say that vCPUs are freed before module_put(KVM the module)
> > in kvm_destroy_vm(), then the whole logic for module unloading fix is:
> >
> > 1. All workqueue callbacks complete when kvm_clear_async_pf_completion_queue(vcpu)
> > 2. kvm_clear_async_pf_completion_queue(vcpu) must be executed before vCPU free.
> > 3. vCPUs must be freed before module_put(KVM the module).
> >
> > So all workqueue callbacks complete before module_put(KVM the module).
> >
> >
> > __fput() of vCPU fd is not the only trigger of kvm_destroy_vm(), that
> > makes me distracted from reason of the fix.
>
> My goal was to call out that (a) the vCPU file descriptor is what ensures kvm.ko
> is alive at this point and (b) that __fput() very deliberately ensures module_put()
> is called after all module function callbacks/hooks complete, as there was quite
Ah, I understood. These are ensured by your previous fix which grants
kvm_vcpu_fops the module owner. LGTM now.
Thanks,
Yilun
Sean Christopherson <[email protected]> writes:
> Put the async #PF worker's reference to the VM's address space as soon as
> the worker is done with the mm. This will allow deferring getting a
> reference to the worker itself without having to track whether or not
> getting a reference succeeded.
>
> Note, if the vCPU is still alive, there is no danger of the worker getting
> stuck with tearing down the host page tables, as userspace also holds a
> reference (obviously), i.e. there is no risk of delaying the page-present
> notification due to triggering the slow path in mmput().
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> virt/kvm/async_pf.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index 876927a558ad..d5dc50318aa6 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -64,6 +64,7 @@ static void async_pf_execute(struct work_struct *work)
> get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
> if (locked)
> mmap_read_unlock(mm);
> + mmput(mm);
>
> if (IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC))
> kvm_arch_async_page_present(vcpu, apf);
> @@ -85,8 +86,6 @@ static void async_pf_execute(struct work_struct *work)
> trace_kvm_async_pf_completed(addr, cr2_or_gpa);
>
> __kvm_vcpu_wake_up(vcpu);
> -
> - mmput(mm);
> }
>
> static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
Reviewed-by: Vitaly Kuznetsov <[email protected]>
--
Vitaly
Sean Christopherson <[email protected]> writes:
> Get a reference to the target VM's address space in async_pf_execute()
> instead of gifting a reference from kvm_setup_async_pf(). Keeping the
> address space alive just to service an async #PF is counter-productive,
> i.e. if the process is exiting and all vCPUs are dead, then NOT doing
> get_user_pages_remote() and freeing the address space asap is
> desirable.
It took me a while to realize why all vCPU fds are managed by the same
mm which did KVM_CREATE_VM as (AFAIU) fds can be passed around. Turns
out, we explicitly forbid this in kvm_vcpu_ioctl():
if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_dead)
return -EIO;
so this indeed means that grabbing current->mm in kvm_setup_async_pf()
can be avoided. I'm not sure whether it's just me or a "all vCPUs are
quired to be managed by the same mm" comment somewhere would be helpful.
>
> Handling the mm reference entirely within async_pf_execute() also
> simplifies the async #PF flows as a whole, e.g. it's not immediately
> obvious when the worker task vs. the vCPU task is responsible for putting
> the gifted mm reference.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> include/linux/kvm_host.h | 1 -
> virt/kvm/async_pf.c | 32 ++++++++++++++++++--------------
> 2 files changed, 18 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7e7fd25b09b3..bbfefd7e612f 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -238,7 +238,6 @@ struct kvm_async_pf {
> struct list_head link;
> struct list_head queue;
> struct kvm_vcpu *vcpu;
> - struct mm_struct *mm;
> gpa_t cr2_or_gpa;
> unsigned long addr;
> struct kvm_arch_async_pf arch;
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index d5dc50318aa6..c3f4f351a2ae 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -46,8 +46,8 @@ static void async_pf_execute(struct work_struct *work)
> {
> struct kvm_async_pf *apf =
> container_of(work, struct kvm_async_pf, work);
> - struct mm_struct *mm = apf->mm;
> struct kvm_vcpu *vcpu = apf->vcpu;
> + struct mm_struct *mm = vcpu->kvm->mm;
> unsigned long addr = apf->addr;
> gpa_t cr2_or_gpa = apf->cr2_or_gpa;
> int locked = 1;
> @@ -56,16 +56,24 @@ static void async_pf_execute(struct work_struct *work)
> might_sleep();
>
> /*
> - * This work is run asynchronously to the task which owns
> - * mm and might be done in another context, so we must
> - * access remotely.
> + * Attempt to pin the VM's host address space, and simply skip gup() if
> + * acquiring a pin fail, i.e. if the process is exiting. Note, KVM
> + * holds a reference to its associated mm_struct until the very end of
> + * kvm_destroy_vm(), i.e. the struct itself won't be freed before this
> + * work item is fully processed.
> */
> - mmap_read_lock(mm);
> - get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
> - if (locked)
> - mmap_read_unlock(mm);
> - mmput(mm);
> + if (mmget_not_zero(mm)) {
> + mmap_read_lock(mm);
> + get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
> + if (locked)
> + mmap_read_unlock(mm);
> + mmput(mm);
> + }
>
> + /*
> + * Notify and kick the vCPU even if faulting in the page failed, e.g.
> + * so that the vCPU can retry the fault synchronously.
> + */
> if (IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC))
> kvm_arch_async_page_present(vcpu, apf);
>
> @@ -129,10 +137,8 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> #ifdef CONFIG_KVM_ASYNC_PF_SYNC
> flush_work(&work->work);
> #else
> - if (cancel_work_sync(&work->work)) {
> - mmput(work->mm);
> + if (cancel_work_sync(&work->work))
> kmem_cache_free(async_pf_cache, work);
> - }
> #endif
> spin_lock(&vcpu->async_pf.lock);
> }
> @@ -211,8 +217,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> work->cr2_or_gpa = cr2_or_gpa;
> work->addr = hva;
> work->arch = *arch;
> - work->mm = current->mm;
> - mmget(work->mm);
>
> INIT_WORK(&work->work, async_pf_execute);
Reviewed-by: Vitaly Kuznetsov <[email protected]>
--
Vitaly
Sean Christopherson <[email protected]> writes:
> Nullify the async #PF worker's local "apf" pointer immediately after the
> point where the structure can be freed by the vCPU. The existing comment
> is helpful, but easy to overlook as there is no associated code.
>
> Update the comment to clarify that it can be freed by as soon as the lock
> is dropped, as "after this point" isn't strictly accurate, nor does it
> help understand what prevents the structure from being freed earlier.
>
"No functional change intended." must be made a requirement, especially
for those who made it their trademark)
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> virt/kvm/async_pf.c | 11 ++++++-----
> 1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index c3f4f351a2ae..1088c6628de9 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -83,13 +83,14 @@ static void async_pf_execute(struct work_struct *work)
> apf->vcpu = NULL;
> spin_unlock(&vcpu->async_pf.lock);
>
> - if (!IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC) && first)
> - kvm_arch_async_page_present_queued(vcpu);
> -
> /*
> - * apf may be freed by kvm_check_async_pf_completion() after
> - * this point
> + * The apf struct may freed by kvm_check_async_pf_completion() as soon
Nit: "may be freed"/"may get freed" maybe?
> + * as the lock is dropped. Nullify it to prevent improper usage.
> */
> + apf = NULL;
> +
> + if (!IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC) && first)
> + kvm_arch_async_page_present_queued(vcpu);
>
> trace_kvm_async_pf_completed(addr, cr2_or_gpa);
Reviewed-by: Vitaly Kuznetsov <[email protected]>
--
Vitaly
On Fri, Jan 26, 2024, Vitaly Kuznetsov wrote:
> Sean Christopherson <[email protected]> writes:
>
> > Get a reference to the target VM's address space in async_pf_execute()
> > instead of gifting a reference from kvm_setup_async_pf(). Keeping the
> > address space alive just to service an async #PF is counter-productive,
> > i.e. if the process is exiting and all vCPUs are dead, then NOT doing
> > get_user_pages_remote() and freeing the address space asap is
> > desirable.
>
> It took me a while to realize why all vCPU fds are managed by the same
> mm which did KVM_CREATE_VM as (AFAIU) fds can be passed around. Turns
> out, we explicitly forbid this in kvm_vcpu_ioctl():
>
> if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_dead)
> return -EIO;
>
> so this indeed means that grabbing current->mm in kvm_setup_async_pf()
> can be avoided. I'm not sure whether it's just me or a "all vCPUs are
> quired to be managed by the same mm" comment somewhere would be helpful.
It's definitely not just you. Documentation/virt/kvm/* could use thorough
documentation of what all in KVM relies on vCPUs, and all meaningful ioctls(),
to be executed in the same mm_struct (address space). Because that requirement
is pervasive throughout KVM. E.g. sharing KVM page tables across vCPUs is safe
iff all vCPUs are in the same address space, otherwise the hva=>pfn translations
through the memslot would diverge, mmu_notifiers would be all kinds of broken, etc.
Sean Christopherson <[email protected]> writes:
> Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
> completion queue, e.g. when a VM and all its vCPUs is being destroyed.
> KVM must ensure that none of its workqueue callbacks is running when the
> last reference to the KVM _module_ is put. Gifting a reference to the
> associated VM prevents the workqueue callback from dereferencing freed
> vCPU/VM memory, but does not prevent the KVM module from being unloaded
> before the callback completes.
>
> Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
> async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
> result in deadlock. async_pf_execute() can't return until kvm_put_kvm()
> finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:
>
> WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
> Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
> CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> Workqueue: events async_pf_execute [kvm]
> RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
> Call Trace:
> <TASK>
> async_pf_execute+0x198/0x260 [kvm]
> process_one_work+0x145/0x2d0
> worker_thread+0x27e/0x3a0
> kthread+0xba/0xe0
> ret_from_fork+0x2d/0x50
> ret_from_fork_asm+0x11/0x20
> </TASK>
> ---[ end trace 0000000000000000 ]---
> INFO: task kworker/8:1:251 blocked for more than 120 seconds.
> Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/8:1 state:D stack:0 pid:251 ppid:2 flags:0x00004000
> Workqueue: events async_pf_execute [kvm]
> Call Trace:
> <TASK>
> __schedule+0x33f/0xa40
> schedule+0x53/0xc0
> schedule_timeout+0x12a/0x140
> __wait_for_common+0x8d/0x1d0
> __flush_work.isra.0+0x19f/0x2c0
> kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
> kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
> kvm_put_kvm+0x1c1/0x320 [kvm]
> async_pf_execute+0x198/0x260 [kvm]
> process_one_work+0x145/0x2d0
> worker_thread+0x27e/0x3a0
> kthread+0xba/0xe0
> ret_from_fork+0x2d/0x50
> ret_from_fork_asm+0x11/0x20
> </TASK>
>
> If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
> then there's no need to gift async_pf_execute() a reference because all
> invocations of async_pf_execute() will be forced to complete before the
> vCPU and its VM are destroyed/freed. And that in turn fixes the module
> unloading bug as __fput() won't do module_put() on the last vCPU reference
> until the vCPU has been freed, e.g. if closing the vCPU file also puts the
> last reference to the KVM module.
>
> Note that kvm_check_async_pf_completion() may also take the work item off
> the completion queue and so also needs to flush the work queue, as the
> work will not be seen by kvm_clear_async_pf_completion_queue(). Waiting
> on the workqueue could theoretically delay a vCPU due to waiting for the
> work to complete, but that's a very, very small chance, and likely a very
> small delay. kvm_arch_async_page_present_queued() unconditionally makes a
> new request, i.e. will effectively delay entering the guest, so the
> remaining work is really just:
>
> trace_kvm_async_pf_completed(addr, cr2_or_gpa);
>
> __kvm_vcpu_wake_up(vcpu);
>
> mmput(mm);
>
> and mmput() can't drop the last reference to the page tables if the vCPU is
> still alive, i.e. the vCPU won't get stuck tearing down page tables.
>
> Add a helper to do the flushing, specifically to deal with "wakeup all"
> work items, as they aren't actually work items, i.e. are never placed in a
> workqueue. Trying to flush a bogus workqueue entry rightly makes
> __flush_work() complain (kudos to whoever added that sanity check).
>
> Note, commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are
> freed") *tried* to fix the module refcounting issue by having VMs grab a
> reference to the module, but that only made the bug slightly harder to hit
> as it gave async_pf_execute() a bit more time to complete before the KVM
> module could be unloaded.
>
> Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out")
> Cc: [email protected]
> Cc: David Matlack <[email protected]>
> Cc: Xu Yilun <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> virt/kvm/async_pf.c | 37 ++++++++++++++++++++++++++++++++-----
> 1 file changed, 32 insertions(+), 5 deletions(-)
>
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index e033c79d528e..876927a558ad 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -87,7 +87,25 @@ static void async_pf_execute(struct work_struct *work)
> __kvm_vcpu_wake_up(vcpu);
>
> mmput(mm);
> - kvm_put_kvm(vcpu->kvm);
> +}
> +
> +static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
> +{
> + /*
> + * The async #PF is "done", but KVM must wait for the work item itself,
> + * i.e. async_pf_execute(), to run to completion. If KVM is a module,
> + * KVM must ensure *no* code owned by the KVM (the module) can be run
> + * after the last call to module_put(), i.e. after the last reference
> + * to the last vCPU's file is put.
> + *
Do I understand correctly that the problem is also present on the
"normal" path, i.e.:
KVM_REQ_APF_READY
kvm_check_async_pf_completion()
kmem_cache_free(,work)
on one CPU can actually finish _before_ work is fully flushed on the
other (async_pf_execute() has already added an item to 'done' list but
hasn't completed)? Is it just the fact that the window of opportunity
to get the freed item re-purposed is so short that no real issue was
ever noticed? In that case I'd suggest we emphasize that in the comment
as currently it sounds like kvm_arch_destroy_vm() is the only
probemmatic path.
> + * Wake all events skip the queue and go straight done, i.e. don't
> + * need to be flushed (but sanity check that the work wasn't queued).
> + */
> + if (work->wakeup_all)
> + WARN_ON_ONCE(work->work.func);
> + else
> + flush_work(&work->work);
> + kmem_cache_free(async_pf_cache, work);
> }
>
> void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> #else
> if (cancel_work_sync(&work->work)) {
> mmput(work->mm);
> - kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
> kmem_cache_free(async_pf_cache, work);
> }
> #endif
> @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> list_first_entry(&vcpu->async_pf.done,
> typeof(*work), link);
> list_del(&work->link);
> - kmem_cache_free(async_pf_cache, work);
> +
> + spin_unlock(&vcpu->async_pf.lock);
> +
> + /*
> + * The async #PF is "done", but KVM must wait for the work item
> + * itself, i.e. async_pf_execute(), to run to completion. If
> + * KVM is a module, KVM must ensure *no* code owned by the KVM
> + * (the module) can be run after the last call to module_put(),
> + * i.e. after the last reference to the last vCPU's file is put.
> + */
> + kvm_flush_and_free_async_pf_work(work);
> + spin_lock(&vcpu->async_pf.lock);
> }
> spin_unlock(&vcpu->async_pf.lock);
>
> @@ -151,7 +179,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
>
> list_del(&work->queue);
> vcpu->async_pf.queued--;
> - kmem_cache_free(async_pf_cache, work);
> + kvm_flush_and_free_async_pf_work(work);
> }
> }
>
> @@ -186,7 +214,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> work->arch = *arch;
> work->mm = current->mm;
> mmget(work->mm);
> - kvm_get_kvm(work->vcpu->kvm);
>
> INIT_WORK(&work->work, async_pf_execute);
Reviewed-by: Vitaly Kuznetsov <[email protected]>
--
Vitaly
On Fri, Jan 26, 2024, Vitaly Kuznetsov wrote:
> > +static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
> > +{
> > + /*
> > + * The async #PF is "done", but KVM must wait for the work item itself,
> > + * i.e. async_pf_execute(), to run to completion. If KVM is a module,
> > + * KVM must ensure *no* code owned by the KVM (the module) can be run
> > + * after the last call to module_put(), i.e. after the last reference
> > + * to the last vCPU's file is put.
> > + *
>
> Do I understand correctly that the problem is also present on the
> "normal" path, i.e.:
>
> KVM_REQ_APF_READY
> kvm_check_async_pf_completion()
> kmem_cache_free(,work)
>
> on one CPU can actually finish _before_ work is fully flushed on the
> other (async_pf_execute() has already added an item to 'done' list but
> hasn't completed)? Is it just the fact that the window of opportunity
> to get the freed item re-purposed is so short that no real issue was
> ever noticed?
Sort of? It's not a problem with accessing a freed obect, the issue is that
async_pf_execute() can still be executing while KVM-the-module is unloaded.
The reason the "normal" path is problematic is because it removes the
work entry from async_pf.done and from async_pf.queue, i.e. makes it invisible
to kvm_arch_destroy_vm(). So to hit the bug where the "normal" path processes
the completed work, userspace would need to terminate the VM (i.e. exit() or
close all fds), and KVM would need to completely tear down the VM, all before
async_pf_execute() finished it's last few instructions.
Which is possible, just comically unlikely :-)
> In that case I'd suggest we emphasize that in the comment as currently it
> sounds like kvm_arch_destroy_vm() is the only probemmatic path.
How's this?
/*
* The async #PF is "done", but KVM must wait for the work item itself,
* i.e. async_pf_execute(), to run to completion. If KVM is a module,
* KVM must ensure *no* code owned by the KVM (the module) can be run
* after the last call to module_put(). Note, flushing the work item
* is always required when the item is taken off the completion queue.
* E.g. even if the vCPU handles the item in the "normal" path, the VM
* could be terminated before async_pf_execute() completes.
*
* Wake all events skip the queue and go straight done, i.e. don't
* need to be flushed (but sanity check that the work wasn't queued).
*/
> > + * Wake all events skip the queue and go straight done, i.e. don't
> > + * need to be flushed (but sanity check that the work wasn't queued).
> > + */
> > + if (work->wakeup_all)
> > + WARN_ON_ONCE(work->work.func);
> > + else
> > + flush_work(&work->work);
> > + kmem_cache_free(async_pf_cache, work);
> > }
> >
> > void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > #else
> > if (cancel_work_sync(&work->work)) {
> > mmput(work->mm);
> > - kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
> > kmem_cache_free(async_pf_cache, work);
> > }
> > #endif
> > @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > list_first_entry(&vcpu->async_pf.done,
> > typeof(*work), link);
> > list_del(&work->link);
> > - kmem_cache_free(async_pf_cache, work);
> > +
> > + spin_unlock(&vcpu->async_pf.lock);
> > +
> > + /*
> > + * The async #PF is "done", but KVM must wait for the work item
> > + * itself, i.e. async_pf_execute(), to run to completion. If
> > + * KVM is a module, KVM must ensure *no* code owned by the KVM
> > + * (the module) can be run after the last call to module_put(),
> > + * i.e. after the last reference to the last vCPU's file is put.
> > + */
Doh, this is a duplicate comment, I'll delete it.
> > + kvm_flush_and_free_async_pf_work(work);
> > + spin_lock(&vcpu->async_pf.lock);
> > }
> > spin_unlock(&vcpu->async_pf.lock);
> >
> > @@ -151,7 +179,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
> >
> > list_del(&work->queue);
> > vcpu->async_pf.queued--;
> > - kmem_cache_free(async_pf_cache, work);
> > + kvm_flush_and_free_async_pf_work(work);
> > }
> > }
> >
> > @@ -186,7 +214,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > work->arch = *arch;
> > work->mm = current->mm;
> > mmget(work->mm);
> > - kvm_get_kvm(work->vcpu->kvm);
> >
> > INIT_WORK(&work->work, async_pf_execute);
>
> Reviewed-by: Vitaly Kuznetsov <[email protected]>
>
> --
> Vitaly
>
Sean Christopherson <[email protected]> writes:
> On Fri, Jan 26, 2024, Vitaly Kuznetsov wrote:
>> > +static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
>> > +{
>> > + /*
>> > + * The async #PF is "done", but KVM must wait for the work item itself,
>> > + * i.e. async_pf_execute(), to run to completion. If KVM is a module,
>> > + * KVM must ensure *no* code owned by the KVM (the module) can be run
>> > + * after the last call to module_put(), i.e. after the last reference
>> > + * to the last vCPU's file is put.
>> > + *
>>
>> Do I understand correctly that the problem is also present on the
>> "normal" path, i.e.:
>>
>> KVM_REQ_APF_READY
>> kvm_check_async_pf_completion()
>> kmem_cache_free(,work)
>>
>> on one CPU can actually finish _before_ work is fully flushed on the
>> other (async_pf_execute() has already added an item to 'done' list but
>> hasn't completed)? Is it just the fact that the window of opportunity
>> to get the freed item re-purposed is so short that no real issue was
>> ever noticed?
>
> Sort of? It's not a problem with accessing a freed obect, the issue is that
> async_pf_execute() can still be executing while KVM-the-module is unloaded.
>
> The reason the "normal" path is problematic is because it removes the
> work entry from async_pf.done and from async_pf.queue, i.e. makes it invisible
> to kvm_arch_destroy_vm(). So to hit the bug where the "normal" path processes
> the completed work, userspace would need to terminate the VM (i.e. exit() or
> close all fds), and KVM would need to completely tear down the VM, all before
> async_pf_execute() finished it's last few instructions.
>
> Which is possible, just comically unlikely :-)
>
>> In that case I'd suggest we emphasize that in the comment as currently it
>> sounds like kvm_arch_destroy_vm() is the only probemmatic path.
>
> How's this?
>
> /*
> * The async #PF is "done", but KVM must wait for the work item itself,
> * i.e. async_pf_execute(), to run to completion. If KVM is a module,
> * KVM must ensure *no* code owned by the KVM (the module) can be run
> * after the last call to module_put(). Note, flushing the work item
> * is always required when the item is taken off the completion queue.
> * E.g. even if the vCPU handles the item in the "normal" path, the VM
> * could be terminated before async_pf_execute() completes.
> *
> * Wake all events skip the queue and go straight done, i.e. don't
> * need to be flushed (but sanity check that the work wasn't queued).
> */
>
Sounds good to me, thanks!
>> > + * Wake all events skip the queue and go straight done, i.e. don't
>> > + * need to be flushed (but sanity check that the work wasn't queued).
>> > + */
>> > + if (work->wakeup_all)
>> > + WARN_ON_ONCE(work->work.func);
>> > + else
>> > + flush_work(&work->work);
>> > + kmem_cache_free(async_pf_cache, work);
>> > }
>> >
>> > void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>> > @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>> > #else
>> > if (cancel_work_sync(&work->work)) {
>> > mmput(work->mm);
>> > - kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
>> > kmem_cache_free(async_pf_cache, work);
>> > }
>> > #endif
>> > @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>> > list_first_entry(&vcpu->async_pf.done,
>> > typeof(*work), link);
>> > list_del(&work->link);
>> > - kmem_cache_free(async_pf_cache, work);
>> > +
>> > + spin_unlock(&vcpu->async_pf.lock);
>> > +
>> > + /*
>> > + * The async #PF is "done", but KVM must wait for the work item
>> > + * itself, i.e. async_pf_execute(), to run to completion. If
>> > + * KVM is a module, KVM must ensure *no* code owned by the KVM
>> > + * (the module) can be run after the last call to module_put(),
>> > + * i.e. after the last reference to the last vCPU's file is put.
>> > + */
>
> Doh, this is a duplicate comment, I'll delete it.
>
>> > + kvm_flush_and_free_async_pf_work(work);
>> > + spin_lock(&vcpu->async_pf.lock);
>> > }
>> > spin_unlock(&vcpu->async_pf.lock);
>> >
>> > @@ -151,7 +179,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
>> >
>> > list_del(&work->queue);
>> > vcpu->async_pf.queued--;
>> > - kmem_cache_free(async_pf_cache, work);
>> > + kvm_flush_and_free_async_pf_work(work);
>> > }
>> > }
>> >
>> > @@ -186,7 +214,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>> > work->arch = *arch;
>> > work->mm = current->mm;
>> > mmget(work->mm);
>> > - kvm_get_kvm(work->vcpu->kvm);
>> >
>> > INIT_WORK(&work->work, async_pf_execute);
>>
>> Reviewed-by: Vitaly Kuznetsov <[email protected]>
>>
>> --
>> Vitaly
>>
>
--
Vitaly
On Sat, Jan 20, 2024, Xu Yilun wrote:
> Overall LGTM, Reviewed-by: Xu Yilun <[email protected]>
A very random thing. AFAICT, b4 doesn't appear to pickup tags that are "buried"
partway through a line. You didn't do anything wrong (IMO, at least :-D), and
it was easy enough to apply manually, just an FYI for the future.
On Tue, 09 Jan 2024 17:15:29 -0800, Sean Christopherson wrote:
> Fix a use-after-module-unload bug in the async #PF code by ensuring all
> workqueue items fully complete before tearing down vCPUs. Do a bit of
> cleanup to try and make the code slightly more readable.
>
> Side topic, I'm pretty s390's flic_set_attr() is broken/racy. The async #PF
> code assumes that only the vCPU can invoke
> kvm_clear_async_pf_completion_queue(), as there are multiple assets that
> are effectively protected by vcpu->mutex. I don't any real world VMMs
> trigger the race(s), but AFAICT it's a bug. I think/assume taking all
> vCPUs' mutexes would plug the hole?
>
> [...]
Applied to kvm-x86 asyncpf, with comment tweaks as per Vitaly. Thanks!
[1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed
https://github.com/kvm-x86/linux/commit/3d75b8aa5c29
[2/4] KVM: Put mm immediately after async #PF worker completes remote gup()
https://github.com/kvm-x86/linux/commit/422eeb543ac9
[3/4] KVM: Get reference to VM's address space in the async #PF worker
https://github.com/kvm-x86/linux/commit/8284765f03b7
[4/4] KVM: Nullify async #PF worker's "apf" pointer as soon as it might be freed
https://github.com/kvm-x86/linux/commit/c2744ed2230a
--
https://github.com/kvm-x86/linux/tree/next
On Mon, Feb 19, 2024, Xu Yilun wrote:
> > void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > #else
> > if (cancel_work_sync(&work->work)) {
> > mmput(work->mm);
> > - kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
> > kmem_cache_free(async_pf_cache, work);
> > }
> > #endif
> > @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > list_first_entry(&vcpu->async_pf.done,
> > typeof(*work), link);
> > list_del(&work->link);
> > - kmem_cache_free(async_pf_cache, work);
> > +
> > + spin_unlock(&vcpu->async_pf.lock);
> > +
> > + /*
> > + * The async #PF is "done", but KVM must wait for the work item
> > + * itself, i.e. async_pf_execute(), to run to completion. If
> > + * KVM is a module, KVM must ensure *no* code owned by the KVM
> > + * (the module) can be run after the last call to module_put(),
> > + * i.e. after the last reference to the last vCPU's file is put.
> > + */
> > + kvm_flush_and_free_async_pf_work(work);
>
> I have a new concern when I re-visit this patchset.
>
> Form kvm_check_async_pf_completion(), I see async_pf.queue is always a
> superset of async_pf.done (except wake-all work, which is not within
> concern). And done work would be skipped from sync (cancel_work_sync()) by:
>
> if (!work->vcpu)
> continue;
>
> But now with this patch we also sync done works, how about we just sync all
> queued work instead.
Hmm, IIUC, I think we can simply revert commit 22583f0d9c85 ("KVM: async_pf: avoid
recursive flushing of work items").
> void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> #else
> if (cancel_work_sync(&work->work)) {
> mmput(work->mm);
> - kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
> kmem_cache_free(async_pf_cache, work);
> }
> #endif
> @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> list_first_entry(&vcpu->async_pf.done,
> typeof(*work), link);
> list_del(&work->link);
> - kmem_cache_free(async_pf_cache, work);
> +
> + spin_unlock(&vcpu->async_pf.lock);
> +
> + /*
> + * The async #PF is "done", but KVM must wait for the work item
> + * itself, i.e. async_pf_execute(), to run to completion. If
> + * KVM is a module, KVM must ensure *no* code owned by the KVM
> + * (the module) can be run after the last call to module_put(),
> + * i.e. after the last reference to the last vCPU's file is put.
> + */
> + kvm_flush_and_free_async_pf_work(work);
I have a new concern when I re-visit this patchset.
Form kvm_check_async_pf_completion(), I see async_pf.queue is always a
superset of async_pf.done (except wake-all work, which is not within
concern). And done work would be skipped from sync (cancel_work_sync()) by:
if (!work->vcpu)
continue;
But now with this patch we also sync done works, how about we just sync all
queued work instead.
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index e033c79d528e..2268f16a36c2 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -71,7 +71,6 @@ static void async_pf_execute(struct work_struct *work)
spin_lock(&vcpu->async_pf.lock);
first = list_empty(&vcpu->async_pf.done);
list_add_tail(&apf->link, &vcpu->async_pf.done);
- apf->vcpu = NULL;
spin_unlock(&vcpu->async_pf.lock);
if (!IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC) && first)
@@ -101,13 +100,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
typeof(*work), queue);
list_del(&work->queue);
- /*
- * We know it's present in vcpu->async_pf.done, do
- * nothing here.
- */
- if (!work->vcpu)
- continue;
-
spin_unlock(&vcpu->async_pf.lock);
#ifdef CONFIG_KVM_ASYNC_PF_SYNC
flush_work(&work->work);
This way we don't have to sync twice for the same purpose, also we could
avoid reusing work->vcpu as a "done" flag which confused me a bit.
Thanks,
Yilun
> + spin_lock(&vcpu->async_pf.lock);
> }
> spin_unlock(&vcpu->async_pf.lock);
>
On Mon, Feb 19, 2024 at 07:51:24AM -0800, Sean Christopherson wrote:
> On Mon, Feb 19, 2024, Xu Yilun wrote:
> > > void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > > @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > > #else
> > > if (cancel_work_sync(&work->work)) {
> > > mmput(work->mm);
> > > - kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
> > > kmem_cache_free(async_pf_cache, work);
> > > }
> > > #endif
> > > @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > > list_first_entry(&vcpu->async_pf.done,
> > > typeof(*work), link);
> > > list_del(&work->link);
> > > - kmem_cache_free(async_pf_cache, work);
> > > +
> > > + spin_unlock(&vcpu->async_pf.lock);
> > > +
> > > + /*
> > > + * The async #PF is "done", but KVM must wait for the work item
> > > + * itself, i.e. async_pf_execute(), to run to completion. If
> > > + * KVM is a module, KVM must ensure *no* code owned by the KVM
> > > + * (the module) can be run after the last call to module_put(),
> > > + * i.e. after the last reference to the last vCPU's file is put.
> > > + */
> > > + kvm_flush_and_free_async_pf_work(work);
> >
> > I have a new concern when I re-visit this patchset.
> >
> > Form kvm_check_async_pf_completion(), I see async_pf.queue is always a
> > superset of async_pf.done (except wake-all work, which is not within
> > concern). And done work would be skipped from sync (cancel_work_sync()) by:
> >
> > if (!work->vcpu)
> > continue;
> >
> > But now with this patch we also sync done works, how about we just sync all
> > queued work instead.
>
> Hmm, IIUC, I think we can simply revert commit 22583f0d9c85 ("KVM: async_pf: avoid
> recursive flushing of work items").
Ah, yes. This also make me clear about the history of the confusing spin_lock.
Reverting is good to me.
Thanks,
Yilun