This new API allows the userspace to stop all running
vcpus using KVM_KICK_ALL_RUNNING_VCPUS ioctl, and resume them with
KVM_RESUME_ALL_KICKED_VCPUS.
A "running" vcpu is a vcpu that is executing the KVM_RUN ioctl.
This serie is especially helpful to userspace hypervisors like
QEMU when they need to perform operations on memslots without the
risk of having a vcpu reading them in the meanwhile.
With "memslots operations" we mean grow, shrink, merge and split
memslots, which are not "atomic" because there is a time window
between the DELETE memslot operation and the CREATE one.
Currently, each memslot operation is performed with one or more
ioctls.
For example, merging two memslots into one would imply:
DELETE(m1)
DELETE(m2)
CREATE(m1+m2)
And a vcpu could attempt to read m2 right after it is deleted, but
before the new one is created.
Therefore the simplest solution is to pause all vcpus in the kvm
side, so that:
- userspace just needs to call the new API before making memslots
changes, keeping modifications to the minimum
- dirty page updates are also performed when vcpus are blocked, so
there is no time window between the dirty page ioctl and memslots
modifications, since vcpus are all stopped.
- no need to modify the existing memslots API
Emanuele Giuseppe Esposito (4):
linux-headers/linux/kvm.h: introduce kvm_userspace_memory_region_list
ioctl
KVM: introduce kvm_clear_all_cpus_request
KVM: introduce memory transaction semaphore
KVM: use signals to abort enter_guest/blocking and retry
Documentation/virt/kvm/vcpu-requests.rst | 3 ++
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/x86.c | 8 +++++
include/uapi/linux/kvm.h | 3 ++
virt/kvm/kvm_main.c | 45 ++++++++++++++++++++++++
5 files changed, 61 insertions(+)
--
2.31.1
Introduce new KVM_KICK_ALL_RUNNING_VCPUS and KVM_RESUME_ALL_KICKED_VCPUS
ioctl that will be used respectively to pause and then resume all vcpus
currently executing KVM_RUN in kvm.
Signed-off-by: Emanuele Giuseppe Esposito <[email protected]>
---
include/uapi/linux/kvm.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index eed0315a77a6..d3cba8d4ca91 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -2227,4 +2227,7 @@ struct kvm_s390_zpci_op {
/* flags for kvm_s390_zpci_op->u.reg_aen.flags */
#define KVM_S390_ZPCIOP_REGAEN_HOST (1 << 0)
+#define KVM_KICK_ALL_RUNNING_VCPUS _IO(KVMIO, 0xd2)
+#define KVM_RESUME_ALL_KICKED_VCPUS _IO(KVMIO, 0xd3)
+
#endif /* __LINUX_KVM_H */
--
2.31.1
Once a vcpu exectues KVM_RUN, it could enter two states:
enter guest mode, or block/halt.
Use a signal to allow a vcpu to exit the guest state or unblock,
so that it can exit KVM_RUN and release the read semaphore,
allowing a pending KVM_KICK_ALL_RUNNING_VCPUS to continue.
Note that the signal is not deleted and used to propagate the
exit reason till vcpu_run(). It will be clearead only by
KVM_RESUME_ALL_KICKED_VCPUS. This allows the vcpu to keep try
entering KVM_RUN and perform again all checks done in
kvm_arch_vcpu_ioctl_run() before entering the guest state,
where it will return again if the request is still set.
However, the userspace hypervisor should also try to avoid
continuously calling KVM_RUN after invoking KVM_KICK_ALL_RUNNING_VCPUS,
because such call will just translate in a back-to-back down_read()
and up_read() (thanks to the signal).
Signed-off-by: Emanuele Giuseppe Esposito <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/x86.c | 8 ++++++++
virt/kvm/kvm_main.c | 21 +++++++++++++++++++++
3 files changed, 31 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index aa381ab69a19..d5c37f344d65 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -108,6 +108,8 @@
KVM_ARCH_REQ_FLAGS(30, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define KVM_REQ_MMU_FREE_OBSOLETE_ROOTS \
KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_USERSPACE_KICK \
+ KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT)
#define CR0_RESERVED_BITS \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b0c47b41c264..2af5f427b4e9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10270,6 +10270,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
}
if (kvm_request_pending(vcpu)) {
+ if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu)) {
+ r = -EINTR;
+ goto out;
+ }
if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
r = -EIO;
goto out;
@@ -10701,6 +10705,10 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
r = vcpu_block(vcpu);
}
+ /* vcpu exited guest/unblocked because of this request */
+ if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu))
+ return -EINTR;
+
if (r <= 0)
break;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ae0240928a4a..13fa7229b85d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3431,6 +3431,8 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
goto out;
if (kvm_check_request(KVM_REQ_UNBLOCK, vcpu))
goto out;
+ if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu))
+ goto out;
ret = 0;
out:
@@ -4668,6 +4670,25 @@ static long kvm_vm_ioctl(struct file *filp,
r = kvm_vm_ioctl_enable_cap_generic(kvm, &cap);
break;
}
+ case KVM_KICK_ALL_RUNNING_VCPUS: {
+ /*
+ * Notify all running vcpus that they have to stop.
+ * Caught in kvm_arch_vcpu_ioctl_run()
+ */
+ kvm_make_all_cpus_request(kvm, KVM_REQ_USERSPACE_KICK);
+
+ /*
+ * Use wr semaphore to wait for all vcpus to exit from KVM_RUN.
+ */
+ down_write(&memory_transaction);
+ up_write(&memory_transaction);
+ break;
+ }
+ case KVM_RESUME_ALL_KICKED_VCPUS: {
+ /* Remove all requests sent with KVM_KICK_ALL_RUNNING_VCPUS */
+ kvm_clear_all_cpus_request(kvm, KVM_REQ_USERSPACE_KICK);
+ break;
+ }
case KVM_SET_USER_MEMORY_REGION: {
struct kvm_userspace_memory_region kvm_userspace_mem;
--
2.31.1
Clear the given request in all vcpus of the VM with struct kvm.
Signed-off-by: Emanuele Giuseppe Esposito <[email protected]>
---
Documentation/virt/kvm/vcpu-requests.rst | 3 +++
virt/kvm/kvm_main.c | 10 ++++++++++
2 files changed, 13 insertions(+)
diff --git a/Documentation/virt/kvm/vcpu-requests.rst b/Documentation/virt/kvm/vcpu-requests.rst
index 31f62b64e07b..468410dfe84d 100644
--- a/Documentation/virt/kvm/vcpu-requests.rst
+++ b/Documentation/virt/kvm/vcpu-requests.rst
@@ -36,6 +36,9 @@ its TLB with a VCPU request. The API consists of the following functions::
/* Make request @req of all VCPUs of the VM with struct kvm @kvm. */
bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
+ /* Clear request @req of all VCPUs of the VM with struct kvm @kvm. */
+ void kvm_clear_all_cpus_request(struct kvm *kvm, unsigned int req);
+
Typically a requester wants the VCPU to perform the activity as soon
as possible after making the request. This means most requests
(kvm_make_request() calls) are followed by a call to kvm_vcpu_kick(),
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 584a5bab3af3..c080b93edc0d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -355,6 +355,16 @@ bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req)
}
EXPORT_SYMBOL_GPL(kvm_make_all_cpus_request);
+void kvm_clear_all_cpus_request(struct kvm *kvm, unsigned int req)
+{
+ unsigned long i;
+ struct kvm_vcpu *vcpu;
+
+ kvm_for_each_vcpu(i, vcpu, kvm)
+ kvm_clear_request(req, vcpu);
+}
+EXPORT_SYMBOL_GPL(kvm_clear_all_cpus_request);
+
#ifndef CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL
void kvm_flush_remote_tlbs(struct kvm *kvm)
{
--
2.31.1
Right now the semaphore is only used to signal that a vcpu
entered KVM_RUN (not necessarly in guest mode, could be also
blocked/halted).
Later it will be used by specific ioctls (writers) to wait that
all vcpus (readers) exit from KVM_RUN.
Signed-off-by: Emanuele Giuseppe Esposito <[email protected]>
---
virt/kvm/kvm_main.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c080b93edc0d..ae0240928a4a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -119,6 +119,8 @@ static const struct file_operations stat_fops_per_vm;
static struct file_operations kvm_chardev_ops;
+static DECLARE_RWSEM(memory_transaction);
+
static long kvm_vcpu_ioctl(struct file *file, unsigned int ioctl,
unsigned long arg);
#ifdef CONFIG_KVM_COMPAT
@@ -4074,7 +4076,19 @@ static long kvm_vcpu_ioctl(struct file *filp,
synchronize_rcu();
put_pid(oldpid);
}
+ /*
+ * Notify that a vcpu wants to run, and thus could be reading
+ * memslots.
+ * If KVM_KICK_ALL_RUNNING_VCPUS runs afterwards, it will have
+ * to wait that KVM_RUN exited and up_read() is called.
+ * If KVM_KICK_ALL_RUNNING_VCPUS already returned but
+ * KVM_RESUME_ALL_KICKED_VCPUS didn't start yet, then there
+ * is a request pending for the vcpu that will cause it to
+ * exit KVM_RUN.
+ */
+ down_read(&memory_transaction);
r = kvm_arch_vcpu_ioctl_run(vcpu);
+ up_read(&memory_transaction);
trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
break;
}
--
2.31.1
On 10/22/22 17:48, Emanuele Giuseppe Esposito wrote:
> Once a vcpu exectues KVM_RUN, it could enter two states:
> enter guest mode, or block/halt.
> Use a signal to allow a vcpu to exit the guest state or unblock,
> so that it can exit KVM_RUN and release the read semaphore,
> allowing a pending KVM_KICK_ALL_RUNNING_VCPUS to continue.
>
> Note that the signal is not deleted and used to propagate the
> exit reason till vcpu_run(). It will be clearead only by
> KVM_RESUME_ALL_KICKED_VCPUS. This allows the vcpu to keep try
> entering KVM_RUN and perform again all checks done in
> kvm_arch_vcpu_ioctl_run() before entering the guest state,
> where it will return again if the request is still set.
>
> However, the userspace hypervisor should also try to avoid
> continuously calling KVM_RUN after invoking KVM_KICK_ALL_RUNNING_VCPUS,
> because such call will just translate in a back-to-back down_read()
> and up_read() (thanks to the signal).
Since the userspace should anyway avoid going into this effectively-busy
wait, what about clearing the request after the first exit? The
cancellation ioctl can be kept for vCPUs that are never entered after
KVM_KICK_ALL_RUNNING_VCPUS. Alternatively, kvm_clear_all_cpus_request
could be done right before up_write().
Paolo
> Signed-off-by: Emanuele Giuseppe Esposito <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/kvm/x86.c | 8 ++++++++
> virt/kvm/kvm_main.c | 21 +++++++++++++++++++++
> 3 files changed, 31 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index aa381ab69a19..d5c37f344d65 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -108,6 +108,8 @@
> KVM_ARCH_REQ_FLAGS(30, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> #define KVM_REQ_MMU_FREE_OBSOLETE_ROOTS \
> KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> +#define KVM_REQ_USERSPACE_KICK \
> + KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT)
>
> #define CR0_RESERVED_BITS \
> (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b0c47b41c264..2af5f427b4e9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10270,6 +10270,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> }
>
> if (kvm_request_pending(vcpu)) {
> + if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu)) {
> + r = -EINTR;
> + goto out;
> + }
> if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
> r = -EIO;
> goto out;
> @@ -10701,6 +10705,10 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
> r = vcpu_block(vcpu);
> }
>
> + /* vcpu exited guest/unblocked because of this request */
> + if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu))
> + return -EINTR;
> +
> if (r <= 0)
> break;
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ae0240928a4a..13fa7229b85d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3431,6 +3431,8 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
> goto out;
> if (kvm_check_request(KVM_REQ_UNBLOCK, vcpu))
> goto out;
> + if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu))
> + goto out;
>
> ret = 0;
> out:
> @@ -4668,6 +4670,25 @@ static long kvm_vm_ioctl(struct file *filp,
> r = kvm_vm_ioctl_enable_cap_generic(kvm, &cap);
> break;
> }
> + case KVM_KICK_ALL_RUNNING_VCPUS: {
> + /*
> + * Notify all running vcpus that they have to stop.
> + * Caught in kvm_arch_vcpu_ioctl_run()
> + */
> + kvm_make_all_cpus_request(kvm, KVM_REQ_USERSPACE_KICK);
> +
> + /*
> + * Use wr semaphore to wait for all vcpus to exit from KVM_RUN.
> + */
> + down_write(&memory_transaction);
> + up_write(&memory_transaction);
> + break;
> + }
> + case KVM_RESUME_ALL_KICKED_VCPUS: {
> + /* Remove all requests sent with KVM_KICK_ALL_RUNNING_VCPUS */
> + kvm_clear_all_cpus_request(kvm, KVM_REQ_USERSPACE_KICK);
> + break;
> + }
> case KVM_SET_USER_MEMORY_REGION: {
> struct kvm_userspace_memory_region kvm_userspace_mem;
>
On 10/22/22 17:48, Emanuele Giuseppe Esposito wrote:
> +static DECLARE_RWSEM(memory_transaction);
This cannot be global, it must be per-struct kvm. Otherwise one VM can
keep the rwsem indefinitely while a second VM hangs in
KVM_KICK_ALL_RUNNING_VCPUS.
It can also be changed to an SRCU (with the down_write+up_write sequence
changed to synchronize_srcu_expedited) which has similar characteristics
to your use of the rwsem.
Paolo
Am 23/10/2022 um 19:48 schrieb Paolo Bonzini:
> On 10/22/22 17:48, Emanuele Giuseppe Esposito wrote:
>> Once a vcpu exectues KVM_RUN, it could enter two states:
>> enter guest mode, or block/halt.
>> Use a signal to allow a vcpu to exit the guest state or unblock,
>> so that it can exit KVM_RUN and release the read semaphore,
>> allowing a pending KVM_KICK_ALL_RUNNING_VCPUS to continue.
>>
>> Note that the signal is not deleted and used to propagate the
>> exit reason till vcpu_run(). It will be clearead only by
>> KVM_RESUME_ALL_KICKED_VCPUS. This allows the vcpu to keep try
>> entering KVM_RUN and perform again all checks done in
>> kvm_arch_vcpu_ioctl_run() before entering the guest state,
>> where it will return again if the request is still set.
>>
>> However, the userspace hypervisor should also try to avoid
>> continuously calling KVM_RUN after invoking KVM_KICK_ALL_RUNNING_VCPUS,
>> because such call will just translate in a back-to-back down_read()
>> and up_read() (thanks to the signal).
>
> Since the userspace should anyway avoid going into this effectively-busy
> wait, what about clearing the request after the first exit? The
> cancellation ioctl can be kept for vCPUs that are never entered after
> KVM_KICK_ALL_RUNNING_VCPUS. Alternatively, kvm_clear_all_cpus_request
> could be done right before up_write().
Clearing makes sense, but should we "trust" the userspace not to go into
busy wait?
What's the typical "contract" between KVM and the userspace? Meaning,
should we cover the basic usage mistakes like forgetting to busy wait on
KVM_RUN?
If we don't, I can add a comment when clearing and of course also
mention it in the API documentation (that I forgot to update, sorry :D)
Emanuele
>
> Paolo
>
>> Signed-off-by: Emanuele Giuseppe Esposito <[email protected]>
>> ---
>> arch/x86/include/asm/kvm_host.h | 2 ++
>> arch/x86/kvm/x86.c | 8 ++++++++
>> virt/kvm/kvm_main.c | 21 +++++++++++++++++++++
>> 3 files changed, 31 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h
>> b/arch/x86/include/asm/kvm_host.h
>> index aa381ab69a19..d5c37f344d65 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -108,6 +108,8 @@
>> KVM_ARCH_REQ_FLAGS(30, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>> #define KVM_REQ_MMU_FREE_OBSOLETE_ROOTS \
>> KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>> +#define KVM_REQ_USERSPACE_KICK \
>> + KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT)
>> #define
>> CR0_RESERVED_BITS \
>> (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM |
>> X86_CR0_TS \
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index b0c47b41c264..2af5f427b4e9 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -10270,6 +10270,10 @@ static int vcpu_enter_guest(struct kvm_vcpu
>> *vcpu)
>> }
>> if (kvm_request_pending(vcpu)) {
>> + if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu)) {
>> + r = -EINTR;
>> + goto out;
>> + }
>> if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
>> r = -EIO;
>> goto out;
>> @@ -10701,6 +10705,10 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>> r = vcpu_block(vcpu);
>> }
>> + /* vcpu exited guest/unblocked because of this request */
>> + if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu))
>> + return -EINTR;
>> +
>> if (r <= 0)
>> break;
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index ae0240928a4a..13fa7229b85d 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -3431,6 +3431,8 @@ static int kvm_vcpu_check_block(struct kvm_vcpu
>> *vcpu)
>> goto out;
>> if (kvm_check_request(KVM_REQ_UNBLOCK, vcpu))
>> goto out;
>> + if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu))
>> + goto out;
>> ret = 0;
>> out:
>> @@ -4668,6 +4670,25 @@ static long kvm_vm_ioctl(struct file *filp,
>> r = kvm_vm_ioctl_enable_cap_generic(kvm, &cap);
>> break;
>> }
>> + case KVM_KICK_ALL_RUNNING_VCPUS: {
>> + /*
>> + * Notify all running vcpus that they have to stop.
>> + * Caught in kvm_arch_vcpu_ioctl_run()
>> + */
>> + kvm_make_all_cpus_request(kvm, KVM_REQ_USERSPACE_KICK);
>> +
>> + /*
>> + * Use wr semaphore to wait for all vcpus to exit from KVM_RUN.
>> + */
>> + down_write(&memory_transaction);
>> + up_write(&memory_transaction);
>> + break;
>> + }
>> + case KVM_RESUME_ALL_KICKED_VCPUS: {
>> + /* Remove all requests sent with KVM_KICK_ALL_RUNNING_VCPUS */
>> + kvm_clear_all_cpus_request(kvm, KVM_REQ_USERSPACE_KICK);
>> + break;
>> + }
>> case KVM_SET_USER_MEMORY_REGION: {
>> struct kvm_userspace_memory_region kvm_userspace_mem;
>>
>
Am 24/10/2022 um 09:43 schrieb Emanuele Giuseppe Esposito:
>
>
> Am 23/10/2022 um 19:48 schrieb Paolo Bonzini:
>> On 10/22/22 17:48, Emanuele Giuseppe Esposito wrote:
>>> Once a vcpu exectues KVM_RUN, it could enter two states:
>>> enter guest mode, or block/halt.
>>> Use a signal to allow a vcpu to exit the guest state or unblock,
>>> so that it can exit KVM_RUN and release the read semaphore,
>>> allowing a pending KVM_KICK_ALL_RUNNING_VCPUS to continue.
>>>
>>> Note that the signal is not deleted and used to propagate the
>>> exit reason till vcpu_run(). It will be clearead only by
>>> KVM_RESUME_ALL_KICKED_VCPUS. This allows the vcpu to keep try
>>> entering KVM_RUN and perform again all checks done in
>>> kvm_arch_vcpu_ioctl_run() before entering the guest state,
>>> where it will return again if the request is still set.
>>>
>>> However, the userspace hypervisor should also try to avoid
>>> continuously calling KVM_RUN after invoking KVM_KICK_ALL_RUNNING_VCPUS,
>>> because such call will just translate in a back-to-back down_read()
>>> and up_read() (thanks to the signal).
>>
>> Since the userspace should anyway avoid going into this effectively-busy
>> wait, what about clearing the request after the first exit? The
>> cancellation ioctl can be kept for vCPUs that are never entered after
>> KVM_KICK_ALL_RUNNING_VCPUS. Alternatively, kvm_clear_all_cpus_request
>> could be done right before up_write().
>
> Clearing makes sense, but should we "trust" the userspace not to go into
> busy wait?
Actually since this change is just a s/test/check, I would rather keep
test. If userspace does things wrong, this mechanism would still work
properly.
> What's the typical "contract" between KVM and the userspace? Meaning,
> should we cover the basic usage mistakes like forgetting to busy wait on
> KVM_RUN?
>
> If we don't, I can add a comment when clearing and of course also
> mention it in the API documentation (that I forgot to update, sorry :D)
>
> Emanuele
>
>>
>> Paolo
>>
>>> Signed-off-by: Emanuele Giuseppe Esposito <[email protected]>
>>> ---
>>> arch/x86/include/asm/kvm_host.h | 2 ++
>>> arch/x86/kvm/x86.c | 8 ++++++++
>>> virt/kvm/kvm_main.c | 21 +++++++++++++++++++++
>>> 3 files changed, 31 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h
>>> b/arch/x86/include/asm/kvm_host.h
>>> index aa381ab69a19..d5c37f344d65 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -108,6 +108,8 @@
>>> KVM_ARCH_REQ_FLAGS(30, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>>> #define KVM_REQ_MMU_FREE_OBSOLETE_ROOTS \
>>> KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>>> +#define KVM_REQ_USERSPACE_KICK \
>>> + KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT)
>>> #define
>>> CR0_RESERVED_BITS \
>>> (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM |
>>> X86_CR0_TS \
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index b0c47b41c264..2af5f427b4e9 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -10270,6 +10270,10 @@ static int vcpu_enter_guest(struct kvm_vcpu
>>> *vcpu)
>>> }
>>> if (kvm_request_pending(vcpu)) {
>>> + if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu)) {
>>> + r = -EINTR;
>>> + goto out;
>>> + }
>>> if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
>>> r = -EIO;
>>> goto out;
>>> @@ -10701,6 +10705,10 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>>> r = vcpu_block(vcpu);
>>> }
>>> + /* vcpu exited guest/unblocked because of this request */
>>> + if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu))
>>> + return -EINTR;
>>> +
>>> if (r <= 0)
>>> break;
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index ae0240928a4a..13fa7229b85d 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -3431,6 +3431,8 @@ static int kvm_vcpu_check_block(struct kvm_vcpu
>>> *vcpu)
>>> goto out;
>>> if (kvm_check_request(KVM_REQ_UNBLOCK, vcpu))
>>> goto out;
>>> + if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu))
>>> + goto out;
>>> ret = 0;
>>> out:
>>> @@ -4668,6 +4670,25 @@ static long kvm_vm_ioctl(struct file *filp,
>>> r = kvm_vm_ioctl_enable_cap_generic(kvm, &cap);
>>> break;
>>> }
>>> + case KVM_KICK_ALL_RUNNING_VCPUS: {
>>> + /*
>>> + * Notify all running vcpus that they have to stop.
>>> + * Caught in kvm_arch_vcpu_ioctl_run()
>>> + */
>>> + kvm_make_all_cpus_request(kvm, KVM_REQ_USERSPACE_KICK);
>>> +
>>> + /*
>>> + * Use wr semaphore to wait for all vcpus to exit from KVM_RUN.
>>> + */
>>> + down_write(&memory_transaction);
>>> + up_write(&memory_transaction);
>>> + break;
>>> + }
>>> + case KVM_RESUME_ALL_KICKED_VCPUS: {
>>> + /* Remove all requests sent with KVM_KICK_ALL_RUNNING_VCPUS */
>>> + kvm_clear_all_cpus_request(kvm, KVM_REQ_USERSPACE_KICK);
>>> + break;
>>> + }
>>> case KVM_SET_USER_MEMORY_REGION: {
>>> struct kvm_userspace_memory_region kvm_userspace_mem;
>>>
>>
Am 22.10.22 um 17:48 schrieb Emanuele Giuseppe Esposito:
> This new API allows the userspace to stop all running
> vcpus using KVM_KICK_ALL_RUNNING_VCPUS ioctl, and resume them with
> KVM_RESUME_ALL_KICKED_VCPUS.
> A "running" vcpu is a vcpu that is executing the KVM_RUN ioctl.
>
> This serie is especially helpful to userspace hypervisors like
> QEMU when they need to perform operations on memslots without the
> risk of having a vcpu reading them in the meanwhile.
> With "memslots operations" we mean grow, shrink, merge and split
> memslots, which are not "atomic" because there is a time window
> between the DELETE memslot operation and the CREATE one.
> Currently, each memslot operation is performed with one or more
> ioctls.
> For example, merging two memslots into one would imply:
> DELETE(m1)
> DELETE(m2)
> CREATE(m1+m2)
>
> And a vcpu could attempt to read m2 right after it is deleted, but
> before the new one is created.
>
> Therefore the simplest solution is to pause all vcpus in the kvm
> side, so that:
> - userspace just needs to call the new API before making memslots
> changes, keeping modifications to the minimum
> - dirty page updates are also performed when vcpus are blocked, so
> there is no time window between the dirty page ioctl and memslots
> modifications, since vcpus are all stopped.
> - no need to modify the existing memslots API
Isnt QEMU able to achieve the same goal today by forcing all vCPUs
into userspace with a signal? Can you provide some rationale why this
is better in the cover letter or patch description?
Am 24/10/2022 um 09:56 schrieb Christian Borntraeger:
> Am 22.10.22 um 17:48 schrieb Emanuele Giuseppe Esposito:
>> This new API allows the userspace to stop all running
>> vcpus using KVM_KICK_ALL_RUNNING_VCPUS ioctl, and resume them with
>> KVM_RESUME_ALL_KICKED_VCPUS.
>> A "running" vcpu is a vcpu that is executing the KVM_RUN ioctl.
>>
>> This serie is especially helpful to userspace hypervisors like
>> QEMU when they need to perform operations on memslots without the
>> risk of having a vcpu reading them in the meanwhile.
>> With "memslots operations" we mean grow, shrink, merge and split
>> memslots, which are not "atomic" because there is a time window
>> between the DELETE memslot operation and the CREATE one.
>> Currently, each memslot operation is performed with one or more
>> ioctls.
>> For example, merging two memslots into one would imply:
>> DELETE(m1)
>> DELETE(m2)
>> CREATE(m1+m2)
>>
>> And a vcpu could attempt to read m2 right after it is deleted, but
>> before the new one is created.
>>
>> Therefore the simplest solution is to pause all vcpus in the kvm
>> side, so that:
>> - userspace just needs to call the new API before making memslots
>> changes, keeping modifications to the minimum
>> - dirty page updates are also performed when vcpus are blocked, so
>> there is no time window between the dirty page ioctl and memslots
>> modifications, since vcpus are all stopped.
>> - no need to modify the existing memslots API
> Isnt QEMU able to achieve the same goal today by forcing all vCPUs
> into userspace with a signal? Can you provide some rationale why this
> is better in the cover letter or patch description?
>
David Hildenbrand tried to propose something similar here:
https://github.com/davidhildenbrand/qemu/commit/86b1bf546a8d00908e33f7362b0b61e2be8dbb7a
While it is not optimized, I think it's more complex that the current
serie, since qemu should also make sure all running ioctls finish and
prevent the new ones from getting executed.
Also we can't use pause_all_vcpus()/resume_all_vcpus() because they drop
the BQL.
Would that be ok as rationale?
Thank you,
Emanuele
Am 24.10.22 um 10:33 schrieb Emanuele Giuseppe Esposito:
>
>
> Am 24/10/2022 um 09:56 schrieb Christian Borntraeger:
>> Am 22.10.22 um 17:48 schrieb Emanuele Giuseppe Esposito:
>>> This new API allows the userspace to stop all running
>>> vcpus using KVM_KICK_ALL_RUNNING_VCPUS ioctl, and resume them with
>>> KVM_RESUME_ALL_KICKED_VCPUS.
>>> A "running" vcpu is a vcpu that is executing the KVM_RUN ioctl.
>>>
>>> This serie is especially helpful to userspace hypervisors like
>>> QEMU when they need to perform operations on memslots without the
>>> risk of having a vcpu reading them in the meanwhile.
>>> With "memslots operations" we mean grow, shrink, merge and split
>>> memslots, which are not "atomic" because there is a time window
>>> between the DELETE memslot operation and the CREATE one.
>>> Currently, each memslot operation is performed with one or more
>>> ioctls.
>>> For example, merging two memslots into one would imply:
>>> DELETE(m1)
>>> DELETE(m2)
>>> CREATE(m1+m2)
>>>
>>> And a vcpu could attempt to read m2 right after it is deleted, but
>>> before the new one is created.
>>>
>>> Therefore the simplest solution is to pause all vcpus in the kvm
>>> side, so that:
>>> - userspace just needs to call the new API before making memslots
>>> changes, keeping modifications to the minimum
>>> - dirty page updates are also performed when vcpus are blocked, so
>>> there is no time window between the dirty page ioctl and memslots
>>> modifications, since vcpus are all stopped.
>>> - no need to modify the existing memslots API
>> Isnt QEMU able to achieve the same goal today by forcing all vCPUs
>> into userspace with a signal? Can you provide some rationale why this
>> is better in the cover letter or patch description?
>>
> David Hildenbrand tried to propose something similar here:
> https://github.com/davidhildenbrand/qemu/commit/86b1bf546a8d00908e33f7362b0b61e2be8dbb7a
>
> While it is not optimized, I think it's more complex that the current
> serie, since qemu should also make sure all running ioctls finish and
> prevent the new ones from getting executed.
>
> Also we can't use pause_all_vcpus()/resume_all_vcpus() because they drop
> the BQL.
>
> Would that be ok as rationale?
Yes that helps and should be part of the cover letter for the next iterations.
Am 23/10/2022 um 19:50 schrieb Paolo Bonzini:
> On 10/22/22 17:48, Emanuele Giuseppe Esposito wrote:
>> +static DECLARE_RWSEM(memory_transaction);
>
> This cannot be global, it must be per-struct kvm. Otherwise one VM can
> keep the rwsem indefinitely while a second VM hangs in
> KVM_KICK_ALL_RUNNING_VCPUS.
>
> It can also be changed to an SRCU (with the down_write+up_write sequence
> changed to synchronize_srcu_expedited) which has similar characteristics
> to your use of the rwsem.
>
Makes sense, but why synchronize_srcu_expedited and not synchronize_srcu?
Thank you,
Emanuele
On Mon, Oct 24, 2022, Christian Borntraeger wrote:
> Am 24.10.22 um 10:33 schrieb Emanuele Giuseppe Esposito:
> > Am 24/10/2022 um 09:56 schrieb Christian Borntraeger:
> > > > Therefore the simplest solution is to pause all vcpus in the kvm
> > > > side, so that:
Simplest for QEMU maybe, most definitely not simplest for KVM.
> > > > - userspace just needs to call the new API before making memslots
> > > > changes, keeping modifications to the minimum
> > > > - dirty page updates are also performed when vcpus are blocked, so
> > > > there is no time window between the dirty page ioctl and memslots
> > > > modifications, since vcpus are all stopped.
> > > > - no need to modify the existing memslots API
> > > Isnt QEMU able to achieve the same goal today by forcing all vCPUs
> > > into userspace with a signal? Can you provide some rationale why this
> > > is better in the cover letter or patch description?
> > >
> > David Hildenbrand tried to propose something similar here:
> > https://github.com/davidhildenbrand/qemu/commit/86b1bf546a8d00908e33f7362b0b61e2be8dbb7a
> >
> > While it is not optimized, I think it's more complex that the current
> > serie, since qemu should also make sure all running ioctls finish and
> > prevent the new ones from getting executed.
> >
> > Also we can't use pause_all_vcpus()/resume_all_vcpus() because they drop
> > the BQL.
> >
> > Would that be ok as rationale?
>
> Yes that helps and should be part of the cover letter for the next iterations.
But that doesn't explain why KVM needs to get involved, it only explains why QEMU
can't use its existing pause_all_vcpus(). I do not understand why this is a
problem QEMU needs KVM's help to solve.
On 10/25/22 00:45, Sean Christopherson wrote:
>> Yes that helps and should be part of the cover letter for the next iterations.
> But that doesn't explain why KVM needs to get involved, it only explains why QEMU
> can't use its existing pause_all_vcpus(). I do not understand why this is a
> problem QEMU needs KVM's help to solve.
I agree that it's not KVM's problem that QEMU cannot use
pause_all_vcpus(). Having a ioctl in KVM, rather than coding the same
in QEMU, is *mostly* a matter of programmer and computer efficiency
because the code is pretty simple.
That said, I believe the limited memslot API makes it more than just a
QEMU problem. Because KVM_GET_DIRTY_LOG cannot be combined atomically
with KVM_SET_USER_MEMORY_REGION(MR_DELETE), any VMM that uses dirty-log
regions while the VM is running is liable to losing the dirty status of
some pages. That's also a reason to provide this API in KVM.
Paolo
On 10/24/22 09:43, Emanuele Giuseppe Esposito wrote:
>> Since the userspace should anyway avoid going into this effectively-busy
>> wait, what about clearing the request after the first exit? The
>> cancellation ioctl can be kept for vCPUs that are never entered after
>> KVM_KICK_ALL_RUNNING_VCPUS. Alternatively, kvm_clear_all_cpus_request
>> could be done right before up_write().
>
> Clearing makes sense, but should we "trust" the userspace not to go into
> busy wait?
I think so, there are many other ways for userspace to screw up.
> What's the typical "contract" between KVM and the userspace? Meaning,
> should we cover the basic usage mistakes like forgetting to busy wait on
> KVM_RUN?
Being able to remove the second ioctl if you do (sort-of pseudocode
based on this v1)
kvm_make_all_cpus_request(kvm, KVM_REQ_USERSPACE_KICK);
down_write(&kvm->memory_transaction);
up_write(&kvm->memory_transaction);
kvm_clear_all_cpus_request(kvm, KVM_REQ_USERSPACE_KICK);
would be worth it, I think.
Paolo
On 10/24/22 14:57, Emanuele Giuseppe Esposito wrote:
>
>
> Am 23/10/2022 um 19:50 schrieb Paolo Bonzini:
>> On 10/22/22 17:48, Emanuele Giuseppe Esposito wrote:
>>> +static DECLARE_RWSEM(memory_transaction);
>>
>> This cannot be global, it must be per-struct kvm. Otherwise one VM can
>> keep the rwsem indefinitely while a second VM hangs in
>> KVM_KICK_ALL_RUNNING_VCPUS.
>>
>> It can also be changed to an SRCU (with the down_write+up_write sequence
>> changed to synchronize_srcu_expedited) which has similar characteristics
>> to your use of the rwsem.
>>
>
> Makes sense, but why synchronize_srcu_expedited and not synchronize_srcu?
Because (thanks to the kick) you expect the grace period to end almost
immediately, and synchronize_srcu() will slow down sensibly the changes
to the memory map.
Paolo
On Tue, Oct 25, 2022, Paolo Bonzini wrote:
> On 10/25/22 00:45, Sean Christopherson wrote:
> > > Yes that helps and should be part of the cover letter for the next iterations.
> > But that doesn't explain why KVM needs to get involved, it only explains why QEMU
> > can't use its existing pause_all_vcpus(). I do not understand why this is a
> > problem QEMU needs KVM's help to solve.
>
> I agree that it's not KVM's problem that QEMU cannot use pause_all_vcpus().
> Having a ioctl in KVM, rather than coding the same in QEMU, is *mostly* a
> matter of programmer and computer efficiency because the code is pretty
> simple.
>
> That said, I believe the limited memslot API makes it more than just a QEMU
> problem. Because KVM_GET_DIRTY_LOG cannot be combined atomically with
> KVM_SET_USER_MEMORY_REGION(MR_DELETE), any VMM that uses dirty-log regions
> while the VM is running is liable to losing the dirty status of some pages.
... and doesn't already do the sane thing and pause vCPUs _and anything else that
can touch guest memory_ before modifying memslots. I honestly think QEMU is the
only VMM that would ever use this API.
> That's also a reason to provide this API in KVM.
It's frankly a terrible API though. Providing a way to force vCPUs out of KVM_RUN
is at best half of the solution.
Userspace still needs:
- a refcounting scheme to track the number of "holds" put on the system
- serialization to ensure KVM_RESUME_ALL_KICKED_VCPUS completes before a new
KVM_KICK_ALL_RUNNING_VCPUS is initiated
- to prevent _all_ ioctls() because it's not just KVM_RUN that consumes memslots
- to stop anything else in the system that consumes KVM memslots, e.g. KVM GT
- to signal vCPU tasks so that the system doesn't livelock if a vCPU is stuck
outside of KVM, e.g. in get_user_pages_unlocked() (Peter Xu's series)
And because of the nature of KVM, to support this API on all architectures, KVM
needs to make change on all architectures, whereas userspace should be able to
implement a generic solution.
On 10/25/22 17:55, Sean Christopherson wrote:
> On Tue, Oct 25, 2022, Paolo Bonzini wrote:
>> That said, I believe the limited memslot API makes it more than just a QEMU
>> problem. Because KVM_GET_DIRTY_LOG cannot be combined atomically with
>> KVM_SET_USER_MEMORY_REGION(MR_DELETE), any VMM that uses dirty-log regions
>> while the VM is running is liable to losing the dirty status of some pages.
>
> ... and doesn't already do the sane thing and pause vCPUs _and anything else that
> can touch guest memory_ before modifying memslots. I honestly think QEMU is the > only VMM that would ever use this API. Providing a way to force vCPUs
out of KVM_RUN> is at best half of the solution.
I agree this is not a full solution (and I do want to remove
KVM_RESUME_ALL_KICKED_VCPUS).
> - a refcounting scheme to track the number of "holds" put on the system
> - serialization to ensure KVM_RESUME_ALL_KICKED_VCPUS completes before a new
> KVM_KICK_ALL_RUNNING_VCPUS is initiated
Both of these can be just a mutex, the others are potentially more
interesting but I'm not sure I understand them:
> - to prevent _all_ ioctls() because it's not just KVM_RUN that consumes memslots
This is perhaps an occasion to solve another disagreement: I still think
that accessing memory outside KVM_RUN (for example KVM_SET_NESTED_STATE
loading the APICv pages from VMCS12) is a bug, on the other hand we
disagreed on that and you wanted to kill KVM_REQ_GET_NESTED_STATE_PAGES.
> - to stop anything else in the system that consumes KVM memslots, e.g. KVM GT
Is this true if you only look at the KVM_GET_DIRTY_LOG case and consider
it a guest bug to access the memory (i.e. ignore the strange read-only
changes which only happen at boot, and which I agree are QEMU-specific)?
> - to signal vCPU tasks so that the system doesn't livelock if a vCPU is stuck
> outside of KVM, e.g. in get_user_pages_unlocked() (Peter Xu's series)
This is the more important one but why would it livelock?
> And because of the nature of KVM, to support this API on all architectures, KVM
> needs to make change on all architectures, whereas userspace should be able to
> implement a generic solution.
Yes, I agree that this is essentially just a more efficient kill().
Emanuele, perhaps you can put together a patch to x86/vmexit.c in
kvm-unit-tests, where CPU0 keeps changing memslots and the other CPUs
are in a for(;;) busy wait, to measure the various ways to do it?
Paolo
On Tue, Oct 25, 2022, Paolo Bonzini wrote:
> On 10/25/22 17:55, Sean Christopherson wrote:
> > On Tue, Oct 25, 2022, Paolo Bonzini wrote:
> > - to prevent _all_ ioctls() because it's not just KVM_RUN that consumes memslots
>
> This is perhaps an occasion to solve another disagreement: I still think
> that accessing memory outside KVM_RUN (for example KVM_SET_NESTED_STATE
> loading the APICv pages from VMCS12) is a bug, on the other hand we
> disagreed on that and you wanted to kill KVM_REQ_GET_NESTED_STATE_PAGES.
I don't think it's realistic to make accesses outside of KVM_RUN go away, e.g.
see the ARM ITS discussion in the dirty ring thread. kvm_xen_set_evtchn() also
explicitly depends on writing guest memory without going through KVM_RUN (and
apparently can be invoked from a kernel thread?!?).
In theory, I do actually like the idea of restricting memory access to KVM_RUN,
but in reality I just think that forcing everything into KVM_RUN creates far more
problems than it solves. E.g. my complaint with KVM_REQ_GET_NESTED_STATE_PAGES
is that instead of syncrhonously telling userspace it has a problem, KVM chugs
along as if everything is fine and only fails at later point in time. I doubt
userspace would actually do anything differently, i.e. the VM is likely hosed no
matter what, but deferring work adds complexity in KVM and makes it more difficult
to debug problems when they occur.
> > - to stop anything else in the system that consumes KVM memslots, e.g. KVM GT
>
> Is this true if you only look at the KVM_GET_DIRTY_LOG case and consider it
> a guest bug to access the memory (i.e. ignore the strange read-only changes
> which only happen at boot, and which I agree are QEMU-specific)?
Yes? I don't know exactly what "the KVM_GET_DIRTY_LOG case" is.
> > - to signal vCPU tasks so that the system doesn't livelock if a vCPU is stuck
> > outside of KVM, e.g. in get_user_pages_unlocked() (Peter Xu's series)
>
> This is the more important one but why would it livelock?
Livelock may not be the right word. Peter's series is addressing a scenario where
a vCPU gets stuck faulting in a page because the page never arrives over the
network. The solution is to recognize non-fatal signals while trying to fault in
the page.
KVM_KICK_ALL_RUNNING_VCPUS doesn't handle that case because it's obviously not
realistic to check for pending KVM requests while buried deep in mm/ code. I.e.
userspace also needs to send SIGUSR1 or whatever to ensure all vCPUs get kicked
out of non-KVM code.
That's not the end of the world, and they probably end up being orthogonal things
in userspace code, but it yields a weird API because KVM_KICK_ALL_RUNNING_VCPUS
ends up with the caveat of "oh, by the way, userspace also needs to signal all
vCPU tasks too, otherwise KVM_KICK_ALL_RUNNING_VCPUS might hang".
> > And because of the nature of KVM, to support this API on all architectures, KVM
> > needs to make change on all architectures, whereas userspace should be able to
> > implement a generic solution.
>
> Yes, I agree that this is essentially just a more efficient kill().
> Emanuele, perhaps you can put together a patch to x86/vmexit.c in
> kvm-unit-tests, where CPU0 keeps changing memslots and the other CPUs are in
> a for(;;) busy wait, to measure the various ways to do it?
I'm a bit confused. Is the goal of this to simplify QEMU, dedup VMM code, provide
a more performant solution, something else entirely? I.e. why measure the
performance of x86/vmexit.c? I have a hard time believing the overhead of pausing
vCPUs is going to be the long pole when it comes to memslot changes. I assume
rebuilding KVM's page tables because of the "zap all" behavior seems like would
completely dwarf any overhead from pausing vCPUs.
On 10/26/22 01:07, Sean Christopherson wrote:
> I don't think it's realistic to make accesses outside of KVM_RUN go away, e.g.
> see the ARM ITS discussion in the dirty ring thread. kvm_xen_set_evtchn() also
> explicitly depends on writing guest memory without going through KVM_RUN (and
> apparently can be invoked from a kernel thread?!?).
Yeah, those are the pages that must be considered dirty when using the
dirty ring.
> In theory, I do actually like the idea of restricting memory access to KVM_RUN,
> but in reality I just think that forcing everything into KVM_RUN creates far more
> problems than it solves. E.g. my complaint with KVM_REQ_GET_NESTED_STATE_PAGES
> is that instead of syncrhonously telling userspace it has a problem, KVM chugs
> along as if everything is fine and only fails at later point in time. I doubt
> userspace would actually do anything differently, i.e. the VM is likely hosed no
> matter what, but deferring work adds complexity in KVM and makes it more difficult
> to debug problems when they occur.
>
>>> - to stop anything else in the system that consumes KVM memslots, e.g. KVM GT
>>
>> Is this true if you only look at the KVM_GET_DIRTY_LOG case and consider it
>> a guest bug to access the memory (i.e. ignore the strange read-only changes
>> which only happen at boot, and which I agree are QEMU-specific)?
>
> Yes? I don't know exactly what "the KVM_GET_DIRTY_LOG case" is.
It is not possible to atomically read the dirty bitmap and delete a
memslot. When you delete a memslot, the bitmap is gone. In this case
however memory accesses to the deleted memslot are a guest bug, so
stopping KVM-GT would not be necessary.
So while I'm being slowly convinced that QEMU should find a way to pause
its vCPUs around memslot changes, I'm not sure that pausing everything
is needed in general.
>>> And because of the nature of KVM, to support this API on all architectures, KVM
>>> needs to make change on all architectures, whereas userspace should be able to
>>> implement a generic solution.
>>
>> Yes, I agree that this is essentially just a more efficient kill().
>> Emanuele, perhaps you can put together a patch to x86/vmexit.c in
>> kvm-unit-tests, where CPU0 keeps changing memslots and the other CPUs are in
>> a for(;;) busy wait, to measure the various ways to do it?
>
> I'm a bit confused. Is the goal of this to simplify QEMU, dedup VMM code, provide
> a more performant solution, something else entirely?
Well, a bit of all of them and perhaps that's the problem. And while
the issues at hand *are* self-inflicted wounds on part of QEMU, it seems
to me that the underlying issues are general.
For example, Alex Graf and I looked back at your proposal of a userspace
exit for "bad" accesses to memory, wondering if it could help with
Hyper-V VTLs too. To recap, the "higher privileged" code at VTL1 can
set up VM-wide restrictions on access to some pages through a hypercall
(HvModifyVtlProtectionMask). After the hypercall, VTL0 would not be
able to access those pages. The hypercall would be handled in userspace
and would invoke a KVM_SET_MEMORY_REGION_PERM ioctl to restrict the RWX
permissions, and this ioctl would set up a VM-wide permission bitmap
that would be used when building page tables.
Using such a bitmap instead of memslots makes it possible to cause
userspace vmexits on VTL mapping violations with efficient data
structures. And it would also be possible to use this mechanism around
KVM_GET_DIRTY_LOG, to read the KVM dirty bitmap just before removing a
memslot.
However, external accesses to the regions (ITS, Xen, KVM-GT, non KVM_RUN
ioctls) would not be blocked, due to the lack of a way to report the
exit. The intersection of these features with VTLs should be very small
(sometimes zero since VTLs are x86 only), but the ioctls would be a
problem so I'm wondering what your thoughts are on this.
Also, while the exit API could be the same, it is not clear to me that
the permission bitmap would be a good match for entirely "void" memslots
used to work around non-atomic memslot changes. So for now let's leave
this aside and only consider the KVM_GET_DIRTY_LOG case.
Paolo
On Wed, Oct 26, 2022, Paolo Bonzini wrote:
> On 10/26/22 01:07, Sean Christopherson wrote:
> > > > - to stop anything else in the system that consumes KVM memslots, e.g. KVM GT
> > >
> > > Is this true if you only look at the KVM_GET_DIRTY_LOG case and consider it
> > > a guest bug to access the memory (i.e. ignore the strange read-only changes
> > > which only happen at boot, and which I agree are QEMU-specific)?
> >
> > Yes? I don't know exactly what "the KVM_GET_DIRTY_LOG case" is.
>
> It is not possible to atomically read the dirty bitmap and delete a memslot.
> When you delete a memslot, the bitmap is gone. In this case however memory
> accesses to the deleted memslot are a guest bug, so stopping KVM-GT would
> not be necessary.
If accesses to the deleted memslot are a guest bug, why do you care about pausing
vCPUs? I don't mean to be beligerent, I'm genuinely confused.
> So while I'm being slowly convinced that QEMU should find a way to pause its
> vCPUs around memslot changes, I'm not sure that pausing everything is needed
> in general.
>
> > > > And because of the nature of KVM, to support this API on all architectures, KVM
> > > > needs to make change on all architectures, whereas userspace should be able to
> > > > implement a generic solution.
> > >
> > > Yes, I agree that this is essentially just a more efficient kill().
> > > Emanuele, perhaps you can put together a patch to x86/vmexit.c in
> > > kvm-unit-tests, where CPU0 keeps changing memslots and the other CPUs are in
> > > a for(;;) busy wait, to measure the various ways to do it?
> >
> > I'm a bit confused. Is the goal of this to simplify QEMU, dedup VMM code, provide
> > a more performant solution, something else entirely?
>
> Well, a bit of all of them and perhaps that's the problem. And while the
> issues at hand *are* self-inflicted wounds on part of QEMU, it seems to me
> that the underlying issues are general.
>
> For example, Alex Graf and I looked back at your proposal of a userspace
> exit for "bad" accesses to memory, wondering if it could help with Hyper-V
> VTLs too. To recap, the "higher privileged" code at VTL1 can set up VM-wide
> restrictions on access to some pages through a hypercall
> (HvModifyVtlProtectionMask). After the hypercall, VTL0 would not be able to
> access those pages. The hypercall would be handled in userspace and would
> invoke a KVM_SET_MEMORY_REGION_PERM ioctl to restrict the RWX permissions,
> and this ioctl would set up a VM-wide permission bitmap that would be used
> when building page tables.
>
> Using such a bitmap instead of memslots makes it possible to cause userspace
> vmexits on VTL mapping violations with efficient data structures. And it
> would also be possible to use this mechanism around KVM_GET_DIRTY_LOG, to
> read the KVM dirty bitmap just before removing a memslot.
What exactly is the behavior you're trying to achieve for KVM_GET_DIRTY_LOG => delete?
If KVM provides KVM_EXIT_MEMORY_FAULT, can you not achieve the desired behavior by
doing mprotect(PROT_NONE) => KVM_GET_DIRTY_LOG => delete? If PROT_NONE causes the
memory to be freed, won't mprotect(PROT_READ) do what you want even without
KVM_EXIT_MEMORY_FAULT?
> However, external accesses to the regions (ITS, Xen, KVM-GT, non KVM_RUN
> ioctls) would not be blocked, due to the lack of a way to report the exit.
Aren't all of those out of scope? E.g. in a very hypothetical world where XEN's
event channel is being used with VTLs, if VTL1 makes the event channel inaccessible,
that's a guest and/or userspace configuration issue and the guest is hosed no matter
what KVM does. Ditto for these case where KVM-GT's buffer is blocked. I'm guessing
the ITS is similar?
> The intersection of these features with VTLs should be very small (sometimes
> zero since VTLs are x86 only), but the ioctls would be a problem so I'm
> wondering what your thoughts are on this.
How do the ioctls() map to VTLs? I.e. are they considered VTL0, VTL1, out-of-band?
> Also, while the exit API could be the same, it is not clear to me that the
> permission bitmap would be a good match for entirely "void" memslots used to
> work around non-atomic memslot changes. So for now let's leave this aside
> and only consider the KVM_GET_DIRTY_LOG case.
As above, can't userspace just mprotect() the entire memslot to prevent writes
between getting the dirty log and deleting the memslot?