v2:
* move request_exits debug counter patch right after introduction of
KVM_REQ_EXIT [3/5]
* use vcpu ioctl instead of vm one [4/5]
* shrink kvm_user_exit from 64 to 32 bytes [4/5]
* new [5/5]
QEMU uses SIGUSR1 to force a userspace exit and also to queue an early
exit before calling VCPU_RUN -- the signal is blocked in user space and
temporarily unblocked in VCPU_RUN.
The temporal unblocking by sigprocmask() in kvm_arch_vcpu_ioctl_run()
takes a shared siglock, which leads to cacheline bouncing in NUMA
systems.
This series allows the same with a new request bit and VM IOCTL that
marks and kicks target VCPU, hence no need to unblock.
inl_from_{pmtimer,qemu} vmexit benchmark from kvm-unit-tests shows ~5%
speedup for 1-4 VCPUs (300-2000 saved cycles) without noticeably
regressing kernel VM exits.
(Paolo did a quick run of older version of this series on a NUMA system
and the speedup was around 35% when utilizing more nodes.)
Radim Krčmář (5):
KVM: add kvm_has_request wrapper
KVM: add KVM_REQ_EXIT request for userspace exit
KVM: x86: add request_exits debug counter
KVM: add KVM_USER_EXIT vcpu ioctl for userspace exit
KVM: refactor asynchronous vcpu ioctl dispatch
Documentation/virtual/kvm/api.txt | 30 ++++++++++++++++++++++++++++++
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/vmx.c | 4 ++--
arch/x86/kvm/x86.c | 32 ++++++++++++++++++++++++++++++++
include/linux/kvm_host.h | 15 +++++++++++++--
include/uapi/linux/kvm.h | 8 ++++++++
virt/kvm/kvm_main.c | 14 +++++++++-----
7 files changed, 95 insertions(+), 9 deletions(-)
--
2.5.0
We want to have requests abstracted from bit operations.
Signed-off-by: Radim Krčmář <[email protected]>
---
arch/x86/kvm/vmx.c | 2 +-
include/linux/kvm_host.h | 7 ++++++-
2 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 217f66343dc8..17514fe7d2cb 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5879,7 +5879,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
if (intr_window_requested && vmx_interrupt_allowed(vcpu))
return handle_interrupt_window(&vmx->vcpu);
- if (test_bit(KVM_REQ_EVENT, &vcpu->requests))
+ if (kvm_has_request(KVM_REQ_EVENT, vcpu))
return 1;
err = emulate_instruction(vcpu, EMULTYPE_NO_REEXECUTE);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 27ccdf91a465..52e388367a26 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1089,9 +1089,14 @@ static inline void kvm_make_request(int req, struct kvm_vcpu *vcpu)
set_bit(req, &vcpu->requests);
}
+static inline bool kvm_has_request(int req, struct kvm_vcpu *vcpu)
+{
+ return test_bit(req, &vcpu->requests);
+}
+
static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
{
- if (test_bit(req, &vcpu->requests)) {
+ if (kvm_has_request(req, vcpu)) {
clear_bit(req, &vcpu->requests);
return true;
} else {
--
2.5.0
When userspace wants KVM to exit to userspace, it sends a signal.
This has a disadvantage of requiring a change to the signal mask because
the signal needs to be blocked in userspace to stay pending when sending
to self.
Using a request flag allows us to shave 200-300 cycles from every
userspace exit and the speedup grows with NUMA because unblocking
touches shared spinlock.
The disadvantage is that it adds an overhead of one bit check for all
kernel exits. A quick tracing shows that the ratio of userspace exits
after boot is about 1/5 and in subsequent run of nmap and kernel compile
has about 1/60, so the check should not regress global performance.
All signal_pending() calls are userspace exit requests, so we add a
check for KVM_REQ_EXIT there. There is one omitted call in kvm_vcpu_run
because KVM_REQ_EXIT is implied in earlier check for requests.
Signed-off-by: Radim Krčmář <[email protected]>
---
arch/x86/kvm/vmx.c | 2 +-
arch/x86/kvm/x86.c | 6 ++++++
include/linux/kvm_host.h | 8 +++++++-
include/uapi/linux/kvm.h | 1 +
virt/kvm/kvm_main.c | 2 +-
5 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 17514fe7d2cb..496f6ba1c812 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5903,7 +5903,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
goto out;
}
- if (signal_pending(current))
+ if (kvm_need_exit(vcpu))
goto out;
if (need_resched())
schedule();
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ec7bf2b16948..c5d790fdfc2e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6547,6 +6547,11 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
++vcpu->stat.signal_exits;
break;
}
+ if (unlikely(kvm_has_request(KVM_REQ_EXIT, vcpu))) {
+ r = 0;
+ vcpu->run->exit_reason = KVM_EXIT_REQUEST;
+ break;
+ }
if (need_resched()) {
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
cond_resched();
@@ -6683,6 +6688,7 @@ out:
post_kvm_run_save(vcpu);
if (vcpu->sigset_active)
sigprocmask(SIG_SETMASK, &sigsaved, NULL);
+ clear_bit(KVM_REQ_EXIT, &vcpu->requests);
return r;
}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 52e388367a26..dcc57171e3ec 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -121,7 +121,7 @@ static inline bool is_error_page(struct page *page)
#define KVM_REQ_UNHALT 6
#define KVM_REQ_MMU_SYNC 7
#define KVM_REQ_CLOCK_UPDATE 8
-#define KVM_REQ_KICK 9
+#define KVM_REQ_EXIT 9
#define KVM_REQ_DEACTIVATE_FPU 10
#define KVM_REQ_EVENT 11
#define KVM_REQ_APF_HALT 12
@@ -1104,6 +1104,12 @@ static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
}
}
+static inline bool kvm_need_exit(struct kvm_vcpu *vcpu)
+{
+ return signal_pending(current) ||
+ kvm_has_request(KVM_REQ_EXIT, vcpu);
+}
+
extern bool kvm_rebooting;
struct kvm_device {
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 26daafbba9ec..d996a7cdb4d2 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -184,6 +184,7 @@ struct kvm_s390_skeys {
#define KVM_EXIT_SYSTEM_EVENT 24
#define KVM_EXIT_S390_STSI 25
#define KVM_EXIT_IOAPIC_EOI 26
+#define KVM_EXIT_REQUEST 27
/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 093b3d10b411..b34a328fdac1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1914,7 +1914,7 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
}
if (kvm_cpu_has_pending_timer(vcpu))
return -EINTR;
- if (signal_pending(current))
+ if (kvm_need_exit(vcpu))
return -EINTR;
return 0;
--
2.5.0
We are still interested in the amount of exits userspace requested and
signal_exits doesn't cover that anymore.
Signed-off-by: Radim Krčmář <[email protected]>
---
v2: move request_exits debug counter patch right after introduction of
KVM_REQ_EXIT [3/5]
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/x86.c | 2 ++
2 files changed, 3 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6e851d518b24..c875f70df02a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -717,6 +717,7 @@ struct kvm_vcpu_stat {
u32 hypercalls;
u32 irq_injections;
u32 nmi_injections;
+ u32 request_exits;
};
struct x86_instruction_info;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c5d790fdfc2e..3493457ad0a1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -145,6 +145,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ "io_exits", VCPU_STAT(io_exits) },
{ "mmio_exits", VCPU_STAT(mmio_exits) },
{ "signal_exits", VCPU_STAT(signal_exits) },
+ { "request_exits", VCPU_STAT(request_exits) },
{ "irq_window", VCPU_STAT(irq_window_exits) },
{ "nmi_window", VCPU_STAT(nmi_window_exits) },
{ "halt_exits", VCPU_STAT(halt_exits) },
@@ -6550,6 +6551,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
if (unlikely(kvm_has_request(KVM_REQ_EXIT, vcpu))) {
r = 0;
vcpu->run->exit_reason = KVM_EXIT_REQUEST;
+ ++vcpu->stat.request_exits;
break;
}
if (need_resched()) {
--
2.5.0
The guest can use KVM_USER_EXIT instead of a signal-based exiting to
userspace. Availability depends on KVM_CAP_USER_EXIT.
Only x86 is implemented so far.
Signed-off-by: Radim Krčmář <[email protected]>
---
v2:
* use vcpu ioctl instead of vm one [4/5]
* shrink kvm_user_exit from 64 to 32 bytes [4/5]
Documentation/virtual/kvm/api.txt | 30 ++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 24 ++++++++++++++++++++++++
include/uapi/linux/kvm.h | 7 +++++++
virt/kvm/kvm_main.c | 5 +++--
4 files changed, 64 insertions(+), 2 deletions(-)
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 3c714d43a717..c5844f0b8e7c 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3020,6 +3020,36 @@ Returns: 0 on success, -1 on error
Queues an SMI on the thread's vcpu.
+
+4.97 KVM_USER_EXIT
+
+Capability: KVM_CAP_USER_EXIT
+Architectures: x86
+Type: vcpu ioctl
+Parameters: struct kvm_user_exit (in)
+Returns: 0 on success,
+ -EFAULT if the parameter couldn't be read,
+ -EINVAL if 'reserved' is not zeroed,
+
+struct kvm_user_exit {
+ __u8 reserved[32];
+};
+
+The ioctl is asynchronous to VCPU execution and can be issued from all threads.
+
+Make vcpu_id exit to userspace as soon as possible. If the VCPU is not running
+in kernel at the time, it will exit early on the next call to KVM_RUN.
+If the VCPU was going to exit because of other reasons when KVM_USER_EXIT was
+issued, it will keep the original exit reason and not exit early on next
+KVM_RUN.
+If VCPU exited because of KVM_USER_EXIT, the exit reason is KVM_EXIT_REQUEST.
+
+This ioctl has very similar effect (same sans some races on userspace exit) as
+sending a signal (that is blocked in userspace and set in KVM_SET_SIGNAL_MASK)
+to the VCPU thread.
+
+
+
5. The kvm_run structure
------------------------
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3493457ad0a1..27d777eb34e4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2466,6 +2466,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_ASSIGN_DEV_IRQ:
case KVM_CAP_PCI_2_3:
#endif
+ case KVM_CAP_USER_EXIT:
r = 1;
break;
case KVM_CAP_X86_SMM:
@@ -3077,6 +3078,20 @@ static int kvm_set_guest_paused(struct kvm_vcpu *vcpu)
return 0;
}
+int kvm_vcpu_ioctl_user_exit(struct kvm_vcpu *vcpu, struct kvm_user_exit *info)
+{
+ struct kvm_user_exit valid = {};
+ BUILD_BUG_ON(sizeof(valid) != 32);
+
+ if (memcmp(info, &valid, sizeof(valid)))
+ return -EINVAL;
+
+ kvm_make_request(KVM_REQ_EXIT, vcpu);
+ kvm_vcpu_kick(vcpu);
+
+ return 0;
+}
+
long kvm_arch_vcpu_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
@@ -3341,6 +3356,15 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
r = kvm_set_guest_paused(vcpu);
goto out;
}
+ case KVM_USER_EXIT: {
+ struct kvm_user_exit info;
+
+ r = -EFAULT;
+ if (copy_from_user(&info, argp, sizeof(info)))
+ goto out;
+ r = kvm_vcpu_ioctl_user_exit(vcpu, &info);
+ break;
+ }
default:
r = -EINVAL;
}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d996a7cdb4d2..bc5a1abe9626 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -826,6 +826,7 @@ struct kvm_ppc_smmu_info {
#define KVM_CAP_X86_SMM 117
#define KVM_CAP_MULTI_ADDRESS_SPACE 118
#define KVM_CAP_SPLIT_IRQCHIP 119
+#define KVM_CAP_USER_EXIT 120
#ifdef KVM_CAP_IRQ_ROUTING
@@ -1008,6 +1009,10 @@ struct kvm_device_attr {
__u64 addr; /* userspace address of attr data */
};
+struct kvm_user_exit {
+ __u8 reserved[32];
+};
+
#define KVM_DEV_VFIO_GROUP 1
#define KVM_DEV_VFIO_GROUP_ADD 1
#define KVM_DEV_VFIO_GROUP_DEL 2
@@ -1119,6 +1124,8 @@ struct kvm_s390_ucas_mapping {
#define KVM_ARM_SET_DEVICE_ADDR _IOW(KVMIO, 0xab, struct kvm_arm_device_addr)
/* Available with KVM_CAP_PPC_RTAS */
#define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO, 0xac, struct kvm_rtas_token_args)
+/* Available with KVM_CAP_USER_EXIT */
+#define KVM_USER_EXIT _IOW(KVMIO, 0xad, struct kvm_user_exit)
/* ioctl for vm fd */
#define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b34a328fdac1..d7ffe6090520 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2248,15 +2248,16 @@ static long kvm_vcpu_ioctl(struct file *filp,
if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
return -EINVAL;
-#if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
/*
* Special cases: vcpu ioctls that are asynchronous to vcpu execution,
* so vcpu_load() would break it.
*/
+#if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
if (ioctl == KVM_S390_INTERRUPT || ioctl == KVM_S390_IRQ || ioctl == KVM_INTERRUPT)
return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
#endif
-
+ if (ioctl == KVM_USER_EXIT)
+ return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
r = vcpu_load(vcpu);
if (r)
--
2.5.0
I find the switch easier to read and modify.
Signed-off-by: Radim Krčmář <[email protected]>
---
v2: new
virt/kvm/kvm_main.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d7ffe6090520..71598554deed 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2252,12 +2252,15 @@ static long kvm_vcpu_ioctl(struct file *filp,
* Special cases: vcpu ioctls that are asynchronous to vcpu execution,
* so vcpu_load() would break it.
*/
+ switch (ioctl) {
#if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
- if (ioctl == KVM_S390_INTERRUPT || ioctl == KVM_S390_IRQ || ioctl == KVM_INTERRUPT)
- return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
+ case KVM_S390_INTERRUPT:
+ case KVM_S390_IRQ:
+ case KVM_INTERRUPT:
#endif
- if (ioctl == KVM_USER_EXIT)
+ case KVM_USER_EXIT:
return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
+ }
r = vcpu_load(vcpu);
if (r)
--
2.5.0
On 05/08/2015 18:33, Radim Krčmář wrote:
> The guest can use KVM_USER_EXIT instead of a signal-based exiting to
> userspace. Availability depends on KVM_CAP_USER_EXIT.
> Only x86 is implemented so far.
>
> Signed-off-by: Radim Krčmář <[email protected]>
> ---
> v2:
> * use vcpu ioctl instead of vm one [4/5]
> * shrink kvm_user_exit from 64 to 32 bytes [4/5]
>
> Documentation/virtual/kvm/api.txt | 30 ++++++++++++++++++++++++++++++
> arch/x86/kvm/x86.c | 24 ++++++++++++++++++++++++
> include/uapi/linux/kvm.h | 7 +++++++
> virt/kvm/kvm_main.c | 5 +++--
> 4 files changed, 64 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 3c714d43a717..c5844f0b8e7c 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -3020,6 +3020,36 @@ Returns: 0 on success, -1 on error
>
> Queues an SMI on the thread's vcpu.
>
> +
> +4.97 KVM_USER_EXIT
> +
> +Capability: KVM_CAP_USER_EXIT
> +Architectures: x86
> +Type: vcpu ioctl
> +Parameters: struct kvm_user_exit (in)
> +Returns: 0 on success,
> + -EFAULT if the parameter couldn't be read,
> + -EINVAL if 'reserved' is not zeroed,
> +
> +struct kvm_user_exit {
> + __u8 reserved[32];
> +};
> +
> +The ioctl is asynchronous to VCPU execution and can be issued from all threads.
> +
> +Make vcpu_id exit to userspace as soon as possible. If the VCPU is not running
> +in kernel at the time, it will exit early on the next call to KVM_RUN.
> +If the VCPU was going to exit because of other reasons when KVM_USER_EXIT was
> +issued, it will keep the original exit reason and not exit early on next
> +KVM_RUN.
> +If VCPU exited because of KVM_USER_EXIT, the exit reason is KVM_EXIT_REQUEST.
> +
> +This ioctl has very similar effect (same sans some races on userspace exit) as
> +sending a signal (that is blocked in userspace and set in KVM_SET_SIGNAL_MASK)
> +to the VCPU thread.
Can we just return EINVAL if the parameter is not NULL?
Paolo
> +
> +
> +
> 5. The kvm_run structure
> ------------------------
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 3493457ad0a1..27d777eb34e4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2466,6 +2466,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_ASSIGN_DEV_IRQ:
> case KVM_CAP_PCI_2_3:
> #endif
> + case KVM_CAP_USER_EXIT:
> r = 1;
> break;
> case KVM_CAP_X86_SMM:
> @@ -3077,6 +3078,20 @@ static int kvm_set_guest_paused(struct kvm_vcpu *vcpu)
> return 0;
> }
>
> +int kvm_vcpu_ioctl_user_exit(struct kvm_vcpu *vcpu, struct kvm_user_exit *info)
> +{
> + struct kvm_user_exit valid = {};
> + BUILD_BUG_ON(sizeof(valid) != 32);
> +
> + if (memcmp(info, &valid, sizeof(valid)))
> + return -EINVAL;
> +
> + kvm_make_request(KVM_REQ_EXIT, vcpu);
> + kvm_vcpu_kick(vcpu);
> +
> + return 0;
> +}
> +
> long kvm_arch_vcpu_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
> {
> @@ -3341,6 +3356,15 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> r = kvm_set_guest_paused(vcpu);
> goto out;
> }
> + case KVM_USER_EXIT: {
> + struct kvm_user_exit info;
> +
> + r = -EFAULT;
> + if (copy_from_user(&info, argp, sizeof(info)))
> + goto out;
> + r = kvm_vcpu_ioctl_user_exit(vcpu, &info);
> + break;
> + }
> default:
> r = -EINVAL;
> }
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index d996a7cdb4d2..bc5a1abe9626 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -826,6 +826,7 @@ struct kvm_ppc_smmu_info {
> #define KVM_CAP_X86_SMM 117
> #define KVM_CAP_MULTI_ADDRESS_SPACE 118
> #define KVM_CAP_SPLIT_IRQCHIP 119
> +#define KVM_CAP_USER_EXIT 120
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -1008,6 +1009,10 @@ struct kvm_device_attr {
> __u64 addr; /* userspace address of attr data */
> };
>
> +struct kvm_user_exit {
> + __u8 reserved[32];
> +};
> +
> #define KVM_DEV_VFIO_GROUP 1
> #define KVM_DEV_VFIO_GROUP_ADD 1
> #define KVM_DEV_VFIO_GROUP_DEL 2
> @@ -1119,6 +1124,8 @@ struct kvm_s390_ucas_mapping {
> #define KVM_ARM_SET_DEVICE_ADDR _IOW(KVMIO, 0xab, struct kvm_arm_device_addr)
> /* Available with KVM_CAP_PPC_RTAS */
> #define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO, 0xac, struct kvm_rtas_token_args)
> +/* Available with KVM_CAP_USER_EXIT */
> +#define KVM_USER_EXIT _IOW(KVMIO, 0xad, struct kvm_user_exit)
>
> /* ioctl for vm fd */
> #define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b34a328fdac1..d7ffe6090520 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2248,15 +2248,16 @@ static long kvm_vcpu_ioctl(struct file *filp,
> if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
> return -EINVAL;
>
> -#if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
> /*
> * Special cases: vcpu ioctls that are asynchronous to vcpu execution,
> * so vcpu_load() would break it.
> */
> +#if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
> if (ioctl == KVM_S390_INTERRUPT || ioctl == KVM_S390_IRQ || ioctl == KVM_INTERRUPT)
> return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
> #endif
> -
> + if (ioctl == KVM_USER_EXIT)
> + return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
>
> r = vcpu_load(vcpu);
> if (r)
>
2015-08-05 18:36+0200, Paolo Bonzini:
> On 05/08/2015 18:33, Radim Krčmář wrote:
>> +4.97 KVM_USER_EXIT
>> +
>> +Capability: KVM_CAP_USER_EXIT
>> +Architectures: x86
>> +Type: vcpu ioctl
>> +Parameters: struct kvm_user_exit (in)
>> +Returns: 0 on success,
>> + -EFAULT if the parameter couldn't be read,
>> + -EINVAL if 'reserved' is not zeroed,
>> +
>> +struct kvm_user_exit {
>> + __u8 reserved[32];
>> +};
>
> Can we just return EINVAL if the parameter is not NULL?
It complicates handling if we extend the ioctl, but removes the useless
clearing/copying/checking now ...
The two obvious extensions are flags to skip kvm_make_request() or
kvm_vcpu_kick(), both of dubious use. Another possibility is setting up
conditional exits, but that would be better as a separate control, like
most other sophisticated extensions.
I think that u32 flags would be sufficient -- is casting the 'unsigned
long arg' (data pointer) to a value still an accepted solution?
On 06/08/2015 15:44, Radim Krčmář wrote:
>> > Can we just return EINVAL if the parameter is not NULL?
> It complicates handling if we extend the ioctl, but removes the useless
> clearing/copying/checking now ...
Yes.
> The two obvious extensions are flags to skip kvm_make_request() or
> kvm_vcpu_kick(), both of dubious use.
Skipping kvm_make_request() would make some sense if you can set
vcpu->run->request_interrupt_window asynchronously. So you could do
vcpu->run->request_interrupt_window = 1;
ioctl(vcpu_fd, KVM_USER_EXIT, KVM_USER_EXIT_LAZY);
and only cause a lightweight vmexit if the interrupt window is currently
closed. I haven't thought of any races that could happen, but it looks
like it could work.
Skipping kvm_vcpu_kick() makes much less sense.
> Another possibility is setting up
> conditional exits, but that would be better as a separate control, like
> most other sophisticated extensions.
>
> I think that u32 flags would be sufficient -- is casting the 'unsigned
> long arg' (data pointer) to a value still an accepted solution?
Yeah, that would work for me as well. Also because, for now, you'd
return EINVAL if the unsigned long is not zero, which boils down to
"return EINVAL if the parameter is not NULL". :)
Paolo
2015-08-06 15:52+0200, Paolo Bonzini:
> On 06/08/2015 15:44, Radim Krčmář wrote:
>> The two obvious extensions are flags to skip kvm_make_request() or
>> kvm_vcpu_kick(), both of dubious use.
>
> Skipping kvm_make_request() would make some sense if you can set
> vcpu->run->request_interrupt_window asynchronously. So you could do
>
> vcpu->run->request_interrupt_window = 1;
> ioctl(vcpu_fd, KVM_USER_EXIT, KVM_USER_EXIT_LAZY);
>
> and only cause a lightweight vmexit if the interrupt window is currently
> closed. I haven't thought of any races that could happen, but it looks
> like it could work.
Seems doable, kvm_run should have been better protected :)
> Skipping kvm_vcpu_kick() makes much less sense.
Could save some cycles when queuing an early exit from the VCPU thread.
>> Another possibility is setting up
>> conditional exits, but that would be better as a separate control, like
>> most other sophisticated extensions.
>>
>> I think that u32 flags would be sufficient -- is casting the 'unsigned
>> long arg' (data pointer) to a value still an accepted solution?
>
> Yeah, that would work for me as well. Also because, for now, you'd
> return EINVAL if the unsigned long is not zero, which boils down to
> "return EINVAL if the parameter is not NULL". :)
Exactly, only the ioctl number will change.
Am 05.08.2015 um 18:32 schrieb Radim Krčmář:
> We want to have requests abstracted from bit operations.
>
> Signed-off-by: Radim Krčmář <[email protected]>
> ---
kvm_check_request is now somewhat a misnomer (what is the difference between test and check?)
but still
Acked-by: Christian Borntraeger <[email protected]>
for the new interface. maybe we can rename kvm_check_request in a separate patch somewhen.
> arch/x86/kvm/vmx.c | 2 +-
> include/linux/kvm_host.h | 7 ++++++-
> 2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 217f66343dc8..17514fe7d2cb 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -5879,7 +5879,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
> if (intr_window_requested && vmx_interrupt_allowed(vcpu))
> return handle_interrupt_window(&vmx->vcpu);
>
> - if (test_bit(KVM_REQ_EVENT, &vcpu->requests))
> + if (kvm_has_request(KVM_REQ_EVENT, vcpu))
> return 1;
>
> err = emulate_instruction(vcpu, EMULTYPE_NO_REEXECUTE);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 27ccdf91a465..52e388367a26 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1089,9 +1089,14 @@ static inline void kvm_make_request(int req, struct kvm_vcpu *vcpu)
> set_bit(req, &vcpu->requests);
> }
>
> +static inline bool kvm_has_request(int req, struct kvm_vcpu *vcpu)
> +{
> + return test_bit(req, &vcpu->requests);
> +}
> +
> static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
> {
> - if (test_bit(req, &vcpu->requests)) {
> + if (kvm_has_request(req, vcpu)) {
> clear_bit(req, &vcpu->requests);
> return true;
> } else {
>
Am 05.08.2015 um 18:33 schrieb Radim Krčmář:
> I find the switch easier to read and modify.
yes.
> Signed-off-by: Radim Krčmář <[email protected]>
> ---
> v2: new
>
> virt/kvm/kvm_main.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d7ffe6090520..71598554deed 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2252,12 +2252,15 @@ static long kvm_vcpu_ioctl(struct file *filp,
> * Special cases: vcpu ioctls that are asynchronous to vcpu execution,
> * so vcpu_load() would break it.
> */
> + switch (ioctl) {
> #if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
> - if (ioctl == KVM_S390_INTERRUPT || ioctl == KVM_S390_IRQ || ioctl == KVM_INTERRUPT)
> - return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
> + case KVM_S390_INTERRUPT:
> + case KVM_S390_IRQ:
> + case KVM_INTERRUPT:
When you are it you might want to put the KVM_S390* withing CONFIG_S390 and
KVM_INTERRUPT within CONFIG_PPC || CONFIG_MIPS
This might speed up the switch statement for s390/ppc/mips a tiny bit. It will add
another ifdef, though. Paolo?
> #endif
> - if (ioctl == KVM_USER_EXIT)
> + case KVM_USER_EXIT:
> return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
> + }
>
> r = vcpu_load(vcpu);
> if (r)
>
2015-08-12 22:03+0200, Christian Borntraeger:
> Am 05.08.2015 um 18:33 schrieb Radim Krčmář:
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> @@ -2252,12 +2252,15 @@ static long kvm_vcpu_ioctl(struct file *filp,
>> * Special cases: vcpu ioctls that are asynchronous to vcpu execution,
>> * so vcpu_load() would break it.
>> */
>> + switch (ioctl) {
>> #if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
>> - if (ioctl == KVM_S390_INTERRUPT || ioctl == KVM_S390_IRQ || ioctl == KVM_INTERRUPT)
>> - return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
>> + case KVM_S390_INTERRUPT:
>> + case KVM_S390_IRQ:
>> + case KVM_INTERRUPT:
>
> When you are it you might want to put the KVM_S390* withing CONFIG_S390 and
> KVM_INTERRUPT within CONFIG_PPC || CONFIG_MIPS
Sure, thanks.
> This might speed up the switch statement for s390/ppc/mips a tiny bit. It will add
> another ifdef, though. Paolo?
For v3, I will name the decision as an inline function, which should
make the #ifing more acceptable (at the cost of not having ioctls #defs
in the body of kvm_vcpu_ioctl). Something like this,
static inline bool kvm_asynchronous_ioctl(unsigned ioctl)
{
switch (ioctl) {
#if defined(CONFIG_S390)
case KVM_S390_INTERRUPT:
case KVM_S390_IRQ:
#endif
#if defined(CONFIG_MIPS)
case KVM_INTERRUPT:
#endif
case KVM_USER_EXIT:
return true;
}
return false;
}
[...]
if (kvm_asynchronous_ioctl(ioctl))
return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
2015-08-12 21:57+0200, Christian Borntraeger:
> kvm_check_request is now somewhat a misnomer (what is the difference between test and check?)
kvm_check_request has always been poetic; it uses two meanings of
check, "examine" and "tick off", at the same time.
We also want something that clears the request, so kvm_drop_request was
my best candidate so far.
> for the new interface. maybe we can rename kvm_check_request in a separate patch somewhen.
I wonder why haven't we copied the naming convention from bit operations
(or if programming would be better if German was its language),
kvm_test_request
kvm_set_request
kvm_clear_request
kvm_test_and_clear_request
The only disadvantage is that
kvm_test_and_clear_request
is longer than
kvm_check_request
123456789
by whooping 9 characters.
I could live with that.
On 12/08/2015 22:03, Christian Borntraeger wrote:
>> > #if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
>> > - if (ioctl == KVM_S390_INTERRUPT || ioctl == KVM_S390_IRQ || ioctl == KVM_INTERRUPT)
>> > - return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
>> > + case KVM_S390_INTERRUPT:
>> > + case KVM_S390_IRQ:
>> > + case KVM_INTERRUPT:
> When you are it you might want to put the KVM_S390* withing CONFIG_S390 and
> KVM_INTERRUPT within CONFIG_PPC || CONFIG_MIPS
>
> This might speed up the switch statement for s390/ppc/mips a tiny bit. It will add
> another ifdef, though. Paolo?
Sure. I wasn't sure of KVM_INTERRUPT's usage on s390.
I'm okay with keeping the switch inline too, but if Radim prefers a
function that's also fine.
Paolo
On 13/08/2015 11:11, Radim Krčmář wrote:
>> > for the new interface. maybe we can rename kvm_check_request in a separate patch somewhen.
> I wonder why haven't we copied the naming convention from bit operations
> (or if programming would be better if German was its language),
>
> kvm_test_request
> kvm_set_request
> kvm_clear_request
> kvm_test_and_clear_request
>
> The only disadvantage is that
> kvm_test_and_clear_request
> is longer than
> kvm_check_request
> 123456789
> by whooping 9 characters.
>
> I could live with that.
Yes, that would be much better.
Paolo
Am 13.08.2015 um 11:29 schrieb Paolo Bonzini:
>
>
> On 13/08/2015 11:11, Radim Krčmář wrote:
>>>> for the new interface. maybe we can rename kvm_check_request in a separate patch somewhen.
>> I wonder why haven't we copied the naming convention from bit operations
>> (or if programming would be better if German was its language),
>>
>> kvm_test_request
>> kvm_set_request
>> kvm_clear_request
>> kvm_test_and_clear_request
>>
>> The only disadvantage is that
>> kvm_test_and_clear_request
>> is longer than
>> kvm_check_request
>> 123456789
>> by whooping 9 characters.
>>
>> I could live with that.
>
> Yes, that would be much better.
+1
2015-08-13 12:03+0200, Christian Borntraeger:
> Am 13.08.2015 um 11:29 schrieb Paolo Bonzini:
>> On 13/08/2015 11:11, Radim Krčmář wrote:
>>>>> for the new interface. maybe we can rename kvm_check_request in a separate patch somewhen.
>>> I wonder why haven't we copied the naming convention from bit operations
| [...]
>>
>> Yes, that would be much better.
>
> +1
I'll send patches later. Hope you won't mind keeping the doomed
kvm_has_request() in v3.
On 08/05/2015 07:33 PM, Radim Krčmář wrote:
> The guest can use KVM_USER_EXIT instead of a signal-based exiting to
> userspace. Availability depends on KVM_CAP_USER_EXIT.
> Only x86 is implemented so far.
>
> Signed-off-by: Radim Krčmář <[email protected]>
> ---
> v2:
> * use vcpu ioctl instead of vm one [4/5]
> * shrink kvm_user_exit from 64 to 32 bytes [4/5]
>
> Documentation/virtual/kvm/api.txt | 30 ++++++++++++++++++++++++++++++
> arch/x86/kvm/x86.c | 24 ++++++++++++++++++++++++
> include/uapi/linux/kvm.h | 7 +++++++
> virt/kvm/kvm_main.c | 5 +++--
> 4 files changed, 64 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 3c714d43a717..c5844f0b8e7c 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -3020,6 +3020,36 @@ Returns: 0 on success, -1 on error
>
> Queues an SMI on the thread's vcpu.
>
> +
> +4.97 KVM_USER_EXIT
> +
> +Capability: KVM_CAP_USER_EXIT
> +Architectures: x86
> +Type: vcpu ioctl
> +Parameters: struct kvm_user_exit (in)
> +Returns: 0 on success,
> + -EFAULT if the parameter couldn't be read,
> + -EINVAL if 'reserved' is not zeroed,
> +
> +struct kvm_user_exit {
> + __u8 reserved[32];
> +};
> +
> +The ioctl is asynchronous to VCPU execution and can be issued from all threads.
> +format
This breaks an invariant of vcpu ioctls, and also forces a cacheline
bounce when we fget() the vcpu fd.
We should really try to avoid this. One options is to have a
KVM_VCPU_MAKE_EXITFD vcpu ioctl, which returns an eventfd that you then
write into. You can make as many exitfds as you like, one for each
waking thread, so they never cause cacheline conflicts.
Edit: I see the invariant was already broken. But the other comment stands.
> +Make vcpu_id exit to userspace as soon as possible. If the VCPU is not running
> +in kernel at the time, it will exit early on the next call to KVM_RUN.
> +If the VCPU was going to exit because of other reasons when KVM_USER_EXIT was
> +issued, it will keep the original exit reason and not exit early on next
> +KVM_RUN.
> +If VCPU exited because of KVM_USER_EXIT, the exit reason is KVM_EXIT_REQUEST.
> +
> +This ioctl has very similar effect (same sans some races on userspace exit) as
> +sending a signal (that is blocked in userspace and set in KVM_SET_SIGNAL_MASK)
> +to the VCPU thread.
> +
> +
> +
> 5. The kvm_run structure
> ------------------------
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 3493457ad0a1..27d777eb34e4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2466,6 +2466,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_ASSIGN_DEV_IRQ:
> case KVM_CAP_PCI_2_3:
> #endif
> + case KVM_CAP_USER_EXIT:
> r = 1;
> break;
> case KVM_CAP_X86_SMM:
> @@ -3077,6 +3078,20 @@ static int kvm_set_guest_paused(struct kvm_vcpu *vcpu)
> return 0;
> }
>
> +int kvm_vcpu_ioctl_user_exit(struct kvm_vcpu *vcpu, struct kvm_user_exit *info)
> +{
> + struct kvm_user_exit valid = {};
> + BUILD_BUG_ON(sizeof(valid) != 32);
> +
> + if (memcmp(info, &valid, sizeof(valid)))
> + return -EINVAL;
> +
> + kvm_make_request(KVM_REQ_EXIT, vcpu);
> + kvm_vcpu_kick(vcpu);
> +
> + return 0;
> +}
> +
> long kvm_arch_vcpu_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
> {
> @@ -3341,6 +3356,15 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> r = kvm_set_guest_paused(vcpu);
> goto out;
> }
> + case KVM_USER_EXIT: {
> + struct kvm_user_exit info;
> +
> + r = -EFAULT;
> + if (copy_from_user(&info, argp, sizeof(info)))
> + goto out;
> + r = kvm_vcpu_ioctl_user_exit(vcpu, &info);
> + break;
> + }
> default:
> r = -EINVAL;
> }
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index d996a7cdb4d2..bc5a1abe9626 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -826,6 +826,7 @@ struct kvm_ppc_smmu_info {
> #define KVM_CAP_X86_SMM 117
> #define KVM_CAP_MULTI_ADDRESS_SPACE 118
> #define KVM_CAP_SPLIT_IRQCHIP 119
> +#define KVM_CAP_USER_EXIT 120
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -1008,6 +1009,10 @@ struct kvm_device_attr {
> __u64 addr; /* userspace address of attr data */
> };
>
> +struct kvm_user_exit {
> + __u8 reserved[32];
> +};
> +
> #define KVM_DEV_VFIO_GROUP 1
> #define KVM_DEV_VFIO_GROUP_ADD 1
> #define KVM_DEV_VFIO_GROUP_DEL 2
> @@ -1119,6 +1124,8 @@ struct kvm_s390_ucas_mapping {
> #define KVM_ARM_SET_DEVICE_ADDR _IOW(KVMIO, 0xab, struct kvm_arm_device_addr)
> /* Available with KVM_CAP_PPC_RTAS */
> #define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO, 0xac, struct kvm_rtas_token_args)
> +/* Available with KVM_CAP_USER_EXIT */
> +#define KVM_USER_EXIT _IOW(KVMIO, 0xad, struct kvm_user_exit)
>
> /* ioctl for vm fd */
> #define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b34a328fdac1..d7ffe6090520 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2248,15 +2248,16 @@ static long kvm_vcpu_ioctl(struct file *filp,
> if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
> return -EINVAL;
>
> -#if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
> /*
> * Special cases: vcpu ioctls that are asynchronous to vcpu execution,
> * so vcpu_load() would break it.
> */
> +#if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
> if (ioctl == KVM_S390_INTERRUPT || ioctl == KVM_S390_IRQ || ioctl == KVM_INTERRUPT)
> return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
> #endif
> -
> + if (ioctl == KVM_USER_EXIT)
> + return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
>
> r = vcpu_load(vcpu);
> if (r)
On 16/08/2015 13:27, Avi Kivity wrote:
> On 08/05/2015 07:33 PM, Radim Krčmář wrote:
>> The guest can use KVM_USER_EXIT instead of a signal-based exiting to
>> userspace. Availability depends on KVM_CAP_USER_EXIT.
>> Only x86 is implemented so far.
>>
>> Signed-off-by: Radim Krčmář <[email protected]>
>> ---
>> v2:
>> * use vcpu ioctl instead of vm one [4/5]
>> * shrink kvm_user_exit from 64 to 32 bytes [4/5]
>>
>> Documentation/virtual/kvm/api.txt | 30 ++++++++++++++++++++++++++++++
>> arch/x86/kvm/x86.c | 24 ++++++++++++++++++++++++
>> include/uapi/linux/kvm.h | 7 +++++++
>> virt/kvm/kvm_main.c | 5 +++--
>> 4 files changed, 64 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/api.txt
>> b/Documentation/virtual/kvm/api.txt
>> index 3c714d43a717..c5844f0b8e7c 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -3020,6 +3020,36 @@ Returns: 0 on success, -1 on error
>> Queues an SMI on the thread's vcpu.
>> +
>> +4.97 KVM_USER_EXIT
>> +
>> +Capability: KVM_CAP_USER_EXIT
>> +Architectures: x86
>> +Type: vcpu ioctl
>> +Parameters: struct kvm_user_exit (in)
>> +Returns: 0 on success,
>> + -EFAULT if the parameter couldn't be read,
>> + -EINVAL if 'reserved' is not zeroed,
>> +
>> +struct kvm_user_exit {
>> + __u8 reserved[32];
>> +};
>> +
>> +The ioctl is asynchronous to VCPU execution and can be issued from
>> all threads.
>> +format
>
> This breaks an invariant of vcpu ioctls, and also forces a cacheline
> bounce when we fget() the vcpu fd.
KVM_USER_EXIT in practice should be so rare (at least with in-kernel
LAPIC) that I don't think this matters. KVM_USER_EXIT is relatively
uninteresting, it only exists to provide an alternative to signals that
doesn't require expensive atomics on each and every KVM_RUN. :(
In addition, when you do it you have to transfer information anyway from
the signaling thread to the VCPU thread, which causes cacheline bounces
too. For example in QEMU both the iothread mutex and
cpu->interrupt_request cachelines bounce.
> We should really try to avoid this. One options is to have a
> KVM_VCPU_MAKE_EXITFD vcpu ioctl, which returns an eventfd that you then
> write into. You can make as many exitfds as you like, one for each
> waking thread, so they never cause cacheline conflicts.
> Edit: I see the invariant was already broken.
Yeah, that was a long time ago. This is when it became apparent in
kvm_vcpu_ioctl, but it was broken even before that for these 2-3
special asynchronous vcpu ioctls:
commit 2122ff5eab8faec853e43f6de886e8dc8f31e317
Author: Avi Kivity <[email protected]>
Date: Thu May 13 11:25:04 2010 +0300
KVM: move vcpu locking to dispatcher for generic vcpu ioctls
All vcpu ioctls need to be locked, so instead of locking each one
specifically we lock at the generic dispatcher.
This patch only updates generic ioctls and leaves arch specific
ioctls alone.
Signed-off-by: Avi Kivity <[email protected]>
Paolo
On 08/17/2015 04:15 PM, Paolo Bonzini wrote:
>
> On 16/08/2015 13:27, Avi Kivity wrote:
>> On 08/05/2015 07:33 PM, Radim Krčmář wrote:
>>> The guest can use KVM_USER_EXIT instead of a signal-based exiting to
>>> userspace. Availability depends on KVM_CAP_USER_EXIT.
>>> Only x86 is implemented so far.
>>>
>>> Signed-off-by: Radim Krčmář <[email protected]>
>>> ---
>>> v2:
>>> * use vcpu ioctl instead of vm one [4/5]
>>> * shrink kvm_user_exit from 64 to 32 bytes [4/5]
>>>
>>> Documentation/virtual/kvm/api.txt | 30 ++++++++++++++++++++++++++++++
>>> arch/x86/kvm/x86.c | 24 ++++++++++++++++++++++++
>>> include/uapi/linux/kvm.h | 7 +++++++
>>> virt/kvm/kvm_main.c | 5 +++--
>>> 4 files changed, 64 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/Documentation/virtual/kvm/api.txt
>>> b/Documentation/virtual/kvm/api.txt
>>> index 3c714d43a717..c5844f0b8e7c 100644
>>> --- a/Documentation/virtual/kvm/api.txt
>>> +++ b/Documentation/virtual/kvm/api.txt
>>> @@ -3020,6 +3020,36 @@ Returns: 0 on success, -1 on error
>>> Queues an SMI on the thread's vcpu.
>>> +
>>> +4.97 KVM_USER_EXIT
>>> +
>>> +Capability: KVM_CAP_USER_EXIT
>>> +Architectures: x86
>>> +Type: vcpu ioctl
>>> +Parameters: struct kvm_user_exit (in)
>>> +Returns: 0 on success,
>>> + -EFAULT if the parameter couldn't be read,
>>> + -EINVAL if 'reserved' is not zeroed,
>>> +
>>> +struct kvm_user_exit {
>>> + __u8 reserved[32];
>>> +};
>>> +
>>> +The ioctl is asynchronous to VCPU execution and can be issued from
>>> all threads.
>>> +format
>> This breaks an invariant of vcpu ioctls, and also forces a cacheline
>> bounce when we fget() the vcpu fd.
> KVM_USER_EXIT in practice should be so rare (at least with in-kernel
> LAPIC) that I don't think this matters. KVM_USER_EXIT is relatively
> uninteresting, it only exists to provide an alternative to signals that
> doesn't require expensive atomics on each and every KVM_RUN. :(
Ah, so the idea is to remove the cost of changing the signal mask?
Yes, although it looks like a thread-local operation, it takes a
process-wide lock.
I expect most user wakeups are via irqfd, so indeed the performance of
KVM_USER_EXIT is uninteresting.
On 18/08/2015 11:30, Avi Kivity wrote:
>> KVM_USER_EXIT in practice should be so rare (at least with in-kernel
>> LAPIC) that I don't think this matters. KVM_USER_EXIT is relatively
>> uninteresting, it only exists to provide an alternative to signals that
>> doesn't require expensive atomics on each and every KVM_RUN. :(
>
> Ah, so the idea is to remove the cost of changing the signal mask?
Yes, it's explained in the cover letter.
> Yes, although it looks like a thread-local operation, it takes a
> process-wide lock.
IIRC the lock was only task-wide and uncontended. Problem is, it's on
the node that created the thread rather than the node that is running
it, and inter-node atomics are really, really slow.
For guests spanning >1 host NUMA nodes it's not really practical to
ensure that the thread is created on the right node. Even for guests
that fit into 1 host node, if you rely on AutoNUMA the VCPUs are created
too early for AutoNUMA to have any effect. And newer machines have
frighteningly small nodes (two nodes per socket, so it's something like
7 pCPUs if you don't have hyper-threading enabled). True, the NUMA
penalty within the same socket is not huge, but it still costs a few
thousand clock cycles on vmexit.flat and this feature sweeps it away
completely.
> I expect most user wakeups are via irqfd, so indeed the performance of
> KVM_USER_EXIT is uninteresting.
Yup, either irqfd or KVM_SET_SIGNAL_MSI.
Paolo
On 08/18/2015 10:57 PM, Paolo Bonzini wrote:
>
> On 18/08/2015 11:30, Avi Kivity wrote:
>>> KVM_USER_EXIT in practice should be so rare (at least with in-kernel
>>> LAPIC) that I don't think this matters. KVM_USER_EXIT is relatively
>>> uninteresting, it only exists to provide an alternative to signals that
>>> doesn't require expensive atomics on each and every KVM_RUN. :(
>> Ah, so the idea is to remove the cost of changing the signal mask?
> Yes, it's explained in the cover letter.
>
>> Yes, although it looks like a thread-local operation, it takes a
>> process-wide lock.
> IIRC the lock was only task-wide and uncontended. Problem is, it's on
> the node that created the thread rather than the node that is running
> it, and inter-node atomics are really, really slow.
Cached inter-node atomics are (relatively) fast, but I think it really
is a process-wide lock:
sigprocmask calls:
void __set_current_blocked(const sigset_t *newset)
{
struct task_struct *tsk = current;
spin_lock_irq(&tsk->sighand->siglock);
__set_task_blocked(tsk, newset);
spin_unlock_irq(&tsk->sighand->siglock);
}
struct sighand_struct {
atomic_t count;
struct k_sigaction action[_NSIG];
spinlock_t siglock;
wait_queue_head_t signalfd_wqh;
};
Since sigaction is usually process-wide, I conclude that so will
tsk->sighand.
>
> For guests spanning >1 host NUMA nodes it's not really practical to
> ensure that the thread is created on the right node. Even for guests
> that fit into 1 host node, if you rely on AutoNUMA the VCPUs are created
> too early for AutoNUMA to have any effect. And newer machines have
> frighteningly small nodes (two nodes per socket, so it's something like
> 7 pCPUs if you don't have hyper-threading enabled). True, the NUMA
> penalty within the same socket is not huge, but it still costs a few
> thousand clock cycles on vmexit.flat and this feature sweeps it away
> completely.
>
>> I expect most user wakeups are via irqfd, so indeed the performance of
>> KVM_USER_EXIT is uninteresting.
> Yup, either irqfd or KVM_SET_SIGNAL_MSI.
>
> Paolo