From: Thomas Prescher <[email protected]>
When a vCPU is interrupted by a signal while running a nested guest,
KVM will exit to userspace with L2 state. However, userspace has no
way to know whether it sees L1 or L2 state (besides calling
KVM_GET_STATS_FD, which does not have a stable ABI).
This causes multiple problems:
The simplest one is L2 state corruption when userspace marks the sregs
as dirty. See this mailing list thread [1] for a complete discussion.
Another problem is that if userspace decides to continue by emulating
instructions, it will unknowingly emulate with L2 state as if L1
doesn't exist, which can be considered a weird guest escape.
This patch introduces a new flag KVM_RUN_X86_GUEST_MODE in the kvm_run
data structure, which is set when the vCPU exited while running a
nested guest. Userspace can then handle this situation.
To see whether this functionality is available, this patch also
introduces a new capability KVM_CAP_X86_GUEST_MODE.
[1] https://lore.kernel.org/kvm/[email protected]/T/#m280aadcb2e10ae02c191a7dc4ed4b711a74b1f55
Signed-off-by: Thomas Prescher <[email protected]>
Signed-off-by: Julian Stecklina <[email protected]>
---
Documentation/virt/kvm/api.rst | 17 +++++++++++++++++
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/x86.c | 3 +++
include/uapi/linux/kvm.h | 1 +
4 files changed, 22 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 0b5a33ee71ee..7748c3eb98e0 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6419,6 +6419,9 @@ affect the device's behavior. Current defined flags::
#define KVM_RUN_X86_SMM (1 << 0)
/* x86, set if bus lock detected in VM */
#define KVM_RUN_BUS_LOCK (1 << 1)
+ /* x86, set if the VCPU exited from a nested (L2) guest */
+ #define KVM_RUN_X86_GUEST_MODE (1 << 2)
+
/* arm64, set for KVM_EXIT_DEBUG */
#define KVM_DEBUG_ARCH_HSR_HIGH_VALID (1 << 0)
@@ -8063,6 +8066,20 @@ error/annotated fault.
See KVM_EXIT_MEMORY_FAULT for more information.
+7.34 KVM_CAP_X86_GUEST_MODE
+------------------------------
+
+:Architectures: x86
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that KVM_RUN will update the
+KVM_RUN_X86_GUEST_MODE bit in kvm_run.flags to indicate whether the
+vCPU was executing nested guest code when it exited.
+
+KVM exits with the register state of either the L1 or L2 guest
+depending on which executed at the time of an exit. Userspace must
+take care to differentiate between these cases.
+
8. Other capabilities.
======================
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index ef11aa4cab42..ff4ed82a2d06 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -106,6 +106,7 @@ struct kvm_ioapic_state {
#define KVM_RUN_X86_SMM (1 << 0)
#define KVM_RUN_X86_BUS_LOCK (1 << 1)
+#define KVM_RUN_X86_GUEST_MODE (1 << 2)
/* for KVM_GET_REGS and KVM_SET_REGS */
struct kvm_regs {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 91478b769af0..64f2cba9345e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4714,6 +4714,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
case KVM_CAP_IRQFD_RESAMPLE:
case KVM_CAP_MEMORY_FAULT_INFO:
+ case KVM_CAP_X86_GUEST_MODE:
r = 1;
break;
case KVM_CAP_EXIT_HYPERCALL:
@@ -10200,6 +10201,8 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu)
if (is_smm(vcpu))
kvm_run->flags |= KVM_RUN_X86_SMM;
+ if (is_guest_mode(vcpu))
+ kvm_run->flags |= KVM_RUN_X86_GUEST_MODE;
}
static void update_cr8_intercept(struct kvm_vcpu *vcpu)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..ccb12f6a656d 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -917,6 +917,7 @@ struct kvm_enable_cap {
#define KVM_CAP_MEMORY_ATTRIBUTES 233
#define KVM_CAP_GUEST_MEMFD 234
#define KVM_CAP_VM_TYPES 235
+#define KVM_CAP_X86_GUEST_MODE 236
struct kvm_irq_routing_irqchip {
__u32 irqchip;
--
2.42.0
Hey Sean,
does this this patch go into the right direction?
Julian
On Wed, 2024-05-08 at 15:25 +0200, Julian Stecklina wrote:
> From: Thomas Prescher <[email protected]>
>
> When a vCPU is interrupted by a signal while running a nested guest,
> KVM will exit to userspace with L2 state. However, userspace has no
> way to know whether it sees L1 or L2 state (besides calling
> KVM_GET_STATS_FD, which does not have a stable ABI).
>
> This causes multiple problems:
>
> The simplest one is L2 state corruption when userspace marks the sregs
> as dirty. See this mailing list thread [1] for a complete discussion.
>
> Another problem is that if userspace decides to continue by emulating
> instructions, it will unknowingly emulate with L2 state as if L1
> doesn't exist, which can be considered a weird guest escape.
>
> This patch introduces a new flag KVM_RUN_X86_GUEST_MODE in the kvm_run
> data structure, which is set when the vCPU exited while running a
> nested guest. Userspace can then handle this situation.
>
> To see whether this functionality is available, this patch also
> introduces a new capability KVM_CAP_X86_GUEST_MODE.
>
> [1]
> https://lore.kernel.org/kvm/[email protected]/T/#m280aadcb2e10ae02c191a7dc4ed4b711a74b1f55
>
> Signed-off-by: Thomas Prescher <[email protected]>
> Signed-off-by: Julian Stecklina <[email protected]>
> ---
> Documentation/virt/kvm/api.rst | 17 +++++++++++++++++
> arch/x86/include/uapi/asm/kvm.h | 1 +
> arch/x86/kvm/x86.c | 3 +++
> include/uapi/linux/kvm.h | 1 +
> 4 files changed, 22 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 0b5a33ee71ee..7748c3eb98e0 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6419,6 +6419,9 @@ affect the device's behavior. Current defined flags::
> #define KVM_RUN_X86_SMM (1 << 0)
> /* x86, set if bus lock detected in VM */
> #define KVM_RUN_BUS_LOCK (1 << 1)
> + /* x86, set if the VCPU exited from a nested (L2) guest */
> + #define KVM_RUN_X86_GUEST_MODE (1 << 2)
> +
> /* arm64, set for KVM_EXIT_DEBUG */
> #define KVM_DEBUG_ARCH_HSR_HIGH_VALID (1 << 0)
>
> @@ -8063,6 +8066,20 @@ error/annotated fault.
>
> See KVM_EXIT_MEMORY_FAULT for more information.
>
> +7.34 KVM_CAP_X86_GUEST_MODE
> +------------------------------
> +
> +:Architectures: x86
> +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
> +
> +The presence of this capability indicates that KVM_RUN will update the
> +KVM_RUN_X86_GUEST_MODE bit in kvm_run.flags to indicate whether the
> +vCPU was executing nested guest code when it exited.
> +
> +KVM exits with the register state of either the L1 or L2 guest
> +depending on which executed at the time of an exit. Userspace must
> +take care to differentiate between these cases.
> +
> 8. Other capabilities.
> ======================
>
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index ef11aa4cab42..ff4ed82a2d06 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -106,6 +106,7 @@ struct kvm_ioapic_state {
>
> #define KVM_RUN_X86_SMM (1 << 0)
> #define KVM_RUN_X86_BUS_LOCK (1 << 1)
> +#define KVM_RUN_X86_GUEST_MODE (1 << 2)
>
> /* for KVM_GET_REGS and KVM_SET_REGS */
> struct kvm_regs {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 91478b769af0..64f2cba9345e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4714,6 +4714,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long
> ext)
> case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
> case KVM_CAP_IRQFD_RESAMPLE:
> case KVM_CAP_MEMORY_FAULT_INFO:
> + case KVM_CAP_X86_GUEST_MODE:
> r = 1;
> break;
> case KVM_CAP_EXIT_HYPERCALL:
> @@ -10200,6 +10201,8 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu)
>
> if (is_smm(vcpu))
> kvm_run->flags |= KVM_RUN_X86_SMM;
> + if (is_guest_mode(vcpu))
> + kvm_run->flags |= KVM_RUN_X86_GUEST_MODE;
> }
>
> static void update_cr8_intercept(struct kvm_vcpu *vcpu)
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 2190adbe3002..ccb12f6a656d 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -917,6 +917,7 @@ struct kvm_enable_cap {
> #define KVM_CAP_MEMORY_ATTRIBUTES 233
> #define KVM_CAP_GUEST_MEMFD 234
> #define KVM_CAP_VM_TYPES 235
> +#define KVM_CAP_X86_GUEST_MODE 236
>
> struct kvm_irq_routing_irqchip {
> __u32 irqchip;
On Wed, May 15, 2024, Julian Stecklina wrote:
> Hey Sean,
>
> does this this patch go into the right direction?
At a glance, yes. We're in a "quite period" until 6.10-rc1, so it'll be a few
weeks before I take a closer look at this (or really anything that's destined
for 6.11 or later).
On Wed, 08 May 2024 15:25:01 +0200, Julian Stecklina wrote:
> When a vCPU is interrupted by a signal while running a nested guest,
> KVM will exit to userspace with L2 state. However, userspace has no
> way to know whether it sees L1 or L2 state (besides calling
> KVM_GET_STATS_FD, which does not have a stable ABI).
>
> This causes multiple problems:
>
> [...]
Applied to kvm-x86 misc. Note, the capability got number 237, as 236 was
claimed by KVM_CAP_X86_APIC_BUS_CYCLES_NS. The number might also change again,
e.g. if a different arch adds a capability and x86 loses the race.
Thanks!
[1/1] KVM: x86: add KVM_RUN_X86_GUEST_MODE kvm_run flag
https://github.com/kvm-x86/linux/commit/85542adb65ec
--
https://github.com/kvm-x86/linux/tree/next