2022-01-26 10:53:05

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH] KVM: x86: Forcibly leave nested virt when SMM state is toggled

Forcibly leave nested virtualization operation if userspace toggles SMM
state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS. If userspace
forces the vCPU out of SMM while it's post-VMXON and then injects an SMI,
vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both
vmxon=false and smm.vmxon=false, but all other nVMX state allocated.

Don't attempt to gracefully handle the transition as (a) most transitions
are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't
sufficient information to handle all transitions, e.g. SVM wants access
to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede
KVM_SET_NESTED_STATE during state restore as the latter disallows putting
the vCPU into L2 if SMM is active, and disallows tagging the vCPU as
being post-VMXON in SMM if SMM is not active.

Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX
due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond
just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU
in an architecturally impossible state.

WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
Modules linked in:
CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00
Call Trace:
<TASK>
kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123
kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]
kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460
kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline]
kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676
kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline]
kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250
kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273
__fput+0x286/0x9f0 fs/file_table.c:311
task_work_run+0xdd/0x1a0 kernel/task_work.c:164
exit_task_work include/linux/task_work.h:32 [inline]
do_exit+0xb29/0x2a30 kernel/exit.c:806
do_group_exit+0xd2/0x2f0 kernel/exit.c:935
get_signal+0x4b0/0x28c0 kernel/signal.c:2862
arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
handle_signal_work kernel/entry/common.c:148 [inline]
exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
__syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>

Cc: [email protected]
Reported-by: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
---

Peeking at QEMU source, AFAICT QEMU restores nested state before events,
but I don't see how that can possibly work. I assume QEMU does something
where it restores the "run" state first and then does a full restore?

arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/svm/nested.c | 9 +++++----
arch/x86/kvm/svm/svm.c | 2 +-
arch/x86/kvm/svm/svm.h | 2 +-
arch/x86/kvm/vmx/nested.c | 1 +
arch/x86/kvm/x86.c | 4 +++-
6 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 682ad02a4e58..df22b04b11c3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1495,6 +1495,7 @@ struct kvm_x86_ops {
};

struct kvm_x86_nested_ops {
+ void (*leave_nested)(struct kvm_vcpu *vcpu);
int (*check_events)(struct kvm_vcpu *vcpu);
bool (*hv_timer_pending)(struct kvm_vcpu *vcpu);
void (*triple_fault)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index cf206855ebf0..1218b5a342fc 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -983,9 +983,9 @@ void svm_free_nested(struct vcpu_svm *svm)
/*
* Forcibly leave nested mode in order to be able to reset the VCPU later on.
*/
-void svm_leave_nested(struct vcpu_svm *svm)
+void svm_leave_nested(struct kvm_vcpu *vcpu)
{
- struct kvm_vcpu *vcpu = &svm->vcpu;
+ struct vcpu_svm *svm = to_svm(vcpu);

if (is_guest_mode(vcpu)) {
svm->nested.nested_run_pending = 0;
@@ -1411,7 +1411,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
return -EINVAL;

if (!(kvm_state->flags & KVM_STATE_NESTED_GUEST_MODE)) {
- svm_leave_nested(svm);
+ svm_leave_nested(vcpu);
svm_set_gif(svm, !!(kvm_state->flags & KVM_STATE_NESTED_GIF_SET));
return 0;
}
@@ -1478,7 +1478,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
*/

if (is_guest_mode(vcpu))
- svm_leave_nested(svm);
+ svm_leave_nested(vcpu);
else
svm->nested.vmcb02.ptr->save = svm->vmcb01.ptr->save;

@@ -1532,6 +1532,7 @@ static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
}

struct kvm_x86_nested_ops svm_nested_ops = {
+ .leave_nested = svm_leave_nested,
.check_events = svm_check_nested_events,
.triple_fault = nested_svm_triple_fault,
.get_nested_state_pages = svm_get_nested_state_pages,
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 6d31d357a83b..78123ed3906f 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -290,7 +290,7 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)

if ((old_efer & EFER_SVME) != (efer & EFER_SVME)) {
if (!(efer & EFER_SVME)) {
- svm_leave_nested(svm);
+ svm_leave_nested(vcpu);
svm_set_gif(svm, true);
/* #GP intercept is still needed for vmware backdoor */
if (!enable_vmware_backdoor)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 47ef8f4a9358..c55d9936bb8b 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -525,7 +525,7 @@ static inline bool nested_exit_on_nmi(struct vcpu_svm *svm)

int enter_svm_guest_mode(struct kvm_vcpu *vcpu,
u64 vmcb_gpa, struct vmcb *vmcb12, bool from_vmrun);
-void svm_leave_nested(struct vcpu_svm *svm);
+void svm_leave_nested(struct kvm_vcpu *vcpu);
void svm_free_nested(struct vcpu_svm *svm);
int svm_allocate_nested(struct vcpu_svm *svm);
int nested_svm_vmrun(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index f235f77cbc03..7eebfdf7204f 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -6771,6 +6771,7 @@ __init int nested_vmx_hardware_setup(int (*exit_handlers[])(struct kvm_vcpu *))
}

struct kvm_x86_nested_ops vmx_nested_ops = {
+ .leave_nested = vmx_leave_nested,
.check_events = vmx_check_nested_events,
.hv_timer_pending = nested_vmx_preemption_timer_pending,
.triple_fault = nested_vmx_triple_fault,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 55518b7d3b96..22040c682d4a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4860,8 +4860,10 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
vcpu->arch.apic->sipi_vector = events->sipi_vector;

if (events->flags & KVM_VCPUEVENT_VALID_SMM) {
- if (!!(vcpu->arch.hflags & HF_SMM_MASK) != events->smi.smm)
+ if (!!(vcpu->arch.hflags & HF_SMM_MASK) != events->smi.smm) {
+ kvm_x86_ops.nested_ops->leave_nested(vcpu);
kvm_smm_changed(vcpu, events->smi.smm);
+ }

vcpu->arch.smi_pending = events->smi.pending;


base-commit: e2e83a73d7ce66f62c7830a85619542ef59c90e4
--
2.35.0.rc0.227.g00780c9af4-goog


2022-01-26 20:33:19

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: Forcibly leave nested virt when SMM state is toggled

On Tue, 2022-01-25 at 22:03 +0000, Sean Christopherson wrote:
> Forcibly leave nested virtualization operation if userspace toggles SMM
> state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS. If userspace
> forces the vCPU out of SMM while it's post-VMXON and then injects an SMI,
> vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both
> vmxon=false and smm.vmxon=false, but all other nVMX state allocated.
>
> Don't attempt to gracefully handle the transition as (a) most transitions
> are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't
> sufficient information to handle all transitions, e.g. SVM wants access
> to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede
> KVM_SET_NESTED_STATE during state restore as the latter disallows putting
> the vCPU into L2 if SMM is active, and disallows tagging the vCPU as
> being post-VMXON in SMM if SMM is not active.
>
> Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX
> due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond
> just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU
> in an architecturally impossible state.
>
> WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
> WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
> Modules linked in:
> CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
> RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
> Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00
> Call Trace:
> <TASK>
> kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123
> kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]
> kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460
> kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline]
> kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676
> kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline]
> kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250
> kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273
> __fput+0x286/0x9f0 fs/file_table.c:311
> task_work_run+0xdd/0x1a0 kernel/task_work.c:164
> exit_task_work include/linux/task_work.h:32 [inline]
> do_exit+0xb29/0x2a30 kernel/exit.c:806
> do_group_exit+0xd2/0x2f0 kernel/exit.c:935
> get_signal+0x4b0/0x28c0 kernel/signal.c:2862
> arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
> handle_signal_work kernel/entry/common.c:148 [inline]
> exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
> exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
> __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
> syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
> do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
> entry_SYSCALL_64_after_hwframe+0x44/0xae
> </TASK>
>
> Cc: [email protected]
> Reported-by: [email protected]
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
>
> Peeking at QEMU source, AFAICT QEMU restores nested state before events,
> but I don't see how that can possibly work. I assume QEMU does something
> where it restores the "run" state first and then does a full restore?

Well, according to my testing, nested migration with SMM *is* still quite broken,
(on both SVM and VMX)
resulting in various issues up to L1 crash.

When I last tackled SMM, I fixed most issues that
happen just when the L2 is running and I inject flood of SMIs to L1 - even that
was crashing things all around, so this might be as well the reason for that.

I'll get back to it soon.

Best regards,
Maxim Levitsky

>
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/svm/nested.c | 9 +++++----
> arch/x86/kvm/svm/svm.c | 2 +-
> arch/x86/kvm/svm/svm.h | 2 +-
> arch/x86/kvm/vmx/nested.c | 1 +
> arch/x86/kvm/x86.c | 4 +++-
> 6 files changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 682ad02a4e58..df22b04b11c3 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1495,6 +1495,7 @@ struct kvm_x86_ops {
> };
>
> struct kvm_x86_nested_ops {
> + void (*leave_nested)(struct kvm_vcpu *vcpu);
> int (*check_events)(struct kvm_vcpu *vcpu);
> bool (*hv_timer_pending)(struct kvm_vcpu *vcpu);
> void (*triple_fault)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index cf206855ebf0..1218b5a342fc 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -983,9 +983,9 @@ void svm_free_nested(struct vcpu_svm *svm)
> /*
> * Forcibly leave nested mode in order to be able to reset the VCPU later on.
> */
> -void svm_leave_nested(struct vcpu_svm *svm)
> +void svm_leave_nested(struct kvm_vcpu *vcpu)
> {
> - struct kvm_vcpu *vcpu = &svm->vcpu;
> + struct vcpu_svm *svm = to_svm(vcpu);
>
> if (is_guest_mode(vcpu)) {
> svm->nested.nested_run_pending = 0;
> @@ -1411,7 +1411,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
> return -EINVAL;
>
> if (!(kvm_state->flags & KVM_STATE_NESTED_GUEST_MODE)) {
> - svm_leave_nested(svm);
> + svm_leave_nested(vcpu);
> svm_set_gif(svm, !!(kvm_state->flags & KVM_STATE_NESTED_GIF_SET));
> return 0;
> }
> @@ -1478,7 +1478,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
> */
>
> if (is_guest_mode(vcpu))
> - svm_leave_nested(svm);
> + svm_leave_nested(vcpu);
> else
> svm->nested.vmcb02.ptr->save = svm->vmcb01.ptr->save;
>
> @@ -1532,6 +1532,7 @@ static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
> }
>
> struct kvm_x86_nested_ops svm_nested_ops = {
> + .leave_nested = svm_leave_nested,
> .check_events = svm_check_nested_events,
> .triple_fault = nested_svm_triple_fault,
> .get_nested_state_pages = svm_get_nested_state_pages,
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 6d31d357a83b..78123ed3906f 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -290,7 +290,7 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
>
> if ((old_efer & EFER_SVME) != (efer & EFER_SVME)) {
> if (!(efer & EFER_SVME)) {
> - svm_leave_nested(svm);
> + svm_leave_nested(vcpu);
> svm_set_gif(svm, true);
> /* #GP intercept is still needed for vmware backdoor */
> if (!enable_vmware_backdoor)
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 47ef8f4a9358..c55d9936bb8b 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -525,7 +525,7 @@ static inline bool nested_exit_on_nmi(struct vcpu_svm *svm)
>
> int enter_svm_guest_mode(struct kvm_vcpu *vcpu,
> u64 vmcb_gpa, struct vmcb *vmcb12, bool from_vmrun);
> -void svm_leave_nested(struct vcpu_svm *svm);
> +void svm_leave_nested(struct kvm_vcpu *vcpu);
> void svm_free_nested(struct vcpu_svm *svm);
> int svm_allocate_nested(struct vcpu_svm *svm);
> int nested_svm_vmrun(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index f235f77cbc03..7eebfdf7204f 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -6771,6 +6771,7 @@ __init int nested_vmx_hardware_setup(int (*exit_handlers[])(struct kvm_vcpu *))
> }
>
> struct kvm_x86_nested_ops vmx_nested_ops = {
> + .leave_nested = vmx_leave_nested,
> .check_events = vmx_check_nested_events,
> .hv_timer_pending = nested_vmx_preemption_timer_pending,
> .triple_fault = nested_vmx_triple_fault,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 55518b7d3b96..22040c682d4a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4860,8 +4860,10 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
> vcpu->arch.apic->sipi_vector = events->sipi_vector;
>
> if (events->flags & KVM_VCPUEVENT_VALID_SMM) {
> - if (!!(vcpu->arch.hflags & HF_SMM_MASK) != events->smi.smm)
> + if (!!(vcpu->arch.hflags & HF_SMM_MASK) != events->smi.smm) {
> + kvm_x86_ops.nested_ops->leave_nested(vcpu);
> kvm_smm_changed(vcpu, events->smi.smm);
> + }
>
> vcpu->arch.smi_pending = events->smi.pending;
>
>
> base-commit: e2e83a73d7ce66f62c7830a85619542ef59c90e4


2022-01-26 21:08:38

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: Forcibly leave nested virt when SMM state is toggled

On 1/25/22 23:03, Sean Christopherson wrote:
> Forcibly leave nested virtualization operation if userspace toggles SMM
> state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS. If userspace
> forces the vCPU out of SMM while it's post-VMXON and then injects an SMI,
> vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both
> vmxon=false and smm.vmxon=false, but all other nVMX state allocated.
>
> Don't attempt to gracefully handle the transition as (a) most transitions
> are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't
> sufficient information to handle all transitions, e.g. SVM wants access
> to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede
> KVM_SET_NESTED_STATE during state restore as the latter disallows putting
> the vCPU into L2 if SMM is active, and disallows tagging the vCPU as
> being post-VMXON in SMM if SMM is not active.
>
> Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX
> due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond
> just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU
> in an architecturally impossible state.
>
> WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
> WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
> Modules linked in:
> CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
> RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
> Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00
> Call Trace:
> <TASK>
> kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123
> kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]
> kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460
> kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline]
> kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676
> kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline]
> kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250
> kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273
> __fput+0x286/0x9f0 fs/file_table.c:311
> task_work_run+0xdd/0x1a0 kernel/task_work.c:164
> exit_task_work include/linux/task_work.h:32 [inline]
> do_exit+0xb29/0x2a30 kernel/exit.c:806
> do_group_exit+0xd2/0x2f0 kernel/exit.c:935
> get_signal+0x4b0/0x28c0 kernel/signal.c:2862
> arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
> handle_signal_work kernel/entry/common.c:148 [inline]
> exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
> exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
> __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
> syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
> do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
> entry_SYSCALL_64_after_hwframe+0x44/0xae
> </TASK>
>
> Cc: [email protected]
> Reported-by: [email protected]
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
>
> Peeking at QEMU source, AFAICT QEMU restores nested state before events,
> but I don't see how that can possibly work. I assume QEMU does something
> where it restores the "run" state first and then does a full restore?
>
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/svm/nested.c | 9 +++++----
> arch/x86/kvm/svm/svm.c | 2 +-
> arch/x86/kvm/svm/svm.h | 2 +-
> arch/x86/kvm/vmx/nested.c | 1 +
> arch/x86/kvm/x86.c | 4 +++-
> 6 files changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 682ad02a4e58..df22b04b11c3 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1495,6 +1495,7 @@ struct kvm_x86_ops {
> };
>
> struct kvm_x86_nested_ops {
> + void (*leave_nested)(struct kvm_vcpu *vcpu);
> int (*check_events)(struct kvm_vcpu *vcpu);
> bool (*hv_timer_pending)(struct kvm_vcpu *vcpu);
> void (*triple_fault)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index cf206855ebf0..1218b5a342fc 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -983,9 +983,9 @@ void svm_free_nested(struct vcpu_svm *svm)
> /*
> * Forcibly leave nested mode in order to be able to reset the VCPU later on.
> */
> -void svm_leave_nested(struct vcpu_svm *svm)
> +void svm_leave_nested(struct kvm_vcpu *vcpu)
> {
> - struct kvm_vcpu *vcpu = &svm->vcpu;
> + struct vcpu_svm *svm = to_svm(vcpu);
>
> if (is_guest_mode(vcpu)) {
> svm->nested.nested_run_pending = 0;
> @@ -1411,7 +1411,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
> return -EINVAL;
>
> if (!(kvm_state->flags & KVM_STATE_NESTED_GUEST_MODE)) {
> - svm_leave_nested(svm);
> + svm_leave_nested(vcpu);
> svm_set_gif(svm, !!(kvm_state->flags & KVM_STATE_NESTED_GIF_SET));
> return 0;
> }
> @@ -1478,7 +1478,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
> */
>
> if (is_guest_mode(vcpu))
> - svm_leave_nested(svm);
> + svm_leave_nested(vcpu);
> else
> svm->nested.vmcb02.ptr->save = svm->vmcb01.ptr->save;
>
> @@ -1532,6 +1532,7 @@ static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
> }
>
> struct kvm_x86_nested_ops svm_nested_ops = {
> + .leave_nested = svm_leave_nested,
> .check_events = svm_check_nested_events,
> .triple_fault = nested_svm_triple_fault,
> .get_nested_state_pages = svm_get_nested_state_pages,
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 6d31d357a83b..78123ed3906f 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -290,7 +290,7 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
>
> if ((old_efer & EFER_SVME) != (efer & EFER_SVME)) {
> if (!(efer & EFER_SVME)) {
> - svm_leave_nested(svm);
> + svm_leave_nested(vcpu);
> svm_set_gif(svm, true);
> /* #GP intercept is still needed for vmware backdoor */
> if (!enable_vmware_backdoor)
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 47ef8f4a9358..c55d9936bb8b 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -525,7 +525,7 @@ static inline bool nested_exit_on_nmi(struct vcpu_svm *svm)
>
> int enter_svm_guest_mode(struct kvm_vcpu *vcpu,
> u64 vmcb_gpa, struct vmcb *vmcb12, bool from_vmrun);
> -void svm_leave_nested(struct vcpu_svm *svm);
> +void svm_leave_nested(struct kvm_vcpu *vcpu);
> void svm_free_nested(struct vcpu_svm *svm);
> int svm_allocate_nested(struct vcpu_svm *svm);
> int nested_svm_vmrun(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index f235f77cbc03..7eebfdf7204f 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -6771,6 +6771,7 @@ __init int nested_vmx_hardware_setup(int (*exit_handlers[])(struct kvm_vcpu *))
> }
>
> struct kvm_x86_nested_ops vmx_nested_ops = {
> + .leave_nested = vmx_leave_nested,
> .check_events = vmx_check_nested_events,
> .hv_timer_pending = nested_vmx_preemption_timer_pending,
> .triple_fault = nested_vmx_triple_fault,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 55518b7d3b96..22040c682d4a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4860,8 +4860,10 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
> vcpu->arch.apic->sipi_vector = events->sipi_vector;
>
> if (events->flags & KVM_VCPUEVENT_VALID_SMM) {
> - if (!!(vcpu->arch.hflags & HF_SMM_MASK) != events->smi.smm)
> + if (!!(vcpu->arch.hflags & HF_SMM_MASK) != events->smi.smm) {
> + kvm_x86_ops.nested_ops->leave_nested(vcpu);
> kvm_smm_changed(vcpu, events->smi.smm);
> + }
>
> vcpu->arch.smi_pending = events->smi.pending;
>
>
> base-commit: e2e83a73d7ce66f62c7830a85619542ef59c90e4

Queued, thanks.

Paolo

2022-01-26 22:15:12

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: Forcibly leave nested virt when SMM state is toggled

On Wed, Jan 26, 2022, Maxim Levitsky wrote:
> On Tue, 2022-01-25 at 22:03 +0000, Sean Christopherson wrote:
> > Peeking at QEMU source, AFAICT QEMU restores nested state before events,
> > but I don't see how that can possibly work. I assume QEMU does something
> > where it restores the "run" state first and then does a full restore?
>
> Well, according to my testing, nested migration with SMM *is* still quite broken,
> (on both SVM and VMX)
> resulting in various issues up to L1 crash.
>
> When I last tackled SMM, I fixed most issues that
> happen just when the L2 is running and I inject flood of SMIs to L1 - even that
> was crashing things all around, so this might be as well the reason for that.

Heh, that would certainly explain why QEMU's code looks broken. Thanks!

2022-02-18 00:04:51

by Tadeusz Struk

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: Forcibly leave nested virt when SMM state is toggled

On 1/25/22 14:03, Sean Christopherson wrote:
> Forcibly leave nested virtualization operation if userspace toggles SMM
> state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS. If userspace
> forces the vCPU out of SMM while it's post-VMXON and then injects an SMI,
> vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both
> vmxon=false and smm.vmxon=false, but all other nVMX state allocated.
>
> Don't attempt to gracefully handle the transition as (a) most transitions
> are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't
> sufficient information to handle all transitions, e.g. SVM wants access
> to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede
> KVM_SET_NESTED_STATE during state restore as the latter disallows putting
> the vCPU into L2 if SMM is active, and disallows tagging the vCPU as
> being post-VMXON in SMM if SMM is not active.
>
> Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX
> due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond
> just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU
> in an architecturally impossible state.
>
> WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
> WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
> Modules linked in:
> CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
> RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
> Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00
> Call Trace:
> <TASK>
> kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123
> kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]
> kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460
> kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline]
> kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676
> kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline]
> kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250
> kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273
> __fput+0x286/0x9f0 fs/file_table.c:311
> task_work_run+0xdd/0x1a0 kernel/task_work.c:164
> exit_task_work include/linux/task_work.h:32 [inline]
> do_exit+0xb29/0x2a30 kernel/exit.c:806
> do_group_exit+0xd2/0x2f0 kernel/exit.c:935
> get_signal+0x4b0/0x28c0 kernel/signal.c:2862
> arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
> handle_signal_work kernel/entry/common.c:148 [inline]
> exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
> exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
> __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
> syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
> do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
> entry_SYSCALL_64_after_hwframe+0x44/0xae
> </TASK>
>
> Cc: [email protected]
> Reported-by: [email protected]
> Signed-off-by: Sean Christopherson <[email protected]>

Sean,
I can reliably reproduce my original issue [1] that this supposed to fix
on 5.17-rc4, with the same reproducer [2]. Here is a screen dump [3].
Maybe we do still need my patch. It fixed the issue.

[1] https://lore.kernel.org/all/[email protected]/
[2] https://syzkaller.appspot.com/text?tag=ReproC&x=173085bdb00000
[3] https://termbin.com/fkm8f

--
Thanks,
Tadeusz

2022-02-18 17:51:23

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: Forcibly leave nested virt when SMM state is toggled

On Thu, Feb 17, 2022, Tadeusz Struk wrote:
> On 1/25/22 14:03, Sean Christopherson wrote:
> > Forcibly leave nested virtualization operation if userspace toggles SMM
> > state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS. If userspace
> > forces the vCPU out of SMM while it's post-VMXON and then injects an SMI,
> > vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both
> > vmxon=false and smm.vmxon=false, but all other nVMX state allocated.
> >
> > Don't attempt to gracefully handle the transition as (a) most transitions
> > are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't
> > sufficient information to handle all transitions, e.g. SVM wants access
> > to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede
> > KVM_SET_NESTED_STATE during state restore as the latter disallows putting
> > the vCPU into L2 if SMM is active, and disallows tagging the vCPU as
> > being post-VMXON in SMM if SMM is not active.
> >
> > Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX
> > due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond
> > just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU
> > in an architecturally impossible state.
> >
> > WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
> > WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
> > Modules linked in:
> > CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
> > RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
> > Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00
> > Call Trace:
> > <TASK>
> > kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123
> > kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]
> > kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460
> > kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline]
> > kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676
> > kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline]
> > kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250
> > kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273
> > __fput+0x286/0x9f0 fs/file_table.c:311
> > task_work_run+0xdd/0x1a0 kernel/task_work.c:164
> > exit_task_work include/linux/task_work.h:32 [inline]
> > do_exit+0xb29/0x2a30 kernel/exit.c:806
> > do_group_exit+0xd2/0x2f0 kernel/exit.c:935
> > get_signal+0x4b0/0x28c0 kernel/signal.c:2862
> > arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
> > handle_signal_work kernel/entry/common.c:148 [inline]
> > exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
> > exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
> > __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
> > syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
> > do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
> > entry_SYSCALL_64_after_hwframe+0x44/0xae
> > </TASK>
> >
> > Cc: [email protected]
> > Reported-by: [email protected]
> > Signed-off-by: Sean Christopherson <[email protected]>
>
> Sean,
> I can reliably reproduce my original issue [1] that this supposed to fix
> on 5.17-rc4, with the same reproducer [2]. Here is a screen dump [3].
> Maybe we do still need my patch. It fixed the issue.

This SMM-specific patch fixes something different, the bug that you are still
hitting is the FNAME(cmpxchg_gpte) mess. The uaccess CMPXCHG series[*] that
properly fixes that issue hasn't been merged yet.

==================================================================
BUG: KASAN: use-after-free in ept_cmpxchg_gpte.constprop.0+0x3c3/0x590
Write of size 8 at addr ffff888010000000 by task repro/5633

[*] https://lore.kernel.org/all/[email protected]

>
> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://syzkaller.appspot.com/text?tag=ReproC&x=173085bdb00000
> [3] https://termbin.com/fkm8f
>
> --
> Thanks,
> Tadeusz

2022-02-18 20:01:57

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: Forcibly leave nested virt when SMM state is toggled

On 2/18/22 18:22, Tadeusz Struk wrote:
> On 2/18/22 08:58, Sean Christopherson wrote:
>> This SMM-specific patch fixes something different, the bug that you
>> are still
>> hitting is the FNAME(cmpxchg_gpte) mess.  The uaccess CMPXCHG
>> series[*] that
>> properly fixes that issue hasn't been merged yet.
>>
>>    ==================================================================
>>    BUG: KASAN: use-after-free in ept_cmpxchg_gpte.constprop.0+0x3c3/0x590
>>    Write of size 8 at addr ffff888010000000 by task repro/5633
>>
>> [*]https://lore.kernel.org/all/[email protected]
>>
>
> Ok, that's good. I will keep an eye on it and give it a try then.

I'll poke PeterZ for a review next week.

Paolo

2022-02-19 00:48:10

by Tadeusz Struk

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: Forcibly leave nested virt when SMM state is toggled

On 2/18/22 08:58, Sean Christopherson wrote:
> This SMM-specific patch fixes something different, the bug that you are still
> hitting is the FNAME(cmpxchg_gpte) mess. The uaccess CMPXCHG series[*] that
> properly fixes that issue hasn't been merged yet.
>
> ==================================================================
> BUG: KASAN: use-after-free in ept_cmpxchg_gpte.constprop.0+0x3c3/0x590
> Write of size 8 at addr ffff888010000000 by task repro/5633
>
> [*]https://lore.kernel.org/all/[email protected]
>

Ok, that's good. I will keep an eye on it and give it a try then.
--
Thanks,
Tadeusz

2022-02-24 19:46:57

by Tadeusz Struk

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: Forcibly leave nested virt when SMM state is toggled

On 2/18/22 10:14, Paolo Bonzini wrote:
> On 2/18/22 18:22, Tadeusz Struk wrote:
>> On 2/18/22 08:58, Sean Christopherson wrote:
>>> This SMM-specific patch fixes something different, the bug that you are still
>>> hitting is the FNAME(cmpxchg_gpte) mess.  The uaccess CMPXCHG series[*] that
>>> properly fixes that issue hasn't been merged yet.
>>>
>>>    ==================================================================
>>>    BUG: KASAN: use-after-free in ept_cmpxchg_gpte.constprop.0+0x3c3/0x590
>>>    Write of size 8 at addr ffff888010000000 by task repro/5633
>>>
>>> [*]https://lore.kernel.org/all/[email protected]
>>>
>>
>> Ok, that's good. I will keep an eye on it and give it a try then.
>
> I'll poke PeterZ for a review next week.
>

Paulo, do you know if PeterZ had a chance to look at the uaccess patches yet?

--
Thanks,
Tadeusz