2023-12-22 16:46:34

by Wilczynski, Michal

[permalink] [raw]
Subject: [PATCH v1] KVM: nVMX: Fix handling triple fault on RSM instruction

Syzkaller found a warning triggered in nested_vmx_vmexit().
vmx->nested.nested_run_pending is non-zero, even though we're in
nested_vmx_vmexit(). Generally, trying to cancel a pending entry is
considered a bug. However in this particular scenario, the kernel
behavior seems correct.

Syzkaller scenario:
1) Set up VCPU's
2) Run some code with KVM_RUN in L2 as a nested guest
3) Return from KVM_RUN
4) Inject KVM_SMI into the VCPU
5) Change the EFER register with KVM_SET_SREGS to value 0x2501
6) Run some code on the VCPU using KVM_RUN
7) Observe following behavior:

kvm_smm_transition: vcpu 0: entering SMM, smbase 0x30000
kvm_entry: vcpu 0, rip 0x8000
kvm_entry: vcpu 0, rip 0x8000
kvm_entry: vcpu 0, rip 0x8002
kvm_smm_transition: vcpu 0: leaving SMM, smbase 0x30000
kvm_nested_vmenter: rip: 0x0000000000008002 vmcs: 0x0000000000007000
nested_rip: 0x0000000000000000 int_ctl: 0x00000000
event_inj: 0x00000000 nested_ept=n guest
cr3: 0x0000000000002000
kvm_nested_vmexit_inject: reason: TRIPLE_FAULT ext_inf1: 0x0000000000000000
ext_inf2: 0x0000000000000000 ext_int: 0x00000000
ext_int_err: 0x00000000

What happened here is an SMI was injected immediately and the handler was
called at address 0x8000; all is good. Later, an RSM instruction is
executed in an emulator to return to the nested VM. em_rsm() is called,
which leads to emulator_leave_smm(). A part of this function calls VMX/SVM
callback, in this case vmx_leave_smm(). It attempts to set up a pending
reentry to guest VM by calling nested_vmx_enter_non_root_mode() and sets
vmx->nested.nested_run_pending to one. Unfortunately, later in
emulator_leave_smm(), rsm_load_state_64() fails to write invalid EFER to
the MSR. This results in em_rsm() calling triple_fault callback. At this
point it's clear that the KVM should call the vmexit, but
vmx->nested.nested_run_pending is left set to 1. To fix this reset the
vmx->nested.nested_run_pending flag in triple_fault handler.

TL;DR (courtesy of Yuan Yao)
Clear nested_run_pending in case of RSM failure on return from L2 SMM.
The pending VMENTRY to L2 should be cancelled due to such failure leads
to triple fault which should be injected to L1.

Possible alternative approach:
While the proposed approach works, the concern is that it would be
simpler, and more readable to cancel the nested_run_pending in em_rsm().
This would, however, require introducing new callback e.g,
post_leave_smm(), that would cancel nested_run_pending in case of a
failure to resume from SMM.

Additionally, while the proposed code fixes VMX specific issue, SVM also
might suffer from similar problem as it also uses it's own
nested_run_pending variable.

Reported-by: Zheyu Ma <[email protected]>
Closes: https://lore.kernel.org/all/CAMhUBjmXMYsEoVYw_M8hSZjBMHh24i88QYm-RY6HDta5YZ7Wgw@mail.gmail.com
Signed-off-by: Michal Wilczynski <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index c5ec0ef51ff7..44432e19eea6 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4904,7 +4904,16 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,

static void nested_vmx_triple_fault(struct kvm_vcpu *vcpu)
{
+ struct vcpu_vmx *vmx = to_vmx(vcpu);
+
kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+
+ /* In case of a triple fault, cancel the nested reentry. This may occur
+ * when the RSM instruction fails while attempting to restore the state
+ * from SMRAM.
+ */
+ vmx->nested.nested_run_pending = 0;
+
nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0);
}

--
2.41.0



2024-01-02 20:06:32

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v1] KVM: nVMX: Fix handling triple fault on RSM instruction

+Maxim

On Fri, Dec 22, 2023, Michal Wilczynski wrote:
> Syzkaller found a warning triggered in nested_vmx_vmexit().
> vmx->nested.nested_run_pending is non-zero, even though we're in
> nested_vmx_vmexit(). Generally, trying to cancel a pending entry is
> considered a bug. However in this particular scenario, the kernel
> behavior seems correct.
>
> Syzkaller scenario:
> 1) Set up VCPU's
> 2) Run some code with KVM_RUN in L2 as a nested guest
> 3) Return from KVM_RUN
> 4) Inject KVM_SMI into the VCPU
> 5) Change the EFER register with KVM_SET_SREGS to value 0x2501
> 6) Run some code on the VCPU using KVM_RUN
> 7) Observe following behavior:
>
> kvm_smm_transition: vcpu 0: entering SMM, smbase 0x30000
> kvm_entry: vcpu 0, rip 0x8000
> kvm_entry: vcpu 0, rip 0x8000
> kvm_entry: vcpu 0, rip 0x8002
> kvm_smm_transition: vcpu 0: leaving SMM, smbase 0x30000
> kvm_nested_vmenter: rip: 0x0000000000008002 vmcs: 0x0000000000007000
> nested_rip: 0x0000000000000000 int_ctl: 0x00000000
> event_inj: 0x00000000 nested_ept=n guest
> cr3: 0x0000000000002000
> kvm_nested_vmexit_inject: reason: TRIPLE_FAULT ext_inf1: 0x0000000000000000
> ext_inf2: 0x0000000000000000 ext_int: 0x00000000
> ext_int_err: 0x00000000
>
> What happened here is an SMI was injected immediately and the handler was
> called at address 0x8000; all is good. Later, an RSM instruction is
> executed in an emulator to return to the nested VM. em_rsm() is called,
> which leads to emulator_leave_smm(). A part of this function calls VMX/SVM
> callback, in this case vmx_leave_smm(). It attempts to set up a pending
> reentry to guest VM by calling nested_vmx_enter_non_root_mode() and sets
> vmx->nested.nested_run_pending to one. Unfortunately, later in
> emulator_leave_smm(), rsm_load_state_64() fails to write invalid EFER to
> the MSR. This results in em_rsm() calling triple_fault callback. At this
> point it's clear that the KVM should call the vmexit, but
> vmx->nested.nested_run_pending is left set to 1. To fix this reset the
> vmx->nested.nested_run_pending flag in triple_fault handler.
>
> TL;DR (courtesy of Yuan Yao)
> Clear nested_run_pending in case of RSM failure on return from L2 SMM.

KVM doesn't emulate SMM for L2. On an injected SMI, KVM forces a syntethic nested
VM-Exit to get from L2 to L1, and then emulates SMM in the context of L1.

> The pending VMENTRY to L2 should be cancelled due to such failure leads
> to triple fault which should be injected to L1.
>
> Possible alternative approach:
> While the proposed approach works, the concern is that it would be
> simpler, and more readable to cancel the nested_run_pending in em_rsm().
> This would, however, require introducing new callback e.g,
> post_leave_smm(), that would cancel nested_run_pending in case of a
> failure to resume from SMM.
>
> Additionally, while the proposed code fixes VMX specific issue, SVM also
> might suffer from similar problem as it also uses it's own
> nested_run_pending variable.
>
> Reported-by: Zheyu Ma <[email protected]>
> Closes: https://lore.kernel.org/all/CAMhUBjmXMYsEoVYw_M8hSZjBMHh24i88QYm-RY6HDta5YZ7Wgw@mail.gmail.com

Fixes: 759cbd59674a ("KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSM")

> Signed-off-by: Michal Wilczynski <[email protected]>
> ---
> arch/x86/kvm/vmx/nested.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index c5ec0ef51ff7..44432e19eea6 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -4904,7 +4904,16 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
>
> static void nested_vmx_triple_fault(struct kvm_vcpu *vcpu)
> {
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> +
> + /* In case of a triple fault, cancel the nested reentry. This may occur

/*
* Multi-line comments should look like this. Blah blah blab blah blah
* blah blah blah blah.
*/

> + * when the RSM instruction fails while attempting to restore the state
> + * from SMRAM.
> + */
> + vmx->nested.nested_run_pending = 0;

Argh. KVM's handling of SMIs while L2 is active is complete garbage. As explained
by the comment in vmx_enter_smm(), the L2<->SMM transitions should have a completely
custom flow and not piggyback/usurp nested VM-Exit/VM-Entry.

/*
* TODO: Implement custom flows for forcing the vCPU out/in of L2 on
* SMI and RSM. Using the common VM-Exit + VM-Enter routines is wrong
* SMI and RSM only modify state that is saved and restored via SMRAM.
* E.g. most MSRs are left untouched, but many are modified by VM-Exit
* and VM-Enter, and thus L2's values may be corrupted on SMI+RSM.
*/

The Fixes: commit above added a hack on top of the hack. But it's not entirely
clear from the changelog exactly what was being fixed.

While RSM induced VM entries are not full VM entries,
they still need to be followed by actual VM entry to complete it,
unlike setting the nested state.

This patch fixes boot of hyperv and SMM enabled
windows VM running nested on KVM, which fail due
to this issue combined with lack of dirty bit setting.

My first guess would be event injection, but that shouldn't be relevant to RSM.
Architecturally, events (SMIs, NMIs, IRQs, etc.) are recognized at instruction
boundaries, but except for SMIs (see below), KVM naturally defers injection until
an instruction boundary by virtue of delivering events via the VMCS/VMCB, i.e. by
waiting to deliver events until successfully re-entering the guest.

Nested VM-Enter is a special snowflake because KVM needs to finish injecting events
from vmcs12 before injecting any synthetic events, i.e. nested_run_pending ensures
that KVM wouldn't clobber/override an L2 event coming from L1. In other words,
nested_run_pending is much more specific than just needing to wait for an instruction
boundary.

So while the "wait until the CPU is at an instruction boundary" applies to RSM,
it's not immediately obvious to me why setting nested_run_pending is necessary.
And setting nested_run_pending *after* calling nested_vmx_enter_non_root_mode()
is nasty. nested_vmx_enter_non_root_mode() and its helpers use nested_run_pending
to determine whether or not to enforce various consistency checks and other
behaviors. And a more minor issue is that stat.nested_run will be incorrectly
incremented.

As a stop gap, something like this patch is not awful, though I would strongly
prefer to be more precise and not clear it on all triple faults. We've had KVM
bugs where KVM prematurely synthesizes triple fault on an actual nested VM-Enter,
and those would be covered up by this fix.

But due to nested_run_pending being (unnecessarily) buried in vendor structs, it
might actually be easier to do a cleaner fix. E.g. add yet another flag to track
that a hardware VM-Enter needs to be completed in order to complete instruction
emulation.

And as alluded to above, there's another bug lurking. Events that are *emulated*
by KVM must not be emulated until KVM knows the vCPU is at an instruction boundary.
Specifically, enter_smm() shouldn't be invoked while KVM is in the middle of
instruction emulation (even if "emulation" is just setting registers and skipping
the instruction). Theoretically, that could be fixed by honoring the existing
at_instruction_boundary flag for SMIs, but that'd be a rather large change and
at_instruction_boundary is nowhere near accurate enough to use right now.

Anyways, before we do anything, I'd like to get Maxim's input on what exactly was
addressed by 759cbd59674a.

2024-01-03 07:26:30

by Yuan Yao

[permalink] [raw]
Subject: Re: [PATCH v1] KVM: nVMX: Fix handling triple fault on RSM instruction

On Tue, Jan 02, 2024 at 11:57:26AM -0800, Sean Christopherson wrote:
> +Maxim
>
> On Fri, Dec 22, 2023, Michal Wilczynski wrote:
> > Syzkaller found a warning triggered in nested_vmx_vmexit().
> > vmx->nested.nested_run_pending is non-zero, even though we're in
> > nested_vmx_vmexit(). Generally, trying to cancel a pending entry is
> > considered a bug. However in this particular scenario, the kernel
> > behavior seems correct.
> >
> > Syzkaller scenario:
> > 1) Set up VCPU's
> > 2) Run some code with KVM_RUN in L2 as a nested guest
> > 3) Return from KVM_RUN
> > 4) Inject KVM_SMI into the VCPU
> > 5) Change the EFER register with KVM_SET_SREGS to value 0x2501
> > 6) Run some code on the VCPU using KVM_RUN
> > 7) Observe following behavior:
> >
> > kvm_smm_transition: vcpu 0: entering SMM, smbase 0x30000
> > kvm_entry: vcpu 0, rip 0x8000
> > kvm_entry: vcpu 0, rip 0x8000
> > kvm_entry: vcpu 0, rip 0x8002
> > kvm_smm_transition: vcpu 0: leaving SMM, smbase 0x30000
> > kvm_nested_vmenter: rip: 0x0000000000008002 vmcs: 0x0000000000007000
> > nested_rip: 0x0000000000000000 int_ctl: 0x00000000
> > event_inj: 0x00000000 nested_ept=n guest
> > cr3: 0x0000000000002000
> > kvm_nested_vmexit_inject: reason: TRIPLE_FAULT ext_inf1: 0x0000000000000000
> > ext_inf2: 0x0000000000000000 ext_int: 0x00000000
> > ext_int_err: 0x00000000
> >
> > What happened here is an SMI was injected immediately and the handler was
> > called at address 0x8000; all is good. Later, an RSM instruction is
> > executed in an emulator to return to the nested VM. em_rsm() is called,
> > which leads to emulator_leave_smm(). A part of this function calls VMX/SVM
> > callback, in this case vmx_leave_smm(). It attempts to set up a pending
> > reentry to guest VM by calling nested_vmx_enter_non_root_mode() and sets
> > vmx->nested.nested_run_pending to one. Unfortunately, later in
> > emulator_leave_smm(), rsm_load_state_64() fails to write invalid EFER to
> > the MSR. This results in em_rsm() calling triple_fault callback. At this
> > point it's clear that the KVM should call the vmexit, but
> > vmx->nested.nested_run_pending is left set to 1. To fix this reset the
> > vmx->nested.nested_run_pending flag in triple_fault handler.
> >
> > TL;DR (courtesy of Yuan Yao)
> > Clear nested_run_pending in case of RSM failure on return from L2 SMM.
>
> KVM doesn't emulate SMM for L2. On an injected SMI, KVM forces a syntethic nested
> VM-Exit to get from L2 to L1, and then emulates SMM in the context of L1.

Ah right!
I was thinking the L1 at that time... Anyway my fault here.

>
> > The pending VMENTRY to L2 should be cancelled due to such failure leads
> > to triple fault which should be injected to L1.
> >
> > Possible alternative approach:
> > While the proposed approach works, the concern is that it would be
> > simpler, and more readable to cancel the nested_run_pending in em_rsm().
> > This would, however, require introducing new callback e.g,
> > post_leave_smm(), that would cancel nested_run_pending in case of a
> > failure to resume from SMM.
> >
> > Additionally, while the proposed code fixes VMX specific issue, SVM also
> > might suffer from similar problem as it also uses it's own
> > nested_run_pending variable.
> >
> > Reported-by: Zheyu Ma <[email protected]>
> > Closes: https://lore.kernel.org/all/CAMhUBjmXMYsEoVYw_M8hSZjBMHh24i88QYm-RY6HDta5YZ7Wgw@mail.gmail.com
>
> Fixes: 759cbd59674a ("KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSM")
>
> > Signed-off-by: Michal Wilczynski <[email protected]>
> > ---
> > arch/x86/kvm/vmx/nested.c | 9 +++++++++
> > 1 file changed, 9 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index c5ec0ef51ff7..44432e19eea6 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -4904,7 +4904,16 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
> >
> > static void nested_vmx_triple_fault(struct kvm_vcpu *vcpu)
> > {
> > + struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +
> > kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> > +
> > + /* In case of a triple fault, cancel the nested reentry. This may occur
>
> /*
> * Multi-line comments should look like this. Blah blah blab blah blah
> * blah blah blah blah.
> */
>
> > + * when the RSM instruction fails while attempting to restore the state
> > + * from SMRAM.
> > + */
> > + vmx->nested.nested_run_pending = 0;
>
> Argh. KVM's handling of SMIs while L2 is active is complete garbage. As explained
> by the comment in vmx_enter_smm(), the L2<->SMM transitions should have a completely
> custom flow and not piggyback/usurp nested VM-Exit/VM-Entry.
>
> /*
> * TODO: Implement custom flows for forcing the vCPU out/in of L2 on
> * SMI and RSM. Using the common VM-Exit + VM-Enter routines is wrong
> * SMI and RSM only modify state that is saved and restored via SMRAM.
> * E.g. most MSRs are left untouched, but many are modified by VM-Exit
> * and VM-Enter, and thus L2's values may be corrupted on SMI+RSM.
> */
>
> The Fixes: commit above added a hack on top of the hack. But it's not entirely
> clear from the changelog exactly what was being fixed.
>
> While RSM induced VM entries are not full VM entries,
> they still need to be followed by actual VM entry to complete it,
> unlike setting the nested state.
>
> This patch fixes boot of hyperv and SMM enabled
> windows VM running nested on KVM, which fail due
> to this issue combined with lack of dirty bit setting.
>
> My first guess would be event injection, but that shouldn't be relevant to RSM.
> Architecturally, events (SMIs, NMIs, IRQs, etc.) are recognized at instruction
> boundaries, but except for SMIs (see below), KVM naturally defers injection until
> an instruction boundary by virtue of delivering events via the VMCS/VMCB, i.e. by
> waiting to deliver events until successfully re-entering the guest.
>
> Nested VM-Enter is a special snowflake because KVM needs to finish injecting events
> from vmcs12 before injecting any synthetic events, i.e. nested_run_pending ensures
> that KVM wouldn't clobber/override an L2 event coming from L1. In other words,
> nested_run_pending is much more specific than just needing to wait for an instruction
> boundary.
>
> So while the "wait until the CPU is at an instruction boundary" applies to RSM,
> it's not immediately obvious to me why setting nested_run_pending is necessary.
> And setting nested_run_pending *after* calling nested_vmx_enter_non_root_mode()
> is nasty. nested_vmx_enter_non_root_mode() and its helpers use nested_run_pending
> to determine whether or not to enforce various consistency checks and other
> behaviors. And a more minor issue is that stat.nested_run will be incorrectly
> incremented.
>
> As a stop gap, something like this patch is not awful, though I would strongly
> prefer to be more precise and not clear it on all triple faults. We've had KVM
> bugs where KVM prematurely synthesizes triple fault on an actual nested VM-Enter,
> and those would be covered up by this fix.
>
> But due to nested_run_pending being (unnecessarily) buried in vendor structs, it
> might actually be easier to do a cleaner fix. E.g. add yet another flag to track
> that a hardware VM-Enter needs to be completed in order to complete instruction
> emulation.
>
> And as alluded to above, there's another bug lurking. Events that are *emulated*
> by KVM must not be emulated until KVM knows the vCPU is at an instruction boundary.
> Specifically, enter_smm() shouldn't be invoked while KVM is in the middle of
> instruction emulation (even if "emulation" is just setting registers and skipping
> the instruction). Theoretically, that could be fixed by honoring the existing
> at_instruction_boundary flag for SMIs, but that'd be a rather large change and
> at_instruction_boundary is nowhere near accurate enough to use right now.
>
> Anyways, before we do anything, I'd like to get Maxim's input on what exactly was
> addressed by 759cbd59674a.
>

2024-01-03 23:03:33

by Wilczynski, Michal

[permalink] [raw]
Subject: Re: [PATCH v1] KVM: nVMX: Fix handling triple fault on RSM instruction



On 1/2/2024 8:57 PM, Sean Christopherson wrote:
>>
>> Additionally, while the proposed code fixes VMX specific issue, SVM also
>> might suffer from similar problem as it also uses it's own
>> nested_run_pending variable.
>>
>> Reported-by: Zheyu Ma <[email protected]>
>> Closes: https://lore.kernel.org/all/CAMhUBjmXMYsEoVYw_M8hSZjBMHh24i88QYm-RY6HDta5YZ7Wgw@mail.gmail.com
>
> Fixes: 759cbd59674a ("KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSM")

Thanks !

>
>> Signed-off-by: Michal Wilczynski <[email protected]>
>> ---
>> arch/x86/kvm/vmx/nested.c | 9 +++++++++
>> 1 file changed, 9 insertions(+)
>>
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index c5ec0ef51ff7..44432e19eea6 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -4904,7 +4904,16 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
>>
>> static void nested_vmx_triple_fault(struct kvm_vcpu *vcpu)
>> {
>> + struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +
>> kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu);
>> +
>> + /* In case of a triple fault, cancel the nested reentry. This may occur
>
> /*
> * Multi-line comments should look like this. Blah blah blab blah blah
> * blah blah blah blah.
> */

Sorry, didn't notice, and checkpatch didn't complain. In other
subsystems e.g. networking this is not enforced. I will make sure to
remember about this next time.

>
>> + * when the RSM instruction fails while attempting to restore the state
>> + * from SMRAM.
>> + */
>> + vmx->nested.nested_run_pending = 0;
>
> Argh. KVM's handling of SMIs while L2 is active is complete garbage. As explained
> by the comment in vmx_enter_smm(), the L2<->SMM transitions should have a completely
> custom flow and not piggyback/usurp nested VM-Exit/VM-Entry.
>
> /*
> * TODO: Implement custom flows for forcing the vCPU out/in of L2 on
> * SMI and RSM. Using the common VM-Exit + VM-Enter routines is wrong
> * SMI and RSM only modify state that is saved and restored via SMRAM.
> * E.g. most MSRs are left untouched, but many are modified by VM-Exit
> * and VM-Enter, and thus L2's values may be corrupted on SMI+RSM.
> */

I noticed this while working on the issue, and I would be very
interested to take this task and implement custom flows mentioned. Hope
you're fine with this.


> As a stop gap, something like this patch is not awful, though I would strongly
> prefer to be more precise and not clear it on all triple faults. We've had KVM
> bugs where KVM prematurely synthesizes triple fault on an actual nested VM-Enter,
> and those would be covered up by this fix.
>
> But due to nested_run_pending being (unnecessarily) buried in vendor structs, it
> might actually be easier to do a cleaner fix. E.g. add yet another flag to track
> that a hardware VM-Enter needs to be completed in order to complete instruction
> emulation.

Sounds like a good idea. I will experiment with that approach.

>
> And as alluded to above, there's another bug lurking. Events that are *emulated*
> by KVM must not be emulated until KVM knows the vCPU is at an instruction boundary.
> Specifically, enter_smm() shouldn't be invoked while KVM is in the middle of
> instruction emulation (even if "emulation" is just setting registers and skipping
> the instruction). Theoretically, that could be fixed by honoring the existing
> at_instruction_boundary flag for SMIs, but that'd be a rather large change and
> at_instruction_boundary is nowhere near accurate enough to use right now.
>
> Anyways, before we do anything, I'd like to get Maxim's input on what exactly was
> addressed by 759cbd59674a.

Thank you very much for such a comprehensive review! I've learned a lot.
Will try to help with the mentioned problems.

Michał

2024-01-12 17:59:09

by Wilczynski, Michal

[permalink] [raw]
Subject: Re: [PATCH v1] KVM: nVMX: Fix handling triple fault on RSM instruction



On 1/2/2024 8:57 PM, Sean Christopherson wrote:
> +Maxim

> Anyways, before we do anything, I'd like to get Maxim's input on what exactly was
> addressed by 759cbd59674a.

I think enough time has passed, and it's safe to say that this e-mail
got lost somewhere in Maxim inbox. I think I'll go ahead and post next
revision of this patch next week.

Thanks,
Michał

2024-01-23 15:03:06

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v1] KVM: nVMX: Fix handling triple fault on RSM instruction

On Tue, 2024-01-02 at 11:57 -0800, Sean Christopherson wrote:
> > > > +Maxim
> > > >
> > > > On Fri, Dec 22, 2023, Michal Wilczynski wrote:
> > > > > > > > Syzkaller found a warning triggered in
> > > > > > > > nested_vmx_vmexit().
> > > > > > > > vmx->nested.nested_run_pending is non-zero, even though
> > > > > > > > we're in
> > > > > > > > nested_vmx_vmexit(). Generally, trying to cancel a
> > > > > > > > pending entry is
> > > > > > > > considered a bug. However in this particular scenario,
> > > > > > > > the kernel
> > > > > > > > behavior seems correct.
> > > > > > > >
> > > > > > > > Syzkaller scenario:
> > > > > > > > 1) Set up VCPU's
> > > > > > > > 2) Run some code with KVM_RUN in L2 as a nested guest
> > > > > > > > 3) Return from KVM_RUN
> > > > > > > > 4) Inject KVM_SMI into the VCPU
> > > > > > > > 5) Change the EFER register with KVM_SET_SREGS to value
> > > > > > > > 0x2501
> > > > > > > > 6) Run some code on the VCPU using KVM_RUN
> > > > > > > > 7) Observe following behavior:
> > > > > > > >
> > > > > > > > kvm_smm_transition: vcpu 0: entering SMM, smbase
> > > > > > > > 0x30000
> > > > > > > > kvm_entry: vcpu 0, rip 0x8000
> > > > > > > > kvm_entry: vcpu 0, rip 0x8000
> > > > > > > > kvm_entry: vcpu 0, rip 0x8002
> > > > > > > > kvm_smm_transition: vcpu 0: leaving SMM, smbase 0x30000
> > > > > > > > kvm_nested_vmenter: rip: 0x0000000000008002 vmcs:
> > > > > > > > 0x0000000000007000
> > > > > > > >                     nested_rip: 0x0000000000000000
> > > > > > > > int_ctl: 0x00000000
> > > > > > > > event_inj: 0x00000000 nested_ept=n
> > > > > > > > guest
> > > > > > > > cr3: 0x0000000000002000
> > > > > > > > kvm_nested_vmexit_inject: reason: TRIPLE_FAULT
> > > > > > > > ext_inf1: 0x0000000000000000
> > > > > > > >                           ext_inf2: 0x0000000000000000
> > > > > > > > ext_int: 0x00000000
> > > > > > > > ext_int_err: 0x00000000
> > > > > > > >
> > > > > > > > What happened here is an SMI was injected immediately
> > > > > > > > and the handler was
> > > > > > > > called at address 0x8000; all is good. Later, an RSM
> > > > > > > > instruction is
> > > > > > > > executed in an emulator to return to the nested VM.
> > > > > > > > em_rsm() is called,
> > > > > > > > which leads to emulator_leave_smm(). A part of this
> > > > > > > > function calls VMX/SVM
> > > > > > > > callback, in this case vmx_leave_smm(). It attempts to
> > > > > > > > set up a pending
> > > > > > > > reentry to guest VM by calling
> > > > > > > > nested_vmx_enter_non_root_mode() and sets
> > > > > > > > vmx->nested.nested_run_pending to one. Unfortunately,
> > > > > > > > later in
> > > > > > > > emulator_leave_smm(), rsm_load_state_64() fails to
> > > > > > > > write invalid EFER to
> > > > > > > > the MSR. This results in em_rsm() calling triple_fault
> > > > > > > > callback. At this
> > > > > > > > point it's clear that the KVM should call the vmexit,
> > > > > > > > but
> > > > > > > > vmx->nested.nested_run_pending is left set to 1. To fix
> > > > > > > > this reset the
> > > > > > > > vmx->nested.nested_run_pending flag in triple_fault
> > > > > > > > handler.
> > > > > > > >
> > > > > > > > TL;DR (courtesy of Yuan Yao)
> > > > > > > > Clear nested_run_pending in case of RSM failure on
> > > > > > > > return from L2 SMM.
> > > >
> > > > KVM doesn't emulate SMM for L2. On an injected SMI, KVM forces
> > > > a syntethic nested
> > > > VM-Exit to get from L2 to L1, and then emulates SMM in the
> > > > context of L1.
> > > >
> > > > > > > > The pending VMENTRY to L2 should be cancelled due to
> > > > > > > > such failure leads
> > > > > > > > to triple fault which should be injected to L1.
> > > > > > > >
> > > > > > > > Possible alternative approach:
> > > > > > > > While the proposed approach works, the concern is that
> > > > > > > > it would be
> > > > > > > > simpler, and more readable to cancel the
> > > > > > > > nested_run_pending in em_rsm().
> > > > > > > > This would, however, require introducing new callback
> > > > > > > > e.g,
> > > > > > > > post_leave_smm(), that would cancel nested_run_pending
> > > > > > > > in case of a
> > > > > > > > failure to resume from SMM.
> > > > > > > >
> > > > > > > > Additionally, while the proposed code fixes VMX
> > > > > > > > specific issue, SVM also
> > > > > > > > might suffer from similar problem as it also uses it's
> > > > > > > > own
> > > > > > > > nested_run_pending variable.
> > > > > > > >
> > > > > > > > Reported-by: Zheyu Ma <[email protected]>
> > > > > > > > Closes:
> > > > > > > > https://lore.kernel.org/all/CAMhUBjmXMYsEoVYw_M8hSZjBMHh24i88QYm-RY6HDta5YZ7Wgw@mail.gmail.com
> > > >
> > > > Fixes: 759cbd59674a ("KVM: x86: nSVM/nVMX: set
> > > > nested_run_pending on VM entry which is a result of RSM")
> > > >
> > > > > > > > Signed-off-by: Michal Wilczynski
> > > > > > > > <[email protected]>
> > > > > > > > ---
> > > > > > > >  arch/x86/kvm/vmx/nested.c | 9 +++++++++
> > > > > > > >  1 file changed, 9 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/arch/x86/kvm/vmx/nested.c
> > > > > > > > b/arch/x86/kvm/vmx/nested.c
> > > > > > > > index c5ec0ef51ff7..44432e19eea6 100644
> > > > > > > > --- a/arch/x86/kvm/vmx/nested.c
> > > > > > > > +++ b/arch/x86/kvm/vmx/nested.c
> > > > > > > > @@ -4904,7 +4904,16 @@ void nested_vmx_vmexit(struct
> > > > > > > > kvm_vcpu *vcpu, u32 vm_exit_reason,
> > > > > > > >  
> > > > > > > >  static void nested_vmx_triple_fault(struct kvm_vcpu
> > > > > > > > *vcpu)
> > > > > > > >  {
> > > > > > > > + struct vcpu_vmx *vmx = to_vmx(vcpu);
> > > > > > > > +
> > > > > > > >   kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> > > > > > > > +
> > > > > > > > + /* In case of a triple fault, cancel the
> > > > > > > > nested reentry. This may occur
> > > >
> > > > /*
> > > > * Multi-line comments should look like this. Blah
> > > > blah blab blah blah
> > > > * blah blah blah blah.
> > > > */
> > > >
> > > > > > > > + * when the RSM instruction fails while
> > > > > > > > attempting to restore the state
> > > > > > > > + * from SMRAM.
> > > > > > > > + */
> > > > > > > > + vmx->nested.nested_run_pending = 0;
> > > >
> > > > Argh. KVM's handling of SMIs while L2 is active is complete
> > > > garbage. As explained
> > > > by the comment in vmx_enter_smm(), the L2<->SMM transitions
> > > > should have a completely
> > > > custom flow and not piggyback/usurp nested VM-Exit/VM-Entry.
> > > >
> > > > /*
> > > > * TODO: Implement custom flows for forcing the vCPU
> > > > out/in of L2 on
> > > > * SMI and RSM. Using the common VM-Exit + VM-Enter
> > > > routines is wrong
> > > > * SMI and RSM only modify state that is saved and
> > > > restored via SMRAM.
> > > > * E.g. most MSRs are left untouched, but many are
> > > > modified by VM-Exit
> > > > * and VM-Enter, and thus L2's values may be corrupted
> > > > on SMI+RSM.
> > > > */
> > > >
> > > > The Fixes: commit above added a hack on top of the hack. But
> > > > it's not entirely
> > > > clear from the changelog exactly what was being fixed.
> > > >
> > > >     While RSM induced VM entries are not full VM entries,
> > > >     they still need to be followed by actual VM entry to
> > > > complete it,
> > > >     unlike setting the nested state.
> > > >     
> > > >     This patch fixes boot of hyperv and SMM enabled
> > > >     windows VM running nested on KVM, which fail due
> > > >     to this issue combined with lack of dirty bit setting.
> > > >
> > > > My first guess would be event injection, but that shouldn't be
> > > > relevant to RSM.
> > > > Architecturally, events (SMIs, NMIs, IRQs, etc.) are recognized
> > > > at instruction
> > > > boundaries, but except for SMIs (see below), KVM naturally
> > > > defers injection until
> > > > an instruction boundary by virtue of delivering events via the
> > > > VMCS/VMCB, i.e. by
> > > > waiting to deliver events until successfully re-entering the
> > > > guest.
> > > >
> > > > Nested VM-Enter is a special snowflake because KVM needs to
> > > > finish injecting events
> > > > from vmcs12 before injecting any synthetic events, i.e.
> > > > nested_run_pending ensures
> > > > that KVM wouldn't clobber/override an L2 event coming from L1.
> > > > In other words,
> > > > nested_run_pending is much more specific than just needing to
> > > > wait for an instruction
> > > > boundary.
> > > >
> > > > So while the "wait until the CPU is at an instruction boundary"
> > > > applies to RSM,
> > > > it's not immediately obvious to me why setting
> > > > nested_run_pending is necessary.

If nested_run_pending is not set, then nested entry can be aborted by
an event
(e.g another #SMI, or just an interrupt since after VM entry the
interrupts are usually enabled,
and interrupts are intercepted usually).

It is even possible for a nested migration to happen right after the
RSM is executed but before
actual VMENTER is executed.

> > > > And setting nested_run_pending *after* calling
> > > > nested_vmx_enter_non_root_mode()
> > > > is nasty. nested_vmx_enter_non_root_mode() and its helpers use
> > > > nested_run_pending
> > > > to determine whether or not to enforce various consistency
> > > > checks and other
> > > > behaviors. And a more minor issue is that stat.nested_run will
> > > > be incorrectly
> > > > incremented.
> > > >
> > > > As a stop gap, something like this patch is not awful, though I
> > > > would strongly
> > > > prefer to be more precise and not clear it on all triple
> > > > faults. We've had KVM
> > > > bugs where KVM prematurely synthesizes triple fault on an
> > > > actual nested VM-Enter,
> > > > and those would be covered up by this fix.
> > > >
> > > > But due to nested_run_pending being (unnecessarily) buried in
> > > > vendor structs, it
> > > > might actually be easier to do a cleaner fix. E.g. add yet
> > > > another flag to track
> > > > that a hardware VM-Enter needs to be completed in order to
> > > > complete instruction
> > > > emulation.
> > > >
> > > > And as alluded to above, there's another bug lurking. Events
> > > > that are *emulated*
> > > > by KVM must not be emulated until KVM knows the vCPU is at an
> > > > instruction boundary.
> > > > Specifically, enter_smm() shouldn't be invoked while KVM is in
> > > > the middle of
> > > > instruction emulation (even if "emulation" is just setting
> > > > registers and skipping
> > > > the instruction). Theoretically, that could be fixed by
> > > > honoring the existing
> > > > at_instruction_boundary flag for SMIs, but that'd be a rather
> > > > large change and
> > > > at_instruction_boundary is nowhere near accurate enough to use
> > > > right now.
> > > >
> > > > Anyways, before we do anything, I'd like to get Maxim's input
> > > > on what exactly was
> > > > addressed by 759cbd59674a.
> > > >

There is a good explanation in commit 759cbd59674a and
commit e8efa4ff0037 that comes before it.

In a nutshell, there were two problems that together created
conditions for the guest boot failure:

1. When a non intercepted #SMI arrives while L2 is running, it's
almost like a VM exit from L2 to L1, however instead of restoring
the host state, KVM jumps to the SMI handler in real mode in L1.
Thus to be able later to run a regular VM exit handler, KVM stores
the host state (which is present in vmcb01) to HSAVE area and on RSM
restores it from HSAVE to vmcb01.
The bug was that KVM wasn't marking vmcb01 as dirty when restoring
the host state.
Usually this is not a problem because in this case we resume L2 and
thus switch to vmcb02.

2. Due to lack of setting the nested_run_pending, the above switch to
vmcb02 might not complete if there are pending events, which is allowed
because RSM doesn't have interrupt shadow, thus once we finished
emulating RSM, we are in the L2 guest and likely with interrupts
enabled.

If the switch to vmcb02 doesn't complete we end up still running
with vmcb01 and with lots of fields that were changed but not
marked as dirty.

Now usually this works because the CPU isn't that agressive in caching
vmcb fields, but if the whole thing is running nested under KVM,
then kvm always caches few 'rare' vmcb fields, like the IDTR, so
we end up with stale IDTR and on next interrupt the guest explodes.


PS: this is one of the mails that I forgot to send because I moved to
Canada recently and was offline for some time due to that.

Best regards,
Maxim Levitsky