2021-06-23 07:45:57

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH RFC] KVM: nSVM: Fix L1 state corruption upon return from SMM

VMCB split commit 4995a3685f1b ("KVM: SVM: Use a separate vmcb for the
nested L2 guest") broke return from SMM when we entered there from guest
(L2) mode. Gen2 WS2016/Hyper-V is known to do this on boot. The problem
manifests itself like this:

kvm_exit: reason EXIT_RSM rip 0x7ffbb280 info 0 0
kvm_emulate_insn: 0:7ffbb280: 0f aa
kvm_smm_transition: vcpu 0: leaving SMM, smbase 0x7ffb3000
kvm_nested_vmrun: rip: 0x000000007ffbb280 vmcb: 0x0000000008224000
nrip: 0xffffffffffbbe119 int_ctl: 0x01020000 event_inj: 0x00000000
npt: on
kvm_nested_intercepts: cr_read: 0000 cr_write: 0010 excp: 40060002
intercepts: fd44bfeb 0000217f 00000000
kvm_entry: vcpu 0, rip 0xffffffffffbbe119
kvm_exit: reason EXIT_NPF rip 0xffffffffffbbe119 info
200000006 1ab000
kvm_nested_vmexit: vcpu 0 reason npf rip 0xffffffffffbbe119 info1
0x0000000200000006 info2 0x00000000001ab000 intr_info 0x00000000
error_code 0x00000000
kvm_page_fault: address 1ab000 error_code 6
kvm_nested_vmexit_inject: reason EXIT_NPF info1 200000006 info2 1ab000
int_info 0 int_info_err 0
kvm_entry: vcpu 0, rip 0x7ffbb280
kvm_exit: reason EXIT_EXCP_GP rip 0x7ffbb280 info 0 0
kvm_emulate_insn: 0:7ffbb280: 0f aa
kvm_inj_exception: #GP (0x0)

Note: return to L2 succeeded but upon first exit to L1 its RIP points to
'RSM' instruction but we're not in SMM.

The problem appears to be that VMCB01 gets irreversibly destroyed during
SMM execution. Previously, we used to have 'hsave' VMCB where regular
(pre-SMM) L1's state was saved upon nested_svm_vmexit() but now we just
switch to VMCB01 from VMCB02.

Pre-split (working) flow looked like:
- SMM is triggered during L2's execution
- L2's state is pushed to SMRAM
- nested_svm_vmexit() restores L1's state from 'hsave'
- SMM -> RSM
- enter_svm_guest_mode() switches to L2 but keeps 'hsave' intact so we have
pre-SMM (and pre L2 VMRUN) L1's state there
- L2's state is restored from SMRAM
- upon first exit L1's state is restored from L1.

This was always broken with regards to svm_get_nested_state()/
svm_set_nested_state(): 'hsave' was never a part of what's being
save and restored so migration happening during SMM triggered from L2 would
never restore L1's state correctly.

Post-split flow (broken) looks like:
- SMM is triggered during L2's execution
- L2's state is pushed to SMRAM
- nested_svm_vmexit() switches to VMCB01 from VMCB02
- SMM -> RSM
- enter_svm_guest_mode() switches from VMCB01 to VMCB02 but pre-SMM VMCB01
is already lost.
- L2's state is restored from SMRAM
- upon first exit L1's state is restored from VMCB01 but it is corrupted
(reflects the state during 'RSM' execution).

VMX doesn't have this problem because unlike VMCB, VMCS keeps both guest
and host state so when we switch back to VMCS02 L1's state is intact there.

To resolve the issue we need to save L1's state somewhere. We could've
created a third VMCB for SMM but that would require us to modify saved
state format. L1's architectural HSAVE area (pointed by MSR_VM_HSAVE_PA)
seems appropriate: L0 is free to save any (or none) of L1's state there.
Currently, KVM does 'none'.

Note, for nested state migration to succeed, both source and destination
hypervisors must have the fix. We, however, don't need to create a new
flag indicating the fact that HSAVE area is now populated as migration
during SMM triggered from L2 was always broken.

Fixes: 4995a3685f1b ("KVM: SVM: Use a separate vmcb for the nested L2 guest")
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
- RFC: I'm not 100% sure my 'smart' idea to use currently-unused HSAVE area
is that smart. Also, we don't even seem to check that L1 set it up upon
nested VMRUN so hypervisors which don't do that may remain broken. A very
much needed selftest is also missing.
---
arch/x86/kvm/svm/svm.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 12c06ea28f5c..d110bfe0e208 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4286,6 +4286,7 @@ static int svm_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
static int svm_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
{
struct vcpu_svm *svm = to_svm(vcpu);
+ struct kvm_host_map map_save;
int ret;

if (is_guest_mode(vcpu)) {
@@ -4301,6 +4302,13 @@ static int svm_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
ret = nested_svm_vmexit(svm);
if (ret)
return ret;
+
+ /* Save L1 state to L1 HSAVE area as vmcb01 will be used in SMM */
+ if (kvm_vcpu_map(vcpu, gpa_to_gfn(svm->nested.hsave_msr),
+ &map_save) == -EINVAL)
+ return 1;
+ memcpy(map_save.hva, &svm->vmcb01.ptr->save, sizeof(svm->vmcb01.ptr->save));
+ kvm_vcpu_unmap(vcpu, &map_save, true);
}
return 0;
}
@@ -4308,7 +4316,7 @@ static int svm_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
static int svm_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
{
struct vcpu_svm *svm = to_svm(vcpu);
- struct kvm_host_map map;
+ struct kvm_host_map map, map_save;
int ret = 0;

if (guest_cpuid_has(vcpu, X86_FEATURE_LM)) {
@@ -4332,6 +4340,13 @@ static int svm_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)

ret = enter_svm_guest_mode(vcpu, vmcb12_gpa, map.hva);
kvm_vcpu_unmap(vcpu, &map, true);
+
+ /* Restore L1 state from L1 HSAVE area as vmcb01 was used in SMM */
+ if (kvm_vcpu_map(vcpu, gpa_to_gfn(svm->nested.hsave_msr),
+ &map_save) == -EINVAL)
+ return 1;
+ memcpy(&svm->vmcb01.ptr->save, map_save.hva, sizeof(svm->vmcb01.ptr->save));
+ kvm_vcpu_unmap(vcpu, &map_save, true);
}
}

--
2.31.1


2021-06-23 09:41:50

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: nSVM: Fix L1 state corruption upon return from SMM

On 23/06/21 09:44, Vitaly Kuznetsov wrote:
> - RFC: I'm not 100% sure my 'smart' idea to use currently-unused HSAVE area
> is that smart. Also, we don't even seem to check that L1 set it up upon
> nested VMRUN so hypervisors which don't do that may remain broken. A very
> much needed selftest is also missing.

It's certainly a bit weird, but I guess it counts as smart too. It
needs a few more comments, but I think it's a good solution.

One could delay the backwards memcpy until vmexit time, but that would
require a new flag so it's not worth it for what is a pretty rare and
already expensive case.

Paolo

2021-06-23 11:40:26

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: nSVM: Fix L1 state corruption upon return from SMM

On Wed, 2021-06-23 at 11:39 +0200, Paolo Bonzini wrote:
> On 23/06/21 09:44, Vitaly Kuznetsov wrote:
> > - RFC: I'm not 100% sure my 'smart' idea to use currently-unused HSAVE area
> > is that smart. Also, we don't even seem to check that L1 set it up upon
> > nested VMRUN so hypervisors which don't do that may remain broken. A very
> > much needed selftest is also missing.
>
> It's certainly a bit weird, but I guess it counts as smart too. It
> needs a few more comments, but I think it's a good solution.
>
> One could delay the backwards memcpy until vmexit time, but that would
> require a new flag so it's not worth it for what is a pretty rare and
> already expensive case.
>
> Paolo
>

I wonder what would happen if SMM entry is triggered by L1 (say with ICR),
on a VCPU which is in L2. Such exit should go straight to L1 SMM mode.

I will very very soon, maybe even today start testing SMM with my migration
tests and such. I hope I will find more bugs in this area.

Thanks for fixing this issue!

Best regards,
Maxim Levitsky

2021-06-23 12:01:20

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: nSVM: Fix L1 state corruption upon return from SMM

On 23/06/21 13:39, Maxim Levitsky wrote:
> On Wed, 2021-06-23 at 11:39 +0200, Paolo Bonzini wrote:
>> On 23/06/21 09:44, Vitaly Kuznetsov wrote:
>>> - RFC: I'm not 100% sure my 'smart' idea to use currently-unused HSAVE area
>>> is that smart. Also, we don't even seem to check that L1 set it up upon
>>> nested VMRUN so hypervisors which don't do that may remain broken. A very
>>> much needed selftest is also missing.
>>
>> It's certainly a bit weird, but I guess it counts as smart too. It
>> needs a few more comments, but I think it's a good solution.
>>
>> One could delay the backwards memcpy until vmexit time, but that would
>> require a new flag so it's not worth it for what is a pretty rare and
>> already expensive case.
>
> I wonder what would happen if SMM entry is triggered by L1 (say with ICR),
> on a VCPU which is in L2. Such exit should go straight to L1 SMM mode.

Yes, it does, but it still records the L2 state in the guest's SMM state
save area. Everything works right as long as the guest stays in L2 (the
vmcb12 control save area is still there in svm->nested and is
saved/restored by KVM_GET/SET_NESTED_STATE), the problem that Vitaly
found is the destruction of the saved L1 host state.

Paolo

> I will very very soon, maybe even today start testing SMM with my migration
> tests and such. I hope I will find more bugs in this area.
>
> Thanks for fixing this issue!
>
> Best regards,
> Maxim Levitsky
>

2021-06-23 13:05:00

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: nSVM: Fix L1 state corruption upon return from SMM

On Wed, 2021-06-23 at 11:39 +0200, Paolo Bonzini wrote:
> On 23/06/21 09:44, Vitaly Kuznetsov wrote:
> > - RFC: I'm not 100% sure my 'smart' idea to use currently-unused HSAVE area
> > is that smart. Also, we don't even seem to check that L1 set it up upon
> > nested VMRUN so hypervisors which don't do that may remain broken. A very
> > much needed selftest is also missing.
>
> It's certainly a bit weird, but I guess it counts as smart too. It
> needs a few more comments, but I think it's a good solution.
>
> One could delay the backwards memcpy until vmexit time, but that would
> require a new flag so it's not worth it for what is a pretty rare and
> already expensive case.
>
> Paolo
>

Hi!

I did some homework on this now and I would like to share few my thoughts on this:

First of all my attention caught the way we intercept the #SMI
(this isn't 100% related to the bug but still worth talking about IMHO)

A. Bare metal: Looks like SVM allows to intercept SMI, with SVM_EXIT_SMI,
with an intention of then entering the BIOS SMM handler manually using the SMM_CTL msr.
On bare metal we do set the INTERCEPT_SMI but we emulate the exit as a nop.
I guess on bare metal there are some undocumented bits that BIOS set which
make the CPU to ignore that SMI intercept and still take the #SMI handler,
normally but I wonder if we could still break some motherboard
code due to that.


B. Nested: If #SMI is intercepted, then it causes nested VMEXIT.
Since KVM does enable SMI intercept, when it runs nested it means that all SMIs
that nested KVM gets are emulated as NOP, and L1's SMI handler is not run.


About the issue that was fixed in this patch. Let me try to understand how
it would work on bare metal:

1. A guest is entered. Host state is saved to VM_HSAVE_PA area (or stashed somewhere
in the CPU)

2. #SMI (without intercept) happens

3. CPU has to exit SVM, and start running the host SMI handler, it loads the SMM
state without touching the VM_HSAVE_PA runs the SMI handler, then once it RSMs,
it restores the guest state from SMM area and continues the guest

4. Once a normal VMexit happens, the host state is restored from VM_HSAVE_PA

So host state indeed can't be saved to VMC01.

I to be honest think would prefer not to use the L1's hsave area but rather add back our
'hsave' in KVM and store there the L1 host state on the nested entry always.

This way we will avoid touching the vmcb01 at all and both solve the issue and
reduce code complexity.
(copying of L1 host state to what basically is L1 guest state area and back
even has a comment to explain why it (was) possible to do so.
(before you discovered that this doesn't work with SMM).

Thanks again for fixing this bug!

Best regards,
Maxim Levitsky

2021-06-23 13:09:19

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: nSVM: Fix L1 state corruption upon return from SMM

On Wed, 2021-06-23 at 16:01 +0300, Maxim Levitsky wrote:
> On Wed, 2021-06-23 at 11:39 +0200, Paolo Bonzini wrote:
> > On 23/06/21 09:44, Vitaly Kuznetsov wrote:
> > > - RFC: I'm not 100% sure my 'smart' idea to use currently-unused HSAVE area
> > > is that smart. Also, we don't even seem to check that L1 set it up upon
> > > nested VMRUN so hypervisors which don't do that may remain broken. A very
> > > much needed selftest is also missing.
> >
> > It's certainly a bit weird, but I guess it counts as smart too. It
> > needs a few more comments, but I think it's a good solution.
> >
> > One could delay the backwards memcpy until vmexit time, but that would
> > require a new flag so it's not worth it for what is a pretty rare and
> > already expensive case.
> >
> > Paolo
> >
>
> Hi!
>
> I did some homework on this now and I would like to share few my thoughts on this:
>
> First of all my attention caught the way we intercept the #SMI
> (this isn't 100% related to the bug but still worth talking about IMHO)
>
> A. Bare metal: Looks like SVM allows to intercept SMI, with SVM_EXIT_SMI,
> with an intention of then entering the BIOS SMM handler manually using the SMM_CTL msr.
> On bare metal we do set the INTERCEPT_SMI but we emulate the exit as a nop.
> I guess on bare metal there are some undocumented bits that BIOS set which
> make the CPU to ignore that SMI intercept and still take the #SMI handler,
> normally but I wonder if we could still break some motherboard
> code due to that.
>
>
> B. Nested: If #SMI is intercepted, then it causes nested VMEXIT.
> Since KVM does enable SMI intercept, when it runs nested it means that all SMIs
> that nested KVM gets are emulated as NOP, and L1's SMI handler is not run.
>
>
> About the issue that was fixed in this patch. Let me try to understand how
> it would work on bare metal:
>
> 1. A guest is entered. Host state is saved to VM_HSAVE_PA area (or stashed somewhere
> in the CPU)
>
> 2. #SMI (without intercept) happens
>
> 3. CPU has to exit SVM, and start running the host SMI handler, it loads the SMM
> state without touching the VM_HSAVE_PA runs the SMI handler, then once it RSMs,
> it restores the guest state from SMM area and continues the guest
>
> 4. Once a normal VMexit happens, the host state is restored from VM_HSAVE_PA
>
> So host state indeed can't be saved to VMC01.
>
> I to be honest think would prefer not to use the L1's hsave area but rather add back our
> 'hsave' in KVM and store there the L1 host state on the nested entry always.
>
> This way we will avoid touching the vmcb01 at all and both solve the issue and
> reduce code complexity.
> (copying of L1 host state to what basically is L1 guest state area and back
> even has a comment to explain why it (was) possible to do so.
> (before you discovered that this doesn't work with SMM).

I need more coffee today. The comment is somwhat wrong actually.
When L1 switches to L2, then its HSAVE area is L1 guest state, but
but L1 is a "host" vs L2, so it is host state.
The copying is more between kvm's register cache and the vmcb.

So maybe backing it up as this patch does is the best solution yet.
I will take more in depth look at this soon.

Best regards,
Maxim Levitsky

>
> Thanks again for fixing this bug!
>
> Best regards,
> Maxim Levitsky


2021-06-23 13:23:15

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: nSVM: Fix L1 state corruption upon return from SMM

On 23/06/21 15:01, Maxim Levitsky wrote:
> I did some homework on this now and I would like to share few my thoughts on this:
>
> First of all my attention caught the way we intercept the #SMI
> (this isn't 100% related to the bug but still worth talking about IMHO)
>
> A. Bare metal: Looks like SVM allows to intercept SMI, with SVM_EXIT_SMI,
> with an intention of then entering the BIOS SMM handler manually using the SMM_CTL msr.

... or just using STGI, which is what happens for KVM. This is in the
manual: "The hypervisor may respond to the #VMEXIT(SMI) by executing the
STGI instruction, which causes the pending SMI to be taken immediately".

It *should* work for KVM to just not intercept SMI, but it adds more
complexity for no particular gain.

> On bare metal we do set the INTERCEPT_SMI but we emulate the exit as a nop.
> I guess on bare metal there are some undocumented bits that BIOS set which
> make the CPU to ignore that SMI intercept and still take the #SMI handler,
> normally but I wonder if we could still break some motherboard
> code due to that.
>
> B. Nested: If #SMI is intercepted, then it causes nested VMEXIT.
> Since KVM does enable SMI intercept, when it runs nested it means that all SMIs
> that nested KVM gets are emulated as NOP, and L1's SMI handler is not run.

No, this is incorrect. Note that svm_check_nested_events does not clear
smi_pending the way vmx_check_nested_events does it for nmi_pending. So
the interrupt is still there and will be injected on the next STGI.

Paolo

>
> About the issue that was fixed in this patch. Let me try to understand how
> it would work on bare metal:
>
> 1. A guest is entered. Host state is saved to VM_HSAVE_PA area (or stashed somewhere
> in the CPU)
>
> 2. #SMI (without intercept) happens
>
> 3. CPU has to exit SVM, and start running the host SMI handler, it loads the SMM
> state without touching the VM_HSAVE_PA runs the SMI handler, then once it RSMs,
> it restores the guest state from SMM area and continues the guest
>
> 4. Once a normal VMexit happens, the host state is restored from VM_HSAVE_PA
>
> So host state indeed can't be saved to VMC01.
>
> I to be honest think would prefer not to use the L1's hsave area but rather add back our
> 'hsave' in KVM and store there the L1 host state on the nested entry always.
>
> This way we will avoid touching the vmcb01 at all and both solve the issue and
> reduce code complexity.
> (copying of L1 host state to what basically is L1 guest state area and back
> even has a comment to explain why it (was) possible to do so.
> (before you discovered that this doesn't work with SMM).
>
> Thanks again for fixing this bug!
>
> Best regards,
> Maxim Levitsky
>

2021-06-23 13:33:50

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: nSVM: Fix L1 state corruption upon return from SMM

Maxim Levitsky <[email protected]> writes:

> On Wed, 2021-06-23 at 16:01 +0300, Maxim Levitsky wrote:
>> On Wed, 2021-06-23 at 11:39 +0200, Paolo Bonzini wrote:
>> > On 23/06/21 09:44, Vitaly Kuznetsov wrote:
>> > > - RFC: I'm not 100% sure my 'smart' idea to use currently-unused HSAVE area
>> > > is that smart. Also, we don't even seem to check that L1 set it up upon
>> > > nested VMRUN so hypervisors which don't do that may remain broken. A very
>> > > much needed selftest is also missing.
>> >
>> > It's certainly a bit weird, but I guess it counts as smart too. It
>> > needs a few more comments, but I think it's a good solution.
>> >
>> > One could delay the backwards memcpy until vmexit time, but that would
>> > require a new flag so it's not worth it for what is a pretty rare and
>> > already expensive case.
>> >
>> > Paolo
>> >
>>
>> Hi!
>>
>> I did some homework on this now and I would like to share few my thoughts on this:
>>
>> First of all my attention caught the way we intercept the #SMI
>> (this isn't 100% related to the bug but still worth talking about IMHO)
>>
>> A. Bare metal: Looks like SVM allows to intercept SMI, with SVM_EXIT_SMI,
>> with an intention of then entering the BIOS SMM handler manually using the SMM_CTL msr.
>> On bare metal we do set the INTERCEPT_SMI but we emulate the exit as a nop.
>> I guess on bare metal there are some undocumented bits that BIOS set which
>> make the CPU to ignore that SMI intercept and still take the #SMI handler,
>> normally but I wonder if we could still break some motherboard
>> code due to that.
>>
>>
>> B. Nested: If #SMI is intercepted, then it causes nested VMEXIT.
>> Since KVM does enable SMI intercept, when it runs nested it means that all SMIs
>> that nested KVM gets are emulated as NOP, and L1's SMI handler is not run.
>>
>>
>> About the issue that was fixed in this patch. Let me try to understand how
>> it would work on bare metal:
>>
>> 1. A guest is entered. Host state is saved to VM_HSAVE_PA area (or stashed somewhere
>> in the CPU)
>>
>> 2. #SMI (without intercept) happens
>>
>> 3. CPU has to exit SVM, and start running the host SMI handler, it loads the SMM
>> state without touching the VM_HSAVE_PA runs the SMI handler, then once it RSMs,
>> it restores the guest state from SMM area and continues the guest
>>
>> 4. Once a normal VMexit happens, the host state is restored from VM_HSAVE_PA
>>
>> So host state indeed can't be saved to VMC01.
>>
>> I to be honest think would prefer not to use the L1's hsave area but rather add back our
>> 'hsave' in KVM and store there the L1 host state on the nested entry always.
>>
>> This way we will avoid touching the vmcb01 at all and both solve the issue and
>> reduce code complexity.
>> (copying of L1 host state to what basically is L1 guest state area and back
>> even has a comment to explain why it (was) possible to do so.
>> (before you discovered that this doesn't work with SMM).
>
> I need more coffee today. The comment is somwhat wrong actually.
> When L1 switches to L2, then its HSAVE area is L1 guest state, but
> but L1 is a "host" vs L2, so it is host state.
> The copying is more between kvm's register cache and the vmcb.
>
> So maybe backing it up as this patch does is the best solution yet.
> I will take more in depth look at this soon.

We can resurrect 'hsave' and keep it internally indeed but to make this
migratable, we'd have to add it to the nested state acquired through
svm_get_nested_state(). Using L1's HSAVE area (ponted to by
MSR_VM_HSAVE_PA) avoids that as we have everything in L1's memory. And,
as far as I understand, we comply with the spec as 1) L1 has to set it
up and 2) L1 is not supposed to expect any particular format there, it's
completely volatile.

--
Vitaly

2021-06-23 14:09:37

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: nSVM: Fix L1 state corruption upon return from SMM

On Wed, 2021-06-23 at 15:21 +0200, Paolo Bonzini wrote:
> On 23/06/21 15:01, Maxim Levitsky wrote:
> > I did some homework on this now and I would like to share few my
> > thoughts on this:
> >
> > First of all my attention caught the way we intercept the #SMI
> > (this isn't 100% related to the bug but still worth talking about
> > IMHO)
> >
> > A. Bare metal: Looks like SVM allows to intercept SMI, with
> > SVM_EXIT_SMI,
> >   with an intention of then entering the BIOS SMM handler manually
> > using the SMM_CTL msr.
>
> ... or just using STGI, which is what happens for KVM.  This is in
> the
> manual: "The hypervisor may respond to the #VMEXIT(SMI) by executing
> the
> STGI instruction, which causes the pending SMI to be taken
> immediately".

Right, I didn't notice that, that makes sense.
Thanks for the explanation!

>
> It *should* work for KVM to just not intercept SMI, but it adds more
> complexity for no particular gain.

It would be nice to do so to increase testing coverage of running
a nested KVM. I'll add a hack for that in my nested kernel.

>
> >   On bare metal we do set the INTERCEPT_SMI but we emulate the exit
> > as a nop.
> >   I guess on bare metal there are some undocumented bits that BIOS
> > set which
> >   make the CPU to ignore that SMI intercept and still take the #SMI
> > handler,
> >   normally but I wonder if we could still break some motherboard
> >   code due to that.
> >
> > B. Nested: If #SMI is intercepted, then it causes nested VMEXIT.
> >   Since KVM does enable SMI intercept, when it runs nested it means
> > that all SMIs
> >   that nested KVM gets are emulated as NOP, and L1's SMI handler is
> > not run.
>
> No, this is incorrect.  Note that svm_check_nested_events does not
> clear
> smi_pending the way vmx_check_nested_events does it for nmi_pending. 
> So
> the interrupt is still there and will be injected on the next STGI.
I din't check the code, but just assumed that same issue should be
present. Now it makes sense. I totally forgot about STGI.


Thanks,
Best regards,
Maxim Levitsky

>
> Paolo
>
> >
> > About the issue that was fixed in this patch. Let me try to
> > understand how
> > it would work on bare metal:
> >
> > 1. A guest is entered. Host state is saved to VM_HSAVE_PA area (or
> > stashed somewhere
> >    in the CPU)
> >
> > 2. #SMI (without intercept) happens
> >
> > 3. CPU has to exit SVM, and start running the host SMI handler, it
> > loads the SMM
> >      state without touching the VM_HSAVE_PA runs the SMI handler,
> > then once it RSMs,
> >      it restores the guest state from SMM area and continues the
> > guest
> >
> > 4. Once a normal VMexit happens, the host state is restored from
> > VM_HSAVE_PA
> >
> > So host state indeed can't be saved to VMC01.
> >
> > I to be honest think would prefer not to use the L1's hsave area
> > but rather add back our
> > 'hsave' in KVM and store there the L1 host state on the nested
> > entry always.
> >
> > This way we will avoid touching the vmcb01 at all and both solve
> > the issue and
> > reduce code complexity.
> > (copying of L1 host state to what basically is L1 guest state area
> > and back
> > even has a comment to explain why it (was) possible to do so.
> > (before you discovered that this doesn't work with SMM).
> >
> > Thanks again for fixing this bug!
> >
> > Best regards,
> >         Maxim Levitsky
> >
>


2021-06-23 14:42:50

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: nSVM: Fix L1 state corruption upon return from SMM

On Wed, 2021-06-23 at 15:32 +0200, Vitaly Kuznetsov wrote:
> Maxim Levitsky <[email protected]> writes:
>
> > On Wed, 2021-06-23 at 16:01 +0300, Maxim Levitsky wrote:
> > > On Wed, 2021-06-23 at 11:39 +0200, Paolo Bonzini wrote:
> > > > On 23/06/21 09:44, Vitaly Kuznetsov wrote:
> > > > > - RFC: I'm not 100% sure my 'smart' idea to use currently-
> > > > > unused HSAVE area
> > > > > is that smart. Also, we don't even seem to check that L1 set
> > > > > it up upon
> > > > > nested VMRUN so hypervisors which don't do that may remain
> > > > > broken. A very
> > > > > much needed selftest is also missing.
> > > >
> > > > It's certainly a bit weird, but I guess it counts as smart
> > > > too.  It
> > > > needs a few more comments, but I think it's a good solution.
> > > >
> > > > One could delay the backwards memcpy until vmexit time, but
> > > > that would
> > > > require a new flag so it's not worth it for what is a pretty
> > > > rare and
> > > > already expensive case.
> > > >
> > > > Paolo
> > > >
> > >
> > > Hi!
> > >
> > > I did some homework on this now and I would like to share few my
> > > thoughts on this:
> > >
> > > First of all my attention caught the way we intercept the #SMI
> > > (this isn't 100% related to the bug but still worth talking about
> > > IMHO)
> > >
> > > A. Bare metal: Looks like SVM allows to intercept SMI, with
> > > SVM_EXIT_SMI,
> > >  with an intention of then entering the BIOS SMM handler manually
> > > using the SMM_CTL msr.
> > >  On bare metal we do set the INTERCEPT_SMI but we emulate the
> > > exit as a nop.
> > >  I guess on bare metal there are some undocumented bits that BIOS
> > > set which
> > >  make the CPU to ignore that SMI intercept and still take the
> > > #SMI handler,
> > >  normally but I wonder if we could still break some motherboard
> > >  code due to that.
> > >
> > >
> > > B. Nested: If #SMI is intercepted, then it causes nested VMEXIT.
> > >  Since KVM does enable SMI intercept, when it runs nested it
> > > means that all SMIs
> > >  that nested KVM gets are emulated as NOP, and L1's SMI handler
> > > is not run.
> > >
> > >
> > > About the issue that was fixed in this patch. Let me try to
> > > understand how
> > > it would work on bare metal:
> > >
> > > 1. A guest is entered. Host state is saved to VM_HSAVE_PA area
> > > (or stashed somewhere
> > >   in the CPU)
> > >
> > > 2. #SMI (without intercept) happens
> > >
> > > 3. CPU has to exit SVM, and start running the host SMI handler,
> > > it loads the SMM
> > >     state without touching the VM_HSAVE_PA runs the SMI handler,
> > > then once it RSMs,
> > >     it restores the guest state from SMM area and continues the
> > > guest
> > >
> > > 4. Once a normal VMexit happens, the host state is restored from
> > > VM_HSAVE_PA
> > >
> > > So host state indeed can't be saved to VMC01.
> > >
> > > I to be honest think would prefer not to use the L1's hsave area
> > > but rather add back our
> > > 'hsave' in KVM and store there the L1 host state on the nested
> > > entry always.
> > >
> > > This way we will avoid touching the vmcb01 at all and both solve
> > > the issue and
> > > reduce code complexity.
> > > (copying of L1 host state to what basically is L1 guest state
> > > area and back
> > > even has a comment to explain why it (was) possible to do so.
> > > (before you discovered that this doesn't work with SMM).
> >
> > I need more coffee today. The comment is somwhat wrong actually.
> > When L1 switches to L2, then its HSAVE area is L1 guest state, but
> > but L1 is a "host" vs L2, so it is host state.
> > The copying is more between kvm's register cache and the vmcb.
> >
> > So maybe backing it up as this patch does is the best solution yet.
> > I will take more in depth look at this soon.
>
> We can resurrect 'hsave' and keep it internally indeed but to make
> this
> migratable, we'd have to add it to the nested state acquired through
> svm_get_nested_state(). Using L1's HSAVE area (ponted to by
> MSR_VM_HSAVE_PA) avoids that as we have everything in L1's memory.
> And,

Hi!

I think I would prefer to avoid touching guest memory as much
as possible to avoid the shenanigans of accessing it:

For example on nested state read we are not allowed to write guest
memory since at the point it is already migrated, and for setting
nested state we are not allowed to even read the guest memory since
the memory map might not be up to date. Then a malicious guest can
always change its memory which also can cause issues.

Since it didn't work before and both sides of migration need a fix,
adding a new flag and adding hsave area to nested state seems like a
very good thing.

I think though that I would use that smm hsave area just like you
did in the patch, just not save it to the guest memory and migrate
it as a new state.

I would call it something smm_l1_hsave_area or something like
that with a comment explaining why it is needed.

This way we still avoid overhead of copying the hsave area
on each nested entry.

What do you think?

Best regards,
Maxim Levitsky

> as far as I understand, we comply with the spec as 1) L1 has to set
> it
> up and 2) L1 is not supposed to expect any particular format there,
> it's
> completely volatile.
>


2021-06-23 16:11:50

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: nSVM: Fix L1 state corruption upon return from SMM

On Wed, Jun 23, 2021, Maxim Levitsky wrote:
> On Wed, 2021-06-23 at 15:32 +0200, Vitaly Kuznetsov wrote:
> > Maxim Levitsky <[email protected]> writes:
> >
> > > On Wed, 2021-06-23 at 16:01 +0300, Maxim Levitsky wrote:
> > > > On Wed, 2021-06-23 at 11:39 +0200, Paolo Bonzini wrote:
> > > > > On 23/06/21 09:44, Vitaly Kuznetsov wrote:
> > > > > > - RFC: I'm not 100% sure my 'smart' idea to use currently- unused
> > > > > > HSAVE area is that smart. Also, we don't even seem to check that L1
> > > > > > set it up upon nested VMRUN so hypervisors which don't do that may
> > > > > > remain broken.

Ya, per the APM, nested_svm_vmrun() needs to verify the MSR is non-zero, and
svm_set_msr() needs to verify the incoming value is a legal, page-aligned address.
Both conditions are #GP.

> > > > > >A very much needed selftest is also missing.
> > > > >
> > > > > It's certainly a bit weird, but I guess it counts as smart too.? It
> > > > > needs a few more comments, but I think it's a good solution.
> > > > >
> > > > > One could delay the backwards memcpy until vmexit time, but that
> > > > > would require a new flag so it's not worth it for what is a pretty
> > > > > rare and already expensive case.

And it's _almost_ architecturally legal, but the APM does clearly state that the
HSAVE area is used on VRMUN and #VMEXIT. We'd definitely be taking a few
liberties by accessing the area at SMI/RSM.

> > We can resurrect 'hsave' and keep it internally indeed but to make this
> > migratable, we'd have to add it to the nested state acquired through
> > svm_get_nested_state(). Using L1's HSAVE area (ponted to by
> > MSR_VM_HSAVE_PA) avoids that as we have everything in L1's memory.

> I think I would prefer to avoid touching guest memory as much
> as possible to avoid the shenanigans of accessing it:
>
> For example on nested state read we are not allowed to write guest
> memory since at the point it is already migrated, and for setting
> nested state we are not allowed to even read the guest memory since
> the memory map might not be up to date.

But that shouldn't conflict with using hsave, because hsave won't be accessed in
this case until KVM_RUN, correct?

> Then a malicious guest can always change its memory which also can cause
> issues.

The APM very clearly states that touching hsave is not allowed:

Software must not attempt to read or write the host save-state area directly.

> Since it didn't work before and both sides of migration need a fix,
> adding a new flag and adding hsave area to nested state seems like a
> very good thing.

...

> This way we still avoid overhead of copying the hsave area
> on each nested entry.
>
> What do you think?

Hmm, I don't like allocating an extra page for every vCPU.

And I believe this hackery is necessary only because nested_svm_vmexit() isn't
following the architcture in the first place. I.e. using vmcb01 to restore
host state is flat out wrong. If this the restore code (below) is instead
converted to use the hsave area, then this bug is fixed _and_ we become (more)
compliant with the spec. And the save/restore to HSAVE could be selective, i.e.
only save/restore state that is actually consumed by nested_svm_vmexit().

It would require saving/restoring select segment _selectors_, but that is also a
bug fix since the APM pseudocode clearly states that segments are reloading from
the descriptor table (APM says "GDT", though I assume LDT is also legal). E.g.
while software can't touch HSAVE, L2 can _legally_ change L1's GDT, so it's
technically legal to have host segment registers magically change across VMRUN.
There might also be side effects we're missing by not reloading segment registers,
e.g. accessed bit updates?

/*
* Restore processor state that had been saved in vmcb01 <-- bad KVM
*/
kvm_set_rflags(vcpu, svm->vmcb->save.rflags);
svm_set_efer(vcpu, svm->vmcb->save.efer);
svm_set_cr0(vcpu, svm->vmcb->save.cr0 | X86_CR0_PE);
svm_set_cr4(vcpu, svm->vmcb->save.cr4);
kvm_rax_write(vcpu, svm->vmcb->save.rax);
kvm_rsp_write(vcpu, svm->vmcb->save.rsp);
kvm_rip_write(vcpu, svm->vmcb->save.rip);

svm->vcpu.arch.dr7 = DR7_FIXED_1;
kvm_update_dr7(&svm->vcpu);

...

kvm_vcpu_unmap(vcpu, &map, true);

nested_svm_transition_tlb_flush(vcpu);

nested_svm_uninit_mmu_context(vcpu);

rc = nested_svm_load_cr3(vcpu, svm->vmcb->save.cr3, false, true);


2021-06-23 16:23:49

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: nSVM: Fix L1 state corruption upon return from SMM

On Wed, Jun 23, 2021, Sean Christopherson wrote:
> And I believe this hackery is necessary only because nested_svm_vmexit() isn't
> following the architcture in the first place. I.e. using vmcb01 to restore
> host state is flat out wrong.

Ah, that's not true, using vmcb01 is allowed by "may store some or all host state
in hidden on-chip memory".

From a performance perspective, I do like the SMI/RSM shenanigans. I'm not
totally opposed to the trickery since I think it will break a guest if and only
if the L1 guest is also violating the APM. And we're not fudging the spec thaat
much :-)

2021-06-23 20:40:22

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: nSVM: Fix L1 state corruption upon return from SMM

On 23/06/21 18:21, Sean Christopherson wrote:
> On Wed, Jun 23, 2021, Sean Christopherson wrote:
>> And I believe this hackery is necessary only because nested_svm_vmexit() isn't
>> following the architcture in the first place. I.e. using vmcb01 to restore
>> host state is flat out wrong.
>
> Ah, that's not true, using vmcb01 is allowed by "may store some or all host state
> in hidden on-chip memory".

And also, "Different implementations may choose to save the hidden parts
of the host’s segment registers as well as the selectors".

> From a performance perspective, I do like the SMI/RSM shenanigans. I'm not
> totally opposed to the trickery since I think it will break a guest if and only
> if the L1 guest is also violating the APM. And we're not fudging the spec thaat
> much :-)

Yeah, that was my reasoning as well. Any reference to "hidden on-chip
memory", plus the forbidding modifications of the host save area, sort
of implies that the processor can actually flush that hidden on-chip
memory for whatever reason (such as on some sleep states?!?).

Paolo

2021-06-24 07:43:54

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: nSVM: Fix L1 state corruption upon return from SMM

Paolo Bonzini <[email protected]> writes:

> On 23/06/21 18:21, Sean Christopherson wrote:
>> On Wed, Jun 23, 2021, Sean Christopherson wrote:
>>> And I believe this hackery is necessary only because nested_svm_vmexit() isn't
>>> following the architcture in the first place. I.e. using vmcb01 to restore
>>> host state is flat out wrong.
>>
>> Ah, that's not true, using vmcb01 is allowed by "may store some or all host state
>> in hidden on-chip memory".
>
> And also, "Different implementations may choose to save the hidden parts
> of the host’s segment registers as well as the selectors".
>
>> From a performance perspective, I do like the SMI/RSM shenanigans. I'm not
>> totally opposed to the trickery since I think it will break a guest if and only
>> if the L1 guest is also violating the APM. And we're not fudging the spec thaat
>> much :-)
>
> Yeah, that was my reasoning as well. Any reference to "hidden on-chip
> memory", plus the forbidding modifications of the host save area, sort
> of implies that the processor can actually flush that hidden on-chip
> memory for whatever reason (such as on some sleep states?!?).

Ok, so it seems nobody feel strongly against the idea I've implemented
in the RFC. We could've avoided saving L1 host state upon L2 enter in
vmcb01 altogether, true, but we would still need some sort of a cache
emulating "hidden on-chip memory" for performance reasons. Resurrecting
'hsave'/ allocating 'special' vmcb01_smm/... and modifying nested state
seems to be unneeded, the L2->SMM case should be rare indeed.

I'll add a testcase to smm selftest and submit v1 then, thanks!

--
Vitaly

2021-06-24 08:22:51

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: nSVM: Fix L1 state corruption upon return from SMM

On Wed, 2021-06-23 at 22:37 +0200, Paolo Bonzini wrote:
> On 23/06/21 18:21, Sean Christopherson wrote:
> > On Wed, Jun 23, 2021, Sean Christopherson wrote:
> > > And I believe this hackery is necessary only because nested_svm_vmexit() isn't
> > > following the architcture in the first place. I.e. using vmcb01 to restore
> > > host state is flat out wrong.
> >
> > Ah, that's not true, using vmcb01 is allowed by "may store some or all host state
> > in hidden on-chip memory".
>
> And also, "Different implementations may choose to save the hidden parts
> of the host’s segment registers as well as the selectors".
>
> > From a performance perspective, I do like the SMI/RSM shenanigans. I'm not
> > totally opposed to the trickery since I think it will break a guest if and only
> > if the L1 guest is also violating the APM. And we're not fudging the spec thaat
> > much :-)
>
> Yeah, that was my reasoning as well. Any reference to "hidden on-chip
> memory", plus the forbidding modifications of the host save area, sort
> of implies that the processor can actually flush that hidden on-chip
> memory for whatever reason (such as on some sleep states?!?).
>
> Paolo
>

Let me explain my thoughts about this again, now that I hopefully
understand this correctly:

L1 register state is stored in VMCB01 since the CPU stores it there,
every time L1 VMexits.

Now L1 is a host for L2, and therefore this is also L1 "host" state.

When we switch to L2, it is therefore natural to keep that state in
vmcb01, and not copy it to VM_HSAVE_PA area.

So when running SMM code in L1, since we must not corrupt this state,
It makes sense to back it up somewhere,
but as I said I prefer to store/migrate it out of band instead of
saving it in the guest memory, although Paolo's reasoning that CPU
might *sometimes* write VM_HSAVE_PA and that *sometimes* is allowed to be
when we enter SMM, does make sense.
Thus I don't have a strong opinion on this anymore, as long as it works.

Something else to note, just for our information is that KVM
these days does vmsave/vmload to VM_HSAVE_PA to store/restore
the additional host state, something that is frowned upon in the spec,
but there is some justification of doing this in the commit message,
citing an old spec which allowed this.
This shoudn't affect anything, but just FYI.

https://patchwork.kernel.org/project/kvm/patch/[email protected]/

Best regards,
Maxim Levitsky

2021-06-24 10:40:53

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: nSVM: Fix L1 state corruption upon return from SMM

On 24/06/21 10:20, Maxim Levitsky wrote:
> Something else to note, just for our information is that KVM
> these days does vmsave/vmload to VM_HSAVE_PA to store/restore
> the additional host state, something that is frowned upon in the spec,
> but there is some justification of doing this in the commit message,
> citing an old spec which allowed this.

True that. And there is no mention in the specification for VMRUN that
the host state-save area is a subset of the VMCB format (i.e., that it
uses VMCB offsets for whatever subset of the state it saves in the
VMCB), so the spec reference in the commit message is incorrect. It
would be nice if the spec guaranteed that. Michael, Tom?

In fact, Vitaly's patch *will* overwrite the vmsave/vmload parts of
VM_HSAVE_PA, and it will store the L2 values rather than the L1 values,
because KVM always does its vmload/vmrun/vmsave sequence using
vmload(vmcs01) and vmsave(vmcs01)! So that has to be changed to use
code similar to svm_set_nested_state (which can be moved to a separate
function and reused):

dest->es = src->es;
dest->cs = src->cs;
dest->ss = src->ss;
dest->ds = src->ds;
dest->gdtr = src->gdtr;
dest->idtr = src->idtr;
dest->rflags = src->rflags | X86_EFLAGS_FIXED;
dest->efer = src->efer;
dest->cr0 = src->cr0;
dest->cr3 = src->cr3;
dest->cr4 = src->cr4;
dest->rax = src->rax;
dest->rsp = src->rsp;
dest->rip = src->rip;
dest->cpl = 0;


Paolo

2021-06-24 14:34:03

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: nSVM: Fix L1 state corruption upon return from SMM

On 6/24/21 5:38 AM, Paolo Bonzini wrote:
> On 24/06/21 10:20, Maxim Levitsky wrote:
>> Something else to note, just for our information is that KVM
>> these days does vmsave/vmload to VM_HSAVE_PA to store/restore
>> the additional host state, something that is frowned upon in the spec,
>> but there is some justification of doing this in the commit message,
>> citing an old spec which allowed this.
>
> True that.  And there is no mention in the specification for VMRUN that
> the host state-save area is a subset of the VMCB format (i.e., that it
> uses VMCB offsets for whatever subset of the state it saves in the VMCB),
> so the spec reference in the commit message is incorrect.  It would be
> nice if the spec guaranteed that.  Michael, Tom?

So that is (now) stated in APM volume 2, Appendix B in the paragraph after
Table B-3, where it starts "The format of the host save area is identical
to the guest save area described in the table below, except that ..."

Thanks,
Tom

>
> In fact, Vitaly's patch *will* overwrite the vmsave/vmload parts of
> VM_HSAVE_PA, and it will store the L2 values rather than the L1 values,
> because KVM always does its vmload/vmrun/vmsave sequence using
> vmload(vmcs01) and vmsave(vmcs01)!  So that has to be changed to use code
> similar to svm_set_nested_state (which can be moved to a separate function
> and reused):
>
>         dest->es = src->es;
>         dest->cs = src->cs;
>         dest->ss = src->ss;
>         dest->ds = src->ds;
>         dest->gdtr = src->gdtr;
>         dest->idtr = src->idtr;
>         dest->rflags = src->rflags | X86_EFLAGS_FIXED;
>         dest->efer = src->efer;
>         dest->cr0 = src->cr0;
>         dest->cr3 = src->cr3;
>         dest->cr4 = src->cr4;
>         dest->rax = src->rax;
>         dest->rsp = src->rsp;
>         dest->rip = src->rip;
>         dest->cpl = 0;
>
>
> Paolo
>

2021-06-24 15:38:09

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: nSVM: Fix L1 state corruption upon return from SMM

On Thu, 2021-06-24 at 09:32 -0500, Tom Lendacky wrote:
> On 6/24/21 5:38 AM, Paolo Bonzini wrote:
> > On 24/06/21 10:20, Maxim Levitsky wrote:
> > > Something else to note, just for our information is that KVM
> > > these days does vmsave/vmload to VM_HSAVE_PA to store/restore
> > > the additional host state, something that is frowned upon in the spec,
> > > but there is some justification of doing this in the commit message,
> > > citing an old spec which allowed this.
> >
> > True that. And there is no mention in the specification for VMRUN that
> > the host state-save area is a subset of the VMCB format (i.e., that it
> > uses VMCB offsets for whatever subset of the state it saves in the VMCB),
> > so the spec reference in the commit message is incorrect. It would be
> > nice if the spec guaranteed that. Michael, Tom?
>
> So that is (now) stated in APM volume 2, Appendix B in the paragraph after
> Table B-3, where it starts "The format of the host save area is identical
> to the guest save area described in the table below, except that ..."

This is a very good find! I wish it was written in the commit message
of commit that added that vmsave/vmload to VM_HSAVE_PA area.

Maybe we should add a comment to the code pointing to this location of the APM.

Thanks,
Best regards,
Maxim Levitsky

>
> Thanks,
> Tom
>
> > In fact, Vitaly's patch *will* overwrite the vmsave/vmload parts of
> > VM_HSAVE_PA, and it will store the L2 values rather than the L1 values,
> > because KVM always does its vmload/vmrun/vmsave sequence using
> > vmload(vmcs01) and vmsave(vmcs01)! So that has to be changed to use code
> > similar to svm_set_nested_state (which can be moved to a separate function
> > and reused):
> >
> > dest->es = src->es;
> > dest->cs = src->cs;
> > dest->ss = src->ss;
> > dest->ds = src->ds;
> > dest->gdtr = src->gdtr;
> > dest->idtr = src->idtr;
> > dest->rflags = src->rflags | X86_EFLAGS_FIXED;
> > dest->efer = src->efer;
> > dest->cr0 = src->cr0;
> > dest->cr3 = src->cr3;
> > dest->cr4 = src->cr4;
> > dest->rax = src->rax;
> > dest->rsp = src->rsp;
> > dest->rip = src->rip;
> > dest->cpl = 0;
> >
> >
> > Paolo
> >