2022-03-11 05:26:23

by Maciej S. Szmigiero

[permalink] [raw]
Subject: [PATCH 1/5] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02

From: "Maciej S. Szmigiero" <[email protected]>

The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
This field value (instead of the saved guest RIP) in used by the CPU for
the return address pushed on stack when injecting a software interrupt or
INT3 or INTO exception.

Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
loading a nested state.

Signed-off-by: Maciej S. Szmigiero <[email protected]>
---
arch/x86/kvm/svm/nested.c | 4 ++++
arch/x86/kvm/svm/svm.h | 1 +
2 files changed, 5 insertions(+)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index d736ec6514ca..9656f0d6815c 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -366,6 +366,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
to->nested_ctl = from->nested_ctl;
to->event_inj = from->event_inj;
to->event_inj_err = from->event_inj_err;
+ to->next_rip = from->next_rip;
to->nested_cr3 = from->nested_cr3;
to->virt_ext = from->virt_ext;
to->pause_filter_count = from->pause_filter_count;
@@ -638,6 +639,8 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
svm->vmcb->control.int_state = svm->nested.ctl.int_state;
svm->vmcb->control.event_inj = svm->nested.ctl.event_inj;
svm->vmcb->control.event_inj_err = svm->nested.ctl.event_inj_err;
+ /* The return address pushed on stack by the CPU for some injected events */
+ svm->vmcb->control.next_rip = svm->nested.ctl.next_rip;

if (!nested_vmcb_needs_vls_intercept(svm))
svm->vmcb->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK;
@@ -1348,6 +1351,7 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
dst->nested_ctl = from->nested_ctl;
dst->event_inj = from->event_inj;
dst->event_inj_err = from->event_inj_err;
+ dst->next_rip = from->next_rip;
dst->nested_cr3 = from->nested_cr3;
dst->virt_ext = from->virt_ext;
dst->pause_filter_count = from->pause_filter_count;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 93502d2a52ce..f757400fc933 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -138,6 +138,7 @@ struct vmcb_ctrl_area_cached {
u64 nested_ctl;
u32 event_inj;
u32 event_inj_err;
+ u64 next_rip;
u64 nested_cr3;
u64 virt_ext;
u32 clean;


2022-04-03 16:47:06

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 1/5] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02

On Fri, Apr 01, 2022, Maciej S. Szmigiero wrote:
> On 1.04.2022 20:32, Sean Christopherson wrote:
> > On Thu, Mar 10, 2022, Maciej S. Szmigiero wrote:
> > > + /* The return address pushed on stack by the CPU for some injected events */
> > > + svm->vmcb->control.next_rip = svm->nested.ctl.next_rip;
> >
> > This needs to be gated by nrips being enabled _and_ exposed to L1, i.e.
> >
> > if (svm->nrips_enabled)
> > vmcb02->control.next_rip = svm->nested.ctl.next_rip;
>
> It can be done, however what if we run on a nrips-capable CPU,
> but don't expose this capability to the L1?

Oh, right, because the field will be populated by the CPU on VM-Exit. Ah, the
correct behavior is to grab RIP from vmcb12 to emulate nrips=0 hardware simply
not updating RIP. E.g. zeroing it out would send L2 into the weeds on IRET due
the CPU pushing '0' on the stack when vectoring the injected event.

if (svm->nrips_enabled)
vmcb02->control.next_rip = svm->nested.ctl.next_rip;
else if (boot_cpu_has(X86_FEATURE_NRIPS))
vmcb02->control.next_rip = vmcb12_rip;

> The CPU will then push whatever value was left in this field as
> the return address for some L1 injected events.
>
> Although without nrips feature the L1 shouldn't even attempt event
> injection, copying this field anyway will make it work if L1 just
> expects this capability based on the current CPU model rather than
> by checking specific CPUID feature bits.

L1 may still inject the exception, it just advances the RIP manually. As above,
the really messy thing is that, because there's no flag to say "don't use NextRIP!",
the CPU will still consume NextRIP and push '0' on the stack for the return RIP
from the INTn/INT3/INTO. Yay.

I found that out the hard way (patch in-progress). The way to handle event
injection if KVM is loaded with nrips=0 but nrips is supported in hardware is to
stuff NextRIP on event injection even if nrips=0, otherwise the guest is hosed.

> > > + u64 next_rip;
> > > u64 nested_cr3;
> > > u64 virt_ext;
> > > u32 clean;
> >
> > I don't know why this struct has
> >
> > u8 reserved_sw[32];
> >
> > but presumably it's for padding, i.e. probably should be reduced to 24 bytes.
>
> Apparently the "reserved_sw" field stores Hyper-V enlightenments state -
> see commit 66c03a926f18 ("KVM: nSVM: Implement Enlightened MSR-Bitmap feature")
> and nested_svm_vmrun_msrpm() in nested.c.

Argh, that's a terrible name. Thanks for doing the homework, I was being lazy.

2022-04-04 21:26:59

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 1/5] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02

On Thu, Mar 10, 2022, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <[email protected]>
>
> The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
> This field value (instead of the saved guest RIP) in used by the CPU for
> the return address pushed on stack when injecting a software interrupt or
> INT3 or INTO exception.
>
> Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
> loading a nested state.
>
> Signed-off-by: Maciej S. Szmigiero <[email protected]>
> ---
> arch/x86/kvm/svm/nested.c | 4 ++++
> arch/x86/kvm/svm/svm.h | 1 +
> 2 files changed, 5 insertions(+)
>
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index d736ec6514ca..9656f0d6815c 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -366,6 +366,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
> to->nested_ctl = from->nested_ctl;
> to->event_inj = from->event_inj;
> to->event_inj_err = from->event_inj_err;
> + to->next_rip = from->next_rip;
> to->nested_cr3 = from->nested_cr3;
> to->virt_ext = from->virt_ext;
> to->pause_filter_count = from->pause_filter_count;
> @@ -638,6 +639,8 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
> svm->vmcb->control.int_state = svm->nested.ctl.int_state;
> svm->vmcb->control.event_inj = svm->nested.ctl.event_inj;
> svm->vmcb->control.event_inj_err = svm->nested.ctl.event_inj_err;
> + /* The return address pushed on stack by the CPU for some injected events */
> + svm->vmcb->control.next_rip = svm->nested.ctl.next_rip;

This needs to be gated by nrips being enabled _and_ exposed to L1, i.e.

if (svm->nrips_enabled)
vmcb02->control.next_rip = svm->nested.ctl.next_rip;

though I don't see any reason to add the condition to the copy to/from cache flows.

> if (!nested_vmcb_needs_vls_intercept(svm))
> svm->vmcb->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK;
> @@ -1348,6 +1351,7 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
> dst->nested_ctl = from->nested_ctl;
> dst->event_inj = from->event_inj;
> dst->event_inj_err = from->event_inj_err;
> + dst->next_rip = from->next_rip;
> dst->nested_cr3 = from->nested_cr3;
> dst->virt_ext = from->virt_ext;
> dst->pause_filter_count = from->pause_filter_count;
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 93502d2a52ce..f757400fc933 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -138,6 +138,7 @@ struct vmcb_ctrl_area_cached {
> u64 nested_ctl;
> u32 event_inj;
> u32 event_inj_err;
> + u64 next_rip;
> u64 nested_cr3;
> u64 virt_ext;
> u32 clean;

I don't know why this struct has

u8 reserved_sw[32];

but presumably it's for padding, i.e. probably should be reduced to 24 bytes.

2022-04-05 01:03:28

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 1/5] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02

On Fri, 2022-04-01 at 21:51 +0000, Sean Christopherson wrote:
> On Fri, Apr 01, 2022, Maciej S. Szmigiero wrote:
> > On 1.04.2022 20:32, Sean Christopherson wrote:
> > > On Thu, Mar 10, 2022, Maciej S. Szmigiero wrote:
> > > > + /* The return address pushed on stack by the CPU for some injected events */
> > > > + svm->vmcb->control.next_rip = svm->nested.ctl.next_rip;
> > >
> > > This needs to be gated by nrips being enabled _and_ exposed to L1, i.e.
> > >
> > > if (svm->nrips_enabled)
> > > vmcb02->control.next_rip = svm->nested.ctl.next_rip;
> >
> > It can be done, however what if we run on a nrips-capable CPU,
> > but don't expose this capability to the L1?
>
> Oh, right, because the field will be populated by the CPU on VM-Exit. Ah, the
> correct behavior is to grab RIP from vmcb12 to emulate nrips=0 hardware simply
> not updating RIP. E.g. zeroing it out would send L2 into the weeds on IRET due
> the CPU pushing '0' on the stack when vectoring the injected event.
>
> if (svm->nrips_enabled)
> vmcb02->control.next_rip = svm->nested.ctl.next_rip;
> else if (boot_cpu_has(X86_FEATURE_NRIPS))
> vmcb02->control.next_rip = vmcb12_rip;
>
> > The CPU will then push whatever value was left in this field as
> > the return address for some L1 injected events.

This makes sense.

Note that even AMD's PRM has a note about this:

"
15.20 Event Injection
...
Software interrupts cannot be properly injected if the processor does not support the NextRIP field.
Support is indicated by CPUID Fn8000_000A_EDX[NRIPS] = 1. Hypervisor software should
emulate the event injection of software interrupts if NextRIP is not supported
"



> >
> > Although without nrips feature the L1 shouldn't even attempt event
> > injection, copying this field anyway will make it work if L1 just
> > expects this capability based on the current CPU model rather than
> > by checking specific CPUID feature bits.

The guest really ought to check CPUID bits. Plus the CPU model is also
usually virtualized (for named machine types in Qemu for example).

>
> L1 may still inject the exception, it just advances the RIP manually. As above,
> the really messy thing is that, because there's no flag to say "don't use NextRIP!",
> the CPU will still consume NextRIP and push '0' on the stack for the return RIP
> from the INTn/INT3/INTO. Yay.
>
> I found that out the hard way (patch in-progress). The way to handle event
> injection if KVM is loaded with nrips=0 but nrips is supported in hardware is to
> stuff NextRIP on event injection even if nrips=0, otherwise the guest is hosed.
>
> > > > + u64 next_rip;
> > > > u64 nested_cr3;
> > > > u64 virt_ext;
> > > > u32 clean;
> > >
> > > I don't know why this struct has
> > >
> > > u8 reserved_sw[32];
> > >
> > > but presumably it's for padding, i.e. probably should be reduced to 24 bytes.
> >
> > Apparently the "reserved_sw" field stores Hyper-V enlightenments state -
> > see commit 66c03a926f18 ("KVM: nSVM: Implement Enlightened MSR-Bitmap feature")
> > and nested_svm_vmrun_msrpm() in nested.c.
>
> Argh, that's a terrible name. Thanks for doing the homework, I was being lazy.


That was added around the commit

1183646a67d01 ("KVM: SVM: hyper-v: Direct Virtual Flush support")

Seems to be used by HV to store 'struct hv_enlightenments',
but I don't know 100% if that is the only thing that can be stored
in this area.



Best regards,
Maxim Levitsky


>


2022-04-05 01:25:18

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH 1/5] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02

On 1.04.2022 20:32, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <[email protected]>
>>
>> The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
>> This field value (instead of the saved guest RIP) in used by the CPU for
>> the return address pushed on stack when injecting a software interrupt or
>> INT3 or INTO exception.
>>
>> Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
>> loading a nested state.
>>
>> Signed-off-by: Maciej S. Szmigiero <[email protected]>
>> ---
>> arch/x86/kvm/svm/nested.c | 4 ++++
>> arch/x86/kvm/svm/svm.h | 1 +
>> 2 files changed, 5 insertions(+)
>>
>> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
>> index d736ec6514ca..9656f0d6815c 100644
>> --- a/arch/x86/kvm/svm/nested.c
>> +++ b/arch/x86/kvm/svm/nested.c
>> @@ -366,6 +366,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
>> to->nested_ctl = from->nested_ctl;
>> to->event_inj = from->event_inj;
>> to->event_inj_err = from->event_inj_err;
>> + to->next_rip = from->next_rip;
>> to->nested_cr3 = from->nested_cr3;
>> to->virt_ext = from->virt_ext;
>> to->pause_filter_count = from->pause_filter_count;
>> @@ -638,6 +639,8 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
>> svm->vmcb->control.int_state = svm->nested.ctl.int_state;
>> svm->vmcb->control.event_inj = svm->nested.ctl.event_inj;
>> svm->vmcb->control.event_inj_err = svm->nested.ctl.event_inj_err;
>> + /* The return address pushed on stack by the CPU for some injected events */
>> + svm->vmcb->control.next_rip = svm->nested.ctl.next_rip;
>
> This needs to be gated by nrips being enabled _and_ exposed to L1, i.e.
>
> if (svm->nrips_enabled)
> vmcb02->control.next_rip = svm->nested.ctl.next_rip;

It can be done, however what if we run on a nrips-capable CPU,
but don't expose this capability to the L1?

The CPU will then push whatever value was left in this field as
the return address for some L1 injected events.

Although without nrips feature the L1 shouldn't even attempt event
injection, copying this field anyway will make it work if L1 just
expects this capability based on the current CPU model rather than
by checking specific CPUID feature bits.

> though I don't see any reason to add the condition to the copy to/from cache flows.
>
>> if (!nested_vmcb_needs_vls_intercept(svm))
>> svm->vmcb->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK;
>> @@ -1348,6 +1351,7 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
>> dst->nested_ctl = from->nested_ctl;
>> dst->event_inj = from->event_inj;
>> dst->event_inj_err = from->event_inj_err;
>> + dst->next_rip = from->next_rip;
>> dst->nested_cr3 = from->nested_cr3;
>> dst->virt_ext = from->virt_ext;
>> dst->pause_filter_count = from->pause_filter_count;
>> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
>> index 93502d2a52ce..f757400fc933 100644
>> --- a/arch/x86/kvm/svm/svm.h
>> +++ b/arch/x86/kvm/svm/svm.h
>> @@ -138,6 +138,7 @@ struct vmcb_ctrl_area_cached {
>> u64 nested_ctl;
>> u32 event_inj;
>> u32 event_inj_err;
>> + u64 next_rip;
>> u64 nested_cr3;
>> u64 virt_ext;
>> u32 clean;
>
> I don't know why this struct has
>
> u8 reserved_sw[32];
>
> but presumably it's for padding, i.e. probably should be reduced to 24 bytes.

Apparently the "reserved_sw" field stores Hyper-V enlightenments state -
see commit 66c03a926f18 ("KVM: nSVM: Implement Enlightened MSR-Bitmap feature")
and nested_svm_vmrun_msrpm() in nested.c.

Thanks,
Maciej