2022-04-04 21:43:21

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02

From: Maciej S. Szmigiero <[email protected]>

The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
This field value (instead of the saved guest RIP) in used by the CPU for
the return address pushed on stack when injecting a software interrupt or
INT3 or INTO exception.

Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
loading a nested state and NRIPS is exposed to L1. If NRIPS is supported
in hardware but not exposed to L1 (nrips=0 or hidden by userspace), stuff
vmcb02's next_rip from the new L2 RIP to emulate a !NRIPS CPU (which
saves RIP on the stack as-is).

Signed-off-by: Maciej S. Szmigiero <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/nested.c | 22 +++++++++++++++++++---
arch/x86/kvm/svm/svm.h | 1 +
2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 73b545278f5f..9a6dc2b38fcf 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -369,6 +369,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
to->nested_ctl = from->nested_ctl;
to->event_inj = from->event_inj;
to->event_inj_err = from->event_inj_err;
+ to->next_rip = from->next_rip;
to->nested_cr3 = from->nested_cr3;
to->virt_ext = from->virt_ext;
to->pause_filter_count = from->pause_filter_count;
@@ -606,7 +607,8 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12
}
}

-static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
+static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
+ unsigned long vmcb12_rip)
{
u32 int_ctl_vmcb01_bits = V_INTR_MASKING_MASK;
u32 int_ctl_vmcb12_bits = V_TPR_MASK | V_IRQ_INJECTION_BITS_MASK;
@@ -660,6 +662,19 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
vmcb02->control.event_inj = svm->nested.ctl.event_inj;
vmcb02->control.event_inj_err = svm->nested.ctl.event_inj_err;

+ /*
+ * next_rip is consumed on VMRUN as the return address pushed on the
+ * stack for injected soft exceptions/interrupts. If nrips is exposed
+ * to L1, take it verbatim from vmcb12. If nrips is supported in
+ * hardware but not exposed to L1, stuff the actual L2 RIP to emulate
+ * what a nrips=0 CPU would do (L1 is responsible for advancing RIP
+ * prior to injecting the event).
+ */
+ if (svm->nrips_enabled)
+ vmcb02->control.next_rip = svm->nested.ctl.next_rip;
+ else if (boot_cpu_has(X86_FEATURE_NRIPS))
+ vmcb02->control.next_rip = vmcb12_rip;
+
vmcb02->control.virt_ext = vmcb01->control.virt_ext &
LBR_CTL_ENABLE_MASK;
if (svm->lbrv_enabled)
@@ -743,7 +758,7 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);

svm_switch_vmcb(svm, &svm->nested.vmcb02);
- nested_vmcb02_prepare_control(svm);
+ nested_vmcb02_prepare_control(svm, vmcb12->save.rip);
nested_vmcb02_prepare_save(svm, vmcb12);

ret = nested_svm_load_cr3(&svm->vcpu, svm->nested.save.cr3,
@@ -1422,6 +1437,7 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
dst->nested_ctl = from->nested_ctl;
dst->event_inj = from->event_inj;
dst->event_inj_err = from->event_inj_err;
+ dst->next_rip = from->next_rip;
dst->nested_cr3 = from->nested_cr3;
dst->virt_ext = from->virt_ext;
dst->pause_filter_count = from->pause_filter_count;
@@ -1606,7 +1622,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
nested_copy_vmcb_control_to_cache(svm, ctl);

svm_switch_vmcb(svm, &svm->nested.vmcb02);
- nested_vmcb02_prepare_control(svm);
+ nested_vmcb02_prepare_control(svm, save->rip);

/*
* While the nested guest CR3 is already checked and set by
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index e246793cbeae..47e7427d0395 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -139,6 +139,7 @@ struct vmcb_ctrl_area_cached {
u64 nested_ctl;
u32 event_inj;
u32 event_inj_err;
+ u64 next_rip;
u64 nested_cr3;
u64 virt_ext;
u32 clean;
--
2.35.1.1094.g7c7d902a7c-goog


2022-04-04 23:41:11

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02

On 2.04.2022 03:08, Sean Christopherson wrote:
> From: Maciej S. Szmigiero <[email protected]>
>
> The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
> This field value (instead of the saved guest RIP) in used by the CPU for
> the return address pushed on stack when injecting a software interrupt or
> INT3 or INTO exception.
>
> Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
> loading a nested state and NRIPS is exposed to L1. If NRIPS is supported
> in hardware but not exposed to L1 (nrips=0 or hidden by userspace), stuff
> vmcb02's next_rip from the new L2 RIP to emulate a !NRIPS CPU (which
> saves RIP on the stack as-is).
>
> Signed-off-by: Maciej S. Szmigiero <[email protected]>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/svm/nested.c | 22 +++++++++++++++++++---
> arch/x86/kvm/svm/svm.h | 1 +
> 2 files changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 73b545278f5f..9a6dc2b38fcf 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -369,6 +369,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
> to->nested_ctl = from->nested_ctl;
> to->event_inj = from->event_inj;
> to->event_inj_err = from->event_inj_err;
> + to->next_rip = from->next_rip;
> to->nested_cr3 = from->nested_cr3;
> to->virt_ext = from->virt_ext;
> to->pause_filter_count = from->pause_filter_count;
> @@ -606,7 +607,8 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12
> }
> }
>
> -static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
> +static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
> + unsigned long vmcb12_rip)
> {
> u32 int_ctl_vmcb01_bits = V_INTR_MASKING_MASK;
> u32 int_ctl_vmcb12_bits = V_TPR_MASK | V_IRQ_INJECTION_BITS_MASK;
> @@ -660,6 +662,19 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
> vmcb02->control.event_inj = svm->nested.ctl.event_inj;
> vmcb02->control.event_inj_err = svm->nested.ctl.event_inj_err;
>
> + /*
> + * next_rip is consumed on VMRUN as the return address pushed on the
> + * stack for injected soft exceptions/interrupts. If nrips is exposed
> + * to L1, take it verbatim from vmcb12. If nrips is supported in
> + * hardware but not exposed to L1, stuff the actual L2 RIP to emulate
> + * what a nrips=0 CPU would do (L1 is responsible for advancing RIP
> + * prior to injecting the event).
> + */
> + if (svm->nrips_enabled)
> + vmcb02->control.next_rip = svm->nested.ctl.next_rip;
> + else if (boot_cpu_has(X86_FEATURE_NRIPS))
> + vmcb02->control.next_rip = vmcb12_rip;
> +
> vmcb02->control.virt_ext = vmcb01->control.virt_ext &
> LBR_CTL_ENABLE_MASK;
> if (svm->lbrv_enabled)
> @@ -743,7 +758,7 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
> nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
>
> svm_switch_vmcb(svm, &svm->nested.vmcb02);
> - nested_vmcb02_prepare_control(svm);
> + nested_vmcb02_prepare_control(svm, vmcb12->save.rip);
> nested_vmcb02_prepare_save(svm, vmcb12);
>
> ret = nested_svm_load_cr3(&svm->vcpu, svm->nested.save.cr3,
> @@ -1422,6 +1437,7 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
> dst->nested_ctl = from->nested_ctl;
> dst->event_inj = from->event_inj;
> dst->event_inj_err = from->event_inj_err;
> + dst->next_rip = from->next_rip;
> dst->nested_cr3 = from->nested_cr3;
> dst->virt_ext = from->virt_ext;
> dst->pause_filter_count = from->pause_filter_count;
> @@ -1606,7 +1622,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
> nested_copy_vmcb_control_to_cache(svm, ctl);
>
> svm_switch_vmcb(svm, &svm->nested.vmcb02);
> - nested_vmcb02_prepare_control(svm);
> + nested_vmcb02_prepare_control(svm, save->rip);
>

^
I guess this should be "svm->vmcb->save.rip", since
KVM_{GET,SET}_NESTED_STATE "save" field contains vmcb01 data,
not vmcb{0,1}2 (in contrast to the "control" field).


Thanks,
Maciej

2022-04-05 00:13:43

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02

On 4.04.2022 19:21, Sean Christopherson wrote:
> On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
>>> @@ -1606,7 +1622,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
>>> nested_copy_vmcb_control_to_cache(svm, ctl);
>>> svm_switch_vmcb(svm, &svm->nested.vmcb02);
>>> - nested_vmcb02_prepare_control(svm);
>>> + nested_vmcb02_prepare_control(svm, save->rip);
>>
>> ^
>> I guess this should be "svm->vmcb->save.rip", since
>> KVM_{GET,SET}_NESTED_STATE "save" field contains vmcb01 data,
>> not vmcb{0,1}2 (in contrast to the "control" field).
>
> Argh, yes. Is userspace required to set L2 guest state prior to KVM_SET_NESTED_STATE?
> If not, this will result in garbage being loaded into vmcb02.

I'm not sure about particular KVM API guarantees,
but looking at the code I guess it is supposed to handle both cases:
1) VMM loads the usual basic KVM state via KVM_SET_{S,}REGS then immediately
issues KVM_SET_NESTED_STATE to load the remaining nested data.

Assuming that it was the L2 that was running at the save time,
at first the basic L2 state will be loaded into vmcb01,
then at KVM_SET_NESTED_STATE time:
> if (is_guest_mode(vcpu))
> svm_leave_nested(vcpu);
> else
> svm->nested.vmcb02.ptr->save = svm->vmcb01.ptr->save;

The !is_guest_mode(vcpu) branch will be taken (since the new VM haven't entered
the guest mode yet), which will copy the basic L2 state from vmcb01 to vmcb02 and
then the remaining code will restore vmcb01 save and vmcb{0,1}2 control normally and
then enter the guest mode.

2) VMM first issues KVM_SET_NESTED_STATE then immediately loads the basic state.

Sane as the above, only some initial VM state will be copied into vmcb02 from vmcb01
by the code mentioned above, then vmcb01 save and vmcb{0,1}2 control will be restored
and guest mode will be entered.
If the VMM then immediately issues KVM_SET_{S,}REGS then it will restore L2 basic state
straight into vmcb02.

However, this all is my guess work from just looking at the relevant code,
I haven't run any tests to make sure that I haven't missed something.

Thanks,
Maciej

2022-04-05 00:45:40

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02

On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
> > @@ -1606,7 +1622,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
> > nested_copy_vmcb_control_to_cache(svm, ctl);
> > svm_switch_vmcb(svm, &svm->nested.vmcb02);
> > - nested_vmcb02_prepare_control(svm);
> > + nested_vmcb02_prepare_control(svm, save->rip);
>
> ^
> I guess this should be "svm->vmcb->save.rip", since
> KVM_{GET,SET}_NESTED_STATE "save" field contains vmcb01 data,
> not vmcb{0,1}2 (in contrast to the "control" field).

Argh, yes. Is userspace required to set L2 guest state prior to KVM_SET_NESTED_STATE?
If not, this will result in garbage being loaded into vmcb02.

2022-04-05 02:36:51

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02

On Sat, 2022-04-02 at 01:08 +0000, Sean Christopherson wrote:
> From: Maciej S. Szmigiero <[email protected]>
>
> The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
> This field value (instead of the saved guest RIP) in used by the CPU for
> the return address pushed on stack when injecting a software interrupt or
> INT3 or INTO exception.
>
> Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
> loading a nested state and NRIPS is exposed to L1. If NRIPS is supported
> in hardware but not exposed to L1 (nrips=0 or hidden by userspace), stuff
> vmcb02's next_rip from the new L2 RIP to emulate a !NRIPS CPU (which
> saves RIP on the stack as-is).
>
> Signed-off-by: Maciej S. Szmigiero <[email protected]>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/svm/nested.c | 22 +++++++++++++++++++---
> arch/x86/kvm/svm/svm.h | 1 +
> 2 files changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 73b545278f5f..9a6dc2b38fcf 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -369,6 +369,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
> to->nested_ctl = from->nested_ctl;
> to->event_inj = from->event_inj;
> to->event_inj_err = from->event_inj_err;
> + to->next_rip = from->next_rip;
> to->nested_cr3 = from->nested_cr3;
> to->virt_ext = from->virt_ext;
> to->pause_filter_count = from->pause_filter_count;
> @@ -606,7 +607,8 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12
> }
> }
>
> -static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
> +static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
> + unsigned long vmcb12_rip)
> {
> u32 int_ctl_vmcb01_bits = V_INTR_MASKING_MASK;
> u32 int_ctl_vmcb12_bits = V_TPR_MASK | V_IRQ_INJECTION_BITS_MASK;
> @@ -660,6 +662,19 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
> vmcb02->control.event_inj = svm->nested.ctl.event_inj;
> vmcb02->control.event_inj_err = svm->nested.ctl.event_inj_err;
>
> + /*
> + * next_rip is consumed on VMRUN as the return address pushed on the
> + * stack for injected soft exceptions/interrupts. If nrips is exposed
> + * to L1, take it verbatim from vmcb12. If nrips is supported in
> + * hardware but not exposed to L1, stuff the actual L2 RIP to emulate
> + * what a nrips=0 CPU would do (L1 is responsible for advancing RIP
> + * prior to injecting the event).
> + */
> + if (svm->nrips_enabled)
> + vmcb02->control.next_rip = svm->nested.ctl.next_rip;
> + else if (boot_cpu_has(X86_FEATURE_NRIPS))
> + vmcb02->control.next_rip = vmcb12_rip;
> +
> vmcb02->control.virt_ext = vmcb01->control.virt_ext &
> LBR_CTL_ENABLE_MASK;
> if (svm->lbrv_enabled)
> @@ -743,7 +758,7 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
> nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
>
> svm_switch_vmcb(svm, &svm->nested.vmcb02);
> - nested_vmcb02_prepare_control(svm);
> + nested_vmcb02_prepare_control(svm, vmcb12->save.rip);
> nested_vmcb02_prepare_save(svm, vmcb12);
>
> ret = nested_svm_load_cr3(&svm->vcpu, svm->nested.save.cr3,
> @@ -1422,6 +1437,7 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
> dst->nested_ctl = from->nested_ctl;
> dst->event_inj = from->event_inj;
> dst->event_inj_err = from->event_inj_err;
> + dst->next_rip = from->next_rip;
> dst->nested_cr3 = from->nested_cr3;
> dst->virt_ext = from->virt_ext;
> dst->pause_filter_count = from->pause_filter_count;
> @@ -1606,7 +1622,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
> nested_copy_vmcb_control_to_cache(svm, ctl);
>
> svm_switch_vmcb(svm, &svm->nested.vmcb02);
> - nested_vmcb02_prepare_control(svm);
> + nested_vmcb02_prepare_control(svm, save->rip);
>
> /*
> * While the nested guest CR3 is already checked and set by
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index e246793cbeae..47e7427d0395 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -139,6 +139,7 @@ struct vmcb_ctrl_area_cached {
> u64 nested_ctl;
> u32 event_inj;
> u32 event_inj_err;
> + u64 next_rip;
> u64 nested_cr3;
> u64 virt_ext;
> u32 clean;

Reviewed-by: Maxim Levitsky <[email protected]>

Best regards,
Maxim Levitsky

2022-04-22 04:17:55

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02

On 20.04.2022 17:00, Paolo Bonzini wrote:
> On 4/4/22 19:21, Sean Christopherson wrote:
>> On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
>>>> @@ -1606,7 +1622,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
>>>>        nested_copy_vmcb_control_to_cache(svm, ctl);
>>>>        svm_switch_vmcb(svm, &svm->nested.vmcb02);
>>>> -    nested_vmcb02_prepare_control(svm);
>>>> +    nested_vmcb02_prepare_control(svm, save->rip);
>>>
>>>                        ^
>>> I guess this should be "svm->vmcb->save.rip", since
>>> KVM_{GET,SET}_NESTED_STATE "save" field contains vmcb01 data,
>>> not vmcb{0,1}2 (in contrast to the "control" field).
>>
>> Argh, yes.  Is userspace required to set L2 guest state prior to KVM_SET_NESTED_STATE?
>> If not, this will result in garbage being loaded into vmcb02.
>>
>
> Let's just require X86_FEATURE_NRIPS, either in general or just to
> enable nested virtualiazation

????

>
> If I looked it up correctly it was introduced around 2010-2011.

A quick Internet search showed that the first CPUs with NextRIP were
the second-generation Family 10h CPUs (Phenom II, Athlon II, etc.).
They started being released in early 2009.

> Paolo

Thanks,
Maciej

2022-04-22 06:03:01

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02

On 4/4/22 19:21, Sean Christopherson wrote:
> On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
>>> @@ -1606,7 +1622,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
>>> nested_copy_vmcb_control_to_cache(svm, ctl);
>>> svm_switch_vmcb(svm, &svm->nested.vmcb02);
>>> - nested_vmcb02_prepare_control(svm);
>>> + nested_vmcb02_prepare_control(svm, save->rip);
>>
>> ^
>> I guess this should be "svm->vmcb->save.rip", since
>> KVM_{GET,SET}_NESTED_STATE "save" field contains vmcb01 data,
>> not vmcb{0,1}2 (in contrast to the "control" field).
>
> Argh, yes. Is userspace required to set L2 guest state prior to KVM_SET_NESTED_STATE?
> If not, this will result in garbage being loaded into vmcb02.
>

Let's just require X86_FEATURE_NRIPS, either in general or just to
enable nested virtualiazation, i.e.:

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index fc1725b7d05f..f8fc8a1b09f1 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4904,10 +4904,12 @@ static __init int svm_hardware_setup(void)
goto err;
}

- if (nrips) {
- if (!boot_cpu_has(X86_FEATURE_NRIPS))
- nrips = false;
- }
+ if (!boot_cpu_has(X86_FEATURE_NRIPS))
+ nrips = false;
+ if (nested & !nrips) {
+ pr_warn("Next RIP Save not available, disabling nested virtualization\n");
+ nested = false;
+ }

enable_apicv = avic = avic && npt_enabled && (boot_cpu_has(X86_FEATURE_AVIC) || force_avic);


If I looked it up correctly it was introduced around 2010-2011.

Paolo

2022-04-22 19:43:17

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02

On Wed, Apr 20, 2022, Maciej S. Szmigiero wrote:
> On 20.04.2022 17:00, Paolo Bonzini wrote:
> > On 4/4/22 19:21, Sean Christopherson wrote:
> > > On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
> > > > > @@ -1606,7 +1622,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
> > > > >        nested_copy_vmcb_control_to_cache(svm, ctl);
> > > > >        svm_switch_vmcb(svm, &svm->nested.vmcb02);
> > > > > -    nested_vmcb02_prepare_control(svm);
> > > > > +    nested_vmcb02_prepare_control(svm, save->rip);
> > > >
> > > >                        ^
> > > > I guess this should be "svm->vmcb->save.rip", since
> > > > KVM_{GET,SET}_NESTED_STATE "save" field contains vmcb01 data,
> > > > not vmcb{0,1}2 (in contrast to the "control" field).
> > >
> > > Argh, yes.  Is userspace required to set L2 guest state prior to KVM_SET_NESTED_STATE?
> > > If not, this will result in garbage being loaded into vmcb02.
> > >
> >
> > Let's just require X86_FEATURE_NRIPS, either in general or just to
> > enable nested virtualiazation
>
> ????

Hmm, so requiring NRIPS for nested doesn't actually buy us anything. KVM still
has to deal with userspace hiding NRIPS from L1, so unless I'm overlooking something,
the only change would be:

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index bdf8375a718b..7bed4e05aaea 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -686,7 +686,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
*/
if (svm->nrips_enabled)
vmcb02->control.next_rip = svm->nested.ctl.next_rip;
- else if (boot_cpu_has(X86_FEATURE_NRIPS))
+ else
vmcb02->control.next_rip = vmcb12_rip;

if (is_evtinj_soft(vmcb02->control.event_inj)) {

And sadly, because SVM doesn't provide the instruction length if an exit occurs
while vectoring a software interrupt/exception, making NRIPS mandatory doesn't buy
us much either.

I believe the below diff is the total savings (plus the above nested thing) against
this series if NRIPS is mandatory (ignoring the setup code, which is a wash). It
does eliminate the rewind in svm_complete_soft_interrupt() and the funky logic in
svm_update_soft_interrupt_rip(), but that's it AFAICT. The most obnoxious code of
having to unwind EMULTYPE_SKIP when retrieving the next RIP for software int/except
injection doesn't go away :-(

I'm not totally opposed to requiring NRIPS, but I'm not in favor of it either.

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 66cfd533aaf8..6b48af423246 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -354,7 +354,7 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
if (sev_es_guest(vcpu->kvm))
goto done;

- if (nrips && svm->vmcb->control.next_rip != 0) {
+ if (svm->vmcb->control.next_rip != 0) {
WARN_ON_ONCE(!static_cpu_has(X86_FEATURE_NRIPS));
svm->next_rip = svm->vmcb->control.next_rip;
}
@@ -401,7 +401,7 @@ static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
* in use, the skip must not commit any side effects such as clearing
* the interrupt shadow or RFLAGS.RF.
*/
- if (!__svm_skip_emulated_instruction(vcpu, !nrips))
+ if (!__svm_skip_emulated_instruction(vcpu, false))
return -EIO;

rip = kvm_rip_read(vcpu);
@@ -420,11 +420,8 @@ static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
svm->soft_int_old_rip = old_rip;
svm->soft_int_next_rip = rip;

- if (nrips)
- kvm_rip_write(vcpu, old_rip);
-
- if (static_cpu_has(X86_FEATURE_NRIPS))
- svm->vmcb->control.next_rip = rip;
+ kvm_rip_write(vcpu, old_rip);
+ svm->vmcb->control.next_rip = rip;

return 0;
}
@@ -3738,20 +3735,9 @@ static void svm_complete_soft_interrupt(struct kvm_vcpu *vcpu, u8 vector,
* the same event, i.e. if the event is a soft exception/interrupt,
* otherwise next_rip is unused on VMRUN.
*/
- if (nrips && (is_soft || (is_exception && kvm_exception_is_soft(vector))) &&
+ if ((is_soft || (is_exception && kvm_exception_is_soft(vector))) &&
kvm_is_linear_rip(vcpu, svm->soft_int_old_rip + svm->soft_int_csbase))
svm->vmcb->control.next_rip = svm->soft_int_next_rip;
- /*
- * If NextRIP isn't enabled, KVM must manually advance RIP prior to
- * injecting the soft exception/interrupt. That advancement needs to
- * be unwound if vectoring didn't complete. Note, the new event may
- * not be the injected event, e.g. if KVM injected an INTn, the INTn
- * hit a #NP in the guest, and the #NP encountered a #PF, the #NP will
- * be the reported vectored event, but RIP still needs to be unwound.
- */
- else if (!nrips && (is_soft || is_exception) &&
- kvm_is_linear_rip(vcpu, svm->soft_int_next_rip + svm->soft_int_csbase))
- kvm_rip_write(vcpu, svm->soft_int_old_rip);
}

static void svm_complete_interrupts(struct kvm_vcpu *vcpu)

2022-04-22 19:53:13

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02

On Wed, Apr 20, 2022, Paolo Bonzini wrote:
> On 4/20/22 18:15, Sean Christopherson wrote:
> > > > Let's just require X86_FEATURE_NRIPS, either in general or just to
> > > > enable nested virtualiazation
> > > ????
> > Hmm, so requiring NRIPS for nested doesn't actually buy us anything. KVM still
> > has to deal with userspace hiding NRIPS from L1, so unless I'm overlooking something,
> > the only change would be:
> >
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index bdf8375a718b..7bed4e05aaea 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -686,7 +686,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
> > */
> > if (svm->nrips_enabled)
> > vmcb02->control.next_rip = svm->nested.ctl.next_rip;
> > - else if (boot_cpu_has(X86_FEATURE_NRIPS))
> > + else
> > vmcb02->control.next_rip = vmcb12_rip;
> >
> > if (is_evtinj_soft(vmcb02->control.event_inj)) {
> >
> > And sadly, because SVM doesn't provide the instruction length if an exit occurs
> > while vectoring a software interrupt/exception, making NRIPS mandatory doesn't buy
> > us much either.
> >
> > I believe the below diff is the total savings (plus the above nested thing) against
> > this series if NRIPS is mandatory (ignoring the setup code, which is a wash). It
> > does eliminate the rewind in svm_complete_soft_interrupt() and the funky logic in
> > svm_update_soft_interrupt_rip(), but that's it AFAICT. The most obnoxious code of
> > having to unwind EMULTYPE_SKIP when retrieving the next RIP for software int/except
> > injection doesn't go away:-(
> >
> > I'm not totally opposed to requiring NRIPS, but I'm not in favor of it either.
>
> Yeah, you're right. However:
>
> * the rewind might already be worth it;

FWIW, I don't actually care about supporting nrips=false, it's the ability to test
EMULTYPE_SKIP that I find valuable. I also find the extra perspective of how RIP
interacts with software interrupts/exceptions to be very helpful/educational, though
that's of debatable value going forward.

> * if we require NRIPS for nested, we can also assume that the SVM save state
> data has a valid next_rip; even if !svm->nrips_enabled. There's the pesky
> issue of restoring from an old system that did not have NRIPS, but let's
> assume for now that NRIPS was set on the source as well.

How about I throw a Not-signed-off-by patch on the end of the series to make NRIPS
mandatory, that way we (in theory) have a fully functional snapshot for nrips=false
if we want to go back in time. And we probably need to give a deprecation grace
period, i.e. wait a release or two before disappearing nrips?

2022-04-22 21:54:28

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02

On 4/20/22 18:15, Sean Christopherson wrote:
>>> Let's just require X86_FEATURE_NRIPS, either in general or just to
>>> enable nested virtualiazation
>> ????
> Hmm, so requiring NRIPS for nested doesn't actually buy us anything. KVM still
> has to deal with userspace hiding NRIPS from L1, so unless I'm overlooking something,
> the only change would be:
>
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index bdf8375a718b..7bed4e05aaea 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -686,7 +686,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
> */
> if (svm->nrips_enabled)
> vmcb02->control.next_rip = svm->nested.ctl.next_rip;
> - else if (boot_cpu_has(X86_FEATURE_NRIPS))
> + else
> vmcb02->control.next_rip = vmcb12_rip;
>
> if (is_evtinj_soft(vmcb02->control.event_inj)) {
>
> And sadly, because SVM doesn't provide the instruction length if an exit occurs
> while vectoring a software interrupt/exception, making NRIPS mandatory doesn't buy
> us much either.
>
> I believe the below diff is the total savings (plus the above nested thing) against
> this series if NRIPS is mandatory (ignoring the setup code, which is a wash). It
> does eliminate the rewind in svm_complete_soft_interrupt() and the funky logic in
> svm_update_soft_interrupt_rip(), but that's it AFAICT. The most obnoxious code of
> having to unwind EMULTYPE_SKIP when retrieving the next RIP for software int/except
> injection doesn't go away:-(
>
> I'm not totally opposed to requiring NRIPS, but I'm not in favor of it either.

Yeah, you're right. However:

* the rewind might already be worth it;

* if we require NRIPS for nested, we can also assume that the SVM save
state data has a valid next_rip; even if !svm->nrips_enabled. There's
the pesky issue of restoring from an old system that did not have NRIPS,
but let's assume for now that NRIPS was set on the source as well.

Paolo