2022-04-23 03:49:04

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 00/11] KVM: SVM: Fix soft int/ex re-injection

Fix soft interrupt/exception reinjection on SVM.

The underlying issue is that SVM simply retries INT* instructions instead
of reinjecting the soft interupt/exception if an exception VM-Exit occurred
during vectoring. Lack of reinjection breaks nested virtualization if
the injected event came from L1 and the VM-Exit is not forwarded to L1,
as there is no instruction to retry. More fundamentally, retrying the
instruction is wrong as it can produce side effects that shouldn't occur,
e.g. code #DBs.

VMX has been fixed since commit 66fd3f7f901f ("KVM: Do not re-execute
INTn instruction."), but SVM was left behind. Probably because fixing
SVM is a mess due to NRIPS not being supported on all architectures, and
due to it being poorly implemented (with respect to soft events) when it
is supported.

Opportunistically clean up related tracepoints to make debugging related
issues less painful in the future.

The last patch is not-signed-off-by as I think it needs broader review
and feedback before KVM drops support for CPUs that are old, but not
thaaaaat old.

The tracepoint output looks like:

kvm_inj_exception: #GP (0x0)
kvm_inj_exception: #UD
kvm_inj_exception: #DE
kvm_inj_exception: #DE [reinjected]
kvm_inj_exception: #BP [reinjected]
kvm_inj_exception: #NP (0x18) [reinjected]

and for "irqs":

kvm_inj_virq: Soft/INTn 0x20 [reinjected]
kvm_inj_virq: Soft/INTn 0x19 [reinjected]
kvm_inj_virq: IRQ 0x20
kvm_inj_virq: IRQ 0xf1

v2:
- Collect reviews. [Maxim]
- Drop a stale comment midway through. [Paolo]
- Correctly handle (at least as correctly as SVM allows) the scenario
where an injected soft interrupt/exception has no backing insn. [Maxim]
- Tag reinjected exceptions in the tracepoint. [Maxim]
- Use the correct L2 RIP (hopefully) in svm_set_nested_state. [Maciej]
- Fix a BUG that can be triggered by userspace.
- Fix the error code FIXME in the exception tracepoint.
- Differentiate soft vs. hard "IRQ" injection in tracepoint.
- Assert that the first soft int is injected on the correct RIP in
the selftest.

v1:
https://lore.kernel.org/all/[email protected]

Maciej's original series:
https://lore.kernel.org/all/[email protected]

Maciej S. Szmigiero (3):
KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02
KVM: SVM: Don't BUG if userspace injects a soft interrupt with GIF=0
KVM: selftests: nSVM: Add svm_nested_soft_inject_test

Sean Christopherson (8):
KVM: SVM: Unwind "speculative" RIP advancement if INTn injection
"fails"
KVM: SVM: Stuff next_rip on emulated INT3 injection if NRIPS is
supported
KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
KVM: SVM: Re-inject INTn instead of retrying the insn on "failure"
KVM: x86: Trace re-injected exceptions
KVM: x86: Print error code in exception injection tracepoint iff valid
KVM: x86: Differentiate Soft vs. Hard IRQs vs. reinjected in
tracepoint
KVM: SVM: Drop support for CPUs without NRIPS (NextRIP Save) support

arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/svm/nested.c | 46 ++++-
arch/x86/kvm/svm/svm.c | 169 ++++++++++++------
arch/x86/kvm/svm/svm.h | 7 +-
arch/x86/kvm/trace.h | 31 ++--
arch/x86/kvm/vmx/vmx.c | 4 +-
arch/x86/kvm/x86.c | 20 ++-
tools/testing/selftests/kvm/.gitignore | 3 +-
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/include/x86_64/svm_util.h | 2 +
.../kvm/x86_64/svm_nested_soft_inject_test.c | 149 +++++++++++++++
11 files changed, 351 insertions(+), 83 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c


base-commit: 2a39d8b39bffdaf1a4223d0d22f07baee154c8f3
--
2.36.0.rc2.479.g8af0fa9b8e-goog


2022-04-23 05:04:07

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 01/11] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02

From: Maciej S. Szmigiero <[email protected]>

The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
This field value (instead of the saved guest RIP) in used by the CPU for
the return address pushed on stack when injecting a software interrupt or
INT3 or INTO exception.

Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
loading a nested state and NRIPS is exposed to L1. If NRIPS is supported
in hardware but not exposed to L1 (nrips=0 or hidden by userspace), stuff
vmcb02's next_rip from the new L2 RIP to emulate a !NRIPS CPU (which
saves RIP on the stack as-is).

Reviewed-by: Maxim Levitsky <[email protected]>
Signed-off-by: Maciej S. Szmigiero <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/nested.c | 22 +++++++++++++++++++---
arch/x86/kvm/svm/svm.h | 1 +
2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index bed5e1692cef..461c5f247801 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -371,6 +371,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
to->nested_ctl = from->nested_ctl;
to->event_inj = from->event_inj;
to->event_inj_err = from->event_inj_err;
+ to->next_rip = from->next_rip;
to->nested_cr3 = from->nested_cr3;
to->virt_ext = from->virt_ext;
to->pause_filter_count = from->pause_filter_count;
@@ -608,7 +609,8 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12
}
}

-static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
+static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
+ unsigned long vmcb12_rip)
{
u32 int_ctl_vmcb01_bits = V_INTR_MASKING_MASK;
u32 int_ctl_vmcb12_bits = V_TPR_MASK | V_IRQ_INJECTION_BITS_MASK;
@@ -662,6 +664,19 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
vmcb02->control.event_inj = svm->nested.ctl.event_inj;
vmcb02->control.event_inj_err = svm->nested.ctl.event_inj_err;

+ /*
+ * next_rip is consumed on VMRUN as the return address pushed on the
+ * stack for injected soft exceptions/interrupts. If nrips is exposed
+ * to L1, take it verbatim from vmcb12. If nrips is supported in
+ * hardware but not exposed to L1, stuff the actual L2 RIP to emulate
+ * what a nrips=0 CPU would do (L1 is responsible for advancing RIP
+ * prior to injecting the event).
+ */
+ if (svm->nrips_enabled)
+ vmcb02->control.next_rip = svm->nested.ctl.next_rip;
+ else if (boot_cpu_has(X86_FEATURE_NRIPS))
+ vmcb02->control.next_rip = vmcb12_rip;
+
vmcb02->control.virt_ext = vmcb01->control.virt_ext &
LBR_CTL_ENABLE_MASK;
if (svm->lbrv_enabled)
@@ -745,7 +760,7 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);

svm_switch_vmcb(svm, &svm->nested.vmcb02);
- nested_vmcb02_prepare_control(svm);
+ nested_vmcb02_prepare_control(svm, vmcb12->save.rip);
nested_vmcb02_prepare_save(svm, vmcb12);

ret = nested_svm_load_cr3(&svm->vcpu, svm->nested.save.cr3,
@@ -1418,6 +1433,7 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
dst->nested_ctl = from->nested_ctl;
dst->event_inj = from->event_inj;
dst->event_inj_err = from->event_inj_err;
+ dst->next_rip = from->next_rip;
dst->nested_cr3 = from->nested_cr3;
dst->virt_ext = from->virt_ext;
dst->pause_filter_count = from->pause_filter_count;
@@ -1602,7 +1618,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
nested_copy_vmcb_control_to_cache(svm, ctl);

svm_switch_vmcb(svm, &svm->nested.vmcb02);
- nested_vmcb02_prepare_control(svm);
+ nested_vmcb02_prepare_control(svm, save->rip);

/*
* While the nested guest CR3 is already checked and set by
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 32220a1b0ea2..7d97e4d18c8b 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -139,6 +139,7 @@ struct vmcb_ctrl_area_cached {
u64 nested_ctl;
u32 event_inj;
u32 event_inj_err;
+ u64 next_rip;
u64 nested_cr3;
u64 virt_ext;
u32 clean;
--
2.36.0.rc2.479.g8af0fa9b8e-goog

2022-04-23 09:10:33

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 03/11] KVM: SVM: Unwind "speculative" RIP advancement if INTn injection "fails"

Unwind the RIP advancement done by svm_queue_exception() when injecting
an INT3 ultimately "fails" due to the CPU encountering a VM-Exit while
vectoring the injected event, even if the exception reported by the CPU
isn't the same event that was injected. If vectoring INT3 encounters an
exception, e.g. #NP, and vectoring the #NP encounters an intercepted
exception, e.g. #PF when KVM is using shadow paging, then the #NP will
be reported as the event that was in-progress.

Note, this is still imperfect, as it will get a false positive if the
INT3 is cleanly injected, no VM-Exit occurs before the IRET from the INT3
handler in the guest, the instruction following the INT3 generates an
exception (directly or indirectly), _and_ vectoring that exception
encounters an exception that is intercepted by KVM. The false positives
could theoretically be solved by further analyzing the vectoring event,
e.g. by comparing the error code against the expected error code were an
exception to occur when vectoring the original injected exception, but
SVM without NRIPS is a complete disaster, trying to make it 100% correct
is a waste of time.

Reviewed-by: Maxim Levitsky <[email protected]>
Fixes: 66b7138f9136 ("KVM: SVM: Emulate nRIP feature when reinjecting INT3")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 23 +++++++++++++++--------
1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 151fba0b405f..82175a13c668 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3700,6 +3700,18 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
type = exitintinfo & SVM_EXITINTINFO_TYPE_MASK;

+ /*
+ * If NextRIP isn't enabled, KVM must manually advance RIP prior to
+ * injecting the soft exception/interrupt. That advancement needs to
+ * be unwound if vectoring didn't complete. Note, the _new_ event may
+ * not be the injected event, e.g. if KVM injected an INTn, the INTn
+ * hit a #NP in the guest, and the #NP encountered a #PF, the #NP will
+ * be the reported vectored event, but RIP still needs to be unwound.
+ */
+ if (int3_injected && type == SVM_EXITINTINFO_TYPE_EXEPT &&
+ kvm_is_linear_rip(vcpu, svm->int3_rip))
+ kvm_rip_write(vcpu, kvm_rip_read(vcpu) - int3_injected);
+
switch (type) {
case SVM_EXITINTINFO_TYPE_NMI:
vcpu->arch.nmi_injected = true;
@@ -3713,16 +3725,11 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)

/*
* In case of software exceptions, do not reinject the vector,
- * but re-execute the instruction instead. Rewind RIP first
- * if we emulated INT3 before.
+ * but re-execute the instruction instead.
*/
- if (kvm_exception_is_soft(vector)) {
- if (vector == BP_VECTOR && int3_injected &&
- kvm_is_linear_rip(vcpu, svm->int3_rip))
- kvm_rip_write(vcpu,
- kvm_rip_read(vcpu) - int3_injected);
+ if (kvm_exception_is_soft(vector))
break;
- }
+
if (exitintinfo & SVM_EXITINTINFO_VALID_ERR) {
u32 err = svm->vmcb->control.exit_int_info_err;
kvm_requeue_exception_e(vcpu, vector, err);
--
2.36.0.rc2.479.g8af0fa9b8e-goog

2022-04-26 07:09:09

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] KVM: SVM: Fix soft int/ex re-injection

On 23.04.2022 04:14, Sean Christopherson wrote:
> Fix soft interrupt/exception reinjection on SVM.
>

Thanks for the patch set Sean, I can't see anything being obviously wrong
during a static code review - just small nits.

Will test it practically tomorrow and report the results.

Thanks,
Maciej

2022-04-27 18:41:48

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] KVM: SVM: Fix soft int/ex re-injection

On 26.04.2022 01:01, Maciej S. Szmigiero wrote:
> On 23.04.2022 04:14, Sean Christopherson wrote:
>> Fix soft interrupt/exception reinjection on SVM.
>>
>
> Thanks for the patch set Sean, I can't see anything being obviously wrong
> during a static code review - just small nits.
>
> Will test it practically tomorrow and report the results.

I've tested this patch set and it seems to work fine with respect
to soft {exception,interrupt} re-injection and next_rip field consistency.

I have prepared a draft of an updated version at [1] with the following
further changes:
* "Downgraded" the commit affecting !nrips CPUs to just drop nested SVM
support for such parts instead of SVM support in general,

* Added a fix for L1/L2 NMI state confusion during L1 -> L2 NMI re-injection,

* Updated the new KVM self-test to also check for the NMI injection
scenario being fixed (that was found causing issues with a real guest),

* Changed "kvm_inj_virq" trace event "reinjected" field type to bool.

Will post a v3 patch set (with proper SoBs, etc.) if there are no further
comments or objections.

Thanks,
Maciej

[1]: https://github.com/maciejsszmigiero/linux/commits/svm_next_rip-sc

2022-04-28 10:24:34

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v2 01/11] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02

On Sat, 2022-04-23 at 02:14 +0000, Sean Christopherson wrote:
> From: Maciej S. Szmigiero <[email protected]>
>
> The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
> This field value (instead of the saved guest RIP) in used by the CPU for
> the return address pushed on stack when injecting a software interrupt or
> INT3 or INTO exception.
>
> Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
> loading a nested state and NRIPS is exposed to L1. If NRIPS is supported
> in hardware but not exposed to L1 (nrips=0 or hidden by userspace), stuff
> vmcb02's next_rip from the new L2 RIP to emulate a !NRIPS CPU (which
> saves RIP on the stack as-is).
>
> Reviewed-by: Maxim Levitsky <[email protected]>
> Signed-off-by: Maciej S. Szmigiero <[email protected]>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/svm/nested.c | 22 +++++++++++++++++++---
> arch/x86/kvm/svm/svm.h | 1 +
> 2 files changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index bed5e1692cef..461c5f247801 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -371,6 +371,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
> to->nested_ctl = from->nested_ctl;
> to->event_inj = from->event_inj;
> to->event_inj_err = from->event_inj_err;
> + to->next_rip = from->next_rip;
> to->nested_cr3 = from->nested_cr3;
> to->virt_ext = from->virt_ext;
> to->pause_filter_count = from->pause_filter_count;
> @@ -608,7 +609,8 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12
> }
> }
>
> -static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
> +static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
> + unsigned long vmcb12_rip)

I know that I already reviewed this, but why do we need to pass an extra
parameter to nested_vmcb02_prepare_control.
Lets just put that value in the cache to be consistent with the rest?

Best regards,
Maxim Levitsky


> {
> u32 int_ctl_vmcb01_bits = V_INTR_MASKING_MASK;
> u32 int_ctl_vmcb12_bits = V_TPR_MASK | V_IRQ_INJECTION_BITS_MASK;
> @@ -662,6 +664,19 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
> vmcb02->control.event_inj = svm->nested.ctl.event_inj;
> vmcb02->control.event_inj_err = svm->nested.ctl.event_inj_err;
>
> + /*
> + * next_rip is consumed on VMRUN as the return address pushed on the
> + * stack for injected soft exceptions/interrupts. If nrips is exposed
> + * to L1, take it verbatim from vmcb12. If nrips is supported in
> + * hardware but not exposed to L1, stuff the actual L2 RIP to emulate
> + * what a nrips=0 CPU would do (L1 is responsible for advancing RIP
> + * prior to injecting the event).
> + */
> + if (svm->nrips_enabled)
> + vmcb02->control.next_rip = svm->nested.ctl.next_rip;
> + else if (boot_cpu_has(X86_FEATURE_NRIPS))
> + vmcb02->control.next_rip = vmcb12_rip;
> +
> vmcb02->control.virt_ext = vmcb01->control.virt_ext &
> LBR_CTL_ENABLE_MASK;
> if (svm->lbrv_enabled)
> @@ -745,7 +760,7 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
> nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
>
> svm_switch_vmcb(svm, &svm->nested.vmcb02);
> - nested_vmcb02_prepare_control(svm);
> + nested_vmcb02_prepare_control(svm, vmcb12->save.rip);
> nested_vmcb02_prepare_save(svm, vmcb12);
>
> ret = nested_svm_load_cr3(&svm->vcpu, svm->nested.save.cr3,
> @@ -1418,6 +1433,7 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
> dst->nested_ctl = from->nested_ctl;
> dst->event_inj = from->event_inj;
> dst->event_inj_err = from->event_inj_err;
> + dst->next_rip = from->next_rip;
> dst->nested_cr3 = from->nested_cr3;
> dst->virt_ext = from->virt_ext;
> dst->pause_filter_count = from->pause_filter_count;
> @@ -1602,7 +1618,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
> nested_copy_vmcb_control_to_cache(svm, ctl);
>
> svm_switch_vmcb(svm, &svm->nested.vmcb02);
> - nested_vmcb02_prepare_control(svm);
> + nested_vmcb02_prepare_control(svm, save->rip);
>
> /*
> * While the nested guest CR3 is already checked and set by
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 32220a1b0ea2..7d97e4d18c8b 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -139,6 +139,7 @@ struct vmcb_ctrl_area_cached {
> u64 nested_ctl;
> u32 event_inj;
> u32 event_inj_err;
> + u64 next_rip;
> u64 nested_cr3;
> u64 virt_ext;
> u32 clean;