2022-03-11 21:45:02

by Maciej S. Szmigiero

[permalink] [raw]
Subject: [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events

From: "Maciej S. Szmigiero" <[email protected]>

In SVM synthetic software interrupts or INT3 or INTO exception that L1
wants to inject into its L2 guest are forgotten if there is an intervening
L0 VMEXIT during their delivery.

They are re-injected correctly with VMX, however.

This is because there is an assumption in SVM that such exceptions will be
re-delivered by simply re-executing the current instruction.
Which might not be true if this is a synthetic exception injected by L1,
since in this case the re-executed instruction will be one already in L2,
not the VMRUN instruction in L1 that attempted the injection.

Leave the pending L1 -> L2 event in svm->nested.ctl.event_inj{,err} until
it is either re-injected successfully or returned to L1 upon a nested
VMEXIT.
Make sure to always re-queue such event if returned in EXITINTINFO.

The handling of L0 -> {L1, L2} event re-injection is left as-is to avoid
unforeseen regressions.

Signed-off-by: Maciej S. Szmigiero <[email protected]>
---
arch/x86/kvm/svm/nested.c | 65 +++++++++++++++++++++++++++++++++++++--
arch/x86/kvm/svm/svm.c | 17 ++++++++--
arch/x86/kvm/svm/svm.h | 47 ++++++++++++++++++++++++++++
3 files changed, 125 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 9656f0d6815c..75017bf77955 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -420,8 +420,17 @@ void nested_copy_vmcb_save_to_cache(struct vcpu_svm *svm,
void nested_sync_control_from_vmcb02(struct vcpu_svm *svm)
{
u32 mask;
- svm->nested.ctl.event_inj = svm->vmcb->control.event_inj;
- svm->nested.ctl.event_inj_err = svm->vmcb->control.event_inj_err;
+
+ /*
+ * Leave the pending L1 -> L2 event in svm->nested.ctl.event_inj{,err}
+ * if its re-injection is needed
+ */
+ if (!exit_during_event_injection(svm, svm->nested.ctl.event_inj,
+ svm->nested.ctl.event_inj_err)) {
+ WARN_ON_ONCE(svm->vmcb->control.event_inj & SVM_EVTINJ_VALID);
+ svm->nested.ctl.event_inj = svm->vmcb->control.event_inj;
+ svm->nested.ctl.event_inj_err = svm->vmcb->control.event_inj_err;
+ }

/* Only a few fields of int_ctl are written by the processor. */
mask = V_IRQ_MASK | V_TPR_MASK;
@@ -669,6 +678,54 @@ static void nested_svm_copy_common_state(struct vmcb *from_vmcb, struct vmcb *to
to_vmcb->save.spec_ctrl = from_vmcb->save.spec_ctrl;
}

+void nested_svm_maybe_reinject(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_svm *svm = to_svm(vcpu);
+ unsigned int vector, type;
+ u32 exitintinfo = svm->vmcb->control.exit_int_info;
+
+ if (WARN_ON_ONCE(!is_guest_mode(vcpu)))
+ return;
+
+ /*
+ * No L1 -> L2 event to re-inject?
+ *
+ * In this case event_inj will be cleared by
+ * nested_sync_control_from_vmcb02().
+ */
+ if (!(svm->nested.ctl.event_inj & SVM_EVTINJ_VALID))
+ return;
+
+ /* If the last event injection was successful there shouldn't be any pending event */
+ if (WARN_ON_ONCE(!(exitintinfo & SVM_EXITINTINFO_VALID)))
+ return;
+
+ kvm_make_request(KVM_REQ_EVENT, vcpu);
+
+ vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
+ type = exitintinfo & SVM_EXITINTINFO_TYPE_MASK;
+
+ switch (type) {
+ case SVM_EXITINTINFO_TYPE_NMI:
+ vcpu->arch.nmi_injected = true;
+ break;
+ case SVM_EXITINTINFO_TYPE_EXEPT:
+ if (exitintinfo & SVM_EXITINTINFO_VALID_ERR)
+ kvm_requeue_exception_e(vcpu, vector,
+ svm->vmcb->control.exit_int_info_err);
+ else
+ kvm_requeue_exception(vcpu, vector);
+ break;
+ case SVM_EXITINTINFO_TYPE_SOFT:
+ case SVM_EXITINTINFO_TYPE_INTR:
+ kvm_queue_interrupt(vcpu, vector, type == SVM_EXITINTINFO_TYPE_SOFT);
+ break;
+ default:
+ vcpu_unimpl(vcpu, "unknown L1 -> L2 exitintinfo type 0x%x\n", type);
+ break;
+ }
+}
+
int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
struct vmcb *vmcb12, bool from_vmrun)
{
@@ -898,6 +955,10 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
if (svm->nrips_enabled)
vmcb12->control.next_rip = vmcb->control.next_rip;

+ /* Forget about any pending L1 event injection since it's a L1 worry now */
+ svm->nested.ctl.event_inj = 0;
+ svm->nested.ctl.event_inj_err = 0;
+
vmcb12->control.int_ctl = svm->nested.ctl.int_ctl;
vmcb12->control.tlb_ctl = svm->nested.ctl.tlb_ctl;
vmcb12->control.event_inj = svm->nested.ctl.event_inj;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 1e5d904aeec3..5b128baa5e57 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3322,13 +3322,18 @@ static void svm_inject_irq(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);

- WARN_ON(!gif_set(svm));
+ WARN_ON(!(vcpu->arch.interrupt.soft || gif_set(svm)));

trace_kvm_inj_virq(vcpu->arch.interrupt.nr);
++vcpu->stat.irq_injections;

svm->vmcb->control.event_inj = vcpu->arch.interrupt.nr |
- SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR;
+ SVM_EVTINJ_VALID;
+ if (vcpu->arch.interrupt.soft) {
+ svm->vmcb->control.event_inj |= SVM_EVTINJ_TYPE_SOFT;
+ } else {
+ svm->vmcb->control.event_inj |= SVM_EVTINJ_TYPE_INTR;
+ }
}

void svm_complete_interrupt_delivery(struct kvm_vcpu *vcpu, int delivery_mode,
@@ -3627,6 +3632,14 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
if (!(exitintinfo & SVM_EXITINTINFO_VALID))
return;

+ /* L1 -> L2 event re-injection needs a different handling */
+ if (is_guest_mode(vcpu) &&
+ exit_during_event_injection(svm, svm->nested.ctl.event_inj,
+ svm->nested.ctl.event_inj_err)) {
+ nested_svm_maybe_reinject(vcpu);
+ return;
+ }
+
kvm_make_request(KVM_REQ_EVENT, vcpu);

vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index f757400fc933..7cafc2e6c82a 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -488,6 +488,52 @@ static inline bool nested_npt_enabled(struct vcpu_svm *svm)
return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
}

+static inline bool event_inj_same(u32 event_inj1, u32 event_inj_err1,
+ u32 event_inj2, u32 event_inj_err2)
+{
+ unsigned int vector_1, vector_2, type_1, type_2;
+
+ /* Either of them not valid? */
+ if (!(event_inj1 & SVM_EVTINJ_VALID) ||
+ !(event_inj2 & SVM_EVTINJ_VALID))
+ return false;
+
+ vector_1 = event_inj1 & SVM_EVTINJ_VEC_MASK;
+ type_1 = event_inj1 & SVM_EVTINJ_TYPE_MASK;
+ vector_2 = event_inj2 & SVM_EVTINJ_VEC_MASK;
+ type_2 = event_inj2 & SVM_EVTINJ_TYPE_MASK;
+
+ /* Different vector or type? */
+ if (vector_1 != vector_2 || type_1 != type_2)
+ return false;
+
+ /* Different error code presence flag? */
+ if ((event_inj1 & SVM_EVTINJ_VALID_ERR) !=
+ (event_inj2 & SVM_EVTINJ_VALID_ERR))
+ return false;
+
+ /* No error code? */
+ if (!(event_inj1 & SVM_EVTINJ_VALID_ERR))
+ return true;
+
+ /* Same error code? */
+ return event_inj_err1 == event_inj_err2;
+}
+
+/* Did the last VMEXIT happen when attempting to inject that event? */
+static inline bool exit_during_event_injection(struct vcpu_svm *svm,
+ u32 event_inj, u32 event_inj_err)
+{
+ BUILD_BUG_ON(SVM_EXITINTINFO_VEC_MASK != SVM_EVTINJ_VEC_MASK ||
+ SVM_EXITINTINFO_TYPE_MASK != SVM_EVTINJ_TYPE_MASK ||
+ SVM_EXITINTINFO_VALID != SVM_EVTINJ_VALID ||
+ SVM_EXITINTINFO_VALID_ERR != SVM_EVTINJ_VALID_ERR);
+
+ return event_inj_same(svm->vmcb->control.exit_int_info,
+ svm->vmcb->control.exit_int_info_err,
+ event_inj, event_inj_err);
+}
+
/* svm.c */
#define MSR_INVALID 0xffffffffU

@@ -540,6 +586,7 @@ static inline bool nested_exit_on_nmi(struct vcpu_svm *svm)
return vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_NMI);
}

+void nested_svm_maybe_reinject(struct kvm_vcpu *vcpu);
int enter_svm_guest_mode(struct kvm_vcpu *vcpu,
u64 vmcb_gpa, struct vmcb *vmcb12, bool from_vmrun);
void svm_leave_nested(struct kvm_vcpu *vcpu);


2022-03-31 02:39:00

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events

On 30.03.2022 23:59, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <[email protected]>
>>
>> In SVM synthetic software interrupts or INT3 or INTO exception that L1
>> wants to inject into its L2 guest are forgotten if there is an intervening
>> L0 VMEXIT during their delivery.
>>
>> They are re-injected correctly with VMX, however.
>>
>> This is because there is an assumption in SVM that such exceptions will be
>> re-delivered by simply re-executing the current instruction.
>> Which might not be true if this is a synthetic exception injected by L1,
>> since in this case the re-executed instruction will be one already in L2,
>> not the VMRUN instruction in L1 that attempted the injection.
>>
>> Leave the pending L1 -> L2 event in svm->nested.ctl.event_inj{,err} until
>> it is either re-injected successfully or returned to L1 upon a nested
>> VMEXIT.
>> Make sure to always re-queue such event if returned in EXITINTINFO.
>>
>> The handling of L0 -> {L1, L2} event re-injection is left as-is to avoid
>> unforeseen regressions.
>>
>> Signed-off-by: Maciej S. Szmigiero <[email protected]>
>> ---
>
> ...
>
>> @@ -3627,6 +3632,14 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
>> if (!(exitintinfo & SVM_EXITINTINFO_VALID))
>> return;
>>
>> + /* L1 -> L2 event re-injection needs a different handling */
>> + if (is_guest_mode(vcpu) &&
>> + exit_during_event_injection(svm, svm->nested.ctl.event_inj,
>> + svm->nested.ctl.event_inj_err)) {
>> + nested_svm_maybe_reinject(vcpu);
>
> Why is this manually re-injecting? More specifically, why does the below (out of
> sight in the diff) code that re-queues the exception/interrupt not work? The
> re-queued event should be picked up by nested_save_pending_event_to_vmcb12() and
> propagatred to vmcb12.

A L1 -> L2 injected event should either be re-injected until successfully
injected into L2 or propagated to VMCB12 if there is a nested VMEXIT
during its delivery.

svm_complete_interrupts() does not do such re-injection in some cases
(soft interrupts, soft exceptions, #VC) - it is trying to resort to
emulation instead, which is incorrect in this case.

I think it's better to split out this L1 -> L2 nested case to a
separate function in nested.c rather than to fill
svm_complete_interrupts() in already very large svm.c with "if" blocks
here and there.

Thanks,
Maciej

2022-03-31 04:01:18

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events

On Thu, Mar 10, 2022, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <[email protected]>
>
> In SVM synthetic software interrupts or INT3 or INTO exception that L1
> wants to inject into its L2 guest are forgotten if there is an intervening
> L0 VMEXIT during their delivery.
>
> They are re-injected correctly with VMX, however.
>
> This is because there is an assumption in SVM that such exceptions will be
> re-delivered by simply re-executing the current instruction.
> Which might not be true if this is a synthetic exception injected by L1,
> since in this case the re-executed instruction will be one already in L2,
> not the VMRUN instruction in L1 that attempted the injection.
>
> Leave the pending L1 -> L2 event in svm->nested.ctl.event_inj{,err} until
> it is either re-injected successfully or returned to L1 upon a nested
> VMEXIT.
> Make sure to always re-queue such event if returned in EXITINTINFO.
>
> The handling of L0 -> {L1, L2} event re-injection is left as-is to avoid
> unforeseen regressions.
>
> Signed-off-by: Maciej S. Szmigiero <[email protected]>
> ---

...

> @@ -3627,6 +3632,14 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
> if (!(exitintinfo & SVM_EXITINTINFO_VALID))
> return;
>
> + /* L1 -> L2 event re-injection needs a different handling */
> + if (is_guest_mode(vcpu) &&
> + exit_during_event_injection(svm, svm->nested.ctl.event_inj,
> + svm->nested.ctl.event_inj_err)) {
> + nested_svm_maybe_reinject(vcpu);

Why is this manually re-injecting? More specifically, why does the below (out of
sight in the diff) code that re-queues the exception/interrupt not work? The
re-queued event should be picked up by nested_save_pending_event_to_vmcb12() and
propagatred to vmcb12.

2022-03-31 04:59:45

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events

On Thu, Mar 31, 2022, Maciej S. Szmigiero wrote:
> On 30.03.2022 23:59, Sean Christopherson wrote:
> > On Thu, Mar 10, 2022, Maciej S. Szmigiero wrote:
> > > @@ -3627,6 +3632,14 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
> > > if (!(exitintinfo & SVM_EXITINTINFO_VALID))
> > > return;
> > > + /* L1 -> L2 event re-injection needs a different handling */
> > > + if (is_guest_mode(vcpu) &&
> > > + exit_during_event_injection(svm, svm->nested.ctl.event_inj,
> > > + svm->nested.ctl.event_inj_err)) {
> > > + nested_svm_maybe_reinject(vcpu);
> >
> > Why is this manually re-injecting? More specifically, why does the below (out of
> > sight in the diff) code that re-queues the exception/interrupt not work? The
> > re-queued event should be picked up by nested_save_pending_event_to_vmcb12() and
> > propagatred to vmcb12.
>
> A L1 -> L2 injected event should either be re-injected until successfully
> injected into L2 or propagated to VMCB12 if there is a nested VMEXIT
> during its delivery.
>
> svm_complete_interrupts() does not do such re-injection in some cases
> (soft interrupts, soft exceptions, #VC) - it is trying to resort to
> emulation instead, which is incorrect in this case.
>
> I think it's better to split out this L1 -> L2 nested case to a
> separate function in nested.c rather than to fill
> svm_complete_interrupts() in already very large svm.c with "if" blocks
> here and there.

Ah, I see it now. WTF.

Ugh, commit 66fd3f7f901f ("KVM: Do not re-execute INTn instruction.") fixed VMX,
but left SVM broken.

Re-executing the INTn is wrong, the instruction has already completed decode and
execution. E.g. if there's there's a code breakpoint on the INTn, rewinding will
cause a spurious #DB.

KVM's INT3 shenanigans are bonkers, but I guess there's no better option given
that the APM says "Software interrupts cannot be properly injected if the processor
does not support the NextRIP field.". What a mess.

Anyways, for the common nrips=true case, I strongly prefer that we properly fix
the non-nested case and re-inject software interrupts, which should in turn
naturally fix this nested case. And for nrips=false, my vote is to either punt
and document it as a "KVM erratum", or straight up make nested require nrips.

Note, that also requires updating svm_queue_exception(), which assumes it will
only be handed hardware exceptions, i.e. hardcodes type EXEPT. That's blatantly
wrong, e.g. if userspace injects a software exception via KVM_SET_VCPU_EVENTS.

2022-04-01 06:39:25

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events

On Fri, Apr 01, 2022, Maciej S. Szmigiero wrote:
> On 31.03.2022 01:20, Sean Christopherson wrote:
> > Re-executing the INTn is wrong, the instruction has already completed decode and
> > execution. E.g. if there's there's a code breakpoint on the INTn, rewinding will
> > cause a spurious #DB.
> >
> > KVM's INT3 shenanigans are bonkers, but I guess there's no better option given
> > that the APM says "Software interrupts cannot be properly injected if the processor
> > does not support the NextRIP field.". What a mess.
>
> Note that KVM currently always tries to re-execute the current instruction
> when asked to re-inject a #BP or a #OF, even when nrips are enabled.

Yep, and my vote is to fix that.

> Also, #BP (and #OF, too) is returned as type SVM_EXITINTINFO_TYPE_EXEPT,
> not as SVM_EXITINTINFO_TYPE_SOFT (soft interrupt), so it should be
> re-injected accordingly.

Ahhh, SVM doesn't differentiate between software exceptions and hardware exceptions.
Finally found the relevant section in the APM:

Despite the instruction name, the events raised by the INT1 (also known as ICEBP),
INT3 and INTO instructions (opcodes F1h, CCh and CEh) are considered exceptions for
the purposes of EXITINTINFO, not software interrupts. Only events raised by the INTn
instruction (opcode CDh) are considered software interrupts.

VMX has separate identifiers for software interrupts and for software exceptions,
where as SVM unconditionally treats #BP and #OF as soft:

Injecting an exception (TYPE = 3) with vectors 3 or 4 behaves like a trap raised by
INT3 and INTO instructions

Now I'm curious why Intel doesn't do the same...

> > Anyways, for the common nrips=true case, I strongly prefer that we properly fix
> > the non-nested case and re-inject software interrupts, which should in turn
> > naturally fix this nested case.
>
> This would also need making the #BP or #OF current instruction
> re-execution conditional on (at least) nrips support.
>
> I am not sure, however, whether this won't introduce any regressions.
> That's why this patch set changed the behavior here only for the
> L1 -> L2 case.
>
> Another issue is whether a L1 hypervisor can legally inject a #VC
> into its L2 (since these are never re-injected).

I would expect to work, and it's easy to find out. I know VMX allows injecting
non-existent exceptions, but the APM is vague as usual and says VMRUN will fail...

If the VMM attempts to inject an event that is impossible for the guest mode

> We still need L1 -> L2 event injection detection to restore the NextRIP
> field when re-injecting an event that uses it.

You lost me on this one. KVM L0 is only (and always!) responsible for saving the
relevant info into vmcb12, why does it need to detect where the vectoring exception
came from?

> > And for nrips=false, my vote is to either punt
> > and document it as a "KVM erratum", or straight up make nested require nrips.
>
> A quick Internet search shows that the first CPUs with NextRIP were the
> second-generation Family 10h CPUs (Phenom II, Athlon II, etc.).
> They started being released in early 2009, so we probably don't need to
> worry about the non-nrips case too much.
>
> For the nested case, orthodox reading of the aforementioned APM sentence
> would mean that a L1 hypervisor is not allowed either to make use of such
> event injection in the non-nrips case.

Heh, my reading of it is that it's not disallowed, it just won't work correctly,
i.e. the INTn won't be skipped.

2022-04-02 14:54:59

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events

On 31.03.2022 01:20, Sean Christopherson wrote:
> On Thu, Mar 31, 2022, Maciej S. Szmigiero wrote:
>> On 30.03.2022 23:59, Sean Christopherson wrote:
>>> On Thu, Mar 10, 2022, Maciej S. Szmigiero wrote:
>>>> @@ -3627,6 +3632,14 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
>>>> if (!(exitintinfo & SVM_EXITINTINFO_VALID))
>>>> return;
>>>> + /* L1 -> L2 event re-injection needs a different handling */
>>>> + if (is_guest_mode(vcpu) &&
>>>> + exit_during_event_injection(svm, svm->nested.ctl.event_inj,
>>>> + svm->nested.ctl.event_inj_err)) {
>>>> + nested_svm_maybe_reinject(vcpu);
>>>
>>> Why is this manually re-injecting? More specifically, why does the below (out of
>>> sight in the diff) code that re-queues the exception/interrupt not work? The
>>> re-queued event should be picked up by nested_save_pending_event_to_vmcb12() and
>>> propagatred to vmcb12.
>>
>> A L1 -> L2 injected event should either be re-injected until successfully
>> injected into L2 or propagated to VMCB12 if there is a nested VMEXIT
>> during its delivery.
>>
>> svm_complete_interrupts() does not do such re-injection in some cases
>> (soft interrupts, soft exceptions, #VC) - it is trying to resort to
>> emulation instead, which is incorrect in this case.
>>
>> I think it's better to split out this L1 -> L2 nested case to a
>> separate function in nested.c rather than to fill
>> svm_complete_interrupts() in already very large svm.c with "if" blocks
>> here and there.
>
> Ah, I see it now. WTF.
>
> Ugh, commit 66fd3f7f901f ("KVM: Do not re-execute INTn instruction.") fixed VMX,
> but left SVM broken.
>
> Re-executing the INTn is wrong, the instruction has already completed decode and
> execution. E.g. if there's there's a code breakpoint on the INTn, rewinding will
> cause a spurious #DB.
>
> KVM's INT3 shenanigans are bonkers, but I guess there's no better option given
> that the APM says "Software interrupts cannot be properly injected if the processor
> does not support the NextRIP field.". What a mess.

Note that KVM currently always tries to re-execute the current instruction
when asked to re-inject a #BP or a #OF, even when nrips are enabled.

Also, #BP (and #OF, too) is returned as type SVM_EXITINTINFO_TYPE_EXEPT,
not as SVM_EXITINTINFO_TYPE_SOFT (soft interrupt), so it should be
re-injected accordingly.

> Anyways, for the common nrips=true case, I strongly prefer that we properly fix
> the non-nested case and re-inject software interrupts, which should in turn
> naturally fix this nested case.

This would also need making the #BP or #OF current instruction
re-execution conditional on (at least) nrips support.

I am not sure, however, whether this won't introduce any regressions.
That's why this patch set changed the behavior here only for the
L1 -> L2 case.

Another issue is whether a L1 hypervisor can legally inject a #VC
into its L2 (since these are never re-injected).

We still need L1 -> L2 event injection detection to restore the NextRIP
field when re-injecting an event that uses it.

> And for nrips=false, my vote is to either punt
> and document it as a "KVM erratum", or straight up make nested require nrips.

A quick Internet search shows that the first CPUs with NextRIP were the
second-generation Family 10h CPUs (Phenom II, Athlon II, etc.).
They started being released in early 2009, so we probably don't need to
worry about the non-nrips case too much.

For the nested case, orthodox reading of the aforementioned APM sentence
would mean that a L1 hypervisor is not allowed either to make use of such
event injection in the non-nrips case.

> Note, that also requires updating svm_queue_exception(), which assumes it will
> only be handed hardware exceptions, i.e. hardcodes type EXEPT. That's blatantly
> wrong, e.g. if userspace injects a software exception via KVM_SET_VCPU_EVENTS.

svm_queue_exception() uses SVM_EVTINJ_TYPE_EXEPT, which is correct even
for software exceptions (#BP or #OF).
It does work indeed, as the self test included in this patch set
demonstrates.

Thanks,
Maciej

2022-04-04 13:42:11

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events

On 1.04.2022 02:08, Sean Christopherson wrote:
> On Fri, Apr 01, 2022, Maciej S. Szmigiero wrote:
>> On 31.03.2022 01:20, Sean Christopherson wrote:
>>> Re-executing the INTn is wrong, the instruction has already completed decode and
>>> execution. E.g. if there's there's a code breakpoint on the INTn, rewinding will
>>> cause a spurious #DB.
>>>
>>> KVM's INT3 shenanigans are bonkers, but I guess there's no better option given
>>> that the APM says "Software interrupts cannot be properly injected if the processor
>>> does not support the NextRIP field.". What a mess.
>>
>> Note that KVM currently always tries to re-execute the current instruction
>> when asked to re-inject a #BP or a #OF, even when nrips are enabled.
>
> Yep, and my vote is to fix that.

The only issue I have with making SVM re-inject a #BP or a #OF in the
nrips case is that this might introduce some regression with respect to
host-side guest debugging.

On the other hand, VMX does re-inject these unconditionally so it might
not be a problem after all.

>> Also, #BP (and #OF, too) is returned as type SVM_EXITINTINFO_TYPE_EXEPT,
>> not as SVM_EXITINTINFO_TYPE_SOFT (soft interrupt), so it should be
>> re-injected accordingly.
>
> Ahhh, SVM doesn't differentiate between software exceptions and hardware exceptions.
> Finally found the relevant section in the APM:
>
> Despite the instruction name, the events raised by the INT1 (also known as ICEBP),
> INT3 and INTO instructions (opcodes F1h, CCh and CEh) are considered exceptions for
> the purposes of EXITINTINFO, not software interrupts. Only events raised by the INTn
> instruction (opcode CDh) are considered software interrupts.
>
> VMX has separate identifiers for software interrupts and for software exceptions,

I guess the sentence above was supposed to read "for *hardware exceptions*
and for software exceptions", just like in the previous paragraph about SVM.

> where as SVM unconditionally treats #BP and #OF as soft:
>
> Injecting an exception (TYPE = 3) with vectors 3 or 4 behaves like a trap raised by
> INT3 and INTO instructions
>
> Now I'm curious why Intel doesn't do the same...

Perhaps it's just a result of VMX and SVM being developed independently,
in parallel.

>>> Anyways, for the common nrips=true case, I strongly prefer that we properly fix
>>> the non-nested case and re-inject software interrupts, which should in turn
>>> naturally fix this nested case.
>>
>> This would also need making the #BP or #OF current instruction
>> re-execution conditional on (at least) nrips support.
>>
>> I am not sure, however, whether this won't introduce any regressions.
>> That's why this patch set changed the behavior here only for the
>> L1 -> L2 case.
>>
>> Another issue is whether a L1 hypervisor can legally inject a #VC
>> into its L2 (since these are never re-injected).
>
> I would expect to work, and it's easy to find out. I know VMX allows injecting
> non-existent exceptions, but the APM is vague as usual and says VMRUN will fail...

I've done a quick test right now and a VMRUN attempt with #VC event
injection does seem to fail.
So we don't need to worry about not re-injecting a #VC.

> If the VMM attempts to inject an event that is impossible for the guest mode
>
>> We still need L1 -> L2 event injection detection to restore the NextRIP
>> field when re-injecting an event that uses it.
>
> You lost me on this one. KVM L0 is only (and always!) responsible for saving the
> relevant info into vmcb12, why does it need to detect where the vectoring exception
> came from?

Look at the description of patch 4 of this patch set - after some
L2 VMEXITs handled by L0 (in other words, which do *not* result in
a nested VMEXIT to L1) the VMCB02 NextRIP field will be zero
(APM 15.7.1 "State Saved on Exit" describes when this happens).

If we attempt to re-inject some types of events with the NextRIP field
being zero the return address pushed on stack will also be zero, which
will obviously do bad things to the L2 when it returns from
an (exception|interrupt) handler.

>>> And for nrips=false, my vote is to either punt
>>> and document it as a "KVM erratum", or straight up make nested require nrips.
>>
>> A quick Internet search shows that the first CPUs with NextRIP were the
>> second-generation Family 10h CPUs (Phenom II, Athlon II, etc.).
>> They started being released in early 2009, so we probably don't need to
>> worry about the non-nrips case too much.
>>
>> For the nested case, orthodox reading of the aforementioned APM sentence
>> would mean that a L1 hypervisor is not allowed either to make use of such
>> event injection in the non-nrips case.
>
> Heh, my reading of it is that it's not disallowed, it just won't work correctly,
> i.e. the INTn won't be skipped.

Either way, we probably don't need to worry that this case don't get handled
100% right.

Thanks,
Maciej

2022-04-04 22:08:51

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events

On Thu, 2022-03-10 at 22:38 +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <[email protected]>
>
> In SVM synthetic software interrupts or INT3 or INTO exception that L1
> wants to inject into its L2 guest are forgotten if there is an intervening
> L0 VMEXIT during their delivery.
>
> They are re-injected correctly with VMX, however.
>
> This is because there is an assumption in SVM that such exceptions will be
> re-delivered by simply re-executing the current instruction.
> Which might not be true if this is a synthetic exception injected by L1,
> since in this case the re-executed instruction will be one already in L2,
> not the VMRUN instruction in L1 that attempted the injection.
>
> Leave the pending L1 -> L2 event in svm->nested.ctl.event_inj{,err} until
> it is either re-injected successfully or returned to L1 upon a nested
> VMEXIT.
> Make sure to always re-queue such event if returned in EXITINTINFO.
>
> The handling of L0 -> {L1, L2} event re-injection is left as-is to avoid
> unforeseen regressions.

Some time ago I noticed this too, but haven't dug into this too much.
I rememeber I even had some half-baked patch for this I never posted,
because I didn't think about the possibility of this syntetic injection.

Just to be clear that I understand this correctly:

1. What is happening is that L1 is injecting INTn/INTO/INT3 but L2 code
doesn't actualy contain an INTn/INTO/INT3 instruction.
This is wierd but legal thing to do.
Again, if L2 actually contained the instruction, it would have worked?


2. When actual INTn/INT0/INT3 are intercepted on SVM, then
save.RIP points to address of the instruction, and control.next_rip
points to address of next instruction after (as expected)

3. When EVENTINJ is used with '(TYPE = 3) with vectors 3 or 4'
or 'TYPE=4', then next_rip is pushed on the stack, while save.RIP is
pretty much ignored, and exectution jumps to the handler in the IDT.
also at least for INT3/INTO, PRM states that IDT's DPL field is checked
before dispatch, meaning that we can get legit #GP during delivery.
(this looks like another legit reason to fix exception merging in KVM)


Best regards,
Maxim Levitsky


>
> Signed-off-by: Maciej S. Szmigiero <[email protected]>
> ---
> arch/x86/kvm/svm/nested.c | 65 +++++++++++++++++++++++++++++++++++++--
> arch/x86/kvm/svm/svm.c | 17 ++++++++--
> arch/x86/kvm/svm/svm.h | 47 ++++++++++++++++++++++++++++
> 3 files changed, 125 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 9656f0d6815c..75017bf77955 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -420,8 +420,17 @@ void nested_copy_vmcb_save_to_cache(struct vcpu_svm *svm,
> void nested_sync_control_from_vmcb02(struct vcpu_svm *svm)
> {
> u32 mask;
> - svm->nested.ctl.event_inj = svm->vmcb->control.event_inj;
> - svm->nested.ctl.event_inj_err = svm->vmcb->control.event_inj_err;
> +
> + /*
> + * Leave the pending L1 -> L2 event in svm->nested.ctl.event_inj{,err}
> + * if its re-injection is needed
> + */
> + if (!exit_during_event_injection(svm, svm->nested.ctl.event_inj,
> + svm->nested.ctl.event_inj_err)) {
> + WARN_ON_ONCE(svm->vmcb->control.event_inj & SVM_EVTINJ_VALID);
> + svm->nested.ctl.event_inj = svm->vmcb->control.event_inj;
> + svm->nested.ctl.event_inj_err = svm->vmcb->control.event_inj_err;
> + }

Beware that this could backfire in regard to nested migration.

I once chased a nasty bug related to this.

The bug was:

- L1 does VMRUN with injection (EVENTINJ set)

- VMRUN exit handler is 'executed' by KVM,
This copies EVENINJ fields from VMCB12 to VMCB02

- Once VMRUN exit handler is done executing, we exit to userspace to start migration
(basically static_call(kvm_x86_handle_exit)(...) handles the SVM_EXIT_VMRUN,
and that is all, vcpu_enter_guest isn't called again, so injection is not canceled)

- migration happens and it migrates the control area of vmcb02 with EVENTINJ fields set.

- on migration target, we inject another interrupt to the guest via EVENTINJ
because svm_check_nested_events checks nested_run_pending to avoid doing this
but nested_run_pending was not migrated correctly,
and overwrites the EVENTINJ - injection is lost.

Paolo back then proposed to me that instead of doing direct copy from VMCB12 to VMCB02
we should instead go through 'vcpu->arch.interrupt' and such.
I had a prototype of this but never gotten to clean it up to be accepted upstream,
knowing that current way also works.




>
> /* Only a few fields of int_ctl are written by the processor. */
> mask = V_IRQ_MASK | V_TPR_MASK;
> @@ -669,6 +678,54 @@ static void nested_svm_copy_common_state(struct vmcb *from_vmcb, struct vmcb *to
> to_vmcb->save.spec_ctrl = from_vmcb->save.spec_ctrl;
> }
>
> +void nested_svm_maybe_reinject(struct kvm_vcpu *vcpu)

A personal taste note: I don't like the 'maybe' for some reason.
But I won't fight over this.

> +{
> + struct vcpu_svm *svm = to_svm(vcpu);
> + unsigned int vector, type;
> + u32 exitintinfo = svm->vmcb->control.exit_int_info;
> +
> + if (WARN_ON_ONCE(!is_guest_mode(vcpu)))
> + return;
> +
> + /*
> + * No L1 -> L2 event to re-inject?
> + *
> + * In this case event_inj will be cleared by
> + * nested_sync_control_from_vmcb02().
> + */
> + if (!(svm->nested.ctl.event_inj & SVM_EVTINJ_VALID))
> + return;
> +
> + /* If the last event injection was successful there shouldn't be any pending event */
> + if (WARN_ON_ONCE(!(exitintinfo & SVM_EXITINTINFO_VALID)))
> + return;
> +
> + kvm_make_request(KVM_REQ_EVENT, vcpu);
> +
> + vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
> + type = exitintinfo & SVM_EXITINTINFO_TYPE_MASK;
> +
> + switch (type) {
> + case SVM_EXITINTINFO_TYPE_NMI:
> + vcpu->arch.nmi_injected = true;
> + break;
> + case SVM_EXITINTINFO_TYPE_EXEPT:
> + if (exitintinfo & SVM_EXITINTINFO_VALID_ERR)
> + kvm_requeue_exception_e(vcpu, vector,
> + svm->vmcb->control.exit_int_info_err);
> + else
> + kvm_requeue_exception(vcpu, vector);
> + break;
> + case SVM_EXITINTINFO_TYPE_SOFT:

Note that AFAIK, SVM_EXITINTINFO_TYPE_SOFT is only for INTn instructions,
while INT3 and INTO are considered normal exceptions but EVENTINJ
handling has special case for them.


On VMX it is a bit cleaner:
It has:

3 - normal stock exception caused by CPU itself, like #PF and such

4 - INTn
* does DPL check and uses VM_EXIT_INSTRUCTION_LEN

5 - ICEBP/INT1, which SVM doesnt support to re-inject
* doesn't do DPL check, but uses VM_EXIT_INSTRUCTION_LEN I think

6 - software exception (INT3/INTO)
* does DPL check and uses VM_EXIT_INSTRUCTION_LEN as well

I don't know if there is any difference between 4 and 6.




> + case SVM_EXITINTINFO_TYPE_INTR:
> + kvm_queue_interrupt(vcpu, vector, type == SVM_EXITINTINFO_TYPE_SOFT);
> + break;
> + default:
> + vcpu_unimpl(vcpu, "unknown L1 -> L2 exitintinfo type 0x%x\n", type);
> + break;
> + }
> +}
> +
> int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
> struct vmcb *vmcb12, bool from_vmrun)
> {
> @@ -898,6 +955,10 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
> if (svm->nrips_enabled)
> vmcb12->control.next_rip = vmcb->control.next_rip;
>
> + /* Forget about any pending L1 event injection since it's a L1 worry now */
> + svm->nested.ctl.event_inj = 0;
> + svm->nested.ctl.event_inj_err = 0;
> +
> vmcb12->control.int_ctl = svm->nested.ctl.int_ctl;
> vmcb12->control.tlb_ctl = svm->nested.ctl.tlb_ctl;
> vmcb12->control.event_inj = svm->nested.ctl.event_inj;
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 1e5d904aeec3..5b128baa5e57 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3322,13 +3322,18 @@ static void svm_inject_irq(struct kvm_vcpu *vcpu)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
>
> - WARN_ON(!gif_set(svm));
> + WARN_ON(!(vcpu->arch.interrupt.soft || gif_set(svm)));
>
> trace_kvm_inj_virq(vcpu->arch.interrupt.nr);
> ++vcpu->stat.irq_injections;
>
> svm->vmcb->control.event_inj = vcpu->arch.interrupt.nr |
> - SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR;
> + SVM_EVTINJ_VALID;
> + if (vcpu->arch.interrupt.soft) {
> + svm->vmcb->control.event_inj |= SVM_EVTINJ_TYPE_SOFT;
> + } else {
> + svm->vmcb->control.event_inj |= SVM_EVTINJ_TYPE_INTR;
> + }
> }
>
> void svm_complete_interrupt_delivery(struct kvm_vcpu *vcpu, int delivery_mode,
> @@ -3627,6 +3632,14 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
> if (!(exitintinfo & SVM_EXITINTINFO_VALID))
> return;
>
> + /* L1 -> L2 event re-injection needs a different handling */
> + if (is_guest_mode(vcpu) &&
> + exit_during_event_injection(svm, svm->nested.ctl.event_inj,
> + svm->nested.ctl.event_inj_err)) {
> + nested_svm_maybe_reinject(vcpu);
> + return;
> + }
> +
> kvm_make_request(KVM_REQ_EVENT, vcpu);
>
> vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index f757400fc933..7cafc2e6c82a 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -488,6 +488,52 @@ static inline bool nested_npt_enabled(struct vcpu_svm *svm)
> return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
> }
>
> +static inline bool event_inj_same(u32 event_inj1, u32 event_inj_err1,
> + u32 event_inj2, u32 event_inj_err2)
> +{
> + unsigned int vector_1, vector_2, type_1, type_2;
> +
> + /* Either of them not valid? */
> + if (!(event_inj1 & SVM_EVTINJ_VALID) ||
> + !(event_inj2 & SVM_EVTINJ_VALID))
> + return false;
> +
> + vector_1 = event_inj1 & SVM_EVTINJ_VEC_MASK;
> + type_1 = event_inj1 & SVM_EVTINJ_TYPE_MASK;
> + vector_2 = event_inj2 & SVM_EVTINJ_VEC_MASK;
> + type_2 = event_inj2 & SVM_EVTINJ_TYPE_MASK;
> +
> + /* Different vector or type? */
> + if (vector_1 != vector_2 || type_1 != type_2)
> + return false;
> +
> + /* Different error code presence flag? */
> + if ((event_inj1 & SVM_EVTINJ_VALID_ERR) !=
> + (event_inj2 & SVM_EVTINJ_VALID_ERR))
> + return false;
> +
> + /* No error code? */
> + if (!(event_inj1 & SVM_EVTINJ_VALID_ERR))
> + return true;
> +
> + /* Same error code? */
> + return event_inj_err1 == event_inj_err2;
> +}
> +
> +/* Did the last VMEXIT happen when attempting to inject that event? */
> +static inline bool exit_during_event_injection(struct vcpu_svm *svm,
> + u32 event_inj, u32 event_inj_err)
> +{
> + BUILD_BUG_ON(SVM_EXITINTINFO_VEC_MASK != SVM_EVTINJ_VEC_MASK ||
> + SVM_EXITINTINFO_TYPE_MASK != SVM_EVTINJ_TYPE_MASK ||
> + SVM_EXITINTINFO_VALID != SVM_EVTINJ_VALID ||
> + SVM_EXITINTINFO_VALID_ERR != SVM_EVTINJ_VALID_ERR);
> +
> + return event_inj_same(svm->vmcb->control.exit_int_info,
> + svm->vmcb->control.exit_int_info_err,
> + event_inj, event_inj_err);
> +}
> +
> /* svm.c */
> #define MSR_INVALID 0xffffffffU
>
> @@ -540,6 +586,7 @@ static inline bool nested_exit_on_nmi(struct vcpu_svm *svm)
> return vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_NMI);
> }
>
> +void nested_svm_maybe_reinject(struct kvm_vcpu *vcpu);
> int enter_svm_guest_mode(struct kvm_vcpu *vcpu,
> u64 vmcb_gpa, struct vmcb *vmcb12, bool from_vmrun);
> void svm_leave_nested(struct kvm_vcpu *vcpu);
>


I will also review Sean's take on this, let see which one is simplier.


Best regards,
Maxim Levitsky



2022-04-05 00:26:54

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events

On 4.04.2022 11:53, Maxim Levitsky wrote:
> On Thu, 2022-03-10 at 22:38 +0100, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <[email protected]>
>>
>> In SVM synthetic software interrupts or INT3 or INTO exception that L1
>> wants to inject into its L2 guest are forgotten if there is an intervening
>> L0 VMEXIT during their delivery.
>>
>> They are re-injected correctly with VMX, however.
>>
>> This is because there is an assumption in SVM that such exceptions will be
>> re-delivered by simply re-executing the current instruction.
>> Which might not be true if this is a synthetic exception injected by L1,
>> since in this case the re-executed instruction will be one already in L2,
>> not the VMRUN instruction in L1 that attempted the injection.
>>
>> Leave the pending L1 -> L2 event in svm->nested.ctl.event_inj{,err} until
>> it is either re-injected successfully or returned to L1 upon a nested
>> VMEXIT.
>> Make sure to always re-queue such event if returned in EXITINTINFO.
>>
>> The handling of L0 -> {L1, L2} event re-injection is left as-is to avoid
>> unforeseen regressions.
>
> Some time ago I noticed this too, but haven't dug into this too much.
> I rememeber I even had some half-baked patch for this I never posted,
> because I didn't think about the possibility of this syntetic injection.
>
> Just to be clear that I understand this correctly:
>
> 1. What is happening is that L1 is injecting INTn/INTO/INT3 but L2 code
> doesn't actualy contain an INTn/INTO/INT3 instruction.
> This is wierd but legal thing to do.
> Again, if L2 actually contained the instruction, it would have worked?

I think so (haven't tested it though).

>
> 2. When actual INTn/INT0/INT3 are intercepted on SVM, then
> save.RIP points to address of the instruction, and control.next_rip
> points to address of next instruction after (as expected)

Yes.

> 3. When EVENTINJ is used with '(TYPE = 3) with vectors 3 or 4'
> or 'TYPE=4', then next_rip is pushed on the stack, while save.RIP is
> pretty much ignored, and exectution jumps to the handler in the IDT.

Yes.

> also at least for INT3/INTO, PRM states that IDT's DPL field is checked
> before dispatch, meaning that we can get legit #GP during delivery.
> (this looks like another legit reason to fix exception merging in KVM)
>

That's right.

> Best regards,
> Maxim Levitsky
>
>
>>
>> Signed-off-by: Maciej S. Szmigiero <[email protected]>
>> ---
>> arch/x86/kvm/svm/nested.c | 65 +++++++++++++++++++++++++++++++++++++--
>> arch/x86/kvm/svm/svm.c | 17 ++++++++--
>> arch/x86/kvm/svm/svm.h | 47 ++++++++++++++++++++++++++++
>> 3 files changed, 125 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
>> index 9656f0d6815c..75017bf77955 100644
>> --- a/arch/x86/kvm/svm/nested.c
>> +++ b/arch/x86/kvm/svm/nested.c
>> @@ -420,8 +420,17 @@ void nested_copy_vmcb_save_to_cache(struct vcpu_svm *svm,
>> void nested_sync_control_from_vmcb02(struct vcpu_svm *svm)
>> {
>> u32 mask;
>> - svm->nested.ctl.event_inj = svm->vmcb->control.event_inj;
>> - svm->nested.ctl.event_inj_err = svm->vmcb->control.event_inj_err;
>> +
>> + /*
>> + * Leave the pending L1 -> L2 event in svm->nested.ctl.event_inj{,err}
>> + * if its re-injection is needed
>> + */
>> + if (!exit_during_event_injection(svm, svm->nested.ctl.event_inj,
>> + svm->nested.ctl.event_inj_err)) {
>> + WARN_ON_ONCE(svm->vmcb->control.event_inj & SVM_EVTINJ_VALID);
>> + svm->nested.ctl.event_inj = svm->vmcb->control.event_inj;
>> + svm->nested.ctl.event_inj_err = svm->vmcb->control.event_inj_err;
>> + }
>
> Beware that this could backfire in regard to nested migration.
>
> I once chased a nasty bug related to this.
>
> The bug was:
>
> - L1 does VMRUN with injection (EVENTINJ set)
>
> - VMRUN exit handler is 'executed' by KVM,
> This copies EVENINJ fields from VMCB12 to VMCB02
>
> - Once VMRUN exit handler is done executing, we exit to userspace to start migration
> (basically static_call(kvm_x86_handle_exit)(...) handles the SVM_EXIT_VMRUN,
> and that is all, vcpu_enter_guest isn't called again, so injection is not canceled)
>
> - migration happens and it migrates the control area of vmcb02 with EVENTINJ fields set.
>
> - on migration target, we inject another interrupt to the guest via EVENTINJ
> because svm_check_nested_events checks nested_run_pending to avoid doing this
> but nested_run_pending was not migrated correctly,
> and overwrites the EVENTINJ - injection is lost.
>
> Paolo back then proposed to me that instead of doing direct copy from VMCB12 to VMCB02
> we should instead go through 'vcpu->arch.interrupt' and such.
> I had a prototype of this but never gotten to clean it up to be accepted upstream,
> knowing that current way also works.
>

This sounds like a valid, but different, bug - to be honest, it would
look much cleaner to me, too, if EVENTINJ was parsed from VMCB12 into
relevant KVM injection structures on a nested VMRUN rather than following
a hybrid approach:
1) Copy the field from VMCB12 to VMCB02 directly on a nested VMRUN,

2) Parse the EXITINTINFO into KVM injection structures when re-injecting.

>>
>> /* Only a few fields of int_ctl are written by the processor. */
>> mask = V_IRQ_MASK | V_TPR_MASK;
>> @@ -669,6 +678,54 @@ static void nested_svm_copy_common_state(struct vmcb *from_vmcb, struct vmcb *to
>> to_vmcb->save.spec_ctrl = from_vmcb->save.spec_ctrl;
>> }
>>
>> +void nested_svm_maybe_reinject(struct kvm_vcpu *vcpu)
>
> A personal taste note: I don't like the 'maybe' for some reason.
> But I won't fight over this.

What's you proposed name then?

>> +{
>> + struct vcpu_svm *svm = to_svm(vcpu);
>> + unsigned int vector, type;
>> + u32 exitintinfo = svm->vmcb->control.exit_int_info;
>> +
>> + if (WARN_ON_ONCE(!is_guest_mode(vcpu)))
>> + return;
>> +
>> + /*
>> + * No L1 -> L2 event to re-inject?
>> + *
>> + * In this case event_inj will be cleared by
>> + * nested_sync_control_from_vmcb02().
>> + */
>> + if (!(svm->nested.ctl.event_inj & SVM_EVTINJ_VALID))
>> + return;
>> +
>> + /* If the last event injection was successful there shouldn't be any pending event */
>> + if (WARN_ON_ONCE(!(exitintinfo & SVM_EXITINTINFO_VALID)))
>> + return;
>> +
>> + kvm_make_request(KVM_REQ_EVENT, vcpu);
>> +
>> + vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
>> + type = exitintinfo & SVM_EXITINTINFO_TYPE_MASK;
>> +
>> + switch (type) {
>> + case SVM_EXITINTINFO_TYPE_NMI:
>> + vcpu->arch.nmi_injected = true;
>> + break;
>> + case SVM_EXITINTINFO_TYPE_EXEPT:
>> + if (exitintinfo & SVM_EXITINTINFO_VALID_ERR)
>> + kvm_requeue_exception_e(vcpu, vector,
>> + svm->vmcb->control.exit_int_info_err);
>> + else
>> + kvm_requeue_exception(vcpu, vector);
>> + break;
>> + case SVM_EXITINTINFO_TYPE_SOFT:
>
> Note that AFAIK, SVM_EXITINTINFO_TYPE_SOFT is only for INTn instructions,
> while INT3 and INTO are considered normal exceptions but EVENTINJ
> handling has special case for them.
>
That's right.

> On VMX it is a bit cleaner:
> It has:
>
> 3 - normal stock exception caused by CPU itself, like #PF and such
>
> 4 - INTn
> * does DPL check and uses VM_EXIT_INSTRUCTION_LEN
>
> 5 - ICEBP/INT1, which SVM doesnt support to re-inject
> * doesn't do DPL check, but uses VM_EXIT_INSTRUCTION_LEN I think
>
> 6 - software exception (INT3/INTO)
> * does DPL check and uses VM_EXIT_INSTRUCTION_LEN as well
>
> I don't know if there is any difference between 4 and 6.
>
>
>
>
(..)
>
>
> I will also review Sean's take on this, let see which one is simplier.

Since Sean's patch set is supposed to supersede this one let's continue
the discussion there.

> Best regards,
> Maxim Levitsky

Thanks,
Maciej

2022-04-05 01:32:18

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events

On Fri, Apr 01, 2022, Maciej S. Szmigiero wrote:
> On 1.04.2022 02:08, Sean Christopherson wrote:
> > where as SVM unconditionally treats #BP and #OF as soft:
> >
> > Injecting an exception (TYPE = 3) with vectors 3 or 4 behaves like a trap raised by
> > INT3 and INTO instructions
> >
> > Now I'm curious why Intel doesn't do the same...
>
> Perhaps it's just a result of VMX and SVM being developed independently,
> in parallel.

That definitely factors in, but nothing in VMX is spurious/random, there's always
a reason/motivation behind the architecture, even if that reason isn't obvious.
It bugs me that I can't figure out the "why" :-)

Ugh, but SVM still needs to fix injection for software interrupts, i.e. svm_inject_irq()
is broken.

> > > We still need L1 -> L2 event injection detection to restore the NextRIP
> > > field when re-injecting an event that uses it.
> >
> > You lost me on this one. KVM L0 is only (and always!) responsible for saving the
> > relevant info into vmcb12, why does it need to detect where the vectoring exception
> > came from?
>
> Look at the description of patch 4 of this patch set - after some
> L2 VMEXITs handled by L0 (in other words, which do *not* result in
> a nested VMEXIT to L1) the VMCB02 NextRIP field will be zero
> (APM 15.7.1 "State Saved on Exit" describes when this happens).
>
> If we attempt to re-inject some types of events with the NextRIP field
> being zero the return address pushed on stack will also be zero, which
> will obviously do bad things to the L2 when it returns from
> an (exception|interrupt) handler.

Right, but that issue isn't unique to L2 exits handled by L0. E.g. if KVM is using
shadow paging and a #PF "owned" by KVM occurs while vectoring INT3/INTO, then KVM
needs to restore NextRIP after resolving the #PF.

Argh, but NextRIP being zeroed makes it painful to handle the scenario where the
INT3/INT0 isn't injected, because doesn't have a record of the instruction length.
That's a big fail on SVM's part :-/

Huh, actually, it's not too bad to support. SVM already has the code to compute
the next RIP via the emulator, it's easy enough to reuse that code to calculate
NextRIP and then unwind RIP.

With your patch 1, your selftest and all KVM unit tests passes[*] with the attached
patch. I'll split it up and send out a proper series with your other fixes and
selftests. I apologize for usurping your series, I was sketching up a diff to
show what I had in mind for reinjecting and it ended up being less code than I
expected to actually get it working.

[*] The new SVM KUT tests don't pass, but that's a different mess, and anything
that uses INVCPID doesn't pass with nrips=0 && npt=0 because KVM's emulator
doesn't know how to decode INVPCID, but KVM needs to intercept INVPCID when
using shadow paging.


Attachments:
(No filename) (2.82 kB)
0001-KVM-SVM-Re-inject-soft-interrupts-instead-of-retryin.patch (6.55 kB)
Download all attachments