2022-04-05 01:34:53

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 0/8] KVM: SVM: Fix soft int/ex re-injection

This is a continuation/alternative of Maciej's series[*] to fix soft
interrupt/exception reinjection. The core difference is that this version
fixes the underlying issue of not doing proper reinjection, which
manifests most readily as the nested virtualization bugs Maciej's series
addressed.

The underlying issue is that SVM simply retries INT* instructions instead
of reinjecting the soft interupt/exception if an exception VM-Exit occurred
during vectoring. Lack of reinjection breaks nested virtualization if
the injected event came from L1 and the VM-Exit is not forwarded to L1,
as there is no instruction to retry. More fundamentally, retrying the
instruction is wrong as it can produce side effects that shouldn't occur,
e.g. code #DBs.

VMX has been fixed since commit 66fd3f7f901f ("KVM: Do not re-execute
INTn instruction."), but SVM was left behind. Probably because fixing
SVM is a mess due to NRIPS not being supported on all architectures, and
due to it being poorly implemented (with respect to soft events) when it
is supported.

[*] https://lore.kernel.org/all/[email protected]

Maciej S. Szmigiero (3):
KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02
KVM: SVM: Downgrade BUG_ON() to WARN_ON() in svm_inject_irq()
KVM: selftests: nSVM: Add svm_nested_soft_inject_test

Sean Christopherson (5):
KVM: SVM: Unwind "speculative" RIP advancement if INTn injection
"fails"
KVM: SVM: Stuff next_rip on emualted INT3 injection if NRIPS is
supported
KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
KVM: SVM: Re-inject INTn instead of retrying the insn on "failure"
KVM: x86: Trace re-injected exceptions

arch/x86/kvm/svm/nested.c | 22 ++-
arch/x86/kvm/svm/svm.c | 135 ++++++++++++----
arch/x86/kvm/svm/svm.h | 5 +-
arch/x86/kvm/x86.c | 8 +-
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/include/x86_64/svm_util.h | 2 +
.../kvm/x86_64/svm_nested_soft_inject_test.c | 147 ++++++++++++++++++
8 files changed, 279 insertions(+), 42 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c


base-commit: 81d50efcff6cf4310aaf6a19806416ffeccf1cdb
--
2.35.1.1094.g7c7d902a7c-goog


2022-04-05 03:30:52

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction

Re-inject INT3/INTO instead of retrying the instruction if the CPU
encountered an intercepted exception while vectoring the software
exception, e.g. if vectoring INT3 encounters a #PF and KVM is using
shadow paging. Retrying the instruction is architecturally wrong, e.g.
will result in a spurious #DB if there's a code breakpoint on the INT3/O,
and lack of re-injection also breaks nested virtualization, e.g. if L1
injects a software exception and vectoring the injected exception
encounters an exception that is intercepted by L0 but not L1.

Due to, ahem, deficiencies in the SVM architecture, acquiring the next
RIP may require flowing through the emulator even if NRIPS is supported,
as the CPU clears next_rip if the VM-Exit is due to an exception other
than "exceptions caused by the INT3, INTO, and BOUND instructions". To
deal with this, "skip" the instruction to calculate next_ript, and then
unwind the RIP write and any side effects (RFLAGS updates).

Reported-by: Maciej S. Szmigiero <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 111 ++++++++++++++++++++++++++++-------------
arch/x86/kvm/svm/svm.h | 4 +-
2 files changed, 79 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 6ea8f16e39ac..ecc828d6921e 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -341,9 +341,11 @@ static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)

}

-static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
+static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
+ bool commit_side_effects)
{
struct vcpu_svm *svm = to_svm(vcpu);
+ unsigned long old_rflags;

/*
* SEV-ES does not expose the next RIP. The RIP update is controlled by
@@ -358,18 +360,71 @@ static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
}

if (!svm->next_rip) {
+ if (unlikely(!commit_side_effects))
+ old_rflags = svm->vmcb->save.rflags;
+
if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
return 0;
+
+ if (unlikely(!commit_side_effects))
+ svm->vmcb->save.rflags = old_rflags;
} else {
kvm_rip_write(vcpu, svm->next_rip);
}

done:
- svm_set_interrupt_shadow(vcpu, 0);
+ if (likely(commit_side_effects))
+ svm_set_interrupt_shadow(vcpu, 0);

return 1;
}

+static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
+{
+ return __svm_skip_emulated_instruction(vcpu, true);
+}
+
+static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
+{
+ unsigned long rip, old_rip = kvm_rip_read(vcpu);
+ struct vcpu_svm *svm = to_svm(vcpu);
+
+ /*
+ * Due to architectural shortcomings, the CPU doesn't always provide
+ * NextRIP, e.g. if KVM intercepted an exception that occurred while
+ * the CPU was vectoring an INTO/INT3 in the guest. Temporarily skip
+ * the instruction even if NextRIP is supported to acquire the next
+ * RIP so that it can be shoved into the NextRIP field, otherwise
+ * hardware will fail to advance guest RIP during event injection.
+ * Drop the exception/interrupt if emulation fails and effectively
+ * retry the instruction, it's the least awful option. If NRIPS is
+ * in use, the skip must not commit any side effects such as clearing
+ * the interrupt shadow or RFLAGS.RF.
+ */
+ if (!__svm_skip_emulated_instruction(vcpu, !nrips))
+ return -EIO;
+
+ rip = kvm_rip_read(vcpu);
+
+ /*
+ * If NextRIP is supported, rewind RIP and update NextRip. If NextRip
+ * isn't supported, keep the result of the skip as the CPU obviously
+ * won't advance RIP, but stash away the injection information so that
+ * RIP can be unwound if injection fails.
+ */
+ if (nrips) {
+ kvm_rip_write(vcpu, old_rip);
+ svm->vmcb->control.next_rip = rip;
+ } else {
+ if (boot_cpu_has(X86_FEATURE_NRIPS))
+ svm->vmcb->control.next_rip = rip;
+
+ svm->soft_int_linear_rip = rip + svm->vmcb->save.cs.base;
+ svm->soft_int_injected = rip - old_rip;
+ }
+ return 0;
+}
+
static void svm_queue_exception(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
@@ -379,25 +434,9 @@ static void svm_queue_exception(struct kvm_vcpu *vcpu)

kvm_deliver_exception_payload(vcpu);

- if (nr == BP_VECTOR && !nrips) {
- unsigned long rip, old_rip = kvm_rip_read(vcpu);
-
- /*
- * For guest debugging where we have to reinject #BP if some
- * INT3 is guest-owned:
- * Emulate nRIP by moving RIP forward. Will fail if injection
- * raises a fault that is not intercepted. Still better than
- * failing in all cases.
- */
- (void)svm_skip_emulated_instruction(vcpu);
- rip = kvm_rip_read(vcpu);
-
- if (boot_cpu_has(X86_FEATURE_NRIPS))
- svm->vmcb->control.next_rip = rip;
-
- svm->int3_rip = rip + svm->vmcb->save.cs.base;
- svm->int3_injected = rip - old_rip;
- }
+ if (kvm_exception_is_soft(nr) &&
+ svm_update_soft_interrupt_rip(vcpu))
+ return;

svm->vmcb->control.event_inj = nr
| SVM_EVTINJ_VALID
@@ -3676,9 +3715,9 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
u8 vector;
int type;
u32 exitintinfo = svm->vmcb->control.exit_int_info;
- unsigned int3_injected = svm->int3_injected;
+ unsigned soft_int_injected = svm->soft_int_injected;

- svm->int3_injected = 0;
+ svm->soft_int_injected = 0;

/*
* If we've made progress since setting HF_IRET_MASK, we've
@@ -3698,6 +3737,18 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
if (!(exitintinfo & SVM_EXITINTINFO_VALID))
return;

+ /*
+ * If NextRIP isn't enabled, KVM must manually advance RIP prior to
+ * injecting the soft exception/interrupt. That advancement needs to
+ * be unwound if vectoring didn't complete. Note, the _new_ event may
+ * not be the injected event, e.g. if KVM injected an INTn, the INTn
+ * hit a #NP in the guest, and the #NP encountered a #PF, the #NP will
+ * be the reported vectored event, but RIP still needs to be unwound.
+ */
+ if (soft_int_injected &&
+ kvm_is_linear_rip(vcpu, to_svm(vcpu)->soft_int_linear_rip))
+ kvm_rip_write(vcpu, kvm_rip_read(vcpu) - soft_int_injected);
+
kvm_make_request(KVM_REQ_EVENT, vcpu);

vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
@@ -3711,9 +3762,9 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
* hit a #NP in the guest, and the #NP encountered a #PF, the #NP will
* be the reported vectored event, but RIP still needs to be unwound.
*/
- if (int3_injected && type == SVM_EXITINTINFO_TYPE_EXEPT &&
- kvm_is_linear_rip(vcpu, svm->int3_rip))
- kvm_rip_write(vcpu, kvm_rip_read(vcpu) - int3_injected);
+ if (soft_int_injected && type == SVM_EXITINTINFO_TYPE_EXEPT &&
+ kvm_is_linear_rip(vcpu, svm->soft_int_linear_rip))
+ kvm_rip_write(vcpu, kvm_rip_read(vcpu) - soft_int_injected);

switch (type) {
case SVM_EXITINTINFO_TYPE_NMI:
@@ -3726,14 +3777,6 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
if (vector == X86_TRAP_VC)
break;

- /*
- * In case of software exceptions, do not reinject the vector,
- * but re-execute the instruction instead. Rewind RIP first
- * if we emulated INT3 before.
- */
- if (kvm_exception_is_soft(vector))
- break;
-
if (exitintinfo & SVM_EXITINTINFO_VALID_ERR) {
u32 err = svm->vmcb->control.exit_int_info_err;
kvm_requeue_exception_e(vcpu, vector, err);
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 47e7427d0395..a770a1c7ddd2 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -230,8 +230,8 @@ struct vcpu_svm {
bool nmi_singlestep;
u64 nmi_singlestep_guest_rflags;

- unsigned int3_injected;
- unsigned long int3_rip;
+ unsigned soft_int_injected;
+ unsigned long soft_int_linear_rip;

/* optional nested SVM features that are enabled for this guest */
bool nrips_enabled : 1;
--
2.35.1.1094.g7c7d902a7c-goog

2022-04-06 16:41:51

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction

On 6.04.2022 03:48, Sean Christopherson wrote:
> On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
(..)
>> Also, I'm not sure that even the proposed updated code above will
>> actually restore the L1-requested next_rip correctly on L1 -> L2
>> re-injection (will review once the full version is available).
>
> Spoiler alert, it doesn't. Save yourself the review time. :-)
>
> The missing piece is stashing away the injected event on nested VMRUN. Those
> events don't get routed through the normal interrupt/exception injection code and
> so the next_rip info is lost on the subsequent #NPF.
>
> Treating soft interrupts/exceptions like they were injected by KVM (which they
> are, technically) works and doesn't seem too gross. E.g. when prepping vmcb02
>
> if (svm->nrips_enabled)
> vmcb02->control.next_rip = svm->nested.ctl.next_rip;
> else if (boot_cpu_has(X86_FEATURE_NRIPS))
> vmcb02->control.next_rip = vmcb12_rip;
>
> if (is_evtinj_soft(vmcb02->control.event_inj)) {
> svm->soft_int_injected = true;
> svm->soft_int_csbase = svm->vmcb->save.cs.base;
> svm->soft_int_old_rip = vmcb12_rip;
> if (svm->nrips_enabled)
> svm->soft_int_next_rip = svm->nested.ctl.next_rip;
> else
> svm->soft_int_next_rip = vmcb12_rip;
> }
>
> And then the VMRUN error path just needs to clear soft_int_injected.

I am also a fan of parsing EVENTINJ from VMCB12 into relevant KVM
injection structures (much like EXITINTINFO is parsed), as I said to
Maxim two days ago [1].
Not only for software {interrupts,exceptions} but for all incoming
events (again, just like EXITINTINFO).

However, there is another issue related to L1 -> L2 event re-injection
using standard KVM event injection mechanism: it mixes the L1 injection
state with the L2 one.

Specifically for SVM:
* When re-injecting a NMI into L2 NMI-blocking is enabled in
vcpu->arch.hflags (shared between L1 and L2) and IRET intercept is
enabled.

This is incorrect, since it is L1 that is responsible for enforcing NMI
blocking for NMIs that it injects into its L2.
Also, *L2* being the target of such injection definitely should not block
further NMIs for *L1*.

* When re-injecting a *hardware* IRQ into L2 GIF is checked (previously
even on the BUG_ON() level), while L1 should be able to inject even when
L2 GIF is off,

With the code in my previous patch set I planned to use
exit_during_event_injection() to detect such case, but if we implement
VMCB12 EVENTINJ parsing we can simply add a flag that the relevant event
comes from L1, so its normal injection side-effects should be skipped.

By the way, the relevant VMX code also looks rather suspicious,
especially for the !enable_vnmi case.

Thanks,
Maciej

[1]: https://lore.kernel.org/kvm/[email protected]/

2022-04-06 20:04:26

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction

On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
> On 6.04.2022 03:48, Sean Christopherson wrote:
> > On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
> (..)
> > > Also, I'm not sure that even the proposed updated code above will
> > > actually restore the L1-requested next_rip correctly on L1 -> L2
> > > re-injection (will review once the full version is available).
> >
> > Spoiler alert, it doesn't. Save yourself the review time. :-)
> >
> > The missing piece is stashing away the injected event on nested VMRUN. Those
> > events don't get routed through the normal interrupt/exception injection code and
> > so the next_rip info is lost on the subsequent #NPF.
> >
> > Treating soft interrupts/exceptions like they were injected by KVM (which they
> > are, technically) works and doesn't seem too gross. E.g. when prepping vmcb02
> >
> > if (svm->nrips_enabled)
> > vmcb02->control.next_rip = svm->nested.ctl.next_rip;
> > else if (boot_cpu_has(X86_FEATURE_NRIPS))
> > vmcb02->control.next_rip = vmcb12_rip;
> >
> > if (is_evtinj_soft(vmcb02->control.event_inj)) {
> > svm->soft_int_injected = true;
> > svm->soft_int_csbase = svm->vmcb->save.cs.base;
> > svm->soft_int_old_rip = vmcb12_rip;
> > if (svm->nrips_enabled)
> > svm->soft_int_next_rip = svm->nested.ctl.next_rip;
> > else
> > svm->soft_int_next_rip = vmcb12_rip;
> > }
> >
> > And then the VMRUN error path just needs to clear soft_int_injected.
>
> I am also a fan of parsing EVENTINJ from VMCB12 into relevant KVM
> injection structures (much like EXITINTINFO is parsed), as I said to
> Maxim two days ago [1].
> Not only for software {interrupts,exceptions} but for all incoming
> events (again, just like EXITINTINFO).

Ahh, I saw that fly by, but somehow I managed to misread what you intended.

I like the idea of populating vcpu->arch.interrupt/exception as "injected" events.
KVM prioritizes "injected" over other nested events, so in theory it should work
without too much fuss. I've ran through a variety of edge cases in my head and
haven't found anything that would be fundamentally broken. I think even live
migration would work.

I think I'd prefer to do that in a follow-up series so that nVMX can be converted
at the same time? It's a bit absurd to add the above soft int code knowing that,
at least in theory, simply populating the right software structs would automagically
fix the bug. But manually handling the soft int case first would be safer in the
sense that we'd still have a fix for the soft int case if it turns out that populating
vcpu->arch.interrupt/exception isn't as straightfoward as it seems.

> However, there is another issue related to L1 -> L2 event re-injection
> using standard KVM event injection mechanism: it mixes the L1 injection
> state with the L2 one.
>
> Specifically for SVM:
> * When re-injecting a NMI into L2 NMI-blocking is enabled in
> vcpu->arch.hflags (shared between L1 and L2) and IRET intercept is
> enabled.
>
> This is incorrect, since it is L1 that is responsible for enforcing NMI
> blocking for NMIs that it injects into its L2.

Ah, I see what you're saying. I think :-) IIUC, we can fix this bug without any
new flags, just skip the side effects if the NMI is being injected into L2.

@@ -3420,6 +3424,10 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
struct vcpu_svm *svm = to_svm(vcpu);

svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_NMI;
+
+ if (is_guest_mode(vcpu))
+ return;
+
vcpu->arch.hflags |= HF_NMI_MASK;
if (!sev_es_guest(vcpu->kvm))
svm_set_intercept(svm, INTERCEPT_IRET);

and for nVMX:

@@ -4598,6 +4598,9 @@ static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

+ if (is_guest_mode(vcpu))
+ goto inject_nmi;
+
if (!enable_vnmi) {
/*
* Tracking the NMI-blocked state in software is built upon
@@ -4619,6 +4622,7 @@ static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
return;
}

+inject_nmi:
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
INTR_TYPE_NMI_INTR | INTR_INFO_VALID_MASK | NMI_VECTOR);


> Also, *L2* being the target of such injection definitely should not block
> further NMIs for *L1*.

Actually, it should block NMIs for L1. From L1's perspective, the injection is
part of VM-Entry. That's a single gigantic instruction, thus there is no NMI window
until VM-Entry completes from L1's perspetive. Any exit that occurs on vectoring
an injected event and is handled by L0 should not be visible to L1, because from
L1's perspective it's all part of VMRUN/VMLAUNCH/VMRESUME. So blocking new events
because an NMI (or any event) needs to be reinjected for L2 is correct.

> * When re-injecting a *hardware* IRQ into L2 GIF is checked (previously
> even on the BUG_ON() level), while L1 should be able to inject even when
> L2 GIF is off,

Isn't that just a matter of tweaking the assertion to ignore GIF if L2 is
active? Hmm, or deleting the assertion altogether, it's likely doing more harm
than good at this point.

> With the code in my previous patch set I planned to use
> exit_during_event_injection() to detect such case, but if we implement
> VMCB12 EVENTINJ parsing we can simply add a flag that the relevant event
> comes from L1, so its normal injection side-effects should be skipped.

Do we still need a flag based on the above? Honest question... I've been staring
at all this for the better part of an hour and may have lost track of things.

> By the way, the relevant VMX code also looks rather suspicious,
> especially for the !enable_vnmi case.

I think it's safe to say we can ignore edge cases for !enable_vnmi. It might even
be worth trying to remove that support again (Paolo tried years ago), IIRC the
only Intel CPUs that don't support virtual NMIs are some funky Yonah SKUs.

> Thanks,
> Maciej
>
> [1]: https://lore.kernel.org/kvm/[email protected]/

2022-04-06 22:12:18

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction

On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
> On 6.04.2022 19:10, Sean Christopherson wrote:
> > On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
> And what if it's L0 that is trying to inject a NMI into L2?
> In this case is_guest_mode() is true, but the full NMI injection machinery
> should be used.

Gah, you're right, I got misled by a benign bug in nested_vmx_l1_wants_exit() and
was thinking that NMIs always exit. The "L1 wants" part should be conditioned on
NMI exiting being enabled. It's benign because KVM always wants "real" NMIs, and
so the path is never encountered.

@@ -5980,7 +6005,7 @@ static bool nested_vmx_l1_wants_exit(struct kvm_vcpu *vcpu,
switch ((u16)exit_reason.basic) {
case EXIT_REASON_EXCEPTION_NMI:
intr_info = vmx_get_intr_info(vcpu);
- if (is_nmi(intr_info))
+ if (is_nmi(intr_info) && nested_cpu_has_nmi_exiting(vmcs12))
return true;
else if (is_page_fault(intr_info))
return true;


> > > Also, *L2* being the target of such injection definitely should not block
> > > further NMIs for *L1*.
> >
> > Actually, it should block NMIs for L1. From L1's perspective, the injection is
> > part of VM-Entry. That's a single gigantic instruction, thus there is no NMI window
> > until VM-Entry completes from L1's perspetive. Any exit that occurs on vectoring
> > an injected event and is handled by L0 should not be visible to L1, because from
> > L1's perspective it's all part of VMRUN/VMLAUNCH/VMRESUME. So blocking new events
> > because an NMI (or any event) needs to be reinjected for L2 is correct.
>
> I think this kind of NMI blocking will be already handled by having
> the pending new NMI in vcpu->arch.nmi_pending but the one that needs
> re-injecting in vcpu->arch.nmi_injected.
>
> The pending new NMI in vcpu->arch.nmi_pending won't be handled until
> vcpu->arch.nmi_injected gets cleared (that is, until re-injection is
> successful).

Yep.

> It is incorrect however, to wait for L2 to execute IRET to unblock
> L0 -> L1 NMIs or L1 -> L2 NMIs, in these two cases we (L0) just need the CPU
> to vector that L2 NMI so it no longer shows in EXITINTINFO.

Yep, and the pending NMI should cause KVM to request an "immediate" exit that
occurs after vectoring completes.

> It is also incorrect to block L1 -> L2 NMI injection because either L1
> or L2 is currently under NMI blocking: the first case is obvious,
> the second because it's L1 that is supposed to take care of proper NMI
> blocking for L2 when injecting an NMI there.

Yep, but I don't think there's a bug here. At least not for nVMX.

> > > * When re-injecting a *hardware* IRQ into L2 GIF is checked (previously
> > > even on the BUG_ON() level), while L1 should be able to inject even when
> > > L2 GIF is off,
> >
> > Isn't that just a matter of tweaking the assertion to ignore GIF if L2 is
> > active? Hmm, or deleting the assertion altogether, it's likely doing more harm
> > than good at this point.
>
> I assume this assertion is meant to catch the case when KVM itself (L0) is
> trying to erroneously inject a hardware interrupt into L1 or L2, so it will
> need to be skipped only for L1 -> L2 event injection.

Yeah, that's what my git archeaology came up with too.

> Whether this assertion benefits outweigh its costs is debatable - don't have
> a strong opinion here (BUG_ON() is for sure too strong, but WARN_ON_ONCE()
> might make sense to catch latent bugs).
>
> > > With the code in my previous patch set I planned to use
> > > exit_during_event_injection() to detect such case, but if we implement
> > > VMCB12 EVENTINJ parsing we can simply add a flag that the relevant event
> > > comes from L1, so its normal injection side-effects should be skipped.
> >
> > Do we still need a flag based on the above? Honest question... I've been staring
> > at all this for the better part of an hour and may have lost track of things.
>
> If checking just is_guest_mode() is not enough due to reasons I described
> above then we need to somehow determine in the NMI / IRQ injection handler
> whether the event to be injected into L2 comes from L0 or L1.
> For this (assuming we do VMCB12 EVENTINJ parsing) I think we need an extra flag.

Yes :-( And I believe the extra flag would need to be handled by KVM_{G,S}ET_VCPU_EVENTS.

> > > By the way, the relevant VMX code also looks rather suspicious,
> > > especially for the !enable_vnmi case.
> >
> > I think it's safe to say we can ignore edge cases for !enable_vnmi. It might even
> > be worth trying to remove that support again (Paolo tried years ago), IIRC the
> > only Intel CPUs that don't support virtual NMIs are some funky Yonah SKUs.
>
> Ack, we could at least disable nested on !enable_vnmi.
>
> BTW, I think that besides Yonah cores very early Core 2 CPUs also lacked
> vNMI support, that's why !enable_vnmi support was reinstated.
> But that's hardware even older than !nrips AMD parts.

2022-04-07 01:13:44

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction

On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
> On 6.04.2022 21:48, Sean Christopherson wrote:
> > On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
> > > On 6.04.2022 19:10, Sean Christopherson wrote:
> > > > On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
> > > And what if it's L0 that is trying to inject a NMI into L2?
> > > In this case is_guest_mode() is true, but the full NMI injection machinery
> > > should be used.
> >
> > Gah, you're right, I got misled by a benign bug in nested_vmx_l1_wants_exit() and
> > was thinking that NMIs always exit. The "L1 wants" part should be conditioned on
> > NMI exiting being enabled. It's benign because KVM always wants "real" NMIs, and
> > so the path is never encountered.
> >
> > @@ -5980,7 +6005,7 @@ static bool nested_vmx_l1_wants_exit(struct kvm_vcpu *vcpu,
> > switch ((u16)exit_reason.basic) {
> > case EXIT_REASON_EXCEPTION_NMI:
> > intr_info = vmx_get_intr_info(vcpu);
> > - if (is_nmi(intr_info))
> > + if (is_nmi(intr_info) && nested_cpu_has_nmi_exiting(vmcs12))
> > return true;
> > else if (is_page_fault(intr_info))
> > return true;
> >
>
> I guess you mean "benign" when having KVM as L1, since other hypervisors may
> let their L2s handle NMIs themselves.

No, this one's truly benign. The nVMX exit processing is:

if (nested_vmx_l0_wants_exit())
handle in L0 / KVM;

if (nested_vmx_l1_wants_exit())
handle in L1

handle in L0 / KVM

Since this is for actual hardware NMIs, the first "L0 wants" check always returns
true for NMIs, so the fact that KVM screws up L1's wants is a non-issue.

> > > It is also incorrect to block L1 -> L2 NMI injection because either L1
> > > or L2 is currently under NMI blocking: the first case is obvious,
> > > the second because it's L1 that is supposed to take care of proper NMI
> > > blocking for L2 when injecting an NMI there.
> >
> > Yep, but I don't think there's a bug here. At least not for nVMX.
>
> I agree this scenario should currently work (including on nSVM) - mentioned
> it just as a constraint on solution space.
>
> > > > > With the code in my previous patch set I planned to use
> > > > > exit_during_event_injection() to detect such case, but if we implement
> > > > > VMCB12 EVENTINJ parsing we can simply add a flag that the relevant event
> > > > > comes from L1, so its normal injection side-effects should be skipped.
> > > >
> > > > Do we still need a flag based on the above? Honest question... I've been staring
> > > > at all this for the better part of an hour and may have lost track of things.
> > >
> > > If checking just is_guest_mode() is not enough due to reasons I described
> > > above then we need to somehow determine in the NMI / IRQ injection handler
> > > whether the event to be injected into L2 comes from L0 or L1.
> > > For this (assuming we do VMCB12 EVENTINJ parsing) I think we need an extra flag.
> >
> > Yes :-( And I believe the extra flag would need to be handled by KVM_{G,S}ET_VCPU_EVENTS.
> >
>
> Another option for saving and restoring a VM would be to add it to
> KVM_{GET,SET}_NESTED_STATE somewhere (maybe as a part of the saved VMCB12
> control area?).

Ooh. What if we keep nested_run_pending=true until the injection completes? Then
we don't even need an extra flag because nested_run_pending effectively says that
any and all injected events are for L1=>L2. In KVM_GET_NESTED_STATE, shove the
to-be-injected event into the normal vmc*12 injection field, and ignore all
to-be-injected events in KVM_GET_VCPU_EVENTS if nested_run_pending=true.

That should work even for migrating to an older KVM, as keeping nested_run_pending
will cause the target to reprocess the event injection as if it were from nested
VM-Enter, which it technically is.

We could probably get away with completely dropping the intermediate event as
the vmc*12 should still have the original event, but that technically could result
in architecturally incorrect behavior, e.g. if vectoring up to the point of
interception sets A/D bits in the guest.

2022-04-07 01:25:40

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction

On 6.04.2022 22:52, Sean Christopherson wrote:
> On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
>> On 6.04.2022 21:48, Sean Christopherson wrote:
>>> On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
>>>> On 6.04.2022 19:10, Sean Christopherson wrote:
>>>>> On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
>>>> And what if it's L0 that is trying to inject a NMI into L2?
>>>> In this case is_guest_mode() is true, but the full NMI injection machinery
>>>> should be used.
>>>
>>> Gah, you're right, I got misled by a benign bug in nested_vmx_l1_wants_exit() and
>>> was thinking that NMIs always exit. The "L1 wants" part should be conditioned on
>>> NMI exiting being enabled. It's benign because KVM always wants "real" NMIs, and
>>> so the path is never encountered.
>>>
>>> @@ -5980,7 +6005,7 @@ static bool nested_vmx_l1_wants_exit(struct kvm_vcpu *vcpu,
>>> switch ((u16)exit_reason.basic) {
>>> case EXIT_REASON_EXCEPTION_NMI:
>>> intr_info = vmx_get_intr_info(vcpu);
>>> - if (is_nmi(intr_info))
>>> + if (is_nmi(intr_info) && nested_cpu_has_nmi_exiting(vmcs12))
>>> return true;
>>> else if (is_page_fault(intr_info))
>>> return true;
>>>
>>
>> I guess you mean "benign" when having KVM as L1, since other hypervisors may
>> let their L2s handle NMIs themselves.
>
> No, this one's truly benign. The nVMX exit processing is:
>
> if (nested_vmx_l0_wants_exit())
> handle in L0 / KVM;
>
> if (nested_vmx_l1_wants_exit())
> handle in L1
>
> handle in L0 / KVM
>
> Since this is for actual hardware NMIs, the first "L0 wants" check always returns
> true for NMIs, so the fact that KVM screws up L1's wants is a non-issue.

Got it, forgot the path was for actual hardware NMIs, which obviously
can't go directly to L1 or L2.

>>>>>> With the code in my previous patch set I planned to use
>>>>>> exit_during_event_injection() to detect such case, but if we implement
>>>>>> VMCB12 EVENTINJ parsing we can simply add a flag that the relevant event
>>>>>> comes from L1, so its normal injection side-effects should be skipped.
>>>>>
>>>>> Do we still need a flag based on the above? Honest question... I've been staring
>>>>> at all this for the better part of an hour and may have lost track of things.
>>>>
>>>> If checking just is_guest_mode() is not enough due to reasons I described
>>>> above then we need to somehow determine in the NMI / IRQ injection handler
>>>> whether the event to be injected into L2 comes from L0 or L1.
>>>> For this (assuming we do VMCB12 EVENTINJ parsing) I think we need an extra flag.
>>>
>>> Yes :-( And I believe the extra flag would need to be handled by KVM_{G,S}ET_VCPU_EVENTS.
>>>
>>
>> Another option for saving and restoring a VM would be to add it to
>> KVM_{GET,SET}_NESTED_STATE somewhere (maybe as a part of the saved VMCB12
>> control area?).
>
> Ooh. What if we keep nested_run_pending=true until the injection completes? Then
> we don't even need an extra flag because nested_run_pending effectively says that
> any and all injected events are for L1=>L2. In KVM_GET_NESTED_STATE, shove the
> to-be-injected event into the normal vmc*12 injection field, and ignore all
> to-be-injected events in KVM_GET_VCPU_EVENTS if nested_run_pending=true.
>
> That should work even for migrating to an older KVM, as keeping nested_run_pending
> will cause the target to reprocess the event injection as if it were from nested
> VM-Enter, which it technically is.

I guess here by "ignore all to-be-injected events in KVM_GET_VCPU_EVENTS" you
mean *moving* back the L1 -> L2 event to be injected from KVM internal data
structures like arch.nmi_injected (and so on) to the KVM_GET_NESTED_STATE-returned
VMCB12 EVENTINJ field (or its VMX equivalent).

But then the VMM will need to first call KVM_GET_NESTED_STATE (which will do
the moving), only then KVM_GET_VCPU_EVENTS (which will then no longer show
these events as pending).
And their setters in the opposite order when restoring the VM.

Although, if the VMCB12 EVENTINJ field contents perfectly matches things like
arch.nmi_injected there should be no problem in principle if these events are
restored by VMM to both places by KVM_SET_NESTED_STATE and KVM_SET_VCPU_EVENTS.

Another option would be to ignore either a first KVM_SET_VCPU_EVENTS after
KVM_SET_NESTED_STATE with KVM_STATE_NESTED_RUN_PENDING or every such call
while nested_run_pending is still true (but wouldn't either of these changes
break KVM API?).

I'm not sure, however, that there isn't some corner case lurking here :(

> We could probably get away with completely dropping the intermediate event as
> the vmc*12 should still have the original event, but that technically could result
> in architecturally incorrect behavior, e.g. if vectoring up to the point of
> interception sets A/D bits in the guest.

Thanks,
Maciej

2022-04-07 03:18:04

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction

On Thu, Apr 07, 2022, Maciej S. Szmigiero wrote:
> On 6.04.2022 22:52, Sean Christopherson wrote:
> > On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
> > > Another option for saving and restoring a VM would be to add it to
> > > KVM_{GET,SET}_NESTED_STATE somewhere (maybe as a part of the saved VMCB12
> > > control area?).
> >
> > Ooh. What if we keep nested_run_pending=true until the injection completes? Then
> > we don't even need an extra flag because nested_run_pending effectively says that
> > any and all injected events are for L1=>L2. In KVM_GET_NESTED_STATE, shove the
> > to-be-injected event into the normal vmc*12 injection field, and ignore all
> > to-be-injected events in KVM_GET_VCPU_EVENTS if nested_run_pending=true.
> >
> > That should work even for migrating to an older KVM, as keeping nested_run_pending
> > will cause the target to reprocess the event injection as if it were from nested
> > VM-Enter, which it technically is.
>
> I guess here by "ignore all to-be-injected events in KVM_GET_VCPU_EVENTS" you
> mean *moving* back the L1 -> L2 event to be injected from KVM internal data
> structures like arch.nmi_injected (and so on) to the KVM_GET_NESTED_STATE-returned
> VMCB12 EVENTINJ field (or its VMX equivalent).
>
> But then the VMM will need to first call KVM_GET_NESTED_STATE (which will do
> the moving), only then KVM_GET_VCPU_EVENTS (which will then no longer show
> these events as pending).
> And their setters in the opposite order when restoring the VM.

I wasn't thinking of actually moving things in the source VM, only ignoring events
in KVM_GET_VCPU_EVENTS. Getting state shouldn't be destructive, e.g. the source VM
should still be able to continue running.

Ahahahaha, and actually looking at the code, there's this gem in KVM_GET_VCPU_EVENTS

/*
* The API doesn't provide the instruction length for software
* exceptions, so don't report them. As long as the guest RIP
* isn't advanced, we should expect to encounter the exception
* again.
*/
if (kvm_exception_is_soft(vcpu->arch.exception.nr)) {
events->exception.injected = 0;
events->exception.pending = 0;
}

and again for soft interrupts

events->interrupt.injected =
vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;

so through KVM's own incompetency, it's already doing half the work.

This is roughly what I had in mind. It will "require" moving nested_run_pending
to kvm_vcpu_arch, but I've been itching for an excuse to do that anyways.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eb71727acecb..62c48f6a0815 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4846,6 +4846,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
struct kvm_vcpu_events *events)
{
+ bool drop_injected_events = vcpu->arch.nested_run_pending;
+
process_nmi(vcpu);

if (kvm_check_request(KVM_REQ_SMI, vcpu))
@@ -4872,7 +4874,8 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
* isn't advanced, we should expect to encounter the exception
* again.
*/
- if (kvm_exception_is_soft(vcpu->arch.exception.nr)) {
+ if (drop_injected_events ||
+ kvm_exception_is_soft(vcpu->arch.exception.nr)) {
events->exception.injected = 0;
events->exception.pending = 0;
} else {
@@ -4893,13 +4896,14 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
events->exception_has_payload = vcpu->arch.exception.has_payload;
events->exception_payload = vcpu->arch.exception.payload;

- events->interrupt.injected =
- vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
+ events->interrupt.injected = vcpu->arch.interrupt.injected &&
+ !vcpu->arch.interrupt.soft &&
+ !drop_injected_events;
events->interrupt.nr = vcpu->arch.interrupt.nr;
events->interrupt.soft = 0;
events->interrupt.shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);

- events->nmi.injected = vcpu->arch.nmi_injected;
+ events->nmi.injected = vcpu->arch.nmi_injected && !drop_injected_events;
events->nmi.pending = vcpu->arch.nmi_pending != 0;
events->nmi.masked = static_call(kvm_x86_get_nmi_mask)(vcpu);
events->nmi.pad = 0;

2022-04-07 06:35:37

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction

On 6.04.2022 21:48, Sean Christopherson wrote:
> On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
>> On 6.04.2022 19:10, Sean Christopherson wrote:
>>> On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
>> And what if it's L0 that is trying to inject a NMI into L2?
>> In this case is_guest_mode() is true, but the full NMI injection machinery
>> should be used.
>
> Gah, you're right, I got misled by a benign bug in nested_vmx_l1_wants_exit() and
> was thinking that NMIs always exit. The "L1 wants" part should be conditioned on
> NMI exiting being enabled. It's benign because KVM always wants "real" NMIs, and
> so the path is never encountered.
>
> @@ -5980,7 +6005,7 @@ static bool nested_vmx_l1_wants_exit(struct kvm_vcpu *vcpu,
> switch ((u16)exit_reason.basic) {
> case EXIT_REASON_EXCEPTION_NMI:
> intr_info = vmx_get_intr_info(vcpu);
> - if (is_nmi(intr_info))
> + if (is_nmi(intr_info) && nested_cpu_has_nmi_exiting(vmcs12))
> return true;
> else if (is_page_fault(intr_info))
> return true;
>

I guess you mean "benign" when having KVM as L1, since other hypervisors may
let their L2s handle NMIs themselves.

>> It is also incorrect to block L1 -> L2 NMI injection because either L1
>> or L2 is currently under NMI blocking: the first case is obvious,
>> the second because it's L1 that is supposed to take care of proper NMI
>> blocking for L2 when injecting an NMI there.
>
> Yep, but I don't think there's a bug here. At least not for nVMX.

I agree this scenario should currently work (including on nSVM) - mentioned
it just as a constraint on solution space.

>>>> With the code in my previous patch set I planned to use
>>>> exit_during_event_injection() to detect such case, but if we implement
>>>> VMCB12 EVENTINJ parsing we can simply add a flag that the relevant event
>>>> comes from L1, so its normal injection side-effects should be skipped.
>>>
>>> Do we still need a flag based on the above? Honest question... I've been staring
>>> at all this for the better part of an hour and may have lost track of things.
>>
>> If checking just is_guest_mode() is not enough due to reasons I described
>> above then we need to somehow determine in the NMI / IRQ injection handler
>> whether the event to be injected into L2 comes from L0 or L1.
>> For this (assuming we do VMCB12 EVENTINJ parsing) I think we need an extra flag.
>
> Yes :-( And I believe the extra flag would need to be handled by KVM_{G,S}ET_VCPU_EVENTS.
>

Another option for saving and restoring a VM would be to add it to
KVM_{GET,SET}_NESTED_STATE somewhere (maybe as a part of the saved VMCB12
control area?).

Thanks,
Maciej

2022-04-07 20:08:22

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction

On 6.04.2022 19:10, Sean Christopherson wrote:
> On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
>> On 6.04.2022 03:48, Sean Christopherson wrote:
>>> On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
>> (..)
>>>> Also, I'm not sure that even the proposed updated code above will
>>>> actually restore the L1-requested next_rip correctly on L1 -> L2
>>>> re-injection (will review once the full version is available).
>>>
>>> Spoiler alert, it doesn't. Save yourself the review time. :-)
>>>
>>> The missing piece is stashing away the injected event on nested VMRUN. Those
>>> events don't get routed through the normal interrupt/exception injection code and
>>> so the next_rip info is lost on the subsequent #NPF.
>>>
>>> Treating soft interrupts/exceptions like they were injected by KVM (which they
>>> are, technically) works and doesn't seem too gross. E.g. when prepping vmcb02
>>>
>>> if (svm->nrips_enabled)
>>> vmcb02->control.next_rip = svm->nested.ctl.next_rip;
>>> else if (boot_cpu_has(X86_FEATURE_NRIPS))
>>> vmcb02->control.next_rip = vmcb12_rip;
>>>
>>> if (is_evtinj_soft(vmcb02->control.event_inj)) {
>>> svm->soft_int_injected = true;
>>> svm->soft_int_csbase = svm->vmcb->save.cs.base;
>>> svm->soft_int_old_rip = vmcb12_rip;
>>> if (svm->nrips_enabled)
>>> svm->soft_int_next_rip = svm->nested.ctl.next_rip;
>>> else
>>> svm->soft_int_next_rip = vmcb12_rip;
>>> }
>>>
>>> And then the VMRUN error path just needs to clear soft_int_injected.
>>
>> I am also a fan of parsing EVENTINJ from VMCB12 into relevant KVM
>> injection structures (much like EXITINTINFO is parsed), as I said to
>> Maxim two days ago [1].
>> Not only for software {interrupts,exceptions} but for all incoming
>> events (again, just like EXITINTINFO).
>
> Ahh, I saw that fly by, but somehow I managed to misread what you intended.
>
> I like the idea of populating vcpu->arch.interrupt/exception as "injected" events.
> KVM prioritizes "injected" over other nested events, so in theory it should work
> without too much fuss. I've ran through a variety of edge cases in my head and
> haven't found anything that would be fundamentally broken. I think even live
> migration would work.
>
> I think I'd prefer to do that in a follow-up series so that nVMX can be converted
> at the same time? It's a bit absurd to add the above soft int code knowing that,
> at least in theory, simply populating the right software structs would automagically
> fix the bug. But manually handling the soft int case first would be safer in the
> sense that we'd still have a fix for the soft int case if it turns out that populating
> vcpu->arch.interrupt/exception isn't as straightfoward as it seems.

I don't see any problem with having two patch series, the second series
depending on the first.

I planned to at least fix the bugs that I described in my previous message
(the NMI one actually breaks a real Windows guest) but that needs
knowledge how the event injection code will finally look like after the
first series of fixes.

>> However, there is another issue related to L1 -> L2 event re-injection
>> using standard KVM event injection mechanism: it mixes the L1 injection
>> state with the L2 one.
>>
>> Specifically for SVM:
>> * When re-injecting a NMI into L2 NMI-blocking is enabled in
>> vcpu->arch.hflags (shared between L1 and L2) and IRET intercept is
>> enabled.
>>
>> This is incorrect, since it is L1 that is responsible for enforcing NMI
>> blocking for NMIs that it injects into its L2.
>
> Ah, I see what you're saying. I think :-) IIUC, we can fix this bug without any
> new flags, just skip the side effects if the NMI is being injected into L2.
>
> @@ -3420,6 +3424,10 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
> struct vcpu_svm *svm = to_svm(vcpu);
>
> svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_NMI;
> +
> + if (is_guest_mode(vcpu))
> + return;
> +
> vcpu->arch.hflags |= HF_NMI_MASK;
> if (!sev_es_guest(vcpu->kvm))
> svm_set_intercept(svm, INTERCEPT_IRET);
>
> and for nVMX:
>
> @@ -4598,6 +4598,9 @@ static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
>
> + if (is_guest_mode(vcpu))
> + goto inject_nmi;
> +
> if (!enable_vnmi) {
> /*
> * Tracking the NMI-blocked state in software is built upon
> @@ -4619,6 +4622,7 @@ static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
> return;
> }
>
> +inject_nmi:
> vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
> INTR_TYPE_NMI_INTR | INTR_INFO_VALID_MASK | NMI_VECTOR);
>

And what if it's L0 that is trying to inject a NMI into L2?
In this case is_guest_mode() is true, but the full NMI injection machinery
should be used.

>> Also, *L2* being the target of such injection definitely should not block
>> further NMIs for *L1*.
>
> Actually, it should block NMIs for L1. From L1's perspective, the injection is
> part of VM-Entry. That's a single gigantic instruction, thus there is no NMI window
> until VM-Entry completes from L1's perspetive. Any exit that occurs on vectoring
> an injected event and is handled by L0 should not be visible to L1, because from
> L1's perspective it's all part of VMRUN/VMLAUNCH/VMRESUME. So blocking new events
> because an NMI (or any event) needs to be reinjected for L2 is correct.

I think this kind of NMI blocking will be already handled by having
the pending new NMI in vcpu->arch.nmi_pending but the one that needs
re-injecting in vcpu->arch.nmi_injected.

The pending new NMI in vcpu->arch.nmi_pending won't be handled until
vcpu->arch.nmi_injected gets cleared (that is, until re-injection is
successful).

It is incorrect however, to wait for L2 to execute IRET to unblock
L0 -> L1 NMIs or L1 -> L2 NMIs, in these two cases we (L0) just need the CPU
to vector that L2 NMI so it no longer shows in EXITINTINFO.

It is also incorrect to block L1 -> L2 NMI injection because either L1
or L2 is currently under NMI blocking: the first case is obvious,
the second because it's L1 that is supposed to take care of proper NMI
blocking for L2 when injecting an NMI there.

>> * When re-injecting a *hardware* IRQ into L2 GIF is checked (previously
>> even on the BUG_ON() level), while L1 should be able to inject even when
>> L2 GIF is off,
>
> Isn't that just a matter of tweaking the assertion to ignore GIF if L2 is
> active? Hmm, or deleting the assertion altogether, it's likely doing more harm
> than good at this point.

I assume this assertion is meant to catch the case when KVM itself (L0) is
trying to erroneously inject a hardware interrupt into L1 or L2, so it will
need to be skipped only for L1 -> L2 event injection.

Whether this assertion benefits outweigh its costs is debatable - don't have
a strong opinion here (BUG_ON() is for sure too strong, but WARN_ON_ONCE()
might make sense to catch latent bugs).

>> With the code in my previous patch set I planned to use
>> exit_during_event_injection() to detect such case, but if we implement
>> VMCB12 EVENTINJ parsing we can simply add a flag that the relevant event
>> comes from L1, so its normal injection side-effects should be skipped.
>
> Do we still need a flag based on the above? Honest question... I've been staring
> at all this for the better part of an hour and may have lost track of things.

If checking just is_guest_mode() is not enough due to reasons I described
above then we need to somehow determine in the NMI / IRQ injection handler
whether the event to be injected into L2 comes from L0 or L1.
For this (assuming we do VMCB12 EVENTINJ parsing) I think we need an extra flag.

>> By the way, the relevant VMX code also looks rather suspicious,
>> especially for the !enable_vnmi case.
>
> I think it's safe to say we can ignore edge cases for !enable_vnmi. It might even
> be worth trying to remove that support again (Paolo tried years ago), IIRC the
> only Intel CPUs that don't support virtual NMIs are some funky Yonah SKUs.

Ack, we could at least disable nested on !enable_vnmi.

BTW, I think that besides Yonah cores very early Core 2 CPUs also lacked
vNMI support, that's why !enable_vnmi support was reinstated.
But that's hardware even older than !nrips AMD parts.

Thanks,
Maciej

2022-04-07 20:16:19

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction

On 7.04.2022 01:03, Sean Christopherson wrote:
> On Thu, Apr 07, 2022, Maciej S. Szmigiero wrote:
>> On 6.04.2022 22:52, Sean Christopherson wrote:
>>> On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
>>>> Another option for saving and restoring a VM would be to add it to
>>>> KVM_{GET,SET}_NESTED_STATE somewhere (maybe as a part of the saved VMCB12
>>>> control area?).
>>>
>>> Ooh. What if we keep nested_run_pending=true until the injection completes? Then
>>> we don't even need an extra flag because nested_run_pending effectively says that
>>> any and all injected events are for L1=>L2. In KVM_GET_NESTED_STATE, shove the
>>> to-be-injected event into the normal vmc*12 injection field, and ignore all
>>> to-be-injected events in KVM_GET_VCPU_EVENTS if nested_run_pending=true.
>>>
>>> That should work even for migrating to an older KVM, as keeping nested_run_pending
>>> will cause the target to reprocess the event injection as if it were from nested
>>> VM-Enter, which it technically is.
>>
>> I guess here by "ignore all to-be-injected events in KVM_GET_VCPU_EVENTS" you
>> mean *moving* back the L1 -> L2 event to be injected from KVM internal data
>> structures like arch.nmi_injected (and so on) to the KVM_GET_NESTED_STATE-returned
>> VMCB12 EVENTINJ field (or its VMX equivalent).
>>
>> But then the VMM will need to first call KVM_GET_NESTED_STATE (which will do
>> the moving), only then KVM_GET_VCPU_EVENTS (which will then no longer show
>> these events as pending).
>> And their setters in the opposite order when restoring the VM.
>
> I wasn't thinking of actually moving things in the source VM, only ignoring events
> in KVM_GET_VCPU_EVENTS. Getting state shouldn't be destructive, e.g. the source VM
> should still be able to continue running.

Right, getters should not change state.

> Ahahahaha, and actually looking at the code, there's this gem in KVM_GET_VCPU_EVENTS
>
> /*
> * The API doesn't provide the instruction length for software
> * exceptions, so don't report them. As long as the guest RIP
> * isn't advanced, we should expect to encounter the exception
> * again.
> */
> if (kvm_exception_is_soft(vcpu->arch.exception.nr)) {
> events->exception.injected = 0;
> events->exception.pending = 0;
> }
>
> and again for soft interrupts
>
> events->interrupt.injected =
> vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
>
> so through KVM's own incompetency, it's already doing half the work.

So KVM_GET_VCPU_EVENTS was basically already explicitly broken for this
case (where RIP does not point to a INT3/INTO/INT x instruction).

> This is roughly what I had in mind. It will "require" moving nested_run_pending
> to kvm_vcpu_arch, but I've been itching for an excuse to do that anyways.
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index eb71727acecb..62c48f6a0815 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4846,6 +4846,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
> static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
> struct kvm_vcpu_events *events)
> {
> + bool drop_injected_events = vcpu->arch.nested_run_pending;
> +
> process_nmi(vcpu);
>
> if (kvm_check_request(KVM_REQ_SMI, vcpu))
> @@ -4872,7 +4874,8 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
> * isn't advanced, we should expect to encounter the exception
> * again.
> */
> - if (kvm_exception_is_soft(vcpu->arch.exception.nr)) {
> + if (drop_injected_events ||
> + kvm_exception_is_soft(vcpu->arch.exception.nr)) {
> events->exception.injected = 0;
> events->exception.pending = 0;
> } else {
> @@ -4893,13 +4896,14 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
> events->exception_has_payload = vcpu->arch.exception.has_payload;
> events->exception_payload = vcpu->arch.exception.payload;
>
> - events->interrupt.injected =
> - vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
> + events->interrupt.injected = vcpu->arch.interrupt.injected &&
> + !vcpu->arch.interrupt.soft &&
> + !drop_injected_events;
> events->interrupt.nr = vcpu->arch.interrupt.nr;
> events->interrupt.soft = 0;
> events->interrupt.shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
>
> - events->nmi.injected = vcpu->arch.nmi_injected;
> + events->nmi.injected = vcpu->arch.nmi_injected && !drop_injected_events;
> events->nmi.pending = vcpu->arch.nmi_pending != 0;
> events->nmi.masked = static_call(kvm_x86_get_nmi_mask)(vcpu);
> events->nmi.pad = 0;
>

So the VMM will get VMCB12 with EVENTINJ field filled with the event
to re-inject from KVM_GET_NESTED_STATE and events.{exception,interrupt,nmi}.injected
unset from KVM_GET_VCPU_EVENTS.

Let's say now that the VMM uses this data to restore a VM: it restores nested
state by using KVM_SET_NESTED_STATE and then events by calling KVM_SET_VCPU_EVENTS.

So it ends with VMCB12 EVENTINJ field filled, but KVM injection structures
(arch.{exception,interrupt,nmi}.injected) zeroed by that later KVM_SET_VCPU_EVENTS
call.

Assuming that L1 -> L2 event injection is always based on KVM injection structures
(like we discussed earlier), rather than on a direct copy of EVENTINJ field like
it is now, before doing the first VM entry KVM will need to re-parse VMCB12 EVENTINJ
field into KVM arch.{exception,interrupt,nmi}.injected to make it work properly.

Thanks,
Maciej