2022-03-11 22:24:40

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

The main goal of this series is to fix KVM's longstanding bug of not
honoring L1's exception intercepts wants when handling an exception that
occurs during delivery of a different exception. E.g. if L0 and L1 are
using shadow paging, and L2 hits a #PF, and then hits another #PF while
vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
so that the #PF is routed to L1, not injected into L2 as a #DF.

nVMX has hacked around the bug for years by overriding the #PF injector
for shadow paging to go straight to VM-Exit, and nSVM has started doing
the same. The hacks mostly work, but they're incomplete, confusing, and
lead to other hacky code, e.g. bailing from the emulator because #PF
injection forced a VM-Exit and suddenly KVM is back in L1.

Everything leading up to that are related fixes and cleanups I encountered
along the way; some through code inspection, some through tests (I truly
thought this series was finished 10 commits and 3 days ago...).

Nothing in here is all that urgent; all bugs tagged for stable have been
around for multiple releases (years in most cases).

Sean Christopherson (21):
KVM: x86: Return immediately from x86_emulate_instruction() on code
#DB
KVM: nVMX: Unconditionally purge queued/injected events on nested
"exit"
KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
KVM: x86: Don't check for code breakpoints when emulating on exception
KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
KVM: x86: Treat #DBs from the emulator as fault-like (code and
DR7.GD=1)
KVM: x86: Use DR7_GD macro instead of open coding check in emulator
KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
KVM: x86: Make kvm_queued_exception a properly named, visible struct
KVM: x86: Formalize blocking of nested pending exceptions
KVM: x86: Use kvm_queue_exception_e() to queue #DF
KVM: x86: Hoist nested event checks above event injection logic
KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
VM-Exit
KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
behavior
KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
KVM: selftests: Add an x86-only test to verify nested exception
queueing

arch/x86/include/asm/kvm-x86-ops.h | 2 +-
arch/x86/include/asm/kvm_host.h | 33 +-
arch/x86/kvm/emulate.c | 3 +-
arch/x86/kvm/svm/nested.c | 100 ++---
arch/x86/kvm/svm/svm.c | 18 +-
arch/x86/kvm/vmx/nested.c | 322 +++++++++-----
arch/x86/kvm/vmx/sgx.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 53 ++-
arch/x86/kvm/x86.c | 409 ++++++++++++------
arch/x86/kvm/x86.h | 10 +-
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/include/x86_64/svm_util.h | 5 +-
.../selftests/kvm/include/x86_64/vmx.h | 51 +--
.../kvm/x86_64/nested_exceptions_test.c | 307 +++++++++++++
15 files changed, 914 insertions(+), 403 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c


base-commit: 4a204f7895878363ca8211f50ec610408c8c70aa
--
2.35.1.723.g4982287a31-goog


2022-03-11 22:32:15

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 04/21] KVM: x86: Don't check for code breakpoints when emulating on exception

Don't check for code breakpoints during instruction emulation if the
emulation was triggered by exception interception. Code breakpoints are
the highest priority fault-like exception, and KVM only emulates on
exceptions that are fault-like. Thus, if hardware signaled a different
exception, then the vCPU is already passed the stage of checking for
hardware breakpoints.

This is likely a glorified nop in terms of functionality, and is more for
clarification and is technically an optimization. Intel's SDM explicitly
states vmcs.GUEST_RFLAGS.RF on exception interception is the same as the
value that would have been saved on the stack had the exception not been
intercepted, i.e. will be '1' due to all fault-like exceptions setting RF
to '1'. AMD says "guest state saved ... is the processor state as of the
moment the intercept triggers", but that begs the question, "when does
the intercept trigger?".

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/x86.c | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index feacc0901c24..3636206ed3e4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8212,8 +8212,24 @@ int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(kvm_skip_emulated_instruction);

-static bool kvm_vcpu_check_code_breakpoint(struct kvm_vcpu *vcpu, int *r)
+static bool kvm_vcpu_check_code_breakpoint(struct kvm_vcpu *vcpu,
+ int emulation_type, int *r)
{
+ WARN_ON_ONCE(emulation_type & EMULTYPE_NO_DECODE);
+
+ /*
+ * Do not check for code breakpoints if hardware has already done the
+ * checks, as inferred from the emulation type. On NO_DECODE and SKIP,
+ * the instruction has passed all exception checks, and all intercepted
+ * exceptions that trigger emulation have lower priority than code
+ * breakpoints, i.e. the fact that the intercepted exception occurred
+ * means any code breakpoints have already been serviced.
+ */
+ if (emulation_type & (EMULTYPE_NO_DECODE | EMULTYPE_SKIP |
+ EMULTYPE_TRAP_UD | EMULTYPE_TRAP_UD_FORCED |
+ EMULTYPE_VMWARE_GP | EMULTYPE_PF))
+ return false;
+
if (unlikely(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP) &&
(vcpu->arch.guest_debug_dr7 & DR7_BP_EN_MASK)) {
struct kvm_run *kvm_run = vcpu->run;
@@ -8335,8 +8351,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
* are fault-like and are higher priority than any faults on
* the code fetch itself.
*/
- if (!(emulation_type & EMULTYPE_SKIP) &&
- kvm_vcpu_check_code_breakpoint(vcpu, &r))
+ if (kvm_vcpu_check_code_breakpoint(vcpu, emulation_type, &r))
return r;

r = x86_decode_emulated_instruction(vcpu, emulation_type,
--
2.35.1.723.g4982287a31-goog

2022-03-11 22:54:40

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 11/21] KVM: VMX: Inject #PF on ENCLS as "emulated" #PF

Treat #PFs that occur during emulation of ENCLS as, wait for it, emulated
page faults. Practically speaking, this is a glorified nop as the
exception is never of the nested flavor, and it's extremely unlikely the
guest is relying on the side effect of an implicit INVLPG on the faulting
address.

Fixes: 70210c044b4e ("KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/sgx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/sgx.c b/arch/x86/kvm/vmx/sgx.c
index 35e7ec91ae86..966cfa228f2a 100644
--- a/arch/x86/kvm/vmx/sgx.c
+++ b/arch/x86/kvm/vmx/sgx.c
@@ -129,7 +129,7 @@ static int sgx_inject_fault(struct kvm_vcpu *vcpu, gva_t gva, int trapnr)
ex.address = gva;
ex.error_code_valid = true;
ex.nested_page_fault = false;
- kvm_inject_page_fault(vcpu, &ex);
+ kvm_inject_emulated_page_fault(vcpu, &ex);
} else {
kvm_inject_gp(vcpu, 0);
}
--
2.35.1.723.g4982287a31-goog

2022-03-11 23:06:19

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 02/21] KVM: nVMX: Unconditionally purge queued/injected events on nested "exit"

Drop pending exceptions and events queued for re-injection when leaving
nested guest mode, even if the "exit" is due to VM-Fail, SMI, or forced
by host userspace. Failure to purge events could result in an event
belonging to L2 being injected into L1.

This _should_ never happen for VM-Fail as all events should be blocked by
nested_run_pending, but it's possible if KVM, not the L1 hypervisor, is
the source of VM-Fail when running vmcs02.

SMI is a nop (barring unknown bugs) as recognition of SMI and thus entry
to SMM is blocked by pending exceptions and re-injected events.

Forced exit is definitely buggy, but has likely gone unnoticed because
userspace probably follows the forced exit with KVM_SET_VCPU_EVENTS (or
some other ioctl() that purges the queue).

Fixes: 4f350c6dbcb9 ("kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly")
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index f18744f7ff82..f09c6eff7af9 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4233,14 +4233,6 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
nested_vmx_abort(vcpu,
VMX_ABORT_SAVE_GUEST_MSR_FAIL);
}
-
- /*
- * Drop what we picked up for L2 via vmx_complete_interrupts. It is
- * preserved above and would only end up incorrectly in L1.
- */
- vcpu->arch.nmi_injected = false;
- kvm_clear_exception_queue(vcpu);
- kvm_clear_interrupt_queue(vcpu);
}

/*
@@ -4582,6 +4574,17 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
WARN_ON_ONCE(nested_early_check);
}

+ /*
+ * Drop events/exceptions that were queued for re-injection to L2
+ * (picked up via vmx_complete_interrupts()), as well as exceptions
+ * that were pending for L2. Note, this must NOT be hoisted above
+ * prepare_vmcs12(), events/exceptions queued for re-injection need to
+ * be captured in vmcs12 (see vmcs12_save_pending_event()).
+ */
+ vcpu->arch.nmi_injected = false;
+ kvm_clear_exception_queue(vcpu);
+ kvm_clear_interrupt_queue(vcpu);
+
vmx_switch_vmcs(vcpu, &vmx->vmcs01);

/* Update any VMCS fields that might have changed while L2 ran */
--
2.35.1.723.g4982287a31-goog

2022-03-11 23:07:12

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 12/21] KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception

Rename the kvm_x86_ops hook for exception injection to better reflect
reality, and to align with pretty much every other related function name
in KVM.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 2 +-
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/svm/svm.c | 4 ++--
arch/x86/kvm/vmx/vmx.c | 4 ++--
arch/x86/kvm/x86.c | 2 +-
5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 29affccb353c..656fa1626dc1 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -66,7 +66,7 @@ KVM_X86_OP(get_interrupt_shadow)
KVM_X86_OP(patch_hypercall)
KVM_X86_OP(inject_irq)
KVM_X86_OP(inject_nmi)
-KVM_X86_OP(queue_exception)
+KVM_X86_OP(inject_exception)
KVM_X86_OP(cancel_injection)
KVM_X86_OP(interrupt_allowed)
KVM_X86_OP(nmi_allowed)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3a2c855f04e3..4f891fe00767 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1391,7 +1391,7 @@ struct kvm_x86_ops {
unsigned char *hypercall_addr);
void (*inject_irq)(struct kvm_vcpu *vcpu);
void (*inject_nmi)(struct kvm_vcpu *vcpu);
- void (*queue_exception)(struct kvm_vcpu *vcpu);
+ void (*inject_exception)(struct kvm_vcpu *vcpu);
void (*cancel_injection)(struct kvm_vcpu *vcpu);
int (*interrupt_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
int (*nmi_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index fc5222a0f506..8b7f3c4e383f 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -382,7 +382,7 @@ static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
return 1;
}

-static void svm_queue_exception(struct kvm_vcpu *vcpu)
+static void svm_inject_exception(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
unsigned nr = vcpu->arch.exception.nr;
@@ -4580,7 +4580,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.patch_hypercall = svm_patch_hypercall,
.inject_irq = svm_inject_irq,
.inject_nmi = svm_inject_nmi,
- .queue_exception = svm_queue_exception,
+ .inject_exception = svm_inject_exception,
.cancel_injection = svm_cancel_injection,
.interrupt_allowed = svm_interrupt_allowed,
.nmi_allowed = svm_nmi_allowed,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a8ebe91fe9a5..f3f16271fa2c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1602,7 +1602,7 @@ static void vmx_clear_hlt(struct kvm_vcpu *vcpu)
vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
}

-static void vmx_queue_exception(struct kvm_vcpu *vcpu)
+static void vmx_inject_exception(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned nr = vcpu->arch.exception.nr;
@@ -7783,7 +7783,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.patch_hypercall = vmx_patch_hypercall,
.inject_irq = vmx_inject_irq,
.inject_nmi = vmx_inject_nmi,
- .queue_exception = vmx_queue_exception,
+ .inject_exception = vmx_inject_exception,
.cancel_injection = vmx_cancel_injection,
.interrupt_allowed = vmx_interrupt_allowed,
.nmi_allowed = vmx_nmi_allowed,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 507e5f26ebbf..452fbb55d9d2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9312,7 +9312,7 @@ static void kvm_inject_exception(struct kvm_vcpu *vcpu)
{
if (vcpu->arch.exception.error_code && !is_protmode(vcpu))
vcpu->arch.exception.error_code = false;
- static_call(kvm_x86_queue_exception)(vcpu);
+ static_call(kvm_x86_inject_exception)(vcpu);
}

static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
--
2.35.1.723.g4982287a31-goog

2022-03-11 23:30:43

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH 19/21] KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle behavior

Document the oddities of ICEBP interception (trap-like #DB is intercepted
as a fault-like exception), and how using VMX's inner "skip" helper
deliberately bypasses the pending MTF and single-step #DB logic.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0420bc6d418a..ae88d42289ce 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1570,9 +1570,13 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)

/*
* Per the SDM, MTF takes priority over debug-trap exceptions besides
- * T-bit traps. As instruction emulation is completed (i.e. at the
- * instruction boundary), any #DB exception pending delivery must be a
- * debug-trap. Record the pending MTF state to be delivered in
+ * TSS T-bit traps and ICEBP (INT1). KVM doesn't emulate T-bit traps
+ * or ICEBP (in the emulator proper), and skipping of ICEBP after an
+ * intercepted #DB deliberately avoids single-step #DB and MTF updates
+ * as ICEBP is higher priority than both. As instruction emulation is
+ * completed at this point (i.e. KVM is at the instruction boundary),
+ * any #DB exception pending delivery must be a debug-trap of lower
+ * priority than MTF. Record the pending MTF state to be delivered in
* vmx_check_nested_events().
*/
if (nested_cpu_has_mtf(vmcs12) &&
@@ -4924,8 +4928,10 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
* instruction. ICEBP generates a trap-like #DB, but
* despite its interception control being tied to #DB,
* is an instruction intercept, i.e. the VM-Exit occurs
- * on the ICEBP itself. Note, skipping ICEBP also
- * clears STI and MOVSS blocking.
+ * on the ICEBP itself. Use the inner "skip" helper to
+ * avoid single-step #DB and MTF updates, as ICEBP is
+ * higher priority. Note, skipping ICEBP still clears
+ * STI and MOVSS blocking.
*
* For all other #DBs, set vmcs.PENDING_DBG_EXCEPTIONS.BS
* if single-step is enabled in RFLAGS and STI or MOVSS
--
2.35.1.723.g4982287a31-goog

2022-03-13 11:01:17

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

On Fri, 2022-03-11 at 03:27 +0000, Sean Christopherson wrote:
> The main goal of this series is to fix KVM's longstanding bug of not
> honoring L1's exception intercepts wants when handling an exception that
> occurs during delivery of a different exception. E.g. if L0 and L1 are
> using shadow paging, and L2 hits a #PF, and then hits another #PF while
> vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> so that the #PF is routed to L1, not injected into L2 as a #DF.
>
> nVMX has hacked around the bug for years by overriding the #PF injector
> for shadow paging to go straight to VM-Exit, and nSVM has started doing
> the same. The hacks mostly work, but they're incomplete, confusing, and
> lead to other hacky code, e.g. bailing from the emulator because #PF
> injection forced a VM-Exit and suddenly KVM is back in L1.
>
> Everything leading up to that are related fixes and cleanups I encountered
> along the way; some through code inspection, some through tests (I truly
> thought this series was finished 10 commits and 3 days ago...).
>
> Nothing in here is all that urgent; all bugs tagged for stable have been
> around for multiple releases (years in most cases).
>
> Sean Christopherson (21):
> KVM: x86: Return immediately from x86_emulate_instruction() on code
> #DB
> KVM: nVMX: Unconditionally purge queued/injected events on nested
> "exit"
> KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
> KVM: x86: Don't check for code breakpoints when emulating on exception
> KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
> KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
> KVM: x86: Treat #DBs from the emulator as fault-like (code and
> DR7.GD=1)
> KVM: x86: Use DR7_GD macro instead of open coding check in emulator
> KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
> KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
> KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
> KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
> KVM: x86: Make kvm_queued_exception a properly named, visible struct
> KVM: x86: Formalize blocking of nested pending exceptions
> KVM: x86: Use kvm_queue_exception_e() to queue #DF
> KVM: x86: Hoist nested event checks above event injection logic
> KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
> VM-Exit
> KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
> KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
> behavior
> KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
> KVM: selftests: Add an x86-only test to verify nested exception
> queueing
>
> arch/x86/include/asm/kvm-x86-ops.h | 2 +-
> arch/x86/include/asm/kvm_host.h | 33 +-
> arch/x86/kvm/emulate.c | 3 +-
> arch/x86/kvm/svm/nested.c | 100 ++---
> arch/x86/kvm/svm/svm.c | 18 +-
> arch/x86/kvm/vmx/nested.c | 322 +++++++++-----
> arch/x86/kvm/vmx/sgx.c | 2 +-
> arch/x86/kvm/vmx/vmx.c | 53 ++-
> arch/x86/kvm/x86.c | 409 ++++++++++++------
> arch/x86/kvm/x86.h | 10 +-
> tools/testing/selftests/kvm/.gitignore | 1 +
> tools/testing/selftests/kvm/Makefile | 1 +
> .../selftests/kvm/include/x86_64/svm_util.h | 5 +-
> .../selftests/kvm/include/x86_64/vmx.h | 51 +--
> .../kvm/x86_64/nested_exceptions_test.c | 307 +++++++++++++
> 15 files changed, 914 insertions(+), 403 deletions(-)
> create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
>
>
> base-commit: 4a204f7895878363ca8211f50ec610408c8c70aa

I am just curious. Are you aware that I worked on this few months ago?
I am sure that you even reviewed some of my code back then.

If so, could you have had at least mentioned this and/or pinged me to continue
working on this instead of re-implementing it?

Best regards,
Maxim Levitsky


2022-03-25 18:07:48

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

On Sun, Mar 13, 2022, Maxim Levitsky wrote:
> On Fri, 2022-03-11 at 03:27 +0000, Sean Christopherson wrote:
> > The main goal of this series is to fix KVM's longstanding bug of not
> > honoring L1's exception intercepts wants when handling an exception that
> > occurs during delivery of a different exception. E.g. if L0 and L1 are
> > using shadow paging, and L2 hits a #PF, and then hits another #PF while
> > vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> > KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> > so that the #PF is routed to L1, not injected into L2 as a #DF.
> >
> > nVMX has hacked around the bug for years by overriding the #PF injector
> > for shadow paging to go straight to VM-Exit, and nSVM has started doing
> > the same. The hacks mostly work, but they're incomplete, confusing, and
> > lead to other hacky code, e.g. bailing from the emulator because #PF
> > injection forced a VM-Exit and suddenly KVM is back in L1.
> >
> > Everything leading up to that are related fixes and cleanups I encountered
> > along the way; some through code inspection, some through tests (I truly
> > thought this series was finished 10 commits and 3 days ago...).
> >
> > Nothing in here is all that urgent; all bugs tagged for stable have been
> > around for multiple releases (years in most cases).
> >
> I am just curious. Are you aware that I worked on this few months ago?

Ah, so that's why I had a feeling of deja vu when factoring out kvm_queued_exception.
I completely forgot about it :-/ In my defense, that was nearly a year ago[1][2], though
I suppose one could argue 11 == "a few" :-)

[1] https://lore.kernel.org/all/[email protected]
[2] https://lore.kernel.org/all/[email protected]

> I am sure that you even reviewed some of my code back then.

Yep, now that I've found the threads I remember discussing the mechanics.

> If so, could you have had at least mentioned this and/or pinged me to continue
> working on this instead of re-implementing it?

I'm invoking Hanlon's razor[*]; I certainly didn't intended to stomp over your
work, I simply forgot.

As for the technical aspects, looking back at your series, I strongly considered
taking the same approach of splitting pending vs. injected (again, without any
recollection of your work). I ultimately opted to go with the "immediated morph
to pending VM-Exit" approach as it allows KVM to do the right thing in almost every
case without requiring new ABI, and even if KVM screws up, e.g. queues multiple
pending exceptions. It also neatly handles one-off things like async #PF in L2.

However, I hadn't considered your approach, which addresses the ABI conundrum by
processing pending=>injected immediately after handling the VM-Exit. I can't think
of any reason that wouldn't work, but I really don't like splitting the event
priority logic, nor do I like having two event injection sites (getting rid of the
extra calls to kvm_check_nested_events() is still on my wish list). If we could go
back in time, I would likely vote for properly tracking injected vs. pending, but
since we're mostly stuck with KVM's ABI, I prefer the "immediately morph to pending
VM-Exit" hack over the "immediately morph to 'injected' exception" hack.

[*] https://en.wikipedia.org/wiki/Hanlon%27s_razor

2022-03-25 21:53:06

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

On 24.03.2022 22:31, Sean Christopherson wrote:
> On Sun, Mar 13, 2022, Maxim Levitsky wrote:
>> On Fri, 2022-03-11 at 03:27 +0000, Sean Christopherson wrote:
>>> The main goal of this series is to fix KVM's longstanding bug of not
>>> honoring L1's exception intercepts wants when handling an exception that
>>> occurs during delivery of a different exception. E.g. if L0 and L1 are
>>> using shadow paging, and L2 hits a #PF, and then hits another #PF while
>>> vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
>>> KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
>>> so that the #PF is routed to L1, not injected into L2 as a #DF.
>>>
>>> nVMX has hacked around the bug for years by overriding the #PF injector
>>> for shadow paging to go straight to VM-Exit, and nSVM has started doing
>>> the same. The hacks mostly work, but they're incomplete, confusing, and
>>> lead to other hacky code, e.g. bailing from the emulator because #PF
>>> injection forced a VM-Exit and suddenly KVM is back in L1.
>>>
>>> Everything leading up to that are related fixes and cleanups I encountered
>>> along the way; some through code inspection, some through tests (I truly
>>> thought this series was finished 10 commits and 3 days ago...).
>>>
>>> Nothing in here is all that urgent; all bugs tagged for stable have been
>>> around for multiple releases (years in most cases).
>>>
>> I am just curious. Are you aware that I worked on this few months ago?
>
> Ah, so that's why I had a feeling of deja vu when factoring out kvm_queued_exception.
> I completely forgot about it :-/ In my defense, that was nearly a year ago[1][2], though
> I suppose one could argue 11 == "a few" :-)
>
> [1] https://lore.kernel.org/all/[email protected]
> [2] https://lore.kernel.org/all/[email protected]
>
>> I am sure that you even reviewed some of my code back then.
>
> Yep, now that I've found the threads I remember discussing the mechanics.
>
>> If so, could you have had at least mentioned this and/or pinged me to continue
>> working on this instead of re-implementing it?
>
> I'm invoking Hanlon's razor[*]; I certainly didn't intended to stomp over your
> work, I simply forgot.
>
> As for the technical aspects, looking back at your series, I strongly considered
> taking the same approach of splitting pending vs. injected (again, without any
> recollection of your work). I ultimately opted to go with the "immediated morph
> to pending VM-Exit" approach as it allows KVM to do the right thing in almost every
> case without requiring new ABI, and even if KVM screws up, e.g. queues multiple
> pending exceptions. It also neatly handles one-off things like async #PF in L2.
>
> However, I hadn't considered your approach, which addresses the ABI conundrum by
> processing pending=>injected immediately after handling the VM-Exit. I can't think
> of any reason that wouldn't work, but I really don't like splitting the event
> priority logic, nor do I like having two event injection sites (getting rid of the
> extra calls to kvm_check_nested_events() is still on my wish list). If we could go
> back in time, I would likely vote for properly tracking injected vs. pending, but
> since we're mostly stuck with KVM's ABI, I prefer the "immediately morph to pending
> VM-Exit" hack over the "immediately morph to 'injected' exception" hack.

So, what's the plan here: is your patch set Sean considered to supersede
Maxim's earlier proposed changes or will you post an updated patch set
incorporating at least some of them?

I am asking because I have a series that touches the same general area
of KVM [1] and would preferably have it based on the final form of the
event injection code to avoid unforeseen negative interactions between
these changes.

Thanks,
Maciej

[1]: https://lore.kernel.org/kvm/[email protected]/

2022-03-25 23:27:47

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

On Fri, Mar 25, 2022, Maciej S. Szmigiero wrote:
> So, what's the plan here: is your patch set Sean considered to supersede
> Maxim's earlier proposed changes or will you post an updated patch set
> incorporating at least some of them?

Next step is to reach a consensus on how we want to solve the problem (or if we
can't reach consensus, until Paolo uses his special powers). I definitely won't
post anything new until there's more conversation.

> I am asking because I have a series that touches the same general area
> of KVM [1] and would preferably have it based on the final form of the
> event injection code to avoid unforeseen negative interactions between
> these changes.

I don't think you need to do anything, at a glance your changes are orthogonal
even though they have similar themes. Any conflicts should be minor.

2022-03-26 02:29:16

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

On 26.03.2022 00:02, Sean Christopherson wrote:
> On Fri, Mar 25, 2022, Maciej S. Szmigiero wrote:
>> So, what's the plan here: is your patch set Sean considered to supersede
>> Maxim's earlier proposed changes or will you post an updated patch set
>> incorporating at least some of them?
>
> Next step is to reach a consensus on how we want to solve the problem (or if we
> can't reach consensus, until Paolo uses his special powers). I definitely won't
> post anything new until there's more conversation.

Ack.

>> I am asking because I have a series that touches the same general area
>> of KVM [1] and would preferably have it based on the final form of the
>> event injection code to avoid unforeseen negative interactions between
>> these changes.
>
> I don't think you need to do anything, at a glance your changes are orthogonal
> even though they have similar themes. Any conflicts should be minor.

All right - I will try to keep an eye on the developments related to the
event injection code just to be sure there are no obvious conflicts here.

Thanks,
Maciej

2022-03-28 20:40:23

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

On Thu, 2022-03-24 at 21:31 +0000, Sean Christopherson wrote:
> On Sun, Mar 13, 2022, Maxim Levitsky wrote:
> > On Fri, 2022-03-11 at 03:27 +0000, Sean Christopherson wrote:
> > > The main goal of this series is to fix KVM's longstanding bug of not
> > > honoring L1's exception intercepts wants when handling an exception that
> > > occurs during delivery of a different exception. E.g. if L0 and L1 are
> > > using shadow paging, and L2 hits a #PF, and then hits another #PF while
> > > vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> > > KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> > > so that the #PF is routed to L1, not injected into L2 as a #DF.
> > >
> > > nVMX has hacked around the bug for years by overriding the #PF injector
> > > for shadow paging to go straight to VM-Exit, and nSVM has started doing
> > > the same. The hacks mostly work, but they're incomplete, confusing, and
> > > lead to other hacky code, e.g. bailing from the emulator because #PF
> > > injection forced a VM-Exit and suddenly KVM is back in L1.
> > >
> > > Everything leading up to that are related fixes and cleanups I encountered
> > > along the way; some through code inspection, some through tests (I truly
> > > thought this series was finished 10 commits and 3 days ago...).
> > >
> > > Nothing in here is all that urgent; all bugs tagged for stable have been
> > > around for multiple releases (years in most cases).
> > >
> > I am just curious. Are you aware that I worked on this few months ago?
>
> Ah, so that's why I had a feeling of deja vu when factoring out kvm_queued_exception.
> I completely forgot about it :-/ In my defense, that was nearly a year ago[1][2], though
> I suppose one could argue 11 == "a few" :-)
>
> [1] https://lore.kernel.org/all/[email protected]
> [2] https://lore.kernel.org/all/[email protected]
>
> > I am sure that you even reviewed some of my code back then.
>
> Yep, now that I've found the threads I remember discussing the mechanics.
>
> > If so, could you have had at least mentioned this and/or pinged me to continue
> > working on this instead of re-implementing it?
>
> I'm invoking Hanlon's razor[*]; I certainly didn't intended to stomp over your
> work, I simply forgot.

Thank you very much for the explanation, and I am glad that it was a honest mistake.

Other than that I am actually very happy that you posted this patch series,
as this gives more chance that this long standing issue will be fixed,
and if your patches are better/simpler/less invasive to KVM and still address the issue,
I fully support using them instead of mine.

Totally agree with you about your thoughts about splitting pending/injected exception,
I also can't say I liked my approach that much, for the same reasons you mentioned.

It is also the main reason I put the whole thing on the backlog lately,
because I was feeling that I am changing too much of the KVM,
for a relatively theoretical issue.


I will review your patches, compare them to mine, and check if you or I missed something.

PS:

Back then, I also did an extensive review on few cases when qemu injects exceptions itself,
which it does thankfully rarely. There are several (theoretical) issues there.
I don't remember those details, I need to refresh my memory.

AFAIK, qemu injects #MC sometimes when it gets it from the kernel in form of a signal,
if I recall this correctly, and it also reflects back #DB, when guest debug was enabled
(and that is the reason for some work I did in this area, like the KVM_GUESTDBG_BLOCKIRQ thing)

Qemu does this without considering nested and/or pending exception/etc.
It just kind of abuses the KVM_SET_VCPU_EVENTS for that.

Best regards,
Maxim Levitsky


>
> As for the technical aspects, looking back at your series, I strongly considered
> taking the same approach of splitting pending vs. injected (again, without any
> recollection of your work). I ultimately opted to go with the "immediated morph
> to pending VM-Exit" approach as it allows KVM to do the right thing in almost every
> case without requiring new ABI, and even if KVM screws up, e.g. queues multiple
> pending exceptions. It also neatly handles one-off things like async #PF in L2.
>
> However, I hadn't considered your approach, which addresses the ABI conundrum by
> processing pending=>injected immediately after handling the VM-Exit. I can't think
> of any reason that wouldn't work, but I really don't like splitting the event
> priority logic, nor do I like having two event injection sites (getting rid of the
> extra calls to kvm_check_nested_events() is still on my wish list). If we could go
> back in time, I would likely vote for properly tracking injected vs. pending, but
> since we're mostly stuck with KVM's ABI, I prefer the "immediately morph to pending
> VM-Exit" hack over the "immediately morph to 'injected' exception" hack.
>
> [*] https://en.wikipedia.org/wiki/Hanlon%27s_razor
>


2022-03-28 22:17:24

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

On Sun, Mar 27, 2022, Maxim Levitsky wrote:
> Other than that I am actually very happy that you posted this patch series,
> as this gives more chance that this long standing issue will be fixed,
> and if your patches are better/simpler/less invasive to KVM and still address the issue,
> I fully support using them instead of mine.

I highly doubt they're simpler or less invasive, but I do hope that the approach
wil be easier to maintain.

> Totally agree with you about your thoughts about splitting pending/injected exception,
> I also can't say I liked my approach that much, for the same reasons you mentioned.
>
> It is also the main reason I put the whole thing on the backlog lately,
> because I was feeling that I am changing too much of the KVM,
> for a relatively theoretical issue.
>
>
> I will review your patches, compare them to mine, and check if you or I missed something.
>
> PS:
>
> Back then, I also did an extensive review on few cases when qemu injects exceptions itself,
> which it does thankfully rarely. There are several (theoretical) issues there.
> I don't remember those details, I need to refresh my memory.
>
> AFAIK, qemu injects #MC sometimes when it gets it from the kernel in form of a signal,
> if I recall this correctly, and it also reflects back #DB, when guest debug was enabled
> (and that is the reason for some work I did in this area, like the KVM_GUESTDBG_BLOCKIRQ thing)
>
> Qemu does this without considering nested and/or pending exception/etc.
> It just kind of abuses the KVM_SET_VCPU_EVENTS for that.

I wouldn't call that abuse, the ioctl() isn't just for migration. Not checking for
a pending exception is firmly a userspace bug and not something KVM should try to
fix.

For #DB, I suspect it's a non-issue. The exit is synchronous, so unless userspace
is deferring the reflection, which would be architecturally wrong in and of itself,
there can never be another pending exception.

For #MC, I think the correct behavior would be to defer the synthesized #MC if there's
a pending exception and resume the guest until the exception is injected. E.g. if a
different task encounters the real #MC, the synthesized #MC will be fully asynchronous
and may be coincident with a pending exception that is unrelated to the #MC. That
would require to userspace to enable KVM_CAP_EXCEPTION_PAYLOAD, otherwise userspace
won't be able to differentiate between a pending and injected exception, e.g. if the
#MC occurs during exception vectoring, userspace should override the injected exception
and synthesize #MC, otherwise it would likely soft hang the guest.

2022-03-29 15:36:13

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

On Mon, 2022-03-28 at 17:50 +0000, Sean Christopherson wrote:
> On Sun, Mar 27, 2022, Maxim Levitsky wrote:
> > Other than that I am actually very happy that you posted this patch series,
> > as this gives more chance that this long standing issue will be fixed,
> > and if your patches are better/simpler/less invasive to KVM and still address the issue,
> > I fully support using them instead of mine.
>
> I highly doubt they're simpler or less invasive, but I do hope that the approach
> wil be easier to maintain.
>
> > Totally agree with you about your thoughts about splitting pending/injected exception,
> > I also can't say I liked my approach that much, for the same reasons you mentioned.
> >
> > It is also the main reason I put the whole thing on the backlog lately,
> > because I was feeling that I am changing too much of the KVM,
> > for a relatively theoretical issue.
> >
> >
> > I will review your patches, compare them to mine, and check if you or I missed something.
> >
> > PS:
> >
> > Back then, I also did an extensive review on few cases when qemu injects exceptions itself,
> > which it does thankfully rarely. There are several (theoretical) issues there.
> > I don't remember those details, I need to refresh my memory.
> >
> > AFAIK, qemu injects #MC sometimes when it gets it from the kernel in form of a signal,
> > if I recall this correctly, and it also reflects back #DB, when guest debug was enabled
> > (and that is the reason for some work I did in this area, like the KVM_GUESTDBG_BLOCKIRQ thing)
> >
> > Qemu does this without considering nested and/or pending exception/etc.
> > It just kind of abuses the KVM_SET_VCPU_EVENTS for that.
>
> I wouldn't call that abuse, the ioctl() isn't just for migration. Not checking for
> a pending exception is firmly a userspace bug and not something KVM should try to
> fix.

yes, but to make the right decision, the userspace has to know if there is a pending
exception, and if there is, then merge it (which might even involve triple fault),

On top of that it is possible that pending excpetion is not intercepted by L1,
but merged result is, so injecting the exception will cause nested VMexit,
which is something that is hard for userspace to model.

I think that the cleanest way to do this is to add new ioctl, KVM_INJECT_EXCEPTION,
which can do the right thing in the kernel, but I am not sure that it is worth it,
knowing that thankfully userspace doesn't inject exceptions much.



>
> For #DB, I suspect it's a non-issue. The exit is synchronous, so unless userspace
> is deferring the reflection, which would be architecturally wrong in and of itself,
> there can never be another pending exception.
Could very be, but still there could be corner cases. Like what if you set data fetch
breakpoint on a IDT entry of some exception? I guess during delivery of that exception
there might be #DB, but I am not 100% expert on when and how #DB is generated, so
I can't be sure. Anyway #DB isn't a big deal because qemu only re-injects it when
guest debug is enabled and that is broken anyway, and does worse things like leaking EFLAGS.TF
to the guest stack and such.


>
> For #MC, I think the correct behavior would be to defer the synthesized #MC if there's
> a pending exception and resume the guest until the exception is injected. E.g. if a
> different task encounters the real #MC, the synthesized #MC will be fully asynchronous
> and may be coincident with a pending exception that is unrelated to the #MC. That
> would require to userspace to enable KVM_CAP_EXCEPTION_PAYLOAD, otherwise userspace
> won't be able to differentiate between a pending and injected exception, e.g. if the
> #MC occurs during exception vectoring, userspace should override the injected exception
> and synthesize #MC, otherwise it would likely soft hang the guest.
Something like that, I don't remember all the details.

Best regards,
Maxim Levitsky

>


2022-03-30 02:01:08

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

On Tue, Mar 29, 2022, Maxim Levitsky wrote:
> On Mon, 2022-03-28 at 17:50 +0000, Sean Christopherson wrote:
> > I wouldn't call that abuse, the ioctl() isn't just for migration. Not checking for
> > a pending exception is firmly a userspace bug and not something KVM should try to
> > fix.
>
> yes, but to make the right decision, the userspace has to know if there is a pending
> exception, and if there is, then merge it (which might even involve triple fault),

There's no need for userspace to ever merge exceptions unless KVM supports either
exiting to userspace on an exception that can occur during exception delivery, or
userspace itself is emulating exception delivery. Outside of debug scenarios, #PF
is likely the only exception that might ever be forwarded to userspace. But in
those scenarios, userspace is almost always going to fix the #PF and resume the
guest. If userspace doesn't fix the #PF, the guest is completely hosed because
its IDT will trigger #PF, i.e. it's headed to shutdown regardless of KVM's ABI.

VM introspection is the only use case I can think of that might possibly want to
emulate exception delivery in userspace, and VMI is a completely new set of APIs,
in no small part because supporting something like this in KVM would require far
more hooks than KVM provides.

> On top of that it is possible that pending excpetion is not intercepted by L1,
> but merged result is, so injecting the exception will cause nested VMexit,
> which is something that is hard for userspace to model.
>
> I think that the cleanest way to do this is to add new ioctl, KVM_INJECT_EXCEPTION,
> which can do the right thing in the kernel, but I am not sure that it is worth it,
> knowing that thankfully userspace doesn't inject exceptions much.
>
> >
> > For #DB, I suspect it's a non-issue. The exit is synchronous, so unless userspace
> > is deferring the reflection, which would be architecturally wrong in and of itself,
> > there can never be another pending exception.
> Could very be, but still there could be corner cases. Like what if you set data fetch
> breakpoint on a IDT entry of some exception? I guess during delivery of that exception
> there might be #DB, but I am not 100% expert on when and how #DB is generated, so
> I can't be sure.

Data #DBs are trap-like. The #DB will arrive after exception delivery completes,
i.e. will occur "on" the first instruction in the exception handler.