2022-07-15 21:06:46

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 00/24] KVM: x86: Event/exception fixes and cleanups

The main goal of this series is to fix KVM's longstanding bug of not
honoring L1's exception intercepts wants when handling an exception that
occurs during delivery of a different exception. E.g. if L0 and L1 are
using shadow paging, and L2 hits a #PF, and then hits another #PF while
vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
so that the #PF is routed to L1, not injected into L2 as a #DF.

nVMX has hacked around the bug for years by overriding the #PF injector
for shadow paging to go straight to VM-Exit, and nSVM has started doing
the same. The hacks mostly work, but they're incomplete, confusing, and
lead to other hacky code, e.g. bailing from the emulator because #PF
injection forced a VM-Exit and suddenly KVM is back in L1.

Maxim, I believe I addressed all of your comments, holler if I missed
something.

v2:
- Collect reviews. [Maxim, Jim]
- Split a few patches into more consumable chunks. [Maxim]
- Document that KVM doesn't correctly handle SMI+MTF (or SMI priority). [Maxim]
- Add comment to document the instruction boundary (event window) aspect
of block_nested_events. [Maxim]
- Add a patch to rename inject_pending_events() and add a comment to
document KVM's not-quite-architecturally-correct handing of instruction
boundaries and asynchronous events. [Maxim]

v1: https://lore.kernel.org/all/[email protected]

Sean Christopherson (24):
KVM: nVMX: Unconditionally purge queued/injected events on nested
"exit"
KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
KVM: x86: Don't check for code breakpoints when emulating on exception
KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
KVM: x86: Treat #DBs from the emulator as fault-like (code and
DR7.GD=1)
KVM: x86: Use DR7_GD macro instead of open coding check in emulator
KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
KVM: x86: Make kvm_queued_exception a properly named, visible struct
KVM: x86: Formalize blocking of nested pending exceptions
KVM: x86: Use kvm_queue_exception_e() to queue #DF
KVM: x86: Hoist nested event checks above event injection logic
KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
VM-Exit
KVM: nVMX: Add a helper to identify low-priority #DB traps
KVM: nVMX: Document priority of all known events on Intel CPUs
KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
behavior
KVM: x86: Rename inject_pending_events() to
kvm_check_and_inject_events()
KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
KVM: selftests: Add an x86-only test to verify nested exception
queueing

arch/x86/include/asm/kvm-x86-ops.h | 2 +-
arch/x86/include/asm/kvm_host.h | 35 +-
arch/x86/kvm/emulate.c | 3 +-
arch/x86/kvm/svm/nested.c | 110 ++---
arch/x86/kvm/svm/svm.c | 20 +-
arch/x86/kvm/vmx/nested.c | 329 ++++++++-----
arch/x86/kvm/vmx/sgx.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 53 ++-
arch/x86/kvm/x86.c | 450 ++++++++++++------
arch/x86/kvm/x86.h | 11 +-
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/include/x86_64/svm_util.h | 7 +-
.../selftests/kvm/include/x86_64/vmx.h | 51 +-
.../kvm/x86_64/nested_exceptions_test.c | 295 ++++++++++++
15 files changed, 950 insertions(+), 420 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c


base-commit: 8031d87aa9953ddeb047a5356ebd0b240c30f233
--
2.37.0.170.g444d1eabd0-goog


2022-07-15 21:06:59

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 19/24] KVM: x86: Morph pending exceptions to pending VM-Exits at queue time

Morph pending exceptions to pending VM-Exits (due to interception) when
the exception is queued instead of waiting until nested events are
checked at VM-Entry. This fixes a longstanding bug where KVM fails to
handle an exception that occurs during delivery of a previous exception,
KVM (L0) and L1 both want to intercept the exception (e.g. #PF for shadow
paging), and KVM determines that the exception is in the guest's domain,
i.e. queues the new exception for L2. Deferring the interception check
causes KVM to esclate various combinations of injected+pending exceptions
to double fault (#DF) without consulting L1's interception desires, and
ends up injecting a spurious #DF into L2.

KVM has fudged around the issue for #PF by special casing emulated #PF
injection for shadow paging, but the underlying issue is not unique to
shadow paging in L0, e.g. if KVM is intercepting #PF because the guest
has a smaller maxphyaddr and L1 (but not L0) is using shadow paging.
Other exceptions are affected as well, e.g. if KVM is intercepting #GP
for one of SVM's workaround or for the VMware backdoor emulation stuff.
The other cases have gone unnoticed because the #DF is spurious if and
only if L1 resolves the exception, e.g. KVM's goofs go unnoticed if L1
would have injected #DF anyways.

The hack-a-fix has also led to ugly code, e.g. bailing from the emulator
if #PF injection forced a nested VM-Exit and the emulator finds itself
back in L1. Allowing for direct-to-VM-Exit queueing also neatly solves
the async #PF in L2 mess; no need to set a magic flag and token, simply
queue a #PF nested VM-Exit.

Deal with event migration by flagging that a pending exception was queued
by userspace and check for interception at the next KVM_RUN, e.g. so that
KVM does the right thing regardless of the order in which userspace
restores nested state vs. event state.

When "getting" events from userspace, simply drop any pending excpetion
that is destined to be intercepted if there is also an injected exception
to be migrated. Ideally, KVM would migrate both events, but that would
require new ABI, and practically speaking losing the event is unlikely to
be noticed, let alone fatal. The injected exception is captured, RIP
still points at the original faulting instruction, etc... So either the
injection on the target will trigger the same intercepted exception, or
the source of the intercepted exception was transient and/or
non-deterministic, thus dropping it is ok-ish.

Fixes: a04aead144fd ("KVM: nSVM: fix running nested guests when npt=0")
Fixes: feaf0c7dc473 ("KVM: nVMX: Do not generate #DF if #PF happens during exception delivery into L2")
Cc: Jim Mattson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 12 ++-
arch/x86/kvm/svm/nested.c | 41 +++-----
arch/x86/kvm/vmx/nested.c | 109 ++++++++++------------
arch/x86/kvm/vmx/vmx.c | 6 +-
arch/x86/kvm/x86.c | 159 ++++++++++++++++++++++----------
arch/x86/kvm/x86.h | 7 ++
6 files changed, 187 insertions(+), 147 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0a6a05e25f24..6bcbffb42420 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -647,7 +647,6 @@ struct kvm_queued_exception {
u32 error_code;
unsigned long payload;
bool has_payload;
- u8 nested_apf;
};

struct kvm_vcpu_arch {
@@ -748,8 +747,12 @@ struct kvm_vcpu_arch {

u8 event_exit_inst_len;

+ bool exception_from_userspace;
+
/* Exceptions to be injected to the guest. */
struct kvm_queued_exception exception;
+ /* Exception VM-Exits to be synthesized to L1. */
+ struct kvm_queued_exception exception_vmexit;

struct kvm_queued_interrupt {
bool injected;
@@ -860,7 +863,6 @@ struct kvm_vcpu_arch {
u32 id;
bool send_user_only;
u32 host_apf_flags;
- unsigned long nested_apf_token;
bool delivery_as_pf_vmexit;
bool pageready_pending;
} apf;
@@ -1636,9 +1638,9 @@ struct kvm_x86_ops {

struct kvm_x86_nested_ops {
void (*leave_nested)(struct kvm_vcpu *vcpu);
+ bool (*is_exception_vmexit)(struct kvm_vcpu *vcpu, u8 vector,
+ u32 error_code);
int (*check_events)(struct kvm_vcpu *vcpu);
- bool (*handle_page_fault_workaround)(struct kvm_vcpu *vcpu,
- struct x86_exception *fault);
bool (*hv_timer_pending)(struct kvm_vcpu *vcpu);
void (*triple_fault)(struct kvm_vcpu *vcpu);
int (*get_state)(struct kvm_vcpu *vcpu,
@@ -1865,7 +1867,7 @@ void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, unsigned long pay
void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr);
void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
-bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
+void kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
struct x86_exception *fault);
bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
bool kvm_require_dr(struct kvm_vcpu *vcpu, int dr);
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index f5676c2679d0..0a8ee5f28319 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -55,28 +55,6 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
nested_svm_vmexit(svm);
}

-static bool nested_svm_handle_page_fault_workaround(struct kvm_vcpu *vcpu,
- struct x86_exception *fault)
-{
- struct vcpu_svm *svm = to_svm(vcpu);
- struct vmcb *vmcb = svm->vmcb;
-
- WARN_ON(!is_guest_mode(vcpu));
-
- if (vmcb12_is_intercept(&svm->nested.ctl,
- INTERCEPT_EXCEPTION_OFFSET + PF_VECTOR) &&
- !WARN_ON_ONCE(svm->nested.nested_run_pending)) {
- vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + PF_VECTOR;
- vmcb->control.exit_code_hi = 0;
- vmcb->control.exit_info_1 = fault->error_code;
- vmcb->control.exit_info_2 = fault->address;
- nested_svm_vmexit(svm);
- return true;
- }
-
- return false;
-}
-
static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index)
{
struct vcpu_svm *svm = to_svm(vcpu);
@@ -1302,16 +1280,17 @@ int nested_svm_check_permissions(struct kvm_vcpu *vcpu)
return 0;
}

-static bool nested_exit_on_exception(struct vcpu_svm *svm)
+static bool nested_svm_is_exception_vmexit(struct kvm_vcpu *vcpu, u8 vector,
+ u32 error_code)
{
- unsigned int vector = svm->vcpu.arch.exception.vector;
+ struct vcpu_svm *svm = to_svm(vcpu);

return (svm->nested.ctl.intercepts[INTERCEPT_EXCEPTION] & BIT(vector));
}

static void nested_svm_inject_exception_vmexit(struct kvm_vcpu *vcpu)
{
- struct kvm_queued_exception *ex = &vcpu->arch.exception;
+ struct kvm_queued_exception *ex = &vcpu->arch.exception_vmexit;
struct vcpu_svm *svm = to_svm(vcpu);
struct vmcb *vmcb = svm->vmcb;

@@ -1379,15 +1358,19 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
return 0;
}

- if (vcpu->arch.exception.pending) {
+ if (vcpu->arch.exception_vmexit.pending) {
if (block_nested_exceptions)
return -EBUSY;
- if (!nested_exit_on_exception(svm))
- return 0;
nested_svm_inject_exception_vmexit(vcpu);
return 0;
}

+ if (vcpu->arch.exception.pending) {
+ if (block_nested_exceptions)
+ return -EBUSY;
+ return 0;
+ }
+
if (vcpu->arch.smi_pending && !svm_smi_blocked(vcpu)) {
if (block_nested_events)
return -EBUSY;
@@ -1725,8 +1708,8 @@ static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)

struct kvm_x86_nested_ops svm_nested_ops = {
.leave_nested = svm_leave_nested,
+ .is_exception_vmexit = nested_svm_is_exception_vmexit,
.check_events = svm_check_nested_events,
- .handle_page_fault_workaround = nested_svm_handle_page_fault_workaround,
.triple_fault = nested_svm_triple_fault,
.get_nested_state_pages = svm_get_nested_state_pages,
.get_state = svm_get_nested_state,
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 981f98ef96f1..5a6ba62dcd49 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -439,59 +439,22 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
return inequality ^ bit;
}

-
-/*
- * KVM wants to inject page-faults which it got to the guest. This function
- * checks whether in a nested guest, we need to inject them to L1 or L2.
- */
-static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned long *exit_qual)
-{
- struct kvm_queued_exception *ex = &vcpu->arch.exception;
- struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
-
- if (ex->vector == PF_VECTOR) {
- if (ex->nested_apf) {
- *exit_qual = vcpu->arch.apf.nested_apf_token;
- return 1;
- }
- if (nested_vmx_is_page_fault_vmexit(vmcs12, ex->error_code)) {
- *exit_qual = ex->has_payload ? ex->payload : vcpu->arch.cr2;
- return 1;
- }
- } else if (vmcs12->exception_bitmap & (1u << ex->vector)) {
- if (ex->vector == DB_VECTOR) {
- if (ex->has_payload) {
- *exit_qual = ex->payload;
- } else {
- *exit_qual = vcpu->arch.dr6;
- *exit_qual &= ~DR6_BT;
- *exit_qual ^= DR6_ACTIVE_LOW;
- }
- } else
- *exit_qual = 0;
- return 1;
- }
-
- return 0;
-}
-
-static bool nested_vmx_handle_page_fault_workaround(struct kvm_vcpu *vcpu,
- struct x86_exception *fault)
+static bool nested_vmx_is_exception_vmexit(struct kvm_vcpu *vcpu, u8 vector,
+ u32 error_code)
{
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);

- WARN_ON(!is_guest_mode(vcpu));
+ /*
+ * Drop bits 31:16 of the error code when performing the #PF mask+match
+ * check. All VMCS fields involved are 32 bits, but Intel CPUs never
+ * set bits 31:16 and VMX disallows setting bits 31:16 in the injected
+ * error code. Including the to-be-dropped bits in the check might
+ * result in an "impossible" or missed exit from L1's perspective.
+ */
+ if (vector == PF_VECTOR)
+ return nested_vmx_is_page_fault_vmexit(vmcs12, (u16)error_code);

- if (nested_vmx_is_page_fault_vmexit(vmcs12, fault->error_code) &&
- !WARN_ON_ONCE(to_vmx(vcpu)->nested.nested_run_pending)) {
- vmcs12->vm_exit_intr_error_code = fault->error_code;
- nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
- PF_VECTOR | INTR_TYPE_HARD_EXCEPTION |
- INTR_INFO_DELIVER_CODE_MASK | INTR_INFO_VALID_MASK,
- fault->address);
- return true;
- }
- return false;
+ return (vmcs12->exception_bitmap & (1u << vector));
}

static int nested_vmx_check_io_bitmap_controls(struct kvm_vcpu *vcpu,
@@ -3812,12 +3775,24 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
return -ENXIO;
}

-static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
- unsigned long exit_qual)
+static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu)
{
- struct kvm_queued_exception *ex = &vcpu->arch.exception;
+ struct kvm_queued_exception *ex = &vcpu->arch.exception_vmexit;
u32 intr_info = ex->vector | INTR_INFO_VALID_MASK;
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+ unsigned long exit_qual;
+
+ if (ex->has_payload) {
+ exit_qual = ex->payload;
+ } else if (ex->vector == PF_VECTOR) {
+ exit_qual = vcpu->arch.cr2;
+ } else if (ex->vector == DB_VECTOR) {
+ exit_qual = vcpu->arch.dr6;
+ exit_qual &= ~DR6_BT;
+ exit_qual ^= DR6_ACTIVE_LOW;
+ } else {
+ exit_qual = 0;
+ }

if (ex->has_error_code) {
/*
@@ -3988,7 +3963,6 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
{
struct kvm_lapic *apic = vcpu->arch.apic;
struct vcpu_vmx *vmx = to_vmx(vcpu);
- unsigned long exit_qual;
/*
* Only a pending nested run blocks a pending exception. If there is a
* previously injected event, the pending exception occurred while said
@@ -4042,14 +4016,20 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
* across SMI/RSM as it should; that needs to be addressed in order to
* prioritize SMI over MTF and trap-like #DBs.
*/
+ if (vcpu->arch.exception_vmexit.pending &&
+ !vmx_is_low_priority_db_trap(&vcpu->arch.exception_vmexit)) {
+ if (block_nested_exceptions)
+ return -EBUSY;
+
+ nested_vmx_inject_exception_vmexit(vcpu);
+ return 0;
+ }
+
if (vcpu->arch.exception.pending &&
!vmx_is_low_priority_db_trap(&vcpu->arch.exception)) {
if (block_nested_exceptions)
return -EBUSY;
- if (!nested_vmx_check_exception(vcpu, &exit_qual))
- goto no_vmexit;
- nested_vmx_inject_exception_vmexit(vcpu, exit_qual);
- return 0;
+ goto no_vmexit;
}

if (vmx->nested.mtf_pending) {
@@ -4060,13 +4040,18 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
return 0;
}

+ if (vcpu->arch.exception_vmexit.pending) {
+ if (block_nested_exceptions)
+ return -EBUSY;
+
+ nested_vmx_inject_exception_vmexit(vcpu);
+ return 0;
+ }
+
if (vcpu->arch.exception.pending) {
if (block_nested_exceptions)
return -EBUSY;
- if (!nested_vmx_check_exception(vcpu, &exit_qual))
- goto no_vmexit;
- nested_vmx_inject_exception_vmexit(vcpu, exit_qual);
- return 0;
+ goto no_vmexit;
}

if (nested_vmx_preemption_timer_pending(vcpu)) {
@@ -6952,8 +6937,8 @@ __init int nested_vmx_hardware_setup(int (*exit_handlers[])(struct kvm_vcpu *))

struct kvm_x86_nested_ops vmx_nested_ops = {
.leave_nested = vmx_leave_nested,
+ .is_exception_vmexit = nested_vmx_is_exception_vmexit,
.check_events = vmx_check_nested_events,
- .handle_page_fault_workaround = nested_vmx_handle_page_fault_workaround,
.hv_timer_pending = nested_vmx_preemption_timer_pending,
.triple_fault = nested_vmx_triple_fault,
.get_state = vmx_get_nested_state,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7d3abe2a206a..5302b046110f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1585,7 +1585,9 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
*/
if (nested_cpu_has_mtf(vmcs12) &&
(!vcpu->arch.exception.pending ||
- vcpu->arch.exception.vector == DB_VECTOR))
+ vcpu->arch.exception.vector == DB_VECTOR) &&
+ (!vcpu->arch.exception_vmexit.pending ||
+ vcpu->arch.exception_vmexit.vector == DB_VECTOR))
vmx->nested.mtf_pending = true;
else
vmx->nested.mtf_pending = false;
@@ -5637,7 +5639,7 @@ static bool vmx_emulation_required_with_pending_exception(struct kvm_vcpu *vcpu)
struct vcpu_vmx *vmx = to_vmx(vcpu);

return vmx->emulation_required && !vmx->rmode.vm86_active &&
- (vcpu->arch.exception.pending || vcpu->arch.exception.injected);
+ (kvm_is_exception_pending(vcpu) || vcpu->arch.exception.injected);
}

static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 795c799fc767..9be2fdf834ad 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -609,6 +609,21 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
}
EXPORT_SYMBOL_GPL(kvm_deliver_exception_payload);

+static void kvm_queue_exception_vmexit(struct kvm_vcpu *vcpu, unsigned int vector,
+ bool has_error_code, u32 error_code,
+ bool has_payload, unsigned long payload)
+{
+ struct kvm_queued_exception *ex = &vcpu->arch.exception_vmexit;
+
+ ex->vector = vector;
+ ex->injected = false;
+ ex->pending = true;
+ ex->has_error_code = has_error_code;
+ ex->error_code = error_code;
+ ex->has_payload = has_payload;
+ ex->payload = payload;
+}
+
static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
unsigned nr, bool has_error, u32 error_code,
bool has_payload, unsigned long payload, bool reinject)
@@ -618,18 +633,31 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,

kvm_make_request(KVM_REQ_EVENT, vcpu);

+ /*
+ * If the exception is destined for L2 and isn't being reinjected,
+ * morph it to a VM-Exit if L1 wants to intercept the exception. A
+ * previously injected exception is not checked because it was checked
+ * when it was original queued, and re-checking is incorrect if _L1_
+ * injected the exception, in which case it's exempt from interception.
+ */
+ if (!reinject && is_guest_mode(vcpu) &&
+ kvm_x86_ops.nested_ops->is_exception_vmexit(vcpu, nr, error_code)) {
+ kvm_queue_exception_vmexit(vcpu, nr, has_error, error_code,
+ has_payload, payload);
+ return;
+ }
+
if (!vcpu->arch.exception.pending && !vcpu->arch.exception.injected) {
queue:
if (reinject) {
/*
- * On vmentry, vcpu->arch.exception.pending is only
- * true if an event injection was blocked by
- * nested_run_pending. In that case, however,
- * vcpu_enter_guest requests an immediate exit,
- * and the guest shouldn't proceed far enough to
- * need reinjection.
+ * On VM-Entry, an exception can be pending if and only
+ * if event injection was blocked by nested_run_pending.
+ * In that case, however, vcpu_enter_guest() requests an
+ * immediate exit, and the guest shouldn't proceed far
+ * enough to need reinjection.
*/
- WARN_ON_ONCE(vcpu->arch.exception.pending);
+ WARN_ON_ONCE(kvm_is_exception_pending(vcpu));
vcpu->arch.exception.injected = true;
if (WARN_ON_ONCE(has_payload)) {
/*
@@ -732,20 +760,22 @@ static int complete_emulated_insn_gp(struct kvm_vcpu *vcpu, int err)
void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
{
++vcpu->stat.pf_guest;
- vcpu->arch.exception.nested_apf =
- is_guest_mode(vcpu) && fault->async_page_fault;
- if (vcpu->arch.exception.nested_apf) {
- vcpu->arch.apf.nested_apf_token = fault->address;
- kvm_queue_exception_e(vcpu, PF_VECTOR, fault->error_code);
- } else {
+
+ /*
+ * Async #PF in L2 is always forwarded to L1 as a VM-Exit regardless of
+ * whether or not L1 wants to intercept "regular" #PF.
+ */
+ if (is_guest_mode(vcpu) && fault->async_page_fault)
+ kvm_queue_exception_vmexit(vcpu, PF_VECTOR,
+ true, fault->error_code,
+ true, fault->address);
+ else
kvm_queue_exception_e_p(vcpu, PF_VECTOR, fault->error_code,
fault->address);
- }
}
EXPORT_SYMBOL_GPL(kvm_inject_page_fault);

-/* Returns true if the page fault was immediately morphed into a VM-Exit. */
-bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
+void kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
struct x86_exception *fault)
{
struct kvm_mmu *fault_mmu;
@@ -763,26 +793,7 @@ bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
kvm_mmu_invalidate_gva(vcpu, fault_mmu, fault->address,
fault_mmu->root.hpa);

- /*
- * A workaround for KVM's bad exception handling. If KVM injected an
- * exception into L2, and L2 encountered a #PF while vectoring the
- * injected exception, manually check to see if L1 wants to intercept
- * #PF, otherwise queuing the #PF will lead to #DF or a lost exception.
- * In all other cases, defer the check to nested_ops->check_events(),
- * which will correctly handle priority (this does not). Note, other
- * exceptions, e.g. #GP, are theoretically affected, #PF is simply the
- * most problematic, e.g. when L0 and L1 are both intercepting #PF for
- * shadow paging.
- *
- * TODO: Rewrite exception handling to track injected and pending
- * (VM-Exit) exceptions separately.
- */
- if (unlikely(vcpu->arch.exception.injected && is_guest_mode(vcpu)) &&
- kvm_x86_ops.nested_ops->handle_page_fault_workaround(vcpu, fault))
- return true;
-
fault_mmu->inject_page_fault(vcpu, fault);
- return false;
}
EXPORT_SYMBOL_GPL(kvm_inject_emulated_page_fault);

@@ -4820,7 +4831,7 @@ static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
return (kvm_arch_interrupt_allowed(vcpu) &&
kvm_cpu_accept_dm_intr(vcpu) &&
!kvm_event_needs_reinjection(vcpu) &&
- !vcpu->arch.exception.pending);
+ !kvm_is_exception_pending(vcpu));
}

static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
@@ -4995,13 +5006,27 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
struct kvm_vcpu_events *events)
{
- struct kvm_queued_exception *ex = &vcpu->arch.exception;
+ struct kvm_queued_exception *ex;

process_nmi(vcpu);

if (kvm_check_request(KVM_REQ_SMI, vcpu))
process_smi(vcpu);

+ /*
+ * KVM's ABI only allows for one exception to be migrated. Luckily,
+ * the only time there can be two queued exceptions is if there's a
+ * non-exiting _injected_ exception, and a pending exiting exception.
+ * In that case, ignore the VM-Exiting exception as it's an extension
+ * of the injected exception.
+ */
+ if (vcpu->arch.exception_vmexit.pending &&
+ !vcpu->arch.exception.pending &&
+ !vcpu->arch.exception.injected)
+ ex = &vcpu->arch.exception_vmexit;
+ else
+ ex = &vcpu->arch.exception;
+
/*
* In guest mode, payload delivery should be deferred if the exception
* will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
@@ -5108,6 +5133,19 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
return -EINVAL;

process_nmi(vcpu);
+
+ /*
+ * Flag that userspace is stuffing an exception, the next KVM_RUN will
+ * morph the exception to a VM-Exit if appropriate. Do this only for
+ * pending exceptions, already-injected exceptions are not subject to
+ * intercpetion. Note, userspace that conflates pending and injected
+ * is hosed, and will incorrectly convert an injected exception into a
+ * pending exception, which in turn may cause a spurious VM-Exit.
+ */
+ vcpu->arch.exception_from_userspace = events->exception.pending;
+
+ vcpu->arch.exception_vmexit.pending = false;
+
vcpu->arch.exception.injected = events->exception.injected;
vcpu->arch.exception.pending = events->exception.pending;
vcpu->arch.exception.vector = events->exception.nr;
@@ -8130,18 +8168,17 @@ static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask)
}
}

-static bool inject_emulated_exception(struct kvm_vcpu *vcpu)
+static void inject_emulated_exception(struct kvm_vcpu *vcpu)
{
struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+
if (ctxt->exception.vector == PF_VECTOR)
- return kvm_inject_emulated_page_fault(vcpu, &ctxt->exception);
-
- if (ctxt->exception.error_code_valid)
+ kvm_inject_emulated_page_fault(vcpu, &ctxt->exception);
+ else if (ctxt->exception.error_code_valid)
kvm_queue_exception_e(vcpu, ctxt->exception.vector,
ctxt->exception.error_code);
else
kvm_queue_exception(vcpu, ctxt->exception.vector);
- return false;
}

static struct x86_emulate_ctxt *alloc_emulate_ctxt(struct kvm_vcpu *vcpu)
@@ -8754,8 +8791,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,

if (ctxt->have_exception) {
r = 1;
- if (inject_emulated_exception(vcpu))
- return r;
+ inject_emulated_exception(vcpu);
} else if (vcpu->arch.pio.count) {
if (!vcpu->arch.pio.in) {
/* FIXME: return into emulator if single-stepping. */
@@ -9695,7 +9731,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
*/
if (vcpu->arch.exception.injected)
kvm_inject_exception(vcpu);
- else if (vcpu->arch.exception.pending)
+ else if (kvm_is_exception_pending(vcpu))
; /* see above */
else if (vcpu->arch.nmi_injected)
static_call(kvm_x86_inject_nmi)(vcpu);
@@ -9722,6 +9758,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
if (r < 0)
goto out;

+ /*
+ * A pending exception VM-Exit should either result in nested VM-Exit
+ * or force an immediate re-entry and exit to/from L2, and exception
+ * VM-Exits cannot be injected (flag should _never_ be set).
+ */
+ WARN_ON_ONCE(vcpu->arch.exception_vmexit.injected ||
+ vcpu->arch.exception_vmexit.pending);
+
/*
* New events, other than exceptions, cannot be injected if KVM needs
* to re-inject a previous event. See above comments on re-injecting
@@ -9821,7 +9865,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
kvm_x86_ops.nested_ops->hv_timer_pending(vcpu))
*req_immediate_exit = true;

- WARN_ON(vcpu->arch.exception.pending);
+ WARN_ON(kvm_is_exception_pending(vcpu));
return 0;

out:
@@ -10839,6 +10883,7 @@ static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)

int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
{
+ struct kvm_queued_exception *ex = &vcpu->arch.exception;
struct kvm_run *kvm_run = vcpu->run;
int r;

@@ -10897,6 +10942,21 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
}
}

+ /*
+ * If userspace set a pending exception and L2 is active, convert it to
+ * a pending VM-Exit if L1 wants to intercept the exception.
+ */
+ if (vcpu->arch.exception_from_userspace && is_guest_mode(vcpu) &&
+ kvm_x86_ops.nested_ops->is_exception_vmexit(vcpu, ex->vector,
+ ex->error_code)) {
+ kvm_queue_exception_vmexit(vcpu, ex->vector,
+ ex->has_error_code, ex->error_code,
+ ex->has_payload, ex->payload);
+ ex->injected = false;
+ ex->pending = false;
+ }
+ vcpu->arch.exception_from_userspace = false;
+
if (unlikely(vcpu->arch.complete_userspace_io)) {
int (*cui)(struct kvm_vcpu *) = vcpu->arch.complete_userspace_io;
vcpu->arch.complete_userspace_io = NULL;
@@ -11003,6 +11063,7 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
kvm_set_rflags(vcpu, regs->rflags | X86_EFLAGS_FIXED);

vcpu->arch.exception.pending = false;
+ vcpu->arch.exception_vmexit.pending = false;

kvm_make_request(KVM_REQ_EVENT, vcpu);
}
@@ -11370,7 +11431,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,

if (dbg->control & (KVM_GUESTDBG_INJECT_DB | KVM_GUESTDBG_INJECT_BP)) {
r = -EBUSY;
- if (vcpu->arch.exception.pending)
+ if (kvm_is_exception_pending(vcpu))
goto out;
if (dbg->control & KVM_GUESTDBG_INJECT_DB)
kvm_queue_exception(vcpu, DB_VECTOR);
@@ -12554,7 +12615,7 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
if (vcpu->arch.pv.pv_unhalted)
return true;

- if (vcpu->arch.exception.pending)
+ if (kvm_is_exception_pending(vcpu))
return true;

if (kvm_test_request(KVM_REQ_NMI, vcpu) ||
@@ -12809,7 +12870,7 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
{
if (unlikely(!lapic_in_kernel(vcpu) ||
kvm_event_needs_reinjection(vcpu) ||
- vcpu->arch.exception.pending))
+ kvm_is_exception_pending(vcpu)))
return false;

if (kvm_hlt_in_guest(vcpu->kvm) && !kvm_can_deliver_async_pf(vcpu))
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index dc2af0146220..eee259e387d3 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -82,10 +82,17 @@ static inline unsigned int __shrink_ple_window(unsigned int val,
void kvm_service_local_tlb_flush_requests(struct kvm_vcpu *vcpu);
int kvm_check_nested_events(struct kvm_vcpu *vcpu);

+static inline bool kvm_is_exception_pending(struct kvm_vcpu *vcpu)
+{
+ return vcpu->arch.exception.pending ||
+ vcpu->arch.exception_vmexit.pending;
+}
+
static inline void kvm_clear_exception_queue(struct kvm_vcpu *vcpu)
{
vcpu->arch.exception.pending = false;
vcpu->arch.exception.injected = false;
+ vcpu->arch.exception_vmexit.pending = false;
}

static inline void kvm_queue_interrupt(struct kvm_vcpu *vcpu, u8 vector,
--
2.37.0.170.g444d1eabd0-goog

2022-07-15 21:08:05

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 05/24] KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag

Service TSS T-flag #DBs prior to pending MTFs, as such #DBs are higher
priority than MTF. KVM itself doesn't emulate TSS #DBs, and any such
exceptions injected from L1 will be handled by hardware (or morphed to
a fault-like exception if injection fails), but theoretically userspace
could pend a TSS T-flag #DB in conjunction with a pending MTF.

Note, there's no known use case this fixes, it's purely to be technically
correct with respect to Intel's SDM.

Cc: Oliver Upton <[email protected]>
Cc: Peter Shier <[email protected]>
Fixes: 5ef8acbdd687 ("KVM: nVMX: Emulate MTF when performing instruction emulation")
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 2409ed8dbc71..bc5759f82a3f 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3932,15 +3932,17 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
}

/*
- * Process any exceptions that are not debug traps before MTF.
+ * Process exceptions that are higher priority than Monitor Trap Flag:
+ * fault-like exceptions, TSS T flag #DB (not emulated by KVM, but
+ * could theoretically come in from userspace), and ICEBP (INT1).
*
* Note that only a pending nested run can block a pending exception.
* Otherwise an injected NMI/interrupt should either be
* lost or delivered to the nested hypervisor in the IDT_VECTORING_INFO,
* while delivering the pending exception.
*/
-
- if (vcpu->arch.exception.pending && !vmx_get_pending_dbg_trap(vcpu)) {
+ if (vcpu->arch.exception.pending &&
+ !(vmx_get_pending_dbg_trap(vcpu) & ~DR6_BT)) {
if (vmx->nested.nested_run_pending)
return -EBUSY;
if (!nested_vmx_check_exception(vcpu, &exit_qual))
--
2.37.0.170.g444d1eabd0-goog

2022-07-15 21:10:42

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 06/24] KVM: x86: Treat #DBs from the emulator as fault-like (code and DR7.GD=1)

Add a dedicated "exception type" for #DBs, as #DBs can be fault-like or
trap-like depending the sub-type of #DB, and effectively defer the
decision of what to do with the #DB to the caller.

For the emulator's two calls to exception_type(), treat the #DB as
fault-like, as the emulator handles only code breakpoint and general
detect #DBs, both of which are fault-like.

For event injection, which uses exception_type() to determine whether to
set EFLAGS.RF=1 on the stack, keep the current behavior of not setting
RF=1 for #DBs. Intel and AMD explicitly state RF isn't set on code #DBs,
so exempting by failing the "== EXCPT_FAULT" check is correct. The only
other fault-like #DB is General Detect, and despite Intel and AMD both
strongly implying (through omission) that General Detect #DBs should set
RF=1, hardware (multiple generations of both Intel and AMD), in fact does
not. Through insider knowledge, extreme foresight, sheer dumb luck, or
some combination thereof, KVM correctly handled RF for General Detect #DBs.

Fixes: 38827dbd3fb8 ("KVM: x86: Do not update EFLAGS on faulting emulation")
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/x86.c | 27 +++++++++++++++++++++++++--
1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4efdb61e60ba..37d686dbb6aa 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -529,6 +529,7 @@ static int exception_class(int vector)
#define EXCPT_TRAP 1
#define EXCPT_ABORT 2
#define EXCPT_INTERRUPT 3
+#define EXCPT_DB 4

static int exception_type(int vector)
{
@@ -539,8 +540,14 @@ static int exception_type(int vector)

mask = 1 << vector;

- /* #DB is trap, as instruction watchpoints are handled elsewhere */
- if (mask & ((1 << DB_VECTOR) | (1 << BP_VECTOR) | (1 << OF_VECTOR)))
+ /*
+ * #DBs can be trap-like or fault-like, the caller must check other CPU
+ * state, e.g. DR6, to determine whether a #DB is a trap or fault.
+ */
+ if (mask & (1 << DB_VECTOR))
+ return EXCPT_DB;
+
+ if (mask & ((1 << BP_VECTOR) | (1 << OF_VECTOR)))
return EXCPT_TRAP;

if (mask & ((1 << DF_VECTOR) | (1 << MC_VECTOR)))
@@ -8785,6 +8792,12 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
unsigned long rflags = static_call(kvm_x86_get_rflags)(vcpu);
toggle_interruptibility(vcpu, ctxt->interruptibility);
vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
+
+ /*
+ * Note, EXCPT_DB is assumed to be fault-like as the emulator
+ * only supports code breakpoints and general detect #DB, both
+ * of which are fault-like.
+ */
if (!ctxt->have_exception ||
exception_type(ctxt->exception.vector) == EXCPT_TRAP) {
kvm_pmu_trigger_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS);
@@ -9701,6 +9714,16 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)

/* try to inject new event if pending */
if (vcpu->arch.exception.pending) {
+ /*
+ * Fault-class exceptions, except #DBs, set RF=1 in the RFLAGS
+ * value pushed on the stack. Trap-like exception and all #DBs
+ * leave RF as-is (KVM follows Intel's behavior in this regard;
+ * AMD states that code breakpoint #DBs excplitly clear RF=0).
+ *
+ * Note, most versions of Intel's SDM and AMD's APM incorrectly
+ * describe the behavior of General Detect #DBs, which are
+ * fault-like. They do _not_ set RF, a la code breakpoints.
+ */
if (exception_type(vcpu->arch.exception.nr) == EXCPT_FAULT)
__kvm_set_rflags(vcpu, kvm_get_rflags(vcpu) |
X86_EFLAGS_RF);
--
2.37.0.170.g444d1eabd0-goog

2022-07-15 21:11:14

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 24/24] KVM: selftests: Add an x86-only test to verify nested exception queueing

Add a test to verify that KVM_{G,S}ET_EVENTS play nice with pending vs.
injected exceptions when an exception is being queued for L2, and that
KVM correctly handles L1's exception intercept wants.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 1 +
.../kvm/x86_64/nested_exceptions_test.c | 295 ++++++++++++++++++
3 files changed, 297 insertions(+)
create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c

diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
index 91429330faea..6c0c37fe81b6 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -28,6 +28,7 @@
/x86_64/max_vcpuid_cap_test
/x86_64/mmio_warning_test
/x86_64/monitor_mwait_test
+/x86_64/nested_exceptions_test
/x86_64/nx_huge_pages_test
/x86_64/platform_info_test
/x86_64/pmu_event_filter_test
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 6b22fb1ce2b9..c6a67c389ac7 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -88,6 +88,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/kvm_clock_test
TEST_GEN_PROGS_x86_64 += x86_64/kvm_pv_test
TEST_GEN_PROGS_x86_64 += x86_64/mmio_warning_test
TEST_GEN_PROGS_x86_64 += x86_64/monitor_mwait_test
+TEST_GEN_PROGS_x86_64 += x86_64/nested_exceptions_test
TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
diff --git a/tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c b/tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
new file mode 100644
index 000000000000..ac33835f78f4
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
@@ -0,0 +1,295 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#define _GNU_SOURCE /* for program_invocation_short_name */
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+#include "vmx.h"
+#include "svm_util.h"
+
+#define L2_GUEST_STACK_SIZE 256
+
+/*
+ * Arbitrary, never shoved into KVM/hardware, just need to avoid conflict with
+ * the "real" exceptions used, #SS/#GP/#DF (12/13/8).
+ */
+#define FAKE_TRIPLE_FAULT_VECTOR 0xaa
+
+/* Arbitrary 32-bit error code injected by this test. */
+#define SS_ERROR_CODE 0xdeadbeef
+
+/*
+ * Bit '0' is set on Intel if the exception occurs while delivering a previous
+ * event/exception. AMD's wording is ambiguous, but presumably the bit is set
+ * if the exception occurs while delivering an external event, e.g. NMI or INTR,
+ * but not for exceptions that occur when delivering other exceptions or
+ * software interrupts.
+ *
+ * Note, Intel's name for it, "External event", is misleading and much more
+ * aligned with AMD's behavior, but the SDM is quite clear on its behavior.
+ */
+#define ERROR_CODE_EXT_FLAG BIT(0)
+
+/*
+ * Bit '1' is set if the fault occurred when looking up a descriptor in the
+ * IDT, which is the case here as the IDT is empty/NULL.
+ */
+#define ERROR_CODE_IDT_FLAG BIT(1)
+
+/*
+ * The #GP that occurs when vectoring #SS should show the index into the IDT
+ * for #SS, plus have the "IDT flag" set.
+ */
+#define GP_ERROR_CODE_AMD ((SS_VECTOR * 8) | ERROR_CODE_IDT_FLAG)
+#define GP_ERROR_CODE_INTEL ((SS_VECTOR * 8) | ERROR_CODE_IDT_FLAG | ERROR_CODE_EXT_FLAG)
+
+/*
+ * Intel and AMD both shove '0' into the error code on #DF, regardless of what
+ * led to the double fault.
+ */
+#define DF_ERROR_CODE 0
+
+#define INTERCEPT_SS (BIT_ULL(SS_VECTOR))
+#define INTERCEPT_SS_DF (INTERCEPT_SS | BIT_ULL(DF_VECTOR))
+#define INTERCEPT_SS_GP_DF (INTERCEPT_SS_DF | BIT_ULL(GP_VECTOR))
+
+static void l2_ss_pending_test(void)
+{
+ GUEST_SYNC(SS_VECTOR);
+}
+
+static void l2_ss_injected_gp_test(void)
+{
+ GUEST_SYNC(GP_VECTOR);
+}
+
+static void l2_ss_injected_df_test(void)
+{
+ GUEST_SYNC(DF_VECTOR);
+}
+
+static void l2_ss_injected_tf_test(void)
+{
+ GUEST_SYNC(FAKE_TRIPLE_FAULT_VECTOR);
+}
+
+static void svm_run_l2(struct svm_test_data *svm, void *l2_code, int vector,
+ uint32_t error_code)
+{
+ struct vmcb *vmcb = svm->vmcb;
+ struct vmcb_control_area *ctrl = &vmcb->control;
+
+ vmcb->save.rip = (u64)l2_code;
+ run_guest(vmcb, svm->vmcb_gpa);
+
+ if (vector == FAKE_TRIPLE_FAULT_VECTOR)
+ return;
+
+ GUEST_ASSERT_EQ(ctrl->exit_code, (SVM_EXIT_EXCP_BASE + vector));
+ GUEST_ASSERT_EQ(ctrl->exit_info_1, error_code);
+}
+
+static void l1_svm_code(struct svm_test_data *svm)
+{
+ struct vmcb_control_area *ctrl = &svm->vmcb->control;
+ unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
+
+ generic_svm_setup(svm, NULL, &l2_guest_stack[L2_GUEST_STACK_SIZE]);
+ svm->vmcb->save.idtr.limit = 0;
+ ctrl->intercept |= BIT_ULL(INTERCEPT_SHUTDOWN);
+
+ ctrl->intercept_exceptions = INTERCEPT_SS_GP_DF;
+ svm_run_l2(svm, l2_ss_pending_test, SS_VECTOR, SS_ERROR_CODE);
+ svm_run_l2(svm, l2_ss_injected_gp_test, GP_VECTOR, GP_ERROR_CODE_AMD);
+
+ ctrl->intercept_exceptions = INTERCEPT_SS_DF;
+ svm_run_l2(svm, l2_ss_injected_df_test, DF_VECTOR, DF_ERROR_CODE);
+
+ ctrl->intercept_exceptions = INTERCEPT_SS;
+ svm_run_l2(svm, l2_ss_injected_tf_test, FAKE_TRIPLE_FAULT_VECTOR, 0);
+ GUEST_ASSERT_EQ(ctrl->exit_code, SVM_EXIT_SHUTDOWN);
+
+ GUEST_DONE();
+}
+
+static void vmx_run_l2(void *l2_code, int vector, uint32_t error_code)
+{
+ GUEST_ASSERT(!vmwrite(GUEST_RIP, (u64)l2_code));
+
+ GUEST_ASSERT_EQ(vector == SS_VECTOR ? vmlaunch() : vmresume(), 0);
+
+ if (vector == FAKE_TRIPLE_FAULT_VECTOR)
+ return;
+
+ GUEST_ASSERT_EQ(vmreadz(VM_EXIT_REASON), EXIT_REASON_EXCEPTION_NMI);
+ GUEST_ASSERT_EQ((vmreadz(VM_EXIT_INTR_INFO) & 0xff), vector);
+ GUEST_ASSERT_EQ(vmreadz(VM_EXIT_INTR_ERROR_CODE), error_code);
+}
+
+static void l1_vmx_code(struct vmx_pages *vmx)
+{
+ unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
+
+ GUEST_ASSERT_EQ(prepare_for_vmx_operation(vmx), true);
+
+ GUEST_ASSERT_EQ(load_vmcs(vmx), true);
+
+ prepare_vmcs(vmx, NULL, &l2_guest_stack[L2_GUEST_STACK_SIZE]);
+ GUEST_ASSERT_EQ(vmwrite(GUEST_IDTR_LIMIT, 0), 0);
+
+ /*
+ * VMX disallows injecting an exception with error_code[31:16] != 0,
+ * and hardware will never generate a VM-Exit with bits 31:16 set.
+ * KVM should likewise truncate the "bad" userspace value.
+ */
+ GUEST_ASSERT_EQ(vmwrite(EXCEPTION_BITMAP, INTERCEPT_SS_GP_DF), 0);
+ vmx_run_l2(l2_ss_pending_test, SS_VECTOR, (u16)SS_ERROR_CODE);
+ vmx_run_l2(l2_ss_injected_gp_test, GP_VECTOR, GP_ERROR_CODE_INTEL);
+
+ GUEST_ASSERT_EQ(vmwrite(EXCEPTION_BITMAP, INTERCEPT_SS_DF), 0);
+ vmx_run_l2(l2_ss_injected_df_test, DF_VECTOR, DF_ERROR_CODE);
+
+ GUEST_ASSERT_EQ(vmwrite(EXCEPTION_BITMAP, INTERCEPT_SS), 0);
+ vmx_run_l2(l2_ss_injected_tf_test, FAKE_TRIPLE_FAULT_VECTOR, 0);
+ GUEST_ASSERT_EQ(vmreadz(VM_EXIT_REASON), EXIT_REASON_TRIPLE_FAULT);
+
+ GUEST_DONE();
+}
+
+static void __attribute__((__flatten__)) l1_guest_code(void *test_data)
+{
+ if (this_cpu_has(X86_FEATURE_SVM))
+ l1_svm_code(test_data);
+ else
+ l1_vmx_code(test_data);
+}
+
+static void assert_ucall_vector(struct kvm_vcpu *vcpu, int vector)
+{
+ struct kvm_run *run = vcpu->run;
+ struct ucall uc;
+
+ TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+ "Unexpected exit reason: %u (%s),\n",
+ run->exit_reason, exit_reason_str(run->exit_reason));
+
+ switch (get_ucall(vcpu, &uc)) {
+ case UCALL_SYNC:
+ TEST_ASSERT(vector == uc.args[1],
+ "Expected L2 to ask for %d, got %ld", vector, uc.args[1]);
+ break;
+ case UCALL_DONE:
+ TEST_ASSERT(vector == -1,
+ "Expected L2 to ask for %d, L2 says it's done", vector);
+ break;
+ case UCALL_ABORT:
+ TEST_FAIL("%s at %s:%ld (0x%lx != 0x%lx)",
+ (const char *)uc.args[0], __FILE__, uc.args[1],
+ uc.args[2], uc.args[3]);
+ break;
+ default:
+ TEST_FAIL("Expected L2 to ask for %d, got unexpected ucall %lu", vector, uc.cmd);
+ }
+}
+
+static void queue_ss_exception(struct kvm_vcpu *vcpu, bool inject)
+{
+ struct kvm_vcpu_events events;
+
+ vcpu_events_get(vcpu, &events);
+
+ TEST_ASSERT(!events.exception.pending,
+ "Vector %d unexpectedlt pending", events.exception.nr);
+ TEST_ASSERT(!events.exception.injected,
+ "Vector %d unexpectedly injected", events.exception.nr);
+
+ events.flags = KVM_VCPUEVENT_VALID_PAYLOAD;
+ events.exception.pending = !inject;
+ events.exception.injected = inject;
+ events.exception.nr = SS_VECTOR;
+ events.exception.has_error_code = true;
+ events.exception.error_code = SS_ERROR_CODE;
+ vcpu_events_set(vcpu, &events);
+}
+
+/*
+ * Verify KVM_{G,S}ET_EVENTS play nice with pending vs. injected exceptions
+ * when an exception is being queued for L2. Specifically, verify that KVM
+ * honors L1 exception intercept controls when a #SS is pending/injected,
+ * triggers a #GP on vectoring the #SS, morphs to #DF if #GP isn't intercepted
+ * by L1, and finally causes (nested) SHUTDOWN if #DF isn't intercepted by L1.
+ */
+int main(int argc, char *argv[])
+{
+ vm_vaddr_t nested_test_data_gva;
+ struct kvm_vcpu_events events;
+ struct kvm_vcpu *vcpu;
+ struct kvm_vm *vm;
+
+ TEST_REQUIRE(kvm_has_cap(KVM_CAP_EXCEPTION_PAYLOAD));
+ TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SVM) || kvm_cpu_has(X86_FEATURE_VMX));
+
+ vm = vm_create_with_one_vcpu(&vcpu, l1_guest_code);
+ vm_enable_cap(vm, KVM_CAP_EXCEPTION_PAYLOAD, -2ul);
+
+ if (kvm_cpu_has(X86_FEATURE_SVM))
+ vcpu_alloc_svm(vm, &nested_test_data_gva);
+ else
+ vcpu_alloc_vmx(vm, &nested_test_data_gva);
+
+ vcpu_args_set(vcpu, 1, nested_test_data_gva);
+
+ /* Run L1 => L2. L2 should sync and request #SS. */
+ vcpu_run(vcpu);
+ assert_ucall_vector(vcpu, SS_VECTOR);
+
+ /* Pend #SS and request immediate exit. #SS should still be pending. */
+ queue_ss_exception(vcpu, false);
+ vcpu->run->immediate_exit = true;
+ vcpu_run_complete_io(vcpu);
+
+ /* Verify the pending events comes back out the same as it went in. */
+ vcpu_events_get(vcpu, &events);
+ ASSERT_EQ(events.flags & KVM_VCPUEVENT_VALID_PAYLOAD,
+ KVM_VCPUEVENT_VALID_PAYLOAD);
+ ASSERT_EQ(events.exception.pending, true);
+ ASSERT_EQ(events.exception.nr, SS_VECTOR);
+ ASSERT_EQ(events.exception.has_error_code, true);
+ ASSERT_EQ(events.exception.error_code, SS_ERROR_CODE);
+
+ /*
+ * Run for real with the pending #SS, L1 should get a VM-Exit due to
+ * #SS interception and re-enter L2 to request #GP (via injected #SS).
+ */
+ vcpu->run->immediate_exit = false;
+ vcpu_run(vcpu);
+ assert_ucall_vector(vcpu, GP_VECTOR);
+
+ /*
+ * Inject #SS, the #SS should bypass interception and cause #GP, which
+ * L1 should intercept before KVM morphs it to #DF. L1 should then
+ * disable #GP interception and run L2 to request #DF (via #SS => #GP).
+ */
+ queue_ss_exception(vcpu, true);
+ vcpu_run(vcpu);
+ assert_ucall_vector(vcpu, DF_VECTOR);
+
+ /*
+ * Inject #SS, the #SS should bypass interception and cause #GP, which
+ * L1 is no longer interception, and so should see a #DF VM-Exit. L1
+ * should then signal that is done.
+ */
+ queue_ss_exception(vcpu, true);
+ vcpu_run(vcpu);
+ assert_ucall_vector(vcpu, FAKE_TRIPLE_FAULT_VECTOR);
+
+ /*
+ * Inject #SS yet again. L1 is not intercepting #GP or #DF, and so
+ * should see nested TRIPLE_FAULT / SHUTDOWN.
+ */
+ queue_ss_exception(vcpu, true);
+ vcpu_run(vcpu);
+ assert_ucall_vector(vcpu, -1);
+
+ kvm_vm_free(vm);
+}
--
2.37.0.170.g444d1eabd0-goog

2022-07-15 21:17:04

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 08/24] KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS

Fall through to handling other pending exception/events for L2 if SIPI
is pending while the CPU is not in Wait-for-SIPI. KVM correctly ignores
the event, but incorrectly returns immediately, e.g. a SIPI coincident
with another event could lead to KVM incorrectly routing the event to L1
instead of L2.

Fixes: bf0cd88ce363 ("KVM: x86: emulate wait-for-SIPI and SIPI-VMExit")
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index bc5759f82a3f..104f233ddd5d 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3925,10 +3925,12 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
return -EBUSY;

clear_bit(KVM_APIC_SIPI, &apic->pending_events);
- if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED)
+ if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
nested_vmx_vmexit(vcpu, EXIT_REASON_SIPI_SIGNAL, 0,
apic->sipi_vector & 0xFFUL);
- return 0;
+ return 0;
+ }
+ /* Fallthrough, the SIPI is completely ignored. */
}

/*
--
2.37.0.170.g444d1eabd0-goog

2022-07-15 21:22:48

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 14/24] KVM: x86: Use kvm_queue_exception_e() to queue #DF

Queue #DF by recursing on kvm_multiple_exception() by way of
kvm_queue_exception_e() instead of open coding the behavior. This will
allow KVM to Just Work when a future commit moves exception interception
checks (for L2 => L1) into kvm_multiple_exception().

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/x86.c | 21 +++++++++------------
1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c4f758edf353..5aaa1d8de0b3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -663,25 +663,22 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
}
class1 = exception_class(prev_nr);
class2 = exception_class(nr);
- if ((class1 == EXCPT_CONTRIBUTORY && class2 == EXCPT_CONTRIBUTORY)
- || (class1 == EXCPT_PF && class2 != EXCPT_BENIGN)) {
+ if ((class1 == EXCPT_CONTRIBUTORY && class2 == EXCPT_CONTRIBUTORY) ||
+ (class1 == EXCPT_PF && class2 != EXCPT_BENIGN)) {
/*
- * Generate double fault per SDM Table 5-5. Set
- * exception.pending = true so that the double fault
- * can trigger a nested vmexit.
+ * Synthesize #DF. Clear the previously injected or pending
+ * exception so as not to incorrectly trigger shutdown.
*/
- vcpu->arch.exception.pending = true;
vcpu->arch.exception.injected = false;
- vcpu->arch.exception.has_error_code = true;
- vcpu->arch.exception.vector = DF_VECTOR;
- vcpu->arch.exception.error_code = 0;
- vcpu->arch.exception.has_payload = false;
- vcpu->arch.exception.payload = 0;
- } else
+ vcpu->arch.exception.pending = false;
+
+ kvm_queue_exception_e(vcpu, DF_VECTOR, 0);
+ } else {
/* replace previous exception with a new one in a hope
that instruction re-execution will regenerate lost
exception */
goto queue;
+ }
}

void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr)
--
2.37.0.170.g444d1eabd0-goog

2022-07-18 13:13:36

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v2 19/24] KVM: x86: Morph pending exceptions to pending VM-Exits at queue time

On Fri, 2022-07-15 at 20:42 +0000, Sean Christopherson wrote:
> Morph pending exceptions to pending VM-Exits (due to interception) when
> the exception is queued instead of waiting until nested events are
> checked at VM-Entry.  This fixes a longstanding bug where KVM fails to
> handle an exception that occurs during delivery of a previous exception,
> KVM (L0) and L1 both want to intercept the exception (e.g. #PF for shadow
> paging), and KVM determines that the exception is in the guest's domain,
> i.e. queues the new exception for L2.  Deferring the interception check
> causes KVM to esclate various combinations of injected+pending exceptions
> to double fault (#DF) without consulting L1's interception desires, and
> ends up injecting a spurious #DF into L2.
>
> KVM has fudged around the issue for #PF by special casing emulated #PF
> injection for shadow paging, but the underlying issue is not unique to
> shadow paging in L0, e.g. if KVM is intercepting #PF because the guest
> has a smaller maxphyaddr and L1 (but not L0) is using shadow paging.
> Other exceptions are affected as well, e.g. if KVM is intercepting #GP
> for one of SVM's workaround or for the VMware backdoor emulation stuff.
> The other cases have gone unnoticed because the #DF is spurious if and
> only if L1 resolves the exception, e.g. KVM's goofs go unnoticed if L1
> would have injected #DF anyways.
>
> The hack-a-fix has also led to ugly code, e.g. bailing from the emulator
> if #PF injection forced a nested VM-Exit and the emulator finds itself
> back in L1.  Allowing for direct-to-VM-Exit queueing also neatly solves
> the async #PF in L2 mess; no need to set a magic flag and token, simply
> queue a #PF nested VM-Exit.
>
> Deal with event migration by flagging that a pending exception was queued
> by userspace and check for interception at the next KVM_RUN, e.g. so that
> KVM does the right thing regardless of the order in which userspace
> restores nested state vs. event state.
>
> When "getting" events from userspace, simply drop any pending excpetion
> that is destined to be intercepted if there is also an injected exception
> to be migrated.  Ideally, KVM would migrate both events, but that would
> require new ABI, and practically speaking losing the event is unlikely to
> be noticed, let alone fatal.  The injected exception is captured, RIP
> still points at the original faulting instruction, etc...  So either the
> injection on the target will trigger the same intercepted exception, or
> the source of the intercepted exception was transient and/or
> non-deterministic, thus dropping it is ok-ish.
>
> Fixes: a04aead144fd ("KVM: nSVM: fix running nested guests when npt=0")
> Fixes: feaf0c7dc473 ("KVM: nVMX: Do not generate #DF if #PF happens during exception delivery into L2")
> Cc: Jim Mattson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
>  arch/x86/include/asm/kvm_host.h |  12 ++-
>  arch/x86/kvm/svm/nested.c       |  41 +++-----
>  arch/x86/kvm/vmx/nested.c       | 109 ++++++++++------------
>  arch/x86/kvm/vmx/vmx.c          |   6 +-
>  arch/x86/kvm/x86.c              | 159 ++++++++++++++++++++++----------
>  arch/x86/kvm/x86.h              |   7 ++
>  6 files changed, 187 insertions(+), 147 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 0a6a05e25f24..6bcbffb42420 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -647,7 +647,6 @@ struct kvm_queued_exception {
>         u32 error_code;
>         unsigned long payload;
>         bool has_payload;
> -       u8 nested_apf;
>  };
>  
>  struct kvm_vcpu_arch {
> @@ -748,8 +747,12 @@ struct kvm_vcpu_arch {
>  
>         u8 event_exit_inst_len;
>  
> +       bool exception_from_userspace;
> +
>         /* Exceptions to be injected to the guest. */
>         struct kvm_queued_exception exception;
> +       /* Exception VM-Exits to be synthesized to L1. */
> +       struct kvm_queued_exception exception_vmexit;
>  
>         struct kvm_queued_interrupt {
>                 bool injected;
> @@ -860,7 +863,6 @@ struct kvm_vcpu_arch {
>                 u32 id;
>                 bool send_user_only;
>                 u32 host_apf_flags;
> -               unsigned long nested_apf_token;
>                 bool delivery_as_pf_vmexit;
>                 bool pageready_pending;
>         } apf;
> @@ -1636,9 +1638,9 @@ struct kvm_x86_ops {
>  
>  struct kvm_x86_nested_ops {
>         void (*leave_nested)(struct kvm_vcpu *vcpu);
> +       bool (*is_exception_vmexit)(struct kvm_vcpu *vcpu, u8 vector,
> +                                   u32 error_code);
>         int (*check_events)(struct kvm_vcpu *vcpu);
> -       bool (*handle_page_fault_workaround)(struct kvm_vcpu *vcpu,
> -                                            struct x86_exception *fault);
>         bool (*hv_timer_pending)(struct kvm_vcpu *vcpu);
>         void (*triple_fault)(struct kvm_vcpu *vcpu);
>         int (*get_state)(struct kvm_vcpu *vcpu,
> @@ -1865,7 +1867,7 @@ void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, unsigned long pay
>  void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr);
>  void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
>  void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
> -bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
> +void kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
>                                     struct x86_exception *fault);
>  bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
>  bool kvm_require_dr(struct kvm_vcpu *vcpu, int dr);
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index f5676c2679d0..0a8ee5f28319 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -55,28 +55,6 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
>         nested_svm_vmexit(svm);
>  }
>  
> -static bool nested_svm_handle_page_fault_workaround(struct kvm_vcpu *vcpu,
> -                                                   struct x86_exception *fault)
> -{
> -       struct vcpu_svm *svm = to_svm(vcpu);
> -       struct vmcb *vmcb = svm->vmcb;
> -
> -       WARN_ON(!is_guest_mode(vcpu));
> -
> -       if (vmcb12_is_intercept(&svm->nested.ctl,
> -                               INTERCEPT_EXCEPTION_OFFSET + PF_VECTOR) &&
> -           !WARN_ON_ONCE(svm->nested.nested_run_pending)) {
> -               vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + PF_VECTOR;
> -               vmcb->control.exit_code_hi = 0;
> -               vmcb->control.exit_info_1 = fault->error_code;
> -               vmcb->control.exit_info_2 = fault->address;
> -               nested_svm_vmexit(svm);
> -               return true;
> -       }
> -
> -       return false;
> -}
> -
>  static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index)
>  {
>         struct vcpu_svm *svm = to_svm(vcpu);
> @@ -1302,16 +1280,17 @@ int nested_svm_check_permissions(struct kvm_vcpu *vcpu)
>         return 0;
>  }
>  
> -static bool nested_exit_on_exception(struct vcpu_svm *svm)
> +static bool nested_svm_is_exception_vmexit(struct kvm_vcpu *vcpu, u8 vector,
> +                                          u32 error_code)
>  {
> -       unsigned int vector = svm->vcpu.arch.exception.vector;
> +       struct vcpu_svm *svm = to_svm(vcpu);
>  
>         return (svm->nested.ctl.intercepts[INTERCEPT_EXCEPTION] & BIT(vector));
>  }
>  
>  static void nested_svm_inject_exception_vmexit(struct kvm_vcpu *vcpu)
>  {
> -       struct kvm_queued_exception *ex = &vcpu->arch.exception;
> +       struct kvm_queued_exception *ex = &vcpu->arch.exception_vmexit;
>         struct vcpu_svm *svm = to_svm(vcpu);
>         struct vmcb *vmcb = svm->vmcb;
>  
> @@ -1379,15 +1358,19 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
>                 return 0;
>         }
>  
> -       if (vcpu->arch.exception.pending) {
> +       if (vcpu->arch.exception_vmexit.pending) {
>                 if (block_nested_exceptions)
>                          return -EBUSY;
> -               if (!nested_exit_on_exception(svm))
> -                       return 0;
>                 nested_svm_inject_exception_vmexit(vcpu);
>                 return 0;
>         }
>  
> +       if (vcpu->arch.exception.pending) {
> +               if (block_nested_exceptions)
> +                       return -EBUSY;
> +               return 0;
> +       }
> +
>         if (vcpu->arch.smi_pending && !svm_smi_blocked(vcpu)) {
>                 if (block_nested_events)
>                         return -EBUSY;
> @@ -1725,8 +1708,8 @@ static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
>  
>  struct kvm_x86_nested_ops svm_nested_ops = {
>         .leave_nested = svm_leave_nested,
> +       .is_exception_vmexit = nested_svm_is_exception_vmexit,
>         .check_events = svm_check_nested_events,
> -       .handle_page_fault_workaround = nested_svm_handle_page_fault_workaround,
>         .triple_fault = nested_svm_triple_fault,
>         .get_nested_state_pages = svm_get_nested_state_pages,
>         .get_state = svm_get_nested_state,
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 981f98ef96f1..5a6ba62dcd49 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -439,59 +439,22 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
>         return inequality ^ bit;
>  }
>  
> -
> -/*
> - * KVM wants to inject page-faults which it got to the guest. This function
> - * checks whether in a nested guest, we need to inject them to L1 or L2.
> - */
> -static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned long *exit_qual)
> -{
> -       struct kvm_queued_exception *ex = &vcpu->arch.exception;
> -       struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> -
> -       if (ex->vector == PF_VECTOR) {
> -               if (ex->nested_apf) {
> -                       *exit_qual = vcpu->arch.apf.nested_apf_token;
> -                       return 1;
> -               }
> -               if (nested_vmx_is_page_fault_vmexit(vmcs12, ex->error_code)) {
> -                       *exit_qual = ex->has_payload ? ex->payload : vcpu->arch.cr2;
> -                       return 1;
> -               }
> -       } else if (vmcs12->exception_bitmap & (1u << ex->vector)) {
> -               if (ex->vector == DB_VECTOR) {
> -                       if (ex->has_payload) {
> -                               *exit_qual = ex->payload;
> -                       } else {
> -                               *exit_qual = vcpu->arch.dr6;
> -                               *exit_qual &= ~DR6_BT;
> -                               *exit_qual ^= DR6_ACTIVE_LOW;
> -                       }
> -               } else
> -                       *exit_qual = 0;
> -               return 1;
> -       }
> -
> -       return 0;
> -}
> -
> -static bool nested_vmx_handle_page_fault_workaround(struct kvm_vcpu *vcpu,
> -                                                   struct x86_exception *fault)
> +static bool nested_vmx_is_exception_vmexit(struct kvm_vcpu *vcpu, u8 vector,
> +                                          u32 error_code)
>  {
>         struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>  
> -       WARN_ON(!is_guest_mode(vcpu));
> +       /*
> +        * Drop bits 31:16 of the error code when performing the #PF mask+match
> +        * check.  All VMCS fields involved are 32 bits, but Intel CPUs never
> +        * set bits 31:16 and VMX disallows setting bits 31:16 in the injected
> +        * error code.  Including the to-be-dropped bits in the check might
> +        * result in an "impossible" or missed exit from L1's perspective.
> +        */
> +       if (vector == PF_VECTOR)
> +               return nested_vmx_is_page_fault_vmexit(vmcs12, (u16)error_code);
>  
> -       if (nested_vmx_is_page_fault_vmexit(vmcs12, fault->error_code) &&
> -           !WARN_ON_ONCE(to_vmx(vcpu)->nested.nested_run_pending)) {
> -               vmcs12->vm_exit_intr_error_code = fault->error_code;
> -               nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
> -                                 PF_VECTOR | INTR_TYPE_HARD_EXCEPTION |
> -                                 INTR_INFO_DELIVER_CODE_MASK | INTR_INFO_VALID_MASK,
> -                                 fault->address);
> -               return true;
> -       }
> -       return false;
> +       return (vmcs12->exception_bitmap & (1u << vector));
>  }
>  
>  static int nested_vmx_check_io_bitmap_controls(struct kvm_vcpu *vcpu,
> @@ -3812,12 +3775,24 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
>         return -ENXIO;
>  }
>  
> -static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
> -                                              unsigned long exit_qual)
> +static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu)
>  {
> -       struct kvm_queued_exception *ex = &vcpu->arch.exception;
> +       struct kvm_queued_exception *ex = &vcpu->arch.exception_vmexit;
>         u32 intr_info = ex->vector | INTR_INFO_VALID_MASK;
>         struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +       unsigned long exit_qual;
> +
> +       if (ex->has_payload) {
> +               exit_qual = ex->payload;
> +       } else if (ex->vector == PF_VECTOR) {
> +               exit_qual = vcpu->arch.cr2;
> +       } else if (ex->vector == DB_VECTOR) {
> +               exit_qual = vcpu->arch.dr6;
> +               exit_qual &= ~DR6_BT;
> +               exit_qual ^= DR6_ACTIVE_LOW;
> +       } else {
> +               exit_qual = 0;
> +       }
>  
>         if (ex->has_error_code) {
>                 /*
> @@ -3988,7 +3963,6 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
>  {
>         struct kvm_lapic *apic = vcpu->arch.apic;
>         struct vcpu_vmx *vmx = to_vmx(vcpu);
> -       unsigned long exit_qual;
>         /*
>          * Only a pending nested run blocks a pending exception.  If there is a
>          * previously injected event, the pending exception occurred while said
> @@ -4042,14 +4016,20 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
>          * across SMI/RSM as it should; that needs to be addressed in order to
>          * prioritize SMI over MTF and trap-like #DBs.
>          */
> +       if (vcpu->arch.exception_vmexit.pending &&
> +           !vmx_is_low_priority_db_trap(&vcpu->arch.exception_vmexit)) {
> +               if (block_nested_exceptions)
> +                       return -EBUSY;
> +
> +               nested_vmx_inject_exception_vmexit(vcpu);
> +               return 0;
> +       }
> +
>         if (vcpu->arch.exception.pending &&
>             !vmx_is_low_priority_db_trap(&vcpu->arch.exception)) {
>                 if (block_nested_exceptions)
>                         return -EBUSY;
> -               if (!nested_vmx_check_exception(vcpu, &exit_qual))
> -                       goto no_vmexit;
> -               nested_vmx_inject_exception_vmexit(vcpu, exit_qual);
> -               return 0;
> +               goto no_vmexit;
>         }
>  
>         if (vmx->nested.mtf_pending) {
> @@ -4060,13 +4040,18 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
>                 return 0;
>         }
>  
> +       if (vcpu->arch.exception_vmexit.pending) {
> +               if (block_nested_exceptions)
> +                       return -EBUSY;
> +
> +               nested_vmx_inject_exception_vmexit(vcpu);
> +               return 0;
> +       }
> +
>         if (vcpu->arch.exception.pending) {
>                 if (block_nested_exceptions)
>                         return -EBUSY;
> -               if (!nested_vmx_check_exception(vcpu, &exit_qual))
> -                       goto no_vmexit;
> -               nested_vmx_inject_exception_vmexit(vcpu, exit_qual);
> -               return 0;
> +               goto no_vmexit;
>         }
>  
>         if (nested_vmx_preemption_timer_pending(vcpu)) {
> @@ -6952,8 +6937,8 @@ __init int nested_vmx_hardware_setup(int (*exit_handlers[])(struct kvm_vcpu *))
>  
>  struct kvm_x86_nested_ops vmx_nested_ops = {
>         .leave_nested = vmx_leave_nested,
> +       .is_exception_vmexit = nested_vmx_is_exception_vmexit,
>         .check_events = vmx_check_nested_events,
> -       .handle_page_fault_workaround = nested_vmx_handle_page_fault_workaround,
>         .hv_timer_pending = nested_vmx_preemption_timer_pending,
>         .triple_fault = nested_vmx_triple_fault,
>         .get_state = vmx_get_nested_state,
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 7d3abe2a206a..5302b046110f 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1585,7 +1585,9 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
>          */
>         if (nested_cpu_has_mtf(vmcs12) &&
>             (!vcpu->arch.exception.pending ||
> -            vcpu->arch.exception.vector == DB_VECTOR))
> +            vcpu->arch.exception.vector == DB_VECTOR) &&
> +           (!vcpu->arch.exception_vmexit.pending ||
> +            vcpu->arch.exception_vmexit.vector == DB_VECTOR))
>                 vmx->nested.mtf_pending = true;
>         else
>                 vmx->nested.mtf_pending = false;
> @@ -5637,7 +5639,7 @@ static bool vmx_emulation_required_with_pending_exception(struct kvm_vcpu *vcpu)
>         struct vcpu_vmx *vmx = to_vmx(vcpu);
>  
>         return vmx->emulation_required && !vmx->rmode.vm86_active &&
> -              (vcpu->arch.exception.pending || vcpu->arch.exception.injected);
> +              (kvm_is_exception_pending(vcpu) || vcpu->arch.exception.injected);
>  }
>  
>  static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 795c799fc767..9be2fdf834ad 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -609,6 +609,21 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
>  }
>  EXPORT_SYMBOL_GPL(kvm_deliver_exception_payload);
>  
> +static void kvm_queue_exception_vmexit(struct kvm_vcpu *vcpu, unsigned int vector,
> +                                      bool has_error_code, u32 error_code,
> +                                      bool has_payload, unsigned long payload)
> +{
> +       struct kvm_queued_exception *ex = &vcpu->arch.exception_vmexit;
> +
> +       ex->vector = vector;
> +       ex->injected = false;
> +       ex->pending = true;
> +       ex->has_error_code = has_error_code;
> +       ex->error_code = error_code;
> +       ex->has_payload = has_payload;
> +       ex->payload = payload;
> +}
> +
>  static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
>                 unsigned nr, bool has_error, u32 error_code,
>                 bool has_payload, unsigned long payload, bool reinject)
> @@ -618,18 +633,31 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
>  
>         kvm_make_request(KVM_REQ_EVENT, vcpu);
>  
> +       /*
> +        * If the exception is destined for L2 and isn't being reinjected,
> +        * morph it to a VM-Exit if L1 wants to intercept the exception.  A
> +        * previously injected exception is not checked because it was checked
> +        * when it was original queued, and re-checking is incorrect if _L1_
> +        * injected the exception, in which case it's exempt from interception.
> +        */
> +       if (!reinject && is_guest_mode(vcpu) &&
> +           kvm_x86_ops.nested_ops->is_exception_vmexit(vcpu, nr, error_code)) {
> +               kvm_queue_exception_vmexit(vcpu, nr, has_error, error_code,
> +                                          has_payload, payload);
> +               return;
> +       }
> +
>         if (!vcpu->arch.exception.pending && !vcpu->arch.exception.injected) {
>         queue:
>                 if (reinject) {
>                         /*
> -                        * On vmentry, vcpu->arch.exception.pending is only
> -                        * true if an event injection was blocked by
> -                        * nested_run_pending.  In that case, however,
> -                        * vcpu_enter_guest requests an immediate exit,
> -                        * and the guest shouldn't proceed far enough to
> -                        * need reinjection.
> +                        * On VM-Entry, an exception can be pending if and only
> +                        * if event injection was blocked by nested_run_pending.
> +                        * In that case, however, vcpu_enter_guest() requests an
> +                        * immediate exit, and the guest shouldn't proceed far
> +                        * enough to need reinjection.
>                          */
> -                       WARN_ON_ONCE(vcpu->arch.exception.pending);
> +                       WARN_ON_ONCE(kvm_is_exception_pending(vcpu));
>                         vcpu->arch.exception.injected = true;
>                         if (WARN_ON_ONCE(has_payload)) {
>                                 /*
> @@ -732,20 +760,22 @@ static int complete_emulated_insn_gp(struct kvm_vcpu *vcpu, int err)
>  void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
>  {
>         ++vcpu->stat.pf_guest;
> -       vcpu->arch.exception.nested_apf =
> -               is_guest_mode(vcpu) && fault->async_page_fault;
> -       if (vcpu->arch.exception.nested_apf) {
> -               vcpu->arch.apf.nested_apf_token = fault->address;
> -               kvm_queue_exception_e(vcpu, PF_VECTOR, fault->error_code);
> -       } else {
> +
> +       /*
> +        * Async #PF in L2 is always forwarded to L1 as a VM-Exit regardless of
> +        * whether or not L1 wants to intercept "regular" #PF.
> +        */
> +       if (is_guest_mode(vcpu) && fault->async_page_fault)
> +               kvm_queue_exception_vmexit(vcpu, PF_VECTOR,
> +                                          true, fault->error_code,
> +                                          true, fault->address);
> +       else
>                 kvm_queue_exception_e_p(vcpu, PF_VECTOR, fault->error_code,
>                                         fault->address);
> -       }
>  }
>  EXPORT_SYMBOL_GPL(kvm_inject_page_fault);
>  
> -/* Returns true if the page fault was immediately morphed into a VM-Exit. */
> -bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
> +void kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
>                                     struct x86_exception *fault)
>  {
>         struct kvm_mmu *fault_mmu;
> @@ -763,26 +793,7 @@ bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
>                 kvm_mmu_invalidate_gva(vcpu, fault_mmu, fault->address,
>                                        fault_mmu->root.hpa);
>  
> -       /*
> -        * A workaround for KVM's bad exception handling.  If KVM injected an
> -        * exception into L2, and L2 encountered a #PF while vectoring the
> -        * injected exception, manually check to see if L1 wants to intercept
> -        * #PF, otherwise queuing the #PF will lead to #DF or a lost exception.
> -        * In all other cases, defer the check to nested_ops->check_events(),
> -        * which will correctly handle priority (this does not).  Note, other
> -        * exceptions, e.g. #GP, are theoretically affected, #PF is simply the
> -        * most problematic, e.g. when L0 and L1 are both intercepting #PF for
> -        * shadow paging.
> -        *
> -        * TODO: Rewrite exception handling to track injected and pending
> -        *       (VM-Exit) exceptions separately.
> -        */
> -       if (unlikely(vcpu->arch.exception.injected && is_guest_mode(vcpu)) &&
> -           kvm_x86_ops.nested_ops->handle_page_fault_workaround(vcpu, fault))
> -               return true;
> -
>         fault_mmu->inject_page_fault(vcpu, fault);
> -       return false;
>  }
>  EXPORT_SYMBOL_GPL(kvm_inject_emulated_page_fault);
>  
> @@ -4820,7 +4831,7 @@ static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
>         return (kvm_arch_interrupt_allowed(vcpu) &&
>                 kvm_cpu_accept_dm_intr(vcpu) &&
>                 !kvm_event_needs_reinjection(vcpu) &&
> -               !vcpu->arch.exception.pending);
> +               !kvm_is_exception_pending(vcpu));
>  }
>  
>  static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
> @@ -4995,13 +5006,27 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
>  static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
>                                                struct kvm_vcpu_events *events)
>  {
> -       struct kvm_queued_exception *ex = &vcpu->arch.exception;
> +       struct kvm_queued_exception *ex;
>  
>         process_nmi(vcpu);
>  
>         if (kvm_check_request(KVM_REQ_SMI, vcpu))
>                 process_smi(vcpu);
>  
> +       /*
> +        * KVM's ABI only allows for one exception to be migrated.  Luckily,
> +        * the only time there can be two queued exceptions is if there's a
> +        * non-exiting _injected_ exception, and a pending exiting exception.
> +        * In that case, ignore the VM-Exiting exception as it's an extension
> +        * of the injected exception.
> +        */
> +       if (vcpu->arch.exception_vmexit.pending &&
> +           !vcpu->arch.exception.pending &&
> +           !vcpu->arch.exception.injected)
> +               ex = &vcpu->arch.exception_vmexit;
> +       else
> +               ex = &vcpu->arch.exception;
> +
>         /*
>          * In guest mode, payload delivery should be deferred if the exception
>          * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
> @@ -5108,6 +5133,19 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
>                 return -EINVAL;
>  
>         process_nmi(vcpu);
> +
> +       /*
> +        * Flag that userspace is stuffing an exception, the next KVM_RUN will
> +        * morph the exception to a VM-Exit if appropriate.  Do this only for
> +        * pending exceptions, already-injected exceptions are not subject to
> +        * intercpetion.  Note, userspace that conflates pending and injected
> +        * is hosed, and will incorrectly convert an injected exception into a
> +        * pending exception, which in turn may cause a spurious VM-Exit.
> +        */
> +       vcpu->arch.exception_from_userspace = events->exception.pending;
> +
> +       vcpu->arch.exception_vmexit.pending = false;
> +
>         vcpu->arch.exception.injected = events->exception.injected;
>         vcpu->arch.exception.pending = events->exception.pending;
>         vcpu->arch.exception.vector = events->exception.nr;
> @@ -8130,18 +8168,17 @@ static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask)
>         }
>  }
>  
> -static bool inject_emulated_exception(struct kvm_vcpu *vcpu)
> +static void inject_emulated_exception(struct kvm_vcpu *vcpu)
>  {
>         struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
> +
>         if (ctxt->exception.vector == PF_VECTOR)
> -               return kvm_inject_emulated_page_fault(vcpu, &ctxt->exception);
> -
> -       if (ctxt->exception.error_code_valid)
> +               kvm_inject_emulated_page_fault(vcpu, &ctxt->exception);
> +       else if (ctxt->exception.error_code_valid)
>                 kvm_queue_exception_e(vcpu, ctxt->exception.vector,
>                                       ctxt->exception.error_code);
>         else
>                 kvm_queue_exception(vcpu, ctxt->exception.vector);
> -       return false;
>  }
>  
>  static struct x86_emulate_ctxt *alloc_emulate_ctxt(struct kvm_vcpu *vcpu)
> @@ -8754,8 +8791,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  
>         if (ctxt->have_exception) {
>                 r = 1;
> -               if (inject_emulated_exception(vcpu))
> -                       return r;
> +               inject_emulated_exception(vcpu);
>         } else if (vcpu->arch.pio.count) {
>                 if (!vcpu->arch.pio.in) {
>                         /* FIXME: return into emulator if single-stepping.  */
> @@ -9695,7 +9731,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
>          */
>         if (vcpu->arch.exception.injected)
>                 kvm_inject_exception(vcpu);
> -       else if (vcpu->arch.exception.pending)
> +       else if (kvm_is_exception_pending(vcpu))
>                 ; /* see above */
>         else if (vcpu->arch.nmi_injected)
>                 static_call(kvm_x86_inject_nmi)(vcpu);
> @@ -9722,6 +9758,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
>         if (r < 0)
>                 goto out;
>  
> +       /*
> +        * A pending exception VM-Exit should either result in nested VM-Exit
> +        * or force an immediate re-entry and exit to/from L2, and exception
> +        * VM-Exits cannot be injected (flag should _never_ be set).
> +        */
> +       WARN_ON_ONCE(vcpu->arch.exception_vmexit.injected ||
> +                    vcpu->arch.exception_vmexit.pending);
> +
>         /*
>          * New events, other than exceptions, cannot be injected if KVM needs
>          * to re-inject a previous event.  See above comments on re-injecting
> @@ -9821,7 +9865,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
>             kvm_x86_ops.nested_ops->hv_timer_pending(vcpu))
>                 *req_immediate_exit = true;
>  
> -       WARN_ON(vcpu->arch.exception.pending);
> +       WARN_ON(kvm_is_exception_pending(vcpu));
>         return 0;
>  
>  out:
> @@ -10839,6 +10883,7 @@ static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
>  
>  int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  {
> +       struct kvm_queued_exception *ex = &vcpu->arch.exception;
>         struct kvm_run *kvm_run = vcpu->run;
>         int r;
>  
> @@ -10897,6 +10942,21 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>                 }
>         }
>  
> +       /*
> +        * If userspace set a pending exception and L2 is active, convert it to
> +        * a pending VM-Exit if L1 wants to intercept the exception.
> +        */
> +       if (vcpu->arch.exception_from_userspace && is_guest_mode(vcpu) &&
> +           kvm_x86_ops.nested_ops->is_exception_vmexit(vcpu, ex->vector,
> +                                                       ex->error_code)) {
> +               kvm_queue_exception_vmexit(vcpu, ex->vector,
> +                                          ex->has_error_code, ex->error_code,
> +                                          ex->has_payload, ex->payload);
> +               ex->injected = false;
> +               ex->pending = false;
> +       }
> +       vcpu->arch.exception_from_userspace = false;
> +
>         if (unlikely(vcpu->arch.complete_userspace_io)) {
>                 int (*cui)(struct kvm_vcpu *) = vcpu->arch.complete_userspace_io;
>                 vcpu->arch.complete_userspace_io = NULL;
> @@ -11003,6 +11063,7 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
>         kvm_set_rflags(vcpu, regs->rflags | X86_EFLAGS_FIXED);
>  
>         vcpu->arch.exception.pending = false;
> +       vcpu->arch.exception_vmexit.pending = false;
>  
>         kvm_make_request(KVM_REQ_EVENT, vcpu);
>  }
> @@ -11370,7 +11431,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
>  
>         if (dbg->control & (KVM_GUESTDBG_INJECT_DB | KVM_GUESTDBG_INJECT_BP)) {
>                 r = -EBUSY;
> -               if (vcpu->arch.exception.pending)
> +               if (kvm_is_exception_pending(vcpu))
>                         goto out;
>                 if (dbg->control & KVM_GUESTDBG_INJECT_DB)
>                         kvm_queue_exception(vcpu, DB_VECTOR);
> @@ -12554,7 +12615,7 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
>         if (vcpu->arch.pv.pv_unhalted)
>                 return true;
>  
> -       if (vcpu->arch.exception.pending)
> +       if (kvm_is_exception_pending(vcpu))
>                 return true;
>  
>         if (kvm_test_request(KVM_REQ_NMI, vcpu) ||
> @@ -12809,7 +12870,7 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
>  {
>         if (unlikely(!lapic_in_kernel(vcpu) ||
>                      kvm_event_needs_reinjection(vcpu) ||
> -                    vcpu->arch.exception.pending))
> +                    kvm_is_exception_pending(vcpu)))
>                 return false;
>  
>         if (kvm_hlt_in_guest(vcpu->kvm) && !kvm_can_deliver_async_pf(vcpu))
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index dc2af0146220..eee259e387d3 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -82,10 +82,17 @@ static inline unsigned int __shrink_ple_window(unsigned int val,
>  void kvm_service_local_tlb_flush_requests(struct kvm_vcpu *vcpu);
>  int kvm_check_nested_events(struct kvm_vcpu *vcpu);
>  
> +static inline bool kvm_is_exception_pending(struct kvm_vcpu *vcpu)
> +{
> +       return vcpu->arch.exception.pending ||
> +              vcpu->arch.exception_vmexit.pending;
> +}
> +
>  static inline void kvm_clear_exception_queue(struct kvm_vcpu *vcpu)
>  {
>         vcpu->arch.exception.pending = false;
>         vcpu->arch.exception.injected = false;
> +       vcpu->arch.exception_vmexit.pending = false;
>  }
>  
>  static inline void kvm_queue_interrupt(struct kvm_vcpu *vcpu, u8 vector,


Reviewed-by: Maxim Levitsky <[email protected]>

The patch is large, I could have missed something though.

Best regards,
Maxim Levitsky