INFO: task gnome-terminal-:1734 blocked for more than 120 seconds.
Not tainted 4.12.0-rc4+ #8
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gnome-terminal- D 0 1734 1015 0x00000000
Call Trace:
__schedule+0x3cd/0xb30
schedule+0x40/0x90
kvm_async_pf_task_wait+0x1cc/0x270
? __vfs_read+0x37/0x150
? prepare_to_swait+0x22/0x70
do_async_page_fault+0x77/0xb0
? do_async_page_fault+0x77/0xb0
async_page_fault+0x28/0x30
This is triggered by running both win7 and win2016 on L1 KVM simultaneously,
and then gives stress to memory on L1, I can observed this hang on L1 when
at least ~70% swap area is occupied on L0.
This is due to async pf was injected to L2 which should be injected to L1,
L2 guest starts receiving pagefault w/ bogus %cr2(apf token from the host
actually), and L1 guest starts accumulating tasks stuck in D state in
kvm_async_pf_task_wait() since missing PAGE_READY async_pfs.
This patchset fixes it according to Radim's proposal "force a nested VM exit
from nested_vmx_check_exception if the injected #PF is async_pf and handle
the #PF VM exit in L1". https://www.spinics.net/lists/kvm/msg142498.html
Note: The patchset almost not touch SVM since I don't have AMD CPU to verify
the modification.
v5 -> v6:
* move vcpu_svm's apf_reason to vcpu->arch.apf.host_apf_reason
* introduce function kvm_handle_page_fault() to be used by both VMX/SVM
* introduce svm's codes posted by Paolo
* introduce nested_apf
* better set MSR_KVM_ASYNC_PF_EN
v4 -> v5:
* utilize wrmsr_safe for MSR_KVM_ASYNC_PF_EN
v3 -> v4:
* reuse pad field in kvm_vcpu_events for async_page_fault
* update kvm_vcpu_events API documentations
* change async_page_fault type in vcpu->arch.exception from bool to u8
v2 -> v3:
* add the flag to the userspace interface(KVM_GET/PUT_VCPU_EVENTS)
v1 -> v2:
* remove nested_vmx_check_exception nr parameter
* construct a simple special vm-exit information field for async pf
* introduce nested_apf_token to vcpu->arch.apf to avoid change the CR2
visible in L2 guest
* avoid pass the apf directed towards it (L1) into L2 if there is L3
at the moment
Wanpeng Li (4):
KVM: x86: Simple kvm_x86_ops->queue_exception parameter
KVM: async_pf: Add L1 guest async_pf #PF vmexit handler
KVM: async_pf: Force a nested vmexit if the injected #PF is async_pf
KVM: async_pf: Let host know whether the guest support delivery async_pf as #PF vmexit
Documentation/virtual/kvm/api.txt | 8 +++--
Documentation/virtual/kvm/msr.txt | 5 +--
arch/x86/include/asm/kvm_emulate.h | 1 +
arch/x86/include/asm/kvm_host.h | 8 +++--
arch/x86/include/uapi/asm/kvm.h | 3 +-
arch/x86/include/uapi/asm/kvm_para.h | 1 +
arch/x86/kernel/kvm.c | 7 ++++-
arch/x86/kvm/mmu.c | 35 ++++++++++++++++++++-
arch/x86/kvm/mmu.h | 2 ++
arch/x86/kvm/svm.c | 58 ++++++++++++-----------------------
arch/x86/kvm/vmx.c | 39 ++++++++++++++---------
arch/x86/kvm/x86.c | 29 ++++++++++++------
tools/arch/x86/include/uapi/asm/kvm.h | 3 +-
13 files changed, 125 insertions(+), 74 deletions(-)
--
2.7.4
From: Wanpeng Li <[email protected]>
This patch removes all arguments except the first in kvm_x86_ops->queue_exception
since they can extract the arguments from vcpu->arch.exception themselves, do the
same in nested_{vmx,svm}_check_exception.
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 +---
arch/x86/kvm/svm.c | 8 +++++---
arch/x86/kvm/vmx.c | 8 +++++---
arch/x86/kvm/x86.c | 5 +----
4 files changed, 12 insertions(+), 13 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 695605e..1f01bfb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -948,9 +948,7 @@ struct kvm_x86_ops {
unsigned char *hypercall_addr);
void (*set_irq)(struct kvm_vcpu *vcpu);
void (*set_nmi)(struct kvm_vcpu *vcpu);
- void (*queue_exception)(struct kvm_vcpu *vcpu, unsigned nr,
- bool has_error_code, u32 error_code,
- bool reinject);
+ void (*queue_exception)(struct kvm_vcpu *vcpu);
void (*cancel_injection)(struct kvm_vcpu *vcpu);
int (*interrupt_allowed)(struct kvm_vcpu *vcpu);
int (*nmi_allowed)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ba9891a..e1f8e89 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -631,11 +631,13 @@ static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
svm_set_interrupt_shadow(vcpu, 0);
}
-static void svm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
- bool has_error_code, u32 error_code,
- bool reinject)
+static void svm_queue_exception(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
+ unsigned nr = vcpu->arch.exception.nr;
+ bool has_error_code = vcpu->arch.exception.has_error_code;
+ bool reinject = vcpu->arch.exception.reinject;
+ u32 error_code = vcpu->arch.exception.error_code;
/*
* If we are within a nested VM we'd better #VMEXIT and let the guest
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ca5d2b9..df825bb 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2431,11 +2431,13 @@ static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned nr)
return 1;
}
-static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
- bool has_error_code, u32 error_code,
- bool reinject)
+static void vmx_queue_exception(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
+ unsigned nr = vcpu->arch.exception.nr;
+ bool has_error_code = vcpu->arch.exception.has_error_code;
+ bool reinject = vcpu->arch.exception.reinject;
+ u32 error_code = vcpu->arch.exception.error_code;
u32 intr_info = nr | INTR_INFO_VALID_MASK;
if (!reinject && is_guest_mode(vcpu) &&
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0e846f0..7511c0a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6347,10 +6347,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
kvm_update_dr7(vcpu);
}
- kvm_x86_ops->queue_exception(vcpu, vcpu->arch.exception.nr,
- vcpu->arch.exception.has_error_code,
- vcpu->arch.exception.error_code,
- vcpu->arch.exception.reinject);
+ kvm_x86_ops->queue_exception(vcpu);
return 0;
}
--
2.7.4
From: Wanpeng Li <[email protected]>
This patch adds the L1 guest async page fault #PF vmexit handler, such
#PF is converted into vmexit from L2 to L1 on #PF which is then handled
by L1 similar to ordinary async page fault.
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu.c | 33 +++++++++++++++++++++++++++++++++
arch/x86/kvm/mmu.h | 2 ++
arch/x86/kvm/svm.c | 36 +++++-------------------------------
arch/x86/kvm/vmx.c | 12 +++++-------
5 files changed, 46 insertions(+), 38 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1f01bfb..e20d8a8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -645,6 +645,7 @@ struct kvm_vcpu_arch {
u64 msr_val;
u32 id;
bool send_user_only;
+ u32 host_apf_reason;
} apf;
/* OSVW MSRs (AMD only) */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cb82259..4a7dc00 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -46,6 +46,7 @@
#include <asm/io.h>
#include <asm/vmx.h>
#include <asm/kvm_page_track.h>
+#include "trace.h"
/*
* When setting this variable to true it enables Two-Dimensional-Paging
@@ -3736,6 +3737,38 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
return false;
}
+int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
+ u64 fault_address)
+{
+ int r = 1;
+
+ switch (vcpu->arch.apf.host_apf_reason) {
+ default:
+ /* TDP won't cause page fault directly */
+ WARN_ON_ONCE(tdp_enabled);
+ trace_kvm_page_fault(fault_address, error_code);
+
+ if (kvm_event_needs_reinjection(vcpu))
+ kvm_mmu_unprotect_page_virt(vcpu, fault_address);
+ r = kvm_mmu_page_fault(vcpu, fault_address, error_code, NULL, 0);
+ break;
+ case KVM_PV_REASON_PAGE_NOT_PRESENT:
+ vcpu->arch.apf.host_apf_reason = 0;
+ local_irq_disable();
+ kvm_async_pf_task_wait(fault_address);
+ local_irq_enable();
+ break;
+ case KVM_PV_REASON_PAGE_READY:
+ vcpu->arch.apf.host_apf_reason = 0;
+ local_irq_disable();
+ kvm_async_pf_task_wake(fault_address);
+ local_irq_enable();
+ break;
+ }
+ return r;
+}
+EXPORT_SYMBOL_GPL(kvm_handle_page_fault);
+
static bool
check_hugepage_cache_consistency(struct kvm_vcpu *vcpu, gfn_t gfn, int level)
{
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 330bf3a..2ae88f0 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -77,6 +77,8 @@ void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu);
void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
bool accessed_dirty);
bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
+int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
+ u64 fault_address);
static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
{
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index e1f8e89..8f263bf 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -192,7 +192,6 @@ struct vcpu_svm {
unsigned int3_injected;
unsigned long int3_rip;
- u32 apf_reason;
/* cached guest cpuid flags for faster access */
bool nrips_enabled : 1;
@@ -2071,34 +2070,9 @@ static void svm_set_dr7(struct kvm_vcpu *vcpu, unsigned long value)
static int pf_interception(struct vcpu_svm *svm)
{
u64 fault_address = svm->vmcb->control.exit_info_2;
- u64 error_code;
- int r = 1;
+ u64 error_code = svm->vmcb->control.exit_info_1;
- switch (svm->apf_reason) {
- default:
- error_code = svm->vmcb->control.exit_info_1;
-
- trace_kvm_page_fault(fault_address, error_code);
- if (!npt_enabled && kvm_event_needs_reinjection(&svm->vcpu))
- kvm_mmu_unprotect_page_virt(&svm->vcpu, fault_address);
- r = kvm_mmu_page_fault(&svm->vcpu, fault_address, error_code,
- svm->vmcb->control.insn_bytes,
- svm->vmcb->control.insn_len);
- break;
- case KVM_PV_REASON_PAGE_NOT_PRESENT:
- svm->apf_reason = 0;
- local_irq_disable();
- kvm_async_pf_task_wait(fault_address);
- local_irq_enable();
- break;
- case KVM_PV_REASON_PAGE_READY:
- svm->apf_reason = 0;
- local_irq_disable();
- kvm_async_pf_task_wake(fault_address);
- local_irq_enable();
- break;
- }
- return r;
+ return kvm_handle_page_fault(&svm->vcpu, error_code, fault_address);
}
static int db_interception(struct vcpu_svm *svm)
@@ -2551,7 +2525,7 @@ static int nested_svm_exit_special(struct vcpu_svm *svm)
break;
case SVM_EXIT_EXCP_BASE + PF_VECTOR:
/* When we're shadowing, trap PFs, but not async PF */
- if (!npt_enabled && svm->apf_reason == 0)
+ if (!npt_enabled && svm->vcpu.arch.apf.host_apf_reason == 0)
return NESTED_EXIT_HOST;
break;
default:
@@ -2594,7 +2568,7 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
vmexit = NESTED_EXIT_DONE;
/* async page fault always cause vmexit */
else if ((exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR) &&
- svm->apf_reason != 0)
+ svm->vcpu.arch.apf.host_apf_reason != 0)
vmexit = NESTED_EXIT_DONE;
break;
}
@@ -4891,7 +4865,7 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
/* if exit due to PF check for async PF */
if (svm->vmcb->control.exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR)
- svm->apf_reason = kvm_read_and_reset_pf_reason();
+ svm->vcpu.arch.apf.host_apf_reason = kvm_read_and_reset_pf_reason();
if (npt_enabled) {
vcpu->arch.regs_avail &= ~(1 << VCPU_EXREG_PDPTR);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index df825bb..d20f794 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5648,14 +5648,8 @@ static int handle_exception(struct kvm_vcpu *vcpu)
}
if (is_page_fault(intr_info)) {
- /* EPT won't cause page fault directly */
- BUG_ON(enable_ept);
cr2 = vmcs_readl(EXIT_QUALIFICATION);
- trace_kvm_page_fault(cr2, error_code);
-
- if (kvm_event_needs_reinjection(vcpu))
- kvm_mmu_unprotect_page_virt(vcpu, cr2);
- return kvm_mmu_page_fault(vcpu, cr2, error_code, NULL, 0);
+ return kvm_handle_page_fault(vcpu, error_code, cr2);
}
ex_no = intr_info & INTR_INFO_VECTOR_MASK;
@@ -8602,6 +8596,10 @@ static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx)
vmx->exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
exit_intr_info = vmx->exit_intr_info;
+ /* if exit due to PF check for async PF */
+ if (is_page_fault(exit_intr_info))
+ vmx->vcpu.arch.apf.host_apf_reason = kvm_read_and_reset_pf_reason();
+
/* Handle machine checks before interrupts are enabled */
if (is_machine_check(exit_intr_info))
kvm_machine_check();
--
2.7.4
From: Wanpeng Li <[email protected]>
Add an nested_apf field to vcpu->arch.exception to identify an async page
fault, and constructs the expected vm-exit information fields. Force a
nested VM exit from nested_vmx_check_exception() if the injected #PF is
async page fault. Extending the userspace interface KVM_GET_VCPU_EVENTS
and KVM_SET_VCPU_EVENTS for live migration.
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
Documentation/virtual/kvm/api.txt | 8 ++++++--
arch/x86/include/asm/kvm_emulate.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/include/uapi/asm/kvm.h | 3 ++-
arch/x86/kvm/svm.c | 16 ++++++++++------
arch/x86/kvm/vmx.c | 17 ++++++++++++++---
arch/x86/kvm/x86.c | 19 +++++++++++++++----
tools/arch/x86/include/uapi/asm/kvm.h | 3 ++-
8 files changed, 52 insertions(+), 17 deletions(-)
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 4029943..a991a7c 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -832,7 +832,7 @@ struct kvm_vcpu_events {
__u8 injected;
__u8 nr;
__u8 has_error_code;
- __u8 pad;
+ __u8 nested_apf;
__u32 error_code;
} exception;
struct {
@@ -857,7 +857,7 @@ struct kvm_vcpu_events {
} smi;
};
-Only two fields are defined in the flags field:
+Only three fields are defined in the flags field:
- KVM_VCPUEVENT_VALID_SHADOW may be set in the flags field to signal that
interrupt.shadow contains a valid state.
@@ -865,6 +865,9 @@ Only two fields are defined in the flags field:
- KVM_VCPUEVENT_VALID_SMM may be set in the flags field to signal that
smi contains a valid state.
+- KVM_VCPUEVENT_VALID_ASYNC_PF may be set in the flags field to signal that
+ the exception is an async page fault.
+
4.32 KVM_SET_VCPU_EVENTS
Capability: KVM_CAP_VCPU_EVENTS
@@ -887,6 +890,7 @@ suppress overwriting the current in-kernel state. The bits are:
KVM_VCPUEVENT_VALID_NMI_PENDING - transfer nmi.pending to the kernel
KVM_VCPUEVENT_VALID_SIPI_VECTOR - transfer sipi_vector
KVM_VCPUEVENT_VALID_SMM - transfer the smi sub-struct.
+KVM_VCPUEVENT_VALID_ASYNC_PF - transfer async page fault
If KVM_CAP_INTR_SHADOW is available, KVM_VCPUEVENT_VALID_SHADOW can be set in
the flags field to signal that interrupt.shadow contains a valid state and
diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index 722d0e5..fde36f1 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -23,6 +23,7 @@ struct x86_exception {
u16 error_code;
bool nested_page_fault;
u64 address; /* cr2 or nested page fault gpa */
+ u8 async_page_fault;
};
/*
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e20d8a8..71aef4b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -545,6 +545,7 @@ struct kvm_vcpu_arch {
bool reinject;
u8 nr;
u32 error_code;
+ u8 nested_apf;
} exception;
struct kvm_queued_interrupt {
@@ -646,6 +647,7 @@ struct kvm_vcpu_arch {
u32 id;
bool send_user_only;
u32 host_apf_reason;
+ unsigned long nested_apf_token;
} apf;
/* OSVW MSRs (AMD only) */
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index c2824d0..c9556ec 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -287,6 +287,7 @@ struct kvm_reinject_control {
#define KVM_VCPUEVENT_VALID_SIPI_VECTOR 0x00000002
#define KVM_VCPUEVENT_VALID_SHADOW 0x00000004
#define KVM_VCPUEVENT_VALID_SMM 0x00000008
+#define KVM_VCPUEVENT_VALID_ASYNC_PF 0x00000010
/* Interrupt shadow states */
#define KVM_X86_SHADOW_INT_MOV_SS 0x01
@@ -298,7 +299,7 @@ struct kvm_vcpu_events {
__u8 injected;
__u8 nr;
__u8 has_error_code;
- __u8 pad;
+ __u8 nested_apf;
__u32 error_code;
} exception;
struct {
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 8f263bf..49cdb8e 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2367,15 +2367,19 @@ static int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr,
if (!is_guest_mode(&svm->vcpu))
return 0;
+ vmexit = nested_svm_intercept(svm);
+ if (vmexit != NESTED_EXIT_DONE)
+ return 0;
+
svm->vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + nr;
svm->vmcb->control.exit_code_hi = 0;
svm->vmcb->control.exit_info_1 = error_code;
- svm->vmcb->control.exit_info_2 = svm->vcpu.arch.cr2;
-
- vmexit = nested_svm_intercept(svm);
- if (vmexit == NESTED_EXIT_DONE)
- svm->nested.exit_required = true;
+ if (svm->vcpu.arch.exception.nested_apf)
+ svm->vmcb->control.exit_info_2 = svm->vcpu.arch.apf.nested_apf_token;
+ else
+ svm->vmcb->control.exit_info_2 = svm->vcpu.arch.cr2;
+ svm->nested.exit_required = true;
return vmexit;
}
@@ -2568,7 +2572,7 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
vmexit = NESTED_EXIT_DONE;
/* async page fault always cause vmexit */
else if ((exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR) &&
- svm->vcpu.arch.apf.host_apf_reason != 0)
+ svm->vcpu.arch.exception.nested_apf != 0)
vmexit = NESTED_EXIT_DONE;
break;
}
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d20f794..8724ea6 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2418,13 +2418,24 @@ static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
* KVM wants to inject page-faults which it got to the guest. This function
* checks whether in a nested guest, we need to inject them to L1 or L2.
*/
-static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned nr)
+static int nested_vmx_check_exception(struct kvm_vcpu *vcpu)
{
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+ unsigned int nr = vcpu->arch.exception.nr;
- if (!(vmcs12->exception_bitmap & (1u << nr)))
+ if (!((vmcs12->exception_bitmap & (1u << nr)) ||
+ (nr == PF_VECTOR && vcpu->arch.exception.nested_apf)))
return 0;
+ if (vcpu->arch.exception.nested_apf) {
+ vmcs_write32(VM_EXIT_INTR_ERROR_CODE, vcpu->arch.exception.error_code);
+ nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
+ PF_VECTOR | INTR_TYPE_HARD_EXCEPTION |
+ INTR_INFO_DELIVER_CODE_MASK | INTR_INFO_VALID_MASK,
+ vcpu->arch.apf.nested_apf_token);
+ return 1;
+ }
+
nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
vmcs_read32(VM_EXIT_INTR_INFO),
vmcs_readl(EXIT_QUALIFICATION));
@@ -2441,7 +2452,7 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu)
u32 intr_info = nr | INTR_INFO_VALID_MASK;
if (!reinject && is_guest_mode(vcpu) &&
- nested_vmx_check_exception(vcpu, nr))
+ nested_vmx_check_exception(vcpu))
return;
if (has_error_code) {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7511c0a..5756811 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -452,7 +452,12 @@ EXPORT_SYMBOL_GPL(kvm_complete_insn_gp);
void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
{
++vcpu->stat.pf_guest;
- vcpu->arch.cr2 = fault->address;
+ vcpu->arch.exception.nested_apf =
+ is_guest_mode(vcpu) && fault->async_page_fault;
+ if (vcpu->arch.exception.nested_apf)
+ vcpu->arch.apf.nested_apf_token = fault->address;
+ else
+ vcpu->arch.cr2 = fault->address;
kvm_queue_exception_e(vcpu, PF_VECTOR, fault->error_code);
}
EXPORT_SYMBOL_GPL(kvm_inject_page_fault);
@@ -3072,7 +3077,7 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
!kvm_exception_is_soft(vcpu->arch.exception.nr);
events->exception.nr = vcpu->arch.exception.nr;
events->exception.has_error_code = vcpu->arch.exception.has_error_code;
- events->exception.pad = 0;
+ events->exception.nested_apf = vcpu->arch.exception.nested_apf;
events->exception.error_code = vcpu->arch.exception.error_code;
events->interrupt.injected =
@@ -3096,7 +3101,8 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
events->flags = (KVM_VCPUEVENT_VALID_NMI_PENDING
| KVM_VCPUEVENT_VALID_SHADOW
- | KVM_VCPUEVENT_VALID_SMM);
+ | KVM_VCPUEVENT_VALID_SMM
+ | KVM_VCPUEVENT_VALID_ASYNC_PF);
memset(&events->reserved, 0, sizeof(events->reserved));
}
@@ -3108,7 +3114,8 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
if (events->flags & ~(KVM_VCPUEVENT_VALID_NMI_PENDING
| KVM_VCPUEVENT_VALID_SIPI_VECTOR
| KVM_VCPUEVENT_VALID_SHADOW
- | KVM_VCPUEVENT_VALID_SMM))
+ | KVM_VCPUEVENT_VALID_SMM
+ | KVM_VCPUEVENT_VALID_ASYNC_PF))
return -EINVAL;
if (events->exception.injected &&
@@ -3126,6 +3133,8 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
vcpu->arch.exception.pending = events->exception.injected;
vcpu->arch.exception.nr = events->exception.nr;
vcpu->arch.exception.has_error_code = events->exception.has_error_code;
+ if (events->flags & KVM_VCPUEVENT_VALID_ASYNC_PF)
+ vcpu->arch.exception.nested_apf = events->exception.nested_apf;
vcpu->arch.exception.error_code = events->exception.error_code;
vcpu->arch.interrupt.pending = events->interrupt.injected;
@@ -8573,6 +8582,7 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
fault.error_code = 0;
fault.nested_page_fault = false;
fault.address = work->arch.token;
+ fault.async_page_fault = true;
kvm_inject_page_fault(vcpu, &fault);
}
}
@@ -8595,6 +8605,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
fault.error_code = 0;
fault.nested_page_fault = false;
fault.address = work->arch.token;
+ fault.async_page_fault = true;
kvm_inject_page_fault(vcpu, &fault);
}
vcpu->arch.apf.halted = false;
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index c2824d0..c9556ec 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -287,6 +287,7 @@ struct kvm_reinject_control {
#define KVM_VCPUEVENT_VALID_SIPI_VECTOR 0x00000002
#define KVM_VCPUEVENT_VALID_SHADOW 0x00000004
#define KVM_VCPUEVENT_VALID_SMM 0x00000008
+#define KVM_VCPUEVENT_VALID_ASYNC_PF 0x00000010
/* Interrupt shadow states */
#define KVM_X86_SHADOW_INT_MOV_SS 0x01
@@ -298,7 +299,7 @@ struct kvm_vcpu_events {
__u8 injected;
__u8 nr;
__u8 has_error_code;
- __u8 pad;
+ __u8 nested_apf;
__u32 error_code;
} exception;
struct {
--
2.7.4
From: Wanpeng Li <[email protected]>
Adds another flag bit (bit 2) to MSR_KVM_ASYNC_PF_EN. If bit 2 is 1, async
page faults are delivered to L1 as #PF vmexits; if bit 2 is 0, kvm_can_do_async_pf
returns 0 if in guest mode.
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
Documentation/virtual/kvm/msr.txt | 5 +++--
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/uapi/asm/kvm_para.h | 1 +
arch/x86/kernel/kvm.c | 7 ++++++-
arch/x86/kvm/mmu.c | 2 +-
arch/x86/kvm/vmx.c | 2 +-
arch/x86/kvm/x86.c | 5 +++--
7 files changed, 16 insertions(+), 7 deletions(-)
diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt
index 0a9ea51..1ebecc1 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -166,10 +166,11 @@ MSR_KVM_SYSTEM_TIME: 0x12
MSR_KVM_ASYNC_PF_EN: 0x4b564d02
data: Bits 63-6 hold 64-byte aligned physical address of a
64 byte memory area which must be in guest RAM and must be
- zeroed. Bits 5-2 are reserved and should be zero. Bit 0 is 1
+ zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1
when asynchronous page faults are enabled on the vcpu 0 when
disabled. Bit 1 is 1 if asynchronous page faults can be injected
- when vcpu is in cpl == 0.
+ when vcpu is in cpl == 0. Bit 2 is 1 if asynchronous page faults
+ are delivered to L1 as #PF vmexits.
First 4 byte of 64 byte memory location will be written to by
the hypervisor at the time of asynchronous page fault (APF)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 71aef4b..a981ab8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -648,6 +648,7 @@ struct kvm_vcpu_arch {
bool send_user_only;
u32 host_apf_reason;
unsigned long nested_apf_token;
+ bool delivery_as_pf_vmexit;
} apf;
/* OSVW MSRs (AMD only) */
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index cff0bb6..a965e5b 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -67,6 +67,7 @@ struct kvm_clock_pairing {
#define KVM_ASYNC_PF_ENABLED (1 << 0)
#define KVM_ASYNC_PF_SEND_ALWAYS (1 << 1)
+#define KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT (1 << 2)
/* Operations for KVM_HC_MMU_OP */
#define KVM_MMU_OP_WRITE_PTE 1
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 43e10d6..71c17a5 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -330,7 +330,12 @@ static void kvm_guest_cpu_init(void)
#ifdef CONFIG_PREEMPT
pa |= KVM_ASYNC_PF_SEND_ALWAYS;
#endif
- wrmsrl(MSR_KVM_ASYNC_PF_EN, pa | KVM_ASYNC_PF_ENABLED);
+ pa |= KVM_ASYNC_PF_ENABLED;
+
+ /* Async page fault support for L1 hypervisor is optional */
+ if (wrmsr_safe(MSR_KVM_ASYNC_PF_EN,
+ (pa | KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT) & 0xffffffff, pa >> 32) < 0)
+ wrmsrl(MSR_KVM_ASYNC_PF_EN, pa);
__this_cpu_write(apf_reason.enabled, 1);
printk(KERN_INFO"KVM setup async PF for cpu %d\n",
smp_processor_id());
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4a7dc00..fb8c35f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3705,7 +3705,7 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
kvm_event_needs_reinjection(vcpu)))
return false;
- if (is_guest_mode(vcpu))
+ if (!vcpu->arch.apf.delivery_as_pf_vmexit && is_guest_mode(vcpu))
return false;
return kvm_x86_ops->interrupt_allowed(vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 8724ea6..4f616db 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8001,7 +8001,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
if (is_nmi(intr_info))
return false;
else if (is_page_fault(intr_info))
- return enable_ept;
+ return !vmx->vcpu.arch.apf.host_apf_reason && enable_ept;
else if (is_no_device(intr_info) &&
!(vmcs12->guest_cr0 & X86_CR0_TS))
return false;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5756811..7254a11 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2065,8 +2065,8 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
{
gpa_t gpa = data & ~0x3f;
- /* Bits 2:5 are reserved, Should be zero */
- if (data & 0x3c)
+ /* Bits 3:5 are reserved, Should be zero */
+ if (data & 0x38)
return 1;
vcpu->arch.apf.msr_val = data;
@@ -2082,6 +2082,7 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
return 1;
vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
+ vcpu->arch.apf.delivery_as_pf_vmexit = data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
kvm_async_pf_wakeup_all(vcpu);
return 0;
}
--
2.7.4
On 28/06/2017 14:25, Wanpeng Li wrote:
> From: Wanpeng Li <[email protected]>
>
> Add an nested_apf field to vcpu->arch.exception to identify an async page
> fault, and constructs the expected vm-exit information fields. Force a
> nested VM exit from nested_vmx_check_exception() if the injected #PF is
> async page fault. Extending the userspace interface KVM_GET_VCPU_EVENTS
> and KVM_SET_VCPU_EVENTS for live migration.
>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Radim Krčmář <[email protected]>
> Signed-off-by: Wanpeng Li <[email protected]>
> ---
Radim, Wanpeng,
the patch is nice now but I'm still not 100% sure about the live
migration part. Why do we need to pass nested_apf to userspace, but not
nested_apf_token?
Paolo
2017-06-28 14:56+0200, Paolo Bonzini:
> On 28/06/2017 14:25, Wanpeng Li wrote:
>> From: Wanpeng Li <[email protected]>
>>
>> Add an nested_apf field to vcpu->arch.exception to identify an async page
>> fault, and constructs the expected vm-exit information fields. Force a
>> nested VM exit from nested_vmx_check_exception() if the injected #PF is
>> async page fault. Extending the userspace interface KVM_GET_VCPU_EVENTS
>> and KVM_SET_VCPU_EVENTS for live migration.
>>
>> Cc: Paolo Bonzini <[email protected]>
>> Cc: Radim Krčmář <[email protected]>
>> Signed-off-by: Wanpeng Li <[email protected]>
>> ---
>
> Radim, Wanpeng,
>
> the patch is nice now but I'm still not 100% sure about the live
> migration part. Why do we need to pass nested_apf to userspace, but not
> nested_apf_token?
We do not need it for migration, but unavailable nested_apf_token
already breaks checkpoint & restore from userspace ... I think the
cleanest way would be to add a new paravirtual event for nested_apf.
(Or just keep delaying the apf.)
Migration does a "async-pf-broadcast" while setting the async-pf MSR on
destination, which resumes all async-pf waiters.
Userspace actually has to drop the async-pf event on migration, because
the destination has invalid nested_apf_token. (It's a horrible design.)
nested_apf is not #PF: if we didn't pass nested_apf, then the exception
would be injected as #PF to L2 after migration. (Local KVM could
remember that the #PF is nested_apf and do some ugly hacks.)
On 28/06/2017 15:38, Radim Krčmář wrote:
>> Radim, Wanpeng,
>>
>> the patch is nice now but I'm still not 100% sure about the live
>> migration part. Why do we need to pass nested_apf to userspace, but not
>> nested_apf_token?
>
> We do not need it for migration, but unavailable nested_apf_token
> already breaks checkpoint & restore from userspace ... I think the
> cleanest way would be to add a new paravirtual event for nested_apf.
> (Or just keep delaying the apf.)
Indeed. With Jim's plans to migrate nested virt data, I was wondering
if nested_apf and nested_apf_token would be better placed in that ioctl,
rather than GET/SET_VCPU_EVENTS.
Nested-virt migration is broken anyway until we have Jim's patches, so
there's little point in migrating nested_apf only. Do you agree?
> Migration does a "async-pf-broadcast" while setting the async-pf MSR on
> destination, which resumes all async-pf waiters.
> Userspace actually has to drop the async-pf event on migration, because
> the destination has invalid nested_apf_token. (It's a horrible design.)
Yes, this was my question essentially. I would still migrate
nested_apf_token (as part of nested virt state), and then clear it in
KVM when doing the async-pf broadcast.
Paolo
> nested_apf is not #PF: if we didn't pass nested_apf, then the exception
> would be injected as #PF to L2 after migration. (Local KVM could
> remember that the #PF is nested_apf and do some ugly hacks.)
2017-06-28 21:48 GMT+08:00 Paolo Bonzini <[email protected]>:
>
>
> On 28/06/2017 15:38, Radim Krčmář wrote:
>>> Radim, Wanpeng,
>>>
>>> the patch is nice now but I'm still not 100% sure about the live
>>> migration part. Why do we need to pass nested_apf to userspace, but not
>>> nested_apf_token?
>>
>> We do not need it for migration, but unavailable nested_apf_token
>> already breaks checkpoint & restore from userspace ... I think the
>> cleanest way would be to add a new paravirtual event for nested_apf.
>> (Or just keep delaying the apf.)
>
> Indeed. With Jim's plans to migrate nested virt data, I was wondering
> if nested_apf and nested_apf_token would be better placed in that ioctl,
> rather than GET/SET_VCPU_EVENTS.
>
> Nested-virt migration is broken anyway until we have Jim's patches, so
> there's little point in migrating nested_apf only. Do you agree?
>
>> Migration does a "async-pf-broadcast" while setting the async-pf MSR on
>> destination, which resumes all async-pf waiters.
>> Userspace actually has to drop the async-pf event on migration, because
>> the destination has invalid nested_apf_token. (It's a horrible design.)
>
> Yes, this was my question essentially. I would still migrate
> nested_apf_token (as part of nested virt state), and then clear it in
> KVM when doing the async-pf broadcast.
Do you mean I should save nested_apf_token by GET_VCPU_EVENTS and
restore it by SET_VCPU_EVENTS? I utilize the place of "u8 pad" in
kvm_vcpu_events to hold nested_apf, however nested_apf_token is
unsigned long.
Regards,
Wanpeng Li
On 28/06/2017 16:09, Wanpeng Li wrote:
>> Yes, this was my question essentially. I would still migrate
>> nested_apf_token (as part of nested virt state), and then clear it in
>> KVM when doing the async-pf broadcast.
> Do you mean I should save nested_apf_token by GET_VCPU_EVENTS and
> restore it by SET_VCPU_EVENTS? I utilize the place of "u8 pad" in
> kvm_vcpu_events to hold nested_apf, however nested_apf_token is
> unsigned long.
If for now we can leave out the GET/SET_VCPU_EVENTS changes, that would
be best. nested_apf and nested_apf_token should be migrated together
with the rest of the nested virt state.
Paolo
2017-06-28 22:11 GMT+08:00 Paolo Bonzini <[email protected]>:
>
>
> On 28/06/2017 16:09, Wanpeng Li wrote:
>>> Yes, this was my question essentially. I would still migrate
>>> nested_apf_token (as part of nested virt state), and then clear it in
>>> KVM when doing the async-pf broadcast.
>> Do you mean I should save nested_apf_token by GET_VCPU_EVENTS and
>> restore it by SET_VCPU_EVENTS? I utilize the place of "u8 pad" in
>> kvm_vcpu_events to hold nested_apf, however nested_apf_token is
>> unsigned long.
>
> If for now we can leave out the GET/SET_VCPU_EVENTS changes, that would
> be best. nested_apf and nested_apf_token should be migrated together
> with the rest of the nested virt state.
Radim explains why we at least needs nested_apf here:
> nested_apf is not #PF: if we didn't pass nested_apf, then the exception would be injected as #PF to L2 after migration.
Do you mean we can ignore it here and depends on Jim's patches to
completely handle it?
Regards,
Wanpeng Li
On 28/06/2017 16:17, Wanpeng Li wrote:
>> If for now we can leave out the GET/SET_VCPU_EVENTS changes, that would
>> be best. nested_apf and nested_apf_token should be migrated together
>> with the rest of the nested virt state.
> Radim explains why we at least needs nested_apf here:
>
>> nested_apf is not #PF: if we didn't pass nested_apf, then the exception would be injected as #PF to L2 after migration.
Yes, but migration of a L1 hypervisor is broken anyway.
> Do you mean we can ignore it here and depends on Jim's patches to
> completely handle it?
Ignore it here, remember it when someone picks up Jim's patches, and
also serialize nested_apf_token.
Paolo
2017-06-28 22:20 GMT+08:00 Paolo Bonzini <[email protected]>:
>
>
> On 28/06/2017 16:17, Wanpeng Li wrote:
>>> If for now we can leave out the GET/SET_VCPU_EVENTS changes, that would
>>> be best. nested_apf and nested_apf_token should be migrated together
>>> with the rest of the nested virt state.
>> Radim explains why we at least needs nested_apf here:
>>
>>> nested_apf is not #PF: if we didn't pass nested_apf, then the exception would be injected as #PF to L2 after migration.
>
> Yes, but migration of a L1 hypervisor is broken anyway.
>
>> Do you mean we can ignore it here and depends on Jim's patches to
>> completely handle it?
>
> Ignore it here, remember it when someone picks up Jim's patches, and
> also serialize nested_apf_token.
Ok, I will remove GET/SET_VCPU_EVENTS stuff in the next version.
Regards,
Wanpeng Li