2017-06-13 06:08:20

by Wanpeng Li

[permalink] [raw]
Subject: [PATCH 0/4] KVM: async_pf: Fix async_pf exception injection

INFO: task gnome-terminal-:1734 blocked for more than 120 seconds.
Not tainted 4.12.0-rc4+ #8
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gnome-terminal- D 0 1734 1015 0x00000000
Call Trace:
__schedule+0x3cd/0xb30
schedule+0x40/0x90
kvm_async_pf_task_wait+0x1cc/0x270
? __vfs_read+0x37/0x150
? prepare_to_swait+0x22/0x70
do_async_page_fault+0x77/0xb0
? do_async_page_fault+0x77/0xb0
async_page_fault+0x28/0x30

This is triggered by running both win7 and win2016 on L1 KVM simultaneously,
and then gives stress to memory on L1, I can observed this hang on L1 when
at least ~70% swap area is occupied on L0.

This is due to async pf was injected to L2 which should be injected to L1,
L2 guest starts receiving pagefault w/ bogus %cr2(apf token from the host
actually), and L1 guest starts accumulating tasks stuck in D state in
kvm_async_pf_task_wait() since missing PAGE_READY async_pfs.

This patchset fixes it according to Radim's proposal "force a nested VM exit
from nested_vmx_check_exception if the injected #PF is async_pf and handle
the #PF VM exit in L1". https://www.spinics.net/lists/kvm/msg142498.html

Note: The patchset almost not touch SVM since I don't have AMD CPU to verify
the modification.

Wanpeng Li (4):
KVM: x86: Simple kvm_x86_ops->queue_exception parameter
KVM: async_pf: Add L1 guest async_pf #PF vmexit handler
KVM: async_pf: Force a nested vmexit if the injected #PF is async_pf
KVM: async_pf: Let host know whether the guest support delivery async_pf as #PF vmexit

Documentation/virtual/kvm/msr.txt | 5 +--
arch/x86/include/asm/kvm_emulate.h | 1 +
arch/x86/include/asm/kvm_host.h | 6 ++--
arch/x86/include/uapi/asm/kvm_para.h | 1 +
arch/x86/kernel/kvm.c | 1 +
arch/x86/kvm/mmu.c | 2 +-
arch/x86/kvm/svm.c | 8 +++--
arch/x86/kvm/vmx.c | 63 +++++++++++++++++++++++++++++-------
arch/x86/kvm/x86.c | 13 ++++----
9 files changed, 73 insertions(+), 27 deletions(-)

--
2.7.4


2017-06-13 06:08:29

by Wanpeng Li

[permalink] [raw]
Subject: [PATCH 1/4] KVM: x86: Simple kvm_x86_ops->queue_exception parameter

From: Wanpeng Li <[email protected]>

This patch removes all arguments except the first in kvm_x86_ops->queue_exception
since they can extract the arguments from vcpu->arch.exception themselves, do the
same in nested_{vmx,svm}_check_exception.

Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 +---
arch/x86/kvm/svm.c | 8 +++++---
arch/x86/kvm/vmx.c | 8 +++++---
arch/x86/kvm/x86.c | 5 +----
4 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 695605e..1f01bfb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -948,9 +948,7 @@ struct kvm_x86_ops {
unsigned char *hypercall_addr);
void (*set_irq)(struct kvm_vcpu *vcpu);
void (*set_nmi)(struct kvm_vcpu *vcpu);
- void (*queue_exception)(struct kvm_vcpu *vcpu, unsigned nr,
- bool has_error_code, u32 error_code,
- bool reinject);
+ void (*queue_exception)(struct kvm_vcpu *vcpu);
void (*cancel_injection)(struct kvm_vcpu *vcpu);
int (*interrupt_allowed)(struct kvm_vcpu *vcpu);
int (*nmi_allowed)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ba9891a..e1f8e89 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -631,11 +631,13 @@ static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
svm_set_interrupt_shadow(vcpu, 0);
}

-static void svm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
- bool has_error_code, u32 error_code,
- bool reinject)
+static void svm_queue_exception(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
+ unsigned int nr = vcpu->arch.exception.nr;
+ bool has_error_code = vcpu->arch.exception.has_error_code;
+ bool reinject = vcpu->arch.exception.reinject;
+ u32 error_code = vcpu->arch.exception.error_code;

/*
* If we are within a nested VM we'd better #VMEXIT and let the guest
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ca5d2b9..df825bb 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2431,11 +2431,13 @@ static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned nr)
return 1;
}

-static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
- bool has_error_code, u32 error_code,
- bool reinject)
+static void vmx_queue_exception(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
+ unsigned int nr = vcpu->arch.exception.nr;
+ bool has_error_code = vcpu->arch.exception.has_error_code;
+ bool reinject = vcpu->arch.exception.reinject;
+ u32 error_code = vcpu->arch.exception.error_code;
u32 intr_info = nr | INTR_INFO_VALID_MASK;

if (!reinject && is_guest_mode(vcpu) &&
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 87d3cb9..1b28a31 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6345,10 +6345,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
kvm_update_dr7(vcpu);
}

- kvm_x86_ops->queue_exception(vcpu, vcpu->arch.exception.nr,
- vcpu->arch.exception.has_error_code,
- vcpu->arch.exception.error_code,
- vcpu->arch.exception.reinject);
+ kvm_x86_ops->queue_exception(vcpu);
return 0;
}

--
2.7.4

2017-06-13 06:08:34

by Wanpeng Li

[permalink] [raw]
Subject: [PATCH 2/4] KVM: async_pf: Add L1 guest async_pf #PF vmexit handler

From: Wanpeng Li <[email protected]>

This patch adds the L1 guest async page fault #PF vmexit handler, such
#PF is converted into vmexit from L2 to L1 on #PF which is then handled
by L1 similar to ordinary async page fault.

Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
arch/x86/kvm/vmx.c | 34 ++++++++++++++++++++++++++++------
1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index df825bb..f533cc1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -616,6 +616,7 @@ struct vcpu_vmx {
bool emulation_required;

u32 exit_reason;
+ u32 apf_reason;

/* Posted interrupt descriptor */
struct pi_desc pi_desc;
@@ -5648,14 +5649,31 @@ static int handle_exception(struct kvm_vcpu *vcpu)
}

if (is_page_fault(intr_info)) {
- /* EPT won't cause page fault directly */
- BUG_ON(enable_ept);
cr2 = vmcs_readl(EXIT_QUALIFICATION);
- trace_kvm_page_fault(cr2, error_code);
+ switch (vmx->apf_reason) {
+ default:
+ /* EPT won't cause page fault directly */
+ BUG_ON(enable_ept);
+ trace_kvm_page_fault(cr2, error_code);

- if (kvm_event_needs_reinjection(vcpu))
- kvm_mmu_unprotect_page_virt(vcpu, cr2);
- return kvm_mmu_page_fault(vcpu, cr2, error_code, NULL, 0);
+ if (kvm_event_needs_reinjection(vcpu))
+ kvm_mmu_unprotect_page_virt(vcpu, cr2);
+ return kvm_mmu_page_fault(vcpu, cr2, error_code, NULL, 0);
+ break;
+ case KVM_PV_REASON_PAGE_NOT_PRESENT:
+ vmx->apf_reason = 0;
+ local_irq_disable();
+ kvm_async_pf_task_wait(cr2);
+ local_irq_enable();
+ break;
+ case KVM_PV_REASON_PAGE_READY:
+ vmx->apf_reason = 0;
+ local_irq_disable();
+ kvm_async_pf_task_wake(cr2);
+ local_irq_enable();
+ break;
+ }
+ return 0;
}

ex_no = intr_info & INTR_INFO_VECTOR_MASK;
@@ -8602,6 +8620,10 @@ static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx)
vmx->exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
exit_intr_info = vmx->exit_intr_info;

+ /* if exit due to PF check for async PF */
+ if (is_page_fault(exit_intr_info))
+ vmx->apf_reason = kvm_read_and_reset_pf_reason();
+
/* Handle machine checks before interrupts are enabled */
if (is_machine_check(exit_intr_info))
kvm_machine_check();
--
2.7.4

2017-06-13 06:08:39

by Wanpeng Li

[permalink] [raw]
Subject: [PATCH 3/4] KVM: async_pf: Force a nested vmexit if the injected #PF is async_pf

From: Wanpeng Li <[email protected]>

Add an async_page_fault field to vcpu->arch.exception to identify an async
page fault, and constructs the expected vm-exit information fields. Force
a nested VM exit from nested_vmx_check_exception() if the injected #PF
is async page fault.

Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
arch/x86/include/asm/kvm_emulate.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/vmx.c | 21 ++++++++++++++++++---
arch/x86/kvm/x86.c | 3 +++
4 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index 0559626..b5bcad9 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -23,6 +23,7 @@ struct x86_exception {
u16 error_code;
bool nested_page_fault;
u64 address; /* cr2 or nested page fault gpa */
+ bool async_page_fault;
};

/*
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1f01bfb..d30576a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -545,6 +545,7 @@ struct kvm_vcpu_arch {
bool reinject;
u8 nr;
u32 error_code;
+ bool async_page_fault;
} exception;

struct kvm_queued_interrupt {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f533cc1..da81e48 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2422,13 +2422,28 @@ static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned nr)
{
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+ u32 intr_info = 0;
+ unsigned long exit_qualification = 0;

- if (!(vmcs12->exception_bitmap & (1u << nr)))
+ if (!((vmcs12->exception_bitmap & (1u << nr)) ||
+ (nr == PF_VECTOR && vcpu->arch.exception.async_page_fault)))
return 0;

+ intr_info = nr | INTR_INFO_VALID_MASK;
+ exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+ if (vcpu->arch.exception.has_error_code) {
+ vmcs_write32(VM_EXIT_INTR_ERROR_CODE, vcpu->arch.exception.error_code);
+ intr_info |= INTR_INFO_DELIVER_CODE_MASK;
+ }
+ if (kvm_exception_is_soft(nr))
+ intr_info |= INTR_TYPE_SOFT_EXCEPTION;
+ else
+ intr_info |= INTR_TYPE_HARD_EXCEPTION;
+ if (nr == PF_VECTOR && vcpu->arch.exception.async_page_fault)
+ exit_qualification = vcpu->arch.cr2;
+
nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
- vmcs_read32(VM_EXIT_INTR_INFO),
- vmcs_readl(EXIT_QUALIFICATION));
+ intr_info, exit_qualification);
return 1;
}

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1b28a31..d7f1a49 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -453,6 +453,7 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
{
++vcpu->stat.pf_guest;
vcpu->arch.cr2 = fault->address;
+ vcpu->arch.exception.async_page_fault = fault->async_page_fault;
kvm_queue_exception_e(vcpu, PF_VECTOR, fault->error_code);
}
EXPORT_SYMBOL_GPL(kvm_inject_page_fault);
@@ -8571,6 +8572,7 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
fault.error_code = 0;
fault.nested_page_fault = false;
fault.address = work->arch.token;
+ fault.async_page_fault = true;
kvm_inject_page_fault(vcpu, &fault);
}
}
@@ -8593,6 +8595,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
fault.error_code = 0;
fault.nested_page_fault = false;
fault.address = work->arch.token;
+ fault.async_page_fault = true;
kvm_inject_page_fault(vcpu, &fault);
}
vcpu->arch.apf.halted = false;
--
2.7.4

2017-06-13 06:09:06

by Wanpeng Li

[permalink] [raw]
Subject: [PATCH 4/4] KVM: async_pf: Let host know whether the guest support delivery async_pf as #PF vmexit

From: Wanpeng Li <[email protected]>

Adds another flag bit (bit 2) to MSR_KVM_ASYNC_PF_EN. If bit 2 is 1, async
page faults are delivered to L1 as #PF vmexits; if bit 2 is 0, kvm_can_do_async_pf
returns 0 if in guest mode.

Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
Documentation/virtual/kvm/msr.txt | 5 +++--
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/uapi/asm/kvm_para.h | 1 +
arch/x86/kernel/kvm.c | 1 +
arch/x86/kvm/mmu.c | 2 +-
arch/x86/kvm/x86.c | 5 +++--
6 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt
index 0a9ea51..1ebecc1 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -166,10 +166,11 @@ MSR_KVM_SYSTEM_TIME: 0x12
MSR_KVM_ASYNC_PF_EN: 0x4b564d02
data: Bits 63-6 hold 64-byte aligned physical address of a
64 byte memory area which must be in guest RAM and must be
- zeroed. Bits 5-2 are reserved and should be zero. Bit 0 is 1
+ zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1
when asynchronous page faults are enabled on the vcpu 0 when
disabled. Bit 1 is 1 if asynchronous page faults can be injected
- when vcpu is in cpl == 0.
+ when vcpu is in cpl == 0. Bit 2 is 1 if asynchronous page faults
+ are delivered to L1 as #PF vmexits.

First 4 byte of 64 byte memory location will be written to by
the hypervisor at the time of asynchronous page fault (APF)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d30576a..2766b2c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -646,6 +646,7 @@ struct kvm_vcpu_arch {
u64 msr_val;
u32 id;
bool send_user_only;
+ bool delivery_as_pf_vmexit;
} apf;

/* OSVW MSRs (AMD only) */
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index cff0bb6..a965e5b 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -67,6 +67,7 @@ struct kvm_clock_pairing {

#define KVM_ASYNC_PF_ENABLED (1 << 0)
#define KVM_ASYNC_PF_SEND_ALWAYS (1 << 1)
+#define KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT (1 << 2)

/* Operations for KVM_HC_MMU_OP */
#define KVM_MMU_OP_WRITE_PTE 1
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 43e10d6..1e29a77 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -330,6 +330,7 @@ static void kvm_guest_cpu_init(void)
#ifdef CONFIG_PREEMPT
pa |= KVM_ASYNC_PF_SEND_ALWAYS;
#endif
+ pa |= KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
wrmsrl(MSR_KVM_ASYNC_PF_EN, pa | KVM_ASYNC_PF_ENABLED);
__this_cpu_write(apf_reason.enabled, 1);
printk(KERN_INFO"KVM setup async PF for cpu %d\n",
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cb82259..c49aecd 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3704,7 +3704,7 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
kvm_event_needs_reinjection(vcpu)))
return false;

- if (is_guest_mode(vcpu))
+ if (!vcpu->arch.apf.delivery_as_pf_vmexit && is_guest_mode(vcpu))
return false;

return kvm_x86_ops->interrupt_allowed(vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d7f1a49..74cb944 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2061,8 +2061,8 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
{
gpa_t gpa = data & ~0x3f;

- /* Bits 2:5 are reserved, Should be zero */
- if (data & 0x3c)
+ /* Bits 3:5 are reserved, Should be zero */
+ if (data & 0x38)
return 1;

vcpu->arch.apf.msr_val = data;
@@ -2078,6 +2078,7 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
return 1;

vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
+ vcpu->arch.apf.delivery_as_pf_vmexit = data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
kvm_async_pf_wakeup_all(vcpu);
return 0;
}
--
2.7.4

2017-06-13 18:19:30

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH 4/4] KVM: async_pf: Let host know whether the guest support delivery async_pf as #PF vmexit

2017-06-12 23:08-0700, Wanpeng Li:
> From: Wanpeng Li <[email protected]>
>
> Adds another flag bit (bit 2) to MSR_KVM_ASYNC_PF_EN. If bit 2 is 1, async
> page faults are delivered to L1 as #PF vmexits; if bit 2 is 0, kvm_can_do_async_pf
> returns 0 if in guest mode.
>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Radim Krčmář <[email protected]>
> Signed-off-by: Wanpeng Li <[email protected]>
> ---

I think KVM (L1) should also do something like

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index dd274db9bf77..c15a9f178e60 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7991,7 +7991,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
if (is_nmi(intr_info))
return false;
else if (is_page_fault(intr_info))
- return enable_ept;
+ return !vmx->apf_reason && enable_ept;
else if (is_no_device(intr_info) &&
!(vmcs12->guest_cr0 & X86_CR0_TS))
return false;

so it doesn't pass the APF directed towards it (L1) into L2 if there is
L3 at the moment.

2017-06-13 18:55:34

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH 3/4] KVM: async_pf: Force a nested vmexit if the injected #PF is async_pf

2017-06-12 23:08-0700, Wanpeng Li:
> From: Wanpeng Li <[email protected]>
>
> Add an async_page_fault field to vcpu->arch.exception to identify an async
> page fault, and constructs the expected vm-exit information fields. Force
> a nested VM exit from nested_vmx_check_exception() if the injected #PF
> is async page fault.
>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Radim Krčmář <[email protected]>
> Signed-off-by: Wanpeng Li <[email protected]>
> ---
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> @@ -2422,13 +2422,28 @@ static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
> static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned nr)

This function could use the same treatment as vmx_queue_exception(), so
we are not mixing 'nr' with 'vcpu->arch.exception.*'.

> {
> struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> + u32 intr_info = 0;
> + unsigned long exit_qualification = 0;
>
> - if (!(vmcs12->exception_bitmap & (1u << nr)))
> + if (!((vmcs12->exception_bitmap & (1u << nr)) ||
> + (nr == PF_VECTOR && vcpu->arch.exception.async_page_fault)))
> return 0;
>
> + intr_info = nr | INTR_INFO_VALID_MASK;
> + exit_qualification = vmcs_readl(EXIT_QUALIFICATION);

This part still uses EXIT_QUALIFICATION(), which means it is not general
and I think it would be nicer to just do simple special case on top:

if (vcpu->arch.exception.async_page_fault) {
vmcs_write32(VM_EXIT_INTR_ERROR_CODE, vcpu->arch.exception.error_code);
nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
PF_VECTOR | INTR_TYPE_HARD_EXCEPTION |
INTR_INFO_DELIVER_CODE_MASK | INTR_INFO_VALID_MASK,
vcpu->arch.cr2);
return 1;
}

Using vcpu->arch.cr2 is suspicious as VMX doesn't update CR2 on VM
exits; isn't this going to change the CR2 visible in L2 guest after a
nested VM entry?

Btw. nested_vmx_check_exception() didn't support emulated exceptions at
all (it only passed through ones we got from hardware), or have I missed
something?

Thanks.

2017-06-14 01:01:39

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH 4/4] KVM: async_pf: Let host know whether the guest support delivery async_pf as #PF vmexit

2017-06-14 2:19 GMT+08:00 Radim Krčmář <[email protected]>:
> 2017-06-12 23:08-0700, Wanpeng Li:
>> From: Wanpeng Li <[email protected]>
>>
>> Adds another flag bit (bit 2) to MSR_KVM_ASYNC_PF_EN. If bit 2 is 1, async
>> page faults are delivered to L1 as #PF vmexits; if bit 2 is 0, kvm_can_do_async_pf
>> returns 0 if in guest mode.
>>
>> Cc: Paolo Bonzini <[email protected]>
>> Cc: Radim Krčmář <[email protected]>
>> Signed-off-by: Wanpeng Li <[email protected]>
>> ---
>
> I think KVM (L1) should also do something like
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index dd274db9bf77..c15a9f178e60 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -7991,7 +7991,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
> if (is_nmi(intr_info))
> return false;
> else if (is_page_fault(intr_info))
> - return enable_ept;
> + return !vmx->apf_reason && enable_ept;
> else if (is_no_device(intr_info) &&
> !(vmcs12->guest_cr0 & X86_CR0_TS))
> return false;
>
> so it doesn't pass the APF directed towards it (L1) into L2 if there is
> L3 at the moment.

Agreed. I will do this in v2.

Regards,
Wanpeng Li

2017-06-14 01:07:35

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH 3/4] KVM: async_pf: Force a nested vmexit if the injected #PF is async_pf

2017-06-14 2:55 GMT+08:00 Radim Krčmář <[email protected]>:
> 2017-06-12 23:08-0700, Wanpeng Li:
>> From: Wanpeng Li <[email protected]>
>>
>> Add an async_page_fault field to vcpu->arch.exception to identify an async
>> page fault, and constructs the expected vm-exit information fields. Force
>> a nested VM exit from nested_vmx_check_exception() if the injected #PF
>> is async page fault.
>>
>> Cc: Paolo Bonzini <[email protected]>
>> Cc: Radim Krčmář <[email protected]>
>> Signed-off-by: Wanpeng Li <[email protected]>
>> ---
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> @@ -2422,13 +2422,28 @@ static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
>> static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned nr)
>
> This function could use the same treatment as vmx_queue_exception(), so
> we are not mixing 'nr' with 'vcpu->arch.exception.*'.
>
>> {
>> struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>> + u32 intr_info = 0;
>> + unsigned long exit_qualification = 0;
>>
>> - if (!(vmcs12->exception_bitmap & (1u << nr)))
>> + if (!((vmcs12->exception_bitmap & (1u << nr)) ||
>> + (nr == PF_VECTOR && vcpu->arch.exception.async_page_fault)))
>> return 0;
>>
>> + intr_info = nr | INTR_INFO_VALID_MASK;
>> + exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>
> This part still uses EXIT_QUALIFICATION(), which means it is not general
> and I think it would be nicer to just do simple special case on top:
>
> if (vcpu->arch.exception.async_page_fault) {
> vmcs_write32(VM_EXIT_INTR_ERROR_CODE, vcpu->arch.exception.error_code);
> nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
> PF_VECTOR | INTR_TYPE_HARD_EXCEPTION |
> INTR_INFO_DELIVER_CODE_MASK | INTR_INFO_VALID_MASK,
> vcpu->arch.cr2);
> return 1;
> }

Good point.

>
> Using vcpu->arch.cr2 is suspicious as VMX doesn't update CR2 on VM
> exits; isn't this going to change the CR2 visible in L2 guest after a
> nested VM entry?

Sorry, I don't fully understand the question. As you know this
vcpu->arch.cr2 which includes token is set before async pf injection,
and L1 will intercept it from EXIT_QUALIFICATION during nested vmexit,
why it can change the CR2 visible in L2 guest after a nested VM entry?

>
> Btw. nested_vmx_check_exception() didn't support emulated exceptions at
> all (it only passed through ones we got from hardware),

I think so.

Regards,
Wanpeng Li

2017-06-14 12:52:30

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH 3/4] KVM: async_pf: Force a nested vmexit if the injected #PF is async_pf

2017-06-14 09:07+0800, Wanpeng Li:
> 2017-06-14 2:55 GMT+08:00 Radim Krčmář <[email protected]>:
> > Using vcpu->arch.cr2 is suspicious as VMX doesn't update CR2 on VM
> > exits; isn't this going to change the CR2 visible in L2 guest after a
> > nested VM entry?
>
> Sorry, I don't fully understand the question. As you know this
> vcpu->arch.cr2 which includes token is set before async pf injection,

Yes, I'm thinking that setting vcpu->arch.cr2 is a mistake in this case.

> and L1 will intercept it from EXIT_QUALIFICATION during nested vmexit,

Right, so we do not need to have the token in CR2, because L1 is not
going to look at it.

> why it can change the CR2 visible in L2 guest after a nested VM entry?

Sorry, the situation is too convoluted to be expressed in one sentence:

1) L2 is running with CR2 = L2CR2
3) VMX exits (say, unrelated EXTERNAL_INTERRUPT) and L0 stores L2CR2 in
vcpu->arch.cr2
2) APF for L1 has completed
4) L0 KVM wants to inject APF and sets vcpu->arch.cr2 = APFT
5) L0 KVM does a nested VM exit to L1, EXIT_QUALIFICATION = APFT
6) L0 KVM enters L1 with CR2 = vcpu->arch.cr2 = APFT
7) L1 stores APFT as L2's CR2
8) L1 handles APF, maybe reschedules, but eventually comes back to this
L2's thread
9) after some time, L1 enters L2 with CR2 = APFT
10) L2 is running with CR2 = APTF

The original L2CR2 is lost and we'd introduce a bug if L2 wanted to look
at it, e.g. it was in a process of handling its #PF.

2017-06-14 13:02:27

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH 3/4] KVM: async_pf: Force a nested vmexit if the injected #PF is async_pf

2017-06-14 20:52 GMT+08:00 Radim Krčmář <[email protected]>:
> 2017-06-14 09:07+0800, Wanpeng Li:
>> 2017-06-14 2:55 GMT+08:00 Radim Krčmář <[email protected]>:
>> > Using vcpu->arch.cr2 is suspicious as VMX doesn't update CR2 on VM
>> > exits; isn't this going to change the CR2 visible in L2 guest after a
>> > nested VM entry?
>>
>> Sorry, I don't fully understand the question. As you know this
>> vcpu->arch.cr2 which includes token is set before async pf injection,
>
> Yes, I'm thinking that setting vcpu->arch.cr2 is a mistake in this case.
>
>> and L1 will intercept it from EXIT_QUALIFICATION during nested vmexit,
>
> Right, so we do not need to have the token in CR2, because L1 is not
> going to look at it.
>
>> why it can change the CR2 visible in L2 guest after a nested VM entry?
>
> Sorry, the situation is too convoluted to be expressed in one sentence:
>
> 1) L2 is running with CR2 = L2CR2
> 3) VMX exits (say, unrelated EXTERNAL_INTERRUPT) and L0 stores L2CR2 in
> vcpu->arch.cr2
> 2) APF for L1 has completed
> 4) L0 KVM wants to inject APF and sets vcpu->arch.cr2 = APFT
> 5) L0 KVM does a nested VM exit to L1, EXIT_QUALIFICATION = APFT
> 6) L0 KVM enters L1 with CR2 = vcpu->arch.cr2 = APFT
> 7) L1 stores APFT as L2's CR2
> 8) L1 handles APF, maybe reschedules, but eventually comes back to this
> L2's thread
> 9) after some time, L1 enters L2 with CR2 = APFT
> 10) L2 is running with CR2 = APTF
>
> The original L2CR2 is lost and we'd introduce a bug if L2 wanted to look
> at it, e.g. it was in a process of handling its #PF.

Good point. What's your proposal? :)

Regards,
Wanpeng Li

2017-06-14 13:20:46

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH 3/4] KVM: async_pf: Force a nested vmexit if the injected #PF is async_pf

2017-06-14 21:02+0800, Wanpeng Li:
> 2017-06-14 20:52 GMT+08:00 Radim Krčmář <[email protected]>:
> > 2017-06-14 09:07+0800, Wanpeng Li:
> >> 2017-06-14 2:55 GMT+08:00 Radim Krčmář <[email protected]>:
> >> > Using vcpu->arch.cr2 is suspicious as VMX doesn't update CR2 on VM
> >> > exits; isn't this going to change the CR2 visible in L2 guest after a
> >> > nested VM entry?
> >>
> >> Sorry, I don't fully understand the question. As you know this
> >> vcpu->arch.cr2 which includes token is set before async pf injection,
> >
> > Yes, I'm thinking that setting vcpu->arch.cr2 is a mistake in this case.
> >
> >> and L1 will intercept it from EXIT_QUALIFICATION during nested vmexit,
> >
> > Right, so we do not need to have the token in CR2, because L1 is not
> > going to look at it.
> >
> >> why it can change the CR2 visible in L2 guest after a nested VM entry?
> >
> > Sorry, the situation is too convoluted to be expressed in one sentence:
> >
> > 1) L2 is running with CR2 = L2CR2
> > 3) VMX exits (say, unrelated EXTERNAL_INTERRUPT) and L0 stores L2CR2 in
> > vcpu->arch.cr2
> > 2) APF for L1 has completed
> > 4) L0 KVM wants to inject APF and sets vcpu->arch.cr2 = APFT
> > 5) L0 KVM does a nested VM exit to L1, EXIT_QUALIFICATION = APFT
> > 6) L0 KVM enters L1 with CR2 = vcpu->arch.cr2 = APFT
> > 7) L1 stores APFT as L2's CR2
> > 8) L1 handles APF, maybe reschedules, but eventually comes back to this
> > L2's thread
> > 9) after some time, L1 enters L2 with CR2 = APFT
> > 10) L2 is running with CR2 = APTF
> >
> > The original L2CR2 is lost and we'd introduce a bug if L2 wanted to look
> > at it, e.g. it was in a process of handling its #PF.
>
> Good point. What's your proposal? :)

Get rid of async_pf. :) Optimal solutions aside, I think it would be
best to add a new injection function for APF. One that injects a normal
#PF for non-nested guests and directly triggers a #PF VM exit otherwise,
and call it from kvm_arch_async_page_*present().

Do you think that just moving the nested VM exit from
nested_vmx_check_exception() would work?

Thanks.

2017-06-14 13:27:56

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH 3/4] KVM: async_pf: Force a nested vmexit if the injected #PF is async_pf

2017-06-14 21:20 GMT+08:00 Radim Krčmář <[email protected]>:
> 2017-06-14 21:02+0800, Wanpeng Li:
>> 2017-06-14 20:52 GMT+08:00 Radim Krčmář <[email protected]>:
>> > 2017-06-14 09:07+0800, Wanpeng Li:
>> >> 2017-06-14 2:55 GMT+08:00 Radim Krčmář <[email protected]>:
>> >> > Using vcpu->arch.cr2 is suspicious as VMX doesn't update CR2 on VM
>> >> > exits; isn't this going to change the CR2 visible in L2 guest after a
>> >> > nested VM entry?
>> >>
>> >> Sorry, I don't fully understand the question. As you know this
>> >> vcpu->arch.cr2 which includes token is set before async pf injection,
>> >
>> > Yes, I'm thinking that setting vcpu->arch.cr2 is a mistake in this case.
>> >
>> >> and L1 will intercept it from EXIT_QUALIFICATION during nested vmexit,
>> >
>> > Right, so we do not need to have the token in CR2, because L1 is not
>> > going to look at it.
>> >
>> >> why it can change the CR2 visible in L2 guest after a nested VM entry?
>> >
>> > Sorry, the situation is too convoluted to be expressed in one sentence:
>> >
>> > 1) L2 is running with CR2 = L2CR2
>> > 3) VMX exits (say, unrelated EXTERNAL_INTERRUPT) and L0 stores L2CR2 in
>> > vcpu->arch.cr2
>> > 2) APF for L1 has completed
>> > 4) L0 KVM wants to inject APF and sets vcpu->arch.cr2 = APFT
>> > 5) L0 KVM does a nested VM exit to L1, EXIT_QUALIFICATION = APFT
>> > 6) L0 KVM enters L1 with CR2 = vcpu->arch.cr2 = APFT
>> > 7) L1 stores APFT as L2's CR2
>> > 8) L1 handles APF, maybe reschedules, but eventually comes back to this
>> > L2's thread
>> > 9) after some time, L1 enters L2 with CR2 = APFT
>> > 10) L2 is running with CR2 = APTF
>> >
>> > The original L2CR2 is lost and we'd introduce a bug if L2 wanted to look
>> > at it, e.g. it was in a process of handling its #PF.
>>
>> Good point. What's your proposal? :)
>
> Get rid of async_pf. :) Optimal solutions aside, I think it would be
> best to add a new injection function for APF. One that injects a normal
> #PF for non-nested guests and directly triggers a #PF VM exit otherwise,
> and call it from kvm_arch_async_page_*present().
>
> Do you think that just moving the nested VM exit from
> nested_vmx_check_exception() would work?

That's the same as what commit 9bc1f09f6fa76

2017-06-14 13:32:09

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH 3/4] KVM: async_pf: Force a nested vmexit if the injected #PF is async_pf

2017-06-14 21:20 GMT+08:00 Radim Krčmář <[email protected]>:
> 2017-06-14 21:02+0800, Wanpeng Li:
>> 2017-06-14 20:52 GMT+08:00 Radim Krčmář <[email protected]>:
>> > 2017-06-14 09:07+0800, Wanpeng Li:
>> >> 2017-06-14 2:55 GMT+08:00 Radim Krčmář <[email protected]>:
>> >> > Using vcpu->arch.cr2 is suspicious as VMX doesn't update CR2 on VM
>> >> > exits; isn't this going to change the CR2 visible in L2 guest after a
>> >> > nested VM entry?
>> >>
>> >> Sorry, I don't fully understand the question. As you know this
>> >> vcpu->arch.cr2 which includes token is set before async pf injection,
>> >
>> > Yes, I'm thinking that setting vcpu->arch.cr2 is a mistake in this case.
>> >
>> >> and L1 will intercept it from EXIT_QUALIFICATION during nested vmexit,
>> >
>> > Right, so we do not need to have the token in CR2, because L1 is not
>> > going to look at it.
>> >
>> >> why it can change the CR2 visible in L2 guest after a nested VM entry?
>> >
>> > Sorry, the situation is too convoluted to be expressed in one sentence:
>> >
>> > 1) L2 is running with CR2 = L2CR2
>> > 3) VMX exits (say, unrelated EXTERNAL_INTERRUPT) and L0 stores L2CR2 in
>> > vcpu->arch.cr2
>> > 2) APF for L1 has completed
>> > 4) L0 KVM wants to inject APF and sets vcpu->arch.cr2 = APFT
>> > 5) L0 KVM does a nested VM exit to L1, EXIT_QUALIFICATION = APFT
>> > 6) L0 KVM enters L1 with CR2 = vcpu->arch.cr2 = APFT
>> > 7) L1 stores APFT as L2's CR2
>> > 8) L1 handles APF, maybe reschedules, but eventually comes back to this
>> > L2's thread
>> > 9) after some time, L1 enters L2 with CR2 = APFT
>> > 10) L2 is running with CR2 = APTF
>> >
>> > The original L2CR2 is lost and we'd introduce a bug if L2 wanted to look
>> > at it, e.g. it was in a process of handling its #PF.
>>
>> Good point. What's your proposal? :)
>
> Get rid of async_pf. :) Optimal solutions aside, I think it would be
> best to add a new injection function for APF. One that injects a normal
> #PF for non-nested guests and directly triggers a #PF VM exit otherwise,
> and call it from kvm_arch_async_page_*present().
>
> Do you think that just moving the nested VM exit from
> nested_vmx_check_exception() would work?

That's the same as what commit 9bc1f09f6fa76 (KVM: async_pf: avoid
async pf injection when in guest mode) do I think, how about
introducing a field like nested_apf_token to vcpu->arch to keep the
token in kvm_inject_page_fault if
(vcpu->arch.exception.async_page_fault && is_guest_mode(vcpu)) is
true.

Regards,
Wanpeng Li

2017-06-14 14:32:28

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH 3/4] KVM: async_pf: Force a nested vmexit if the injected #PF is async_pf

2017-06-14 21:20 GMT+08:00 Radim Krčmář <[email protected]>:
> 2017-06-14 21:02+0800, Wanpeng Li:
>> 2017-06-14 20:52 GMT+08:00 Radim Krčmář <[email protected]>:
>> > 2017-06-14 09:07+0800, Wanpeng Li:
>> >> 2017-06-14 2:55 GMT+08:00 Radim Krčmář <[email protected]>:
>> >> > Using vcpu->arch.cr2 is suspicious as VMX doesn't update CR2 on VM
>> >> > exits; isn't this going to change the CR2 visible in L2 guest after a
>> >> > nested VM entry?
>> >>
>> >> Sorry, I don't fully understand the question. As you know this
>> >> vcpu->arch.cr2 which includes token is set before async pf injection,
>> >
>> > Yes, I'm thinking that setting vcpu->arch.cr2 is a mistake in this case.
>> >
>> >> and L1 will intercept it from EXIT_QUALIFICATION during nested vmexit,
>> >
>> > Right, so we do not need to have the token in CR2, because L1 is not
>> > going to look at it.
>> >
>> >> why it can change the CR2 visible in L2 guest after a nested VM entry?
>> >
>> > Sorry, the situation is too convoluted to be expressed in one sentence:
>> >
>> > 1) L2 is running with CR2 = L2CR2
>> > 3) VMX exits (say, unrelated EXTERNAL_INTERRUPT) and L0 stores L2CR2 in
>> > vcpu->arch.cr2
>> > 2) APF for L1 has completed
>> > 4) L0 KVM wants to inject APF and sets vcpu->arch.cr2 = APFT
>> > 5) L0 KVM does a nested VM exit to L1, EXIT_QUALIFICATION = APFT
>> > 6) L0 KVM enters L1 with CR2 = vcpu->arch.cr2 = APFT
>> > 7) L1 stores APFT as L2's CR2
>> > 8) L1 handles APF, maybe reschedules, but eventually comes back to this
>> > L2's thread
>> > 9) after some time, L1 enters L2 with CR2 = APFT
>> > 10) L2 is running with CR2 = APTF
>> >
>> > The original L2CR2 is lost and we'd introduce a bug if L2 wanted to look
>> > at it, e.g. it was in a process of handling its #PF.
>>
>> Good point. What's your proposal? :)
>
> Get rid of async_pf. :) Optimal solutions aside, I think it would be
> best to add a new injection function for APF. One that injects a normal
> #PF for non-nested guests and directly triggers a #PF VM exit otherwise,
> and call it from kvm_arch_async_page_*present().

In addition, nested vmexit in kvm_arch_async_page_*present() directly
instead of through inject_pending_event() before vmentry, or nested
vmexit after vmexit on L0 looks strange. So how about the proposal of
the nested_apf_token stuff? Radim, Paolo?

Regards,
Wanpeng Li

2017-06-14 16:18:33

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH 3/4] KVM: async_pf: Force a nested vmexit if the injected #PF is async_pf

2017-06-14 22:32+0800, Wanpeng Li:
> 2017-06-14 21:20 GMT+08:00 Radim Krčmář <[email protected]>:
> > 2017-06-14 21:02+0800, Wanpeng Li:
> >> 2017-06-14 20:52 GMT+08:00 Radim Krčmář <[email protected]>:
> >> > 2017-06-14 09:07+0800, Wanpeng Li:
> >> >> 2017-06-14 2:55 GMT+08:00 Radim Krčmář <[email protected]>:
> >> >> > Using vcpu->arch.cr2 is suspicious as VMX doesn't update CR2 on VM
> >> >> > exits; isn't this going to change the CR2 visible in L2 guest after a
> >> >> > nested VM entry?
> >> >>
> >> >> Sorry, I don't fully understand the question. As you know this
> >> >> vcpu->arch.cr2 which includes token is set before async pf injection,
> >> >
> >> > Yes, I'm thinking that setting vcpu->arch.cr2 is a mistake in this case.
> >> >
> >> >> and L1 will intercept it from EXIT_QUALIFICATION during nested vmexit,
> >> >
> >> > Right, so we do not need to have the token in CR2, because L1 is not
> >> > going to look at it.
> >> >
> >> >> why it can change the CR2 visible in L2 guest after a nested VM entry?
> >> >
> >> > Sorry, the situation is too convoluted to be expressed in one sentence:
> >> >
> >> > 1) L2 is running with CR2 = L2CR2
> >> > 3) VMX exits (say, unrelated EXTERNAL_INTERRUPT) and L0 stores L2CR2 in
> >> > vcpu->arch.cr2
> >> > 2) APF for L1 has completed
> >> > 4) L0 KVM wants to inject APF and sets vcpu->arch.cr2 = APFT
> >> > 5) L0 KVM does a nested VM exit to L1, EXIT_QUALIFICATION = APFT
> >> > 6) L0 KVM enters L1 with CR2 = vcpu->arch.cr2 = APFT
> >> > 7) L1 stores APFT as L2's CR2
> >> > 8) L1 handles APF, maybe reschedules, but eventually comes back to this
> >> > L2's thread
> >> > 9) after some time, L1 enters L2 with CR2 = APFT
> >> > 10) L2 is running with CR2 = APTF
> >> >
> >> > The original L2CR2 is lost and we'd introduce a bug if L2 wanted to look
> >> > at it, e.g. it was in a process of handling its #PF.
> >>
> >> Good point. What's your proposal? :)
> >
> > Get rid of async_pf. :) Optimal solutions aside, I think it would be
> > best to add a new injection function for APF. One that injects a normal
> > #PF for non-nested guests and directly triggers a #PF VM exit otherwise,
> > and call it from kvm_arch_async_page_*present().
>
> In addition, nested vmexit in kvm_arch_async_page_*present() directly
> instead of through inject_pending_event() before vmentry, or nested
> vmexit after vmexit on L0 looks strange.

Right, it might be tricky if another exception can get queued in
between. (Which shouldn't happen, though, because async_pf exceptions
must not cause double faults for no good reason.)

> So how about the proposal of
> the nested_apf_token stuff? Radim, Paolo?

I think it is worth exploring. We need to make sure that interfacing
with userspace through kvm_vcpu_ioctl_x86_{set,get}_vcpu_events() is
right, but it should be possible without any extension as migration is
already covered by unconditional async_pf wakeup on the destination.

At this point, using a structure other than arch.exception would make
sense too -- async_pf uses the exception injection path mostly for
convenience, but the paravirt exception does not want to mix with real
exceptions.

2017-06-15 02:44:02

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH 3/4] KVM: async_pf: Force a nested vmexit if the injected #PF is async_pf

2017-06-15 0:18 GMT+08:00 Radim Krčmář <[email protected]>:
> 2017-06-14 22:32+0800, Wanpeng Li:
>> 2017-06-14 21:20 GMT+08:00 Radim Krčmář <[email protected]>:
>> > 2017-06-14 21:02+0800, Wanpeng Li:
>> >> 2017-06-14 20:52 GMT+08:00 Radim Krčmář <[email protected]>:
>> >> > 2017-06-14 09:07+0800, Wanpeng Li:
>> >> >> 2017-06-14 2:55 GMT+08:00 Radim Krčmář <[email protected]>:
>> >> >> > Using vcpu->arch.cr2 is suspicious as VMX doesn't update CR2 on VM
>> >> >> > exits; isn't this going to change the CR2 visible in L2 guest after a
>> >> >> > nested VM entry?
>> >> >>
>> >> >> Sorry, I don't fully understand the question. As you know this
>> >> >> vcpu->arch.cr2 which includes token is set before async pf injection,
>> >> >
>> >> > Yes, I'm thinking that setting vcpu->arch.cr2 is a mistake in this case.
>> >> >
>> >> >> and L1 will intercept it from EXIT_QUALIFICATION during nested vmexit,
>> >> >
>> >> > Right, so we do not need to have the token in CR2, because L1 is not
>> >> > going to look at it.
>> >> >
>> >> >> why it can change the CR2 visible in L2 guest after a nested VM entry?
>> >> >
>> >> > Sorry, the situation is too convoluted to be expressed in one sentence:
>> >> >
>> >> > 1) L2 is running with CR2 = L2CR2
>> >> > 3) VMX exits (say, unrelated EXTERNAL_INTERRUPT) and L0 stores L2CR2 in
>> >> > vcpu->arch.cr2
>> >> > 2) APF for L1 has completed
>> >> > 4) L0 KVM wants to inject APF and sets vcpu->arch.cr2 = APFT
>> >> > 5) L0 KVM does a nested VM exit to L1, EXIT_QUALIFICATION = APFT
>> >> > 6) L0 KVM enters L1 with CR2 = vcpu->arch.cr2 = APFT
>> >> > 7) L1 stores APFT as L2's CR2
>> >> > 8) L1 handles APF, maybe reschedules, but eventually comes back to this
>> >> > L2's thread
>> >> > 9) after some time, L1 enters L2 with CR2 = APFT
>> >> > 10) L2 is running with CR2 = APTF
>> >> >
>> >> > The original L2CR2 is lost and we'd introduce a bug if L2 wanted to look
>> >> > at it, e.g. it was in a process of handling its #PF.
>> >>
>> >> Good point. What's your proposal? :)
>> >
>> > Get rid of async_pf. :) Optimal solutions aside, I think it would be
>> > best to add a new injection function for APF. One that injects a normal
>> > #PF for non-nested guests and directly triggers a #PF VM exit otherwise,
>> > and call it from kvm_arch_async_page_*present().
>>
>> In addition, nested vmexit in kvm_arch_async_page_*present() directly
>> instead of through inject_pending_event() before vmentry, or nested
>> vmexit after vmexit on L0 looks strange.
>
> Right, it might be tricky if another exception can get queued in
> between. (Which shouldn't happen, though, because async_pf exceptions
> must not cause double faults for no good reason.)
>
>> So how about the proposal of
>> the nested_apf_token stuff? Radim, Paolo?
>
> I think it is worth exploring. We need to make sure that interfacing
> with userspace through kvm_vcpu_ioctl_x86_{set,get}_vcpu_events() is
> right, but it should be possible without any extension as migration is
> already covered by unconditional async_pf wakeup on the destination.
>
> At this point, using a structure other than arch.exception would make
> sense too -- async_pf uses the exception injection path mostly for
> convenience, but the paravirt exception does not want to mix with real
> exceptions.

Yeah, but maybe need more reconstruct, I just send out v2 to fix it
simply and avoid too aggressive. :)

Regards,
Wanpeng Li