2017-12-21 12:43:58

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 0/4] KVM: nVMX: prepare_vmcs02 optimizations

That's about 800-1000 clock cycles more that can be easily peeled, by
saving about 60 VMWRITEs on every exit.

My numbers so far have been collected on a Haswell system vs. the
Broadwell that Jim used for his KVM Forum talk, and I am now down
from 22000 (compared to 18000 that Jim gave as the baseline) to 14000.
Also the guest is running 4.14, so it didn't have the XSETBV and DEBUGCTL
patches; that removes two ancillary exit to L1, each costing about 1000
cycles on my machine). So we are probably pretty close to VMware's
6500 cycles on Broadwell.

After these patches there may still be some low-hanging fruit; the remaining
large deltas between non-nested and nested workloads with lots of vmexits are:

4.80% vmx_set_cr3
4.35% native_read_msr
3.73% vmcs_load
3.65% update_permission_bitmask
2.49% _raw_spin_lock
2.37% sync_vmcs12
2.20% copy_shadow_to_vmcs12
1.19% kvm_load_guest_fpu

There is a large cost associated to resetting the MMU. Making that smarter
could probably be worth a 10-15% improvement; not easy, but actually even
more worthwhile than that on SMP nested guests because that's where the
spinlock contention comes from.

The MSR accesses are probably also interesting, but I haven't tried to see
what they are about. One somewhat crazy idea in that area is to set
CR4.FSGSBASE at vcpu_load/sched_in and clear it at vcpu_put/sched_out.
Then we could skip the costly setup of the FS/GS/kernelGS base MSRs.
However the cost of writes to CR4 might make it less appealing for
userspace exits; I haven't benchmarked it.

Paolo

Paolo Bonzini (4):
KVM: VMX: split list of shadowed VMCS field to a separate file
KVM: nVMX: track dirty state of non-shadowed VMCS fields
KVM: nVMX: move descriptor cache handling to prepare_vmcs02_full
KVM: nVMX: move other simple fields to prepare_vmcs02_full

arch/x86/kvm/vmx.c | 301 +++++++++++++++++++--------------------
arch/x86/kvm/vmx_shadow_fields.h | 71 +++++++++
2 files changed, 214 insertions(+), 158 deletions(-)
create mode 100644 arch/x86/kvm/vmx_shadow_fields.h

--
1.8.3.1


2017-12-21 12:44:13

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 4/4] KVM: nVMX: initialize more non-shadowed fields in prepare_vmcs02_full

These fields are also simple copies of the data in the vmcs12 struct.
For some of them, prepare_vmcs02 was skipping the copy when the field
was unused. In prepare_vmcs02_full, we copy them always as long as the
field exists on the host, because the corresponding execution control
might be one of the shadowed fields.

Optimization opportunities remain for MSRs that, depending on the
entry/exit controls, have to be copied from either the vmcs01 or
the vmcs12: EFER (whose value is partly stored in the entry controls
too), PAT, DEBUGCTL (and also DR7). Before moving these three and
the entry/exit controls to prepare_vmcs02_full, KVM would have to set
dirty_vmcs12 on writes to the L1 MSRs.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx.c | 158 +++++++++++++++++++++++++++--------------------------
1 file changed, 80 insertions(+), 78 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index cc42994f4ba2..e2ac5bf64ad5 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -10623,6 +10623,85 @@ static void prepare_vmcs02_full(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
vmcs_writel(GUEST_TR_BASE, vmcs12->guest_tr_base);
vmcs_writel(GUEST_GDTR_BASE, vmcs12->guest_gdtr_base);
vmcs_writel(GUEST_IDTR_BASE, vmcs12->guest_idtr_base);
+
+ vmcs_write32(GUEST_SYSENTER_CS, vmcs12->guest_sysenter_cs);
+ vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
+ vmcs12->guest_pending_dbg_exceptions);
+ vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->guest_sysenter_esp);
+ vmcs_writel(GUEST_SYSENTER_EIP, vmcs12->guest_sysenter_eip);
+
+ if (nested_cpu_has_xsaves(vmcs12))
+ vmcs_write64(XSS_EXIT_BITMAP, vmcs12->xss_exit_bitmap);
+ vmcs_write64(VMCS_LINK_POINTER, -1ull);
+
+ /*
+ * Whether page-faults are trapped is determined by a combination of
+ * 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF.
+ * If enable_ept, L0 doesn't care about page faults and we should
+ * set all of these to L1's desires. However, if !enable_ept, L0 does
+ * care about (at least some) page faults, and because it is not easy
+ * (if at all possible?) to merge L0 and L1's desires, we simply ask
+ * to exit on each and every L2 page fault. This is done by setting
+ * MASK=MATCH=0 and (see below) EB.PF=1.
+ * Note that below we don't need special code to set EB.PF beyond the
+ * "or"ing of the EB of vmcs01 and vmcs12, because when enable_ept,
+ * vmcs01's EB.PF is 0 so the "or" will take vmcs12's value, and when
+ * !enable_ept, EB.PF is 1, so the "or" will always be 1.
+ */
+ vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
+ enable_ept ? vmcs12->page_fault_error_code_mask : 0);
+ vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
+ enable_ept ? vmcs12->page_fault_error_code_match : 0);
+
+ /* All VMFUNCs are currently emulated through L0 vmexits. */
+ if (cpu_has_vmx_vmfunc())
+ vmcs_write64(VM_FUNCTION_CONTROL, 0);
+
+ if (cpu_has_vmx_apicv()) {
+ vmcs_write64(EOI_EXIT_BITMAP0, vmcs12->eoi_exit_bitmap0);
+ vmcs_write64(EOI_EXIT_BITMAP1, vmcs12->eoi_exit_bitmap1);
+ vmcs_write64(EOI_EXIT_BITMAP2, vmcs12->eoi_exit_bitmap2);
+ vmcs_write64(EOI_EXIT_BITMAP3, vmcs12->eoi_exit_bitmap3);
+ }
+
+ /*
+ * Set host-state according to L0's settings (vmcs12 is irrelevant here)
+ * Some constant fields are set here by vmx_set_constant_host_state().
+ * Other fields are different per CPU, and will be set later when
+ * vmx_vcpu_load() is called, and when vmx_save_host_state() is called.
+ */
+ vmx_set_constant_host_state(vmx);
+
+ /*
+ * Set the MSR load/store lists to match L0's settings.
+ */
+ vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
+ vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.nr);
+ vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));
+ vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.nr);
+ vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest));
+
+ set_cr4_guest_host_mask(vmx);
+
+ if (vmx_mpx_supported())
+ vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
+
+ if (enable_vpid) {
+ if (nested_cpu_has_vpid(vmcs12) && vmx->nested.vpid02)
+ vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->nested.vpid02);
+ else
+ vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
+ }
+
+ /*
+ * L1 may access the L2's PDPTR, so save them to construct vmcs12
+ */
+ if (enable_ept) {
+ vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
+ vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1);
+ vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2);
+ vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
+ }
}

/*
@@ -10680,16 +10759,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
} else {
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
}
- vmcs_write32(GUEST_SYSENTER_CS, vmcs12->guest_sysenter_cs);
vmx_set_rflags(vcpu, vmcs12->guest_rflags);
- vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
- vmcs12->guest_pending_dbg_exceptions);
- vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->guest_sysenter_esp);
- vmcs_writel(GUEST_SYSENTER_EIP, vmcs12->guest_sysenter_eip);
-
- if (nested_cpu_has_xsaves(vmcs12))
- vmcs_write64(XSS_EXIT_BITMAP, vmcs12->xss_exit_bitmap);
- vmcs_write64(VMCS_LINK_POINTER, -1ull);

exec_control = vmcs12->pin_based_vm_exec_control;

@@ -10714,25 +10784,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
if (nested_cpu_has_preemption_timer(vmcs12))
vmx_start_preemption_timer(vcpu);

- /*
- * Whether page-faults are trapped is determined by a combination of
- * 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF.
- * If enable_ept, L0 doesn't care about page faults and we should
- * set all of these to L1's desires. However, if !enable_ept, L0 does
- * care about (at least some) page faults, and because it is not easy
- * (if at all possible?) to merge L0 and L1's desires, we simply ask
- * to exit on each and every L2 page fault. This is done by setting
- * MASK=MATCH=0 and (see below) EB.PF=1.
- * Note that below we don't need special code to set EB.PF beyond the
- * "or"ing of the EB of vmcs01 and vmcs12, because when enable_ept,
- * vmcs01's EB.PF is 0 so the "or" will take vmcs12's value, and when
- * !enable_ept, EB.PF is 1, so the "or" will always be 1.
- */
- vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
- enable_ept ? vmcs12->page_fault_error_code_mask : 0);
- vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
- enable_ept ? vmcs12->page_fault_error_code_match : 0);
-
if (cpu_has_secondary_exec_ctrls()) {
exec_control = vmx->secondary_exec_control;

@@ -10751,22 +10802,9 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
exec_control |= vmcs12_exec_ctrl;
}

- /* All VMFUNCs are currently emulated through L0 vmexits. */
- if (exec_control & SECONDARY_EXEC_ENABLE_VMFUNC)
- vmcs_write64(VM_FUNCTION_CONTROL, 0);
-
- if (exec_control & SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) {
- vmcs_write64(EOI_EXIT_BITMAP0,
- vmcs12->eoi_exit_bitmap0);
- vmcs_write64(EOI_EXIT_BITMAP1,
- vmcs12->eoi_exit_bitmap1);
- vmcs_write64(EOI_EXIT_BITMAP2,
- vmcs12->eoi_exit_bitmap2);
- vmcs_write64(EOI_EXIT_BITMAP3,
- vmcs12->eoi_exit_bitmap3);
+ if (exec_control & SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY)
vmcs_write16(GUEST_INTR_STATUS,
vmcs12->guest_intr_status);
- }

/*
* Write an illegal value to APIC_ACCESS_ADDR. Later,
@@ -10779,24 +10817,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
}

-
- /*
- * Set host-state according to L0's settings (vmcs12 is irrelevant here)
- * Some constant fields are set here by vmx_set_constant_host_state().
- * Other fields are different per CPU, and will be set later when
- * vmx_vcpu_load() is called, and when vmx_save_host_state() is called.
- */
- vmx_set_constant_host_state(vmx);
-
- /*
- * Set the MSR load/store lists to match L0's settings.
- */
- vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
- vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.nr);
- vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));
- vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.nr);
- vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest));
-
/*
* HOST_RSP is normally set correctly in vmx_vcpu_run() just before
* entry, but only if the current (host) sp changed from the value
@@ -10866,12 +10886,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat);
}

- set_cr4_guest_host_mask(vmx);
-
- if (from_vmentry &&
- vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)
- vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
-
if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING)
vmcs_write64(TSC_OFFSET,
vcpu->arch.tsc_offset + vmcs12->tsc_offset);
@@ -10890,13 +10904,11 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
* even if spawn a lot of nested vCPUs.
*/
if (nested_cpu_has_vpid(vmcs12) && vmx->nested.vpid02) {
- vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->nested.vpid02);
if (vmcs12->virtual_processor_id != vmx->nested.last_vpid) {
vmx->nested.last_vpid = vmcs12->virtual_processor_id;
__vmx_flush_tlb(vcpu, to_vmx(vcpu)->nested.vpid02, true);
}
} else {
- vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
vmx_flush_tlb(vcpu, true);
}
}
@@ -10960,16 +10972,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
if (!enable_ept)
vcpu->arch.walk_mmu->inject_page_fault = vmx_inject_page_fault_nested;

- /*
- * L1 may access the L2's PDPTR, so save them to construct vmcs12
- */
- if (enable_ept) {
- vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
- vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1);
- vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2);
- vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
- }
-
kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp);
kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip);
return 0;
--
1.8.3.1

2017-12-21 12:44:40

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 3/4] KVM: nVMX: initialize descriptor cache fields in prepare_vmcs02_full

This part is separate for ease of review, because git prefers to move
prepare_vmcs02 below the initial long sequence of vmcs_write* operations.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx.c | 56 +++++++++++++++++++++++++++++++++---------------------
1 file changed, 34 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 8b6013b529b3..cc42994f4ba2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -10590,27 +10590,9 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3, bool ne
static void prepare_vmcs02_full(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
bool from_vmentry)
{
-}
-
-/*
- * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
- * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
- * with L0's requirements for its guest (a.k.a. vmcs01), so we can run the L2
- * guest in a way that will both be appropriate to L1's requests, and our
- * needs. In addition to modifying the active vmcs (which is vmcs02), this
- * function also has additional necessary side-effects, like setting various
- * vcpu->arch fields.
- * Returns 0 on success, 1 on failure. Invalid state exit qualification code
- * is assigned to entry_failure_code on failure.
- */
-static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
- bool from_vmentry, u32 *entry_failure_code)
-{
struct vcpu_vmx *vmx = to_vmx(vcpu);
- u32 exec_control, vmcs12_exec_ctrl;

vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
- vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
vmcs_write16(GUEST_SS_SELECTOR, vmcs12->guest_ss_selector);
vmcs_write16(GUEST_DS_SELECTOR, vmcs12->guest_ds_selector);
vmcs_write16(GUEST_FS_SELECTOR, vmcs12->guest_fs_selector);
@@ -10618,7 +10600,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
vmcs_write16(GUEST_LDTR_SELECTOR, vmcs12->guest_ldtr_selector);
vmcs_write16(GUEST_TR_SELECTOR, vmcs12->guest_tr_selector);
vmcs_write32(GUEST_ES_LIMIT, vmcs12->guest_es_limit);
- vmcs_write32(GUEST_CS_LIMIT, vmcs12->guest_cs_limit);
vmcs_write32(GUEST_SS_LIMIT, vmcs12->guest_ss_limit);
vmcs_write32(GUEST_DS_LIMIT, vmcs12->guest_ds_limit);
vmcs_write32(GUEST_FS_LIMIT, vmcs12->guest_fs_limit);
@@ -10628,15 +10609,12 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
vmcs_write32(GUEST_GDTR_LIMIT, vmcs12->guest_gdtr_limit);
vmcs_write32(GUEST_IDTR_LIMIT, vmcs12->guest_idtr_limit);
vmcs_write32(GUEST_ES_AR_BYTES, vmcs12->guest_es_ar_bytes);
- vmcs_write32(GUEST_CS_AR_BYTES, vmcs12->guest_cs_ar_bytes);
vmcs_write32(GUEST_SS_AR_BYTES, vmcs12->guest_ss_ar_bytes);
vmcs_write32(GUEST_DS_AR_BYTES, vmcs12->guest_ds_ar_bytes);
vmcs_write32(GUEST_FS_AR_BYTES, vmcs12->guest_fs_ar_bytes);
vmcs_write32(GUEST_GS_AR_BYTES, vmcs12->guest_gs_ar_bytes);
vmcs_write32(GUEST_LDTR_AR_BYTES, vmcs12->guest_ldtr_ar_bytes);
vmcs_write32(GUEST_TR_AR_BYTES, vmcs12->guest_tr_ar_bytes);
- vmcs_writel(GUEST_ES_BASE, vmcs12->guest_es_base);
- vmcs_writel(GUEST_CS_BASE, vmcs12->guest_cs_base);
vmcs_writel(GUEST_SS_BASE, vmcs12->guest_ss_base);
vmcs_writel(GUEST_DS_BASE, vmcs12->guest_ds_base);
vmcs_writel(GUEST_FS_BASE, vmcs12->guest_fs_base);
@@ -10645,6 +10623,40 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
vmcs_writel(GUEST_TR_BASE, vmcs12->guest_tr_base);
vmcs_writel(GUEST_GDTR_BASE, vmcs12->guest_gdtr_base);
vmcs_writel(GUEST_IDTR_BASE, vmcs12->guest_idtr_base);
+}
+
+/*
+ * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
+ * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
+ * with L0's requirements for its guest (a.k.a. vmcs01), so we can run the L2
+ * guest in a way that will both be appropriate to L1's requests, and our
+ * needs. In addition to modifying the active vmcs (which is vmcs02), this
+ * function also has additional necessary side-effects, like setting various
+ * vcpu->arch fields.
+ * Returns 0 on success, 1 on failure. Invalid state exit qualification code
+ * is assigned to entry_failure_code on failure.
+ */
+static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
+ bool from_vmentry, u32 *entry_failure_code)
+{
+ struct vcpu_vmx *vmx = to_vmx(vcpu);
+ u32 exec_control, vmcs12_exec_ctrl;
+
+ /*
+ * First, the fields that are shadowed. This must be kept in sync
+ * with vmx_shadow_fields.h.
+ */
+
+ vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
+ vmcs_write32(GUEST_CS_LIMIT, vmcs12->guest_cs_limit);
+ vmcs_write32(GUEST_CS_AR_BYTES, vmcs12->guest_cs_ar_bytes);
+ vmcs_writel(GUEST_ES_BASE, vmcs12->guest_es_base);
+ vmcs_writel(GUEST_CS_BASE, vmcs12->guest_cs_base);
+
+ /*
+ * Not in vmcs02: GUEST_PML_INDEX, HOST_FS_SELECTOR, HOST_GS_SELECTOR,
+ * HOST_FS_BASE, HOST_GS_BASE.
+ */

if (from_vmentry &&
(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) {
--
1.8.3.1


2017-12-21 12:45:03

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 1/4] KVM: VMX: split list of shadowed VMCS field to a separate file

Prepare for multiple inclusions of the list.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx.c | 64 +++---------------------------------
arch/x86/kvm/vmx_shadow_fields.h | 71 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 75 insertions(+), 60 deletions(-)
create mode 100644 arch/x86/kvm/vmx_shadow_fields.h

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 0ac665f88cc0..2ee842990976 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -715,71 +715,15 @@ static struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)


static u16 shadow_read_only_fields[] = {
- /*
- * We do NOT shadow fields that are modified when L0
- * traps and emulates any vmx instruction (e.g. VMPTRLD,
- * VMXON...) executed by L1.
- * For example, VM_INSTRUCTION_ERROR is read
- * by L1 if a vmx instruction fails (part of the error path).
- * Note the code assumes this logic. If for some reason
- * we start shadowing these fields then we need to
- * force a shadow sync when L0 emulates vmx instructions
- * (e.g. force a sync if VM_INSTRUCTION_ERROR is modified
- * by nested_vmx_failValid)
- */
- /* 32-bits */
- VM_EXIT_REASON,
- VM_EXIT_INTR_INFO,
- VM_EXIT_INSTRUCTION_LEN,
- IDT_VECTORING_INFO_FIELD,
- IDT_VECTORING_ERROR_CODE,
- VM_EXIT_INTR_ERROR_CODE,
-
- /* Natural width */
- EXIT_QUALIFICATION,
- GUEST_LINEAR_ADDRESS,
-
- /* 64-bit */
- GUEST_PHYSICAL_ADDRESS,
- GUEST_PHYSICAL_ADDRESS_HIGH,
+#define SHADOW_FIELD_RO(x) x,
+#include "vmx_shadow_fields.h"
};
static int max_shadow_read_only_fields =
ARRAY_SIZE(shadow_read_only_fields);

static u16 shadow_read_write_fields[] = {
- /* 16-bits */
- GUEST_CS_SELECTOR,
- GUEST_INTR_STATUS,
- GUEST_PML_INDEX,
- HOST_FS_SELECTOR,
- HOST_GS_SELECTOR,
-
- /* 32-bits */
- CPU_BASED_VM_EXEC_CONTROL,
- EXCEPTION_BITMAP,
- VM_ENTRY_EXCEPTION_ERROR_CODE,
- VM_ENTRY_INTR_INFO_FIELD,
- VM_ENTRY_INSTRUCTION_LEN,
- TPR_THRESHOLD,
- GUEST_CS_LIMIT,
- GUEST_CS_AR_BYTES,
- GUEST_INTERRUPTIBILITY_INFO,
- VMX_PREEMPTION_TIMER_VALUE,
-
- /* Natural width */
- GUEST_RIP,
- GUEST_RSP,
- GUEST_CR0,
- GUEST_CR3,
- GUEST_CR4,
- GUEST_RFLAGS,
- GUEST_CS_BASE,
- GUEST_ES_BASE,
- CR0_GUEST_HOST_MASK,
- CR0_READ_SHADOW,
- CR4_READ_SHADOW,
- HOST_FS_BASE,
- HOST_GS_BASE,
+#define SHADOW_FIELD_RW(x) x,
+#include "vmx_shadow_fields.h"
};
static int max_shadow_read_write_fields =
ARRAY_SIZE(shadow_read_write_fields);
diff --git a/arch/x86/kvm/vmx_shadow_fields.h b/arch/x86/kvm/vmx_shadow_fields.h
new file mode 100644
index 000000000000..31d7a15338ac
--- /dev/null
+++ b/arch/x86/kvm/vmx_shadow_fields.h
@@ -0,0 +1,71 @@
+#ifndef SHADOW_FIELD_RO
+#define SHADOW_FIELD_RO(x)
+#endif
+#ifndef SHADOW_FIELD_RW
+#define SHADOW_FIELD_RW(x)
+#endif
+
+/*
+ * We do NOT shadow fields that are modified when L0
+ * traps and emulates any vmx instruction (e.g. VMPTRLD,
+ * VMXON...) executed by L1.
+ * For example, VM_INSTRUCTION_ERROR is read
+ * by L1 if a vmx instruction fails (part of the error path).
+ * Note the code assumes this logic. If for some reason
+ * we start shadowing these fields then we need to
+ * force a shadow sync when L0 emulates vmx instructions
+ * (e.g. force a sync if VM_INSTRUCTION_ERROR is modified
+ * by nested_vmx_failValid)
+ *
+ * Keeping the fields ordered by size is an attempt at improving
+ * branch prediction in vmcs_read_any and vmcs_write_any.
+ */
+
+/* 16-bits */
+SHADOW_FIELD_RW(GUEST_CS_SELECTOR)
+SHADOW_FIELD_RW(GUEST_INTR_STATUS)
+SHADOW_FIELD_RW(GUEST_PML_INDEX)
+SHADOW_FIELD_RW(HOST_FS_SELECTOR)
+SHADOW_FIELD_RW(HOST_GS_SELECTOR)
+
+/* 32-bits */
+SHADOW_FIELD_RO(VM_EXIT_REASON)
+SHADOW_FIELD_RO(VM_EXIT_INTR_INFO)
+SHADOW_FIELD_RO(VM_EXIT_INSTRUCTION_LEN)
+SHADOW_FIELD_RO(IDT_VECTORING_INFO_FIELD)
+SHADOW_FIELD_RO(IDT_VECTORING_ERROR_CODE)
+SHADOW_FIELD_RO(VM_EXIT_INTR_ERROR_CODE)
+SHADOW_FIELD_RW(CPU_BASED_VM_EXEC_CONTROL)
+SHADOW_FIELD_RW(EXCEPTION_BITMAP)
+SHADOW_FIELD_RW(VM_ENTRY_EXCEPTION_ERROR_CODE)
+SHADOW_FIELD_RW(VM_ENTRY_INTR_INFO_FIELD)
+SHADOW_FIELD_RW(VM_ENTRY_INSTRUCTION_LEN)
+SHADOW_FIELD_RW(TPR_THRESHOLD)
+SHADOW_FIELD_RW(GUEST_CS_LIMIT)
+SHADOW_FIELD_RW(GUEST_CS_AR_BYTES)
+SHADOW_FIELD_RW(GUEST_INTERRUPTIBILITY_INFO)
+SHADOW_FIELD_RW(VMX_PREEMPTION_TIMER_VALUE)
+
+/* Natural width */
+SHADOW_FIELD_RO(EXIT_QUALIFICATION)
+SHADOW_FIELD_RO(GUEST_LINEAR_ADDRESS)
+SHADOW_FIELD_RW(GUEST_RIP)
+SHADOW_FIELD_RW(GUEST_RSP)
+SHADOW_FIELD_RW(GUEST_CR0)
+SHADOW_FIELD_RW(GUEST_CR3)
+SHADOW_FIELD_RW(GUEST_CR4)
+SHADOW_FIELD_RW(GUEST_RFLAGS)
+SHADOW_FIELD_RW(GUEST_CS_BASE)
+SHADOW_FIELD_RW(GUEST_ES_BASE)
+SHADOW_FIELD_RW(CR0_GUEST_HOST_MASK)
+SHADOW_FIELD_RW(CR0_READ_SHADOW)
+SHADOW_FIELD_RW(CR4_READ_SHADOW)
+SHADOW_FIELD_RW(HOST_FS_BASE)
+SHADOW_FIELD_RW(HOST_GS_BASE)
+
+/* 64-bit */
+SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS)
+SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS_HIGH)
+
+#undef SHADOW_FIELD_RO
+#undef SHADOW_FIELD_RW
--
1.8.3.1


2017-12-21 12:45:00

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 2/4] KVM: nVMX: track dirty state of non-shadowed VMCS fields

VMCS12 fields that are not handled through shadow VMCS are rarely
written, and thus they are also almost constant in the vmcs02. We can
thus optimize prepare_vmcs02 by skipping all the work for non-shadowed
fields in the common case.

This patch introduces the (pretty simple) tracking infrastructure; the
next patches will move work to prepare_vmcs02_full and save a few hundred
clock cycles per VMRESUME on a Haswell Xeon E5 system:

before after
cpuid 14159 13869
vmcall 15290 14951
inl_from_kernel 17703 17447
outl_to_kernel 16011 14692
self_ipi_sti_nop 16763 15825
self_ipi_tpr_sti_nop 17341 15935
wr_tsc_adjust_msr 14510 14264
rd_tsc_adjust_msr 15018 14311
mmio-wildcard-eventfd:pci-mem 16381 14947
mmio-datamatch-eventfd:pci-mem 18620 17858
portio-wildcard-eventfd:pci-io 15121 14769
portio-datamatch-eventfd:pci-io 15761 14831

(average savings 748, stdev 460).

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx.c | 29 ++++++++++++++++++++++++++++-
1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2ee842990976..8b6013b529b3 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -441,6 +441,7 @@ struct nested_vmx {
* data hold by vmcs12
*/
bool sync_shadow_vmcs;
+ bool dirty_vmcs12;

bool change_vmcs01_virtual_x2apic_mode;
/* L2 must run next, and mustn't decide to exit to L1. */
@@ -7879,8 +7880,10 @@ static int handle_vmwrite(struct kvm_vcpu *vcpu)
{
unsigned long field;
gva_t gva;
+ struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+
/* The value to write might be 32 or 64 bits, depending on L1's long
* mode, and eventually we need to write that into a field of several
* possible lengths. The code below first zero-extends the value to 64
@@ -7923,6 +7926,20 @@ static int handle_vmwrite(struct kvm_vcpu *vcpu)
return kvm_skip_emulated_instruction(vcpu);
}

+ switch (field) {
+#define SHADOW_FIELD_RW(x) case x:
+#include "vmx_shadow_fields.h"
+ /*
+ * The fields that can be updated by L1 without a vmexit are
+ * always updated in the vmcs02, the others go down the slow
+ * path of prepare_vmcs02.
+ */
+ break;
+ default:
+ vmx->nested.dirty_vmcs12 = true;
+ break;
+ }
+
nested_vmx_succeed(vcpu);
return kvm_skip_emulated_instruction(vcpu);
}
@@ -7937,6 +7954,7 @@ static void set_current_vmptr(struct vcpu_vmx *vmx, gpa_t vmptr)
__pa(vmx->vmcs01.shadow_vmcs));
vmx->nested.sync_shadow_vmcs = true;
}
+ vmx->nested.dirty_vmcs12 = true;
}

/* Emulate the VMPTRLD instruction */
@@ -10569,6 +10587,11 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3, bool ne
return 0;
}

+static void prepare_vmcs02_full(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
+ bool from_vmentry)
+{
+}
+
/*
* prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
* L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
@@ -10864,7 +10887,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
vmx_flush_tlb(vcpu, true);
}
-
}

if (enable_pml) {
@@ -10913,6 +10935,11 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
/* Note: modifies VM_ENTRY/EXIT_CONTROLS and GUEST/HOST_IA32_EFER */
vmx_set_efer(vcpu, vcpu->arch.efer);

+ if (vmx->nested.dirty_vmcs12) {
+ prepare_vmcs02_full(vcpu, vmcs12, from_vmentry);
+ vmx->nested.dirty_vmcs12 = false;
+ }
+
/* Shadow page tables on either EPT or shadow page tables. */
if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12),
entry_failure_code))
--
1.8.3.1


2017-12-21 22:51:12

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH 1/4] KVM: VMX: split list of shadowed VMCS field to a separate file

Reviewed-by: Jim Mattson <[email protected]>

On Thu, Dec 21, 2017 at 4:43 AM, Paolo Bonzini <[email protected]> wrote:
> Prepare for multiple inclusions of the list.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/vmx.c | 64 +++---------------------------------
> arch/x86/kvm/vmx_shadow_fields.h | 71 ++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 75 insertions(+), 60 deletions(-)
> create mode 100644 arch/x86/kvm/vmx_shadow_fields.h
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 0ac665f88cc0..2ee842990976 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -715,71 +715,15 @@ static struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
>
>
> static u16 shadow_read_only_fields[] = {
> - /*
> - * We do NOT shadow fields that are modified when L0
> - * traps and emulates any vmx instruction (e.g. VMPTRLD,
> - * VMXON...) executed by L1.
> - * For example, VM_INSTRUCTION_ERROR is read
> - * by L1 if a vmx instruction fails (part of the error path).
> - * Note the code assumes this logic. If for some reason
> - * we start shadowing these fields then we need to
> - * force a shadow sync when L0 emulates vmx instructions
> - * (e.g. force a sync if VM_INSTRUCTION_ERROR is modified
> - * by nested_vmx_failValid)
> - */
> - /* 32-bits */
> - VM_EXIT_REASON,
> - VM_EXIT_INTR_INFO,
> - VM_EXIT_INSTRUCTION_LEN,
> - IDT_VECTORING_INFO_FIELD,
> - IDT_VECTORING_ERROR_CODE,
> - VM_EXIT_INTR_ERROR_CODE,
> -
> - /* Natural width */
> - EXIT_QUALIFICATION,
> - GUEST_LINEAR_ADDRESS,
> -
> - /* 64-bit */
> - GUEST_PHYSICAL_ADDRESS,
> - GUEST_PHYSICAL_ADDRESS_HIGH,
> +#define SHADOW_FIELD_RO(x) x,
> +#include "vmx_shadow_fields.h"
> };
> static int max_shadow_read_only_fields =
> ARRAY_SIZE(shadow_read_only_fields);
>
> static u16 shadow_read_write_fields[] = {
> - /* 16-bits */
> - GUEST_CS_SELECTOR,
> - GUEST_INTR_STATUS,
> - GUEST_PML_INDEX,
> - HOST_FS_SELECTOR,
> - HOST_GS_SELECTOR,
> -
> - /* 32-bits */
> - CPU_BASED_VM_EXEC_CONTROL,
> - EXCEPTION_BITMAP,
> - VM_ENTRY_EXCEPTION_ERROR_CODE,
> - VM_ENTRY_INTR_INFO_FIELD,
> - VM_ENTRY_INSTRUCTION_LEN,
> - TPR_THRESHOLD,
> - GUEST_CS_LIMIT,
> - GUEST_CS_AR_BYTES,
> - GUEST_INTERRUPTIBILITY_INFO,
> - VMX_PREEMPTION_TIMER_VALUE,
> -
> - /* Natural width */
> - GUEST_RIP,
> - GUEST_RSP,
> - GUEST_CR0,
> - GUEST_CR3,
> - GUEST_CR4,
> - GUEST_RFLAGS,
> - GUEST_CS_BASE,
> - GUEST_ES_BASE,
> - CR0_GUEST_HOST_MASK,
> - CR0_READ_SHADOW,
> - CR4_READ_SHADOW,
> - HOST_FS_BASE,
> - HOST_GS_BASE,
> +#define SHADOW_FIELD_RW(x) x,
> +#include "vmx_shadow_fields.h"
> };
> static int max_shadow_read_write_fields =
> ARRAY_SIZE(shadow_read_write_fields);
> diff --git a/arch/x86/kvm/vmx_shadow_fields.h b/arch/x86/kvm/vmx_shadow_fields.h
> new file mode 100644
> index 000000000000..31d7a15338ac
> --- /dev/null
> +++ b/arch/x86/kvm/vmx_shadow_fields.h
> @@ -0,0 +1,71 @@
> +#ifndef SHADOW_FIELD_RO
> +#define SHADOW_FIELD_RO(x)
> +#endif
> +#ifndef SHADOW_FIELD_RW
> +#define SHADOW_FIELD_RW(x)
> +#endif
> +
> +/*
> + * We do NOT shadow fields that are modified when L0
> + * traps and emulates any vmx instruction (e.g. VMPTRLD,
> + * VMXON...) executed by L1.
> + * For example, VM_INSTRUCTION_ERROR is read
> + * by L1 if a vmx instruction fails (part of the error path).
> + * Note the code assumes this logic. If for some reason
> + * we start shadowing these fields then we need to
> + * force a shadow sync when L0 emulates vmx instructions
> + * (e.g. force a sync if VM_INSTRUCTION_ERROR is modified
> + * by nested_vmx_failValid)
> + *
> + * Keeping the fields ordered by size is an attempt at improving
> + * branch prediction in vmcs_read_any and vmcs_write_any.
> + */
> +
> +/* 16-bits */
> +SHADOW_FIELD_RW(GUEST_CS_SELECTOR)
> +SHADOW_FIELD_RW(GUEST_INTR_STATUS)
> +SHADOW_FIELD_RW(GUEST_PML_INDEX)
> +SHADOW_FIELD_RW(HOST_FS_SELECTOR)
> +SHADOW_FIELD_RW(HOST_GS_SELECTOR)
> +
> +/* 32-bits */
> +SHADOW_FIELD_RO(VM_EXIT_REASON)
> +SHADOW_FIELD_RO(VM_EXIT_INTR_INFO)
> +SHADOW_FIELD_RO(VM_EXIT_INSTRUCTION_LEN)
> +SHADOW_FIELD_RO(IDT_VECTORING_INFO_FIELD)
> +SHADOW_FIELD_RO(IDT_VECTORING_ERROR_CODE)
> +SHADOW_FIELD_RO(VM_EXIT_INTR_ERROR_CODE)
> +SHADOW_FIELD_RW(CPU_BASED_VM_EXEC_CONTROL)
> +SHADOW_FIELD_RW(EXCEPTION_BITMAP)
> +SHADOW_FIELD_RW(VM_ENTRY_EXCEPTION_ERROR_CODE)
> +SHADOW_FIELD_RW(VM_ENTRY_INTR_INFO_FIELD)
> +SHADOW_FIELD_RW(VM_ENTRY_INSTRUCTION_LEN)
> +SHADOW_FIELD_RW(TPR_THRESHOLD)
> +SHADOW_FIELD_RW(GUEST_CS_LIMIT)
> +SHADOW_FIELD_RW(GUEST_CS_AR_BYTES)
> +SHADOW_FIELD_RW(GUEST_INTERRUPTIBILITY_INFO)
> +SHADOW_FIELD_RW(VMX_PREEMPTION_TIMER_VALUE)
> +
> +/* Natural width */
> +SHADOW_FIELD_RO(EXIT_QUALIFICATION)
> +SHADOW_FIELD_RO(GUEST_LINEAR_ADDRESS)
> +SHADOW_FIELD_RW(GUEST_RIP)
> +SHADOW_FIELD_RW(GUEST_RSP)
> +SHADOW_FIELD_RW(GUEST_CR0)
> +SHADOW_FIELD_RW(GUEST_CR3)
> +SHADOW_FIELD_RW(GUEST_CR4)
> +SHADOW_FIELD_RW(GUEST_RFLAGS)
> +SHADOW_FIELD_RW(GUEST_CS_BASE)
> +SHADOW_FIELD_RW(GUEST_ES_BASE)
> +SHADOW_FIELD_RW(CR0_GUEST_HOST_MASK)
> +SHADOW_FIELD_RW(CR0_READ_SHADOW)
> +SHADOW_FIELD_RW(CR4_READ_SHADOW)
> +SHADOW_FIELD_RW(HOST_FS_BASE)
> +SHADOW_FIELD_RW(HOST_GS_BASE)
> +
> +/* 64-bit */
> +SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS)
> +SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS_HIGH)
> +
> +#undef SHADOW_FIELD_RO
> +#undef SHADOW_FIELD_RW
> --
> 1.8.3.1
>
>

2017-12-21 22:57:14

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH 2/4] KVM: nVMX: track dirty state of non-shadowed VMCS fields

Reviewed-by: Jim Mattson <[email protected]>

On Thu, Dec 21, 2017 at 4:43 AM, Paolo Bonzini <[email protected]> wrote:
> VMCS12 fields that are not handled through shadow VMCS are rarely
> written, and thus they are also almost constant in the vmcs02. We can
> thus optimize prepare_vmcs02 by skipping all the work for non-shadowed
> fields in the common case.
>
> This patch introduces the (pretty simple) tracking infrastructure; the
> next patches will move work to prepare_vmcs02_full and save a few hundred
> clock cycles per VMRESUME on a Haswell Xeon E5 system:
>
> before after
> cpuid 14159 13869
> vmcall 15290 14951
> inl_from_kernel 17703 17447
> outl_to_kernel 16011 14692
> self_ipi_sti_nop 16763 15825
> self_ipi_tpr_sti_nop 17341 15935
> wr_tsc_adjust_msr 14510 14264
> rd_tsc_adjust_msr 15018 14311
> mmio-wildcard-eventfd:pci-mem 16381 14947
> mmio-datamatch-eventfd:pci-mem 18620 17858
> portio-wildcard-eventfd:pci-io 15121 14769
> portio-datamatch-eventfd:pci-io 15761 14831
>
> (average savings 748, stdev 460).
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/vmx.c | 29 ++++++++++++++++++++++++++++-
> 1 file changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 2ee842990976..8b6013b529b3 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -441,6 +441,7 @@ struct nested_vmx {
> * data hold by vmcs12
> */
> bool sync_shadow_vmcs;
> + bool dirty_vmcs12;
>
> bool change_vmcs01_virtual_x2apic_mode;
> /* L2 must run next, and mustn't decide to exit to L1. */
> @@ -7879,8 +7880,10 @@ static int handle_vmwrite(struct kvm_vcpu *vcpu)
> {
> unsigned long field;
> gva_t gva;
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
> +
> /* The value to write might be 32 or 64 bits, depending on L1's long
> * mode, and eventually we need to write that into a field of several
> * possible lengths. The code below first zero-extends the value to 64
> @@ -7923,6 +7926,20 @@ static int handle_vmwrite(struct kvm_vcpu *vcpu)
> return kvm_skip_emulated_instruction(vcpu);
> }
>
> + switch (field) {
> +#define SHADOW_FIELD_RW(x) case x:
> +#include "vmx_shadow_fields.h"
> + /*
> + * The fields that can be updated by L1 without a vmexit are
> + * always updated in the vmcs02, the others go down the slow
> + * path of prepare_vmcs02.
> + */
> + break;
> + default:
> + vmx->nested.dirty_vmcs12 = true;
> + break;
> + }
> +
> nested_vmx_succeed(vcpu);
> return kvm_skip_emulated_instruction(vcpu);
> }
> @@ -7937,6 +7954,7 @@ static void set_current_vmptr(struct vcpu_vmx *vmx, gpa_t vmptr)
> __pa(vmx->vmcs01.shadow_vmcs));
> vmx->nested.sync_shadow_vmcs = true;
> }
> + vmx->nested.dirty_vmcs12 = true;
> }
>
> /* Emulate the VMPTRLD instruction */
> @@ -10569,6 +10587,11 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3, bool ne
> return 0;
> }
>
> +static void prepare_vmcs02_full(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> + bool from_vmentry)
> +{
> +}
> +
> /*
> * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
> * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
> @@ -10864,7 +10887,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
> vmx_flush_tlb(vcpu, true);
> }
> -
> }
>
> if (enable_pml) {
> @@ -10913,6 +10935,11 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> /* Note: modifies VM_ENTRY/EXIT_CONTROLS and GUEST/HOST_IA32_EFER */
> vmx_set_efer(vcpu, vcpu->arch.efer);
>
> + if (vmx->nested.dirty_vmcs12) {
> + prepare_vmcs02_full(vcpu, vmcs12, from_vmentry);
> + vmx->nested.dirty_vmcs12 = false;
> + }
> +
> /* Shadow page tables on either EPT or shadow page tables. */
> if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12),
> entry_failure_code))
> --
> 1.8.3.1
>
>

2017-12-25 03:04:02

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH 2/4] KVM: nVMX: track dirty state of non-shadowed VMCS fields

2017-12-21 20:43 GMT+08:00 Paolo Bonzini <[email protected]>:
> VMCS12 fields that are not handled through shadow VMCS are rarely
> written, and thus they are also almost constant in the vmcs02. We can
> thus optimize prepare_vmcs02 by skipping all the work for non-shadowed
> fields in the common case.
>
> This patch introduces the (pretty simple) tracking infrastructure; the
> next patches will move work to prepare_vmcs02_full and save a few hundred
> clock cycles per VMRESUME on a Haswell Xeon E5 system:
>
> before after
> cpuid 14159 13869
> vmcall 15290 14951
> inl_from_kernel 17703 17447
> outl_to_kernel 16011 14692
> self_ipi_sti_nop 16763 15825
> self_ipi_tpr_sti_nop 17341 15935
> wr_tsc_adjust_msr 14510 14264
> rd_tsc_adjust_msr 15018 14311
> mmio-wildcard-eventfd:pci-mem 16381 14947
> mmio-datamatch-eventfd:pci-mem 18620 17858
> portio-wildcard-eventfd:pci-io 15121 14769
> portio-datamatch-eventfd:pci-io 15761 14831
>
> (average savings 748, stdev 460).
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/vmx.c | 29 ++++++++++++++++++++++++++++-
> 1 file changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 2ee842990976..8b6013b529b3 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -441,6 +441,7 @@ struct nested_vmx {
> * data hold by vmcs12
> */
> bool sync_shadow_vmcs;
> + bool dirty_vmcs12;
>
> bool change_vmcs01_virtual_x2apic_mode;
> /* L2 must run next, and mustn't decide to exit to L1. */
> @@ -7879,8 +7880,10 @@ static int handle_vmwrite(struct kvm_vcpu *vcpu)
> {
> unsigned long field;
> gva_t gva;
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
> +
> /* The value to write might be 32 or 64 bits, depending on L1's long
> * mode, and eventually we need to write that into a field of several
> * possible lengths. The code below first zero-extends the value to 64
> @@ -7923,6 +7926,20 @@ static int handle_vmwrite(struct kvm_vcpu *vcpu)
> return kvm_skip_emulated_instruction(vcpu);
> }
>
> + switch (field) {
> +#define SHADOW_FIELD_RW(x) case x:
> +#include "vmx_shadow_fields.h"

What's will happen here if enable_shadow_vmcs == false?

Regards,
Wanpeng Li

> + /*
> + * The fields that can be updated by L1 without a vmexit are
> + * always updated in the vmcs02, the others go down the slow
> + * path of prepare_vmcs02.
> + */
> + break;
> + default:
> + vmx->nested.dirty_vmcs12 = true;
> + break;
> + }
> +
> nested_vmx_succeed(vcpu);
> return kvm_skip_emulated_instruction(vcpu);
> }
> @@ -7937,6 +7954,7 @@ static void set_current_vmptr(struct vcpu_vmx *vmx, gpa_t vmptr)
> __pa(vmx->vmcs01.shadow_vmcs));
> vmx->nested.sync_shadow_vmcs = true;
> }
> + vmx->nested.dirty_vmcs12 = true;
> }
>
> /* Emulate the VMPTRLD instruction */
> @@ -10569,6 +10587,11 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3, bool ne
> return 0;
> }
>
> +static void prepare_vmcs02_full(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> + bool from_vmentry)
> +{
> +}
> +
> /*
> * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
> * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
> @@ -10864,7 +10887,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
> vmx_flush_tlb(vcpu, true);
> }
> -
> }
>
> if (enable_pml) {
> @@ -10913,6 +10935,11 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> /* Note: modifies VM_ENTRY/EXIT_CONTROLS and GUEST/HOST_IA32_EFER */
> vmx_set_efer(vcpu, vcpu->arch.efer);
>
> + if (vmx->nested.dirty_vmcs12) {
> + prepare_vmcs02_full(vcpu, vmcs12, from_vmentry);
> + vmx->nested.dirty_vmcs12 = false;
> + }
> +
> /* Shadow page tables on either EPT or shadow page tables. */
> if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12),
> entry_failure_code))
> --
> 1.8.3.1
>
>

2017-12-25 03:09:12

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH 4/4] KVM: nVMX: initialize more non-shadowed fields in prepare_vmcs02_full

2017-12-21 20:43 GMT+08:00 Paolo Bonzini <[email protected]>:
> These fields are also simple copies of the data in the vmcs12 struct.
> For some of them, prepare_vmcs02 was skipping the copy when the field
> was unused. In prepare_vmcs02_full, we copy them always as long as the
> field exists on the host, because the corresponding execution control
> might be one of the shadowed fields.

Why we don't need to copy them always before the patchset?

Regards,
Wanpeng Li

2017-12-25 10:07:14

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH 0/4] KVM: nVMX: prepare_vmcs02 optimizations

2017-12-21 20:43 GMT+08:00 Paolo Bonzini <[email protected]>:
> That's about 800-1000 clock cycles more that can be easily peeled, by
> saving about 60 VMWRITEs on every exit.
>
> My numbers so far have been collected on a Haswell system vs. the
> Broadwell that Jim used for his KVM Forum talk, and I am now down
> from 22000 (compared to 18000 that Jim gave as the baseline) to 14000.
> Also the guest is running 4.14, so it didn't have the XSETBV and DEBUGCTL
> patches; that removes two ancillary exit to L1, each costing about 1000
> cycles on my machine). So we are probably pretty close to VMware's
> 6500 cycles on Broadwell.
>
> After these patches there may still be some low-hanging fruit; the remaining
> large deltas between non-nested and nested workloads with lots of vmexits are:
>
> 4.80% vmx_set_cr3
> 4.35% native_read_msr
> 3.73% vmcs_load
> 3.65% update_permission_bitmask
> 2.49% _raw_spin_lock
> 2.37% sync_vmcs12
> 2.20% copy_shadow_to_vmcs12
> 1.19% kvm_load_guest_fpu
>
> There is a large cost associated to resetting the MMU. Making that smarter
> could probably be worth a 10-15% improvement; not easy, but actually even
> more worthwhile than that on SMP nested guests because that's where the
> spinlock contention comes from.
>
> The MSR accesses are probably also interesting, but I haven't tried to see
> what they are about. One somewhat crazy idea in that area is to set
> CR4.FSGSBASE at vcpu_load/sched_in and clear it at vcpu_put/sched_out.
> Then we could skip the costly setup of the FS/GS/kernelGS base MSRs.
> However the cost of writes to CR4 might make it less appealing for
> userspace exits; I haven't benchmarked it.
>
> Paolo
>
> Paolo Bonzini (4):
> KVM: VMX: split list of shadowed VMCS field to a separate file
> KVM: nVMX: track dirty state of non-shadowed VMCS fields
> KVM: nVMX: move descriptor cache handling to prepare_vmcs02_full
> KVM: nVMX: move other simple fields to prepare_vmcs02_full
>
> arch/x86/kvm/vmx.c | 301 +++++++++++++++++++--------------------
> arch/x86/kvm/vmx_shadow_fields.h | 71 +++++++++
> 2 files changed, 214 insertions(+), 158 deletions(-)
> create mode 100644 arch/x86/kvm/vmx_shadow_fields.h

I observe L1(latest kvm/queue) panic and L0(latest kvm/queue)
calltrace, I'm not sure whether it is caused by this patchset.

L1:

[ 114.941243] BUG: unable to handle kernel paging request at ffffa6e6831dfbbe
[ 114.943423] IP: native_load_gdt+0x0/0x10
[ 114.944249] PGD 42ed2e067 P4D 42ed2e067 PUD 42ed2f067 PMD 3e7fc7067
PTE 800000040c09d163
[ 114.945911] Oops: 0009 [#1] SMP
[ 114.946615] Modules linked in: kvm_intel kvm irqbypass
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel
binfmt_misc i2c_piix4 aes_x86_64 joydev input_leds crypto_simd
serio_raw glue_helper cryptd mac_hid parport_pc ppdev lp parport
autofs4 hid_generic usbhid hid floppy psmouse pata_acpi
[ 114.952293] CPU: 13 PID: 11077 Comm: qemu-system-x86 Not tainted
4.15.0-rc3+ #6
[ 114.954108] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
[ 114.957213] RIP: 0010:native_load_gdt+0x0/0x10
[ 114.958360] RSP: 0018:ffffa6e6831dfbb0 EFLAGS: 00010286
[ 114.959868] RAX: 000000000000007f RBX: ffff8a78c9620000 RCX: 00000000c0000102
[ 114.961933] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa6e6831dfbbe
[ 114.964147] RBP: ffffa6e6831dfc00 R08: 000000000000002c R09: 0000000000000001
[ 114.966190] R10: ffffa6e6831dfc18 R11: 0000000000000000 R12: 0000000000000000
[ 114.968008] R13: 0000000000000000 R14: ffff8a78ebe86000 R15: ffff8a78c9620000
[ 114.969456] FS: 00007fd452168700(0000) GS:ffff8a78ef540000(0000)
knlGS:0000000000000000
[ 114.977682] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 114.979207] CR2: ffffa6e6831dfbbe CR3: 0000000429932005 CR4: 0000000000162ee0
[ 114.980691] Call Trace:
[ 114.981209] load_fixmap_gdt+0x30/0x40
[ 114.982005] __vmx_load_host_state.part.82+0xfc/0x190 [kvm_intel]
[ 114.983367] ? __gfn_to_pfn_memslot+0x2ed/0x3c0 [kvm]
[ 114.984520] ? vmx_switch_vmcs+0x26/0x40 [kvm_intel]
[ 114.985548] vmx_switch_vmcs+0x26/0x40 [kvm_intel]
[ 114.986995] nested_vmx_vmexit+0x86/0x770 [kvm_intel]
[ 114.988640] ? enter_vmx_non_root_mode+0x720/0x10e0 [kvm_intel]
[ 114.990613] ? enter_vmx_non_root_mode+0x720/0x10e0 [kvm_intel]
[ 114.992607] ? vmx_handle_exit+0xb22/0x1530 [kvm_intel]
[ 114.994394] vmx_handle_exit+0xb22/0x1530 [kvm_intel]
[ 114.996226] ? atomic_switch_perf_msrs+0x6f/0xa0 [kvm_intel]
[ 114.998113] ? vmx_vcpu_run+0x3ae/0x4b0 [kvm_intel]
[ 114.999803] kvm_arch_vcpu_ioctl_run+0x9ed/0x15e0 [kvm]
[ 115.001595] ? file_update_time+0x60/0x110
[ 115.003001] ? kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
[ 115.004539] kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
[ 115.005982] do_vfs_ioctl+0x9f/0x5e0
[ 115.006981] ? vfs_write+0x14f/0x1a0
[ 115.007732] SyS_ioctl+0x74/0x80
[ 115.008393] entry_SYSCALL_64_fastpath+0x1e/0x81
[ 115.009345] RIP: 0033:0x7fd470412f07
[ 115.010141] RSP: 002b:00007fd452167978 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[ 115.011742] RAX: ffffffffffffffda RBX: 00007fd4757c8001 RCX: 00007fd470412f07
[ 115.013280] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000e
[ 115.014900] RBP: 0000000000000001 R08: 000055844b5c1650 R09: 0000000000000001
[ 115.016487] R10: 00007fd452167790 R11: 0000000000000246 R12: 0000000000000000
[ 115.017971] R13: 000055844b5abe40 R14: 00007fd4757c7000 R15: 000055844c277530
[ 115.023574] RIP: native_load_gdt+0x0/0x10 RSP: ffffa6e6831dfbb0
[ 115.024812] CR2: ffffa6e6831dfbbe
[ 115.025485] ---[ end trace 3d70820b36036f21 ]---


L0:

[ 149.013514] WARNING: CPU: 2 PID: 2073 at arch/x86/kvm/vmx.c:6376
handle_desc+0x2d/0x40 [kvm_intel]
[ 149.013517] Modules linked in: binfmt_misc nls_iso8859_1
snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic
snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm
x86_pkg_temp_thermal intel_powerclamp coretemp crc32_pclmul
snd_seq_midi snd_seq_midi_event pcbc snd_rawmidi snd_seq aesni_intel
aes_x86_64 crypto_simd cryptd snd_seq_device joydev glue_helper
input_leds snd_timer snd mei_me shpchp mei soundcore wmi_bmof lpc_ich
mac_hid kvm_intel kvm irqbypass parport_pc ppdev lp parport autofs4
hid_generic usbhid hid i915 i2c_algo_bit drm_kms_helper syscopyarea
sysfillrect sysimgblt fb_sys_fops drm e1000e ahci ptp libahci pps_core
wmi video
[ 149.013687] CPU: 2 PID: 2073 Comm: qemu-system-x86 Tainted: G
W 4.15.0-rc3+ #1
[ 149.013690] Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY,
BIOS FBKTC1AUS 02/16/2016
[ 149.013696] RIP: 0010:handle_desc+0x2d/0x40 [kvm_intel]
[ 149.013699] RSP: 0018:ffffa2c341a07ca0 EFLAGS: 00010246
[ 149.013705] RAX: ffffffffc04d5160 RBX: 000000000000002f RCX: 0000000000000001
[ 149.013709] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: ffff95befbb48000
[ 149.013712] RBP: ffffa2c341a07ca0 R08: 000000008a728c6a R09: f1ee3e3400000000
[ 149.013715] R10: 0000000000000000 R11: 0000000000000001 R12: ffff95befbaa0000
[ 149.013718] R13: 0000000000000000 R14: 0000000000000000 R15: ffff95befbb48000
[ 149.013721] FS: 00007fee79ffb700(0000) GS:ffff95bf0d800000(0000)
knlGS:0000000000000000
[ 149.013724] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 149.013727] CR2: ffffffffffffe000 CR3: 00000003fba1d004 CR4: 00000000001626e0
[ 149.013730] Call Trace:
[ 149.013737] vmx_handle_exit+0xbd/0xe20 [kvm_intel]
[ 149.013752] ? kvm_arch_vcpu_ioctl_run+0xcea/0x1c20 [kvm]
[ 149.013773] kvm_arch_vcpu_ioctl_run+0xd66/0x1c20 [kvm]
[ 149.013800] kvm_vcpu_ioctl+0x3e9/0x720 [kvm]
[ 149.013813] ? kvm_vcpu_ioctl+0x3e9/0x720 [kvm]
[ 149.013821] ? __fget+0xfc/0x210
[ 149.013826] ? __fget+0xfc/0x210
[ 149.013835] do_vfs_ioctl+0xa4/0x6a0
[ 149.013840] ? __fget+0x11d/0x210
[ 149.013850] SyS_ioctl+0x79/0x90
[ 149.013859] entry_SYSCALL_64_fastpath+0x1f/0x96
[ 149.013862] RIP: 0033:0x7fee94e26f07
[ 149.013865] RSP: 002b:00007fee79ffa8b8 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[ 149.013871] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007fee94e26f07
[ 149.013874] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000015
[ 149.013877] RBP: 00005558752c6960 R08: 0000000000000000 R09: 0000000000000001
[ 149.013881] R10: 0000000000000058 R11: 0000000000000246 R12: 0000000000000000
[ 149.013884] R13: 00007fee974f9000 R14: 0000000000000000 R15: 00005558752c6960
[ 149.013900] Code: 44 00 00 f6 87 f1 03 00 00 08 55 48 89 e5 74 1b
45 31 c0 31 c9 31 f6 ba 10 00 00 00 e8 2d 0e f8 ff 85 c0 0f 94 c0 0f
b6 c0 5d c3 <0f> ff eb e1 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00
0f 1f
[ 149.014080] ---[ end trace 53c0bffb9d8f6939 ]---

Regards,
Wanpeng Li

2017-12-25 10:08:52

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH 0/4] KVM: nVMX: prepare_vmcs02 optimizations

2017-12-25 18:07 GMT+08:00 Wanpeng Li <[email protected]>:
> 2017-12-21 20:43 GMT+08:00 Paolo Bonzini <[email protected]>:
>> That's about 800-1000 clock cycles more that can be easily peeled, by
>> saving about 60 VMWRITEs on every exit.
>>
>> My numbers so far have been collected on a Haswell system vs. the
>> Broadwell that Jim used for his KVM Forum talk, and I am now down
>> from 22000 (compared to 18000 that Jim gave as the baseline) to 14000.
>> Also the guest is running 4.14, so it didn't have the XSETBV and DEBUGCTL
>> patches; that removes two ancillary exit to L1, each costing about 1000
>> cycles on my machine). So we are probably pretty close to VMware's
>> 6500 cycles on Broadwell.
>>
>> After these patches there may still be some low-hanging fruit; the remaining
>> large deltas between non-nested and nested workloads with lots of vmexits are:
>>
>> 4.80% vmx_set_cr3
>> 4.35% native_read_msr
>> 3.73% vmcs_load
>> 3.65% update_permission_bitmask
>> 2.49% _raw_spin_lock
>> 2.37% sync_vmcs12
>> 2.20% copy_shadow_to_vmcs12
>> 1.19% kvm_load_guest_fpu
>>
>> There is a large cost associated to resetting the MMU. Making that smarter
>> could probably be worth a 10-15% improvement; not easy, but actually even
>> more worthwhile than that on SMP nested guests because that's where the
>> spinlock contention comes from.
>>
>> The MSR accesses are probably also interesting, but I haven't tried to see
>> what they are about. One somewhat crazy idea in that area is to set
>> CR4.FSGSBASE at vcpu_load/sched_in and clear it at vcpu_put/sched_out.
>> Then we could skip the costly setup of the FS/GS/kernelGS base MSRs.
>> However the cost of writes to CR4 might make it less appealing for
>> userspace exits; I haven't benchmarked it.
>>
>> Paolo
>>
>> Paolo Bonzini (4):
>> KVM: VMX: split list of shadowed VMCS field to a separate file
>> KVM: nVMX: track dirty state of non-shadowed VMCS fields
>> KVM: nVMX: move descriptor cache handling to prepare_vmcs02_full
>> KVM: nVMX: move other simple fields to prepare_vmcs02_full
>>
>> arch/x86/kvm/vmx.c | 301 +++++++++++++++++++--------------------
>> arch/x86/kvm/vmx_shadow_fields.h | 71 +++++++++
>> 2 files changed, 214 insertions(+), 158 deletions(-)
>> create mode 100644 arch/x86/kvm/vmx_shadow_fields.h
>
> I observe L1(latest kvm/queue) panic and L0(latest kvm/queue)
> calltrace, I'm not sure whether it is caused by this patchset.

It can be reproduced steadily by running kvm-unit-tests in L1.

Regards,
Wanpeng Li

>
> L1:
>
> [ 114.941243] BUG: unable to handle kernel paging request at ffffa6e6831dfbbe
> [ 114.943423] IP: native_load_gdt+0x0/0x10
> [ 114.944249] PGD 42ed2e067 P4D 42ed2e067 PUD 42ed2f067 PMD 3e7fc7067
> PTE 800000040c09d163
> [ 114.945911] Oops: 0009 [#1] SMP
> [ 114.946615] Modules linked in: kvm_intel kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel
> binfmt_misc i2c_piix4 aes_x86_64 joydev input_leds crypto_simd
> serio_raw glue_helper cryptd mac_hid parport_pc ppdev lp parport
> autofs4 hid_generic usbhid hid floppy psmouse pata_acpi
> [ 114.952293] CPU: 13 PID: 11077 Comm: qemu-system-x86 Not tainted
> 4.15.0-rc3+ #6
> [ 114.954108] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
> [ 114.957213] RIP: 0010:native_load_gdt+0x0/0x10
> [ 114.958360] RSP: 0018:ffffa6e6831dfbb0 EFLAGS: 00010286
> [ 114.959868] RAX: 000000000000007f RBX: ffff8a78c9620000 RCX: 00000000c0000102
> [ 114.961933] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa6e6831dfbbe
> [ 114.964147] RBP: ffffa6e6831dfc00 R08: 000000000000002c R09: 0000000000000001
> [ 114.966190] R10: ffffa6e6831dfc18 R11: 0000000000000000 R12: 0000000000000000
> [ 114.968008] R13: 0000000000000000 R14: ffff8a78ebe86000 R15: ffff8a78c9620000
> [ 114.969456] FS: 00007fd452168700(0000) GS:ffff8a78ef540000(0000)
> knlGS:0000000000000000
> [ 114.977682] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 114.979207] CR2: ffffa6e6831dfbbe CR3: 0000000429932005 CR4: 0000000000162ee0
> [ 114.980691] Call Trace:
> [ 114.981209] load_fixmap_gdt+0x30/0x40
> [ 114.982005] __vmx_load_host_state.part.82+0xfc/0x190 [kvm_intel]
> [ 114.983367] ? __gfn_to_pfn_memslot+0x2ed/0x3c0 [kvm]
> [ 114.984520] ? vmx_switch_vmcs+0x26/0x40 [kvm_intel]
> [ 114.985548] vmx_switch_vmcs+0x26/0x40 [kvm_intel]
> [ 114.986995] nested_vmx_vmexit+0x86/0x770 [kvm_intel]
> [ 114.988640] ? enter_vmx_non_root_mode+0x720/0x10e0 [kvm_intel]
> [ 114.990613] ? enter_vmx_non_root_mode+0x720/0x10e0 [kvm_intel]
> [ 114.992607] ? vmx_handle_exit+0xb22/0x1530 [kvm_intel]
> [ 114.994394] vmx_handle_exit+0xb22/0x1530 [kvm_intel]
> [ 114.996226] ? atomic_switch_perf_msrs+0x6f/0xa0 [kvm_intel]
> [ 114.998113] ? vmx_vcpu_run+0x3ae/0x4b0 [kvm_intel]
> [ 114.999803] kvm_arch_vcpu_ioctl_run+0x9ed/0x15e0 [kvm]
> [ 115.001595] ? file_update_time+0x60/0x110
> [ 115.003001] ? kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
> [ 115.004539] kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
> [ 115.005982] do_vfs_ioctl+0x9f/0x5e0
> [ 115.006981] ? vfs_write+0x14f/0x1a0
> [ 115.007732] SyS_ioctl+0x74/0x80
> [ 115.008393] entry_SYSCALL_64_fastpath+0x1e/0x81
> [ 115.009345] RIP: 0033:0x7fd470412f07
> [ 115.010141] RSP: 002b:00007fd452167978 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [ 115.011742] RAX: ffffffffffffffda RBX: 00007fd4757c8001 RCX: 00007fd470412f07
> [ 115.013280] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000e
> [ 115.014900] RBP: 0000000000000001 R08: 000055844b5c1650 R09: 0000000000000001
> [ 115.016487] R10: 00007fd452167790 R11: 0000000000000246 R12: 0000000000000000
> [ 115.017971] R13: 000055844b5abe40 R14: 00007fd4757c7000 R15: 000055844c277530
> [ 115.023574] RIP: native_load_gdt+0x0/0x10 RSP: ffffa6e6831dfbb0
> [ 115.024812] CR2: ffffa6e6831dfbbe
> [ 115.025485] ---[ end trace 3d70820b36036f21 ]---
>
>
> L0:
>
> [ 149.013514] WARNING: CPU: 2 PID: 2073 at arch/x86/kvm/vmx.c:6376
> handle_desc+0x2d/0x40 [kvm_intel]
> [ 149.013517] Modules linked in: binfmt_misc nls_iso8859_1
> snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic
> snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm
> x86_pkg_temp_thermal intel_powerclamp coretemp crc32_pclmul
> snd_seq_midi snd_seq_midi_event pcbc snd_rawmidi snd_seq aesni_intel
> aes_x86_64 crypto_simd cryptd snd_seq_device joydev glue_helper
> input_leds snd_timer snd mei_me shpchp mei soundcore wmi_bmof lpc_ich
> mac_hid kvm_intel kvm irqbypass parport_pc ppdev lp parport autofs4
> hid_generic usbhid hid i915 i2c_algo_bit drm_kms_helper syscopyarea
> sysfillrect sysimgblt fb_sys_fops drm e1000e ahci ptp libahci pps_core
> wmi video
> [ 149.013687] CPU: 2 PID: 2073 Comm: qemu-system-x86 Tainted: G
> W 4.15.0-rc3+ #1
> [ 149.013690] Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY,
> BIOS FBKTC1AUS 02/16/2016
> [ 149.013696] RIP: 0010:handle_desc+0x2d/0x40 [kvm_intel]
> [ 149.013699] RSP: 0018:ffffa2c341a07ca0 EFLAGS: 00010246
> [ 149.013705] RAX: ffffffffc04d5160 RBX: 000000000000002f RCX: 0000000000000001
> [ 149.013709] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: ffff95befbb48000
> [ 149.013712] RBP: ffffa2c341a07ca0 R08: 000000008a728c6a R09: f1ee3e3400000000
> [ 149.013715] R10: 0000000000000000 R11: 0000000000000001 R12: ffff95befbaa0000
> [ 149.013718] R13: 0000000000000000 R14: 0000000000000000 R15: ffff95befbb48000
> [ 149.013721] FS: 00007fee79ffb700(0000) GS:ffff95bf0d800000(0000)
> knlGS:0000000000000000
> [ 149.013724] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 149.013727] CR2: ffffffffffffe000 CR3: 00000003fba1d004 CR4: 00000000001626e0
> [ 149.013730] Call Trace:
> [ 149.013737] vmx_handle_exit+0xbd/0xe20 [kvm_intel]
> [ 149.013752] ? kvm_arch_vcpu_ioctl_run+0xcea/0x1c20 [kvm]
> [ 149.013773] kvm_arch_vcpu_ioctl_run+0xd66/0x1c20 [kvm]
> [ 149.013800] kvm_vcpu_ioctl+0x3e9/0x720 [kvm]
> [ 149.013813] ? kvm_vcpu_ioctl+0x3e9/0x720 [kvm]
> [ 149.013821] ? __fget+0xfc/0x210
> [ 149.013826] ? __fget+0xfc/0x210
> [ 149.013835] do_vfs_ioctl+0xa4/0x6a0
> [ 149.013840] ? __fget+0x11d/0x210
> [ 149.013850] SyS_ioctl+0x79/0x90
> [ 149.013859] entry_SYSCALL_64_fastpath+0x1f/0x96
> [ 149.013862] RIP: 0033:0x7fee94e26f07
> [ 149.013865] RSP: 002b:00007fee79ffa8b8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [ 149.013871] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007fee94e26f07
> [ 149.013874] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000015
> [ 149.013877] RBP: 00005558752c6960 R08: 0000000000000000 R09: 0000000000000001
> [ 149.013881] R10: 0000000000000058 R11: 0000000000000246 R12: 0000000000000000
> [ 149.013884] R13: 00007fee974f9000 R14: 0000000000000000 R15: 00005558752c6960
> [ 149.013900] Code: 44 00 00 f6 87 f1 03 00 00 08 55 48 89 e5 74 1b
> 45 31 c0 31 c9 31 f6 ba 10 00 00 00 e8 2d 0e f8 ff 85 c0 0f 94 c0 0f
> b6 c0 5d c3 <0f> ff eb e1 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00
> 0f 1f
> [ 149.014080] ---[ end trace 53c0bffb9d8f6939 ]---
>
> Regards,
> Wanpeng Li

2017-12-27 09:54:29

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 4/4] KVM: nVMX: initialize more non-shadowed fields in prepare_vmcs02_full

On 25/12/2017 04:09, Wanpeng Li wrote:
> 2017-12-21 20:43 GMT+08:00 Paolo Bonzini <[email protected]>:
>> These fields are also simple copies of the data in the vmcs12 struct.
>> For some of them, prepare_vmcs02 was skipping the copy when the field
>> was unused. In prepare_vmcs02_full, we copy them always as long as the
>> field exists on the host, because the corresponding execution control
>> might be one of the shadowed fields.
>
> Why we don't need to copy them always before the patchset?

Before these patches, we only copy them if the corresponding processor
control is enabled. For example, we only copy the EOI exit bitmap if
APICv is enabled by L1. Here we could have

write to EOI exit bitmap
vmlaunch (calls prepare_vmcs02_full)
enable APICv (but EOI exit bitmap fields are clean)
vmresume (doesn't call prepare_vmcs02_full)

The vmresume doesn't call prepare_vmcs02_full, so the EOI exit bitmap
must be copied every time prepare_vmcs02_full runs.

Paolo

2017-12-27 14:28:43

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 0/4] KVM: nVMX: prepare_vmcs02 optimizations

On 25/12/2017 11:08, Wanpeng Li wrote:
>> I observe L1(latest kvm/queue) panic and L0(latest kvm/queue)
>> calltrace, I'm not sure whether it is caused by this patchset.
> It can be reproduced steadily by running kvm-unit-tests in L1.

It works here, can you show the L0 call trace and/or bisect it?

Paolo

2017-12-28 02:07:13

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH 4/4] KVM: nVMX: initialize more non-shadowed fields in prepare_vmcs02_full

2017-12-27 17:54 GMT+08:00 Paolo Bonzini <[email protected]>:
> On 25/12/2017 04:09, Wanpeng Li wrote:
>> 2017-12-21 20:43 GMT+08:00 Paolo Bonzini <[email protected]>:
>>> These fields are also simple copies of the data in the vmcs12 struct.
>>> For some of them, prepare_vmcs02 was skipping the copy when the field
>>> was unused. In prepare_vmcs02_full, we copy them always as long as the
>>> field exists on the host, because the corresponding execution control
>>> might be one of the shadowed fields.
>>
>> Why we don't need to copy them always before the patchset?
>
> Before these patches, we only copy them if the corresponding processor
> control is enabled. For example, we only copy the EOI exit bitmap if
> APICv is enabled by L1. Here we could have
>
> write to EOI exit bitmap
> vmlaunch (calls prepare_vmcs02_full)
> enable APICv (but EOI exit bitmap fields are clean)
> vmresume (doesn't call prepare_vmcs02_full)
>
> The vmresume doesn't call prepare_vmcs02_full, so the EOI exit bitmap
> must be copied every time prepare_vmcs02_full runs.

I see, what about my question to patch 2/4? Writing to vmcs fields
always causes vmexit if enable_shadow_vmcs == false and
vmx->nested.dirty_vmcs12 is false for "shadow vmcs fields", this can
result in prepare_vmcs02_full() is not called even if some processor
controls are modified.

Regards,
Wanpeng Li

2017-12-28 08:39:51

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH 0/4] KVM: nVMX: prepare_vmcs02 optimizations

2017-12-27 22:28 GMT+08:00 Paolo Bonzini <[email protected]>:
> On 25/12/2017 11:08, Wanpeng Li wrote:
>>> I observe L1(latest kvm/queue) panic and L0(latest kvm/queue)
>>> calltrace, I'm not sure whether it is caused by this patchset.
>> It can be reproduced steadily by running kvm-unit-tests in L1.
>
> It works here, can you show the L0 call trace and/or bisect it?

L0 call trace has already been posted here.
https://lkml.org/lkml/2017/12/25/53 In addition, the splatting is
still there after I revert the last 9 nVMX optimization patches in
kvm/queue. So it is not caused by this patchset. :)

Regards,
Wanpeng Li

2017-12-31 08:08:19

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 2/4] KVM: nVMX: track dirty state of non-shadowed VMCS fields

On 25/12/2017 04:03, Wanpeng Li wrote:
> 2017-12-21 20:43 GMT+08:00 Paolo Bonzini <[email protected]>:
>> VMCS12 fields that are not handled through shadow VMCS are rarely
>> written, and thus they are also almost constant in the vmcs02. We can
>> thus optimize prepare_vmcs02 by skipping all the work for non-shadowed
>> fields in the common case.
>>
>> This patch introduces the (pretty simple) tracking infrastructure; the
>> next patches will move work to prepare_vmcs02_full and save a few hundred
>> clock cycles per VMRESUME on a Haswell Xeon E5 system:
>>
>> before after
>> cpuid 14159 13869
>> vmcall 15290 14951
>> inl_from_kernel 17703 17447
>> outl_to_kernel 16011 14692
>> self_ipi_sti_nop 16763 15825
>> self_ipi_tpr_sti_nop 17341 15935
>> wr_tsc_adjust_msr 14510 14264
>> rd_tsc_adjust_msr 15018 14311
>> mmio-wildcard-eventfd:pci-mem 16381 14947
>> mmio-datamatch-eventfd:pci-mem 18620 17858
>> portio-wildcard-eventfd:pci-io 15121 14769
>> portio-datamatch-eventfd:pci-io 15761 14831
>>
>> (average savings 748, stdev 460).
>>
>> Signed-off-by: Paolo Bonzini <[email protected]>
>> ---
>> arch/x86/kvm/vmx.c | 29 ++++++++++++++++++++++++++++-
>> 1 file changed, 28 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 2ee842990976..8b6013b529b3 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -441,6 +441,7 @@ struct nested_vmx {
>> * data hold by vmcs12
>> */
>> bool sync_shadow_vmcs;
>> + bool dirty_vmcs12;
>>
>> bool change_vmcs01_virtual_x2apic_mode;
>> /* L2 must run next, and mustn't decide to exit to L1. */
>> @@ -7879,8 +7880,10 @@ static int handle_vmwrite(struct kvm_vcpu *vcpu)
>> {
>> unsigned long field;
>> gva_t gva;
>> + struct vcpu_vmx *vmx = to_vmx(vcpu);
>> unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>> u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
>> +
>> /* The value to write might be 32 or 64 bits, depending on L1's long
>> * mode, and eventually we need to write that into a field of several
>> * possible lengths. The code below first zero-extends the value to 64
>> @@ -7923,6 +7926,20 @@ static int handle_vmwrite(struct kvm_vcpu *vcpu)
>> return kvm_skip_emulated_instruction(vcpu);
>> }
>>
>> + switch (field) {
>> +#define SHADOW_FIELD_RW(x) case x:
>> +#include "vmx_shadow_fields.h"
>
> What's will happen here if enable_shadow_vmcs == false?

If enable_shadow_vmcs == true, these fields never even go through
handle_vmwrite, so they have to be written always by prepare_vmcs02.
Other fields go through handle_vmwrite and set dirty_vmcs12.

If enable_shadow_vmcs == false, or if the fields do not exist in the
hardware VMCS (e.g. PML index on Haswell systems), writes go through
handle_vmwrite, but they still don't need to go through
prepare_vmcs02_full. So the switch statement recognizes these flags, so
that they don't set dirty_vmcs12.

Thanks,

Paolo
>
> Regards,
> Wanpeng Li
>
>> + /*
>> + * The fields that can be updated by L1 without a vmexit are
>> + * always updated in the vmcs02, the others go down the slow
>> + * path of prepare_vmcs02.
>> + */
>> + break;
>> + default:
>> + vmx->nested.dirty_vmcs12 = true;
>> + break;
>> + }
>> +
>> nested_vmx_succeed(vcpu);
>> return kvm_skip_emulated_instruction(vcpu);
>> }
>> @@ -7937,6 +7954,7 @@ static void set_current_vmptr(struct vcpu_vmx *vmx, gpa_t vmptr)
>> __pa(vmx->vmcs01.shadow_vmcs));
>> vmx->nested.sync_shadow_vmcs = true;
>> }
>> + vmx->nested.dirty_vmcs12 = true;
>> }
>>
>> /* Emulate the VMPTRLD instruction */
>> @@ -10569,6 +10587,11 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3, bool ne
>> return 0;
>> }
>>
>> +static void prepare_vmcs02_full(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>> + bool from_vmentry)
>> +{
>> +}
>> +
>> /*
>> * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
>> * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
>> @@ -10864,7 +10887,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>> vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
>> vmx_flush_tlb(vcpu, true);
>> }
>> -
>> }
>>
>> if (enable_pml) {
>> @@ -10913,6 +10935,11 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>> /* Note: modifies VM_ENTRY/EXIT_CONTROLS and GUEST/HOST_IA32_EFER */
>> vmx_set_efer(vcpu, vcpu->arch.efer);
>>
>> + if (vmx->nested.dirty_vmcs12) {
>> + prepare_vmcs02_full(vcpu, vmcs12, from_vmentry);
>> + vmx->nested.dirty_vmcs12 = false;
>> + }
>> +
>> /* Shadow page tables on either EPT or shadow page tables. */
>> if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12),
>> entry_failure_code))
>> --
>> 1.8.3.1
>>
>>
>

2017-12-31 22:48:52

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH 2/4] KVM: nVMX: track dirty state of non-shadowed VMCS fields

2017-12-31 16:08 GMT+08:00 Paolo Bonzini <[email protected]>:
> On 25/12/2017 04:03, Wanpeng Li wrote:
>> 2017-12-21 20:43 GMT+08:00 Paolo Bonzini <[email protected]>:
>>> VMCS12 fields that are not handled through shadow VMCS are rarely
>>> written, and thus they are also almost constant in the vmcs02. We can
>>> thus optimize prepare_vmcs02 by skipping all the work for non-shadowed
>>> fields in the common case.
>>>
>>> This patch introduces the (pretty simple) tracking infrastructure; the
>>> next patches will move work to prepare_vmcs02_full and save a few hundred
>>> clock cycles per VMRESUME on a Haswell Xeon E5 system:
>>>
>>> before after
>>> cpuid 14159 13869
>>> vmcall 15290 14951
>>> inl_from_kernel 17703 17447
>>> outl_to_kernel 16011 14692
>>> self_ipi_sti_nop 16763 15825
>>> self_ipi_tpr_sti_nop 17341 15935
>>> wr_tsc_adjust_msr 14510 14264
>>> rd_tsc_adjust_msr 15018 14311
>>> mmio-wildcard-eventfd:pci-mem 16381 14947
>>> mmio-datamatch-eventfd:pci-mem 18620 17858
>>> portio-wildcard-eventfd:pci-io 15121 14769
>>> portio-datamatch-eventfd:pci-io 15761 14831
>>>
>>> (average savings 748, stdev 460).
>>>
>>> Signed-off-by: Paolo Bonzini <[email protected]>
>>> ---
>>> arch/x86/kvm/vmx.c | 29 ++++++++++++++++++++++++++++-
>>> 1 file changed, 28 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index 2ee842990976..8b6013b529b3 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -441,6 +441,7 @@ struct nested_vmx {
>>> * data hold by vmcs12
>>> */
>>> bool sync_shadow_vmcs;
>>> + bool dirty_vmcs12;
>>>
>>> bool change_vmcs01_virtual_x2apic_mode;
>>> /* L2 must run next, and mustn't decide to exit to L1. */
>>> @@ -7879,8 +7880,10 @@ static int handle_vmwrite(struct kvm_vcpu *vcpu)
>>> {
>>> unsigned long field;
>>> gva_t gva;
>>> + struct vcpu_vmx *vmx = to_vmx(vcpu);
>>> unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>>> u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
>>> +
>>> /* The value to write might be 32 or 64 bits, depending on L1's long
>>> * mode, and eventually we need to write that into a field of several
>>> * possible lengths. The code below first zero-extends the value to 64
>>> @@ -7923,6 +7926,20 @@ static int handle_vmwrite(struct kvm_vcpu *vcpu)
>>> return kvm_skip_emulated_instruction(vcpu);
>>> }
>>>
>>> + switch (field) {
>>> +#define SHADOW_FIELD_RW(x) case x:
>>> +#include "vmx_shadow_fields.h"
>>
>> What's will happen here if enable_shadow_vmcs == false?
>
> If enable_shadow_vmcs == true, these fields never even go through
> handle_vmwrite, so they have to be written always by prepare_vmcs02.
> Other fields go through handle_vmwrite and set dirty_vmcs12.
>
> If enable_shadow_vmcs == false, or if the fields do not exist in the
> hardware VMCS (e.g. PML index on Haswell systems), writes go through
> handle_vmwrite, but they still don't need to go through
> prepare_vmcs02_full. So the switch statement recognizes these flags, so
> that they don't set dirty_vmcs12.

You are right.

Regards,
Wanpeng Li

2018-01-01 09:36:28

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 0/4] KVM: nVMX: prepare_vmcs02 optimizations

On 28/12/2017 09:39, Wanpeng Li wrote:
> 2017-12-27 22:28 GMT+08:00 Paolo Bonzini <[email protected]>:
>> On 25/12/2017 11:08, Wanpeng Li wrote:
>>>> I observe L1(latest kvm/queue) panic and L0(latest kvm/queue)
>>>> calltrace, I'm not sure whether it is caused by this patchset.
>>> It can be reproduced steadily by running kvm-unit-tests in L1.
>>
>> It works here, can you show the L0 call trace and/or bisect it?
>
> L0 call trace has already been posted here.
> https://lkml.org/lkml/2017/12/25/53 In addition, the splatting is
> still there after I revert the last 9 nVMX optimization patches in
> kvm/queue. So it is not caused by this patchset. :)

Hmm, maybe you're using "-cpu host,+umip"? I'll check when I get back
to work tomorrow.

Paolo

2018-01-01 23:01:09

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 0/4] KVM: nVMX: prepare_vmcs02 optimizations

On 01/01/2018 10:36, Paolo Bonzini wrote:
> On 28/12/2017 09:39, Wanpeng Li wrote:
>> 2017-12-27 22:28 GMT+08:00 Paolo Bonzini <[email protected]>:
>>> On 25/12/2017 11:08, Wanpeng Li wrote:
>>>>> I observe L1(latest kvm/queue) panic and L0(latest kvm/queue)
>>>>> calltrace, I'm not sure whether it is caused by this patchset.
>>>> It can be reproduced steadily by running kvm-unit-tests in L1.
>>>
>>> It works here, can you show the L0 call trace and/or bisect it?
>>
>> L0 call trace has already been posted here.
>> https://lkml.org/lkml/2017/12/25/53 In addition, the splatting is
>> still there after I revert the last 9 nVMX optimization patches in
>> kvm/queue. So it is not caused by this patchset. :)
>
> Hmm, maybe you're using "-cpu host,+umip"? I'll check when I get back
> to work tomorrow.

Yeah, I think this could be it:

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 30e6115d4f09..6404e96179b4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -10780,6 +10780,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
exec_control &= ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
SECONDARY_EXEC_ENABLE_INVPCID |
SECONDARY_EXEC_RDTSCP |
+ SECONDARY_EXEC_DESC |
SECONDARY_EXEC_XSAVES |
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
SECONDARY_EXEC_APIC_REGISTER_VIRT |

Paolo

2018-01-02 01:05:35

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH 0/4] KVM: nVMX: prepare_vmcs02 optimizations

2018-01-02 7:01 GMT+08:00 Paolo Bonzini <[email protected]>:
> On 01/01/2018 10:36, Paolo Bonzini wrote:
>> On 28/12/2017 09:39, Wanpeng Li wrote:
>>> 2017-12-27 22:28 GMT+08:00 Paolo Bonzini <[email protected]>:
>>>> On 25/12/2017 11:08, Wanpeng Li wrote:
>>>>>> I observe L1(latest kvm/queue) panic and L0(latest kvm/queue)
>>>>>> calltrace, I'm not sure whether it is caused by this patchset.
>>>>> It can be reproduced steadily by running kvm-unit-tests in L1.
>>>>
>>>> It works here, can you show the L0 call trace and/or bisect it?
>>>
>>> L0 call trace has already been posted here.
>>> https://lkml.org/lkml/2017/12/25/53 In addition, the splatting is
>>> still there after I revert the last 9 nVMX optimization patches in
>>> kvm/queue. So it is not caused by this patchset. :)
>>
>> Hmm, maybe you're using "-cpu host,+umip"? I'll check when I get back
>> to work tomorrow.
>
> Yeah, I think this could be it:
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 30e6115d4f09..6404e96179b4 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -10780,6 +10780,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> exec_control &= ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
> SECONDARY_EXEC_ENABLE_INVPCID |
> SECONDARY_EXEC_RDTSCP |
> + SECONDARY_EXEC_DESC |
> SECONDARY_EXEC_XSAVES |
> SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
> SECONDARY_EXEC_APIC_REGISTER_VIRT |

The issue is still there after applying this to both L0 and L1,
actually, I can observe a vmentry fail just before the splatting in L0
w/ and w/o the above code. In addition, I comment out the other
testcases in unittests.cfg except vmx_controls, then run
./run_tests.sh.

[334079.689931] nested_vmx_exit_reflected failed vm entry 7
[334079.689980] WARNING: CPU: 6 PID: 6911 at
/home/kernel/data/kvm/arch/x86/kvm//vmx.c:6376 handle_desc+0x2d/0x40
[kvm_intel]
[334079.689982] Modules linked in: kvm_intel(OE) kvm(OE) binfmt_misc
nls_iso8859_1 snd_hda_codec_hdmi snd_hda_codec_realtek
snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core
snd_hwdep x86_pkg_temp_thermal snd_pcm intel_powerclamp coretemp
crc32_pclmul snd_seq_midi pcbc snd_seq_midi_event snd_rawmidi
aesni_intel snd_seq aes_x86_64 crypto_simd cryptd glue_helper joydev
input_leds wmi_bmof snd_seq_device snd_timer mei_me snd mei shpchp
lpc_ich soundcore mac_hid irqbypass parport_pc ppdev lp parport
autofs4 hid_generic usbhid hid i915 i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops drm e1000e ahci ptp
libahci pps_core wmi video [last unloaded: kvm]
[334079.690080] CPU: 6 PID: 6911 Comm: qemu-system-x86 Tainted: G
OE 4.15.0-rc3+ #1
[334079.690082] Hardware name: LENOVO ThinkCentre
M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
[334079.690086] RIP: 0010:handle_desc+0x2d/0x40 [kvm_intel]
[334079.690088] RSP: 0018:ffffaf010029bca0 EFLAGS: 00010246
[334079.690091] RAX: ffffffffc0785160 RBX: 000000000000002e RCX:
0000000000000001
[334079.690093] RDX: 0000000000000000 RSI: 00000000ffffffff RDI:
ffffa0512c740000
[334079.690094] RBP: ffffaf010029bca0 R08: 00000000c61dd84f R09:
1e890e0400000000
[334079.690096] R10: 0000000000000000 R11: 0000000000000001 R12:
ffffa051aa6c0000
[334079.690098] R13: 0000000000000000 R14: 0000000000000001 R15:
ffffa0512c740000
[334079.690100] FS: 00007f1cf5450700(0000) GS:ffffa051ce000000(0000)
knlGS:0000000000000000
[334079.690102] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[334079.690104] CR2: 0000000000000000 CR3: 00000003eb98d006 CR4:
00000000001626e0
[334079.690106] Call Trace:
[334079.690111] vmx_handle_exit+0xbd/0xe20 [kvm_intel]
[334079.690131] ? kvm_arch_vcpu_ioctl_run+0xcea/0x1c20 [kvm]
[334079.690144] kvm_arch_vcpu_ioctl_run+0xd66/0x1c20 [kvm]
[334079.690159] kvm_vcpu_ioctl+0x3e9/0x720 [kvm]
[334079.690167] ? kvm_vcpu_ioctl+0x3e9/0x720 [kvm]
[334079.690173] ? __fget+0xfc/0x210
[334079.690176] ? __fget+0xfc/0x210
[334079.690181] do_vfs_ioctl+0xa4/0x6a0
[334079.690184] ? __fget+0x11d/0x210
[334079.690190] SyS_ioctl+0x79/0x90
[334079.690195] entry_SYSCALL_64_fastpath+0x1f/0x96
[334079.690197] RIP: 0033:0x7f1d01faef07
[334079.690199] RSP: 002b:00007f1cf544f8b8 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[334079.690203] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX:
00007f1d01faef07
[334079.690204] RDX: 0000000000000000 RSI: 000000000000ae80 RDI:
0000000000000011
[334079.690206] RBP: 0000561713c40f70 R08: 0000000000000000 R09:
0000000000000001
[334079.690208] R10: 0000000000000058 R11: 0000000000000246 R12:
0000000000000000
[334079.690209] R13: 00007f1d0468d000 R14: 0000000000000000 R15:
0000561713c40f70
[334079.690218] Code: 44 00 00 f6 87 f1 03 00 00 08 55 48 89 e5 74 1b
45 31 c0 31 c9 31 f6 ba 10 00 00 00 e8 2d 4e dc ff 85 c0 0f 94 c0 0f
b6 c0 5d c3 <0f> ff eb e1 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00
0f 1f
[334079.690314] ---[ end trace 16ed3250b9b651d2 ]---

Regards,
Wanpeng Li

2018-01-02 13:02:29

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 0/4] KVM: nVMX: prepare_vmcs02 optimizations

On 02/01/2018 02:05, Wanpeng Li wrote:
> The issue is still there after applying this to both L0 and L1,
> actually, I can observe a vmentry fail just before the splatting in L0
> w/ and w/o the above code. In addition, I comment out the other
> testcases in unittests.cfg except vmx_controls, then run
> ./run_tests.sh.

Ok, I didn't understand that you're running vmx.flat inside L1 (thus
creating an L3 guest). The patch I sent is wrong, because
SECONDARY_EXEC_DESC should always be clear in vmx->secondary_exec_control.

Note that even with "-cpu host,+vmx,-umip" I get a hang at

PASS: Use TPR shadow enabled: TPR threshold 0x10: vmlaunch fails

So that might not be entirely related. However, the failures with
descriptor exiting are new.

Paolo