UMONITOR, UMWAIT and TPAUSE are a set of user wait instructions.
UMONITOR arms address monitoring hardware using an address. A store
to an address within the specified address range triggers the
monitoring hardware to wake up the processor waiting in umwait.
UMWAIT instructs the processor to enter an implementation-dependent
optimized state while monitoring a range of addresses. The optimized
state may be either a light-weight power/performance optimized state
(c0.1 state) or an improved power/performance optimized state
(c0.2 state).
TPAUSE instructs the processor to enter an implementation-dependent
optimized state c0.1 or c0.2 state and wake up when time-stamp counter
reaches specified timeout.
Availability of the user wait instructions is indicated by the presence
of the CPUID feature flag WAITPKG CPUID.0x07.0x0:ECX[5].
The patches enable the umonitor, umwait and tpause features in KVM.
Because umwait and tpause can put a (psysical) CPU into a power saving
state, by default we dont't expose it to kvm and enable it only when
guest CPUID has it. If the instruction causes a delay, the amount
of time delayed is called here the physical delay. The physical delay is
first computed by determining the virtual delay (the time to delay
relative to the VM’s timestamp counter).
The release document ref below link:
Intel 64 and IA-32 Architectures Software Developer's Manual,
https://software.intel.com/sites/default/files/\
managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf
Changelog:
v7:
Add nested support for user wait instructions (Paolo)
Use the test on vmx->secondary_exec_control to replace
guest_cpuid_has (Paolo)
v6:
add check msr_info->host_initiated in get/set msr(Xiaoyao)
restore the atomic_switch_umwait_control_msr()(Xiaoyao)
Tao Xu (3):
KVM: x86: add support for user wait instructions
KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL
KVM: vmx: handle vm-exit for UMWAIT and TPAUSE
arch/x86/include/asm/vmx.h | 1 +
arch/x86/include/uapi/asm/vmx.h | 6 ++-
arch/x86/kernel/cpu/umwait.c | 3 +-
arch/x86/kvm/cpuid.c | 2 +-
arch/x86/kvm/vmx/nested.c | 4 ++
arch/x86/kvm/vmx/vmx.c | 69 +++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.h | 9 +++++
arch/x86/kvm/x86.c | 1 +
8 files changed, 92 insertions(+), 3 deletions(-)
--
2.20.1
UMONITOR, UMWAIT and TPAUSE are a set of user wait instructions.
This patch adds support for user wait instructions in KVM. Availability
of the user wait instructions is indicated by the presence of the CPUID
feature flag WAITPKG CPUID.0x07.0x0:ECX[5]. User wait instructions may
be executed at any privilege level, and use IA32_UMWAIT_CONTROL MSR to
set the maximum time.
The behavior of user wait instructions in VMX non-root operation is
determined first by the setting of the "enable user wait and pause"
secondary processor-based VM-execution control bit 26.
If the VM-execution control is 0, UMONITOR/UMWAIT/TPAUSE cause
an invalid-opcode exception (#UD).
If the VM-execution control is 1, treatment is based on the
setting of the “RDTSC exiting” VM-execution control. Because KVM never
enables RDTSC exiting, if the instruction causes a delay, the amount of
time delayed is called here the physical delay. The physical delay is
first computed by determining the virtual delay. If
IA32_UMWAIT_CONTROL[31:2] is zero, the virtual delay is the value in
EDX:EAX minus the value that RDTSC would return; if
IA32_UMWAIT_CONTROL[31:2] is not zero, the virtual delay is the minimum
of that difference and AND(IA32_UMWAIT_CONTROL,FFFFFFFCH).
Because umwait and tpause can put a (psysical) CPU into a power saving
state, by default we dont't expose it to kvm and enable it only when
guest CPUID has it.
Detailed information about user wait instructions can be found in the
latest Intel 64 and IA-32 Architectures Software Developer's Manual.
Co-developed-by: Jingqi Liu <[email protected]>
Signed-off-by: Jingqi Liu <[email protected]>
Signed-off-by: Tao Xu <[email protected]>
---
Changes in v7:
- Add nested support for user wait instructions (Paolo)
---
arch/x86/include/asm/vmx.h | 1 +
arch/x86/kvm/cpuid.c | 2 +-
arch/x86/kvm/vmx/nested.c | 1 +
arch/x86/kvm/vmx/vmx.c | 20 ++++++++++++++++++++
4 files changed, 23 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index a39136b0d509..8f00882664d3 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -69,6 +69,7 @@
#define SECONDARY_EXEC_PT_USE_GPA 0x01000000
#define SECONDARY_EXEC_MODE_BASED_EPT_EXEC 0x00400000
#define SECONDARY_EXEC_TSC_SCALING 0x02000000
+#define SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE 0x04000000
#define PIN_BASED_EXT_INTR_MASK 0x00000001
#define PIN_BASED_NMI_EXITING 0x00000008
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 4992e7c99588..7d2cd4066f64 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -402,7 +402,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
F(AVX512VBMI) | F(LA57) | F(PKU) | 0 /*OSPKE*/ |
F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
- F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B);
+ F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/;
/* cpuid 7.0.edx*/
const u32 kvm_cpuid_7_0_edx_x86_features =
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 46af3a5e9209..a4d5da34b306 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2048,6 +2048,7 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
SECONDARY_EXEC_ENABLE_INVPCID |
SECONDARY_EXEC_RDTSCP |
SECONDARY_EXEC_XSAVES |
+ SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE |
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
SECONDARY_EXEC_APIC_REGISTER_VIRT |
SECONDARY_EXEC_ENABLE_VMFUNC);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d98eac371c0a..f411c9ae5589 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2247,6 +2247,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
SECONDARY_EXEC_RDRAND_EXITING |
SECONDARY_EXEC_ENABLE_PML |
SECONDARY_EXEC_TSC_SCALING |
+ SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE |
SECONDARY_EXEC_PT_USE_GPA |
SECONDARY_EXEC_PT_CONCEAL_VMX |
SECONDARY_EXEC_ENABLE_VMFUNC |
@@ -3984,6 +3985,25 @@ static void vmx_compute_secondary_exec_control(struct vcpu_vmx *vmx)
}
}
+ if (vmcs_config.cpu_based_2nd_exec_ctrl &
+ SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE) {
+ /* Exposing WAITPKG only when WAITPKG is exposed */
+ bool waitpkg_enabled =
+ guest_cpuid_has(vcpu, X86_FEATURE_WAITPKG);
+
+ if (!waitpkg_enabled)
+ exec_control &= ~SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE;
+
+ if (nested) {
+ if (waitpkg_enabled)
+ vmx->nested.msrs.secondary_ctls_high |=
+ SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE;
+ else
+ vmx->nested.msrs.secondary_ctls_high &=
+ ~SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE;
+ }
+ }
+
vmx->secondary_exec_control = exec_control;
}
--
2.20.1
UMWAIT and TPAUSE instructions use IA32_UMWAIT_CONTROL at MSR index E1H
to determines the maximum time in TSC-quanta that the processor can reside
in either C0.1 or C0.2.
This patch emulates MSR IA32_UMWAIT_CONTROL in guest and differentiate
IA32_UMWAIT_CONTROL between host and guest. The variable
mwait_control_cached in arch/x86/power/umwait.c caches the MSR value, so
this patch uses it to avoid frequently rdmsr of IA32_UMWAIT_CONTROL.
Co-developed-by: Jingqi Liu <[email protected]>
Signed-off-by: Jingqi Liu <[email protected]>
Signed-off-by: Tao Xu <[email protected]>
---
Changes in v7:
- Use the test on vmx->secondary_exec_control to replace
guest_cpuid_has (Paolo)
---
arch/x86/kernel/cpu/umwait.c | 3 ++-
arch/x86/kvm/vmx/vmx.c | 33 +++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.h | 9 +++++++++
arch/x86/kvm/x86.c | 1 +
4 files changed, 45 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/umwait.c b/arch/x86/kernel/cpu/umwait.c
index 6a204e7336c1..631152a67c6e 100644
--- a/arch/x86/kernel/cpu/umwait.c
+++ b/arch/x86/kernel/cpu/umwait.c
@@ -15,7 +15,8 @@
* Cache IA32_UMWAIT_CONTROL MSR. This is a systemwide control. By default,
* umwait max time is 100000 in TSC-quanta and C0.2 is enabled
*/
-static u32 umwait_control_cached = UMWAIT_CTRL_VAL(100000, UMWAIT_C02_ENABLE);
+u32 umwait_control_cached = UMWAIT_CTRL_VAL(100000, UMWAIT_C02_ENABLE);
+EXPORT_SYMBOL_GPL(umwait_control_cached);
/*
* Serialize access to umwait_control_cached and IA32_UMWAIT_CONTROL MSR in
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f411c9ae5589..0787f140d155 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1676,6 +1676,12 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
#endif
case MSR_EFER:
return kvm_get_msr_common(vcpu, msr_info);
+ case MSR_IA32_UMWAIT_CONTROL:
+ if (!msr_info->host_initiated && !vmx_has_waitpkg(vmx))
+ return 1;
+
+ msr_info->data = vmx->msr_ia32_umwait_control;
+ break;
case MSR_IA32_SPEC_CTRL:
if (!msr_info->host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
@@ -1838,6 +1844,16 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 1;
vmcs_write64(GUEST_BNDCFGS, data);
break;
+ case MSR_IA32_UMWAIT_CONTROL:
+ if (!msr_info->host_initiated && !vmx_has_waitpkg(vmx))
+ return 1;
+
+ /* The reserved bit IA32_UMWAIT_CONTROL[1] should be zero */
+ if (data & BIT_ULL(1))
+ return 1;
+
+ vmx->msr_ia32_umwait_control = data;
+ break;
case MSR_IA32_SPEC_CTRL:
if (!msr_info->host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
@@ -4139,6 +4155,8 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmx->rmode.vm86_active = 0;
vmx->spec_ctrl = 0;
+ vmx->msr_ia32_umwait_control = 0;
+
vcpu->arch.microcode_version = 0x100000000ULL;
vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
kvm_set_cr8(vcpu, 0);
@@ -6352,6 +6370,19 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
msrs[i].host, false);
}
+static void atomic_switch_umwait_control_msr(struct vcpu_vmx *vmx)
+{
+ if (!vmx_has_waitpkg(vmx))
+ return;
+
+ if (vmx->msr_ia32_umwait_control != umwait_control_cached)
+ add_atomic_switch_msr(vmx, MSR_IA32_UMWAIT_CONTROL,
+ vmx->msr_ia32_umwait_control,
+ umwait_control_cached, false);
+ else
+ clear_atomic_switch_msr(vmx, MSR_IA32_UMWAIT_CONTROL);
+}
+
static void vmx_arm_hv_timer(struct vcpu_vmx *vmx, u32 val)
{
vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, val);
@@ -6460,6 +6491,8 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
atomic_switch_perf_msrs(vmx);
+ atomic_switch_umwait_control_msr(vmx);
+
vmx_update_hv_timer(vcpu);
/*
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 61128b48c503..b4ca34f7a2da 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -14,6 +14,8 @@
extern const u32 vmx_msr_index[];
extern u64 host_efer;
+extern u32 umwait_control_cached;
+
#define MSR_TYPE_R 1
#define MSR_TYPE_W 2
#define MSR_TYPE_RW 3
@@ -194,6 +196,7 @@ struct vcpu_vmx {
#endif
u64 spec_ctrl;
+ u64 msr_ia32_umwait_control;
u32 vm_entry_controls_shadow;
u32 vm_exit_controls_shadow;
@@ -523,6 +526,12 @@ static inline void decache_tsc_multiplier(struct vcpu_vmx *vmx)
vmcs_write64(TSC_MULTIPLIER, vmx->current_tsc_ratio);
}
+static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
+{
+ return vmx->secondary_exec_control &
+ SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE;
+}
+
void dump_vmcs(void);
#endif /* __KVM_X86_VMX_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 63bb1ee8258e..e914e4d03cce 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1136,6 +1136,7 @@ static u32 msrs_to_save[] = {
MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
+ MSR_IA32_UMWAIT_CONTROL,
};
static unsigned num_msrs_to_save;
--
2.20.1
As the latest Intel 64 and IA-32 Architectures Software Developer's
Manual, UMWAIT and TPAUSE instructions cause a VM exit if the
RDTSC exiting and enable user wait and pause VM-execution
controls are both 1.
This patch is to handle the vm-exit for UMWAIT and TPAUSE as this
should never happen.
Co-developed-by: Jingqi Liu <[email protected]>
Signed-off-by: Jingqi Liu <[email protected]>
Signed-off-by: Tao Xu <[email protected]>
---
Changes in v7:
- Add nested exit reason for UMWAIT and TPAUSE (Paolo)
---
arch/x86/include/uapi/asm/vmx.h | 6 +++++-
arch/x86/kvm/vmx/nested.c | 3 +++
arch/x86/kvm/vmx/vmx.c | 16 ++++++++++++++++
3 files changed, 24 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index d213ec5c3766..d88d7a68849b 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -85,6 +85,8 @@
#define EXIT_REASON_PML_FULL 62
#define EXIT_REASON_XSAVES 63
#define EXIT_REASON_XRSTORS 64
+#define EXIT_REASON_UMWAIT 67
+#define EXIT_REASON_TPAUSE 68
#define VMX_EXIT_REASONS \
{ EXIT_REASON_EXCEPTION_NMI, "EXCEPTION_NMI" }, \
@@ -142,7 +144,9 @@
{ EXIT_REASON_RDSEED, "RDSEED" }, \
{ EXIT_REASON_PML_FULL, "PML_FULL" }, \
{ EXIT_REASON_XSAVES, "XSAVES" }, \
- { EXIT_REASON_XRSTORS, "XRSTORS" }
+ { EXIT_REASON_XRSTORS, "XRSTORS" }, \
+ { EXIT_REASON_UMWAIT, "UMWAIT" }, \
+ { EXIT_REASON_TPAUSE, "TPAUSE" }
#define VMX_ABORT_SAVE_GUEST_MSR_FAIL 1
#define VMX_ABORT_LOAD_HOST_PDPTE_FAIL 2
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index a4d5da34b306..9f91f834ec43 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -5213,6 +5213,9 @@ bool nested_vmx_exit_reflected(struct kvm_vcpu *vcpu, u32 exit_reason)
case EXIT_REASON_ENCLS:
/* SGX is never exposed to L1 */
return false;
+ case EXIT_REASON_UMWAIT: case EXIT_REASON_TPAUSE:
+ return nested_cpu_has2(vmcs12,
+ SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE);
default:
return true;
}
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0787f140d155..e026b1313dc3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5349,6 +5349,20 @@ static int handle_monitor(struct kvm_vcpu *vcpu)
return handle_nop(vcpu);
}
+static int handle_umwait(struct kvm_vcpu *vcpu)
+{
+ kvm_skip_emulated_instruction(vcpu);
+ WARN(1, "this should never happen\n");
+ return 1;
+}
+
+static int handle_tpause(struct kvm_vcpu *vcpu)
+{
+ kvm_skip_emulated_instruction(vcpu);
+ WARN(1, "this should never happen\n");
+ return 1;
+}
+
static int handle_invpcid(struct kvm_vcpu *vcpu)
{
u32 vmx_instruction_info;
@@ -5559,6 +5573,8 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
[EXIT_REASON_VMFUNC] = handle_vmx_instruction,
[EXIT_REASON_PREEMPTION_TIMER] = handle_preemption_timer,
[EXIT_REASON_ENCLS] = handle_encls,
+ [EXIT_REASON_UMWAIT] = handle_umwait,
+ [EXIT_REASON_TPAUSE] = handle_tpause,
};
static const int kvm_vmx_max_exit_handlers =
--
2.20.1
On Fri, Jul 12, 2019 at 04:29:05PM +0800, Tao Xu wrote:
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 46af3a5e9209..a4d5da34b306 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2048,6 +2048,7 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
> SECONDARY_EXEC_ENABLE_INVPCID |
> SECONDARY_EXEC_RDTSCP |
> SECONDARY_EXEC_XSAVES |
> + SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE |
> SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
> SECONDARY_EXEC_APIC_REGISTER_VIRT |
> SECONDARY_EXEC_ENABLE_VMFUNC);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d98eac371c0a..f411c9ae5589 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2247,6 +2247,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
> SECONDARY_EXEC_RDRAND_EXITING |
> SECONDARY_EXEC_ENABLE_PML |
> SECONDARY_EXEC_TSC_SCALING |
> + SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE |
> SECONDARY_EXEC_PT_USE_GPA |
> SECONDARY_EXEC_PT_CONCEAL_VMX |
> SECONDARY_EXEC_ENABLE_VMFUNC |
> @@ -3984,6 +3985,25 @@ static void vmx_compute_secondary_exec_control(struct vcpu_vmx *vmx)
> }
> }
>
> + if (vmcs_config.cpu_based_2nd_exec_ctrl &
> + SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE) {
This should be aligned with the beginning of the conditional.
Alternatively, add a vmx_waitpkg_supported() helper, which is fairly
ubiquitous even when there is only a single call site.
> + /* Exposing WAITPKG only when WAITPKG is exposed */
No need for this comment. It's also oddly worded, e.g. the second
"exposed" should probably be "enabled"?
> + bool waitpkg_enabled =
> + guest_cpuid_has(vcpu, X86_FEATURE_WAITPKG);
> +
> + if (!waitpkg_enabled)
> + exec_control &= ~SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE;
> +
> + if (nested) {
> + if (waitpkg_enabled)
> + vmx->nested.msrs.secondary_ctls_high |=
> + SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE;
> + else
> + vmx->nested.msrs.secondary_ctls_high &=
> + ~SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE;
> + }
> + }
> +
> vmx->secondary_exec_control = exec_control;
> }
>
> --
> 2.20.1
>
On Fri, Jul 12, 2019 at 04:29:06PM +0800, Tao Xu wrote:
> diff --git a/arch/x86/kernel/cpu/umwait.c b/arch/x86/kernel/cpu/umwait.c
> index 6a204e7336c1..631152a67c6e 100644
> --- a/arch/x86/kernel/cpu/umwait.c
> +++ b/arch/x86/kernel/cpu/umwait.c
> @@ -15,7 +15,8 @@
> * Cache IA32_UMWAIT_CONTROL MSR. This is a systemwide control. By default,
> * umwait max time is 100000 in TSC-quanta and C0.2 is enabled
> */
> -static u32 umwait_control_cached = UMWAIT_CTRL_VAL(100000, UMWAIT_C02_ENABLE);
> +u32 umwait_control_cached = UMWAIT_CTRL_VAL(100000, UMWAIT_C02_ENABLE);
> +EXPORT_SYMBOL_GPL(umwait_control_cached);
It'd probably be better to add an accessor to expose umwait_control_cached
given that umwait.c is using {READ,WRITE}_ONCE() and there shouldn't be a
need to write it outside of umwait.c.
> /*
> * Serialize access to umwait_control_cached and IA32_UMWAIT_CONTROL MSR in
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index f411c9ae5589..0787f140d155 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1676,6 +1676,12 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> #endif
> case MSR_EFER:
> return kvm_get_msr_common(vcpu, msr_info);
> + case MSR_IA32_UMWAIT_CONTROL:
> + if (!msr_info->host_initiated && !vmx_has_waitpkg(vmx))
> + return 1;
> +
> + msr_info->data = vmx->msr_ia32_umwait_control;
> + break;
> case MSR_IA32_SPEC_CTRL:
> if (!msr_info->host_initiated &&
> !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
> @@ -1838,6 +1844,16 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> return 1;
> vmcs_write64(GUEST_BNDCFGS, data);
> break;
> + case MSR_IA32_UMWAIT_CONTROL:
> + if (!msr_info->host_initiated && !vmx_has_waitpkg(vmx))
> + return 1;
> +
> + /* The reserved bit IA32_UMWAIT_CONTROL[1] should be zero */
> + if (data & BIT_ULL(1))
> + return 1;
> +
> + vmx->msr_ia32_umwait_control = data;
The SDM only defines bits 31:0, and the kernel uses a u32 to cache its
value. I assume bits 63:32 are reserved? I'm guessing we also need an
SDM update...
> + break;
> case MSR_IA32_SPEC_CTRL:
> if (!msr_info->host_initiated &&
> !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
> @@ -4139,6 +4155,8 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> vmx->rmode.vm86_active = 0;
> vmx->spec_ctrl = 0;
>
> + vmx->msr_ia32_umwait_control = 0;
> +
> vcpu->arch.microcode_version = 0x100000000ULL;
> vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
> kvm_set_cr8(vcpu, 0);
> @@ -6352,6 +6370,19 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
> msrs[i].host, false);
> }
>
On Fri, Jul 12, 2019 at 04:29:07PM +0800, Tao Xu wrote:
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -5213,6 +5213,9 @@ bool nested_vmx_exit_reflected(struct kvm_vcpu *vcpu, u32 exit_reason)
> case EXIT_REASON_ENCLS:
> /* SGX is never exposed to L1 */
> return false;
> + case EXIT_REASON_UMWAIT: case EXIT_REASON_TPAUSE:
Grouped case statements are usually stacked vertically, e.g.:
case EXIT_REASON_UMWAIT:
case EXIT_REASON_TPAUSE:
> + return nested_cpu_has2(vmcs12,
> + SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE);
> default:
> return true;
> }
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 0787f140d155..e026b1313dc3 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5349,6 +5349,20 @@ static int handle_monitor(struct kvm_vcpu *vcpu)
> return handle_nop(vcpu);
> }
>
> +static int handle_umwait(struct kvm_vcpu *vcpu)
> +{
> + kvm_skip_emulated_instruction(vcpu);
> + WARN(1, "this should never happen\n");
Blech. I'm guessing this code was copy-pasted from handle_xsaves() and
handle_xrstors(). The blurb of "this should never happen" isn't very
helpful, e.g. the WARN itself makes it pretty obvious that we don't expect
to reach this point. WARN_ONCE would also be preferable, no need to spam
the log in the event things go completely haywire.
Rather than propagate ugly code, what about defining a common helper, e.g.
static int handle_unexpected_vmexit(struct kvm_vcpu *vcpu)
{
kvm_skip_emulated_instruction(vcpu);
WARN_ONCE(1, "Unexpected VM-Exit = 0x%x", vmcs_read32(VM_EXIT_REASON));
return 1;
}
...
{
[EXIT_REASON_XSAVES] = handle_unexpected_vmexit,
[EXIT_REASON_XRSTORS] = handle_unexpected_vmexit,
[EXIT_REASON_UMWAIT] = handle_unexpected_vmexit,
[EXIT_REASON_TPAUSE] = handle_unexpected_vmexit,
}
> + return 1;
> +}
> +
> +static int handle_tpause(struct kvm_vcpu *vcpu)
> +{
> + kvm_skip_emulated_instruction(vcpu);
> + WARN(1, "this should never happen\n");
> + return 1;
> +}
> +
> static int handle_invpcid(struct kvm_vcpu *vcpu)
> {
> u32 vmx_instruction_info;
> @@ -5559,6 +5573,8 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
> [EXIT_REASON_VMFUNC] = handle_vmx_instruction,
> [EXIT_REASON_PREEMPTION_TIMER] = handle_preemption_timer,
> [EXIT_REASON_ENCLS] = handle_encls,
> + [EXIT_REASON_UMWAIT] = handle_umwait,
> + [EXIT_REASON_TPAUSE] = handle_tpause,
> };
>
> static const int kvm_vmx_max_exit_handlers =
> --
> 2.20.1
>
On 7/13/2019 12:03 AM, Sean Christopherson wrote:
> On Fri, Jul 12, 2019 at 04:29:07PM +0800, Tao Xu wrote:
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -5213,6 +5213,9 @@ bool nested_vmx_exit_reflected(struct kvm_vcpu *vcpu, u32 exit_reason)
>> case EXIT_REASON_ENCLS:
>> /* SGX is never exposed to L1 */
>> return false;
>> + case EXIT_REASON_UMWAIT: case EXIT_REASON_TPAUSE:
>
> Grouped case statements are usually stacked vertically, e.g.:
>
> case EXIT_REASON_UMWAIT:
> case EXIT_REASON_TPAUSE:
>Ok, thank you for your suggestion.
>> + return nested_cpu_has2(vmcs12,
>> + SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE);
>> default:
>> return true;
>> }
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 0787f140d155..e026b1313dc3 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -5349,6 +5349,20 @@ static int handle_monitor(struct kvm_vcpu *vcpu)
>> return handle_nop(vcpu);
>> }
>>
>> +static int handle_umwait(struct kvm_vcpu *vcpu)
>> +{
>> + kvm_skip_emulated_instruction(vcpu);
>> + WARN(1, "this should never happen\n");
>
> Blech. I'm guessing this code was copy-pasted from handle_xsaves() and
> handle_xrstors(). The blurb of "this should never happen" isn't very
> helpful, e.g. the WARN itself makes it pretty obvious that we don't expect
> to reach this point. WARN_ONCE would also be preferable, no need to spam
> the log in the event things go completely haywire.
>
> Rather than propagate ugly code, what about defining a common helper, e.g.
>
> static int handle_unexpected_vmexit(struct kvm_vcpu *vcpu)
> {
> kvm_skip_emulated_instruction(vcpu);
> WARN_ONCE(1, "Unexpected VM-Exit = 0x%x", vmcs_read32(VM_EXIT_REASON));
> return 1;
> }
>
> ...
> {
> [EXIT_REASON_XSAVES] = handle_unexpected_vmexit,
> [EXIT_REASON_XRSTORS] = handle_unexpected_vmexit,
>
> [EXIT_REASON_UMWAIT] = handle_unexpected_vmexit,
> [EXIT_REASON_TPAUSE] = handle_unexpected_vmexit,
>
> }
>
Thank you Sean, I will do this in next version of patch.
>> + return 1;
>> +}
>> +
>> +static int handle_tpause(struct kvm_vcpu *vcpu)
>> +{
>> + kvm_skip_emulated_instruction(vcpu);
>> + WARN(1, "this should never happen\n");
>> + return 1;
>> +}
>> +
>> static int handle_invpcid(struct kvm_vcpu *vcpu)
>> {
>> u32 vmx_instruction_info;
>> @@ -5559,6 +5573,8 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>> [EXIT_REASON_VMFUNC] = handle_vmx_instruction,
>> [EXIT_REASON_PREEMPTION_TIMER] = handle_preemption_timer,
>> [EXIT_REASON_ENCLS] = handle_encls,
>> + [EXIT_REASON_UMWAIT] = handle_umwait,
>> + [EXIT_REASON_TPAUSE] = handle_tpause,
>> };
>>
>> static const int kvm_vmx_max_exit_handlers =
>> --
>> 2.20.1
>>
On 7/12/2019 11:13 PM, Sean Christopherson wrote:
> On Fri, Jul 12, 2019 at 04:29:05PM +0800, Tao Xu wrote:
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index 46af3a5e9209..a4d5da34b306 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -2048,6 +2048,7 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>> SECONDARY_EXEC_ENABLE_INVPCID |
>> SECONDARY_EXEC_RDTSCP |
>> SECONDARY_EXEC_XSAVES |
>> + SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE |
>> SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
>> SECONDARY_EXEC_APIC_REGISTER_VIRT |
>> SECONDARY_EXEC_ENABLE_VMFUNC);
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index d98eac371c0a..f411c9ae5589 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -2247,6 +2247,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>> SECONDARY_EXEC_RDRAND_EXITING |
>> SECONDARY_EXEC_ENABLE_PML |
>> SECONDARY_EXEC_TSC_SCALING |
>> + SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE |
>> SECONDARY_EXEC_PT_USE_GPA |
>> SECONDARY_EXEC_PT_CONCEAL_VMX |
>> SECONDARY_EXEC_ENABLE_VMFUNC |
>> @@ -3984,6 +3985,25 @@ static void vmx_compute_secondary_exec_control(struct vcpu_vmx *vmx)
>> }
>> }
>>
>> + if (vmcs_config.cpu_based_2nd_exec_ctrl &
>> + SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE) {
>
> This should be aligned with the beginning of the conditional.
> Alternatively, add a vmx_waitpkg_supported() helper, which is fairly
> ubiquitous even when there is only a single call site.
>
OK, Thank you for your suggestion.
>> + /* Exposing WAITPKG only when WAITPKG is exposed */
> No need for this comment. It's also oddly worded, e.g. the second
> "exposed" should probably be "enabled"?
>
>> + bool waitpkg_enabled =
>> + guest_cpuid_has(vcpu, X86_FEATURE_WAITPKG);
>> +
>> + if (!waitpkg_enabled)
>> + exec_control &= ~SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE;
>> +
>> + if (nested) {
>> + if (waitpkg_enabled)
>> + vmx->nested.msrs.secondary_ctls_high |=
>> + SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE;
>> + else
>> + vmx->nested.msrs.secondary_ctls_high &=
>> + ~SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE;
>> + }
>> + }
>> +
>> vmx->secondary_exec_control = exec_control;
>> }
>>
>> --
>> 2.20.1
>>
On 7/12/2019 11:52 PM, Sean Christopherson wrote:
> On Fri, Jul 12, 2019 at 04:29:06PM +0800, Tao Xu wrote:
>> diff --git a/arch/x86/kernel/cpu/umwait.c b/arch/x86/kernel/cpu/umwait.c
>> index 6a204e7336c1..631152a67c6e 100644
>> --- a/arch/x86/kernel/cpu/umwait.c
>> +++ b/arch/x86/kernel/cpu/umwait.c
>> @@ -15,7 +15,8 @@
>> * Cache IA32_UMWAIT_CONTROL MSR. This is a systemwide control. By default,
>> * umwait max time is 100000 in TSC-quanta and C0.2 is enabled
>> */
>> -static u32 umwait_control_cached = UMWAIT_CTRL_VAL(100000, UMWAIT_C02_ENABLE);
>> +u32 umwait_control_cached = UMWAIT_CTRL_VAL(100000, UMWAIT_C02_ENABLE);
>> +EXPORT_SYMBOL_GPL(umwait_control_cached);
>
> It'd probably be better to add an accessor to expose umwait_control_cached
> given that umwait.c is using {READ,WRITE}_ONCE() and there shouldn't be a
> need to write it outside of umwait.c.
>
OKay
>> /*
>> * Serialize access to umwait_control_cached and IA32_UMWAIT_CONTROL MSR in
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index f411c9ae5589..0787f140d155 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -1676,6 +1676,12 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> #endif
>> case MSR_EFER:
>> return kvm_get_msr_common(vcpu, msr_info);
>> + case MSR_IA32_UMWAIT_CONTROL:
>> + if (!msr_info->host_initiated && !vmx_has_waitpkg(vmx))
>> + return 1;
>> +
>> + msr_info->data = vmx->msr_ia32_umwait_control;
>> + break;
>> case MSR_IA32_SPEC_CTRL:
>> if (!msr_info->host_initiated &&
>> !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
>> @@ -1838,6 +1844,16 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> return 1;
>> vmcs_write64(GUEST_BNDCFGS, data);
>> break;
>> + case MSR_IA32_UMWAIT_CONTROL:
>> + if (!msr_info->host_initiated && !vmx_has_waitpkg(vmx))
>> + return 1;
>> +
>> + /* The reserved bit IA32_UMWAIT_CONTROL[1] should be zero */
>> + if (data & BIT_ULL(1))
>> + return 1;
>> +
>> + vmx->msr_ia32_umwait_control = data;
>
> The SDM only defines bits 31:0, and the kernel uses a u32 to cache its
> value. I assume bits 63:32 are reserved? I'm guessing we also need an
> SDM update...
>
The SDM define IA32_UMWAIT_CONTROL is a 32bit MSR. So need me to set
63:32 reserved?
>> + break;
>> case MSR_IA32_SPEC_CTRL:
>> if (!msr_info->host_initiated &&
>> !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
>> @@ -4139,6 +4155,8 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>> vmx->rmode.vm86_active = 0;
>> vmx->spec_ctrl = 0;
>>
>> + vmx->msr_ia32_umwait_control = 0;
>> +
>> vcpu->arch.microcode_version = 0x100000000ULL;
>> vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
>> kvm_set_cr8(vcpu, 0);
>> @@ -6352,6 +6370,19 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
>> msrs[i].host, false);
>> }
>>
On Mon, Jul 15, 2019 at 09:22:14AM +0800, Tao Xu wrote:
> On 7/12/2019 11:52 PM, Sean Christopherson wrote:
> >The SDM only defines bits 31:0, and the kernel uses a u32 to cache its
> >value. I assume bits 63:32 are reserved? I'm guessing we also need an
> >SDM update...
> >
>
> The SDM define IA32_UMWAIT_CONTROL is a 32bit MSR. So need me to set 63:32
> reserved?
Huh, I didn't realize the SDM allows 32 bit MSRs, I assumed all bits
needed to be explicitly defined even if the underlying implementation only
tracked 32 bits.
RDMSR:
If fewer than 64 bits are implemented in the MSR being read, the values
return in EDX:EAX in unimplemented bit locations are undefined.
WRMSR:
Undefined or reserved bits in an MSR should be set to values previously
read.
From KVM's perspective, bits 63:32 should be treated as reserved-to-zero.
This also means that struct vcpu_vmx can track a u32 instead of a u64
for msr_ia32_umwait_control.
On Fri, Jul 12, 2019 at 04:29:06PM +0800, Tao Xu wrote:
> UMWAIT and TPAUSE instructions use IA32_UMWAIT_CONTROL at MSR index E1H
> to determines the maximum time in TSC-quanta that the processor can reside
> in either C0.1 or C0.2.
>
> This patch emulates MSR IA32_UMWAIT_CONTROL in guest and differentiate
> IA32_UMWAIT_CONTROL between host and guest. The variable
> mwait_control_cached in arch/x86/power/umwait.c caches the MSR value, so
> this patch uses it to avoid frequently rdmsr of IA32_UMWAIT_CONTROL.
>
> Co-developed-by: Jingqi Liu <[email protected]>
> Signed-off-by: Jingqi Liu <[email protected]>
> Signed-off-by: Tao Xu <[email protected]>
> ---
[...]
> +static void atomic_switch_umwait_control_msr(struct vcpu_vmx *vmx)
> +{
> + if (!vmx_has_waitpkg(vmx))
> + return;
> +
> + if (vmx->msr_ia32_umwait_control != umwait_control_cached)
> + add_atomic_switch_msr(vmx, MSR_IA32_UMWAIT_CONTROL,
> + vmx->msr_ia32_umwait_control,
> + umwait_control_cached, false);
How exactly do we ensure NR_AUTOLOAD_MSRS (8) is still large enough?
I see 3 existing add_atomic_switch_msr() calls, but the one at
atomic_switch_perf_msrs() is in a loop. Are we absolutely sure
that perf_guest_get_msrs() will never return more than 5 MSRs?
> + else
> + clear_atomic_switch_msr(vmx, MSR_IA32_UMWAIT_CONTROL);
> +}
> +
> static void vmx_arm_hv_timer(struct vcpu_vmx *vmx, u32 val)
> {
> vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, val);
[...]
--
Eduardo
On 7/17/2019 12:03 AM, Eduardo Habkost wrote:
> On Fri, Jul 12, 2019 at 04:29:06PM +0800, Tao Xu wrote:
>> UMWAIT and TPAUSE instructions use IA32_UMWAIT_CONTROL at MSR index E1H
>> to determines the maximum time in TSC-quanta that the processor can reside
>> in either C0.1 or C0.2.
>>
>> This patch emulates MSR IA32_UMWAIT_CONTROL in guest and differentiate
>> IA32_UMWAIT_CONTROL between host and guest. The variable
>> mwait_control_cached in arch/x86/power/umwait.c caches the MSR value, so
>> this patch uses it to avoid frequently rdmsr of IA32_UMWAIT_CONTROL.
>>
>> Co-developed-by: Jingqi Liu <[email protected]>
>> Signed-off-by: Jingqi Liu <[email protected]>
>> Signed-off-by: Tao Xu <[email protected]>
>> ---
> [...]
>> +static void atomic_switch_umwait_control_msr(struct vcpu_vmx *vmx)
>> +{
>> + if (!vmx_has_waitpkg(vmx))
>> + return;
>> +
>> + if (vmx->msr_ia32_umwait_control != umwait_control_cached)
>> + add_atomic_switch_msr(vmx, MSR_IA32_UMWAIT_CONTROL,
>> + vmx->msr_ia32_umwait_control,
>> + umwait_control_cached, false);
>
> How exactly do we ensure NR_AUTOLOAD_MSRS (8) is still large enough?
>
> I see 3 existing add_atomic_switch_msr() calls, but the one at
> atomic_switch_perf_msrs() is in a loop. Are we absolutely sure
> that perf_guest_get_msrs() will never return more than 5 MSRs?
>
Quote the code of intel_guest_get_msrs:
static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
{
[...]
arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask;
if (x86_pmu.flags & PMU_FL_PEBS_ALL)
arr[0].guest &= ~cpuc->pebs_enabled;
else
arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
*nr = 1;
if (x86_pmu.pebs && x86_pmu.pebs_no_isolation) {
[...]
arr[1].msr = MSR_IA32_PEBS_ENABLE;
arr[1].host = cpuc->pebs_enabled;
arr[1].guest = 0;
*nr = 2;
[...]
There are most 2 msrs now. By default umwait is disabled in KVM. So by
default there is no MSR_IA32_UMWAIT_CONTROL added into
add_atomic_switch_msr().
Thanks.
>
>> + else
>> + clear_atomic_switch_msr(vmx, MSR_IA32_UMWAIT_CONTROL);
>> +}
>> +
>> static void vmx_arm_hv_timer(struct vcpu_vmx *vmx, u32 val)
>> {
>> vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, val);
> [...]
>
>
On 7/17/2019 9:17 AM, Tao Xu wrote:
> On 7/17/2019 12:03 AM, Eduardo Habkost wrote:
>> On Fri, Jul 12, 2019 at 04:29:06PM +0800, Tao Xu wrote:
>>> UMWAIT and TPAUSE instructions use IA32_UMWAIT_CONTROL at MSR index E1H
>>> to determines the maximum time in TSC-quanta that the processor can
>>> reside
>>> in either C0.1 or C0.2.
>>>
>>> This patch emulates MSR IA32_UMWAIT_CONTROL in guest and differentiate
>>> IA32_UMWAIT_CONTROL between host and guest. The variable
>>> mwait_control_cached in arch/x86/power/umwait.c caches the MSR value, so
>>> this patch uses it to avoid frequently rdmsr of IA32_UMWAIT_CONTROL.
>>>
>>> Co-developed-by: Jingqi Liu <[email protected]>
>>> Signed-off-by: Jingqi Liu <[email protected]>
>>> Signed-off-by: Tao Xu <[email protected]>
>>> ---
>> [...]
>>> +static void atomic_switch_umwait_control_msr(struct vcpu_vmx *vmx)
>>> +{
>>> + if (!vmx_has_waitpkg(vmx))
>>> + return;
>>> +
>>> + if (vmx->msr_ia32_umwait_control != umwait_control_cached)
>>> + add_atomic_switch_msr(vmx, MSR_IA32_UMWAIT_CONTROL,
>>> + vmx->msr_ia32_umwait_control,
>>> + umwait_control_cached, false);
>>
>> How exactly do we ensure NR_AUTOLOAD_MSRS (8) is still large enough?
>>
>> I see 3 existing add_atomic_switch_msr() calls, but the one at
>> atomic_switch_perf_msrs() is in a loop. Are we absolutely sure
>> that perf_guest_get_msrs() will never return more than 5 MSRs?
>>
>
> Quote the code of intel_guest_get_msrs:
>
> static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
> {
> [...]
> arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
> arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
> arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask;
> if (x86_pmu.flags & PMU_FL_PEBS_ALL)
> arr[0].guest &= ~cpuc->pebs_enabled;
> else
> arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
> *nr = 1;
>
> if (x86_pmu.pebs && x86_pmu.pebs_no_isolation) {
> [...]
> arr[1].msr = MSR_IA32_PEBS_ENABLE;
> arr[1].host = cpuc->pebs_enabled;
> arr[1].guest = 0;
> *nr = 2;
> [...]
>
> There are most 2 msrs now. By default umwait is disabled in KVM. So by
> default there is no MSR_IA32_UMWAIT_CONTROL added into
> add_atomic_switch_msr().
>
> Thanks.
And for old hardware, kvm use core_guest_get_msrs, but umwait is for now
hardware, and if hardware in host doesn't have the cpuid, there is no
MSR_IA32_UMWAIT_CONTROL in kvm as well.
>>
>>> + else
>>> + clear_atomic_switch_msr(vmx, MSR_IA32_UMWAIT_CONTROL);
>>> +}
>>> +
>>> static void vmx_arm_hv_timer(struct vcpu_vmx *vmx, u32 val)
>>> {
>>> vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, val);
>> [...]
>>
>>
>