Currently, issuing an IPI except self-ipi in guest on Intel CPU
always causes a VM-exit. It can lead to non-negligible overhead
to some workloads involving frequent IPIs when running in VMs.
IPI virtualization is a new VT-x feature, targeting to eliminate
VM-exits on source vCPUs when issuing unicast, physical-addressing
IPIs. Once it is enabled, the processor virtualizes following kinds
of operations that send IPIs without causing VM-exits:
- Memory-mapped ICR writes
- MSR-mapped ICR writes
- SENDUIPI execution
This patch series implements IPI virtualization support in KVM.
Patches 1-4 add tertiary processor-based VM-execution support
framework, which is used to enumerate IPI virtualization.
Patch 5 handles APIC-write VM exit due to writes to ICR MSR when
guest works in x2APIC mode. This is a new case introduced by
Intel VT-x.
Patch 6 implements IPI virtualization related function including
feature enabling through tertiary processor-based VM-execution in
various scenarios of VMCS configuration, PID table setup in vCPU
creation and vCPU block consideration.
Patch 7 supports guest VM to modify vCPU APIC ID in xAPIC mode in
the case of IPI virtualization enabled.
Patch 8 changes the memory allocation policy to resize the PID-pointer
table on demand. It targets to reduce overall memory footprint.
Document for IPI virtualization is now available at the latest "Intel
Architecture Instruction Set Extensions Programming Reference".
Document Link:
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
We did experiment to measure average time sending IPI from source vCPU
to the target vCPU completing the IPI handling by kvm unittest w/ and
w/o IPI virtualization. When IPI virtualization enabled, it will reduce
22.21% and 15.98% cycles consuming in xAPIC mode and x2APIC mode
respectively.
--------------------------------------
KVM unittest:vmexit/ipi
2 vCPU, AP was modified to run in idle loop instead of halt to ensure
no VM exit impact on target vCPU.
Cycles of IPI
xAPIC mode x2APIC mode
test w/o IPIv w/ IPIv w/o IPIv w/ IPIv
1 6106 4816 4265 3768
2 6244 4656 4404 3546
3 6165 4658 4233 3474
4 5992 4710 4363 3430
5 6083 4741 4215 3551
6 6238 4904 4304 3547
7 6164 4617 4263 3709
8 5984 4763 4518 3779
9 5931 4712 4645 3667
10 5955 4530 4332 3724
11 5897 4673 4283 3569
12 6140 4794 4178 3598
13 6183 4728 4363 3628
14 5991 4994 4509 3842
15 5866 4665 4520 3739
16 6032 4654 4229 3701
17 6050 4653 4185 3726
18 6004 4792 4319 3746
19 5961 4626 4196 3392
20 6194 4576 4433 3760
Average cycles 6059 4713.1 4337.85 3644.8
%Reduction -22.21% -15.98%
--------------------------------------
IPI microbenchmark:
(https://lore.kernel.org/kvm/[email protected])
2 vCPUs, 1:1 pin vCPU to pCPU, guest VM runs with idle=poll, x2APIC mode
Result with IPIv enabled:
Dry-run: 0, 272798 ns
Self-IPI: 5094123, 11114037 ns
Normal IPI: 131697087, 173321200 ns
Broadcast IPI: 0, 155649075 ns
Broadcast lock: 0, 161518031 ns
Result with IPIv disabled:
Dry-run: 0, 272766 ns
Self-IPI: 5091788, 11123699 ns
Normal IPI: 145215772, 174558920 ns
Broadcast IPI: 0, 175785384 ns
Broadcast lock: 0, 149076195 ns
As IPIv can benefit unicast IPI to other CPU, Normal IPI test case gain
about 9.73% time saving on average out of 15 test runs when IPIv is
enabled.
Normal IPI statistics (unit:ns):
test w/o IPIv w/ IPIv
1 153346049 140907046
2 147218648 141660618
3 145215772 117890672
4 146621682 136430470
5 144821472 136199421
6 144704378 131676928
7 141403224 131697087
8 144775766 125476250
9 140658192 137263330
10 144768626 138593127
11 145166679 131946752
12 145020451 116852889
13 148161353 131406280
14 148378655 130174353
15 148903652 127969674
Average time 145944306.6 131742993.1 ns
%Reduction -9.73%
--------------------------------------
hackbench:
8 vCPUs, guest VM free run, x2APIC mode
./hackbench -p -l 100000
w/o IPIv w/ IPIv
Time 91.887 74.605
%Reduction -18.808%
96 vCPUs, guest VM free run, x2APIC mode
./hackbench -p -l 1000000
w/o IPIv w/ IPIv
Time 287.504 235.185
%Reduction -18.198%
--------------------------------------
v4 -> v5:
1. Deal with enable_ipiv parameter following current
vmcs configuration rule.
2. Allocate memory for PID-pointer table dynamically
3. Support guest runtime modify APIC ID in xAPIC mode
4. Helper to judge possibility to take PI block in IPIv case
v3 -> v4:
1. Refine code style of patch 2
2. Move tertiary control shadow build into patch 3
3. Make vmx_tertiary_exec_control to be static function
v2 -> v3:
1. Misc change on tertiary execution control
definition and capability setup
2. Alternative to get tertiary execution
control configuration
v1 -> v2:
1. Refine the IPIv enabling logic for VM.
Remove ipiv_active definition per vCPU.
--------------------------------------
Gao Chao (1):
KVM: VMX: enable IPI virtualization
Robert Hoo (4):
x86/cpu: Add new VMX feature, Tertiary VM-Execution control
KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit
variation
KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config
KVM: VMX: dump_vmcs() reports tertiary_exec_control field as well
Zeng Guang (3):
KVM: x86: Support interrupt dispatch in x2APIC mode with APIC-write VM
exit
KVM: VMX: Update PID-pointer table entry when APIC ID is changed
KVM: VMX: Resize PID-ponter table on demand for IPI virtualization
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 4 +
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/vmx.h | 11 ++
arch/x86/include/asm/vmxfeatures.h | 5 +-
arch/x86/kernel/cpu/feat_ctl.c | 9 +-
arch/x86/kvm/lapic.c | 19 ++-
arch/x86/kvm/lapic.h | 5 +
arch/x86/kvm/vmx/capabilities.h | 14 +++
arch/x86/kvm/vmx/evmcs.c | 2 +
arch/x86/kvm/vmx/evmcs.h | 1 +
arch/x86/kvm/vmx/posted_intr.c | 9 +-
arch/x86/kvm/vmx/vmcs.h | 1 +
arch/x86/kvm/vmx/vmx.c | 184 +++++++++++++++++++++++++++--
arch/x86/kvm/vmx/vmx.h | 69 ++++++-----
arch/x86/kvm/x86.c | 2 +
16 files changed, 292 insertions(+), 45 deletions(-)
--
2.27.0
From: Robert Hoo <[email protected]>
A new 64-bit control field "tertiary processor-based VM-execution
controls", is defined [1]. It's controlled by bit 17 of the primary
processor-based VM-execution controls.
Different from its brother VM-execution fields, this tertiary VM-
execution controls field is 64 bit. So it occupies 2 vmx_feature_leafs,
TERTIARY_CTLS_LOW and TERTIARY_CTLS_HIGH.
Its companion VMX capability reporting MSR,MSR_IA32_VMX_PROCBASED_CTLS3
(0x492), is also semantically different from its brothers, whose 64 bits
consist of all allow-1, rather than 32-bit allow-0 and 32-bit allow-1 [1][2].
Therefore, its init_vmx_capabilities() is a little different from others.
[1] ISE 6.2 "VMCS Changes"
https://www.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
[2] SDM Vol3. Appendix A.3
Signed-off-by: Robert Hoo <[email protected]>
Signed-off-by: Zeng Guang <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/vmxfeatures.h | 3 ++-
arch/x86/kernel/cpu/feat_ctl.c | 9 ++++++++-
3 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 01e2650b9585..4914de76ea51 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -921,6 +921,7 @@
#define MSR_IA32_VMX_TRUE_EXIT_CTLS 0x0000048f
#define MSR_IA32_VMX_TRUE_ENTRY_CTLS 0x00000490
#define MSR_IA32_VMX_VMFUNC 0x00000491
+#define MSR_IA32_VMX_PROCBASED_CTLS3 0x00000492
/* VMX_BASIC bits and bitmasks */
#define VMX_BASIC_VMCS_SIZE_SHIFT 32
diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
index d9a74681a77d..ff20776dc83b 100644
--- a/arch/x86/include/asm/vmxfeatures.h
+++ b/arch/x86/include/asm/vmxfeatures.h
@@ -5,7 +5,7 @@
/*
* Defines VMX CPU feature bits
*/
-#define NVMXINTS 3 /* N 32-bit words worth of info */
+#define NVMXINTS 5 /* N 32-bit words worth of info */
/*
* Note: If the comment begins with a quoted string, that string is used
@@ -43,6 +43,7 @@
#define VMX_FEATURE_RDTSC_EXITING ( 1*32+ 12) /* "" VM-Exit on RDTSC */
#define VMX_FEATURE_CR3_LOAD_EXITING ( 1*32+ 15) /* "" VM-Exit on writes to CR3 */
#define VMX_FEATURE_CR3_STORE_EXITING ( 1*32+ 16) /* "" VM-Exit on reads from CR3 */
+#define VMX_FEATURE_TERTIARY_CONTROLS ( 1*32+ 17) /* "" Enable Tertiary VM-Execution Controls */
#define VMX_FEATURE_CR8_LOAD_EXITING ( 1*32+ 19) /* "" VM-Exit on writes to CR8 */
#define VMX_FEATURE_CR8_STORE_EXITING ( 1*32+ 20) /* "" VM-Exit on reads from CR8 */
#define VMX_FEATURE_VIRTUAL_TPR ( 1*32+ 21) /* "vtpr" TPR virtualization, a.k.a. TPR shadow */
diff --git a/arch/x86/kernel/cpu/feat_ctl.c b/arch/x86/kernel/cpu/feat_ctl.c
index da696eb4821a..993697e71854 100644
--- a/arch/x86/kernel/cpu/feat_ctl.c
+++ b/arch/x86/kernel/cpu/feat_ctl.c
@@ -15,6 +15,8 @@ enum vmx_feature_leafs {
MISC_FEATURES = 0,
PRIMARY_CTLS,
SECONDARY_CTLS,
+ TERTIARY_CTLS_LOW,
+ TERTIARY_CTLS_HIGH,
NR_VMX_FEATURE_WORDS,
};
@@ -22,7 +24,7 @@ enum vmx_feature_leafs {
static void init_vmx_capabilities(struct cpuinfo_x86 *c)
{
- u32 supported, funcs, ept, vpid, ign;
+ u32 supported, funcs, ept, vpid, ign, low, high;
BUILD_BUG_ON(NVMXINTS != NR_VMX_FEATURE_WORDS);
@@ -42,6 +44,11 @@ static void init_vmx_capabilities(struct cpuinfo_x86 *c)
rdmsr_safe(MSR_IA32_VMX_PROCBASED_CTLS2, &ign, &supported);
c->vmx_capability[SECONDARY_CTLS] = supported;
+ /* All 64 bits of tertiary controls MSR are allowed-1 settings. */
+ rdmsr_safe(MSR_IA32_VMX_PROCBASED_CTLS3, &low, &high);
+ c->vmx_capability[TERTIARY_CTLS_LOW] = low;
+ c->vmx_capability[TERTIARY_CTLS_HIGH] = high;
+
rdmsr(MSR_IA32_VMX_PINBASED_CTLS, ign, supported);
rdmsr_safe(MSR_IA32_VMX_VMFUNC, &ign, &funcs);
--
2.27.0
From: Robert Hoo <[email protected]>
The Tertiary VM-Exec Control, different from previous control fields, is 64
bit. So extend BUILD_CONTROLS_SHADOW() by adding a 'bit' parameter, to
support both 32 bit and 64 bit fields' auxiliary functions building.
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Robert Hoo <[email protected]>
Signed-off-by: Zeng Guang <[email protected]>
---
arch/x86/kvm/vmx/vmx.h | 59 ++++++++++++++++++++++--------------------
1 file changed, 31 insertions(+), 28 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 4df2ac24ffc1..07e1753225bf 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -443,35 +443,38 @@ static inline u8 vmx_get_rvi(void)
return vmcs_read16(GUEST_INTR_STATUS) & 0xff;
}
-#define BUILD_CONTROLS_SHADOW(lname, uname) \
-static inline void lname##_controls_set(struct vcpu_vmx *vmx, u32 val) \
-{ \
- if (vmx->loaded_vmcs->controls_shadow.lname != val) { \
- vmcs_write32(uname, val); \
- vmx->loaded_vmcs->controls_shadow.lname = val; \
- } \
-} \
-static inline u32 __##lname##_controls_get(struct loaded_vmcs *vmcs) \
-{ \
- return vmcs->controls_shadow.lname; \
-} \
-static inline u32 lname##_controls_get(struct vcpu_vmx *vmx) \
-{ \
- return __##lname##_controls_get(vmx->loaded_vmcs); \
-} \
-static inline void lname##_controls_setbit(struct vcpu_vmx *vmx, u32 val) \
-{ \
- lname##_controls_set(vmx, lname##_controls_get(vmx) | val); \
-} \
-static inline void lname##_controls_clearbit(struct vcpu_vmx *vmx, u32 val) \
-{ \
- lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val); \
+#define BUILD_CONTROLS_SHADOW(lname, uname, bits) \
+static inline \
+void lname##_controls_set(struct vcpu_vmx *vmx, u##bits val) \
+{ \
+ if (vmx->loaded_vmcs->controls_shadow.lname != val) { \
+ vmcs_write##bits(uname, val); \
+ vmx->loaded_vmcs->controls_shadow.lname = val; \
+ } \
+} \
+static inline u##bits __##lname##_controls_get(struct loaded_vmcs *vmcs)\
+{ \
+ return vmcs->controls_shadow.lname; \
+} \
+static inline u##bits lname##_controls_get(struct vcpu_vmx *vmx) \
+{ \
+ return __##lname##_controls_get(vmx->loaded_vmcs); \
+} \
+static inline \
+void lname##_controls_setbit(struct vcpu_vmx *vmx, u##bits val) \
+{ \
+ lname##_controls_set(vmx, lname##_controls_get(vmx) | val); \
+} \
+static inline \
+void lname##_controls_clearbit(struct vcpu_vmx *vmx, u##bits val) \
+{ \
+ lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val); \
}
-BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS)
-BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS)
-BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL)
-BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL)
-BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL)
+BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS, 32)
+BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
+BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL, 32)
+BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL, 32)
+BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL, 32)
static inline void vmx_register_cache_reset(struct kvm_vcpu *vcpu)
{
--
2.27.0
From: Robert Hoo <[email protected]>
Check VMX features on tertiary execution control in VMCS config setup.
Sub-features in tertiary execution control to be enabled are adjusted
according to hardware capabilities although no sub-feature is enabled
in this patch.
EVMCSv1 doesn't support tertiary VM-execution control, so disable it
when EVMCSv1 is in use. And define the auxiliary functions for Tertiary
control field here, using the new BUILD_CONTROLS_SHADOW().
Signed-off-by: Robert Hoo <[email protected]>
Signed-off-by: Zeng Guang <[email protected]>
---
arch/x86/include/asm/vmx.h | 3 +++
arch/x86/kvm/vmx/capabilities.h | 7 ++++++
arch/x86/kvm/vmx/evmcs.c | 2 ++
arch/x86/kvm/vmx/evmcs.h | 1 +
arch/x86/kvm/vmx/vmcs.h | 1 +
arch/x86/kvm/vmx/vmx.c | 38 ++++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/vmx.h | 1 +
7 files changed, 52 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0ffaa3156a4e..8c929596a299 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -31,6 +31,7 @@
#define CPU_BASED_RDTSC_EXITING VMCS_CONTROL_BIT(RDTSC_EXITING)
#define CPU_BASED_CR3_LOAD_EXITING VMCS_CONTROL_BIT(CR3_LOAD_EXITING)
#define CPU_BASED_CR3_STORE_EXITING VMCS_CONTROL_BIT(CR3_STORE_EXITING)
+#define CPU_BASED_ACTIVATE_TERTIARY_CONTROLS VMCS_CONTROL_BIT(TERTIARY_CONTROLS)
#define CPU_BASED_CR8_LOAD_EXITING VMCS_CONTROL_BIT(CR8_LOAD_EXITING)
#define CPU_BASED_CR8_STORE_EXITING VMCS_CONTROL_BIT(CR8_STORE_EXITING)
#define CPU_BASED_TPR_SHADOW VMCS_CONTROL_BIT(VIRTUAL_TPR)
@@ -221,6 +222,8 @@ enum vmcs_field {
ENCLS_EXITING_BITMAP_HIGH = 0x0000202F,
TSC_MULTIPLIER = 0x00002032,
TSC_MULTIPLIER_HIGH = 0x00002033,
+ TERTIARY_VM_EXEC_CONTROL = 0x00002034,
+ TERTIARY_VM_EXEC_CONTROL_HIGH = 0x00002035,
GUEST_PHYSICAL_ADDRESS = 0x00002400,
GUEST_PHYSICAL_ADDRESS_HIGH = 0x00002401,
VMCS_LINK_POINTER = 0x00002800,
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 4705ad55abb5..38d414f64e61 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -59,6 +59,7 @@ struct vmcs_config {
u32 pin_based_exec_ctrl;
u32 cpu_based_exec_ctrl;
u32 cpu_based_2nd_exec_ctrl;
+ u64 cpu_based_3rd_exec_ctrl;
u32 vmexit_ctrl;
u32 vmentry_ctrl;
struct nested_vmx_msrs nested;
@@ -131,6 +132,12 @@ static inline bool cpu_has_secondary_exec_ctrls(void)
CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
}
+static inline bool cpu_has_tertiary_exec_ctrls(void)
+{
+ return vmcs_config.cpu_based_exec_ctrl &
+ CPU_BASED_ACTIVATE_TERTIARY_CONTROLS;
+}
+
static inline bool cpu_has_vmx_virtualize_apic_accesses(void)
{
return vmcs_config.cpu_based_2nd_exec_ctrl &
diff --git a/arch/x86/kvm/vmx/evmcs.c b/arch/x86/kvm/vmx/evmcs.c
index ba6f99f584ac..03e7c80186fb 100644
--- a/arch/x86/kvm/vmx/evmcs.c
+++ b/arch/x86/kvm/vmx/evmcs.c
@@ -298,8 +298,10 @@ const unsigned int nr_evmcs_1_fields = ARRAY_SIZE(vmcs_field_to_evmcs_1);
__init void evmcs_sanitize_exec_ctrls(struct vmcs_config *vmcs_conf)
{
+ vmcs_conf->cpu_based_exec_ctrl &= ~EVMCS1_UNSUPPORTED_EXEC_CTRL;
vmcs_conf->pin_based_exec_ctrl &= ~EVMCS1_UNSUPPORTED_PINCTRL;
vmcs_conf->cpu_based_2nd_exec_ctrl &= ~EVMCS1_UNSUPPORTED_2NDEXEC;
+ vmcs_conf->cpu_based_3rd_exec_ctrl = 0;
vmcs_conf->vmexit_ctrl &= ~EVMCS1_UNSUPPORTED_VMEXIT_CTRL;
vmcs_conf->vmentry_ctrl &= ~EVMCS1_UNSUPPORTED_VMENTRY_CTRL;
diff --git a/arch/x86/kvm/vmx/evmcs.h b/arch/x86/kvm/vmx/evmcs.h
index 16731d2cf231..65fd2b9f893c 100644
--- a/arch/x86/kvm/vmx/evmcs.h
+++ b/arch/x86/kvm/vmx/evmcs.h
@@ -50,6 +50,7 @@ DECLARE_STATIC_KEY_FALSE(enable_evmcs);
*/
#define EVMCS1_UNSUPPORTED_PINCTRL (PIN_BASED_POSTED_INTR | \
PIN_BASED_VMX_PREEMPTION_TIMER)
+#define EVMCS1_UNSUPPORTED_EXEC_CTRL (CPU_BASED_ACTIVATE_TERTIARY_CONTROLS)
#define EVMCS1_UNSUPPORTED_2NDEXEC \
(SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY | \
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | \
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index 6e5de2e2b0da..b9d18cfcf837 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -50,6 +50,7 @@ struct vmcs_controls_shadow {
u32 pin;
u32 exec;
u32 secondary_exec;
+ u64 tertiary_exec;
};
/*
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5aadad3e7367..fb0f600368c6 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2383,6 +2383,21 @@ static __init int adjust_vmx_controls(u32 ctl_min, u32 ctl_opt,
return 0;
}
+static __init int adjust_vmx_controls_64(u64 ctl_min, u64 ctl_opt,
+ u32 msr, u64 *result)
+{
+ u64 allowed1;
+
+ rdmsrl(msr, allowed1);
+
+ /* Ensure minimum (required) set of control bits are supported. */
+ if (ctl_min & ~allowed1)
+ return -EIO;
+
+ *result = (ctl_min | ctl_opt) & allowed1;
+ return 0;
+}
+
static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
struct vmx_capability *vmx_cap)
{
@@ -2391,6 +2406,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
u32 _pin_based_exec_control = 0;
u32 _cpu_based_exec_control = 0;
u32 _cpu_based_2nd_exec_control = 0;
+ u64 _cpu_based_3rd_exec_control = 0;
u32 _vmexit_control = 0;
u32 _vmentry_control = 0;
@@ -2412,7 +2428,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
opt = CPU_BASED_TPR_SHADOW |
CPU_BASED_USE_MSR_BITMAPS |
- CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+ CPU_BASED_ACTIVATE_SECONDARY_CONTROLS |
+ CPU_BASED_ACTIVATE_TERTIARY_CONTROLS;
if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PROCBASED_CTLS,
&_cpu_based_exec_control) < 0)
return -EIO;
@@ -2486,6 +2503,16 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
"1-setting enable VPID VM-execution control\n");
}
+ if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_TERTIARY_CONTROLS) {
+ u64 opt3 = 0;
+ u64 min3 = 0;
+
+ if (adjust_vmx_controls_64(min3, opt3,
+ MSR_IA32_VMX_PROCBASED_CTLS3,
+ &_cpu_based_3rd_exec_control))
+ return -EIO;
+ }
+
min = VM_EXIT_SAVE_DEBUG_CONTROLS | VM_EXIT_ACK_INTR_ON_EXIT;
#ifdef CONFIG_X86_64
min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
@@ -2573,6 +2600,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
vmcs_conf->pin_based_exec_ctrl = _pin_based_exec_control;
vmcs_conf->cpu_based_exec_ctrl = _cpu_based_exec_control;
vmcs_conf->cpu_based_2nd_exec_ctrl = _cpu_based_2nd_exec_control;
+ vmcs_conf->cpu_based_3rd_exec_ctrl = _cpu_based_3rd_exec_control;
vmcs_conf->vmexit_ctrl = _vmexit_control;
vmcs_conf->vmentry_ctrl = _vmentry_control;
@@ -4128,6 +4156,11 @@ static u32 vmx_exec_control(struct vcpu_vmx *vmx)
return exec_control;
}
+static u64 vmx_tertiary_exec_control(struct vcpu_vmx *vmx)
+{
+ return vmcs_config.cpu_based_3rd_exec_ctrl;
+}
+
/*
* Adjust a single secondary execution control bit to intercept/allow an
* instruction in the guest. This is usually done based on whether or not a
@@ -4293,6 +4326,9 @@ static void init_vmcs(struct vcpu_vmx *vmx)
if (cpu_has_secondary_exec_ctrls())
secondary_exec_controls_set(vmx, vmx_secondary_exec_control(vmx));
+ if (cpu_has_tertiary_exec_ctrls())
+ tertiary_exec_controls_set(vmx, vmx_tertiary_exec_control(vmx));
+
if (kvm_vcpu_apicv_active(&vmx->vcpu)) {
vmcs_write64(EOI_EXIT_BITMAP0, 0);
vmcs_write64(EOI_EXIT_BITMAP1, 0);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 07e1753225bf..ee94068ca8fb 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -475,6 +475,7 @@ BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL, 32)
BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL, 32)
BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL, 32)
+BUILD_CONTROLS_SHADOW(tertiary_exec, TERTIARY_VM_EXEC_CONTROL, 64)
static inline void vmx_register_cache_reset(struct kvm_vcpu *vcpu)
{
--
2.27.0
From: Robert Hoo <[email protected]>
Add tertiary_exec_control field report in dump_vmcs()
Signed-off-by: Robert Hoo <[email protected]>
Signed-off-by: Zeng Guang <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fb0f600368c6..5716db9704c0 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5729,6 +5729,7 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
struct vcpu_vmx *vmx = to_vmx(vcpu);
u32 vmentry_ctl, vmexit_ctl;
u32 cpu_based_exec_ctrl, pin_based_exec_ctrl, secondary_exec_control;
+ u64 tertiary_exec_control = 0;
unsigned long cr4;
int efer_slot;
@@ -5746,6 +5747,9 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
if (cpu_has_secondary_exec_ctrls())
secondary_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
+ if (cpu_has_tertiary_exec_ctrls())
+ tertiary_exec_control = vmcs_read64(TERTIARY_VM_EXEC_CONTROL);
+
pr_err("VMCS %p, last attempted VM-entry on CPU %d\n",
vmx->loaded_vmcs->vmcs, vcpu->arch.last_vmentry_cpu);
pr_err("*** Guest State ***\n");
@@ -5844,8 +5848,9 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
vmx_dump_msrs("host autoload", &vmx->msr_autoload.host);
pr_err("*** Control State ***\n");
- pr_err("PinBased=%08x CPUBased=%08x SecondaryExec=%08x\n",
- pin_based_exec_ctrl, cpu_based_exec_ctrl, secondary_exec_control);
+ pr_err("PinBased=0x%08x CPUBased=0x%08x SecondaryExec=0x%08x TertiaryExec=0x%016llx\n",
+ pin_based_exec_ctrl, cpu_based_exec_ctrl, secondary_exec_control,
+ tertiary_exec_control);
pr_err("EntryControls=%08x ExitControls=%08x\n", vmentry_ctl, vmexit_ctl);
pr_err("ExceptionBitmap=%08x PFECmask=%08x PFECmatch=%08x\n",
vmcs_read32(EXCEPTION_BITMAP),
--
2.27.0
In VMX non-root operation, new behavior applies to
virtualize WRMSR to vICR in x2APIC mode. Depending
on settings of the VM-execution controls, CPU would
produce APIC-write VM-exit following the 64-bit value
written to offset 300H on the virtual-APIC page(vICR).
KVM needs to retrieve the value written by CPU and
emulate the vICR write to deliver an interrupt.
Current KVM doesn't consider to handle the 64-bit setting
on vICR in trap-like APIC-write VM-exit. Because using
kvm_lapic_reg_write() to emulate writes to APIC_ICR requires
the APIC_ICR2 is already programmed correctly. But in the
above APIC-write VM-exit, CPU writes the whole 64 bits to
APIC_ICR rather than program higher 32 bits and lower 32
bits to APIC_ICR2 and APIC_ICR respectively. So, KVM needs
to retrieve the whole 64-bit value and program higher 32 bits
to APIC_ICR2 first.
Signed-off-by: Zeng Guang <[email protected]>
---
arch/x86/kvm/lapic.c | 12 +++++++++---
arch/x86/kvm/lapic.h | 5 +++++
2 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index f206fc35deff..3ce7142ba00e 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2186,15 +2186,21 @@ EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
/* emulate APIC access in a trap manner */
void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
{
- u32 val = 0;
+ struct kvm_lapic *apic = vcpu->arch.apic;
+ u64 val = 0;
/* hw has done the conditional check and inst decode */
offset &= 0xff0;
- kvm_lapic_reg_read(vcpu->arch.apic, offset, 4, &val);
+ /* exception dealing with 64bit data on vICR in x2apic mode */
+ if ((offset == APIC_ICR) && apic_x2apic_mode(apic)) {
+ val = kvm_lapic_get_reg64(apic, offset);
+ kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(val>>32));
+ } else
+ kvm_lapic_reg_read(apic, offset, 4, &val);
/* TODO: optimize to just emulate side effect w/o one more write */
- kvm_lapic_reg_write(vcpu->arch.apic, offset, val);
+ kvm_lapic_reg_write(apic, offset, (u32)val);
}
EXPORT_SYMBOL_GPL(kvm_apic_write_nodecode);
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 2b44e533fc8d..91864e401a64 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -158,6 +158,11 @@ static inline u32 kvm_lapic_get_reg(struct kvm_lapic *apic, int reg_off)
return *((u32 *) (apic->regs + reg_off));
}
+static inline u64 kvm_lapic_get_reg64(struct kvm_lapic *apic, int reg_off)
+{
+ return *((u64 *) (apic->regs + reg_off));
+}
+
static inline void __kvm_lapic_set_reg(char *regs, int reg_off, u32 val)
{
*((u32 *) (regs + reg_off)) = val;
--
2.27.0
In xAPIC mode, guest is allowed to modify APIC ID at runtime.
If IPI virtualization is enabled, corresponding entry in
PID-pointer table need change accordingly.
Signed-off-by: Zeng Guang <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/lapic.c | 7 +++++--
arch/x86/kvm/vmx/vmx.c | 12 ++++++++++++
3 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2164b9f4c7b0..753bf2a7cebc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1493,6 +1493,7 @@ struct kvm_x86_ops {
int (*complete_emulated_msr)(struct kvm_vcpu *vcpu, int err);
void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
+ void (*update_ipiv_pid_entry)(struct kvm_vcpu *vcpu, u8 old_id, u8 new_id);
};
struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 3ce7142ba00e..83c2c7594bcd 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2007,9 +2007,12 @@ int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
switch (reg) {
case APIC_ID: /* Local APIC ID */
- if (!apic_x2apic_mode(apic))
+ if (!apic_x2apic_mode(apic)) {
+ u8 old_id = kvm_lapic_get_reg(apic, APIC_ID) >> 24;
+
kvm_apic_set_xapic_id(apic, val >> 24);
- else
+ kvm_x86_ops.update_ipiv_pid_entry(apic->vcpu, old_id, val >> 24);
+ } else
ret = 1;
break;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 2e65464d6dee..f21ce15c5eb8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7633,6 +7633,17 @@ static void vmx_vm_destroy(struct kvm *kvm)
free_pages((unsigned long)kvm_vmx->pid_table, MAX_PID_TABLE_ORDER);
}
+static void vmx_update_ipiv_pid_entry(struct kvm_vcpu *vcpu, u8 old_id, u8 new_id)
+{
+ if (enable_ipiv && kvm_vcpu_apicv_active(vcpu)) {
+ u64 *pid_table = to_kvm_vmx(vcpu->kvm)->pid_table;
+
+ WRITE_ONCE(pid_table[old_id], 0);
+ WRITE_ONCE(pid_table[new_id], __pa(&to_vmx(vcpu)->pi_desc) |
+ PID_TABLE_ENTRY_VALID);
+ }
+}
+
static struct kvm_x86_ops vmx_x86_ops __initdata = {
.name = "kvm_intel",
@@ -7770,6 +7781,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.complete_emulated_msr = kvm_complete_insn_gp,
.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+ .update_ipiv_pid_entry = vmx_update_ipiv_pid_entry,
};
static __init void vmx_setup_user_return_msrs(void)
--
2.27.0
Current kvm allocates 8 pages in advance for Posted Interrupt
Descriptor pointer (PID-pointer) table to accommodate vCPUs
with APIC ID up to KVM_MAX_VCPU_IDS - 1. This policy wastes
some memory because most of VMs have less than 512 vCPUs and
then just need one page.
To reduce the memory consumption of most of VMs, KVM initially
allocates one page for PID-pointer table for each VM and bumps
up the table on demand according to the maximum APIC ID of all
vCPUs of a VM. Bumping up PID-pointer table involves allocating
a new table, requesting all vCPUs to update related VMCS fields
and freeing the old table.
In worst case that new memory allocation fails, KVM keep using
the present PID-pointer table. Thus IPI virtualization won't
take effect to those vCPUs not set in the table without impact
on others.
Signed-off-by: Zeng Guang <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 3 ++
arch/x86/kvm/vmx/vmx.c | 77 +++++++++++++++++++++++++-----
arch/x86/kvm/vmx/vmx.h | 6 +++
arch/x86/kvm/x86.c | 2 +
5 files changed, 78 insertions(+), 11 deletions(-)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index cefe1d81e2e8..847246f2537d 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -121,6 +121,7 @@ KVM_X86_OP_NULL(enable_direct_tlbflush)
KVM_X86_OP_NULL(migrate_timers)
KVM_X86_OP(msr_filter_changed)
KVM_X86_OP_NULL(complete_emulated_msr)
+KVM_X86_OP(update_ipiv_pid_table)
#undef KVM_X86_OP
#undef KVM_X86_OP_NULL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 753bf2a7cebc..24990d4e94c4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -102,6 +102,8 @@
#define KVM_REQ_MSR_FILTER_CHANGED KVM_ARCH_REQ(29)
#define KVM_REQ_UPDATE_CPU_DIRTY_LOGGING \
KVM_ARCH_REQ_FLAGS(30, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_PID_TABLE_UPDATE \
+ KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define CR0_RESERVED_BITS \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
@@ -1494,6 +1496,7 @@ struct kvm_x86_ops {
void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
void (*update_ipiv_pid_entry)(struct kvm_vcpu *vcpu, u8 old_id, u8 new_id);
+ void (*update_ipiv_pid_table)(struct kvm_vcpu *vcpu);
};
struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f21ce15c5eb8..fb8e2b52b5f7 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -228,8 +228,9 @@ static const struct {
#define L1D_CACHE_ORDER 4
-/* PID(Posted-Interrupt Descriptor)-pointer table entry is 64-bit long */
-#define MAX_PID_TABLE_ORDER get_order(KVM_MAX_VCPU_IDS * sizeof(u64))
+/* Each entry in PID(Posted-Interrupt Descriptor)-pointer table is 8 bytes */
+#define table_index_to_size(index) ((index) << 3)
+#define table_size_to_index(size) ((size) >> 3)
#define PID_TABLE_ENTRY_VALID 1
static void *vmx_l1d_flush_pages;
@@ -4332,6 +4333,42 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
return exec_control;
}
+static int vmx_alloc_pid_table(struct kvm_vmx *kvm_vmx, int order)
+{
+ u64 *pid_table;
+
+ pid_table = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
+ if (!pid_table)
+ return -ENOMEM;
+
+ kvm_vmx->pid_table = pid_table;
+ kvm_vmx->pid_last_index = table_size_to_index(PAGE_SIZE << order) - 1;
+ return 0;
+}
+
+static int vmx_expand_pid_table(struct kvm_vmx *kvm_vmx, int entry_idx)
+{
+ u64 *last_pid_table;
+ int last_table_size, new_order;
+
+ if (entry_idx <= kvm_vmx->pid_last_index)
+ return 0;
+
+ last_pid_table = kvm_vmx->pid_table;
+ last_table_size = table_index_to_size(kvm_vmx->pid_last_index + 1);
+ new_order = get_order(table_index_to_size(entry_idx + 1));
+
+ if (vmx_alloc_pid_table(kvm_vmx, new_order))
+ return -ENOMEM;
+
+ memcpy(kvm_vmx->pid_table, last_pid_table, last_table_size);
+ kvm_make_all_cpus_request(&kvm_vmx->kvm, KVM_REQ_PID_TABLE_UPDATE);
+
+ /* Now old PID table can be freed safely as no vCPU is using it. */
+ free_pages((unsigned long)last_pid_table, get_order(last_table_size));
+ return 0;
+}
+
#define VMX_XSS_EXIT_BITMAP 0
static void init_vmcs(struct vcpu_vmx *vmx)
@@ -4370,10 +4407,19 @@ static void init_vmcs(struct vcpu_vmx *vmx)
vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
if (enable_ipiv) {
- WRITE_ONCE(kvm_vmx->pid_table[vcpu->vcpu_id],
- __pa(&vmx->pi_desc) | PID_TABLE_ENTRY_VALID);
+ down_write(&kvm_vmx->pid_table_lock);
+
+ /*
+ * In case new memory allocation for PID table fails,
+ * skip setting Posted-Interrupt descriptor of current
+ * vCPU which index is beyond present table limit.
+ */
+ if (!vmx_expand_pid_table(kvm_vmx, vcpu->vcpu_id))
+ WRITE_ONCE(kvm_vmx->pid_table[vcpu->vcpu_id],
+ __pa(&vmx->pi_desc) | PID_TABLE_ENTRY_VALID);
vmcs_write64(PID_POINTER_TABLE, __pa(kvm_vmx->pid_table));
vmcs_write16(LAST_PID_POINTER_INDEX, kvm_vmx->pid_last_index);
+ up_write(&kvm_vmx->pid_table_lock);
}
}
@@ -7001,14 +7047,11 @@ static int vmx_vm_init(struct kvm *kvm)
}
if (enable_ipiv) {
- struct page *pages;
+ struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
- pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, MAX_PID_TABLE_ORDER);
- if (!pages)
+ if (vmx_alloc_pid_table(kvm_vmx, 0))
return -ENOMEM;
-
- to_kvm_vmx(kvm)->pid_table = (void *)page_address(pages);
- to_kvm_vmx(kvm)->pid_last_index = KVM_MAX_VCPU_IDS - 1;
+ init_rwsem(&kvm_vmx->pid_table_lock);
}
return 0;
@@ -7630,7 +7673,18 @@ static void vmx_vm_destroy(struct kvm *kvm)
struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
if (kvm_vmx->pid_table)
- free_pages((unsigned long)kvm_vmx->pid_table, MAX_PID_TABLE_ORDER);
+ free_pages((unsigned long)kvm_vmx->pid_table,
+ get_order(table_index_to_size(kvm_vmx->pid_last_index)));
+}
+
+static void vmx_update_ipiv_pid_table(struct kvm_vcpu *vcpu)
+{
+ struct kvm_vmx *kvm_vmx = to_kvm_vmx(vcpu->kvm);
+
+ down_read(&kvm_vmx->pid_table_lock);
+ vmcs_write64(PID_POINTER_TABLE, __pa(kvm_vmx->pid_table));
+ vmcs_write16(LAST_PID_POINTER_INDEX, kvm_vmx->pid_last_index);
+ up_read(&kvm_vmx->pid_table_lock);
}
static void vmx_update_ipiv_pid_entry(struct kvm_vcpu *vcpu, u8 old_id, u8 new_id)
@@ -7782,6 +7836,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
.update_ipiv_pid_entry = vmx_update_ipiv_pid_entry,
+ .update_ipiv_pid_table = vmx_update_ipiv_pid_table,
};
static __init void vmx_setup_user_return_msrs(void)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index c8ae1458eb9e..8c437a7be08a 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -356,6 +356,12 @@ struct kvm_vmx {
/* PID table for IPI virtualization */
u64 *pid_table;
u16 pid_last_index;
+ /*
+ * Protects accesses to pid_table and pid_last_index.
+ * Request to reallocate and update PID table could
+ * happen on multiple vCPUs simultaneously.
+ */
+ struct rw_semaphore pid_table_lock;
};
bool nested_vmx_allowed(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9a2972fdae82..97ec2adb76bd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9783,6 +9783,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
+ if (kvm_check_request(KVM_REQ_PID_TABLE_UPDATE, vcpu))
+ static_call(kvm_x86_update_ipiv_pid_table)(vcpu);
}
if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
--
2.27.0
From: Gao Chao <[email protected]>
With IPI virtualization enabled, the processor emulates writes to
APIC registers that would send IPIs. The processor sets the bit
corresponding to the vector in target vCPU's PIR and may send a
notification (IPI) specified by NDST and NV fields in target vCPU's
Posted-Interrupt Descriptor (PID). It is similar to what IOMMU
engine does when dealing with posted interrupt from devices.
A PID-pointer table is used by the processor to locate the PID of a
vCPU with the vCPU's APIC ID.
Like VT-d PI, if a vCPU goes to blocked state, VMM needs to switch its
notification vector to wakeup vector. This can ensure that when an IPI
for blocked vCPUs arrives, VMM can get control and wake up blocked
vCPUs. And if a VCPU is preempted, its posted interrupt notification
is suppressed.
Note that IPI virtualization can only virualize physical-addressing,
flat mode, unicast IPIs. Sending other IPIs would still cause a
trap-like APIC-write VM-exit and need to be handled by VMM.
Signed-off-by: Gao Chao <[email protected]>
Signed-off-by: Zeng Guang <[email protected]>
---
arch/x86/include/asm/vmx.h | 8 ++++
arch/x86/include/asm/vmxfeatures.h | 2 +
arch/x86/kvm/vmx/capabilities.h | 7 +++
arch/x86/kvm/vmx/posted_intr.c | 9 +++-
arch/x86/kvm/vmx/vmx.c | 74 +++++++++++++++++++++++++++---
arch/x86/kvm/vmx/vmx.h | 3 ++
6 files changed, 94 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 8c929596a299..b79b6438acaa 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -76,6 +76,11 @@
#define SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE VMCS_CONTROL_BIT(USR_WAIT_PAUSE)
#define SECONDARY_EXEC_BUS_LOCK_DETECTION VMCS_CONTROL_BIT(BUS_LOCK_DETECTION)
+/*
+ * Definitions of Tertiary Processor-Based VM-Execution Controls.
+ */
+#define TERTIARY_EXEC_IPI_VIRT VMCS_CONTROL_BIT(IPI_VIRT)
+
#define PIN_BASED_EXT_INTR_MASK VMCS_CONTROL_BIT(INTR_EXITING)
#define PIN_BASED_NMI_EXITING VMCS_CONTROL_BIT(NMI_EXITING)
#define PIN_BASED_VIRTUAL_NMIS VMCS_CONTROL_BIT(VIRTUAL_NMIS)
@@ -159,6 +164,7 @@ static inline int vmx_misc_mseg_revid(u64 vmx_misc)
enum vmcs_field {
VIRTUAL_PROCESSOR_ID = 0x00000000,
POSTED_INTR_NV = 0x00000002,
+ LAST_PID_POINTER_INDEX = 0x00000008,
GUEST_ES_SELECTOR = 0x00000800,
GUEST_CS_SELECTOR = 0x00000802,
GUEST_SS_SELECTOR = 0x00000804,
@@ -224,6 +230,8 @@ enum vmcs_field {
TSC_MULTIPLIER_HIGH = 0x00002033,
TERTIARY_VM_EXEC_CONTROL = 0x00002034,
TERTIARY_VM_EXEC_CONTROL_HIGH = 0x00002035,
+ PID_POINTER_TABLE = 0x00002042,
+ PID_POINTER_TABLE_HIGH = 0x00002043,
GUEST_PHYSICAL_ADDRESS = 0x00002400,
GUEST_PHYSICAL_ADDRESS_HIGH = 0x00002401,
VMCS_LINK_POINTER = 0x00002800,
diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
index ff20776dc83b..7ce616af2db2 100644
--- a/arch/x86/include/asm/vmxfeatures.h
+++ b/arch/x86/include/asm/vmxfeatures.h
@@ -86,4 +86,6 @@
#define VMX_FEATURE_ENCLV_EXITING ( 2*32+ 28) /* "" VM-Exit on ENCLV (leaf dependent) */
#define VMX_FEATURE_BUS_LOCK_DETECTION ( 2*32+ 30) /* "" VM-Exit when bus lock caused */
+/* Tertiary Processor-Based VM-Execution Controls, word 3 */
+#define VMX_FEATURE_IPI_VIRT (3*32 + 4) /* "" Enable IPI virtualization */
#endif /* _ASM_X86_VMXFEATURES_H */
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 38d414f64e61..78b0525dd991 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -12,6 +12,7 @@ extern bool __read_mostly enable_ept;
extern bool __read_mostly enable_unrestricted_guest;
extern bool __read_mostly enable_ept_ad_bits;
extern bool __read_mostly enable_pml;
+extern bool __read_mostly enable_ipiv;
extern int __read_mostly pt_mode;
#define PT_MODE_SYSTEM 0
@@ -283,6 +284,12 @@ static inline bool cpu_has_vmx_apicv(void)
cpu_has_vmx_posted_intr();
}
+static inline bool cpu_has_vmx_ipiv(void)
+{
+ return vmcs_config.cpu_based_3rd_exec_ctrl &
+ TERTIARY_EXEC_IPI_VIRT;
+}
+
static inline bool cpu_has_vmx_flexpriority(void)
{
return cpu_has_vmx_tpr_shadow() &&
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 1c94783b5a54..bd9c9a89726a 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -85,11 +85,16 @@ static bool vmx_can_use_vtd_pi(struct kvm *kvm)
irq_remapping_cap(IRQ_POSTING_CAP);
}
+static bool vmx_can_use_ipiv_pi(struct kvm *kvm)
+{
+ return irqchip_in_kernel(kvm) && enable_apicv && enable_ipiv;
+}
+
void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
{
struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
- if (!vmx_can_use_vtd_pi(vcpu->kvm))
+ if (!(vmx_can_use_ipiv_pi(vcpu->kvm) || vmx_can_use_vtd_pi(vcpu->kvm)))
return;
/* Set SN when the vCPU is preempted */
@@ -147,7 +152,7 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
struct pi_desc old, new;
struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
- if (!vmx_can_use_vtd_pi(vcpu->kvm))
+ if (!(vmx_can_use_ipiv_pi(vcpu->kvm) || vmx_can_use_vtd_pi(vcpu->kvm)))
return 0;
WARN_ON(irqs_disabled());
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5716db9704c0..2e65464d6dee 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -104,6 +104,9 @@ module_param(fasteoi, bool, S_IRUGO);
module_param(enable_apicv, bool, S_IRUGO);
+bool __read_mostly enable_ipiv = true;
+module_param(enable_ipiv, bool, 0444);
+
/*
* If nested=1, nested virtualization is supported, i.e., guests may use
* VMX and be a hypervisor for its own guests. If nested=0, guests may not
@@ -224,6 +227,11 @@ static const struct {
};
#define L1D_CACHE_ORDER 4
+
+/* PID(Posted-Interrupt Descriptor)-pointer table entry is 64-bit long */
+#define MAX_PID_TABLE_ORDER get_order(KVM_MAX_VCPU_IDS * sizeof(u64))
+#define PID_TABLE_ENTRY_VALID 1
+
static void *vmx_l1d_flush_pages;
static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
@@ -2504,7 +2512,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
}
if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_TERTIARY_CONTROLS) {
- u64 opt3 = 0;
+ u64 opt3 = TERTIARY_EXEC_IPI_VIRT;
u64 min3 = 0;
if (adjust_vmx_controls_64(min3, opt3,
@@ -3841,6 +3849,8 @@ static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu)
vmx_enable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_TMCCT), MSR_TYPE_RW);
vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_EOI), MSR_TYPE_W);
vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_SELF_IPI), MSR_TYPE_W);
+ vmx_set_intercept_for_msr(vcpu, X2APIC_MSR(APIC_ICR),
+ MSR_TYPE_RW, !enable_ipiv);
}
}
@@ -4117,14 +4127,21 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
if (cpu_has_secondary_exec_ctrls()) {
- if (kvm_vcpu_apicv_active(vcpu))
+ if (kvm_vcpu_apicv_active(vcpu)) {
secondary_exec_controls_setbit(vmx,
SECONDARY_EXEC_APIC_REGISTER_VIRT |
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
- else
+ if (cpu_has_tertiary_exec_ctrls() && enable_ipiv)
+ tertiary_exec_controls_setbit(vmx,
+ TERTIARY_EXEC_IPI_VIRT);
+ } else {
secondary_exec_controls_clearbit(vmx,
SECONDARY_EXEC_APIC_REGISTER_VIRT |
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
+ if (cpu_has_tertiary_exec_ctrls())
+ tertiary_exec_controls_clearbit(vmx,
+ TERTIARY_EXEC_IPI_VIRT);
+ }
}
vmx_update_msr_bitmap_x2apic(vcpu);
@@ -4158,7 +4175,16 @@ static u32 vmx_exec_control(struct vcpu_vmx *vmx)
static u64 vmx_tertiary_exec_control(struct vcpu_vmx *vmx)
{
- return vmcs_config.cpu_based_3rd_exec_ctrl;
+ u64 exec_control = vmcs_config.cpu_based_3rd_exec_ctrl;
+
+ /*
+ * IPI virtualization relies on APICv. Disable IPI
+ * virtualization if APICv is inhibited.
+ */
+ if (!enable_ipiv || !kvm_vcpu_apicv_active(&vmx->vcpu))
+ exec_control &= ~TERTIARY_EXEC_IPI_VIRT;
+
+ return exec_control;
}
/*
@@ -4310,6 +4336,9 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
static void init_vmcs(struct vcpu_vmx *vmx)
{
+ struct kvm_vcpu *vcpu = &vmx->vcpu;
+ struct kvm_vmx *kvm_vmx = to_kvm_vmx(vcpu->kvm);
+
if (nested)
nested_vmx_set_vmcs_shadowing_bitmap();
@@ -4329,7 +4358,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
if (cpu_has_tertiary_exec_ctrls())
tertiary_exec_controls_set(vmx, vmx_tertiary_exec_control(vmx));
- if (kvm_vcpu_apicv_active(&vmx->vcpu)) {
+ if (kvm_vcpu_apicv_active(vcpu)) {
vmcs_write64(EOI_EXIT_BITMAP0, 0);
vmcs_write64(EOI_EXIT_BITMAP1, 0);
vmcs_write64(EOI_EXIT_BITMAP2, 0);
@@ -4339,6 +4368,13 @@ static void init_vmcs(struct vcpu_vmx *vmx)
vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR);
vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
+
+ if (enable_ipiv) {
+ WRITE_ONCE(kvm_vmx->pid_table[vcpu->vcpu_id],
+ __pa(&vmx->pi_desc) | PID_TABLE_ENTRY_VALID);
+ vmcs_write64(PID_POINTER_TABLE, __pa(kvm_vmx->pid_table));
+ vmcs_write16(LAST_PID_POINTER_INDEX, kvm_vmx->pid_last_index);
+ }
}
if (!kvm_pause_in_guest(vmx->vcpu.kvm)) {
@@ -4390,7 +4426,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
}
- vmx_write_encls_bitmap(&vmx->vcpu, NULL);
+ vmx_write_encls_bitmap(vcpu, NULL);
if (vmx_pt_mode_is_host_guest()) {
memset(&vmx->pt_desc, 0, sizeof(vmx->pt_desc));
@@ -4406,7 +4442,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
if (cpu_has_vmx_tpr_shadow()) {
vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 0);
- if (cpu_need_tpr_shadow(&vmx->vcpu))
+ if (cpu_need_tpr_shadow(vcpu))
vmcs_write64(VIRTUAL_APIC_PAGE_ADDR,
__pa(vmx->vcpu.arch.apic->regs));
vmcs_write32(TPR_THRESHOLD, 0);
@@ -6963,6 +6999,18 @@ static int vmx_vm_init(struct kvm *kvm)
break;
}
}
+
+ if (enable_ipiv) {
+ struct page *pages;
+
+ pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, MAX_PID_TABLE_ORDER);
+ if (!pages)
+ return -ENOMEM;
+
+ to_kvm_vmx(kvm)->pid_table = (void *)page_address(pages);
+ to_kvm_vmx(kvm)->pid_last_index = KVM_MAX_VCPU_IDS - 1;
+ }
+
return 0;
}
@@ -7577,6 +7625,14 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
return supported & BIT(bit);
}
+static void vmx_vm_destroy(struct kvm *kvm)
+{
+ struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
+
+ if (kvm_vmx->pid_table)
+ free_pages((unsigned long)kvm_vmx->pid_table, MAX_PID_TABLE_ORDER);
+}
+
static struct kvm_x86_ops vmx_x86_ops __initdata = {
.name = "kvm_intel",
@@ -7589,6 +7645,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.vm_size = sizeof(struct kvm_vmx),
.vm_init = vmx_vm_init,
+ .vm_destroy = vmx_vm_destroy,
.vcpu_create = vmx_create_vcpu,
.vcpu_free = vmx_free_vcpu,
@@ -7828,6 +7885,9 @@ static __init int hardware_setup(void)
if (!enable_apicv)
vmx_x86_ops.sync_pir_to_irr = NULL;
+ if (!enable_apicv || !cpu_has_vmx_ipiv())
+ enable_ipiv = false;
+
if (cpu_has_vmx_tsc_scaling()) {
kvm_has_tsc_control = true;
kvm_max_tsc_scaling_ratio = KVM_VMX_TSC_MULTIPLIER_MAX;
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index ee94068ca8fb..c8ae1458eb9e 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -353,6 +353,9 @@ struct kvm_vmx {
unsigned int tss_addr;
bool ept_identity_pagetable_done;
gpa_t ept_identity_map_addr;
+ /* PID table for IPI virtualization */
+ u64 *pid_table;
+ u16 pid_last_index;
};
bool nested_vmx_allowed(struct kvm_vcpu *vcpu);
--
2.27.0
On 12/31/21 8:28 AM, Zeng Guang wrote:
> In xAPIC mode, guest is allowed to modify APIC ID at runtime.
> If IPI virtualization is enabled, corresponding entry in
> PID-pointer table need change accordingly.
>
> Signed-off-by: Zeng Guang <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/lapic.c | 7 +++++--
> arch/x86/kvm/vmx/vmx.c | 12 ++++++++++++
> 3 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2164b9f4c7b0..753bf2a7cebc 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1493,6 +1493,7 @@ struct kvm_x86_ops {
> int (*complete_emulated_msr)(struct kvm_vcpu *vcpu, int err);
>
> void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
> + void (*update_ipiv_pid_entry)(struct kvm_vcpu *vcpu, u8 old_id, u8 new_id);
> };
>
> struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 3ce7142ba00e..83c2c7594bcd 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2007,9 +2007,12 @@ int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
>
> switch (reg) {
> case APIC_ID: /* Local APIC ID */
> - if (!apic_x2apic_mode(apic))
> + if (!apic_x2apic_mode(apic)) {
> + u8 old_id = kvm_lapic_get_reg(apic, APIC_ID) >> 24;
> +
> kvm_apic_set_xapic_id(apic, val >> 24);
> - else
> + kvm_x86_ops.update_ipiv_pid_entry(apic->vcpu, old_id, val >> 24);
Won't this blow up on AMD since there is no corresponding SVM op?
Thanks,
Tom
> + } else
> ret = 1;
> break;
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 2e65464d6dee..f21ce15c5eb8 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7633,6 +7633,17 @@ static void vmx_vm_destroy(struct kvm *kvm)
> free_pages((unsigned long)kvm_vmx->pid_table, MAX_PID_TABLE_ORDER);
> }
>
> +static void vmx_update_ipiv_pid_entry(struct kvm_vcpu *vcpu, u8 old_id, u8 new_id)
> +{
> + if (enable_ipiv && kvm_vcpu_apicv_active(vcpu)) {
> + u64 *pid_table = to_kvm_vmx(vcpu->kvm)->pid_table;
> +
> + WRITE_ONCE(pid_table[old_id], 0);
> + WRITE_ONCE(pid_table[new_id], __pa(&to_vmx(vcpu)->pi_desc) |
> + PID_TABLE_ENTRY_VALID);
> + }
> +}
> +
> static struct kvm_x86_ops vmx_x86_ops __initdata = {
> .name = "kvm_intel",
>
> @@ -7770,6 +7781,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
> .complete_emulated_msr = kvm_complete_insn_gp,
>
> .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
> + .update_ipiv_pid_entry = vmx_update_ipiv_pid_entry,
> };
>
> static __init void vmx_setup_user_return_msrs(void)
>
On 1/6/2022 3:13 AM, Tom Lendacky wrote:
> On 12/31/21 8:28 AM, Zeng Guang wrote:
>> In xAPIC mode, guest is allowed to modify APIC ID at runtime.
>> If IPI virtualization is enabled, corresponding entry in
>> PID-pointer table need change accordingly.
>>
>> Signed-off-by: Zeng Guang <[email protected]>
>> ---
>> arch/x86/include/asm/kvm_host.h | 1 +
>> arch/x86/kvm/lapic.c | 7 +++++--
>> arch/x86/kvm/vmx/vmx.c | 12 ++++++++++++
>> 3 files changed, 18 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 2164b9f4c7b0..753bf2a7cebc 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1493,6 +1493,7 @@ struct kvm_x86_ops {
>> int (*complete_emulated_msr)(struct kvm_vcpu *vcpu, int err);
>>
>> void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
>> + void (*update_ipiv_pid_entry)(struct kvm_vcpu *vcpu, u8 old_id, u8 new_id);
>> };
>>
>> struct kvm_x86_nested_ops {
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 3ce7142ba00e..83c2c7594bcd 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -2007,9 +2007,12 @@ int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
>>
>> switch (reg) {
>> case APIC_ID: /* Local APIC ID */
>> - if (!apic_x2apic_mode(apic))
>> + if (!apic_x2apic_mode(apic)) {
>> + u8 old_id = kvm_lapic_get_reg(apic, APIC_ID) >> 24;
>> +
>> kvm_apic_set_xapic_id(apic, val >> 24);
>> - else
>> + kvm_x86_ops.update_ipiv_pid_entry(apic->vcpu, old_id, val >> 24);
> Won't this blow up on AMD since there is no corresponding SVM op?
>
> Thanks,
> Tom
Right, need check ops validness to avoid ruining AMD system. Same
consideration on ops "update_ipiv_pid_table" in patch8.
I will revise in next version. Thanks.
>> + } else
>> ret = 1;
>> break;
>>
>>
On 1/5/22 7:44 PM, Zeng Guang wrote:
> On 1/6/2022 3:13 AM, Tom Lendacky wrote:
>> On 12/31/21 8:28 AM, Zeng Guang wrote:
>> Won't this blow up on AMD since there is no corresponding SVM op?
>>
>> Thanks,
>> Tom
> Right, need check ops validness to avoid ruining AMD system. Same
> consideration on ops "update_ipiv_pid_table" in patch8.
Not necessarily for patch8. That is "protected" by the
kvm_check_request(KVM_REQ_PID_TABLE_UPDATE, vcpu) test, but it couldn't hurt.
Thanks,
Tom
> I will revise in next version. Thanks.
>>> + } else
>>> ret = 1;
>>> break;
>>>
On 1/6/2022 10:06 PM, Tom Lendacky wrote:
> On 1/5/22 7:44 PM, Zeng Guang wrote:
>> On 1/6/2022 3:13 AM, Tom Lendacky wrote:
>>> On 12/31/21 8:28 AM, Zeng Guang wrote:
>>> Won't this blow up on AMD since there is no corresponding SVM op?
>>>
>>> Thanks,
>>> Tom
>> Right, need check ops validness to avoid ruining AMD system. Same
>> consideration on ops "update_ipiv_pid_table" in patch8.
> Not necessarily for patch8. That is "protected" by the
> kvm_check_request(KVM_REQ_PID_TABLE_UPDATE, vcpu) test, but it couldn't hurt.
OK, make sense. Thanks.
> Thanks,
> Tom
>
>> I will revise in next version. Thanks.
>>>> + } else
>>>> ret = 1;
>>>> break;
>>>>
On Fri, 2022-01-07 at 16:05 +0800, Zeng Guang wrote:
> On 1/6/2022 10:06 PM, Tom Lendacky wrote:
> > On 1/5/22 7:44 PM, Zeng Guang wrote:
> > > On 1/6/2022 3:13 AM, Tom Lendacky wrote:
> > > > On 12/31/21 8:28 AM, Zeng Guang wrote:
> > > > Won't this blow up on AMD since there is no corresponding SVM op?
> > > >
> > > > Thanks,
> > > > Tom
> > > Right, need check ops validness to avoid ruining AMD system. Same
> > > consideration on ops "update_ipiv_pid_table" in patch8.
> > Not necessarily for patch8. That is "protected" by the
> > kvm_check_request(KVM_REQ_PID_TABLE_UPDATE, vcpu) test, but it couldn't hurt.
>
> OK, make sense. Thanks.
I haven't fully reviewed this patch series yet,
and I will soon.
I just want to point out few things:
1. AMD's AVIC also has a PID table (its calle AVIC physical ID table).
It stores addressses of vCPUs apic backing pages,
and thier real APIC IDs.
avic_init_backing_page initializes the entry (assuming apic_id == vcpu_id)
(which is double confusing)
2. For some reason KVM supports writable APIC IDs. Does anyone use these?
Even Intel's PRM strongly discourages users from using them and in X2APIC mode,
the APIC ID is read only.
Because of this we have quite some bookkeeping in lapic.c,
(things like kvm_recalculate_apic_map and such)
Also AVIC has its own handling for writes to APIC_ID,APIC_LDR,APIC_DFR
which tries to update its physical and logical ID tables.
(it used also to handle apic base and I removed this as apic base otherwise
was always hardcoded to the default vaule)
Note that avic_handle_apic_id_update is broken - it always copies the entry
from the default (apicid == vcpu_id) location to new location and zeros
the old location, which will fail in many cases, like even if the guest
were to swap few apic ids.
Also writable apic ID means that two vCPUs can have same apic ID. No way
we handle this correclty, and no way APICv/AVIC does.
Best regards,
Maxim Levitsky
>
> > Thanks,
> > Tom
> >
> > > I will revise in next version. Thanks.
> > > > > + } else
> > > > > ret = 1;
> > > > > break;
> > > > >
On Fri, Jan 07, 2022 at 10:31:59AM +0200, Maxim Levitsky wrote:
>On Fri, 2022-01-07 at 16:05 +0800, Zeng Guang wrote:
>> On 1/6/2022 10:06 PM, Tom Lendacky wrote:
>> > On 1/5/22 7:44 PM, Zeng Guang wrote:
>> > > On 1/6/2022 3:13 AM, Tom Lendacky wrote:
>> > > > On 12/31/21 8:28 AM, Zeng Guang wrote:
>> > > > Won't this blow up on AMD since there is no corresponding SVM op?
>> > > >
>> > > > Thanks,
>> > > > Tom
>> > > Right, need check ops validness to avoid ruining AMD system. Same
>> > > consideration on ops "update_ipiv_pid_table" in patch8.
>> > Not necessarily for patch8. That is "protected" by the
>> > kvm_check_request(KVM_REQ_PID_TABLE_UPDATE, vcpu) test, but it couldn't hurt.
>>
>> OK, make sense. Thanks.
>
>I haven't fully reviewed this patch series yet,
>and I will soon.
>
>I just want to point out few things:
Thanks for pointing them out.
>
>1. AMD's AVIC also has a PID table (its calle AVIC physical ID table).
>It stores addressses of vCPUs apic backing pages,
>and thier real APIC IDs.
>
>avic_init_backing_page initializes the entry (assuming apic_id == vcpu_id)
>(which is double confusing)
>
>2. For some reason KVM supports writable APIC IDs. Does anyone use these?
>Even Intel's PRM strongly discourages users from using them and in X2APIC mode,
>the APIC ID is read only.
>
>Because of this we have quite some bookkeeping in lapic.c,
>(things like kvm_recalculate_apic_map and such)
>
>Also AVIC has its own handling for writes to APIC_ID,APIC_LDR,APIC_DFR
>which tries to update its physical and logical ID tables.
Intel's IPI virtualization doesn't handle logical-addressing IPIs. They cause
APIC-write vm-exit as usual. So, this series doesn't handle APIC_LDR/DFR.
>
>(it used also to handle apic base and I removed this as apic base otherwise
>was always hardcoded to the default vaule)
>
>Note that avic_handle_apic_id_update is broken - it always copies the entry
>from the default (apicid == vcpu_id) location to new location and zeros
>the old location, which will fail in many cases, like even if the guest
>were to swap few apic ids.
This series differs from avic_handle_apic_id_update slightly:
If a vCPU's APIC ID is changed, this series zeros the old entry in PID-pointer
table and programs the vCPU's PID to the new entry (rather than copy from the
old entry).
But this series is also problematic if guest swaps two vCPU's APIC ID without
using another free APIC ID; it would end up one of them having no valid entry.
One solution in my mind is:
when a vCPU's APIC ID is changed, KVM traverses all vCPUs to count vCPUs using
the old APIC ID and the new APIC ID, programs corrsponding entries following
below rules:
1. populate an entry with a vCPU's PID if the corrsponding APIC ID is
exclusively used by that vCPU.
2. zero an entry for other cases.
Proper locking is needed in this process to prevent changes to vCPUs' APIC IDs.
Or if it doesn't worth it, we can disable IPI virtualization for a guest on its
first attempt to change xAPIC ID.
Let us know which option is preferred.
On Mon, 2022-01-10 at 15:45 +0800, Chao Gao wrote:
> On Fri, Jan 07, 2022 at 10:31:59AM +0200, Maxim Levitsky wrote:
> > On Fri, 2022-01-07 at 16:05 +0800, Zeng Guang wrote:
> > > On 1/6/2022 10:06 PM, Tom Lendacky wrote:
> > > > On 1/5/22 7:44 PM, Zeng Guang wrote:
> > > > > On 1/6/2022 3:13 AM, Tom Lendacky wrote:
> > > > > > On 12/31/21 8:28 AM, Zeng Guang wrote:
> > > > > > Won't this blow up on AMD since there is no corresponding SVM op?
> > > > > >
> > > > > > Thanks,
> > > > > > Tom
> > > > > Right, need check ops validness to avoid ruining AMD system. Same
> > > > > consideration on ops "update_ipiv_pid_table" in patch8.
> > > > Not necessarily for patch8. That is "protected" by the
> > > > kvm_check_request(KVM_REQ_PID_TABLE_UPDATE, vcpu) test, but it couldn't hurt.
> > >
> > > OK, make sense. Thanks.
> >
> > I haven't fully reviewed this patch series yet,
> > and I will soon.
> >
> > I just want to point out few things:
>
> Thanks for pointing them out.
>
> > 1. AMD's AVIC also has a PID table (its calle AVIC physical ID table).
> > It stores addressses of vCPUs apic backing pages,
> > and thier real APIC IDs.
> >
> > avic_init_backing_page initializes the entry (assuming apic_id == vcpu_id)
> > (which is double confusing)
> >
> > 2. For some reason KVM supports writable APIC IDs. Does anyone use these?
> > Even Intel's PRM strongly discourages users from using them and in X2APIC mode,
> > the APIC ID is read only.
> >
> > Because of this we have quite some bookkeeping in lapic.c,
> > (things like kvm_recalculate_apic_map and such)
> >
> > Also AVIC has its own handling for writes to APIC_ID,APIC_LDR,APIC_DFR
> > which tries to update its physical and logical ID tables.
>
> Intel's IPI virtualization doesn't handle logical-addressing IPIs. They cause
> APIC-write vm-exit as usual. So, this series doesn't handle APIC_LDR/DFR.
>
> > (it used also to handle apic base and I removed this as apic base otherwise
> > was always hardcoded to the default vaule)
> >
> > Note that avic_handle_apic_id_update is broken - it always copies the entry
> > from the default (apicid == vcpu_id) location to new location and zeros
> > the old location, which will fail in many cases, like even if the guest
> > were to swap few apic ids.
>
> This series differs from avic_handle_apic_id_update slightly:
>
> If a vCPU's APIC ID is changed, this series zeros the old entry in PID-pointer
> table and programs the vCPU's PID to the new entry (rather than copy from the
> old entry).
Yes. The AVIC code is pretty much totaly busted in this regard which I noticed recently.
It will fail the 2nd time it is called because it zeroes the entry it copies,
and even if the guest changes the APIC ID once, this code will still fail because,
it is called after each AVIC inhibition.
>
> But this series is also problematic if guest swaps two vCPU's APIC ID without
> using another free APIC ID; it would end up one of them having no valid entry.
Yes, exactly. I wanted to fix the AVIC's code and also noticed that.
Plus, the guest can assign the same APIC ID to two vCPUs in theory and keep it this
way which complicates things further, from the point of view of what malicious guests can do.
>
> One solution in my mind is:
>
> when a vCPU's APIC ID is changed, KVM traverses all vCPUs to count vCPUs using
> the old APIC ID and the new APIC ID, programs corrsponding entries following
> below rules:
> 1. populate an entry with a vCPU's PID if the corrsponding APIC ID is
> exclusively used by that vCPU.
> 2. zero an entry for other cases.
Yes, that what I was thinking as well - but zeroing *both* entries when they are duplicate,
is not what I was thinkging and it is a very good idea IMHO.
>
> Proper locking is needed in this process to prevent changes to vCPUs' APIC IDs.
Yes.
>
> Or if it doesn't worth it, we can disable IPI virtualization for a guest on its
> first attempt to change xAPIC ID.
Yes, and this brings the main question. Are there any OSes that actually change the APIC ID?
I tested winxp, and win10 32 bit, and they seem to work just fine when I don't allow them to change apic id.
Older/more obscure oses like win98/95/dos and such don't use APIC at all.
That leaves only modern OSes like *BSD and such, I'll try to check a few of them soon.
I just don't see any reason whatsoever to change APIC ID. It seems that it was initially writable,
just because Intel' forgot to make it read-only, and then they even made it read-only with x2apic.
Both Intel and AMD's PRM also state that changing APIC ID is implementation dependent.
I vote to forbid changing apic id, at least in the case any APIC acceleration is used, be that APICv or AVIC.
Best regards,
Maxim Levitsky
>
> Let us know which option is preferred.
>
On Fri, Dec 31, 2021, Zeng Guang wrote:
> From: Robert Hoo <[email protected]>
>
> Add tertiary_exec_control field report in dump_vmcs()
>
> Signed-off-by: Robert Hoo <[email protected]>
> Signed-off-by: Zeng Guang <[email protected]>
> ---
> arch/x86/kvm/vmx/vmx.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index fb0f600368c6..5716db9704c0 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5729,6 +5729,7 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> u32 vmentry_ctl, vmexit_ctl;
> u32 cpu_based_exec_ctrl, pin_based_exec_ctrl, secondary_exec_control;
> + u64 tertiary_exec_control = 0;
> unsigned long cr4;
> int efer_slot;
>
> @@ -5746,6 +5747,9 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
> if (cpu_has_secondary_exec_ctrls())
> secondary_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
Gah, this (not your) code is silly. I had to go look at the full source to see
that secondary_exec_control isn't accessed uninitialized...
Can you opportunistically tweak it to the below, and use the same patter for the
tertiary controls?
if (cpu_has_secondary_exec_ctrls())
secondary_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
else
secondary_exec_control = 0;
>
> + if (cpu_has_tertiary_exec_ctrls())
> + tertiary_exec_control = vmcs_read64(TERTIARY_VM_EXEC_CONTROL);
> +
> pr_err("VMCS %p, last attempted VM-entry on CPU %d\n",
> vmx->loaded_vmcs->vmcs, vcpu->arch.last_vmentry_cpu);
> pr_err("*** Guest State ***\n");
> @@ -5844,8 +5848,9 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
> vmx_dump_msrs("host autoload", &vmx->msr_autoload.host);
>
> pr_err("*** Control State ***\n");
> - pr_err("PinBased=%08x CPUBased=%08x SecondaryExec=%08x\n",
> - pin_based_exec_ctrl, cpu_based_exec_ctrl, secondary_exec_control);
> + pr_err("PinBased=0x%08x CPUBased=0x%08x SecondaryExec=0x%08x TertiaryExec=0x%016llx\n",
> + pin_based_exec_ctrl, cpu_based_exec_ctrl, secondary_exec_control,
> + tertiary_exec_control);
Can you provide a sample dump? It's hard to visualize the output, e.g. I'm worried
this will be overly log and harder to read than putting tertiary controls on their
own line.
> pr_err("EntryControls=%08x ExitControls=%08x\n", vmentry_ctl, vmexit_ctl);
> pr_err("ExceptionBitmap=%08x PFECmask=%08x PFECmatch=%08x\n",
> vmcs_read32(EXCEPTION_BITMAP),
> --
> 2.27.0
>
On Fri, Dec 31, 2021, Zeng Guang wrote:
> In VMX non-root operation, new behavior applies to
"new behavior" is ambiguous, it's not clear if it refers to new hardware behavior,
new KVM behavior, etc...
> virtualize WRMSR to vICR in x2APIC mode. Depending
Please wrap at ~75 chars, this is too narrow.
> on settings of the VM-execution controls, CPU would
> produce APIC-write VM-exit following the 64-bit value
> written to offset 300H on the virtual-APIC page(vICR).
> KVM needs to retrieve the value written by CPU and
> emulate the vICR write to deliver an interrupt.
>
> Current KVM doesn't consider to handle the 64-bit setting
> on vICR in trap-like APIC-write VM-exit. Because using
> kvm_lapic_reg_write() to emulate writes to APIC_ICR requires
> the APIC_ICR2 is already programmed correctly. But in the
> above APIC-write VM-exit, CPU writes the whole 64 bits to
> APIC_ICR rather than program higher 32 bits and lower 32
> bits to APIC_ICR2 and APIC_ICR respectively. So, KVM needs
> to retrieve the whole 64-bit value and program higher 32 bits
> to APIC_ICR2 first.
I think this is simply saying:
Upcoming Intel CPUs will support virtual x2APIC MSR writes to the vICR,
i.e. will trap and generate an APIC-write VM-Exit instead of intercepting
the WRMSR. Add support for handling "nodecode" x2APIC writes, which were
previously impossible.
Note, x2APIC MSR writes are 64 bits wide.
and then the shortlog can be:
KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode
The "interrupt dispatch" part is quite confusing because it's not really germane
to the change; yes, the vICR write does (eventually) dispatch an IRQ, but that
has nothing to do with the code being modified.
> Signed-off-by: Zeng Guang <[email protected]>
> ---
> arch/x86/kvm/lapic.c | 12 +++++++++---
> arch/x86/kvm/lapic.h | 5 +++++
> 2 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index f206fc35deff..3ce7142ba00e 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2186,15 +2186,21 @@ EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
> /* emulate APIC access in a trap manner */
> void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
> {
> - u32 val = 0;
> + struct kvm_lapic *apic = vcpu->arch.apic;
> + u64 val = 0;
>
> /* hw has done the conditional check and inst decode */
> offset &= 0xff0;
>
> - kvm_lapic_reg_read(vcpu->arch.apic, offset, 4, &val);
> + /* exception dealing with 64bit data on vICR in x2apic mode */
> + if ((offset == APIC_ICR) && apic_x2apic_mode(apic)) {
Sorry, I failed to reply to your response in the previous version. I suggested
a WARN_ON(offset != APIC_ICR), but you were concerned that apic_x2apic_mode()
would be expensive to check before @offset. I don't think that's a valid concern
as apic_x2apic_mode() is simply:
apic->vcpu->arch.apic_base & X2APIC_ENABLE
And is likely well-predicted by the CPU, especially in single tenant or pinned
scenarios where the pCPU is running a single VM/vCPU, i.e. will amost never see
X2APIC_ENABLE toggling.
So I stand behind my previous feedback[*] that we should split on x2APIC.
> + val = kvm_lapic_get_reg64(apic, offset);
> + kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(val>>32));
> + } else
> + kvm_lapic_reg_read(apic, offset, 4, &val);
Needs curly braces. But again, I stand behind my previous feedback that this
would be better written as:
if (apic_x2apic_mode(apic)) {
if (WARN_ON_ONCE(offset != APIC_ICR))
return 1;
kvm_lapic_reg_read(apic, offset, 8, &val);
kvm_lapic_reg_write64(apic, offset, val);
} else {
kvm_lapic_reg_read(apic, offset, 4, &val);
kvm_lapic_reg_write(apic, offset, val);
}
after a patch (provided in earlier feedback) to introduce kvm_lapic_reg_write64().
[*] https://lore.kernel.org/all/[email protected]
> /* TODO: optimize to just emulate side effect w/o one more write */
> - kvm_lapic_reg_write(vcpu->arch.apic, offset, val);
> + kvm_lapic_reg_write(apic, offset, (u32)val);
> }
> EXPORT_SYMBOL_GPL(kvm_apic_write_nodecode);
>
On Fri, Dec 31, 2021, Zeng Guang wrote:
> +/* Tertiary Processor-Based VM-Execution Controls, word 3 */
> +#define VMX_FEATURE_IPI_VIRT (3*32 + 4) /* "" Enable IPI virtualization */
> #endif /* _ASM_X86_VMXFEATURES_H */
> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
> index 38d414f64e61..78b0525dd991 100644
> --- a/arch/x86/kvm/vmx/capabilities.h
> +++ b/arch/x86/kvm/vmx/capabilities.h
> @@ -12,6 +12,7 @@ extern bool __read_mostly enable_ept;
> extern bool __read_mostly enable_unrestricted_guest;
> extern bool __read_mostly enable_ept_ad_bits;
> extern bool __read_mostly enable_pml;
> +extern bool __read_mostly enable_ipiv;
> extern int __read_mostly pt_mode;
>
> #define PT_MODE_SYSTEM 0
> @@ -283,6 +284,12 @@ static inline bool cpu_has_vmx_apicv(void)
> cpu_has_vmx_posted_intr();
> }
>
> +static inline bool cpu_has_vmx_ipiv(void)
> +{
> + return vmcs_config.cpu_based_3rd_exec_ctrl &
> + TERTIARY_EXEC_IPI_VIRT;
Unnecessary newline, that fits on a single line.
> +}
> +
> static inline bool cpu_has_vmx_flexpriority(void)
> {
> return cpu_has_vmx_tpr_shadow() &&
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 1c94783b5a54..bd9c9a89726a 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -85,11 +85,16 @@ static bool vmx_can_use_vtd_pi(struct kvm *kvm)
> irq_remapping_cap(IRQ_POSTING_CAP);
> }
>
> +static bool vmx_can_use_ipiv_pi(struct kvm *kvm)
> +{
> + return irqchip_in_kernel(kvm) && enable_apicv && enable_ipiv;
enable_ipiv should be cleared if !enable_apicv, i.e. the enable_apicv check
here should be unnecessary.
> +}
> +
> void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
> {
> struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>
> - if (!vmx_can_use_vtd_pi(vcpu->kvm))
> + if (!(vmx_can_use_ipiv_pi(vcpu->kvm) || vmx_can_use_vtd_pi(vcpu->kvm)))
Purely because I am beyond terrible at reading !(A || B) and !(A && B), can we
write this as:
if (!vmx_can_use_ipiv_pi(vcpu->kvm) && !vmx_can_use_vtd_pi(vcpu->kvm))
return;
Or better, add a helper. We could even drop vmx_can_use_ipiv_pi() altogether, e.g.
static bool vmx_can_use_posted_interrupts(struct kvm *kvm)
{
return irqchip_in_kernel(kvm) &&
(enable_ipiv || vmx_can_use_vtd_pi(kvm));
}
Or with both helpers:
static bool vmx_can_use_posted_interrupts(struct kvm *kvm)
{
return vmx_can_use_ipiv_pi(kvm) || vmx_can_use_vtd_pi(kvm);
}
I don't think I have a strong preference over whether or not to drop
vmx_can_use_ipiv_pi(). I think it's marginally easier to read with the extra
helper?
> return;
>
> /* Set SN when the vCPU is preempted */
> @@ -147,7 +152,7 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
> struct pi_desc old, new;
> struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>
> - if (!vmx_can_use_vtd_pi(vcpu->kvm))
> + if (!(vmx_can_use_ipiv_pi(vcpu->kvm) || vmx_can_use_vtd_pi(vcpu->kvm)))
> return 0;
>
> WARN_ON(irqs_disabled());
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 5716db9704c0..2e65464d6dee 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -104,6 +104,9 @@ module_param(fasteoi, bool, S_IRUGO);
>
> module_param(enable_apicv, bool, S_IRUGO);
>
> +bool __read_mostly enable_ipiv = true;
> +module_param(enable_ipiv, bool, 0444);
> +
> /*
> * If nested=1, nested virtualization is supported, i.e., guests may use
> * VMX and be a hypervisor for its own guests. If nested=0, guests may not
> @@ -224,6 +227,11 @@ static const struct {
> };
>
> #define L1D_CACHE_ORDER 4
> +
> +/* PID(Posted-Interrupt Descriptor)-pointer table entry is 64-bit long */
> +#define MAX_PID_TABLE_ORDER get_order(KVM_MAX_VCPU_IDS * sizeof(u64))
> +#define PID_TABLE_ENTRY_VALID 1
> +
> static void *vmx_l1d_flush_pages;
>
> static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
> @@ -2504,7 +2512,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
> }
>
> if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_TERTIARY_CONTROLS) {
> - u64 opt3 = 0;
> + u64 opt3 = TERTIARY_EXEC_IPI_VIRT;
> u64 min3 = 0;
>
> if (adjust_vmx_controls_64(min3, opt3,
> @@ -3841,6 +3849,8 @@ static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu)
> vmx_enable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_TMCCT), MSR_TYPE_RW);
> vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_EOI), MSR_TYPE_W);
> vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_SELF_IPI), MSR_TYPE_W);
> + vmx_set_intercept_for_msr(vcpu, X2APIC_MSR(APIC_ICR),
> + MSR_TYPE_RW, !enable_ipiv);
Please align this, e.g.
vmx_set_intercept_for_msr(vcpu, X2APIC_MSR(APIC_ICR),
MSR_TYPE_RW, !enable_ipiv);
though I think I'd actually prefer we do:
if (enable_ipiv)
vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_ICR), MSR_TYPE_RW);
and just let it poke out. That makes it much more obvious that interception is
disabled when IPI virtualization is enabled. Using vmx_set_intercept_for_msr()
implies that it could go either way, but that's not true as vmx_reset_x2apic_msrs()
sets the bitmap to intercept all x2APIC MSRs.
> }
> }
>
On Fri, Dec 31, 2021, Zeng Guang wrote:
> +static int vmx_expand_pid_table(struct kvm_vmx *kvm_vmx, int entry_idx)
> +{
> + u64 *last_pid_table;
> + int last_table_size, new_order;
> +
> + if (entry_idx <= kvm_vmx->pid_last_index)
> + return 0;
> +
> + last_pid_table = kvm_vmx->pid_table;
> + last_table_size = table_index_to_size(kvm_vmx->pid_last_index + 1);
> + new_order = get_order(table_index_to_size(entry_idx + 1));
> +
> + if (vmx_alloc_pid_table(kvm_vmx, new_order))
> + return -ENOMEM;
> +
> + memcpy(kvm_vmx->pid_table, last_pid_table, last_table_size);
> + kvm_make_all_cpus_request(&kvm_vmx->kvm, KVM_REQ_PID_TABLE_UPDATE);
> +
> + /* Now old PID table can be freed safely as no vCPU is using it. */
> + free_pages((unsigned long)last_pid_table, get_order(last_table_size));
This is terrifying. I think it's safe? But it's still terrifying.
Rather than dynamically react as vCPUs are created, what about we make max_vcpus
common[*], extend KVM_CAP_MAX_VCPUS to allow userspace to override max_vcpus,
and then have the IPIv support allocate the PID table on first vCPU creation
instead of in vmx_vm_init()?
That will give userspace an opportunity to lower max_vcpus to reduce memory
consumption without needing to dynamically muck with the table in KVM. Then
this entire patch goes away.
On Tue, Jan 11, 2022, Maxim Levitsky wrote:
> Both Intel and AMD's PRM also state that changing APIC ID is implementation
> dependent.
>
> I vote to forbid changing apic id, at least in the case any APIC acceleration
> is used, be that APICv or AVIC.
That has my vote as well. For IPIv in particular there's not much concern with
backwards compability, i.e. we can tie the behavior to enable_ipiv.
On Mon, Jan 10, 2022 at 03:45:25PM +0800, Chao Gao wrote:
> On Fri, Jan 07, 2022 at 10:31:59AM +0200, Maxim Levitsky wrote:
> >On Fri, 2022-01-07 at 16:05 +0800, Zeng Guang wrote:
> >> On 1/6/2022 10:06 PM, Tom Lendacky wrote:
> >> > On 1/5/22 7:44 PM, Zeng Guang wrote:
> >> > > On 1/6/2022 3:13 AM, Tom Lendacky wrote:
> >> > > > On 12/31/21 8:28 AM, Zeng Guang wrote:
> >> > > > Won't this blow up on AMD since there is no corresponding SVM op?
> >> > > >
> >> > > > Thanks,
> >> > > > Tom
> >> > > Right, need check ops validness to avoid ruining AMD system. Same
> >> > > consideration on ops "update_ipiv_pid_table" in patch8.
> >> > Not necessarily for patch8. That is "protected" by the
> >> > kvm_check_request(KVM_REQ_PID_TABLE_UPDATE, vcpu) test, but it couldn't hurt.
> >>
> >> OK, make sense. Thanks.
> >
> >I haven't fully reviewed this patch series yet,
> >and I will soon.
> >
> >I just want to point out few things:
>
> Thanks for pointing them out.
>
> >
> >1. AMD's AVIC also has a PID table (its calle AVIC physical ID table).
> >It stores addressses of vCPUs apic backing pages,
> >and thier real APIC IDs.
> >
> >avic_init_backing_page initializes the entry (assuming apic_id == vcpu_id)
> >(which is double confusing)
> >
> >2. For some reason KVM supports writable APIC IDs. Does anyone use these?
> >Even Intel's PRM strongly discourages users from using them and in X2APIC mode,
> >the APIC ID is read only.
> >
> >Because of this we have quite some bookkeeping in lapic.c,
> >(things like kvm_recalculate_apic_map and such)
> >
> >Also AVIC has its own handling for writes to APIC_ID,APIC_LDR,APIC_DFR
> >which tries to update its physical and logical ID tables.
>
> Intel's IPI virtualization doesn't handle logical-addressing IPIs. They cause
> APIC-write vm-exit as usual. So, this series doesn't handle APIC_LDR/DFR.
>
> >
> >(it used also to handle apic base and I removed this as apic base otherwise
> >was always hardcoded to the default vaule)
> >
> >Note that avic_handle_apic_id_update is broken - it always copies the entry
> >from the default (apicid == vcpu_id) location to new location and zeros
> >the old location, which will fail in many cases, like even if the guest
> >were to swap few apic ids.
>
> This series differs from avic_handle_apic_id_update slightly:
>
> If a vCPU's APIC ID is changed, this series zeros the old entry in PID-pointer
> table and programs the vCPU's PID to the new entry (rather than copy from the
> old entry).
>
> But this series is also problematic if guest swaps two vCPU's APIC ID without
> using another free APIC ID; it would end up one of them having no valid entry.
>
> One solution in my mind is:
>
> when a vCPU's APIC ID is changed, KVM traverses all vCPUs to count vCPUs using
> the old APIC ID and the new APIC ID, programs corrsponding entries following
> below rules:
> 1. populate an entry with a vCPU's PID if the corrsponding APIC ID is
> exclusively used by that vCPU.
> 2. zero an entry for other cases.
Don't need to traverse I think, just not zero the old entry if it's not
belong to the vcpu:
+Take new one or exist vm level lock
+if (__pa(&to_vmx(vcpu)->pi_desc) == (pid_table[old_id] & ~PID_TABLE_ENTRY_VALID))
WRITE_ONCE(pid_table[old_id], 0);
WRITE_ONCE(pid_table[new_id], __pa(&to_vmx(vcpu)->pi_desc) |
PID_TABLE_ENTRY_VALID);
+Release new one or exist vm level lock
>
> Proper locking is needed in this process to prevent changes to vCPUs' APIC IDs.
>
> Or if it doesn't worth it, we can disable IPI virtualization for a guest on its
> first attempt to change xAPIC ID.
>
> Let us know which option is preferred.
On Thu, Jan 13, 2022 at 10:19:21PM +0000, Sean Christopherson wrote:
>On Tue, Jan 11, 2022, Maxim Levitsky wrote:
>> Both Intel and AMD's PRM also state that changing APIC ID is implementation
>> dependent.
>>
>> I vote to forbid changing apic id, at least in the case any APIC acceleration
>> is used, be that APICv or AVIC.
>
>That has my vote as well. For IPIv in particular there's not much concern with
>backwards compability, i.e. we can tie the behavior to enable_ipiv.
Hi Sean and Levitsky,
Let's align on the implementation.
To disable changes for xAPIC ID when IPIv/AVIC is enabled:
1. introduce a variable (forbid_apicid_change) for this behavior in kvm.ko
and export it so that kvm-intel, kvm-amd can set it when IPIv/AVIC is
enabled. To reduce complexity, this variable is a module level setting.
2. when guest attempts to change xAPIC ID but it is forbidden, KVM prints
a warning on host and injects a #GP to guest.
3. remove AVIC code that deals with changes to xAPIC ID.
On 1/14/2022 5:03 AM, Sean Christopherson wrote:
> On Fri, Dec 31, 2021, Zeng Guang wrote:
>> From: Robert Hoo <[email protected]>
>>
>> Add tertiary_exec_control field report in dump_vmcs()
>>
>> Signed-off-by: Robert Hoo <[email protected]>
>> Signed-off-by: Zeng Guang <[email protected]>
>> ---
>> arch/x86/kvm/vmx/vmx.c | 9 +++++++--
>> 1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index fb0f600368c6..5716db9704c0 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -5729,6 +5729,7 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
>> struct vcpu_vmx *vmx = to_vmx(vcpu);
>> u32 vmentry_ctl, vmexit_ctl;
>> u32 cpu_based_exec_ctrl, pin_based_exec_ctrl, secondary_exec_control;
>> + u64 tertiary_exec_control = 0;
>> unsigned long cr4;
>> int efer_slot;
>>
>> @@ -5746,6 +5747,9 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
>> if (cpu_has_secondary_exec_ctrls())
>> secondary_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
> Gah, this (not your) code is silly. I had to go look at the full source to see
> that secondary_exec_control isn't accessed uninitialized...
>
> Can you opportunistically tweak it to the below, and use the same patter for the
> tertiary controls?
>
> if (cpu_has_secondary_exec_ctrls())
> secondary_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
> else
> secondary_exec_control = 0;
Actually secondary_exec_control did zero initialization ahead .
yah, it's better to unify the code for both.
>>
>> + if (cpu_has_tertiary_exec_ctrls())
>> + tertiary_exec_control = vmcs_read64(TERTIARY_VM_EXEC_CONTROL);
>> +
>> pr_err("VMCS %p, last attempted VM-entry on CPU %d\n",
>> vmx->loaded_vmcs->vmcs, vcpu->arch.last_vmentry_cpu);
>> pr_err("*** Guest State ***\n");
>> @@ -5844,8 +5848,9 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
>> vmx_dump_msrs("host autoload", &vmx->msr_autoload.host);
>>
>> pr_err("*** Control State ***\n");
>> - pr_err("PinBased=%08x CPUBased=%08x SecondaryExec=%08x\n",
>> - pin_based_exec_ctrl, cpu_based_exec_ctrl, secondary_exec_control);
>> + pr_err("PinBased=0x%08x CPUBased=0x%08x SecondaryExec=0x%08x TertiaryExec=0x%016llx\n",
>> + pin_based_exec_ctrl, cpu_based_exec_ctrl, secondary_exec_control,
>> + tertiary_exec_control);
> Can you provide a sample dump? It's hard to visualize the output, e.g. I'm worried
> this will be overly log and harder to read than putting tertiary controls on their
> own line.
Sample dump here.
*** Control State ***
PinBased=0x000000ff CPUBased=0xb5a26dfa SecondaryExec=0x061037eb
TertiaryExec=0x0000000000000010
EntryControls=0000d1ff ExitControls=002befff
ExceptionBitmap=00060042 PFECmask=00000000 PFECmatch=00000000
VMEntry: intr_info=00000000 errcode=00000000 ilen=00000000
VMExit: intr_info=00000000 errcode=00000000 ilen=00000003
reason=00000030 qualification=0000000000000784
>> pr_err("EntryControls=%08x ExitControls=%08x\n", vmentry_ctl, vmexit_ctl);
>> pr_err("ExceptionBitmap=%08x PFECmask=%08x PFECmatch=%08x\n",
>> vmcs_read32(EXCEPTION_BITMAP),
>> --
>> 2.27.0
>>
On 1/14/2022 5:47 AM, Sean Christopherson wrote:
> On Fri, Dec 31, 2021, Zeng Guang wrote:
>> +/* Tertiary Processor-Based VM-Execution Controls, word 3 */
>> +#define VMX_FEATURE_IPI_VIRT (3*32 + 4) /* "" Enable IPI virtualization */
>> #endif /* _ASM_X86_VMXFEATURES_H */
>> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
>> index 38d414f64e61..78b0525dd991 100644
>> --- a/arch/x86/kvm/vmx/capabilities.h
>> +++ b/arch/x86/kvm/vmx/capabilities.h
>> @@ -12,6 +12,7 @@ extern bool __read_mostly enable_ept;
>> extern bool __read_mostly enable_unrestricted_guest;
>> extern bool __read_mostly enable_ept_ad_bits;
>> extern bool __read_mostly enable_pml;
>> +extern bool __read_mostly enable_ipiv;
>> extern int __read_mostly pt_mode;
>>
>> #define PT_MODE_SYSTEM 0
>> @@ -283,6 +284,12 @@ static inline bool cpu_has_vmx_apicv(void)
>> cpu_has_vmx_posted_intr();
>> }
>>
>> +static inline bool cpu_has_vmx_ipiv(void)
>> +{
>> + return vmcs_config.cpu_based_3rd_exec_ctrl &
>> + TERTIARY_EXEC_IPI_VIRT;
> Unnecessary newline, that fits on a single line.
OK.
>> +}
>> +
>> static inline bool cpu_has_vmx_flexpriority(void)
>> {
>> return cpu_has_vmx_tpr_shadow() &&
>> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
>> index 1c94783b5a54..bd9c9a89726a 100644
>> --- a/arch/x86/kvm/vmx/posted_intr.c
>> +++ b/arch/x86/kvm/vmx/posted_intr.c
>> @@ -85,11 +85,16 @@ static bool vmx_can_use_vtd_pi(struct kvm *kvm)
>> irq_remapping_cap(IRQ_POSTING_CAP);
>> }
>>
>> +static bool vmx_can_use_ipiv_pi(struct kvm *kvm)
>> +{
>> + return irqchip_in_kernel(kvm) && enable_apicv && enable_ipiv;
> enable_ipiv should be cleared if !enable_apicv, i.e. the enable_apicv check
> here should be unnecessary.
Right, it's more concise. Thanks.
>> +}
>> +
>> void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
>> {
>> struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>>
>> - if (!vmx_can_use_vtd_pi(vcpu->kvm))
>> + if (!(vmx_can_use_ipiv_pi(vcpu->kvm) || vmx_can_use_vtd_pi(vcpu->kvm)))
> Purely because I am beyond terrible at reading !(A || B) and !(A && B), can we
> write this as:
>
> if (!vmx_can_use_ipiv_pi(vcpu->kvm) && !vmx_can_use_vtd_pi(vcpu->kvm))
> return;
>
> Or better, add a helper. We could even drop vmx_can_use_ipiv_pi() altogether, e.g.
>
> static bool vmx_can_use_posted_interrupts(struct kvm *kvm)
> {
> return irqchip_in_kernel(kvm) &&
> (enable_ipiv || vmx_can_use_vtd_pi(kvm));
> }
>
> Or with both helpers:
>
> static bool vmx_can_use_posted_interrupts(struct kvm *kvm)
> {
> return vmx_can_use_ipiv_pi(kvm) || vmx_can_use_vtd_pi(kvm);
> }
>
> I don't think I have a strong preference over whether or not to drop
> vmx_can_use_ipiv_pi(). I think it's marginally easier to read with the extra
> helper?
I'd like to add helper without dropping vmx_can_use_ipiv_pi() which
makes logic clear and independent.
>> return;
>>
>> /* Set SN when the vCPU is preempted */
>> @@ -147,7 +152,7 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
>> struct pi_desc old, new;
>> struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>>
>> - if (!vmx_can_use_vtd_pi(vcpu->kvm))
>> + if (!(vmx_can_use_ipiv_pi(vcpu->kvm) || vmx_can_use_vtd_pi(vcpu->kvm)))
>> return 0;
>>
>> WARN_ON(irqs_disabled());
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 5716db9704c0..2e65464d6dee 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -104,6 +104,9 @@ module_param(fasteoi, bool, S_IRUGO);
>>
>> module_param(enable_apicv, bool, S_IRUGO);
>>
>> +bool __read_mostly enable_ipiv = true;
>> +module_param(enable_ipiv, bool, 0444);
>> +
>> /*
>> * If nested=1, nested virtualization is supported, i.e., guests may use
>> * VMX and be a hypervisor for its own guests. If nested=0, guests may not
>> @@ -224,6 +227,11 @@ static const struct {
>> };
>>
>> #define L1D_CACHE_ORDER 4
>> +
>> +/* PID(Posted-Interrupt Descriptor)-pointer table entry is 64-bit long */
>> +#define MAX_PID_TABLE_ORDER get_order(KVM_MAX_VCPU_IDS * sizeof(u64))
>> +#define PID_TABLE_ENTRY_VALID 1
>> +
>> static void *vmx_l1d_flush_pages;
>>
>> static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
>> @@ -2504,7 +2512,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>> }
>>
>> if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_TERTIARY_CONTROLS) {
>> - u64 opt3 = 0;
>> + u64 opt3 = TERTIARY_EXEC_IPI_VIRT;
>> u64 min3 = 0;
>>
>> if (adjust_vmx_controls_64(min3, opt3,
>> @@ -3841,6 +3849,8 @@ static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu)
>> vmx_enable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_TMCCT), MSR_TYPE_RW);
>> vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_EOI), MSR_TYPE_W);
>> vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_SELF_IPI), MSR_TYPE_W);
>> + vmx_set_intercept_for_msr(vcpu, X2APIC_MSR(APIC_ICR),
>> + MSR_TYPE_RW, !enable_ipiv);
> Please align this, e.g.
>
> vmx_set_intercept_for_msr(vcpu, X2APIC_MSR(APIC_ICR),
> MSR_TYPE_RW, !enable_ipiv);
>
> though I think I'd actually prefer we do:
>
>
> if (enable_ipiv)
> vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_ICR), MSR_TYPE_RW);
>
> and just let it poke out. That makes it much more obvious that interception is
> disabled when IPI virtualization is enabled. Using vmx_set_intercept_for_msr()
> implies that it could go either way, but that's not true as vmx_reset_x2apic_msrs()
> sets the bitmap to intercept all x2APIC MSRs.
Make sense. Will do.
On 1/14/2022 5:29 AM, Sean Christopherson wrote:
> On Fri, Dec 31, 2021, Zeng Guang wrote:
>> In VMX non-root operation, new behavior applies to
> "new behavior" is ambiguous, it's not clear if it refers to new hardware behavior,
> new KVM behavior, etc...
>
>> virtualize WRMSR to vICR in x2APIC mode. Depending
> Please wrap at ~75 chars, this is too narrow.
>
>> on settings of the VM-execution controls, CPU would
>> produce APIC-write VM-exit following the 64-bit value
>> written to offset 300H on the virtual-APIC page(vICR).
>> KVM needs to retrieve the value written by CPU and
>> emulate the vICR write to deliver an interrupt.
>>
>> Current KVM doesn't consider to handle the 64-bit setting
>> on vICR in trap-like APIC-write VM-exit. Because using
>> kvm_lapic_reg_write() to emulate writes to APIC_ICR requires
>> the APIC_ICR2 is already programmed correctly. But in the
>> above APIC-write VM-exit, CPU writes the whole 64 bits to
>> APIC_ICR rather than program higher 32 bits and lower 32
>> bits to APIC_ICR2 and APIC_ICR respectively. So, KVM needs
>> to retrieve the whole 64-bit value and program higher 32 bits
>> to APIC_ICR2 first.
> I think this is simply saying:
>
> Upcoming Intel CPUs will support virtual x2APIC MSR writes to the vICR,
> i.e. will trap and generate an APIC-write VM-Exit instead of intercepting
> the WRMSR. Add support for handling "nodecode" x2APIC writes, which were
> previously impossible.
>
> Note, x2APIC MSR writes are 64 bits wide.
>
> and then the shortlog can be:
>
> KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode
>
> The "interrupt dispatch" part is quite confusing because it's not really germane
> to the change; yes, the vICR write does (eventually) dispatch an IRQ, but that
> has nothing to do with the code being modified.
I would take commit message as you suggested. Thanks.
>> Signed-off-by: Zeng Guang <[email protected]>
>> ---
>> arch/x86/kvm/lapic.c | 12 +++++++++---
>> arch/x86/kvm/lapic.h | 5 +++++
>> 2 files changed, 14 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index f206fc35deff..3ce7142ba00e 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -2186,15 +2186,21 @@ EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
>> /* emulate APIC access in a trap manner */
>> void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
>> {
>> - u32 val = 0;
>> + struct kvm_lapic *apic = vcpu->arch.apic;
>> + u64 val = 0;
>>
>> /* hw has done the conditional check and inst decode */
>> offset &= 0xff0;
>>
>> - kvm_lapic_reg_read(vcpu->arch.apic, offset, 4, &val);
>> + /* exception dealing with 64bit data on vICR in x2apic mode */
>> + if ((offset == APIC_ICR) && apic_x2apic_mode(apic)) {
> Sorry, I failed to reply to your response in the previous version. I suggested
> a WARN_ON(offset != APIC_ICR), but you were concerned that apic_x2apic_mode()
> would be expensive to check before @offset. I don't think that's a valid concern
> as apic_x2apic_mode() is simply:
>
> apic->vcpu->arch.apic_base & X2APIC_ENABLE
>
> And is likely well-predicted by the CPU, especially in single tenant or pinned
> scenarios where the pCPU is running a single VM/vCPU, i.e. will amost never see
> X2APIC_ENABLE toggling.
>
> So I stand behind my previous feedback[*] that we should split on x2APIC.
>
>> + val = kvm_lapic_get_reg64(apic, offset);
>> + kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(val>>32));
>> + } else
>> + kvm_lapic_reg_read(apic, offset, 4, &val);
> Needs curly braces. But again, I stand behind my previous feedback that this
> would be better written as:
>
> if (apic_x2apic_mode(apic)) {
> if (WARN_ON_ONCE(offset != APIC_ICR))
> return 1;
>
> kvm_lapic_reg_read(apic, offset, 8, &val);
> kvm_lapic_reg_write64(apic, offset, val);
> } else {
> kvm_lapic_reg_read(apic, offset, 4, &val);
> kvm_lapic_reg_write(apic, offset, val);
> }
>
> after a patch (provided in earlier feedback) to introduce kvm_lapic_reg_write64().
>
> [*] https://lore.kernel.org/all/[email protected]
kvm_lapic_reg_read() is limited to read up to 4 bytes. It needs extension to support 64bit
read. And another concern is here getting reg value only specific from vICR(no other regs
need take care), going through whole path on kvm_lapic_reg_read() could be time-consuming
unnecessarily. Is it proper that calling kvm_lapic_get_reg64() to retrieve vICR value directly?
The change could be like follows:
if (apic_x2apic_mode(apic)) {
if (WARN_ON_ONCE(offset != APIC_ICR))
return 1;
val = kvm_lapic_get_reg64(apic, offset);
kvm_lapic_reg_write64(apic, offset, val);
} else {
kvm_lapic_reg_read(apic, offset, 4, &val);
kvm_lapic_reg_write(apic, offset, val);
}
>> /* TODO: optimize to just emulate side effect w/o one more write */
>> - kvm_lapic_reg_write(vcpu->arch.apic, offset, val);
>> + kvm_lapic_reg_write(apic, offset, (u32)val);
>> }
>> EXPORT_SYMBOL_GPL(kvm_apic_write_nodecode);
>>
On Fri, 2022-01-14 at 10:58 +0800, Chao Gao wrote:
> On Thu, Jan 13, 2022 at 10:19:21PM +0000, Sean Christopherson wrote:
> > On Tue, Jan 11, 2022, Maxim Levitsky wrote:
> > > Both Intel and AMD's PRM also state that changing APIC ID is implementation
> > > dependent.
> > >
> > > I vote to forbid changing apic id, at least in the case any APIC acceleration
> > > is used, be that APICv or AVIC.
> >
> > That has my vote as well. For IPIv in particular there's not much concern with
> > backwards compability, i.e. we can tie the behavior to enable_ipiv.
Great!
>
> Hi Sean and Levitsky,
>
> Let's align on the implementation.
>
> To disable changes for xAPIC ID when IPIv/AVIC is enabled:
>
> 1. introduce a variable (forbid_apicid_change) for this behavior in kvm.ko
> and export it so that kvm-intel, kvm-amd can set it when IPIv/AVIC is
> enabled. To reduce complexity, this variable is a module level setting.
>
> 2. when guest attempts to change xAPIC ID but it is forbidden, KVM prints
> a warning on host and injects a #GP to guest.
>
> 3. remove AVIC code that deals with changes to xAPIC ID.
>
I have a patch for both, I attached them.
I haven't tested either of these patches that much other than a smoke test,
but I did test all of the guests I have and none broke in regard to boot.
I will send those patches as part of larger patch series that implements
nesting for AVIC. I hope to do this next week.
Best regards,
Maxim Levitsky
On 1/14/2022 6:09 AM, Sean Christopherson wrote:
> On Fri, Dec 31, 2021, Zeng Guang wrote:
>> +static int vmx_expand_pid_table(struct kvm_vmx *kvm_vmx, int entry_idx)
>> +{
>> + u64 *last_pid_table;
>> + int last_table_size, new_order;
>> +
>> + if (entry_idx <= kvm_vmx->pid_last_index)
>> + return 0;
>> +
>> + last_pid_table = kvm_vmx->pid_table;
>> + last_table_size = table_index_to_size(kvm_vmx->pid_last_index + 1);
>> + new_order = get_order(table_index_to_size(entry_idx + 1));
>> +
>> + if (vmx_alloc_pid_table(kvm_vmx, new_order))
>> + return -ENOMEM;
>> +
>> + memcpy(kvm_vmx->pid_table, last_pid_table, last_table_size);
>> + kvm_make_all_cpus_request(&kvm_vmx->kvm, KVM_REQ_PID_TABLE_UPDATE);
>> +
>> + /* Now old PID table can be freed safely as no vCPU is using it. */
>> + free_pages((unsigned long)last_pid_table, get_order(last_table_size));
> This is terrifying. I think it's safe? But it's still terrifying.
Free old PID table here is safe as kvm making request
KVM_REQ_PI_TABLE_UPDATE with
KVM_REQUEST_WAIT flag force all vcpus trigger vm-exit to update vmcs
field to new allocated
PID table. At this time, it makes sure old PID table not referenced by
any vcpu.
Do you mean it still has potential problem?
> Rather than dynamically react as vCPUs are created, what about we make max_vcpus
> common[*], extend KVM_CAP_MAX_VCPUS to allow userspace to override max_vcpus,
> and then have the IPIv support allocate the PID table on first vCPU creation
> instead of in vmx_vm_init()?
>
> That will give userspace an opportunity to lower max_vcpus to reduce memory
> consumption without needing to dynamically muck with the table in KVM. Then
> this entire patch goes away.
IIUC, it's risky if relying on userspace . In this way userspace also
have chance to assign large max_vcpus
but not use them at all. This cannot approach the goal to save memory as
much as possible just similar
as using KVM_MAX_VCPU_IDS to allocate PID table.
On Fri, Jan 14, 2022, Zeng Guang wrote:
> On 1/14/2022 6:09 AM, Sean Christopherson wrote:
> > On Fri, Dec 31, 2021, Zeng Guang wrote:
> > > +static int vmx_expand_pid_table(struct kvm_vmx *kvm_vmx, int entry_idx)
> > > +{
> > > + u64 *last_pid_table;
> > > + int last_table_size, new_order;
> > > +
> > > + if (entry_idx <= kvm_vmx->pid_last_index)
> > > + return 0;
> > > +
> > > + last_pid_table = kvm_vmx->pid_table;
> > > + last_table_size = table_index_to_size(kvm_vmx->pid_last_index + 1);
> > > + new_order = get_order(table_index_to_size(entry_idx + 1));
> > > +
> > > + if (vmx_alloc_pid_table(kvm_vmx, new_order))
> > > + return -ENOMEM;
> > > +
> > > + memcpy(kvm_vmx->pid_table, last_pid_table, last_table_size);
> > > + kvm_make_all_cpus_request(&kvm_vmx->kvm, KVM_REQ_PID_TABLE_UPDATE);
> > > +
> > > + /* Now old PID table can be freed safely as no vCPU is using it. */
> > > + free_pages((unsigned long)last_pid_table, get_order(last_table_size));
> > This is terrifying. I think it's safe? But it's still terrifying.
>
> Free old PID table here is safe as kvm making request KVM_REQ_PI_TABLE_UPDATE
> with KVM_REQUEST_WAIT flag force all vcpus trigger vm-exit to update vmcs
> field to new allocated PID table. At this time, it makes sure old PID table
> not referenced by any vcpu.
> Do you mean it still has potential problem?
No, I do think it's safe, but it is still terrifying :-)
> > Rather than dynamically react as vCPUs are created, what about we make max_vcpus
> > common[*], extend KVM_CAP_MAX_VCPUS to allow userspace to override max_vcpus,
> > and then have the IPIv support allocate the PID table on first vCPU creation
> > instead of in vmx_vm_init()?
> >
> > That will give userspace an opportunity to lower max_vcpus to reduce memory
> > consumption without needing to dynamically muck with the table in KVM. Then
> > this entire patch goes away.
> IIUC, it's risky if relying on userspace .
That's why we have cgroups, rlimits, etc...
> In this way userspace also have chance to assign large max_vcpus but not use
> them at all. This cannot approach the goal to save memory as much as possible
> just similar as using KVM_MAX_VCPU_IDS to allocate PID table.
Userspace can simply do KVM_CREATE_VCPU until it hits KVM_MAX_VCPU_IDS...
On Fri, Jan 14, 2022, Zeng Guang wrote:
> kvm_lapic_reg_read() is limited to read up to 4 bytes. It needs extension to
> support 64bit read.
Ah, right.
> And another concern is here getting reg value only specific from vICR(no
> other regs need take care), going through whole path on kvm_lapic_reg_read()
> could be time-consuming unnecessarily. Is it proper that calling
> kvm_lapic_get_reg64() to retrieve vICR value directly?
Hmm, no, I don't think that's proper. Retrieving a 64-bit value really is unique
to vICR. Yes, the code does WARN on that, but if future architectural extensions
even generate APIC-write exits on other registers, then using kvm_lapic_get_reg64()
would be wrong and this code would need to be updated again.
What about tweaking my prep patch from before to the below? That would yield:
if (apic_x2apic_mode(apic)) {
if (WARN_ON_ONCE(offset != APIC_ICR))
return 1;
kvm_lapic_msr_read(apic, offset, &val);
kvm_lapic_msr_write(apic, offset, val);
} else {
kvm_lapic_reg_read(apic, offset, 4, &val);
kvm_lapic_reg_write(apic, offset, val);
}
I like that the above has "msr" in the low level x2apic helpers, and it maximizes
code reuse. Compile tested only...
From: Sean Christopherson <[email protected]>
Date: Fri, 14 Jan 2022 09:29:34 -0800
Subject: [PATCH] KVM: x86: Add helpers to handle 64-bit APIC MSR read/writes
Add helpers to handle 64-bit APIC read/writes via MSRs to deduplicate the
x2APIC and Hyper-V code needed to service reads/writes to ICR. Future
support for IPI virtualization will add yet another path where KVM must
handle 64-bit APIC MSR reads/write (to ICR).
Opportunistically fix the comment in the write path; ICR2 holds the
destination (if there's no shorthand), not the vector.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/lapic.c | 59 ++++++++++++++++++++++----------------------
1 file changed, 29 insertions(+), 30 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index f206fc35deff..cc4531eb448f 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2787,6 +2787,30 @@ int kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr)
return 0;
}
+static int kvm_lapic_msr_read(struct kvm_lapic *apic, u32 reg, u64 *data)
+{
+ u32 low, high = 0;
+
+ if (kvm_lapic_reg_read(apic, reg, 4, &low))
+ return 1;
+
+ if (reg == APIC_ICR &&
+ WARN_ON_ONCE(kvm_lapic_reg_read(apic, APIC_ICR2, 4, &high)))
+ return 1;
+
+ *data = (((u64)high) << 32) | low;
+
+ return 0;
+}
+
+static int kvm_lapic_msr_write(struct kvm_lapic *apic, u32 reg, u64 data)
+{
+ /* For 64-bit ICR writes, set ICR2 (dest) before ICR (command). */
+ if (reg == APIC_ICR)
+ kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(data >> 32));
+ return kvm_lapic_reg_write(apic, reg, (u32)data);
+}
+
int kvm_x2apic_msr_write(struct kvm_vcpu *vcpu, u32 msr, u64 data)
{
struct kvm_lapic *apic = vcpu->arch.apic;
@@ -2798,16 +2822,13 @@ int kvm_x2apic_msr_write(struct kvm_vcpu *vcpu, u32 msr, u64 data)
if (reg == APIC_ICR2)
return 1;
- /* if this is ICR write vector before command */
- if (reg == APIC_ICR)
- kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(data >> 32));
- return kvm_lapic_reg_write(apic, reg, (u32)data);
+ return kvm_lapic_msr_write(apic, reg, data);
}
int kvm_x2apic_msr_read(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
{
struct kvm_lapic *apic = vcpu->arch.apic;
- u32 reg = (msr - APIC_BASE_MSR) << 4, low, high = 0;
+ u32 reg = (msr - APIC_BASE_MSR) << 4;
if (!lapic_in_kernel(vcpu) || !apic_x2apic_mode(apic))
return 1;
@@ -2815,45 +2836,23 @@ int kvm_x2apic_msr_read(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
if (reg == APIC_DFR || reg == APIC_ICR2)
return 1;
- if (kvm_lapic_reg_read(apic, reg, 4, &low))
- return 1;
- if (reg == APIC_ICR)
- kvm_lapic_reg_read(apic, APIC_ICR2, 4, &high);
-
- *data = (((u64)high) << 32) | low;
-
- return 0;
+ return kvm_lapic_msr_read(apic, reg, data);
}
int kvm_hv_vapic_msr_write(struct kvm_vcpu *vcpu, u32 reg, u64 data)
{
- struct kvm_lapic *apic = vcpu->arch.apic;
-
if (!lapic_in_kernel(vcpu))
return 1;
- /* if this is ICR write vector before command */
- if (reg == APIC_ICR)
- kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(data >> 32));
- return kvm_lapic_reg_write(apic, reg, (u32)data);
+ return kvm_lapic_msr_write(vcpu->arch.apic, reg, data);
}
int kvm_hv_vapic_msr_read(struct kvm_vcpu *vcpu, u32 reg, u64 *data)
{
- struct kvm_lapic *apic = vcpu->arch.apic;
- u32 low, high = 0;
-
if (!lapic_in_kernel(vcpu))
return 1;
- if (kvm_lapic_reg_read(apic, reg, 4, &low))
- return 1;
- if (reg == APIC_ICR)
- kvm_lapic_reg_read(apic, APIC_ICR2, 4, &high);
-
- *data = (((u64)high) << 32) | low;
-
- return 0;
+ return kvm_lapic_msr_read(vcpu->arch.apic, reg, data);
}
int kvm_lapic_set_pv_eoi(struct kvm_vcpu *vcpu, u64 data, unsigned long len)
--
On 1/15/2022 1:34 AM, Sean Christopherson wrote:
> On Fri, Jan 14, 2022, Zeng Guang wrote:
>> kvm_lapic_reg_read() is limited to read up to 4 bytes. It needs extension to
>> support 64bit read.
> Ah, right.
>
>> And another concern is here getting reg value only specific from vICR(no
>> other regs need take care), going through whole path on kvm_lapic_reg_read()
>> could be time-consuming unnecessarily. Is it proper that calling
>> kvm_lapic_get_reg64() to retrieve vICR value directly?
> Hmm, no, I don't think that's proper. Retrieving a 64-bit value really is unique
> to vICR. Yes, the code does WARN on that, but if future architectural extensions
> even generate APIC-write exits on other registers, then using kvm_lapic_get_reg64()
> would be wrong and this code would need to be updated again.
Split on x2apic and WARN on (offset != APIC_ICR) already limit register
read to vICR only. Actually
we just need consider to deal with 64bit data specific to vICR in
APIC-write exits. From this point of
view, previous design can be compatible on handling other registers even
if future architectural
extensions changes. :)
>
> What about tweaking my prep patch from before to the below? That would yield:
>
> if (apic_x2apic_mode(apic)) {
> if (WARN_ON_ONCE(offset != APIC_ICR))
> return 1;
>
> kvm_lapic_msr_read(apic, offset, &val);
I think it's problematic to use kvm_lapic_msr_read() in this case. It
premises the high 32bit value
already valid at APIC_ICR2, while in handling "nodecode" x2APIC writes
we need get continuous 64bit
data from offset 300H first and prepare emulation of APIC_ICR2 write. At
this time, APIC_ICR2 is not
ready yet.
> kvm_lapic_msr_write(apic, offset, val);
> } else {
> kvm_lapic_reg_read(apic, offset, 4, &val);
> kvm_lapic_reg_write(apic, offset, val);
> }
>
> I like that the above has "msr" in the low level x2apic helpers, and it maximizes
> code reuse. Compile tested only...
>
> From: Sean Christopherson <[email protected]>
> Date: Fri, 14 Jan 2022 09:29:34 -0800
> Subject: [PATCH] KVM: x86: Add helpers to handle 64-bit APIC MSR read/writes
>
> Add helpers to handle 64-bit APIC read/writes via MSRs to deduplicate the
> x2APIC and Hyper-V code needed to service reads/writes to ICR. Future
> support for IPI virtualization will add yet another path where KVM must
> handle 64-bit APIC MSR reads/write (to ICR).
>
> Opportunistically fix the comment in the write path; ICR2 holds the
> destination (if there's no shorthand), not the vector.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/lapic.c | 59 ++++++++++++++++++++++----------------------
> 1 file changed, 29 insertions(+), 30 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index f206fc35deff..cc4531eb448f 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2787,6 +2787,30 @@ int kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr)
> return 0;
> }
>
> +static int kvm_lapic_msr_read(struct kvm_lapic *apic, u32 reg, u64 *data)
> +{
> + u32 low, high = 0;
> +
> + if (kvm_lapic_reg_read(apic, reg, 4, &low))
> + return 1;
> +
> + if (reg == APIC_ICR &&
> + WARN_ON_ONCE(kvm_lapic_reg_read(apic, APIC_ICR2, 4, &high)))
> + return 1;
> +
> + *data = (((u64)high) << 32) | low;
> +
> + return 0;
> +}
> +
> +static int kvm_lapic_msr_write(struct kvm_lapic *apic, u32 reg, u64 data)
> +{
> + /* For 64-bit ICR writes, set ICR2 (dest) before ICR (command). */
> + if (reg == APIC_ICR)
> + kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(data >> 32));
> + return kvm_lapic_reg_write(apic, reg, (u32)data);
> +}
> +
> int kvm_x2apic_msr_write(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> {
> struct kvm_lapic *apic = vcpu->arch.apic;
> @@ -2798,16 +2822,13 @@ int kvm_x2apic_msr_write(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> if (reg == APIC_ICR2)
> return 1;
>
> - /* if this is ICR write vector before command */
> - if (reg == APIC_ICR)
> - kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(data >> 32));
> - return kvm_lapic_reg_write(apic, reg, (u32)data);
> + return kvm_lapic_msr_write(apic, reg, data);
> }
>
> int kvm_x2apic_msr_read(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
> {
> struct kvm_lapic *apic = vcpu->arch.apic;
> - u32 reg = (msr - APIC_BASE_MSR) << 4, low, high = 0;
> + u32 reg = (msr - APIC_BASE_MSR) << 4;
>
> if (!lapic_in_kernel(vcpu) || !apic_x2apic_mode(apic))
> return 1;
> @@ -2815,45 +2836,23 @@ int kvm_x2apic_msr_read(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
> if (reg == APIC_DFR || reg == APIC_ICR2)
> return 1;
>
> - if (kvm_lapic_reg_read(apic, reg, 4, &low))
> - return 1;
> - if (reg == APIC_ICR)
> - kvm_lapic_reg_read(apic, APIC_ICR2, 4, &high);
> -
> - *data = (((u64)high) << 32) | low;
> -
> - return 0;
> + return kvm_lapic_msr_read(apic, reg, data);
> }
>
> int kvm_hv_vapic_msr_write(struct kvm_vcpu *vcpu, u32 reg, u64 data)
> {
> - struct kvm_lapic *apic = vcpu->arch.apic;
> -
> if (!lapic_in_kernel(vcpu))
> return 1;
>
> - /* if this is ICR write vector before command */
> - if (reg == APIC_ICR)
> - kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(data >> 32));
> - return kvm_lapic_reg_write(apic, reg, (u32)data);
> + return kvm_lapic_msr_write(vcpu->arch.apic, reg, data);
> }
>
> int kvm_hv_vapic_msr_read(struct kvm_vcpu *vcpu, u32 reg, u64 *data)
> {
> - struct kvm_lapic *apic = vcpu->arch.apic;
> - u32 low, high = 0;
> -
> if (!lapic_in_kernel(vcpu))
> return 1;
>
> - if (kvm_lapic_reg_read(apic, reg, 4, &low))
> - return 1;
> - if (reg == APIC_ICR)
> - kvm_lapic_reg_read(apic, APIC_ICR2, 4, &high);
> -
> - *data = (((u64)high) << 32) | low;
> -
> - return 0;
> + return kvm_lapic_msr_read(vcpu->arch.apic, reg, data);
> }
>
> int kvm_lapic_set_pv_eoi(struct kvm_vcpu *vcpu, u64 data, unsigned long len)
> --
On Fri, Jan 14, 2022 at 10:17:34AM +0200, Maxim Levitsky wrote:
>On Fri, 2022-01-14 at 10:58 +0800, Chao Gao wrote:
>> On Thu, Jan 13, 2022 at 10:19:21PM +0000, Sean Christopherson wrote:
>> > On Tue, Jan 11, 2022, Maxim Levitsky wrote:
>> > > Both Intel and AMD's PRM also state that changing APIC ID is implementation
>> > > dependent.
>> > >
>> > > I vote to forbid changing apic id, at least in the case any APIC acceleration
>> > > is used, be that APICv or AVIC.
>> >
>> > That has my vote as well. For IPIv in particular there's not much concern with
>> > backwards compability, i.e. we can tie the behavior to enable_ipiv.
>Great!
>>
>> Hi Sean and Levitsky,
>>
>> Let's align on the implementation.
>>
>> To disable changes for xAPIC ID when IPIv/AVIC is enabled:
>>
>> 1. introduce a variable (forbid_apicid_change) for this behavior in kvm.ko
>> and export it so that kvm-intel, kvm-amd can set it when IPIv/AVIC is
>> enabled. To reduce complexity, this variable is a module level setting.
>>
>> 2. when guest attempts to change xAPIC ID but it is forbidden, KVM prints
>> a warning on host and injects a #GP to guest.
>>
>> 3. remove AVIC code that deals with changes to xAPIC ID.
>>
>
>I have a patch for both, I attached them.
Looks good to me. We will drop this patch and rely on the first attached patch
to forbid guest from changing xAPIC ID.
>I haven't tested either of these patches that much other than a smoke test,
>but I did test all of the guests I have and none broke in regard to boot.
>
>I will send those patches as part of larger patch series that implements
>nesting for AVIC. I hope to do this next week.
Thanks.
On 1/15/2022 12:18 AM, Sean Christopherson wrote:
> On Fri, Jan 14, 2022, Zeng Guang wrote:
>> On 1/14/2022 6:09 AM, Sean Christopherson wrote:
>>> On Fri, Dec 31, 2021, Zeng Guang wrote:
>>>> +static int vmx_expand_pid_table(struct kvm_vmx *kvm_vmx, int entry_idx)
>>>> +{
>>>> + u64 *last_pid_table;
>>>> + int last_table_size, new_order;
>>>> +
>>>> + if (entry_idx <= kvm_vmx->pid_last_index)
>>>> + return 0;
>>>> +
>>>> + last_pid_table = kvm_vmx->pid_table;
>>>> + last_table_size = table_index_to_size(kvm_vmx->pid_last_index + 1);
>>>> + new_order = get_order(table_index_to_size(entry_idx + 1));
>>>> +
>>>> + if (vmx_alloc_pid_table(kvm_vmx, new_order))
>>>> + return -ENOMEM;
>>>> +
>>>> + memcpy(kvm_vmx->pid_table, last_pid_table, last_table_size);
>>>> + kvm_make_all_cpus_request(&kvm_vmx->kvm, KVM_REQ_PID_TABLE_UPDATE);
>>>> +
>>>> + /* Now old PID table can be freed safely as no vCPU is using it. */
>>>> + free_pages((unsigned long)last_pid_table, get_order(last_table_size));
>>> This is terrifying. I think it's safe? But it's still terrifying.
>> Free old PID table here is safe as kvm making request KVM_REQ_PI_TABLE_UPDATE
>> with KVM_REQUEST_WAIT flag force all vcpus trigger vm-exit to update vmcs
>> field to new allocated PID table. At this time, it makes sure old PID table
>> not referenced by any vcpu.
>> Do you mean it still has potential problem?
> No, I do think it's safe, but it is still terrifying :-)
>
>>> Rather than dynamically react as vCPUs are created, what about we make max_vcpus
>>> common[*], extend KVM_CAP_MAX_VCPUS to allow userspace to override max_vcpus,
>>> and then have the IPIv support allocate the PID table on first vCPU creation
>>> instead of in vmx_vm_init()?
>>>
>>> That will give userspace an opportunity to lower max_vcpus to reduce memory
>>> consumption without needing to dynamically muck with the table in KVM. Then
>>> this entire patch goes away.
>> IIUC, it's risky if relying on userspace .
> That's why we have cgroups, rlimits, etc...
>
>> In this way userspace also have chance to assign large max_vcpus but not use
>> them at all. This cannot approach the goal to save memory as much as possible
>> just similar as using KVM_MAX_VCPU_IDS to allocate PID table.
> Userspace can simply do KVM_CREATE_VCPU until it hits KVM_MAX_VCPU_IDS...
IIUC, what you proposed is to use max_vcpus in kvm for x86 arch
(currently not present yet) and
provide new api for userspace to notify kvm how many vcpus in current vm
session prior to vCPU creation.
Thus IPIv can setup PID-table with this information in one shot.
I'm thinking this may have several things uncertain:
1. cannot identify the exact max APIC ID corresponding to max vcpus
APIC ID definition is platform dependent. A large APIC ID could be
assigned to one vCPU in theory even running with
small max_vcpus. We cannot figure out max APIC ID supported mapping to
max_vcpus.
2. cannot optimize the memory consumption on PID table to the least at
run-time
In case "-smp=small_n,maxcpus=large_N", kvm has to allocate memory to
accommodate large_N vcpus at the
beginning no matter whether all maxcpus will run.
3. Potential backward-compatible problem
If running with old QEMU version, kvm cannot get expected information
so as to make a fallback to use
KVM_MAX_VCPU_IDS by default. It's feasible but not benefit on memory
optimization for PID table.
What's your opinion ? Thanks.
On Sat, Jan 15, 2022 at 10:08:10AM +0800, Zeng Guang wrote:
> On 1/15/2022 1:34 AM, Sean Christopherson wrote:
> > On Fri, Jan 14, 2022, Zeng Guang wrote:
> > > kvm_lapic_reg_read() is limited to read up to 4 bytes. It needs extension to
> > > support 64bit read.
> > Ah, right.
> >
> > > And another concern is here getting reg value only specific from vICR(no
> > > other regs need take care), going through whole path on kvm_lapic_reg_read()
> > > could be time-consuming unnecessarily. Is it proper that calling
> > > kvm_lapic_get_reg64() to retrieve vICR value directly?
> > Hmm, no, I don't think that's proper. Retrieving a 64-bit value really is unique
> > to vICR. Yes, the code does WARN on that, but if future architectural extensions
> > even generate APIC-write exits on other registers, then using kvm_lapic_get_reg64()
> > would be wrong and this code would need to be updated again.
> Split on x2apic and WARN on (offset != APIC_ICR) already limit register read
> to vICR only. Actually
> we just need consider to deal with 64bit data specific to vICR in APIC-write
> exits. From this point of
> view, previous design can be compatible on handling other registers even if
> future architectural
> extensions changes. :)
> >
> > What about tweaking my prep patch from before to the below? That would yield:
> >
> > if (apic_x2apic_mode(apic)) {
> > if (WARN_ON_ONCE(offset != APIC_ICR))
> > return 1;
> >
> > kvm_lapic_msr_read(apic, offset, &val);
>
> I think it's problematic to use kvm_lapic_msr_read() in this case. It
> premises the high 32bit value
> already valid at APIC_ICR2, while in handling "nodecode" x2APIC writes we
> need get continuous 64bit
> data from offset 300H first and prepare emulation of APIC_ICR2 write. At
> this time, APIC_ICR2 is not
> ready yet.
How about combine them, then you can handle the ICR write vmexit for
IPI virtualization and Sean's patch can still work with code reusing,
like below:
if (apic_x2apic_mode(apic)) {
if (WARN_ON_ONCE(offset != APIC_ICR))
kvm_lapic_msr_read(apic, offset, &val);
else
kvm_lapic_get_reg64(apic, offset, &val);
kvm_lapic_msr_write(apic, offset, val);
} else {
kvm_lapic_reg_read(apic, offset, 4, &val);
kvm_lapic_reg_write(apic, offset, val);
}
>
> > kvm_lapic_msr_write(apic, offset, val);
> > } else {
> > kvm_lapic_reg_read(apic, offset, 4, &val);
> > kvm_lapic_reg_write(apic, offset, val);
> > }
> >
> > I like that the above has "msr" in the low level x2apic helpers, and it maximizes
> > code reuse. Compile tested only...
> >
> > From: Sean Christopherson <[email protected]>
> > Date: Fri, 14 Jan 2022 09:29:34 -0800
> > Subject: [PATCH] KVM: x86: Add helpers to handle 64-bit APIC MSR read/writes
> >
> > Add helpers to handle 64-bit APIC read/writes via MSRs to deduplicate the
> > x2APIC and Hyper-V code needed to service reads/writes to ICR. Future
> > support for IPI virtualization will add yet another path where KVM must
> > handle 64-bit APIC MSR reads/write (to ICR).
> >
> > Opportunistically fix the comment in the write path; ICR2 holds the
> > destination (if there's no shorthand), not the vector.
> >
> > No functional change intended.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/lapic.c | 59 ++++++++++++++++++++++----------------------
> > 1 file changed, 29 insertions(+), 30 deletions(-)
> >
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index f206fc35deff..cc4531eb448f 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -2787,6 +2787,30 @@ int kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr)
> > return 0;
> > }
> >
> > +static int kvm_lapic_msr_read(struct kvm_lapic *apic, u32 reg, u64 *data)
> > +{
> > + u32 low, high = 0;
> > +
> > + if (kvm_lapic_reg_read(apic, reg, 4, &low))
> > + return 1;
> > +
> > + if (reg == APIC_ICR &&
> > + WARN_ON_ONCE(kvm_lapic_reg_read(apic, APIC_ICR2, 4, &high)))
> > + return 1;
> > +
> > + *data = (((u64)high) << 32) | low;
> > +
> > + return 0;
> > +}
> > +
> > +static int kvm_lapic_msr_write(struct kvm_lapic *apic, u32 reg, u64 data)
> > +{
> > + /* For 64-bit ICR writes, set ICR2 (dest) before ICR (command). */
> > + if (reg == APIC_ICR)
> > + kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(data >> 32));
> > + return kvm_lapic_reg_write(apic, reg, (u32)data);
> > +}
> > +
> > int kvm_x2apic_msr_write(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> > {
> > struct kvm_lapic *apic = vcpu->arch.apic;
> > @@ -2798,16 +2822,13 @@ int kvm_x2apic_msr_write(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> > if (reg == APIC_ICR2)
> > return 1;
> >
> > - /* if this is ICR write vector before command */
> > - if (reg == APIC_ICR)
> > - kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(data >> 32));
> > - return kvm_lapic_reg_write(apic, reg, (u32)data);
> > + return kvm_lapic_msr_write(apic, reg, data);
> > }
> >
> > int kvm_x2apic_msr_read(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
> > {
> > struct kvm_lapic *apic = vcpu->arch.apic;
> > - u32 reg = (msr - APIC_BASE_MSR) << 4, low, high = 0;
> > + u32 reg = (msr - APIC_BASE_MSR) << 4;
> >
> > if (!lapic_in_kernel(vcpu) || !apic_x2apic_mode(apic))
> > return 1;
> > @@ -2815,45 +2836,23 @@ int kvm_x2apic_msr_read(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
> > if (reg == APIC_DFR || reg == APIC_ICR2)
> > return 1;
> >
> > - if (kvm_lapic_reg_read(apic, reg, 4, &low))
> > - return 1;
> > - if (reg == APIC_ICR)
> > - kvm_lapic_reg_read(apic, APIC_ICR2, 4, &high);
> > -
> > - *data = (((u64)high) << 32) | low;
> > -
> > - return 0;
> > + return kvm_lapic_msr_read(apic, reg, data);
> > }
> >
> > int kvm_hv_vapic_msr_write(struct kvm_vcpu *vcpu, u32 reg, u64 data)
> > {
> > - struct kvm_lapic *apic = vcpu->arch.apic;
> > -
> > if (!lapic_in_kernel(vcpu))
> > return 1;
> >
> > - /* if this is ICR write vector before command */
> > - if (reg == APIC_ICR)
> > - kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(data >> 32));
> > - return kvm_lapic_reg_write(apic, reg, (u32)data);
> > + return kvm_lapic_msr_write(vcpu->arch.apic, reg, data);
> > }
> >
> > int kvm_hv_vapic_msr_read(struct kvm_vcpu *vcpu, u32 reg, u64 *data)
> > {
> > - struct kvm_lapic *apic = vcpu->arch.apic;
> > - u32 low, high = 0;
> > -
> > if (!lapic_in_kernel(vcpu))
> > return 1;
> >
> > - if (kvm_lapic_reg_read(apic, reg, 4, &low))
> > - return 1;
> > - if (reg == APIC_ICR)
> > - kvm_lapic_reg_read(apic, APIC_ICR2, 4, &high);
> > -
> > - *data = (((u64)high) << 32) | low;
> > -
> > - return 0;
> > + return kvm_lapic_msr_read(vcpu->arch.apic, reg, data);
> > }
> >
> > int kvm_lapic_set_pv_eoi(struct kvm_vcpu *vcpu, u64 data, unsigned long len)
> > --
On 1/18/2022 8:44 AM, Yuan Yao wrote:
> On Sat, Jan 15, 2022 at 10:08:10AM +0800, Zeng Guang wrote:
>> On 1/15/2022 1:34 AM, Sean Christopherson wrote:
>>> On Fri, Jan 14, 2022, Zeng Guang wrote:
>>>> kvm_lapic_reg_read() is limited to read up to 4 bytes. It needs extension to
>>>> support 64bit read.
>>> Ah, right.
>>>
>>>> And another concern is here getting reg value only specific from vICR(no
>>>> other regs need take care), going through whole path on kvm_lapic_reg_read()
>>>> could be time-consuming unnecessarily. Is it proper that calling
>>>> kvm_lapic_get_reg64() to retrieve vICR value directly?
>>> Hmm, no, I don't think that's proper. Retrieving a 64-bit value really is unique
>>> to vICR. Yes, the code does WARN on that, but if future architectural extensions
>>> even generate APIC-write exits on other registers, then using kvm_lapic_get_reg64()
>>> would be wrong and this code would need to be updated again.
>> Split on x2apic and WARN on (offset != APIC_ICR) already limit register read
>> to vICR only. Actually
>> we just need consider to deal with 64bit data specific to vICR in APIC-write
>> exits. From this point of
>> view, previous design can be compatible on handling other registers even if
>> future architectural
>> extensions changes. :)
>>> What about tweaking my prep patch from before to the below? That would yield:
>>>
>>> if (apic_x2apic_mode(apic)) {
>>> if (WARN_ON_ONCE(offset != APIC_ICR))
>>> return 1;
>>>
>>> kvm_lapic_msr_read(apic, offset, &val);
>> I think it's problematic to use kvm_lapic_msr_read() in this case. It
>> premises the high 32bit value
>> already valid at APIC_ICR2, while in handling "nodecode" x2APIC writes we
>> need get continuous 64bit
>> data from offset 300H first and prepare emulation of APIC_ICR2 write. At
>> this time, APIC_ICR2 is not
>> ready yet.
> How about combine them, then you can handle the ICR write vmexit for
> IPI virtualization and Sean's patch can still work with code reusing,
> like below:
>
> if (apic_x2apic_mode(apic)) {
> if (WARN_ON_ONCE(offset != APIC_ICR))
> kvm_lapic_msr_read(apic, offset, &val);
> else
> kvm_lapic_get_reg64(apic, offset, &val);
>
> kvm_lapic_msr_write(apic, offset, val);
> } else {
> kvm_lapic_reg_read(apic, offset, 4, &val);
> kvm_lapic_reg_write(apic, offset, val);
> }
Alternatively we can merge as this if Sean think it ok to call
kvm_lapic_get_reg64() retrieving 64bit data from vICR directly.
>>> kvm_lapic_msr_write(apic, offset, val);
>>> } else {
>>> kvm_lapic_reg_read(apic, offset, 4, &val);
>>> kvm_lapic_reg_write(apic, offset, val);
>>> }
>>>
>>> I like that the above has "msr" in the low level x2apic helpers, and it maximizes
>>> code reuse. Compile tested only...
>>>
>>> From: Sean Christopherson <[email protected]>
>>> Date: Fri, 14 Jan 2022 09:29:34 -0800
>>> Subject: [PATCH] KVM: x86: Add helpers to handle 64-bit APIC MSR read/writes
>>>
>>> Add helpers to handle 64-bit APIC read/writes via MSRs to deduplicate the
>>> x2APIC and Hyper-V code needed to service reads/writes to ICR. Future
>>> support for IPI virtualization will add yet another path where KVM must
>>> handle 64-bit APIC MSR reads/write (to ICR).
>>>
>>> Opportunistically fix the comment in the write path; ICR2 holds the
>>> destination (if there's no shorthand), not the vector.
>>>
>>> No functional change intended.
>>>
>>> Signed-off-by: Sean Christopherson <[email protected]>
>>> ---
>>> arch/x86/kvm/lapic.c | 59 ++++++++++++++++++++++----------------------
>>> 1 file changed, 29 insertions(+), 30 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>>> index f206fc35deff..cc4531eb448f 100644
>>> --- a/arch/x86/kvm/lapic.c
>>> +++ b/arch/x86/kvm/lapic.c
>>> @@ -2787,6 +2787,30 @@ int kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr)
>>> return 0;
>>> }
>>>
>>> +static int kvm_lapic_msr_read(struct kvm_lapic *apic, u32 reg, u64 *data)
>>> +{
>>> + u32 low, high = 0;
>>> +
>>> + if (kvm_lapic_reg_read(apic, reg, 4, &low))
>>> + return 1;
>>> +
>>> + if (reg == APIC_ICR &&
>>> + WARN_ON_ONCE(kvm_lapic_reg_read(apic, APIC_ICR2, 4, &high)))
>>> + return 1;
>>> +
>>> + *data = (((u64)high) << 32) | low;
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static int kvm_lapic_msr_write(struct kvm_lapic *apic, u32 reg, u64 data)
>>> +{
>>> + /* For 64-bit ICR writes, set ICR2 (dest) before ICR (command). */
>>> + if (reg == APIC_ICR)
>>> + kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(data >> 32));
>>> + return kvm_lapic_reg_write(apic, reg, (u32)data);
>>> +}
>>> +
>>> int kvm_x2apic_msr_write(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>>> {
>>> struct kvm_lapic *apic = vcpu->arch.apic;
>>> @@ -2798,16 +2822,13 @@ int kvm_x2apic_msr_write(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>>> if (reg == APIC_ICR2)
>>> return 1;
>>>
>>> - /* if this is ICR write vector before command */
>>> - if (reg == APIC_ICR)
>>> - kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(data >> 32));
>>> - return kvm_lapic_reg_write(apic, reg, (u32)data);
>>> + return kvm_lapic_msr_write(apic, reg, data);
>>> }
>>>
>>> int kvm_x2apic_msr_read(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
>>> {
>>> struct kvm_lapic *apic = vcpu->arch.apic;
>>> - u32 reg = (msr - APIC_BASE_MSR) << 4, low, high = 0;
>>> + u32 reg = (msr - APIC_BASE_MSR) << 4;
>>>
>>> if (!lapic_in_kernel(vcpu) || !apic_x2apic_mode(apic))
>>> return 1;
>>> @@ -2815,45 +2836,23 @@ int kvm_x2apic_msr_read(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
>>> if (reg == APIC_DFR || reg == APIC_ICR2)
>>> return 1;
>>>
>>> - if (kvm_lapic_reg_read(apic, reg, 4, &low))
>>> - return 1;
>>> - if (reg == APIC_ICR)
>>> - kvm_lapic_reg_read(apic, APIC_ICR2, 4, &high);
>>> -
>>> - *data = (((u64)high) << 32) | low;
>>> -
>>> - return 0;
>>> + return kvm_lapic_msr_read(apic, reg, data);
>>> }
>>>
>>> int kvm_hv_vapic_msr_write(struct kvm_vcpu *vcpu, u32 reg, u64 data)
>>> {
>>> - struct kvm_lapic *apic = vcpu->arch.apic;
>>> -
>>> if (!lapic_in_kernel(vcpu))
>>> return 1;
>>>
>>> - /* if this is ICR write vector before command */
>>> - if (reg == APIC_ICR)
>>> - kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(data >> 32));
>>> - return kvm_lapic_reg_write(apic, reg, (u32)data);
>>> + return kvm_lapic_msr_write(vcpu->arch.apic, reg, data);
>>> }
>>>
>>> int kvm_hv_vapic_msr_read(struct kvm_vcpu *vcpu, u32 reg, u64 *data)
>>> {
>>> - struct kvm_lapic *apic = vcpu->arch.apic;
>>> - u32 low, high = 0;
>>> -
>>> if (!lapic_in_kernel(vcpu))
>>> return 1;
>>>
>>> - if (kvm_lapic_reg_read(apic, reg, 4, &low))
>>> - return 1;
>>> - if (reg == APIC_ICR)
>>> - kvm_lapic_reg_read(apic, APIC_ICR2, 4, &high);
>>> -
>>> - *data = (((u64)high) << 32) | low;
>>> -
>>> - return 0;
>>> + return kvm_lapic_msr_read(vcpu->arch.apic, reg, data);
>>> }
>>>
>>> int kvm_lapic_set_pv_eoi(struct kvm_vcpu *vcpu, u64 data, unsigned long len)
>>> --
On Mon, Jan 17, 2022, Zeng Guang wrote:
> On 1/15/2022 12:18 AM, Sean Christopherson wrote:
> > Userspace can simply do KVM_CREATE_VCPU until it hits KVM_MAX_VCPU_IDS...
> IIUC, what you proposed is to use max_vcpus in kvm for x86 arch (currently
> not present yet) and
> provide new api for userspace to notify kvm how many vcpus in current vm
> session prior to vCPU creation.
> Thus IPIv can setup PID-table with this information in one shot.
> I'm thinking this may have several things uncertain:
> 1. cannot identify the exact max APIC ID corresponding to max vcpus
> APIC ID definition is platform dependent. A large APIC ID could be assigned
> to one vCPU in theory even running with
> small max_vcpus. We cannot figure out max APIC ID supported mapping to
> max_vcpus.
Gah, I conflated KVM_CAP_MAX_VCPUS and KVM_MAX_VCPU_IDS. But the underlying idea
still works: extend KVM_MAX_VCPU_IDS to allow userspace to lower the max allowed
vCPU ID to reduce the memory footprint of densely "packed" and/or small VMs.
> 2. cannot optimize the memory consumption on PID table to the least at
> run-time
> ?In case "-smp=small_n,maxcpus=large_N", kvm has to allocate memory to
> accommodate large_N vcpus at the
> beginning no matter whether all maxcpus will run.
That's a feature. E.g. if userspace defines a max vCPU ID that is larger than
what is required at boot, e.g. to hotplug vCPUs, then consuming a few extra pages
of memory to ensure that IPIv will be supported for hotplugged vCPUs is very
desirable behavior. Observing poor performance on hotplugged vCPUs because the
host was under memory pressure is far worse.
And the goal isn't to achieve the smallest memory footprint possible, it's to
avoid allocating 32kb of memory when userspace wants to run a VM with only a
handful of vCPUs, i.e. when 4kb will suffice. Consuming 32kb of memory for a VM
with hundreds of vCPUs is a non-issue, e.g. it's highly unlikely to be running
multiple such VMs on a single host, and such hosts will likely have hundreds of
gb of RAM. Conversely, hosts running run small VMs will likely run tens or hundreds
of small VMs, e.g. for container scenarios, in which case reducing the per-VM memory
footprint is much more valuable and also easier to achieve.
> 3. Potential backward-compatible problem
> If running with old QEMU version,? kvm cannot get expected information so as
> to make a fallback to use
> KVM_MAX_VCPU_IDS by default. It's feasible but not benefit on memory
> optimization for PID table.
That's totally fine. This is purely a memory optimization, IPIv will still work
as intended if usersepace doesn't lower the max vCPU ID, it'll just consume a bit
more memory.
On Sat, Jan 15, 2022, Zeng Guang wrote:
> > What about tweaking my prep patch from before to the below? That would yield:
> >
> > if (apic_x2apic_mode(apic)) {
> > if (WARN_ON_ONCE(offset != APIC_ICR))
> > return 1;
> >
> > kvm_lapic_msr_read(apic, offset, &val);
>
> I think it's problematic to use kvm_lapic_msr_read() in this case. It
> premises the high 32bit value already valid at APIC_ICR2, while in handling
> "nodecode" x2APIC writes we need get continuous 64bit data from offset 300H
> first and prepare emulation of APIC_ICR2 write.
Ah, I read this part of the spec:
All 64 bits of the ICR are written by using WRMSR to access the MSR with index 830H.
If ECX = 830H, WRMSR writes the 64-bit value in EDX:EAX to the ICR, causing the APIC
to send an IPI. If any of bits 13, 17:16, or 31:20 are set in EAX, WRMSR detects a
reserved-bit violation and causes a general-protection exception (#GP).
but not the part down below that explicit says
VICR refers the 64-bit field at offset 300H on the virtual-APIC page. When the
“virtualize x2APIC mode” VM-execution control is 1 (indicating virtualization of
x2APIC mode), this field is used to virtualize the entire ICR.
But that's indicative of an existing KVM problem. KVM's emulation of x2APIC is
broken. The SDM, in section 10.12.9 ICR Operation in x2APIC Mode, clearly states
that the ICR is extended to 64-bits. ICR2 does not exist in x2APIC mode, full stop.
KVM botched things by effectively aliasing ICR[63:32] to ICR2.
We can and should fix that issue before merging IPIv suport, that way we don't
further propagate KVM's incorrect behavior. KVM will need to manipulate the APIC
state in KVM_{G,S}ET_LAPIC so as not to "break" migration, "break" in quotes because
I highly doubt any kernel reads ICR[63:32] for anything but debug purposes. But
we'd need to do that anyways for IPIv, otherwise migration from an IPIv host to
a non-IPIv host would suffer the same migration bug.
I'll post a series this week, in theory we should be able to reduce the patch for
IPIv support to just having to only touching kvm_apic_write_nodecode().
On 1/19/2022 2:17 AM, Sean Christopherson wrote:
> On Sat, Jan 15, 2022, Zeng Guang wrote:
>>> What about tweaking my prep patch from before to the below? That would yield:
>>>
>>> if (apic_x2apic_mode(apic)) {
>>> if (WARN_ON_ONCE(offset != APIC_ICR))
>>> return 1;
>>>
>>> kvm_lapic_msr_read(apic, offset, &val);
>> I think it's problematic to use kvm_lapic_msr_read() in this case. It
>> premises the high 32bit value already valid at APIC_ICR2, while in handling
>> "nodecode" x2APIC writes we need get continuous 64bit data from offset 300H
>> first and prepare emulation of APIC_ICR2 write.
> Ah, I read this part of the spec:
>
> All 64 bits of the ICR are written by using WRMSR to access the MSR with index 830H.
> If ECX = 830H, WRMSR writes the 64-bit value in EDX:EAX to the ICR, causing the APIC
> to send an IPI. If any of bits 13, 17:16, or 31:20 are set in EAX, WRMSR detects a
> reserved-bit violation and causes a general-protection exception (#GP).
>
> but not the part down below that explicit says
>
> VICR refers the 64-bit field at offset 300H on the virtual-APIC page. When the
> “virtualize x2APIC mode” VM-execution control is 1 (indicating virtualization of
> x2APIC mode), this field is used to virtualize the entire ICR.
>
> But that's indicative of an existing KVM problem. KVM's emulation of x2APIC is
> broken. The SDM, in section 10.12.9 ICR Operation in x2APIC Mode, clearly states
> that the ICR is extended to 64-bits. ICR2 does not exist in x2APIC mode, full stop.
> KVM botched things by effectively aliasing ICR[63:32] to ICR2.
>
> We can and should fix that issue before merging IPIv suport, that way we don't
> further propagate KVM's incorrect behavior. KVM will need to manipulate the APIC
> state in KVM_{G,S}ET_LAPIC so as not to "break" migration, "break" in quotes because
> I highly doubt any kernel reads ICR[63:32] for anything but debug purposes. But
> we'd need to do that anyways for IPIv, otherwise migration from an IPIv host to
> a non-IPIv host would suffer the same migration bug.
>
> I'll post a series this week, in theory we should be able to reduce the patch for
> IPIv support to just having to only touching kvm_apic_write_nodecode().
OK, I'll adapt patch after your fix is ready. Suppose
kvm_lapic_msr_{write,read} needn't emulate ICR2 write/read anymore.
On 1/19/2022 1:15 AM, Sean Christopherson wrote:
> On Mon, Jan 17, 2022, Zeng Guang wrote:
>> On 1/15/2022 12:18 AM, Sean Christopherson wrote:
>>> Userspace can simply do KVM_CREATE_VCPU until it hits KVM_MAX_VCPU_IDS...
>> IIUC, what you proposed is to use max_vcpus in kvm for x86 arch (currently
>> not present yet) and
>> provide new api for userspace to notify kvm how many vcpus in current vm
>> session prior to vCPU creation.
>> Thus IPIv can setup PID-table with this information in one shot.
>> I'm thinking this may have several things uncertain:
>> 1. cannot identify the exact max APIC ID corresponding to max vcpus
>> APIC ID definition is platform dependent. A large APIC ID could be assigned
>> to one vCPU in theory even running with
>> small max_vcpus. We cannot figure out max APIC ID supported mapping to
>> max_vcpus.
> Gah, I conflated KVM_CAP_MAX_VCPUS and KVM_MAX_VCPU_IDS. But the underlying idea
> still works: extend KVM_MAX_VCPU_IDS to allow userspace to lower the max allowed
> vCPU ID to reduce the memory footprint of densely "packed" and/or small VMs.
Possibly it may not work well as expected. From user's perspective,
assigning
max apic id requires knowledge of apic id implementation on various
platform.
It's hard to let user to determine an appropriate value for every vm
session.
User may know his exact demand on vcpu resource like cpu number of smp ,
max cpus for cpu hotplug etc, but highly possibly not know or care about
what
the apic id should be. If an improper value is provided, we cannot
achieve the
goal to reduce the memory footprint, but also may lead to unexpected
failure on
vcpu creation, e.g. actual vcpu id(=apic id) is larger than max apic id
assigned. So
this solution seems still have potential problem existing.
Besides, it also need change user hypervisor(QEMU etc.) and kvm (kvm arch,
vcpu creation policy etc.) which unnecessarily interrelate such modules
together.
From these point of view, it's given not much advantage other than
simplifying IPIv
memory management on PID table.
>> 2. cannot optimize the memory consumption on PID table to the least at
>> run-time
>> In case "-smp=small_n,maxcpus=large_N", kvm has to allocate memory to
>> accommodate large_N vcpus at the
>> beginning no matter whether all maxcpus will run.
> That's a feature. E.g. if userspace defines a max vCPU ID that is larger than
> what is required at boot, e.g. to hotplug vCPUs, then consuming a few extra pages
> of memory to ensure that IPIv will be supported for hotplugged vCPUs is very
> desirable behavior. Observing poor performance on hotplugged vCPUs because the
> host was under memory pressure is far worse.
>
> And the goal isn't to achieve the smallest memory footprint possible, it's to
> avoid allocating 32kb of memory when userspace wants to run a VM with only a
> handful of vCPUs, i.e. when 4kb will suffice. Consuming 32kb of memory for a VM
> with hundreds of vCPUs is a non-issue, e.g. it's highly unlikely to be running
> multiple such VMs on a single host, and such hosts will likely have hundreds of
> gb of RAM. Conversely, hosts running run small VMs will likely run tens or hudreds
> of small VMs, e.g. for container scenarios, in which case reducing the per-VM memory
> footprint is much more valuable and also easier to achieve.
Agree. This is the purpose to implement this patch. With current
solution we proposed, IPIv just
use memory as less as possible in all kinds of scenarios, and keep 4Kb
in most cases instead of 32Kb.
It's self-adaptive , standalone function module in kvm, no any extra
limitation introduced and scalable
even future extension on KVM_MAX_VCPU_IDS or new apic id implementation
released.
How do you think ? :)
>> 3. Potential backward-compatible problem
>> If running with old QEMU version, kvm cannot get expected information so as
>> to make a fallback to use
>> KVM_MAX_VCPU_IDS by default. It's feasible but not benefit on memory
>> optimization for PID table.
> That's totally fine. This is purely a memory optimization, IPIv will still work
> as intended if usersepace doesn't lower the max vCPU ID, it'll just consume a bit
> more memory.
On Wed, Jan 19, 2022, Zeng Guang wrote:
> It's self-adaptive , standalone function module in kvm, no any extra
> limitation introduced
I disagree. Its failure mode on OOM is to degrade guest performance, _that_ is
a limitation. OOM is absolutely something that should be immediately communicated
to userspace in a way that userspace can take action.
> and scalable even future extension on KVM_MAX_VCPU_IDS or new apic id
> implementation released.
>
> How do you think ? :)
Heh, I think I've made it quite clear that I think it's unnecesary complexity in
KVM. It's not a hill I'll die on, e.g. if Paolo and others feel it's the right
approach then so be it, but I really, really dislike the idea of dynamically
changing the table, KVM has a long and sordid history of botching those types
of flows/features.
On Fri, Jan 14, 2022, Zeng Guang wrote:
> On 1/14/2022 5:03 AM, Sean Christopherson wrote:
> > Can you provide a sample dump? It's hard to visualize the output, e.g. I'm worried
> > this will be overly log and harder to read than putting tertiary controls on their
> > own line.
>
> Sample dump here.
> *** Control State ***
>
> ?PinBased=0x000000ff CPUBased=0xb5a26dfa SecondaryExec=0x061037eb
> TertiaryExec=0x0000000000000010
That's quite the line. What if we reorganize the code to generate output like:
CPUBased=0xb5a26dfa SecondaryExec=0x061037eb TertiaryExec=0x0000000000000010
PinBased=0x000000ff EntryControls=0000d1ff ExitControls=002befff
That keeps the lines reasonable and IMO is better organization too, e.g. it captures
the relationship between primary, secondary, and tertiary controls.
> ?EntryControls=0000d1ff ExitControls=002befff
> ?ExceptionBitmap=00060042 PFECmask=00000000 PFECmatch=00000000
> ?VMEntry: intr_info=00000000 errcode=00000000 ilen=00000000
> ?VMExit: intr_info=00000000 errcode=00000000 ilen=00000003
> ???????? reason=00000030 qualification=0000000000000784
> > > pr_err("EntryControls=%08x ExitControls=%08x\n", vmentry_ctl, vmexit_ctl);
> > > pr_err("ExceptionBitmap=%08x PFECmask=%08x PFECmatch=%08x\n",
> > > vmcs_read32(EXCEPTION_BITMAP),
> > > --
> > > 2.27.0
> > >
On 1/20/2022 9:06 AM, Sean Christopherson wrote:
> On Fri, Jan 14, 2022, Zeng Guang wrote:
>> On 1/14/2022 5:03 AM, Sean Christopherson wrote:
>>> Can you provide a sample dump? It's hard to visualize the output, e.g. I'm worried
>>> this will be overly log and harder to read than putting tertiary controls on their
>>> own line.
>> Sample dump here.
>> *** Control State ***
>>
>> PinBased=0x000000ff CPUBased=0xb5a26dfa SecondaryExec=0x061037eb
>> TertiaryExec=0x0000000000000010
> That's quite the line. What if we reorganize the code to generate output like:
>
> CPUBased=0xb5a26dfa SecondaryExec=0x061037eb TertiaryExec=0x0000000000000010
> PinBased=0x000000ff EntryControls=0000d1ff ExitControls=002befff
>
> That keeps the lines reasonable and IMO is better organization too, e.g. it captures
> the relationship between primary, secondary, and tertiary controls.
Make sense. I'll change it as your suggestion. Thanks.
>> EntryControls=0000d1ff ExitControls=002befff
>> ExceptionBitmap=00060042 PFECmask=00000000 PFECmatch=00000000
>> VMEntry: intr_info=00000000 errcode=00000000 ilen=00000000
>> VMExit: intr_info=00000000 errcode=00000000 ilen=00000003
>> reason=00000030 qualification=0000000000000784
>>>> pr_err("EntryControls=%08x ExitControls=%08x\n", vmentry_ctl, vmexit_ctl);
>>>> pr_err("ExceptionBitmap=%08x PFECmask=%08x PFECmatch=%08x\n",
>>>> vmcs_read32(EXCEPTION_BITMAP),
>>>> --
>>>> 2.27.0
>>>>
On 1/20/2022 9:01 AM, Sean Christopherson wrote:
> On Wed, Jan 19, 2022, Zeng Guang wrote:
>> It's self-adaptive , standalone function module in kvm, no any extra
>> limitation introduced
> I disagree. Its failure mode on OOM is to degrade guest performance, _that_ is
> a limitation. OOM is absolutely something that should be immediately communicated
> to userspace in a way that userspace can take action.
If memory allocation fails, PID-pointer table stop updating and keep using
the old one. All IPIs from other vcpus will go through APIC-Write VM-exits
and won't get performance improvement from IPI virtualization to this new
created vcpu. Right, it's a limitation though it doesn't impact the
effectiveness
of IPI virtualization among existing vcpus.
>> and scalable even future extension on KVM_MAX_VCPU_IDS or new apic id
>> implementation released.
>>
>> How do you think ? :)
> Heh, I think I've made it quite clear that I think it's unnecesary complexity in
> KVM. It's not a hill I'll die on, e.g. if Paolo and others feel it's the right
> approach then so be it, but I really, really dislike the idea of dynamically
> changing the table, KVM has a long and sordid history of botching those types
> of flows/features.
To follow your proposal, we think about the feasible implementation as
follows:
1. Define new parameter apic_id_limit in struct kvm_arch and initialized
as KVM_MAX_VCPU_IDS by default.
2. New vm ioclt KVM_SET_APICID_LIMIT to allow user space set the possible
max apic id required in the vm session before vcpu creation. Currently
QEMU calculates the limit to CPU APIC ID up to max cpus assigned for
hotpluggable cpu. It simply uses package/die/core/smt model to get bit
width of id field on each level (not totally comply with CPUID 1f/0b) and
make apic id for specific vcpu index. We can notify kvm this apic id limit
to ensure memory enough for PID-table.
3. Need check whether id is less than min(apic_id_limit, KVM_MAX_VCPU_IDS)
in vcpu creation. Otherwise return error.
4. Allocate memory covering vcpus with the id up to apic_id_limit for PID
table during the first vcpu creation. Proper lock still needed to
protect PID
table setup from race condition. If OOM happens, current vcpu creation
fails either and return error back to user space.
Plz let us know whether we can go for this solution further. Thanks.
On Thu, Jan 13, 2022, Sean Christopherson wrote:
> On Tue, Jan 11, 2022, Maxim Levitsky wrote:
> > Both Intel and AMD's PRM also state that changing APIC ID is implementation
> > dependent.
> >
> > I vote to forbid changing apic id, at least in the case any APIC acceleration
> > is used, be that APICv or AVIC.
>
> That has my vote as well. For IPIv in particular there's not much concern with
> backwards compability, i.e. we can tie the behavior to enable_ipiv.
Hrm, it may not be that simple. There's some crusty (really, really crusty) code
in Linux's boot code that writes APIC_ID. IIUC, the intent is to play nice with
running a UP crash dump kernel on "BSP" that isn't "the BSP", e.g. has a non-zero
APIC ID.
static void __init apic_bsp_up_setup(void)
{
#ifdef CONFIG_X86_64
apic_write(APIC_ID, apic->set_apic_id(boot_cpu_physical_apicid));
#else
/*
* Hack: In case of kdump, after a crash, kernel might be booting
* on a cpu with non-zero lapic id. But boot_cpu_physical_apicid
* might be zero if read from MP tables. Get it from LAPIC.
*/
# ifdef CONFIG_CRASH_DUMP
boot_cpu_physical_apicid = read_apic_id();
# endif
#endif
}
The most helpful comment is in generic_processor_info():
/*
* boot_cpu_physical_apicid is designed to have the apicid
* returned by read_apic_id(), i.e, the apicid of the
* currently booting-up processor. However, on some platforms,
* it is temporarily modified by the apicid reported as BSP
* through MP table. Concretely:
*
* - arch/x86/kernel/mpparse.c: MP_processor_info()
* - arch/x86/mm/amdtopology.c: amd_numa_init()
*
* This function is executed with the modified
* boot_cpu_physical_apicid. So, disabled_cpu_apicid kernel
* parameter doesn't work to disable APs on kdump 2nd kernel.
*
* Since fixing handling of boot_cpu_physical_apicid requires
* another discussion and tests on each platform, we leave it
* for now and here we use read_apic_id() directly in this
* function, generic_processor_info().
*/
It's entirely possible that this path is unused in a KVM guest, but I don't think
we can know that with 100% certainty.
But I also completely agree that attempting to keep the tables up-to-date is ugly
and a waste of time and effort, e.g. as Maxim pointed out, the current AVIC code
is comically broken.
Rather than disallowing the write, what if we add yet another inhibit that disables
APICv if IPI virtualization is enabled and a vCPU has an APIC ID != vcpu_id? KVM
is equipped to handle the emulation, so it just means that a guest that's doing
weird things loses a big of performance.
On Wed, Feb 02, 2022, Sean Christopherson wrote:
> On Thu, Jan 13, 2022, Sean Christopherson wrote:
> > On Tue, Jan 11, 2022, Maxim Levitsky wrote:
> > > Both Intel and AMD's PRM also state that changing APIC ID is implementation
> > > dependent.
> > >
> > > I vote to forbid changing apic id, at least in the case any APIC acceleration
> > > is used, be that APICv or AVIC.
> >
> > That has my vote as well. For IPIv in particular there's not much concern with
> > backwards compability, i.e. we can tie the behavior to enable_ipiv.
>
> Hrm, it may not be that simple. There's some crusty (really, really crusty) code
> in Linux's boot code that writes APIC_ID. IIUC, the intent is to play nice with
> running a UP crash dump kernel on "BSP" that isn't "the BSP", e.g. has a non-zero
> APIC ID.
...
> It's entirely possible that this path is unused in a KVM guest, but I don't think
> we can know that with 100% certainty.
>
> But I also completely agree that attempting to keep the tables up-to-date is ugly
> and a waste of time and effort, e.g. as Maxim pointed out, the current AVIC code
> is comically broken.
>
> Rather than disallowing the write, what if we add yet another inhibit that disables
> APICv if IPI virtualization is enabled and a vCPU has an APIC ID != vcpu_id? KVM
> is equipped to handle the emulation, so it just means that a guest that's doing
> weird things loses a big of performance.
LOL, this is all such a mess. The x2apic ID is actually indirectly writable on
AMD CPUs. Per the APM:
A value previously written by software to the 8-bit APIC_ID register (MMIO offset 30h) is
converted by hardware into the appropriate format and reflected into the 32-bit x2APIC_ID
register (MSR 802h).
I confirmed this on hardware (Milan). That means KVM's handling of the x2APIC ID
in kvm_lapic_set_base() is wrong, at least with respect to AMD.
Intel's SDM is a bit vague. I _think_ it means Intel CPUs treat the the x2APIC ID
and xAPIC ID as two completely independent assets. I haven't been able to glean any
info from hardware because writes to the legacy xAPIC ID are ignored on all CPUs
I've tested (Haswell and Cascade lake).
The x2APIC ID (32 bits) and the legacy local xAPIC ID (8 bits) are preserved
across this transition.
Given that the xAPIC ID appears to no longer be writable on Intel CPUs, it's
impossible that _generic_ kernel code can rely on xAPIC ID being writable. That
just leaves the aforementioned amd_numa_init() crud.
Linux's handling of that is:
void __init x86_numa_init(void)
{
if (!numa_off) {
#ifdef CONFIG_ACPI_NUMA
if (!numa_init(x86_acpi_numa_init))
return;
#endif
#ifdef CONFIG_AMD_NUMA
if (!numa_init(amd_numa_init))
return;
#endif
}
numa_init(dummy_numa_init);
}
i.e. ACPI_NUMA gets priority and thus amd_numa_init() will never be reached if
the NUMA topology is enumerated in the ACPI tables. Furthermore, the VMM would
have to actually emulate an old AMD northbridge, which is also extremely unlikely.
The odds of breaking a guest are further diminised given that KVM doesn't emulate
the xAPIC ID => x2APIC ID hilarity on AMD CPUs and no one has complained.
So, rather than tie this to IPI virtualization, I think we should either make
the xAPIC ID read-only across the board, or if we want to hedge in case someone
has a crazy use case, make the xAPIC ID read-only by default, add a module param
to let userspace opt-in to a writable xAPIC ID, and report x2APIC and APICv as
unsupported if the xAPIC ID is writable. E.g. rougly this, plus your AVIC patches
if we want to hedge.
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 28be02adc669..32854ac403a8 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -539,8 +539,15 @@ void kvm_set_cpu_caps(void)
0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) |
F(F16C) | F(RDRAND)
);
- /* KVM emulates x2apic in software irrespective of host support. */
- kvm_cpu_cap_set(X86_FEATURE_X2APIC);
+ /*
+ * KVM emulates x2apic in software irrespective of host support. Due
+ * to architecturally difference between Intel and AMD, x2APIC is not
+ * supported if the xAPIC ID is writable.
+ */
+ if (!xapic_id_writable)
+ kvm_cpu_cap_set(X86_FEATURE_X2APIC);
+ else
+ kvm_cpu_cap_clear(X86_FEATURE_X2APIC);
kvm_cpu_cap_mask(CPUID_1_EDX,
F(FPU) | F(VME) | F(DE) | F(PSE) |
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 670361bf1d81..6b42b65e7a42 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2047,10 +2047,10 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
switch (reg) {
case APIC_ID: /* Local APIC ID */
- if (!apic_x2apic_mode(apic))
+ if (apic_x2apic_mode(apic))
+ ret = 1;
+ else if (xapic_id_writable)
kvm_apic_set_xapic_id(apic, val >> 24);
- else
- ret = 1;
break;
case APIC_TASKPRI:
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 9cef8e4598df..71a3bcdb3317 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4743,7 +4743,8 @@ static __init int svm_hardware_setup(void)
nrips = false;
}
- enable_apicv = avic = avic && npt_enabled && boot_cpu_has(X86_FEATURE_AVIC);
+ enable_apicv = avic = avic && !xapic_id_writable && npt_enabled &&
+ boot_cpu_has(X86_FEATURE_AVIC);
if (enable_apicv) {
pr_info("AVIC enabled\n");
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 1b135473677b..fad7b36fbb1d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7939,7 +7939,7 @@ static __init int hardware_setup(void)
ple_window_shrink = 0;
}
- if (!cpu_has_vmx_apicv())
+ if (!cpu_has_vmx_apicv() || xapic_id_writable)
enable_apicv = 0;
if (!enable_apicv)
vmx_x86_ops.sync_pir_to_irr = NULL;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eaad2f485b64..ef6eba8c832a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -174,6 +174,10 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
static int __read_mostly lapic_timer_advance_ns = -1;
module_param(lapic_timer_advance_ns, int, S_IRUGO | S_IWUSR);
+bool __read_mostly xapic_id_writable;
+module_param(xapic_id_writable, bool, 0444);
+EXPORT_SYMBOL_GPL(xapic_id_writable);
+
static bool __read_mostly vector_hashing = true;
module_param(vector_hashing, bool, S_IRUGO);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 1ebd5a7594da..142663ff9cba 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -346,6 +346,7 @@ static inline bool kvm_mpx_supported(void)
extern unsigned int min_timer_period_us;
+extern bool xapic_id_writable;
extern bool enable_vmware_backdoor;
extern int pi_inject_timer;
On Wed, 2022-02-23 at 14:10 +0800, Chao Gao wrote:
> On Thu, Feb 03, 2022 at 08:22:13PM +0000, Sean Christopherson wrote:
> > i.e. ACPI_NUMA gets priority and thus amd_numa_init() will never be reached if
> > the NUMA topology is enumerated in the ACPI tables. Furthermore, the VMM would
> > have to actually emulate an old AMD northbridge, which is also extremely unlikely.
> >
> > The odds of breaking a guest are further diminised given that KVM doesn't emulate
> > the xAPIC ID => x2APIC ID hilarity on AMD CPUs and no one has complained.
> >
> > So, rather than tie this to IPI virtualization, I think we should either make
> > the xAPIC ID read-only across the board,
>
> We will go this way and defer the introduction of "xapic_id_writable" to the
> emergence of the "crazy" use case.
>
> Levitsky, we plan to revise your patch 13 "[PATCH RESEND 13/30] KVM: x86: lapic:
> don't allow to change APIC ID when apic acceleration is enabled" to make xAPIC
> ID read-only regardless of APICv/AVIC and include it into IPI virtualization
> series (to eliminate the dependency on your AVIC series). Is it fine with you?
Absolutely!
> And does this patch 13 depend on other patches in your fixes?
This patch doesn't depend on anything.
There is also patch 14 in this series which closes a case where malicious userspace
could upload non default _x2apic id_. I haven't yet written a unit test
to demonstrate this, but I will soon.
You don't need that patch for now IMHO.
>
> > or if we want to hedge in case someone
> > has a crazy use case, make the xAPIC ID read-only by default, add a module param
> > to let userspace opt-in to a writable xAPIC ID, and report x2APIC and APICv as
> > unsupported if the xAPIC ID is writable. E.g. rougly this, plus your AVIC patches
> > if we want to hedge.
Best regards,
Maxim Levitsky
On Thu, Feb 03, 2022 at 08:22:13PM +0000, Sean Christopherson wrote:
>i.e. ACPI_NUMA gets priority and thus amd_numa_init() will never be reached if
>the NUMA topology is enumerated in the ACPI tables. Furthermore, the VMM would
>have to actually emulate an old AMD northbridge, which is also extremely unlikely.
>
>The odds of breaking a guest are further diminised given that KVM doesn't emulate
>the xAPIC ID => x2APIC ID hilarity on AMD CPUs and no one has complained.
>
>So, rather than tie this to IPI virtualization, I think we should either make
>the xAPIC ID read-only across the board,
We will go this way and defer the introduction of "xapic_id_writable" to the
emergence of the "crazy" use case.
Levitsky, we plan to revise your patch 13 "[PATCH RESEND 13/30] KVM: x86: lapic:
don't allow to change APIC ID when apic acceleration is enabled" to make xAPIC
ID read-only regardless of APICv/AVIC and include it into IPI virtualization
series (to eliminate the dependency on your AVIC series). Is it fine with you?
And does this patch 13 depend on other patches in your fixes?
>or if we want to hedge in case someone
>has a crazy use case, make the xAPIC ID read-only by default, add a module param
>to let userspace opt-in to a writable xAPIC ID, and report x2APIC and APICv as
>unsupported if the xAPIC ID is writable. E.g. rougly this, plus your AVIC patches
>if we want to hedge.