For the record, I went into this thinking it was going to be a simple code
shuffle between {svm,vmx}_vcpu_reset() and kvm_vcpu_reset(). The actual
goal is to consolidate the RESET/INIT code, both to deduplicate code and
to try to avoid divergent behavior/bugs, e.g. SVM only recently started
updating vcpu->arch.cr4 on INIT.
The TL;DR of why it takes 40+ patches to get there is that the RESET/INIT
flows have multiple latent bugs and hidden dependencies, but "work"
because they're rarely touched, are mostly fixed flows in both KVM and the
guest, and because guests don't sanity check state after INIT.
While several of the patches have Fixes tags, I am absolutely terrified of
backporting most of them due to the likelihood of breaking a different
version of KVM. And, for the most part the bugs are benign in the sense
no guest has actually encountered any of these bugs. For that reason, I
intentionally omitted stable@ entirely. The only patches I would consider
even remotely safe for backporting are the first two patches in the series.
Sean Christopherson (43):
KVM: nVMX: Set LDTR to its architecturally defined value on nested
VM-Exit
KVM: VMX: Set EDX at INIT with CPUID.0x1, Family-Model-Stepping
KVM: SVM: Require exact CPUID.0x1 match when stuffing EDX at INIT
KVM: SVM: Fall back to KVM's hardcoded value for EDX at RESET/INIT
KVM: x86: Split out CR0/CR4 MMU role change detectors to separate
helpers
KVM: x86: Properly reset MMU context at vCPU RESET/INIT
KVM: VMX: Remove explicit MMU reset in enter_rmode()
KVM: SVM: Drop explicit MMU reset at RESET/INIT
KVM: SVM: Drop a redundant init_vmcb() from svm_create_vcpu()
KVM: VMX: Move init_vmcs() invocation to vmx_vcpu_reset()
KVM: x86: WARN if the APIC map is dirty without an in-kernel local
APIC
KVM: x86: Remove defunct BSP "update" in local APIC reset
KVM: x86: Migrate the PIT only if vcpu0 is migrated, not any BSP
KVM: x86: Don't force set BSP bit when local APIC is managed by
userspace
KVM: x86: Set BSP bit in reset BSP vCPU's APIC base by default
KVM: VMX: Stuff vcpu->arch.apic_base directly at vCPU RESET
KVM: x86: Open code necessary bits of kvm_lapic_set_base() at vCPU
RESET
KVM: x86: Consolidate APIC base RESET initialization code
KVM: x86: Move EDX initialization at vCPU RESET to common code
KVM: SVM: Don't bother writing vmcb->save.rip at vCPU RESET/INIT
KVM: VMX: Invert handling of CR0.WP for EPT without unrestricted guest
KVM: VMX: Remove direct write to vcpu->arch.cr0 during vCPU RESET/INIT
KVM: VMX: Fold ept_update_paging_mode_cr0() back into vmx_set_cr0()
KVM: nVMX: Do not clear CR3 load/store exiting bits if L1 wants 'em
KVM: VMX: Pull GUEST_CR3 from the VMCS iff CR3 load exiting is
disabled
KVM: VMX: Process CR0.PG side effects after setting CR0 assets
KVM: VMX: Skip emulation required checks during pmode/rmode
transitions
KVM: nVMX: Don't evaluate "emulation required" on VM-Exit
KVM: SVM: Tweak order of cr0/cr4/efer writes at RESET/INIT
KVM: SVM: Drop redundant writes to vmcb->save.cr4 at RESET/INIT
KVM: SVM: Stuff save->dr6 at during VMSA sync, not at RESET/INIT
KVM: VMX: Skip pointless MSR bitmap update when setting EFER
KVM: VMX: Refresh list of user return MSRs after setting guest CPUID
KVM: VMX: Don't _explicitly_ reconfigure user return MSRs on vCPU INIT
KVM: x86: Move setting of sregs during vCPU RESET/INIT to common x86
KVM: VMX: Remove obsolete MSR bitmap refresh at vCPU RESET/INIT
KVM: nVMX: Remove obsolete MSR bitmap refresh at nested transitions
KVM: VMX: Don't redo x2APIC MSR bitmaps when userspace filter is
changed
KVM: VMX: Remove unnecessary initialization of msr_bitmap_mode
KVM: VMX: Smush x2APIC MSR bitmap adjustments into single function
KVM: VMX: Remove redundant write to set vCPU as active at RESET/INIT
KVM: VMX: Drop VMWRITEs to zero fields at vCPU RESET
KVM: x86: Drop pointless @reset_roots from kvm_init_mmu()
arch/x86/include/asm/kvm_host.h | 5 -
arch/x86/kvm/i8254.c | 3 +-
arch/x86/kvm/lapic.c | 26 +--
arch/x86/kvm/mmu.h | 2 +-
arch/x86/kvm/mmu/mmu.c | 13 +-
arch/x86/kvm/svm/nested.c | 2 +-
arch/x86/kvm/svm/sev.c | 1 +
arch/x86/kvm/svm/svm.c | 33 +---
arch/x86/kvm/vmx/nested.c | 26 ++-
arch/x86/kvm/vmx/vmx.c | 271 +++++++++++++-------------------
arch/x86/kvm/vmx/vmx.h | 5 +-
arch/x86/kvm/x86.c | 51 +++++-
12 files changed, 189 insertions(+), 249 deletions(-)
--
2.31.1.498.g6c1eba8ee3d-goog
Set EDX at RESET/INIT based on the userspace-defined CPUID model when
possible, i.e. when CPUID.0x1.EAX is defined by userspace. At
RESET/INIT, all CPUs that support CPUID set EDX to the FMS enumerated in
CPUID.0x1.EAX. If no CPUID match is found, fall back to KVM's default
of 0x600 (Family '6'), which is the least awful approximation of KVM's
virtual CPU model.
Fixes: 6aa8b732ca01 ("[PATCH] kvm: userspace interface")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a182cae71044..9c00d013af59 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4497,6 +4497,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct msr_data apic_base_msr;
+ u32 eax, dummy;
u64 cr0;
vmx->rmode.vm86_active = 0;
@@ -4504,7 +4505,11 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmx->msr_ia32_umwait_control = 0;
- vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
+ eax = 1;
+ if (!kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, true))
+ eax = get_rdx_init_val();
+ kvm_rdx_write(vcpu, eax);
+
vmx->hv_deadline_tsc = -1;
kvm_set_cr8(vcpu, 0);
--
2.31.1.498.g6c1eba8ee3d-goog
Split out the post-CR0/CR4 MMU role change detectors to separate helpers,
they will be used during vCPU RESET/INIT to conditionally reset the MMU
in a future patch.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/x86.c | 22 ++++++++++++++++------
1 file changed, 16 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3bf52ba5f2bb..0bc783fc6c9b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -823,16 +823,21 @@ bool pdptrs_changed(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(pdptrs_changed);
+static bool kvm_cr0_mmu_role_changed(unsigned long old_cr0, unsigned long cr0)
+{
+ unsigned long mmu_role_bits = X86_CR0_PG | X86_CR0_WP;
+
+ return (cr0 ^ old_cr0) & mmu_role_bits;
+}
+
void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
{
- unsigned long update_bits = X86_CR0_PG | X86_CR0_WP;
-
if ((cr0 ^ old_cr0) & X86_CR0_PG) {
kvm_clear_async_pf_completion_queue(vcpu);
kvm_async_pf_hash_reset(vcpu);
}
- if ((cr0 ^ old_cr0) & update_bits)
+ if (kvm_cr0_mmu_role_changed(old_cr0, cr0))
kvm_mmu_reset_context(vcpu);
if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
@@ -1009,13 +1014,18 @@ bool kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
}
EXPORT_SYMBOL_GPL(kvm_is_valid_cr4);
-void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned long cr4)
+static bool kvm_cr4_mmu_role_changed(unsigned long old_cr4, unsigned long cr4)
{
unsigned long mmu_role_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE;
- if (((cr4 ^ old_cr4) & mmu_role_bits) ||
- (!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE)))
+ return (((cr4 ^ old_cr4) & mmu_role_bits) ||
+ (!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE)));
+}
+
+void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned long cr4)
+{
+ if (kvm_cr4_mmu_role_changed(old_cr4, cr4))
kvm_mmu_reset_context(vcpu);
}
EXPORT_SYMBOL_GPL(kvm_post_set_cr4);
--
2.31.1.498.g6c1eba8ee3d-goog
Post-process the CR0 and CR4 changes at vCPU INIT (and RESET for good
measure) to effect a MMU context reset when necessary. Simply
re-initializing the current MMU is not sufficient as the current root
HPA may not be usable in the new context. E.g. if TDP is disabled and
INIT arrives while the vCPU is in long mode, KVM will fail to switch to
the 32-bit pae_root and bomb on the next VM-Enter due to running with a
64-bit CR3 in 32-bit mode.
This bug was papered over in both VMX and SVM.
In VMX, the INIT issue is specific to running without unrestricted guest
since unrestricted guest is available if and only if EPT is enabled.
Commit 8668a3c468ed ("KVM: VMX: Reset mmu context when entering real
mode") resolved the issue by forcing a reset when entering emulated real
mode.
In SVM, commit ebae871a509d ("kvm: svm: reset mmu on VCPU reset") forced
a MMU reset on every INIT to workaround the flaw in common x86. Note, at
the time the bug was fixed, the SVM problem was exacerbated by a complete
lack of a CR4 update.
The VMX and SVM fixes are not technically wrong, but lack of precision
makes it difficult to reason about why a context reset is needed. The VMX
code in particular is nasty. The vendor resets will be reverted in future
patches, primarily to aid bisection in case there are non-INIT flows that
rely on the existing VMX logic.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/x86.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0bc783fc6c9b..b87193190a73 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10341,7 +10341,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
kvm_vcpu_mtrr_init(vcpu);
vcpu_load(vcpu);
kvm_vcpu_reset(vcpu, false);
- kvm_init_mmu(vcpu, false);
vcpu_put(vcpu);
return 0;
@@ -10415,6 +10414,9 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
+ unsigned long old_cr0 = kvm_read_cr0(vcpu);
+ unsigned long old_cr4 = kvm_read_cr4(vcpu);
+
kvm_lapic_reset(vcpu, init_event);
vcpu->arch.hflags = 0;
@@ -10483,6 +10485,10 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vcpu->arch.ia32_xss = 0;
static_call(kvm_x86_vcpu_reset)(vcpu, init_event);
+
+ if (kvm_cr0_mmu_role_changed(old_cr0, kvm_read_cr0(vcpu)) ||
+ kvm_cr4_mmu_role_changed(old_cr4, kvm_read_cr4(vcpu)))
+ kvm_mmu_reset_context(vcpu);
}
void kvm_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
--
2.31.1.498.g6c1eba8ee3d-goog
Drop an extra init_vmcb() from svm_create_vcpu(), svm_vcpu_reset() is
guaranteed to call init_vmcb() and there are no consumers of the VMCB
data between ->vcpu_create() and ->vcpu_reset(). Keep the call to
svm_switch_vmcb() as sev_es_create_vcpu() touches the current VMCB, but
hoist it up a few lines to associate the switch with the allocation of
vmcb01.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index fbea2f45de9a..6c73ea3d20c6 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1371,15 +1371,13 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
svm->vmcb01.ptr = page_address(vmcb01_page);
svm->vmcb01.pa = __sme_set(page_to_pfn(vmcb01_page) << PAGE_SHIFT);
+ svm_switch_vmcb(svm, &svm->vmcb01);
if (vmsa_page)
svm->vmsa = page_address(vmsa_page);
svm->guest_state_loaded = false;
- svm_switch_vmcb(svm, &svm->vmcb01);
- init_vmcb(vcpu);
-
svm_init_osvw(vcpu);
vcpu->arch.microcode_version = 0x01000065;
--
2.31.1.498.g6c1eba8ee3d-goog
Drop an explicit MMU reset in SVM's vCPU RESET/INIT flow now that the
common x86 path correctly handles conditional MMU resets, e.g. if INIT
arrives while the vCPU is in 64-bit mode.
This reverts commit ebae871a509d ("kvm: svm: reset mmu on VCPU reset").
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d4d7720ce42f..fbea2f45de9a 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1216,7 +1216,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
* It also updates the guest-visible cr0 value.
*/
svm_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
- kvm_mmu_reset_context(vcpu);
save->cr4 = X86_CR4_PAE;
--
2.31.1.498.g6c1eba8ee3d-goog
Initialize constant VMCS state in vcpu_vcpu_reset() instead of in
vmx_vcpu_create(), which allows for the removal of the open coded "vCPU
load" sequence since ->vcpu_reset() is invoked while the vCPU is properly
loaded (which is the entire point of vCPU reset...).
Deferring initialization is effectively a nop as it's impossible to
safely access the VMCS between the current call site and its new home, as
both the vCPU and the pCPU are put immediately after init_vmcs(), i.e.
the VMCS isn't guaranteed to be loaded.
Note, task preemption is not a problem as vmx_sched_in() _can't_ touch
the VMCS as ->sched_in() is invoked before the vCPU, and thus VMCS, is
reloaded. I.e. the preemption path also can't consume VMCS state.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 16 +++++-----------
1 file changed, 5 insertions(+), 11 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e90952ca6087..856aa44b17d5 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4396,10 +4396,6 @@ static void vmx_compute_secondary_exec_control(struct vcpu_vmx *vmx)
#define VMX_XSS_EXIT_BITMAP 0
-/*
- * Noting that the initialization of Guest-state Area of VMCS is in
- * vmx_vcpu_reset().
- */
static void init_vmcs(struct vcpu_vmx *vmx)
{
if (nested)
@@ -4498,6 +4494,9 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
u32 eax, dummy;
u64 cr0;
+ if (!init_event)
+ init_vmcs(vmx);
+
vmx->rmode.vm86_active = 0;
vmx->spec_ctrl = 0;
@@ -6905,7 +6904,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx;
- int i, cpu, err;
+ int i, err;
BUILD_BUG_ON(offsetof(struct vcpu_vmx, vcpu) != 0);
vmx = to_vmx(vcpu);
@@ -6991,12 +6990,7 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
vmx->msr_bitmap_mode = 0;
vmx->loaded_vmcs = &vmx->vmcs01;
- cpu = get_cpu();
- vmx_vcpu_load(vcpu, cpu);
- vcpu->cpu = cpu;
- init_vmcs(vmx);
- vmx_vcpu_put(vcpu);
- put_cpu();
+
if (cpu_need_virtualize_apic_accesses(vcpu)) {
err = alloc_apic_access_page(vcpu->kvm);
if (err)
--
2.31.1.498.g6c1eba8ee3d-goog
Do not allow an inexact CPUID "match" when querying the guest's CPUID.0x1
to stuff EDX during INIT. In the common case, where the guest CPU model
is an AMD variant, allowing an inexact match is a nop since KVM doesn't
emulate Intel's goofy "out-of-range" logic for AMD and Hygon. If the
vCPU model happens to be an Intel variant, an inexact match is possible
if and only if the max CPUID leaf is precisely '0'. Aside from the fact
that there's probably no CPU in existence with a single CPUID leaf, if
the max CPUID leaf is '0', that means that CPUID.0.EAX is '0', and thus
an inexact match for CPUID.0x1.EAX will also yield '0'.
So, with lots of twisty logic, no functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 14ff7f0963e9..b9e3229ddc27 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1299,7 +1299,7 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
}
init_vmcb(vcpu);
- kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, false);
+ kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, true);
kvm_rdx_write(vcpu, eax);
if (kvm_vcpu_apicv_active(vcpu) && !init_event)
--
2.31.1.498.g6c1eba8ee3d-goog
At vCPU RESET/INIT (mostly RESET), stuff EDX with KVM's hardcoded,
default Family-Model-Stepping ID of 0x600 if CPUID.0x1 isn't defined.
At RESET, the CPUID lookup is guaranteed to "miss" because KVM emulates
RESET before exposing the vCPU to userspace, i.e. userspace can't
possibly have done set the vCPU's CPUID model, and thus KVM will always
write '0'. At INIT, using 0x600 is less bad than using '0'.
While initializing EDX to '0' is _extremely_ unlikely to be noticed by
the guest, let alone break the guest, and can be overridden by
userspace for the RESET case, using 0x600 is preferable as it will allow
consolidating the relevant VMX and SVM RESET/INIT logic in the future.
And, digging through old specs suggests that neither Intel nor AMD have
ever shipped a CPU that initialized EDX to '0' at RESET.
Regarding 0x600 as KVM's default Family, it is a sane default and in
many ways the most appropriate. Prior to the 386 implementations, DX
was undefined at RESET. With the 386, 486, 586/P5, and 686/P6/Athlon,
both Intel and AMD set EDX to 3, 4, 5, and 6 respectively. AMD switched
to using '15' as its primary Family with the introduction of AMD64, but
Intel has continued using '6' for the last few decades.
So, '6' is a valid Family for both Intel and AMD CPUs, is compatible
with both 32-bit and 64-bit CPUs (albeit not a perfect fit for 64-bit
AMD), and of the common Families (3 - 6), is the best fit with respect to
KVM's virtual CPU model. E.g. prior to the P6, Intel CPUs did not have a
STI window. Modern operating systems, Linux included, rely on the STI
window, e.g. for "safe halt", and KVM unconditionally assumes the virtual
CPU has an STI window. Thus enumerating a Family ID of 3, 4, or 5 would
be provably wrong.
Opportunistically remove a stale comment.
Fixes: 66f7b72e1171 ("KVM: x86: Make register state after reset conform to specification")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index b9e3229ddc27..d4d7720ce42f 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1219,7 +1219,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
kvm_mmu_reset_context(vcpu);
save->cr4 = X86_CR4_PAE;
- /* rdx = ?? */
if (npt_enabled) {
/* Setup VMCB for Nested Paging */
@@ -1299,7 +1298,15 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
}
init_vmcb(vcpu);
- kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, true);
+ /*
+ * Fall back to KVM's default Family/Model/Stepping if no CPUID match
+ * is found. Note, it's impossible to get a match at RESET since KVM
+ * emulates RESET before exposing the vCPU to userspace, i.e. it's
+ * impossible for kvm_cpuid() to find a valid entry on RESET. But, go
+ * through the motions in case that's ever remedied, and to be pedantic.
+ */
+ if (!kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, true))
+ eax = get_rdx_init_val();
kvm_rdx_write(vcpu, eax);
if (kvm_vcpu_apicv_active(vcpu) && !init_event)
--
2.31.1.498.g6c1eba8ee3d-goog
Set L1's LDTR on VM-Exit per the Intel SDM:
The host-state area does not contain a selector field for LDTR. LDTR is
established as follows on all VM exits: the selector is cleared to
0000H, the segment is marked unusable and is otherwise undefined
(although the base address is always canonical).
This is likely a benign bug since the LDTR is unusable, as it means the
L1 VMM is conditioned to reload its LDTR in order to function properly on
bare metal.
Fixes: 4704d0befb07 ("KVM: nVMX: Exiting from L2 to L1")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 00339d624c92..32126fa0c4d8 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4276,6 +4276,10 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
};
vmx_set_segment(vcpu, &seg, VCPU_SREG_TR);
+ memset(&seg, 0, sizeof(seg));
+ seg.unusable = 1;
+ vmx_set_segment(vcpu, &seg, VCPU_SREG_LDTR);
+
kvm_set_dr(vcpu, 7, 0x400);
vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
--
2.31.1.498.g6c1eba8ee3d-goog
WARN if KVM ends up in a state where it thinks its APIC map needs to be
recalculated, but KVM is not emulating the local APIC. This is mostly
to document KVM's "rules" in order to provide clarity in future cleanups.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/lapic.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 152591f9243a..655eb1d07344 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -192,6 +192,9 @@ void kvm_recalculate_apic_map(struct kvm *kvm)
if (atomic_read_acquire(&kvm->arch.apic_map_dirty) == CLEAN)
return;
+ WARN_ONCE(!irqchip_in_kernel(kvm),
+ "Dirty APIC map without an in-kernel local APIC");
+
mutex_lock(&kvm->arch.apic_map_lock);
/*
* Read kvm->arch.apic_map_dirty before kvm->arch.apic_map
--
2.31.1.498.g6c1eba8ee3d-goog
Make vcpu0 the arbitrary owner of the PIT, as was intended when PIT
migration was added by commit 2f5997140f22 ("KVM: migrate PIT timer").
The PIT was unintentionally turned into being owned by the BSP by commit
c5af89b68abb ("KVM: Introduce kvm_vcpu_is_bsp() function."), and was then
unintentionally converted to a shared ownership model when
kvm_vcpu_is_bsp() was modified to check the APIC base MSR instead of
hardcoding vcpu0 as the BSP.
Functionally, this just means the PIT's hrtimer is migrated less often.
The real motivation is to remove the usage of kvm_vcpu_is_bsp(), so that
more legacy/broken crud can be removed in a future patch.
Fixes: 58d269d8cccc ("KVM: x86: BSP in MSR_IA32_APICBASE is writable")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/i8254.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index a6e218c6140d..5a69cce4d72d 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -220,7 +220,8 @@ void __kvm_migrate_pit_timer(struct kvm_vcpu *vcpu)
struct kvm_pit *pit = vcpu->kvm->arch.vpit;
struct hrtimer *timer;
- if (!kvm_vcpu_is_bsp(vcpu) || !pit)
+ /* Somewhat arbitrarily make vcpu0 the owner of the PIT. */
+ if (vcpu->vcpu_id || !pit)
return;
timer = &pit->pit_state.timer;
--
2.31.1.498.g6c1eba8ee3d-goog
Don't set the BSP bit in vcpu->arch.apic_base when the local APIC is
managed by userspace. Forcing all vCPUs to be BSPs is non-sensical, and
was dead code when it was added by commit 97222cc83163 ("KVM: Emulate
local APIC in kernel"). At the time, kvm_lapic_set_base() was invoked
if and only if the local APIC was in-kernel (and it couldn't be called
before the vCPU created its APIC).
kvm_lapic_set_base() eventually gained generic usage, but the latent bug
escaped notice because the only true consumer would be the guest itself
in the form of an explicit RDMSRs on APs. Out of Linux, SeaBIOS, and
EDK2/OVMF, only OVMF consume the BSP bit from the APIC_BASE MSR. For
the vast majority of usage in OVMF, BSP confusion would be benign.
OVMF's BSP election upon SMI rendezvous might be broken, but practically
no one runs KVM with an out-of-kernel local APIC, let alone does so while
utilizing SMIs with OVMF.
Fixes: 97222cc83163 ("KVM: Emulate local APIC in kernel")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/lapic.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index b99630c6d7fe..c11f23753a5b 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2252,9 +2252,6 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
u64 old_value = vcpu->arch.apic_base;
struct kvm_lapic *apic = vcpu->arch.apic;
- if (!apic)
- value |= MSR_IA32_APICBASE_BSP;
-
vcpu->arch.apic_base = value;
if ((old_value ^ value) & MSR_IA32_APICBASE_ENABLE)
--
2.31.1.498.g6c1eba8ee3d-goog
Set the BSP bit appropriately during local APIC "reset" instead of
relying on vendor code to clean up at a later point. This is a step
towards consolidating the local APIC, VMX, and SVM xAPIC initialization
code.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/lapic.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index c11f23753a5b..b088f6984b37 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2305,6 +2305,7 @@ EXPORT_SYMBOL_GPL(kvm_apic_update_apicv);
void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct kvm_lapic *apic = vcpu->arch.apic;
+ u64 msr_val;
int i;
if (!apic)
@@ -2314,8 +2315,10 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
hrtimer_cancel(&apic->lapic_timer.timer);
if (!init_event) {
- kvm_lapic_set_base(vcpu, APIC_DEFAULT_PHYS_BASE |
- MSR_IA32_APICBASE_ENABLE);
+ msr_val = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE;
+ if (kvm_vcpu_is_reset_bsp(vcpu))
+ msr_val |= MSR_IA32_APICBASE_BSP;
+ kvm_lapic_set_base(vcpu, msr_val);
kvm_apic_set_xapic_id(apic, vcpu->vcpu_id);
}
kvm_apic_set_version(apic->vcpu);
--
2.31.1.498.g6c1eba8ee3d-goog
Drop an explicit MMU reset when entering emulated real mode now that the
vCPU INIT/RESET path correctly handles conditional MMU resets, e.g. if
INIT arrives while the vCPU is in 64-bit mode.
Note, while there are multiple other direct calls to vmx_set_cr0(), i.e.
paths that change CR0 without invoking kvm_post_set_cr0(), only the INIT
emulation can reach enter_rmode(). CLTS emulation only toggles CR.TS,
VM-Exit (and late VM-Fail) emulation cannot architecturally transition to
Real Mode, and VM-Enter to Real Mode is possible if and only if
Unrestricted Guest is enabled (exposed to L1).
This effectively reverts commit 8668a3c468ed ("KVM: VMX: Reset mmu
context when entering real mode")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9c00d013af59..e90952ca6087 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2960,8 +2960,6 @@ static void enter_rmode(struct kvm_vcpu *vcpu)
fix_rmode_seg(VCPU_SREG_DS, &vmx->rmode.segs[VCPU_SREG_DS]);
fix_rmode_seg(VCPU_SREG_GS, &vmx->rmode.segs[VCPU_SREG_GS]);
fix_rmode_seg(VCPU_SREG_FS, &vmx->rmode.segs[VCPU_SREG_FS]);
-
- kvm_mmu_reset_context(vcpu);
}
int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer)
--
2.31.1.498.g6c1eba8ee3d-goog
Write vcpu->arch.apic_base directly instead of bouncing through
kvm_set_apic_base(). This is a glorified nop, and is a step towards
cleaning up the mess that is local APIC creation.
When using an in-kernel APIC, kvm_create_lapic() explicitly sets
vcpu->arch.apic_base to MSR_IA32_APICBASE_ENABLE to avoid its own
kvm_lapic_set_base() call in kvm_lapic_reset() from triggering state
changes. That call during RESET exists purely to set apic->base_address
to the default base value. As a result, by the time VMX gets control,
the only missing piece is the BSP bit being set for the reset BSP.
For a userspace APIC, there are no side effects to process (for the APIC).
In both cases, the call to kvm_update_cpuid_runtime() is a nop because
the vCPU hasn't yet been exposed to userspace, i.e. there can't be any
CPUID entries.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 856aa44b17d5..fa14e9a74b96 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4490,7 +4490,6 @@ static void init_vmcs(struct vcpu_vmx *vmx)
static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
- struct msr_data apic_base_msr;
u32 eax, dummy;
u64 cr0;
@@ -4511,12 +4510,10 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
kvm_set_cr8(vcpu, 0);
if (!init_event) {
- apic_base_msr.data = APIC_DEFAULT_PHYS_BASE |
- MSR_IA32_APICBASE_ENABLE;
+ vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE |
+ MSR_IA32_APICBASE_ENABLE;
if (kvm_vcpu_is_reset_bsp(vcpu))
- apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
- apic_base_msr.host_initiated = true;
- kvm_set_apic_base(vcpu, &apic_base_msr);
+ vcpu->arch.apic_base |= MSR_IA32_APICBASE_BSP;
}
vmx_segment_cache_clear(vmx);
--
2.31.1.498.g6c1eba8ee3d-goog
Stuff vcpu->arch.apic_base and apic->base_address directly during APIC
reset, as opposed to bouncing through kvm_set_apic_base() while fudging
the ENABLE bit during creation to avoid the other, unwanted side effects.
This is a step towards consolidating the APIC RESET logic across x86,
VMX, and SVM.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/lapic.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index b088f6984b37..b1366df46d1d 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2305,7 +2305,6 @@ EXPORT_SYMBOL_GPL(kvm_apic_update_apicv);
void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct kvm_lapic *apic = vcpu->arch.apic;
- u64 msr_val;
int i;
if (!apic)
@@ -2315,10 +2314,13 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
hrtimer_cancel(&apic->lapic_timer.timer);
if (!init_event) {
- msr_val = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE;
+ vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE |
+ MSR_IA32_APICBASE_ENABLE;
if (kvm_vcpu_is_reset_bsp(vcpu))
- msr_val |= MSR_IA32_APICBASE_BSP;
- kvm_lapic_set_base(vcpu, msr_val);
+ vcpu->arch.apic_base |= MSR_IA32_APICBASE_BSP;
+
+ apic->base_address = MSR_IA32_APICBASE_ENABLE;
+
kvm_apic_set_xapic_id(apic, vcpu->vcpu_id);
}
kvm_apic_set_version(apic->vcpu);
@@ -2461,11 +2463,6 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns)
lapic_timer_advance_dynamic = false;
}
- /*
- * APIC is created enabled. This will prevent kvm_lapic_set_base from
- * thinking that APIC state has changed.
- */
- vcpu->arch.apic_base = MSR_IA32_APICBASE_ENABLE;
static_branch_inc(&apic_sw_disabled.key); /* sw disabled at reset */
kvm_iodevice_init(&apic->dev, &apic_mmio_ops);
--
2.31.1.498.g6c1eba8ee3d-goog
Consolidate the APIC base RESET logic, which is currently spread out
across both x86 and vendor code. For an in-kernel APIC, the vendor code
is redundant. But for a userspace APIC, KVM relies on the vendor code
to initialize vcpu->arch.apic_base. Hoist the vcpu->arch.apic_base
initialization above the !apic check so that it applies to both flavors
of APIC emulation, and delete the vendor code.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/lapic.c | 14 ++++++++------
arch/x86/kvm/svm/svm.c | 6 ------
arch/x86/kvm/vmx/vmx.c | 7 -------
3 files changed, 8 insertions(+), 19 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index b1366df46d1d..07cfa4d181da 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2307,18 +2307,20 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
struct kvm_lapic *apic = vcpu->arch.apic;
int i;
- if (!apic)
- return;
-
- /* Stop the timer in case it's a reset to an active apic */
- hrtimer_cancel(&apic->lapic_timer.timer);
-
if (!init_event) {
vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE |
MSR_IA32_APICBASE_ENABLE;
if (kvm_vcpu_is_reset_bsp(vcpu))
vcpu->arch.apic_base |= MSR_IA32_APICBASE_BSP;
+ }
+ if (!apic)
+ return;
+
+ /* Stop the timer in case it's a reset to an active apic */
+ hrtimer_cancel(&apic->lapic_timer.timer);
+
+ if (!init_event) {
apic->base_address = MSR_IA32_APICBASE_ENABLE;
kvm_apic_set_xapic_id(apic, vcpu->vcpu_id);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 6c73ea3d20c6..271b6def087f 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1289,12 +1289,6 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
svm->spec_ctrl = 0;
svm->virt_spec_ctrl = 0;
- if (!init_event) {
- vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE |
- MSR_IA32_APICBASE_ENABLE;
- if (kvm_vcpu_is_reset_bsp(vcpu))
- vcpu->arch.apic_base |= MSR_IA32_APICBASE_BSP;
- }
init_vmcb(vcpu);
/*
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fa14e9a74b96..40a4ac23d54f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4509,13 +4509,6 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmx->hv_deadline_tsc = -1;
kvm_set_cr8(vcpu, 0);
- if (!init_event) {
- vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE |
- MSR_IA32_APICBASE_ENABLE;
- if (kvm_vcpu_is_reset_bsp(vcpu))
- vcpu->arch.apic_base |= MSR_IA32_APICBASE_BSP;
- }
-
vmx_segment_cache_clear(vmx);
seg_setup(VCPU_SREG_CS);
--
2.31.1.498.g6c1eba8ee3d-goog
Remove a BSP APIC update in kvm_lapic_reset() that is a glorified and
confusing nop. When the code was originally added, kvm_vcpu_is_bsp()
queried kvm->arch.bsp_vcpu, i.e. the intent was to set the BSP bit in the
BSP vCPU's APIC. But, stuffing the BSP bit at INIT was wrong since the
guest can change its BSP(s); this was fixed by commit 58d269d8cccc ("KVM:
x86: BSP in MSR_IA32_APICBASE is writable").
In other words, kvm_vcpu_is_bsp() is now purely a reflection of
vcpu->arch.apic_base.MSR_IA32_APICBASE_BSP, thus the update will always
set the current value and kvm_lapic_set_base() is effectively a nop if
the new and old values match. The RESET case, which does need to stuff
the BSP for the reset vCPU, is handled by vendor code (though this will
soon be moved to common code).
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/lapic.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 655eb1d07344..b99630c6d7fe 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2351,9 +2351,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
apic->highest_isr_cache = -1;
update_divide_count(apic);
atomic_set(&apic->lapic_timer.pending, 0);
- if (kvm_vcpu_is_bsp(vcpu))
- kvm_lapic_set_base(vcpu,
- vcpu->arch.apic_base | MSR_IA32_APICBASE_BSP);
+
vcpu->arch.pv_eoi.msr_val = 0;
apic_update_ppr(apic);
if (vcpu->arch.apicv_active) {
--
2.31.1.498.g6c1eba8ee3d-goog
Remove a bogus write to vcpu->arch.cr0 that immediately precedes
vmx_set_cr0() during vCPU RESET/INIT. For RESET, this is a nop since
the "old" CR0 value is meaningless. But for INIT, if the vCPU is coming
from paging enabled mode, crushing vcpu->arch.cr0 will cause the various
is_paging() checks in vmx_set_cr0() to get false negatives.
For the exit_lmode() case, the false negative is benign as vmx_set_efer()
is called immediately after vmx_set_cr0().
For EPT without unrestricted guest, the false negative will cause KVM to
unnecessarily run with CR3 load/store exiting. But again, this is
benign, albeit sub-optimal.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d0050c140b4d..5795de909609 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4486,7 +4486,6 @@ static void init_vmcs(struct vcpu_vmx *vmx)
static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
- u64 cr0;
if (!init_event)
init_vmcs(vmx);
@@ -4557,9 +4556,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
- cr0 = X86_CR0_NW | X86_CR0_CD | X86_CR0_ET;
- vmx->vcpu.arch.cr0 = cr0;
- vmx_set_cr0(vcpu, cr0); /* enter rmode */
+ vmx_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
vmx_set_cr4(vcpu, 0);
vmx_set_efer(vcpu, 0);
--
2.31.1.498.g6c1eba8ee3d-goog
Don't refresh "emulation required" when stuffing segments during
transitions to/from real mode when running without unrestricted guest.
The checks are unnecessary as vmx_set_cr0() unconditionally rechecks
"emulation required". They also happen to be broken, as enter_pmode()
and enter_rmode() run with a stale vcpu->arch.cr0.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5f30181fd240..37bfa20a04dd 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2827,6 +2827,8 @@ static __init int alloc_kvm_area(void)
return 0;
}
+static void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+
static void fix_pmode_seg(struct kvm_vcpu *vcpu, int seg,
struct kvm_segment *save)
{
@@ -2843,7 +2845,7 @@ static void fix_pmode_seg(struct kvm_vcpu *vcpu, int seg,
save->dpl = save->selector & SEGMENT_RPL_MASK;
save->s = 1;
}
- vmx_set_segment(vcpu, save, seg);
+ __vmx_set_segment(vcpu, save, seg);
}
static void enter_pmode(struct kvm_vcpu *vcpu)
@@ -2864,7 +2866,7 @@ static void enter_pmode(struct kvm_vcpu *vcpu)
vmx->rmode.vm86_active = 0;
- vmx_set_segment(vcpu, &vmx->rmode.segs[VCPU_SREG_TR], VCPU_SREG_TR);
+ __vmx_set_segment(vcpu, &vmx->rmode.segs[VCPU_SREG_TR], VCPU_SREG_TR);
flags = vmcs_readl(GUEST_RFLAGS);
flags &= RMODE_GUEST_OWNED_EFLAGS_BITS;
@@ -3399,7 +3401,7 @@ static u32 vmx_segment_access_rights(struct kvm_segment *var)
return ar;
}
-void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+static void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
const struct kvm_vmx_segment_field *sf = &kvm_vmx_segment_fields[seg];
@@ -3412,7 +3414,7 @@ void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
vmcs_write16(sf->selector, var->selector);
else if (var->s)
fix_rmode_seg(seg, &vmx->rmode.segs[seg]);
- goto out;
+ return;
}
vmcs_writel(sf->base, var->base);
@@ -3434,9 +3436,13 @@ void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
var->type |= 0x1; /* Accessed */
vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));
+}
-out:
- vmx->emulation_required = emulation_required(vcpu);
+void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+{
+ __vmx_set_segment(vcpu, var, seg);
+
+ to_vmx(vcpu)->emulation_required = emulation_required(vcpu);
}
static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
--
2.31.1.498.g6c1eba8ee3d-goog
Use the "internal" variants of setting segment registers when stuffing
state on nested VM-Exit in order to skip the "emulation required"
updates. VM-Exit must always go to protected mode, and all segments are
mostly hardcoded (to valid values) on VM-Exit. The bits of the segments
that aren't hardcoded are explicitly checked during VM-Enter, e.g. the
selector RPLs must all be zero.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 16 ++++++++--------
arch/x86/kvm/vmx/vmx.c | 6 ++----
arch/x86/kvm/vmx/vmx.h | 2 +-
3 files changed, 11 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 32126fa0c4d8..f811bb7f2dc3 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4245,7 +4245,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
seg.l = 1;
else
seg.db = 1;
- vmx_set_segment(vcpu, &seg, VCPU_SREG_CS);
+ __vmx_set_segment(vcpu, &seg, VCPU_SREG_CS);
seg = (struct kvm_segment) {
.base = 0,
.limit = 0xFFFFFFFF,
@@ -4256,17 +4256,17 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
.g = 1
};
seg.selector = vmcs12->host_ds_selector;
- vmx_set_segment(vcpu, &seg, VCPU_SREG_DS);
+ __vmx_set_segment(vcpu, &seg, VCPU_SREG_DS);
seg.selector = vmcs12->host_es_selector;
- vmx_set_segment(vcpu, &seg, VCPU_SREG_ES);
+ __vmx_set_segment(vcpu, &seg, VCPU_SREG_ES);
seg.selector = vmcs12->host_ss_selector;
- vmx_set_segment(vcpu, &seg, VCPU_SREG_SS);
+ __vmx_set_segment(vcpu, &seg, VCPU_SREG_SS);
seg.selector = vmcs12->host_fs_selector;
seg.base = vmcs12->host_fs_base;
- vmx_set_segment(vcpu, &seg, VCPU_SREG_FS);
+ __vmx_set_segment(vcpu, &seg, VCPU_SREG_FS);
seg.selector = vmcs12->host_gs_selector;
seg.base = vmcs12->host_gs_base;
- vmx_set_segment(vcpu, &seg, VCPU_SREG_GS);
+ __vmx_set_segment(vcpu, &seg, VCPU_SREG_GS);
seg = (struct kvm_segment) {
.base = vmcs12->host_tr_base,
.limit = 0x67,
@@ -4274,11 +4274,11 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
.type = 11,
.present = 1
};
- vmx_set_segment(vcpu, &seg, VCPU_SREG_TR);
+ __vmx_set_segment(vcpu, &seg, VCPU_SREG_TR);
memset(&seg, 0, sizeof(seg));
seg.unusable = 1;
- vmx_set_segment(vcpu, &seg, VCPU_SREG_LDTR);
+ __vmx_set_segment(vcpu, &seg, VCPU_SREG_LDTR);
kvm_set_dr(vcpu, 7, 0x400);
vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 37bfa20a04dd..594975dc3f94 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2827,8 +2827,6 @@ static __init int alloc_kvm_area(void)
return 0;
}
-static void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
-
static void fix_pmode_seg(struct kvm_vcpu *vcpu, int seg,
struct kvm_segment *save)
{
@@ -3401,7 +3399,7 @@ static u32 vmx_segment_access_rights(struct kvm_segment *var)
return ar;
}
-static void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
const struct kvm_vmx_segment_field *sf = &kvm_vmx_segment_fields[seg];
@@ -3438,7 +3436,7 @@ static void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, in
vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));
}
-void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+static void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
{
__vmx_set_segment(vcpu, var, seg);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 19fe09fad2fe..1283ad0e592d 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -376,7 +376,7 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
void set_cr4_guest_host_mask(struct vcpu_vmx *vmx);
void ept_save_pdptrs(struct kvm_vcpu *vcpu);
void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
-void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu);
--
2.31.1.498.g6c1eba8ee3d-goog
Move the EDX initialization at vCPU RESET, which is now identical between
VMX and SVM, into common code.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 5 -----
arch/x86/kvm/svm/svm.c | 13 -------------
arch/x86/kvm/vmx/vmx.c | 6 ------
arch/x86/kvm/x86.c | 13 +++++++++++++
4 files changed, 13 insertions(+), 24 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3e5fc80a35c8..2b9799366dad 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1701,11 +1701,6 @@ static inline unsigned long read_msr(unsigned long msr)
}
#endif
-static inline u32 get_rdx_init_val(void)
-{
- return 0x600; /* P6 family */
-}
-
static inline void kvm_inject_gp(struct kvm_vcpu *vcpu, u32 error_code)
{
kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 271b6def087f..5c12ba725186 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1283,25 +1283,12 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct vcpu_svm *svm = to_svm(vcpu);
- u32 dummy;
- u32 eax = 1;
svm->spec_ctrl = 0;
svm->virt_spec_ctrl = 0;
init_vmcb(vcpu);
- /*
- * Fall back to KVM's default Family/Model/Stepping if no CPUID match
- * is found. Note, it's impossible to get a match at RESET since KVM
- * emulates RESET before exposing the vCPU to userspace, i.e. it's
- * impossible for kvm_cpuid() to find a valid entry on RESET. But, go
- * through the motions in case that's ever remedied, and to be pedantic.
- */
- if (!kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, true))
- eax = get_rdx_init_val();
- kvm_rdx_write(vcpu, eax);
-
if (kvm_vcpu_apicv_active(vcpu) && !init_event)
avic_update_vapic_bar(svm, APIC_DEFAULT_PHYS_BASE);
}
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 40a4ac23d54f..805888541142 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4490,7 +4490,6 @@ static void init_vmcs(struct vcpu_vmx *vmx)
static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
- u32 eax, dummy;
u64 cr0;
if (!init_event)
@@ -4501,11 +4500,6 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmx->msr_ia32_umwait_control = 0;
- eax = 1;
- if (!kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, true))
- eax = get_rdx_init_val();
- kvm_rdx_write(vcpu, eax);
-
vmx->hv_deadline_tsc = -1;
kvm_set_cr8(vcpu, 0);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b87193190a73..167c650d1187 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10416,6 +10416,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
unsigned long old_cr0 = kvm_read_cr0(vcpu);
unsigned long old_cr4 = kvm_read_cr4(vcpu);
+ u32 eax, dummy;
kvm_lapic_reset(vcpu, init_event);
@@ -10482,6 +10483,18 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vcpu->arch.regs_avail = ~0;
vcpu->arch.regs_dirty = ~0;
+ /*
+ * Fall back to KVM's default Family/Model/Stepping of 0x600 (P6/Athlon)
+ * if no CPUID match is found. Note, it's impossible to get a match at
+ * RESET since KVM emulates RESET before exposing the vCPU to userspace,
+ * i.e. it'simpossible for kvm_cpuid() to find a valid entry on RESET.
+ * But, go through the motions in case that's ever remedied.
+ */
+ eax = 1;
+ if (!kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, true))
+ eax = 0x600;
+ kvm_rdx_write(vcpu, eax);
+
vcpu->arch.ia32_xss = 0;
static_call(kvm_x86_vcpu_reset)(vcpu, init_event);
--
2.31.1.498.g6c1eba8ee3d-goog
Remove unnecessary initialization of vmcb->save.rip during vCPU RESET/INIT,
as svm_vcpu_run() unconditionally propagates VCPU_REGS_RIP to save.rip.
No true functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 5c12ba725186..4ea100c08cb3 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1208,8 +1208,7 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
svm_set_efer(vcpu, 0);
save->dr6 = 0xffff0ff0;
kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
- save->rip = 0x0000fff0;
- vcpu->arch.regs[VCPU_REGS_RIP] = save->rip;
+ vcpu->arch.regs[VCPU_REGS_RIP] = 0x0000fff0;
/*
* svm_set_cr0() sets PG and WP and clears NW and CD on save->cr0.
--
2.31.1.498.g6c1eba8ee3d-goog
Opt-in to forcing CR0.WP=1 for shadow paging, and stop lying about WP
being "always on" for unrestricted guest. In addition to making KVM a
wee bit more honest, this paves the way for additional cleanup.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 805888541142..d0050c140b4d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -135,8 +135,7 @@ module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
#define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
#define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
#define KVM_VM_CR0_ALWAYS_ON \
- (KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST | \
- X86_CR0_WP | X86_CR0_PG | X86_CR0_PE)
+ (KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST | X86_CR0_PG | X86_CR0_PE)
#define KVM_VM_CR4_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR4_VMXE
#define KVM_PMODE_VM_CR4_ALWAYS_ON (X86_CR4_PAE | X86_CR4_VMXE)
@@ -3103,9 +3102,7 @@ void ept_save_pdptrs(struct kvm_vcpu *vcpu)
kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
}
-static void ept_update_paging_mode_cr0(unsigned long *hw_cr0,
- unsigned long cr0,
- struct kvm_vcpu *vcpu)
+static void ept_update_paging_mode_cr0(unsigned long cr0, struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -3124,9 +3121,6 @@ static void ept_update_paging_mode_cr0(unsigned long *hw_cr0,
vcpu->arch.cr0 = cr0;
vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
}
-
- if (!(cr0 & X86_CR0_WP))
- *hw_cr0 &= ~X86_CR0_WP;
}
void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
@@ -3139,6 +3133,8 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
hw_cr0 |= KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST;
else {
hw_cr0 |= KVM_VM_CR0_ALWAYS_ON;
+ if (!enable_ept)
+ hw_cr0 |= X86_CR0_WP;
if (vmx->rmode.vm86_active && (cr0 & X86_CR0_PE))
enter_pmode(vcpu);
@@ -3157,7 +3153,7 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
#endif
if (enable_ept && !is_unrestricted_guest(vcpu))
- ept_update_paging_mode_cr0(&hw_cr0, cr0, vcpu);
+ ept_update_paging_mode_cr0(cr0, vcpu);
vmcs_writel(CR0_READ_SHADOW, cr0);
vmcs_writel(GUEST_CR0, hw_cr0);
--
2.31.1.498.g6c1eba8ee3d-goog
Hoist svm_set_cr0() up in the sequence of register initialization during
vCPU RESET/INIT, purely to match VMX so that a future patch can move the
sequences to common x86.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 4ea100c08cb3..88d34fa93d8b 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1204,18 +1204,13 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
init_sys_seg(&save->ldtr, SEG_TYPE_LDT);
init_sys_seg(&save->tr, SEG_TYPE_BUSY_TSS16);
+ svm_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
svm_set_cr4(vcpu, 0);
svm_set_efer(vcpu, 0);
save->dr6 = 0xffff0ff0;
kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
vcpu->arch.regs[VCPU_REGS_RIP] = 0x0000fff0;
- /*
- * svm_set_cr0() sets PG and WP and clears NW and CD on save->cr0.
- * It also updates the guest-visible cr0 value.
- */
- svm_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
-
save->cr4 = X86_CR4_PAE;
if (npt_enabled) {
--
2.31.1.498.g6c1eba8ee3d-goog
Drop direct writes to vmcb->save.cr4 during vCPU RESET/INIT, as the
values being written are fully redundant with respect to
svm_set_cr4(vcpu, 0) a few lines earlier. Note, svm_set_cr4() also
correctly forces X86_CR4_PAE when NPT is disabled.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 88d34fa93d8b..558329f53709 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1211,8 +1211,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
vcpu->arch.regs[VCPU_REGS_RIP] = 0x0000fff0;
- save->cr4 = X86_CR4_PAE;
-
if (npt_enabled) {
/* Setup VMCB for Nested Paging */
control->nested_ctl |= SVM_NESTED_CTL_NP_ENABLE;
@@ -1222,7 +1220,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
svm_clr_intercept(svm, INTERCEPT_CR3_WRITE);
save->g_pat = vcpu->arch.pat;
save->cr3 = 0;
- save->cr4 = 0;
}
svm->current_vmcb->asid_generation = 0;
svm->asid = 0;
--
2.31.1.498.g6c1eba8ee3d-goog
Move code to stuff vmcb->save.dr6 to its architectural init value from
svm_vcpu_reset() into sev_es_sync_vmsa(). Except for protected guests,
a.k.a. SEV-ES guests, vmcb->save.dr6 is set during VM-Enter, i.e. the
extra write is unnecessary. For SEV-ES, stuffing save->dr6 handles a
theoretical case where the VMSA could be encrypted before the first
KVM_RUN.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/sev.c | 1 +
arch/x86/kvm/svm/svm.c | 1 -
2 files changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index a9d8d6aafdb8..b81ebeb4c426 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -573,6 +573,7 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm)
save->xcr0 = svm->vcpu.arch.xcr0;
save->pkru = svm->vcpu.arch.pkru;
save->xss = svm->vcpu.arch.ia32_xss;
+ save->dr6 = svm->vcpu.arch.dr6;
/*
* SEV-ES will use a VMSA that is pointed to by the VMCB, not
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 558329f53709..996a6b03e338 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1207,7 +1207,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
svm_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
svm_set_cr4(vcpu, 0);
svm_set_efer(vcpu, 0);
- save->dr6 = 0xffff0ff0;
kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
vcpu->arch.regs[VCPU_REGS_RIP] = 0x0000fff0;
--
2.31.1.498.g6c1eba8ee3d-goog
Split setup_msrs() into vmx_setup_uret_msrs() and an open coded refresh
of the MSR bitmap, and skip the latter when refreshing the user return
MSRs during an EFER load. Only the x2APIC MSRs are dynamically exposed
and hidden, and those are not affected by a change in EFER.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 19 ++++++++++---------
1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 594975dc3f94..bdfb3def8526 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1767,11 +1767,12 @@ static void vmx_setup_uret_msr(struct vcpu_vmx *vmx, unsigned int msr)
}
/*
- * Set up the vmcs to automatically save and restore system
- * msrs. Don't touch the 64-bit msrs if the guest is in legacy
- * mode, as fiddling with msrs is very expensive.
+ * Configuring user return MSRs to automatically save, load, and restore MSRs
+ * that need to be shoved into hardware when running the guest. Note, omitting
+ * an MSR here does _NOT_ mean it's not emulated, only that it will not be
+ * loaded into hardware when running the guest.
*/
-static void setup_msrs(struct vcpu_vmx *vmx)
+static void vmx_setup_uret_msrs(struct vcpu_vmx *vmx)
{
vmx->guest_uret_msrs_loaded = false;
vmx->nr_active_uret_msrs = 0;
@@ -1793,9 +1794,6 @@ static void setup_msrs(struct vcpu_vmx *vmx)
vmx_setup_uret_msr(vmx, MSR_TSC_AUX);
vmx_setup_uret_msr(vmx, MSR_IA32_TSX_CTRL);
-
- if (cpu_has_vmx_msr_bitmap())
- vmx_update_msr_bitmap(&vmx->vcpu);
}
static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
@@ -2982,7 +2980,7 @@ int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer)
msr->data = efer & ~EFER_LME;
}
- setup_msrs(vmx);
+ vmx_setup_uret_msrs(vmx);
return 0;
}
@@ -4572,7 +4570,10 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
if (kvm_mpx_supported())
vmcs_write64(GUEST_BNDCFGS, 0);
- setup_msrs(vmx);
+ vmx_setup_uret_msrs(vmx);
+
+ if (cpu_has_vmx_msr_bitmap())
+ vmx_update_msr_bitmap(&vmx->vcpu);
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); /* 22.2.1 */
--
2.31.1.498.g6c1eba8ee3d-goog
Move the CR0/CR3/CR4 shenanigans for EPT without unrestricted guest back
into vmx_set_cr0(). This will allow a future patch to eliminate the
rather gross stuffing of vcpu->arch.cr0 in the paging transition cases
by snapshotting the old CR0.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 40 +++++++++++++++++-----------------------
1 file changed, 17 insertions(+), 23 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5795de909609..c9322cd55390 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3102,27 +3102,6 @@ void ept_save_pdptrs(struct kvm_vcpu *vcpu)
kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
}
-static void ept_update_paging_mode_cr0(unsigned long cr0, struct kvm_vcpu *vcpu)
-{
- struct vcpu_vmx *vmx = to_vmx(vcpu);
-
- if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3))
- vmx_cache_reg(vcpu, VCPU_EXREG_CR3);
- if (!(cr0 & X86_CR0_PG)) {
- /* From paging/starting to nonpaging */
- exec_controls_setbit(vmx, CPU_BASED_CR3_LOAD_EXITING |
- CPU_BASED_CR3_STORE_EXITING);
- vcpu->arch.cr0 = cr0;
- vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
- } else if (!is_paging(vcpu)) {
- /* From nonpaging to paging */
- exec_controls_clearbit(vmx, CPU_BASED_CR3_LOAD_EXITING |
- CPU_BASED_CR3_STORE_EXITING);
- vcpu->arch.cr0 = cr0;
- vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
- }
-}
-
void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -3152,8 +3131,23 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
}
#endif
- if (enable_ept && !is_unrestricted_guest(vcpu))
- ept_update_paging_mode_cr0(cr0, vcpu);
+ if (enable_ept && !is_unrestricted_guest(vcpu)) {
+ if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3))
+ vmx_cache_reg(vcpu, VCPU_EXREG_CR3);
+ if (!(cr0 & X86_CR0_PG)) {
+ /* From paging/starting to nonpaging */
+ exec_controls_setbit(vmx, CPU_BASED_CR3_LOAD_EXITING |
+ CPU_BASED_CR3_STORE_EXITING);
+ vcpu->arch.cr0 = cr0;
+ vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
+ } else if (!is_paging(vcpu)) {
+ /* From nonpaging to paging */
+ exec_controls_clearbit(vmx, CPU_BASED_CR3_LOAD_EXITING |
+ CPU_BASED_CR3_STORE_EXITING);
+ vcpu->arch.cr0 = cr0;
+ vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
+ }
+ }
vmcs_writel(CR0_READ_SHADOW, cr0);
vmcs_writel(GUEST_CR0, hw_cr0);
--
2.31.1.498.g6c1eba8ee3d-goog
Keep CR3 load/store exiting enable as needed when running L2 in order to
honor L1's desires. This fixes a largely theoretical bug where L1 could
intercept CR3 but not CR0.PG and end up not getting the desired CR3 exits
when L2 enables paging. In other words, the existing !is_paging() check
inadvertantly handles the normal case for L2 where vmx_set_cr0() is
called during VM-Enter, which is guaranteed to run with paging enabled,
and thus will never clear the bits.
Removing the !is_paging() check will also allow future consolidation and
cleanup of the related code. From a performance perspective, this is
all a nop, as the VMCS controls shadow will optimize away the VMWRITE
when the controls are in the desired state.
Add a comment explaining why CR3 is intercepted, with a big disclaimer
about not querying the old CR3. Because vmx_set_cr0() is used for flows
that are not directly tied to MOV CR3, e.g. vCPU RESET/INIT and nested
VM-Enter, it's possible that is_paging() is not synchronized with CR3
load/store exiting. This is actually guaranteed in the current code, as
KVM starts with CR3 interception disabled. Obviously that can be fixed,
but there's no good reason to play whack-a-mole, and it tends to end
poorly, e.g. descriptor table exiting for UMIP emulation attempted to be
precise in the past and ended up botching the interception toggling.
Fixes: fe3ef05c7572 ("KVM: nVMX: Prepare vmcs02 from vmcs01 and vmcs12")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 46 +++++++++++++++++++++++++++++++++---------
1 file changed, 37 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c9322cd55390..e42ae77e4b82 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3102,10 +3102,14 @@ void ept_save_pdptrs(struct kvm_vcpu *vcpu)
kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
}
+#define CR3_EXITING_BITS (CPU_BASED_CR3_LOAD_EXITING | \
+ CPU_BASED_CR3_STORE_EXITING)
+
void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned long hw_cr0;
+ u32 tmp;
hw_cr0 = (cr0 & ~KVM_VM_CR0_ALWAYS_OFF);
if (is_unrestricted_guest(vcpu))
@@ -3132,18 +3136,42 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
#endif
if (enable_ept && !is_unrestricted_guest(vcpu)) {
+ /*
+ * Ensure KVM has an up-to-date snapshot of the guest's CR3. If
+ * the below code _enables_ CR3 exiting, vmx_cache_reg() will
+ * (correctly) stop reading vmcs.GUEST_CR3 because it thinks
+ * KVM's CR3 is installed.
+ */
if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3))
vmx_cache_reg(vcpu, VCPU_EXREG_CR3);
+
+ /*
+ * When running with EPT but not unrestricted guest, KVM must
+ * intercept CR3 accesses when paging is _disabled_. This is
+ * necessary because restricted guests can't actually run with
+ * paging disabled, and so KVM stuffs its own CR3 in order to
+ * run the guest when identity mapped page tables.
+ *
+ * Do _NOT_ check the old CR0.PG, e.g. to optimize away the
+ * update, it may be stale with respect to CR3 interception,
+ * e.g. after nested VM-Enter.
+ *
+ * Lastly, honor L1's desires, i.e. intercept CR3 loads and/or
+ * stores to forward them to L1, even if KVM does not need to
+ * intercept them to preserve its identity mapped page tables.
+ */
if (!(cr0 & X86_CR0_PG)) {
- /* From paging/starting to nonpaging */
- exec_controls_setbit(vmx, CPU_BASED_CR3_LOAD_EXITING |
- CPU_BASED_CR3_STORE_EXITING);
- vcpu->arch.cr0 = cr0;
- vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
- } else if (!is_paging(vcpu)) {
- /* From nonpaging to paging */
- exec_controls_clearbit(vmx, CPU_BASED_CR3_LOAD_EXITING |
- CPU_BASED_CR3_STORE_EXITING);
+ exec_controls_setbit(vmx, CR3_EXITING_BITS);
+ } else if (!is_guest_mode(vcpu)) {
+ exec_controls_clearbit(vmx, CR3_EXITING_BITS);
+ } else {
+ tmp = exec_controls_get(vmx);
+ tmp &= ~CR3_EXITING_BITS;
+ tmp |= get_vmcs12(vcpu)->cpu_based_vm_exec_control & CR3_EXITING_BITS;
+ exec_controls_set(vmx, tmp);
+ }
+
+ if (!is_paging(vcpu) != !(cr0 & X86_CR0_PG)) {
vcpu->arch.cr0 = cr0;
vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
}
--
2.31.1.498.g6c1eba8ee3d-goog
Move the setting of CR0, CR4, EFER, RFLAGS, and RIP from vendor code to
common x86. VMX and SVM now have near-identical sequences, the only
difference between that VMX updates the exception bitmap. Updating the
bitmap on SVM is unnecessary, but benign. Unfortunately it can't be left
behind in VMX due to the need to update exception intercepts after the
control registers are set.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 6 ------
arch/x86/kvm/vmx/vmx.c | 9 ---------
arch/x86/kvm/x86.c | 8 ++++++++
3 files changed, 8 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 996a6b03e338..23f880268ff5 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1204,12 +1204,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
init_sys_seg(&save->ldtr, SEG_TYPE_LDT);
init_sys_seg(&save->tr, SEG_TYPE_BUSY_TSS16);
- svm_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
- svm_set_cr4(vcpu, 0);
- svm_set_efer(vcpu, 0);
- kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
- vcpu->arch.regs[VCPU_REGS_RIP] = 0x0000fff0;
-
if (npt_enabled) {
/* Setup VMCB for Nested Paging */
control->nested_ctl |= SVM_NESTED_CTL_NP_ENABLE;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8c982e049cbb..d8afca144e11 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4557,9 +4557,6 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
}
- kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
- kvm_rip_write(vcpu, 0xfff0);
-
vmcs_writel(GUEST_GDTR_BASE, 0);
vmcs_write32(GUEST_GDTR_LIMIT, 0xffff);
@@ -4587,12 +4584,6 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
- vmx_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
- vmx_set_cr4(vcpu, 0);
- vmx_set_efer(vcpu, 0);
-
- vmx_update_exception_bitmap(vcpu);
-
vpid_sync_context(vmx->vpid);
if (init_event)
vmx_clear_hlt(vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 167c650d1187..97d8e3e74bab 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10499,6 +10499,14 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
static_call(kvm_x86_vcpu_reset)(vcpu, init_event);
+ kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
+ kvm_rip_write(vcpu, 0xfff0);
+
+ static_call(kvm_x86_set_cr0)(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
+ static_call(kvm_x86_set_cr4)(vcpu, 0);
+ static_call(kvm_x86_set_efer)(vcpu, 0);
+ static_call(kvm_x86_update_exception_bitmap)(vcpu);
+
if (kvm_cr0_mmu_role_changed(old_cr0, kvm_read_cr0(vcpu)) ||
kvm_cr4_mmu_role_changed(old_cr4, kvm_read_cr4(vcpu)))
kvm_mmu_reset_context(vcpu);
--
2.31.1.498.g6c1eba8ee3d-goog
Tweak the logic for grabbing vmcs.GUEST_CR3 in vmx_cache_reg() to look
directly at the execution controls, as opposed to effectively inferring
the controls from vCPUs state. Inferring the controls isn't wrong, but it
creates a very subtle dependency between the caching logic, the state of
vcpu->arch.cr0 (via is_paging()), and the behavior of vmx_set_cr0().
Using the execution controls doesn't completely eliminate the dependency
in vmx_set_cr0(), e.g. neglecting to cache CR3 before enabling
interception would still break the guest, but it does reduce the
code dependency and mostly eliminate the logical dependency (that CR3
loads are intercepted in certain scenarios). Eliminating the sublte
read of vcpu->arch.cr0 will also allow for additional cleanup in
vmx_set_cr0().
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e42ae77e4b82..596c8f9766ac 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2370,8 +2370,11 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
vcpu->arch.cr0 |= vmcs_readl(GUEST_CR0) & guest_owned_bits;
break;
case VCPU_EXREG_CR3:
- if (is_unrestricted_guest(vcpu) ||
- (enable_ept && is_paging(vcpu)))
+ /*
+ * When intercepting CR3 loads, e.g. for shadowing paging, KVM's
+ * CR3 is loaded into hardware, not the guest's CR3.
+ */
+ if (!(exec_controls_get(to_vmx(vcpu)) & CPU_BASED_CR3_LOAD_EXITING))
vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
break;
case VCPU_EXREG_CR4:
--
2.31.1.498.g6c1eba8ee3d-goog
Move the long mode and EPT w/o unrestricted guest side effect processing
down in vmx_set_cr0() so that the EPT && !URG case doesn't have to stuff
vcpu->arch.cr0 early. This also fixes an oddity where CR0 might not be
marked available, i.e. the early vcpu->arch.cr0 write would appear to be
in danger of being overwritten, though that can't actually happen in the
current code since CR0.TS is the only guest-owned bit, and CR0.TS is not
read by vmx_set_cr4().
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 23 ++++++++++++-----------
1 file changed, 12 insertions(+), 11 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 596c8f9766ac..5f30181fd240 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3111,9 +3111,11 @@ void ept_save_pdptrs(struct kvm_vcpu *vcpu)
void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
- unsigned long hw_cr0;
+ unsigned long hw_cr0, old_cr0_pg;
u32 tmp;
+ old_cr0_pg = kvm_read_cr0_bits(vcpu, X86_CR0_PG);
+
hw_cr0 = (cr0 & ~KVM_VM_CR0_ALWAYS_OFF);
if (is_unrestricted_guest(vcpu))
hw_cr0 |= KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST;
@@ -3129,11 +3131,16 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
enter_rmode(vcpu);
}
+ vmcs_writel(CR0_READ_SHADOW, cr0);
+ vmcs_writel(GUEST_CR0, hw_cr0);
+ vcpu->arch.cr0 = cr0;
+ kvm_register_mark_available(vcpu, VCPU_EXREG_CR0);
+
#ifdef CONFIG_X86_64
if (vcpu->arch.efer & EFER_LME) {
- if (!is_paging(vcpu) && (cr0 & X86_CR0_PG))
+ if (!old_cr0_pg && (cr0 & X86_CR0_PG))
enter_lmode(vcpu);
- if (is_paging(vcpu) && !(cr0 & X86_CR0_PG))
+ else if (old_cr0_pg && !(cr0 & X86_CR0_PG))
exit_lmode(vcpu);
}
#endif
@@ -3174,17 +3181,11 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
exec_controls_set(vmx, tmp);
}
- if (!is_paging(vcpu) != !(cr0 & X86_CR0_PG)) {
- vcpu->arch.cr0 = cr0;
+ /* Note, vmx_set_cr4() consumes the new vcpu->arch.cr0. */
+ if ((old_cr0_pg ^ cr0) & X86_CR0_PG)
vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
- }
}
- vmcs_writel(CR0_READ_SHADOW, cr0);
- vmcs_writel(GUEST_CR0, hw_cr0);
- vcpu->arch.cr0 = cr0;
- kvm_register_mark_available(vcpu, VCPU_EXREG_CR0);
-
/* depends on vcpu->arch.cr0 to be set to a new value */
vmx->emulation_required = emulation_required(vcpu);
}
--
2.31.1.498.g6c1eba8ee3d-goog
Remove an unnecessary MSR bitmap refresh during vCPU RESET/INIT. In both
cases, the MSR bitmap already has the desired values and state.
At RESET, the vCPU is guaranteed to be running with x2APIC disabled, the
x2APIC MSRs are guaranteed to be intercepted due to the MSR bitmap being
initialized to all ones by alloc_loaded_vmcs(), and vmx->msr_bitmap_mode
is guaranteed to be zero, i.e. reflecting x2APIC disabled.
At INIT, the APIC_BASE MSR is not modified, thus there can't be any
change in x2APIC state.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d8afca144e11..acfb87f30979 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4569,9 +4569,6 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
if (kvm_mpx_supported())
vmcs_write64(GUEST_BNDCFGS, 0);
- if (cpu_has_vmx_msr_bitmap())
- vmx_update_msr_bitmap(&vmx->vcpu);
-
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); /* 22.2.1 */
if (cpu_has_vmx_tpr_shadow() && !init_event) {
--
2.31.1.498.g6c1eba8ee3d-goog
Drop unnecessary MSR bitmap updates during nested transitions, as L1's
APIC_BASE MSR is not modified by the standard VM-Enter/VM-Exit flows,
and L2's MSR bitmap is managed separately. In the unlikely event that L1
is pathological and loads APIC_BASE via the VM-Exit load list, KVM will
handle updating the bitmap in its normal WRMSR flows.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 6 ------
arch/x86/kvm/vmx/vmx.c | 2 +-
arch/x86/kvm/vmx/vmx.h | 1 -
3 files changed, 1 insertion(+), 8 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index f811bb7f2dc3..9dcdf158a405 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4283,9 +4283,6 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
kvm_set_dr(vcpu, 7, 0x400);
vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
- if (cpu_has_vmx_msr_bitmap())
- vmx_update_msr_bitmap(vcpu);
-
if (nested_vmx_load_msr(vcpu, vmcs12->vm_exit_msr_load_addr,
vmcs12->vm_exit_msr_load_count))
nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
@@ -4364,9 +4361,6 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
kvm_mmu_reset_context(vcpu);
- if (cpu_has_vmx_msr_bitmap())
- vmx_update_msr_bitmap(vcpu);
-
/*
* This nasty bit of open coding is a compromise between blindly
* loading L1's MSRs using the exit load lists (incorrect emulation
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index acfb87f30979..45a013631f63 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3984,7 +3984,7 @@ static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu, u8 mode)
}
}
-void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu)
+static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
u8 mode = vmx_msr_bitmap_mode(vcpu);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 1283ad0e592d..e46df3253a21 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -380,7 +380,6 @@ void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu);
-void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
bool vmx_nmi_blocked(struct kvm_vcpu *vcpu);
bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu);
bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu);
--
2.31.1.498.g6c1eba8ee3d-goog
Don't bother initializing msr_bitmap_mode to 0, all of struct vcpu_vmx is
zero initialized.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index beaf9fefddad..cfd986aae7b7 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6982,7 +6982,6 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C6_RESIDENCY, MSR_TYPE_R);
vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C7_RESIDENCY, MSR_TYPE_R);
}
- vmx->msr_bitmap_mode = 0;
vmx->loaded_vmcs = &vmx->vmcs01;
--
2.31.1.498.g6c1eba8ee3d-goog
After a CPUID update, refresh the list of user return MSRs that are
loaded into hardware when running the vCPU. This is necessary to handle
the oddball case where userspace exposes X86_FEATURE_RDTSCP to the guest
after the vCPU is running.
Fixes: 0023ef39dc35 ("kvm: vmx: Set IA32_TSC_AUX for legacy mode guests")
Fixes: 4e47c7a6d714 ("KVM: VMX: Add instruction rdtscp support for guest")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index bdfb3def8526..57cabef3ffd9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7313,6 +7313,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
/* xsaves_enabled is recomputed in vmx_compute_secondary_exec_control(). */
vcpu->arch.xsaves_enabled = false;
+ vmx_setup_uret_msrs(vmx);
+
if (cpu_has_secondary_exec_ctrls()) {
vmx_compute_secondary_exec_control(vmx);
vmcs_set_secondary_exec_control(vmx);
--
2.31.1.498.g6c1eba8ee3d-goog
Remove the @reset_roots param from kvm_init_mmu(), the one user,
kvm_mmu_reset_context() has already unloaded the MMU and thus freed and
invalidated all roots. This also happens to be why the reset_roots=true
paths doesn't leak roots; they're already invalid.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu.h | 2 +-
arch/x86/kvm/mmu/mmu.c | 13 ++-----------
arch/x86/kvm/svm/nested.c | 2 +-
arch/x86/kvm/vmx/nested.c | 2 +-
4 files changed, 5 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 88d0ed5225a4..63b49725fb24 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -65,7 +65,7 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
void
reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
-void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
+void kvm_init_mmu(struct kvm_vcpu *vcpu);
void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer,
gpa_t nested_cr3);
void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 930ac8a7e7c9..ff3e200b32dd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4793,17 +4793,8 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
update_last_nonleaf_level(vcpu, g_context);
}
-void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots)
+void kvm_init_mmu(struct kvm_vcpu *vcpu)
{
- if (reset_roots) {
- uint i;
-
- vcpu->arch.mmu->root_hpa = INVALID_PAGE;
-
- for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
- vcpu->arch.mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
- }
-
if (mmu_is_nested(vcpu))
init_kvm_nested_mmu(vcpu);
else if (tdp_enabled)
@@ -4829,7 +4820,7 @@ kvm_mmu_calc_root_page_role(struct kvm_vcpu *vcpu)
void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
{
kvm_mmu_unload(vcpu);
- kvm_init_mmu(vcpu, true);
+ kvm_init_mmu(vcpu);
}
EXPORT_SYMBOL_GPL(kvm_mmu_reset_context);
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 540d43ba2cf4..a0b48a8f32ed 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -406,7 +406,7 @@ static int nested_svm_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
vcpu->arch.cr3 = cr3;
kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
- kvm_init_mmu(vcpu, false);
+ kvm_init_mmu(vcpu);
return 0;
}
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 9dcdf158a405..3a5b86932a5e 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1137,7 +1137,7 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3, bool ne
vcpu->arch.cr3 = cr3;
kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
- kvm_init_mmu(vcpu, false);
+ kvm_init_mmu(vcpu);
return 0;
}
--
2.31.1.498.g6c1eba8ee3d-goog
When emulating vCPU INIT, do not unconditionally refresh the list of user
return MSRs that need to be loaded into hardware when running the guest.
Unconditionally refreshing the list is confusing, as the vast majority of
MSRs are not modified on INIT. The real motivation is to handle the case
where an INIT during long mode obviates the need to load the SYSCALL MSRs,
and that is handled as needed by vmx_set_efer().
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 57cabef3ffd9..8c982e049cbb 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4509,6 +4509,8 @@ static void init_vmcs(struct vcpu_vmx *vmx)
vmx->pt_desc.guest.output_mask = 0x7F;
vmcs_write64(GUEST_IA32_RTIT_CTL, 0);
}
+
+ vmx_setup_uret_msrs(vmx);
}
static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
@@ -4570,8 +4572,6 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
if (kvm_mpx_supported())
vmcs_write64(GUEST_BNDCFGS, 0);
- vmx_setup_uret_msrs(vmx);
-
if (cpu_has_vmx_msr_bitmap())
vmx_update_msr_bitmap(&vmx->vcpu);
--
2.31.1.498.g6c1eba8ee3d-goog
Drop an explicit call to update the x2APIC MSRs when the userspace MSR
filter is modified. The x2APIC MSRs are deliberately exempt from
userspace filtering.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 45a013631f63..beaf9fefddad 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4055,7 +4055,6 @@ static void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
}
pt_update_intercept_for_msr(vcpu);
- vmx_update_msr_bitmap_x2apic(vcpu, vmx_msr_bitmap_mode(vcpu));
}
static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
--
2.31.1.498.g6c1eba8ee3d-goog
Consolidate all of the dynamic MSR bitmap adjustments into
vmx_update_msr_bitmap_x2apic(), and rename the mode tracker to reflect
that it is x2APIC specific. If KVM gains more cases of dynamic MSR
pass-through, odds are very good that those new cases will be better off
with their own logic, e.g. see Intel PT MSRs and MSR_IA32_SPEC_CTRL.
Attempting to handle all updates in a common helper did more harm than
good, as KVM ended up collecting a large number of useless "updates".
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 55 ++++++++++++++++--------------------------
arch/x86/kvm/vmx/vmx.h | 2 +-
2 files changed, 22 insertions(+), 35 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index cfd986aae7b7..dcef189b0108 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3931,21 +3931,6 @@ void vmx_set_intercept_for_msr(struct kvm_vcpu *vcpu,
vmx_disable_intercept_for_msr(vcpu, msr, type);
}
-static u8 vmx_msr_bitmap_mode(struct kvm_vcpu *vcpu)
-{
- u8 mode = 0;
-
- if (cpu_has_secondary_exec_ctrls() &&
- (secondary_exec_controls_get(to_vmx(vcpu)) &
- SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) {
- mode |= MSR_BITMAP_MODE_X2APIC;
- if (enable_apicv && kvm_vcpu_apicv_active(vcpu))
- mode |= MSR_BITMAP_MODE_X2APIC_APICV;
- }
-
- return mode;
-}
-
static void vmx_reset_x2apic_msrs(struct kvm_vcpu *vcpu, u8 mode)
{
unsigned long *msr_bitmap = to_vmx(vcpu)->vmcs01.msr_bitmap;
@@ -3963,11 +3948,29 @@ static void vmx_reset_x2apic_msrs(struct kvm_vcpu *vcpu, u8 mode)
}
}
-static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu, u8 mode)
+static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu)
{
+ struct vcpu_vmx *vmx = to_vmx(vcpu);
+ u8 mode;
+
if (!cpu_has_vmx_msr_bitmap())
return;
+ if (cpu_has_secondary_exec_ctrls() &&
+ (secondary_exec_controls_get(vmx) &
+ SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) {
+ mode = MSR_BITMAP_MODE_X2APIC;
+ if (enable_apicv && kvm_vcpu_apicv_active(vcpu))
+ mode |= MSR_BITMAP_MODE_X2APIC_APICV;
+ } else {
+ mode = 0;
+ }
+
+ if (!(mode ^ vmx->x2apic_msr_bitmap_mode))
+ return;
+
+ vmx->x2apic_msr_bitmap_mode = mode;
+
vmx_reset_x2apic_msrs(vcpu, mode);
/*
@@ -3984,21 +3987,6 @@ static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu, u8 mode)
}
}
-static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu)
-{
- struct vcpu_vmx *vmx = to_vmx(vcpu);
- u8 mode = vmx_msr_bitmap_mode(vcpu);
- u8 changed = mode ^ vmx->msr_bitmap_mode;
-
- if (!changed)
- return;
-
- if (changed & (MSR_BITMAP_MODE_X2APIC | MSR_BITMAP_MODE_X2APIC_APICV))
- vmx_update_msr_bitmap_x2apic(vcpu, mode);
-
- vmx->msr_bitmap_mode = mode;
-}
-
void pt_update_intercept_for_msr(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -4258,8 +4246,7 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
}
- if (cpu_has_vmx_msr_bitmap())
- vmx_update_msr_bitmap(vcpu);
+ vmx_update_msr_bitmap_x2apic(vcpu);
}
u32 vmx_exec_control(struct vcpu_vmx *vmx)
@@ -6287,7 +6274,7 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
}
secondary_exec_controls_set(vmx, sec_exec_control);
- vmx_update_msr_bitmap(vcpu);
+ vmx_update_msr_bitmap_x2apic(vcpu);
}
static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index e46df3253a21..5d2f047dafdf 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -229,7 +229,7 @@ struct nested_vmx {
struct vcpu_vmx {
struct kvm_vcpu vcpu;
u8 fail;
- u8 msr_bitmap_mode;
+ u8 x2apic_msr_bitmap_mode;
/*
* If true, host state has been stored in vmx->loaded_vmcs for
--
2.31.1.498.g6c1eba8ee3d-goog
Drop a call to vmx_clear_hlt() during vCPU INIT, the guest's activity
state is unconditionally set to "active" a few lines earlier in
vmx_vcpu_reset().
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index dcef189b0108..78d17adce7e6 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4568,8 +4568,6 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
vpid_sync_context(vmx->vpid);
- if (init_event)
- vmx_clear_hlt(vcpu);
}
static void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
--
2.31.1.498.g6c1eba8ee3d-goog
Don't waste time writing zeros via VMWRITE during vCPU RESET, the VMCS
is zero allocated.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 29 -----------------------------
1 file changed, 29 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 78d17adce7e6..74258ba4832a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4427,13 +4427,6 @@ static void init_vmcs(struct vcpu_vmx *vmx)
}
if (kvm_vcpu_apicv_active(&vmx->vcpu)) {
- vmcs_write64(EOI_EXIT_BITMAP0, 0);
- vmcs_write64(EOI_EXIT_BITMAP1, 0);
- vmcs_write64(EOI_EXIT_BITMAP2, 0);
- vmcs_write64(EOI_EXIT_BITMAP3, 0);
-
- vmcs_write16(GUEST_INTR_STATUS, 0);
-
vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR);
vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
}
@@ -4444,23 +4437,9 @@ static void init_vmcs(struct vcpu_vmx *vmx)
vmx->ple_window_dirty = true;
}
- vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0);
- vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 0);
- vmcs_write32(CR3_TARGET_COUNT, 0); /* 22.2.1 */
-
- vmcs_write16(HOST_FS_SELECTOR, 0); /* 22.2.4 */
- vmcs_write16(HOST_GS_SELECTOR, 0); /* 22.2.4 */
vmx_set_constant_host_state(vmx);
- vmcs_writel(HOST_FS_BASE, 0); /* 22.2.4 */
- vmcs_writel(HOST_GS_BASE, 0); /* 22.2.4 */
- if (cpu_has_vmx_vmfunc())
- vmcs_write64(VM_FUNCTION_CONTROL, 0);
-
- vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
- vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host.val));
- vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);
vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest.val));
if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
@@ -4493,7 +4472,6 @@ static void init_vmcs(struct vcpu_vmx *vmx)
memset(&vmx->pt_desc, 0, sizeof(vmx->pt_desc));
/* Bit[6~0] are forced to 1, writes are ignored. */
vmx->pt_desc.guest.output_mask = 0x7F;
- vmcs_write64(GUEST_IA32_RTIT_CTL, 0);
}
vmx_setup_uret_msrs(vmx);
@@ -4536,13 +4514,6 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmcs_write32(GUEST_LDTR_LIMIT, 0xffff);
vmcs_write32(GUEST_LDTR_AR_BYTES, 0x00082);
- if (!init_event) {
- vmcs_write32(GUEST_SYSENTER_CS, 0);
- vmcs_writel(GUEST_SYSENTER_ESP, 0);
- vmcs_writel(GUEST_SYSENTER_EIP, 0);
- vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
- }
-
vmcs_writel(GUEST_GDTR_BASE, 0);
vmcs_write32(GUEST_GDTR_LIMIT, 0xffff);
--
2.31.1.498.g6c1eba8ee3d-goog
> void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> {
> + unsigned long old_cr0 = kvm_read_cr0(vcpu);
> + unsigned long old_cr4 = kvm_read_cr4(vcpu);
> +
> kvm_lapic_reset(vcpu, init_event);
>
> vcpu->arch.hflags = 0;
> @@ -10483,6 +10485,10 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> vcpu->arch.ia32_xss = 0;
>
> static_call(kvm_x86_vcpu_reset)(vcpu, init_event);
> +
> + if (kvm_cr0_mmu_role_changed(old_cr0, kvm_read_cr0(vcpu)) ||
> + kvm_cr4_mmu_role_changed(old_cr4, kvm_read_cr4(vcpu)))
> + kvm_mmu_reset_context(vcpu);
> }
I'm wondering if kvm_vcpu_reset() should call kvm_mmu_reset_context()
for a change in EFER.NX as well.
Thanks,
Reiji
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1204,12 +1204,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
> init_sys_seg(&save->ldtr, SEG_TYPE_LDT);
> init_sys_seg(&save->tr, SEG_TYPE_BUSY_TSS16);
>
> - svm_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
> - svm_set_cr4(vcpu, 0);
> - svm_set_efer(vcpu, 0);
> - kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
> - vcpu->arch.regs[VCPU_REGS_RIP] = 0x0000fff0;
Reviewed-by: Reiji Watanabe <[email protected]>
Those your vCPU RESET/INIT changes look great.
I think the change in init_vmcb() basically assumes that the
function is called from kvm_vcpu_reset(via svm_vcpu_reset()).
Although shutdown_interception() directly calls init_mcb(),
I would think the change doesn't matter for the shutdown
interception case.
IMHO it would be a bit misleading that a function named 'init_vmcb',
which is called from other than kvm_vcpu_reset (svm_vcpu_reset()),
only partially resets the vmcb (probably just to me though).
So, I personally think it would be better if its name or comment
can give some more specific information about the assumption.
BTW, it looks like two lines of "vcpu->arch.hflags = 0;"
can be also removed from the init_vmcb() as well.
Thanks,
Reiji
On Mon, May 17, 2021, Reiji Watanabe wrote:
> > void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > {
> > + unsigned long old_cr0 = kvm_read_cr0(vcpu);
> > + unsigned long old_cr4 = kvm_read_cr4(vcpu);
> > +
> > kvm_lapic_reset(vcpu, init_event);
> >
> > vcpu->arch.hflags = 0;
> > @@ -10483,6 +10485,10 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > vcpu->arch.ia32_xss = 0;
> >
> > static_call(kvm_x86_vcpu_reset)(vcpu, init_event);
> > +
> > + if (kvm_cr0_mmu_role_changed(old_cr0, kvm_read_cr0(vcpu)) ||
> > + kvm_cr4_mmu_role_changed(old_cr4, kvm_read_cr4(vcpu)))
> > + kvm_mmu_reset_context(vcpu);
> > }
>
> I'm wondering if kvm_vcpu_reset() should call kvm_mmu_reset_context()
> for a change in EFER.NX as well.
Oooh. So there _should_ be no need. Paging has to be enabled for EFER.NX to
be relevant, and INIT toggles CR0.PG 1=>0 if paging was enabled and so is
guaranteed to trigger a context reset. And we do want to skip the context reset,
e.g. INIT-SIPI-SIPI when the vCPU has paging disabled should continue using the
same MMU.
But, kvm_calc_mmu_role_common() neglects to ignore NX if CR0.PG=0, and so the
MMU role will be stale if INIT clears EFER.NX without forcing a context reset.
However, that's benign from a functionality perspective because the context
itself correctly incorporates CR0.PG, it's only the role that's borked. I.e.
KVM will fail to reuse a page/context due to the spurious role.nxe, but the
permission checks are always be correct.
I'll add a comment here and send a patch to fix the role calculation.
On Fri, Apr 23, 2021 at 5:48 PM Sean Christopherson <[email protected]> wrote:
>
> Drop an extra init_vmcb() from svm_create_vcpu(), svm_vcpu_reset() is
> guaranteed to call init_vmcb() and there are no consumers of the VMCB
> data between ->vcpu_create() and ->vcpu_reset(). Keep the call to
> svm_switch_vmcb() as sev_es_create_vcpu() touches the current VMCB, but
> hoist it up a few lines to associate the switch with the allocation of
> vmcb01.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Reiji Watanabe <[email protected]>
On Fri, Apr 23, 2021 at 5:48 PM Sean Christopherson <[email protected]> wrote:
>
> Initialize constant VMCS state in vcpu_vcpu_reset() instead of in
> vmx_vcpu_create(), which allows for the removal of the open coded "vCPU
> load" sequence since ->vcpu_reset() is invoked while the vCPU is properly
> loaded (which is the entire point of vCPU reset...).
>
> Deferring initialization is effectively a nop as it's impossible to
> safely access the VMCS between the current call site and its new home, as
> both the vCPU and the pCPU are put immediately after init_vmcs(), i.e.
> the VMCS isn't guaranteed to be loaded.
>
> Note, task preemption is not a problem as vmx_sched_in() _can't_ touch
> the VMCS as ->sched_in() is invoked before the vCPU, and thus VMCS, is
> reloaded. I.e. the preemption path also can't consume VMCS state.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Reiji Watanabe <[email protected]>
On Fri, Apr 23, 2021 at 5:51 PM Sean Christopherson <[email protected]> wrote:
>
> Remove unnecessary initialization of vmcb->save.rip during vCPU RESET/INIT,
> as svm_vcpu_run() unconditionally propagates VCPU_REGS_RIP to save.rip.
>
> No true functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Reiji Watanabe <[email protected]>
On Fri, Apr 23, 2021 at 5:47 PM Sean Christopherson <[email protected]> wrote:
>
> Set L1's LDTR on VM-Exit per the Intel SDM:
>
> The host-state area does not contain a selector field for LDTR. LDTR is
> established as follows on all VM exits: the selector is cleared to
> 0000H, the segment is marked unusable and is otherwise undefined
> (although the base address is always canonical).
>
> This is likely a benign bug since the LDTR is unusable, as it means the
> L1 VMM is conditioned to reload its LDTR in order to function properly on
> bare metal.
>
> Fixes: 4704d0befb07 ("KVM: nVMX: Exiting from L2 to L1")
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Reiji Watanabe <[email protected]>
On Fri, Apr 23, 2021 at 5:47 PM Sean Christopherson <[email protected]> wrote:
>
> Split out the post-CR0/CR4 MMU role change detectors to separate helpers,
> they will be used during vCPU RESET/INIT to conditionally reset the MMU
> in a future patch.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Reiji Watanabe <[email protected]>
On Fri, Apr 23, 2021 at 5:47 PM Sean Christopherson <[email protected]> wrote:
>
> Do not allow an inexact CPUID "match" when querying the guest's CPUID.0x1
> to stuff EDX during INIT. In the common case, where the guest CPU model
> is an AMD variant, allowing an inexact match is a nop since KVM doesn't
> emulate Intel's goofy "out-of-range" logic for AMD and Hygon. If the
> vCPU model happens to be an Intel variant, an inexact match is possible
> if and only if the max CPUID leaf is precisely '0'. Aside from the fact
> that there's probably no CPU in existence with a single CPUID leaf, if
> the max CPUID leaf is '0', that means that CPUID.0.EAX is '0', and thus
> an inexact match for CPUID.0x1.EAX will also yield '0'.
>
> So, with lots of twisty logic, no functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Reiji Watanabe <[email protected]>
On Fri, Apr 23, 2021 at 5:48 PM Sean Christopherson <[email protected]> wrote:
>
> Drop an explicit MMU reset in SVM's vCPU RESET/INIT flow now that the
> common x86 path correctly handles conditional MMU resets, e.g. if INIT
> arrives while the vCPU is in 64-bit mode.
>
> This reverts commit ebae871a509d ("kvm: svm: reset mmu on VCPU reset").
>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Reiji Watanabe <[email protected]>
On Fri, Apr 23, 2021 at 5:51 PM Sean Christopherson <[email protected]> wrote:
>
> Move the EDX initialization at vCPU RESET, which is now identical between
> VMX and SVM, into common code.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Reiji Watanabe <[email protected]>
All of those refactorings look great to me.
Thanks,
Reiji
On Fri, Apr 23, 2021 at 5:54 PM Sean Christopherson <[email protected]> wrote:
>
> Drop direct writes to vmcb->save.cr4 during vCPU RESET/INIT, as the
> values being written are fully redundant with respect to
> svm_set_cr4(vcpu, 0) a few lines earlier. Note, svm_set_cr4() also
> correctly forces X86_CR4_PAE when NPT is disabled.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Reiji Watanabe <[email protected]>
> @@ -4504,7 +4505,11 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>
> vmx->msr_ia32_umwait_control = 0;
>
> - vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
> + eax = 1;
> + if (!kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, true))
> + eax = get_rdx_init_val();
> + kvm_rdx_write(vcpu, eax);
Reviewed-by: Reiji Watanabe <[email protected]>
For RESET, I assume that rdx should be set by userspace
when userspace changes CPUID.0x1.EAX.
BTW, I would think having a default CPUID for CPUID.(EAX=0x1)
would be better for consistency of a vCPU state for RESET.
I would think it doesn't matter practically anyway though.
Thanks,
Reiji
On Fri, Apr 23, 2021 at 5:51 PM Sean Christopherson <[email protected]> wrote:
>
> Remove a bogus write to vcpu->arch.cr0 that immediately precedes
> vmx_set_cr0() during vCPU RESET/INIT. For RESET, this is a nop since
> the "old" CR0 value is meaningless. But for INIT, if the vCPU is coming
> from paging enabled mode, crushing vcpu->arch.cr0 will cause the various
> is_paging() checks in vmx_set_cr0() to get false negatives.
>
> For the exit_lmode() case, the false negative is benign as vmx_set_efer()
> is called immediately after vmx_set_cr0().
>
> For EPT without unrestricted guest, the false negative will cause KVM to
> unnecessarily run with CR3 load/store exiting. But again, this is
> benign, albeit sub-optimal.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Reiji Watanabe <[email protected]>
On Fri, Apr 23, 2021 at 5:55 PM Sean Christopherson <[email protected]> wrote:
>
> After a CPUID update, refresh the list of user return MSRs that are
> loaded into hardware when running the vCPU. This is necessary to handle
> the oddball case where userspace exposes X86_FEATURE_RDTSCP to the guest
> after the vCPU is running.
>
> Fixes: 0023ef39dc35 ("kvm: vmx: Set IA32_TSC_AUX for legacy mode guests")
> Fixes: 4e47c7a6d714 ("KVM: VMX: Add instruction rdtscp support for guest")
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Reiji Watanabe <[email protected]>
On Fri, Apr 23, 2021 at 5:47 PM Sean Christopherson <[email protected]> wrote:
>
> At vCPU RESET/INIT (mostly RESET), stuff EDX with KVM's hardcoded,
> default Family-Model-Stepping ID of 0x600 if CPUID.0x1 isn't defined.
> At RESET, the CPUID lookup is guaranteed to "miss" because KVM emulates
> RESET before exposing the vCPU to userspace, i.e. userspace can't
> possibly have done set the vCPU's CPUID model, and thus KVM will always
> write '0'. At INIT, using 0x600 is less bad than using '0'.
>
> While initializing EDX to '0' is _extremely_ unlikely to be noticed by
> the guest, let alone break the guest, and can be overridden by
> userspace for the RESET case, using 0x600 is preferable as it will allow
> consolidating the relevant VMX and SVM RESET/INIT logic in the future.
> And, digging through old specs suggests that neither Intel nor AMD have
> ever shipped a CPU that initialized EDX to '0' at RESET.
>
> Regarding 0x600 as KVM's default Family, it is a sane default and in
> many ways the most appropriate. Prior to the 386 implementations, DX
> was undefined at RESET. With the 386, 486, 586/P5, and 686/P6/Athlon,
> both Intel and AMD set EDX to 3, 4, 5, and 6 respectively. AMD switched
> to using '15' as its primary Family with the introduction of AMD64, but
> Intel has continued using '6' for the last few decades.
>
> So, '6' is a valid Family for both Intel and AMD CPUs, is compatible
> with both 32-bit and 64-bit CPUs (albeit not a perfect fit for 64-bit
> AMD), and of the common Families (3 - 6), is the best fit with respect to
> KVM's virtual CPU model. E.g. prior to the P6, Intel CPUs did not have a
> STI window. Modern operating systems, Linux included, rely on the STI
> window, e.g. for "safe halt", and KVM unconditionally assumes the virtual
> CPU has an STI window. Thus enumerating a Family ID of 3, 4, or 5 would
> be provably wrong.
>
> Opportunistically remove a stale comment.
>
> Fixes: 66f7b72e1171 ("KVM: x86: Make register state after reset conform to specification")
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Reiji Watanabe <[email protected]>
On Tue, May 18, 2021, Reiji Watanabe wrote:
> > > > + if (kvm_cr0_mmu_role_changed(old_cr0, kvm_read_cr0(vcpu)) ||
> > > > + kvm_cr4_mmu_role_changed(old_cr4, kvm_read_cr4(vcpu)))
> > > > + kvm_mmu_reset_context(vcpu);
> > > > }
> > >
> > > I'm wondering if kvm_vcpu_reset() should call kvm_mmu_reset_context()
> > > for a change in EFER.NX as well.
> >
> > Oooh. So there _should_ be no need. Paging has to be enabled for EFER.NX to
> > be relevant, and INIT toggles CR0.PG 1=>0 if paging was enabled and so is
> > guaranteed to trigger a context reset. And we do want to skip the context reset,
> > e.g. INIT-SIPI-SIPI when the vCPU has paging disabled should continue using the
> > same MMU.
> >
> > But, kvm_calc_mmu_role_common() neglects to ignore NX if CR0.PG=0, and so the
> > MMU role will be stale if INIT clears EFER.NX without forcing a context reset.
> > However, that's benign from a functionality perspective because the context
> > itself correctly incorporates CR0.PG, it's only the role that's borked. I.e.
> > KVM will fail to reuse a page/context due to the spurious role.nxe, but the
> > permission checks are always be correct.
> >
> > I'll add a comment here and send a patch to fix the role calculation.
>
> Thank you so much for the explanation !
> I understand your intention and why it would be benign.
>
> Then, I'm wondering if kvm_cr4_mmu_role_changed() needs to be
> called here. Looking at the Intel SDM, in my understanding,
> all the bits kvm_cr4_mmu_role_changed() checks are relevant
> only if paging is enabled. (Or is my understanding incorrect ??)
Duh, yes. And it goes even beyond that, CR0.WP is only relevant if CR0.PG=1,
i.e. INIT with CR0.PG=0 and CR0.WP=1 will incorrectly trigger a MMU reset with
the current logic.
Sadly, simply omitting the CR4 check puts us in an awkward situation where, due
to the MMU role CR4 calculations not accounting for CR0.PG=0, KVM will run with
a stale role.
The other consideration is that kvm_post_set_cr4() and kvm_post_set_cr0() should
also skip kvm_mmu_reset_context() if CR0.PG=0, but again that requires fixing
the role calculations first (or at the same time).
I think I'll throw in those cleanups to the beginning of this series. The result
is going to be disgustingly long, but I really don't want to introduce code that
knowingly leaves KVM in an inconsistent state, nor do I want to add useless
checks on CR4 and EFER.
> @@ -1204,18 +1204,13 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
> init_sys_seg(&save->ldtr, SEG_TYPE_LDT);
> init_sys_seg(&save->tr, SEG_TYPE_BUSY_TSS16);
>
> + svm_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
> svm_set_cr4(vcpu, 0);
> svm_set_efer(vcpu, 0);
> save->dr6 = 0xffff0ff0;
> kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
> vcpu->arch.regs[VCPU_REGS_RIP] = 0x0000fff0;
>
> - /*
> - * svm_set_cr0() sets PG and WP and clears NW and CD on save->cr0.
> - * It also updates the guest-visible cr0 value.
> - */
> - svm_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
AMD's APM Vol2 (Table 14-1 in Revision 3.37) says CR0 After INIT will be:
CD and NW are unchanged
Bit 4 (reserved) = 1
All others = 0
(CR0 will be 0x60000010 after RESET)
So, it looks the CR0 value that init_vmcb() sets could be
different from what is indicated in the APM for INIT.
BTW, Intel's SDM (April 2021 version) says CR0 for Power up/Reset/INIT
will be 0x60000010 with the following note.
-------------------------------------------------
The CD and NW flags are unchanged,
bit 4 is set to 1, all other bits are cleared.
-------------------------------------------------
The note is attached as '2' to all Power up/Reset/INIT cases
looking at the SDM. I would guess it is erroneous that
the note is attached to Power up/Reset though.
Thanks,
Reiji
On Tue, May 18, 2021, Reiji Watanabe wrote:
> > @@ -4504,7 +4505,11 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> >
> > vmx->msr_ia32_umwait_control = 0;
> >
> > - vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
> > + eax = 1;
> > + if (!kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, true))
> > + eax = get_rdx_init_val();
> > + kvm_rdx_write(vcpu, eax);
>
> Reviewed-by: Reiji Watanabe <[email protected]>
>
> For RESET, I assume that rdx should be set by userspace
> when userspace changes CPUID.0x1.EAX.
Ya, although the ideal solution is to add a proper RESET ioctl() so userspace can
configure the vCPU model and then pull RESET#.
> BTW, I would think having a default CPUID for CPUID.(EAX=0x1) would be better
> for consistency of a vCPU state for RESET. I would think it doesn't matter
> practically anyway though.
Probably, but that would require defining default values for all of CPUID.0x0 and
CPUID.0x1, which is a can of worms I'd rather not open. E.g. vendor info, basic
feature set, APIC ID, etc... would all need default values. On the other hand,
the EDX value stuffing predates CPUID, so using 0x600 isn't provably wrong, just
a bit anachronistic. :-)
On Wed, May 19, 2021, Reiji Watanabe wrote:
> > @@ -1204,18 +1204,13 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
> > init_sys_seg(&save->ldtr, SEG_TYPE_LDT);
> > init_sys_seg(&save->tr, SEG_TYPE_BUSY_TSS16);
> >
> > + svm_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
> > svm_set_cr4(vcpu, 0);
> > svm_set_efer(vcpu, 0);
> > save->dr6 = 0xffff0ff0;
> > kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
> > vcpu->arch.regs[VCPU_REGS_RIP] = 0x0000fff0;
> >
> > - /*
> > - * svm_set_cr0() sets PG and WP and clears NW and CD on save->cr0.
> > - * It also updates the guest-visible cr0 value.
> > - */
> > - svm_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
>
> AMD's APM Vol2 (Table 14-1 in Revision 3.37) says CR0 After INIT will be:
>
> CD and NW are unchanged
> Bit 4 (reserved) = 1
> All others = 0
>
> (CR0 will be 0x60000010 after RESET)
>
> So, it looks the CR0 value that init_vmcb() sets could be
> different from what is indicated in the APM for INIT.
>
> BTW, Intel's SDM (April 2021 version) says CR0 for Power up/Reset/INIT
> will be 0x60000010 with the following note.
> -------------------------------------------------
> The CD and NW flags are unchanged,
> bit 4 is set to 1, all other bits are cleared.
> -------------------------------------------------
> The note is attached as '2' to all Power up/Reset/INIT cases
> looking at the SDM. I would guess it is erroneous that
> the note is attached to Power up/Reset though.
Agreed. I'll double check that CD and NW are preserved by hardware on INIT,
and will also ping Intel folks to fix the POWER-UP and RESET footnote.
Hah! Reading through that section yet again, there's another SDM bug. It
contradicts itself with respect to the TLBs after INIT.
9.1 INITIALIZATION OVERVIEW:
The major difference is that during an INIT, the internal caches, MSRs,
MTRRs, and x87 FPU state are left unchanged (although, the TLBs and BTB
are invalidated as with a hardware reset)
while Table 9-1 says:
Register Power up Reset INIT
Data and Code Cache, TLBs: Invalid[6] Invalid[6] Unchanged
I'm pretty sure that Intel CPUs are supposed to flush the TLB, i.e. Tabel 9-1 is
wrong. Back in my Intel validation days, I remember being involved in a Core2
bug that manifested as a triple fault after INIT due to global TLB entries not
being flushed. Looks like that wasn't fixed:
https://www.intel.com/content/dam/support/us/en/documents/processors/mobile/celeron/sb/320121.pdf
AZ28. INIT Does Not Clear Global Entries in the TLB
Problem: INIT may not flush a TLB entry when:
• The processor is in protected mode with paging enabled and the page global enable
flag is set (PGE bit of CR4 register)
• G bit for the page table entry is set
• TLB entry is present in TLB when INIT occurs
• Software may encounter unexpected page fault or incorrect address translation due
to a TLB entry erroneously left in TLB after INIT.
Workaround: Write to CR3, CR4 (setting bits PSE, PGE or PAE) or CR0 (setting
bits PG or PE) registers before writing to memory early in BIOS
code to clear all the global entries from TLB.
Status: For the steppings affected, see the Summary Tables of Changes.
AMD's APM also appears to contradict itself, though that depends on one's
interpretation of "external intialization". Like the SDM, its table states that
the TLBs are not flushed on INIT:
Table 14-1. Initial Processor State
Processor Resource Value after RESET Value after INIT
Instruction and Data TLBs Invalidated Unchanged
but a blurb later on says:
5.5.3 TLB Management
Implicit Invalidations. The following operations cause the entire TLB to be
invalidated, including global pages:
• External initialization of the processor.
All in all, that means KVM also has a bug in the form of a missing guest TLB
flush on INIT, at least for VMX and probably for SVM. I'll add a patch to flush
the guest TLBs on INIT irrespective of vendor. Even if AMD CPUs don't flush the
TLB, I see no reason to bank on all guests being paranoid enough to flush the
TLB immediately after INIT.
On Mon, May 17, 2021, Reiji Watanabe wrote:
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -1204,12 +1204,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
> > init_sys_seg(&save->ldtr, SEG_TYPE_LDT);
> > init_sys_seg(&save->tr, SEG_TYPE_BUSY_TSS16);
> >
> > - svm_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
> > - svm_set_cr4(vcpu, 0);
> > - svm_set_efer(vcpu, 0);
> > - kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
> > - vcpu->arch.regs[VCPU_REGS_RIP] = 0x0000fff0;
>
> Reviewed-by: Reiji Watanabe <[email protected]>
>
> Those your vCPU RESET/INIT changes look great.
>
> I think the change in init_vmcb() basically assumes that the
> function is called from kvm_vcpu_reset(via svm_vcpu_reset()).
> Although shutdown_interception() directly calls init_mcb(),
> I would think the change doesn't matter for the shutdown
> interception case.
>
> IMHO it would be a bit misleading that a function named 'init_vmcb',
> which is called from other than kvm_vcpu_reset (svm_vcpu_reset()),
> only partially resets the vmcb (probably just to me though).
It's not just you, that code is funky. If I could go back in time, I would lobby
to not automatically init the VMCB and instead put the vCPU into
KVM_MP_STATE_UNINITIALIZED and force userspace to explicitly INIT or RESET the
vCPU. Even better would be to add KVM_MP_STATE_SHUTDOWN, since technically NMI
can break shutdown (and SMI on Intel CPUs).
Anyways, that ship has sailed, but we might be able to get away with replacing
init_vmcb() with kvm_vcpu_reset(vcpu, true), i.e. effecting a full INIT. That
would ensure the VMCB is consistent with KVM's software model, which I'm guessing
is not true with the direct init_vmcb() call. It would also have some connection
to reality since there exist bare metal platforms that automatically INIT the CPU
if it hits shutdown (maybe only for the BSP?).
Side topic, the NMI thing got me looking through init_vmcb() to see how it
handles the IDT and GDT, and surprise, surprise, it fails to zero IDTR.base and
GDTR.base. I'll add a patch to fix that, and maybe try to consolidate the VMX
and SVM segmentation logic.
> So, I personally think it would be better if its name or comment
> can give some more specific information about the assumption.
>
> BTW, it looks like two lines of "vcpu->arch.hflags = 0;"
> can be also removed from the init_vmcb() as well.
Ya, I'll add that to the pile.
Thanks!
> > > + if (kvm_cr0_mmu_role_changed(old_cr0, kvm_read_cr0(vcpu)) ||
> > > + kvm_cr4_mmu_role_changed(old_cr4, kvm_read_cr4(vcpu)))
> > > + kvm_mmu_reset_context(vcpu);
> > > }
> >
> > I'm wondering if kvm_vcpu_reset() should call kvm_mmu_reset_context()
> > for a change in EFER.NX as well.
>
> Oooh. So there _should_ be no need. Paging has to be enabled for EFER.NX to
> be relevant, and INIT toggles CR0.PG 1=>0 if paging was enabled and so is
> guaranteed to trigger a context reset. And we do want to skip the context reset,
> e.g. INIT-SIPI-SIPI when the vCPU has paging disabled should continue using the
> same MMU.
>
> But, kvm_calc_mmu_role_common() neglects to ignore NX if CR0.PG=0, and so the
> MMU role will be stale if INIT clears EFER.NX without forcing a context reset.
> However, that's benign from a functionality perspective because the context
> itself correctly incorporates CR0.PG, it's only the role that's borked. I.e.
> KVM will fail to reuse a page/context due to the spurious role.nxe, but the
> permission checks are always be correct.
>
> I'll add a comment here and send a patch to fix the role calculation.
Thank you so much for the explanation !
I understand your intention and why it would be benign.
Then, I'm wondering if kvm_cr4_mmu_role_changed() needs to be
called here. Looking at the Intel SDM, in my understanding,
all the bits kvm_cr4_mmu_role_changed() checks are relevant
only if paging is enabled. (Or is my understanding incorrect ??)
Thanks,
Reiji
On Tue, May 18, 2021 at 2:45 PM Sean Christopherson <[email protected]> wrote:
>
> On Mon, May 17, 2021, Reiji Watanabe wrote:
> > > --- a/arch/x86/kvm/svm/svm.c
> > > +++ b/arch/x86/kvm/svm/svm.c
> > > @@ -1204,12 +1204,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
> > > init_sys_seg(&save->ldtr, SEG_TYPE_LDT);
> > > init_sys_seg(&save->tr, SEG_TYPE_BUSY_TSS16);
> > >
> > > - svm_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
> > > - svm_set_cr4(vcpu, 0);
> > > - svm_set_efer(vcpu, 0);
> > > - kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
> > > - vcpu->arch.regs[VCPU_REGS_RIP] = 0x0000fff0;
> >
> > Reviewed-by: Reiji Watanabe <[email protected]>
> >
> > Those your vCPU RESET/INIT changes look great.
> >
> > I think the change in init_vmcb() basically assumes that the
> > function is called from kvm_vcpu_reset(via svm_vcpu_reset()).
> > Although shutdown_interception() directly calls init_mcb(),
> > I would think the change doesn't matter for the shutdown
> > interception case.
> >
> > IMHO it would be a bit misleading that a function named 'init_vmcb',
> > which is called from other than kvm_vcpu_reset (svm_vcpu_reset()),
> > only partially resets the vmcb (probably just to me though).
>
> It's not just you, that code is funky. If I could go back in time, I would lobby
> to not automatically init the VMCB and instead put the vCPU into
> KVM_MP_STATE_UNINITIALIZED and force userspace to explicitly INIT or RESET the
> vCPU. Even better would be to add KVM_MP_STATE_SHUTDOWN, since technically NMI
> can break shutdown (and SMI on Intel CPUs).
I see. Adding KVM_MP_STATE_SHUTDOWN sounds right to me
given the real CPU's behavior.
> Anyways, that ship has sailed, but we might be able to get away with replacing
> init_vmcb() with kvm_vcpu_reset(vcpu, true), i.e. effecting a full INIT. That
> would ensure the VMCB is consistent with KVM's software model, which I'm guessing
> is not true with the direct init_vmcb() call. It would also have some connection
> to reality since there exist bare metal platforms that automatically INIT the CPU
> if it hits shutdown (maybe only for the BSP?).
Yes, calling kvm_vcpu_reset(vcpu, true) sounds better than
the direct init_vmcb() call.
> Side topic, the NMI thing got me looking through init_vmcb() to see how it
> handles the IDT and GDT, and surprise, surprise, it fails to zero IDTR.base and
> GDTR.base. I'll add a patch to fix that, and maybe try to consolidate the VMX
> and SVM segmentation logic.
That's surprising...
It seems init_vmcb() was used only for RESET when the function was
originally introduced and the entire vmcb was zero-cleared before
init_vmcb() was called. So, I would suspect init_vmcb() was originally
implemented assuming that all the vmcb fields were zero.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6aa8b732ca01c3d7a54e93f4d701b8aabbe60fb7
Thanks,
Reiji
On Wed, May 19, 2021 at 11:47 AM Sean Christopherson <[email protected]> wrote:
>
> On Tue, May 18, 2021, Reiji Watanabe wrote:
> > > @@ -4504,7 +4505,11 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > >
> > > vmx->msr_ia32_umwait_control = 0;
> > >
> > > - vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
> > > + eax = 1;
> > > + if (!kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, true))
> > > + eax = get_rdx_init_val();
> > > + kvm_rdx_write(vcpu, eax);
> >
> > Reviewed-by: Reiji Watanabe <[email protected]>
> >
> > For RESET, I assume that rdx should be set by userspace
> > when userspace changes CPUID.0x1.EAX.
>
> Ya, although the ideal solution is to add a proper RESET ioctl() so userspace can
> configure the vCPU model and then pull RESET#.
Thank you so much for the answer !
It sounds like a good idea in terms of userspace handling as well
as KVM implementation (I assume it would be something similar to
KVM_ARM_VCPU_INIT).
> > BTW, I would think having a default CPUID for CPUID.(EAX=0x1) would be better
> > for consistency of a vCPU state for RESET. I would think it doesn't matter
> > practically anyway though.
>
> Probably, but that would require defining default values for all of CPUID.0x0 and
> CPUID.0x1, which is a can of worms I'd rather not open. E.g. vendor info, basic
> feature set, APIC ID, etc... would all need default values. On the other hand,
> the EDX value stuffing predates CPUID, so using 0x600 isn't provably wrong, just
> a bit anachronistic. :-)
I see... Then I don't think it's worth doing...
Just out of curiosity, can't we simply use a vcpu_id for the APIC ID ?
Also, can't we simply use the same values that KVM_GET_SUPPORTED_CPUID
provides for other CPUID fields ?
Thanks,
Reiji
On Fri, May 21, 2021, Reiji Watanabe wrote:
> On Wed, May 19, 2021 at 11:47 AM Sean Christopherson <[email protected]> wrote:
> >
> > On Tue, May 18, 2021, Reiji Watanabe wrote:
> > > BTW, I would think having a default CPUID for CPUID.(EAX=0x1) would be better
> > > for consistency of a vCPU state for RESET. I would think it doesn't matter
> > > practically anyway though.
> >
> > Probably, but that would require defining default values for all of CPUID.0x0 and
> > CPUID.0x1, which is a can of worms I'd rather not open. E.g. vendor info, basic
> > feature set, APIC ID, etc... would all need default values. On the other hand,
> > the EDX value stuffing predates CPUID, so using 0x600 isn't provably wrong, just
> > a bit anachronistic. :-)
>
> I see... Then I don't think it's worth doing...
> Just out of curiosity, can't we simply use a vcpu_id for the APIC ID ?
That would mostly work, but theoretically we could overflow the 8 bit field
because max vCPUs is 288. Thanks Larrabee.
commit 682f732ecf7396e9d6fe24d44738966699fae6c0
Author: Radim Krčmář <[email protected]>
Date: Tue Jul 12 22:09:29 2016 +0200
KVM: x86: bump MAX_VCPUS to 288
288 is in high demand because of Knights Landing CPU.
We cannot set the limit to 640k, because that would be wasting space.
> Also, can't we simply use the same values that KVM_GET_SUPPORTED_CPUID
> provides for other CPUID fields ?
Yes, that would mostly work. It's certainly possible to have a moderately sane
default, but there's essentially zero benefit in doing so since practically
speaking all userspace VMMs will override CPUID anyways. KVM could completely
default to the host CPUID, but again, it wouldn't provide any meaningful benefit,
while doing so would step on userspace's toes since KVM's approach is that KVM is
"just" an accelerator, while userspace defines the CPU model, devices, etc...
And it would also mean KVM has to start worrying about silly corner cases like
the max vCPUs thing. That's why I say it's a can of worms :-)
> > AMD's APM Vol2 (Table 14-1 in Revision 3.37) says CR0 After INIT will be:
> >
> > CD and NW are unchanged
> > Bit 4 (reserved) = 1
> > All others = 0
> >
> > (CR0 will be 0x60000010 after RESET)
> >
> > So, it looks the CR0 value that init_vmcb() sets could be
> > different from what is indicated in the APM for INIT.
> >
> > BTW, Intel's SDM (April 2021 version) says CR0 for Power up/Reset/INIT
> > will be 0x60000010 with the following note.
> > -------------------------------------------------
> > The CD and NW flags are unchanged,
> > bit 4 is set to 1, all other bits are cleared.
> > -------------------------------------------------
> > The note is attached as '2' to all Power up/Reset/INIT cases
> > looking at the SDM. I would guess it is erroneous that
> > the note is attached to Power up/Reset though.
>
> Agreed. I'll double check that CD and NW are preserved by hardware on INIT,
> and will also ping Intel folks to fix the POWER-UP and RESET footnote.
>
> Hah! Reading through that section yet again, there's another SDM bug. It
> contradicts itself with respect to the TLBs after INIT.
>
> 9.1 INITIALIZATION OVERVIEW:
> The major difference is that during an INIT, the internal caches, MSRs,
> MTRRs, and x87 FPU state are left unchanged (although, the TLBs and BTB
> are invalidated as with a hardware reset)
>
> while Table 9-1 says:
>
> Register Power up Reset INIT
> Data and Code Cache, TLBs: Invalid[6] Invalid[6] Unchanged
>
> I'm pretty sure that Intel CPUs are supposed to flush the TLB, i.e. Tabel 9-1 is
> wrong. Back in my Intel validation days, I remember being involved in a Core2
> bug that manifested as a triple fault after INIT due to global TLB entries not
> being flushed. Looks like that wasn't fixed:
>
> https://www.intel.com/content/dam/support/us/en/documents/processors/mobile/celeron/sb/320121.pdf
>
> AZ28. INIT Does Not Clear Global Entries in the TLB
> Problem: INIT may not flush a TLB entry when:
> • The processor is in protected mode with paging enabled and the page global enable
> flag is set (PGE bit of CR4 register)
> • G bit for the page table entry is set
> • TLB entry is present in TLB when INIT occurs
> • Software may encounter unexpected page fault or incorrect address translation due
> to a TLB entry erroneously left in TLB after INIT.
>
> Workaround: Write to CR3, CR4 (setting bits PSE, PGE or PAE) or CR0 (setting
> bits PG or PE) registers before writing to memory early in BIOS
> code to clear all the global entries from TLB.
>
> Status: For the steppings affected, see the Summary Tables of Changes.
>
> AMD's APM also appears to contradict itself, though that depends on one's
> interpretation of "external intialization". Like the SDM, its table states that
> the TLBs are not flushed on INIT:
>
> Table 14-1. Initial Processor State
>
> Processor Resource Value after RESET Value after INIT
> Instruction and Data TLBs Invalidated Unchanged
>
> but a blurb later on says:
>
> 5.5.3 TLB Management
>
> Implicit Invalidations. The following operations cause the entire TLB to be
> invalidated, including global pages:
>
> • External initialization of the processor.
"Table 8-9. Simultaneous Interrupt Priorities" of AMD's APM has
the words "External Processor Initialization (INIT)", which make
me guess "the External initialization of the processor" in 5.5.3
TLB Management means INIT.
> All in all, that means KVM also has a bug in the form of a missing guest TLB
> flush on INIT, at least for VMX and probably for SVM. I'll add a patch to flush
> the guest TLBs on INIT irrespective of vendor. Even if AMD CPUs don't flush the
> TLB, I see no reason to bank on all guests being paranoid enough to flush the
> TLB immediately after INIT.
Yes, I agree that would be better.
Thank you so much for all the helpful information !
Regards,
Reiji
> > > On Tue, May 18, 2021, Reiji Watanabe wrote:
> > > > BTW, I would think having a default CPUID for CPUID.(EAX=0x1) would be better
> > > > for consistency of a vCPU state for RESET. I would think it doesn't matter
> > > > practically anyway though.
> > >
> > > Probably, but that would require defining default values for all of CPUID.0x0 and
> > > CPUID.0x1, which is a can of worms I'd rather not open. E.g. vendor info, basic
> > > feature set, APIC ID, etc... would all need default values. On the other hand,
> > > the EDX value stuffing predates CPUID, so using 0x600 isn't provably wrong, just
> > > a bit anachronistic. :-)
> >
> > I see... Then I don't think it's worth doing...
> > Just out of curiosity, can't we simply use a vcpu_id for the APIC ID ?
>
> That would mostly work, but theoretically we could overflow the 8 bit field
> because max vCPUs is 288. Thanks Larrabee.
>
> commit 682f732ecf7396e9d6fe24d44738966699fae6c0
> Author: Radim Krčmář <[email protected]>
> Date: Tue Jul 12 22:09:29 2016 +0200
>
> KVM: x86: bump MAX_VCPUS to 288
>
> 288 is in high demand because of Knights Landing CPU.
> We cannot set the limit to 640k, because that would be wasting space.
>
> > Also, can't we simply use the same values that KVM_GET_SUPPORTED_CPUID
> > provides for other CPUID fields ?
>
> Yes, that would mostly work. It's certainly possible to have a moderately sane
> default, but there's essentially zero benefit in doing so since practically
> speaking all userspace VMMs will override CPUID anyways. KVM could completely
> default to the host CPUID, but again, it wouldn't provide any meaningful benefit,
> while doing so would step on userspace's toes since KVM's approach is that KVM is
> "just" an accelerator, while userspace defines the CPU model, devices, etc...
> And it would also mean KVM has to start worrying about silly corner cases like
> the max vCPUs thing. That's why I say it's a can of worms :-)
Ah, I see. Thank you for the answer and the helpful information !
Regards,
Reiji
On Wed, May 19, 2021 at 10:16 AM Sean Christopherson <[email protected]> wrote:
>
> On Tue, May 18, 2021, Reiji Watanabe wrote:
> > > > > + if (kvm_cr0_mmu_role_changed(old_cr0, kvm_read_cr0(vcpu)) ||
> > > > > + kvm_cr4_mmu_role_changed(old_cr4, kvm_read_cr4(vcpu)))
> > > > > + kvm_mmu_reset_context(vcpu);
> > > > > }
> > > >
> > > > I'm wondering if kvm_vcpu_reset() should call kvm_mmu_reset_context()
> > > > for a change in EFER.NX as well.
> > >
> > > Oooh. So there _should_ be no need. Paging has to be enabled for EFER.NX to
> > > be relevant, and INIT toggles CR0.PG 1=>0 if paging was enabled and so is
> > > guaranteed to trigger a context reset. And we do want to skip the context reset,
> > > e.g. INIT-SIPI-SIPI when the vCPU has paging disabled should continue using the
> > > same MMU.
> > >
> > > But, kvm_calc_mmu_role_common() neglects to ignore NX if CR0.PG=0, and so the
> > > MMU role will be stale if INIT clears EFER.NX without forcing a context reset.
> > > However, that's benign from a functionality perspective because the context
> > > itself correctly incorporates CR0.PG, it's only the role that's borked. I.e.
> > > KVM will fail to reuse a page/context due to the spurious role.nxe, but the
> > > permission checks are always be correct.
> > >
> > > I'll add a comment here and send a patch to fix the role calculation.
> >
> > Thank you so much for the explanation !
> > I understand your intention and why it would be benign.
> >
> > Then, I'm wondering if kvm_cr4_mmu_role_changed() needs to be
> > called here. Looking at the Intel SDM, in my understanding,
> > all the bits kvm_cr4_mmu_role_changed() checks are relevant
> > only if paging is enabled. (Or is my understanding incorrect ??)
>
> Duh, yes. And it goes even beyond that, CR0.WP is only relevant if CR0.PG=1,
> i.e. INIT with CR0.PG=0 and CR0.WP=1 will incorrectly trigger a MMU reset with
> the current logic.
>
> Sadly, simply omitting the CR4 check puts us in an awkward situation where, due
> to the MMU role CR4 calculations not accounting for CR0.PG=0, KVM will run with
> a stale role.
>
> The other consideration is that kvm_post_set_cr4() and kvm_post_set_cr0() should
> also skip kvm_mmu_reset_context() if CR0.PG=0, but again that requires fixing
> the role calculations first (or at the same time).
>
> I think I'll throw in those cleanups to the beginning of this series. The result
> is going to be disgustingly long, but I really don't want to introduce code that
> knowingly leaves KVM in an inconsistent state, nor do I want to add useless
> checks on CR4 and EFER.
Yes, I would think having the cleanups would be better.
Thank you !
Reiji
On 24/04/21 02:46, Sean Christopherson wrote:
> Don't waste time writing zeros via VMWRITE during vCPU RESET, the VMCS
> is zero allocated.
Is this guaranteed to be valid, or could the VMCS in principle use some
weird encoding? (Like it does for the access rights, even though this
does not matter for this patch).
Paolo
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/vmx.c | 29 -----------------------------
> 1 file changed, 29 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 78d17adce7e6..74258ba4832a 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4427,13 +4427,6 @@ static void init_vmcs(struct vcpu_vmx *vmx)
> }
>
> if (kvm_vcpu_apicv_active(&vmx->vcpu)) {
> - vmcs_write64(EOI_EXIT_BITMAP0, 0);
> - vmcs_write64(EOI_EXIT_BITMAP1, 0);
> - vmcs_write64(EOI_EXIT_BITMAP2, 0);
> - vmcs_write64(EOI_EXIT_BITMAP3, 0);
> -
> - vmcs_write16(GUEST_INTR_STATUS, 0);
> -
> vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR);
> vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
> }
> @@ -4444,23 +4437,9 @@ static void init_vmcs(struct vcpu_vmx *vmx)
> vmx->ple_window_dirty = true;
> }
>
> - vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0);
> - vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 0);
> - vmcs_write32(CR3_TARGET_COUNT, 0); /* 22.2.1 */
> -
> - vmcs_write16(HOST_FS_SELECTOR, 0); /* 22.2.4 */
> - vmcs_write16(HOST_GS_SELECTOR, 0); /* 22.2.4 */
> vmx_set_constant_host_state(vmx);
> - vmcs_writel(HOST_FS_BASE, 0); /* 22.2.4 */
> - vmcs_writel(HOST_GS_BASE, 0); /* 22.2.4 */
>
> - if (cpu_has_vmx_vmfunc())
> - vmcs_write64(VM_FUNCTION_CONTROL, 0);
> -
> - vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
> - vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
> vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host.val));
> - vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);
> vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest.val));
>
> if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
> @@ -4493,7 +4472,6 @@ static void init_vmcs(struct vcpu_vmx *vmx)
> memset(&vmx->pt_desc, 0, sizeof(vmx->pt_desc));
> /* Bit[6~0] are forced to 1, writes are ignored. */
> vmx->pt_desc.guest.output_mask = 0x7F;
> - vmcs_write64(GUEST_IA32_RTIT_CTL, 0);
> }
>
> vmx_setup_uret_msrs(vmx);
> @@ -4536,13 +4514,6 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> vmcs_write32(GUEST_LDTR_LIMIT, 0xffff);
> vmcs_write32(GUEST_LDTR_AR_BYTES, 0x00082);
>
> - if (!init_event) {
> - vmcs_write32(GUEST_SYSENTER_CS, 0);
> - vmcs_writel(GUEST_SYSENTER_ESP, 0);
> - vmcs_writel(GUEST_SYSENTER_EIP, 0);
> - vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
> - }
> -
> vmcs_writel(GUEST_GDTR_BASE, 0);
> vmcs_write32(GUEST_GDTR_LIMIT, 0xffff);
>
>
On Mon, May 24, 2021 at 2:15 PM Paolo Bonzini <[email protected]> wrote:
>
> On 24/04/21 02:46, Sean Christopherson wrote:
> > Don't waste time writing zeros via VMWRITE during vCPU RESET, the VMCS
> > is zero allocated.
>
> Is this guaranteed to be valid, or could the VMCS in principle use some
> weird encoding? (Like it does for the access rights, even though this
> does not matter for this patch).
I see nothing in the SDM that would indicate that zero must be encoded as zero.
On Mon, May 24, 2021, Paolo Bonzini wrote:
> On 24/04/21 02:46, Sean Christopherson wrote:
> > Don't waste time writing zeros via VMWRITE during vCPU RESET, the VMCS
> > is zero allocated.
>
> Is this guaranteed to be valid, or could the VMCS in principle use some
> weird encoding? (Like it does for the access rights, even though this does
> not matter for this patch).
Phooey. In principle, the CPU can do whatever it wants, e.g. the SDM states that
software should never write to the data portion of the VMCS under any circumstance.
In practice, I would be flabbergasted if Intel ever ships a CPU that doesn't play
nice with zero initiazing the VMCS via software writes. I'd bet dollars to
donuts that KVM isn't the only software that relies on that behavior.
That said, I'm not against switching to VMWRITE for everything, but regardless
of which route we choose, we should commit to one or the other. I.e. double down
on memset() and bet that Intel won't break KVM, or replace the memset() in
alloc_vmcs_cpu() with a sequence that writes all known (possible?) fields. The
current approach of zeroing the memory in software but initializing _some_ fields
is the worst option, e.g. I highly doubt vmcs01 and vmcs02 do VMWRITE(..., 0) on
the same fields.
On Mon, May 24, 2021 at 3:28 PM Sean Christopherson <[email protected]> wrote:
>
> On Mon, May 24, 2021, Paolo Bonzini wrote:
> > On 24/04/21 02:46, Sean Christopherson wrote:
> > > Don't waste time writing zeros via VMWRITE during vCPU RESET, the VMCS
> > > is zero allocated.
> >
> > Is this guaranteed to be valid, or could the VMCS in principle use some
> > weird encoding? (Like it does for the access rights, even though this does
> > not matter for this patch).
>
> Phooey. In principle, the CPU can do whatever it wants, e.g. the SDM states that
> software should never write to the data portion of the VMCS under any circumstance.
>
> In practice, I would be flabbergasted if Intel ever ships a CPU that doesn't play
> nice with zero initiazing the VMCS via software writes. I'd bet dollars to
> donuts that KVM isn't the only software that relies on that behavior.
It's not just Intel. It's any manufacturer of physical or virtual CPUs
that implement VT-x. Non-architected behavior isn't guaranteed.
> That said, I'm not against switching to VMWRITE for everything, but regardless
> of which route we choose, we should commit to one or the other. I.e. double down
> on memset() and bet that Intel won't break KVM, or replace the memset() in
> alloc_vmcs_cpu() with a sequence that writes all known (possible?) fields. The
> current approach of zeroing the memory in software but initializing _some_ fields
> is the worst option, e.g. I highly doubt vmcs01 and vmcs02 do VMWRITE(..., 0) on
> the same fields.
The memset should probably be dropped, unless it is there to prevent
information leakage. However, it is not necessary to VMWRITE all known
(or possible) fields--just those that aren't guarded by an enable bit.
On Mon, May 24, 2021, Jim Mattson wrote:
> On Mon, May 24, 2021 at 3:28 PM Sean Christopherson <[email protected]> wrote:
> > That said, I'm not against switching to VMWRITE for everything, but regardless
> > of which route we choose, we should commit to one or the other. I.e. double down
> > on memset() and bet that Intel won't break KVM, or replace the memset() in
> > alloc_vmcs_cpu() with a sequence that writes all known (possible?) fields. The
> > current approach of zeroing the memory in software but initializing _some_ fields
> > is the worst option, e.g. I highly doubt vmcs01 and vmcs02 do VMWRITE(..., 0) on
> > the same fields.
>
> The memset should probably be dropped, unless it is there to prevent
> information leakage. However, it is not necessary to VMWRITE all known
> (or possible) fields--just those that aren't guarded by an enable bit.
Yeah, I was thinking of defense-in-depth, e.g. better to have VM-Enter consume
'0' than random garbage because KVM botched an enabling sequence. We essentially
get that today via the memset(). I'll fiddle with the sequence and see how much
overhead a paranoid and/or really paranoid approach would incur.
On Fri, Apr 23, 2021 at 5:50 PM Sean Christopherson <[email protected]> wrote:
>
> Don't set the BSP bit in vcpu->arch.apic_base when the local APIC is
> managed by userspace. Forcing all vCPUs to be BSPs is non-sensical, and
> was dead code when it was added by commit 97222cc83163 ("KVM: Emulate
> local APIC in kernel"). At the time, kvm_lapic_set_base() was invoked
> if and only if the local APIC was in-kernel (and it couldn't be called
> before the vCPU created its APIC).
>
> kvm_lapic_set_base() eventually gained generic usage, but the latent bug
> escaped notice because the only true consumer would be the guest itself
> in the form of an explicit RDMSRs on APs. Out of Linux, SeaBIOS, and
> EDK2/OVMF, only OVMF consume the BSP bit from the APIC_BASE MSR. For
> the vast majority of usage in OVMF, BSP confusion would be benign.
> OVMF's BSP election upon SMI rendezvous might be broken, but practically
> no one runs KVM with an out-of-kernel local APIC, let alone does so while
> utilizing SMIs with OVMF.
>
> Fixes: 97222cc83163 ("KVM: Emulate local APIC in kernel")
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Reiji Watanabe <[email protected]>
On Fri, Apr 23, 2021 at 5:49 PM Sean Christopherson <[email protected]> wrote:
>
> Remove a BSP APIC update in kvm_lapic_reset() that is a glorified and
> confusing nop. When the code was originally added, kvm_vcpu_is_bsp()
> queried kvm->arch.bsp_vcpu, i.e. the intent was to set the BSP bit in the
> BSP vCPU's APIC. But, stuffing the BSP bit at INIT was wrong since the
> guest can change its BSP(s); this was fixed by commit 58d269d8cccc ("KVM:
> x86: BSP in MSR_IA32_APICBASE is writable").
>
> In other words, kvm_vcpu_is_bsp() is now purely a reflection of
> vcpu->arch.apic_base.MSR_IA32_APICBASE_BSP, thus the update will always
> set the current value and kvm_lapic_set_base() is effectively a nop if
> the new and old values match. The RESET case, which does need to stuff
> the BSP for the reset vCPU, is handled by vendor code (though this will
> soon be moved to common code).
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Reiji Watanabe <[email protected]>
On Fri, Apr 23, 2021 at 5:50 PM Sean Christopherson <[email protected]> wrote:
>
> Set the BSP bit appropriately during local APIC "reset" instead of
> relying on vendor code to clean up at a later point. This is a step
> towards consolidating the local APIC, VMX, and SVM xAPIC initialization
> code.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Reiji Watanabe <[email protected]>
On Fri, Apr 23, 2021 at 5:50 PM Sean Christopherson <[email protected]> wrote:
>
> Write vcpu->arch.apic_base directly instead of bouncing through
> kvm_set_apic_base(). This is a glorified nop, and is a step towards
> cleaning up the mess that is local APIC creation.
>
> When using an in-kernel APIC, kvm_create_lapic() explicitly sets
> vcpu->arch.apic_base to MSR_IA32_APICBASE_ENABLE to avoid its own
> kvm_lapic_set_base() call in kvm_lapic_reset() from triggering state
> changes. That call during RESET exists purely to set apic->base_address
> to the default base value. As a result, by the time VMX gets control,
> the only missing piece is the BSP bit being set for the reset BSP.
>
> For a userspace APIC, there are no side effects to process (for the APIC).
>
> In both cases, the call to kvm_update_cpuid_runtime() is a nop because
> the vCPU hasn't yet been exposed to userspace, i.e. there can't be any
> CPUID entries.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Reiji Watanabe <[email protected]>
On Fri, Apr 23, 2021 at 5:51 PM Sean Christopherson <[email protected]> wrote:
>
> Stuff vcpu->arch.apic_base and apic->base_address directly during APIC
> reset, as opposed to bouncing through kvm_set_apic_base() while fudging
> the ENABLE bit during creation to avoid the other, unwanted side effects.
>
> This is a step towards consolidating the APIC RESET logic across x86,
> VMX, and SVM.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/lapic.c | 15 ++++++---------
> 1 file changed, 6 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index b088f6984b37..b1366df46d1d 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2305,7 +2305,6 @@ EXPORT_SYMBOL_GPL(kvm_apic_update_apicv);
> void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
> {
> struct kvm_lapic *apic = vcpu->arch.apic;
> - u64 msr_val;
> int i;
>
> if (!apic)
> @@ -2315,10 +2314,13 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
> hrtimer_cancel(&apic->lapic_timer.timer);
>
> if (!init_event) {
> - msr_val = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE;
> + vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE |
> + MSR_IA32_APICBASE_ENABLE;
> if (kvm_vcpu_is_reset_bsp(vcpu))
> - msr_val |= MSR_IA32_APICBASE_BSP;
> - kvm_lapic_set_base(vcpu, msr_val);
> + vcpu->arch.apic_base |= MSR_IA32_APICBASE_BSP;
> +
> + apic->base_address = MSR_IA32_APICBASE_ENABLE;
I think you wanted to make the code above set apic->base_address
to APIC_DEFAULT_PHYS_BASE (not MSR_IA32_APICBASE_ENABLE).
Thanks,
Reiji
On Fri, Apr 23, 2021 at 5:51 PM Sean Christopherson <[email protected]> wrote:
>
> Consolidate the APIC base RESET logic, which is currently spread out
> across both x86 and vendor code. For an in-kernel APIC, the vendor code
> is redundant. But for a userspace APIC, KVM relies on the vendor code
> to initialize vcpu->arch.apic_base. Hoist the vcpu->arch.apic_base
> initialization above the !apic check so that it applies to both flavors
> of APIC emulation, and delete the vendor code.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Reiji Watanabe <[email protected]>
On Wed, May 26, 2021, Reiji Watanabe wrote:
> On Fri, Apr 23, 2021 at 5:51 PM Sean Christopherson <[email protected]> wrote:
> >
> > Stuff vcpu->arch.apic_base and apic->base_address directly during APIC
> > reset, as opposed to bouncing through kvm_set_apic_base() while fudging
> > the ENABLE bit during creation to avoid the other, unwanted side effects.
> >
> > This is a step towards consolidating the APIC RESET logic across x86,
> > VMX, and SVM.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/lapic.c | 15 ++++++---------
> > 1 file changed, 6 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index b088f6984b37..b1366df46d1d 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -2305,7 +2305,6 @@ EXPORT_SYMBOL_GPL(kvm_apic_update_apicv);
> > void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
> > {
> > struct kvm_lapic *apic = vcpu->arch.apic;
> > - u64 msr_val;
> > int i;
> >
> > if (!apic)
> > @@ -2315,10 +2314,13 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
> > hrtimer_cancel(&apic->lapic_timer.timer);
> >
> > if (!init_event) {
> > - msr_val = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE;
> > + vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE |
> > + MSR_IA32_APICBASE_ENABLE;
> > if (kvm_vcpu_is_reset_bsp(vcpu))
> > - msr_val |= MSR_IA32_APICBASE_BSP;
> > - kvm_lapic_set_base(vcpu, msr_val);
> > + vcpu->arch.apic_base |= MSR_IA32_APICBASE_BSP;
> > +
> > + apic->base_address = MSR_IA32_APICBASE_ENABLE;
>
> I think you wanted to make the code above set apic->base_address
> to APIC_DEFAULT_PHYS_BASE (not MSR_IA32_APICBASE_ENABLE).
Indeed! It also means I need to double check that I'm testing a guest without
x2apic enabled. Thanks much!
On Thu, May 27, 2021, Sean Christopherson wrote:
> On Fri, Apr 23, 2021, Sean Christopherson wrote:
> > Remove the @reset_roots param from kvm_init_mmu(), the one user,
> > kvm_mmu_reset_context() has already unloaded the MMU and thus freed and
> > invalidated all roots. This also happens to be why the reset_roots=true
> > paths doesn't leak roots; they're already invalid.
> >
> > No functional change intended.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/mmu.h | 2 +-
> > arch/x86/kvm/mmu/mmu.c | 13 ++-----------
> > arch/x86/kvm/svm/nested.c | 2 +-
> > arch/x86/kvm/vmx/nested.c | 2 +-
> > 4 files changed, 5 insertions(+), 14 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 88d0ed5225a4..63b49725fb24 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -65,7 +65,7 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
> > void
> > reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
> >
> > -void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
> > +void kvm_init_mmu(struct kvm_vcpu *vcpu);
> > void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer,
> > gpa_t nested_cr3);
> > void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 930ac8a7e7c9..ff3e200b32dd 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4793,17 +4793,8 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
> > update_last_nonleaf_level(vcpu, g_context);
> > }
> >
> > -void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots)
> > +void kvm_init_mmu(struct kvm_vcpu *vcpu)
> > {
> > - if (reset_roots) {
> > - uint i;
> > -
> > - vcpu->arch.mmu->root_hpa = INVALID_PAGE;
> > -
> > - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> > - vcpu->arch.mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
>
> Egad! This is wrong. mmu->root_hpa is guaranteed to be INVALID_PAGE, but the
> prev_roots are not! I'll drop this patch and do cleanup of this code in a
> separate series.
*sigh* Jumped the gun, I was right the first time. kvm_mmu_free_roots() does
invalidate prev_roots[*] via mmu_free_root_page(). I still think I'll drop this
patch from this series, I don't think there's anything in this series that is
needed to purge @reset_roots.
for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
if (roots_to_free & KVM_MMU_ROOT_PREVIOUS(i))
mmu_free_root_page(kvm, &mmu->prev_roots[i].hpa, <-- tricky little devil
&invalid_list);
>
> > - }
> > -
> > if (mmu_is_nested(vcpu))
> > init_kvm_nested_mmu(vcpu);
> > else if (tdp_enabled)
On Fri, Apr 23, 2021, Sean Christopherson wrote:
> Remove the @reset_roots param from kvm_init_mmu(), the one user,
> kvm_mmu_reset_context() has already unloaded the MMU and thus freed and
> invalidated all roots. This also happens to be why the reset_roots=true
> paths doesn't leak roots; they're already invalid.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu.h | 2 +-
> arch/x86/kvm/mmu/mmu.c | 13 ++-----------
> arch/x86/kvm/svm/nested.c | 2 +-
> arch/x86/kvm/vmx/nested.c | 2 +-
> 4 files changed, 5 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 88d0ed5225a4..63b49725fb24 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -65,7 +65,7 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
> void
> reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
>
> -void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
> +void kvm_init_mmu(struct kvm_vcpu *vcpu);
> void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer,
> gpa_t nested_cr3);
> void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 930ac8a7e7c9..ff3e200b32dd 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4793,17 +4793,8 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
> update_last_nonleaf_level(vcpu, g_context);
> }
>
> -void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots)
> +void kvm_init_mmu(struct kvm_vcpu *vcpu)
> {
> - if (reset_roots) {
> - uint i;
> -
> - vcpu->arch.mmu->root_hpa = INVALID_PAGE;
> -
> - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> - vcpu->arch.mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
Egad! This is wrong. mmu->root_hpa is guaranteed to be INVALID_PAGE, but the
prev_roots are not! I'll drop this patch and do cleanup of this code in a
separate series.
> - }
> -
> if (mmu_is_nested(vcpu))
> init_kvm_nested_mmu(vcpu);
> else if (tdp_enabled)
On 24/04/21 02:46, Sean Christopherson wrote:
> For the record, I went into this thinking it was going to be a simple code
> shuffle between {svm,vmx}_vcpu_reset() and kvm_vcpu_reset(). The actual
> goal is to consolidate the RESET/INIT code, both to deduplicate code and
> to try to avoid divergent behavior/bugs, e.g. SVM only recently started
> updating vcpu->arch.cr4 on INIT.
>
> The TL;DR of why it takes 40+ patches to get there is that the RESET/INIT
> flows have multiple latent bugs and hidden dependencies, but "work"
> because they're rarely touched, are mostly fixed flows in both KVM and the
> guest, and because guests don't sanity check state after INIT.
>
> While several of the patches have Fixes tags, I am absolutely terrified of
> backporting most of them due to the likelihood of breaking a different
> version of KVM. And, for the most part the bugs are benign in the sense
> no guest has actually encountered any of these bugs. For that reason, I
> intentionally omitted stable@ entirely. The only patches I would consider
> even remotely safe for backporting are the first two patches in the series.
>
>
> Sean Christopherson (43):
> KVM: nVMX: Set LDTR to its architecturally defined value on nested
> VM-Exit
> KVM: VMX: Set EDX at INIT with CPUID.0x1, Family-Model-Stepping
> KVM: SVM: Require exact CPUID.0x1 match when stuffing EDX at INIT
> KVM: SVM: Fall back to KVM's hardcoded value for EDX at RESET/INIT
> KVM: x86: Split out CR0/CR4 MMU role change detectors to separate
> helpers
> KVM: x86: Properly reset MMU context at vCPU RESET/INIT
> KVM: VMX: Remove explicit MMU reset in enter_rmode()
> KVM: SVM: Drop explicit MMU reset at RESET/INIT
> KVM: SVM: Drop a redundant init_vmcb() from svm_create_vcpu()
> KVM: VMX: Move init_vmcs() invocation to vmx_vcpu_reset()
> KVM: x86: WARN if the APIC map is dirty without an in-kernel local
> APIC
> KVM: x86: Remove defunct BSP "update" in local APIC reset
> KVM: x86: Migrate the PIT only if vcpu0 is migrated, not any BSP
> KVM: x86: Don't force set BSP bit when local APIC is managed by
> userspace
> KVM: x86: Set BSP bit in reset BSP vCPU's APIC base by default
> KVM: VMX: Stuff vcpu->arch.apic_base directly at vCPU RESET
> KVM: x86: Open code necessary bits of kvm_lapic_set_base() at vCPU
> RESET
> KVM: x86: Consolidate APIC base RESET initialization code
> KVM: x86: Move EDX initialization at vCPU RESET to common code
> KVM: SVM: Don't bother writing vmcb->save.rip at vCPU RESET/INIT
> KVM: VMX: Invert handling of CR0.WP for EPT without unrestricted guest
> KVM: VMX: Remove direct write to vcpu->arch.cr0 during vCPU RESET/INIT
> KVM: VMX: Fold ept_update_paging_mode_cr0() back into vmx_set_cr0()
> KVM: nVMX: Do not clear CR3 load/store exiting bits if L1 wants 'em
> KVM: VMX: Pull GUEST_CR3 from the VMCS iff CR3 load exiting is
> disabled
> KVM: VMX: Process CR0.PG side effects after setting CR0 assets
> KVM: VMX: Skip emulation required checks during pmode/rmode
> transitions
> KVM: nVMX: Don't evaluate "emulation required" on VM-Exit
> KVM: SVM: Tweak order of cr0/cr4/efer writes at RESET/INIT
> KVM: SVM: Drop redundant writes to vmcb->save.cr4 at RESET/INIT
> KVM: SVM: Stuff save->dr6 at during VMSA sync, not at RESET/INIT
> KVM: VMX: Skip pointless MSR bitmap update when setting EFER
> KVM: VMX: Refresh list of user return MSRs after setting guest CPUID
> KVM: VMX: Don't _explicitly_ reconfigure user return MSRs on vCPU INIT
> KVM: x86: Move setting of sregs during vCPU RESET/INIT to common x86
> KVM: VMX: Remove obsolete MSR bitmap refresh at vCPU RESET/INIT
> KVM: nVMX: Remove obsolete MSR bitmap refresh at nested transitions
> KVM: VMX: Don't redo x2APIC MSR bitmaps when userspace filter is
> changed
> KVM: VMX: Remove unnecessary initialization of msr_bitmap_mode
> KVM: VMX: Smush x2APIC MSR bitmap adjustments into single function
> KVM: VMX: Remove redundant write to set vCPU as active at RESET/INIT
> KVM: VMX: Drop VMWRITEs to zero fields at vCPU RESET
> KVM: x86: Drop pointless @reset_roots from kvm_init_mmu()
>
> arch/x86/include/asm/kvm_host.h | 5 -
> arch/x86/kvm/i8254.c | 3 +-
> arch/x86/kvm/lapic.c | 26 +--
> arch/x86/kvm/mmu.h | 2 +-
> arch/x86/kvm/mmu/mmu.c | 13 +-
> arch/x86/kvm/svm/nested.c | 2 +-
> arch/x86/kvm/svm/sev.c | 1 +
> arch/x86/kvm/svm/svm.c | 33 +---
> arch/x86/kvm/vmx/nested.c | 26 ++-
> arch/x86/kvm/vmx/vmx.c | 271 +++++++++++++-------------------
> arch/x86/kvm/vmx/vmx.h | 5 +-
> arch/x86/kvm/x86.c | 51 +++++-
> 12 files changed, 189 insertions(+), 249 deletions(-)
>
I'm waiting for a v2 of this; it applies with relatively few conflicts,
but there were some comments so it's better if you take care of updating it.
Paolo
On Thu, Jun 10, 2021, Paolo Bonzini wrote:
> I'm waiting for a v2 of this; it applies with relatively few conflicts, but
> there were some comments so it's better if you take care of updating it.
Ya, slowly getting there... Something in this series (I can't even remember what)
sent me into the morass that is unsync shadow pages and I've been thrashing around
in there for a while.