2021-07-13 16:35:06

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 00/46] KVM: x86: vCPU RESET/INIT fixes and consolidation

The end goal of this series is to consolidate the RESET/INIT code, both to
deduplicate code and to try to avoid divergent behavior/bugs, e.g. SVM only
recently started updating vcpu->arch.cr4 on INIT.

The TL;DR of why it takes 40+ patches to get there is that the RESET/INIT
flows have multiple latent bugs and hidden dependencies, but "work"
because they're rarely touched, are mostly fixed flows in both KVM and the
guest, and because guests don't sanity check state after INIT.

While several of the patches have Fixes tags, I am absolutely terrified of
backporting most of them due to the likelihood of breaking a different
version of KVM. And, for the most part the bugs are benign in the sense
no guest has actually encountered any of these bugs. For that reason, I
intentionally omitted stable@ entirely. The only patches I would consider
even remotely safe for backporting are the first four patches in the series.

v2:
- Collect Reviews. [Reiji]
- Fix an apic->base_address initialization goof. [Reiji]
- Add patch to flush TLB on INIT. [Reiji]
- Add patch to preserved CR0.CD/NW on INIT. [Reiji]
- Add patch to emulate #INIT after shutdown on SVM. [Reiji]
- Add patch to consolidate arch.hflags code. [Reiji]
- Drop patch to omit VMWRITE zeroing. [Paolo, Jim]
- Drop several MMU patches (moved to other series).

v1: https://lkml.kernel.org/r/[email protected]

Sean Christopherson (46):
KVM: x86: Flush the guest's TLB on INIT
KVM: nVMX: Set LDTR to its architecturally defined value on nested
VM-Exit
KVM: SVM: Zero out GDTR.base and IDTR.base on INIT
KVM: VMX: Set EDX at INIT with CPUID.0x1, Family-Model-Stepping
KVM: SVM: Require exact CPUID.0x1 match when stuffing EDX at INIT
KVM: SVM: Fall back to KVM's hardcoded value for EDX at RESET/INIT
KVM: VMX: Remove explicit MMU reset in enter_rmode()
KVM: SVM: Drop explicit MMU reset at RESET/INIT
KVM: SVM: Drop a redundant init_vmcb() from svm_create_vcpu()
KVM: VMX: Move init_vmcs() invocation to vmx_vcpu_reset()
KVM: x86: WARN if the APIC map is dirty without an in-kernel local
APIC
KVM: x86: Remove defunct BSP "update" in local APIC reset
KVM: x86: Migrate the PIT only if vcpu0 is migrated, not any BSP
KVM: x86: Don't force set BSP bit when local APIC is managed by
userspace
KVM: x86: Set BSP bit in reset BSP vCPU's APIC base by default
KVM: VMX: Stuff vcpu->arch.apic_base directly at vCPU RESET
KVM: x86: Open code necessary bits of kvm_lapic_set_base() at vCPU
RESET
KVM: x86: Consolidate APIC base RESET initialization code
KVM: x86: Move EDX initialization at vCPU RESET to common code
KVM: SVM: Don't bother writing vmcb->save.rip at vCPU RESET/INIT
KVM: VMX: Invert handling of CR0.WP for EPT without unrestricted guest
KVM: VMX: Remove direct write to vcpu->arch.cr0 during vCPU RESET/INIT
KVM: VMX: Fold ept_update_paging_mode_cr0() back into vmx_set_cr0()
KVM: nVMX: Do not clear CR3 load/store exiting bits if L1 wants 'em
KVM: VMX: Pull GUEST_CR3 from the VMCS iff CR3 load exiting is
disabled
KVM: x86/mmu: Skip the permission_fault() check on MMIO if CR0.PG=0
KVM: VMX: Process CR0.PG side effects after setting CR0 assets
KVM: VMX: Skip emulation required checks during pmode/rmode
transitions
KVM: nVMX: Don't evaluate "emulation required" on nested VM-Exit
KVM: SVM: Tweak order of cr0/cr4/efer writes at RESET/INIT
KVM: SVM: Drop redundant writes to vmcb->save.cr4 at RESET/INIT
KVM: SVM: Stuff save->dr6 at during VMSA sync, not at RESET/INIT
KVM: VMX: Skip pointless MSR bitmap update when setting EFER
KVM: VMX: Refresh list of user return MSRs after setting guest CPUID
KVM: VMX: Don't _explicitly_ reconfigure user return MSRs on vCPU INIT
KVM: x86: Move setting of sregs during vCPU RESET/INIT to common x86
KVM: VMX: Remove obsolete MSR bitmap refresh at vCPU RESET/INIT
KVM: nVMX: Remove obsolete MSR bitmap refresh at nested transitions
KVM: VMX: Don't redo x2APIC MSR bitmaps when userspace filter is
changed
KVM: VMX: Remove unnecessary initialization of msr_bitmap_mode
KVM: VMX: Smush x2APIC MSR bitmap adjustments into single function
KVM: VMX: Remove redundant write to set vCPU as active at RESET/INIT
KVM: VMX: Move RESET-only VMWRITE sequences to init_vmcs()
KVM: SVM: Emulate #INIT in response to triple fault shutdown
KVM: SVM: Drop redundant clearing of vcpu->arch.hflags at INIT/RESET
KVM: x86: Preserve guest's CR0.CD/NW on INIT

arch/x86/include/asm/kvm_host.h | 5 -
arch/x86/kvm/i8254.c | 3 +-
arch/x86/kvm/lapic.c | 26 +--
arch/x86/kvm/svm/sev.c | 1 +
arch/x86/kvm/svm/svm.c | 48 ++----
arch/x86/kvm/vmx/nested.c | 24 ++-
arch/x86/kvm/vmx/vmx.c | 270 +++++++++++++++-----------------
arch/x86/kvm/vmx/vmx.h | 5 +-
arch/x86/kvm/x86.c | 52 +++++-
9 files changed, 211 insertions(+), 223 deletions(-)

--
2.32.0.93.g670b81a890-goog


2021-07-13 16:35:14

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 01/46] KVM: x86: Flush the guest's TLB on INIT

Flush the guest's TLB on INIT, as required by Intel's SDM. Although
AMD's APM states that the TLBs are unchanged by INIT, it's not clear that
that's correct as the APM also states that the TLB is flush on "External
initialization of the processor." Regardless, relying on the guest to be
paranoid is unnecessarily risky, while an unnecessary flush is benign
from a functional perspective and likely has no measurable impact on
guest performance.

Note, as of the April 2021 version of Intels' SDM, it also contradicts
itself with respect to TLB flushing. The overview of INIT explicitly
calls out the TLBs as being invalidated, while a table later in the same
section says they are unchanged.

9.1 INITIALIZATION OVERVIEW:
The major difference is that during an INIT, the internal caches, MSRs,
MTRRs, and x87 FPU state are left unchanged (although, the TLBs and BTB
are invalidated as with a hardware reset)

Table 9-1:

Register Power up Reset INIT
Data and Code Cache, TLBs: Invalid[6] Invalid[6] Unchanged

Given Core2's erratum[*] about global TLB entries not being flush on INIT,
it's safe to assume that the table is simply wrong.

AZ28. INIT Does Not Clear Global Entries in the TLB
Problem: INIT may not flush a TLB entry when:
• The processor is in protected mode with paging enabled and the page global enable
flag is set (PGE bit of CR4 register)
• G bit for the page table entry is set
• TLB entry is present in TLB when INIT occurs
• Software may encounter unexpected page fault or incorrect address translation due
to a TLB entry erroneously left in TLB after INIT.

Workaround: Write to CR3, CR4 (setting bits PSE, PGE or PAE) or CR0 (setting
bits PG or PE) registers before writing to memory early in BIOS
code to clear all the global entries from TLB.

Status: For the steppings affected, see the Summary Tables of Changes.

[*] https://www.intel.com/content/dam/support/us/en/documents/processors/mobile/celeron/sb/320121.pdf

Fixes: 6aa8b732ca01 ("[PATCH] kvm: userspace interface")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/x86.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8166ad113fb2..4ffc4ca7d7b0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10867,6 +10867,18 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
*/
if (old_cr0 & X86_CR0_PG)
kvm_mmu_reset_context(vcpu);
+
+ /*
+ * Intel's SDM states that all TLB entries are flushed on INIT. AMD's
+ * APM states the TLBs are untouched by INIT, but it also states that
+ * the TLBs are flushed on "External initialization of the processor."
+ * Flush the guest TLB regardless of vendor, there is no meaningful
+ * benefit in relying on the guest to flush the TLB immediately after
+ * INIT. A spurious TLB flush is benign and likely negligible from a
+ * performance perspective.
+ */
+ if (init_event)
+ kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
}

void kvm_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:35:14

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 02/46] KVM: nVMX: Set LDTR to its architecturally defined value on nested VM-Exit

Set L1's LDTR on VM-Exit per the Intel SDM:

The host-state area does not contain a selector field for LDTR. LDTR is
established as follows on all VM exits: the selector is cleared to
0000H, the segment is marked unusable and is otherwise undefined
(although the base address is always canonical).

This is likely a benign bug since the LDTR is unusable, as it means the
L1 VMM is conditioned to reload its LDTR in order to function properly on
bare metal.

Fixes: 4704d0befb07 ("KVM: nVMX: Exiting from L2 to L1")
Reviewed-by: Reiji Watanabe <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 1a52134b0c42..7f8184f432b4 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4298,6 +4298,10 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
};
vmx_set_segment(vcpu, &seg, VCPU_SREG_TR);

+ memset(&seg, 0, sizeof(seg));
+ seg.unusable = 1;
+ vmx_set_segment(vcpu, &seg, VCPU_SREG_LDTR);
+
kvm_set_dr(vcpu, 7, 0x400);
vmcs_write64(GUEST_IA32_DEBUGCTL, 0);

--
2.32.0.93.g670b81a890-goog

2021-07-13 16:35:22

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 03/46] KVM: SVM: Zero out GDTR.base and IDTR.base on INIT

Explicitly set GDTR.base and IDTR.base to zero when intializing the VMCB.
Functionally this only affects INIT, as the bases are implicitly set to
zero on RESET by virtue of the VMCB being zero allocated.

Per AMD's APM, GDTR.base and IDTR.base are zeroed after RESET and INIT.

Fixes: 04d2cc7780d4 ("KVM: Move main vcpu loop into subarch independent code")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 616b9679ddcc..2150642e1bef 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1254,7 +1254,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
SVM_SELECTOR_S_MASK | SVM_SELECTOR_CODE_MASK;
save->cs.limit = 0xffff;

+ save->gdtr.base = 0;
save->gdtr.limit = 0xffff;
+ save->idtr.base = 0;
save->idtr.limit = 0xffff;

init_sys_seg(&save->ldtr, SEG_TYPE_LDT);
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:35:41

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 06/46] KVM: SVM: Fall back to KVM's hardcoded value for EDX at RESET/INIT

At vCPU RESET/INIT (mostly RESET), stuff EDX with KVM's hardcoded,
default Family-Model-Stepping ID of 0x600 if CPUID.0x1 isn't defined.
At RESET, the CPUID lookup is guaranteed to "miss" because KVM emulates
RESET before exposing the vCPU to userspace, i.e. userspace can't
possibly have done set the vCPU's CPUID model, and thus KVM will always
write '0'. At INIT, using 0x600 is less bad than using '0'.

While initializing EDX to '0' is _extremely_ unlikely to be noticed by
the guest, let alone break the guest, and can be overridden by
userspace for the RESET case, using 0x600 is preferable as it will allow
consolidating the relevant VMX and SVM RESET/INIT logic in the future.
And, digging through old specs suggests that neither Intel nor AMD have
ever shipped a CPU that initialized EDX to '0' at RESET.

Regarding 0x600 as KVM's default Family, it is a sane default and in
many ways the most appropriate. Prior to the 386 implementations, DX
was undefined at RESET. With the 386, 486, 586/P5, and 686/P6/Athlon,
both Intel and AMD set EDX to 3, 4, 5, and 6 respectively. AMD switched
to using '15' as its primary Family with the introduction of AMD64, but
Intel has continued using '6' for the last few decades.

So, '6' is a valid Family for both Intel and AMD CPUs, is compatible
with both 32-bit and 64-bit CPUs (albeit not a perfect fit for 64-bit
AMD), and of the common Families (3 - 6), is the best fit with respect to
KVM's virtual CPU model. E.g. prior to the P6, Intel CPUs did not have a
STI window. Modern operating systems, Linux included, rely on the STI
window, e.g. for "safe halt", and KVM unconditionally assumes the virtual
CPU has an STI window. Thus enumerating a Family ID of 3, 4, or 5 would
be provably wrong.

Opportunistically remove a stale comment.

Fixes: 66f7b72e1171 ("KVM: x86: Make register state after reset conform to specification")
Reviewed-by: Reiji Watanabe <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 12e49dc16efe..7da214660c64 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1277,7 +1277,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
kvm_mmu_reset_context(vcpu);

save->cr4 = X86_CR4_PAE;
- /* rdx = ?? */

if (npt_enabled) {
/* Setup VMCB for Nested Paging */
@@ -1359,7 +1358,15 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
}
init_vmcb(vcpu);

- kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, true);
+ /*
+ * Fall back to KVM's default Family/Model/Stepping if no CPUID match
+ * is found. Note, it's impossible to get a match at RESET since KVM
+ * emulates RESET before exposing the vCPU to userspace, i.e. it's
+ * impossible for kvm_cpuid() to find a valid entry on RESET. But, go
+ * through the motions in case that's ever remedied, and to be pedantic.
+ */
+ if (!kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, true))
+ eax = get_rdx_init_val();
kvm_rdx_write(vcpu, eax);

if (kvm_vcpu_apicv_active(vcpu) && !init_event)
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:35:41

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 04/46] KVM: VMX: Set EDX at INIT with CPUID.0x1, Family-Model-Stepping

Set EDX at RESET/INIT based on the userspace-defined CPUID model when
possible, i.e. when CPUID.0x1.EAX is defind by userspace. At RESET/INIT,
all CPUs that support CPUID set EDX to the FMS enumerated in
CPUID.0x1.EAX. If no CPUID match is found, fall back to KVM's default
of 0x600 (Family '6'), which is the least awful approximation of KVM's
virtual CPU model.

Fixes: 6aa8b732ca01 ("[PATCH] kvm: userspace interface")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 927a552393b9..825197f21700 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4394,6 +4394,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct msr_data apic_base_msr;
+ u32 eax, dummy;
u64 cr0;

vmx->rmode.vm86_active = 0;
@@ -4401,7 +4402,11 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)

vmx->msr_ia32_umwait_control = 0;

- vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
+ eax = 1;
+ if (!kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, true))
+ eax = get_rdx_init_val();
+ kvm_rdx_write(vcpu, eax);
+
vmx->hv_deadline_tsc = -1;
kvm_set_cr8(vcpu, 0);

--
2.32.0.93.g670b81a890-goog

2021-07-13 16:35:41

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 05/46] KVM: SVM: Require exact CPUID.0x1 match when stuffing EDX at INIT

Do not allow an inexact CPUID "match" when querying the guest's CPUID.0x1
to stuff EDX during INIT. In the common case, where the guest CPU model
is an AMD variant, allowing an inexact match is a nop since KVM doesn't
emulate Intel's goofy "out-of-range" logic for AMD and Hygon. If the
vCPU model happens to be an Intel variant, an inexact match is possible
if and only if the max CPUID leaf is precisely '0'. Aside from the fact
that there's probably no CPU in existence with a single CPUID leaf, if
the max CPUID leaf is '0', that means that CPUID.0.EAX is '0', and thus
an inexact match for CPUID.0x1.EAX will also yield '0'.

So, with lots of twisty logic, no functional change intended.

Reviewed-by: Reiji Watanabe <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 2150642e1bef..12e49dc16efe 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1359,7 +1359,7 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
}
init_vmcb(vcpu);

- kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, false);
+ kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, true);
kvm_rdx_write(vcpu, eax);

if (kvm_vcpu_apicv_active(vcpu) && !init_event)
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:35:45

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 08/46] KVM: SVM: Drop explicit MMU reset at RESET/INIT

Drop an explicit MMU reset in SVM's vCPU RESET/INIT flow now that the
common x86 path correctly handles conditional MMU resets, e.g. if INIT
arrives while the vCPU is in 64-bit mode.

This reverts commit ebae871a509d ("kvm: svm: reset mmu on VCPU reset").

Reviewed-by: Reiji Watanabe <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 7da214660c64..44248548be7d 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1274,7 +1274,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
* It also updates the guest-visible cr0 value.
*/
svm_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
- kvm_mmu_reset_context(vcpu);

save->cr4 = X86_CR4_PAE;

--
2.32.0.93.g670b81a890-goog

2021-07-13 16:35:53

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 07/46] KVM: VMX: Remove explicit MMU reset in enter_rmode()

Drop an explicit MMU reset when entering emulated real mode now that the
vCPU INIT/RESET path correctly handles conditional MMU resets, e.g. if
INIT arrives while the vCPU is in 64-bit mode.

Note, while there are multiple other direct calls to vmx_set_cr0(), i.e.
paths that change CR0 without invoking kvm_post_set_cr0(), only the INIT
emulation can reach enter_rmode(). CLTS emulation only toggles CR.TS,
VM-Exit (and late VM-Fail) emulation cannot architecturally transition to
Real Mode, and VM-Enter to Real Mode is possible if and only if
Unrestricted Guest is enabled (exposed to L1).

This effectively reverts commit 8668a3c468ed ("KVM: VMX: Reset mmu
context when entering real mode")

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 825197f21700..0f5e97a904e5 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2852,8 +2852,6 @@ static void enter_rmode(struct kvm_vcpu *vcpu)
fix_rmode_seg(VCPU_SREG_DS, &vmx->rmode.segs[VCPU_SREG_DS]);
fix_rmode_seg(VCPU_SREG_GS, &vmx->rmode.segs[VCPU_SREG_GS]);
fix_rmode_seg(VCPU_SREG_FS, &vmx->rmode.segs[VCPU_SREG_FS]);
-
- kvm_mmu_reset_context(vcpu);
}

int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer)
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:35:59

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 09/46] KVM: SVM: Drop a redundant init_vmcb() from svm_create_vcpu()

Drop an extra init_vmcb() from svm_create_vcpu(), svm_vcpu_reset() is
guaranteed to call init_vmcb() and there are no consumers of the VMCB
data between ->vcpu_create() and ->vcpu_reset(). Keep the call to
svm_switch_vmcb() as sev_es_create_vcpu() touches the current VMCB, but
hoist it up a few lines to associate the switch with the allocation of
vmcb01.

Reviewed-by: Reiji Watanabe <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 44248548be7d..cef9520fe77f 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1431,15 +1431,13 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)

svm->vmcb01.ptr = page_address(vmcb01_page);
svm->vmcb01.pa = __sme_set(page_to_pfn(vmcb01_page) << PAGE_SHIFT);
+ svm_switch_vmcb(svm, &svm->vmcb01);

if (vmsa_page)
svm->vmsa = page_address(vmsa_page);

svm->guest_state_loaded = false;

- svm_switch_vmcb(svm, &svm->vmcb01);
- init_vmcb(vcpu);
-
svm_init_osvw(vcpu);
vcpu->arch.microcode_version = 0x01000065;

--
2.32.0.93.g670b81a890-goog

2021-07-13 16:36:05

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 10/46] KVM: VMX: Move init_vmcs() invocation to vmx_vcpu_reset()

Initialize constant VMCS state in vcpu_vcpu_reset() instead of in
vmx_vcpu_create(), which allows for the removal of the open coded "vCPU
load" sequence since ->vcpu_reset() is invoked while the vCPU is properly
loaded (which is the entire point of vCPU reset...).

Deferring initialization is effectively a nop as it's impossible to
safely access the VMCS between the current call site and its new home, as
both the vCPU and the pCPU are put immediately after init_vmcs(), i.e.
the VMCS isn't guaranteed to be loaded.

Note, task preemption is not a problem as vmx_sched_in() _can't_ touch
the VMCS as ->sched_in() is invoked before the vCPU, and thus VMCS, is
reloaded. I.e. the preemption path also can't consume VMCS state.

Reviewed-by: Reiji Watanabe <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 16 +++++-----------
1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0f5e97a904e5..26c0e776827c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4293,10 +4293,6 @@ static void vmx_compute_secondary_exec_control(struct vcpu_vmx *vmx)

#define VMX_XSS_EXIT_BITMAP 0

-/*
- * Noting that the initialization of Guest-state Area of VMCS is in
- * vmx_vcpu_reset().
- */
static void init_vmcs(struct vcpu_vmx *vmx)
{
if (nested)
@@ -4395,6 +4391,9 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
u32 eax, dummy;
u64 cr0;

+ if (!init_event)
+ init_vmcs(vmx);
+
vmx->rmode.vm86_active = 0;
vmx->spec_ctrl = 0;

@@ -6782,7 +6781,7 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
{
struct vmx_uret_msr *tsx_ctrl;
struct vcpu_vmx *vmx;
- int i, cpu, err;
+ int i, err;

BUILD_BUG_ON(offsetof(struct vcpu_vmx, vcpu) != 0);
vmx = to_vmx(vcpu);
@@ -6844,12 +6843,7 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
vmx->msr_bitmap_mode = 0;

vmx->loaded_vmcs = &vmx->vmcs01;
- cpu = get_cpu();
- vmx_vcpu_load(vcpu, cpu);
- vcpu->cpu = cpu;
- init_vmcs(vmx);
- vmx_vcpu_put(vcpu);
- put_cpu();
+
if (cpu_need_virtualize_apic_accesses(vcpu)) {
err = alloc_apic_access_page(vcpu->kvm);
if (err)
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:36:11

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 11/46] KVM: x86: WARN if the APIC map is dirty without an in-kernel local APIC

WARN if KVM ends up in a state where it thinks its APIC map needs to be
recalculated, but KVM is not emulating the local APIC. This is mostly
to document KVM's "rules" in order to provide clarity in future cleanups.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/lapic.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index ba5a27879f1d..add4dd1e3528 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -192,6 +192,9 @@ void kvm_recalculate_apic_map(struct kvm *kvm)
if (atomic_read_acquire(&kvm->arch.apic_map_dirty) == CLEAN)
return;

+ WARN_ONCE(!irqchip_in_kernel(kvm),
+ "Dirty APIC map without an in-kernel local APIC");
+
mutex_lock(&kvm->arch.apic_map_lock);
/*
* Read kvm->arch.apic_map_dirty before kvm->arch.apic_map
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:36:22

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 13/46] KVM: x86: Migrate the PIT only if vcpu0 is migrated, not any BSP

Make vcpu0 the arbitrary owner of the PIT, as was intended when PIT
migration was added by commit 2f5997140f22 ("KVM: migrate PIT timer").
The PIT was unintentionally turned into being owned by the BSP by commit
c5af89b68abb ("KVM: Introduce kvm_vcpu_is_bsp() function."), and was then
unintentionally converted to a shared ownership model when
kvm_vcpu_is_bsp() was modified to check the APIC base MSR instead of
hardcoding vcpu0 as the BSP.

Functionally, this just means the PIT's hrtimer is migrated less often.
The real motivation is to remove the usage of kvm_vcpu_is_bsp(), so that
more legacy/broken crud can be removed in a future patch.

Fixes: 58d269d8cccc ("KVM: x86: BSP in MSR_IA32_APICBASE is writable")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/i8254.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index a6e218c6140d..5a69cce4d72d 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -220,7 +220,8 @@ void __kvm_migrate_pit_timer(struct kvm_vcpu *vcpu)
struct kvm_pit *pit = vcpu->kvm->arch.vpit;
struct hrtimer *timer;

- if (!kvm_vcpu_is_bsp(vcpu) || !pit)
+ /* Somewhat arbitrarily make vcpu0 the owner of the PIT. */
+ if (vcpu->vcpu_id || !pit)
return;

timer = &pit->pit_state.timer;
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:36:24

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 14/46] KVM: x86: Don't force set BSP bit when local APIC is managed by userspace

Don't set the BSP bit in vcpu->arch.apic_base when the local APIC is
managed by userspace. Forcing all vCPUs to be BSPs is non-sensical, and
was dead code when it was added by commit 97222cc83163 ("KVM: Emulate
local APIC in kernel"). At the time, kvm_lapic_set_base() was invoked
if and only if the local APIC was in-kernel (and it couldn't be called
before the vCPU created its APIC).

kvm_lapic_set_base() eventually gained generic usage, but the latent bug
escaped notice because the only true consumer would be the guest itself
in the form of an explicit RDMSRs on APs. Out of Linux, SeaBIOS, and
EDK2/OVMF, only OVMF consumes the BSP bit from the APIC_BASE MSR. For
the vast majority of usage in OVMF, BSP confusion would be benign.
OVMF's BSP election upon SMI rendezvous might be broken, but practically
no one runs KVM with an out-of-kernel local APIC, let alone does so while
utilizing SMIs with OVMF.

Fixes: 97222cc83163 ("KVM: Emulate local APIC in kernel")
Reviewed-by: Reiji Watanabe <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/lapic.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index a24ce8fe93e5..acb201d16b5e 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2268,9 +2268,6 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
u64 old_value = vcpu->arch.apic_base;
struct kvm_lapic *apic = vcpu->arch.apic;

- if (!apic)
- value |= MSR_IA32_APICBASE_BSP;
-
vcpu->arch.apic_base = value;

if ((old_value ^ value) & MSR_IA32_APICBASE_ENABLE)
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:36:43

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 15/46] KVM: x86: Set BSP bit in reset BSP vCPU's APIC base by default

Set the BSP bit appropriately during local APIC "reset" instead of
relying on vendor code to clean up at a later point. This is a step
towards consolidating the local APIC, VMX, and SVM xAPIC initialization
code.

Reviewed-by: Reiji Watanabe <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/lapic.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index acb201d16b5e..0fb282b64c8f 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2321,6 +2321,7 @@ EXPORT_SYMBOL_GPL(kvm_apic_update_apicv);
void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct kvm_lapic *apic = vcpu->arch.apic;
+ u64 msr_val;
int i;

if (!apic)
@@ -2330,8 +2331,10 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
hrtimer_cancel(&apic->lapic_timer.timer);

if (!init_event) {
- kvm_lapic_set_base(vcpu, APIC_DEFAULT_PHYS_BASE |
- MSR_IA32_APICBASE_ENABLE);
+ msr_val = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE;
+ if (kvm_vcpu_is_reset_bsp(vcpu))
+ msr_val |= MSR_IA32_APICBASE_BSP;
+ kvm_lapic_set_base(vcpu, msr_val);
kvm_apic_set_xapic_id(apic, vcpu->vcpu_id);
}
kvm_apic_set_version(apic->vcpu);
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:36:49

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 27/46] KVM: VMX: Process CR0.PG side effects after setting CR0 assets

Move the long mode and EPT w/o unrestricted guest side effect processing
down in vmx_set_cr0() so that the EPT && !URG case doesn't have to stuff
vcpu->arch.cr0 early. This also fixes an oddity where CR0 might not be
marked available, i.e. the early vcpu->arch.cr0 write would appear to be
in danger of being overwritten, though that can't actually happen in the
current code since CR0.TS is the only guest-owned bit, and CR0.TS is not
read by vmx_set_cr4().

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 23 ++++++++++++-----------
1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d632c0a16f12..45b123bb5aaa 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3003,9 +3003,11 @@ void ept_save_pdptrs(struct kvm_vcpu *vcpu)
void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
- unsigned long hw_cr0;
+ unsigned long hw_cr0, old_cr0_pg;
u32 tmp;

+ old_cr0_pg = kvm_read_cr0_bits(vcpu, X86_CR0_PG);
+
hw_cr0 = (cr0 & ~KVM_VM_CR0_ALWAYS_OFF);
if (is_unrestricted_guest(vcpu))
hw_cr0 |= KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST;
@@ -3021,11 +3023,16 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
enter_rmode(vcpu);
}

+ vmcs_writel(CR0_READ_SHADOW, cr0);
+ vmcs_writel(GUEST_CR0, hw_cr0);
+ vcpu->arch.cr0 = cr0;
+ kvm_register_mark_available(vcpu, VCPU_EXREG_CR0);
+
#ifdef CONFIG_X86_64
if (vcpu->arch.efer & EFER_LME) {
- if (!is_paging(vcpu) && (cr0 & X86_CR0_PG))
+ if (!old_cr0_pg && (cr0 & X86_CR0_PG))
enter_lmode(vcpu);
- if (is_paging(vcpu) && !(cr0 & X86_CR0_PG))
+ else if (old_cr0_pg && !(cr0 & X86_CR0_PG))
exit_lmode(vcpu);
}
#endif
@@ -3066,17 +3073,11 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
exec_controls_set(vmx, tmp);
}

- if (!is_paging(vcpu) != !(cr0 & X86_CR0_PG)) {
- vcpu->arch.cr0 = cr0;
+ /* Note, vmx_set_cr4() consumes the new vcpu->arch.cr0. */
+ if ((old_cr0_pg ^ cr0) & X86_CR0_PG)
vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
- }
}

- vmcs_writel(CR0_READ_SHADOW, cr0);
- vmcs_writel(GUEST_CR0, hw_cr0);
- vcpu->arch.cr0 = cr0;
- kvm_register_mark_available(vcpu, VCPU_EXREG_CR0);
-
/* depends on vcpu->arch.cr0 to be set to a new value */
vmx->emulation_required = emulation_required(vcpu);
}
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:36:51

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 16/46] KVM: VMX: Stuff vcpu->arch.apic_base directly at vCPU RESET

Write vcpu->arch.apic_base directly instead of bouncing through
kvm_set_apic_base(). This is a glorified nop, and is a step towards
cleaning up the mess that is local APIC creation.

When using an in-kernel APIC, kvm_create_lapic() explicitly sets
vcpu->arch.apic_base to MSR_IA32_APICBASE_ENABLE to avoid its own
kvm_lapic_set_base() call in kvm_lapic_reset() from triggering state
changes. That call during RESET exists purely to set apic->base_address
to the default base value. As a result, by the time VMX gets control,
the only missing piece is the BSP bit being set for the reset BSP.

For a userspace APIC, there are no side effects to process (for the APIC).

In both cases, the call to kvm_update_cpuid_runtime() is a nop because
the vCPU hasn't yet been exposed to userspace, i.e. there can't be any
CPUID entries.

No functional change intended.

Reviewed-by: Reiji Watanabe <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 26c0e776827c..e6cc389ec697 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4387,7 +4387,6 @@ static void init_vmcs(struct vcpu_vmx *vmx)
static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
- struct msr_data apic_base_msr;
u32 eax, dummy;
u64 cr0;

@@ -4408,12 +4407,10 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
kvm_set_cr8(vcpu, 0);

if (!init_event) {
- apic_base_msr.data = APIC_DEFAULT_PHYS_BASE |
- MSR_IA32_APICBASE_ENABLE;
+ vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE |
+ MSR_IA32_APICBASE_ENABLE;
if (kvm_vcpu_is_reset_bsp(vcpu))
- apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
- apic_base_msr.host_initiated = true;
- kvm_set_apic_base(vcpu, &apic_base_msr);
+ vcpu->arch.apic_base |= MSR_IA32_APICBASE_BSP;
}

vmx_segment_cache_clear(vmx);
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:37:07

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 20/46] KVM: SVM: Don't bother writing vmcb->save.rip at vCPU RESET/INIT

Drop unnecessary initialization of vmcb->save.rip during vCPU RESET/INIT,
as svm_vcpu_run() unconditionally propagates VCPU_REGS_RIP to save.rip.

No true functional change intended.

Reviewed-by: Reiji Watanabe <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 268580713938..0101646e42e0 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1266,8 +1266,7 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
svm_set_efer(vcpu, 0);
save->dr6 = 0xffff0ff0;
kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
- save->rip = 0x0000fff0;
- vcpu->arch.regs[VCPU_REGS_RIP] = save->rip;
+ vcpu->arch.regs[VCPU_REGS_RIP] = 0x0000fff0;

/*
* svm_set_cr0() sets PG and WP and clears NW and CD on save->cr0.
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:37:06

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 22/46] KVM: VMX: Remove direct write to vcpu->arch.cr0 during vCPU RESET/INIT

Remove a bogus write to vcpu->arch.cr0 that immediately precedes
vmx_set_cr0() during vCPU RESET/INIT. For RESET, this is a nop since
the "old" CR0 value is meaningless. But for INIT, if the vCPU is coming
from paging enabled mode, crushing vcpu->arch.cr0 will cause the various
is_paging() checks in vmx_set_cr0() to get false negatives.

For the exit_lmode() case, the false negative is benign as vmx_set_efer()
is called immediately after vmx_set_cr0().

For EPT without unrestricted guest, the false negative will cause KVM to
unnecessarily run with CR3 load/store exiting. But again, this is
benign, albeit sub-optimal.

Reviewed-by: Reiji Watanabe <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 02aec75ec6f6..ed631564c651 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4383,7 +4383,6 @@ static void init_vmcs(struct vcpu_vmx *vmx)
static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
- u64 cr0;

if (!init_event)
init_vmcs(vmx);
@@ -4454,9 +4453,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)

kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);

- cr0 = X86_CR0_NW | X86_CR0_CD | X86_CR0_ET;
- vmx->vcpu.arch.cr0 = cr0;
- vmx_set_cr0(vcpu, cr0); /* enter rmode */
+ vmx_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
vmx_set_cr4(vcpu, 0);
vmx_set_efer(vcpu, 0);

--
2.32.0.93.g670b81a890-goog

2021-07-13 16:37:07

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 23/46] KVM: VMX: Fold ept_update_paging_mode_cr0() back into vmx_set_cr0()

Move the CR0/CR3/CR4 shenanigans for EPT without unrestricted guest back
into vmx_set_cr0(). This will allow a future patch to eliminate the
rather gross stuffing of vcpu->arch.cr0 in the paging transition cases
by snapshotting the old CR0.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 40 +++++++++++++++++-----------------------
1 file changed, 17 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ed631564c651..db70fe463aa1 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2994,27 +2994,6 @@ void ept_save_pdptrs(struct kvm_vcpu *vcpu)
kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
}

-static void ept_update_paging_mode_cr0(unsigned long cr0, struct kvm_vcpu *vcpu)
-{
- struct vcpu_vmx *vmx = to_vmx(vcpu);
-
- if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3))
- vmx_cache_reg(vcpu, VCPU_EXREG_CR3);
- if (!(cr0 & X86_CR0_PG)) {
- /* From paging/starting to nonpaging */
- exec_controls_setbit(vmx, CPU_BASED_CR3_LOAD_EXITING |
- CPU_BASED_CR3_STORE_EXITING);
- vcpu->arch.cr0 = cr0;
- vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
- } else if (!is_paging(vcpu)) {
- /* From nonpaging to paging */
- exec_controls_clearbit(vmx, CPU_BASED_CR3_LOAD_EXITING |
- CPU_BASED_CR3_STORE_EXITING);
- vcpu->arch.cr0 = cr0;
- vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
- }
-}
-
void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -3044,8 +3023,23 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
}
#endif

- if (enable_ept && !is_unrestricted_guest(vcpu))
- ept_update_paging_mode_cr0(cr0, vcpu);
+ if (enable_ept && !is_unrestricted_guest(vcpu)) {
+ if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3))
+ vmx_cache_reg(vcpu, VCPU_EXREG_CR3);
+ if (!(cr0 & X86_CR0_PG)) {
+ /* From paging/starting to nonpaging */
+ exec_controls_setbit(vmx, CPU_BASED_CR3_LOAD_EXITING |
+ CPU_BASED_CR3_STORE_EXITING);
+ vcpu->arch.cr0 = cr0;
+ vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
+ } else if (!is_paging(vcpu)) {
+ /* From nonpaging to paging */
+ exec_controls_clearbit(vmx, CPU_BASED_CR3_LOAD_EXITING |
+ CPU_BASED_CR3_STORE_EXITING);
+ vcpu->arch.cr0 = cr0;
+ vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
+ }
+ }

vmcs_writel(CR0_READ_SHADOW, cr0);
vmcs_writel(GUEST_CR0, hw_cr0);
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:37:15

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 25/46] KVM: VMX: Pull GUEST_CR3 from the VMCS iff CR3 load exiting is disabled

Tweak the logic for grabbing vmcs.GUEST_CR3 in vmx_cache_reg() to look
directly at the execution controls, as opposed to effectively inferring
the controls based on vCPUs. Inferring the controls isn't wrong, but it
creates a very subtle dependency between the caching logic, the state of
vcpu->arch.cr0 (via is_paging()), and the behavior of vmx_set_cr0().

Using the execution controls doesn't completely eliminate the dependency
in vmx_set_cr0(), e.g. neglecting to cache CR3 before enabling
interception would still break the guest, but it does reduce the
code dependency and mostly eliminate the logical dependency (that CR3
loads are intercepted in certain scenarios). Eliminating the subtle
read of vcpu->arch.cr0 will also allow for additional cleanup in
vmx_set_cr0().

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 58c6d7b98624..d632c0a16f12 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2262,8 +2262,11 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
vcpu->arch.cr0 |= vmcs_readl(GUEST_CR0) & guest_owned_bits;
break;
case VCPU_EXREG_CR3:
- if (is_unrestricted_guest(vcpu) ||
- (enable_ept && is_paging(vcpu)))
+ /*
+ * When intercepting CR3 loads, e.g. for shadowing paging, KVM's
+ * CR3 is loaded into hardware, not the guest's CR3.
+ */
+ if (!(exec_controls_get(to_vmx(vcpu)) & CPU_BASED_CR3_LOAD_EXITING))
vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
break;
case VCPU_EXREG_CR4:
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:37:19

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 24/46] KVM: nVMX: Do not clear CR3 load/store exiting bits if L1 wants 'em

Keep CR3 load/store exiting enable as needed when running L2 in order to
honor L1's desires. This fixes a largely theoretical bug where L1 could
intercept CR3 but not CR0.PG and end up not getting the desired CR3 exits
when L2 enables paging. In other words, the existing !is_paging() check
inadvertantly handles the normal case for L2 where vmx_set_cr0() is
called during VM-Enter, which is guaranteed to run with paging enabled,
and thus will never clear the bits.

Removing the !is_paging() check will also allow future consolidation and
cleanup of the related code. From a performance perspective, this is
all a nop, as the VMCS controls shadow will optimize away the VMWRITE
when the controls are in the desired state.

Add a comment explaining why CR3 is intercepted, with a big disclaimer
about not querying the old CR3. Because vmx_set_cr0() is used for flows
that are not directly tied to MOV CR3, e.g. vCPU RESET/INIT and nested
VM-Enter, it's possible that is_paging() is not synchronized with CR3
load/store exiting. This is actually guaranteed in the current code, as
KVM starts with CR3 interception disabled. Obviously that can be fixed,
but there's no good reason to play whack-a-mole, and it tends to end
poorly, e.g. descriptor table exiting for UMIP emulation attempted to be
precise in the past and ended up botching the interception toggling.

Fixes: fe3ef05c7572 ("KVM: nVMX: Prepare vmcs02 from vmcs01 and vmcs12")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 46 +++++++++++++++++++++++++++++++++---------
1 file changed, 37 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index db70fe463aa1..58c6d7b98624 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2994,10 +2994,14 @@ void ept_save_pdptrs(struct kvm_vcpu *vcpu)
kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
}

+#define CR3_EXITING_BITS (CPU_BASED_CR3_LOAD_EXITING | \
+ CPU_BASED_CR3_STORE_EXITING)
+
void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned long hw_cr0;
+ u32 tmp;

hw_cr0 = (cr0 & ~KVM_VM_CR0_ALWAYS_OFF);
if (is_unrestricted_guest(vcpu))
@@ -3024,18 +3028,42 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
#endif

if (enable_ept && !is_unrestricted_guest(vcpu)) {
+ /*
+ * Ensure KVM has an up-to-date snapshot of the guest's CR3. If
+ * the below code _enables_ CR3 exiting, vmx_cache_reg() will
+ * (correctly) stop reading vmcs.GUEST_CR3 because it thinks
+ * KVM's CR3 is installed.
+ */
if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3))
vmx_cache_reg(vcpu, VCPU_EXREG_CR3);
+
+ /*
+ * When running with EPT but not unrestricted guest, KVM must
+ * intercept CR3 accesses when paging is _disabled_. This is
+ * necessary because restricted guests can't actually run with
+ * paging disabled, and so KVM stuffs its own CR3 in order to
+ * run the guest when identity mapped page tables.
+ *
+ * Do _NOT_ check the old CR0.PG, e.g. to optimize away the
+ * update, it may be stale with respect to CR3 interception,
+ * e.g. after nested VM-Enter.
+ *
+ * Lastly, honor L1's desires, i.e. intercept CR3 loads and/or
+ * stores to forward them to L1, even if KVM does not need to
+ * intercept them to preserve its identity mapped page tables.
+ */
if (!(cr0 & X86_CR0_PG)) {
- /* From paging/starting to nonpaging */
- exec_controls_setbit(vmx, CPU_BASED_CR3_LOAD_EXITING |
- CPU_BASED_CR3_STORE_EXITING);
- vcpu->arch.cr0 = cr0;
- vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
- } else if (!is_paging(vcpu)) {
- /* From nonpaging to paging */
- exec_controls_clearbit(vmx, CPU_BASED_CR3_LOAD_EXITING |
- CPU_BASED_CR3_STORE_EXITING);
+ exec_controls_setbit(vmx, CR3_EXITING_BITS);
+ } else if (!is_guest_mode(vcpu)) {
+ exec_controls_clearbit(vmx, CR3_EXITING_BITS);
+ } else {
+ tmp = exec_controls_get(vmx);
+ tmp &= ~CR3_EXITING_BITS;
+ tmp |= get_vmcs12(vcpu)->cpu_based_vm_exec_control & CR3_EXITING_BITS;
+ exec_controls_set(vmx, tmp);
+ }
+
+ if (!is_paging(vcpu) != !(cr0 & X86_CR0_PG)) {
vcpu->arch.cr0 = cr0;
vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
}
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:37:35

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 17/46] KVM: x86: Open code necessary bits of kvm_lapic_set_base() at vCPU RESET

Stuff vcpu->arch.apic_base and apic->base_address directly during APIC
reset, as opposed to bouncing through kvm_set_apic_base() while fudging
the ENABLE bit during creation to avoid the other, unwanted side effects.

This is a step towards consolidating the APIC RESET logic across x86,
VMX, and SVM.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/lapic.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 0fb282b64c8f..295a9d02a9a5 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2321,7 +2321,6 @@ EXPORT_SYMBOL_GPL(kvm_apic_update_apicv);
void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct kvm_lapic *apic = vcpu->arch.apic;
- u64 msr_val;
int i;

if (!apic)
@@ -2331,10 +2330,13 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
hrtimer_cancel(&apic->lapic_timer.timer);

if (!init_event) {
- msr_val = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE;
+ vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE |
+ MSR_IA32_APICBASE_ENABLE;
if (kvm_vcpu_is_reset_bsp(vcpu))
- msr_val |= MSR_IA32_APICBASE_BSP;
- kvm_lapic_set_base(vcpu, msr_val);
+ vcpu->arch.apic_base |= MSR_IA32_APICBASE_BSP;
+
+ apic->base_address = APIC_DEFAULT_PHYS_BASE;
+
kvm_apic_set_xapic_id(apic, vcpu->vcpu_id);
}
kvm_apic_set_version(apic->vcpu);
@@ -2477,11 +2479,6 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns)
lapic_timer_advance_dynamic = false;
}

- /*
- * APIC is created enabled. This will prevent kvm_lapic_set_base from
- * thinking that APIC state has changed.
- */
- vcpu->arch.apic_base = MSR_IA32_APICBASE_ENABLE;
static_branch_inc(&apic_sw_disabled.key); /* sw disabled at reset */
kvm_iodevice_init(&apic->dev, &apic_mmio_ops);

--
2.32.0.93.g670b81a890-goog

2021-07-13 16:37:38

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 30/46] KVM: SVM: Tweak order of cr0/cr4/efer writes at RESET/INIT

Hoist svm_set_cr0() up in the sequence of register initialization during
vCPU RESET/INIT, purely to match VMX so that a future patch can move the
sequences to common x86.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 0101646e42e0..875d68c4cb9b 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1262,18 +1262,13 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
init_sys_seg(&save->ldtr, SEG_TYPE_LDT);
init_sys_seg(&save->tr, SEG_TYPE_BUSY_TSS16);

+ svm_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
svm_set_cr4(vcpu, 0);
svm_set_efer(vcpu, 0);
save->dr6 = 0xffff0ff0;
kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
vcpu->arch.regs[VCPU_REGS_RIP] = 0x0000fff0;

- /*
- * svm_set_cr0() sets PG and WP and clears NW and CD on save->cr0.
- * It also updates the guest-visible cr0 value.
- */
- svm_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
-
save->cr4 = X86_CR4_PAE;

if (npt_enabled) {
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:37:48

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 31/46] KVM: SVM: Drop redundant writes to vmcb->save.cr4 at RESET/INIT

Drop direct writes to vmcb->save.cr4 during vCPU RESET/INIT, as the
values being written are fully redundant with respect to
svm_set_cr4(vcpu, 0) a few lines earlier. Note, svm_set_cr4() also
correctly forces X86_CR4_PAE when NPT is disabled.

No functional change intended.

Reviewed-by: Reiji Watanabe <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 875d68c4cb9b..6eff7f1a4672 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1269,8 +1269,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
vcpu->arch.regs[VCPU_REGS_RIP] = 0x0000fff0;

- save->cr4 = X86_CR4_PAE;
-
if (npt_enabled) {
/* Setup VMCB for Nested Paging */
control->nested_ctl |= SVM_NESTED_CTL_NP_ENABLE;
@@ -1280,7 +1278,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
svm_clr_intercept(svm, INTERCEPT_CR3_WRITE);
save->g_pat = vcpu->arch.pat;
save->cr3 = 0;
- save->cr4 = 0;
}
svm->current_vmcb->asid_generation = 0;
svm->asid = 0;
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:37:49

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 36/46] KVM: x86: Move setting of sregs during vCPU RESET/INIT to common x86

Move the setting of CR0, CR4, EFER, RFLAGS, and RIP from vendor code to
common x86. VMX and SVM now have near-identical sequences, the only
difference being that VMX updates the exception bitmap. Updating the
bitmap on SVM is unnecessary, but benign. Unfortunately it can't be left
behind in VMX due to the need to update exception intercepts after the
control registers are set.

Reviewed-by: Reiji Watanabe <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 6 ------
arch/x86/kvm/vmx/vmx.c | 9 ---------
arch/x86/kvm/x86.c | 8 ++++++++
3 files changed, 8 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 251b230b2fef..ea4bea428078 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1262,12 +1262,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
init_sys_seg(&save->ldtr, SEG_TYPE_LDT);
init_sys_seg(&save->tr, SEG_TYPE_BUSY_TSS16);

- svm_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
- svm_set_cr4(vcpu, 0);
- svm_set_efer(vcpu, 0);
- kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
- vcpu->arch.regs[VCPU_REGS_RIP] = 0x0000fff0;
-
if (npt_enabled) {
/* Setup VMCB for Nested Paging */
control->nested_ctl |= SVM_NESTED_CTL_NP_ENABLE;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 555235d6c17e..ef92ec40d3d9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4454,9 +4454,6 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
}

- kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
- kvm_rip_write(vcpu, 0xfff0);
-
vmcs_writel(GUEST_GDTR_BASE, 0);
vmcs_write32(GUEST_GDTR_LIMIT, 0xffff);

@@ -4484,12 +4481,6 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)

kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);

- vmx_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
- vmx_set_cr4(vcpu, 0);
- vmx_set_efer(vcpu, 0);
-
- vmx_update_exception_bitmap(vcpu);
-
vpid_sync_context(vmx->vpid);
if (init_event)
vmx_clear_hlt(vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6a11ec5d38ac..3aa952edd5f4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10870,6 +10870,14 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)

static_call(kvm_x86_vcpu_reset)(vcpu, init_event);

+ kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
+ kvm_rip_write(vcpu, 0xfff0);
+
+ static_call(kvm_x86_set_cr0)(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
+ static_call(kvm_x86_set_cr4)(vcpu, 0);
+ static_call(kvm_x86_set_efer)(vcpu, 0);
+ static_call(kvm_x86_update_exception_bitmap)(vcpu);
+
/*
* Reset the MMU context if paging was enabled prior to INIT (which is
* implied if CR0.PG=1 as CR0 will be '0' prior to RESET). Unlike the
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:37:50

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 34/46] KVM: VMX: Refresh list of user return MSRs after setting guest CPUID

After a CPUID update, refresh the list of user return MSRs that are
loaded into hardware when running the vCPU. This is necessary to handle
the oddball case where userspace exposes X86_FEATURE_RDTSCP to the guest
after the vCPU is running.

Fixes: 0023ef39dc35 ("kvm: vmx: Set IA32_TSC_AUX for legacy mode guests")
Fixes: 4e47c7a6d714 ("KVM: VMX: Add instruction rdtscp support for guest")
Reviewed-by: Reiji Watanabe <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d7a4db15a169..3045daa3ec30 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7159,6 +7159,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
/* xsaves_enabled is recomputed in vmx_compute_secondary_exec_control(). */
vcpu->arch.xsaves_enabled = false;

+ vmx_setup_uret_msrs(vmx);
+
if (cpu_has_secondary_exec_ctrls()) {
vmx_compute_secondary_exec_control(vmx);
vmcs_set_secondary_exec_control(vmx);
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:37:50

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 35/46] KVM: VMX: Don't _explicitly_ reconfigure user return MSRs on vCPU INIT

When emulating vCPU INIT, do not unconditionally refresh the list of user
return MSRs that need to be loaded into hardware when running the guest.
Unconditionally refreshing the list is confusing, as the vast majority of
MSRs are not modified on INIT. The real motivation is to handle the case
where an INIT during long mode obviates the need to load the SYSCALL MSRs,
and that is handled as needed by vmx_set_efer().

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 3045daa3ec30..555235d6c17e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4406,6 +4406,8 @@ static void init_vmcs(struct vcpu_vmx *vmx)
vmx->pt_desc.guest.output_mask = 0x7F;
vmcs_write64(GUEST_IA32_RTIT_CTL, 0);
}
+
+ vmx_setup_uret_msrs(vmx);
}

static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
@@ -4467,8 +4469,6 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
if (kvm_mpx_supported())
vmcs_write64(GUEST_BNDCFGS, 0);

- vmx_setup_uret_msrs(vmx);
-
if (cpu_has_vmx_msr_bitmap())
vmx_update_msr_bitmap(&vmx->vcpu);

--
2.32.0.93.g670b81a890-goog

2021-07-13 16:37:51

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 33/46] KVM: VMX: Skip pointless MSR bitmap update when setting EFER

Split setup_msrs() into vmx_setup_uret_msrs() and an open coded refresh
of the MSR bitmap, and skip the latter when refreshing the user return
MSRs during an EFER load. Only the x2APIC MSRs are dynamically exposed
and hidden, and those are not affected by a change in EFER.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 19 ++++++++++---------
1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a1e5706fd27b..d7a4db15a169 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1647,11 +1647,12 @@ static void vmx_setup_uret_msr(struct vcpu_vmx *vmx, unsigned int msr,
}

/*
- * Set up the vmcs to automatically save and restore system
- * msrs. Don't touch the 64-bit msrs if the guest is in legacy
- * mode, as fiddling with msrs is very expensive.
+ * Configuring user return MSRs to automatically save, load, and restore MSRs
+ * that need to be shoved into hardware when running the guest. Note, omitting
+ * an MSR here does _NOT_ mean it's not emulated, only that it will not be
+ * loaded into hardware when running the guest.
*/
-static void setup_msrs(struct vcpu_vmx *vmx)
+static void vmx_setup_uret_msrs(struct vcpu_vmx *vmx)
{
#ifdef CONFIG_X86_64
bool load_syscall_msrs;
@@ -1681,9 +1682,6 @@ static void setup_msrs(struct vcpu_vmx *vmx)
*/
vmx_setup_uret_msr(vmx, MSR_IA32_TSX_CTRL, boot_cpu_has(X86_FEATURE_RTM));

- if (cpu_has_vmx_msr_bitmap())
- vmx_update_msr_bitmap(&vmx->vcpu);
-
/*
* The set of MSRs to load may have changed, reload MSRs before the
* next VM-Enter.
@@ -2874,7 +2872,7 @@ int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer)

msr->data = efer & ~EFER_LME;
}
- setup_msrs(vmx);
+ vmx_setup_uret_msrs(vmx);
return 0;
}

@@ -4469,7 +4467,10 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
if (kvm_mpx_supported())
vmcs_write64(GUEST_BNDCFGS, 0);

- setup_msrs(vmx);
+ vmx_setup_uret_msrs(vmx);
+
+ if (cpu_has_vmx_msr_bitmap())
+ vmx_update_msr_bitmap(&vmx->vcpu);

vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); /* 22.2.1 */

--
2.32.0.93.g670b81a890-goog

2021-07-13 16:38:03

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 19/46] KVM: x86: Move EDX initialization at vCPU RESET to common code

Move the EDX initialization at vCPU RESET, which is now identical between
VMX and SVM, into common code.

No functional change intended.

Reviewed-by: Reiji Watanabe <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 5 -----
arch/x86/kvm/svm/svm.c | 13 -------------
arch/x86/kvm/vmx/vmx.c | 6 ------
arch/x86/kvm/x86.c | 13 +++++++++++++
4 files changed, 13 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 974cbfb1eefe..0ec988778db1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1772,11 +1772,6 @@ static inline unsigned long read_msr(unsigned long msr)
}
#endif

-static inline u32 get_rdx_init_val(void)
-{
- return 0x600; /* P6 family */
-}
-
static inline void kvm_inject_gp(struct kvm_vcpu *vcpu, u32 error_code)
{
kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index f7486b1645de..268580713938 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1343,25 +1343,12 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct vcpu_svm *svm = to_svm(vcpu);
- u32 dummy;
- u32 eax = 1;

svm->spec_ctrl = 0;
svm->virt_spec_ctrl = 0;

init_vmcb(vcpu);

- /*
- * Fall back to KVM's default Family/Model/Stepping if no CPUID match
- * is found. Note, it's impossible to get a match at RESET since KVM
- * emulates RESET before exposing the vCPU to userspace, i.e. it's
- * impossible for kvm_cpuid() to find a valid entry on RESET. But, go
- * through the motions in case that's ever remedied, and to be pedantic.
- */
- if (!kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, true))
- eax = get_rdx_init_val();
- kvm_rdx_write(vcpu, eax);
-
if (kvm_vcpu_apicv_active(vcpu) && !init_event)
avic_update_vapic_bar(svm, APIC_DEFAULT_PHYS_BASE);
}
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ff82c05b948b..f506b94539ab 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4387,7 +4387,6 @@ static void init_vmcs(struct vcpu_vmx *vmx)
static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
- u32 eax, dummy;
u64 cr0;

if (!init_event)
@@ -4398,11 +4397,6 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)

vmx->msr_ia32_umwait_control = 0;

- eax = 1;
- if (!kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, true))
- eax = get_rdx_init_val();
- kvm_rdx_write(vcpu, eax);
-
vmx->hv_deadline_tsc = -1;
kvm_set_cr8(vcpu, 0);

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4ffc4ca7d7b0..fd9026437fdd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10787,6 +10787,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
unsigned long old_cr0 = kvm_read_cr0(vcpu);
+ u32 eax, dummy;

kvm_lapic_reset(vcpu, init_event);

@@ -10853,6 +10854,18 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vcpu->arch.regs_avail = ~0;
vcpu->arch.regs_dirty = ~0;

+ /*
+ * Fall back to KVM's default Family/Model/Stepping of 0x600 (P6/Athlon)
+ * if no CPUID match is found. Note, it's impossible to get a match at
+ * RESET since KVM emulates RESET before exposing the vCPU to userspace,
+ * i.e. it'simpossible for kvm_cpuid() to find a valid entry on RESET.
+ * But, go through the motions in case that's ever remedied.
+ */
+ eax = 1;
+ if (!kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, true))
+ eax = 0x600;
+ kvm_rdx_write(vcpu, eax);
+
vcpu->arch.ia32_xss = 0;

static_call(kvm_x86_vcpu_reset)(vcpu, init_event);
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:38:11

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 40/46] KVM: VMX: Remove unnecessary initialization of msr_bitmap_mode

Don't bother initializing msr_bitmap_mode to 0, all of struct vcpu_vmx is
zero initialized.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d5a174ff20f9..bc09a2f7cb5f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6835,7 +6835,6 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C6_RESIDENCY, MSR_TYPE_R);
vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C7_RESIDENCY, MSR_TYPE_R);
}
- vmx->msr_bitmap_mode = 0;

vmx->loaded_vmcs = &vmx->vmcs01;

--
2.32.0.93.g670b81a890-goog

2021-07-13 16:38:20

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 38/46] KVM: nVMX: Remove obsolete MSR bitmap refresh at nested transitions

Drop unnecessary MSR bitmap updates during nested transitions, as L1's
APIC_BASE MSR is not modified by the standard VM-Enter/VM-Exit flows,
and L2's MSR bitmap is managed separately. In the unlikely event that L1
is pathological and loads APIC_BASE via the VM-Exit load list, KVM will
handle updating the bitmap in its normal WRMSR flows.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 6 ------
arch/x86/kvm/vmx/vmx.c | 2 +-
arch/x86/kvm/vmx/vmx.h | 1 -
3 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index a77cfc8bcf11..0d0dd6580cfd 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4305,9 +4305,6 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
kvm_set_dr(vcpu, 7, 0x400);
vmcs_write64(GUEST_IA32_DEBUGCTL, 0);

- if (cpu_has_vmx_msr_bitmap())
- vmx_update_msr_bitmap(vcpu);
-
if (nested_vmx_load_msr(vcpu, vmcs12->vm_exit_msr_load_addr,
vmcs12->vm_exit_msr_load_count))
nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
@@ -4386,9 +4383,6 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)

kvm_mmu_reset_context(vcpu);

- if (cpu_has_vmx_msr_bitmap())
- vmx_update_msr_bitmap(vcpu);
-
/*
* This nasty bit of open coding is a compromise between blindly
* loading L1's MSRs using the exit load lists (incorrect emulation
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7e99535a4cbb..f605b43d28e1 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3865,7 +3865,7 @@ static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu, u8 mode)
}
}

-void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu)
+static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
u8 mode = vmx_msr_bitmap_mode(vcpu);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index b584e41bed44..1b3dd5ddf235 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -378,7 +378,6 @@ u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);

bool vmx_guest_inject_ac(struct kvm_vcpu *vcpu);
void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu);
-void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
bool vmx_nmi_blocked(struct kvm_vcpu *vcpu);
bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu);
bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu);
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:38:29

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 44/46] KVM: SVM: Emulate #INIT in response to triple fault shutdown

Emulate a full #INIT instead of simply initializing the VMCB if the
guest hits a shutdown. Initializing the VMCB but not other vCPU state,
much of which is mirrored by the VMCB, results in incoherent and broken
vCPU state.

Ideally, KVM would not automatically init anything on shutdown, and
instead put the vCPU into e.g. KVM_MP_STATE_UNINITIALIZED and force
userspace to explicitly INIT or RESET the vCPU. Even better would be to
add KVM_MP_STATE_SHUTDOWN, since technically NMI can break shutdown
(and SMI on Intel CPUs).

But, that ship has sailed, and emulating #INIT is the next best thing as
that has at least some connection with reality since there exist bare
metal platforms that automatically INIT the CPU if it hits shutdown.

Fixes: 46fe4ddd9dbb ("[PATCH] KVM: SVM: Propagate cpu shutdown events to userspace")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 10 +++++++---
arch/x86/kvm/x86.c | 1 +
2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index ea4bea428078..285587a7fe80 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2058,11 +2058,15 @@ static int shutdown_interception(struct kvm_vcpu *vcpu)
return -EINVAL;

/*
- * VMCB is undefined after a SHUTDOWN intercept
- * so reinitialize it.
+ * VMCB is undefined after a SHUTDOWN intercept. INIT the vCPU to put
+ * the VMCB in a known good state. Unfortuately, KVM doesn't have
+ * KVM_MP_STATE_SHUTDOWN and can't add it without potentially breaking
+ * userspace. At a platform view, INIT is acceptable behavior as
+ * there exist bare metal platforms that automatically INIT the CPU
+ * in response to shutdown.
*/
clear_page(svm->vmcb);
- init_vmcb(vcpu);
+ kvm_vcpu_reset(vcpu, true);

kvm_run->exit_reason = KVM_EXIT_SHUTDOWN;
return 0;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3aa952edd5f4..f35dd8192c32 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10901,6 +10901,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
if (init_event)
kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
}
+EXPORT_SYMBOL_GPL(kvm_vcpu_reset);

void kvm_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
{
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:38:28

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 41/46] KVM: VMX: Smush x2APIC MSR bitmap adjustments into single function

Consolidate all of the dynamic MSR bitmap adjustments into
vmx_update_msr_bitmap_x2apic(), and rename the mode tracker to reflect
that it is x2APIC specific. If KVM gains more cases of dynamic MSR
pass-through, odds are very good that those new cases will be better off
with their own logic, e.g. see Intel PT MSRs and MSR_IA32_SPEC_CTRL.

Attempting to handle all updates in a common helper did more harm than
good, as KVM ended up collecting a large number of useless "updates".

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 55 ++++++++++++++++--------------------------
arch/x86/kvm/vmx/vmx.h | 2 +-
2 files changed, 22 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index bc09a2f7cb5f..cdde1dfaa574 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3812,21 +3812,6 @@ void vmx_enable_intercept_for_msr(struct kvm_vcpu *vcpu, u32 msr, int type)
vmx_set_msr_bitmap_write(msr_bitmap, msr);
}

-static u8 vmx_msr_bitmap_mode(struct kvm_vcpu *vcpu)
-{
- u8 mode = 0;
-
- if (cpu_has_secondary_exec_ctrls() &&
- (secondary_exec_controls_get(to_vmx(vcpu)) &
- SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) {
- mode |= MSR_BITMAP_MODE_X2APIC;
- if (enable_apicv && kvm_vcpu_apicv_active(vcpu))
- mode |= MSR_BITMAP_MODE_X2APIC_APICV;
- }
-
- return mode;
-}
-
static void vmx_reset_x2apic_msrs(struct kvm_vcpu *vcpu, u8 mode)
{
unsigned long *msr_bitmap = to_vmx(vcpu)->vmcs01.msr_bitmap;
@@ -3844,11 +3829,29 @@ static void vmx_reset_x2apic_msrs(struct kvm_vcpu *vcpu, u8 mode)
}
}

-static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu, u8 mode)
+static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu)
{
+ struct vcpu_vmx *vmx = to_vmx(vcpu);
+ u8 mode;
+
if (!cpu_has_vmx_msr_bitmap())
return;

+ if (cpu_has_secondary_exec_ctrls() &&
+ (secondary_exec_controls_get(vmx) &
+ SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) {
+ mode = MSR_BITMAP_MODE_X2APIC;
+ if (enable_apicv && kvm_vcpu_apicv_active(vcpu))
+ mode |= MSR_BITMAP_MODE_X2APIC_APICV;
+ } else {
+ mode = 0;
+ }
+
+ if (!(mode ^ vmx->x2apic_msr_bitmap_mode))
+ return;
+
+ vmx->x2apic_msr_bitmap_mode = mode;
+
vmx_reset_x2apic_msrs(vcpu, mode);

/*
@@ -3865,21 +3868,6 @@ static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu, u8 mode)
}
}

-static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu)
-{
- struct vcpu_vmx *vmx = to_vmx(vcpu);
- u8 mode = vmx_msr_bitmap_mode(vcpu);
- u8 changed = mode ^ vmx->msr_bitmap_mode;
-
- if (!changed)
- return;
-
- if (changed & (MSR_BITMAP_MODE_X2APIC | MSR_BITMAP_MODE_X2APIC_APICV))
- vmx_update_msr_bitmap_x2apic(vcpu, mode);
-
- vmx->msr_bitmap_mode = mode;
-}
-
void pt_update_intercept_for_msr(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -4139,8 +4127,7 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
}

- if (cpu_has_vmx_msr_bitmap())
- vmx_update_msr_bitmap(vcpu);
+ vmx_update_msr_bitmap_x2apic(vcpu);
}

u32 vmx_exec_control(struct vcpu_vmx *vmx)
@@ -6186,7 +6173,7 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
}
secondary_exec_controls_set(vmx, sec_exec_control);

- vmx_update_msr_bitmap(vcpu);
+ vmx_update_msr_bitmap_x2apic(vcpu);
}

static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 1b3dd5ddf235..e370091d57c6 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -229,7 +229,7 @@ struct nested_vmx {
struct vcpu_vmx {
struct kvm_vcpu vcpu;
u8 fail;
- u8 msr_bitmap_mode;
+ u8 x2apic_msr_bitmap_mode;

/*
* If true, host state has been stored in vmx->loaded_vmcs for
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:38:30

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 29/46] KVM: nVMX: Don't evaluate "emulation required" on nested VM-Exit

Use the "internal" variants of setting segment registers when stuffing
state on nested VM-Exit in order to skip the "emulation required"
updates. VM-Exit must always go to protected mode, and all segments are
mostly hardcoded (to valid values) on VM-Exit. The bits of the segments
that aren't hardcoded are explicitly checked during VM-Enter, e.g. the
selector RPLs must all be zero.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 16 ++++++++--------
arch/x86/kvm/vmx/vmx.c | 6 ++----
arch/x86/kvm/vmx/vmx.h | 2 +-
3 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 7f8184f432b4..a77cfc8bcf11 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4267,7 +4267,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
seg.l = 1;
else
seg.db = 1;
- vmx_set_segment(vcpu, &seg, VCPU_SREG_CS);
+ __vmx_set_segment(vcpu, &seg, VCPU_SREG_CS);
seg = (struct kvm_segment) {
.base = 0,
.limit = 0xFFFFFFFF,
@@ -4278,17 +4278,17 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
.g = 1
};
seg.selector = vmcs12->host_ds_selector;
- vmx_set_segment(vcpu, &seg, VCPU_SREG_DS);
+ __vmx_set_segment(vcpu, &seg, VCPU_SREG_DS);
seg.selector = vmcs12->host_es_selector;
- vmx_set_segment(vcpu, &seg, VCPU_SREG_ES);
+ __vmx_set_segment(vcpu, &seg, VCPU_SREG_ES);
seg.selector = vmcs12->host_ss_selector;
- vmx_set_segment(vcpu, &seg, VCPU_SREG_SS);
+ __vmx_set_segment(vcpu, &seg, VCPU_SREG_SS);
seg.selector = vmcs12->host_fs_selector;
seg.base = vmcs12->host_fs_base;
- vmx_set_segment(vcpu, &seg, VCPU_SREG_FS);
+ __vmx_set_segment(vcpu, &seg, VCPU_SREG_FS);
seg.selector = vmcs12->host_gs_selector;
seg.base = vmcs12->host_gs_base;
- vmx_set_segment(vcpu, &seg, VCPU_SREG_GS);
+ __vmx_set_segment(vcpu, &seg, VCPU_SREG_GS);
seg = (struct kvm_segment) {
.base = vmcs12->host_tr_base,
.limit = 0x67,
@@ -4296,11 +4296,11 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
.type = 11,
.present = 1
};
- vmx_set_segment(vcpu, &seg, VCPU_SREG_TR);
+ __vmx_set_segment(vcpu, &seg, VCPU_SREG_TR);

memset(&seg, 0, sizeof(seg));
seg.unusable = 1;
- vmx_set_segment(vcpu, &seg, VCPU_SREG_LDTR);
+ __vmx_set_segment(vcpu, &seg, VCPU_SREG_LDTR);

kvm_set_dr(vcpu, 7, 0x400);
vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7ab493708b06..a1e5706fd27b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2719,8 +2719,6 @@ static __init int alloc_kvm_area(void)
return 0;
}

-static void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
-
static void fix_pmode_seg(struct kvm_vcpu *vcpu, int seg,
struct kvm_segment *save)
{
@@ -3293,7 +3291,7 @@ static u32 vmx_segment_access_rights(struct kvm_segment *var)
return ar;
}

-static void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
const struct kvm_vmx_segment_field *sf = &kvm_vmx_segment_fields[seg];
@@ -3330,7 +3328,7 @@ static void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, in
vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));
}

-void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+static void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
{
__vmx_set_segment(vcpu, var, seg);

diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 3979a947933a..b584e41bed44 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -373,7 +373,7 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
void set_cr4_guest_host_mask(struct vcpu_vmx *vmx);
void ept_save_pdptrs(struct kvm_vcpu *vcpu);
void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
-void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);

bool vmx_guest_inject_ac(struct kvm_vcpu *vcpu);
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:38:30

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 12/46] KVM: x86: Remove defunct BSP "update" in local APIC reset

Remove a BSP APIC update in kvm_lapic_reset() that is a glorified and
confusing nop. When the code was originally added, kvm_vcpu_is_bsp()
queried kvm->arch.bsp_vcpu, i.e. the intent was to set the BSP bit in the
BSP vCPU's APIC. But, stuffing the BSP bit at INIT was wrong since the
guest can change its BSP(s); this was fixed by commit 58d269d8cccc ("KVM:
x86: BSP in MSR_IA32_APICBASE is writable").

In other words, kvm_vcpu_is_bsp() is now purely a reflection of
vcpu->arch.apic_base.MSR_IA32_APICBASE_BSP, thus the update will always
set the current value and kvm_lapic_set_base() is effectively a nop if
the new and old values match. The RESET case, which does need to stuff
the BSP for the reset vCPU, is handled by vendor code (though this will
soon be moved to common code).

No functional change intended.

Reviewed-by: Reiji Watanabe <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/lapic.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index add4dd1e3528..a24ce8fe93e5 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2367,9 +2367,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
apic->highest_isr_cache = -1;
update_divide_count(apic);
atomic_set(&apic->lapic_timer.pending, 0);
- if (kvm_vcpu_is_bsp(vcpu))
- kvm_lapic_set_base(vcpu,
- vcpu->arch.apic_base | MSR_IA32_APICBASE_BSP);
+
vcpu->arch.pv_eoi.msr_val = 0;
apic_update_ppr(apic);
if (vcpu->arch.apicv_active) {
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:38:30

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 28/46] KVM: VMX: Skip emulation required checks during pmode/rmode transitions

Don't refresh "emulation required" when stuffing segments during
transitions to/from real mode when running without unrestricted guest.
The checks are unnecessary as vmx_set_cr0() unconditionally rechecks
"emulation required". They also happen to be broken, as enter_pmode()
and enter_rmode() run with a stale vcpu->arch.cr0.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 45b123bb5aaa..7ab493708b06 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2719,6 +2719,8 @@ static __init int alloc_kvm_area(void)
return 0;
}

+static void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+
static void fix_pmode_seg(struct kvm_vcpu *vcpu, int seg,
struct kvm_segment *save)
{
@@ -2735,7 +2737,7 @@ static void fix_pmode_seg(struct kvm_vcpu *vcpu, int seg,
save->dpl = save->selector & SEGMENT_RPL_MASK;
save->s = 1;
}
- vmx_set_segment(vcpu, save, seg);
+ __vmx_set_segment(vcpu, save, seg);
}

static void enter_pmode(struct kvm_vcpu *vcpu)
@@ -2756,7 +2758,7 @@ static void enter_pmode(struct kvm_vcpu *vcpu)

vmx->rmode.vm86_active = 0;

- vmx_set_segment(vcpu, &vmx->rmode.segs[VCPU_SREG_TR], VCPU_SREG_TR);
+ __vmx_set_segment(vcpu, &vmx->rmode.segs[VCPU_SREG_TR], VCPU_SREG_TR);

flags = vmcs_readl(GUEST_RFLAGS);
flags &= RMODE_GUEST_OWNED_EFLAGS_BITS;
@@ -3291,7 +3293,7 @@ static u32 vmx_segment_access_rights(struct kvm_segment *var)
return ar;
}

-void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+static void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
const struct kvm_vmx_segment_field *sf = &kvm_vmx_segment_fields[seg];
@@ -3304,7 +3306,7 @@ void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
vmcs_write16(sf->selector, var->selector);
else if (var->s)
fix_rmode_seg(seg, &vmx->rmode.segs[seg]);
- goto out;
+ return;
}

vmcs_writel(sf->base, var->base);
@@ -3326,9 +3328,13 @@ void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
var->type |= 0x1; /* Accessed */

vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));
+}

-out:
- vmx->emulation_required = emulation_required(vcpu);
+void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+{
+ __vmx_set_segment(vcpu, var, seg);
+
+ to_vmx(vcpu)->emulation_required = emulation_required(vcpu);
}

static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:38:43

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 45/46] KVM: SVM: Drop redundant clearing of vcpu->arch.hflags at INIT/RESET

Drop redundant clears of vcpu->arch.hflags in init_vmcb() now that
init_vmcb() is invoked only through kvm_vcpu_reset(), which always clears
hflags. And of course, the second clearing in init_vmcb() was always
redundant.

Suggested-by: Reiji Watanabe <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/svm.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 285587a7fe80..46d341f57e26 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1177,8 +1177,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
struct vmcb_control_area *control = &svm->vmcb->control;
struct vmcb_save_area *save = &svm->vmcb->save;

- vcpu->arch.hflags = 0;
-
svm_set_intercept(svm, INTERCEPT_CR0_READ);
svm_set_intercept(svm, INTERCEPT_CR3_READ);
svm_set_intercept(svm, INTERCEPT_CR4_READ);
@@ -1277,7 +1275,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)

svm->nested.vmcb12_gpa = INVALID_GPA;
svm->nested.last_vmcb12_gpa = INVALID_GPA;
- vcpu->arch.hflags = 0;

if (!kvm_pause_in_guest(vcpu->kvm)) {
control->pause_filter_count = pause_filter_count;
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:38:46

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 18/46] KVM: x86: Consolidate APIC base RESET initialization code

Consolidate the APIC base RESET logic, which is currently spread out
across both x86 and vendor code. For an in-kernel APIC, the vendor code
is redundant. But for a userspace APIC, KVM relies on the vendor code
to initialize vcpu->arch.apic_base. Hoist the vcpu->arch.apic_base
initialization above the !apic check so that it applies to both flavors
of APIC emulation, and delete the vendor code.

Reviewed-by: Reiji Watanabe <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/lapic.c | 14 ++++++++------
arch/x86/kvm/svm/svm.c | 6 ------
arch/x86/kvm/vmx/vmx.c | 7 -------
3 files changed, 8 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 295a9d02a9a5..76fb00921203 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2323,18 +2323,20 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
struct kvm_lapic *apic = vcpu->arch.apic;
int i;

- if (!apic)
- return;
-
- /* Stop the timer in case it's a reset to an active apic */
- hrtimer_cancel(&apic->lapic_timer.timer);
-
if (!init_event) {
vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE |
MSR_IA32_APICBASE_ENABLE;
if (kvm_vcpu_is_reset_bsp(vcpu))
vcpu->arch.apic_base |= MSR_IA32_APICBASE_BSP;
+ }

+ if (!apic)
+ return;
+
+ /* Stop the timer in case it's a reset to an active apic */
+ hrtimer_cancel(&apic->lapic_timer.timer);
+
+ if (!init_event) {
apic->base_address = APIC_DEFAULT_PHYS_BASE;

kvm_apic_set_xapic_id(apic, vcpu->vcpu_id);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index cef9520fe77f..f7486b1645de 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1349,12 +1349,6 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
svm->spec_ctrl = 0;
svm->virt_spec_ctrl = 0;

- if (!init_event) {
- vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE |
- MSR_IA32_APICBASE_ENABLE;
- if (kvm_vcpu_is_reset_bsp(vcpu))
- vcpu->arch.apic_base |= MSR_IA32_APICBASE_BSP;
- }
init_vmcb(vcpu);

/*
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e6cc389ec697..ff82c05b948b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4406,13 +4406,6 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmx->hv_deadline_tsc = -1;
kvm_set_cr8(vcpu, 0);

- if (!init_event) {
- vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE |
- MSR_IA32_APICBASE_ENABLE;
- if (kvm_vcpu_is_reset_bsp(vcpu))
- vcpu->arch.apic_base |= MSR_IA32_APICBASE_BSP;
- }
-
vmx_segment_cache_clear(vmx);

seg_setup(VCPU_SREG_CS);
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:39:01

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 21/46] KVM: VMX: Invert handling of CR0.WP for EPT without unrestricted guest

Opt-in to forcing CR0.WP=1 for shadow paging, and stop lying about WP
being "always on" for unrestricted guest. In addition to making KVM a
wee bit more honest, this paves the way for additional cleanup.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f506b94539ab..02aec75ec6f6 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -136,8 +136,7 @@ module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
#define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
#define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
#define KVM_VM_CR0_ALWAYS_ON \
- (KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST | \
- X86_CR0_WP | X86_CR0_PG | X86_CR0_PE)
+ (KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST | X86_CR0_PG | X86_CR0_PE)

#define KVM_VM_CR4_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR4_VMXE
#define KVM_PMODE_VM_CR4_ALWAYS_ON (X86_CR4_PAE | X86_CR4_VMXE)
@@ -2995,9 +2994,7 @@ void ept_save_pdptrs(struct kvm_vcpu *vcpu)
kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
}

-static void ept_update_paging_mode_cr0(unsigned long *hw_cr0,
- unsigned long cr0,
- struct kvm_vcpu *vcpu)
+static void ept_update_paging_mode_cr0(unsigned long cr0, struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -3016,9 +3013,6 @@ static void ept_update_paging_mode_cr0(unsigned long *hw_cr0,
vcpu->arch.cr0 = cr0;
vmx_set_cr4(vcpu, kvm_read_cr4(vcpu));
}
-
- if (!(cr0 & X86_CR0_WP))
- *hw_cr0 &= ~X86_CR0_WP;
}

void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
@@ -3031,6 +3025,8 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
hw_cr0 |= KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST;
else {
hw_cr0 |= KVM_VM_CR0_ALWAYS_ON;
+ if (!enable_ept)
+ hw_cr0 |= X86_CR0_WP;

if (vmx->rmode.vm86_active && (cr0 & X86_CR0_PE))
enter_pmode(vcpu);
@@ -3049,7 +3045,7 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
#endif

if (enable_ept && !is_unrestricted_guest(vcpu))
- ept_update_paging_mode_cr0(&hw_cr0, cr0, vcpu);
+ ept_update_paging_mode_cr0(cr0, vcpu);

vmcs_writel(CR0_READ_SHADOW, cr0);
vmcs_writel(GUEST_CR0, hw_cr0);
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:39:18

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 26/46] KVM: x86/mmu: Skip the permission_fault() check on MMIO if CR0.PG=0

Skip the MMU permission_fault() check if paging is disabled when
verifying the cached MMIO GVA is usable. The check is unnecessary and
can theoretically get a false positive since the MMU doesn't zero out
"permissions" or "pkru_mask" when guest paging is disabled.

The obvious alternative is to zero out all the bitmasks when configuring
nonpaging MMUs, but that's unnecessary work and doesn't align with the
MMU's general approach of doing as little as possible for flows that are
supposed to be unreachable.

This is nearly a nop as the false positive is nothing more than an
insignificant performance blip, and more or less limited to string MMIO
when L1 is running with paging disabled. KVM doesn't cache MMIO if L2 is
active with nested TDP since the "GVA" is really an L2 GPA. If L2 is
active without nested TDP, then paging can't be disabled as neither VMX
nor SVM allows entering the guest without paging of some form.

Jumping back to L1 with paging disabled, in that case direct_map is true
and so KVM will use CR2 as a GPA; the only time it doesn't is if the
fault from the emulator doesn't match or emulator_can_use_gpa(), and that
fails only on string MMIO and other instructions with multiple memory
operands.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/x86.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fd9026437fdd..6a11ec5d38ac 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6557,9 +6557,9 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
* there is no pkey in EPT page table for L1 guest or EPT
* shadow page table for L2 guest.
*/
- if (vcpu_match_mmio_gva(vcpu, gva)
- && !permission_fault(vcpu, vcpu->arch.walk_mmu,
- vcpu->arch.mmio_access, 0, access)) {
+ if (vcpu_match_mmio_gva(vcpu, gva) && (!is_paging(vcpu) ||
+ !permission_fault(vcpu, vcpu->arch.walk_mmu,
+ vcpu->arch.mmio_access, 0, access))) {
*gpa = vcpu->arch.mmio_gfn << PAGE_SHIFT |
(gva & (PAGE_SIZE - 1));
trace_vcpu_match_mmio(gva, *gpa, write, false);
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:39:21

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 32/46] KVM: SVM: Stuff save->dr6 at during VMSA sync, not at RESET/INIT

Move code to stuff vmcb->save.dr6 to its architectural init value from
svm_vcpu_reset() into sev_es_sync_vmsa(). Except for protected guests,
a.k.a. SEV-ES guests, vmcb->save.dr6 is set during VM-Enter, i.e. the
extra write is unnecessary. For SEV-ES, stuffing save->dr6 handles a
theoretical case where the VMSA could be encrypted before the first
KVM_RUN.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/sev.c | 1 +
arch/x86/kvm/svm/svm.c | 1 -
2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 8d36f0c73071..e34ee60fc9d7 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -583,6 +583,7 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm)
save->xcr0 = svm->vcpu.arch.xcr0;
save->pkru = svm->vcpu.arch.pkru;
save->xss = svm->vcpu.arch.ia32_xss;
+ save->dr6 = svm->vcpu.arch.dr6;

/*
* SEV-ES will use a VMSA that is pointed to by the VMCB, not
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 6eff7f1a4672..251b230b2fef 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1265,7 +1265,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
svm_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
svm_set_cr4(vcpu, 0);
svm_set_efer(vcpu, 0);
- save->dr6 = 0xffff0ff0;
kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
vcpu->arch.regs[VCPU_REGS_RIP] = 0x0000fff0;

--
2.32.0.93.g670b81a890-goog

2021-07-13 16:39:37

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 37/46] KVM: VMX: Remove obsolete MSR bitmap refresh at vCPU RESET/INIT

Remove an unnecessary MSR bitmap refresh during vCPU RESET/INIT. In both
cases, the MSR bitmap already has the desired values and state.

At RESET, the vCPU is guaranteed to be running with x2APIC disabled, the
x2APIC MSRs are guaranteed to be intercepted due to the MSR bitmap being
initialized to all ones by alloc_loaded_vmcs(), and vmx->msr_bitmap_mode
is guaranteed to be zero, i.e. reflecting x2APIC disabled.

At INIT, the APIC_BASE MSR is not modified, thus there can't be any
change in x2APIC state.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ef92ec40d3d9..7e99535a4cbb 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4466,9 +4466,6 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
if (kvm_mpx_supported())
vmcs_write64(GUEST_BNDCFGS, 0);

- if (cpu_has_vmx_msr_bitmap())
- vmx_update_msr_bitmap(&vmx->vcpu);
-
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); /* 22.2.1 */

if (cpu_has_vmx_tpr_shadow() && !init_event) {
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:39:38

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 43/46] KVM: VMX: Move RESET-only VMWRITE sequences to init_vmcs()

Move VMWRITE sequences in vmx_vcpu_reset() guarded by !init_event into
init_vmcs() to make it more obvious that they're, uh, initializing the
VMCS.

No meaningful functional change intended (though the order of VMWRITEs
and whatnot is different).

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 28 +++++++++++++---------------
1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 4acfb2f450e6..97fa2aa676bd 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4393,6 +4393,19 @@ static void init_vmcs(struct vcpu_vmx *vmx)
vmcs_write64(GUEST_IA32_RTIT_CTL, 0);
}

+ vmcs_write32(GUEST_SYSENTER_CS, 0);
+ vmcs_writel(GUEST_SYSENTER_ESP, 0);
+ vmcs_writel(GUEST_SYSENTER_EIP, 0);
+ vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
+
+ if (cpu_has_vmx_tpr_shadow()) {
+ vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 0);
+ if (cpu_need_tpr_shadow(&vmx->vcpu))
+ vmcs_write64(VIRTUAL_APIC_PAGE_ADDR,
+ __pa(vmx->vcpu.arch.apic->regs));
+ vmcs_write32(TPR_THRESHOLD, 0);
+ }
+
vmx_setup_uret_msrs(vmx);
}

@@ -4433,13 +4446,6 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmcs_write32(GUEST_LDTR_LIMIT, 0xffff);
vmcs_write32(GUEST_LDTR_AR_BYTES, 0x00082);

- if (!init_event) {
- vmcs_write32(GUEST_SYSENTER_CS, 0);
- vmcs_writel(GUEST_SYSENTER_ESP, 0);
- vmcs_writel(GUEST_SYSENTER_EIP, 0);
- vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
- }
-
vmcs_writel(GUEST_GDTR_BASE, 0);
vmcs_write32(GUEST_GDTR_LIMIT, 0xffff);

@@ -4454,14 +4460,6 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)

vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); /* 22.2.1 */

- if (cpu_has_vmx_tpr_shadow() && !init_event) {
- vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 0);
- if (cpu_need_tpr_shadow(vcpu))
- vmcs_write64(VIRTUAL_APIC_PAGE_ADDR,
- __pa(vcpu->arch.apic->regs));
- vmcs_write32(TPR_THRESHOLD, 0);
- }
-
kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);

vpid_sync_context(vmx->vpid);
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:40:17

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 39/46] KVM: VMX: Don't redo x2APIC MSR bitmaps when userspace filter is changed

Drop an explicit call to update the x2APIC MSRs when the userspace MSR
filter is modified. The x2APIC MSRs are deliberately exempt from
userspace filtering.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f605b43d28e1..d5a174ff20f9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3936,7 +3936,6 @@ static void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
}

pt_update_intercept_for_msr(vcpu);
- vmx_update_msr_bitmap_x2apic(vcpu, vmx_msr_bitmap_mode(vcpu));
}

static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:40:23

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 42/46] KVM: VMX: Remove redundant write to set vCPU as active at RESET/INIT

Drop a call to vmx_clear_hlt() during vCPU INIT, the guest's activity
state is unconditionally set to "active" a few lines earlier in
vmx_vcpu_reset().

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index cdde1dfaa574..4acfb2f450e6 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4465,8 +4465,6 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);

vpid_sync_context(vmx->vpid);
- if (init_event)
- vmx_clear_hlt(vcpu);
}

static void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
--
2.32.0.93.g670b81a890-goog

2021-07-13 16:40:39

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 46/46] KVM: x86: Preserve guest's CR0.CD/NW on INIT

Preserve CR0.CD and CR0.NW on INIT instead of forcing them to '1', as
defined by both Intel's SDM and AMD's APM.

Note, current versions of Intel's SDM are very poorly written with
respect to INIT behavior. Table 9-1. "IA-32 and Intel 64 Processor
States Following Power-up, Reset, or INIT" quite clearly lists power-up,
RESET, _and_ INIT as setting CR0=60000010H, i.e. CD/NW=1. But the SDM
then attempts to qualify CD/NW behavior in a footnote:

2. The CD and NW flags are unchanged, bit 4 is set to 1, all other bits
are cleared.

Presumably that footnote is only meant for INIT, as the RESET case and
especially the power-up case are rather non-sensical. Another footnote
all but confirms that:

6. Internal caches are invalid after power-up and RESET, but left
unchanged with an INIT.

Bare metal testing shows that CD/NW are indeed preserved on INIT (someone
else can hack their BIOS to check RESET and power-up :-D).

Reported-by: Reiji Watanabe <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/x86.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f35dd8192c32..3f0226259496 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10787,6 +10787,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
unsigned long old_cr0 = kvm_read_cr0(vcpu);
+ unsigned long new_cr0;
u32 eax, dummy;

kvm_lapic_reset(vcpu, init_event);
@@ -10873,7 +10874,18 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
kvm_rip_write(vcpu, 0xfff0);

- static_call(kvm_x86_set_cr0)(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
+ /*
+ * CR0.CD/NW are set on RESET, preserved on INIT. Note, some versions
+ * of Intel's SDM list CD/NW as being set on INIT, but they contradict
+ * (or qualify) that with a footnote stating that CD/NW are preserved.
+ */
+ new_cr0 = X86_CR0_ET;
+ if (init_event)
+ new_cr0 |= (old_cr0 & (X86_CR0_NW | X86_CR0_CD));
+ else
+ new_cr0 |= X86_CR0_NW | X86_CR0_CD;
+
+ static_call(kvm_x86_set_cr0)(vcpu, new_cr0);
static_call(kvm_x86_set_cr4)(vcpu, 0);
static_call(kvm_x86_set_efer)(vcpu, 0);
static_call(kvm_x86_update_exception_bitmap)(vcpu);
--
2.32.0.93.g670b81a890-goog

2021-07-20 04:39:12

by Reiji Watanabe

[permalink] [raw]
Subject: Re: [PATCH v2 45/46] KVM: SVM: Drop redundant clearing of vcpu->arch.hflags at INIT/RESET

On Tue, Jul 13, 2021 at 9:35 AM Sean Christopherson <[email protected]> wrote:
>
> Drop redundant clears of vcpu->arch.hflags in init_vmcb() now that
> init_vmcb() is invoked only through kvm_vcpu_reset(), which always clears
> hflags. And of course, the second clearing in init_vmcb() was always
> redundant.
>
> Suggested-by: Reiji Watanabe <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Reiji Watanabe <[email protected]>

Thank you for removing the redundant code.

Regards,
Reiji

2021-07-20 04:39:57

by Reiji Watanabe

[permalink] [raw]
Subject: Re: [PATCH v2 46/46] KVM: x86: Preserve guest's CR0.CD/NW on INIT

On Tue, Jul 13, 2021 at 9:35 AM Sean Christopherson <[email protected]> wrote:
>
> Preserve CR0.CD and CR0.NW on INIT instead of forcing them to '1', as
> defined by both Intel's SDM and AMD's APM.
>
> Note, current versions of Intel's SDM are very poorly written with
> respect to INIT behavior. Table 9-1. "IA-32 and Intel 64 Processor
> States Following Power-up, Reset, or INIT" quite clearly lists power-up,
> RESET, _and_ INIT as setting CR0=60000010H, i.e. CD/NW=1. But the SDM
> then attempts to qualify CD/NW behavior in a footnote:
>
> 2. The CD and NW flags are unchanged, bit 4 is set to 1, all other bits
> are cleared.
>
> Presumably that footnote is only meant for INIT, as the RESET case and
> especially the power-up case are rather non-sensical. Another footnote
> all but confirms that:
>
> 6. Internal caches are invalid after power-up and RESET, but left
> unchanged with an INIT.
>
> Bare metal testing shows that CD/NW are indeed preserved on INIT (someone
> else can hack their BIOS to check RESET and power-up :-D).
>
> Reported-by: Reiji Watanabe <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Reiji Watanabe <[email protected]>

Thank you for the fix and checking the CD/NW with the bare metal testing.

Regards,
Reiji

2021-07-26 20:35:51

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 09/46] KVM: SVM: Drop a redundant init_vmcb() from svm_create_vcpu()

On 13/07/21 18:32, Sean Christopherson wrote:
> Drop an extra init_vmcb() from svm_create_vcpu(), svm_vcpu_reset() is
> guaranteed to call init_vmcb() and there are no consumers of the VMCB
> data between ->vcpu_create() and ->vcpu_reset(). Keep the call to
> svm_switch_vmcb() as sev_es_create_vcpu() touches the current VMCB, but
> hoist it up a few lines to associate the switch with the allocation of
> vmcb01.
>
> Reviewed-by: Reiji Watanabe <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/svm/svm.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 44248548be7d..cef9520fe77f 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1431,15 +1431,13 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
>
> svm->vmcb01.ptr = page_address(vmcb01_page);
> svm->vmcb01.pa = __sme_set(page_to_pfn(vmcb01_page) << PAGE_SHIFT);
> + svm_switch_vmcb(svm, &svm->vmcb01);
>
> if (vmsa_page)
> svm->vmsa = page_address(vmsa_page);
>
> svm->guest_state_loaded = false;
>
> - svm_switch_vmcb(svm, &svm->vmcb01);
> - init_vmcb(vcpu);
> -
> svm_init_osvw(vcpu);
> vcpu->arch.microcode_version = 0x01000065;

While this patch makes sense, I'd rather not include it to reduce the
part of the code in which svm->vmcb is NULL. See for example the issue
reported by the static analyzer in which
svm_hv_vmcb_dirty_nested_enlightenments (called by svm_vcpu_init_msrpm)
dereferences svm->vmcb.

Paolo

2021-07-26 21:02:10

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 41/46] KVM: VMX: Smush x2APIC MSR bitmap adjustments into single function

On 13/07/21 18:33, Sean Christopherson wrote:
> + if (!(mode ^ vmx->x2apic_msr_bitmap_mode))
> + return;

Just !=, I guess?

Paolo

> + vmx->x2apic_msr_bitmap_mode = mode;

2021-07-26 21:06:18

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 45/46] KVM: SVM: Drop redundant clearing of vcpu->arch.hflags at INIT/RESET

On 13/07/21 18:33, Sean Christopherson wrote:
> Drop redundant clears of vcpu->arch.hflags in init_vmcb() now that
> init_vmcb() is invoked only through kvm_vcpu_reset(),

Not true if patch 9 is kept, but at this point hflags is zero anyway, so
the patch is okay.

Paolo

> which always clears
> hflags. And of course, the second clearing in init_vmcb() was always
> redundant.

2021-07-26 21:14:25

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 00/46] KVM: x86: vCPU RESET/INIT fixes and consolidation

On 13/07/21 18:32, Sean Christopherson wrote:
> The end goal of this series is to consolidate the RESET/INIT code, both to
> deduplicate code and to try to avoid divergent behavior/bugs, e.g. SVM only
> recently started updating vcpu->arch.cr4 on INIT.
>
> The TL;DR of why it takes 40+ patches to get there is that the RESET/INIT
> flows have multiple latent bugs and hidden dependencies, but "work"
> because they're rarely touched, are mostly fixed flows in both KVM and the
> guest, and because guests don't sanity check state after INIT.
>
> While several of the patches have Fixes tags, I am absolutely terrified of
> backporting most of them due to the likelihood of breaking a different
> version of KVM. And, for the most part the bugs are benign in the sense
> no guest has actually encountered any of these bugs. For that reason, I
> intentionally omitted stable@ entirely. The only patches I would consider
> even remotely safe for backporting are the first four patches in the series.
>
> v2:
> - Collect Reviews. [Reiji]
> - Fix an apic->base_address initialization goof. [Reiji]
> - Add patch to flush TLB on INIT. [Reiji]
> - Add patch to preserved CR0.CD/NW on INIT. [Reiji]
> - Add patch to emulate #INIT after shutdown on SVM. [Reiji]
> - Add patch to consolidate arch.hflags code. [Reiji]
> - Drop patch to omit VMWRITE zeroing. [Paolo, Jim]
> - Drop several MMU patches (moved to other series).
>
> v1: https://lkml.kernel.org/r/[email protected]
>
> Sean Christopherson (46):
> KVM: x86: Flush the guest's TLB on INIT
> KVM: nVMX: Set LDTR to its architecturally defined value on nested
> VM-Exit
> KVM: SVM: Zero out GDTR.base and IDTR.base on INIT
> KVM: VMX: Set EDX at INIT with CPUID.0x1, Family-Model-Stepping
> KVM: SVM: Require exact CPUID.0x1 match when stuffing EDX at INIT
> KVM: SVM: Fall back to KVM's hardcoded value for EDX at RESET/INIT
> KVM: VMX: Remove explicit MMU reset in enter_rmode()
> KVM: SVM: Drop explicit MMU reset at RESET/INIT
> KVM: SVM: Drop a redundant init_vmcb() from svm_create_vcpu()
> KVM: VMX: Move init_vmcs() invocation to vmx_vcpu_reset()
> KVM: x86: WARN if the APIC map is dirty without an in-kernel local
> APIC
> KVM: x86: Remove defunct BSP "update" in local APIC reset
> KVM: x86: Migrate the PIT only if vcpu0 is migrated, not any BSP
> KVM: x86: Don't force set BSP bit when local APIC is managed by
> userspace
> KVM: x86: Set BSP bit in reset BSP vCPU's APIC base by default
> KVM: VMX: Stuff vcpu->arch.apic_base directly at vCPU RESET
> KVM: x86: Open code necessary bits of kvm_lapic_set_base() at vCPU
> RESET
> KVM: x86: Consolidate APIC base RESET initialization code
> KVM: x86: Move EDX initialization at vCPU RESET to common code
> KVM: SVM: Don't bother writing vmcb->save.rip at vCPU RESET/INIT
> KVM: VMX: Invert handling of CR0.WP for EPT without unrestricted guest
> KVM: VMX: Remove direct write to vcpu->arch.cr0 during vCPU RESET/INIT
> KVM: VMX: Fold ept_update_paging_mode_cr0() back into vmx_set_cr0()
> KVM: nVMX: Do not clear CR3 load/store exiting bits if L1 wants 'em
> KVM: VMX: Pull GUEST_CR3 from the VMCS iff CR3 load exiting is
> disabled
> KVM: x86/mmu: Skip the permission_fault() check on MMIO if CR0.PG=0
> KVM: VMX: Process CR0.PG side effects after setting CR0 assets
> KVM: VMX: Skip emulation required checks during pmode/rmode
> transitions
> KVM: nVMX: Don't evaluate "emulation required" on nested VM-Exit
> KVM: SVM: Tweak order of cr0/cr4/efer writes at RESET/INIT
> KVM: SVM: Drop redundant writes to vmcb->save.cr4 at RESET/INIT
> KVM: SVM: Stuff save->dr6 at during VMSA sync, not at RESET/INIT
> KVM: VMX: Skip pointless MSR bitmap update when setting EFER
> KVM: VMX: Refresh list of user return MSRs after setting guest CPUID
> KVM: VMX: Don't _explicitly_ reconfigure user return MSRs on vCPU INIT
> KVM: x86: Move setting of sregs during vCPU RESET/INIT to common x86
> KVM: VMX: Remove obsolete MSR bitmap refresh at vCPU RESET/INIT
> KVM: nVMX: Remove obsolete MSR bitmap refresh at nested transitions
> KVM: VMX: Don't redo x2APIC MSR bitmaps when userspace filter is
> changed
> KVM: VMX: Remove unnecessary initialization of msr_bitmap_mode
> KVM: VMX: Smush x2APIC MSR bitmap adjustments into single function
> KVM: VMX: Remove redundant write to set vCPU as active at RESET/INIT
> KVM: VMX: Move RESET-only VMWRITE sequences to init_vmcs()
> KVM: SVM: Emulate #INIT in response to triple fault shutdown
> KVM: SVM: Drop redundant clearing of vcpu->arch.hflags at INIT/RESET
> KVM: x86: Preserve guest's CR0.CD/NW on INIT
>
> arch/x86/include/asm/kvm_host.h | 5 -
> arch/x86/kvm/i8254.c | 3 +-
> arch/x86/kvm/lapic.c | 26 +--
> arch/x86/kvm/svm/sev.c | 1 +
> arch/x86/kvm/svm/svm.c | 48 ++----
> arch/x86/kvm/vmx/nested.c | 24 ++-
> arch/x86/kvm/vmx/vmx.c | 270 +++++++++++++++-----------------
> arch/x86/kvm/vmx/vmx.h | 5 +-
> arch/x86/kvm/x86.c | 52 +++++-
> 9 files changed, 211 insertions(+), 223 deletions(-)
>

Queued, except for patches 9-10.

I'd rather have init_vmcb/svm_vcpu_reset look more like the VMX code,
with the INIT code moved to svm_vcpu_reset and the rest remaining in
init_vmcb.

Paolo

2021-07-26 22:24:06

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 41/46] KVM: VMX: Smush x2APIC MSR bitmap adjustments into single function

On 27/07/21 00:21, Sean Christopherson wrote:
> On Mon, Jul 26, 2021, Paolo Bonzini wrote:
>> On 13/07/21 18:33, Sean Christopherson wrote:
>>> + if (!(mode ^ vmx->x2apic_msr_bitmap_mode))
>>> + return;
>> Just !=, I guess?
> Ha, yeah. Forgot to do a bit of critical thinking after refactoring.
>

Well, == even since I assume you don't want the ! in front. :)

Paolo

2021-07-26 22:24:58

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 41/46] KVM: VMX: Smush x2APIC MSR bitmap adjustments into single function

On Mon, Jul 26, 2021, Paolo Bonzini wrote:
> On 13/07/21 18:33, Sean Christopherson wrote:
> > + if (!(mode ^ vmx->x2apic_msr_bitmap_mode))
> > + return;
>
> Just !=, I guess?

Ha, yeah. Forgot to do a bit of critical thinking after refactoring.

2021-07-26 22:27:45

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 09/46] KVM: SVM: Drop a redundant init_vmcb() from svm_create_vcpu()

On Mon, Jul 26, 2021, Paolo Bonzini wrote:
> On 13/07/21 18:32, Sean Christopherson wrote:
> > Drop an extra init_vmcb() from svm_create_vcpu(), svm_vcpu_reset() is
> > guaranteed to call init_vmcb() and there are no consumers of the VMCB
> > data between ->vcpu_create() and ->vcpu_reset(). Keep the call to
> > svm_switch_vmcb() as sev_es_create_vcpu() touches the current VMCB, but
> > hoist it up a few lines to associate the switch with the allocation of
> > vmcb01.
> >
> > Reviewed-by: Reiji Watanabe <[email protected]>
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/svm/svm.c | 4 +---
> > 1 file changed, 1 insertion(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 44248548be7d..cef9520fe77f 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -1431,15 +1431,13 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
> > svm->vmcb01.ptr = page_address(vmcb01_page);
> > svm->vmcb01.pa = __sme_set(page_to_pfn(vmcb01_page) << PAGE_SHIFT);
> > + svm_switch_vmcb(svm, &svm->vmcb01);
> > if (vmsa_page)
> > svm->vmsa = page_address(vmsa_page);
> > svm->guest_state_loaded = false;
> > - svm_switch_vmcb(svm, &svm->vmcb01);
> > - init_vmcb(vcpu);
> > -
> > svm_init_osvw(vcpu);
> > vcpu->arch.microcode_version = 0x01000065;
>
> While this patch makes sense, I'd rather not include it to reduce the part
> of the code in which svm->vmcb is NULL.

Not sure I follow. The svm_switch_vmcb() is kept, and even moved earlier.

2021-07-27 00:03:27

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 46/46] KVM: x86: Preserve guest's CR0.CD/NW on INIT


> On Jul 19, 2021, at 9:37 PM, Reiji Watanabe <[email protected]> wrote:
>
> On Tue, Jul 13, 2021 at 9:35 AM Sean Christopherson <[email protected]> wrote:
>>
>> Preserve CR0.CD and CR0.NW on INIT instead of forcing them to '1', as
>> defined by both Intel's SDM and AMD's APM.
>>
>> Note, current versions of Intel's SDM are very poorly written with
>> respect to INIT behavior. Table 9-1. "IA-32 and Intel 64 Processor
>> States Following Power-up, Reset, or INIT" quite clearly lists power-up,
>> RESET, _and_ INIT as setting CR0=60000010H, i.e. CD/NW=1. But the SDM
>> then attempts to qualify CD/NW behavior in a footnote:
>>
>> 2. The CD and NW flags are unchanged, bit 4 is set to 1, all other bits
>> are cleared.
>>
>> Presumably that footnote is only meant for INIT, as the RESET case and
>> especially the power-up case are rather non-sensical. Another footnote
>> all but confirms that:
>>
>> 6. Internal caches are invalid after power-up and RESET, but left
>> unchanged with an INIT.
>>
>> Bare metal testing shows that CD/NW are indeed preserved on INIT (someone
>> else can hack their BIOS to check RESET and power-up :-D).
>>
>> Reported-by: Reiji Watanabe <[email protected]>
>> Signed-off-by: Sean Christopherson <[email protected]>
>
> Reviewed-by: Reiji Watanabe <[email protected]>
>
> Thank you for the fix and checking the CD/NW with the bare metal testing.

Interesting.

Is there a kvm-unit-test to reproduce the issue by any chance?

2021-07-28 20:47:14

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 46/46] KVM: x86: Preserve guest's CR0.CD/NW on INIT

On Mon, Jul 26, 2021, Nadav Amit wrote:
>
> > On Jul 19, 2021, at 9:37 PM, Reiji Watanabe <[email protected]> wrote:
> >
> > On Tue, Jul 13, 2021 at 9:35 AM Sean Christopherson <[email protected]> wrote:
> >>
> >> Preserve CR0.CD and CR0.NW on INIT instead of forcing them to '1', as
> >> defined by both Intel's SDM and AMD's APM.
> >>
> >> Note, current versions of Intel's SDM are very poorly written with
> >> respect to INIT behavior. Table 9-1. "IA-32 and Intel 64 Processor
> >> States Following Power-up, Reset, or INIT" quite clearly lists power-up,
> >> RESET, _and_ INIT as setting CR0=60000010H, i.e. CD/NW=1. But the SDM
> >> then attempts to qualify CD/NW behavior in a footnote:
> >>
> >> 2. The CD and NW flags are unchanged, bit 4 is set to 1, all other bits
> >> are cleared.
> >>
> >> Presumably that footnote is only meant for INIT, as the RESET case and
> >> especially the power-up case are rather non-sensical. Another footnote
> >> all but confirms that:
> >>
> >> 6. Internal caches are invalid after power-up and RESET, but left
> >> unchanged with an INIT.
> >>
> >> Bare metal testing shows that CD/NW are indeed preserved on INIT (someone
> >> else can hack their BIOS to check RESET and power-up :-D).
> >>
> >> Reported-by: Reiji Watanabe <[email protected]>
> >> Signed-off-by: Sean Christopherson <[email protected]>
> >
> > Reviewed-by: Reiji Watanabe <[email protected]>
> >
> > Thank you for the fix and checking the CD/NW with the bare metal testing.
>
> Interesting.
>
> Is there a kvm-unit-test to reproduce the issue by any chance?

No :-/