Control-flow Enforcement Technology (CET) is a kind of CPU feature used
to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP) attacks.
It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
style control-flow subversion attacks.
Shadow Stack (SHSTK):
A shadow stack is a second stack used exclusively for control transfer
operations. The shadow stack is separate from the data/normal stack and
can be enabled individually in user and kernel mode. When shadow stack
is enabled, CALL pushes the return address on both the data and shadow
stack. RET pops the return address from both stacks and compares them.
If the return addresses from the two stacks do not match, the processor
generates a #CP.
Indirect Branch Tracking (IBT):
IBT introduces new instruction(ENDBRANCH)to mark valid target addresses of
indirect branches (CALL, JMP etc...). If an indirect branch is executed
and the next instruction is _not_ an ENDBRANCH, the processor generates a
#CP. These instruction behaves as a NOP on platforms that doesn't support
CET.
Dependency:
--------------------------------------------------------------------------
CET native series for user mode shadow stack has already been merged in v6.6
mainline kernel.
The first 7 kernel patches are prerequisites for this KVM patch series since
guest CET user mode and supervisor mode states depends on kernel FPU framework
to properly save/restore the states whenever FPU context switch is required,
e.g., after VM-Exit and before vCPU thread exits to userspace.
In this series, guest supervisor SHSTK mitigation solution isn't introduced
for Intel platform therefore guest SSS_CET bit of CPUID(0x7,1):EDX[bit18] is
cleared. Check SDM (Vol 1, Section 17.2.3) for details.
CET states management:
--------------------------------------------------------------------------
KVM cooperates with host kernel FPU framework to manage guest CET registers.
With CET supervisor mode state support in this series, KVM can save/restore
full guest CET xsave-managed states.
CET user mode and supervisor mode xstates, i.e., MSR_IA32_{U_CET,PL3_SSP}
and MSR_IA32_PL{0,1,2}, depend on host FPU framework to swap guest and host
xstates. On VM-Exit, guest CET xstates are saved to guest fpu area and host
CET xstates are loaded from task/thread context before vCPU returns to
userspace, vice-versa on VM-Entry. See details in kvm_{load,put}_guest_fpu().
So guest CET xstates management depends on CET xstate bits(U_CET/S_CET bit)
set in host XSS MSR.
CET supervisor mode states are grouped into two categories : XSAVE-managed
and non-XSAVE-managed, the former includes MSR_IA32_PL{0,1,2}_SSP and are
controlled by CET supervisor mode bit(S_CET bit) in XSS, the later consists
of MSR_IA32_S_CET and MSR_IA32_INTR_SSP_TBL.
VMX introduces new VMCS fields, {GUEST|HOST}_{S_CET,SSP,INTR_SSP_TABL}, to
facilitate guest/host non-XSAVES-managed states. When VMX CET entry/exit load
bits are set, guest/host MSR_IA32_{S_CET,INTR_SSP_TBL,SSP} are loaded from
equivalent fields at VM-Exit/Entry. With these new fields, such supervisor
states require no addtional KVM save/reload actions.
Tests:
--------------------------------------------------------------------------
This series passed basic CET user shadow stack test and kernel IBT test in L1
and L2 guest.
The patch series _has_ impact to existing vmx test cases in KVM-unit-tests,the
failures have been fixed here [1].
One new selftest app [2] is introduced for testing CET MSRs accessibilities.
Note, this series hasn't been tested on AMD platform yet.
To run user SHSTK test and kernel IBT test in guest, an CET capable platform
is required, e.g., Sapphire Rapids server, and follow below steps to build
the binaries:
1. Host kernel: Apply this series to mainline kernel (>= v6.6) and build.
2. Guest kernel: Pull kernel (>= v6.6), opt-in CONFIG_X86_KERNEL_IBT
and CONFIG_X86_USER_SHADOW_STACK options. Build with CET enabled gcc versions
(>= 8.5.0).
3. Apply CET QEMU patches [3] before build mainline QEMU.
Check kernel selftest test_shadow_stack_64 output:
[INFO] new_ssp = 7f8c82100ff8, *new_ssp = 7f8c82101001
[INFO] changing ssp from 7f8c82900ff0 to 7f8c82100ff8
[INFO] ssp is now 7f8c82101000
[OK] Shadow stack pivot
[OK] Shadow stack faults
[INFO] Corrupting shadow stack
[INFO] Generated shadow stack violation successfully
[OK] Shadow stack violation test
[INFO] Gup read -> shstk access success
[INFO] Gup write -> shstk access success
[INFO] Violation from normal write
[INFO] Gup read -> write access success
[INFO] Violation from normal write
[INFO] Gup write -> write access success
[INFO] Cow gup write -> write access success
[OK] Shadow gup test
[INFO] Violation from shstk access
[OK] mprotect() test
[SKIP] Userfaultfd unavailable.
[OK] 32 bit test
Check kernel IBT with dmesg | grep CET:
CET detected: Indirect Branch Tracking enabled
--------------------------------------------------------------------------
Changes in v7:
1. Introduced guest dedicated config for guest related xstate fixup. [Sean, Maxim]
2. Refined CET supervisor state handling for guest fpstate. [Dave]
3. Enclosed Sean's fixup patch for kernel xstate issue. [Sean]
4. Refined CET MSR read/write handling flow. [Sean, Maxim]
5. Added CET VMCS fields sync between vmcs12 and vmcs02. [Chao, Maxim]
6. Added reset handling for CET xstate-managed MSRs.
7. Other minor changes due to community review feedback. [Sean, Maxim, Chao]
8. Rebased to: https://github.com/kvm-x86/linux tag: kvm-x86-next-2023.11.01
[1]: KVM-unit-tests fixup:
https://lore.kernel.org/all/[email protected]/
[2]: Selftest for CET MSRs:
https://lore.kernel.org/all/[email protected]/
[3]: QEMU patch:
https://lore.kernel.org/all/[email protected]/
[4]: v6 patchset:
https://lore.kernel.org/all/[email protected]/
Patch 1-7: Fixup patches for kernel xstate and enable CET supervisor xstate.
Patch 8-11: Cleanup patches for KVM.
Patch 12-15: Enable KVM XSS MSR support.
Patch 16: Fault check for CR4.CET setting.
Patch 17: Report CET MSRs to userspace.
Patch 18: Introduce CET VMCS fields.
Patch 19: Add SHSTK/IBT to KVM-governed framework.(to be deprecated)
Patch 20: Emulate CET MSR access.
Patch 21: Handle SSP at entry/exit to SMM.
Patch 22: Set up CET MSR interception.
Patch 23: Initialize host constant supervisor state.
Patch 24: Add CET virtualization settings.
Patch 25-26: Add CET nested support.
Sean Christopherson (4):
x86/fpu/xstate: Always preserve non-user xfeatures/flags in
__state_perm
KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data
KVM: x86: Report XSS as to-be-saved if there are supported features
KVM: x86: Load guest FPU state when access XSAVE-managed MSRs
Yang Weijiang (22):
x86/fpu/xstate: Refine CET user xstate bit enabling
x86/fpu/xstate: Add CET supervisor mode state support
x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set
x86/fpu/xstate: Introduce fpu_guest_cfg for guest FPU configuration
x86/fpu/xstate: Create guest fpstate with guest specific config
x86/fpu/xstate: Warn if kernel dynamic xfeatures detected in normal fpstate
KVM: x86: Rename kvm_{g,s}et_msr() to menifest emulation operations
KVM: x86: Refine xsave-managed guest register/MSR reset handling
KVM: x86: Add kvm_msr_{read,write}() helpers
KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS
KVM: x86: Initialize kvm_caps.supported_xss
KVM: x86: Add fault checks for guest CR4.CET setting
KVM: x86: Report KVM supported CET MSRs as to-be-saved
KVM: VMX: Introduce CET VMCS fields and control bits
KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled"
KVM: VMX: Emulate read and write to CET MSRs
KVM: x86: Save and reload SSP to/from SMRAM
KVM: VMX: Set up interception for CET MSRs
KVM: VMX: Set host constant supervisor states to VMCS fields
KVM: x86: Enable CET virtualization for VMX and advertise to userspace
KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1
KVM: nVMX: Enable CET support for nested guest
arch/x86/include/asm/fpu/types.h | 16 +-
arch/x86/include/asm/fpu/xstate.h | 11 +-
arch/x86/include/asm/kvm_host.h | 13 +-
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/vmx.h | 8 +
arch/x86/include/uapi/asm/kvm_para.h | 1 +
arch/x86/kernel/fpu/core.c | 62 +++++--
arch/x86/kernel/fpu/xstate.c | 46 +++--
arch/x86/kernel/fpu/xstate.h | 4 +
arch/x86/kvm/cpuid.c | 69 +++++---
arch/x86/kvm/governed_features.h | 2 +
arch/x86/kvm/smm.c | 12 +-
arch/x86/kvm/smm.h | 2 +-
arch/x86/kvm/vmx/capabilities.h | 10 ++
arch/x86/kvm/vmx/nested.c | 88 ++++++++--
arch/x86/kvm/vmx/nested.h | 5 +
arch/x86/kvm/vmx/vmcs12.c | 6 +
arch/x86/kvm/vmx/vmcs12.h | 14 +-
arch/x86/kvm/vmx/vmx.c | 110 +++++++++++-
arch/x86/kvm/vmx/vmx.h | 6 +-
arch/x86/kvm/x86.c | 254 +++++++++++++++++++++++++--
arch/x86/kvm/x86.h | 28 +++
22 files changed, 669 insertions(+), 99 deletions(-)
--
2.27.0
Add supervisor mode state support within FPU xstate management framework.
Although supervisor shadow stack is not enabled/used today in kernel,KVM
requires the support because when KVM advertises shadow stack feature to
guest, architecturally it claims the support for both user and supervisor
modes for guest OSes(Linux or non-Linux).
CET supervisor states not only includes PL{0,1,2}_SSP but also IA32_S_CET
MSR, but the latter is not xsave-managed. In virtualization world, guest
IA32_S_CET is saved/stored into/from VM control structure. With supervisor
xstate support, guest supervisor mode shadow stack state can be properly
saved/restored when 1) guest/host FPU context is swapped 2) vCPU
thread is sched out/in.
The alternative is to enable it in KVM domain, but KVM maintainers NAKed
the solution. The external discussion can be found at [*], it ended up
with adding the support in kernel instead of KVM domain.
Note, in KVM case, guest CET supervisor state i.e., IA32_PL{0,1,2}_MSRs,
are preserved after VM-Exit until host/guest fpstates are swapped, but
since host supervisor shadow stack is disabled, the preserved MSRs won't
hurt host.
[*]: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/include/asm/fpu/types.h | 14 ++++++++++++--
arch/x86/include/asm/fpu/xstate.h | 6 +++---
arch/x86/kernel/fpu/xstate.c | 6 +++++-
3 files changed, 20 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index eb810074f1e7..c6fd13a17205 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -116,7 +116,7 @@ enum xfeature {
XFEATURE_PKRU,
XFEATURE_PASID,
XFEATURE_CET_USER,
- XFEATURE_CET_KERNEL_UNUSED,
+ XFEATURE_CET_KERNEL,
XFEATURE_RSRVD_COMP_13,
XFEATURE_RSRVD_COMP_14,
XFEATURE_LBR,
@@ -139,7 +139,7 @@ enum xfeature {
#define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU)
#define XFEATURE_MASK_PASID (1 << XFEATURE_PASID)
#define XFEATURE_MASK_CET_USER (1 << XFEATURE_CET_USER)
-#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL_UNUSED)
+#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL)
#define XFEATURE_MASK_LBR (1 << XFEATURE_LBR)
#define XFEATURE_MASK_XTILE_CFG (1 << XFEATURE_XTILE_CFG)
#define XFEATURE_MASK_XTILE_DATA (1 << XFEATURE_XTILE_DATA)
@@ -264,6 +264,16 @@ struct cet_user_state {
u64 user_ssp;
};
+/*
+ * State component 12 is Control-flow Enforcement supervisor states
+ */
+struct cet_supervisor_state {
+ /* supervisor ssp pointers */
+ u64 pl0_ssp;
+ u64 pl1_ssp;
+ u64 pl2_ssp;
+};
+
/*
* State component 15: Architectural LBR configuration state.
* The size of Arch LBR state depends on the number of LBRs (lbr_depth).
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index d4427b88ee12..3b4a038d3c57 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -51,7 +51,8 @@
/* All currently supported supervisor features */
#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
- XFEATURE_MASK_CET_USER)
+ XFEATURE_MASK_CET_USER | \
+ XFEATURE_MASK_CET_KERNEL)
/*
* A supervisor state component may not always contain valuable information,
@@ -78,8 +79,7 @@
* Unsupported supervisor features. When a supervisor feature in this mask is
* supported in the future, move it to the supported supervisor feature mask.
*/
-#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \
- XFEATURE_MASK_CET_KERNEL)
+#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT)
/* All supervisor states including supported and unsupported states. */
#define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 6e50a4251e2b..b57d909facca 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -51,7 +51,7 @@ static const char *xfeature_names[] =
"Protection Keys User registers",
"PASID state",
"Control-flow User registers",
- "Control-flow Kernel registers (unused)",
+ "Control-flow Kernel registers",
"unknown xstate feature",
"unknown xstate feature",
"unknown xstate feature",
@@ -73,6 +73,7 @@ static unsigned short xsave_cpuid_features[] __initdata = {
[XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
[XFEATURE_PKRU] = X86_FEATURE_OSPKE,
[XFEATURE_PASID] = X86_FEATURE_ENQCMD,
+ [XFEATURE_CET_KERNEL] = X86_FEATURE_SHSTK,
[XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE,
[XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE,
};
@@ -277,6 +278,7 @@ static void __init print_xstate_features(void)
print_xstate_feature(XFEATURE_MASK_PKRU);
print_xstate_feature(XFEATURE_MASK_PASID);
print_xstate_feature(XFEATURE_MASK_CET_USER);
+ print_xstate_feature(XFEATURE_MASK_CET_KERNEL);
print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
}
@@ -346,6 +348,7 @@ static __init void os_xrstor_booting(struct xregs_state *xstate)
XFEATURE_MASK_BNDCSR | \
XFEATURE_MASK_PASID | \
XFEATURE_MASK_CET_USER | \
+ XFEATURE_MASK_CET_KERNEL | \
XFEATURE_MASK_XTILE)
/*
@@ -546,6 +549,7 @@ static bool __init check_xstate_against_struct(int nr)
case XFEATURE_PASID: return XCHECK_SZ(sz, nr, struct ia32_pasid_state);
case XFEATURE_XTILE_CFG: return XCHECK_SZ(sz, nr, struct xtile_cfg);
case XFEATURE_CET_USER: return XCHECK_SZ(sz, nr, struct cet_user_state);
+ case XFEATURE_CET_KERNEL: return XCHECK_SZ(sz, nr, struct cet_supervisor_state);
case XFEATURE_XTILE_DATA: check_xtile_data_against_struct(sz); return true;
default:
XSTATE_WARN_ON(1, "No structure for xstate: %d\n", nr);
--
2.27.0
From: Sean Christopherson <[email protected]>
Rework and rename cpuid_get_supported_xcr0() to explicitly operate on
vCPU state, i.e. on a vCPU's CPUID state, now that the only usage of
the helper is to retrieve a vCPU's already-set CPUID.
Prior to commit 275a87244ec8 ("KVM: x86: Don't adjust guest's CPUID.0x12.1
(allowed SGX enclave XFRM)"), KVM incorrectly fudged guest CPUID at runtime,
which in turn necessitated massaging the incoming CPUID state for
KVM_SET_CPUID{2} so as not to run afoul of kvm_cpuid_check_equal().
I.e. KVM also invoked cpuid_get_supported_xcr0() with the incoming CPUID
state, and thus without an explicit vCPU object.
Opportunistically move the helper below kvm_update_cpuid_runtime() to make
it harder to repeat the mistake of querying supported XCR0 for runtime
updates.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/cpuid.c | 33 ++++++++++++++++-----------------
1 file changed, 16 insertions(+), 17 deletions(-)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index dda6fc4cfae8..d0315e469d92 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -247,21 +247,6 @@ void kvm_update_pv_runtime(struct kvm_vcpu *vcpu)
vcpu->arch.pv_cpuid.features = best->eax;
}
-/*
- * Calculate guest's supported XCR0 taking into account guest CPUID data and
- * KVM's supported XCR0 (comprised of host's XCR0 and KVM_SUPPORTED_XCR0).
- */
-static u64 cpuid_get_supported_xcr0(struct kvm_cpuid_entry2 *entries, int nent)
-{
- struct kvm_cpuid_entry2 *best;
-
- best = cpuid_entry2_find(entries, nent, 0xd, 0);
- if (!best)
- return 0;
-
- return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
-}
-
static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *entries,
int nent)
{
@@ -312,6 +297,21 @@ void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(kvm_update_cpuid_runtime);
+/*
+ * Calculate guest's supported XCR0 taking into account guest CPUID data and
+ * KVM's supported XCR0 (comprised of host's XCR0 and KVM_SUPPORTED_XCR0).
+ */
+static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
+{
+ struct kvm_cpuid_entry2 *best;
+
+ best = kvm_find_cpuid_entry_index(vcpu, 0xd, 0);
+ if (!best)
+ return 0;
+
+ return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
+}
+
static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
{
struct kvm_cpuid_entry2 *entry;
@@ -357,8 +357,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
kvm_apic_set_version(vcpu);
}
- vcpu->arch.guest_supported_xcr0 =
- cpuid_get_supported_xcr0(vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent);
+ vcpu->arch.guest_supported_xcr0 = vcpu_get_supported_xcr0(vcpu);
kvm_update_pv_runtime(vcpu);
--
2.27.0
Rename kvm_{g,s}et_msr() to kvm_emulate_msr_{read,write}() to make it
more obvious that KVM uses these helpers to emulate guest behaviors,
i.e., host_initiated == false in these helpers.
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 ++--
arch/x86/kvm/smm.c | 4 ++--
arch/x86/kvm/vmx/nested.c | 13 +++++++------
arch/x86/kvm/x86.c | 10 +++++-----
4 files changed, 16 insertions(+), 15 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d7036982332e..5cfa18aaf33f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1967,8 +1967,8 @@ void kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu);
void kvm_enable_efer_bits(u64);
bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer);
int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data, bool host_initiated);
-int kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data);
-int kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data);
+int kvm_emulate_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data);
+int kvm_emulate_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data);
int kvm_emulate_rdmsr(struct kvm_vcpu *vcpu);
int kvm_emulate_wrmsr(struct kvm_vcpu *vcpu);
int kvm_emulate_as_nop(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
index dc3d95fdca7d..45c855389ea7 100644
--- a/arch/x86/kvm/smm.c
+++ b/arch/x86/kvm/smm.c
@@ -535,7 +535,7 @@ static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,
vcpu->arch.smbase = smstate->smbase;
- if (kvm_set_msr(vcpu, MSR_EFER, smstate->efer & ~EFER_LMA))
+ if (kvm_emulate_msr_write(vcpu, MSR_EFER, smstate->efer & ~EFER_LMA))
return X86EMUL_UNHANDLEABLE;
rsm_load_seg_64(vcpu, &smstate->tr, VCPU_SREG_TR);
@@ -626,7 +626,7 @@ int emulator_leave_smm(struct x86_emulate_ctxt *ctxt)
/* And finally go back to 32-bit mode. */
efer = 0;
- kvm_set_msr(vcpu, MSR_EFER, efer);
+ kvm_emulate_msr_write(vcpu, MSR_EFER, efer);
}
#endif
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index c5ec0ef51ff7..2034337681f9 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -927,7 +927,7 @@ static u32 nested_vmx_load_msr(struct kvm_vcpu *vcpu, u64 gpa, u32 count)
__func__, i, e.index, e.reserved);
goto fail;
}
- if (kvm_set_msr(vcpu, e.index, e.value)) {
+ if (kvm_emulate_msr_write(vcpu, e.index, e.value)) {
pr_debug_ratelimited(
"%s cannot write MSR (%u, 0x%x, 0x%llx)\n",
__func__, i, e.index, e.value);
@@ -963,7 +963,7 @@ static bool nested_vmx_get_vmexit_msr_value(struct kvm_vcpu *vcpu,
}
}
- if (kvm_get_msr(vcpu, msr_index, data)) {
+ if (kvm_emulate_msr_read(vcpu, msr_index, data)) {
pr_debug_ratelimited("%s cannot read MSR (0x%x)\n", __func__,
msr_index);
return false;
@@ -2649,7 +2649,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) &&
kvm_pmu_has_perf_global_ctrl(vcpu_to_pmu(vcpu)) &&
- WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
+ WARN_ON_ONCE(kvm_emulate_msr_write(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
vmcs12->guest_ia32_perf_global_ctrl))) {
*entry_failure_code = ENTRY_FAIL_DEFAULT;
return -EINVAL;
@@ -4524,8 +4524,9 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
}
if ((vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL) &&
kvm_pmu_has_perf_global_ctrl(vcpu_to_pmu(vcpu)))
- WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
- vmcs12->host_ia32_perf_global_ctrl));
+ WARN_ON_ONCE(kvm_emulate_msr_write(vcpu,
+ MSR_CORE_PERF_GLOBAL_CTRL,
+ vmcs12->host_ia32_perf_global_ctrl));
/* Set L1 segment info according to Intel SDM
27.5.2 Loading Host Segment and Descriptor-Table Registers */
@@ -4700,7 +4701,7 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
goto vmabort;
}
- if (kvm_set_msr(vcpu, h.index, h.value)) {
+ if (kvm_emulate_msr_write(vcpu, h.index, h.value)) {
pr_debug_ratelimited(
"%s WRMSR failed (%u, 0x%x, 0x%llx)\n",
__func__, j, h.index, h.value);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2c924075f6f1..b9c2c0cd4cf5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1973,17 +1973,17 @@ static int kvm_set_msr_with_filter(struct kvm_vcpu *vcpu, u32 index, u64 data)
return kvm_set_msr_ignored_check(vcpu, index, data, false);
}
-int kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data)
+int kvm_emulate_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data)
{
return kvm_get_msr_ignored_check(vcpu, index, data, false);
}
-EXPORT_SYMBOL_GPL(kvm_get_msr);
+EXPORT_SYMBOL_GPL(kvm_emulate_msr_read);
-int kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data)
+int kvm_emulate_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data)
{
return kvm_set_msr_ignored_check(vcpu, index, data, false);
}
-EXPORT_SYMBOL_GPL(kvm_set_msr);
+EXPORT_SYMBOL_GPL(kvm_emulate_msr_write);
static void complete_userspace_rdmsr(struct kvm_vcpu *vcpu)
{
@@ -8329,7 +8329,7 @@ static int emulator_set_msr_with_filter(struct x86_emulate_ctxt *ctxt,
static int emulator_get_msr(struct x86_emulate_ctxt *ctxt,
u32 msr_index, u64 *pdata)
{
- return kvm_get_msr(emul_to_vcpu(ctxt), msr_index, pdata);
+ return kvm_emulate_msr_read(emul_to_vcpu(ctxt), msr_index, pdata);
}
static int emulator_check_pmc(struct x86_emulate_ctxt *ctxt,
--
2.27.0
Define new XFEATURE_MASK_KERNEL_DYNAMIC set including the features can be
optionally enabled by kernel components, i.e., the features are required by
specific kernel components. Currently it's used by KVM to configure guest
dedicated fpstate for calculating the xfeature and fpstate storage size etc.
The kernel dynamic xfeatures now only contain XFEATURE_CET_KERNEL, which is
supported by host as they're enabled in xsaves/xrstors operating xfeature set
(XCR0 | XSS), but the relevant CPU feature, i.e., supervisor shadow stack, is
not enabled in host kernel so it can be omitted for normal fpstate by default.
Remove the kernel dynamic feature from fpu_kernel_cfg.default_features so that
the bits in xstate_bv and xcomp_bv are cleared and xsaves/xrstors can be
optimized by HW for normal fpstate.
Suggested-by: Dave Hansen <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/include/asm/fpu/xstate.h | 5 ++++-
arch/x86/kernel/fpu/xstate.c | 1 +
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 3b4a038d3c57..a212d3851429 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -46,9 +46,12 @@
#define XFEATURE_MASK_USER_RESTORE \
(XFEATURE_MASK_USER_SUPPORTED & ~XFEATURE_MASK_PKRU)
-/* Features which are dynamically enabled for a process on request */
+/* Features which are dynamically enabled per userspace request */
#define XFEATURE_MASK_USER_DYNAMIC XFEATURE_MASK_XTILE_DATA
+/* Features which are dynamically enabled per kernel side request */
+#define XFEATURE_MASK_KERNEL_DYNAMIC XFEATURE_MASK_CET_KERNEL
+
/* All currently supported supervisor features */
#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
XFEATURE_MASK_CET_USER | \
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index b57d909facca..ba4172172afd 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -824,6 +824,7 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
/* Clean out dynamic features from default */
fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
+ fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_KERNEL_DYNAMIC;
fpu_user_cfg.default_features = fpu_user_cfg.max_features;
fpu_user_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
--
2.27.0
From: Sean Christopherson <[email protected]>
When granting userspace or a KVM guest access to an xfeature, preserve the
entity's existing supervisor and software-defined permissions as tracked
by __state_perm, i.e. use __state_perm to track *all* permissions even
though all supported supervisor xfeatures are granted to all FPUs and
FPU_GUEST_PERM_LOCKED disallows changing permissions.
Effectively clobbering supervisor permissions results in inconsistent
behavior, as xstate_get_group_perm() will report supervisor features for
process that do NOT request access to dynamic user xfeatures, whereas any
and all supervisor features will be absent from the set of permissions for
any process that is granted access to one or more dynamic xfeatures (which
right now means AMX).
The inconsistency isn't problematic because fpu_xstate_prctl() already
strips out everything except user xfeatures:
case ARCH_GET_XCOMP_PERM:
/*
* Lockless snapshot as it can also change right after the
* dropping the lock.
*/
permitted = xstate_get_host_group_perm();
permitted &= XFEATURE_MASK_USER_SUPPORTED;
return put_user(permitted, uptr);
case ARCH_GET_XCOMP_GUEST_PERM:
permitted = xstate_get_guest_group_perm();
permitted &= XFEATURE_MASK_USER_SUPPORTED;
return put_user(permitted, uptr);
and similarly KVM doesn't apply the __state_perm to supervisor states
(kvm_get_filtered_xcr0() incorporates xstate_get_guest_group_perm()):
case 0xd: {
u64 permitted_xcr0 = kvm_get_filtered_xcr0();
u64 permitted_xss = kvm_caps.supported_xss;
But if KVM in particular were to ever change, dropping supervisor
permissions would result in subtle bugs in KVM's reporting of supported
CPUID settings. And the above behavior also means that having supervisor
xfeatures in __state_perm is correctly handled by all users.
Dropping supervisor permissions also creates another landmine for KVM. If
more dynamic user xfeatures are ever added, requesting access to multiple
xfeatures in separate ARCH_REQ_XCOMP_GUEST_PERM calls will result in the
second invocation of __xstate_request_perm() computing the wrong ksize, as
as the mask passed to xstate_calculate_size() would not contain *any*
supervisor features.
Commit 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE
permissions") fudged around the size issue for userspace FPUs, but for
reasons unknown skipped guest FPUs. Lack of a fix for KVM "works" only
because KVM doesn't yet support virtualizing features that have supervisor
xfeatures, i.e. as of today, KVM guest FPUs will never need the relevant
xfeatures.
Simply extending the hack-a-fix for guests would temporarily solve the
ksize issue, but wouldn't address the inconsistency issue and would leave
another lurking pitfall for KVM. KVM support for virtualizing CET will
likely add CET_KERNEL as a guest-only xfeature, i.e. CET_KERNEL will not
be set in xfeatures_mask_supervisor() and would again be dropped when
granting access to dynamic xfeatures.
Note, the existing clobbering behavior is rather subtle. The @permitted
parameter to __xstate_request_perm() comes from:
permitted = xstate_get_group_perm(guest);
which is either fpu->guest_perm.__state_perm or fpu->perm.__state_perm,
where __state_perm is initialized to:
fpu->perm.__state_perm = fpu_kernel_cfg.default_features;
and copied to the guest side of things:
/* Same defaults for guests */
fpu->guest_perm = fpu->perm;
fpu_kernel_cfg.default_features contains everything except the dynamic
xfeatures, i.e. everything except XFEATURE_MASK_XTILE_DATA:
fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
When __xstate_request_perm() restricts the local "mask" variable to
compute the user state size:
mask &= XFEATURE_MASK_USER_SUPPORTED;
usize = xstate_calculate_size(mask, false);
it subtly overwrites the target __state_perm with "mask" containing only
user xfeatures:
perm = guest ? &fpu->guest_perm : &fpu->perm;
/* Pairs with the READ_ONCE() in xstate_get_group_perm() */
WRITE_ONCE(perm->__state_perm, mask);
Cc: Maxim Levitsky <[email protected]>
Cc: Weijiang Yang <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Chao Gao <[email protected]>
Cc: Rick Edgecombe <[email protected]>
Cc: John Allen <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kernel/fpu/xstate.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index ef6906107c54..73f6bc00d178 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1601,16 +1601,20 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
if ((permitted & requested) == requested)
return 0;
- /* Calculate the resulting kernel state size */
+ /*
+ * Calculate the resulting kernel state size. Note, @permitted also
+ * contains supervisor xfeatures even though supervisor are always
+ * permitted for kernel and guest FPUs, and never permitted for user
+ * FPUs.
+ */
mask = permitted | requested;
- /* Take supervisor states into account on the host */
- if (!guest)
- mask |= xfeatures_mask_supervisor();
ksize = xstate_calculate_size(mask, compacted);
- /* Calculate the resulting user state size */
- mask &= XFEATURE_MASK_USER_SUPPORTED;
- usize = xstate_calculate_size(mask, false);
+ /*
+ * Calculate the resulting user state size. Take care not to clobber
+ * the supervisor xfeatures in the new mask!
+ */
+ usize = xstate_calculate_size(mask & XFEATURE_MASK_USER_SUPPORTED, false);
if (!guest) {
ret = validate_sigaltstack(usize);
--
2.27.0
Wrap __kvm_{get,set}_msr() into two new helpers for KVM usage and use the
helpers to replace existing usage of the raw functions.
kvm_msr_{read,write}() are KVM-internal helpers, i.e. used when KVM needs
to get/set a MSR value for emulating CPU behavior, i.e., host_initiated ==
%true in the helpers.
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++-
arch/x86/kvm/cpuid.c | 2 +-
arch/x86/kvm/x86.c | 16 +++++++++++++---
3 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5cfa18aaf33f..499bd42e3a32 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1966,9 +1966,10 @@ void kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu);
void kvm_enable_efer_bits(u64);
bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer);
-int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data, bool host_initiated);
int kvm_emulate_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data);
int kvm_emulate_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data);
+int kvm_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data);
+int kvm_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data);
int kvm_emulate_rdmsr(struct kvm_vcpu *vcpu);
int kvm_emulate_wrmsr(struct kvm_vcpu *vcpu);
int kvm_emulate_as_nop(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index d0315e469d92..0351e311168a 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1527,7 +1527,7 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
*edx = entry->edx;
if (function == 7 && index == 0) {
u64 data;
- if (!__kvm_get_msr(vcpu, MSR_IA32_TSX_CTRL, &data, true) &&
+ if (!kvm_msr_read(vcpu, MSR_IA32_TSX_CTRL, &data) &&
(data & TSX_CTRL_CPUID_CLEAR))
*ebx &= ~(F(RTM) | F(HLE));
} else if (function == 0x80000007) {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 16b4f2dd138a..360f4b8a4944 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1917,8 +1917,8 @@ static int kvm_set_msr_ignored_check(struct kvm_vcpu *vcpu,
* Returns 0 on success, non-0 otherwise.
* Assumes vcpu_load() was already called.
*/
-int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
- bool host_initiated)
+static int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
+ bool host_initiated)
{
struct msr_data msr;
int ret;
@@ -1944,6 +1944,16 @@ int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
return ret;
}
+int kvm_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data)
+{
+ return __kvm_set_msr(vcpu, index, data, true);
+}
+
+int kvm_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data)
+{
+ return __kvm_get_msr(vcpu, index, data, true);
+}
+
static int kvm_get_msr_ignored_check(struct kvm_vcpu *vcpu,
u32 index, u64 *data, bool host_initiated)
{
@@ -12224,7 +12234,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
MSR_IA32_MISC_ENABLE_BTS_UNAVAIL;
__kvm_set_xcr(vcpu, 0, XFEATURE_MASK_FP);
- __kvm_set_msr(vcpu, MSR_IA32_XSS, 0, true);
+ kvm_msr_write(vcpu, MSR_IA32_XSS, 0);
}
/* All GPRs except RDX (handled below) are zeroed on RESET/INIT. */
--
2.27.0
Tweak the code a bit to facilitate resetting more xstate components in
the future, e.g., adding CET's xstate-managed MSRs.
No functional change intended.
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/x86.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b9c2c0cd4cf5..16b4f2dd138a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12132,6 +12132,11 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
static_branch_dec(&kvm_has_noapic_vcpu);
}
+static inline bool is_xstate_reset_needed(void)
+{
+ return kvm_cpu_cap_has(X86_FEATURE_MPX);
+}
+
void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct kvm_cpuid_entry2 *cpuid_0x1;
@@ -12189,7 +12194,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
kvm_async_pf_hash_reset(vcpu);
vcpu->arch.apf.halted = false;
- if (vcpu->arch.guest_fpu.fpstate && kvm_mpx_supported()) {
+ if (vcpu->arch.guest_fpu.fpstate && is_xstate_reset_needed()) {
struct fpstate *fpstate = vcpu->arch.guest_fpu.fpstate;
/*
@@ -12199,8 +12204,12 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
if (init_event)
kvm_put_guest_fpu(vcpu);
- fpstate_clear_xstate_component(fpstate, XFEATURE_BNDREGS);
- fpstate_clear_xstate_component(fpstate, XFEATURE_BNDCSR);
+ if (kvm_cpu_cap_has(X86_FEATURE_MPX)) {
+ fpstate_clear_xstate_component(fpstate,
+ XFEATURE_BNDREGS);
+ fpstate_clear_xstate_component(fpstate,
+ XFEATURE_BNDCSR);
+ }
if (init_event)
kvm_load_guest_fpu(vcpu);
--
2.27.0
From: Sean Christopherson <[email protected]>
Add MSR_IA32_XSS to list of MSRs reported to userspace if supported_xss
is non-zero, i.e. KVM supports at least one XSS based feature.
Before enabling CET virtualization series, guest IA32_MSR_XSS is
guaranteed to be 0, i.e., XSAVES/XRSTORS is executed in non-root mode
with XSS == 0, which equals to the effect of XSAVE/XRSTOR.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/x86.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 360f4b8a4944..f7d4cc61bc55 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1464,6 +1464,7 @@ static const u32 msrs_to_save_base[] = {
MSR_IA32_UMWAIT_CONTROL,
MSR_IA32_XFD, MSR_IA32_XFD_ERR,
+ MSR_IA32_XSS,
};
static const u32 msrs_to_save_pmu[] = {
@@ -7317,6 +7318,10 @@ static void kvm_probe_msr_to_save(u32 msr_index)
if (!(kvm_get_arch_capabilities() & ARCH_CAP_TSX_CTRL_MSR))
return;
break;
+ case MSR_IA32_XSS:
+ if (!kvm_caps.supported_xss)
+ return;
+ break;
default:
break;
}
--
2.27.0
Use fpu_guest_cfg to calculate guest fpstate settings, open code for
__fpstate_reset() to avoid using kernel FPU config.
Below configuration steps are currently enforced to get guest fpstate:
1) Kernel sets up guest FPU settings in fpu__init_system_xstate().
2) User space sets vCPU thread group xstate permits via arch_prctl().
3) User space creates guest fpstate via __fpu_alloc_init_guest_fpstate()
for vcpu thread.
4) User space enables guest dynamic xfeatures and re-allocate guest
fpstate.
By adding kernel dynamic xfeatures in above #1 and #2, guest xstate area
size is expanded to hold (fpu_kernel_cfg.default_features | kernel dynamic
xfeatures | user dynamic xfeatures), then host xsaves/xrstors can operate
for all guest xfeatures.
The user_* fields remain unchanged for compatibility with KVM uAPIs.
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kernel/fpu/core.c | 48 ++++++++++++++++++++++++++++--------
arch/x86/kernel/fpu/xstate.c | 2 +-
arch/x86/kernel/fpu/xstate.h | 1 +
3 files changed, 40 insertions(+), 11 deletions(-)
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 516af626bf6a..985eaf8b55e0 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -194,8 +194,6 @@ void fpu_reset_from_exception_fixup(void)
}
#if IS_ENABLED(CONFIG_KVM)
-static void __fpstate_reset(struct fpstate *fpstate, u64 xfd);
-
static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
{
struct fpu_state_perm *fpuperm;
@@ -216,25 +214,55 @@ static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
gfpu->perm = perm & ~FPU_GUEST_PERM_LOCKED;
}
-bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
+static struct fpstate *__fpu_alloc_init_guest_fpstate(struct fpu_guest *gfpu)
{
+ bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
+ unsigned int gfpstate_size, size;
struct fpstate *fpstate;
- unsigned int size;
- size = fpu_user_cfg.default_size + ALIGN(offsetof(struct fpstate, regs), 64);
+ /*
+ * fpu_guest_cfg.default_features includes all enabled xfeatures
+ * except the user dynamic xfeatures. If the user dynamic xfeatures
+ * are enabled, the guest fpstate will be re-allocated to hold all
+ * guest enabled xfeatures, so omit user dynamic xfeatures here.
+ */
+ gfpstate_size = xstate_calculate_size(fpu_guest_cfg.default_features,
+ compacted);
+
+ size = gfpstate_size + ALIGN(offsetof(struct fpstate, regs), 64);
+
fpstate = vzalloc(size);
if (!fpstate)
- return false;
+ return NULL;
+ /*
+ * Initialize sizes and feature masks, use fpu_user_cfg.*
+ * for user_* settings for compatibility of exiting uAPIs.
+ */
+ fpstate->size = gfpstate_size;
+ fpstate->xfeatures = fpu_guest_cfg.default_features;
+ fpstate->user_size = fpu_user_cfg.default_size;
+ fpstate->user_xfeatures = fpu_user_cfg.default_features;
+ fpstate->xfd = 0;
- /* Leave xfd to 0 (the reset value defined by spec) */
- __fpstate_reset(fpstate, 0);
fpstate_init_user(fpstate);
fpstate->is_valloc = true;
fpstate->is_guest = true;
gfpu->fpstate = fpstate;
- gfpu->xfeatures = fpu_user_cfg.default_features;
- gfpu->perm = fpu_user_cfg.default_features;
+ gfpu->xfeatures = fpu_guest_cfg.default_features;
+ gfpu->perm = fpu_guest_cfg.default_features;
+
+ return fpstate;
+}
+
+bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
+{
+ struct fpstate *fpstate;
+
+ fpstate = __fpu_alloc_init_guest_fpstate(gfpu);
+
+ if (!fpstate)
+ return false;
/*
* KVM sets the FP+SSE bits in the XSAVE header when copying FPU state
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index aa8f8595cd41..253944cb2298 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -559,7 +559,7 @@ static bool __init check_xstate_against_struct(int nr)
return true;
}
-static unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
+unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
{
unsigned int topmost = fls64(xfeatures) - 1;
unsigned int offset = xstate_offsets[topmost];
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index 3518fb26d06b..c032acb56306 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -55,6 +55,7 @@ extern void fpu__init_cpu_xstate(void);
extern void fpu__init_system_xstate(unsigned int legacy_size);
extern void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
+extern unsigned int xstate_calculate_size(u64 xfeatures, bool compacted);
static inline u64 xfeatures_mask_supervisor(void)
{
--
2.27.0
Define new fpu_guest_cfg to hold all guest FPU settings so that it can
differ from generic kernel FPU settings, e.g., enabling CET supervisor
xstate by default for guest fpstate while it's remained disabled in
kernel FPU config.
The kernel dynamic xfeatures are specifically used by guest fpstate now,
add the mask for guest fpstate so that guest_perm.__state_permit ==
(fpu_kernel_cfg.default_xfeature | XFEATURE_MASK_KERNEL_DYNAMIC). And
if guest fpstate is re-allocated to hold user dynamic xfeatures, the
resulting permissions are consumed before calculate new guest fpstate.
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/include/asm/fpu/types.h | 2 +-
arch/x86/kernel/fpu/core.c | 14 +++++++++++---
arch/x86/kernel/fpu/xstate.c | 10 ++++++++++
3 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index c6fd13a17205..306825ad6bc0 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -602,6 +602,6 @@ struct fpu_state_config {
};
/* FPU state configuration information */
-extern struct fpu_state_config fpu_kernel_cfg, fpu_user_cfg;
+extern struct fpu_state_config fpu_kernel_cfg, fpu_user_cfg, fpu_guest_cfg;
#endif /* _ASM_X86_FPU_H */
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index a21a4d0ecc34..516af626bf6a 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -33,9 +33,10 @@ DEFINE_STATIC_KEY_FALSE(__fpu_state_size_dynamic);
DEFINE_PER_CPU(u64, xfd_state);
#endif
-/* The FPU state configuration data for kernel and user space */
+/* The FPU state configuration data for kernel, user space and guest. */
struct fpu_state_config fpu_kernel_cfg __ro_after_init;
struct fpu_state_config fpu_user_cfg __ro_after_init;
+struct fpu_state_config fpu_guest_cfg __ro_after_init;
/*
* Represents the initial FPU state. It's mostly (but not completely) zeroes,
@@ -536,8 +537,15 @@ void fpstate_reset(struct fpu *fpu)
fpu->perm.__state_perm = fpu_kernel_cfg.default_features;
fpu->perm.__state_size = fpu_kernel_cfg.default_size;
fpu->perm.__user_state_size = fpu_user_cfg.default_size;
- /* Same defaults for guests */
- fpu->guest_perm = fpu->perm;
+
+ /* Guest permission settings */
+ fpu->guest_perm.__state_perm = fpu_guest_cfg.default_features;
+ fpu->guest_perm.__state_size = fpu_guest_cfg.default_size;
+ /*
+ * Set guest's __user_state_size to fpu_user_cfg.default_size so that
+ * existing uAPIs can still work.
+ */
+ fpu->guest_perm.__user_state_size = fpu_user_cfg.default_size;
}
static inline void fpu_inherit_perms(struct fpu *dst_fpu)
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index ba4172172afd..aa8f8595cd41 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -681,6 +681,7 @@ static int __init init_xstate_size(void)
{
/* Recompute the context size for enabled features: */
unsigned int user_size, kernel_size, kernel_default_size;
+ unsigned int guest_default_size;
bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
/* Uncompacted user space size */
@@ -702,13 +703,18 @@ static int __init init_xstate_size(void)
kernel_default_size =
xstate_calculate_size(fpu_kernel_cfg.default_features, compacted);
+ guest_default_size =
+ xstate_calculate_size(fpu_guest_cfg.default_features, compacted);
+
if (!paranoid_xstate_size_valid(kernel_size))
return -EINVAL;
fpu_kernel_cfg.max_size = kernel_size;
fpu_user_cfg.max_size = user_size;
+ fpu_guest_cfg.max_size = kernel_size;
fpu_kernel_cfg.default_size = kernel_default_size;
+ fpu_guest_cfg.default_size = guest_default_size;
fpu_user_cfg.default_size =
xstate_calculate_size(fpu_user_cfg.default_features, false);
@@ -829,6 +835,10 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
fpu_user_cfg.default_features = fpu_user_cfg.max_features;
fpu_user_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
+ fpu_guest_cfg.max_features = fpu_kernel_cfg.max_features;
+ fpu_guest_cfg.default_features = fpu_guest_cfg.max_features;
+ fpu_guest_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
+
/* Store it for paranoia check at the end */
xfeatures = fpu_kernel_cfg.max_features;
--
2.27.0
Add CET MSRs to the list of MSRs reported to userspace if the feature,
i.e. IBT or SHSTK, associated with the MSRs is supported by KVM.
SSP can only be read via RDSSP. Writing even requires destructive and
potentially faulting operations such as SAVEPREVSSP/RSTORSSP or
SETSSBSY/CLRSSBSY. Let the host use a pseudo-MSR that is just a wrapper
for the GUEST_SSP field of the VMCS.
Suggested-by: Chao Gao <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/include/uapi/asm/kvm_para.h | 1 +
arch/x86/kvm/vmx/vmx.c | 2 ++
arch/x86/kvm/x86.c | 18 ++++++++++++++++++
3 files changed, 21 insertions(+)
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 6e64b27b2c1e..9864bbcf2470 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -58,6 +58,7 @@
#define MSR_KVM_ASYNC_PF_INT 0x4b564d06
#define MSR_KVM_ASYNC_PF_ACK 0x4b564d07
#define MSR_KVM_MIGRATION_CONTROL 0x4b564d08
+#define MSR_KVM_SSP 0x4b564d09
struct kvm_steal_time {
__u64 steal;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index be20a60047b1..d3d0d74fef70 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7009,6 +7009,8 @@ static bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
case MSR_AMD64_TSC_RATIO:
/* This is AMD only. */
return false;
+ case MSR_KVM_SSP:
+ return kvm_cpu_cap_has(X86_FEATURE_SHSTK);
default:
return true;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 44b8cf459dfc..74d2d00a1681 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1476,6 +1476,9 @@ static const u32 msrs_to_save_base[] = {
MSR_IA32_XFD, MSR_IA32_XFD_ERR,
MSR_IA32_XSS,
+ MSR_IA32_U_CET, MSR_IA32_S_CET,
+ MSR_IA32_PL0_SSP, MSR_IA32_PL1_SSP, MSR_IA32_PL2_SSP,
+ MSR_IA32_PL3_SSP, MSR_IA32_INT_SSP_TAB,
};
static const u32 msrs_to_save_pmu[] = {
@@ -1576,6 +1579,7 @@ static const u32 emulated_msrs_all[] = {
MSR_K7_HWCR,
MSR_KVM_POLL_CONTROL,
+ MSR_KVM_SSP,
};
static u32 emulated_msrs[ARRAY_SIZE(emulated_msrs_all)];
@@ -7371,6 +7375,20 @@ static void kvm_probe_msr_to_save(u32 msr_index)
if (!kvm_caps.supported_xss)
return;
break;
+ case MSR_IA32_U_CET:
+ case MSR_IA32_S_CET:
+ if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK) &&
+ !kvm_cpu_cap_has(X86_FEATURE_IBT))
+ return;
+ break;
+ case MSR_IA32_INT_SSP_TAB:
+ if (!kvm_cpu_cap_has(X86_FEATURE_LM))
+ return;
+ fallthrough;
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK))
+ return;
+ break;
default:
break;
}
--
2.27.0
Check potential faults for CR4.CET setting per Intel SDM requirements.
CET can be enabled if and only if CR0.WP == 1, i.e. setting CR4.CET ==
1 faults if CR0.WP == 0 and setting CR0.WP == 0 fails if CR4.CET == 1.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Chao Gao <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/x86.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fd48b825510c..44b8cf459dfc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1006,6 +1006,9 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
(is_64_bit_mode(vcpu) || kvm_is_cr4_bit_set(vcpu, X86_CR4_PCIDE)))
return 1;
+ if (!(cr0 & X86_CR0_WP) && kvm_is_cr4_bit_set(vcpu, X86_CR4_CET))
+ return 1;
+
static_call(kvm_x86_set_cr0)(vcpu, cr0);
kvm_post_set_cr0(vcpu, old_cr0, cr0);
@@ -1217,6 +1220,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
return 1;
}
+ if ((cr4 & X86_CR4_CET) && !kvm_is_cr0_bit_set(vcpu, X86_CR0_WP))
+ return 1;
+
static_call(kvm_x86_set_cr4)(vcpu, cr4);
kvm_post_set_cr4(vcpu, old_cr4, cr4);
--
2.27.0
Control-flow Enforcement Technology (CET) is a kind of CPU feature used
to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP) attacks.
It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
style control-flow subversion attacks.
Shadow Stack (SHSTK):
A shadow stack is a second stack used exclusively for control transfer
operations. The shadow stack is separate from the data/normal stack and
can be enabled individually in user and kernel mode. When shadow stack
is enabled, CALL pushes the return address on both the data and shadow
stack. RET pops the return address from both stacks and compares them.
If the return addresses from the two stacks do not match, the processor
generates a #CP.
Indirect Branch Tracking (IBT):
IBT introduces instruction(ENDBRANCH)to mark valid target addresses of
indirect branches (CALL, JMP etc...). If an indirect branch is executed
and the next instruction is _not_ an ENDBRANCH, the processor generates
a #CP. These instruction behaves as a NOP on platforms that have no CET.
Several new CET MSRs are defined to support CET:
MSR_IA32_{U,S}_CET: CET settings for {user,supervisor} CET respectively.
MSR_IA32_PL{0,1,2,3}_SSP: SHSTK pointer linear address for CPL{0,1,2,3}.
MSR_IA32_INT_SSP_TAB: Linear address of SHSTK pointer table, whose entry
is indexed by IST of interrupt gate desc.
Two XSAVES state bits are introduced for CET:
IA32_XSS:[bit 11]: Control saving/restoring user mode CET states
IA32_XSS:[bit 12]: Control saving/restoring supervisor mode CET states.
Six VMCS fields are introduced for CET:
{HOST,GUEST}_S_CET: Stores CET settings for kernel mode.
{HOST,GUEST}_SSP: Stores current active SSP.
{HOST,GUEST}_INTR_SSP_TABLE: Stores current active MSR_IA32_INT_SSP_TAB.
On Intel platforms, two additional bits are defined in VM_EXIT and VM_ENTRY
control fields:
If VM_EXIT_LOAD_CET_STATE = 1, host CET states are loaded from following
VMCS fields at VM-Exit:
HOST_S_CET
HOST_SSP
HOST_INTR_SSP_TABLE
If VM_ENTRY_LOAD_CET_STATE = 1, guest CET states are loaded from following
VMCS fields at VM-Entry:
GUEST_S_CET
GUEST_SSP
GUEST_INTR_SSP_TABLE
Co-developed-by: Zhang Yi Z <[email protected]>
Signed-off-by: Zhang Yi Z <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Chao Gao <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/include/asm/vmx.h | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0e73616b82f3..451fd4f4fedc 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -104,6 +104,7 @@
#define VM_EXIT_CLEAR_BNDCFGS 0x00800000
#define VM_EXIT_PT_CONCEAL_PIP 0x01000000
#define VM_EXIT_CLEAR_IA32_RTIT_CTL 0x02000000
+#define VM_EXIT_LOAD_CET_STATE 0x10000000
#define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR 0x00036dff
@@ -117,6 +118,7 @@
#define VM_ENTRY_LOAD_BNDCFGS 0x00010000
#define VM_ENTRY_PT_CONCEAL_PIP 0x00020000
#define VM_ENTRY_LOAD_IA32_RTIT_CTL 0x00040000
+#define VM_ENTRY_LOAD_CET_STATE 0x00100000
#define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR 0x000011ff
@@ -345,6 +347,9 @@ enum vmcs_field {
GUEST_PENDING_DBG_EXCEPTIONS = 0x00006822,
GUEST_SYSENTER_ESP = 0x00006824,
GUEST_SYSENTER_EIP = 0x00006826,
+ GUEST_S_CET = 0x00006828,
+ GUEST_SSP = 0x0000682a,
+ GUEST_INTR_SSP_TABLE = 0x0000682c,
HOST_CR0 = 0x00006c00,
HOST_CR3 = 0x00006c02,
HOST_CR4 = 0x00006c04,
@@ -357,6 +362,9 @@ enum vmcs_field {
HOST_IA32_SYSENTER_EIP = 0x00006c12,
HOST_RSP = 0x00006c14,
HOST_RIP = 0x00006c16,
+ HOST_S_CET = 0x00006c18,
+ HOST_SSP = 0x00006c1a,
+ HOST_INTR_SSP_TABLE = 0x00006c1c
};
/*
--
2.27.0
Add emulation interface for CET MSR access. The emulation code is split
into common part and vendor specific part. The former does common checks
for MSRs, e.g., accessibility, data validity etc., then pass the operation
to either XSAVE-managed MSRs via the helpers or CET VMCS fields.
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 18 +++++++++
arch/x86/kvm/x86.c | 88 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 106 insertions(+)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f6ad5ba5d518..554f665e59c3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2111,6 +2111,15 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
else
msr_info->data = vmx->pt_desc.guest.addr_a[index / 2];
break;
+ case MSR_IA32_S_CET:
+ msr_info->data = vmcs_readl(GUEST_S_CET);
+ break;
+ case MSR_KVM_SSP:
+ msr_info->data = vmcs_readl(GUEST_SSP);
+ break;
+ case MSR_IA32_INT_SSP_TAB:
+ msr_info->data = vmcs_readl(GUEST_INTR_SSP_TABLE);
+ break;
case MSR_IA32_DEBUGCTLMSR:
msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
break;
@@ -2420,6 +2429,15 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
else
vmx->pt_desc.guest.addr_a[index / 2] = data;
break;
+ case MSR_IA32_S_CET:
+ vmcs_writel(GUEST_S_CET, data);
+ break;
+ case MSR_KVM_SSP:
+ vmcs_writel(GUEST_SSP, data);
+ break;
+ case MSR_IA32_INT_SSP_TAB:
+ vmcs_writel(GUEST_INTR_SSP_TABLE, data);
+ break;
case MSR_IA32_PERF_CAPABILITIES:
if (data && !vcpu_to_pmu(vcpu)->version)
return 1;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 74d2d00a1681..5792ed16e61b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1847,6 +1847,36 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type)
}
EXPORT_SYMBOL_GPL(kvm_msr_allowed);
+#define CET_US_RESERVED_BITS GENMASK(9, 6)
+#define CET_US_SHSTK_MASK_BITS GENMASK(1, 0)
+#define CET_US_IBT_MASK_BITS (GENMASK_ULL(5, 2) | GENMASK_ULL(63, 10))
+#define CET_US_LEGACY_BITMAP_BASE(data) ((data) >> 12)
+
+static bool is_set_cet_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u64 data,
+ bool host_initiated)
+{
+ bool msr_ctrl = index == MSR_IA32_S_CET || index == MSR_IA32_U_CET;
+
+ if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
+ return true;
+
+ if (msr_ctrl && guest_can_use(vcpu, X86_FEATURE_IBT))
+ return true;
+
+ /*
+ * If KVM supports the MSR, i.e. has enumerated the MSR existence to
+ * userspace, then userspace is allowed to write '0' irrespective of
+ * whether or not the MSR is exposed to the guest.
+ */
+ if (!host_initiated || data)
+ return false;
+
+ if (kvm_cpu_cap_has(X86_FEATURE_SHSTK))
+ return true;
+
+ return msr_ctrl && kvm_cpu_cap_has(X86_FEATURE_IBT);
+}
+
/*
* Write @data into the MSR specified by @index. Select MSR specific fault
* checks are bypassed if @host_initiated is %true.
@@ -1906,6 +1936,43 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
data = (u32)data;
break;
+ case MSR_IA32_U_CET:
+ case MSR_IA32_S_CET:
+ if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
+ return 1;
+ if (data & CET_US_RESERVED_BITS)
+ return 1;
+ if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
+ (data & CET_US_SHSTK_MASK_BITS))
+ return 1;
+ if (!guest_can_use(vcpu, X86_FEATURE_IBT) &&
+ (data & CET_US_IBT_MASK_BITS))
+ return 1;
+ if (!IS_ALIGNED(CET_US_LEGACY_BITMAP_BASE(data), 4))
+ return 1;
+ /* IBT can be suppressed iff the TRACKER isn't WAIT_ENDBR. */
+ if ((data & CET_SUPPRESS) && (data & CET_WAIT_ENDBR))
+ return 1;
+ break;
+ case MSR_IA32_INT_SSP_TAB:
+ if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated) ||
+ !guest_cpuid_has(vcpu, X86_FEATURE_LM))
+ return 1;
+ if (is_noncanonical_address(data, vcpu))
+ return 1;
+ break;
+ case MSR_KVM_SSP:
+ if (!host_initiated)
+ return 1;
+ fallthrough;
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
+ return 1;
+ if (is_noncanonical_address(data, vcpu))
+ return 1;
+ if (!IS_ALIGNED(data, 4))
+ return 1;
+ break;
}
msr.data = data;
@@ -1949,6 +2016,19 @@ static int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
!guest_cpuid_has(vcpu, X86_FEATURE_RDPID))
return 1;
break;
+ case MSR_IA32_INT_SSP_TAB:
+ if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) ||
+ !guest_cpuid_has(vcpu, X86_FEATURE_LM))
+ return 1;
+ break;
+ case MSR_KVM_SSP:
+ if (!host_initiated)
+ return 1;
+ fallthrough;
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
+ return 1;
+ break;
}
msr.index = index;
@@ -4118,6 +4198,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vcpu->arch.guest_fpu.xfd_err = data;
break;
#endif
+ case MSR_IA32_U_CET:
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ kvm_set_xstate_msr(vcpu, msr_info);
+ break;
default:
if (kvm_pmu_is_valid_msr(vcpu, msr))
return kvm_pmu_set_msr(vcpu, msr_info);
@@ -4475,6 +4559,10 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
msr_info->data = vcpu->arch.guest_fpu.xfd_err;
break;
#endif
+ case MSR_IA32_U_CET:
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ kvm_get_xstate_msr(vcpu, msr_info);
+ break;
default:
if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
return kvm_pmu_get_msr(vcpu, msr_info);
--
2.27.0
Kernel dynamic xfeatures now are __ONLY__ enabled for guest fpstate, i.e.,
none for normal kernel fpstate. The bits are added when guest FPU config
is initialized. Guest fpstate is allocated with fpstate->is_guest set to
%true.
For normal fpstate, the bits should have been removed when initializes
kernel FPU config settings, WARN_ONCE() if kernel detects normal fpstate
xfeatures contains kernel dynamic xfeatures before executes xsaves.
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kernel/fpu/xstate.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index c032acb56306..d45f3e570e69 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -186,6 +186,9 @@ static inline void os_xsave(struct fpstate *fpstate)
WARN_ON_FPU(!alternatives_patched);
xfd_validate_state(fpstate, mask, false);
+ WARN_ON_FPU(!fpstate->is_guest &&
+ (mask & XFEATURE_MASK_KERNEL_DYNAMIC));
+
XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err);
/* We should never fault when copying to a kernel buffer: */
--
2.27.0
From: Sean Christopherson <[email protected]>
Load the guest's FPU state if userspace is accessing MSRs whose values
are managed by XSAVES. Introduce two helpers, kvm_{get,set}_xstate_msr(),
to facilitate access to such kind of MSRs.
If MSRs supported in kvm_caps.supported_xss are passed through to guest,
the guest MSRs are swapped with host's before vCPU exits to userspace and
after it reenters kernel before next VM-entry.
Because the modified code is also used for the KVM_GET_MSRS device ioctl(),
explicitly check @vcpu is non-null before attempting to load guest state.
The XSAVE-managed MSRs cannot be retrieved via the device ioctl() without
loading guest FPU state (which doesn't exist).
Note that guest_cpuid_has() is not queried as host userspace is allowed to
access MSRs that have not been exposed to the guest, e.g. it might do
KVM_SET_MSRS prior to KVM_SET_CPUID2.
The two helpers are put here in order to manifest accessing xsave-managed MSRs
requires special check and handling to guarantee the correctness of read/write
to the MSRs.
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Yang Weijiang <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/x86.c | 35 ++++++++++++++++++++++++++++++++++-
arch/x86/kvm/x86.h | 24 ++++++++++++++++++++++++
2 files changed, 58 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 607082aca80d..fd48b825510c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -133,6 +133,9 @@ static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
static DEFINE_MUTEX(vendor_module_lock);
+static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
+static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
+
struct kvm_x86_ops kvm_x86_ops __read_mostly;
#define KVM_X86_OP(func) \
@@ -4482,6 +4485,21 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
}
EXPORT_SYMBOL_GPL(kvm_get_msr_common);
+/*
+ * Returns true if the MSR in question is managed via XSTATE, i.e. is context
+ * switched with the rest of guest FPU state.
+ */
+static bool is_xstate_managed_msr(u32 index)
+{
+ switch (index) {
+ case MSR_IA32_U_CET:
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ return true;
+ default:
+ return false;
+ }
+}
+
/*
* Read or write a bunch of msrs. All parameters are kernel addresses.
*
@@ -4492,11 +4510,26 @@ static int __msr_io(struct kvm_vcpu *vcpu, struct kvm_msrs *msrs,
int (*do_msr)(struct kvm_vcpu *vcpu,
unsigned index, u64 *data))
{
+ bool fpu_loaded = false;
int i;
- for (i = 0; i < msrs->nmsrs; ++i)
+ for (i = 0; i < msrs->nmsrs; ++i) {
+ /*
+ * If userspace is accessing one or more XSTATE-managed MSRs,
+ * temporarily load the guest's FPU state so that the guest's
+ * MSR value(s) is resident in hardware, i.e. so that KVM can
+ * get/set the MSR via RDMSR/WRMSR.
+ */
+ if (vcpu && !fpu_loaded && kvm_caps.supported_xss &&
+ is_xstate_managed_msr(entries[i].index)) {
+ kvm_load_guest_fpu(vcpu);
+ fpu_loaded = true;
+ }
if (do_msr(vcpu, entries[i].index, &entries[i].data))
break;
+ }
+ if (fpu_loaded)
+ kvm_put_guest_fpu(vcpu);
return i;
}
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 5184fde1dc54..6e42ede335f5 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -541,4 +541,28 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
unsigned int port, void *data, unsigned int count,
int in);
+/*
+ * Lock and/or reload guest FPU and access xstate MSRs. For accesses initiated
+ * by host, guest FPU is loaded in __msr_io(). For accesses initiated by guest,
+ * guest FPU should have been loaded already.
+ */
+
+static inline void kvm_get_xstate_msr(struct kvm_vcpu *vcpu,
+ struct msr_data *msr_info)
+{
+ KVM_BUG_ON(!vcpu->arch.guest_fpu.fpstate->in_use, vcpu->kvm);
+ kvm_fpu_get();
+ rdmsrl(msr_info->index, msr_info->data);
+ kvm_fpu_put();
+}
+
+static inline void kvm_set_xstate_msr(struct kvm_vcpu *vcpu,
+ struct msr_data *msr_info)
+{
+ KVM_BUG_ON(!vcpu->arch.guest_fpu.fpstate->in_use, vcpu->kvm);
+ kvm_fpu_get();
+ wrmsrl(msr_info->index, msr_info->data);
+ kvm_fpu_put();
+}
+
#endif
--
2.27.0
Use the governed feature framework to track whether X86_FEATURE_SHSTK
and X86_FEATURE_IBT features can be used by userspace and guest, i.e.,
the features can be used iff both KVM and guest CPUID can support them.
TODO: remove this patch once Sean's refactor to "KVM-governed" framework
is upstreamed. See the work here [*].
[*]: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/governed_features.h | 2 ++
arch/x86/kvm/vmx/vmx.c | 2 ++
2 files changed, 4 insertions(+)
diff --git a/arch/x86/kvm/governed_features.h b/arch/x86/kvm/governed_features.h
index 423a73395c10..db7e21c5ecc2 100644
--- a/arch/x86/kvm/governed_features.h
+++ b/arch/x86/kvm/governed_features.h
@@ -16,6 +16,8 @@ KVM_GOVERNED_X86_FEATURE(PAUSEFILTER)
KVM_GOVERNED_X86_FEATURE(PFTHRESHOLD)
KVM_GOVERNED_X86_FEATURE(VGIF)
KVM_GOVERNED_X86_FEATURE(VNMI)
+KVM_GOVERNED_X86_FEATURE(SHSTK)
+KVM_GOVERNED_X86_FEATURE(IBT)
#undef KVM_GOVERNED_X86_FEATURE
#undef KVM_GOVERNED_FEATURE
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d3d0d74fef70..f6ad5ba5d518 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7762,6 +7762,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_XSAVES);
kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_VMX);
+ kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_SHSTK);
+ kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_IBT);
vmx_setup_uret_msrs(vmx);
--
2.27.0
Expose CET features to guest if KVM/host can support them, clear CPUID
feature bits if KVM/host cannot support.
Set CPUID feature bits so that CET features are available in guest CPUID.
Add CR4.CET bit support in order to allow guest set CET master control
bit.
Disable KVM CET feature if unrestricted_guest is unsupported/disabled as
KVM does not support emulating CET.
Don't expose CET feature if either of {U,S}_CET xstate bits is cleared
in host XSS or if XSAVES isn't supported.
The CET load-bits in VM_ENTRY/VM_EXIT control fields should be set to make
guest CET xstates isolated from host's. And all platforms that support CET
enumerate VMX_BASIC[bit56] as 1, clear CET feature bits if the bit doesn't
read 1.
Per Arch confirmation, CET MSR contents after reset, power-up and INIT are
set to 0s, clears relevant guest fpstate areas so that the guest MSRs are
reset to 0s after these events.
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++-
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kvm/cpuid.c | 19 +++++++++++++++++--
arch/x86/kvm/vmx/capabilities.h | 6 ++++++
arch/x86/kvm/vmx/vmx.c | 29 ++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/vmx.h | 6 ++++--
arch/x86/kvm/x86.c | 26 ++++++++++++++++++++++++--
arch/x86/kvm/x86.h | 3 +++
8 files changed, 85 insertions(+), 8 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f536102f1eca..fd110a0b712f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -133,7 +133,8 @@
| X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \
| X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
| X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
- | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP))
+ | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \
+ | X86_CR4_CET))
#define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 389f9594746e..25ae7ceb5b39 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1097,6 +1097,7 @@
#define VMX_BASIC_MEM_TYPE_MASK 0x003c000000000000LLU
#define VMX_BASIC_MEM_TYPE_WB 6LLU
#define VMX_BASIC_INOUT 0x0040000000000000LLU
+#define VMX_BASIC_NO_HW_ERROR_CODE_CC 0x0100000000000000LLU
/* Resctrl MSRs: */
/* - Intel: */
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 1d9843b34196..6d758054f994 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -648,7 +648,7 @@ void kvm_set_cpu_caps(void)
F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
- F(SGX_LC) | F(BUS_LOCK_DETECT)
+ F(SGX_LC) | F(BUS_LOCK_DETECT) | F(SHSTK)
);
/* Set LA57 based on hardware capability. */
if (cpuid_ecx(7) & F(LA57))
@@ -666,7 +666,8 @@ void kvm_set_cpu_caps(void)
F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) |
F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) |
- F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D)
+ F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D) |
+ F(IBT)
);
/* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */
@@ -679,6 +680,20 @@ void kvm_set_cpu_caps(void)
kvm_cpu_cap_set(X86_FEATURE_INTEL_STIBP);
if (boot_cpu_has(X86_FEATURE_AMD_SSBD))
kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD);
+ /*
+ * Don't use boot_cpu_has() to check availability of IBT because the
+ * feature bit is cleared in boot_cpu_data when ibt=off is applied
+ * in host cmdline.
+ *
+ * As currently there's no HW bug which requires disabling IBT feature
+ * while CPU can enumerate it, host cmdline option ibt=off is most
+ * likely due to administrative reason on host side, so KVM refers to
+ * CPU CPUID enumeration to enable the feature. In future if there's
+ * actually some bug clobbered ibt=off option, then enforce additional
+ * check here to disable the support in KVM.
+ */
+ if (cpuid_edx(7) & F(IBT))
+ kvm_cpu_cap_set(X86_FEATURE_IBT);
kvm_cpu_cap_mask(CPUID_7_1_EAX,
F(AVX_VNNI) | F(AVX512_BF16) | F(CMPCCXADD) |
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index ee8938818c8a..e12bc233d88b 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -79,6 +79,12 @@ static inline bool cpu_has_vmx_basic_inout(void)
return (((u64)vmcs_config.basic_cap << 32) & VMX_BASIC_INOUT);
}
+static inline bool cpu_has_vmx_basic_no_hw_errcode(void)
+{
+ return ((u64)vmcs_config.basic_cap << 32) &
+ VMX_BASIC_NO_HW_ERROR_CODE_CC;
+}
+
static inline bool cpu_has_virtual_nmis(void)
{
return vmcs_config.pin_based_exec_ctrl & PIN_BASED_VIRTUAL_NMIS &&
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c658f2f230df..a1aae8709939 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2614,6 +2614,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
{ VM_ENTRY_LOAD_IA32_EFER, VM_EXIT_LOAD_IA32_EFER },
{ VM_ENTRY_LOAD_BNDCFGS, VM_EXIT_CLEAR_BNDCFGS },
{ VM_ENTRY_LOAD_IA32_RTIT_CTL, VM_EXIT_CLEAR_IA32_RTIT_CTL },
+ { VM_ENTRY_LOAD_CET_STATE, VM_EXIT_LOAD_CET_STATE },
};
memset(vmcs_conf, 0, sizeof(*vmcs_conf));
@@ -4935,6 +4936,15 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); /* 22.2.1 */
+ if (kvm_cpu_cap_has(X86_FEATURE_SHSTK))
+ vmcs_writel(GUEST_SSP, 0);
+ if (kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
+ kvm_cpu_cap_has(X86_FEATURE_IBT))
+ vmcs_writel(GUEST_S_CET, 0);
+ if (kvm_cpu_cap_has(X86_FEATURE_SHSTK) &&
+ IS_ENABLED(CONFIG_X86_64))
+ vmcs_writel(GUEST_INTR_SSP_TABLE, 0);
+
kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
vpid_sync_context(vmx->vpid);
@@ -6354,6 +6364,12 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
if (vmcs_read32(VM_EXIT_MSR_STORE_COUNT) > 0)
vmx_dump_msrs("guest autostore", &vmx->msr_autostore.guest);
+ if (vmentry_ctl & VM_ENTRY_LOAD_CET_STATE) {
+ pr_err("S_CET = 0x%016lx\n", vmcs_readl(GUEST_S_CET));
+ pr_err("SSP = 0x%016lx\n", vmcs_readl(GUEST_SSP));
+ pr_err("INTR SSP TABLE = 0x%016lx\n",
+ vmcs_readl(GUEST_INTR_SSP_TABLE));
+ }
pr_err("*** Host State ***\n");
pr_err("RIP = 0x%016lx RSP = 0x%016lx\n",
vmcs_readl(HOST_RIP), vmcs_readl(HOST_RSP));
@@ -6431,6 +6447,12 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
if (secondary_exec_control & SECONDARY_EXEC_ENABLE_VPID)
pr_err("Virtual processor ID = 0x%04x\n",
vmcs_read16(VIRTUAL_PROCESSOR_ID));
+ if (vmexit_ctl & VM_EXIT_LOAD_CET_STATE) {
+ pr_err("S_CET = 0x%016lx\n", vmcs_readl(HOST_S_CET));
+ pr_err("SSP = 0x%016lx\n", vmcs_readl(HOST_SSP));
+ pr_err("INTR SSP TABLE = 0x%016lx\n",
+ vmcs_readl(HOST_INTR_SSP_TABLE));
+ }
}
/*
@@ -7964,7 +7986,6 @@ static __init void vmx_set_cpu_caps(void)
kvm_cpu_cap_set(X86_FEATURE_UMIP);
/* CPUID 0xD.1 */
- kvm_caps.supported_xss = 0;
if (!cpu_has_vmx_xsaves())
kvm_cpu_cap_clear(X86_FEATURE_XSAVES);
@@ -7976,6 +7997,12 @@ static __init void vmx_set_cpu_caps(void)
if (cpu_has_vmx_waitpkg())
kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
+
+ if (!cpu_has_load_cet_ctrl() || !enable_unrestricted_guest ||
+ !cpu_has_vmx_basic_no_hw_errcode()) {
+ kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
+ kvm_cpu_cap_clear(X86_FEATURE_IBT);
+ }
}
static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index c2130d2c8e24..fb72819fbb41 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -480,7 +480,8 @@ static inline u8 vmx_get_rvi(void)
VM_ENTRY_LOAD_IA32_EFER | \
VM_ENTRY_LOAD_BNDCFGS | \
VM_ENTRY_PT_CONCEAL_PIP | \
- VM_ENTRY_LOAD_IA32_RTIT_CTL)
+ VM_ENTRY_LOAD_IA32_RTIT_CTL | \
+ VM_ENTRY_LOAD_CET_STATE)
#define __KVM_REQUIRED_VMX_VM_EXIT_CONTROLS \
(VM_EXIT_SAVE_DEBUG_CONTROLS | \
@@ -502,7 +503,8 @@ static inline u8 vmx_get_rvi(void)
VM_EXIT_LOAD_IA32_EFER | \
VM_EXIT_CLEAR_BNDCFGS | \
VM_EXIT_PT_CONCEAL_PIP | \
- VM_EXIT_CLEAR_IA32_RTIT_CTL)
+ VM_EXIT_CLEAR_IA32_RTIT_CTL | \
+ VM_EXIT_LOAD_CET_STATE)
#define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL \
(PIN_BASED_EXT_INTR_MASK | \
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c6b57ede0f57..2bcf3c7923bf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -231,7 +231,8 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
| XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
| XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)
-#define KVM_SUPPORTED_XSS 0
+#define KVM_SUPPORTED_XSS (XFEATURE_MASK_CET_USER | \
+ XFEATURE_MASK_CET_KERNEL)
u64 __read_mostly host_efer;
EXPORT_SYMBOL_GPL(host_efer);
@@ -9854,6 +9855,15 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
kvm_caps.supported_xss = 0;
+ if ((kvm_caps.supported_xss & (XFEATURE_MASK_CET_USER |
+ XFEATURE_MASK_CET_KERNEL)) !=
+ (XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)) {
+ kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
+ kvm_cpu_cap_clear(X86_FEATURE_IBT);
+ kvm_caps.supported_xss &= ~XFEATURE_CET_USER;
+ kvm_caps.supported_xss &= ~XFEATURE_CET_KERNEL;
+ }
+
#define __kvm_cpu_cap_has(UNUSED_, f) kvm_cpu_cap_has(f)
cr4_reserved_bits = __cr4_reserved_bits(__kvm_cpu_cap_has, UNUSED_);
#undef __kvm_cpu_cap_has
@@ -12319,7 +12329,9 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
static inline bool is_xstate_reset_needed(void)
{
- return kvm_cpu_cap_has(X86_FEATURE_MPX);
+ return kvm_cpu_cap_has(X86_FEATURE_MPX) ||
+ kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
+ kvm_cpu_cap_has(X86_FEATURE_IBT);
}
void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
@@ -12396,6 +12408,16 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
XFEATURE_BNDCSR);
}
+ if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
+ fpstate_clear_xstate_component(fpstate,
+ XFEATURE_CET_USER);
+ fpstate_clear_xstate_component(fpstate,
+ XFEATURE_CET_KERNEL);
+ } else if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
+ fpstate_clear_xstate_component(fpstate,
+ XFEATURE_CET_USER);
+ }
+
if (init_event)
kvm_load_guest_fpu(vcpu);
}
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index d9cc352cf421..dc79dcd733ac 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -531,6 +531,9 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
__reserved_bits |= X86_CR4_VMXE; \
if (!__cpu_has(__c, X86_FEATURE_PCID)) \
__reserved_bits |= X86_CR4_PCIDE; \
+ if (!__cpu_has(__c, X86_FEATURE_SHSTK) && \
+ !__cpu_has(__c, X86_FEATURE_IBT)) \
+ __reserved_bits |= X86_CR4_CET; \
__reserved_bits; \
})
--
2.27.0
Update CPUID.(EAX=0DH,ECX=1).EBX to reflect current required xstate size
due to XSS MSR modification.
CPUID(EAX=0DH,ECX=1).EBX reports the required storage size of all enabled
xstate features in (XCR0 | IA32_XSS). The CPUID value can be used by guest
before allocate sufficient xsave buffer.
Note, KVM does not yet support any XSS based features, i.e. supported_xss
is guaranteed to be zero at this time.
Opportunistically modify XSS write access logic as:
If XSAVES is not enabled in the guest CPUID, forbid setting IA32_XSS msr
to anything but 0, even if the write is host initiated.
Suggested-by: Sean Christopherson <[email protected]>
Co-developed-by: Zhang Yi Z <[email protected]>
Signed-off-by: Zhang Yi Z <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++-
arch/x86/kvm/cpuid.c | 15 ++++++++++++++-
arch/x86/kvm/x86.c | 16 ++++++++++++----
3 files changed, 28 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 499bd42e3a32..f536102f1eca 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -756,7 +756,6 @@ struct kvm_vcpu_arch {
bool at_instruction_boundary;
bool tpr_access_reporting;
bool xfd_no_write_intercept;
- u64 ia32_xss;
u64 microcode_version;
u64 arch_capabilities;
u64 perf_capabilities;
@@ -812,6 +811,8 @@ struct kvm_vcpu_arch {
u64 xcr0;
u64 guest_supported_xcr0;
+ u64 guest_supported_xss;
+ u64 ia32_xss;
struct kvm_pio_request pio;
void *pio_data;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0351e311168a..1d9843b34196 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -275,7 +275,8 @@ static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_e
best = cpuid_entry2_find(entries, nent, 0xD, 1);
if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) ||
cpuid_entry_has(best, X86_FEATURE_XSAVEC)))
- best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
+ best->ebx = xstate_required_size(vcpu->arch.xcr0 |
+ vcpu->arch.ia32_xss, true);
best = __kvm_find_kvm_cpuid_features(vcpu, entries, nent);
if (kvm_hlt_in_guest(vcpu->kvm) && best &&
@@ -312,6 +313,17 @@ static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
}
+static u64 vcpu_get_supported_xss(struct kvm_vcpu *vcpu)
+{
+ struct kvm_cpuid_entry2 *best;
+
+ best = kvm_find_cpuid_entry_index(vcpu, 0xd, 1);
+ if (!best)
+ return 0;
+
+ return (best->ecx | ((u64)best->edx << 32)) & kvm_caps.supported_xss;
+}
+
static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
{
struct kvm_cpuid_entry2 *entry;
@@ -358,6 +370,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
}
vcpu->arch.guest_supported_xcr0 = vcpu_get_supported_xcr0(vcpu);
+ vcpu->arch.guest_supported_xss = vcpu_get_supported_xss(vcpu);
kvm_update_pv_runtime(vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f7d4cc61bc55..649a100ffd25 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3901,20 +3901,28 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vcpu->arch.ia32_tsc_adjust_msr += adj;
}
break;
- case MSR_IA32_XSS:
- if (!msr_info->host_initiated &&
- !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
+ case MSR_IA32_XSS: {
+ /*
+ * If KVM reported support of XSS MSR, even guest CPUID doesn't
+ * support XSAVES, still allow userspace to set default value(0)
+ * to this MSR.
+ */
+ if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES) &&
+ !(msr_info->host_initiated && data == 0))
return 1;
/*
* KVM supports exposing PT to the guest, but does not support
* IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
* XSAVES/XRSTORS to save/restore PT MSRs.
*/
- if (data & ~kvm_caps.supported_xss)
+ if (data & ~vcpu->arch.guest_supported_xss)
return 1;
+ if (vcpu->arch.ia32_xss == data)
+ break;
vcpu->arch.ia32_xss = data;
kvm_update_cpuid_runtime(vcpu);
break;
+ }
case MSR_SMI_COUNT:
if (!msr_info->host_initiated)
return 1;
--
2.27.0
Save constant values to HOST_{S_CET,SSP,INTR_SSP_TABLE} field explicitly.
Kernel IBT is supported and the setting in MSR_IA32_S_CET is static after
post-boot(The exception is BIOS call case but vCPU thread never across it)
and KVM doesn't need to refresh HOST_S_CET field before every VM-Enter/
VM-Exit sequence.
Host supervisor shadow stack is not enabled now and SSP is not accessible
to kernel mode, thus it's safe to set host IA32_INT_SSP_TAB/SSP VMCS field
to 0s. When shadow stack is enabled for CPL3, SSP is reloaded from PL3_SSP
before it exits to userspace. Check SDM Vol 2A/B Chapter 3/4 for SYSCALL/
SYSRET/SYSENTER SYSEXIT/RDSSP/CALL etc.
Prevent KVM module loading if host supervisor shadow stack SHSTK_EN is set
in MSR_IA32_S_CET as KVM cannot co-exit with it correctly.
Suggested-by: Sean Christopherson <[email protected]>
Suggested-by: Chao Gao <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/vmx/capabilities.h | 4 ++++
arch/x86/kvm/vmx/vmx.c | 15 +++++++++++++++
arch/x86/kvm/x86.c | 14 ++++++++++++++
arch/x86/kvm/x86.h | 1 +
4 files changed, 34 insertions(+)
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 41a4533f9989..ee8938818c8a 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -106,6 +106,10 @@ static inline bool cpu_has_load_perf_global_ctrl(void)
return vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
}
+static inline bool cpu_has_load_cet_ctrl(void)
+{
+ return (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_CET_STATE);
+}
static inline bool cpu_has_vmx_mpx(void)
{
return vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_BNDCFGS;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e484333eddb0..c658f2f230df 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4375,6 +4375,21 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
if (cpu_has_load_ia32_efer())
vmcs_write64(HOST_IA32_EFER, host_efer);
+
+ /*
+ * Supervisor shadow stack is not enabled on host side, i.e.,
+ * host IA32_S_CET.SHSTK_EN bit is guaranteed to 0 now, per SDM
+ * description(RDSSP instruction), SSP is not readable in CPL0,
+ * so resetting the two registers to 0s at VM-Exit does no harm
+ * to kernel execution. When execution flow exits to userspace,
+ * SSP is reloaded from IA32_PL3_SSP. Check SDM Vol.2A/B Chapter
+ * 3 and 4 for details.
+ */
+ if (cpu_has_load_cet_ctrl()) {
+ vmcs_writel(HOST_S_CET, host_s_cet);
+ vmcs_writel(HOST_SSP, 0);
+ vmcs_writel(HOST_INTR_SSP_TABLE, 0);
+ }
}
void set_cr4_guest_host_mask(struct vcpu_vmx *vmx)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5792ed16e61b..c6b57ede0f57 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -114,6 +114,8 @@ static u64 __read_mostly efer_reserved_bits = ~((u64)EFER_SCE);
#endif
static u64 __read_mostly cr4_reserved_bits = CR4_RESERVED_BITS;
+u64 __read_mostly host_s_cet;
+EXPORT_SYMBOL_GPL(host_s_cet);
#define KVM_EXIT_HYPERCALL_VALID_MASK (1 << KVM_HC_MAP_GPA_RANGE)
@@ -9773,6 +9775,18 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
return -EIO;
}
+ if (boot_cpu_has(X86_FEATURE_SHSTK)) {
+ rdmsrl(MSR_IA32_S_CET, host_s_cet);
+ /*
+ * Linux doesn't yet support supervisor shadow stacks (SSS), so
+ * KVM doesn't save/restore the associated MSRs, i.e. KVM may
+ * clobber the host values. Yell and refuse to load if SSS is
+ * unexpectedly enabled, e.g. to avoid crashing the host.
+ */
+ if (WARN_ON_ONCE(host_s_cet & CET_SHSTK_EN))
+ return -EIO;
+ }
+
x86_emulator_cache = kvm_alloc_emulator_cache();
if (!x86_emulator_cache) {
pr_err("failed to allocate cache for x86 emulator\n");
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 6e42ede335f5..d9cc352cf421 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -325,6 +325,7 @@ fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu);
extern u64 host_xcr0;
extern u64 host_xss;
extern u64 host_arch_capabilities;
+extern u64 host_s_cet;
extern struct kvm_caps kvm_caps;
--
2.27.0
Save CET SSP to SMRAM on SMI and reload it on RSM. KVM emulates HW arch
behavior when guest enters/leaves SMM mode,i.e., save registers to SMRAM
at the entry of SMM and reload them at the exit to SMM. Per SDM, SSP is
one of such registers on 64bit Arch, so add the support for SSP.
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/smm.c | 8 ++++++++
arch/x86/kvm/smm.h | 2 +-
2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
index 45c855389ea7..7aac9c54c353 100644
--- a/arch/x86/kvm/smm.c
+++ b/arch/x86/kvm/smm.c
@@ -275,6 +275,10 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu,
enter_smm_save_seg_64(vcpu, &smram->gs, VCPU_SREG_GS);
smram->int_shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
+
+ if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
+ KVM_BUG_ON(kvm_msr_read(vcpu, MSR_KVM_SSP, &smram->ssp),
+ vcpu->kvm);
}
#endif
@@ -564,6 +568,10 @@ static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,
static_call(kvm_x86_set_interrupt_shadow)(vcpu, 0);
ctxt->interruptibility = (u8)smstate->int_shadow;
+ if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
+ KVM_BUG_ON(kvm_msr_write(vcpu, MSR_KVM_SSP, smstate->ssp),
+ vcpu->kvm);
+
return X86EMUL_CONTINUE;
}
#endif
diff --git a/arch/x86/kvm/smm.h b/arch/x86/kvm/smm.h
index a1cf2ac5bd78..1e2a3e18207f 100644
--- a/arch/x86/kvm/smm.h
+++ b/arch/x86/kvm/smm.h
@@ -116,8 +116,8 @@ struct kvm_smram_state_64 {
u32 smbase;
u32 reserved4[5];
- /* ssp and svm_* fields below are not implemented by KVM */
u64 ssp;
+ /* svm_* fields below are not implemented by KVM */
u64 svm_guest_pat;
u64 svm_host_efer;
u64 svm_host_cr4;
--
2.27.0
Set original kvm_caps.supported_xss to (host_xss & KVM_SUPPORTED_XSS) if
XSAVES is supported. host_xss contains the host supported xstate feature
bits for thread FPU context switch, KVM_SUPPORTED_XSS includes all KVM
enabled XSS feature bits, the resulting value represents the supervisor
xstates that are available to guest and are backed by host FPU framework
for swapping {guest,host} XSAVE-managed registers/MSRs.
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/x86.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 649a100ffd25..607082aca80d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -226,6 +226,8 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
| XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
| XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)
+#define KVM_SUPPORTED_XSS 0
+
u64 __read_mostly host_efer;
EXPORT_SYMBOL_GPL(host_efer);
@@ -9648,12 +9650,13 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
kvm_caps.supported_xcr0 = host_xcr0 & KVM_SUPPORTED_XCR0;
}
+ if (boot_cpu_has(X86_FEATURE_XSAVES)) {
+ rdmsrl(MSR_IA32_XSS, host_xss);
+ kvm_caps.supported_xss = host_xss & KVM_SUPPORTED_XSS;
+ }
rdmsrl_safe(MSR_EFER, &host_efer);
- if (boot_cpu_has(X86_FEATURE_XSAVES))
- rdmsrl(MSR_IA32_XSS, host_xss);
-
kvm_init_pmu_capability(ops->pmu_ops);
if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
--
2.27.0
Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
to enable CET for nested VM.
Note, generally L1 VMM only touches CET VMCS fields when live migration or
vmcs_{read,write}() to the fields happens, so the fields only need to be
synced in these "rare" cases. And here only considers the case that L1 VMM
has set VM_ENTRY_LOAD_CET_STATE in its VMCS vm_entry_controls as it's the
common usage.
Suggested-by: Chao Gao <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 48 +++++++++++++++++++++++++++++++++++++--
arch/x86/kvm/vmx/vmcs12.c | 6 +++++
arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++-
arch/x86/kvm/vmx/vmx.c | 2 ++
4 files changed, 67 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index d8c32682ca76..965173650542 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
+ /* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_U_CET, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_S_CET, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PL0_SSP, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PL1_SSP, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PL2_SSP, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PL3_SSP, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
+
kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
vmx->nested.force_msr_bitmap_recalc = false;
@@ -2469,6 +2491,18 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
+
+ if (vmx->nested.nested_run_pending &&
+ (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) {
+ if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
+ vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
+ vmcs_writel(GUEST_INTR_SSP_TABLE,
+ vmcs12->guest_ssp_tbl);
+ }
+ if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
+ guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
+ vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
+ }
}
if (nested_cpu_has_xsaves(vmcs12))
@@ -4300,6 +4334,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
vmcs12->guest_pending_dbg_exceptions =
vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
+ if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
+ vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
+ vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
+ }
+ if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
+ guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
+ vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
+ }
+
vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
}
@@ -6798,7 +6841,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
VM_EXIT_HOST_ADDR_SPACE_SIZE |
#endif
VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
- VM_EXIT_CLEAR_BNDCFGS;
+ VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
msrs->exit_ctls_high |=
VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
@@ -6820,7 +6863,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
#ifdef CONFIG_X86_64
VM_ENTRY_IA32E_MODE |
#endif
- VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
+ VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
+ VM_ENTRY_LOAD_CET_STATE;
msrs->entry_ctls_high |=
(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index 106a72c923ca..4233b5ca9461 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
+ FIELD(GUEST_S_CET, guest_s_cet),
+ FIELD(GUEST_SSP, guest_ssp),
+ FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
FIELD(HOST_CR0, host_cr0),
FIELD(HOST_CR3, host_cr3),
FIELD(HOST_CR4, host_cr4),
@@ -151,5 +154,8 @@ const unsigned short vmcs12_field_offsets[] = {
FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
FIELD(HOST_RSP, host_rsp),
FIELD(HOST_RIP, host_rip),
+ FIELD(HOST_S_CET, host_s_cet),
+ FIELD(HOST_SSP, host_ssp),
+ FIELD(HOST_INTR_SSP_TABLE, host_ssp_tbl),
};
const unsigned int nr_vmcs12_fields = ARRAY_SIZE(vmcs12_field_offsets);
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index 01936013428b..3884489e7f7e 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -117,7 +117,13 @@ struct __packed vmcs12 {
natural_width host_ia32_sysenter_eip;
natural_width host_rsp;
natural_width host_rip;
- natural_width paddingl[8]; /* room for future expansion */
+ natural_width host_s_cet;
+ natural_width host_ssp;
+ natural_width host_ssp_tbl;
+ natural_width guest_s_cet;
+ natural_width guest_ssp;
+ natural_width guest_ssp_tbl;
+ natural_width paddingl[2]; /* room for future expansion */
u32 pin_based_vm_exec_control;
u32 cpu_based_vm_exec_control;
u32 exception_bitmap;
@@ -292,6 +298,12 @@ static inline void vmx_check_vmcs12_offsets(void)
CHECK_OFFSET(host_ia32_sysenter_eip, 656);
CHECK_OFFSET(host_rsp, 664);
CHECK_OFFSET(host_rip, 672);
+ CHECK_OFFSET(host_s_cet, 680);
+ CHECK_OFFSET(host_ssp, 688);
+ CHECK_OFFSET(host_ssp_tbl, 696);
+ CHECK_OFFSET(guest_s_cet, 704);
+ CHECK_OFFSET(guest_ssp, 712);
+ CHECK_OFFSET(guest_ssp_tbl, 720);
CHECK_OFFSET(pin_based_vm_exec_control, 744);
CHECK_OFFSET(cpu_based_vm_exec_control, 748);
CHECK_OFFSET(exception_bitmap, 752);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a1aae8709939..947028ff2e25 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7734,6 +7734,8 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
cr4_fixed1_update(X86_CR4_PKE, ecx, feature_bit(PKU));
cr4_fixed1_update(X86_CR4_UMIP, ecx, feature_bit(UMIP));
cr4_fixed1_update(X86_CR4_LA57, ecx, feature_bit(LA57));
+ cr4_fixed1_update(X86_CR4_CET, ecx, feature_bit(SHSTK));
+ cr4_fixed1_update(X86_CR4_CET, edx, feature_bit(IBT));
#undef cr4_fixed1_update
}
--
2.27.0
Enable/disable CET MSRs interception per associated feature configuration.
Shadow Stack feature requires all CET MSRs passed through to guest to make
it supported in user and supervisor mode while IBT feature only depends on
MSR_IA32_{U,S}_CETS_CET to enable user and supervisor IBT.
Note, this MSR design introduced an architectural limitation of SHSTK and
IBT control for guest, i.e., when SHSTK is exposed, IBT is also available
to guest from architectual perspective since IBT relies on subset of SHSTK
relevant MSRs.
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 42 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 42 insertions(+)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 554f665e59c3..e484333eddb0 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -699,6 +699,10 @@ static bool is_valid_passthrough_msr(u32 msr)
case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
/* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
return true;
+ case MSR_IA32_U_CET:
+ case MSR_IA32_S_CET:
+ case MSR_IA32_PL0_SSP ... MSR_IA32_INT_SSP_TAB:
+ return true;
}
r = possible_passthrough_msr_slot(msr) != -ENOENT;
@@ -7766,6 +7770,42 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
}
+static void vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu)
+{
+ bool incpt;
+
+ if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
+ incpt = !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK);
+
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
+ MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
+ MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
+ MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
+ MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
+ MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP,
+ MSR_TYPE_RW, incpt);
+ if (guest_cpuid_has(vcpu, X86_FEATURE_LM))
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
+ MSR_TYPE_RW, incpt);
+ if (!incpt)
+ return;
+ }
+
+ if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
+ incpt = !guest_cpuid_has(vcpu, X86_FEATURE_IBT);
+
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
+ MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
+ MSR_TYPE_RW, incpt);
+ }
+}
+
static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7843,6 +7883,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
/* Refresh #PF interception to account for MAXPHYADDR changes. */
vmx_update_exception_bitmap(vcpu);
+
+ vmx_update_intercept_for_cet_msr(vcpu);
}
static u64 vmx_get_perf_capabilities(void)
--
2.27.0
Per SDM description(Vol.3D, Appendix A.1):
"If bit 56 is read as 1, software can use VM entry to deliver a hardware
exception with or without an error code, regardless of vector"
Modify has_error_code check before inject events to nested guest. Only
enforce the check when guest is in real mode, the exception is not hard
exception and the platform doesn't enumerate bit56 in VMX_BASIC, in all
other case ignore the check to make the logic consistent with SDM.
Suggested-by: Chao Gao <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 27 ++++++++++++++++++---------
arch/x86/kvm/vmx/nested.h | 5 +++++
2 files changed, 23 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 2034337681f9..d8c32682ca76 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1205,9 +1205,9 @@ static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx, u64 data)
{
const u64 feature_and_reserved =
/* feature (except bit 48; see below) */
- BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) |
+ BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) | BIT_ULL(56) |
/* reserved */
- BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 56);
+ BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 57);
u64 vmx_basic = vmcs_config.nested.basic;
if (!is_bitwise_subset(vmx_basic, data, feature_and_reserved))
@@ -2829,7 +2829,6 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
u8 vector = intr_info & INTR_INFO_VECTOR_MASK;
u32 intr_type = intr_info & INTR_INFO_INTR_TYPE_MASK;
bool has_error_code = intr_info & INTR_INFO_DELIVER_CODE_MASK;
- bool should_have_error_code;
bool urg = nested_cpu_has2(vmcs12,
SECONDARY_EXEC_UNRESTRICTED_GUEST);
bool prot_mode = !urg || vmcs12->guest_cr0 & X86_CR0_PE;
@@ -2846,12 +2845,20 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
CC(intr_type == INTR_TYPE_OTHER_EVENT && vector != 0))
return -EINVAL;
- /* VM-entry interruption-info field: deliver error code */
- should_have_error_code =
- intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
- x86_exception_has_error_code(vector);
- if (CC(has_error_code != should_have_error_code))
- return -EINVAL;
+ /*
+ * Cannot deliver error code in real mode or if the interrupt
+ * type is not hardware exception. For other cases, do the
+ * consistency check only if the vCPU doesn't enumerate
+ * VMX_BASIC_NO_HW_ERROR_CODE_CC.
+ */
+ if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION) {
+ if (CC(has_error_code))
+ return -EINVAL;
+ } else if (!nested_cpu_has_no_hw_errcode_cc(vcpu)) {
+ if (CC(has_error_code !=
+ x86_exception_has_error_code(vector)))
+ return -EINVAL;
+ }
/* VM-entry exception error code */
if (CC(has_error_code &&
@@ -6969,6 +6976,8 @@ static void nested_vmx_setup_basic(struct nested_vmx_msrs *msrs)
if (cpu_has_vmx_basic_inout())
msrs->basic |= VMX_BASIC_INOUT;
+ if (cpu_has_vmx_basic_no_hw_errcode())
+ msrs->basic |= VMX_BASIC_NO_HW_ERROR_CODE_CC;
}
static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs)
diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
index b4b9d51438c6..26842da6857d 100644
--- a/arch/x86/kvm/vmx/nested.h
+++ b/arch/x86/kvm/vmx/nested.h
@@ -284,6 +284,11 @@ static inline bool nested_cr4_valid(struct kvm_vcpu *vcpu, unsigned long val)
__kvm_is_valid_cr4(vcpu, val);
}
+static inline bool nested_cpu_has_no_hw_errcode_cc(struct kvm_vcpu *vcpu)
+{
+ return to_vmx(vcpu)->nested.msrs.basic & VMX_BASIC_NO_HW_ERROR_CODE_CC;
+}
+
/* No difference in the restrictions on guest and host CR4 in VMX operation. */
#define nested_guest_cr4_valid nested_cr4_valid
#define nested_host_cr4_valid nested_cr4_valid
--
2.27.0
On Fri, Nov 24, 2023 at 12:53:07AM -0500, Yang Weijiang wrote:
> Note, in KVM case, guest CET supervisor state i.e., IA32_PL{0,1,2}_MSRs,
> are preserved after VM-Exit until host/guest fpstates are swapped, but
> since host supervisor shadow stack is disabled, the preserved MSRs won't
> hurt host.
Just to be clear, with FRED all this changes, right? Then we get more
VMCS fields for SSS state.
On 11/24/2023 5:45 PM, Peter Zijlstra wrote:
> On Fri, Nov 24, 2023 at 12:53:07AM -0500, Yang Weijiang wrote:
>
>> Note, in KVM case, guest CET supervisor state i.e., IA32_PL{0,1,2}_MSRs,
>> are preserved after VM-Exit until host/guest fpstates are swapped, but
>> since host supervisor shadow stack is disabled, the preserved MSRs won't
>> hurt host.
> Just to be clear, with FRED all this changes, right? Then we get more
> VMCS fields for SSS state.
Yes, I think so, KVM needs to properly handle guest SSS state and host FRED states.
Thanks!
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> Add supervisor mode state support within FPU xstate management
> framework.
> Although supervisor shadow stack is not enabled/used today in
> kernel,KVM
> requires the support because when KVM advertises shadow stack feature
> to
> guest, architecturally it claims the support for both user and
> supervisor
> modes for guest OSes(Linux or non-Linux).
>
> CET supervisor states not only includes PL{0,1,2}_SSP but also
> IA32_S_CET
> MSR, but the latter is not xsave-managed. In virtualization world,
> guest
> IA32_S_CET is saved/stored into/from VM control structure. With
> supervisor
> xstate support, guest supervisor mode shadow stack state can be
> properly
> saved/restored when 1) guest/host FPU context is swapped 2) vCPU
> thread is sched out/in.
>
> The alternative is to enable it in KVM domain, but KVM maintainers
> NAKed
> the solution. The external discussion can be found at [*], it ended
> up
> with adding the support in kernel instead of KVM domain.
>
> Note, in KVM case, guest CET supervisor state i.e.,
> IA32_PL{0,1,2}_MSRs,
> are preserved after VM-Exit until host/guest fpstates are swapped,
> but
> since host supervisor shadow stack is disabled, the preserved MSRs
> won't
> hurt host.
>
> [*]:
> https://lore.kernel.org/all/[email protected]/
>
> Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> Define new XFEATURE_MASK_KERNEL_DYNAMIC set including the features
> can be
> optionally enabled by kernel components, i.e., the features are
> required by
> specific kernel components.
The above is a bit tough to parse. Does any of this seem clearer?
Define a new XFEATURE_MASK_KERNEL_DYNAMIC mask to specify the features
that can be optionally enabled by kernel components. This is similar to
XFEATURE_MASK_KERNEL_DYNAMIC in that it contains optional xfeatures
that can allows the FPU buffer to be dynamically sized. The difference
is that the KERNEL variant contains supervisor features and will be
enabled by kernel components that need them, and not directly by the
user.
> Currently it's used by KVM to configure guest
> dedicated fpstate for calculating the xfeature and fpstate storage
> size etc.
>
> The kernel dynamic xfeatures now only contain XFEATURE_CET_KERNEL,
> which is
> supported by host as they're enabled in xsaves/xrstors operating
> xfeature set
> (XCR0 | XSS), but the relevant CPU feature, i.e., supervisor shadow
> stack, is
> not enabled in host kernel so it can be omitted for normal fpstate by
> default.
>
> Remove the kernel dynamic feature from
> fpu_kernel_cfg.default_features so that
> the bits in xstate_bv and xcomp_bv are cleared and xsaves/xrstors can
> be
> optimized by HW for normal fpstate.
Thanks for breaking these into small patches.
On 11/28/2023 9:46 AM, Edgecombe, Rick P wrote:
> On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
>> Define new XFEATURE_MASK_KERNEL_DYNAMIC set including the features
>> can be
>> optionally enabled by kernel components, i.e., the features are
>> required by
>> specific kernel components.
> The above is a bit tough to parse. Does any of this seem clearer?
>
> Define a new XFEATURE_MASK_KERNEL_DYNAMIC mask to specify the features
> that can be optionally enabled by kernel components. This is similar to
> XFEATURE_MASK_KERNEL_DYNAMIC in that it contains optional xfeatures
> that can allows the FPU buffer to be dynamically sized. The difference
> is that the KERNEL variant contains supervisor features and will be
> enabled by kernel components that need them, and not directly by the
> user.
Definitely the wording is much better, I'll apply it to next version, thanks a lot!
>> Currently it's used by KVM to configure guest
>> dedicated fpstate for calculating the xfeature and fpstate storage
>> size etc.
>>
>> The kernel dynamic xfeatures now only contain XFEATURE_CET_KERNEL,
>> which is
>> supported by host as they're enabled in xsaves/xrstors operating
>> xfeature set
>> (XCR0 | XSS), but the relevant CPU feature, i.e., supervisor shadow
>> stack, is
>> not enabled in host kernel so it can be omitted for normal fpstate by
>> default.
>>
>> Remove the kernel dynamic feature from
>> fpu_kernel_cfg.default_features so that
>> the bits in xstate_bv and xcomp_bv are cleared and xsaves/xrstors can
>> be
>> optimized by HW for normal fpstate.
> Thanks for breaking these into small patches.
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> + /*
> + * Set guest's __user_state_size to fpu_user_cfg.default_size
> so that
> + * existing uAPIs can still work.
> + */
> + fpu->guest_perm.__user_state_size =
> fpu_user_cfg.default_size;
It seems like an appropriate value, but where does this come into play
exactly for guest FPUs?
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> +static struct fpstate *__fpu_alloc_init_guest_fpstate(struct
> fpu_guest *gfpu)
> {
> + bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
> + unsigned int gfpstate_size, size;
> struct fpstate *fpstate;
> - unsigned int size;
>
> - size = fpu_user_cfg.default_size + ALIGN(offsetof(struct
> fpstate, regs), 64);
> + /*
> + * fpu_guest_cfg.default_features includes all enabled
> xfeatures
> + * except the user dynamic xfeatures. If the user dynamic
> xfeatures
> + * are enabled, the guest fpstate will be re-allocated to
> hold all
> + * guest enabled xfeatures, so omit user dynamic xfeatures
> here.
> + */
> + gfpstate_size =
> xstate_calculate_size(fpu_guest_cfg.default_features,
> + compacted);
Why not fpu_guest_cfg.default_size here?
> +
> + size = gfpstate_size + ALIGN(offsetof(struct fpstate, regs),
> 64);
On 11/28/2023 10:58 PM, Edgecombe, Rick P wrote:
> On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
>> + /*
>> + * Set guest's __user_state_size to fpu_user_cfg.default_size
>> so that
>> + * existing uAPIs can still work.
>> + */
>> + fpu->guest_perm.__user_state_size =
>> fpu_user_cfg.default_size;
> It seems like an appropriate value, but where does this come into play
> exactly for guest FPUs?
I don't see there's special usage of this field for vCPU in VMM userspace(QEMU).
Maybe it's mainly for AMX resulted usespace fault handling? For vCPU thread,
it's only referenced when AMX is enabled via __xfd_enable_feature() .
On 11/28/2023 11:19 PM, Edgecombe, Rick P wrote:
> On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
>> +static struct fpstate *__fpu_alloc_init_guest_fpstate(struct
>> fpu_guest *gfpu)
>> {
>> + bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
>> + unsigned int gfpstate_size, size;
>> struct fpstate *fpstate;
>> - unsigned int size;
>>
>> - size = fpu_user_cfg.default_size + ALIGN(offsetof(struct
>> fpstate, regs), 64);
>> + /*
>> + * fpu_guest_cfg.default_features includes all enabled
>> xfeatures
>> + * except the user dynamic xfeatures. If the user dynamic
>> xfeatures
>> + * are enabled, the guest fpstate will be re-allocated to
>> hold all
>> + * guest enabled xfeatures, so omit user dynamic xfeatures
>> here.
>> + */
>> + gfpstate_size =
>> xstate_calculate_size(fpu_guest_cfg.default_features,
>> + compacted);
> Why not fpu_guest_cfg.default_size here?
Nice catch!
I should use fpu_guest_cfg.default_size directly instead of re-calculating it with the same manner. Thanks!
>
>> +
>> + size = gfpstate_size + ALIGN(offsetof(struct fpstate, regs),
>> 64);
On 11/28/2023 11:25 PM, Edgecombe, Rick P wrote:
> On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
>> Kernel dynamic xfeatures now are __ONLY__ enabled for guest fpstate,
>> i.e.,
>> none for normal kernel fpstate. The bits are added when guest FPU
> ^never?
Sure, will change it, thank you!
>> config
>> is initialized. Guest fpstate is allocated with fpstate->is_guest set
>> to
>> %true.
>>
>> For normal fpstate, the bits should have been removed when
>> initializes
>> kernel FPU config settings, WARN_ONCE() if kernel detects normal
>> fpstate
>> xfeatures contains kernel dynamic xfeatures before executes xsaves.
>>
>> Signed-off-by: Yang Weijiang <[email protected]>
> Otherwise...
>
> Reviewed-by: Rick Edgecombe <[email protected]>
On Wed, 2023-11-29 at 22:12 +0800, Yang, Weijiang wrote:
> On 11/28/2023 10:58 PM, Edgecombe, Rick P wrote:
> > On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> > > + /*
> > > + * Set guest's __user_state_size to
> > > fpu_user_cfg.default_size
> > > so that
> > > + * existing uAPIs can still work.
> > > + */
> > > + fpu->guest_perm.__user_state_size =
> > > fpu_user_cfg.default_size;
> > It seems like an appropriate value, but where does this come into
> > play
> > exactly for guest FPUs?
>
> I don't see there's special usage of this field for vCPU in VMM
> userspace(QEMU).
> Maybe it's mainly for AMX resulted usespace fault handling? For vCPU
> thread,
> it's only referenced when AMX is enabled via __xfd_enable_feature()
> .
>
In that case the "so that existing uAPIs can still work" comment seems
misleading. Maybe "this doesn't come into play for guest FPUs, but set
it to a reasonable value"?
On 11/30/2023 1:08 AM, Edgecombe, Rick P wrote:
> On Wed, 2023-11-29 at 22:12 +0800, Yang, Weijiang wrote:
>> On 11/28/2023 10:58 PM, Edgecombe, Rick P wrote:
>>> On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
>>>> + /*
>>>> + * Set guest's __user_state_size to
>>>> fpu_user_cfg.default_size
>>>> so that
>>>> + * existing uAPIs can still work.
>>>> + */
>>>> + fpu->guest_perm.__user_state_size =
>>>> fpu_user_cfg.default_size;
>>> It seems like an appropriate value, but where does this come into
>>> play
>>> exactly for guest FPUs?
>> I don't see there's special usage of this field for vCPU in VMM
>> userspace(QEMU).
>> Maybe it's mainly for AMX resulted usespace fault handling? For vCPU
>> thread,
>> it's only referenced when AMX is enabled via __xfd_enable_feature()
>> .
>>
> In that case the "so that existing uAPIs can still work" comment seems
> misleading. Maybe "this doesn't come into play for guest FPUs, but set
> it to a reasonable value"?
Ah, I mistook it for uabi_size and added comments. Will reword it properly.
Thank you for bringing it up!
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> From: Sean Christopherson <[email protected]>
>
> When granting userspace or a KVM guest access to an xfeature, preserve the
> entity's existing supervisor and software-defined permissions as tracked
> by __state_perm, i.e. use __state_perm to track *all* permissions even
> though all supported supervisor xfeatures are granted to all FPUs and
> FPU_GUEST_PERM_LOCKED disallows changing permissions.
>
> Effectively clobbering supervisor permissions results in inconsistent
> behavior, as xstate_get_group_perm() will report supervisor features for
> process that do NOT request access to dynamic user xfeatures, whereas any
> and all supervisor features will be absent from the set of permissions for
> any process that is granted access to one or more dynamic xfeatures (which
> right now means AMX).
>
> The inconsistency isn't problematic because fpu_xstate_prctl() already
> strips out everything except user xfeatures:
>
> case ARCH_GET_XCOMP_PERM:
> /*
> * Lockless snapshot as it can also change right after the
> * dropping the lock.
> */
> permitted = xstate_get_host_group_perm();
> permitted &= XFEATURE_MASK_USER_SUPPORTED;
> return put_user(permitted, uptr);
>
> case ARCH_GET_XCOMP_GUEST_PERM:
> permitted = xstate_get_guest_group_perm();
> permitted &= XFEATURE_MASK_USER_SUPPORTED;
> return put_user(permitted, uptr);
>
> and similarly KVM doesn't apply the __state_perm to supervisor states
> (kvm_get_filtered_xcr0() incorporates xstate_get_guest_group_perm()):
>
> case 0xd: {
> u64 permitted_xcr0 = kvm_get_filtered_xcr0();
> u64 permitted_xss = kvm_caps.supported_xss;
>
> But if KVM in particular were to ever change, dropping supervisor
> permissions would result in subtle bugs in KVM's reporting of supported
> CPUID settings. And the above behavior also means that having supervisor
> xfeatures in __state_perm is correctly handled by all users.
>
> Dropping supervisor permissions also creates another landmine for KVM. If
> more dynamic user xfeatures are ever added, requesting access to multiple
> xfeatures in separate ARCH_REQ_XCOMP_GUEST_PERM calls will result in the
> second invocation of __xstate_request_perm() computing the wrong ksize, as
> as the mask passed to xstate_calculate_size() would not contain *any*
> supervisor features.
>
> Commit 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE
> permissions") fudged around the size issue for userspace FPUs, but for
> reasons unknown skipped guest FPUs. Lack of a fix for KVM "works" only
> because KVM doesn't yet support virtualizing features that have supervisor
> xfeatures, i.e. as of today, KVM guest FPUs will never need the relevant
> xfeatures.
>
> Simply extending the hack-a-fix for guests would temporarily solve the
> ksize issue, but wouldn't address the inconsistency issue and would leave
> another lurking pitfall for KVM. KVM support for virtualizing CET will
> likely add CET_KERNEL as a guest-only xfeature, i.e. CET_KERNEL will not
> be set in xfeatures_mask_supervisor() and would again be dropped when
> granting access to dynamic xfeatures.
>
> Note, the existing clobbering behavior is rather subtle. The @permitted
> parameter to __xstate_request_perm() comes from:
>
> permitted = xstate_get_group_perm(guest);
>
> which is either fpu->guest_perm.__state_perm or fpu->perm.__state_perm,
> where __state_perm is initialized to:
>
> fpu->perm.__state_perm = fpu_kernel_cfg.default_features;
>
> and copied to the guest side of things:
>
> /* Same defaults for guests */
> fpu->guest_perm = fpu->perm;
>
> fpu_kernel_cfg.default_features contains everything except the dynamic
> xfeatures, i.e. everything except XFEATURE_MASK_XTILE_DATA:
>
> fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
> fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
>
> When __xstate_request_perm() restricts the local "mask" variable to
> compute the user state size:
>
> mask &= XFEATURE_MASK_USER_SUPPORTED;
> usize = xstate_calculate_size(mask, false);
>
> it subtly overwrites the target __state_perm with "mask" containing only
> user xfeatures:
>
> perm = guest ? &fpu->guest_perm : &fpu->perm;
> /* Pairs with the READ_ONCE() in xstate_get_group_perm() */
> WRITE_ONCE(perm->__state_perm, mask);
>
> Cc: Maxim Levitsky <[email protected]>
> Cc: Weijiang Yang <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Chao Gao <[email protected]>
> Cc: Rick Edgecombe <[email protected]>
> Cc: John Allen <[email protected]>
> Cc: [email protected]
> Link: https://lore.kernel.org/all/[email protected]
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kernel/fpu/xstate.c | 18 +++++++++++-------
> 1 file changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index ef6906107c54..73f6bc00d178 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -1601,16 +1601,20 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
> if ((permitted & requested) == requested)
> return 0;
>
> - /* Calculate the resulting kernel state size */
> + /*
> + * Calculate the resulting kernel state size. Note, @permitted also
> + * contains supervisor xfeatures even though supervisor are always
> + * permitted for kernel and guest FPUs, and never permitted for user
> + * FPUs.
> + */
> mask = permitted | requested;
> - /* Take supervisor states into account on the host */
> - if (!guest)
> - mask |= xfeatures_mask_supervisor();
> ksize = xstate_calculate_size(mask, compacted);
>
> - /* Calculate the resulting user state size */
> - mask &= XFEATURE_MASK_USER_SUPPORTED;
> - usize = xstate_calculate_size(mask, false);
> + /*
> + * Calculate the resulting user state size. Take care not to clobber
> + * the supervisor xfeatures in the new mask!
> + */
> + usize = xstate_calculate_size(mask & XFEATURE_MASK_USER_SUPPORTED, false);
>
> if (!guest) {
> ret = validate_sigaltstack(usize);
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> Use fpu_guest_cfg to calculate guest fpstate settings, open code for
> __fpstate_reset() to avoid using kernel FPU config.
>
> Below configuration steps are currently enforced to get guest fpstate:
> 1) Kernel sets up guest FPU settings in fpu__init_system_xstate().
> 2) User space sets vCPU thread group xstate permits via arch_prctl().
> 3) User space creates guest fpstate via __fpu_alloc_init_guest_fpstate()
> for vcpu thread.
> 4) User space enables guest dynamic xfeatures and re-allocate guest
> fpstate.
>
> By adding kernel dynamic xfeatures in above #1 and #2, guest xstate area
> size is expanded to hold (fpu_kernel_cfg.default_features | kernel dynamic
> xfeatures | user dynamic xfeatures), then host xsaves/xrstors can operate
> for all guest xfeatures.
>
> The user_* fields remain unchanged for compatibility with KVM uAPIs.
>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kernel/fpu/core.c | 48 ++++++++++++++++++++++++++++--------
> arch/x86/kernel/fpu/xstate.c | 2 +-
> arch/x86/kernel/fpu/xstate.h | 1 +
> 3 files changed, 40 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index 516af626bf6a..985eaf8b55e0 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -194,8 +194,6 @@ void fpu_reset_from_exception_fixup(void)
> }
>
> #if IS_ENABLED(CONFIG_KVM)
> -static void __fpstate_reset(struct fpstate *fpstate, u64 xfd);
> -
> static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
> {
> struct fpu_state_perm *fpuperm;
> @@ -216,25 +214,55 @@ static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
> gfpu->perm = perm & ~FPU_GUEST_PERM_LOCKED;
> }
>
> -bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
> +static struct fpstate *__fpu_alloc_init_guest_fpstate(struct fpu_guest *gfpu)
> {
> + bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
> + unsigned int gfpstate_size, size;
> struct fpstate *fpstate;
> - unsigned int size;
>
> - size = fpu_user_cfg.default_size + ALIGN(offsetof(struct fpstate, regs), 64);
> + /*
> + * fpu_guest_cfg.default_features includes all enabled xfeatures
> + * except the user dynamic xfeatures. If the user dynamic xfeatures
> + * are enabled, the guest fpstate will be re-allocated to hold all
> + * guest enabled xfeatures, so omit user dynamic xfeatures here.
> + */
This is a very good comment to have, although I don't think there is any way
to ensure that the whole thing is not utterly confusing.....
> + gfpstate_size = xstate_calculate_size(fpu_guest_cfg.default_features,
> + compacted);
> +
> + size = gfpstate_size + ALIGN(offsetof(struct fpstate, regs), 64);
> +
> fpstate = vzalloc(size);
> if (!fpstate)
> - return false;
> + return NULL;
> + /*
> + * Initialize sizes and feature masks, use fpu_user_cfg.*
> + * for user_* settings for compatibility of exiting uAPIs.
> + */
> + fpstate->size = gfpstate_size;
> + fpstate->xfeatures = fpu_guest_cfg.default_features;
> + fpstate->user_size = fpu_user_cfg.default_size;
> + fpstate->user_xfeatures = fpu_user_cfg.default_features;
The whole thing makes my head spin like the good old CD/DVD writers used to ....
So just to summarize this is what we have:
KERNEL FPU CONFIG
/*
all known and CPU supported user and supervisor features except
- "dynamic" kernel features" (CET_S)
- "independent" kernel features (XFEATURE_LBR)
*/
fpu_kernel_cfg.max_features;
/*
all known and CPU supported user and supervisor features except
- "dynamic" kernel features" (CET_S)
- "independent" kernel features (arch LBRs)
- "dynamic" userspace features (AMX state)
*/
fpu_kernel_cfg.default_features;
// size of compacted buffer with 'fpu_kernel_cfg.max_features'
fpu_kernel_cfg.max_size;
// size of compacted buffer with 'fpu_kernel_cfg.default_features'
fpu_kernel_cfg.default_size;
USER FPU CONFIG
/*
all known and CPU supported user features
*/
fpu_user_cfg.max_features;
/*
all known and CPU supported user features except
- "dynamic" userspace features (AMX state)
*/
fpu_user_cfg.default_features;
// size of non compacted buffer with 'fpu_user_cfg.max_features'
fpu_user_cfg.max_size;
// size of non compacted buffer with 'fpu_user_cfg.default_features'
fpu_user_cfg.default_size;
GUEST FPU CONFIG
/*
all known and CPU supported user and supervisor features except
- "independent" kernel features (XFEATURE_LBR)
*/
fpu_guest_cfg.max_features;
/*
all known and CPU supported user and supervisor features except
- "independent" kernel features (arch LBRs)
- "dynamic" userspace features (AMX state)
*/
fpu_guest_cfg.default_features;
// size of compacted buffer with 'fpu_guest_cfg.max_features'
fpu_guest_cfg.max_size;
// size of compacted buffer with 'fpu_guest_cfg.default_features'
fpu_guest_cfg.default_size;
---
So in essence, guest FPU config is guest kernel fpu config and that is why
'fpu_user_cfg.default_size' had to be used above.
How about that we have fpu_guest_kernel_config and fpu_guest_user_config instead
to make the whole horrible thing maybe even more complicated but at least a bit more orthogonal?
Best regards,
Maxim Levitsky
> + fpstate->xfd = 0;
>
> - /* Leave xfd to 0 (the reset value defined by spec) */
> - __fpstate_reset(fpstate, 0);
> fpstate_init_user(fpstate);
> fpstate->is_valloc = true;
> fpstate->is_guest = true;
>
> gfpu->fpstate = fpstate;
> - gfpu->xfeatures = fpu_user_cfg.default_features;
> - gfpu->perm = fpu_user_cfg.default_features;
> + gfpu->xfeatures = fpu_guest_cfg.default_features;
> + gfpu->perm = fpu_guest_cfg.default_features;
> +
> + return fpstate;
> +}
> +
> +bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
> +{
> + struct fpstate *fpstate;
> +
> + fpstate = __fpu_alloc_init_guest_fpstate(gfpu);
> +
> + if (!fpstate)
> + return false;
>
> /*
> * KVM sets the FP+SSE bits in the XSAVE header when copying FPU state
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index aa8f8595cd41..253944cb2298 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -559,7 +559,7 @@ static bool __init check_xstate_against_struct(int nr)
> return true;
> }
>
> -static unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
> +unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
> {
> unsigned int topmost = fls64(xfeatures) - 1;
> unsigned int offset = xstate_offsets[topmost];
> diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
> index 3518fb26d06b..c032acb56306 100644
> --- a/arch/x86/kernel/fpu/xstate.h
> +++ b/arch/x86/kernel/fpu/xstate.h
> @@ -55,6 +55,7 @@ extern void fpu__init_cpu_xstate(void);
> extern void fpu__init_system_xstate(unsigned int legacy_size);
>
> extern void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
> +extern unsigned int xstate_calculate_size(u64 xfeatures, bool compacted);
>
> static inline u64 xfeatures_mask_supervisor(void)
> {
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> Define new XFEATURE_MASK_KERNEL_DYNAMIC set including the features can be
I am not sure though that this name is correct, but I don't know if I can
suggest a better name.
> optionally enabled by kernel components, i.e., the features are required by
> specific kernel components. Currently it's used by KVM to configure guest
> dedicated fpstate for calculating the xfeature and fpstate storage size etc.
>
> The kernel dynamic xfeatures now only contain XFEATURE_CET_KERNEL, which is
> supported by host as they're enabled in xsaves/xrstors operating xfeature set
> (XCR0 | XSS), but the relevant CPU feature, i.e., supervisor shadow stack, is
> not enabled in host kernel so it can be omitted for normal fpstate by default.
>
> Remove the kernel dynamic feature from fpu_kernel_cfg.default_features so that
> the bits in xstate_bv and xcomp_bv are cleared and xsaves/xrstors can be
> optimized by HW for normal fpstate.
>
> Suggested-by: Dave Hansen <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/include/asm/fpu/xstate.h | 5 ++++-
> arch/x86/kernel/fpu/xstate.c | 1 +
> 2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
> index 3b4a038d3c57..a212d3851429 100644
> --- a/arch/x86/include/asm/fpu/xstate.h
> +++ b/arch/x86/include/asm/fpu/xstate.h
> @@ -46,9 +46,12 @@
> #define XFEATURE_MASK_USER_RESTORE \
> (XFEATURE_MASK_USER_SUPPORTED & ~XFEATURE_MASK_PKRU)
>
> -/* Features which are dynamically enabled for a process on request */
> +/* Features which are dynamically enabled per userspace request */
> #define XFEATURE_MASK_USER_DYNAMIC XFEATURE_MASK_XTILE_DATA
>
> +/* Features which are dynamically enabled per kernel side request */
I suggest to explain this a bit better. How about something like that:
"Kernel features that are not enabled by default for all processes, but can
be still used by some processes, for example to support guest virtualization"
But feel free to keep it as is or propose something else. IMHO this will
be confusing this way or another.
Another question: kernel already has a notion of 'independent features'
which are currently kernel features that are enabled in IA32_XSS but not present in 'fpu_kernel_cfg.max_features'
Currently only 'XFEATURE_LBR' is in this set. These features are saved/restored manually
from independent buffer (in case of LBRs, perf code cares for this).
Does it make sense to add CET_S to there as well instead of having XFEATURE_MASK_KERNEL_DYNAMIC, and maybe rename the
'XFEATURE_MASK_INDEPENDENT' to something like 'XFEATURES_THE_KERNEL_DOESNT_CARE_ABOUT'
(terrible name, but you might think of a better name)
> +#define XFEATURE_MASK_KERNEL_DYNAMIC XFEATURE_MASK_CET_KERNEL
> +
> /* All currently supported supervisor features */
> #define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
> XFEATURE_MASK_CET_USER | \
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index b57d909facca..ba4172172afd 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -824,6 +824,7 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
> /* Clean out dynamic features from default */
> fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
> fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
> + fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_KERNEL_DYNAMIC;
>
> fpu_user_cfg.default_features = fpu_user_cfg.max_features;
> fpu_user_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
Best regards,
Maxim Levitsky
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> Expose CET features to guest if KVM/host can support them, clear CPUID
> feature bits if KVM/host cannot support.
>
> Set CPUID feature bits so that CET features are available in guest CPUID.
> Add CR4.CET bit support in order to allow guest set CET master control
> bit.
>
> Disable KVM CET feature if unrestricted_guest is unsupported/disabled as
> KVM does not support emulating CET.
> Don't expose CET feature if either of {U,S}_CET xstate bits is cleared
> in host XSS or if XSAVES isn't supported.
>
> The CET load-bits in VM_ENTRY/VM_EXIT control fields should be set to make
> guest CET xstates isolated from host's. And all platforms that support CET
> enumerate VMX_BASIC[bit56] as 1, clear CET feature bits if the bit doesn't
> read 1.
>
> Per Arch confirmation, CET MSR contents after reset, power-up and INIT are
> set to 0s, clears relevant guest fpstate areas so that the guest MSRs are
> reset to 0s after these events.
>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 3 ++-
> arch/x86/include/asm/msr-index.h | 1 +
> arch/x86/kvm/cpuid.c | 19 +++++++++++++++++--
> arch/x86/kvm/vmx/capabilities.h | 6 ++++++
> arch/x86/kvm/vmx/vmx.c | 29 ++++++++++++++++++++++++++++-
> arch/x86/kvm/vmx/vmx.h | 6 ++++--
> arch/x86/kvm/x86.c | 26 ++++++++++++++++++++++++--
> arch/x86/kvm/x86.h | 3 +++
> 8 files changed, 85 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f536102f1eca..fd110a0b712f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -133,7 +133,8 @@
> | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \
> | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
> | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
> - | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP))
> + | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \
> + | X86_CR4_CET))
>
> #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
>
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 389f9594746e..25ae7ceb5b39 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1097,6 +1097,7 @@
> #define VMX_BASIC_MEM_TYPE_MASK 0x003c000000000000LLU
> #define VMX_BASIC_MEM_TYPE_WB 6LLU
> #define VMX_BASIC_INOUT 0x0040000000000000LLU
> +#define VMX_BASIC_NO_HW_ERROR_CODE_CC 0x0100000000000000LLU
>
> /* Resctrl MSRs: */
> /* - Intel: */
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 1d9843b34196..6d758054f994 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -648,7 +648,7 @@ void kvm_set_cpu_caps(void)
> F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
> F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
> F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
> - F(SGX_LC) | F(BUS_LOCK_DETECT)
> + F(SGX_LC) | F(BUS_LOCK_DETECT) | F(SHSTK)
> );
> /* Set LA57 based on hardware capability. */
> if (cpuid_ecx(7) & F(LA57))
> @@ -666,7 +666,8 @@ void kvm_set_cpu_caps(void)
> F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
> F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) |
> F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) |
> - F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D)
> + F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D) |
> + F(IBT)
> );
>
> /* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */
> @@ -679,6 +680,20 @@ void kvm_set_cpu_caps(void)
> kvm_cpu_cap_set(X86_FEATURE_INTEL_STIBP);
> if (boot_cpu_has(X86_FEATURE_AMD_SSBD))
> kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD);
> + /*
> + * Don't use boot_cpu_has() to check availability of IBT because the
> + * feature bit is cleared in boot_cpu_data when ibt=off is applied
> + * in host cmdline.
> + *
> + * As currently there's no HW bug which requires disabling IBT feature
> + * while CPU can enumerate it, host cmdline option ibt=off is most
> + * likely due to administrative reason on host side, so KVM refers to
> + * CPU CPUID enumeration to enable the feature. In future if there's
> + * actually some bug clobbered ibt=off option, then enforce additional
> + * check here to disable the support in KVM.
> + */
This is a reasonable explanation.
> + if (cpuid_edx(7) & F(IBT))
> + kvm_cpu_cap_set(X86_FEATURE_IBT);
>
> kvm_cpu_cap_mask(CPUID_7_1_EAX,
> F(AVX_VNNI) | F(AVX512_BF16) | F(CMPCCXADD) |
> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
> index ee8938818c8a..e12bc233d88b 100644
> --- a/arch/x86/kvm/vmx/capabilities.h
> +++ b/arch/x86/kvm/vmx/capabilities.h
> @@ -79,6 +79,12 @@ static inline bool cpu_has_vmx_basic_inout(void)
> return (((u64)vmcs_config.basic_cap << 32) & VMX_BASIC_INOUT);
> }
>
> +static inline bool cpu_has_vmx_basic_no_hw_errcode(void)
> +{
> + return ((u64)vmcs_config.basic_cap << 32) &
> + VMX_BASIC_NO_HW_ERROR_CODE_CC;
> +}
I still think that we should add a comment explaining why this check is needed,
as I said in the previous review.
> +
> static inline bool cpu_has_virtual_nmis(void)
> {
> return vmcs_config.pin_based_exec_ctrl & PIN_BASED_VIRTUAL_NMIS &&
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index c658f2f230df..a1aae8709939 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2614,6 +2614,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
> { VM_ENTRY_LOAD_IA32_EFER, VM_EXIT_LOAD_IA32_EFER },
> { VM_ENTRY_LOAD_BNDCFGS, VM_EXIT_CLEAR_BNDCFGS },
> { VM_ENTRY_LOAD_IA32_RTIT_CTL, VM_EXIT_CLEAR_IA32_RTIT_CTL },
> + { VM_ENTRY_LOAD_CET_STATE, VM_EXIT_LOAD_CET_STATE },
> };
>
> memset(vmcs_conf, 0, sizeof(*vmcs_conf));
> @@ -4935,6 +4936,15 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>
> vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); /* 22.2.1 */
>
> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> + vmcs_writel(GUEST_SSP, 0);
> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
> + kvm_cpu_cap_has(X86_FEATURE_IBT))
> + vmcs_writel(GUEST_S_CET, 0);
> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK) &&
> + IS_ENABLED(CONFIG_X86_64))
> + vmcs_writel(GUEST_INTR_SSP_TABLE, 0);
Looks reasonable now.
> +
> kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
>
> vpid_sync_context(vmx->vpid);
> @@ -6354,6 +6364,12 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
> if (vmcs_read32(VM_EXIT_MSR_STORE_COUNT) > 0)
> vmx_dump_msrs("guest autostore", &vmx->msr_autostore.guest);
>
> + if (vmentry_ctl & VM_ENTRY_LOAD_CET_STATE) {
> + pr_err("S_CET = 0x%016lx\n", vmcs_readl(GUEST_S_CET));
> + pr_err("SSP = 0x%016lx\n", vmcs_readl(GUEST_SSP));
> + pr_err("INTR SSP TABLE = 0x%016lx\n",
> + vmcs_readl(GUEST_INTR_SSP_TABLE));
> + }
> pr_err("*** Host State ***\n");
> pr_err("RIP = 0x%016lx RSP = 0x%016lx\n",
> vmcs_readl(HOST_RIP), vmcs_readl(HOST_RSP));
> @@ -6431,6 +6447,12 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
> if (secondary_exec_control & SECONDARY_EXEC_ENABLE_VPID)
> pr_err("Virtual processor ID = 0x%04x\n",
> vmcs_read16(VIRTUAL_PROCESSOR_ID));
> + if (vmexit_ctl & VM_EXIT_LOAD_CET_STATE) {
> + pr_err("S_CET = 0x%016lx\n", vmcs_readl(HOST_S_CET));
> + pr_err("SSP = 0x%016lx\n", vmcs_readl(HOST_SSP));
> + pr_err("INTR SSP TABLE = 0x%016lx\n",
> + vmcs_readl(HOST_INTR_SSP_TABLE));
> + }
> }
>
> /*
> @@ -7964,7 +7986,6 @@ static __init void vmx_set_cpu_caps(void)
> kvm_cpu_cap_set(X86_FEATURE_UMIP);
>
> /* CPUID 0xD.1 */
> - kvm_caps.supported_xss = 0;
> if (!cpu_has_vmx_xsaves())
> kvm_cpu_cap_clear(X86_FEATURE_XSAVES);
>
> @@ -7976,6 +7997,12 @@ static __init void vmx_set_cpu_caps(void)
>
> if (cpu_has_vmx_waitpkg())
> kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
> +
> + if (!cpu_has_load_cet_ctrl() || !enable_unrestricted_guest ||
> + !cpu_has_vmx_basic_no_hw_errcode()) {
> + kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
> + kvm_cpu_cap_clear(X86_FEATURE_IBT);
> + }
My review feedback from previous version still applies here, I don't have an
idea why this was not addressed....
"I think that here we also need to clear kvm_caps.supported_xss,
or even better, lets set the CET bits in kvm_caps.supported_xss only
once CET is fully enabled (both this check and check in __kvm_x86_vendor_init pass).
"
In addition to that I just checked and unless I am mistaken:
vmx_set_cpu_caps() is called from vmx's hardware_setup(), which is called
from __kvm_x86_vendor_init.
After this call, __kvm_x86_vendor_init does clear kvm_caps.supported_xss,
but doesn't do this if the above code cleared X86_FEATURE_SHSTK/X86_FEATURE_IBT.
> }
>
> static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index c2130d2c8e24..fb72819fbb41 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -480,7 +480,8 @@ static inline u8 vmx_get_rvi(void)
> VM_ENTRY_LOAD_IA32_EFER | \
> VM_ENTRY_LOAD_BNDCFGS | \
> VM_ENTRY_PT_CONCEAL_PIP | \
> - VM_ENTRY_LOAD_IA32_RTIT_CTL)
> + VM_ENTRY_LOAD_IA32_RTIT_CTL | \
> + VM_ENTRY_LOAD_CET_STATE)
>
> #define __KVM_REQUIRED_VMX_VM_EXIT_CONTROLS \
> (VM_EXIT_SAVE_DEBUG_CONTROLS | \
> @@ -502,7 +503,8 @@ static inline u8 vmx_get_rvi(void)
> VM_EXIT_LOAD_IA32_EFER | \
> VM_EXIT_CLEAR_BNDCFGS | \
> VM_EXIT_PT_CONCEAL_PIP | \
> - VM_EXIT_CLEAR_IA32_RTIT_CTL)
> + VM_EXIT_CLEAR_IA32_RTIT_CTL | \
> + VM_EXIT_LOAD_CET_STATE)
>
> #define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL \
> (PIN_BASED_EXT_INTR_MASK | \
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c6b57ede0f57..2bcf3c7923bf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -231,7 +231,8 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
> | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
> | XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)
>
> -#define KVM_SUPPORTED_XSS 0
> +#define KVM_SUPPORTED_XSS (XFEATURE_MASK_CET_USER | \
> + XFEATURE_MASK_CET_KERNEL)
>
> u64 __read_mostly host_efer;
> EXPORT_SYMBOL_GPL(host_efer);
> @@ -9854,6 +9855,15 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
> if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
> kvm_caps.supported_xss = 0;
>
> + if ((kvm_caps.supported_xss & (XFEATURE_MASK_CET_USER |
> + XFEATURE_MASK_CET_KERNEL)) !=
> + (XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)) {
> + kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
> + kvm_cpu_cap_clear(X86_FEATURE_IBT);
> + kvm_caps.supported_xss &= ~XFEATURE_CET_USER;
> + kvm_caps.supported_xss &= ~XFEATURE_CET_KERNEL;
> + }
> +
> #define __kvm_cpu_cap_has(UNUSED_, f) kvm_cpu_cap_has(f)
> cr4_reserved_bits = __cr4_reserved_bits(__kvm_cpu_cap_has, UNUSED_);
> #undef __kvm_cpu_cap_has
> @@ -12319,7 +12329,9 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
>
> static inline bool is_xstate_reset_needed(void)
> {
> - return kvm_cpu_cap_has(X86_FEATURE_MPX);
> + return kvm_cpu_cap_has(X86_FEATURE_MPX) ||
> + kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
> + kvm_cpu_cap_has(X86_FEATURE_IBT);
> }
>
> void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> @@ -12396,6 +12408,16 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> XFEATURE_BNDCSR);
> }
>
> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
> + fpstate_clear_xstate_component(fpstate,
> + XFEATURE_CET_USER);
> + fpstate_clear_xstate_component(fpstate,
> + XFEATURE_CET_KERNEL);
> + } else if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
> + fpstate_clear_xstate_component(fpstate,
> + XFEATURE_CET_USER);
> + }
> +
> if (init_event)
> kvm_load_guest_fpu(vcpu);
> }
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index d9cc352cf421..dc79dcd733ac 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -531,6 +531,9 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
> __reserved_bits |= X86_CR4_VMXE; \
> if (!__cpu_has(__c, X86_FEATURE_PCID)) \
> __reserved_bits |= X86_CR4_PCIDE; \
> + if (!__cpu_has(__c, X86_FEATURE_SHSTK) && \
> + !__cpu_has(__c, X86_FEATURE_IBT)) \
> + __reserved_bits |= X86_CR4_CET; \
> __reserved_bits; \
> })
>
Best regards,
Maxim Levitsky
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> Rename kvm_{g,s}et_msr() to kvm_emulate_msr_{read,write}() to make it
> more obvious that KVM uses these helpers to emulate guest behaviors,
> i.e., host_initiated == false in these helpers.
>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 4 ++--
> arch/x86/kvm/smm.c | 4 ++--
> arch/x86/kvm/vmx/nested.c | 13 +++++++------
> arch/x86/kvm/x86.c | 10 +++++-----
> 4 files changed, 16 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d7036982332e..5cfa18aaf33f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1967,8 +1967,8 @@ void kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu);
> void kvm_enable_efer_bits(u64);
> bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer);
> int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data, bool host_initiated);
> -int kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data);
> -int kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data);
> +int kvm_emulate_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data);
> +int kvm_emulate_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data);
> int kvm_emulate_rdmsr(struct kvm_vcpu *vcpu);
> int kvm_emulate_wrmsr(struct kvm_vcpu *vcpu);
> int kvm_emulate_as_nop(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
> index dc3d95fdca7d..45c855389ea7 100644
> --- a/arch/x86/kvm/smm.c
> +++ b/arch/x86/kvm/smm.c
> @@ -535,7 +535,7 @@ static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,
>
> vcpu->arch.smbase = smstate->smbase;
>
> - if (kvm_set_msr(vcpu, MSR_EFER, smstate->efer & ~EFER_LMA))
> + if (kvm_emulate_msr_write(vcpu, MSR_EFER, smstate->efer & ~EFER_LMA))
> return X86EMUL_UNHANDLEABLE;
>
> rsm_load_seg_64(vcpu, &smstate->tr, VCPU_SREG_TR);
> @@ -626,7 +626,7 @@ int emulator_leave_smm(struct x86_emulate_ctxt *ctxt)
>
> /* And finally go back to 32-bit mode. */
> efer = 0;
> - kvm_set_msr(vcpu, MSR_EFER, efer);
> + kvm_emulate_msr_write(vcpu, MSR_EFER, efer);
> }
> #endif
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index c5ec0ef51ff7..2034337681f9 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -927,7 +927,7 @@ static u32 nested_vmx_load_msr(struct kvm_vcpu *vcpu, u64 gpa, u32 count)
> __func__, i, e.index, e.reserved);
> goto fail;
> }
> - if (kvm_set_msr(vcpu, e.index, e.value)) {
> + if (kvm_emulate_msr_write(vcpu, e.index, e.value)) {
> pr_debug_ratelimited(
> "%s cannot write MSR (%u, 0x%x, 0x%llx)\n",
> __func__, i, e.index, e.value);
> @@ -963,7 +963,7 @@ static bool nested_vmx_get_vmexit_msr_value(struct kvm_vcpu *vcpu,
> }
> }
>
> - if (kvm_get_msr(vcpu, msr_index, data)) {
> + if (kvm_emulate_msr_read(vcpu, msr_index, data)) {
> pr_debug_ratelimited("%s cannot read MSR (0x%x)\n", __func__,
> msr_index);
> return false;
> @@ -2649,7 +2649,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>
> if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) &&
> kvm_pmu_has_perf_global_ctrl(vcpu_to_pmu(vcpu)) &&
> - WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
> + WARN_ON_ONCE(kvm_emulate_msr_write(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
> vmcs12->guest_ia32_perf_global_ctrl))) {
> *entry_failure_code = ENTRY_FAIL_DEFAULT;
> return -EINVAL;
> @@ -4524,8 +4524,9 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
> }
> if ((vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL) &&
> kvm_pmu_has_perf_global_ctrl(vcpu_to_pmu(vcpu)))
> - WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
> - vmcs12->host_ia32_perf_global_ctrl));
> + WARN_ON_ONCE(kvm_emulate_msr_write(vcpu,
> + MSR_CORE_PERF_GLOBAL_CTRL,
> + vmcs12->host_ia32_perf_global_ctrl));
>
> /* Set L1 segment info according to Intel SDM
> 27.5.2 Loading Host Segment and Descriptor-Table Registers */
> @@ -4700,7 +4701,7 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
> goto vmabort;
> }
>
> - if (kvm_set_msr(vcpu, h.index, h.value)) {
> + if (kvm_emulate_msr_write(vcpu, h.index, h.value)) {
> pr_debug_ratelimited(
> "%s WRMSR failed (%u, 0x%x, 0x%llx)\n",
> __func__, j, h.index, h.value);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2c924075f6f1..b9c2c0cd4cf5 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1973,17 +1973,17 @@ static int kvm_set_msr_with_filter(struct kvm_vcpu *vcpu, u32 index, u64 data)
> return kvm_set_msr_ignored_check(vcpu, index, data, false);
> }
>
> -int kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data)
> +int kvm_emulate_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data)
> {
> return kvm_get_msr_ignored_check(vcpu, index, data, false);
> }
> -EXPORT_SYMBOL_GPL(kvm_get_msr);
> +EXPORT_SYMBOL_GPL(kvm_emulate_msr_read);
>
> -int kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data)
> +int kvm_emulate_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data)
> {
> return kvm_set_msr_ignored_check(vcpu, index, data, false);
> }
> -EXPORT_SYMBOL_GPL(kvm_set_msr);
> +EXPORT_SYMBOL_GPL(kvm_emulate_msr_write);
>
> static void complete_userspace_rdmsr(struct kvm_vcpu *vcpu)
> {
> @@ -8329,7 +8329,7 @@ static int emulator_set_msr_with_filter(struct x86_emulate_ctxt *ctxt,
> static int emulator_get_msr(struct x86_emulate_ctxt *ctxt,
> u32 msr_index, u64 *pdata)
> {
> - return kvm_get_msr(emul_to_vcpu(ctxt), msr_index, pdata);
> + return kvm_emulate_msr_read(emul_to_vcpu(ctxt), msr_index, pdata);
> }
>
> static int emulator_check_pmc(struct x86_emulate_ctxt *ctxt,
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> Update CPUID.(EAX=0DH,ECX=1).EBX to reflect current required xstate size
> due to XSS MSR modification.
> CPUID(EAX=0DH,ECX=1).EBX reports the required storage size of all enabled
> xstate features in (XCR0 | IA32_XSS). The CPUID value can be used by guest
> before allocate sufficient xsave buffer.
>
> Note, KVM does not yet support any XSS based features, i.e. supported_xss
> is guaranteed to be zero at this time.
>
> Opportunistically modify XSS write access logic as:
> If XSAVES is not enabled in the guest CPUID, forbid setting IA32_XSS msr
> to anything but 0, even if the write is host initiated.
>
> Suggested-by: Sean Christopherson <[email protected]>
> Co-developed-by: Zhang Yi Z <[email protected]>
> Signed-off-by: Zhang Yi Z <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 3 ++-
> arch/x86/kvm/cpuid.c | 15 ++++++++++++++-
> arch/x86/kvm/x86.c | 16 ++++++++++++----
> 3 files changed, 28 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 499bd42e3a32..f536102f1eca 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -756,7 +756,6 @@ struct kvm_vcpu_arch {
> bool at_instruction_boundary;
> bool tpr_access_reporting;
> bool xfd_no_write_intercept;
> - u64 ia32_xss;
> u64 microcode_version;
> u64 arch_capabilities;
> u64 perf_capabilities;
> @@ -812,6 +811,8 @@ struct kvm_vcpu_arch {
>
> u64 xcr0;
> u64 guest_supported_xcr0;
> + u64 guest_supported_xss;
> + u64 ia32_xss;
>
> struct kvm_pio_request pio;
> void *pio_data;
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 0351e311168a..1d9843b34196 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -275,7 +275,8 @@ static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_e
> best = cpuid_entry2_find(entries, nent, 0xD, 1);
> if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) ||
> cpuid_entry_has(best, X86_FEATURE_XSAVEC)))
> - best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
> + best->ebx = xstate_required_size(vcpu->arch.xcr0 |
> + vcpu->arch.ia32_xss, true);
>
> best = __kvm_find_kvm_cpuid_features(vcpu, entries, nent);
> if (kvm_hlt_in_guest(vcpu->kvm) && best &&
> @@ -312,6 +313,17 @@ static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
> return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
> }
>
> +static u64 vcpu_get_supported_xss(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_cpuid_entry2 *best;
> +
> + best = kvm_find_cpuid_entry_index(vcpu, 0xd, 1);
> + if (!best)
> + return 0;
> +
> + return (best->ecx | ((u64)best->edx << 32)) & kvm_caps.supported_xss;
> +}
> +
> static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
> {
> struct kvm_cpuid_entry2 *entry;
> @@ -358,6 +370,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> }
>
> vcpu->arch.guest_supported_xcr0 = vcpu_get_supported_xcr0(vcpu);
> + vcpu->arch.guest_supported_xss = vcpu_get_supported_xss(vcpu);
>
> kvm_update_pv_runtime(vcpu);
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f7d4cc61bc55..649a100ffd25 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3901,20 +3901,28 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> vcpu->arch.ia32_tsc_adjust_msr += adj;
> }
> break;
> - case MSR_IA32_XSS:
> - if (!msr_info->host_initiated &&
> - !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
> + case MSR_IA32_XSS: {
> + /*
> + * If KVM reported support of XSS MSR, even guest CPUID doesn't
> + * support XSAVES, still allow userspace to set default value(0)
> + * to this MSR.
> + */
> + if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES) &&
> + !(msr_info->host_initiated && data == 0))
> return 1;
> /*
> * KVM supports exposing PT to the guest, but does not support
> * IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
> * XSAVES/XRSTORS to save/restore PT MSRs.
> */
> - if (data & ~kvm_caps.supported_xss)
> + if (data & ~vcpu->arch.guest_supported_xss)
> return 1;
> + if (vcpu->arch.ia32_xss == data)
> + break;
> vcpu->arch.ia32_xss = data;
> kvm_update_cpuid_runtime(vcpu);
> break;
> + }
> case MSR_SMI_COUNT:
> if (!msr_info->host_initiated)
> return 1;
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Tue, 2023-11-28 at 14:58 +0000, Edgecombe, Rick P wrote:
> On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> > + /*
> > + * Set guest's __user_state_size to fpu_user_cfg.default_size
> > so that
> > + * existing uAPIs can still work.
> > + */
> > + fpu->guest_perm.__user_state_size =
> > fpu_user_cfg.default_size;
>
> It seems like an appropriate value, but where does this come into play
> exactly for guest FPUs?
It is used because permission API is used for guest fpu state as well (for user features),
and it affects two things:
1. If permission is not asked, then KVM will fail to resize the FPU state to match guest CPUID.
2. It will affect output size of the KVM_GET_XSAVE2 ioctl, which outputs buffer similar to
other FPU state buffers exposed to userspace (like one saved on signal stack, or one obtained via ptrace).
Best regards,
Maxim Levitsky
On Thu, 2023-11-30 at 19:29 +0200, Maxim Levitsky wrote:
> 2. It will affect output size of the KVM_GET_XSAVE2 ioctl, which
> outputs buffer similar to
> other FPU state buffers exposed to userspace (like one saved on
> signal stack, or one obtained via ptrace).
Ah! I missed this part. Thanks for the correction. So the original
comment is important and correct.
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
> to enable CET for nested VM.
>
> Note, generally L1 VMM only touches CET VMCS fields when live migration or
> vmcs_{read,write}() to the fields happens, so the fields only need to be
> synced in these "rare" cases.
To be honest we can't assume anything about L1, but what we can assume
is that if vmcs12 field is not shadowed, then L1 vmwrite/vmread will
be always intercepted and during the interception the fields can be synced,
however I studied this area long ago and I might be mistaken.
> And here only considers the case that L1 VMM
> has set VM_ENTRY_LOAD_CET_STATE in its VMCS vm_entry_controls as it's the
> common usage.
>
> Suggested-by: Chao Gao <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kvm/vmx/nested.c | 48 +++++++++++++++++++++++++++++++++++++--
> arch/x86/kvm/vmx/vmcs12.c | 6 +++++
> arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++-
> arch/x86/kvm/vmx/vmx.c | 2 ++
> 4 files changed, 67 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index d8c32682ca76..965173650542 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
>
> + /* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_U_CET, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_S_CET, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_PL0_SSP, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_PL1_SSP, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_PL2_SSP, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_PL3_SSP, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
> +
> kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
>
> vmx->nested.force_msr_bitmap_recalc = false;
> @@ -2469,6 +2491,18 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
> if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
> (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
> vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
> +
> + if (vmx->nested.nested_run_pending &&
I don't think that nested.nested_run_pending check is needed.
prepare_vmcs02_rare is not going to be called unless the nested run is pending.
> + (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) {
> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> + vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
> + vmcs_writel(GUEST_INTR_SSP_TABLE,
> + vmcs12->guest_ssp_tbl);
> + }
> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
> + vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
> + }
> }
>
> if (nested_cpu_has_xsaves(vmcs12))
> @@ -4300,6 +4334,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
> vmcs12->guest_pending_dbg_exceptions =
> vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
>
> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> + vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
> + vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
> + }
> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
> + vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
> + }
The above code should be conditional on VM_ENTRY_LOAD_CET_STATE - if the guest (L2) state
was loaded, then it must be updated on exit - this is usually how VMX works.
Also I don't see any mention of usage of VM_EXIT_LOAD_CET_STATE, which if set,
should reset the L1 CET state to values in 'host_s_cet/host_ssp/host_ssp_tbl'
(This is also a common theme in VMX - host state is reset to values that the hypervisor
sets in VMCS, and the hypervisor must care to update these fields itself).
As a rule of thumb, if you add a field to vmcs12, you should use it somewhere,
and you should never use it unconditionally, as almost always its use
depends on entry or exit controls.
Same is true for entry/exit/execution controls - if you add one, you almost
always have to use it somewhere.
Best regards,
Maxim Levitsky
> +
> vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
> }
>
> @@ -6798,7 +6841,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
> VM_EXIT_HOST_ADDR_SPACE_SIZE |
> #endif
> VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
> - VM_EXIT_CLEAR_BNDCFGS;
> + VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
> msrs->exit_ctls_high |=
> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
> @@ -6820,7 +6863,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
> #ifdef CONFIG_X86_64
> VM_ENTRY_IA32E_MODE |
> #endif
> - VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
> + VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
> + VM_ENTRY_LOAD_CET_STATE;
> msrs->entry_ctls_high |=
> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
> VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
> diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
> index 106a72c923ca..4233b5ca9461 100644
> --- a/arch/x86/kvm/vmx/vmcs12.c
> +++ b/arch/x86/kvm/vmx/vmcs12.c
> @@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
> FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
> FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
> FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
> + FIELD(GUEST_S_CET, guest_s_cet),
> + FIELD(GUEST_SSP, guest_ssp),
> + FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
> FIELD(HOST_CR0, host_cr0),
> FIELD(HOST_CR3, host_cr3),
> FIELD(HOST_CR4, host_cr4),
> @@ -151,5 +154,8 @@ const unsigned short vmcs12_field_offsets[] = {
> FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
> FIELD(HOST_RSP, host_rsp),
> FIELD(HOST_RIP, host_rip),
> + FIELD(HOST_S_CET, host_s_cet),
> + FIELD(HOST_SSP, host_ssp),
> + FIELD(HOST_INTR_SSP_TABLE, host_ssp_tbl),
> };
> const unsigned int nr_vmcs12_fields = ARRAY_SIZE(vmcs12_field_offsets);
> diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
> index 01936013428b..3884489e7f7e 100644
> --- a/arch/x86/kvm/vmx/vmcs12.h
> +++ b/arch/x86/kvm/vmx/vmcs12.h
> @@ -117,7 +117,13 @@ struct __packed vmcs12 {
> natural_width host_ia32_sysenter_eip;
> natural_width host_rsp;
> natural_width host_rip;
> - natural_width paddingl[8]; /* room for future expansion */
> + natural_width host_s_cet;
> + natural_width host_ssp;
> + natural_width host_ssp_tbl;
> + natural_width guest_s_cet;
> + natural_width guest_ssp;
> + natural_width guest_ssp_tbl;
> + natural_width paddingl[2]; /* room for future expansion */
> u32 pin_based_vm_exec_control;
> u32 cpu_based_vm_exec_control;
> u32 exception_bitmap;
> @@ -292,6 +298,12 @@ static inline void vmx_check_vmcs12_offsets(void)
> CHECK_OFFSET(host_ia32_sysenter_eip, 656);
> CHECK_OFFSET(host_rsp, 664);
> CHECK_OFFSET(host_rip, 672);
> + CHECK_OFFSET(host_s_cet, 680);
> + CHECK_OFFSET(host_ssp, 688);
> + CHECK_OFFSET(host_ssp_tbl, 696);
> + CHECK_OFFSET(guest_s_cet, 704);
> + CHECK_OFFSET(guest_ssp, 712);
> + CHECK_OFFSET(guest_ssp_tbl, 720);
> CHECK_OFFSET(pin_based_vm_exec_control, 744);
> CHECK_OFFSET(cpu_based_vm_exec_control, 748);
> CHECK_OFFSET(exception_bitmap, 752);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index a1aae8709939..947028ff2e25 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7734,6 +7734,8 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
> cr4_fixed1_update(X86_CR4_PKE, ecx, feature_bit(PKU));
> cr4_fixed1_update(X86_CR4_UMIP, ecx, feature_bit(UMIP));
> cr4_fixed1_update(X86_CR4_LA57, ecx, feature_bit(LA57));
> + cr4_fixed1_update(X86_CR4_CET, ecx, feature_bit(SHSTK));
> + cr4_fixed1_update(X86_CR4_CET, edx, feature_bit(IBT));
>
> #undef cr4_fixed1_update
> }
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> From: Sean Christopherson <[email protected]>
>
> Load the guest's FPU state if userspace is accessing MSRs whose values
> are managed by XSAVES. Introduce two helpers, kvm_{get,set}_xstate_msr(),
> to facilitate access to such kind of MSRs.
>
> If MSRs supported in kvm_caps.supported_xss are passed through to guest,
> the guest MSRs are swapped with host's before vCPU exits to userspace and
> after it reenters kernel before next VM-entry.
>
> Because the modified code is also used for the KVM_GET_MSRS device ioctl(),
> explicitly check @vcpu is non-null before attempting to load guest state.
> The XSAVE-managed MSRs cannot be retrieved via the device ioctl() without
> loading guest FPU state (which doesn't exist).
>
> Note that guest_cpuid_has() is not queried as host userspace is allowed to
> access MSRs that have not been exposed to the guest, e.g. it might do
> KVM_SET_MSRS prior to KVM_SET_CPUID2.
>
> The two helpers are put here in order to manifest accessing xsave-managed MSRs
> requires special check and handling to guarantee the correctness of read/write
> to the MSRs.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Co-developed-by: Yang Weijiang <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kvm/x86.c | 35 ++++++++++++++++++++++++++++++++++-
> arch/x86/kvm/x86.h | 24 ++++++++++++++++++++++++
> 2 files changed, 58 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 607082aca80d..fd48b825510c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -133,6 +133,9 @@ static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
> static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
>
> static DEFINE_MUTEX(vendor_module_lock);
> +static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
> +static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
> +
> struct kvm_x86_ops kvm_x86_ops __read_mostly;
>
> #define KVM_X86_OP(func) \
> @@ -4482,6 +4485,21 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> }
> EXPORT_SYMBOL_GPL(kvm_get_msr_common);
>
> +/*
> + * Returns true if the MSR in question is managed via XSTATE, i.e. is context
> + * switched with the rest of guest FPU state.
> + */
> +static bool is_xstate_managed_msr(u32 index)
> +{
> + switch (index) {
> + case MSR_IA32_U_CET:
> + case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> + return true;
> + default:
> + return false;
> + }
> +}
> +
> /*
> * Read or write a bunch of msrs. All parameters are kernel addresses.
> *
> @@ -4492,11 +4510,26 @@ static int __msr_io(struct kvm_vcpu *vcpu, struct kvm_msrs *msrs,
> int (*do_msr)(struct kvm_vcpu *vcpu,
> unsigned index, u64 *data))
> {
> + bool fpu_loaded = false;
> int i;
>
> - for (i = 0; i < msrs->nmsrs; ++i)
> + for (i = 0; i < msrs->nmsrs; ++i) {
> + /*
> + * If userspace is accessing one or more XSTATE-managed MSRs,
> + * temporarily load the guest's FPU state so that the guest's
> + * MSR value(s) is resident in hardware, i.e. so that KVM can
> + * get/set the MSR via RDMSR/WRMSR.
> + */
> + if (vcpu && !fpu_loaded && kvm_caps.supported_xss &&
> + is_xstate_managed_msr(entries[i].index)) {
> + kvm_load_guest_fpu(vcpu);
> + fpu_loaded = true;
> + }
> if (do_msr(vcpu, entries[i].index, &entries[i].data))
> break;
> + }
> + if (fpu_loaded)
> + kvm_put_guest_fpu(vcpu);
>
> return i;
> }
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 5184fde1dc54..6e42ede335f5 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -541,4 +541,28 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
> unsigned int port, void *data, unsigned int count,
> int in);
>
> +/*
> + * Lock and/or reload guest FPU and access xstate MSRs. For accesses initiated
> + * by host, guest FPU is loaded in __msr_io(). For accesses initiated by guest,
> + * guest FPU should have been loaded already.
> + */
> +
> +static inline void kvm_get_xstate_msr(struct kvm_vcpu *vcpu,
> + struct msr_data *msr_info)
> +{
> + KVM_BUG_ON(!vcpu->arch.guest_fpu.fpstate->in_use, vcpu->kvm);
> + kvm_fpu_get();
> + rdmsrl(msr_info->index, msr_info->data);
> + kvm_fpu_put();
> +}
> +
> +static inline void kvm_set_xstate_msr(struct kvm_vcpu *vcpu,
> + struct msr_data *msr_info)
> +{
> + KVM_BUG_ON(!vcpu->arch.guest_fpu.fpstate->in_use, vcpu->kvm);
> + kvm_fpu_get();
> + wrmsrl(msr_info->index, msr_info->data);
> + kvm_fpu_put();
> +}
> +
> #endif
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> Add CET MSRs to the list of MSRs reported to userspace if the feature,
> i.e. IBT or SHSTK, associated with the MSRs is supported by KVM.
>
> SSP can only be read via RDSSP. Writing even requires destructive and
> potentially faulting operations such as SAVEPREVSSP/RSTORSSP or
> SETSSBSY/CLRSSBSY. Let the host use a pseudo-MSR that is just a wrapper
> for the GUEST_SSP field of the VMCS.
>
> Suggested-by: Chao Gao <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/include/uapi/asm/kvm_para.h | 1 +
> arch/x86/kvm/vmx/vmx.c | 2 ++
> arch/x86/kvm/x86.c | 18 ++++++++++++++++++
> 3 files changed, 21 insertions(+)
>
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index 6e64b27b2c1e..9864bbcf2470 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -58,6 +58,7 @@
> #define MSR_KVM_ASYNC_PF_INT 0x4b564d06
> #define MSR_KVM_ASYNC_PF_ACK 0x4b564d07
> #define MSR_KVM_MIGRATION_CONTROL 0x4b564d08
> +#define MSR_KVM_SSP 0x4b564d09
>
> struct kvm_steal_time {
> __u64 steal;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index be20a60047b1..d3d0d74fef70 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7009,6 +7009,8 @@ static bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
> case MSR_AMD64_TSC_RATIO:
> /* This is AMD only. */
> return false;
> + case MSR_KVM_SSP:
> + return kvm_cpu_cap_has(X86_FEATURE_SHSTK);
> default:
> return true;
> }
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 44b8cf459dfc..74d2d00a1681 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1476,6 +1476,9 @@ static const u32 msrs_to_save_base[] = {
>
> MSR_IA32_XFD, MSR_IA32_XFD_ERR,
> MSR_IA32_XSS,
> + MSR_IA32_U_CET, MSR_IA32_S_CET,
> + MSR_IA32_PL0_SSP, MSR_IA32_PL1_SSP, MSR_IA32_PL2_SSP,
> + MSR_IA32_PL3_SSP, MSR_IA32_INT_SSP_TAB,
> };
>
> static const u32 msrs_to_save_pmu[] = {
> @@ -1576,6 +1579,7 @@ static const u32 emulated_msrs_all[] = {
>
> MSR_K7_HWCR,
> MSR_KVM_POLL_CONTROL,
> + MSR_KVM_SSP,
> };
>
> static u32 emulated_msrs[ARRAY_SIZE(emulated_msrs_all)];
> @@ -7371,6 +7375,20 @@ static void kvm_probe_msr_to_save(u32 msr_index)
> if (!kvm_caps.supported_xss)
> return;
> break;
> + case MSR_IA32_U_CET:
> + case MSR_IA32_S_CET:
> + if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK) &&
> + !kvm_cpu_cap_has(X86_FEATURE_IBT))
> + return;
> + break;
> + case MSR_IA32_INT_SSP_TAB:
> + if (!kvm_cpu_cap_has(X86_FEATURE_LM))
> + return;
> + fallthrough;
> + case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> + if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> + return;
> + break;
> default:
> break;
> }
I still think that pseudo MSR is a hack that might backfire,
but I am not going to argue much about this.
Best regards,
Maxim Levitsky
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> Save CET SSP to SMRAM on SMI and reload it on RSM. KVM emulates HW arch
> behavior when guest enters/leaves SMM mode,i.e., save registers to SMRAM
> at the entry of SMM and reload them at the exit to SMM. Per SDM, SSP is
> one of such registers on 64bit Arch, so add the support for SSP.
>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kvm/smm.c | 8 ++++++++
> arch/x86/kvm/smm.h | 2 +-
> 2 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
> index 45c855389ea7..7aac9c54c353 100644
> --- a/arch/x86/kvm/smm.c
> +++ b/arch/x86/kvm/smm.c
> @@ -275,6 +275,10 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu,
> enter_smm_save_seg_64(vcpu, &smram->gs, VCPU_SREG_GS);
>
> smram->int_shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
> +
> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
> + KVM_BUG_ON(kvm_msr_read(vcpu, MSR_KVM_SSP, &smram->ssp),
> + vcpu->kvm);
> }
> #endif
>
> @@ -564,6 +568,10 @@ static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,
> static_call(kvm_x86_set_interrupt_shadow)(vcpu, 0);
> ctxt->interruptibility = (u8)smstate->int_shadow;
>
> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
> + KVM_BUG_ON(kvm_msr_write(vcpu, MSR_KVM_SSP, smstate->ssp),
> + vcpu->kvm);
> +
> return X86EMUL_CONTINUE;
> }
> #endif
> diff --git a/arch/x86/kvm/smm.h b/arch/x86/kvm/smm.h
> index a1cf2ac5bd78..1e2a3e18207f 100644
> --- a/arch/x86/kvm/smm.h
> +++ b/arch/x86/kvm/smm.h
> @@ -116,8 +116,8 @@ struct kvm_smram_state_64 {
> u32 smbase;
> u32 reserved4[5];
>
> - /* ssp and svm_* fields below are not implemented by KVM */
> u64 ssp;
> + /* svm_* fields below are not implemented by KVM */
> u64 svm_guest_pat;
> u64 svm_host_efer;
> u64 svm_host_cr4;
My review feedback from the previous patch series still applies, and I don't
know why it was not addressed/replied to:
I still think that it is worth it to have a check that CET is not enabled in
enter_smm_save_state_32 which is called for pure 32 bit guests (guests that don't
have X86_FEATURE_LM enabled)
Best regards,
Maxim Levitsky
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> Add supervisor mode state support within FPU xstate management framework.
> Although supervisor shadow stack is not enabled/used today in kernel,KVM
> requires the support because when KVM advertises shadow stack feature to
> guest, architecturally it claims the support for both user and supervisor
> modes for guest OSes(Linux or non-Linux).
>
> CET supervisor states not only includes PL{0,1,2}_SSP but also IA32_S_CET
> MSR, but the latter is not xsave-managed. In virtualization world, guest
> IA32_S_CET is saved/stored into/from VM control structure. With supervisor
> xstate support, guest supervisor mode shadow stack state can be properly
> saved/restored when 1) guest/host FPU context is swapped 2) vCPU
> thread is sched out/in.
>
> The alternative is to enable it in KVM domain, but KVM maintainers NAKed
> the solution. The external discussion can be found at [*], it ended up
> with adding the support in kernel instead of KVM domain.
>
> Note, in KVM case, guest CET supervisor state i.e., IA32_PL{0,1,2}_MSRs,
> are preserved after VM-Exit until host/guest fpstates are swapped, but
> since host supervisor shadow stack is disabled, the preserved MSRs won't
> hurt host.
>
> [*]: https://lore.kernel.org/all/[email protected]/
>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/include/asm/fpu/types.h | 14 ++++++++++++--
> arch/x86/include/asm/fpu/xstate.h | 6 +++---
> arch/x86/kernel/fpu/xstate.c | 6 +++++-
> 3 files changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
> index eb810074f1e7..c6fd13a17205 100644
> --- a/arch/x86/include/asm/fpu/types.h
> +++ b/arch/x86/include/asm/fpu/types.h
> @@ -116,7 +116,7 @@ enum xfeature {
> XFEATURE_PKRU,
> XFEATURE_PASID,
> XFEATURE_CET_USER,
> - XFEATURE_CET_KERNEL_UNUSED,
> + XFEATURE_CET_KERNEL,
> XFEATURE_RSRVD_COMP_13,
> XFEATURE_RSRVD_COMP_14,
> XFEATURE_LBR,
> @@ -139,7 +139,7 @@ enum xfeature {
> #define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU)
> #define XFEATURE_MASK_PASID (1 << XFEATURE_PASID)
> #define XFEATURE_MASK_CET_USER (1 << XFEATURE_CET_USER)
> -#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL_UNUSED)
> +#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL)
> #define XFEATURE_MASK_LBR (1 << XFEATURE_LBR)
> #define XFEATURE_MASK_XTILE_CFG (1 << XFEATURE_XTILE_CFG)
> #define XFEATURE_MASK_XTILE_DATA (1 << XFEATURE_XTILE_DATA)
> @@ -264,6 +264,16 @@ struct cet_user_state {
> u64 user_ssp;
> };
>
> +/*
> + * State component 12 is Control-flow Enforcement supervisor states
> + */
> +struct cet_supervisor_state {
> + /* supervisor ssp pointers */
> + u64 pl0_ssp;
> + u64 pl1_ssp;
> + u64 pl2_ssp;
> +};
> +
> /*
> * State component 15: Architectural LBR configuration state.
> * The size of Arch LBR state depends on the number of LBRs (lbr_depth).
> diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
> index d4427b88ee12..3b4a038d3c57 100644
> --- a/arch/x86/include/asm/fpu/xstate.h
> +++ b/arch/x86/include/asm/fpu/xstate.h
> @@ -51,7 +51,8 @@
>
> /* All currently supported supervisor features */
> #define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
> - XFEATURE_MASK_CET_USER)
> + XFEATURE_MASK_CET_USER | \
> + XFEATURE_MASK_CET_KERNEL)
>
> /*
> * A supervisor state component may not always contain valuable information,
> @@ -78,8 +79,7 @@
> * Unsupported supervisor features. When a supervisor feature in this mask is
> * supported in the future, move it to the supported supervisor feature mask.
> */
> -#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \
> - XFEATURE_MASK_CET_KERNEL)
> +#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT)
>
> /* All supervisor states including supported and unsupported states. */
> #define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 6e50a4251e2b..b57d909facca 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -51,7 +51,7 @@ static const char *xfeature_names[] =
> "Protection Keys User registers",
> "PASID state",
> "Control-flow User registers",
> - "Control-flow Kernel registers (unused)",
> + "Control-flow Kernel registers",
> "unknown xstate feature",
> "unknown xstate feature",
> "unknown xstate feature",
> @@ -73,6 +73,7 @@ static unsigned short xsave_cpuid_features[] __initdata = {
> [XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
> [XFEATURE_PKRU] = X86_FEATURE_OSPKE,
> [XFEATURE_PASID] = X86_FEATURE_ENQCMD,
> + [XFEATURE_CET_KERNEL] = X86_FEATURE_SHSTK,
> [XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE,
> [XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE,
> };
> @@ -277,6 +278,7 @@ static void __init print_xstate_features(void)
> print_xstate_feature(XFEATURE_MASK_PKRU);
> print_xstate_feature(XFEATURE_MASK_PASID);
> print_xstate_feature(XFEATURE_MASK_CET_USER);
> + print_xstate_feature(XFEATURE_MASK_CET_KERNEL);
> print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
> print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
> }
> @@ -346,6 +348,7 @@ static __init void os_xrstor_booting(struct xregs_state *xstate)
> XFEATURE_MASK_BNDCSR | \
> XFEATURE_MASK_PASID | \
> XFEATURE_MASK_CET_USER | \
> + XFEATURE_MASK_CET_KERNEL | \
> XFEATURE_MASK_XTILE)
>
> /*
> @@ -546,6 +549,7 @@ static bool __init check_xstate_against_struct(int nr)
> case XFEATURE_PASID: return XCHECK_SZ(sz, nr, struct ia32_pasid_state);
> case XFEATURE_XTILE_CFG: return XCHECK_SZ(sz, nr, struct xtile_cfg);
> case XFEATURE_CET_USER: return XCHECK_SZ(sz, nr, struct cet_user_state);
> + case XFEATURE_CET_KERNEL: return XCHECK_SZ(sz, nr, struct cet_supervisor_state);
> case XFEATURE_XTILE_DATA: check_xtile_data_against_struct(sz); return true;
> default:
> XSTATE_WARN_ON(1, "No structure for xstate: %d\n", nr);
Any reason why my reviewed-by was not added to this patch?
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> Define new fpu_guest_cfg to hold all guest FPU settings so that it can
> differ from generic kernel FPU settings, e.g., enabling CET supervisor
> xstate by default for guest fpstate while it's remained disabled in
> kernel FPU config.
>
> The kernel dynamic xfeatures are specifically used by guest fpstate now,
> add the mask for guest fpstate so that guest_perm.__state_permit ==
> (fpu_kernel_cfg.default_xfeature | XFEATURE_MASK_KERNEL_DYNAMIC). And
> if guest fpstate is re-allocated to hold user dynamic xfeatures, the
> resulting permissions are consumed before calculate new guest fpstate.
>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/include/asm/fpu/types.h | 2 +-
> arch/x86/kernel/fpu/core.c | 14 +++++++++++---
> arch/x86/kernel/fpu/xstate.c | 10 ++++++++++
> 3 files changed, 22 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
> index c6fd13a17205..306825ad6bc0 100644
> --- a/arch/x86/include/asm/fpu/types.h
> +++ b/arch/x86/include/asm/fpu/types.h
> @@ -602,6 +602,6 @@ struct fpu_state_config {
> };
>
> /* FPU state configuration information */
> -extern struct fpu_state_config fpu_kernel_cfg, fpu_user_cfg;
> +extern struct fpu_state_config fpu_kernel_cfg, fpu_user_cfg, fpu_guest_cfg;
>
> #endif /* _ASM_X86_FPU_H */
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index a21a4d0ecc34..516af626bf6a 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -33,9 +33,10 @@ DEFINE_STATIC_KEY_FALSE(__fpu_state_size_dynamic);
> DEFINE_PER_CPU(u64, xfd_state);
> #endif
>
> -/* The FPU state configuration data for kernel and user space */
> +/* The FPU state configuration data for kernel, user space and guest. */
> struct fpu_state_config fpu_kernel_cfg __ro_after_init;
> struct fpu_state_config fpu_user_cfg __ro_after_init;
> +struct fpu_state_config fpu_guest_cfg __ro_after_init;
>
> /*
> * Represents the initial FPU state. It's mostly (but not completely) zeroes,
> @@ -536,8 +537,15 @@ void fpstate_reset(struct fpu *fpu)
> fpu->perm.__state_perm = fpu_kernel_cfg.default_features;
> fpu->perm.__state_size = fpu_kernel_cfg.default_size;
> fpu->perm.__user_state_size = fpu_user_cfg.default_size;
> - /* Same defaults for guests */
> - fpu->guest_perm = fpu->perm;
> +
> + /* Guest permission settings */
> + fpu->guest_perm.__state_perm = fpu_guest_cfg.default_features;
> + fpu->guest_perm.__state_size = fpu_guest_cfg.default_size;
> + /*
> + * Set guest's __user_state_size to fpu_user_cfg.default_size so that
> + * existing uAPIs can still work.
> + */
> + fpu->guest_perm.__user_state_size = fpu_user_cfg.default_size;
> }
>
> static inline void fpu_inherit_perms(struct fpu *dst_fpu)
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index ba4172172afd..aa8f8595cd41 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -681,6 +681,7 @@ static int __init init_xstate_size(void)
> {
> /* Recompute the context size for enabled features: */
> unsigned int user_size, kernel_size, kernel_default_size;
> + unsigned int guest_default_size;
> bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
>
> /* Uncompacted user space size */
> @@ -702,13 +703,18 @@ static int __init init_xstate_size(void)
> kernel_default_size =
> xstate_calculate_size(fpu_kernel_cfg.default_features, compacted);
>
> + guest_default_size =
> + xstate_calculate_size(fpu_guest_cfg.default_features, compacted);
> +
> if (!paranoid_xstate_size_valid(kernel_size))
> return -EINVAL;
>
> fpu_kernel_cfg.max_size = kernel_size;
> fpu_user_cfg.max_size = user_size;
> + fpu_guest_cfg.max_size = kernel_size;
>
> fpu_kernel_cfg.default_size = kernel_default_size;
> + fpu_guest_cfg.default_size = guest_default_size;
> fpu_user_cfg.default_size =
> xstate_calculate_size(fpu_user_cfg.default_features, false);
>
> @@ -829,6 +835,10 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
> fpu_user_cfg.default_features = fpu_user_cfg.max_features;
> fpu_user_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
>
> + fpu_guest_cfg.max_features = fpu_kernel_cfg.max_features;
> + fpu_guest_cfg.default_features = fpu_guest_cfg.max_features;
> + fpu_guest_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
> +
> /* Store it for paranoia check at the end */
> xfeatures = fpu_kernel_cfg.max_features;
>
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> Tweak the code a bit to facilitate resetting more xstate components in
> the future, e.g., adding CET's xstate-managed MSRs.
>
> No functional change intended.
>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kvm/x86.c | 15 ++++++++++++---
> 1 file changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b9c2c0cd4cf5..16b4f2dd138a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12132,6 +12132,11 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
> static_branch_dec(&kvm_has_noapic_vcpu);
> }
>
> +static inline bool is_xstate_reset_needed(void)
> +{
> + return kvm_cpu_cap_has(X86_FEATURE_MPX);
> +}
> +
> void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> {
> struct kvm_cpuid_entry2 *cpuid_0x1;
> @@ -12189,7 +12194,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> kvm_async_pf_hash_reset(vcpu);
> vcpu->arch.apf.halted = false;
>
> - if (vcpu->arch.guest_fpu.fpstate && kvm_mpx_supported()) {
> + if (vcpu->arch.guest_fpu.fpstate && is_xstate_reset_needed()) {
> struct fpstate *fpstate = vcpu->arch.guest_fpu.fpstate;
>
> /*
> @@ -12199,8 +12204,12 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> if (init_event)
> kvm_put_guest_fpu(vcpu);
>
> - fpstate_clear_xstate_component(fpstate, XFEATURE_BNDREGS);
> - fpstate_clear_xstate_component(fpstate, XFEATURE_BNDCSR);
> + if (kvm_cpu_cap_has(X86_FEATURE_MPX)) {
> + fpstate_clear_xstate_component(fpstate,
> + XFEATURE_BNDREGS);
> + fpstate_clear_xstate_component(fpstate,
> + XFEATURE_BNDCSR);
> + }
>
> if (init_event)
> kvm_load_guest_fpu(vcpu);
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> Enable/disable CET MSRs interception per associated feature configuration.
> Shadow Stack feature requires all CET MSRs passed through to guest to make
> it supported in user and supervisor mode while IBT feature only depends on
> MSR_IA32_{U,S}_CETS_CET to enable user and supervisor IBT.
>
> Note, this MSR design introduced an architectural limitation of SHSTK and
> IBT control for guest, i.e., when SHSTK is exposed, IBT is also available
> to guest from architectual perspective since IBT relies on subset of SHSTK
> relevant MSRs.
>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kvm/vmx/vmx.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 42 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 554f665e59c3..e484333eddb0 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -699,6 +699,10 @@ static bool is_valid_passthrough_msr(u32 msr)
> case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
> /* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
> return true;
> + case MSR_IA32_U_CET:
> + case MSR_IA32_S_CET:
> + case MSR_IA32_PL0_SSP ... MSR_IA32_INT_SSP_TAB:
> + return true;
> }
>
> r = possible_passthrough_msr_slot(msr) != -ENOENT;
> @@ -7766,6 +7770,42 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
> vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
> }
>
> +static void vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu)
> +{
> + bool incpt;
> +
> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
> + incpt = !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK);
> +
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP,
> + MSR_TYPE_RW, incpt);
> + if (guest_cpuid_has(vcpu, X86_FEATURE_LM))
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
> + MSR_TYPE_RW, incpt);
> + if (!incpt)
> + return;
> + }
> +
> + if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
> + incpt = !guest_cpuid_has(vcpu, X86_FEATURE_IBT);
> +
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
> + MSR_TYPE_RW, incpt);
> + }
> +}
> +
> static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> @@ -7843,6 +7883,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>
> /* Refresh #PF interception to account for MAXPHYADDR changes. */
> vmx_update_exception_bitmap(vcpu);
> +
> + vmx_update_intercept_for_cet_msr(vcpu);
> }
>
> static u64 vmx_get_perf_capabilities(void)
My review feedback from the previous patch still applies as well,
I still think that we should either try a best effort approach to plug
this virtualization hole, or we at least should fail guest creation
if the virtualization hole is present as I said:
"Another, much simpler option is to fail the guest creation if the shadow stack + indirect branch tracking
state differs between host and the guest, unless both are disabled in the guest.
(in essence don't let the guest be created if (2) or (3) happen)"
Please at least tell me what do you think about this.
Best regards,
Maxim Levitsky
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> Use the governed feature framework to track whether X86_FEATURE_SHSTK
> and X86_FEATURE_IBT features can be used by userspace and guest, i.e.,
> the features can be used iff both KVM and guest CPUID can support them.
>
> TODO: remove this patch once Sean's refactor to "KVM-governed" framework
> is upstreamed. See the work here [*].
>
> [*]: https://lore.kernel.org/all/[email protected]/
>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kvm/governed_features.h | 2 ++
> arch/x86/kvm/vmx/vmx.c | 2 ++
> 2 files changed, 4 insertions(+)
>
> diff --git a/arch/x86/kvm/governed_features.h b/arch/x86/kvm/governed_features.h
> index 423a73395c10..db7e21c5ecc2 100644
> --- a/arch/x86/kvm/governed_features.h
> +++ b/arch/x86/kvm/governed_features.h
> @@ -16,6 +16,8 @@ KVM_GOVERNED_X86_FEATURE(PAUSEFILTER)
> KVM_GOVERNED_X86_FEATURE(PFTHRESHOLD)
> KVM_GOVERNED_X86_FEATURE(VGIF)
> KVM_GOVERNED_X86_FEATURE(VNMI)
> +KVM_GOVERNED_X86_FEATURE(SHSTK)
> +KVM_GOVERNED_X86_FEATURE(IBT)
>
> #undef KVM_GOVERNED_X86_FEATURE
> #undef KVM_GOVERNED_FEATURE
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d3d0d74fef70..f6ad5ba5d518 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7762,6 +7762,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_XSAVES);
>
> kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_VMX);
> + kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_SHSTK);
> + kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_IBT);
>
> vmx_setup_uret_msrs(vmx);
>
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> Add emulation interface for CET MSR access. The emulation code is split
> into common part and vendor specific part. The former does common checks
> for MSRs, e.g., accessibility, data validity etc., then pass the operation
> to either XSAVE-managed MSRs via the helpers or CET VMCS fields.
>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kvm/vmx/vmx.c | 18 +++++++++
> arch/x86/kvm/x86.c | 88 ++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 106 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index f6ad5ba5d518..554f665e59c3 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2111,6 +2111,15 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> else
> msr_info->data = vmx->pt_desc.guest.addr_a[index / 2];
> break;
> + case MSR_IA32_S_CET:
> + msr_info->data = vmcs_readl(GUEST_S_CET);
> + break;
> + case MSR_KVM_SSP:
> + msr_info->data = vmcs_readl(GUEST_SSP);
> + break;
> + case MSR_IA32_INT_SSP_TAB:
> + msr_info->data = vmcs_readl(GUEST_INTR_SSP_TABLE);
> + break;
> case MSR_IA32_DEBUGCTLMSR:
> msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
> break;
> @@ -2420,6 +2429,15 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> else
> vmx->pt_desc.guest.addr_a[index / 2] = data;
> break;
> + case MSR_IA32_S_CET:
> + vmcs_writel(GUEST_S_CET, data);
> + break;
> + case MSR_KVM_SSP:
> + vmcs_writel(GUEST_SSP, data);
> + break;
> + case MSR_IA32_INT_SSP_TAB:
> + vmcs_writel(GUEST_INTR_SSP_TABLE, data);
> + break;
> case MSR_IA32_PERF_CAPABILITIES:
> if (data && !vcpu_to_pmu(vcpu)->version)
> return 1;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 74d2d00a1681..5792ed16e61b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1847,6 +1847,36 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type)
> }
> EXPORT_SYMBOL_GPL(kvm_msr_allowed);
>
> +#define CET_US_RESERVED_BITS GENMASK(9, 6)
> +#define CET_US_SHSTK_MASK_BITS GENMASK(1, 0)
> +#define CET_US_IBT_MASK_BITS (GENMASK_ULL(5, 2) | GENMASK_ULL(63, 10))
> +#define CET_US_LEGACY_BITMAP_BASE(data) ((data) >> 12)
> +
> +static bool is_set_cet_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u64 data,
> + bool host_initiated)
> +{
> + bool msr_ctrl = index == MSR_IA32_S_CET || index == MSR_IA32_U_CET;
> +
> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
> + return true;
> +
> + if (msr_ctrl && guest_can_use(vcpu, X86_FEATURE_IBT))
> + return true;
> +
> + /*
> + * If KVM supports the MSR, i.e. has enumerated the MSR existence to
> + * userspace, then userspace is allowed to write '0' irrespective of
> + * whether or not the MSR is exposed to the guest.
> + */
> + if (!host_initiated || data)
> + return false;
> +
> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> + return true;
> +
> + return msr_ctrl && kvm_cpu_cap_has(X86_FEATURE_IBT);
This is reasonable.
> +}
> +
> /*
> * Write @data into the MSR specified by @index. Select MSR specific fault
> * checks are bypassed if @host_initiated is %true.
> @@ -1906,6 +1936,43 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
>
> data = (u32)data;
> break;
> + case MSR_IA32_U_CET:
> + case MSR_IA32_S_CET:
> + if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
> + return 1;
> + if (data & CET_US_RESERVED_BITS)
> + return 1;
> + if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
> + (data & CET_US_SHSTK_MASK_BITS))
> + return 1;
> + if (!guest_can_use(vcpu, X86_FEATURE_IBT) &&
> + (data & CET_US_IBT_MASK_BITS))
> + return 1;
> + if (!IS_ALIGNED(CET_US_LEGACY_BITMAP_BASE(data), 4))
> + return 1;
> + /* IBT can be suppressed iff the TRACKER isn't WAIT_ENDBR. */
> + if ((data & CET_SUPPRESS) && (data & CET_WAIT_ENDBR))
> + return 1;
> + break;
> + case MSR_IA32_INT_SSP_TAB:
> + if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated) ||
> + !guest_cpuid_has(vcpu, X86_FEATURE_LM))
> + return 1;
> + if (is_noncanonical_address(data, vcpu))
> + return 1;
> + break;
> + case MSR_KVM_SSP:
> + if (!host_initiated)
> + return 1;
> + fallthrough;
> + case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> + if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
> + return 1;
> + if (is_noncanonical_address(data, vcpu))
> + return 1;
> + if (!IS_ALIGNED(data, 4))
> + return 1;
> + break;
> }
>
> msr.data = data;
> @@ -1949,6 +2016,19 @@ static int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
> !guest_cpuid_has(vcpu, X86_FEATURE_RDPID))
> return 1;
> break;
> + case MSR_IA32_INT_SSP_TAB:
> + if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) ||
> + !guest_cpuid_has(vcpu, X86_FEATURE_LM))
> + return 1;
> + break;
> + case MSR_KVM_SSP:
> + if (!host_initiated)
> + return 1;
> + fallthrough;
> + case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> + if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
> + return 1;
> + break;
> }
>
> msr.index = index;
> @@ -4118,6 +4198,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> vcpu->arch.guest_fpu.xfd_err = data;
> break;
> #endif
> + case MSR_IA32_U_CET:
> + case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> + kvm_set_xstate_msr(vcpu, msr_info);
> + break;
> default:
> if (kvm_pmu_is_valid_msr(vcpu, msr))
> return kvm_pmu_set_msr(vcpu, msr_info);
> @@ -4475,6 +4559,10 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> msr_info->data = vcpu->arch.guest_fpu.xfd_err;
> break;
> #endif
> + case MSR_IA32_U_CET:
> + case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> + kvm_get_xstate_msr(vcpu, msr_info);
> + break;
> default:
> if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
> return kvm_pmu_get_msr(vcpu, msr_info);
Overall looks OK to me, although I still object to the idea of having the MSR_KVM_SSP.
Best regards,
Maxim Levitsky
On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> Wrap __kvm_{get,set}_msr() into two new helpers for KVM usage and use the
> helpers to replace existing usage of the raw functions.
> kvm_msr_{read,write}() are KVM-internal helpers, i.e. used when KVM needs
> to get/set a MSR value for emulating CPU behavior, i.e., host_initiated ==
> %true in the helpers.
>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 3 ++-
> arch/x86/kvm/cpuid.c | 2 +-
> arch/x86/kvm/x86.c | 16 +++++++++++++---
> 3 files changed, 16 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5cfa18aaf33f..499bd42e3a32 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1966,9 +1966,10 @@ void kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu);
>
> void kvm_enable_efer_bits(u64);
> bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer);
> -int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data, bool host_initiated);
> int kvm_emulate_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data);
> int kvm_emulate_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data);
> +int kvm_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data);
> +int kvm_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data);
> int kvm_emulate_rdmsr(struct kvm_vcpu *vcpu);
> int kvm_emulate_wrmsr(struct kvm_vcpu *vcpu);
> int kvm_emulate_as_nop(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index d0315e469d92..0351e311168a 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1527,7 +1527,7 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
> *edx = entry->edx;
> if (function == 7 && index == 0) {
> u64 data;
> - if (!__kvm_get_msr(vcpu, MSR_IA32_TSX_CTRL, &data, true) &&
> + if (!kvm_msr_read(vcpu, MSR_IA32_TSX_CTRL, &data) &&
> (data & TSX_CTRL_CPUID_CLEAR))
> *ebx &= ~(F(RTM) | F(HLE));
> } else if (function == 0x80000007) {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 16b4f2dd138a..360f4b8a4944 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1917,8 +1917,8 @@ static int kvm_set_msr_ignored_check(struct kvm_vcpu *vcpu,
> * Returns 0 on success, non-0 otherwise.
> * Assumes vcpu_load() was already called.
> */
> -int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
> - bool host_initiated)
> +static int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
> + bool host_initiated)
> {
> struct msr_data msr;
> int ret;
> @@ -1944,6 +1944,16 @@ int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
> return ret;
> }
>
> +int kvm_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data)
> +{
> + return __kvm_set_msr(vcpu, index, data, true);
> +}
> +
> +int kvm_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data)
> +{
> + return __kvm_get_msr(vcpu, index, data, true);
> +}
> +
> static int kvm_get_msr_ignored_check(struct kvm_vcpu *vcpu,
> u32 index, u64 *data, bool host_initiated)
> {
> @@ -12224,7 +12234,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> MSR_IA32_MISC_ENABLE_BTS_UNAVAIL;
>
> __kvm_set_xcr(vcpu, 0, XFEATURE_MASK_FP);
> - __kvm_set_msr(vcpu, MSR_IA32_XSS, 0, true);
> + kvm_msr_write(vcpu, MSR_IA32_XSS, 0);
> }
>
> /* All GPRs except RDX (handled below) are zeroed on RESET/INIT. */
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Thu, Nov 30, 2023 at 07:42:44PM +0200, Maxim Levitsky wrote:
>On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
>> Save CET SSP to SMRAM on SMI and reload it on RSM. KVM emulates HW arch
>> behavior when guest enters/leaves SMM mode,i.e., save registers to SMRAM
>> at the entry of SMM and reload them at the exit to SMM. Per SDM, SSP is
>> one of such registers on 64bit Arch, so add the support for SSP.
>>
>> Suggested-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
>> arch/x86/kvm/smm.c | 8 ++++++++
>> arch/x86/kvm/smm.h | 2 +-
>> 2 files changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
>> index 45c855389ea7..7aac9c54c353 100644
>> --- a/arch/x86/kvm/smm.c
>> +++ b/arch/x86/kvm/smm.c
>> @@ -275,6 +275,10 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu,
>> enter_smm_save_seg_64(vcpu, &smram->gs, VCPU_SREG_GS);
>>
>> smram->int_shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
>> +
>> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
>> + KVM_BUG_ON(kvm_msr_read(vcpu, MSR_KVM_SSP, &smram->ssp),
>> + vcpu->kvm);
>> }
>> #endif
>>
>> @@ -564,6 +568,10 @@ static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,
>> static_call(kvm_x86_set_interrupt_shadow)(vcpu, 0);
>> ctxt->interruptibility = (u8)smstate->int_shadow;
>>
>> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
>> + KVM_BUG_ON(kvm_msr_write(vcpu, MSR_KVM_SSP, smstate->ssp),
>> + vcpu->kvm);
>> +
>> return X86EMUL_CONTINUE;
>> }
>> #endif
>> diff --git a/arch/x86/kvm/smm.h b/arch/x86/kvm/smm.h
>> index a1cf2ac5bd78..1e2a3e18207f 100644
>> --- a/arch/x86/kvm/smm.h
>> +++ b/arch/x86/kvm/smm.h
>> @@ -116,8 +116,8 @@ struct kvm_smram_state_64 {
>> u32 smbase;
>> u32 reserved4[5];
>>
>> - /* ssp and svm_* fields below are not implemented by KVM */
>> u64 ssp;
>> + /* svm_* fields below are not implemented by KVM */
>> u64 svm_guest_pat;
>> u64 svm_host_efer;
>> u64 svm_host_cr4;
>
>
>My review feedback from the previous patch series still applies, and I don't
>know why it was not addressed/replied to:
>
>I still think that it is worth it to have a check that CET is not enabled in
>enter_smm_save_state_32 which is called for pure 32 bit guests (guests that don't
>have X86_FEATURE_LM enabled)
can KVM just reject a KVM_SET_CPUID ioctl which attempts to expose shadow stack
(or even any CET feature) to 32-bit guest in the first place? I think it is simpler.
On Thu, Nov 30, 2023 at 07:44:45PM +0200, Maxim Levitsky wrote:
>On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
>> Enable/disable CET MSRs interception per associated feature configuration.
>> Shadow Stack feature requires all CET MSRs passed through to guest to make
>> it supported in user and supervisor mode while IBT feature only depends on
>> MSR_IA32_{U,S}_CETS_CET to enable user and supervisor IBT.
>>
>> Note, this MSR design introduced an architectural limitation of SHSTK and
>> IBT control for guest, i.e., when SHSTK is exposed, IBT is also available
>> to guest from architectual perspective since IBT relies on subset of SHSTK
>> relevant MSRs.
>>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
>> arch/x86/kvm/vmx/vmx.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 42 insertions(+)
>>
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 554f665e59c3..e484333eddb0 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -699,6 +699,10 @@ static bool is_valid_passthrough_msr(u32 msr)
>> case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
>> /* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
>> return true;
>> + case MSR_IA32_U_CET:
>> + case MSR_IA32_S_CET:
>> + case MSR_IA32_PL0_SSP ... MSR_IA32_INT_SSP_TAB:
>> + return true;
>> }
>>
>> r = possible_passthrough_msr_slot(msr) != -ENOENT;
>> @@ -7766,6 +7770,42 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
>> vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
>> }
>>
>> +static void vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu)
>> +{
>> + bool incpt;
>> +
>> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
>> + incpt = !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK);
>> +
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP,
>> + MSR_TYPE_RW, incpt);
>> + if (guest_cpuid_has(vcpu, X86_FEATURE_LM))
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
>> + MSR_TYPE_RW, incpt);
>> + if (!incpt)
>> + return;
>> + }
>> +
>> + if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
>> + incpt = !guest_cpuid_has(vcpu, X86_FEATURE_IBT);
>> +
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
>> + MSR_TYPE_RW, incpt);
>> + }
>> +}
>> +
>> static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>> {
>> struct vcpu_vmx *vmx = to_vmx(vcpu);
>> @@ -7843,6 +7883,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>>
>> /* Refresh #PF interception to account for MAXPHYADDR changes. */
>> vmx_update_exception_bitmap(vcpu);
>> +
>> + vmx_update_intercept_for_cet_msr(vcpu);
>> }
>>
>> static u64 vmx_get_perf_capabilities(void)
>
>My review feedback from the previous patch still applies as well,
>
>I still think that we should either try a best effort approach to plug
>this virtualization hole, or we at least should fail guest creation
>if the virtualization hole is present as I said:
>
>"Another, much simpler option is to fail the guest creation if the shadow stack + indirect branch tracking
>state differs between host and the guest, unless both are disabled in the guest.
>(in essence don't let the guest be created if (2) or (3) happen)"
Enforcing a "none" or "all" policy is a temporary solution. in future, if some
reserved bits in S/U_CET MSRs are extended for new features, there will be:
platform A supports SS + IBT
platform B supports SS + IBT + new feature
Guests running on B inevitably have the same virtualization hole. and if kvm
continues enforcing the policy on B, then VM migration from A to B would be
impossible.
To me, intercepting S/U_CET MSR and CET_S/U xsave components is intricate and
yields marginal benefits. And I also doubt any reasonable OS implementation
would depend on #GP of WRMSR to S/U_CET MSRs for functionalities. So, I vote
to leave the patch as-is.
>
>Please at least tell me what do you think about this.
>
>Best regards,
> Maxim Levitsky
>
>
>
On 12/1/2023 1:27 AM, Maxim Levitsky wrote:
> On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
>> Add supervisor mode state support within FPU xstate management framework.
>> Although supervisor shadow stack is not enabled/used today in kernel,KVM
>> requires the support because when KVM advertises shadow stack feature to
>> guest, architecturally it claims the support for both user and supervisor
>> modes for guest OSes(Linux or non-Linux).
>>
>> CET supervisor states not only includes PL{0,1,2}_SSP but also IA32_S_CET
>> MSR, but the latter is not xsave-managed. In virtualization world, guest
>> IA32_S_CET is saved/stored into/from VM control structure. With supervisor
>> xstate support, guest supervisor mode shadow stack state can be properly
>> saved/restored when 1) guest/host FPU context is swapped 2) vCPU
>> thread is sched out/in.
>>
>> The alternative is to enable it in KVM domain, but KVM maintainers NAKed
>> the solution. The external discussion can be found at [*], it ended up
>> with adding the support in kernel instead of KVM domain.
>>
>> Note, in KVM case, guest CET supervisor state i.e., IA32_PL{0,1,2}_MSRs,
>> are preserved after VM-Exit until host/guest fpstates are swapped, but
>> since host supervisor shadow stack is disabled, the preserved MSRs won't
>> hurt host.
>>
>> [*]: https://lore.kernel.org/all/[email protected]/
>>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
>> arch/x86/include/asm/fpu/types.h | 14 ++++++++++++--
>> arch/x86/include/asm/fpu/xstate.h | 6 +++---
>> arch/x86/kernel/fpu/xstate.c | 6 +++++-
>> 3 files changed, 20 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
>> index eb810074f1e7..c6fd13a17205 100644
>> --- a/arch/x86/include/asm/fpu/types.h
>> +++ b/arch/x86/include/asm/fpu/types.h
>> @@ -116,7 +116,7 @@ enum xfeature {
>> XFEATURE_PKRU,
>> XFEATURE_PASID,
>> XFEATURE_CET_USER,
>> - XFEATURE_CET_KERNEL_UNUSED,
>> + XFEATURE_CET_KERNEL,
>> XFEATURE_RSRVD_COMP_13,
>> XFEATURE_RSRVD_COMP_14,
>> XFEATURE_LBR,
>> @@ -139,7 +139,7 @@ enum xfeature {
>> #define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU)
>> #define XFEATURE_MASK_PASID (1 << XFEATURE_PASID)
>> #define XFEATURE_MASK_CET_USER (1 << XFEATURE_CET_USER)
>> -#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL_UNUSED)
>> +#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL)
>> #define XFEATURE_MASK_LBR (1 << XFEATURE_LBR)
>> #define XFEATURE_MASK_XTILE_CFG (1 << XFEATURE_XTILE_CFG)
>> #define XFEATURE_MASK_XTILE_DATA (1 << XFEATURE_XTILE_DATA)
>> @@ -264,6 +264,16 @@ struct cet_user_state {
>> u64 user_ssp;
>> };
>>
>> +/*
>> + * State component 12 is Control-flow Enforcement supervisor states
>> + */
>> +struct cet_supervisor_state {
>> + /* supervisor ssp pointers */
>> + u64 pl0_ssp;
>> + u64 pl1_ssp;
>> + u64 pl2_ssp;
>> +};
>> +
>> /*
>> * State component 15: Architectural LBR configuration state.
>> * The size of Arch LBR state depends on the number of LBRs (lbr_depth).
>> diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
>> index d4427b88ee12..3b4a038d3c57 100644
>> --- a/arch/x86/include/asm/fpu/xstate.h
>> +++ b/arch/x86/include/asm/fpu/xstate.h
>> @@ -51,7 +51,8 @@
>>
>> /* All currently supported supervisor features */
>> #define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
>> - XFEATURE_MASK_CET_USER)
>> + XFEATURE_MASK_CET_USER | \
>> + XFEATURE_MASK_CET_KERNEL)
>>
>> /*
>> * A supervisor state component may not always contain valuable information,
>> @@ -78,8 +79,7 @@
>> * Unsupported supervisor features. When a supervisor feature in this mask is
>> * supported in the future, move it to the supported supervisor feature mask.
>> */
>> -#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \
>> - XFEATURE_MASK_CET_KERNEL)
>> +#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT)
>>
>> /* All supervisor states including supported and unsupported states. */
>> #define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
>> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>> index 6e50a4251e2b..b57d909facca 100644
>> --- a/arch/x86/kernel/fpu/xstate.c
>> +++ b/arch/x86/kernel/fpu/xstate.c
>> @@ -51,7 +51,7 @@ static const char *xfeature_names[] =
>> "Protection Keys User registers",
>> "PASID state",
>> "Control-flow User registers",
>> - "Control-flow Kernel registers (unused)",
>> + "Control-flow Kernel registers",
>> "unknown xstate feature",
>> "unknown xstate feature",
>> "unknown xstate feature",
>> @@ -73,6 +73,7 @@ static unsigned short xsave_cpuid_features[] __initdata = {
>> [XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
>> [XFEATURE_PKRU] = X86_FEATURE_OSPKE,
>> [XFEATURE_PASID] = X86_FEATURE_ENQCMD,
>> + [XFEATURE_CET_KERNEL] = X86_FEATURE_SHSTK,
>> [XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE,
>> [XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE,
>> };
>> @@ -277,6 +278,7 @@ static void __init print_xstate_features(void)
>> print_xstate_feature(XFEATURE_MASK_PKRU);
>> print_xstate_feature(XFEATURE_MASK_PASID);
>> print_xstate_feature(XFEATURE_MASK_CET_USER);
>> + print_xstate_feature(XFEATURE_MASK_CET_KERNEL);
>> print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
>> print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
>> }
>> @@ -346,6 +348,7 @@ static __init void os_xrstor_booting(struct xregs_state *xstate)
>> XFEATURE_MASK_BNDCSR | \
>> XFEATURE_MASK_PASID | \
>> XFEATURE_MASK_CET_USER | \
>> + XFEATURE_MASK_CET_KERNEL | \
>> XFEATURE_MASK_XTILE)
>>
>> /*
>> @@ -546,6 +549,7 @@ static bool __init check_xstate_against_struct(int nr)
>> case XFEATURE_PASID: return XCHECK_SZ(sz, nr, struct ia32_pasid_state);
>> case XFEATURE_XTILE_CFG: return XCHECK_SZ(sz, nr, struct xtile_cfg);
>> case XFEATURE_CET_USER: return XCHECK_SZ(sz, nr, struct cet_user_state);
>> + case XFEATURE_CET_KERNEL: return XCHECK_SZ(sz, nr, struct cet_supervisor_state);
>> case XFEATURE_XTILE_DATA: check_xtile_data_against_struct(sz); return true;
>> default:
>> XSTATE_WARN_ON(1, "No structure for xstate: %d\n", nr);
> Any reason why my reviewed-by was not added to this patch?
My apology again! I missed the Reviewed-by tag for this patch!
Appreciated for your careful review for this series!
> Reviewed-by: Maxim Levitsky <[email protected]>
>
> Best regards,
> Maxim Levitsky
>
>
>
On 12/1/2023 1:33 AM, Maxim Levitsky wrote:
> On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
>> Define new XFEATURE_MASK_KERNEL_DYNAMIC set including the features can be
> I am not sure though that this name is correct, but I don't know if I can
> suggest a better name.
It's a symmetry of XFEATURE_MASK_USER_DYNAMIC ;-)
>> optionally enabled by kernel components, i.e., the features are required by
>> specific kernel components. Currently it's used by KVM to configure guest
>> dedicated fpstate for calculating the xfeature and fpstate storage size etc.
>>
>> The kernel dynamic xfeatures now only contain XFEATURE_CET_KERNEL, which is
>> supported by host as they're enabled in xsaves/xrstors operating xfeature set
>> (XCR0 | XSS), but the relevant CPU feature, i.e., supervisor shadow stack, is
>> not enabled in host kernel so it can be omitted for normal fpstate by default.
>>
>> Remove the kernel dynamic feature from fpu_kernel_cfg.default_features so that
>> the bits in xstate_bv and xcomp_bv are cleared and xsaves/xrstors can be
>> optimized by HW for normal fpstate.
>>
>> Suggested-by: Dave Hansen <[email protected]>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
>> arch/x86/include/asm/fpu/xstate.h | 5 ++++-
>> arch/x86/kernel/fpu/xstate.c | 1 +
>> 2 files changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
>> index 3b4a038d3c57..a212d3851429 100644
>> --- a/arch/x86/include/asm/fpu/xstate.h
>> +++ b/arch/x86/include/asm/fpu/xstate.h
>> @@ -46,9 +46,12 @@
>> #define XFEATURE_MASK_USER_RESTORE \
>> (XFEATURE_MASK_USER_SUPPORTED & ~XFEATURE_MASK_PKRU)
>>
>> -/* Features which are dynamically enabled for a process on request */
>> +/* Features which are dynamically enabled per userspace request */
>> #define XFEATURE_MASK_USER_DYNAMIC XFEATURE_MASK_XTILE_DATA
>>
>> +/* Features which are dynamically enabled per kernel side request */
> I suggest to explain this a bit better. How about something like that:
>
> "Kernel features that are not enabled by default for all processes, but can
> be still used by some processes, for example to support guest virtualization"
It looks good to me, will apply it in next version, thanks!
> But feel free to keep it as is or propose something else. IMHO this will
> be confusing this way or another.
>
>
> Another question: kernel already has a notion of 'independent features'
> which are currently kernel features that are enabled in IA32_XSS but not present in 'fpu_kernel_cfg.max_features'
>
> Currently only 'XFEATURE_LBR' is in this set. These features are saved/restored manually
> from independent buffer (in case of LBRs, perf code cares for this).
>
> Does it make sense to add CET_S to there as well instead of having XFEATURE_MASK_KERNEL_DYNAMIC,
CET_S here refers to PL{0,1,2}_SSP, right?
IMHO, perf relies on dedicated code to switch LBR MSRs for various reason, e.g., overhead, the feature
owns dozens of MSRs, remove xfeature bit will offload the burden of common FPU/xsave framework.
But CET only has 3 supervisor MSRs and they need to be managed together with user mode MSRs.
Enabling it in common FPU framework would make the switch/swap much easier without additional
support code.
> and maybe rename the
> 'XFEATURE_MASK_INDEPENDENT' to something like 'XFEATURES_THE_KERNEL_DOESNT_CARE_ABOUT'
> (terrible name, but you might think of a better name)
>
>
>> +#define XFEATURE_MASK_KERNEL_DYNAMIC XFEATURE_MASK_CET_KERNEL
>> +
>> /* All currently supported supervisor features */
>> #define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
>> XFEATURE_MASK_CET_USER | \
>> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>> index b57d909facca..ba4172172afd 100644
>> --- a/arch/x86/kernel/fpu/xstate.c
>> +++ b/arch/x86/kernel/fpu/xstate.c
>> @@ -824,6 +824,7 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
>> /* Clean out dynamic features from default */
>> fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
>> fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
>> + fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_KERNEL_DYNAMIC;
>>
>> fpu_user_cfg.default_features = fpu_user_cfg.max_features;
>> fpu_user_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
>
>
> Best regards,
> Maxim Levitsky
>
>
>
>
On 12/1/2023 1:36 AM, Maxim Levitsky wrote:
[...]
>> + fpstate->user_size = fpu_user_cfg.default_size;
>> + fpstate->user_xfeatures = fpu_user_cfg.default_features;
> The whole thing makes my head spin like the good old CD/DVD writers used to ....
>
> So just to summarize this is what we have:
>
>
> KERNEL FPU CONFIG
>
> /*
> all known and CPU supported user and supervisor features except
> - "dynamic" kernel features" (CET_S)
> - "independent" kernel features (XFEATURE_LBR)
> */
> fpu_kernel_cfg.max_features;
>
> /*
> all known and CPU supported user and supervisor features except
> - "dynamic" kernel features" (CET_S)
> - "independent" kernel features (arch LBRs)
> - "dynamic" userspace features (AMX state)
> */
> fpu_kernel_cfg.default_features;
>
>
> // size of compacted buffer with 'fpu_kernel_cfg.max_features'
> fpu_kernel_cfg.max_size;
>
>
> // size of compacted buffer with 'fpu_kernel_cfg.default_features'
> fpu_kernel_cfg.default_size;
>
>
> USER FPU CONFIG
>
> /*
> all known and CPU supported user features
> */
> fpu_user_cfg.max_features;
>
> /*
> all known and CPU supported user features except
> - "dynamic" userspace features (AMX state)
> */
> fpu_user_cfg.default_features;
>
> // size of non compacted buffer with 'fpu_user_cfg.max_features'
> fpu_user_cfg.max_size;
>
> // size of non compacted buffer with 'fpu_user_cfg.default_features'
> fpu_user_cfg.default_size;
>
>
> GUEST FPU CONFIG
> /*
> all known and CPU supported user and supervisor features except
> - "independent" kernel features (XFEATURE_LBR)
> */
> fpu_guest_cfg.max_features;
>
> /*
> all known and CPU supported user and supervisor features except
> - "independent" kernel features (arch LBRs)
> - "dynamic" userspace features (AMX state)
> */
> fpu_guest_cfg.default_features;
>
> // size of compacted buffer with 'fpu_guest_cfg.max_features'
> fpu_guest_cfg.max_size;
>
> // size of compacted buffer with 'fpu_guest_cfg.default_features'
> fpu_guest_cfg.default_size;
Good suggestion! Thanks!
how about adding them in patch 5 to make the summaries manifested?
> ---
>
>
> So in essence, guest FPU config is guest kernel fpu config and that is why
> 'fpu_user_cfg.default_size' had to be used above.
>
> How about that we have fpu_guest_kernel_config and fpu_guest_user_config instead
> to make the whole horrible thing maybe even more complicated but at least a bit more orthogonal?
I think it becomes necessary when there were more guest user/kernel xfeaures requiring
special handling like CET-S MSRs, then it looks much reasonable to split guest config into two,
but now we only have one single outstanding xfeature for guest. IMHO, existing definitions still
work with a few comments.
But I really like your ideas of making things clean and tidy :-)
On 12/1/2023 1:42 AM, Maxim Levitsky wrote:
> On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
>> Save CET SSP to SMRAM on SMI and reload it on RSM. KVM emulates HW arch
>> behavior when guest enters/leaves SMM mode,i.e., save registers to SMRAM
>> at the entry of SMM and reload them at the exit to SMM. Per SDM, SSP is
>> one of such registers on 64bit Arch, so add the support for SSP.
>>
>> Suggested-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
>> arch/x86/kvm/smm.c | 8 ++++++++
>> arch/x86/kvm/smm.h | 2 +-
>> 2 files changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
>> index 45c855389ea7..7aac9c54c353 100644
>> --- a/arch/x86/kvm/smm.c
>> +++ b/arch/x86/kvm/smm.c
>> @@ -275,6 +275,10 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu,
>> enter_smm_save_seg_64(vcpu, &smram->gs, VCPU_SREG_GS);
>>
>> smram->int_shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
>> +
>> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
>> + KVM_BUG_ON(kvm_msr_read(vcpu, MSR_KVM_SSP, &smram->ssp),
>> + vcpu->kvm);
>> }
>> #endif
>>
>> @@ -564,6 +568,10 @@ static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,
>> static_call(kvm_x86_set_interrupt_shadow)(vcpu, 0);
>> ctxt->interruptibility = (u8)smstate->int_shadow;
>>
>> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
>> + KVM_BUG_ON(kvm_msr_write(vcpu, MSR_KVM_SSP, smstate->ssp),
>> + vcpu->kvm);
>> +
>> return X86EMUL_CONTINUE;
>> }
>> #endif
>> diff --git a/arch/x86/kvm/smm.h b/arch/x86/kvm/smm.h
>> index a1cf2ac5bd78..1e2a3e18207f 100644
>> --- a/arch/x86/kvm/smm.h
>> +++ b/arch/x86/kvm/smm.h
>> @@ -116,8 +116,8 @@ struct kvm_smram_state_64 {
>> u32 smbase;
>> u32 reserved4[5];
>>
>> - /* ssp and svm_* fields below are not implemented by KVM */
>> u64 ssp;
>> + /* svm_* fields below are not implemented by KVM */
>> u64 svm_guest_pat;
>> u64 svm_host_efer;
>> u64 svm_host_cr4;
>
> My review feedback from the previous patch series still applies, and I don't
> know why it was not addressed/replied to:
>
> I still think that it is worth it to have a check that CET is not enabled in
> enter_smm_save_state_32 which is called for pure 32 bit guests (guests that don't
> have X86_FEATURE_LM enabled)
I'm not sure whether it's worth doing so for CET on 32-bit guest since you said:
" it lacks several fields because it is no longer maintained ".
I kick the ball to Sean and Paolo to get their voice on this.
On 12/1/2023 1:44 AM, Maxim Levitsky wrote:
> On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
>> Enable/disable CET MSRs interception per associated feature configuration.
>> Shadow Stack feature requires all CET MSRs passed through to guest to make
>> it supported in user and supervisor mode while IBT feature only depends on
>> MSR_IA32_{U,S}_CETS_CET to enable user and supervisor IBT.
>>
>> Note, this MSR design introduced an architectural limitation of SHSTK and
>> IBT control for guest, i.e., when SHSTK is exposed, IBT is also available
>> to guest from architectual perspective since IBT relies on subset of SHSTK
>> relevant MSRs.
>>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
>> arch/x86/kvm/vmx/vmx.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 42 insertions(+)
>>
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 554f665e59c3..e484333eddb0 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -699,6 +699,10 @@ static bool is_valid_passthrough_msr(u32 msr)
>> case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
>> /* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
>> return true;
>> + case MSR_IA32_U_CET:
>> + case MSR_IA32_S_CET:
>> + case MSR_IA32_PL0_SSP ... MSR_IA32_INT_SSP_TAB:
>> + return true;
>> }
>>
>> r = possible_passthrough_msr_slot(msr) != -ENOENT;
>> @@ -7766,6 +7770,42 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
>> vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
>> }
>>
>> +static void vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu)
>> +{
>> + bool incpt;
>> +
>> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
>> + incpt = !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK);
>> +
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP,
>> + MSR_TYPE_RW, incpt);
>> + if (guest_cpuid_has(vcpu, X86_FEATURE_LM))
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
>> + MSR_TYPE_RW, incpt);
>> + if (!incpt)
>> + return;
>> + }
>> +
>> + if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
>> + incpt = !guest_cpuid_has(vcpu, X86_FEATURE_IBT);
>> +
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
>> + MSR_TYPE_RW, incpt);
>> + }
>> +}
>> +
>> static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>> {
>> struct vcpu_vmx *vmx = to_vmx(vcpu);
>> @@ -7843,6 +7883,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>>
>> /* Refresh #PF interception to account for MAXPHYADDR changes. */
>> vmx_update_exception_bitmap(vcpu);
>> +
>> + vmx_update_intercept_for_cet_msr(vcpu);
>> }
>>
>> static u64 vmx_get_perf_capabilities(void)
> My review feedback from the previous patch still applies as well,
>
> I still think that we should either try a best effort approach to plug
> this virtualization hole, or we at least should fail guest creation
> if the virtualization hole is present as I said:
>
> "Another, much simpler option is to fail the guest creation if the shadow stack + indirect branch tracking
> state differs between host and the guest, unless both are disabled in the guest.
> (in essence don't let the guest be created if (2) or (3) happen)"
>
> Please at least tell me what do you think about this.
Oh, I thought I had replied this patch in v6 but I failed to send it out!
Let me explain it a bit, at early stage of this series, I thought of checking relevant host
feature enabling status before exposing guest CET features, but it's proved
unnecessary and user unfriendly.
E.g., we frequently disable host CET features due to whatever reasons on host, then
the features cannot be used/tested in guest at all. Technically, guest should be allowed
to run the features so long as the dependencies(i.e., xsave related support) are enabled
on host and there're no risks brought up by using of the features in guest.
I think cloud-computing should share the similar pain point when deploy CET into virtualization
usages.
On 12/1/2023 1:46 AM, Maxim Levitsky wrote:
[...]
>>
>> +static inline bool cpu_has_vmx_basic_no_hw_errcode(void)
>> +{
>> + return ((u64)vmcs_config.basic_cap << 32) &
>> + VMX_BASIC_NO_HW_ERROR_CODE_CC;
>> +}
> I still think that we should add a comment explaining why this check is needed,
> as I said in the previous review.
OK, I'll add some comments above the function. Thanks!
>> +
>> static inline bool cpu_has_virtual_nmis(void)
>> {
>> return vmcs_config.pin_based_exec_ctrl & PIN_BASED_VIRTUAL_NMIS &&
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index c658f2f230df..a1aae8709939 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -2614,6 +2614,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>> { VM_ENTRY_LOAD_IA32_EFER, VM_EXIT_LOAD_IA32_EFER },
>> { VM_ENTRY_LOAD_BNDCFGS, VM_EXIT_CLEAR_BNDCFGS },
>> { VM_ENTRY_LOAD_IA32_RTIT_CTL, VM_EXIT_CLEAR_IA32_RTIT_CTL },
>> + { VM_ENTRY_LOAD_CET_STATE, VM_EXIT_LOAD_CET_STATE },
>> };
>>
>> memset(vmcs_conf, 0, sizeof(*vmcs_conf));
>> @@ -4935,6 +4936,15 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>>
>> vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); /* 22.2.1 */
>>
>> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK))
>> + vmcs_writel(GUEST_SSP, 0);
>> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
>> + kvm_cpu_cap_has(X86_FEATURE_IBT))
>> + vmcs_writel(GUEST_S_CET, 0);
>> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK) &&
>> + IS_ENABLED(CONFIG_X86_64))
>> + vmcs_writel(GUEST_INTR_SSP_TABLE, 0);
> Looks reasonable now.
>> +
>> kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
>>
>> vpid_sync_context(vmx->vpid);
>> @@ -6354,6 +6364,12 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
>> if (vmcs_read32(VM_EXIT_MSR_STORE_COUNT) > 0)
>> vmx_dump_msrs("guest autostore", &vmx->msr_autostore.guest);
>>
>> + if (vmentry_ctl & VM_ENTRY_LOAD_CET_STATE) {
>> + pr_err("S_CET = 0x%016lx\n", vmcs_readl(GUEST_S_CET));
>> + pr_err("SSP = 0x%016lx\n", vmcs_readl(GUEST_SSP));
>> + pr_err("INTR SSP TABLE = 0x%016lx\n",
>> + vmcs_readl(GUEST_INTR_SSP_TABLE));
>> + }
>> pr_err("*** Host State ***\n");
>> pr_err("RIP = 0x%016lx RSP = 0x%016lx\n",
>> vmcs_readl(HOST_RIP), vmcs_readl(HOST_RSP));
>> @@ -6431,6 +6447,12 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
>> if (secondary_exec_control & SECONDARY_EXEC_ENABLE_VPID)
>> pr_err("Virtual processor ID = 0x%04x\n",
>> vmcs_read16(VIRTUAL_PROCESSOR_ID));
>> + if (vmexit_ctl & VM_EXIT_LOAD_CET_STATE) {
>> + pr_err("S_CET = 0x%016lx\n", vmcs_readl(HOST_S_CET));
>> + pr_err("SSP = 0x%016lx\n", vmcs_readl(HOST_SSP));
>> + pr_err("INTR SSP TABLE = 0x%016lx\n",
>> + vmcs_readl(HOST_INTR_SSP_TABLE));
>> + }
>> }
>>
>> /*
>> @@ -7964,7 +7986,6 @@ static __init void vmx_set_cpu_caps(void)
>> kvm_cpu_cap_set(X86_FEATURE_UMIP);
>>
>> /* CPUID 0xD.1 */
>> - kvm_caps.supported_xss = 0;
>> if (!cpu_has_vmx_xsaves())
>> kvm_cpu_cap_clear(X86_FEATURE_XSAVES);
>>
>> @@ -7976,6 +7997,12 @@ static __init void vmx_set_cpu_caps(void)
>>
>> if (cpu_has_vmx_waitpkg())
>> kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
>> +
>> + if (!cpu_has_load_cet_ctrl() || !enable_unrestricted_guest ||
>> + !cpu_has_vmx_basic_no_hw_errcode()) {
>> + kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
>> + kvm_cpu_cap_clear(X86_FEATURE_IBT);
>> + }
> My review feedback from previous version still applies here, I don't have an
> idea why this was not addressed....
>
> "I think that here we also need to clear kvm_caps.supported_xss,
> or even better, lets set the CET bits in kvm_caps.supported_xss only
> once CET is fully enabled (both this check and check in __kvm_x86_vendor_init pass)."
Ah, previously I had a helper to check whether CET bits were enabled in kvm_caps.supported_xss,
so need to set the bits earlier before vmx's hardware_setup. I still want to keep the code as-is
in case other features need to check the their related bits set before configure something in
vmx hardware_setup.
> In addition to that I just checked and unless I am mistaken:
>
> vmx_set_cpu_caps() is called from vmx's hardware_setup(), which is called
> from __kvm_x86_vendor_init.
>
> After this call, __kvm_x86_vendor_init does clear kvm_caps.supported_xss,
> but doesn't do this if the above code cleared X86_FEATURE_SHSTK/X86_FEATURE_IBT.
>
Yeah, I checked the history, the similar logic was there until v6, I can pick it up, thanks!
>> }
>>
>> static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
>> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
>> index c2130d2c8e24..fb72819fbb41 100644
>> --- a/arch/x86/kvm/vmx/vmx.h
>> +++ b/arch/x86/kvm/vmx/vmx.h
>> @@ -480,7 +480,8 @@ static inline u8 vmx_get_rvi(void)
>> VM_ENTRY_LOAD_IA32_EFER | \
>> VM_ENTRY_LOAD_BNDCFGS | \
>> VM_ENTRY_PT_CONCEAL_PIP | \
>> - VM_ENTRY_LOAD_IA32_RTIT_CTL)
>> + VM_ENTRY_LOAD_IA32_RTIT_CTL | \
>> + VM_ENTRY_LOAD_CET_STATE)
>>
>> #define __KVM_REQUIRED_VMX_VM_EXIT_CONTROLS \
>> (VM_EXIT_SAVE_DEBUG_CONTROLS | \
>> @@ -502,7 +503,8 @@ static inline u8 vmx_get_rvi(void)
>> VM_EXIT_LOAD_IA32_EFER | \
>> VM_EXIT_CLEAR_BNDCFGS | \
>> VM_EXIT_PT_CONCEAL_PIP | \
>> - VM_EXIT_CLEAR_IA32_RTIT_CTL)
>> + VM_EXIT_CLEAR_IA32_RTIT_CTL | \
>> + VM_EXIT_LOAD_CET_STATE)
>>
>> #define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL \
>> (PIN_BASED_EXT_INTR_MASK | \
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index c6b57ede0f57..2bcf3c7923bf 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -231,7 +231,8 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
>> | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
>> | XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)
>>
>> -#define KVM_SUPPORTED_XSS 0
>> +#define KVM_SUPPORTED_XSS (XFEATURE_MASK_CET_USER | \
>> + XFEATURE_MASK_CET_KERNEL)
>>
>> u64 __read_mostly host_efer;
>> EXPORT_SYMBOL_GPL(host_efer);
>> @@ -9854,6 +9855,15 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
>> if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
>> kvm_caps.supported_xss = 0;
>>
>> + if ((kvm_caps.supported_xss & (XFEATURE_MASK_CET_USER |
>> + XFEATURE_MASK_CET_KERNEL)) !=
>> + (XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)) {
>> + kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
>> + kvm_cpu_cap_clear(X86_FEATURE_IBT);
>> + kvm_caps.supported_xss &= ~XFEATURE_CET_USER;
>> + kvm_caps.supported_xss &= ~XFEATURE_CET_KERNEL;
>> + }
>> +
>> #define __kvm_cpu_cap_has(UNUSED_, f) kvm_cpu_cap_has(f)
>> cr4_reserved_bits = __cr4_reserved_bits(__kvm_cpu_cap_has, UNUSED_);
>> #undef __kvm_cpu_cap_has
>> @@ -12319,7 +12329,9 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
>>
>> static inline bool is_xstate_reset_needed(void)
>> {
>> - return kvm_cpu_cap_has(X86_FEATURE_MPX);
>> + return kvm_cpu_cap_has(X86_FEATURE_MPX) ||
>> + kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
>> + kvm_cpu_cap_has(X86_FEATURE_IBT);
>> }
>>
>> void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>> @@ -12396,6 +12408,16 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>> XFEATURE_BNDCSR);
>> }
>>
>> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
>> + fpstate_clear_xstate_component(fpstate,
>> + XFEATURE_CET_USER);
>> + fpstate_clear_xstate_component(fpstate,
>> + XFEATURE_CET_KERNEL);
>> + } else if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
>> + fpstate_clear_xstate_component(fpstate,
>> + XFEATURE_CET_USER);
>> + }
>> +
>> if (init_event)
>> kvm_load_guest_fpu(vcpu);
>> }
>> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
>> index d9cc352cf421..dc79dcd733ac 100644
>> --- a/arch/x86/kvm/x86.h
>> +++ b/arch/x86/kvm/x86.h
>> @@ -531,6 +531,9 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
>> __reserved_bits |= X86_CR4_VMXE; \
>> if (!__cpu_has(__c, X86_FEATURE_PCID)) \
>> __reserved_bits |= X86_CR4_PCIDE; \
>> + if (!__cpu_has(__c, X86_FEATURE_SHSTK) && \
>> + !__cpu_has(__c, X86_FEATURE_IBT)) \
>> + __reserved_bits |= X86_CR4_CET; \
>> __reserved_bits; \
>> })
>>
>
> Best regards,
> Maxim Levitsky
>
>
>
On 12/1/2023 10:23 AM, Chao Gao wrote:
> On Thu, Nov 30, 2023 at 07:42:44PM +0200, Maxim Levitsky wrote:
>> On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
>>> Save CET SSP to SMRAM on SMI and reload it on RSM. KVM emulates HW arch
>>> behavior when guest enters/leaves SMM mode,i.e., save registers to SMRAM
>>> at the entry of SMM and reload them at the exit to SMM. Per SDM, SSP is
>>> one of such registers on 64bit Arch, so add the support for SSP.
>>>
>>> Suggested-by: Sean Christopherson <[email protected]>
>>> Signed-off-by: Yang Weijiang <[email protected]>
>>> ---
>>> arch/x86/kvm/smm.c | 8 ++++++++
>>> arch/x86/kvm/smm.h | 2 +-
>>> 2 files changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
>>> index 45c855389ea7..7aac9c54c353 100644
>>> --- a/arch/x86/kvm/smm.c
>>> +++ b/arch/x86/kvm/smm.c
>>> @@ -275,6 +275,10 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu,
>>> enter_smm_save_seg_64(vcpu, &smram->gs, VCPU_SREG_GS);
>>>
>>> smram->int_shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
>>> +
>>> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
>>> + KVM_BUG_ON(kvm_msr_read(vcpu, MSR_KVM_SSP, &smram->ssp),
>>> + vcpu->kvm);
>>> }
>>> #endif
>>>
>>> @@ -564,6 +568,10 @@ static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,
>>> static_call(kvm_x86_set_interrupt_shadow)(vcpu, 0);
>>> ctxt->interruptibility = (u8)smstate->int_shadow;
>>>
>>> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
>>> + KVM_BUG_ON(kvm_msr_write(vcpu, MSR_KVM_SSP, smstate->ssp),
>>> + vcpu->kvm);
>>> +
>>> return X86EMUL_CONTINUE;
>>> }
>>> #endif
>>> diff --git a/arch/x86/kvm/smm.h b/arch/x86/kvm/smm.h
>>> index a1cf2ac5bd78..1e2a3e18207f 100644
>>> --- a/arch/x86/kvm/smm.h
>>> +++ b/arch/x86/kvm/smm.h
>>> @@ -116,8 +116,8 @@ struct kvm_smram_state_64 {
>>> u32 smbase;
>>> u32 reserved4[5];
>>>
>>> - /* ssp and svm_* fields below are not implemented by KVM */
>>> u64 ssp;
>>> + /* svm_* fields below are not implemented by KVM */
>>> u64 svm_guest_pat;
>>> u64 svm_host_efer;
>>> u64 svm_host_cr4;
>>
>> My review feedback from the previous patch series still applies, and I don't
>> know why it was not addressed/replied to:
>>
>> I still think that it is worth it to have a check that CET is not enabled in
>> enter_smm_save_state_32 which is called for pure 32 bit guests (guests that don't
>> have X86_FEATURE_LM enabled)
> can KVM just reject a KVM_SET_CPUID ioctl which attempts to expose shadow stack
> (or even any CET feature) to 32-bit guest in the first place? I think it is simpler.
I favor adding an early defensive check for the issue under discussion if we want to handle the case.
Crashing the VM at runtime when guest SMI is kicked seems not user friendly.
On 12/1/2023 1:53 AM, Maxim Levitsky wrote:
> On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
>> Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
>> to enable CET for nested VM.
>>
>> Note, generally L1 VMM only touches CET VMCS fields when live migration or
>> vmcs_{read,write}() to the fields happens, so the fields only need to be
>> synced in these "rare" cases.
> To be honest we can't assume anything about L1, but what we can assume
>
> is that if vmcs12 field is not shadowed, then L1 vmwrite/vmread will
> be always intercepted and during the interception the fields can be synced,
> however I studied this area long ago and I might be mistaken.
The changelog wording failed to express what I meant to say:
vmcs12 and vmcs02 should be synced to reflect the correct CET states L1 or L2 are expected
to see. In LM case, the nested CET states should also be synced between L1 or L2 via the
control structures.
Will reword them, thanks for pointing it out!
>> And here only considers the case that L1 VMM
>> has set VM_ENTRY_LOAD_CET_STATE in its VMCS vm_entry_controls as it's the
>> common usage.
>>
>> Suggested-by: Chao Gao <[email protected]>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
>> arch/x86/kvm/vmx/nested.c | 48 +++++++++++++++++++++++++++++++++++++--
>> arch/x86/kvm/vmx/vmcs12.c | 6 +++++
>> arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++-
>> arch/x86/kvm/vmx/vmx.c | 2 ++
>> 4 files changed, 67 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index d8c32682ca76..965173650542 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>> nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
>>
>> + /* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_U_CET, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_S_CET, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_PL0_SSP, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_PL1_SSP, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_PL2_SSP, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_PL3_SSP, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
>> +
>> kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
>>
>> vmx->nested.force_msr_bitmap_recalc = false;
>> @@ -2469,6 +2491,18 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>> if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
>> (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
>> vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
>> +
>> + if (vmx->nested.nested_run_pending &&
> I don't think that nested.nested_run_pending check is needed.
> prepare_vmcs02_rare is not going to be called unless the nested run is pending.
But there're other paths along to call prepare_vmcs02_rare(), e.g., vmx_set_nested_state()-> nested_vmx_enter_non_root_mode()-> prepare_vmcs02_rare(), especially when L1 instead of L2 was running. In this case, nested.nested_run_pending == false,
we don't need to update vmcs02's fields at the point until L2 is being resumed.
>> + (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) {
>> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
>> + vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
>> + vmcs_writel(GUEST_INTR_SSP_TABLE,
>> + vmcs12->guest_ssp_tbl);
>> + }
>> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
>> + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
>> + vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
>> + }
>> }
>>
>> if (nested_cpu_has_xsaves(vmcs12))
>> @@ -4300,6 +4334,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
>> vmcs12->guest_pending_dbg_exceptions =
>> vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
>>
>> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
>> + vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
>> + vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
>> + }
>> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
>> + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
>> + vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
>> + }
> The above code should be conditional on VM_ENTRY_LOAD_CET_STATE - if the guest (L2) state
> was loaded, then it must be updated on exit - this is usually how VMX works.
I think this is not for L2 VM_ENTRY_LOAD_CET_STATE, it happens in prepare_vmcs02_rare(). IIUC, the guest registers will be saved into VMCS fields unconditionally when vm-exit happens,
so these fields for L2 guest should be synced to L1 unconditionally.
> Also I don't see any mention of usage of VM_EXIT_LOAD_CET_STATE, which if set,
> should reset the L1 CET state to values in 'host_s_cet/host_ssp/host_ssp_tbl'
> (This is also a common theme in VMX - host state is reset to values that the hypervisor
> sets in VMCS, and the hypervisor must care to update these fields itself).
Yes, the host CET states for L1 also should be synced, I'll add the missing part, thanks!
> As a rule of thumb, if you add a field to vmcs12, you should use it somewhere,
> and you should never use it unconditionally, as almost always its use
> depends on entry or exit controls.
>
> Same is true for entry/exit/execution controls - if you add one, you almost
> always have to use it somewhere.
I'll double check if anything is lost in various cases, thanks!
> Best regards,
> Maxim Levitsky
>
>> +
>> vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
>> }
>>
>> @@ -6798,7 +6841,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
>> VM_EXIT_HOST_ADDR_SPACE_SIZE |
>> #endif
>> VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
>> - VM_EXIT_CLEAR_BNDCFGS;
>> + VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
>> msrs->exit_ctls_high |=
>> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>> VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
>> @@ -6820,7 +6863,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
>> #ifdef CONFIG_X86_64
>> VM_ENTRY_IA32E_MODE |
>> #endif
>> - VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
>> + VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
>> + VM_ENTRY_LOAD_CET_STATE;
>> msrs->entry_ctls_high |=
>> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
>> VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
>> diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
>> index 106a72c923ca..4233b5ca9461 100644
>> --- a/arch/x86/kvm/vmx/vmcs12.c
>> +++ b/arch/x86/kvm/vmx/vmcs12.c
>> @@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
>> FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
>> FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
>> FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
>> + FIELD(GUEST_S_CET, guest_s_cet),
>> + FIELD(GUEST_SSP, guest_ssp),
>> + FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
>> FIELD(HOST_CR0, host_cr0),
>> FIELD(HOST_CR3, host_cr3),
>> FIELD(HOST_CR4, host_cr4),
>> @@ -151,5 +154,8 @@ const unsigned short vmcs12_field_offsets[] = {
>> FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
>> FIELD(HOST_RSP, host_rsp),
>> FIELD(HOST_RIP, host_rip),
>> + FIELD(HOST_S_CET, host_s_cet),
>> + FIELD(HOST_SSP, host_ssp),
>> + FIELD(HOST_INTR_SSP_TABLE, host_ssp_tbl),
>> };
>> const unsigned int nr_vmcs12_fields = ARRAY_SIZE(vmcs12_field_offsets);
>> diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
>> index 01936013428b..3884489e7f7e 100644
>> --- a/arch/x86/kvm/vmx/vmcs12.h
>> +++ b/arch/x86/kvm/vmx/vmcs12.h
>> @@ -117,7 +117,13 @@ struct __packed vmcs12 {
>> natural_width host_ia32_sysenter_eip;
>> natural_width host_rsp;
>> natural_width host_rip;
>> - natural_width paddingl[8]; /* room for future expansion */
>> + natural_width host_s_cet;
>> + natural_width host_ssp;
>> + natural_width host_ssp_tbl;
>> + natural_width guest_s_cet;
>> + natural_width guest_ssp;
>> + natural_width guest_ssp_tbl;
>> + natural_width paddingl[2]; /* room for future expansion */
>> u32 pin_based_vm_exec_control;
>> u32 cpu_based_vm_exec_control;
>> u32 exception_bitmap;
>> @@ -292,6 +298,12 @@ static inline void vmx_check_vmcs12_offsets(void)
>> CHECK_OFFSET(host_ia32_sysenter_eip, 656);
>> CHECK_OFFSET(host_rsp, 664);
>> CHECK_OFFSET(host_rip, 672);
>> + CHECK_OFFSET(host_s_cet, 680);
>> + CHECK_OFFSET(host_ssp, 688);
>> + CHECK_OFFSET(host_ssp_tbl, 696);
>> + CHECK_OFFSET(guest_s_cet, 704);
>> + CHECK_OFFSET(guest_ssp, 712);
>> + CHECK_OFFSET(guest_ssp_tbl, 720);
>> CHECK_OFFSET(pin_based_vm_exec_control, 744);
>> CHECK_OFFSET(cpu_based_vm_exec_control, 748);
>> CHECK_OFFSET(exception_bitmap, 752);
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index a1aae8709939..947028ff2e25 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -7734,6 +7734,8 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
>> cr4_fixed1_update(X86_CR4_PKE, ecx, feature_bit(PKU));
>> cr4_fixed1_update(X86_CR4_UMIP, ecx, feature_bit(UMIP));
>> cr4_fixed1_update(X86_CR4_LA57, ecx, feature_bit(LA57));
>> + cr4_fixed1_update(X86_CR4_CET, ecx, feature_bit(SHSTK));
>> + cr4_fixed1_update(X86_CR4_CET, edx, feature_bit(IBT));
>>
>> #undef cr4_fixed1_update
>> }
>
>
On Fri, 2023-12-01 at 15:01 +0800, Yang, Weijiang wrote:
> On 12/1/2023 1:27 AM, Maxim Levitsky wrote:
> > On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> > > Add supervisor mode state support within FPU xstate management framework.
> > > Although supervisor shadow stack is not enabled/used today in kernel,KVM
> > > requires the support because when KVM advertises shadow stack feature to
> > > guest, architecturally it claims the support for both user and supervisor
> > > modes for guest OSes(Linux or non-Linux).
> > >
> > > CET supervisor states not only includes PL{0,1,2}_SSP but also IA32_S_CET
> > > MSR, but the latter is not xsave-managed. In virtualization world, guest
> > > IA32_S_CET is saved/stored into/from VM control structure. With supervisor
> > > xstate support, guest supervisor mode shadow stack state can be properly
> > > saved/restored when 1) guest/host FPU context is swapped 2) vCPU
> > > thread is sched out/in.
> > >
> > > The alternative is to enable it in KVM domain, but KVM maintainers NAKed
> > > the solution. The external discussion can be found at [*], it ended up
> > > with adding the support in kernel instead of KVM domain.
> > >
> > > Note, in KVM case, guest CET supervisor state i.e., IA32_PL{0,1,2}_MSRs,
> > > are preserved after VM-Exit until host/guest fpstates are swapped, but
> > > since host supervisor shadow stack is disabled, the preserved MSRs won't
> > > hurt host.
> > >
> > > [*]: https://lore.kernel.org/all/[email protected]/
> > >
> > > Signed-off-by: Yang Weijiang <[email protected]>
> > > ---
> > > arch/x86/include/asm/fpu/types.h | 14 ++++++++++++--
> > > arch/x86/include/asm/fpu/xstate.h | 6 +++---
> > > arch/x86/kernel/fpu/xstate.c | 6 +++++-
> > > 3 files changed, 20 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
> > > index eb810074f1e7..c6fd13a17205 100644
> > > --- a/arch/x86/include/asm/fpu/types.h
> > > +++ b/arch/x86/include/asm/fpu/types.h
> > > @@ -116,7 +116,7 @@ enum xfeature {
> > > XFEATURE_PKRU,
> > > XFEATURE_PASID,
> > > XFEATURE_CET_USER,
> > > - XFEATURE_CET_KERNEL_UNUSED,
> > > + XFEATURE_CET_KERNEL,
> > > XFEATURE_RSRVD_COMP_13,
> > > XFEATURE_RSRVD_COMP_14,
> > > XFEATURE_LBR,
> > > @@ -139,7 +139,7 @@ enum xfeature {
> > > #define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU)
> > > #define XFEATURE_MASK_PASID (1 << XFEATURE_PASID)
> > > #define XFEATURE_MASK_CET_USER (1 << XFEATURE_CET_USER)
> > > -#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL_UNUSED)
> > > +#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL)
> > > #define XFEATURE_MASK_LBR (1 << XFEATURE_LBR)
> > > #define XFEATURE_MASK_XTILE_CFG (1 << XFEATURE_XTILE_CFG)
> > > #define XFEATURE_MASK_XTILE_DATA (1 << XFEATURE_XTILE_DATA)
> > > @@ -264,6 +264,16 @@ struct cet_user_state {
> > > u64 user_ssp;
> > > };
> > >
> > > +/*
> > > + * State component 12 is Control-flow Enforcement supervisor states
> > > + */
> > > +struct cet_supervisor_state {
> > > + /* supervisor ssp pointers */
> > > + u64 pl0_ssp;
> > > + u64 pl1_ssp;
> > > + u64 pl2_ssp;
> > > +};
> > > +
> > > /*
> > > * State component 15: Architectural LBR configuration state.
> > > * The size of Arch LBR state depends on the number of LBRs (lbr_depth).
> > > diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
> > > index d4427b88ee12..3b4a038d3c57 100644
> > > --- a/arch/x86/include/asm/fpu/xstate.h
> > > +++ b/arch/x86/include/asm/fpu/xstate.h
> > > @@ -51,7 +51,8 @@
> > >
> > > /* All currently supported supervisor features */
> > > #define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
> > > - XFEATURE_MASK_CET_USER)
> > > + XFEATURE_MASK_CET_USER | \
> > > + XFEATURE_MASK_CET_KERNEL)
> > >
> > > /*
> > > * A supervisor state component may not always contain valuable information,
> > > @@ -78,8 +79,7 @@
> > > * Unsupported supervisor features. When a supervisor feature in this mask is
> > > * supported in the future, move it to the supported supervisor feature mask.
> > > */
> > > -#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \
> > > - XFEATURE_MASK_CET_KERNEL)
> > > +#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT)
> > >
> > > /* All supervisor states including supported and unsupported states. */
> > > #define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
> > > diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> > > index 6e50a4251e2b..b57d909facca 100644
> > > --- a/arch/x86/kernel/fpu/xstate.c
> > > +++ b/arch/x86/kernel/fpu/xstate.c
> > > @@ -51,7 +51,7 @@ static const char *xfeature_names[] =
> > > "Protection Keys User registers",
> > > "PASID state",
> > > "Control-flow User registers",
> > > - "Control-flow Kernel registers (unused)",
> > > + "Control-flow Kernel registers",
> > > "unknown xstate feature",
> > > "unknown xstate feature",
> > > "unknown xstate feature",
> > > @@ -73,6 +73,7 @@ static unsigned short xsave_cpuid_features[] __initdata = {
> > > [XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
> > > [XFEATURE_PKRU] = X86_FEATURE_OSPKE,
> > > [XFEATURE_PASID] = X86_FEATURE_ENQCMD,
> > > + [XFEATURE_CET_KERNEL] = X86_FEATURE_SHSTK,
> > > [XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE,
> > > [XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE,
> > > };
> > > @@ -277,6 +278,7 @@ static void __init print_xstate_features(void)
> > > print_xstate_feature(XFEATURE_MASK_PKRU);
> > > print_xstate_feature(XFEATURE_MASK_PASID);
> > > print_xstate_feature(XFEATURE_MASK_CET_USER);
> > > + print_xstate_feature(XFEATURE_MASK_CET_KERNEL);
> > > print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
> > > print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
> > > }
> > > @@ -346,6 +348,7 @@ static __init void os_xrstor_booting(struct xregs_state *xstate)
> > > XFEATURE_MASK_BNDCSR | \
> > > XFEATURE_MASK_PASID | \
> > > XFEATURE_MASK_CET_USER | \
> > > + XFEATURE_MASK_CET_KERNEL | \
> > > XFEATURE_MASK_XTILE)
> > >
> > > /*
> > > @@ -546,6 +549,7 @@ static bool __init check_xstate_against_struct(int nr)
> > > case XFEATURE_PASID: return XCHECK_SZ(sz, nr, struct ia32_pasid_state);
> > > case XFEATURE_XTILE_CFG: return XCHECK_SZ(sz, nr, struct xtile_cfg);
> > > case XFEATURE_CET_USER: return XCHECK_SZ(sz, nr, struct cet_user_state);
> > > + case XFEATURE_CET_KERNEL: return XCHECK_SZ(sz, nr, struct cet_supervisor_state);
> > > case XFEATURE_XTILE_DATA: check_xtile_data_against_struct(sz); return true;
> > > default:
> > > XSTATE_WARN_ON(1, "No structure for xstate: %d\n", nr);
> > Any reason why my reviewed-by was not added to this patch?
>
> My apology again! I missed the Reviewed-by tag for this patch!
>
> Appreciated for your careful review for this series!
Thank you very much!
Best regards,
Maxim Levitsky
>
> > Reviewed-by: Maxim Levitsky <[email protected]>
> >
> > Best regards,
> > Maxim Levitsky
> >
> >
> >
On Fri, 2023-12-01 at 16:36 +0800, Yang, Weijiang wrote:
> On 12/1/2023 1:36 AM, Maxim Levitsky wrote:
>
> [...]
>
> > > + fpstate->user_size = fpu_user_cfg.default_size;
> > > + fpstate->user_xfeatures = fpu_user_cfg.default_features;
> > The whole thing makes my head spin like the good old CD/DVD writers used to ....
> >
> > So just to summarize this is what we have:
> >
> >
> > KERNEL FPU CONFIG
> >
> > /*
> > all known and CPU supported user and supervisor features except
> > - "dynamic" kernel features" (CET_S)
> > - "independent" kernel features (XFEATURE_LBR)
> > */
> > fpu_kernel_cfg.max_features;
> >
> > /*
> > all known and CPU supported user and supervisor features except
> > - "dynamic" kernel features" (CET_S)
> > - "independent" kernel features (arch LBRs)
> > - "dynamic" userspace features (AMX state)
> > */
> > fpu_kernel_cfg.default_features;
> >
> >
> > // size of compacted buffer with 'fpu_kernel_cfg.max_features'
> > fpu_kernel_cfg.max_size;
> >
> >
> > // size of compacted buffer with 'fpu_kernel_cfg.default_features'
> > fpu_kernel_cfg.default_size;
> >
> >
> > USER FPU CONFIG
> >
> > /*
> > all known and CPU supported user features
> > */
> > fpu_user_cfg.max_features;
> >
> > /*
> > all known and CPU supported user features except
> > - "dynamic" userspace features (AMX state)
> > */
> > fpu_user_cfg.default_features;
> >
> > // size of non compacted buffer with 'fpu_user_cfg.max_features'
> > fpu_user_cfg.max_size;
> >
> > // size of non compacted buffer with 'fpu_user_cfg.default_features'
> > fpu_user_cfg.default_size;
> >
> >
> > GUEST FPU CONFIG
> > /*
> > all known and CPU supported user and supervisor features except
> > - "independent" kernel features (XFEATURE_LBR)
> > */
> > fpu_guest_cfg.max_features;
> >
> > /*
> > all known and CPU supported user and supervisor features except
> > - "independent" kernel features (arch LBRs)
> > - "dynamic" userspace features (AMX state)
> > */
> > fpu_guest_cfg.default_features;
> >
> > // size of compacted buffer with 'fpu_guest_cfg.max_features'
> > fpu_guest_cfg.max_size;
> >
> > // size of compacted buffer with 'fpu_guest_cfg.default_features'
> > fpu_guest_cfg.default_size;
>
> Good suggestion! Thanks!
> how about adding them in patch 5 to make the summaries manifested?
I don't know if we want to add these comments to the source - I made them
up for myself/you to understand the subtle differences between each of these variables.
There is some documentation on the struct fields, it is reasonable, but
I do think that it will help a lot to add documentation to each of
fpu_kernel_cfg, fpu_user_cfg and fpu_guest_cfg.
>
> > ---
> >
> >
> > So in essence, guest FPU config is guest kernel fpu config and that is why
> > 'fpu_user_cfg.default_size' had to be used above.
> >
> > How about that we have fpu_guest_kernel_config and fpu_guest_user_config instead
> > to make the whole horrible thing maybe even more complicated but at least a bit more orthogonal?
>
> I think it becomes necessary when there were more guest user/kernel xfeaures requiring
> special handling like CET-S MSRs, then it looks much reasonable to split guest config into two,
> but now we only have one single outstanding xfeature for guest. IMHO, existing definitions still
> work with a few comments.
It's all up to you to decide. The thing is one big mess, IMHO no comment can really make it understandable
without hours of research.
However as usual, the more comments the better, comments do help.
Best regards,
Maxim Levitsky
>
> But I really like your ideas of making things clean and tidy :-)
>
>
On Mon, 2023-12-04 at 08:45 +0800, Yang, Weijiang wrote:
> On 12/1/2023 10:23 AM, Chao Gao wrote:
> > On Thu, Nov 30, 2023 at 07:42:44PM +0200, Maxim Levitsky wrote:
> > > On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> > > > Save CET SSP to SMRAM on SMI and reload it on RSM. KVM emulates HW arch
> > > > behavior when guest enters/leaves SMM mode,i.e., save registers to SMRAM
> > > > at the entry of SMM and reload them at the exit to SMM. Per SDM, SSP is
> > > > one of such registers on 64bit Arch, so add the support for SSP.
> > > >
> > > > Suggested-by: Sean Christopherson <[email protected]>
> > > > Signed-off-by: Yang Weijiang <[email protected]>
> > > > ---
> > > > arch/x86/kvm/smm.c | 8 ++++++++
> > > > arch/x86/kvm/smm.h | 2 +-
> > > > 2 files changed, 9 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
> > > > index 45c855389ea7..7aac9c54c353 100644
> > > > --- a/arch/x86/kvm/smm.c
> > > > +++ b/arch/x86/kvm/smm.c
> > > > @@ -275,6 +275,10 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu,
> > > > enter_smm_save_seg_64(vcpu, &smram->gs, VCPU_SREG_GS);
> > > >
> > > > smram->int_shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
> > > > +
> > > > + if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
> > > > + KVM_BUG_ON(kvm_msr_read(vcpu, MSR_KVM_SSP, &smram->ssp),
> > > > + vcpu->kvm);
> > > > }
> > > > #endif
> > > >
> > > > @@ -564,6 +568,10 @@ static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,
> > > > static_call(kvm_x86_set_interrupt_shadow)(vcpu, 0);
> > > > ctxt->interruptibility = (u8)smstate->int_shadow;
> > > >
> > > > + if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
> > > > + KVM_BUG_ON(kvm_msr_write(vcpu, MSR_KVM_SSP, smstate->ssp),
> > > > + vcpu->kvm);
> > > > +
> > > > return X86EMUL_CONTINUE;
> > > > }
> > > > #endif
> > > > diff --git a/arch/x86/kvm/smm.h b/arch/x86/kvm/smm.h
> > > > index a1cf2ac5bd78..1e2a3e18207f 100644
> > > > --- a/arch/x86/kvm/smm.h
> > > > +++ b/arch/x86/kvm/smm.h
> > > > @@ -116,8 +116,8 @@ struct kvm_smram_state_64 {
> > > > u32 smbase;
> > > > u32 reserved4[5];
> > > >
> > > > - /* ssp and svm_* fields below are not implemented by KVM */
> > > > u64 ssp;
> > > > + /* svm_* fields below are not implemented by KVM */
> > > > u64 svm_guest_pat;
> > > > u64 svm_host_efer;
> > > > u64 svm_host_cr4;
> > >
> > > My review feedback from the previous patch series still applies, and I don't
> > > know why it was not addressed/replied to:
> > >
> > > I still think that it is worth it to have a check that CET is not enabled in
> > > enter_smm_save_state_32 which is called for pure 32 bit guests (guests that don't
> > > have X86_FEATURE_LM enabled)
> > can KVM just reject a KVM_SET_CPUID ioctl which attempts to expose shadow stack
> > (or even any CET feature) to 32-bit guest in the first place? I think it is simpler.
>
> I favor adding an early defensive check for the issue under discussion if we want to handle the case.
> Crashing the VM at runtime when guest SMI is kicked seems not user friendly.
>
I don't mind. I remember that I was told that crashing a guest when #SMI arrives and a nested guest is running
is ok for 32 bit guests, don't know what the justification was.
Sean, Paolo, do you remember?
IMHO the chances of pure 32 bit guest (only qemu-system-386 creates these) running with CET are very low,
but I just wanted to have a cheap check just to keep the 32 bit and 64 bit smm save/restore code similar,
so that nobody in the future will ask 'why this code does this or that'.
Also it is trivial to add the ssp to 32 bit smmram image - the layout is not really compliant to x86 spec,
and never consumed by the hardware, you can just put it somewhere in the image, instead of one of the reserved fields.
From my point of view I want the code to be as orthogonal as possible.
Best regards,
Maxim Levitsky
On Fri, 2023-12-01 at 14:33 +0800, Chao Gao wrote:
> On Thu, Nov 30, 2023 at 07:44:45PM +0200, Maxim Levitsky wrote:
> > On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> > > Enable/disable CET MSRs interception per associated feature configuration.
> > > Shadow Stack feature requires all CET MSRs passed through to guest to make
> > > it supported in user and supervisor mode while IBT feature only depends on
> > > MSR_IA32_{U,S}_CETS_CET to enable user and supervisor IBT.
> > >
> > > Note, this MSR design introduced an architectural limitation of SHSTK and
> > > IBT control for guest, i.e., when SHSTK is exposed, IBT is also available
> > > to guest from architectual perspective since IBT relies on subset of SHSTK
> > > relevant MSRs.
> > >
> > > Signed-off-by: Yang Weijiang <[email protected]>
> > > ---
> > > arch/x86/kvm/vmx/vmx.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> > > 1 file changed, 42 insertions(+)
> > >
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index 554f665e59c3..e484333eddb0 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -699,6 +699,10 @@ static bool is_valid_passthrough_msr(u32 msr)
> > > case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
> > > /* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
> > > return true;
> > > + case MSR_IA32_U_CET:
> > > + case MSR_IA32_S_CET:
> > > + case MSR_IA32_PL0_SSP ... MSR_IA32_INT_SSP_TAB:
> > > + return true;
> > > }
> > >
> > > r = possible_passthrough_msr_slot(msr) != -ENOENT;
> > > @@ -7766,6 +7770,42 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
> > > vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
> > > }
> > >
> > > +static void vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu)
> > > +{
> > > + bool incpt;
> > > +
> > > + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
> > > + incpt = !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK);
> > > +
> > > + vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
> > > + MSR_TYPE_RW, incpt);
> > > + vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
> > > + MSR_TYPE_RW, incpt);
> > > + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
> > > + MSR_TYPE_RW, incpt);
> > > + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
> > > + MSR_TYPE_RW, incpt);
> > > + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
> > > + MSR_TYPE_RW, incpt);
> > > + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP,
> > > + MSR_TYPE_RW, incpt);
> > > + if (guest_cpuid_has(vcpu, X86_FEATURE_LM))
> > > + vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
> > > + MSR_TYPE_RW, incpt);
> > > + if (!incpt)
> > > + return;
> > > + }
> > > +
> > > + if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
> > > + incpt = !guest_cpuid_has(vcpu, X86_FEATURE_IBT);
> > > +
> > > + vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
> > > + MSR_TYPE_RW, incpt);
> > > + vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
> > > + MSR_TYPE_RW, incpt);
> > > + }
> > > +}
> > > +
> > > static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > > {
> > > struct vcpu_vmx *vmx = to_vmx(vcpu);
> > > @@ -7843,6 +7883,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > >
> > > /* Refresh #PF interception to account for MAXPHYADDR changes. */
> > > vmx_update_exception_bitmap(vcpu);
> > > +
> > > + vmx_update_intercept_for_cet_msr(vcpu);
> > > }
> > >
> > > static u64 vmx_get_perf_capabilities(void)
> >
> > My review feedback from the previous patch still applies as well,
> >
> > I still think that we should either try a best effort approach to plug
> > this virtualization hole, or we at least should fail guest creation
> > if the virtualization hole is present as I said:
> >
> > "Another, much simpler option is to fail the guest creation if the shadow stack + indirect branch tracking
> > state differs between host and the guest, unless both are disabled in the guest.
> > (in essence don't let the guest be created if (2) or (3) happen)"
>
> Enforcing a "none" or "all" policy is a temporary solution. in future, if some
> reserved bits in S/U_CET MSRs are extended for new features, there will be:
>
> platform A supports SS + IBT
> platform B supports SS + IBT + new feature
>
> Guests running on B inevitably have the same virtualization hole. and if kvm
> continues enforcing the policy on B, then VM migration from A to B would be
> impossible.
>
> To me, intercepting S/U_CET MSR and CET_S/U xsave components is intricate and
> yields marginal benefits. And I also doubt any reasonable OS implementation
> would depend on #GP of WRMSR to S/U_CET MSRs for functionalities. So, I vote
> to leave the patch as-is.
To some extent I do agree with you but this can become a huge mess in the future.
I think we need at least to tell Intel/AMD about this to ensure that they don't make this thing worse
than it already is.
Also the very least we can do if we opt to keep things as is,
is to document this virtualization hole - we have Documentation/virt/kvm/x86/errata.rst for that.
Best regards,
Maxim Levitsky
>
> > Please at least tell me what do you think about this.
> > Best regards,
> > Maxim Levitsky
> >
> >
> >
On Fri, 2023-12-01 at 17:45 +0800, Yang, Weijiang wrote:
> On 12/1/2023 1:44 AM, Maxim Levitsky wrote:
> > On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> > > Enable/disable CET MSRs interception per associated feature configuration.
> > > Shadow Stack feature requires all CET MSRs passed through to guest to make
> > > it supported in user and supervisor mode while IBT feature only depends on
> > > MSR_IA32_{U,S}_CETS_CET to enable user and supervisor IBT.
> > >
> > > Note, this MSR design introduced an architectural limitation of SHSTK and
> > > IBT control for guest, i.e., when SHSTK is exposed, IBT is also available
> > > to guest from architectual perspective since IBT relies on subset of SHSTK
> > > relevant MSRs.
> > >
> > > Signed-off-by: Yang Weijiang <[email protected]>
> > > ---
> > > arch/x86/kvm/vmx/vmx.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> > > 1 file changed, 42 insertions(+)
> > >
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index 554f665e59c3..e484333eddb0 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -699,6 +699,10 @@ static bool is_valid_passthrough_msr(u32 msr)
> > > case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
> > > /* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
> > > return true;
> > > + case MSR_IA32_U_CET:
> > > + case MSR_IA32_S_CET:
> > > + case MSR_IA32_PL0_SSP ... MSR_IA32_INT_SSP_TAB:
> > > + return true;
> > > }
> > >
> > > r = possible_passthrough_msr_slot(msr) != -ENOENT;
> > > @@ -7766,6 +7770,42 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
> > > vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
> > > }
> > >
> > > +static void vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu)
> > > +{
> > > + bool incpt;
> > > +
> > > + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
> > > + incpt = !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK);
> > > +
> > > + vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
> > > + MSR_TYPE_RW, incpt);
> > > + vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
> > > + MSR_TYPE_RW, incpt);
> > > + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
> > > + MSR_TYPE_RW, incpt);
> > > + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
> > > + MSR_TYPE_RW, incpt);
> > > + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
> > > + MSR_TYPE_RW, incpt);
> > > + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP,
> > > + MSR_TYPE_RW, incpt);
> > > + if (guest_cpuid_has(vcpu, X86_FEATURE_LM))
> > > + vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
> > > + MSR_TYPE_RW, incpt);
> > > + if (!incpt)
> > > + return;
> > > + }
> > > +
> > > + if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
> > > + incpt = !guest_cpuid_has(vcpu, X86_FEATURE_IBT);
> > > +
> > > + vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
> > > + MSR_TYPE_RW, incpt);
> > > + vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
> > > + MSR_TYPE_RW, incpt);
> > > + }
> > > +}
> > > +
> > > static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > > {
> > > struct vcpu_vmx *vmx = to_vmx(vcpu);
> > > @@ -7843,6 +7883,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > >
> > > /* Refresh #PF interception to account for MAXPHYADDR changes. */
> > > vmx_update_exception_bitmap(vcpu);
> > > +
> > > + vmx_update_intercept_for_cet_msr(vcpu);
> > > }
> > >
> > > static u64 vmx_get_perf_capabilities(void)
> > My review feedback from the previous patch still applies as well,
> >
> > I still think that we should either try a best effort approach to plug
> > this virtualization hole, or we at least should fail guest creation
> > if the virtualization hole is present as I said:
> >
> > "Another, much simpler option is to fail the guest creation if the shadow stack + indirect branch tracking
> > state differs between host and the guest, unless both are disabled in the guest.
> > (in essence don't let the guest be created if (2) or (3) happen)"
> >
> > Please at least tell me what do you think about this.
>
> Oh, I thought I had replied this patch in v6 but I failed to send it out!
> Let me explain it a bit, at early stage of this series, I thought of checking relevant host
> feature enabling status before exposing guest CET features, but it's proved
> unnecessary and user unfriendly.
>
> E.g., we frequently disable host CET features due to whatever reasons on host, then
> the features cannot be used/tested in guest at all. Technically, guest should be allowed
> to run the features so long as the dependencies(i.e., xsave related support) are enabled
> on host and there're no risks brought up by using of the features in guest.
To be honest this is a dangerous POV in regard to guest migration: If the VMM were to be lax with features
that it exposes to the guest, then the guests will start to make assumptions instead of checking CPUID
and then the guest will mysteriously fail when migrated to a machine which actually lacks the features,
in addition to not having them in the CPUID.
In other words, leaving "undocumented" features opens a slippery slope of later supporting this
undocumented behavior.
I understand though that CET is problematic, and I overall won't object much to leave things as is
but a part of me thinks that we will regret it later.
Best regards,
Maxim Levitsky
>
> I think cloud-computing should share the similar pain point when deploy CET into virtualization
> usages.
>
>
On Sat, 2023-12-02 at 00:15 +0800, Yang, Weijiang wrote:
> On 12/1/2023 1:46 AM, Maxim Levitsky wrote:
>
> [...]
>
> > >
> > > +static inline bool cpu_has_vmx_basic_no_hw_errcode(void)
> > > +{
> > > + return ((u64)vmcs_config.basic_cap << 32) &
> > > + VMX_BASIC_NO_HW_ERROR_CODE_CC;
> > > +}
> > I still think that we should add a comment explaining why this check is needed,
> > as I said in the previous review.
>
> OK, I'll add some comments above the function. Thanks!
>
> > > +
> > > static inline bool cpu_has_virtual_nmis(void)
> > > {
> > > return vmcs_config.pin_based_exec_ctrl & PIN_BASED_VIRTUAL_NMIS &&
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index c658f2f230df..a1aae8709939 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -2614,6 +2614,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
> > > { VM_ENTRY_LOAD_IA32_EFER, VM_EXIT_LOAD_IA32_EFER },
> > > { VM_ENTRY_LOAD_BNDCFGS, VM_EXIT_CLEAR_BNDCFGS },
> > > { VM_ENTRY_LOAD_IA32_RTIT_CTL, VM_EXIT_CLEAR_IA32_RTIT_CTL },
> > > + { VM_ENTRY_LOAD_CET_STATE, VM_EXIT_LOAD_CET_STATE },
> > > };
> > >
> > > memset(vmcs_conf, 0, sizeof(*vmcs_conf));
> > > @@ -4935,6 +4936,15 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > >
> > > vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); /* 22.2.1 */
> > >
> > > + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> > > + vmcs_writel(GUEST_SSP, 0);
> > > + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
> > > + kvm_cpu_cap_has(X86_FEATURE_IBT))
> > > + vmcs_writel(GUEST_S_CET, 0);
> > > + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK) &&
> > > + IS_ENABLED(CONFIG_X86_64))
> > > + vmcs_writel(GUEST_INTR_SSP_TABLE, 0);
> > Looks reasonable now.
> > > +
> > > kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
> > >
> > > vpid_sync_context(vmx->vpid);
> > > @@ -6354,6 +6364,12 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
> > > if (vmcs_read32(VM_EXIT_MSR_STORE_COUNT) > 0)
> > > vmx_dump_msrs("guest autostore", &vmx->msr_autostore.guest);
> > >
> > > + if (vmentry_ctl & VM_ENTRY_LOAD_CET_STATE) {
> > > + pr_err("S_CET = 0x%016lx\n", vmcs_readl(GUEST_S_CET));
> > > + pr_err("SSP = 0x%016lx\n", vmcs_readl(GUEST_SSP));
> > > + pr_err("INTR SSP TABLE = 0x%016lx\n",
> > > + vmcs_readl(GUEST_INTR_SSP_TABLE));
> > > + }
> > > pr_err("*** Host State ***\n");
> > > pr_err("RIP = 0x%016lx RSP = 0x%016lx\n",
> > > vmcs_readl(HOST_RIP), vmcs_readl(HOST_RSP));
> > > @@ -6431,6 +6447,12 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
> > > if (secondary_exec_control & SECONDARY_EXEC_ENABLE_VPID)
> > > pr_err("Virtual processor ID = 0x%04x\n",
> > > vmcs_read16(VIRTUAL_PROCESSOR_ID));
> > > + if (vmexit_ctl & VM_EXIT_LOAD_CET_STATE) {
> > > + pr_err("S_CET = 0x%016lx\n", vmcs_readl(HOST_S_CET));
> > > + pr_err("SSP = 0x%016lx\n", vmcs_readl(HOST_SSP));
> > > + pr_err("INTR SSP TABLE = 0x%016lx\n",
> > > + vmcs_readl(HOST_INTR_SSP_TABLE));
> > > + }
> > > }
> > >
> > > /*
> > > @@ -7964,7 +7986,6 @@ static __init void vmx_set_cpu_caps(void)
> > > kvm_cpu_cap_set(X86_FEATURE_UMIP);
> > >
> > > /* CPUID 0xD.1 */
> > > - kvm_caps.supported_xss = 0;
> > > if (!cpu_has_vmx_xsaves())
> > > kvm_cpu_cap_clear(X86_FEATURE_XSAVES);
> > >
> > > @@ -7976,6 +7997,12 @@ static __init void vmx_set_cpu_caps(void)
> > >
> > > if (cpu_has_vmx_waitpkg())
> > > kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
> > > +
> > > + if (!cpu_has_load_cet_ctrl() || !enable_unrestricted_guest ||
> > > + !cpu_has_vmx_basic_no_hw_errcode()) {
> > > + kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
> > > + kvm_cpu_cap_clear(X86_FEATURE_IBT);
> > > + }
> > My review feedback from previous version still applies here, I don't have an
> > idea why this was not addressed....
> >
> > "I think that here we also need to clear kvm_caps.supported_xss,
> > or even better, lets set the CET bits in kvm_caps.supported_xss only
> > once CET is fully enabled (both this check and check in __kvm_x86_vendor_init pass)."
>
> Ah, previously I had a helper to check whether CET bits were enabled in kvm_caps.supported_xss,
> so need to set the bits earlier before vmx's hardware_setup. I still want to keep the code as-is
> in case other features need to check the their related bits set before configure something in
> vmx hardware_setup.
As long as the code is correct I won't object.
Best regards,
Maxim Levitsky
>
> > In addition to that I just checked and unless I am mistaken:
> >
> > vmx_set_cpu_caps() is called from vmx's hardware_setup(), which is called
> > from __kvm_x86_vendor_init.
> >
> > After this call, __kvm_x86_vendor_init does clear kvm_caps.supported_xss,
> > but doesn't do this if the above code cleared X86_FEATURE_SHSTK/X86_FEATURE_IBT.
> >
> Yeah, I checked the history, the similar logic was there until v6, I can pick it up, thanks!
>
> > > }
> > >
> > > static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
> > > diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> > > index c2130d2c8e24..fb72819fbb41 100644
> > > --- a/arch/x86/kvm/vmx/vmx.h
> > > +++ b/arch/x86/kvm/vmx/vmx.h
> > > @@ -480,7 +480,8 @@ static inline u8 vmx_get_rvi(void)
> > > VM_ENTRY_LOAD_IA32_EFER | \
> > > VM_ENTRY_LOAD_BNDCFGS | \
> > > VM_ENTRY_PT_CONCEAL_PIP | \
> > > - VM_ENTRY_LOAD_IA32_RTIT_CTL)
> > > + VM_ENTRY_LOAD_IA32_RTIT_CTL | \
> > > + VM_ENTRY_LOAD_CET_STATE)
> > >
> > > #define __KVM_REQUIRED_VMX_VM_EXIT_CONTROLS \
> > > (VM_EXIT_SAVE_DEBUG_CONTROLS | \
> > > @@ -502,7 +503,8 @@ static inline u8 vmx_get_rvi(void)
> > > VM_EXIT_LOAD_IA32_EFER | \
> > > VM_EXIT_CLEAR_BNDCFGS | \
> > > VM_EXIT_PT_CONCEAL_PIP | \
> > > - VM_EXIT_CLEAR_IA32_RTIT_CTL)
> > > + VM_EXIT_CLEAR_IA32_RTIT_CTL | \
> > > + VM_EXIT_LOAD_CET_STATE)
> > >
> > > #define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL \
> > > (PIN_BASED_EXT_INTR_MASK | \
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index c6b57ede0f57..2bcf3c7923bf 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -231,7 +231,8 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
> > > | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
> > > | XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)
> > >
> > > -#define KVM_SUPPORTED_XSS 0
> > > +#define KVM_SUPPORTED_XSS (XFEATURE_MASK_CET_USER | \
> > > + XFEATURE_MASK_CET_KERNEL)
> > >
> > > u64 __read_mostly host_efer;
> > > EXPORT_SYMBOL_GPL(host_efer);
> > > @@ -9854,6 +9855,15 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
> > > if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
> > > kvm_caps.supported_xss = 0;
> > >
> > > + if ((kvm_caps.supported_xss & (XFEATURE_MASK_CET_USER |
> > > + XFEATURE_MASK_CET_KERNEL)) !=
> > > + (XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)) {
> > > + kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
> > > + kvm_cpu_cap_clear(X86_FEATURE_IBT);
> > > + kvm_caps.supported_xss &= ~XFEATURE_CET_USER;
> > > + kvm_caps.supported_xss &= ~XFEATURE_CET_KERNEL;
> > > + }
> > > +
> > > #define __kvm_cpu_cap_has(UNUSED_, f) kvm_cpu_cap_has(f)
> > > cr4_reserved_bits = __cr4_reserved_bits(__kvm_cpu_cap_has, UNUSED_);
> > > #undef __kvm_cpu_cap_has
> > > @@ -12319,7 +12329,9 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
> > >
> > > static inline bool is_xstate_reset_needed(void)
> > > {
> > > - return kvm_cpu_cap_has(X86_FEATURE_MPX);
> > > + return kvm_cpu_cap_has(X86_FEATURE_MPX) ||
> > > + kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
> > > + kvm_cpu_cap_has(X86_FEATURE_IBT);
> > > }
> > >
> > > void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > @@ -12396,6 +12408,16 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > XFEATURE_BNDCSR);
> > > }
> > >
> > > + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
> > > + fpstate_clear_xstate_component(fpstate,
> > > + XFEATURE_CET_USER);
> > > + fpstate_clear_xstate_component(fpstate,
> > > + XFEATURE_CET_KERNEL);
> > > + } else if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
> > > + fpstate_clear_xstate_component(fpstate,
> > > + XFEATURE_CET_USER);
> > > + }
> > > +
> > > if (init_event)
> > > kvm_load_guest_fpu(vcpu);
> > > }
> > > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> > > index d9cc352cf421..dc79dcd733ac 100644
> > > --- a/arch/x86/kvm/x86.h
> > > +++ b/arch/x86/kvm/x86.h
> > > @@ -531,6 +531,9 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
> > > __reserved_bits |= X86_CR4_VMXE; \
> > > if (!__cpu_has(__c, X86_FEATURE_PCID)) \
> > > __reserved_bits |= X86_CR4_PCIDE; \
> > > + if (!__cpu_has(__c, X86_FEATURE_SHSTK) && \
> > > + !__cpu_has(__c, X86_FEATURE_IBT)) \
> > > + __reserved_bits |= X86_CR4_CET; \
> > > __reserved_bits; \
> > > })
> > >
> >
> > Best regards,
> > Maxim Levitsky
> >
> >
> >
On Mon, 2023-12-04 at 16:50 +0800, Yang, Weijiang wrote:
> On 12/1/2023 1:53 AM, Maxim Levitsky wrote:
> > On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> > > Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
> > > to enable CET for nested VM.
> > >
> > > Note, generally L1 VMM only touches CET VMCS fields when live migration or
> > > vmcs_{read,write}() to the fields happens, so the fields only need to be
> > > synced in these "rare" cases.
> > To be honest we can't assume anything about L1, but what we can assume
> >
> > is that if vmcs12 field is not shadowed, then L1 vmwrite/vmread will
> > be always intercepted and during the interception the fields can be synced,
> > however I studied this area long ago and I might be mistaken.
>
> The changelog wording failed to express what I meant to say:
> vmcs12 and vmcs02 should be synced to reflect the correct CET states L1 or L2 are expected
> to see. In LM case, the nested CET states should also be synced between L1 or L2 via the
> control structures.
>
> Will reword them, thanks for pointing it out!
> > > And here only considers the case that L1 VMM
> > > has set VM_ENTRY_LOAD_CET_STATE in its VMCS vm_entry_controls as it's the
> > > common usage.
> > >
> > > Suggested-by: Chao Gao <[email protected]>
> > > Signed-off-by: Yang Weijiang <[email protected]>
> > > ---
> > > arch/x86/kvm/vmx/nested.c | 48 +++++++++++++++++++++++++++++++++++++--
> > > arch/x86/kvm/vmx/vmcs12.c | 6 +++++
> > > arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++-
> > > arch/x86/kvm/vmx/vmx.c | 2 ++
> > > 4 files changed, 67 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > > index d8c32682ca76..965173650542 100644
> > > --- a/arch/x86/kvm/vmx/nested.c
> > > +++ b/arch/x86/kvm/vmx/nested.c
> > > @@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> > > nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > > MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
> > >
> > > + /* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
> > > + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > > + MSR_IA32_U_CET, MSR_TYPE_RW);
> > > +
> > > + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > > + MSR_IA32_S_CET, MSR_TYPE_RW);
> > > +
> > > + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > > + MSR_IA32_PL0_SSP, MSR_TYPE_RW);
> > > +
> > > + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > > + MSR_IA32_PL1_SSP, MSR_TYPE_RW);
> > > +
> > > + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > > + MSR_IA32_PL2_SSP, MSR_TYPE_RW);
> > > +
> > > + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > > + MSR_IA32_PL3_SSP, MSR_TYPE_RW);
> > > +
> > > + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > > + MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
> > > +
> > > kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
> > >
> > > vmx->nested.force_msr_bitmap_recalc = false;
> > > @@ -2469,6 +2491,18 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
> > > if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
> > > (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
> > > vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
> > > +
> > > + if (vmx->nested.nested_run_pending &&
> > I don't think that nested.nested_run_pending check is needed.
> > prepare_vmcs02_rare is not going to be called unless the nested run is pending.
>
> But there're other paths along to call prepare_vmcs02_rare(), e.g., vmx_set_nested_state()-> nested_vmx_enter_non_root_mode()-> prepare_vmcs02_rare(), especially when L1 instead of L2 was running. In this case, nested.nested_run_pending == false,
> we don't need to update vmcs02's fields at the point until L2 is being resumed.
- If we restore VM from migration stream when L2 is *not running*, then prepare_vmcs02_rare won't be called,
because nested_vmx_enter_non_root_mode will not be called, because in turn there is no nested vmcs to load.
- If we restore VM from migration stream when L2 is *about to run* (KVM emulated the VMRESUME/VMLAUNCH,
but we didn't do the actual hardware VMLAUNCH/VMRESUME on vmcs02, then the 'nested_run_pending' will be true, it will be restored
from the migration stream.
- If we migrate while nested guest was run once but didn't VMEXIT to L1 yet, then yes, nested.nested_run_pending will be false indeed,
but we still need to setup vmcs02, otherwise it will be left with default zero values.
Remember that prior to setting nested state the VM wasn't running even once usually, unlike when the guest enters nested state normally.
>
> > > + (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) {
> > > + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> > > + vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
> > > + vmcs_writel(GUEST_INTR_SSP_TABLE,
> > > + vmcs12->guest_ssp_tbl);
> > > + }
> > > + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> > > + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
> > > + vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
> > > + }
> > > }
> > >
> > > if (nested_cpu_has_xsaves(vmcs12))
> > > @@ -4300,6 +4334,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
> > > vmcs12->guest_pending_dbg_exceptions =
> > > vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
> > >
> > > + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> > > + vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
> > > + vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
> > > + }
> > > + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> > > + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
> > > + vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
> > > + }
> > The above code should be conditional on VM_ENTRY_LOAD_CET_STATE - if the guest (L2) state
> > was loaded, then it must be updated on exit - this is usually how VMX works.
>
> I think this is not for L2 VM_ENTRY_LOAD_CET_STATE, it happens in prepare_vmcs02_rare(). IIUC, the guest registers will be saved into VMCS fields unconditionally when vm-exit happens,
> so these fields for L2 guest should be synced to L1 unconditionally.
"the guest registers will be saved into VMCS fields unconditionally"
This is not true, unless there is a bug. the vmcs12 VM_ENTRY_LOAD_CET_STATE should be passed through as is to vmcs02, so if the nested guest doesn't set this bit
the entry/exit using vmcs02 will not touch the CET state, which is unusual but allowed by the spec I think - a nested hypervisor can opt for example to save/load
this state manually or use msr load/store lists instead.
Regardless of this,
if the guest didn't set VM_ENTRY_LOAD_CET_STATE, then vmcs12 guest fields should neither be loaded on VM entry (copied to vmcs02) nor updated on VM exit,
(that is copied back to vmcs12) this is what is written in the VMX spec.
Best regards,
Maxim Levitsky
>
> > Also I don't see any mention of usage of VM_EXIT_LOAD_CET_STATE, which if set,
> > should reset the L1 CET state to values in 'host_s_cet/host_ssp/host_ssp_tbl'
> > (This is also a common theme in VMX - host state is reset to values that the hypervisor
> > sets in VMCS, and the hypervisor must care to update these fields itself).
>
> Yes, the host CET states for L1 also should be synced, I'll add the missing part, thanks!
>
> > As a rule of thumb, if you add a field to vmcs12, you should use it somewhere,
> > and you should never use it unconditionally, as almost always its use
> > depends on entry or exit controls.
> >
> > Same is true for entry/exit/execution controls - if you add one, you almost
> > always have to use it somewhere.
>
> I'll double check if anything is lost in various cases, thanks!
>
> > Best regards,
> > Maxim Levitsky
> >
> > > +
> > > vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
> > > }
> > >
> > > @@ -6798,7 +6841,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
> > > VM_EXIT_HOST_ADDR_SPACE_SIZE |
> > > #endif
> > > VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
> > > - VM_EXIT_CLEAR_BNDCFGS;
> > > + VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
> > > msrs->exit_ctls_high |=
> > > VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> > > VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
> > > @@ -6820,7 +6863,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
> > > #ifdef CONFIG_X86_64
> > > VM_ENTRY_IA32E_MODE |
> > > #endif
> > > - VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
> > > + VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
> > > + VM_ENTRY_LOAD_CET_STATE;
> > > msrs->entry_ctls_high |=
> > > (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
> > > VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
> > > diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
> > > index 106a72c923ca..4233b5ca9461 100644
> > > --- a/arch/x86/kvm/vmx/vmcs12.c
> > > +++ b/arch/x86/kvm/vmx/vmcs12.c
> > > @@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
> > > FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
> > > FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
> > > FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
> > > + FIELD(GUEST_S_CET, guest_s_cet),
> > > + FIELD(GUEST_SSP, guest_ssp),
> > > + FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
> > > FIELD(HOST_CR0, host_cr0),
> > > FIELD(HOST_CR3, host_cr3),
> > > FIELD(HOST_CR4, host_cr4),
> > > @@ -151,5 +154,8 @@ const unsigned short vmcs12_field_offsets[] = {
> > > FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
> > > FIELD(HOST_RSP, host_rsp),
> > > FIELD(HOST_RIP, host_rip),
> > > + FIELD(HOST_S_CET, host_s_cet),
> > > + FIELD(HOST_SSP, host_ssp),
> > > + FIELD(HOST_INTR_SSP_TABLE, host_ssp_tbl),
> > > };
> > > const unsigned int nr_vmcs12_fields = ARRAY_SIZE(vmcs12_field_offsets);
> > > diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
> > > index 01936013428b..3884489e7f7e 100644
> > > --- a/arch/x86/kvm/vmx/vmcs12.h
> > > +++ b/arch/x86/kvm/vmx/vmcs12.h
> > > @@ -117,7 +117,13 @@ struct __packed vmcs12 {
> > > natural_width host_ia32_sysenter_eip;
> > > natural_width host_rsp;
> > > natural_width host_rip;
> > > - natural_width paddingl[8]; /* room for future expansion */
> > > + natural_width host_s_cet;
> > > + natural_width host_ssp;
> > > + natural_width host_ssp_tbl;
> > > + natural_width guest_s_cet;
> > > + natural_width guest_ssp;
> > > + natural_width guest_ssp_tbl;
> > > + natural_width paddingl[2]; /* room for future expansion */
> > > u32 pin_based_vm_exec_control;
> > > u32 cpu_based_vm_exec_control;
> > > u32 exception_bitmap;
> > > @@ -292,6 +298,12 @@ static inline void vmx_check_vmcs12_offsets(void)
> > > CHECK_OFFSET(host_ia32_sysenter_eip, 656);
> > > CHECK_OFFSET(host_rsp, 664);
> > > CHECK_OFFSET(host_rip, 672);
> > > + CHECK_OFFSET(host_s_cet, 680);
> > > + CHECK_OFFSET(host_ssp, 688);
> > > + CHECK_OFFSET(host_ssp_tbl, 696);
> > > + CHECK_OFFSET(guest_s_cet, 704);
> > > + CHECK_OFFSET(guest_ssp, 712);
> > > + CHECK_OFFSET(guest_ssp_tbl, 720);
> > > CHECK_OFFSET(pin_based_vm_exec_control, 744);
> > > CHECK_OFFSET(cpu_based_vm_exec_control, 748);
> > > CHECK_OFFSET(exception_bitmap, 752);
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index a1aae8709939..947028ff2e25 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -7734,6 +7734,8 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
> > > cr4_fixed1_update(X86_CR4_PKE, ecx, feature_bit(PKU));
> > > cr4_fixed1_update(X86_CR4_UMIP, ecx, feature_bit(UMIP));
> > > cr4_fixed1_update(X86_CR4_LA57, ecx, feature_bit(LA57));
> > > + cr4_fixed1_update(X86_CR4_CET, ecx, feature_bit(SHSTK));
> > > + cr4_fixed1_update(X86_CR4_CET, edx, feature_bit(IBT));
> > >
> > > #undef cr4_fixed1_update
> > > }
On Fri, 2023-12-01 at 15:49 +0800, Yang, Weijiang wrote:
> On 12/1/2023 1:33 AM, Maxim Levitsky wrote:
> > On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> > > Define new XFEATURE_MASK_KERNEL_DYNAMIC set including the features can be
> > I am not sure though that this name is correct, but I don't know if I can
> > suggest a better name.
>
> It's a symmetry of XFEATURE_MASK_USER_DYNAMIC ;-)
> > > optionally enabled by kernel components, i.e., the features are required by
> > > specific kernel components. Currently it's used by KVM to configure guest
> > > dedicated fpstate for calculating the xfeature and fpstate storage size etc.
> > >
> > > The kernel dynamic xfeatures now only contain XFEATURE_CET_KERNEL, which is
> > > supported by host as they're enabled in xsaves/xrstors operating xfeature set
> > > (XCR0 | XSS), but the relevant CPU feature, i.e., supervisor shadow stack, is
> > > not enabled in host kernel so it can be omitted for normal fpstate by default.
> > >
> > > Remove the kernel dynamic feature from fpu_kernel_cfg.default_features so that
> > > the bits in xstate_bv and xcomp_bv are cleared and xsaves/xrstors can be
> > > optimized by HW for normal fpstate.
> > >
> > > Suggested-by: Dave Hansen <[email protected]>
> > > Signed-off-by: Yang Weijiang <[email protected]>
> > > ---
> > > arch/x86/include/asm/fpu/xstate.h | 5 ++++-
> > > arch/x86/kernel/fpu/xstate.c | 1 +
> > > 2 files changed, 5 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
> > > index 3b4a038d3c57..a212d3851429 100644
> > > --- a/arch/x86/include/asm/fpu/xstate.h
> > > +++ b/arch/x86/include/asm/fpu/xstate.h
> > > @@ -46,9 +46,12 @@
> > > #define XFEATURE_MASK_USER_RESTORE \
> > > (XFEATURE_MASK_USER_SUPPORTED & ~XFEATURE_MASK_PKRU)
> > >
> > > -/* Features which are dynamically enabled for a process on request */
> > > +/* Features which are dynamically enabled per userspace request */
> > > #define XFEATURE_MASK_USER_DYNAMIC XFEATURE_MASK_XTILE_DATA
> > >
> > > +/* Features which are dynamically enabled per kernel side request */
> > I suggest to explain this a bit better. How about something like that:
> >
> > "Kernel features that are not enabled by default for all processes, but can
> > be still used by some processes, for example to support guest virtualization"
>
> It looks good to me, will apply it in next version, thanks!
>
> > But feel free to keep it as is or propose something else. IMHO this will
> > be confusing this way or another.
> >
> >
> > Another question: kernel already has a notion of 'independent features'
> > which are currently kernel features that are enabled in IA32_XSS but not present in 'fpu_kernel_cfg.max_features'
> >
> > Currently only 'XFEATURE_LBR' is in this set. These features are saved/restored manually
> > from independent buffer (in case of LBRs, perf code cares for this).
> >
> > Does it make sense to add CET_S to there as well instead of having XFEATURE_MASK_KERNEL_DYNAMIC,
>
> CET_S here refers to PL{0,1,2}_SSP, right?
>
> IMHO, perf relies on dedicated code to switch LBR MSRs for various reason, e.g., overhead, the feature
> owns dozens of MSRs, remove xfeature bit will offload the burden of common FPU/xsave framework.
This is true, but the question that begs to be asked, is what is the true purpose of the 'independent features' is
from the POV of the kernel FPU framework. IMHO these are features that the framework is not aware of, except
that it enables it in IA32_XSS (and in XCR0 in the future).
For the guest only features, like CET_S, it is also kind of the same thing (xsave but to guest state area only).
I don't insist that we add CET_S to independent features, but I just gave an idea that maybe that is better
from complexity point of view to add CET there. It's up to you to decide.
Sean what do you think?
Best regards,
Maxim Levitsky
>
> But CET only has 3 supervisor MSRs and they need to be managed together with user mode MSRs.
> Enabling it in common FPU framework would make the switch/swap much easier without additional
> support code.
>
> > and maybe rename the
> > 'XFEATURE_MASK_INDEPENDENT' to something like 'XFEATURES_THE_KERNEL_DOESNT_CARE_ABOUT'
> > (terrible name, but you might think of a better name)
> >
> >
> > > +#define XFEATURE_MASK_KERNEL_DYNAMIC XFEATURE_MASK_CET_KERNEL
> > > +
> > > /* All currently supported supervisor features */
> > > #define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
> > > XFEATURE_MASK_CET_USER | \
> > > diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> > > index b57d909facca..ba4172172afd 100644
> > > --- a/arch/x86/kernel/fpu/xstate.c
> > > +++ b/arch/x86/kernel/fpu/xstate.c
> > > @@ -824,6 +824,7 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
> > > /* Clean out dynamic features from default */
> > > fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
> > > fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
> > > + fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_KERNEL_DYNAMIC;
> > >
> > > fpu_user_cfg.default_features = fpu_user_cfg.max_features;
> > > fpu_user_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
> >
> > Best regards,
> > Maxim Levitsky
> >
> >
> >
> >
On 12/5/2023 5:55 PM, Maxim Levitsky wrote:
> On Fri, 2023-12-01 at 15:49 +0800, Yang, Weijiang wrote:
>> On 12/1/2023 1:33 AM, Maxim Levitsky wrote:
>>> On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
>>>> Define new XFEATURE_MASK_KERNEL_DYNAMIC set including the features can be
>>> I am not sure though that this name is correct, but I don't know if I can
>>> suggest a better name.
>> It's a symmetry of XFEATURE_MASK_USER_DYNAMIC ;-)
>>>> optionally enabled by kernel components, i.e., the features are required by
>>>> specific kernel components. Currently it's used by KVM to configure guest
>>>> dedicated fpstate for calculating the xfeature and fpstate storage size etc.
>>>>
>>>> The kernel dynamic xfeatures now only contain XFEATURE_CET_KERNEL, which is
>>>> supported by host as they're enabled in xsaves/xrstors operating xfeature set
>>>> (XCR0 | XSS), but the relevant CPU feature, i.e., supervisor shadow stack, is
>>>> not enabled in host kernel so it can be omitted for normal fpstate by default.
>>>>
>>>> Remove the kernel dynamic feature from fpu_kernel_cfg.default_features so that
>>>> the bits in xstate_bv and xcomp_bv are cleared and xsaves/xrstors can be
>>>> optimized by HW for normal fpstate.
>>>>
>>>> Suggested-by: Dave Hansen <[email protected]>
>>>> Signed-off-by: Yang Weijiang <[email protected]>
>>>> ---
>>>> arch/x86/include/asm/fpu/xstate.h | 5 ++++-
>>>> arch/x86/kernel/fpu/xstate.c | 1 +
>>>> 2 files changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
>>>> index 3b4a038d3c57..a212d3851429 100644
>>>> --- a/arch/x86/include/asm/fpu/xstate.h
>>>> +++ b/arch/x86/include/asm/fpu/xstate.h
>>>> @@ -46,9 +46,12 @@
>>>> #define XFEATURE_MASK_USER_RESTORE \
>>>> (XFEATURE_MASK_USER_SUPPORTED & ~XFEATURE_MASK_PKRU)
>>>>
>>>> -/* Features which are dynamically enabled for a process on request */
>>>> +/* Features which are dynamically enabled per userspace request */
>>>> #define XFEATURE_MASK_USER_DYNAMIC XFEATURE_MASK_XTILE_DATA
>>>>
>>>> +/* Features which are dynamically enabled per kernel side request */
>>> I suggest to explain this a bit better. How about something like that:
>>>
>>> "Kernel features that are not enabled by default for all processes, but can
>>> be still used by some processes, for example to support guest virtualization"
>> It looks good to me, will apply it in next version, thanks!
>>
>>> But feel free to keep it as is or propose something else. IMHO this will
>>> be confusing this way or another.
>>>
>>>
>>> Another question: kernel already has a notion of 'independent features'
>>> which are currently kernel features that are enabled in IA32_XSS but not present in 'fpu_kernel_cfg.max_features'
>>>
>>> Currently only 'XFEATURE_LBR' is in this set. These features are saved/restored manually
>>> from independent buffer (in case of LBRs, perf code cares for this).
>>>
>>> Does it make sense to add CET_S to there as well instead of having XFEATURE_MASK_KERNEL_DYNAMIC,
>> CET_S here refers to PL{0,1,2}_SSP, right?
>>
>> IMHO, perf relies on dedicated code to switch LBR MSRs for various reason, e.g., overhead, the feature
>> owns dozens of MSRs, remove xfeature bit will offload the burden of common FPU/xsave framework.
> This is true, but the question that begs to be asked, is what is the true purpose of the 'independent features' is
> from the POV of the kernel FPU framework. IMHO these are features that the framework is not aware of, except
> that it enables it in IA32_XSS (and in XCR0 in the future).
This is the origin intention for introducing independent features(firstly called dynamic feature, renamed later), from the
changelog the major concern is overhead:
commit f0dccc9da4c0fda049e99326f85db8c242fd781f
Author: Kan Liang <[email protected]>
Date: Fri Jul 3 05:49:26 2020 -0700
x86/fpu/xstate: Support dynamic supervisor feature for LBR
"However, the kernel should not save/restore the LBR state component at
each context switch, like other state components, because of the
following unique features of LBR:
- The LBR state component only contains valuable information when LBR
is enabled in the perf subsystem, but for most of the time, LBR is
disabled.
- The size of the LBR state component is huge. For the current
platform, it's 808 bytes.
If the kernel saves/restores the LBR state at each context switch, for
most of the time, it is just a waste of space and cycles."
>
> For the guest only features, like CET_S, it is also kind of the same thing (xsave but to guest state area only).
> I don't insist that we add CET_S to independent features, but I just gave an idea that maybe that is better
> from complexity point of view to add CET there. It's up to you to decide.
>
> Sean what do you think?
>
> Best regards,
> Maxim Levitsky
>
>
>> But CET only has 3 supervisor MSRs and they need to be managed together with user mode MSRs.
>> Enabling it in common FPU framework would make the switch/swap much easier without additional
>> support code.
>>
>>> and maybe rename the
>>> 'XFEATURE_MASK_INDEPENDENT' to something like 'XFEATURES_THE_KERNEL_DOESNT_CARE_ABOUT'
>>> (terrible name, but you might think of a better name)
>>>
>>>
>>>> +#define XFEATURE_MASK_KERNEL_DYNAMIC XFEATURE_MASK_CET_KERNEL
>>>> +
>>>> /* All currently supported supervisor features */
>>>> #define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
>>>> XFEATURE_MASK_CET_USER | \
>>>> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>>>> index b57d909facca..ba4172172afd 100644
>>>> --- a/arch/x86/kernel/fpu/xstate.c
>>>> +++ b/arch/x86/kernel/fpu/xstate.c
>>>> @@ -824,6 +824,7 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
>>>> /* Clean out dynamic features from default */
>>>> fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
>>>> fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
>>>> + fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_KERNEL_DYNAMIC;
>>>>
>>>> fpu_user_cfg.default_features = fpu_user_cfg.max_features;
>>>> fpu_user_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
>>> Best regards,
>>> Maxim Levitsky
>>>
>>>
>>>
>>>
>
>
>
On 12/5/2023 6:12 PM, Maxim Levitsky wrote:
> On Mon, 2023-12-04 at 16:50 +0800, Yang, Weijiang wrote:
[...]
>>>> vmx->nested.force_msr_bitmap_recalc = false;
>>>> @@ -2469,6 +2491,18 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>>>> if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
>>>> (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
>>>> vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
>>>> +
>>>> + if (vmx->nested.nested_run_pending &&
>>> I don't think that nested.nested_run_pending check is needed.
>>> prepare_vmcs02_rare is not going to be called unless the nested run is pending.
>> But there're other paths along to call prepare_vmcs02_rare(), e.g., vmx_set_nested_state()-> nested_vmx_enter_non_root_mode()-> prepare_vmcs02_rare(), especially when L1 instead of L2 was running. In this case, nested.nested_run_pending == false,
>> we don't need to update vmcs02's fields at the point until L2 is being resumed.
> - If we restore VM from migration stream when L2 is *not running*, then prepare_vmcs02_rare won't be called,
> because nested_vmx_enter_non_root_mode will not be called, because in turn there is no nested vmcs to load.
>
> - If we restore VM from migration stream when L2 is *about to run* (KVM emulated the VMRESUME/VMLAUNCH,
> but we didn't do the actual hardware VMLAUNCH/VMRESUME on vmcs02, then the 'nested_run_pending' will be true, it will be restored
> from the migration stream.
>
> - If we migrate while nested guest was run once but didn't VMEXIT to L1 yet, then yes, nested.nested_run_pending will be false indeed,
> but we still need to setup vmcs02, otherwise it will be left with default zero values.
Thanks a lot for recapping these cases! I overlooked some nested flags before. It makes sense to remove nested.nested_run_pending.
> Remember that prior to setting nested state the VM wasn't running even once usually, unlike when the guest enters nested state normally.
>
>>>> + (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) {
>>>> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
>>>> + vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
>>>> + vmcs_writel(GUEST_INTR_SSP_TABLE,
>>>> + vmcs12->guest_ssp_tbl);
>>>> + }
>>>> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
>>>> + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
>>>> + vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
>>>> + }
>>>> }
>>>>
>>>> if (nested_cpu_has_xsaves(vmcs12))
>>>> @@ -4300,6 +4334,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
>>>> vmcs12->guest_pending_dbg_exceptions =
>>>> vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
>>>>
>>>> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
>>>> + vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
>>>> + vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
>>>> + }
>>>> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
>>>> + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
>>>> + vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
>>>> + }
>>> The above code should be conditional on VM_ENTRY_LOAD_CET_STATE - if the guest (L2) state
>>> was loaded, then it must be updated on exit - this is usually how VMX works.
>> I think this is not for L2 VM_ENTRY_LOAD_CET_STATE, it happens in prepare_vmcs02_rare(). IIUC, the guest registers will be saved into VMCS fields unconditionally when vm-exit happens,
>> so these fields for L2 guest should be synced to L1 unconditionally.
> "the guest registers will be saved into VMCS fields unconditionally"
> This is not true, unless there is a bug.
I checked the latest SDM, there's no such kind of wording regarding CET entry/exit control bits. The wording comes from
the individual CET spec.:
"10.6 VM Exit
On processors that support CET, the VM exit saves the state of IA32_S_CET, SSP and IA32_INTERRUPT_SSP_TABLE_ADDR MSR to the VMCS guest-state area unconditionally."
But since it doesn't appear in SDM, I shouldn't take it for granted.
> the vmcs12 VM_ENTRY_LOAD_CET_STATE should be passed through as is to vmcs02, so if the nested guest doesn't set this bit
> the entry/exit using vmcs02 will not touch the CET state, which is unusual but allowed by the spec I think - a nested hypervisor can opt for example to save/load
> this state manually or use msr load/store lists instead.
Right although the use case should be rare, will modify the code to check VM_ENTRY_LOAD_CET_STATE. Thanks!
> Regardless of this,
> if the guest didn't set VM_ENTRY_LOAD_CET_STATE, then vmcs12 guest fields should neither be loaded on VM entry (copied to vmcs02) nor updated on VM exit,
> (that is copied back to vmcs12) this is what is written in the VMX spec.
What's the VMX spec. your're referring to here?
On Wed, 2023-12-06 at 11:00 +0800, Yang, Weijiang wrote:
> On 12/5/2023 5:55 PM, Maxim Levitsky wrote:
> > On Fri, 2023-12-01 at 15:49 +0800, Yang, Weijiang wrote:
> > > On 12/1/2023 1:33 AM, Maxim Levitsky wrote:
> > > > On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> > > > > Define new XFEATURE_MASK_KERNEL_DYNAMIC set including the features can be
> > > > I am not sure though that this name is correct, but I don't know if I can
> > > > suggest a better name.
> > > It's a symmetry of XFEATURE_MASK_USER_DYNAMIC ;-)
> > > > > optionally enabled by kernel components, i.e., the features are required by
> > > > > specific kernel components. Currently it's used by KVM to configure guest
> > > > > dedicated fpstate for calculating the xfeature and fpstate storage size etc.
> > > > >
> > > > > The kernel dynamic xfeatures now only contain XFEATURE_CET_KERNEL, which is
> > > > > supported by host as they're enabled in xsaves/xrstors operating xfeature set
> > > > > (XCR0 | XSS), but the relevant CPU feature, i.e., supervisor shadow stack, is
> > > > > not enabled in host kernel so it can be omitted for normal fpstate by default.
> > > > >
> > > > > Remove the kernel dynamic feature from fpu_kernel_cfg.default_features so that
> > > > > the bits in xstate_bv and xcomp_bv are cleared and xsaves/xrstors can be
> > > > > optimized by HW for normal fpstate.
> > > > >
> > > > > Suggested-by: Dave Hansen <[email protected]>
> > > > > Signed-off-by: Yang Weijiang <[email protected]>
> > > > > ---
> > > > > arch/x86/include/asm/fpu/xstate.h | 5 ++++-
> > > > > arch/x86/kernel/fpu/xstate.c | 1 +
> > > > > 2 files changed, 5 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
> > > > > index 3b4a038d3c57..a212d3851429 100644
> > > > > --- a/arch/x86/include/asm/fpu/xstate.h
> > > > > +++ b/arch/x86/include/asm/fpu/xstate.h
> > > > > @@ -46,9 +46,12 @@
> > > > > #define XFEATURE_MASK_USER_RESTORE \
> > > > > (XFEATURE_MASK_USER_SUPPORTED & ~XFEATURE_MASK_PKRU)
> > > > >
> > > > > -/* Features which are dynamically enabled for a process on request */
> > > > > +/* Features which are dynamically enabled per userspace request */
> > > > > #define XFEATURE_MASK_USER_DYNAMIC XFEATURE_MASK_XTILE_DATA
> > > > >
> > > > > +/* Features which are dynamically enabled per kernel side request */
> > > > I suggest to explain this a bit better. How about something like that:
> > > >
> > > > "Kernel features that are not enabled by default for all processes, but can
> > > > be still used by some processes, for example to support guest virtualization"
> > > It looks good to me, will apply it in next version, thanks!
> > >
> > > > But feel free to keep it as is or propose something else. IMHO this will
> > > > be confusing this way or another.
> > > >
> > > >
> > > > Another question: kernel already has a notion of 'independent features'
> > > > which are currently kernel features that are enabled in IA32_XSS but not present in 'fpu_kernel_cfg.max_features'
> > > >
> > > > Currently only 'XFEATURE_LBR' is in this set. These features are saved/restored manually
> > > > from independent buffer (in case of LBRs, perf code cares for this).
> > > >
> > > > Does it make sense to add CET_S to there as well instead of having XFEATURE_MASK_KERNEL_DYNAMIC,
> > > CET_S here refers to PL{0,1,2}_SSP, right?
> > >
> > > IMHO, perf relies on dedicated code to switch LBR MSRs for various reason, e.g., overhead, the feature
> > > owns dozens of MSRs, remove xfeature bit will offload the burden of common FPU/xsave framework.
> > This is true, but the question that begs to be asked, is what is the true purpose of the 'independent features' is
> > from the POV of the kernel FPU framework. IMHO these are features that the framework is not aware of, except
> > that it enables it in IA32_XSS (and in XCR0 in the future).
>
> This is the origin intention for introducing independent features(firstly called dynamic feature, renamed later), from the
> changelog the major concern is overhead:
Yes, and to some extent the reason why we want to have CET supervisor state not saved on normal thread's FPU state is also overhead,
because in theory if the kernel did save it, the MSRs will be in INIT state and thus XSAVES shouldn't have any functional impact,
even if it saves/restores them for nothing.
In other words, as I said, independent features = features that FPU state doesn't manage, and are just optionally enabled,
so that a custom code can do a custom xsave(s)/xrstor(s), likely from/to a custom area to save/load these features.
It might make sense to rename independent features again to something like 'unmanaged features' or 'manual features' or something
like that.
Another interesting question that arises here, is once KVM supports arch LBRs, it will likely need to expose the XFEATURE_LBR
to the guest and will need to context switch it similar to CET_S state, which strengthens the argument that CET_S should
be in the same group as the 'independent features'.
Depending on the performance impact, XFEATURE_LBR might even need to be dynamically allocated.
For the reference this is the patch series that introduced the arch LBRs to KVM:
https://www.spinics.net/lists/kvm/msg296507.html
Best regards,
Maxim Levitsky
>
> commit f0dccc9da4c0fda049e99326f85db8c242fd781f
> Author: Kan Liang <[email protected]>
> Date: Fri Jul 3 05:49:26 2020 -0700
>
> x86/fpu/xstate: Support dynamic supervisor feature for LBR
>
> "However, the kernel should not save/restore the LBR state component at
> each context switch, like other state components, because of the
> following unique features of LBR:
> - The LBR state component only contains valuable information when LBR
> is enabled in the perf subsystem, but for most of the time, LBR is
> disabled.
> - The size of the LBR state component is huge. For the current
> platform, it's 808 bytes.
> If the kernel saves/restores the LBR state at each context switch, for
> most of the time, it is just a waste of space and cycles."
>
> > For the guest only features, like CET_S, it is also kind of the same thing (xsave but to guest state area only).
> > I don't insist that we add CET_S to independent features, but I just gave an idea that maybe that is better
> > from complexity point of view to add CET there. It's up to you to decide.
> >
> > Sean what do you think?
> >
> > Best regards,
> > Maxim Levitsky
> >
> >
> > > But CET only has 3 supervisor MSRs and they need to be managed together with user mode MSRs.
> > > Enabling it in common FPU framework would make the switch/swap much easier without additional
> > > support code.
> > >
> > > > and maybe rename the
> > > > 'XFEATURE_MASK_INDEPENDENT' to something like 'XFEATURES_THE_KERNEL_DOESNT_CARE_ABOUT'
> > > > (terrible name, but you might think of a better name)
> > > >
> > > >
> > > > > +#define XFEATURE_MASK_KERNEL_DYNAMIC XFEATURE_MASK_CET_KERNEL
> > > > > +
> > > > > /* All currently supported supervisor features */
> > > > > #define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
> > > > > XFEATURE_MASK_CET_USER | \
> > > > > diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> > > > > index b57d909facca..ba4172172afd 100644
> > > > > --- a/arch/x86/kernel/fpu/xstate.c
> > > > > +++ b/arch/x86/kernel/fpu/xstate.c
> > > > > @@ -824,6 +824,7 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
> > > > > /* Clean out dynamic features from default */
> > > > > fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
> > > > > fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
> > > > > + fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_KERNEL_DYNAMIC;
> > > > >
> > > > > fpu_user_cfg.default_features = fpu_user_cfg.max_features;
> > > > > fpu_user_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
> > > > Best regards,
> > > > Maxim Levitsky
> > > >
> > > >
> > > >
> > > >
> >
> >
On Wed, 2023-12-06 at 17:22 +0800, Yang, Weijiang wrote:
> On 12/5/2023 6:12 PM, Maxim Levitsky wrote:
> > On Mon, 2023-12-04 at 16:50 +0800, Yang, Weijiang wrote:
>
> [...]
>
> > > > > vmx->nested.force_msr_bitmap_recalc = false;
> > > > > @@ -2469,6 +2491,18 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
> > > > > if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
> > > > > (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
> > > > > vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
> > > > > +
> > > > > + if (vmx->nested.nested_run_pending &&
> > > > I don't think that nested.nested_run_pending check is needed.
> > > > prepare_vmcs02_rare is not going to be called unless the nested run is pending.
> > > But there're other paths along to call prepare_vmcs02_rare(), e.g., vmx_set_nested_state()-> nested_vmx_enter_non_root_mode()-> prepare_vmcs02_rare(), especially when L1 instead of L2 was running. In this case, nested.nested_run_pending == false,
> > > we don't need to update vmcs02's fields at the point until L2 is being resumed.
> > - If we restore VM from migration stream when L2 is *not running*, then prepare_vmcs02_rare won't be called,
> > because nested_vmx_enter_non_root_mode will not be called, because in turn there is no nested vmcs to load.
> >
> > - If we restore VM from migration stream when L2 is *about to run* (KVM emulated the VMRESUME/VMLAUNCH,
> > but we didn't do the actual hardware VMLAUNCH/VMRESUME on vmcs02, then the 'nested_run_pending' will be true, it will be restored
> > from the migration stream.
> >
> > - If we migrate while nested guest was run once but didn't VMEXIT to L1 yet, then yes, nested.nested_run_pending will be false indeed,
> > but we still need to setup vmcs02, otherwise it will be left with default zero values.
>
> Thanks a lot for recapping these cases! I overlooked some nested flags before. It makes sense to remove nested.nested_run_pending.
> > Remember that prior to setting nested state the VM wasn't running even once usually, unlike when the guest enters nested state normally.
> >
> > > > > + (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) {
> > > > > + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> > > > > + vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
> > > > > + vmcs_writel(GUEST_INTR_SSP_TABLE,
> > > > > + vmcs12->guest_ssp_tbl);
> > > > > + }
> > > > > + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> > > > > + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
> > > > > + vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
> > > > > + }
> > > > > }
> > > > >
> > > > > if (nested_cpu_has_xsaves(vmcs12))
> > > > > @@ -4300,6 +4334,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
> > > > > vmcs12->guest_pending_dbg_exceptions =
> > > > > vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
> > > > >
> > > > > + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> > > > > + vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
> > > > > + vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
> > > > > + }
> > > > > + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> > > > > + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
> > > > > + vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
> > > > > + }
> > > > The above code should be conditional on VM_ENTRY_LOAD_CET_STATE - if the guest (L2) state
> > > > was loaded, then it must be updated on exit - this is usually how VMX works.
> > > I think this is not for L2 VM_ENTRY_LOAD_CET_STATE, it happens in prepare_vmcs02_rare(). IIUC, the guest registers will be saved into VMCS fields unconditionally when vm-exit happens,
> > > so these fields for L2 guest should be synced to L1 unconditionally.
> > "the guest registers will be saved into VMCS fields unconditionally"
> > This is not true, unless there is a bug.
>
> I checked the latest SDM, there's no such kind of wording regarding CET entry/exit control bits. The wording comes from
> the individual CET spec.:
> "10.6 VM Exit
> On processors that support CET, the VM exit saves the state of IA32_S_CET, SSP and IA32_INTERRUPT_SSP_TABLE_ADDR MSR to the VMCS guest-state area unconditionally."
> But since it doesn't appear in SDM, I shouldn't take it for granted.
SDM spec from September 2023:
28.3.1 Saving Control Registers, Debug Registers, and MSRs
"If the processor supports the 1-setting of the “load CET” VM-entry control, the contents of the IA32_S_CET and
IA32_INTERRUPT_SSP_TABLE_ADDR MSRs are saved into the corresponding fields. On processors that do not
support Intel 64 architecture, bits 63:32 of these MSRs are not saved."
Honestly it's not 100% clear if the “load CET” should be set to 1 to trigger the restore, or that this control just needs to be
supported on the CPU.
It does feel like you are right here, that CPU always saves the guest state, but allows to not load it on VM entry via
“load CET” VM entry control.
IMHO its best to check what the bare metal does by rigging a test by patching the host kernel to not set the 'load CET' control,
and see if the CPU still updates the guest CET fields on the VM exit.
>
> > the vmcs12 VM_ENTRY_LOAD_CET_STATE should be passed through as is to vmcs02, so if the nested guest doesn't set this bit
> > the entry/exit using vmcs02 will not touch the CET state, which is unusual but allowed by the spec I think - a nested hypervisor can opt for example to save/load
> > this state manually or use msr load/store lists instead.
>
> Right although the use case should be rare, will modify the code to check VM_ENTRY_LOAD_CET_STATE. Thanks!
> > Regardless of this,
> > if the guest didn't set VM_ENTRY_LOAD_CET_STATE, then vmcs12 guest fields should neither be loaded on VM entry (copied to vmcs02) nor updated on VM exit,
> > (that is copied back to vmcs12) this is what is written in the VMX spec.
>
> What's the VMX spec. your're referring to here?
SDM.
In fact, now that I am thinking about this again, it should be OK to unconditionally copy the CET fields from vmcs12 to vmcs02, because as long as the
VM_ENTRY_LOAD_CET_STATE is not set, the CPU should care about their values in the vmcs02.
And about the other way around, assuming that I made a mistake as I said above, then the other way around is indeed unconditional.
Sorry for a bit of a confusion.
Best regards,
Maxim Levitsky
>
>
On 12/7/2023 1:24 AM, Maxim Levitsky wrote:
> On Wed, 2023-12-06 at 17:22 +0800, Yang, Weijiang wrote:
>> On 12/5/2023 6:12 PM, Maxim Levitsky wrote:
>>> On Mon, 2023-12-04 at 16:50 +0800, Yang, Weijiang wrote:
>> [...]
>>
>>>>>> vmx->nested.force_msr_bitmap_recalc = false;
>>>>>> @@ -2469,6 +2491,18 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>>>>>> if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
>>>>>> (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
>>>>>> vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
>>>>>> +
>>>>>> + if (vmx->nested.nested_run_pending &&
>>>>> I don't think that nested.nested_run_pending check is needed.
>>>>> prepare_vmcs02_rare is not going to be called unless the nested run is pending.
>>>> But there're other paths along to call prepare_vmcs02_rare(), e.g., vmx_set_nested_state()-> nested_vmx_enter_non_root_mode()-> prepare_vmcs02_rare(), especially when L1 instead of L2 was running. In this case, nested.nested_run_pending == false,
>>>> we don't need to update vmcs02's fields at the point until L2 is being resumed.
>>> - If we restore VM from migration stream when L2 is *not running*, then prepare_vmcs02_rare won't be called,
>>> because nested_vmx_enter_non_root_mode will not be called, because in turn there is no nested vmcs to load.
>>>
>>> - If we restore VM from migration stream when L2 is *about to run* (KVM emulated the VMRESUME/VMLAUNCH,
>>> but we didn't do the actual hardware VMLAUNCH/VMRESUME on vmcs02, then the 'nested_run_pending' will be true, it will be restored
>>> from the migration stream.
>>>
>>> - If we migrate while nested guest was run once but didn't VMEXIT to L1 yet, then yes, nested.nested_run_pending will be false indeed,
>>> but we still need to setup vmcs02, otherwise it will be left with default zero values.
>> Thanks a lot for recapping these cases! I overlooked some nested flags before. It makes sense to remove nested.nested_run_pending.
>>> Remember that prior to setting nested state the VM wasn't running even once usually, unlike when the guest enters nested state normally.
>>>
>>>>>> + (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) {
>>>>>> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
>>>>>> + vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
>>>>>> + vmcs_writel(GUEST_INTR_SSP_TABLE,
>>>>>> + vmcs12->guest_ssp_tbl);
>>>>>> + }
>>>>>> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
>>>>>> + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
>>>>>> + vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
>>>>>> + }
>>>>>> }
>>>>>>
>>>>>> if (nested_cpu_has_xsaves(vmcs12))
>>>>>> @@ -4300,6 +4334,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
>>>>>> vmcs12->guest_pending_dbg_exceptions =
>>>>>> vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
>>>>>>
>>>>>> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
>>>>>> + vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
>>>>>> + vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
>>>>>> + }
>>>>>> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
>>>>>> + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
>>>>>> + vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
>>>>>> + }
>>>>> The above code should be conditional on VM_ENTRY_LOAD_CET_STATE - if the guest (L2) state
>>>>> was loaded, then it must be updated on exit - this is usually how VMX works.
>>>> I think this is not for L2 VM_ENTRY_LOAD_CET_STATE, it happens in prepare_vmcs02_rare(). IIUC, the guest registers will be saved into VMCS fields unconditionally when vm-exit happens,
>>>> so these fields for L2 guest should be synced to L1 unconditionally.
>>> "the guest registers will be saved into VMCS fields unconditionally"
>>> This is not true, unless there is a bug.
>> I checked the latest SDM, there's no such kind of wording regarding CET entry/exit control bits. The wording comes from
>> the individual CET spec.:
>> "10.6 VM Exit
>> On processors that support CET, the VM exit saves the state of IA32_S_CET, SSP and IA32_INTERRUPT_SSP_TABLE_ADDR MSR to the VMCS guest-state area unconditionally."
>> But since it doesn't appear in SDM, I shouldn't take it for granted.
> SDM spec from September 2023:
>
> 28.3.1 Saving Control Registers, Debug Registers, and MSRs
>
> "If the processor supports the 1-setting of the “load CET” VM-entry control, the contents of the IA32_S_CET and
> IA32_INTERRUPT_SSP_TABLE_ADDR MSRs are saved into the corresponding fields. On processors that do not
> support Intel 64 architecture, bits 63:32 of these MSRs are not saved."
>
> Honestly it's not 100% clear if the “load CET” should be set to 1 to trigger the restore, or that this control just needs to be
> supported on the CPU.
> It does feel like you are right here, that CPU always saves the guest state, but allows to not load it on VM entry via
> “load CET” VM entry control.
>
> IMHO its best to check what the bare metal does by rigging a test by patching the host kernel to not set the 'load CET' control,
> and see if the CPU still updates the guest CET fields on the VM exit.
OK, I'll do some tests to see what's happening, thanks!
>>> the vmcs12 VM_ENTRY_LOAD_CET_STATE should be passed through as is to vmcs02, so if the nested guest doesn't set this bit
>>> the entry/exit using vmcs02 will not touch the CET state, which is unusual but allowed by the spec I think - a nested hypervisor can opt for example to save/load
>>> this state manually or use msr load/store lists instead.
>> Right although the use case should be rare, will modify the code to check VM_ENTRY_LOAD_CET_STATE. Thanks!
>
>>> Regardless of this,
>>> if the guest didn't set VM_ENTRY_LOAD_CET_STATE, then vmcs12 guest fields should neither be loaded on VM entry (copied to vmcs02) nor updated on VM exit,
>>> (that is copied back to vmcs12) this is what is written in the VMX spec.
>> What's the VMX spec. your're referring to here?
> SDM.
>
> In fact, now that I am thinking about this again, it should be OK to unconditionally copy the CET fields from vmcs12 to vmcs02, because as long as the
> VM_ENTRY_LOAD_CET_STATE is not set, the CPU should care about their values in the vmcs02.
>
> And about the other way around, assuming that I made a mistake as I said above, then the other way around is indeed unconditional.
>
>
> Sorry for a bit of a confusion.
NP, I also double check it with HW Arch and get it back.
Thanks for raising these questions!
> Best regards,
> Maxim Levitsky
>
>
>>
>
On Fri, 2023-12-08 at 23:15 +0800, Yang, Weijiang wrote:
> On 12/7/2023 1:24 AM, Maxim Levitsky wrote:
> > On Wed, 2023-12-06 at 17:22 +0800, Yang, Weijiang wrote:
> > > On 12/5/2023 6:12 PM, Maxim Levitsky wrote:
> > > > On Mon, 2023-12-04 at 16:50 +0800, Yang, Weijiang wrote:
> > > [...]
> > >
> > > > > > > vmx->nested.force_msr_bitmap_recalc = false;
> > > > > > > @@ -2469,6 +2491,18 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
> > > > > > > if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
> > > > > > > (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
> > > > > > > vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
> > > > > > > +
> > > > > > > + if (vmx->nested.nested_run_pending &&
> > > > > > I don't think that nested.nested_run_pending check is needed.
> > > > > > prepare_vmcs02_rare is not going to be called unless the nested run is pending.
> > > > > But there're other paths along to call prepare_vmcs02_rare(), e.g., vmx_set_nested_state()-> nested_vmx_enter_non_root_mode()-> prepare_vmcs02_rare(), especially when L1 instead of L2 was running. In this case, nested.nested_run_pending == false,
> > > > > we don't need to update vmcs02's fields at the point until L2 is being resumed.
> > > > - If we restore VM from migration stream when L2 is *not running*, then prepare_vmcs02_rare won't be called,
> > > > because nested_vmx_enter_non_root_mode will not be called, because in turn there is no nested vmcs to load.
> > > >
> > > > - If we restore VM from migration stream when L2 is *about to run* (KVM emulated the VMRESUME/VMLAUNCH,
> > > > but we didn't do the actual hardware VMLAUNCH/VMRESUME on vmcs02, then the 'nested_run_pending' will be true, it will be restored
> > > > from the migration stream.
> > > >
> > > > - If we migrate while nested guest was run once but didn't VMEXIT to L1 yet, then yes, nested.nested_run_pending will be false indeed,
> > > > but we still need to setup vmcs02, otherwise it will be left with default zero values.
> > > Thanks a lot for recapping these cases! I overlooked some nested flags before. It makes sense to remove nested.nested_run_pending.
> > > > Remember that prior to setting nested state the VM wasn't running even once usually, unlike when the guest enters nested state normally.
> > > >
> > > > > > > + (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) {
> > > > > > > + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> > > > > > > + vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
> > > > > > > + vmcs_writel(GUEST_INTR_SSP_TABLE,
> > > > > > > + vmcs12->guest_ssp_tbl);
> > > > > > > + }
> > > > > > > + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> > > > > > > + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
> > > > > > > + vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
> > > > > > > + }
> > > > > > > }
> > > > > > >
> > > > > > > if (nested_cpu_has_xsaves(vmcs12))
> > > > > > > @@ -4300,6 +4334,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
> > > > > > > vmcs12->guest_pending_dbg_exceptions =
> > > > > > > vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
> > > > > > >
> > > > > > > + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> > > > > > > + vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
> > > > > > > + vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
> > > > > > > + }
> > > > > > > + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> > > > > > > + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
> > > > > > > + vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
> > > > > > > + }
> > > > > > The above code should be conditional on VM_ENTRY_LOAD_CET_STATE - if the guest (L2) state
> > > > > > was loaded, then it must be updated on exit - this is usually how VMX works.
> > > > > I think this is not for L2 VM_ENTRY_LOAD_CET_STATE, it happens in prepare_vmcs02_rare(). IIUC, the guest registers will be saved into VMCS fields unconditionally when vm-exit happens,
> > > > > so these fields for L2 guest should be synced to L1 unconditionally.
> > > > "the guest registers will be saved into VMCS fields unconditionally"
> > > > This is not true, unless there is a bug.
> > > I checked the latest SDM, there's no such kind of wording regarding CET entry/exit control bits. The wording comes from
> > > the individual CET spec.:
> > > "10.6 VM Exit
> > > On processors that support CET, the VM exit saves the state of IA32_S_CET, SSP and IA32_INTERRUPT_SSP_TABLE_ADDR MSR to the VMCS guest-state area unconditionally."
> > > But since it doesn't appear in SDM, I shouldn't take it for granted.
> > SDM spec from September 2023:
> >
> > 28.3.1 Saving Control Registers, Debug Registers, and MSRs
> >
> > "If the processor supports the 1-setting of the “load CET” VM-entry control, the contents of the IA32_S_CET and
> > IA32_INTERRUPT_SSP_TABLE_ADDR MSRs are saved into the corresponding fields. On processors that do not
> > support Intel 64 architecture, bits 63:32 of these MSRs are not saved."
> >
> > Honestly it's not 100% clear if the “load CET” should be set to 1 to trigger the restore, or that this control just needs to be
> > supported on the CPU.
> > It does feel like you are right here, that CPU always saves the guest state, but allows to not load it on VM entry via
> > “load CET” VM entry control.
> >
> > IMHO its best to check what the bare metal does by rigging a test by patching the host kernel to not set the 'load CET' control,
> > and see if the CPU still updates the guest CET fields on the VM exit.
>
> OK, I'll do some tests to see what's happening, thanks!
> > > > the vmcs12 VM_ENTRY_LOAD_CET_STATE should be passed through as is to vmcs02, so if the nested guest doesn't set this bit
> > > > the entry/exit using vmcs02 will not touch the CET state, which is unusual but allowed by the spec I think - a nested hypervisor can opt for example to save/load
> > > > this state manually or use msr load/store lists instead.
> > > Right although the use case should be rare, will modify the code to check VM_ENTRY_LOAD_CET_STATE. Thanks!
> > > > Regardless of this,
> > > > if the guest didn't set VM_ENTRY_LOAD_CET_STATE, then vmcs12 guest fields should neither be loaded on VM entry (copied to vmcs02) nor updated on VM exit,
> > > > (that is copied back to vmcs12) this is what is written in the VMX spec.
> > > What's the VMX spec. your're referring to here?
> > SDM.
> >
> > In fact, now that I am thinking about this again, it should be OK to unconditionally copy the CET fields from vmcs12 to vmcs02, because as long as the
> > VM_ENTRY_LOAD_CET_STATE is not set, the CPU should care about their values in the vmcs02.
I noticed a typo. I meant that the CPU should't care about their values in the vmcs02.
> >
> > And about the other way around, assuming that I made a mistake as I said above, then the other way around is indeed unconditional.
> >
> >
> > Sorry for a bit of a confusion.
>
> NP, I also double check it with HW Arch and get it back.
> Thanks for raising these questions!
Thanks to you too!
Best regards,
Maxim Levitsky
>
> > Best regards,
> > Maxim Levitsky
> >
> >
On 12/7/2023 12:11 AM, Maxim Levitsky wrote:
> On Wed, 2023-12-06 at 11:00 +0800, Yang, Weijiang wrote:
>> On 12/5/2023 5:55 PM, Maxim Levitsky wrote:
>>> On Fri, 2023-12-01 at 15:49 +0800, Yang, Weijiang wrote:
>>>> On 12/1/2023 1:33 AM, Maxim Levitsky wrote:
>>>>> On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
>>>>>> Define new XFEATURE_MASK_KERNEL_DYNAMIC set including the features can be
>>>>> I am not sure though that this name is correct, but I don't know if I can
>>>>> suggest a better name.
>>>> It's a symmetry of XFEATURE_MASK_USER_DYNAMIC ;-)
>>>>>> optionally enabled by kernel components, i.e., the features are required by
>>>>>> specific kernel components. Currently it's used by KVM to configure guest
>>>>>> dedicated fpstate for calculating the xfeature and fpstate storage size etc.
>>>>>>
>>>>>> The kernel dynamic xfeatures now only contain XFEATURE_CET_KERNEL, which is
>>>>>> supported by host as they're enabled in xsaves/xrstors operating xfeature set
>>>>>> (XCR0 | XSS), but the relevant CPU feature, i.e., supervisor shadow stack, is
>>>>>> not enabled in host kernel so it can be omitted for normal fpstate by default.
>>>>>>
>>>>>> Remove the kernel dynamic feature from fpu_kernel_cfg.default_features so that
>>>>>> the bits in xstate_bv and xcomp_bv are cleared and xsaves/xrstors can be
>>>>>> optimized by HW for normal fpstate.
>>>>>>
>>>>>> Suggested-by: Dave Hansen <[email protected]>
>>>>>> Signed-off-by: Yang Weijiang <[email protected]>
>>>>>> ---
>>>>>> arch/x86/include/asm/fpu/xstate.h | 5 ++++-
>>>>>> arch/x86/kernel/fpu/xstate.c | 1 +
>>>>>> 2 files changed, 5 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
>>>>>> index 3b4a038d3c57..a212d3851429 100644
>>>>>> --- a/arch/x86/include/asm/fpu/xstate.h
>>>>>> +++ b/arch/x86/include/asm/fpu/xstate.h
>>>>>> @@ -46,9 +46,12 @@
>>>>>> #define XFEATURE_MASK_USER_RESTORE \
>>>>>> (XFEATURE_MASK_USER_SUPPORTED & ~XFEATURE_MASK_PKRU)
>>>>>>
>>>>>> -/* Features which are dynamically enabled for a process on request */
>>>>>> +/* Features which are dynamically enabled per userspace request */
>>>>>> #define XFEATURE_MASK_USER_DYNAMIC XFEATURE_MASK_XTILE_DATA
>>>>>>
>>>>>> +/* Features which are dynamically enabled per kernel side request */
>>>>> I suggest to explain this a bit better. How about something like that:
>>>>>
>>>>> "Kernel features that are not enabled by default for all processes, but can
>>>>> be still used by some processes, for example to support guest virtualization"
>>>> It looks good to me, will apply it in next version, thanks!
>>>>
>>>>> But feel free to keep it as is or propose something else. IMHO this will
>>>>> be confusing this way or another.
>>>>>
>>>>>
>>>>> Another question: kernel already has a notion of 'independent features'
>>>>> which are currently kernel features that are enabled in IA32_XSS but not present in 'fpu_kernel_cfg.max_features'
>>>>>
>>>>> Currently only 'XFEATURE_LBR' is in this set. These features are saved/restored manually
>>>>> from independent buffer (in case of LBRs, perf code cares for this).
>>>>>
>>>>> Does it make sense to add CET_S to there as well instead of having XFEATURE_MASK_KERNEL_DYNAMIC,
>>>> CET_S here refers to PL{0,1,2}_SSP, right?
>>>>
>>>> IMHO, perf relies on dedicated code to switch LBR MSRs for various reason, e.g., overhead, the feature
>>>> owns dozens of MSRs, remove xfeature bit will offload the burden of common FPU/xsave framework.
>>> This is true, but the question that begs to be asked, is what is the true purpose of the 'independent features' is
>>> from the POV of the kernel FPU framework. IMHO these are features that the framework is not aware of, except
>>> that it enables it in IA32_XSS (and in XCR0 in the future).
>> This is the origin intention for introducing independent features(firstly called dynamic feature, renamed later), from the
>> changelog the major concern is overhead:
> Yes, and to some extent the reason why we want to have CET supervisor state not saved on normal thread's FPU state is also overhead,
> because in theory if the kernel did save it, the MSRs will be in INIT state and thus XSAVES shouldn't have any functional impact,
> even if it saves/restores them for nothing.
CET supervisor state in normal thread's FPU state won't always be in INIT state. Per SDM, it's INIT state is defined only if 3 MSRs are 0,
but if guest is using supervisor CET, then with vCPU migration between pCPUs, more and more MSRs would hold non-zero contents.
This doesn't impact host kernel behavior because host CET_S is still disabled, but it does impact host XSAVES/XRSTORS behavior.
> In other words, as I said, independent features = features that FPU state doesn't manage, and are just optionally enabled,
> so that a custom code can do a custom xsave(s)/xrstor(s), likely from/to a custom area to save/load these features.
>
> It might make sense to rename independent features again to something like 'unmanaged features' or 'manual features' or something
> like that.
>
>
> Another interesting question that arises here, is once KVM supports arch LBRs, it will likely need to expose the XFEATURE_LBR
> to the guest and will need to context switch it similar to CET_S state, which strengthens the argument that CET_S should
> be in the same group as the 'independent features'.
>
> Depending on the performance impact, XFEATURE_LBR might even need to be dynamically allocated.
This is most likely true for fpu_guest_cfg instead of fpu_kernel_cfg, let me think it over, thanks for bring up this brilliant idea :-)
> For the reference this is the patch series that introduced the arch LBRs to KVM:
> https://www.spinics.net/lists/kvm/msg296507.html
>
>
> Best regards,
> Maxim Levitsky
>
>> commit f0dccc9da4c0fda049e99326f85db8c242fd781f
>> Author: Kan Liang <[email protected]>
>> Date: Fri Jul 3 05:49:26 2020 -0700
>>
>> x86/fpu/xstate: Support dynamic supervisor feature for LBR
>>
>> "However, the kernel should not save/restore the LBR state component at
>> each context switch, like other state components, because of the
>> following unique features of LBR:
>> - The LBR state component only contains valuable information when LBR
>> is enabled in the perf subsystem, but for most of the time, LBR is
>> disabled.
>> - The size of the LBR state component is huge. For the current
>> platform, it's 808 bytes.
>> If the kernel saves/restores the LBR state at each context switch, for
>> most of the time, it is just a waste of space and cycles."
>>
>>> For the guest only features, like CET_S, it is also kind of the same thing (xsave but to guest state area only).
>>> I don't insist that we add CET_S to independent features, but I just gave an idea that maybe that is better
>>> from complexity point of view to add CET there. It's up to you to decide.
>>>
>>> Sean what do you think?
>>>
>>> Best regards,
>>> Maxim Levitsky
>>>
>>>
>>>> But CET only has 3 supervisor MSRs and they need to be managed together with user mode MSRs.
>>>> Enabling it in common FPU framework would make the switch/swap much easier without additional
>>>> support code.
>>>>
>>>>> and maybe rename the
>>>>> 'XFEATURE_MASK_INDEPENDENT' to something like 'XFEATURES_THE_KERNEL_DOESNT_CARE_ABOUT'
>>>>> (terrible name, but you might think of a better name)
>>>>>
>>>>>
>>>>>> +#define XFEATURE_MASK_KERNEL_DYNAMIC XFEATURE_MASK_CET_KERNEL
>>>>>> +
>>>>>> /* All currently supported supervisor features */
>>>>>> #define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
>>>>>> XFEATURE_MASK_CET_USER | \
>>>>>> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>>>>>> index b57d909facca..ba4172172afd 100644
>>>>>> --- a/arch/x86/kernel/fpu/xstate.c
>>>>>> +++ b/arch/x86/kernel/fpu/xstate.c
>>>>>> @@ -824,6 +824,7 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
>>>>>> /* Clean out dynamic features from default */
>>>>>> fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
>>>>>> fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
>>>>>> + fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_KERNEL_DYNAMIC;
>>>>>>
>>>>>> fpu_user_cfg.default_features = fpu_user_cfg.max_features;
>>>>>> fpu_user_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
>>>>> Best regards,
>>>>> Maxim Levitsky
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>
On 12/8/2023 11:22 PM, Maxim Levitsky wrote:
> On Fri, 2023-12-08 at 23:15 +0800, Yang, Weijiang wrote:
>> On 12/7/2023 1:24 AM, Maxim Levitsky wrote:
>>> On Wed, 2023-12-06 at 17:22 +0800, Yang, Weijiang wrote:
>>>> On 12/5/2023 6:12 PM, Maxim Levitsky wrote:
>>>>> On Mon, 2023-12-04 at 16:50 +0800, Yang, Weijiang wrote:
>>>> [...]
>>>>
>>>>>>>> vmx->nested.force_msr_bitmap_recalc = false;
>>>>>>>> @@ -2469,6 +2491,18 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>>>>>>>> if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
>>>>>>>> (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
>>>>>>>> vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
>>>>>>>> +
>>>>>>>> + if (vmx->nested.nested_run_pending &&
>>>>>>> I don't think that nested.nested_run_pending check is needed.
>>>>>>> prepare_vmcs02_rare is not going to be called unless the nested run is pending.
>>>>>> But there're other paths along to call prepare_vmcs02_rare(), e.g., vmx_set_nested_state()-> nested_vmx_enter_non_root_mode()-> prepare_vmcs02_rare(), especially when L1 instead of L2 was running. In this case, nested.nested_run_pending == false,
>>>>>> we don't need to update vmcs02's fields at the point until L2 is being resumed.
>>>>> - If we restore VM from migration stream when L2 is *not running*, then prepare_vmcs02_rare won't be called,
>>>>> because nested_vmx_enter_non_root_mode will not be called, because in turn there is no nested vmcs to load.
>>>>>
>>>>> - If we restore VM from migration stream when L2 is *about to run* (KVM emulated the VMRESUME/VMLAUNCH,
>>>>> but we didn't do the actual hardware VMLAUNCH/VMRESUME on vmcs02, then the 'nested_run_pending' will be true, it will be restored
>>>>> from the migration stream.
>>>>>
>>>>> - If we migrate while nested guest was run once but didn't VMEXIT to L1 yet, then yes, nested.nested_run_pending will be false indeed,
>>>>> but we still need to setup vmcs02, otherwise it will be left with default zero values.
>>>> Thanks a lot for recapping these cases! I overlooked some nested flags before. It makes sense to remove nested.nested_run_pending.
>>>>> Remember that prior to setting nested state the VM wasn't running even once usually, unlike when the guest enters nested state normally.
>>>>>
>>>>>>>> + (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) {
>>>>>>>> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
>>>>>>>> + vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
>>>>>>>> + vmcs_writel(GUEST_INTR_SSP_TABLE,
>>>>>>>> + vmcs12->guest_ssp_tbl);
>>>>>>>> + }
>>>>>>>> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
>>>>>>>> + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
>>>>>>>> + vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
>>>>>>>> + }
>>>>>>>> }
>>>>>>>>
>>>>>>>> if (nested_cpu_has_xsaves(vmcs12))
>>>>>>>> @@ -4300,6 +4334,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
>>>>>>>> vmcs12->guest_pending_dbg_exceptions =
>>>>>>>> vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
>>>>>>>>
>>>>>>>> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
>>>>>>>> + vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
>>>>>>>> + vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
>>>>>>>> + }
>>>>>>>> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
>>>>>>>> + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
>>>>>>>> + vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
>>>>>>>> + }
>>>>>>> The above code should be conditional on VM_ENTRY_LOAD_CET_STATE - if the guest (L2) state
>>>>>>> was loaded, then it must be updated on exit - this is usually how VMX works.
>>>>>> I think this is not for L2 VM_ENTRY_LOAD_CET_STATE, it happens in prepare_vmcs02_rare(). IIUC, the guest registers will be saved into VMCS fields unconditionally when vm-exit happens,
>>>>>> so these fields for L2 guest should be synced to L1 unconditionally.
>>>>> "the guest registers will be saved into VMCS fields unconditionally"
>>>>> This is not true, unless there is a bug.
>>>> I checked the latest SDM, there's no such kind of wording regarding CET entry/exit control bits. The wording comes from
>>>> the individual CET spec.:
>>>> "10.6 VM Exit
>>>> On processors that support CET, the VM exit saves the state of IA32_S_CET, SSP and IA32_INTERRUPT_SSP_TABLE_ADDR MSR to the VMCS guest-state area unconditionally."
>>>> But since it doesn't appear in SDM, I shouldn't take it for granted.
>>> SDM spec from September 2023:
>>>
>>> 28.3.1 Saving Control Registers, Debug Registers, and MSRs
>>>
>>> "If the processor supports the 1-setting of the “load CET” VM-entry control, the contents of the IA32_S_CET and
>>> IA32_INTERRUPT_SSP_TABLE_ADDR MSRs are saved into the corresponding fields. On processors that do not
>>> support Intel 64 architecture, bits 63:32 of these MSRs are not saved."
>>>
>>> Honestly it's not 100% clear if the “load CET” should be set to 1 to trigger the restore, or that this control just needs to be
>>> supported on the CPU.
>>> It does feel like you are right here, that CPU always saves the guest state, but allows to not load it on VM entry via
>>> “load CET” VM entry control.
>>>
>>> IMHO its best to check what the bare metal does by rigging a test by patching the host kernel to not set the 'load CET' control,
>>> and see if the CPU still updates the guest CET fields on the VM exit.
>> OK, I'll do some tests to see what's happening, thanks!
>>>>> the vmcs12 VM_ENTRY_LOAD_CET_STATE should be passed through as is to vmcs02, so if the nested guest doesn't set this bit
>>>>> the entry/exit using vmcs02 will not touch the CET state, which is unusual but allowed by the spec I think - a nested hypervisor can opt for example to save/load
>>>>> this state manually or use msr load/store lists instead.
>>>> Right although the use case should be rare, will modify the code to check VM_ENTRY_LOAD_CET_STATE. Thanks!
>>>>> Regardless of this,
>>>>> if the guest didn't set VM_ENTRY_LOAD_CET_STATE, then vmcs12 guest fields should neither be loaded on VM entry (copied to vmcs02) nor updated on VM exit,
>>>>> (that is copied back to vmcs12) this is what is written in the VMX spec.
>>>> What's the VMX spec. your're referring to here?
>>> SDM.
>>>
>>> In fact, now that I am thinking about this again, it should be OK to unconditionally copy the CET fields from vmcs12 to vmcs02, because as long as the
>>> VM_ENTRY_LOAD_CET_STATE is not set, the CPU should care about their values in the vmcs02.
> I noticed a typo. I meant that the CPU should't care about their values in the vmcs02.
>
>>> And about the other way around, assuming that I made a mistake as I said above, then the other way around is indeed unconditional.
>>>
>>>
>>> Sorry for a bit of a confusion.
>> NP, I also double check it with HW Arch and get it back.
>> Thanks for raising these questions!
I got reply from HW Arch, the guest CET state is saved unconditionally:
"On the state save side, uCode doesn’t check for an exit control (or the load CET VM-entry control), but rather since it supports (as of TGL/SPR) CET,
it unconditionally saves the state to the VMCS guest-state area. "
> Thanks to you too!
>
>
> Best regards,
> Maxim Levitsky
>
>>> Best regards,
>>> Maxim Levitsky
>>>
>>>
>
On Tue, 2023-12-12 at 16:56 +0800, Yang, Weijiang wrote:
>
> On 12/8/2023 11:22 PM, Maxim Levitsky wrote:
> > On Fri, 2023-12-08 at 23:15 +0800, Yang, Weijiang wrote:
> > > On 12/7/2023 1:24 AM, Maxim Levitsky wrote:
> > > > On Wed, 2023-12-06 at 17:22 +0800, Yang, Weijiang wrote:
> > > > > On 12/5/2023 6:12 PM, Maxim Levitsky wrote:
> > > > > > On Mon, 2023-12-04 at 16:50 +0800, Yang, Weijiang wrote:
> > > > > [...]
> > > > >
> > > > > > > > > vmx->nested.force_msr_bitmap_recalc = false;
> > > > > > > > > @@ -2469,6 +2491,18 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
> > > > > > > > > if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
> > > > > > > > > (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
> > > > > > > > > vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
> > > > > > > > > +
> > > > > > > > > + if (vmx->nested.nested_run_pending &&
> > > > > > > > I don't think that nested.nested_run_pending check is needed.
> > > > > > > > prepare_vmcs02_rare is not going to be called unless the nested run is pending.
> > > > > > > But there're other paths along to call prepare_vmcs02_rare(), e.g., vmx_set_nested_state()-> nested_vmx_enter_non_root_mode()-> prepare_vmcs02_rare(), especially when L1 instead of L2 was running. In this case, nested.nested_run_pending == false,
> > > > > > > we don't need to update vmcs02's fields at the point until L2 is being resumed.
> > > > > > - If we restore VM from migration stream when L2 is *not running*, then prepare_vmcs02_rare won't be called,
> > > > > > because nested_vmx_enter_non_root_mode will not be called, because in turn there is no nested vmcs to load.
> > > > > >
> > > > > > - If we restore VM from migration stream when L2 is *about to run* (KVM emulated the VMRESUME/VMLAUNCH,
> > > > > > but we didn't do the actual hardware VMLAUNCH/VMRESUME on vmcs02, then the 'nested_run_pending' will be true, it will be restored
> > > > > > from the migration stream.
> > > > > >
> > > > > > - If we migrate while nested guest was run once but didn't VMEXIT to L1 yet, then yes, nested.nested_run_pending will be false indeed,
> > > > > > but we still need to setup vmcs02, otherwise it will be left with default zero values.
> > > > > Thanks a lot for recapping these cases! I overlooked some nested flags before. It makes sense to remove nested.nested_run_pending.
> > > > > > Remember that prior to setting nested state the VM wasn't running even once usually, unlike when the guest enters nested state normally.
> > > > > >
> > > > > > > > > + (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) {
> > > > > > > > > + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> > > > > > > > > + vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
> > > > > > > > > + vmcs_writel(GUEST_INTR_SSP_TABLE,
> > > > > > > > > + vmcs12->guest_ssp_tbl);
> > > > > > > > > + }
> > > > > > > > > + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> > > > > > > > > + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
> > > > > > > > > + vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
> > > > > > > > > + }
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > if (nested_cpu_has_xsaves(vmcs12))
> > > > > > > > > @@ -4300,6 +4334,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
> > > > > > > > > vmcs12->guest_pending_dbg_exceptions =
> > > > > > > > > vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
> > > > > > > > >
> > > > > > > > > + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> > > > > > > > > + vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
> > > > > > > > > + vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
> > > > > > > > > + }
> > > > > > > > > + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> > > > > > > > > + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
> > > > > > > > > + vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
> > > > > > > > > + }
> > > > > > > > The above code should be conditional on VM_ENTRY_LOAD_CET_STATE - if the guest (L2) state
> > > > > > > > was loaded, then it must be updated on exit - this is usually how VMX works.
> > > > > > > I think this is not for L2 VM_ENTRY_LOAD_CET_STATE, it happens in prepare_vmcs02_rare(). IIUC, the guest registers will be saved into VMCS fields unconditionally when vm-exit happens,
> > > > > > > so these fields for L2 guest should be synced to L1 unconditionally.
> > > > > > "the guest registers will be saved into VMCS fields unconditionally"
> > > > > > This is not true, unless there is a bug.
> > > > > I checked the latest SDM, there's no such kind of wording regarding CET entry/exit control bits. The wording comes from
> > > > > the individual CET spec.:
> > > > > "10.6 VM Exit
> > > > > On processors that support CET, the VM exit saves the state of IA32_S_CET, SSP and IA32_INTERRUPT_SSP_TABLE_ADDR MSR to the VMCS guest-state area unconditionally."
> > > > > But since it doesn't appear in SDM, I shouldn't take it for granted.
> > > > SDM spec from September 2023:
> > > >
> > > > 28.3.1 Saving Control Registers, Debug Registers, and MSRs
> > > >
> > > > "If the processor supports the 1-setting of the “load CET” VM-entry control, the contents of the IA32_S_CET and
> > > > IA32_INTERRUPT_SSP_TABLE_ADDR MSRs are saved into the corresponding fields. On processors that do not
> > > > support Intel 64 architecture, bits 63:32 of these MSRs are not saved."
> > > >
> > > > Honestly it's not 100% clear if the “load CET” should be set to 1 to trigger the restore, or that this control just needs to be
> > > > supported on the CPU.
> > > > It does feel like you are right here, that CPU always saves the guest state, but allows to not load it on VM entry via
> > > > “load CET” VM entry control.
> > > >
> > > > IMHO its best to check what the bare metal does by rigging a test by patching the host kernel to not set the 'load CET' control,
> > > > and see if the CPU still updates the guest CET fields on the VM exit.
> > > OK, I'll do some tests to see what's happening, thanks!
> > > > > > the vmcs12 VM_ENTRY_LOAD_CET_STATE should be passed through as is to vmcs02, so if the nested guest doesn't set this bit
> > > > > > the entry/exit using vmcs02 will not touch the CET state, which is unusual but allowed by the spec I think - a nested hypervisor can opt for example to save/load
> > > > > > this state manually or use msr load/store lists instead.
> > > > > Right although the use case should be rare, will modify the code to check VM_ENTRY_LOAD_CET_STATE. Thanks!
> > > > > > Regardless of this,
> > > > > > if the guest didn't set VM_ENTRY_LOAD_CET_STATE, then vmcs12 guest fields should neither be loaded on VM entry (copied to vmcs02) nor updated on VM exit,
> > > > > > (that is copied back to vmcs12) this is what is written in the VMX spec.
> > > > > What's the VMX spec. your're referring to here?
> > > > SDM.
> > > >
> > > > In fact, now that I am thinking about this again, it should be OK to unconditionally copy the CET fields from vmcs12 to vmcs02, because as long as the
> > > > VM_ENTRY_LOAD_CET_STATE is not set, the CPU should care about their values in the vmcs02.
> > I noticed a typo. I meant that the CPU should't care about their values in the vmcs02.
> >
> > > > And about the other way around, assuming that I made a mistake as I said above, then the other way around is indeed unconditional.
> > > >
> > > >
> > > > Sorry for a bit of a confusion.
> > > NP, I also double check it with HW Arch and get it back.
> > > Thanks for raising these questions!
>
> I got reply from HW Arch, the guest CET state is saved unconditionally:
>
> "On the state save side, uCode doesn’t check for an exit control (or the load CET VM-entry control), but rather since it supports (as of TGL/SPR) CET,
> it unconditionally saves the state to the VMCS guest-state area. "
Great!
Best regards,
Maxim Levitsky
>
> > Thanks to you too!
> >
> >
> > Best regards,
> > Maxim Levitsky
> >
> > > > Best regards,
> > > > Maxim Levitsky
> > > >
> > > >
Hi, Sean,
Do you have additional comments for this version?
I'll enclose Maxim and Chao's feedback in v8.
And what's your plan for this series? I'd get your direction for the next step.
I'm appreciated your persistent attention on this series over the past years!
On 11/24/2023 1:53 PM, Yang Weijiang wrote:
> Control-flow Enforcement Technology (CET) is a kind of CPU feature used
> to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP) attacks.
> It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
> style control-flow subversion attacks.
>
> Shadow Stack (SHSTK):
> A shadow stack is a second stack used exclusively for control transfer
> operations. The shadow stack is separate from the data/normal stack and
> can be enabled individually in user and kernel mode. When shadow stack
> is enabled, CALL pushes the return address on both the data and shadow
> stack. RET pops the return address from both stacks and compares them.
> If the return addresses from the two stacks do not match, the processor
> generates a #CP.
>
> Indirect Branch Tracking (IBT):
> IBT introduces new instruction(ENDBRANCH)to mark valid target addresses of
> indirect branches (CALL, JMP etc...). If an indirect branch is executed
> and the next instruction is _not_ an ENDBRANCH, the processor generates a
> #CP. These instruction behaves as a NOP on platforms that doesn't support
> CET.
>
> Dependency:
> --------------------------------------------------------------------------
> CET native series for user mode shadow stack has already been merged in v6.6
> mainline kernel.
>
> The first 7 kernel patches are prerequisites for this KVM patch series since
> guest CET user mode and supervisor mode states depends on kernel FPU framework
> to properly save/restore the states whenever FPU context switch is required,
> e.g., after VM-Exit and before vCPU thread exits to userspace.
>
> In this series, guest supervisor SHSTK mitigation solution isn't introduced
> for Intel platform therefore guest SSS_CET bit of CPUID(0x7,1):EDX[bit18] is
> cleared. Check SDM (Vol 1, Section 17.2.3) for details.
>
> CET states management:
> --------------------------------------------------------------------------
> KVM cooperates with host kernel FPU framework to manage guest CET registers.
> With CET supervisor mode state support in this series, KVM can save/restore
> full guest CET xsave-managed states.
>
> CET user mode and supervisor mode xstates, i.e., MSR_IA32_{U_CET,PL3_SSP}
> and MSR_IA32_PL{0,1,2}, depend on host FPU framework to swap guest and host
> xstates. On VM-Exit, guest CET xstates are saved to guest fpu area and host
> CET xstates are loaded from task/thread context before vCPU returns to
> userspace, vice-versa on VM-Entry. See details in kvm_{load,put}_guest_fpu().
> So guest CET xstates management depends on CET xstate bits(U_CET/S_CET bit)
> set in host XSS MSR.
>
> CET supervisor mode states are grouped into two categories : XSAVE-managed
> and non-XSAVE-managed, the former includes MSR_IA32_PL{0,1,2}_SSP and are
> controlled by CET supervisor mode bit(S_CET bit) in XSS, the later consists
> of MSR_IA32_S_CET and MSR_IA32_INTR_SSP_TBL.
>
> VMX introduces new VMCS fields, {GUEST|HOST}_{S_CET,SSP,INTR_SSP_TABL}, to
> facilitate guest/host non-XSAVES-managed states. When VMX CET entry/exit load
> bits are set, guest/host MSR_IA32_{S_CET,INTR_SSP_TBL,SSP} are loaded from
> equivalent fields at VM-Exit/Entry. With these new fields, such supervisor
> states require no addtional KVM save/reload actions.
>
> Tests:
> --------------------------------------------------------------------------
> This series passed basic CET user shadow stack test and kernel IBT test in L1
> and L2 guest.
> The patch series _has_ impact to existing vmx test cases in KVM-unit-tests,the
> failures have been fixed here [1].
> One new selftest app [2] is introduced for testing CET MSRs accessibilities.
>
> Note, this series hasn't been tested on AMD platform yet.
>
> To run user SHSTK test and kernel IBT test in guest, an CET capable platform
> is required, e.g., Sapphire Rapids server, and follow below steps to build
> the binaries:
>
> 1. Host kernel: Apply this series to mainline kernel (>= v6.6) and build.
>
> 2. Guest kernel: Pull kernel (>= v6.6), opt-in CONFIG_X86_KERNEL_IBT
> and CONFIG_X86_USER_SHADOW_STACK options. Build with CET enabled gcc versions
> (>= 8.5.0).
>
> 3. Apply CET QEMU patches [3] before build mainline QEMU.
>
> Check kernel selftest test_shadow_stack_64 output:
>
> [INFO] new_ssp = 7f8c82100ff8, *new_ssp = 7f8c82101001
> [INFO] changing ssp from 7f8c82900ff0 to 7f8c82100ff8
> [INFO] ssp is now 7f8c82101000
> [OK] Shadow stack pivot
> [OK] Shadow stack faults
> [INFO] Corrupting shadow stack
> [INFO] Generated shadow stack violation successfully
> [OK] Shadow stack violation test
> [INFO] Gup read -> shstk access success
> [INFO] Gup write -> shstk access success
> [INFO] Violation from normal write
> [INFO] Gup read -> write access success
> [INFO] Violation from normal write
> [INFO] Gup write -> write access success
> [INFO] Cow gup write -> write access success
> [OK] Shadow gup test
> [INFO] Violation from shstk access
> [OK] mprotect() test
> [SKIP] Userfaultfd unavailable.
> [OK] 32 bit test
>
>
> Check kernel IBT with dmesg | grep CET:
>
> CET detected: Indirect Branch Tracking enabled
>
> --------------------------------------------------------------------------
> Changes in v7:
> 1. Introduced guest dedicated config for guest related xstate fixup. [Sean, Maxim]
> 2. Refined CET supervisor state handling for guest fpstate. [Dave]
> 3. Enclosed Sean's fixup patch for kernel xstate issue. [Sean]
> 4. Refined CET MSR read/write handling flow. [Sean, Maxim]
> 5. Added CET VMCS fields sync between vmcs12 and vmcs02. [Chao, Maxim]
> 6. Added reset handling for CET xstate-managed MSRs.
> 7. Other minor changes due to community review feedback. [Sean, Maxim, Chao]
> 8. Rebased to: https://github.com/kvm-x86/linux tag: kvm-x86-next-2023.11.01
>
>
> [1]: KVM-unit-tests fixup:
> https://lore.kernel.org/all/[email protected]/
> [2]: Selftest for CET MSRs:
> https://lore.kernel.org/all/[email protected]/
> [3]: QEMU patch:
> https://lore.kernel.org/all/[email protected]/
> [4]: v6 patchset:
> https://lore.kernel.org/all/[email protected]/
>
> Patch 1-7: Fixup patches for kernel xstate and enable CET supervisor xstate.
> Patch 8-11: Cleanup patches for KVM.
> Patch 12-15: Enable KVM XSS MSR support.
> Patch 16: Fault check for CR4.CET setting.
> Patch 17: Report CET MSRs to userspace.
> Patch 18: Introduce CET VMCS fields.
> Patch 19: Add SHSTK/IBT to KVM-governed framework.(to be deprecated)
> Patch 20: Emulate CET MSR access.
> Patch 21: Handle SSP at entry/exit to SMM.
> Patch 22: Set up CET MSR interception.
> Patch 23: Initialize host constant supervisor state.
> Patch 24: Add CET virtualization settings.
> Patch 25-26: Add CET nested support.
>
>
> Sean Christopherson (4):
> x86/fpu/xstate: Always preserve non-user xfeatures/flags in
> __state_perm
> KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data
> KVM: x86: Report XSS as to-be-saved if there are supported features
> KVM: x86: Load guest FPU state when access XSAVE-managed MSRs
>
> Yang Weijiang (22):
> x86/fpu/xstate: Refine CET user xstate bit enabling
> x86/fpu/xstate: Add CET supervisor mode state support
> x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set
> x86/fpu/xstate: Introduce fpu_guest_cfg for guest FPU configuration
> x86/fpu/xstate: Create guest fpstate with guest specific config
> x86/fpu/xstate: Warn if kernel dynamic xfeatures detected in normal fpstate
> KVM: x86: Rename kvm_{g,s}et_msr() to menifest emulation operations
> KVM: x86: Refine xsave-managed guest register/MSR reset handling
> KVM: x86: Add kvm_msr_{read,write}() helpers
> KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS
> KVM: x86: Initialize kvm_caps.supported_xss
> KVM: x86: Add fault checks for guest CR4.CET setting
> KVM: x86: Report KVM supported CET MSRs as to-be-saved
> KVM: VMX: Introduce CET VMCS fields and control bits
> KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled"
> KVM: VMX: Emulate read and write to CET MSRs
> KVM: x86: Save and reload SSP to/from SMRAM
> KVM: VMX: Set up interception for CET MSRs
> KVM: VMX: Set host constant supervisor states to VMCS fields
> KVM: x86: Enable CET virtualization for VMX and advertise to userspace
> KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1
> KVM: nVMX: Enable CET support for nested guest
>
> arch/x86/include/asm/fpu/types.h | 16 +-
> arch/x86/include/asm/fpu/xstate.h | 11 +-
> arch/x86/include/asm/kvm_host.h | 13 +-
> arch/x86/include/asm/msr-index.h | 1 +
> arch/x86/include/asm/vmx.h | 8 +
> arch/x86/include/uapi/asm/kvm_para.h | 1 +
> arch/x86/kernel/fpu/core.c | 62 +++++--
> arch/x86/kernel/fpu/xstate.c | 46 +++--
> arch/x86/kernel/fpu/xstate.h | 4 +
> arch/x86/kvm/cpuid.c | 69 +++++---
> arch/x86/kvm/governed_features.h | 2 +
> arch/x86/kvm/smm.c | 12 +-
> arch/x86/kvm/smm.h | 2 +-
> arch/x86/kvm/vmx/capabilities.h | 10 ++
> arch/x86/kvm/vmx/nested.c | 88 ++++++++--
> arch/x86/kvm/vmx/nested.h | 5 +
> arch/x86/kvm/vmx/vmcs12.c | 6 +
> arch/x86/kvm/vmx/vmcs12.h | 14 +-
> arch/x86/kvm/vmx/vmx.c | 110 +++++++++++-
> arch/x86/kvm/vmx/vmx.h | 6 +-
> arch/x86/kvm/x86.c | 254 +++++++++++++++++++++++++--
> arch/x86/kvm/x86.h | 28 +++
> 22 files changed, 669 insertions(+), 99 deletions(-)
>