Control-flow Enforcement Technology (CET) is a CPU feature used to prevent
Return/Jump-Oriented Programming (ROP/JOP) attacks. CET introduces a new
exception type, Control Protection (#CP), and two sub-features(SHSTK,IBT)
to defend against ROP/JOP style control-flow subversion attacks.
Shadow Stack (SHSTK):
A shadow stack is a second stack used exclusively for control transfer
operations. The shadow stack is separate from the data/normal stack and
can be enabled individually in user and kernel mode. When shadow stack
is enabled, CALL pushes the return address on both the data and shadow
stack. RET pops the return address from both stacks and compares them.
If the return addresses from the two stacks do not match, the processor
generates a #CP.
Indirect Branch Tracking (IBT):
IBT adds a new instruction, ENDBRANCH, to mark valid target addresses of
indirect branches (CALL, JMP etc...). If an indirect branch is executed
and the next instruction is _not_ an ENDBRANCH, the processor generates a
#CP. These instruction behaves as a NOP on platforms that doesn't support
CET.
Dependency:
--------------------------------------------------------------------------
The first 5 patches are taken over from CET native series [1] in linux-next.
They're prerequisites for enabling guest user mode SHSTK. Patch this full
series before build host kernel for guest CET testing. Also apply CET enabling
patches in [2] to build qualified QEMU. These kernel dependent patches will
be enclosed in KVM series until CET native series is merged in mainline tree.
Implementation:
--------------------------------------------------------------------------
Historically, the early KVM patches can support both user SHSTK and IBT,
and most of the early patches are carried forward with changes in this new
series. And with kernel IBT feature merged in 5.18, a new patch was added
to support the feature in guest. The last patch is introduced to support
supervisor SHSTK but the feature is not enabled on Intel platform for now,
the main purpose of this patch is to facilitate AMD folks to enable the
feature.
In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but
doesn't fully support CET supervisor SHSTK, the enabling work is left for
the future.
Supported CET sub-features:
|
User SHSTK | User IBT (user mode)
--------------------------------------------------
s-SHSTK (X) | Kernel IBT (kernel mode)
|
Guest user mode SHSTK/IBT relies on host side XSAVES support(XSS[bit 11])
to swap CET states. Guest kernel IBT doesn't have dependency on host XSAVES.
The supervisor SHSTK relies on host side XSAVES support(XSS[bit 12]) for
supervisor mode CET states save/restore.
This version removed unnecessary checks of host CET enabling status before
expose CET features to guest, making guest CET enabling apart from host.
By doing so, it's expected to be more friendly to cloud computing scenarios.
CET states management:
--------------------------------------------------------------------------
CET user mode states, MSR_IA32_{U_CET,PL3_SSP} depends on {XSAVES,XRSTORS}
instructions to swap guest/host context when vm-exit/vm-entry happens.
On vm-exit, the guest CET states are stored to guest fpu area and host user
mode states are loaded from thread/process context before vCPU returns to
userspace, vice-versa on vm-entry. See details in kvm_{load|put}_guest_fpu().
So the user mode state validity depends on host side U_CET bit set in MSR_XSS.
CET supervisor mode states are grouped into two categories - XSAVES dependent
and non-dependent, the former includes MSR_IA32_PL{0,1,2}_SSP, the later
consists of MSR_IA32_S_CET and MSR_IA32_INTR_SSP_TBL. The XSAVES dependent
MSR's save/restore depends on S_CET bit set in MSR_XSS. Since native series
doesn't enable S_CET support, these s-SHSTK shadow stack pointers are invalid.
New VMCS fields, {GUEST|HOST}_{S_CET,SSP,INTR_SSP_TABL}, are introduced for
guest/host non-XSAVES managed states switch. When CET entry/exit load bits are
set, guest/host MSR_IA32_{S_CET,INTR_SSP_TBL,SSP} are loaded from these fields
at vm-exit/entry. With these new fields, current guest kernel IBT enabling
doesn't depend on S_CET bit in XSS, i.e., host {XSAVES|XRSTORS} support.
Tests:
--------------------------------------------------------------------------
This series passed basic CET user shadow stack test and kernel IBT test in
L1 and L2 guest. It also works with CET KVM-unit-test application.
Executed all KVM-unit-test cases and KVM selftests against this series, all
test cases passed except the vmx test, the failure is due to CR4_CET bit
testing in test_vmxon_bad_cr(). After add CR4_CET bit to skip list, the test
passed. I'll send a patch to fix this issue later.
To run user shadow stack test and kernel IBT test in VM, you need an CET
capable platform, e.g., Sapphire Rapids server, and follow below steps to
build host/guest kernel properly:
1. Build host kernel. Patch this series to kernel tree and build kernel.
2. Build guest kernel. Patch CET native series to kernel tree and opt-in
CONFIG_X86_KERNEL_IBT and CONFIG_X86_USER_SHADOW_STACK options. Build with
CET enabled gcc versions(>= 8.5.0).
3. Use patched QEMU to launch a VM.
Check kernel selftest test_shadow_stack_64 output:
[INFO] new_ssp = 7f8c82100ff8, *new_ssp = 7f8c82101001
[INFO] changing ssp from 7f8c82900ff0 to 7f8c82100ff8
[INFO] ssp is now 7f8c82101000
[OK] Shadow stack pivot
[OK] Shadow stack faults
[INFO] Corrupting shadow stack
[INFO] Generated shadow stack violation successfully
[OK] Shadow stack violation test
[INFO] Gup read -> shstk access success
[INFO] Gup write -> shstk access success
[INFO] Violation from normal write
[INFO] Gup read -> write access success
[INFO] Violation from normal write
[INFO] Gup write -> write access success
[INFO] Cow gup write -> write access success
[OK] Shadow gup test
[INFO] Violation from shstk access
[OK] mprotect() test
[SKIP] Userfaultfd unavailable.
[OK] 32 bit test
Check kernel IBT with dmesg | grep CET:
CET detected: Indirect Branch Tracking enabled
--------------------------------------------------------------------------
Changes in v3:
1. Moved MSR access check helper to x86 common file. [Mike]
2. Modified cover letter, commit logs and code per review comments. [PeterZ, Binbin, Rick]
3. Fixed an issue on host MSR_IA32_S_CET reload at vm-exit.
5. Rebase on kvm-x86/next [4].
[1]: linux-next: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/?h=next-20230420
[2]: QEMU patch: https://lore.kernel.org/all/[email protected]/
[3]: v2 patchset: https://lore.kernel.org/all/[email protected]/
[4]: Rebase branch: https://github.com/kvm-x86/linux.git, commit: 5c291b93e5d6 (tag: kvm-x86-next-2023.04.26)
Rick Edgecombe (5):
x86/shstk: Add Kconfig option for shadow stack
x86/cpufeatures: Add CPU feature flags for shadow stacks
x86/cpufeatures: Enable CET CR4 bit for shadow stack
x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
x86/fpu: Add helper for modifying xstate
Sean Christopherson (2):
KVM:x86: Report XSS as to-be-saved if there are supported features
KVM:x86: Load guest FPU state when accessing xsaves-managed MSRs
Yang Weijiang (14):
KVM:x86: Refresh CPUID on write to guest MSR_IA32_XSS
KVM:x86: Init kvm_caps.supported_xss with supported feature bits
KVM:x86: Add #CP support in guest exception classification
KVM:VMX: Introduce CET VMCS fields and control bits
KVM:x86: Add fault checks for guest CR4.CET setting
KVM:VMX: Emulate reads and writes to CET MSRs
KVM:VMX: Add a synthetic MSR to allow userspace to access GUEST_SSP
KVM:x86: Report CET MSRs as to-be-saved if CET is supported
KVM:x86: Save/Restore GUEST_SSP to/from SMM state save area
KVM:VMX: Pass through user CET MSRs to the guest
KVM:x86: Enable CET virtualization for VMX and advertise to userspace
KVM:nVMX: Enable user CET support for nested VMX
KVM:x86: Enable kernel IBT support for guest
KVM:x86: Support CET supervisor shadow stack MSR access
arch/x86/Kconfig | 24 +++++
arch/x86/Kconfig.assembler | 5 +
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/fpu/api.h | 9 ++
arch/x86/include/asm/fpu/types.h | 16 ++-
arch/x86/include/asm/fpu/xstate.h | 6 +-
arch/x86/include/asm/kvm_host.h | 3 +-
arch/x86/include/asm/vmx.h | 8 ++
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/include/uapi/asm/kvm_para.h | 1 +
arch/x86/kernel/cpu/common.c | 35 +++++--
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
arch/x86/kernel/fpu/core.c | 19 ++++
arch/x86/kernel/fpu/xstate.c | 90 ++++++++--------
arch/x86/kvm/cpuid.c | 19 +++-
arch/x86/kvm/cpuid.h | 6 ++
arch/x86/kvm/smm.c | 20 ++++
arch/x86/kvm/vmx/capabilities.h | 4 +
arch/x86/kvm/vmx/nested.c | 29 +++++-
arch/x86/kvm/vmx/vmcs12.c | 6 ++
arch/x86/kvm/vmx/vmcs12.h | 14 ++-
arch/x86/kvm/vmx/vmx.c | 124 ++++++++++++++++++++++-
arch/x86/kvm/vmx/vmx.h | 6 +-
arch/x86/kvm/x86.c | 122 ++++++++++++++++++++--
arch/x86/kvm/x86.h | 47 ++++++++-
26 files changed, 543 insertions(+), 82 deletions(-)
base-commit: 5c291b93e5d665380dbecc6944973583f9565ee5
--
2.27.0
Introduce a host-only synthetic MSR, MSR_KVM_GUEST_SSP, so that the VMM
can read/write the guest's SSP, e.g. to migrate CET state. Use a synthetic
MSR, e.g. as opposed to a VCPU_REG_, as GUEST_SSP is subject to the same
consistency checks as the PL*_SSP MSRs, i.e. can share code.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/include/uapi/asm/kvm_para.h | 1 +
arch/x86/kvm/vmx/vmx.c | 15 ++++++++++++---
arch/x86/kvm/x86.c | 4 ++++
3 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 6e64b27b2c1e..7af465e4e0bd 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -58,6 +58,7 @@
#define MSR_KVM_ASYNC_PF_INT 0x4b564d06
#define MSR_KVM_ASYNC_PF_ACK 0x4b564d07
#define MSR_KVM_MIGRATION_CONTROL 0x4b564d08
+#define MSR_KVM_GUEST_SSP 0x4b564d09
struct kvm_steal_time {
__u64 steal;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0ccaa467d7d3..72149156bbd3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2095,9 +2095,13 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
break;
case MSR_IA32_U_CET:
case MSR_IA32_PL3_SSP:
+ case MSR_KVM_GUEST_SSP:
if (!kvm_cet_is_msr_accessible(vcpu, msr_info))
return 1;
- kvm_get_xsave_msr(msr_info);
+ if (msr_info->index == MSR_KVM_GUEST_SSP)
+ msr_info->data = vmcs_readl(GUEST_SSP);
+ else
+ kvm_get_xsave_msr(msr_info);
break;
case MSR_IA32_DEBUGCTLMSR:
msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
@@ -2413,15 +2417,20 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
break;
case MSR_IA32_U_CET:
case MSR_IA32_PL3_SSP:
+ case MSR_KVM_GUEST_SSP:
if (!kvm_cet_is_msr_accessible(vcpu, msr_info))
return 1;
if (is_noncanonical_address(data, vcpu))
return 1;
if (msr_index == MSR_IA32_U_CET && (data & GENMASK(9, 6)))
return 1;
- if (msr_index == MSR_IA32_PL3_SSP && (data & GENMASK(2, 0)))
+ if ((msr_index == MSR_IA32_PL3_SSP ||
+ msr_index == MSR_KVM_GUEST_SSP) && (data & GENMASK(2, 0)))
return 1;
- kvm_set_xsave_msr(msr_info);
+ if (msr_index == MSR_KVM_GUEST_SSP)
+ vmcs_writel(GUEST_SSP, data);
+ else
+ kvm_set_xsave_msr(msr_info);
break;
case MSR_IA32_PERF_CAPABILITIES:
if (data && !vcpu_to_pmu(vcpu)->version)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2e3a39c9297c..baac6acebd40 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13642,6 +13642,10 @@ bool kvm_cet_is_msr_accessible(struct kvm_vcpu *vcpu, struct msr_data *msr)
!guest_cpuid_has(vcpu, X86_FEATURE_IBT))
return false;
+ /* The synthetic MSR is for userspace access only. */
+ if (msr->index == MSR_KVM_GUEST_SSP)
+ return false;
+
if (msr->index == MSR_IA32_PL3_SSP &&
!guest_cpuid_has(vcpu, X86_FEATURE_SHSTK))
return false;
--
2.27.0
Add handling for Control Protection (#CP) exceptions(vector 21).
The new vector is introduced for Intel's Control-Flow Enforcement
Technology (CET) relevant violation cases.
Although #CP belongs contributory exception class, but the actual
effect is conditional on CET being exposed to guest. If CET is not
available to guest, #CP falls back to non-contributory and doesn't
have an error code. The rational is used to fix one unit test failure
encountered in L1. Although the issue now is fixed in unit test case,
keep the handling is reasonable. cr4_guest_rsvd_bits is used to avoid
guest_cpuid_has() lookups.
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/vmx/nested.c | 2 +-
arch/x86/kvm/x86.c | 10 +++++++---
arch/x86/kvm/x86.h | 13 ++++++++++---
4 files changed, 19 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 7f467fe05d42..1c002abe2be8 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -33,6 +33,7 @@
#define MC_VECTOR 18
#define XM_VECTOR 19
#define VE_VECTOR 20
+#define CP_VECTOR 21
/* Select x86 specific features in <linux/kvm.h> */
#define __KVM_HAVE_PIT
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 96ede74a6067..7bc62cd72748 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2850,7 +2850,7 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
/* VM-entry interruption-info field: deliver error code */
should_have_error_code =
intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
- x86_exception_has_error_code(vector);
+ x86_exception_has_error_code(vcpu, vector);
if (CC(has_error_code != should_have_error_code))
return -EINVAL;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7788646bbf1f..a768cbf3fbb7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -520,11 +520,15 @@ EXPORT_SYMBOL_GPL(kvm_spurious_fault);
#define EXCPT_CONTRIBUTORY 1
#define EXCPT_PF 2
-static int exception_class(int vector)
+static int exception_class(struct kvm_vcpu *vcpu, int vector)
{
switch (vector) {
case PF_VECTOR:
return EXCPT_PF;
+ case CP_VECTOR:
+ if (vcpu->arch.cr4_guest_rsvd_bits & X86_CR4_CET)
+ return EXCPT_BENIGN;
+ return EXCPT_CONTRIBUTORY;
case DE_VECTOR:
case TS_VECTOR:
case NP_VECTOR:
@@ -707,8 +711,8 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
return;
}
- class1 = exception_class(prev_nr);
- class2 = exception_class(nr);
+ class1 = exception_class(vcpu, prev_nr);
+ class2 = exception_class(vcpu, nr);
if ((class1 == EXCPT_CONTRIBUTORY && class2 == EXCPT_CONTRIBUTORY) ||
(class1 == EXCPT_PF && class2 != EXCPT_BENIGN)) {
/*
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index c544602d07a3..2ba7c7fc4846 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -171,13 +171,20 @@ static inline bool is_64_bit_hypercall(struct kvm_vcpu *vcpu)
return vcpu->arch.guest_state_protected || is_64_bit_mode(vcpu);
}
-static inline bool x86_exception_has_error_code(unsigned int vector)
+static inline bool x86_exception_has_error_code(struct kvm_vcpu *vcpu,
+ unsigned int vector)
{
static u32 exception_has_error_code = BIT(DF_VECTOR) | BIT(TS_VECTOR) |
BIT(NP_VECTOR) | BIT(SS_VECTOR) | BIT(GP_VECTOR) |
- BIT(PF_VECTOR) | BIT(AC_VECTOR);
+ BIT(PF_VECTOR) | BIT(AC_VECTOR) | BIT(CP_VECTOR);
- return (1U << vector) & exception_has_error_code;
+ if (!((1U << vector) & exception_has_error_code))
+ return false;
+
+ if (vector == CP_VECTOR)
+ return !(vcpu->arch.cr4_guest_rsvd_bits & X86_CR4_CET);
+
+ return true;
}
static inline bool mmu_is_nested(struct kvm_vcpu *vcpu)
--
2.27.0
From: Rick Edgecombe <[email protected]>
Just like user xfeatures, supervisor xfeatures can be active in the
registers or present in the task FPU buffer. If the registers are
active, the registers can be modified directly. If the registers are
not active, the modification must be performed on the task FPU buffer.
When the state is not active, the kernel could perform modifications
directly to the buffer. But in order for it to do that, it needs
to know where in the buffer the specific state it wants to modify is
located. Doing this is not robust against optimizations that compact
the FPU buffer, as each access would require computing where in the
buffer it is.
The easiest way to modify supervisor xfeature data is to force restore
the registers and write directly to the MSRs. Often times this is just fine
anyway as the registers need to be restored before returning to userspace.
Do this for now, leaving buffer writing optimizations for the future.
Add a new function fpregs_lock_and_load() that can simultaneously call
fpregs_lock() and do this restore. Also perform some extra sanity
checks in this function since this will be used in non-fpu focused code.
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Tested-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/all/20230319001535.23210-7-rick.p.edgecombe%40intel.com
---
arch/x86/include/asm/fpu/api.h | 9 +++++++++
arch/x86/kernel/fpu/core.c | 18 ++++++++++++++++++
2 files changed, 27 insertions(+)
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 503a577814b2..aadc6893dcaa 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -82,6 +82,15 @@ static inline void fpregs_unlock(void)
preempt_enable();
}
+/*
+ * FPU state gets lazily restored before returning to userspace. So when in the
+ * kernel, the valid FPU state may be kept in the buffer. This function will force
+ * restore all the fpu state to the registers early if needed, and lock them from
+ * being automatically saved/restored. Then FPU state can be modified safely in the
+ * registers, before unlocking with fpregs_unlock().
+ */
+void fpregs_lock_and_load(void);
+
#ifdef CONFIG_X86_DEBUG_FPU
extern void fpregs_assert_state_consistent(void);
#else
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index caf33486dc5e..f851558b673f 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -753,6 +753,24 @@ void switch_fpu_return(void)
}
EXPORT_SYMBOL_GPL(switch_fpu_return);
+void fpregs_lock_and_load(void)
+{
+ /*
+ * fpregs_lock() only disables preemption (mostly). So modifying state
+ * in an interrupt could screw up some in progress fpregs operation.
+ * Warn about it.
+ */
+ WARN_ON_ONCE(!irq_fpu_usable());
+ WARN_ON_ONCE(current->flags & PF_KTHREAD);
+
+ fpregs_lock();
+
+ fpregs_assert_state_consistent();
+
+ if (test_thread_flag(TIF_NEED_FPU_LOAD))
+ fpregs_restore_userregs();
+}
+
#ifdef CONFIG_X86_DEBUG_FPU
/*
* If current FPU state according to its tracking (loaded FPU context on this
--
2.27.0
Add MSR access interfaces for supervisor shadow stack, i.e.,
MSR_IA32_PL{0,1,2} and MSR_IA32_INT_SSP_TAB, meanwhile pass through
them to {L1,L2} guests when {L0,L1} KVM supports supervisor shadow
stack.
Note, currently supervisor shadow stack is not supported on Intel
platforms, i.e., VMX always clears CPUID(EAX=07H,ECX=1):EDX[bit 18].
The main purpose of this patch is to facilitate AMD folks to enable
supervisor shadow stack for their platforms.
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/cpuid.h | 6 ++++++
arch/x86/kvm/vmx/nested.c | 12 ++++++++++++
arch/x86/kvm/vmx/vmx.c | 25 ++++++++++++++++++++++++-
arch/x86/kvm/x86.c | 21 ++++++++++++++++++---
4 files changed, 60 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index b1658c0de847..019a16b25b88 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -232,4 +232,10 @@ static __always_inline bool guest_pv_has(struct kvm_vcpu *vcpu,
return vcpu->arch.pv_cpuid.features & (1u << kvm_feature);
}
+static __always_inline bool kvm_cet_kernel_shstk_supported(void)
+{
+ return !IS_ENABLED(CONFIG_KVM_INTEL) &&
+ kvm_cpu_cap_has(X86_FEATURE_SHSTK);
+}
+
#endif
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index bf690827bfee..aaaae92dc9f6 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -670,6 +670,18 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
MSR_IA32_PL3_SSP, MSR_TYPE_RW);
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PL0_SSP, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PL1_SSP, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PL2_SSP, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
+
kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
vmx->nested.force_msr_bitmap_recalc = false;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 1d0151f9e575..d70f2e94b187 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -713,6 +713,9 @@ static bool is_valid_passthrough_msr(u32 msr)
case MSR_IA32_PL3_SSP:
case MSR_IA32_S_CET:
return true;
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL2_SSP:
+ case MSR_IA32_INT_SSP_TAB:
+ return true;
}
r = possible_passthrough_msr_slot(msr) != -ENOENT;
@@ -2101,12 +2104,16 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_S_CET:
case MSR_IA32_PL3_SSP:
case MSR_KVM_GUEST_SSP:
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL2_SSP:
+ case MSR_IA32_INT_SSP_TAB:
if (!kvm_cet_is_msr_accessible(vcpu, msr_info))
return 1;
if (msr_info->index == MSR_KVM_GUEST_SSP) {
msr_info->data = vmcs_readl(GUEST_SSP);
} else if (msr_info->index == MSR_IA32_S_CET) {
msr_info->data = vmcs_readl(GUEST_S_CET);
+ } else if (msr_info->index == MSR_IA32_INT_SSP_TAB) {
+ msr_info->data = vmcs_readl(GUEST_INTR_SSP_TABLE);
} else {
kvm_get_xsave_msr(msr_info);
}
@@ -2427,6 +2434,8 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_S_CET:
case MSR_IA32_PL3_SSP:
case MSR_KVM_GUEST_SSP:
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL2_SSP:
+ case MSR_IA32_INT_SSP_TAB:
if (!kvm_cet_is_msr_accessible(vcpu, msr_info))
return 1;
if (is_noncanonical_address(data, vcpu))
@@ -2440,6 +2449,8 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vmcs_writel(GUEST_SSP, data);
} else if (msr_index == MSR_IA32_S_CET) {
vmcs_writel(GUEST_S_CET, data);
+ } else if (msr_index == MSR_IA32_INT_SSP_TAB) {
+ vmcs_writel(GUEST_INTR_SSP_TABLE, data);
} else {
kvm_set_xsave_msr(msr_info);
}
@@ -7764,6 +7775,17 @@ static void vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu)
*/
incpt = !guest_cpuid_has(vcpu, X86_FEATURE_IBT);
vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET, MSR_TYPE_RW, incpt);
+
+ /*
+ * Supervisor shadow stack is not supported in VMX now, intercept all
+ * related MSRs.
+ */
+ incpt = !kvm_cet_kernel_shstk_supported();
+
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP, MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP, MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP, MSR_TYPE_RW, incpt);
}
static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
@@ -7834,7 +7856,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
/* Refresh #PF interception to account for MAXPHYADDR changes. */
vmx_update_exception_bitmap(vcpu);
- if (kvm_cet_user_supported() || kvm_cpu_cap_has(X86_FEATURE_IBT))
+ if (kvm_cet_user_supported() || kvm_cpu_cap_has(X86_FEATURE_IBT) ||
+ kvm_cpu_cap_has(X86_FEATURE_SHSTK))
vmx_update_intercept_for_cet_msr(vcpu);
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b450361b94ef..a9ab01293420 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1472,6 +1472,8 @@ static const u32 msrs_to_save_base[] = {
MSR_IA32_XSS,
MSR_IA32_U_CET, MSR_IA32_PL3_SSP, MSR_KVM_GUEST_SSP,
MSR_IA32_S_CET,
+ MSR_IA32_PL0_SSP, MSR_IA32_PL1_SSP, MSR_IA32_PL2_SSP,
+ MSR_IA32_INT_SSP_TAB,
};
static const u32 msrs_to_save_pmu[] = {
@@ -13653,8 +13655,11 @@ EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
bool kvm_cet_is_msr_accessible(struct kvm_vcpu *vcpu, struct msr_data *msr)
{
+ u64 mask;
+
if (!kvm_cet_user_supported() &&
- !kvm_cpu_cap_has(X86_FEATURE_IBT))
+ !(kvm_cpu_cap_has(X86_FEATURE_IBT) ||
+ kvm_cpu_cap_has(X86_FEATURE_SHSTK)))
return false;
if (msr->host_initiated)
@@ -13668,14 +13673,24 @@ bool kvm_cet_is_msr_accessible(struct kvm_vcpu *vcpu, struct msr_data *msr)
if (msr->index == MSR_KVM_GUEST_SSP)
return false;
+ if (msr->index == MSR_IA32_U_CET)
+ return true;
+
if (msr->index == MSR_IA32_S_CET)
- return guest_cpuid_has(vcpu, X86_FEATURE_IBT);
+ return guest_cpuid_has(vcpu, X86_FEATURE_IBT) ||
+ kvm_cet_kernel_shstk_supported();
+
+ if (msr->index == MSR_IA32_INT_SSP_TAB)
+ return guest_cpuid_has(vcpu, X86_FEATURE_SHSTK) &&
+ kvm_cet_kernel_shstk_supported();
if (msr->index == MSR_IA32_PL3_SSP &&
!guest_cpuid_has(vcpu, X86_FEATURE_SHSTK))
return false;
- return true;
+ mask = (msr->index == MSR_IA32_PL3_SSP) ? XFEATURE_MASK_CET_USER :
+ XFEATURE_MASK_CET_KERNEL;
+ return !!(kvm_caps.supported_xss & mask);
}
EXPORT_SYMBOL_GPL(kvm_cet_is_msr_accessible);
--
2.27.0
From: Rick Edgecombe <[email protected]>
Shadow stack provides protection for applications against function return
address corruption. It is active when the processor supports it, the
kernel has CONFIG_X86_SHADOW_STACK enabled, and the application is built
for the feature. This is only implemented for the 64-bit kernel. When it
is enabled, legacy non-shadow stack applications continue to work, but
without protection.
Since there is another feature that utilizes CET (Kernel IBT) that will
share implementation with shadow stacks, create CONFIG_CET to signify
that at least one CET feature is configured.
Co-developed-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Tested-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/all/20230319001535.23210-3-rick.p.edgecombe%40intel.com
---
arch/x86/Kconfig | 24 ++++++++++++++++++++++++
arch/x86/Kconfig.assembler | 5 +++++
2 files changed, 29 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a825bf031f49..f03791b73f9f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1851,6 +1851,11 @@ config CC_HAS_IBT
(CC_IS_CLANG && CLANG_VERSION >= 140000)) && \
$(as-instr,endbr64)
+config X86_CET
+ def_bool n
+ help
+ CET features configured (Shadow stack or IBT)
+
config X86_KERNEL_IBT
prompt "Indirect Branch Tracking"
def_bool y
@@ -1858,6 +1863,7 @@ config X86_KERNEL_IBT
# https://github.com/llvm/llvm-project/commit/9d7001eba9c4cb311e03cd8cdc231f9e579f2d0f
depends on !LD_IS_LLD || LLD_VERSION >= 140000
select OBJTOOL
+ select X86_CET
help
Build the kernel with support for Indirect Branch Tracking, a
hardware support course-grain forward-edge Control Flow Integrity
@@ -1952,6 +1958,24 @@ config X86_SGX
If unsure, say N.
+config X86_USER_SHADOW_STACK
+ bool "X86 userspace shadow stack"
+ depends on AS_WRUSS
+ depends on X86_64
+ select ARCH_USES_HIGH_VMA_FLAGS
+ select X86_CET
+ help
+ Shadow stack protection is a hardware feature that detects function
+ return address corruption. This helps mitigate ROP attacks.
+ Applications must be enabled to use it, and old userspace does not
+ get protection "for free".
+
+ CPUs supporting shadow stacks were first released in 2020.
+
+ See Documentation/x86/shstk.rst for more information.
+
+ If unsure, say N.
+
config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
index b88f784cb02e..8ad41da301e5 100644
--- a/arch/x86/Kconfig.assembler
+++ b/arch/x86/Kconfig.assembler
@@ -24,3 +24,8 @@ config AS_GFNI
def_bool $(as-instr,vgf2p8mulb %xmm0$(comma)%xmm1$(comma)%xmm2)
help
Supported by binutils >= 2.30 and LLVM integrated assembler
+
+config AS_WRUSS
+ def_bool $(as-instr,wrussq %rax$(comma)(%rbx))
+ help
+ Supported by binutils >= 2.31 and LLVM integrated assembler
--
2.27.0
From: Sean Christopherson <[email protected]>
Add MSR_IA32_XSS to the list of MSRs reported to userspace if
supported_xss is non-zero, i.e. KVM supports at least one XSS based
feature.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/x86.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e7f78fe79b32..33a780fe820b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1454,6 +1454,7 @@ static const u32 msrs_to_save_base[] = {
MSR_IA32_UMWAIT_CONTROL,
MSR_IA32_XFD, MSR_IA32_XFD_ERR,
+ MSR_IA32_XSS,
};
static const u32 msrs_to_save_pmu[] = {
--
2.27.0
Control-flow Enforcement Technology(CET) is a CPU feature used to prevent
Return/Jump-Oriented Programming (ROP/JOP) attacks. CET introduces a new
exception type, Control Protection (#CP), and two sub-features(SHSTK,IBT)
to defend against ROP/JOP style control-flow subversion attacks.
Shadow Stack (SHSTK):
A shadow stack is a second stack used exclusively for control transfer
operations. The shadow stack is separate from the data/normal stack and
can be enabled individually in user and kernel mode. When shadow stack
is enabled, CALL pushes the return address on both the data and shadow
stack. RET pops the return address from both stacks and compares them.
If the return addresses from the two stacks do not match, the processor
generates a #CP.
Indirect Branch Tracking (IBT):
IBT adds a new instrution, ENDBRANCH, that is used to mark valid target
addresses of indirect branches(CALL, JMP, ENCLU[EEXIT], etc...). If an
indirect branch is executed and the next instruction is _not_ an ENDBRANCH,
the processor generates a #CP.
Several new CET MSRs are defined to support CET:
MSR_IA32_{U,S}_CET: Controls the CET settings for user mode and kernel
mode respectively.
MSR_IA32_PL{0,1,2,3}_SSP: Stores shadow stack pointers for CPL-0,1,2,3
protection respectively.
MSR_IA32_INT_SSP_TAB: Linear address of shadow stack pointer table,the
entry is indexed by IST of interrupt gate desc.
Two XSAVES state bits are introduced for CET:
IA32_XSS:[bit 11]: Control saving/restoring user mode CET states
IA32_XSS:[bit 12]: Control saving/restoring kernel mode CET states.
Six VMCS fields are introduced for CET:
{HOST,GUEST}_S_CET: Stores CET settings for kernel mode.
{HOST,GUEST}_SSP: Stores shadow stack pointer of current active task/thread.
{HOST,GUEST}_INTR_SSP_TABLE: Stores current active MSR_IA32_INT_SSP_TAB.
If VM_EXIT_LOAD_HOST_CET_STATE = 1, the host CET states are restored from
the following VMCS fields at VM-Exit:
HOST_S_CET
HOST_SSP
HOST_INTR_SSP_TABLE
If VM_ENTRY_LOAD_GUEST_CET_STATE = 1, the guest CET states are loaded from
the following VMCS fields at VM-Entry:
GUEST_S_CET
GUEST_SSP
GUEST_INTR_SSP_TABLE
Co-developed-by: Zhang Yi Z <[email protected]>
Signed-off-by: Zhang Yi Z <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/include/asm/vmx.h | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 498dc600bd5c..fe2aff27df8c 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -102,6 +102,7 @@
#define VM_EXIT_CLEAR_BNDCFGS 0x00800000
#define VM_EXIT_PT_CONCEAL_PIP 0x01000000
#define VM_EXIT_CLEAR_IA32_RTIT_CTL 0x02000000
+#define VM_EXIT_LOAD_CET_STATE 0x10000000
#define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR 0x00036dff
@@ -115,6 +116,7 @@
#define VM_ENTRY_LOAD_BNDCFGS 0x00010000
#define VM_ENTRY_PT_CONCEAL_PIP 0x00020000
#define VM_ENTRY_LOAD_IA32_RTIT_CTL 0x00040000
+#define VM_ENTRY_LOAD_CET_STATE 0x00100000
#define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR 0x000011ff
@@ -343,6 +345,9 @@ enum vmcs_field {
GUEST_PENDING_DBG_EXCEPTIONS = 0x00006822,
GUEST_SYSENTER_ESP = 0x00006824,
GUEST_SYSENTER_EIP = 0x00006826,
+ GUEST_S_CET = 0x00006828,
+ GUEST_SSP = 0x0000682a,
+ GUEST_INTR_SSP_TABLE = 0x0000682c,
HOST_CR0 = 0x00006c00,
HOST_CR3 = 0x00006c02,
HOST_CR4 = 0x00006c04,
@@ -355,6 +360,9 @@ enum vmcs_field {
HOST_IA32_SYSENTER_EIP = 0x00006c12,
HOST_RSP = 0x00006c14,
HOST_RIP = 0x00006c16,
+ HOST_S_CET = 0x00006c18,
+ HOST_SSP = 0x00006c1a,
+ HOST_INTR_SSP_TABLE = 0x00006c1c
};
/*
--
2.27.0
On 5/11/2023 12:08 PM, Yang Weijiang wrote:
> Introduce a host-only synthetic MSR, MSR_KVM_GUEST_SSP, so that the VMM
> can read/write the guest's SSP, e.g. to migrate CET state. Use a synthetic
> MSR, e.g. as opposed to a VCPU_REG_, as GUEST_SSP is subject to the same
> consistency checks as the PL*_SSP MSRs, i.e. can share code.
>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/include/uapi/asm/kvm_para.h | 1 +
> arch/x86/kvm/vmx/vmx.c | 15 ++++++++++++---
> arch/x86/kvm/x86.c | 4 ++++
> 3 files changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index 6e64b27b2c1e..7af465e4e0bd 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -58,6 +58,7 @@
> #define MSR_KVM_ASYNC_PF_INT 0x4b564d06
> #define MSR_KVM_ASYNC_PF_ACK 0x4b564d07
> #define MSR_KVM_MIGRATION_CONTROL 0x4b564d08
> +#define MSR_KVM_GUEST_SSP 0x4b564d09
>
> struct kvm_steal_time {
> __u64 steal;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 0ccaa467d7d3..72149156bbd3 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2095,9 +2095,13 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> break;
> case MSR_IA32_U_CET:
> case MSR_IA32_PL3_SSP:
> + case MSR_KVM_GUEST_SSP:
> if (!kvm_cet_is_msr_accessible(vcpu, msr_info))
> return 1;
> - kvm_get_xsave_msr(msr_info);
> + if (msr_info->index == MSR_KVM_GUEST_SSP)
> + msr_info->data = vmcs_readl(GUEST_SSP);
According to the change of the kvm_cet_is_msr_accessible() below,
kvm_cet_is_msr_accessible() will return false for MSR_KVM_GUEST_SSP,
then this code is unreachable?
> + else
> + kvm_get_xsave_msr(msr_info);
> break;
> case MSR_IA32_DEBUGCTLMSR:
> msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
> @@ -2413,15 +2417,20 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> break;
> case MSR_IA32_U_CET:
> case MSR_IA32_PL3_SSP:
> + case MSR_KVM_GUEST_SSP:
> if (!kvm_cet_is_msr_accessible(vcpu, msr_info))
> return 1;
> if (is_noncanonical_address(data, vcpu))
> return 1;
> if (msr_index == MSR_IA32_U_CET && (data & GENMASK(9, 6)))
> return 1;
> - if (msr_index == MSR_IA32_PL3_SSP && (data & GENMASK(2, 0)))
> + if ((msr_index == MSR_IA32_PL3_SSP ||
> + msr_index == MSR_KVM_GUEST_SSP) && (data & GENMASK(2, 0)))
> return 1;
> - kvm_set_xsave_msr(msr_info);
> + if (msr_index == MSR_KVM_GUEST_SSP)
> + vmcs_writel(GUEST_SSP, data);
> + else
> + kvm_set_xsave_msr(msr_info);
> break;
> case MSR_IA32_PERF_CAPABILITIES:
> if (data && !vcpu_to_pmu(vcpu)->version)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2e3a39c9297c..baac6acebd40 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13642,6 +13642,10 @@ bool kvm_cet_is_msr_accessible(struct kvm_vcpu *vcpu, struct msr_data *msr)
> !guest_cpuid_has(vcpu, X86_FEATURE_IBT))
> return false;
>
> + /* The synthetic MSR is for userspace access only. */
> + if (msr->index == MSR_KVM_GUEST_SSP)
> + return false;
> +
> if (msr->index == MSR_IA32_PL3_SSP &&
> !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK))
> return false;
On 5/23/2023 4:57 PM, Binbin Wu wrote:
>
>
> On 5/11/2023 12:08 PM, Yang Weijiang wrote:
>> Introduce a host-only synthetic MSR, MSR_KVM_GUEST_SSP, so that the VMM
>> can read/write the guest's SSP, e.g. to migrate CET state. Use a
>> synthetic
>> MSR, e.g. as opposed to a VCPU_REG_, as GUEST_SSP is subject to the same
>> consistency checks as the PL*_SSP MSRs, i.e. can share code.
>>
>> Co-developed-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
>> arch/x86/include/uapi/asm/kvm_para.h | 1 +
>> arch/x86/kvm/vmx/vmx.c | 15 ++++++++++++---
>> arch/x86/kvm/x86.c | 4 ++++
>> 3 files changed, 17 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/include/uapi/asm/kvm_para.h
>> b/arch/x86/include/uapi/asm/kvm_para.h
>> index 6e64b27b2c1e..7af465e4e0bd 100644
>> --- a/arch/x86/include/uapi/asm/kvm_para.h
>> +++ b/arch/x86/include/uapi/asm/kvm_para.h
>> @@ -58,6 +58,7 @@
>> #define MSR_KVM_ASYNC_PF_INT 0x4b564d06
>> #define MSR_KVM_ASYNC_PF_ACK 0x4b564d07
>> #define MSR_KVM_MIGRATION_CONTROL 0x4b564d08
>> +#define MSR_KVM_GUEST_SSP 0x4b564d09
>> struct kvm_steal_time {
>> __u64 steal;
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 0ccaa467d7d3..72149156bbd3 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -2095,9 +2095,13 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu,
>> struct msr_data *msr_info)
>> break;
>> case MSR_IA32_U_CET:
>> case MSR_IA32_PL3_SSP:
>> + case MSR_KVM_GUEST_SSP:
>> if (!kvm_cet_is_msr_accessible(vcpu, msr_info))
>> return 1;
>> - kvm_get_xsave_msr(msr_info);
>> + if (msr_info->index == MSR_KVM_GUEST_SSP)
>> + msr_info->data = vmcs_readl(GUEST_SSP);
> According to the change of the kvm_cet_is_msr_accessible() below,
> kvm_cet_is_msr_accessible() will return false for MSR_KVM_GUEST_SSP,
> then this code is unreachable?
No, when the access is initiated from host side,
kvm_cet_is_msr_accessible() return true for MSR_KVM_GUEST_SSP.
So the code is reachable:
if (msr->host_initiated)
return true;
[...]
On Thu, May 11, 2023 at 12:08:42AM -0400, Yang Weijiang wrote:
>From: Sean Christopherson <[email protected]>
>
>Add MSR_IA32_XSS to the list of MSRs reported to userspace if
>supported_xss is non-zero, i.e. KVM supports at least one XSS based
>feature.
The changelog doesn't match what the patch does.
Do you need to check if supported_xss is non-zero in kvm_probe_msr_to_save(),
e.g.,
case MSR_IA32_XSS:
if (!kvm_caps.supported_xss)
return;
break;
>
>Signed-off-by: Sean Christopherson <[email protected]>
>Signed-off-by: Yang Weijiang <[email protected]>
>---
> arch/x86/kvm/x86.c | 1 +
> 1 file changed, 1 insertion(+)
>
>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>index e7f78fe79b32..33a780fe820b 100644
>--- a/arch/x86/kvm/x86.c
>+++ b/arch/x86/kvm/x86.c
>@@ -1454,6 +1454,7 @@ static const u32 msrs_to_save_base[] = {
> MSR_IA32_UMWAIT_CONTROL,
>
> MSR_IA32_XFD, MSR_IA32_XFD_ERR,
>+ MSR_IA32_XSS,
> };
>
> static const u32 msrs_to_save_pmu[] = {
>--
>2.27.0
>
On 5/24/2023 3:06 PM, Chao Gao wrote:
> On Thu, May 11, 2023 at 12:08:42AM -0400, Yang Weijiang wrote:
>> From: Sean Christopherson <[email protected]>
>>
>> Add MSR_IA32_XSS to the list of MSRs reported to userspace if
>> supported_xss is non-zero, i.e. KVM supports at least one XSS based
>> feature.
> The changelog doesn't match what the patch does.
>
> Do you need to check if supported_xss is non-zero in kvm_probe_msr_to_save(),
> e.g.,
> case MSR_IA32_XSS:
> if (!kvm_caps.supported_xss)
> return;
> break;
I looked back the history of this patch, there's similar check
originally, however it's lost
during following rebases. I'll add it back. Thanks for pointing it out!
>> Signed-off-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
>> arch/x86/kvm/x86.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index e7f78fe79b32..33a780fe820b 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1454,6 +1454,7 @@ static const u32 msrs_to_save_base[] = {
>> MSR_IA32_UMWAIT_CONTROL,
>>
>> MSR_IA32_XFD, MSR_IA32_XFD_ERR,
>> + MSR_IA32_XSS,
>> };
>>
>> static const u32 msrs_to_save_pmu[] = {
>> --
>> 2.27.0
>>
On Thu, May 11, 2023 at 12:08:46AM -0400, Yang Weijiang wrote:
>Add handling for Control Protection (#CP) exceptions(vector 21).
>The new vector is introduced for Intel's Control-Flow Enforcement
>Technology (CET) relevant violation cases.
>
>Although #CP belongs contributory exception class, but the actual
>effect is conditional on CET being exposed to guest. If CET is not
>available to guest, #CP falls back to non-contributory and doesn't
>have an error code.
This sounds weird. is this the hardware behavior? If yes, could you
point us to where this behavior is documented?
On 6/6/2023 5:08 PM, Chao Gao wrote:
> On Thu, May 11, 2023 at 12:08:46AM -0400, Yang Weijiang wrote:
>> Add handling for Control Protection (#CP) exceptions(vector 21).
>> The new vector is introduced for Intel's Control-Flow Enforcement
>> Technology (CET) relevant violation cases.
>>
>> Although #CP belongs contributory exception class, but the actual
>> effect is conditional on CET being exposed to guest. If CET is not
>> available to guest, #CP falls back to non-contributory and doesn't
>> have an error code.
> This sounds weird. is this the hardware behavior? If yes, could you
> point us to where this behavior is documented?
It's not SDM documented behavior.
The original description is provided by Sean here:
Re: [PATCH v15 04/14] KVM: x86: Add #CP support in guest exception
dispatch - Sean Christopherson (kernel.org)
<https://lore.kernel.org/all/[email protected]/>
I also verified the issue on my side. If the KVM CET patches are there
in L1 but CET is not enabled, and running some unit test can
trigger unit test failure although the #CP induced one has been fixed in
KVM unit tests.
On Thu, May 11, 2023, Yang Weijiang wrote:
> The last patch is introduced to support supervisor SHSTK but the feature is
> not enabled on Intel platform for now, the main purpose of this patch is to
> facilitate AMD folks to enable the feature.
I am beyond confused by the SDM's wording of CET_SSS.
First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is
more appropriate phrasing).
Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor
shadow stacks as long as it ensures that certain supervisor shadow-stack pushes
will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 1).
But then it says says VMMs shouldn't set the bit.
When emulating the CPUID instruction, a virtual-machine monitor should return
this bit as 0 if those pushes can cause VM exits.
Based on the Xen code (which is sadly a far better source of information than the
SDM), I *think* that what the SDM is trying to say is that VMMs should not set
CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU. Because
if the SDM really means "VMMs should never set the bit", then what on earth is the
point of the bit.
> In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but
> doesn't fully support CET supervisor SHSTK, the enabling work is left for
> the future.
Why? If my interpretation of the SDM is correct, then all the pieces are there.
> Executed all KVM-unit-test cases and KVM selftests against this series, all
> test cases passed except the vmx test, the failure is due to CR4_CET bit
> testing in test_vmxon_bad_cr(). After add CR4_CET bit to skip list, the test
> passed. I'll send a patch to fix this issue later.
Your cover letter from v2 back in April said the same thing. Why hasn't the patch
been posted? And what exactly is the issue? IIUC, setting CR4.CET with
MSR_IA32_S_CET=0 and MSR_IA32_U_CET=0 should be a nop, which suggests that there's
a KVM bug. And if that's the case, the next obvious questions is, why are you
posting known buggy code?
On Thu, Jun 08, 2023, Weijiang Yang wrote:
>
> On 6/6/2023 5:08 PM, Chao Gao wrote:
> > On Thu, May 11, 2023 at 12:08:46AM -0400, Yang Weijiang wrote:
> > > Add handling for Control Protection (#CP) exceptions(vector 21).
> > > The new vector is introduced for Intel's Control-Flow Enforcement
> > > Technology (CET) relevant violation cases.
> > >
> > > Although #CP belongs contributory exception class, but the actual
> > > effect is conditional on CET being exposed to guest. If CET is not
> > > available to guest, #CP falls back to non-contributory and doesn't
> > > have an error code.
> > This sounds weird. is this the hardware behavior? If yes, could you
> > point us to where this behavior is documented?
>
> It's not SDM documented behavior.
The #CP behavior needs to be documented. Please pester whoever you need to in
order to make that happen.
On Thu, Jun 15, 2023, Sean Christopherson wrote:
> Your cover letter from v2 back in April said the same thing. Why hasn't the patch
> been posted? And what exactly is the issue? IIUC, setting CR4.CET with
> MSR_IA32_S_CET=0 and MSR_IA32_U_CET=0 should be a nop, which suggests that there's
> a KVM bug. And if that's the case, the next obvious questions is, why are you
> posting known buggy code?
Ah, is the problem that the test doesn't set CR0.WP as required by CR4.CET=1?
On 6/16/2023 8:00 AM, Sean Christopherson wrote:
> On Thu, Jun 15, 2023, Sean Christopherson wrote:
>> Your cover letter from v2 back in April said the same thing. Why hasn't the patch
>> been posted? And what exactly is the issue? IIUC, setting CR4.CET with
>> MSR_IA32_S_CET=0 and MSR_IA32_U_CET=0 should be a nop, which suggests that there's
>> a KVM bug. And if that's the case, the next obvious questions is, why are you
>> posting known buggy code?
> Ah, is the problem that the test doesn't set CR0.WP as required by CR4.CET=1?
Thanks for taking time to review this series!
Yes, due to CR0.WP bit is not set while CR4.CET is being set.
The check is imposed by patch-12.
I'll add the fixup patch together with next the version.
On 6/16/2023 7:58 AM, Sean Christopherson wrote:
> On Thu, Jun 08, 2023, Weijiang Yang wrote:
>> On 6/6/2023 5:08 PM, Chao Gao wrote:
>>> On Thu, May 11, 2023 at 12:08:46AM -0400, Yang Weijiang wrote:
>>>> Add handling for Control Protection (#CP) exceptions(vector 21).
>>>> The new vector is introduced for Intel's Control-Flow Enforcement
>>>> Technology (CET) relevant violation cases.
>>>>
>>>> Although #CP belongs contributory exception class, but the actual
>>>> effect is conditional on CET being exposed to guest. If CET is not
>>>> available to guest, #CP falls back to non-contributory and doesn't
>>>> have an error code.
>>> This sounds weird. is this the hardware behavior? If yes, could you
>>> point us to where this behavior is documented?
>> It's not SDM documented behavior.
> The #CP behavior needs to be documented. Please pester whoever you need to in
> order to make that happen.
Do you mean documentation for #CP as an generic exception or the
behavior in KVM as
this patch shows?
On 6/16/2023 7:30 AM, Sean Christopherson wrote:
> On Thu, May 11, 2023, Yang Weijiang wrote:
>> The last patch is introduced to support supervisor SHSTK but the feature is
>> not enabled on Intel platform for now, the main purpose of this patch is to
>> facilitate AMD folks to enable the feature.
> I am beyond confused by the SDM's wording of CET_SSS.
>
> First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is
> more appropriate phrasing).
>
> Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor
> shadow stacks as long as it ensures that certain supervisor shadow-stack pushes
> will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32
> Architectures Software Developer’s Manual, Volume 1).
>
> But then it says says VMMs shouldn't set the bit.
>
> When emulating the CPUID instruction, a virtual-machine monitor should return
> this bit as 0 if those pushes can cause VM exits.
>
> Based on the Xen code (which is sadly a far better source of information than the
> SDM), I *think* that what the SDM is trying to say is that VMMs should not set
> CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU. Because
> if the SDM really means "VMMs should never set the bit", then what on earth is the
> point of the bit.
I need to double check for the vague description.
From my understanding, on bare metal side, if the bit is 1, OS can
enable SSS if pushes won't cause
page fault. But for VM case, it's not recommended(regardless of the bit
state) to set the bit as vm-exits
caused by guest SSS pushes cannot be fully excluded.
In other word, the bit is mainly for bare metal guidance now.
>> In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but
>> doesn't fully support CET supervisor SHSTK, the enabling work is left for
>> the future.
> Why? If my interpretation of the SDM is correct, then all the pieces are there.
My assumption is, VM supervisor SHSTK depends bare metal kernel support
as PL0_SSP MSR is
backed by XSAVES via IA32_XSS:bit12(CET_S), but this part of support is
not there in Rick's native series.
And also based on above SDM description, I don't want to add the support
blindly now.
> [...]
On Fri, Jun 16, 2023, Weijiang Yang wrote:
>
> On 6/16/2023 7:30 AM, Sean Christopherson wrote:
> > On Thu, May 11, 2023, Yang Weijiang wrote:
> > > The last patch is introduced to support supervisor SHSTK but the feature is
> > > not enabled on Intel platform for now, the main purpose of this patch is to
> > > facilitate AMD folks to enable the feature.
> > I am beyond confused by the SDM's wording of CET_SSS.
> >
> > First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is
> > more appropriate phrasing).
> >
> > Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor
> > shadow stacks as long as it ensures that certain supervisor shadow-stack pushes
> > will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32
> > Architectures Software Developer’s Manual, Volume 1).
> >
> > But then it says says VMMs shouldn't set the bit.
> >
> > When emulating the CPUID instruction, a virtual-machine monitor should return
> > this bit as 0 if those pushes can cause VM exits.
> >
> > Based on the Xen code (which is sadly a far better source of information than the
> > SDM), I *think* that what the SDM is trying to say is that VMMs should not set
> > CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU. Because
> > if the SDM really means "VMMs should never set the bit", then what on earth is the
> > point of the bit.
>
> I need to double check for the vague description.
>
> From my understanding, on bare metal side, if the bit is 1, OS can enable
> SSS if pushes won't cause page fault. But for VM case, it's not recommended
> (regardless of the bit state) to set the bit as vm-exits caused by guest SSS
> pushes cannot be fully excluded.
>
> In other word, the bit is mainly for bare metal guidance now.
>
> > > In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but
> > > doesn't fully support CET supervisor SHSTK, the enabling work is left for
> > > the future.
> > Why? If my interpretation of the SDM is correct, then all the pieces are there.
...
> And also based on above SDM description, I don't want to add the support
> blindly now.
*sigh*
I got filled in on the details offlist.
1) In the next version of this series, please rework it to reincorporate Supervisor
Shadow Stack support into the main series, i.e. pretend Intel's implemenation
isn't horribly flawed. KVM can't guarantee that a VM-Exit won't occur, i.e.
can't advertise CET_SS, but I want the baseline support to be implemented,
otherwise the series as a whole is a big confusing mess with unanswered question
left, right, and center. And more importantly, architecturally SSS exists if
X86_FEATURE_SHSTK is enumerated, i.e. the guest should be allowed to utilize
SSS if it so chooses, with the obvious caveat that there's a non-zero chance
the guest risks death by doing so. Or if userspace can ensure no VM-Exit will
occur, which is difficult but feasible (ignoring #MC), e.g. by statically
partitioning memory, prefaulting all memory in guest firmware, and not dirty
logging SSS pages. In such an extreme setup, userspace can enumerate CET_SSS
to the guest, and KVM should support that.
2) Add the below patch to document exactly why KVM doesn't advertise CET_SSS.
While Intel is apparently ok with treating KVM developers like mushrooms, I
am not.
---
From: Sean Christopherson <[email protected]>
Date: Fri, 16 Jun 2023 10:04:37 -0700
Subject: [PATCH] KVM: x86: Explicitly document that KVM must not advertise
CET_SSS
Explicitly call out that KVM must NOT advertise CET_SSS to userspace,
i.e. must not tell userspace and thus the guest that it is safe for the
guest to enable Supervisor Shadow Stacks (SSS).
Intel's implementation of SSS is fatally flawed for virtualized
environments, as despite wording in the SDM that suggests otherwise,
Intel CPUs' handling of shadow stack switches are NOT fully atomic. Only
the check-and-update of the supervisor shadow stack token's busy bit is
atomic. Per the SDM:
If the far CALL or event delivery pushes a stack frame after the token
is acquired and any of the pushes causes a fault or VM exit, the
processor will revert to the old shadow stack and the busy bit in the
new shadow stack's token remains set.
Or more bluntly, any fault or VM-Exit that occurs when pushing to the
shadow stack after the busy bit is set is fatal to the kernel, i.e. to
the guest in KVM's case. The (guest) kernel can protect itself against
faults, e.g. by ensuring that the shadow stack always has a valid mapping,
but a guest kernel obviously has no control over, or even knowledge of,
VM-Exits due to host activity.
To help software determine when it is safe to use SSS, Intel defined
CPUID.0x7.1.EDX bit (CET_SSS) and updated Intel CPUs to enumerate CET_SS,
i.e. bare metal Intel CPUs advertise to software that it is safe to enable
SSS.
If CPUID.(EAX=07H,ECX=1H):EDX[bit 18] is enumerated as 1, it is
sufficient for an operating system to ensure that none of the pushes can
cause a page fault.
But CET_SS also comes with an major caveat that is kinda sorta documented
in the SDM:
When emulating the CPUID instruction, a virtual-machine monitor should
return this bit as 0 if those pushes can cause VM exits.
In other words, CET_SSS (bit 18) does NOT enumerate that the underlying
CPU prevents VM-Exits, only that the environment in which the software is
running will not generate VM-Exits. I.e. CET_SSS is a stopgap to stem the
bleeding and allow kernels to enable SSS, not an indication that the
underlying CPU is immune to the VM-Exit problem.
And unfortunately, KVM itself effectively has zero chance of ensuring that
a shadow stack switch can't trigger a VM-Exit, e.g. KVM zaps *all* SPTEs
when any memslot is deleted, enabling dirty logging write-protects SPTEs,
etc. A sufficiently motivated userspace can, at least in theory, provide
a safe environment for SSS, e.g. by statically partitioning and
prefaulting (in guest firmware) all memory, disabling PML, never
write-protecting guest shadow stacks, etc. But such a setup is far, far
beyond typical KVM deployments.
Note, AMD CPUs have a similar erratum, but AMD CPUs *DO* perform the full
shadow stack switch atomically so long as the stack is mapped WB and does
not cross a page boundary, i.e. a "normal" KVM setup and a well-behaved
guest play nice with SSS without additional shenanigans.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/cpuid.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 1e3ee96c879b..ecf4a68aaa08 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -658,7 +658,15 @@ void kvm_set_cpu_caps(void)
);
kvm_cpu_cap_init_kvm_defined(CPUID_7_1_EDX,
- F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI)
+ F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI) |
+
+ /*
+ * Do NOT advertise CET_SSS, i.e. do not tell userspace and the
+ * guest that it is safe to use Supervisor Shadow Stacks under
+ * KVM when running on Intel CPUs. KVM itself cannot guarantee
+ * that a VM-Exit won't occur during a shadow stack update.
+ */
+ 0 /* F(CET_SSS) */
);
kvm_cpu_cap_mask(CPUID_D_1_EAX,
base-commit: 9305c14847719870e9e08294034861360577ce08
--
On Fri, Jun 16, 2023, Weijiang Yang wrote:
>
> On 6/16/2023 7:58 AM, Sean Christopherson wrote:
> > On Thu, Jun 08, 2023, Weijiang Yang wrote:
> > > On 6/6/2023 5:08 PM, Chao Gao wrote:
> > > > On Thu, May 11, 2023 at 12:08:46AM -0400, Yang Weijiang wrote:
> > > > > Add handling for Control Protection (#CP) exceptions(vector 21).
> > > > > The new vector is introduced for Intel's Control-Flow Enforcement
> > > > > Technology (CET) relevant violation cases.
> > > > >
> > > > > Although #CP belongs contributory exception class, but the actual
> > > > > effect is conditional on CET being exposed to guest. If CET is not
> > > > > available to guest, #CP falls back to non-contributory and doesn't
> > > > > have an error code.
> > > > This sounds weird. is this the hardware behavior? If yes, could you
> > > > point us to where this behavior is documented?
> > > It's not SDM documented behavior.
> > The #CP behavior needs to be documented. Please pester whoever you need to in
> > order to make that happen.
>
> Do you mean documentation for #CP as an generic exception or the behavior in
> KVM as this patch shows?
As I pointed out two *years* ago, this entry in the SDM
— The field's deliver-error-code bit (bit 11) is 1 if each of the following
holds: (1) the interruption type is hardware exception; (2) bit 0
(corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
(3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector
indicates one of the following exceptions: #DF (vector 8), #TS (10),
#NP (11), #SS (12), #GP (13), #PF (14), or #AC (17).
needs to read something like
— The field's deliver-error-code bit (bit 11) is 1 if each of the following
holds: (1) the interruption type is hardware exception; (2) bit 0
(corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
(3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector
indicates one of the following exceptions: #DF (vector 8), #TS (10),
#NP (11), #SS (12), #GP (13), #PF (14), #AC (17), or #CP (21)[1]
[1] #CP has an error code if and only if IA32_VMX_CR4_FIXED1 enumerates
support for the 1-setting of CR4.CET.
On 6/17/2023 1:56 AM, Sean Christopherson wrote:
> On Fri, Jun 16, 2023, Weijiang Yang wrote:
>> On 6/16/2023 7:30 AM, Sean Christopherson wrote:
>>> On Thu, May 11, 2023, Yang Weijiang wrote:
>>>> The last patch is introduced to support supervisor SHSTK but the feature is
>>>> not enabled on Intel platform for now, the main purpose of this patch is to
>>>> facilitate AMD folks to enable the feature.
>>> I am beyond confused by the SDM's wording of CET_SSS.
>>>
>>> First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is
>>> more appropriate phrasing).
>>>
>>> Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor
>>> shadow stacks as long as it ensures that certain supervisor shadow-stack pushes
>>> will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32
>>> Architectures Software Developer’s Manual, Volume 1).
>>>
>>> But then it says says VMMs shouldn't set the bit.
>>>
>>> When emulating the CPUID instruction, a virtual-machine monitor should return
>>> this bit as 0 if those pushes can cause VM exits.
>>>
>>> Based on the Xen code (which is sadly a far better source of information than the
>>> SDM), I *think* that what the SDM is trying to say is that VMMs should not set
>>> CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU. Because
>>> if the SDM really means "VMMs should never set the bit", then what on earth is the
>>> point of the bit.
>> I need to double check for the vague description.
>>
>> From my understanding, on bare metal side, if the bit is 1, OS can enable
>> SSS if pushes won't cause page fault. But for VM case, it's not recommended
>> (regardless of the bit state) to set the bit as vm-exits caused by guest SSS
>> pushes cannot be fully excluded.
>>
>> In other word, the bit is mainly for bare metal guidance now.
>>
>>>> In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but
>>>> doesn't fully support CET supervisor SHSTK, the enabling work is left for
>>>> the future.
>>> Why? If my interpretation of the SDM is correct, then all the pieces are there.
> ...
>
>> And also based on above SDM description, I don't want to add the support
>> blindly now.
> *sigh*
>
> I got filled in on the details offlist.
>
> 1) In the next version of this series, please rework it to reincorporate Supervisor
> Shadow Stack support into the main series, i.e. pretend Intel's implemenation
> isn't horribly flawed.
Let me make it clear, you want me to do two things:
1)Add Supervisor Shadow Stack state support(i.e., XSS.bit12(CET_S))
into kernel so that host can
support guest Supervisor Shadow Stack MSRs in g/h FPU context switch.
2) Add Supervisor Shadow stack support into KVM part so that guest OS is
able to use SSS with risk.
is it correct?
> KVM can't guarantee that a VM-Exit won't occur, i.e.
> can't advertise CET_SS, but I want the baseline support to be implemented,
> otherwise the series as a whole is a big confusing mess with unanswered question
> left, right, and center. And more importantly, architecturally SSS exists if
> X86_FEATURE_SHSTK is enumerated, i.e. the guest should be allowed to utilize
> SSS if it so chooses, with the obvious caveat that there's a non-zero chance
> the guest risks death by doing so. Or if userspace can ensure no VM-Exit will
> occur, which is difficult but feasible (ignoring #MC), e.g. by statically
> partitioning memory, prefaulting all memory in guest firmware, and not dirty
> logging SSS pages. In such an extreme setup, userspace can enumerate CET_SSS
> to the guest, and KVM should support that.
Make sense, provide support but take risk on your own.
>
> 2) Add the below patch to document exactly why KVM doesn't advertise CET_SSS.
> While Intel is apparently ok with treating KVM developers like mushrooms, I
> am not.
will add it, thanks a lot for detailed change logs!
>
> ---
> From: Sean Christopherson <[email protected]>
> Date: Fri, 16 Jun 2023 10:04:37 -0700
> Subject: [PATCH] KVM: x86: Explicitly document that KVM must not advertise
> CET_SSS
>
> Explicitly call out that KVM must NOT advertise CET_SSS to userspace,
> i.e. must not tell userspace and thus the guest that it is safe for the
> guest to enable Supervisor Shadow Stacks (SSS).
>
> Intel's implementation of SSS is fatally flawed for virtualized
> environments, as despite wording in the SDM that suggests otherwise,
> Intel CPUs' handling of shadow stack switches are NOT fully atomic. Only
> the check-and-update of the supervisor shadow stack token's busy bit is
> atomic. Per the SDM:
>
> If the far CALL or event delivery pushes a stack frame after the token
> is acquired and any of the pushes causes a fault or VM exit, the
> processor will revert to the old shadow stack and the busy bit in the
> new shadow stack's token remains set.
>
> Or more bluntly, any fault or VM-Exit that occurs when pushing to the
> shadow stack after the busy bit is set is fatal to the kernel, i.e. to
> the guest in KVM's case. The (guest) kernel can protect itself against
> faults, e.g. by ensuring that the shadow stack always has a valid mapping,
> but a guest kernel obviously has no control over, or even knowledge of,
> VM-Exits due to host activity.
>
> To help software determine when it is safe to use SSS, Intel defined
> CPUID.0x7.1.EDX bit (CET_SSS) and updated Intel CPUs to enumerate CET_SS,
> i.e. bare metal Intel CPUs advertise to software that it is safe to enable
> SSS.
>
> If CPUID.(EAX=07H,ECX=1H):EDX[bit 18] is enumerated as 1, it is
> sufficient for an operating system to ensure that none of the pushes can
> cause a page fault.
>
> But CET_SS also comes with an major caveat that is kinda sorta documented
> in the SDM:
>
> When emulating the CPUID instruction, a virtual-machine monitor should
> return this bit as 0 if those pushes can cause VM exits.
>
> In other words, CET_SSS (bit 18) does NOT enumerate that the underlying
> CPU prevents VM-Exits, only that the environment in which the software is
> running will not generate VM-Exits. I.e. CET_SSS is a stopgap to stem the
> bleeding and allow kernels to enable SSS, not an indication that the
> underlying CPU is immune to the VM-Exit problem.
>
> And unfortunately, KVM itself effectively has zero chance of ensuring that
> a shadow stack switch can't trigger a VM-Exit, e.g. KVM zaps *all* SPTEs
> when any memslot is deleted, enabling dirty logging write-protects SPTEs,
> etc. A sufficiently motivated userspace can, at least in theory, provide
> a safe environment for SSS, e.g. by statically partitioning and
> prefaulting (in guest firmware) all memory, disabling PML, never
> write-protecting guest shadow stacks, etc. But such a setup is far, far
> beyond typical KVM deployments.
>
> Note, AMD CPUs have a similar erratum, but AMD CPUs *DO* perform the full
> shadow stack switch atomically so long as the stack is mapped WB and does
> not cross a page boundary, i.e. a "normal" KVM setup and a well-behaved
> guest play nice with SSS without additional shenanigans.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/cpuid.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 1e3ee96c879b..ecf4a68aaa08 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -658,7 +658,15 @@ void kvm_set_cpu_caps(void)
> );
>
> kvm_cpu_cap_init_kvm_defined(CPUID_7_1_EDX,
> - F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI)
> + F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI) |
> +
> + /*
> + * Do NOT advertise CET_SSS, i.e. do not tell userspace and the
> + * guest that it is safe to use Supervisor Shadow Stacks under
> + * KVM when running on Intel CPUs. KVM itself cannot guarantee
> + * that a VM-Exit won't occur during a shadow stack update.
> + */
> + 0 /* F(CET_SSS) */
> );
>
> kvm_cpu_cap_mask(CPUID_D_1_EAX,
>
> base-commit: 9305c14847719870e9e08294034861360577ce08
On 6/17/2023 2:57 AM, Sean Christopherson wrote:
> On Fri, Jun 16, 2023, Weijiang Yang wrote:
>> On 6/16/2023 7:58 AM, Sean Christopherson wrote:
>>> On Thu, Jun 08, 2023, Weijiang Yang wrote:
>>>> On 6/6/2023 5:08 PM, Chao Gao wrote:
>>>>> On Thu, May 11, 2023 at 12:08:46AM -0400, Yang Weijiang wrote:
>>>>>> Add handling for Control Protection (#CP) exceptions(vector 21).
>>>>>> The new vector is introduced for Intel's Control-Flow Enforcement
>>>>>> Technology (CET) relevant violation cases.
>>>>>>
>>>>>> Although #CP belongs contributory exception class, but the actual
>>>>>> effect is conditional on CET being exposed to guest. If CET is not
>>>>>> available to guest, #CP falls back to non-contributory and doesn't
>>>>>> have an error code.
>>>>> This sounds weird. is this the hardware behavior? If yes, could you
>>>>> point us to where this behavior is documented?
>>>> It's not SDM documented behavior.
>>> The #CP behavior needs to be documented. Please pester whoever you need to in
>>> order to make that happen.
>> Do you mean documentation for #CP as an generic exception or the behavior in
>> KVM as this patch shows?
> As I pointed out two *years* ago, this entry in the SDM
>
> — The field's deliver-error-code bit (bit 11) is 1 if each of the following
> holds: (1) the interruption type is hardware exception; (2) bit 0
> (corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
> (3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector
> indicates one of the following exceptions: #DF (vector 8), #TS (10),
> #NP (11), #SS (12), #GP (13), #PF (14), or #AC (17).
>
> needs to read something like
>
> — The field's deliver-error-code bit (bit 11) is 1 if each of the following
> holds: (1) the interruption type is hardware exception; (2) bit 0
> (corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
> (3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector
> indicates one of the following exceptions: #DF (vector 8), #TS (10),
> #NP (11), #SS (12), #GP (13), #PF (14), #AC (17), or #CP (21)[1]
>
> [1] #CP has an error code if and only if IA32_VMX_CR4_FIXED1 enumerates
> support for the 1-setting of CR4.CET.
OK, I'll route the messages to related person, thanks!
On Mon, Jun 19, 2023, Weijiang Yang wrote:
>
> On 6/17/2023 1:56 AM, Sean Christopherson wrote:
> > On Fri, Jun 16, 2023, Weijiang Yang wrote:
> > > On 6/16/2023 7:30 AM, Sean Christopherson wrote:
> > > > On Thu, May 11, 2023, Yang Weijiang wrote:
> > > > > The last patch is introduced to support supervisor SHSTK but the feature is
> > > > > not enabled on Intel platform for now, the main purpose of this patch is to
> > > > > facilitate AMD folks to enable the feature.
> > > > I am beyond confused by the SDM's wording of CET_SSS.
> > > >
> > > > First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is
> > > > more appropriate phrasing).
> > > >
> > > > Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor
> > > > shadow stacks as long as it ensures that certain supervisor shadow-stack pushes
> > > > will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32
> > > > Architectures Software Developer’s Manual, Volume 1).
> > > >
> > > > But then it says says VMMs shouldn't set the bit.
> > > >
> > > > When emulating the CPUID instruction, a virtual-machine monitor should return
> > > > this bit as 0 if those pushes can cause VM exits.
> > > >
> > > > Based on the Xen code (which is sadly a far better source of information than the
> > > > SDM), I *think* that what the SDM is trying to say is that VMMs should not set
> > > > CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU. Because
> > > > if the SDM really means "VMMs should never set the bit", then what on earth is the
> > > > point of the bit.
> > > I need to double check for the vague description.
> > >
> > > From my understanding, on bare metal side, if the bit is 1, OS can enable
> > > SSS if pushes won't cause page fault. But for VM case, it's not recommended
> > > (regardless of the bit state) to set the bit as vm-exits caused by guest SSS
> > > pushes cannot be fully excluded.
> > >
> > > In other word, the bit is mainly for bare metal guidance now.
> > >
> > > > > In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but
> > > > > doesn't fully support CET supervisor SHSTK, the enabling work is left for
> > > > > the future.
> > > > Why? If my interpretation of the SDM is correct, then all the pieces are there.
> > ...
> >
> > > And also based on above SDM description, I don't want to add the support
> > > blindly now.
> > *sigh*
> >
> > I got filled in on the details offlist.
> >
> > 1) In the next version of this series, please rework it to reincorporate Supervisor
> > Shadow Stack support into the main series, i.e. pretend Intel's implemenation
> > isn't horribly flawed.
>
> Let me make it clear, you want me to do two things:
>
> 1)Add Supervisor Shadow Stack state support(i.e., XSS.bit12(CET_S)) into
> kernel so that host can support guest Supervisor Shadow Stack MSRs in g/h FPU
> context switch.
If that's necessary for correct functionality, yes.
> 2) Add Supervisor Shadow stack support into KVM part so that guest OS is
> able to use SSS with risk.
Yes. Architecturally, if KVM advertises X86_FEATURE_SHSTK, then KVM needs to
provide both User and Supervisor support. CET_SSS doesn't change the architecture,
it's little more than a hint. And even if the guest follows SDM's recommendation
to not enable shadow stacks, a clever kernel can still utilize SSS assets, e.g. use
the MSRs as scratch registers.
On 6/24/2023 4:51 AM, Sean Christopherson wrote:
> On Mon, Jun 19, 2023, Weijiang Yang wrote:
>> On 6/17/2023 1:56 AM, Sean Christopherson wrote:
>>> On Fri, Jun 16, 2023, Weijiang Yang wrote:
>>>> On 6/16/2023 7:30 AM, Sean Christopherson wrote:
>>>>> On Thu, May 11, 2023, Yang Weijiang wrote:
>>>>>> The last patch is introduced to support supervisor SHSTK but the feature is
>>>>>> not enabled on Intel platform for now, the main purpose of this patch is to
>>>>>> facilitate AMD folks to enable the feature.
>>>>> I am beyond confused by the SDM's wording of CET_SSS.
>>>>>
>>>>> First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is
>>>>> more appropriate phrasing).
>>>>>
>>>>> Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor
>>>>> shadow stacks as long as it ensures that certain supervisor shadow-stack pushes
>>>>> will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32
>>>>> Architectures Software Developer’s Manual, Volume 1).
>>>>>
>>>>> But then it says says VMMs shouldn't set the bit.
>>>>>
>>>>> When emulating the CPUID instruction, a virtual-machine monitor should return
>>>>> this bit as 0 if those pushes can cause VM exits.
>>>>>
>>>>> Based on the Xen code (which is sadly a far better source of information than the
>>>>> SDM), I *think* that what the SDM is trying to say is that VMMs should not set
>>>>> CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU. Because
>>>>> if the SDM really means "VMMs should never set the bit", then what on earth is the
>>>>> point of the bit.
>>>> I need to double check for the vague description.
>>>>
>>>> From my understanding, on bare metal side, if the bit is 1, OS can enable
>>>> SSS if pushes won't cause page fault. But for VM case, it's not recommended
>>>> (regardless of the bit state) to set the bit as vm-exits caused by guest SSS
>>>> pushes cannot be fully excluded.
>>>>
>>>> In other word, the bit is mainly for bare metal guidance now.
>>>>
>>>>>> In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but
>>>>>> doesn't fully support CET supervisor SHSTK, the enabling work is left for
>>>>>> the future.
>>>>> Why? If my interpretation of the SDM is correct, then all the pieces are there.
>>> ...
>>>
>>>> And also based on above SDM description, I don't want to add the support
>>>> blindly now.
>>> *sigh*
>>>
>>> I got filled in on the details offlist.
>>>
>>> 1) In the next version of this series, please rework it to reincorporate Supervisor
>>> Shadow Stack support into the main series, i.e. pretend Intel's implemenation
>>> isn't horribly flawed.
>> Let me make it clear, you want me to do two things:
>>
>> 1)Add Supervisor Shadow Stack state support(i.e., XSS.bit12(CET_S)) into
>> kernel so that host can support guest Supervisor Shadow Stack MSRs in g/h FPU
>> context switch.
> If that's necessary for correct functionality, yes.
>
>> 2) Add Supervisor Shadow stack support into KVM part so that guest OS is
>> able to use SSS with risk.
> Yes. Architecturally, if KVM advertises X86_FEATURE_SHSTK, then KVM needs to
> provide both User and Supervisor support. CET_SSS doesn't change the architecture,
> it's little more than a hint. And even if the guest follows SDM's recommendation
> to not enable shadow stacks, a clever kernel can still utilize SSS assets, e.g. use
> the MSRs as scratch registers.
Understood, thanks!
On 6/17/2023 2:57 AM, Sean Christopherson wrote:
> On Fri, Jun 16, 2023, Weijiang Yang wrote:
>> On 6/16/2023 7:58 AM, Sean Christopherson wrote:
>>> On Thu, Jun 08, 2023, Weijiang Yang wrote:
>>>> On 6/6/2023 5:08 PM, Chao Gao wrote:
>>>>> On Thu, May 11, 2023 at 12:08:46AM -0400, Yang Weijiang wrote:
>>>>>> Add handling for Control Protection (#CP) exceptions(vector 21).
>>>>>> The new vector is introduced for Intel's Control-Flow Enforcement
>>>>>> Technology (CET) relevant violation cases.
>>>>>>
>>>>>> Although #CP belongs contributory exception class, but the actual
>>>>>> effect is conditional on CET being exposed to guest. If CET is not
>>>>>> available to guest, #CP falls back to non-contributory and doesn't
>>>>>> have an error code.
>>>>> This sounds weird. is this the hardware behavior? If yes, could you
>>>>> point us to where this behavior is documented?
>>>> It's not SDM documented behavior.
>>> The #CP behavior needs to be documented. Please pester whoever you need to in
>>> order to make that happen.
>> Do you mean documentation for #CP as an generic exception or the behavior in
>> KVM as this patch shows?
> As I pointed out two *years* ago, this entry in the SDM
>
> — The field's deliver-error-code bit (bit 11) is 1 if each of the following
> holds: (1) the interruption type is hardware exception; (2) bit 0
> (corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
> (3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector
> indicates one of the following exceptions: #DF (vector 8), #TS (10),
> #NP (11), #SS (12), #GP (13), #PF (14), or #AC (17).
>
> needs to read something like
>
> — The field's deliver-error-code bit (bit 11) is 1 if each of the following
> holds: (1) the interruption type is hardware exception; (2) bit 0
> (corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
> (3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector
> indicates one of the following exceptions: #DF (vector 8), #TS (10),
> #NP (11), #SS (12), #GP (13), #PF (14), #AC (17), or #CP (21)[1]
>
> [1] #CP has an error code if and only if IA32_VMX_CR4_FIXED1 enumerates
> support for the 1-setting of CR4.CET.
Hi, Sean,
I sent above change request to Gil(added in cc), but he shared different
opinion on this issue:
"It is the case that all CET-capable parts enumerate IA32_VMX_BASIC[56]
as 1.
However, there were earlier parts without CET that enumerated
IA32_VMX_BASIC[56] as 0.
On those parts, an attempt to inject an exception with vector 21 (#CP)
with an error code would fail.
(Injection of exception 21 with no error code would be allowed.)
It may make things clearer if we document the statement above (all
CET-capable parts enumerate IA32_VMX_BASIC[56] as 1).
I will see if we can update future revisions of the SDM to clarify this."
Then if this is the case, kvm needs to check IA32_VMX_BASIC[56] before
inject exception to nested VM.
And this patch could be removed, instead need another patch like below:
diff --git a/arch/x86/include/asm/msr-index.h
b/arch/x86/include/asm/msr-index.h
index ad35355ee43e..6b33aacc8587 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1076,6 +1076,7 @@
#define VMX_BASIC_MEM_TYPE_MASK 0x003c000000000000LLU
#define VMX_BASIC_MEM_TYPE_WB 6LLU
#define VMX_BASIC_INOUT 0x0040000000000000LLU
+#define VMX_BASIC_CHECK_ERRCODE 0x0140000000000000LLU
/* Resctrl MSRs: */
/* - Intel: */
diff --git a/arch/x86/kvm/vmx/capabilities.h
b/arch/x86/kvm/vmx/capabilities.h
index 85cffeae7f10..4b1ed4dc03bc 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -79,6 +79,11 @@ static inline bool cpu_has_vmx_basic_inout(void)
return (((u64)vmcs_config.basic_cap << 32) & VMX_BASIC_INOUT);
}
+static inline bool cpu_has_vmx_basic_check_errcode(void)
+{
+ return (((u64)vmcs_config.basic_cap << 32) &
VMX_BASIC_CHECK_ERRCODE);
+}
+
static inline bool cpu_has_virtual_nmis(void)
{
return vmcs_config.pin_based_exec_ctrl & PIN_BASED_VIRTUAL_NMIS &&
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 78524daa2cb2..92aa4fc3d233 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1227,9 +1227,9 @@ static int vmx_restore_vmx_basic(struct vcpu_vmx
*vmx, u64 data)
{
const u64 feature_and_reserved =
/* feature (except bit 48; see below) */
- BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) |
+ BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) | BIT_ULL(56) |
/* reserved */
- BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 56);
+ BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 57);
u64 vmx_basic = vmcs_config.nested.basic;
if (!is_bitwise_subset(vmx_basic, data, feature_and_reserved))
@@ -2873,7 +2873,8 @@ static int nested_check_vm_entry_controls(struct
kvm_vcpu *vcpu,
should_have_error_code =
intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
x86_exception_has_error_code(vector);
- if (CC(has_error_code != should_have_error_code))
+ if (!cpu_has_vmx_basic_check_errcode() &&
+ CC(has_error_code != should_have_error_code))
return -EINVAL;
/* VM-entry exception error code */
@@ -6986,6 +6987,8 @@ static void nested_vmx_setup_basic(struct
nested_vmx_msrs *msrs)
if (cpu_has_vmx_basic_inout())
msrs->basic |= VMX_BASIC_INOUT;
+ if (cpu_has_vmx_basic_check_errcode())
+ msrs->basic |= VMX_BASIC_CHECK_ERRCODE;
}
static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d70f2e94b187..95c0eab7805c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2748,7 +2748,7 @@ static int setup_vmcs_config(struct vmcs_config
*vmcs_conf,
rdmsrl(MSR_IA32_VMX_MISC, misc_msr);
vmcs_conf->size = vmx_msr_high & 0x1fff;
- vmcs_conf->basic_cap = vmx_msr_high & ~0x1fff;
+ vmcs_conf->basic_cap = vmx_msr_high & ~0x7fff;
vmcs_conf->revision_id = vmx_msr_low;
On Fri, Jun 30, 2023 at 05:34:28PM +0800, Yang, Weijiang wrote:
>
>On 6/17/2023 2:57 AM, Sean Christopherson wrote:
>> On Fri, Jun 16, 2023, Weijiang Yang wrote:
>> > On 6/16/2023 7:58 AM, Sean Christopherson wrote:
>> > > On Thu, Jun 08, 2023, Weijiang Yang wrote:
>> > > > On 6/6/2023 5:08 PM, Chao Gao wrote:
>> > > > > On Thu, May 11, 2023 at 12:08:46AM -0400, Yang Weijiang wrote:
>> > > > > > Add handling for Control Protection (#CP) exceptions(vector 21).
>> > > > > > The new vector is introduced for Intel's Control-Flow Enforcement
>> > > > > > Technology (CET) relevant violation cases.
>> > > > > >
>> > > > > > Although #CP belongs contributory exception class, but the actual
>> > > > > > effect is conditional on CET being exposed to guest. If CET is not
>> > > > > > available to guest, #CP falls back to non-contributory and doesn't
>> > > > > > have an error code.
>> > > > > This sounds weird. is this the hardware behavior? If yes, could you
>> > > > > point us to where this behavior is documented?
>> > > > It's not SDM documented behavior.
>> > > The #CP behavior needs to be documented. Please pester whoever you need to in
>> > > order to make that happen.
>> > Do you mean documentation for #CP as an generic exception or the behavior in
>> > KVM as this patch shows?
>> As I pointed out two *years* ago, this entry in the SDM
>>
>> — The field's deliver-error-code bit (bit 11) is 1 if each of the following
>> holds: (1) the interruption type is hardware exception; (2) bit 0
>> (corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
>> (3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector
>> indicates one of the following exceptions: #DF (vector 8), #TS (10),
>> #NP (11), #SS (12), #GP (13), #PF (14), or #AC (17).
>>
>> needs to read something like
>>
>> — The field's deliver-error-code bit (bit 11) is 1 if each of the following
>> holds: (1) the interruption type is hardware exception; (2) bit 0
>> (corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
>> (3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector
>> indicates one of the following exceptions: #DF (vector 8), #TS (10),
>> #NP (11), #SS (12), #GP (13), #PF (14), #AC (17), or #CP (21)[1]
>>
>> [1] #CP has an error code if and only if IA32_VMX_CR4_FIXED1 enumerates
>> support for the 1-setting of CR4.CET.
>
>Hi, Sean,
>
>I sent above change request to Gil(added in cc), but he shared different
>opinion on this issue:
>
>
>"It is the case that all CET-capable parts enumerate IA32_VMX_BASIC[56] as 1.
>
> However, there were earlier parts without CET that enumerated
>IA32_VMX_BASIC[56] as 0.
>
> On those parts, an attempt to inject an exception with vector 21 (#CP) with
>an error code would fail.
>
>(Injection of exception 21 with no error code would be allowed.)
>
> It may make things clearer if we document the statement above (all
>CET-capable parts enumerate IA32_VMX_BASIC[56] as 1).
>
>I will see if we can update future revisions of the SDM to clarify this."
>
>
>Then if this is the case, kvm needs to check IA32_VMX_BASIC[56] before
>inject exception to nested VM.
And KVM can hide CET from guests if IA32_VMX_BASIC[56] is 0.
>
>And this patch could be removed, instead need another patch like below:
>
>diff --git a/arch/x86/include/asm/msr-index.h
>b/arch/x86/include/asm/msr-index.h
>index ad35355ee43e..6b33aacc8587 100644
>--- a/arch/x86/include/asm/msr-index.h
>+++ b/arch/x86/include/asm/msr-index.h
>@@ -1076,6 +1076,7 @@
> #define VMX_BASIC_MEM_TYPE_MASK 0x003c000000000000LLU
> #define VMX_BASIC_MEM_TYPE_WB 6LLU
> #define VMX_BASIC_INOUT 0x0040000000000000LLU
>+#define VMX_BASIC_CHECK_ERRCODE 0x0140000000000000LLU
>
> /* Resctrl MSRs: */
> /* - Intel: */
>diff --git a/arch/x86/kvm/vmx/capabilities.h
>b/arch/x86/kvm/vmx/capabilities.h
>index 85cffeae7f10..4b1ed4dc03bc 100644
>--- a/arch/x86/kvm/vmx/capabilities.h
>+++ b/arch/x86/kvm/vmx/capabilities.h
>@@ -79,6 +79,11 @@ static inline bool cpu_has_vmx_basic_inout(void)
> return (((u64)vmcs_config.basic_cap << 32) & VMX_BASIC_INOUT);
> }
>
>+static inline bool cpu_has_vmx_basic_check_errcode(void)
>+{
>+ return (((u64)vmcs_config.basic_cap << 32) &
>VMX_BASIC_CHECK_ERRCODE);
>+}
>+
> static inline bool cpu_has_virtual_nmis(void)
> {
> return vmcs_config.pin_based_exec_ctrl & PIN_BASED_VIRTUAL_NMIS &&
>diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>index 78524daa2cb2..92aa4fc3d233 100644
>--- a/arch/x86/kvm/vmx/nested.c
>+++ b/arch/x86/kvm/vmx/nested.c
>@@ -1227,9 +1227,9 @@ static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx,
>u64 data)
> {
> const u64 feature_and_reserved =
> /* feature (except bit 48; see below) */
>- BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) |
>+ BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) | BIT_ULL(56) |
> /* reserved */
>- BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 56);
>+ BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 57);
> u64 vmx_basic = vmcs_config.nested.basic;
>
> if (!is_bitwise_subset(vmx_basic, data, feature_and_reserved))
>@@ -2873,7 +2873,8 @@ static int nested_check_vm_entry_controls(struct
>kvm_vcpu *vcpu,
> should_have_error_code =
> intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
> x86_exception_has_error_code(vector);
>- if (CC(has_error_code != should_have_error_code))
>+ if (!cpu_has_vmx_basic_check_errcode() &&
We can skip computing should_have_error_code. and we should check if
IA32_VMX_BASIC[56] is set for this vCPU (i.e. in vmx->nested.msrs.basic)
rather than host/kvm capability.
>+ CC(has_error_code != should_have_error_code))
> return -EINVAL;
>
> /* VM-entry exception error code */
>@@ -6986,6 +6987,8 @@ static void nested_vmx_setup_basic(struct
>nested_vmx_msrs *msrs)
>
> if (cpu_has_vmx_basic_inout())
> msrs->basic |= VMX_BASIC_INOUT;
>+ if (cpu_has_vmx_basic_check_errcode())
>+ msrs->basic |= VMX_BASIC_CHECK_ERRCODE;
> }
>
> static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs)
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index d70f2e94b187..95c0eab7805c 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -2748,7 +2748,7 @@ static int setup_vmcs_config(struct vmcs_config
>*vmcs_conf,
> rdmsrl(MSR_IA32_VMX_MISC, misc_msr);
>
> vmcs_conf->size = vmx_msr_high & 0x1fff;
>- vmcs_conf->basic_cap = vmx_msr_high & ~0x1fff;
>+ vmcs_conf->basic_cap = vmx_msr_high & ~0x7fff;
>
> vmcs_conf->revision_id = vmx_msr_low;
>
>
On 6/30/2023 6:27 PM, Chao Gao wrote:
> On Fri, Jun 30, 2023 at 05:34:28PM +0800, Yang, Weijiang wrote:
>> On 6/17/2023 2:57 AM, Sean Christopherson wrote:
>>> On Fri, Jun 16, 2023, Weijiang Yang wrote:
>>>> On 6/16/2023 7:58 AM, Sean Christopherson wrote:
>>>>> On Thu, Jun 08, 2023, Weijiang Yang wrote:
>>>>>> On 6/6/2023 5:08 PM, Chao Gao wrote:
>>>>>>> On Thu, May 11, 2023 at 12:08:46AM -0400, Yang Weijiang wrote:
>>>>>>>> Add handling for Control Protection (#CP) exceptions(vector 21).
>>>>>>>> The new vector is introduced for Intel's Control-Flow Enforcement
>>>>>>>> Technology (CET) relevant violation cases.
>>>>>>>>
>>>>>>>> Although #CP belongs contributory exception class, but the actual
>>>>>>>> effect is conditional on CET being exposed to guest. If CET is not
>>>>>>>> available to guest, #CP falls back to non-contributory and doesn't
>>>>>>>> have an error code.
>>>>>>> This sounds weird. is this the hardware behavior? If yes, could you
>>>>>>> point us to where this behavior is documented?
>>>>>> It's not SDM documented behavior.
>>>>> The #CP behavior needs to be documented. Please pester whoever you need to in
>>>>> order to make that happen.
>>>> Do you mean documentation for #CP as an generic exception or the behavior in
>>>> KVM as this patch shows?
>>> As I pointed out two *years* ago, this entry in the SDM
>>>
>>> — The field's deliver-error-code bit (bit 11) is 1 if each of the following
>>> holds: (1) the interruption type is hardware exception; (2) bit 0
>>> (corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
>>> (3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector
>>> indicates one of the following exceptions: #DF (vector 8), #TS (10),
>>> #NP (11), #SS (12), #GP (13), #PF (14), or #AC (17).
>>>
>>> needs to read something like
>>>
>>> — The field's deliver-error-code bit (bit 11) is 1 if each of the following
>>> holds: (1) the interruption type is hardware exception; (2) bit 0
>>> (corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
>>> (3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector
>>> indicates one of the following exceptions: #DF (vector 8), #TS (10),
>>> #NP (11), #SS (12), #GP (13), #PF (14), #AC (17), or #CP (21)[1]
>>>
>>> [1] #CP has an error code if and only if IA32_VMX_CR4_FIXED1 enumerates
>>> support for the 1-setting of CR4.CET.
>> Hi, Sean,
>>
>> I sent above change request to Gil(added in cc), but he shared different
>> opinion on this issue:
>>
>>
>> "It is the case that all CET-capable parts enumerate IA32_VMX_BASIC[56] as 1.
>>
>> However, there were earlier parts without CET that enumerated
>> IA32_VMX_BASIC[56] as 0.
>>
>> On those parts, an attempt to inject an exception with vector 21 (#CP) with
>> an error code would fail.
>>
>> (Injection of exception 21 with no error code would be allowed.)
>>
>> It may make things clearer if we document the statement above (all
>> CET-capable parts enumerate IA32_VMX_BASIC[56] as 1).
>>
>> I will see if we can update future revisions of the SDM to clarify this."
>>
>>
>> Then if this is the case, kvm needs to check IA32_VMX_BASIC[56] before
>> inject exception to nested VM.
> And KVM can hide CET from guests if IA32_VMX_BASIC[56] is 0.
Yes, this scratch patch didn't cover cross-check with CET enabling, thanks!
>
>> And this patch could be removed, instead need another patch like below:
>>
>> diff --git a/arch/x86/include/asm/msr-index.h
>> b/arch/x86/include/asm/msr-index.h
>> index ad35355ee43e..6b33aacc8587 100644
>> --- a/arch/x86/include/asm/msr-index.h
>> +++ b/arch/x86/include/asm/msr-index.h
>> @@ -1076,6 +1076,7 @@
>> #define VMX_BASIC_MEM_TYPE_MASK 0x003c000000000000LLU
>> #define VMX_BASIC_MEM_TYPE_WB 6LLU
>> #define VMX_BASIC_INOUT 0x0040000000000000LLU
>> +#define VMX_BASIC_CHECK_ERRCODE 0x0140000000000000LLU
>>
>> /* Resctrl MSRs: */
>> /* - Intel: */
>> diff --git a/arch/x86/kvm/vmx/capabilities.h
>> b/arch/x86/kvm/vmx/capabilities.h
>> index 85cffeae7f10..4b1ed4dc03bc 100644
>> --- a/arch/x86/kvm/vmx/capabilities.h
>> +++ b/arch/x86/kvm/vmx/capabilities.h
>> @@ -79,6 +79,11 @@ static inline bool cpu_has_vmx_basic_inout(void)
>> return (((u64)vmcs_config.basic_cap << 32) & VMX_BASIC_INOUT);
>> }
>>
>> +static inline bool cpu_has_vmx_basic_check_errcode(void)
>> +{
>> + return (((u64)vmcs_config.basic_cap << 32) &
>> VMX_BASIC_CHECK_ERRCODE);
>> +}
>> +
>> static inline bool cpu_has_virtual_nmis(void)
>> {
>> return vmcs_config.pin_based_exec_ctrl & PIN_BASED_VIRTUAL_NMIS &&
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index 78524daa2cb2..92aa4fc3d233 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -1227,9 +1227,9 @@ static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx,
>> u64 data)
>> {
>> const u64 feature_and_reserved =
>> /* feature (except bit 48; see below) */
>> - BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) |
>> + BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) | BIT_ULL(56) |
>> /* reserved */
>> - BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 56);
>> + BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 57);
>> u64 vmx_basic = vmcs_config.nested.basic;
>>
>> if (!is_bitwise_subset(vmx_basic, data, feature_and_reserved))
>> @@ -2873,7 +2873,8 @@ static int nested_check_vm_entry_controls(struct
>> kvm_vcpu *vcpu,
>> should_have_error_code =
>> intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
>> x86_exception_has_error_code(vector);
>> - if (CC(has_error_code != should_have_error_code))
>> + if (!cpu_has_vmx_basic_check_errcode() &&
> We can skip computing should_have_error_code. and we should check if
> IA32_VMX_BASIC[56] is set for this vCPU (i.e. in vmx->nested.msrs.basic)
> rather than host/kvm capability.
Oops, I confused myself, yes, need to reshape the code a bit and use
msrs.basic
to check the bit status, thanks!
>
>> + CC(has_error_code != should_have_error_code))
>> return -EINVAL;
>>
>> /* VM-entry exception error code */
>> @@ -6986,6 +6987,8 @@ static void nested_vmx_setup_basic(struct
>> nested_vmx_msrs *msrs)
>>
>> if (cpu_has_vmx_basic_inout())
>> msrs->basic |= VMX_BASIC_INOUT;
>> + if (cpu_has_vmx_basic_check_errcode())
>> + msrs->basic |= VMX_BASIC_CHECK_ERRCODE;
>> }
>>
>> static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs)
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index d70f2e94b187..95c0eab7805c 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -2748,7 +2748,7 @@ static int setup_vmcs_config(struct vmcs_config
>> *vmcs_conf,
>> rdmsrl(MSR_IA32_VMX_MISC, misc_msr);
>>
>> vmcs_conf->size = vmx_msr_high & 0x1fff;
>> - vmcs_conf->basic_cap = vmx_msr_high & ~0x1fff;
>> + vmcs_conf->basic_cap = vmx_msr_high & ~0x7fff;
>>
>> vmcs_conf->revision_id = vmx_msr_low;
>>
>>
On Fri, Jun 30, 2023, Weijiang Yang wrote:
>
> On 6/17/2023 2:57 AM, Sean Christopherson wrote:
> > > Do you mean documentation for #CP as an generic exception or the behavior in
> > > KVM as this patch shows?
> > As I pointed out two *years* ago, this entry in the SDM
> >
> > — The field's deliver-error-code bit (bit 11) is 1 if each of the following
> > holds: (1) the interruption type is hardware exception; (2) bit 0
> > (corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
> > (3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector
> > indicates one of the following exceptions: #DF (vector 8), #TS (10),
> > #NP (11), #SS (12), #GP (13), #PF (14), or #AC (17).
> >
> > needs to read something like
> >
> > — The field's deliver-error-code bit (bit 11) is 1 if each of the following
> > holds: (1) the interruption type is hardware exception; (2) bit 0
> > (corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
> > (3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector
> > indicates one of the following exceptions: #DF (vector 8), #TS (10),
> > #NP (11), #SS (12), #GP (13), #PF (14), #AC (17), or #CP (21)[1]
> >
> > [1] #CP has an error code if and only if IA32_VMX_CR4_FIXED1 enumerates
> > support for the 1-setting of CR4.CET.
>
> Hi, Sean,
>
> I sent above change request to Gil(added in cc), but he shared different
> opinion on this issue:
Heh, "opinion".
> It may make things clearer if we document the statement above (all
> CET-capable parts enumerate IA32_VMX_BASIC[56] as 1).
>
> I will see if we can update future revisions of the SDM to clarify this."
That would be helpful. Though to be perfectly honest, I simply overlooked the
existence of IA32_VMX_BASIC[56].
Thanks!
> Then if this is the case, kvm needs to check IA32_VMX_BASIC[56] before
> inject exception to nested VM.
>
> And this patch could be removed, instead need another patch like below:
>
> diff --git a/arch/x86/include/asm/msr-index.h
> b/arch/x86/include/asm/msr-index.h
> index ad35355ee43e..6b33aacc8587 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1076,6 +1076,7 @@
> #define VMX_BASIC_MEM_TYPE_MASK 0x003c000000000000LLU
> #define VMX_BASIC_MEM_TYPE_WB 6LLU
> #define VMX_BASIC_INOUT 0x0040000000000000LLU
> +#define VMX_BASIC_CHECK_ERRCODE 0x0140000000000000LLU
"Check Error Code" isn't a great description. The flag enumerates that there the
CPU does *not* perform consistency checks on the error code when injecting hardware
exceptions.
So something like this?
VMX_BASIC_NO_HW_ERROR_CODE_CC
or maybe
VMX_BASIC_PM_NO_HW_ERROR_CODE_CC
if we want to capture that only protected mode is exempt (I personally prefer
just VMX_BASIC_NO_HW_ERROR_CODE_CC as "PM" is a bit ambiguous).
> @@ -2873,7 +2873,8 @@ static int nested_check_vm_entry_controls(struct
> kvm_vcpu *vcpu,
> should_have_error_code =
> intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
> x86_exception_has_error_code(vector);
> - if (CC(has_error_code != should_have_error_code))
> + if (!cpu_has_vmx_basic_check_errcode() &&
> + CC(has_error_code != should_have_error_code))
This is wrong on mutiple fronts:
1. The new feature flag only excempts hardware exceptions delivered to guests
with CR0.PE=1. The above will skip the consistency check for all event injection.
2. KVM needs to check the CPU model that is exposed to L1, not the capabilities
of the host CPU.
Highlighting the key phrases in the SDM:
The field's deliver-error-code bit (bit 11) is 1 if each of the following holds: (1) the interruption type is
^^^^^^^
hardware exception; (2) bit 0 (corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
(3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector indicates one of the following
exceptions: #DF (vector 8), #TS (10), #NP (11), #SS (12), #GP (13), #PF (14), or #AC (17).
The field's deliver-error-code bit is 0 if any of the following holds: (1) the interruption type is not hardware
^^^^^^
exception; (2) bit 0 is clear in the CR0 field in the guest-state area; or (3) IA32_VMX_BASIC[56] is read as
0 and the vector is in one of the following ranges: 0–7, 9, 15, 16, or 18–31.
I think what we want is:
/* VM-entry interruption-info field: deliver error code */
if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION ||
!nested_cpu_has_no_hw_error_code_cc(vcpu)) {
should_have_error_code =
intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
x86_exception_has_error_code(vector);
if (CC(has_error_code != should_have_error_code))
return -EINVAL;
}
Intel will not produce any CPU with CET that does not enumerate IA32_VMX_BASIC[56] as 1.
One can check that bit before injecting a #CP with error code, but it should not be necessary if CET is enumerated.
Of course, if the KVM may run as a guest of another VMM/hypervisor, it may be that the virtual CPU in which KVM operates may enumerate CET but clear the bit in IA32_VMX_BASIC.
- Gil
-----Original Message-----
From: Yang, Weijiang <[email protected]>
Sent: Friday, June 30, 2023 05:05
To: Gao, Chao <[email protected]>
Cc: Christopherson,, Sean <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Edgecombe, Rick P <[email protected]>; [email protected]; Neiger, Gil <[email protected]>
Subject: Re: [PATCH v3 10/21] KVM:x86: Add #CP support in guest exception classification
On 6/30/2023 6:27 PM, Chao Gao wrote:
> On Fri, Jun 30, 2023 at 05:34:28PM +0800, Yang, Weijiang wrote:
>> On 6/17/2023 2:57 AM, Sean Christopherson wrote:
>>> On Fri, Jun 16, 2023, Weijiang Yang wrote:
>>>> On 6/16/2023 7:58 AM, Sean Christopherson wrote:
>>>>> On Thu, Jun 08, 2023, Weijiang Yang wrote:
>>>>>> On 6/6/2023 5:08 PM, Chao Gao wrote:
>>>>>>> On Thu, May 11, 2023 at 12:08:46AM -0400, Yang Weijiang wrote:
>>>>>>>> Add handling for Control Protection (#CP) exceptions(vector 21).
>>>>>>>> The new vector is introduced for Intel's Control-Flow
>>>>>>>> Enforcement Technology (CET) relevant violation cases.
>>>>>>>>
>>>>>>>> Although #CP belongs contributory exception class, but the
>>>>>>>> actual effect is conditional on CET being exposed to guest. If
>>>>>>>> CET is not available to guest, #CP falls back to
>>>>>>>> non-contributory and doesn't have an error code.
>>>>>>> This sounds weird. is this the hardware behavior? If yes, could
>>>>>>> you point us to where this behavior is documented?
>>>>>> It's not SDM documented behavior.
>>>>> The #CP behavior needs to be documented. Please pester whoever
>>>>> you need to in order to make that happen.
>>>> Do you mean documentation for #CP as an generic exception or the
>>>> behavior in KVM as this patch shows?
>>> As I pointed out two *years* ago, this entry in the SDM
>>>
>>> — The field's deliver-error-code bit (bit 11) is 1 if each of the following
>>> holds: (1) the interruption type is hardware exception; (2) bit 0
>>> (corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
>>> (3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector
>>> indicates one of the following exceptions: #DF (vector 8), #TS (10),
>>> #NP (11), #SS (12), #GP (13), #PF (14), or #AC (17).
>>>
>>> needs to read something like
>>>
>>> — The field's deliver-error-code bit (bit 11) is 1 if each of the following
>>> holds: (1) the interruption type is hardware exception; (2) bit 0
>>> (corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
>>> (3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector
>>> indicates one of the following exceptions: #DF (vector 8), #TS (10),
>>> #NP (11), #SS (12), #GP (13), #PF (14), #AC (17), or #CP
>>> (21)[1]
>>>
>>> [1] #CP has an error code if and only if IA32_VMX_CR4_FIXED1 enumerates
>>> support for the 1-setting of CR4.CET.
>> Hi, Sean,
>>
>> I sent above change request to Gil(added in cc), but he shared
>> different opinion on this issue:
>>
>>
>> "It is the case that all CET-capable parts enumerate IA32_VMX_BASIC[56] as 1.
>>
>> However, there were earlier parts without CET that enumerated
>> IA32_VMX_BASIC[56] as 0.
>>
>> On those parts, an attempt to inject an exception with vector 21
>> (#CP) with an error code would fail.
>>
>> (Injection of exception 21 with no error code would be allowed.)
>>
>> It may make things clearer if we document the statement above (all
>> CET-capable parts enumerate IA32_VMX_BASIC[56] as 1).
>>
>> I will see if we can update future revisions of the SDM to clarify this."
>>
>>
>> Then if this is the case, kvm needs to check IA32_VMX_BASIC[56]
>> before inject exception to nested VM.
> And KVM can hide CET from guests if IA32_VMX_BASIC[56] is 0.
Yes, this scratch patch didn't cover cross-check with CET enabling, thanks!
>
>> And this patch could be removed, instead need another patch like below:
>>
>> diff --git a/arch/x86/include/asm/msr-index.h
>> b/arch/x86/include/asm/msr-index.h
>> index ad35355ee43e..6b33aacc8587 100644
>> --- a/arch/x86/include/asm/msr-index.h
>> +++ b/arch/x86/include/asm/msr-index.h
>> @@ -1076,6 +1076,7 @@
>> #define VMX_BASIC_MEM_TYPE_MASK 0x003c000000000000LLU
>> #define VMX_BASIC_MEM_TYPE_WB 6LLU
>> #define VMX_BASIC_INOUT 0x0040000000000000LLU
>> +#define VMX_BASIC_CHECK_ERRCODE 0x0140000000000000LLU
>>
>> /* Resctrl MSRs: */
>> /* - Intel: */
>> diff --git a/arch/x86/kvm/vmx/capabilities.h
>> b/arch/x86/kvm/vmx/capabilities.h index 85cffeae7f10..4b1ed4dc03bc
>> 100644
>> --- a/arch/x86/kvm/vmx/capabilities.h
>> +++ b/arch/x86/kvm/vmx/capabilities.h
>> @@ -79,6 +79,11 @@ static inline bool cpu_has_vmx_basic_inout(void)
>> return (((u64)vmcs_config.basic_cap << 32) &
>> VMX_BASIC_INOUT);
>> }
>>
>> +static inline bool cpu_has_vmx_basic_check_errcode(void)
>> +{
>> + return (((u64)vmcs_config.basic_cap << 32) &
>> VMX_BASIC_CHECK_ERRCODE);
>> +}
>> +
>> static inline bool cpu_has_virtual_nmis(void)
>> {
>> return vmcs_config.pin_based_exec_ctrl & PIN_BASED_VIRTUAL_NMIS
>> && diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index 78524daa2cb2..92aa4fc3d233 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -1227,9 +1227,9 @@ static int vmx_restore_vmx_basic(struct
>> vcpu_vmx *vmx,
>> u64 data)
>> {
>> const u64 feature_and_reserved =
>> /* feature (except bit 48; see below) */
>> - BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) |
>> + BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) | BIT_ULL(56) |
>> /* reserved */
>> - BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 56);
>> + BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 57);
>> u64 vmx_basic = vmcs_config.nested.basic;
>>
>> if (!is_bitwise_subset(vmx_basic, data, feature_and_reserved))
>> @@ -2873,7 +2873,8 @@ static int
>> nested_check_vm_entry_controls(struct
>> kvm_vcpu *vcpu,
>> should_have_error_code =
>> intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
>> x86_exception_has_error_code(vector);
>> - if (CC(has_error_code != should_have_error_code))
>> + if (!cpu_has_vmx_basic_check_errcode() &&
> We can skip computing should_have_error_code. and we should check if
> IA32_VMX_BASIC[56] is set for this vCPU (i.e. in
> vmx->nested.msrs.basic) rather than host/kvm capability.
Oops, I confused myself, yes, need to reshape the code a bit and use msrs.basic
to check the bit status, thanks!
>
>> + CC(has_error_code != should_have_error_code))
>> return -EINVAL;
>>
>> /* VM-entry exception error code */ @@ -6986,6 +6987,8 @@
>> static void nested_vmx_setup_basic(struct nested_vmx_msrs *msrs)
>>
>> if (cpu_has_vmx_basic_inout())
>> msrs->basic |= VMX_BASIC_INOUT;
>> + if (cpu_has_vmx_basic_check_errcode())
>> + msrs->basic |= VMX_BASIC_CHECK_ERRCODE;
>> }
>>
>> static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs)
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index
>> d70f2e94b187..95c0eab7805c 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -2748,7 +2748,7 @@ static int setup_vmcs_config(struct vmcs_config
>> *vmcs_conf,
>> rdmsrl(MSR_IA32_VMX_MISC, misc_msr);
>>
>> vmcs_conf->size = vmx_msr_high & 0x1fff;
>> - vmcs_conf->basic_cap = vmx_msr_high & ~0x1fff;
>> + vmcs_conf->basic_cap = vmx_msr_high & ~0x7fff;
>>
>> vmcs_conf->revision_id = vmx_msr_low;
>>
>>
On Fri, Jun 30, 2023, Gil Neiger wrote:
> Intel will not produce any CPU with CET that does not enumerate IA32_VMX_BASIC[56] as 1.
>
> One can check that bit before injecting a #CP with error code, but it should
> not be necessary if CET is enumerated.
>
> Of course, if the KVM may run as a guest of another VMM/hypervisor, it may be
> that the virtual CPU in which KVM operates may enumerate CET but clear the
> bit in IA32_VMX_BASIC.
Yeah, I think KVM should be paranoid and expose CET to the guest if and only if
IA32_VMX_BASIC[56] is 1. That'll also help validate nested support, e.g. will
make it more obvious if userspace+KVM provides a "bad" model to L1.
Just in case it is not clear: event delivery in real mode never includes an error code. That is why the PE bit in CR0 is checked.
- Gil
-----Original Message-----
From: Sean Christopherson <[email protected]>
Sent: Friday, June 30, 2023 08:08
To: Yang, Weijiang <[email protected]>
Cc: Gao, Chao <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Edgecombe, Rick P <[email protected]>; [email protected]; Neiger, Gil <[email protected]>
Subject: Re: [PATCH v3 10/21] KVM:x86: Add #CP support in guest exception classification
On Fri, Jun 30, 2023, Weijiang Yang wrote:
>
> On 6/17/2023 2:57 AM, Sean Christopherson wrote:
> > > Do you mean documentation for #CP as an generic exception or the
> > > behavior in KVM as this patch shows?
> > As I pointed out two *years* ago, this entry in the SDM
> >
> > — The field's deliver-error-code bit (bit 11) is 1 if each of the following
> > holds: (1) the interruption type is hardware exception; (2) bit 0
> > (corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
> > (3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector
> > indicates one of the following exceptions: #DF (vector 8), #TS (10),
> > #NP (11), #SS (12), #GP (13), #PF (14), or #AC (17).
> >
> > needs to read something like
> >
> > — The field's deliver-error-code bit (bit 11) is 1 if each of the following
> > holds: (1) the interruption type is hardware exception; (2) bit 0
> > (corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
> > (3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector
> > indicates one of the following exceptions: #DF (vector 8), #TS (10),
> > #NP (11), #SS (12), #GP (13), #PF (14), #AC (17), or #CP
> > (21)[1]
> >
> > [1] #CP has an error code if and only if IA32_VMX_CR4_FIXED1 enumerates
> > support for the 1-setting of CR4.CET.
>
> Hi, Sean,
>
> I sent above change request to Gil(added in cc), but he shared
> different opinion on this issue:
Heh, "opinion".
> It may make things clearer if we document the statement above (all
> CET-capable parts enumerate IA32_VMX_BASIC[56] as 1).
>
> I will see if we can update future revisions of the SDM to clarify this."
That would be helpful. Though to be perfectly honest, I simply overlooked the existence of IA32_VMX_BASIC[56].
Thanks!
> Then if this is the case, kvm needs to check IA32_VMX_BASIC[56]
> before inject exception to nested VM.
>
> And this patch could be removed, instead need another patch like below:
>
> diff --git a/arch/x86/include/asm/msr-index.h
> b/arch/x86/include/asm/msr-index.h
> index ad35355ee43e..6b33aacc8587 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1076,6 +1076,7 @@
> #define VMX_BASIC_MEM_TYPE_MASK 0x003c000000000000LLU
> #define VMX_BASIC_MEM_TYPE_WB 6LLU
> #define VMX_BASIC_INOUT 0x0040000000000000LLU
> +#define VMX_BASIC_CHECK_ERRCODE 0x0140000000000000LLU
"Check Error Code" isn't a great description. The flag enumerates that there the CPU does *not* perform consistency checks on the error code when injecting hardware exceptions.
So something like this?
VMX_BASIC_NO_HW_ERROR_CODE_CC
or maybe
VMX_BASIC_PM_NO_HW_ERROR_CODE_CC
if we want to capture that only protected mode is exempt (I personally prefer just VMX_BASIC_NO_HW_ERROR_CODE_CC as "PM" is a bit ambiguous).
> @@ -2873,7 +2873,8 @@ static int nested_check_vm_entry_controls(struct
> kvm_vcpu *vcpu,
> should_have_error_code =
> intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
> x86_exception_has_error_code(vector);
> - if (CC(has_error_code != should_have_error_code))
> + if (!cpu_has_vmx_basic_check_errcode() &&
> + CC(has_error_code != should_have_error_code))
This is wrong on mutiple fronts:
1. The new feature flag only excempts hardware exceptions delivered to guests
with CR0.PE=1. The above will skip the consistency check for all event injection.
2. KVM needs to check the CPU model that is exposed to L1, not the capabilities
of the host CPU.
Highlighting the key phrases in the SDM:
The field's deliver-error-code bit (bit 11) is 1 if each of the following holds: (1) the interruption type is
^^^^^^^
hardware exception; (2) bit 0 (corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
(3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector indicates one of the following
exceptions: #DF (vector 8), #TS (10), #NP (11), #SS (12), #GP (13), #PF (14), or #AC (17).
The field's deliver-error-code bit is 0 if any of the following holds: (1) the interruption type is not hardware
^^^^^^
exception; (2) bit 0 is clear in the CR0 field in the guest-state area; or (3) IA32_VMX_BASIC[56] is read as
0 and the vector is in one of the following ranges: 0–7, 9, 15, 16, or 18–31.
I think what we want is:
/* VM-entry interruption-info field: deliver error code */
if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION ||
!nested_cpu_has_no_hw_error_code_cc(vcpu)) {
should_have_error_code =
intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
x86_exception_has_error_code(vector);
if (CC(has_error_code != should_have_error_code))
return -EINVAL;
}
On 6/30/2023 11:07 PM, Sean Christopherson wrote:
> On Fri, Jun 30, 2023, Weijiang Yang wrote:
>> On 6/17/2023 2:57 AM, Sean Christopherson wrote:
>>>> Do you mean documentation for #CP as an generic exception or the behavior in
>>>> KVM as this patch shows?
>>> As I pointed out two *years* ago, this entry in the SDM
>>>
>>> — The field's deliver-error-code bit (bit 11) is 1 if each of the following
>>> holds: (1) the interruption type is hardware exception; (2) bit 0
>>> (corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
>>> (3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector
>>> indicates one of the following exceptions: #DF (vector 8), #TS (10),
>>> #NP (11), #SS (12), #GP (13), #PF (14), or #AC (17).
>>>
>>> needs to read something like
>>>
>>> — The field's deliver-error-code bit (bit 11) is 1 if each of the following
>>> holds: (1) the interruption type is hardware exception; (2) bit 0
>>> (corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
>>> (3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector
>>> indicates one of the following exceptions: #DF (vector 8), #TS (10),
>>> #NP (11), #SS (12), #GP (13), #PF (14), #AC (17), or #CP (21)[1]
>>>
>>> [1] #CP has an error code if and only if IA32_VMX_CR4_FIXED1 enumerates
>>> support for the 1-setting of CR4.CET.
>> Hi, Sean,
>>
>> I sent above change request to Gil(added in cc), but he shared different
>> opinion on this issue:
> Heh, "opinion".
>
>> It may make things clearer if we document the statement above (all
>> CET-capable parts enumerate IA32_VMX_BASIC[56] as 1).
>>
>> I will see if we can update future revisions of the SDM to clarify this."
> That would be helpful. Though to be perfectly honest, I simply overlooked the
> existence of IA32_VMX_BASIC[56].
>
> Thanks!
>
>> Then if this is the case, kvm needs to check IA32_VMX_BASIC[56] before
>> inject exception to nested VM.
>>
>> And this patch could be removed, instead need another patch like below:
>>
>> diff --git a/arch/x86/include/asm/msr-index.h
>> b/arch/x86/include/asm/msr-index.h
>> index ad35355ee43e..6b33aacc8587 100644
>> --- a/arch/x86/include/asm/msr-index.h
>> +++ b/arch/x86/include/asm/msr-index.h
>> @@ -1076,6 +1076,7 @@
>> #define VMX_BASIC_MEM_TYPE_MASK 0x003c000000000000LLU
>> #define VMX_BASIC_MEM_TYPE_WB 6LLU
>> #define VMX_BASIC_INOUT 0x0040000000000000LLU
>> +#define VMX_BASIC_CHECK_ERRCODE 0x0140000000000000LLU
> "Check Error Code" isn't a great description. The flag enumerates that there the
> CPU does *not* perform consistency checks on the error code when injecting hardware
> exceptions.
>
> So something like this?
>
> VMX_BASIC_NO_HW_ERROR_CODE_CC
>
> or maybe
>
> VMX_BASIC_PM_NO_HW_ERROR_CODE_CC
>
> if we want to capture that only protected mode is exempt (I personally prefer
> just VMX_BASIC_NO_HW_ERROR_CODE_CC as "PM" is a bit ambiguous).
I like VMX_BASIC_NO_HW_ERROR_CODE_CC too :-), thanks!
>
>> @@ -2873,7 +2873,8 @@ static int nested_check_vm_entry_controls(struct
>> kvm_vcpu *vcpu,
>> should_have_error_code =
>> intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
>> x86_exception_has_error_code(vector);
>> - if (CC(has_error_code != should_have_error_code))
>> + if (!cpu_has_vmx_basic_check_errcode() &&
>> + CC(has_error_code != should_have_error_code))
> This is wrong on mutiple fronts:
>
> 1. The new feature flag only excempts hardware exceptions delivered to guests
> with CR0.PE=1. The above will skip the consistency check for all event injection.
>
> 2. KVM needs to check the CPU model that is exposed to L1, not the capabilities
> of the host CPU.
>
> Highlighting the key phrases in the SDM:
>
> The field's deliver-error-code bit (bit 11) is 1 if each of the following holds: (1) the interruption type is
> ^^^^^^^
> hardware exception; (2) bit 0 (corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
> (3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector indicates one of the following
> exceptions: #DF (vector 8), #TS (10), #NP (11), #SS (12), #GP (13), #PF (14), or #AC (17).
>
> The field's deliver-error-code bit is 0 if any of the following holds: (1) the interruption type is not hardware
> ^^^^^^
> exception; (2) bit 0 is clear in the CR0 field in the guest-state area; or (3) IA32_VMX_BASIC[56] is read as
> 0 and the vector is in one of the following ranges: 0–7, 9, 15, 16, or 18–31.
>
> I think what we want is:
>
> /* VM-entry interruption-info field: deliver error code */
> if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION ||
> !nested_cpu_has_no_hw_error_code_cc(vcpu)) {
> should_have_error_code =
> intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
> x86_exception_has_error_code(vector);
> if (CC(has_error_code != should_have_error_code))
> return -EINVAL;
> }
It looks good to me, will take it, thanks a lot!
On 6/30/2023 11:15 PM, Sean Christopherson wrote:
> On Fri, Jun 30, 2023, Gil Neiger wrote:
>> Intel will not produce any CPU with CET that does not enumerate IA32_VMX_BASIC[56] as 1.
>>
>> One can check that bit before injecting a #CP with error code, but it should
>> not be necessary if CET is enumerated.
>>
>> Of course, if the KVM may run as a guest of another VMM/hypervisor, it may be
>> that the virtual CPU in which KVM operates may enumerate CET but clear the
>> bit in IA32_VMX_BASIC.
> Yeah, I think KVM should be paranoid and expose CET to the guest if and only if
> IA32_VMX_BASIC[56] is 1. That'll also help validate nested support, e.g. will
> make it more obvious if userspace+KVM provides a "bad" model to L1.
OK, will do it, thanks you two!
On 6/30/2023 11:05 PM, Neiger, Gil wrote:
> Intel will not produce any CPU with CET that does not enumerate IA32_VMX_BASIC[56] as 1.
>
> One can check that bit before injecting a #CP with error code, but it should not be necessary if CET is enumerated.
>
> Of course, if the KVM may run as a guest of another VMM/hypervisor, it may be that the virtual CPU in which KVM operates may enumerate CET but clear the bit in IA32_VMX_BASIC.
>
> - Gil
Thanks Gil for clarity!
>
> -----Original Message-----
> From: Yang, Weijiang <[email protected]>
> Sent: Friday, June 30, 2023 05:05
> To: Gao, Chao <[email protected]>
> Cc: Christopherson,, Sean <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Edgecombe, Rick P <[email protected]>; [email protected]; Neiger, Gil <[email protected]>
> Subject: Re: [PATCH v3 10/21] KVM:x86: Add #CP support in guest exception classification
>
>
> On 6/30/2023 6:27 PM, Chao Gao wrote:
> [...]
> *sigh*
>
> I got filled in on the details offlist.
>
> 1) In the next version of this series, please rework it to reincorporate Supervisor
> Shadow Stack support into the main series, i.e. pretend Intel's implemenation
> isn't horribly flawed. KVM can't guarantee that a VM-Exit won't occur, i.e.
> can't advertise CET_SS, but I want the baseline support to be implemented,
> otherwise the series as a whole is a big confusing mess with unanswered question
> left, right, and center. And more importantly, architecturally SSS exists if
> X86_FEATURE_SHSTK is enumerated, i.e. the guest should be allowed to utilize
> SSS if it so chooses, with the obvious caveat that there's a non-zero chance
> the guest risks death by doing so. Or if userspace can ensure no VM-Exit will
> occur, which is difficult but feasible (ignoring #MC), e.g. by statically
> partitioning memory, prefaulting all memory in guest firmware, and not dirty
> logging SSS pages. In such an extreme setup, userspace can enumerate CET_SSS
> to the guest, and KVM should support that.
>
> 2) Add the below patch to document exactly why KVM doesn't advertise CET_SSS.
> While Intel is apparently ok with treating KVM developers like mushrooms, I
> am not.
>
> ---
> From: Sean Christopherson<[email protected]>
> Date: Fri, 16 Jun 2023 10:04:37 -0700
> Subject: [PATCH] KVM: x86: Explicitly document that KVM must not advertise
> CET_SSS
>
> Explicitly call out that KVM must NOT advertise CET_SSS to userspace,
> i.e. must not tell userspace and thus the guest that it is safe for the
> guest to enable Supervisor Shadow Stacks (SSS).
>
> Intel's implementation of SSS is fatally flawed for virtualized
> environments, as despite wording in the SDM that suggests otherwise,
> Intel CPUs' handling of shadow stack switches are NOT fully atomic. Only
> the check-and-update of the supervisor shadow stack token's busy bit is
> atomic. Per the SDM:
>
> If the far CALL or event delivery pushes a stack frame after the token
> is acquired and any of the pushes causes a fault or VM exit, the
> processor will revert to the old shadow stack and the busy bit in the
> new shadow stack's token remains set.
>
> Or more bluntly, any fault or VM-Exit that occurs when pushing to the
> shadow stack after the busy bit is set is fatal to the kernel, i.e. to
> the guest in KVM's case. The (guest) kernel can protect itself against
> faults, e.g. by ensuring that the shadow stack always has a valid mapping,
> but a guest kernel obviously has no control over, or even knowledge of,
> VM-Exits due to host activity.
>
> To help software determine when it is safe to use SSS, Intel defined
> CPUID.0x7.1.EDX bit (CET_SSS) and updated Intel CPUs to enumerate CET_SS,
> i.e. bare metal Intel CPUs advertise to software that it is safe to enable
> SSS.
>
> If CPUID.(EAX=07H,ECX=1H):EDX[bit 18] is enumerated as 1, it is
> sufficient for an operating system to ensure that none of the pushes can
> cause a page fault.
>
> But CET_SS also comes with an major caveat that is kinda sorta documented
> in the SDM:
>
> When emulating the CPUID instruction, a virtual-machine monitor should
> return this bit as 0 if those pushes can cause VM exits.
>
> In other words, CET_SSS (bit 18) does NOT enumerate that the underlying
> CPU prevents VM-Exits, only that the environment in which the software is
> running will not generate VM-Exits. I.e. CET_SSS is a stopgap to stem the
> bleeding and allow kernels to enable SSS, not an indication that the
> underlying CPU is immune to the VM-Exit problem.
>
> And unfortunately, KVM itself effectively has zero chance of ensuring that
> a shadow stack switch can't trigger a VM-Exit, e.g. KVM zaps *all* SPTEs
> when any memslot is deleted, enabling dirty logging write-protects SPTEs,
> etc. A sufficiently motivated userspace can, at least in theory, provide
> a safe environment for SSS, e.g. by statically partitioning and
> prefaulting (in guest firmware) all memory, disabling PML, never
> write-protecting guest shadow stacks, etc. But such a setup is far, far
> beyond typical KVM deployments.
>
> Note, AMD CPUs have a similar erratum, but AMD CPUs *DO* perform the full
> shadow stack switch atomically so long as the stack is mapped WB and does
> not cross a page boundary, i.e. a "normal" KVM setup and a well-behaved
> guest play nice with SSS without additional shenanigans.
>
> Signed-off-by: Sean Christopherson<[email protected]>
> ---
> arch/x86/kvm/cpuid.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 1e3ee96c879b..ecf4a68aaa08 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -658,7 +658,15 @@ void kvm_set_cpu_caps(void)
> );
>
> kvm_cpu_cap_init_kvm_defined(CPUID_7_1_EDX,
> - F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI)
> + F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI) |
> +
> + /*
> + * Do NOT advertise CET_SSS, i.e. do not tell userspace and the
> + * guest that it is safe to use Supervisor Shadow Stacks under
> + * KVM when running on Intel CPUs. KVM itself cannot guarantee
> + * that a VM-Exit won't occur during a shadow stack update.
> + */
> + 0 /* F(CET_SSS) */
> );
>
> kvm_cpu_cap_mask(CPUID_D_1_EAX,
>
> base-commit: 9305c14847719870e9e08294034861360577ce08
Hi, Sean,
Gil reminded me SDM has been updated CET SSS related topics
recently(June release):
======================================================================
Section 17.2.3 (Supervisor Shadow Stack Token) in Volume 1 of the SDM:
If the far CALL or event delivery pushes a stack frame after the
token is acquired and any of the pushes causes a
fault or VM exit, the processor will revert to the old shadow stack
and the busy bit in the new shadow stack's token
remains set. The new shadow stack is said to be prematurely busy.
Software should enable supervisor shadow
stacks only if it is certain that this situation cannot occur. If
CPUID.(EAX=07H,ECX=1H):EDX[bit 18] is enumerated
as 1, it is sufficient for an operating system to ensure that none
of the pushes can cause a page fault.
Volume 2A: CPUID.(EAX=07H,ECX=1H):EDX[bit 18] as follows:
CET_SSS. If 1, indicates that an operating system can enable
supervisor shadow stacks as long as
it ensures that a supervisor shadow stack cannot become prematurely
busy due to page faults (see Section
17.2.3 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 1). When
emulating the CPUID instruction, a virtual-machine monitor (VMM)
should return this bit as 1 only if it
ensures that VM exits cannot cause a guest supervisor shadow stack
to appear to be prematurely busy.
Such a VMM could set the “prematurely busy shadow stack” VM-exit
control and use the additional information
that it provides.
Volume 3C: new “prematurely busy shadow stack” VM-exit control.
========================================================================
And Gil told me additional information was planed to be released later
in the summer.
Maybe you need modify above changelog a bit per the update.
Given the updated parts are technical forecast, I don't plan to
implement it in this series and still enumerate
CET_SSS ==0 for guest. What's your thoughts?
On Mon, Jul 10, 2023, Weijiang Yang wrote:
> Maybe you need modify above changelog a bit per the update.
Ya, I'll make sure the changelog gets updated before CET support is merged, though
feel free to post the next version without waiting for new changelog.
> Given the updated parts are technical forecast, I don't plan to implement it
> in this series and still enumerate
>
> CET_SSS ==0 for guest. What's your thoughts?
Yes, definitely punt shadow-stack fixup to future enabling work.
On 7/11/2023 6:18 AM, Sean Christopherson wrote:
> On Mon, Jul 10, 2023, Weijiang Yang wrote:
>> Maybe you need modify above changelog a bit per the update.
> Ya, I'll make sure the changelog gets updated before CET support is merged, though
> feel free to post the next version without waiting for new changelog.
Sure, thanks!
>> Given the updated parts are technical forecast, I don't plan to implement it
>> in this series and still enumerate
>>
>> CET_SSS ==0 for guest. What's your thoughts?
> Yes, definitely punt shadow-stack fixup to future enabling work.
Got it.
On 6/24/2023 4:51 AM, Sean Christopherson wrote:
> On Mon, Jun 19, 2023, Weijiang Yang wrote:
>> On 6/17/2023 1:56 AM, Sean Christopherson wrote:
>>> On Fri, Jun 16, 2023, Weijiang Yang wrote:
>>>> On 6/16/2023 7:30 AM, Sean Christopherson wrote:
>>>>> On Thu, May 11, 2023, Yang Weijiang wrote:
>>>>>
[...]
>> Let me make it clear, you want me to do two things:
>>
>> 1)Add Supervisor Shadow Stack state support(i.e., XSS.bit12(CET_S)) into
>> kernel so that host can support guest Supervisor Shadow Stack MSRs in g/h FPU
>> context switch.
> If that's necessary for correct functionality, yes.
Hi, Sean,
I held off posting the new version and want to sync up with you on this
point to avoid
surprising you.
After discussed adding the patch in kernel with Rick and Chao, we got
blow conclusions
on doing so:
the Pros:
- Super easy to implement for KVM.
- Automatically avoids saving and restoring this data when the vmexit
is handled within KVM.
the Cons:
- Unnecessarily restores XFEATURE_CET_KERNEL when switching to
non-KVM task's userspace.
- Forces allocating space for this state on all tasks, whether or not
they use KVM, and with likely zero users today and the near future.
- Complicates the FPU optimization thinking by including things that
can have no affect on userspace in the FPU
Given above reasons, I implemented guest CET supervisor states management
in KVM instead of adding a kernel patch for it.
Below are 3 KVM patches to support it:
Patch 1: Save/reload guest CET supervisor states when necessary:
=======================================================================
commit 16147ede75dee29583b7d42a6621d10d55b63595
Author: Yang Weijiang <[email protected]>
Date: Tue Jul 11 02:26:17 2023 -0400
KVM:x86: Make guest supervisor states as non-XSAVE managed
Save and reload guest CET supervisor states, i.e.,PL{0,1,2}_SSP,
when vCPU context is being swapped before and after userspace
<->kernel entry, also do the same operation when vCPU is sched-in
or sched-out.
Enabling CET supervisor state management only in KVM due to:
1) Currently, suervisor SHSTK is not enabled on host side, only
KVM needs to care about the guest's suervisor SHSTK states.
2) Enabling them in kernel FPU state framework has global effects
to all threads on host kernel, but the majority of the threads
are free to CET supervisor states. And it requires additional
storage size of thread FPU state area.
Add a new helper kvm_arch_sched_out() for that purpose. Adding
the support in kvm_arch_vcpu_put/load() without the new helper
looks possible, but the put/load functions are also called in
vcpu_put()/load(), the latter are heavily used in KVM, so adding
new helper makes the implementation clearer.
Signed-off-by: Yang Weijiang <[email protected]>
diff --git a/arch/arm64/include/asm/kvm_host.h
b/arch/arm64/include/asm/kvm_host.h
index 7e7e19ef6993..98235cb3d258 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1023,6 +1023,7 @@ void kvm_arm_vcpu_ptrauth_trap(struct kvm_vcpu *vcpu);
static inline void kvm_arch_sync_events(struct kvm *kvm) {}
static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {}
void kvm_arm_init_debug(void);
void kvm_arm_vcpu_init_debug(struct kvm_vcpu *vcpu);
diff --git a/arch/mips/include/asm/kvm_host.h
b/arch/mips/include/asm/kvm_host.h
index 957121a495f0..56c5e85ba5a3 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -893,6 +893,7 @@ static inline void kvm_arch_free_memslot(struct kvm
*kvm,
struct kvm_memory_slot *slot) {}
static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {}
static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
diff --git a/arch/powerpc/include/asm/kvm_host.h
b/arch/powerpc/include/asm/kvm_host.h
index 14ee0dece853..11587d953bf6 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -880,6 +880,7 @@ static inline void kvm_arch_sync_events(struct kvm
*kvm) {}
static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {}
static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {}
static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
diff --git a/arch/riscv/include/asm/kvm_host.h
b/arch/riscv/include/asm/kvm_host.h
index ee0acccb1d3b..6ff4a04fe0f2 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -244,6 +244,7 @@ struct kvm_vcpu_arch {
static inline void kvm_arch_sync_events(struct kvm *kvm) {}
static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {}
#define KVM_ARCH_WANT_MMU_NOTIFIER
diff --git a/arch/s390/include/asm/kvm_host.h
b/arch/s390/include/asm/kvm_host.h
index 2bbc3d54959d..d1750a6a86cf 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -1033,6 +1033,7 @@ extern int kvm_s390_gisc_unregister(struct kvm
*kvm, u32 gisc);
static inline void kvm_arch_sync_events(struct kvm *kvm) {}
static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {}
static inline void kvm_arch_free_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot) {}
static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e2c549f147a5..7d9cfb7e2fe8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11212,6 +11212,31 @@ static void kvm_put_guest_fpu(struct kvm_vcpu
*vcpu)
trace_kvm_fpu(0);
}
+static void kvm_save_cet_supervisor_ssp(struct kvm_vcpu *vcpu)
+{
+ preempt_disable();
+ if (unlikely(guest_can_use(vcpu, X86_FEATURE_SHSTK))) {
+ rdmsrl(MSR_IA32_PL0_SSP, vcpu->arch.cet_s_ssp[0]);
+ rdmsrl(MSR_IA32_PL1_SSP, vcpu->arch.cet_s_ssp[1]);
+ rdmsrl(MSR_IA32_PL2_SSP, vcpu->arch.cet_s_ssp[2]);
+ wrmsrl(MSR_IA32_PL0_SSP, 0);
+ wrmsrl(MSR_IA32_PL1_SSP, 0);
+ wrmsrl(MSR_IA32_PL2_SSP, 0);
+ }
+ preempt_enable();
+}
+
+static void kvm_reload_cet_supervisor_ssp(struct kvm_vcpu *vcpu)
+{
+ preempt_disable();
+ if (unlikely(guest_can_use(vcpu, X86_FEATURE_SHSTK))) {
+ wrmsrl(MSR_IA32_PL0_SSP, vcpu->arch.cet_s_ssp[0]);
+ wrmsrl(MSR_IA32_PL1_SSP, vcpu->arch.cet_s_ssp[1]);
+ wrmsrl(MSR_IA32_PL2_SSP, vcpu->arch.cet_s_ssp[2]);
+ }
+ preempt_enable();
+}
+
int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
{
struct kvm_queued_exception *ex = &vcpu->arch.exception;
@@ -11222,6 +11247,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
kvm_sigset_activate(vcpu);
kvm_run->flags = 0;
kvm_load_guest_fpu(vcpu);
+ kvm_reload_cet_supervisor_ssp(vcpu);
kvm_vcpu_srcu_read_lock(vcpu);
if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) {
@@ -11310,6 +11336,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
r = vcpu_run(vcpu);
out:
+ kvm_save_cet_supervisor_ssp(vcpu);
kvm_put_guest_fpu(vcpu);
if (kvm_run->kvm_valid_regs)
store_regs(vcpu);
@@ -12398,9 +12425,17 @@ void kvm_arch_sched_in(struct kvm_vcpu *vcpu,
int cpu)
pmu->need_cleanup = true;
kvm_make_request(KVM_REQ_PMU, vcpu);
}
+
+ kvm_reload_cet_supervisor_ssp(vcpu);
+
static_call(kvm_x86_sched_in)(vcpu, cpu);
}
+void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu)
+{
+ kvm_save_cet_supervisor_ssp(vcpu);
+}
+
void kvm_arch_free_vm(struct kvm *kvm)
{
kfree(to_kvm_hv(kvm)->hv_pa_pg);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d90331f16db1..b3032a5f0641 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1423,6 +1423,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct
kvm_vcpu *vcpu,
int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu);
void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu);
+void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu);
void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 66c1447d3c7f..42f28e8905e1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5885,6 +5885,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
{
struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
+ kvm_arch_sched_out(vcpu, 0);
if (current->on_rq) {
WRITE_ONCE(vcpu->preempted, true);
WRITE_ONCE(vcpu->ready, true);
Patch 2: optimization patch for above one:
===================================================================
commit ae5fe7c81cc3b93193758d1b7b4ab74a92a51dad
Author: Yang Weijiang <[email protected]>
Date: Fri Jul 14 20:03:52 2023 -0400
KVM:x86: Optimize CET supervisor SSP save/reload
Make PL{0,1,2}_SSP as write-intercepted to detect whether
guest is using these MSRs. Disable intercept to the MSRs
if they're written with non-zero values. KVM does save/
reload for the MSRs only if they're used by guest.
Signed-off-by: Yang Weijiang <[email protected]>
diff --git a/arch/x86/include/asm/kvm_host.h
b/arch/x86/include/asm/kvm_host.h
index 69cbc9d9b277..c50b555234fb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -748,6 +748,7 @@ struct kvm_vcpu_arch {
bool tpr_access_reporting;
bool xsaves_enabled;
bool xfd_no_write_intercept;
+ bool cet_sss_active;
u64 ia32_xss;
u64 microcode_version;
u64 arch_capabilities;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 90ce1c7d3fd7..21c89d200c88 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2156,6 +2156,18 @@ static u64 vmx_get_supported_debugctl(struct
kvm_vcpu *vcpu, bool host_initiated
return debugctl;
}
+static void vmx_disable_write_intercept_sss_msr(struct kvm_vcpu *vcpu)
+{
+ if (guest_can_use(vcpu, X86_FEATURE_SHSTK)) {
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
+ MSR_TYPE_RW, false);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
+ MSR_TYPE_RW, false);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
+ MSR_TYPE_RW, false);
+ }
+}
+
/*
* Writes msr value into the appropriate "register".
* Returns 0 on success, non-0 otherwise.
@@ -2427,7 +2439,17 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu,
struct msr_data *msr_info)
#define VMX_CET_CONTROL_MASK (~GENMASK_ULL(9,6))
#define LEG_BITMAP_BASE(data) ((data) >> 12)
case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
- return kvm_set_msr_common(vcpu, msr_info);
+ if (kvm_set_msr_common(vcpu, msr_info))
+ return 1;
+ /*
+ * Write to the base SSP MSRs should happen ahead of
toggling
+ * of IA32_S_CET.SH_STK_EN bit.
+ */
+ if (!msr_info->host_initiated &&
+ msr_index != MSR_IA32_PL3_SSP && data) {
+ vmx_disable_write_intercept_sss_msr(vcpu);
+ wrmsrl(msr_index, data);
+ }
break;
case MSR_IA32_U_CET:
case MSR_IA32_S_CET:
@@ -7773,12 +7795,17 @@ static void
vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu)
MSR_TYPE_RW, false);
vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
MSR_TYPE_RW, false);
+ /*
+ * Supervisor shadow stack MSRs are intercepted until
+ * they're written by guest, this is designed to
+ * optimize the save/restore overhead.
+ */
vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
- MSR_TYPE_RW, false);
+ MSR_TYPE_R, false);
vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
- MSR_TYPE_RW, false);
+ MSR_TYPE_R, false);
vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
- MSR_TYPE_RW, false);
+ MSR_TYPE_R, false);
vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cab31dbb2bec..06dc5111da3b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4049,8 +4049,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu,
struct msr_data *msr_info)
if (!IS_ALIGNED(data, 4))
return 1;
if (msr == MSR_IA32_PL0_SSP || msr == MSR_IA32_PL1_SSP ||
- msr == MSR_IA32_PL2_SSP)
+ msr == MSR_IA32_PL2_SSP) {
+ if (!msr_info->host_initiated && data)
+ vcpu->arch.cet_sss_active = true;
vcpu->arch.cet_s_ssp[msr - MSR_IA32_PL0_SSP] =
data;
+ }
else if (msr == MSR_IA32_PL3_SSP)
kvm_set_xsave_msr(msr_info);
break;
@@ -11250,7 +11253,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
kvm_sigset_activate(vcpu);
kvm_run->flags = 0;
kvm_load_guest_fpu(vcpu);
- kvm_reload_cet_supervisor_ssp(vcpu);
+ if (vcpu->arch.cet_sss_active)
+ kvm_reload_cet_supervisor_ssp(vcpu);
kvm_vcpu_srcu_read_lock(vcpu);
if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) {
@@ -11339,7 +11343,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
r = vcpu_run(vcpu);
out:
- kvm_save_cet_supervisor_ssp(vcpu);
+ if (vcpu->arch.cet_sss_active)
+ kvm_save_cet_supervisor_ssp(vcpu);
kvm_put_guest_fpu(vcpu);
if (kvm_run->kvm_valid_regs)
store_regs(vcpu);
kvm_vcpu_srcu_read_lock(vcpu);
if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) {
@@ -11339,7 +11343,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
r = vcpu_run(vcpu);
out:
- kvm_save_cet_supervisor_ssp(vcpu);
+ if (vcpu->arch.cet_sss_active)
+ kvm_save_cet_supervisor_ssp(vcpu);
kvm_put_guest_fpu(vcpu);
if (kvm_run->kvm_valid_regs)
store_regs(vcpu);
@@ -12428,15 +12433,16 @@ void kvm_arch_sched_in(struct kvm_vcpu *vcpu,
int cpu)
pmu->need_cleanup = true;
kvm_make_request(KVM_REQ_PMU, vcpu);
}
-
- kvm_reload_cet_supervisor_ssp(vcpu);
+ if (vcpu->arch.cet_sss_active)
+ kvm_reload_cet_supervisor_ssp(vcpu);
static_call(kvm_x86_sched_in)(vcpu, cpu);
}
void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu)
{
- kvm_save_cet_supervisor_ssp(vcpu);
+ if (vcpu->arch.cet_sss_active)
+ kvm_save_cet_supervisor_ssp(vcpu);
}
void kvm_arch_free_vm(struct kvm *kvm)
=============================================================
Patch 3: support guest CET supervisor xstate bit:
commit 2708b3c959db56fb9243f9a157884c2120b8810c
Author: Yang Weijiang <[email protected]>
Date: Sat Jul 15 20:56:37 2023 -0400
KVM:x86: Enable guest CET supervisor xstate bit support
Add S_CET bit in kvm_caps.supported_xss so that guest can enumerate
the feature in CPUID(0xd,1).ECX.
Guest S_CET xstate bit is specially handled, i.e., it can be exposed
without related enabling on host side, because KVM manually
saves/reloads
guest supervisor SHSTK SSPs and current XSS swap logic for
host/guest aslo
supports doing so, thus it's safe to enable the bit without host
support.
Signed-off-by: Yang Weijiang <[email protected]>
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2653e5eb54ee..071bcdedc530 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -228,7 +228,8 @@ static struct kvm_user_return_msrs __percpu
*user_return_msrs;
| XFEATURE_MASK_BNDCSR |
XFEATURE_MASK_AVX512 \
| XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)
-#define KVM_SUPPORTED_XSS (XFEATURE_MASK_CET_USER)
+#define KVM_SUPPORTED_XSS (XFEATURE_MASK_CET_USER | \
+ XFEATURE_MASK_CET_KERNEL)
u64 __read_mostly host_efer;
EXPORT_SYMBOL_GPL(host_efer);
@@ -9639,6 +9640,7 @@ static int __kvm_x86_vendor_init(struct
kvm_x86_init_ops *ops)
if (boot_cpu_has(X86_FEATURE_XSAVES)) {
rdmsrl(MSR_IA32_XSS, host_xss);
kvm_caps.supported_xss = host_xss & KVM_SUPPORTED_XSS;
+ kvm_caps.supported_xss |= XFEATURE_MASK_CET_KERNEL;
}
kvm_init_pmu_capability(ops->pmu_ops);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index f8f042c91728..df187d7c3e74 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -362,7 +362,7 @@ static inline bool kvm_mpx_supported(void)
== (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR);
}
-#define CET_XSTATE_MASK (XFEATURE_MASK_CET_USER)
+#define CET_XSTATE_MASK (XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)
/*
* Shadow Stack and Indirect Branch Tracking feature enabling depends on
* whether host side CET user xstate bit is supported or not.
=================================================================
What's your thoughts on the solution? Is it appropriate for KVM?
Thanks!
[...]
On Mon, Jul 17, 2023, Weijiang Yang wrote:
>
> On 6/24/2023 4:51 AM, Sean Christopherson wrote:
> > > 1)Add Supervisor Shadow Stack� state support(i.e., XSS.bit12(CET_S)) into
> > > kernel so that host can support guest Supervisor Shadow Stack MSRs in g/h FPU
> > > context switch.
> > If that's necessary for correct functionality, yes.
...
> the Pros:
> �- Super easy to implement for KVM.
> �- Automatically avoids saving and restoring this data when the vmexit
> �� is handled within KVM.
>
> the Cons:
> �- Unnecessarily restores XFEATURE_CET_KERNEL when switching to
> �� non-KVM task's userspace.
> �- Forces allocating space for this state on all tasks, whether or not
> �� they use KVM, and with likely zero users today and the near future.
> �- Complicates the FPU optimization thinking by including things that
> �� can have no affect on userspace in the FPU
>
> Given above reasons, I implemented guest CET supervisor states management
> in KVM instead of adding a kernel patch for it.
>
> Below are 3 KVM patches to support it:
>
> Patch 1: Save/reload guest CET supervisor states when necessary:
>
> =======================================================================
>
> commit 16147ede75dee29583b7d42a6621d10d55b63595
> Author: Yang Weijiang <[email protected]>
> Date:�� Tue Jul 11 02:26:17 2023 -0400
>
> ��� KVM:x86: Make guest supervisor states as non-XSAVE managed
>
> ��� Save and reload guest CET supervisor states, i.e.,PL{0,1,2}_SSP,
> ��� when vCPU context is being swapped before and after userspace
> ��� <->kernel entry, also do the same operation when vCPU is sched-in
> ��� or sched-out.
...
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e2c549f147a5..7d9cfb7e2fe8 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11212,6 +11212,31 @@ static void kvm_put_guest_fpu(struct kvm_vcpu
> *vcpu)
> ������� trace_kvm_fpu(0);
> �}
>
> +static void kvm_save_cet_supervisor_ssp(struct kvm_vcpu *vcpu)
> +{
> +������ preempt_disable();
> +������ if (unlikely(guest_can_use(vcpu, X86_FEATURE_SHSTK))) {
> +�������������� rdmsrl(MSR_IA32_PL0_SSP, vcpu->arch.cet_s_ssp[0]);
> +�������������� rdmsrl(MSR_IA32_PL1_SSP, vcpu->arch.cet_s_ssp[1]);
> +�������������� rdmsrl(MSR_IA32_PL2_SSP, vcpu->arch.cet_s_ssp[2]);
> +�������������� wrmsrl(MSR_IA32_PL0_SSP, 0);
> +�������������� wrmsrl(MSR_IA32_PL1_SSP, 0);
> +�������������� wrmsrl(MSR_IA32_PL2_SSP, 0);
> +������ }
> +������ preempt_enable();
> +}
> +
> +static void kvm_reload_cet_supervisor_ssp(struct kvm_vcpu *vcpu)
> +{
> +������ preempt_disable();
> +������ if (unlikely(guest_can_use(vcpu, X86_FEATURE_SHSTK))) {
> +�������������� wrmsrl(MSR_IA32_PL0_SSP, vcpu->arch.cet_s_ssp[0]);
> +�������������� wrmsrl(MSR_IA32_PL1_SSP, vcpu->arch.cet_s_ssp[1]);
> +�������������� wrmsrl(MSR_IA32_PL2_SSP, vcpu->arch.cet_s_ssp[2]);
> +������ }
> +������ preempt_enable();
> +}
My understanding is that PL[0-2]_SSP are used only on transitions to the
corresponding privilege level from a *different* privilege level. That means
KVM should be able to utilize the user_return_msr framework to load the host
values. Though if Linux ever supports SSS, I'm guessing the core kernel will
have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to
userspace, e.g. to avoid having to write PL0_SSP, which will presumably be
per-task, on every context switch.
But note my original wording: **If that's necessary**
If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in
IA32_S_CET, then running host stuff with guest values should be ok. KVM only
needs to guarantee that it doesn't leak values between guests. But that should
Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the
guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest.
And regardless of what the mechanism ends up managing SSP MSRs, it should only
ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will
never consume PL{1,2}_SSP.
Am I missing something?
On Wed, Jul 19, 2023, Sean Christopherson wrote:
> On Mon, Jul 17, 2023, Weijiang Yang wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index e2c549f147a5..7d9cfb7e2fe8 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -11212,6 +11212,31 @@ static void kvm_put_guest_fpu(struct kvm_vcpu
> > *vcpu)
> > ������� trace_kvm_fpu(0);
> > �}
Huh. After a bit of debugging, the mangling is due to mutt's default for send_charset
being
"us-ascii:iso-8859-1:utf-8"
and selecting iso-8859-1 instead of utf-8 as the encoding despite the original
mail being utf-8. In this case, mutt ran afoul of nbsp (u+00a0).
AFAICT, the solution is to essentially tell mutt to never try to use iso-8859-1
for sending mail
set send_charset="us-ascii:utf-8"
On Wed, Jul 19, 2023 at 12:41:47PM -0700, Sean Christopherson wrote:
> My understanding is that PL[0-2]_SSP are used only on transitions to the
> corresponding privilege level from a *different* privilege level. That means
> KVM should be able to utilize the user_return_msr framework to load the host
> values. Though if Linux ever supports SSS, I'm guessing the core kernel will
> have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to
> userspace, e.g. to avoid having to write PL0_SSP, which will presumably be
> per-task, on every context switch.
>
> But note my original wording: **If that's necessary**
>
> If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in
> IA32_S_CET, then running host stuff with guest values should be ok. KVM only
> needs to guarantee that it doesn't leak values between guests. But that should
> Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the
> guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest.
>
> And regardless of what the mechanism ends up managing SSP MSRs, it should only
> ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will
> never consume PL{1,2}_SSP.
To clarify, Linux will only use SSS in FRED mode -- FRED removes CPL1,2.
On 7/20/2023 3:41 AM, Sean Christopherson wrote:
> [...]
> My understanding is that PL[0-2]_SSP are used only on transitions to the
> corresponding privilege level from a *different* privilege level. That means
> KVM should be able to utilize the user_return_msr framework to load the host
> values. Though if Linux ever supports SSS, I'm guessing the core kernel will
> have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to
> userspace, e.g. to avoid having to write PL0_SSP, which will presumably be
> per-task, on every context switch.
>
> But note my original wording: **If that's necessary**
Thanks!
I think host SSS enabling won't happen in short-term, handling the guest
supervisor
states in KVM side is doable.
>
> If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in
> IA32_S_CET, then running host stuff with guest values should be ok. KVM only
> needs to guarantee that it doesn't leak values between guests. But that should
> Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the
> guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest.
Yes, these handling have been covered by the new version.
>
> And regardless of what the mechanism ends up managing SSP MSRs, it should only
> ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will
> never consume PL{1,2}_SSP.
>
> Am I missing something?
I think, guest PL{0,1,2}_SSP can be handled as a bundle to make the
handling easy(instead of handling each
separately) because guest can be non-Linux systems, as you said before
they could even be used as scratch registers.
But for host side as it's Linux, I can omit reloading/resetting host
PL{1,2}_SSP when vCPU thread is preempted.
I will post new version to community if above is minor divergence.
On 7/20/2023 4:26 AM, Sean Christopherson wrote:
> On Wed, Jul 19, 2023, Sean Christopherson wrote:
>> On Mon, Jul 17, 2023, Weijiang Yang wrote:
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index e2c549f147a5..7d9cfb7e2fe8 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -11212,6 +11212,31 @@ static void kvm_put_guest_fpu(struct kvm_vcpu
>>> *vcpu)
>>> ������� trace_kvm_fpu(0);
>>> �}
> Huh. After a bit of debugging, the mangling is due to mutt's default for send_charset
> being
>
> "us-ascii:iso-8859-1:utf-8"
>
> and selecting iso-8859-1 instead of utf-8 as the encoding despite the original
> mail being utf-8. In this case, mutt ran afoul of nbsp (u+00a0).
>
> AFAICT, the solution is to essentially tell mutt to never try to use iso-8859-1
> for sending mail
>
> set send_charset="us-ascii:utf-8"
It made me feel a bit guilty as I thought it could be resulted from
wrong settings of my email system :-)
> > My understanding is that PL[0-2]_SSP are used only on transitions to the
> > corresponding privilege level from a *different* privilege level. That means
> > KVM should be able to utilize the user_return_msr framework to load the host
> > values. Though if Linux ever supports SSS, I'm guessing the core kernel will
> > have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to
> > userspace, e.g. to avoid having to write PL0_SSP, which will presumably be
> > per-task, on every context switch.
> >
> > But note my original wording: **If that's necessary**
> >
> > If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in
> > IA32_S_CET, then running host stuff with guest values should be ok. KVM only
> > needs to guarantee that it doesn't leak values between guests. But that should
> > Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the
> > guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest.
> >
> > And regardless of what the mechanism ends up managing SSP MSRs, it should only
> > ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will
> > never consume PL{1,2}_SSP.
>
> To clarify, Linux will only use SSS in FRED mode -- FRED removes CPL1,2.
Trying to understand more what prevents SSS to enable in pre FRED, Is
it better #CP exception
handling with other nested exceptions?
Won't same problems (to some extent) happen in user-mode shadow stack
(and in case of guest, SSS inside VM)?
Thanks,
Pankaj
On Thu, Jul 20, 2023 at 07:26:04AM +0200, Pankaj Gupta wrote:
> > > My understanding is that PL[0-2]_SSP are used only on transitions to the
> > > corresponding privilege level from a *different* privilege level. That means
> > > KVM should be able to utilize the user_return_msr framework to load the host
> > > values. Though if Linux ever supports SSS, I'm guessing the core kernel will
> > > have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to
> > > userspace, e.g. to avoid having to write PL0_SSP, which will presumably be
> > > per-task, on every context switch.
> > >
> > > But note my original wording: **If that's necessary**
> > >
> > > If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in
> > > IA32_S_CET, then running host stuff with guest values should be ok. KVM only
> > > needs to guarantee that it doesn't leak values between guests. But that should
> > > Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the
> > > guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest.
> > >
> > > And regardless of what the mechanism ends up managing SSP MSRs, it should only
> > > ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will
> > > never consume PL{1,2}_SSP.
> >
> > To clarify, Linux will only use SSS in FRED mode -- FRED removes CPL1,2.
>
> Trying to understand more what prevents SSS to enable in pre FRED, Is
> it better #CP exception
> handling with other nested exceptions?
SSS took the syscall gap and made it worse -- as in *way* worse.
To top it off, the whole SSS busy bit thing is fundamentally
incompatible with how we manage to survive nested exceptions in NMI
context.
Basically, the whole x86 exception / stack switching logic was already
borderline impossible (consider taking an MCE in the early NMI path
where we set up, but have not finished, the re-entrancy stuff), and
pushed it over the edge and set it on fire.
And NMI isn't the only problem, the various new virt exceptions #VC and
#HV are on their own already near impossible, adding SSS again pushes
the whole thing into clear insanity.
There's a good exposition of the whole trainwreck by Andrew here:
https://www.youtube.com/watch?v=qcORS8CN0ow
(that is, sorry for the youtube link, but Google is failing me in
finding the actual Google Doc that talk is based on, or even the slide
deck :/)
FRED solves all that by:
- removing the stack gap, cc/ip/ss/sp/ssp/gs will all be switched
atomically and consistently for every transition.
- removing the non-reentrant IST mechanism and replacing it with stack
levels
- adding an explicit NMI latch
- re-organising the actual shadow stacks and doing away with that busy
bit thing (I need to re-read the FRED spec on this detail again).
Crazy as we are, we're not touching legacy/IDT SSS with a ten foot pole,
sorry.
On Thu, Jul 20, 2023 at 10:03:58AM +0200, Peter Zijlstra wrote:
> > Trying to understand more what prevents SSS to enable in pre FRED, Is
> > it better #CP exception
> > handling with other nested exceptions?
>
> SSS took the syscall gap and made it worse -- as in *way* worse.
>
> To top it off, the whole SSS busy bit thing is fundamentally
> incompatible with how we manage to survive nested exceptions in NMI
> context.
>
> Basically, the whole x86 exception / stack switching logic was already
> borderline impossible (consider taking an MCE in the early NMI path
> where we set up, but have not finished, the re-entrancy stuff), and
SSS
> pushed it over the edge and set it on fire.
>
> And NMI isn't the only problem, the various new virt exceptions #VC and
> #HV are on their own already near impossible, adding SSS again pushes
> the whole thing into clear insanity.
>
> There's a good exposition of the whole trainwreck by Andrew here:
>
> https://www.youtube.com/watch?v=qcORS8CN0ow
>
> (that is, sorry for the youtube link, but Google is failing me in
> finding the actual Google Doc that talk is based on, or even the slide
> deck :/)
>
>
>
> FRED solves all that by:
>
> - removing the stack gap, cc/ip/ss/sp/ssp/gs will all be switched
> atomically and consistently for every transition.
>
> - removing the non-reentrant IST mechanism and replacing it with stack
> levels
>
> - adding an explicit NMI latch
>
> - re-organising the actual shadow stacks and doing away with that busy
> bit thing (I need to re-read the FRED spec on this detail again).
>
>
>
> Crazy as we are, we're not touching legacy/IDT SSS with a ten foot pole,
> sorry.
> > > Trying to understand more what prevents SSS to enable in pre FRED, Is
> > > it better #CP exception
> > > handling with other nested exceptions?
> >
> > SSS took the syscall gap and made it worse -- as in *way* worse.
> >
> > To top it off, the whole SSS busy bit thing is fundamentally
> > incompatible with how we manage to survive nested exceptions in NMI
> > context.
> >
> > Basically, the whole x86 exception / stack switching logic was already
> > borderline impossible (consider taking an MCE in the early NMI path
> > where we set up, but have not finished, the re-entrancy stuff), and
>
> SSS
>
> > pushed it over the edge and set it on fire.
ah I see. SSS takes it to the next level.
> >
> > And NMI isn't the only problem, the various new virt exceptions #VC and
> > #HV are on their own already near impossible, adding SSS again pushes
> > the whole thing into clear insanity.
> >
> > There's a good exposition of the whole trainwreck by Andrew here:
> >
> > https://www.youtube.com/watch?v=qcORS8CN0ow
> >
> > (that is, sorry for the youtube link, but Google is failing me in
> > finding the actual Google Doc that talk is based on, or even the slide
> > deck :/)
I think I got the link:
https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?pli=1
> >
> >
> >
> > FRED solves all that by:
> >
> > - removing the stack gap, cc/ip/ss/sp/ssp/gs will all be switched
> > atomically and consistently for every transition.
> >
> > - removing the non-reentrant IST mechanism and replacing it with stack
> > levels
> >
> > - adding an explicit NMI latch
> >
> > - re-organising the actual shadow stacks and doing away with that busy
> > bit thing (I need to re-read the FRED spec on this detail again).
> >
Thank you for explaining. I will also study the FRED spec and
corresponding kernel
patches posted in the mailing list.
> >
> >
> > Crazy as we are, we're not touching legacy/IDT SSS with a ten foot pole,
> > sorry.
ya, interesting.
Best regards,
Pankaj