2023-12-21 09:11:32

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 00/26] Enable CET Virtualization

Control-flow Enforcement Technology (CET) is a kind of CPU feature used
to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP) attacks.
It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
style control-flow subversion attacks.

Shadow Stack (SHSTK):
A shadow stack is a second stack used exclusively for control transfer
operations. The shadow stack is separate from the data/normal stack and
can be enabled individually in user and kernel mode. When shadow stack
is enabled, CALL pushes the return address on both the data and shadow
stack. RET pops the return address from both stacks and compares them.
If the return addresses from the two stacks do not match, the processor
generates a #CP.

Indirect Branch Tracking (IBT):
IBT introduces new instruction(ENDBRANCH)to mark valid target addresses of
indirect branches (CALL, JMP etc...). If an indirect branch is executed
and the next instruction is _not_ an ENDBRANCH, the processor generates a
#CP. These instruction behaves as a NOP on platforms that doesn't support
CET.

Dependency:
=====================
CET native series for user mode shadow stack has already been merged in v6.6
mainline kernel.

The first 7 kernel patches are prerequisites for this KVM patch series since
guest CET user mode and supervisor mode states depends on kernel FPU framework
to properly save/restore the states whenever FPU context switch is required,
e.g., after VM-Exit and before vCPU thread exits to userspace.

In this series, guest supervisor SHSTK mitigation solution isn't introduced
for Intel platform therefore guest SSS_CET bit of CPUID(0x7,1):EDX[bit18] is
cleared. Check SDM (Vol 1, Section 17.2.3) for details.

CET states management:
======================
KVM cooperates with host kernel FPU framework to manage guest CET registers.
With CET supervisor mode state support in this series, KVM can save/restore
full guest CET xsave-managed states.

CET user mode and supervisor mode xstates, i.e., MSR_IA32_{U_CET,PL3_SSP}
and MSR_IA32_PL{0,1,2}, depend on host FPU framework to swap guest and host
xstates. On VM-Exit, guest CET xstates are saved to guest fpu area and host
CET xstates are loaded from task/thread context before vCPU returns to
userspace, vice-versa on VM-Entry. See details in kvm_{load,put}_guest_fpu().
So guest CET xstates management depends on CET xstate bits(U_CET/S_CET bit)
set in host XSS MSR.

CET supervisor mode states are grouped into two categories : XSAVE-managed
and non-XSAVE-managed, the former includes MSR_IA32_PL{0,1,2}_SSP and are
controlled by CET supervisor mode bit(S_CET bit) in XSS, the later consists
of MSR_IA32_S_CET and MSR_IA32_INTR_SSP_TBL.

VMX introduces new VMCS fields, {GUEST|HOST}_{S_CET,SSP,INTR_SSP_TABL}, to
facilitate guest/host non-XSAVES-managed states. When VMX CET entry/exit load
bits are set, guest/host MSR_IA32_{S_CET,INTR_SSP_TBL,SSP} are loaded from
equivalent fields at VM-Exit/Entry. With these new fields, such supervisor
states require no addtional KVM save/reload actions.

Tests:
======================
This series passed basic CET user shadow stack test and kernel IBT test in L1
and L2 guest.
The patch series _has_ impact to existing vmx test cases in KVM-unit-tests,the
failures have been fixed here[1].
One new selftest app[2] is introduced for testing CET MSRs accessibilities.

Note, this series hasn't been tested on AMD platform yet.

To run user SHSTK test and kernel IBT test in guest, an CET capable platform
is required, e.g., Sapphire Rapids server, and follow below steps to build
the binaries:

1. Host kernel: Apply this series to mainline kernel (>= v6.6) and build.

2. Guest kernel: Pull kernel (>= v6.6), opt-in CONFIG_X86_KERNEL_IBT
and CONFIG_X86_USER_SHADOW_STACK options. Build with CET enabled gcc versions
(>= 8.5.0).

3. Apply CET QEMU patches[3] before build mainline QEMU.

Check kernel selftest test_shadow_stack_64 output:
[INFO] new_ssp = 7f8c82100ff8, *new_ssp = 7f8c82101001
[INFO] changing ssp from 7f8c82900ff0 to 7f8c82100ff8
[INFO] ssp is now 7f8c82101000
[OK] Shadow stack pivot
[OK] Shadow stack faults
[INFO] Corrupting shadow stack
[INFO] Generated shadow stack violation successfully
[OK] Shadow stack violation test
[INFO] Gup read -> shstk access success
[INFO] Gup write -> shstk access success
[INFO] Violation from normal write
[INFO] Gup read -> write access success
[INFO] Violation from normal write
[INFO] Gup write -> write access success
[INFO] Cow gup write -> write access success
[OK] Shadow gup test
[INFO] Violation from shstk access
[OK] mprotect() test
[SKIP] Userfaultfd unavailable.
[OK] 32 bit test


Check kernel IBT with dmesg | grep CET:
CET detected: Indirect Branch Tracking enabled

Changes in v8:
=====================
1. Add annotation for fpu_{kernel,user, guest}_cfg fields. [Maxim]
2. Remove CET state bits in kvm_caps.supported_xss if CET is disabled. [Maxim]
3. Prevent 32-bit guest launch if CET is enabled in CPUID. [Maxim, Chao]
4. Use fpu_guest_cfg.default_size to calculate guest default fpstate size. [Rick]
5. Sync CET host states in vmcs12 to vmcs01 before L2 exits to L1. [Maxim]
7. Other minor changes due to review comments. [Rick, Maxim]
8. Rebased to: https://github.com/kvm-x86/linux tag:kvm-x86-next-2023.11.30


[1]: KVM-unit-tests fixup:
https://lore.kernel.org/all/[email protected]/
[2]: Selftest for CET MSRs:
https://lore.kernel.org/all/[email protected]/
[3]: QEMU patch:
https://lore.kernel.org/all/[email protected]/
[4]: v7 patchset:
https://lore.kernel.org/all/[email protected]/

Patch 1-7: Fixup patches for kernel xstate and enable CET supervisor xstate.
Patch 8-11: Cleanup patches for KVM.
Patch 12-15: Enable KVM XSS MSR support.
Patch 16: Fault check for CR4.CET setting.
Patch 17: Report CET MSRs to userspace.
Patch 18: Introduce CET VMCS fields.
Patch 19: Add SHSTK/IBT to KVM-governed framework.(to be deprecated)
Patch 20: Emulate CET MSR access.
Patch 21: Handle SSP at entry/exit to SMM.
Patch 22: Set up CET MSR interception.
Patch 23: Initialize host constant supervisor state.
Patch 24: Enable CET virtualization settings.
Patch 25-26: Add CET nested support.


Sean Christopherson (4):
x86/fpu/xstate: Always preserve non-user xfeatures/flags in
__state_perm
KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data
KVM: x86: Report XSS as to-be-saved if there are supported features
KVM: x86: Load guest FPU state when access XSAVE-managed MSRs

Yang Weijiang (22):
x86/fpu/xstate: Refine CET user xstate bit enabling
x86/fpu/xstate: Add CET supervisor mode state support
x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set
x86/fpu/xstate: Introduce fpu_guest_cfg for guest FPU configuration
x86/fpu/xstate: Create guest fpstate with guest specific config
x86/fpu/xstate: Warn if kernel dynamic xfeatures detected in normal fpstate
KVM: x86: Rename kvm_{g,s}et_msr() to menifest emulation operations
KVM: x86: Refine xsave-managed guest register/MSR reset handling
KVM: x86: Add kvm_msr_{read,write}() helpers
KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS
KVM: x86: Initialize kvm_caps.supported_xss
KVM: x86: Add fault checks for guest CR4.CET setting
KVM: x86: Report KVM supported CET MSRs as to-be-saved
KVM: VMX: Introduce CET VMCS fields and control bits
KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled"
KVM: VMX: Emulate read and write to CET MSRs
KVM: x86: Save and reload SSP to/from SMRAM
KVM: VMX: Set up interception for CET MSRs
KVM: VMX: Set host constant supervisor states to VMCS fields
KVM: x86: Enable CET virtualization for VMX and advertise to userspace
KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1
KVM: nVMX: Enable CET support for nested guest

arch/x86/include/asm/fpu/types.h | 16 +-
arch/x86/include/asm/fpu/xstate.h | 11 +-
arch/x86/include/asm/kvm_host.h | 12 +-
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/vmx.h | 8 +
arch/x86/include/uapi/asm/kvm_para.h | 1 +
arch/x86/kernel/fpu/core.c | 117 ++++++++++--
arch/x86/kernel/fpu/xstate.c | 44 ++++-
arch/x86/kernel/fpu/xstate.h | 3 +
arch/x86/kvm/cpuid.c | 80 ++++++---
arch/x86/kvm/governed_features.h | 2 +
arch/x86/kvm/smm.c | 12 +-
arch/x86/kvm/smm.h | 2 +-
arch/x86/kvm/vmx/capabilities.h | 10 ++
arch/x86/kvm/vmx/nested.c | 97 ++++++++--
arch/x86/kvm/vmx/nested.h | 5 +
arch/x86/kvm/vmx/vmcs12.c | 6 +
arch/x86/kvm/vmx/vmcs12.h | 14 +-
arch/x86/kvm/vmx/vmx.c | 110 +++++++++++-
arch/x86/kvm/vmx/vmx.h | 6 +-
arch/x86/kvm/x86.c | 259 +++++++++++++++++++++++++--
arch/x86/kvm/x86.h | 28 +++
22 files changed, 746 insertions(+), 98 deletions(-)


base-commit: f2a3fb7234e52f72ff4a38364dbf639cf4c7d6c6
--
2.39.3



2023-12-21 09:12:08

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 04/26] x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set

Define a new XFEATURE_MASK_KERNEL_DYNAMIC mask to specify the features
that can be optionally enabled by kernel components. This is similar to
XFEATURE_MASK_USER_DYNAMIC in that it contains optional xfeatures that
can allows the FPU buffer to be dynamically sized. The difference is that
the KERNEL variant contains supervisor features and will be enabled by
kernel components that need them, and not directly by the user. Currently
it's used by KVM to configure guest dedicated fpstate for calculating
the xfeature and fpstate storage size etc.

The kernel dynamic xfeatures now only contain XFEATURE_CET_KERNEL, which
is supported by host as they're enabled in kernel XSS MSR setting but
relevant CPU feature, i.e., supervisor shadow stack, is not enabled in
host kernel therefore it can be omitted for normal fpstate by default.

Remove the kernel dynamic feature from fpu_kernel_cfg.default_features
so that the bits in xstate_bv and xcomp_bv are cleared and xsaves/xrstors
can be optimized by HW for normal fpstate.

Suggested-by: Dave Hansen <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/include/asm/fpu/xstate.h | 5 ++++-
arch/x86/kernel/fpu/xstate.c | 1 +
2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 3b4a038d3c57..a212d3851429 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -46,9 +46,12 @@
#define XFEATURE_MASK_USER_RESTORE \
(XFEATURE_MASK_USER_SUPPORTED & ~XFEATURE_MASK_PKRU)

-/* Features which are dynamically enabled for a process on request */
+/* Features which are dynamically enabled per userspace request */
#define XFEATURE_MASK_USER_DYNAMIC XFEATURE_MASK_XTILE_DATA

+/* Features which are dynamically enabled per kernel side request */
+#define XFEATURE_MASK_KERNEL_DYNAMIC XFEATURE_MASK_CET_KERNEL
+
/* All currently supported supervisor features */
#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
XFEATURE_MASK_CET_USER | \
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 03e166a87d61..ca4b83c142eb 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -824,6 +824,7 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
/* Clean out dynamic features from default */
fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
+ fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_KERNEL_DYNAMIC;

fpu_user_cfg.default_features = fpu_user_cfg.max_features;
fpu_user_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
--
2.39.3


2023-12-21 09:12:19

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 03/26] x86/fpu/xstate: Add CET supervisor mode state support

Add supervisor mode state support within FPU xstate management framework.
Although supervisor shadow stack is not enabled/used today in kernel,KVM
requires the support because when KVM advertises shadow stack feature to
guest, architecturally it claims the support for both user and supervisor
modes for guest OSes(Linux or non-Linux).

CET supervisor states not only includes PL{0,1,2}_SSP but also IA32_S_CET
MSR, but the latter is not xsave-managed. In virtualization world, guest
IA32_S_CET is saved/stored into/from VM control structure. With supervisor
xstate support, guest supervisor mode shadow stack state can be properly
saved/restored when 1) guest/host FPU context is swapped 2) vCPU
thread is sched out/in.

The alternative is to enable it in KVM domain, but KVM maintainers NAKed
the solution. The external discussion can be found at [*], it ended up
with adding the support in kernel instead of KVM domain.

Note, in KVM case, guest CET supervisor state i.e., IA32_PL{0,1,2}_MSRs,
are preserved after VM-Exit until host/guest fpstates are swapped, but
since host supervisor shadow stack is disabled, the preserved MSRs won't
hurt host.

[*]: https://lore.kernel.org/all/[email protected]/

Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/include/asm/fpu/types.h | 14 ++++++++++++--
arch/x86/include/asm/fpu/xstate.h | 6 +++---
arch/x86/kernel/fpu/xstate.c | 6 +++++-
3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index eb810074f1e7..c6fd13a17205 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -116,7 +116,7 @@ enum xfeature {
XFEATURE_PKRU,
XFEATURE_PASID,
XFEATURE_CET_USER,
- XFEATURE_CET_KERNEL_UNUSED,
+ XFEATURE_CET_KERNEL,
XFEATURE_RSRVD_COMP_13,
XFEATURE_RSRVD_COMP_14,
XFEATURE_LBR,
@@ -139,7 +139,7 @@ enum xfeature {
#define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU)
#define XFEATURE_MASK_PASID (1 << XFEATURE_PASID)
#define XFEATURE_MASK_CET_USER (1 << XFEATURE_CET_USER)
-#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL_UNUSED)
+#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL)
#define XFEATURE_MASK_LBR (1 << XFEATURE_LBR)
#define XFEATURE_MASK_XTILE_CFG (1 << XFEATURE_XTILE_CFG)
#define XFEATURE_MASK_XTILE_DATA (1 << XFEATURE_XTILE_DATA)
@@ -264,6 +264,16 @@ struct cet_user_state {
u64 user_ssp;
};

+/*
+ * State component 12 is Control-flow Enforcement supervisor states
+ */
+struct cet_supervisor_state {
+ /* supervisor ssp pointers */
+ u64 pl0_ssp;
+ u64 pl1_ssp;
+ u64 pl2_ssp;
+};
+
/*
* State component 15: Architectural LBR configuration state.
* The size of Arch LBR state depends on the number of LBRs (lbr_depth).
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index d4427b88ee12..3b4a038d3c57 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -51,7 +51,8 @@

/* All currently supported supervisor features */
#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
- XFEATURE_MASK_CET_USER)
+ XFEATURE_MASK_CET_USER | \
+ XFEATURE_MASK_CET_KERNEL)

/*
* A supervisor state component may not always contain valuable information,
@@ -78,8 +79,7 @@
* Unsupported supervisor features. When a supervisor feature in this mask is
* supported in the future, move it to the supported supervisor feature mask.
*/
-#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \
- XFEATURE_MASK_CET_KERNEL)
+#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT)

/* All supervisor states including supported and unsupported states. */
#define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index f6b98693da59..03e166a87d61 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -51,7 +51,7 @@ static const char *xfeature_names[] =
"Protection Keys User registers",
"PASID state",
"Control-flow User registers",
- "Control-flow Kernel registers (unused)",
+ "Control-flow Kernel registers",
"unknown xstate feature",
"unknown xstate feature",
"unknown xstate feature",
@@ -73,6 +73,7 @@ static unsigned short xsave_cpuid_features[] __initdata = {
[XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
[XFEATURE_PKRU] = X86_FEATURE_OSPKE,
[XFEATURE_PASID] = X86_FEATURE_ENQCMD,
+ [XFEATURE_CET_KERNEL] = X86_FEATURE_SHSTK,
[XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE,
[XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE,
};
@@ -277,6 +278,7 @@ static void __init print_xstate_features(void)
print_xstate_feature(XFEATURE_MASK_PKRU);
print_xstate_feature(XFEATURE_MASK_PASID);
print_xstate_feature(XFEATURE_MASK_CET_USER);
+ print_xstate_feature(XFEATURE_MASK_CET_KERNEL);
print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
}
@@ -346,6 +348,7 @@ static __init void os_xrstor_booting(struct xregs_state *xstate)
XFEATURE_MASK_BNDCSR | \
XFEATURE_MASK_PASID | \
XFEATURE_MASK_CET_USER | \
+ XFEATURE_MASK_CET_KERNEL | \
XFEATURE_MASK_XTILE)

/*
@@ -546,6 +549,7 @@ static bool __init check_xstate_against_struct(int nr)
case XFEATURE_PASID: return XCHECK_SZ(sz, nr, struct ia32_pasid_state);
case XFEATURE_XTILE_CFG: return XCHECK_SZ(sz, nr, struct xtile_cfg);
case XFEATURE_CET_USER: return XCHECK_SZ(sz, nr, struct cet_user_state);
+ case XFEATURE_CET_KERNEL: return XCHECK_SZ(sz, nr, struct cet_supervisor_state);
case XFEATURE_XTILE_DATA: check_xtile_data_against_struct(sz); return true;
default:
XSTATE_WARN_ON(1, "No structure for xstate: %d\n", nr);
--
2.39.3


2023-12-21 09:13:06

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 07/26] x86/fpu/xstate: Warn if kernel dynamic xfeatures detected in normal fpstate

Kernel dynamic xfeatures now are __ONLY__ enabled for guest fpstate, i.e.,
never for normal kernel fpstate. The bits are added when guest FPU config
is initialized. Guest fpstate is allocated with fpstate->is_guest set to
%true.

For normal fpstate, the bits should have been removed when initializes
kernel FPU config settings, WARN_ONCE() if kernel detects normal fpstate
xfeatures contains kernel dynamic xfeatures before executes xsaves.

Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
---
arch/x86/kernel/fpu/xstate.h | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index 3518fb26d06b..83ebf1e1cbb4 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -185,6 +185,9 @@ static inline void os_xsave(struct fpstate *fpstate)
WARN_ON_FPU(!alternatives_patched);
xfd_validate_state(fpstate, mask, false);

+ WARN_ON_FPU(!fpstate->is_guest &&
+ (mask & XFEATURE_MASK_KERNEL_DYNAMIC));
+
XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err);

/* We should never fault when copying to a kernel buffer: */
--
2.39.3


2023-12-21 09:13:29

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 06/26] x86/fpu/xstate: Create guest fpstate with guest specific config

Use fpu_guest_cfg to calculate guest fpstate settings, open code for
__fpstate_reset() to avoid using kernel FPU config.

Below configuration steps are currently enforced to get guest fpstate:
1) Kernel sets up guest FPU settings in fpu__init_system_xstate().
2) User space sets vCPU thread group xstate permits via arch_prctl().
3) User space creates guest fpstate via __fpu_alloc_init_guest_fpstate()
for vcpu thread.
4) User space enables guest dynamic xfeatures and re-allocate guest
fpstate.

By adding kernel dynamic xfeatures in above #1 and #2, guest xstate area
size is expanded to hold (fpu_kernel_cfg.default_features | kernel dynamic
xfeatures | user dynamic xfeatures), then host xsaves/xrstors can operate
for all guest xfeatures.

The user_* fields remain unchanged for compatibility with KVM uAPIs.

Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kernel/fpu/core.c | 47 ++++++++++++++++++++++++++++++--------
1 file changed, 37 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 976f519721e2..0e0bf151418f 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -250,8 +250,6 @@ void fpu_reset_from_exception_fixup(void)
}

#if IS_ENABLED(CONFIG_KVM)
-static void __fpstate_reset(struct fpstate *fpstate, u64 xfd);
-
static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
{
struct fpu_state_perm *fpuperm;
@@ -272,25 +270,54 @@ static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
gfpu->perm = perm & ~FPU_GUEST_PERM_LOCKED;
}

-bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
+static struct fpstate *__fpu_alloc_init_guest_fpstate(struct fpu_guest *gfpu)
{
+ bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
+ unsigned int gfpstate_size, size;
struct fpstate *fpstate;
- unsigned int size;

- size = fpu_user_cfg.default_size + ALIGN(offsetof(struct fpstate, regs), 64);
+ /*
+ * fpu_guest_cfg.default_size is initialized to hold all enabled
+ * xfeatures except the user dynamic xfeatures. If the user dynamic
+ * xfeatures are enabled, the guest fpstate will be re-allocated to
+ * hold all guest enabled xfeatures, so omit user dynamic xfeatures
+ * here.
+ */
+ size = fpu_guest_cfg.default_size +
+ ALIGN(offsetof(struct fpstate, regs), 64);
+
fpstate = vzalloc(size);
if (!fpstate)
- return false;
+ return NULL;
+ /*
+ * Initialize sizes and feature masks, use fpu_user_cfg.*
+ * for user_* settings for compatibility of exiting uAPIs.
+ */
+ fpstate->size = gfpstate_size;
+ fpstate->xfeatures = fpu_guest_cfg.default_features;
+ fpstate->user_size = fpu_user_cfg.default_size;
+ fpstate->user_xfeatures = fpu_user_cfg.default_features;
+ fpstate->xfd = 0;

- /* Leave xfd to 0 (the reset value defined by spec) */
- __fpstate_reset(fpstate, 0);
fpstate_init_user(fpstate);
fpstate->is_valloc = true;
fpstate->is_guest = true;

gfpu->fpstate = fpstate;
- gfpu->xfeatures = fpu_user_cfg.default_features;
- gfpu->perm = fpu_user_cfg.default_features;
+ gfpu->xfeatures = fpu_guest_cfg.default_features;
+ gfpu->perm = fpu_guest_cfg.default_features;
+
+ return fpstate;
+}
+
+bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
+{
+ struct fpstate *fpstate;
+
+ fpstate = __fpu_alloc_init_guest_fpstate(gfpu);
+
+ if (!fpstate)
+ return false;

/*
* KVM sets the FP+SSE bits in the XSAVE header when copying FPU state
--
2.39.3


2023-12-21 09:13:56

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 05/26] x86/fpu/xstate: Introduce fpu_guest_cfg for guest FPU configuration

Define new fpu_guest_cfg to hold all guest FPU settings so that it can
differ from generic kernel FPU settings, e.g., enabling CET supervisor
xstate by default for guest fpstate while it's remained disabled in
kernel FPU config.

The kernel dynamic xfeatures are specifically used by guest fpstate now,
add the mask for guest fpstate so that guest_perm.__state_permit ==
(fpu_kernel_cfg.default_xfeature | XFEATURE_MASK_KERNEL_DYNAMIC). And
if guest fpstate is re-allocated to hold user dynamic xfeatures, the
resulting permissions are consumed before calculate new guest fpstate.

Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/include/asm/fpu/types.h | 2 +-
arch/x86/kernel/fpu/core.c | 70 ++++++++++++++++++++++++++++++--
arch/x86/kernel/fpu/xstate.c | 10 +++++
3 files changed, 78 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index c6fd13a17205..306825ad6bc0 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -602,6 +602,6 @@ struct fpu_state_config {
};

/* FPU state configuration information */
-extern struct fpu_state_config fpu_kernel_cfg, fpu_user_cfg;
+extern struct fpu_state_config fpu_kernel_cfg, fpu_user_cfg, fpu_guest_cfg;

#endif /* _ASM_X86_FPU_H */
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index a21a4d0ecc34..976f519721e2 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -33,10 +33,67 @@ DEFINE_STATIC_KEY_FALSE(__fpu_state_size_dynamic);
DEFINE_PER_CPU(u64, xfd_state);
#endif

-/* The FPU state configuration data for kernel and user space */
+/* The FPU state configuration data for kernel, user space and guest. */
+/*
+ * kernel FPU config:
+ *
+ * all known and CPU supported user and supervisor features except
+ * - independent kernel features (XFEATURE_LBR)
+ * @fpu_kernel_cfg.max_features;
+ *
+ * all known and CPU supported user and supervisor features except
+ * - dynamic kernel features (CET_S)
+ * - independent kernel features (XFEATURE_LBR)
+ * - dynamic userspace features (AMX state)
+ * @fpu_kernel_cfg.default_features;
+ *
+ * size of compacted buffer with 'fpu_kernel_cfg.max_features'
+ * @fpu_kernel_cfg.max_size;
+ *
+ * size of compacted buffer with 'fpu_kernel_cfg.default_features'
+ * @fpu_kernel_cfg.default_size;
+ */
struct fpu_state_config fpu_kernel_cfg __ro_after_init;
+
+/*
+ * user FPU config:
+ *
+ * all known and CPU supported user features
+ * @fpu_user_cfg.max_features;
+ *
+ * all known and CPU supported user features except
+ * - dynamic userspace features (AMX state)
+ * @fpu_user_cfg.default_features;
+ *
+ * size of non-compacted buffer with 'fpu_user_cfg.max_features'
+ * @fpu_user_cfg.max_size;
+ *
+ * size of non-compacted buffer with 'fpu_user_cfg.default_features'
+ * @fpu_user_cfg.default_size;
+ */
struct fpu_state_config fpu_user_cfg __ro_after_init;

+/*
+ * guest FPU config:
+ *
+ * all known and CPU supported user and supervisor features except
+ * - independent kernel features (XFEATURE_LBR)
+ * @fpu_guest_cfg.max_features;
+ *
+ * all known and CPU supported user and supervisor features except
+ * - independent kernel features (XFEATURE_LBR)
+ * - dynamic userspace features (AMX state)
+ * @fpu_guest_cfg.default_features;
+ *
+ * size of compacted buffer with 'fpu_guest_cfg.max_features'
+ * @fpu_guest_cfg.max_size;
+ *
+ * size of compacted buffer with 'fpu_guest_cfg.default_features'
+ * @fpu_guest_cfg.default_size;
+ */
+
+struct fpu_state_config fpu_guest_cfg __ro_after_init;
+
/*
* Represents the initial FPU state. It's mostly (but not completely) zeroes,
* depending on the FPU hardware format:
@@ -536,8 +593,15 @@ void fpstate_reset(struct fpu *fpu)
fpu->perm.__state_perm = fpu_kernel_cfg.default_features;
fpu->perm.__state_size = fpu_kernel_cfg.default_size;
fpu->perm.__user_state_size = fpu_user_cfg.default_size;
- /* Same defaults for guests */
- fpu->guest_perm = fpu->perm;
+
+ /* Guest permission settings */
+ fpu->guest_perm.__state_perm = fpu_guest_cfg.default_features;
+ fpu->guest_perm.__state_size = fpu_guest_cfg.default_size;
+ /*
+ * Set guest's __user_state_size to fpu_user_cfg.default_size so that
+ * existing uAPIs can still work.
+ */
+ fpu->guest_perm.__user_state_size = fpu_user_cfg.default_size;
}

static inline void fpu_inherit_perms(struct fpu *dst_fpu)
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index ca4b83c142eb..9cbdc83d1eab 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -681,6 +681,7 @@ static int __init init_xstate_size(void)
{
/* Recompute the context size for enabled features: */
unsigned int user_size, kernel_size, kernel_default_size;
+ unsigned int guest_default_size;
bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);

/* Uncompacted user space size */
@@ -702,13 +703,18 @@ static int __init init_xstate_size(void)
kernel_default_size =
xstate_calculate_size(fpu_kernel_cfg.default_features, compacted);

+ guest_default_size =
+ xstate_calculate_size(fpu_guest_cfg.default_features, compacted);
+
if (!paranoid_xstate_size_valid(kernel_size))
return -EINVAL;

fpu_kernel_cfg.max_size = kernel_size;
fpu_user_cfg.max_size = user_size;
+ fpu_guest_cfg.max_size = kernel_size;

fpu_kernel_cfg.default_size = kernel_default_size;
+ fpu_guest_cfg.default_size = guest_default_size;
fpu_user_cfg.default_size =
xstate_calculate_size(fpu_user_cfg.default_features, false);

@@ -829,6 +835,10 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
fpu_user_cfg.default_features = fpu_user_cfg.max_features;
fpu_user_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;

+ fpu_guest_cfg.max_features = fpu_kernel_cfg.max_features;
+ fpu_guest_cfg.default_features = fpu_guest_cfg.max_features;
+ fpu_guest_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
+
/* Store it for paranoia check at the end */
xfeatures = fpu_kernel_cfg.max_features;

--
2.39.3


2023-12-21 09:14:30

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 13/26] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS

Update CPUID.(EAX=0DH,ECX=1).EBX to reflect current required xstate size
due to XSS MSR modification.
CPUID(EAX=0DH,ECX=1).EBX reports the required storage size of all enabled
xstate features in (XCR0 | IA32_XSS). The CPUID value can be used by guest
before allocate sufficient xsave buffer.

Note, KVM does not yet support any XSS based features, i.e. supported_xss
is guaranteed to be zero at this time.

Opportunistically modify XSS write access logic as:
If XSAVES is not enabled in the guest CPUID, forbid setting IA32_XSS msr
to anything but 0, even if the write is host initiated.

Suggested-by: Sean Christopherson <[email protected]>
Co-developed-by: Zhang Yi Z <[email protected]>
Signed-off-by: Zhang Yi Z <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++-
arch/x86/kvm/cpuid.c | 15 ++++++++++++++-
arch/x86/kvm/x86.c | 16 ++++++++++++----
3 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 40dd796ea085..6efaaaa15945 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -772,7 +772,6 @@ struct kvm_vcpu_arch {
bool at_instruction_boundary;
bool tpr_access_reporting;
bool xfd_no_write_intercept;
- u64 ia32_xss;
u64 microcode_version;
u64 arch_capabilities;
u64 perf_capabilities;
@@ -828,6 +827,8 @@ struct kvm_vcpu_arch {

u64 xcr0;
u64 guest_supported_xcr0;
+ u64 guest_supported_xss;
+ u64 ia32_xss;

struct kvm_pio_request pio;
void *pio_data;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index acc360c76318..3ab133530573 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -275,7 +275,8 @@ static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_e
best = cpuid_entry2_find(entries, nent, 0xD, 1);
if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) ||
cpuid_entry_has(best, X86_FEATURE_XSAVEC)))
- best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
+ best->ebx = xstate_required_size(vcpu->arch.xcr0 |
+ vcpu->arch.ia32_xss, true);

best = __kvm_find_kvm_cpuid_features(vcpu, entries, nent);
if (kvm_hlt_in_guest(vcpu->kvm) && best &&
@@ -312,6 +313,17 @@ static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
}

+static u64 vcpu_get_supported_xss(struct kvm_vcpu *vcpu)
+{
+ struct kvm_cpuid_entry2 *best;
+
+ best = kvm_find_cpuid_entry_index(vcpu, 0xd, 1);
+ if (!best)
+ return 0;
+
+ return (best->ecx | ((u64)best->edx << 32)) & kvm_caps.supported_xss;
+}
+
static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
{
#ifdef CONFIG_KVM_HYPERV
@@ -362,6 +374,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
}

vcpu->arch.guest_supported_xcr0 = vcpu_get_supported_xcr0(vcpu);
+ vcpu->arch.guest_supported_xss = vcpu_get_supported_xss(vcpu);

kvm_update_pv_runtime(vcpu);

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b3a39886e418..7b7a15aab3aa 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3924,20 +3924,28 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vcpu->arch.ia32_tsc_adjust_msr += adj;
}
break;
- case MSR_IA32_XSS:
- if (!msr_info->host_initiated &&
- !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
+ case MSR_IA32_XSS: {
+ /*
+ * If KVM reported support of XSS MSR, even guest CPUID doesn't
+ * support XSAVES, still allow userspace to set default value(0)
+ * to this MSR.
+ */
+ if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES) &&
+ !(msr_info->host_initiated && data == 0))
return 1;
/*
* KVM supports exposing PT to the guest, but does not support
* IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
* XSAVES/XRSTORS to save/restore PT MSRs.
*/
- if (data & ~kvm_caps.supported_xss)
+ if (data & ~vcpu->arch.guest_supported_xss)
return 1;
+ if (vcpu->arch.ia32_xss == data)
+ break;
vcpu->arch.ia32_xss = data;
kvm_update_cpuid_runtime(vcpu);
break;
+ }
case MSR_SMI_COUNT:
if (!msr_info->host_initiated)
return 1;
--
2.39.3


2023-12-21 09:14:34

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 10/26] KVM: x86: Refine xsave-managed guest register/MSR reset handling

Tweak the code a bit to facilitate resetting more xstate components in
the future, e.g., adding CET's xstate-managed MSRs.

No functional change intended.

Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/x86.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0e7dc3398293..3671f4868d1b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12205,6 +12205,11 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
static_branch_dec(&kvm_has_noapic_vcpu);
}

+static inline bool is_xstate_reset_needed(void)
+{
+ return kvm_cpu_cap_has(X86_FEATURE_MPX);
+}
+
void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct kvm_cpuid_entry2 *cpuid_0x1;
@@ -12262,7 +12267,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
kvm_async_pf_hash_reset(vcpu);
vcpu->arch.apf.halted = false;

- if (vcpu->arch.guest_fpu.fpstate && kvm_mpx_supported()) {
+ if (vcpu->arch.guest_fpu.fpstate && is_xstate_reset_needed()) {
struct fpstate *fpstate = vcpu->arch.guest_fpu.fpstate;

/*
@@ -12272,8 +12277,12 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
if (init_event)
kvm_put_guest_fpu(vcpu);

- fpstate_clear_xstate_component(fpstate, XFEATURE_BNDREGS);
- fpstate_clear_xstate_component(fpstate, XFEATURE_BNDCSR);
+ if (kvm_cpu_cap_has(X86_FEATURE_MPX)) {
+ fpstate_clear_xstate_component(fpstate,
+ XFEATURE_BNDREGS);
+ fpstate_clear_xstate_component(fpstate,
+ XFEATURE_BNDCSR);
+ }

if (init_event)
kvm_load_guest_fpu(vcpu);
--
2.39.3


2023-12-21 09:14:58

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 08/26] KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data

From: Sean Christopherson <[email protected]>

Rework and rename cpuid_get_supported_xcr0() to explicitly operate on
vCPU state, i.e. on a vCPU's CPUID state, now that the only usage of
the helper is to retrieve a vCPU's already-set CPUID.

Prior to commit 275a87244ec8 ("KVM: x86: Don't adjust guest's CPUID.0x12.1
(allowed SGX enclave XFRM)"), KVM incorrectly fudged guest CPUID at runtime,
which in turn necessitated massaging the incoming CPUID state for
KVM_SET_CPUID{2} so as not to run afoul of kvm_cpuid_check_equal().
I.e. KVM also invoked cpuid_get_supported_xcr0() with the incoming CPUID
state, and thus without an explicit vCPU object.

Opportunistically move the helper below kvm_update_cpuid_runtime() to make
it harder to repeat the mistake of querying supported XCR0 for runtime
updates.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/cpuid.c | 33 ++++++++++++++++-----------------
1 file changed, 16 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 294e5bd5f8a0..624954203b40 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -247,21 +247,6 @@ void kvm_update_pv_runtime(struct kvm_vcpu *vcpu)
vcpu->arch.pv_cpuid.features = best->eax;
}

-/*
- * Calculate guest's supported XCR0 taking into account guest CPUID data and
- * KVM's supported XCR0 (comprised of host's XCR0 and KVM_SUPPORTED_XCR0).
- */
-static u64 cpuid_get_supported_xcr0(struct kvm_cpuid_entry2 *entries, int nent)
-{
- struct kvm_cpuid_entry2 *best;
-
- best = cpuid_entry2_find(entries, nent, 0xd, 0);
- if (!best)
- return 0;
-
- return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
-}
-
static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *entries,
int nent)
{
@@ -312,6 +297,21 @@ void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(kvm_update_cpuid_runtime);

+/*
+ * Calculate guest's supported XCR0 taking into account guest CPUID data and
+ * KVM's supported XCR0 (comprised of host's XCR0 and KVM_SUPPORTED_XCR0).
+ */
+static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
+{
+ struct kvm_cpuid_entry2 *best;
+
+ best = kvm_find_cpuid_entry_index(vcpu, 0xd, 0);
+ if (!best)
+ return 0;
+
+ return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
+}
+
static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
{
#ifdef CONFIG_KVM_HYPERV
@@ -361,8 +361,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
kvm_apic_set_version(vcpu);
}

- vcpu->arch.guest_supported_xcr0 =
- cpuid_get_supported_xcr0(vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent);
+ vcpu->arch.guest_supported_xcr0 = vcpu_get_supported_xcr0(vcpu);

kvm_update_pv_runtime(vcpu);

--
2.39.3


2023-12-21 09:15:21

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 14/26] KVM: x86: Initialize kvm_caps.supported_xss

Set original kvm_caps.supported_xss to (host_xss & KVM_SUPPORTED_XSS) if
XSAVES is supported. host_xss contains the host supported xstate feature
bits for thread FPU context switch, KVM_SUPPORTED_XSS includes all KVM
enabled XSS feature bits, the resulting value represents the supervisor
xstates that are available to guest and are backed by host FPU framework
for swapping {guest,host} XSAVE-managed registers/MSRs.

Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/x86.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7b7a15aab3aa..f50c5a523b92 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -226,6 +226,8 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
| XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
| XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)

+#define KVM_SUPPORTED_XSS 0
+
u64 __read_mostly host_efer;
EXPORT_SYMBOL_GPL(host_efer);

@@ -9715,12 +9717,13 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
kvm_caps.supported_xcr0 = host_xcr0 & KVM_SUPPORTED_XCR0;
}
+ if (boot_cpu_has(X86_FEATURE_XSAVES)) {
+ rdmsrl(MSR_IA32_XSS, host_xss);
+ kvm_caps.supported_xss = host_xss & KVM_SUPPORTED_XSS;
+ }

rdmsrl_safe(MSR_EFER, &host_efer);

- if (boot_cpu_has(X86_FEATURE_XSAVES))
- rdmsrl(MSR_IA32_XSS, host_xss);
-
kvm_init_pmu_capability(ops->pmu_ops);

if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
--
2.39.3


2023-12-21 09:16:19

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 19/26] KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled"

Use the governed feature framework to track whether X86_FEATURE_SHSTK
and X86_FEATURE_IBT features can be used by userspace and guest, i.e.,
the features can be used iff both KVM and guest CPUID can support them.

TODO: remove this patch once Sean's refactor to "KVM-governed" framework
is upstreamed. See the work here [*].

[*]: https://lore.kernel.org/all/[email protected]/

Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/governed_features.h | 2 ++
arch/x86/kvm/vmx/vmx.c | 2 ++
2 files changed, 4 insertions(+)

diff --git a/arch/x86/kvm/governed_features.h b/arch/x86/kvm/governed_features.h
index ad463b1ed4e4..daf0c0a3e29c 100644
--- a/arch/x86/kvm/governed_features.h
+++ b/arch/x86/kvm/governed_features.h
@@ -17,6 +17,8 @@ KVM_GOVERNED_X86_FEATURE(PFTHRESHOLD)
KVM_GOVERNED_X86_FEATURE(VGIF)
KVM_GOVERNED_X86_FEATURE(VNMI)
KVM_GOVERNED_X86_FEATURE(LAM)
+KVM_GOVERNED_X86_FEATURE(SHSTK)
+KVM_GOVERNED_X86_FEATURE(IBT)

#undef KVM_GOVERNED_X86_FEATURE
#undef KVM_GOVERNED_FEATURE
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b2f6bcf3bf9b..29a0fd3e83c5 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7764,6 +7764,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)

kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_VMX);
kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_LAM);
+ kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_SHSTK);
+ kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_IBT);

vmx_setup_uret_msrs(vmx);

--
2.39.3


2023-12-21 09:16:49

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 17/26] KVM: x86: Report KVM supported CET MSRs as to-be-saved

Add CET MSRs to the list of MSRs reported to userspace if the feature,
i.e. IBT or SHSTK, associated with the MSRs is supported by KVM.

SSP can only be read via RDSSP. Writing even requires destructive and
potentially faulting operations such as SAVEPREVSSP/RSTORSSP or
SETSSBSY/CLRSSBSY. Let the host use a pseudo-MSR that is just a wrapper
for the GUEST_SSP field of the VMCS.

Suggested-by: Chao Gao <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/include/uapi/asm/kvm_para.h | 1 +
arch/x86/kvm/vmx/vmx.c | 2 ++
arch/x86/kvm/x86.c | 18 ++++++++++++++++++
3 files changed, 21 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 6e64b27b2c1e..9864bbcf2470 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -58,6 +58,7 @@
#define MSR_KVM_ASYNC_PF_INT 0x4b564d06
#define MSR_KVM_ASYNC_PF_ACK 0x4b564d07
#define MSR_KVM_MIGRATION_CONTROL 0x4b564d08
+#define MSR_KVM_SSP 0x4b564d09

struct kvm_steal_time {
__u64 steal;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d21f55f323ea..b2f6bcf3bf9b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7007,6 +7007,8 @@ static bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
case MSR_AMD64_TSC_RATIO:
/* This is AMD only. */
return false;
+ case MSR_KVM_SSP:
+ return kvm_cpu_cap_has(X86_FEATURE_SHSTK);
default:
return true;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b418e4f5277b..a7368adad6b8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1476,6 +1476,9 @@ static const u32 msrs_to_save_base[] = {

MSR_IA32_XFD, MSR_IA32_XFD_ERR,
MSR_IA32_XSS,
+ MSR_IA32_U_CET, MSR_IA32_S_CET,
+ MSR_IA32_PL0_SSP, MSR_IA32_PL1_SSP, MSR_IA32_PL2_SSP,
+ MSR_IA32_PL3_SSP, MSR_IA32_INT_SSP_TAB,
};

static const u32 msrs_to_save_pmu[] = {
@@ -1579,6 +1582,7 @@ static const u32 emulated_msrs_all[] = {

MSR_K7_HWCR,
MSR_KVM_POLL_CONTROL,
+ MSR_KVM_SSP,
};

static u32 emulated_msrs[ARRAY_SIZE(emulated_msrs_all)];
@@ -7428,6 +7432,20 @@ static void kvm_probe_msr_to_save(u32 msr_index)
if (!kvm_caps.supported_xss)
return;
break;
+ case MSR_IA32_U_CET:
+ case MSR_IA32_S_CET:
+ if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK) &&
+ !kvm_cpu_cap_has(X86_FEATURE_IBT))
+ return;
+ break;
+ case MSR_IA32_INT_SSP_TAB:
+ if (!kvm_cpu_cap_has(X86_FEATURE_LM))
+ return;
+ fallthrough;
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK))
+ return;
+ break;
default:
break;
}
--
2.39.3


2023-12-21 09:16:57

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 02/26] x86/fpu/xstate: Refine CET user xstate bit enabling

Remove XFEATURE_CET_USER entry from dependency array as the entry doesn't
reflect true dependency between CET features and the user xstate bit.
Enable the bit in fpu_kernel_cfg.max_features when either SHSTK or IBT is
available.

Both user mode shadow stack and indirect branch tracking features depend
on XFEATURE_CET_USER bit in XSS to automatically save/restore user mode
xstate registers, i.e., IA32_U_CET and IA32_PL3_SSP whenever necessary.

Note, the issue, i.e., CPUID only enumerates IBT but no SHSTK is resulted
from CET KVM series which synthesizes guest CPUIDs based on userspace
settings,in real world the case is rare. In other words, the existing
dependency check is correct when only user mode SHSTK is available.

Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
Tested-by: Rick Edgecombe <[email protected]>
---
arch/x86/kernel/fpu/xstate.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 07911532b108..f6b98693da59 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -73,7 +73,6 @@ static unsigned short xsave_cpuid_features[] __initdata = {
[XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
[XFEATURE_PKRU] = X86_FEATURE_OSPKE,
[XFEATURE_PASID] = X86_FEATURE_ENQCMD,
- [XFEATURE_CET_USER] = X86_FEATURE_SHSTK,
[XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE,
[XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE,
};
@@ -798,6 +797,14 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
fpu_kernel_cfg.max_features &= ~BIT_ULL(i);
}

+ /*
+ * CET user mode xstate bit has been cleared by above sanity check.
+ * Now pick it up if either SHSTK or IBT is available. Either feature
+ * depends on the xstate bit to save/restore user mode states.
+ */
+ if (boot_cpu_has(X86_FEATURE_SHSTK) || boot_cpu_has(X86_FEATURE_IBT))
+ fpu_kernel_cfg.max_features |= BIT_ULL(XFEATURE_CET_USER);
+
if (!cpu_feature_enabled(X86_FEATURE_XFD))
fpu_kernel_cfg.max_features &= ~XFEATURE_MASK_USER_DYNAMIC;

--
2.39.3


2023-12-21 09:17:54

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 24/26] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

Expose CET features to guest if KVM/host can support them, clear CPUID
feature bits if KVM/host cannot support.

Set CPUID feature bits so that CET features are available in guest CPUID.
Add CR4.CET bit support in order to allow guest set CET master control
bit.

Disable KVM CET feature if unrestricted_guest is unsupported/disabled as
KVM does not support emulating CET.

The CET load-bits in VM_ENTRY/VM_EXIT control fields should be set to make
guest CET xstates isolated from host's.

On platforms with VMX_BASIC[bit56] == 0, inject #CP at VMX entry with error
code will fail, and if VMX_BASIC[bit56] == 1, #CP injection with or without
error code is allowed. Disable CET feature bits if the MSR bit is cleared
so that nested VMM can inject #CP if and only if VMX_BASIC[bit56] == 1.

Don't expose CET feature if either of {U,S}_CET xstate bits is cleared
in host XSS or if XSAVES isn't supported.

CET MSR contents after reset, power-up and INIT are set to 0s, clears the
guest fpstate fields so that the guest MSRs are reset to 0s after the events.

Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kvm/cpuid.c | 19 +++++++++++++++++--
arch/x86/kvm/vmx/capabilities.h | 6 ++++++
arch/x86/kvm/vmx/vmx.c | 29 ++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/vmx.h | 6 ++++--
arch/x86/kvm/x86.c | 31 +++++++++++++++++++++++++++++--
arch/x86/kvm/x86.h | 3 +++
8 files changed, 89 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6efaaaa15945..161d0552be5f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -134,7 +134,7 @@
| X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
| X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
| X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \
- | X86_CR4_LAM_SUP))
+ | X86_CR4_LAM_SUP | X86_CR4_CET))

#define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 1d51e1850ed0..233e00c01e62 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1102,6 +1102,7 @@
#define VMX_BASIC_MEM_TYPE_MASK 0x003c000000000000LLU
#define VMX_BASIC_MEM_TYPE_WB 6LLU
#define VMX_BASIC_INOUT 0x0040000000000000LLU
+#define VMX_BASIC_NO_HW_ERROR_CODE_CC 0x0100000000000000LLU

/* Resctrl MSRs: */
/* - Intel: */
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index cfc0ac8ddb4a..18d1a0eb0f64 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -665,7 +665,7 @@ void kvm_set_cpu_caps(void)
F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
- F(SGX_LC) | F(BUS_LOCK_DETECT)
+ F(SGX_LC) | F(BUS_LOCK_DETECT) | F(SHSTK)
);
/* Set LA57 based on hardware capability. */
if (cpuid_ecx(7) & F(LA57))
@@ -683,7 +683,8 @@ void kvm_set_cpu_caps(void)
F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) |
F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) |
- F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D)
+ F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D) |
+ F(IBT)
);

/* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */
@@ -696,6 +697,20 @@ void kvm_set_cpu_caps(void)
kvm_cpu_cap_set(X86_FEATURE_INTEL_STIBP);
if (boot_cpu_has(X86_FEATURE_AMD_SSBD))
kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD);
+ /*
+ * Don't use boot_cpu_has() to check availability of IBT because the
+ * feature bit is cleared in boot_cpu_data when ibt=off is applied
+ * in host cmdline.
+ *
+ * As currently there's no HW bug which requires disabling IBT feature
+ * while CPU can enumerate it, host cmdline option ibt=off is most
+ * likely due to administrative reason on host side, so KVM refers to
+ * CPU CPUID enumeration to enable the feature. In future if there's
+ * actually some bug clobbered ibt=off option, then enforce additional
+ * check here to disable the support in KVM.
+ */
+ if (cpuid_edx(7) & F(IBT))
+ kvm_cpu_cap_set(X86_FEATURE_IBT);

kvm_cpu_cap_mask(CPUID_7_1_EAX,
F(AVX_VNNI) | F(AVX512_BF16) | F(CMPCCXADD) |
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index ee8938818c8a..e12bc233d88b 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -79,6 +79,12 @@ static inline bool cpu_has_vmx_basic_inout(void)
return (((u64)vmcs_config.basic_cap << 32) & VMX_BASIC_INOUT);
}

+static inline bool cpu_has_vmx_basic_no_hw_errcode(void)
+{
+ return ((u64)vmcs_config.basic_cap << 32) &
+ VMX_BASIC_NO_HW_ERROR_CODE_CC;
+}
+
static inline bool cpu_has_virtual_nmis(void)
{
return vmcs_config.pin_based_exec_ctrl & PIN_BASED_VIRTUAL_NMIS &&
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e9c0b571b3bb..c802e790c0d5 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2609,6 +2609,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
{ VM_ENTRY_LOAD_IA32_EFER, VM_EXIT_LOAD_IA32_EFER },
{ VM_ENTRY_LOAD_BNDCFGS, VM_EXIT_CLEAR_BNDCFGS },
{ VM_ENTRY_LOAD_IA32_RTIT_CTL, VM_EXIT_CLEAR_IA32_RTIT_CTL },
+ { VM_ENTRY_LOAD_CET_STATE, VM_EXIT_LOAD_CET_STATE },
};

memset(vmcs_conf, 0, sizeof(*vmcs_conf));
@@ -4934,6 +4935,15 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)

vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); /* 22.2.1 */

+ if (kvm_cpu_cap_has(X86_FEATURE_SHSTK))
+ vmcs_writel(GUEST_SSP, 0);
+ if (kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
+ kvm_cpu_cap_has(X86_FEATURE_IBT))
+ vmcs_writel(GUEST_S_CET, 0);
+ if (kvm_cpu_cap_has(X86_FEATURE_SHSTK) &&
+ IS_ENABLED(CONFIG_X86_64))
+ vmcs_writel(GUEST_INTR_SSP_TABLE, 0);
+
kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);

vpid_sync_context(vmx->vpid);
@@ -6353,6 +6363,12 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
if (vmcs_read32(VM_EXIT_MSR_STORE_COUNT) > 0)
vmx_dump_msrs("guest autostore", &vmx->msr_autostore.guest);

+ if (vmentry_ctl & VM_ENTRY_LOAD_CET_STATE) {
+ pr_err("S_CET = 0x%016lx\n", vmcs_readl(GUEST_S_CET));
+ pr_err("SSP = 0x%016lx\n", vmcs_readl(GUEST_SSP));
+ pr_err("INTR SSP TABLE = 0x%016lx\n",
+ vmcs_readl(GUEST_INTR_SSP_TABLE));
+ }
pr_err("*** Host State ***\n");
pr_err("RIP = 0x%016lx RSP = 0x%016lx\n",
vmcs_readl(HOST_RIP), vmcs_readl(HOST_RSP));
@@ -6430,6 +6446,12 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
if (secondary_exec_control & SECONDARY_EXEC_ENABLE_VPID)
pr_err("Virtual processor ID = 0x%04x\n",
vmcs_read16(VIRTUAL_PROCESSOR_ID));
+ if (vmexit_ctl & VM_EXIT_LOAD_CET_STATE) {
+ pr_err("S_CET = 0x%016lx\n", vmcs_readl(HOST_S_CET));
+ pr_err("SSP = 0x%016lx\n", vmcs_readl(HOST_SSP));
+ pr_err("INTR SSP TABLE = 0x%016lx\n",
+ vmcs_readl(HOST_INTR_SSP_TABLE));
+ }
}

/*
@@ -7966,7 +7988,6 @@ static __init void vmx_set_cpu_caps(void)
kvm_cpu_cap_set(X86_FEATURE_UMIP);

/* CPUID 0xD.1 */
- kvm_caps.supported_xss = 0;
if (!cpu_has_vmx_xsaves())
kvm_cpu_cap_clear(X86_FEATURE_XSAVES);

@@ -7978,6 +7999,12 @@ static __init void vmx_set_cpu_caps(void)

if (cpu_has_vmx_waitpkg())
kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
+
+ if (!cpu_has_load_cet_ctrl() || !enable_unrestricted_guest ||
+ !cpu_has_vmx_basic_no_hw_errcode()) {
+ kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
+ kvm_cpu_cap_clear(X86_FEATURE_IBT);
+ }
}

static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index e3b0985bb74a..d0cad2624564 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -484,7 +484,8 @@ static inline u8 vmx_get_rvi(void)
VM_ENTRY_LOAD_IA32_EFER | \
VM_ENTRY_LOAD_BNDCFGS | \
VM_ENTRY_PT_CONCEAL_PIP | \
- VM_ENTRY_LOAD_IA32_RTIT_CTL)
+ VM_ENTRY_LOAD_IA32_RTIT_CTL | \
+ VM_ENTRY_LOAD_CET_STATE)

#define __KVM_REQUIRED_VMX_VM_EXIT_CONTROLS \
(VM_EXIT_SAVE_DEBUG_CONTROLS | \
@@ -506,7 +507,8 @@ static inline u8 vmx_get_rvi(void)
VM_EXIT_LOAD_IA32_EFER | \
VM_EXIT_CLEAR_BNDCFGS | \
VM_EXIT_PT_CONCEAL_PIP | \
- VM_EXIT_CLEAR_IA32_RTIT_CTL)
+ VM_EXIT_CLEAR_IA32_RTIT_CTL | \
+ VM_EXIT_LOAD_CET_STATE)

#define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL \
(PIN_BASED_EXT_INTR_MASK | \
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9596763fae8d..5058c9c5f4cc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -231,7 +231,8 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
| XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
| XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)

-#define KVM_SUPPORTED_XSS 0
+#define KVM_SUPPORTED_XSS (XFEATURE_MASK_CET_USER | \
+ XFEATURE_MASK_CET_KERNEL)

u64 __read_mostly host_efer;
EXPORT_SYMBOL_GPL(host_efer);
@@ -9921,6 +9922,20 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
kvm_caps.supported_xss = 0;

+ if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK) &&
+ !kvm_cpu_cap_has(X86_FEATURE_IBT))
+ kvm_caps.supported_xss &= ~(XFEATURE_CET_USER |
+ XFEATURE_CET_KERNEL);
+
+ if ((kvm_caps.supported_xss & (XFEATURE_MASK_CET_USER |
+ XFEATURE_MASK_CET_KERNEL)) !=
+ (XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)) {
+ kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
+ kvm_cpu_cap_clear(X86_FEATURE_IBT);
+ kvm_caps.supported_xss &= ~(XFEATURE_CET_USER |
+ XFEATURE_CET_KERNEL);
+ }
+
#define __kvm_cpu_cap_has(UNUSED_, f) kvm_cpu_cap_has(f)
cr4_reserved_bits = __cr4_reserved_bits(__kvm_cpu_cap_has, UNUSED_);
#undef __kvm_cpu_cap_has
@@ -12392,7 +12407,9 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)

static inline bool is_xstate_reset_needed(void)
{
- return kvm_cpu_cap_has(X86_FEATURE_MPX);
+ return kvm_cpu_cap_has(X86_FEATURE_MPX) ||
+ kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
+ kvm_cpu_cap_has(X86_FEATURE_IBT);
}

void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
@@ -12469,6 +12486,16 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
XFEATURE_BNDCSR);
}

+ if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
+ fpstate_clear_xstate_component(fpstate,
+ XFEATURE_CET_USER);
+ fpstate_clear_xstate_component(fpstate,
+ XFEATURE_CET_KERNEL);
+ } else if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
+ fpstate_clear_xstate_component(fpstate,
+ XFEATURE_CET_USER);
+ }
+
if (init_event)
kvm_load_guest_fpu(vcpu);
}
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 656107e64c93..cc585051d24b 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -533,6 +533,9 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
__reserved_bits |= X86_CR4_PCIDE; \
if (!__cpu_has(__c, X86_FEATURE_LAM)) \
__reserved_bits |= X86_CR4_LAM_SUP; \
+ if (!__cpu_has(__c, X86_FEATURE_SHSTK) && \
+ !__cpu_has(__c, X86_FEATURE_IBT)) \
+ __reserved_bits |= X86_CR4_CET; \
__reserved_bits; \
})

--
2.39.3


2023-12-21 09:17:56

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 25/26] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1

Per SDM description(Vol.3D, Appendix A.1):
"If bit 56 is read as 1, software can use VM entry to deliver a hardware
exception with or without an error code, regardless of vector"

Modify has_error_code check before inject events to nested guest. Only
enforce the check when guest is in real mode, the exception is not hard
exception and the platform doesn't enumerate bit56 in VMX_BASIC, in all
other case ignore the check to make the logic consistent with SDM.

Suggested-by: Chao Gao <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 27 ++++++++++++++++++---------
arch/x86/kvm/vmx/nested.h | 5 +++++
2 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index b2e9853584b8..468a7cf75035 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1230,9 +1230,9 @@ static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx, u64 data)
{
const u64 feature_and_reserved =
/* feature (except bit 48; see below) */
- BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) |
+ BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) | BIT_ULL(56) |
/* reserved */
- BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 56);
+ BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 57);
u64 vmx_basic = vmcs_config.nested.basic;

if (!is_bitwise_subset(vmx_basic, data, feature_and_reserved))
@@ -2865,7 +2865,6 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
u8 vector = intr_info & INTR_INFO_VECTOR_MASK;
u32 intr_type = intr_info & INTR_INFO_INTR_TYPE_MASK;
bool has_error_code = intr_info & INTR_INFO_DELIVER_CODE_MASK;
- bool should_have_error_code;
bool urg = nested_cpu_has2(vmcs12,
SECONDARY_EXEC_UNRESTRICTED_GUEST);
bool prot_mode = !urg || vmcs12->guest_cr0 & X86_CR0_PE;
@@ -2882,12 +2881,20 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
CC(intr_type == INTR_TYPE_OTHER_EVENT && vector != 0))
return -EINVAL;

- /* VM-entry interruption-info field: deliver error code */
- should_have_error_code =
- intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
- x86_exception_has_error_code(vector);
- if (CC(has_error_code != should_have_error_code))
- return -EINVAL;
+ /*
+ * Cannot deliver error code in real mode or if the interrupt
+ * type is not hardware exception. For other cases, do the
+ * consistency check only if the vCPU doesn't enumerate
+ * VMX_BASIC_NO_HW_ERROR_CODE_CC.
+ */
+ if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION) {
+ if (CC(has_error_code))
+ return -EINVAL;
+ } else if (!nested_cpu_has_no_hw_errcode_cc(vcpu)) {
+ if (CC(has_error_code !=
+ x86_exception_has_error_code(vector)))
+ return -EINVAL;
+ }

/* VM-entry exception error code */
if (CC(has_error_code &&
@@ -7011,6 +7018,8 @@ static void nested_vmx_setup_basic(struct nested_vmx_msrs *msrs)

if (cpu_has_vmx_basic_inout())
msrs->basic |= VMX_BASIC_INOUT;
+ if (cpu_has_vmx_basic_no_hw_errcode())
+ msrs->basic |= VMX_BASIC_NO_HW_ERROR_CODE_CC;
}

static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs)
diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
index cce4e2aa30fb..747061c2aeb9 100644
--- a/arch/x86/kvm/vmx/nested.h
+++ b/arch/x86/kvm/vmx/nested.h
@@ -285,6 +285,11 @@ static inline bool nested_cr4_valid(struct kvm_vcpu *vcpu, unsigned long val)
__kvm_is_valid_cr4(vcpu, val);
}

+static inline bool nested_cpu_has_no_hw_errcode_cc(struct kvm_vcpu *vcpu)
+{
+ return to_vmx(vcpu)->nested.msrs.basic & VMX_BASIC_NO_HW_ERROR_CODE_CC;
+}
+
/* No difference in the restrictions on guest and host CR4 in VMX operation. */
#define nested_guest_cr4_valid nested_cr4_valid
#define nested_host_cr4_valid nested_cr4_valid
--
2.39.3


2023-12-21 09:18:22

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 09/26] KVM: x86: Rename kvm_{g,s}et_msr() to menifest emulation operations

Rename kvm_{g,s}et_msr() to kvm_emulate_msr_{read,write}() to make it
more obvious that KVM uses these helpers to emulate guest behaviors,
i.e., host_initiated == false in these helpers.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 ++--
arch/x86/kvm/smm.c | 4 ++--
arch/x86/kvm/vmx/nested.c | 13 +++++++------
arch/x86/kvm/x86.c | 10 +++++-----
4 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7bc1daf68741..5c665165024c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2013,8 +2013,8 @@ void kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu);
void kvm_enable_efer_bits(u64);
bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer);
int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data, bool host_initiated);
-int kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data);
-int kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data);
+int kvm_emulate_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data);
+int kvm_emulate_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data);
int kvm_emulate_rdmsr(struct kvm_vcpu *vcpu);
int kvm_emulate_wrmsr(struct kvm_vcpu *vcpu);
int kvm_emulate_as_nop(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
index dc3d95fdca7d..45c855389ea7 100644
--- a/arch/x86/kvm/smm.c
+++ b/arch/x86/kvm/smm.c
@@ -535,7 +535,7 @@ static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,

vcpu->arch.smbase = smstate->smbase;

- if (kvm_set_msr(vcpu, MSR_EFER, smstate->efer & ~EFER_LMA))
+ if (kvm_emulate_msr_write(vcpu, MSR_EFER, smstate->efer & ~EFER_LMA))
return X86EMUL_UNHANDLEABLE;

rsm_load_seg_64(vcpu, &smstate->tr, VCPU_SREG_TR);
@@ -626,7 +626,7 @@ int emulator_leave_smm(struct x86_emulate_ctxt *ctxt)

/* And finally go back to 32-bit mode. */
efer = 0;
- kvm_set_msr(vcpu, MSR_EFER, efer);
+ kvm_emulate_msr_write(vcpu, MSR_EFER, efer);
}
#endif

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index db0ad1e6ec4b..b2e9853584b8 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -958,7 +958,7 @@ static u32 nested_vmx_load_msr(struct kvm_vcpu *vcpu, u64 gpa, u32 count)
__func__, i, e.index, e.reserved);
goto fail;
}
- if (kvm_set_msr(vcpu, e.index, e.value)) {
+ if (kvm_emulate_msr_write(vcpu, e.index, e.value)) {
pr_debug_ratelimited(
"%s cannot write MSR (%u, 0x%x, 0x%llx)\n",
__func__, i, e.index, e.value);
@@ -994,7 +994,7 @@ static bool nested_vmx_get_vmexit_msr_value(struct kvm_vcpu *vcpu,
}
}

- if (kvm_get_msr(vcpu, msr_index, data)) {
+ if (kvm_emulate_msr_read(vcpu, msr_index, data)) {
pr_debug_ratelimited("%s cannot read MSR (0x%x)\n", __func__,
msr_index);
return false;
@@ -2686,7 +2686,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,

if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) &&
kvm_pmu_has_perf_global_ctrl(vcpu_to_pmu(vcpu)) &&
- WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
+ WARN_ON_ONCE(kvm_emulate_msr_write(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
vmcs12->guest_ia32_perf_global_ctrl))) {
*entry_failure_code = ENTRY_FAIL_DEFAULT;
return -EINVAL;
@@ -4568,8 +4568,9 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
}
if ((vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL) &&
kvm_pmu_has_perf_global_ctrl(vcpu_to_pmu(vcpu)))
- WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
- vmcs12->host_ia32_perf_global_ctrl));
+ WARN_ON_ONCE(kvm_emulate_msr_write(vcpu,
+ MSR_CORE_PERF_GLOBAL_CTRL,
+ vmcs12->host_ia32_perf_global_ctrl));

/* Set L1 segment info according to Intel SDM
27.5.2 Loading Host Segment and Descriptor-Table Registers */
@@ -4744,7 +4745,7 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
goto vmabort;
}

- if (kvm_set_msr(vcpu, h.index, h.value)) {
+ if (kvm_emulate_msr_write(vcpu, h.index, h.value)) {
pr_debug_ratelimited(
"%s WRMSR failed (%u, 0x%x, 0x%llx)\n",
__func__, j, h.index, h.value);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 27e23714e960..0e7dc3398293 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1976,17 +1976,17 @@ static int kvm_set_msr_with_filter(struct kvm_vcpu *vcpu, u32 index, u64 data)
return kvm_set_msr_ignored_check(vcpu, index, data, false);
}

-int kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data)
+int kvm_emulate_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data)
{
return kvm_get_msr_ignored_check(vcpu, index, data, false);
}
-EXPORT_SYMBOL_GPL(kvm_get_msr);
+EXPORT_SYMBOL_GPL(kvm_emulate_msr_read);

-int kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data)
+int kvm_emulate_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data)
{
return kvm_set_msr_ignored_check(vcpu, index, data, false);
}
-EXPORT_SYMBOL_GPL(kvm_set_msr);
+EXPORT_SYMBOL_GPL(kvm_emulate_msr_write);

static void complete_userspace_rdmsr(struct kvm_vcpu *vcpu)
{
@@ -8386,7 +8386,7 @@ static int emulator_set_msr_with_filter(struct x86_emulate_ctxt *ctxt,
static int emulator_get_msr(struct x86_emulate_ctxt *ctxt,
u32 msr_index, u64 *pdata)
{
- return kvm_get_msr(emul_to_vcpu(ctxt), msr_index, pdata);
+ return kvm_emulate_msr_read(emul_to_vcpu(ctxt), msr_index, pdata);
}

static int emulator_check_pmc(struct x86_emulate_ctxt *ctxt,
--
2.39.3


2023-12-21 09:18:29

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 20/26] KVM: VMX: Emulate read and write to CET MSRs

Add emulation interface for CET MSR access. The emulation code is split
into common part and vendor specific part. The former does common checks
for MSRs, e.g., accessibility, data validity etc., then pass the operation
to either XSAVE-managed MSRs via the helpers or CET VMCS fields.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 18 +++++++++
arch/x86/kvm/x86.c | 88 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 106 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 29a0fd3e83c5..064a5fe87948 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2106,6 +2106,15 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
else
msr_info->data = vmx->pt_desc.guest.addr_a[index / 2];
break;
+ case MSR_IA32_S_CET:
+ msr_info->data = vmcs_readl(GUEST_S_CET);
+ break;
+ case MSR_KVM_SSP:
+ msr_info->data = vmcs_readl(GUEST_SSP);
+ break;
+ case MSR_IA32_INT_SSP_TAB:
+ msr_info->data = vmcs_readl(GUEST_INTR_SSP_TABLE);
+ break;
case MSR_IA32_DEBUGCTLMSR:
msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
break;
@@ -2415,6 +2424,15 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
else
vmx->pt_desc.guest.addr_a[index / 2] = data;
break;
+ case MSR_IA32_S_CET:
+ vmcs_writel(GUEST_S_CET, data);
+ break;
+ case MSR_KVM_SSP:
+ vmcs_writel(GUEST_SSP, data);
+ break;
+ case MSR_IA32_INT_SSP_TAB:
+ vmcs_writel(GUEST_INTR_SSP_TABLE, data);
+ break;
case MSR_IA32_PERF_CAPABILITIES:
if (data && !vcpu_to_pmu(vcpu)->version)
return 1;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a7368adad6b8..cf0f9e4474a4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1850,6 +1850,36 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type)
}
EXPORT_SYMBOL_GPL(kvm_msr_allowed);

+#define CET_US_RESERVED_BITS GENMASK(9, 6)
+#define CET_US_SHSTK_MASK_BITS GENMASK(1, 0)
+#define CET_US_IBT_MASK_BITS (GENMASK_ULL(5, 2) | GENMASK_ULL(63, 10))
+#define CET_US_LEGACY_BITMAP_BASE(data) ((data) >> 12)
+
+static bool is_set_cet_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u64 data,
+ bool host_initiated)
+{
+ bool msr_ctrl = index == MSR_IA32_S_CET || index == MSR_IA32_U_CET;
+
+ if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
+ return true;
+
+ if (msr_ctrl && guest_can_use(vcpu, X86_FEATURE_IBT))
+ return true;
+
+ /*
+ * If KVM supports the MSR, i.e. has enumerated the MSR existence to
+ * userspace, then userspace is allowed to write '0' irrespective of
+ * whether or not the MSR is exposed to the guest.
+ */
+ if (!host_initiated || data)
+ return false;
+
+ if (kvm_cpu_cap_has(X86_FEATURE_SHSTK))
+ return true;
+
+ return msr_ctrl && kvm_cpu_cap_has(X86_FEATURE_IBT);
+}
+
/*
* Write @data into the MSR specified by @index. Select MSR specific fault
* checks are bypassed if @host_initiated is %true.
@@ -1909,6 +1939,43 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,

data = (u32)data;
break;
+ case MSR_IA32_U_CET:
+ case MSR_IA32_S_CET:
+ if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
+ return 1;
+ if (data & CET_US_RESERVED_BITS)
+ return 1;
+ if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
+ (data & CET_US_SHSTK_MASK_BITS))
+ return 1;
+ if (!guest_can_use(vcpu, X86_FEATURE_IBT) &&
+ (data & CET_US_IBT_MASK_BITS))
+ return 1;
+ if (!IS_ALIGNED(CET_US_LEGACY_BITMAP_BASE(data), 4))
+ return 1;
+ /* IBT can be suppressed iff the TRACKER isn't WAIT_ENDBR. */
+ if ((data & CET_SUPPRESS) && (data & CET_WAIT_ENDBR))
+ return 1;
+ break;
+ case MSR_IA32_INT_SSP_TAB:
+ if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated) ||
+ !guest_cpuid_has(vcpu, X86_FEATURE_LM))
+ return 1;
+ if (is_noncanonical_address(data, vcpu))
+ return 1;
+ break;
+ case MSR_KVM_SSP:
+ if (!host_initiated)
+ return 1;
+ fallthrough;
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
+ return 1;
+ if (is_noncanonical_address(data, vcpu))
+ return 1;
+ if (!IS_ALIGNED(data, 4))
+ return 1;
+ break;
}

msr.data = data;
@@ -1952,6 +2019,19 @@ static int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
!guest_cpuid_has(vcpu, X86_FEATURE_RDPID))
return 1;
break;
+ case MSR_IA32_INT_SSP_TAB:
+ if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) ||
+ !guest_cpuid_has(vcpu, X86_FEATURE_LM))
+ return 1;
+ break;
+ case MSR_KVM_SSP:
+ if (!host_initiated)
+ return 1;
+ fallthrough;
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
+ return 1;
+ break;
}

msr.index = index;
@@ -4143,6 +4223,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vcpu->arch.guest_fpu.xfd_err = data;
break;
#endif
+ case MSR_IA32_U_CET:
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ kvm_set_xstate_msr(vcpu, msr_info);
+ break;
default:
if (kvm_pmu_is_valid_msr(vcpu, msr))
return kvm_pmu_set_msr(vcpu, msr_info);
@@ -4502,6 +4586,10 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
msr_info->data = vcpu->arch.guest_fpu.xfd_err;
break;
#endif
+ case MSR_IA32_U_CET:
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ kvm_get_xstate_msr(vcpu, msr_info);
+ break;
default:
if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
return kvm_pmu_get_msr(vcpu, msr_info);
--
2.39.3


2023-12-21 09:18:59

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 12/26] KVM: x86: Report XSS as to-be-saved if there are supported features

From: Sean Christopherson <[email protected]>

Add MSR_IA32_XSS to list of MSRs reported to userspace if supported_xss
is non-zero, i.e. KVM supports at least one XSS based feature.

Before enabling CET virtualization series, guest IA32_MSR_XSS is
guaranteed to be 0, i.e., XSAVES/XRSTORS is executed in non-root mode
with XSS == 0, which equals to the effect of XSAVE/XRSTOR.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/x86.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 594c9e025f95..b3a39886e418 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1464,6 +1464,7 @@ static const u32 msrs_to_save_base[] = {
MSR_IA32_UMWAIT_CONTROL,

MSR_IA32_XFD, MSR_IA32_XFD_ERR,
+ MSR_IA32_XSS,
};

static const u32 msrs_to_save_pmu[] = {
@@ -7374,6 +7375,10 @@ static void kvm_probe_msr_to_save(u32 msr_index)
if (!(kvm_get_arch_capabilities() & ARCH_CAP_TSX_CTRL_MSR))
return;
break;
+ case MSR_IA32_XSS:
+ if (!kvm_caps.supported_xss)
+ return;
+ break;
default:
break;
}
--
2.39.3


2023-12-21 09:19:57

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 26/26] KVM: nVMX: Enable CET support for nested guest

Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
to enable CET for nested VM.

vmcs12 and vmcs02 needs to be synced when L2 exits to L1 or when L1 wants
to resume L2, that way correct CET states can be observed by one another.

Suggested-by: Chao Gao <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 57 +++++++++++++++++++++++++++++++++++++--
arch/x86/kvm/vmx/vmcs12.c | 6 +++++
arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++-
arch/x86/kvm/vmx/vmx.c | 2 ++
4 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 468a7cf75035..dee718c65255 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -691,6 +691,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
MSR_IA32_FLUSH_CMD, MSR_TYPE_W);

+ /* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_U_CET, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_S_CET, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PL0_SSP, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PL1_SSP, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PL2_SSP, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PL3_SSP, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
+
kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);

vmx->nested.force_msr_bitmap_recalc = false;
@@ -2506,6 +2528,17 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
+
+ if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE) {
+ if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
+ vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
+ vmcs_writel(GUEST_INTR_SSP_TABLE,
+ vmcs12->guest_ssp_tbl);
+ }
+ if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
+ guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
+ vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
+ }
}

if (nested_cpu_has_xsaves(vmcs12))
@@ -4344,6 +4377,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
vmcs12->guest_pending_dbg_exceptions =
vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);

+ if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
+ vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
+ vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
+ }
+ if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
+ guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
+ vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
+ }
+
vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
}

@@ -4569,6 +4611,16 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
if (vmcs12->vm_exit_controls & VM_EXIT_CLEAR_BNDCFGS)
vmcs_write64(GUEST_BNDCFGS, 0);

+ if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_CET_STATE) {
+ if (guest_can_use(vcpu, X86_FEATURE_SHSTK)) {
+ vmcs_writel(HOST_SSP, vmcs12->host_ssp);
+ vmcs_writel(HOST_INTR_SSP_TABLE, vmcs12->host_ssp_tbl);
+ }
+ if (guest_can_use(vcpu, X86_FEATURE_SHSTK) ||
+ guest_can_use(vcpu, X86_FEATURE_IBT))
+ vmcs_writel(HOST_S_CET, vmcs12->host_s_cet);
+ }
+
if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PAT) {
vmcs_write64(GUEST_IA32_PAT, vmcs12->host_ia32_pat);
vcpu->arch.pat = vmcs12->host_ia32_pat;
@@ -6840,7 +6892,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
VM_EXIT_HOST_ADDR_SPACE_SIZE |
#endif
VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
- VM_EXIT_CLEAR_BNDCFGS;
+ VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
msrs->exit_ctls_high |=
VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
@@ -6862,7 +6914,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
#ifdef CONFIG_X86_64
VM_ENTRY_IA32E_MODE |
#endif
- VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
+ VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
+ VM_ENTRY_LOAD_CET_STATE;
msrs->entry_ctls_high |=
(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index 106a72c923ca..4233b5ca9461 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
+ FIELD(GUEST_S_CET, guest_s_cet),
+ FIELD(GUEST_SSP, guest_ssp),
+ FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
FIELD(HOST_CR0, host_cr0),
FIELD(HOST_CR3, host_cr3),
FIELD(HOST_CR4, host_cr4),
@@ -151,5 +154,8 @@ const unsigned short vmcs12_field_offsets[] = {
FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
FIELD(HOST_RSP, host_rsp),
FIELD(HOST_RIP, host_rip),
+ FIELD(HOST_S_CET, host_s_cet),
+ FIELD(HOST_SSP, host_ssp),
+ FIELD(HOST_INTR_SSP_TABLE, host_ssp_tbl),
};
const unsigned int nr_vmcs12_fields = ARRAY_SIZE(vmcs12_field_offsets);
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index 01936013428b..3884489e7f7e 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -117,7 +117,13 @@ struct __packed vmcs12 {
natural_width host_ia32_sysenter_eip;
natural_width host_rsp;
natural_width host_rip;
- natural_width paddingl[8]; /* room for future expansion */
+ natural_width host_s_cet;
+ natural_width host_ssp;
+ natural_width host_ssp_tbl;
+ natural_width guest_s_cet;
+ natural_width guest_ssp;
+ natural_width guest_ssp_tbl;
+ natural_width paddingl[2]; /* room for future expansion */
u32 pin_based_vm_exec_control;
u32 cpu_based_vm_exec_control;
u32 exception_bitmap;
@@ -292,6 +298,12 @@ static inline void vmx_check_vmcs12_offsets(void)
CHECK_OFFSET(host_ia32_sysenter_eip, 656);
CHECK_OFFSET(host_rsp, 664);
CHECK_OFFSET(host_rip, 672);
+ CHECK_OFFSET(host_s_cet, 680);
+ CHECK_OFFSET(host_ssp, 688);
+ CHECK_OFFSET(host_ssp_tbl, 696);
+ CHECK_OFFSET(guest_s_cet, 704);
+ CHECK_OFFSET(guest_ssp, 712);
+ CHECK_OFFSET(guest_ssp_tbl, 720);
CHECK_OFFSET(pin_based_vm_exec_control, 744);
CHECK_OFFSET(cpu_based_vm_exec_control, 748);
CHECK_OFFSET(exception_bitmap, 752);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c802e790c0d5..7ddd3f6fe8ab 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7732,6 +7732,8 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
cr4_fixed1_update(X86_CR4_PKE, ecx, feature_bit(PKU));
cr4_fixed1_update(X86_CR4_UMIP, ecx, feature_bit(UMIP));
cr4_fixed1_update(X86_CR4_LA57, ecx, feature_bit(LA57));
+ cr4_fixed1_update(X86_CR4_CET, ecx, feature_bit(SHSTK));
+ cr4_fixed1_update(X86_CR4_CET, edx, feature_bit(IBT));

entry = kvm_find_cpuid_entry_index(vcpu, 0x7, 1);
cr4_fixed1_update(X86_CR4_LAM_SUP, eax, feature_bit(LAM));
--
2.39.3


2023-12-21 09:20:27

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 21/26] KVM: x86: Save and reload SSP to/from SMRAM

Save CET SSP to SMRAM on SMI and reload it on RSM. KVM emulates HW arch
behavior when guest enters/leaves SMM mode,i.e., save registers to SMRAM
at the entry of SMM and reload them at the exit to SMM. Per SDM, SSP is
one of such registers on 64-bit Arch, and add the support for SSP. Note,
on 32-bit Arch, SSP is not defined in SMRAM, so fail 32-bit CET guest
launch.

Suggested-by: Sean Christopherson <[email protected]>
Suggested-by: Chao Gao <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/cpuid.c | 11 +++++++++++
arch/x86/kvm/smm.c | 8 ++++++++
arch/x86/kvm/smm.h | 2 +-
3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 3ab133530573..cfc0ac8ddb4a 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -149,6 +149,17 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu,
if (vaddr_bits != 48 && vaddr_bits != 57 && vaddr_bits != 0)
return -EINVAL;
}
+ /*
+ * Prevent 32-bit guest from being launched if CET is exposed as SSP
+ * state is not defined for 32-bit SMRAM.
+ */
+ best = cpuid_entry2_find(entries, nent, 0x80000001,
+ KVM_CPUID_INDEX_NOT_SIGNIFICANT);
+ if (best && !(best->edx & F(LM))) {
+ best = cpuid_entry2_find(entries, nent, 0x7, 0);
+ if (best && ((best->ecx & F(SHSTK)) || (best->edx & F(IBT))))
+ return -EINVAL;
+ }

/*
* Exposing dynamic xfeatures to the guest requires additional
diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
index 45c855389ea7..7aac9c54c353 100644
--- a/arch/x86/kvm/smm.c
+++ b/arch/x86/kvm/smm.c
@@ -275,6 +275,10 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu,
enter_smm_save_seg_64(vcpu, &smram->gs, VCPU_SREG_GS);

smram->int_shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
+
+ if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
+ KVM_BUG_ON(kvm_msr_read(vcpu, MSR_KVM_SSP, &smram->ssp),
+ vcpu->kvm);
}
#endif

@@ -564,6 +568,10 @@ static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,
static_call(kvm_x86_set_interrupt_shadow)(vcpu, 0);
ctxt->interruptibility = (u8)smstate->int_shadow;

+ if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
+ KVM_BUG_ON(kvm_msr_write(vcpu, MSR_KVM_SSP, smstate->ssp),
+ vcpu->kvm);
+
return X86EMUL_CONTINUE;
}
#endif
diff --git a/arch/x86/kvm/smm.h b/arch/x86/kvm/smm.h
index a1cf2ac5bd78..1e2a3e18207f 100644
--- a/arch/x86/kvm/smm.h
+++ b/arch/x86/kvm/smm.h
@@ -116,8 +116,8 @@ struct kvm_smram_state_64 {
u32 smbase;
u32 reserved4[5];

- /* ssp and svm_* fields below are not implemented by KVM */
u64 ssp;
+ /* svm_* fields below are not implemented by KVM */
u64 svm_guest_pat;
u64 svm_host_efer;
u64 svm_host_cr4;
--
2.39.3


2023-12-21 09:20:32

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 23/26] KVM: VMX: Set host constant supervisor states to VMCS fields

Save constant values to HOST_{S_CET,SSP,INTR_SSP_TABLE} field explicitly.
Kernel IBT is supported and the setting in MSR_IA32_S_CET is static after
post-boot(The exception is BIOS call case but vCPU thread never across it)
and KVM doesn't need to refresh HOST_S_CET field before every VM-Enter/
VM-Exit sequence.

Host supervisor shadow stack is not enabled now and SSP is not accessible
to kernel mode, thus it's safe to set host IA32_INT_SSP_TAB/SSP VMCS field
to 0s. When shadow stack is enabled for CPL3, SSP is reloaded from PL3_SSP
before it exits to userspace. Check SDM Vol 2A/B Chapter 3/4 for SYSCALL/
SYSRET/SYSENTER SYSEXIT/RDSSP/CALL etc.

Prevent KVM module loading if host supervisor shadow stack SHSTK_EN is set
in MSR_IA32_S_CET as KVM cannot co-exit with it correctly.

Suggested-by: Sean Christopherson <[email protected]>
Suggested-by: Chao Gao <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/vmx/capabilities.h | 4 ++++
arch/x86/kvm/vmx/vmx.c | 15 +++++++++++++++
arch/x86/kvm/x86.c | 14 ++++++++++++++
arch/x86/kvm/x86.h | 1 +
4 files changed, 34 insertions(+)

diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 41a4533f9989..ee8938818c8a 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -106,6 +106,10 @@ static inline bool cpu_has_load_perf_global_ctrl(void)
return vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
}

+static inline bool cpu_has_load_cet_ctrl(void)
+{
+ return (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_CET_STATE);
+}
static inline bool cpu_has_vmx_mpx(void)
{
return vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_BNDCFGS;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 08058b182893..e9c0b571b3bb 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4371,6 +4371,21 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)

if (cpu_has_load_ia32_efer())
vmcs_write64(HOST_IA32_EFER, host_efer);
+
+ /*
+ * Supervisor shadow stack is not enabled on host side, i.e.,
+ * host IA32_S_CET.SHSTK_EN bit is guaranteed to 0 now, per SDM
+ * description(RDSSP instruction), SSP is not readable in CPL0,
+ * so resetting the two registers to 0s at VM-Exit does no harm
+ * to kernel execution. When execution flow exits to userspace,
+ * SSP is reloaded from IA32_PL3_SSP. Check SDM Vol.2A/B Chapter
+ * 3 and 4 for details.
+ */
+ if (cpu_has_load_cet_ctrl()) {
+ vmcs_writel(HOST_S_CET, host_s_cet);
+ vmcs_writel(HOST_SSP, 0);
+ vmcs_writel(HOST_INTR_SSP_TABLE, 0);
+ }
}

void set_cr4_guest_host_mask(struct vcpu_vmx *vmx)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cf0f9e4474a4..9596763fae8d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -114,6 +114,8 @@ static u64 __read_mostly efer_reserved_bits = ~((u64)EFER_SCE);
#endif

static u64 __read_mostly cr4_reserved_bits = CR4_RESERVED_BITS;
+u64 __read_mostly host_s_cet;
+EXPORT_SYMBOL_GPL(host_s_cet);

#define KVM_EXIT_HYPERCALL_VALID_MASK (1 << KVM_HC_MAP_GPA_RANGE)

@@ -9840,6 +9842,18 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
return -EIO;
}

+ if (boot_cpu_has(X86_FEATURE_SHSTK)) {
+ rdmsrl(MSR_IA32_S_CET, host_s_cet);
+ /*
+ * Linux doesn't yet support supervisor shadow stacks (SSS), so
+ * KVM doesn't save/restore the associated MSRs, i.e. KVM may
+ * clobber the host values. Yell and refuse to load if SSS is
+ * unexpectedly enabled, e.g. to avoid crashing the host.
+ */
+ if (WARN_ON_ONCE(host_s_cet & CET_SHSTK_EN))
+ return -EIO;
+ }
+
x86_emulator_cache = kvm_alloc_emulator_cache();
if (!x86_emulator_cache) {
pr_err("failed to allocate cache for x86 emulator\n");
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 9c19dfb5011d..656107e64c93 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -325,6 +325,7 @@ fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu);
extern u64 host_xcr0;
extern u64 host_xss;
extern u64 host_arch_capabilities;
+extern u64 host_s_cet;

extern struct kvm_caps kvm_caps;

--
2.39.3


2023-12-21 09:24:04

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 11/26] KVM: x86: Add kvm_msr_{read,write}() helpers

Wrap __kvm_{get,set}_msr() into two new helpers for KVM usage and use the
helpers to replace existing usage of the raw functions.
kvm_msr_{read,write}() are KVM-internal helpers, i.e. used when KVM needs
to get/set a MSR value for emulating CPU behavior, i.e., host_initiated ==
%true in the helpers.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++-
arch/x86/kvm/cpuid.c | 2 +-
arch/x86/kvm/x86.c | 16 +++++++++++++---
3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5c665165024c..40dd796ea085 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2012,9 +2012,10 @@ void kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu);

void kvm_enable_efer_bits(u64);
bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer);
-int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data, bool host_initiated);
int kvm_emulate_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data);
int kvm_emulate_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data);
+int kvm_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data);
+int kvm_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data);
int kvm_emulate_rdmsr(struct kvm_vcpu *vcpu);
int kvm_emulate_wrmsr(struct kvm_vcpu *vcpu);
int kvm_emulate_as_nop(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 624954203b40..acc360c76318 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1548,7 +1548,7 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
*edx = entry->edx;
if (function == 7 && index == 0) {
u64 data;
- if (!__kvm_get_msr(vcpu, MSR_IA32_TSX_CTRL, &data, true) &&
+ if (!kvm_msr_read(vcpu, MSR_IA32_TSX_CTRL, &data) &&
(data & TSX_CTRL_CPUID_CLEAR))
*ebx &= ~(F(RTM) | F(HLE));
} else if (function == 0x80000007) {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3671f4868d1b..594c9e025f95 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1920,8 +1920,8 @@ static int kvm_set_msr_ignored_check(struct kvm_vcpu *vcpu,
* Returns 0 on success, non-0 otherwise.
* Assumes vcpu_load() was already called.
*/
-int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
- bool host_initiated)
+static int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
+ bool host_initiated)
{
struct msr_data msr;
int ret;
@@ -1947,6 +1947,16 @@ int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
return ret;
}

+int kvm_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data)
+{
+ return __kvm_set_msr(vcpu, index, data, true);
+}
+
+int kvm_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data)
+{
+ return __kvm_get_msr(vcpu, index, data, true);
+}
+
static int kvm_get_msr_ignored_check(struct kvm_vcpu *vcpu,
u32 index, u64 *data, bool host_initiated)
{
@@ -12296,7 +12306,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
MSR_IA32_MISC_ENABLE_BTS_UNAVAIL;

__kvm_set_xcr(vcpu, 0, XFEATURE_MASK_FP);
- __kvm_set_msr(vcpu, MSR_IA32_XSS, 0, true);
+ kvm_msr_write(vcpu, MSR_IA32_XSS, 0);
}

/* All GPRs except RDX (handled below) are zeroed on RESET/INIT. */
--
2.39.3


2023-12-21 09:24:22

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 01/26] x86/fpu/xstate: Always preserve non-user xfeatures/flags in __state_perm

From: Sean Christopherson <[email protected]>

When granting userspace or a KVM guest access to an xfeature, preserve the
entity's existing supervisor and software-defined permissions as tracked
by __state_perm, i.e. use __state_perm to track *all* permissions even
though all supported supervisor xfeatures are granted to all FPUs and
FPU_GUEST_PERM_LOCKED disallows changing permissions.

Effectively clobbering supervisor permissions results in inconsistent
behavior, as xstate_get_group_perm() will report supervisor features for
process that do NOT request access to dynamic user xfeatures, whereas any
and all supervisor features will be absent from the set of permissions for
any process that is granted access to one or more dynamic xfeatures (which
right now means AMX).

The inconsistency isn't problematic because fpu_xstate_prctl() already
strips out everything except user xfeatures:

case ARCH_GET_XCOMP_PERM:
/*
* Lockless snapshot as it can also change right after the
* dropping the lock.
*/
permitted = xstate_get_host_group_perm();
permitted &= XFEATURE_MASK_USER_SUPPORTED;
return put_user(permitted, uptr);

case ARCH_GET_XCOMP_GUEST_PERM:
permitted = xstate_get_guest_group_perm();
permitted &= XFEATURE_MASK_USER_SUPPORTED;
return put_user(permitted, uptr);

and similarly KVM doesn't apply the __state_perm to supervisor states
(kvm_get_filtered_xcr0() incorporates xstate_get_guest_group_perm()):

case 0xd: {
u64 permitted_xcr0 = kvm_get_filtered_xcr0();
u64 permitted_xss = kvm_caps.supported_xss;

But if KVM in particular were to ever change, dropping supervisor
permissions would result in subtle bugs in KVM's reporting of supported
CPUID settings. And the above behavior also means that having supervisor
xfeatures in __state_perm is correctly handled by all users.

Dropping supervisor permissions also creates another landmine for KVM. If
more dynamic user xfeatures are ever added, requesting access to multiple
xfeatures in separate ARCH_REQ_XCOMP_GUEST_PERM calls will result in the
second invocation of __xstate_request_perm() computing the wrong ksize, as
as the mask passed to xstate_calculate_size() would not contain *any*
supervisor features.

Commit 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE
permissions") fudged around the size issue for userspace FPUs, but for
reasons unknown skipped guest FPUs. Lack of a fix for KVM "works" only
because KVM doesn't yet support virtualizing features that have supervisor
xfeatures, i.e. as of today, KVM guest FPUs will never need the relevant
xfeatures.

Simply extending the hack-a-fix for guests would temporarily solve the
ksize issue, but wouldn't address the inconsistency issue and would leave
another lurking pitfall for KVM. KVM support for virtualizing CET will
likely add CET_KERNEL as a guest-only xfeature, i.e. CET_KERNEL will not
be set in xfeatures_mask_supervisor() and would again be dropped when
granting access to dynamic xfeatures.

Note, the existing clobbering behavior is rather subtle. The @permitted
parameter to __xstate_request_perm() comes from:

permitted = xstate_get_group_perm(guest);

which is either fpu->guest_perm.__state_perm or fpu->perm.__state_perm,
where __state_perm is initialized to:

fpu->perm.__state_perm = fpu_kernel_cfg.default_features;

and copied to the guest side of things:

/* Same defaults for guests */
fpu->guest_perm = fpu->perm;

fpu_kernel_cfg.default_features contains everything except the dynamic
xfeatures, i.e. everything except XFEATURE_MASK_XTILE_DATA:

fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;

When __xstate_request_perm() restricts the local "mask" variable to
compute the user state size:

mask &= XFEATURE_MASK_USER_SUPPORTED;
usize = xstate_calculate_size(mask, false);

it subtly overwrites the target __state_perm with "mask" containing only
user xfeatures:

perm = guest ? &fpu->guest_perm : &fpu->perm;
/* Pairs with the READ_ONCE() in xstate_get_group_perm() */
WRITE_ONCE(perm->__state_perm, mask);

Cc: Maxim Levitsky <[email protected]>
Cc: Weijiang Yang <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Chao Gao <[email protected]>
Cc: Rick Edgecombe <[email protected]>
Cc: John Allen <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kernel/fpu/xstate.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 117e74c44e75..07911532b108 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1601,16 +1601,20 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
if ((permitted & requested) == requested)
return 0;

- /* Calculate the resulting kernel state size */
+ /*
+ * Calculate the resulting kernel state size. Note, @permitted also
+ * contains supervisor xfeatures even though supervisor are always
+ * permitted for kernel and guest FPUs, and never permitted for user
+ * FPUs.
+ */
mask = permitted | requested;
- /* Take supervisor states into account on the host */
- if (!guest)
- mask |= xfeatures_mask_supervisor();
ksize = xstate_calculate_size(mask, compacted);

- /* Calculate the resulting user state size */
- mask &= XFEATURE_MASK_USER_SUPPORTED;
- usize = xstate_calculate_size(mask, false);
+ /*
+ * Calculate the resulting user state size. Take care not to clobber
+ * the supervisor xfeatures in the new mask!
+ */
+ usize = xstate_calculate_size(mask & XFEATURE_MASK_USER_SUPPORTED, false);

if (!guest) {
ret = validate_sigaltstack(usize);
--
2.39.3


2023-12-21 09:26:00

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 16/26] KVM: x86: Add fault checks for guest CR4.CET setting

Check potential faults for CR4.CET setting per Intel SDM requirements.
CET can be enabled if and only if CR0.WP == 1, i.e. setting CR4.CET ==
1 faults if CR0.WP == 0 and setting CR0.WP == 0 fails if CR4.CET == 1.

Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Chao Gao <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/x86.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bde780ae69bf..b418e4f5277b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1006,6 +1006,9 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
(is_64_bit_mode(vcpu) || kvm_is_cr4_bit_set(vcpu, X86_CR4_PCIDE)))
return 1;

+ if (!(cr0 & X86_CR0_WP) && kvm_is_cr4_bit_set(vcpu, X86_CR4_CET))
+ return 1;
+
static_call(kvm_x86_set_cr0)(vcpu, cr0);

kvm_post_set_cr0(vcpu, old_cr0, cr0);
@@ -1217,6 +1220,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
return 1;
}

+ if ((cr4 & X86_CR4_CET) && !kvm_is_cr0_bit_set(vcpu, X86_CR0_WP))
+ return 1;
+
static_call(kvm_x86_set_cr4)(vcpu, cr4);

kvm_post_set_cr4(vcpu, old_cr4, cr4);
--
2.39.3


2023-12-21 09:27:39

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 18/26] KVM: VMX: Introduce CET VMCS fields and control bits

Control-flow Enforcement Technology (CET) is a kind of CPU feature used
to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP) attacks.
It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
style control-flow subversion attacks.

Shadow Stack (SHSTK):
A shadow stack is a second stack used exclusively for control transfer
operations. The shadow stack is separate from the data/normal stack and
can be enabled individually in user and kernel mode. When shadow stack
is enabled, CALL pushes the return address on both the data and shadow
stack. RET pops the return address from both stacks and compares them.
If the return addresses from the two stacks do not match, the processor
generates a #CP.

Indirect Branch Tracking (IBT):
IBT introduces instruction(ENDBRANCH)to mark valid target addresses of
indirect branches (CALL, JMP etc...). If an indirect branch is executed
and the next instruction is _not_ an ENDBRANCH, the processor generates
a #CP. These instruction behaves as a NOP on platforms that have no CET.

Several new CET MSRs are defined to support CET:
MSR_IA32_{U,S}_CET: CET settings for {user,supervisor} CET respectively.

MSR_IA32_PL{0,1,2,3}_SSP: SHSTK pointer linear address for CPL{0,1,2,3}.

MSR_IA32_INT_SSP_TAB: Linear address of SHSTK pointer table, whose entry
is indexed by IST of interrupt gate desc.

Two XSAVES state bits are introduced for CET:
IA32_XSS:[bit 11]: Control saving/restoring user mode CET states
IA32_XSS:[bit 12]: Control saving/restoring supervisor mode CET states.

Six VMCS fields are introduced for CET:
{HOST,GUEST}_S_CET: Stores CET settings for kernel mode.
{HOST,GUEST}_SSP: Stores current active SSP.
{HOST,GUEST}_INTR_SSP_TABLE: Stores current active MSR_IA32_INT_SSP_TAB.

On Intel platforms, two additional bits are defined in VM_EXIT and VM_ENTRY
control fields:
If VM_EXIT_LOAD_CET_STATE = 1, host CET states are loaded from following
VMCS fields at VM-Exit:
HOST_S_CET
HOST_SSP
HOST_INTR_SSP_TABLE

If VM_ENTRY_LOAD_CET_STATE = 1, guest CET states are loaded from following
VMCS fields at VM-Entry:
GUEST_S_CET
GUEST_SSP
GUEST_INTR_SSP_TABLE

Co-developed-by: Zhang Yi Z <[email protected]>
Signed-off-by: Zhang Yi Z <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Chao Gao <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/include/asm/vmx.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0e73616b82f3..451fd4f4fedc 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -104,6 +104,7 @@
#define VM_EXIT_CLEAR_BNDCFGS 0x00800000
#define VM_EXIT_PT_CONCEAL_PIP 0x01000000
#define VM_EXIT_CLEAR_IA32_RTIT_CTL 0x02000000
+#define VM_EXIT_LOAD_CET_STATE 0x10000000

#define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR 0x00036dff

@@ -117,6 +118,7 @@
#define VM_ENTRY_LOAD_BNDCFGS 0x00010000
#define VM_ENTRY_PT_CONCEAL_PIP 0x00020000
#define VM_ENTRY_LOAD_IA32_RTIT_CTL 0x00040000
+#define VM_ENTRY_LOAD_CET_STATE 0x00100000

#define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR 0x000011ff

@@ -345,6 +347,9 @@ enum vmcs_field {
GUEST_PENDING_DBG_EXCEPTIONS = 0x00006822,
GUEST_SYSENTER_ESP = 0x00006824,
GUEST_SYSENTER_EIP = 0x00006826,
+ GUEST_S_CET = 0x00006828,
+ GUEST_SSP = 0x0000682a,
+ GUEST_INTR_SSP_TABLE = 0x0000682c,
HOST_CR0 = 0x00006c00,
HOST_CR3 = 0x00006c02,
HOST_CR4 = 0x00006c04,
@@ -357,6 +362,9 @@ enum vmcs_field {
HOST_IA32_SYSENTER_EIP = 0x00006c12,
HOST_RSP = 0x00006c14,
HOST_RIP = 0x00006c16,
+ HOST_S_CET = 0x00006c18,
+ HOST_SSP = 0x00006c1a,
+ HOST_INTR_SSP_TABLE = 0x00006c1c
};

/*
--
2.39.3


2023-12-21 09:28:07

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 15/26] KVM: x86: Load guest FPU state when access XSAVE-managed MSRs

From: Sean Christopherson <[email protected]>

Load the guest's FPU state if userspace is accessing MSRs whose values
are managed by XSAVES. Introduce two helpers, kvm_{get,set}_xstate_msr(),
to facilitate access to such kind of MSRs.

If MSRs supported in kvm_caps.supported_xss are passed through to guest,
the guest MSRs are swapped with host's before vCPU exits to userspace and
after it reenters kernel before next VM-entry.

Because the modified code is also used for the KVM_GET_MSRS device ioctl(),
explicitly check @vcpu is non-null before attempting to load guest state.
The XSAVE-managed MSRs cannot be retrieved via the device ioctl() without
loading guest FPU state (which doesn't exist).

Note that guest_cpuid_has() is not queried as host userspace is allowed to
access MSRs that have not been exposed to the guest, e.g. it might do
KVM_SET_MSRS prior to KVM_SET_CPUID2.

The two helpers are put here in order to manifest accessing xsave-managed MSRs
requires special check and handling to guarantee the correctness of read/write
to the MSRs.

Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Yang Weijiang <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/x86.c | 35 ++++++++++++++++++++++++++++++++++-
arch/x86/kvm/x86.h | 24 ++++++++++++++++++++++++
2 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f50c5a523b92..bde780ae69bf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -133,6 +133,9 @@ static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);

static DEFINE_MUTEX(vendor_module_lock);
+static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
+static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
+
struct kvm_x86_ops kvm_x86_ops __read_mostly;

#define KVM_X86_OP(func) \
@@ -4509,6 +4512,21 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
}
EXPORT_SYMBOL_GPL(kvm_get_msr_common);

+/*
+ * Returns true if the MSR in question is managed via XSTATE, i.e. is context
+ * switched with the rest of guest FPU state.
+ */
+static bool is_xstate_managed_msr(u32 index)
+{
+ switch (index) {
+ case MSR_IA32_U_CET:
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ return true;
+ default:
+ return false;
+ }
+}
+
/*
* Read or write a bunch of msrs. All parameters are kernel addresses.
*
@@ -4519,11 +4537,26 @@ static int __msr_io(struct kvm_vcpu *vcpu, struct kvm_msrs *msrs,
int (*do_msr)(struct kvm_vcpu *vcpu,
unsigned index, u64 *data))
{
+ bool fpu_loaded = false;
int i;

- for (i = 0; i < msrs->nmsrs; ++i)
+ for (i = 0; i < msrs->nmsrs; ++i) {
+ /*
+ * If userspace is accessing one or more XSTATE-managed MSRs,
+ * temporarily load the guest's FPU state so that the guest's
+ * MSR value(s) is resident in hardware, i.e. so that KVM can
+ * get/set the MSR via RDMSR/WRMSR.
+ */
+ if (vcpu && !fpu_loaded && kvm_caps.supported_xss &&
+ is_xstate_managed_msr(entries[i].index)) {
+ kvm_load_guest_fpu(vcpu);
+ fpu_loaded = true;
+ }
if (do_msr(vcpu, entries[i].index, &entries[i].data))
break;
+ }
+ if (fpu_loaded)
+ kvm_put_guest_fpu(vcpu);

return i;
}
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 2f7e19166658..9c19dfb5011d 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -543,4 +543,28 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
unsigned int port, void *data, unsigned int count,
int in);

+/*
+ * Lock and/or reload guest FPU and access xstate MSRs. For accesses initiated
+ * by host, guest FPU is loaded in __msr_io(). For accesses initiated by guest,
+ * guest FPU should have been loaded already.
+ */
+
+static inline void kvm_get_xstate_msr(struct kvm_vcpu *vcpu,
+ struct msr_data *msr_info)
+{
+ KVM_BUG_ON(!vcpu->arch.guest_fpu.fpstate->in_use, vcpu->kvm);
+ kvm_fpu_get();
+ rdmsrl(msr_info->index, msr_info->data);
+ kvm_fpu_put();
+}
+
+static inline void kvm_set_xstate_msr(struct kvm_vcpu *vcpu,
+ struct msr_data *msr_info)
+{
+ KVM_BUG_ON(!vcpu->arch.guest_fpu.fpstate->in_use, vcpu->kvm);
+ kvm_fpu_get();
+ wrmsrl(msr_info->index, msr_info->data);
+ kvm_fpu_put();
+}
+
#endif
--
2.39.3


2023-12-21 09:30:30

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v8 22/26] KVM: VMX: Set up interception for CET MSRs

Enable/disable CET MSRs interception per associated feature configuration.
Shadow Stack feature requires all CET MSRs passed through to guest to make
it supported in user and supervisor mode while IBT feature only depends on
MSR_IA32_{U,S}_CETS_CET to enable user and supervisor IBT.

Note, this MSR design introduced an architectural limitation of SHSTK and
IBT control for guest, i.e., when SHSTK is exposed, IBT is also available
to guest from architectual perspective since IBT relies on subset of SHSTK
relevant MSRs.

Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 42 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 42 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 064a5fe87948..08058b182893 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -692,6 +692,10 @@ static bool is_valid_passthrough_msr(u32 msr)
case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
/* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
return true;
+ case MSR_IA32_U_CET:
+ case MSR_IA32_S_CET:
+ case MSR_IA32_PL0_SSP ... MSR_IA32_INT_SSP_TAB:
+ return true;
}

r = possible_passthrough_msr_slot(msr) != -ENOENT;
@@ -7767,6 +7771,42 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
}

+static void vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu)
+{
+ bool incpt;
+
+ if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
+ incpt = !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK);
+
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
+ MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
+ MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
+ MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
+ MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
+ MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP,
+ MSR_TYPE_RW, incpt);
+ if (guest_cpuid_has(vcpu, X86_FEATURE_LM))
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
+ MSR_TYPE_RW, incpt);
+ if (!incpt)
+ return;
+ }
+
+ if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
+ incpt = !guest_cpuid_has(vcpu, X86_FEATURE_IBT);
+
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
+ MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
+ MSR_TYPE_RW, incpt);
+ }
+}
+
static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7845,6 +7885,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)

/* Refresh #PF interception to account for MAXPHYADDR changes. */
vmx_update_exception_bitmap(vcpu);
+
+ vmx_update_intercept_for_cet_msr(vcpu);
}

static u64 vmx_get_perf_capabilities(void)
--
2.39.3


2024-01-02 22:25:30

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v8 02/26] x86/fpu/xstate: Refine CET user xstate bit enabling

On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
> Remove XFEATURE_CET_USER entry from dependency array as the entry doesn't
> reflect true dependency between CET features and the user xstate bit.
> Enable the bit in fpu_kernel_cfg.max_features when either SHSTK or IBT is
> available.
>
> Both user mode shadow stack and indirect branch tracking features depend
> on XFEATURE_CET_USER bit in XSS to automatically save/restore user mode
> xstate registers, i.e., IA32_U_CET and IA32_PL3_SSP whenever necessary.
>
> Note, the issue, i.e., CPUID only enumerates IBT but no SHSTK is resulted
> from CET KVM series which synthesizes guest CPUIDs based on userspace
> settings,in real world the case is rare. In other words, the existing
> dependency check is correct when only user mode SHSTK is available.
>
> Signed-off-by: Yang Weijiang <[email protected]>
> Reviewed-by: Rick Edgecombe <[email protected]>
> Tested-by: Rick Edgecombe <[email protected]>
> ---
> arch/x86/kernel/fpu/xstate.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 07911532b108..f6b98693da59 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -73,7 +73,6 @@ static unsigned short xsave_cpuid_features[] __initdata = {
> [XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
> [XFEATURE_PKRU] = X86_FEATURE_OSPKE,
> [XFEATURE_PASID] = X86_FEATURE_ENQCMD,
> - [XFEATURE_CET_USER] = X86_FEATURE_SHSTK,
> [XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE,
> [XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE,
> };
> @@ -798,6 +797,14 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
> fpu_kernel_cfg.max_features &= ~BIT_ULL(i);
> }
>
> + /*
> + * CET user mode xstate bit has been cleared by above sanity check.
> + * Now pick it up if either SHSTK or IBT is available. Either feature
> + * depends on the xstate bit to save/restore user mode states.
> + */
> + if (boot_cpu_has(X86_FEATURE_SHSTK) || boot_cpu_has(X86_FEATURE_IBT))
> + fpu_kernel_cfg.max_features |= BIT_ULL(XFEATURE_CET_USER);
> +

I am still not convinced that this is not a workaround for a bug in the sanity check code,
and I don't really like this, but whatever, as long as the code works,
I don't intend to fight over this. Let it be.

Best regards,
Maxim Levitsky


> if (!cpu_feature_enabled(X86_FEATURE_XFD))
> fpu_kernel_cfg.max_features &= ~XFEATURE_MASK_USER_DYNAMIC;
>





2024-01-02 22:25:55

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v8 04/26] x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set

On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
> Define a new XFEATURE_MASK_KERNEL_DYNAMIC mask to specify the features
> that can be optionally enabled by kernel components. This is similar to
> XFEATURE_MASK_USER_DYNAMIC in that it contains optional xfeatures that
> can allows the FPU buffer to be dynamically sized. The difference is that
> the KERNEL variant contains supervisor features and will be enabled by
> kernel components that need them, and not directly by the user. Currently
> it's used by KVM to configure guest dedicated fpstate for calculating
> the xfeature and fpstate storage size etc.
>
> The kernel dynamic xfeatures now only contain XFEATURE_CET_KERNEL, which
> is supported by host as they're enabled in kernel XSS MSR setting but
> relevant CPU feature, i.e., supervisor shadow stack, is not enabled in
> host kernel therefore it can be omitted for normal fpstate by default.
>
> Remove the kernel dynamic feature from fpu_kernel_cfg.default_features
> so that the bits in xstate_bv and xcomp_bv are cleared and xsaves/xrstors
> can be optimized by HW for normal fpstate.
>
> Suggested-by: Dave Hansen <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/include/asm/fpu/xstate.h | 5 ++++-
> arch/x86/kernel/fpu/xstate.c | 1 +
> 2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
> index 3b4a038d3c57..a212d3851429 100644
> --- a/arch/x86/include/asm/fpu/xstate.h
> +++ b/arch/x86/include/asm/fpu/xstate.h
> @@ -46,9 +46,12 @@
> #define XFEATURE_MASK_USER_RESTORE \
> (XFEATURE_MASK_USER_SUPPORTED & ~XFEATURE_MASK_PKRU)
>
> -/* Features which are dynamically enabled for a process on request */
> +/* Features which are dynamically enabled per userspace request */
> #define XFEATURE_MASK_USER_DYNAMIC XFEATURE_MASK_XTILE_DATA
>
> +/* Features which are dynamically enabled per kernel side request */
> +#define XFEATURE_MASK_KERNEL_DYNAMIC XFEATURE_MASK_CET_KERNEL
> +
> /* All currently supported supervisor features */
> #define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
> XFEATURE_MASK_CET_USER | \
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 03e166a87d61..ca4b83c142eb 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -824,6 +824,7 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
> /* Clean out dynamic features from default */
> fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
> fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
> + fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_KERNEL_DYNAMIC;
>
> fpu_user_cfg.default_features = fpu_user_cfg.max_features;
> fpu_user_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;


I still think that we should consider adding XFEATURE_MASK_CET_KERNEL to
XFEATURE_MASK_INDEPENDENT or at least have a good conversation on why this doesn't make sense,
but I also don't intend to fight over this, as long as the code works.

Best regards,
Maxim Levitsky


2024-01-02 22:33:01

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v8 05/26] x86/fpu/xstate: Introduce fpu_guest_cfg for guest FPU configuration

On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
> Define new fpu_guest_cfg to hold all guest FPU settings so that it can
> differ from generic kernel FPU settings, e.g., enabling CET supervisor
> xstate by default for guest fpstate while it's remained disabled in
> kernel FPU config.
>
> The kernel dynamic xfeatures are specifically used by guest fpstate now,
> add the mask for guest fpstate so that guest_perm.__state_permit ==
> (fpu_kernel_cfg.default_xfeature | XFEATURE_MASK_KERNEL_DYNAMIC). And
> if guest fpstate is re-allocated to hold user dynamic xfeatures, the
> resulting permissions are consumed before calculate new guest fpstate.
>
> Signed-off-by: Yang Weijiang <[email protected]>
> Reviewed-by: Maxim Levitsky <[email protected]>
> ---
> arch/x86/include/asm/fpu/types.h | 2 +-
> arch/x86/kernel/fpu/core.c | 70 ++++++++++++++++++++++++++++++--
> arch/x86/kernel/fpu/xstate.c | 10 +++++
> 3 files changed, 78 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
> index c6fd13a17205..306825ad6bc0 100644
> --- a/arch/x86/include/asm/fpu/types.h
> +++ b/arch/x86/include/asm/fpu/types.h
> @@ -602,6 +602,6 @@ struct fpu_state_config {
> };
>
> /* FPU state configuration information */
> -extern struct fpu_state_config fpu_kernel_cfg, fpu_user_cfg;
> +extern struct fpu_state_config fpu_kernel_cfg, fpu_user_cfg, fpu_guest_cfg;
>
> #endif /* _ASM_X86_FPU_H */
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index a21a4d0ecc34..976f519721e2 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -33,10 +33,67 @@ DEFINE_STATIC_KEY_FALSE(__fpu_state_size_dynamic);
> DEFINE_PER_CPU(u64, xfd_state);
> #endif
>
> -/* The FPU state configuration data for kernel and user space */
> +/* The FPU state configuration data for kernel, user space and guest. */
> +/*
> + * kernel FPU config:
> + *
> + * all known and CPU supported user and supervisor features except
> + * - independent kernel features (XFEATURE_LBR)
> + * @fpu_kernel_cfg.max_features;
> + *
> + * all known and CPU supported user and supervisor features except
> + * - dynamic kernel features (CET_S)
> + * - independent kernel features (XFEATURE_LBR)
> + * - dynamic userspace features (AMX state)
> + * @fpu_kernel_cfg.default_features;
> + *
> + * size of compacted buffer with 'fpu_kernel_cfg.max_features'
> + * @fpu_kernel_cfg.max_size;
> + *
> + * size of compacted buffer with 'fpu_kernel_cfg.default_features'
> + * @fpu_kernel_cfg.default_size;
> + */
> struct fpu_state_config fpu_kernel_cfg __ro_after_init;
> +
> +/*
> + * user FPU config:
> + *
> + * all known and CPU supported user features
> + * @fpu_user_cfg.max_features;
> + *
> + * all known and CPU supported user features except
> + * - dynamic userspace features (AMX state)
> + * @fpu_user_cfg.default_features;
> + *
> + * size of non-compacted buffer with 'fpu_user_cfg.max_features'
> + * @fpu_user_cfg.max_size;
> + *
> + * size of non-compacted buffer with 'fpu_user_cfg.default_features'
> + * @fpu_user_cfg.default_size;
> + */
> struct fpu_state_config fpu_user_cfg __ro_after_init;
>
> +/*
> + * guest FPU config:
> + *
> + * all known and CPU supported user and supervisor features except
> + * - independent kernel features (XFEATURE_LBR)
> + * @fpu_guest_cfg.max_features;
> + *
> + * all known and CPU supported user and supervisor features except
> + * - independent kernel features (XFEATURE_LBR)
> + * - dynamic userspace features (AMX state)
> + * @fpu_guest_cfg.default_features;
> + *
> + * size of compacted buffer with 'fpu_guest_cfg.max_features'
> + * @fpu_guest_cfg.max_size;
> + *
> + * size of compacted buffer with 'fpu_guest_cfg.default_features'
> + * @fpu_guest_cfg.default_size;
> + */


IMHO this comment is too verbose. I didn't intend it to be copied verbatim,
to the kernel, but rather to explain the meaning of the fpu context fields
to both of us (I also keep on forgetting what each combination means...).

At least this comment should not include examples because xfeatures
are subject to change.


Best regards,
Maxim Levitsky


> +
> +struct fpu_state_config fpu_guest_cfg __ro_after_init;
> +
> /*
> * Represents the initial FPU state. It's mostly (but not completely) zeroes,
> * depending on the FPU hardware format:
> @@ -536,8 +593,15 @@ void fpstate_reset(struct fpu *fpu)
> fpu->perm.__state_perm = fpu_kernel_cfg.default_features;
> fpu->perm.__state_size = fpu_kernel_cfg.default_size;
> fpu->perm.__user_state_size = fpu_user_cfg.default_size;
> - /* Same defaults for guests */
> - fpu->guest_perm = fpu->perm;
> +
> + /* Guest permission settings */
> + fpu->guest_perm.__state_perm = fpu_guest_cfg.default_features;
> + fpu->guest_perm.__state_size = fpu_guest_cfg.default_size;
> + /*
> + * Set guest's __user_state_size to fpu_user_cfg.default_size so that
> + * existing uAPIs can still work.
> + */
> + fpu->guest_perm.__user_state_size = fpu_user_cfg.default_size;
> }
>
> static inline void fpu_inherit_perms(struct fpu *dst_fpu)
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index ca4b83c142eb..9cbdc83d1eab 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -681,6 +681,7 @@ static int __init init_xstate_size(void)
> {
> /* Recompute the context size for enabled features: */
> unsigned int user_size, kernel_size, kernel_default_size;
> + unsigned int guest_default_size;
> bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
>
> /* Uncompacted user space size */
> @@ -702,13 +703,18 @@ static int __init init_xstate_size(void)
> kernel_default_size =
> xstate_calculate_size(fpu_kernel_cfg.default_features, compacted);
>
> + guest_default_size =
> + xstate_calculate_size(fpu_guest_cfg.default_features, compacted);
> +
> if (!paranoid_xstate_size_valid(kernel_size))
> return -EINVAL;
>
> fpu_kernel_cfg.max_size = kernel_size;
> fpu_user_cfg.max_size = user_size;
> + fpu_guest_cfg.max_size = kernel_size;
>
> fpu_kernel_cfg.default_size = kernel_default_size;
> + fpu_guest_cfg.default_size = guest_default_size;
> fpu_user_cfg.default_size =
> xstate_calculate_size(fpu_user_cfg.default_features, false);
>
> @@ -829,6 +835,10 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
> fpu_user_cfg.default_features = fpu_user_cfg.max_features;
> fpu_user_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
>
> + fpu_guest_cfg.max_features = fpu_kernel_cfg.max_features;
> + fpu_guest_cfg.default_features = fpu_guest_cfg.max_features;
> + fpu_guest_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
> +
> /* Store it for paranoia check at the end */
> xfeatures = fpu_kernel_cfg.max_features;
>



2024-01-02 22:33:15

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v8 06/26] x86/fpu/xstate: Create guest fpstate with guest specific config

On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
> Use fpu_guest_cfg to calculate guest fpstate settings, open code for
> __fpstate_reset() to avoid using kernel FPU config.
>
> Below configuration steps are currently enforced to get guest fpstate:
> 1) Kernel sets up guest FPU settings in fpu__init_system_xstate().
> 2) User space sets vCPU thread group xstate permits via arch_prctl().
> 3) User space creates guest fpstate via __fpu_alloc_init_guest_fpstate()
> for vcpu thread.
> 4) User space enables guest dynamic xfeatures and re-allocate guest
> fpstate.
>
> By adding kernel dynamic xfeatures in above #1 and #2, guest xstate area
> size is expanded to hold (fpu_kernel_cfg.default_features | kernel dynamic
> xfeatures | user dynamic xfeatures), then host xsaves/xrstors can operate
> for all guest xfeatures.
>
> The user_* fields remain unchanged for compatibility with KVM uAPIs.
>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kernel/fpu/core.c | 47 ++++++++++++++++++++++++++++++--------
> 1 file changed, 37 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index 976f519721e2..0e0bf151418f 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -250,8 +250,6 @@ void fpu_reset_from_exception_fixup(void)
> }
>
> #if IS_ENABLED(CONFIG_KVM)
> -static void __fpstate_reset(struct fpstate *fpstate, u64 xfd);
> -
> static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
> {
> struct fpu_state_perm *fpuperm;
> @@ -272,25 +270,54 @@ static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
> gfpu->perm = perm & ~FPU_GUEST_PERM_LOCKED;
> }
>
> -bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
> +static struct fpstate *__fpu_alloc_init_guest_fpstate(struct fpu_guest *gfpu)
> {
> + bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
> + unsigned int gfpstate_size, size;
> struct fpstate *fpstate;
> - unsigned int size;
>
> - size = fpu_user_cfg.default_size + ALIGN(offsetof(struct fpstate, regs), 64);
> + /*
> + * fpu_guest_cfg.default_size is initialized to hold all enabled
> + * xfeatures except the user dynamic xfeatures. If the user dynamic
> + * xfeatures are enabled, the guest fpstate will be re-allocated to
> + * hold all guest enabled xfeatures, so omit user dynamic xfeatures
> + * here.
> + */
> + size = fpu_guest_cfg.default_size +
> + ALIGN(offsetof(struct fpstate, regs), 64);
> +
> fpstate = vzalloc(size);
> if (!fpstate)
> - return false;
> + return NULL;
> + /*
> + * Initialize sizes and feature masks, use fpu_user_cfg.*
> + * for user_* settings for compatibility of exiting uAPIs.
> + */
> + fpstate->size = gfpstate_size;
> + fpstate->xfeatures = fpu_guest_cfg.default_features;
> + fpstate->user_size = fpu_user_cfg.default_size;
> + fpstate->user_xfeatures = fpu_user_cfg.default_features;
> + fpstate->xfd = 0;
>
> - /* Leave xfd to 0 (the reset value defined by spec) */
> - __fpstate_reset(fpstate, 0);
> fpstate_init_user(fpstate);
> fpstate->is_valloc = true;
> fpstate->is_guest = true;
>
> gfpu->fpstate = fpstate;
> - gfpu->xfeatures = fpu_user_cfg.default_features;
> - gfpu->perm = fpu_user_cfg.default_features;
> + gfpu->xfeatures = fpu_guest_cfg.default_features;
> + gfpu->perm = fpu_guest_cfg.default_features;
> +
> + return fpstate;
> +}
> +
> +bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
> +{
> + struct fpstate *fpstate;
> +
> + fpstate = __fpu_alloc_init_guest_fpstate(gfpu);
> +
> + if (!fpstate)
> + return false;
>
> /*
> * KVM sets the FP+SSE bits in the XSAVE header when copying FPU state

Reviewed-by: Maxim Levitsky <[email protected]>

Best regards,
Maxim Levitsky




2024-01-02 22:33:39

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v8 07/26] x86/fpu/xstate: Warn if kernel dynamic xfeatures detected in normal fpstate

On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
> Kernel dynamic xfeatures now are __ONLY__ enabled for guest fpstate, i.e.,
> never for normal kernel fpstate. The bits are added when guest FPU config
> is initialized. Guest fpstate is allocated with fpstate->is_guest set to
> %true.
>
> For normal fpstate, the bits should have been removed when initializes
> kernel FPU config settings, WARN_ONCE() if kernel detects normal fpstate
> xfeatures contains kernel dynamic xfeatures before executes xsaves.
>
> Signed-off-by: Yang Weijiang <[email protected]>
> Reviewed-by: Rick Edgecombe <[email protected]>
> ---
> arch/x86/kernel/fpu/xstate.h | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
> index 3518fb26d06b..83ebf1e1cbb4 100644
> --- a/arch/x86/kernel/fpu/xstate.h
> +++ b/arch/x86/kernel/fpu/xstate.h
> @@ -185,6 +185,9 @@ static inline void os_xsave(struct fpstate *fpstate)
> WARN_ON_FPU(!alternatives_patched);
> xfd_validate_state(fpstate, mask, false);
>
> + WARN_ON_FPU(!fpstate->is_guest &&
> + (mask & XFEATURE_MASK_KERNEL_DYNAMIC));
> +
> XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err);
>
> /* We should never fault when copying to a kernel buffer: */

Reviewed-by: Maxim Levitsky <[email protected]>

Best regards,
Maxim Levitsky


2024-01-02 22:34:06

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v8 08/26] KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data

On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
> From: Sean Christopherson <[email protected]>
>
> Rework and rename cpuid_get_supported_xcr0() to explicitly operate on
> vCPU state, i.e. on a vCPU's CPUID state, now that the only usage of
> the helper is to retrieve a vCPU's already-set CPUID.
>
> Prior to commit 275a87244ec8 ("KVM: x86: Don't adjust guest's CPUID.0x12.1
> (allowed SGX enclave XFRM)"), KVM incorrectly fudged guest CPUID at runtime,
> which in turn necessitated massaging the incoming CPUID state for
> KVM_SET_CPUID{2} so as not to run afoul of kvm_cpuid_check_equal().
> I.e. KVM also invoked cpuid_get_supported_xcr0() with the incoming CPUID
> state, and thus without an explicit vCPU object.
>
> Opportunistically move the helper below kvm_update_cpuid_runtime() to make
> it harder to repeat the mistake of querying supported XCR0 for runtime
> updates.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kvm/cpuid.c | 33 ++++++++++++++++-----------------
> 1 file changed, 16 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 294e5bd5f8a0..624954203b40 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -247,21 +247,6 @@ void kvm_update_pv_runtime(struct kvm_vcpu *vcpu)
> vcpu->arch.pv_cpuid.features = best->eax;
> }
>
> -/*
> - * Calculate guest's supported XCR0 taking into account guest CPUID data and
> - * KVM's supported XCR0 (comprised of host's XCR0 and KVM_SUPPORTED_XCR0).
> - */
> -static u64 cpuid_get_supported_xcr0(struct kvm_cpuid_entry2 *entries, int nent)
> -{
> - struct kvm_cpuid_entry2 *best;
> -
> - best = cpuid_entry2_find(entries, nent, 0xd, 0);
> - if (!best)
> - return 0;
> -
> - return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
> -}
> -
> static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *entries,
> int nent)
> {
> @@ -312,6 +297,21 @@ void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
> }
> EXPORT_SYMBOL_GPL(kvm_update_cpuid_runtime);
>
> +/*
> + * Calculate guest's supported XCR0 taking into account guest CPUID data and
> + * KVM's supported XCR0 (comprised of host's XCR0 and KVM_SUPPORTED_XCR0).
> + */
> +static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_cpuid_entry2 *best;
> +
> + best = kvm_find_cpuid_entry_index(vcpu, 0xd, 0);
> + if (!best)
> + return 0;
> +
> + return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
> +}
> +
> static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
> {
> #ifdef CONFIG_KVM_HYPERV
> @@ -361,8 +361,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> kvm_apic_set_version(vcpu);
> }
>
> - vcpu->arch.guest_supported_xcr0 =
> - cpuid_get_supported_xcr0(vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent);
> + vcpu->arch.guest_supported_xcr0 = vcpu_get_supported_xcr0(vcpu);
>
> kvm_update_pv_runtime(vcpu);
>

Looks like I forgot to add my reviewed-by:

Reviewed-by: Maxim Levitsky <[email protected]>

Best regards,
Maxim Levitsky


2024-01-02 22:34:41

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v8 21/26] KVM: x86: Save and reload SSP to/from SMRAM

On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
> Save CET SSP to SMRAM on SMI and reload it on RSM. KVM emulates HW arch
> behavior when guest enters/leaves SMM mode,i.e., save registers to SMRAM
> at the entry of SMM and reload them at the exit to SMM. Per SDM, SSP is
> one of such registers on 64-bit Arch, and add the support for SSP. Note,
> on 32-bit Arch, SSP is not defined in SMRAM, so fail 32-bit CET guest
> launch.
>
> Suggested-by: Sean Christopherson <[email protected]>
> Suggested-by: Chao Gao <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kvm/cpuid.c | 11 +++++++++++
> arch/x86/kvm/smm.c | 8 ++++++++
> arch/x86/kvm/smm.h | 2 +-
> 3 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 3ab133530573..cfc0ac8ddb4a 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -149,6 +149,17 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu,
> if (vaddr_bits != 48 && vaddr_bits != 57 && vaddr_bits != 0)
> return -EINVAL;
> }
> + /*
> + * Prevent 32-bit guest from being launched if CET is exposed as SSP
> + * state is not defined for 32-bit SMRAM.
> + */
> + best = cpuid_entry2_find(entries, nent, 0x80000001,
> + KVM_CPUID_INDEX_NOT_SIGNIFICANT);
> + if (best && !(best->edx & F(LM))) {
> + best = cpuid_entry2_find(entries, nent, 0x7, 0);
> + if (best && ((best->ecx & F(SHSTK)) || (best->edx & F(IBT))))
> + return -EINVAL;
> + }

I honestly prefer a check in enter_smm_save_state_32 because SMM might not even
be enabled/used for the guest, and for consistency with SVM check that I added,
but whatever.

Reviewed-by: Maxim Levitsky <[email protected]>


Best regards,
Maxim Levitsky

>
> /*
> * Exposing dynamic xfeatures to the guest requires additional
> diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
> index 45c855389ea7..7aac9c54c353 100644
> --- a/arch/x86/kvm/smm.c
> +++ b/arch/x86/kvm/smm.c
> @@ -275,6 +275,10 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu,
> enter_smm_save_seg_64(vcpu, &smram->gs, VCPU_SREG_GS);
>
> smram->int_shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
> +
> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
> + KVM_BUG_ON(kvm_msr_read(vcpu, MSR_KVM_SSP, &smram->ssp),
> + vcpu->kvm);
> }
> #endif
>
> @@ -564,6 +568,10 @@ static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,
> static_call(kvm_x86_set_interrupt_shadow)(vcpu, 0);
> ctxt->interruptibility = (u8)smstate->int_shadow;
>
> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
> + KVM_BUG_ON(kvm_msr_write(vcpu, MSR_KVM_SSP, smstate->ssp),
> + vcpu->kvm);
> +
> return X86EMUL_CONTINUE;
> }
> #endif
> diff --git a/arch/x86/kvm/smm.h b/arch/x86/kvm/smm.h
> index a1cf2ac5bd78..1e2a3e18207f 100644
> --- a/arch/x86/kvm/smm.h
> +++ b/arch/x86/kvm/smm.h
> @@ -116,8 +116,8 @@ struct kvm_smram_state_64 {
> u32 smbase;
> u32 reserved4[5];
>
> - /* ssp and svm_* fields below are not implemented by KVM */
> u64 ssp;
> + /* svm_* fields below are not implemented by KVM */
> u64 svm_guest_pat;
> u64 svm_host_efer;
> u64 svm_host_cr4;


Best regards,
Maxim Levitsky


2024-01-02 22:35:05

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v8 22/26] KVM: VMX: Set up interception for CET MSRs

On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
> Enable/disable CET MSRs interception per associated feature configuration.
> Shadow Stack feature requires all CET MSRs passed through to guest to make
> it supported in user and supervisor mode while IBT feature only depends on
> MSR_IA32_{U,S}_CETS_CET to enable user and supervisor IBT.
>
> Note, this MSR design introduced an architectural limitation of SHSTK and
> IBT control for guest, i.e., when SHSTK is exposed, IBT is also available
> to guest from architectual perspective since IBT relies on subset of SHSTK
> relevant MSRs.
>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kvm/vmx/vmx.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 42 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 064a5fe87948..08058b182893 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -692,6 +692,10 @@ static bool is_valid_passthrough_msr(u32 msr)
> case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
> /* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
> return true;
> + case MSR_IA32_U_CET:
> + case MSR_IA32_S_CET:
> + case MSR_IA32_PL0_SSP ... MSR_IA32_INT_SSP_TAB:
> + return true;
> }
>
> r = possible_passthrough_msr_slot(msr) != -ENOENT;
> @@ -7767,6 +7771,42 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
> vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
> }
>
> +static void vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu)
> +{
> + bool incpt;
> +
> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
> + incpt = !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK);
> +
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP,
> + MSR_TYPE_RW, incpt);
> + if (guest_cpuid_has(vcpu, X86_FEATURE_LM))
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
> + MSR_TYPE_RW, incpt);
> + if (!incpt)
> + return;
> + }
> +
> + if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
> + incpt = !guest_cpuid_has(vcpu, X86_FEATURE_IBT);
> +
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
> + MSR_TYPE_RW, incpt);
> + }
> +}
> +
> static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> @@ -7845,6 +7885,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>
> /* Refresh #PF interception to account for MAXPHYADDR changes. */
> vmx_update_exception_bitmap(vcpu);
> +
> + vmx_update_intercept_for_cet_msr(vcpu);
> }
>
> static u64 vmx_get_perf_capabilities(void)

Reviewed-by: Maxim Levitsky <[email protected]>

Best regards,
Maxim Levitsky


2024-01-02 22:35:19

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v8 24/26] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
> Expose CET features to guest if KVM/host can support them, clear CPUID
> feature bits if KVM/host cannot support.
>
> Set CPUID feature bits so that CET features are available in guest CPUID.
> Add CR4.CET bit support in order to allow guest set CET master control
> bit.
>
> Disable KVM CET feature if unrestricted_guest is unsupported/disabled as
> KVM does not support emulating CET.
>
> The CET load-bits in VM_ENTRY/VM_EXIT control fields should be set to make
> guest CET xstates isolated from host's.
>
> On platforms with VMX_BASIC[bit56] == 0, inject #CP at VMX entry with error
> code will fail, and if VMX_BASIC[bit56] == 1, #CP injection with or without
> error code is allowed. Disable CET feature bits if the MSR bit is cleared
> so that nested VMM can inject #CP if and only if VMX_BASIC[bit56] == 1.

This is a good explanation but IMHO it should be in the code and not in the commit message,
because its hard to trace things to commit messages just to figure out what
the code is doing.

>
> Don't expose CET feature if either of {U,S}_CET xstate bits is cleared
> in host XSS or if XSAVES isn't supported.
>
> CET MSR contents after reset, power-up and INIT are set to 0s, clears the
> guest fpstate fields so that the guest MSRs are reset to 0s after the events.
>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 2 +-
> arch/x86/include/asm/msr-index.h | 1 +
> arch/x86/kvm/cpuid.c | 19 +++++++++++++++++--
> arch/x86/kvm/vmx/capabilities.h | 6 ++++++
> arch/x86/kvm/vmx/vmx.c | 29 ++++++++++++++++++++++++++++-
> arch/x86/kvm/vmx/vmx.h | 6 ++++--
> arch/x86/kvm/x86.c | 31 +++++++++++++++++++++++++++++--
> arch/x86/kvm/x86.h | 3 +++
> 8 files changed, 89 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6efaaaa15945..161d0552be5f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -134,7 +134,7 @@
> | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
> | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
> | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \
> - | X86_CR4_LAM_SUP))
> + | X86_CR4_LAM_SUP | X86_CR4_CET))
>
> #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
>
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 1d51e1850ed0..233e00c01e62 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1102,6 +1102,7 @@
> #define VMX_BASIC_MEM_TYPE_MASK 0x003c000000000000LLU
> #define VMX_BASIC_MEM_TYPE_WB 6LLU
> #define VMX_BASIC_INOUT 0x0040000000000000LLU
> +#define VMX_BASIC_NO_HW_ERROR_CODE_CC 0x0100000000000000LLU
>
> /* Resctrl MSRs: */
> /* - Intel: */
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index cfc0ac8ddb4a..18d1a0eb0f64 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -665,7 +665,7 @@ void kvm_set_cpu_caps(void)
> F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
> F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
> F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
> - F(SGX_LC) | F(BUS_LOCK_DETECT)
> + F(SGX_LC) | F(BUS_LOCK_DETECT) | F(SHSTK)
> );
> /* Set LA57 based on hardware capability. */
> if (cpuid_ecx(7) & F(LA57))
> @@ -683,7 +683,8 @@ void kvm_set_cpu_caps(void)
> F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
> F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) |
> F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) |
> - F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D)
> + F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D) |
> + F(IBT)
> );
>
> /* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */
> @@ -696,6 +697,20 @@ void kvm_set_cpu_caps(void)
> kvm_cpu_cap_set(X86_FEATURE_INTEL_STIBP);
> if (boot_cpu_has(X86_FEATURE_AMD_SSBD))
> kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD);
> + /*
> + * Don't use boot_cpu_has() to check availability of IBT because the
> + * feature bit is cleared in boot_cpu_data when ibt=off is applied
> + * in host cmdline.
> + *
> + * As currently there's no HW bug which requires disabling IBT feature
> + * while CPU can enumerate it, host cmdline option ibt=off is most
> + * likely due to administrative reason on host side, so KVM refers to
> + * CPU CPUID enumeration to enable the feature. In future if there's
> + * actually some bug clobbered ibt=off option, then enforce additional
> + * check here to disable the support in KVM.
> + */
> + if (cpuid_edx(7) & F(IBT))
> + kvm_cpu_cap_set(X86_FEATURE_IBT);
>
> kvm_cpu_cap_mask(CPUID_7_1_EAX,
> F(AVX_VNNI) | F(AVX512_BF16) | F(CMPCCXADD) |
> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
> index ee8938818c8a..e12bc233d88b 100644
> --- a/arch/x86/kvm/vmx/capabilities.h
> +++ b/arch/x86/kvm/vmx/capabilities.h
> @@ -79,6 +79,12 @@ static inline bool cpu_has_vmx_basic_inout(void)
> return (((u64)vmcs_config.basic_cap << 32) & VMX_BASIC_INOUT);
> }
>
> +static inline bool cpu_has_vmx_basic_no_hw_errcode(void)
> +{
> + return ((u64)vmcs_config.basic_cap << 32) &
> + VMX_BASIC_NO_HW_ERROR_CODE_CC;
> +}
> +
> static inline bool cpu_has_virtual_nmis(void)
> {
> return vmcs_config.pin_based_exec_ctrl & PIN_BASED_VIRTUAL_NMIS &&
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index e9c0b571b3bb..c802e790c0d5 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2609,6 +2609,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
> { VM_ENTRY_LOAD_IA32_EFER, VM_EXIT_LOAD_IA32_EFER },
> { VM_ENTRY_LOAD_BNDCFGS, VM_EXIT_CLEAR_BNDCFGS },
> { VM_ENTRY_LOAD_IA32_RTIT_CTL, VM_EXIT_CLEAR_IA32_RTIT_CTL },
> + { VM_ENTRY_LOAD_CET_STATE, VM_EXIT_LOAD_CET_STATE },
> };
>
> memset(vmcs_conf, 0, sizeof(*vmcs_conf));
> @@ -4934,6 +4935,15 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>
> vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); /* 22.2.1 */
>
> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> + vmcs_writel(GUEST_SSP, 0);
> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
> + kvm_cpu_cap_has(X86_FEATURE_IBT))
> + vmcs_writel(GUEST_S_CET, 0);
> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK) &&
> + IS_ENABLED(CONFIG_X86_64))
> + vmcs_writel(GUEST_INTR_SSP_TABLE, 0);
> +
> kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
>
> vpid_sync_context(vmx->vpid);
> @@ -6353,6 +6363,12 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
> if (vmcs_read32(VM_EXIT_MSR_STORE_COUNT) > 0)
> vmx_dump_msrs("guest autostore", &vmx->msr_autostore.guest);
>
> + if (vmentry_ctl & VM_ENTRY_LOAD_CET_STATE) {
> + pr_err("S_CET = 0x%016lx\n", vmcs_readl(GUEST_S_CET));
> + pr_err("SSP = 0x%016lx\n", vmcs_readl(GUEST_SSP));
> + pr_err("INTR SSP TABLE = 0x%016lx\n",
> + vmcs_readl(GUEST_INTR_SSP_TABLE));
> + }
> pr_err("*** Host State ***\n");
> pr_err("RIP = 0x%016lx RSP = 0x%016lx\n",
> vmcs_readl(HOST_RIP), vmcs_readl(HOST_RSP));
> @@ -6430,6 +6446,12 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
> if (secondary_exec_control & SECONDARY_EXEC_ENABLE_VPID)
> pr_err("Virtual processor ID = 0x%04x\n",
> vmcs_read16(VIRTUAL_PROCESSOR_ID));
> + if (vmexit_ctl & VM_EXIT_LOAD_CET_STATE) {
> + pr_err("S_CET = 0x%016lx\n", vmcs_readl(HOST_S_CET));
> + pr_err("SSP = 0x%016lx\n", vmcs_readl(HOST_SSP));
> + pr_err("INTR SSP TABLE = 0x%016lx\n",
> + vmcs_readl(HOST_INTR_SSP_TABLE));
> + }
> }
>
> /*
> @@ -7966,7 +7988,6 @@ static __init void vmx_set_cpu_caps(void)
> kvm_cpu_cap_set(X86_FEATURE_UMIP);
>
> /* CPUID 0xD.1 */
> - kvm_caps.supported_xss = 0;
> if (!cpu_has_vmx_xsaves())
> kvm_cpu_cap_clear(X86_FEATURE_XSAVES);
>
> @@ -7978,6 +7999,12 @@ static __init void vmx_set_cpu_caps(void)
>
> if (cpu_has_vmx_waitpkg())
> kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
> +
> + if (!cpu_has_load_cet_ctrl() || !enable_unrestricted_guest ||
> + !cpu_has_vmx_basic_no_hw_errcode()) {
> + kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
> + kvm_cpu_cap_clear(X86_FEATURE_IBT);
> + }
> }
>
> static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index e3b0985bb74a..d0cad2624564 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -484,7 +484,8 @@ static inline u8 vmx_get_rvi(void)
> VM_ENTRY_LOAD_IA32_EFER | \
> VM_ENTRY_LOAD_BNDCFGS | \
> VM_ENTRY_PT_CONCEAL_PIP | \
> - VM_ENTRY_LOAD_IA32_RTIT_CTL)
> + VM_ENTRY_LOAD_IA32_RTIT_CTL | \
> + VM_ENTRY_LOAD_CET_STATE)
>
> #define __KVM_REQUIRED_VMX_VM_EXIT_CONTROLS \
> (VM_EXIT_SAVE_DEBUG_CONTROLS | \
> @@ -506,7 +507,8 @@ static inline u8 vmx_get_rvi(void)
> VM_EXIT_LOAD_IA32_EFER | \
> VM_EXIT_CLEAR_BNDCFGS | \
> VM_EXIT_PT_CONCEAL_PIP | \
> - VM_EXIT_CLEAR_IA32_RTIT_CTL)
> + VM_EXIT_CLEAR_IA32_RTIT_CTL | \
> + VM_EXIT_LOAD_CET_STATE)
>
> #define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL \
> (PIN_BASED_EXT_INTR_MASK | \
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9596763fae8d..5058c9c5f4cc 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -231,7 +231,8 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
> | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
> | XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)
>
> -#define KVM_SUPPORTED_XSS 0
> +#define KVM_SUPPORTED_XSS (XFEATURE_MASK_CET_USER | \
> + XFEATURE_MASK_CET_KERNEL)
>
> u64 __read_mostly host_efer;
> EXPORT_SYMBOL_GPL(host_efer);
> @@ -9921,6 +9922,20 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
> if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
> kvm_caps.supported_xss = 0;
>
> + if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK) &&
> + !kvm_cpu_cap_has(X86_FEATURE_IBT))
> + kvm_caps.supported_xss &= ~(XFEATURE_CET_USER |
> + XFEATURE_CET_KERNEL);

OK.

> +
> + if ((kvm_caps.supported_xss & (XFEATURE_MASK_CET_USER |
> + XFEATURE_MASK_CET_KERNEL)) !=
> + (XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)) {
> + kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
> + kvm_cpu_cap_clear(X86_FEATURE_IBT);
> + kvm_caps.supported_xss &= ~(XFEATURE_CET_USER |
> + XFEATURE_CET_KERNEL);
> + }
> +
> #define __kvm_cpu_cap_has(UNUSED_, f) kvm_cpu_cap_has(f)
> cr4_reserved_bits = __cr4_reserved_bits(__kvm_cpu_cap_has, UNUSED_);
> #undef __kvm_cpu_cap_has
> @@ -12392,7 +12407,9 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
>
> static inline bool is_xstate_reset_needed(void)
> {
> - return kvm_cpu_cap_has(X86_FEATURE_MPX);
> + return kvm_cpu_cap_has(X86_FEATURE_MPX) ||
> + kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
> + kvm_cpu_cap_has(X86_FEATURE_IBT);
> }
>
> void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> @@ -12469,6 +12486,16 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> XFEATURE_BNDCSR);
> }
>
> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
> + fpstate_clear_xstate_component(fpstate,
> + XFEATURE_CET_USER);
> + fpstate_clear_xstate_component(fpstate,
> + XFEATURE_CET_KERNEL);
> + } else if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
> + fpstate_clear_xstate_component(fpstate,
> + XFEATURE_CET_USER);
> + }
> +
> if (init_event)
> kvm_load_guest_fpu(vcpu);
> }
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 656107e64c93..cc585051d24b 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -533,6 +533,9 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
> __reserved_bits |= X86_CR4_PCIDE; \
> if (!__cpu_has(__c, X86_FEATURE_LAM)) \
> __reserved_bits |= X86_CR4_LAM_SUP; \
> + if (!__cpu_has(__c, X86_FEATURE_SHSTK) && \
> + !__cpu_has(__c, X86_FEATURE_IBT)) \
> + __reserved_bits |= X86_CR4_CET; \
> __reserved_bits; \
> })
>

Reviewed-by: Maxim Levitsky <[email protected]>

Best regards,
Maxim Levitsky





2024-01-02 22:35:52

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v8 26/26] KVM: nVMX: Enable CET support for nested guest

On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
> Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
> to enable CET for nested VM.
>
> vmcs12 and vmcs02 needs to be synced when L2 exits to L1 or when L1 wants
> to resume L2, that way correct CET states can be observed by one another.
>
> Suggested-by: Chao Gao <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kvm/vmx/nested.c | 57 +++++++++++++++++++++++++++++++++++++--
> arch/x86/kvm/vmx/vmcs12.c | 6 +++++
> arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++-
> arch/x86/kvm/vmx/vmx.c | 2 ++
> 4 files changed, 76 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 468a7cf75035..dee718c65255 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -691,6 +691,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
>
> + /* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_U_CET, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_S_CET, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_PL0_SSP, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_PL1_SSP, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_PL2_SSP, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_PL3_SSP, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
> +
> kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
>
> vmx->nested.force_msr_bitmap_recalc = false;
> @@ -2506,6 +2528,17 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
> if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
> (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
> vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
> +
> + if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE) {
> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> + vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
> + vmcs_writel(GUEST_INTR_SSP_TABLE,
> + vmcs12->guest_ssp_tbl);
> + }
> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
> + vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
> + }
> }

Looks good.
>
> if (nested_cpu_has_xsaves(vmcs12))
> @@ -4344,6 +4377,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
> vmcs12->guest_pending_dbg_exceptions =
> vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
>
> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> + vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
> + vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
> + }
> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
> + vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
> + }
> +
> vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
> }

Looks good.

>
> @@ -4569,6 +4611,16 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
> if (vmcs12->vm_exit_controls & VM_EXIT_CLEAR_BNDCFGS)
> vmcs_write64(GUEST_BNDCFGS, 0);
>
> + if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_CET_STATE) {
> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK)) {
> + vmcs_writel(HOST_SSP, vmcs12->host_ssp);
> + vmcs_writel(HOST_INTR_SSP_TABLE, vmcs12->host_ssp_tbl);
> + }
> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK) ||
> + guest_can_use(vcpu, X86_FEATURE_IBT))
> + vmcs_writel(HOST_S_CET, vmcs12->host_s_cet);
> + }
> +

Looks good.

> if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PAT) {
> vmcs_write64(GUEST_IA32_PAT, vmcs12->host_ia32_pat);
> vcpu->arch.pat = vmcs12->host_ia32_pat;
> @@ -6840,7 +6892,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
> VM_EXIT_HOST_ADDR_SPACE_SIZE |
> #endif
> VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
> - VM_EXIT_CLEAR_BNDCFGS;
> + VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
> msrs->exit_ctls_high |=
> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
> @@ -6862,7 +6914,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
> #ifdef CONFIG_X86_64
> VM_ENTRY_IA32E_MODE |
> #endif
> - VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
> + VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
> + VM_ENTRY_LOAD_CET_STATE;
> msrs->entry_ctls_high |=
> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
> VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
> diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
> index 106a72c923ca..4233b5ca9461 100644
> --- a/arch/x86/kvm/vmx/vmcs12.c
> +++ b/arch/x86/kvm/vmx/vmcs12.c
> @@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
> FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
> FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
> FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
> + FIELD(GUEST_S_CET, guest_s_cet),
> + FIELD(GUEST_SSP, guest_ssp),
> + FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
> FIELD(HOST_CR0, host_cr0),
> FIELD(HOST_CR3, host_cr3),
> FIELD(HOST_CR4, host_cr4),
> @@ -151,5 +154,8 @@ const unsigned short vmcs12_field_offsets[] = {
> FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
> FIELD(HOST_RSP, host_rsp),
> FIELD(HOST_RIP, host_rip),
> + FIELD(HOST_S_CET, host_s_cet),
> + FIELD(HOST_SSP, host_ssp),
> + FIELD(HOST_INTR_SSP_TABLE, host_ssp_tbl),
> };
> const unsigned int nr_vmcs12_fields = ARRAY_SIZE(vmcs12_field_offsets);
> diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
> index 01936013428b..3884489e7f7e 100644
> --- a/arch/x86/kvm/vmx/vmcs12.h
> +++ b/arch/x86/kvm/vmx/vmcs12.h
> @@ -117,7 +117,13 @@ struct __packed vmcs12 {
> natural_width host_ia32_sysenter_eip;
> natural_width host_rsp;
> natural_width host_rip;
> - natural_width paddingl[8]; /* room for future expansion */
> + natural_width host_s_cet;
> + natural_width host_ssp;
> + natural_width host_ssp_tbl;
> + natural_width guest_s_cet;
> + natural_width guest_ssp;
> + natural_width guest_ssp_tbl;
> + natural_width paddingl[2]; /* room for future expansion */
> u32 pin_based_vm_exec_control;
> u32 cpu_based_vm_exec_control;
> u32 exception_bitmap;
> @@ -292,6 +298,12 @@ static inline void vmx_check_vmcs12_offsets(void)
> CHECK_OFFSET(host_ia32_sysenter_eip, 656);
> CHECK_OFFSET(host_rsp, 664);
> CHECK_OFFSET(host_rip, 672);
> + CHECK_OFFSET(host_s_cet, 680);
> + CHECK_OFFSET(host_ssp, 688);
> + CHECK_OFFSET(host_ssp_tbl, 696);
> + CHECK_OFFSET(guest_s_cet, 704);
> + CHECK_OFFSET(guest_ssp, 712);
> + CHECK_OFFSET(guest_ssp_tbl, 720);
> CHECK_OFFSET(pin_based_vm_exec_control, 744);
> CHECK_OFFSET(cpu_based_vm_exec_control, 748);
> CHECK_OFFSET(exception_bitmap, 752);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index c802e790c0d5..7ddd3f6fe8ab 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7732,6 +7732,8 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
> cr4_fixed1_update(X86_CR4_PKE, ecx, feature_bit(PKU));
> cr4_fixed1_update(X86_CR4_UMIP, ecx, feature_bit(UMIP));
> cr4_fixed1_update(X86_CR4_LA57, ecx, feature_bit(LA57));
> + cr4_fixed1_update(X86_CR4_CET, ecx, feature_bit(SHSTK));
> + cr4_fixed1_update(X86_CR4_CET, edx, feature_bit(IBT));
>
> entry = kvm_find_cpuid_entry_index(vcpu, 0x7, 1);
> cr4_fixed1_update(X86_CR4_LAM_SUP, eax, feature_bit(LAM));

Looks good to me, but I might have missed something. Nesting is always tricky to get right,
so this should be very well tested.


Reviewed-by: Maxim Levitsky <[email protected]>

Best regards,
Maxim Levitsky



2024-01-03 09:10:57

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v8 04/26] x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set

On 1/3/2024 6:25 AM, Maxim Levitsky wrote:
> On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
>> Define a new XFEATURE_MASK_KERNEL_DYNAMIC mask to specify the features
>> that can be optionally enabled by kernel components. This is similar to
>> XFEATURE_MASK_USER_DYNAMIC in that it contains optional xfeatures that
>> can allows the FPU buffer to be dynamically sized. The difference is that
>> the KERNEL variant contains supervisor features and will be enabled by
>> kernel components that need them, and not directly by the user. Currently
>> it's used by KVM to configure guest dedicated fpstate for calculating
>> the xfeature and fpstate storage size etc.
>>
>> The kernel dynamic xfeatures now only contain XFEATURE_CET_KERNEL, which
>> is supported by host as they're enabled in kernel XSS MSR setting but
>> relevant CPU feature, i.e., supervisor shadow stack, is not enabled in
>> host kernel therefore it can be omitted for normal fpstate by default.
>>
>> Remove the kernel dynamic feature from fpu_kernel_cfg.default_features
>> so that the bits in xstate_bv and xcomp_bv are cleared and xsaves/xrstors
>> can be optimized by HW for normal fpstate.
>>
>> Suggested-by: Dave Hansen <[email protected]>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
>> arch/x86/include/asm/fpu/xstate.h | 5 ++++-
>> arch/x86/kernel/fpu/xstate.c | 1 +
>> 2 files changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
>> index 3b4a038d3c57..a212d3851429 100644
>> --- a/arch/x86/include/asm/fpu/xstate.h
>> +++ b/arch/x86/include/asm/fpu/xstate.h
>> @@ -46,9 +46,12 @@
>> #define XFEATURE_MASK_USER_RESTORE \
>> (XFEATURE_MASK_USER_SUPPORTED & ~XFEATURE_MASK_PKRU)
>>
>> -/* Features which are dynamically enabled for a process on request */
>> +/* Features which are dynamically enabled per userspace request */
>> #define XFEATURE_MASK_USER_DYNAMIC XFEATURE_MASK_XTILE_DATA
>>
>> +/* Features which are dynamically enabled per kernel side request */
>> +#define XFEATURE_MASK_KERNEL_DYNAMIC XFEATURE_MASK_CET_KERNEL
>> +
>> /* All currently supported supervisor features */
>> #define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
>> XFEATURE_MASK_CET_USER | \
>> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>> index 03e166a87d61..ca4b83c142eb 100644
>> --- a/arch/x86/kernel/fpu/xstate.c
>> +++ b/arch/x86/kernel/fpu/xstate.c
>> @@ -824,6 +824,7 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
>> /* Clean out dynamic features from default */
>> fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
>> fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
>> + fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_KERNEL_DYNAMIC;
>>
>> fpu_user_cfg.default_features = fpu_user_cfg.max_features;
>> fpu_user_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
>
> I still think that we should consider adding XFEATURE_MASK_CET_KERNEL to
> XFEATURE_MASK_INDEPENDENT or at least have a good conversation on why this doesn't make sense,

Hi, Maxim,
Thanks for continuously adding feedback on this series! Appreciated!

I think the discussion is not closed at this point but need maintainers to indicate the preferred approach,
so far I'm following previous alignment that reached in community discussion, but it's still open for
discussion.

IMHO, folding XFEATURE_MASK_CET_KERNEL into XFEATURE_MASK_INDEPENDENT isn't necessarily cheap, we may have to touch more code that works pretty fine these days. In terms of KVM part, currently after VM-exit, guest arch-lbr MSRs are not saved/restored unless vCPU thread is preempted and host kernel arch-lbr save/restore code will handle the MSRs. But for guest CET supervisor xstate, host kernel doesn't have similar mechanism to handle CET supervisor MSRs, so require relatively "eager" handling after VM-exit. If we mix two different flavors in XFEATURE_MASK_INDEPENDENT, it would make it harder to handle guest xstates. Note, arch-lbr support for guest hasn't been upstreamed yet, it's based on my previous upstream solution. Maybe I missed something but it looks like true for the two guest features.
> but I also don't intend to fight over this, as long as the code works.
>
> Best regards,
> Maxim Levitsky
>


2024-01-03 09:19:01

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v8 05/26] x86/fpu/xstate: Introduce fpu_guest_cfg for guest FPU configuration

On 1/3/2024 6:32 AM, Maxim Levitsky wrote:
> On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
>> Define new fpu_guest_cfg to hold all guest FPU settings so that it can
>> differ from generic kernel FPU settings, e.g., enabling CET supervisor
>> xstate by default for guest fpstate while it's remained disabled in
>> kernel FPU config.
>>
>> The kernel dynamic xfeatures are specifically used by guest fpstate now,
>> add the mask for guest fpstate so that guest_perm.__state_permit ==
>> (fpu_kernel_cfg.default_xfeature | XFEATURE_MASK_KERNEL_DYNAMIC). And
>> if guest fpstate is re-allocated to hold user dynamic xfeatures, the
>> resulting permissions are consumed before calculate new guest fpstate.
>>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> Reviewed-by: Maxim Levitsky <[email protected]>
>> ---
>> arch/x86/include/asm/fpu/types.h | 2 +-
>> arch/x86/kernel/fpu/core.c | 70 ++++++++++++++++++++++++++++++--
>> arch/x86/kernel/fpu/xstate.c | 10 +++++
>> 3 files changed, 78 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
>> index c6fd13a17205..306825ad6bc0 100644
>> --- a/arch/x86/include/asm/fpu/types.h
>> +++ b/arch/x86/include/asm/fpu/types.h
>> @@ -602,6 +602,6 @@ struct fpu_state_config {
>> };
>>
>> /* FPU state configuration information */
>> -extern struct fpu_state_config fpu_kernel_cfg, fpu_user_cfg;
>> +extern struct fpu_state_config fpu_kernel_cfg, fpu_user_cfg, fpu_guest_cfg;
>>
>> #endif /* _ASM_X86_FPU_H */
>> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
>> index a21a4d0ecc34..976f519721e2 100644
>> --- a/arch/x86/kernel/fpu/core.c
>> +++ b/arch/x86/kernel/fpu/core.c
>> @@ -33,10 +33,67 @@ DEFINE_STATIC_KEY_FALSE(__fpu_state_size_dynamic);
>> DEFINE_PER_CPU(u64, xfd_state);
>> #endif
>>
>> -/* The FPU state configuration data for kernel and user space */
>> +/* The FPU state configuration data for kernel, user space and guest. */
>> +/*
>> + * kernel FPU config:
>> + *
>> + * all known and CPU supported user and supervisor features except
>> + * - independent kernel features (XFEATURE_LBR)
>> + * @fpu_kernel_cfg.max_features;
>> + *
>> + * all known and CPU supported user and supervisor features except
>> + * - dynamic kernel features (CET_S)
>> + * - independent kernel features (XFEATURE_LBR)
>> + * - dynamic userspace features (AMX state)
>> + * @fpu_kernel_cfg.default_features;
>> + *
>> + * size of compacted buffer with 'fpu_kernel_cfg.max_features'
>> + * @fpu_kernel_cfg.max_size;
>> + *
>> + * size of compacted buffer with 'fpu_kernel_cfg.default_features'
>> + * @fpu_kernel_cfg.default_size;
>> + */
>> struct fpu_state_config fpu_kernel_cfg __ro_after_init;
>> +
>> +/*
>> + * user FPU config:
>> + *
>> + * all known and CPU supported user features
>> + * @fpu_user_cfg.max_features;
>> + *
>> + * all known and CPU supported user features except
>> + * - dynamic userspace features (AMX state)
>> + * @fpu_user_cfg.default_features;
>> + *
>> + * size of non-compacted buffer with 'fpu_user_cfg.max_features'
>> + * @fpu_user_cfg.max_size;
>> + *
>> + * size of non-compacted buffer with 'fpu_user_cfg.default_features'
>> + * @fpu_user_cfg.default_size;
>> + */
>> struct fpu_state_config fpu_user_cfg __ro_after_init;
>>
>> +/*
>> + * guest FPU config:
>> + *
>> + * all known and CPU supported user and supervisor features except
>> + * - independent kernel features (XFEATURE_LBR)
>> + * @fpu_guest_cfg.max_features;
>> + *
>> + * all known and CPU supported user and supervisor features except
>> + * - independent kernel features (XFEATURE_LBR)
>> + * - dynamic userspace features (AMX state)
>> + * @fpu_guest_cfg.default_features;
>> + *
>> + * size of compacted buffer with 'fpu_guest_cfg.max_features'
>> + * @fpu_guest_cfg.max_size;
>> + *
>> + * size of compacted buffer with 'fpu_guest_cfg.default_features'
>> + * @fpu_guest_cfg.default_size;
>> + */
>
> IMHO this comment is too verbose. I didn't intend it to be copied verbatim,
> to the kernel, but rather to explain the meaning of the fpu context fields
> to both of us (I also keep on forgetting what each combination means...).
>
> At least this comment should not include examples because xfeatures
> are subject to change.

Yeah, I cannot find a better place to put these annotations, but feel putting them here
is not too bad :-). How about putting them in commit log?

the examples inlined are just to make it clearer for audiences how the fields are used, surely
will remove them later.


2024-01-03 18:17:08

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v8 06/26] x86/fpu/xstate: Create guest fpstate with guest specific config

On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
>  #if IS_ENABLED(CONFIG_KVM)
> -static void __fpstate_reset(struct fpstate *fpstate, u64 xfd);
> -
>  static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
>  {
>         struct fpu_state_perm *fpuperm;
> @@ -272,25 +270,54 @@ static void fpu_init_guest_permissions(struct
> fpu_guest *gfpu)
>         gfpu->perm = perm & ~FPU_GUEST_PERM_LOCKED;
>  }
>  
> -bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
> +static struct fpstate *__fpu_alloc_init_guest_fpstate(struct
> fpu_guest *gfpu)
>  {
> +       bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);

With CONFIG_WERROR I get:
arch/x86/kernel/fpu/core.c: In function
‘__fpu_alloc_init_guest_fpstate’:
arch/x86/kernel/fpu/core.c:275:14: error: unused variable ‘compacted’
[-Werror=unused-variable]
275 | bool compacted =
cpu_feature_enabled(X86_FEATURE_XCOMPACTED);

2024-01-03 18:51:15

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v8 00/26] Enable CET Virtualization

On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
> Control-flow Enforcement Technology (CET) is a kind of CPU feature
> used
> to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP)
> attacks.
> It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
> style control-flow subversion attacks.
>
> Shadow Stack (SHSTK):
>   A shadow stack is a second stack used exclusively for control
> transfer
>   operations. The shadow stack is separate from the data/normal stack
> and
>   can be enabled individually in user and kernel mode. When shadow
> stack
>   is enabled, CALL pushes the return address on both the data and
> shadow
>   stack. RET pops the return address from both stacks and compares
> them.
>   If the return addresses from the two stacks do not match, the
> processor
>   generates a #CP.
>
> Indirect Branch Tracking (IBT):
>   IBT introduces new instruction(ENDBRANCH)to mark valid target
> addresses of
>   indirect branches (CALL, JMP etc...). If an indirect branch is
> executed
>   and the next instruction is _not_ an ENDBRANCH, the processor
> generates a
>   #CP. These instruction behaves as a NOP on platforms that doesn't
> support
>   CET.

What is the design around CET and the KVM emulator?

My understanding is that the KVM emulator kind of does what it has to
keep things running, and isn't expected to emulate every possible
instruction. With CET though, it is changing the behavior of existing
supported instructions. I could imagine a guest could skip over CET
enforcement by causing an MMIO exit and racing to overwrite the exit-
causing instruction from a different vcpu to be an indirect CALL/RET,
etc. With reasonable assumptions around the threat model in use by the
guest this is probably not a huge problem. And I guess also reasonable
assumptions about functional expectations, as a misshandled CALL or RET
by the emulator would corrupt the shadow stack.

But, another thing to do could be to just return X86EMUL_UNHANDLEABLE
or X86EMUL_RETRY_INSTR when CET is active and RET or CALL are emulated.
And I guess also for all instructions if the TRACKER bit is set. It
might tie up that loose end without too much trouble.

Anyway, was there a conscious decision to just punt on CET enforcement
in the emulator?

2024-01-04 02:17:26

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v8 06/26] x86/fpu/xstate: Create guest fpstate with guest specific config

On 1/4/2024 2:16 AM, Edgecombe, Rick P wrote:
> On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
>>  #if IS_ENABLED(CONFIG_KVM)
>> -static void __fpstate_reset(struct fpstate *fpstate, u64 xfd);
>> -
>>  static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
>>  {
>>         struct fpu_state_perm *fpuperm;
>> @@ -272,25 +270,54 @@ static void fpu_init_guest_permissions(struct
>> fpu_guest *gfpu)
>>         gfpu->perm = perm & ~FPU_GUEST_PERM_LOCKED;
>>  }
>>
>> -bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
>> +static struct fpstate *__fpu_alloc_init_guest_fpstate(struct
>> fpu_guest *gfpu)
>>  {
>> +       bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
> With CONFIG_WERROR I get:
> arch/x86/kernel/fpu/core.c: In function
> ‘__fpu_alloc_init_guest_fpstate’:
> arch/x86/kernel/fpu/core.c:275:14: error: unused variable ‘compacted’
> [-Werror=unused-variable]
> 275 | bool compacted =
> cpu_feature_enabled(X86_FEATURE_XCOMPACTED);

Nice catch! Will remove this unused variable, thanks!

>


2024-01-04 07:12:13

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v8 00/26] Enable CET Virtualization

On 1/4/2024 2:50 AM, Edgecombe, Rick P wrote:
> On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
>> Control-flow Enforcement Technology (CET) is a kind of CPU feature
>> used
>> to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP)
>> attacks.
>> It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
>> style control-flow subversion attacks.
>>
>> Shadow Stack (SHSTK):
>>   A shadow stack is a second stack used exclusively for control
>> transfer
>>   operations. The shadow stack is separate from the data/normal stack
>> and
>>   can be enabled individually in user and kernel mode. When shadow
>> stack
>>   is enabled, CALL pushes the return address on both the data and
>> shadow
>>   stack. RET pops the return address from both stacks and compares
>> them.
>>   If the return addresses from the two stacks do not match, the
>> processor
>>   generates a #CP.
>>
>> Indirect Branch Tracking (IBT):
>>   IBT introduces new instruction(ENDBRANCH)to mark valid target
>> addresses of
>>   indirect branches (CALL, JMP etc...). If an indirect branch is
>> executed
>>   and the next instruction is _not_ an ENDBRANCH, the processor
>> generates a
>>   #CP. These instruction behaves as a NOP on platforms that doesn't
>> support
>>   CET.
> What is the design around CET and the KVM emulator?

KVM doesn't emulate CET HW behavior for guest CET, instead it leaves CET related
checks and handling in guest kernel. E.g., if emulated JMP/CALL in emulator triggers
mismatch of data stack and shadow stack contents, #CP is generated in non-root
mode instead of being injected by KVM.  KVM only emulates basic x86 HW behaviors,
e.g., call/jmp/ret/in/out etc.

> My understanding is that the KVM emulator kind of does what it has to
> keep things running, and isn't expected to emulate every possible
> instruction. With CET though, it is changing the behavior of existing
> supported instructions. I could imagine a guest could skip over CET
> enforcement by causing an MMIO exit and racing to overwrite the exit-
> causing instruction from a different vcpu to be an indirect CALL/RET,
> etc.

Can you elaborate the case? I cannot figure out how it works.

> With reasonable assumptions around the threat model in use by the
> guest this is probably not a huge problem. And I guess also reasonable
> assumptions about functional expectations, as a misshandled CALL or RET
> by the emulator would corrupt the shadow stack.

KVM emulates general x86 HW behaviors, if something wrong happens after emulation
then it can happen even on bare metal, i.e., guest SW most likely gets wrong somewhere
and it's expected to trigger CET exceptions in guest kernel.

> But, another thing to do could be to just return X86EMUL_UNHANDLEABLE
> or X86EMUL_RETRY_INSTR when CET is active and RET or CALL are emulated.

IMHO, translating the CET induced exceptions into X86EMUL_UNHANDLEABLE or X86EMUL_RETRY_INSTR would confuse guest kernel or even VMM, I prefer letting guest kernel handle #CP directly.
> And I guess also for all instructions if the TRACKER bit is set. It
> might tie up that loose end without too much trouble.
>
> Anyway, was there a conscious decision to just punt on CET enforcement
> in the emulator?

I don't remember we ever discussed it in community, but since KVM maintainers reviewed
the CET virtualization series for a long time, I assume we're moving on the right way :-)



2024-01-04 21:10:41

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v8 00/26] Enable CET Virtualization

On Thu, 2024-01-04 at 15:11 +0800, Yang, Weijiang wrote:
> > What is the design around CET and the KVM emulator?
>
> KVM doesn't emulate CET HW behavior for guest CET, instead it leaves
> CET related
> checks and handling in guest kernel. E.g., if emulated JMP/CALL in
> emulator triggers
> mismatch of data stack and shadow stack contents, #CP is generated in
> non-root
> mode instead of being injected by KVM.  KVM only emulates basic x86
> HW behaviors,
> e.g., call/jmp/ret/in/out etc.

Right. In the case of CET those basic behaviors (call/jmp/ret) now have
host emulation behavior that doesn't match what guest execution would
do.

>
> > My understanding is that the KVM emulator kind of does what it has
> > to
> > keep things running, and isn't expected to emulate every possible
> > instruction. With CET though, it is changing the behavior of
> > existing
> > supported instructions. I could imagine a guest could skip over CET
> > enforcement by causing an MMIO exit and racing to overwrite the
> > exit-
> > causing instruction from a different vcpu to be an indirect
> > CALL/RET,
> > etc.
>
> Can you elaborate the case? I cannot figure out how it works.

The point that it should be possible for KVM to emulate call/ret with
CET enabled. Not saying the specific case is critical, but the one I
used as an example was that the KVM emulator can (or at least in the
not too distant past) be forced to emulate arbitrary instructions if
the guest overwrites the instruction between the exit and the SW fetch
from the host. 

The steps are:
vcpu 1 vcpu 2
-------------------------------------
mov to mmio addr
vm exit ept_misconfig
overwrite mov instruction to call %rax
host emulator fetches
host emulates call instruction

So then the guest call operation will skip the endbranch check. But I'm
not sure that there are not less exotic cases that would run across it.
I see a bunch of cases where write protected memory kicks to the
emulator as well. Not sure the exact scenarios and whether this could
happen naturally in races during live migration, dirty tracking, etc.
Again, I'm more just asking the exposure and thinking on it.

>
> > With reasonable assumptions around the threat model in use by the
> > guest this is probably not a huge problem. And I guess also
> > reasonable
> > assumptions about functional expectations, as a misshandled CALL or
> > RET
> > by the emulator would corrupt the shadow stack.
>
> KVM emulates general x86 HW behaviors, if something wrong happens
> after emulation
> then it can happen even on bare metal, i.e., guest SW most likely
> gets wrong somewhere
> and it's expected to trigger CET exceptions in guest kernel.
>
> > But, another thing to do could be to just return
> > X86EMUL_UNHANDLEABLE
> > or X86EMUL_RETRY_INSTR when CET is active and RET or CALL are
> > emulated.
>
> IMHO, translating the CET induced exceptions into
> X86EMUL_UNHANDLEABLE or X86EMUL_RETRY_INSTR would confuse guest
> kernel or even VMM, I prefer letting guest kernel handle #CP
> directly.

Doesn't X86EMUL_RETRY_INSTR kick it back to the guest which is what you
want? Today it will do the operations without the special CET behavior.

But I do see how this could be tricky to avoid the guest getting stuck
in a loop with X86EMUL_RETRY_INSTR. I guess the question is if this
situation is encountered, when KVM can't handle the emulation
correctly, what should happen? I think usually it returns
KVM_INTERNAL_ERROR_EMULATION to userspace? So I don't see why the CET
case is different.

If the scenario (call/ret emulation with CET enabled) doesn't happen,
how can the guest be confused? If it does happen, won't it be an issue?

> > And I guess also for all instructions if the TRACKER bit is set. It
> > might tie up that loose end without too much trouble.
> >
> > Anyway, was there a conscious decision to just punt on CET
> > enforcement
> > in the emulator?
>
> I don't remember we ever discussed it in community, but since KVM
> maintainers reviewed
> the CET virtualization series for a long time, I assume we're moving
> on the right way :-)

It seems like kind of leap that if it never came up that they must be
approving of the specific detail. Don't know. Maybe they will chime in.

2024-01-04 22:26:37

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v8 04/26] x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set

On Wed, 2024-01-03 at 00:25 +0200, Maxim Levitsky wrote:
> I still think that we should consider adding XFEATURE_MASK_CET_KERNEL
> to
> XFEATURE_MASK_INDEPENDENT or at least have a good conversation on why
> this doesn't make sense,
> but I also don't intend to fight over this, as long as the code
> works.

Hi,

Using XFEATURE_MASK_INDEPENDENT would be pretty close to what we
initially discussed when this series resumed:
https://lore.kernel.org/lkml/[email protected]/

Except that it used manual MSR operations instead of xsaves. But the
gist is the same I think - the state is managed manually by KVM.

A XFEATURE_MASK_INDEPENDENT solution seems reasonable to me. I kind of
liked the that the MSR version didn't complicate the overly complex FPU
code. But there was an idea to give XFEATURE_MASK_KERNEL_DYNAMIC a try,
to see if it turned out easy. I think it turned out "ok" complexity
wise. So it doesn't make it clear win one way or the other for me.

I guess it might be slightly more efficient as in this patch because it
gets to use the lazy FPU stuff. It won't need to save/restore if the
exit is handled within KVM, or the kernel switches to a kernel thread
and back. I think that tilts it in favor of
XFEATURE_MASK_KERNEL_DYNAMIC.

Rick


2024-01-04 22:27:00

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v8 04/26] x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set

On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
> Define a new XFEATURE_MASK_KERNEL_DYNAMIC mask to specify the
> features
> that can be optionally enabled by kernel components. This is similar
> to
> XFEATURE_MASK_USER_DYNAMIC in that it contains optional xfeatures
> that
> can allows the FPU buffer to be dynamically sized. The difference is
> that
> the KERNEL variant contains supervisor features and will be enabled
> by
> kernel components that need them, and not directly by the user.
> Currently
> it's used by KVM to configure guest dedicated fpstate for calculating
> the xfeature and fpstate storage size etc.
>
> The kernel dynamic xfeatures now only contain XFEATURE_CET_KERNEL,
> which
> is supported by host as they're enabled in kernel XSS MSR setting but
> relevant CPU feature, i.e., supervisor shadow stack, is not enabled
> in
> host kernel therefore it can be omitted for normal fpstate by
> default.
>
> Remove the kernel dynamic feature from
> fpu_kernel_cfg.default_features
> so that the bits in xstate_bv and xcomp_bv are cleared and
> xsaves/xrstors
> can be optimized by HW for normal fpstate.
>
> Suggested-by: Dave Hansen <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>

Reviewed-by: Rick Edgecombe <[email protected]>

2024-01-04 22:30:13

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v8 00/26] Enable CET Virtualization

On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
> Tests:
> ======================
> This series passed basic CET user shadow stack test and kernel IBT
> test in L1
> and L2 guest.

With the build fix, reproduced the basic IBT and user shadow stack
tests, plus the CET enabled glibc unit tests.

Tested-by: Rick Edgecombe <[email protected]>

2024-01-04 22:42:27

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v8 05/26] x86/fpu/xstate: Introduce fpu_guest_cfg for guest FPU configuration

On Wed, 2024-01-03 at 00:32 +0200, Maxim Levitsky wrote:
> At least this comment should not include examples because xfeatures
> are subject to change.

+1 to this.

2024-01-04 22:47:40

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v8 06/26] x86/fpu/xstate: Create guest fpstate with guest specific config

On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
> +static struct fpstate *__fpu_alloc_init_guest_fpstate(struct
> fpu_guest *gfpu)
>  {
> +       bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
> +       unsigned int gfpstate_size, size;
>         struct fpstate *fpstate;
> -       unsigned int size;
>  
> -       size = fpu_user_cfg.default_size + ALIGN(offsetof(struct
> fpstate, regs), 64);
> +       /*
> +        * fpu_guest_cfg.default_size is initialized to hold all
> enabled
> +        * xfeatures except the user dynamic xfeatures. If the user
> dynamic
> +        * xfeatures are enabled, the guest fpstate will be re-
> allocated to
> +        * hold all guest enabled xfeatures, so omit user dynamic
> xfeatures
> +        * here.
> +        */
> +       size = fpu_guest_cfg.default_size +
> +              ALIGN(offsetof(struct fpstate, regs), 64);
> +
>         fpstate = vzalloc(size);
>         if (!fpstate)
> -               return false;
> +               return NULL;
> +       /*
> +        * Initialize sizes and feature masks, use fpu_user_cfg.*
> +        * for user_* settings for compatibility of exiting uAPIs.
> +        */
> +       fpstate->size           = gfpstate_size;

gfpstate_size is used uninitialized.

2024-01-05 00:22:58

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v8 00/26] Enable CET Virtualization

On Thu, Jan 04, 2024, Rick P Edgecombe wrote:
> On Thu, 2024-01-04 at 15:11 +0800, Yang, Weijiang wrote:
> > > What is the design around CET and the KVM emulator?
> >
> > KVM doesn't emulate CET HW behavior for guest CET, instead it leaves CET
> > related checks and handling in guest kernel. E.g., if emulated JMP/CALL in
> > emulator triggers mismatch of data stack and shadow stack contents, #CP is
> > generated in non-root mode instead of being injected by KVM.  KVM only
> > emulates basic x86 HW behaviors, e.g., call/jmp/ret/in/out etc.
>
> Right. In the case of CET those basic behaviors (call/jmp/ret) now have
> host emulation behavior that doesn't match what guest execution would
> do.

I wouldn't say that KVM emulates "basic" x86. KVM emulates instructions that
BIOS and kernels execute in Big Real Mode (and other "illegal" modes prior to Intel
adding unrestricted guest), instructions that guests commonly use for MMIO, I/O,
and page table modifications, and few other tidbits that have cropped up over the
years.

In other words, as Weijiang suspects below, KVM's emulator handles juuust enough
stuff to squeak by and not barf on real world guests. It is not, and has never
been, anything remotely resembling a fully capable architectural emulator.

> > > My understanding is that the KVM emulator kind of does what it has to
> > > keep things running, and isn't expected to emulate every possible
> > > instruction. With CET though, it is changing the behavior of existing
> > > supported instructions. I could imagine a guest could skip over CET
> > > enforcement by causing an MMIO exit and racing to overwrite the exit-
> > > causing instruction from a different vcpu to be an indirect CALL/RET,
> > > etc.
> >
> > Can you elaborate the case? I cannot figure out how it works.
>
> The point that it should be possible for KVM to emulate call/ret with
> CET enabled. Not saying the specific case is critical, but the one I
> used as an example was that the KVM emulator can (or at least in the
> not too distant past) be forced to emulate arbitrary instructions if
> the guest overwrites the instruction between the exit and the SW fetch
> from the host. 
>
> The steps are:
> vcpu 1 vcpu 2
> -------------------------------------
> mov to mmio addr
> vm exit ept_misconfig
> overwrite mov instruction to call %rax
> host emulator fetches
> host emulates call instruction
>
> So then the guest call operation will skip the endbranch check. But I'm
> not sure that there are not less exotic cases that would run across it.
> I see a bunch of cases where write protected memory kicks to the
> emulator as well. Not sure the exact scenarios and whether this could
> happen naturally in races during live migration, dirty tracking, etc.

It's for shadow paging. Instead of _immediately_ zapping SPTEs on any write to
a shadowed guest PTE, KVM instead tries to emulate the faulting instruction (and
then still zaps SPTE). If KVM can't emulate the instruction for whatever reason,
then KVM will _usually_ just zap the SPTE and resume the guest, i.e. retry the
faulting instruction.

The reason KVM doesn't automatically/unconditionally zap and retry is that there
are circumstances where the guest can't make forward progress, e.g. if an
instruction is using a guest PTE that it is writing, if L2 is modifying L1 PTEs,
and probably a few other edge cases I'm forgetting.

> Again, I'm more just asking the exposure and thinking on it.

If you care about exposure to the emulator from a guest security perspective,
assume that a compromised guest can coerce KVM into attempting to emulate
arbitrary bytes. As in the situation described above, it's not _that_ difficult
to play games with TLBs and instruction vs. data caches.

If all you care about is not breaking misbehaving guests, I wouldn't worry too
much about it.

> > > With reasonable assumptions around the threat model in use by the guest
> > > this is probably not a huge problem. And I guess also reasonable
> > > assumptions about functional expectations, as a misshandled CALL or RET
> > > by the emulator would corrupt the shadow stack.
> >
> > KVM emulates general x86 HW behaviors, if something wrong happens after
> > emulation then it can happen even on bare metal, i.e., guest SW most likely
> > gets wrong somewhere and it's expected to trigger CET exceptions in guest
> > kernel.

No, the days of KVM making shit up from are done. IIUC, you're advocating that
it's ok for KVM to induce a #CP that architecturally should not happen. That is
not acceptable, full stop.

Retrying the instruction in the guest, exiting to userspace, and even terminating
the VM are all perfectly acceptable behaviors if KVM encounters something it can't
*correctly* emulate. But clobbering the shadow stack or not detecting a CFI
violation, even if the guest is misbehaving, is not ok.

> > > But, another thing to do could be to just return X86EMUL_UNHANDLEABLE or
> > > X86EMUL_RETRY_INSTR when CET is active and RET or CALL are emulated.
> >
> > IMHO, translating the CET induced exceptions into X86EMUL_UNHANDLEABLE or
> > X86EMUL_RETRY_INSTR would confuse guest kernel or even VMM, I prefer
> > letting guest kernel handle #CP directly.
>
> Doesn't X86EMUL_RETRY_INSTR kick it back to the guest which is what you
> want? Today it will do the operations without the special CET behavior.
>
> But I do see how this could be tricky to avoid the guest getting stuck
> in a loop with X86EMUL_RETRY_INSTR. I guess the question is if this
> situation is encountered, when KVM can't handle the emulation
> correctly, what should happen? I think usually it returns
> KVM_INTERNAL_ERROR_EMULATION to userspace? So I don't see why the CET
> case is different.
>
> If the scenario (call/ret emulation with CET enabled) doesn't happen,
> how can the guest be confused? If it does happen, won't it be an issue?
>
> > > And I guess also for all instructions if the TRACKER bit is set. It
> > > might tie up that loose end without too much trouble.
> > >
> > > Anyway, was there a conscious decision to just punt on CET enforcement in
> > > the emulator?
> >
> > I don't remember we ever discussed it in community, but since KVM
> > maintainers reviewed the CET virtualization series for a long time, I
> > assume we're moving on the right way :-)
>
> It seems like kind of leap that if it never came up that they must be
> approving of the specific detail. Don't know. Maybe they will chime in.

Yeah, I don't even know what the TRACKER bit does (I don't feel like reading the
SDM right now), let alone if what KVM does or doesn't do in response is remotely
correct.

For CALL/RET (and presumably any branch instructions with IBT?) other instructions
that are directly affected by CET, the simplest thing would probably be to disable
those in KVM's emulator if shadow stacks and/or IBT are enabled, and let KVM's
failure paths take it from there.

Then, *if* a use case comes along where the guest is utilizing CET and "needs"
KVM to emulate affected instructions, we can add the necessary support the emulator.

Alternatively, if teaching KVM's emulator to play nice with shadow stacks and IBT
is easy-ish, just do that.

2024-01-05 00:34:27

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v8 00/26] Enable CET Virtualization

On Thu, 2024-01-04 at 16:22 -0800, Sean Christopherson wrote:
> No, the days of KVM making shit up from are done.  IIUC, you're
> advocating that
> it's ok for KVM to induce a #CP that architecturally should not
> happen.  That is
> not acceptable, full stop.

Nope, not advocating that at all. I'm noticing that in this series KVM
has special emulator behavior that doesn't match the HW when CET is
enabled. That it *skips* emitting #CPs (and other CET behaviors SW
depends on), and wondering if it is a problem.

I'm worried that there is some way attackers will induce the host to
emulate an instruction and skip CET enforcement that the HW would
normally do.

>
> Retrying the instruction in the guest, exiting to userspace, and even
> terminating
> the VM are all perfectly acceptable behaviors if KVM encounters
> something it can't
> *correctly* emulate.  But clobbering the shadow stack or not
> detecting a CFI
> violation, even if the guest is misbehaving, is not ok.
>
[snip]
> Yeah, I don't even know what the TRACKER bit does (I don't feel like
> reading the
> SDM right now), let alone if what KVM does or doesn't do in response
> is remotely
> correct.
>
> For CALL/RET (and presumably any branch instructions with IBT?) other
> instructions
> that are directly affected by CET, the simplest thing would probably
> be to disable
> those in KVM's emulator if shadow stacks and/or IBT are enabled, and
> let KVM's
> failure paths take it from there.

Right, that is what I was wondering might be the normal solution for
situations like this.

>
> Then, *if* a use case comes along where the guest is utilizing CET
> and "needs"
> KVM to emulate affected instructions, we can add the necessary
> support the emulator.
>
> Alternatively, if teaching KVM's emulator to play nice with shadow
> stacks and IBT
> is easy-ish, just do that.

I think it will not be very easy.

2024-01-05 00:45:40

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH v8 00/26] Enable CET Virtualization

On Thu, Jan 4, 2024 at 4:34 PM Edgecombe, Rick P
<[email protected]> wrote:
>
> On Thu, 2024-01-04 at 16:22 -0800, Sean Christopherson wrote:
> > No, the days of KVM making shit up from are done. IIUC, you're
> > advocating that
> > it's ok for KVM to induce a #CP that architecturally should not
> > happen. That is
> > not acceptable, full stop.
>
> Nope, not advocating that at all. I'm noticing that in this series KVM
> has special emulator behavior that doesn't match the HW when CET is
> enabled. That it *skips* emitting #CPs (and other CET behaviors SW
> depends on), and wondering if it is a problem.
>
> I'm worried that there is some way attackers will induce the host to
> emulate an instruction and skip CET enforcement that the HW would
> normally do.
>
> >
> > Retrying the instruction in the guest, exiting to userspace, and even
> > terminating
> > the VM are all perfectly acceptable behaviors if KVM encounters
> > something it can't
> > *correctly* emulate. But clobbering the shadow stack or not
> > detecting a CFI
> > violation, even if the guest is misbehaving, is not ok.
> >
> [snip]
> > Yeah, I don't even know what the TRACKER bit does (I don't feel like
> > reading the
> > SDM right now), let alone if what KVM does or doesn't do in response
> > is remotely
> > correct.
> >
> > For CALL/RET (and presumably any branch instructions with IBT?) other
> > instructions
> > that are directly affected by CET, the simplest thing would probably
> > be to disable
> > those in KVM's emulator if shadow stacks and/or IBT are enabled, and
> > let KVM's
> > failure paths take it from there.
>
> Right, that is what I was wondering might be the normal solution for
> situations like this.

On AMD CPUs and on Intel CPUs with "unrestricted guest," I don't think
there is any need to emulate an instruction that doesn't either (a)
cause a VM-exit by opcode (e.g. CPUID) or (b) access memory. I think
we should probably disable emulation of anything else, for both
security and sanity.

> >
> > Then, *if* a use case comes along where the guest is utilizing CET
> > and "needs"
> > KVM to emulate affected instructions, we can add the necessary
> > support the emulator.
> >
> > Alternatively, if teaching KVM's emulator to play nice with shadow
> > stacks and IBT
> > is easy-ish, just do that.
>
> I think it will not be very easy.

2024-01-05 00:55:09

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v8 00/26] Enable CET Virtualization

On Fri, Jan 05, 2024, Rick P Edgecombe wrote:
> On Thu, 2024-01-04 at 16:22 -0800, Sean Christopherson wrote:
> > No, the days of KVM making shit up from are done.  IIUC, you're advocating
> > that it's ok for KVM to induce a #CP that architecturally should not
> > happen.  That is not acceptable, full stop.
>
> Nope, not advocating that at all.

Heh, wrong "you". That "you" was directed at Weijiang, who I *think* is saying
that clobbering the shadow stack by emulating CALL+RET and thus inducing a bogus
#CP in the guest is ok.

> I'm noticing that in this series KVM has special emulator behavior that
> doesn't match the HW when CET is enabled. That it *skips* emitting #CPs (and
> other CET behaviors SW depends on), and wondering if it is a problem.

Yes, it's a problem. But IIUC, as is KVM would also induce bogus #CPs (which is
probably less of a problem in practice, but still not acceptable).

> I'm worried that there is some way attackers will induce the host to
> emulate an instruction and skip CET enforcement that the HW would
> normally do.

Yep. The best behavior for this is likely KVM's existing behavior, i.e. retry
the instruction in the guest, and if that doesn't work, kick out to userspace and
let userspace try to sort things out.

> > For CALL/RET (and presumably any branch instructions with IBT?) other
> > instructions that are directly affected by CET, the simplest thing would
> > probably be to disable those in KVM's emulator if shadow stacks and/or IBT
> > are enabled, and let KVM's failure paths take it from there.
>
> Right, that is what I was wondering might be the normal solution for
> situations like this.

If KVM can't emulate something, it either retries the instruction (with some
decent logic to guard against infinite retries) or punts to userspace.

Or if the platform owner likes to play with fire and doesn't enable
KVM_CAP_EXIT_ON_EMULATION_FAILURE, KVM will inject a #UD (and still exit to
userspace if the emulation happened at CPL0). And yes, that #UD is 100% KVM
making shit up, and yes, it has caused problems and confusion. :-)

> > Then, *if* a use case comes along where the guest is utilizing CET and
> > "needs" KVM to emulate affected instructions, we can add the necessary
> > support the emulator.
> >
> > Alternatively, if teaching KVM's emulator to play nice with shadow stacks
> > and IBT is easy-ish, just do that.
>
> I think it will not be very easy.

Yeah. As Jim alluded to, I think it's probably time to admit that emulating
instructions for modern CPUs is a fools errand and KVM should simply stop trying.

2024-01-05 08:17:23

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v8 06/26] x86/fpu/xstate: Create guest fpstate with guest specific config

On 1/5/2024 6:47 AM, Edgecombe, Rick P wrote:
> On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
>> +static struct fpstate *__fpu_alloc_init_guest_fpstate(struct
>> fpu_guest *gfpu)
>>  {
>> +       bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
>> +       unsigned int gfpstate_size, size;
>>         struct fpstate *fpstate;
>> -       unsigned int size;
>>
>> -       size = fpu_user_cfg.default_size + ALIGN(offsetof(struct
>> fpstate, regs), 64);
>> +       /*
>> +        * fpu_guest_cfg.default_size is initialized to hold all
>> enabled
>> +        * xfeatures except the user dynamic xfeatures. If the user
>> dynamic
>> +        * xfeatures are enabled, the guest fpstate will be re-
>> allocated to
>> +        * hold all guest enabled xfeatures, so omit user dynamic
>> xfeatures
>> +        * here.
>> +        */
>> +       size = fpu_guest_cfg.default_size +
>> +              ALIGN(offsetof(struct fpstate, regs), 64);
>> +
>>         fpstate = vzalloc(size);
>>         if (!fpstate)
>> -               return false;
>> +               return NULL;
>> +       /*
>> +        * Initialize sizes and feature masks, use fpu_user_cfg.*
>> +        * for user_* settings for compatibility of exiting uAPIs.
>> +        */
>> +       fpstate->size           = gfpstate_size;
> gfpstate_size is used uninitialized.

Ah, this is another unused variable after introduce fpu_guest_cfg.*, should be replaced by
fpu_guest_cfg.default_size. Thanks for caching it!



2024-01-05 09:05:32

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v8 00/26] Enable CET Virtualization

On 1/5/2024 5:10 AM, Edgecombe, Rick P wrote:
> On Thu, 2024-01-04 at 15:11 +0800, Yang, Weijiang wrote:
[...]
>>> My understanding is that the KVM emulator kind of does what it has
>>> to
>>> keep things running, and isn't expected to emulate every possible
>>> instruction. With CET though, it is changing the behavior of
>>> existing
>>> supported instructions. I could imagine a guest could skip over CET
>>> enforcement by causing an MMIO exit and racing to overwrite the
>>> exit-
>>> causing instruction from a different vcpu to be an indirect
>>> CALL/RET,
>>> etc.
>> Can you elaborate the case? I cannot figure out how it works.
> The point that it should be possible for KVM to emulate call/ret with
> CET enabled. Not saying the specific case is critical, but the one I
> used as an example was that the KVM emulator can (or at least in the
> not too distant past) be forced to emulate arbitrary instructions if
> the guest overwrites the instruction between the exit and the SW fetch
> from the host.
>
> The steps are:
> vcpu 1 vcpu 2
> -------------------------------------
> mov to mmio addr
> vm exit ept_misconfig
> overwrite mov instruction to call %rax
> host emulator fetches
> host emulates call instruction
>
> So then the guest call operation will skip the endbranch check. But I'm
> not sure that there are not less exotic cases that would run across it.
> I see a bunch of cases where write protected memory kicks to the
> emulator as well. Not sure the exact scenarios and whether this could
> happen naturally in races during live migration, dirty tracking, etc.
> Again, I'm more just asking the exposure and thinking on it.

Now I get your points, I didn't think of exposure from guest and just thought of the
normal execution flow in guest, so I said let guest handle #CP directly.

Yes, I think we need to take these cases into account, as Sean suggested in following
replies, stopping emulation JMP/CALL/RET etc. instructions when guest CET is enabled
is effective and simple, I'll investigate the emulator code.

Thanks for raising the concerns!


2024-01-05 09:28:59

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v8 00/26] Enable CET Virtualization

On 1/5/2024 8:54 AM, Sean Christopherson wrote:
> On Fri, Jan 05, 2024, Rick P Edgecombe wrote:
>> On Thu, 2024-01-04 at 16:22 -0800, Sean Christopherson wrote:
>>> No, the days of KVM making shit up from are done.  IIUC, you're advocating
>>> that it's ok for KVM to induce a #CP that architecturally should not
>>> happen.  That is not acceptable, full stop.
>> Nope, not advocating that at all.
> Heh, wrong "you". That "you" was directed at Weijiang, who I *think* is saying
> that clobbering the shadow stack by emulating CALL+RET and thus inducing a bogus
> #CP in the guest is ok.

My fault, I just thought of the normal execution instead of the subverting cases :-)

>
>> I'm noticing that in this series KVM has special emulator behavior that
>> doesn't match the HW when CET is enabled. That it *skips* emitting #CPs (and
>> other CET behaviors SW depends on), and wondering if it is a problem.
> Yes, it's a problem. But IIUC, as is KVM would also induce bogus #CPs (which is
> probably less of a problem in practice, but still not acceptable).

I'd choose to stop emulating the CET sensitive instructions while CET is enabled in guest
as re-enter guest after emulation would raise some kind of risk, but I don't know how to
stop the emulation decently.

>> I'm worried that there is some way attackers will induce the host to
>> emulate an instruction and skip CET enforcement that the HW would
>> normally do.
> Yep. The best behavior for this is likely KVM's existing behavior, i.e. retry
> the instruction in the guest, and if that doesn't work, kick out to userspace and
> let userspace try to sort things out.
>
>>> For CALL/RET (and presumably any branch instructions with IBT?) other
>>> instructions that are directly affected by CET, the simplest thing would
>>> probably be to disable those in KVM's emulator if shadow stacks and/or IBT
>>> are enabled, and let KVM's failure paths take it from there.
>> Right, that is what I was wondering might be the normal solution for
>> situations like this.
> If KVM can't emulate something, it either retries the instruction (with some
> decent logic to guard against infinite retries) or punts to userspace.

What kind of error is proper if KVM has to punt to userspace? Or just inject #UD into guest
on detecting this case?

>
> Or if the platform owner likes to play with fire and doesn't enable
> KVM_CAP_EXIT_ON_EMULATION_FAILURE, KVM will inject a #UD (and still exit to
> userspace if the emulation happened at CPL0). And yes, that #UD is 100% KVM
> making shit up, and yes, it has caused problems and confusion. :-)
>
>>> Then, *if* a use case comes along where the guest is utilizing CET and
>>> "needs" KVM to emulate affected instructions, we can add the necessary
>>> support the emulator.
>>>
>>> Alternatively, if teaching KVM's emulator to play nice with shadow stacks
>>> and IBT is easy-ish, just do that.
>> I think it will not be very easy.
> Yeah. As Jim alluded to, I think it's probably time to admit that emulating
> instructions for modern CPUs is a fools errand and KVM should simply stop trying.
>


2024-01-05 16:21:54

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v8 00/26] Enable CET Virtualization

On Fri, Jan 05, 2024, Weijiang Yang wrote:
> On 1/5/2024 8:54 AM, Sean Christopherson wrote:
> > On Fri, Jan 05, 2024, Rick P Edgecombe wrote:
> > > > For CALL/RET (and presumably any branch instructions with IBT?) other
> > > > instructions that are directly affected by CET, the simplest thing would
> > > > probably be to disable those in KVM's emulator if shadow stacks and/or IBT
> > > > are enabled, and let KVM's failure paths take it from there.
> > > Right, that is what I was wondering might be the normal solution for
> > > situations like this.
> > If KVM can't emulate something, it either retries the instruction (with some
> > decent logic to guard against infinite retries) or punts to userspace.
>
> What kind of error is proper if KVM has to punt to userspace?

KVM_INTERNAL_ERROR_EMULATION. See prepare_emulation_failure_exit().

> Or just inject #UD into guest on detecting this case?

No, do not inject #UD or do anything else that deviates from architecturally
defined behavior.

2024-01-05 17:53:08

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v8 00/26] Enable CET Virtualization

On Fri, 2024-01-05 at 08:21 -0800, Sean Christopherson wrote:
> No, do not inject #UD or do anything else that deviates from
> architecturally
> defined behavior.

Here is a, at least partial, list of CET touch points I just created by
searching the SDM:
1. The emulator SW fetch with TRACKER=1
2. CALL, RET, JMP, IRET, INT, SYSCALL, SYSENTER, SYSEXIT, SYSRET
3. Task switching
4. The new CET instructions (which I guess should be handled by
default): CLRSSBSY, INCSSPD, RSTORSSP, SAVEPREVSSP, SETSSBSYY, WRSS,
WRUSS

Not all of those are security checks, but would have some functional
implications. It's still not clear to me if this could happen naturally
(the TDP shadowing stuff), or only via strange attacker behavior. If we
only care about the attacker case, then we could have a smaller list.

It also sounds like the instructions in 2 could maybe be filtered by
mode instead of caring about CET being enabled. But maybe it's not good
to mix the CET problem with the bigger emulator issues. Don't know.

2024-01-05 18:10:04

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH v8 00/26] Enable CET Virtualization

On Fri, Jan 5, 2024 at 9:53 AM Edgecombe, Rick P
<[email protected]> wrote:
>
> On Fri, 2024-01-05 at 08:21 -0800, Sean Christopherson wrote:
> > No, do not inject #UD or do anything else that deviates from
> > architecturally
> > defined behavior.
>
> Here is a, at least partial, list of CET touch points I just created by
> searching the SDM:
> 1. The emulator SW fetch with TRACKER=1
> 2. CALL, RET, JMP, IRET, INT, SYSCALL, SYSENTER, SYSEXIT, SYSRET
> 3. Task switching

Sigh. KVM is forced to emulate task switch, because the hardware is
incapable of virtualizing it. How hard would it be to make KVM's
task-switch emulation CET-aware?

> 4. The new CET instructions (which I guess should be handled by
> default): CLRSSBSY, INCSSPD, RSTORSSP, SAVEPREVSSP, SETSSBSYY, WRSS,
> WRUSS
>
> Not all of those are security checks, but would have some functional
> implications. It's still not clear to me if this could happen naturally
> (the TDP shadowing stuff), or only via strange attacker behavior. If we
> only care about the attacker case, then we could have a smaller list.
>
> It also sounds like the instructions in 2 could maybe be filtered by
> mode instead of caring about CET being enabled. But maybe it's not good
> to mix the CET problem with the bigger emulator issues. Don't know.

2024-01-05 18:58:50

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v8 00/26] Enable CET Virtualization

On Fri, 2024-01-05 at 10:09 -0800, Jim Mattson wrote:
> > 3. Task switching
>
> Sigh. KVM is forced to emulate task switch, because the hardware is
> incapable of virtualizing it. How hard would it be to make KVM's
> task-switch emulation CET-aware?

(I am not too familiar with this part of the arch).

See SDM Vol 3a, chapter 7.3, number 8 and 15. The behavior is around
actual task switching. At first glance, it looks annoying at least. It
would need to do a CMPXCHG to guest memory at some points and take care
to not implement the "Complex Shadow-Stack Updates" behavior.

But, would anyone use it? I'm not aware of any 32 bit supervisor shadow
stack support out there. So maybe it is ok to just punt to userspace in
this case?


2024-01-05 19:34:19

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v8 00/26] Enable CET Virtualization

On Fri, Jan 05, 2024, Rick P Edgecombe wrote:
> On Fri, 2024-01-05 at 10:09 -0800, Jim Mattson wrote:
> > > 3. Task switching
> >
> > Sigh. KVM is forced to emulate task switch, because the hardware is
> > incapable of virtualizing it. How hard would it be to make KVM's
> > task-switch emulation CET-aware?
>
> (I am not too familiar with this part of the arch).
>
> See SDM Vol 3a, chapter 7.3, number 8 and 15. The behavior is around
> actual task switching. At first glance, it looks annoying at least. It
> would need to do a CMPXCHG to guest memory at some points and take care
> to not implement the "Complex Shadow-Stack Updates" behavior.
>
> But, would anyone use it? I'm not aware of any 32 bit supervisor shadow
> stack support out there. So maybe it is ok to just punt to userspace in
> this case?

Yeah, I think KVM can punt.

2024-01-08 14:18:22

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v8 00/26] Enable CET Virtualization

On 1/6/2024 12:21 AM, Sean Christopherson wrote:
> On Fri, Jan 05, 2024, Weijiang Yang wrote:
>> On 1/5/2024 8:54 AM, Sean Christopherson wrote:
>>> On Fri, Jan 05, 2024, Rick P Edgecombe wrote:
>>>>> For CALL/RET (and presumably any branch instructions with IBT?) other
>>>>> instructions that are directly affected by CET, the simplest thing would
>>>>> probably be to disable those in KVM's emulator if shadow stacks and/or IBT
>>>>> are enabled, and let KVM's failure paths take it from there.
>>>> Right, that is what I was wondering might be the normal solution for
>>>> situations like this.
>>> If KVM can't emulate something, it either retries the instruction (with some
>>> decent logic to guard against infinite retries) or punts to userspace.
>> What kind of error is proper if KVM has to punt to userspace?
> KVM_INTERNAL_ERROR_EMULATION. See prepare_emulation_failure_exit().
>
>> Or just inject #UD into guest on detecting this case?
> No, do not inject #UD or do anything else that deviates from architecturally
> defined behavior.

Thanks!
But based on current KVM implementation and patch 24, seems that if CET is exposed
to guest, the emulation code or shadow paging mode couldn't be activated at the same time:

In vmx.c,
hardware_setup(void):
if (!cpu_has_vmx_unrestricted_guest() || !enable_ept)
        enable_unrestricted_guest = 0;

in vmx_set_cr0():
[...]
        if (enable_unrestricted_guest)
                hw_cr0 |= KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST;
        else {
                hw_cr0 |= KVM_VM_CR0_ALWAYS_ON;
                if (!enable_ept)
                        hw_cr0 |= X86_CR0_WP;

                if (vmx->rmode.vm86_active && (cr0 & X86_CR0_PE))
                        enter_pmode(vcpu);

                if (!vmx->rmode.vm86_active && !(cr0 & X86_CR0_PE))
                        enter_rmode(vcpu);
        }
[...]

And in patch 24:

+   if (!cpu_has_load_cet_ctrl() || !enable_unrestricted_guest ||
+       !cpu_has_vmx_basic_no_hw_errcode()) {
+       kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
+       kvm_cpu_cap_clear(X86_FEATURE_IBT);
+   }

Not sure if I missed anything.



2024-01-09 15:10:58

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v8 00/26] Enable CET Virtualization

On Mon, Jan 08, 2024, Weijiang Yang wrote:
> On 1/6/2024 12:21 AM, Sean Christopherson wrote:
> > On Fri, Jan 05, 2024, Weijiang Yang wrote:
> > > On 1/5/2024 8:54 AM, Sean Christopherson wrote:
> > > > On Fri, Jan 05, 2024, Rick P Edgecombe wrote:
> > > > > > For CALL/RET (and presumably any branch instructions with IBT?) other
> > > > > > instructions that are directly affected by CET, the simplest thing would
> > > > > > probably be to disable those in KVM's emulator if shadow stacks and/or IBT
> > > > > > are enabled, and let KVM's failure paths take it from there.
> > > > > Right, that is what I was wondering might be the normal solution for
> > > > > situations like this.
> > > > If KVM can't emulate something, it either retries the instruction (with some
> > > > decent logic to guard against infinite retries) or punts to userspace.
> > > What kind of error is proper if KVM has to punt to userspace?
> > KVM_INTERNAL_ERROR_EMULATION. See prepare_emulation_failure_exit().
> >
> > > Or just inject #UD into guest on detecting this case?
> > No, do not inject #UD or do anything else that deviates from architecturally
> > defined behavior.
>
> Thanks!
> But based on current KVM implementation and patch 24, seems that if CET is exposed
> to guest, the emulation code or shadow paging mode couldn't be activated at the same time:

No, requiring unrestricted guest only disables the paths where KVM *delibeately*
emulates the entire guest code stream. In no way, shape, or form does it prevent
KVM from attempting to emulate arbitrary instructions.

> In vmx.c,
> hardware_setup(void):
> if (!cpu_has_vmx_unrestricted_guest() || !enable_ept)
>         enable_unrestricted_guest = 0;
>
> in vmx_set_cr0():
> [...]
>         if (enable_unrestricted_guest)
>                 hw_cr0 |= KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST;
>         else {
>                 hw_cr0 |= KVM_VM_CR0_ALWAYS_ON;
>                 if (!enable_ept)
>                         hw_cr0 |= X86_CR0_WP;
>
>                 if (vmx->rmode.vm86_active && (cr0 & X86_CR0_PE))
>                         enter_pmode(vcpu);
>
>                 if (!vmx->rmode.vm86_active && !(cr0 & X86_CR0_PE))
>                         enter_rmode(vcpu);
>         }
> [...]
>
> And in patch 24:
>
> +   if (!cpu_has_load_cet_ctrl() || !enable_unrestricted_guest ||
> +       !cpu_has_vmx_basic_no_hw_errcode()) {
> +       kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
> +       kvm_cpu_cap_clear(X86_FEATURE_IBT);
> +   }
>
> Not sure if I missed anything.
>
>

2024-01-11 15:00:50

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v8 00/26] Enable CET Virtualization

On 1/9/2024 11:10 PM, Sean Christopherson wrote:
> On Mon, Jan 08, 2024, Weijiang Yang wrote:
>> On 1/6/2024 12:21 AM, Sean Christopherson wrote:
>>> On Fri, Jan 05, 2024, Weijiang Yang wrote:
>>>> On 1/5/2024 8:54 AM, Sean Christopherson wrote:
>>>>> On Fri, Jan 05, 2024, Rick P Edgecombe wrote:
>>>>>>> For CALL/RET (and presumably any branch instructions with IBT?) other
>>>>>>> instructions that are directly affected by CET, the simplest thing would
>>>>>>> probably be to disable those in KVM's emulator if shadow stacks and/or IBT
>>>>>>> are enabled, and let KVM's failure paths take it from there.
>>>>>> Right, that is what I was wondering might be the normal solution for
>>>>>> situations like this.
>>>>> If KVM can't emulate something, it either retries the instruction (with some
>>>>> decent logic to guard against infinite retries) or punts to userspace.
>>>> What kind of error is proper if KVM has to punt to userspace?
>>> KVM_INTERNAL_ERROR_EMULATION. See prepare_emulation_failure_exit().
>>>
>>>> Or just inject #UD into guest on detecting this case?
>>> No, do not inject #UD or do anything else that deviates from architecturally
>>> defined behavior.
>> Thanks!
>> But based on current KVM implementation and patch 24, seems that if CET is exposed
>> to guest, the emulation code or shadow paging mode couldn't be activated at the same time:
> No, requiring unrestricted guest only disables the paths where KVM *delibeately*
> emulates the entire guest code stream. In no way, shape, or form does it prevent
> KVM from attempting to emulate arbitrary instructions.

Yes, also need to prevent sporadic emulation, how about adding below patch in emulator?


diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index e223043ef5b2..e817d8560ceb 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -178,6 +178,7 @@
 #define IncSP       ((u64)1 << 54)  /* SP is incremented before ModRM calc */
 #define TwoMemOp    ((u64)1 << 55)  /* Instruction has two memory operand */
 #define IsBranch    ((u64)1 << 56)  /* Instruction is considered a branch. */
+#define IsProtected ((u64)1 << 57)  /* Instruction is protected by CET. */

 #define DstXacc     (DstAccLo | SrcAccHi | SrcWrite)

@@ -4098,9 +4099,9 @@ static const struct opcode group4[] = {
 static const struct opcode group5[] = {
        F(DstMem | SrcNone | Lock,              em_inc),
        F(DstMem | SrcNone | Lock,              em_dec),
-       I(SrcMem | NearBranch | IsBranch,       em_call_near_abs),
-       I(SrcMemFAddr | ImplicitOps | IsBranch, em_call_far),
-       I(SrcMem | NearBranch | IsBranch,       em_jmp_abs),
+       I(SrcMem | NearBranch | IsBranch | IsProtected, em_call_near_abs),
+       I(SrcMemFAddr | ImplicitOps | IsBranch | IsProtected, em_call_far),
+       I(SrcMem | NearBranch | IsBranch | IsProtected, em_jmp_abs),
        I(SrcMemFAddr | ImplicitOps | IsBranch, em_jmp_far),
        I(SrcMem | Stack | TwoMemOp,            em_push), D(Undefined),
 };
@@ -4362,11 +4363,11 @@ static const struct opcode opcode_table[256] = {
        /* 0xC8 - 0xCF */
        I(Stack | SrcImmU16 | Src2ImmByte | IsBranch, em_enter),
        I(Stack | IsBranch, em_leave),
-       I(ImplicitOps | SrcImmU16 | IsBranch, em_ret_far_imm),
-       I(ImplicitOps | IsBranch, em_ret_far),
-       D(ImplicitOps | IsBranch), DI(SrcImmByte | IsBranch, intn),
+       I(ImplicitOps | SrcImmU16 | IsBranch | IsProtected, em_ret_far_imm),
+       I(ImplicitOps | IsBranch | IsProtected, em_ret_far),
+       D(ImplicitOps | IsBranch), DI(SrcImmByte | IsBranch | IsProtected, intn),
        D(ImplicitOps | No64 | IsBranch),
-       II(ImplicitOps | IsBranch, em_iret, iret),
+       II(ImplicitOps | IsBranch | IsProtected, em_iret, iret),
        /* 0xD0 - 0xD7 */
        G(Src2One | ByteOp, group2), G(Src2One, group2),
        G(Src2CL | ByteOp, group2), G(Src2CL, group2),
@@ -4382,7 +4383,7 @@ static const struct opcode opcode_table[256] = {
        I2bvIP(SrcImmUByte | DstAcc, em_in,  in,  check_perm_in),
        I2bvIP(SrcAcc | DstImmUByte, em_out, out, check_perm_out),
        /* 0xE8 - 0xEF */
-       I(SrcImm | NearBranch | IsBranch, em_call),
+       I(SrcImm | NearBranch | IsBranch | IsProtected, em_call),
        D(SrcImm | ImplicitOps | NearBranch | IsBranch),
        I(SrcImmFAddr | No64 | IsBranch, em_jmp_far),
        D(SrcImmByte | ImplicitOps | NearBranch | IsBranch),
@@ -4401,7 +4402,7 @@ static const struct opcode opcode_table[256] = {
 static const struct opcode twobyte_table[256] = {
        /* 0x00 - 0x0F */
        G(0, group6), GD(0, &group7), N, N,
-       N, I(ImplicitOps | EmulateOnUD | IsBranch, em_syscall),
+       N, I(ImplicitOps | EmulateOnUD | IsBranch | IsProtected, em_syscall),
        II(ImplicitOps | Priv, em_clts, clts), N,
        DI(ImplicitOps | Priv, invd), DI(ImplicitOps | Priv, wbinvd), N, N,
        N, D(ImplicitOps | ModRM | SrcMem | NoAccess), N, N,
@@ -4432,8 +4433,8 @@ static const struct opcode twobyte_table[256] = {
        IIP(ImplicitOps, em_rdtsc, rdtsc, check_rdtsc),
        II(ImplicitOps | Priv, em_rdmsr, rdmsr),
        IIP(ImplicitOps, em_rdpmc, rdpmc, check_rdpmc),
-       I(ImplicitOps | EmulateOnUD | IsBranch, em_sysenter),
-       I(ImplicitOps | Priv | EmulateOnUD | IsBranch, em_sysexit),
+       I(ImplicitOps | EmulateOnUD | IsBranch | IsProtected, em_sysenter),
+       I(ImplicitOps | Priv | EmulateOnUD | IsBranch | IsProtected, em_sysexit),
        N, N,
        N, N, N, N, N, N, N, N,
        /* 0x40 - 0x4F */
@@ -4971,6 +4972,12 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
        if (ctxt->d == 0)
                return EMULATION_FAILED;
+       if ((opcode.flags & IsProtected) &&
+           (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_CET)) {
+               WARN_ONCE(1, "CET is active, emulation aborted.\n");
+               return EMULATION_FAILED;
+       }
+
        ctxt->execute = opcode.u.execute;

        if (unlikely(emulation_type & EMULTYPE_TRAP_UD) &&


2024-01-15 01:56:19

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v8 00/26] Enable CET Virtualization

On Thu, Jan 11, 2024 at 10:56:55PM +0800, Yang, Weijiang wrote:
>On 1/9/2024 11:10 PM, Sean Christopherson wrote:
>> On Mon, Jan 08, 2024, Weijiang Yang wrote:
>> > On 1/6/2024 12:21 AM, Sean Christopherson wrote:
>> > > On Fri, Jan 05, 2024, Weijiang Yang wrote:
>> > > > On 1/5/2024 8:54 AM, Sean Christopherson wrote:
>> > > > > On Fri, Jan 05, 2024, Rick P Edgecombe wrote:
>> > > > > > > For CALL/RET (and presumably any branch instructions with IBT?) other
>> > > > > > > instructions that are directly affected by CET, the simplest thing would
>> > > > > > > probably be to disable those in KVM's emulator if shadow stacks and/or IBT
>> > > > > > > are enabled, and let KVM's failure paths take it from there.
>> > > > > > Right, that is what I was wondering might be the normal solution for
>> > > > > > situations like this.
>> > > > > If KVM can't emulate something, it either retries the instruction (with some
>> > > > > decent logic to guard against infinite retries) or punts to userspace.
>> > > > What kind of error is proper if KVM has to punt to userspace?
>> > > KVM_INTERNAL_ERROR_EMULATION. See prepare_emulation_failure_exit().
>> > >
>> > > > Or just inject #UD into guest on detecting this case?
>> > > No, do not inject #UD or do anything else that deviates from architecturally
>> > > defined behavior.
>> > Thanks!
>> > But based on current KVM implementation and patch 24, seems that if CET is exposed
>> > to guest, the emulation code or shadow paging mode couldn't be activated at the same time:
>> No, requiring unrestricted guest only disables the paths where KVM *delibeately*
>> emulates the entire guest code stream. In no way, shape, or form does it prevent
>> KVM from attempting to emulate arbitrary instructions.
>
>Yes, also need to prevent sporadic emulation, how about adding below patch in emulator?
>
>
>diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
>index e223043ef5b2..e817d8560ceb 100644
>--- a/arch/x86/kvm/emulate.c
>+++ b/arch/x86/kvm/emulate.c
>@@ -178,6 +178,7 @@
>?#define IncSP?????? ((u64)1 << 54)? /* SP is incremented before ModRM calc */
>?#define TwoMemOp??? ((u64)1 << 55)? /* Instruction has two memory operand */
>?#define IsBranch??? ((u64)1 << 56)? /* Instruction is considered a branch. */
>+#define IsProtected ((u64)1 << 57)? /* Instruction is protected by CET. */
>
>?#define DstXacc???? (DstAccLo | SrcAccHi | SrcWrite)
>
>@@ -4098,9 +4099,9 @@ static const struct opcode group4[] = {
>?static const struct opcode group5[] = {
>??????? F(DstMem | SrcNone | Lock,????????????? em_inc),
>??????? F(DstMem | SrcNone | Lock,????????????? em_dec),
>-?????? I(SrcMem | NearBranch | IsBranch,?????? em_call_near_abs),
>-?????? I(SrcMemFAddr | ImplicitOps | IsBranch, em_call_far),
>-?????? I(SrcMem | NearBranch | IsBranch,?????? em_jmp_abs),
>+?????? I(SrcMem | NearBranch | IsBranch | IsProtected, em_call_near_abs),
>+?????? I(SrcMemFAddr | ImplicitOps | IsBranch | IsProtected, em_call_far),
>+?????? I(SrcMem | NearBranch | IsBranch | IsProtected, em_jmp_abs),
>??????? I(SrcMemFAddr | ImplicitOps | IsBranch, em_jmp_far),
>??????? I(SrcMem | Stack | TwoMemOp,??????????? em_push), D(Undefined),
>?};
>@@ -4362,11 +4363,11 @@ static const struct opcode opcode_table[256] = {
>??????? /* 0xC8 - 0xCF */
>??????? I(Stack | SrcImmU16 | Src2ImmByte | IsBranch, em_enter),
>??????? I(Stack | IsBranch, em_leave),
>-?????? I(ImplicitOps | SrcImmU16 | IsBranch, em_ret_far_imm),
>-?????? I(ImplicitOps | IsBranch, em_ret_far),
>-?????? D(ImplicitOps | IsBranch), DI(SrcImmByte | IsBranch, intn),
>+?????? I(ImplicitOps | SrcImmU16 | IsBranch | IsProtected, em_ret_far_imm),
>+?????? I(ImplicitOps | IsBranch | IsProtected, em_ret_far),
>+?????? D(ImplicitOps | IsBranch), DI(SrcImmByte | IsBranch | IsProtected, intn),
>??????? D(ImplicitOps | No64 | IsBranch),
>-?????? II(ImplicitOps | IsBranch, em_iret, iret),
>+?????? II(ImplicitOps | IsBranch | IsProtected, em_iret, iret),
>??????? /* 0xD0 - 0xD7 */
>??????? G(Src2One | ByteOp, group2), G(Src2One, group2),
>??????? G(Src2CL | ByteOp, group2), G(Src2CL, group2),
>@@ -4382,7 +4383,7 @@ static const struct opcode opcode_table[256] = {
>??????? I2bvIP(SrcImmUByte | DstAcc, em_in,? in,? check_perm_in),
>??????? I2bvIP(SrcAcc | DstImmUByte, em_out, out, check_perm_out),
>??????? /* 0xE8 - 0xEF */
>-?????? I(SrcImm | NearBranch | IsBranch, em_call),
>+?????? I(SrcImm | NearBranch | IsBranch | IsProtected, em_call),
>??????? D(SrcImm | ImplicitOps | NearBranch | IsBranch),
>??????? I(SrcImmFAddr | No64 | IsBranch, em_jmp_far),
>??????? D(SrcImmByte | ImplicitOps | NearBranch | IsBranch),
>@@ -4401,7 +4402,7 @@ static const struct opcode opcode_table[256] = {
>?static const struct opcode twobyte_table[256] = {
>??????? /* 0x00 - 0x0F */
>??????? G(0, group6), GD(0, &group7), N, N,
>-?????? N, I(ImplicitOps | EmulateOnUD | IsBranch, em_syscall),
>+?????? N, I(ImplicitOps | EmulateOnUD | IsBranch | IsProtected, em_syscall),
>??????? II(ImplicitOps | Priv, em_clts, clts), N,
>??????? DI(ImplicitOps | Priv, invd), DI(ImplicitOps | Priv, wbinvd), N, N,
>??????? N, D(ImplicitOps | ModRM | SrcMem | NoAccess), N, N,
>@@ -4432,8 +4433,8 @@ static const struct opcode twobyte_table[256] = {
>??????? IIP(ImplicitOps, em_rdtsc, rdtsc, check_rdtsc),
>??????? II(ImplicitOps | Priv, em_rdmsr, rdmsr),
>??????? IIP(ImplicitOps, em_rdpmc, rdpmc, check_rdpmc),
>-?????? I(ImplicitOps | EmulateOnUD | IsBranch, em_sysenter),
>-?????? I(ImplicitOps | Priv | EmulateOnUD | IsBranch, em_sysexit),
>+?????? I(ImplicitOps | EmulateOnUD | IsBranch | IsProtected, em_sysenter),
>+?????? I(ImplicitOps | Priv | EmulateOnUD | IsBranch | IsProtected, em_sysexit),
>??????? N, N,
>??????? N, N, N, N, N, N, N, N,
>??????? /* 0x40 - 0x4F */
>@@ -4971,6 +4972,12 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
>??????? if (ctxt->d == 0)
>??????????????? return EMULATION_FAILED;
>+?????? if ((opcode.flags & IsProtected) &&
>+?????????? (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_CET)) {

CR4.CET doesn't necessarily mean IBT or shadow stack is enabled. why not check
CPL and IA32_S/U_CET?

>+?????????????? WARN_ONCE(1, "CET is active, emulation aborted.\n");

remove this WARN_ONCE(). Guest can trigger this at will and overflow host dmesg.

if you really want to tell usespace the emulation_failure is due to CET, maybe
you can add a new flag like KVM_INTERNAL_ERROR_EMULATION_FLAG_INSTRUCTION_BYTES.
for now, I won't bother to add this because probably userspace just terminates
the VM on any instruction failure (i.e., won't try to figure out the reason of
the instruction failure and fix it).

2024-01-15 09:59:33

by Yuan Yao

[permalink] [raw]
Subject: Re: [PATCH v8 22/26] KVM: VMX: Set up interception for CET MSRs

On Thu, Dec 21, 2023 at 09:02:35AM -0500, Yang Weijiang wrote:
> Enable/disable CET MSRs interception per associated feature configuration.
> Shadow Stack feature requires all CET MSRs passed through to guest to make
> it supported in user and supervisor mode while IBT feature only depends on
> MSR_IA32_{U,S}_CETS_CET to enable user and supervisor IBT.
>
> Note, this MSR design introduced an architectural limitation of SHSTK and
> IBT control for guest, i.e., when SHSTK is exposed, IBT is also available
> to guest from architectual perspective since IBT relies on subset of SHSTK
> relevant MSRs.
>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kvm/vmx/vmx.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 42 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 064a5fe87948..08058b182893 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -692,6 +692,10 @@ static bool is_valid_passthrough_msr(u32 msr)
> case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
> /* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
> return true;
> + case MSR_IA32_U_CET:
> + case MSR_IA32_S_CET:
> + case MSR_IA32_PL0_SSP ... MSR_IA32_INT_SSP_TAB:
> + return true;
> }
>
> r = possible_passthrough_msr_slot(msr) != -ENOENT;
> @@ -7767,6 +7771,42 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
> vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
> }
>
> +static void vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu)
> +{
> + bool incpt;
> +
> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
> + incpt = !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK);
> +
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP,
> + MSR_TYPE_RW, incpt);
> + if (guest_cpuid_has(vcpu, X86_FEATURE_LM))

Looks this leading to MSR_IA32_INT_SSP_TAB not intercepted
after below steps:

Step 1. User space set cpuid w/ X86_FEATURE_LM, w/ SHSTK.
Step 2. User space set cpuid w/o X86_FEATURE_LM, w/o SHSTK.

Then MSR_IA32_INT_SSP_TAB won't be intercepted even w/o SHSTK
on guest cpuid, will this lead to inconsistency when do
rdmsr(MSR_IA32_INT_SSP_TAB) from guest in this scenario ?

> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
> + MSR_TYPE_RW, incpt);
> + if (!incpt)
> + return;
> + }
> +
> + if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
> + incpt = !guest_cpuid_has(vcpu, X86_FEATURE_IBT);
> +
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
> + MSR_TYPE_RW, incpt);
> + }
> +}
> +
> static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> @@ -7845,6 +7885,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>
> /* Refresh #PF interception to account for MAXPHYADDR changes. */
> vmx_update_exception_bitmap(vcpu);
> +
> + vmx_update_intercept_for_cet_msr(vcpu);
> }
>
> static u64 vmx_get_perf_capabilities(void)
> --
> 2.39.3
>
>

2024-01-16 07:22:49

by Yuan Yao

[permalink] [raw]
Subject: Re: [PATCH v8 26/26] KVM: nVMX: Enable CET support for nested guest

On Thu, Dec 21, 2023 at 09:02:39AM -0500, Yang Weijiang wrote:
> Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
> to enable CET for nested VM.
>
> vmcs12 and vmcs02 needs to be synced when L2 exits to L1 or when L1 wants
> to resume L2, that way correct CET states can be observed by one another.
>
> Suggested-by: Chao Gao <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kvm/vmx/nested.c | 57 +++++++++++++++++++++++++++++++++++++--
> arch/x86/kvm/vmx/vmcs12.c | 6 +++++
> arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++-
> arch/x86/kvm/vmx/vmx.c | 2 ++
> 4 files changed, 76 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 468a7cf75035..dee718c65255 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -691,6 +691,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
>
> + /* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_U_CET, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_S_CET, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_PL0_SSP, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_PL1_SSP, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_PL2_SSP, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_PL3_SSP, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
> +
> kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
>
> vmx->nested.force_msr_bitmap_recalc = false;
> @@ -2506,6 +2528,17 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
> if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
> (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
> vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
> +
> + if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE) {
> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> + vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
> + vmcs_writel(GUEST_INTR_SSP_TABLE,
> + vmcs12->guest_ssp_tbl);
> + }
> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
> + vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
> + }
> }
>
> if (nested_cpu_has_xsaves(vmcs12))
> @@ -4344,6 +4377,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
> vmcs12->guest_pending_dbg_exceptions =
> vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
>
> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> + vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
> + vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
> + }
> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
> + vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
> + }
> +
> vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
> }
>
> @@ -4569,6 +4611,16 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
> if (vmcs12->vm_exit_controls & VM_EXIT_CLEAR_BNDCFGS)
> vmcs_write64(GUEST_BNDCFGS, 0);
>
> + if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_CET_STATE) {
> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK)) {
> + vmcs_writel(HOST_SSP, vmcs12->host_ssp);

Shuold be GUEST_xxx here.

Now KVM does "vmexit" from L2 to L1, thus should sync
vmcs01's guest state with vmcs12's host state, so KVM
can emulate "vmexit" from L2 -> L1 directly by vmlaunch
with vmcs01.

> + vmcs_writel(HOST_INTR_SSP_TABLE, vmcs12->host_ssp_tbl);

Ditto.

> + }
> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK) ||
> + guest_can_use(vcpu, X86_FEATURE_IBT))
> + vmcs_writel(HOST_S_CET, vmcs12->host_s_cet);

Ditto.

> + }
> +
> if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PAT) {
> vmcs_write64(GUEST_IA32_PAT, vmcs12->host_ia32_pat);
> vcpu->arch.pat = vmcs12->host_ia32_pat;
> @@ -6840,7 +6892,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
> VM_EXIT_HOST_ADDR_SPACE_SIZE |
> #endif
> VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
> - VM_EXIT_CLEAR_BNDCFGS;
> + VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
> msrs->exit_ctls_high |=
> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
> @@ -6862,7 +6914,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
> #ifdef CONFIG_X86_64
> VM_ENTRY_IA32E_MODE |
> #endif
> - VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
> + VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
> + VM_ENTRY_LOAD_CET_STATE;
> msrs->entry_ctls_high |=
> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
> VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
> diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
> index 106a72c923ca..4233b5ca9461 100644
> --- a/arch/x86/kvm/vmx/vmcs12.c
> +++ b/arch/x86/kvm/vmx/vmcs12.c
> @@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
> FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
> FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
> FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
> + FIELD(GUEST_S_CET, guest_s_cet),
> + FIELD(GUEST_SSP, guest_ssp),
> + FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
> FIELD(HOST_CR0, host_cr0),
> FIELD(HOST_CR3, host_cr3),
> FIELD(HOST_CR4, host_cr4),
> @@ -151,5 +154,8 @@ const unsigned short vmcs12_field_offsets[] = {
> FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
> FIELD(HOST_RSP, host_rsp),
> FIELD(HOST_RIP, host_rip),
> + FIELD(HOST_S_CET, host_s_cet),
> + FIELD(HOST_SSP, host_ssp),
> + FIELD(HOST_INTR_SSP_TABLE, host_ssp_tbl),
> };
> const unsigned int nr_vmcs12_fields = ARRAY_SIZE(vmcs12_field_offsets);
> diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
> index 01936013428b..3884489e7f7e 100644
> --- a/arch/x86/kvm/vmx/vmcs12.h
> +++ b/arch/x86/kvm/vmx/vmcs12.h
> @@ -117,7 +117,13 @@ struct __packed vmcs12 {
> natural_width host_ia32_sysenter_eip;
> natural_width host_rsp;
> natural_width host_rip;
> - natural_width paddingl[8]; /* room for future expansion */
> + natural_width host_s_cet;
> + natural_width host_ssp;
> + natural_width host_ssp_tbl;
> + natural_width guest_s_cet;
> + natural_width guest_ssp;
> + natural_width guest_ssp_tbl;
> + natural_width paddingl[2]; /* room for future expansion */
> u32 pin_based_vm_exec_control;
> u32 cpu_based_vm_exec_control;
> u32 exception_bitmap;
> @@ -292,6 +298,12 @@ static inline void vmx_check_vmcs12_offsets(void)
> CHECK_OFFSET(host_ia32_sysenter_eip, 656);
> CHECK_OFFSET(host_rsp, 664);
> CHECK_OFFSET(host_rip, 672);
> + CHECK_OFFSET(host_s_cet, 680);
> + CHECK_OFFSET(host_ssp, 688);
> + CHECK_OFFSET(host_ssp_tbl, 696);
> + CHECK_OFFSET(guest_s_cet, 704);
> + CHECK_OFFSET(guest_ssp, 712);
> + CHECK_OFFSET(guest_ssp_tbl, 720);
> CHECK_OFFSET(pin_based_vm_exec_control, 744);
> CHECK_OFFSET(cpu_based_vm_exec_control, 748);
> CHECK_OFFSET(exception_bitmap, 752);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index c802e790c0d5..7ddd3f6fe8ab 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7732,6 +7732,8 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
> cr4_fixed1_update(X86_CR4_PKE, ecx, feature_bit(PKU));
> cr4_fixed1_update(X86_CR4_UMIP, ecx, feature_bit(UMIP));
> cr4_fixed1_update(X86_CR4_LA57, ecx, feature_bit(LA57));
> + cr4_fixed1_update(X86_CR4_CET, ecx, feature_bit(SHSTK));
> + cr4_fixed1_update(X86_CR4_CET, edx, feature_bit(IBT));
>
> entry = kvm_find_cpuid_entry_index(vcpu, 0x7, 1);
> cr4_fixed1_update(X86_CR4_LAM_SUP, eax, feature_bit(LAM));
> --
> 2.39.3
>
>

2024-01-16 07:26:10

by Yuan Yao

[permalink] [raw]
Subject: Re: [PATCH v8 24/26] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On Thu, Dec 21, 2023 at 09:02:37AM -0500, Yang Weijiang wrote:
> Expose CET features to guest if KVM/host can support them, clear CPUID
> feature bits if KVM/host cannot support.
>
> Set CPUID feature bits so that CET features are available in guest CPUID.
> Add CR4.CET bit support in order to allow guest set CET master control
> bit.
>
> Disable KVM CET feature if unrestricted_guest is unsupported/disabled as
> KVM does not support emulating CET.
>
> The CET load-bits in VM_ENTRY/VM_EXIT control fields should be set to make
> guest CET xstates isolated from host's.
>
> On platforms with VMX_BASIC[bit56] == 0, inject #CP at VMX entry with error
> code will fail, and if VMX_BASIC[bit56] == 1, #CP injection with or without
> error code is allowed. Disable CET feature bits if the MSR bit is cleared
> so that nested VMM can inject #CP if and only if VMX_BASIC[bit56] == 1.
>
> Don't expose CET feature if either of {U,S}_CET xstate bits is cleared
> in host XSS or if XSAVES isn't supported.
>
> CET MSR contents after reset, power-up and INIT are set to 0s, clears the
> guest fpstate fields so that the guest MSRs are reset to 0s after the events.
>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 2 +-
> arch/x86/include/asm/msr-index.h | 1 +
> arch/x86/kvm/cpuid.c | 19 +++++++++++++++++--
> arch/x86/kvm/vmx/capabilities.h | 6 ++++++
> arch/x86/kvm/vmx/vmx.c | 29 ++++++++++++++++++++++++++++-
> arch/x86/kvm/vmx/vmx.h | 6 ++++--
> arch/x86/kvm/x86.c | 31 +++++++++++++++++++++++++++++--
> arch/x86/kvm/x86.h | 3 +++
> 8 files changed, 89 insertions(+), 8 deletions(-)
..
> -#define KVM_SUPPORTED_XSS 0
> +#define KVM_SUPPORTED_XSS (XFEATURE_MASK_CET_USER | \
> + XFEATURE_MASK_CET_KERNEL)
>
> u64 __read_mostly host_efer;
> EXPORT_SYMBOL_GPL(host_efer);
> @@ -9921,6 +9922,20 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
> if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
> kvm_caps.supported_xss = 0;
>
> + if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK) &&
> + !kvm_cpu_cap_has(X86_FEATURE_IBT))
> + kvm_caps.supported_xss &= ~(XFEATURE_CET_USER |
> + XFEATURE_CET_KERNEL);

Looks should be XFEATURE_MASK_xxx.

> +
> + if ((kvm_caps.supported_xss & (XFEATURE_MASK_CET_USER |
> + XFEATURE_MASK_CET_KERNEL)) !=
> + (XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)) {
> + kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
> + kvm_cpu_cap_clear(X86_FEATURE_IBT);
> + kvm_caps.supported_xss &= ~(XFEATURE_CET_USER |
> + XFEATURE_CET_KERNEL);

Ditto.

> + }
> +
> #define __kvm_cpu_cap_has(UNUSED_, f) kvm_cpu_cap_has(f)
> cr4_reserved_bits = __cr4_reserved_bits(__kvm_cpu_cap_has, UNUSED_);
> #undef __kvm_cpu_cap_has
> @@ -12392,7 +12407,9 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
>
> static inline bool is_xstate_reset_needed(void)
> {
> - return kvm_cpu_cap_has(X86_FEATURE_MPX);
> + return kvm_cpu_cap_has(X86_FEATURE_MPX) ||
> + kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
> + kvm_cpu_cap_has(X86_FEATURE_IBT);
> }
>
> void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> @@ -12469,6 +12486,16 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> XFEATURE_BNDCSR);
> }
>
> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
> + fpstate_clear_xstate_component(fpstate,
> + XFEATURE_CET_USER);
> + fpstate_clear_xstate_component(fpstate,
> + XFEATURE_CET_KERNEL);
> + } else if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
> + fpstate_clear_xstate_component(fpstate,
> + XFEATURE_CET_USER);
> + }
> +
> if (init_event)
> kvm_load_guest_fpu(vcpu);
> }
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 656107e64c93..cc585051d24b 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -533,6 +533,9 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
> __reserved_bits |= X86_CR4_PCIDE; \
> if (!__cpu_has(__c, X86_FEATURE_LAM)) \
> __reserved_bits |= X86_CR4_LAM_SUP; \
> + if (!__cpu_has(__c, X86_FEATURE_SHSTK) && \
> + !__cpu_has(__c, X86_FEATURE_IBT)) \
> + __reserved_bits |= X86_CR4_CET; \
> __reserved_bits; \
> })
>
> --
> 2.39.3
>
>

2024-01-17 00:54:21

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v8 00/26] Enable CET Virtualization

On 1/15/2024 9:55 AM, Chao Gao wrote:
> On Thu, Jan 11, 2024 at 10:56:55PM +0800, Yang, Weijiang wrote:
>> On 1/9/2024 11:10 PM, Sean Christopherson wrote:
>>> On Mon, Jan 08, 2024, Weijiang Yang wrote:
>>>> On 1/6/2024 12:21 AM, Sean Christopherson wrote:
>>>>> On Fri, Jan 05, 2024, Weijiang Yang wrote:
>>>>>> On 1/5/2024 8:54 AM, Sean Christopherson wrote:
>>>>>>> On Fri, Jan 05, 2024, Rick P Edgecombe wrote:
>>>>>>>>> For CALL/RET (and presumably any branch instructions with IBT?) other
>>>>>>>>> instructions that are directly affected by CET, the simplest thing would
>>>>>>>>> probably be to disable those in KVM's emulator if shadow stacks and/or IBT
>>>>>>>>> are enabled, and let KVM's failure paths take it from there.
>>>>>>>> Right, that is what I was wondering might be the normal solution for
>>>>>>>> situations like this.
>>>>>>> If KVM can't emulate something, it either retries the instruction (with some
>>>>>>> decent logic to guard against infinite retries) or punts to userspace.
>>>>>> What kind of error is proper if KVM has to punt to userspace?
>>>>> KVM_INTERNAL_ERROR_EMULATION. See prepare_emulation_failure_exit().
>>>>>
>>>>>> Or just inject #UD into guest on detecting this case?
>>>>> No, do not inject #UD or do anything else that deviates from architecturally
>>>>> defined behavior.
>>>> Thanks!
>>>> But based on current KVM implementation and patch 24, seems that if CET is exposed
>>>> to guest, the emulation code or shadow paging mode couldn't be activated at the same time:
>>> No, requiring unrestricted guest only disables the paths where KVM *delibeately*
>>> emulates the entire guest code stream. In no way, shape, or form does it prevent
>>> KVM from attempting to emulate arbitrary instructions.
>> Yes, also need to prevent sporadic emulation, how about adding below patch in emulator?
>>
>>
>> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
>> index e223043ef5b2..e817d8560ceb 100644
>> --- a/arch/x86/kvm/emulate.c
>> +++ b/arch/x86/kvm/emulate.c
>> @@ -178,6 +178,7 @@
>>  #define IncSP       ((u64)1 << 54)  /* SP is incremented before ModRM calc */
>>  #define TwoMemOp    ((u64)1 << 55)  /* Instruction has two memory operand */
>>  #define IsBranch    ((u64)1 << 56)  /* Instruction is considered a branch. */
>> +#define IsProtected ((u64)1 << 57)  /* Instruction is protected by CET. */
>>
>>  #define DstXacc     (DstAccLo | SrcAccHi | SrcWrite)
>>
>> @@ -4098,9 +4099,9 @@ static const struct opcode group4[] = {
>>  static const struct opcode group5[] = {
>>         F(DstMem | SrcNone | Lock,              em_inc),
>>         F(DstMem | SrcNone | Lock,              em_dec),
>> -       I(SrcMem | NearBranch | IsBranch,       em_call_near_abs),
>> -       I(SrcMemFAddr | ImplicitOps | IsBranch, em_call_far),
>> -       I(SrcMem | NearBranch | IsBranch,       em_jmp_abs),
>> +       I(SrcMem | NearBranch | IsBranch | IsProtected, em_call_near_abs),
>> +       I(SrcMemFAddr | ImplicitOps | IsBranch | IsProtected, em_call_far),
>> +       I(SrcMem | NearBranch | IsBranch | IsProtected, em_jmp_abs),
>>         I(SrcMemFAddr | ImplicitOps | IsBranch, em_jmp_far),
>>         I(SrcMem | Stack | TwoMemOp,            em_push), D(Undefined),
>>  };
>> @@ -4362,11 +4363,11 @@ static const struct opcode opcode_table[256] = {
>>         /* 0xC8 - 0xCF */
>>         I(Stack | SrcImmU16 | Src2ImmByte | IsBranch, em_enter),
>>         I(Stack | IsBranch, em_leave),
>> -       I(ImplicitOps | SrcImmU16 | IsBranch, em_ret_far_imm),
>> -       I(ImplicitOps | IsBranch, em_ret_far),
>> -       D(ImplicitOps | IsBranch), DI(SrcImmByte | IsBranch, intn),
>> +       I(ImplicitOps | SrcImmU16 | IsBranch | IsProtected, em_ret_far_imm),
>> +       I(ImplicitOps | IsBranch | IsProtected, em_ret_far),
>> +       D(ImplicitOps | IsBranch), DI(SrcImmByte | IsBranch | IsProtected, intn),
>>         D(ImplicitOps | No64 | IsBranch),
>> -       II(ImplicitOps | IsBranch, em_iret, iret),
>> +       II(ImplicitOps | IsBranch | IsProtected, em_iret, iret),
>>         /* 0xD0 - 0xD7 */
>>         G(Src2One | ByteOp, group2), G(Src2One, group2),
>>         G(Src2CL | ByteOp, group2), G(Src2CL, group2),
>> @@ -4382,7 +4383,7 @@ static const struct opcode opcode_table[256] = {
>>         I2bvIP(SrcImmUByte | DstAcc, em_in,  in,  check_perm_in),
>>         I2bvIP(SrcAcc | DstImmUByte, em_out, out, check_perm_out),
>>         /* 0xE8 - 0xEF */
>> -       I(SrcImm | NearBranch | IsBranch, em_call),
>> +       I(SrcImm | NearBranch | IsBranch | IsProtected, em_call),
>>         D(SrcImm | ImplicitOps | NearBranch | IsBranch),
>>         I(SrcImmFAddr | No64 | IsBranch, em_jmp_far),
>>         D(SrcImmByte | ImplicitOps | NearBranch | IsBranch),
>> @@ -4401,7 +4402,7 @@ static const struct opcode opcode_table[256] = {
>>  static const struct opcode twobyte_table[256] = {
>>         /* 0x00 - 0x0F */
>>         G(0, group6), GD(0, &group7), N, N,
>> -       N, I(ImplicitOps | EmulateOnUD | IsBranch, em_syscall),
>> +       N, I(ImplicitOps | EmulateOnUD | IsBranch | IsProtected, em_syscall),
>>         II(ImplicitOps | Priv, em_clts, clts), N,
>>         DI(ImplicitOps | Priv, invd), DI(ImplicitOps | Priv, wbinvd), N, N,
>>         N, D(ImplicitOps | ModRM | SrcMem | NoAccess), N, N,
>> @@ -4432,8 +4433,8 @@ static const struct opcode twobyte_table[256] = {
>>         IIP(ImplicitOps, em_rdtsc, rdtsc, check_rdtsc),
>>         II(ImplicitOps | Priv, em_rdmsr, rdmsr),
>>         IIP(ImplicitOps, em_rdpmc, rdpmc, check_rdpmc),
>> -       I(ImplicitOps | EmulateOnUD | IsBranch, em_sysenter),
>> -       I(ImplicitOps | Priv | EmulateOnUD | IsBranch, em_sysexit),
>> +       I(ImplicitOps | EmulateOnUD | IsBranch | IsProtected, em_sysenter),
>> +       I(ImplicitOps | Priv | EmulateOnUD | IsBranch | IsProtected, em_sysexit),
>>         N, N,
>>         N, N, N, N, N, N, N, N,
>>         /* 0x40 - 0x4F */
>> @@ -4971,6 +4972,12 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
>>         if (ctxt->d == 0)
>>                 return EMULATION_FAILED;
>> +       if ((opcode.flags & IsProtected) &&
>> +           (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_CET)) {
> CR4.CET doesn't necessarily mean IBT or shadow stack is enabled. why not check
> CPL and IA32_S/U_CET?

CR4.CET is the master control bit for CET features, a sane guest should set the bit iff it wants
to activate CET features. On the contrast, the IBT/SHSTK bits in IA32_S/U_CET only mean
the feature is enabled but maybe not active at the moment emulator is working, so no need
to stop emulation in this case.

>
>> +               WARN_ONCE(1, "CET is active, emulation aborted.\n");
> remove this WARN_ONCE(). Guest can trigger this at will and overflow host dmesg.

OK, the purpose is to give some informative message when guest hits the prohibited cases.
I can remove it. Thanks!
>
> if you really want to tell usespace the emulation_failure is due to CET, maybe
> you can add a new flag like KVM_INTERNAL_ERROR_EMULATION_FLAG_INSTRUCTION_BYTES.
> for now, I won't bother to add this because probably userspace just terminates
> the VM on any instruction failure (i.e., won't try to figure out the reason of
> the instruction failure and fix it).

Agreed, don't need to another flag to indicate this is due to CET on.



2024-01-17 01:42:21

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v8 22/26] KVM: VMX: Set up interception for CET MSRs

On 1/15/2024 5:58 PM, Yuan Yao wrote:
> On Thu, Dec 21, 2023 at 09:02:35AM -0500, Yang Weijiang wrote:
>> Enable/disable CET MSRs interception per associated feature configuration.
>> Shadow Stack feature requires all CET MSRs passed through to guest to make
>> it supported in user and supervisor mode while IBT feature only depends on
>> MSR_IA32_{U,S}_CETS_CET to enable user and supervisor IBT.
>>
>> Note, this MSR design introduced an architectural limitation of SHSTK and
>> IBT control for guest, i.e., when SHSTK is exposed, IBT is also available
>> to guest from architectual perspective since IBT relies on subset of SHSTK
>> relevant MSRs.
>>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
>> arch/x86/kvm/vmx/vmx.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 42 insertions(+)
>>
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 064a5fe87948..08058b182893 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -692,6 +692,10 @@ static bool is_valid_passthrough_msr(u32 msr)
>> case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
>> /* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
>> return true;
>> + case MSR_IA32_U_CET:
>> + case MSR_IA32_S_CET:
>> + case MSR_IA32_PL0_SSP ... MSR_IA32_INT_SSP_TAB:
>> + return true;
>> }
>>
>> r = possible_passthrough_msr_slot(msr) != -ENOENT;
>> @@ -7767,6 +7771,42 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
>> vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
>> }
>>
>> +static void vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu)
>> +{
>> + bool incpt;
>> +
>> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
>> + incpt = !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK);
>> +
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP,
>> + MSR_TYPE_RW, incpt);
>> + if (guest_cpuid_has(vcpu, X86_FEATURE_LM))
> Looks this leading to MSR_IA32_INT_SSP_TAB not intercepted
> after below steps:
>
> Step 1. User space set cpuid w/ X86_FEATURE_LM, w/ SHSTK.
> Step 2. User space set cpuid w/o X86_FEATURE_LM, w/o SHSTK.
>
> Then MSR_IA32_INT_SSP_TAB won't be intercepted even w/o SHSTK
> on guest cpuid, will this lead to inconsistency when do
> rdmsr(MSR_IA32_INT_SSP_TAB) from guest in this scenario ?

Yes, theoretically it's possible, how about changing it as below?

vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
MSR_TYPE_RW,
incpt | !guest_cpuid_has(vcpu, X86_FEATURE_LM));

>
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
>> + MSR_TYPE_RW, incpt);
>> + if (!incpt)
>> + return;
>> + }
>> +
>> + if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
>> + incpt = !guest_cpuid_has(vcpu, X86_FEATURE_IBT);
>> +
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
>> + MSR_TYPE_RW, incpt);
>> + }
>> +}
>> +
>> static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>> {
>> struct vcpu_vmx *vmx = to_vmx(vcpu);
>> @@ -7845,6 +7885,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>>
>> /* Refresh #PF interception to account for MAXPHYADDR changes. */
>> vmx_update_exception_bitmap(vcpu);
>> +
>> + vmx_update_intercept_for_cet_msr(vcpu);
>> }
>>
>> static u64 vmx_get_perf_capabilities(void)
>> --
>> 2.39.3
>>
>>


2024-01-17 01:43:51

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v8 24/26] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On 1/16/2024 3:25 PM, Yuan Yao wrote:
> On Thu, Dec 21, 2023 at 09:02:37AM -0500, Yang Weijiang wrote:
>> Expose CET features to guest if KVM/host can support them, clear CPUID
>> feature bits if KVM/host cannot support.
>>
>> Set CPUID feature bits so that CET features are available in guest CPUID.
>> Add CR4.CET bit support in order to allow guest set CET master control
>> bit.
>>
>> Disable KVM CET feature if unrestricted_guest is unsupported/disabled as
>> KVM does not support emulating CET.
>>
>> The CET load-bits in VM_ENTRY/VM_EXIT control fields should be set to make
>> guest CET xstates isolated from host's.
>>
>> On platforms with VMX_BASIC[bit56] == 0, inject #CP at VMX entry with error
>> code will fail, and if VMX_BASIC[bit56] == 1, #CP injection with or without
>> error code is allowed. Disable CET feature bits if the MSR bit is cleared
>> so that nested VMM can inject #CP if and only if VMX_BASIC[bit56] == 1.
>>
>> Don't expose CET feature if either of {U,S}_CET xstate bits is cleared
>> in host XSS or if XSAVES isn't supported.
>>
>> CET MSR contents after reset, power-up and INIT are set to 0s, clears the
>> guest fpstate fields so that the guest MSRs are reset to 0s after the events.
>>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
>> arch/x86/include/asm/kvm_host.h | 2 +-
>> arch/x86/include/asm/msr-index.h | 1 +
>> arch/x86/kvm/cpuid.c | 19 +++++++++++++++++--
>> arch/x86/kvm/vmx/capabilities.h | 6 ++++++
>> arch/x86/kvm/vmx/vmx.c | 29 ++++++++++++++++++++++++++++-
>> arch/x86/kvm/vmx/vmx.h | 6 ++++--
>> arch/x86/kvm/x86.c | 31 +++++++++++++++++++++++++++++--
>> arch/x86/kvm/x86.h | 3 +++
>> 8 files changed, 89 insertions(+), 8 deletions(-)
> ...
>> -#define KVM_SUPPORTED_XSS 0
>> +#define KVM_SUPPORTED_XSS (XFEATURE_MASK_CET_USER | \
>> + XFEATURE_MASK_CET_KERNEL)
>>
>> u64 __read_mostly host_efer;
>> EXPORT_SYMBOL_GPL(host_efer);
>> @@ -9921,6 +9922,20 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
>> if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
>> kvm_caps.supported_xss = 0;
>>
>> + if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK) &&
>> + !kvm_cpu_cap_has(X86_FEATURE_IBT))
>> + kvm_caps.supported_xss &= ~(XFEATURE_CET_USER |
>> + XFEATURE_CET_KERNEL);
> Looks should be XFEATURE_MASK_xxx.

Good catch! Thanks!

>
>> +
>> + if ((kvm_caps.supported_xss & (XFEATURE_MASK_CET_USER |
>> + XFEATURE_MASK_CET_KERNEL)) !=
>> + (XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)) {
>> + kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
>> + kvm_cpu_cap_clear(X86_FEATURE_IBT);
>> + kvm_caps.supported_xss &= ~(XFEATURE_CET_USER |
>> + XFEATURE_CET_KERNEL);
> Ditto.

Yes.

>
>> + }
>> +
>> #define __kvm_cpu_cap_has(UNUSED_, f) kvm_cpu_cap_has(f)
>> cr4_reserved_bits = __cr4_reserved_bits(__kvm_cpu_cap_has, UNUSED_);
>> #undef __kvm_cpu_cap_has
>> @@ -12392,7 +12407,9 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
>>
>> static inline bool is_xstate_reset_needed(void)
>> {
>> - return kvm_cpu_cap_has(X86_FEATURE_MPX);
>> + return kvm_cpu_cap_has(X86_FEATURE_MPX) ||
>> + kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
>> + kvm_cpu_cap_has(X86_FEATURE_IBT);
>> }
>>
>> void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>> @@ -12469,6 +12486,16 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>> XFEATURE_BNDCSR);
>> }
>>
>> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
>> + fpstate_clear_xstate_component(fpstate,
>> + XFEATURE_CET_USER);
>> + fpstate_clear_xstate_component(fpstate,
>> + XFEATURE_CET_KERNEL);
>> + } else if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
>> + fpstate_clear_xstate_component(fpstate,
>> + XFEATURE_CET_USER);
>> + }
>> +
>> if (init_event)
>> kvm_load_guest_fpu(vcpu);
>> }
>> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
>> index 656107e64c93..cc585051d24b 100644
>> --- a/arch/x86/kvm/x86.h
>> +++ b/arch/x86/kvm/x86.h
>> @@ -533,6 +533,9 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
>> __reserved_bits |= X86_CR4_PCIDE; \
>> if (!__cpu_has(__c, X86_FEATURE_LAM)) \
>> __reserved_bits |= X86_CR4_LAM_SUP; \
>> + if (!__cpu_has(__c, X86_FEATURE_SHSTK) && \
>> + !__cpu_has(__c, X86_FEATURE_IBT)) \
>> + __reserved_bits |= X86_CR4_CET; \
>> __reserved_bits; \
>> })
>>
>> --
>> 2.39.3
>>
>>


2024-01-17 01:56:47

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v8 26/26] KVM: nVMX: Enable CET support for nested guest

On 1/16/2024 3:22 PM, Yuan Yao wrote:
> On Thu, Dec 21, 2023 at 09:02:39AM -0500, Yang Weijiang wrote:
>> Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
>> to enable CET for nested VM.
>>
>> vmcs12 and vmcs02 needs to be synced when L2 exits to L1 or when L1 wants
>> to resume L2, that way correct CET states can be observed by one another.
>>
>> Suggested-by: Chao Gao <[email protected]>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
>> arch/x86/kvm/vmx/nested.c | 57 +++++++++++++++++++++++++++++++++++++--
>> arch/x86/kvm/vmx/vmcs12.c | 6 +++++
>> arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++-
>> arch/x86/kvm/vmx/vmx.c | 2 ++
>> 4 files changed, 76 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index 468a7cf75035..dee718c65255 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -691,6 +691,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>> nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
>>
>> + /* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_U_CET, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_S_CET, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_PL0_SSP, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_PL1_SSP, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_PL2_SSP, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_PL3_SSP, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
>> +
>> kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
>>
>> vmx->nested.force_msr_bitmap_recalc = false;
>> @@ -2506,6 +2528,17 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>> if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
>> (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
>> vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
>> +
>> + if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE) {
>> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
>> + vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
>> + vmcs_writel(GUEST_INTR_SSP_TABLE,
>> + vmcs12->guest_ssp_tbl);
>> + }
>> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
>> + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
>> + vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
>> + }
>> }
>>
>> if (nested_cpu_has_xsaves(vmcs12))
>> @@ -4344,6 +4377,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
>> vmcs12->guest_pending_dbg_exceptions =
>> vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
>>
>> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
>> + vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
>> + vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
>> + }
>> + if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
>> + guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
>> + vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
>> + }
>> +
>> vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
>> }
>>
>> @@ -4569,6 +4611,16 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>> if (vmcs12->vm_exit_controls & VM_EXIT_CLEAR_BNDCFGS)
>> vmcs_write64(GUEST_BNDCFGS, 0);
>>
>> + if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_CET_STATE) {
>> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK)) {
>> + vmcs_writel(HOST_SSP, vmcs12->host_ssp);
> Shuold be GUEST_xxx here.
>
> Now KVM does "vmexit" from L2 to L1, thus should sync
> vmcs01's guest state with vmcs12's host state, so KVM
> can emulate "vmexit" from L2 -> L1 directly by vmlaunch
> with vmcs01.

Right, I'll change it, thanks for pointing it out!

>
>> + vmcs_writel(HOST_INTR_SSP_TABLE, vmcs12->host_ssp_tbl);
> Ditto.

Yes,

>> + }
>> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK) ||
>> + guest_can_use(vcpu, X86_FEATURE_IBT))
>> + vmcs_writel(HOST_S_CET, vmcs12->host_s_cet);
> Ditto.

Yes.

>
>> + }
>> +
>> if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PAT) {
>> vmcs_write64(GUEST_IA32_PAT, vmcs12->host_ia32_pat);
>> vcpu->arch.pat = vmcs12->host_ia32_pat;
>> @@ -6840,7 +6892,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
>> VM_EXIT_HOST_ADDR_SPACE_SIZE |
>> #endif
>> VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
>> - VM_EXIT_CLEAR_BNDCFGS;
>> + VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
>> msrs->exit_ctls_high |=
>> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>> VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
>> @@ -6862,7 +6914,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
>> #ifdef CONFIG_X86_64
>> VM_ENTRY_IA32E_MODE |
>> #endif
>> - VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
>> + VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
>> + VM_ENTRY_LOAD_CET_STATE;
>> msrs->entry_ctls_high |=
>> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
>> VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
>> diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
>> index 106a72c923ca..4233b5ca9461 100644
>> --- a/arch/x86/kvm/vmx/vmcs12.c
>> +++ b/arch/x86/kvm/vmx/vmcs12.c
>> @@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
>> FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
>> FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
>> FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
>> + FIELD(GUEST_S_CET, guest_s_cet),
>> + FIELD(GUEST_SSP, guest_ssp),
>> + FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
>> FIELD(HOST_CR0, host_cr0),
>> FIELD(HOST_CR3, host_cr3),
>> FIELD(HOST_CR4, host_cr4),
>> @@ -151,5 +154,8 @@ const unsigned short vmcs12_field_offsets[] = {
>> FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
>> FIELD(HOST_RSP, host_rsp),
>> FIELD(HOST_RIP, host_rip),
>> + FIELD(HOST_S_CET, host_s_cet),
>> + FIELD(HOST_SSP, host_ssp),
>> + FIELD(HOST_INTR_SSP_TABLE, host_ssp_tbl),
>> };
>> const unsigned int nr_vmcs12_fields = ARRAY_SIZE(vmcs12_field_offsets);
>> diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
>> index 01936013428b..3884489e7f7e 100644
>> --- a/arch/x86/kvm/vmx/vmcs12.h
>> +++ b/arch/x86/kvm/vmx/vmcs12.h
>> @@ -117,7 +117,13 @@ struct __packed vmcs12 {
>> natural_width host_ia32_sysenter_eip;
>> natural_width host_rsp;
>> natural_width host_rip;
>> - natural_width paddingl[8]; /* room for future expansion */
>> + natural_width host_s_cet;
>> + natural_width host_ssp;
>> + natural_width host_ssp_tbl;
>> + natural_width guest_s_cet;
>> + natural_width guest_ssp;
>> + natural_width guest_ssp_tbl;
>> + natural_width paddingl[2]; /* room for future expansion */
>> u32 pin_based_vm_exec_control;
>> u32 cpu_based_vm_exec_control;
>> u32 exception_bitmap;
>> @@ -292,6 +298,12 @@ static inline void vmx_check_vmcs12_offsets(void)
>> CHECK_OFFSET(host_ia32_sysenter_eip, 656);
>> CHECK_OFFSET(host_rsp, 664);
>> CHECK_OFFSET(host_rip, 672);
>> + CHECK_OFFSET(host_s_cet, 680);
>> + CHECK_OFFSET(host_ssp, 688);
>> + CHECK_OFFSET(host_ssp_tbl, 696);
>> + CHECK_OFFSET(guest_s_cet, 704);
>> + CHECK_OFFSET(guest_ssp, 712);
>> + CHECK_OFFSET(guest_ssp_tbl, 720);
>> CHECK_OFFSET(pin_based_vm_exec_control, 744);
>> CHECK_OFFSET(cpu_based_vm_exec_control, 748);
>> CHECK_OFFSET(exception_bitmap, 752);
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index c802e790c0d5..7ddd3f6fe8ab 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -7732,6 +7732,8 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
>> cr4_fixed1_update(X86_CR4_PKE, ecx, feature_bit(PKU));
>> cr4_fixed1_update(X86_CR4_UMIP, ecx, feature_bit(UMIP));
>> cr4_fixed1_update(X86_CR4_LA57, ecx, feature_bit(LA57));
>> + cr4_fixed1_update(X86_CR4_CET, ecx, feature_bit(SHSTK));
>> + cr4_fixed1_update(X86_CR4_CET, edx, feature_bit(IBT));
>>
>> entry = kvm_find_cpuid_entry_index(vcpu, 0x7, 1);
>> cr4_fixed1_update(X86_CR4_LAM_SUP, eax, feature_bit(LAM));
>> --
>> 2.39.3
>>
>>


2024-01-17 01:59:16

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v8 22/26] KVM: VMX: Set up interception for CET MSRs

On 1/17/2024 9:41 AM, Yang, Weijiang wrote:
> On 1/15/2024 5:58 PM, Yuan Yao wrote:
>> On Thu, Dec 21, 2023 at 09:02:35AM -0500, Yang Weijiang wrote:
[...]
>> Looks this leading to MSR_IA32_INT_SSP_TAB not intercepted
>> after below steps:
>>
>> Step 1. User space set cpuid w/ X86_FEATURE_LM, w/ SHSTK.
>> Step 2. User space set cpuid w/o X86_FEATURE_LM, w/o SHSTK.
>>
>> Then MSR_IA32_INT_SSP_TAB won't be intercepted even w/o SHSTK
>> on guest cpuid, will this lead to inconsistency when do
>> rdmsr(MSR_IA32_INT_SSP_TAB) from guest in this scenario ?
> Yes, theoretically it's possible, how about changing it as below?
>
> vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
> MSR_TYPE_RW,
> incpt | !guest_cpuid_has(vcpu, X86_FEATURE_LM));
>
Oops, should be : incpt || !guest_cpuid_has(vcpu, X86_FEATURE_LM)

2024-01-17 06:16:37

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v8 22/26] KVM: VMX: Set up interception for CET MSRs

On 1/17/2024 1:31 PM, Yuan Yao wrote:
> On Wed, Jan 17, 2024 at 09:58:40AM +0800, Yang, Weijiang wrote:
>> On 1/17/2024 9:41 AM, Yang, Weijiang wrote:
>>> On 1/15/2024 5:58 PM, Yuan Yao wrote:
>>>> On Thu, Dec 21, 2023 at 09:02:35AM -0500, Yang Weijiang wrote:
>> [...]
>>>> Looks this leading to MSR_IA32_INT_SSP_TAB not intercepted
>>>> after below steps:
>>>>
>>>> Step 1. User space set cpuid w/ X86_FEATURE_LM, w/ SHSTK.
>>>> Step 2. User space set cpuid w/o X86_FEATURE_LM, w/o SHSTK.
>>>>
>>>> Then MSR_IA32_INT_SSP_TAB won't be intercepted even w/o SHSTK
>>>> on guest cpuid, will this lead to inconsistency when do
>>>> rdmsr(MSR_IA32_INT_SSP_TAB) from guest in this scenario ?
>>> Yes, theoretically it's possible, how about changing it as below?
>>>
>>> vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
>>> MSR_TYPE_RW,
>>> incpt | !guest_cpuid_has(vcpu, X86_FEATURE_LM));
>>>
>> Oops, should be : incpt || !guest_cpuid_has(vcpu, X86_FEATURE_LM)
> It means guest cpuid:
>
> "has X86_FEATURE_SHSTK" + "doesn't have X86_FEATURE_LM"

No, this is invalid within this series. With patch 21 to prevent SHSTK in
32-bit guest, I think the check of LM here can be omitted.

then
vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW, incpt); is OK.

>
> not sure this is valid combination or not.
> If yes it's ok, else just relies on incpt is enough ?
>


2024-01-17 11:41:31

by Yuan Yao

[permalink] [raw]
Subject: Re: [PATCH v8 22/26] KVM: VMX: Set up interception for CET MSRs

On Wed, Jan 17, 2024 at 09:58:40AM +0800, Yang, Weijiang wrote:
> On 1/17/2024 9:41 AM, Yang, Weijiang wrote:
> > On 1/15/2024 5:58 PM, Yuan Yao wrote:
> > > On Thu, Dec 21, 2023 at 09:02:35AM -0500, Yang Weijiang wrote:
> [...]
> > > Looks this leading to MSR_IA32_INT_SSP_TAB not intercepted
> > > after below steps:
> > >
> > > Step 1. User space set cpuid w/ X86_FEATURE_LM, w/ SHSTK.
> > > Step 2. User space set cpuid w/o X86_FEATURE_LM, w/o SHSTK.
> > >
> > > Then MSR_IA32_INT_SSP_TAB won't be intercepted even w/o SHSTK
> > > on guest cpuid, will this lead to inconsistency when do
> > > rdmsr(MSR_IA32_INT_SSP_TAB) from guest in this scenario ?
> > Yes, theoretically it's possible, how about changing it as below?
> >
> > vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
> > MSR_TYPE_RW,
> > incpt | !guest_cpuid_has(vcpu, X86_FEATURE_LM));
> >
> Oops, should be : incpt || !guest_cpuid_has(vcpu, X86_FEATURE_LM)

It means guest cpuid:

"has X86_FEATURE_SHSTK" + "doesn't have X86_FEATURE_LM"

not sure this is valid combination or not.
If yes it's ok, else just relies on incpt is enough ?