2024-02-19 07:50:38

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 00/27] Enable CET Virtualization

Control-flow Enforcement Technology (CET) is a kind of CPU feature used
to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP) attacks.
It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
style control-flow subversion attacks.

Shadow Stack (SHSTK):
A shadow stack is a second stack used exclusively for control transfer
operations. The shadow stack is separate from the data/normal stack and
can be enabled individually in user and kernel mode. When shadow stack
is enabled, CALL pushes the return address on both the data and shadow
stack. RET pops the return address from both stacks and compares them.
If the return addresses from the two stacks do not match, the processor
generates a #CP.

Indirect Branch Tracking (IBT):
IBT introduces new instruction(ENDBRANCH)to mark valid target addresses of
indirect branches (CALL, JMP etc...). If an indirect branch is executed
and the next instruction is _not_ an ENDBRANCH, the processor generates a
#CP. These instruction behaves as a NOP on platforms that doesn't support
CET.

Dependency:
=====================
CET native series for user mode shadow stack has already been merged in v6.6
mainline kernel.

The first 7 kernel patches are prerequisites for this KVM patch series since
guest CET user mode and supervisor mode states depends on kernel FPU framework
to properly save/restore the states whenever FPU context switch is required,
e.g., after VM-Exit and before vCPU thread exits to userspace.

In this series, guest supervisor SHSTK mitigation solution isn't introduced
for Intel platform therefore guest SSS_CET bit of CPUID(0x7,1):EDX[bit18] is
cleared. Check SDM (Vol 1, Section 17.2.3) for details.

CET states management:
======================
KVM cooperates with host kernel FPU framework to manage guest CET registers.
With CET supervisor mode state support in this series, KVM can save/restore
full guest CET xsave-managed states.

CET user mode and supervisor mode xstates, i.e., MSR_IA32_{U_CET,PL3_SSP}
and MSR_IA32_PL{0,1,2}, depend on host FPU framework to swap guest and host
xstates. On VM-Exit, guest CET xstates are saved to guest fpu area and host
CET xstates are loaded from task/thread context before vCPU returns to
userspace, vice-versa on VM-Entry. See details in kvm_{load,put}_guest_fpu().
So guest CET xstates management depends on CET xstate bits(U_CET/S_CET bit)
set in host XSS MSR.

CET supervisor mode states are grouped into two categories : XSAVE-managed
and non-XSAVE-managed, the former includes MSR_IA32_PL{0,1,2}_SSP and are
controlled by CET supervisor mode bit(S_CET bit) in XSS, the later consists
of MSR_IA32_S_CET and MSR_IA32_INTR_SSP_TBL.

VMX introduces new VMCS fields, {GUEST|HOST}_{S_CET,SSP,INTR_SSP_TABL}, to
facilitate guest/host non-XSAVES-managed states. When VMX CET entry/exit load
bits are set, guest/host MSR_IA32_{S_CET,INTR_SSP_TBL,SSP} are loaded from
equivalent fields at VM-Exit/Entry. With these new fields, such supervisor
states require no addtional KVM save/reload actions.

Tests:
======================
This series passed basic CET user shadow stack test and kernel IBT test in L1
and L2 guest.
The patch series _has_ impact to existing vmx test cases in KVM-unit-tests,the
failures have been fixed here[1].
One new selftest app[2] is introduced for testing CET MSRs accessibilities.

Note, this series hasn't been tested on AMD platform yet.

To run user SHSTK test and kernel IBT test in guest, an CET capable platform
is required, e.g., Sapphire Rapids server, and follow below steps to build
the binaries:

1. Host kernel: Apply this series to mainline kernel (>= v6.6) and build.

2. Guest kernel: Pull kernel (>= v6.6), opt-in CONFIG_X86_KERNEL_IBT
and CONFIG_X86_USER_SHADOW_STACK options. Build with CET enabled gcc versions
(>= 8.5.0).

3. Apply CET QEMU patches[3] before build mainline QEMU.

Check kernel selftest test_shadow_stack_64 output:
[INFO] new_ssp = 7f8c82100ff8, *new_ssp = 7f8c82101001
[INFO] changing ssp from 7f8c82900ff0 to 7f8c82100ff8
[INFO] ssp is now 7f8c82101000
[OK] Shadow stack pivot
[OK] Shadow stack faults
[INFO] Corrupting shadow stack
[INFO] Generated shadow stack violation successfully
[OK] Shadow stack violation test
[INFO] Gup read -> shstk access success
[INFO] Gup write -> shstk access success
[INFO] Violation from normal write
[INFO] Gup read -> write access success
[INFO] Violation from normal write
[INFO] Gup write -> write access success
[INFO] Cow gup write -> write access success
[OK] Shadow gup test
[INFO] Violation from shstk access
[OK] mprotect() test
[SKIP] Userfaultfd unavailable.
[OK] 32 bit test


Check kernel IBT with dmesg | grep CET:
CET detected: Indirect Branch Tracking enabled

Changes in v10:
=====================
1. Add Reviewed-by tags from Chao and Rick. [Chao, Rick]
2. Use two bit flags to check CET guarded instructions in KVM emulator. [Chao]
3. Refine reset handling of xsave-managed guest FPU states. [Chao]
4. Add nested CET MSR sync when entry/exit-load-bit is not set. [Chao]
5. Other minor changes per comments from Chao and Rick.
6. Rebased on https://github.com/kvm-x86/linux commit: c0f8b0752b09


[1]: KVM-unit-tests fixup:
https://lore.kernel.org/all/[email protected]/
[2]: Selftest for CET MSRs:
https://lore.kernel.org/all/[email protected]/
[3]: QEMU patch:
https://lore.kernel.org/all/[email protected]/
[4]: v9 patchset:
https://lore.kernel.org/all/[email protected]/


Patch 1-7: Fixup patches for kernel xstate and enable CET supervisor xstate.
Patch 8-11: Cleanup patches for KVM.
Patch 12-15: Enable KVM XSS MSR support.
Patch 16: Fault check for CR4.CET setting.
Patch 17: Report CET MSRs to userspace.
Patch 18: Introduce CET VMCS fields.
Patch 19: Add SHSTK/IBT to KVM-governed framework.(to be deprecated)
Patch 20: Emulate CET MSR access.
Patch 21: Handle SSP at entry/exit to SMM.
Patch 22: Set up CET MSR interception.
Patch 23: Initialize host constant supervisor state.
Patch 24: Enable CET virtualization settings.
Patch 25-26: Add CET nested support.
Patch 27: KVM emulation handling for branch instructions


Sean Christopherson (4):
x86/fpu/xstate: Always preserve non-user xfeatures/flags in
__state_perm
KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data
KVM: x86: Report XSS as to-be-saved if there are supported features
KVM: x86: Load guest FPU state when access XSAVE-managed MSRs

Yang Weijiang (23):
x86/fpu/xstate: Refine CET user xstate bit enabling
x86/fpu/xstate: Add CET supervisor mode state support
x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set
x86/fpu/xstate: Introduce fpu_guest_cfg for guest FPU configuration
x86/fpu/xstate: Create guest fpstate with guest specific config
x86/fpu/xstate: Warn if kernel dynamic xfeatures detected in normal
fpstate
KVM: x86: Rename kvm_{g,s}et_msr()* to menifest emulation operations
KVM: x86: Refine xsave-managed guest register/MSR reset handling
KVM: x86: Add kvm_msr_{read,write}() helpers
KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS
KVM: x86: Initialize kvm_caps.supported_xss
KVM: x86: Add fault checks for guest CR4.CET setting
KVM: x86: Report KVM supported CET MSRs as to-be-saved
KVM: VMX: Introduce CET VMCS fields and control bits
KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT
enabled"
KVM: VMX: Emulate read and write to CET MSRs
KVM: x86: Save and reload SSP to/from SMRAM
KVM: VMX: Set up interception for CET MSRs
KVM: VMX: Set host constant supervisor states to VMCS fields
KVM: x86: Enable CET virtualization for VMX and advertise to userspace
KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery
to L1
KVM: nVMX: Enable CET support for nested guest
KVM: x86: Don't emulate instructions guarded by CET

arch/x86/include/asm/fpu/types.h | 16 +-
arch/x86/include/asm/fpu/xstate.h | 11 +-
arch/x86/include/asm/kvm_host.h | 12 +-
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/vmx.h | 8 +
arch/x86/include/uapi/asm/kvm_para.h | 1 +
arch/x86/kernel/fpu/core.c | 53 +++--
arch/x86/kernel/fpu/xstate.c | 44 ++++-
arch/x86/kernel/fpu/xstate.h | 3 +
arch/x86/kvm/cpuid.c | 80 ++++++--
arch/x86/kvm/emulate.c | 46 +++--
arch/x86/kvm/governed_features.h | 2 +
arch/x86/kvm/smm.c | 12 +-
arch/x86/kvm/smm.h | 2 +-
arch/x86/kvm/vmx/capabilities.h | 10 +
arch/x86/kvm/vmx/nested.c | 120 ++++++++++--
arch/x86/kvm/vmx/nested.h | 5 +
arch/x86/kvm/vmx/vmcs12.c | 6 +
arch/x86/kvm/vmx/vmcs12.h | 14 +-
arch/x86/kvm/vmx/vmx.c | 112 ++++++++++-
arch/x86/kvm/vmx/vmx.h | 9 +-
arch/x86/kvm/x86.c | 280 ++++++++++++++++++++++++---
arch/x86/kvm/x86.h | 28 +++
23 files changed, 761 insertions(+), 114 deletions(-)


base-commit: c0f8b0752b0988e5116c78e8b6c3cfdf89806e45
--
2.43.0



2024-02-19 07:51:32

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 06/27] x86/fpu/xstate: Create guest fpstate with guest specific config

Use fpu_guest_cfg to calculate guest fpstate settings, open code for
__fpstate_reset() to avoid using kernel FPU config.

Below configuration steps are currently enforced to get guest fpstate:
1) Kernel sets up guest FPU settings in fpu__init_system_xstate().
2) User space sets vCPU thread group xstate permits via arch_prctl().
3) User space creates guest fpstate via __fpu_alloc_init_guest_fpstate()
for vcpu thread.
4) User space enables guest dynamic xfeatures and re-allocate guest
fpstate.

By adding kernel dynamic xfeatures in above #1 and #2, guest xstate area
size is expanded to hold (fpu_kernel_cfg.default_features | kernel dynamic
xfeatures | user dynamic xfeatures), then host xsaves/xrstors can operate
for all guest xfeatures.

The user_* fields remain unchanged for compatibility with KVM uAPIs.

Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
---
arch/x86/kernel/fpu/core.c | 39 +++++++++++++++++++++++++++++---------
1 file changed, 30 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index e8205e261a24..dc2d2641fda7 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -194,8 +194,6 @@ void fpu_reset_from_exception_fixup(void)
}

#if IS_ENABLED(CONFIG_KVM)
-static void __fpstate_reset(struct fpstate *fpstate, u64 xfd);
-
static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
{
struct fpu_state_perm *fpuperm;
@@ -216,25 +214,48 @@ static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
gfpu->perm = perm & ~FPU_GUEST_PERM_LOCKED;
}

-bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
+static struct fpstate *__fpu_alloc_init_guest_fpstate(struct fpu_guest *gfpu)
{
struct fpstate *fpstate;
unsigned int size;

- size = fpu_user_cfg.default_size + ALIGN(offsetof(struct fpstate, regs), 64);
+ /*
+ * fpu_guest_cfg.default_size is initialized to hold all enabled
+ * xfeatures except the user dynamic xfeatures. If the user dynamic
+ * xfeatures are enabled, the guest fpstate will be re-allocated to
+ * hold all guest enabled xfeatures, so omit user dynamic xfeatures
+ * here.
+ */
+ size = fpu_guest_cfg.default_size + ALIGN(offsetof(struct fpstate, regs), 64);
+
fpstate = vzalloc(size);
if (!fpstate)
- return false;
+ return NULL;
+ /*
+ * Initialize sizes and feature masks, use fpu_user_cfg.*
+ * for user_* settings for compatibility of exiting uAPIs.
+ */
+ fpstate->size = fpu_guest_cfg.default_size;
+ fpstate->xfeatures = fpu_guest_cfg.default_features;
+ fpstate->user_size = fpu_user_cfg.default_size;
+ fpstate->user_xfeatures = fpu_user_cfg.default_features;
+ fpstate->xfd = 0;

- /* Leave xfd to 0 (the reset value defined by spec) */
- __fpstate_reset(fpstate, 0);
fpstate_init_user(fpstate);
fpstate->is_valloc = true;
fpstate->is_guest = true;

gfpu->fpstate = fpstate;
- gfpu->xfeatures = fpu_user_cfg.default_features;
- gfpu->perm = fpu_user_cfg.default_features;
+ gfpu->xfeatures = fpu_guest_cfg.default_features;
+ gfpu->perm = fpu_guest_cfg.default_features;
+
+ return fpstate;
+}
+
+bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
+{
+ if (!__fpu_alloc_init_guest_fpstate(gfpu))
+ return false;

/*
* KVM sets the FP+SSE bits in the XSAVE header when copying FPU state
--
2.43.0


2024-02-19 07:51:40

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 03/27] x86/fpu/xstate: Add CET supervisor mode state support

Add supervisor mode state support within FPU xstate management framework.
Although supervisor shadow stack is not enabled/used today in kernel,KVM
requires the support because when KVM advertises shadow stack feature to
guest, architecturally it claims the support for both user and supervisor
modes for guest OSes(Linux or non-Linux).

CET supervisor states not only includes PL{0,1,2}_SSP but also IA32_S_CET
MSR, but the latter is not xsave-managed. In virtualization world, guest
IA32_S_CET is saved/stored into/from VM control structure. With supervisor
xstate support, guest supervisor mode shadow stack state can be properly
saved/restored when 1) guest/host FPU context is swapped 2) vCPU
thread is sched out/in.

The alternative is to enable it in KVM domain, but KVM maintainers NAKed
the solution. The external discussion can be found at [*], it ended up
with adding the support in kernel instead of KVM domain.

Note, in KVM case, guest CET supervisor state i.e., IA32_PL{0,1,2}_MSRs,
are preserved after VM-Exit until host/guest fpstates are swapped, but
since host supervisor shadow stack is disabled, the preserved MSRs won't
hurt host.

[*]: https://lore.kernel.org/all/[email protected]/

Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/include/asm/fpu/types.h | 14 ++++++++++++--
arch/x86/include/asm/fpu/xstate.h | 6 +++---
arch/x86/kernel/fpu/xstate.c | 6 +++++-
3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index ace9aa3b78a3..fe12724c50cc 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -118,7 +118,7 @@ enum xfeature {
XFEATURE_PKRU,
XFEATURE_PASID,
XFEATURE_CET_USER,
- XFEATURE_CET_KERNEL_UNUSED,
+ XFEATURE_CET_KERNEL,
XFEATURE_RSRVD_COMP_13,
XFEATURE_RSRVD_COMP_14,
XFEATURE_LBR,
@@ -141,7 +141,7 @@ enum xfeature {
#define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU)
#define XFEATURE_MASK_PASID (1 << XFEATURE_PASID)
#define XFEATURE_MASK_CET_USER (1 << XFEATURE_CET_USER)
-#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL_UNUSED)
+#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL)
#define XFEATURE_MASK_LBR (1 << XFEATURE_LBR)
#define XFEATURE_MASK_XTILE_CFG (1 << XFEATURE_XTILE_CFG)
#define XFEATURE_MASK_XTILE_DATA (1 << XFEATURE_XTILE_DATA)
@@ -266,6 +266,16 @@ struct cet_user_state {
u64 user_ssp;
};

+/*
+ * State component 12 is Control-flow Enforcement supervisor states
+ */
+struct cet_supervisor_state {
+ /* supervisor ssp pointers */
+ u64 pl0_ssp;
+ u64 pl1_ssp;
+ u64 pl2_ssp;
+};
+
/*
* State component 15: Architectural LBR configuration state.
* The size of Arch LBR state depends on the number of LBRs (lbr_depth).
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index d4427b88ee12..3b4a038d3c57 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -51,7 +51,8 @@

/* All currently supported supervisor features */
#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
- XFEATURE_MASK_CET_USER)
+ XFEATURE_MASK_CET_USER | \
+ XFEATURE_MASK_CET_KERNEL)

/*
* A supervisor state component may not always contain valuable information,
@@ -78,8 +79,7 @@
* Unsupported supervisor features. When a supervisor feature in this mask is
* supported in the future, move it to the supported supervisor feature mask.
*/
-#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \
- XFEATURE_MASK_CET_KERNEL)
+#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT)

/* All supervisor states including supported and unsupported states. */
#define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index f6b98693da59..03e166a87d61 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -51,7 +51,7 @@ static const char *xfeature_names[] =
"Protection Keys User registers",
"PASID state",
"Control-flow User registers",
- "Control-flow Kernel registers (unused)",
+ "Control-flow Kernel registers",
"unknown xstate feature",
"unknown xstate feature",
"unknown xstate feature",
@@ -73,6 +73,7 @@ static unsigned short xsave_cpuid_features[] __initdata = {
[XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
[XFEATURE_PKRU] = X86_FEATURE_OSPKE,
[XFEATURE_PASID] = X86_FEATURE_ENQCMD,
+ [XFEATURE_CET_KERNEL] = X86_FEATURE_SHSTK,
[XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE,
[XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE,
};
@@ -277,6 +278,7 @@ static void __init print_xstate_features(void)
print_xstate_feature(XFEATURE_MASK_PKRU);
print_xstate_feature(XFEATURE_MASK_PASID);
print_xstate_feature(XFEATURE_MASK_CET_USER);
+ print_xstate_feature(XFEATURE_MASK_CET_KERNEL);
print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
}
@@ -346,6 +348,7 @@ static __init void os_xrstor_booting(struct xregs_state *xstate)
XFEATURE_MASK_BNDCSR | \
XFEATURE_MASK_PASID | \
XFEATURE_MASK_CET_USER | \
+ XFEATURE_MASK_CET_KERNEL | \
XFEATURE_MASK_XTILE)

/*
@@ -546,6 +549,7 @@ static bool __init check_xstate_against_struct(int nr)
case XFEATURE_PASID: return XCHECK_SZ(sz, nr, struct ia32_pasid_state);
case XFEATURE_XTILE_CFG: return XCHECK_SZ(sz, nr, struct xtile_cfg);
case XFEATURE_CET_USER: return XCHECK_SZ(sz, nr, struct cet_user_state);
+ case XFEATURE_CET_KERNEL: return XCHECK_SZ(sz, nr, struct cet_supervisor_state);
case XFEATURE_XTILE_DATA: check_xtile_data_against_struct(sz); return true;
default:
XSTATE_WARN_ON(1, "No structure for xstate: %d\n", nr);
--
2.43.0


2024-02-19 07:52:29

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 05/27] x86/fpu/xstate: Introduce fpu_guest_cfg for guest FPU configuration

Define new fpu_guest_cfg to hold all guest FPU settings so that it can
differ from generic kernel FPU settings, e.g., enabling CET supervisor
xstate by default for guest fpstate while it's remained disabled in
kernel FPU config.

The kernel dynamic xfeatures are specifically used by guest fpstate now,
add the mask for guest fpstate so that guest_perm.__state_perm ==
(fpu_kernel_cfg.default_xfeature | XFEATURE_MASK_KERNEL_DYNAMIC). And
if guest fpstate is re-allocated to hold user dynamic xfeatures, the
resulting permissions are consumed before calculate new guest fpstate.

With new guest FPU config added, there're 3 categories of FPU configs in
kernel, the usages and key fields are recapped as below.

kernel FPU config:
@fpu_kernel_cfg.max_features
- all known and CPU supported user and supervisor features except
independent kernel features

@fpu_kernel_cfg.default_features
- all known and CPU supported user and supervisor features except
dynamic kernel features, independent kernel features and dynamic
userspace features.

@fpu_kernel_cfg.max_size
- size of compacted buffer with 'fpu_kernel_cfg.max_features'

@fpu_kernel_cfg.default_size
- size of compacted buffer with 'fpu_kernel_cfg.default_features'

user FPU config:
@fpu_user_cfg.max_features
- all known and CPU supported user features

@fpu_user_cfg.default_features
- all known and CPU supported user features except dynamic userspace
features.

@fpu_user_cfg.max_size
- size of non-compacted buffer with 'fpu_user_cfg.max_features'

@fpu_user_cfg.default_size
- size of non-compacted buffer with 'fpu_user_cfg.default_features'

guest FPU config:
@fpu_guest_cfg.max_features
- all known and CPU supported user and supervisor features except
independent kernel features.

@fpu_guest_cfg.default_features
- all known and CPU supported user and supervisor features except
independent kernel features and dynamic userspace features.

@fpu_guest_cfg.max_size
- size of compacted buffer with 'fpu_guest_cfg.max_features'

@fpu_guest_cfg.default_size
- size of compacted buffer with 'fpu_guest_cfg.default_features'

Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
---
arch/x86/include/asm/fpu/types.h | 2 +-
arch/x86/kernel/fpu/core.c | 14 +++++++++++---
arch/x86/kernel/fpu/xstate.c | 10 ++++++++++
3 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index fe12724c50cc..aa00a9617832 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -604,6 +604,6 @@ struct fpu_state_config {
};

/* FPU state configuration information */
-extern struct fpu_state_config fpu_kernel_cfg, fpu_user_cfg;
+extern struct fpu_state_config fpu_kernel_cfg, fpu_user_cfg, fpu_guest_cfg;

#endif /* _ASM_X86_FPU_H */
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 520deb411a70..e8205e261a24 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -33,9 +33,10 @@ DEFINE_STATIC_KEY_FALSE(__fpu_state_size_dynamic);
DEFINE_PER_CPU(u64, xfd_state);
#endif

-/* The FPU state configuration data for kernel and user space */
+/* The FPU state configuration data for kernel, user space and guest. */
struct fpu_state_config fpu_kernel_cfg __ro_after_init;
struct fpu_state_config fpu_user_cfg __ro_after_init;
+struct fpu_state_config fpu_guest_cfg __ro_after_init;

/*
* Represents the initial FPU state. It's mostly (but not completely) zeroes,
@@ -536,8 +537,15 @@ void fpstate_reset(struct fpu *fpu)
fpu->perm.__state_perm = fpu_kernel_cfg.default_features;
fpu->perm.__state_size = fpu_kernel_cfg.default_size;
fpu->perm.__user_state_size = fpu_user_cfg.default_size;
- /* Same defaults for guests */
- fpu->guest_perm = fpu->perm;
+
+ /* Guest permission settings */
+ fpu->guest_perm.__state_perm = fpu_guest_cfg.default_features;
+ fpu->guest_perm.__state_size = fpu_guest_cfg.default_size;
+ /*
+ * Set guest's __user_state_size to fpu_user_cfg.default_size so that
+ * existing uAPIs can still work.
+ */
+ fpu->guest_perm.__user_state_size = fpu_user_cfg.default_size;
}

static inline void fpu_inherit_perms(struct fpu *dst_fpu)
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index ca4b83c142eb..9cbdc83d1eab 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -681,6 +681,7 @@ static int __init init_xstate_size(void)
{
/* Recompute the context size for enabled features: */
unsigned int user_size, kernel_size, kernel_default_size;
+ unsigned int guest_default_size;
bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);

/* Uncompacted user space size */
@@ -702,13 +703,18 @@ static int __init init_xstate_size(void)
kernel_default_size =
xstate_calculate_size(fpu_kernel_cfg.default_features, compacted);

+ guest_default_size =
+ xstate_calculate_size(fpu_guest_cfg.default_features, compacted);
+
if (!paranoid_xstate_size_valid(kernel_size))
return -EINVAL;

fpu_kernel_cfg.max_size = kernel_size;
fpu_user_cfg.max_size = user_size;
+ fpu_guest_cfg.max_size = kernel_size;

fpu_kernel_cfg.default_size = kernel_default_size;
+ fpu_guest_cfg.default_size = guest_default_size;
fpu_user_cfg.default_size =
xstate_calculate_size(fpu_user_cfg.default_features, false);

@@ -829,6 +835,10 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
fpu_user_cfg.default_features = fpu_user_cfg.max_features;
fpu_user_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;

+ fpu_guest_cfg.max_features = fpu_kernel_cfg.max_features;
+ fpu_guest_cfg.default_features = fpu_guest_cfg.max_features;
+ fpu_guest_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
+
/* Store it for paranoia check at the end */
xfeatures = fpu_kernel_cfg.max_features;

--
2.43.0


2024-02-19 07:52:40

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 08/27] KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data

From: Sean Christopherson <[email protected]>

Rework and rename cpuid_get_supported_xcr0() to explicitly operate on
vCPU state, i.e. on a vCPU's CPUID state, now that the only usage of
the helper is to retrieve a vCPU's already-set CPUID.

Prior to commit 275a87244ec8 ("KVM: x86: Don't adjust guest's CPUID.0x12.1
(allowed SGX enclave XFRM)"), KVM incorrectly fudged guest CPUID at runtime,
which in turn necessitated massaging the incoming CPUID state for
KVM_SET_CPUID{2} so as not to run afoul of kvm_cpuid_check_equal().
I.e. KVM also invoked cpuid_get_supported_xcr0() with the incoming CPUID
state, and thus without an explicit vCPU object.

Opportunistically move the helper below kvm_update_cpuid_runtime() to make
it harder to repeat the mistake of querying supported XCR0 for runtime
updates.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Reviewed-by: Chao Gao <[email protected]>
---
arch/x86/kvm/cpuid.c | 33 ++++++++++++++++-----------------
1 file changed, 16 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index adba49afb5fe..d57a6255b19f 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -247,21 +247,6 @@ void kvm_update_pv_runtime(struct kvm_vcpu *vcpu)
vcpu->arch.pv_cpuid.features = best->eax;
}

-/*
- * Calculate guest's supported XCR0 taking into account guest CPUID data and
- * KVM's supported XCR0 (comprised of host's XCR0 and KVM_SUPPORTED_XCR0).
- */
-static u64 cpuid_get_supported_xcr0(struct kvm_cpuid_entry2 *entries, int nent)
-{
- struct kvm_cpuid_entry2 *best;
-
- best = cpuid_entry2_find(entries, nent, 0xd, 0);
- if (!best)
- return 0;
-
- return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
-}
-
static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *entries,
int nent)
{
@@ -312,6 +297,21 @@ void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(kvm_update_cpuid_runtime);

+/*
+ * Calculate guest's supported XCR0 taking into account guest CPUID data and
+ * KVM's supported XCR0 (comprised of host's XCR0 and KVM_SUPPORTED_XCR0).
+ */
+static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
+{
+ struct kvm_cpuid_entry2 *best;
+
+ best = kvm_find_cpuid_entry_index(vcpu, 0xd, 0);
+ if (!best)
+ return 0;
+
+ return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
+}
+
static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
{
#ifdef CONFIG_KVM_HYPERV
@@ -361,8 +361,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
kvm_apic_set_version(vcpu);
}

- vcpu->arch.guest_supported_xcr0 =
- cpuid_get_supported_xcr0(vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent);
+ vcpu->arch.guest_supported_xcr0 = vcpu_get_supported_xcr0(vcpu);

kvm_update_pv_runtime(vcpu);

--
2.43.0


2024-02-19 07:53:40

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 09/27] KVM: x86: Rename kvm_{g,s}et_msr()* to menifest emulation operations

Rename kvm_{g,s}et_msr()* to kvm_emulate_msr_{read,write}()* to make it
more obvious that KVM uses these helpers to emulate guest behaviors,
i.e., host_initiated == false in these helpers.

Suggested-by: Sean Christopherson <[email protected]>
Suggested-by: Chao Gao <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Reviewed-by: Chao Gao <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 ++--
arch/x86/kvm/smm.c | 4 ++--
arch/x86/kvm/vmx/nested.c | 13 +++++++------
arch/x86/kvm/x86.c | 24 +++++++++++++-----------
4 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index aaf5a25ea7ed..5ab122f8843e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2016,8 +2016,8 @@ void kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu);
void kvm_enable_efer_bits(u64);
bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer);
int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data, bool host_initiated);
-int kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data);
-int kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data);
+int kvm_emulate_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data);
+int kvm_emulate_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data);
int kvm_emulate_rdmsr(struct kvm_vcpu *vcpu);
int kvm_emulate_wrmsr(struct kvm_vcpu *vcpu);
int kvm_emulate_as_nop(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
index dc3d95fdca7d..45c855389ea7 100644
--- a/arch/x86/kvm/smm.c
+++ b/arch/x86/kvm/smm.c
@@ -535,7 +535,7 @@ static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,

vcpu->arch.smbase = smstate->smbase;

- if (kvm_set_msr(vcpu, MSR_EFER, smstate->efer & ~EFER_LMA))
+ if (kvm_emulate_msr_write(vcpu, MSR_EFER, smstate->efer & ~EFER_LMA))
return X86EMUL_UNHANDLEABLE;

rsm_load_seg_64(vcpu, &smstate->tr, VCPU_SREG_TR);
@@ -626,7 +626,7 @@ int emulator_leave_smm(struct x86_emulate_ctxt *ctxt)

/* And finally go back to 32-bit mode. */
efer = 0;
- kvm_set_msr(vcpu, MSR_EFER, efer);
+ kvm_emulate_msr_write(vcpu, MSR_EFER, efer);
}
#endif

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 994e014f8a50..4be0078ca713 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -958,7 +958,7 @@ static u32 nested_vmx_load_msr(struct kvm_vcpu *vcpu, u64 gpa, u32 count)
__func__, i, e.index, e.reserved);
goto fail;
}
- if (kvm_set_msr(vcpu, e.index, e.value)) {
+ if (kvm_emulate_msr_write(vcpu, e.index, e.value)) {
pr_debug_ratelimited(
"%s cannot write MSR (%u, 0x%x, 0x%llx)\n",
__func__, i, e.index, e.value);
@@ -994,7 +994,7 @@ static bool nested_vmx_get_vmexit_msr_value(struct kvm_vcpu *vcpu,
}
}

- if (kvm_get_msr(vcpu, msr_index, data)) {
+ if (kvm_emulate_msr_read(vcpu, msr_index, data)) {
pr_debug_ratelimited("%s cannot read MSR (0x%x)\n", __func__,
msr_index);
return false;
@@ -2686,7 +2686,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,

if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) &&
kvm_pmu_has_perf_global_ctrl(vcpu_to_pmu(vcpu)) &&
- WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
+ WARN_ON_ONCE(kvm_emulate_msr_write(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
vmcs12->guest_ia32_perf_global_ctrl))) {
*entry_failure_code = ENTRY_FAIL_DEFAULT;
return -EINVAL;
@@ -4568,8 +4568,9 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
}
if ((vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL) &&
kvm_pmu_has_perf_global_ctrl(vcpu_to_pmu(vcpu)))
- WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
- vmcs12->host_ia32_perf_global_ctrl));
+ WARN_ON_ONCE(kvm_emulate_msr_write(vcpu,
+ MSR_CORE_PERF_GLOBAL_CTRL,
+ vmcs12->host_ia32_perf_global_ctrl));

/* Set L1 segment info according to Intel SDM
27.5.2 Loading Host Segment and Descriptor-Table Registers */
@@ -4744,7 +4745,7 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
goto vmabort;
}

- if (kvm_set_msr(vcpu, h.index, h.value)) {
+ if (kvm_emulate_msr_write(vcpu, h.index, h.value)) {
pr_debug_ratelimited(
"%s WRMSR failed (%u, 0x%x, 0x%llx)\n",
__func__, j, h.index, h.value);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bcd3258c7ece..10847e1cc413 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1961,31 +1961,33 @@ static int kvm_get_msr_ignored_check(struct kvm_vcpu *vcpu,
return ret;
}

-static int kvm_get_msr_with_filter(struct kvm_vcpu *vcpu, u32 index, u64 *data)
+static int kvm_emulate_msr_read_with_filter(struct kvm_vcpu *vcpu, u32 index,
+ u64 *data)
{
if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_READ))
return KVM_MSR_RET_FILTERED;
return kvm_get_msr_ignored_check(vcpu, index, data, false);
}

-static int kvm_set_msr_with_filter(struct kvm_vcpu *vcpu, u32 index, u64 data)
+static int kvm_emulate_msr_write_with_filter(struct kvm_vcpu *vcpu, u32 index,
+ u64 data)
{
if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_WRITE))
return KVM_MSR_RET_FILTERED;
return kvm_set_msr_ignored_check(vcpu, index, data, false);
}

-int kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data)
+int kvm_emulate_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data)
{
return kvm_get_msr_ignored_check(vcpu, index, data, false);
}
-EXPORT_SYMBOL_GPL(kvm_get_msr);
+EXPORT_SYMBOL_GPL(kvm_emulate_msr_read);

-int kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data)
+int kvm_emulate_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data)
{
return kvm_set_msr_ignored_check(vcpu, index, data, false);
}
-EXPORT_SYMBOL_GPL(kvm_set_msr);
+EXPORT_SYMBOL_GPL(kvm_emulate_msr_write);

static void complete_userspace_rdmsr(struct kvm_vcpu *vcpu)
{
@@ -2057,7 +2059,7 @@ int kvm_emulate_rdmsr(struct kvm_vcpu *vcpu)
u64 data;
int r;

- r = kvm_get_msr_with_filter(vcpu, ecx, &data);
+ r = kvm_emulate_msr_read_with_filter(vcpu, ecx, &data);

if (!r) {
trace_kvm_msr_read(ecx, data);
@@ -2082,7 +2084,7 @@ int kvm_emulate_wrmsr(struct kvm_vcpu *vcpu)
u64 data = kvm_read_edx_eax(vcpu);
int r;

- r = kvm_set_msr_with_filter(vcpu, ecx, data);
+ r = kvm_emulate_msr_write_with_filter(vcpu, ecx, data);

if (!r) {
trace_kvm_msr_write(ecx, data);
@@ -8365,7 +8367,7 @@ static int emulator_get_msr_with_filter(struct x86_emulate_ctxt *ctxt,
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
int r;

- r = kvm_get_msr_with_filter(vcpu, msr_index, pdata);
+ r = kvm_emulate_msr_read_with_filter(vcpu, msr_index, pdata);
if (r < 0)
return X86EMUL_UNHANDLEABLE;

@@ -8388,7 +8390,7 @@ static int emulator_set_msr_with_filter(struct x86_emulate_ctxt *ctxt,
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
int r;

- r = kvm_set_msr_with_filter(vcpu, msr_index, data);
+ r = kvm_emulate_msr_write_with_filter(vcpu, msr_index, data);
if (r < 0)
return X86EMUL_UNHANDLEABLE;

@@ -8408,7 +8410,7 @@ static int emulator_set_msr_with_filter(struct x86_emulate_ctxt *ctxt,
static int emulator_get_msr(struct x86_emulate_ctxt *ctxt,
u32 msr_index, u64 *pdata)
{
- return kvm_get_msr(emul_to_vcpu(ctxt), msr_index, pdata);
+ return kvm_emulate_msr_read(emul_to_vcpu(ctxt), msr_index, pdata);
}

static int emulator_check_rdpmc_early(struct x86_emulate_ctxt *ctxt, u32 pmc)
--
2.43.0


2024-02-19 07:53:55

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 12/27] KVM: x86: Report XSS as to-be-saved if there are supported features

From: Sean Christopherson <[email protected]>

Add MSR_IA32_XSS to list of MSRs reported to userspace if supported_xss
is non-zero, i.e. KVM supports at least one XSS based feature.

Before enabling CET virtualization series, guest IA32_MSR_XSS is
guaranteed to be 0, i.e., XSAVES/XRSTORS is executed in non-root mode
with XSS == 0, which equals to the effect of XSAVE/XRSTOR.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Reviewed-by: Chao Gao <[email protected]>
---
arch/x86/kvm/x86.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cbd44f904ba8..9eb5c8dbd4fb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1464,6 +1464,7 @@ static const u32 msrs_to_save_base[] = {
MSR_IA32_UMWAIT_CONTROL,

MSR_IA32_XFD, MSR_IA32_XFD_ERR,
+ MSR_IA32_XSS,
};

static const u32 msrs_to_save_pmu[] = {
@@ -7388,6 +7389,10 @@ static void kvm_probe_msr_to_save(u32 msr_index)
if (!(kvm_get_arch_capabilities() & ARCH_CAP_TSX_CTRL_MSR))
return;
break;
+ case MSR_IA32_XSS:
+ if (!kvm_caps.supported_xss)
+ return;
+ break;
default:
break;
}
--
2.43.0


2024-02-19 07:54:18

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 11/27] KVM: x86: Add kvm_msr_{read,write}() helpers

Wrap __kvm_{get,set}_msr() into two new helpers for KVM usage and use the
helpers to replace existing usage of the raw functions.
kvm_msr_{read,write}() are KVM-internal helpers, i.e. used when KVM needs
to get/set a MSR value for emulating CPU behavior, i.e., host_initiated ==
%true in the helpers.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++-
arch/x86/kvm/cpuid.c | 2 +-
arch/x86/kvm/x86.c | 16 +++++++++++++---
3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5ab122f8843e..f95e93975242 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2015,9 +2015,10 @@ void kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu);

void kvm_enable_efer_bits(u64);
bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer);
-int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data, bool host_initiated);
int kvm_emulate_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data);
int kvm_emulate_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data);
+int kvm_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data);
+int kvm_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data);
int kvm_emulate_rdmsr(struct kvm_vcpu *vcpu);
int kvm_emulate_wrmsr(struct kvm_vcpu *vcpu);
int kvm_emulate_as_nop(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index d57a6255b19f..39529e14ae59 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1548,7 +1548,7 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
*edx = entry->edx;
if (function == 7 && index == 0) {
u64 data;
- if (!__kvm_get_msr(vcpu, MSR_IA32_TSX_CTRL, &data, true) &&
+ if (!kvm_msr_read(vcpu, MSR_IA32_TSX_CTRL, &data) &&
(data & TSX_CTRL_CPUID_CLEAR))
*ebx &= ~(F(RTM) | F(HLE));
} else if (function == 0x80000007) {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5a9c07751c0e..cbd44f904ba8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1919,8 +1919,8 @@ static int kvm_set_msr_ignored_check(struct kvm_vcpu *vcpu,
* Returns 0 on success, non-0 otherwise.
* Assumes vcpu_load() was already called.
*/
-int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
- bool host_initiated)
+static int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
+ bool host_initiated)
{
struct msr_data msr;
int ret;
@@ -1946,6 +1946,16 @@ int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
return ret;
}

+int kvm_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data)
+{
+ return __kvm_set_msr(vcpu, index, data, true);
+}
+
+int kvm_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data)
+{
+ return __kvm_get_msr(vcpu, index, data, true);
+}
+
static int kvm_get_msr_ignored_check(struct kvm_vcpu *vcpu,
u32 index, u64 *data, bool host_initiated)
{
@@ -12323,7 +12333,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
MSR_IA32_MISC_ENABLE_BTS_UNAVAIL;

__kvm_set_xcr(vcpu, 0, XFEATURE_MASK_FP);
- __kvm_set_msr(vcpu, MSR_IA32_XSS, 0, true);
+ kvm_msr_write(vcpu, MSR_IA32_XSS, 0);
}

/* All GPRs except RDX (handled below) are zeroed on RESET/INIT. */
--
2.43.0


2024-02-19 07:54:39

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 13/27] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS

Update CPUID.(EAX=0DH,ECX=1).EBX to reflect current required xstate size
due to XSS MSR modification.
CPUID(EAX=0DH,ECX=1).EBX reports the required storage size of all enabled
xstate features in (XCR0 | IA32_XSS). The CPUID value can be used by guest
before allocate sufficient xsave buffer.

Note, KVM does not yet support any XSS based features, i.e. supported_xss
is guaranteed to be zero at this time.

Opportunistically modify XSS write access logic as:
If XSAVES is not enabled in the guest CPUID, forbid setting IA32_XSS msr
to anything but 0, even if the write is host initiated.

Suggested-by: Sean Christopherson <[email protected]>
Co-developed-by: Zhang Yi Z <[email protected]>
Signed-off-by: Zhang Yi Z <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++-
arch/x86/kvm/cpuid.c | 15 ++++++++++++++-
arch/x86/kvm/x86.c | 13 ++++++++++---
3 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f95e93975242..79f7c18c487b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -773,7 +773,6 @@ struct kvm_vcpu_arch {
bool at_instruction_boundary;
bool tpr_access_reporting;
bool xfd_no_write_intercept;
- u64 ia32_xss;
u64 microcode_version;
u64 arch_capabilities;
u64 perf_capabilities;
@@ -829,6 +828,8 @@ struct kvm_vcpu_arch {

u64 xcr0;
u64 guest_supported_xcr0;
+ u64 guest_supported_xss;
+ u64 ia32_xss;

struct kvm_pio_request pio;
void *pio_data;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 39529e14ae59..2bb1931103ad 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -275,7 +275,8 @@ static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_e
best = cpuid_entry2_find(entries, nent, 0xD, 1);
if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) ||
cpuid_entry_has(best, X86_FEATURE_XSAVEC)))
- best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
+ best->ebx = xstate_required_size(vcpu->arch.xcr0 |
+ vcpu->arch.ia32_xss, true);

best = __kvm_find_kvm_cpuid_features(vcpu, entries, nent);
if (kvm_hlt_in_guest(vcpu->kvm) && best &&
@@ -312,6 +313,17 @@ static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
}

+static u64 vcpu_get_supported_xss(struct kvm_vcpu *vcpu)
+{
+ struct kvm_cpuid_entry2 *best;
+
+ best = kvm_find_cpuid_entry_index(vcpu, 0xd, 1);
+ if (!best)
+ return 0;
+
+ return (best->ecx | ((u64)best->edx << 32)) & kvm_caps.supported_xss;
+}
+
static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
{
#ifdef CONFIG_KVM_HYPERV
@@ -362,6 +374,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
}

vcpu->arch.guest_supported_xcr0 = vcpu_get_supported_xcr0(vcpu);
+ vcpu->arch.guest_supported_xss = vcpu_get_supported_xss(vcpu);

kvm_update_pv_runtime(vcpu);

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9eb5c8dbd4fb..b502d68a2576 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3926,16 +3926,23 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
}
break;
case MSR_IA32_XSS:
- if (!msr_info->host_initiated &&
- !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
+ /*
+ * If KVM reported support of XSS MSR, even guest CPUID doesn't
+ * support XSAVES, still allow userspace to set default value(0)
+ * to this MSR.
+ */
+ if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES) &&
+ !(msr_info->host_initiated && data == 0))
return 1;
/*
* KVM supports exposing PT to the guest, but does not support
* IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
* XSAVES/XRSTORS to save/restore PT MSRs.
*/
- if (data & ~kvm_caps.supported_xss)
+ if (data & ~vcpu->arch.guest_supported_xss)
return 1;
+ if (vcpu->arch.ia32_xss == data)
+ break;
vcpu->arch.ia32_xss = data;
kvm_update_cpuid_runtime(vcpu);
break;
--
2.43.0


2024-02-19 07:55:17

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 15/27] KVM: x86: Load guest FPU state when access XSAVE-managed MSRs

From: Sean Christopherson <[email protected]>

Load the guest's FPU state if userspace is accessing MSRs whose values
are managed by XSAVES. Introduce two helpers, kvm_{get,set}_xstate_msr(),
to facilitate access to such kind of MSRs.

If MSRs supported in kvm_caps.supported_xss are passed through to guest,
the guest MSRs are swapped with host's before vCPU exits to userspace and
after it reenters kernel before next VM-entry.

Because the modified code is also used for the KVM_GET_MSRS device ioctl(),
explicitly check @vcpu is non-null before attempting to load guest state.
The XSAVE-managed MSRs cannot be retrieved via the device ioctl() without
loading guest FPU state (which doesn't exist).

Note that guest_cpuid_has() is not queried as host userspace is allowed to
access MSRs that have not been exposed to the guest, e.g. it might do
KVM_SET_MSRS prior to KVM_SET_CPUID2.

The two helpers are put here in order to manifest accessing xsave-managed MSRs
requires special check and handling to guarantee the correctness of read/write
to the MSRs.

Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Yang Weijiang <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/x86.c | 35 ++++++++++++++++++++++++++++++++++-
arch/x86/kvm/x86.h | 24 ++++++++++++++++++++++++
2 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 60b574fc04d1..906307757159 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -133,6 +133,9 @@ static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);

static DEFINE_MUTEX(vendor_module_lock);
+static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
+static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
+
struct kvm_x86_ops kvm_x86_ops __read_mostly;

#define KVM_X86_OP(func) \
@@ -4509,6 +4512,21 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
}
EXPORT_SYMBOL_GPL(kvm_get_msr_common);

+/*
+ * Returns true if the MSR in question is managed via XSTATE, i.e. is context
+ * switched with the rest of guest FPU state.
+ */
+static bool is_xstate_managed_msr(u32 index)
+{
+ switch (index) {
+ case MSR_IA32_U_CET:
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ return true;
+ default:
+ return false;
+ }
+}
+
/*
* Read or write a bunch of msrs. All parameters are kernel addresses.
*
@@ -4519,11 +4537,26 @@ static int __msr_io(struct kvm_vcpu *vcpu, struct kvm_msrs *msrs,
int (*do_msr)(struct kvm_vcpu *vcpu,
unsigned index, u64 *data))
{
+ bool fpu_loaded = false;
int i;

- for (i = 0; i < msrs->nmsrs; ++i)
+ for (i = 0; i < msrs->nmsrs; ++i) {
+ /*
+ * If userspace is accessing one or more XSTATE-managed MSRs,
+ * temporarily load the guest's FPU state so that the guest's
+ * MSR value(s) is resident in hardware, i.e. so that KVM can
+ * get/set the MSR via RDMSR/WRMSR.
+ */
+ if (vcpu && !fpu_loaded && kvm_caps.supported_xss &&
+ is_xstate_managed_msr(entries[i].index)) {
+ kvm_load_guest_fpu(vcpu);
+ fpu_loaded = true;
+ }
if (do_msr(vcpu, entries[i].index, &entries[i].data))
break;
+ }
+ if (fpu_loaded)
+ kvm_put_guest_fpu(vcpu);

return i;
}
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 2f7e19166658..9c19dfb5011d 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -543,4 +543,28 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
unsigned int port, void *data, unsigned int count,
int in);

+/*
+ * Lock and/or reload guest FPU and access xstate MSRs. For accesses initiated
+ * by host, guest FPU is loaded in __msr_io(). For accesses initiated by guest,
+ * guest FPU should have been loaded already.
+ */
+
+static inline void kvm_get_xstate_msr(struct kvm_vcpu *vcpu,
+ struct msr_data *msr_info)
+{
+ KVM_BUG_ON(!vcpu->arch.guest_fpu.fpstate->in_use, vcpu->kvm);
+ kvm_fpu_get();
+ rdmsrl(msr_info->index, msr_info->data);
+ kvm_fpu_put();
+}
+
+static inline void kvm_set_xstate_msr(struct kvm_vcpu *vcpu,
+ struct msr_data *msr_info)
+{
+ KVM_BUG_ON(!vcpu->arch.guest_fpu.fpstate->in_use, vcpu->kvm);
+ kvm_fpu_get();
+ wrmsrl(msr_info->index, msr_info->data);
+ kvm_fpu_put();
+}
+
#endif
--
2.43.0


2024-02-19 07:55:21

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 16/27] KVM: x86: Add fault checks for guest CR4.CET setting

Check potential faults for CR4.CET setting per Intel SDM requirements.
CET can be enabled if and only if CR0.WP == 1, i.e. setting CR4.CET ==
1 faults if CR0.WP == 0 and setting CR0.WP == 0 fails if CR4.CET == 1.

Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Chao Gao <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/x86.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 906307757159..5f5df7e38d3d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1006,6 +1006,9 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
(is_64_bit_mode(vcpu) || kvm_is_cr4_bit_set(vcpu, X86_CR4_PCIDE)))
return 1;

+ if (!(cr0 & X86_CR0_WP) && kvm_is_cr4_bit_set(vcpu, X86_CR4_CET))
+ return 1;
+
static_call(kvm_x86_set_cr0)(vcpu, cr0);

kvm_post_set_cr0(vcpu, old_cr0, cr0);
@@ -1217,6 +1220,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
return 1;
}

+ if ((cr4 & X86_CR4_CET) && !kvm_is_cr0_bit_set(vcpu, X86_CR0_WP))
+ return 1;
+
static_call(kvm_x86_set_cr4)(vcpu, cr4);

kvm_post_set_cr4(vcpu, old_cr4, cr4);
--
2.43.0


2024-02-19 07:55:49

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 18/27] KVM: VMX: Introduce CET VMCS fields and control bits

Control-flow Enforcement Technology (CET) is a kind of CPU feature used
to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP) attacks.
It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
style control-flow subversion attacks.

Shadow Stack (SHSTK):
A shadow stack is a second stack used exclusively for control transfer
operations. The shadow stack is separate from the data/normal stack and
can be enabled individually in user and kernel mode. When shadow stack
is enabled, CALL pushes the return address on both the data and shadow
stack. RET pops the return address from both stacks and compares them.
If the return addresses from the two stacks do not match, the processor
generates a #CP.

Indirect Branch Tracking (IBT):
IBT introduces instruction(ENDBRANCH)to mark valid target addresses of
indirect branches (CALL, JMP etc...). If an indirect branch is executed
and the next instruction is _not_ an ENDBRANCH, the processor generates
a #CP. These instruction behaves as a NOP on platforms that have no CET.

Several new CET MSRs are defined to support CET:
MSR_IA32_{U,S}_CET: CET settings for {user,supervisor} CET respectively.

MSR_IA32_PL{0,1,2,3}_SSP: SHSTK pointer linear address for CPL{0,1,2,3}.

MSR_IA32_INT_SSP_TAB: Linear address of SHSTK pointer table, whose entry
is indexed by IST of interrupt gate desc.

Two XSAVES state bits are introduced for CET:
IA32_XSS:[bit 11]: Control saving/restoring user mode CET states
IA32_XSS:[bit 12]: Control saving/restoring supervisor mode CET states.

Six VMCS fields are introduced for CET:
{HOST,GUEST}_S_CET: Stores CET settings for kernel mode.
{HOST,GUEST}_SSP: Stores current active SSP.
{HOST,GUEST}_INTR_SSP_TABLE: Stores current active MSR_IA32_INT_SSP_TAB.

On Intel platforms, two additional bits are defined in VM_EXIT and VM_ENTRY
control fields:
If VM_EXIT_LOAD_CET_STATE = 1, host CET states are loaded from following
VMCS fields at VM-Exit:
HOST_S_CET
HOST_SSP
HOST_INTR_SSP_TABLE

If VM_ENTRY_LOAD_CET_STATE = 1, guest CET states are loaded from following
VMCS fields at VM-Entry:
GUEST_S_CET
GUEST_SSP
GUEST_INTR_SSP_TABLE

Co-developed-by: Zhang Yi Z <[email protected]>
Signed-off-by: Zhang Yi Z <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Chao Gao <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/include/asm/vmx.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0e73616b82f3..451fd4f4fedc 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -104,6 +104,7 @@
#define VM_EXIT_CLEAR_BNDCFGS 0x00800000
#define VM_EXIT_PT_CONCEAL_PIP 0x01000000
#define VM_EXIT_CLEAR_IA32_RTIT_CTL 0x02000000
+#define VM_EXIT_LOAD_CET_STATE 0x10000000

#define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR 0x00036dff

@@ -117,6 +118,7 @@
#define VM_ENTRY_LOAD_BNDCFGS 0x00010000
#define VM_ENTRY_PT_CONCEAL_PIP 0x00020000
#define VM_ENTRY_LOAD_IA32_RTIT_CTL 0x00040000
+#define VM_ENTRY_LOAD_CET_STATE 0x00100000

#define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR 0x000011ff

@@ -345,6 +347,9 @@ enum vmcs_field {
GUEST_PENDING_DBG_EXCEPTIONS = 0x00006822,
GUEST_SYSENTER_ESP = 0x00006824,
GUEST_SYSENTER_EIP = 0x00006826,
+ GUEST_S_CET = 0x00006828,
+ GUEST_SSP = 0x0000682a,
+ GUEST_INTR_SSP_TABLE = 0x0000682c,
HOST_CR0 = 0x00006c00,
HOST_CR3 = 0x00006c02,
HOST_CR4 = 0x00006c04,
@@ -357,6 +362,9 @@ enum vmcs_field {
HOST_IA32_SYSENTER_EIP = 0x00006c12,
HOST_RSP = 0x00006c14,
HOST_RIP = 0x00006c16,
+ HOST_S_CET = 0x00006c18,
+ HOST_SSP = 0x00006c1a,
+ HOST_INTR_SSP_TABLE = 0x00006c1c
};

/*
--
2.43.0


2024-02-19 07:56:02

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 17/27] KVM: x86: Report KVM supported CET MSRs as to-be-saved

Add CET MSRs to the list of MSRs reported to userspace if the feature,
i.e. IBT or SHSTK, associated with the MSRs is supported by KVM.

SSP can only be read via RDSSP. Writing even requires destructive and
potentially faulting operations such as SAVEPREVSSP/RSTORSSP or
SETSSBSY/CLRSSBSY. Let the host use a pseudo-MSR that is just a wrapper
for the GUEST_SSP field of the VMCS.

Suggested-by: Chao Gao <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/include/uapi/asm/kvm_para.h | 1 +
arch/x86/kvm/vmx/vmx.c | 2 ++
arch/x86/kvm/x86.c | 18 ++++++++++++++++++
3 files changed, 21 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 605899594ebb..9d08c0bec477 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -58,6 +58,7 @@
#define MSR_KVM_ASYNC_PF_INT 0x4b564d06
#define MSR_KVM_ASYNC_PF_ACK 0x4b564d07
#define MSR_KVM_MIGRATION_CONTROL 0x4b564d08
+#define MSR_KVM_SSP 0x4b564d09

struct kvm_steal_time {
__u64 steal;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9239a89dea22..46042bc6e2fa 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7007,6 +7007,8 @@ static bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
case MSR_AMD64_TSC_RATIO:
/* This is AMD only. */
return false;
+ case MSR_KVM_SSP:
+ return kvm_cpu_cap_has(X86_FEATURE_SHSTK);
default:
return true;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5f5df7e38d3d..c0ed69353674 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1476,6 +1476,9 @@ static const u32 msrs_to_save_base[] = {

MSR_IA32_XFD, MSR_IA32_XFD_ERR,
MSR_IA32_XSS,
+ MSR_IA32_U_CET, MSR_IA32_S_CET,
+ MSR_IA32_PL0_SSP, MSR_IA32_PL1_SSP, MSR_IA32_PL2_SSP,
+ MSR_IA32_PL3_SSP, MSR_IA32_INT_SSP_TAB,
};

static const u32 msrs_to_save_pmu[] = {
@@ -1579,6 +1582,7 @@ static const u32 emulated_msrs_all[] = {

MSR_K7_HWCR,
MSR_KVM_POLL_CONTROL,
+ MSR_KVM_SSP,
};

static u32 emulated_msrs[ARRAY_SIZE(emulated_msrs_all)];
@@ -7441,6 +7445,20 @@ static void kvm_probe_msr_to_save(u32 msr_index)
if (!kvm_caps.supported_xss)
return;
break;
+ case MSR_IA32_U_CET:
+ case MSR_IA32_S_CET:
+ if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK) &&
+ !kvm_cpu_cap_has(X86_FEATURE_IBT))
+ return;
+ break;
+ case MSR_IA32_INT_SSP_TAB:
+ if (!kvm_cpu_cap_has(X86_FEATURE_LM))
+ return;
+ fallthrough;
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK))
+ return;
+ break;
default:
break;
}
--
2.43.0


2024-02-19 07:56:39

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 02/27] x86/fpu/xstate: Refine CET user xstate bit enabling

Remove XFEATURE_CET_USER entry from dependency array as the entry doesn't
reflect true dependency between CET features and the user xstate bit.
Enable the bit in fpu_kernel_cfg.max_features when either SHSTK or IBT is
available.

Both user mode shadow stack and indirect branch tracking features depend
on XFEATURE_CET_USER bit in XSS to automatically save/restore user mode
xstate registers, i.e., IA32_U_CET and IA32_PL3_SSP whenever necessary.

Note, the issue, i.e., CPUID only enumerates IBT but no SHSTK is resulted
from CET KVM series which synthesizes guest CPUIDs based on userspace
settings,in real world the case is rare. In other words, the existing
dependency check is correct when only user mode SHSTK is available.

Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
Tested-by: Rick Edgecombe <[email protected]>
---
arch/x86/kernel/fpu/xstate.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 07911532b108..f6b98693da59 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -73,7 +73,6 @@ static unsigned short xsave_cpuid_features[] __initdata = {
[XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
[XFEATURE_PKRU] = X86_FEATURE_OSPKE,
[XFEATURE_PASID] = X86_FEATURE_ENQCMD,
- [XFEATURE_CET_USER] = X86_FEATURE_SHSTK,
[XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE,
[XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE,
};
@@ -798,6 +797,14 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
fpu_kernel_cfg.max_features &= ~BIT_ULL(i);
}

+ /*
+ * CET user mode xstate bit has been cleared by above sanity check.
+ * Now pick it up if either SHSTK or IBT is available. Either feature
+ * depends on the xstate bit to save/restore user mode states.
+ */
+ if (boot_cpu_has(X86_FEATURE_SHSTK) || boot_cpu_has(X86_FEATURE_IBT))
+ fpu_kernel_cfg.max_features |= BIT_ULL(XFEATURE_CET_USER);
+
if (!cpu_feature_enabled(X86_FEATURE_XFD))
fpu_kernel_cfg.max_features &= ~XFEATURE_MASK_USER_DYNAMIC;

--
2.43.0


2024-02-19 07:56:39

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 19/27] KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled"

Use the governed feature framework to track whether X86_FEATURE_SHSTK
and X86_FEATURE_IBT features can be used by userspace and guest, i.e.,
the features can be used iff both KVM and guest CPUID can support them.

TODO: remove this patch once Sean's refactor to "KVM-governed" framework
is upstreamed. See the work here [*].

[*]: https://lore.kernel.org/all/[email protected]/

Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/governed_features.h | 2 ++
arch/x86/kvm/vmx/vmx.c | 2 ++
2 files changed, 4 insertions(+)

diff --git a/arch/x86/kvm/governed_features.h b/arch/x86/kvm/governed_features.h
index ad463b1ed4e4..daf0c0a3e29c 100644
--- a/arch/x86/kvm/governed_features.h
+++ b/arch/x86/kvm/governed_features.h
@@ -17,6 +17,8 @@ KVM_GOVERNED_X86_FEATURE(PFTHRESHOLD)
KVM_GOVERNED_X86_FEATURE(VGIF)
KVM_GOVERNED_X86_FEATURE(VNMI)
KVM_GOVERNED_X86_FEATURE(LAM)
+KVM_GOVERNED_X86_FEATURE(SHSTK)
+KVM_GOVERNED_X86_FEATURE(IBT)

#undef KVM_GOVERNED_X86_FEATURE
#undef KVM_GOVERNED_FEATURE
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 46042bc6e2fa..6cb94754c2a9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7764,6 +7764,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)

kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_VMX);
kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_LAM);
+ kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_SHSTK);
+ kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_IBT);

vmx_setup_uret_msrs(vmx);

--
2.43.0


2024-02-19 07:56:56

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 20/27] KVM: VMX: Emulate read and write to CET MSRs

Add emulation interface for CET MSR access. The emulation code is split
into common part and vendor specific part. The former does common checks
for MSRs, e.g., accessibility, data validity etc., then passes operation
to either XSAVE-managed MSRs via the helpers or CET VMCS fields.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 18 +++++++++
arch/x86/kvm/x86.c | 88 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 106 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6cb94754c2a9..ff2296fa7d39 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2106,6 +2106,15 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
else
msr_info->data = vmx->pt_desc.guest.addr_a[index / 2];
break;
+ case MSR_IA32_S_CET:
+ msr_info->data = vmcs_readl(GUEST_S_CET);
+ break;
+ case MSR_KVM_SSP:
+ msr_info->data = vmcs_readl(GUEST_SSP);
+ break;
+ case MSR_IA32_INT_SSP_TAB:
+ msr_info->data = vmcs_readl(GUEST_INTR_SSP_TABLE);
+ break;
case MSR_IA32_DEBUGCTLMSR:
msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
break;
@@ -2415,6 +2424,15 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
else
vmx->pt_desc.guest.addr_a[index / 2] = data;
break;
+ case MSR_IA32_S_CET:
+ vmcs_writel(GUEST_S_CET, data);
+ break;
+ case MSR_KVM_SSP:
+ vmcs_writel(GUEST_SSP, data);
+ break;
+ case MSR_IA32_INT_SSP_TAB:
+ vmcs_writel(GUEST_INTR_SSP_TABLE, data);
+ break;
case MSR_IA32_PERF_CAPABILITIES:
if (data && !vcpu_to_pmu(vcpu)->version)
return 1;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c0ed69353674..281c3fe728c5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1849,6 +1849,36 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type)
}
EXPORT_SYMBOL_GPL(kvm_msr_allowed);

+#define CET_US_RESERVED_BITS GENMASK(9, 6)
+#define CET_US_SHSTK_MASK_BITS GENMASK(1, 0)
+#define CET_US_IBT_MASK_BITS (GENMASK_ULL(5, 2) | GENMASK_ULL(63, 10))
+#define CET_US_LEGACY_BITMAP_BASE(data) ((data) >> 12)
+
+static bool is_set_cet_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u64 data,
+ bool host_initiated)
+{
+ bool msr_ctrl = index == MSR_IA32_S_CET || index == MSR_IA32_U_CET;
+
+ if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
+ return true;
+
+ if (msr_ctrl && guest_can_use(vcpu, X86_FEATURE_IBT))
+ return true;
+
+ /*
+ * If KVM supports the MSR, i.e. has enumerated the MSR existence to
+ * userspace, then userspace is allowed to write '0' irrespective of
+ * whether or not the MSR is exposed to the guest.
+ */
+ if (!host_initiated || data)
+ return false;
+
+ if (kvm_cpu_cap_has(X86_FEATURE_SHSTK))
+ return true;
+
+ return msr_ctrl && kvm_cpu_cap_has(X86_FEATURE_IBT);
+}
+
/*
* Write @data into the MSR specified by @index. Select MSR specific fault
* checks are bypassed if @host_initiated is %true.
@@ -1908,6 +1938,42 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,

data = (u32)data;
break;
+ case MSR_IA32_U_CET:
+ case MSR_IA32_S_CET:
+ if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
+ return 1;
+ if (data & CET_US_RESERVED_BITS)
+ return 1;
+ if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
+ (data & CET_US_SHSTK_MASK_BITS))
+ return 1;
+ if (!guest_can_use(vcpu, X86_FEATURE_IBT) &&
+ (data & CET_US_IBT_MASK_BITS))
+ return 1;
+ if (!IS_ALIGNED(CET_US_LEGACY_BITMAP_BASE(data), 4))
+ return 1;
+ /* IBT can be suppressed iff the TRACKER isn't WAIT_ENDBR. */
+ if ((data & CET_SUPPRESS) && (data & CET_WAIT_ENDBR))
+ return 1;
+ break;
+ case MSR_IA32_INT_SSP_TAB:
+ if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
+ return 1;
+ if (is_noncanonical_address(data, vcpu))
+ return 1;
+ break;
+ case MSR_KVM_SSP:
+ if (!host_initiated)
+ return 1;
+ fallthrough;
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
+ return 1;
+ if (is_noncanonical_address(data, vcpu))
+ return 1;
+ if (!IS_ALIGNED(data, 4))
+ return 1;
+ break;
}

msr.data = data;
@@ -1951,6 +2017,20 @@ static int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
!guest_cpuid_has(vcpu, X86_FEATURE_RDPID))
return 1;
break;
+ case MSR_IA32_U_CET:
+ case MSR_IA32_S_CET:
+ if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
+ !guest_can_use(vcpu, X86_FEATURE_IBT))
+ return 1;
+ break;
+ case MSR_KVM_SSP:
+ if (!host_initiated)
+ return 1;
+ fallthrough;
+ case MSR_IA32_PL0_SSP ... MSR_IA32_INT_SSP_TAB:
+ if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
+ return 1;
+ break;
}

msr.index = index;
@@ -4143,6 +4223,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vcpu->arch.guest_fpu.xfd_err = data;
break;
#endif
+ case MSR_IA32_U_CET:
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ kvm_set_xstate_msr(vcpu, msr_info);
+ break;
default:
if (kvm_pmu_is_valid_msr(vcpu, msr))
return kvm_pmu_set_msr(vcpu, msr_info);
@@ -4502,6 +4586,10 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
msr_info->data = vcpu->arch.guest_fpu.xfd_err;
break;
#endif
+ case MSR_IA32_U_CET:
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ kvm_get_xstate_msr(vcpu, msr_info);
+ break;
default:
if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
return kvm_pmu_get_msr(vcpu, msr_info);
--
2.43.0


2024-02-19 07:57:12

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 01/27] x86/fpu/xstate: Always preserve non-user xfeatures/flags in __state_perm

From: Sean Christopherson <[email protected]>

When granting userspace or a KVM guest access to an xfeature, preserve the
entity's existing supervisor and software-defined permissions as tracked
by __state_perm, i.e. use __state_perm to track *all* permissions even
though all supported supervisor xfeatures are granted to all FPUs and
FPU_GUEST_PERM_LOCKED disallows changing permissions.

Effectively clobbering supervisor permissions results in inconsistent
behavior, as xstate_get_group_perm() will report supervisor features for
process that do NOT request access to dynamic user xfeatures, whereas any
and all supervisor features will be absent from the set of permissions for
any process that is granted access to one or more dynamic xfeatures (which
right now means AMX).

The inconsistency isn't problematic because fpu_xstate_prctl() already
strips out everything except user xfeatures:

case ARCH_GET_XCOMP_PERM:
/*
* Lockless snapshot as it can also change right after the
* dropping the lock.
*/
permitted = xstate_get_host_group_perm();
permitted &= XFEATURE_MASK_USER_SUPPORTED;
return put_user(permitted, uptr);

case ARCH_GET_XCOMP_GUEST_PERM:
permitted = xstate_get_guest_group_perm();
permitted &= XFEATURE_MASK_USER_SUPPORTED;
return put_user(permitted, uptr);

and similarly KVM doesn't apply the __state_perm to supervisor states
(kvm_get_filtered_xcr0() incorporates xstate_get_guest_group_perm()):

case 0xd: {
u64 permitted_xcr0 = kvm_get_filtered_xcr0();
u64 permitted_xss = kvm_caps.supported_xss;

But if KVM in particular were to ever change, dropping supervisor
permissions would result in subtle bugs in KVM's reporting of supported
CPUID settings. And the above behavior also means that having supervisor
xfeatures in __state_perm is correctly handled by all users.

Dropping supervisor permissions also creates another landmine for KVM. If
more dynamic user xfeatures are ever added, requesting access to multiple
xfeatures in separate ARCH_REQ_XCOMP_GUEST_PERM calls will result in the
second invocation of __xstate_request_perm() computing the wrong ksize, as
as the mask passed to xstate_calculate_size() would not contain *any*
supervisor features.

Commit 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE
permissions") fudged around the size issue for userspace FPUs, but for
reasons unknown skipped guest FPUs. Lack of a fix for KVM "works" only
because KVM doesn't yet support virtualizing features that have supervisor
xfeatures, i.e. as of today, KVM guest FPUs will never need the relevant
xfeatures.

Simply extending the hack-a-fix for guests would temporarily solve the
ksize issue, but wouldn't address the inconsistency issue and would leave
another lurking pitfall for KVM. KVM support for virtualizing CET will
likely add CET_KERNEL as a guest-only xfeature, i.e. CET_KERNEL will not
be set in xfeatures_mask_supervisor() and would again be dropped when
granting access to dynamic xfeatures.

Note, the existing clobbering behavior is rather subtle. The @permitted
parameter to __xstate_request_perm() comes from:

permitted = xstate_get_group_perm(guest);

which is either fpu->guest_perm.__state_perm or fpu->perm.__state_perm,
where __state_perm is initialized to:

fpu->perm.__state_perm = fpu_kernel_cfg.default_features;

and copied to the guest side of things:

/* Same defaults for guests */
fpu->guest_perm = fpu->perm;

fpu_kernel_cfg.default_features contains everything except the dynamic
xfeatures, i.e. everything except XFEATURE_MASK_XTILE_DATA:

fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;

When __xstate_request_perm() restricts the local "mask" variable to
compute the user state size:

mask &= XFEATURE_MASK_USER_SUPPORTED;
usize = xstate_calculate_size(mask, false);

it subtly overwrites the target __state_perm with "mask" containing only
user xfeatures:

perm = guest ? &fpu->guest_perm : &fpu->perm;
/* Pairs with the READ_ONCE() in xstate_get_group_perm() */
WRITE_ONCE(perm->__state_perm, mask);

Cc: Maxim Levitsky <[email protected]>
Cc: Weijiang Yang <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Chao Gao <[email protected]>
Cc: Rick Edgecombe <[email protected]>
Cc: John Allen <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
---
arch/x86/kernel/fpu/xstate.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 117e74c44e75..07911532b108 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1601,16 +1601,20 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
if ((permitted & requested) == requested)
return 0;

- /* Calculate the resulting kernel state size */
+ /*
+ * Calculate the resulting kernel state size. Note, @permitted also
+ * contains supervisor xfeatures even though supervisor are always
+ * permitted for kernel and guest FPUs, and never permitted for user
+ * FPUs.
+ */
mask = permitted | requested;
- /* Take supervisor states into account on the host */
- if (!guest)
- mask |= xfeatures_mask_supervisor();
ksize = xstate_calculate_size(mask, compacted);

- /* Calculate the resulting user state size */
- mask &= XFEATURE_MASK_USER_SUPPORTED;
- usize = xstate_calculate_size(mask, false);
+ /*
+ * Calculate the resulting user state size. Take care not to clobber
+ * the supervisor xfeatures in the new mask!
+ */
+ usize = xstate_calculate_size(mask & XFEATURE_MASK_USER_SUPPORTED, false);

if (!guest) {
ret = validate_sigaltstack(usize);
--
2.43.0


2024-02-19 07:57:58

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 04/27] x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set

Define a new XFEATURE_MASK_KERNEL_DYNAMIC mask to specify the features
that can be optionally enabled by kernel components. This is similar to
XFEATURE_MASK_USER_DYNAMIC in that it contains optional xfeatures that
can allows the FPU buffer to be dynamically sized. The difference is that
the KERNEL variant contains supervisor features and will be enabled by
kernel components that need them, and not directly by the user. Currently
it's used by KVM to configure guest dedicated fpstate for calculating
the xfeature and fpstate storage size etc.

The kernel dynamic xfeatures now only contain XFEATURE_CET_KERNEL, which
is supported by host as they're enabled in kernel XSS MSR setting but
relevant CPU feature, i.e., supervisor shadow stack, is not enabled in
host kernel therefore it can be omitted for normal fpstate by default.

Remove the kernel dynamic feature from fpu_kernel_cfg.default_features
so that the bits in xstate_bv and xcomp_bv are cleared and xsaves/xrstors
can be optimized by HW for normal fpstate.

Suggested-by: Dave Hansen <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
---
arch/x86/include/asm/fpu/xstate.h | 5 ++++-
arch/x86/kernel/fpu/xstate.c | 1 +
2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 3b4a038d3c57..a212d3851429 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -46,9 +46,12 @@
#define XFEATURE_MASK_USER_RESTORE \
(XFEATURE_MASK_USER_SUPPORTED & ~XFEATURE_MASK_PKRU)

-/* Features which are dynamically enabled for a process on request */
+/* Features which are dynamically enabled per userspace request */
#define XFEATURE_MASK_USER_DYNAMIC XFEATURE_MASK_XTILE_DATA

+/* Features which are dynamically enabled per kernel side request */
+#define XFEATURE_MASK_KERNEL_DYNAMIC XFEATURE_MASK_CET_KERNEL
+
/* All currently supported supervisor features */
#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
XFEATURE_MASK_CET_USER | \
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 03e166a87d61..ca4b83c142eb 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -824,6 +824,7 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
/* Clean out dynamic features from default */
fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
+ fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_KERNEL_DYNAMIC;

fpu_user_cfg.default_features = fpu_user_cfg.max_features;
fpu_user_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
--
2.43.0


2024-02-19 07:58:18

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 25/27] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1

Per SDM description(Vol.3D, Appendix A.1):
"If bit 56 is read as 1, software can use VM entry to deliver a hardware
exception with or without an error code, regardless of vector"

Modify has_error_code check before inject events to nested guest. Only
enforce the check when guest is in real mode, the exception is not hard
exception and the platform doesn't enumerate bit56 in VMX_BASIC, in all
other case ignore the check to make the logic consistent with SDM.

Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Reviewed-by: Chao Gao <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 27 ++++++++++++++++++---------
arch/x86/kvm/vmx/nested.h | 5 +++++
2 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 4be0078ca713..0439208523b8 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1230,9 +1230,9 @@ static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx, u64 data)
{
const u64 feature_and_reserved =
/* feature (except bit 48; see below) */
- BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) |
+ BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) | BIT_ULL(56) |
/* reserved */
- BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 56);
+ BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 57);
u64 vmx_basic = vmcs_config.nested.basic;

if (!is_bitwise_subset(vmx_basic, data, feature_and_reserved))
@@ -2865,7 +2865,6 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
u8 vector = intr_info & INTR_INFO_VECTOR_MASK;
u32 intr_type = intr_info & INTR_INFO_INTR_TYPE_MASK;
bool has_error_code = intr_info & INTR_INFO_DELIVER_CODE_MASK;
- bool should_have_error_code;
bool urg = nested_cpu_has2(vmcs12,
SECONDARY_EXEC_UNRESTRICTED_GUEST);
bool prot_mode = !urg || vmcs12->guest_cr0 & X86_CR0_PE;
@@ -2882,12 +2881,20 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
CC(intr_type == INTR_TYPE_OTHER_EVENT && vector != 0))
return -EINVAL;

- /* VM-entry interruption-info field: deliver error code */
- should_have_error_code =
- intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
- x86_exception_has_error_code(vector);
- if (CC(has_error_code != should_have_error_code))
- return -EINVAL;
+ /*
+ * Cannot deliver error code in real mode or if the interrupt
+ * type is not hardware exception. For other cases, do the
+ * consistency check only if the vCPU doesn't enumerate
+ * VMX_BASIC_NO_HW_ERROR_CODE_CC.
+ */
+ if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION) {
+ if (CC(has_error_code))
+ return -EINVAL;
+ } else if (!nested_cpu_has_no_hw_errcode_cc(vcpu)) {
+ if (CC(has_error_code !=
+ x86_exception_has_error_code(vector)))
+ return -EINVAL;
+ }

/* VM-entry exception error code */
if (CC(has_error_code &&
@@ -7011,6 +7018,8 @@ static void nested_vmx_setup_basic(struct nested_vmx_msrs *msrs)

if (cpu_has_vmx_basic_inout())
msrs->basic |= VMX_BASIC_INOUT;
+ if (cpu_has_vmx_basic_no_hw_errcode())
+ msrs->basic |= VMX_BASIC_NO_HW_ERROR_CODE_CC;
}

static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs)
diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
index cce4e2aa30fb..747061c2aeb9 100644
--- a/arch/x86/kvm/vmx/nested.h
+++ b/arch/x86/kvm/vmx/nested.h
@@ -285,6 +285,11 @@ static inline bool nested_cr4_valid(struct kvm_vcpu *vcpu, unsigned long val)
__kvm_is_valid_cr4(vcpu, val);
}

+static inline bool nested_cpu_has_no_hw_errcode_cc(struct kvm_vcpu *vcpu)
+{
+ return to_vmx(vcpu)->nested.msrs.basic & VMX_BASIC_NO_HW_ERROR_CODE_CC;
+}
+
/* No difference in the restrictions on guest and host CR4 in VMX operation. */
#define nested_guest_cr4_valid nested_cr4_valid
#define nested_host_cr4_valid nested_cr4_valid
--
2.43.0


2024-02-19 07:58:37

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 23/27] KVM: VMX: Set host constant supervisor states to VMCS fields

Save constant values to HOST_{S_CET,SSP,INTR_SSP_TABLE} field explicitly.
Kernel IBT is supported and the setting in MSR_IA32_S_CET is static after
post-boot(The exception is BIOS call case but vCPU thread never across it)
and KVM doesn't need to refresh HOST_S_CET field before every VM-Enter/
VM-Exit sequence.

Host supervisor shadow stack is not enabled now and SSP is not accessible
to kernel mode, thus it's safe to set host IA32_INT_SSP_TAB/SSP VMCS field
to 0s. When shadow stack is enabled for CPL3, SSP is reloaded from PL3_SSP
before it exits to userspace. Check SDM Vol 2A/B Chapter 3/4 for SYSCALL/
SYSRET/SYSENTER SYSEXIT/RDSSP/CALL etc.

Prevent KVM module loading if host supervisor shadow stack SHSTK_EN is set
in MSR_IA32_S_CET as KVM cannot co-exit with it correctly.

Suggested-by: Sean Christopherson <[email protected]>
Suggested-by: Chao Gao <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Reviewed-by: Chao Gao <[email protected]>
---
arch/x86/kvm/vmx/capabilities.h | 4 ++++
arch/x86/kvm/vmx/vmx.c | 15 +++++++++++++++
arch/x86/kvm/x86.c | 14 ++++++++++++++
arch/x86/kvm/x86.h | 1 +
4 files changed, 34 insertions(+)

diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 41a4533f9989..ee8938818c8a 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -106,6 +106,10 @@ static inline bool cpu_has_load_perf_global_ctrl(void)
return vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
}

+static inline bool cpu_has_load_cet_ctrl(void)
+{
+ return (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_CET_STATE);
+}
static inline bool cpu_has_vmx_mpx(void)
{
return vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_BNDCFGS;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 24e921c4e7e3..342b5b94c892 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4371,6 +4371,21 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)

if (cpu_has_load_ia32_efer())
vmcs_write64(HOST_IA32_EFER, host_efer);
+
+ /*
+ * Supervisor shadow stack is not enabled on host side, i.e.,
+ * host IA32_S_CET.SHSTK_EN bit is guaranteed to 0 now, per SDM
+ * description(RDSSP instruction), SSP is not readable in CPL0,
+ * so resetting the two registers to 0s at VM-Exit does no harm
+ * to kernel execution. When execution flow exits to userspace,
+ * SSP is reloaded from IA32_PL3_SSP. Check SDM Vol.2A/B Chapter
+ * 3 and 4 for details.
+ */
+ if (cpu_has_load_cet_ctrl()) {
+ vmcs_writel(HOST_S_CET, host_s_cet);
+ vmcs_writel(HOST_SSP, 0);
+ vmcs_writel(HOST_INTR_SSP_TABLE, 0);
+ }
}

void set_cr4_guest_host_mask(struct vcpu_vmx *vmx)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 281c3fe728c5..73a55d388dd9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -114,6 +114,8 @@ static u64 __read_mostly efer_reserved_bits = ~((u64)EFER_SCE);
#endif

static u64 __read_mostly cr4_reserved_bits = CR4_RESERVED_BITS;
+u64 __read_mostly host_s_cet;
+EXPORT_SYMBOL_GPL(host_s_cet);

#define KVM_EXIT_HYPERCALL_VALID_MASK (1 << KVM_HC_MAP_GPA_RANGE)

@@ -9862,6 +9864,18 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
return -EIO;
}

+ if (boot_cpu_has(X86_FEATURE_SHSTK)) {
+ rdmsrl(MSR_IA32_S_CET, host_s_cet);
+ /*
+ * Linux doesn't yet support supervisor shadow stacks (SSS), so
+ * KVM doesn't save/restore the associated MSRs, i.e. KVM may
+ * clobber the host values. Yell and refuse to load if SSS is
+ * unexpectedly enabled, e.g. to avoid crashing the host.
+ */
+ if (WARN_ON_ONCE(host_s_cet & CET_SHSTK_EN))
+ return -EIO;
+ }
+
x86_emulator_cache = kvm_alloc_emulator_cache();
if (!x86_emulator_cache) {
pr_err("failed to allocate cache for x86 emulator\n");
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 9c19dfb5011d..656107e64c93 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -325,6 +325,7 @@ fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu);
extern u64 host_xcr0;
extern u64 host_xss;
extern u64 host_arch_capabilities;
+extern u64 host_s_cet;

extern struct kvm_caps kvm_caps;

--
2.43.0


2024-02-19 07:59:07

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 07/27] x86/fpu/xstate: Warn if kernel dynamic xfeatures detected in normal fpstate

Kernel dynamic xfeatures now are __ONLY__ enabled for guest fpstate, i.e.,
never for normal kernel fpstate. The bits are added when guest FPU config
is initialized. Guest fpstate is allocated with fpstate->is_guest set to
%true.

For normal fpstate, the bits should have been removed when initializes
kernel FPU config settings, WARN_ONCE() if kernel detects normal fpstate
xfeatures contains kernel dynamic xfeatures before executes xsaves.

Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kernel/fpu/xstate.h | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index 3518fb26d06b..83ebf1e1cbb4 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -185,6 +185,9 @@ static inline void os_xsave(struct fpstate *fpstate)
WARN_ON_FPU(!alternatives_patched);
xfd_validate_state(fpstate, mask, false);

+ WARN_ON_FPU(!fpstate->is_guest &&
+ (mask & XFEATURE_MASK_KERNEL_DYNAMIC));
+
XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err);

/* We should never fault when copying to a kernel buffer: */
--
2.43.0


2024-02-19 07:59:29

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 27/27] KVM: x86: Don't emulate instructions guarded by CET

Don't emulate the branch instructions, e.g., CALL/RET/JMP etc., when CET
is active in guest, return KVM_INTERNAL_ERROR_EMULATION to userspace to
handle it.

KVM doesn't emulate CPU behaviors to check CET protected stuffs while
emulating guest instructions, instead it stops emulation on detecting
the instructions in process are CET protected. By doing so, it can avoid
generating bogus #CP in guest and preventing CET protected execution flow
subversion from guest side.

Suggested-by: Chao Gao <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/emulate.c | 46 ++++++++++++++++++++++++++++++++----------
1 file changed, 35 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 8ccc17eb78ca..c18616d24ac9 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -178,6 +178,8 @@
#define IncSP ((u64)1 << 54) /* SP is incremented before ModRM calc */
#define TwoMemOp ((u64)1 << 55) /* Instruction has two memory operand */
#define IsBranch ((u64)1 << 56) /* Instruction is considered a branch. */
+#define ShadowStack ((u64)1 << 57) /* Instruction protected by Shadow Stack. */
+#define IndirBrnTrk ((u64)1 << 58) /* Instruction protected by IBT. */

#define DstXacc (DstAccLo | SrcAccHi | SrcWrite)

@@ -4100,9 +4102,11 @@ static const struct opcode group4[] = {
static const struct opcode group5[] = {
F(DstMem | SrcNone | Lock, em_inc),
F(DstMem | SrcNone | Lock, em_dec),
- I(SrcMem | NearBranch | IsBranch, em_call_near_abs),
- I(SrcMemFAddr | ImplicitOps | IsBranch, em_call_far),
- I(SrcMem | NearBranch | IsBranch, em_jmp_abs),
+ I(SrcMem | NearBranch | IsBranch | ShadowStack | IndirBrnTrk,
+ em_call_near_abs),
+ I(SrcMemFAddr | ImplicitOps | IsBranch | ShadowStack | IndirBrnTrk,
+ em_call_far),
+ I(SrcMem | NearBranch | IsBranch | IndirBrnTrk, em_jmp_abs),
I(SrcMemFAddr | ImplicitOps | IsBranch, em_jmp_far),
I(SrcMem | Stack | TwoMemOp, em_push), D(Undefined),
};
@@ -4364,11 +4368,11 @@ static const struct opcode opcode_table[256] = {
/* 0xC8 - 0xCF */
I(Stack | SrcImmU16 | Src2ImmByte | IsBranch, em_enter),
I(Stack | IsBranch, em_leave),
- I(ImplicitOps | SrcImmU16 | IsBranch, em_ret_far_imm),
- I(ImplicitOps | IsBranch, em_ret_far),
- D(ImplicitOps | IsBranch), DI(SrcImmByte | IsBranch, intn),
+ I(ImplicitOps | SrcImmU16 | IsBranch | ShadowStack, em_ret_far_imm),
+ I(ImplicitOps | IsBranch | ShadowStack, em_ret_far),
+ D(ImplicitOps | IsBranch), DI(SrcImmByte | IsBranch | ShadowStack, intn),
D(ImplicitOps | No64 | IsBranch),
- II(ImplicitOps | IsBranch, em_iret, iret),
+ II(ImplicitOps | IsBranch | ShadowStack, em_iret, iret),
/* 0xD0 - 0xD7 */
G(Src2One | ByteOp, group2), G(Src2One, group2),
G(Src2CL | ByteOp, group2), G(Src2CL, group2),
@@ -4384,7 +4388,7 @@ static const struct opcode opcode_table[256] = {
I2bvIP(SrcImmUByte | DstAcc, em_in, in, check_perm_in),
I2bvIP(SrcAcc | DstImmUByte, em_out, out, check_perm_out),
/* 0xE8 - 0xEF */
- I(SrcImm | NearBranch | IsBranch, em_call),
+ I(SrcImm | NearBranch | IsBranch | ShadowStack, em_call),
D(SrcImm | ImplicitOps | NearBranch | IsBranch),
I(SrcImmFAddr | No64 | IsBranch, em_jmp_far),
D(SrcImmByte | ImplicitOps | NearBranch | IsBranch),
@@ -4403,7 +4407,8 @@ static const struct opcode opcode_table[256] = {
static const struct opcode twobyte_table[256] = {
/* 0x00 - 0x0F */
G(0, group6), GD(0, &group7), N, N,
- N, I(ImplicitOps | EmulateOnUD | IsBranch, em_syscall),
+ N, I(ImplicitOps | EmulateOnUD | IsBranch | ShadowStack | IndirBrnTrk,
+ em_syscall),
II(ImplicitOps | Priv, em_clts, clts), N,
DI(ImplicitOps | Priv, invd), DI(ImplicitOps | Priv, wbinvd), N, N,
N, D(ImplicitOps | ModRM | SrcMem | NoAccess), N, N,
@@ -4434,8 +4439,9 @@ static const struct opcode twobyte_table[256] = {
IIP(ImplicitOps, em_rdtsc, rdtsc, check_rdtsc),
II(ImplicitOps | Priv, em_rdmsr, rdmsr),
IIP(ImplicitOps, em_rdpmc, rdpmc, check_rdpmc),
- I(ImplicitOps | EmulateOnUD | IsBranch, em_sysenter),
- I(ImplicitOps | Priv | EmulateOnUD | IsBranch, em_sysexit),
+ I(ImplicitOps | EmulateOnUD | IsBranch | ShadowStack | IndirBrnTrk,
+ em_sysenter),
+ I(ImplicitOps | Priv | EmulateOnUD | IsBranch | ShadowStack, em_sysexit),
N, N,
N, N, N, N, N, N, N, N,
/* 0x40 - 0x4F */
@@ -4973,6 +4979,24 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
if (ctxt->d == 0)
return EMULATION_FAILED;

+ if (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_CET) {
+ u64 u_cet, s_cet;
+ bool stop_em;
+
+ if (ctxt->ops->get_msr(ctxt, MSR_IA32_U_CET, &u_cet) ||
+ ctxt->ops->get_msr(ctxt, MSR_IA32_S_CET, &s_cet))
+ return EMULATION_FAILED;
+
+ stop_em = ((u_cet & CET_SHSTK_EN) || (s_cet & CET_SHSTK_EN)) &&
+ (opcode.flags & ShadowStack);
+
+ stop_em |= ((u_cet & CET_ENDBR_EN) || (s_cet & CET_ENDBR_EN)) &&
+ (opcode.flags & IndirBrnTrk);
+
+ if (stop_em)
+ return EMULATION_FAILED;
+ }
+
ctxt->execute = opcode.u.execute;

if (unlikely(emulation_type & EMULTYPE_TRAP_UD) &&
--
2.43.0


2024-02-19 07:59:35

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 26/27] KVM: nVMX: Enable CET support for nested guest

Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
to enable CET for nested VM.

vmcs12 and vmcs02 needs to be synced when L2 exits to L1 or when L1 wants
to resume L2, that way correct CET states can be observed by one another.

Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 80 ++++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/vmcs12.c | 6 +++
arch/x86/kvm/vmx/vmcs12.h | 14 ++++++-
arch/x86/kvm/vmx/vmx.c | 2 +
arch/x86/kvm/vmx/vmx.h | 3 ++
5 files changed, 102 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 0439208523b8..d0311260270c 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -691,6 +691,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
MSR_IA32_FLUSH_CMD, MSR_TYPE_W);

+ /* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_U_CET, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_S_CET, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PL0_SSP, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PL1_SSP, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PL2_SSP, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PL3_SSP, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
+
kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);

vmx->nested.force_msr_bitmap_recalc = false;
@@ -2438,6 +2460,30 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
}
}

+static inline void cet_vmcs_fields_get(struct kvm_vcpu *vcpu, u64 *ssp,
+ u64 *s_cet, u64 *ssp_tbl)
+{
+ if (guest_can_use(vcpu, X86_FEATURE_SHSTK)) {
+ *ssp = vmcs_readl(GUEST_SSP);
+ *s_cet = vmcs_readl(GUEST_S_CET);
+ *ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
+ } else if (guest_can_use(vcpu, X86_FEATURE_IBT)) {
+ *s_cet = vmcs_readl(GUEST_S_CET);
+ }
+}
+
+static inline void cet_vmcs_fields_put(struct kvm_vcpu *vcpu, u64 ssp,
+ u64 s_cet, u64 ssp_tbl)
+{
+ if (guest_can_use(vcpu, X86_FEATURE_SHSTK)) {
+ vmcs_writel(GUEST_SSP, ssp);
+ vmcs_writel(GUEST_S_CET, s_cet);
+ vmcs_writel(GUEST_INTR_SSP_TABLE, ssp_tbl);
+ } else if (guest_can_use(vcpu, X86_FEATURE_IBT)) {
+ vmcs_writel(GUEST_S_CET, s_cet);
+ }
+}
+
static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
{
struct hv_enlightened_vmcs *hv_evmcs = nested_vmx_evmcs(vmx);
@@ -2553,6 +2599,11 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr);
vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);

+ if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)
+ cet_vmcs_fields_put(&vmx->vcpu, vmcs12->guest_ssp,
+ vmcs12->guest_s_cet,
+ vmcs12->guest_ssp_tbl);
+
set_cr4_guest_host_mask(vmx);
}

@@ -2591,6 +2642,13 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
kvm_set_dr(vcpu, 7, vcpu->arch.dr7);
vmcs_write64(GUEST_IA32_DEBUGCTL, vmx->nested.pre_vmenter_debugctl);
}
+
+ if (!vmx->nested.nested_run_pending ||
+ !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE))
+ cet_vmcs_fields_put(vcpu, vmx->nested.pre_vmenter_ssp,
+ vmx->nested.pre_vmenter_s_cet,
+ vmx->nested.pre_vmenter_ssp_tbl);
+
if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending ||
!(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
vmcs_write64(GUEST_BNDCFGS, vmx->nested.pre_vmenter_bndcfgs);
@@ -3471,6 +3529,12 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
!(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
vmx->nested.pre_vmenter_bndcfgs = vmcs_read64(GUEST_BNDCFGS);

+ if (!vmx->nested.nested_run_pending ||
+ !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE))
+ cet_vmcs_fields_get(vcpu, &vmx->nested.pre_vmenter_ssp,
+ &vmx->nested.pre_vmenter_s_cet,
+ &vmx->nested.pre_vmenter_ssp_tbl);
+
/*
* Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled *and*
* nested early checks are disabled. In the event of a "late" VM-Fail,
@@ -4294,6 +4358,9 @@ static bool is_vmcs12_ext_field(unsigned long field)
case GUEST_IDTR_BASE:
case GUEST_PENDING_DBG_EXCEPTIONS:
case GUEST_BNDCFGS:
+ case GUEST_SSP:
+ case GUEST_S_CET:
+ case GUEST_INTR_SSP_TABLE:
return true;
default:
break;
@@ -4344,6 +4411,10 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
vmcs12->guest_pending_dbg_exceptions =
vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);

+ cet_vmcs_fields_get(&vmx->vcpu, &vmcs12->guest_ssp,
+ &vmcs12->guest_s_cet,
+ &vmcs12->guest_ssp_tbl);
+
vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
}

@@ -4569,6 +4640,10 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
if (vmcs12->vm_exit_controls & VM_EXIT_CLEAR_BNDCFGS)
vmcs_write64(GUEST_BNDCFGS, 0);

+ if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_CET_STATE)
+ cet_vmcs_fields_put(vcpu, vmcs12->host_ssp, vmcs12->host_s_cet,
+ vmcs12->host_ssp_tbl);
+
if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PAT) {
vmcs_write64(GUEST_IA32_PAT, vmcs12->host_ia32_pat);
vcpu->arch.pat = vmcs12->host_ia32_pat;
@@ -6840,7 +6915,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
VM_EXIT_HOST_ADDR_SPACE_SIZE |
#endif
VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
- VM_EXIT_CLEAR_BNDCFGS;
+ VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
msrs->exit_ctls_high |=
VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
@@ -6862,7 +6937,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
#ifdef CONFIG_X86_64
VM_ENTRY_IA32E_MODE |
#endif
- VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
+ VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
+ VM_ENTRY_LOAD_CET_STATE;
msrs->entry_ctls_high |=
(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index 106a72c923ca..4233b5ca9461 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
+ FIELD(GUEST_S_CET, guest_s_cet),
+ FIELD(GUEST_SSP, guest_ssp),
+ FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
FIELD(HOST_CR0, host_cr0),
FIELD(HOST_CR3, host_cr3),
FIELD(HOST_CR4, host_cr4),
@@ -151,5 +154,8 @@ const unsigned short vmcs12_field_offsets[] = {
FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
FIELD(HOST_RSP, host_rsp),
FIELD(HOST_RIP, host_rip),
+ FIELD(HOST_S_CET, host_s_cet),
+ FIELD(HOST_SSP, host_ssp),
+ FIELD(HOST_INTR_SSP_TABLE, host_ssp_tbl),
};
const unsigned int nr_vmcs12_fields = ARRAY_SIZE(vmcs12_field_offsets);
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index 01936013428b..3884489e7f7e 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -117,7 +117,13 @@ struct __packed vmcs12 {
natural_width host_ia32_sysenter_eip;
natural_width host_rsp;
natural_width host_rip;
- natural_width paddingl[8]; /* room for future expansion */
+ natural_width host_s_cet;
+ natural_width host_ssp;
+ natural_width host_ssp_tbl;
+ natural_width guest_s_cet;
+ natural_width guest_ssp;
+ natural_width guest_ssp_tbl;
+ natural_width paddingl[2]; /* room for future expansion */
u32 pin_based_vm_exec_control;
u32 cpu_based_vm_exec_control;
u32 exception_bitmap;
@@ -292,6 +298,12 @@ static inline void vmx_check_vmcs12_offsets(void)
CHECK_OFFSET(host_ia32_sysenter_eip, 656);
CHECK_OFFSET(host_rsp, 664);
CHECK_OFFSET(host_rip, 672);
+ CHECK_OFFSET(host_s_cet, 680);
+ CHECK_OFFSET(host_ssp, 688);
+ CHECK_OFFSET(host_ssp_tbl, 696);
+ CHECK_OFFSET(guest_s_cet, 704);
+ CHECK_OFFSET(guest_ssp, 712);
+ CHECK_OFFSET(guest_ssp_tbl, 720);
CHECK_OFFSET(pin_based_vm_exec_control, 744);
CHECK_OFFSET(cpu_based_vm_exec_control, 748);
CHECK_OFFSET(exception_bitmap, 752);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9df25c9e80f5..210724d2151c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7727,6 +7727,8 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
cr4_fixed1_update(X86_CR4_PKE, ecx, feature_bit(PKU));
cr4_fixed1_update(X86_CR4_UMIP, ecx, feature_bit(UMIP));
cr4_fixed1_update(X86_CR4_LA57, ecx, feature_bit(LA57));
+ cr4_fixed1_update(X86_CR4_CET, ecx, feature_bit(SHSTK));
+ cr4_fixed1_update(X86_CR4_CET, edx, feature_bit(IBT));

entry = kvm_find_cpuid_entry_index(vcpu, 0x7, 1);
cr4_fixed1_update(X86_CR4_LAM_SUP, eax, feature_bit(LAM));
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index d0cad2624564..3c1de37728fe 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -224,6 +224,9 @@ struct nested_vmx {
*/
u64 pre_vmenter_debugctl;
u64 pre_vmenter_bndcfgs;
+ u64 pre_vmenter_ssp;
+ u64 pre_vmenter_s_cet;
+ u64 pre_vmenter_ssp_tbl;

/* to migrate it to L1 if L2 writes to L1's CR8 directly */
int l1_tpr_threshold;
--
2.43.0


2024-02-19 08:00:19

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 10/27] KVM: x86: Refine xsave-managed guest register/MSR reset handling

Tweak the code a bit to facilitate resetting more xstate components in
the future, e.g., CET's xstate-managed MSRs.

No functional change intended.

Suggested-by: Chao Gao <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/x86.c | 30 +++++++++++++++++++++++++++---
1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 10847e1cc413..5a9c07751c0e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12217,11 +12217,27 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
static_branch_dec(&kvm_has_noapic_vcpu);
}

+#define XSTATE_NEED_RESET_MASK (XFEATURE_MASK_BNDREGS | \
+ XFEATURE_MASK_BNDCSR)
+
+static bool kvm_vcpu_has_xstate(unsigned long xfeature)
+{
+ switch (xfeature) {
+ case XFEATURE_MASK_BNDREGS:
+ case XFEATURE_MASK_BNDCSR:
+ return kvm_cpu_cap_has(X86_FEATURE_MPX);
+ default:
+ return false;
+ }
+}
+
void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct kvm_cpuid_entry2 *cpuid_0x1;
unsigned long old_cr0 = kvm_read_cr0(vcpu);
+ DECLARE_BITMAP(reset_mask, 64);
unsigned long new_cr0;
+ unsigned int i;

/*
* Several of the "set" flows, e.g. ->set_cr0(), read other registers
@@ -12274,7 +12290,12 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
kvm_async_pf_hash_reset(vcpu);
vcpu->arch.apf.halted = false;

- if (vcpu->arch.guest_fpu.fpstate && kvm_mpx_supported()) {
+ bitmap_from_u64(reset_mask, (kvm_caps.supported_xcr0 |
+ kvm_caps.supported_xss) &
+ XSTATE_NEED_RESET_MASK);
+
+ if (vcpu->arch.guest_fpu.fpstate &&
+ !bitmap_empty(reset_mask, XFEATURE_MAX)) {
struct fpstate *fpstate = vcpu->arch.guest_fpu.fpstate;

/*
@@ -12284,8 +12305,11 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
if (init_event)
kvm_put_guest_fpu(vcpu);

- fpstate_clear_xstate_component(fpstate, XFEATURE_BNDREGS);
- fpstate_clear_xstate_component(fpstate, XFEATURE_BNDCSR);
+ for_each_set_bit(i, reset_mask, XFEATURE_MAX) {
+ if (!kvm_vcpu_has_xstate(i))
+ continue;
+ fpstate_clear_xstate_component(fpstate, i);
+ }

if (init_event)
kvm_load_guest_fpu(vcpu);
--
2.43.0


2024-02-19 08:00:53

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 14/27] KVM: x86: Initialize kvm_caps.supported_xss

Set original kvm_caps.supported_xss to (host_xss & KVM_SUPPORTED_XSS) if
XSAVES is supported. host_xss contains the host supported xstate feature
bits for thread FPU context switch, KVM_SUPPORTED_XSS includes all KVM
enabled XSS feature bits, the resulting value represents the supervisor
xstates that are available to guest and are backed by host FPU framework
for swapping {guest,host} XSAVE-managed registers/MSRs.

Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Reviewed-by: Chao Gao <[email protected]>
---
arch/x86/kvm/x86.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b502d68a2576..60b574fc04d1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -226,6 +226,8 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
| XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
| XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)

+#define KVM_SUPPORTED_XSS 0
+
u64 __read_mostly host_efer;
EXPORT_SYMBOL_GPL(host_efer);

@@ -9737,12 +9739,13 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
kvm_caps.supported_xcr0 = host_xcr0 & KVM_SUPPORTED_XCR0;
}
+ if (boot_cpu_has(X86_FEATURE_XSAVES)) {
+ rdmsrl(MSR_IA32_XSS, host_xss);
+ kvm_caps.supported_xss = host_xss & KVM_SUPPORTED_XSS;
+ }

rdmsrl_safe(MSR_EFER, &host_efer);

- if (boot_cpu_has(X86_FEATURE_XSAVES))
- rdmsrl(MSR_IA32_XSS, host_xss);
-
kvm_init_pmu_capability(ops->pmu_ops);

if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
--
2.43.0


2024-02-19 08:05:50

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 22/27] KVM: VMX: Set up interception for CET MSRs

Enable/disable CET MSRs interception per associated feature configuration.
Shadow Stack feature requires all CET MSRs passed through to guest to make
it supported in user and supervisor mode while IBT feature only depends on
MSR_IA32_{U,S}_CETS_CET to enable user and supervisor IBT.

Note, this MSR design introduced an architectural limitation of SHSTK and
IBT control for guest, i.e., when SHSTK is exposed, IBT is also available
to guest from architectual perspective since IBT relies on subset of SHSTK
relevant MSRs.

Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Reviewed-by: Chao Gao <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 43 +++++++++++++++++++++++++++++++++++++++++-
1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ff2296fa7d39..24e921c4e7e3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -159,7 +159,7 @@ module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);

/*
* List of MSRs that can be directly passed to the guest.
- * In addition to these x2apic and PT MSRs are handled specially.
+ * In addition to these x2apic/PT/CET MSRs are handled specially.
*/
static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
MSR_IA32_SPEC_CTRL,
@@ -692,6 +692,10 @@ static bool is_valid_passthrough_msr(u32 msr)
case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
/* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
return true;
+ case MSR_IA32_U_CET:
+ case MSR_IA32_S_CET:
+ case MSR_IA32_PL0_SSP ... MSR_IA32_INT_SSP_TAB:
+ return true;
}

r = possible_passthrough_msr_slot(msr) != -ENOENT;
@@ -7767,6 +7771,41 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
}

+static void vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu)
+{
+ bool incpt;
+
+ if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
+ incpt = !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK);
+
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
+ MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
+ MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
+ MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
+ MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
+ MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP,
+ MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
+ MSR_TYPE_RW, incpt);
+ if (!incpt)
+ return;
+ }
+
+ if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
+ incpt = !guest_cpuid_has(vcpu, X86_FEATURE_IBT);
+
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
+ MSR_TYPE_RW, incpt);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
+ MSR_TYPE_RW, incpt);
+ }
+}
+
static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7845,6 +7884,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)

/* Refresh #PF interception to account for MAXPHYADDR changes. */
vmx_update_exception_bitmap(vcpu);
+
+ vmx_update_intercept_for_cet_msr(vcpu);
}

static u64 vmx_get_perf_capabilities(void)
--
2.43.0


2024-02-19 08:06:49

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 21/27] KVM: x86: Save and reload SSP to/from SMRAM

Save CET SSP to SMRAM on SMI and reload it on RSM. KVM emulates HW arch
behavior when guest enters/leaves SMM mode,i.e., save registers to SMRAM
at the entry of SMM and reload them at the exit to SMM. Per SDM, SSP is
one of such registers on 64-bit Arch, and add the support for SSP. Note,
on 32-bit Arch, SSP is not defined in SMRAM, so fail 32-bit CET guest
launch.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/cpuid.c | 11 +++++++++++
arch/x86/kvm/smm.c | 8 ++++++++
arch/x86/kvm/smm.h | 2 +-
3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 2bb1931103ad..c0e13040e35b 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -149,6 +149,17 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu,
if (vaddr_bits != 48 && vaddr_bits != 57 && vaddr_bits != 0)
return -EINVAL;
}
+ /*
+ * Prevent 32-bit guest launch if shadow stack is exposed as SSP
+ * state is not defined for 32-bit SMRAM.
+ */
+ best = cpuid_entry2_find(entries, nent, 0x80000001,
+ KVM_CPUID_INDEX_NOT_SIGNIFICANT);
+ if (best && !(best->edx & F(LM))) {
+ best = cpuid_entry2_find(entries, nent, 0x7, 0);
+ if (best && (best->ecx & F(SHSTK)))
+ return -EINVAL;
+ }

/*
* Exposing dynamic xfeatures to the guest requires additional
diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
index 45c855389ea7..7aac9c54c353 100644
--- a/arch/x86/kvm/smm.c
+++ b/arch/x86/kvm/smm.c
@@ -275,6 +275,10 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu,
enter_smm_save_seg_64(vcpu, &smram->gs, VCPU_SREG_GS);

smram->int_shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
+
+ if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
+ KVM_BUG_ON(kvm_msr_read(vcpu, MSR_KVM_SSP, &smram->ssp),
+ vcpu->kvm);
}
#endif

@@ -564,6 +568,10 @@ static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,
static_call(kvm_x86_set_interrupt_shadow)(vcpu, 0);
ctxt->interruptibility = (u8)smstate->int_shadow;

+ if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
+ KVM_BUG_ON(kvm_msr_write(vcpu, MSR_KVM_SSP, smstate->ssp),
+ vcpu->kvm);
+
return X86EMUL_CONTINUE;
}
#endif
diff --git a/arch/x86/kvm/smm.h b/arch/x86/kvm/smm.h
index a1cf2ac5bd78..1e2a3e18207f 100644
--- a/arch/x86/kvm/smm.h
+++ b/arch/x86/kvm/smm.h
@@ -116,8 +116,8 @@ struct kvm_smram_state_64 {
u32 smbase;
u32 reserved4[5];

- /* ssp and svm_* fields below are not implemented by KVM */
u64 ssp;
+ /* svm_* fields below are not implemented by KVM */
u64 svm_guest_pat;
u64 svm_host_efer;
u64 svm_host_cr4;
--
2.43.0


2024-02-19 08:07:56

by Yang, Weijiang

[permalink] [raw]
Subject: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

Expose CET features to guest if KVM/host can support them, clear CPUID
feature bits if KVM/host cannot support.

Set CPUID feature bits so that CET features are available in guest CPUID.
Add CR4.CET bit support in order to allow guest set CET master control
bit.

Disable KVM CET feature if unrestricted_guest is unsupported/disabled as
KVM does not support emulating CET.

The CET load-bits in VM_ENTRY/VM_EXIT control fields should be set to make
guest CET xstates isolated from host's.

On platforms with VMX_BASIC[bit56] == 0, inject #CP at VMX entry with error
code will fail, and if VMX_BASIC[bit56] == 1, #CP injection with or without
error code is allowed. Disable CET feature bits if the MSR bit is cleared
so that nested VMM can inject #CP if and only if VMX_BASIC[bit56] == 1.

Don't expose CET feature if either of {U,S}_CET xstate bits is cleared
in host XSS or if XSAVES isn't supported.

CET MSR contents after reset, power-up and INIT are set to 0s, clears the
guest fpstate fields so that the guest MSRs are reset to 0s after the events.

Signed-off-by: Yang Weijiang <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kvm/cpuid.c | 25 ++++++++++++++++++++-----
arch/x86/kvm/vmx/capabilities.h | 6 ++++++
arch/x86/kvm/vmx/vmx.c | 30 +++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/vmx.h | 6 ++++--
arch/x86/kvm/x86.c | 26 ++++++++++++++++++++++++--
arch/x86/kvm/x86.h | 3 +++
8 files changed, 88 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 79f7c18c487b..3b263fa171a1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -134,7 +134,7 @@
| X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
| X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
| X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \
- | X86_CR4_LAM_SUP))
+ | X86_CR4_LAM_SUP | X86_CR4_CET))

#define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index f1bd7b91b3c6..4aa9aaa295f0 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1110,6 +1110,7 @@
#define VMX_BASIC_MEM_TYPE_MASK 0x003c000000000000LLU
#define VMX_BASIC_MEM_TYPE_WB 6LLU
#define VMX_BASIC_INOUT 0x0040000000000000LLU
+#define VMX_BASIC_NO_HW_ERROR_CODE_CC 0x0100000000000000LLU

/* Resctrl MSRs: */
/* - Intel: */
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index c0e13040e35b..d37f41472043 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -150,14 +150,14 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu,
return -EINVAL;
}
/*
- * Prevent 32-bit guest launch if shadow stack is exposed as SSP
- * state is not defined for 32-bit SMRAM.
+ * CET is not supported for 32-bit guest, prevent guest launch if
+ * shadow stack or IBT is enabled for 32-bit guest.
*/
best = cpuid_entry2_find(entries, nent, 0x80000001,
KVM_CPUID_INDEX_NOT_SIGNIFICANT);
if (best && !(best->edx & F(LM))) {
best = cpuid_entry2_find(entries, nent, 0x7, 0);
- if (best && (best->ecx & F(SHSTK)))
+ if (best && ((best->ecx & F(SHSTK)) || (best->edx & F(IBT))))
return -EINVAL;
}

@@ -665,7 +665,7 @@ void kvm_set_cpu_caps(void)
F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
- F(SGX_LC) | F(BUS_LOCK_DETECT)
+ F(SGX_LC) | F(BUS_LOCK_DETECT) | F(SHSTK)
);
/* Set LA57 based on hardware capability. */
if (cpuid_ecx(7) & F(LA57))
@@ -683,7 +683,8 @@ void kvm_set_cpu_caps(void)
F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) |
F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) |
- F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D)
+ F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D) |
+ F(IBT)
);

/* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */
@@ -696,6 +697,20 @@ void kvm_set_cpu_caps(void)
kvm_cpu_cap_set(X86_FEATURE_INTEL_STIBP);
if (boot_cpu_has(X86_FEATURE_AMD_SSBD))
kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD);
+ /*
+ * Don't use boot_cpu_has() to check availability of IBT because the
+ * feature bit is cleared in boot_cpu_data when ibt=off is applied
+ * in host cmdline.
+ *
+ * As currently there's no HW bug which requires disabling IBT feature
+ * while CPU can enumerate it, host cmdline option ibt=off is most
+ * likely due to administrative reason on host side, so KVM refers to
+ * CPU CPUID enumeration to enable the feature. In future if there's
+ * actually some bug clobbered ibt=off option, then enforce additional
+ * check here to disable the support in KVM.
+ */
+ if (cpuid_edx(7) & F(IBT))
+ kvm_cpu_cap_set(X86_FEATURE_IBT);

kvm_cpu_cap_mask(CPUID_7_1_EAX,
F(AVX_VNNI) | F(AVX512_BF16) | F(CMPCCXADD) |
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index ee8938818c8a..e12bc233d88b 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -79,6 +79,12 @@ static inline bool cpu_has_vmx_basic_inout(void)
return (((u64)vmcs_config.basic_cap << 32) & VMX_BASIC_INOUT);
}

+static inline bool cpu_has_vmx_basic_no_hw_errcode(void)
+{
+ return ((u64)vmcs_config.basic_cap << 32) &
+ VMX_BASIC_NO_HW_ERROR_CODE_CC;
+}
+
static inline bool cpu_has_virtual_nmis(void)
{
return vmcs_config.pin_based_exec_ctrl & PIN_BASED_VIRTUAL_NMIS &&
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 342b5b94c892..9df25c9e80f5 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2609,6 +2609,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
{ VM_ENTRY_LOAD_IA32_EFER, VM_EXIT_LOAD_IA32_EFER },
{ VM_ENTRY_LOAD_BNDCFGS, VM_EXIT_CLEAR_BNDCFGS },
{ VM_ENTRY_LOAD_IA32_RTIT_CTL, VM_EXIT_CLEAR_IA32_RTIT_CTL },
+ { VM_ENTRY_LOAD_CET_STATE, VM_EXIT_LOAD_CET_STATE },
};

memset(vmcs_conf, 0, sizeof(*vmcs_conf));
@@ -4934,6 +4935,14 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)

vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); /* 22.2.1 */

+ if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
+ vmcs_writel(GUEST_SSP, 0);
+ vmcs_writel(GUEST_S_CET, 0);
+ vmcs_writel(GUEST_INTR_SSP_TABLE, 0);
+ } else if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
+ vmcs_writel(GUEST_S_CET, 0);
+ }
+
kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);

vpid_sync_context(vmx->vpid);
@@ -6353,6 +6362,10 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
if (vmcs_read32(VM_EXIT_MSR_STORE_COUNT) > 0)
vmx_dump_msrs("guest autostore", &vmx->msr_autostore.guest);

+ if (vmentry_ctl & VM_ENTRY_LOAD_CET_STATE)
+ pr_err("S_CET = 0x%016lx, SSP = 0x%016lx, SSP TABLE = 0x%016lx\n",
+ vmcs_readl(GUEST_S_CET), vmcs_readl(GUEST_SSP),
+ vmcs_readl(GUEST_INTR_SSP_TABLE));
pr_err("*** Host State ***\n");
pr_err("RIP = 0x%016lx RSP = 0x%016lx\n",
vmcs_readl(HOST_RIP), vmcs_readl(HOST_RSP));
@@ -6383,6 +6396,10 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
vmcs_read64(HOST_IA32_PERF_GLOBAL_CTRL));
if (vmcs_read32(VM_EXIT_MSR_LOAD_COUNT) > 0)
vmx_dump_msrs("host autoload", &vmx->msr_autoload.host);
+ if (vmexit_ctl & VM_EXIT_LOAD_CET_STATE)
+ pr_err("S_CET = 0x%016lx, SSP = 0x%016lx, SSP TABLE = 0x%016lx\n",
+ vmcs_readl(HOST_S_CET), vmcs_readl(HOST_SSP),
+ vmcs_readl(HOST_INTR_SSP_TABLE));

pr_err("*** Control State ***\n");
pr_err("CPUBased=0x%08x SecondaryExec=0x%08x TertiaryExec=0x%016llx\n",
@@ -7965,7 +7982,6 @@ static __init void vmx_set_cpu_caps(void)
kvm_cpu_cap_set(X86_FEATURE_UMIP);

/* CPUID 0xD.1 */
- kvm_caps.supported_xss = 0;
if (!cpu_has_vmx_xsaves())
kvm_cpu_cap_clear(X86_FEATURE_XSAVES);

@@ -7977,6 +7993,18 @@ static __init void vmx_set_cpu_caps(void)

if (cpu_has_vmx_waitpkg())
kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
+
+ /*
+ * Disable CET if unrestricted_guest is unsupported as KVM doesn't
+ * enforce CET HW behaviors in emulator. On platforms with
+ * VMX_BASIC[bit56] == 0, inject #CP at VMX entry with error code
+ * fails, so disable CET in this case too.
+ */
+ if (!cpu_has_load_cet_ctrl() || !enable_unrestricted_guest ||
+ !cpu_has_vmx_basic_no_hw_errcode()) {
+ kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
+ kvm_cpu_cap_clear(X86_FEATURE_IBT);
+ }
}

static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index e3b0985bb74a..d0cad2624564 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -484,7 +484,8 @@ static inline u8 vmx_get_rvi(void)
VM_ENTRY_LOAD_IA32_EFER | \
VM_ENTRY_LOAD_BNDCFGS | \
VM_ENTRY_PT_CONCEAL_PIP | \
- VM_ENTRY_LOAD_IA32_RTIT_CTL)
+ VM_ENTRY_LOAD_IA32_RTIT_CTL | \
+ VM_ENTRY_LOAD_CET_STATE)

#define __KVM_REQUIRED_VMX_VM_EXIT_CONTROLS \
(VM_EXIT_SAVE_DEBUG_CONTROLS | \
@@ -506,7 +507,8 @@ static inline u8 vmx_get_rvi(void)
VM_EXIT_LOAD_IA32_EFER | \
VM_EXIT_CLEAR_BNDCFGS | \
VM_EXIT_PT_CONCEAL_PIP | \
- VM_EXIT_CLEAR_IA32_RTIT_CTL)
+ VM_EXIT_CLEAR_IA32_RTIT_CTL | \
+ VM_EXIT_LOAD_CET_STATE)

#define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL \
(PIN_BASED_EXT_INTR_MASK | \
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 73a55d388dd9..cd656099fbfd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -231,7 +231,8 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
| XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
| XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)

-#define KVM_SUPPORTED_XSS 0
+#define KVM_SUPPORTED_XSS (XFEATURE_MASK_CET_USER | \
+ XFEATURE_MASK_CET_KERNEL)

u64 __read_mostly host_efer;
EXPORT_SYMBOL_GPL(host_efer);
@@ -9943,6 +9944,20 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
kvm_caps.supported_xss = 0;

+ if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK) &&
+ !kvm_cpu_cap_has(X86_FEATURE_IBT))
+ kvm_caps.supported_xss &= ~(XFEATURE_MASK_CET_USER |
+ XFEATURE_MASK_CET_KERNEL);
+
+ if ((kvm_caps.supported_xss & (XFEATURE_MASK_CET_USER |
+ XFEATURE_MASK_CET_KERNEL)) !=
+ (XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)) {
+ kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
+ kvm_cpu_cap_clear(X86_FEATURE_IBT);
+ kvm_caps.supported_xss &= ~(XFEATURE_MASK_CET_USER |
+ XFEATURE_MASK_CET_KERNEL);
+ }
+
#define __kvm_cpu_cap_has(UNUSED_, f) kvm_cpu_cap_has(f)
cr4_reserved_bits = __cr4_reserved_bits(__kvm_cpu_cap_has, UNUSED_);
#undef __kvm_cpu_cap_has
@@ -12402,7 +12417,9 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
}

#define XSTATE_NEED_RESET_MASK (XFEATURE_MASK_BNDREGS | \
- XFEATURE_MASK_BNDCSR)
+ XFEATURE_MASK_BNDCSR | \
+ XFEATURE_MASK_CET_USER | \
+ XFEATURE_MASK_CET_KERNEL)

static bool kvm_vcpu_has_xstate(unsigned long xfeature)
{
@@ -12410,6 +12427,11 @@ static bool kvm_vcpu_has_xstate(unsigned long xfeature)
case XFEATURE_MASK_BNDREGS:
case XFEATURE_MASK_BNDCSR:
return kvm_cpu_cap_has(X86_FEATURE_MPX);
+ case XFEATURE_CET_USER:
+ return kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
+ kvm_cpu_cap_has(X86_FEATURE_IBT);
+ case XFEATURE_CET_KERNEL:
+ return kvm_cpu_cap_has(X86_FEATURE_SHSTK);
default:
return false;
}
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 656107e64c93..cc585051d24b 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -533,6 +533,9 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
__reserved_bits |= X86_CR4_PCIDE; \
if (!__cpu_has(__c, X86_FEATURE_LAM)) \
__reserved_bits |= X86_CR4_LAM_SUP; \
+ if (!__cpu_has(__c, X86_FEATURE_SHSTK) && \
+ !__cpu_has(__c, X86_FEATURE_IBT)) \
+ __reserved_bits |= X86_CR4_CET; \
__reserved_bits; \
})

--
2.43.0


2024-02-20 03:05:00

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v10 10/27] KVM: x86: Refine xsave-managed guest register/MSR reset handling

On Sun, Feb 18, 2024 at 11:47:16PM -0800, Yang Weijiang wrote:
>Tweak the code a bit to facilitate resetting more xstate components in
>the future, e.g., CET's xstate-managed MSRs.
>

>No functional change intended.

Strictly speaking, there is a functional change. in the previous logic, if
either of BNDCSR or BNDREGS state is not supported (kvm_mpx_supported() will
return false), KVM won't reset either of them. Since this gets changed, I vote
to drop 'No functional change ..'

>
>Suggested-by: Chao Gao <[email protected]>
>Signed-off-by: Yang Weijiang <[email protected]>
>---
> arch/x86/kvm/x86.c | 30 +++++++++++++++++++++++++++---
> 1 file changed, 27 insertions(+), 3 deletions(-)
>
>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>index 10847e1cc413..5a9c07751c0e 100644
>--- a/arch/x86/kvm/x86.c
>+++ b/arch/x86/kvm/x86.c
>@@ -12217,11 +12217,27 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
> static_branch_dec(&kvm_has_noapic_vcpu);
> }
>
>+#define XSTATE_NEED_RESET_MASK (XFEATURE_MASK_BNDREGS | \
>+ XFEATURE_MASK_BNDCSR)
>+
>+static bool kvm_vcpu_has_xstate(unsigned long xfeature)

kvm_vcpu_has_xstate is a misnomer because it doesn't take a vCPU.

>+{
>+ switch (xfeature) {
>+ case XFEATURE_MASK_BNDREGS:
>+ case XFEATURE_MASK_BNDCSR:
>+ return kvm_cpu_cap_has(X86_FEATURE_MPX);
>+ default:
>+ return false;
>+ }
>+}
>+
> void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> {
> struct kvm_cpuid_entry2 *cpuid_0x1;
> unsigned long old_cr0 = kvm_read_cr0(vcpu);
>+ DECLARE_BITMAP(reset_mask, 64);
> unsigned long new_cr0;
>+ unsigned int i;
>
> /*
> * Several of the "set" flows, e.g. ->set_cr0(), read other registers
>@@ -12274,7 +12290,12 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> kvm_async_pf_hash_reset(vcpu);
> vcpu->arch.apf.halted = false;
>
>- if (vcpu->arch.guest_fpu.fpstate && kvm_mpx_supported()) {
>+ bitmap_from_u64(reset_mask, (kvm_caps.supported_xcr0 |
>+ kvm_caps.supported_xss) &
>+ XSTATE_NEED_RESET_MASK);
>+
>+ if (vcpu->arch.guest_fpu.fpstate &&
>+ !bitmap_empty(reset_mask, XFEATURE_MAX)) {
> struct fpstate *fpstate = vcpu->arch.guest_fpu.fpstate;
>
> /*
>@@ -12284,8 +12305,11 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> if (init_event)
> kvm_put_guest_fpu(vcpu);
>
>- fpstate_clear_xstate_component(fpstate, XFEATURE_BNDREGS);
>- fpstate_clear_xstate_component(fpstate, XFEATURE_BNDCSR);
>+ for_each_set_bit(i, reset_mask, XFEATURE_MAX) {
>+ if (!kvm_vcpu_has_xstate(i))
>+ continue;

The kvm_vcpu_has_xstate() check is superfluous because @i is derived from
kvm_caps.supported_xcr0/xss, which already guarantees that all unsupported
xfeatures are filtered out.

I recommend dropping this check. w/ this change,

Reviewed-by: Chao Gao <[email protected]>

>+ fpstate_clear_xstate_component(fpstate, i);
>+ }
>
> if (init_event)
> kvm_load_guest_fpu(vcpu);
>--
>2.43.0
>

2024-02-20 08:52:31

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v10 13/27] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS

On Sun, Feb 18, 2024 at 11:47:19PM -0800, Yang Weijiang wrote:
>Update CPUID.(EAX=0DH,ECX=1).EBX to reflect current required xstate size
>due to XSS MSR modification.
>CPUID(EAX=0DH,ECX=1).EBX reports the required storage size of all enabled
>xstate features in (XCR0 | IA32_XSS). The CPUID value can be used by guest
>before allocate sufficient xsave buffer.
>
>Note, KVM does not yet support any XSS based features, i.e. supported_xss
>is guaranteed to be zero at this time.
>
>Opportunistically modify XSS write access logic as:
>If XSAVES is not enabled in the guest CPUID, forbid setting IA32_XSS msr
>to anything but 0, even if the write is host initiated.
>
>Suggested-by: Sean Christopherson <[email protected]>
>Co-developed-by: Zhang Yi Z <[email protected]>
>Signed-off-by: Zhang Yi Z <[email protected]>
>Signed-off-by: Yang Weijiang <[email protected]>
>Reviewed-by: Maxim Levitsky <[email protected]>

Reviewed-by: Chao Gao <[email protected]>

2024-02-20 13:23:34

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 10/27] KVM: x86: Refine xsave-managed guest register/MSR reset handling

On 2/20/2024 11:04 AM, Chao Gao wrote:
> On Sun, Feb 18, 2024 at 11:47:16PM -0800, Yang Weijiang wrote:
>> Tweak the code a bit to facilitate resetting more xstate components in
>> the future, e.g., CET's xstate-managed MSRs.
>>
>> No functional change intended.
> Strictly speaking, there is a functional change. in the previous logic, if
> either of BNDCSR or BNDREGS state is not supported (kvm_mpx_supported() will
> return false), KVM won't reset either of them.
> Since this gets changed, I vote
> to drop 'No functional change ..'

Yes, I'll remove it since existing logic is slightly changed.

>> Suggested-by: Chao Gao <[email protected]>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
>> arch/x86/kvm/x86.c | 30 +++++++++++++++++++++++++++---
>> 1 file changed, 27 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 10847e1cc413..5a9c07751c0e 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -12217,11 +12217,27 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
>> static_branch_dec(&kvm_has_noapic_vcpu);
>> }
>>
>> +#define XSTATE_NEED_RESET_MASK (XFEATURE_MASK_BNDREGS | \
>> + XFEATURE_MASK_BNDCSR)
>> +
>> +static bool kvm_vcpu_has_xstate(unsigned long xfeature)
> kvm_vcpu_has_xstate is a misnomer because it doesn't take a vCPU.

True, I'll change it, thanks!

>> +{
>> + switch (xfeature) {
>> + case XFEATURE_MASK_BNDREGS:
>> + case XFEATURE_MASK_BNDCSR:
>> + return kvm_cpu_cap_has(X86_FEATURE_MPX);
>> + default:
>> + return false;
>> + }
>> +}
>> +
>> void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>> {
>> struct kvm_cpuid_entry2 *cpuid_0x1;
>> unsigned long old_cr0 = kvm_read_cr0(vcpu);
>> + DECLARE_BITMAP(reset_mask, 64);
>> unsigned long new_cr0;
>> + unsigned int i;
>>
>> /*
>> * Several of the "set" flows, e.g. ->set_cr0(), read other registers
>> @@ -12274,7 +12290,12 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>> kvm_async_pf_hash_reset(vcpu);
>> vcpu->arch.apf.halted = false;
>>
>> - if (vcpu->arch.guest_fpu.fpstate && kvm_mpx_supported()) {
>> + bitmap_from_u64(reset_mask, (kvm_caps.supported_xcr0 |
>> + kvm_caps.supported_xss) &
>> + XSTATE_NEED_RESET_MASK);
>> +
>> + if (vcpu->arch.guest_fpu.fpstate &&
>> + !bitmap_empty(reset_mask, XFEATURE_MAX)) {
>> struct fpstate *fpstate = vcpu->arch.guest_fpu.fpstate;
>>
>> /*
>> @@ -12284,8 +12305,11 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>> if (init_event)
>> kvm_put_guest_fpu(vcpu);
>>
>> - fpstate_clear_xstate_component(fpstate, XFEATURE_BNDREGS);
>> - fpstate_clear_xstate_component(fpstate, XFEATURE_BNDCSR);
>> + for_each_set_bit(i, reset_mask, XFEATURE_MAX) {
>> + if (!kvm_vcpu_has_xstate(i))
>> + continue;
> The kvm_vcpu_has_xstate() check is superfluous because @i is derived from
> kvm_caps.supported_xcr0/xss, which already guarantees that all unsupported
> xfeatures are filtered out.

Yeah, at least currently I can skip the check for CET/MPX feaures, will remove it, thanks!

>
> I recommend dropping this check. w/ this change,
>
> Reviewed-by: Chao Gao <[email protected]>
>
>> + fpstate_clear_xstate_component(fpstate, i);
>> + }
>>
>> if (init_event)
>> kvm_load_guest_fpu(vcpu);
>> --
>> 2.43.0
>>


2024-03-06 14:44:57

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 00/27] Enable CET Virtualization

Hi, Dave,

Could you kindly review the kernel patches(patch 1-7) at your convenience?
Rick has added RB tags on these patches, so I'd get your opinions on them.

Thanks a lot!

On 2/19/2024 3:47 PM, Yang Weijiang wrote:
> Control-flow Enforcement Technology (CET) is a kind of CPU feature used
> to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP) attacks.
> It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
> style control-flow subversion attacks.
>
> Shadow Stack (SHSTK):
> A shadow stack is a second stack used exclusively for control transfer
> operations. The shadow stack is separate from the data/normal stack and
> can be enabled individually in user and kernel mode. When shadow stack
> is enabled, CALL pushes the return address on both the data and shadow
> stack. RET pops the return address from both stacks and compares them.
> If the return addresses from the two stacks do not match, the processor
> generates a #CP.
>
> Indirect Branch Tracking (IBT):
> IBT introduces new instruction(ENDBRANCH)to mark valid target addresses of
> indirect branches (CALL, JMP etc...). If an indirect branch is executed
> and the next instruction is _not_ an ENDBRANCH, the processor generates a
> #CP. These instruction behaves as a NOP on platforms that doesn't support
> CET.
>
> Dependency:
> =====================
> CET native series for user mode shadow stack has already been merged in v6.6
> mainline kernel.
>
> The first 7 kernel patches are prerequisites for this KVM patch series since
> guest CET user mode and supervisor mode states depends on kernel FPU framework
> to properly save/restore the states whenever FPU context switch is required,
> e.g., after VM-Exit and before vCPU thread exits to userspace.
>
> In this series, guest supervisor SHSTK mitigation solution isn't introduced
> for Intel platform therefore guest SSS_CET bit of CPUID(0x7,1):EDX[bit18] is
> cleared. Check SDM (Vol 1, Section 17.2.3) for details.
>
> CET states management:
> ======================
> KVM cooperates with host kernel FPU framework to manage guest CET registers.
> With CET supervisor mode state support in this series, KVM can save/restore
> full guest CET xsave-managed states.
>
> CET user mode and supervisor mode xstates, i.e., MSR_IA32_{U_CET,PL3_SSP}
> and MSR_IA32_PL{0,1,2}, depend on host FPU framework to swap guest and host
> xstates. On VM-Exit, guest CET xstates are saved to guest fpu area and host
> CET xstates are loaded from task/thread context before vCPU returns to
> userspace, vice-versa on VM-Entry. See details in kvm_{load,put}_guest_fpu().
> So guest CET xstates management depends on CET xstate bits(U_CET/S_CET bit)
> set in host XSS MSR.
>
> CET supervisor mode states are grouped into two categories : XSAVE-managed
> and non-XSAVE-managed, the former includes MSR_IA32_PL{0,1,2}_SSP and are
> controlled by CET supervisor mode bit(S_CET bit) in XSS, the later consists
> of MSR_IA32_S_CET and MSR_IA32_INTR_SSP_TBL.
>
> VMX introduces new VMCS fields, {GUEST|HOST}_{S_CET,SSP,INTR_SSP_TABL}, to
> facilitate guest/host non-XSAVES-managed states. When VMX CET entry/exit load
> bits are set, guest/host MSR_IA32_{S_CET,INTR_SSP_TBL,SSP} are loaded from
> equivalent fields at VM-Exit/Entry. With these new fields, such supervisor
> states require no addtional KVM save/reload actions.
>
> Tests:
> ======================
> This series passed basic CET user shadow stack test and kernel IBT test in L1
> and L2 guest.
> The patch series _has_ impact to existing vmx test cases in KVM-unit-tests,the
> failures have been fixed here[1].
> One new selftest app[2] is introduced for testing CET MSRs accessibilities.
>
> Note, this series hasn't been tested on AMD platform yet.
>
> To run user SHSTK test and kernel IBT test in guest, an CET capable platform
> is required, e.g., Sapphire Rapids server, and follow below steps to build
> the binaries:
>
> 1. Host kernel: Apply this series to mainline kernel (>= v6.6) and build.
>
> 2. Guest kernel: Pull kernel (>= v6.6), opt-in CONFIG_X86_KERNEL_IBT
> and CONFIG_X86_USER_SHADOW_STACK options. Build with CET enabled gcc versions
> (>= 8.5.0).
>
> 3. Apply CET QEMU patches[3] before build mainline QEMU.
>
> Check kernel selftest test_shadow_stack_64 output:
> [INFO] new_ssp = 7f8c82100ff8, *new_ssp = 7f8c82101001
> [INFO] changing ssp from 7f8c82900ff0 to 7f8c82100ff8
> [INFO] ssp is now 7f8c82101000
> [OK] Shadow stack pivot
> [OK] Shadow stack faults
> [INFO] Corrupting shadow stack
> [INFO] Generated shadow stack violation successfully
> [OK] Shadow stack violation test
> [INFO] Gup read -> shstk access success
> [INFO] Gup write -> shstk access success
> [INFO] Violation from normal write
> [INFO] Gup read -> write access success
> [INFO] Violation from normal write
> [INFO] Gup write -> write access success
> [INFO] Cow gup write -> write access success
> [OK] Shadow gup test
> [INFO] Violation from shstk access
> [OK] mprotect() test
> [SKIP] Userfaultfd unavailable.
> [OK] 32 bit test
>
>
> Check kernel IBT with dmesg | grep CET:
> CET detected: Indirect Branch Tracking enabled
>
> Changes in v10:
> =====================
> 1. Add Reviewed-by tags from Chao and Rick. [Chao, Rick]
> 2. Use two bit flags to check CET guarded instructions in KVM emulator. [Chao]
> 3. Refine reset handling of xsave-managed guest FPU states. [Chao]
> 4. Add nested CET MSR sync when entry/exit-load-bit is not set. [Chao]
> 5. Other minor changes per comments from Chao and Rick.
> 6. Rebased on https://github.com/kvm-x86/linux commit: c0f8b0752b09
>
>
> [1]: KVM-unit-tests fixup:
> https://lore.kernel.org/all/[email protected]/
> [2]: Selftest for CET MSRs:
> https://lore.kernel.org/all/[email protected]/
> [3]: QEMU patch:
> https://lore.kernel.org/all/[email protected]/
> [4]: v9 patchset:
> https://lore.kernel.org/all/[email protected]/
>
>
> Patch 1-7: Fixup patches for kernel xstate and enable CET supervisor xstate.
> Patch 8-11: Cleanup patches for KVM.
> Patch 12-15: Enable KVM XSS MSR support.
> Patch 16: Fault check for CR4.CET setting.
> Patch 17: Report CET MSRs to userspace.
> Patch 18: Introduce CET VMCS fields.
> Patch 19: Add SHSTK/IBT to KVM-governed framework.(to be deprecated)
> Patch 20: Emulate CET MSR access.
> Patch 21: Handle SSP at entry/exit to SMM.
> Patch 22: Set up CET MSR interception.
> Patch 23: Initialize host constant supervisor state.
> Patch 24: Enable CET virtualization settings.
> Patch 25-26: Add CET nested support.
> Patch 27: KVM emulation handling for branch instructions
>
>
> Sean Christopherson (4):
> x86/fpu/xstate: Always preserve non-user xfeatures/flags in
> __state_perm
> KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data
> KVM: x86: Report XSS as to-be-saved if there are supported features
> KVM: x86: Load guest FPU state when access XSAVE-managed MSRs
>
> Yang Weijiang (23):
> x86/fpu/xstate: Refine CET user xstate bit enabling
> x86/fpu/xstate: Add CET supervisor mode state support
> x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set
> x86/fpu/xstate: Introduce fpu_guest_cfg for guest FPU configuration
> x86/fpu/xstate: Create guest fpstate with guest specific config
> x86/fpu/xstate: Warn if kernel dynamic xfeatures detected in normal
> fpstate
> KVM: x86: Rename kvm_{g,s}et_msr()* to menifest emulation operations
> KVM: x86: Refine xsave-managed guest register/MSR reset handling
> KVM: x86: Add kvm_msr_{read,write}() helpers
> KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS
> KVM: x86: Initialize kvm_caps.supported_xss
> KVM: x86: Add fault checks for guest CR4.CET setting
> KVM: x86: Report KVM supported CET MSRs as to-be-saved
> KVM: VMX: Introduce CET VMCS fields and control bits
> KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT
> enabled"
> KVM: VMX: Emulate read and write to CET MSRs
> KVM: x86: Save and reload SSP to/from SMRAM
> KVM: VMX: Set up interception for CET MSRs
> KVM: VMX: Set host constant supervisor states to VMCS fields
> KVM: x86: Enable CET virtualization for VMX and advertise to userspace
> KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery
> to L1
> KVM: nVMX: Enable CET support for nested guest
> KVM: x86: Don't emulate instructions guarded by CET
>
> arch/x86/include/asm/fpu/types.h | 16 +-
> arch/x86/include/asm/fpu/xstate.h | 11 +-
> arch/x86/include/asm/kvm_host.h | 12 +-
> arch/x86/include/asm/msr-index.h | 1 +
> arch/x86/include/asm/vmx.h | 8 +
> arch/x86/include/uapi/asm/kvm_para.h | 1 +
> arch/x86/kernel/fpu/core.c | 53 +++--
> arch/x86/kernel/fpu/xstate.c | 44 ++++-
> arch/x86/kernel/fpu/xstate.h | 3 +
> arch/x86/kvm/cpuid.c | 80 ++++++--
> arch/x86/kvm/emulate.c | 46 +++--
> arch/x86/kvm/governed_features.h | 2 +
> arch/x86/kvm/smm.c | 12 +-
> arch/x86/kvm/smm.h | 2 +-
> arch/x86/kvm/vmx/capabilities.h | 10 +
> arch/x86/kvm/vmx/nested.c | 120 ++++++++++--
> arch/x86/kvm/vmx/nested.h | 5 +
> arch/x86/kvm/vmx/vmcs12.c | 6 +
> arch/x86/kvm/vmx/vmcs12.h | 14 +-
> arch/x86/kvm/vmx/vmx.c | 112 ++++++++++-
> arch/x86/kvm/vmx/vmx.h | 9 +-
> arch/x86/kvm/x86.c | 280 ++++++++++++++++++++++++---
> arch/x86/kvm/x86.h | 28 +++
> 23 files changed, 761 insertions(+), 114 deletions(-)
>
>
> base-commit: c0f8b0752b0988e5116c78e8b6c3cfdf89806e45


2024-03-13 09:43:46

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 20/27] KVM: VMX: Emulate read and write to CET MSRs

On 3/13/2024 6:55 AM, Sean Christopherson wrote:
> -non-KVM people, +Mingwei, Aaron, Oliver, and Jim
>
> On Sun, Feb 18, 2024, Yang Weijiang wrote:
>> case MSR_IA32_PERF_CAPABILITIES:
>> if (data && !vcpu_to_pmu(vcpu)->version)
>> return 1;
> Ha, perfect, this is already in the diff context.
>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index c0ed69353674..281c3fe728c5 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1849,6 +1849,36 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type)
>> }
>> EXPORT_SYMBOL_GPL(kvm_msr_allowed);
>>
>> +#define CET_US_RESERVED_BITS GENMASK(9, 6)
>> +#define CET_US_SHSTK_MASK_BITS GENMASK(1, 0)
>> +#define CET_US_IBT_MASK_BITS (GENMASK_ULL(5, 2) | GENMASK_ULL(63, 10))
>> +#define CET_US_LEGACY_BITMAP_BASE(data) ((data) >> 12)
>> +
>> +static bool is_set_cet_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u64 data,
>> + bool host_initiated)
>> +{
> ...
>
>> + /*
>> + * If KVM supports the MSR, i.e. has enumerated the MSR existence to
>> + * userspace, then userspace is allowed to write '0' irrespective of
>> + * whether or not the MSR is exposed to the guest.
>> + */
>> + if (!host_initiated || data)
>> + return false;
> ...
>
>> @@ -1951,6 +2017,20 @@ static int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
>> !guest_cpuid_has(vcpu, X86_FEATURE_RDPID))
>> return 1;
>> break;
>> + case MSR_IA32_U_CET:
>> + case MSR_IA32_S_CET:
>> + if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
>> + !guest_can_use(vcpu, X86_FEATURE_IBT))
>> + return 1;
> As pointed out by Mingwei in a conversation about PERF_CAPABILITIES, rejecting
> host *reads* while allowing host writes of '0' is inconsistent. Which, while
> arguably par for the course for KVM's ABI, will likely result in the exact problem
> we're trying to avoid: killing userspace because it attempts to access an MSR KVM
> has said exists.

Thank you for the notification!
Agree on it.

>
> PERF_CAPABILITIES has a similar, but opposite, problem where KVM returns a non-zero
> value on reads, but rejects that same non-zero value on write. PERF_CAPABILITIES
> is even more complicated because KVM stuff a non-zero value at vCPU creation, but
> that's not really relevant to this discussion, just another data point for how
> messed up this all is.
>
> Also relevant to this discussion are KVM's PV MSRs, e.g. MSR_KVM_ASYNC_PF_ACK,
> as KVM rejects attempts to write '0' if the guest doesn't support the MSR, but
> if and only userspace has enabled KVM_CAP_ENFORCE_PV_FEATURE_CPUID.
>
> Coming to the point, this mess is getting too hard to maintain, both from a code
> perspective and "what is KVM's ABI?" perspective.
>
> Rather than play whack-a-mole and inevitably end up with bugs and/or inconsistencies,
> what if we (a) return KVM_MSR_RET_INVALID when an MSR access is denied based on
> guest CPUID,

Can we define a new return value KVM_MSR_RET_REJECTED for this case in order to tell it from KVM_MSR_RET_INVALID which means the msr index doesn't exit?
> (b) wrap userspace MSR accesses at the very top level and convert
> KVM_MSR_RET_INVALID to "success" when KVM reported the MSR as savable and userspace
> is reading or writing '0',

Yes, this can limit the change on KVM side.

> and (c) drop all of the host_initiated checks that
> exist purely to exempt userspace access from guest CPUID checks.
>
> The only possible hiccup I can think of is that this could subtly break userspace
> that is setting CPUID _after_ MSRs, but my understanding is that we've agreed to
> draw a line and say that that's unsupported.

Yeah,  it would mess up things.

> And I think it's low risk, because
> I don't see how code like this:
>
> case MSR_TSC_AUX:
> if (!kvm_is_supported_user_return_msr(MSR_TSC_AUX))
> return 1;
>
> if (!host_initiated &&
> !guest_cpuid_has(vcpu, X86_FEATURE_RDTSCP) &&
> !guest_cpuid_has(vcpu, X86_FEATURE_RDPID))
> return 1;
>
> if (guest_cpuid_is_intel(vcpu) && (data >> 32) != 0)
> return 1;
>
> can possibly work if userspace sets MSRs first. The RDTSCP/RDPID checks are
> exempt, but the vendor in guest CPUID would be '0', not Intel's magic string,
> and so setting MSRs before CPUID would fail, at least if the target vCPU model
> is Intel.
>
> P.S. I also want to rename KVM_MSR_RET_INVALID => KVM_MSR_RET_UNSUPPORTED, because
> I can never remember that "invalid" doesn't mean the value was invalid, it means
> the MSR index was invalid.

So do I :-)

>
> It'll take a few patches, but I believe we can end up with something like this:
>
> static bool kvm_is_msr_to_save(u32 msr_index)
> {
> unsigned int i;
>
> for (i = 0; i < num_msrs_to_save; i++) {
> if (msrs_to_save[i] == msr_index)
> return true;
> }

Should we also check emulated_msrs list here since KVM_GET_MSR_INDEX_LIST exposes it too?

>
> return false;
> }
> typedef int (*msr_uaccess_t)(struct kvm_vcpu *vcpu, u32 index, u64 *data,
> bool host_initiated);
>
> static __always_inline int kvm_do_msr_uaccess(struct kvm_vcpu *vcpu, u32 msr,
> u64 *data, bool host_initiated,
> enum kvm_msr_access rw,
> msr_uaccess_t msr_uaccess_fn)
> {
> const char *op = rw == MSR_TYPE_W ? "wrmsr" : "rdmsr";
> int ret;
>
> BUILD_BUG_ON(rw != MSR_TYPE_R && rw != MSR_TYPE_W);
>
> /*
> * Zero the data on read failures to avoid leaking stack data to the
> * guest and/or userspace, e.g. if the failure is ignored below.
> */
> ret = msr_uaccess_fn(vcpu, msr, data, host_initiated);
> if (ret && rw == MSR_TYPE_R)
> *data = 0;
>
> if (ret != KVM_MSR_RET_UNSUPPORTED)
> return ret;
>
> /*
> * Userspace is allowed to read MSRs, and write '0' to MSRs, that KVM
> * reports as to-be-saved, even if an MSRs isn't fully supported.
> * Simply check that @data is '0', which covers both the write '0' case
> * and all reads (in which case @data is zeroed on failure; see above).
> */
> if (kvm_is_msr_to_save(msr) && !*data)
> return 0;
>
> if (!ignore_msrs) {
> kvm_debug_ratelimited("unhandled %s: 0x%x data 0x%llx\n",
> op, msr, *data);
> return ret;
> }
>
> if (report_ignored_msrs)
> kvm_pr_unimpl("ignored %s: 0x%x data 0x%llx\n", op, msr, *data);
>
> return 0;
> }

The handling flow looks good to me. Thanks a lot!



2024-03-13 16:00:33

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 20/27] KVM: VMX: Emulate read and write to CET MSRs

On Wed, Mar 13, 2024, Weijiang Yang wrote:
> On 3/13/2024 6:55 AM, Sean Christopherson wrote:
> > PERF_CAPABILITIES has a similar, but opposite, problem where KVM returns a non-zero
> > value on reads, but rejects that same non-zero value on write. PERF_CAPABILITIES
> > is even more complicated because KVM stuff a non-zero value at vCPU creation, but
> > that's not really relevant to this discussion, just another data point for how
> > messed up this all is.
> >
> > Also relevant to this discussion are KVM's PV MSRs, e.g. MSR_KVM_ASYNC_PF_ACK,
> > as KVM rejects attempts to write '0' if the guest doesn't support the MSR, but
> > if and only userspace has enabled KVM_CAP_ENFORCE_PV_FEATURE_CPUID.
> >
> > Coming to the point, this mess is getting too hard to maintain, both from a code
> > perspective and "what is KVM's ABI?" perspective.
> >
> > Rather than play whack-a-mole and inevitably end up with bugs and/or inconsistencies,
> > what if we (a) return KVM_MSR_RET_INVALID when an MSR access is denied based on
> > guest CPUID,
>
> Can we define a new return value KVM_MSR_RET_REJECTED for this case in order
> to tell it from KVM_MSR_RET_INVALID which means the msr index doesn't exit?

No. Well, I mean, we could, but I don't see any reason to define another return
value, because the semantics further up the stack need to be identical. And
unfortunately, correctly differentiating between the two scenario would require
quite a bit of surgery to play nice with PMU MSRs.

Hmm, I suppose we could WARN if a _completely_ unhandled MSR ends up in the
msrs_to_save or emulated_msrs lists. But because of the PMU MSRs complications,
this is definitely not worth doing right away, if ever.

> > static bool kvm_is_msr_to_save(u32 msr_index)
> > {
> > unsigned int i;
> >
> > for (i = 0; i < num_msrs_to_save; i++) {
> > if (msrs_to_save[i] == msr_index)
> > return true;
> > }
>
> Should we also check emulated_msrs list here since KVM_GET_MSR_INDEX_LIST
> exposes it too?

Ah, yes. I was thinking msrs_to_save was a superset, but KVM_GET_MSR_INDEX_LIST
is where the lists get smushed together.

2024-03-12 22:56:30

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 20/27] KVM: VMX: Emulate read and write to CET MSRs

-non-KVM people, +Mingwei, Aaron, Oliver, and Jim

On Sun, Feb 18, 2024, Yang Weijiang wrote:
> case MSR_IA32_PERF_CAPABILITIES:
> if (data && !vcpu_to_pmu(vcpu)->version)
> return 1;

Ha, perfect, this is already in the diff context.

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c0ed69353674..281c3fe728c5 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1849,6 +1849,36 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type)
> }
> EXPORT_SYMBOL_GPL(kvm_msr_allowed);
>
> +#define CET_US_RESERVED_BITS GENMASK(9, 6)
> +#define CET_US_SHSTK_MASK_BITS GENMASK(1, 0)
> +#define CET_US_IBT_MASK_BITS (GENMASK_ULL(5, 2) | GENMASK_ULL(63, 10))
> +#define CET_US_LEGACY_BITMAP_BASE(data) ((data) >> 12)
> +
> +static bool is_set_cet_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u64 data,
> + bool host_initiated)
> +{

..

> + /*
> + * If KVM supports the MSR, i.e. has enumerated the MSR existence to
> + * userspace, then userspace is allowed to write '0' irrespective of
> + * whether or not the MSR is exposed to the guest.
> + */
> + if (!host_initiated || data)
> + return false;

..

> @@ -1951,6 +2017,20 @@ static int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
> !guest_cpuid_has(vcpu, X86_FEATURE_RDPID))
> return 1;
> break;
> + case MSR_IA32_U_CET:
> + case MSR_IA32_S_CET:
> + if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
> + !guest_can_use(vcpu, X86_FEATURE_IBT))
> + return 1;

As pointed out by Mingwei in a conversation about PERF_CAPABILITIES, rejecting
host *reads* while allowing host writes of '0' is inconsistent. Which, while
arguably par for the course for KVM's ABI, will likely result in the exact problem
we're trying to avoid: killing userspace because it attempts to access an MSR KVM
has said exists.

PERF_CAPABILITIES has a similar, but opposite, problem where KVM returns a non-zero
value on reads, but rejects that same non-zero value on write. PERF_CAPABILITIES
is even more complicated because KVM stuff a non-zero value at vCPU creation, but
that's not really relevant to this discussion, just another data point for how
messed up this all is.

Also relevant to this discussion are KVM's PV MSRs, e.g. MSR_KVM_ASYNC_PF_ACK,
as KVM rejects attempts to write '0' if the guest doesn't support the MSR, but
if and only userspace has enabled KVM_CAP_ENFORCE_PV_FEATURE_CPUID.

Coming to the point, this mess is getting too hard to maintain, both from a code
perspective and "what is KVM's ABI?" perspective.

Rather than play whack-a-mole and inevitably end up with bugs and/or inconsistencies,
what if we (a) return KVM_MSR_RET_INVALID when an MSR access is denied based on
guest CPUID, (b) wrap userspace MSR accesses at the very top level and convert
KVM_MSR_RET_INVALID to "success" when KVM reported the MSR as savable and userspace
is reading or writing '0', and (c) drop all of the host_initiated checks that
exist purely to exempt userspace access from guest CPUID checks.

The only possible hiccup I can think of is that this could subtly break userspace
that is setting CPUID _after_ MSRs, but my understanding is that we've agreed to
draw a line and say that that's unsupported. And I think it's low risk, because
I don't see how code like this:

case MSR_TSC_AUX:
if (!kvm_is_supported_user_return_msr(MSR_TSC_AUX))
return 1;

if (!host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_RDTSCP) &&
!guest_cpuid_has(vcpu, X86_FEATURE_RDPID))
return 1;

if (guest_cpuid_is_intel(vcpu) && (data >> 32) != 0)
return 1;

can possibly work if userspace sets MSRs first. The RDTSCP/RDPID checks are
exempt, but the vendor in guest CPUID would be '0', not Intel's magic string,
and so setting MSRs before CPUID would fail, at least if the target vCPU model
is Intel.

P.S. I also want to rename KVM_MSR_RET_INVALID => KVM_MSR_RET_UNSUPPORTED, because
I can never remember that "invalid" doesn't mean the value was invalid, it means
the MSR index was invalid.

It'll take a few patches, but I believe we can end up with something like this:

static bool kvm_is_msr_to_save(u32 msr_index)
{
unsigned int i;

for (i = 0; i < num_msrs_to_save; i++) {
if (msrs_to_save[i] == msr_index)
return true;
}

return false;
}
typedef int (*msr_uaccess_t)(struct kvm_vcpu *vcpu, u32 index, u64 *data,
bool host_initiated);

static __always_inline int kvm_do_msr_uaccess(struct kvm_vcpu *vcpu, u32 msr,
u64 *data, bool host_initiated,
enum kvm_msr_access rw,
msr_uaccess_t msr_uaccess_fn)
{
const char *op = rw == MSR_TYPE_W ? "wrmsr" : "rdmsr";
int ret;

BUILD_BUG_ON(rw != MSR_TYPE_R && rw != MSR_TYPE_W);

/*
* Zero the data on read failures to avoid leaking stack data to the
* guest and/or userspace, e.g. if the failure is ignored below.
*/
ret = msr_uaccess_fn(vcpu, msr, data, host_initiated);
if (ret && rw == MSR_TYPE_R)
*data = 0;

if (ret != KVM_MSR_RET_UNSUPPORTED)
return ret;

/*
* Userspace is allowed to read MSRs, and write '0' to MSRs, that KVM
* reports as to-be-saved, even if an MSRs isn't fully supported.
* Simply check that @data is '0', which covers both the write '0' case
* and all reads (in which case @data is zeroed on failure; see above).
*/
if (kvm_is_msr_to_save(msr) && !*data)
return 0;

if (!ignore_msrs) {
kvm_debug_ratelimited("unhandled %s: 0x%x data 0x%llx\n",
op, msr, *data);
return ret;
}

if (report_ignored_msrs)
kvm_pr_unimpl("ignored %s: 0x%x data 0x%llx\n", op, msr, *data);

return 0;
}

2024-05-01 18:45:40

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 04/27] x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set

On Sun, Feb 18, 2024, Yang Weijiang wrote:
> Define a new XFEATURE_MASK_KERNEL_DYNAMIC mask to specify the features

I still don't understand why this is being called DYNAMIC. CET_SS isn't dynamic,
as KVM is _always_ allowed to save/restore CET_SS, i.e. whether or not KVM can
expose CET_SS to a guest is a static, boot-time decision. Whether or not a guest
XSS actually enables CET_SS is "dynamic", but that's true of literally every
xfeature in XCR0 and XSS.

XFEATURE_MASK_XTILE_DATA is labeled as dynamic because userspace has to explicitly
request that XTILE_DATA be enabled, and thus whether or not KVM is allowed to
expose XTILE_DATA to the guest is a dynamic, runtime decision.

So IMO, the umbrella macro should be XFEATURE_MASK_KERNEL_GUEST_ONLY.

2024-05-01 18:54:56

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 09/27] KVM: x86: Rename kvm_{g,s}et_msr()* to menifest emulation operations

s/menifest/manifest, though I find the shortlog confusing irrespective of the
typo. I think this would be more grammatically correct:

KVM: x86: Rename kvm_{g,s}et_msr()* to manifest their emulation operations

but I still find that unnecessarily "fancy". What about this instead?

KVM: x86: Rename kvm_{g,s}et_msr()* to show that they emulate guest accesses

It's not perfect, e.g. it might be read as saying they emulate guest RDMSR and
WRMSR, but for a shortlog I think that's fine.

2024-05-01 20:40:40

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 10/27] KVM: x86: Refine xsave-managed guest register/MSR reset handling

The shortlog is a bit stale now that it only deals with XSTATE. This?

KVM: x86: Zero XSTATE components on INIT by iterating over supported features

On Sun, Feb 18, 2024, Yang Weijiang wrote:
> Tweak the code a bit to facilitate resetting more xstate components in
> the future, e.g., CET's xstate-managed MSRs.
>
> No functional change intended.
>
> Suggested-by: Chao Gao <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kvm/x86.c | 30 +++++++++++++++++++++++++++---
> 1 file changed, 27 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 10847e1cc413..5a9c07751c0e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12217,11 +12217,27 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
> static_branch_dec(&kvm_has_noapic_vcpu);
> }
>
> +#define XSTATE_NEED_RESET_MASK (XFEATURE_MASK_BNDREGS | \
> + XFEATURE_MASK_BNDCSR)
> +
> +static bool kvm_vcpu_has_xstate(unsigned long xfeature)
> +{
> + switch (xfeature) {
> + case XFEATURE_MASK_BNDREGS:
> + case XFEATURE_MASK_BNDCSR:
> + return kvm_cpu_cap_has(X86_FEATURE_MPX);
> + default:
> + return false;
> + }
> +}
> +
> void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> {
> struct kvm_cpuid_entry2 *cpuid_0x1;
> unsigned long old_cr0 = kvm_read_cr0(vcpu);
> + DECLARE_BITMAP(reset_mask, 64);

I vote to use a u64 instead of a bitmask. The result cast isn't exactly pretty,
but it's not all that uncommon, and it's easy enough to make it "safe" by adding
BUILD_BUG_ON().

On the flip side, using the bitmap_*() APIs for super simple bitwise-OR/AND/TEST
operations makes the code harder to read.

> unsigned long new_cr0;
> + unsigned int i;
>
> /*
> * Several of the "set" flows, e.g. ->set_cr0(), read other registers
> @@ -12274,7 +12290,12 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> kvm_async_pf_hash_reset(vcpu);
> vcpu->arch.apf.halted = false;
>
> - if (vcpu->arch.guest_fpu.fpstate && kvm_mpx_supported()) {
> + bitmap_from_u64(reset_mask, (kvm_caps.supported_xcr0 |
> + kvm_caps.supported_xss) &
> + XSTATE_NEED_RESET_MASK);
> +
> + if (vcpu->arch.guest_fpu.fpstate &&
> + !bitmap_empty(reset_mask, XFEATURE_MAX)) {
> struct fpstate *fpstate = vcpu->arch.guest_fpu.fpstate;
>
> /*
> @@ -12284,8 +12305,11 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> if (init_event)
> kvm_put_guest_fpu(vcpu);
>
> - fpstate_clear_xstate_component(fpstate, XFEATURE_BNDREGS);
> - fpstate_clear_xstate_component(fpstate, XFEATURE_BNDCSR);
> + for_each_set_bit(i, reset_mask, XFEATURE_MAX) {
> + if (!kvm_vcpu_has_xstate(i))
> + continue;
> + fpstate_clear_xstate_component(fpstate, i);
> + }

A few intertwined thoughts:

1. fpstate is zero allocated, and KVM absolutely relies on that, e.g. KVM doesn't
manually zero out the XSAVE fields that are preserved on INIT, but zeroed on
RESET.

2. That means there is no need to manually clear XSTATE components during RESET,
as KVM doesn't support standalone RESET, i.e. it's only cleared during vCPU
creation, when guest FPU state is guaranteed to be '0'.

3. That makes XSTATE_NEED_RESET_MASK a misnomer, as it's literally the !RESET
path that is relevant. E.g. it should be XSTATE_CLEAR_ON_INIT_MASK or so.

4. If we add a helper, then XSTATE_NEED_RESET_MASK is probably unneeded.

So, what if we slot in the below (compile tested only) patch as prep work? Then
this patch becomes:

---
arch/x86/kvm/x86.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b441bf61b541..b00730353a28 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12220,6 +12220,8 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
static void kvm_xstate_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct fpstate *fpstate = vcpu->arch.guest_fpu.fpstate;
+ u64 xfeatures_mask;
+ int i;

/*
* Guest FPU state is zero allocated and so doesn't need to be manually
@@ -12233,16 +12235,20 @@ static void kvm_xstate_reset(struct kvm_vcpu *vcpu, bool init_event)
* are unchanged. Currently, the only components that are zeroed and
* supported by KVM are MPX related.
*/
- if (!kvm_mpx_supported())
+ xfeatures_mask = (kvm_caps.supported_xcr0 | kvm_caps.supported_xss) &
+ (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR);
+ if (!xfeatures_mask)
return;

+ BUILD_BUG_ON(XFEATURE_MAX >= sizeof(xfeatures_mask));
+
/*
* All paths that lead to INIT are required to load the guest's FPU
* state (because most paths are buried in KVM_RUN).
*/
kvm_put_guest_fpu(vcpu);
- fpstate_clear_xstate_component(fpstate, XFEATURE_BNDREGS);
- fpstate_clear_xstate_component(fpstate, XFEATURE_BNDCSR);
+ for_each_set_bit(i, xfeatures_mask, XFEATURE_MAX)
+ fpstate_clear_xstate_component(fpstate, i);
kvm_load_guest_fpu(vcpu);
}


base-commit: efca8b27900dfec160b6ba90820fa2ced81de904
--


and the final code looks like:


static void kvm_xstate_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct fpstate *fpstate = vcpu->arch.guest_fpu.fpstate;
u64 xfeatures_mask;
int i;

/*
* Guest FPU state is zero allocated and so doesn't need to be manually
* cleared on RESET, i.e. during vCPU creation.
*/
if (!init_event || !fpstate)
return;

/*
* On INIT, only select XSTATE components are zeroed, most compoments
* are unchanged. Currently, the only components that are zeroed and
* supported by KVM are MPX and CET related.
*/
xfeatures_mask = (kvm_caps.supported_xcr0 | kvm_caps.supported_xss) &
(XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR |
XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL);
if (!xfeatures_mask)
return;

BUILD_BUG_ON(XFEATURE_MAX >= sizeof(xfeatures_mask) * BITS_PER_BYTE);

/*
* All paths that lead to INIT are required to load the guest's FPU
* state (because most paths are buried in KVM_RUN).
*/
kvm_put_guest_fpu(vcpu);
for_each_set_bit(i, (unsigned long *)&xfeatures_mask, XFEATURE_MAX)
fpstate_clear_xstate_component(fpstate, i);
kvm_load_guest_fpu(vcpu);
}


--
From: Sean Christopherson <[email protected]>
Date: Wed, 1 May 2024 12:12:31 -0700
Subject: [PATCH] KVM: x86: Manually clear MPX state only on INIT

Don't manually clear/zero MPX state on RESET, as the guest FPU state is
zero allocated and KVM only does RESET during vCPU creation, i.e. the
relevant state is guaranteed to be all zeroes.

Opportunistically move the relevant code into a helper in anticipation of
adding support for CET shadow stacks, which also has state that is zeroed
on INIT.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/x86.c | 46 ++++++++++++++++++++++++++++++----------------
1 file changed, 30 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 10847e1cc413..b441bf61b541 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12217,6 +12217,35 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
static_branch_dec(&kvm_has_noapic_vcpu);
}

+static void kvm_xstate_reset(struct kvm_vcpu *vcpu, bool init_event)
+{
+ struct fpstate *fpstate = vcpu->arch.guest_fpu.fpstate;
+
+ /*
+ * Guest FPU state is zero allocated and so doesn't need to be manually
+ * cleared on RESET, i.e. during vCPU creation.
+ */
+ if (!init_event || !fpstate)
+ return;
+
+ /*
+ * On INIT, only select XSTATE components are zeroed, most compoments
+ * are unchanged. Currently, the only components that are zeroed and
+ * supported by KVM are MPX related.
+ */
+ if (!kvm_mpx_supported())
+ return;
+
+ /*
+ * All paths that lead to INIT are required to load the guest's FPU
+ * state (because most paths are buried in KVM_RUN).
+ */
+ kvm_put_guest_fpu(vcpu);
+ fpstate_clear_xstate_component(fpstate, XFEATURE_BNDREGS);
+ fpstate_clear_xstate_component(fpstate, XFEATURE_BNDCSR);
+ kvm_load_guest_fpu(vcpu);
+}
+
void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct kvm_cpuid_entry2 *cpuid_0x1;
@@ -12274,22 +12303,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
kvm_async_pf_hash_reset(vcpu);
vcpu->arch.apf.halted = false;

- if (vcpu->arch.guest_fpu.fpstate && kvm_mpx_supported()) {
- struct fpstate *fpstate = vcpu->arch.guest_fpu.fpstate;
-
- /*
- * All paths that lead to INIT are required to load the guest's
- * FPU state (because most paths are buried in KVM_RUN).
- */
- if (init_event)
- kvm_put_guest_fpu(vcpu);
-
- fpstate_clear_xstate_component(fpstate, XFEATURE_BNDREGS);
- fpstate_clear_xstate_component(fpstate, XFEATURE_BNDCSR);
-
- if (init_event)
- kvm_load_guest_fpu(vcpu);
- }
+ kvm_xstate_reset(vcpu, init_event);

if (!init_event) {
vcpu->arch.smbase = 0x30000;

base-commit: 1a89965fa9dae1ae04f44679860ef6bc008c2003
--

2024-05-01 20:44:00

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 13/27] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS

On Sun, Feb 18, 2024, Yang Weijiang wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9eb5c8dbd4fb..b502d68a2576 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3926,16 +3926,23 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> }
> break;
> case MSR_IA32_XSS:
> - if (!msr_info->host_initiated &&
> - !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
> + /*
> + * If KVM reported support of XSS MSR, even guest CPUID doesn't
> + * support XSAVES, still allow userspace to set default value(0)
> + * to this MSR.
> + */
> + if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES) &&
> + !(msr_info->host_initiated && data == 0))

With my proposed MSR access cleanup[*], I think (hope?) this simply becomes:

if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
return KVM_MSR_RET_UNSUPPORTED;

with no comment needed as the "host && !data" case is handled in common code.

[*] https://lore.kernel.org/all/[email protected]

2024-05-01 22:41:04

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 17/27] KVM: x86: Report KVM supported CET MSRs as to-be-saved

On Sun, Feb 18, 2024, Yang Weijiang wrote:
> Add CET MSRs to the list of MSRs reported to userspace if the feature,
> i.e. IBT or SHSTK, associated with the MSRs is supported by KVM.
>
> SSP can only be read via RDSSP. Writing even requires destructive and
> potentially faulting operations such as SAVEPREVSSP/RSTORSSP or
> SETSSBSY/CLRSSBSY. Let the host use a pseudo-MSR that is just a wrapper
> for the GUEST_SSP field of the VMCS.
>
> Suggested-by: Chao Gao <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/include/uapi/asm/kvm_para.h | 1 +
> arch/x86/kvm/vmx/vmx.c | 2 ++
> arch/x86/kvm/x86.c | 18 ++++++++++++++++++
> 3 files changed, 21 insertions(+)
>
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index 605899594ebb..9d08c0bec477 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -58,6 +58,7 @@
> #define MSR_KVM_ASYNC_PF_INT 0x4b564d06
> #define MSR_KVM_ASYNC_PF_ACK 0x4b564d07
> #define MSR_KVM_MIGRATION_CONTROL 0x4b564d08
> +#define MSR_KVM_SSP 0x4b564d09

We never resolved the conservation from v6[*], but I still agree with Maxim's
view that defining a synthetic MSR, which "steals" an MSR from KVM's MSR address
space, is a bad idea.

And I still also think that KVM_SET_ONE_REG is the best way forward. Completely
untested, but I think this is all that is needed to wire up KVM_{G,S}ET_ONE_REG
to support MSRs, and carve out room for 250+ other register types, plus room for
more future stuff as needed.

We'll still need a KVM-defined MSR for SSP, but it can be KVM internal, not uAPI,
e.g. the "index" exposed to userspace can simply be '0' for a register type of
KVM_X86_REG_SYNTHETIC_MSR, and then the translated internal index can be any
value that doesn't conflict.

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index ef11aa4cab42..ca2a47a85fa1 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -410,6 +410,16 @@ struct kvm_xcrs {
__u64 padding[16];
};

+#define KVM_X86_REG_MSR (1 << 2)
+#define KVM_X86_REG_SYNTHETIC_MSR (1 << 3)
+
+struct kvm_x86_reg_id {
+ __u32 index;
+ __u8 type;
+ __u8 rsvd;
+ __u16 rsvd16;
+};
+
#define KVM_SYNC_X86_REGS (1UL << 0)
#define KVM_SYNC_X86_SREGS (1UL << 1)
#define KVM_SYNC_X86_EVENTS (1UL << 2)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 47d9f03b7778..53f2b43b4651 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2244,6 +2244,30 @@ static int do_set_msr(struct kvm_vcpu *vcpu, unsigned index, u64 *data)
return kvm_set_msr_ignored_check(vcpu, index, *data, true);
}

+static int kvm_get_one_msr(struct kvm_vcpu *vcpu, u32 msr, u64 __user *value)
+{
+ u64 val;
+
+ r = do_get_msr(vcpu, reg.index, &val);
+ if (r)
+ return r;
+
+ if (put_user(val, value);
+ return -EFAULT;
+
+ return 0;
+}
+
+static int kvm_set_one_msr(struct kvm_vcpu *vcpu, u32 msr, u64 __user *value)
+{
+ u64 val;
+
+ if (get_user(val, value);
+ return -EFAULT;
+
+ return do_set_msr(vcpu, reg.index, &val);
+}
+
#ifdef CONFIG_X86_64
struct pvclock_clock {
int vclock_mode;
@@ -5976,6 +6000,39 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
srcu_read_unlock(&vcpu->kvm->srcu, idx);
break;
}
+ case KVM_GET_ONE_REG:
+ case KVM_SET_ONE_REG: {
+ struct kvm_x86_reg_id id;
+ struct kvm_one_reg reg;
+ u64 __user *value;
+
+ r = -EFAULT;
+ if (copy_from_user(&reg, argp, sizeof(reg)))
+ break;
+
+ r = -EINVAL;
+ id = (struct kvm_x86_reg)reg->id;
+ if (id.rsvd || id.rsvd16)
+ break;
+
+ if (id.type != KVM_X86_REG_MSR &&
+ id.type != KVM_X86_REG_SYNTHETIC_MSR)
+ break;
+
+ if (id.type == KVM_X86_REG_SYNTHETIC_MSR) {
+ id.type = KVM_X86_REG_MSR;
+ r = kvm_translate_synthetic_msr(&id.index);
+ if (r)
+ break;
+ }
+
+ value = u64_to_user_ptr(reg.addr);
+ if (ioctl == KVM_GET_ONE_REG)
+ r = kvm_get_one_msr(vcpu, id.index, value);
+ else
+ r = kvm_set_one_msr(vcpu, id.index, value);
+ break;
+ }
case KVM_TPR_ACCESS_REPORTING: {
struct kvm_tpr_access_ctl tac;



2024-05-01 22:50:31

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 21/27] KVM: x86: Save and reload SSP to/from SMRAM

On Sun, Feb 18, 2024, Yang Weijiang wrote:
> Save CET SSP to SMRAM on SMI and reload it on RSM. KVM emulates HW arch
> behavior when guest enters/leaves SMM mode,i.e., save registers to SMRAM
> at the entry of SMM and reload them at the exit to SMM. Per SDM, SSP is
> one of such registers on 64-bit Arch, and add the support for SSP. Note,
> on 32-bit Arch, SSP is not defined in SMRAM, so fail 32-bit CET guest
> launch.
>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> Reviewed-by: Maxim Levitsky <[email protected]>
> ---
> arch/x86/kvm/cpuid.c | 11 +++++++++++
> arch/x86/kvm/smm.c | 8 ++++++++
> arch/x86/kvm/smm.h | 2 +-
> 3 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 2bb1931103ad..c0e13040e35b 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -149,6 +149,17 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu,
> if (vaddr_bits != 48 && vaddr_bits != 57 && vaddr_bits != 0)
> return -EINVAL;
> }
> + /*
> + * Prevent 32-bit guest launch if shadow stack is exposed as SSP
> + * state is not defined for 32-bit SMRAM.

Why? Lack of save/restore for SSP on 32-bit guests is a gap in Intel's
architecture, I don't see why KVM should diverge from hardware. I.e. just do
nothing for SSP on SMI/RSM, because that's exactly what the architecture says
will happen.

> + */
> + best = cpuid_entry2_find(entries, nent, 0x80000001,
> + KVM_CPUID_INDEX_NOT_SIGNIFICANT);
> + if (best && !(best->edx & F(LM))) {
> + best = cpuid_entry2_find(entries, nent, 0x7, 0);
> + if (best && (best->ecx & F(SHSTK)))
> + return -EINVAL;
> + }
>
> /*
> * Exposing dynamic xfeatures to the guest requires additional
> diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
> index 45c855389ea7..7aac9c54c353 100644
> --- a/arch/x86/kvm/smm.c
> +++ b/arch/x86/kvm/smm.c
> @@ -275,6 +275,10 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu,
> enter_smm_save_seg_64(vcpu, &smram->gs, VCPU_SREG_GS);
>
> smram->int_shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
> +
> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
> + KVM_BUG_ON(kvm_msr_read(vcpu, MSR_KVM_SSP, &smram->ssp),
> + vcpu->kvm);
> }
> #endif
>
> @@ -564,6 +568,10 @@ static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,
> static_call(kvm_x86_set_interrupt_shadow)(vcpu, 0);
> ctxt->interruptibility = (u8)smstate->int_shadow;
>
> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
> + KVM_BUG_ON(kvm_msr_write(vcpu, MSR_KVM_SSP, smstate->ssp),
> + vcpu->kvm);


This should synthesize triple-fault, not WARN and kill the VM, as the value to
be restored is guest controlled (the guest can scribble SMRAM from within the
SMI handler).

At that point, I would just synthesize triple-fault for the read path too.

2024-05-01 23:10:38

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 22/27] KVM: VMX: Set up interception for CET MSRs

On Sun, Feb 18, 2024, Yang Weijiang wrote:
> @@ -7767,6 +7771,41 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
> vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
> }
>
> +static void vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu)
> +{
> + bool incpt;
> +
> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
> + incpt = !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK);
> +
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP,
> + MSR_TYPE_RW, incpt);
> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
> + MSR_TYPE_RW, incpt);
> + if (!incpt)
> + return;

Hmm, I find this is unnecessarily confusing and brittle. E.g. in the unlikely
event more CET stuff comes along, this lurking return could cause problems.

Why not handle S_CET and U_CET in a single common path? IMO, this is less error
prone, and more clearly captures the relationship between S/U_CET, SHSTK, and IBT.
Updating MSR intercepts is not a hot path, so the overhead of checking guest CPUID
multiple times should be a non-issue. And eventually KVM should effectively cache
all of those lookups, i.e. the cost will be negilible.

bool incpt;

if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
incpt = !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK);

vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
MSR_TYPE_RW, incpt);
vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
MSR_TYPE_RW, incpt);
vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
MSR_TYPE_RW, incpt);
vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP,
MSR_TYPE_RW, incpt);
vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
MSR_TYPE_RW, incpt);
}

if (kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
kvm_cpu_cap_has(X86_FEATURE_IBT)) {
incpt = !guest_cpuid_has(vcpu, X86_FEATURE_IBT) &&
!guest_cpuid_has(vcpu, X86_FEATURE_SHSTK);

vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
MSR_TYPE_RW, incpt);
vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
MSR_TYPE_RW, incpt);
}

2024-05-01 23:17:34

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On Sun, Feb 18, 2024, Yang Weijiang wrote:
> @@ -696,6 +697,20 @@ void kvm_set_cpu_caps(void)
> kvm_cpu_cap_set(X86_FEATURE_INTEL_STIBP);
> if (boot_cpu_has(X86_FEATURE_AMD_SSBD))
> kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD);
> + /*
> + * Don't use boot_cpu_has() to check availability of IBT because the
> + * feature bit is cleared in boot_cpu_data when ibt=off is applied
> + * in host cmdline.

I'm not convinced this is a good reason to diverge from the host kernel. E.g.
PCID and many other features honor the host setup, I don't see what makes IBT
special.

LA57 is special because it's entirely reasonable, likely even, for a host to
only want to use 48-bit virtual addresses, but still want to let the guest enable
LA57.

> @@ -4934,6 +4935,14 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>
> vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); /* 22.2.1 */
>
> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
> + vmcs_writel(GUEST_SSP, 0);
> + vmcs_writel(GUEST_S_CET, 0);
> + vmcs_writel(GUEST_INTR_SSP_TABLE, 0);
> + } else if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
> + vmcs_writel(GUEST_S_CET, 0);
> + }

Similar to my comments about MSR interception, I think it would be better to
explicitly handle the "common" field. At first glance, code like the above makes
it look like IBT is mutually exclusive with SHSTK, e.g. a reader that isn't
looking closely could easily miss that both paths write GUEST_S_CET.

if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
vmcs_writel(GUEST_SSP, 0);
vmcs_writel(GUEST_INTR_SSP_TABLE, 0);
}
if (kvm_cpu_cap_has(X86_FEATURE_IBT) ||
kvm_cpu_cap_has(X86_FEATURE_SHSTK))
vmcs_writel(GUEST_S_CET, 0);

2024-05-01 23:19:16

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 25/27] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1

The shortlog is wrong, or perhaps just misleading. This isn't "for event error_code
delivery to L1", it's for event inject error code deliver for _L2_, i.e. from L1
to L2.

The shortlog should be something more like:

KVM: nVMX: Virtualize NO_HW_ERROR_CODE_CC for L1 event injection to L2

2024-05-01 23:23:39

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 26/27] KVM: nVMX: Enable CET support for nested guest

On Sun, Feb 18, 2024, Yang Weijiang wrote:
> @@ -2438,6 +2460,30 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
> }
> }
>
> +static inline void cet_vmcs_fields_get(struct kvm_vcpu *vcpu, u64 *ssp,
> + u64 *s_cet, u64 *ssp_tbl)
> +{
> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK)) {
> + *ssp = vmcs_readl(GUEST_SSP);
> + *s_cet = vmcs_readl(GUEST_S_CET);
> + *ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
> + } else if (guest_can_use(vcpu, X86_FEATURE_IBT)) {
> + *s_cet = vmcs_readl(GUEST_S_CET);
> + }

Same comments about accessing S_CET, please do so in a dedicated path.

> +}
> +
> +static inline void cet_vmcs_fields_put(struct kvm_vcpu *vcpu, u64 ssp,
> + u64 s_cet, u64 ssp_tbl)

This should probably use "set" instead of "put". I can't think of a single case
where KVM uses "put" to describe writing state, e.g. "put" is always used when
putting a reference or unloading state.

> +{
> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK)) {
> + vmcs_writel(GUEST_SSP, ssp);
> + vmcs_writel(GUEST_S_CET, s_cet);
> + vmcs_writel(GUEST_INTR_SSP_TABLE, ssp_tbl);
> + } else if (guest_can_use(vcpu, X86_FEATURE_IBT)) {
> + vmcs_writel(GUEST_S_CET, s_cet);
> + }

And here.

2024-05-01 23:24:37

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 27/27] KVM: x86: Don't emulate instructions guarded by CET

On Sun, Feb 18, 2024, Yang Weijiang wrote:
> Don't emulate the branch instructions, e.g., CALL/RET/JMP etc., when CET
> is active in guest, return KVM_INTERNAL_ERROR_EMULATION to userspace to
> handle it.
>
> KVM doesn't emulate CPU behaviors to check CET protected stuffs while
> emulating guest instructions, instead it stops emulation on detecting
> the instructions in process are CET protected. By doing so, it can avoid
> generating bogus #CP in guest and preventing CET protected execution flow
> subversion from guest side.
>
> Suggested-by: Chao Gao <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---

This should be ordered before CET is exposed to userspace, e.g. so that KVM's
ABI is well defined when CET support because usable.

2024-05-01 23:25:15

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On Wed, 2024-05-01 at 16:15 -0700, Sean Christopherson wrote:
> On Sun, Feb 18, 2024, Yang Weijiang wrote:
> > @@ -696,6 +697,20 @@ void kvm_set_cpu_caps(void)
> >                 kvm_cpu_cap_set(X86_FEATURE_INTEL_STIBP);
> >         if (boot_cpu_has(X86_FEATURE_AMD_SSBD))
> >                 kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD);
> > +       /*
> > +        * Don't use boot_cpu_has() to check availability of IBT because the
> > +        * feature bit is cleared in boot_cpu_data when ibt=off is applied
> > +        * in host cmdline.
>
> I'm not convinced this is a good reason to diverge from the host kernel.  E.g.
> PCID and many other features honor the host setup, I don't see what makes IBT
> special.
>
> LA57 is special because it's entirely reasonable, likely even, for a host to
> only want to use 48-bit virtual addresses, but still want to let the guest
> enable
> LA57.

Definitely. I swear we (Weijiang and I) had a back and forth at some point where
we agreed to match the host support. Plus I think the CET FPU stuff triggers off
of host support for CET. So if the host doesn't have X86_FEATURE_SHSTK or
X86_FEATURE_IBT then... hopefully it's caught later. But then don't report it's
supported.

2024-05-01 23:28:04

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 00/27] Enable CET Virtualization

On Sun, Feb 18, 2024, Yang Weijiang wrote:
> Sean Christopherson (4):
> x86/fpu/xstate: Always preserve non-user xfeatures/flags in
> __state_perm
> KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data
> KVM: x86: Report XSS as to-be-saved if there are supported features
> KVM: x86: Load guest FPU state when access XSAVE-managed MSRs
>
> Yang Weijiang (23):
> x86/fpu/xstate: Refine CET user xstate bit enabling
> x86/fpu/xstate: Add CET supervisor mode state support
> x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set
> x86/fpu/xstate: Introduce fpu_guest_cfg for guest FPU configuration
> x86/fpu/xstate: Create guest fpstate with guest specific config
> x86/fpu/xstate: Warn if kernel dynamic xfeatures detected in normal
> fpstate
> KVM: x86: Rename kvm_{g,s}et_msr()* to menifest emulation operations
> KVM: x86: Refine xsave-managed guest register/MSR reset handling
> KVM: x86: Add kvm_msr_{read,write}() helpers
> KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS
> KVM: x86: Initialize kvm_caps.supported_xss
> KVM: x86: Add fault checks for guest CR4.CET setting
> KVM: x86: Report KVM supported CET MSRs as to-be-saved
> KVM: VMX: Introduce CET VMCS fields and control bits
> KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT
> enabled"
> KVM: VMX: Emulate read and write to CET MSRs
> KVM: x86: Save and reload SSP to/from SMRAM
> KVM: VMX: Set up interception for CET MSRs
> KVM: VMX: Set host constant supervisor states to VMCS fields
> KVM: x86: Enable CET virtualization for VMX and advertise to userspace
> KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery
> to L1
> KVM: nVMX: Enable CET support for nested guest
> KVM: x86: Don't emulate instructions guarded by CET

A decent number of comments, but almost all of them are quite minor. The big
open is how to handle save/restore of SSP from userspace.

Instead of spinning a full v10, maybe send an RFC for KVM_{G,S}ET_ONE_REG idea?
That will make it easier to review, and if you delay v11 a bit, I should be able
to get various series applied that have minor conflicts/dependencies, e.g. the
MSR access and the kvm_host series.

2024-05-02 17:46:41

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v10 04/27] x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set

On 5/1/24 11:45, Sean Christopherson wrote:
> On Sun, Feb 18, 2024, Yang Weijiang wrote:
>> Define a new XFEATURE_MASK_KERNEL_DYNAMIC mask to specify the features
> I still don't understand why this is being called DYNAMIC. CET_SS isn't dynamic,
> as KVM is _always_ allowed to save/restore CET_SS, i.e. whether or not KVM can
> expose CET_SS to a guest is a static, boot-time decision. Whether or not a guest
> XSS actually enables CET_SS is "dynamic", but that's true of literally every
> xfeature in XCR0 and XSS.
>
> XFEATURE_MASK_XTILE_DATA is labeled as dynamic because userspace has to explicitly
> request that XTILE_DATA be enabled, and thus whether or not KVM is allowed to
> expose XTILE_DATA to the guest is a dynamic, runtime decision.
>
> So IMO, the umbrella macro should be XFEATURE_MASK_KERNEL_GUEST_ONLY.

Here's how I got that naming. First, "static" features are always
there. "Dynamic" features might or might not be there. I was also much
more focused on what's in the XSAVE buffer than on the enabling itself,
which are _slightly_ different.

Then, it's a matter of whether the feature is user or supervisor. The
kernel might need new state for multiple reasons. Think of LBR state as
an example. The kernel might want LBR state around for perf _or_ so it
can be exposed to a guest.

I just didn't want to tie it to "GUEST" too much in case we have more of
these things come along that get used for things unrelated to KVM.
Obviously, at this point, we've only got one and KVM is the only user so
the delta that I was worried about doesn't actually exist.

So I still prefer calling it "KERNEL" over "GUEST". But I also don't
feel strongly about it and I've said my peace. I won't NAK it one way
or the other.

2024-05-06 05:59:02

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 09/27] KVM: x86: Rename kvm_{g,s}et_msr()* to menifest emulation operations

On 5/2/2024 2:54 AM, Sean Christopherson wrote:
> s/menifest/manifest, though I find the shortlog confusing irrespective of the
> typo. I think this would be more grammatically correct:
>
> KVM: x86: Rename kvm_{g,s}et_msr()* to manifest their emulation operations
>
> but I still find that unnecessarily "fancy". What about this instead?
>
> KVM: x86: Rename kvm_{g,s}et_msr()* to show that they emulate guest accesses
>
> It's not perfect, e.g. it might be read as saying they emulate guest RDMSR and
> WRMSR, but for a shortlog I think that's fine.

Sorry for the delayed reply!
It looks good to me, will change it in next version, thanks!



2024-05-06 07:27:37

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 10/27] KVM: x86: Refine xsave-managed guest register/MSR reset handling

On 5/2/2024 4:40 AM, Sean Christopherson wrote:
> The shortlog is a bit stale now that it only deals with XSTATE. This?
>
> KVM: x86: Zero XSTATE components on INIT by iterating over supported features

OK, will change it.

>
> On Sun, Feb 18, 2024, Yang Weijiang wrote:
>> Tweak the code a bit to facilitate resetting more xstate components in
>> the future, e.g., CET's xstate-managed MSRs.
>>
>> No functional change intended.
>>
>> Suggested-by: Chao Gao <[email protected]>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
>> arch/x86/kvm/x86.c | 30 +++++++++++++++++++++++++++---
>> 1 file changed, 27 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 10847e1cc413..5a9c07751c0e 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -12217,11 +12217,27 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
>> static_branch_dec(&kvm_has_noapic_vcpu);
>> }
>>
>> +#define XSTATE_NEED_RESET_MASK (XFEATURE_MASK_BNDREGS | \
>> + XFEATURE_MASK_BNDCSR)
>> +
>> +static bool kvm_vcpu_has_xstate(unsigned long xfeature)
>> +{
>> + switch (xfeature) {
>> + case XFEATURE_MASK_BNDREGS:
>> + case XFEATURE_MASK_BNDCSR:
>> + return kvm_cpu_cap_has(X86_FEATURE_MPX);
>> + default:
>> + return false;
>> + }
>> +}
>> +
>> void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>> {
>> struct kvm_cpuid_entry2 *cpuid_0x1;
>> unsigned long old_cr0 = kvm_read_cr0(vcpu);
>> + DECLARE_BITMAP(reset_mask, 64);
> I vote to use a u64 instead of a bitmask. The result cast isn't exactly pretty,
> but it's not all that uncommon, and it's easy enough to make it "safe" by adding
> BUILD_BUG_ON().
>
> On the flip side, using the bitmap_*() APIs for super simple bitwise-OR/AND/TEST
> operations makes the code harder to read.

Make sense.

>
>> unsigned long new_cr0;
>> + unsigned int i;
>>
>> /*
>> * Several of the "set" flows, e.g. ->set_cr0(), read other registers
>> @@ -12274,7 +12290,12 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>> kvm_async_pf_hash_reset(vcpu);
>> vcpu->arch.apf.halted = false;
>>
>> - if (vcpu->arch.guest_fpu.fpstate && kvm_mpx_supported()) {
>> + bitmap_from_u64(reset_mask, (kvm_caps.supported_xcr0 |
>> + kvm_caps.supported_xss) &
>> + XSTATE_NEED_RESET_MASK);
>> +
>> + if (vcpu->arch.guest_fpu.fpstate &&
>> + !bitmap_empty(reset_mask, XFEATURE_MAX)) {
>> struct fpstate *fpstate = vcpu->arch.guest_fpu.fpstate;
>>
>> /*
>> @@ -12284,8 +12305,11 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>> if (init_event)
>> kvm_put_guest_fpu(vcpu);
>>
>> - fpstate_clear_xstate_component(fpstate, XFEATURE_BNDREGS);
>> - fpstate_clear_xstate_component(fpstate, XFEATURE_BNDCSR);
>> + for_each_set_bit(i, reset_mask, XFEATURE_MAX) {
>> + if (!kvm_vcpu_has_xstate(i))
>> + continue;
>> + fpstate_clear_xstate_component(fpstate, i);
>> + }
> A few intertwined thoughts:
>
> 1. fpstate is zero allocated, and KVM absolutely relies on that, e.g. KVM doesn't
> manually zero out the XSAVE fields that are preserved on INIT, but zeroed on
> RESET.
>
> 2. That means there is no need to manually clear XSTATE components during RESET,
> as KVM doesn't support standalone RESET, i.e. it's only cleared during vCPU
> creation, when guest FPU state is guaranteed to be '0'.
>
> 3. That makes XSTATE_NEED_RESET_MASK a misnomer, as it's literally the !RESET
> path that is relevant. E.g. it should be XSTATE_CLEAR_ON_INIT_MASK or so.
>
> 4. If we add a helper, then XSTATE_NEED_RESET_MASK is probably unneeded.

Fair enough.
For #4, I still prefer to add all relevant xstate bits in a macro so thatxfeatures_mask initialization line length keeps shorter and constant.
>
> So, what if we slot in the below (compile tested only) patch as prep work? Then
> this patch becomes:

Thanks! I'll replace this patch with the one your attached.

>
> ---
> arch/x86/kvm/x86.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b441bf61b541..b00730353a28 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12220,6 +12220,8 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
> static void kvm_xstate_reset(struct kvm_vcpu *vcpu, bool init_event)
> {
> struct fpstate *fpstate = vcpu->arch.guest_fpu.fpstate;
> + u64 xfeatures_mask;
> + int i;
>
> /*
> * Guest FPU state is zero allocated and so doesn't need to be manually
> @@ -12233,16 +12235,20 @@ static void kvm_xstate_reset(struct kvm_vcpu *vcpu, bool init_event)
> * are unchanged. Currently, the only components that are zeroed and
> * supported by KVM are MPX related.
> */
> - if (!kvm_mpx_supported())
> + xfeatures_mask = (kvm_caps.supported_xcr0 | kvm_caps.supported_xss) &
> + (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR);
> + if (!xfeatures_mask)
> return;
>
> + BUILD_BUG_ON(XFEATURE_MAX >= sizeof(xfeatures_mask));
> +
> /*
> * All paths that lead to INIT are required to load the guest's FPU
> * state (because most paths are buried in KVM_RUN).
> */
> kvm_put_guest_fpu(vcpu);
> - fpstate_clear_xstate_component(fpstate, XFEATURE_BNDREGS);
> - fpstate_clear_xstate_component(fpstate, XFEATURE_BNDCSR);
> + for_each_set_bit(i, xfeatures_mask, XFEATURE_MAX)
> + fpstate_clear_xstate_component(fpstate, i);
> kvm_load_guest_fpu(vcpu);
> }
>
>
> base-commit: efca8b27900dfec160b6ba90820fa2ced81de904


2024-05-06 07:32:02

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 13/27] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS

On 5/2/2024 4:43 AM, Sean Christopherson wrote:
> On Sun, Feb 18, 2024, Yang Weijiang wrote:
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 9eb5c8dbd4fb..b502d68a2576 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -3926,16 +3926,23 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> }
>> break;
>> case MSR_IA32_XSS:
>> - if (!msr_info->host_initiated &&
>> - !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
>> + /*
>> + * If KVM reported support of XSS MSR, even guest CPUID doesn't
>> + * support XSAVES, still allow userspace to set default value(0)
>> + * to this MSR.
>> + */
>> + if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES) &&
>> + !(msr_info->host_initiated && data == 0))
> With my proposed MSR access cleanup[*], I think (hope?) this simply becomes:
>
> if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
> return KVM_MSR_RET_UNSUPPORTED;
>
> with no comment needed as the "host && !data" case is handled in common code.

Right, I'll change this part after the cleanup series is merged. Thanks!

>
> [*] https://lore.kernel.org/all/[email protected]


2024-05-06 08:32:09

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 17/27] KVM: x86: Report KVM supported CET MSRs as to-be-saved

On 5/2/2024 6:40 AM, Sean Christopherson wrote:
> On Sun, Feb 18, 2024, Yang Weijiang wrote:
>> Add CET MSRs to the list of MSRs reported to userspace if the feature,
>> i.e. IBT or SHSTK, associated with the MSRs is supported by KVM.
>>
>> SSP can only be read via RDSSP. Writing even requires destructive and
>> potentially faulting operations such as SAVEPREVSSP/RSTORSSP or
>> SETSSBSY/CLRSSBSY. Let the host use a pseudo-MSR that is just a wrapper
>> for the GUEST_SSP field of the VMCS.
>>
>> Suggested-by: Chao Gao <[email protected]>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
>> arch/x86/include/uapi/asm/kvm_para.h | 1 +
>> arch/x86/kvm/vmx/vmx.c | 2 ++
>> arch/x86/kvm/x86.c | 18 ++++++++++++++++++
>> 3 files changed, 21 insertions(+)
>>
>> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
>> index 605899594ebb..9d08c0bec477 100644
>> --- a/arch/x86/include/uapi/asm/kvm_para.h
>> +++ b/arch/x86/include/uapi/asm/kvm_para.h
>> @@ -58,6 +58,7 @@
>> #define MSR_KVM_ASYNC_PF_INT 0x4b564d06
>> #define MSR_KVM_ASYNC_PF_ACK 0x4b564d07
>> #define MSR_KVM_MIGRATION_CONTROL 0x4b564d08
>> +#define MSR_KVM_SSP 0x4b564d09
> We never resolved the conservation from v6[*], but I still agree with Maxim's
> view that defining a synthetic MSR, which "steals" an MSR from KVM's MSR address
> space, is a bad idea.
>
> And I still also think that KVM_SET_ONE_REG is the best way forward. Completely
> untested, but I think this is all that is needed to wire up KVM_{G,S}ET_ONE_REG
> to support MSRs, and carve out room for 250+ other register types, plus room for
> more future stuff as needed.

Got your point now.

>
> We'll still need a KVM-defined MSR for SSP, but it can be KVM internal, not uAPI,
> e.g. the "index" exposed to userspace can simply be '0' for a register type of
> KVM_X86_REG_SYNTHETIC_MSR, and then the translated internal index can be any
> value that doesn't conflict.

Let me try to understand it, for your reference code below, id.type is to separate normal
MSR (HW defined) namespace and synthetic MSR namespace, right? For the latter, IIUC
KVM still needs to expose the index within the synthetic namespace so that userspace can
read/write the intended MSRs, of course not expose the synthetic MSR index via existing
uAPI,  But you said the "index" exposed to userspace can simply  be '0' in this case, then
how to distinguish the synthetic MSRs in userspace and KVM? And how userspace can be
aware of the synthetic MSR index allocation in KVM?

Per your comments in [*],  if we can use bits 39:32 to identify MSR classes/types, then under
each class/type or namespace, still need define the relevant index for each synthetic MSR.

[*]: https://lore.kernel.org/all/[email protected]/

>
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index ef11aa4cab42..ca2a47a85fa1 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -410,6 +410,16 @@ struct kvm_xcrs {
> __u64 padding[16];
> };
>
> +#define KVM_X86_REG_MSR (1 << 2)
> +#define KVM_X86_REG_SYNTHETIC_MSR (1 << 3)
> +
> +struct kvm_x86_reg_id {
> + __u32 index;
> + __u8 type;
> + __u8 rsvd;
> + __u16 rsvd16;
> +};
> +
> #define KVM_SYNC_X86_REGS (1UL << 0)
> #define KVM_SYNC_X86_SREGS (1UL << 1)
> #define KVM_SYNC_X86_EVENTS (1UL << 2)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 47d9f03b7778..53f2b43b4651 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2244,6 +2244,30 @@ static int do_set_msr(struct kvm_vcpu *vcpu, unsigned index, u64 *data)
> return kvm_set_msr_ignored_check(vcpu, index, *data, true);
> }
>
> +static int kvm_get_one_msr(struct kvm_vcpu *vcpu, u32 msr, u64 __user *value)
> +{
> + u64 val;
> +
> + r = do_get_msr(vcpu, reg.index, &val);
> + if (r)
> + return r;
> +
> + if (put_user(val, value);
> + return -EFAULT;
> +
> + return 0;
> +}
> +
> +static int kvm_set_one_msr(struct kvm_vcpu *vcpu, u32 msr, u64 __user *value)
> +{
> + u64 val;
> +
> + if (get_user(val, value);
> + return -EFAULT;
> +
> + return do_set_msr(vcpu, reg.index, &val);
> +}
> +
> #ifdef CONFIG_X86_64
> struct pvclock_clock {
> int vclock_mode;
> @@ -5976,6 +6000,39 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> srcu_read_unlock(&vcpu->kvm->srcu, idx);
> break;
> }
> + case KVM_GET_ONE_REG:
> + case KVM_SET_ONE_REG: {
> + struct kvm_x86_reg_id id;
> + struct kvm_one_reg reg;
> + u64 __user *value;
> +
> + r = -EFAULT;
> + if (copy_from_user(&reg, argp, sizeof(reg)))
> + break;
> +
> + r = -EINVAL;
> + id = (struct kvm_x86_reg)reg->id;
> + if (id.rsvd || id.rsvd16)
> + break;
> +
> + if (id.type != KVM_X86_REG_MSR &&
> + id.type != KVM_X86_REG_SYNTHETIC_MSR)
> + break;
> +
> + if (id.type == KVM_X86_REG_SYNTHETIC_MSR) {
> + id.type = KVM_X86_REG_MSR;
> + r = kvm_translate_synthetic_msr(&id.index);
> + if (r)
> + break;
> + }
> +
> + value = u64_to_user_ptr(reg.addr);
> + if (ioctl == KVM_GET_ONE_REG)
> + r = kvm_get_one_msr(vcpu, id.index, value);
> + else
> + r = kvm_set_one_msr(vcpu, id.index, value);
> + break;
> + }
> case KVM_TPR_ACCESS_REPORTING: {
> struct kvm_tpr_access_ctl tac;
>
>


2024-05-06 08:41:31

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 21/27] KVM: x86: Save and reload SSP to/from SMRAM

On 5/2/2024 6:50 AM, Sean Christopherson wrote:
> On Sun, Feb 18, 2024, Yang Weijiang wrote:
>> Save CET SSP to SMRAM on SMI and reload it on RSM. KVM emulates HW arch
>> behavior when guest enters/leaves SMM mode,i.e., save registers to SMRAM
>> at the entry of SMM and reload them at the exit to SMM. Per SDM, SSP is
>> one of such registers on 64-bit Arch, and add the support for SSP. Note,
>> on 32-bit Arch, SSP is not defined in SMRAM, so fail 32-bit CET guest
>> launch.
>>
>> Suggested-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> Reviewed-by: Maxim Levitsky <[email protected]>
>> ---
>> arch/x86/kvm/cpuid.c | 11 +++++++++++
>> arch/x86/kvm/smm.c | 8 ++++++++
>> arch/x86/kvm/smm.h | 2 +-
>> 3 files changed, 20 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
>> index 2bb1931103ad..c0e13040e35b 100644
>> --- a/arch/x86/kvm/cpuid.c
>> +++ b/arch/x86/kvm/cpuid.c
>> @@ -149,6 +149,17 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu,
>> if (vaddr_bits != 48 && vaddr_bits != 57 && vaddr_bits != 0)
>> return -EINVAL;
>> }
>> + /*
>> + * Prevent 32-bit guest launch if shadow stack is exposed as SSP
>> + * state is not defined for 32-bit SMRAM.
> Why? Lack of save/restore for SSP on 32-bit guests is a gap in Intel's
> architecture, I don't see why KVM should diverge from hardware. I.e. just do
> nothing for SSP on SMI/RSM, because that's exactly what the architecture says
> will happen.

OK, will remove the check. I just wanted to avoid any undocumented hole if SHSTK is
exposed in CPUID.

>
>> + */
>> + best = cpuid_entry2_find(entries, nent, 0x80000001,
>> + KVM_CPUID_INDEX_NOT_SIGNIFICANT);
>> + if (best && !(best->edx & F(LM))) {
>> + best = cpuid_entry2_find(entries, nent, 0x7, 0);
>> + if (best && (best->ecx & F(SHSTK)))
>> + return -EINVAL;
>> + }
>>
>> /*
>> * Exposing dynamic xfeatures to the guest requires additional
>> diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
>> index 45c855389ea7..7aac9c54c353 100644
>> --- a/arch/x86/kvm/smm.c
>> +++ b/arch/x86/kvm/smm.c
>> @@ -275,6 +275,10 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu,
>> enter_smm_save_seg_64(vcpu, &smram->gs, VCPU_SREG_GS);
>>
>> smram->int_shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
>> +
>> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
>> + KVM_BUG_ON(kvm_msr_read(vcpu, MSR_KVM_SSP, &smram->ssp),
>> + vcpu->kvm);
>> }
>> #endif
>>
>> @@ -564,6 +568,10 @@ static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,
>> static_call(kvm_x86_set_interrupt_shadow)(vcpu, 0);
>> ctxt->interruptibility = (u8)smstate->int_shadow;
>>
>> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
>> + KVM_BUG_ON(kvm_msr_write(vcpu, MSR_KVM_SSP, smstate->ssp),
>> + vcpu->kvm);
>
> This should synthesize triple-fault, not WARN and kill the VM, as the value to
> be restored is guest controlled (the guest can scribble SMRAM from within the
> SMI handler).
>
> At that point, I would just synthesize triple-fault for the read path too.

Ah, yes, will fail with triple-fault in next version, thanks!

>


2024-05-06 08:49:22

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 22/27] KVM: VMX: Set up interception for CET MSRs

On 5/2/2024 7:07 AM, Sean Christopherson wrote:
> On Sun, Feb 18, 2024, Yang Weijiang wrote:
>> @@ -7767,6 +7771,41 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
>> vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
>> }
>>
>> +static void vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu)
>> +{
>> + bool incpt;
>> +
>> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
>> + incpt = !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK);
>> +
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP,
>> + MSR_TYPE_RW, incpt);
>> + vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
>> + MSR_TYPE_RW, incpt);
>> + if (!incpt)
>> + return;
> Hmm, I find this is unnecessarily confusing and brittle. E.g. in the unlikely
> event more CET stuff comes along, this lurking return could cause problems.
>
> Why not handle S_CET and U_CET in a single common path? IMO, this is less error
> prone, and more clearly captures the relationship between S/U_CET, SHSTK, and IBT.
> Updating MSR intercepts is not a hot path, so the overhead of checking guest CPUID
> multiple times should be a non-issue. And eventually KVM should effectively cache
> all of those lookups, i.e. the cost will be negilible.
>
> bool incpt;
>
> if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
> incpt = !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK);
>
> vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
> MSR_TYPE_RW, incpt);
> vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
> MSR_TYPE_RW, incpt);
> vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
> MSR_TYPE_RW, incpt);
> vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP,
> MSR_TYPE_RW, incpt);
> vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
> MSR_TYPE_RW, incpt);
> }
>
> if (kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
> kvm_cpu_cap_has(X86_FEATURE_IBT)) {
> incpt = !guest_cpuid_has(vcpu, X86_FEATURE_IBT) &&
> !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK);
>
> vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
> MSR_TYPE_RW, incpt);
> vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
> MSR_TYPE_RW, incpt);
> }

It looks fine to me, will apply it, thanks!



2024-05-06 09:19:58

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On 5/2/2024 7:15 AM, Sean Christopherson wrote:
> On Sun, Feb 18, 2024, Yang Weijiang wrote:
>> @@ -696,6 +697,20 @@ void kvm_set_cpu_caps(void)
>> kvm_cpu_cap_set(X86_FEATURE_INTEL_STIBP);
>> if (boot_cpu_has(X86_FEATURE_AMD_SSBD))
>> kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD);
>> + /*
>> + * Don't use boot_cpu_has() to check availability of IBT because the
>> + * feature bit is cleared in boot_cpu_data when ibt=off is applied
>> + * in host cmdline.
> I'm not convinced this is a good reason to diverge from the host kernel. E.g.
> PCID and many other features honor the host setup, I don't see what makes IBT
> special.

This is mostly based on our user experience and the hypothesis for cloud computing:
When we evolve host kernels, we constantly encounter issues when kernel IBT is on,
so we have to disable kernel IBT by adding ibt=off. But we need to test the CET features
in VM, if we just simply refer to host boot cpuid data, then IBT cannot be enabled in
VM which makes CET features incomplete in guest.

I guess in cloud computing, it could run into similar dilemma. In this case, the tenant
cannot benefit the feature just because of host SW problem. I know currently KVM
except LA57 always honors host feature configurations, but in CET case, there could be
divergence wrt honoring host configuration as long as there's no quirk for the feature.

But I think the issue is still open for discussion...

>
> LA57 is special because it's entirely reasonable, likely even, for a host to
> only want to use 48-bit virtual addresses, but still want to let the guest enable
> LA57.
>
>> @@ -4934,6 +4935,14 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>>
>> vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); /* 22.2.1 */
>>
>> + if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
>> + vmcs_writel(GUEST_SSP, 0);
>> + vmcs_writel(GUEST_S_CET, 0);
>> + vmcs_writel(GUEST_INTR_SSP_TABLE, 0);
>> + } else if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
>> + vmcs_writel(GUEST_S_CET, 0);
>> + }
> Similar to my comments about MSR interception, I think it would be better to
> explicitly handle the "common" field. At first glance, code like the above makes
> it look like IBT is mutually exclusive with SHSTK, e.g. a reader that isn't
> looking closely could easily miss that both paths write GUEST_S_CET.

Sure,thanks!

>
> if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
> vmcs_writel(GUEST_SSP, 0);
> vmcs_writel(GUEST_INTR_SSP_TABLE, 0);
> }
> if (kvm_cpu_cap_has(X86_FEATURE_IBT) ||
> kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> vmcs_writel(GUEST_S_CET, 0);


2024-05-06 09:20:23

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 25/27] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1

On 5/2/2024 7:19 AM, Sean Christopherson wrote:
> The shortlog is wrong, or perhaps just misleading. This isn't "for event error_code
> delivery to L1", it's for event inject error code deliver for _L2_, i.e. from L1
> to L2.
>
> The shortlog should be something more like:
>
> KVM: nVMX: Virtualize NO_HW_ERROR_CODE_CC for L1 event injection to L2

OK, will change it, thanks!



2024-05-06 09:26:15

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 26/27] KVM: nVMX: Enable CET support for nested guest

On 5/2/2024 7:23 AM, Sean Christopherson wrote:
> On Sun, Feb 18, 2024, Yang Weijiang wrote:
>> @@ -2438,6 +2460,30 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
>> }
>> }
>>
>> +static inline void cet_vmcs_fields_get(struct kvm_vcpu *vcpu, u64 *ssp,
>> + u64 *s_cet, u64 *ssp_tbl)
>> +{
>> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK)) {
>> + *ssp = vmcs_readl(GUEST_SSP);
>> + *s_cet = vmcs_readl(GUEST_S_CET);
>> + *ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
>> + } else if (guest_can_use(vcpu, X86_FEATURE_IBT)) {
>> + *s_cet = vmcs_readl(GUEST_S_CET);
>> + }
> Same comments about accessing S_CET, please do so in a dedicated path.

Will change it, thanks!

>
>> +}
>> +
>> +static inline void cet_vmcs_fields_put(struct kvm_vcpu *vcpu, u64 ssp,
>> + u64 s_cet, u64 ssp_tbl)
> This should probably use "set" instead of "put". I can't think of a single case
> where KVM uses "put" to describe writing state, e.g. "put" is always used when
> putting a reference or unloading state.

Yes, "put" is not proper in this case, will change it.


>
>> +{
>> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK)) {
>> + vmcs_writel(GUEST_SSP, ssp);
>> + vmcs_writel(GUEST_S_CET, s_cet);
>> + vmcs_writel(GUEST_INTR_SSP_TABLE, ssp_tbl);
>> + } else if (guest_can_use(vcpu, X86_FEATURE_IBT)) {
>> + vmcs_writel(GUEST_S_CET, s_cet);
>> + }
> And here.

OK.



2024-05-06 09:26:57

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 27/27] KVM: x86: Don't emulate instructions guarded by CET

On 5/2/2024 7:24 AM, Sean Christopherson wrote:
> On Sun, Feb 18, 2024, Yang Weijiang wrote:
>> Don't emulate the branch instructions, e.g., CALL/RET/JMP etc., when CET
>> is active in guest, return KVM_INTERNAL_ERROR_EMULATION to userspace to
>> handle it.
>>
>> KVM doesn't emulate CPU behaviors to check CET protected stuffs while
>> emulating guest instructions, instead it stops emulation on detecting
>> the instructions in process are CET protected. By doing so, it can avoid
>> generating bogus #CP in guest and preventing CET protected execution flow
>> subversion from guest side.
>>
>> Suggested-by: Chao Gao <[email protected]>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
> This should be ordered before CET is exposed to userspace, e.g. so that KVM's
> ABI is well defined when CET support because usable.

Sure, thanks!



2024-05-06 09:33:48

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 00/27] Enable CET Virtualization

On 5/2/2024 7:27 AM, Sean Christopherson wrote:
> On Sun, Feb 18, 2024, Yang Weijiang wrote:
>> Sean Christopherson (4):
>> x86/fpu/xstate: Always preserve non-user xfeatures/flags in
>> __state_perm
>> KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data
>> KVM: x86: Report XSS as to-be-saved if there are supported features
>> KVM: x86: Load guest FPU state when access XSAVE-managed MSRs
>>
>> Yang Weijiang (23):
>> x86/fpu/xstate: Refine CET user xstate bit enabling
>> x86/fpu/xstate: Add CET supervisor mode state support
>> x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set
>> x86/fpu/xstate: Introduce fpu_guest_cfg for guest FPU configuration
>> x86/fpu/xstate: Create guest fpstate with guest specific config
>> x86/fpu/xstate: Warn if kernel dynamic xfeatures detected in normal
>> fpstate
>> KVM: x86: Rename kvm_{g,s}et_msr()* to menifest emulation operations
>> KVM: x86: Refine xsave-managed guest register/MSR reset handling
>> KVM: x86: Add kvm_msr_{read,write}() helpers
>> KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS
>> KVM: x86: Initialize kvm_caps.supported_xss
>> KVM: x86: Add fault checks for guest CR4.CET setting
>> KVM: x86: Report KVM supported CET MSRs as to-be-saved
>> KVM: VMX: Introduce CET VMCS fields and control bits
>> KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT
>> enabled"
>> KVM: VMX: Emulate read and write to CET MSRs
>> KVM: x86: Save and reload SSP to/from SMRAM
>> KVM: VMX: Set up interception for CET MSRs
>> KVM: VMX: Set host constant supervisor states to VMCS fields
>> KVM: x86: Enable CET virtualization for VMX and advertise to userspace
>> KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery
>> to L1
>> KVM: nVMX: Enable CET support for nested guest
>> KVM: x86: Don't emulate instructions guarded by CET
> A decent number of comments, but almost all of them are quite minor. The big
> open is how to handle save/restore of SSP from userspace.
>
> Instead of spinning a full v10, maybe send an RFC for KVM_{G,S}ET_ONE_REG idea?

OK, I'll send an RFC patch after relevant discussion is settled.

> That will make it easier to review, and if you delay v11 a bit, I should be able
> to get various series applied that have minor conflicts/dependencies, e.g. the
> MSR access and the kvm_host series.
I can wait until the series landed in x86-kvm tree.
Appreciated for your review and comments!



2024-05-06 09:42:16

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On 5/2/2024 7:34 AM, Sean Christopherson wrote:
> On Sun, Feb 18, 2024, Yang Weijiang wrote:
>> @@ -665,7 +665,7 @@ void kvm_set_cpu_caps(void)
>> F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
>> F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
>> F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
>> - F(SGX_LC) | F(BUS_LOCK_DETECT)
>> + F(SGX_LC) | F(BUS_LOCK_DETECT) | F(SHSTK)
>> );
>> /* Set LA57 based on hardware capability. */
>> if (cpuid_ecx(7) & F(LA57))
>> @@ -683,7 +683,8 @@ void kvm_set_cpu_caps(void)
>> F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
>> F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) |
>> F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) |
>> - F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D)
>> + F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D) |
>> + F(IBT)
>> );
> ...
>
>> @@ -7977,6 +7993,18 @@ static __init void vmx_set_cpu_caps(void)
>>
>> if (cpu_has_vmx_waitpkg())
>> kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
>> +
>> + /*
>> + * Disable CET if unrestricted_guest is unsupported as KVM doesn't
>> + * enforce CET HW behaviors in emulator. On platforms with
>> + * VMX_BASIC[bit56] == 0, inject #CP at VMX entry with error code
>> + * fails, so disable CET in this case too.
>> + */
>> + if (!cpu_has_load_cet_ctrl() || !enable_unrestricted_guest ||
>> + !cpu_has_vmx_basic_no_hw_errcode()) {
>> + kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
>> + kvm_cpu_cap_clear(X86_FEATURE_IBT);
>> + }
> Oh! Almost missed it. This patch should explicitly kvm_cpu_cap_clear()
> X86_FEATURE_SHSTK and X86_FEATURE_IBT. We *know* there are upcoming AMD CPUs
> that support at least SHSTK, so enumerating support for common code would yield
> a version of KVM that incorrectly advertises support for SHSTK.
>
> I hope to land both Intel and AMD virtualization in the same kernel release, but
> there are no guarantees that will happen. And explicitly clearing both SHSTK and
> IBT would guard against IBT showing up in some future AMD CPU in advance of KVM
> gaining full support.

Let me be clear on this, you want me to disable SHSTK/IBT with kvm_cpu_cap_clear() unconditionally
for now in this patch, and wait until both AMD's SVM patches and this series are ready for guest CET,
then remove the disabling code in this patch for final merge, am I right?



2024-05-06 16:55:02

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On Mon, May 06, 2024, Weijiang Yang wrote:
> On 5/2/2024 7:15 AM, Sean Christopherson wrote:
> > On Sun, Feb 18, 2024, Yang Weijiang wrote:
> > > @@ -696,6 +697,20 @@ void kvm_set_cpu_caps(void)
> > > kvm_cpu_cap_set(X86_FEATURE_INTEL_STIBP);
> > > if (boot_cpu_has(X86_FEATURE_AMD_SSBD))
> > > kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD);
> > > + /*
> > > + * Don't use boot_cpu_has() to check availability of IBT because the
> > > + * feature bit is cleared in boot_cpu_data when ibt=off is applied
> > > + * in host cmdline.
> > I'm not convinced this is a good reason to diverge from the host kernel. E.g.
> > PCID and many other features honor the host setup, I don't see what makes IBT
> > special.
>
> This is mostly based on our user experience and the hypothesis for cloud
> computing: When we evolve host kernels, we constantly encounter issues when
> kernel IBT is on, so we have to disable kernel IBT by adding ibt=off. But we
> need to test the CET features in VM, if we just simply refer to host boot
> cpuid data, then IBT cannot be enabled in VM which makes CET features
> incomplete in guest.
>
> I guess in cloud computing, it could run into similar dilemma. In this case,
> the tenant cannot benefit the feature just because of host SW problem.

Hmm, but such issues should be found before deploying a kernel to production.

The one scenario that comes to mind where I can see someone wanting to disable
IBT would be running a out-of-tree and/or third party module.

> I know currently KVM except LA57 always honors host feature configurations,
> but in CET case, there could be divergence wrt honoring host configuration as
> long as there's no quirk for the feature.
>
> But I think the issue is still open for discussion...

I'm not totally opposed to the idea.

Somewhat off-topic, the existing LA57 code upon which the IBT check is based is
flawed, as it doesn't account for the max supported CPUID leaf. On Intel CPUs,
that could result in a false positive due CPUID (stupidly) returning the value
of the last implemented CPUID leaf, no zeros. In practice, it doesn't cause
problems because CPUID.0x7 has been supported since forever, but it's still a
bug.

Hmm, actually, __kvm_cpu_cap_mask() has the exact same bug. And that's much less
theoretical, e.g. kvm_cpu_cap_init_kvm_defined() in particular is likely to cause
problems at some point.

And I really don't like that KVM open codes calls to cpuid_<reg>() for these
"raw" features. One option would be to and helpers to change this:

if (cpuid_edx(7) & F(IBT))
kvm_cpu_cap_set(X86_FEATURE_IBT);

to this:

if (raw_cpuid_has(X86_FEATURE_IBT))
kvm_cpu_cap_set(X86_FEATURE_IBT);

but I think we can do better, and harden the CPUID code in the process. If we
do kvm_cpu_cap_set() _before_ kvm_cpu_cap_mask(), then incorporating the raw host
CPUID will happen automagically, as __kvm_cpu_cap_mask() will clear bits that
aren't in host CPUID.

The most obvious approach would be to simply call kvm_cpu_cap_set() before
kvm_cpu_cap_mask(), but that's more than a bit confusing, and would open the door
for potential bugs due to calling kvm_cpu_cap_set() after kvm_cpu_cap_mask().
And detecting such bugs would be difficult, because there are features that KVM
fully emulates, i.e. _must_ be stuffed after kvm_cpu_cap_mask().

Instead of calling kvm_cpu_cap_set() directly, we can take advantage of the fact
that the F() maskes are fed into kvm_cpu_cap_mask(), i.e. are naturally processed
before the corresponding kvm_cpu_cap_mask().

If we add an array to track which capabilities have been initialized, then F()
can WARN on improper usage. That would allow detecting bad "raw" usage, *and*
would detect (some) scenarios where a F() is fed into the wrong leaf, e.g. if
we added F(LA57) to CPUID_7_EDX instead of CPUID_7_ECX.

#define F(name) \
({ \
u32 __leaf = __feature_leaf(X86_FEATURE_##name); \
\
BUILD_BUG_ON(__leaf >= ARRAY_SIZE(kvm_cpu_cap_initialized)); \
WARN_ON_ONCE(kvm_cpu_cap_initialized[__leaf]); \
\
feature_bit(name); \
})

/*
* Raw Feature - For features that KVM supports based purely on raw host CPUID,
* i.e. that KVM virtualizes even if the host kernel doesn't use the feature.
* Simply force set the feature in KVM's capabilities, raw CPUID support will
* be factored in by kvm_cpu_cap_mask().
*/
#define RAW_F(name) \
({ \
kvm_cpu_cap_set(X86_FEATURE_##name); \
F(name); \
})

Assuming testing doesn't poke a hole in my idea, I'll post a small series.

2024-05-06 17:08:14

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On Mon, 2024-05-06 at 17:19 +0800, Yang, Weijiang wrote:
> On 5/2/2024 7:15 AM, Sean Christopherson wrote:
> > On Sun, Feb 18, 2024, Yang Weijiang wrote:
> > > @@ -696,6 +697,20 @@ void kvm_set_cpu_caps(void)
> > >                 kvm_cpu_cap_set(X86_FEATURE_INTEL_STIBP);
> > >         if (boot_cpu_has(X86_FEATURE_AMD_SSBD))
> > >                 kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD);
> > > +       /*
> > > +        * Don't use boot_cpu_has() to check availability of IBT because
> > > the
> > > +        * feature bit is cleared in boot_cpu_data when ibt=off is applied
> > > +        * in host cmdline.
> > I'm not convinced this is a good reason to diverge from the host kernel. 
> > E.g.
> > PCID and many other features honor the host setup, I don't see what makes
> > IBT
> > special.
>
> This is mostly based on our user experience and the hypothesis for cloud
> computing:
> When we evolve host kernels, we constantly encounter issues when kernel IBT is
> on,
> so we have to disable kernel IBT by adding ibt=off. But we need to test the
> CET features
> in VM, if we just simply refer to host boot cpuid data, then IBT cannot be
> enabled in
> VM which makes CET features incomplete in guest.
>
> I guess in cloud computing, it could run into similar dilemma. In this case,
> the tenant
> cannot benefit the feature just because of host SW problem. I know currently
> KVM
> except LA57 always honors host feature configurations, but in CET case, there
> could be
> divergence wrt honoring host configuration as long as there's no quirk for the
> feature.
>
> But I think the issue is still open for discussion...

I think the back and forth I remembered was actually around SGX IBT, but I did
find this thread:
https://lore.kernel.org/lkml/[email protected]/

Disabling kernel IBT enforcement without disabling KVM IBT seems worthwhile. But
the solution is to not to not honor host support. It is to have kernel IBT not
clear the feature flag and instead clear something else. This can be done
independently of the KVM series.

2024-05-06 23:35:21

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On Mon, May 06, 2024, Rick P Edgecombe wrote:
> On Mon, 2024-05-06 at 17:19 +0800, Yang, Weijiang wrote:
> > On 5/2/2024 7:15 AM, Sean Christopherson wrote:
> > > On Sun, Feb 18, 2024, Yang Weijiang wrote:
> > > > @@ -696,6 +697,20 @@ void kvm_set_cpu_caps(void)
> > > >                 kvm_cpu_cap_set(X86_FEATURE_INTEL_STIBP);
> > > >         if (boot_cpu_has(X86_FEATURE_AMD_SSBD))
> > > >                 kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD);
> > > > +       /*
> > > > +        * Don't use boot_cpu_has() to check availability of IBT because
> > > > the
> > > > +        * feature bit is cleared in boot_cpu_data when ibt=off is applied
> > > > +        * in host cmdline.
> > > I'm not convinced this is a good reason to diverge from the host kernel. 
> > > E.g. PCID and many other features honor the host setup, I don't see what
> > > makes IBT special.
> >
> > This is mostly based on our user experience and the hypothesis for cloud
> > computing: When we evolve host kernels, we constantly encounter issues when
> > kernel IBT is on, so we have to disable kernel IBT by adding ibt=off. But
> > we need to test the CET features in VM, if we just simply refer to host
> > boot cpuid data, then IBT cannot be enabled in VM which makes CET features
> > incomplete in guest.
> >
> > I guess in cloud computing, it could run into similar dilemma. In this
> > case, the tenant cannot benefit the feature just because of host SW
> > problem. I know currently KVM except LA57 always honors host feature
> > configurations, but in CET case, there could be divergence wrt honoring
> > host configuration as long as there's no quirk for the feature.
> >
> > But I think the issue is still open for discussion...
>
> I think the back and forth I remembered was actually around SGX IBT, but I did
> find this thread:
> https://lore.kernel.org/lkml/[email protected]/
>
> Disabling kernel IBT enforcement without disabling KVM IBT seems worthwhile. But
> the solution is to not to not honor host support. It is to have kernel IBT not
> clear the feature flag and instead clear something else. This can be done
> independently of the KVM series.

Hmm, I don't disagree, but I'm not sure it makes sense to wait on that discussion
to exempt IBT from the "it must be supported in the host" rule. I suspect that
tweaking the handling X86_FEATURE_IBT of will open a much larger can of worms,
as overhauling feature flag handling is very much on the x86 maintainers todo
list.

IMO, the odds of there being a hardware bug that necessitates hard disabling IBT
are lower than the odds of KVM support for CET landing well before the feature
stuff is sorted out.

2024-05-06 23:54:21

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On Mon, 2024-05-06 at 16:33 -0700, Sean Christopherson wrote:
>
> Hmm, I don't disagree, but I'm not sure it makes sense to wait on that
> discussion
> to exempt IBT from the "it must be supported in the host" rule.  I suspect
> that
> tweaking the handling X86_FEATURE_IBT of will open a much larger can of worms,
> as overhauling feature flag handling is very much on the x86 maintainers todo
> list.
>
> IMO, the odds of there being a hardware bug that necessitates hard disabling
> IBT
> are lower than the odds of KVM support for CET landing well before the feature
> stuff is sorted out.

I see a few reasons to tie to the host cpu feature:
1. Disabling it because of some HW concern. I agree with your assessment on the
chances.
2. Having the cpu feature be disabled by some kernel parameter, and having KVM
try to use it without the necessary FPU or other host support. It could cause
lots of problems, guest->host DOS, etc.
3. Confusion for the user about which CPU features are actually in use on the
system. There is a fair chance for compatibility issues to show up with CET.
Today there is clearcpuid. If a user was having issues with CET and just wanted
to turn it off, they might use clearcpuid or something else to just disable CET.
Then boot a VM and find it was still enabled. For shadow stack there is also
nousershstk.

So, my two cents, it's just all easier to reason about for everyone if you tie
it to host cpu features.

I don't immediately see what trouble will be in giving kernel IBT a disable
parameter that doesn't touch X86_FEATURE_IBT at some point in the future. Sorry
if I'm missing the point. So like, ibt=off disables all IBT including kernel
IBT, kernel_ibt=off disables kernel IBT enforcement via a global bool.

2024-05-07 02:38:15

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On 5/7/2024 12:54 AM, Sean Christopherson wrote:
> On Mon, May 06, 2024, Weijiang Yang wrote:
>> On 5/2/2024 7:15 AM, Sean Christopherson wrote:
>>> On Sun, Feb 18, 2024, Yang Weijiang wrote:
>>>> @@ -696,6 +697,20 @@ void kvm_set_cpu_caps(void)
>>>> kvm_cpu_cap_set(X86_FEATURE_INTEL_STIBP);
>>>> if (boot_cpu_has(X86_FEATURE_AMD_SSBD))
>>>> kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD);
>>>> + /*
>>>> + * Don't use boot_cpu_has() to check availability of IBT because the
>>>> + * feature bit is cleared in boot_cpu_data when ibt=off is applied
>>>> + * in host cmdline.
>>> I'm not convinced this is a good reason to diverge from the host kernel. E.g.
>>> PCID and many other features honor the host setup, I don't see what makes IBT
>>> special.
>> This is mostly based on our user experience and the hypothesis for cloud
>> computing: When we evolve host kernels, we constantly encounter issues when
>> kernel IBT is on, so we have to disable kernel IBT by adding ibt=off. But we
>> need to test the CET features in VM, if we just simply refer to host boot
>> cpuid data, then IBT cannot be enabled in VM which makes CET features
>> incomplete in guest.
>>
>> I guess in cloud computing, it could run into similar dilemma. In this case,
>> the tenant cannot benefit the feature just because of host SW problem.
> Hmm, but such issues should be found before deploying a kernel to production.
>
> The one scenario that comes to mind where I can see someone wanting to disable
> IBT would be running a out-of-tree and/or third party module.

Yes, the developers may neglect IBT violations in modules/kernel components and deploy
them, in this case, host side has to either fix the issues or disable IBT.

>
>> I know currently KVM except LA57 always honors host feature configurations,
>> but in CET case, there could be divergence wrt honoring host configuration as
>> long as there's no quirk for the feature.
>>
>> But I think the issue is still open for discussion...
> I'm not totally opposed to the idea.
>
> Somewhat off-topic, the existing LA57 code upon which the IBT check is based is
> flawed, as it doesn't account for the max supported CPUID leaf. On Intel CPUs,
> that could result in a false positive due CPUID (stupidly) returning the value
> of the last implemented CPUID leaf, no zeros. In practice, it doesn't cause
> problems because CPUID.0x7 has been supported since forever, but it's still a
> bug.
>
> Hmm, actually, __kvm_cpu_cap_mask() has the exact same bug. And that's much less
> theoretical, e.g. kvm_cpu_cap_init_kvm_defined() in particular is likely to cause
> problems at some point.
>
> And I really don't like that KVM open codes calls to cpuid_<reg>() for these
> "raw" features. One option would be to and helpers to change this:
>
> if (cpuid_edx(7) & F(IBT))
> kvm_cpu_cap_set(X86_FEATURE_IBT);
>
> to this:
>
> if (raw_cpuid_has(X86_FEATURE_IBT))
> kvm_cpu_cap_set(X86_FEATURE_IBT);
>
> but I think we can do better, and harden the CPUID code in the process. If we
> do kvm_cpu_cap_set() _before_ kvm_cpu_cap_mask(), then incorporating the raw host
> CPUID will happen automagically, as __kvm_cpu_cap_mask() will clear bits that
> aren't in host CPUID.
>
> The most obvious approach would be to simply call kvm_cpu_cap_set() before
> kvm_cpu_cap_mask(), but that's more than a bit confusing, and would open the door
> for potential bugs due to calling kvm_cpu_cap_set() after kvm_cpu_cap_mask().
> And detecting such bugs would be difficult, because there are features that KVM
> fully emulates, i.e. _must_ be stuffed after kvm_cpu_cap_mask().
>
> Instead of calling kvm_cpu_cap_set() directly, we can take advantage of the fact
> that the F() maskes are fed into kvm_cpu_cap_mask(), i.e. are naturally processed
> before the corresponding kvm_cpu_cap_mask().
>
> If we add an array to track which capabilities have been initialized, then F()
> can WARN on improper usage. That would allow detecting bad "raw" usage, *and*
> would detect (some) scenarios where a F() is fed into the wrong leaf, e.g. if
> we added F(LA57) to CPUID_7_EDX instead of CPUID_7_ECX.
>
> #define F(name) \
> ({ \
> u32 __leaf = __feature_leaf(X86_FEATURE_##name); \
> \
> BUILD_BUG_ON(__leaf >= ARRAY_SIZE(kvm_cpu_cap_initialized)); \
> WARN_ON_ONCE(kvm_cpu_cap_initialized[__leaf]); \
> \
> feature_bit(name); \
> })
>
> /*
> * Raw Feature - For features that KVM supports based purely on raw host CPUID,
> * i.e. that KVM virtualizes even if the host kernel doesn't use the feature.
> * Simply force set the feature in KVM's capabilities, raw CPUID support will
> * be factored in by kvm_cpu_cap_mask().
> */
> #define RAW_F(name) \
> ({ \
> kvm_cpu_cap_set(X86_FEATURE_##name); \
> F(name); \
> })
>
> Assuming testing doesn't poke a hole in my idea, I'll post a small series.

Fancy enough! But I like the idea :-)

>


2024-05-07 14:24:32

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On Mon, May 06, 2024, Rick P Edgecombe wrote:
> I don't immediately see what trouble will be in giving kernel IBT a disable
> parameter that doesn't touch X86_FEATURE_IBT at some point in the future.

Keeping X86_FEATURE_IBT set will result in "ibt" being reported in /proc/cpuinfo,
i.e. will mislead userspace into thinking IBT is supported and fully enabled by
the kernel. For a security feature, that's a pretty big issue.

To fudge around that, we could add a synthetic feature flag to let the kernel
tell KVM whether or not it's safe to virtualize IBT, but I don't see what value
that adds over KVM checking raw host CPUID.

2024-05-07 14:46:13

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On Tue, 2024-05-07 at 07:21 -0700, Sean Christopherson wrote:
>
> Keeping X86_FEATURE_IBT set will result in "ibt" being reported in
> /proc/cpuinfo,
> i.e. will mislead userspace into thinking IBT is supported and fully enabled
> by
> the kernel.  For a security feature, that's a pretty big issue.

Since the beginning, if you don't configure kernel IBT in Kconfig but the HW
supports it, "ibt" will appear in /proc/cpuinfo. It never was a reliable
indicator of kernel IBT enforcement. It is just an indicator of if the IBT
feature is usable. I think tying kernel IBT enforcement to the CPU feature is
wrong. But if you disable the HW feature, it makes sense that the enforcement
would be disabled.

CET is something that requires a fair amount of SW enablement. SW needs to do
things in special ways or things will go wrong. So whether IBT is in use and
whether it is supported by the HW are useful to maintain as separate concepts.

>
> To fudge around that, we could add a synthetic feature flag to let the kernel
> tell KVM whether or not it's safe to virtualize IBT, but I don't see what
> value
> that adds over KVM checking raw host CPUID.

A synthetic feature flag for kernel IBT seems reasonable to me. It's what I
suggested on that thread I linked earlier. But Peterz was advocating for a bool.
How enforcement would be exposed, would just be dmesg I guess. Having a new
feature flag still makes sense to me. Maybe he could be convinced.

2024-05-07 15:08:40

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On Tue, May 07, 2024, Rick P Edgecombe wrote:
> On Tue, 2024-05-07 at 07:21 -0700, Sean Christopherson wrote:
> >
> > Keeping X86_FEATURE_IBT set will result in "ibt" being reported in
> > /proc/cpuinfo,
> > i.e. will mislead userspace into thinking IBT is supported and fully enabled
> > by
> > the kernel.  For a security feature, that's a pretty big issue.
>
> Since the beginning, if you don't configure kernel IBT in Kconfig but the HW
> supports it, "ibt" will appear in /proc/cpuinfo. It never was a reliable
> indicator of kernel IBT enforcement.

Ah, good to know.

> It is just an indicator of if the IBT feature is usable.

Does ibt=off make IBT unusable for userspace? Huh. Looking at the #CP handler,
I take it userspace support for IBT hasn't landed yet?

> I think tying kernel IBT enforcement to the CPU feature is wrong. But if you
> disable the HW feature, it makes sense that the enforcement would be
> disabled.
>
> CET is something that requires a fair amount of SW enablement. SW needs to do
> things in special ways or things will go wrong. So whether IBT is in use and
> whether it is supported by the HW are useful to maintain as separate concepts.
>
> >
> > To fudge around that, we could add a synthetic feature flag to let the
> > kernel tell KVM whether or not it's safe to virtualize IBT, but I don't see
> > what value that adds over KVM checking raw host CPUID.
>
> A synthetic feature flag for kernel IBT seems reasonable to me. It's what I
> suggested on that thread I linked earlier. But Peterz was advocating for a bool.
> How enforcement would be exposed, would just be dmesg I guess. Having a new
> feature flag still makes sense to me. Maybe he could be convinced.

If there's a need for IBT and KERNEL_IBT, I agree a synthetic flag makes sense.
But as above, it's not clear to me why both are needed, at least for KVM's sake.
Is the need more apparent when userspace IBT support comes along?

2024-05-07 15:33:57

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On Tue, 2024-05-07 at 08:08 -0700, Sean Christopherson wrote:
> > It is just an indicator of if the IBT feature is usable.
>
> Does ibt=off make IBT unusable for userspace?  Huh.  Looking at the #CP
> handler,
> I take it userspace support for IBT hasn't landed yet?

Sadly it remains an small internal dev branch that was preempted by the enormous
TDX ramp. It's my #1 thing I want to get back to when time reappears. Despite
being an old feature at this point, there is contingent of HW CFI fans that
still want to see it, including on the gcc/glibc and distro side. So I still
have hope.

>
> > I think tying kernel IBT enforcement to the CPU feature is wrong. But if you
> > disable the HW feature, it makes sense that the enforcement would be
> > disabled.
> >
> > CET is something that requires a fair amount of SW enablement. SW needs to
> > do
> > things in special ways or things will go wrong. So whether IBT is in use and
> > whether it is supported by the HW are useful to maintain as separate
> > concepts.
> >
> > >
> > > To fudge around that, we could add a synthetic feature flag to let the
> > > kernel tell KVM whether or not it's safe to virtualize IBT, but I don't
> > > see
> > > what value that adds over KVM checking raw host CPUID.
> >
> > A synthetic feature flag for kernel IBT seems reasonable to me. It's what I
> > suggested on that thread I linked earlier. But Peterz was advocating for a
> > bool.
> > How enforcement would be exposed, would just be dmesg I guess. Having a new
> > feature flag still makes sense to me. Maybe he could be convinced.
>
> If there's a need for IBT and KERNEL_IBT, I agree a synthetic flag makes
> sense.
> But as above, it's not clear to me why both are needed, at least for KVM's
> sake.
> Is the need more apparent when userspace IBT support comes along?

Isn't KVM CET kind of a second user though? It doesn't depends on CR4.CET like
the rest, but the same host FPU support. Let me try to ping peterz re the
synthetic flag.

For shadow stack we also have user_shstk. This was done because of the expected
introduction of the CET_SSS bit (the shadow stack fracturing busy thing bit). It
can be treated something like a supervisor shadow stack support bit. So for
guests, you might have user shadow stack support and not supervisor. At least
that was the idea.

2024-05-07 23:01:46

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 04/27] x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set

On Thu, May 02, 2024, Dave Hansen wrote:
> On 5/1/24 11:45, Sean Christopherson wrote:
> > On Sun, Feb 18, 2024, Yang Weijiang wrote:
> >> Define a new XFEATURE_MASK_KERNEL_DYNAMIC mask to specify the features
> > I still don't understand why this is being called DYNAMIC. CET_SS isn't dynamic,
> > as KVM is _always_ allowed to save/restore CET_SS, i.e. whether or not KVM can
> > expose CET_SS to a guest is a static, boot-time decision. Whether or not a guest
> > XSS actually enables CET_SS is "dynamic", but that's true of literally every
> > xfeature in XCR0 and XSS.
> >
> > XFEATURE_MASK_XTILE_DATA is labeled as dynamic because userspace has to explicitly
> > request that XTILE_DATA be enabled, and thus whether or not KVM is allowed to
> > expose XTILE_DATA to the guest is a dynamic, runtime decision.
> >
> > So IMO, the umbrella macro should be XFEATURE_MASK_KERNEL_GUEST_ONLY.
>
> Here's how I got that naming. First, "static" features are always
> there. "Dynamic" features might or might not be there. I was also much
> more focused on what's in the XSAVE buffer than on the enabling itself,
> which are _slightly_ different.

Ah, and CET_KERNEL will be '0' in XSTATE_BV for non-guest buffers, but '1' for
guest buffers.

> Then, it's a matter of whether the feature is user or supervisor. The
> kernel might need new state for multiple reasons. Think of LBR state as
> an example. The kernel might want LBR state around for perf _or_ so it
> can be exposed to a guest.
>
> I just didn't want to tie it to "GUEST" too much in case we have more of
> these things come along that get used for things unrelated to KVM.
> Obviously, at this point, we've only got one and KVM is the only user so
> the delta that I was worried about doesn't actually exist.
>
> So I still prefer calling it "KERNEL" over "GUEST". But I also don't
> feel strongly about it and I've said my peace. I won't NAK it one way
> or the other.

I assume you mean "DYNAMIC" over "GUEST"? I'm ok with DYNAMIC, reflecting the
impact on each buffer makes sense.

My one request would be to change the WARN in os_xsave() to fire on CET_KERNEL,
not KERNEL_DYNAMIC, because it's specifically CET_KERNEL that is guest-only.
Future dynamic xfeatures could be guest-only, but they could also be dynamic for
some completely different reason. That was my other hang-up with "DYNAMIC";
as-is, os_xsave() implies that it really truly is GUEST_ONLY.

diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index 83ebf1e1cbb4..2a1ff49ccfd5 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -185,8 +185,7 @@ static inline void os_xsave(struct fpstate *fpstate)
WARN_ON_FPU(!alternatives_patched);
xfd_validate_state(fpstate, mask, false);

- WARN_ON_FPU(!fpstate->is_guest &&
- (mask & XFEATURE_MASK_KERNEL_DYNAMIC));
+ WARN_ON_FPU(!fpstate->is_guest && (mask & XFEATURE_MASK_CET_KERNEL));

XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err);

2024-05-08 00:18:14

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v10 04/27] x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set

On 5/7/24 15:57, Sean Christopherson wrote:
>> So I still prefer calling it "KERNEL" over "GUEST". But I also don't
>> feel strongly about it and I've said my peace. I won't NAK it one way
>> or the other.
> I assume you mean "DYNAMIC" over "GUEST"? I'm ok with DYNAMIC, reflecting the
> impact on each buffer makes sense.

Yes. Silly thinko/typo on my part.

> My one request would be to change the WARN in os_xsave() to fire on CET_KERNEL,
> not KERNEL_DYNAMIC, because it's specifically CET_KERNEL that is guest-only.
> Future dynamic xfeatures could be guest-only, but they could also be dynamic for
> some completely different reason. That was my other hang-up with "DYNAMIC";
> as-is, os_xsave() implies that it really truly is GUEST_ONLY.
>
> diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
> index 83ebf1e1cbb4..2a1ff49ccfd5 100644
> --- a/arch/x86/kernel/fpu/xstate.h
> +++ b/arch/x86/kernel/fpu/xstate.h
> @@ -185,8 +185,7 @@ static inline void os_xsave(struct fpstate *fpstate)
> WARN_ON_FPU(!alternatives_patched);
> xfd_validate_state(fpstate, mask, false);
>
> - WARN_ON_FPU(!fpstate->is_guest &&
> - (mask & XFEATURE_MASK_KERNEL_DYNAMIC));
> + WARN_ON_FPU(!fpstate->is_guest && (mask & XFEATURE_MASK_CET_KERNEL));
>
> XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err);

Yeah, that would make a lot of sense. We could add a more generic
#define for it later if another feature gets added like this.

2024-05-08 01:20:25

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 04/27] x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set

On 5/8/2024 7:17 AM, Dave Hansen wrote:
> On 5/7/24 15:57, Sean Christopherson wrote:

[...]

>> My one request would be to change the WARN in os_xsave() to fire on CET_KERNEL,
>> not KERNEL_DYNAMIC, because it's specifically CET_KERNEL that is guest-only.
>> Future dynamic xfeatures could be guest-only, but they could also be dynamic for
>> some completely different reason. That was my other hang-up with "DYNAMIC";
>> as-is, os_xsave() implies that it really truly is GUEST_ONLY.
>>
>> diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
>> index 83ebf1e1cbb4..2a1ff49ccfd5 100644
>> --- a/arch/x86/kernel/fpu/xstate.h
>> +++ b/arch/x86/kernel/fpu/xstate.h
>> @@ -185,8 +185,7 @@ static inline void os_xsave(struct fpstate *fpstate)
>> WARN_ON_FPU(!alternatives_patched);
>> xfd_validate_state(fpstate, mask, false);
>>
>> - WARN_ON_FPU(!fpstate->is_guest &&
>> - (mask & XFEATURE_MASK_KERNEL_DYNAMIC));
>> + WARN_ON_FPU(!fpstate->is_guest && (mask & XFEATURE_MASK_CET_KERNEL));
>>
>> XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err);
> Yeah, that would make a lot of sense. We could add a more generic
> #define for it later if another feature gets added like this.

Thank you for getting alignment! I will change the code accordingly.



2024-05-08 07:00:49

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 17/27] KVM: x86: Report KVM supported CET MSRs as to-be-saved

On 5/8/2024 1:27 AM, Sean Christopherson wrote:
> On Mon, May 06, 2024, Weijiang Yang wrote:
>> On 5/2/2024 6:40 AM, Sean Christopherson wrote:
>>> On Sun, Feb 18, 2024, Yang Weijiang wrote:

[...]

>> For the latter, IIUC KVM still needs to expose the index within the synthetic
>> namespace so that userspace can read/write the intended MSRs, of course not
>> expose the synthetic MSR index via existing uAPI,  But you said the "index"
>> exposed to userspace can simply  be '0' in this case, then how to distinguish
>> the synthetic MSRs in userspace and KVM? And how userspace can be aware of
>> the synthetic MSR index allocation in KVM?
> The idea is to have a synthetic index that is exposed to userspace, and a separate
> KVM-internal index for emulating accesses. The value that is exposed to userspace
> can start at 0 and be a simple incrementing value as we add synthetic MSRs, as the
> .type == SYNTHETIC makes it impossible for the value to collide with a "real" MSR.
>
> Translating to a KVM-internal index is a hack to avoid having to plumb a 64-bit
> index into all the MSR code. We could do that, i.e. pass the full kvm_x86_reg_id
> into the MSR helpers, but I'm not convinced it'd be worth the churn. That said,
> I'm not opposed to the idea either, if others prefer that approach.
>
> E.g.
>
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 738c449e4f9e..21152796238a 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -420,6 +420,8 @@ struct kvm_x86_reg_id {
> __u16 rsvd16;
> };
>
> +#define MSR_KVM_GUEST_SSP 0
> +
> #define KVM_SYNC_X86_REGS (1UL << 0)
> #define KVM_SYNC_X86_SREGS (1UL << 1)
> #define KVM_SYNC_X86_EVENTS (1UL << 2)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f45cdd9d8c1f..1a9e1e0c9f49 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5990,6 +5990,19 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
> }
> }
>
> +static int kvm_translate_synthetic_msr(u32 *index)
> +{
> + switch (*index) {
> + case MSR_KVM_GUEST_SSP:
> + *index = MSR_KVM_INTERNAL_GUEST_SSP;
> + break;
> + default:
> + return -EINVAL;
> + }
> +
> + return 0;
> +}
> +
> long kvm_arch_vcpu_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
> {
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index cc585051d24b..3b5a038f5260 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -49,6 +49,15 @@ void kvm_spurious_fault(void);
> #define KVM_FIRST_EMULATED_VMX_MSR MSR_IA32_VMX_BASIC
> #define KVM_LAST_EMULATED_VMX_MSR MSR_IA32_VMX_VMFUNC
>
> +/*
> + * KVM's internal, non-ABI indices for synthetic MSRs. The values themselves
> + * are arbitrary and have no meaning, the only requirement is that they don't
> + * conflict with "real" MSRs that KVM supports. Use values at the uppper end
> + * of KVM's reserved paravirtual MSR range to minimize churn, i.e. these values
> + * will be usable until KVM exhausts its supply of paravirtual MSR indices.
> + */
> +#define MSR_KVM_INTERNAL_GUEST_SSP 0x4b564dff
> +
> #define KVM_DEFAULT_PLE_GAP 128
> #define KVM_VMX_DEFAULT_PLE_WINDOW 4096
> #define KVM_DEFAULT_PLE_WINDOW_GROW 2

OK, I'll post an RFC patch for this change, thanks a lot!



2024-05-16 07:13:31

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On 5/2/2024 7:15 AM, Sean Christopherson wrote:
> On Sun, Feb 18, 2024, Yang Weijiang wrote:
>> @@ -696,6 +697,20 @@ void kvm_set_cpu_caps(void)
>> kvm_cpu_cap_set(X86_FEATURE_INTEL_STIBP);
>> if (boot_cpu_has(X86_FEATURE_AMD_SSBD))
>> kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD);
>> + /*
>> + * Don't use boot_cpu_has() to check availability of IBT because the
>> + * feature bit is cleared in boot_cpu_data when ibt=off is applied
>> + * in host cmdline.
> I'm not convinced this is a good reason to diverge from the host kernel. E.g.
> PCID and many other features honor the host setup, I don't see what makes IBT
> special.
>
>
Hi, Sean,
We synced the issue internally, and got conclusion that KVM should honor host IBT config.
In this case IBT bit in boot_cpu_data should be honored.  With this policy, it can avoid CPUID
confusion to guest side due to host ibt=off config. Host side xstate support couldn't be an issue
because we already have below check in this patch:

+ if ((kvm_caps.supported_xss & (XFEATURE_MASK_CET_USER | + XFEATURE_MASK_CET_KERNEL)) != + (XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)) { + kvm_cpu_cap_clear(X86_FEATURE_SHSTK); + kvm_cpu_cap_clear(X86_FEATURE_IBT); + kvm_caps.supported_xss &= ~(XFEATURE_MASK_CET_USER | + XFEATURE_MASK_CET_KERNEL); + }

What's your thoughts? Should I just remove the quirk here and keep everything normal and
peaceful?


2024-05-16 07:20:48

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On 5/6/2024 5:41 PM, Yang, Weijiang wrote:
> On 5/2/2024 7:34 AM, Sean Christopherson wrote:
>> On Sun, Feb 18, 2024, Yang Weijiang wrote:
>>> @@ -665,7 +665,7 @@ void kvm_set_cpu_caps(void)
>>>           F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
>>>           F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
>>>           F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
>>> -        F(SGX_LC) | F(BUS_LOCK_DETECT)
>>> +        F(SGX_LC) | F(BUS_LOCK_DETECT) | F(SHSTK)
>>>       );
>>>       /* Set LA57 based on hardware capability. */
>>>       if (cpuid_ecx(7) & F(LA57))
>>> @@ -683,7 +683,8 @@ void kvm_set_cpu_caps(void)
>>>           F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
>>>           F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) |
>>>           F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) |
>>> -        F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D)
>>> +        F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D) |
>>> +        F(IBT)
>>>       );
>> ...
>>
>>> @@ -7977,6 +7993,18 @@ static __init void vmx_set_cpu_caps(void)
>>>         if (cpu_has_vmx_waitpkg())
>>>           kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
>>> +
>>> +    /*
>>> +     * Disable CET if unrestricted_guest is unsupported as KVM doesn't
>>> +     * enforce CET HW behaviors in emulator. On platforms with
>>> +     * VMX_BASIC[bit56] == 0, inject #CP at VMX entry with error code
>>> +     * fails, so disable CET in this case too.
>>> +     */
>>> +    if (!cpu_has_load_cet_ctrl() || !enable_unrestricted_guest ||
>>> +        !cpu_has_vmx_basic_no_hw_errcode()) {
>>> +        kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
>>> +        kvm_cpu_cap_clear(X86_FEATURE_IBT);
>>> +    }
>> Oh!  Almost missed it.  This patch should explicitly kvm_cpu_cap_clear()
>> X86_FEATURE_SHSTK and X86_FEATURE_IBT.  We *know* there are upcoming AMD CPUs
>> that support at least SHSTK, so enumerating support for common code would yield
>> a version of KVM that incorrectly advertises support for SHSTK.
>>
>> I hope to land both Intel and AMD virtualization in the same kernel release, but
>> there are no guarantees that will happen.  And explicitly clearing both SHSTK and
>> IBT would guard against IBT showing up in some future AMD CPU in advance of KVM
>> gaining full support.
>
> Let me be clear on this, you want me to disable SHSTK/IBT with kvm_cpu_cap_clear() unconditionally
> for now in this patch, and wait until both AMD's SVM patches and this series are ready for guest CET,
> then remove the disabling code in this patch for final merge, am I right?
Hi, Sean,
I haven't got your reply on above question. Would like to get your confirmation.
Thanks!

>
>


2024-05-16 14:43:16

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On Thu, May 16, 2024, Weijiang Yang wrote:
> On 5/6/2024 5:41 PM, Yang, Weijiang wrote:
> > On 5/2/2024 7:34 AM, Sean Christopherson wrote:
> > > On Sun, Feb 18, 2024, Yang Weijiang wrote:
> > > > @@ -665,7 +665,7 @@ void kvm_set_cpu_caps(void)
> > > >           F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
> > > >           F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
> > > >           F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
> > > > -        F(SGX_LC) | F(BUS_LOCK_DETECT)
> > > > +        F(SGX_LC) | F(BUS_LOCK_DETECT) | F(SHSTK)
> > > >       );
> > > >       /* Set LA57 based on hardware capability. */
> > > >       if (cpuid_ecx(7) & F(LA57))
> > > > @@ -683,7 +683,8 @@ void kvm_set_cpu_caps(void)
> > > >           F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
> > > >           F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) |
> > > >           F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) |
> > > > -        F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D)
> > > > +        F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D) |
> > > > +        F(IBT)
> > > >       );
> > > ...
> > >
> > > > @@ -7977,6 +7993,18 @@ static __init void vmx_set_cpu_caps(void)
> > > >         if (cpu_has_vmx_waitpkg())
> > > >           kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
> > > > +
> > > > +    /*
> > > > +     * Disable CET if unrestricted_guest is unsupported as KVM doesn't
> > > > +     * enforce CET HW behaviors in emulator. On platforms with
> > > > +     * VMX_BASIC[bit56] == 0, inject #CP at VMX entry with error code
> > > > +     * fails, so disable CET in this case too.
> > > > +     */
> > > > +    if (!cpu_has_load_cet_ctrl() || !enable_unrestricted_guest ||
> > > > +        !cpu_has_vmx_basic_no_hw_errcode()) {
> > > > +        kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
> > > > +        kvm_cpu_cap_clear(X86_FEATURE_IBT);
> > > > +    }
> > > Oh!  Almost missed it.  This patch should explicitly kvm_cpu_cap_clear()
> > > X86_FEATURE_SHSTK and X86_FEATURE_IBT.  We *know* there are upcoming AMD CPUs
> > > that support at least SHSTK, so enumerating support for common code would yield
> > > a version of KVM that incorrectly advertises support for SHSTK.
> > >
> > > I hope to land both Intel and AMD virtualization in the same kernel release, but
> > > there are no guarantees that will happen.  And explicitly clearing both SHSTK and
> > > IBT would guard against IBT showing up in some future AMD CPU in advance of KVM
> > > gaining full support.
> >
> > Let me be clear on this, you want me to disable SHSTK/IBT with
> > kvm_cpu_cap_clear() unconditionally for now in this patch, and wait until
> > both AMD's SVM patches and this series are ready for guest CET, then remove
> > the disabling code in this patch for final merge, am I right?

No, allow it to be enabled for VMX, but explicitly disable it for SVM, i.e.

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 4aaffbf22531..b3df12af4ee6 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5125,6 +5125,10 @@ static __init void svm_set_cpu_caps(void)
kvm_caps.supported_perf_cap = 0;
kvm_caps.supported_xss = 0;

+ /* KVM doesn't yet support CET virtualization for SVM. */
+ kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
+ kvm_cpu_cap_clear(X86_FEATURE_IBT);
+
/* CPUID 0x80000001 and 0x8000000A (SVM features) */
if (nested) {
kvm_cpu_cap_set(X86_FEATURE_SVM);

Then the SVM series can simply delete those lines when all is ready.

2024-05-16 15:37:14

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On 5/16/24 07:39, Sean Christopherson wrote:
>> We synced the issue internally, and got conclusion that KVM should honor host
>> IBT config. In this case IBT bit in boot_cpu_data should be honored.  With
>> this policy, it can avoid CPUID confusion to guest side due to host ibt=off
>> config.
> What was the reasoning? CPUID confusion is a weak justification, e.g. it's not
> like the guest has visibility into the host kernel, and raw CPUID will still show
> IBT support in the host.

I'm basically arguing for the path of least resistance (at least to start).

We should just do what takes the least amount of code for now that
results in mostly sane behavior, then debate about making it perfect later.

In other words, let's say the place we'd *IDEALLY* end up is that guests
can have any random FPU state which is disconnected from the host. But
the reality, for now, is that the host needs to have XFEATURE_CET_USER
set in order to pass it into the guest and that means keeping
X86_FEATURE_SHSTK set.

If you want guest XFEATURE_CET_USER, you must have host
X86_FEATURE_SHSTK ... for now.

2024-05-16 16:58:50

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On Thu, May 16, 2024, Dave Hansen wrote:
> On 5/16/24 07:39, Sean Christopherson wrote:
> >> We synced the issue internally, and got conclusion that KVM should honor host
> >> IBT config. In this case IBT bit in boot_cpu_data should be honored.  With
> >> this policy, it can avoid CPUID confusion to guest side due to host ibt=off
> >> config.
> > What was the reasoning? CPUID confusion is a weak justification, e.g. it's not
> > like the guest has visibility into the host kernel, and raw CPUID will still show
> > IBT support in the host.
>
> I'm basically arguing for the path of least resistance (at least to start).
>
> We should just do what takes the least amount of code for now that
> results in mostly sane behavior, then debate about making it perfect later.
>
> In other words, let's say the place we'd *IDEALLY* end up is that guests
> can have any random FPU state which is disconnected from the host. But
> the reality, for now, is that the host needs to have XFEATURE_CET_USER
> set in order to pass it into the guest and that means keeping
> X86_FEATURE_SHSTK set.
>
> If you want guest XFEATURE_CET_USER, you must have host
> X86_FEATURE_SHSTK ... for now.

Ah, because fpu__init_system_xstate() will clear XFEATURE_CET_USER via the
X86_FEATURE_SHSTK connection in xsave_cpuid_features.

Please put something to that effect in the changelog. "this literally won't work
(without more changes)" is very different than us making a largely arbitrary
decision.

2024-05-17 08:05:23

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On 5/16/2024 10:43 PM, Sean Christopherson wrote:
> On Thu, May 16, 2024, Weijiang Yang wrote:
>> On 5/6/2024 5:41 PM, Yang, Weijiang wrote:
>>> On 5/2/2024 7:34 AM, Sean Christopherson wrote:
>>>> On Sun, Feb 18, 2024, Yang Weijiang wrote:
>>>>> @@ -665,7 +665,7 @@ void kvm_set_cpu_caps(void)
>>>>>           F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
>>>>>           F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
>>>>>           F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
>>>>> -        F(SGX_LC) | F(BUS_LOCK_DETECT)
>>>>> +        F(SGX_LC) | F(BUS_LOCK_DETECT) | F(SHSTK)
>>>>>       );
>>>>>       /* Set LA57 based on hardware capability. */
>>>>>       if (cpuid_ecx(7) & F(LA57))
>>>>> @@ -683,7 +683,8 @@ void kvm_set_cpu_caps(void)
>>>>>           F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
>>>>>           F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) |
>>>>>           F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) |
>>>>> -        F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D)
>>>>> +        F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D) |
>>>>> +        F(IBT)
>>>>>       );
>>>> ...
>>>>
>>>>> @@ -7977,6 +7993,18 @@ static __init void vmx_set_cpu_caps(void)
>>>>>         if (cpu_has_vmx_waitpkg())
>>>>>           kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
>>>>> +
>>>>> +    /*
>>>>> +     * Disable CET if unrestricted_guest is unsupported as KVM doesn't
>>>>> +     * enforce CET HW behaviors in emulator. On platforms with
>>>>> +     * VMX_BASIC[bit56] == 0, inject #CP at VMX entry with error code
>>>>> +     * fails, so disable CET in this case too.
>>>>> +     */
>>>>> +    if (!cpu_has_load_cet_ctrl() || !enable_unrestricted_guest ||
>>>>> +        !cpu_has_vmx_basic_no_hw_errcode()) {
>>>>> +        kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
>>>>> +        kvm_cpu_cap_clear(X86_FEATURE_IBT);
>>>>> +    }
>>>> Oh!  Almost missed it.  This patch should explicitly kvm_cpu_cap_clear()
>>>> X86_FEATURE_SHSTK and X86_FEATURE_IBT.  We *know* there are upcoming AMD CPUs
>>>> that support at least SHSTK, so enumerating support for common code would yield
>>>> a version of KVM that incorrectly advertises support for SHSTK.
>>>>
>>>> I hope to land both Intel and AMD virtualization in the same kernel release, but
>>>> there are no guarantees that will happen.  And explicitly clearing both SHSTK and
>>>> IBT would guard against IBT showing up in some future AMD CPU in advance of KVM
>>>> gaining full support.
>>> Let me be clear on this, you want me to disable SHSTK/IBT with
>>> kvm_cpu_cap_clear() unconditionally for now in this patch, and wait until
>>> both AMD's SVM patches and this series are ready for guest CET, then remove
>>> the disabling code in this patch for final merge, am I right?
> No, allow it to be enabled for VMX, but explicitly disable it for SVM, i.e.
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 4aaffbf22531..b3df12af4ee6 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -5125,6 +5125,10 @@ static __init void svm_set_cpu_caps(void)
> kvm_caps.supported_perf_cap = 0;
> kvm_caps.supported_xss = 0;
>
> + /* KVM doesn't yet support CET virtualization for SVM. */
> + kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
> + kvm_cpu_cap_clear(X86_FEATURE_IBT);
> +
> /* CPUID 0x80000001 and 0x8000000A (SVM features) */
> if (nested) {
> kvm_cpu_cap_set(X86_FEATURE_SVM);
>
> Then the SVM series can simply delete those lines when all is ready.

Understood, thanks!



2024-05-17 08:28:22

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On 5/17/2024 12:58 AM, Sean Christopherson wrote:
> On Thu, May 16, 2024, Dave Hansen wrote:
>> On 5/16/24 07:39, Sean Christopherson wrote:
>>>> We synced the issue internally, and got conclusion that KVM should honor host
>>>> IBT config. In this case IBT bit in boot_cpu_data should be honored.  With
>>>> this policy, it can avoid CPUID confusion to guest side due to host ibt=off
>>>> config.
>>> What was the reasoning? CPUID confusion is a weak justification, e.g. it's not
>>> like the guest has visibility into the host kernel, and raw CPUID will still show
>>> IBT support in the host.
>> I'm basically arguing for the path of least resistance (at least to start).
>>
>> We should just do what takes the least amount of code for now that
>> results in mostly sane behavior, then debate about making it perfect later.
>>
>> In other words, let's say the place we'd *IDEALLY* end up is that guests
>> can have any random FPU state which is disconnected from the host. But
>> the reality, for now, is that the host needs to have XFEATURE_CET_USER
>> set in order to pass it into the guest and that means keeping
>> X86_FEATURE_SHSTK set.
>>
>> If you want guest XFEATURE_CET_USER, you must have host
>> X86_FEATURE_SHSTK ... for now.
> Ah, because fpu__init_system_xstate() will clear XFEATURE_CET_USER via the
> X86_FEATURE_SHSTK connection in xsave_cpuid_features.
>
> Please put something to that effect in the changelog. "this literally won't work
> (without more changes)" is very different than us making a largely arbitrary
> decision.

So I need to remove the trick here for guest IBT, right?

Side topic:
When X86_FEATURE_SHSTK and X86_FEATURE_IBT (no ibt=off set) are available on host side, then host XFEATURE_CET_USER is enabled. In this case, we still *need* below patch: https://lore.kernel.org/all/[email protected]/ to correctly enable XFEATURE_CET_USER in *guest kernel*, because VMM userspace can enable IBT alone for guest by -cpu host -shstk, am I right?


2024-05-17 08:57:15

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On Thu, May 16 2024 at 07:39, Sean Christopherson wrote:
> On Thu, May 16, 2024, Weijiang Yang wrote:
>> We synced the issue internally, and got conclusion that KVM should honor host
>> IBT config. In this case IBT bit in boot_cpu_data should be honored.  With
>> this policy, it can avoid CPUID confusion to guest side due to host ibt=off
>> config.
>
> What was the reasoning? CPUID confusion is a weak justification, e.g. it's not
> like the guest has visibility into the host kernel, and raw CPUID will still show
> IBT support in the host.
>
> On the other hand, I can definitely see folks wanting to expose IBT to guests
> when running non-complaint host kernels, especially when live migration is in
> play, i.e. when hiding IBT from the guest will actively cause problems.

I have to disagree here violently.

If the exposure of a CPUID bit to a guest requires host side support,
e.g. in xstate handling, then exposing it to a guest is simply not
possible.

Just because virtualization allows to do that does not mean that it's
correct in any way.

Thanks,

tglx

2024-05-17 14:36:13

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On Fri, May 17, 2024, Thomas Gleixner wrote:
> On Thu, May 16 2024 at 07:39, Sean Christopherson wrote:
> > On Thu, May 16, 2024, Weijiang Yang wrote:
> >> We synced the issue internally, and got conclusion that KVM should honor host
> >> IBT config. In this case IBT bit in boot_cpu_data should be honored.  With
> >> this policy, it can avoid CPUID confusion to guest side due to host ibt=off
> >> config.
> >
> > What was the reasoning? CPUID confusion is a weak justification, e.g. it's not
> > like the guest has visibility into the host kernel, and raw CPUID will still show
> > IBT support in the host.
> >
> > On the other hand, I can definitely see folks wanting to expose IBT to guests
> > when running non-complaint host kernels, especially when live migration is in
> > play, i.e. when hiding IBT from the guest will actively cause problems.
>
> I have to disagree here violently.
>
> If the exposure of a CPUID bit to a guest requires host side support,
> e.g. in xstate handling, then exposing it to a guest is simply not
> possible.

Ya, I don't disagree, I just didn't realize that CET_USER would be cleared in the
supported xfeatures mask.

2024-05-20 09:44:22

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On 5/17/2024 10:26 PM, Sean Christopherson wrote:
> On Fri, May 17, 2024, Thomas Gleixner wrote:
>> On Thu, May 16 2024 at 07:39, Sean Christopherson wrote:
>>> On Thu, May 16, 2024, Weijiang Yang wrote:
>>>> We synced the issue internally, and got conclusion that KVM should honor host
>>>> IBT config. In this case IBT bit in boot_cpu_data should be honored.  With
>>>> this policy, it can avoid CPUID confusion to guest side due to host ibt=off
>>>> config.
>>> What was the reasoning? CPUID confusion is a weak justification, e.g. it's not
>>> like the guest has visibility into the host kernel, and raw CPUID will still show
>>> IBT support in the host.
>>>
>>> On the other hand, I can definitely see folks wanting to expose IBT to guests
>>> when running non-complaint host kernels, especially when live migration is in
>>> play, i.e. when hiding IBT from the guest will actively cause problems.
>> I have to disagree here violently.
>>
>> If the exposure of a CPUID bit to a guest requires host side support,
>> e.g. in xstate handling, then exposing it to a guest is simply not
>> possible.
> Ya, I don't disagree, I just didn't realize that CET_USER would be cleared in the
> supported xfeatures mask.

For host side support, fortunately,  this patch already has some checks for that. But for userspace
CPUID config, it allows IBT to be exposed alone.

IIUC, this series tries to tie IBT to SHSTK feature, i.e., IBT cannot be exposed as an independent
feature to guest without exposing SHSTK at the same time. If it is, then below patch is not needed
anymore:
https://lore.kernel.org/all/[email protected]/

I'd check and clear IBT bit from CPUID when userspace enables only IBT via KVM_SET_CPUID2.
And update related code of the series given the implicit dependency.

Thanks!



2024-05-20 17:15:39

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On 5/20/24 10:09, Sean Christopherson wrote:
>> IIUC, this series tries to tie IBT to SHSTK feature, i.e., IBT cannot be
>> exposed as an independent feature to guest without exposing SHSTK at the same
>> time. If it is, then below patch is not needed anymore:
>> https://lore.kernel.org/all/[email protected]/
> That's a question for the x86 maintainers. Specifically, do they want to allow
> enabling XFEATURE_CET_USER even if userspace shadow stack support is disabled.

I like the sound of "below patch is not needed anymore".

Unless removing the patch causes permanent issues or results in
something that's not functional, I say: jettison it with glee. If it's
that important, it can be considered on its own merits separately.

2024-05-22 08:42:00

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On 5/21/2024 1:09 AM, Sean Christopherson wrote:
> On Mon, May 20, 2024, Weijiang Yang wrote:
>> On 5/17/2024 10:26 PM, Sean Christopherson wrote:
>>> On Fri, May 17, 2024, Thomas Gleixner wrote:
>>>> On Thu, May 16 2024 at 07:39, Sean Christopherson wrote:
>>>>> On Thu, May 16, 2024, Weijiang Yang wrote:
>>>>>> We synced the issue internally, and got conclusion that KVM should honor host
>>>>>> IBT config. In this case IBT bit in boot_cpu_data should be honored.  With
>>>>>> this policy, it can avoid CPUID confusion to guest side due to host ibt=off
>>>>>> config.
>>>>> What was the reasoning? CPUID confusion is a weak justification, e.g. it's not
>>>>> like the guest has visibility into the host kernel, and raw CPUID will still show
>>>>> IBT support in the host.
>>>>>
>>>>> On the other hand, I can definitely see folks wanting to expose IBT to guests
>>>>> when running non-complaint host kernels, especially when live migration is in
>>>>> play, i.e. when hiding IBT from the guest will actively cause problems.
>>>> I have to disagree here violently.
>>>>
>>>> If the exposure of a CPUID bit to a guest requires host side support,
>>>> e.g. in xstate handling, then exposing it to a guest is simply not
>>>> possible.
>>> Ya, I don't disagree, I just didn't realize that CET_USER would be cleared in the
>>> supported xfeatures mask.
>> For host side support, fortunately,  this patch already has some checks for
>> that. But for userspace CPUID config, it allows IBT to be exposed alone.
>>
>> IIUC, this series tries to tie IBT to SHSTK feature, i.e., IBT cannot be
>> exposed as an independent feature to guest without exposing SHSTK at the same
>> time. If it is, then below patch is not needed anymore:
>> https://lore.kernel.org/all/[email protected]/
> That's a question for the x86 maintainers. Specifically, do they want to allow
> enabling XFEATURE_CET_USER even if userspace shadow stack support is disabled.
>
> I don't think it impacts KVM, at least not directly. Regardless of what decision
> the kernel makes, KVM needs to disable IBT and SHSTK if CET_USER _or_ CET_KERNEL
> is missing, which KVM already does via:
>
> if ((kvm_caps.supported_xss & (XFEATURE_MASK_CET_USER |
> XFEATURE_MASK_CET_KERNEL)) !=
> (XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)) {
> kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
> kvm_cpu_cap_clear(X86_FEATURE_IBT);
> kvm_caps.supported_xss &= ~(XFEATURE_MASK_CET_USER |
> XFEATURE_MASK_CET_KERNEL);
> }
>
>> I'd check and clear IBT bit from CPUID when userspace enables only IBT via
>> KVM_SET_CPUID2.
> No. It is userspace's responsibility to provide a sane CPUID model for the guest.
> KVM needs to ensure that *KVM* doesn't treat IBT as supported if the kernel doesn't
> allow XFEATURE_CET_USER, but userspace can advertise whatever it wants to the guest
> (and gets to keep the pieces if it does something funky).

OK, I think we can go ahead to keep KVM patches as-is given the fact user IBT is not enabled in Linux.
I only hope other OSes can enforce both SHSTK and IBT dependency on XFEATURE_CET_USER so
that user IBT can work well there.

Then IBT can be exposed to guest alone because guest *kernel* IBT only relies on S_CET MSR  which is
VMCS auto-saved/restored.

What's your thoughts?



2024-05-22 09:04:36

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On 5/21/2024 1:15 AM, Dave Hansen wrote:
> On 5/20/24 10:09, Sean Christopherson wrote:
>>> IIUC, this series tries to tie IBT to SHSTK feature, i.e., IBT cannot be
>>> exposed as an independent feature to guest without exposing SHSTK at the same
>>> time. If it is, then below patch is not needed anymore:
>>> https://lore.kernel.org/all/[email protected]/
>> That's a question for the x86 maintainers. Specifically, do they want to allow
>> enabling XFEATURE_CET_USER even if userspace shadow stack support is disabled.
> I like the sound of "below patch is not needed anymore".
>
> Unless removing the patch causes permanent issues or results in
> something that's not functional, I say: jettison it with glee. If it's
> that important, it can be considered on its own merits separately.
I guess the existing dependency there is due to the fact that only user SHSTK is landed and there's
possibly no such kind of odd bare metal platform.

Side topic:  would it be reasonable to enforce IBT dependency on XFEATURE_CET_USER when *user* IBT
enabling patches are landing in kernel? Then guest kernel can play with user IBT alone if VMM
userspace just wants to enable IBT for guest. Or when SHSTK is disabled for whatever reason.




2024-05-22 15:07:06

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On Wed, 2024-05-22 at 17:03 +0800, Yang, Weijiang wrote:
> Side topic:  would it be reasonable to enforce IBT dependency on
> XFEATURE_CET_USER when *user* IBT
> enabling patches are landing in kernel? Then guest kernel can play with user
> IBT alone if VMM
> userspace just wants to enable IBT for guest. Or when SHSTK is disabled for
> whatever reason.

I think earlier there was a comment that CET would be less likely to need to be
disabled for security reasons, so there would not be utility for a system wide
disable (that affects KVM). I recently remembered we actually already had a
reason come up.

The EDK2 SMI handler uses shadow stack and had a bug around saving and restoring
CET state. Using IBT in the kernel was causing systems to hang. The temporary
fix was to disable IBT.

So the point is, let's not try to find a narrow way to get away with enabling as
much as technically possible in KVM.

The simple obviously correct solution would be:
XFEATURE_CET_USER + XFEATURE_CET_KERNEL + X86_FEATURE_IBT = KVM IBT support
XFEATURE_CET_USER + XFEATURE_CET_KERNEL + X86_FEATURE_SHSTK = KVM SHSTK support

It should be correct both with and without that patch to enable
XFEATURE_CET_USER for X86_FEATURE_IBT.

Then the two missing changes to expand support would be:
1. Fixing that ibt=off disables X86_FEATURE_IBT. The fix is to move to bool as
peterz suggested.
2. Making XFEATURE_CET_USER also depend on X86_FEATURE_IBT (the patch in this
series)

We should do those, but in a later small series. Does it seem reasonable? Can we
just do the simple obvious solution above for now?

2024-05-23 10:08:07

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On 5/22/2024 11:06 PM, Edgecombe, Rick P wrote:
> On Wed, 2024-05-22 at 17:03 +0800, Yang, Weijiang wrote:
>> Side topic:  would it be reasonable to enforce IBT dependency on
>> XFEATURE_CET_USER when *user* IBT
>> enabling patches are landing in kernel? Then guest kernel can play with user
>> IBT alone if VMM
>> userspace just wants to enable IBT for guest. Or when SHSTK is disabled for
>> whatever reason.
> I think earlier there was a comment that CET would be less likely to need to be
> disabled for security reasons, so there would not be utility for a system wide
> disable (that affects KVM). I recently remembered we actually already had a
> reason come up.
>
> The EDK2 SMI handler uses shadow stack and had a bug around saving and restoring
> CET state. Using IBT in the kernel was causing systems to hang. The temporary
> fix was to disable IBT.
>
> So the point is, let's not try to find a narrow way to get away with enabling as
> much as technically possible in KVM.
>
> The simple obviously correct solution would be:
> XFEATURE_CET_USER + XFEATURE_CET_KERNEL + X86_FEATURE_IBT = KVM IBT support
> XFEATURE_CET_USER + XFEATURE_CET_KERNEL + X86_FEATURE_SHSTK = KVM SHSTK support

Yes, I can easily achieve it by removing the raw cpuid check for KVM IBT. Host side CET xstate
support check is already there in this patch.

>
> It should be correct both with and without that patch to enable
> XFEATURE_CET_USER for X86_FEATURE_IBT.

IMHO, given the fact user IBT hasn't been enabled in kernel, it's not too bad just discarding the patch.
I can highlight the issue somewhere in this series.

>
> Then the two missing changes to expand support would be:
> 1. Fixing that ibt=off disables X86_FEATURE_IBT. The fix is to move to bool as
> peterz suggested.
> 2. Making XFEATURE_CET_USER also depend on X86_FEATURE_IBT (the patch in this
> series)
>
> We should do those, but in a later small series. Does it seem reasonable? Can we
> just do the simple obvious solution above for now?

It makes sense for me, but I want to hear x86 and KVM maintainers' voice for it.

Thanks!



2024-05-27 09:05:53

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [PATCH v10 24/27] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

On 5/22/2024 4:41 PM, Yang, Weijiang wrote:
> On 5/21/2024 1:09 AM, Sean Christopherson wrote:
>> On Mon, May 20, 2024, Weijiang Yang wrote:
>>> On 5/17/2024 10:26 PM, Sean Christopherson wrote:
>>>> On Fri, May 17, 2024, Thomas Gleixner wrote:
>>>>> On Thu, May 16 2024 at 07:39, Sean Christopherson wrote:
>>>>>> On Thu, May 16, 2024, Weijiang Yang wrote:
>>>>>>> We synced the issue internally, and got conclusion that KVM should honor host
>>>>>>> IBT config.  In this case IBT bit in boot_cpu_data should be honored.  With
>>>>>>> this policy, it can avoid CPUID confusion to guest side due to host ibt=off
>>>>>>> config.
>>>>>> What was the reasoning?  CPUID confusion is a weak justification, e.g. it's not
>>>>>> like the guest has visibility into the host kernel, and raw CPUID will still show
>>>>>> IBT support in the host.
>>>>>>
>>>>>> On the other hand, I can definitely see folks wanting to expose IBT to guests
>>>>>> when running non-complaint host kernels, especially when live migration is in
>>>>>> play, i.e. when hiding IBT from the guest will actively cause problems.
>>>>> I have to disagree here violently.
>>>>>
>>>>> If the exposure of a CPUID bit to a guest requires host side support,
>>>>> e.g. in xstate handling, then exposing it to a guest is simply not
>>>>> possible.
>>>> Ya, I don't disagree, I just didn't realize that CET_USER would be cleared in the
>>>> supported xfeatures mask.
>>> For host side support, fortunately,  this patch already has some checks for
>>> that. But for userspace CPUID config, it allows IBT to be exposed alone.
>>>
>>> IIUC, this series tries to tie IBT to SHSTK feature, i.e., IBT cannot be
>>> exposed as an independent feature to guest without exposing SHSTK at the same
>>> time. If it is, then below patch is not needed anymore:
>>> https://lore.kernel.org/all/[email protected]/
>> That's a question for the x86 maintainers.  Specifically, do they want to allow
>> enabling XFEATURE_CET_USER even if userspace shadow stack support is disabled.
>>
>> I don't think it impacts KVM, at least not directly.  Regardless of what decision
>> the kernel makes, KVM needs to disable IBT and SHSTK if CET_USER _or_ CET_KERNEL
>> is missing, which KVM already does via:
>>
>>     if ((kvm_caps.supported_xss & (XFEATURE_MASK_CET_USER |
>>          XFEATURE_MASK_CET_KERNEL)) !=
>>         (XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)) {
>>         kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
>>         kvm_cpu_cap_clear(X86_FEATURE_IBT);
>>         kvm_caps.supported_xss &= ~(XFEATURE_MASK_CET_USER |
>>                         XFEATURE_MASK_CET_KERNEL);
>>     }
>>
>>> I'd check and clear IBT bit from CPUID when userspace enables only IBT via
>>> KVM_SET_CPUID2.
>> No.  It is userspace's responsibility to provide a sane CPUID model for the guest.
>> KVM needs to ensure that *KVM* doesn't treat IBT as supported if the kernel doesn't
>> allow XFEATURE_CET_USER, but userspace can advertise whatever it wants to the guest
>> (and gets to keep the pieces if it does something funky).
>
> OK, I think we can go ahead to keep KVM patches as-is given the fact user IBT is not enabled in Linux.
> I only hope other OSes can enforce both SHSTK and IBT dependency on XFEATURE_CET_USER so
> that user IBT can work well there.
>
> Then IBT can be exposed to guest alone because guest *kernel* IBT only relies on S_CET MSR  which is
> VMCS auto-saved/restored.
>
> What's your thoughts?

If there's objection I'll do below changes for this series:
1) Remove patch :
https://lore.kernel.org/all/[email protected]/
2) Remove reference to raw CPUID for KVM IBT CPUID, handling it the same as SHSTK.

Meanwhile still allow userspace to enable IBT feature alone because user IBT is not enabled
in kernel now, leave enforcement of user IBT dependency on XFEATURE_CET_USER in future.

>
>