2023-09-14 17:44:55

by Yang Weijiang

[permalink] [raw]
Subject: [PATCH v6 00/25] Enable CET Virtualization

Control-flow Enforcement Technology (CET) is a kind of CPU feature used
to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP) attacks.
It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
style control-flow subversion attacks.

Shadow Stack (SHSTK):
A shadow stack is a second stack used exclusively for control transfer
operations. The shadow stack is separate from the data/normal stack and
can be enabled individually in user and kernel mode. When shadow stack
is enabled, CALL pushes the return address on both the data and shadow
stack. RET pops the return address from both stacks and compares them.
If the return addresses from the two stacks do not match, the processor
generates a #CP.

Indirect Branch Tracking (IBT):
IBT introduces new instruction(ENDBRANCH)to mark valid target addresses of
indirect branches (CALL, JMP etc...). If an indirect branch is executed
and the next instruction is _not_ an ENDBRANCH, the processor generates a
#CP. These instruction behaves as a NOP on platforms that doesn't support
CET.


Dependency:
--------------------------------------------------------------------------
At the moment, the CET native series for user mode shadow stack is upstream
-merged in v6.6-rc1, so no native patches are enclosed in this series.

The first 8 kernel patches are prerequisites for this KVM patch series as
guest CET user mode and supervisor mode xstates/MSRs rely on host FPU
framework to properly saves/reloads guest MSRs when it's required to do so,
e.g., when vCPU thread is sched in/out. The kernel patches are released in
separate review thread here [1].

To test CET guest, patch this KVM series to kernel tree to build qualified
host kernel. Also apply QEMU CET enabling patches[2] to build qualified QEMU.


Implementation:
--------------------------------------------------------------------------
This series enables full support for guest CET SHSTK/IBT register states,
i.e., guest CET register states in below usage models are supported.

|
User SHSTK | User IBT (user mode)
--------------------------------------------------
Kernel SHSTK | Kernel IBT (kernel mode)
|

KVM cooperates with host kernel FPU framework to back guest CET xstates switch
when guest CET MSRs need to be saved/reloaded on host side, thus KVM relies on
host FPU xstate settings. From KVM perspective, part of user mode CET state
support is in the native series but requires series [1] to fix some issues
and enable CET supervisor xstate support for guest.

Note, guest supervisor(kernel) SHSTK cannot be fully supported by this series,
therefore guest SSS_CET bit of CPUID(0x7,1):EDX[bit18] is cleared. Check SDM
(Vol 1, Section 17.2.3) for details.


CET states management:
--------------------------------------------------------------------------
CET user mode and supervisor mode xstates, i.e., MSR_IA32_{U_CET,PL3_SSP}
and MSR_IA32_PL{0,1,2}, depend on host FPU framework to swap guest and host
xstates. On VM-Exit, guest CET xstates are saved to guest fpu area and host
CET xstates are loaded from task/thread context before vCPU returns to
userspace, vice-versa on VM-Entry. See details in kvm_{load,put}_guest_fpu().
So guest CET xstates management depends on CET xstate bits(U_CET/S_CET bit)
set in host XSS MSR.

CET supervisor mode states are grouped into two categories : XSAVE-managed
and non-XSAVE-managed, the former includes MSR_IA32_PL{0,1,2}_SSP and are
controlled by CET supervisor mode bit(S_CET bit) in XSS, the later consists
of MSR_IA32_S_CET and MSR_IA32_INTR_SSP_TBL.

VMX introduces new VMCS fields, {GUEST|HOST}_{S_CET,SSP,INTR_SSP_TABL}, to
facilitate guest/host non-XSAVES-managed states. When VMX CET entry/exit load
bits are set, guest/host MSR_IA32_{S_CET,INTR_SSP_TBL,SSP} are loaded from
equivalent fields at VM-Exit/Entry. With these new fields, such supervisor
states require no addtional KVM save/reload actions.


Tests:
--------------------------------------------------------------------------
This series passed basic CET user shadow stack test and kernel IBT test in L1
and L2 guest.

The patch series _has_ impact to existing vmx test cases in KVM-unit-tests,the
failures have been fixed in this series[3].

All other parts of KVM unit-tests and selftests passed with this series. One
new selftest app for CET MSRs is also included in this series.

Note, this series hasn't been tested on AMD platform yet.

To run user SHSTK test and kernel IBT test in guest, an CET capable platform
is required, e.g., Sapphire Rapids server, and follow below steps to build host/
guest kernel properly:

1. Build host kernel: Apply this series to kernel tree(>= v6.6-rc1) and build.

2. Build guest kernel: Pull kernel (>= v6.6-rc1) and opt-in CONFIG_X86_KERNEL_IBT
and CONFIG_X86_USER_SHADOW_STACK options. Build with CET enabled gcc versions
(>= 8.5.0).

3. Use patched QEMU to launch a guest.

Check kernel selftest test_shadow_stack_64 output:

[INFO] new_ssp = 7f8c82100ff8, *new_ssp = 7f8c82101001
[INFO] changing ssp from 7f8c82900ff0 to 7f8c82100ff8
[INFO] ssp is now 7f8c82101000
[OK] Shadow stack pivot
[OK] Shadow stack faults
[INFO] Corrupting shadow stack
[INFO] Generated shadow stack violation successfully
[OK] Shadow stack violation test
[INFO] Gup read -> shstk access success
[INFO] Gup write -> shstk access success
[INFO] Violation from normal write
[INFO] Gup read -> write access success
[INFO] Violation from normal write
[INFO] Gup write -> write access success
[INFO] Cow gup write -> write access success
[OK] Shadow gup test
[INFO] Violation from shstk access
[OK] mprotect() test
[SKIP] Userfaultfd unavailable.
[OK] 32 bit test


Check kernel IBT with dmesg | grep CET:

CET detected: Indirect Branch Tracking enabled

--------------------------------------------------------------------------
Changes in v6:
1. Added kernel patches to enable CET supervisor xstate support for guest. [Sean, Paolo]
2. Overhauled CET MSR access interface to make read/write clearer.[Sean, Chao]
3. Removed KVM-managed CET supervisor state patches.
4. Tweaked the code for accessing XSS MSR/reporting CET MSRs/SSP access in SMM mode/
CET MSR interception etc.per review feedback. [Sean, Paolo, Chao]
5. Rebased to: https://github.com/kvm-x86/linux tag: kvm-x86-next-2023.09.07


[1]: CET supervisor xstate support:
https://lore.kernel.org/all/[email protected]/
[2]: QEMU patch:
https://lore.kernel.org/all/[email protected]/
[3]: KVM-unit-tests fixup:
https://lore.kernel.org/all/[email protected]/
[4]: v5 patchset:
https://lore.kernel.org/kvm/[email protected]/


Patch 1-8: Kernel patches to enable CET supervisor state.
Patch 9-14: Enable XSS support in KVM.
Patch 15: Fault check for CR4.CET setting.
Patch 16: Report CET MSRs to userspace.
Patch 17: Introduce CET VMCS fields.
Patch 18: Add SHSTK/IBT to KVM-governed framework.
Patch 19: Emulate CET MSR access.
Patch 20: Handle SSP at entry/exit to SMM.
Patch 21: Set up CET MSR interception.
Patch 22: Initialize host constant supervisor state.
Patch 23: Add CET virtualization settings.
Patch 24-25: Add CET nested support.



Sean Christopherson (3):
KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data
KVM: x86: Report XSS as to-be-saved if there are supported features
KVM: x86: Load guest FPU state when access XSAVE-managed MSRs

Yang Weijiang (22):
x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit
x86/fpu/xstate: Fix guest fpstate allocation size calculation
x86/fpu/xstate: Add CET supervisor mode state support
x86/fpu/xstate: Introduce kernel dynamic xfeature set
x86/fpu/xstate: Remove kernel dynamic xfeatures from kernel
default_features
x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate
size
x86/fpu/xstate: Tweak guest fpstate to support kernel dynamic
xfeatures
x86/fpu/xstate: WARN if normal fpstate contains kernel dynamic
xfeatures
KVM: x86: Add kvm_msr_{read,write}() helpers
KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS
KVM: x86: Initialize kvm_caps.supported_xss
KVM: x86: Add fault checks for guest CR4.CET setting
KVM: x86: Report KVM supported CET MSRs as to-be-saved
KVM: VMX: Introduce CET VMCS fields and control bits
KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT
enabled"
KVM: VMX: Emulate read and write to CET MSRs
KVM: x86: Save and reload SSP to/from SMRAM
KVM: VMX: Set up interception for CET MSRs
KVM: VMX: Set host constant supervisor states to VMCS fields
KVM: x86: Enable CET virtualization for VMX and advertise to userspace
KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery
to L1
KVM: nVMX: Enable CET support for nested guest

arch/x86/include/asm/fpu/types.h | 14 +-
arch/x86/include/asm/fpu/xstate.h | 6 +-
arch/x86/include/asm/kvm_host.h | 8 +-
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/vmx.h | 8 ++
arch/x86/include/uapi/asm/kvm_para.h | 1 +
arch/x86/kernel/fpu/core.c | 56 ++++++--
arch/x86/kernel/fpu/xstate.c | 49 ++++++-
arch/x86/kernel/fpu/xstate.h | 5 +
arch/x86/kvm/cpuid.c | 62 ++++++---
arch/x86/kvm/governed_features.h | 2 +
arch/x86/kvm/smm.c | 8 ++
arch/x86/kvm/smm.h | 2 +-
arch/x86/kvm/vmx/capabilities.h | 10 ++
arch/x86/kvm/vmx/nested.c | 49 +++++--
arch/x86/kvm/vmx/nested.h | 5 +
arch/x86/kvm/vmx/vmcs12.c | 6 +
arch/x86/kvm/vmx/vmcs12.h | 14 +-
arch/x86/kvm/vmx/vmx.c | 104 ++++++++++++++-
arch/x86/kvm/vmx/vmx.h | 6 +-
arch/x86/kvm/x86.c | 192 +++++++++++++++++++++++++--
arch/x86/kvm/x86.h | 28 ++++
22 files changed, 569 insertions(+), 67 deletions(-)


base-commit: ff6e6ded54725cd01623b9a1a86b74a523198733
--
2.27.0


2023-09-14 18:03:33

by Yang Weijiang

[permalink] [raw]
Subject: [PATCH v6 20/25] KVM: x86: Save and reload SSP to/from SMRAM

Save CET SSP to SMRAM on SMI and reload it on RSM. KVM emulates HW arch
behavior when guest enters/leaves SMM mode,i.e., save registers to SMRAM
at the entry of SMM and reload them at the exit to SMM. Per SDM, SSP is
one of such registers on 64bit Arch, so add the support for SSP.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/smm.c | 8 ++++++++
arch/x86/kvm/smm.h | 2 +-
2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
index b42111a24cc2..235fca95f103 100644
--- a/arch/x86/kvm/smm.c
+++ b/arch/x86/kvm/smm.c
@@ -275,6 +275,10 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu,
enter_smm_save_seg_64(vcpu, &smram->gs, VCPU_SREG_GS);

smram->int_shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
+
+ if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
+ KVM_BUG_ON(kvm_msr_read(vcpu, MSR_KVM_SSP, &smram->ssp),
+ vcpu->kvm);
}
#endif

@@ -565,6 +569,10 @@ static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,
static_call(kvm_x86_set_interrupt_shadow)(vcpu, 0);
ctxt->interruptibility = (u8)smstate->int_shadow;

+ if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
+ KVM_BUG_ON(kvm_msr_write(vcpu, MSR_KVM_SSP, smstate->ssp),
+ vcpu->kvm);
+
return X86EMUL_CONTINUE;
}
#endif
diff --git a/arch/x86/kvm/smm.h b/arch/x86/kvm/smm.h
index a1cf2ac5bd78..1e2a3e18207f 100644
--- a/arch/x86/kvm/smm.h
+++ b/arch/x86/kvm/smm.h
@@ -116,8 +116,8 @@ struct kvm_smram_state_64 {
u32 smbase;
u32 reserved4[5];

- /* ssp and svm_* fields below are not implemented by KVM */
u64 ssp;
+ /* svm_* fields below are not implemented by KVM */
u64 svm_guest_pat;
u64 svm_host_efer;
u64 svm_host_cr4;
--
2.27.0

2023-09-14 20:47:01

by Yang Weijiang

[permalink] [raw]
Subject: [PATCH v6 10/25] KVM: x86: Add kvm_msr_{read,write}() helpers

Wrap __kvm_{get,set}_msr() into two new helpers for KVM usage and use the
helpers to replace existing usage of the raw functions.
kvm_msr_{read,write}() are KVM-internal helpers, i.e. used when KVM needs
to get/set a MSR value for emulating CPU behavior.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 +++-
arch/x86/kvm/cpuid.c | 2 +-
arch/x86/kvm/x86.c | 16 +++++++++++++---
3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1a4def36d5bb..0fc5e6312e93 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1956,7 +1956,9 @@ void kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu);

void kvm_enable_efer_bits(u64);
bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer);
-int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data, bool host_initiated);
+
+int kvm_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data);
+int kvm_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data);
int kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data);
int kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data);
int kvm_emulate_rdmsr(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 7c3e4a550ca7..1f206caec559 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1531,7 +1531,7 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
*edx = entry->edx;
if (function == 7 && index == 0) {
u64 data;
- if (!__kvm_get_msr(vcpu, MSR_IA32_TSX_CTRL, &data, true) &&
+ if (!kvm_msr_read(vcpu, MSR_IA32_TSX_CTRL, &data) &&
(data & TSX_CTRL_CPUID_CLEAR))
*ebx &= ~(F(RTM) | F(HLE));
} else if (function == 0x80000007) {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6c9c81e82e65..e0b55c043dab 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1917,8 +1917,8 @@ static int kvm_set_msr_ignored_check(struct kvm_vcpu *vcpu,
* Returns 0 on success, non-0 otherwise.
* Assumes vcpu_load() was already called.
*/
-int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
- bool host_initiated)
+static int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
+ bool host_initiated)
{
struct msr_data msr;
int ret;
@@ -1944,6 +1944,16 @@ int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
return ret;
}

+int kvm_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data)
+{
+ return __kvm_set_msr(vcpu, index, data, true);
+}
+
+int kvm_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data)
+{
+ return __kvm_get_msr(vcpu, index, data, true);
+}
+
static int kvm_get_msr_ignored_check(struct kvm_vcpu *vcpu,
u32 index, u64 *data, bool host_initiated)
{
@@ -12082,7 +12092,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
MSR_IA32_MISC_ENABLE_BTS_UNAVAIL;

__kvm_set_xcr(vcpu, 0, XFEATURE_MASK_FP);
- __kvm_set_msr(vcpu, MSR_IA32_XSS, 0, true);
+ kvm_msr_write(vcpu, MSR_IA32_XSS, 0);
}

/* All GPRs except RDX (handled below) are zeroed on RESET/INIT. */
--
2.27.0

2023-09-14 20:59:42

by Yang Weijiang

[permalink] [raw]
Subject: [PATCH v6 25/25] KVM: nVMX: Enable CET support for nested guest

Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
to enable CET for nested VM.

Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 27 +++++++++++++++++++++++++--
arch/x86/kvm/vmx/vmcs12.c | 6 ++++++
arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++++-
arch/x86/kvm/vmx/vmx.c | 2 ++
4 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 78a3be394d00..2c4ff13fddb0 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
MSR_IA32_FLUSH_CMD, MSR_TYPE_W);

+ /* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_U_CET, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_S_CET, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PL0_SSP, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PL1_SSP, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PL2_SSP, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PL3_SSP, MSR_TYPE_RW);
+
+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
+
kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);

vmx->nested.force_msr_bitmap_recalc = false;
@@ -6794,7 +6816,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
VM_EXIT_HOST_ADDR_SPACE_SIZE |
#endif
VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
- VM_EXIT_CLEAR_BNDCFGS;
+ VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
msrs->exit_ctls_high |=
VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
@@ -6816,7 +6838,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
#ifdef CONFIG_X86_64
VM_ENTRY_IA32E_MODE |
#endif
- VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
+ VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
+ VM_ENTRY_LOAD_CET_STATE;
msrs->entry_ctls_high |=
(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index 106a72c923ca..4233b5ca9461 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
+ FIELD(GUEST_S_CET, guest_s_cet),
+ FIELD(GUEST_SSP, guest_ssp),
+ FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
FIELD(HOST_CR0, host_cr0),
FIELD(HOST_CR3, host_cr3),
FIELD(HOST_CR4, host_cr4),
@@ -151,5 +154,8 @@ const unsigned short vmcs12_field_offsets[] = {
FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
FIELD(HOST_RSP, host_rsp),
FIELD(HOST_RIP, host_rip),
+ FIELD(HOST_S_CET, host_s_cet),
+ FIELD(HOST_SSP, host_ssp),
+ FIELD(HOST_INTR_SSP_TABLE, host_ssp_tbl),
};
const unsigned int nr_vmcs12_fields = ARRAY_SIZE(vmcs12_field_offsets);
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index 01936013428b..3884489e7f7e 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -117,7 +117,13 @@ struct __packed vmcs12 {
natural_width host_ia32_sysenter_eip;
natural_width host_rsp;
natural_width host_rip;
- natural_width paddingl[8]; /* room for future expansion */
+ natural_width host_s_cet;
+ natural_width host_ssp;
+ natural_width host_ssp_tbl;
+ natural_width guest_s_cet;
+ natural_width guest_ssp;
+ natural_width guest_ssp_tbl;
+ natural_width paddingl[2]; /* room for future expansion */
u32 pin_based_vm_exec_control;
u32 cpu_based_vm_exec_control;
u32 exception_bitmap;
@@ -292,6 +298,12 @@ static inline void vmx_check_vmcs12_offsets(void)
CHECK_OFFSET(host_ia32_sysenter_eip, 656);
CHECK_OFFSET(host_rsp, 664);
CHECK_OFFSET(host_rip, 672);
+ CHECK_OFFSET(host_s_cet, 680);
+ CHECK_OFFSET(host_ssp, 688);
+ CHECK_OFFSET(host_ssp_tbl, 696);
+ CHECK_OFFSET(guest_s_cet, 704);
+ CHECK_OFFSET(guest_ssp, 712);
+ CHECK_OFFSET(guest_ssp_tbl, 720);
CHECK_OFFSET(pin_based_vm_exec_control, 744);
CHECK_OFFSET(cpu_based_vm_exec_control, 748);
CHECK_OFFSET(exception_bitmap, 752);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f0dea8ecd0c6..2c43f1088d77 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7731,6 +7731,8 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
cr4_fixed1_update(X86_CR4_PKE, ecx, feature_bit(PKU));
cr4_fixed1_update(X86_CR4_UMIP, ecx, feature_bit(UMIP));
cr4_fixed1_update(X86_CR4_LA57, ecx, feature_bit(LA57));
+ cr4_fixed1_update(X86_CR4_CET, ecx, feature_bit(SHSTK));
+ cr4_fixed1_update(X86_CR4_CET, edx, feature_bit(IBT));

#undef cr4_fixed1_update
}
--
2.27.0

2023-09-14 22:06:46

by Yang Weijiang

[permalink] [raw]
Subject: [PATCH v6 19/25] KVM: VMX: Emulate read and write to CET MSRs

Add emulation interface for CET MSR access. The emulation code is split
into common part and vendor specific part. The former does common check
for MSRs and reads/writes directly from/to XSAVE-managed MSRs via the
helpers while the latter accesses the MSRs linked to VMCS fields.

Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 18 +++++++++++
arch/x86/kvm/x86.c | 71 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 89 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fd5893b3a2c8..9f4b56337251 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2111,6 +2111,15 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
else
msr_info->data = vmx->pt_desc.guest.addr_a[index / 2];
break;
+ case MSR_IA32_S_CET:
+ msr_info->data = vmcs_readl(GUEST_S_CET);
+ break;
+ case MSR_KVM_SSP:
+ msr_info->data = vmcs_readl(GUEST_SSP);
+ break;
+ case MSR_IA32_INT_SSP_TAB:
+ msr_info->data = vmcs_readl(GUEST_INTR_SSP_TABLE);
+ break;
case MSR_IA32_DEBUGCTLMSR:
msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
break;
@@ -2420,6 +2429,15 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
else
vmx->pt_desc.guest.addr_a[index / 2] = data;
break;
+ case MSR_IA32_S_CET:
+ vmcs_writel(GUEST_S_CET, data);
+ break;
+ case MSR_KVM_SSP:
+ vmcs_writel(GUEST_SSP, data);
+ break;
+ case MSR_IA32_INT_SSP_TAB:
+ vmcs_writel(GUEST_INTR_SSP_TABLE, data);
+ break;
case MSR_IA32_PERF_CAPABILITIES:
if (data && !vcpu_to_pmu(vcpu)->version)
return 1;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 73b45351c0fc..c85ee42ab4f1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1847,6 +1847,11 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type)
}
EXPORT_SYMBOL_GPL(kvm_msr_allowed);

+#define CET_US_RESERVED_BITS GENMASK(9, 6)
+#define CET_US_SHSTK_MASK_BITS GENMASK(1, 0)
+#define CET_US_IBT_MASK_BITS (GENMASK_ULL(5, 2) | GENMASK_ULL(63, 10))
+#define CET_US_LEGACY_BITMAP_BASE(data) ((data) >> 12)
+
/*
* Write @data into the MSR specified by @index. Select MSR specific fault
* checks are bypassed if @host_initiated is %true.
@@ -1856,6 +1861,7 @@ EXPORT_SYMBOL_GPL(kvm_msr_allowed);
static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
bool host_initiated)
{
+ bool host_msr_reset = host_initiated && data == 0;
struct msr_data msr;

switch (index) {
@@ -1906,6 +1912,46 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,

data = (u32)data;
break;
+ case MSR_IA32_U_CET:
+ case MSR_IA32_S_CET:
+ if (host_msr_reset && (kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
+ kvm_cpu_cap_has(X86_FEATURE_IBT)))
+ break;
+ if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
+ !guest_can_use(vcpu, X86_FEATURE_IBT))
+ return 1;
+ if (data & CET_US_RESERVED_BITS)
+ return 1;
+ if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
+ (data & CET_US_SHSTK_MASK_BITS))
+ return 1;
+ if (!guest_can_use(vcpu, X86_FEATURE_IBT) &&
+ (data & CET_US_IBT_MASK_BITS))
+ return 1;
+ if (!IS_ALIGNED(CET_US_LEGACY_BITMAP_BASE(data), 4))
+ return 1;
+
+ /* IBT can be suppressed iff the TRACKER isn't WAIT_ENDBR. */
+ if ((data & CET_SUPPRESS) && (data & CET_WAIT_ENDBR))
+ return 1;
+ break;
+ case MSR_IA32_INT_SSP_TAB:
+ if (!guest_cpuid_has(vcpu, X86_FEATURE_LM))
+ return 1;
+ fallthrough;
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ case MSR_KVM_SSP:
+ if (host_msr_reset && kvm_cpu_cap_has(X86_FEATURE_SHSTK))
+ break;
+ if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
+ return 1;
+ if (index == MSR_KVM_SSP && !host_initiated)
+ return 1;
+ if (is_noncanonical_address(data, vcpu))
+ return 1;
+ if (index != MSR_IA32_INT_SSP_TAB && !IS_ALIGNED(data, 4))
+ return 1;
+ break;
}

msr.data = data;
@@ -1949,6 +1995,23 @@ static int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
!guest_cpuid_has(vcpu, X86_FEATURE_RDPID))
return 1;
break;
+ case MSR_IA32_U_CET:
+ case MSR_IA32_S_CET:
+ if (!guest_can_use(vcpu, X86_FEATURE_IBT) &&
+ !guest_can_use(vcpu, X86_FEATURE_SHSTK))
+ return 1;
+ break;
+ case MSR_IA32_INT_SSP_TAB:
+ if (!guest_cpuid_has(vcpu, X86_FEATURE_LM))
+ return 1;
+ fallthrough;
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ case MSR_KVM_SSP:
+ if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
+ return 1;
+ if (index == MSR_KVM_SSP && !host_initiated)
+ return 1;
+ break;
}

msr.index = index;
@@ -4009,6 +4072,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vcpu->arch.guest_fpu.xfd_err = data;
break;
#endif
+ case MSR_IA32_U_CET:
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ kvm_set_xstate_msr(vcpu, msr_info);
+ break;
default:
if (kvm_pmu_is_valid_msr(vcpu, msr))
return kvm_pmu_set_msr(vcpu, msr_info);
@@ -4365,6 +4432,10 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
msr_info->data = vcpu->arch.guest_fpu.xfd_err;
break;
#endif
+ case MSR_IA32_U_CET:
+ case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+ kvm_get_xstate_msr(vcpu, msr_info);
+ break;
default:
if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
return kvm_pmu_get_msr(vcpu, msr_info);
--
2.27.0

2023-09-14 22:59:30

by Yang Weijiang

[permalink] [raw]
Subject: [PATCH v6 04/25] x86/fpu/xstate: Introduce kernel dynamic xfeature set

Define a new kernel xfeature set including the features can be dynamically
enabled, i.e., the relevant feature is enabled on demand. The xfeature set
is currently used by KVM to configure __guest__ fpstate, i.e., calculating
the xfeature and fpstate storage size etc. The xfeature set is initialized
once and used whenever it's referenced to avoid repeat calculation.

Currently it's used when 1) guest fpstate __state_size is calculated while
guest permits are configured 2) guest vCPU is created and its fpstate is
initialized.

Suggested-by: Dave Hansen <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kernel/fpu/xstate.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c3ed86732d33..eaec05bc1b3c 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -84,6 +84,8 @@ static unsigned int xstate_sizes[XFEATURE_MAX] __ro_after_init =
{ [ 0 ... XFEATURE_MAX - 1] = -1};
static unsigned int xstate_flags[XFEATURE_MAX] __ro_after_init;

+u64 fpu_kernel_dynamic_xfeatures __ro_after_init;
+
#define XSTATE_FLAG_SUPERVISOR BIT(0)
#define XSTATE_FLAG_ALIGNED64 BIT(1)

@@ -740,6 +742,23 @@ static void __init fpu__init_disable_system_xstate(unsigned int legacy_size)
fpstate_reset(&current->thread.fpu);
}

+static unsigned short xsave_kernel_dynamic_xfeatures[] = {
+ [XFEATURE_CET_KERNEL] = X86_FEATURE_SHSTK,
+};
+
+static void __init init_kernel_dynamic_xfeatures(void)
+{
+ unsigned short cid;
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(xsave_kernel_dynamic_xfeatures); i++) {
+ cid = xsave_kernel_dynamic_xfeatures[i];
+
+ if (cid && boot_cpu_has(cid))
+ fpu_kernel_dynamic_xfeatures |= BIT_ULL(i);
+ }
+}
+
/*
* Enable and initialize the xsave feature.
* Called once per system bootup.
@@ -809,6 +828,8 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
if (boot_cpu_has(X86_FEATURE_SHSTK) || boot_cpu_has(X86_FEATURE_IBT))
fpu_kernel_cfg.max_features |= BIT_ULL(XFEATURE_CET_USER);

+ init_kernel_dynamic_xfeatures();
+
if (!cpu_feature_enabled(X86_FEATURE_XFD))
fpu_kernel_cfg.max_features &= ~XFEATURE_MASK_USER_DYNAMIC;

--
2.27.0

2023-09-14 23:29:46

by Yang Weijiang

[permalink] [raw]
Subject: [PATCH v6 01/25] x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit

Remove XFEATURE_CET_USER entry from dependency array as the entry doesn't
reflect true dependency between CET features and the xstate bit, instead
manually check and add the bit back if either SHSTK or IBT is supported.

Both user mode shadow stack and indirect branch tracking features depend
on XFEATURE_CET_USER bit in XSS to automatically save/restore user mode
xstate registers, i.e., IA32_U_CET and IA32_PL3_SSP whenever necessary.

Although in real world a platform with IBT but no SHSTK is rare, but in
virtualization world it's common, guest SHSTK and IBT can be controlled
independently via userspace app.

Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kernel/fpu/xstate.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index cadf68737e6b..12c8cb278346 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -73,7 +73,6 @@ static unsigned short xsave_cpuid_features[] __initdata = {
[XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
[XFEATURE_PKRU] = X86_FEATURE_OSPKE,
[XFEATURE_PASID] = X86_FEATURE_ENQCMD,
- [XFEATURE_CET_USER] = X86_FEATURE_SHSTK,
[XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE,
[XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE,
};
@@ -798,6 +797,14 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
fpu_kernel_cfg.max_features &= ~BIT_ULL(i);
}

+ /*
+ * Manually add CET user mode xstate bit if either SHSTK or IBT is
+ * available. Both features depend on the xstate bit to save/restore
+ * CET user mode state.
+ */
+ if (boot_cpu_has(X86_FEATURE_SHSTK) || boot_cpu_has(X86_FEATURE_IBT))
+ fpu_kernel_cfg.max_features |= BIT_ULL(XFEATURE_CET_USER);
+
if (!cpu_feature_enabled(X86_FEATURE_XFD))
fpu_kernel_cfg.max_features &= ~XFEATURE_MASK_USER_DYNAMIC;

--
2.27.0

2023-09-15 00:19:03

by Yang Weijiang

[permalink] [raw]
Subject: [PATCH v6 12/25] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS

Update CPUID.(EAX=0DH,ECX=1).EBX to reflect current required xstate size
due to XSS MSR modification.
CPUID(EAX=0DH,ECX=1).EBX reports the required storage size of all enabled
xstate features in (XCR0 | IA32_XSS). The CPUID value can be used by guest
before allocate sufficient xsave buffer.

Note, KVM does not yet support any XSS based features, i.e. supported_xss
is guaranteed to be zero at this time.

Opportunistically modify XSS write access logic as: if !guest_cpuid_has(),
write initiated from host is allowed iff the write is reset operaiton,
i.e., data == 0, reject host_initiated non-reset write and any guest write.

Suggested-by: Sean Christopherson <[email protected]>
Co-developed-by: Zhang Yi Z <[email protected]>
Signed-off-by: Zhang Yi Z <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/cpuid.c | 15 ++++++++++++++-
arch/x86/kvm/x86.c | 13 +++++++++----
3 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0fc5e6312e93..d77b030e996c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -803,6 +803,7 @@ struct kvm_vcpu_arch {

u64 xcr0;
u64 guest_supported_xcr0;
+ u64 guest_supported_xss;

struct kvm_pio_request pio;
void *pio_data;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 1f206caec559..4e7a820cba62 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -275,7 +275,8 @@ static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_e
best = cpuid_entry2_find(entries, nent, 0xD, 1);
if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) ||
cpuid_entry_has(best, X86_FEATURE_XSAVEC)))
- best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
+ best->ebx = xstate_required_size(vcpu->arch.xcr0 |
+ vcpu->arch.ia32_xss, true);

best = __kvm_find_kvm_cpuid_features(vcpu, entries, nent);
if (kvm_hlt_in_guest(vcpu->kvm) && best &&
@@ -312,6 +313,17 @@ static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
}

+static u64 vcpu_get_supported_xss(struct kvm_vcpu *vcpu)
+{
+ struct kvm_cpuid_entry2 *best;
+
+ best = kvm_find_cpuid_entry_index(vcpu, 0xd, 1);
+ if (!best)
+ return 0;
+
+ return (best->ecx | ((u64)best->edx << 32)) & kvm_caps.supported_xss;
+}
+
static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
{
struct kvm_cpuid_entry2 *entry;
@@ -358,6 +370,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
}

vcpu->arch.guest_supported_xcr0 = vcpu_get_supported_xcr0(vcpu);
+ vcpu->arch.guest_supported_xss = vcpu_get_supported_xss(vcpu);

/*
* FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1258d1d6dd52..9a616d84bd39 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3795,20 +3795,25 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vcpu->arch.ia32_tsc_adjust_msr += adj;
}
break;
- case MSR_IA32_XSS:
- if (!msr_info->host_initiated &&
- !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
+ case MSR_IA32_XSS: {
+ bool host_msr_reset = msr_info->host_initiated && data == 0;
+
+ if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES) &&
+ (!host_msr_reset || !msr_info->host_initiated))
return 1;
/*
* KVM supports exposing PT to the guest, but does not support
* IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
* XSAVES/XRSTORS to save/restore PT MSRs.
*/
- if (data & ~kvm_caps.supported_xss)
+ if (data & ~vcpu->arch.guest_supported_xss)
return 1;
+ if (vcpu->arch.ia32_xss == data)
+ break;
vcpu->arch.ia32_xss = data;
kvm_update_cpuid_runtime(vcpu);
break;
+ }
case MSR_SMI_COUNT:
if (!msr_info->host_initiated)
return 1;
--
2.27.0

2023-09-15 01:30:22

by Yang Weijiang

[permalink] [raw]
Subject: [PATCH v6 03/25] x86/fpu/xstate: Add CET supervisor mode state support

Add supervisor mode state support within FPU xstate management framework.
Although supervisor shadow stack is not enabled/used today in kernel,KVM
requires the support because when KVM advertises shadow stack feature to
guest, architechturally it claims the support for both user and supervisor
modes for Linux and non-Linux guest OSes.

With the xstate support, guest supervisor mode shadow stack state can be
properly saved/restored when 1) guest/host FPU context is swapped 2) vCPU
thread is sched out/in.

The alternative is to enable it in KVM domain, but KVM maintainers NAKed
the solution. The external discussion can be found at [*], it ended up
with adding the support in kernel instead of KVM domain.

Note, in KVM case, guest CET supervisor state i.e., IA32_PL{0,1,2}_MSRs,
are preserved after VM-Exit until host/guest fpstates are swapped, but
since host supervisor shadow stack is disabled, the preserved MSRs won't
hurt host.

[*]: https://lore.kernel.org/all/[email protected]/

Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/include/asm/fpu/types.h | 14 ++++++++++++--
arch/x86/include/asm/fpu/xstate.h | 6 +++---
arch/x86/kernel/fpu/xstate.c | 6 +++++-
3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index eb810074f1e7..c6fd13a17205 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -116,7 +116,7 @@ enum xfeature {
XFEATURE_PKRU,
XFEATURE_PASID,
XFEATURE_CET_USER,
- XFEATURE_CET_KERNEL_UNUSED,
+ XFEATURE_CET_KERNEL,
XFEATURE_RSRVD_COMP_13,
XFEATURE_RSRVD_COMP_14,
XFEATURE_LBR,
@@ -139,7 +139,7 @@ enum xfeature {
#define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU)
#define XFEATURE_MASK_PASID (1 << XFEATURE_PASID)
#define XFEATURE_MASK_CET_USER (1 << XFEATURE_CET_USER)
-#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL_UNUSED)
+#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL)
#define XFEATURE_MASK_LBR (1 << XFEATURE_LBR)
#define XFEATURE_MASK_XTILE_CFG (1 << XFEATURE_XTILE_CFG)
#define XFEATURE_MASK_XTILE_DATA (1 << XFEATURE_XTILE_DATA)
@@ -264,6 +264,16 @@ struct cet_user_state {
u64 user_ssp;
};

+/*
+ * State component 12 is Control-flow Enforcement supervisor states
+ */
+struct cet_supervisor_state {
+ /* supervisor ssp pointers */
+ u64 pl0_ssp;
+ u64 pl1_ssp;
+ u64 pl2_ssp;
+};
+
/*
* State component 15: Architectural LBR configuration state.
* The size of Arch LBR state depends on the number of LBRs (lbr_depth).
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index d4427b88ee12..3b4a038d3c57 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -51,7 +51,8 @@

/* All currently supported supervisor features */
#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
- XFEATURE_MASK_CET_USER)
+ XFEATURE_MASK_CET_USER | \
+ XFEATURE_MASK_CET_KERNEL)

/*
* A supervisor state component may not always contain valuable information,
@@ -78,8 +79,7 @@
* Unsupported supervisor features. When a supervisor feature in this mask is
* supported in the future, move it to the supported supervisor feature mask.
*/
-#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \
- XFEATURE_MASK_CET_KERNEL)
+#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT)

/* All supervisor states including supported and unsupported states. */
#define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 12c8cb278346..c3ed86732d33 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -51,7 +51,7 @@ static const char *xfeature_names[] =
"Protection Keys User registers",
"PASID state",
"Control-flow User registers",
- "Control-flow Kernel registers (unused)",
+ "Control-flow Kernel registers",
"unknown xstate feature",
"unknown xstate feature",
"unknown xstate feature",
@@ -73,6 +73,7 @@ static unsigned short xsave_cpuid_features[] __initdata = {
[XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
[XFEATURE_PKRU] = X86_FEATURE_OSPKE,
[XFEATURE_PASID] = X86_FEATURE_ENQCMD,
+ [XFEATURE_CET_KERNEL] = X86_FEATURE_SHSTK,
[XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE,
[XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE,
};
@@ -277,6 +278,7 @@ static void __init print_xstate_features(void)
print_xstate_feature(XFEATURE_MASK_PKRU);
print_xstate_feature(XFEATURE_MASK_PASID);
print_xstate_feature(XFEATURE_MASK_CET_USER);
+ print_xstate_feature(XFEATURE_MASK_CET_KERNEL);
print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
}
@@ -346,6 +348,7 @@ static __init void os_xrstor_booting(struct xregs_state *xstate)
XFEATURE_MASK_BNDCSR | \
XFEATURE_MASK_PASID | \
XFEATURE_MASK_CET_USER | \
+ XFEATURE_MASK_CET_KERNEL | \
XFEATURE_MASK_XTILE)

/*
@@ -546,6 +549,7 @@ static bool __init check_xstate_against_struct(int nr)
case XFEATURE_PASID: return XCHECK_SZ(sz, nr, struct ia32_pasid_state);
case XFEATURE_XTILE_CFG: return XCHECK_SZ(sz, nr, struct xtile_cfg);
case XFEATURE_CET_USER: return XCHECK_SZ(sz, nr, struct cet_user_state);
+ case XFEATURE_CET_KERNEL: return XCHECK_SZ(sz, nr, struct cet_supervisor_state);
case XFEATURE_XTILE_DATA: check_xtile_data_against_struct(sz); return true;
default:
XSTATE_WARN_ON(1, "No structure for xstate: %d\n", nr);
--
2.27.0

2023-09-15 01:51:39

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 04/25] x86/fpu/xstate: Introduce kernel dynamic xfeature set

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> +static void __init init_kernel_dynamic_xfeatures(void)
> +{
> +       unsigned short cid;
> +       int i;
> +
> +       for (i = 0; i < ARRAY_SIZE(xsave_kernel_dynamic_xfeatures);
> i++) {
> +               cid = xsave_kernel_dynamic_xfeatures[i];
> +
> +               if (cid && boot_cpu_has(cid))
> +                       fpu_kernel_dynamic_xfeatures |= BIT_ULL(i);
> +       }
> +}
> +

I think this can be part of the max_features calculation that uses
xsave_cpuid_features when you use use a fixed mask like Dave suggested
in the other patch.

2023-09-15 04:25:37

by Yang Weijiang

[permalink] [raw]
Subject: [PATCH v6 07/25] x86/fpu/xstate: Tweak guest fpstate to support kernel dynamic xfeatures

The guest fpstate is sized with fpu_kernel_cfg.default_size (by preceding
fix) and the kernel dynamic xfeatures are not taken into account, so add
the support and tweak fpstate xfeatures and size accordingly.

Below configuration steps are currently enforced to get guest fpstate:
1) User space sets thread group xstate permits via arch_prctl().
2) User space creates vcpu thread.
3) User space enables guest dynamic xfeatures.

In #1, guest fpstate size (i.e., __state_size [1]) is induced from
(fpu_kernel_cfg.default_features | user dynamic xfeatures) [2].
In #2, guest fpstate size is calculated with fpu_kernel_cfg.default_size
and fpstate->size is set to the same. fpstate->xfeatures is set to
fpu_kernel_cfg.default_features.
In #3, guest fpstate is re-allocated as [1] and fpstate->xfeatures is
set to [2].

By adding kernel dynamic xfeatures in above #1 and #2, guest xstate area
size is expanded to hold (fpu_kernel_cfg.default_features | kernel dynamic
_xfeatures | user dynamic xfeatures)[3], and guest fpstate->xfeatures is
set to [3]. Then host xsaves/xrstors can act on all guest xfeatures.

The user_* fields remain unchanged for compatibility of non-compacted KVM
uAPIs.

Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kernel/fpu/core.c | 56 +++++++++++++++++++++++++++++-------
arch/x86/kernel/fpu/xstate.c | 2 +-
arch/x86/kernel/fpu/xstate.h | 2 ++
3 files changed, 49 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index a42d8ad26ce6..e5819b38545a 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -33,6 +33,8 @@ DEFINE_STATIC_KEY_FALSE(__fpu_state_size_dynamic);
DEFINE_PER_CPU(u64, xfd_state);
#endif

+extern unsigned int xstate_calculate_size(u64 xfeatures, bool compacted);
+
/* The FPU state configuration data for kernel and user space */
struct fpu_state_config fpu_kernel_cfg __ro_after_init;
struct fpu_state_config fpu_user_cfg __ro_after_init;
@@ -193,8 +195,6 @@ void fpu_reset_from_exception_fixup(void)
}

#if IS_ENABLED(CONFIG_KVM)
-static void __fpstate_reset(struct fpstate *fpstate, u64 xfd);
-
static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
{
struct fpu_state_perm *fpuperm;
@@ -215,28 +215,64 @@ static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
gfpu->perm = perm & ~FPU_GUEST_PERM_LOCKED;
}

-bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
+static struct fpstate *__fpu_alloc_init_guest_fpstate(struct fpu_guest *gfpu)
{
+ bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
+ unsigned int gfpstate_size, size;
struct fpstate *fpstate;
- unsigned int size;
+ u64 xfeatures;
+
+ /*
+ * fpu_kernel_cfg.default_features includes all enabled xfeatures
+ * except those dynamic xfeatures. Compared with user dynamic
+ * xfeatures, the kernel dynamic ones are enabled for guest by
+ * default, so add the kernel dynamic xfeatures back when calculate
+ * guest fpstate size.
+ *
+ * If the user dynamic xfeatures are enabled, the guest fpstate will
+ * be re-allocated to hold all guest enabled xfeatures, so omit user
+ * dynamic xfeatures here.
+ */
+ xfeatures = fpu_kernel_cfg.default_features |
+ fpu_kernel_dynamic_xfeatures;
+
+ gfpstate_size = xstate_calculate_size(xfeatures, compacted);

- size = fpu_kernel_cfg.default_size +
- ALIGN(offsetof(struct fpstate, regs), 64);
+ size = gfpstate_size + ALIGN(offsetof(struct fpstate, regs), 64);

fpstate = vzalloc(size);
if (!fpstate)
- return false;
+ return NULL;
+ /*
+ * Initialize sizes and feature masks, use fpu_user_cfg.*
+ * for user_* settings for compatibility of exiting uAPIs.
+ */
+ fpstate->size = gfpstate_size;
+ fpstate->xfeatures = xfeatures;
+ fpstate->user_size = fpu_user_cfg.default_size;
+ fpstate->user_xfeatures = fpu_user_cfg.default_features;
+ fpstate->xfd = 0;

- /* Leave xfd to 0 (the reset value defined by spec) */
- __fpstate_reset(fpstate, 0);
fpstate_init_user(fpstate);
fpstate->is_valloc = true;
fpstate->is_guest = true;

gfpu->fpstate = fpstate;
- gfpu->xfeatures = fpu_user_cfg.default_features;
+ gfpu->xfeatures = xfeatures;
gfpu->perm = fpu_user_cfg.default_features;

+ return fpstate;
+}
+
+bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
+{
+ struct fpstate *fpstate;
+
+ fpstate = __fpu_alloc_init_guest_fpstate(gfpu);
+
+ if (!fpstate)
+ return false;
+
/*
* KVM sets the FP+SSE bits in the XSAVE header when copying FPU state
* to userspace, even when XSAVE is unsupported, so that restoring FPU
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c5d903b4df4d..87149aba6f11 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -561,7 +561,7 @@ static bool __init check_xstate_against_struct(int nr)
return true;
}

-static unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
+unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
{
unsigned int topmost = fls64(xfeatures) - 1;
unsigned int offset = xstate_offsets[topmost];
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index a4ecb04d8d64..9c6e3ca05c5c 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -10,6 +10,8 @@
DECLARE_PER_CPU(u64, xfd_state);
#endif

+extern u64 fpu_kernel_dynamic_xfeatures;
+
static inline void xstate_init_xcomp_bv(struct xregs_state *xsave, u64 mask)
{
/*
--
2.27.0

2023-09-15 08:23:34

by Yang Weijiang

[permalink] [raw]
Subject: [PATCH v6 24/25] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1

Per SDM description(Vol.3D, Appendix A.1):
"If bit 56 is read as 1, software can use VM entry to deliver a hardware
exception with or without an error code, regardless of vector"

Modify has_error_code check before inject events to nested guest. Only
enforce the check when guest is in real mode, the exception is not hard
exception and the platform doesn't enumerate bit56 in VMX_BASIC, in all
other case ignore the check to make the logic consistent with SDM.

Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 22 ++++++++++++++--------
arch/x86/kvm/vmx/nested.h | 5 +++++
2 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index c5ec0ef51ff7..78a3be394d00 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1205,9 +1205,9 @@ static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx, u64 data)
{
const u64 feature_and_reserved =
/* feature (except bit 48; see below) */
- BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) |
+ BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) | BIT_ULL(56) |
/* reserved */
- BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 56);
+ BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 57);
u64 vmx_basic = vmcs_config.nested.basic;

if (!is_bitwise_subset(vmx_basic, data, feature_and_reserved))
@@ -2846,12 +2846,16 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
CC(intr_type == INTR_TYPE_OTHER_EVENT && vector != 0))
return -EINVAL;

- /* VM-entry interruption-info field: deliver error code */
- should_have_error_code =
- intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
- x86_exception_has_error_code(vector);
- if (CC(has_error_code != should_have_error_code))
- return -EINVAL;
+ if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION ||
+ !nested_cpu_has_no_hw_errcode_cc(vcpu)) {
+ /* VM-entry interruption-info field: deliver error code */
+ should_have_error_code =
+ intr_type == INTR_TYPE_HARD_EXCEPTION &&
+ prot_mode &&
+ x86_exception_has_error_code(vector);
+ if (CC(has_error_code != should_have_error_code))
+ return -EINVAL;
+ }

/* VM-entry exception error code */
if (CC(has_error_code &&
@@ -6968,6 +6972,8 @@ static void nested_vmx_setup_basic(struct nested_vmx_msrs *msrs)

if (cpu_has_vmx_basic_inout())
msrs->basic |= VMX_BASIC_INOUT;
+ if (cpu_has_vmx_basic_no_hw_errcode())
+ msrs->basic |= VMX_BASIC_NO_HW_ERROR_CODE_CC;
}

static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs)
diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
index b4b9d51438c6..26842da6857d 100644
--- a/arch/x86/kvm/vmx/nested.h
+++ b/arch/x86/kvm/vmx/nested.h
@@ -284,6 +284,11 @@ static inline bool nested_cr4_valid(struct kvm_vcpu *vcpu, unsigned long val)
__kvm_is_valid_cr4(vcpu, val);
}

+static inline bool nested_cpu_has_no_hw_errcode_cc(struct kvm_vcpu *vcpu)
+{
+ return to_vmx(vcpu)->nested.msrs.basic & VMX_BASIC_NO_HW_ERROR_CODE_CC;
+}
+
/* No difference in the restrictions on guest and host CR4 in VMX operation. */
#define nested_guest_cr4_valid nested_cr4_valid
#define nested_host_cr4_valid nested_cr4_valid
--
2.27.0

2023-09-15 09:42:59

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 01/25] x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Remove XFEATURE_CET_USER entry from dependency array as the entry
> doesn't
> reflect true dependency between CET features and the xstate bit,
> instead
> manually check and add the bit back if either SHSTK or IBT is
> supported.
>
> Both user mode shadow stack and indirect branch tracking features
> depend
> on XFEATURE_CET_USER bit in XSS to automatically save/restore user
> mode
> xstate registers, i.e., IA32_U_CET and IA32_PL3_SSP whenever
> necessary.
>
> Although in real world a platform with IBT but no SHSTK is rare, but
> in
> virtualization world it's common, guest SHSTK and IBT can be
> controlled
> independently via userspace app.

Nit, not sure we can assert it's common yet. It's true in general that
guests can have CPUID combinations that don't appear in real world of
course. Is that what you meant?

Also, this doesn't discuss the real main reason for this patch, and
that is that KVM will soon use the xfeature for user ibt, and so there
will now be a reason to have XFEATURE_CET_USER depend on IBT.

>
> Signed-off-by: Yang Weijiang <[email protected]>

Otherwise:

Reviewed-by: Rick Edgecombe <[email protected]>
Tested-by: Rick Edgecombe <[email protected]>

2023-09-15 13:32:58

by Yang Weijiang

[permalink] [raw]
Subject: Re: [PATCH v6 01/25] x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit

On 9/15/2023 6:39 AM, Edgecombe, Rick P wrote:
> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
>> Remove XFEATURE_CET_USER entry from dependency array as the entry
>> doesn't
>> reflect true dependency between CET features and the xstate bit,
>> instead
>> manually check and add the bit back if either SHSTK or IBT is
>> supported.
>>
>> Both user mode shadow stack and indirect branch tracking features
>> depend
>> on XFEATURE_CET_USER bit in XSS to automatically save/restore user
>> mode
>> xstate registers, i.e., IA32_U_CET and IA32_PL3_SSP whenever
>> necessary.
>>
>> Although in real world a platform with IBT but no SHSTK is rare, but
>> in
>> virtualization world it's common, guest SHSTK and IBT can be
>> controlled
>> independently via userspace app.
> Nit, not sure we can assert it's common yet. It's true in general that
> guests can have CPUID combinations that don't appear in real world of
> course. Is that what you meant?

Yes, guest CPUID features can be configured by userspace flexibly.

>
> Also, this doesn't discuss the real main reason for this patch, and
> that is that KVM will soon use the xfeature for user ibt, and so there
> will now be a reason to have XFEATURE_CET_USER depend on IBT.

This is one justification for Linux OS, another reason is there's non-Linux
OS which is using the user IBT feature.  I should make the reasons clearer
in changelog, thanks for pointing it out!

>> Signed-off-by: Yang Weijiang <[email protected]>
> Otherwise:
>
> Reviewed-by: Rick Edgecombe <[email protected]>
> Tested-by: Rick Edgecombe <[email protected]>

2023-09-15 14:00:22

by Yang Weijiang

[permalink] [raw]
Subject: [PATCH v6 18/25] KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled"

Use the governed feature framework to track whether X86_FEATURE_SHSTK
and X86_FEATURE_IBT features can be used by userspace and guest, i.e.,
the features can be used iff both KVM and guest CPUID can support them.

Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kvm/governed_features.h | 2 ++
arch/x86/kvm/vmx/vmx.c | 2 ++
2 files changed, 4 insertions(+)

diff --git a/arch/x86/kvm/governed_features.h b/arch/x86/kvm/governed_features.h
index 423a73395c10..db7e21c5ecc2 100644
--- a/arch/x86/kvm/governed_features.h
+++ b/arch/x86/kvm/governed_features.h
@@ -16,6 +16,8 @@ KVM_GOVERNED_X86_FEATURE(PAUSEFILTER)
KVM_GOVERNED_X86_FEATURE(PFTHRESHOLD)
KVM_GOVERNED_X86_FEATURE(VGIF)
KVM_GOVERNED_X86_FEATURE(VNMI)
+KVM_GOVERNED_X86_FEATURE(SHSTK)
+KVM_GOVERNED_X86_FEATURE(IBT)

#undef KVM_GOVERNED_X86_FEATURE
#undef KVM_GOVERNED_FEATURE
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9409753f45b0..fd5893b3a2c8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7765,6 +7765,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_XSAVES);

kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_VMX);
+ kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_SHSTK);
+ kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_IBT);

vmx_setup_uret_msrs(vmx);

--
2.27.0

2023-09-15 16:26:38

by Yang Weijiang

[permalink] [raw]
Subject: Re: [PATCH v6 04/25] x86/fpu/xstate: Introduce kernel dynamic xfeature set

On 9/15/2023 8:24 AM, Edgecombe, Rick P wrote:
> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
>> +static void __init init_kernel_dynamic_xfeatures(void)
>> +{
>> +       unsigned short cid;
>> +       int i;
>> +
>> +       for (i = 0; i < ARRAY_SIZE(xsave_kernel_dynamic_xfeatures);
>> i++) {
>> +               cid = xsave_kernel_dynamic_xfeatures[i];
>> +
>> +               if (cid && boot_cpu_has(cid))
>> +                       fpu_kernel_dynamic_xfeatures |= BIT_ULL(i);
>> +       }
>> +}
>> +
> I think this can be part of the max_features calculation that uses
> xsave_cpuid_features when you use use a fixed mask like Dave suggested
> in the other patch.

Yes, the max_features has already included CET supervisor state bit. After  use
fixed mask, this function is not needed.


2023-09-15 21:14:12

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 01/25] x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit

On Fri, 2023-09-15 at 10:32 +0800, Yang, Weijiang wrote:
> >
> > Also, this doesn't discuss the real main reason for this patch, and
> > that is that KVM will soon use the xfeature for user ibt, and so
> > there
> > will now be a reason to have XFEATURE_CET_USER depend on IBT.
>
> This is one justification for Linux OS, another reason is there's
> non-Linux
> OS which is using the user IBT feature.  I should make the reasons
> clearer
> in changelog, thanks for pointing it out!

The point I was trying to make was today (before this series) nothing
on the system can use user IBT. Not the host, and not in any guest
because KVM doesn't support it. So the added xfeature dependency on IBT
was not previously needed. It is being added only for KVM CET support
(which, yes, may run on guests with non-standard CPUID).


2023-09-18 13:30:45

by Yang Weijiang

[permalink] [raw]
Subject: Re: [PATCH v6 01/25] x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit

On 9/16/2023 12:35 AM, Edgecombe, Rick P wrote:
> On Fri, 2023-09-15 at 10:32 +0800, Yang, Weijiang wrote:
>>> Also, this doesn't discuss the real main reason for this patch, and
>>> that is that KVM will soon use the xfeature for user ibt, and so
>>> there
>>> will now be a reason to have XFEATURE_CET_USER depend on IBT.
>> This is one justification for Linux OS, another reason is there's
>> non-Linux
>> OS which is using the user IBT feature.  I should make the reasons
>> clearer
>> in changelog, thanks for pointing it out!
> The point I was trying to make was today (before this series) nothing
> on the system can use user IBT. Not the host, and not in any guest
> because KVM doesn't support it. So the added xfeature dependency on IBT
> was not previously needed. It is being added only for KVM CET support
> (which, yes, may run on guests with non-standard CPUID).

Agree, I'll highlight this in changelog, thanks!


2023-09-25 13:56:45

by Yang Weijiang

[permalink] [raw]
Subject: Re: [PATCH v6 00/25] Enable CET Virtualization


Kindly ping maintainers for KVM part review, thanks!

On 9/14/2023 2:33 PM, Yang Weijiang wrote:
> Control-flow Enforcement Technology (CET) is a kind of CPU feature used
> to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP) attacks.
> It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
> style control-flow subversion attacks.
>
> Shadow Stack (SHSTK):
> A shadow stack is a second stack used exclusively for control transfer
> operations. The shadow stack is separate from the data/normal stack and
> can be enabled individually in user and kernel mode. When shadow stack
> is enabled, CALL pushes the return address on both the data and shadow
> stack. RET pops the return address from both stacks and compares them.
> If the return addresses from the two stacks do not match, the processor
> generates a #CP.
>
> Indirect Branch Tracking (IBT):
> IBT introduces new instruction(ENDBRANCH)to mark valid target addresses of
> indirect branches (CALL, JMP etc...). If an indirect branch is executed
> and the next instruction is _not_ an ENDBRANCH, the processor generates a
> #CP. These instruction behaves as a NOP on platforms that doesn't support
> CET.
>
>
[...]

2023-10-08 05:56:00

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v6 12/25] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS

>diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>index 0fc5e6312e93..d77b030e996c 100644
>--- a/arch/x86/include/asm/kvm_host.h
>+++ b/arch/x86/include/asm/kvm_host.h
>@@ -803,6 +803,7 @@ struct kvm_vcpu_arch {
>
> u64 xcr0;
> u64 guest_supported_xcr0;
>+ u64 guest_supported_xss;

This structure has the ia32_xss field. how about moving it here for symmetry?

>
> struct kvm_pio_request pio;
> void *pio_data;
>diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
>index 1f206caec559..4e7a820cba62 100644
>--- a/arch/x86/kvm/cpuid.c
>+++ b/arch/x86/kvm/cpuid.c
>@@ -275,7 +275,8 @@ static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_e
> best = cpuid_entry2_find(entries, nent, 0xD, 1);
> if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) ||
> cpuid_entry_has(best, X86_FEATURE_XSAVEC)))
>- best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
>+ best->ebx = xstate_required_size(vcpu->arch.xcr0 |
>+ vcpu->arch.ia32_xss, true);
>
> best = __kvm_find_kvm_cpuid_features(vcpu, entries, nent);
> if (kvm_hlt_in_guest(vcpu->kvm) && best &&
>@@ -312,6 +313,17 @@ static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
> return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
> }
>
>+static u64 vcpu_get_supported_xss(struct kvm_vcpu *vcpu)
>+{
>+ struct kvm_cpuid_entry2 *best;
>+
>+ best = kvm_find_cpuid_entry_index(vcpu, 0xd, 1);
>+ if (!best)
>+ return 0;
>+
>+ return (best->ecx | ((u64)best->edx << 32)) & kvm_caps.supported_xss;
>+}
>+
> static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
> {
> struct kvm_cpuid_entry2 *entry;
>@@ -358,6 +370,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> }
>
> vcpu->arch.guest_supported_xcr0 = vcpu_get_supported_xcr0(vcpu);
>+ vcpu->arch.guest_supported_xss = vcpu_get_supported_xss(vcpu);
>
> /*
> * FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if
>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>index 1258d1d6dd52..9a616d84bd39 100644
>--- a/arch/x86/kvm/x86.c
>+++ b/arch/x86/kvm/x86.c
>@@ -3795,20 +3795,25 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> vcpu->arch.ia32_tsc_adjust_msr += adj;
> }
> break;
>- case MSR_IA32_XSS:
>- if (!msr_info->host_initiated &&
>- !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
>+ case MSR_IA32_XSS: {
>+ bool host_msr_reset = msr_info->host_initiated && data == 0;
>+
>+ if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES) &&
>+ (!host_msr_reset || !msr_info->host_initiated))

!msr_info->host_initiated can be dropped here.

2023-10-10 00:49:59

by Yang Weijiang

[permalink] [raw]
Subject: Re: [PATCH v6 12/25] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS

On 10/8/2023 1:54 PM, Chao Gao wrote:
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 0fc5e6312e93..d77b030e996c 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -803,6 +803,7 @@ struct kvm_vcpu_arch {
>>
>> u64 xcr0;
>> u64 guest_supported_xcr0;
>> + u64 guest_supported_xss;
> This structure has the ia32_xss field. how about moving it here for symmetry?

OK, will do it, thanks!

>> struct kvm_pio_request pio;
>> void *pio_data;
>> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
>> index 1f206caec559..4e7a820cba62 100644
>> --- a/arch/x86/kvm/cpuid.c
>> +++ b/arch/x86/kvm/cpuid.c
>> @@ -275,7 +275,8 @@ static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_e
>> best = cpuid_entry2_find(entries, nent, 0xD, 1);
>> if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) ||
>> cpuid_entry_has(best, X86_FEATURE_XSAVEC)))
>> - best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
>> + best->ebx = xstate_required_size(vcpu->arch.xcr0 |
>> + vcpu->arch.ia32_xss, true);
>>
>> best = __kvm_find_kvm_cpuid_features(vcpu, entries, nent);
>> if (kvm_hlt_in_guest(vcpu->kvm) && best &&
>> @@ -312,6 +313,17 @@ static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
>> return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
>> }
>>
>> +static u64 vcpu_get_supported_xss(struct kvm_vcpu *vcpu)
>> +{
>> + struct kvm_cpuid_entry2 *best;
>> +
>> + best = kvm_find_cpuid_entry_index(vcpu, 0xd, 1);
>> + if (!best)
>> + return 0;
>> +
>> + return (best->ecx | ((u64)best->edx << 32)) & kvm_caps.supported_xss;
>> +}
>> +
>> static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
>> {
>> struct kvm_cpuid_entry2 *entry;
>> @@ -358,6 +370,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>> }
>>
>> vcpu->arch.guest_supported_xcr0 = vcpu_get_supported_xcr0(vcpu);
>> + vcpu->arch.guest_supported_xss = vcpu_get_supported_xss(vcpu);
>>
>> /*
>> * FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 1258d1d6dd52..9a616d84bd39 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -3795,20 +3795,25 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> vcpu->arch.ia32_tsc_adjust_msr += adj;
>> }
>> break;
>> - case MSR_IA32_XSS:
>> - if (!msr_info->host_initiated &&
>> - !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
>> + case MSR_IA32_XSS: {
>> + bool host_msr_reset = msr_info->host_initiated && data == 0;
>> +
>> + if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES) &&
>> + (!host_msr_reset || !msr_info->host_initiated))
> !msr_info->host_initiated can be dropped here.

Yes, it's not necessary, will remove it, thanks!

2023-10-31 17:44:37

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v6 01/25] x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Remove XFEATURE_CET_USER entry from dependency array as the entry doesn't
> reflect true dependency between CET features and the xstate bit, instead
> manually check and add the bit back if either SHSTK or IBT is supported.
>
> Both user mode shadow stack and indirect branch tracking features depend
> on XFEATURE_CET_USER bit in XSS to automatically save/restore user mode
> xstate registers, i.e., IA32_U_CET and IA32_PL3_SSP whenever necessary.
>
> Although in real world a platform with IBT but no SHSTK is rare, but in
> virtualization world it's common, guest SHSTK and IBT can be controlled
> independently via userspace app.
>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kernel/fpu/xstate.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index cadf68737e6b..12c8cb278346 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -73,7 +73,6 @@ static unsigned short xsave_cpuid_features[] __initdata = {
> [XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
> [XFEATURE_PKRU] = X86_FEATURE_OSPKE,
> [XFEATURE_PASID] = X86_FEATURE_ENQCMD,
> - [XFEATURE_CET_USER] = X86_FEATURE_SHSTK,
> [XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE,
> [XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE,
> };
> @@ -798,6 +797,14 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
> fpu_kernel_cfg.max_features &= ~BIT_ULL(i);
> }
>
> + /*
> + * Manually add CET user mode xstate bit if either SHSTK or IBT is
> + * available. Both features depend on the xstate bit to save/restore
> + * CET user mode state.
> + */
> + if (boot_cpu_has(X86_FEATURE_SHSTK) || boot_cpu_has(X86_FEATURE_IBT))
> + fpu_kernel_cfg.max_features |= BIT_ULL(XFEATURE_CET_USER);
> +
> if (!cpu_feature_enabled(X86_FEATURE_XFD))
> fpu_kernel_cfg.max_features &= ~XFEATURE_MASK_USER_DYNAMIC;
>


The goal of the xsave_cpuid_features is to disable xfeature state bits which are enabled
in CPUID, but their parent feature bit (e.g X86_FEATURE_AVX512) is disabled in CPUID,
something that should not happen on real CPU, but can happen if the user explicitly
disables the feature on the kernel command line and/or due to virtualization.

However the above code does the opposite, it will enable XFEATURE_CET_USER xsaves component,
when in fact, it might be disabled in the CPUID (and one can say that in theory such
configuration is even useful, since the kernel can still context switch CET msrs manually).


So I think that the code should do this instead:

if (!boot_cpu_has(X86_FEATURE_SHSTK) && !boot_cpu_has(X86_FEATURE_IBT))
fpu_kernel_cfg.max_features &= ~BIT_ULL(XFEATURE_CET_USER);


Best regards,
Maxim Levitsky




2023-10-31 17:46:15

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v6 04/25] x86/fpu/xstate: Introduce kernel dynamic xfeature set

On Fri, 2023-09-15 at 14:42 +0800, Yang, Weijiang wrote:
> On 9/15/2023 8:24 AM, Edgecombe, Rick P wrote:
> > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > +static void __init init_kernel_dynamic_xfeatures(void)
> > > +{
> > > + unsigned short cid;
> > > + int i;
> > > +
> > > + for (i = 0; i < ARRAY_SIZE(xsave_kernel_dynamic_xfeatures);
> > > i++) {
> > > + cid = xsave_kernel_dynamic_xfeatures[i];
> > > +
> > > + if (cid && boot_cpu_has(cid))
> > > + fpu_kernel_dynamic_xfeatures |= BIT_ULL(i);
> > > + }
> > > +}
> > > +
> > I think this can be part of the max_features calculation that uses
> > xsave_cpuid_features when you use use a fixed mask like Dave suggested
> > in the other patch.
>
> Yes, the max_features has already included CET supervisor state bit. After use
> fixed mask, this function is not needed.
>
>
My 0.2 cents are also on having XFEATURE_MASK_KERNEL_DYNAMIC macro instead.

Best regards,
Maxim Levitsky




2023-10-31 17:46:32

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v6 07/25] x86/fpu/xstate: Tweak guest fpstate to support kernel dynamic xfeatures

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> The guest fpstate is sized with fpu_kernel_cfg.default_size (by preceding
> fix) and the kernel dynamic xfeatures are not taken into account, so add
> the support and tweak fpstate xfeatures and size accordingly.
>
> Below configuration steps are currently enforced to get guest fpstate:
> 1) User space sets thread group xstate permits via arch_prctl().
> 2) User space creates vcpu thread.
> 3) User space enables guest dynamic xfeatures.
>
> In #1, guest fpstate size (i.e., __state_size [1]) is induced from
> (fpu_kernel_cfg.default_features | user dynamic xfeatures) [2].
> In #2, guest fpstate size is calculated with fpu_kernel_cfg.default_size
> and fpstate->size is set to the same. fpstate->xfeatures is set to
> fpu_kernel_cfg.default_features.
> In #3, guest fpstate is re-allocated as [1] and fpstate->xfeatures is
> set to [2].
>
> By adding kernel dynamic xfeatures in above #1 and #2, guest xstate area
> size is expanded to hold (fpu_kernel_cfg.default_features | kernel dynamic
> _xfeatures | user dynamic xfeatures)[3], and guest fpstate->xfeatures is
> set to [3]. Then host xsaves/xrstors can act on all guest xfeatures.
>
> The user_* fields remain unchanged for compatibility of non-compacted KVM
> uAPIs.
>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kernel/fpu/core.c | 56 +++++++++++++++++++++++++++++-------
> arch/x86/kernel/fpu/xstate.c | 2 +-
> arch/x86/kernel/fpu/xstate.h | 2 ++
> 3 files changed, 49 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index a42d8ad26ce6..e5819b38545a 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -33,6 +33,8 @@ DEFINE_STATIC_KEY_FALSE(__fpu_state_size_dynamic);
> DEFINE_PER_CPU(u64, xfd_state);
> #endif
>
> +extern unsigned int xstate_calculate_size(u64 xfeatures, bool compacted);
> +
> /* The FPU state configuration data for kernel and user space */
> struct fpu_state_config fpu_kernel_cfg __ro_after_init;
> struct fpu_state_config fpu_user_cfg __ro_after_init;
> @@ -193,8 +195,6 @@ void fpu_reset_from_exception_fixup(void)
> }
>
> #if IS_ENABLED(CONFIG_KVM)
> -static void __fpstate_reset(struct fpstate *fpstate, u64 xfd);
> -
> static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
> {
> struct fpu_state_perm *fpuperm;
> @@ -215,28 +215,64 @@ static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
> gfpu->perm = perm & ~FPU_GUEST_PERM_LOCKED;
> }
>
> -bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
> +static struct fpstate *__fpu_alloc_init_guest_fpstate(struct fpu_guest *gfpu)
> {
> + bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
> + unsigned int gfpstate_size, size;
> struct fpstate *fpstate;
> - unsigned int size;
> + u64 xfeatures;
> +
> + /*
> + * fpu_kernel_cfg.default_features includes all enabled xfeatures
> + * except those dynamic xfeatures. Compared with user dynamic
> + * xfeatures, the kernel dynamic ones are enabled for guest by
> + * default, so add the kernel dynamic xfeatures back when calculate
> + * guest fpstate size.
> + *
> + * If the user dynamic xfeatures are enabled, the guest fpstate will
> + * be re-allocated to hold all guest enabled xfeatures, so omit user
> + * dynamic xfeatures here.
> + */
> + xfeatures = fpu_kernel_cfg.default_features |
> + fpu_kernel_dynamic_xfeatures;


This is roughly what I had in mind when I was reviewing the previous patch,
however for the sake of not hard-coding even more of the KVM policy here,
I would let the KVM tell which dynamic kernel features it wants to enable
as a parameter of this function, or even better *which* features it wants
to enable.


> +
> + gfpstate_size = xstate_calculate_size(xfeatures, compacted);
>
> - size = fpu_kernel_cfg.default_size +
> - ALIGN(offsetof(struct fpstate, regs), 64);
> + size = gfpstate_size + ALIGN(offsetof(struct fpstate, regs), 64);
>
> fpstate = vzalloc(size);
> if (!fpstate)
> - return false;
> + return NULL;
> + /*
> + * Initialize sizes and feature masks, use fpu_user_cfg.*
> + * for user_* settings for compatibility of exiting uAPIs.
> + */
> + fpstate->size = gfpstate_size;
> + fpstate->xfeatures = xfeatures;
> + fpstate->user_size = fpu_user_cfg.default_size;
> + fpstate->user_xfeatures = fpu_user_cfg.default_features;
> + fpstate->xfd = 0;
>
> - /* Leave xfd to 0 (the reset value defined by spec) */
> - __fpstate_reset(fpstate, 0);
> fpstate_init_user(fpstate);
> fpstate->is_valloc = true;
> fpstate->is_guest = true;
>
> gfpu->fpstate = fpstate;
> - gfpu->xfeatures = fpu_user_cfg.default_features;
> + gfpu->xfeatures = xfeatures;
> gfpu->perm = fpu_user_cfg.default_features;
>
> + return fpstate;
> +}


I think that this code will break, later when permission api is called by the KVM,
because it will overwrite the fpstate->user_size and with fpstate->size
assuming that all kernel dynamic features are enabled/disabled (depending
on Sean's patch).


> +
> +bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
> +{
> + struct fpstate *fpstate;
> +
> + fpstate = __fpu_alloc_init_guest_fpstate(gfpu);
> +
> + if (!fpstate)
> + return false;
> +

What is the point of the __fpu_alloc_init_guest_fpstate / fpu_alloc_guest_fpstate split,
since there is only one caller?


Best regards,
Maxim Levitsky

> /*
> * KVM sets the FP+SSE bits in the XSAVE header when copying FPU state
> * to userspace, even when XSAVE is unsupported, so that restoring FPU
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index c5d903b4df4d..87149aba6f11 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -561,7 +561,7 @@ static bool __init check_xstate_against_struct(int nr)
> return true;
> }
>
> -static unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
> +unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
> {
> unsigned int topmost = fls64(xfeatures) - 1;
> unsigned int offset = xstate_offsets[topmost];
> diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
> index a4ecb04d8d64..9c6e3ca05c5c 100644
> --- a/arch/x86/kernel/fpu/xstate.h
> +++ b/arch/x86/kernel/fpu/xstate.h
> @@ -10,6 +10,8 @@
> DECLARE_PER_CPU(u64, xfd_state);
> #endif
>
> +extern u64 fpu_kernel_dynamic_xfeatures;
> +
> static inline void xstate_init_xcomp_bv(struct xregs_state *xsave, u64 mask)
> {
> /*







2023-10-31 17:48:12

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v6 10/25] KVM: x86: Add kvm_msr_{read,write}() helpers

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Wrap __kvm_{get,set}_msr() into two new helpers for KVM usage and use the
> helpers to replace existing usage of the raw functions.
> kvm_msr_{read,write}() are KVM-internal helpers, i.e. used when KVM needs
> to get/set a MSR value for emulating CPU behavior.

I am not sure if I like this patch or not. On one hand the code is cleaner this way,
but on the other hand now it is easier to call kvm_msr_write() on behalf of the guest.

For example we also have the 'kvm_set_msr()' which does actually set the msr on behalf of the guest.

How about we call the new function kvm_msr_set_host() and rename kvm_set_msr() to kvm_msr_set_guest(),
together with good comments explaning what they do?


Also functions like kvm_set_msr_ignored_check(), kvm_set_msr_with_filter() and such,
IMHO have names that are not very user friendly.

A refactoring is very welcome in this area. At the very least they should gain
thoughtful comments about what they do.


For reading msrs API, I can suggest similar names and comments:

/*
* Read a value of a MSR.
* Some MSRs exist in the KVM model even when the guest can't read them.
*/
int kvm_get_msr_value(struct kvm_vcpu *vcpu, u32 index, u64 *data);


/* Read a value of a MSR on the behalf of the guest */

int kvm_get_guest_msr_value(struct kvm_vcpu *vcpu, u32 index, u64 *data);


Although I am not going to argue over this, there are multiple ways to improve this,
and also keeping things as is, or something similar to this patch is also fine with me.


Best regards,
Maxim Levitsky

>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 4 +++-
> arch/x86/kvm/cpuid.c | 2 +-
> arch/x86/kvm/x86.c | 16 +++++++++++++---
> 3 files changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 1a4def36d5bb..0fc5e6312e93 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1956,7 +1956,9 @@ void kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu);
>
> void kvm_enable_efer_bits(u64);
> bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer);
> -int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data, bool host_initiated);
> +
> +int kvm_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data);
> +int kvm_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data);
> int kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data);
> int kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data);
> int kvm_emulate_rdmsr(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 7c3e4a550ca7..1f206caec559 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1531,7 +1531,7 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
> *edx = entry->edx;
> if (function == 7 && index == 0) {
> u64 data;
> - if (!__kvm_get_msr(vcpu, MSR_IA32_TSX_CTRL, &data, true) &&
> + if (!kvm_msr_read(vcpu, MSR_IA32_TSX_CTRL, &data) &&
> (data & TSX_CTRL_CPUID_CLEAR))
> *ebx &= ~(F(RTM) | F(HLE));
> } else if (function == 0x80000007) {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6c9c81e82e65..e0b55c043dab 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1917,8 +1917,8 @@ static int kvm_set_msr_ignored_check(struct kvm_vcpu *vcpu,
> * Returns 0 on success, non-0 otherwise.
> * Assumes vcpu_load() was already called.
> */
> -int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
> - bool host_initiated)
> +static int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
> + bool host_initiated)
> {
> struct msr_data msr;
> int ret;
> @@ -1944,6 +1944,16 @@ int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
> return ret;
> }
>
> +int kvm_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data)
> +{
> + return __kvm_set_msr(vcpu, index, data, true);
> +}
> +
> +int kvm_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data)
> +{
> + return __kvm_get_msr(vcpu, index, data, true);
> +}
> +
> static int kvm_get_msr_ignored_check(struct kvm_vcpu *vcpu,
> u32 index, u64 *data, bool host_initiated)
> {
> @@ -12082,7 +12092,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> MSR_IA32_MISC_ENABLE_BTS_UNAVAIL;
>
> __kvm_set_xcr(vcpu, 0, XFEATURE_MASK_FP);
> - __kvm_set_msr(vcpu, MSR_IA32_XSS, 0, true);
> + kvm_msr_write(vcpu, MSR_IA32_XSS, 0);
> }
>
> /* All GPRs except RDX (handled below) are zeroed on RESET/INIT. */






2023-10-31 17:52:27

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v6 12/25] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Update CPUID.(EAX=0DH,ECX=1).EBX to reflect current required xstate size
> due to XSS MSR modification.
> CPUID(EAX=0DH,ECX=1).EBX reports the required storage size of all enabled
> xstate features in (XCR0 | IA32_XSS). The CPUID value can be used by guest
> before allocate sufficient xsave buffer.
>
> Note, KVM does not yet support any XSS based features, i.e. supported_xss
> is guaranteed to be zero at this time.
>
> Opportunistically modify XSS write access logic as: if !guest_cpuid_has(),
> write initiated from host is allowed iff the write is reset operaiton,
> i.e., data == 0, reject host_initiated non-reset write and any guest write.

The commit message is not clear and somewhat misleading because it forces the reader
to parse the whole patch before one could understand what '!guest_cpuid_has()'
means.

Also I don't think that the term 'reset operation' is a good choice, because it is too closely
related to vCPU reset IMHO. Let's at least call it 'reset to a default value' or something like that.
Also note that 0 is not always the default/reset value of an MSR.

I suggest this instead:

"If XSAVES is not enabled in the guest CPUID, forbid setting IA32_XSS msr
to anything but 0, even if the write is host initiated."

Also, isn't this change is at least in theory not backward compatible?
While KVM didn't report IA32_XSS as one needing save/restore, the userspace
could before this patch set the IA32_XSS to any value, now it can't.

Maybe it's safer to allow to set any value, ignore the set value and
issue a WARN_ON_ONCE or something?

Finally, I think that this change is better to be done in a separate patch
because it is unrelated and might not even be backward compatible.

Best regards,
Maxim Levitsky

>
> Suggested-by: Sean Christopherson <[email protected]>
> Co-developed-by: Zhang Yi Z <[email protected]>
> Signed-off-by: Zhang Yi Z <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/cpuid.c | 15 ++++++++++++++-
> arch/x86/kvm/x86.c | 13 +++++++++----
> 3 files changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 0fc5e6312e93..d77b030e996c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -803,6 +803,7 @@ struct kvm_vcpu_arch {
>
> u64 xcr0;
> u64 guest_supported_xcr0;
> + u64 guest_supported_xss;
>
> struct kvm_pio_request pio;
> void *pio_data;
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 1f206caec559..4e7a820cba62 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -275,7 +275,8 @@ static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_e
> best = cpuid_entry2_find(entries, nent, 0xD, 1);
> if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) ||
> cpuid_entry_has(best, X86_FEATURE_XSAVEC)))
> - best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
> + best->ebx = xstate_required_size(vcpu->arch.xcr0 |
> + vcpu->arch.ia32_xss, true);
>
> best = __kvm_find_kvm_cpuid_features(vcpu, entries, nent);
> if (kvm_hlt_in_guest(vcpu->kvm) && best &&
> @@ -312,6 +313,17 @@ static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
> return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
> }
>
> +static u64 vcpu_get_supported_xss(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_cpuid_entry2 *best;
> +
> + best = kvm_find_cpuid_entry_index(vcpu, 0xd, 1);
> + if (!best)
> + return 0;
> +
> + return (best->ecx | ((u64)best->edx << 32)) & kvm_caps.supported_xss;
> +}

Same question as one for patch that added vcpu_get_supported_xcr0()
Why to have per vCPU supported XSS if we assume that all CPUs have the same
CPUID?

I mean I am not against supporting hybrid CPU models, but KVM currently doesn't
support this and this creates illusion that it does.

> +
> static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
> {
> struct kvm_cpuid_entry2 *entry;
> @@ -358,6 +370,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> }
>
> vcpu->arch.guest_supported_xcr0 = vcpu_get_supported_xcr0(vcpu);
> + vcpu->arch.guest_supported_xss = vcpu_get_supported_xss(vcpu);
>
> /*
> * FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1258d1d6dd52..9a616d84bd39 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3795,20 +3795,25 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> vcpu->arch.ia32_tsc_adjust_msr += adj;
> }
> break;
> - case MSR_IA32_XSS:
> - if (!msr_info->host_initiated &&
> - !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
> + case MSR_IA32_XSS: {
> + bool host_msr_reset = msr_info->host_initiated && data == 0;
> +
> + if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES) &&
> + (!host_msr_reset || !msr_info->host_initiated))
> return 1;
> /*
> * KVM supports exposing PT to the guest, but does not support
> * IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
> * XSAVES/XRSTORS to save/restore PT MSRs.
> */
Just in case.... TODO

> - if (data & ~kvm_caps.supported_xss)
> + if (data & ~vcpu->arch.guest_supported_xss)
> return 1;
> + if (vcpu->arch.ia32_xss == data)
> + break;
> vcpu->arch.ia32_xss = data;
> kvm_update_cpuid_runtime(vcpu);
> break;
> + }
> case MSR_SMI_COUNT:
> if (!msr_info->host_initiated)
> return 1;






2023-10-31 17:56:26

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v6 18/25] KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled"

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Use the governed feature framework to track whether X86_FEATURE_SHSTK
> and X86_FEATURE_IBT features can be used by userspace and guest, i.e.,
> the features can be used iff both KVM and guest CPUID can support them.
>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kvm/governed_features.h | 2 ++
> arch/x86/kvm/vmx/vmx.c | 2 ++
> 2 files changed, 4 insertions(+)
>
> diff --git a/arch/x86/kvm/governed_features.h b/arch/x86/kvm/governed_features.h
> index 423a73395c10..db7e21c5ecc2 100644
> --- a/arch/x86/kvm/governed_features.h
> +++ b/arch/x86/kvm/governed_features.h
> @@ -16,6 +16,8 @@ KVM_GOVERNED_X86_FEATURE(PAUSEFILTER)
> KVM_GOVERNED_X86_FEATURE(PFTHRESHOLD)
> KVM_GOVERNED_X86_FEATURE(VGIF)
> KVM_GOVERNED_X86_FEATURE(VNMI)
> +KVM_GOVERNED_X86_FEATURE(SHSTK)
> +KVM_GOVERNED_X86_FEATURE(IBT)
>
> #undef KVM_GOVERNED_X86_FEATURE
> #undef KVM_GOVERNED_FEATURE
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 9409753f45b0..fd5893b3a2c8 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7765,6 +7765,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_XSAVES);
>
> kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_VMX);
> + kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_SHSTK);
> + kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_IBT);
>
> vmx_setup_uret_msrs(vmx);
>

Reviewed-by: Maxim Levitsky <[email protected]>


PS: IMHO The whole 'governed feature framework' is very confusing and somewhat poorly documented.

Currently the only partial explanation of it, is at 'governed_features', which doesn't
explain how to use it.

For the reference this is how KVM expects governed features to be used in the common case
(there are some exceptions to this but they are rare)

1. If a feature is not enabled in host CPUID or KVM doesn't support it,
KVM is expected to not enable it in KVM cpu caps.

2. Userspace uploads guest CPUID.

3. After the guest CPUID upload, the vendor code calls kvm_governed_feature_check_and_set() which sets
governed features = True iff feature is supported in both kvm cpu caps and in guest CPUID.

4. kvm/vendor code uses 'guest_can_use()' to query the value of the governed feature instead of reading
guest CPUID.

It might make sense to document the above somewhere at least.

Now about another thing I am thinking:

I do know that the mess of boolean flags that svm had is worse than these governed features and functionality wise these are equivalent.

However thinking again about the whole thing:

IMHO the 'governed features' is another quite confusing term that a KVM developer will need to learn and keep in memory.

Because of that, can't we just use guest CPUID as a single source of truth and drop all the governed features code?

In most cases, when the governed feature value will differ from the guest CPUID is when a feature is enabled in the guest CPUID,
but not enabled in the KVM caps.

I do see two exceptions to this: XSAVES on AMD and X86_FEATURE_GBPAGES, in which the opposite happens,
governed feature is enabled, even when the feature is hidden from the guest CPUID, but it might be
better from
readability wise point, to deal with these cases manually and we unlikely to have many new such cases in the future.

So for the common case of CPUID mismatch, when the governed feature is disabled but guest CPUID is enabled,
does it make sense to allow this?

Such a feature which is advertised as supported but not really working is a recipe of hard to find guest bugs IMHO.

IMHO it would be much better to just check this condition and do kvm_vm_bugged() or something in case when a feature
is enabled in the guest CPUID but KVM can't support it, and then just use guest CPUID in 'guest_can_use()'.

Best regards,
Maxim Levitsky






2023-10-31 17:56:51

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v6 19/25] KVM: VMX: Emulate read and write to CET MSRs

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Add emulation interface for CET MSR access. The emulation code is split
> into common part and vendor specific part. The former does common check
> for MSRs and reads/writes directly from/to XSAVE-managed MSRs via the
> helpers while the latter accesses the MSRs linked to VMCS fields.
>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kvm/vmx/vmx.c | 18 +++++++++++
> arch/x86/kvm/x86.c | 71 ++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 89 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index fd5893b3a2c8..9f4b56337251 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2111,6 +2111,15 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> else
> msr_info->data = vmx->pt_desc.guest.addr_a[index / 2];
> break;
> + case MSR_IA32_S_CET:
> + msr_info->data = vmcs_readl(GUEST_S_CET);
> + break;
> + case MSR_KVM_SSP:
> + msr_info->data = vmcs_readl(GUEST_SSP);
> + break;
> + case MSR_IA32_INT_SSP_TAB:
> + msr_info->data = vmcs_readl(GUEST_INTR_SSP_TABLE);
> + break;
> case MSR_IA32_DEBUGCTLMSR:
> msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
> break;
> @@ -2420,6 +2429,15 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> else
> vmx->pt_desc.guest.addr_a[index / 2] = data;
> break;
> + case MSR_IA32_S_CET:
> + vmcs_writel(GUEST_S_CET, data);
> + break;
> + case MSR_KVM_SSP:
> + vmcs_writel(GUEST_SSP, data);
> + break;
> + case MSR_IA32_INT_SSP_TAB:
> + vmcs_writel(GUEST_INTR_SSP_TABLE, data);
> + break;
> case MSR_IA32_PERF_CAPABILITIES:
> if (data && !vcpu_to_pmu(vcpu)->version)
> return 1;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 73b45351c0fc..c85ee42ab4f1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1847,6 +1847,11 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type)
> }
> EXPORT_SYMBOL_GPL(kvm_msr_allowed);
>
> +#define CET_US_RESERVED_BITS GENMASK(9, 6)
> +#define CET_US_SHSTK_MASK_BITS GENMASK(1, 0)
> +#define CET_US_IBT_MASK_BITS (GENMASK_ULL(5, 2) | GENMASK_ULL(63, 10))
> +#define CET_US_LEGACY_BITMAP_BASE(data) ((data) >> 12)
> +
> /*
> * Write @data into the MSR specified by @index. Select MSR specific fault
> * checks are bypassed if @host_initiated is %true.
> @@ -1856,6 +1861,7 @@ EXPORT_SYMBOL_GPL(kvm_msr_allowed);
> static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
> bool host_initiated)
> {
> + bool host_msr_reset = host_initiated && data == 0;

I really don't like this boolean. While 0 is usually the reset value, it doesn't have to
be (like SVM tsc ratio reset value is 1 for example).
Also its name is confusing.

I suggest to just open code this instead.

> struct msr_data msr;
>
> switch (index) {
> @@ -1906,6 +1912,46 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
>
> data = (u32)data;
> break;
> + case MSR_IA32_U_CET:
> + case MSR_IA32_S_CET:
> + if (host_msr_reset && (kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
> + kvm_cpu_cap_has(X86_FEATURE_IBT)))
> + break;
> + if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
> + !guest_can_use(vcpu, X86_FEATURE_IBT))
> + return 1;
> + if (data & CET_US_RESERVED_BITS)
> + return 1;
> + if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
> + (data & CET_US_SHSTK_MASK_BITS))
> + return 1;
> + if (!guest_can_use(vcpu, X86_FEATURE_IBT) &&
> + (data & CET_US_IBT_MASK_BITS))
> + return 1;
> + if (!IS_ALIGNED(CET_US_LEGACY_BITMAP_BASE(data), 4))
> + return 1;
> +
> + /* IBT can be suppressed iff the TRACKER isn't WAIT_ENDBR. */
> + if ((data & CET_SUPPRESS) && (data & CET_WAIT_ENDBR))
> + return 1;
> + break;
> + case MSR_IA32_INT_SSP_TAB:
> + if (!guest_cpuid_has(vcpu, X86_FEATURE_LM))
> + return 1;
> + fallthrough;
> + case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> + case MSR_KVM_SSP:
> + if (host_msr_reset && kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> + break;
> + if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
> + return 1;
> + if (index == MSR_KVM_SSP && !host_initiated)
> + return 1;
> + if (is_noncanonical_address(data, vcpu))
> + return 1;
> + if (index != MSR_IA32_INT_SSP_TAB && !IS_ALIGNED(data, 4))
> + return 1;
> + break;
Once again I'll prefer to have an ioctl for setting/getting SSP, this will make the above
code simpler (e.g there will be no need to check that write comes from the host/etc).

> }
>
> msr.data = data;
> @@ -1949,6 +1995,23 @@ static int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
> !guest_cpuid_has(vcpu, X86_FEATURE_RDPID))
> return 1;
> break;
> + case MSR_IA32_U_CET:
> + case MSR_IA32_S_CET:
> + if (!guest_can_use(vcpu, X86_FEATURE_IBT) &&
> + !guest_can_use(vcpu, X86_FEATURE_SHSTK))
> + return 1;
> + break;
> + case MSR_IA32_INT_SSP_TAB:
> + if (!guest_cpuid_has(vcpu, X86_FEATURE_LM))
> + return 1;
> + fallthrough;
> + case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> + case MSR_KVM_SSP:
> + if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
> + return 1;
> + if (index == MSR_KVM_SSP && !host_initiated)
> + return 1;
> + break;
> }
>
> msr.index = index;
> @@ -4009,6 +4072,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> vcpu->arch.guest_fpu.xfd_err = data;
> break;
> #endif
> + case MSR_IA32_U_CET:
> + case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> + kvm_set_xstate_msr(vcpu, msr_info);

Ah, so here these functions (kvm_set_xstate_msr/kvm_get_xstate_msr) are used.
I think that this patch should introduce them.

Also I will appreciate a comment to kvm_set_xstate_msr/kvm_get_xstate_msr saying something like:

"This function updates a guest MSR which value is saved in the guest FPU state.
Wrap the write with load/save of the guest FPU state to keep the state consistent with the new MSR value"

Or something similar, although I will not argue over this.

> + break;
> default:
> if (kvm_pmu_is_valid_msr(vcpu, msr))
> return kvm_pmu_set_msr(vcpu, msr_info);
> @@ -4365,6 +4432,10 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> msr_info->data = vcpu->arch.guest_fpu.xfd_err;
> break;
> #endif
> + case MSR_IA32_U_CET:
> + case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> + kvm_get_xstate_msr(vcpu, msr_info);
> + break;
> default:
> if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
> return kvm_pmu_get_msr(vcpu, msr_info);


Best regards,
Maxim Levitsky




2023-10-31 17:57:09

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v6 20/25] KVM: x86: Save and reload SSP to/from SMRAM

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Save CET SSP to SMRAM on SMI and reload it on RSM. KVM emulates HW arch
> behavior when guest enters/leaves SMM mode,i.e., save registers to SMRAM
> at the entry of SMM and reload them at the exit to SMM. Per SDM, SSP is
> one of such registers on 64bit Arch, so add the support for SSP.
>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kvm/smm.c | 8 ++++++++
> arch/x86/kvm/smm.h | 2 +-
> 2 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
> index b42111a24cc2..235fca95f103 100644
> --- a/arch/x86/kvm/smm.c
> +++ b/arch/x86/kvm/smm.c
> @@ -275,6 +275,10 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu,
> enter_smm_save_seg_64(vcpu, &smram->gs, VCPU_SREG_GS);
>
> smram->int_shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
> +
> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
> + KVM_BUG_ON(kvm_msr_read(vcpu, MSR_KVM_SSP, &smram->ssp),
> + vcpu->kvm);
> }
> #endif
>
> @@ -565,6 +569,10 @@ static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,
> static_call(kvm_x86_set_interrupt_shadow)(vcpu, 0);
> ctxt->interruptibility = (u8)smstate->int_shadow;
>
> + if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
> + KVM_BUG_ON(kvm_msr_write(vcpu, MSR_KVM_SSP, smstate->ssp),
> + vcpu->kvm);
> +
> return X86EMUL_CONTINUE;
> }
> #endif
> diff --git a/arch/x86/kvm/smm.h b/arch/x86/kvm/smm.h
> index a1cf2ac5bd78..1e2a3e18207f 100644
> --- a/arch/x86/kvm/smm.h
> +++ b/arch/x86/kvm/smm.h
> @@ -116,8 +116,8 @@ struct kvm_smram_state_64 {
> u32 smbase;
> u32 reserved4[5];
>
> - /* ssp and svm_* fields below are not implemented by KVM */
> u64 ssp;
> + /* svm_* fields below are not implemented by KVM */
> u64 svm_guest_pat;
> u64 svm_host_efer;
> u64 svm_host_cr4;


Just one note: Due to historical reasons, KVM supports 2 formats of the SMM save area: 32 and 64 bit.
32 bit format more or less resembles the format that true 32 bit Intel and AMD CPUs used,
while 64 bit format more or less resembles the format that 64 bit AMD cpus use (Intel uses a very different SMRAM layout)

32 bit format is used when X86_FEATURE_LM is not exposed to the guest CPUID which is very rare (only 32 bit qemu doesn't set it),
and it lacks several fields because it is no longer maintained.

Still, for the sake of completeness, it might make sense to fail enter_smm_save_state_32 (need to add return value and, do 'goto error' in
the main 'enter_smm' in this case, if CET is enabled.

I did a similar thing in SVM code 'svm_enter_smm' when it detects the lack of the X86_FEATURE_LM.

Best regards,
Maxim Levitsky




2023-10-31 17:58:23

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v6 24/25] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Per SDM description(Vol.3D, Appendix A.1):
> "If bit 56 is read as 1, software can use VM entry to deliver a hardware
> exception with or without an error code, regardless of vector"
>
> Modify has_error_code check before inject events to nested guest. Only
> enforce the check when guest is in real mode, the exception is not hard
> exception and the platform doesn't enumerate bit56 in VMX_BASIC, in all
> other case ignore the check to make the logic consistent with SDM.
>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kvm/vmx/nested.c | 22 ++++++++++++++--------
> arch/x86/kvm/vmx/nested.h | 5 +++++
> 2 files changed, 19 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index c5ec0ef51ff7..78a3be394d00 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -1205,9 +1205,9 @@ static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx, u64 data)
> {
> const u64 feature_and_reserved =
> /* feature (except bit 48; see below) */
> - BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) |
> + BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) | BIT_ULL(56) |
> /* reserved */
> - BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 56);
> + BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 57);
> u64 vmx_basic = vmcs_config.nested.basic;
>
> if (!is_bitwise_subset(vmx_basic, data, feature_and_reserved))
> @@ -2846,12 +2846,16 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
> CC(intr_type == INTR_TYPE_OTHER_EVENT && vector != 0))
> return -EINVAL;
>
> - /* VM-entry interruption-info field: deliver error code */
> - should_have_error_code =
> - intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
> - x86_exception_has_error_code(vector);
> - if (CC(has_error_code != should_have_error_code))
> - return -EINVAL;
> + if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION ||
> + !nested_cpu_has_no_hw_errcode_cc(vcpu)) {
> + /* VM-entry interruption-info field: deliver error code */
> + should_have_error_code =
> + intr_type == INTR_TYPE_HARD_EXCEPTION &&
> + prot_mode &&
> + x86_exception_has_error_code(vector);
> + if (CC(has_error_code != should_have_error_code))
> + return -EINVAL;
> + }
>
> /* VM-entry exception error code */
> if (CC(has_error_code &&
> @@ -6968,6 +6972,8 @@ static void nested_vmx_setup_basic(struct nested_vmx_msrs *msrs)
>
> if (cpu_has_vmx_basic_inout())
> msrs->basic |= VMX_BASIC_INOUT;
> + if (cpu_has_vmx_basic_no_hw_errcode())
> + msrs->basic |= VMX_BASIC_NO_HW_ERROR_CODE_CC;
> }
>
> static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs)
> diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
> index b4b9d51438c6..26842da6857d 100644
> --- a/arch/x86/kvm/vmx/nested.h
> +++ b/arch/x86/kvm/vmx/nested.h
> @@ -284,6 +284,11 @@ static inline bool nested_cr4_valid(struct kvm_vcpu *vcpu, unsigned long val)
> __kvm_is_valid_cr4(vcpu, val);
> }
>
> +static inline bool nested_cpu_has_no_hw_errcode_cc(struct kvm_vcpu *vcpu)
> +{
> + return to_vmx(vcpu)->nested.msrs.basic & VMX_BASIC_NO_HW_ERROR_CODE_CC;
> +}
> +
> /* No difference in the restrictions on guest and host CR4 in VMX operation. */
> #define nested_guest_cr4_valid nested_cr4_valid
> #define nested_host_cr4_valid nested_cr4_valid

Reviewed-by: Maxim Levitsky <[email protected]>

Best regards,
Maxim Levitsky




2023-10-31 17:58:51

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v6 25/25] KVM: nVMX: Enable CET support for nested guest

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
> to enable CET for nested VM.
>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/kvm/vmx/nested.c | 27 +++++++++++++++++++++++++--
> arch/x86/kvm/vmx/vmcs12.c | 6 ++++++
> arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++++-
> arch/x86/kvm/vmx/vmx.c | 2 ++
> 4 files changed, 46 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 78a3be394d00..2c4ff13fddb0 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
>
> + /* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_U_CET, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_S_CET, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_PL0_SSP, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_PL1_SSP, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_PL2_SSP, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_PL3_SSP, MSR_TYPE_RW);
> +
> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
> +
> kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
>
> vmx->nested.force_msr_bitmap_recalc = false;
> @@ -6794,7 +6816,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
> VM_EXIT_HOST_ADDR_SPACE_SIZE |
> #endif
> VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
> - VM_EXIT_CLEAR_BNDCFGS;
> + VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
> msrs->exit_ctls_high |=
> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
> @@ -6816,7 +6838,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
> #ifdef CONFIG_X86_64
> VM_ENTRY_IA32E_MODE |
> #endif
> - VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
> + VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
> + VM_ENTRY_LOAD_CET_STATE;
> msrs->entry_ctls_high |=
> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
> VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
> diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
> index 106a72c923ca..4233b5ca9461 100644
> --- a/arch/x86/kvm/vmx/vmcs12.c
> +++ b/arch/x86/kvm/vmx/vmcs12.c
> @@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
> FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
> FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
> FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
> + FIELD(GUEST_S_CET, guest_s_cet),
> + FIELD(GUEST_SSP, guest_ssp),
> + FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
> FIELD(HOST_CR0, host_cr0),
> FIELD(HOST_CR3, host_cr3),
> FIELD(HOST_CR4, host_cr4),
> @@ -151,5 +154,8 @@ const unsigned short vmcs12_field_offsets[] = {
> FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
> FIELD(HOST_RSP, host_rsp),
> FIELD(HOST_RIP, host_rip),
> + FIELD(HOST_S_CET, host_s_cet),
> + FIELD(HOST_SSP, host_ssp),
> + FIELD(HOST_INTR_SSP_TABLE, host_ssp_tbl),
> };
> const unsigned int nr_vmcs12_fields = ARRAY_SIZE(vmcs12_field_offsets);
> diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
> index 01936013428b..3884489e7f7e 100644
> --- a/arch/x86/kvm/vmx/vmcs12.h
> +++ b/arch/x86/kvm/vmx/vmcs12.h
> @@ -117,7 +117,13 @@ struct __packed vmcs12 {
> natural_width host_ia32_sysenter_eip;
> natural_width host_rsp;
> natural_width host_rip;
> - natural_width paddingl[8]; /* room for future expansion */
> + natural_width host_s_cet;
> + natural_width host_ssp;
> + natural_width host_ssp_tbl;
> + natural_width guest_s_cet;
> + natural_width guest_ssp;
> + natural_width guest_ssp_tbl;
> + natural_width paddingl[2]; /* room for future expansion */
> u32 pin_based_vm_exec_control;
> u32 cpu_based_vm_exec_control;
> u32 exception_bitmap;
> @@ -292,6 +298,12 @@ static inline void vmx_check_vmcs12_offsets(void)
> CHECK_OFFSET(host_ia32_sysenter_eip, 656);
> CHECK_OFFSET(host_rsp, 664);
> CHECK_OFFSET(host_rip, 672);
> + CHECK_OFFSET(host_s_cet, 680);
> + CHECK_OFFSET(host_ssp, 688);
> + CHECK_OFFSET(host_ssp_tbl, 696);
> + CHECK_OFFSET(guest_s_cet, 704);
> + CHECK_OFFSET(guest_ssp, 712);
> + CHECK_OFFSET(guest_ssp_tbl, 720);
> CHECK_OFFSET(pin_based_vm_exec_control, 744);
> CHECK_OFFSET(cpu_based_vm_exec_control, 748);
> CHECK_OFFSET(exception_bitmap, 752);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index f0dea8ecd0c6..2c43f1088d77 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7731,6 +7731,8 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
> cr4_fixed1_update(X86_CR4_PKE, ecx, feature_bit(PKU));
> cr4_fixed1_update(X86_CR4_UMIP, ecx, feature_bit(UMIP));
> cr4_fixed1_update(X86_CR4_LA57, ecx, feature_bit(LA57));
> + cr4_fixed1_update(X86_CR4_CET, ecx, feature_bit(SHSTK));
> + cr4_fixed1_update(X86_CR4_CET, edx, feature_bit(IBT));
>
> #undef cr4_fixed1_update
> }


It is surprising how little needs to be done to support the nested mode, but it does look correct.
I might have missed something though, can't be 100% sure in this case.


Reviewed-by: Maxim Levitsky <[email protected]>

Best regards,
Maxim Levitsky






2023-11-01 02:10:12

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v6 25/25] KVM: nVMX: Enable CET support for nested guest

On Thu, Sep 14, 2023 at 02:33:25AM -0400, Yang Weijiang wrote:
>Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
>to enable CET for nested VM.
>
>Signed-off-by: Yang Weijiang <[email protected]>
>---
> arch/x86/kvm/vmx/nested.c | 27 +++++++++++++++++++++++++--
> arch/x86/kvm/vmx/vmcs12.c | 6 ++++++
> arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++++-
> arch/x86/kvm/vmx/vmx.c | 2 ++
> 4 files changed, 46 insertions(+), 3 deletions(-)
>
>diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>index 78a3be394d00..2c4ff13fddb0 100644
>--- a/arch/x86/kvm/vmx/nested.c
>+++ b/arch/x86/kvm/vmx/nested.c
>@@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
>
>+ /* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
>+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>+ MSR_IA32_U_CET, MSR_TYPE_RW);
>+
>+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>+ MSR_IA32_S_CET, MSR_TYPE_RW);
>+
>+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>+ MSR_IA32_PL0_SSP, MSR_TYPE_RW);
>+
>+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>+ MSR_IA32_PL1_SSP, MSR_TYPE_RW);
>+
>+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>+ MSR_IA32_PL2_SSP, MSR_TYPE_RW);
>+
>+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>+ MSR_IA32_PL3_SSP, MSR_TYPE_RW);
>+
>+ nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>+ MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
>+
> kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
>
> vmx->nested.force_msr_bitmap_recalc = false;
>@@ -6794,7 +6816,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
> VM_EXIT_HOST_ADDR_SPACE_SIZE |
> #endif
> VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
>- VM_EXIT_CLEAR_BNDCFGS;
>+ VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
> msrs->exit_ctls_high |=
> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
>@@ -6816,7 +6838,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
> #ifdef CONFIG_X86_64
> VM_ENTRY_IA32E_MODE |
> #endif
>- VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
>+ VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
>+ VM_ENTRY_LOAD_CET_STATE;
> msrs->entry_ctls_high |=
> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
> VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
>diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
>index 106a72c923ca..4233b5ca9461 100644
>--- a/arch/x86/kvm/vmx/vmcs12.c
>+++ b/arch/x86/kvm/vmx/vmcs12.c
>@@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
> FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
> FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
> FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
>+ FIELD(GUEST_S_CET, guest_s_cet),
>+ FIELD(GUEST_SSP, guest_ssp),
>+ FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),

I think we need to sync guest states, e.g., guest_s_cet/guest_ssp/guest_ssp_tbl,
between vmcs02 and vmcs12 on nested VM entry/exit, probably in
sync_vmcs02_to_vmcs12() and prepare_vmcs12() or "_rare" variants of them.

2023-11-01 04:22:50

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v6 24/25] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1

On Thu, Sep 14, 2023 at 02:33:24AM -0400, Yang Weijiang wrote:
>Per SDM description(Vol.3D, Appendix A.1):
>"If bit 56 is read as 1, software can use VM entry to deliver a hardware
>exception with or without an error code, regardless of vector"
>
>Modify has_error_code check before inject events to nested guest. Only
>enforce the check when guest is in real mode, the exception is not hard
>exception and the platform doesn't enumerate bit56 in VMX_BASIC, in all
>other case ignore the check to make the logic consistent with SDM.
>
>Signed-off-by: Yang Weijiang <[email protected]>
>---
> arch/x86/kvm/vmx/nested.c | 22 ++++++++++++++--------
> arch/x86/kvm/vmx/nested.h | 5 +++++
> 2 files changed, 19 insertions(+), 8 deletions(-)
>
>diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>index c5ec0ef51ff7..78a3be394d00 100644
>--- a/arch/x86/kvm/vmx/nested.c
>+++ b/arch/x86/kvm/vmx/nested.c
>@@ -1205,9 +1205,9 @@ static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx, u64 data)
> {
> const u64 feature_and_reserved =
> /* feature (except bit 48; see below) */
>- BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) |
>+ BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) | BIT_ULL(56) |
> /* reserved */
>- BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 56);
>+ BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 57);
> u64 vmx_basic = vmcs_config.nested.basic;
>
> if (!is_bitwise_subset(vmx_basic, data, feature_and_reserved))
>@@ -2846,12 +2846,16 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
> CC(intr_type == INTR_TYPE_OTHER_EVENT && vector != 0))
> return -EINVAL;
>
>- /* VM-entry interruption-info field: deliver error code */
>- should_have_error_code =
>- intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
>- x86_exception_has_error_code(vector);
>- if (CC(has_error_code != should_have_error_code))
>- return -EINVAL;
>+ if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION ||
>+ !nested_cpu_has_no_hw_errcode_cc(vcpu)) {
>+ /* VM-entry interruption-info field: deliver error code */
>+ should_have_error_code =
>+ intr_type == INTR_TYPE_HARD_EXCEPTION &&
>+ prot_mode &&
>+ x86_exception_has_error_code(vector);
>+ if (CC(has_error_code != should_have_error_code))
>+ return -EINVAL;
>+ }

prot_mode and intr_type are used twice, making the code a little hard to read.

how about:
/*
* Cannot deliver error code in real mode or if the
* interruption type is not hardware exception. For other
* cases, do the consistency check only if the vCPU doesn't
* enumerate VMX_BASIC_NO_HW_ERROR_CODE_CC.
*/
if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION) {
if (CC(has_error_code))
return -EINVAL;
} else if (!nested_cpu_has_no_hw_errcode_cc(vcpu)) {
if (CC(has_error_code != x86_exception_has_error_code(vector)))
return -EINVAL;
}

and drop should_have_error_code.

2023-11-01 09:22:30

by Yang Weijiang

[permalink] [raw]
Subject: Re: [PATCH v6 01/25] x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit

On 11/1/2023 1:43 AM, Maxim Levitsky wrote:
> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
>> Remove XFEATURE_CET_USER entry from dependency array as the entry doesn't
>> reflect true dependency between CET features and the xstate bit, instead
>> manually check and add the bit back if either SHSTK or IBT is supported.
>>
>> Both user mode shadow stack and indirect branch tracking features depend
>> on XFEATURE_CET_USER bit in XSS to automatically save/restore user mode
>> xstate registers, i.e., IA32_U_CET and IA32_PL3_SSP whenever necessary.
>>
>> Although in real world a platform with IBT but no SHSTK is rare, but in
>> virtualization world it's common, guest SHSTK and IBT can be controlled
>> independently via userspace app.
>>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
>> arch/x86/kernel/fpu/xstate.c | 9 ++++++++-
>> 1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>> index cadf68737e6b..12c8cb278346 100644
>> --- a/arch/x86/kernel/fpu/xstate.c
>> +++ b/arch/x86/kernel/fpu/xstate.c
>> @@ -73,7 +73,6 @@ static unsigned short xsave_cpuid_features[] __initdata = {
>> [XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
>> [XFEATURE_PKRU] = X86_FEATURE_OSPKE,
>> [XFEATURE_PASID] = X86_FEATURE_ENQCMD,
>> - [XFEATURE_CET_USER] = X86_FEATURE_SHSTK,
>> [XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE,
>> [XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE,
>> };
>> @@ -798,6 +797,14 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
>> fpu_kernel_cfg.max_features &= ~BIT_ULL(i);
>> }
>>
>> + /*
>> + * Manually add CET user mode xstate bit if either SHSTK or IBT is
>> + * available. Both features depend on the xstate bit to save/restore
>> + * CET user mode state.
>> + */
>> + if (boot_cpu_has(X86_FEATURE_SHSTK) || boot_cpu_has(X86_FEATURE_IBT))
>> + fpu_kernel_cfg.max_features |= BIT_ULL(XFEATURE_CET_USER);
>> +
>> if (!cpu_feature_enabled(X86_FEATURE_XFD))
>> fpu_kernel_cfg.max_features &= ~XFEATURE_MASK_USER_DYNAMIC;
>>
>
> The goal of the xsave_cpuid_features is to disable xfeature state bits which are enabled
> in CPUID, but their parent feature bit (e.g X86_FEATURE_AVX512) is disabled in CPUID,
> something that should not happen on real CPU, but can happen if the user explicitly
> disables the feature on the kernel command line and/or due to virtualization.
>
> However the above code does the opposite, it will enable XFEATURE_CET_USER xsaves component,
> when in fact, it might be disabled in the CPUID (and one can say that in theory such
> configuration is even useful, since the kernel can still context switch CET msrs manually).
>
>
> So I think that the code should do this instead:
>
> if (!boot_cpu_has(X86_FEATURE_SHSTK) && !boot_cpu_has(X86_FEATURE_IBT))
> fpu_kernel_cfg.max_features &= ~BIT_ULL(XFEATURE_CET_USER);

Hi, Maxim,
Thanks a lot for the comments on the series!
I'll will check and reply them after finish an urgent task at hand.

Yeah, it looks good to me and makes the handling logic more consistent!

> Best regards,
> Maxim Levitsky
>
>
>
>

2023-11-01 09:23:16

by Yang Weijiang

[permalink] [raw]
Subject: Re: [PATCH v6 25/25] KVM: nVMX: Enable CET support for nested guest

On 11/1/2023 10:09 AM, Chao Gao wrote:
> On Thu, Sep 14, 2023 at 02:33:25AM -0400, Yang Weijiang wrote:
>> Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
>> to enable CET for nested VM.
>>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
>> arch/x86/kvm/vmx/nested.c | 27 +++++++++++++++++++++++++--
>> arch/x86/kvm/vmx/vmcs12.c | 6 ++++++
>> arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++++-
>> arch/x86/kvm/vmx/vmx.c | 2 ++
>> 4 files changed, 46 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index 78a3be394d00..2c4ff13fddb0 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>> nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
>>
>> + /* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_U_CET, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_S_CET, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_PL0_SSP, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_PL1_SSP, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_PL2_SSP, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_PL3_SSP, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
>> +
>> kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
>>
>> vmx->nested.force_msr_bitmap_recalc = false;
>> @@ -6794,7 +6816,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
>> VM_EXIT_HOST_ADDR_SPACE_SIZE |
>> #endif
>> VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
>> - VM_EXIT_CLEAR_BNDCFGS;
>> + VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
>> msrs->exit_ctls_high |=
>> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>> VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
>> @@ -6816,7 +6838,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
>> #ifdef CONFIG_X86_64
>> VM_ENTRY_IA32E_MODE |
>> #endif
>> - VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
>> + VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
>> + VM_ENTRY_LOAD_CET_STATE;
>> msrs->entry_ctls_high |=
>> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
>> VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
>> diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
>> index 106a72c923ca..4233b5ca9461 100644
>> --- a/arch/x86/kvm/vmx/vmcs12.c
>> +++ b/arch/x86/kvm/vmx/vmcs12.c
>> @@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
>> FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
>> FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
>> FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
>> + FIELD(GUEST_S_CET, guest_s_cet),
>> + FIELD(GUEST_SSP, guest_ssp),
>> + FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
> I think we need to sync guest states, e.g., guest_s_cet/guest_ssp/guest_ssp_tbl,
> between vmcs02 and vmcs12 on nested VM entry/exit, probably in
> sync_vmcs02_to_vmcs12() and prepare_vmcs12() or "_rare" variants of them.

Thanks Chao!
Let me double check the nested code part and reply.

2023-11-01 09:55:57

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v6 25/25] KVM: nVMX: Enable CET support for nested guest

On Wed, 2023-11-01 at 10:09 +0800, Chao Gao wrote:
> On Thu, Sep 14, 2023 at 02:33:25AM -0400, Yang Weijiang wrote:
> > Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
> > to enable CET for nested VM.
> >
> > Signed-off-by: Yang Weijiang <[email protected]>
> > ---
> > arch/x86/kvm/vmx/nested.c | 27 +++++++++++++++++++++++++--
> > arch/x86/kvm/vmx/vmcs12.c | 6 ++++++
> > arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++++-
> > arch/x86/kvm/vmx/vmx.c | 2 ++
> > 4 files changed, 46 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 78a3be394d00..2c4ff13fddb0 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> > nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
> >
> > + /* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
> > + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > + MSR_IA32_U_CET, MSR_TYPE_RW);
> > +
> > + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > + MSR_IA32_S_CET, MSR_TYPE_RW);
> > +
> > + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > + MSR_IA32_PL0_SSP, MSR_TYPE_RW);
> > +
> > + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > + MSR_IA32_PL1_SSP, MSR_TYPE_RW);
> > +
> > + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > + MSR_IA32_PL2_SSP, MSR_TYPE_RW);
> > +
> > + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > + MSR_IA32_PL3_SSP, MSR_TYPE_RW);
> > +
> > + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > + MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
> > +
> > kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
> >
> > vmx->nested.force_msr_bitmap_recalc = false;
> > @@ -6794,7 +6816,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
> > VM_EXIT_HOST_ADDR_SPACE_SIZE |
> > #endif
> > VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
> > - VM_EXIT_CLEAR_BNDCFGS;
> > + VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
> > msrs->exit_ctls_high |=
> > VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> > VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
> > @@ -6816,7 +6838,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
> > #ifdef CONFIG_X86_64
> > VM_ENTRY_IA32E_MODE |
> > #endif
> > - VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
> > + VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
> > + VM_ENTRY_LOAD_CET_STATE;
> > msrs->entry_ctls_high |=
> > (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
> > VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
> > diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
> > index 106a72c923ca..4233b5ca9461 100644
> > --- a/arch/x86/kvm/vmx/vmcs12.c
> > +++ b/arch/x86/kvm/vmx/vmcs12.c
> > @@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
> > FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
> > FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
> > FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
> > + FIELD(GUEST_S_CET, guest_s_cet),
> > + FIELD(GUEST_SSP, guest_ssp),
> > + FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
>
> I think we need to sync guest states, e.g., guest_s_cet/guest_ssp/guest_ssp_tbl,
> between vmcs02 and vmcs12 on nested VM entry/exit, probably in
> sync_vmcs02_to_vmcs12() and prepare_vmcs12() or "_rare" variants of them.
>

Aha, this is why I suspected that nested support is incomplete,
100% agree.

In particular, looking at Intel's SDM I see that:

HOST_S_CET, HOST_SSP, HOST_INTR_SSP_TABLE needs to be copied from vmcb12 to vmcb02 but not vise versa
because the CPU doesn't touch them.

GUEST_S_CET, GUEST_SSP, GUEST_INTR_SSP_TABLE should be copied bi-directionally.

This of course depends on the corresponding vm entry and vm exit controls being set.
That means that it is legal in theory to do VM entry/exit with CET enabled but not use
VM_ENTRY_LOAD_CET_STATE and/or VM_EXIT_LOAD_CET_STATE,
because for example nested hypervisor in theory can opt to save/load these itself.

I think that this is all, but I also can't be 100% sure. This thing has to be tested well before
we can be sure that it works.

Best regards,
Maxim Levitsky

2023-11-01 15:47:58

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v6 18/25] KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled"

On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > Use the governed feature framework to track whether X86_FEATURE_SHSTK
> > and X86_FEATURE_IBT features can be used by userspace and guest, i.e.,
> > the features can be used iff both KVM and guest CPUID can support them.
> >
> > Signed-off-by: Yang Weijiang <[email protected]>
> > ---
> > arch/x86/kvm/governed_features.h | 2 ++
> > arch/x86/kvm/vmx/vmx.c | 2 ++
> > 2 files changed, 4 insertions(+)
> >
> > diff --git a/arch/x86/kvm/governed_features.h b/arch/x86/kvm/governed_features.h
> > index 423a73395c10..db7e21c5ecc2 100644
> > --- a/arch/x86/kvm/governed_features.h
> > +++ b/arch/x86/kvm/governed_features.h
> > @@ -16,6 +16,8 @@ KVM_GOVERNED_X86_FEATURE(PAUSEFILTER)
> > KVM_GOVERNED_X86_FEATURE(PFTHRESHOLD)
> > KVM_GOVERNED_X86_FEATURE(VGIF)
> > KVM_GOVERNED_X86_FEATURE(VNMI)
> > +KVM_GOVERNED_X86_FEATURE(SHSTK)
> > +KVM_GOVERNED_X86_FEATURE(IBT)
> >
> > #undef KVM_GOVERNED_X86_FEATURE
> > #undef KVM_GOVERNED_FEATURE
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 9409753f45b0..fd5893b3a2c8 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7765,6 +7765,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_XSAVES);
> >
> > kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_VMX);
> > + kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_SHSTK);
> > + kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_IBT);
> >
> > vmx_setup_uret_msrs(vmx);
> >
>
> Reviewed-by: Maxim Levitsky <[email protected]>
>
>
> PS: IMHO The whole 'governed feature framework' is very confusing and
> somewhat poorly documented.
>
> Currently the only partial explanation of it, is at 'governed_features',
> which doesn't explain how to use it.

To be honest, terrible name aside, I thought kvm_governed_feature_check_and_set()
would be fairly self-explanatory, at least relative to all the other CPUID handling
in KVM.

> For the reference this is how KVM expects governed features to be used in the
> common case (there are some exceptions to this but they are rare)
>
> 1. If a feature is not enabled in host CPUID or KVM doesn't support it,
> KVM is expected to not enable it in KVM cpu caps.
>
> 2. Userspace uploads guest CPUID.
>
> 3. After the guest CPUID upload, the vendor code calls
> kvm_governed_feature_check_and_set() which sets governed features = True iff
> feature is supported in both kvm cpu caps and in guest CPUID.
>
> 4. kvm/vendor code uses 'guest_can_use()' to query the value of the governed
> feature instead of reading guest CPUID.
>
> It might make sense to document the above somewhere at least.
>
> Now about another thing I am thinking:
>
> I do know that the mess of boolean flags that svm had is worse than these
> governed features and functionality wise these are equivalent.
>
> However thinking again about the whole thing:
>
> IMHO the 'governed features' is another quite confusing term that a KVM
> developer will need to learn and keep in memory.

I 100% agree, but I explicitly called out the terrible name in the v1 and v2
cover letters[1][2], and the patches were on the list for 6 months before I
applied them. I'm definitely still open to a better name, but I'm also not
exactly chomping at the bit to get behind the bikehsed.

v1:
: Note, I don't like the name "governed", but it was the least awful thing I
: could come up with. Suggestions most definitely welcome.

v2:
: Note, I still don't like the name "governed", but no one has suggested
: anything else, let alone anything better :-)


[1] https://lore.kernel.org/all/[email protected]
[2] https://lore.kernel.org/all/[email protected]

> Because of that, can't we just use guest CPUID as a single source of truth
> and drop all the governed features code?

No, not without a rather massive ABI break. To make guest CPUID the single source
of true, KVM would need to modify guest CPUID to squash features that userspace
has set, but that are not supported by hardware. And that is most definitely a
can of worms I don't want to reopen, e.g. see the mess that got created when KVM
tried to "help" userspace by mucking with VMX capability MSRs in response to
CPUID changes.

There aren't many real use cases for advertising "unsupported" features via guest
CPUID, but there are some, and I have definitely abused KVM_SET_CPUID2 for testing
purposes.

And as above, that doesn't work for X86_FEATURE_XSAVES or X86_FEATURE_GBPAGES.

We'd also have to overhaul guest CPUID lookups to be significantly faster (which
is doable), as one of the motiviations for the framework was to avoid the overhead
of looking through guest CPUID without needing one-off boolean fields.

> In most cases, when the governed feature value will differ from the guest
> CPUID is when a feature is enabled in the guest CPUID, but not enabled in the
> KVM caps.
>
> I do see two exceptions to this: XSAVES on AMD and X86_FEATURE_GBPAGES, in
> which the opposite happens, governed feature is enabled, even when the
> feature is hidden from the guest CPUID, but it might be better from
> readability wise point, to deal with these cases manually and we unlikely to
> have many new such cases in the future.
>
> So for the common case of CPUID mismatch, when the governed feature is
> disabled but guest CPUID is enabled, does it make sense to allow this?

Yes and no. For "governed features", probably not. But for CPUID as a whole, there
are legimiate cases where userspace needs to enumerate things that aren't officially
"supported" by KVM. E.g. topology, core crystal frequency (CPUID 0x15), defeatures
that KVM hasn't yet learned about, features that don't have virtualization controls
and KVM hasn't yet learned about, etc. And for things like Xen and Hyper-V paravirt
features, it's very doable to implement features that are enumerate by CPUID fully
in userspace, e.g. using MSR filters.

But again, it's a moot point because KVM has (mostly) allowed userspace to fully
control guest CPUID for a very long time.

> Such a feature which is advertised as supported but not really working is a
> recipe of hard to find guest bugs IMHO.
>
> IMHO it would be much better to just check this condition and do
> kvm_vm_bugged() or something in case when a feature is enabled in the guest
> CPUID but KVM can't support it, and then just use guest CPUID in
> 'guest_can_use()'.

Maybe, if we were creating KVM from scratch, e.g. didn't have to worry about
existing uesrspace behavior and could implement a more forward-looking API than
KVM_GET_SUPPORTED_CPUID. But even then the enforcement would need to be limited
to "pure" hardware-defined feature bits, and I suspect that there would still be
exceptions. And there would likely be complexitly in dealing with CPUID leafs
that are completely unknown to KVM, e.g. unless KVM completely disallowed non-zero
values for unknown CPUID leafs, adding restrictions when a feature is defined by
Intel or AMD would be at constant risk of breaking userspace.

2023-11-01 16:32:12

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v6 19/25] KVM: VMX: Emulate read and write to CET MSRs

On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > Add emulation interface for CET MSR access. The emulation code is split
> > into common part and vendor specific part. The former does common check
> > for MSRs and reads/writes directly from/to XSAVE-managed MSRs via the
> > helpers while the latter accesses the MSRs linked to VMCS fields.
> >
> > Signed-off-by: Yang Weijiang <[email protected]>
> > ---

...

> > + case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> > + case MSR_KVM_SSP:
> > + if (host_msr_reset && kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> > + break;
> > + if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
> > + return 1;
> > + if (index == MSR_KVM_SSP && !host_initiated)
> > + return 1;
> > + if (is_noncanonical_address(data, vcpu))
> > + return 1;
> > + if (index != MSR_IA32_INT_SSP_TAB && !IS_ALIGNED(data, 4))
> > + return 1;
> > + break;
> Once again I'll prefer to have an ioctl for setting/getting SSP, this will
> make the above code simpler (e.g there will be no need to check that write
> comes from the host/etc).

I don't think an ioctl() would be simpler overall, especially when factoring in
userspace. With a synthetic MSR, we get the following quite cheaply:

1. Enumerating support to userspace.
2. Save/restore of the value, e.g. for live migration.
3. Vendor hooks for propagating values to/from the VMCS/VMCB.

For an ioctl(), #1 would require a capability, #2 (and #1 to some extent) would
require new userspace flows, and #3 would require new kvm_x86_ops hooks.

The synthetic MSR adds a small amount of messiness, as does bundling
MSR_IA32_INT_SSP_TAB with the other shadow stack MSRs. The bulk of the mess comes
from the need to allow userspace to write '0' when KVM enumerated supported to
userspace.

If we isolate MSR_IA32_INT_SSP_TAB, that'll help with the synthetic MSR and with
MSR_IA32_INT_SSP_TAB. For the unfortunate "host reset" behavior, the best idea I
came up with is to add a helper. It's still a bit ugly, but the ugliness is
contained in a helper and IMO makes it much easier to follow the case statements.

get:

case MSR_IA32_INT_SSP_TAB:
if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) ||
!guest_cpuid_has(vcpu, X86_FEATURE_LM))
return 1;
break;
case MSR_KVM_SSP:
if (!host_initiated)
return 1;
fallthrough;
case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
return 1;
break;

static bool is_set_cet_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u64 data,
bool host_initiated)
{
bool any_cet = index == MSR_IA32_S_CET || index == MSR_IA32_U_CET;

if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
return true;

if (any_cet && guest_can_use(vcpu, X86_FEATURE_IBT))
return true;

/*
* If KVM supports the MSR, i.e. has enumerated the MSR existence to
* userspace, then userspace is allowed to write '0' irrespective of
* whether or not the MSR is exposed to the guest.
*/
if (!host_initiated || data)
return false;

if (kvm_cpu_cap_has(X86_FEATURE_SHSTK))
return true;

return any_cet && kvm_cpu_cap_has(X86_FEATURE_IBT);
}

set:
case MSR_IA32_U_CET:
case MSR_IA32_S_CET:
if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
return 1;
if (data & CET_US_RESERVED_BITS)
return 1;
if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
(data & CET_US_SHSTK_MASK_BITS))
return 1;
if (!guest_can_use(vcpu, X86_FEATURE_IBT) &&
(data & CET_US_IBT_MASK_BITS))
return 1;
if (!IS_ALIGNED(CET_US_LEGACY_BITMAP_BASE(data), 4))
return 1;

/* IBT can be suppressed iff the TRACKER isn't WAIT_ENDBR. */
if ((data & CET_SUPPRESS) && (data & CET_WAIT_ENDBR))
return 1;
break;
case MSR_IA32_INT_SSP_TAB:
if (!guest_cpuid_has(vcpu, X86_FEATURE_LM))
return 1;

if (is_noncanonical_address(data, vcpu))
return 1;
break;
case MSR_KVM_SSP:
if (!host_initiated)
return 1;
fallthrough;
case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
return 1;
if (is_noncanonical_address(data, vcpu))
return 1;
if (!IS_ALIGNED(data, 4))
return 1;
break;
}

2023-11-01 17:21:23

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v6 12/25] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS

On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > @@ -312,6 +313,17 @@ static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
> > return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
> > }
> >
> > +static u64 vcpu_get_supported_xss(struct kvm_vcpu *vcpu)
> > +{
> > + struct kvm_cpuid_entry2 *best;
> > +
> > + best = kvm_find_cpuid_entry_index(vcpu, 0xd, 1);
> > + if (!best)
> > + return 0;
> > +
> > + return (best->ecx | ((u64)best->edx << 32)) & kvm_caps.supported_xss;
> > +}
>
> Same question as one for patch that added vcpu_get_supported_xcr0()
> Why to have per vCPU supported XSS if we assume that all CPUs have the same
> CPUID?
>
> I mean I am not against supporting hybrid CPU models, but KVM currently doesn't
> support this and this creates illusion that it does.

KVM does "support" hybrid vCPU models in the sense that KVM has allow hybrid models
since forever. There are definite things that won't work, e.g. not all relevant
CPUID bits are captured in kvm_mmu_page_role, and so KVM will incorrectly share
page tables across vCPUs that are technically incompatible.

But for many features, heterogenous vCPU models do Just Work as far as KVM is
concerned. There likely isn't a real world kernel that supports heterogenous
feature sets for things like XSS and XCR0, but that's a guest software limitation,
not a limitation of KVM's CPU virtualization.

As with many things, KVM's ABI is to let userspace shoot themselves in the foot.

2023-11-01 19:32:50

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v6 10/25] KVM: x86: Add kvm_msr_{read,write}() helpers

On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > Wrap __kvm_{get,set}_msr() into two new helpers for KVM usage and use the
> > helpers to replace existing usage of the raw functions.
> > kvm_msr_{read,write}() are KVM-internal helpers, i.e. used when KVM needs
> > to get/set a MSR value for emulating CPU behavior.
>
> I am not sure if I like this patch or not. On one hand the code is cleaner
> this way, but on the other hand now it is easier to call kvm_msr_write() on
> behalf of the guest.
>
> For example we also have the 'kvm_set_msr()' which does actually set the msr
> on behalf of the guest.
>
> How about we call the new function kvm_msr_set_host() and rename
> kvm_set_msr() to kvm_msr_set_guest(), together with good comments explaning
> what they do?

LOL, just call me Nostradamus[*] ;-)

: > SSP save/load should go to enter_smm_save_state_64() and rsm_load_state_64(),
: > where other fields of SMRAM are handled.
:
: +1. The right way to get/set MSRs like this is to use __kvm_get_msr() and pass
: %true for @host_initiated. Though I would add a prep patch to provide wrappers
: for __kvm_get_msr() and __kvm_set_msr(). Naming will be hard, but I think we
^^^^^^^^^^^^^^^^^^^
: can use kvm_{read,write}_msr() to go along with the KVM-initiated register
: accessors/mutators, e.g. kvm_register_read(), kvm_pdptr_write(), etc.

[*] https://lore.kernel.org/all/[email protected]

> Also functions like kvm_set_msr_ignored_check(), kvm_set_msr_with_filter() and such,
> IMHO have names that are not very user friendly.

I don't like the host/guest split because KVM always operates on guest values,
e.g. kvm_msr_set_host() in particular could get confusing.

IMO kvm_get_msr() and kvm_set_msr(), and to some extent the helpers you note below,
are the real problem.

What if we rename kvm_{g,s}et_msr() to kvm_emulate_msr_{read,write}() to make it
more obvious that those are the "guest" helpers? And do that as a prep patch in
this series (there aren't _that_ many users).

I'm also in favor of renaming the "inner" helpers, but I think we should tackle
those separately.separately

2023-11-02 18:27:14

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v6 10/25] KVM: x86: Add kvm_msr_{read,write}() helpers

On Wed, 2023-11-01 at 12:32 -0700, Sean Christopherson wrote:
> On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > Wrap __kvm_{get,set}_msr() into two new helpers for KVM usage and use the
> > > helpers to replace existing usage of the raw functions.
> > > kvm_msr_{read,write}() are KVM-internal helpers, i.e. used when KVM needs
> > > to get/set a MSR value for emulating CPU behavior.
> >
> > I am not sure if I like this patch or not. On one hand the code is cleaner
> > this way, but on the other hand now it is easier to call kvm_msr_write() on
> > behalf of the guest.
> >
> > For example we also have the 'kvm_set_msr()' which does actually set the msr
> > on behalf of the guest.
> >
> > How about we call the new function kvm_msr_set_host() and rename
> > kvm_set_msr() to kvm_msr_set_guest(), together with good comments explaning
> > what they do?
>
> LOL, just call me Nostradamus[*] ;-)
>
> : > SSP save/load should go to enter_smm_save_state_64() and rsm_load_state_64(),
> : > where other fields of SMRAM are handled.
> :
> : +1. The right way to get/set MSRs like this is to use __kvm_get_msr() and pass
> : %true for @host_initiated. Though I would add a prep patch to provide wrappers
> : for __kvm_get_msr() and __kvm_set_msr(). Naming will be hard, but I think we
> ^^^^^^^^^^^^^^^^^^^
> : can use kvm_{read,write}_msr() to go along with the KVM-initiated register
> : accessors/mutators, e.g. kvm_register_read(), kvm_pdptr_write(), etc.
>
> [*] https://lore.kernel.org/all/[email protected]
>
> > Also functions like kvm_set_msr_ignored_check(), kvm_set_msr_with_filter() and such,
> > IMHO have names that are not very user friendly.
>
> I don't like the host/guest split because KVM always operates on guest values,
> e.g. kvm_msr_set_host() in particular could get confusing.
That makes sense.

>
> IMO kvm_get_msr() and kvm_set_msr(), and to some extent the helpers you note below,
> are the real problem.
>
> What if we rename kvm_{g,s}et_msr() to kvm_emulate_msr_{read,write}() to make it
> more obvious that those are the "guest" helpers? And do that as a prep patch in
> this series (there aren't _that_ many users).
Makes sense.

>
> I'm also in favor of renaming the "inner" helpers, but I think we should tackle
> those separately.separately

OK.

>

Best regards,
Maxim Levitsky

2023-11-02 18:36:40

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v6 18/25] KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled"

On Wed, 2023-11-01 at 08:46 -0700, Sean Christopherson wrote:
> On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > Use the governed feature framework to track whether X86_FEATURE_SHSTK
> > > and X86_FEATURE_IBT features can be used by userspace and guest, i.e.,
> > > the features can be used iff both KVM and guest CPUID can support them.
> > >
> > > Signed-off-by: Yang Weijiang <[email protected]>
> > > ---
> > > arch/x86/kvm/governed_features.h | 2 ++
> > > arch/x86/kvm/vmx/vmx.c | 2 ++
> > > 2 files changed, 4 insertions(+)
> > >
> > > diff --git a/arch/x86/kvm/governed_features.h b/arch/x86/kvm/governed_features.h
> > > index 423a73395c10..db7e21c5ecc2 100644
> > > --- a/arch/x86/kvm/governed_features.h
> > > +++ b/arch/x86/kvm/governed_features.h
> > > @@ -16,6 +16,8 @@ KVM_GOVERNED_X86_FEATURE(PAUSEFILTER)
> > > KVM_GOVERNED_X86_FEATURE(PFTHRESHOLD)
> > > KVM_GOVERNED_X86_FEATURE(VGIF)
> > > KVM_GOVERNED_X86_FEATURE(VNMI)
> > > +KVM_GOVERNED_X86_FEATURE(SHSTK)
> > > +KVM_GOVERNED_X86_FEATURE(IBT)
> > >
> > > #undef KVM_GOVERNED_X86_FEATURE
> > > #undef KVM_GOVERNED_FEATURE
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index 9409753f45b0..fd5893b3a2c8 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -7765,6 +7765,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > > kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_XSAVES);
> > >
> > > kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_VMX);
> > > + kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_SHSTK);
> > > + kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_IBT);
> > >
> > > vmx_setup_uret_msrs(vmx);
> > >
> >
> > Reviewed-by: Maxim Levitsky <[email protected]>
> >
> >
> > PS: IMHO The whole 'governed feature framework' is very confusing and
> > somewhat poorly documented.
> >
> > Currently the only partial explanation of it, is at 'governed_features',
> > which doesn't explain how to use it.
>
> To be honest, terrible name aside, I thought kvm_governed_feature_check_and_set()
> would be fairly self-explanatory, at least relative to all the other CPUID handling
> in KVM.

What is not self-explanatory is what are the governed_feature and how to query them.

>
> > For the reference this is how KVM expects governed features to be used in the
> > common case (there are some exceptions to this but they are rare)
> >
> > 1. If a feature is not enabled in host CPUID or KVM doesn't support it,
> > KVM is expected to not enable it in KVM cpu caps.
> >
> > 2. Userspace uploads guest CPUID.
> >
> > 3. After the guest CPUID upload, the vendor code calls
> > kvm_governed_feature_check_and_set() which sets governed features = True iff
> > feature is supported in both kvm cpu caps and in guest CPUID.
> >
> > 4. kvm/vendor code uses 'guest_can_use()' to query the value of the governed
> > feature instead of reading guest CPUID.
> >
> > It might make sense to document the above somewhere at least.
> >
> > Now about another thing I am thinking:
> >
> > I do know that the mess of boolean flags that svm had is worse than these
> > governed features and functionality wise these are equivalent.
> >
> > However thinking again about the whole thing:
> >
> > IMHO the 'governed features' is another quite confusing term that a KVM
> > developer will need to learn and keep in memory.
>
> I 100% agree, but I explicitly called out the terrible name in the v1 and v2
> cover letters[1][2], and the patches were on the list for 6 months before I
> applied them. I'm definitely still open to a better name, but I'm also not
> exactly chomping at the bit to get behind the bikehsed.

Honestly I don't know if I can come up with a better name either.
Name is IMHO not the underlying problem, its the feature itself that is confusing.

>
> v1:
> : Note, I don't like the name "governed", but it was the least awful thing I
> : could come up with. Suggestions most definitely welcome.
>
> v2:
> : Note, I still don't like the name "governed", but no one has suggested
> : anything else, let alone anything better :-)
>
>
> [1] https://lore.kernel.org/all/[email protected]
> [2] https://lore.kernel.org/all/[email protected]
>
> > Because of that, can't we just use guest CPUID as a single source of truth
> > and drop all the governed features code?
>
> No, not without a rather massive ABI break. To make guest CPUID the single source
> of true, KVM would need to modify guest CPUID to squash features that userspace
> has set, but that are not supported by hardware. And that is most definitely a
> can of worms I don't want to reopen, e.g. see the mess that got created when KVM
> tried to "help" userspace by mucking with VMX capability MSRs in response to
> CPUID changes.


>
> There aren't many real use cases for advertising "unsupported" features via guest
> CPUID, but there are some, and I have definitely abused KVM_SET_CPUID2 for testing
> purposes.
>
> And as above, that doesn't work for X86_FEATURE_XSAVES or X86_FEATURE_GBPAGES.
>
> We'd also have to overhaul guest CPUID lookups to be significantly faster (which
> is doable), as one of the motiviations for the framework was to avoid the overhead
> of looking through guest CPUID without needing one-off boolean fields.
>
> > In most cases, when the governed feature value will differ from the guest
> > CPUID is when a feature is enabled in the guest CPUID, but not enabled in the
> > KVM caps.
> >
> > I do see two exceptions to this: XSAVES on AMD and X86_FEATURE_GBPAGES, in
> > which the opposite happens, governed feature is enabled, even when the
> > feature is hidden from the guest CPUID, but it might be better from
> > readability wise point, to deal with these cases manually and we unlikely to
> > have many new such cases in the future.
> >
> > So for the common case of CPUID mismatch, when the governed feature is
> > disabled but guest CPUID is enabled, does it make sense to allow this?
>
> Yes and no. For "governed features", probably not. But for CPUID as a whole, there
> are legimiate cases where userspace needs to enumerate things that aren't officially
> "supported" by KVM. E.g. topology, core crystal frequency (CPUID 0x15), defeatures
> that KVM hasn't yet learned about, features that don't have virtualization controls
> and KVM hasn't yet learned about, etc. And for things like Xen and Hyper-V paravirt
> features, it's very doable to implement features that are enumerate by CPUID fully
> in userspace, e.g. using MSR filters.
>
> But again, it's a moot point because KVM has (mostly) allowed userspace to fully
> control guest CPUID for a very long time.
>
> > Such a feature which is advertised as supported but not really working is a
> > recipe of hard to find guest bugs IMHO.
> >
> > IMHO it would be much better to just check this condition and do
> > kvm_vm_bugged() or something in case when a feature is enabled in the guest
> > CPUID but KVM can't support it, and then just use guest CPUID in
> > 'guest_can_use()'.

OK, I won't argue that much over this, however I still think that there are
better ways to deal with it.

If we put optimizations aside (all of this can surely be optimized such as to
have very little overhead)

How about we have 2 cpuids: Guest visible CPUID which KVM will never use directly
other than during initialization and effective cpuid which is roughly
what governed features are, but will include all features and will be initialized
roughly like governed features are initialized:

effective_cpuid = guest_cpuid & kvm_supported_cpuid

Except for some forced overrides like for XSAVES and such.

Then we won't need to maintain a list of governed features, and guest_can_use()
for all features will just return the effective cpuid leafs.

In other words, I want KVM to turn all known CPUID features to governed features,
and then remove all the mentions of governed features except 'guest_can_use'
which is a good API.

Such proposal will use a bit more memory but will make it easier for future
KVM developers to understand the code and have less chance of introducing bugs.

Best regards,
Maxim Levitsky



>
> Maybe, if we were creating KVM from scratch, e.g. didn't have to worry about
> existing uesrspace behavior and could implement a more forward-looking API than
> KVM_GET_SUPPORTED_CPUID. But even then the enforcement would need to be limited
> to "pure" hardware-defined feature bits, and I suspect that there would still be
> exceptions. And there would likely be complexitly in dealing with CPUID leafs
> that are completely unknown to KVM, e.g. unless KVM completely disallowed non-zero
> values for unknown CPUID leafs, adding restrictions when a feature is defined by
> Intel or AMD would be at constant risk of breaking userspace.
>


2023-11-02 18:39:17

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v6 19/25] KVM: VMX: Emulate read and write to CET MSRs

On Wed, 2023-11-01 at 09:31 -0700, Sean Christopherson wrote:
> On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > Add emulation interface for CET MSR access. The emulation code is split
> > > into common part and vendor specific part. The former does common check
> > > for MSRs and reads/writes directly from/to XSAVE-managed MSRs via the
> > > helpers while the latter accesses the MSRs linked to VMCS fields.
> > >
> > > Signed-off-by: Yang Weijiang <[email protected]>
> > > ---
>
> ...
>
> > > + case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> > > + case MSR_KVM_SSP:
> > > + if (host_msr_reset && kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> > > + break;
> > > + if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
> > > + return 1;
> > > + if (index == MSR_KVM_SSP && !host_initiated)
> > > + return 1;
> > > + if (is_noncanonical_address(data, vcpu))
> > > + return 1;
> > > + if (index != MSR_IA32_INT_SSP_TAB && !IS_ALIGNED(data, 4))
> > > + return 1;
> > > + break;
> > Once again I'll prefer to have an ioctl for setting/getting SSP, this will
> > make the above code simpler (e.g there will be no need to check that write
> > comes from the host/etc).
>
> I don't think an ioctl() would be simpler overall, especially when factoring in
> userspace. With a synthetic MSR, we get the following quite cheaply:
>
> 1. Enumerating support to userspace.
> 2. Save/restore of the value, e.g. for live migration.
> 3. Vendor hooks for propagating values to/from the VMCS/VMCB.
>
> For an ioctl(),
> #1 would require a capability, #2 (and #1 to some extent) would
> require new userspace flows, and #3 would require new kvm_x86_ops hooks.
>
> The synthetic MSR adds a small amount of messiness, as does bundling
> MSR_IA32_INT_SSP_TAB with the other shadow stack MSRs. The bulk of the mess comes
> from the need to allow userspace to write '0' when KVM enumerated supported to
> userspace.


Let me put it this way - all hacks start like that, and in this case this is API/ABI hack
so we will have to live with it forever.

Once there is a precedent, trust me there will be 10s of new 'fake' msrs added, and the
interface will become one big mess.

As I suggested, if you don't want to add new capability/ioctl and vendor callback per new
x86 arch register, then let's implement KVM_GET_ONE_REG/KVM_SET_ONE_REG and then it will
be really easy to add new regs without confusing users, and without polluting msr
namespace with msrs that don't exist.


Best regards,
Maxim Levitsky


>
> If we isolate MSR_IA32_INT_SSP_TAB, that'll help with the synthetic MSR and with
> MSR_IA32_INT_SSP_TAB. For the unfortunate "host reset" behavior, the best idea I
> came up with is to add a helper. It's still a bit ugly, but the ugliness is
> contained in a helper and IMO makes it much easier to follow the case statements.
>
> get:
>
> case MSR_IA32_INT_SSP_TAB:
> if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) ||
> !guest_cpuid_has(vcpu, X86_FEATURE_LM))
> return 1;
> break;
> case MSR_KVM_SSP:
> if (!host_initiated)
> return 1;
> fallthrough;
> case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
> return 1;
> break;
>
> static bool is_set_cet_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u64 data,
> bool host_initiated)
> {
> bool any_cet = index == MSR_IA32_S_CET || index == MSR_IA32_U_CET;
>
> if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
> return true;
>
> if (any_cet && guest_can_use(vcpu, X86_FEATURE_IBT))
> return true;
>
> /*
> * If KVM supports the MSR, i.e. has enumerated the MSR existence to
> * userspace, then userspace is allowed to write '0' irrespective of
> * whether or not the MSR is exposed to the guest.
> */
> if (!host_initiated || data)
> return false;
>
> if (kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> return true;
>
> return any_cet && kvm_cpu_cap_has(X86_FEATURE_IBT);
> }
>
> set:
> case MSR_IA32_U_CET:
> case MSR_IA32_S_CET:
> if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
> return 1;
> if (data & CET_US_RESERVED_BITS)
> return 1;
> if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
> (data & CET_US_SHSTK_MASK_BITS))
> return 1;
> if (!guest_can_use(vcpu, X86_FEATURE_IBT) &&
> (data & CET_US_IBT_MASK_BITS))
> return 1;
> if (!IS_ALIGNED(CET_US_LEGACY_BITMAP_BASE(data), 4))
> return 1;
>
> /* IBT can be suppressed iff the TRACKER isn't WAIT_ENDBR. */
> if ((data & CET_SUPPRESS) && (data & CET_WAIT_ENDBR))
> return 1;
> break;
> case MSR_IA32_INT_SSP_TAB:
> if (!guest_cpuid_has(vcpu, X86_FEATURE_LM))
> return 1;
>
> if (is_noncanonical_address(data, vcpu))
> return 1;
> break;
> case MSR_KVM_SSP:
> if (!host_initiated)
> return 1;
> fallthrough;
> case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
> return 1;
> if (is_noncanonical_address(data, vcpu))
> return 1;
> if (!IS_ALIGNED(data, 4))
> return 1;
> break;
> }
>


2023-11-02 23:58:58

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v6 19/25] KVM: VMX: Emulate read and write to CET MSRs

On Thu, Nov 02, 2023, Maxim Levitsky wrote:
> On Wed, 2023-11-01 at 09:31 -0700, Sean Christopherson wrote:
> > On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > > Add emulation interface for CET MSR access. The emulation code is split
> > > > into common part and vendor specific part. The former does common check
> > > > for MSRs and reads/writes directly from/to XSAVE-managed MSRs via the
> > > > helpers while the latter accesses the MSRs linked to VMCS fields.
> > > >
> > > > Signed-off-by: Yang Weijiang <[email protected]>
> > > > ---
> >
> > ...
> >
> > > > + case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> > > > + case MSR_KVM_SSP:
> > > > + if (host_msr_reset && kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> > > > + break;
> > > > + if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
> > > > + return 1;
> > > > + if (index == MSR_KVM_SSP && !host_initiated)
> > > > + return 1;
> > > > + if (is_noncanonical_address(data, vcpu))
> > > > + return 1;
> > > > + if (index != MSR_IA32_INT_SSP_TAB && !IS_ALIGNED(data, 4))
> > > > + return 1;
> > > > + break;
> > > Once again I'll prefer to have an ioctl for setting/getting SSP, this will
> > > make the above code simpler (e.g there will be no need to check that write
> > > comes from the host/etc).
> >
> > I don't think an ioctl() would be simpler overall, especially when factoring in
> > userspace. With a synthetic MSR, we get the following quite cheaply:
> >
> > 1. Enumerating support to userspace.
> > 2. Save/restore of the value, e.g. for live migration.
> > 3. Vendor hooks for propagating values to/from the VMCS/VMCB.
> >
> > For an ioctl(),
> > #1 would require a capability, #2 (and #1 to some extent) would
> > require new userspace flows, and #3 would require new kvm_x86_ops hooks.
> >
> > The synthetic MSR adds a small amount of messiness, as does bundling
> > MSR_IA32_INT_SSP_TAB with the other shadow stack MSRs. The bulk of the mess comes
> > from the need to allow userspace to write '0' when KVM enumerated supported to
> > userspace.
>
> Let me put it this way - all hacks start like that, and in this case this is API/ABI hack
> so we will have to live with it forever.

Eh, I don't view it as a hack, at least the kind of hack that has a negative
connotation. KVM effectively has ~240 MSR indices reserved for whatever KVM
wants. The only weird thing about this one is that it's not accessible from the
guest. Which I agree is quite weird, but from a code perspective I think it
works quite well.

> Once there is a precedent, trust me there will be 10s of new 'fake' msrs added, and the
> interface will become one big mess.

That suggests MSRs aren't already one big mess. :-) I'm somewhat joking, but also
somewhat serious. I really don't think that adding one oddball synthetic MSR is
going to meaningfully move the needle on the messiness of MSRs.

Hmm, there probably is a valid slippery slope argument though. As you say, at
some point, enough state will get shoved into hardware that KVM will need an ever
growing number of synthetic MSRs to keep pace.

> As I suggested, if you don't want to add new capability/ioctl and vendor
> callback per new x86 arch register, then let's implement
> KVM_GET_ONE_REG/KVM_SET_ONE_REG and then it will be really easy to add new
> regs without confusing users, and without polluting msr namespace with msrs
> that don't exist.

I definitely don't hate the idea of KVM_{G,S}ET_ONE_REG, what I don't want is to
have an entirely separate path in KVM for handling the actual get/set.

What if we combine the approaches? Add KVM_{G,S}ET_ONE_REG support so that the
uAPI can use completely arbitrary register indices without having to worry about
polluting the MSR space and making MSR_KVM_SSP ABI.

Ooh, if we're clever, I bet we can extend KVM_{G,S}ET_ONE_REG to also work with
existing MSRs, GPRs, and other stuff, i.e. not force userspace through the funky
KVM_SET_MSRS just to set one reg, and not force a RMW of all GPRs just to set
RIP or something. E.g. use bits 39:32 of the id to encode the register class,
bits 31:0 to hold the index within a class, and reserve bits 63:40 for future
usage.

Then for KVM-defined registers, we can route them internally as needed, e.g. we
can still define MSR_KVM_SSP so that internal it's treated like an MSR, but its
index isn't ABI and so can be changed at will. And future KVM-defined registers
wouldn't _need_ to be treated like MSRs, i.e. we could route registers through
the MSR APIs if and only if it makes sense to do so.

Side topic, why on earth is the data value of kvm_one_reg "addr"?

2023-11-03 08:19:19

by Yang Weijiang

[permalink] [raw]
Subject: Re: [PATCH v6 19/25] KVM: VMX: Emulate read and write to CET MSRs

On 11/2/2023 12:31 AM, Sean Christopherson wrote:
> On Tue, Oct 31, 2023, Maxim Levitsky wrote:
>> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
>>> Add emulation interface for CET MSR access. The emulation code is split
>>> into common part and vendor specific part. The former does common check
>>> for MSRs and reads/writes directly from/to XSAVE-managed MSRs via the
>>> helpers while the latter accesses the MSRs linked to VMCS fields.
>>>
>>> Signed-off-by: Yang Weijiang <[email protected]>
>>> ---
> ...
>
>>> + case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
>>> + case MSR_KVM_SSP:
>>> + if (host_msr_reset && kvm_cpu_cap_has(X86_FEATURE_SHSTK))
>>> + break;
>>> + if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
>>> + return 1;
>>> + if (index == MSR_KVM_SSP && !host_initiated)
>>> + return 1;
>>> + if (is_noncanonical_address(data, vcpu))
>>> + return 1;
>>> + if (index != MSR_IA32_INT_SSP_TAB && !IS_ALIGNED(data, 4))
>>> + return 1;
>>> + break;
>> Once again I'll prefer to have an ioctl for setting/getting SSP, this will
>> make the above code simpler (e.g there will be no need to check that write
>> comes from the host/etc).
> I don't think an ioctl() would be simpler overall, especially when factoring in
> userspace. With a synthetic MSR, we get the following quite cheaply:
>
> 1. Enumerating support to userspace.
> 2. Save/restore of the value, e.g. for live migration.
> 3. Vendor hooks for propagating values to/from the VMCS/VMCB.
>
> For an ioctl(), #1 would require a capability, #2 (and #1 to some extent) would
> require new userspace flows, and #3 would require new kvm_x86_ops hooks.
>
> The synthetic MSR adds a small amount of messiness, as does bundling
> MSR_IA32_INT_SSP_TAB with the other shadow stack MSRs. The bulk of the mess comes
> from the need to allow userspace to write '0' when KVM enumerated supported to
> userspace.
>
> If we isolate MSR_IA32_INT_SSP_TAB, that'll help with the synthetic MSR and with
> MSR_IA32_INT_SSP_TAB. For the unfortunate "host reset" behavior, the best idea I
> came up with is to add a helper. It's still a bit ugly, but the ugliness is
> contained in a helper and IMO makes it much easier to follow the case statements.

Frankly speaking, existing code is not hard to understand to me :-), the handling for MSR_KVM_SSP
and MSR_IA32_INT_SSP_TAB is straightforward if audiences read the related spec.
But I'll take your advice and enclose below changes. Thanks!
> get:
>
> case MSR_IA32_INT_SSP_TAB:
> if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) ||
> !guest_cpuid_has(vcpu, X86_FEATURE_LM))
> return 1;
> break;
> case MSR_KVM_SSP:
> if (!host_initiated)
> return 1;
> fallthrough;
> case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
> return 1;
> break;
>
> static bool is_set_cet_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u64 data,
> bool host_initiated)
> {
> bool any_cet = index == MSR_IA32_S_CET || index == MSR_IA32_U_CET;
>
> if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
> return true;
>
> if (any_cet && guest_can_use(vcpu, X86_FEATURE_IBT))
> return true;
>
> /*
> * If KVM supports the MSR, i.e. has enumerated the MSR existence to
> * userspace, then userspace is allowed to write '0' irrespective of
> * whether or not the MSR is exposed to the guest.
> */
> if (!host_initiated || data)
> return false;
> if (kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> return true;
>
> return any_cet && kvm_cpu_cap_has(X86_FEATURE_IBT);
> }
>
> set:
> case MSR_IA32_U_CET:
> case MSR_IA32_S_CET:
> if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
> return 1;
> if (data & CET_US_RESERVED_BITS)
> return 1;
> if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
> (data & CET_US_SHSTK_MASK_BITS))
> return 1;
> if (!guest_can_use(vcpu, X86_FEATURE_IBT) &&
> (data & CET_US_IBT_MASK_BITS))
> return 1;
> if (!IS_ALIGNED(CET_US_LEGACY_BITMAP_BASE(data), 4))
> return 1;
>
> /* IBT can be suppressed iff the TRACKER isn't WAIT_ENDBR. */
> if ((data & CET_SUPPRESS) && (data & CET_WAIT_ENDBR))
> return 1;
> break;
> case MSR_IA32_INT_SSP_TAB:
> if (!guest_cpuid_has(vcpu, X86_FEATURE_LM))
> return 1;
>
> if (is_noncanonical_address(data, vcpu))
> return 1;
> break;
> case MSR_KVM_SSP:
> if (!host_initiated)
> return 1;
> fallthrough;
> case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
> return 1;
> if (is_noncanonical_address(data, vcpu))
> return 1;
> if (!IS_ALIGNED(data, 4))
> return 1;
> break;
> }

2023-11-03 22:28:13

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v6 19/25] KVM: VMX: Emulate read and write to CET MSRs

On Fri, Nov 03, 2023, Weijiang Yang wrote:
> On 11/2/2023 12:31 AM, Sean Christopherson wrote:
> > On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > > Add emulation interface for CET MSR access. The emulation code is split
> > > > into common part and vendor specific part. The former does common check
> > > > for MSRs and reads/writes directly from/to XSAVE-managed MSRs via the
> > > > helpers while the latter accesses the MSRs linked to VMCS fields.
> > > >
> > > > Signed-off-by: Yang Weijiang <[email protected]>
> > > > ---
> > ...
> >
> > > > + case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> > > > + case MSR_KVM_SSP:
> > > > + if (host_msr_reset && kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> > > > + break;
> > > > + if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
> > > > + return 1;
> > > > + if (index == MSR_KVM_SSP && !host_initiated)
> > > > + return 1;
> > > > + if (is_noncanonical_address(data, vcpu))
> > > > + return 1;
> > > > + if (index != MSR_IA32_INT_SSP_TAB && !IS_ALIGNED(data, 4))
> > > > + return 1;
> > > > + break;
> > > Once again I'll prefer to have an ioctl for setting/getting SSP, this will
> > > make the above code simpler (e.g there will be no need to check that write
> > > comes from the host/etc).
> > I don't think an ioctl() would be simpler overall, especially when factoring in
> > userspace. With a synthetic MSR, we get the following quite cheaply:
> >
> > 1. Enumerating support to userspace.
> > 2. Save/restore of the value, e.g. for live migration.
> > 3. Vendor hooks for propagating values to/from the VMCS/VMCB.
> >
> > For an ioctl(), #1 would require a capability, #2 (and #1 to some extent) would
> > require new userspace flows, and #3 would require new kvm_x86_ops hooks.
> >
> > The synthetic MSR adds a small amount of messiness, as does bundling
> > MSR_IA32_INT_SSP_TAB with the other shadow stack MSRs. The bulk of the mess comes
> > from the need to allow userspace to write '0' when KVM enumerated supported to
> > userspace.
> >
> > If we isolate MSR_IA32_INT_SSP_TAB, that'll help with the synthetic MSR and with
> > MSR_IA32_INT_SSP_TAB. For the unfortunate "host reset" behavior, the best idea I
> > came up with is to add a helper. It's still a bit ugly, but the ugliness is
> > contained in a helper and IMO makes it much easier to follow the case statements.
>
> Frankly speaking, existing code is not hard to understand to me :-), the
> handling for MSR_KVM_SSP and MSR_IA32_INT_SSP_TAB is straightforward if
> audiences read the related spec.

I don't necessarily disagree, but I 100% agree with Maxim that host_msr_reset is
a confusing name. As Maxim pointed out, '0' isn't necessarily the RESET value.
And host_msr_reset implies that userspace is emulating a RESET, which may not
actually be true, e.g. a naive userspace could be restoring '0' as part of live
migration.

> But I'll take your advice and enclose below changes. Thanks!

Definitely feel free to propose an alternative. My goal with the suggested change
is eliminate host_msr_reset without creating creating unwieldy case statements.
Isolating MSR_IA32_INT_SSP_TAB was (obviously) the best solution I came up with.

> > get:
> >
> > case MSR_IA32_INT_SSP_TAB:
> > if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) ||
> > !guest_cpuid_has(vcpu, X86_FEATURE_LM))
> > return 1;
> > break;
> > case MSR_KVM_SSP:
> > if (!host_initiated)
> > return 1;
> > fallthrough;
> > case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> > if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
> > return 1;
> > break;
> >
> > static bool is_set_cet_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u64 data,
> > bool host_initiated)
> > {
> > bool any_cet = index == MSR_IA32_S_CET || index == MSR_IA32_U_CET;
> >
> > if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
> > return true;
> >
> > if (any_cet && guest_can_use(vcpu, X86_FEATURE_IBT))
> > return true;
> >
> > /*
> > * If KVM supports the MSR, i.e. has enumerated the MSR existence to
> > * userspace, then userspace is allowed to write '0' irrespective of
> > * whether or not the MSR is exposed to the guest.
> > */
> > if (!host_initiated || data)
> > return false;
> > if (kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> > return true;
> >
> > return any_cet && kvm_cpu_cap_has(X86_FEATURE_IBT);
> > }
> >
> > set:
> > case MSR_IA32_U_CET:
> > case MSR_IA32_S_CET:
> > if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
> > return 1;
> > if (data & CET_US_RESERVED_BITS)
> > return 1;
> > if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
> > (data & CET_US_SHSTK_MASK_BITS))
> > return 1;
> > if (!guest_can_use(vcpu, X86_FEATURE_IBT) &&
> > (data & CET_US_IBT_MASK_BITS))
> > return 1;
> > if (!IS_ALIGNED(CET_US_LEGACY_BITMAP_BASE(data), 4))
> > return 1;
> >
> > /* IBT can be suppressed iff the TRACKER isn't WAIT_ENDBR. */
> > if ((data & CET_SUPPRESS) && (data & CET_WAIT_ENDBR))
> > return 1;
> > break;
> > case MSR_IA32_INT_SSP_TAB:
> > if (!guest_cpuid_has(vcpu, X86_FEATURE_LM))
> > return 1;

Doh, I think this should be:

if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated) ||
!guest_cpuid_has(vcpu, X86_FEATURE_LM))
return 1;
> >
> > if (is_noncanonical_address(data, vcpu))
> > return 1;
> > break;
> > case MSR_KVM_SSP:
> > if (!host_initiated)
> > return 1;
> > fallthrough;
> > case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> > if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
> > return 1;
> > if (is_noncanonical_address(data, vcpu))
> > return 1;
> > if (!IS_ALIGNED(data, 4))
> > return 1;
> > break;
> > }
>

2023-11-04 00:08:03

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v6 18/25] KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled"

On Thu, Nov 02, 2023, Maxim Levitsky wrote:
> On Wed, 2023-11-01 at 08:46 -0700, Sean Christopherson wrote:
> > On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > > Use the governed feature framework to track whether X86_FEATURE_SHSTK
> > > > and X86_FEATURE_IBT features can be used by userspace and guest, i.e.,
> > > > the features can be used iff both KVM and guest CPUID can support them.
> > > PS: IMHO The whole 'governed feature framework' is very confusing and
> > > somewhat poorly documented.
> > >
> > > Currently the only partial explanation of it, is at 'governed_features',
> > > which doesn't explain how to use it.
> >
> > To be honest, terrible name aside, I thought kvm_governed_feature_check_and_set()
> > would be fairly self-explanatory, at least relative to all the other CPUID handling
> > in KVM.
>
> What is not self-explanatory is what are the governed_feature and how to query them.

...

> > > However thinking again about the whole thing:
> > >
> > > IMHO the 'governed features' is another quite confusing term that a KVM
> > > developer will need to learn and keep in memory.
> >
> > I 100% agree, but I explicitly called out the terrible name in the v1 and v2
> > cover letters[1][2], and the patches were on the list for 6 months before I
> > applied them. I'm definitely still open to a better name, but I'm also not
> > exactly chomping at the bit to get behind the bikehsed.
>
> Honestly I don't know if I can come up with a better name either. Name is
> IMHO not the underlying problem, its the feature itself that is confusing.

...

> > Yes and no. For "governed features", probably not. But for CPUID as a whole, there
> > are legimiate cases where userspace needs to enumerate things that aren't officially
> > "supported" by KVM. E.g. topology, core crystal frequency (CPUID 0x15), defeatures
> > that KVM hasn't yet learned about, features that don't have virtualization controls
> > and KVM hasn't yet learned about, etc. And for things like Xen and Hyper-V paravirt
> > features, it's very doable to implement features that are enumerate by CPUID fully
> > in userspace, e.g. using MSR filters.
> >
> > But again, it's a moot point because KVM has (mostly) allowed userspace to fully
> > control guest CPUID for a very long time.
> >
> > > Such a feature which is advertised as supported but not really working is a
> > > recipe of hard to find guest bugs IMHO.
> > >
> > > IMHO it would be much better to just check this condition and do
> > > kvm_vm_bugged() or something in case when a feature is enabled in the guest
> > > CPUID but KVM can't support it, and then just use guest CPUID in
> > > 'guest_can_use()'.
>
> OK, I won't argue that much over this, however I still think that there are
> better ways to deal with it.
>
> If we put optimizations aside (all of this can surely be optimized such as to
> have very little overhead)
>
> How about we have 2 cpuids: Guest visible CPUID which KVM will never use directly
> other than during initialization and effective cpuid which is roughly
> what governed features are, but will include all features and will be initialized
> roughly like governed features are initialized:
>
> effective_cpuid = guest_cpuid & kvm_supported_cpuid
>
> Except for some forced overrides like for XSAVES and such.
>
> Then we won't need to maintain a list of governed features, and guest_can_use()
> for all features will just return the effective cpuid leafs.
>
> In other words, I want KVM to turn all known CPUID features to governed features,
> and then remove all the mentions of governed features except 'guest_can_use'
> which is a good API.
>
> Such proposal will use a bit more memory but will make it easier for future
> KVM developers to understand the code and have less chance of introducing bugs.

Hmm, two _full_ CPUID arrays would be a mess and completely unnecessary. E.g.
we'd have to sort out Hyper-V and KVM PV, which both have their own caches. And
a duplicate entry for things like F/M/S would be ridiculous.

But maintaining a per-vCPU version of the CPU caps is definitely doable. I.e. a
vCPU equivalent to kvm_cpu_caps and the per-CPU capabilities. There are currently
25 leafs that are tracked by kvm_cpu_caps, so relative to "governed" features,
the cost will be 96 bytes per vCPU. I agree that 96 bytes is worth eating, we've
certainly taken on more for a lot, lot less.

It's a lot of churn, and there are some subtle nasties, e.g. MWAIT and other
CPUID bits that changed based on MSRs or CR4, but most of the churn is superficial
and the result is waaaaay less ugly than governed features and for the majority of
features will Just Work.

I'll get a series posted next week (need to write changelogs and do a _lot_ more
testing). If you want to take a peek at where I'm headed before then:

https://github.com/sean-jc/linux x86/guest_cpufeatures

2023-11-15 07:18:18

by Binbin Wu

[permalink] [raw]
Subject: Re: [PATCH v6 12/25] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS



On 9/14/2023 2:33 PM, Yang Weijiang wrote:
> Update CPUID.(EAX=0DH,ECX=1).EBX to reflect current required xstate size
> due to XSS MSR modification.
> CPUID(EAX=0DH,ECX=1).EBX reports the required storage size of all enabled
> xstate features in (XCR0 | IA32_XSS). The CPUID value can be used by guest
> before allocate sufficient xsave buffer.
>
> Note, KVM does not yet support any XSS based features, i.e. supported_xss
> is guaranteed to be zero at this time.
>
> Opportunistically modify XSS write access logic as: if !guest_cpuid_has(),
> write initiated from host is allowed iff the write is reset operaiton,
> i.e., data == 0, reject host_initiated non-reset write and any guest write.
Hi Sean & Polo,
During code review of Enable CET Virtualization v5 patchset, there were
discussions about "do a wholesale cleanup of all the cases that essentially
allow userspace to do KVM_SET_MSR before KVM_SET_CPUID2", i.e. force the
order
between  KVM_SET_CPUID2 and KVM_SET_MSR, but allow the host_initiated
path with
default (generally 0) value.
https://lore.kernel.org/kvm/[email protected]/
https://lore.kernel.org/kvm/CABgObfbvr8F8g5hJN6jn95m7u7m2+8ACkqO25KAZwRmJ9AncZg@mail.gmail.com/

I can take the task to do the code cleanup.
Before going any further, I want to confirm it is still the direction
intended,
right?


>
> Suggested-by: Sean Christopherson <[email protected]>
> Co-developed-by: Zhang Yi Z <[email protected]>
> Signed-off-by: Zhang Yi Z <[email protected]>
> Signed-off-by: Yang Weijiang <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/cpuid.c | 15 ++++++++++++++-
> arch/x86/kvm/x86.c | 13 +++++++++----
> 3 files changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 0fc5e6312e93..d77b030e996c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -803,6 +803,7 @@ struct kvm_vcpu_arch {
>
> u64 xcr0;
> u64 guest_supported_xcr0;
> + u64 guest_supported_xss;
>
> struct kvm_pio_request pio;
> void *pio_data;
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 1f206caec559..4e7a820cba62 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -275,7 +275,8 @@ static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_e
> best = cpuid_entry2_find(entries, nent, 0xD, 1);
> if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) ||
> cpuid_entry_has(best, X86_FEATURE_XSAVEC)))
> - best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
> + best->ebx = xstate_required_size(vcpu->arch.xcr0 |
> + vcpu->arch.ia32_xss, true);
>
> best = __kvm_find_kvm_cpuid_features(vcpu, entries, nent);
> if (kvm_hlt_in_guest(vcpu->kvm) && best &&
> @@ -312,6 +313,17 @@ static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
> return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
> }
>
> +static u64 vcpu_get_supported_xss(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_cpuid_entry2 *best;
> +
> + best = kvm_find_cpuid_entry_index(vcpu, 0xd, 1);
> + if (!best)
> + return 0;
> +
> + return (best->ecx | ((u64)best->edx << 32)) & kvm_caps.supported_xss;
> +}
> +
> static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
> {
> struct kvm_cpuid_entry2 *entry;
> @@ -358,6 +370,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> }
>
> vcpu->arch.guest_supported_xcr0 = vcpu_get_supported_xcr0(vcpu);
> + vcpu->arch.guest_supported_xss = vcpu_get_supported_xss(vcpu);
>
> /*
> * FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1258d1d6dd52..9a616d84bd39 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3795,20 +3795,25 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> vcpu->arch.ia32_tsc_adjust_msr += adj;
> }
> break;
> - case MSR_IA32_XSS:
> - if (!msr_info->host_initiated &&
> - !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
> + case MSR_IA32_XSS: {
> + bool host_msr_reset = msr_info->host_initiated && data == 0;
> +
> + if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES) &&
> + (!host_msr_reset || !msr_info->host_initiated))
> return 1;
> /*
> * KVM supports exposing PT to the guest, but does not support
> * IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
> * XSAVES/XRSTORS to save/restore PT MSRs.
> */
> - if (data & ~kvm_caps.supported_xss)
> + if (data & ~vcpu->arch.guest_supported_xss)
> return 1;
> + if (vcpu->arch.ia32_xss == data)
> + break;
> vcpu->arch.ia32_xss = data;
> kvm_update_cpuid_runtime(vcpu);
> break;
> + }
> case MSR_SMI_COUNT:
> if (!msr_info->host_initiated)
> return 1;

2023-11-15 08:24:21

by Yang Weijiang

[permalink] [raw]
Subject: Re: [PATCH v6 25/25] KVM: nVMX: Enable CET support for nested guest

On 11/1/2023 10:09 AM, Chao Gao wrote:
> On Thu, Sep 14, 2023 at 02:33:25AM -0400, Yang Weijiang wrote:
>> Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
>> to enable CET for nested VM.
>>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
>> arch/x86/kvm/vmx/nested.c | 27 +++++++++++++++++++++++++--
>> arch/x86/kvm/vmx/vmcs12.c | 6 ++++++
>> arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++++-
>> arch/x86/kvm/vmx/vmx.c | 2 ++
>> 4 files changed, 46 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index 78a3be394d00..2c4ff13fddb0 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>> nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
>>
>> + /* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_U_CET, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_S_CET, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_PL0_SSP, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_PL1_SSP, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_PL2_SSP, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_PL3_SSP, MSR_TYPE_RW);
>> +
>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
>> +
>> kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
>>
>> vmx->nested.force_msr_bitmap_recalc = false;
>> @@ -6794,7 +6816,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
>> VM_EXIT_HOST_ADDR_SPACE_SIZE |
>> #endif
>> VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
>> - VM_EXIT_CLEAR_BNDCFGS;
>> + VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
>> msrs->exit_ctls_high |=
>> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>> VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
>> @@ -6816,7 +6838,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
>> #ifdef CONFIG_X86_64
>> VM_ENTRY_IA32E_MODE |
>> #endif
>> - VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
>> + VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
>> + VM_ENTRY_LOAD_CET_STATE;
>> msrs->entry_ctls_high |=
>> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
>> VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
>> diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
>> index 106a72c923ca..4233b5ca9461 100644
>> --- a/arch/x86/kvm/vmx/vmcs12.c
>> +++ b/arch/x86/kvm/vmx/vmcs12.c
>> @@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
>> FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
>> FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
>> FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
>> + FIELD(GUEST_S_CET, guest_s_cet),
>> + FIELD(GUEST_SSP, guest_ssp),
>> + FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
> I think we need to sync guest states, e.g., guest_s_cet/guest_ssp/guest_ssp_tbl,
> between vmcs02 and vmcs12 on nested VM entry/exit, probably in
> sync_vmcs02_to_vmcs12() and prepare_vmcs12() or "_rare" variants of them.

After checked around the code, it's necessary to sync related fields from vmcs02 to vmcs12
at nested VM exit so that L1 or userspace can access correct values.
I'll add this part, thanks!

2023-11-15 08:31:44

by Yang Weijiang

[permalink] [raw]
Subject: Re: [PATCH v6 24/25] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1

On 11/1/2023 12:21 PM, Chao Gao wrote:
> On Thu, Sep 14, 2023 at 02:33:24AM -0400, Yang Weijiang wrote:
>> Per SDM description(Vol.3D, Appendix A.1):
>> "If bit 56 is read as 1, software can use VM entry to deliver a hardware
>> exception with or without an error code, regardless of vector"
>>
>> Modify has_error_code check before inject events to nested guest. Only
>> enforce the check when guest is in real mode, the exception is not hard
>> exception and the platform doesn't enumerate bit56 in VMX_BASIC, in all
>> other case ignore the check to make the logic consistent with SDM.
>>
>> Signed-off-by: Yang Weijiang <[email protected]>
>> ---
>> arch/x86/kvm/vmx/nested.c | 22 ++++++++++++++--------
>> arch/x86/kvm/vmx/nested.h | 5 +++++
>> 2 files changed, 19 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index c5ec0ef51ff7..78a3be394d00 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -1205,9 +1205,9 @@ static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx, u64 data)
>> {
>> const u64 feature_and_reserved =
>> /* feature (except bit 48; see below) */
>> - BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) |
>> + BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) | BIT_ULL(56) |
>> /* reserved */
>> - BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 56);
>> + BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 57);
>> u64 vmx_basic = vmcs_config.nested.basic;
>>
>> if (!is_bitwise_subset(vmx_basic, data, feature_and_reserved))
>> @@ -2846,12 +2846,16 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
>> CC(intr_type == INTR_TYPE_OTHER_EVENT && vector != 0))
>> return -EINVAL;
>>
>> - /* VM-entry interruption-info field: deliver error code */
>> - should_have_error_code =
>> - intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
>> - x86_exception_has_error_code(vector);
>> - if (CC(has_error_code != should_have_error_code))
>> - return -EINVAL;
>> + if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION ||
>> + !nested_cpu_has_no_hw_errcode_cc(vcpu)) {
>> + /* VM-entry interruption-info field: deliver error code */
>> + should_have_error_code =
>> + intr_type == INTR_TYPE_HARD_EXCEPTION &&
>> + prot_mode &&
>> + x86_exception_has_error_code(vector);
>> + if (CC(has_error_code != should_have_error_code))
>> + return -EINVAL;
>> + }
> prot_mode and intr_type are used twice, making the code a little hard to read.
>
> how about:
> /*
> * Cannot deliver error code in real mode or if the
> * interruption type is not hardware exception. For other
> * cases, do the consistency check only if the vCPU doesn't
> * enumerate VMX_BASIC_NO_HW_ERROR_CODE_CC.
> */
> if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION) {
> if (CC(has_error_code))
> return -EINVAL;
> } else if (!nested_cpu_has_no_hw_errcode_cc(vcpu)) {
> if (CC(has_error_code != x86_exception_has_error_code(vector)))
> return -EINVAL;
> }
>
> and drop should_have_error_code.

The change looks clearer, I'll take it, thanks!


2023-11-15 08:56:58

by Yang Weijiang

[permalink] [raw]
Subject: Re: [PATCH v6 25/25] KVM: nVMX: Enable CET support for nested guest

On 11/1/2023 5:54 PM, Maxim Levitsky wrote:
> On Wed, 2023-11-01 at 10:09 +0800, Chao Gao wrote:
>> On Thu, Sep 14, 2023 at 02:33:25AM -0400, Yang Weijiang wrote:
>>> Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
>>> to enable CET for nested VM.
>>>
>>> Signed-off-by: Yang Weijiang <[email protected]>
>>> ---
>>> arch/x86/kvm/vmx/nested.c | 27 +++++++++++++++++++++++++--
>>> arch/x86/kvm/vmx/vmcs12.c | 6 ++++++
>>> arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++++-
>>> arch/x86/kvm/vmx/vmx.c | 2 ++
>>> 4 files changed, 46 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>>> index 78a3be394d00..2c4ff13fddb0 100644
>>> --- a/arch/x86/kvm/vmx/nested.c
>>> +++ b/arch/x86/kvm/vmx/nested.c
>>> @@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>>> nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>>> MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
>>>
>>> + /* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
>>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>>> + MSR_IA32_U_CET, MSR_TYPE_RW);
>>> +
>>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>>> + MSR_IA32_S_CET, MSR_TYPE_RW);
>>> +
>>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>>> + MSR_IA32_PL0_SSP, MSR_TYPE_RW);
>>> +
>>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>>> + MSR_IA32_PL1_SSP, MSR_TYPE_RW);
>>> +
>>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>>> + MSR_IA32_PL2_SSP, MSR_TYPE_RW);
>>> +
>>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>>> + MSR_IA32_PL3_SSP, MSR_TYPE_RW);
>>> +
>>> + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>>> + MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
>>> +
>>> kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
>>>
>>> vmx->nested.force_msr_bitmap_recalc = false;
>>> @@ -6794,7 +6816,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
>>> VM_EXIT_HOST_ADDR_SPACE_SIZE |
>>> #endif
>>> VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
>>> - VM_EXIT_CLEAR_BNDCFGS;
>>> + VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
>>> msrs->exit_ctls_high |=
>>> VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>>> VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
>>> @@ -6816,7 +6838,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
>>> #ifdef CONFIG_X86_64
>>> VM_ENTRY_IA32E_MODE |
>>> #endif
>>> - VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
>>> + VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
>>> + VM_ENTRY_LOAD_CET_STATE;
>>> msrs->entry_ctls_high |=
>>> (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
>>> VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
>>> diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
>>> index 106a72c923ca..4233b5ca9461 100644
>>> --- a/arch/x86/kvm/vmx/vmcs12.c
>>> +++ b/arch/x86/kvm/vmx/vmcs12.c
>>> @@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
>>> FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
>>> FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
>>> FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
>>> + FIELD(GUEST_S_CET, guest_s_cet),
>>> + FIELD(GUEST_SSP, guest_ssp),
>>> + FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
>> I think we need to sync guest states, e.g., guest_s_cet/guest_ssp/guest_ssp_tbl,
>> between vmcs02 and vmcs12 on nested VM entry/exit, probably in
>> sync_vmcs02_to_vmcs12() and prepare_vmcs12() or "_rare" variants of them.
>>
> Aha, this is why I suspected that nested support is incomplete,
> 100% agree.
>
> In particular, looking at Intel's SDM I see that:
>
> HOST_S_CET, HOST_SSP, HOST_INTR_SSP_TABLE needs to be copied from vmcb12 to vmcb02 but not vise versa
> because the CPU doesn't touch them.
>
> GUEST_S_CET, GUEST_SSP, GUEST_INTR_SSP_TABLE should be copied bi-directionally.

Yes, I'll make this part of code complete in next version, thanks!

> This of course depends on the corresponding vm entry and vm exit controls being set.
> That means that it is legal in theory to do VM entry/exit with CET enabled but not use
> VM_ENTRY_LOAD_CET_STATE and/or VM_EXIT_LOAD_CET_STATE,
> because for example nested hypervisor in theory can opt to save/load these itself.
>
> I think that this is all, but I also can't be 100% sure. This thing has to be tested well before
> we can be sure that it works.
>
> Best regards,
> Maxim Levitsky
>

2023-11-15 09:01:16

by Yang Weijiang

[permalink] [raw]
Subject: Re: [PATCH v6 10/25] KVM: x86: Add kvm_msr_{read,write}() helpers

On 11/3/2023 2:26 AM, Maxim Levitsky wrote:
> On Wed, 2023-11-01 at 12:32 -0700, Sean Christopherson wrote:
>> On Tue, Oct 31, 2023, Maxim Levitsky wrote:
>>> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
>>>> Wrap __kvm_{get,set}_msr() into two new helpers for KVM usage and use the
>>>> helpers to replace existing usage of the raw functions.
>>>> kvm_msr_{read,write}() are KVM-internal helpers, i.e. used when KVM needs
>>>> to get/set a MSR value for emulating CPU behavior.
>>> I am not sure if I like this patch or not. On one hand the code is cleaner
>>> this way, but on the other hand now it is easier to call kvm_msr_write() on
>>> behalf of the guest.
>>>
>>> For example we also have the 'kvm_set_msr()' which does actually set the msr
>>> on behalf of the guest.
>>>
>>> How about we call the new function kvm_msr_set_host() and rename
>>> kvm_set_msr() to kvm_msr_set_guest(), together with good comments explaning
>>> what they do?
>> LOL, just call me Nostradamus[*] ;-)
>>
>> : > SSP save/load should go to enter_smm_save_state_64() and rsm_load_state_64(),
>> : > where other fields of SMRAM are handled.
>> :
>> : +1. The right way to get/set MSRs like this is to use __kvm_get_msr() and pass
>> : %true for @host_initiated. Though I would add a prep patch to provide wrappers
>> : for __kvm_get_msr() and __kvm_set_msr(). Naming will be hard, but I think we
>> ^^^^^^^^^^^^^^^^^^^
>> : can use kvm_{read,write}_msr() to go along with the KVM-initiated register
>> : accessors/mutators, e.g. kvm_register_read(), kvm_pdptr_write(), etc.
>>
>> [*] https://lore.kernel.org/all/[email protected]
>>
>>> Also functions like kvm_set_msr_ignored_check(), kvm_set_msr_with_filter() and such,
>>> IMHO have names that are not very user friendly.
>> I don't like the host/guest split because KVM always operates on guest values,
>> e.g. kvm_msr_set_host() in particular could get confusing.
> That makes sense.
>
>> IMO kvm_get_msr() and kvm_set_msr(), and to some extent the helpers you note below,
>> are the real problem.
>>
>> What if we rename kvm_{g,s}et_msr() to kvm_emulate_msr_{read,write}() to make it
>> more obvious that those are the "guest" helpers? And do that as a prep patch in
>> this series (there aren't _that_ many users).
> Makes sense.

Then I'll modify related code and add the pre-patch in next version, thanks!

>> I'm also in favor of renaming the "inner" helpers, but I think we should tackle
>> those separately.separately
> OK.
>
> Best regards,
> Maxim Levitsky
>
>

2023-11-15 13:27:26

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v6 24/25] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1

On Wed, Nov 15, 2023, Weijiang Yang wrote:
> On 11/1/2023 12:21 PM, Chao Gao wrote:
> > On Thu, Sep 14, 2023 at 02:33:24AM -0400, Yang Weijiang wrote:
> > > @@ -2846,12 +2846,16 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
> > > CC(intr_type == INTR_TYPE_OTHER_EVENT && vector != 0))
> > > return -EINVAL;
> > >
> > > - /* VM-entry interruption-info field: deliver error code */
> > > - should_have_error_code =
> > > - intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
> > > - x86_exception_has_error_code(vector);
> > > - if (CC(has_error_code != should_have_error_code))
> > > - return -EINVAL;
> > > + if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION ||
> > > + !nested_cpu_has_no_hw_errcode_cc(vcpu)) {
> > > + /* VM-entry interruption-info field: deliver error code */
> > > + should_have_error_code =
> > > + intr_type == INTR_TYPE_HARD_EXCEPTION &&
> > > + prot_mode &&
> > > + x86_exception_has_error_code(vector);
> > > + if (CC(has_error_code != should_have_error_code))
> > > + return -EINVAL;
> > > + }
> > prot_mode and intr_type are used twice, making the code a little hard to read.
> >
> > how about:
> > /*
> > * Cannot deliver error code in real mode or if the
> > * interruption type is not hardware exception. For other
> > * cases, do the consistency check only if the vCPU doesn't
> > * enumerate VMX_BASIC_NO_HW_ERROR_CODE_CC.
> > */
> > if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION) {
> > if (CC(has_error_code))
> > return -EINVAL;
> > } else if (!nested_cpu_has_no_hw_errcode_cc(vcpu)) {
> > if (CC(has_error_code != x86_exception_has_error_code(vector)))
> > return -EINVAL;
> > }

Or maybe go one step further and put the nested_cpu_has...() check inside the CC()
macro so that it too will be captured on error. It's a little uglier though, and
I doubt providing that extra information will matter in practice, so definitely
feel free to stick with Chao's version.

if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION) {
if (CC(has_error_code))
return -EINVAL;
} else if (CC(!nested_cpu_has_no_hw_errcode_cc(vcpu) &&
has_error_code != x86_exception_has_error_code(vector))) {
return -EINVAL;
}