2021-12-17 15:30:45

by Liu, Jing2

[permalink] [raw]
Subject: [PATCH v2 00/23] AMX Support in KVM

This version is not intended for acceptance. But giving many moving
pieces in v1 review it might be good to have an alignment on what
we have improved now.

It's highly appreciated if you can help take a quick look and let
us know any new issue before your holiday. We will continue working
on closing remaining TODOs and try to have a complete version ready
when you are back. :)

TODOs:

- selftest for XCR0, XFD and new AMX state is still in progress;
- Investigate whether trapping #NM can be delayed until the
1st WRMSR(IA32_XFD) with a non-zero value;
- We are confirming with hardware team whether state 18 can be
frozen to 8k;

----
v1->v2:
- Live migration supported and verified with a selftest
- Rebase to Thomas's new series for guest fpstate reallocation [1]
- Expand fpstate at KVM_SET_CPUID2 instead of when emulating XCR0
and IA32_XFD (Thomas/Paolo)
- Accordingly remove all exit-to-userspace stuff
- Intercept #NM to save guest XFD_ERR and restore host/guest value
at preemption on/off boundary (Thomas)
- Accordingly remove all xfd_err logic in preemption callback and
fpu_swap_kvm_fpstate()
- Reuse KVM_SET_XSAVE to handle both legacy and expanded buffer (Paolo)
- Don't return dynamic bits w/o prctl() in KVM_GET_SUPPORTED_CPUID (Paolo)
- Check guest permissions for dynamic features in CPUID[0xD] instead
of only for AMX at KVM_SET_CPUID (Paolo)
- Remove dynamic bit check for 32-bit guest in __kvm_set_xcr() (Paolo)
- Fix CPUID emulation for 0x1d and 0x1e (Paolo)
- Move "disable interception" to the end of the series (Paolo)

This series brings AMX (Advanced Matrix eXtensions) virtualization
support to KVM. The preparatory series from Thomas [1] is also included.

A large portion of the changes in this series is to deal with eXtended
Feature Disable (XFD) which allows resizing of the fpstate buffer to
support dynamically-enabled XSTATE features with large state component
(e.g. 8K for AMX).

There are a lot of simplications when comparing v2 to the original
proposal [2] and the first version [3]. Thanks to Thomas and Paolo for
many good suggestions.

The support is based on following key changes:

- Guest permissions for dynamically-enabled XSAVE features

Native tasks have to request permission via prctl() before touching
a dynamic-resized XSTATE compoenent. Introduce guest permissions
for the similar purpose. Userspace VMM is expected to request guest
permission only once when the first vCPU is created.

KVM checks guest permission in KVM_SET_CPUID2. Setting XFD in guest
cpuid w/o proper permissions fails this operation. In the meantime,
unpermitted features are also excluded in KVM_GET_SUPPORTED_CPUID.

- Extend fpstate reallocation mechanism to cover guest fpu

Unlike native tasks which have reallocation triggered from #NM
handler, guest fpstate reallocation is requested by KVM when it
identifies the intention on using dynamically-enabled XSAVE
features inside guest.

Extend fpu core to allow KVM request fpstate buffer expansion
for a guest fpu containter.

- Trigger fpstate reallocation in KVM

This could be done either statically (before guest runs) or
dynamically (in the emulation path). According to discussion [1]
we decide to statically enable all xfeatures allowed by guest perm
in KVM_SET_CPUID2, with fpstate buffer sized accordingly. This spares
a lot of code and also avoid imposing an ordered restore sequence
(XCR0, XFD and XSTATE) to userspace VMM.

- RDMSR/WRMSR emulation for IA32_XFD

Because fpstate expansion is completed in KVM_SET_CPUID2, emulating
r/w access to IA32_XFD simply involves the xfd field in the guest
fpu container. If write and guest fpu is currently active, the
software state (guest_fpstate::xfd and per-cpu xfd cache) is also
updated.

- RDMSR/WRMSR emulation for XFD_ERR

When XFD causes an instruction to generate #NM, XFD_ERR contains
information about which disabled state components are being accessed.
It'd be problematic if the XFD_ERR value generated in guest is
consumed/clobbered by the host before the guest itself doing so.

Intercept #NM exception to save the guest XFD_ERR value if
guest CPUID includes dynamic xfeature bits. There is at most one
interception per guest task given a dynamic feature.

RDMSR/WRMSR emulation uses the saved value. The host value (always
ZERO outside of the host #NM handler) is restored before enabling
preemption. The saved guest value is restored right before entering
the guest (with preemption disabled).

- Get/set dynamic xfeature state for migration

Introduce new capability (KVM_CAP_XSAVE2) to deal with >4KB fpstate
buffer. Reading this capability returns the size of the current
guest fpstate (e.g. after expansion). Userspace VMM uses a new ioctl
(KVM_GET_XSAVE2) to read guest fpstate from the kernel and reuses
the existing ioctl (KVM_SET_XSAVE) to update guest fpsate to the
kernel. KVM_SET_XSAVE is extended to do properly_sized memdup_user()
based on the guest fpstate.

- Expose related cpuid bits to guest

The last step is to allow exposing XFD, AMX_TILE, AMX_INT8 and
AMX_BF16 in guest cpuid. Adding those bits into kvm_cpu_caps finally
activates all previous logics in this series

- Optimization: disable interception for IA32_XFD

IA32_XFD can be frequently updated by the guest, as it is part of
the task state and swapped in context switch when prev and next have
different XFD setting. Always intercepting WRMSR can easily cause
non-negligible overhead.

Disable r/w emulation for IA32_XFD after intercepting the first
WRMSR(IA32_XFD) with a non-zero value. However MSR passthrough
implies the software state (guest_fpstate::xfd and per-cpu xfd
cache) might be out of sync with MSR. This suggests KVM needs to
re-sync them at VM-exit before preemption is enabled.

Also disable read emulation for IA32_XFD_ERR at the same point.

To verify AMX virtualization overhead on non-AMX usages, we run the
Phoronix kernel build test in the guest w/ and w/o AMX in cpuid. The
result shows no observable difference between two configurations.

Thanks Jun Nakajima and Kevin Tian for the design suggestions when
this version is being internally worked on.

[1] https://lore.kernel.org/all/[email protected]/
[2] https://www.spinics.net/lists/kvm/msg259015.html
[3] https://lore.kernel.org/lkml/[email protected]/

Thanks,
Jing

---
Guang Zeng (1):
kvm: x86: AMX live migration support

Jing Liu (13):
kvm: x86: Fix xstate_required_size() to follow XSTATE alignment rule
kvm: x86: Exclude unpermitted xfeatures at KVM_GET_SUPPORTED_CPUID
kvm: x86: Check permitted dynamic xfeatures at KVM_SET_CPUID2
x86/fpu: Make XFD initialization in __fpstate_reset() a function
argument
kvm: x86: Enable dynamic XSAVE features at KVM_SET_CPUID2
kvm: x86: Add emulation for IA32_XFD
x86/fpu: Prepare xfd_err in struct fpu_guest
kvm: x86: Intercept #NM for saving IA32_XFD_ERR
kvm: x86: Emulate IA32_XFD_ERR for guest
kvm: x86: Add XCR0 support for Intel AMX
kvm: x86: Add CPUID support for Intel AMX
kvm: x86: Disable interception for IA32_XFD on demand
kvm: x86: Disable RDMSR interception of IA32_XFD_ERR

Kevin Tian (2):
x86/fpu: Provide fpu_update_guest_perm_features() for guest
x86/fpu: Provide fpu_update_guest_xfd() for IA32_XFD emulation

Thomas Gleixner (5):
x86/fpu: Extend fpu_xstate_prctl() with guest permissions
x86/fpu: Prepare guest FPU for dynamically enabled FPU features
x86/fpu: Add guest support to xfd_enable_feature()
x86/fpu: add uabi_size to guest_fpu
x86/fpu: Provide fpu_sync_guest_vmexit_xfd_state()

Wei Wang (2):
kvm: selftests: Add support for KVM_CAP_XSAVE2
docs: kvm: Add KVM_GET_XSAVE2

Documentation/virt/kvm/api.rst | 33 +++-
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/fpu/api.h | 11 ++
arch/x86/include/asm/fpu/types.h | 32 ++++
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/include/uapi/asm/kvm.h | 12 +-
arch/x86/include/uapi/asm/prctl.h | 26 ++--
arch/x86/kernel/fpu/core.c | 104 ++++++++++++-
arch/x86/kernel/fpu/xstate.c | 145 +++++++++++-------
arch/x86/kernel/fpu/xstate.h | 15 +-
arch/x86/kernel/process.c | 2 +
arch/x86/kvm/cpuid.c | 93 ++++++++---
arch/x86/kvm/vmx/vmcs.h | 5 +
arch/x86/kvm/vmx/vmx.c | 27 +++-
arch/x86/kvm/vmx/vmx.h | 2 +-
arch/x86/kvm/x86.c | 103 ++++++++++++-
include/uapi/linux/kvm.h | 4 +
tools/arch/x86/include/uapi/asm/kvm.h | 12 +-
tools/include/uapi/linux/kvm.h | 3 +
.../testing/selftests/kvm/include/kvm_util.h | 2 +
.../selftests/kvm/include/x86_64/processor.h | 12 ++
tools/testing/selftests/kvm/lib/kvm_util.c | 31 ++++
.../selftests/kvm/lib/x86_64/processor.c | 45 +++++-
.../testing/selftests/kvm/x86_64/evmcs_test.c | 2 +-
tools/testing/selftests/kvm/x86_64/smm_test.c | 2 +-
.../testing/selftests/kvm/x86_64/state_test.c | 2 +-
.../kvm/x86_64/vmx_preemption_timer_test.c | 2 +-
28 files changed, 624 insertions(+), 108 deletions(-)

--
2.27.0



2021-12-17 15:30:55

by Liu, Jing2

[permalink] [raw]
Subject: [PATCH v2 21/23] x86/fpu: Provide fpu_sync_guest_vmexit_xfd_state()

From: Thomas Gleixner <[email protected]>

KVM can disable the write emulation for the XFD MSR when the vCPU's fpstate
is already correctly sized to reduce the overhead.

When write emulation is disabled the XFD MSR state after a VMEXIT is
unknown and therefore not in sync with the software states in fpstate and
the per CPU XFD cache.

Provide fpu_sync_guest_vmexit_xfd_state() which has to be invoked after a
VMEXIT before enabling interrupts when write emulation is disabled for the
XFD MSR.

It could be invoked unconditionally even when write emulation is enabled
for the price of a pointless MSR read.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/include/asm/fpu/api.h | 2 ++
arch/x86/kernel/fpu/core.c | 24 ++++++++++++++++++++++++
2 files changed, 26 insertions(+)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 100b535cd3de..dc51809f0e4a 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -142,8 +142,10 @@ extern int fpu_update_guest_perm_features(struct fpu_guest *guest_fpu);

#ifdef CONFIG_X86_64
extern void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd);
+extern void fpu_sync_guest_vmexit_xfd_state(void);
#else
static inline void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd) { }
+static inline void fpu_sync_guest_vmexit_xfd_state(void) { }
#endif

extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, unsigned int size, u32 pkru);
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 89d679cc819b..42a195a6b2ce 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -299,6 +299,30 @@ void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd)
fpregs_unlock();
}
EXPORT_SYMBOL_GPL(fpu_update_guest_xfd);
+
+/**
+ * fpu_sync_guest_vmexit_xfd_state - Synchronize XFD MSR and software state
+ *
+ * Must be invoked from KVM after a VMEXIT before enabling interrupts when
+ * XFD write emulation is disabled. This is required because the guest can
+ * freely modify XFD and the state at VMEXIT is not guaranteed to be the
+ * same as the state on VMENTER. So software state has to be udpated before
+ * any operation which depends on it can take place.
+ *
+ * Note: It can be invoked unconditionally even when write emulation is
+ * enabled for the price of a then pointless MSR read.
+ */
+void fpu_sync_guest_vmexit_xfd_state(void)
+{
+ struct fpstate *fps = current->thread.fpu.fpstate;
+
+ lockdep_assert_irqs_disabled();
+ if (fpu_state_size_dynamic()) {
+ rdmsrl(MSR_IA32_XFD, fps->xfd);
+ __this_cpu_write(xfd_state, fps->xfd);
+ }
+}
+EXPORT_SYMBOL_GPL(fpu_sync_guest_vmexit_xfd_state);
#endif /* CONFIG_X86_64 */

int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
--
2.27.0


2021-12-17 15:31:04

by Liu, Jing2

[permalink] [raw]
Subject: [PATCH v2 22/23] kvm: x86: Disable interception for IA32_XFD on demand

Always intercepting IA32_XFD causes non-negligible overhead when this
register is updated frequently in the guest.

Disable r/w emulation after intercepting the first WRMSR(IA32_XFD)
with a non-zero value.

Disable WRMSR emulation implies that IA32_XFD becomes out-of-sync
with the software states in fpstate and the per-cpu xfd cache. Call
fpu_sync_guest_vmexit_xfd_state() to bring them back in-sync, before
preemption is enabled.

p.s. We have confirmed that SDM is being revised to say that
when setting IA32_XFD[18] the AMX register state is not guaranteed
to be preserved. This clarification avoids adding mess for a creative
guest which sets IA32_XFD[18]=1 before saving active AMX state to
its own storage.

Signed-off-by: Kevin Tian <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/vmx/vmx.c | 10 ++++++++++
arch/x86/kvm/vmx/vmx.h | 2 +-
arch/x86/kvm/x86.c | 6 ++++++
5 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index cefe1d81e2e8..60c27f9990e9 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -30,6 +30,7 @@ KVM_X86_OP(update_exception_bitmap)
KVM_X86_OP(get_msr)
KVM_X86_OP(set_msr)
KVM_X86_OP(get_segment_base)
+KVM_X86_OP_NULL(set_xfd_passthrough)
KVM_X86_OP(get_segment)
KVM_X86_OP(get_cpl)
KVM_X86_OP(set_segment)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 860ed500580c..b513c12fa39e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -640,6 +640,7 @@ struct kvm_vcpu_arch {
u64 smi_count;
bool tpr_access_reporting;
bool xsaves_enabled;
+ bool xfd_out_of_sync;
u64 ia32_xss;
u64 microcode_version;
u64 arch_capabilities;
@@ -1329,6 +1330,7 @@ struct kvm_x86_ops {
void (*update_exception_bitmap)(struct kvm_vcpu *vcpu);
int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
+ void (*set_xfd_passthrough)(struct kvm_vcpu *vcpu);
u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg);
void (*get_segment)(struct kvm_vcpu *vcpu,
struct kvm_segment *var, int seg);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 483075045253..97a823a3f23f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -162,6 +162,7 @@ static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
MSR_FS_BASE,
MSR_GS_BASE,
MSR_KERNEL_GS_BASE,
+ MSR_IA32_XFD,
#endif
MSR_IA32_SYSENTER_CS,
MSR_IA32_SYSENTER_ESP,
@@ -1929,6 +1930,14 @@ static u64 vcpu_supported_debugctl(struct kvm_vcpu *vcpu)
return debugctl;
}

+#ifdef CONFIG_X86_64
+static void vmx_set_xfd_passthrough(struct kvm_vcpu *vcpu)
+{
+ vmx_disable_intercept_for_msr(vcpu, MSR_IA32_XFD, MSR_TYPE_RW);
+ vcpu->arch.xfd_out_of_sync = true;
+}
+#endif
+
/*
* Writes msr value into the appropriate "register".
* Returns 0 on success, non-0 otherwise.
@@ -7664,6 +7673,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
#ifdef CONFIG_X86_64
.set_hv_timer = vmx_set_hv_timer,
.cancel_hv_timer = vmx_cancel_hv_timer,
+ .set_xfd_passthrough = vmx_set_xfd_passthrough,
#endif

.setup_mce = vmx_setup_mce,
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 4df2ac24ffc1..bf9d3051cd6c 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -340,7 +340,7 @@ struct vcpu_vmx {
struct lbr_desc lbr_desc;

/* Save desired MSR intercept (read: pass-through) state */
-#define MAX_POSSIBLE_PASSTHROUGH_MSRS 13
+#define MAX_POSSIBLE_PASSTHROUGH_MSRS 14
struct {
DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 796a9f2d1f23..13ba1e066b55 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3686,6 +3686,9 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 1;

fpu_update_guest_xfd(&vcpu->arch.guest_fpu, data);
+
+ if (data && kvm_x86_ops.set_xfd_passthrough)
+ static_call(kvm_x86_set_xfd_passthrough)(vcpu);
break;
case MSR_IA32_XFD_ERR:
if (!msr_info->host_initiated &&
@@ -10023,6 +10026,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (vcpu->arch.guest_fpu.xfd_err)
wrmsrl(MSR_IA32_XFD_ERR, 0);

+ if (vcpu->arch.xfd_out_of_sync)
+ fpu_sync_guest_vmexit_xfd_state();
+
/*
* Consume any pending interrupts, including the possible source of
* VM-Exit on SVM and any ticks that occur between VM-Exit and now.
--
2.27.0


2021-12-17 15:31:08

by Liu, Jing2

[permalink] [raw]
Subject: [PATCH v2 16/23] kvm: x86: Add CPUID support for Intel AMX

Extend CPUID emulation to support XFD, AMX_TILE, AMX_INT8 and
AMX_BF16. Adding those bits into kvm_cpu_caps finally activates all
previous logics in this series.

Signed-off-by: Yang Zhong <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 2 ++
arch/x86/kvm/cpuid.c | 25 +++++++++++++++++++++++--
2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d5b5f2ab87a0..da872b6f8d8b 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -299,7 +299,9 @@
/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
#define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
#define X86_FEATURE_AVX512_BF16 (12*32+ 5) /* AVX512 BFLOAT16 instructions */
+#define X86_FEATURE_AMX_BF16 (18*32+22) /* AMX bf16 Support */
#define X86_FEATURE_AMX_TILE (18*32+24) /* AMX tile Support */
+#define X86_FEATURE_AMX_INT8 (18*32+25) /* AMX int8 Support */

/* AMD-defined CPU features, CPUID level 0x80000008 (EBX), word 13 */
#define X86_FEATURE_CLZERO (13*32+ 0) /* CLZERO instruction */
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index eb5a5070accb..1d694ead374e 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -515,7 +515,8 @@ void kvm_set_cpu_caps(void)
F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) |
- F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16)
+ F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) |
+ F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16)
);

/* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */
@@ -534,7 +535,7 @@ void kvm_set_cpu_caps(void)
);

kvm_cpu_cap_mask(CPUID_D_1_EAX,
- F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | F(XSAVES)
+ F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | F(XSAVES) | F(XFD)
);

kvm_cpu_cap_init_scattered(CPUID_12_EAX,
@@ -660,6 +661,8 @@ static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_array *array,
case 0x14:
case 0x17:
case 0x18:
+ case 0x1d:
+ case 0x1e:
case 0x1f:
case 0x8000001d:
entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
@@ -932,6 +935,24 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
goto out;
}
break;
+ /* Intel AMX TILE */
+ case 0x1d:
+ if (!kvm_cpu_cap_has(X86_FEATURE_AMX_TILE)) {
+ entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
+ break;
+ }
+
+ for (i = 1, max_idx = entry->eax; i <= max_idx; ++i) {
+ if (!do_host_cpuid(array, function, i))
+ goto out;
+ }
+ break;
+ case 0x1e: /* TMUL information */
+ if (!kvm_cpu_cap_has(X86_FEATURE_AMX_TILE)) {
+ entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
+ break;
+ }
+ break;
case KVM_CPUID_SIGNATURE: {
const u32 *sigptr = (const u32 *)KVM_SIGNATURE;
entry->eax = KVM_CPUID_FEATURES;
--
2.27.0


2021-12-17 15:31:09

by Liu, Jing2

[permalink] [raw]
Subject: [PATCH v2 13/23] kvm: x86: Intercept #NM for saving IA32_XFD_ERR

Guest IA32_XFD_ERR is generally modified in two places:

- Set by CPU when #NM is triggered;
- Cleared by guest in its #NM handler;

Intercept #NM for the first case, if guest CPUID includes any dynamic
xfeature. #NM is rare if the guest doesn't use dynamic features.
Otherwise, there is at most one exception per guest task given a
dynamic feature.

Save the current XFD_ERR value to the guest_fpu container in the #NM
VM-exit handler. This must be done with interrupt/preemption disabled,
otherwise the unsaved MSR value may be clobbered by host operations.

Inject a virtual #NM to the guest after saving the MSR value.

Restore the host value (always ZERO outside of the host #NM
handler) before enabling preemption.

Restore the guest value from the guest_fpu container right before
entering the guest (with preemption disabled).

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
TODO: Investigate delaying #NM interception until guest sets a dynamic
feature in XCR0.

arch/x86/kvm/vmx/vmcs.h | 5 +++++
arch/x86/kvm/vmx/vmx.c | 15 ++++++++++++++-
arch/x86/kvm/x86.c | 6 ++++++
3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index 6e5de2e2b0da..c57798b56f95 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -129,6 +129,11 @@ static inline bool is_machine_check(u32 intr_info)
return is_exception_n(intr_info, MC_VECTOR);
}

+static inline bool is_nm(u32 intr_info)
+{
+ return is_exception_n(intr_info, NM_VECTOR);
+}
+
/* Undocumented: icebp/int1 */
static inline bool is_icebp(u32 intr_info)
{
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9453743ce0c4..483075045253 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -36,6 +36,7 @@
#include <asm/debugreg.h>
#include <asm/desc.h>
#include <asm/fpu/api.h>
+#include <asm/fpu/xstate.h>
#include <asm/idtentry.h>
#include <asm/io.h>
#include <asm/irq_remapping.h>
@@ -763,6 +764,9 @@ void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu)
vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, match);
}

+ if (vcpu->arch.guest_supported_xcr0 & XFEATURE_MASK_USER_DYNAMIC)
+ eb |= (1u << NM_VECTOR);
+
vmcs_write32(EXCEPTION_BITMAP, eb);
}

@@ -4750,7 +4754,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
vect_info = vmx->idt_vectoring_info;
intr_info = vmx_get_intr_info(vcpu);

- if (is_machine_check(intr_info) || is_nmi(intr_info))
+ if (is_machine_check(intr_info) || is_nmi(intr_info) || is_nm(intr_info))
return 1; /* handled by handle_exception_nmi_irqoff() */

if (is_invalid_opcode(intr_info))
@@ -6338,6 +6342,12 @@ static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
kvm_after_interrupt(vcpu);
}

+static void handle_exception_nm(struct kvm_vcpu *vcpu)
+{
+ rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
+ kvm_queue_exception(vcpu, NM_VECTOR);
+}
+
static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
{
const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
@@ -6346,6 +6356,9 @@ static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
/* if exit due to PF check for async PF */
if (is_page_fault(intr_info))
vmx->vcpu.arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
+ /* if exit due to NM, handle before preemptions are enabled */
+ else if (is_nm(intr_info))
+ handle_exception_nm(&vmx->vcpu);
/* Handle machine checks before interrupts are enabled */
else if (is_machine_check(intr_info))
kvm_machine_check();
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a274146ef439..e528085030b3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9894,6 +9894,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (test_thread_flag(TIF_NEED_FPU_LOAD))
switch_fpu_return();

+ if (vcpu->arch.guest_fpu.xfd_err)
+ wrmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
+
if (unlikely(vcpu->arch.switch_db_regs)) {
set_debugreg(0, 7);
set_debugreg(vcpu->arch.eff_db[0], 0);
@@ -9957,6 +9960,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)

static_call(kvm_x86_handle_exit_irqoff)(vcpu);

+ if (vcpu->arch.guest_fpu.xfd_err)
+ wrmsrl(MSR_IA32_XFD_ERR, 0);
+
/*
* Consume any pending interrupts, including the possible source of
* VM-Exit on SVM and any ticks that occur between VM-Exit and now.
--
2.27.0


2021-12-17 15:31:11

by Liu, Jing2

[permalink] [raw]
Subject: [PATCH v2 14/23] kvm: x86: Emulate IA32_XFD_ERR for guest

Emulate read/write to IA32_XFD_ERR MSR.

Only the saved value in the guest_fpu container is touched in the
emulation handler. Actual MSR update is handled right before entering
the guest (with preemption disabled)

Signed-off-by: Zeng Guang <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/kvm/x86.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e528085030b3..15b093c58b3d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1358,7 +1358,7 @@ static const u32 msrs_to_save_all[] = {
MSR_F15H_PERF_CTL3, MSR_F15H_PERF_CTL4, MSR_F15H_PERF_CTL5,
MSR_F15H_PERF_CTR0, MSR_F15H_PERF_CTR1, MSR_F15H_PERF_CTR2,
MSR_F15H_PERF_CTR3, MSR_F15H_PERF_CTR4, MSR_F15H_PERF_CTR5,
- MSR_IA32_XFD,
+ MSR_IA32_XFD, MSR_IA32_XFD_ERR,
};

static u32 msrs_to_save[ARRAY_SIZE(msrs_to_save_all)];
@@ -3681,6 +3681,17 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

fpu_update_guest_xfd(&vcpu->arch.guest_fpu, data);
break;
+ case MSR_IA32_XFD_ERR:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_XFD))
+ return 1;
+
+ if (data & ~(XFEATURE_MASK_USER_DYNAMIC &
+ vcpu->arch.guest_supported_xcr0))
+ return 1;
+
+ vcpu->arch.guest_fpu.xfd_err = data;
+ break;
#endif
default:
if (kvm_pmu_is_valid_msr(vcpu, msr))
@@ -4010,6 +4021,13 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

msr_info->data = vcpu->arch.guest_fpu.fpstate->xfd;
break;
+ case MSR_IA32_XFD_ERR:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_XFD))
+ return 1;
+
+ msr_info->data = vcpu->arch.guest_fpu.xfd_err;
+ break;
#endif
default:
if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
@@ -6445,6 +6463,7 @@ static void kvm_init_msr_list(void)
continue;
break;
case MSR_IA32_XFD:
+ case MSR_IA32_XFD_ERR:
if (!kvm_cpu_cap_has(X86_FEATURE_XFD))
continue;
break;
--
2.27.0


2021-12-17 15:31:12

by Liu, Jing2

[permalink] [raw]
Subject: [PATCH v2 12/23] x86/fpu: Prepare xfd_err in struct fpu_guest

When XFD causes an instruction to generate #NM, IA32_XFD_ERR
contains information about which disabled state components are
being accessed. The #NM handler is expected to check this
information and then enable the state components by clearing
IA32_XFD for the faulting task (if having permission).

If the XFD_ERR value generated in guest is consumed/clobbered
by the host before the guest itself doing so. This may lead to
non-XFD-related #NM treated as XFD #NM in host (due to non-zero
value in XFD_ERR), or XFD-related #NM treated as non-XFD #NM in
guest (XFD_ERR cleared by the host #NM handler).

Introduce a new field in fpu_guest to save the guest xfd_err value.
KVM is expected to save guest xfd_err before preemption is enabled
and restore it right before entering the guest (with preemption
disabled).

Signed-off-by: Kevin Tian <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/include/asm/fpu/types.h | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index c752d0aa23a4..3795d0573773 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -517,6 +517,11 @@ struct fpu_guest {
*/
u64 perm;

+ /*
+ * @xfd_err: Save the guest value.
+ */
+ u64 xfd_err;
+
/*
* @fpstate: Pointer to the allocated guest fpstate
*/
--
2.27.0


2021-12-17 15:31:15

by Liu, Jing2

[permalink] [raw]
Subject: [PATCH v2 05/23] kvm: x86: Check permitted dynamic xfeatures at KVM_SET_CPUID2

Guest xstate permissions should be set by userspace VMM before vcpu
creation. Extend KVM_SET_CPUID2 to verify that every feature reported
in CPUID[0xD] has proper permission set.

Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/kvm/cpuid.c | 36 ++++++++++++++++++++++++------------
1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 4855344091b8..a068373a7fbd 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -81,7 +81,9 @@ static inline struct kvm_cpuid_entry2 *cpuid_entry2_find(
return NULL;
}

-static int kvm_check_cpuid(struct kvm_cpuid_entry2 *entries, int nent)
+static int kvm_check_cpuid(struct kvm_vcpu *vcpu,
+ struct kvm_cpuid_entry2 *entries,
+ int nent)
{
struct kvm_cpuid_entry2 *best;

@@ -97,6 +99,16 @@ static int kvm_check_cpuid(struct kvm_cpuid_entry2 *entries, int nent)
return -EINVAL;
}

+ /* Check guest permissions for dynamically-enabled xfeatures */
+ best = cpuid_entry2_find(entries, nent, 0xd, 0);
+ if (best) {
+ u64 xfeatures;
+
+ xfeatures = best->eax | ((u64)best->edx << 32);
+ if (xfeatures & ~vcpu->arch.guest_fpu.perm)
+ return -ENXIO;
+ }
+
return 0;
}

@@ -277,21 +289,21 @@ u64 kvm_vcpu_reserved_gpa_bits_raw(struct kvm_vcpu *vcpu)
static int kvm_set_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2,
int nent)
{
- int r;
+ int r;

- r = kvm_check_cpuid(e2, nent);
- if (r)
- return r;
+ r = kvm_check_cpuid(vcpu, e2, nent);
+ if (r)
+ return r;

- kvfree(vcpu->arch.cpuid_entries);
- vcpu->arch.cpuid_entries = e2;
- vcpu->arch.cpuid_nent = nent;
+ kvfree(vcpu->arch.cpuid_entries);
+ vcpu->arch.cpuid_entries = e2;
+ vcpu->arch.cpuid_nent = nent;

- kvm_update_kvm_cpuid_base(vcpu);
- kvm_update_cpuid_runtime(vcpu);
- kvm_vcpu_after_set_cpuid(vcpu);
+ kvm_update_kvm_cpuid_base(vcpu);
+ kvm_update_cpuid_runtime(vcpu);
+ kvm_vcpu_after_set_cpuid(vcpu);

- return 0;
+ return 0;
}

/* when an old userspace process fills a new kernel module */
--
2.27.0


2021-12-17 15:31:18

by Liu, Jing2

[permalink] [raw]
Subject: [PATCH v2 08/23] x86/fpu: Provide fpu_update_guest_perm_features() for guest

From: Kevin Tian <[email protected]>

KVM can require fpstate expansion in two approaches:

1) Dynamic expansion when intercepting guest updates to XCR0 and
XFD MSR;

2) Static expansion e.g. at KVM_SET_CPUID2;

The first option doesn't waste memory for legacy guest if it doesn't
support XFD. However doing so introduces more complexity and also
imposes an order requirement in the restoring path, i.e. XCR0/XFD
must be restored before XSTATE.

Given that the agreement is to do the static approach. This is
considered a better tradeoff though it does waste 8K memory for
legacy guest if its CPUID includes XFD features.

Provide a wrapper to allow expanding the fpstate buffer to what
guest perm allows. Once completed, both emulation and restore path
don't need to worry about the buffer size.

Signed-off-by: Kevin Tian <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/include/asm/fpu/api.h | 1 +
arch/x86/kernel/fpu/core.c | 27 +++++++++++++++++++++++++++
2 files changed, 28 insertions(+)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index d8c222290e68..8e934e571273 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -138,6 +138,7 @@ extern inline u64 xstate_get_guest_group_perm(void);
extern bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu);
extern void fpu_free_guest_fpstate(struct fpu_guest *gfpu);
extern int fpu_swap_kvm_fpstate(struct fpu_guest *gfpu, bool enter_guest);
+extern int fpu_update_guest_perm_features(struct fpu_guest *guest_fpu);

extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, unsigned int size, u32 pkru);
extern int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf, u64 xcr0, u32 *vpkru);
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index a78bc547fc03..2560a95980aa 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -261,6 +261,33 @@ void fpu_free_guest_fpstate(struct fpu_guest *gfpu)
}
EXPORT_SYMBOL_GPL(fpu_free_guest_fpstate);

+/*
+ * fpu_update_guest_perm_features - Enable xfeatures according to guest perm
+ * @guest_fpu: Pointer to the guest FPU container
+ *
+ * Enable all dynamic xfeatures according to guest perm. Invoked if the
+ * caller wants to conservatively expand fpstate buffer instead of waiting
+ * until XCR0 or XFD MSR is written.
+ *
+ * Return: 0 on success, error code otherwise
+ */
+int fpu_update_guest_perm_features(struct fpu_guest *guest_fpu)
+{
+ u64 expand;
+
+ lockdep_assert_preemption_enabled();
+
+ if (!IS_ENABLED(CONFIG_X86_64))
+ return 0;
+
+ expand = guest_fpu->perm & ~guest_fpu->xfeatures;
+ if (!expand)
+ return 0;
+
+ return __xfd_enable_feature(expand, guest_fpu);
+}
+EXPORT_SYMBOL_GPL(fpu_update_guest_perm_features);
+
int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
{
struct fpstate *guest_fps = guest_fpu->fpstate;
--
2.27.0


2021-12-17 18:50:28

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 08/23] x86/fpu: Provide fpu_update_guest_perm_features() for guest

eviewed-by: Thomas Gleixner <[email protected]>
On Fri, Dec 17 2021 at 07:29, Jing Liu wrote:
> +/*
> + * fpu_update_guest_perm_features - Enable xfeatures according to guest perm
> + * @guest_fpu: Pointer to the guest FPU container
> + *
> + * Enable all dynamic xfeatures according to guest perm. Invoked if the
> + * caller wants to conservatively expand fpstate buffer instead of waiting
> + * until XCR0 or XFD MSR is written.
> + *
> + * Return: 0 on success, error code otherwise
> + */
> +int fpu_update_guest_perm_features(struct fpu_guest *guest_fpu)
> +{
> + u64 expand;
> +
> + lockdep_assert_preemption_enabled();
> +
> + if (!IS_ENABLED(CONFIG_X86_64))
> + return 0;
> +
> + expand = guest_fpu->perm & ~guest_fpu->xfeatures;
> + if (!expand)
> + return 0;
> +
> + return __xfd_enable_feature(expand, guest_fpu);
> +}
> +EXPORT_SYMBOL_GPL(fpu_update_guest_perm_features);

Reviewed-by: Thomas Gleixner <[email protected]>

2021-12-17 18:52:55

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 00/23] AMX Support in KVM

On Fri, Dec 17 2021 at 07:29, Jing Liu wrote:
> This version is not intended for acceptance. But giving many moving
> pieces in v1 review it might be good to have an alignment on what
> we have improved now.
>
> It's highly appreciated if you can help take a quick look and let
> us know any new issue before your holiday. We will continue working
> on closing remaining TODOs and try to have a complete version ready
> when you are back. :)

From my side this looks good. I let Paolo comment on the KVM bits.

Thanks

tglx


2021-12-20 09:08:14

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 22/23] kvm: x86: Disable interception for IA32_XFD on demand

On 12/17/21 16:30, Jing Liu wrote:
> +++ b/arch/x86/kvm/x86.c
> @@ -3686,6 +3686,9 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> return 1;
>
> fpu_update_guest_xfd(&vcpu->arch.guest_fpu, data);
> +
> + if (data && kvm_x86_ops.set_xfd_passthrough)
> + static_call(kvm_x86_set_xfd_passthrough)(vcpu);
> break;
> case MSR_IA32_XFD_ERR:


Please instead add a "case" to vmx_set_msr:

case MSR_IA32_XFD:
ret = kvm_set_msr_common(vcpu, msr_info);
if (!ret && data) {
vmx_disable_intercept_for_msr(vcpu, MSR_IA32_XFD, MSR_TYPE_RW);
vcpu->arch.xfd_out_of_sync = true;
}
break;

Paolo

2021-12-21 06:42:21

by Liu, Jing2

[permalink] [raw]
Subject: RE: [PATCH v2 22/23] kvm: x86: Disable interception for IA32_XFD on demand

On 12/20/2021 5:07 PM, Paolo Bonzini wrote:
>
> On 12/17/21 16:30, Jing Liu wrote:
> > +++ b/arch/x86/kvm/x86.c
> > @@ -3686,6 +3686,9 @@ int kvm_set_msr_common(struct kvm_vcpu
> *vcpu, struct msr_data *msr_info)
> > return 1;
> >
> > fpu_update_guest_xfd(&vcpu->arch.guest_fpu, data);
> > +
> > + if (data && kvm_x86_ops.set_xfd_passthrough)
> > + static_call(kvm_x86_set_xfd_passthrough)(vcpu);
> > break;
> > case MSR_IA32_XFD_ERR:
>
>
> Please instead add a "case" to vmx_set_msr:

OK, it seems the passthrough setup is preferred in vmx.c.
Do we also want a case in vmx_get_msr (for patch 11), even though
no specific handling there?

Thanks,
Jing


>
> case MSR_IA32_XFD:
> ret = kvm_set_msr_common(vcpu, msr_info);
> if (!ret && data) {
> vmx_disable_intercept_for_msr(vcpu,
> MSR_IA32_XFD, MSR_TYPE_RW);
> vcpu->arch.xfd_out_of_sync = true;
> }
> break;
>
> Paolo

2021-12-21 08:46:55

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 22/23] kvm: x86: Disable interception for IA32_XFD on demand

On 12/21/21 07:42, Liu, Jing2 wrote:
>> Please instead add a "case" to vmx_set_msr:
> OK, it seems the passthrough setup is preferred in vmx.c.
> Do we also want a case in vmx_get_msr (for patch 11), even though
> no specific handling there?

No, thanks.

Paolo