Highly appreciate for your review. We will continue working on
remaining selftest and send out later.
TODO:
- kvm selftest for AMX is still in progress;
----
v2->v3:
- Trap #NM until write IA32_XFD with a non-zero value (Thomas)
- Revise return value in __xstate_request_perm() (Thomas)
- Revise doc for KVM_GET_SUPPORTED_CPUID (Paolo)
- Add Thomas's reviewed-by on one patch
- Reorder disabling read interception of XFD_ERR patch (Paolo)
- Move disabling r/w interception of XFD from x86.c to vmx.c (Paolo)
- Provide the API doc together with the new KVM_GET_XSAVE2 ioctl (Paolo)
- Make KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) return minimum size of struct
kvm_xsave (4K) (Paolo)
- Request permission at the start of vm_create_with_vcpus() in selftest
- Request permission conditionally when XFD is supported (Paolo)
v1->v2:
- Live migration supported and verified with a selftest
- Rebase to Thomas's new series for guest fpstate reallocation [1]
- Expand fpstate at KVM_SET_CPUID2 instead of when emulating XCR0
and IA32_XFD (Thomas/Paolo)
- Accordingly remove all exit-to-userspace stuff
- Intercept #NM to save guest XFD_ERR and restore host/guest value
at preemption on/off boundary (Thomas)
- Accordingly remove all xfd_err logic in preemption callback and
fpu_swap_kvm_fpstate()
- Reuse KVM_SET_XSAVE to handle both legacy and expanded buffer (Paolo)
- Don't return dynamic bits w/o prctl() in KVM_GET_SUPPORTED_CPUID (Paolo)
- Check guest permissions for dynamic features in CPUID[0xD] instead
of only for AMX at KVM_SET_CPUID (Paolo)
- Remove dynamic bit check for 32-bit guest in __kvm_set_xcr() (Paolo)
- Fix CPUID emulation for 0x1d and 0x1e (Paolo)
- Move "disable interception" to the end of the series (Paolo)
This series brings AMX (Advanced Matrix eXtensions) virtualization
support to KVM. The preparatory series from Thomas [1] is also included.
A large portion of the changes in this series is to deal with eXtended
Feature Disable (XFD) which allows resizing of the fpstate buffer to
support dynamically-enabled XSTATE features with large state component
(e.g. 8K for AMX).
There are a lot of simplications when comparing v2/v3 to the original
proposal [2] and the first version [3]. Thanks to Thomas and Paolo for
many good suggestions.
The support is based on following key changes:
- Guest permissions for dynamically-enabled XSAVE features
Native tasks have to request permission via prctl() before touching
a dynamic-resized XSTATE compoenent. Introduce guest permissions
for the similar purpose. Userspace VMM is expected to request guest
permission only once when the first vCPU is created.
KVM checks guest permission in KVM_SET_CPUID2. Setting XFD in guest
cpuid w/o proper permissions fails this operation. In the meantime,
unpermitted features are also excluded in KVM_GET_SUPPORTED_CPUID.
- Extend fpstate reallocation mechanism to cover guest fpu
Unlike native tasks which have reallocation triggered from #NM
handler, guest fpstate reallocation is requested by KVM when it
identifies the intention on using dynamically-enabled XSAVE
features inside guest.
Extend fpu core to allow KVM request fpstate buffer expansion
for a guest fpu containter.
- Trigger fpstate reallocation in KVM
This could be done either statically (before guest runs) or
dynamically (in the emulation path). According to discussion [1]
we decide to statically enable all xfeatures allowed by guest perm
in KVM_SET_CPUID2, with fpstate buffer sized accordingly. This spares
a lot of code and also avoid imposing an ordered restore sequence
(XCR0, XFD and XSTATE) to userspace VMM.
- RDMSR/WRMSR emulation for IA32_XFD
Because fpstate expansion is completed in KVM_SET_CPUID2, emulating
r/w access to IA32_XFD simply involves the xfd field in the guest
fpu container. If write and guest fpu is currently active, the
software state (guest_fpstate::xfd and per-cpu xfd cache) is also
updated.
- RDMSR/WRMSR emulation for XFD_ERR
When XFD causes an instruction to generate #NM, XFD_ERR contains
information about which disabled state components are being accessed.
It'd be problematic if the XFD_ERR value generated in guest is
consumed/clobbered by the host before the guest itself doing so.
Intercept #NM exception to save the guest XFD_ERR value when write
IA32_XFD with a non-zero value for 1st time. There is at most one
interception per guest task given a dynamic feature.
RDMSR/WRMSR emulation uses the saved value. The host value (always
ZERO outside of the host #NM handler) is restored before enabling
preemption. The saved guest value is restored right before entering
the guest (with preemption disabled).
- Get/set dynamic xfeature state for migration
Introduce new capability (KVM_CAP_XSAVE2) to deal with >4KB fpstate
buffer. Reading this capability returns the size of the current
guest fpstate (e.g. after expansion). Userspace VMM uses a new ioctl
(KVM_GET_XSAVE2) to read guest fpstate from the kernel and reuses
the existing ioctl (KVM_SET_XSAVE) to update guest fpsate to the
kernel. KVM_SET_XSAVE is extended to do properly_sized memdup_user()
based on the guest fpstate.
- Expose related cpuid bits to guest
The last step is to allow exposing XFD, AMX_TILE, AMX_INT8 and
AMX_BF16 in guest cpuid. Adding those bits into kvm_cpu_caps finally
activates all previous logics in this series
- Optimization: disable interception for IA32_XFD
IA32_XFD can be frequently updated by the guest, as it is part of
the task state and swapped in context switch when prev and next have
different XFD setting. Always intercepting WRMSR can easily cause
non-negligible overhead.
Disable r/w emulation for IA32_XFD after intercepting the first
WRMSR(IA32_XFD) with a non-zero value. However MSR passthrough
implies the software state (guest_fpstate::xfd and per-cpu xfd
cache) might be out of sync with MSR. This suggests KVM needs to
re-sync them at VM-exit before preemption is enabled.
To verify AMX virtualization overhead on non-AMX usages, we run the
Phoronix kernel build test in the guest w/ and w/o AMX in cpuid. The
result shows no observable difference between two configurations.
Thanks Jun Nakajima and Kevin Tian for the design suggestions when
this version is being internally worked on.
[1] https://lore.kernel.org/all/[email protected]/
[2] https://www.spinics.net/lists/kvm/msg259015.html
[3] https://lore.kernel.org/lkml/[email protected]/
Thanks,
Jing
---
Guang Zeng (1):
kvm: x86: Get/set expanded xstate buffer
Jing Liu (13):
kvm: x86: Fix xstate_required_size() to follow XSTATE alignment rule
kvm: x86: Exclude unpermitted xfeatures at KVM_GET_SUPPORTED_CPUID
kvm: x86: Check permitted dynamic xfeatures at KVM_SET_CPUID2
x86/fpu: Make XFD initialization in __fpstate_reset() a function
argument
kvm: x86: Enable dynamic XSAVE features at KVM_SET_CPUID2
kvm: x86: Add emulation for IA32_XFD
x86/fpu: Prepare xfd_err in struct fpu_guest
kvm: x86: Intercept #NM for saving IA32_XFD_ERR
kvm: x86: Emulate IA32_XFD_ERR for guest
kvm: x86: Disable RDMSR interception of IA32_XFD_ERR
kvm: x86: Add XCR0 support for Intel AMX
kvm: x86: Add CPUID support for Intel AMX
kvm: x86: Disable interception for IA32_XFD on demand
Kevin Tian (2):
x86/fpu: Provide fpu_update_guest_perm_features() for guest
x86/fpu: Provide fpu_update_guest_xfd() for IA32_XFD emulation
Thomas Gleixner (5):
x86/fpu: Extend fpu_xstate_prctl() with guest permissions
x86/fpu: Prepare guest FPU for dynamically enabled FPU features
x86/fpu: Add guest support to xfd_enable_feature()
x86/fpu: Add uabi_size to guest_fpu
x86/fpu: Provide fpu_sync_guest_vmexit_xfd_state()
Wei Wang (1):
kvm: selftests: Add support for KVM_CAP_XSAVE2
Documentation/virt/kvm/api.rst | 46 +++++-
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/fpu/api.h | 11 ++
arch/x86/include/asm/fpu/types.h | 32 ++++
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/include/uapi/asm/kvm.h | 16 +-
arch/x86/include/uapi/asm/prctl.h | 26 ++--
arch/x86/kernel/fpu/core.c | 104 ++++++++++++-
arch/x86/kernel/fpu/xstate.c | 147 +++++++++++-------
arch/x86/kernel/fpu/xstate.h | 15 +-
arch/x86/kernel/process.c | 2 +
arch/x86/kvm/cpuid.c | 93 ++++++++---
arch/x86/kvm/vmx/vmcs.h | 5 +
arch/x86/kvm/vmx/vmx.c | 32 +++-
arch/x86/kvm/vmx/vmx.h | 2 +-
arch/x86/kvm/x86.c | 102 +++++++++++-
include/uapi/linux/kvm.h | 4 +
tools/arch/x86/include/uapi/asm/kvm.h | 16 +-
tools/include/uapi/linux/kvm.h | 3 +
.../testing/selftests/kvm/include/kvm_util.h | 2 +
.../selftests/kvm/include/x86_64/processor.h | 10 ++
tools/testing/selftests/kvm/lib/kvm_util.c | 32 ++++
.../selftests/kvm/lib/x86_64/processor.c | 67 +++++++-
.../testing/selftests/kvm/x86_64/evmcs_test.c | 2 +-
tools/testing/selftests/kvm/x86_64/smm_test.c | 2 +-
.../testing/selftests/kvm/x86_64/state_test.c | 2 +-
.../kvm/x86_64/vmx_preemption_timer_test.c | 2 +-
27 files changed, 668 insertions(+), 111 deletions(-)
--
2.27.0
From: Thomas Gleixner <[email protected]>
KVM requires a clear separation of host user space and guest permissions
for dynamic XSTATE components.
Add a guest permissions member to struct fpu and a separate set of prctl()
arguments: ARCH_GET_XCOMP_GUEST_PERM and ARCH_REQ_XCOMP_GUEST_PERM.
The semantics are equivalent to the host user space permission control
except for the following constraints:
1) Permissions have to be requested before the first vCPU is created
2) Permissions are frozen when the first vCPU is created to ensure
consistency. Any attempt to expand permissions via the prctl() after
that point is rejected.
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/include/asm/fpu/api.h | 2 ++
arch/x86/include/asm/fpu/types.h | 9 ++++++
arch/x86/include/uapi/asm/prctl.h | 26 ++++++++-------
arch/x86/kernel/fpu/core.c | 3 ++
arch/x86/kernel/fpu/xstate.c | 53 +++++++++++++++++++++++--------
arch/x86/kernel/fpu/xstate.h | 13 ++++++--
arch/x86/kernel/process.c | 2 ++
7 files changed, 80 insertions(+), 28 deletions(-)
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index c2767a6a387e..d8c222290e68 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -132,6 +132,8 @@ static inline void fpstate_free(struct fpu *fpu) { }
/* fpstate-related functions which are exported to KVM */
extern void fpstate_clear_xstate_component(struct fpstate *fps, unsigned int xfeature);
+extern inline u64 xstate_get_guest_group_perm(void);
+
/* KVM specific functions */
extern bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu);
extern void fpu_free_guest_fpstate(struct fpu_guest *gfpu);
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 3c06c82ab355..6ddf80637697 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -387,6 +387,8 @@ struct fpstate {
/* @regs is dynamically sized! Don't add anything after @regs! */
} __aligned(64);
+#define FPU_GUEST_PERM_LOCKED BIT_ULL(63)
+
struct fpu_state_perm {
/*
* @__state_perm:
@@ -476,6 +478,13 @@ struct fpu {
*/
struct fpu_state_perm perm;
+ /*
+ * @guest_perm:
+ *
+ * Permission related information for guest pseudo FPUs
+ */
+ struct fpu_state_perm guest_perm;
+
/*
* @__fpstate:
*
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 754a07856817..500b96e71f18 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -2,20 +2,22 @@
#ifndef _ASM_X86_PRCTL_H
#define _ASM_X86_PRCTL_H
-#define ARCH_SET_GS 0x1001
-#define ARCH_SET_FS 0x1002
-#define ARCH_GET_FS 0x1003
-#define ARCH_GET_GS 0x1004
+#define ARCH_SET_GS 0x1001
+#define ARCH_SET_FS 0x1002
+#define ARCH_GET_FS 0x1003
+#define ARCH_GET_GS 0x1004
-#define ARCH_GET_CPUID 0x1011
-#define ARCH_SET_CPUID 0x1012
+#define ARCH_GET_CPUID 0x1011
+#define ARCH_SET_CPUID 0x1012
-#define ARCH_GET_XCOMP_SUPP 0x1021
-#define ARCH_GET_XCOMP_PERM 0x1022
-#define ARCH_REQ_XCOMP_PERM 0x1023
+#define ARCH_GET_XCOMP_SUPP 0x1021
+#define ARCH_GET_XCOMP_PERM 0x1022
+#define ARCH_REQ_XCOMP_PERM 0x1023
+#define ARCH_GET_XCOMP_GUEST_PERM 0x1024
+#define ARCH_REQ_XCOMP_GUEST_PERM 0x1025
-#define ARCH_MAP_VDSO_X32 0x2001
-#define ARCH_MAP_VDSO_32 0x2002
-#define ARCH_MAP_VDSO_64 0x2003
+#define ARCH_MAP_VDSO_X32 0x2001
+#define ARCH_MAP_VDSO_32 0x2002
+#define ARCH_MAP_VDSO_64 0x2003
#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 8ea306b1bf8e..ab19b3d8b2f7 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -450,6 +450,8 @@ void fpstate_reset(struct fpu *fpu)
fpu->perm.__state_perm = fpu_kernel_cfg.default_features;
fpu->perm.__state_size = fpu_kernel_cfg.default_size;
fpu->perm.__user_state_size = fpu_user_cfg.default_size;
+ /* Same defaults for guests */
+ fpu->guest_perm = fpu->perm;
}
static inline void fpu_inherit_perms(struct fpu *dst_fpu)
@@ -460,6 +462,7 @@ static inline void fpu_inherit_perms(struct fpu *dst_fpu)
spin_lock_irq(¤t->sighand->siglock);
/* Fork also inherits the permissions of the parent */
dst_fpu->perm = src_fpu->perm;
+ dst_fpu->guest_perm = src_fpu->guest_perm;
spin_unlock_irq(¤t->sighand->siglock);
}
}
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d28829403ed0..5f01d463859d 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1595,7 +1595,7 @@ static int validate_sigaltstack(unsigned int usize)
return 0;
}
-static int __xstate_request_perm(u64 permitted, u64 requested)
+static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
{
/*
* This deliberately does not exclude !XSAVES as we still might
@@ -1605,9 +1605,10 @@ static int __xstate_request_perm(u64 permitted, u64 requested)
*/
bool compacted = cpu_feature_enabled(X86_FEATURE_XSAVES);
struct fpu *fpu = ¤t->group_leader->thread.fpu;
+ struct fpu_state_perm *perm;
unsigned int ksize, usize;
u64 mask;
- int ret;
+ int ret = 0;
/* Check whether fully enabled */
if ((permitted & requested) == requested)
@@ -1621,15 +1622,18 @@ static int __xstate_request_perm(u64 permitted, u64 requested)
mask &= XFEATURE_MASK_USER_SUPPORTED;
usize = xstate_calculate_size(mask, false);
- ret = validate_sigaltstack(usize);
- if (ret)
- return ret;
+ if (!guest) {
+ ret = validate_sigaltstack(usize);
+ if (ret)
+ return ret;
+ }
+ perm = guest ? &fpu->guest_perm : &fpu->perm;
/* Pairs with the READ_ONCE() in xstate_get_group_perm() */
- WRITE_ONCE(fpu->perm.__state_perm, requested);
+ WRITE_ONCE(perm->__state_perm, requested);
/* Protected by sighand lock */
- fpu->perm.__state_size = ksize;
- fpu->perm.__user_state_size = usize;
+ perm->__state_size = ksize;
+ perm->__user_state_size = usize;
return ret;
}
@@ -1640,7 +1644,7 @@ static const u64 xstate_prctl_req[XFEATURE_MAX] = {
[XFEATURE_XTILE_DATA] = XFEATURE_MASK_XTILE_DATA,
};
-static int xstate_request_perm(unsigned long idx)
+static int xstate_request_perm(unsigned long idx, bool guest)
{
u64 permitted, requested;
int ret;
@@ -1661,14 +1665,19 @@ static int xstate_request_perm(unsigned long idx)
return -EOPNOTSUPP;
/* Lockless quick check */
- permitted = xstate_get_host_group_perm();
+ permitted = xstate_get_group_perm(guest);
if ((permitted & requested) == requested)
return 0;
/* Protect against concurrent modifications */
spin_lock_irq(¤t->sighand->siglock);
- permitted = xstate_get_host_group_perm();
- ret = __xstate_request_perm(permitted, requested);
+ permitted = xstate_get_group_perm(guest);
+
+ /* First vCPU allocation locks the permissions. */
+ if (guest && (permitted & FPU_GUEST_PERM_LOCKED))
+ ret = -EBUSY;
+ else
+ ret = __xstate_request_perm(permitted, requested, guest);
spin_unlock_irq(¤t->sighand->siglock);
return ret;
}
@@ -1713,12 +1722,18 @@ int xfd_enable_feature(u64 xfd_err)
return 0;
}
#else /* CONFIG_X86_64 */
-static inline int xstate_request_perm(unsigned long idx)
+static inline int xstate_request_perm(unsigned long idx, bool guest)
{
return -EPERM;
}
#endif /* !CONFIG_X86_64 */
+inline u64 xstate_get_guest_group_perm(void)
+{
+ return xstate_get_group_perm(true);
+}
+EXPORT_SYMBOL_GPL(xstate_get_guest_group_perm);
+
/**
* fpu_xstate_prctl - xstate permission operations
* @tsk: Redundant pointer to current
@@ -1742,6 +1757,7 @@ long fpu_xstate_prctl(struct task_struct *tsk, int option, unsigned long arg2)
u64 __user *uptr = (u64 __user *)arg2;
u64 permitted, supported;
unsigned long idx = arg2;
+ bool guest = false;
if (tsk != current)
return -EPERM;
@@ -1760,11 +1776,20 @@ long fpu_xstate_prctl(struct task_struct *tsk, int option, unsigned long arg2)
permitted &= XFEATURE_MASK_USER_SUPPORTED;
return put_user(permitted, uptr);
+ case ARCH_GET_XCOMP_GUEST_PERM:
+ permitted = xstate_get_guest_group_perm();
+ permitted &= XFEATURE_MASK_USER_SUPPORTED;
+ return put_user(permitted, uptr);
+
+ case ARCH_REQ_XCOMP_GUEST_PERM:
+ guest = true;
+ fallthrough;
+
case ARCH_REQ_XCOMP_PERM:
if (!IS_ENABLED(CONFIG_X86_64))
return -EOPNOTSUPP;
- return xstate_request_perm(idx);
+ return xstate_request_perm(idx, guest);
default:
return -EINVAL;
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index 86ea7c0fa2f6..98a472775c97 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -20,10 +20,19 @@ static inline void xstate_init_xcomp_bv(struct xregs_state *xsave, u64 mask)
xsave->header.xcomp_bv = mask | XCOMP_BV_COMPACTED_FORMAT;
}
-static inline u64 xstate_get_host_group_perm(void)
+static inline u64 xstate_get_group_perm(bool guest)
{
+ struct fpu *fpu = ¤t->group_leader->thread.fpu;
+ struct fpu_state_perm *perm;
+
/* Pairs with WRITE_ONCE() in xstate_request_perm() */
- return READ_ONCE(current->group_leader->thread.fpu.perm.__state_perm);
+ perm = guest ? &fpu->guest_perm : &fpu->perm;
+ return READ_ONCE(perm->__state_perm);
+}
+
+static inline u64 xstate_get_host_group_perm(void)
+{
+ return xstate_get_group_perm(false);
}
enum xstate_copy_mode {
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 04143a653a8a..d7bc23589062 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -993,6 +993,8 @@ long do_arch_prctl_common(struct task_struct *task, int option,
case ARCH_GET_XCOMP_SUPP:
case ARCH_GET_XCOMP_PERM:
case ARCH_REQ_XCOMP_PERM:
+ case ARCH_GET_XCOMP_GUEST_PERM:
+ case ARCH_REQ_XCOMP_GUEST_PERM:
return fpu_xstate_prctl(task, option, arg2);
}
--
2.27.0
CPUID.0xD.1.EBX enumerates the size of the XSAVE area (in compacted
format) required by XSAVES. If CPUID.0xD.i.ECX[1] is set for a state
component (i), this state component should be located on the next
64-bytes boundary following the preceding state component in the
compacted layout.
Fix xstate_required_size() to follow the alignment rule. AMX is the
first state component with 64-bytes alignment to catch this bug.
Signed-off-by: Yang Zhong <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/kvm/cpuid.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 07e9215e911d..148003e26cbb 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -42,7 +42,8 @@ static u32 xstate_required_size(u64 xstate_bv, bool compacted)
if (xstate_bv & 0x1) {
u32 eax, ebx, ecx, edx, offset;
cpuid_count(0xD, feature_bit, &eax, &ebx, &ecx, &edx);
- offset = compacted ? ret : ebx;
+ /* ECX[1]: 64B alignment in compacted form */
+ offset = compacted ? ((ecx & 0x2) ? ALIGN(ret, 64) : ret) : ebx;
ret = max(ret, offset + eax);
}
--
2.27.0
Guest xstate permissions should be set by userspace VMM before vcpu
creation. Extend KVM_SET_CPUID2 to verify that every feature reported
in CPUID[0xD] has proper permission set.
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/kvm/cpuid.c | 36 ++++++++++++++++++++++++------------
1 file changed, 24 insertions(+), 12 deletions(-)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 4855344091b8..a068373a7fbd 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -81,7 +81,9 @@ static inline struct kvm_cpuid_entry2 *cpuid_entry2_find(
return NULL;
}
-static int kvm_check_cpuid(struct kvm_cpuid_entry2 *entries, int nent)
+static int kvm_check_cpuid(struct kvm_vcpu *vcpu,
+ struct kvm_cpuid_entry2 *entries,
+ int nent)
{
struct kvm_cpuid_entry2 *best;
@@ -97,6 +99,16 @@ static int kvm_check_cpuid(struct kvm_cpuid_entry2 *entries, int nent)
return -EINVAL;
}
+ /* Check guest permissions for dynamically-enabled xfeatures */
+ best = cpuid_entry2_find(entries, nent, 0xd, 0);
+ if (best) {
+ u64 xfeatures;
+
+ xfeatures = best->eax | ((u64)best->edx << 32);
+ if (xfeatures & ~vcpu->arch.guest_fpu.perm)
+ return -ENXIO;
+ }
+
return 0;
}
@@ -277,21 +289,21 @@ u64 kvm_vcpu_reserved_gpa_bits_raw(struct kvm_vcpu *vcpu)
static int kvm_set_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2,
int nent)
{
- int r;
+ int r;
- r = kvm_check_cpuid(e2, nent);
- if (r)
- return r;
+ r = kvm_check_cpuid(vcpu, e2, nent);
+ if (r)
+ return r;
- kvfree(vcpu->arch.cpuid_entries);
- vcpu->arch.cpuid_entries = e2;
- vcpu->arch.cpuid_nent = nent;
+ kvfree(vcpu->arch.cpuid_entries);
+ vcpu->arch.cpuid_entries = e2;
+ vcpu->arch.cpuid_nent = nent;
- kvm_update_kvm_cpuid_base(vcpu);
- kvm_update_cpuid_runtime(vcpu);
- kvm_vcpu_after_set_cpuid(vcpu);
+ kvm_update_kvm_cpuid_base(vcpu);
+ kvm_update_cpuid_runtime(vcpu);
+ kvm_vcpu_after_set_cpuid(vcpu);
- return 0;
+ return 0;
}
/* when an old userspace process fills a new kernel module */
--
2.27.0
KVM_GET_SUPPORTED_CPUID should not include any dynamic xstates in
CPUID[0xD] if they have not been requested with prctl. Otherwise
a process which directly passes KVM_GET_SUPPORTED_CPUID to
KVM_SET_CPUID2 would now fail even if it doesn't intend to use a
dynamically enabled feature. Userspace must know that prctl is
required and allocate >4K xstate buffer before setting any dynamic
bit.
Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
Documentation/virt/kvm/api.rst | 4 ++++
arch/x86/kvm/cpuid.c | 9 ++++++---
2 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index aeeb071c7688..1cf2483246cd 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1684,6 +1684,10 @@ userspace capabilities, and with user requirements (for example, the
user may wish to constrain cpuid to emulate older hardware, or for
feature consistency across a cluster).
+Dynamically-enabled feature bits need to be requested with
+``arch_prctl()`` before calling this ioctl. Feature bits that have not
+been requested are excluded from the result.
+
Note that certain capabilities, such as KVM_CAP_X86_DISABLE_EXITS, may
expose cpuid features (e.g. MONITOR) which are not supported by kvm in
its default configuration. If userspace enables such capabilities, it
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 148003e26cbb..4855344091b8 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -812,11 +812,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
goto out;
}
break;
- case 0xd:
- entry->eax &= supported_xcr0;
+ case 0xd: {
+ u64 guest_perm = xstate_get_guest_group_perm();
+
+ entry->eax &= supported_xcr0 & guest_perm;
entry->ebx = xstate_required_size(supported_xcr0, false);
entry->ecx = entry->ebx;
- entry->edx &= supported_xcr0 >> 32;
+ entry->edx &= (supported_xcr0 & guest_perm) >> 32;
if (!supported_xcr0)
break;
@@ -863,6 +865,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->edx = 0;
}
break;
+ }
case 0x12:
/* Intel SGX */
if (!kvm_cpu_cap_has(X86_FEATURE_SGX)) {
--
2.27.0
vCPU threads are different from native tasks regarding to the initial XFD
value. While all native tasks follow a fixed value (init_fpstate::xfd)
established by the FPU core at boot, vCPU threads need to obey the reset
value (i.e. ZERO) defined by the specification, to meet the expectation of
the guest.
Let the caller supply an argument and adjust the host and guest related
invocations accordingly.
Signed-off-by: Yang Zhong <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/kernel/fpu/core.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index eddeeb4ed2f5..a78bc547fc03 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -199,7 +199,7 @@ void fpu_reset_from_exception_fixup(void)
}
#if IS_ENABLED(CONFIG_KVM)
-static void __fpstate_reset(struct fpstate *fpstate);
+static void __fpstate_reset(struct fpstate *fpstate, u64 xfd);
static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
{
@@ -231,7 +231,8 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
if (!fpstate)
return false;
- __fpstate_reset(fpstate);
+ /* Leave xfd to 0 (the reset value defined by spec) */
+ __fpstate_reset(fpstate, 0);
fpstate_init_user(fpstate);
fpstate->is_valloc = true;
fpstate->is_guest = true;
@@ -454,21 +455,21 @@ void fpstate_init_user(struct fpstate *fpstate)
fpstate_init_fstate(fpstate);
}
-static void __fpstate_reset(struct fpstate *fpstate)
+static void __fpstate_reset(struct fpstate *fpstate, u64 xfd)
{
/* Initialize sizes and feature masks */
fpstate->size = fpu_kernel_cfg.default_size;
fpstate->user_size = fpu_user_cfg.default_size;
fpstate->xfeatures = fpu_kernel_cfg.default_features;
fpstate->user_xfeatures = fpu_user_cfg.default_features;
- fpstate->xfd = init_fpstate.xfd;
+ fpstate->xfd = xfd;
}
void fpstate_reset(struct fpu *fpu)
{
/* Set the fpstate pointer to the default fpstate */
fpu->fpstate = &fpu->__fpstate;
- __fpstate_reset(fpu->fpstate);
+ __fpstate_reset(fpu->fpstate, init_fpstate.xfd);
/* Initialize the permission related info in fpu */
fpu->perm.__state_perm = fpu_kernel_cfg.default_features;
--
2.27.0
From: Thomas Gleixner <[email protected]>
To support dynamically enabled FPU features for guests prepare the guest
pseudo FPU container to keep track of the currently enabled xfeatures and
the guest permissions.
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/include/asm/fpu/types.h | 13 +++++++++++++
arch/x86/kernel/fpu/core.c | 26 +++++++++++++++++++++++++-
2 files changed, 38 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 6ddf80637697..c752d0aa23a4 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -504,6 +504,19 @@ struct fpu {
* Guest pseudo FPU container
*/
struct fpu_guest {
+ /*
+ * @xfeatures: xfeature bitmap of features which are
+ * currently enabled for the guest vCPU.
+ */
+ u64 xfeatures;
+
+ /*
+ * @perm: xfeature bitmap of features which are
+ * permitted to be enabled for the guest
+ * vCPU.
+ */
+ u64 perm;
+
/*
* @fpstate: Pointer to the allocated guest fpstate
*/
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index ab19b3d8b2f7..eddeeb4ed2f5 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -201,6 +201,26 @@ void fpu_reset_from_exception_fixup(void)
#if IS_ENABLED(CONFIG_KVM)
static void __fpstate_reset(struct fpstate *fpstate);
+static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
+{
+ struct fpu_state_perm *fpuperm;
+ u64 perm;
+
+ if (!IS_ENABLED(CONFIG_X86_64))
+ return;
+
+ spin_lock_irq(¤t->sighand->siglock);
+ fpuperm = ¤t->group_leader->thread.fpu.guest_perm;
+ perm = fpuperm->__state_perm;
+
+ /* First fpstate allocation locks down permissions. */
+ WRITE_ONCE(fpuperm->__state_perm, perm | FPU_GUEST_PERM_LOCKED);
+
+ spin_unlock_irq(¤t->sighand->siglock);
+
+ gfpu->perm = perm & ~FPU_GUEST_PERM_LOCKED;
+}
+
bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
{
struct fpstate *fpstate;
@@ -216,7 +236,11 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
fpstate->is_valloc = true;
fpstate->is_guest = true;
- gfpu->fpstate = fpstate;
+ gfpu->fpstate = fpstate;
+ gfpu->xfeatures = fpu_user_cfg.default_features;
+ gfpu->perm = fpu_user_cfg.default_features;
+ fpu_init_guest_permissions(gfpu);
+
return true;
}
EXPORT_SYMBOL_GPL(fpu_alloc_guest_fpstate);
--
2.27.0
From: Thomas Gleixner <[email protected]>
Guest support for dynamically enabled FPU features requires a few
modifications to the enablement function which is currently invoked from
the #NM handler:
1) Use guest permissions and sizes for the update
2) Update fpu_guest state accordingly
3) Take into account that the enabling can be triggered either from a
running guest via XSETBV and MSR_IA32_XFD write emulation or from
a guest restore. In the latter case the guests fpstate is not the
current tasks active fpstate.
Split the function and implement the guest mechanics throughout the
callchain.
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/kernel/fpu/xstate.c | 93 +++++++++++++++++++++---------------
arch/x86/kernel/fpu/xstate.h | 2 +
2 files changed, 56 insertions(+), 39 deletions(-)
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 5f01d463859d..0c0b2323cdec 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1499,29 +1499,6 @@ void fpstate_free(struct fpu *fpu)
vfree(fpu->fpstate);
}
-/**
- * fpu_install_fpstate - Update the active fpstate in the FPU
- *
- * @fpu: A struct fpu * pointer
- * @newfps: A struct fpstate * pointer
- *
- * Returns: A null pointer if the last active fpstate is the embedded
- * one or the new fpstate is already installed;
- * otherwise, a pointer to the old fpstate which has to
- * be freed by the caller.
- */
-static struct fpstate *fpu_install_fpstate(struct fpu *fpu,
- struct fpstate *newfps)
-{
- struct fpstate *oldfps = fpu->fpstate;
-
- if (fpu->fpstate == newfps)
- return NULL;
-
- fpu->fpstate = newfps;
- return oldfps != &fpu->__fpstate ? oldfps : NULL;
-}
-
/**
* fpstate_realloc - Reallocate struct fpstate for the requested new features
*
@@ -1529,6 +1506,7 @@ static struct fpstate *fpu_install_fpstate(struct fpu *fpu,
* of that task
* @ksize: The required size for the kernel buffer
* @usize: The required size for user space buffers
+ * @guest_fpu: Pointer to a guest FPU container. NULL for host allocations
*
* Note vs. vmalloc(): If the task with a vzalloc()-allocated buffer
* terminates quickly, vfree()-induced IPIs may be a concern, but tasks
@@ -1537,13 +1515,13 @@ static struct fpstate *fpu_install_fpstate(struct fpu *fpu,
* Returns: 0 on success, -ENOMEM on allocation error.
*/
static int fpstate_realloc(u64 xfeatures, unsigned int ksize,
- unsigned int usize)
+ unsigned int usize, struct fpu_guest *guest_fpu)
{
struct fpu *fpu = ¤t->thread.fpu;
struct fpstate *curfps, *newfps = NULL;
unsigned int fpsize;
+ bool in_use;
- curfps = fpu->fpstate;
fpsize = ksize + ALIGN(offsetof(struct fpstate, regs), 64);
newfps = vzalloc(fpsize);
@@ -1553,28 +1531,55 @@ static int fpstate_realloc(u64 xfeatures, unsigned int ksize,
newfps->user_size = usize;
newfps->is_valloc = true;
+ /*
+ * When a guest FPU is supplied, use @guest_fpu->fpstate
+ * as reference independent whether it is in use or not.
+ */
+ curfps = guest_fpu ? guest_fpu->fpstate : fpu->fpstate;
+
+ /* Determine whether @curfps is the active fpstate */
+ in_use = fpu->fpstate == curfps;
+
+ if (guest_fpu) {
+ newfps->is_guest = true;
+ newfps->is_confidential = curfps->is_confidential;
+ newfps->in_use = curfps->in_use;
+ guest_fpu->xfeatures |= xfeatures;
+ }
+
fpregs_lock();
/*
- * Ensure that the current state is in the registers before
- * swapping fpstate as that might invalidate it due to layout
- * changes.
+ * If @curfps is in use, ensure that the current state is in the
+ * registers before swapping fpstate as that might invalidate it
+ * due to layout changes.
*/
- if (test_thread_flag(TIF_NEED_FPU_LOAD))
+ if (in_use && test_thread_flag(TIF_NEED_FPU_LOAD))
fpregs_restore_userregs();
newfps->xfeatures = curfps->xfeatures | xfeatures;
newfps->user_xfeatures = curfps->user_xfeatures | xfeatures;
newfps->xfd = curfps->xfd & ~xfeatures;
- curfps = fpu_install_fpstate(fpu, newfps);
-
/* Do the final updates within the locked region */
xstate_init_xcomp_bv(&newfps->regs.xsave, newfps->xfeatures);
- xfd_update_state(newfps);
+ if (guest_fpu) {
+ guest_fpu->fpstate = newfps;
+ /* If curfps is active, update the FPU fpstate pointer */
+ if (in_use)
+ fpu->fpstate = newfps;
+ } else {
+ fpu->fpstate = newfps;
+ }
+
+ if (in_use)
+ xfd_update_state(fpu->fpstate);
fpregs_unlock();
- vfree(curfps);
+ /* Only free valloc'ed state */
+ if (curfps && curfps->is_valloc)
+ vfree(curfps);
+
return 0;
}
@@ -1682,14 +1687,16 @@ static int xstate_request_perm(unsigned long idx, bool guest)
return ret;
}
-int xfd_enable_feature(u64 xfd_err)
+int __xfd_enable_feature(u64 xfd_err, struct fpu_guest *guest_fpu)
{
u64 xfd_event = xfd_err & XFEATURE_MASK_USER_DYNAMIC;
+ struct fpu_state_perm *perm;
unsigned int ksize, usize;
struct fpu *fpu;
if (!xfd_event) {
- pr_err_once("XFD: Invalid xfd error: %016llx\n", xfd_err);
+ if (!guest_fpu)
+ pr_err_once("XFD: Invalid xfd error: %016llx\n", xfd_err);
return 0;
}
@@ -1697,14 +1704,16 @@ int xfd_enable_feature(u64 xfd_err)
spin_lock_irq(¤t->sighand->siglock);
/* If not permitted let it die */
- if ((xstate_get_host_group_perm() & xfd_event) != xfd_event) {
+ if ((xstate_get_group_perm(!!guest_fpu) & xfd_event) != xfd_event) {
spin_unlock_irq(¤t->sighand->siglock);
return -EPERM;
}
fpu = ¤t->group_leader->thread.fpu;
- ksize = fpu->perm.__state_size;
- usize = fpu->perm.__user_state_size;
+ perm = guest_fpu ? &fpu->guest_perm : &fpu->perm;
+ ksize = perm->__state_size;
+ usize = perm->__user_state_size;
+
/*
* The feature is permitted. State size is sufficient. Dropping
* the lock is safe here even if more features are added from
@@ -1717,10 +1726,16 @@ int xfd_enable_feature(u64 xfd_err)
* Try to allocate a new fpstate. If that fails there is no way
* out.
*/
- if (fpstate_realloc(xfd_event, ksize, usize))
+ if (fpstate_realloc(xfd_event, ksize, usize, guest_fpu))
return -EFAULT;
return 0;
}
+
+int xfd_enable_feature(u64 xfd_err)
+{
+ return __xfd_enable_feature(xfd_err, NULL);
+}
+
#else /* CONFIG_X86_64 */
static inline int xstate_request_perm(unsigned long idx, bool guest)
{
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index 98a472775c97..3e63f723525b 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -55,6 +55,8 @@ extern void fpu__init_system_xstate(unsigned int legacy_size);
extern void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
+extern int __xfd_enable_feature(u64 which, struct fpu_guest *guest_fpu);
+
static inline u64 xfeatures_mask_supervisor(void)
{
return fpu_kernel_cfg.max_features & XFEATURE_MASK_SUPERVISOR_SUPPORTED;
--
2.27.0
When XFD causes an instruction to generate #NM, IA32_XFD_ERR
contains information about which disabled state components are
being accessed. The #NM handler is expected to check this
information and then enable the state components by clearing
IA32_XFD for the faulting task (if having permission).
If the XFD_ERR value generated in guest is consumed/clobbered
by the host before the guest itself doing so. This may lead to
non-XFD-related #NM treated as XFD #NM in host (due to non-zero
value in XFD_ERR), or XFD-related #NM treated as non-XFD #NM in
guest (XFD_ERR cleared by the host #NM handler).
Introduce a new field in fpu_guest to save the guest xfd_err value.
KVM is expected to save guest xfd_err before preemption is enabled
and restore it right before entering the guest (with preemption
disabled).
Signed-off-by: Kevin Tian <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/include/asm/fpu/types.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index c752d0aa23a4..3795d0573773 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -517,6 +517,11 @@ struct fpu_guest {
*/
u64 perm;
+ /*
+ * @xfd_err: Save the guest value.
+ */
+ u64 xfd_err;
+
/*
* @fpstate: Pointer to the allocated guest fpstate
*/
--
2.27.0
Guest IA32_XFD_ERR is generally modified in two places:
- Set by CPU when #NM is triggered;
- Cleared by guest in its #NM handler;
Intercept #NM for the first case, if guest writes XFD as nonzero for
the first time which indicates guest is possible to use XFD generating
the exception. #NM is rare if the guest doesn't use dynamic features.
Otherwise, there is at most one exception per guest task given a
dynamic feature.
Save the current XFD_ERR value to the guest_fpu container in the #NM
VM-exit handler. This must be done with interrupt/preemption disabled,
otherwise the unsaved MSR value may be clobbered by host operations.
Inject a virtual #NM to the guest after saving the MSR value.
Restore the host value (always ZERO outside of the host #NM
handler) before enabling preemption.
Restore the guest value from the guest_fpu container right before
entering the guest (with preemption disabled).
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/vmx/vmcs.h | 5 +++++
arch/x86/kvm/vmx/vmx.c | 22 +++++++++++++++++++++-
arch/x86/kvm/x86.c | 6 ++++++
4 files changed, 33 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 555f4de47ef2..f7a661f35d1a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -640,6 +640,7 @@ struct kvm_vcpu_arch {
u64 smi_count;
bool tpr_access_reporting;
bool xsaves_enabled;
+ bool trap_nm;
u64 ia32_xss;
u64 microcode_version;
u64 arch_capabilities;
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index 6e5de2e2b0da..c57798b56f95 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -129,6 +129,11 @@ static inline bool is_machine_check(u32 intr_info)
return is_exception_n(intr_info, MC_VECTOR);
}
+static inline bool is_nm(u32 intr_info)
+{
+ return is_exception_n(intr_info, NM_VECTOR);
+}
+
/* Undocumented: icebp/int1 */
static inline bool is_icebp(u32 intr_info)
{
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5974a88c9d35..57d256fbf34b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -36,6 +36,7 @@
#include <asm/debugreg.h>
#include <asm/desc.h>
#include <asm/fpu/api.h>
+#include <asm/fpu/xstate.h>
#include <asm/idtentry.h>
#include <asm/io.h>
#include <asm/irq_remapping.h>
@@ -763,6 +764,9 @@ void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu)
vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, match);
}
+ if (vcpu->arch.trap_nm)
+ eb |= (1u << NM_VECTOR);
+
vmcs_write32(EXCEPTION_BITMAP, eb);
}
@@ -1960,6 +1964,13 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_KERNEL_GS_BASE:
vmx_write_guest_kernel_gs_base(vmx, data);
break;
+ case MSR_IA32_XFD:
+ ret = kvm_set_msr_common(vcpu, msr_info);
+ if (!ret && data) {
+ vcpu->arch.trap_nm = true;
+ vmx_update_exception_bitmap(vcpu);
+ }
+ break;
#endif
case MSR_IA32_SYSENTER_CS:
if (is_guest_mode(vcpu))
@@ -4746,7 +4757,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
vect_info = vmx->idt_vectoring_info;
intr_info = vmx_get_intr_info(vcpu);
- if (is_machine_check(intr_info) || is_nmi(intr_info))
+ if (is_machine_check(intr_info) || is_nmi(intr_info) || is_nm(intr_info))
return 1; /* handled by handle_exception_nmi_irqoff() */
if (is_invalid_opcode(intr_info))
@@ -6350,6 +6361,12 @@ static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
kvm_after_interrupt(vcpu);
}
+static void handle_exception_nm(struct kvm_vcpu *vcpu)
+{
+ rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
+ kvm_queue_exception(vcpu, NM_VECTOR);
+}
+
static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
{
const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
@@ -6358,6 +6375,9 @@ static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
/* if exit due to PF check for async PF */
if (is_page_fault(intr_info))
vmx->vcpu.arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
+ /* if exit due to NM, handle before preemptions are enabled */
+ else if (is_nm(intr_info))
+ handle_exception_nm(&vmx->vcpu);
/* Handle machine checks before interrupts are enabled */
else if (is_machine_check(intr_info))
kvm_machine_check();
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 36677b754ac9..b22defad5cab 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9893,6 +9893,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (test_thread_flag(TIF_NEED_FPU_LOAD))
switch_fpu_return();
+ if (vcpu->arch.guest_fpu.xfd_err)
+ wrmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
+
if (unlikely(vcpu->arch.switch_db_regs)) {
set_debugreg(0, 7);
set_debugreg(vcpu->arch.eff_db[0], 0);
@@ -9956,6 +9959,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
static_call(kvm_x86_handle_exit_irqoff)(vcpu);
+ if (vcpu->arch.guest_fpu.xfd_err)
+ wrmsrl(MSR_IA32_XFD_ERR, 0);
+
/*
* Consume any pending interrupts, including the possible source of
* VM-Exit on SVM and any ticks that occur between VM-Exit and now.
--
2.27.0
Intel's eXtended Feature Disable (XFD) feature allows the software
to dynamically adjust fpstate buffer size for XSAVE features which
have large state.
Because fpstate has been expanded for all possible dynamic xstates
at KVM_SET_CPUID2, emulation of the IA32_XFD MSR is straightforward.
For write just call fpu_update_guest_xfd() to update the guest fpu
container once all the sanity checks are passed. For read then
return the cached value in the container.
Signed-off-by: Zeng Guang <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/kvm/x86.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e50e97ac4408..36677b754ac9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1359,6 +1359,7 @@ static const u32 msrs_to_save_all[] = {
MSR_F15H_PERF_CTL3, MSR_F15H_PERF_CTL4, MSR_F15H_PERF_CTL5,
MSR_F15H_PERF_CTR0, MSR_F15H_PERF_CTR1, MSR_F15H_PERF_CTR2,
MSR_F15H_PERF_CTR3, MSR_F15H_PERF_CTR4, MSR_F15H_PERF_CTR5,
+ MSR_IA32_XFD,
};
static u32 msrs_to_save[ARRAY_SIZE(msrs_to_save_all)];
@@ -3669,6 +3670,19 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 1;
vcpu->arch.msr_misc_features_enables = data;
break;
+#ifdef CONFIG_X86_64
+ case MSR_IA32_XFD:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_XFD))
+ return 1;
+
+ if (data & ~(XFEATURE_MASK_USER_DYNAMIC &
+ vcpu->arch.guest_supported_xcr0))
+ return 1;
+
+ fpu_update_guest_xfd(&vcpu->arch.guest_fpu, data);
+ break;
+#endif
default:
if (kvm_pmu_is_valid_msr(vcpu, msr))
return kvm_pmu_set_msr(vcpu, msr_info);
@@ -3989,6 +4003,15 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_K7_HWCR:
msr_info->data = vcpu->arch.msr_hwcr;
break;
+#ifdef CONFIG_X86_64
+ case MSR_IA32_XFD:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_XFD))
+ return 1;
+
+ msr_info->data = vcpu->arch.guest_fpu.fpstate->xfd;
+ break;
+#endif
default:
if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
return kvm_pmu_get_msr(vcpu, msr_info);
@@ -6422,6 +6445,10 @@ static void kvm_init_msr_list(void)
min(INTEL_PMC_MAX_GENERIC, x86_pmu.num_counters_gp))
continue;
break;
+ case MSR_IA32_XFD:
+ if (!kvm_cpu_cap_has(X86_FEATURE_XFD))
+ continue;
+ break;
default:
break;
}
--
2.27.0
Two XCR0 bits are defined for AMX to support XSAVE mechanism. Bit 17
is for tilecfg and bit 18 is for tiledata.
The value of XCR0[17:18] is always either 00b or 11b. Also, SDM
recommends that only 64-bit operating systems enable Intel AMX by
setting XCR0[18:17]. If a 32-bit guest tries to set dynamic bits, it
fails to pass vcpu->arch.guest_supported_xcr0 check and gets a #GP.
Signed-off-by: Yang Zhong <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/kvm/x86.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a48a89f73027..c558c098979a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -210,7 +210,7 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
#define KVM_SUPPORTED_XCR0 (XFEATURE_MASK_FP | XFEATURE_MASK_SSE \
| XFEATURE_MASK_YMM | XFEATURE_MASK_BNDREGS \
| XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
- | XFEATURE_MASK_PKRU)
+ | XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)
u64 __read_mostly host_efer;
EXPORT_SYMBOL_GPL(host_efer);
@@ -990,6 +990,12 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
if ((xcr0 & XFEATURE_MASK_AVX512) != XFEATURE_MASK_AVX512)
return 1;
}
+
+#ifdef CONFIG_X86_64
+ if ((xcr0 & XFEATURE_MASK_XTILE) &&
+ ((xcr0 & XFEATURE_MASK_XTILE) != XFEATURE_MASK_XTILE))
+ return 1;
+#endif
vcpu->arch.xcr0 = xcr0;
if ((xcr0 ^ old_xcr0) & XFEATURE_MASK_EXTEND)
--
2.27.0
Disable read emulation of IA32_XFD_ERR MSR if guest cpuid includes XFD.
This saves one unnecessary VM-exit in guest #NM handler, given that the
MSR is already restored with the guest value before the guest is resumed.
Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 6 ++++++
arch/x86/kvm/vmx/vmx.h | 2 +-
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 57d256fbf34b..8595a3e8fbd8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -162,6 +162,7 @@ static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
MSR_FS_BASE,
MSR_GS_BASE,
MSR_KERNEL_GS_BASE,
+ MSR_IA32_XFD_ERR,
#endif
MSR_IA32_SYSENTER_CS,
MSR_IA32_SYSENTER_ESP,
@@ -7225,6 +7226,11 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
}
}
+ if (boot_cpu_has(X86_FEATURE_XFD))
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_XFD_ERR, MSR_TYPE_R,
+ !guest_cpuid_has(vcpu, X86_FEATURE_XFD));
+
+
set_cr4_guest_host_mask(vmx);
vmx_write_encls_bitmap(vcpu, NULL);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 4df2ac24ffc1..bf9d3051cd6c 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -340,7 +340,7 @@ struct vcpu_vmx {
struct lbr_desc lbr_desc;
/* Save desired MSR intercept (read: pass-through) state */
-#define MAX_POSSIBLE_PASSTHROUGH_MSRS 13
+#define MAX_POSSIBLE_PASSTHROUGH_MSRS 14
struct {
DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
--
2.27.0
Emulate read/write to IA32_XFD_ERR MSR.
Only the saved value in the guest_fpu container is touched in the
emulation handler. Actual MSR update is handled right before entering
the guest (with preemption disabled)
Signed-off-by: Zeng Guang <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/kvm/x86.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b22defad5cab..a48a89f73027 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1359,7 +1359,7 @@ static const u32 msrs_to_save_all[] = {
MSR_F15H_PERF_CTL3, MSR_F15H_PERF_CTL4, MSR_F15H_PERF_CTL5,
MSR_F15H_PERF_CTR0, MSR_F15H_PERF_CTR1, MSR_F15H_PERF_CTR2,
MSR_F15H_PERF_CTR3, MSR_F15H_PERF_CTR4, MSR_F15H_PERF_CTR5,
- MSR_IA32_XFD,
+ MSR_IA32_XFD, MSR_IA32_XFD_ERR,
};
static u32 msrs_to_save[ARRAY_SIZE(msrs_to_save_all)];
@@ -3682,6 +3682,17 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
fpu_update_guest_xfd(&vcpu->arch.guest_fpu, data);
break;
+ case MSR_IA32_XFD_ERR:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_XFD))
+ return 1;
+
+ if (data & ~(XFEATURE_MASK_USER_DYNAMIC &
+ vcpu->arch.guest_supported_xcr0))
+ return 1;
+
+ vcpu->arch.guest_fpu.xfd_err = data;
+ break;
#endif
default:
if (kvm_pmu_is_valid_msr(vcpu, msr))
@@ -4011,6 +4022,13 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
msr_info->data = vcpu->arch.guest_fpu.fpstate->xfd;
break;
+ case MSR_IA32_XFD_ERR:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_XFD))
+ return 1;
+
+ msr_info->data = vcpu->arch.guest_fpu.xfd_err;
+ break;
#endif
default:
if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
@@ -6446,6 +6464,7 @@ static void kvm_init_msr_list(void)
continue;
break;
case MSR_IA32_XFD:
+ case MSR_IA32_XFD_ERR:
if (!kvm_cpu_cap_has(X86_FEATURE_XFD))
continue;
break;
--
2.27.0
From: Thomas Gleixner <[email protected]>
Userspace needs to inquire KVM about the buffer size to work
with the new KVM_SET_XSAVE and KVM_GET_XSAVE2. Add the size info
to guest_fpu for KVM to access.
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/include/asm/fpu/types.h | 5 +++++
arch/x86/kernel/fpu/core.c | 1 +
arch/x86/kernel/fpu/xstate.c | 1 +
3 files changed, 7 insertions(+)
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 3795d0573773..eb7cd1139d97 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -522,6 +522,11 @@ struct fpu_guest {
*/
u64 xfd_err;
+ /*
+ * @uabi_size: Size required for save/restore
+ */
+ unsigned int uabi_size;
+
/*
* @fpstate: Pointer to the allocated guest fpstate
*/
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 3daea097c618..89d679cc819b 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -240,6 +240,7 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
gfpu->fpstate = fpstate;
gfpu->xfeatures = fpu_user_cfg.default_features;
gfpu->perm = fpu_user_cfg.default_features;
+ gfpu->uabi_size = fpu_user_cfg.default_size;
fpu_init_guest_permissions(gfpu);
return true;
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 0c0b2323cdec..10fe072f1c92 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1545,6 +1545,7 @@ static int fpstate_realloc(u64 xfeatures, unsigned int ksize,
newfps->is_confidential = curfps->is_confidential;
newfps->in_use = curfps->in_use;
guest_fpu->xfeatures |= xfeatures;
+ guest_fpu->uabi_size = usize;
}
fpregs_lock();
--
2.27.0
Extend CPUID emulation to support XFD, AMX_TILE, AMX_INT8 and
AMX_BF16. Adding those bits into kvm_cpu_caps finally activates all
previous logics in this series.
Signed-off-by: Yang Zhong <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 2 ++
arch/x86/kvm/cpuid.c | 25 +++++++++++++++++++++++--
2 files changed, 25 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d5b5f2ab87a0..da872b6f8d8b 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -299,7 +299,9 @@
/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
#define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
#define X86_FEATURE_AVX512_BF16 (12*32+ 5) /* AVX512 BFLOAT16 instructions */
+#define X86_FEATURE_AMX_BF16 (18*32+22) /* AMX bf16 Support */
#define X86_FEATURE_AMX_TILE (18*32+24) /* AMX tile Support */
+#define X86_FEATURE_AMX_INT8 (18*32+25) /* AMX int8 Support */
/* AMD-defined CPU features, CPUID level 0x80000008 (EBX), word 13 */
#define X86_FEATURE_CLZERO (13*32+ 0) /* CLZERO instruction */
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index eb5a5070accb..1d694ead374e 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -515,7 +515,8 @@ void kvm_set_cpu_caps(void)
F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) |
- F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16)
+ F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) |
+ F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16)
);
/* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */
@@ -534,7 +535,7 @@ void kvm_set_cpu_caps(void)
);
kvm_cpu_cap_mask(CPUID_D_1_EAX,
- F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | F(XSAVES)
+ F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | F(XSAVES) | F(XFD)
);
kvm_cpu_cap_init_scattered(CPUID_12_EAX,
@@ -660,6 +661,8 @@ static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_array *array,
case 0x14:
case 0x17:
case 0x18:
+ case 0x1d:
+ case 0x1e:
case 0x1f:
case 0x8000001d:
entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
@@ -932,6 +935,24 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
goto out;
}
break;
+ /* Intel AMX TILE */
+ case 0x1d:
+ if (!kvm_cpu_cap_has(X86_FEATURE_AMX_TILE)) {
+ entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
+ break;
+ }
+
+ for (i = 1, max_idx = entry->eax; i <= max_idx; ++i) {
+ if (!do_host_cpuid(array, function, i))
+ goto out;
+ }
+ break;
+ case 0x1e: /* TMUL information */
+ if (!kvm_cpu_cap_has(X86_FEATURE_AMX_TILE)) {
+ entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
+ break;
+ }
+ break;
case KVM_CPUID_SIGNATURE: {
const u32 *sigptr = (const u32 *)KVM_SIGNATURE;
entry->eax = KVM_CPUID_FEATURES;
--
2.27.0
From: Guang Zeng <[email protected]>
When AMX is enabled it requires a larger xstate buffer than
the legacy hardcoded 4KB one. Exising kvm ioctls
(KVM_[G|S]ET_XSAVE under KVM_CAP_XSAVE) are not suitable for
this purpose.
A new capability (KVM_CAP_XSAVE2) is introduced to mark an
extended kvm_xsave format to support >4KB fpstate. The expanded
fpstate size is returned to userspace when it checks the
KVM_CAP_XSAVE2 capability.
Introduce a new KVM_GET_XSAVE2 under this capability to use the
new format for copying guest fpstate to userspace. Reuse
KVM_SET_XSAVE for both old/new formats by reimplementing it to
do properly-sized memdup_user() based on the guest fpu container.
Also, update the api doc with the new KVM_GET_XSAVE2 ioctl.
Signed-off-by: Guang Zeng <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
Documentation/virt/kvm/api.rst | 42 +++++++++++++++++++++++++++++++--
arch/x86/include/uapi/asm/kvm.h | 16 ++++++++++++-
arch/x86/kvm/x86.c | 39 +++++++++++++++++++++++++++++-
include/uapi/linux/kvm.h | 4 ++++
4 files changed, 97 insertions(+), 4 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 1cf2483246cd..e48f7de5f23a 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1566,6 +1566,7 @@ otherwise it will return EBUSY error.
struct kvm_xsave {
__u32 region[1024];
+ __u32 extra[0];
};
This ioctl would copy current vcpu's xsave struct to the userspace.
@@ -1574,7 +1575,7 @@ This ioctl would copy current vcpu's xsave struct to the userspace.
4.43 KVM_SET_XSAVE
------------------
-:Capability: KVM_CAP_XSAVE
+:Capability: KVM_CAP_XSAVE and KVM_CAP_XSAVE2
:Architectures: x86
:Type: vcpu ioctl
:Parameters: struct kvm_xsave (in)
@@ -1585,9 +1586,18 @@ This ioctl would copy current vcpu's xsave struct to the userspace.
struct kvm_xsave {
__u32 region[1024];
+ __u32 extra[0];
};
-This ioctl would copy userspace's xsave struct to the kernel.
+This ioctl would copy userspace's xsave struct to the kernel. It copies
+as many bytes as are returned by KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2),
+when invoked on the vm file descriptor. The size value returned by
+KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) will always be at least 4096.
+Currently, it is only greater than 4096 if a dynamic feature has been
+enabled with ``arch_prctl()``, but this may change in the future.
+
+The offsets of the state save areas in struct kvm_xsave follow the
+contents of CPUID leaf 0xD on the host.
4.44 KVM_GET_XCRS
@@ -5507,6 +5517,34 @@ the trailing ``'\0'``, is indicated by ``name_size`` in the header.
The Stats Data block contains an array of 64-bit values in the same order
as the descriptors in Descriptors block.
+4.42 KVM_GET_XSAVE2
+------------------
+
+:Capability: KVM_CAP_XSAVE2
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_xsave (out)
+:Returns: 0 on success, -1 on error
+
+
+::
+
+ struct kvm_xsave {
+ __u32 region[1024];
+ __u32 extra[0];
+ };
+
+This ioctl would copy current vcpu's xsave struct to the userspace. It
+copies as many bytes as are returned by KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2)
+when invoked on the vm file descriptor. The size value returned by
+KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) will always be at least 4096.
+Currently, it is only greater than 4096 if a dynamic feature has been
+enabled with ``arch_prctl()``, but this may change in the future.
+
+The offsets of the state save areas in struct kvm_xsave follow the contents
+of CPUID leaf 0xD on the host.
+
+
5. The kvm_run structure
========================
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 5a776a08f78c..2da3316bb559 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -373,9 +373,23 @@ struct kvm_debugregs {
__u64 reserved[9];
};
-/* for KVM_CAP_XSAVE */
+/* for KVM_CAP_XSAVE and KVM_CAP_XSAVE2 */
struct kvm_xsave {
+ /*
+ * KVM_GET_XSAVE2 and KVM_SET_XSAVE write and read as many bytes
+ * as are returned by KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2)
+ * respectively, when invoked on the vm file descriptor.
+ *
+ * The size value returned by KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2)
+ * will always be at least 4096. Currently, it is only greater
+ * than 4096 if a dynamic feature has been enabled with
+ * ``arch_prctl()``, but this may change in the future.
+ *
+ * The offsets of the state save areas in struct kvm_xsave follow
+ * the contents of CPUID leaf 0xD on the host.
+ */
__u32 region[1024];
+ __u32 extra[0];
};
#define KVM_MAX_XCRS 16
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c558c098979a..3b756ff13103 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4297,6 +4297,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
else
r = 0;
break;
+ case KVM_CAP_XSAVE2:
+ r = kvm->vcpus[0]->arch.guest_fpu.uabi_size;
+ if (r < sizeof(struct kvm_xsave))
+ r = sizeof(struct kvm_xsave);
+ break;
default:
break;
}
@@ -4900,6 +4905,16 @@ static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
vcpu->arch.pkru);
}
+static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu,
+ u8 *state, unsigned int size)
+{
+ if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
+ return;
+
+ fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu,
+ state, size, vcpu->arch.pkru);
+}
+
static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
struct kvm_xsave *guest_xsave)
{
@@ -5367,7 +5382,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
break;
}
case KVM_SET_XSAVE: {
- u.xsave = memdup_user(argp, sizeof(*u.xsave));
+ int size = vcpu->arch.guest_fpu.uabi_size;
+
+ u.xsave = memdup_user(argp, size);
if (IS_ERR(u.xsave)) {
r = PTR_ERR(u.xsave);
goto out_nofree;
@@ -5376,6 +5393,26 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
r = kvm_vcpu_ioctl_x86_set_xsave(vcpu, u.xsave);
break;
}
+
+ case KVM_GET_XSAVE2: {
+ int size = vcpu->arch.guest_fpu.uabi_size;
+
+ u.xsave = kzalloc(size, GFP_KERNEL_ACCOUNT);
+ if (!u.xsave) {
+ r = -ENOMEM;
+ break;
+ }
+
+ kvm_vcpu_ioctl_x86_get_xsave2(vcpu, u.buffer, size);
+
+ if (copy_to_user(argp, u.xsave, size)) {
+ r = -EFAULT;
+ break;
+ }
+ r = 0;
+ break;
+ }
+
case KVM_GET_XCRS: {
u.xcrs = kzalloc(sizeof(struct kvm_xcrs), GFP_KERNEL_ACCOUNT);
r = -ENOMEM;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 1daa45268de2..9d1c01669560 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1131,6 +1131,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_EXIT_ON_EMULATION_FAILURE 204
#define KVM_CAP_ARM_MTE 205
#define KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM 206
+#define KVM_CAP_XSAVE2 207
#ifdef KVM_CAP_IRQ_ROUTING
@@ -1610,6 +1611,9 @@ struct kvm_enc_region {
#define KVM_S390_NORMAL_RESET _IO(KVMIO, 0xc3)
#define KVM_S390_CLEAR_RESET _IO(KVMIO, 0xc4)
+/* Available with KVM_CAP_XSAVE2 */
+#define KVM_GET_XSAVE2 _IOR(KVMIO, 0xcf, struct kvm_xsave)
+
struct kvm_s390_pv_sec_parm {
__u64 origin;
__u64 length;
--
2.27.0
From: Kevin Tian <[email protected]>
Guest XFD can be updated either in the emulation path or in the
restore path.
Provide a wrapper to update guest_fpu::fpstate::xfd. If the guest
fpstate is currently in-use, also update the per-cpu xfd cache and
the actual MSR.
Signed-off-by: Kevin Tian <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/include/asm/fpu/api.h | 6 ++++++
arch/x86/kernel/fpu/core.c | 12 ++++++++++++
2 files changed, 18 insertions(+)
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 8e934e571273..100b535cd3de 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -140,6 +140,12 @@ extern void fpu_free_guest_fpstate(struct fpu_guest *gfpu);
extern int fpu_swap_kvm_fpstate(struct fpu_guest *gfpu, bool enter_guest);
extern int fpu_update_guest_perm_features(struct fpu_guest *guest_fpu);
+#ifdef CONFIG_X86_64
+extern void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd);
+#else
+static inline void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd) { }
+#endif
+
extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, unsigned int size, u32 pkru);
extern int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf, u64 xcr0, u32 *vpkru);
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 2560a95980aa..3daea097c618 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -288,6 +288,18 @@ int fpu_update_guest_perm_features(struct fpu_guest *guest_fpu)
}
EXPORT_SYMBOL_GPL(fpu_update_guest_perm_features);
+#ifdef CONFIG_X86_64
+void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd)
+{
+ fpregs_lock();
+ guest_fpu->fpstate->xfd = xfd;
+ if (guest_fpu->fpstate->in_use)
+ xfd_update_state(guest_fpu->fpstate);
+ fpregs_unlock();
+}
+EXPORT_SYMBOL_GPL(fpu_update_guest_xfd);
+#endif /* CONFIG_X86_64 */
+
int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
{
struct fpstate *guest_fps = guest_fpu->fpstate;
--
2.27.0
From: Wei Wang <[email protected]>
When KVM_CAP_XSAVE2 is supported, userspace is expected to allocate
buffer for KVM_GET_XSAVE2 and KVM_SET_XSAVE using the size returned
by KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2).
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Guang Zeng <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
tools/arch/x86/include/uapi/asm/kvm.h | 16 ++++-
tools/include/uapi/linux/kvm.h | 3 +
.../testing/selftests/kvm/include/kvm_util.h | 2 +
.../selftests/kvm/include/x86_64/processor.h | 10 +++
tools/testing/selftests/kvm/lib/kvm_util.c | 32 +++++++++
.../selftests/kvm/lib/x86_64/processor.c | 67 ++++++++++++++++++-
.../testing/selftests/kvm/x86_64/evmcs_test.c | 2 +-
tools/testing/selftests/kvm/x86_64/smm_test.c | 2 +-
.../testing/selftests/kvm/x86_64/state_test.c | 2 +-
.../kvm/x86_64/vmx_preemption_timer_test.c | 2 +-
10 files changed, 130 insertions(+), 8 deletions(-)
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 5a776a08f78c..2da3316bb559 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -373,9 +373,23 @@ struct kvm_debugregs {
__u64 reserved[9];
};
-/* for KVM_CAP_XSAVE */
+/* for KVM_CAP_XSAVE and KVM_CAP_XSAVE2 */
struct kvm_xsave {
+ /*
+ * KVM_GET_XSAVE2 and KVM_SET_XSAVE write and read as many bytes
+ * as are returned by KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2)
+ * respectively, when invoked on the vm file descriptor.
+ *
+ * The size value returned by KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2)
+ * will always be at least 4096. Currently, it is only greater
+ * than 4096 if a dynamic feature has been enabled with
+ * ``arch_prctl()``, but this may change in the future.
+ *
+ * The offsets of the state save areas in struct kvm_xsave follow
+ * the contents of CPUID leaf 0xD on the host.
+ */
__u32 region[1024];
+ __u32 extra[0];
};
#define KVM_MAX_XCRS 16
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 1daa45268de2..f066637ee206 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -1131,6 +1131,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_EXIT_ON_EMULATION_FAILURE 204
#define KVM_CAP_ARM_MTE 205
#define KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM 206
+#define KVM_CAP_XSAVE2 207
#ifdef KVM_CAP_IRQ_ROUTING
@@ -1551,6 +1552,8 @@ struct kvm_s390_ucas_mapping {
/* Available with KVM_CAP_XSAVE */
#define KVM_GET_XSAVE _IOR(KVMIO, 0xa4, struct kvm_xsave)
#define KVM_SET_XSAVE _IOW(KVMIO, 0xa5, struct kvm_xsave)
+/* Available with KVM_CAP_XSAVE2 */
+#define KVM_GET_XSAVE2 _IOR(KVMIO, 0xcf, struct kvm_xsave)
/* Available with KVM_CAP_XCRS */
#define KVM_GET_XCRS _IOR(KVMIO, 0xa6, struct kvm_xcrs)
#define KVM_SET_XCRS _IOW(KVMIO, 0xa7, struct kvm_xcrs)
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 2d62edc49d67..65ace3f01fad 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -85,6 +85,7 @@ extern const struct vm_guest_mode_params vm_guest_mode_params[];
int open_path_or_exit(const char *path, int flags);
int open_kvm_dev_path_or_exit(void);
int kvm_check_cap(long cap);
+int vm_check_cap(struct kvm_vm *vm, long cap);
int vm_enable_cap(struct kvm_vm *vm, struct kvm_enable_cap *cap);
int vcpu_enable_cap(struct kvm_vm *vm, uint32_t vcpu_id,
struct kvm_enable_cap *cap);
@@ -316,6 +317,7 @@ struct kvm_vm *vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus,
* guest_code - The vCPU's entry point
*/
void vm_vcpu_add_default(struct kvm_vm *vm, uint32_t vcpuid, void *guest_code);
+void vm_xsave_req_perm(void);
bool vm_is_unrestricted_guest(struct kvm_vm *vm);
diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 05e65ca1c30c..58633e51960f 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -10,8 +10,10 @@
#include <assert.h>
#include <stdint.h>
+#include <syscall.h>
#include <asm/msr-index.h>
+#include <asm/prctl.h>
#include "../kvm_util.h"
@@ -352,6 +354,7 @@ struct kvm_x86_state;
struct kvm_x86_state *vcpu_save_state(struct kvm_vm *vm, uint32_t vcpuid);
void vcpu_load_state(struct kvm_vm *vm, uint32_t vcpuid,
struct kvm_x86_state *state);
+void kvm_x86_state_cleanup(struct kvm_x86_state *state);
struct kvm_msr_list *kvm_get_msr_index_list(void);
uint64_t kvm_get_feature_msr(uint64_t msr_index);
@@ -443,4 +446,11 @@ void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr,
/* VMX_EPT_VPID_CAP bits */
#define VMX_EPT_VPID_CAP_AD_BITS (1ULL << 21)
+#define XSTATE_XTILE_CFG_BIT 17
+#define XSTATE_XTILE_DATA_BIT 18
+
+#define XSTATE_XTILE_CFG_MASK (1ULL << XSTATE_XTILE_CFG_BIT)
+#define XSTATE_XTILE_DATA_MASK (1ULL << XSTATE_XTILE_DATA_BIT)
+#define XFEATURE_XTILE_MASK (XSTATE_XTILE_CFG_MASK | \
+ XSTATE_XTILE_DATA_MASK)
#endif /* SELFTEST_KVM_PROCESSOR_H */
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 53d2b5d04b82..318ac76fd2f1 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -85,6 +85,33 @@ int kvm_check_cap(long cap)
return ret;
}
+/* VM Check Capability
+ *
+ * Input Args:
+ * vm - Virtual Machine
+ * cap - Capability
+ *
+ * Output Args: None
+ *
+ * Return:
+ * On success, the Value corresponding to the capability (KVM_CAP_*)
+ * specified by the value of cap. On failure a TEST_ASSERT failure
+ * is produced.
+ *
+ * Looks up and returns the value corresponding to the capability
+ * (KVM_CAP_*) given by cap.
+ */
+int vm_check_cap(struct kvm_vm *vm, long cap)
+{
+ int ret;
+
+ ret = ioctl(vm->fd, KVM_CHECK_EXTENSION, cap);
+ TEST_ASSERT(ret >= 0, "KVM_CHECK_EXTENSION VM IOCTL failed,\n"
+ " rc: %i errno: %i", ret, errno);
+
+ return ret;
+}
+
/* VM Enable Capability
*
* Input Args:
@@ -344,6 +371,11 @@ struct kvm_vm *vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus,
struct kvm_vm *vm;
int i;
+ /*
+ * Permission needs to be requested before KVM_SET_CPUID2.
+ */
+ vm_xsave_req_perm();
+
/* Force slot0 memory size not small than DEFAULT_GUEST_PHY_PAGES */
if (slot0_mem_pages < DEFAULT_GUEST_PHY_PAGES)
slot0_mem_pages = DEFAULT_GUEST_PHY_PAGES;
diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c
index eef7b34756d5..f19d6d201977 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c
@@ -650,6 +650,45 @@ static void vcpu_setup(struct kvm_vm *vm, int vcpuid)
vcpu_sregs_set(vm, vcpuid, &sregs);
}
+#define CPUID_XFD_BIT (1 << 4)
+static bool is_xfd_supported(void)
+{
+ int eax, ebx, ecx, edx;
+ const int leaf = 0xd, subleaf = 0x1;
+
+ __asm__ __volatile__(
+ "cpuid"
+ : /* output */ "=a"(eax), "=b"(ebx),
+ "=c"(ecx), "=d"(edx)
+ : /* input */ "0"(leaf), "2"(subleaf));
+
+ return !!(eax & CPUID_XFD_BIT);
+}
+
+void vm_xsave_req_perm(void)
+{
+ unsigned long bitmask;
+ long rc;
+
+ if (!is_xfd_supported())
+ return;
+
+ rc = syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_GUEST_PERM,
+ XSTATE_XTILE_DATA_BIT);
+ /*
+ * The older kernel version(<5.15) can't support
+ * ARCH_REQ_XCOMP_GUEST_PERM and directly return.
+ */
+ if (rc)
+ return;
+
+ rc = syscall(SYS_arch_prctl, ARCH_GET_XCOMP_GUEST_PERM, &bitmask);
+ TEST_ASSERT(rc == 0, "prctl(ARCH_GET_XCOMP_GUEST_PERM) error: %ld", rc);
+ TEST_ASSERT(bitmask & XFEATURE_XTILE_MASK,
+ "prctl(ARCH_REQ_XCOMP_GUEST_PERM) failure bitmask=0x%lx",
+ bitmask);
+}
+
void vm_vcpu_add_default(struct kvm_vm *vm, uint32_t vcpuid, void *guest_code)
{
struct kvm_mp_state mp_state;
@@ -1018,10 +1057,10 @@ void vcpu_dump(FILE *stream, struct kvm_vm *vm, uint32_t vcpuid, uint8_t indent)
}
struct kvm_x86_state {
+ struct kvm_xsave *xsave;
struct kvm_vcpu_events events;
struct kvm_mp_state mp_state;
struct kvm_regs regs;
- struct kvm_xsave xsave;
struct kvm_xcrs xcrs;
struct kvm_sregs sregs;
struct kvm_debugregs debugregs;
@@ -1069,6 +1108,22 @@ struct kvm_msr_list *kvm_get_msr_index_list(void)
return list;
}
+static int vcpu_save_xsave_state(struct kvm_vm *vm, struct vcpu *vcpu,
+ struct kvm_x86_state *state)
+{
+ int size;
+
+ size = vm_check_cap(vm, KVM_CAP_XSAVE2);
+ if (!size)
+ size = sizeof(struct kvm_xsave);
+
+ state->xsave = malloc(size);
+ if (size == sizeof(struct kvm_xsave))
+ return ioctl(vcpu->fd, KVM_GET_XSAVE, state->xsave);
+ else
+ return ioctl(vcpu->fd, KVM_GET_XSAVE2, state->xsave);
+}
+
struct kvm_x86_state *vcpu_save_state(struct kvm_vm *vm, uint32_t vcpuid)
{
struct vcpu *vcpu = vcpu_find(vm, vcpuid);
@@ -1112,7 +1167,7 @@ struct kvm_x86_state *vcpu_save_state(struct kvm_vm *vm, uint32_t vcpuid)
TEST_ASSERT(r == 0, "Unexpected result from KVM_GET_REGS, r: %i",
r);
- r = ioctl(vcpu->fd, KVM_GET_XSAVE, &state->xsave);
+ r = vcpu_save_xsave_state(vm, vcpu, state);
TEST_ASSERT(r == 0, "Unexpected result from KVM_GET_XSAVE, r: %i",
r);
@@ -1157,7 +1212,7 @@ void vcpu_load_state(struct kvm_vm *vm, uint32_t vcpuid, struct kvm_x86_state *s
struct vcpu *vcpu = vcpu_find(vm, vcpuid);
int r;
- r = ioctl(vcpu->fd, KVM_SET_XSAVE, &state->xsave);
+ r = ioctl(vcpu->fd, KVM_SET_XSAVE, state->xsave);
TEST_ASSERT(r == 0, "Unexpected result from KVM_SET_XSAVE, r: %i",
r);
@@ -1198,6 +1253,12 @@ void vcpu_load_state(struct kvm_vm *vm, uint32_t vcpuid, struct kvm_x86_state *s
}
}
+void kvm_x86_state_cleanup(struct kvm_x86_state *state)
+{
+ free(state->xsave);
+ free(state);
+}
+
bool is_intel_cpu(void)
{
int eax, ebx, ecx, edx;
diff --git a/tools/testing/selftests/kvm/x86_64/evmcs_test.c b/tools/testing/selftests/kvm/x86_64/evmcs_test.c
index 2b46dcca86a8..4c7841dfd481 100644
--- a/tools/testing/selftests/kvm/x86_64/evmcs_test.c
+++ b/tools/testing/selftests/kvm/x86_64/evmcs_test.c
@@ -129,7 +129,7 @@ static void save_restore_vm(struct kvm_vm *vm)
vcpu_set_hv_cpuid(vm, VCPU_ID);
vcpu_enable_evmcs(vm, VCPU_ID);
vcpu_load_state(vm, VCPU_ID, state);
- free(state);
+ kvm_x86_state_cleanup(state);
memset(®s2, 0, sizeof(regs2));
vcpu_regs_get(vm, VCPU_ID, ®s2);
diff --git a/tools/testing/selftests/kvm/x86_64/smm_test.c b/tools/testing/selftests/kvm/x86_64/smm_test.c
index d0fe2fdce58c..2da8eb8e2d96 100644
--- a/tools/testing/selftests/kvm/x86_64/smm_test.c
+++ b/tools/testing/selftests/kvm/x86_64/smm_test.c
@@ -212,7 +212,7 @@ int main(int argc, char *argv[])
vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid());
vcpu_load_state(vm, VCPU_ID, state);
run = vcpu_state(vm, VCPU_ID);
- free(state);
+ kvm_x86_state_cleanup(state);
}
done:
diff --git a/tools/testing/selftests/kvm/x86_64/state_test.c b/tools/testing/selftests/kvm/x86_64/state_test.c
index 32854c1462ad..2e0a92da8ff5 100644
--- a/tools/testing/selftests/kvm/x86_64/state_test.c
+++ b/tools/testing/selftests/kvm/x86_64/state_test.c
@@ -218,7 +218,7 @@ int main(int argc, char *argv[])
vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid());
vcpu_load_state(vm, VCPU_ID, state);
run = vcpu_state(vm, VCPU_ID);
- free(state);
+ kvm_x86_state_cleanup(state);
memset(®s2, 0, sizeof(regs2));
vcpu_regs_get(vm, VCPU_ID, ®s2);
diff --git a/tools/testing/selftests/kvm/x86_64/vmx_preemption_timer_test.c b/tools/testing/selftests/kvm/x86_64/vmx_preemption_timer_test.c
index a07480aed397..ff92e25b6f1e 100644
--- a/tools/testing/selftests/kvm/x86_64/vmx_preemption_timer_test.c
+++ b/tools/testing/selftests/kvm/x86_64/vmx_preemption_timer_test.c
@@ -244,7 +244,7 @@ int main(int argc, char *argv[])
vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid());
vcpu_load_state(vm, VCPU_ID, state);
run = vcpu_state(vm, VCPU_ID);
- free(state);
+ kvm_x86_state_cleanup(state);
memset(®s2, 0, sizeof(regs2));
vcpu_regs_get(vm, VCPU_ID, ®s2);
--
2.27.0
From: Thomas Gleixner <[email protected]>
KVM can disable the write emulation for the XFD MSR when the vCPU's fpstate
is already correctly sized to reduce the overhead.
When write emulation is disabled the XFD MSR state after a VMEXIT is
unknown and therefore not in sync with the software states in fpstate and
the per CPU XFD cache.
Provide fpu_sync_guest_vmexit_xfd_state() which has to be invoked after a
VMEXIT before enabling interrupts when write emulation is disabled for the
XFD MSR.
It could be invoked unconditionally even when write emulation is enabled
for the price of a pointless MSR read.
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/include/asm/fpu/api.h | 2 ++
arch/x86/kernel/fpu/core.c | 24 ++++++++++++++++++++++++
2 files changed, 26 insertions(+)
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 100b535cd3de..dc51809f0e4a 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -142,8 +142,10 @@ extern int fpu_update_guest_perm_features(struct fpu_guest *guest_fpu);
#ifdef CONFIG_X86_64
extern void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd);
+extern void fpu_sync_guest_vmexit_xfd_state(void);
#else
static inline void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd) { }
+static inline void fpu_sync_guest_vmexit_xfd_state(void) { }
#endif
extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, unsigned int size, u32 pkru);
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 89d679cc819b..42a195a6b2ce 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -299,6 +299,30 @@ void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd)
fpregs_unlock();
}
EXPORT_SYMBOL_GPL(fpu_update_guest_xfd);
+
+/**
+ * fpu_sync_guest_vmexit_xfd_state - Synchronize XFD MSR and software state
+ *
+ * Must be invoked from KVM after a VMEXIT before enabling interrupts when
+ * XFD write emulation is disabled. This is required because the guest can
+ * freely modify XFD and the state at VMEXIT is not guaranteed to be the
+ * same as the state on VMENTER. So software state has to be udpated before
+ * any operation which depends on it can take place.
+ *
+ * Note: It can be invoked unconditionally even when write emulation is
+ * enabled for the price of a then pointless MSR read.
+ */
+void fpu_sync_guest_vmexit_xfd_state(void)
+{
+ struct fpstate *fps = current->thread.fpu.fpstate;
+
+ lockdep_assert_irqs_disabled();
+ if (fpu_state_size_dynamic()) {
+ rdmsrl(MSR_IA32_XFD, fps->xfd);
+ __this_cpu_write(xfd_state, fps->xfd);
+ }
+}
+EXPORT_SYMBOL_GPL(fpu_sync_guest_vmexit_xfd_state);
#endif /* CONFIG_X86_64 */
int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
--
2.27.0
From: Kevin Tian <[email protected]>
KVM can require fpstate expansion in two approaches:
1) Dynamic expansion when intercepting guest updates to XCR0 and
XFD MSR;
2) Static expansion e.g. at KVM_SET_CPUID2;
The first option doesn't waste memory for legacy guest if it doesn't
support XFD. However doing so introduces more complexity and also
imposes an order requirement in the restoring path, i.e. XCR0/XFD
must be restored before XSTATE.
Given that the agreement is to do the static approach. This is
considered a better tradeoff though it does waste 8K memory for
legacy guest if its CPUID includes XFD features.
Provide a wrapper to allow expanding the fpstate buffer to what
guest perm allows. Once completed, both emulation and restore path
don't need to worry about the buffer size.
Signed-off-by: Kevin Tian <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
---
arch/x86/include/asm/fpu/api.h | 1 +
arch/x86/kernel/fpu/core.c | 27 +++++++++++++++++++++++++++
2 files changed, 28 insertions(+)
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index d8c222290e68..8e934e571273 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -138,6 +138,7 @@ extern inline u64 xstate_get_guest_group_perm(void);
extern bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu);
extern void fpu_free_guest_fpstate(struct fpu_guest *gfpu);
extern int fpu_swap_kvm_fpstate(struct fpu_guest *gfpu, bool enter_guest);
+extern int fpu_update_guest_perm_features(struct fpu_guest *guest_fpu);
extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, unsigned int size, u32 pkru);
extern int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf, u64 xcr0, u32 *vpkru);
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index a78bc547fc03..2560a95980aa 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -261,6 +261,33 @@ void fpu_free_guest_fpstate(struct fpu_guest *gfpu)
}
EXPORT_SYMBOL_GPL(fpu_free_guest_fpstate);
+/*
+ * fpu_update_guest_perm_features - Enable xfeatures according to guest perm
+ * @guest_fpu: Pointer to the guest FPU container
+ *
+ * Enable all dynamic xfeatures according to guest perm. Invoked if the
+ * caller wants to conservatively expand fpstate buffer instead of waiting
+ * until XCR0 or XFD MSR is written.
+ *
+ * Return: 0 on success, error code otherwise
+ */
+int fpu_update_guest_perm_features(struct fpu_guest *guest_fpu)
+{
+ u64 expand;
+
+ lockdep_assert_preemption_enabled();
+
+ if (!IS_ENABLED(CONFIG_X86_64))
+ return 0;
+
+ expand = guest_fpu->perm & ~guest_fpu->xfeatures;
+ if (!expand)
+ return 0;
+
+ return __xfd_enable_feature(expand, guest_fpu);
+}
+EXPORT_SYMBOL_GPL(fpu_update_guest_perm_features);
+
int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
{
struct fpstate *guest_fps = guest_fpu->fpstate;
--
2.27.0
Statically enable all xfeatures allowed by guest perm in
KVM_SET_CPUID2, with fpstate buffer sized accordingly. This avoids
run-time expansion in the emulation and restore path of XCR0 and
XFD MSR [1].
Change kvm_vcpu_after_set_cpuid() to return error given fpstate
reallocation may fail.
[1] https://lore.kernel.org/all/[email protected]/
Suggested-by: Thomas Gleixner <[email protected]>
Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/kvm/cpuid.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index a068373a7fbd..eb5a5070accb 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -204,10 +204,12 @@ void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(kvm_update_cpuid_runtime);
-static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
+static int kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
{
struct kvm_lapic *apic = vcpu->arch.apic;
struct kvm_cpuid_entry2 *best;
+ u64 xfeatures;
+ int r;
best = kvm_find_cpuid_entry(vcpu, 1, 0);
if (best && apic) {
@@ -222,9 +224,17 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
best = kvm_find_cpuid_entry(vcpu, 0xD, 0);
if (!best)
vcpu->arch.guest_supported_xcr0 = 0;
- else
- vcpu->arch.guest_supported_xcr0 =
- (best->eax | ((u64)best->edx << 32)) & supported_xcr0;
+ else {
+ xfeatures = best->eax | ((u64)best->edx << 32);
+
+ vcpu->arch.guest_supported_xcr0 = xfeatures & supported_xcr0;
+
+ if (xfeatures != vcpu->arch.guest_fpu.xfeatures) {
+ r = fpu_update_guest_perm_features(&vcpu->arch.guest_fpu);
+ if (r)
+ return r;
+ }
+ }
/*
* Bits 127:0 of the allowed SECS.ATTRIBUTES (CPUID.0x12.0x1) enumerate
@@ -260,6 +270,8 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
* adjustments to the reserved GPA bits.
*/
kvm_mmu_after_set_cpuid(vcpu);
+
+ return 0;
}
int cpuid_query_maxphyaddr(struct kvm_vcpu *vcpu)
@@ -301,9 +313,7 @@ static int kvm_set_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2,
kvm_update_kvm_cpuid_base(vcpu);
kvm_update_cpuid_runtime(vcpu);
- kvm_vcpu_after_set_cpuid(vcpu);
-
- return 0;
+ return kvm_vcpu_after_set_cpuid(vcpu);
}
/* when an old userspace process fills a new kernel module */
--
2.27.0
Always intercepting IA32_XFD causes non-negligible overhead when this
register is updated frequently in the guest.
Disable r/w emulation after intercepting the first WRMSR(IA32_XFD)
with a non-zero value.
Disable WRMSR emulation implies that IA32_XFD becomes out-of-sync
with the software states in fpstate and the per-cpu xfd cache. Call
fpu_sync_guest_vmexit_xfd_state() to bring them back in-sync, before
preemption is enabled.
p.s. We have confirmed that SDM is being revised to say that
when setting IA32_XFD[18] the AMX register state is not guaranteed
to be preserved. This clarification avoids adding mess for a creative
guest which sets IA32_XFD[18]=1 before saving active AMX state to
its own storage.
Signed-off-by: Kevin Tian <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/vmx/vmx.c | 4 ++++
arch/x86/kvm/vmx/vmx.h | 2 +-
arch/x86/kvm/x86.c | 3 +++
4 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f7a661f35d1a..7b81d324b32a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -641,6 +641,7 @@ struct kvm_vcpu_arch {
bool tpr_access_reporting;
bool xsaves_enabled;
bool trap_nm;
+ bool xfd_out_of_sync;
u64 ia32_xss;
u64 microcode_version;
u64 arch_capabilities;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8595a3e8fbd8..7aaf905e13b4 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -162,6 +162,7 @@ static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
MSR_FS_BASE,
MSR_GS_BASE,
MSR_KERNEL_GS_BASE,
+ MSR_IA32_XFD,
MSR_IA32_XFD_ERR,
#endif
MSR_IA32_SYSENTER_CS,
@@ -1968,6 +1969,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_XFD:
ret = kvm_set_msr_common(vcpu, msr_info);
if (!ret && data) {
+ vmx_disable_intercept_for_msr(vcpu, MSR_IA32_XFD, MSR_TYPE_RW);
+ vcpu->arch.xfd_out_of_sync = true;
+
vcpu->arch.trap_nm = true;
vmx_update_exception_bitmap(vcpu);
}
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index bf9d3051cd6c..0a00242a91e7 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -340,7 +340,7 @@ struct vcpu_vmx {
struct lbr_desc lbr_desc;
/* Save desired MSR intercept (read: pass-through) state */
-#define MAX_POSSIBLE_PASSTHROUGH_MSRS 14
+#define MAX_POSSIBLE_PASSTHROUGH_MSRS 15
struct {
DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3b756ff13103..10a08aa2aa45 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10024,6 +10024,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (vcpu->arch.guest_fpu.xfd_err)
wrmsrl(MSR_IA32_XFD_ERR, 0);
+ if (vcpu->arch.xfd_out_of_sync)
+ fpu_sync_guest_vmexit_xfd_state();
+
/*
* Consume any pending interrupts, including the possible source of
* VM-Exit on SVM and any ticks that occur between VM-Exit and now.
--
2.27.0
On Wed, Dec 22, 2021, Jing Liu wrote:
> Guest xstate permissions should be set by userspace VMM before vcpu
> creation. Extend KVM_SET_CPUID2 to verify that every feature reported
> in CPUID[0xD] has proper permission set.
Why? Nothing in the changelog, code, or comments explains why KVM _needs_ to prevent
userspace from advertising bogus features to the guest. E.g. the virtual address
width check exists because _KVM_ will do the wrong thing given a width other than 48
or 57, and explicity says as much in a comment.
On Wed, Dec 22, 2021, Jing Liu wrote:
> Statically enable all xfeatures allowed by guest perm in
Statically isn't the right word. It's not dymanic with respect to running the
vCPU, but it's certainly not static. I think you can just omit "Statically"
entirely.
> KVM_SET_CPUID2, with fpstate buffer sized accordingly. This avoids
> run-time expansion in the emulation and restore path of XCR0 and
> XFD MSR [1].
>
> Change kvm_vcpu_after_set_cpuid() to return error given fpstate
> reallocation may fail.
>
> [1] https://lore.kernel.org/all/[email protected]/
>
> Suggested-by: Thomas Gleixner <[email protected]>
> Suggested-by: Paolo Bonzini <[email protected]>
> Signed-off-by: Jing Liu <[email protected]>
> ---
> arch/x86/kvm/cpuid.c | 24 +++++++++++++++++-------
> 1 file changed, 17 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index a068373a7fbd..eb5a5070accb 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -204,10 +204,12 @@ void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
> }
> EXPORT_SYMBOL_GPL(kvm_update_cpuid_runtime);
>
> -static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> +static int kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> {
> struct kvm_lapic *apic = vcpu->arch.apic;
> struct kvm_cpuid_entry2 *best;
> + u64 xfeatures;
> + int r;
>
> best = kvm_find_cpuid_entry(vcpu, 1, 0);
> if (best && apic) {
> @@ -222,9 +224,17 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> best = kvm_find_cpuid_entry(vcpu, 0xD, 0);
> if (!best)
> vcpu->arch.guest_supported_xcr0 = 0;
> - else
> - vcpu->arch.guest_supported_xcr0 =
> - (best->eax | ((u64)best->edx << 32)) & supported_xcr0;
> + else {
> + xfeatures = best->eax | ((u64)best->edx << 32);
> +
> + vcpu->arch.guest_supported_xcr0 = xfeatures & supported_xcr0;
> +
> + if (xfeatures != vcpu->arch.guest_fpu.xfeatures) {
> + r = fpu_update_guest_perm_features(&vcpu->arch.guest_fpu);
> + if (r)
> + return r;
IMO, this should be done and check before "committing" state, otherwise KVM will
set the vCPU's CPUID info and update a variety of state, but then tell userspace
that it failed. The -EPERM case in particular falls squarely into the "check"
category.
On Wed, Dec 22, 2021, Jing Liu wrote:
> Guest IA32_XFD_ERR is generally modified in two places:
>
> - Set by CPU when #NM is triggered;
> - Cleared by guest in its #NM handler;
>
> Intercept #NM for the first case, if guest writes XFD as nonzero for
> the first time which indicates guest is possible to use XFD generating
> the exception. #NM is rare if the guest doesn't use dynamic features.
> Otherwise, there is at most one exception per guest task given a
> dynamic feature.
>
> Save the current XFD_ERR value to the guest_fpu container in the #NM
> VM-exit handler. This must be done with interrupt/preemption disabled,
Assuming my below understanding is correct, drop the "preemption" bit, it's
misleading.
> otherwise the unsaved MSR value may be clobbered by host operations.
>
> Inject a virtual #NM to the guest after saving the MSR value.
>
> Restore the host value (always ZERO outside of the host #NM
> handler) before enabling preemption.
AIUI, changelog is wrong, code is right. This must be done before _IRQs_ are
enabled, same as handling TIF_NEED_FPU_LOAD.
> Restore the guest value from the guest_fpu container right before
> entering the guest (with preemption disabled).
Same complaint about preemption.
> Suggested-by: Thomas Gleixner <[email protected]>
> Signed-off-by: Jing Liu <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/vmx/vmcs.h | 5 +++++
> arch/x86/kvm/vmx/vmx.c | 22 +++++++++++++++++++++-
> arch/x86/kvm/x86.c | 6 ++++++
> 4 files changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 555f4de47ef2..f7a661f35d1a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -640,6 +640,7 @@ struct kvm_vcpu_arch {
> u64 smi_count;
> bool tpr_access_reporting;
> bool xsaves_enabled;
> + bool trap_nm;
> u64 ia32_xss;
> u64 microcode_version;
> u64 arch_capabilities;
...
> @@ -763,6 +764,9 @@ void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu)
> vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, match);
> }
>
> + if (vcpu->arch.trap_nm)
> + eb |= (1u << NM_VECTOR);
> +
> vmcs_write32(EXCEPTION_BITMAP, eb);
> }
>
> @@ -1960,6 +1964,13 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> case MSR_KERNEL_GS_BASE:
> vmx_write_guest_kernel_gs_base(vmx, data);
> break;
> + case MSR_IA32_XFD:
> + ret = kvm_set_msr_common(vcpu, msr_info);
> + if (!ret && data) {
> + vcpu->arch.trap_nm = true;
> + vmx_update_exception_bitmap(vcpu);
This is wrong, it fails to clear vcpu->arch.trap_nm and update the bitmap if the
MSR is cleared.
But why even bother with an extra flag? Can't vmx_update_exception_bitmap() get
the guest's MSR_IA32_XFD value and intercept #NM accordingly? Then you could
even handle this fully in kvm_set_msr_common(), e.g.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2c9606380bca..c6c936d2b298 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3704,6 +3704,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 1;
fpu_update_guest_xfd(&vcpu->arch.guest_fpu, data);
+ /* Blah blah blah blah */
+ static_call(kvm_x86_update_exception_bitmap)(vcpu);
break;
case MSR_IA32_XFD_ERR:
if (!msr_info->host_initiated &&
> + }
> + break;
> #endif
> case MSR_IA32_SYSENTER_CS:
> if (is_guest_mode(vcpu))
> @@ -4746,7 +4757,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
> vect_info = vmx->idt_vectoring_info;
> intr_info = vmx_get_intr_info(vcpu);
>
> - if (is_machine_check(intr_info) || is_nmi(intr_info))
> + if (is_machine_check(intr_info) || is_nmi(intr_info) || is_nm(intr_info))
> return 1; /* handled by handle_exception_nmi_irqoff() */
>
> if (is_invalid_opcode(intr_info))
> @@ -6350,6 +6361,12 @@ static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
> kvm_after_interrupt(vcpu);
> }
>
> +static void handle_exception_nm(struct kvm_vcpu *vcpu)
This needs a different name, it's waaaay too close to the base handle_exception_nmi(),
which runs with IRQs _on_. And please add "_irqoff" at the end. Maybe handle_nm_fault_irqoff()?
> +{
> + rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> + kvm_queue_exception(vcpu, NM_VECTOR);
> +}
> +
> static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
> {
> const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
> @@ -6358,6 +6375,9 @@ static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
> /* if exit due to PF check for async PF */
> if (is_page_fault(intr_info))
> vmx->vcpu.arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
> + /* if exit due to NM, handle before preemptions are enabled */
> + else if (is_nm(intr_info))
Same naming complaint about this helper, it looks like an is_nmi() typo. is_nm_fault()?
> + handle_exception_nm(&vmx->vcpu);
> /* Handle machine checks before interrupts are enabled */
> else if (is_machine_check(intr_info))
> kvm_machine_check();
On Wed, Dec 22, 2021, Jing Liu wrote:
> Two XCR0 bits are defined for AMX to support XSAVE mechanism. Bit 17
> is for tilecfg and bit 18 is for tiledata.
>
> The value of XCR0[17:18] is always either 00b or 11b.
Is that an SDM requirement, or an arbitrary Linux/KVM requirement?
> Also, SDM
> recommends that only 64-bit operating systems enable Intel AMX by
> setting XCR0[18:17]. If a 32-bit guest tries to set dynamic bits, it
This is wrong. It has nothing to do with 32-bit guests. The restriction is on
32-bit _host kernels_, which I'm guessing never set the tile bits in _host_ XCR0.
> fails to pass vcpu->arch.guest_supported_xcr0 check and gets a #GP.
>
> Signed-off-by: Yang Zhong <[email protected]>
> Signed-off-by: Jing Liu <[email protected]>
> ---
> arch/x86/kvm/x86.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a48a89f73027..c558c098979a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -210,7 +210,7 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
> #define KVM_SUPPORTED_XCR0 (XFEATURE_MASK_FP | XFEATURE_MASK_SSE \
> | XFEATURE_MASK_YMM | XFEATURE_MASK_BNDREGS \
> | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
> - | XFEATURE_MASK_PKRU)
> + | XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)
>
> u64 __read_mostly host_efer;
> EXPORT_SYMBOL_GPL(host_efer);
> @@ -990,6 +990,12 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
> if ((xcr0 & XFEATURE_MASK_AVX512) != XFEATURE_MASK_AVX512)
> return 1;
> }
> +
> +#ifdef CONFIG_X86_64
Drop the #ifdef, it adds no meaningful value and requires the reader to think
far harder than they should have. Yes, it's technically dead code for 32-bit KVM,
but no one cares about performance of 32-bit KVM, and in any case it's extremely
unlikely this will be anything but noise.
> + if ((xcr0 & XFEATURE_MASK_XTILE) &&
> + ((xcr0 & XFEATURE_MASK_XTILE) != XFEATURE_MASK_XTILE))
> + return 1;
> +#endif
> vcpu->arch.xcr0 = xcr0;
>
> if ((xcr0 ^ old_xcr0) & XFEATURE_MASK_EXTEND)
> --
> 2.27.0
>
Shortlog needs to have a verb somewhere.
On Wed, Dec 22, 2021, Jing Liu wrote:
> From: Guang Zeng <[email protected]>
>
> When AMX is enabled it requires a larger xstate buffer than
> the legacy hardcoded 4KB one. Exising kvm ioctls
Existing
> (KVM_[G|S]ET_XSAVE under KVM_CAP_XSAVE) are not suitable for
> this purpose.
...
> Reuse KVM_SET_XSAVE for both old/new formats by reimplementing it to
> do properly-sized memdup_user() based on the guest fpu container.
I'm confused, the first sentence says KVM_SET_XSAVE isn't suitable, the second
says it can be reused with minimal effort.
> Also, update the api doc with the new KVM_GET_XSAVE2 ioctl.
...
> @@ -5367,7 +5382,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> break;
> }
> case KVM_SET_XSAVE: {
> - u.xsave = memdup_user(argp, sizeof(*u.xsave));
> + int size = vcpu->arch.guest_fpu.uabi_size;
IIUC, reusing KVM_SET_XSAVE works by requiring that userspace use KVM_GET_XSAVE2
if userspace has expanded the guest FPU size by exposing relevant features to
the guest via guest CPUID. If so, then that needs to be enforced in KVM_GET_XSAVE,
otherwise userspace will get subtle corruption by invoking the wrong ioctl, e.g.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2c9606380bca..5d2acbd52df5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5386,6 +5386,10 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
break;
}
case KVM_GET_XSAVE: {
+ r -EINVAL;
+ if (vcpu->arch.guest_fpu.uabi_size > sizeof(struct kvm_xsave))
+ break;
+
u.xsave = kzalloc(sizeof(struct kvm_xsave), GFP_KERNEL_ACCOUNT);
r = -ENOMEM;
if (!u.xsave)
> +
> + u.xsave = memdup_user(argp, size);
> if (IS_ERR(u.xsave)) {
> r = PTR_ERR(u.xsave);
> goto out_nofree;
> @@ -5376,6 +5393,26 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> r = kvm_vcpu_ioctl_x86_set_xsave(vcpu, u.xsave);
> break;
> }
> +
> + case KVM_GET_XSAVE2: {
> + int size = vcpu->arch.guest_fpu.uabi_size;
> +
> + u.xsave = kzalloc(size, GFP_KERNEL_ACCOUNT);
> + if (!u.xsave) {
> + r = -ENOMEM;
I hate the odd patterns in this code as much as anyone, but for better or worse
the style throughout is:
r = -ENOMEM;
u.xsave = kzalloc(size, GFP_KERNEL_ACCOUNT);
if (u.xsave)
break;
> + break;
> + }
> +
> + kvm_vcpu_ioctl_x86_get_xsave2(vcpu, u.buffer, size);
> +
> + if (copy_to_user(argp, u.xsave, size)) {
> + r = -EFAULT;
> + break;
Same style thing here.
> + }
> + r = 0;
> + break;
> + }
> +
> case KVM_GET_XCRS: {
> u.xcrs = kzalloc(sizeof(struct kvm_xcrs), GFP_KERNEL_ACCOUNT);
> r = -ENOMEM;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 1daa45268de2..9d1c01669560 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1131,6 +1131,7 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_EXIT_ON_EMULATION_FAILURE 204
> #define KVM_CAP_ARM_MTE 205
> #define KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM 206
> +#define KVM_CAP_XSAVE2 207
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -1610,6 +1611,9 @@ struct kvm_enc_region {
> #define KVM_S390_NORMAL_RESET _IO(KVMIO, 0xc3)
> #define KVM_S390_CLEAR_RESET _IO(KVMIO, 0xc4)
>
> +/* Available with KVM_CAP_XSAVE2 */
> +#define KVM_GET_XSAVE2 _IOR(KVMIO, 0xcf, struct kvm_xsave)
> +
> struct kvm_s390_pv_sec_parm {
> __u64 origin;
> __u64 length;
> --
> 2.27.0
>
On Wed, Dec 22, 2021, Jing Liu wrote:
> @@ -1968,6 +1969,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> case MSR_IA32_XFD:
> ret = kvm_set_msr_common(vcpu, msr_info);
> if (!ret && data) {
> + vmx_disable_intercept_for_msr(vcpu, MSR_IA32_XFD, MSR_TYPE_RW);
> + vcpu->arch.xfd_out_of_sync = true;
xfd_out_of_sync is a poor name, as XFD _may_ be out of sync, or it may not. It's
also confusing that it's kept set after XFD is explicitly synchronized in
vcpu_enter_guest().
> +
> vcpu->arch.trap_nm = true;
> vmx_update_exception_bitmap(vcpu);
Ah, this is why #NM interception was made sticky many patches ago. More at the end.
> }
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index bf9d3051cd6c..0a00242a91e7 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -340,7 +340,7 @@ struct vcpu_vmx {
> struct lbr_desc lbr_desc;
>
> /* Save desired MSR intercept (read: pass-through) state */
> -#define MAX_POSSIBLE_PASSTHROUGH_MSRS 14
> +#define MAX_POSSIBLE_PASSTHROUGH_MSRS 15
> struct {
> DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
> DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 3b756ff13103..10a08aa2aa45 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10024,6 +10024,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> if (vcpu->arch.guest_fpu.xfd_err)
> wrmsrl(MSR_IA32_XFD_ERR, 0);
>
> + if (vcpu->arch.xfd_out_of_sync)
Rather than adding a flag that tracks whether or not the MSR can be written by
the guest, can't this be:
if (!vmx_test_msr_bitmap_write(vcpu->loaded_vmcs->msr_bitmap))
fpu_sync_guest_vmexit_xfd_state();
That might be marginally slower than checking a dedicated flag? But is has the
advantage of doing the correct thing for nested guests instead of penalizing them
with an unnecessary sync on every exit. If performance of the check is an issue,
we could add a static key to skip the code unless at least one vCPU has triggered
the XFD crud, a la kvm_has_noapic_vcpu (which may or may not provide any real
performance benefits).
Speaking of nested, interception of #NM in vmx_update_exception_bitmap() is wrong
with respect to nested guests. Until XFD is supported for L2, which I didn't see
in this series, #NM should not be intercepted while L2 is running.
For the earlier patch that introduced arch.trap_nm, if it's not too gross and not
racy, the code could be:
if (is_guest_mode(vcpu))
eb |= get_vmcs12(vcpu)->exception_bitmap;
else {
...
if (vcpu->arch.guest_fpu.fpstate.xfd)
eb |= (1u << NM_VECTOR);
}
Though I'm ok with a semi-temporary flag if that's gross/racy.
Then this patch can change it to:
if (is_guest_mode(vcpu))
eb |= get_vmcs12(vcpu)->exception_bitmap;
else {
...
if (!vmx_test_msr_bitmap_write(vcpu->vmcs01.msr_bitmap))
eb |= (1u << NM_VECTOR);
}
> + fpu_sync_guest_vmexit_xfd_state();
> +
> /*
> * Consume any pending interrupts, including the possible source of
> * VM-Exit on SVM and any ticks that occur between VM-Exit and now.
> --
> 2.27.0
>
> From: Sean Christopherson <[email protected]>
> Sent: Wednesday, December 29, 2021 7:39 AM
>
> On Wed, Dec 22, 2021, Jing Liu wrote:
> > Guest xstate permissions should be set by userspace VMM before vcpu
> > creation. Extend KVM_SET_CPUID2 to verify that every feature reported
> > in CPUID[0xD] has proper permission set.
>
> Why? Nothing in the changelog, code, or comments explains why KVM
> _needs_ to prevent
> userspace from advertising bogus features to the guest. E.g. the virtual
> address
> width check exists because _KVM_ will do the wrong thing given a width
> other than 48
> or 57, and explicity says as much in a comment.
Advertising a known bogus feature due to lacking of permission does
no good compared to failing it early even before the guest is running. This
also avoids tons of complexity at run-time to deal with permission
violation (e.g. introducing new kvm exit reason if you tracked the v2
discussion).
But yes, we should add a clear comment here.
Thanks
Kevin
> From: Sean Christopherson <[email protected]>
> Sent: Wednesday, December 29, 2021 7:54 AM
>
> On Wed, Dec 22, 2021, Jing Liu wrote:
> > Statically enable all xfeatures allowed by guest perm in
>
> Statically isn't the right word. It's not dymanic with respect to running the
> vCPU, but it's certainly not static. I think you can just omit "Statically"
> entirely.
make sense.
>
> > KVM_SET_CPUID2, with fpstate buffer sized accordingly. This avoids
> > run-time expansion in the emulation and restore path of XCR0 and
> > XFD MSR [1].
> >
> > Change kvm_vcpu_after_set_cpuid() to return error given fpstate
> > reallocation may fail.
> >
> > [1] https://lore.kernel.org/all/[email protected]/
> >
> > Suggested-by: Thomas Gleixner <[email protected]>
> > Suggested-by: Paolo Bonzini <[email protected]>
> > Signed-off-by: Jing Liu <[email protected]>
> > ---
> > arch/x86/kvm/cpuid.c | 24 +++++++++++++++++-------
> > 1 file changed, 17 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index a068373a7fbd..eb5a5070accb 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -204,10 +204,12 @@ void kvm_update_cpuid_runtime(struct
> kvm_vcpu *vcpu)
> > }
> > EXPORT_SYMBOL_GPL(kvm_update_cpuid_runtime);
> >
> > -static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > +static int kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > {
> > struct kvm_lapic *apic = vcpu->arch.apic;
> > struct kvm_cpuid_entry2 *best;
> > + u64 xfeatures;
> > + int r;
> >
> > best = kvm_find_cpuid_entry(vcpu, 1, 0);
> > if (best && apic) {
> > @@ -222,9 +224,17 @@ static void kvm_vcpu_after_set_cpuid(struct
> kvm_vcpu *vcpu)
> > best = kvm_find_cpuid_entry(vcpu, 0xD, 0);
> > if (!best)
> > vcpu->arch.guest_supported_xcr0 = 0;
> > - else
> > - vcpu->arch.guest_supported_xcr0 =
> > - (best->eax | ((u64)best->edx << 32)) &
> supported_xcr0;
> > + else {
> > + xfeatures = best->eax | ((u64)best->edx << 32);
> > +
> > + vcpu->arch.guest_supported_xcr0 = xfeatures &
> supported_xcr0;
> > +
> > + if (xfeatures != vcpu->arch.guest_fpu.xfeatures) {
> > + r = fpu_update_guest_perm_features(&vcpu-
> >arch.guest_fpu);
> > + if (r)
> > + return r;
>
> IMO, this should be done and check before "committing" state, otherwise
> KVM will
> set the vCPU's CPUID info and update a variety of state, but then tell
> userspace
> that it failed. The -EPERM case in particular falls squarely into the "check"
> category.
We did consider your suggestion in the first place, but then adopted
the current way with the impression that all vcpu related changes are
done in this function. We can surely change it back to the 'check' point.
Thanks
Kevin
> From: Sean Christopherson <[email protected]>
> Sent: Wednesday, December 29, 2021 8:10 AM
>
> On Wed, Dec 22, 2021, Jing Liu wrote:
> > Guest IA32_XFD_ERR is generally modified in two places:
> >
> > - Set by CPU when #NM is triggered;
> > - Cleared by guest in its #NM handler;
> >
> > Intercept #NM for the first case, if guest writes XFD as nonzero for
> > the first time which indicates guest is possible to use XFD generating
> > the exception. #NM is rare if the guest doesn't use dynamic features.
> > Otherwise, there is at most one exception per guest task given a
> > dynamic feature.
> >
> > Save the current XFD_ERR value to the guest_fpu container in the #NM
> > VM-exit handler. This must be done with interrupt/preemption disabled,
>
> Assuming my below understanding is correct, drop the "preemption" bit, it's
> misleading.
code-wise yes. In concept we just want to highlight that this operation
must be completed when both interrupt and preemption are disabled.
But we can also drop preemption if you prefer to, since preemption is
certainly disabled when interrupt is disabled.
>
> > otherwise the unsaved MSR value may be clobbered by host operations.
> >
> > Inject a virtual #NM to the guest after saving the MSR value.
> >
> > Restore the host value (always ZERO outside of the host #NM
> > handler) before enabling preemption.
>
> AIUI, changelog is wrong, code is right. This must be done before _IRQs_ are
> enabled, same as handling TIF_NEED_FPU_LOAD.
yes
>
> > Restore the guest value from the guest_fpu container right before
> > entering the guest (with preemption disabled).
>
> Same complaint about preemption.
>
> > Suggested-by: Thomas Gleixner <[email protected]>
> > Signed-off-by: Jing Liu <[email protected]>
> > ---
> > arch/x86/include/asm/kvm_host.h | 1 +
> > arch/x86/kvm/vmx/vmcs.h | 5 +++++
> > arch/x86/kvm/vmx/vmx.c | 22 +++++++++++++++++++++-
> > arch/x86/kvm/x86.c | 6 ++++++
> > 4 files changed, 33 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h
> b/arch/x86/include/asm/kvm_host.h
> > index 555f4de47ef2..f7a661f35d1a 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -640,6 +640,7 @@ struct kvm_vcpu_arch {
> > u64 smi_count;
> > bool tpr_access_reporting;
> > bool xsaves_enabled;
> > + bool trap_nm;
> > u64 ia32_xss;
> > u64 microcode_version;
> > u64 arch_capabilities;
>
> ...
>
> > @@ -763,6 +764,9 @@ void vmx_update_exception_bitmap(struct
> kvm_vcpu *vcpu)
> > vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, match);
> > }
> >
> > + if (vcpu->arch.trap_nm)
> > + eb |= (1u << NM_VECTOR);
> > +
> > vmcs_write32(EXCEPTION_BITMAP, eb);
> > }
> >
> > @@ -1960,6 +1964,13 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu,
> struct msr_data *msr_info)
> > case MSR_KERNEL_GS_BASE:
> > vmx_write_guest_kernel_gs_base(vmx, data);
> > break;
> > + case MSR_IA32_XFD:
> > + ret = kvm_set_msr_common(vcpu, msr_info);
> > + if (!ret && data) {
> > + vcpu->arch.trap_nm = true;
> > + vmx_update_exception_bitmap(vcpu);
>
> This is wrong, it fails to clear vcpu->arch.trap_nm and update the bitmap if
> the
> MSR is cleared.
In concept you are right if just looking at this patch. It's pointless to
trap #NM if guest xfd is cleared.
But here we need think about patch22 which disables write interception
for xfd. With that in consideration we use the 1st non-zero write as the
hint indicating that guest might enable xfd-related usages thus always
trap #NM after this point.
It's not a good ordering, but Paolo wants to put the optimization in the
end of this series. But we do need to put a clear comment here explaining
the always-trap policy.
>
> But why even bother with an extra flag? Can't
> vmx_update_exception_bitmap() get
> the guest's MSR_IA32_XFD value and intercept #NM accordingly? Then you
Above is the reason for the extra flag
> could
> even handle this fully in kvm_set_msr_common(), e.g.
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2c9606380bca..c6c936d2b298 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3704,6 +3704,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu,
> struct msr_data *msr_info)
> return 1;
>
> fpu_update_guest_xfd(&vcpu->arch.guest_fpu, data);
> + /* Blah blah blah blah */
> + static_call(kvm_x86_update_exception_bitmap)(vcpu);
> break;
> case MSR_IA32_XFD_ERR:
> if (!msr_info->host_initiated &&
>
> > + }
> > + break;
> > #endif
> > case MSR_IA32_SYSENTER_CS:
> > if (is_guest_mode(vcpu))
> > @@ -4746,7 +4757,7 @@ static int handle_exception_nmi(struct kvm_vcpu
> *vcpu)
> > vect_info = vmx->idt_vectoring_info;
> > intr_info = vmx_get_intr_info(vcpu);
> >
> > - if (is_machine_check(intr_info) || is_nmi(intr_info))
> > + if (is_machine_check(intr_info) || is_nmi(intr_info) ||
> is_nm(intr_info))
> > return 1; /* handled by handle_exception_nmi_irqoff() */
> >
> > if (is_invalid_opcode(intr_info))
> > @@ -6350,6 +6361,12 @@ static void handle_interrupt_nmi_irqoff(struct
> kvm_vcpu *vcpu,
> > kvm_after_interrupt(vcpu);
> > }
> >
> > +static void handle_exception_nm(struct kvm_vcpu *vcpu)
>
> This needs a different name, it's waaaay too close to the base
> handle_exception_nmi(),
> which runs with IRQs _on_. And please add "_irqoff" at the end. Maybe
> handle_nm_fault_irqoff()?
sounds good.
>
> > +{
> > + rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> > + kvm_queue_exception(vcpu, NM_VECTOR);
> > +}
> > +
> > static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
> > {
> > const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
> > @@ -6358,6 +6375,9 @@ static void handle_exception_nmi_irqoff(struct
> vcpu_vmx *vmx)
> > /* if exit due to PF check for async PF */
> > if (is_page_fault(intr_info))
> > vmx->vcpu.arch.apf.host_apf_flags =
> kvm_read_and_reset_apf_flags();
> > + /* if exit due to NM, handle before preemptions are enabled */
> > + else if (is_nm(intr_info))
>
> Same naming complaint about this helper, it looks like an is_nmi() typo.
> is_nm_fault()?
will fix
>
> > + handle_exception_nm(&vmx->vcpu);
> > /* Handle machine checks before interrupts are enabled */
> > else if (is_machine_check(intr_info))
> > kvm_machine_check();
On Wednesday, December 29, 2021 8:39 AM, Sean Christopherson wrote:
> To: Liu, Jing2 <[email protected]>
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; Nakajima, Jun
> <[email protected]>; Tian, Kevin <[email protected]>;
> [email protected]; Zeng, Guang <[email protected]>; Wang, Wei
> W <[email protected]>; Zhong, Yang <[email protected]>
> Subject: Re: [PATCH v3 19/22] kvm: x86: Get/set expanded xstate buffer
>
> Shortlog needs to have a verb somewhere.
>
> On Wed, Dec 22, 2021, Jing Liu wrote:
> > From: Guang Zeng <[email protected]>
> >
> > When AMX is enabled it requires a larger xstate buffer than the legacy
> > hardcoded 4KB one. Exising kvm ioctls
>
> Existing
>
> > (KVM_[G|S]ET_XSAVE under KVM_CAP_XSAVE) are not suitable for this
> > purpose.
>
> ...
>
> > Reuse KVM_SET_XSAVE for both old/new formats by reimplementing it to
> > do properly-sized memdup_user() based on the guest fpu container.
>
> I'm confused, the first sentence says KVM_SET_XSAVE isn't suitable, the
> second says it can be reused with minimal effort.
Probably "doesn't support" sounds better than "isn't suitable" above. But plan to reword a bit:
With KVM_CAP_XSAVE, userspace uses a hardcoded 4KB buffer to get/set xstate data from/to
KVM. This doesn't work when dynamic features (e.g. AMX) are used by the guest, as KVM uses
a full expanded xstate buffer for the guest fpu emulation, which is larger than 4KB.
Add KVM_CAP_XSAVE2, and userspace gets the required xstate buffer size from KVM via
KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2). KVM_SET_XSAVE is extended with the support to
work with larger xstate data size passed from userspace. KVM_GET_XSAVE2 is preferred to
extending KVM_GET_XSAVE to work with large buffer size for backward-compatible considerations.
(Link: https://lkml.org/lkml/2021/12/15/510)
Also, update the api doc with the new KVM_GET_XSAVE2 ioctl.
>
> > Also, update the api doc with the new KVM_GET_XSAVE2 ioctl.
>
> ...
>
> > @@ -5367,7 +5382,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> > break;
> > }
> > case KVM_SET_XSAVE: {
> > - u.xsave = memdup_user(argp, sizeof(*u.xsave));
> > + int size = vcpu->arch.guest_fpu.uabi_size;
>
> IIUC, reusing KVM_SET_XSAVE works by requiring that userspace use
> KVM_GET_XSAVE2 if userspace has expanded the guest FPU size by exposing
> relevant features to the guest via guest CPUID. If so, then that needs to be
> enforced in KVM_GET_XSAVE, otherwise userspace will get subtle corruption
> by invoking the wrong ioctl, e.g.
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index
> 2c9606380bca..5d2acbd52df5 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5386,6 +5386,10 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> break;
> }
> case KVM_GET_XSAVE: {
> + r -EINVAL;
> + if (vcpu->arch.guest_fpu.uabi_size > sizeof(struct
> kvm_xsave))
> + break;
> +
Looks good to me.
Thanks,
Wei
> From: Sean Christopherson <[email protected]>
> Sent: Wednesday, December 29, 2021 8:21 AM
>
> On Wed, Dec 22, 2021, Jing Liu wrote:
> > Two XCR0 bits are defined for AMX to support XSAVE mechanism. Bit 17
> > is for tilecfg and bit 18 is for tiledata.
> >
> > The value of XCR0[17:18] is always either 00b or 11b.
>
> Is that an SDM requirement, or an arbitrary Linux/KVM requirement?
SDM requirement
>
> > Also, SDM
> > recommends that only 64-bit operating systems enable Intel AMX by
> > setting XCR0[18:17]. If a 32-bit guest tries to set dynamic bits, it
>
> This is wrong. It has nothing to do with 32-bit guests. The restriction is on
> 32-bit _host kernels_, which I'm guessing never set the tile bits in _host_
> XCR0.
make sense.
>
> > fails to pass vcpu->arch.guest_supported_xcr0 check and gets a #GP.
> >
> > Signed-off-by: Yang Zhong <[email protected]>
> > Signed-off-by: Jing Liu <[email protected]>
> > ---
> > arch/x86/kvm/x86.c | 8 +++++++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index a48a89f73027..c558c098979a 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -210,7 +210,7 @@ static struct kvm_user_return_msrs __percpu
> *user_return_msrs;
> > #define KVM_SUPPORTED_XCR0 (XFEATURE_MASK_FP |
> XFEATURE_MASK_SSE \
> > | XFEATURE_MASK_YMM |
> XFEATURE_MASK_BNDREGS \
> > | XFEATURE_MASK_BNDCSR |
> XFEATURE_MASK_AVX512 \
> > - | XFEATURE_MASK_PKRU)
> > + | XFEATURE_MASK_PKRU |
> XFEATURE_MASK_XTILE)
> >
> > u64 __read_mostly host_efer;
> > EXPORT_SYMBOL_GPL(host_efer);
> > @@ -990,6 +990,12 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu,
> u32 index, u64 xcr)
> > if ((xcr0 & XFEATURE_MASK_AVX512) !=
> XFEATURE_MASK_AVX512)
> > return 1;
> > }
> > +
> > +#ifdef CONFIG_X86_64
>
> Drop the #ifdef, it adds no meaningful value and requires the reader to think
> far harder than they should have. Yes, it's technically dead code for 32-bit
> KVM,
> but no one cares about performance of 32-bit KVM, and in any case it's
> extremely
> unlikely this will be anything but noise.
ok
>
> > + if ((xcr0 & XFEATURE_MASK_XTILE) &&
> > + ((xcr0 & XFEATURE_MASK_XTILE) != XFEATURE_MASK_XTILE))
> > + return 1;
> > +#endif
> > vcpu->arch.xcr0 = xcr0;
> >
> > if ((xcr0 ^ old_xcr0) & XFEATURE_MASK_EXTEND)
> > --
> > 2.27.0
> >
> From: Sean Christopherson <[email protected]>
> Sent: Wednesday, December 29, 2021 9:05 AM
>
> On Wed, Dec 22, 2021, Jing Liu wrote:
> > @@ -1968,6 +1969,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu,
> struct msr_data *msr_info)
> > case MSR_IA32_XFD:
> > ret = kvm_set_msr_common(vcpu, msr_info);
> > if (!ret && data) {
> > + vmx_disable_intercept_for_msr(vcpu,
> MSR_IA32_XFD, MSR_TYPE_RW);
> > + vcpu->arch.xfd_out_of_sync = true;
>
> xfd_out_of_sync is a poor name, as XFD _may_ be out of sync, or it may not.
> It's
> also confusing that it's kept set after XFD is explicitly synchronized in
> vcpu_enter_guest().
yes, sync_xfd_after_exit might be more accurate.
>
> > +
> > vcpu->arch.trap_nm = true;
> > vmx_update_exception_bitmap(vcpu);
>
> Ah, this is why #NM interception was made sticky many patches ago. More
> at the end.
>
> > }
> > diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> > index bf9d3051cd6c..0a00242a91e7 100644
> > --- a/arch/x86/kvm/vmx/vmx.h
> > +++ b/arch/x86/kvm/vmx/vmx.h
> > @@ -340,7 +340,7 @@ struct vcpu_vmx {
> > struct lbr_desc lbr_desc;
> >
> > /* Save desired MSR intercept (read: pass-through) state */
> > -#define MAX_POSSIBLE_PASSTHROUGH_MSRS 14
> > +#define MAX_POSSIBLE_PASSTHROUGH_MSRS 15
> > struct {
> > DECLARE_BITMAP(read,
> MAX_POSSIBLE_PASSTHROUGH_MSRS);
> > DECLARE_BITMAP(write,
> MAX_POSSIBLE_PASSTHROUGH_MSRS);
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 3b756ff13103..10a08aa2aa45 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -10024,6 +10024,9 @@ static int vcpu_enter_guest(struct kvm_vcpu
> *vcpu)
> > if (vcpu->arch.guest_fpu.xfd_err)
> > wrmsrl(MSR_IA32_XFD_ERR, 0);
> >
> > + if (vcpu->arch.xfd_out_of_sync)
>
> Rather than adding a flag that tracks whether or not the MSR can be written
> by
> the guest, can't this be:
>
> if (!vmx_test_msr_bitmap_write(vcpu->loaded_vmcs->msr_bitmap))
> fpu_sync_guest_vmexit_xfd_state();
>
We can use this
> That might be marginally slower than checking a dedicated flag? But is has
> the
> advantage of doing the correct thing for nested guests instead of penalizing
> them
> with an unnecessary sync on every exit. If performance of the check is an
> issue,
> we could add a static key to skip the code unless at least one vCPU has
> triggered
> the XFD crud, a la kvm_has_noapic_vcpu (which may or may not provide any
> real
> performance benefits).
>
> Speaking of nested, interception of #NM in vmx_update_exception_bitmap()
> is wrong
> with respect to nested guests. Until XFD is supported for L2, which I didn't
> see
> in this series, #NM should not be intercepted while L2 is running.
Can you remind what additional thing is required to support XFD for L2?
If only about performance I prefer to the current conservative approach
as the first step. As explained earlier, #NM should be rare if the guest
doesn't run AMX applications at all. Adding nested into this picture doesn't
make things a lot worser.
>
> For the earlier patch that introduced arch.trap_nm, if it's not too gross and
> not
> racy, the code could be:
>
> if (is_guest_mode(vcpu))
> eb |= get_vmcs12(vcpu)->exception_bitmap;
> else {
> ...
>
> if (vcpu->arch.guest_fpu.fpstate.xfd)
> eb |= (1u << NM_VECTOR);
> }
>
> Though I'm ok with a semi-temporary flag if that's gross/racy.
>
> Then this patch can change it to:
>
> if (is_guest_mode(vcpu))
> eb |= get_vmcs12(vcpu)->exception_bitmap;
> else {
> ...
>
> if (!vmx_test_msr_bitmap_write(vcpu->vmcs01.msr_bitmap))
> eb |= (1u << NM_VECTOR);
> }
>
> > + fpu_sync_guest_vmexit_xfd_state();
> > +
> > /*
> > * Consume any pending interrupts, including the possible source of
> > * VM-Exit on SVM and any ticks that occur between VM-Exit and
> now.
> > --
> > 2.27.0
> >
Thanks
Kevin
> From: Wang, Wei W <[email protected]>
> Sent: Wednesday, December 29, 2021 10:58 AM
>
> > > Reuse KVM_SET_XSAVE for both old/new formats by reimplementing it to
> > > do properly-sized memdup_user() based on the guest fpu container.
> >
> > I'm confused, the first sentence says KVM_SET_XSAVE isn't suitable, the
> > second says it can be reused with minimal effort.
>
> Probably "doesn't support" sounds better than "isn't suitable" above. But
> plan to reword a bit:
>
> With KVM_CAP_XSAVE, userspace uses a hardcoded 4KB buffer to get/set
> xstate data from/to
> KVM. This doesn't work when dynamic features (e.g. AMX) are used by the
> guest, as KVM uses
> a full expanded xstate buffer for the guest fpu emulation, which is larger
> than 4KB.
>
> Add KVM_CAP_XSAVE2, and userspace gets the required xstate buffer size
> from KVM via
> KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2). KVM_SET_XSAVE is extended
> with the support to
> work with larger xstate data size passed from userspace. KVM_GET_XSAVE2
> is preferred to
> extending KVM_GET_XSAVE to work with large buffer size for backward-
> compatible considerations.
> (Link: https://lkml.org/lkml/2021/12/15/510)
>
> Also, update the api doc with the new KVM_GET_XSAVE2 ioctl.
>
Revised to:
--
With KVM_CAP_XSAVE, userspace uses a hardcoded 4KB buffer to get/set
xstate data from/to KVM. This doesn't work when dynamic xfeatures
(e.g. AMX) are exposed to the guest as they require a larger buffer
size.
Introduce a new capability (KVM_CAP_XSAVE2). Userspace VMM gets the
required xstate buffer size via KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2).
KVM_SET_XSAVE is extended to work with both legacy and new capabilities
by doing properly-sized memdup_user() based on the guest fpu container.
KVM_GET_XSAVE is kept for backward-compatible reason. Instead,
KVM_GET_XSAVE2 is introduced under KVM_CAP_XSAVE2 as the preferred
interface for getting xstate buffer (4KB or larger size) from KVM.
(Link: https://lkml.org/lkml/2021/12/15/510)
Also, update the api doc with the new KVM_GET_XSAVE2 ioctl
--
Thanks
Kevin
> From: Tian, Kevin
> Sent: Wednesday, December 29, 2021 10:53 AM
> > > + case MSR_IA32_XFD:
> > > + ret = kvm_set_msr_common(vcpu, msr_info);
> > > + if (!ret && data) {
> > > + vcpu->arch.trap_nm = true;
> > > + vmx_update_exception_bitmap(vcpu);
> >
> > This is wrong, it fails to clear vcpu->arch.trap_nm and update the bitmap if
> > the
> > MSR is cleared.
>
> In concept you are right if just looking at this patch. It's pointless to
> trap #NM if guest xfd is cleared.
>
> But here we need think about patch22 which disables write interception
> for xfd. With that in consideration we use the 1st non-zero write as the
> hint indicating that guest might enable xfd-related usages thus always
> trap #NM after this point.
>
> It's not a good ordering, but Paolo wants to put the optimization in the
> end of this series. But we do need to put a clear comment here explaining
> the always-trap policy.
Given write emulation of XFD is not disabled in this patch, it reads cleaner
to always update exception bitmap according to guest xfd value at this
stage. So we will follow your suggestion here and then change to check
msr bitmap when write emulation is disabled in patch22.
Thanks
Kevin
> From: Tian, Kevin
> Sent: Wednesday, December 29, 2021 11:35 AM
> >
> > Speaking of nested, interception of #NM in
> vmx_update_exception_bitmap()
> > is wrong
> > with respect to nested guests. Until XFD is supported for L2, which I didn't
> > see
> > in this series, #NM should not be intercepted while L2 is running.
>
> Can you remind what additional thing is required to support XFD for L2?
ok, at least we require additional work on when to disable write interception.
It can be done only when both L0 and L1 make the same decision, just like
what we did for many other VMCS settings...
> If only about performance I prefer to the current conservative approach
> as the first step. As explained earlier, #NM should be rare if the guest
> doesn't run AMX applications at all. Adding nested into this picture doesn't
> make things a lot worser.
All your comments incorporated except this one. As said always trapping
#NM for L2 is not a big problem, as long as it's rare and don't break function.
Usually a relatively static scheme is safer than a dynamic one in case of
anything overlooked for the initial implementation. ????
Thanks
Kevin
> From: Sean Christopherson <[email protected]>
> Sent: Wednesday, December 29, 2021 9:05 AM
> > + if (vcpu->arch.xfd_out_of_sync)
>
> Rather than adding a flag that tracks whether or not the MSR can be written
> by
> the guest, can't this be:
>
> if (!vmx_test_msr_bitmap_write(vcpu->loaded_vmcs->msr_bitmap))
> fpu_sync_guest_vmexit_xfd_state();
>
and forgot to mention a change different from above. It's in x86
common exit path but above is vmx specific check. I'm not sure
whether it's worthy of introducing another kvm_x86_ops callback
just for this minor usage.
Introducing an extra flag (e.g. vcpu->arch.xfd_disable_interception)
sounds simpler here.
Thanks
Kevin
> From: Sean Christopherson <[email protected]>
> Sent: Wednesday, December 29, 2021 8:10 AM
>
> But why even bother with an extra flag? Can't
> vmx_update_exception_bitmap() get
> the guest's MSR_IA32_XFD value and intercept #NM accordingly? Then you
> could
> even handle this fully in kvm_set_msr_common(), e.g.
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2c9606380bca..c6c936d2b298 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3704,6 +3704,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu,
> struct msr_data *msr_info)
> return 1;
>
> fpu_update_guest_xfd(&vcpu->arch.guest_fpu, data);
> + /* Blah blah blah blah */
> + static_call(kvm_x86_update_exception_bitmap)(vcpu);
> break;
btw follow Paolo's suggestion here:
https://lore.kernel.org/all/[email protected]/
He prefers to disabling xfd interception in vmx specific code instead of
introducing a new callback to be called in common code. Since we want
to always trap #NM after xfd interception is disabled (which is done
after calling the msr common code), it also reads cleaner to call
update_exception_bitmap() from vmx.c instead of here.
Thanks
Kevin
On Wed, Dec 29, 2021, Tian, Kevin wrote:
> > From: Tian, Kevin
> > Sent: Wednesday, December 29, 2021 11:35 AM
> > >
> > > Speaking of nested, interception of #NM in
> > vmx_update_exception_bitmap()
> > > is wrong
> > > with respect to nested guests. Until XFD is supported for L2, which I didn't
> > > see
> > > in this series, #NM should not be intercepted while L2 is running.
> >
> > Can you remind what additional thing is required to support XFD for L2?
>
> ok, at least we require additional work on when to disable write interception.
> It can be done only when both L0 and L1 make the same decision, just like
> what we did for many other VMCS settings...
I'm not terribly concerned with exposing XFD to L2 right now, I'm much more
concerned with not breaking nVMX when XFD is exposed to L1.
> > If only about performance I prefer to the current conservative approach
> > as the first step. As explained earlier, #NM should be rare if the guest
> > doesn't run AMX applications at all. Adding nested into this picture doesn't
> > make things a lot worser.
>
> All your comments incorporated except this one. As said always trapping
> #NM for L2 is not a big problem, as long as it's rare and don't break function.
> Usually a relatively static scheme is safer than a dynamic one in case of
> anything overlooked for the initial implementation. ????
I strongly disagree, it's not automagically safer. Either way, we need to validate
that KVM does the correct thing with respect to vmcs12. E.g. does KVM correctly
reflect the #NM back to L2 when it's not intercepted by L1? Does it synthesize a
nested VM-Exit when it is intercepted by L1? The testing required doesn't magically
go away.
As posted, there is zero chance that the patches correctly handling #NM in L2
because nested_vmx_l0_wants_exit() doesn't prevent an #NM from being forwarded
to L1. E.g. if L0 (KVM) and L1 both intercept #NM, handle_exception_nm() will
queue a #NM for injection and then syntehsizea nested VM-Exit, i.e. the #NM from
L2 will get injected into L1 after the nested exit.
That also means handle_exception_nmi_irqoff() => handle_exception_nm() is fatally
broken on non-XFD hardware, as it will attempt RDMSR(MSR_IA32_XFD_ERR) if L1
intercepts #NM since handle_exception_nmi_irqoff() will run before
__vmx_handle_exit() => nested_vmx_reflect_vmexit() checks whether L0 or L1 should
handle the exit.
On Wed, Dec 29, 2021, Tian, Kevin wrote:
> > From: Sean Christopherson <[email protected]>
> > Sent: Wednesday, December 29, 2021 8:10 AM
> >
> > On Wed, Dec 22, 2021, Jing Liu wrote:
> > > Guest IA32_XFD_ERR is generally modified in two places:
> > >
> > > - Set by CPU when #NM is triggered;
> > > - Cleared by guest in its #NM handler;
> > >
> > > Intercept #NM for the first case, if guest writes XFD as nonzero for
> > > the first time which indicates guest is possible to use XFD generating
> > > the exception. #NM is rare if the guest doesn't use dynamic features.
> > > Otherwise, there is at most one exception per guest task given a
> > > dynamic feature.
> > >
> > > Save the current XFD_ERR value to the guest_fpu container in the #NM
> > > VM-exit handler. This must be done with interrupt/preemption disabled,
> >
> > Assuming my below understanding is correct, drop the "preemption" bit, it's
> > misleading.
>
> code-wise yes. In concept we just want to highlight that this operation
> must be completed when both interrupt and preemption are disabled.
No no no no no. Yes, disabling IRQs also disables preemption, but that's not at
all relevant, e.g. KVM could handle preemption via kvm_sched_{in,out}(). Handling
this with IRQs disable is 100% mandatory because MSR_IA32_XFD_ERR can be indirectly
consumed in (soft) IRQ context, end of story.
> But we can also drop preemption if you prefer to, since preemption is
> certainly disabled when interrupt is disabled.
> From: Sean Christopherson <[email protected]>
> Sent: Thursday, December 30, 2021 1:26 AM
>
> On Wed, Dec 29, 2021, Tian, Kevin wrote:
> > > From: Tian, Kevin
> > > Sent: Wednesday, December 29, 2021 11:35 AM
> > > >
> > > > Speaking of nested, interception of #NM in
> > > vmx_update_exception_bitmap()
> > > > is wrong
> > > > with respect to nested guests. Until XFD is supported for L2, which I
> didn't
> > > > see
> > > > in this series, #NM should not be intercepted while L2 is running.
> > >
> > > Can you remind what additional thing is required to support XFD for L2?
> >
> > ok, at least we require additional work on when to disable write
> interception.
> > It can be done only when both L0 and L1 make the same decision, just like
> > what we did for many other VMCS settings...
>
> I'm not terribly concerned with exposing XFD to L2 right now, I'm much more
> concerned with not breaking nVMX when XFD is exposed to L1.
>
> > > If only about performance I prefer to the current conservative approach
> > > as the first step. As explained earlier, #NM should be rare if the guest
> > > doesn't run AMX applications at all. Adding nested into this picture
> doesn't
> > > make things a lot worser.
> >
> > All your comments incorporated except this one. As said always trapping
> > #NM for L2 is not a big problem, as long as it's rare and don't break
> function.
> > Usually a relatively static scheme is safer than a dynamic one in case of
> > anything overlooked for the initial implementation. ????
>
> I strongly disagree, it's not automagically safer. Either way, we need to
> validate
> that KVM does the correct thing with respect to vmcs12. E.g. does KVM
> correctly
> reflect the #NM back to L2 when it's not intercepted by L1? Does it
> synthesize a
> nested VM-Exit when it is intercepted by L1? The testing required doesn't
> magically
> go away.
>
> As posted, there is zero chance that the patches correctly handling #NM in L2
> because nested_vmx_l0_wants_exit() doesn't prevent an #NM from being
> forwarded
> to L1. E.g. if L0 (KVM) and L1 both intercept #NM, handle_exception_nm()
> will
> queue a #NM for injection and then syntehsizea nested VM-Exit, i.e. the #NM
> from
> L2 will get injected into L1 after the nested exit.
Yes, it's incorrect behavior.
>
> That also means handle_exception_nmi_irqoff() => handle_exception_nm()
> is fatally
> broken on non-XFD hardware, as it will attempt RDMSR(MSR_IA32_XFD_ERR)
> if L1
> intercepts #NM since handle_exception_nmi_irqoff() will run before
> __vmx_handle_exit() => nested_vmx_reflect_vmexit() checks whether L0 or
> L1 should
> handle the exit.
Ditto. Thanks for pointing out those facts that we obviously overlooked.
So handle_exception_nm() should neither blindly read xfd_err (#NM might be
caused by L1 interception on a non-xfd platform) nor blindly queue a #NM
exception (triggered in L2) to L1 which intercepts #NM (then expects vm-exit)
The former suggests that reading xfd_err should be made conditionally
similar to what we did in vmx_update_exception_bitmap(). The latter
suggests the actual exception is better postponed until __vmx_handle_exit().
We are looking at this change now.
And once #NM handler works correctly to handle interception by either L0
or L1, I'm not sure whether we still want to special case L1 vs. L2 in
vmx_update_exception_bitmap(), since in the end L0 wants to intercept
#NM to save xfd_err for both L1 and L2 (if exposed with xfd capability by L1).
Thanks
Kevin
> From: Tian, Kevin
> Sent: Thursday, December 30, 2021 9:28 AM
>
> > As posted, there is zero chance that the patches correctly handling #NM in
> L2
> > because nested_vmx_l0_wants_exit() doesn't prevent an #NM from being
> > forwarded
> > to L1. E.g. if L0 (KVM) and L1 both intercept #NM, handle_exception_nm()
> > will
> > queue a #NM for injection and then syntehsizea nested VM-Exit, i.e. the
> #NM
> > from
> > L2 will get injected into L1 after the nested exit.
>
> Yes, it's incorrect behavior.
>
> >
> > That also means handle_exception_nmi_irqoff() => handle_exception_nm()
> > is fatally
> > broken on non-XFD hardware, as it will attempt
> RDMSR(MSR_IA32_XFD_ERR)
> > if L1
> > intercepts #NM since handle_exception_nmi_irqoff() will run before
> > __vmx_handle_exit() => nested_vmx_reflect_vmexit() checks whether L0 or
> > L1 should
> > handle the exit.
>
> Ditto. Thanks for pointing out those facts that we obviously overlooked.
>
> So handle_exception_nm() should neither blindly read xfd_err (#NM might
> be
> caused by L1 interception on a non-xfd platform) nor blindly queue a #NM
> exception (triggered in L2) to L1 which intercepts #NM (then expects vm-exit)
>
> The former suggests that reading xfd_err should be made conditionally
> similar to what we did in vmx_update_exception_bitmap(). The latter
> suggests the actual exception is better postponed until __vmx_handle_exit().
>
> We are looking at this change now.
>
> And once #NM handler works correctly to handle interception by either L0
> or L1, I'm not sure whether we still want to special case L1 vs. L2 in
> vmx_update_exception_bitmap(), since in the end L0 wants to intercept
> #NM to save xfd_err for both L1 and L2 (if exposed with xfd capability by L1).
>
the new change is like below.
static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
{
/*
* Save xfd_err to guest_fpu before interrupt is enabled, so the
* guest value is not clobbered by the host activity before the guest
* has chance to consume it.
*
* Since trapping #NM is started when xfd write interception is
* disabled, using this flag to guard the saving operation. This
* implies no-op for a non-xfd #NM due to L1 interception.
*
* Queuing exception is done in vmx_handle_exit.
*/
if (vcpu->arch.xfd_no_write_intercept)
rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
}
in the final series it will first check vcpu->arch.guest_fpu.fpstate->xfd
before the disable interception patch is applied and then becomes
the above form, similar to your suggestion on vmx_update_exception_bitmap().
whether to check msr_bitmap vs. an extra flag is an orthogonal open.
Then:
handle_exception_nmi(struct kvm_vcpu *vcpu)
{
...
if (is_machine_check(intr_info) || is_nmi(intr_info))
return 1; /* handled by handle_exception_nmi_irqoff() */
/*
* Queue the exception here instead of in handle_nm_fault_irqoff().
* This ensures the nested_vmx check is not skipped so vmexit can
* be reflected to L1 (when it intercepts #NM) before reaching this
* point.
*/
if (is_nm_fault(intr_info)) {
kvm_queue_exception(vcpu, NM_VECTOR);
return 1;
}
...
}
Then regarding to test non-AMX nested #NM usage, it might be difficult
to trigger it from modern OS. As commented by Linux #NM handler, it's
expected only for XFD or math emulation when fpu is missing. So we plan
to run a selftest in L1 which sets CR0.TS and then touch fpu registers. and
for L1 kernel we will run two binaries with one trapping #NM and the other
not.
Please let us know if you have a better suggestion for the testing here.
Thanks
Kevin
> From: Tian, Kevin
> Sent: Thursday, December 30, 2021 3:05 PM
>
> the new change is like below.
>
> static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
> {
> /*
> * Save xfd_err to guest_fpu before interrupt is enabled, so the
> * guest value is not clobbered by the host activity before the guest
> * has chance to consume it.
> *
> * Since trapping #NM is started when xfd write interception is
> * disabled, using this flag to guard the saving operation. This
> * implies no-op for a non-xfd #NM due to L1 interception.
> *
> * Queuing exception is done in vmx_handle_exit.
> */
> if (vcpu->arch.xfd_no_write_intercept)
> rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> }
>
> in the final series it will first check vcpu->arch.guest_fpu.fpstate->xfd
> before the disable interception patch is applied and then becomes
> the above form, similar to your suggestion on
> vmx_update_exception_bitmap().
>
> whether to check msr_bitmap vs. an extra flag is an orthogonal open.
>
> Then:
>
> handle_exception_nmi(struct kvm_vcpu *vcpu)
> {
> ...
> if (is_machine_check(intr_info) || is_nmi(intr_info))
> return 1; /* handled by handle_exception_nmi_irqoff() */
>
> /*
> * Queue the exception here instead of in handle_nm_fault_irqoff().
> * This ensures the nested_vmx check is not skipped so vmexit can
> * be reflected to L1 (when it intercepts #NM) before reaching this
> * point.
> */
> if (is_nm_fault(intr_info)) {
> kvm_queue_exception(vcpu, NM_VECTOR);
> return 1;
> }
>
> ...
> }
>
> Then regarding to test non-AMX nested #NM usage, it might be difficult
> to trigger it from modern OS. As commented by Linux #NM handler, it's
> expected only for XFD or math emulation when fpu is missing. So we plan
> to run a selftest in L1 which sets CR0.TS and then touch fpu registers. and
> for L1 kernel we will run two binaries with one trapping #NM and the other
> not.
>
We have verified this scenario and didn't find problem.
Basically the selftest is like below:
guest_code()
{
cr0 = read_cr0();
cr0 |= X86_CR0_TS;
write_cr0(cr0);
asm volatile("fnop");
}
guest_nm_handler()
{
cr0 = read_cr0();
cr0 &= ~X86_CR0_TS;
write_cr0(cr0);
}
We run the selftest in L1 to create a nested scenario.
When L1 intercepts #NM:
(L2) fnop
(L0) #NM vmexit
(L0) reflect a virtual vmexit (reason #NM) to L1
(L1) #NM vmexit
(L1) queue #NM exception to L2
(L2) guest_nm_handler()
(L2) fnop (succeed)
When L1 doesn't intercept #NM:
(L2) fnop
(L0) #NM vmexit
(L0) queue #NM exception to L2
(L2) guest_nm_handler()
(L2) fnop (succeed)
Please suggest if any more test is necessary here.
Thanks
Kevin
On 12/29/21 02:04, Sean Christopherson wrote:
>
> Speaking of nested, interception of #NM in vmx_update_exception_bitmap() is wrong
> with respect to nested guests. Until XFD is supported for L2, which I didn't see
> in this series, #NM should not be intercepted while L2 is running.
Why wouldn't L2 support XFD, since there are no new VMCS bits? As long
as L0 knows what to do with XFD and XFD_ERR, it will do the right thing
no matter if L1 or L2 is running.
Paolo
On Tue, Jan 04, 2022, Paolo Bonzini wrote:
> On 12/29/21 02:04, Sean Christopherson wrote:
> >
> > Speaking of nested, interception of #NM in vmx_update_exception_bitmap() is wrong
> > with respect to nested guests. Until XFD is supported for L2, which I didn't see
> > in this series, #NM should not be intercepted while L2 is running.
>
> Why wouldn't L2 support XFD, since there are no new VMCS bits? As long as
> L0 knows what to do with XFD and XFD_ERR, it will do the right thing no
> matter if L1 or L2 is running.
I incorrectly assumed there was something L0 needed to do in order to support
nested XFD.