2021-12-07 15:09:08

by Yang Zhong

[permalink] [raw]
Subject: [PATCH 00/19] AMX Support in KVM

(send on behalf of Jing who is currently on leave)

This series brings AMX (Advanced Matrix eXtensions) virtualization
support to KVM. The three preparation patches in fpu core from
Thomas [1] are also included.

A large portion of the changes in this series is to deal with eXtended
Feature Disable (XFD) which allows resizing of the fpstate buffer to
support dynamically-enabled XSTATE features with large state component
(e.g. 8K for AMX).

The support is based on several key changes (design discussions can be
found in [2]):

- Guest permissions for dynamically-enabled XSAVE features

Native tasks have to request permission via prctl() before touching
a dynamic-resized XSTATE compoenent. Introduce guest permissions
for the similar purpose. Userspace VMM is expected to request guest
permission only once when the first vCPU is created.

KVM checks guest permission in KVM_SET_CPUID2. Setting XFD in guest
cpuid w/o proper permissions fails this operation.

- Extend fpstate reallocation mechanism to cover guest fpu

Unlike native tasks which have reallocation triggered from #NM
handler, guest fpstate reallocation is requested by KVM when it
detects the guest intention to use dynamically-enabled XSAVE
features.

The reallocation request is handled when exiting to userspace
VMM. This implies that KVM must break vcpu_run() loop and exit
to userspace VMM instead of immediately resuming back to the guest
when reallocation is required.

- Detect fpstate reallocation in the emulation code

Because guest #NM is not trapped in KVM (costly), the guest
intention of using a dynamically-enabled XSAVE feature[i] can be
indirectly represented by guest XCR0[i]=1 and XFD[i]=0. This
requires the emulation logic of both WRMSR(IA32_XFD) and XSETBV
to check reallocation requirement when one of the two conditions
is changed.

- Disable WRMSR interception for IA32_XFD

IA32_XFD can be frequently updated by the guest, as it is part of
the task state and swapped in context switch when prev and next have
different XFD setting. Always intercepting WRMSR can easily cause
non-negligible overhead.

Disable WRMSR interception for IA32_XFD after fpstate reallocation
succeeds. After that point the guest direct writes IA32_XFD without
causing VM-exits.

However MSR passthrough implies that guest_fpstate::xfd and per-cpu
xfd cache might be out of sync with the current IA32_XFD value set by
the guest. This suggests KVM needs to re-sync the software state
with IA32_XFD before the vCPU thread might be preempted or interrupted.

- Save/restore guest XFD_ERR

When XFD causes an instruction to generate #NM, XFD_ERR contains
information about which disabled state components are being accessed.
The #NM handler is expected to check this information and then enable
the state components by clearing IA32_XFD for the faulting task (if
having permission).

#NM can be triggered in both host and guest. It'd be problematic if
the XFD_ERR value generated in guest is consumed/clobbered by the
host before the guest itself doing so. This may lead to non-XFD-
related #NM treated as XFD #NM in host (due to guest XFD_ERR value),
or XFD-related #NM treated as non-XFD #NM in guest (XFD_ERR cleared
by the host #NM handler).

KVM needs to save the guest XFD_ERR value before this register
might be accessed by the host and restore it before entering the
guest.

One open remains in this area about when to start saving/restoring
guest XFD_ERR. Several options are discussed in patch 15.

- Expose related cpuid bits to guest

The last step is to allow exposing XFD, AMX_TILE, AMX_INT8 and
AMX_BF16 in guest cpuid. Adding those bits into kvm_cpu_caps finally
activates all previous logics in this series

To verify AMX virtualization overhead on non-AMX usages, we run the
Phoronix kernel build test in the guest w/ and w/o AMX in cpuid. The
result shows no observable difference between two configurations.

Live migration support is still being worked on. Userspace VMM needs
to use the new KVM_{G|S}SET_XSAVE2 ioctl in this series to migrate state
for dynamically-enabled XSAVE features.

Thanks Thomas for the thoughts and patches on the KVM FPU and AMX
support. Thanks Jun Nakajima for the design suggestions.

[1] git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/fpu-kvm
[2] https://www.spinics.net/lists/kvm/msg259015.html

Thanks,
Yang

---
Jing Liu (13):
kvm: x86: Fix xstate_required_size() to follow XSTATE alignment rule
kvm: x86: Check guest xstate permissions when KVM_SET_CPUID2
x86/fpu: Move xfd initialization out of __fpstate_reset() to the
callers
kvm: x86: Propagate fpstate reallocation error to userspace
x86/fpu: Move xfd_update_state() to xstate.c and export symbol
kvm: x86: Prepare reallocation check
kvm: x86: Emulate WRMSR of guest IA32_XFD
kvm: x86: Disable WRMSR interception for IA32_XFD on demand
x86/fpu: Prepare for KVM XFD_ERR handling
kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl
docs: virt: api.rst: Document the new KVM_{G, S}ET_XSAVE2 ioctls
kvm: x86: AMX XCR0 support for guest
kvm: x86: Add AMX CPUIDs support

Thomas Gleixner (4):
x86/fpu: Extend prctl() with guest permissions
x86/fpu: Prepare KVM for dynamically enabled states
x86/fpu: Add reallocation mechanims for KVM
x86/fpu: Prepare KVM for bringing XFD state back in-sync

Yang Zhong (2):
kvm: x86: Check fpstate reallocation in XSETBV emulation
kvm: x86: Save and restore guest XFD_ERR properly

Documentation/virt/kvm/api.rst | 47 +++++++
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/fpu/api.h | 12 ++
arch/x86/include/asm/fpu/types.h | 56 +++++++++
arch/x86/include/asm/fpu/xstate.h | 2 +
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/include/uapi/asm/kvm.h | 6 +
arch/x86/include/uapi/asm/prctl.h | 26 ++--
arch/x86/kernel/fpu/core.c | 109 ++++++++++++++++-
arch/x86/kernel/fpu/xstate.c | 119 +++++++++++++++---
arch/x86/kernel/fpu/xstate.h | 29 +++--
arch/x86/kernel/process.c | 2 +
arch/x86/kvm/cpuid.c | 36 +++++-
arch/x86/kvm/vmx/vmx.c | 20 +++
arch/x86/kvm/vmx/vmx.h | 2 +-
arch/x86/kvm/x86.c | 189 ++++++++++++++++++++++++++++-
arch/x86/kvm/x86.h | 2 +
include/uapi/linux/kvm.h | 8 +-
19 files changed, 607 insertions(+), 63 deletions(-)



2021-12-07 15:09:13

by Yang Zhong

[permalink] [raw]
Subject: [PATCH 01/19] x86/fpu: Extend prctl() with guest permissions

From: Thomas Gleixner <[email protected]>

Add guest permission control for dynamic XSTATE components, including
extension to prctl() with two new options (ARCH_GET_XCOMP_GUEST_PERM
and ARCH_REQ_XCOMP_GUEST_PERM) and to struct fpu with a new member
(guest_perm).

Userspace VMM has to request guest permissions before it exposes any
XSAVE feature using dynamic XSTATE components. The permission can be
set only once when the first vCPU is created. A new flag
FPU_GUEST_PERM_LOCKED is introduced to lock the change for this purpose

Similar to native permissions this doesn't actually enable the
permitted feature. KVM is expected to install a larger kernel buffer
and enable the feature when detecting the intention from the guest.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
---
(To Thomas) We change the definition of xstate_get_guest_group_perm()
from xstate.h to api.h since this will be called by KVM.

arch/x86/include/asm/fpu/api.h | 2 ++
arch/x86/include/asm/fpu/types.h | 9 ++++++
arch/x86/include/uapi/asm/prctl.h | 26 ++++++++--------
arch/x86/kernel/fpu/core.c | 3 ++
arch/x86/kernel/fpu/xstate.c | 50 +++++++++++++++++++++++--------
arch/x86/kernel/fpu/xstate.h | 13 ++++++--
arch/x86/kernel/process.c | 2 ++
7 files changed, 78 insertions(+), 27 deletions(-)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 6053674f9132..7532f73c82a6 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -138,6 +138,8 @@ static inline void fpstate_free(struct fpu *fpu) { }
/* fpstate-related functions which are exported to KVM */
extern void fpstate_clear_xstate_component(struct fpstate *fps, unsigned int xfeature);

+extern inline u64 xstate_get_guest_group_perm(void);
+
/* KVM specific functions */
extern bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu);
extern void fpu_free_guest_fpstate(struct fpu_guest *gfpu);
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 3c06c82ab355..6ddf80637697 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -387,6 +387,8 @@ struct fpstate {
/* @regs is dynamically sized! Don't add anything after @regs! */
} __aligned(64);

+#define FPU_GUEST_PERM_LOCKED BIT_ULL(63)
+
struct fpu_state_perm {
/*
* @__state_perm:
@@ -476,6 +478,13 @@ struct fpu {
*/
struct fpu_state_perm perm;

+ /*
+ * @guest_perm:
+ *
+ * Permission related information for guest pseudo FPUs
+ */
+ struct fpu_state_perm guest_perm;
+
/*
* @__fpstate:
*
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 754a07856817..500b96e71f18 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -2,20 +2,22 @@
#ifndef _ASM_X86_PRCTL_H
#define _ASM_X86_PRCTL_H

-#define ARCH_SET_GS 0x1001
-#define ARCH_SET_FS 0x1002
-#define ARCH_GET_FS 0x1003
-#define ARCH_GET_GS 0x1004
+#define ARCH_SET_GS 0x1001
+#define ARCH_SET_FS 0x1002
+#define ARCH_GET_FS 0x1003
+#define ARCH_GET_GS 0x1004

-#define ARCH_GET_CPUID 0x1011
-#define ARCH_SET_CPUID 0x1012
+#define ARCH_GET_CPUID 0x1011
+#define ARCH_SET_CPUID 0x1012

-#define ARCH_GET_XCOMP_SUPP 0x1021
-#define ARCH_GET_XCOMP_PERM 0x1022
-#define ARCH_REQ_XCOMP_PERM 0x1023
+#define ARCH_GET_XCOMP_SUPP 0x1021
+#define ARCH_GET_XCOMP_PERM 0x1022
+#define ARCH_REQ_XCOMP_PERM 0x1023
+#define ARCH_GET_XCOMP_GUEST_PERM 0x1024
+#define ARCH_REQ_XCOMP_GUEST_PERM 0x1025

-#define ARCH_MAP_VDSO_X32 0x2001
-#define ARCH_MAP_VDSO_32 0x2002
-#define ARCH_MAP_VDSO_64 0x2003
+#define ARCH_MAP_VDSO_X32 0x2001
+#define ARCH_MAP_VDSO_32 0x2002
+#define ARCH_MAP_VDSO_64 0x2003

#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 8ea306b1bf8e..ab19b3d8b2f7 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -450,6 +450,8 @@ void fpstate_reset(struct fpu *fpu)
fpu->perm.__state_perm = fpu_kernel_cfg.default_features;
fpu->perm.__state_size = fpu_kernel_cfg.default_size;
fpu->perm.__user_state_size = fpu_user_cfg.default_size;
+ /* Same defaults for guests */
+ fpu->guest_perm = fpu->perm;
}

static inline void fpu_inherit_perms(struct fpu *dst_fpu)
@@ -460,6 +462,7 @@ static inline void fpu_inherit_perms(struct fpu *dst_fpu)
spin_lock_irq(&current->sighand->siglock);
/* Fork also inherits the permissions of the parent */
dst_fpu->perm = src_fpu->perm;
+ dst_fpu->guest_perm = src_fpu->guest_perm;
spin_unlock_irq(&current->sighand->siglock);
}
}
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d28829403ed0..9856d579aa6e 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1595,7 +1595,7 @@ static int validate_sigaltstack(unsigned int usize)
return 0;
}

-static int __xstate_request_perm(u64 permitted, u64 requested)
+static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
{
/*
* This deliberately does not exclude !XSAVES as we still might
@@ -1605,6 +1605,7 @@ static int __xstate_request_perm(u64 permitted, u64 requested)
*/
bool compacted = cpu_feature_enabled(X86_FEATURE_XSAVES);
struct fpu *fpu = &current->group_leader->thread.fpu;
+ struct fpu_state_perm *perm;
unsigned int ksize, usize;
u64 mask;
int ret;
@@ -1621,15 +1622,18 @@ static int __xstate_request_perm(u64 permitted, u64 requested)
mask &= XFEATURE_MASK_USER_SUPPORTED;
usize = xstate_calculate_size(mask, false);

- ret = validate_sigaltstack(usize);
- if (ret)
- return ret;
+ if (!guest) {
+ ret = validate_sigaltstack(usize);
+ if (ret)
+ return ret;
+ }

+ perm = guest ? &fpu->guest_perm : &fpu->perm;
/* Pairs with the READ_ONCE() in xstate_get_group_perm() */
- WRITE_ONCE(fpu->perm.__state_perm, requested);
+ WRITE_ONCE(perm->__state_perm, requested);
/* Protected by sighand lock */
- fpu->perm.__state_size = ksize;
- fpu->perm.__user_state_size = usize;
+ perm->__state_size = ksize;
+ perm->__user_state_size = usize;
return ret;
}

@@ -1640,7 +1644,7 @@ static const u64 xstate_prctl_req[XFEATURE_MAX] = {
[XFEATURE_XTILE_DATA] = XFEATURE_MASK_XTILE_DATA,
};

-static int xstate_request_perm(unsigned long idx)
+static int xstate_request_perm(unsigned long idx, bool guest)
{
u64 permitted, requested;
int ret;
@@ -1661,14 +1665,19 @@ static int xstate_request_perm(unsigned long idx)
return -EOPNOTSUPP;

/* Lockless quick check */
- permitted = xstate_get_host_group_perm();
+ permitted = xstate_get_group_perm(guest);
if ((permitted & requested) == requested)
return 0;

/* Protect against concurrent modifications */
spin_lock_irq(&current->sighand->siglock);
- permitted = xstate_get_host_group_perm();
- ret = __xstate_request_perm(permitted, requested);
+ permitted = xstate_get_group_perm(guest);
+
+ /* First vCPU allocation locks the permissions. */
+ if (guest && (permitted & FPU_GUEST_PERM_LOCKED))
+ ret = -EBUSY;
+ else
+ ret = __xstate_request_perm(permitted, requested, guest);
spin_unlock_irq(&current->sighand->siglock);
return ret;
}
@@ -1713,12 +1722,17 @@ int xfd_enable_feature(u64 xfd_err)
return 0;
}
#else /* CONFIG_X86_64 */
-static inline int xstate_request_perm(unsigned long idx)
+static inline int xstate_request_perm(unsigned long idx, bool guest)
{
return -EPERM;
}
#endif /* !CONFIG_X86_64 */

+inline u64 xstate_get_guest_group_perm(void)
+{
+ return xstate_get_group_perm(true);
+}
+EXPORT_SYMBOL_GPL(xstate_get_guest_group_perm);
/**
* fpu_xstate_prctl - xstate permission operations
* @tsk: Redundant pointer to current
@@ -1742,6 +1756,7 @@ long fpu_xstate_prctl(struct task_struct *tsk, int option, unsigned long arg2)
u64 __user *uptr = (u64 __user *)arg2;
u64 permitted, supported;
unsigned long idx = arg2;
+ bool guest = false;

if (tsk != current)
return -EPERM;
@@ -1760,11 +1775,20 @@ long fpu_xstate_prctl(struct task_struct *tsk, int option, unsigned long arg2)
permitted &= XFEATURE_MASK_USER_SUPPORTED;
return put_user(permitted, uptr);

+ case ARCH_GET_XCOMP_GUEST_PERM:
+ permitted = xstate_get_guest_group_perm();
+ permitted &= XFEATURE_MASK_USER_SUPPORTED;
+ return put_user(permitted, uptr);
+
+ case ARCH_REQ_XCOMP_GUEST_PERM:
+ guest = true;
+ fallthrough;
+
case ARCH_REQ_XCOMP_PERM:
if (!IS_ENABLED(CONFIG_X86_64))
return -EOPNOTSUPP;

- return xstate_request_perm(idx);
+ return xstate_request_perm(idx, guest);

default:
return -EINVAL;
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index 86ea7c0fa2f6..98a472775c97 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -20,10 +20,19 @@ static inline void xstate_init_xcomp_bv(struct xregs_state *xsave, u64 mask)
xsave->header.xcomp_bv = mask | XCOMP_BV_COMPACTED_FORMAT;
}

-static inline u64 xstate_get_host_group_perm(void)
+static inline u64 xstate_get_group_perm(bool guest)
{
+ struct fpu *fpu = &current->group_leader->thread.fpu;
+ struct fpu_state_perm *perm;
+
/* Pairs with WRITE_ONCE() in xstate_request_perm() */
- return READ_ONCE(current->group_leader->thread.fpu.perm.__state_perm);
+ perm = guest ? &fpu->guest_perm : &fpu->perm;
+ return READ_ONCE(perm->__state_perm);
+}
+
+static inline u64 xstate_get_host_group_perm(void)
+{
+ return xstate_get_group_perm(false);
}

enum xstate_copy_mode {
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 04143a653a8a..d7bc23589062 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -993,6 +993,8 @@ long do_arch_prctl_common(struct task_struct *task, int option,
case ARCH_GET_XCOMP_SUPP:
case ARCH_GET_XCOMP_PERM:
case ARCH_REQ_XCOMP_PERM:
+ case ARCH_GET_XCOMP_GUEST_PERM:
+ case ARCH_REQ_XCOMP_GUEST_PERM:
return fpu_xstate_prctl(task, option, arg2);
}


2021-12-07 15:09:17

by Yang Zhong

[permalink] [raw]
Subject: [PATCH 02/19] x86/fpu: Prepare KVM for dynamically enabled states

From: Thomas Gleixner <[email protected]>

Add more fields for tracking per-vCPU permissions for dynamic XSTATE
components:

- user_xfeatures

Track which features are currently enabled for the vCPU

- user_perm

Copied from guest_perm of the group leader thread. The first
vCPU which does the copy locks the guest_perm

- realloc_request

KVM sets this field to request dynamically-enabled features
which require reallocation of @fpstate

Initialize those fields properly.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
---
arch/x86/include/asm/fpu/types.h | 23 +++++++++++++++++++++++
arch/x86/kernel/fpu/core.c | 26 +++++++++++++++++++++++++-
2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 6ddf80637697..861cffca3209 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -504,6 +504,29 @@ struct fpu {
* Guest pseudo FPU container
*/
struct fpu_guest {
+ /*
+ * @user_xfeatures: xfeature bitmap of features which are
+ * currently enabled for the guest vCPU.
+ */
+ u64 user_xfeatures;
+
+ /*
+ * @user_perm: xfeature bitmap of features which are
+ * permitted to be enabled for the guest
+ * vCPU.
+ */
+ u64 user_perm;
+
+ /*
+ * @realloc_request: xfeature bitmap of features which are
+ * requested to be enabled dynamically
+ * which requires reallocation of @fpstate
+ *
+ * Set by an intercept handler and
+ * evaluated in fpu_swap_kvm_fpstate()
+ */
+ u64 realloc_request;
+
/*
* @fpstate: Pointer to the allocated guest fpstate
*/
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index ab19b3d8b2f7..fe592799508c 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -201,6 +201,26 @@ void fpu_reset_from_exception_fixup(void)
#if IS_ENABLED(CONFIG_KVM)
static void __fpstate_reset(struct fpstate *fpstate);

+static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
+{
+ struct fpu_state_perm *fpuperm;
+ u64 perm;
+
+ if (!IS_ENABLED(CONFIG_X86_64))
+ return;
+
+ spin_lock_irq(&current->sighand->siglock);
+ fpuperm = &current->group_leader->thread.fpu.guest_perm;
+ perm = fpuperm->__state_perm;
+
+ /* First fpstate allocation locks down permissions. */
+ WRITE_ONCE(fpuperm->__state_perm, perm | FPU_GUEST_PERM_LOCKED);
+
+ spin_unlock_irq(&current->sighand->siglock);
+
+ gfpu->user_perm = perm & ~FPU_GUEST_PERM_LOCKED;
+}
+
bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
{
struct fpstate *fpstate;
@@ -216,7 +236,11 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
fpstate->is_valloc = true;
fpstate->is_guest = true;

- gfpu->fpstate = fpstate;
+ gfpu->fpstate = fpstate;
+ gfpu->user_xfeatures = fpu_user_cfg.default_features;
+ gfpu->user_perm = fpu_user_cfg.default_features;
+ fpu_init_guest_permissions(gfpu);
+
return true;
}
EXPORT_SYMBOL_GPL(fpu_alloc_guest_fpstate);

2021-12-07 15:09:23

by Yang Zhong

[permalink] [raw]
Subject: [PATCH 03/19] kvm: x86: Fix xstate_required_size() to follow XSTATE alignment rule

From: Jing Liu <[email protected]>

CPUID.0xD.1.EBX enumerates the size of the XSAVE area (in compacted
format) required by XSAVES. If CPUID.0xD.i.ECX[1] is set for a state
component (i), this state component should be located on the next
64-bytes boundary following the preceding state component in the
compacted layout.

Fix xstate_required_size() to follow the alignment rule. AMX is the
first state component with 64-bytes alignment to catch this bug.

Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
---
arch/x86/kvm/cpuid.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 07e9215e911d..148003e26cbb 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -42,7 +42,8 @@ static u32 xstate_required_size(u64 xstate_bv, bool compacted)
if (xstate_bv & 0x1) {
u32 eax, ebx, ecx, edx, offset;
cpuid_count(0xD, feature_bit, &eax, &ebx, &ecx, &edx);
- offset = compacted ? ret : ebx;
+ /* ECX[1]: 64B alignment in compacted form */
+ offset = compacted ? ((ecx & 0x2) ? ALIGN(ret, 64) : ret) : ebx;
ret = max(ret, offset + eax);
}


2021-12-07 15:09:25

by Yang Zhong

[permalink] [raw]
Subject: [PATCH 04/19] kvm: x86: Check guest xstate permissions when KVM_SET_CPUID2

From: Jing Liu <[email protected]>

Guest xstate permissions should be set by userspace VMM before vcpu
creation. This patch extends KVM to check the guest permissions in
KVM_SET_CPUID2 ioctl to avoid permission failure at guest run-time
(e.g. when reallocation path is triggered).

Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
---
arch/x86/kvm/cpuid.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 148003e26cbb..f3c61205bbf4 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -18,6 +18,7 @@
#include <asm/processor.h>
#include <asm/user.h>
#include <asm/fpu/xstate.h>
+#include <asm/fpu/api.h>
#include <asm/sgx.h>
#include "cpuid.h"
#include "lapic.h"
@@ -97,6 +98,17 @@ static int kvm_check_cpuid(struct kvm_cpuid_entry2 *entries, int nent)
return -EINVAL;
}

+ /*
+ * Check guest permissions for XSTATE features which must
+ * be enabled dynamically.
+ */
+ best = cpuid_entry2_find(entries, nent, 7, 0);
+ if (best && cpuid_entry_has(best, X86_FEATURE_AMX_TILE)) {
+ if (!(xstate_get_guest_group_perm() &
+ XFEATURE_MASK_XTILE_DATA))
+ return -EINVAL;
+ }
+
return 0;
}


2021-12-07 15:09:31

by Yang Zhong

[permalink] [raw]
Subject: [PATCH 05/19] x86/fpu: Move xfd initialization out of __fpstate_reset() to the callers

From: Jing Liu <[email protected]>

vCPU threads are different from native tasks regarding to the initial
xfd value. While all native tasks follow a fixed value (init_fpstate::xfd)
defined by fpu core, vCPU threads need to obey the reset value
(i.e. ZERO) defined by the spec, to meet the expectation of the guest.

Move xfd initialization out of __fpstate_reset() to the callers for
choosing a specific value.

Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
---
arch/x86/kernel/fpu/core.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index fe592799508c..fae44fa27cdb 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -231,6 +231,7 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
if (!fpstate)
return false;

+ /* Leave xfd to 0 (the reset value defined by spec) */
__fpstate_reset(fpstate);
fpstate_init_user(fpstate);
fpstate->is_valloc = true;
@@ -461,7 +462,6 @@ static void __fpstate_reset(struct fpstate *fpstate)
fpstate->user_size = fpu_user_cfg.default_size;
fpstate->xfeatures = fpu_kernel_cfg.default_features;
fpstate->user_xfeatures = fpu_user_cfg.default_features;
- fpstate->xfd = init_fpstate.xfd;
}

void fpstate_reset(struct fpu *fpu)
@@ -469,6 +469,7 @@ void fpstate_reset(struct fpu *fpu)
/* Set the fpstate pointer to the default fpstate */
fpu->fpstate = &fpu->__fpstate;
__fpstate_reset(fpu->fpstate);
+ fpu->fpstate->xfd = init_fpstate.xfd;

/* Initialize the permission related info in fpu */
fpu->perm.__state_perm = fpu_kernel_cfg.default_features;

2021-12-07 15:09:34

by Yang Zhong

[permalink] [raw]
Subject: [PATCH 06/19] x86/fpu: Add reallocation mechanims for KVM

From: Thomas Gleixner <[email protected]>

Extend fpstate reallocation mechanism to cover guest fpu. Unlike native
tasks which have reallocation triggered from #NM handler, guest fpstate
reallocation is requested by KVM when detecting the guest intention
on using a dynamically-enabled XSAVE feature.

Since KVM currently swaps host/guest fpstate when exiting to userspace
VMM (see fpu_swap_kvm_fpstate()), deal with fpstate reallocation also
at this point.

The implication - KVM must break vcpu_run() loop to exit to userspace
VMM instead of immediately returning back to the guest when fpstate
requires reallocation. In this case KVM should set
guest_fpu::realloc_request to mark those features in related VM exit
handlers.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
---
arch/x86/kernel/fpu/core.c | 26 +++++++++++++++++++---
arch/x86/kernel/fpu/xstate.c | 43 ++++++++++++++++++++++++++++++------
arch/x86/kernel/fpu/xstate.h | 2 ++
3 files changed, 61 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index fae44fa27cdb..7a0436a0cb2c 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -261,11 +261,31 @@ void fpu_free_guest_fpstate(struct fpu_guest *gfpu)
}
EXPORT_SYMBOL_GPL(fpu_free_guest_fpstate);

+static int fpu_guest_realloc_fpstate(struct fpu_guest *guest_fpu,
+ bool enter_guest)
+{
+ /*
+ * Reallocation requests can only be handled when
+ * switching from guest to host mode.
+ */
+ if (WARN_ON_ONCE(enter_guest || !IS_ENABLED(CONFIG_X86_64))) {
+ guest_fpu->realloc_request = 0;
+ return -EUNATCH;
+ }
+ return xfd_enable_guest_features(guest_fpu);
+}
+
int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
{
- struct fpstate *guest_fps = guest_fpu->fpstate;
+ struct fpstate *guest_fps, *cur_fps;
struct fpu *fpu = &current->thread.fpu;
- struct fpstate *cur_fps = fpu->fpstate;
+ int ret = 0;
+
+ if (unlikely(guest_fpu->realloc_request))
+ ret = fpu_guest_realloc_fpstate(guest_fpu, enter_guest);
+
+ guest_fps = guest_fpu->fpstate;
+ cur_fps = fpu->fpstate;

fpregs_lock();
if (!cur_fps->is_confidential && !test_thread_flag(TIF_NEED_FPU_LOAD))
@@ -298,7 +318,7 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)

fpregs_mark_activate();
fpregs_unlock();
- return 0;
+ return ret;
}
EXPORT_SYMBOL_GPL(fpu_swap_kvm_fpstate);

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 9856d579aa6e..fe3d8ed3db0e 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1529,6 +1529,7 @@ static struct fpstate *fpu_install_fpstate(struct fpu *fpu,
* of that task
* @ksize: The required size for the kernel buffer
* @usize: The required size for user space buffers
+ * @guest_fpu: Pointer to a guest FPU container. NULL for host allocations
*
* Note vs. vmalloc(): If the task with a vzalloc()-allocated buffer
* terminates quickly, vfree()-induced IPIs may be a concern, but tasks
@@ -1537,7 +1538,7 @@ static struct fpstate *fpu_install_fpstate(struct fpu *fpu,
* Returns: 0 on success, -ENOMEM on allocation error.
*/
static int fpstate_realloc(u64 xfeatures, unsigned int ksize,
- unsigned int usize)
+ unsigned int usize, struct fpu_guest *guest_fpu)
{
struct fpu *fpu = &current->thread.fpu;
struct fpstate *curfps, *newfps = NULL;
@@ -1553,6 +1554,12 @@ static int fpstate_realloc(u64 xfeatures, unsigned int ksize,
newfps->user_size = usize;
newfps->is_valloc = true;

+ if (guest_fpu) {
+ newfps->is_guest = true;
+ newfps->is_confidential = curfps->is_confidential;
+ guest_fpu->user_xfeatures |= xfeatures;
+ }
+
fpregs_lock();
/*
* Ensure that the current state is in the registers before
@@ -1566,12 +1573,14 @@ static int fpstate_realloc(u64 xfeatures, unsigned int ksize,
newfps->user_xfeatures = curfps->user_xfeatures | xfeatures;
newfps->xfd = curfps->xfd & ~xfeatures;

+ if (guest_fpu)
+ guest_fpu->fpstate = newfps;
+
curfps = fpu_install_fpstate(fpu, newfps);

/* Do the final updates within the locked region */
xstate_init_xcomp_bv(&newfps->regs.xsave, newfps->xfeatures);
xfd_update_state(newfps);
-
fpregs_unlock();

vfree(curfps);
@@ -1682,9 +1691,10 @@ static int xstate_request_perm(unsigned long idx, bool guest)
return ret;
}

-int xfd_enable_feature(u64 xfd_err)
+static int __xfd_enable_feature(u64 xfd_err, struct fpu_guest *guest_fpu)
{
u64 xfd_event = xfd_err & XFEATURE_MASK_USER_DYNAMIC;
+ struct fpu_state_perm *perm;
unsigned int ksize, usize;
struct fpu *fpu;

@@ -1697,14 +1707,16 @@ int xfd_enable_feature(u64 xfd_err)
spin_lock_irq(&current->sighand->siglock);

/* If not permitted let it die */
- if ((xstate_get_host_group_perm() & xfd_event) != xfd_event) {
+ if ((xstate_get_group_perm(!!guest_fpu) & xfd_event) != xfd_event) {
spin_unlock_irq(&current->sighand->siglock);
return -EPERM;
}

fpu = &current->group_leader->thread.fpu;
- ksize = fpu->perm.__state_size;
- usize = fpu->perm.__user_state_size;
+ perm = guest_fpu ? &fpu->guest_perm : &fpu->perm;
+ ksize = perm->__state_size;
+ usize = perm->__user_state_size;
+
/*
* The feature is permitted. State size is sufficient. Dropping
* the lock is safe here even if more features are added from
@@ -1717,10 +1729,27 @@ int xfd_enable_feature(u64 xfd_err)
* Try to allocate a new fpstate. If that fails there is no way
* out.
*/
- if (fpstate_realloc(xfd_event, ksize, usize))
+ if (fpstate_realloc(xfd_event, ksize, usize, guest_fpu))
return -EFAULT;
return 0;
}
+
+int xfd_enable_feature(u64 xfd_err)
+{
+ return __xfd_enable_feature(xfd_err, NULL);
+}
+
+int xfd_enable_guest_features(struct fpu_guest *guest_fpu)
+{
+ u64 xfd_err = guest_fpu->realloc_request & XFEATURE_MASK_USER_SUPPORTED;
+
+ guest_fpu->realloc_request = 0;
+
+ if (!xfd_err)
+ return 0;
+ return __xfd_enable_feature(xfd_err, guest_fpu);
+}
+
#else /* CONFIG_X86_64 */
static inline int xstate_request_perm(unsigned long idx, bool guest)
{
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index 98a472775c97..3254e2b5f17f 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -55,6 +55,8 @@ extern void fpu__init_system_xstate(unsigned int legacy_size);

extern void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);

+extern int xfd_enable_guest_features(struct fpu_guest *guest_fpu);
+
static inline u64 xfeatures_mask_supervisor(void)
{
return fpu_kernel_cfg.max_features & XFEATURE_MASK_SUPERVISOR_SUPPORTED;

2021-12-07 15:09:39

by Yang Zhong

[permalink] [raw]
Subject: [PATCH 07/19] kvm: x86: Propagate fpstate reallocation error to userspace

From: Jing Liu <[email protected]>

fpstate reallocation is handled when the vCPU thread returns
to userspace. As reallocation could fail (e.g. lack of memory),
this patch extends kvm_put_guest_fpu() to return an integer value
to carry error code to userspace VMM. The userspace VMM is expected
to handle any error caused by fpstate reallocation.

Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
---
arch/x86/kvm/x86.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0ee1a039b490..05f2cda73d69 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10171,17 +10171,21 @@ static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
}

/* When vcpu_run ends, restore user space FPU context. */
-static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
+static int kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
{
- fpu_swap_kvm_fpstate(&vcpu->arch.guest_fpu, false);
+ int ret;
+
+ ret = fpu_swap_kvm_fpstate(&vcpu->arch.guest_fpu, false);
++vcpu->stat.fpu_reload;
trace_kvm_fpu(0);
+
+ return ret;
}

int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
{
struct kvm_run *kvm_run = vcpu->run;
- int r;
+ int r, ret;

vcpu_load(vcpu);
kvm_sigset_activate(vcpu);
@@ -10243,7 +10247,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
r = vcpu_run(vcpu);

out:
- kvm_put_guest_fpu(vcpu);
+ ret = kvm_put_guest_fpu(vcpu);
+ if ((r >= 0) && (ret < 0))
+ r = ret;
+
if (kvm_run->kvm_valid_regs)
store_regs(vcpu);
post_kvm_run_save(vcpu);

2021-12-07 15:09:44

by Yang Zhong

[permalink] [raw]
Subject: [PATCH 08/19] x86/fpu: Move xfd_update_state() to xstate.c and export symbol

From: Jing Liu <[email protected]>

xfd_update_state() is the interface to update IA32_XFD and its per-cpu
cache. All callers of this interface are currently in fpu core. KVM only
indirectly triggers IA32_XFD update via a helper function
(fpu_swap_kvm_fpstate()) when switching between user fpu and guest fpu.

Supporting AMX in guest now requires KVM to directly update IA32_XFD
with the guest value (when emulating WRMSR) so XSAVE/XRSTOR can manage
XSTATE components correctly inside guest.

This patch moves xfd_update_state() from fpu/xstate.h to fpu/xstate.c
and export it for reference outside of fpu core.

Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
---
arch/x86/include/asm/fpu/api.h | 2 ++
arch/x86/kernel/fpu/xstate.c | 12 ++++++++++++
arch/x86/kernel/fpu/xstate.h | 14 +-------------
3 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 7532f73c82a6..999d89026be9 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -131,8 +131,10 @@ DECLARE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
/* Process cleanup */
#ifdef CONFIG_X86_64
extern void fpstate_free(struct fpu *fpu);
+extern void xfd_update_state(struct fpstate *fpstate);
#else
static inline void fpstate_free(struct fpu *fpu) { }
+static void xfd_update_state(struct fpstate *fpstate) { }
#endif

/* fpstate-related functions which are exported to KVM */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index fe3d8ed3db0e..3c39789deeb9 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1750,6 +1750,18 @@ int xfd_enable_guest_features(struct fpu_guest *guest_fpu)
return __xfd_enable_feature(xfd_err, guest_fpu);
}

+void xfd_update_state(struct fpstate *fpstate)
+{
+ if (fpu_state_size_dynamic()) {
+ u64 xfd = fpstate->xfd;
+
+ if (__this_cpu_read(xfd_state) != xfd) {
+ wrmsrl(MSR_IA32_XFD, xfd);
+ __this_cpu_write(xfd_state, xfd);
+ }
+ }
+}
+EXPORT_SYMBOL_GPL(xfd_update_state);
#else /* CONFIG_X86_64 */
static inline int xstate_request_perm(unsigned long idx, bool guest)
{
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index 3254e2b5f17f..651bd29977b9 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -149,19 +149,7 @@ static inline void xfd_validate_state(struct fpstate *fpstate, u64 mask, bool rs
#endif

#ifdef CONFIG_X86_64
-static inline void xfd_update_state(struct fpstate *fpstate)
-{
- if (fpu_state_size_dynamic()) {
- u64 xfd = fpstate->xfd;
-
- if (__this_cpu_read(xfd_state) != xfd) {
- wrmsrl(MSR_IA32_XFD, xfd);
- __this_cpu_write(xfd_state, xfd);
- }
- }
-}
-#else
-static inline void xfd_update_state(struct fpstate *fpstate) { }
+extern void xfd_update_state(struct fpstate *fpstate);
#endif

/*

2021-12-07 15:09:48

by Yang Zhong

[permalink] [raw]
Subject: [PATCH 09/19] kvm: x86: Prepare reallocation check

From: Jing Liu <[email protected]>

On native fpstate reallocation is triggered by #NM because IA32_XFD
is initialized to 1 for all native tasks.

However #NM in guest is not trapped by KVM. Instead, guest enabling
of a dynamic extended feature can be captured via emulation of
IA32_XFD and XSETBV. Basically having guest XCR0[i]=1 and XFD[i]=0
indicates that the feature[i] is activated by the guest.

This patch provides a helper function for such check, invoked when
either XCR0 or XFD is changed in the emulation path.

Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Kevin Tian <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
---
arch/x86/kvm/x86.c | 24 ++++++++++++++++++++++++
arch/x86/kvm/x86.h | 1 +
2 files changed, 25 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 05f2cda73d69..91cc6f69a7ca 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -956,6 +956,30 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(kvm_load_host_xsave_state);

+bool kvm_check_guest_realloc_fpstate(struct kvm_vcpu *vcpu, u64 xfd)
+{
+ u64 xcr0 = vcpu->arch.xcr0 & XFEATURE_MASK_USER_DYNAMIC;
+
+ /* For any state which is enabled dynamically */
+ if ((xfd & xcr0) != xcr0) {
+ u64 request = (xcr0 ^ xfd) & xcr0;
+ struct fpu_guest *guest_fpu = &vcpu->arch.guest_fpu;
+
+ /*
+ * If requested features haven't been enabled, update
+ * the request bitmap and tell the caller to request
+ * dynamic buffer reallocation.
+ */
+ if ((guest_fpu->user_xfeatures & request) != request) {
+ vcpu->arch.guest_fpu.realloc_request = request;
+ return true;
+ }
+ }
+
+ return false;
+}
+EXPORT_SYMBOL_GPL(kvm_check_guest_realloc_fpstate);
+
static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
{
u64 xcr0 = xcr;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 4abcd8d9836d..24a323980146 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -445,6 +445,7 @@ static inline void kvm_machine_check(void)

void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
+bool kvm_check_guest_realloc_fpstate(struct kvm_vcpu *vcpu, u64 new_xfd);
int kvm_spec_ctrl_test_value(u64 value);
bool kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
int kvm_handle_memory_failure(struct kvm_vcpu *vcpu, int r,

2021-12-07 15:09:58

by Yang Zhong

[permalink] [raw]
Subject: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD

From: Jing Liu <[email protected]>

Intel's eXtended Feature Disable (XFD) feature allows the software
to dynamically adjust fpstate buffer size for XSAVE features which
have large state.

WRMSR to IA32_XFD is intercepted so if the written value enables
a dynamic XSAVE feature the emulation code can exit to userspace
to trigger fpstate reallocation for the state.

Introduce a new KVM exit reason (KVM_EXIT_FPU_REALLOC) for this
purpose. If reallocation succeeds in fpu_swap_kvm_fpstate(), this
exit just bounces to userspace and then back. Otherwise the
userspace VMM should handle the error properly.

Use a new exit reason (instead of KVM_EXIT_X86_WRMSR) is clearer
and can be shared between WRMSR(IA32_XFD) and XSETBV. This also
avoids mixing with the userspace MSR machinery which is tied to
KVM_EXIT_X86_WRMSR today.

Also introduce a new MSR return type (KVM_MSR_RET_USERSPACE).
Currently MSR emulation returns to userspace only upon error or
per certain filtering rules via the userspace MSR mechinary.
This new return type indicates that emulation of certain MSR has
its own specific reason to bounce to userspace.

IA32_XFD is updated in two ways:

- If reallocation is not required, the emulation code directly
updates guest_fpu::xfd and then calls xfd_update_state() to
update IA32_XFD and per-cpu cache;

- If reallocation is triggered, above updates are completed as
part of the fpstate reallocation process if succeeds;

RDMSR to IA32_XFD is not intercepted. fpu_swap_kvm_fpstate() ensures
the guest XFD value loaded into MSR before re-entering the guest.
Just save an unnecessary VM-exit here

Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Kevin Tian <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 8 +++++++
arch/x86/kvm/x86.c | 48 ++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.h | 1 +
include/uapi/linux/kvm.h | 1 +
4 files changed, 58 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 70d86ffbccf7..971d60980d5b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7141,6 +7141,11 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
}

+static void vmx_update_intercept_xfd(struct kvm_vcpu *vcpu)
+{
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_XFD, MSR_TYPE_R, false);
+}
+
static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7181,6 +7186,9 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
}
}

+ if (cpu_feature_enabled(X86_FEATURE_XFD) && guest_cpuid_has(vcpu, X86_FEATURE_XFD))
+ vmx_update_intercept_xfd(vcpu);
+
set_cr4_guest_host_mask(vmx);

vmx_write_encls_bitmap(vcpu, NULL);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 91cc6f69a7ca..c83887cb55ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1873,6 +1873,16 @@ static int kvm_msr_user_space(struct kvm_vcpu *vcpu, u32 index,
{
u64 msr_reason = kvm_msr_reason(r);

+ /*
+ * MSR emulation may need certain effect triggered in the
+ * path transitioning to userspace (e.g. fpstate realloction).
+ * In this case the actual exit reason and completion
+ * func should have been set by the emulation code before
+ * this point.
+ */
+ if (r == KVM_MSR_RET_USERSPACE)
+ return 1;
+
/* Check if the user wanted to know about this MSR fault */
if (!(vcpu->kvm->arch.user_space_msr_mask & msr_reason))
return 0;
@@ -3692,6 +3702,44 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 1;
vcpu->arch.msr_misc_features_enables = data;
break;
+#ifdef CONFIG_X86_64
+ case MSR_IA32_XFD:
+ if (!guest_cpuid_has(vcpu, X86_FEATURE_XFD))
+ return 1;
+
+ /* Setting unsupported bits causes #GP */
+ if (~XFEATURE_MASK_USER_DYNAMIC & data) {
+ kvm_inject_gp(vcpu, 0);
+ break;
+ }
+
+ WARN_ON_ONCE(current->thread.fpu.fpstate !=
+ vcpu->arch.guest_fpu.fpstate);
+
+ /*
+ * Check if fpstate reallocate is required. If yes, then
+ * let the fpu core do reallocation and update xfd;
+ * otherwise, update xfd here.
+ */
+ if (kvm_check_guest_realloc_fpstate(vcpu, data)) {
+ vcpu->run->exit_reason = KVM_EXIT_FPU_REALLOC;
+ vcpu->arch.complete_userspace_io =
+ kvm_skip_emulated_instruction;
+ return KVM_MSR_RET_USERSPACE;
+ }
+
+ /*
+ * Update IA32_XFD to the guest value so #NM can be
+ * raised properly in the guest. Instead of directly
+ * writing the MSR, call a helper to avoid breaking
+ * per-cpu cached value in fpu core.
+ */
+ fpregs_lock();
+ current->thread.fpu.fpstate->xfd = data;
+ xfd_update_state(current->thread.fpu.fpstate);
+ fpregs_unlock();
+ break;
+#endif
default:
if (kvm_pmu_is_valid_msr(vcpu, msr))
return kvm_pmu_set_msr(vcpu, msr_info);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 24a323980146..446ffa8c7804 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -460,6 +460,7 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
*/
#define KVM_MSR_RET_INVALID 2 /* in-kernel MSR emulation #GP condition */
#define KVM_MSR_RET_FILTERED 3 /* #GP due to userspace MSR filter */
+#define KVM_MSR_RET_USERSPACE 4 /* Userspace handling */

#define __cr4_reserved_bits(__cpu_has, __c) \
({ \
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 1daa45268de2..0c7b301c7254 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -270,6 +270,7 @@ struct kvm_xen_exit {
#define KVM_EXIT_X86_BUS_LOCK 33
#define KVM_EXIT_XEN 34
#define KVM_EXIT_RISCV_SBI 35
+#define KVM_EXIT_FPU_REALLOC 36

/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */

2021-12-07 15:10:03

by Yang Zhong

[permalink] [raw]
Subject: [PATCH 11/19] kvm: x86: Check fpstate reallocation in XSETBV emulation

XSETBV allows the software to write the extended control register
XCR0, thus its emulation handler also needs to check fpstate
reallocation when the changed XCR0 value enables certain
dynamically-enabled features.

Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
---
arch/x86/kvm/x86.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c83887cb55ee..b195f4fa888f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1028,6 +1028,15 @@ int kvm_emulate_xsetbv(struct kvm_vcpu *vcpu)
return 1;
}

+ if (guest_cpuid_has(vcpu, X86_FEATURE_XFD)) {
+ if (kvm_check_guest_realloc_fpstate(vcpu, vcpu->arch.guest_fpu.fpstate->xfd)) {
+ vcpu->run->exit_reason = KVM_EXIT_FPU_REALLOC;
+ vcpu->arch.complete_userspace_io =
+ kvm_skip_emulated_instruction;
+ return 0;
+ }
+ }
+
return kvm_skip_emulated_instruction(vcpu);
}
EXPORT_SYMBOL_GPL(kvm_emulate_xsetbv);

2021-12-07 15:10:06

by Yang Zhong

[permalink] [raw]
Subject: [PATCH 12/19] x86/fpu: Prepare KVM for bringing XFD state back in-sync

From: Thomas Gleixner <[email protected]>

Guest may toggle IA32_XFD in high frequency as it is part of the fpstate
information (features, sizes, xfd) and swapped in task context switch.

To minimize the trap overhead of writes to this MSR, one optimization
is to allow guest direct write thus eliminate traps. However MSR
passthrough implies that guest_fpstate::xfd and per-cpu xfd cache might
be out of sync with the current IA32_XFD value by the guest.

This suggests KVM needs to re-sync guest_fpstate::xfd and per-cpu cache
with IA32_XFD before the vCPU thread might be preempted or interrupted.

This patch provides a helper function for the re-sync purpose.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
---
(To Thomas): the original name kvm_update_guest_xfd_state() in
your sample code is renamed to xfd_sync_state() in this patch. In
concept it is a general helper to bring software values in-sync with
the MSR value after they become out-of-sync. KVM is just the
first out-of-sync usage on this helper, so a neutral name may make
more sense. But if you prefer to the original name we can also
change back.

arch/x86/include/asm/fpu/xstate.h | 2 ++
arch/x86/kernel/fpu/xstate.c | 14 ++++++++++++++
2 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index cd3dd170e23a..c8b51d34daab 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -129,4 +129,6 @@ static __always_inline __pure bool fpu_state_size_dynamic(void)
}
#endif

+extern void xfd_sync_state(void);
+
#endif
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 3c39789deeb9..a5656237a763 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1762,11 +1762,25 @@ void xfd_update_state(struct fpstate *fpstate)
}
}
EXPORT_SYMBOL_GPL(xfd_update_state);
+
+/* Bring software state in sync with the current MSR value */
+void xfd_sync_state(void)
+{
+ if (fpu_state_size_dynamic()) {
+ u64 xfd;
+
+ rdmsrl(MSR_IA32_XFD, xfd);
+ current->thread.fpu.fpstate->xfd = xfd;
+ __this_cpu_write(xfd_state, xfd);
+ }
+}
+EXPORT_SYMBOL_GPL(xfd_sync_state);
#else /* CONFIG_X86_64 */
static inline int xstate_request_perm(unsigned long idx, bool guest)
{
return -EPERM;
}
+void xfd_sync_state(void) {}
#endif /* !CONFIG_X86_64 */

inline u64 xstate_get_guest_group_perm(void)

2021-12-07 15:10:15

by Yang Zhong

[permalink] [raw]
Subject: [PATCH 13/19] kvm: x86: Disable WRMSR interception for IA32_XFD on demand

From: Jing Liu <[email protected]>

Always intercepting IA32_XFD causes non-negligible overhead when
this register is updated frequently in the guest.

Disable WRMSR interception to IA32_XFD after fpstate reallocation
is completed. There are three options for when to disable the
interception:

1) When emulating the 1st WRMSR which requires reallocation,
disable interception before exiting to userapce with the
assumption that the userspace VMM should not bounch back to
the kernel if reallocation fails. However it's not good to
design kernel based on application behavior. If due to bug
the vCPU thread comes back to the kernel after reallocation
fails, XFD passthrough may lead to host memory corruption
when doing XSAVES for guest fpstate which has a smaller size
than what guest XFD allows.

2) Disable interception when coming back from the userspace VMM
(for the 1st WRMSR which triggers reallocation). Re-check
whether fpstate size can serve the new guest XFD value. Disable
interception only when the check succeeds. This requires KVM
to store guest XFD value in some place and then compare it
to guest_fpu::user_xfeatures in the completion handler.

3) Disable interception at the 2nd WRMSR which enables dynamic
XSTATE features. If guest_fpu::user_xfeatures already includes
bits for dynamic features set in guest XFD value, disable
interception.

Currently 3) is implemented, with a flow like below:

(G) WRMSR(IA32_XFD) which enables AMX for the FIRST time
--trap to host--
(HK) Emulate WRMSR and find fpstate size too small
(HK) Reallocate fpstate
--exit to userspace--
(HU) do nothing
--back to kernel via kvm_run--
(HK) complete WRMSR emulation
--enter guest--
(G) do something
(G) WRMSR(IA32_XFD) which disables AMX
--trap to host--
(HK) Emulate WRMSR and disable AMX in IA32_XFD
--enter guest--
(G) do something
(G) WRMSR(IA32_XFD) which enables AMX for the SECOND time
--trap to host--
(HK) Emulate WRMSR and find fpstate size sufficient for AMX
(HK) Disable WRMSR interception for IA32_XFD
--enter guest--
(G) WRMSR(IA32_XFD)
(G) WRMSR(IA32_XFD)
(G) WRMSR(IA32_XFD)
...

After disabling WRMSR interception, the guest directly updates
IA32_XFD which becomes out-of-sync with the host-side software
state (guest_fpstate::xfd and per-cpu xfd cache). This requires
KVM to call xfd_sync_state() to bring the software state in
sync with IA32_XFD register after VM-exit (before preemption
happens or exiting to userspace).

p.s. We have confirmed that SDM is being revised to say that
when setting IA32_XFD[18] the AMX register state is not
guaranteed to be preserved. This clarification avoids adding
mess for a creative guest which sets IA32_XFD[18]=1 before saving
active AMX state to its own storage.

Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Kevin Tian <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/vmx/vmx.c | 10 ++++++++++
arch/x86/kvm/vmx/vmx.h | 2 +-
arch/x86/kvm/x86.c | 7 +++++++
5 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index cefe1d81e2e8..60c27f9990e9 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -30,6 +30,7 @@ KVM_X86_OP(update_exception_bitmap)
KVM_X86_OP(get_msr)
KVM_X86_OP(set_msr)
KVM_X86_OP(get_segment_base)
+KVM_X86_OP_NULL(set_xfd_passthrough)
KVM_X86_OP(get_segment)
KVM_X86_OP(get_cpl)
KVM_X86_OP(set_segment)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6ac61f85e07b..7c97cc1fea89 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -640,6 +640,7 @@ struct kvm_vcpu_arch {
u64 smi_count;
bool tpr_access_reporting;
bool xsaves_enabled;
+ bool xfd_out_of_sync;
u64 ia32_xss;
u64 microcode_version;
u64 arch_capabilities;
@@ -1328,6 +1329,7 @@ struct kvm_x86_ops {
void (*update_exception_bitmap)(struct kvm_vcpu *vcpu);
int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
+ void (*set_xfd_passthrough)(struct kvm_vcpu *vcpu);
u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg);
void (*get_segment)(struct kvm_vcpu *vcpu,
struct kvm_segment *var, int seg);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 971d60980d5b..6198b13c4846 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -160,6 +160,7 @@ static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
MSR_FS_BASE,
MSR_GS_BASE,
MSR_KERNEL_GS_BASE,
+ MSR_IA32_XFD,
#endif
MSR_IA32_SYSENTER_CS,
MSR_IA32_SYSENTER_ESP,
@@ -1924,6 +1925,14 @@ static u64 vcpu_supported_debugctl(struct kvm_vcpu *vcpu)
return debugctl;
}

+#ifdef CONFIG_X86_64
+static void vmx_set_xfd_passthrough(struct kvm_vcpu *vcpu)
+{
+ vmx_disable_intercept_for_msr(vcpu, MSR_IA32_XFD, MSR_TYPE_RW);
+ vcpu->arch.xfd_out_of_sync = true;
+}
+#endif
+
/*
* Writes msr value into the appropriate "register".
* Returns 0 on success, non-0 otherwise.
@@ -7657,6 +7666,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
#ifdef CONFIG_X86_64
.set_hv_timer = vmx_set_hv_timer,
.cancel_hv_timer = vmx_cancel_hv_timer,
+ .set_xfd_passthrough = vmx_set_xfd_passthrough,
#endif

.setup_mce = vmx_setup_mce,
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 4df2ac24ffc1..bf9d3051cd6c 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -340,7 +340,7 @@ struct vcpu_vmx {
struct lbr_desc lbr_desc;

/* Save desired MSR intercept (read: pass-through) state */
-#define MAX_POSSIBLE_PASSTHROUGH_MSRS 13
+#define MAX_POSSIBLE_PASSTHROUGH_MSRS 14
struct {
DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b195f4fa888f..d127b229dd29 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -974,6 +974,10 @@ bool kvm_check_guest_realloc_fpstate(struct kvm_vcpu *vcpu, u64 xfd)
vcpu->arch.guest_fpu.realloc_request = request;
return true;
}
+
+ /* Disable WRMSR interception if possible */
+ if (kvm_x86_ops.set_xfd_passthrough)
+ static_call(kvm_x86_set_xfd_passthrough)(vcpu);
}

return false;
@@ -10002,6 +10006,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (hw_breakpoint_active())
hw_breakpoint_restore();

+ if (vcpu->arch.xfd_out_of_sync)
+ xfd_sync_state();
+
vcpu->arch.last_vmentry_cpu = vcpu->cpu;
vcpu->arch.last_guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());


2021-12-07 15:10:23

by Yang Zhong

[permalink] [raw]
Subject: [PATCH 14/19] x86/fpu: Prepare for KVM XFD_ERR handling

From: Jing Liu <[email protected]>

When XFD causes an instruction to generate #NM, IA32_XFD_ERR
contains information about which disabled state components
are being accessed. The #NM handler is expected to check this
information and then enable the state components by clearing
IA32_XFD for the faulting task (if having permission).

if the XFD_ERR value generated in guest is consumed/clobbered by
the host before the guest itself doing so. This may lead to
non-XFD-related #NM treated as XFD #NM in host (due to non-zero
value in XFD_ERR), or XFD-related #NM treated as non-XFD #NM in
guest (XFD_ERR cleared by the host #NM handler).

This patch provides two helpers to swap the guest XFD_ERR and host
XFD_ERR. Where to call them in KVM will be discussed thoroughly
in next patch.

The guest XFD_ERR value is saved in fpu_guest::xfd_err. There is
no need to save host XFD_ERR because it's always cleared to ZERO
by the host #NM handler (which cannot be preempted by a vCPU
thread to observe a non-zero value).

The lower two bits in fpu_guest::xfd_err is borrowed for special
purposes. The state components (FP and SSE) covered by the two
bits are not XSAVE-enabled feature, thus not XFD-enabled either.
It's impossible to see hardware setting them in XFD_ERR:

- XFD_ERR_GUEST_DISABLED (bit 0)

Indicate that XFD extension is not exposed to the guest thus
no need to save/restore it.

- XFD_ERR_GUEST_SAVED (bit 1)

Indicate fpu_guest::xfd_err already contains a saved value
thus no need for duplicated saving (e.g. when the vCPU thread
is preempted multiple times before re-enter the guest).

Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Kevin Tian <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
---
arch/x86/include/asm/fpu/api.h | 8 ++++++
arch/x86/include/asm/fpu/types.h | 24 ++++++++++++++++
arch/x86/kernel/fpu/core.c | 49 ++++++++++++++++++++++++++++++++
3 files changed, 81 insertions(+)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 999d89026be9..c2e8f2172994 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -147,6 +147,14 @@ extern bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu);
extern void fpu_free_guest_fpstate(struct fpu_guest *gfpu);
extern int fpu_swap_kvm_fpstate(struct fpu_guest *gfpu, bool enter_guest);

+#ifdef CONFIG_X86_64
+extern void fpu_save_guest_xfd_err(struct fpu_guest *guest_fpu);
+extern void fpu_restore_guest_xfd_err(struct fpu_guest *guest_fpu);
+#else
+static inline void fpu_save_guest_xfd_err(struct fpu_guest *guest_fpu) { }
+static inline void fpu_restore_guest_xfd_err(struct fpu_guest *guest_fpu) { }
+#endif
+
extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, unsigned int size, u32 pkru);
extern int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf, u64 xcr0, u32 *vpkru);

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 861cffca3209..5ee98222c103 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -500,6 +500,22 @@ struct fpu {
*/
};

+/*
+ * Use @xfd_err:bit0 to indicate whether guest XFD_ERR should be
+ * saved/restored. The x87 state covered by bit 0 is not a
+ * XSAVE-enabled feature, thus is not XFD-enabled either (won't
+ * occur in XFD_ERR).
+ */
+#define XFD_ERR_GUEST_DISABLED (1 << XFEATURE_FP)
+
+/*
+ * Use @xfd_err:bit1 to indicate the validity of @xfd_err. Used to
+ * avoid duplicated savings in case the vCPU is preempted multiple
+ * times before it re-enters the guest. The SSE state covered by
+ * bit 1 is neither XSAVE-enabled nor XFD-enabled.
+ */
+#define XFD_ERR_GUEST_SAVED (1 << XFEATURE_SSE)
+
/*
* Guest pseudo FPU container
*/
@@ -527,6 +543,14 @@ struct fpu_guest {
*/
u64 realloc_request;

+ /*
+ * @xfd_err: save the guest value. bit 0 and bit1
+ * have special meaning to indicate the
+ * requirement of saving and the validity
+ * of the saved value.
+ */
+ u64 xfd_err;
+
/*
* @fpstate: Pointer to the allocated guest fpstate
*/
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 7a0436a0cb2c..5089f2e7dc22 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -322,6 +322,55 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
}
EXPORT_SYMBOL_GPL(fpu_swap_kvm_fpstate);

+#ifdef CONFIG_X86_64
+void fpu_save_guest_xfd_err(struct fpu_guest *guest_fpu)
+{
+ if (guest_fpu->xfd_err & XFD_ERR_GUEST_DISABLED)
+ return;
+
+ /* A non-zero value indicates guest XFD_ERR already saved */
+ if (guest_fpu->xfd_err)
+ return;
+
+ /* Guest XFD_ERR must be saved before switching to host fpstate */
+ WARN_ON_ONCE(!current->thread.fpu.fpstate->is_guest);
+
+ rdmsrl(MSR_IA32_XFD_ERR, guest_fpu->xfd_err);
+
+ /*
+ * Restore to the host value if guest xfd_err is non-zero.
+ * Except in #NM handler, all other places in the kernel
+ * should just see xfd_err=0. So just restore to 0.
+ */
+ if (guest_fpu->xfd_err)
+ wrmsrl(MSR_IA32_XFD_ERR, 0);
+
+ guest_fpu->xfd_err |= XFD_ERR_GUEST_SAVED;
+}
+EXPORT_SYMBOL_GPL(fpu_save_guest_xfd_err);
+
+void fpu_restore_guest_xfd_err(struct fpu_guest *guest_fpu)
+{
+ u64 xfd_err = guest_fpu->xfd_err;
+
+ if (xfd_err & XFD_ERR_GUEST_DISABLED)
+ return;
+
+ xfd_err &= ~XFD_ERR_GUEST_SAVED;
+
+ /*
+ * No need to restore a zero value since XFD_ERR
+ * is always zero outside of #NM handler in the host.
+ */
+ if (!xfd_err)
+ return;
+
+ wrmsrl(MSR_IA32_XFD_ERR, xfd_err);
+ guest_fpu->xfd_err = 0;
+}
+EXPORT_SYMBOL_GPL(fpu_restore_guest_xfd_err);
+#endif
+
void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf,
unsigned int size, u32 pkru)
{

2021-12-07 15:10:26

by Yang Zhong

[permalink] [raw]
Subject: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly

KVM needs to save the guest XFD_ERR value before this register
might be accessed by the host and restore it before entering the
guest.

This implementation saves guest XFD_ERR in two transition points:

- When the vCPU thread exits to the userspace VMM;
- When the vCPU thread is preempted;

XFD_ERR is cleared to ZERO right after saving the previous guest
value. Otherwise a stale guest value may confuse the host #NM
handler to misinterpret a non-XFD-related #NM as XFD related.

There is no need to save the host XFD_ERR value because the only
place where XFD_ERR is consumed outside of KVM is in #NM handler
(which can not be preempted by a vCPU thread). XFD_ERR should
always be observed as ZER0 outside of #NM hanlder, thus clearing
XFD_ERR meets the host expectation here.

The saved guest value is restored to XFD_ERR right before entering
the guest (with preemption disabled).

Current implementation still has two opens which we would like
to hear suggestions:

1) Will #NM be triggered in host kernel?

Now the code is written assuming above is true, and it's the only
reason for saving guest XFD_ERR at preemption time. Otherwise the
save is only required when the CPU enters ring-3 (either from the
vCPU itself or other threads), by leveraging the "user-return
notifier" machinery as suggested by Paolo.

2) When to enable XFD_ERR save/restore?

There are four options on the table:

a) As long as guest cpuid has xfd enabled

XFD_ERR save/restore is enabled in every VM-exit (if preemption
or ret-to-userspace happens)

b) When the guest sets IA32_XFD to 1 for the first time

Indicate that guest OS supports XFD features. Because guest OS
usually initializes IA32_XFD at boot time, XFD_ERR save/restore
is enabled for almost every VM-exit (if preemption or ret-to-
userspace happens).

No save/restore for legacy guest OS which doesn't support XFD
features at all (thus won't touch IA32_XFD).

c) When the guest sets IA32_XFD to 0 for the first time

Lazily enabling XFD_ERR save/restore until XFD features are
used inside guest. However, this option doesn't work because
XFD_ERR is set when #NM is raised. An VM-exit could happen
between CPU raising #NM and guest #NM handler reading XFD_ERR
(before setting XFD to 0). The very first XFD_ERR might be
already clobbered by the host due to no save/restore in that
small window.

d) When the 1st guest #NM with non-zero XFD_ERR occurs

Lazily enabling XFD_ERR save/restore until XFD features are
used inside guest. This requires intercepting guest #NM until
non-zero XFD_ERR occurs. If a guest with XFD in cpuid never
launches an AMX application, it implies that #NM is always
trapped thus adding a constant overhead which may be even
higher than doing RDMSR in preemption path in a) and b):

#preempts < #VMEXITS (no #NM trap) < #VMEXITS (#NM trap)

The number of preemptions and ret-to-userspaces should be a
small portion of total #VMEXITs in a healthy virtualization
environment. Our gut-feeling is that adding at most one MSR
read and one MSR write to the preempt/user-ret paths is possibly
more efficient than increasing #VMEXITs due to trapping #NM.

For above analysis we plan to go option b), although this version
currently implements a). But we would like to hear other suggestions
before making this change.

Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Kevin Tian <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
---
arch/x86/kernel/fpu/core.c | 2 ++
arch/x86/kvm/cpuid.c | 5 +++++
arch/x86/kvm/vmx/vmx.c | 2 ++
arch/x86/kvm/vmx/vmx.h | 2 +-
arch/x86/kvm/x86.c | 5 +++++
5 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 5089f2e7dc22..9811dc98d550 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -238,6 +238,7 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
fpstate->is_guest = true;

gfpu->fpstate = fpstate;
+ gfpu->xfd_err = XFD_ERR_GUEST_DISABLED;
gfpu->user_xfeatures = fpu_user_cfg.default_features;
gfpu->user_perm = fpu_user_cfg.default_features;
fpu_init_guest_permissions(gfpu);
@@ -297,6 +298,7 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
fpu->fpstate = guest_fps;
guest_fps->in_use = true;
} else {
+ fpu_save_guest_xfd_err(guest_fpu);
guest_fps->in_use = false;
fpu->fpstate = fpu->__task_fpstate;
fpu->__task_fpstate = NULL;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index f3c61205bbf4..ea51b986ee67 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -219,6 +219,11 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
kvm_apic_set_version(vcpu);
}

+ /* Enable saving guest XFD_ERR */
+ best = kvm_find_cpuid_entry(vcpu, 7, 0);
+ if (best && cpuid_entry_has(best, X86_FEATURE_AMX_TILE))
+ vcpu->arch.guest_fpu.xfd_err = 0;
+
best = kvm_find_cpuid_entry(vcpu, 0xD, 0);
if (!best)
vcpu->arch.guest_supported_xcr0 = 0;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6198b13c4846..0db8bdf273e2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -161,6 +161,7 @@ static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
MSR_GS_BASE,
MSR_KERNEL_GS_BASE,
MSR_IA32_XFD,
+ MSR_IA32_XFD_ERR,
#endif
MSR_IA32_SYSENTER_CS,
MSR_IA32_SYSENTER_ESP,
@@ -7153,6 +7154,7 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
static void vmx_update_intercept_xfd(struct kvm_vcpu *vcpu)
{
vmx_set_intercept_for_msr(vcpu, MSR_IA32_XFD, MSR_TYPE_R, false);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_XFD_ERR, MSR_TYPE_RW, false);
}

static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index bf9d3051cd6c..0a00242a91e7 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -340,7 +340,7 @@ struct vcpu_vmx {
struct lbr_desc lbr_desc;

/* Save desired MSR intercept (read: pass-through) state */
-#define MAX_POSSIBLE_PASSTHROUGH_MSRS 14
+#define MAX_POSSIBLE_PASSTHROUGH_MSRS 15
struct {
DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d127b229dd29..8b033c9241d6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4550,6 +4550,9 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
kvm_steal_time_set_preempted(vcpu);
srcu_read_unlock(&vcpu->kvm->srcu, idx);

+ if (vcpu->preempted)
+ fpu_save_guest_xfd_err(&vcpu->arch.guest_fpu);
+
static_call(kvm_x86_vcpu_put)(vcpu);
vcpu->arch.last_host_tsc = rdtsc();
}
@@ -9951,6 +9954,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (test_thread_flag(TIF_NEED_FPU_LOAD))
switch_fpu_return();

+ fpu_restore_guest_xfd_err(&vcpu->arch.guest_fpu);
+
if (unlikely(vcpu->arch.switch_db_regs)) {
set_debugreg(0, 7);
set_debugreg(vcpu->arch.eff_db[0], 0);

2021-12-07 15:10:50

by Yang Zhong

[permalink] [raw]
Subject: [PATCH 17/19] docs: virt: api.rst: Document the new KVM_{G, S}ET_XSAVE2 ioctls

From: Jing Liu <[email protected]>

Document the detailed information of the new KVM_{G, S}ET_XSAVE2 ioctls.

Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
---
Documentation/virt/kvm/api.rst | 47 ++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index aeeb071c7688..39dfd867e429 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1569,6 +1569,8 @@ otherwise it will return EBUSY error.
};

This ioctl would copy current vcpu's xsave struct to the userspace.
+Application should use KVM_GET_XSAVE2 if xsave states are larger than
+4KB.


4.43 KVM_SET_XSAVE
@@ -1588,6 +1590,8 @@ This ioctl would copy current vcpu's xsave struct to the userspace.
};

This ioctl would copy userspace's xsave struct to the kernel.
+Application should use KVM_SET_XSAVE2 if xsave states are larger than
+4KB.


4.44 KVM_GET_XCRS
@@ -7484,3 +7488,46 @@ The argument to KVM_ENABLE_CAP is also a bitmask, and must be a subset
of the result of KVM_CHECK_EXTENSION. KVM will forward to userspace
the hypercalls whose corresponding bit is in the argument, and return
ENOSYS for the others.
+
+8.35 KVM_GET_XSAVE2
+-------------------
+
+:Capability: KVM_CAP_XSAVE2
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_xsave2 (in)
+:Returns: 0 on success, -1 on error
+
+
+::
+
+ struct kvm_xsave2 {
+ __u32 size;
+ __u8 state[0];
+ };
+
+This ioctl is used for copying current vcpu's xsave struct to the
+userspace when xsave state size is larger than 4KB. Application code
+should set the 'size' member which indicates the size of xsave state
+and KVM copies the xsave state into the 'state' region.
+
+8.36 KVM_SET_XSAVE2
+-------------------
+
+:Capability: KVM_CAP_XSAVE2
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_xsave2 (out)
+:Returns: 0 on success, -1 on error
+
+
+::
+
+ struct kvm_xsave2 {
+ __u32 size;
+ __u8 state[0];
+ };
+
+This ioctl is used for copying userspace's xsave struct to the kernel
+when xsave size is larger than 4KB. Application code should set the
+'size' member which indicates the size of xsave state.

2021-12-07 15:10:53

by Yang Zhong

[permalink] [raw]
Subject: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl

From: Jing Liu <[email protected]>

When dynamic XSTATE features are supported, the xsave states are
beyond 4KB. The current kvm_xsave structure and related
KVM_{G, S}ET_XSAVE only allows 4KB which is not enough for full
states.

Introduce a new kvm_xsave2 structure and the corresponding
KVM_GET_XSAVE2 and KVM_SET_XSAVE2 ioctls so that userspace VMM
can get and set the full xsave states.

Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
---
arch/x86/include/uapi/asm/kvm.h | 6 ++++
arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++++++
include/uapi/linux/kvm.h | 7 +++-
3 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 5a776a08f78c..de42a51e20c3 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -47,6 +47,7 @@
#define __KVM_HAVE_VCPU_EVENTS
#define __KVM_HAVE_DEBUGREGS
#define __KVM_HAVE_XSAVE
+#define __KVM_HAVE_XSAVE2
#define __KVM_HAVE_XCRS
#define __KVM_HAVE_READONLY_MEM

@@ -378,6 +379,11 @@ struct kvm_xsave {
__u32 region[1024];
};

+struct kvm_xsave2 {
+ __u32 size;
+ __u8 state[0];
+};
+
#define KVM_MAX_XCRS 16

struct kvm_xcr {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8b033c9241d6..d212f6d2d39a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4216,6 +4216,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_DEBUGREGS:
case KVM_CAP_X86_ROBUST_SINGLESTEP:
case KVM_CAP_XSAVE:
+ case KVM_CAP_XSAVE2:
case KVM_CAP_ASYNC_PF:
case KVM_CAP_ASYNC_PF_INT:
case KVM_CAP_GET_TSC_KHZ:
@@ -4940,6 +4941,17 @@ static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
vcpu->arch.pkru);
}

+static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu,
+ u8 *state, u32 size)
+{
+ if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
+ return;
+
+ fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu,
+ state, size,
+ vcpu->arch.pkru);
+}
+
static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
struct kvm_xsave *guest_xsave)
{
@@ -4951,6 +4963,15 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
supported_xcr0, &vcpu->arch.pkru);
}

+static int kvm_vcpu_ioctl_x86_set_xsave2(struct kvm_vcpu *vcpu, u8 *state)
+{
+ if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
+ return 0;
+
+ return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, state,
+ supported_xcr0, &vcpu->arch.pkru);
+}
+
static void kvm_vcpu_ioctl_x86_get_xcrs(struct kvm_vcpu *vcpu,
struct kvm_xcrs *guest_xcrs)
{
@@ -5416,6 +5437,47 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
r = kvm_vcpu_ioctl_x86_set_xsave(vcpu, u.xsave);
break;
}
+ case KVM_GET_XSAVE2: {
+ struct kvm_xsave2 __user *xsave2_arg = argp;
+ struct kvm_xsave2 xsave2;
+
+ r = -EFAULT;
+ if (copy_from_user(&xsave2, xsave2_arg, sizeof(struct kvm_xsave2)))
+ break;
+
+ u.buffer = kzalloc(xsave2.size, GFP_KERNEL_ACCOUNT);
+
+ r = -ENOMEM;
+ if (!u.buffer)
+ break;
+
+ kvm_vcpu_ioctl_x86_get_xsave2(vcpu, u.buffer, xsave2.size);
+
+ r = -EFAULT;
+ if (copy_to_user(xsave2_arg->state, u.buffer, xsave2.size))
+ break;
+
+ r = 0;
+ break;
+ }
+ case KVM_SET_XSAVE2: {
+ struct kvm_xsave2 __user *xsave2_arg = argp;
+ struct kvm_xsave2 xsave2;
+
+ r = -EFAULT;
+ if (copy_from_user(&xsave2, xsave2_arg, sizeof(struct kvm_xsave2)))
+ break;
+
+ u.buffer = memdup_user(xsave2_arg->state, xsave2.size);
+
+ if (IS_ERR(u.buffer)) {
+ r = PTR_ERR(u.buffer);
+ goto out_nofree;
+ }
+
+ r = kvm_vcpu_ioctl_x86_set_xsave2(vcpu, u.buffer);
+ break;
+ }
case KVM_GET_XCRS: {
u.xcrs = kzalloc(sizeof(struct kvm_xcrs), GFP_KERNEL_ACCOUNT);
r = -ENOMEM;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0c7b301c7254..603e1ca9ba09 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1132,7 +1132,9 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_EXIT_ON_EMULATION_FAILURE 204
#define KVM_CAP_ARM_MTE 205
#define KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM 206
-
+#ifdef __KVM_HAVE_XSAVE2
+#define KVM_CAP_XSAVE2 207
+#endif
#ifdef KVM_CAP_IRQ_ROUTING

struct kvm_irq_routing_irqchip {
@@ -1679,6 +1681,9 @@ struct kvm_xen_hvm_attr {
#define KVM_GET_SREGS2 _IOR(KVMIO, 0xcc, struct kvm_sregs2)
#define KVM_SET_SREGS2 _IOW(KVMIO, 0xcd, struct kvm_sregs2)

+#define KVM_GET_XSAVE2 _IOR(KVMIO, 0xcf, struct kvm_xsave2)
+#define KVM_SET_XSAVE2 _IOW(KVMIO, 0xd0, struct kvm_xsave2)
+
struct kvm_xen_vcpu_attr {
__u16 type;
__u16 pad[3];

2021-12-07 15:10:56

by Yang Zhong

[permalink] [raw]
Subject: [PATCH 18/19] kvm: x86: AMX XCR0 support for guest

From: Jing Liu <[email protected]>

Two XCR0 bits are defined for AMX to support XSAVE mechanism. Bit 17 is
for tilecfg and bit 18 is for tiledata.

The value of XCR0[17:18] is always either 00b or 11b. Also, SDM recommends
that only 64-bit operating systems enable Intel AMX by setting
XCR0[18:17].

Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
---
arch/x86/kvm/x86.c | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d212f6d2d39a..a9a608c8fa50 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -210,7 +210,7 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
#define KVM_SUPPORTED_XCR0 (XFEATURE_MASK_FP | XFEATURE_MASK_SSE \
| XFEATURE_MASK_YMM | XFEATURE_MASK_BNDREGS \
| XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
- | XFEATURE_MASK_PKRU)
+ | XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)

u64 __read_mostly host_efer;
EXPORT_SYMBOL_GPL(host_efer);
@@ -1017,6 +1017,23 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
if ((xcr0 & XFEATURE_MASK_AVX512) != XFEATURE_MASK_AVX512)
return 1;
}
+
+#ifdef CONFIG_X86_64
+ if ((xcr0 & XFEATURE_MASK_XTILE) &&
+ ((xcr0 & XFEATURE_MASK_XTILE) != XFEATURE_MASK_XTILE))
+ return 1;
+#else
+ /*
+ * Intel AMX instructions can be executed only in 64-bit mode but
+ * XSAVE can operate on XTILECFG and XTILEDATA in any mode.
+ * Since the FPU core follows SDM recommendation to set
+ * XCR[18:17] only in 64-bit environment, here also prevent any
+ * guest OS from setting the two bits when host is 32-bit.
+ *
+ * XFEATURE_MASK_XTILE cannot be used since it is 0 in this case.
+ */
+ xcr0 &= ~(XFEATURE_MASK_XTILE_DATA | XFEATURE_MASK_XTILE_CFG);
+#endif
vcpu->arch.xcr0 = xcr0;

if ((xcr0 ^ old_xcr0) & XFEATURE_MASK_EXTEND)

2021-12-07 15:11:09

by Yang Zhong

[permalink] [raw]
Subject: [PATCH 19/19] kvm: x86: Add AMX CPUIDs support

From: Jing Liu <[email protected]>

Extend CPUID emulation to support XFD, AMX_TILE, AMX_INT8 and
AMX_BF16. Adding those bits into kvm_cpu_caps finally activates all
previous logics in this series.

Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Yang Zhong <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 2 ++
arch/x86/kvm/cpuid.c | 16 +++++++++++++---
2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d5b5f2ab87a0..da872b6f8d8b 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -299,7 +299,9 @@
/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
#define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
#define X86_FEATURE_AVX512_BF16 (12*32+ 5) /* AVX512 BFLOAT16 instructions */
+#define X86_FEATURE_AMX_BF16 (18*32+22) /* AMX bf16 Support */
#define X86_FEATURE_AMX_TILE (18*32+24) /* AMX tile Support */
+#define X86_FEATURE_AMX_INT8 (18*32+25) /* AMX int8 Support */

/* AMD-defined CPU features, CPUID level 0x80000008 (EBX), word 13 */
#define X86_FEATURE_CLZERO (13*32+ 0) /* CLZERO instruction */
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index ea51b986ee67..7bb56cc89aa7 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -510,7 +510,8 @@ void kvm_set_cpu_caps(void)
F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) |
- F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16)
+ F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) |
+ F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16)
);

/* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */
@@ -529,7 +530,7 @@ void kvm_set_cpu_caps(void)
);

kvm_cpu_cap_mask(CPUID_D_1_EAX,
- F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | F(XSAVES)
+ F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | F(XSAVES) | F(XFD)
);

kvm_cpu_cap_init_scattered(CPUID_12_EAX,
@@ -655,6 +656,8 @@ static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_array *array,
case 0x14:
case 0x17:
case 0x18:
+ case 0x1d:
+ case 0x1e:
case 0x1f:
case 0x8000001d:
entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
@@ -779,6 +782,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
}
break;
case 9:
+ case 0x1e: /* TMUL information */
break;
case 0xa: { /* Architectural Performance Monitoring */
struct x86_pmu_capability cap;
@@ -914,7 +918,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
break;
/* Intel PT */
case 0x14:
- if (!kvm_cpu_cap_has(X86_FEATURE_INTEL_PT)) {
+ if ((function == 0x14 && !kvm_cpu_cap_has(X86_FEATURE_INTEL_PT)) ||
+ (function == 0x1d && !kvm_cpu_cap_has(X86_FEATURE_AMX_TILE))) {
entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
break;
}
@@ -924,6 +929,11 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
goto out;
}
break;
+ /* Intel AMX TILE */
+ case 0x1d:
+ if (!kvm_cpu_cap_has(X86_FEATURE_AMX_TILE))
+ entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
+ break;
case KVM_CPUID_SIGNATURE: {
const u32 *sigptr = (const u32 *)KVM_SIGNATURE;
entry->eax = KVM_CPUID_FEATURES;

2021-12-08 07:23:28

by Liu, Jing2

[permalink] [raw]
Subject: RE: [PATCH 13/19] kvm: x86: Disable WRMSR interception for IA32_XFD on demand


On 12/8/2021 8:03 AM, Yang Zhong wrote:
> From: Jing Liu <[email protected]>
>
> Always intercepting IA32_XFD causes non-negligible overhead when this
> register is updated frequently in the guest.
>
> Disable WRMSR interception to IA32_XFD after fpstate reallocation is
> completed. There are three options for when to disable the
> interception:
>
> 1) When emulating the 1st WRMSR which requires reallocation,
> disable interception before exiting to userapce with the
> assumption that the userspace VMM should not bounch back to
> the kernel if reallocation fails. However it's not good to
> design kernel based on application behavior. If due to bug
> the vCPU thread comes back to the kernel after reallocation
> fails, XFD passthrough may lead to host memory corruption
> when doing XSAVES for guest fpstate which has a smaller size
> than what guest XFD allows.
>
> 2) Disable interception when coming back from the userspace VMM
> (for the 1st WRMSR which triggers reallocation). Re-check
> whether fpstate size can serve the new guest XFD value. Disable
> interception only when the check succeeds. This requires KVM
> to store guest XFD value in some place and then compare it
> to guest_fpu::user_xfeatures in the completion handler.

For option 2), we are considering that fpstate->size can be used to indicate
if reallocation is successful. Because once one of the XFD features (today,
it's AMX) is enabled, kernel need reallocate full size, otherwise, KVM has no
chance to reallocate for other XFD features later since it's non-trapped (to
avoid WRMSR VM EXITs due to guest toggling XFD).

Then KVM doesn't need to store guest XFD value in some place. And kernel
fpu core may need an API to tell guest permitted size for KVM.

Thanks,
Jing

>
> 3) Disable interception at the 2nd WRMSR which enables dynamic
> XSTATE features. If guest_fpu::user_xfeatures already includes
> bits for dynamic features set in guest XFD value, disable
> interception.
>


2021-12-10 15:44:36

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 07/19] kvm: x86: Propagate fpstate reallocation error to userspace

On 12/8/21 01:03, Yang Zhong wrote:
> out:
> - kvm_put_guest_fpu(vcpu);
> + ret = kvm_put_guest_fpu(vcpu);
> + if ((r >= 0) && (ret < 0))
> + r = ret;
> +

No extra parentheses.

Paolo

2021-12-10 16:02:58

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD

First, the MSR should be added to msrs_to_save_all and
kvm_cpu_cap_has(X86_FEATURE_XFD) should be checked in kvm_init_msr_list.

It seems that RDMSR support is missing, too.

More important, please include:

- documentation for the new KVM_EXIT_* value

- a selftest that explains how userspace should react to it.

This is a strong requirement for any new API (the first has been for
years; but the latter is also almost always respected these days). This
series should not have been submitted without documentation.

Also:

On 12/8/21 01:03, Yang Zhong wrote:
>
> + if (!guest_cpuid_has(vcpu, X86_FEATURE_XFD))
> + return 1;

This should allow msr->host_initiated always (even if XFD is not part of
CPUID). However, if XFD is nonzero and kvm_check_guest_realloc_fpstate
returns true, then it should return 1.

The selftest should also cover using KVM_GET_MSR/KVM_SET_MSR.

> + /* Setting unsupported bits causes #GP */
> + if (~XFEATURE_MASK_USER_DYNAMIC & data) {
> + kvm_inject_gp(vcpu, 0);
> + break;
> + }

This should check

if (data & ~(XFEATURE_MASK_USER_DYNAMIC &
vcpu->arch.guest_supported_xcr0))

instead.

Paolo

2021-12-10 16:16:54

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 14/19] x86/fpu: Prepare for KVM XFD_ERR handling

On 12/8/21 01:03, Yang Zhong wrote:
>
> The guest XFD_ERR value is saved in fpu_guest::xfd_err. There is
> no need to save host XFD_ERR because it's always cleared to ZERO
> by the host #NM handler (which cannot be preempted by a vCPU
> thread to observe a non-zero value).
>
> The lower two bits in fpu_guest::xfd_err is borrowed for special
> purposes. The state components (FP and SSE) covered by the two
> bits are not XSAVE-enabled feature, thus not XFD-enabled either.
> It's impossible to see hardware setting them in XFD_ERR:
>
> - XFD_ERR_GUEST_DISABLED (bit 0)
>
> Indicate that XFD extension is not exposed to the guest thus
> no need to save/restore it.

Can this instead just check if xfeatures includes any dynamic-save features?

Paolo

> - XFD_ERR_GUEST_SAVED (bit 1)
>
> Indicate fpu_guest::xfd_err already contains a saved value
> thus no need for duplicated saving (e.g. when the vCPU thread
> is preempted multiple times before re-enter the guest).


2021-12-10 16:23:46

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly

On 12/8/21 01:03, Yang Zhong wrote:
> kvm_steal_time_set_preempted(vcpu);
> srcu_read_unlock(&vcpu->kvm->srcu, idx);
>
> + if (vcpu->preempted)
> + fpu_save_guest_xfd_err(&vcpu->arch.guest_fpu);
> +

Instead of checking vcpu->preempted, can you instead check if the active
FPU is the guest FPU? That is, save if
current->thread.fpu->fpstate->is_guest?

Paolo

2021-12-10 16:25:10

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl

On 12/8/21 01:03, Yang Zhong wrote:
> From: Jing Liu <[email protected]>
>
> When dynamic XSTATE features are supported, the xsave states are
> beyond 4KB. The current kvm_xsave structure and related
> KVM_{G, S}ET_XSAVE only allows 4KB which is not enough for full
> states.
>
> Introduce a new kvm_xsave2 structure and the corresponding
> KVM_GET_XSAVE2 and KVM_SET_XSAVE2 ioctls so that userspace VMM
> can get and set the full xsave states.
>
> Signed-off-by: Jing Liu <[email protected]>
> Signed-off-by: Yang Zhong <[email protected]>
> ---
> arch/x86/include/uapi/asm/kvm.h | 6 ++++
> arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++++++
> include/uapi/linux/kvm.h | 7 +++-
> 3 files changed, 74 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 5a776a08f78c..de42a51e20c3 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -47,6 +47,7 @@
> #define __KVM_HAVE_VCPU_EVENTS
> #define __KVM_HAVE_DEBUGREGS
> #define __KVM_HAVE_XSAVE
> +#define __KVM_HAVE_XSAVE2
> #define __KVM_HAVE_XCRS
> #define __KVM_HAVE_READONLY_MEM
>
> @@ -378,6 +379,11 @@ struct kvm_xsave {
> __u32 region[1024];
> };
>
> +struct kvm_xsave2 {
> + __u32 size;
> + __u8 state[0];
> +};
> +
> #define KVM_MAX_XCRS 16
>
> struct kvm_xcr {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 8b033c9241d6..d212f6d2d39a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4216,6 +4216,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_DEBUGREGS:
> case KVM_CAP_X86_ROBUST_SINGLESTEP:
> case KVM_CAP_XSAVE:
> + case KVM_CAP_XSAVE2:
> case KVM_CAP_ASYNC_PF:
> case KVM_CAP_ASYNC_PF_INT:
> case KVM_CAP_GET_TSC_KHZ:
> @@ -4940,6 +4941,17 @@ static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
> vcpu->arch.pkru);
> }
>
> +static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu,
> + u8 *state, u32 size)
> +{
> + if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
> + return;
> +
> + fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu,
> + state, size,
> + vcpu->arch.pkru);
> +}
> +
> static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
> struct kvm_xsave *guest_xsave)
> {
> @@ -4951,6 +4963,15 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
> supported_xcr0, &vcpu->arch.pkru);
> }
>
> +static int kvm_vcpu_ioctl_x86_set_xsave2(struct kvm_vcpu *vcpu, u8 *state)
> +{
> + if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
> + return 0;
> +
> + return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, state,
> + supported_xcr0, &vcpu->arch.pkru);
> +}
> +
> static void kvm_vcpu_ioctl_x86_get_xcrs(struct kvm_vcpu *vcpu,
> struct kvm_xcrs *guest_xcrs)
> {
> @@ -5416,6 +5437,47 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> r = kvm_vcpu_ioctl_x86_set_xsave(vcpu, u.xsave);
> break;
> }
> + case KVM_GET_XSAVE2: {
> + struct kvm_xsave2 __user *xsave2_arg = argp;
> + struct kvm_xsave2 xsave2;
> +
> + r = -EFAULT;
> + if (copy_from_user(&xsave2, xsave2_arg, sizeof(struct kvm_xsave2)))
> + break;
> +
> + u.buffer = kzalloc(xsave2.size, GFP_KERNEL_ACCOUNT);
> +
> + r = -ENOMEM;
> + if (!u.buffer)
> + break;
> +
> + kvm_vcpu_ioctl_x86_get_xsave2(vcpu, u.buffer, xsave2.size);
> +
> + r = -EFAULT;
> + if (copy_to_user(xsave2_arg->state, u.buffer, xsave2.size))
> + break;
> +
> + r = 0;
> + break;
> + }
> + case KVM_SET_XSAVE2: {
> + struct kvm_xsave2 __user *xsave2_arg = argp;
> + struct kvm_xsave2 xsave2;
> +
> + r = -EFAULT;
> + if (copy_from_user(&xsave2, xsave2_arg, sizeof(struct kvm_xsave2)))
> + break;
> +
> + u.buffer = memdup_user(xsave2_arg->state, xsave2.size);
> +
> + if (IS_ERR(u.buffer)) {
> + r = PTR_ERR(u.buffer);
> + goto out_nofree;
> + }
> +
> + r = kvm_vcpu_ioctl_x86_set_xsave2(vcpu, u.buffer);
> + break;
> + }
> case KVM_GET_XCRS: {
> u.xcrs = kzalloc(sizeof(struct kvm_xcrs), GFP_KERNEL_ACCOUNT);
> r = -ENOMEM;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 0c7b301c7254..603e1ca9ba09 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1132,7 +1132,9 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_EXIT_ON_EMULATION_FAILURE 204
> #define KVM_CAP_ARM_MTE 205
> #define KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM 206
> -
> +#ifdef __KVM_HAVE_XSAVE2
> +#define KVM_CAP_XSAVE2 207
> +#endif
> #ifdef KVM_CAP_IRQ_ROUTING
>
> struct kvm_irq_routing_irqchip {
> @@ -1679,6 +1681,9 @@ struct kvm_xen_hvm_attr {
> #define KVM_GET_SREGS2 _IOR(KVMIO, 0xcc, struct kvm_sregs2)
> #define KVM_SET_SREGS2 _IOW(KVMIO, 0xcd, struct kvm_sregs2)
>
> +#define KVM_GET_XSAVE2 _IOR(KVMIO, 0xcf, struct kvm_xsave2)
> +#define KVM_SET_XSAVE2 _IOW(KVMIO, 0xd0, struct kvm_xsave2)
> +
> struct kvm_xen_vcpu_attr {
> __u16 type;
> __u16 pad[3];
>

Please also modify KVM_GET/SET_XSAVE to fail with ENOSPC if the
requested size is bigger than 4096.

Paolo

2021-12-10 16:30:19

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl

On 12/8/21 01:03, Yang Zhong wrote:
> +static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu,
> + u8 *state, u32 size)
> +{
> + if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
> + return;
> +
> + fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu,
> + state, size,
> + vcpu->arch.pkru);
> +}
> +
> static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
> struct kvm_xsave *guest_xsave)
> {
> @@ -4951,6 +4963,15 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
> supported_xcr0, &vcpu->arch.pkru);
> }
>
> +static int kvm_vcpu_ioctl_x86_set_xsave2(struct kvm_vcpu *vcpu, u8 *state)
> +{
> + if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
> + return 0;
> +
> + return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, state,
> + supported_xcr0, &vcpu->arch.pkru);
> +}
> +

I think fpu_copy_uabi_to_guest_fpstate (and therefore
copy_uabi_from_kernel_to_xstate) needs to check that the size is
compatible with the components in the input.

Also, IIUC the size of the AMX state will vary in different processors.
Is this correct? If so, this should be handled already by
KVM_GET/SET_XSAVE2 and therefore should be part of the
arch/x86/kernel/fpu APIs. In the future we want to support migrating a
"small AMX" host to a "large AMX" host; and also migrating from a "large
AMX" host to a "small AMX" host if the guest CPUID is compatible with
the destination of the migration.

Paolo

2021-12-10 16:30:56

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 18/19] kvm: x86: AMX XCR0 support for guest

On 12/8/21 01:03, Yang Zhong wrote:
> +
> +#ifdef CONFIG_X86_64
> + if ((xcr0 & XFEATURE_MASK_XTILE) &&
> + ((xcr0 & XFEATURE_MASK_XTILE) != XFEATURE_MASK_XTILE))
> + return 1;
> +#else
> + /*
> + * Intel AMX instructions can be executed only in 64-bit mode but
> + * XSAVE can operate on XTILECFG and XTILEDATA in any mode.
> + * Since the FPU core follows SDM recommendation to set
> + * XCR[18:17] only in 64-bit environment, here also prevent any
> + * guest OS from setting the two bits when host is 32-bit.
> + *
> + * XFEATURE_MASK_XTILE cannot be used since it is 0 in this case.
> + */
> + xcr0 &= ~(XFEATURE_MASK_XTILE_DATA | XFEATURE_MASK_XTILE_CFG);
> +#endif

This should not be necessary, because on a 32-bit system the bits won't
be part of supported_xcr0.

Paolo

2021-12-10 21:52:51

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 19/19] kvm: x86: Add AMX CPUIDs support

On 12/8/21 01:03, Yang Zhong wrote:
> @@ -914,7 +918,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
> break;
> /* Intel PT */
> case 0x14:
> - if (!kvm_cpu_cap_has(X86_FEATURE_INTEL_PT)) {
> + if ((function == 0x14 && !kvm_cpu_cap_has(X86_FEATURE_INTEL_PT)) ||
> + (function == 0x1d && !kvm_cpu_cap_has(X86_FEATURE_AMX_TILE))) {

This hunk is wrong.

> entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
> break;
> }
> @@ -924,6 +929,11 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
> goto out;
> }
> break;
> + /* Intel AMX TILE */
> + case 0x1d:
> + if (!kvm_cpu_cap_has(X86_FEATURE_AMX_TILE))
> + entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
> + break;

This also needs a loop similar to the one in case 0x14; so the "break"
goes inside the "if" and then you have

for (i = 1, max_idx = entry->eax; i <= max_idx; ++i) {
if (!do_host_cpuid(array, function, i))
goto out;
}


Same for 0x1e, which also needs to be marked conditional.

Paolo

2021-12-10 22:01:30

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly

On 12/8/21 01:03, Yang Zhong wrote:
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -219,6 +219,11 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> kvm_apic_set_version(vcpu);
> }
>
> + /* Enable saving guest XFD_ERR */
> + best = kvm_find_cpuid_entry(vcpu, 7, 0);
> + if (best && cpuid_entry_has(best, X86_FEATURE_AMX_TILE))
> + vcpu->arch.guest_fpu.xfd_err = 0;
> +

This is incorrect. Instead it should check whether leaf 0xD includes
any dynamic features.

Paolo

2021-12-10 22:13:40

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl

On 12/10/21 17:30, Paolo Bonzini wrote:
>>
>> +static int kvm_vcpu_ioctl_x86_set_xsave2(struct kvm_vcpu *vcpu, u8
>> *state)
>> +{
>> +    if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
>> +        return 0;
>> +
>> +    return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, state,
>> +                          supported_xcr0, &vcpu->arch.pkru);
>> +}
>> +
>
> I think fpu_copy_uabi_to_guest_fpstate (and therefore
> copy_uabi_from_kernel_to_xstate) needs to check that the size is
> compatible with the components in the input.
>
> Also, IIUC the size of the AMX state will vary in different processors.
>  Is this correct?  If so, this should be handled already by
> KVM_GET/SET_XSAVE2 and therefore should be part of the
> arch/x86/kernel/fpu APIs.  In the future we want to support migrating a
> "small AMX" host to a "large AMX" host; and also migrating from a "large
> AMX" host to a "small AMX" host if the guest CPUID is compatible with
> the destination of the migration.

So, the size of the AMX state will depend on the active "palette" in
TILECONFIG, and on the CPUID information. I have a few questions on how
Intel intends to handle future extensions to AMX:

- can we assume that, in the future, palette 1 will always have the same
value (bytes_per_row=64, max_names=8, max_rows=16), and basically that
the only variable value is really the number of palettes?

- how does Intel plan to handle bigger TILEDATA? Will it use more XCR0
bits or will it rather enlarge save state 18?

If it will use more XCR0 bits, I suppose that XCR0 bits will control
which palettes can be chosen by LDTILECFG.

If not, on the other hand, this will be a first case of one system's
XSAVE data not being XRSTOR-able on another system even if the
destination system can set XCR0 to the same value as the source system.

Likewise, if the size and offsets for save state 18 were to vary
depending on the selected palette, then this would be novel, in that the
save state size and offsets would not be in CPUID anymore. It would be
particularly interesting for non-compacted format, where all save states
after 18 would also move forward.

So, I hope that save state 18 will be frozen to 8k. In that case, and
if palette 1 is frozen to the same values as today, implementing
migration will not be a problem; it will be essentially the same as
SSE->AVX (horizontal extension of existing registers) and/or AVX->AVX512
(both horizontal and vertical extension).

By the way, I think KVM_SET_XSAVE2 is not needed. Instead:

- KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) should return the size of the
buffer that is passed to KVM_GET_XSAVE2

- KVM_GET_XSAVE2 should fill in the buffer expecting that its size is
whatever KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) passes

- KVM_SET_XSAVE can just expect a buffer that is bigger than 4k if the
save states recorded in the header point to offsets larger than 4k.

Paolo

2021-12-10 22:33:21

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 05/19] x86/fpu: Move xfd initialization out of __fpstate_reset() to the callers

Yang, Jing,

On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index fe592799508c..fae44fa27cdb 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -231,6 +231,7 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
> if (!fpstate)
> return false;
>
> + /* Leave xfd to 0 (the reset value defined by spec) */
> __fpstate_reset(fpstate);

That change makes me a bit wary simply because the comment here is above
__fpstate_reset() which makes no sense. It does make sense to you at the
time, but does it make sense to you when you look at it 6 month down the
road?

So I'd rather make this very obvious what's going. See below.

Thanks,

tglx
---

--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -199,7 +199,7 @@ void fpu_reset_from_exception_fixup(void
}

#if IS_ENABLED(CONFIG_KVM)
-static void __fpstate_reset(struct fpstate *fpstate);
+static void __fpstate_reset(struct fpstate *fpstate, u64 xfd);

static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
{
@@ -231,7 +231,8 @@ bool fpu_alloc_guest_fpstate(struct fpu_
if (!fpstate)
return false;

- __fpstate_reset(fpstate);
+ /* Leave xfd to 0 (the reset value defined by spec) */
+ __fpstate_reset(fpstate, 0);
fpstate_init_user(fpstate);
fpstate->is_valloc = true;
fpstate->is_guest = true;
@@ -454,21 +455,21 @@ void fpstate_init_user(struct fpstate *f
fpstate_init_fstate(fpstate);
}

-static void __fpstate_reset(struct fpstate *fpstate)
+static void __fpstate_reset(struct fpstate *fpstate, u64 xfd)
{
/* Initialize sizes and feature masks */
fpstate->size = fpu_kernel_cfg.default_size;
fpstate->user_size = fpu_user_cfg.default_size;
fpstate->xfeatures = fpu_kernel_cfg.default_features;
fpstate->user_xfeatures = fpu_user_cfg.default_features;
- fpstate->xfd = init_fpstate.xfd;
+ fpstate->xfd = xfd;
}

void fpstate_reset(struct fpu *fpu)
{
/* Set the fpstate pointer to the default fpstate */
fpu->fpstate = &fpu->__fpstate;
- __fpstate_reset(fpu->fpstate);
+ __fpstate_reset(fpu->fpstate, init_fpstate.xfd);

/* Initialize the permission related info in fpu */
fpu->perm.__state_perm = fpu_kernel_cfg.default_features;

2021-12-10 22:45:46

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 08/19] x86/fpu: Move xfd_update_state() to xstate.c and export symbol

On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
> From: Jing Liu <[email protected]>
>
> xfd_update_state() is the interface to update IA32_XFD and its per-cpu
> cache. All callers of this interface are currently in fpu core. KVM only
> indirectly triggers IA32_XFD update via a helper function
> (fpu_swap_kvm_fpstate()) when switching between user fpu and guest fpu.
>
> Supporting AMX in guest now requires KVM to directly update IA32_XFD
> with the guest value (when emulating WRMSR) so XSAVE/XRSTOR can manage
> XSTATE components correctly inside guest.
>
> This patch moves xfd_update_state() from fpu/xstate.h to fpu/xstate.c

s/This patch moves/Move/

please. See Documentation/process/submitting-patches.rst and search for
'This patch'

> and export it for reference outside of fpu core.
>
> Signed-off-by: Jing Liu <[email protected]>
> Signed-off-by: Yang Zhong <[email protected]>
> ---
> arch/x86/include/asm/fpu/api.h | 2 ++
> arch/x86/kernel/fpu/xstate.c | 12 ++++++++++++
> arch/x86/kernel/fpu/xstate.h | 14 +-------------
> 3 files changed, 15 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
> index 7532f73c82a6..999d89026be9 100644
> --- a/arch/x86/include/asm/fpu/api.h
> +++ b/arch/x86/include/asm/fpu/api.h
> @@ -131,8 +131,10 @@ DECLARE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
> /* Process cleanup */
> #ifdef CONFIG_X86_64
> extern void fpstate_free(struct fpu *fpu);
> +extern void xfd_update_state(struct fpstate *fpstate);
> #else
> static inline void fpstate_free(struct fpu *fpu) { }
> +static void xfd_update_state(struct fpstate *fpstate) { }

Try a 32bit build to see the warnings this causes. That wants to be
'static inline void' obviously.

> #ifdef CONFIG_X86_64
> -static inline void xfd_update_state(struct fpstate *fpstate)
> -{
> - if (fpu_state_size_dynamic()) {
> - u64 xfd = fpstate->xfd;
> -
> - if (__this_cpu_read(xfd_state) != xfd) {
> - wrmsrl(MSR_IA32_XFD, xfd);
> - __this_cpu_write(xfd_state, xfd);
> - }
> - }
> -}
> -#else
> -static inline void xfd_update_state(struct fpstate *fpstate) { }
> +extern void xfd_update_state(struct fpstate *fpstate);

Why? It's already declared in the global header. So all of this has to
be simply removed, no?

Thanks,

tglx



2021-12-10 23:09:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD

On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
> +
> + /*
> + * Update IA32_XFD to the guest value so #NM can be
> + * raised properly in the guest. Instead of directly
> + * writing the MSR, call a helper to avoid breaking
> + * per-cpu cached value in fpu core.
> + */
> + fpregs_lock();
> + current->thread.fpu.fpstate->xfd = data;
> + xfd_update_state(current->thread.fpu.fpstate);
> + fpregs_unlock();
> + break;

Now looking at the actual callsite the previous patch really should be
something like the below. Why?

It preserves the inline which allows the compiler to generate better
code in the other hotpathes and it keeps the FPU internals to the core
code. Hmm?

Thanks,

tglx

--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -125,8 +125,10 @@ DECLARE_PER_CPU(struct fpu *, fpu_fpregs
/* Process cleanup */
#ifdef CONFIG_X86_64
extern void fpstate_free(struct fpu *fpu);
+extern void fpu_update_xfd_state(u64 xfd);
#else
static inline void fpstate_free(struct fpu *fpu) { }
+static inline void fpu_update_xfd_state(u64 xfd) { }
#endif

/* fpstate-related functions which are exported to KVM */
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -322,6 +322,19 @@ int fpu_swap_kvm_fpstate(struct fpu_gues
}
EXPORT_SYMBOL_GPL(fpu_swap_kvm_fpstate);

+#ifdef CONFIG_X86_64
+void fpu_update_xfd_state(u64 xfd)
+{
+ struct fpstate *fps = current->thread.fpu.fpstate;
+
+ fpregs_lock();
+ fps->xfd = xfd;
+ xfd_update_state(fps);
+ fpregs_unlock();
+}
+EXPORT_SYMBOL_GPL(fpu_update_xfd_state);
+#endif
+
void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf,
unsigned int size, u32 pkru)
{



2021-12-10 23:12:01

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 12/19] x86/fpu: Prepare KVM for bringing XFD state back in-sync

On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
> From: Thomas Gleixner <[email protected]>
>
> Guest may toggle IA32_XFD in high frequency as it is part of the fpstate
> information (features, sizes, xfd) and swapped in task context switch.
>
> To minimize the trap overhead of writes to this MSR, one optimization
> is to allow guest direct write thus eliminate traps. However MSR
> passthrough implies that guest_fpstate::xfd and per-cpu xfd cache might
> be out of sync with the current IA32_XFD value by the guest.
>
> This suggests KVM needs to re-sync guest_fpstate::xfd and per-cpu cache
> with IA32_XFD before the vCPU thread might be preempted or interrupted.
>
> This patch provides a helper function for the re-sync purpose.

Provide a ....

> Signed-off-by: Thomas Gleixner <[email protected]>
> Signed-off-by: Jing Liu <[email protected]>
> Signed-off-by: Yang Zhong <[email protected]>
> ---
> (To Thomas): the original name kvm_update_guest_xfd_state() in
> your sample code is renamed to xfd_sync_state() in this patch. In
> concept it is a general helper to bring software values in-sync with
> the MSR value after they become out-of-sync. KVM is just the
> first out-of-sync usage on this helper, so a neutral name may make
> more sense. But if you prefer to the original name we can also
> change back.

There is no need for a general helper, really.

It's KVM specific and should go into KVM section in core.c next to the
other thing vs. the XFD update.

Thanks,

tglx

2021-12-10 23:20:28

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 14/19] x86/fpu: Prepare for KVM XFD_ERR handling

On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -322,6 +322,55 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
> }
> EXPORT_SYMBOL_GPL(fpu_swap_kvm_fpstate);
>
> +#ifdef CONFIG_X86_64
> +void fpu_save_guest_xfd_err(struct fpu_guest *guest_fpu)
> +{
> + if (guest_fpu->xfd_err & XFD_ERR_GUEST_DISABLED)
> + return;
> +
> + /* A non-zero value indicates guest XFD_ERR already saved */
> + if (guest_fpu->xfd_err)
> + return;
> +
> + /* Guest XFD_ERR must be saved before switching to host fpstate */
> + WARN_ON_ONCE(!current->thread.fpu.fpstate->is_guest);

Warn and proceed?

> + rdmsrl(MSR_IA32_XFD_ERR, guest_fpu->xfd_err);
> +
> + /*
> + * Restore to the host value if guest xfd_err is non-zero.
> + * Except in #NM handler, all other places in the kernel
> + * should just see xfd_err=0. So just restore to 0.
> + */
> + if (guest_fpu->xfd_err)
> + wrmsrl(MSR_IA32_XFD_ERR, 0);
> +
> + guest_fpu->xfd_err |= XFD_ERR_GUEST_SAVED;
> +}
> +EXPORT_SYMBOL_GPL(fpu_save_guest_xfd_err);
> +
> +void fpu_restore_guest_xfd_err(struct fpu_guest *guest_fpu)
> +{
> + u64 xfd_err = guest_fpu->xfd_err;
> +
> + if (xfd_err & XFD_ERR_GUEST_DISABLED)
> + return;
> +
> + xfd_err &= ~XFD_ERR_GUEST_SAVED;
> +
> + /*
> + * No need to restore a zero value since XFD_ERR
> + * is always zero outside of #NM handler in the host.
> + */
> + if (!xfd_err)
> + return;
> +
> + wrmsrl(MSR_IA32_XFD_ERR, xfd_err);
> + guest_fpu->xfd_err = 0;
> +}

Why should any pf this be in the FPU core?

It's a pure guest issue as all of this is related to struct fpu_guest
and not struct fpu or any other core FPU state.

Thanks,

tglx


2021-12-11 00:10:53

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly

On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index 5089f2e7dc22..9811dc98d550 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -238,6 +238,7 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
> fpstate->is_guest = true;
>
> gfpu->fpstate = fpstate;
> + gfpu->xfd_err = XFD_ERR_GUEST_DISABLED;

This wants to be part of the previous patch, which introduces the field.

> gfpu->user_xfeatures = fpu_user_cfg.default_features;
> gfpu->user_perm = fpu_user_cfg.default_features;
> fpu_init_guest_permissions(gfpu);
> @@ -297,6 +298,7 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
> fpu->fpstate = guest_fps;
> guest_fps->in_use = true;
> } else {
> + fpu_save_guest_xfd_err(guest_fpu);

Hmm. See below.

> guest_fps->in_use = false;
> fpu->fpstate = fpu->__task_fpstate;
> fpu->__task_fpstate = NULL;
> @@ -4550,6 +4550,9 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> kvm_steal_time_set_preempted(vcpu);
> srcu_read_unlock(&vcpu->kvm->srcu, idx);
>
> + if (vcpu->preempted)
> + fpu_save_guest_xfd_err(&vcpu->arch.guest_fpu);

I'm not really exited about the thought of an exception cause register
in guest clobbered state.

Aside of that I really have to ask the question why all this is needed?

#NM in the guest is slow path, right? So why are you trying to optimize
for it?

The straight forward solution to this is:

1) Trap #NM and MSR_XFD_ERR write

2) When the guest triggers #NM is takes an VMEXIT and the host
does:

rdmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);

injects the #NM and goes on.

3) When the guest writes to MSR_XFD_ERR it takes an VMEXIT and
the host does:

vcpu->arch.guest_fpu.xfd_err = msrval;
wrmsrl(MSR_XFD_ERR, msrval);

and goes back.

4) Before entering the preemption disabled section of the VCPU loop
do:

if (vcpu->arch.guest_fpu.xfd_err)
wrmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);

5) Before leaving the preemption disabled section of the VCPU loop
do:

if (vcpu->arch.guest_fpu.xfd_err)
wrmsrl(MSR_XFD_ERR, 0);

It's really that simple and pretty much 0 overhead for the regular case.

If the guest triggers #NM with a high frequency then taking the VMEXITs
is the least of the problems. That's not a realistic use case, really.

Hmm?

Thanks,

tglx

2021-12-11 01:31:54

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly

On 12/11/21 01:10, Thomas Gleixner wrote:
> 2) When the guest triggers #NM is takes an VMEXIT and the host
> does:
>
> rdmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
>
> injects the #NM and goes on.
>
> 3) When the guest writes to MSR_XFD_ERR it takes an VMEXIT and
> the host does:
>
> vcpu->arch.guest_fpu.xfd_err = msrval;
> wrmsrl(MSR_XFD_ERR, msrval);

No wrmsrl here I think, the host value is 0 and should stay so. Instead
the wrmsrl will happen the next time the VCPU loop is entred.

Paolo

2021-12-11 03:07:50

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly

> From: Thomas Gleixner <[email protected]>
> Sent: Saturday, December 11, 2021 8:11 AM
>
> On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
> > diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> > index 5089f2e7dc22..9811dc98d550 100644
> > --- a/arch/x86/kernel/fpu/core.c
> > +++ b/arch/x86/kernel/fpu/core.c
> > @@ -238,6 +238,7 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest
> *gfpu)
> > fpstate->is_guest = true;
> >
> > gfpu->fpstate = fpstate;
> > + gfpu->xfd_err = XFD_ERR_GUEST_DISABLED;
>
> This wants to be part of the previous patch, which introduces the field.
>
> > gfpu->user_xfeatures = fpu_user_cfg.default_features;
> > gfpu->user_perm = fpu_user_cfg.default_features;
> > fpu_init_guest_permissions(gfpu);
> > @@ -297,6 +298,7 @@ int fpu_swap_kvm_fpstate(struct fpu_guest
> *guest_fpu, bool enter_guest)
> > fpu->fpstate = guest_fps;
> > guest_fps->in_use = true;
> > } else {
> > + fpu_save_guest_xfd_err(guest_fpu);
>
> Hmm. See below.
>
> > guest_fps->in_use = false;
> > fpu->fpstate = fpu->__task_fpstate;
> > fpu->__task_fpstate = NULL;
> > @@ -4550,6 +4550,9 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> > kvm_steal_time_set_preempted(vcpu);
> > srcu_read_unlock(&vcpu->kvm->srcu, idx);
> >
> > + if (vcpu->preempted)
> > + fpu_save_guest_xfd_err(&vcpu->arch.guest_fpu);
>
> I'm not really exited about the thought of an exception cause register
> in guest clobbered state.
>
> Aside of that I really have to ask the question why all this is needed?
>
> #NM in the guest is slow path, right? So why are you trying to optimize
> for it?

This is really good information. The current logic is obviously
based on the assumption that #NM is frequently triggered.

>
> The straight forward solution to this is:
>
> 1) Trap #NM and MSR_XFD_ERR write

and #NM vmexit handler should be called in kvm_x86_handle_exit_irqoff()
before preemption is enabled, otherwise there is still a small window
where MSR_XFD_ERR might be clobbered after preemption enable and
before #NM handler is actually called.

>
> 2) When the guest triggers #NM is takes an VMEXIT and the host
> does:
>
> rdmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
>
> injects the #NM and goes on.
>
> 3) When the guest writes to MSR_XFD_ERR it takes an VMEXIT and
> the host does:
>
> vcpu->arch.guest_fpu.xfd_err = msrval;
> wrmsrl(MSR_XFD_ERR, msrval);
>
> and goes back.
>
> 4) Before entering the preemption disabled section of the VCPU loop
> do:
>
> if (vcpu->arch.guest_fpu.xfd_err)
> wrmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
>
> 5) Before leaving the preemption disabled section of the VCPU loop
> do:
>
> if (vcpu->arch.guest_fpu.xfd_err)
> wrmsrl(MSR_XFD_ERR, 0);
>
> It's really that simple and pretty much 0 overhead for the regular case.

Much cleaner.

>
> If the guest triggers #NM with a high frequency then taking the VMEXITs
> is the least of the problems. That's not a realistic use case, really.
>
> Hmm?
>
> Thanks,
>
> tglx

Thanks
Kevin

2021-12-11 03:23:30

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly

> From: Paolo Bonzini
> Sent: Saturday, December 11, 2021 9:32 AM
>
> On 12/11/21 01:10, Thomas Gleixner wrote:
> > 2) When the guest triggers #NM is takes an VMEXIT and the host
> > does:
> >
> > rdmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> >
> > injects the #NM and goes on.
> >
> > 3) When the guest writes to MSR_XFD_ERR it takes an VMEXIT and
> > the host does:
> >
> > vcpu->arch.guest_fpu.xfd_err = msrval;
> > wrmsrl(MSR_XFD_ERR, msrval);
>
> No wrmsrl here I think, the host value is 0 and should stay so. Instead
> the wrmsrl will happen the next time the VCPU loop is entred.
>

To elaborate I guess the reason is because MSR_XFD_ERR should
always contain host value 0 after preemption is enabled, while
WRMSR emulation is called with preemption enabled. Then we
just need wait for the next time the vcpu loop is entered to
restore the guest value after preemption is disabled. ????

Thanks
Kevin

2021-12-11 13:10:57

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly

On Sat, Dec 11 2021 at 02:31, Paolo Bonzini wrote:
> On 12/11/21 01:10, Thomas Gleixner wrote:
>> 2) When the guest triggers #NM is takes an VMEXIT and the host
>> does:
>>
>> rdmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
>>
>> injects the #NM and goes on.
>>
>> 3) When the guest writes to MSR_XFD_ERR it takes an VMEXIT and
>> the host does:
>>
>> vcpu->arch.guest_fpu.xfd_err = msrval;
>> wrmsrl(MSR_XFD_ERR, msrval);
>
> No wrmsrl here I think, the host value is 0 and should stay so. Instead
> the wrmsrl will happen the next time the VCPU loop is entred.

I assumed this can be handled in the fast path, but either way.

2021-12-11 13:29:16

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly

Kevin,

On Sat, Dec 11 2021 at 03:07, Kevin Tian wrote:
>> From: Thomas Gleixner <[email protected]>
>> #NM in the guest is slow path, right? So why are you trying to optimize
>> for it?
>
> This is really good information. The current logic is obviously
> based on the assumption that #NM is frequently triggered.

More context.

When an application want's to use AMX, it invokes the prctl() which
grants permission. If permission is granted then still the kernel FPU
state buffers are default size and XFD is armed.

When a thread of that process issues the first AMX (tile) instruction,
then #NM is raised.

The #NM handler does:

1) Read MSR_XFD_ERR. If 0, goto regular #NM

2) Write MSR_XFD_ERR to 0

3) Check whether the process has permission granted. If not,
raise SIGILL and return.

4) Allocate and install a larger FPU state buffer for the task.
If allocation fails, raise SIGSEGV and return.

5) Disarm XFD for that task

That means one thread takes at max. one AMX/XFD related #NM during its
lifetime, which means two VMEXITs.

If there are other XFD controlled facilities in the future, then it will
be NR_USED_XFD_CONTROLLED_FACILITIES * 2 VMEXITs per thread which uses
them. Not the end of the world either.

Looking at the targeted application space it's pretty unlikely that
tasks which utilize AMX are going to be so short lived that the overhead
of these VMEXITs really matters.

This of course can be revisited when there is a sane use case, but
optimizing for it prematurely does not buy us anything else than
pointless complexity.

>> The straight forward solution to this is:
>>
>> 1) Trap #NM and MSR_XFD_ERR write
>
> and #NM vmexit handler should be called in kvm_x86_handle_exit_irqoff()
> before preemption is enabled, otherwise there is still a small window
> where MSR_XFD_ERR might be clobbered after preemption enable and
> before #NM handler is actually called.

Yes.

Thanks,

tglx


2021-12-11 21:20:35

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 00/19] AMX Support in KVM

Jing, Yang,

On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
>
> Thanks Thomas for the thoughts and patches on the KVM FPU and AMX
> support.

welcome. The overall impression of that series is that it is heading
into the right direction. Keep up the good work!

Thanks,

tglx

2021-12-12 01:50:29

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly

> From: Thomas Gleixner <[email protected]>
> Sent: Saturday, December 11, 2021 9:29 PM
>
> Kevin,
>
> On Sat, Dec 11 2021 at 03:07, Kevin Tian wrote:
> >> From: Thomas Gleixner <[email protected]>
> >> #NM in the guest is slow path, right? So why are you trying to optimize
> >> for it?
> >
> > This is really good information. The current logic is obviously
> > based on the assumption that #NM is frequently triggered.
>
> More context.
>
> When an application want's to use AMX, it invokes the prctl() which
> grants permission. If permission is granted then still the kernel FPU
> state buffers are default size and XFD is armed.
>
> When a thread of that process issues the first AMX (tile) instruction,
> then #NM is raised.
>
> The #NM handler does:
>
> 1) Read MSR_XFD_ERR. If 0, goto regular #NM
>
> 2) Write MSR_XFD_ERR to 0
>
> 3) Check whether the process has permission granted. If not,
> raise SIGILL and return.
>
> 4) Allocate and install a larger FPU state buffer for the task.
> If allocation fails, raise SIGSEGV and return.
>
> 5) Disarm XFD for that task
>
> That means one thread takes at max. one AMX/XFD related #NM during its
> lifetime, which means two VMEXITs.
>
> If there are other XFD controlled facilities in the future, then it will
> be NR_USED_XFD_CONTROLLED_FACILITIES * 2 VMEXITs per thread which
> uses
> them. Not the end of the world either.
>
> Looking at the targeted application space it's pretty unlikely that
> tasks which utilize AMX are going to be so short lived that the overhead
> of these VMEXITs really matters.
>
> This of course can be revisited when there is a sane use case, but
> optimizing for it prematurely does not buy us anything else than
> pointless complexity.

I get all above.

I guess the original open is also about the frequency of #NM not due
to XFD. For Linux guest looks it's not a problem since CR0.TS is not set
now when math emulation is not required:

DEFINE_IDTENTRY(exc_device_not_available)
{
...
/* This should not happen. */
if (WARN(cr0 & X86_CR0_TS, "CR0.TS was set")) {
/* Try to fix it up and carry on. */
write_cr0(cr0 & ~X86_CR0_TS);
} else {
/*
* Something terrible happened, and we're better off trying
* to kill the task than getting stuck in a never-ending
* loop of #NM faults.
*/
die("unexpected #NM exception", regs, 0);
}
}

It may affect guest which still uses CR0.TS to do lazy save. But likely
modern OSes all move to eager save approach so always trapping #NM
should be fine.

Is this understanding correct?

Thanks
Kevin

2021-12-12 09:10:30

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly

On 12/12/21 02:50, Tian, Kevin wrote:
>>
>> If there are other XFD controlled facilities in the future, then it will
>> be NR_USED_XFD_CONTROLLED_FACILITIES * 2 VMEXITs per thread which
>> uses
>> them. Not the end of the world either.
>>
>> Looking at the targeted application space it's pretty unlikely that
>> tasks which utilize AMX are going to be so short lived that the overhead
>> of these VMEXITs really matters.
>>
>> This of course can be revisited when there is a sane use case, but
>> optimizing for it prematurely does not buy us anything else than
>> pointless complexity.
> It may affect guest which still uses CR0.TS to do lazy save. But likely
> modern OSes all move to eager save approach so always trapping #NM
> should be fine.

You also don't need to trap #NM if CPUID includes no dynamic bits,
because then XFD will never be nonzero.

Paolo

2021-12-12 13:26:01

by Yang Zhong

[permalink] [raw]
Subject: Re: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly

On Fri, Dec 10, 2021 at 11:01:15PM +0100, Paolo Bonzini wrote:
> On 12/8/21 01:03, Yang Zhong wrote:
> >--- a/arch/x86/kvm/cpuid.c
> >+++ b/arch/x86/kvm/cpuid.c
> >@@ -219,6 +219,11 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > kvm_apic_set_version(vcpu);
> > }
> >+ /* Enable saving guest XFD_ERR */
> >+ best = kvm_find_cpuid_entry(vcpu, 7, 0);
> >+ if (best && cpuid_entry_has(best, X86_FEATURE_AMX_TILE))
> >+ vcpu->arch.guest_fpu.xfd_err = 0;
> >+
>
> This is incorrect. Instead it should check whether leaf 0xD
> includes any dynamic features.
>

Thanks Paolo, So ditto for "[PATCH 04/19] kvm: x86: Check guest xstate permissions when KVM_SET_CPUID2".

Yang

> Paolo

2021-12-13 07:51:37

by Liu, Jing2

[permalink] [raw]
Subject: RE: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD

On 12/11/2021 12:02 AM, Paolo Bonzini wrote:
>
> Also:
>
> On 12/8/21 01:03, Yang Zhong wrote:
> >
> > + if (!guest_cpuid_has(vcpu, X86_FEATURE_XFD))
> > + return 1;
>
> This should allow msr->host_initiated always (even if XFD is not part of
> CPUID).
Thanks Paolo.

msr->host_initiated handling would be added in next version.

I'd like to ask why always allow msr->host_initiated even if XFD is not part of
CPUID, although guest doesn't care that MSR? We found some MSRs
(e.g. MSR_AMD64_OSVW_STATUS and MSR_AMD64_OSVW_ID_LENGTH )
are specially handled so would like to know the consideration of allowing
msr->host_initiated.

if (!msr_info->host_initiated && !guest_cpuid_has(vcpu, X86_FEATURE_XFD))
return 1;


However, if XFD is nonzero and kvm_check_guest_realloc_fpstate
> returns true, then it should return 1.
>
If XFD is nonzero, kvm_check_guest_realloc_fpstate() won't return true. So
may not need this check here?

Thanks,
Jing

>
> Paolo

2021-12-13 08:23:24

by Wei Wang

[permalink] [raw]
Subject: RE: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl

On Saturday, December 11, 2021 6:13 AM, Paolo Bonzini wrote:
>
> By the way, I think KVM_SET_XSAVE2 is not needed. Instead:
>
> - KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) should return the size of the
> buffer that is passed to KVM_GET_XSAVE2
>
> - KVM_GET_XSAVE2 should fill in the buffer expecting that its size is
> whatever KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) passes
>
> - KVM_SET_XSAVE can just expect a buffer that is bigger than 4k if the
> save states recorded in the header point to offsets larger than 4k.

I think one issue is that KVM_SET_XSAVE works with "struct kvm_xsave" (hardcoded 4KB buffer),
including kvm_vcpu_ioctl_x86_set_xsave. The states obtained via KVM_GET_XSAVE2 will be made
using "struct kvm_xsave2".

Did you mean that we could add a new code path under KVM_SET_XSAVE to make it work with
the new "struct kvm_xsave2"?
e.g.:

(xsave2_enabled below is set when userspace calls to get KVM_CAP_XSAVE2)
if (kvm->xsave2_enabled) {
new implementation using "struct kvm_xsave2"
} else {
current implementation using "struct kvm_xsave"
}
(this seems like a new implementation which might deserve a new ioctl)

Thanks,
Wei

2021-12-13 09:01:32

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD

On 12/13/21 08:51, Liu, Jing2 wrote:
> On 12/11/2021 12:02 AM, Paolo Bonzini wrote:
>>
>> Also:
>>
>> On 12/8/21 01:03, Yang Zhong wrote:
>>>
>>> + if (!guest_cpuid_has(vcpu, X86_FEATURE_XFD))
>>> + return 1;
>>
>> This should allow msr->host_initiated always (even if XFD is not part of
>> CPUID).
> Thanks Paolo.
>
> msr->host_initiated handling would be added in next version.
>
> I'd like to ask why always allow msr->host_initiated even if XFD is not part of
> CPUID, although guest doesn't care that MSR? We found some MSRs
> (e.g. MSR_AMD64_OSVW_STATUS and MSR_AMD64_OSVW_ID_LENGTH )
> are specially handled so would like to know the consideration of allowing
> msr->host_initiated.
>
> if (!msr_info->host_initiated && !guest_cpuid_has(vcpu, X86_FEATURE_XFD))
> return 1;

Because it's simpler if userspace can just take the entire list from
KVM_GET_MSR_INDEX_LIST and pass it to KVM_GET/SET_MSR. See for example
vcpu_save_state and vcpu_load_state in
tools/testing/selftests/kvm/lib/x86_64/processor.c.

>> However, if XFD is nonzero and kvm_check_guest_realloc_fpstate
>> returns true, then it should return 1.
>
> If XFD is nonzero, kvm_check_guest_realloc_fpstate() won't return true. So
> may not need this check here?

It can't for now, because there's a single dynamic feature, but here:

+ if ((xfd & xcr0) != xcr0) {
+ u64 request = (xcr0 ^ xfd) & xcr0;
+ struct fpu_guest *guest_fpu = &vcpu->arch.guest_fpu;
+
+ /*
+ * If requested features haven't been enabled, update
+ * the request bitmap and tell the caller to request
+ * dynamic buffer reallocation.
+ */
+ if ((guest_fpu->user_xfeatures & request) != request) {
+ vcpu->arch.guest_fpu.realloc_request = request;
+ return true;
+ }
+ }

it is certainly possible to return true with nonzero XFD.

Paolo

2021-12-13 09:12:30

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 02/19] x86/fpu: Prepare KVM for dynamically enabled states

On 12/8/21 01:03, Yang Zhong wrote:
> - user_xfeatures
>
> Track which features are currently enabled for the vCPU

Please rename to alloc_xfeatures

> - user_perm
>
> Copied from guest_perm of the group leader thread. The first
> vCPU which does the copy locks the guest_perm

Please rename to perm_xfeatures.

> - realloc_request
>
> KVM sets this field to request dynamically-enabled features
> which require reallocation of @fpstate

This field should be in vcpu->arch, and there is no need for
fpu_guest_realloc_fpstate. Rename __xfd_enable_feature to
fpu_enable_xfd_feature and add it to the public API, then just do

if (unlikely(vcpu->arch.xfd_realloc_request)) {
u64 request = vcpu->arch.xfd_realloc_request;
ret = fpu_enable_xfd(request, enter_guest);
}

to kvm_put_guest_fpu.

Paolo

2021-12-13 09:16:33

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 09/19] kvm: x86: Prepare reallocation check

On 12/8/21 01:03, Yang Zhong wrote:
> + u64 xcr0 = vcpu->arch.xcr0 & XFEATURE_MASK_USER_DYNAMIC;
> +
> + /* For any state which is enabled dynamically */
> + if ((xfd & xcr0) != xcr0) {
> + u64 request = (xcr0 ^ xfd) & xcr0;
> + struct fpu_guest *guest_fpu = &vcpu->arch.guest_fpu;
> +
> + /*
> + * If requested features haven't been enabled, update
> + * the request bitmap and tell the caller to request
> + * dynamic buffer reallocation.
> + */
> + if ((guest_fpu->user_xfeatures & request) != request) {
> + vcpu->arch.guest_fpu.realloc_request = request;

This should be "|=". If you have

wrmsr(XFD, dynamic-feature-1);
...
wrmsr(XFD, dynamic-feature-2);

then the space for both features has to be allocated.

> + return true;
> + }
> + }
> +


This is just:

struct fpu_guest *guest_fpu = &vcpu->arch.guest_fpu;
u64 xcr0 = vcpu->arch.xcr0 & XFEATURE_MASK_USER_DYNAMIC;
u64 dynamic_enabled = xcr0 & ~xfd;
if (!(dynamic_enabled & ~guest_fpu->user_xfeatures))
return false;

/*
* This should actually not be in guest_fpu, see review of
* patch 2. Also see above regarding "=" vs "|=".
*/
vcpu->arch.guest_fpu.realloc_request |= dynamic_enabled;
return true;

But without documentation I'm not sure why this exit-to-userspace step
is needed:

- if (dynamic_enabled & ~guest_fpu->user_perm) != 0, then this is a
userspace error and you can #GP the guest without any issue. Userspace
is buggy

- if (dynamic_enabled & ~guest_fpu->user_xfeatures) != 0, but the
feature *is* permitted, then it is okay to just go on and reallocate the
guest FPU.


Paolo

2021-12-13 09:24:20

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl

On 12/13/21 09:23, Wang, Wei W wrote:
> On Saturday, December 11, 2021 6:13 AM, Paolo Bonzini wrote:
>>
>> By the way, I think KVM_SET_XSAVE2 is not needed. Instead:
>>
>> - KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) should return the size of the
>> buffer that is passed to KVM_GET_XSAVE2
>>
>> - KVM_GET_XSAVE2 should fill in the buffer expecting that its size is
>> whatever KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) passes
>>
>> - KVM_SET_XSAVE can just expect a buffer that is bigger than 4k if the
>> save states recorded in the header point to offsets larger than 4k.
>
> I think one issue is that KVM_SET_XSAVE works with "struct kvm_xsave" (hardcoded 4KB buffer),
> including kvm_vcpu_ioctl_x86_set_xsave. The states obtained via KVM_GET_XSAVE2 will be made
> using "struct kvm_xsave2".
>
> Did you mean that we could add a new code path under KVM_SET_XSAVE to make it work with
> the new "struct kvm_xsave2"?

There is no need for struct kvm_xsave2, because there is no need for a
"size" argument.

- KVM_GET_XSAVE2 *is* needed, and it can expect a buffer as big as the
return value of KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2)

- but KVM_SET_XSAVE2 is not needed, because KVM_SET_XSAVE can use
copy_from_user to read the XSTATE_BV, use it to deduce the size of the
buffer, and use copy_from_user to read the full size of the buffer.

For this to work you can redefine struct kvm_xsave to

struct kvm_xsave {
__u32 region[1024];

/*
* KVM_GET_XSAVE only uses 4096 bytes and only returns
* user save states up to save state 17 (TILECFG).
*
* For KVM_GET_XSAVE2, the total size of region + extra
* must be the size that is communicated by
* KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2).
*
* KVM_SET_XSAVE uses the extra field if the struct was
* returned by KVM_GET_XSAVE2.
*/
__u32 extra[];
}

Paolo

2021-12-13 10:36:20

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl

On Fri, Dec 10 2021 at 17:30, Paolo Bonzini wrote:
> On 12/8/21 01:03, Yang Zhong wrote:
>> +static int kvm_vcpu_ioctl_x86_set_xsave2(struct kvm_vcpu *vcpu, u8 *state)
>> +{
>> + if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
>> + return 0;
>> +
>> + return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, state,
>> + supported_xcr0, &vcpu->arch.pkru);
>> +}
>> +
>
> I think fpu_copy_uabi_to_guest_fpstate (and therefore
> copy_uabi_from_kernel_to_xstate) needs to check that the size is
> compatible with the components in the input.

fpu_copy_uabi_to_guest_fpstate() expects that the input buffer is
correctly sized. We surely can add a size check there.

> Also, IIUC the size of the AMX state will vary in different processors.
> Is this correct? If so, this should be handled already by
> KVM_GET/SET_XSAVE2 and therefore should be part of the
> arch/x86/kernel/fpu APIs. In the future we want to support migrating a
> "small AMX" host to a "large AMX" host; and also migrating from a "large
> AMX" host to a "small AMX" host if the guest CPUID is compatible with
> the destination of the migration.

How is that supposed to work? If the AMX state size differs then the
hosts are not compatible.

Thanks,

tglx

2021-12-13 10:43:44

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl

On 12/13/21 11:10, Thomas Gleixner wrote:
> On Fri, Dec 10 2021 at 17:30, Paolo Bonzini wrote:
>> On 12/8/21 01:03, Yang Zhong wrote:
>>> +static int kvm_vcpu_ioctl_x86_set_xsave2(struct kvm_vcpu *vcpu, u8 *state)
>>> +{
>>> + if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
>>> + return 0;
>>> +
>>> + return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, state,
>>> + supported_xcr0, &vcpu->arch.pkru);
>>> +}
>>> +
>>
>> I think fpu_copy_uabi_to_guest_fpstate (and therefore
>> copy_uabi_from_kernel_to_xstate) needs to check that the size is
>> compatible with the components in the input.
>
> fpu_copy_uabi_to_guest_fpstate() expects that the input buffer is
> correctly sized. We surely can add a size check there.

fpu_copy_guest_fpstate_to_uabi is more problematic because that one
writes memory. For fpu_copy_uabi_to_guest_fpstate, we know the input
buffer size from the components and we can use it to do a properly-sized
memdup_user.

For fpu_copy_guest_fpstate_to_uabi we can just decide that KVM_GET_XSAVE
will only save up to the first 4K. Something like the following might
actually be good for 5.16-rc; right now, header.xfeatures might lead
userspace into reading uninitialized or unmapped memory:

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d28829403ed0..69609b8c3887 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1138,8 +1138,10 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
struct xstate_header header;
unsigned int zerofrom;
u64 mask;
+ u64 size;
int i;

+ size = to->left;
memset(&header, 0, sizeof(header));
header.xfeatures = xsave->header.xfeatures;

@@ -1186,7 +1188,20 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
/* Copy xsave->i387.sw_reserved */
membuf_write(&to, xstate_fx_sw_bytes, sizeof(xsave->i387.sw_reserved));

- /* Copy the user space relevant state of @xsave->header */
+ /*
+ * Copy the user space relevant state of @xsave->header.
+ * If not all features fit in the buffer, drop them from the
+ * saved state so that userspace does not read uninitialized or
+ * unmapped memory.
+ */
+ mask = fpstate->user_xfeatures;
+ for_each_extended_xfeature(i, mask) {
+ if (xstate_offsets[i] + xstate_size[i] > size) {
+ header.xfeatures &= BIT(i) - 1;
+ mask &= BIT(i) - 1;
+ break;
+ }
+ }
membuf_write(&to, &header, sizeof(header));

zerofrom = offsetof(struct xregs_state, extended_state_area);
@@ -1197,7 +1212,6 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
* but there is no state to copy from in the compacted
* init_fpstate. The gap tracking will zero these states.
*/
- mask = fpstate->user_xfeatures;

for_each_extended_xfeature(i, mask) {
/*



>> Also, IIUC the size of the AMX state will vary in different processors.
>> Is this correct? If so, this should be handled already by
>> KVM_GET/SET_XSAVE2 and therefore should be part of the
>> arch/x86/kernel/fpu APIs. In the future we want to support migrating a
>> "small AMX" host to a "large AMX" host; and also migrating from a "large
>> AMX" host to a "small AMX" host if the guest CPUID is compatible with
>> the destination of the migration.
>
> How is that supposed to work? If the AMX state size differs then the
> hosts are not compatible.

I replied with some more questions later. Basically it depends on how
Intel will define palettes that aren't part of the first implementation
of AMX.

Paolo

2021-12-13 12:00:39

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 02/19] x86/fpu: Prepare KVM for dynamically enabled states

On Mon, Dec 13 2021 at 10:12, Paolo Bonzini wrote:
> On 12/8/21 01:03, Yang Zhong wrote:
>> - user_xfeatures
>>
>> Track which features are currently enabled for the vCPU
>
> Please rename to alloc_xfeatures

That name makes no sense at all. This has nothing to do with alloc.

>> - user_perm
>>
>> Copied from guest_perm of the group leader thread. The first
>> vCPU which does the copy locks the guest_perm
>
> Please rename to perm_xfeatures.

All of that is following the naming conventions in the FPU code related
to permissions etc.

>> - realloc_request
>>
>> KVM sets this field to request dynamically-enabled features
>> which require reallocation of @fpstate
>
> This field should be in vcpu->arch, and there is no need for
> fpu_guest_realloc_fpstate. Rename __xfd_enable_feature to
> fpu_enable_xfd_feature and add it to the public API, then just do
>
> if (unlikely(vcpu->arch.xfd_realloc_request)) {
> u64 request = vcpu->arch.xfd_realloc_request;
> ret = fpu_enable_xfd(request, enter_guest);
> }
>
> to kvm_put_guest_fpu.

Why? Yet another export of FPU internals just because?

Also what clears the reallocation request and what is the @enter_guest
argument supposed to help with?

I have no idea what you are trying to achieve.

Thanks,

tglx

2021-12-13 12:40:35

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl

On Mon, Dec 13 2021 at 11:43, Paolo Bonzini wrote:
> On 12/13/21 11:10, Thomas Gleixner wrote:
>> On Fri, Dec 10 2021 at 17:30, Paolo Bonzini wrote:
>>> I think fpu_copy_uabi_to_guest_fpstate (and therefore
>>> copy_uabi_from_kernel_to_xstate) needs to check that the size is
>>> compatible with the components in the input.
>>
>> fpu_copy_uabi_to_guest_fpstate() expects that the input buffer is
>> correctly sized. We surely can add a size check there.
>
> fpu_copy_guest_fpstate_to_uabi is more problematic because that one
> writes memory. For fpu_copy_uabi_to_guest_fpstate, we know the input
> buffer size from the components and we can use it to do a properly-sized
> memdup_user.
>
> For fpu_copy_guest_fpstate_to_uabi we can just decide that KVM_GET_XSAVE
> will only save up to the first 4K. Something like the following might
> actually be good for 5.16-rc; right now, header.xfeatures might lead
> userspace into reading uninitialized or unmapped memory:

If user space supplies a 4k buffer and reads beyond the end of the
buffer then it's hardly a kernel problem.

That function allows to provide a short buffer and fill it up to the
point where the buffer ends with the real information.

Thanks,

tglx

2021-12-13 12:45:58

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 02/19] x86/fpu: Prepare KVM for dynamically enabled states

On 12/13/21 13:00, Thomas Gleixner wrote:
> On Mon, Dec 13 2021 at 10:12, Paolo Bonzini wrote:
>> On 12/8/21 01:03, Yang Zhong wrote:
>>> - user_xfeatures
>>>
>>> Track which features are currently enabled for the vCPU
>>
>> Please rename to alloc_xfeatures
>
> That name makes no sense at all. This has nothing to do with alloc.

Isn't that the features for which space is currently allocated?

fpstate_realloc does

+ if (guest_fpu) {
+ newfps->is_guest = true;
+ newfps->is_confidential = curfps->is_confidential;
+ guest_fpu->user_xfeatures |= xfeatures;
+ }
+

and kvm_check_guest_realloc_fpstate does


+ if ((guest_fpu->user_xfeatures & request) != request) {
+ vcpu->arch.guest_fpu.realloc_request |= request;
+ return true;
+ }

Reading "user_xfeatures" in there is cryptic, it seems like it's
something related to the userspace thread or group that has invoked the
KVM ioctl. If it's renamed to alloc_xfeatures, then this:

+ missing = request & ~guest_fpu->alloc_xfeatures;
+ if (missing) {
+ vcpu->arch.guest_fpu.realloc_request |= missing;
+ return true;
+ }

makes it obvious that the allocation is for features that are requested
but haven't been allocated in the xstate yet.

>>> - user_perm
>>>
>>> Copied from guest_perm of the group leader thread. The first
>>> vCPU which does the copy locks the guest_perm
>>
>> Please rename to perm_xfeatures.
>
> All of that is following the naming conventions in the FPU code related
> to permissions etc.

perm or guest_perm is just fine as well; but user_perm is not (there's
no preexisting reference to user_perm in the code, in fact, as far as I
can see).

>>> - realloc_request
>>>
>>> KVM sets this field to request dynamically-enabled features
>>> which require reallocation of @fpstate
>>
>> This field should be in vcpu->arch, and there is no need for
>> fpu_guest_realloc_fpstate. Rename __xfd_enable_feature to
>> fpu_enable_xfd_feature and add it to the public API, then just do
>>
>> if (unlikely(vcpu->arch.xfd_realloc_request)) {
>> u64 request = vcpu->arch.xfd_realloc_request;
>> ret = fpu_enable_xfd(request, enter_guest);
>> }
>>
>> to kvm_put_guest_fpu.
>
> Why? Yet another export of FPU internals just because?

It's one function more and one field less. I prefer another export of
FPU internals, to a write to a random field with undocumented invariants.

For example, why WARN_ON_ONCE if enter_guest == true? If you enter the
guest after the host has restored MSR_IA32_XFD with KVM_SET_MSR, the
WARN_ON_ONCE can actually fire. It would be just fine to skip the
reallocation until enter_guest == false, which is you actually XSAVE
into the guest_fpu.

As an aside, realloc_request (if it survives, see below) shouldn't be
added until patch 6.

> Also what clears the reallocation request and what is the @enter_guest
> argument supposed to help with?

Nothing---make it

if (unlikely(vcpu->arch.xfd_realloc_request)) {
u64 request = vcpu->arch.xfd_realloc_request;
ret = fpu_enable_xfd(request, &vcpu->arch.guest_fpu);
if (!ret)
vcpu->arch.xfd_realloc_request = 0;
}

but in fact, I'm not sure why the request has to be delayed at all. The
obvious implementation of a write to XFD, after all the validity checks,
is just

/* This function just calls xfd_enable_feature. */
r = fpu_guest_alloc_xfeatures(&vcpu->arch.guest_fpu,
vcpu->arch.xcr0 & ~xfd);
/*
* An error means that userspace has screwed up by not doing
* arch_prctl(ARCH_REQ_XCOMP_GUEST_PERM). If we are here coming
* from a ioctl fail it, if the guest has done WRMSR or XSETBV
* it will inject a #GP.
*/
if (r < 0)
return 1;

vcpu->arch.xfd = xfd;

with none of the kvm_check_guest_realloc_fpstate or KVM_EXIT_FPU_REALLOC
business. No field and a simple, self-contained external API. The same
code works for __kvm_set_xcr as well, just with "xcr0 & ~vcpu->arch.xfd"
as the second argument instead.

(Maybe I'm missing why KVM_EXIT_FPU_REALLOC is needed, but I'm not
ashamed to say that, given that new userspace API was added with
documentation or tests. The only comment in the code is:

+ /*
+ * Check if fpstate reallocate is required. If yes, then
+ * let the fpu core do reallocation and update xfd;
+ * otherwise, update xfd here.
+ */
+ if (kvm_check_guest_realloc_fpstate(vcpu, data)) {
+ vcpu->run->exit_reason = KVM_EXIT_FPU_REALLOC;
+ vcpu->arch.complete_userspace_io =
+ kvm_skip_emulated_instruction;
+ return KVM_MSR_RET_USERSPACE;
+ }
+

which has nothing to do with the actual content of either
kvm_check_guest_realloc_fpstate or the "then" branch. Userspace can
just do ARCH_REQ_XCOMP_GUEST_PERM based on the guest CPUID, before
KVM_SET_CPUID2, KVM_SET_MSR or KVM_SET_XCR).

Paolo

2021-12-13 15:06:33

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD

On 12/8/21 01:03, Yang Zhong wrote:
> + /*
> + * Update IA32_XFD to the guest value so #NM can be
> + * raised properly in the guest. Instead of directly
> + * writing the MSR, call a helper to avoid breaking
> + * per-cpu cached value in fpu core.
> + */
> + fpregs_lock();
> + current->thread.fpu.fpstate->xfd = data;

This is wrong, it should be written in vcpu->arch.guest_fpu.

> + xfd_update_state(current->thread.fpu.fpstate);

This is okay though, so that KVM_SET_MSR will not write XFD and WRMSR will.

That said, I think xfd_update_state should not have an argument.
current->thread.fpu.fpstate->xfd is the only fpstate that should be
synced with the xfd_state per-CPU variable.

Paolo

> + fpregs_unlock();


2021-12-13 19:45:56

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD

On Mon, Dec 13 2021 at 16:06, Paolo Bonzini wrote:
> On 12/8/21 01:03, Yang Zhong wrote:
>> + /*
>> + * Update IA32_XFD to the guest value so #NM can be
>> + * raised properly in the guest. Instead of directly
>> + * writing the MSR, call a helper to avoid breaking
>> + * per-cpu cached value in fpu core.
>> + */
>> + fpregs_lock();
>> + current->thread.fpu.fpstate->xfd = data;
>
> This is wrong, it should be written in vcpu->arch.guest_fpu.
>
>> + xfd_update_state(current->thread.fpu.fpstate);
>
> This is okay though, so that KVM_SET_MSR will not write XFD and WRMSR
> will.
>
> That said, I think xfd_update_state should not have an argument.
> current->thread.fpu.fpstate->xfd is the only fpstate that should be
> synced with the xfd_state per-CPU variable.

I'm looking into this right now. The whole restore versus runtime thing
needs to be handled differently.

Thanks,

tglx

2021-12-13 19:50:22

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 02/19] x86/fpu: Prepare KVM for dynamically enabled states

Paolo,

On Mon, Dec 13 2021 at 13:45, Paolo Bonzini wrote:
> On 12/13/21 13:00, Thomas Gleixner wrote:
>> On Mon, Dec 13 2021 at 10:12, Paolo Bonzini wrote:
>>> Please rename to alloc_xfeatures
>>
>> That name makes no sense at all. This has nothing to do with alloc.
>
> Isn't that the features for which space is currently allocated?

It is, but from the kernel POV this is user. :)

> Reading "user_xfeatures" in there is cryptic, it seems like it's
> something related to the userspace thread or group that has invoked the
> KVM ioctl. If it's renamed to alloc_xfeatures, then this:
>
> + missing = request & ~guest_fpu->alloc_xfeatures;
> + if (missing) {
> + vcpu->arch.guest_fpu.realloc_request |= missing;
> + return true;
> + }
>
> makes it obvious that the allocation is for features that are requested
> but haven't been allocated in the xstate yet.

Let's rename it to xfeatures and perm and be done with it.

>> Why? Yet another export of FPU internals just because?
>
> It's one function more and one field less. I prefer another export of
> FPU internals, to a write to a random field with undocumented
> invariants.

We want less not more exports. :)

> For example, why WARN_ON_ONCE if enter_guest == true? If you enter the
> guest after the host has restored MSR_IA32_XFD with KVM_SET_MSR, the

Indeed restoring a guest might require buffer reallocation, I missed
that, duh!

On restore the following components are involved:

XCR0, XFD, XSTATE

XCR0 and XFD have to be restored _before_ XSTATE and that needs to
be enforced.

But independent of the ordering of XCR0 and XFD restore the following
check applies to both the restore and the runtime logic:

int kvm_fpu_realloc(struct kvm_vcpu *vcpu, u64 xcr0, u64 xfd)
{
u64 expand, enabled = xcr0 & ~xfd;

expand = enabled & ~vcpu->arch.guest_fpu.xfeatures;
if (!expand)
return 0;

return fpu_enable_guest_features(&vcpu->arch.guest_fpu, expand);
}

int fpu_enable_guest_features(struct guest_fpu *gfpu, u64 which)
{
permission_checks();
...
return fpstate_realloc(.....)
}

fpstate_realloc() needs to be careful about flipping the pointers
depending on the question whether guest_fpu->fpstate is actually active,
i.e.:

current->thread.fpu.fpstate == gfpu->fpstate

I'm halfways done with that. Will send something soonish.

Thanks,

tglx




2021-12-13 21:23:28

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD

Paolo,

On Mon, Dec 13 2021 at 20:45, Thomas Gleixner wrote:
> On Mon, Dec 13 2021 at 16:06, Paolo Bonzini wrote:
>> That said, I think xfd_update_state should not have an argument.
>> current->thread.fpu.fpstate->xfd is the only fpstate that should be
>> synced with the xfd_state per-CPU variable.
>
> I'm looking into this right now. The whole restore versus runtime thing
> needs to be handled differently.

We need to look at different things here:

1) XFD MSR write emulation

2) XFD MSR synchronization when write emulation is disabled

3) Guest restore

#1 and #2 are in the context of vcpu_run() and

vcpu->arch.guest_fpu.fpstate == current->thread.fpu.fpstate

while #3 has:

vcpu->arch.guest_fpu.fpstate != current->thread.fpu.fpstate


#2 is only updating fpstate->xfd and the per CPU shadow.

So the state synchronization wants to be something like this:

void fpu_sync_guest_xfd_state(void)
{
struct fpstate *fps = current->thread.fpu.fpstate;

lockdep_assert_irqs_disabled();
if (fpu_state_size_dynamic()) {
rdmsrl(MSR_IA32_XFD, fps->xfd);
__this_cpu_write(xfd_state, fps->xfd);
}
}
EXPORT_SYMBOL_GPL(fpu_sync_guest_xfd_state);

No wrmsrl() because the MSR is already up do date. The important part is
that fpstate->xfd and the shadow state are updated so that after
reenabling preemption the context switch FPU logic works correctly.


#1 and #3 can trigger a reallocation of guest_fpu.fpstate and
can fail. But this is also true for XSETBV emulation and XCR0 restore.

For #1 modifying fps->xfd in the KVM code before calling into the FPU
code is just _wrong_ because if the guest removes the XFD restriction
then it must be ensured that the buffer is sized correctly _before_ this
is updated.

For #3 it's not really important, but I still try to wrap my head around
the whole picture vs. XCR0.

There are two options:

1) Require strict ordering of XFD and XCR0 update to avoid pointless
buffer expansion, i.e. XFD before XCR0.

Because if XCR0 is updated while guest_fpu->fpstate.xfd is still in
init state (0) and XCR0 contains extended features, then the buffer
would be expanded because XFD does not mask the extended features
out. When XFD is restored with a non-zero value, it's too late
already.

2) Ignore buffer expansion up to the point where XSTATE restore happens
and evaluate guest XCR0 and guest_fpu->fpstate.xfd there.

I'm leaning towards #1 because that means we have exactly _ONE_ place
where we need to deal with buffer expansion. If Qemu gets the ordering
wrong it wastes memory per vCPU, *shrug*.

Thanks,

tglx





2021-12-14 00:16:25

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 01/19] x86/fpu: Extend prctl() with guest permissions

On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
> Similar to native permissions this doesn't actually enable the
> permitted feature. KVM is expected to install a larger kernel buffer
> and enable the feature when detecting the intention from the guest.
>
> Signed-off-by: Thomas Gleixner <[email protected]>
> Signed-off-by: Jing Liu <[email protected]>
> Signed-off-by: Yang Zhong <[email protected]>
> ---
> (To Thomas) We change the definition of xstate_get_guest_group_perm()
> from xstate.h to api.h since this will be called by KVM.

No.

There is absolutely no need for that. After creating a vCPU the
permissions are frozen and readily available via
vcpu->arch.guest_fpu.perm.

Thanks,

tglx


2021-12-14 06:06:59

by Wei Wang

[permalink] [raw]
Subject: RE: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl

On Monday, December 13, 2021 5:24 PM, Paolo Bonzini wrote:
> There is no need for struct kvm_xsave2, because there is no need for a "size"
> argument.
>
> - KVM_GET_XSAVE2 *is* needed, and it can expect a buffer as big as the return
> value of KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2)

Why would KVM_GET_XSAVE2 still be needed in this case?

I'm thinking it would also be possible to reuse KVM_GET_XSAVE:

- If userspace calls to KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2),
then KVM knows that the userspace is a new version and it works with larger xsave buffer using the "size" that it returns via KVM_CAP_XSAVE2.
So we can add a flag "kvm->xsave2_enabled", which gets set upon userspace checks KVM_CAP_XSAVE2.

- On KVM_GET_XSAVE, if "kvm->xsave2_enabled" is set,
then KVM allocates buffer to load xstates and copies the loaded xstates data to the userspace buffer
using the "size" that was returned to userspace on KVM_CAP_XSAVE2.
If "kvm->xsave2_enabled" isn't set, using the legacy "4KB" size.

Thanks,
Wei

2021-12-14 06:19:06

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl

On 12/14/21 07:06, Wang, Wei W wrote:
> On Monday, December 13, 2021 5:24 PM, Paolo Bonzini wrote:
>> There is no need for struct kvm_xsave2, because there is no need for a "size"
>> argument.
>>
>> - KVM_GET_XSAVE2 *is* needed, and it can expect a buffer as big as the return
>> value of KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2)
>
> Why would KVM_GET_XSAVE2 still be needed in this case?
>
> I'm thinking it would also be possible to reuse KVM_GET_XSAVE:
>
> - If userspace calls to KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2),
> then KVM knows that the userspace is a new version and it works with larger xsave buffer using the "size" that it returns via KVM_CAP_XSAVE2.
> So we can add a flag "kvm->xsave2_enabled", which gets set upon userspace checks KVM_CAP_XSAVE2.

You can use KVM_ENABLE_CAP(KVM_CAP_XSAVE2) for that, yes. In that case
you don't need KVM_GET_XSAVE2.

Paolo

> - On KVM_GET_XSAVE, if "kvm->xsave2_enabled" is set,
> then KVM allocates buffer to load xstates and copies the loaded xstates data to the userspace buffer
> using the "size" that was returned to userspace on KVM_CAP_XSAVE2.
> If "kvm->xsave2_enabled" isn't set, using the legacy "4KB" size.
>
> Thanks,
> Wei
>


2021-12-14 07:06:47

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH 09/19] kvm: x86: Prepare reallocation check

> From: Paolo Bonzini
> Sent: Monday, December 13, 2021 5:16 PM
>
>
> - if (dynamic_enabled & ~guest_fpu->user_perm) != 0, then this is a
> userspace error and you can #GP the guest without any issue. Userspace
> is buggy
>

Is it a general guideline that an error caused by emulation itself (e.g.
due to no memory) can be reflected into the guest as #GP, even
when from guest p.o.v there is nothing wrong with its setting?

Thanks
Kevin

2021-12-14 07:16:17

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD

Hi, Thomas,

> From: Thomas Gleixner <[email protected]>
> Sent: Tuesday, December 14, 2021 5:23 AM
>
> Paolo,
>
> On Mon, Dec 13 2021 at 20:45, Thomas Gleixner wrote:
> > On Mon, Dec 13 2021 at 16:06, Paolo Bonzini wrote:
> >> That said, I think xfd_update_state should not have an argument.
> >> current->thread.fpu.fpstate->xfd is the only fpstate that should be
> >> synced with the xfd_state per-CPU variable.
> >
> > I'm looking into this right now. The whole restore versus runtime thing
> > needs to be handled differently.
>

After looking at your series, I think it missed Paolo's comment
about changing xfd_update_state() to accept no argument.

Thanks
Kevin

2021-12-14 10:16:24

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 09/19] kvm: x86: Prepare reallocation check

On 12/14/21 08:06, Tian, Kevin wrote:
>> - if (dynamic_enabled & ~guest_fpu->user_perm) != 0, then this is a
>> userspace error and you can #GP the guest without any issue. Userspace
>> is buggy
>
> Is it a general guideline that an error caused by emulation itself (e.g.
> due to no memory) can be reflected into the guest as #GP, even
> when from guest p.o.v there is nothing wrong with its setting?

No memory is a tricky one, if possible it should propagate -ENOMEM up to
KVM_RUN or KVM_SET_MSR. But it's basically an impossible case anyway,
because even with 8K TILEDATA we're within the limit of
PAGE_ALLOC_COSTLY_ORDER.

So, since it's not easy to do it right now, we can look at it later.

Paolo

2021-12-14 10:41:22

by Yang Zhong

[permalink] [raw]
Subject: Re: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD

On Fri, Dec 10, 2021 at 05:02:49PM +0100, Paolo Bonzini wrote:
> First, the MSR should be added to msrs_to_save_all and
> kvm_cpu_cap_has(X86_FEATURE_XFD) should be checked in
> kvm_init_msr_list.
>
> It seems that RDMSR support is missing, too.
>
> More important, please include:
>
> - documentation for the new KVM_EXIT_* value
>
> - a selftest that explains how userspace should react to it.
>
> This is a strong requirement for any new API (the first has been for
> years; but the latter is also almost always respected these days).
> This series should not have been submitted without documentation.
>
> Also:
>
> On 12/8/21 01:03, Yang Zhong wrote:
> >
> >+ if (!guest_cpuid_has(vcpu, X86_FEATURE_XFD))
> >+ return 1;
>
> This should allow msr->host_initiated always (even if XFD is not
> part of CPUID). However, if XFD is nonzero and
> kvm_check_guest_realloc_fpstate returns true, then it should return
> 1.
>
> The selftest should also cover using KVM_GET_MSR/KVM_SET_MSR.
>

Paolo, Seems we do not need new KVM_EXIT_* again from below thomas' new patchset:
git://git.kernel.org/pub/scm/linux/kernel/git/people/tglx/devel.git x86/fpu-kvm

So the selftest stll need support KVM_GET_MSR/KVM_SET_MSR for MSR_IA32_XFD
and MSR_IA32_XFD_ERR? If yes, we only do some read/write test with vcpu_set_msr()/
vcpu_get_msr() from new selftest tool? or do wrmsr from guest side and check this value
from selftest side?

I checked some msr selftest reference code, tsc_msrs_test.c, which maybe better for this
reference. If you have better suggestion, please share it to me. thanks!

Yang



2021-12-14 11:24:55

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD

On 12/14/21 11:26, Yang Zhong wrote:
> Paolo, Seems we do not need new KVM_EXIT_* again from below thomas' new patchset:
> git://git.kernel.org/pub/scm/linux/kernel/git/people/tglx/devel.git x86/fpu-kvm
>
> So the selftest stll need support KVM_GET_MSR/KVM_SET_MSR for MSR_IA32_XFD
> and MSR_IA32_XFD_ERR? If yes, we only do some read/write test with vcpu_set_msr()/
> vcpu_get_msr() from new selftest tool? or do wrmsr from guest side and check this value
> from selftest side?

You can write a test similar to state_test.c to cover XCR0, XFD and the
new XSAVE extensions. The test can:

- initialize AMX and write a nonzero value to XFD

- load a matrix into TMM0

- check that #NM is delivered (search for vm_install_exception_handler) and
that XFD_ERR is correct

- write 0 to XFD

- load again the matrix, and check that #NM is not delivered

- store it back into memory

- compare it with the original data

All of this can be done with a full save&restore after every step
(though I suggest that you first get it working without save&restore,
the relevant code in state_test.c is easy to identify and comment out).

You will have to modify vcpu_load_state, so that it does
first KVM_SET_MSRS, then KVM_SET_XCRS, then KVM_SET_XSAVE.
See patch below.

Paolo

> I checked some msr selftest reference code, tsc_msrs_test.c, which maybe better for this
> reference. If you have better suggestion, please share it to me. thanks!


------------------ 8< -----------------
From: Paolo Bonzini <[email protected]>
Subject: [PATCH] selftest: kvm: Reorder vcpu_load_state steps for AMX

For AMX support it is recommended to load XCR0 after XFD, so that
KVM does not see XFD=0, XCR=1 for a save state that will eventually
be disabled (which would lead to premature allocation of the space
required for that save state).

It is also required to load XSAVE data after XCR0 and XFD, so that
KVM can trigger allocation of the extra space required to store AMX
state.

Adjust vcpu_load_state to obey these new requirements.

Signed-off-by: Paolo Bonzini <[email protected]>

diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c
index 82c39db91369..d805f63f7203 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c
@@ -1157,16 +1157,6 @@ void vcpu_load_state(struct kvm_vm *vm, uint32_t vcpuid, struct kvm_x86_state *s
struct vcpu *vcpu = vcpu_find(vm, vcpuid);
int r;

- r = ioctl(vcpu->fd, KVM_SET_XSAVE, &state->xsave);
- TEST_ASSERT(r == 0, "Unexpected result from KVM_SET_XSAVE, r: %i",
- r);
-
- if (kvm_check_cap(KVM_CAP_XCRS)) {
- r = ioctl(vcpu->fd, KVM_SET_XCRS, &state->xcrs);
- TEST_ASSERT(r == 0, "Unexpected result from KVM_SET_XCRS, r: %i",
- r);
- }
-
r = ioctl(vcpu->fd, KVM_SET_SREGS, &state->sregs);
TEST_ASSERT(r == 0, "Unexpected result from KVM_SET_SREGS, r: %i",
r);
@@ -1175,6 +1165,16 @@ void vcpu_load_state(struct kvm_vm *vm, uint32_t vcpuid, struct kvm_x86_state *s
TEST_ASSERT(r == state->msrs.nmsrs, "Unexpected result from KVM_SET_MSRS, r: %i (failed at %x)",
r, r == state->msrs.nmsrs ? -1 : state->msrs.entries[r].index);

+ if (kvm_check_cap(KVM_CAP_XCRS)) {
+ r = ioctl(vcpu->fd, KVM_SET_XCRS, &state->xcrs);
+ TEST_ASSERT(r == 0, "Unexpected result from KVM_SET_XCRS, r: %i",
+ r);
+ }
+
+ r = ioctl(vcpu->fd, KVM_SET_XSAVE, &state->xsave);
+ TEST_ASSERT(r == 0, "Unexpected result from KVM_SET_XSAVE, r: %i",
+ r);
+
r = ioctl(vcpu->fd, KVM_SET_VCPU_EVENTS, &state->events);
TEST_ASSERT(r == 0, "Unexpected result from KVM_SET_VCPU_EVENTS, r: %i",
r);

2021-12-14 14:41:59

by Liu, Jing2

[permalink] [raw]
Subject: RE: [PATCH 09/19] kvm: x86: Prepare reallocation check


On 12/14/2021 6:16 PM, Paolo Bonzini wrote:
>
> On 12/14/21 08:06, Tian, Kevin wrote:
> >> - if (dynamic_enabled & ~guest_fpu->user_perm) != 0, then this is a
> >> userspace error and you can #GP the guest without any issue.
> >> Userspace is buggy
> >
> > Is it a general guideline that an error caused by emulation itself (e.g.
> > due to no memory) can be reflected into the guest as #GP, even when
> > from guest p.o.v there is nothing wrong with its setting?
>
> No memory is a tricky one, if possible it should propagate -ENOMEM up to
> KVM_RUN or KVM_SET_MSR. But it's basically an impossible case anyway,
> because even with 8K TILEDATA we're within the limit of
> PAGE_ALLOC_COSTLY_ORDER.
>
> So, since it's not easy to do it right now, we can look at it later.

For the way handling xcr0 and xfd ioctl failure, xcr0 and xfd have
different handlings. Current KVM_SET_XCRS returns -EINVAL to
userspace. KVM_SET_MSR is always allowed as the discussion in
another thread.

So I'm thinking if reallocation failure in KVM_SET_XCRS and
KVM_SET_MSR (may due to NOMEM or EPERM or ENOTSUPP),
what is the way we would like to choose?

Thanks,
Jing

> Paolo

2021-12-15 02:40:07

by Wei Wang

[permalink] [raw]
Subject: RE: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl

On Tuesday, December 14, 2021 2:19 PM, Paolo Bonzini wrote:
> To: Wang, Wei W <[email protected]>; Zhong, Yang
> <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]
> Cc: [email protected]; Nakajima, Jun <[email protected]>; Tian, Kevin
> <[email protected]>; [email protected]; Liu, Jing2
> <[email protected]>; Zeng, Guang <[email protected]>
> Subject: Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl
>
> On 12/14/21 07:06, Wang, Wei W wrote:
> > On Monday, December 13, 2021 5:24 PM, Paolo Bonzini wrote:
> >> There is no need for struct kvm_xsave2, because there is no need for a
> "size"
> >> argument.
> >>
> >> - KVM_GET_XSAVE2 *is* needed, and it can expect a buffer as big as
> >> the return value of KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2)
> >
> > Why would KVM_GET_XSAVE2 still be needed in this case?
> >
> > I'm thinking it would also be possible to reuse KVM_GET_XSAVE:
> >
> > - If userspace calls to KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2),
> > then KVM knows that the userspace is a new version and it works with
> larger xsave buffer using the "size" that it returns via KVM_CAP_XSAVE2.
> > So we can add a flag "kvm->xsave2_enabled", which gets set upon
> userspace checks KVM_CAP_XSAVE2.
>
> You can use KVM_ENABLE_CAP(KVM_CAP_XSAVE2) for that, yes. In that case
> you don't need KVM_GET_XSAVE2.

On more thing here, what size should KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) return?
If the size still comes from the guest CPUID(0xd, 0)::RCX, would it be better to just return 1?
This requires that the QEMU CPUID info has been set to KVM before checking the cap.
QEMU already has this CPUID info to get the size (seems no need to inquire KVM for it).

Thanks,
Wei

2021-12-15 07:09:48

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH 09/19] kvm: x86: Prepare reallocation check

> From: Liu, Jing2 <[email protected]>
> Sent: Tuesday, December 14, 2021 10:42 PM
>
> On 12/14/2021 6:16 PM, Paolo Bonzini wrote:
> >
> > On 12/14/21 08:06, Tian, Kevin wrote:
> > >> - if (dynamic_enabled & ~guest_fpu->user_perm) != 0, then this is a
> > >> userspace error and you can #GP the guest without any issue.
> > >> Userspace is buggy
> > >
> > > Is it a general guideline that an error caused by emulation itself (e.g.
> > > due to no memory) can be reflected into the guest as #GP, even when
> > > from guest p.o.v there is nothing wrong with its setting?
> >
> > No memory is a tricky one, if possible it should propagate -ENOMEM up to
> > KVM_RUN or KVM_SET_MSR. But it's basically an impossible case anyway,
> > because even with 8K TILEDATA we're within the limit of
> > PAGE_ALLOC_COSTLY_ORDER.
> >
> > So, since it's not easy to do it right now, we can look at it later.
>
> For the way handling xcr0 and xfd ioctl failure, xcr0 and xfd have
> different handlings. Current KVM_SET_XCRS returns -EINVAL to
> userspace. KVM_SET_MSR is always allowed as the discussion in
> another thread.
>
> So I'm thinking if reallocation failure in KVM_SET_XCRS and
> KVM_SET_MSR (may due to NOMEM or EPERM or ENOTSUPP),
> what is the way we would like to choose?
>

KVM_SET_MSRS can definitely accept failure according to msr_io().
I think Paolo's point is more about that the restore path should not
inherit any check related to vCPU capability. It's a different matter
if the error is caused by other host kernel errors.

Given that we don't need any special handling between the two
scenarios (set by guest vs. set by host) in those emulation paths.
Just return '1' to indicate error and whatever error policy exists
in those scenarios is just applied.

Thanks
Kevin

2021-12-15 13:42:59

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl

On 12/15/21 03:39, Wang, Wei W wrote:
>>> Why would KVM_GET_XSAVE2 still be needed in this case?
>>>
>>> I'm thinking it would also be possible to reuse KVM_GET_XSAVE:
>>>
>>> - If userspace calls to KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2),
>>> then KVM knows that the userspace is a new version and it works with
>> larger xsave buffer using the "size" that it returns via KVM_CAP_XSAVE2.
>>> So we can add a flag "kvm->xsave2_enabled", which gets set upon
>> userspace checks KVM_CAP_XSAVE2.
>>
>> You can use KVM_ENABLE_CAP(KVM_CAP_XSAVE2) for that, yes. In that case
>> you don't need KVM_GET_XSAVE2.
>
> On more thing here, what size should KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) return?
> If the size still comes from the guest CPUID(0xd, 0)::RCX, would it be better to just return 1?
> This requires that the QEMU CPUID info has been set to KVM before checking the cap.
> QEMU already has this CPUID info to get the size (seems no need to inquire KVM for it).

It's still easier to return the full size of the buffer from
KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2). It makes the userspace code a bit
easier.

I'm also thinking that I prefer KVM_GET_XSAVE2 to
KVM_ENABLE_CAP(KVM_CAP_XSAVE2), after all. Since it would be a
backwards-incompatible change to an _old_ ioctl (KVM_GET_XSAVE), I
prefer to limit the ways that userspace can shoot itself in the foot.

Paolo

2021-12-16 08:26:10

by Wei Wang

[permalink] [raw]
Subject: RE: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl

On Wednesday, December 15, 2021 9:43 PM, Paolo Bonzini wrote:
> It's still easier to return the full size of the buffer from
> KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2). It makes the userspace code a
> bit easier.

OK. For the "full size" returned to userspace, would you prefer to directly use the value retrieved from guest CPUID(0xd),
or get it from guest_fpu (i.e. fpstate->user_size)?
(retrieved from CPUID will be the max size and should work fine as well)

>
> I'm also thinking that I prefer KVM_GET_XSAVE2 to
> KVM_ENABLE_CAP(KVM_CAP_XSAVE2), after all. Since it would be a
> backwards-incompatible change to an _old_ ioctl (KVM_GET_XSAVE), I prefer
> to limit the ways that userspace can shoot itself in the foot.

OK, we will use KVM_GET_XSAVE2.

Thanks,
Wei

2021-12-16 10:28:08

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl

On 12/16/21 09:25, Wang, Wei W wrote:
>> It's still easier to return the full size of the buffer from
>> KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2). It makes the userspace code
>> a bit easier.
>
> OK. For the "full size" returned to userspace, would you prefer to
> directly use the value retrieved from guest CPUID(0xd), or get it
> from guest_fpu (i.e. fpstate->user_size)? (retrieved from CPUID will
> be the max size and should work fine as well)

It is okay to reflect only the bits that were enabled in prctl, but
please document it in api.rst as well.

Paolo

2021-12-20 17:56:47

by Nakajima, Jun

[permalink] [raw]
Subject: State Component 18 and Palette 1 (Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl)


> On Dec 10, 2021, at 2:13 PM, Paolo Bonzini <[email protected]> wrote:
>
> On 12/10/21 17:30, Paolo Bonzini wrote:
>>>
>>> +static int kvm_vcpu_ioctl_x86_set_xsave2(struct kvm_vcpu *vcpu, u8 *state)
>>> +{
>>> + if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
>>> + return 0;
>>> +
>>> + return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, state,
>>> + supported_xcr0, &vcpu->arch.pkru);
>>> +}
>>> +
>> I think fpu_copy_uabi_to_guest_fpstate (and therefore copy_uabi_from_kernel_to_xstate) needs to check that the size is compatible with the components in the input.
>> Also, IIUC the size of the AMX state will vary in different processors. Is this correct? If so, this should be handled already by KVM_GET/SET_XSAVE2 and therefore should be part of the arch/x86/kernel/fpu APIs. In the future we want to support migrating a "small AMX" host to a "large AMX" host; and also migrating from a "large AMX" host to a "small AMX" host if the guest CPUID is compatible with the destination of the migration.
>
> So, the size of the AMX state will depend on the active "palette" in TILECONFIG, and on the CPUID information. I have a few questions on how Intel intends to handle future extensions to AMX:
>
> - can we assume that, in the future, palette 1 will always have the same value (bytes_per_row=64, max_names=8, max_rows=16), and basically that the only variable value is really the number of palettes?
>
> - how does Intel plan to handle bigger TILEDATA? Will it use more XCR0 bits or will it rather enlarge save state 18?
>
> If it will use more XCR0 bits, I suppose that XCR0 bits will control which palettes can be chosen by LDTILECFG.
>
> If not, on the other hand, this will be a first case of one system's XSAVE data not being XRSTOR-able on another system even if the destination system can set XCR0 to the same value as the source system.
>
> Likewise, if the size and offsets for save state 18 were to vary depending on the selected palette, then this would be novel, in that the save state size and offsets would not be in CPUID anymore. It would be particularly interesting for non-compacted format, where all save states after 18 would also move forward.
>
> So, I hope that save state 18 will be frozen to 8k. In that case, and if palette 1 is frozen to the same values as today, implementing migration will not be a problem; it will be essentially the same as SSE->AVX (horizontal extension of existing registers) and/or AVX->AVX512 (both horizontal and vertical extension).

Hi Paolo,

I would like to confirm that the state component 18 will remain 8KB and palette 1 will remain the same.

Thanks,
---
Jun






2021-12-22 14:44:37

by Paolo Bonzini

[permalink] [raw]
Subject: Re: State Component 18 and Palette 1 (Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl)

On 12/20/21 18:54, Nakajima, Jun wrote:
> Hi Paolo,
>
> I would like to confirm that the state component 18 will remain 8KB and palette 1 will remain the same.

Great! Can you also confirm that XCR0 bits will control which palettes
can be chosen by LDTILECFG?

Thanks,

Paolo

2021-12-22 14:52:05

by Dave Hansen

[permalink] [raw]
Subject: Re: State Component 18 and Palette 1 (Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl)

On 12/20/21 9:54 AM, Nakajima, Jun wrote:
>> So, I hope that save state 18 will be frozen to 8k. In that case,
>> and if palette 1 is frozen to the same values as today,
>> implementing migration will not be a problem; it will be
>> essentially the same as SSE->AVX (horizontal extension of existing
>> registers) and/or AVX->AVX512 (both horizontal and vertical
>> extension).
>
> I would like to confirm that the state component 18 will remain 8KB
> and palette 1 will remain the same.

Is that an architectural statement that will soon be making its way into
the SDM?

2021-12-22 23:47:25

by Nakajima, Jun

[permalink] [raw]
Subject: Re: State Component 18 and Palette 1 (Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl)

> On Dec 22, 2021, at 6:44 AM, Paolo Bonzini <[email protected]> wrote:
>
> On 12/20/21 18:54, Nakajima, Jun wrote:
>> Hi Paolo,
>> I would like to confirm that the state component 18 will remain 8KB and palette 1 will remain the same.
>
> Great! Can you also confirm that XCR0 bits will control which palettes can be chosen by LDTILECFG?
>

I need to discuss with the H/W team more and come back. I think this request is plausible; I will suggest it.

Thanks,
---
Jun


2021-12-22 23:51:37

by Nakajima, Jun

[permalink] [raw]
Subject: Re: State Component 18 and Palette 1 (Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl)


> On Dec 22, 2021, at 6:52 AM, Hansen, Dave <[email protected]> wrote:
>
> On 12/20/21 9:54 AM, Nakajima, Jun wrote:
>>> So, I hope that save state 18 will be frozen to 8k. In that case,
>>> and if palette 1 is frozen to the same values as today,
>>> implementing migration will not be a problem; it will be
>>> essentially the same as SSE->AVX (horizontal extension of existing
>>> registers) and/or AVX->AVX512 (both horizontal and vertical
>>> extension).
>>
>> I would like to confirm that the state component 18 will remain 8KB
>> and palette 1 will remain the same.
>
> Is that an architectural statement that will soon be making its way into
> the SDM?

Yes, with the other clarifications (e.g. setting IA32_XFD[18]).

Thanks,
---
Jun