Intel introduces Advanced Matrix Extensions (AMX) [1] feature that
will be shipping soon. AMX consists of configurable two-dimensional
"TILE" registers and new accelerator instructions that operate on them.
TMUL (Tile matrix MULtiply) is the first accelerator instruction set
to use the new registers.
Intel AMX is XSAVE supported and XSAVE enabled. It is associated with
two state components, XTILECFG and XTILEDATA. The XTILEDATA state
component is very large so an XSAVE extension called extended feature
disable (XFD) is introduced to support dynamic usage. When XFD is
enabled for a state component, any instruction that would access
that state component does not execute and instead generates an #NM.
So Linux kernel arms XFD to monitor the first usage of AMX.
This patchset adds AMX and XFD support for guest: providing related
CPUID and MSRs to guest, adding extended XSAVE state context switch and
XFD MSRs switch during vmenter/vmexit.
This RFC series is based on kernel AMX series v3 [2] in LKML though not
latest upstream commit and we'd looking forward for your comments.
[1]: Intel Architecture Instruction Set Extension Programming Reference
https://software.intel.com/content/dam/develop/external/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf
[2]: AMX kernel series v3 https://lkml.org/lkml/2020/12/23/464
Jing Liu (7):
kvm: x86: Expose XFD CPUID to guest
kvm: x86: Introduce XFD MSRs as passthrough to guest
kvm: x86: Dynamic XSAVE and XFD MSRs context switch
kvm: x86: Add new ioctls for XSAVE extension
kvm: x86: Revise CPUID.D.1.EBX for alignment rule
kvm: x86: Add AMX_TILE, AMX_INT8 and AMX_BF16 support
kvm: x86: AMX XCR0 support for guest
arch/x86/include/asm/kvm_host.h | 3 +
arch/x86/include/uapi/asm/kvm.h | 5 ++
arch/x86/kernel/fpu/init.c | 1 +
arch/x86/kernel/fpu/xstate.c | 2 +
arch/x86/kvm/cpuid.c | 19 +++-
arch/x86/kvm/vmx/vmx.c | 114 ++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.h | 7 +-
arch/x86/kvm/x86.c | 153 ++++++++++++++++++++++++++------
include/uapi/linux/kvm.h | 8 ++
9 files changed, 279 insertions(+), 33 deletions(-)
--
2.18.4
Intel's Extended Feature Disable (XFD) feature is an extension
to the XSAVE feature that allows an operating system to enable
a feature while preventing specific user threads from using
the feature. A processor that supports XFD enumerates
CPUID.(EAX=0DH,ECX=1):EAX[4] as 1.
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/kvm/cpuid.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 83637a2ff605..04a73c395c71 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -437,7 +437,7 @@ void kvm_set_cpu_caps(void)
);
kvm_cpu_cap_mask(CPUID_D_1_EAX,
- F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | F(XSAVES)
+ F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | F(XSAVES) | F(XFD)
);
kvm_cpu_cap_mask(CPUID_8000_0001_ECX,
--
2.18.4
XFD feature introduces two new MSRs: IA32_XFD and IA32_XFD_ERR.
Each of the MSRs contains a state-component bitmap. XFD is enabled
for state component i if XCR0[i] = IA32_XFD[i] = 1. When XFD is
enabled for a state component, any instruction that would access
that state component does not execute and instead generates an
device-not-available exception (#NM). IA32_XFD_ERR is for
indicating which state causes the #NM event.
The MSRs are per task and need be context switched between host
and guest, and also between tasks inside guest just as native.
Passthrough both MSRs to let guest access and write without
vmexit. Add two slots for XFD MSRs as desired passthrough MSRs.
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 38 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.h | 6 +++++-
arch/x86/kvm/x86.c | 6 ++++++
3 files changed, 49 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 47b8357b9751..7fa54e78c45c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -162,6 +162,8 @@ static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
MSR_IA32_SYSENTER_CS,
MSR_IA32_SYSENTER_ESP,
MSR_IA32_SYSENTER_EIP,
+ MSR_IA32_XFD,
+ MSR_IA32_XFD_ERR,
MSR_CORE_C1_RES,
MSR_CORE_C3_RESIDENCY,
MSR_CORE_C6_RESIDENCY,
@@ -1824,6 +1826,18 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
msr_info->data = vmx->msr_ia32_umwait_control;
break;
+ case MSR_IA32_XFD:
+ if (!msr_info->host_initiated)
+ return 1;
+
+ msr_info->data = vmx->msr_ia32_xfd;
+ break;
+ case MSR_IA32_XFD_ERR:
+ if (!msr_info->host_initiated)
+ return 1;
+
+ msr_info->data = vmx->msr_ia32_xfd_err;
+ break;
case MSR_IA32_SPEC_CTRL:
if (!msr_info->host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
@@ -2026,6 +2040,20 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vmx->msr_ia32_umwait_control = data;
break;
+ case MSR_IA32_XFD:
+ if (!msr_info->host_initiated)
+ return 1;
+
+ vmx->msr_ia32_xfd = data;
+ break;
+ case MSR_IA32_XFD_ERR:
+ if (!msr_info->host_initiated)
+ return 1;
+ if (data)
+ break;
+
+ vmx->msr_ia32_xfd_err = data;
+ break;
case MSR_IA32_SPEC_CTRL:
if (!msr_info->host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
@@ -7219,6 +7247,12 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
}
+static void vmx_update_intercept_xfd(struct kvm_vcpu *vcpu)
+{
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_XFD, MSR_TYPE_RW, false);
+ vmx_set_intercept_for_msr(vcpu, MSR_IA32_XFD_ERR, MSR_TYPE_RW, false);
+}
+
static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7249,6 +7283,10 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
guest_cpuid_has(vcpu, X86_FEATURE_INTEL_PT))
update_intel_pt_cfg(vcpu);
+ if (boot_cpu_has(X86_FEATURE_XFD) &&
+ guest_cpuid_has(vcpu, X86_FEATURE_XFD))
+ vmx_update_intercept_xfd(vcpu);
+
if (boot_cpu_has(X86_FEATURE_RTM)) {
struct vmx_uret_msr *msr;
msr = vmx_find_uret_msr(vmx, MSR_IA32_TSX_CTRL);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index f6f66e5c6510..d487f5a53a08 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -281,11 +281,15 @@ struct vcpu_vmx {
struct pt_desc pt_desc;
/* Save desired MSR intercept (read: pass-through) state */
-#define MAX_POSSIBLE_PASSTHROUGH_MSRS 13
+#define MAX_POSSIBLE_PASSTHROUGH_MSRS 15
struct {
DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
} shadow_msr_intercept;
+
+ /* eXtended Feature Disabling (XFD) MSRs */
+ u64 msr_ia32_xfd;
+ u64 msr_ia32_xfd_err;
};
enum ept_pointers_status {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 93b5bacad67a..9ca8b1e58afa 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1213,6 +1213,7 @@ static const u32 msrs_to_save_all[] = {
MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
MSR_IA32_UMWAIT_CONTROL,
+ MSR_IA32_XFD, MSR_IA32_XFD_ERR,
MSR_ARCH_PERFMON_FIXED_CTR0, MSR_ARCH_PERFMON_FIXED_CTR1,
MSR_ARCH_PERFMON_FIXED_CTR0 + 2, MSR_ARCH_PERFMON_FIXED_CTR0 + 3,
@@ -5744,6 +5745,11 @@ static void kvm_init_msr_list(void)
if (!kvm_cpu_cap_has(X86_FEATURE_WAITPKG))
continue;
break;
+ case MSR_IA32_XFD:
+ case MSR_IA32_XFD_ERR:
+ if (!kvm_cpu_cap_has(X86_FEATURE_XFD))
+ continue;
+ break;
case MSR_IA32_RTIT_CTL:
case MSR_IA32_RTIT_STATUS:
if (!kvm_cpu_cap_has(X86_FEATURE_INTEL_PT))
--
2.18.4
XFD allows the kernel to enable a feature state in XCR0 and to
receive a #NM trap when a task uses instructions accessing that state.
Kernel defines "struct fpu.state_mask" to indicate the saved xstate and
interact with the XFD hardware when needed via a simple conversion.
Once a dynamic feature is detected, "state_mask" is expanded and
"state_ptr" is dynamically allocated to hold the whole state. Meanwhile
once the state is not in INIT state, the corresponding XFD bit should
not be armed anymore.
In KVM, "guest_fpu" serves for any guest task working on this vcpu
during vmexit and vmenter. We provide a pre-allocated guest_fpu space
and entire "guest_fpu.state_mask" to avoid each dynamic features
detection on each vcpu task. Meanwhile, to ensure correctly
xsaves/xrstors guest state, set IA32_XFD as zero during vmexit and
vmenter.
For "current->thread.fpu", since host and guest probably have different
state and mask, it also need be switched to the right context when fpu
load and put.
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++
arch/x86/kernel/fpu/init.c | 1 +
arch/x86/kernel/fpu/xstate.c | 2 +
arch/x86/kvm/vmx/vmx.c | 76 +++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.h | 1 +
arch/x86/kvm/x86.c | 69 +++++++++++++++++++++++++-----
6 files changed, 141 insertions(+), 11 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7e5f33a0d0e2..6dedf3d22659 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1203,6 +1203,9 @@ struct kvm_x86_ops {
struct x86_exception *exception);
void (*handle_exit_irqoff)(struct kvm_vcpu *vcpu);
+ void (*xfd_load)(struct kvm_vcpu *vcpu);
+ void (*xfd_put)(struct kvm_vcpu *vcpu);
+
void (*request_immediate_exit)(struct kvm_vcpu *vcpu);
void (*sched_in)(struct kvm_vcpu *kvm, int cpu);
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 7e0c68043ce3..fbb761fc13ec 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -145,6 +145,7 @@ EXPORT_SYMBOL_GPL(fpu_kernel_xstate_min_size);
* can be dynamically expanded to include some states up to this size.
*/
unsigned int fpu_kernel_xstate_max_size;
+EXPORT_SYMBOL_GPL(fpu_kernel_xstate_max_size);
/* Get alignment of the TYPE. */
#define TYPE_ALIGN(TYPE) offsetof(struct { char x; TYPE test; }, test)
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 080f3be9a5e6..9c471a0364e2 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -77,12 +77,14 @@ static struct xfeature_capflag_info xfeature_capflags[] __initdata = {
* XSAVE buffer, both supervisor and user xstates.
*/
u64 xfeatures_mask_all __read_mostly;
+EXPORT_SYMBOL_GPL(xfeatures_mask_all);
/*
* This represents user xstates, a subset of xfeatures_mask_all, saved in a
* dynamic kernel XSAVE buffer.
*/
u64 xfeatures_mask_user_dynamic __read_mostly;
+EXPORT_SYMBOL_GPL(xfeatures_mask_user_dynamic);
static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
static unsigned int xstate_sizes[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7fa54e78c45c..be3cc0f3ec6d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1167,6 +1167,75 @@ static void pt_guest_exit(struct vcpu_vmx *vmx)
wrmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
}
+static void vmx_xfd_load(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+ if (guest_cpuid_has(vcpu, X86_FEATURE_XFD)) {
+ vmx->host_ia32_xfd = xfirstuse_not_detected(vcpu->arch.user_fpu);
+ /*
+ * Keep IA32_XFD as zero in hypervisor.
+ * Guest non-zero IA32_XFD is restored until kvm_x86_ops.run
+ */
+ if (vmx->host_ia32_xfd)
+ wrmsrl(MSR_IA32_XFD, 0);
+ }
+}
+
+static void vmx_xfd_put(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+ if (guest_cpuid_has(vcpu, X86_FEATURE_XFD)) {
+ /* IA32_XFD register is kept as zero in hypervisor. */
+ if (vmx->host_ia32_xfd)
+ wrmsrl(MSR_IA32_XFD, vmx->host_ia32_xfd);
+ /* User (qemu) IA32_XFD_ERR should be zero. */
+ if (vmx->msr_ia32_xfd_err)
+ wrmsrl(MSR_IA32_XFD_ERR, 0);
+ }
+}
+
+/* Load guest XFD MSRs before entering. */
+static void xfd_guest_enter(struct vcpu_vmx *vmx)
+{
+ if (guest_cpuid_has(&vmx->vcpu, X86_FEATURE_XFD)) {
+ if (vmx->msr_ia32_xfd)
+ wrmsrl(MSR_IA32_XFD, vmx->msr_ia32_xfd);
+ /*
+ * We do not rdmsr here since in most cases
+ * IA32_XFD_ERR is zero. One rare exception is that,
+ * this vmenter follows a vmexit with non-zero
+ * MSR_IA32_XFD_ERR and it doesn't change during
+ * this interval.
+ *
+ * So just simply load the non-zero guest value.
+ */
+ if (vmx->msr_ia32_xfd_err)
+ wrmsrl(MSR_IA32_XFD_ERR, vmx->msr_ia32_xfd_err);
+ }
+}
+
+/*
+ * Save guest XFD MSRs once vmexit since the registers may be changed
+ * when control is transferred out of KVM, e.g. preemption.
+ */
+static void xfd_guest_exit(struct vcpu_vmx *vmx)
+{
+ if (guest_cpuid_has(&vmx->vcpu, X86_FEATURE_XFD)) {
+ rdmsrl(MSR_IA32_XFD, vmx->msr_ia32_xfd);
+ rdmsrl(MSR_IA32_XFD_ERR, vmx->msr_ia32_xfd_err);
+ /*
+ * Clear the MSR_IA32_XFD to ensure correctly protect guest
+ * fpu context in hypervisor.
+ * No need to reset MSR_IA32_XFD_ERR in hypervisor since it
+ * has no impact on others.
+ */
+ if (vmx->msr_ia32_xfd)
+ wrmsrl(MSR_IA32_XFD, 0);
+ }
+}
+
void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel,
unsigned long fs_base, unsigned long gs_base)
{
@@ -6735,6 +6804,8 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
kvm_load_guest_xsave_state(vcpu);
+ xfd_guest_enter(vmx);
+
pt_guest_enter(vmx);
atomic_switch_perf_msrs(vmx);
@@ -6804,6 +6875,8 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
pt_guest_exit(vmx);
+ xfd_guest_exit(vmx);
+
kvm_load_host_xsave_state(vcpu);
vmx->nested.nested_run_pending = 0;
@@ -7644,6 +7717,9 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.vcpu_load = vmx_vcpu_load,
.vcpu_put = vmx_vcpu_put,
+ .xfd_load = vmx_xfd_load,
+ .xfd_put = vmx_xfd_put,
+
.update_exception_bitmap = update_exception_bitmap,
.get_msr_feature = vmx_get_msr_feature,
.get_msr = vmx_get_msr,
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index d487f5a53a08..9a9ea37a29b1 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -288,6 +288,7 @@ struct vcpu_vmx {
} shadow_msr_intercept;
/* eXtended Feature Disabling (XFD) MSRs */
+ u64 host_ia32_xfd;
u64 msr_ia32_xfd;
u64 msr_ia32_xfd_err;
};
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9ca8b1e58afa..15908bc65d1c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9220,22 +9220,44 @@ static int complete_emulated_mmio(struct kvm_vcpu *vcpu)
static void kvm_save_current_fpu(struct fpu *fpu)
{
- struct fpu *src_fpu = ¤t->thread.fpu;
+ struct fpu *cur_fpu = ¤t->thread.fpu;
+ fpu->state_ptr = cur_fpu->state_ptr;
+ fpu->state_mask = cur_fpu->state_mask;
/*
* If the target FPU state is not resident in the CPU registers, just
* memcpy() from current, else save CPU state directly to the target.
*/
if (test_thread_flag(TIF_NEED_FPU_LOAD)) {
- memcpy(&fpu->state, &src_fpu->state,
- fpu_kernel_xstate_min_size);
+ /*
+ * No need to copy if dynamic feature is used, because
+ * they just simply point to the same recent state.
+ */
+ if (!cur_fpu->state_ptr)
+ memcpy(&fpu->state, &cur_fpu->state,
+ fpu_kernel_xstate_min_size);
} else {
- if (fpu->state_mask != src_fpu->state_mask)
- fpu->state_mask = src_fpu->state_mask;
copy_fpregs_to_fpstate(fpu);
}
}
+/*
+ * Swap fpu context to next fpu role.
+ *
+ * "current" fpu acts two roles: user contexts and guest contexts.
+ * Swap "current" fpu to next role to ensure correctly handle
+ * dynamic state buffers, e.g. in preemption case.
+ */
+static void kvm_load_next_fpu(struct fpu *next_fpu, u64 mask)
+{
+ struct fpu *cur_fpu = ¤t->thread.fpu;
+
+ cur_fpu->state_ptr = next_fpu->state_ptr;
+ cur_fpu->state_mask = next_fpu->state_mask;
+
+ __copy_kernel_to_fpregs(__xstate(next_fpu), mask);
+}
+
/* Swap (qemu) user FPU context for the guest FPU context. */
static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
{
@@ -9243,9 +9265,11 @@ static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
kvm_save_current_fpu(vcpu->arch.user_fpu);
+ if (static_cpu_has(X86_FEATURE_XFD) && kvm_x86_ops.xfd_load)
+ kvm_x86_ops.xfd_load(vcpu);
+
/* PKRU is separately restored in kvm_x86_ops.run. */
- __copy_kernel_to_fpregs(&vcpu->arch.guest_fpu->state,
- ~XFEATURE_MASK_PKRU);
+ kvm_load_next_fpu(vcpu->arch.guest_fpu, ~XFEATURE_MASK_PKRU);
fpregs_mark_activate();
fpregs_unlock();
@@ -9260,7 +9284,10 @@ static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
kvm_save_current_fpu(vcpu->arch.guest_fpu);
- copy_kernel_to_fpregs(vcpu->arch.user_fpu);
+ if (static_cpu_has(X86_FEATURE_XFD) && kvm_x86_ops.xfd_put)
+ kvm_x86_ops.xfd_put(vcpu);
+
+ kvm_load_next_fpu(vcpu->arch.user_fpu, -1);
fpregs_mark_activate();
fpregs_unlock();
@@ -9840,11 +9867,13 @@ static int sync_regs(struct kvm_vcpu *vcpu)
static void fx_init(struct kvm_vcpu *vcpu)
{
+ struct xregs_state *xsave;
+
+ xsave = __xsave(vcpu->arch.guest_fpu);
fpstate_init(vcpu->arch.guest_fpu);
if (boot_cpu_has(X86_FEATURE_XSAVES))
- vcpu->arch.guest_fpu->state.xsave.header.xcomp_bv =
+ xsave->header.xcomp_bv =
host_xcr0 | XSTATE_COMPACTION_ENABLED;
-
/*
* Ensure guest xcr0 is valid for loading
*/
@@ -9920,6 +9949,21 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
pr_err("kvm: failed to allocate vcpu's fpu\n");
goto free_user_fpu;
}
+
+ vcpu->arch.guest_fpu->state_mask = xfeatures_mask_all &
+ ~xfeatures_mask_user_dynamic;
+
+ /* If have dynamic features, initialize full context. */
+ if (xfeatures_mask_user_dynamic) {
+ vcpu->arch.guest_fpu->state_ptr =
+ kmalloc(fpu_kernel_xstate_max_size, GFP_KERNEL);
+ if (!vcpu->arch.guest_fpu->state_ptr)
+ goto free_guest_fpu;
+
+ vcpu->arch.guest_fpu->state_mask |=
+ xfeatures_mask_user_dynamic;
+ }
+
fx_init(vcpu);
vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
@@ -9936,7 +9980,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
r = kvm_x86_ops.vcpu_create(vcpu);
if (r)
- goto free_guest_fpu;
+ goto free_guest_fpu_exp;
vcpu->arch.arch_capabilities = kvm_get_arch_capabilities();
vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT;
@@ -9947,6 +9991,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
vcpu_put(vcpu);
return 0;
+free_guest_fpu_exp:
+ kfree(vcpu->arch.guest_fpu->state_ptr);
free_guest_fpu:
kmem_cache_free(x86_fpu_cache, vcpu->arch.guest_fpu);
free_user_fpu:
@@ -10002,6 +10048,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
kmem_cache_free(x86_emulator_cache, vcpu->arch.emulate_ctxt);
free_cpumask_var(vcpu->arch.wbinvd_dirty_mask);
kmem_cache_free(x86_fpu_cache, vcpu->arch.user_fpu);
+ kfree(vcpu->arch.guest_fpu->state_ptr);
kmem_cache_free(x86_fpu_cache, vcpu->arch.guest_fpu);
kvm_hv_vcpu_uninit(vcpu);
--
2.18.4
The static xstate buffer kvm_xsave contains the extended register
states, but it is not enough for dynamic features with large state.
Introduce a new capability called KVM_CAP_X86_XSAVE_EXTENSION to
detect if hardware has XSAVE extension (XFD). Meanwhile, add two
new ioctl interfaces to get/set the whole xstate using struct
kvm_xsave_extension buffer containing both static and dynamic
xfeatures. Reuse fill_xsave and load_xsave for both cases.
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/include/uapi/asm/kvm.h | 5 +++
arch/x86/kvm/x86.c | 70 +++++++++++++++++++++++++--------
include/uapi/linux/kvm.h | 8 ++++
3 files changed, 66 insertions(+), 17 deletions(-)
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 89e5f3d1bba8..bf785e89a728 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -362,6 +362,11 @@ struct kvm_xsave {
__u32 region[1024];
};
+/* for KVM_CAP_XSAVE_EXTENSION */
+struct kvm_xsave_extension {
+ __u32 region[3072];
+};
+
#define KVM_MAX_XCRS 16
struct kvm_xcr {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 15908bc65d1c..bfbde877221e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3786,6 +3786,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_XCRS:
r = boot_cpu_has(X86_FEATURE_XSAVE);
break;
+ case KVM_CAP_X86_XSAVE_EXTENSION:
+ r = boot_cpu_has(X86_FEATURE_XSAVE) &&
+ boot_cpu_has(X86_FEATURE_XFD);
+ break;
case KVM_CAP_TSC_CONTROL:
r = kvm_has_tsc_control;
break;
@@ -4395,7 +4399,7 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,
#define XSTATE_COMPACTION_ENABLED (1ULL << 63)
-static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
+static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu, bool has_extension)
{
struct xregs_state *xsave;
struct fpu *guest_fpu;
@@ -4403,9 +4407,14 @@ static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
u64 valid;
guest_fpu = vcpu->arch.guest_fpu;
- xsave = &guest_fpu->state.xsave;
+ xsave = __xsave(guest_fpu);
xstate_bv = xsave->header.xfeatures;
+ if (!has_extension) {
+ /* truncate with only non-dynamic features */
+ xstate_bv = xstate_bv & ~xfeatures_mask_user_dynamic;
+ }
+
/*
* Copy legacy XSAVE area, to avoid complications with CPUID
* leaves 0 and 1 in the loop below.
@@ -4450,7 +4459,7 @@ static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
u64 valid;
guest_fpu = vcpu->arch.guest_fpu;
- xsave = &guest_fpu->state.xsave;
+ xsave = __xsave(guest_fpu);
/*
* Copy legacy XSAVE area, to avoid complications with CPUID
@@ -4488,29 +4497,31 @@ static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
}
}
-static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
- struct kvm_xsave *guest_xsave)
+static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu, u32 *region, bool has_extension)
{
if (boot_cpu_has(X86_FEATURE_XSAVE)) {
- memset(guest_xsave, 0, sizeof(struct kvm_xsave));
- fill_xsave((u8 *) guest_xsave->region, vcpu);
+ if (has_extension)
+ memset(region, 0, sizeof(struct kvm_xsave_extension));
+ else
+ memset(region, 0, sizeof(struct kvm_xsave));
+
+ fill_xsave((u8 *)region, vcpu, has_extension);
} else {
- memcpy(guest_xsave->region,
+ memcpy(region,
&vcpu->arch.guest_fpu->state.fxsave,
sizeof(struct fxregs_state));
- *(u64 *)&guest_xsave->region[XSAVE_HDR_OFFSET / sizeof(u32)] =
+ *(u64 *)®ion[XSAVE_HDR_OFFSET / sizeof(u32)] =
XFEATURE_MASK_FPSSE;
}
}
#define XSAVE_MXCSR_OFFSET 24
-static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
- struct kvm_xsave *guest_xsave)
+static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu, u32 *region)
{
u64 xstate_bv =
- *(u64 *)&guest_xsave->region[XSAVE_HDR_OFFSET / sizeof(u32)];
- u32 mxcsr = *(u32 *)&guest_xsave->region[XSAVE_MXCSR_OFFSET / sizeof(u32)];
+ *(u64 *)®ion[XSAVE_HDR_OFFSET / sizeof(u32)];
+ u32 mxcsr = *(u32 *)®ion[XSAVE_MXCSR_OFFSET / sizeof(u32)];
if (boot_cpu_has(X86_FEATURE_XSAVE)) {
/*
@@ -4520,13 +4531,13 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
*/
if (xstate_bv & ~supported_xcr0 || mxcsr & ~mxcsr_feature_mask)
return -EINVAL;
- load_xsave(vcpu, (u8 *)guest_xsave->region);
+ load_xsave(vcpu, (u8 *)region);
} else {
if (xstate_bv & ~XFEATURE_MASK_FPSSE ||
mxcsr & ~mxcsr_feature_mask)
return -EINVAL;
memcpy(&vcpu->arch.guest_fpu->state.fxsave,
- guest_xsave->region, sizeof(struct fxregs_state));
+ region, sizeof(struct fxregs_state));
}
return 0;
}
@@ -4642,6 +4653,7 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
union {
struct kvm_lapic_state *lapic;
struct kvm_xsave *xsave;
+ struct kvm_xsave_extension *xsave_ext;
struct kvm_xcrs *xcrs;
void *buffer;
} u;
@@ -4847,7 +4859,7 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
if (!u.xsave)
break;
- kvm_vcpu_ioctl_x86_get_xsave(vcpu, u.xsave);
+ kvm_vcpu_ioctl_x86_get_xsave(vcpu, u.xsave->region, false);
r = -EFAULT;
if (copy_to_user(argp, u.xsave, sizeof(struct kvm_xsave)))
@@ -4855,6 +4867,20 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
r = 0;
break;
}
+ case KVM_GET_XSAVE_EXTENSION: {
+ u.xsave_ext = kzalloc(sizeof(struct kvm_xsave_extension), GFP_KERNEL_ACCOUNT);
+ r = -ENOMEM;
+ if (!u.xsave_ext)
+ break;
+
+ kvm_vcpu_ioctl_x86_get_xsave(vcpu, u.xsave_ext->region, true);
+
+ r = -EFAULT;
+ if (copy_to_user(argp, u.xsave_ext, sizeof(struct kvm_xsave_extension)))
+ break;
+ r = 0;
+ break;
+ }
case KVM_SET_XSAVE: {
u.xsave = memdup_user(argp, sizeof(*u.xsave));
if (IS_ERR(u.xsave)) {
@@ -4862,7 +4888,17 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
goto out_nofree;
}
- r = kvm_vcpu_ioctl_x86_set_xsave(vcpu, u.xsave);
+ r = kvm_vcpu_ioctl_x86_set_xsave(vcpu, u.xsave->region);
+ break;
+ }
+ case KVM_SET_XSAVE_EXTENSION: {
+ u.xsave_ext = memdup_user(argp, sizeof(*u.xsave_ext));
+ if (IS_ERR(u.xsave_ext)) {
+ r = PTR_ERR(u.xsave_ext);
+ goto out_nofree;
+ }
+
+ r = kvm_vcpu_ioctl_x86_set_xsave(vcpu, u.xsave_ext->region);
break;
}
case KVM_GET_XCRS: {
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index ca41220b40b8..42a167a29350 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1053,6 +1053,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_X86_USER_SPACE_MSR 188
#define KVM_CAP_X86_MSR_FILTER 189
#define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
+#define KVM_CAP_X86_XSAVE_EXTENSION 191
#ifdef KVM_CAP_IRQ_ROUTING
@@ -1462,6 +1463,13 @@ struct kvm_s390_ucas_mapping {
/* Available with KVM_CAP_XSAVE */
#define KVM_GET_XSAVE _IOR(KVMIO, 0xa4, struct kvm_xsave)
#define KVM_SET_XSAVE _IOW(KVMIO, 0xa5, struct kvm_xsave)
+/*
+ * Available with KVM_CAP_XSAVE_EXTENSION
+ * 0xa4/0xa5 are ok for the new ioctls since structure size is different.
+ */
+#define KVM_GET_XSAVE_EXTENSION _IOW(KVMIO, 0xa4, struct kvm_xsave_extension)
+#define KVM_SET_XSAVE_EXTENSION _IOW(KVMIO, 0xa5, struct kvm_xsave_extension)
+
/* Available with KVM_CAP_XCRS */
#define KVM_GET_XCRS _IOR(KVMIO, 0xa6, struct kvm_xcrs)
#define KVM_SET_XCRS _IOW(KVMIO, 0xa7, struct kvm_xcrs)
--
2.18.4
CPUID.0xD.1.EBX[1] is set if, when the compacted format of an XSAVE
area is used, this extended state component located on the next
64-byte boundary following the preceding state component (otherwise,
it is located immediately following the preceding state component).
AMX tileconfig and tiledata are the first to use 64B alignment.
Revise the runtime cpuid modification for this rule.
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/kvm/cpuid.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 04a73c395c71..ee1fac0a865e 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -35,12 +35,17 @@ static u32 xstate_required_size(u64 xstate_bv, bool compacted)
{
int feature_bit = 0;
u32 ret = XSAVE_HDR_SIZE + XSAVE_HDR_OFFSET;
+ bool is_aligned = false;
xstate_bv &= XFEATURE_MASK_EXTEND;
while (xstate_bv) {
if (xstate_bv & 0x1) {
u32 eax, ebx, ecx, edx, offset;
cpuid_count(0xD, feature_bit, &eax, &ebx, &ecx, &edx);
+ /* ECX[2]: 64B alignment in compacted form */
+ is_aligned = !!(ecx & 2);
+ if (is_aligned && compacted)
+ ret = ALIGN(ret, 64);
offset = compacted ? ret : ebx;
ret = max(ret, offset + eax);
}
--
2.18.4
Intel introduces AMX architecture in SPR platform, which includes
AMX_TILE, AMX_INT8 and AMX_BF16 support.
Exposes these features to KVM guest.
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/kvm/cpuid.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index ee1fac0a865e..1b3ea9195a75 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -423,7 +423,8 @@ void kvm_set_cpu_caps(void)
F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) |
- F(SERIALIZE) | F(TSXLDTRK)
+ F(SERIALIZE) | F(TSXLDTRK) |
+ F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16)
);
/* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */
@@ -544,6 +545,8 @@ static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_array *array,
case 0x14:
case 0x17:
case 0x18:
+ case 0x1d:
+ case 0x1e:
case 0x1f:
case 0x8000001d:
entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
@@ -667,6 +670,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
break;
case 9:
break;
+ case 0x1e:
+ break;
case 0xa: { /* Architectural Performance Monitoring */
struct x86_pmu_capability cap;
union cpuid10_eax eax;
@@ -766,9 +771,12 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->edx = 0;
}
break;
+ /* Intel AMX TILE */
+ case 0x1d:
/* Intel PT */
case 0x14:
- if (!kvm_cpu_cap_has(X86_FEATURE_INTEL_PT)) {
+ if ((function == 0x14 && !kvm_cpu_cap_has(X86_FEATURE_INTEL_PT)) ||
+ (function == 0x1d && !kvm_cpu_cap_has(X86_FEATURE_AMX_TILE))) {
entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
break;
}
--
2.18.4
Two XCR0 bits are defined for AMX to support XSAVE mechanism.
Bit 17 is for tilecfg and bit 18 is for tiledata.
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/kvm/x86.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bfbde877221e..f1c5893dee18 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -189,7 +189,7 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
#define KVM_SUPPORTED_XCR0 (XFEATURE_MASK_FP | XFEATURE_MASK_SSE \
| XFEATURE_MASK_YMM | XFEATURE_MASK_BNDREGS \
| XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
- | XFEATURE_MASK_PKRU)
+ | XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)
u64 __read_mostly host_efer;
EXPORT_SYMBOL_GPL(host_efer);
@@ -946,6 +946,12 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
if ((xcr0 & XFEATURE_MASK_AVX512) != XFEATURE_MASK_AVX512)
return 1;
}
+
+ if (xcr0 & XFEATURE_MASK_XTILE) {
+ if ((xcr0 & XFEATURE_MASK_XTILE) != XFEATURE_MASK_XTILE)
+ return 1;
+ }
+
vcpu->arch.xcr0 = xcr0;
if ((xcr0 ^ old_xcr0) & XFEATURE_MASK_EXTEND)
--
2.18.4
On Sun, Feb 07, 2021 at 10:42:52AM -0500, Jing Liu wrote:
> diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
> index 7e0c68043ce3..fbb761fc13ec 100644
> --- a/arch/x86/kernel/fpu/init.c
> +++ b/arch/x86/kernel/fpu/init.c
> @@ -145,6 +145,7 @@ EXPORT_SYMBOL_GPL(fpu_kernel_xstate_min_size);
> * can be dynamically expanded to include some states up to this size.
> */
> unsigned int fpu_kernel_xstate_max_size;
> +EXPORT_SYMBOL_GPL(fpu_kernel_xstate_max_size);
>
> /* Get alignment of the TYPE. */
> #define TYPE_ALIGN(TYPE) offsetof(struct { char x; TYPE test; }, test)
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 080f3be9a5e6..9c471a0364e2 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -77,12 +77,14 @@ static struct xfeature_capflag_info xfeature_capflags[] __initdata = {
> * XSAVE buffer, both supervisor and user xstates.
> */
> u64 xfeatures_mask_all __read_mostly;
> +EXPORT_SYMBOL_GPL(xfeatures_mask_all);
>
> /*
> * This represents user xstates, a subset of xfeatures_mask_all, saved in a
> * dynamic kernel XSAVE buffer.
> */
> u64 xfeatures_mask_user_dynamic __read_mostly;
> +EXPORT_SYMBOL_GPL(xfeatures_mask_user_dynamic);
>
> static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
> static unsigned int xstate_sizes[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
Make sure you Cc [email protected] when touching code outside of kvm.
There's this script called scripts/get_maintainer.pl which will tell you who to
Cc. Use it before you send next time please.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 2/7/2021 7:49 PM, Borislav Petkov wrote:
> On Sun, Feb 07, 2021 at 10:42:52AM -0500, Jing Liu wrote:
>> diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
>> index 7e0c68043ce3..fbb761fc13ec 100644
>> --- a/arch/x86/kernel/fpu/init.c
>> +++ b/arch/x86/kernel/fpu/init.c
>> @@ -145,6 +145,7 @@ EXPORT_SYMBOL_GPL(fpu_kernel_xstate_min_size);
>> * can be dynamically expanded to include some states up to this size.
>> */
>> unsigned int fpu_kernel_xstate_max_size;
>> +EXPORT_SYMBOL_GPL(fpu_kernel_xstate_max_size);
>>
>> /* Get alignment of the TYPE. */
>> #define TYPE_ALIGN(TYPE) offsetof(struct { char x; TYPE test; }, test)
>> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>> index 080f3be9a5e6..9c471a0364e2 100644
>> --- a/arch/x86/kernel/fpu/xstate.c
>> +++ b/arch/x86/kernel/fpu/xstate.c
>> @@ -77,12 +77,14 @@ static struct xfeature_capflag_info xfeature_capflags[] __initdata = {
>> * XSAVE buffer, both supervisor and user xstates.
>> */
>> u64 xfeatures_mask_all __read_mostly;
>> +EXPORT_SYMBOL_GPL(xfeatures_mask_all);
>>
>> /*
>> * This represents user xstates, a subset of xfeatures_mask_all, saved in a
>> * dynamic kernel XSAVE buffer.
>> */
>> u64 xfeatures_mask_user_dynamic __read_mostly;
>> +EXPORT_SYMBOL_GPL(xfeatures_mask_user_dynamic);
>>
>> static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
>> static unsigned int xstate_sizes[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
> Make sure you Cc [email protected] when touching code outside of kvm.
>
> There's this script called scripts/get_maintainer.pl which will tell you who to
> Cc. Use it before you send next time please.
>
> Thx.
Thank you for the reminder. I'll cc that next time.
BRs,
Jing
On 07/02/21 16:42, Jing Liu wrote:
> |In KVM, "guest_fpu" serves for any guest task working on this vcpu
> during vmexit and vmenter. We provide a pre-allocated guest_fpu space
> and entire "guest_fpu.state_mask" to avoid each dynamic features
> detection on each vcpu task. Meanwhile, to ensure correctly
> xsaves/xrstors guest state, set IA32_XFD as zero during vmexit and vmenter.|
Most guests will not need the whole xstate feature set. So perhaps you
could set XFD to the host value | the guest value, trap #NM if the
host XFD is zero, and possibly reflect the exception to the guest's XFD
and XFD_ERR.
In addition, loading the guest XFD MSRs should use the MSR autoload
feature (add_atomic_switch_msr).
Paolo
On Mon, Feb 08, 2021, Paolo Bonzini wrote:
> On 07/02/21 16:42, Jing Liu wrote:
> > |In KVM, "guest_fpu" serves for any guest task working on this vcpu
> > during vmexit and vmenter. We provide a pre-allocated guest_fpu space
> > and entire "guest_fpu.state_mask" to avoid each dynamic features
> > detection on each vcpu task. Meanwhile, to ensure correctly
> > xsaves/xrstors guest state, set IA32_XFD as zero during vmexit and
> > vmenter.|
>
> Most guests will not need the whole xstate feature set. So perhaps you
> could set XFD to the host value | the guest value, trap #NM if the host XFD
> is zero, and possibly reflect the exception to the guest's XFD and XFD_ERR.
>
> In addition, loading the guest XFD MSRs should use the MSR autoload feature
> (add_atomic_switch_msr).
Why do you say that? I would strongly prefer to use the load lists only if they
are absolutely necessary. I don't think that's the case here, as I can't
imagine accessing FPU state in NMI context is allowed, at least not without a
big pile of save/restore code.
On 08/02/21 18:31, Sean Christopherson wrote:
> On Mon, Feb 08, 2021, Paolo Bonzini wrote:
>> On 07/02/21 16:42, Jing Liu wrote:
>>> In KVM, "guest_fpu" serves for any guest task working on this vcpu
>>> during vmexit and vmenter. We provide a pre-allocated guest_fpu space
>>> and entire "guest_fpu.state_mask" to avoid each dynamic features
>>> detection on each vcpu task. Meanwhile, to ensure correctly
>>> xsaves/xrstors guest state, set IA32_XFD as zero during vmexit and
>>> vmenter.
>>
>> Most guests will not need the whole xstate feature set. So perhaps you
>> could set XFD to the host value | the guest value, trap #NM if the host XFD
>> is zero, and possibly reflect the exception to the guest's XFD and XFD_ERR.
>>
>> In addition, loading the guest XFD MSRs should use the MSR autoload feature
>> (add_atomic_switch_msr).
>
> Why do you say that? I would strongly prefer to use the load lists only if they
> are absolutely necessary. I don't think that's the case here, as I can't
> imagine accessing FPU state in NMI context is allowed, at least not without a
> big pile of save/restore code.
I was thinking more of the added vmentry/vmexit overhead due to
xfd_guest_enter xfd_guest_exit.
That said, the case where we saw MSR autoload as faster involved EFER,
and we decided that it was due to TLB flushes (commit f6577a5fa15d,
"x86, kvm, vmx: Always use LOAD_IA32_EFER if available", 2014-11-12).
Do you know if RDMSR/WRMSR is always slower than MSR autoload?
Paolo
On Mon, Feb 08, 2021, Paolo Bonzini wrote:
> On 08/02/21 18:31, Sean Christopherson wrote:
> > On Mon, Feb 08, 2021, Paolo Bonzini wrote:
> > > On 07/02/21 16:42, Jing Liu wrote:
> > > > In KVM, "guest_fpu" serves for any guest task working on this vcpu
> > > > during vmexit and vmenter. We provide a pre-allocated guest_fpu space
> > > > and entire "guest_fpu.state_mask" to avoid each dynamic features
> > > > detection on each vcpu task. Meanwhile, to ensure correctly
> > > > xsaves/xrstors guest state, set IA32_XFD as zero during vmexit and
> > > > vmenter.
> > >
> > > Most guests will not need the whole xstate feature set. So perhaps you
> > > could set XFD to the host value | the guest value, trap #NM if the host XFD
> > > is zero, and possibly reflect the exception to the guest's XFD and XFD_ERR.
> > >
> > > In addition, loading the guest XFD MSRs should use the MSR autoload feature
> > > (add_atomic_switch_msr).
> >
> > Why do you say that? I would strongly prefer to use the load lists only if they
> > are absolutely necessary. I don't think that's the case here, as I can't
> > imagine accessing FPU state in NMI context is allowed, at least not without a
> > big pile of save/restore code.
>
> I was thinking more of the added vmentry/vmexit overhead due to
> xfd_guest_enter xfd_guest_exit.
>
> That said, the case where we saw MSR autoload as faster involved EFER, and
> we decided that it was due to TLB flushes (commit f6577a5fa15d, "x86, kvm,
> vmx: Always use LOAD_IA32_EFER if available", 2014-11-12). Do you know if
> RDMSR/WRMSR is always slower than MSR autoload?
RDMSR/WRMSR may be marginally slower, but only because the autoload stuff avoids
serializing the pipeline after every MSR. The autoload paths are effectively
just wrappers around the WRMSR ucode, plus some extra VM-Enter specific checks,
as ucode needs to perform all the normal fault checks on the index and value.
On the flip side, if the load lists are dynamically constructed, I suspect the
code overhead of walking the lists negates any advantages of the load lists.
TL;DR: it likely depends on the exact use case. My primary objection to using
the load lists is that people tend to assume they are more performant that raw
RDMSR/WRMSR, and so aren't as careful/thoughtful as they should be about adding
MSRs to the save/restore paths.
Note, the dedicated VMCS fields, e.g. EFER and SYSENTER, are 1-2 orders of
magnitude faster than raw RDMSR/WRMSR or the load lists, as they obviously have
dedicated handling in VM-Enter ucode.
On 08/02/21 19:04, Sean Christopherson wrote:
>> That said, the case where we saw MSR autoload as faster involved EFER, and
>> we decided that it was due to TLB flushes (commit f6577a5fa15d, "x86, kvm,
>> vmx: Always use LOAD_IA32_EFER if available", 2014-11-12). Do you know if
>> RDMSR/WRMSR is always slower than MSR autoload?
> RDMSR/WRMSR may be marginally slower, but only because the autoload stuff avoids
> serializing the pipeline after every MSR.
That's probably adding up quickly...
> The autoload paths are effectively
> just wrappers around the WRMSR ucode, plus some extra VM-Enter specific checks,
> as ucode needs to perform all the normal fault checks on the index and value.
> On the flip side, if the load lists are dynamically constructed, I suspect the
> code overhead of walking the lists negates any advantages of the load lists.
... but yeah this is not very encouraging.
Context switch time is a problem for XFD. In a VM that uses AMX, most
threads in the guest will have nonzero XFD but the vCPU thread itself
will have zero XFD. So as soon as one thread in the VM forces the vCPU
thread to clear XFD, you pay a price on all vmexits and vmentries.
However, running the host with _more_ bits set than necessary in XFD
should not be a problem as long as the host doesn't use the AMX
instructions. So perhaps Jing can look into keeping XFD=0 for as little
time as possible, and XFD=host_XFD|guest_XFD as much as possible.
Paolo
On Mon, Feb 08, 2021 at 07:12:22PM +0100, Paolo Bonzini wrote:
> On 08/02/21 19:04, Sean Christopherson wrote:
> > > That said, the case where we saw MSR autoload as faster involved EFER, and
> > > we decided that it was due to TLB flushes (commit f6577a5fa15d, "x86, kvm,
> > > vmx: Always use LOAD_IA32_EFER if available", 2014-11-12). Do you know if
> > > RDMSR/WRMSR is always slower than MSR autoload?
> > RDMSR/WRMSR may be marginally slower, but only because the autoload stuff avoids
> > serializing the pipeline after every MSR.
>
> That's probably adding up quickly...
>
> > The autoload paths are effectively
> > just wrappers around the WRMSR ucode, plus some extra VM-Enter specific checks,
> > as ucode needs to perform all the normal fault checks on the index and value.
> > On the flip side, if the load lists are dynamically constructed, I suspect the
> > code overhead of walking the lists negates any advantages of the load lists.
>
> ... but yeah this is not very encouraging.
>
> Context switch time is a problem for XFD. In a VM that uses AMX, most
> threads in the guest will have nonzero XFD but the vCPU thread itself will
> have zero XFD. So as soon as one thread in the VM forces the vCPU thread to
> clear XFD, you pay a price on all vmexits and vmentries.
>
> However, running the host with _more_ bits set than necessary in XFD should
> not be a problem as long as the host doesn't use the AMX instructions. So
> perhaps Jing can look into keeping XFD=0 for as little time as possible, and
> XFD=host_XFD|guest_XFD as much as possible.
This sounds like the lazy-fpu (eagerfpu?) that used to be part of the
kernel? I recall that we had a CVE for that - so it may also be worth
double-checking that we don't reintroduce that one.
On 2/9/2021 2:12 AM, Paolo Bonzini wrote:
> On 08/02/21 19:04, Sean Christopherson wrote:
>>> That said, the case where we saw MSR autoload as faster involved
>>> EFER, and
>>> we decided that it was due to TLB flushes (commit f6577a5fa15d,
>>> "x86, kvm,
>>> vmx: Always use LOAD_IA32_EFER if available", 2014-11-12). Do you
>>> know if
>>> RDMSR/WRMSR is always slower than MSR autoload?
>> RDMSR/WRMSR may be marginally slower, but only because the autoload
>> stuff avoids
>> serializing the pipeline after every MSR.
>
> That's probably adding up quickly...
>
>> The autoload paths are effectively
>> just wrappers around the WRMSR ucode, plus some extra VM-Enter
>> specific checks,
>> as ucode needs to perform all the normal fault checks on the index
>> and value.
>> On the flip side, if the load lists are dynamically constructed, I
>> suspect the
>> code overhead of walking the lists negates any advantages of the load
>> lists.
>
> ... but yeah this is not very encouraging.
Thanks for reviewing the patches.
> Context switch time is a problem for XFD. In a VM that uses AMX, most
> threads in the guest will have nonzero XFD but the vCPU thread itself
> will have zero XFD. So as soon as one thread in the VM forces the
> vCPU thread to clear XFD, you pay a price on all vmexits and vmentries.
>
Spec says,
"If XSAVE, XSAVEC, XSAVEOPT, or XSAVES is saving the state component i,
the instruction does not generate #NM when XCR0[i] = IA32_XFD[i] = 1;
instead, it saves bit i of XSTATE_BV field of the XSAVE header as 0
(indicating that the state component is in its initialized state).
With the exception of XSAVE, no data is saved for the state
component (XSAVE saves the initial value of the state component..."
Thus, the key point is not losing the non initial AMX state on vmexit
and vmenter. If AMX state is in initialized state, it doesn't matter.
Otherwise, XFD[i] should not be armed with a nonzero value.
If we don't want to extremely set XFD=0 every time on vmexit, it would
be useful to first detect if guest AMX state is initial or not.
How about using XINUSE notation here?
(Details in SDM vol1 13.6 PROCESSOR TRACKING OF
XSAVE-MANAGED STATE, and vol2 XRSTOR/XRSTORS instruction operation part)
The main idea is processor tracks the status of various state components
by XINUSE, and it shows if the state component is in use or not.
When XINUSE[i]=0, state component i is in initial configuration.
Otherwise, kvm should take care of XFD on vmexit.
> However, running the host with _more_ bits set than necessary in XFD
> should not be a problem as long as the host doesn't use the AMX
> instructions.
Does "running the host" mean running in kvm? why need more bits
(host_XFD|guest_XFD),
I'm trying to think about the case that guest_XFD is not enough? e.g.
In guest, it only need bit i when guest supports it and guest uses
the passthru XFD[i] for detecting dynamic usage;
In kvm, kvm doesn't use AMX instructions; and "system software should not
use XFD to implement a 'lazy restore' approach to management of the
XTILEDATA
state component."
Out of kvm, kernel ensures setting correct XFD for threads when scheduling;
Thanks,
Jing
> So perhaps Jing can look into keeping XFD=0 for as little time as
> possible, and XFD=host_XFD|guest_XFD as much as possible.
>
> Paolo
>
On 2/9/2021 2:55 AM, Konrad Rzeszutek Wilk wrote:
> On Mon, Feb 08, 2021 at 07:12:22PM +0100, Paolo Bonzini wrote:
[...]
>>
>> However, running the host with _more_ bits set than necessary in XFD should
>> not be a problem as long as the host doesn't use the AMX instructions. So
>> perhaps Jing can look into keeping XFD=0 for as little time as possible, and
>> XFD=host_XFD|guest_XFD as much as possible.
> This sounds like the lazy-fpu (eagerfpu?) that used to be part of the
> kernel? I recall that we had a CVE for that - so it may also be worth
> double-checking that we don't reintroduce that one.
Not sure if this means lazy restore, but the spec mentions that
"System software should not use XFD to implement a 'lazy restore' approach
to management of the XTILEDATA state component." One reason is XSAVE(S)
will lose the xTILEDATA when XFD[i] is nonzero.
Thanks,
Jing
I need a formletter for these...
GET_SUPPORTED_CPUID advertises support to userspace, it does not expose anything
to the guest.
On Sun, Feb 07, 2021, Jing Liu wrote:
> Intel's Extended Feature Disable (XFD) feature is an extension
> to the XSAVE feature that allows an operating system to enable
> a feature while preventing specific user threads from using
> the feature. A processor that supports XFD enumerates
> CPUID.(EAX=0DH,ECX=1):EAX[4] as 1.
>
> Signed-off-by: Jing Liu <[email protected]>
> ---
> arch/x86/kvm/cpuid.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 83637a2ff605..04a73c395c71 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -437,7 +437,7 @@ void kvm_set_cpu_caps(void)
> );
>
> kvm_cpu_cap_mask(CPUID_D_1_EAX,
> - F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | F(XSAVES)
> + F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | F(XSAVES) | F(XFD)
KVM must not advertise support until it actually has said support, i.e. this
patch needs to go at the end of the series.
Also, adding the kvm_cpu_cap flag in a separate patch isn't strictly required.
In most cases, I would go so far as to say that if there is additional enabling
to be done, advertising the feature should be done in the same patch that adds
the last bits of enabling. Putting the CPUID stuff in its own patch doesn't
usually add values, e.g. if there's a bug in the actual support code bisecting
will point at the wrong patch if userspace conditions its vCPU model on
GET_SUPPORTED_CPUID.
> );
>
> kvm_cpu_cap_mask(CPUID_8000_0001_ECX,
> --
> 2.18.4
>
On Sun, Feb 07, 2021, Jing Liu wrote:
> Passthrough both MSRs to let guest access and write without vmexit.
Why? Except for read-only MSRs, e.g. MSR_CORE_C1_RES, passthrough MSRs are
costly to support because KVM must context switch the MSR (which, by the by, is
completely missing from the patch).
In other words, if these MSRs are full RW passthrough, guests with XFD enabled
will need to load the guest value on entry, save the guest value on exit, and
load the host value on exit. That's in the neighborhood of a 40% increase in
latency for a single VM-Enter/VM-Exit roundtrip (~1500 cycles => >2000 cycles).
I'm not saying these can't be passhthrough, but there needs to be strong
justification for letting the guest read/write them directly.
On Sun, Feb 07, 2021, Jing Liu wrote:
> The static xstate buffer kvm_xsave contains the extended register
> states, but it is not enough for dynamic features with large state.
>
> Introduce a new capability called KVM_CAP_X86_XSAVE_EXTENSION to
> detect if hardware has XSAVE extension (XFD). Meanwhile, add two
> new ioctl interfaces to get/set the whole xstate using struct
> kvm_xsave_extension buffer containing both static and dynamic
> xfeatures. Reuse fill_xsave and load_xsave for both cases.
>
> Signed-off-by: Jing Liu <[email protected]>
> ---
> arch/x86/include/uapi/asm/kvm.h | 5 +++
> arch/x86/kvm/x86.c | 70 +++++++++++++++++++++++++--------
> include/uapi/linux/kvm.h | 8 ++++
> 3 files changed, 66 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 89e5f3d1bba8..bf785e89a728 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -362,6 +362,11 @@ struct kvm_xsave {
> __u32 region[1024];
> };
>
> +/* for KVM_CAP_XSAVE_EXTENSION */
> +struct kvm_xsave_extension {
> + __u32 region[3072];
Fool me once, shame on you (Intel). Fool me twice, shame on me (KVM).
As amusing as kvm_xsave_really_extended would be, the required size should be
discoverable, not hardcoded. Nothing prevents a hardware vendor from inventing
a newfangled feature that requires yet more space.
As an alternative to adding a dedicated capability, can we leverage
GET_SUPPORTED_CPUID, leaf CPUID.0xD, to enumerate the minimum required size and
state that the new ioctl() is available if the min size is greater than 1024?
Or is that unnecessarily convoluted...
On Sun, Feb 07, 2021, Jing Liu wrote:
> Two XCR0 bits are defined for AMX to support XSAVE mechanism.
> Bit 17 is for tilecfg and bit 18 is for tiledata.
This fails to explain why they must be set in tandem. Out of curisoity, assuming
they do indeed need to be set/cleared as a pair, what's the point of having two
separate bits?
> Signed-off-by: Jing Liu <[email protected]>
> ---
> arch/x86/kvm/x86.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index bfbde877221e..f1c5893dee18 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -189,7 +189,7 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
> #define KVM_SUPPORTED_XCR0 (XFEATURE_MASK_FP | XFEATURE_MASK_SSE \
> | XFEATURE_MASK_YMM | XFEATURE_MASK_BNDREGS \
> | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
> - | XFEATURE_MASK_PKRU)
> + | XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)
>
> u64 __read_mostly host_efer;
> EXPORT_SYMBOL_GPL(host_efer);
> @@ -946,6 +946,12 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
> if ((xcr0 & XFEATURE_MASK_AVX512) != XFEATURE_MASK_AVX512)
> return 1;
> }
> +
> + if (xcr0 & XFEATURE_MASK_XTILE) {
> + if ((xcr0 & XFEATURE_MASK_XTILE) != XFEATURE_MASK_XTILE)
> + return 1;
> + }
> +
> vcpu->arch.xcr0 = xcr0;
>
> if ((xcr0 ^ old_xcr0) & XFEATURE_MASK_EXTEND)
> --
> 2.18.4
>
On Mon, May 24, 2021 at 2:44 PM Sean Christopherson <[email protected]> wrote:
>
> On Sun, Feb 07, 2021, Jing Liu wrote:
> > Passthrough both MSRs to let guest access and write without vmexit.
>
> Why? Except for read-only MSRs, e.g. MSR_CORE_C1_RES, passthrough MSRs are
> costly to support because KVM must context switch the MSR (which, by the by, is
> completely missing from the patch).
>
> In other words, if these MSRs are full RW passthrough, guests with XFD enabled
> will need to load the guest value on entry, save the guest value on exit, and
> load the host value on exit. That's in the neighborhood of a 40% increase in
> latency for a single VM-Enter/VM-Exit roundtrip (~1500 cycles => >2000 cycles).
>
> I'm not saying these can't be passhthrough, but there needs to be strong
> justification for letting the guest read/write them directly.
If we virtualize XFD, we have to context switch the guest/host values
on VM-entry/VM-exit, don't we? If we don't, we're forced to synthesize
the #NM on any instruction that would access a disabled state
component, and I don't think we have any way of doing that. We could
intercept a guest WRMSR to these MSRs, but it sounds like the guest
can still implicitly write to IA32_XFD_ERR, if we allow it to have a
non-zero IA32_XFD.
Perhaps the answer is "don't virtualize XFD."
On Sat, Feb 6, 2021 at 11:00 PM Jing Liu <[email protected]> wrote:
>
> The static xstate buffer kvm_xsave contains the extended register
> states, but it is not enough for dynamic features with large state.
>
> Introduce a new capability called KVM_CAP_X86_XSAVE_EXTENSION to
> detect if hardware has XSAVE extension (XFD). Meanwhile, add two
> new ioctl interfaces to get/set the whole xstate using struct
> kvm_xsave_extension buffer containing both static and dynamic
> xfeatures. Reuse fill_xsave and load_xsave for both cases.
>
> Signed-off-by: Jing Liu <[email protected]>
> ---
> +#define KVM_GET_XSAVE_EXTENSION _IOW(KVMIO, 0xa4, struct kvm_xsave_extension)
> +#define KVM_SET_XSAVE_EXTENSION _IOW(KVMIO, 0xa5, struct kvm_xsave_extension)
Isn't the convention to call these KVM_GET_XSAVE2 and KVM_SET_XSAVE2?
Do you have any documentation to add to Documentation/virt/kvm/api.rst?
On Sun, Feb 07, 2021, Jing Liu wrote:
> CPUID.0xD.1.EBX[1] is set if, when the compacted format of an XSAVE
> area is used, this extended state component located on the next
> 64-byte boundary following the preceding state component (otherwise,
> it is located immediately following the preceding state component).
>
> AMX tileconfig and tiledata are the first to use 64B alignment.
> Revise the runtime cpuid modification for this rule.
>
> Signed-off-by: Jing Liu <[email protected]>
> ---
> arch/x86/kvm/cpuid.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 04a73c395c71..ee1fac0a865e 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -35,12 +35,17 @@ static u32 xstate_required_size(u64 xstate_bv, bool compacted)
> {
> int feature_bit = 0;
> u32 ret = XSAVE_HDR_SIZE + XSAVE_HDR_OFFSET;
> + bool is_aligned = false;
>
> xstate_bv &= XFEATURE_MASK_EXTEND;
> while (xstate_bv) {
> if (xstate_bv & 0x1) {
> u32 eax, ebx, ecx, edx, offset;
> cpuid_count(0xD, feature_bit, &eax, &ebx, &ecx, &edx);
> + /* ECX[2]: 64B alignment in compacted form */
> + is_aligned = !!(ecx & 2);
> + if (is_aligned && compacted)
I'd forego the local is_aligned, and also check "compacted" first so that the
uncompacted variant isn't required to evaluated ecx.
And the real reason I am responding... can you post this as a standalone patch?
I stumbled across the "aligned" flag when reading through the SDM for a completely
unrelated reason, and also discovered that the flag has been documented since
2016. While AMX may be the first to "officially" utilize the alignment flag,
the flag itself is architectural and not strictly limited to AMX.
> + ret = ALIGN(ret, 64);
> offset = compacted ? ret : ebx;
> ret = max(ret, offset + eax);
> }
> --
> 2.18.4
>
On 5/25/2021 5:50 AM, Sean Christopherson wrote:
> On Sun, Feb 07, 2021, Jing Liu wrote:
>> The static xstate buffer kvm_xsave contains the extended register
>> states, but it is not enough for dynamic features with large state.
>>
>> Introduce a new capability called KVM_CAP_X86_XSAVE_EXTENSION to
>> detect if hardware has XSAVE extension (XFD). Meanwhile, add two
>> new ioctl interfaces to get/set the whole xstate using struct
>> kvm_xsave_extension buffer containing both static and dynamic
>> xfeatures. Reuse fill_xsave and load_xsave for both cases.
>>
>> Signed-off-by: Jing Liu <[email protected]>
>> ---
>> arch/x86/include/uapi/asm/kvm.h | 5 +++
>> arch/x86/kvm/x86.c | 70 +++++++++++++++++++++++++--------
>> include/uapi/linux/kvm.h | 8 ++++
>> 3 files changed, 66 insertions(+), 17 deletions(-)
>>
>> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
>> index 89e5f3d1bba8..bf785e89a728 100644
>> --- a/arch/x86/include/uapi/asm/kvm.h
>> +++ b/arch/x86/include/uapi/asm/kvm.h
>> @@ -362,6 +362,11 @@ struct kvm_xsave {
>> __u32 region[1024];
>> };
>>
>> +/* for KVM_CAP_XSAVE_EXTENSION */
>> +struct kvm_xsave_extension {
>> + __u32 region[3072];
> Fool me once, shame on you (Intel). Fool me twice, shame on me (KVM).
>
> As amusing as kvm_xsave_really_extended would be, the required size should be
> discoverable, not hardcoded.
Thanks for reviewing the patch.
When looking at current kvm_xsave structure, I felt confusing about the
static
hardcoding of 1024 bytes, but failed to find clue for its final decision
in 2010[1].
So we'd prefer to changing the way right? Please correct me if I
misunderstood.
> Nothing prevents a hardware vendor from inventing
> a newfangled feature that requires yet more space.
> As an alternative to adding a dedicated capability, can we leverage
> GET_SUPPORTED_CPUID, leaf CPUID.0xD,
Yes, this is a good way to avoid a dedicated capability. Thanks for the
suggestion.
Use 0xD.1.EBX for size of enabled xcr0|xss if supposing kvm_xsave cares
both.
> to enumerate the minimum required size and
> state
For the state, an extreme case is using an old qemu as follows, but a
new kvm with more future_featureZ supported. If hardware vendor arranges
one by one, it's OK to use static state like X86XSaveArea(2) and
get/set between userspace and kvm because it's non-compacted. If not,
the state will not correct.
So far it is OK, so I'm wondering if this would be an issue for now?
X86XSaveArea2 {
...
XSaveAVX
...
AMX_XTILE;
future_featureX;
future_featureY;
}
> that the new ioctl() is available if the min size is greater than 1024?
> Or is that unnecessarily convoluted...
To enable a dynamic size kvm_xsave2(Thanks Jim's name suggestion), if
things as
follows are what we might want.
/* for xstate large than 1024 */
struct kvm_xsave2 {
int size; // size of the whole xstate
void *ptr; // xstate pointer
}
#define KVM_GET_XSAVE2 _IOW(KVMIO, 0xa4, struct kvm_xsave2)
Take @size together, so KVM need not fetch 0xd.1.ebx each time or a
dedicated
variable.
For Userspace(Qemu):
struct X86XSaveArea2 {...}// new struct holding all features
if 0xd.1.ebx <= sizeof(kvm_xsave)
env->xsave_buf = alloc(sizeof(kvm_xsave))
...
ioctl(KVM_GET/SET_XSAVE, X86XSaveArea *)
else
env->xsave_buf = alloc(0xd.1.ebx + sizeof(int))
...
xsave2 = env->xsave_buf
xsave2->size = ...
X86XSaveArea2 *area2 = xsave2->ptr
ioctl(KVM_GET/SET_XSAVE2, xsave2)
endif
[1] https://lore.kernel.org/kvm/[email protected]/
Thanks,
Jing
On 5/25/2021 6:06 AM, Jim Mattson wrote:
> On Sat, Feb 6, 2021 at 11:00 PM Jing Liu <[email protected]> wrote:
>> The static xstate buffer kvm_xsave contains the extended register
>> states, but it is not enough for dynamic features with large state.
>>
>> Introduce a new capability called KVM_CAP_X86_XSAVE_EXTENSION to
>> detect if hardware has XSAVE extension (XFD). Meanwhile, add two
>> new ioctl interfaces to get/set the whole xstate using struct
>> kvm_xsave_extension buffer containing both static and dynamic
>> xfeatures. Reuse fill_xsave and load_xsave for both cases.
>>
>> Signed-off-by: Jing Liu <[email protected]>
>> ---
>> +#define KVM_GET_XSAVE_EXTENSION _IOW(KVMIO, 0xa4, struct kvm_xsave_extension)
>> +#define KVM_SET_XSAVE_EXTENSION _IOW(KVMIO, 0xa5, struct kvm_xsave_extension)
> Isn't the convention to call these KVM_GET_XSAVE2 and KVM_SET_XSAVE2?
>
> Do you have any documentation to add to Documentation/virt/kvm/api.rst?
Thanks for reviewing the patch.
I'll change the name as convention and add documentation if new apis are
needed.
BRs,
Jing
On 5/25/2021 5:53 AM, Sean Christopherson wrote:
> On Sun, Feb 07, 2021, Jing Liu wrote:
>> Two XCR0 bits are defined for AMX to support XSAVE mechanism.
>> Bit 17 is for tilecfg and bit 18 is for tiledata.
> This fails to explain why they must be set in tandem.
The spec says,
"executing the XSETBV instruction causes a general-protection fault
(#GP) if ECX=0
and EAX[17] ≠ EAX[18] (XTILECFG and XTILEDATA must be enabled together).
This
implies that the value of XCR0[17:18] is always either 00b or 11b."
I can add more to changelog if this is reasonable.
> Out of curisoity, assuming
> they do indeed need to be set/cleared as a pair, what's the point of having two
> separate bits?
What I can see is to separate different states and mirror by XFD which
can set
bits separately.
Thanks,
Jing
>
>> Signed-off-by: Jing Liu <[email protected]>
>> ---
>> arch/x86/kvm/x86.c | 8 +++++++-
>> 1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index bfbde877221e..f1c5893dee18 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -189,7 +189,7 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
>> #define KVM_SUPPORTED_XCR0 (XFEATURE_MASK_FP | XFEATURE_MASK_SSE \
>> | XFEATURE_MASK_YMM | XFEATURE_MASK_BNDREGS \
>> | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
>> - | XFEATURE_MASK_PKRU)
>> + | XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)
>>
>> u64 __read_mostly host_efer;
>> EXPORT_SYMBOL_GPL(host_efer);
>> @@ -946,6 +946,12 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
>> if ((xcr0 & XFEATURE_MASK_AVX512) != XFEATURE_MASK_AVX512)
>> return 1;
>> }
>> +
>> + if (xcr0 & XFEATURE_MASK_XTILE) {
>> + if ((xcr0 & XFEATURE_MASK_XTILE) != XFEATURE_MASK_XTILE)
>> + return 1;
>> + }
>> +
>> vcpu->arch.xcr0 = xcr0;
>>
>> if ((xcr0 ^ old_xcr0) & XFEATURE_MASK_EXTEND)
>> --
>> 2.18.4
>>
On Wed, May 26, 2021, Liu, Jing2 wrote:
>
> On 5/25/2021 5:50 AM, Sean Christopherson wrote:
> > On Sun, Feb 07, 2021, Jing Liu wrote:
> > > The static xstate buffer kvm_xsave contains the extended register
> > > states, but it is not enough for dynamic features with large state.
> > >
> > > Introduce a new capability called KVM_CAP_X86_XSAVE_EXTENSION to
> > > detect if hardware has XSAVE extension (XFD). Meanwhile, add two
> > > new ioctl interfaces to get/set the whole xstate using struct
> > > kvm_xsave_extension buffer containing both static and dynamic
> > > xfeatures. Reuse fill_xsave and load_xsave for both cases.
> > >
> > > Signed-off-by: Jing Liu <[email protected]>
> > > ---
> > > arch/x86/include/uapi/asm/kvm.h | 5 +++
> > > arch/x86/kvm/x86.c | 70 +++++++++++++++++++++++++--------
> > > include/uapi/linux/kvm.h | 8 ++++
> > > 3 files changed, 66 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> > > index 89e5f3d1bba8..bf785e89a728 100644
> > > --- a/arch/x86/include/uapi/asm/kvm.h
> > > +++ b/arch/x86/include/uapi/asm/kvm.h
> > > @@ -362,6 +362,11 @@ struct kvm_xsave {
> > > __u32 region[1024];
Hold up a sec. How big is the AMX data? The existing size is 4096 bytes, not
1024 bytes. IIRC, AMX is >4k, so we still need a new ioctl(), but we should be
careful to mentally adjust for the __u32 when mentioning the sizes.
> > > };
> > > +/* for KVM_CAP_XSAVE_EXTENSION */
> > > +struct kvm_xsave_extension {
> > > + __u32 region[3072];
> > Fool me once, shame on you (Intel). Fool me twice, shame on me (KVM).
> >
> > As amusing as kvm_xsave_really_extended would be, the required size should be
> > discoverable, not hardcoded.
> Thanks for reviewing the patch. When looking at current kvm_xsave structure,
> I felt confusing about the static hardcoding of 1024 bytes, but failed to
> find clue for its final decision in 2010[1].
Simplicitly and lack of foresight :-)
> So we'd prefer to changing the way right? Please correct me if I misunderstood.
Sadly, we can't fix the existing ioctl() without breaking userspace. But for
the new ioctl(), yes, its size should not be hardcoded.
> > Nothing prevents a hardware vendor from inventing a newfangled feature that
> > requires yet more space. As an alternative to adding a dedicated
> > capability, can we leverage GET_SUPPORTED_CPUID, leaf CPUID.0xD,
> Yes, this is a good way to avoid a dedicated capability. Thanks for the
> suggestion. Use 0xD.1.EBX for size of enabled xcr0|xss if supposing
> kvm_xsave cares both.
> > to enumerate the minimum required size and state
> For the state, an extreme case is using an old qemu as follows, but a
> new kvm with more future_featureZ supported. If hardware vendor arranges
> one by one, it's OK to use static state like X86XSaveArea(2) and
> get/set between userspace and kvm because it's non-compacted. If not,
> the state will not correct.
> So far it is OK, so I'm wondering if this would be an issue for now?
Oh, you're saying that, because kvm_xsave is non-compacted, future features may
overflow kvm_xsave simply because the architectural offset overflows 4096 bytes.
That should be a non-issue for old KVM/kernels, since the new features shouldn't
be enabled. For new KVM, I think the right approach is to reject KVM_GET_XSAVE
and KVM_SET_XSAVE if the required size is greater than sizeof(struct kvm_xsave).
I.e. force userspace to either hide the features from the guest, or use
KVM_{G,S}ET_XSAVE2.
> X86XSaveArea2 {
> ??? ...
> ??? XSaveAVX
> ??? ...
> ??? AMX_XTILE;
> ??? future_featureX;
> ??? future_featureY;
> }
>
> > that the new ioctl() is available if the min size is greater than 1024?
> > Or is that unnecessarily convoluted...
> To enable a dynamic size kvm_xsave2(Thanks Jim's name suggestion), if things
> as follows are what we might want.
> /* for xstate large than 1024 */
> struct kvm_xsave2 {
> ??? int size; // size of the whole xstate
> ??? void *ptr; // xstate pointer
> }
> #define KVM_GET_XSAVE2 ? _IOW(KVMIO,? 0xa4, struct kvm_xsave2)
>
> Take @size together, so KVM need not fetch 0xd.1.ebx each time or a dedicated
> variable.
Yes, userspace needs to provide the size so that KVM doesn't unintentionally
overflow the buffer provided by userspace. We might also want to hedge by adding
a flags? Can't think of a use for it at the moment, though.
struct kvm_xsave2 {
__u32 flags;
__u32 size;
__u8 state[0];
};
On Wed, May 26, 2021, Liu, Jing2 wrote:
>
> On 5/25/2021 5:53 AM, Sean Christopherson wrote:
> > On Sun, Feb 07, 2021, Jing Liu wrote:
> > > Two XCR0 bits are defined for AMX to support XSAVE mechanism.
> > > Bit 17 is for tilecfg and bit 18 is for tiledata.
> > This fails to explain why they must be set in tandem.
> The spec says, "executing the XSETBV instruction causes a general-protection
> fault (#GP) if ECX=0 and EAX[17] ≠ EAX[18] (XTILECFG and XTILEDATA must be
> enabled together). This implies that the value of XCR0[17:18] is always
> either 00b or 11b."
>
> I can add more to changelog if this is reasonable.
Ya, please do. It doesn't have to be the full thing verbatim (but that's ok, too),
but the requirement does need to be called out.
> > Out of curisoity, assuming they do indeed need to be set/cleared as a
> > pair, what's the point of having two separate bits?
>
> What I can see is to separate different states and mirror by XFD which can
> set bits separately.
Ah, that would make sense. Thanks!
On 5/26/2021 10:43 PM, Sean Christopherson wrote:
> On Wed, May 26, 2021, Liu, Jing2 wrote:
>> On 5/25/2021 5:50 AM, Sean Christopherson wrote:
>>> On Sun, Feb 07, 2021, Jing Liu wrote:
>>>> The static xstate buffer kvm_xsave contains the extended register
>>>> states, but it is not enough for dynamic features with large state.
>>>>
>>>> Introduce a new capability called KVM_CAP_X86_XSAVE_EXTENSION to
>>>> detect if hardware has XSAVE extension (XFD). Meanwhile, add two
>>>> new ioctl interfaces to get/set the whole xstate using struct
>>>> kvm_xsave_extension buffer containing both static and dynamic
>>>> xfeatures. Reuse fill_xsave and load_xsave for both cases.
>>>>
>>>> Signed-off-by: Jing Liu <[email protected]>
>>>> ---
>>>> arch/x86/include/uapi/asm/kvm.h | 5 +++
>>>> arch/x86/kvm/x86.c | 70 +++++++++++++++++++++++++--------
>>>> include/uapi/linux/kvm.h | 8 ++++
>>>> 3 files changed, 66 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
>>>> index 89e5f3d1bba8..bf785e89a728 100644
>>>> --- a/arch/x86/include/uapi/asm/kvm.h
>>>> +++ b/arch/x86/include/uapi/asm/kvm.h
>>>> @@ -362,6 +362,11 @@ struct kvm_xsave {
>>>> __u32 region[1024];
> Hold up a sec. How big is the AMX data?
AMX tileconfig size is 64B, but tiledata size is 8K.
> The existing size is 4096 bytes, not
> 1024 bytes. IIRC, AMX is >4k, so we still need a new ioctl(),
Yep, kvm_xsave can hold 4KB state. We need a new ioctl, holding all the
states,
not only AMX. And once KVM supports AMX, the size will >4096 so qemu need
use kvm_xsave2 instead, otherwise, cannot save/restore whole AMX state.
> but we should be
> careful to mentally adjust for the __u32 when mentioning the sizes.
>
>>>> };
>>>> +/* for KVM_CAP_XSAVE_EXTENSION */
>>>> +struct kvm_xsave_extension {
>>>> + __u32 region[3072];
>>> Fool me once, shame on you (Intel). Fool me twice, shame on me (KVM).
>>>
>>> As amusing as kvm_xsave_really_extended would be, the required size should be
>>> discoverable, not hardcoded.
>> Thanks for reviewing the patch. When looking at current kvm_xsave structure,
>> I felt confusing about the static hardcoding of 1024 bytes, but failed to
>> find clue for its final decision in 2010[1].
> Simplicitly and lack of foresight :-)
>
>> So we'd prefer to changing the way right? Please correct me if I misunderstood.
> Sadly, we can't fix the existing ioctl() without breaking userspace. But for
> the new ioctl(), yes, its size should not be hardcoded.
>
>>> Nothing prevents a hardware vendor from inventing a newfangled feature that
>>> requires yet more space. As an alternative to adding a dedicated
>>> capability, can we leverage GET_SUPPORTED_CPUID, leaf CPUID.0xD,
>> Yes, this is a good way to avoid a dedicated capability. Thanks for the
>> suggestion. Use 0xD.1.EBX for size of enabled xcr0|xss if supposing
>> kvm_xsave cares both.
>>> to enumerate the minimum required size and state
>> For the state, an extreme case is using an old qemu as follows, but a
>> new kvm with more future_featureZ supported. If hardware vendor arranges
>> one by one, it's OK to use static state like X86XSaveArea(2) and
>> get/set between userspace and kvm because it's non-compacted. If not,
>> the state will not correct.
>> So far it is OK, so I'm wondering if this would be an issue for now?
> Oh, you're saying that, because kvm_xsave is non-compacted, future features may
> overflow kvm_xsave simply because the architectural offset overflows 4096 bytes.
>
> That should be a non-issue for old KVM/kernels, since the new features shouldn't
> be enabled. For new KVM, I think the right approach is to reject KVM_GET_XSAVE
> and KVM_SET_XSAVE if the required size is greater than sizeof(struct kvm_xsave).
> I.e. force userspace to either hide the features from the guest, or use
> KVM_{G,S}ET_XSAVE2.
I was considering if the order/offset of future features will impact the
compatibility
if it is not designed one by one. But I realized it's not an issue
because there uses
non-compacted format so each offset strictly refers to spec.
>> X86XSaveArea2 {
>> ...
>> XSaveAVX
>> ...
>> AMX_XTILE;
>> future_featureX;
>> future_featureY;
>> }
>>
>>> that the new ioctl() is available if the min size is greater than 1024?
>>> Or is that unnecessarily convoluted...
>> To enable a dynamic size kvm_xsave2(Thanks Jim's name suggestion), if things
>> as follows are what we might want.
>> /* for xstate large than 1024 */
>> struct kvm_xsave2 {
>> int size; // size of the whole xstate
>> void *ptr; // xstate pointer
>> }
>> #define KVM_GET_XSAVE2 _IOW(KVMIO, 0xa4, struct kvm_xsave2)
>>
>> Take @size together, so KVM need not fetch 0xd.1.ebx each time or a dedicated
>> variable.
> Yes, userspace needs to provide the size so that KVM doesn't unintentionally
> overflow the buffer provided by userspace. We might also want to hedge by adding
> a flags? Can't think of a use for it at the moment, though.
>
> struct kvm_xsave2 {
> __u32 flags;
> __u32 size;
> __u8 state[0];
> };
u8 makes things simple that kvm doesn't need compute size to u32.
Thanks,
Jing
On 5/25/2021 5:43 AM, Sean Christopherson wrote:
> On Sun, Feb 07, 2021, Jing Liu wrote:
>> Passthrough both MSRs to let guest access and write without vmexit.
> Why? Except for read-only MSRs, e.g. MSR_CORE_C1_RES, passthrough MSRs are
> costly to support because KVM must context switch the MSR (which, by the by, is
> completely missing from the patch).
>
> In other words, if these MSRs are full RW passthrough, guests with XFD enabled
> will need to load the guest value on entry, save the guest value on exit, and
> load the host value on exit. That's in the neighborhood of a 40% increase in
> latency for a single VM-Enter/VM-Exit roundtrip (~1500 cycles => >2000 cycles).
>
> I'm not saying these can't be passhthrough, but there needs to be strong
> justification for letting the guest read/write them directly.
For IA32_XFD, it's per task and switched during task switch(if
different). Meanwhile,
hardware uses IA32_XFD to prevent any instruction per task touching AMX
state at the
first time and generate #NM. This means if vcpu running with AMX
enabled, hardware
IA32_XFD should keep a guest value, and once guest #NM handler finished,
the bit
should not be set anymore for this task.
No matter passthrough or not, IA32_XFD need be restored to guest value
when vm-enter,
and load host value on exit (or load zero to prevent losing guest AMX in
use state).
And passthrough makes guest IA32_XFD switch during task switch without
trapped.
For IA32_XFD_ERR, hardware automatically set the bit once #NM fault. #NM
handler
detect and clear the bit.
No matter passthrough or not, KVM doesn't know if guest has #NM, it need
read and save
IA32_XFD_ERR when vmexit to prevent preemption task from clearing bit.
Emulating need not restore guest non-zero IA32_XFD_ERR when vmenter and
effort is guest
#NM handler takes extra trap(s). But restoring a non-zero IA32_XFD_ERR
only occurs for
the vmexit case: between task's first AMX instruction(causing #NM) and
end of clearing
in guest #NM handler.
Thanks,
Jing
On 5/25/2021 5:28 AM, Sean Christopherson wrote:
> On Sun, Feb 07, 2021, Jing Liu wrote:
>> CPUID.0xD.1.EBX[1] is set if, when the compacted format of an XSAVE
>> area is used, this extended state component located on the next
>> 64-byte boundary following the preceding state component (otherwise,
>> it is located immediately following the preceding state component).
>>
>> AMX tileconfig and tiledata are the first to use 64B alignment.
>> Revise the runtime cpuid modification for this rule.
>>
>> Signed-off-by: Jing Liu<[email protected]>
>> ---
>> arch/x86/kvm/cpuid.c | 5 +++++
>> 1 file changed, 5 insertions(+)
>>
>> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
>> index 04a73c395c71..ee1fac0a865e 100644
>> --- a/arch/x86/kvm/cpuid.c
>> +++ b/arch/x86/kvm/cpuid.c
>> @@ -35,12 +35,17 @@ static u32 xstate_required_size(u64 xstate_bv, bool compacted)
>> {
>> int feature_bit = 0;
>> u32 ret = XSAVE_HDR_SIZE + XSAVE_HDR_OFFSET;
>> + bool is_aligned = false;
>>
>> xstate_bv &= XFEATURE_MASK_EXTEND;
>> while (xstate_bv) {
>> if (xstate_bv & 0x1) {
>> u32 eax, ebx, ecx, edx, offset;
>> cpuid_count(0xD, feature_bit, &eax, &ebx, &ecx, &edx);
>> + /* ECX[2]: 64B alignment in compacted form */
>> + is_aligned = !!(ecx & 2);
>> + if (is_aligned && compacted)
> I'd forego the local is_aligned, and also check "compacted" first so that the
> uncompacted variant isn't required to evaluated ecx.
Non-compacted only works for XCR0 (user states), do we need add a check
or simply do
'state_bv &= XFEATURE_MASK_USER_ENABLED'?
xstate_bv &= XFEATURE_MASK_EXTEND;
+ /* Only user states use non-compacted format. */
+ if (!compacted)
+ xstate_bv &= XFEATURE_MASK_USER_SUPPORTED;
+
while (xstate_bv) {
if (xstate_bv & 0x1) {
u32 eax, ebx, ecx, edx, offset;
cpuid_count(0xD, feature_bit, &eax, &ebx, &ecx,
&edx);
- offset = compacted ? ret : ebx;
+ offset = compacted ? ((ecx & 0x2) ? ALIGN(ret,
64) : ret) : ebx;
ret = max(ret, offset + eax);
}
>
> And the real reason I am responding... can you post this as a standalone patch?
Sure. Let me separate it.
> I stumbled across the "aligned" flag when reading through the SDM for a completely
> unrelated reason, and also discovered that the flag has been documented since
> 2016. While AMX may be the first to "officially" utilize the alignment flag,
> the flag itself is architectural and not strictly limited to AMX.
Yes, this is not a new feature, but seems no one use it before.
Thanks,
Jing
>> + ret = ALIGN(ret, 64);
>> offset = compacted ? ret : ebx;
>> ret = max(ret, offset + eax);
>> }
>> --
>> 2.18.4
>>
On 5/25/2021 5:34 AM, Sean Christopherson wrote:
> I need a formletter for these...
>
> GET_SUPPORTED_CPUID advertises support to userspace, it does not expose anything
> to the guest.
Oh, yes. This is only part of cpuid exposing process. Let me change the
commit log.
>
> On Sun, Feb 07, 2021, Jing Liu wrote:
>> Intel's Extended Feature Disable (XFD) feature is an extension
>> to the XSAVE feature that allows an operating system to enable
>> a feature while preventing specific user threads from using
>> the feature. A processor that supports XFD enumerates
>> CPUID.(EAX=0DH,ECX=1):EAX[4] as 1.
>>
>> Signed-off-by: Jing Liu <[email protected]>
>> ---
>> arch/x86/kvm/cpuid.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
>> index 83637a2ff605..04a73c395c71 100644
>> --- a/arch/x86/kvm/cpuid.c
>> +++ b/arch/x86/kvm/cpuid.c
>> @@ -437,7 +437,7 @@ void kvm_set_cpu_caps(void)
>> );
>>
>> kvm_cpu_cap_mask(CPUID_D_1_EAX,
>> - F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | F(XSAVES)
>> + F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | F(XSAVES) | F(XFD)
> KVM must not advertise support until it actually has said support, i.e. this
> patch needs to go at the end of the series.
>
> Also, adding the kvm_cpu_cap flag in a separate patch isn't strictly required.
> In most cases, I would go so far as to say that if there is additional enabling
> to be done, advertising the feature should be done in the same patch that adds
> the last bits of enabling. Putting the CPUID stuff in its own patch doesn't
> usually add values, e.g. if there's a bug in the actual support code bisecting
> will point at the wrong patch if userspace conditions its vCPU model on
> GET_SUPPORTED_CPUID.
Got it. Since XFD are separate feature from AMX, when trying to think about
putting CPUID stuff to MSR stuff, current MSR stuff are mainly two
parts, one is
two MSRs support, another is MSR switches. So it seems not suitable to
put CPUID
into MSR switch patch?
Thanks,
Jing
>> );
>>
>> kvm_cpu_cap_mask(CPUID_8000_0001_ECX,
>> --
>> 2.18.4
>>
On 6/1/2021 6:24 PM, Liu, Jing2 wrote:
>
>
> On 5/26/2021 10:43 PM, Sean Christopherson wrote:
>> On Wed, May 26, 2021, Liu, Jing2 wrote:
>>> On 5/25/2021 5:50 AM, Sean Christopherson wrote:
>>>> On Sun, Feb 07, 2021, Jing Liu wrote:
>>>>> The static xstate buffer kvm_xsave contains the extended register
>>>>> states, but it is not enough for dynamic features with large state.
>>>>>
>>>>> Introduce a new capability called KVM_CAP_X86_XSAVE_EXTENSION to
>>>>> detect if hardware has XSAVE extension (XFD). Meanwhile, add two
>>>>> new ioctl interfaces to get/set the whole xstate using struct
>>>>> kvm_xsave_extension buffer containing both static and dynamic
>>>>> xfeatures. Reuse fill_xsave and load_xsave for both cases.
>>>>>
>>>>> Signed-off-by: Jing Liu <[email protected]>
>>>>> ---
>>>>> arch/x86/include/uapi/asm/kvm.h | 5 +++
>>>>> arch/x86/kvm/x86.c | 70
>>>>> +++++++++++++++++++++++++--------
>>>>> include/uapi/linux/kvm.h | 8 ++++
>>>>> 3 files changed, 66 insertions(+), 17 deletions(-)
>>>>>
>>>>> diff --git a/arch/x86/include/uapi/asm/kvm.h
>>>>> b/arch/x86/include/uapi/asm/kvm.h
>>>>> index 89e5f3d1bba8..bf785e89a728 100644
>>>>> --- a/arch/x86/include/uapi/asm/kvm.h
>>>>> +++ b/arch/x86/include/uapi/asm/kvm.h
>>>>> @@ -362,6 +362,11 @@ struct kvm_xsave {
>>>>> __u32 region[1024];
>> Hold up a sec. How big is the AMX data?
> AMX tileconfig size is 64B, but tiledata size is 8K.
>> The existing size is 4096 bytes, not
>> 1024 bytes. IIRC, AMX is >4k, so we still need a new ioctl(),
> Yep, kvm_xsave can hold 4KB state. We need a new ioctl, holding all
> the states,
> not only AMX. And once KVM supports AMX, the size will >4096 so qemu need
> use kvm_xsave2 instead, otherwise, cannot save/restore whole AMX state.
>> but we should be
>> careful to mentally adjust for the __u32 when mentioning the sizes.
>>
>>>>> };
>>>>> +/* for KVM_CAP_XSAVE_EXTENSION */
>>>>> +struct kvm_xsave_extension {
>>>>> + __u32 region[3072];
>>>> Fool me once, shame on you (Intel). Fool me twice, shame on me (KVM).
>>>>
>>>> As amusing as kvm_xsave_really_extended would be, the required size
>>>> should be
>>>> discoverable, not hardcoded.
>>> Thanks for reviewing the patch. When looking at current kvm_xsave
>>> structure,
>>> I felt confusing about the static hardcoding of 1024 bytes, but
>>> failed to
>>> find clue for its final decision in 2010[1].
>> Simplicitly and lack of foresight :-)
>>
>>> So we'd prefer to changing the way right? Please correct me if I
>>> misunderstood.
>> Sadly, we can't fix the existing ioctl() without breaking userspace.
>> But for
>> the new ioctl(), yes, its size should not be hardcoded.
>>
>>>> Nothing prevents a hardware vendor from inventing a newfangled
>>>> feature that
>>>> requires yet more space. As an alternative to adding a dedicated
>>>> capability, can we leverage GET_SUPPORTED_CPUID, leaf CPUID.0xD,
>>> Yes, this is a good way to avoid a dedicated capability. Thanks for the
>>> suggestion. Use 0xD.1.EBX for size of enabled xcr0|xss if supposing
>>> kvm_xsave cares both.
I think kvm_xsave ioctl only cares user states because supervisor states
should
always use compacted format. When trying to think about how to get/set
supervisor
states, I think it can not reuse current design (qemu talks with kvm via
a buffer
and set/get to/from qemu's env->some_feature one by one according to the
offset).
So do we need handle supervisor states and change the way?
For kvm_xsave2 which expands to >4096B, if reuse and expand current way,
it only
detects xcr0 from 0xD.0.EBX.
[...]
Thanks,
Jing
On 5/24/21 2:43 PM, Sean Christopherson wrote:
> On Sun, Feb 07, 2021, Jing Liu wrote:
>> Passthrough both MSRs to let guest access and write without vmexit.
> Why? Except for read-only MSRs, e.g. MSR_CORE_C1_RES, passthrough MSRs are
> costly to support because KVM must context switch the MSR (which, by the by, is
> completely missing from the patch).
>
> In other words, if these MSRs are full RW passthrough, guests with XFD enabled
> will need to load the guest value on entry, save the guest value on exit, and
> load the host value on exit. That's in the neighborhood of a 40% increase in
> latency for a single VM-Enter/VM-Exit roundtrip (~1500 cycles => >2000 cycles).
I'm not taking a position as to whether these _should_ be passthrough or
not. But, if they are, I don't think you strictly need to do the
RDMSR/WRMSR at VM-Exit time.
Just like the "FPU", XFD isn't be used in normal kernel code. This is
why we can be lazy about FPU state with TIF_NEED_FPU_LOAD. I _suspect_
that some XFD manipulation can be at least deferred to the same place
where the FPU state is manipulated: places like switch_fpu_return() or
kernel_fpu_begin().
Doing that would at least help the fast VM-Exit/VM-Enter paths that
really like TIF_NEED_FPU_LOAD today.
I guess the nasty part is that you actually need to stash the old XFD
MSR value in the vcpu structure and that's not available at
context-switch time. So, maybe this would only allow deferring the
WRMSR. That's better than nothing I guess.
On 6/24/2021 1:50 AM, Dave Hansen wrote:
> On 5/24/21 2:43 PM, Sean Christopherson wrote:
>> On Sun, Feb 07, 2021, Jing Liu wrote:
>>> Passthrough both MSRs to let guest access and write without vmexit.
>> Why? Except for read-only MSRs, e.g. MSR_CORE_C1_RES, passthrough MSRs are
>> costly to support because KVM must context switch the MSR (which, by the by, is
>> completely missing from the patch).
>>
>> In other words, if these MSRs are full RW passthrough, guests with XFD enabled
>> will need to load the guest value on entry, save the guest value on exit, and
>> load the host value on exit. That's in the neighborhood of a 40% increase in
>> latency for a single VM-Enter/VM-Exit roundtrip (~1500 cycles => >2000 cycles).
> I'm not taking a position as to whether these _should_ be passthrough or
> not. But, if they are, I don't think you strictly need to do the
> RDMSR/WRMSR at VM-Exit time.
Hi Dave,
Thanks for reviewing the patches.
When vmexit, clearing XFD (because KVM thinks guest has requested AMX) can
be deferred to the time when host does XSAVES, but this means need a new
flag in common "fpu" structure or a common macro per thread which works
only dedicated for KVM case, and check the flag in 1) switch_fpu_prepare()
2) kernel_fpu_begin() . This is the concern to me.
Thanks,
Jing
> Just like the "FPU", XFD isn't be used in normal kernel code. This is
> why we can be lazy about FPU state with TIF_NEED_FPU_LOAD. I _suspect_
> that some XFD manipulation can be at least deferred to the same place
> where the FPU state is manipulated: places like switch_fpu_return() or
> kernel_fpu_begin().
>
> Doing that would at least help the fast VM-Exit/VM-Enter paths that
> really like TIF_NEED_FPU_LOAD today.
>
> I guess the nasty part is that you actually need to stash the old XFD
> MSR value in the vcpu structure and that's not available at
> context-switch time. So, maybe this would only allow deferring the
> WRMSR. That's better than nothing I guess.
On 6/27/21 7:00 PM, Liu, Jing2 wrote:
> On 6/24/2021 1:50 AM, Dave Hansen wrote:
>> On 5/24/21 2:43 PM, Sean Christopherson wrote:
>>> On Sun, Feb 07, 2021, Jing Liu wrote:
>>>> Passthrough both MSRs to let guest access and write without vmexit.
>>> Why? Except for read-only MSRs, e.g. MSR_CORE_C1_RES,
>>> passthrough MSRs are costly to support because KVM must context
>>> switch the MSR (which, by the by, is completely missing from the
>>> patch).
>>>
>>> In other words, if these MSRs are full RW passthrough, guests
>>> with XFD enabled will need to load the guest value on entry, save
>>> the guest value on exit, and load the host value on exit. That's
>>> in the neighborhood of a 40% increase in latency for a single
>>> VM-Enter/VM-Exit roundtrip (~1500 cycles =>
>>>> 2000 cycles).
>> I'm not taking a position as to whether these _should_ be passthrough or
>> not. But, if they are, I don't think you strictly need to do the
>> RDMSR/WRMSR at VM-Exit time.
> Hi Dave,
>
> Thanks for reviewing the patches.
>
> When vmexit, clearing XFD (because KVM thinks guest has requested AMX) can
> be deferred to the time when host does XSAVES, but this means need a new
> flag in common "fpu" structure or a common macro per thread which works
> only dedicated for KVM case, and check the flag in 1) switch_fpu_prepare()
> 2) kernel_fpu_begin() . This is the concern to me.
Why is this a concern? You're worried about finding a single bit worth
of space somewhere?
On 6/30/2021 1:58 AM, Dave Hansen wrote:
> On 6/27/21 7:00 PM, Liu, Jing2 wrote:
>> On 6/24/2021 1:50 AM, Dave Hansen wrote:
>>> On 5/24/21 2:43 PM, Sean Christopherson wrote:
>>>> On Sun, Feb 07, 2021, Jing Liu wrote:
>>>>> Passthrough both MSRs to let guest access and write without vmexit.
>>>> Why? Except for read-only MSRs, e.g. MSR_CORE_C1_RES,
>>>> passthrough MSRs are costly to support because KVM must context
>>>> switch the MSR (which, by the by, is completely missing from the
>>>> patch).
>>>>
>>>> In other words, if these MSRs are full RW passthrough, guests
>>>> with XFD enabled will need to load the guest value on entry, save
>>>> the guest value on exit, and load the host value on exit. That's
>>>> in the neighborhood of a 40% increase in latency for a single
>>>> VM-Enter/VM-Exit roundtrip (~1500 cycles =>
>>>>> 2000 cycles).
>>> I'm not taking a position as to whether these _should_ be passthrough or
>>> not. But, if they are, I don't think you strictly need to do the
>>> RDMSR/WRMSR at VM-Exit time.
>> Hi Dave,
>>
>> Thanks for reviewing the patches.
>>
>> When vmexit, clearing XFD (because KVM thinks guest has requested AMX) can
>> be deferred to the time when host does XSAVES, but this means need a new
>> flag in common "fpu" structure or a common macro per thread which works
>> only dedicated for KVM case, and check the flag in 1) switch_fpu_prepare()
>> 2) kernel_fpu_begin() . This is the concern to me.
> Why is this a concern? You're worried about finding a single bit worth
> of space somewhere?
A bit of flag can be found so far though the space is somehow nervous.
What I
am worrying about is, we introduce a flag per thread and add the check
in core
place like softirq path and context switch path, to handle a case only
for KVM
thread + XFD=1 + AMX usage in guest. This is not a quite frequent case
but we
need check every time for every thread.
I am considering using XGETBV(1) (~24 cycles) to detect if KVM really need
wrmsr(0) to clear XFD for guest AMX state when vmexit. And this is not a
quite
frequent case I think. Only one concern is, does/will kernel check
somewhere that
thread's memory fpu buffer is already large but thread's XFD=1? (I
believe not)
Thanks,
Jing
>