From: Guang Zeng <[email protected]>
When AMX is enabled it requires a larger xstate buffer than the
legacy hardcoded 4KB one. Exising kvm ioctls (KVM_[G|S]ET_XSAVE
under KVM_CAP_XSAVE) are not suitable for this purpose.
A new capability (KVM_CAP_XSAVE2) is introduced to mark an extended
kvm_xsave format for supporting >4KB fpstate. The expanded fpstate
size is returned to userspace when KVM_CAP_XSAVE2 is queried.
Introduce KVM_GET_XSAVE2 under this capability with the new format
for copying kernel fpstate to userspace.
Reuse KVM_SET_XSAVE for both old/new formats by reimplementing it to
do properly-sized memdup_user() based on guest fpstate.
Signed-off-by: Guang Zeng <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
---
arch/x86/include/uapi/asm/kvm.h | 12 ++++++++++-
arch/x86/kvm/x86.c | 37 ++++++++++++++++++++++++++++++++-
include/uapi/linux/kvm.h | 4 ++++
3 files changed, 51 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 5a776a08f78c..240e17829e89 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -373,9 +373,19 @@ struct kvm_debugregs {
__u64 reserved[9];
};
-/* for KVM_CAP_XSAVE */
+/* for KVM_CAP_XSAVE and KVM_CAP_XSAVE2 */
struct kvm_xsave {
+ /*
+ * KVM_GET_XSAVE only uses the first 4096 bytes.
+ *
+ * KVM_GET_XSAVE2 must have the size match what is returned by
+ * KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2).
+ *
+ * KVM_SET_XSAVE uses the extra field if guest_fpu::fpstate::size
+ * exceeds 4096 bytes.
+ */
__u32 region[1024];
+ __u32 extra[0];
};
#define KVM_MAX_XCRS 16
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f8bacf18e6ed..796a9f2d1f23 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4296,6 +4296,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
else
r = 0;
break;
+ case KVM_CAP_XSAVE2:
+ r = kvm->vcpus[0]->arch.guest_fpu.uabi_size;
+ break;
default:
break;
}
@@ -4899,6 +4902,16 @@ static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
vcpu->arch.pkru);
}
+static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu,
+ u8 *state, unsigned int size)
+{
+ if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
+ return;
+
+ fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu,
+ state, size, vcpu->arch.pkru);
+}
+
static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
struct kvm_xsave *guest_xsave)
{
@@ -5366,7 +5379,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
break;
}
case KVM_SET_XSAVE: {
- u.xsave = memdup_user(argp, sizeof(*u.xsave));
+ int size = vcpu->arch.guest_fpu.uabi_size;
+
+ u.xsave = memdup_user(argp, size);
if (IS_ERR(u.xsave)) {
r = PTR_ERR(u.xsave);
goto out_nofree;
@@ -5375,6 +5390,26 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
r = kvm_vcpu_ioctl_x86_set_xsave(vcpu, u.xsave);
break;
}
+
+ case KVM_GET_XSAVE2: {
+ int size = vcpu->arch.guest_fpu.uabi_size;
+
+ u.xsave = kzalloc(size, GFP_KERNEL_ACCOUNT);
+ if (!u.xsave) {
+ r = -ENOMEM;
+ break;
+ }
+
+ kvm_vcpu_ioctl_x86_get_xsave2(vcpu, u.buffer, size);
+
+ if (copy_to_user(argp, u.xsave, size)) {
+ r = -EFAULT;
+ break;
+ }
+ r = 0;
+ break;
+ }
+
case KVM_GET_XCRS: {
u.xcrs = kzalloc(sizeof(struct kvm_xcrs), GFP_KERNEL_ACCOUNT);
r = -ENOMEM;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 1daa45268de2..9d1c01669560 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1131,6 +1131,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_EXIT_ON_EMULATION_FAILURE 204
#define KVM_CAP_ARM_MTE 205
#define KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM 206
+#define KVM_CAP_XSAVE2 207
#ifdef KVM_CAP_IRQ_ROUTING
@@ -1610,6 +1611,9 @@ struct kvm_enc_region {
#define KVM_S390_NORMAL_RESET _IO(KVMIO, 0xc3)
#define KVM_S390_CLEAR_RESET _IO(KVMIO, 0xc4)
+/* Available with KVM_CAP_XSAVE2 */
+#define KVM_GET_XSAVE2 _IOR(KVMIO, 0xcf, struct kvm_xsave)
+
struct kvm_s390_pv_sec_parm {
__u64 origin;
__u64 length;
--
2.27.0
On 12/17/21 16:29, Jing Liu wrote:
> +/* for KVM_CAP_XSAVE and KVM_CAP_XSAVE2 */
> struct kvm_xsave {
> + /*
> + * KVM_GET_XSAVE only uses the first 4096 bytes.
> + *
> + * KVM_GET_XSAVE2 must have the size match what is returned by
> + * KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2).
> + *
> + * KVM_SET_XSAVE uses the extra field if guest_fpu::fpstate::size
> + * exceeds 4096 bytes.
KVM_GET_XSAVE2 and KVM_SET_XSAVE respectively write and read as many
bytes as are returned by KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2), when
invoked on the vm file descriptor. Currently,
KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) will only return a value that is
greater than 4096 bytes if any dynamic features have been enabled with
``arch_prctl()``; this however may change in the future.
The offsets of the state save areas in struct kvm_xsave follow the
contents of CPUID leaf 0xD on the host.
Paolo
On Monday, December 20, 2021 5:04 PM, Paolo Bonzini wrote:
> On 12/17/21 16:29, Jing Liu wrote:
> > +/* for KVM_CAP_XSAVE and KVM_CAP_XSAVE2 */
> > struct kvm_xsave {
> > + /*
> > + * KVM_GET_XSAVE only uses the first 4096 bytes.
> > + *
> > + * KVM_GET_XSAVE2 must have the size match what is returned by
> > + * KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2).
> > + *
> > + * KVM_SET_XSAVE uses the extra field if guest_fpu::fpstate::size
> > + * exceeds 4096 bytes.
>
> KVM_GET_XSAVE2 and KVM_SET_XSAVE respectively write and read as many
> bytes as are returned by KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2), when
> invoked on the vm file descriptor. Currently,
> KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) will only return a value that is
> greater than 4096 bytes if any dynamic features have been enabled with
> ``arch_prctl()``; this however may change in the future.
Would this make people think that KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) doesn’t
return the value (i.e. return 0) if it is smaller than 4096?
(i.e. KVM_GET_XSAVE2 doesn't work with size < 4096, which isn’t true)
I plan to just reword a bit:
Currently, KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) will only return a size value,
and the value is greater than 4096 bytes if any dynamic features have been enabled with
``arch_prctl()``. More types of values could be returned in the future.
Thanks,
Wei
On 12/21/21 03:45, Wang, Wei W wrote:
>> KVM_GET_XSAVE2 and KVM_SET_XSAVE respectively write and read as many
>> bytes as are returned by KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2), when
>> invoked on the vm file descriptor. Currently,
>> KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) will only return a value that is
>> greater than 4096 bytes if any dynamic features have been enabled with
>> ``arch_prctl()``; this however may change in the future.
> Would this make people think that KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) doesn’t
> return the value (i.e. return 0) if it is smaller than 4096?
> (i.e. KVM_GET_XSAVE2 doesn't work with size < 4096, which isn’t true)
>
> I plan to just reword a bit:
> Currently, KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) will only return a size value,
> and the value is greater than 4096 bytes if any dynamic features have been enabled with
> ``arch_prctl()``. More types of values could be returned in the future.
Next refinement:
The size value returned by KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) will
always be at least 4096. Currently, it is only greater than 4096 if a
dynamic feature has been enabled with ``arch_prctl()``, but this may
change in the future.
(I'm not sure if the first sentence is true in the code, but if not it
is a bug that has to be fixed :)).
Paolo
On Tuesday, December 21, 2021 4:45 PM, Paolo Bonzini wrote:
> On 12/21/21 03:45, Wang, Wei W wrote:
> >> KVM_GET_XSAVE2 and KVM_SET_XSAVE respectively write and read as
> many
> >> bytes as are returned by KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2),
> when
> >> invoked on the vm file descriptor. Currently,
> >> KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) will only return a value that is
> >> greater than 4096 bytes if any dynamic features have been enabled
> >> with ``arch_prctl()``; this however may change in the future.
> > Would this make people think that
> KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2)
> > doesn’t return the value (i.e. return 0) if it is smaller than 4096?
> > (i.e. KVM_GET_XSAVE2 doesn't work with size < 4096, which isn’t true)
> >
> > I plan to just reword a bit:
> > Currently, KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) will only return a
> size
> > value, and the value is greater than 4096 bytes if any dynamic
> > features have been enabled with ``arch_prctl()``. More types of values could
> be returned in the future.
>
> Next refinement:
>
> The size value returned by KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) will
> always be at least 4096. Currently, it is only greater than 4096 if a dynamic
> feature has been enabled with ``arch_prctl()``, but this may change in the
> future.
> (I'm not sure if the first sentence is true in the code, but if not it is a bug that
> has to be fixed :)).
For the implementation, KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) always return kvm->vcpus[0]->arch.guest_fpu.uabi_size.
Do you want to change it to below?
If (kvm->vcpus[0]->arch.guest_fpu.uabi_size < 4096)
return 0;
else
return kvm->vcpus[0]->arch.guest_fpu.uabi_size;
If the size is less than 4096 (e.g. no dynamic xfeatures enabled),
userspace should use the old KVM_GET_XSAVE (instead of KVM_GET_XSAVE2)?
(KVM_GET_XSAVE2 supports to work with size less than 4096, so I think this isn't necessary)
Thanks,
Wei
On 12/21/21 10:06, Wang, Wei W wrote:
>> (I'm not sure if the first sentence is true in the code, but if not it is a bug that
>> has to be fixed :)).
> For the implementation, KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) always return kvm->vcpus[0]->arch.guest_fpu.uabi_size.
> Do you want to change it to below?
>
> If (kvm->vcpus[0]->arch.guest_fpu.uabi_size < 4096)
> return 0;
return 4096;
since the minimum size of struct kvm_xsave2 (with no extra) is 4096.
Paolo
> else
> return kvm->vcpus[0]->arch.guest_fpu.uabi_size;
>
> If the size is less than 4096 (e.g. no dynamic xfeatures enabled),
> userspace should use the old KVM_GET_XSAVE (instead of KVM_GET_XSAVE2)?
> (KVM_GET_XSAVE2 supports to work with size less than 4096, so I think this isn't necessary)