LinuxLists.cc - [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size

2023-09-14 17:19:44

Subject: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size

When user space requests guest xstate permits, the sufficient xstate size
is calculated from permitted mask. Currently the max guest permits are set
to fpu_kernel_cfg.default_features, and the latter doesn't include kernel
dynamic xfeatures, so add them back for correct guest fpstate size.

If guest dynamic xfeatures are enabled, KVM re-allocates guest fpstate area
with above resulting size before launches VM.

Signed-off-by: Yang Weijiang <[email protected]>
---
arch/x86/kernel/fpu/xstate.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 4753c677e2e1..c5d903b4df4d 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1636,9 +1636,17 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)

/* Calculate the resulting kernel state size */
mask = permitted | requested;
- /* Take supervisor states into account on the host */
+ /*
+ * Take supervisor states into account on the host. And add
+ * kernel dynamic xfeatures to guest since guest kernel may
+ * enable corresponding CPU feaures and the xstate registers
+ * need to be saved/restored properly.
+ */
if (!guest)
mask |= xfeatures_mask_supervisor();
+ else
+ mask |= fpu_kernel_dynamic_xfeatures;
+
ksize = xstate_calculate_size(mask, compacted);

/* Calculate the resulting user state size */
--
2.27.0

2023-09-14 20:45:43

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size

On 9/13/23 23:33, Yang Weijiang wrote:
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -1636,9 +1636,17 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
>
> /* Calculate the resulting kernel state size */
> mask = permitted | requested;
> - /* Take supervisor states into account on the host */
> + /*
> + * Take supervisor states into account on the host. And add
> + * kernel dynamic xfeatures to guest since guest kernel may
> + * enable corresponding CPU feaures and the xstate registers
> + * need to be saved/restored properly.
> + */
> if (!guest)
> mask |= xfeatures_mask_supervisor();
> + else
> + mask |= fpu_kernel_dynamic_xfeatures;
> +
> ksize = xstate_calculate_size(mask, compacted);

Heh, you changed the "guest" naming in "fpu_kernel_dynamic_xfeatures"
but didn't change the logic.

As it's coded at the moment *ALL* "fpu_kernel_dynamic_xfeatures" are
guest xfeatures. So, they're different in name only.

If you want to change the rules for guests, we have *ONE* place that's
done: fpstate_reset(). It establishes the permissions and the sizes for
the default guest FPU. Start there. If you want to make the guest
defaults include XFEATURE_CET_USER, then you need to put the bit in *there*.

The other option is to have the KVM code actually go and "request" that
the dynamic states get added to 'fpu->guest_perm'. Would there ever be
any reason for KVM to be on a system which supports a dynamic kernel
feature but where it doesn't get enabled for guest use, or at least
shouldn't have the FPU space allocated?

2023-09-15 08:20:13

by Yang Weijiang

[permalink] [raw]

Subject: Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size

On 9/15/2023 1:40 AM, Dave Hansen wrote:
> On 9/13/23 23:33, Yang Weijiang wrote:
>> --- a/arch/x86/kernel/fpu/xstate.c
>> +++ b/arch/x86/kernel/fpu/xstate.c
>> @@ -1636,9 +1636,17 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
>>
>> /* Calculate the resulting kernel state size */
>> mask = permitted | requested;
>> - /* Take supervisor states into account on the host */
>> + /*
>> + * Take supervisor states into account on the host. And add
>> + * kernel dynamic xfeatures to guest since guest kernel may
>> + * enable corresponding CPU feaures and the xstate registers
>> + * need to be saved/restored properly.
>> + */
>> if (!guest)
>> mask |= xfeatures_mask_supervisor();
>> + else
>> + mask |= fpu_kernel_dynamic_xfeatures;
>> +
>> ksize = xstate_calculate_size(mask, compacted);
> Heh, you changed the "guest" naming in "fpu_kernel_dynamic_xfeatures"
> but didn't change the logic.
>
> As it's coded at the moment *ALL* "fpu_kernel_dynamic_xfeatures" are
> guest xfeatures. So, they're different in name only.
>
> If you want to change the rules for guests, we have *ONE* place that's
> done: fpstate_reset(). It establishes the permissions and the sizes for
> the default guest FPU. Start there. If you want to make the guest
> defaults include XFEATURE_CET_USER, then you need to put the bit in *there*.

Yeah, fpstate_reset() is the right place to hold the guest init permits and propagate
them here, thanks for the suggestion!

Nit, did you actually mean XFEATURE_CET_KERNEL instead of XFEATURE_CET_USER above?
because the latter is already supported by upstream kernel.

> The other option is to have the KVM code actually go and "request" that
> the dynamic states get added to 'fpu->guest_perm'.

Yes, compared with above option, it will change current userspace handling logic, i.e.,
only user xstates are dynamically requested. So I'd try above option first.

> Would there ever be
> any reason for KVM to be on a system which supports a dynamic kernel
> feature but where it doesn't get enabled for guest use, or at least
> shouldn't have the FPU space allocated?

I haven't heard of that kind of usage for other features so far, CET supervisor xstate is the
only dynamic kernel feature now, not sure whether other CPU features having supervisor
xstate would share the handling logic like CET does one day.

2023-10-24 17:07:27

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size

On Fri, Sep 15, 2023, Weijiang Yang wrote:
> On 9/15/2023 1:40 AM, Dave Hansen wrote:
> > On 9/13/23 23:33, Yang Weijiang wrote:
> > > --- a/arch/x86/kernel/fpu/xstate.c
> > > +++ b/arch/x86/kernel/fpu/xstate.c
> > > @@ -1636,9 +1636,17 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
> > > /* Calculate the resulting kernel state size */
> > > mask = permitted | requested;
> > > - /* Take supervisor states into account on the host */
> > > + /*
> > > + * Take supervisor states into account on the host. And add
> > > + * kernel dynamic xfeatures to guest since guest kernel may
> > > + * enable corresponding CPU feaures and the xstate registers
> > > + * need to be saved/restored properly.
> > > + */
> > > if (!guest)
> > > mask |= xfeatures_mask_supervisor();
> > > + else
> > > + mask |= fpu_kernel_dynamic_xfeatures;

This looks wrong. Per commit 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor
states in XSTATE permissions"), mask at this point only contains user features,
which somewhat unintuitively doesn't include CET_USER (I get that they're MSRs
and thus supervisor state, it's just the name that's odd).

IIUC, the "dynamic" features contains CET_KERNEL, whereas xfeatures_mask_supervisor()
conatins PASID, CET_USER, and CET_KERNEL. PASID isn't virtualized by KVM, but
doesn't that mean CET_USER will get dropped/lost if userspace requests AMX/XTILE
enabling?

The existing code also seems odd, but I might be missing something. Won't the
kernel drop PASID if the guest request AMX/XTILE? I'm not at all familiar with
what PASID state is managed via XSAVE, so I've no idea if that's an actual problem
or just an oddity.

> > > ksize = xstate_calculate_size(mask, compacted);
> > Heh, you changed the "guest" naming in "fpu_kernel_dynamic_xfeatures"
> > but didn't change the logic.
> >
> > As it's coded at the moment *ALL* "fpu_kernel_dynamic_xfeatures" are
> > guest xfeatures. So, they're different in name only.

...

> > Would there ever be any reason for KVM to be on a system which supports a
> > dynamic kernel feature but where it doesn't get enabled for guest use, or
> > at least shouldn't have the FPU space allocated?
>
> I haven't heard of that kind of usage for other features so far, CET
> supervisor xstate is the only dynamic kernel feature now, not sure whether
> other CPU features having supervisor xstate would share the handling logic
> like CET does one day.

There are definitely scenarios where CET will not be exposed to KVM guests, but
I don't see any reason to make the guest FPU space dynamically sized for CET.
It's what, 40 bytes?

I would much prefer to avoid the whole "dynamic" thing and instead make CET
explicitly guest-only. E.g. fpu_kernel_guest_only_xfeatures? Or even better
if it doesn't cause weirdness elsewhere, a dedicated fpu_guest_cfg. For me at
least, a fpu_guest_cfg would make it easier to understand what all is going on.

2023-10-25 14:51:04

by Yang Weijiang

[permalink] [raw]

Subject: Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size

On 10/25/2023 1:07 AM, Sean Christopherson wrote:
> On Fri, Sep 15, 2023, Weijiang Yang wrote:
>> On 9/15/2023 1:40 AM, Dave Hansen wrote:
>>> On 9/13/23 23:33, Yang Weijiang wrote:
>>>> --- a/arch/x86/kernel/fpu/xstate.c
>>>> +++ b/arch/x86/kernel/fpu/xstate.c
>>>> @@ -1636,9 +1636,17 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
>>>> /* Calculate the resulting kernel state size */
>>>> mask = permitted | requested;
>>>> - /* Take supervisor states into account on the host */
>>>> + /*
>>>> + * Take supervisor states into account on the host. And add
>>>> + * kernel dynamic xfeatures to guest since guest kernel may
>>>> + * enable corresponding CPU feaures and the xstate registers
>>>> + * need to be saved/restored properly.
>>>> + */
>>>> if (!guest)
>>>> mask |= xfeatures_mask_supervisor();
>>>> + else
>>>> + mask |= fpu_kernel_dynamic_xfeatures;
> This looks wrong. Per commit 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor
> states in XSTATE permissions"), mask at this point only contains user features,
> which somewhat unintuitively doesn't include CET_USER (I get that they're MSRs
> and thus supervisor state, it's just the name that's odd).

I think the user-only boundary becomes unclear when fpstate_reset() introduce below line:
fpu->perm.__state_perm = fpu_kernel_cfg.default_features;

Then in xstate_request_perm(), it re-uses above reset value for __xstate_request_perm(),
so in the latter, the mask is already mixed with supervisor xfeatures.

> IIUC, the "dynamic" features contains CET_KERNEL, whereas xfeatures_mask_supervisor()
> conatins PASID, CET_USER, and CET_KERNEL. PASID isn't virtualized by KVM, but
> doesn't that mean CET_USER will get dropped/lost if userspace requests AMX/XTILE
> enabling?

Yes, __state_size is correct for guest enabled xfeatures, including CET_USER, and it gets
removed from __state_perm.

IIUC, from current qemu/kernel interaction for guest permission settings, __xstate_request_perm()
is called only _ONCE_ to set AMX/XTILE for every vCPU thread, so the removal of guest supervisor
xfeatures won't hurt guest! ;-/

> The existing code also seems odd, but I might be missing something. Won't the
> kernel drop PASID if the guest request AMX/XTILE?

Yeah, dropped after the first invocation.

> I'm not at all familiar with
> what PASID state is managed via XSAVE, so I've no idea if that's an actual problem
> or just an oddity.
>
>>>> ksize = xstate_calculate_size(mask, compacted);
>>> Heh, you changed the "guest" naming in "fpu_kernel_dynamic_xfeatures"
>>> but didn't change the logic.
>>>
>>> As it's coded at the moment *ALL* "fpu_kernel_dynamic_xfeatures" are
>>> guest xfeatures. So, they're different in name only.
> ...
>
>>> Would there ever be any reason for KVM to be on a system which supports a
>>> dynamic kernel feature but where it doesn't get enabled for guest use, or
>>> at least shouldn't have the FPU space allocated?
>> I haven't heard of that kind of usage for other features so far, CET
>> supervisor xstate is the only dynamic kernel feature now, not sure whether
>> other CPU features having supervisor xstate would share the handling logic
>> like CET does one day.
> There are definitely scenarios where CET will not be exposed to KVM guests, but
> I don't see any reason to make the guest FPU space dynamically sized for CET.
> It's what, 40 bytes?

Could it also be xsave/xrstor operation efficiency for non-guest threads?

> I would much prefer to avoid the whole "dynamic" thing and instead make CET
> explicitly guest-only. E.g. fpu_kernel_guest_only_xfeatures? Or even better
> if it doesn't cause weirdness elsewhere, a dedicated fpu_guest_cfg. For me at
> least, a fpu_guest_cfg would make it easier to understand what all is going on.

Agree, guess non-kernel-generic designs are not very much welcome for kernel...

2023-10-26 17:24:41

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size

On Wed, Oct 25, 2023, Weijiang Yang wrote:
> On 10/25/2023 1:07 AM, Sean Christopherson wrote:
> > On Fri, Sep 15, 2023, Weijiang Yang wrote:
> > IIUC, the "dynamic" features contains CET_KERNEL, whereas xfeatures_mask_supervisor()
> > conatins PASID, CET_USER, and CET_KERNEL. PASID isn't virtualized by KVM, but
> > doesn't that mean CET_USER will get dropped/lost if userspace requests AMX/XTILE
> > enabling?
>
> Yes, __state_size is correct for guest enabled xfeatures, including CET_USER,
> and it gets removed from __state_perm.
>
> IIUC, from current qemu/kernel interaction for guest permission settings,
> __xstate_request_perm() is called only _ONCE_ to set AMX/XTILE for every vCPU
> thread, so the removal of guest supervisor xfeatures won't hurt guest! ;-/

Huh? I don't follow. What does calling __xstate_request_perm() only once have
to do with anything?

/me stares more

OMG, hell no. First off, this code is a nightmare to follow. The existing comment
is useless. No shit the code is adding in supervisor states for the host. What's
not AT ALL clear is *why*.

The commit says it's necessary because the "permission bitmap is only relevant
for user states":

commit 781c64bfcb735960717d1cb45428047ff6a5030c
Author: Thomas Gleixner <[email protected]>
Date: Thu Mar 24 14:47:14 2022 +0100

x86/fpu/xstate: Handle supervisor states in XSTATE permissions

The size calculation in __xstate_request_perm() fails to take supervisor
states into account because the permission bitmap is only relevant for user
states.

But @permitted comes from:

permitted = xstate_get_group_perm(guest);

which is either fpu->guest_perm.__state_perm or fpu->perm.__state_perm. And
__state_perm is initialized to:

fpu->perm.__state_perm = fpu_kernel_cfg.default_features;

where fpu_kernel_cfg.default_features contains everything except the dynamic
xfeatures, i.e. everything except XFEATURE_MASK_XTILE_DATA:

fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;

So why on earth does this code to force back xfeatures_mask_supervisor()? Because
the code just below drops the supervisor bits to compute the user xstate size and
then clobbers __state_perm.

/* Calculate the resulting user state size */
mask &= XFEATURE_MASK_USER_SUPPORTED;
usize = xstate_calculate_size(mask, false);

...

WRITE_ONCE(perm->__state_perm, mask);

That is beyond asinine. IIUC, the intent is to apply the permission bitmap only
for user states, because the only dynamic states are user states. Bbut the above
creates an inconsistent mess. If userspace doesn't request XTILE_DATA,
__state_perm will contain supervisor states, but once userspace does request
XTILE_DATA, __state_perm will be lost.

And because that's not confusing enough, clobbering __state_perm would also drop
FPU_GUEST_PERM_LOCKED, except that __xstate_request_perm() can' be reached with
said LOCKED flag set.

fpu_xstate_prctl() already strips out supervisor features:

case ARCH_GET_XCOMP_PERM:
/*
* Lockless snapshot as it can also change right after the
* dropping the lock.
*/
permitted = xstate_get_host_group_perm();
permitted &= XFEATURE_MASK_USER_SUPPORTED;
return put_user(permitted, uptr);

case ARCH_GET_XCOMP_GUEST_PERM:
permitted = xstate_get_guest_group_perm();
permitted &= XFEATURE_MASK_USER_SUPPORTED;
return put_user(permitted, uptr);

and while KVM doesn't apply the __state_perm to supervisor states, if it did
there would be zero harm in doing so.

case 0xd: {
u64 permitted_xcr0 = kvm_get_filtered_xcr0();
u64 permitted_xss = kvm_caps.supported_xss;

Second, the relying on QEMU to only trigger __xstate_request_perm() is not acceptable.
It "works" for the current code, but only because there's only a single dynamic
feature, i.e. this will short circuit and prevent computing a bad ksize.

/* Check whether fully enabled */
if ((permitted & requested) == requested)
return 0;

I don't know how I can possibly make it any clearer: KVM absolutely must not assume
userspace behavior.

So rather than continue with the current madness, which will break if/when the
next dynamic feature comes along, just preserve non-user xfeatures/flags in
__guest_perm.

If there are no objections, I'll test the below and write a proper changelog.

--
From: Sean Christopherson <[email protected]>
Date: Thu, 26 Oct 2023 10:17:33 -0700
Subject: [PATCH] x86/fpu/xstate: Always preserve non-user xfeatures/flags in
__state_perm

Fixes: 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE permissions")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kernel/fpu/xstate.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index ef6906107c54..73f6bc00d178 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1601,16 +1601,20 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
if ((permitted & requested) == requested)
return 0;

- /* Calculate the resulting kernel state size */
+ /*
+ * Calculate the resulting kernel state size. Note, @permitted also
+ * contains supervisor xfeatures even though supervisor are always
+ * permitted for kernel and guest FPUs, and never permitted for user
+ * FPUs.
+ */
mask = permitted | requested;
- /* Take supervisor states into account on the host */
- if (!guest)
- mask |= xfeatures_mask_supervisor();
ksize = xstate_calculate_size(mask, compacted);

- /* Calculate the resulting user state size */
- mask &= XFEATURE_MASK_USER_SUPPORTED;
- usize = xstate_calculate_size(mask, false);
+ /*
+ * Calculate the resulting user state size. Take care not to clobber
+ * the supervisor xfeatures in the new mask!
+ */
+ usize = xstate_calculate_size(mask & XFEATURE_MASK_USER_SUPPORTED, false);

if (!guest) {
ret = validate_sigaltstack(usize);

base-commit: c076acf10c78c0d7e1aa50670e9cc4c91e8d59b4
--

2023-10-26 22:07:11

by Edgecombe, Rick P

[permalink] [raw]

Subject: Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size

On Thu, 2023-10-26 at 10:24 -0700, Sean Christopherson wrote:
> +       /*
> +        * Calculate the resulting kernel state size. Note,
> @permitted also
> +        * contains supervisor xfeatures even though supervisor are
> always
> +        * permitted for kernel and guest FPUs, and never permitted
> for user
> +        * FPUs.

What is a user FPU vs kernel FPU in this context? By user FPU do you
mean, like user FPU state in a sigframe or something? Or a kernel
task's FPU? If the former I think this comment could be made more
clear. Maybe just drop the bit about user FPUs. At least the comment
makes more sense to me without it.

2023-10-27 15:46:46

by Yang Weijiang

[permalink] [raw]

Subject: Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size

On 10/27/2023 1:24 AM, Sean Christopherson wrote:
> On Wed, Oct 25, 2023, Weijiang Yang wrote:
>> On 10/25/2023 1:07 AM, Sean Christopherson wrote:
>>> On Fri, Sep 15, 2023, Weijiang Yang wrote:
>>> IIUC, the "dynamic" features contains CET_KERNEL, whereas xfeatures_mask_supervisor()
>>> conatins PASID, CET_USER, and CET_KERNEL. PASID isn't virtualized by KVM, but
>>> doesn't that mean CET_USER will get dropped/lost if userspace requests AMX/XTILE
>>> enabling?
>> Yes, __state_size is correct for guest enabled xfeatures, including CET_USER,
>> and it gets removed from __state_perm.
>>
>> IIUC, from current qemu/kernel interaction for guest permission settings,
>> __xstate_request_perm() is called only _ONCE_ to set AMX/XTILE for every vCPU
>> thread, so the removal of guest supervisor xfeatures won't hurt guest! ;-/
> Huh? I don't follow. What does calling __xstate_request_perm() only once have
> to do with anything?
>
> /me stares more
>
> OMG, hell no. First off, this code is a nightmare to follow. The existing comment
> is useless. No shit the code is adding in supervisor states for the host. What's
> not AT ALL clear is *why*.
>
> The commit says it's necessary because the "permission bitmap is only relevant
> for user states":
>
> commit 781c64bfcb735960717d1cb45428047ff6a5030c
> Author: Thomas Gleixner <[email protected]>
> Date: Thu Mar 24 14:47:14 2022 +0100
>
> x86/fpu/xstate: Handle supervisor states in XSTATE permissions
>
> The size calculation in __xstate_request_perm() fails to take supervisor
> states into account because the permission bitmap is only relevant for user
> states.
>
> But @permitted comes from:
>
> permitted = xstate_get_group_perm(guest);
>
> which is either fpu->guest_perm.__state_perm or fpu->perm.__state_perm. And
> __state_perm is initialized to:
>
> fpu->perm.__state_perm = fpu_kernel_cfg.default_features;
>
> where fpu_kernel_cfg.default_features contains everything except the dynamic
> xfeatures, i.e. everything except XFEATURE_MASK_XTILE_DATA:
>
> fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
> fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
>
> So why on earth does this code to force back xfeatures_mask_supervisor()? Because
> the code just below drops the supervisor bits to compute the user xstate size and
> then clobbers __state_perm.
>
> /* Calculate the resulting user state size */
> mask &= XFEATURE_MASK_USER_SUPPORTED;
> usize = xstate_calculate_size(mask, false);
>
> ...
>
> WRITE_ONCE(perm->__state_perm, mask);
>
> That is beyond asinine. IIUC, the intent is to apply the permission bitmap only
> for user states, because the only dynamic states are user states. Bbut the above
> creates an inconsistent mess. If userspace doesn't request XTILE_DATA,
> __state_perm will contain supervisor states, but once userspace does request
> XTILE_DATA, __state_perm will be lost.

Exactly, thanks for calling it out!

> And because that's not confusing enough, clobbering __state_perm would also drop
> FPU_GUEST_PERM_LOCKED, except that __xstate_request_perm() can' be reached with
> said LOCKED flag set.
>
> fpu_xstate_prctl() already strips out supervisor features:
>
> case ARCH_GET_XCOMP_PERM:
> /*
> * Lockless snapshot as it can also change right after the
> * dropping the lock.
> */
> permitted = xstate_get_host_group_perm();
> permitted &= XFEATURE_MASK_USER_SUPPORTED;
> return put_user(permitted, uptr);
>
> case ARCH_GET_XCOMP_GUEST_PERM:
> permitted = xstate_get_guest_group_perm();
> permitted &= XFEATURE_MASK_USER_SUPPORTED;
> return put_user(permitted, uptr);
>
> and while KVM doesn't apply the __state_perm to supervisor states, if it did
> there would be zero harm in doing so.
>
> case 0xd: {
> u64 permitted_xcr0 = kvm_get_filtered_xcr0();
> u64 permitted_xss = kvm_caps.supported_xss;
>
> Second, the relying on QEMU to only trigger __xstate_request_perm() is not acceptable.
> It "works" for the current code, but only because there's only a single dynamic
> feature, i.e. this will short circuit and prevent computing a bad ksize.
>
> /* Check whether fully enabled */
> if ((permitted & requested) == requested)
> return 0;
>
> I don't know how I can possibly make it any clearer: KVM absolutely must not assume
> userspace behavior.
>
> So rather than continue with the current madness, which will break if/when the
> next dynamic feature comes along, just preserve non-user xfeatures/flags in
> __guest_perm.

Yes, it's time to rectify the confusion and make permission based settings clearer.
Below patch looks good to me, thanks!

> If there are no objections, I'll test the below and write a proper changelog.
>
> --
> From: Sean Christopherson <[email protected]>
> Date: Thu, 26 Oct 2023 10:17:33 -0700
> Subject: [PATCH] x86/fpu/xstate: Always preserve non-user xfeatures/flags in
> __state_perm
>
> Fixes: 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE permissions")
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kernel/fpu/xstate.c | 18 +++++++++++-------
> 1 file changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index ef6906107c54..73f6bc00d178 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -1601,16 +1601,20 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
> if ((permitted & requested) == requested)
> return 0;
>
> - /* Calculate the resulting kernel state size */
> + /*
> + * Calculate the resulting kernel state size. Note, @permitted also
> + * contains supervisor xfeatures even though supervisor are always
> + * permitted for kernel and guest FPUs, and never permitted for user
> + * FPUs.
> + */
> mask = permitted | requested;
> - /* Take supervisor states into account on the host */
> - if (!guest)
> - mask |= xfeatures_mask_supervisor();
> ksize = xstate_calculate_size(mask, compacted);
>
> - /* Calculate the resulting user state size */
> - mask &= XFEATURE_MASK_USER_SUPPORTED;
> - usize = xstate_calculate_size(mask, false);
> + /*
> + * Calculate the resulting user state size. Take care not to clobber
> + * the supervisor xfeatures in the new mask!
> + */
> + usize = xstate_calculate_size(mask & XFEATURE_MASK_USER_SUPPORTED, false);
>
> if (!guest) {
> ret = validate_sigaltstack(usize);
>
> base-commit: c076acf10c78c0d7e1aa50670e9cc4c91e8d59b4

2023-10-31 17:46:27

by Maxim Levitsky

[permalink] [raw]

Subject: Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size

On Thu, 2023-10-26 at 10:24 -0700, Sean Christopherson wrote:
> On Wed, Oct 25, 2023, Weijiang Yang wrote:
> > On 10/25/2023 1:07 AM, Sean Christopherson wrote:
> > > On Fri, Sep 15, 2023, Weijiang Yang wrote:
> > > IIUC, the "dynamic" features contains CET_KERNEL, whereas xfeatures_mask_supervisor()
> > > conatins PASID, CET_USER, and CET_KERNEL. PASID isn't virtualized by KVM, but
> > > doesn't that mean CET_USER will get dropped/lost if userspace requests AMX/XTILE
> > > enabling?
> >
> > Yes, __state_size is correct for guest enabled xfeatures, including CET_USER,
> > and it gets removed from __state_perm.
> >
> > IIUC, from current qemu/kernel interaction for guest permission settings,
> > __xstate_request_perm() is called only _ONCE_ to set AMX/XTILE for every vCPU
> > thread, so the removal of guest supervisor xfeatures won't hurt guest! ;-/
>
> Huh? I don't follow. What does calling __xstate_request_perm() only once have
> to do with anything?
>
> /me stares more
>
> OMG, hell no. First off, this code is a nightmare to follow. The existing comment
> is useless. No shit the code is adding in supervisor states for the host. What's
> not AT ALL clear is *why*.
>
> The commit says it's necessary because the "permission bitmap is only relevant
> for user states":
>
> commit 781c64bfcb735960717d1cb45428047ff6a5030c
> Author: Thomas Gleixner <[email protected]>
> Date: Thu Mar 24 14:47:14 2022 +0100
>
> x86/fpu/xstate: Handle supervisor states in XSTATE permissions
>
> The size calculation in __xstate_request_perm() fails to take supervisor
> states into account because the permission bitmap is only relevant for user
> states.
>
> But @permitted comes from:
>
> permitted = xstate_get_group_perm(guest);
>
> which is either fpu->guest_perm.__state_perm or fpu->perm.__state_perm. And
> __state_perm is initialized to:
>
> fpu->perm.__state_perm = fpu_kernel_cfg.default_features;

Not anymore after patch 5, and patch 5 does make sense in the regard to the fact
that we might not want to save/restore kernel CET state for nothing for regular kernel threads.

>
> where fpu_kernel_cfg.default_features contains everything except the dynamic
> xfeatures, i.e. everything except XFEATURE_MASK_XTILE_DATA:
>
> fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
> fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
>
> So why on earth does this code to force back xfeatures_mask_supervisor()? Because
> the code just below drops the supervisor bits to compute the user xstate size and
> then clobbers __state_perm.
>
> /* Calculate the resulting user state size */
> mask &= XFEATURE_MASK_USER_SUPPORTED;
> usize = xstate_calculate_size(mask, false);
>
> ...
>
> WRITE_ONCE(perm->__state_perm, mask);
>
> That is beyond asinine. IIUC, the intent is to apply the permission bitmap only
> for user states, because the only dynamic states are user states. Bbut the above
> creates an inconsistent mess. If userspace doesn't request XTILE_DATA,
> __state_perm will contain supervisor states, but once userspace does request
> XTILE_DATA, __state_perm will be lost.
>
> And because that's not confusing enough, clobbering __state_perm would also drop
> FPU_GUEST_PERM_LOCKED, except that __xstate_request_perm() can' be reached with
> said LOCKED flag set.
>
> fpu_xstate_prctl() already strips out supervisor features:
>
> case ARCH_GET_XCOMP_PERM:
> /*
> * Lockless snapshot as it can also change right after the
> * dropping the lock.
> */
> permitted = xstate_get_host_group_perm();
> permitted &= XFEATURE_MASK_USER_SUPPORTED;
> return put_user(permitted, uptr);
>
> case ARCH_GET_XCOMP_GUEST_PERM:
> permitted = xstate_get_guest_group_perm();
> permitted &= XFEATURE_MASK_USER_SUPPORTED;
> return put_user(permitted, uptr);
>
> and while KVM doesn't apply the __state_perm to supervisor states, if it did
> there would be zero harm in doing so.
>
> case 0xd: {
> u64 permitted_xcr0 = kvm_get_filtered_xcr0();
> u64 permitted_xss = kvm_caps.supported_xss;
>
> Second, the relying on QEMU to only trigger __xstate_request_perm() is not acceptable.
> It "works" for the current code, but only because there's only a single dynamic
> feature, i.e. this will short circuit and prevent computing a bad ksize.
>
> /* Check whether fully enabled */
> if ((permitted & requested) == requested)
> return 0;
>
> I don't know how I can possibly make it any clearer: KVM absolutely must not assume
> userspace behavior.
>
> So rather than continue with the current madness, which will break if/when the
> next dynamic feature comes along, just preserve non-user xfeatures/flags in
> __guest_perm.

I more or less agree with you, however I would like to discuss the FPU permissions
in more depth:

First of all we have two things at play here:

1. On demand resize of the thread's FPU state buffer to avoid penalty of context switching the AMX state.

2. The fact that allowing this on demand resize of this state buffer breaks the x86_64 ABI,
because FPU state has to be saved on the signal stack and ABI allows the stack size to be smaller than what is
needed to save the FPU state with AMX features enabled.

Thus a two tiered approach was done: first application asks for a permission to use the dynamic features,
and then when it actually uses it, the FPU state buffer is resized.

Otherwise if an AMX instruction is used by the app but the permission was not asked by the app for its xstate component,
the application is terminated.

(I might not 100% understand this correctly, please correct me if I am wrong).

However IMHO the 'fpu permission' name is a bit misleading,
This feature is not really about security/permissions but more like opt-in to use newer ABI,
for example KVM capabilities API, and the kernel will never refuse the permission request
(except if the
signal stack size is too small but the userspace can adjust it before asking for the permission)

On top of that I think that applying the same permission approach to guest's FPU state is not a good fit,
because of two reasons:

1. The guest FPU state will never be pushed on the signal stack - KVM swaps back the host FPU state
before it returns from the KVM_RUN ioctl.

Also I think (not sure) that ptrace can only access (FPU) state of a stopped process, and a stopped vCPU process
will also first return to userspace. Again I might be mistaken here, I never researched this in depth.

Assuming that I am correct on these assumptions, the guest FPU state can only be accessed via
KVM_GET_XSAVE/KVM_SET_XSAVE/KVM_GET_XSAVE2 ioctls,
which also returns the userspace portion of the state including optionally the AMX state,
but this ioctl doesn't really need FPU permission framework, because it is a KVM ABI, and in
fact KVM_GET_XSAVE2 was added exactly because of that: to make sure that userspace
is aware that larger than 4K buffer can be returned.

2. Guest FPU state is not even on demand resized (but I can imagine that in the future we will do this).

And of course, adding permissions for kernel features, that is even worse idea, which we really
shouldn't do.

>
> If there are no objections, I'll test the below and write a proper changelog.
>
> --
> From: Sean Christopherson <[email protected]>
> Date: Thu, 26 Oct 2023 10:17:33 -0700
> Subject: [PATCH] x86/fpu/xstate: Always preserve non-user xfeatures/flags in
> __state_perm
>
> Fixes: 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE permissions")
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kernel/fpu/xstate.c | 18 +++++++++++-------
> 1 file changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index ef6906107c54..73f6bc00d178 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -1601,16 +1601,20 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
> if ((permitted & requested) == requested)
> return 0;
>
> - /* Calculate the resulting kernel state size */
> + /*
> + * Calculate the resulting kernel state size. Note, @permitted also
> + * contains supervisor xfeatures even though supervisor are always
> + * permitted for kernel and guest FPUs, and never permitted for user
> + * FPUs.
> + */
> mask = permitted | requested;
> - /* Take supervisor states into account on the host */
> - if (!guest)
> - mask |= xfeatures_mask_supervisor();
> ksize = xstate_calculate_size(mask, compacted);

This might not work with kernel dynamic features, because xfeatures_mask_supervisor() will
return all supported supervisor features.

Therefore at least until we have an actual kernel dynamic feature
(a feature used by the host kernel and not KVM, and which has to be dynamic like AMX),
I suggest that KVM stops using the permission
API completely for the guest FPU state,
and just gives all the features it wants to enable right to __fpu_alloc_init_guest_fpstate()
(Guest FPU permission API IMHO should be deprecated and ignored)

>
> - /* Calculate the resulting user state size */
> - mask &= XFEATURE_MASK_USER_SUPPORTED;
> - usize = xstate_calculate_size(mask, false);
> + /*
> + * Calculate the resulting user state size. Take care not to clobber
> + * the supervisor xfeatures in the new mask!
> + */
> + usize = xstate_calculate_size(mask & XFEATURE_MASK_USER_SUPPORTED, false);
>
> if (!guest) {
> ret = validate_sigaltstack(usize);

Best regards,
Maxim Levitsky

>
> base-commit: c076acf10c78c0d7e1aa50670e9cc4c91e8d59b4

2023-11-01 14:16:22

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size

On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> On Thu, 2023-10-26 at 10:24 -0700, Sean Christopherson wrote:
> > On Wed, Oct 25, 2023, Weijiang Yang wrote:
> On top of that I think that applying the same permission approach to guest's
> FPU state is not a good fit, because of two reasons:
>
> 1. The guest FPU state will never be pushed on the signal stack - KVM swaps
> back the host FPU state before it returns from the KVM_RUN ioctl.
>
> Also I think (not sure) that ptrace can only access (FPU) state of a
> stopped process, and a stopped vCPU process will also first return to
> userspace. Again I might be mistaken here, I never researched this in
> depth.
>
> Assuming that I am correct on these assumptions, the guest FPU state can
> only be accessed via KVM_GET_XSAVE/KVM_SET_XSAVE/KVM_GET_XSAVE2 ioctls,
> which also returns the userspace portion of the state including optionally
> the AMX state, but this ioctl doesn't really need FPU permission
> framework, because it is a KVM ABI, and in fact KVM_GET_XSAVE2 was added
> exactly because of that: to make sure that userspace is aware that larger
> than 4K buffer can be returned.
>
> 2. Guest FPU state is not even on demand resized (but I can imagine that in
> the future we will do this).

Just because guest FPU state isn't resized doesn't mean there's no value in
requiring userspace to opt-in to allocating 8KiB of data per-vCPU.

> And of course, adding permissions for kernel features, that is even worse
> idea, which we really shouldn't do.
>
> >
> > If there are no objections, I'll test the below and write a proper changelog.
> >
> > --
> > From: Sean Christopherson <[email protected]>
> > Date: Thu, 26 Oct 2023 10:17:33 -0700
> > Subject: [PATCH] x86/fpu/xstate: Always preserve non-user xfeatures/flags in
> > __state_perm
> >
> > Fixes: 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE permissions")
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kernel/fpu/xstate.c | 18 +++++++++++-------
> > 1 file changed, 11 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> > index ef6906107c54..73f6bc00d178 100644
> > --- a/arch/x86/kernel/fpu/xstate.c
> > +++ b/arch/x86/kernel/fpu/xstate.c
> > @@ -1601,16 +1601,20 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
> > if ((permitted & requested) == requested)
> > return 0;
> >
> > - /* Calculate the resulting kernel state size */
> > + /*
> > + * Calculate the resulting kernel state size. Note, @permitted also
> > + * contains supervisor xfeatures even though supervisor are always
> > + * permitted for kernel and guest FPUs, and never permitted for user
> > + * FPUs.
> > + */
> > mask = permitted | requested;
> > - /* Take supervisor states into account on the host */
> > - if (!guest)
> > - mask |= xfeatures_mask_supervisor();
> > ksize = xstate_calculate_size(mask, compacted);
>
> This might not work with kernel dynamic features, because
> xfeatures_mask_supervisor() will return all supported supervisor features.

I don't understand what you mean by "This".

Somewhat of a side topic, I feel very strongly that we should use "guest only"
terminology instead of "dynamic". There is nothing dynamic about whether or not
XFEATURE_CET_KERNEL is allowed; there's not even a real "decision" beyond checking
wheter or not CET is supported.

> Therefore at least until we have an actual kernel dynamic feature (a feature
> used by the host kernel and not KVM, and which has to be dynamic like AMX),
> I suggest that KVM stops using the permission API completely for the guest
> FPU state, and just gives all the features it wants to enable right to

By "it", I assume you mean userspace?

> __fpu_alloc_init_guest_fpstate() (Guest FPU permission API IMHO should be
> deprecated and ignored)

KVM allocates guest FPU state during KVM_CREATE_VCPU, so not using prctl() would
either require KVM to defer allocating guest FPU state until KVM_SET_CPUID{,2},
or would require a VM-scoped KVM ioctl() to let userspace opt-in to

Allocating guest FPU state during KVM_SET_CPUID{,2} would get messy, as KVM allows
multiple calls to KVM_SET_CPUID{,2} so long as the vCPU hasn't done KVM_RUN. E.g.
KVM would need to support actually resizing guest FPU state, which would be extra
complexity without any meaningful benefit.

The only benefit I can think of for a VM-scoped ioctl() is that it would allow a
single process to host multiple VMs with different dynamic xfeature requirements.
But such a setup is mostly theoretical. Maybe it'll affect the SEV migration
helper at some point? But even that isn't guaranteed.

So while I agree that ARCH_GET_XCOMP_GUEST_PERM isn't ideal, practically speaking
it's sufficient for all current use cases. Unless a concrete use case comes along,
deprecating ARCH_GET_XCOMP_GUEST_PERM in favor of a KVM ioctl() would be churn for
both the kernel and userspace without any meaningful benefit, or really even any
true change in behavior.

2023-11-02 18:21:29

by Maxim Levitsky

[permalink] [raw]

Subject: Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size

On Wed, 2023-11-01 at 07:16 -0700, Sean Christopherson wrote:
> On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > On Thu, 2023-10-26 at 10:24 -0700, Sean Christopherson wrote:
> > > On Wed, Oct 25, 2023, Weijiang Yang wrote:
> > On top of that I think that applying the same permission approach to guest's
> > FPU state is not a good fit, because of two reasons:
> >
> > 1. The guest FPU state will never be pushed on the signal stack - KVM swaps
> > back the host FPU state before it returns from the KVM_RUN ioctl.
> >
> > Also I think (not sure) that ptrace can only access (FPU) state of a
> > stopped process, and a stopped vCPU process will also first return to
> > userspace. Again I might be mistaken here, I never researched this in
> > depth.
> >
> > Assuming that I am correct on these assumptions, the guest FPU state can
> > only be accessed via KVM_GET_XSAVE/KVM_SET_XSAVE/KVM_GET_XSAVE2 ioctls,
> > which also returns the userspace portion of the state including optionally
> > the AMX state, but this ioctl doesn't really need FPU permission
> > framework, because it is a KVM ABI, and in fact KVM_GET_XSAVE2 was added
> > exactly because of that: to make sure that userspace is aware that larger
> > than 4K buffer can be returned.
> >
> > 2. Guest FPU state is not even on demand resized (but I can imagine that in
> > the future we will do this).
>
> Just because guest FPU state isn't resized doesn't mean there's no value in
> requiring userspace to opt-in to allocating 8KiB of data per-vCPU.
See my response below:
>
> > And of course, adding permissions for kernel features, that is even worse
> > idea, which we really shouldn't do.
> >
> > >
> > > If there are no objections, I'll test the below and write a proper changelog.
> > >
> > > --
> > > From: Sean Christopherson <[email protected]>
> > > Date: Thu, 26 Oct 2023 10:17:33 -0700
> > > Subject: [PATCH] x86/fpu/xstate: Always preserve non-user xfeatures/flags in
> > > __state_perm
> > >
> > > Fixes: 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE permissions")
> > > Signed-off-by: Sean Christopherson <[email protected]>
> > > ---
> > > arch/x86/kernel/fpu/xstate.c | 18 +++++++++++-------
> > > 1 file changed, 11 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> > > index ef6906107c54..73f6bc00d178 100644
> > > --- a/arch/x86/kernel/fpu/xstate.c
> > > +++ b/arch/x86/kernel/fpu/xstate.c
> > > @@ -1601,16 +1601,20 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
> > > if ((permitted & requested) == requested)
> > > return 0;
> > >
> > > - /* Calculate the resulting kernel state size */
> > > + /*
> > > + * Calculate the resulting kernel state size. Note, @permitted also
> > > + * contains supervisor xfeatures even though supervisor are always
> > > + * permitted for kernel and guest FPUs, and never permitted for user
> > > + * FPUs.
> > > + */
> > > mask = permitted | requested;
> > > - /* Take supervisor states into account on the host */
> > > - if (!guest)
> > > - mask |= xfeatures_mask_supervisor();
> > > ksize = xstate_calculate_size(mask, compacted);
> >
> > This might not work with kernel dynamic features, because
> > xfeatures_mask_supervisor() will return all supported supervisor features.
>
> I don't understand what you mean by "This".

>
> Somewhat of a side topic, I feel very strongly that we should use "guest only"
> terminology instead of "dynamic". There is nothing dynamic about whether or not
> XFEATURE_CET_KERNEL is allowed; there's not even a real "decision" beyond checking
> wheter or not CET is supported.

> > Therefore at least until we have an actual kernel dynamic feature (a feature
> > used by the host kernel and not KVM, and which has to be dynamic like AMX),
> > I suggest that KVM stops using the permission API completely for the guest
> > FPU state, and just gives all the features it wants to enable right to
>
> By "it", I assume you mean userspace?
>
> > __fpu_alloc_init_guest_fpstate() (Guest FPU permission API IMHO should be
> > deprecated and ignored)
>
> KVM allocates guest FPU state during KVM_CREATE_VCPU, so not using prctl() would
> either require KVM to defer allocating guest FPU state until KVM_SET_CPUID{,2},
> or would require a VM-scoped KVM ioctl() to let userspace opt-in to
>
> Allocating guest FPU state during KVM_SET_CPUID{,2} would get messy,

> as KVM allows
> multiple calls to KVM_SET_CPUID{,2} so long as the vCPU hasn't done KVM_RUN. E.g.
> KVM would need to support actually resizing guest FPU state, which would be extra
> complexity without any meaningful benefit.

OK, I understand you now. What you claim is that it is legal to do this:

- KVM_SET_XSAVE
- KVM_SET_CPUID (with AMX enabled)

KVM_SET_CPUID will have to resize the xstate which is already valid.

Your patch to fix the __xstate_request_perm() does seem to be correct in a sense that it will
preserve the kernel fpu components in the fpu permissions.

However note that kernel fpu permissions come from 'fpu_kernel_cfg.default_features'
which don't include the dynamic kernel xfeatures (added a few patches before this one).

Therefore an attempt to resize the xstate to include a kernel dynamic feature by
__xfd_enable_feature will fail.

If kvm on the other hand includes all the kernel dynamic features in the initial
allocation of FPU state (not optimal but possible), then later call to __xstate_request_perm
for a userspace dynamic feature (which can still happen) will mess the the xstate,
because again the permission code assumes that only default kernel features were granted the permissions.

This has to be solved this way or another.

>
> The only benefit I can think of for a VM-scoped ioctl() is that it would allow a
> single process to host multiple VMs with different dynamic xfeature requirements.
> But such a setup is mostly theoretical. Maybe it'll affect the SEV migration
> helper at some point? But even that isn't guaranteed.
>
> So while I agree that ARCH_GET_XCOMP_GUEST_PERM isn't ideal, practically speaking
> it's sufficient for all current use cases. Unless a concrete use case comes along,
> deprecating ARCH_GET_XCOMP_GUEST_PERM in favor of a KVM ioctl() would be churn for
> both the kernel and userspace without any meaningful benefit, or really even any
> true change in behavior.

ARCH_GET_XCOMP_GUEST_PERM/ARCH_SET_XCOMP_GUEST_PERM is not a good API from usability POV, because it is redundant.

KVM already has API called KVM_SET_CPUID2, by which the qemu/userspace instructs the KVM, how much space to allocate,
to support a VM with *this* CPUID.

For example if qemu asks for nested SVM/VMX, then kvm will allocate on demand state for it (also at least 8K/vCPU btw).
The same should apply for AMX - Qemu sets AMX xsave bit in CPUID - that permits KVM to allocate the extra state when needed.

I don't see why we need an extra and non KVM API for that.

Best regards,
Maxim Levitsky

>

2023-11-03 14:33:28

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size

On Thu, Nov 02, 2023, Maxim Levitsky wrote:
> On Wed, 2023-11-01 at 07:16 -0700, Sean Christopherson wrote:
> > On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > > On Thu, 2023-10-26 at 10:24 -0700, Sean Christopherson wrote:
> > > > --
> > > > From: Sean Christopherson <[email protected]>
> > > > Date: Thu, 26 Oct 2023 10:17:33 -0700
> > > > Subject: [PATCH] x86/fpu/xstate: Always preserve non-user xfeatures/flags in
> > > > __state_perm
> > > >
> > > > Fixes: 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE permissions")
> > > > Signed-off-by: Sean Christopherson <[email protected]>
> > > > ---
> > > > arch/x86/kernel/fpu/xstate.c | 18 +++++++++++-------
> > > > 1 file changed, 11 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> > > > index ef6906107c54..73f6bc00d178 100644
> > > > --- a/arch/x86/kernel/fpu/xstate.c
> > > > +++ b/arch/x86/kernel/fpu/xstate.c
> > > > @@ -1601,16 +1601,20 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
> > > > if ((permitted & requested) == requested)
> > > > return 0;
> > > >
> > > > - /* Calculate the resulting kernel state size */
> > > > + /*
> > > > + * Calculate the resulting kernel state size. Note, @permitted also
> > > > + * contains supervisor xfeatures even though supervisor are always
> > > > + * permitted for kernel and guest FPUs, and never permitted for user
> > > > + * FPUs.
> > > > + */
> > > > mask = permitted | requested;
> > > > - /* Take supervisor states into account on the host */
> > > > - if (!guest)
> > > > - mask |= xfeatures_mask_supervisor();
> > > > ksize = xstate_calculate_size(mask, compacted);
> > >
> > > This might not work with kernel dynamic features, because
> > > xfeatures_mask_supervisor() will return all supported supervisor features.
> >
> > I don't understand what you mean by "This".
>
> >
> > Somewhat of a side topic, I feel very strongly that we should use "guest only"
> > terminology instead of "dynamic". There is nothing dynamic about whether or not
> > XFEATURE_CET_KERNEL is allowed; there's not even a real "decision" beyond checking
> > wheter or not CET is supported.
>
> > > Therefore at least until we have an actual kernel dynamic feature (a feature
> > > used by the host kernel and not KVM, and which has to be dynamic like AMX),
> > > I suggest that KVM stops using the permission API completely for the guest
> > > FPU state, and just gives all the features it wants to enable right to
> >
> > By "it", I assume you mean userspace?
> >
> > > __fpu_alloc_init_guest_fpstate() (Guest FPU permission API IMHO should be
> > > deprecated and ignored)
> >
> > KVM allocates guest FPU state during KVM_CREATE_VCPU, so not using prctl() would
> > either require KVM to defer allocating guest FPU state until KVM_SET_CPUID{,2},
> > or would require a VM-scoped KVM ioctl() to let userspace opt-in to
> >
> > Allocating guest FPU state during KVM_SET_CPUID{,2} would get messy,
>
> > as KVM allows
> > multiple calls to KVM_SET_CPUID{,2} so long as the vCPU hasn't done KVM_RUN. E.g.
> > KVM would need to support actually resizing guest FPU state, which would be extra
> > complexity without any meaningful benefit.
>
>
> OK, I understand you now. What you claim is that it is legal to do this:
>
> - KVM_SET_XSAVE
> - KVM_SET_CPUID (with AMX enabled)
>
> KVM_SET_CPUID will have to resize the xstate which is already valid.

I was actually talking about

KVM_SET_CPUID2 (with dynamic user feature #1)
KVM_SET_CPUID2 (with dynamic user feature #2)

The second call through __xstate_request_perm() will be done with only user
xfeatures in @permitted and so the kernel will compute the wrong ksize.

> Your patch to fix the __xstate_request_perm() does seem to be correct in a
> sense that it will preserve the kernel fpu components in the fpu permissions.
>
> However note that kernel fpu permissions come from
> 'fpu_kernel_cfg.default_features' which don't include the dynamic kernel
> xfeatures (added a few patches before this one).

CET_KERNEL isn't dynamic! It's guest-only. There are no runtime decisions as to
whether or not CET_KERNEL is allowed. All guest FPU get CET_KERNEL, no kernel FPUs
get CET_KERNEL.

That matters because I am also proposing that we add a dedicated, defined-at-boot
fpu_guest_cfg instead of bolting on a "dynamic", which is what I meant by this:

: Or even better if it doesn't cause weirdness elsewhere, a dedicated
: fpu_guest_cfg. For me at least, a fpu_guest_cfg would make it easier to
: understand what all is going on.

That way, initialization of permissions is simply

fpu->guest_perm = fpu_guest_cfg.default_features;

and there's no need to differentiate between guest and kernel FPUs when reallocating
for dynamic user xfeatures because guest_perm.__state_perm already holds the correct
data.

> Therefore an attempt to resize the xstate to include a kernel dynamic feature by
> __xfd_enable_feature will fail.
>
> If kvm on the other hand includes all the kernel dynamic features in the
> initial allocation of FPU state (not optimal but possible),

This is what I am suggesting.

: There are definitely scenarios where CET will not be exposed to KVM guests, but
: I don't see any reason to make the guest FPU space dynamically sized for CET.
: It's what, 40 bytes?

> then later call to __xstate_request_perm for a userspace dynamic feature
> (which can still happen) will mess the the xstate, because again the
> permission code assumes that only default kernel features were granted the
> permissions.
>
>
> This has to be solved this way or another.
>
> >
> > The only benefit I can think of for a VM-scoped ioctl() is that it would allow a
> > single process to host multiple VMs with different dynamic xfeature requirements.
> > But such a setup is mostly theoretical. Maybe it'll affect the SEV migration
> > helper at some point? But even that isn't guaranteed.
> >
> > So while I agree that ARCH_GET_XCOMP_GUEST_PERM isn't ideal, practically speaking
> > it's sufficient for all current use cases. Unless a concrete use case comes along,
> > deprecating ARCH_GET_XCOMP_GUEST_PERM in favor of a KVM ioctl() would be churn for
> > both the kernel and userspace without any meaningful benefit, or really even any
> > true change in behavior.
>
>
> ARCH_GET_XCOMP_GUEST_PERM/ARCH_SET_XCOMP_GUEST_PERM is not a good API from
> usability POV, because it is redundant.
>
> KVM already has API called KVM_SET_CPUID2, by which the qemu/userspace
> instructs the KVM, how much space to allocate, to support a VM with *this*
> CPUID.
>
> For example if qemu asks for nested SVM/VMX, then kvm will allocate on demand
> state for it (also at least 8K/vCPU btw). The same should apply for AMX -
> Qemu sets AMX xsave bit in CPUID - that permits KVM to allocate the extra
> state when needed.
>
> I don't see why we need an extra and non KVM API for that.

I don't necessarily disagree, but what's done is done. We missed our chance to
propose a different mechanism, and at this point undoing all of that without good
cause is unlikely to benefit anyone. If a use comes along that needs something
"better" than the prctl() API, then I agree it'd be worth revisiting.