2022-10-11 22:46:06

by Dave Hansen

[permalink] [raw]
Subject: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate

From: Yuan Yao <[email protected]>

This was found a couple of months ago in a big old AMX
backport. But, it looks to be a problem in mainline too.
Please let me know if this looks OK. I'd also especially
appreciate some testing from folks that have AMX hardware
handy.

Builds and survives a quick boot test on non-AMX hardware.

--

== Background ==

'init_fpstate' is a sort of template for all of the fpstates
that come after it. It is copied when new processes are
execve()'d or XRSTOR'd to get fpregs into a known state.

That means that it represents the *starting* state for a
process's fpstate which includes only the 'default' features.
Dynamic features can, of course, be acquired later, but
processes start with only default_features.

During boot the kernel decides whether all fpstates will be
kept in the compacted or uncompacted format. This choice is
communicated to the hardware via the XCOMP_BV field in all
XSAVE buffers, including 'init_fpstate'.

== Problem ==

But, the existing XCOMP_BV calculation is incorrect. It uses
the set of 'max_features', not the default features.

As a result, when XRSTOR operates on buffers derived from
'init_fpstate', it may attempt to "tickle" dynamic features which
are at offsets for which there is no space allocated in the
fpstate.

== Scope ==

This normally results in a relatively harmless out-of-bounds
memory read. It's harmless because it never gets consumed. But,
if the fpstate is next to some unmapped memory, this "tickle" can
cause a page fault and an oops.

This only causes issues on systems when dynamic features are
available and when an XSAVE buffer is next to uninitialized
memory. In other words, it only affects recent Intel server
CPUs, and in relatively few memory locations.

Those two things are why it took relatively long to catch this.

== Solution ==

Use 'default_features' to establish the init_fpstate
xcomp_bv value. Reset individual fpstate xcomp_bv values
when the rest of the fpstate is reset.

[ dhansen: add reset code from tglx, rewrites
commit message and comments ]

Fixes: 1c253ff2287f ("x86/fpu: Move xstate feature masks to fpu_*_cfg")
Suggested-by: Dave Hansen <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Yuan Yao <[email protected]>
Cc: [email protected]
---
arch/x86/kernel/fpu/core.c | 3 +++
arch/x86/kernel/fpu/xstate.c | 7 ++++++-
2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 3b28c5b25e12..4d64de34da12 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -526,6 +526,9 @@ static void __fpstate_reset(struct fpstate *fpstate, u64 xfd)
fpstate->xfeatures = fpu_kernel_cfg.default_features;
fpstate->user_xfeatures = fpu_user_cfg.default_features;
fpstate->xfd = xfd;
+
+ /* Ensure that xcomp_bv matches ->xfeatures */
+ xstate_init_xcomp_bv(&fpstate->regs.xsave, fpstate->xfeatures);
}

void fpstate_reset(struct fpu *fpu)
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c8340156bfd2..f9f45610c72f 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -360,7 +360,12 @@ static void __init setup_init_fpu_buf(void)

print_xstate_features();

- xstate_init_xcomp_bv(&init_fpstate.regs.xsave, fpu_kernel_cfg.max_features);
+ /*
+ * 'init_fpstate' is sized for the default feature
+ * set without any of the dynamic features.
+ */
+ xstate_init_xcomp_bv(&init_fpstate.regs.xsave,
+ fpu_kernel_cfg.default_features);

/*
* Init all the features state with header.xfeatures being 0x0
--
2.34.1


2022-10-11 22:51:21

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate

On 10/11/2022 3:24 PM, Dave Hansen wrote:
> From: Yuan Yao <[email protected]>
>
> This was found a couple of months ago in a big old AMX
> backport. But, it looks to be a problem in mainline too.
> Please let me know if this looks OK. I'd also especially
> appreciate some testing from folks that have AMX hardware
> handy.
>
> Builds and survives a quick boot test on non-AMX hardware.
>
> --
>
> == Background ==
>
> 'init_fpstate' is a sort of template for all of the fpstates
> that come after it. It is copied when new processes are
> execve()'d or XRSTOR'd to get fpregs into a known state.
>
> That means that it represents the *starting* state for a
> process's fpstate which includes only the 'default' features.
> Dynamic features can, of course, be acquired later, but
> processes start with only default_features.
>
> During boot the kernel decides whether all fpstates will be
> kept in the compacted or uncompacted format. This choice is
> communicated to the hardware via the XCOMP_BV field in all
> XSAVE buffers, including 'init_fpstate'.
>
> == Problem ==
>
> But, the existing XCOMP_BV calculation is incorrect. It uses
> the set of 'max_features', not the default features.
>
> As a result, when XRSTOR operates on buffers derived from
> 'init_fpstate', it may attempt to "tickle" dynamic features which
> are at offsets for which there is no space allocated in the
> fpstate.
>
> == Scope ==
>
> This normally results in a relatively harmless out-of-bounds
> memory read. It's harmless because it never gets consumed. But,
> if the fpstate is next to some unmapped memory, this "tickle" can
> cause a page fault and an oops.
>
> This only causes issues on systems when dynamic features are
> available and when an XSAVE buffer is next to uninitialized
> memory. In other words, it only affects recent Intel server
> CPUs, and in relatively few memory locations.
>
> Those two things are why it took relatively long to catch this.
>
> == Solution ==
>
> Use 'default_features' to establish the init_fpstate
> xcomp_bv value. Reset individual fpstate xcomp_bv values
> when the rest of the fpstate is reset.
>
> [ dhansen: add reset code from tglx, rewrites
> commit message and comments ]
>
> Fixes: 1c253ff2287f ("x86/fpu: Move xstate feature masks to fpu_*_cfg")
> Suggested-by: Dave Hansen <[email protected]>
> Suggested-by: Thomas Gleixner <[email protected]>
> Signed-off-by: Yuan Yao <[email protected]>
> Cc: [email protected]
> ---
> arch/x86/kernel/fpu/core.c | 3 +++
> arch/x86/kernel/fpu/xstate.c | 7 ++++++-
> 2 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index 3b28c5b25e12..4d64de34da12 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -526,6 +526,9 @@ static void __fpstate_reset(struct fpstate *fpstate, u64 xfd)
> fpstate->xfeatures = fpu_kernel_cfg.default_features;
> fpstate->user_xfeatures = fpu_user_cfg.default_features;
> fpstate->xfd = xfd;
> +
> + /* Ensure that xcomp_bv matches ->xfeatures */
> + xstate_init_xcomp_bv(&fpstate->regs.xsave, fpstate->xfeatures);
> }
>
> void fpstate_reset(struct fpu *fpu)
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index c8340156bfd2..f9f45610c72f 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -360,7 +360,12 @@ static void __init setup_init_fpu_buf(void)
>
> print_xstate_features();
>
> - xstate_init_xcomp_bv(&init_fpstate.regs.xsave, fpu_kernel_cfg.max_features);
> + /*
> + * 'init_fpstate' is sized for the default feature
> + * set without any of the dynamic features.
> + */
> + xstate_init_xcomp_bv(&init_fpstate.regs.xsave,
> + fpu_kernel_cfg.default_features);

Along with this, perhaps this hunk needs to be updated as well [2]:

In arch/x86/kernel/fpu/init.c,

static void __init fpu__init_init_fpstate(void)
{
/* Bring init_fpstate size and features up to date */
init_fpstate.size = fpu_kernel_cfg.max_size;
init_fpstate.xfeatures = fpu_kernel_cfg.max_features;
}

I attempted to fix the issue in the earlier posting [1]. But, mine looks
to be clearly missing that __fpstate_reset() part.

Thanks,
Chang

[1]:
https://lore.kernel.org/lkml/[email protected]
[2]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/init.c#n213

2022-10-13 01:44:24

by Yao, Yuan

[permalink] [raw]
Subject: RE: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate

>-----Original Message-----
>From: Dave Hansen <[email protected]>
>Sent: Wednesday, October 12, 2022 06:24
>To: [email protected]
>Cc: Bae, Chang Seok <[email protected]>; [email protected]; Yao, Yuan <[email protected]>; Hansen, Dave
><[email protected]>; Thomas Gleixner <[email protected]>
>Subject: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate
>
>From: Yuan Yao <[email protected]>
>
>This was found a couple of months ago in a big old AMX
>backport. But, it looks to be a problem in mainline too.
>Please let me know if this looks OK. I'd also especially
>appreciate some testing from folks that have AMX hardware
>handy.
>
>Builds and survives a quick boot test on non-AMX hardware.
>
>--
>
>== Background ==
>
>'init_fpstate' is a sort of template for all of the fpstates
>that come after it. It is copied when new processes are
>execve()'d or XRSTOR'd to get fpregs into a known state.
>
>That means that it represents the *starting* state for a
>process's fpstate which includes only the 'default' features.
>Dynamic features can, of course, be acquired later, but
>processes start with only default_features.
>
>During boot the kernel decides whether all fpstates will be
>kept in the compacted or uncompacted format. This choice is
>communicated to the hardware via the XCOMP_BV field in all
>XSAVE buffers, including 'init_fpstate'.
>
>== Problem ==
>
>But, the existing XCOMP_BV calculation is incorrect. It uses
>the set of 'max_features', not the default features.
>
>As a result, when XRSTOR operates on buffers derived from
>'init_fpstate', it may attempt to "tickle" dynamic features which
>are at offsets for which there is no space allocated in the
>fpstate.
>
>== Scope ==
>
>This normally results in a relatively harmless out-of-bounds
>memory read. It's harmless because it never gets consumed. But,
>if the fpstate is next to some unmapped memory, this "tickle" can
>cause a page fault and an oops.
>
>This only causes issues on systems when dynamic features are
>available and when an XSAVE buffer is next to uninitialized
>memory. In other words, it only affects recent Intel server
>CPUs, and in relatively few memory locations.
>
>Those two things are why it took relatively long to catch this.
>
>== Solution ==
>
>Use 'default_features' to establish the init_fpstate
>xcomp_bv value. Reset individual fpstate xcomp_bv values
>when the rest of the fpstate is reset.
>
>[ dhansen: add reset code from tglx, rewrites
> commit message and comments ]

Thanks to Dave's help on this! The new comment message is more
clear and easier to understand then before.

>
>Fixes: 1c253ff2287f ("x86/fpu: Move xstate feature masks to fpu_*_cfg")
>Suggested-by: Dave Hansen <[email protected]>
>Suggested-by: Thomas Gleixner <[email protected]>
>Signed-off-by: Yuan Yao <[email protected]>
>Cc: [email protected]
>---
> arch/x86/kernel/fpu/core.c | 3 +++
> arch/x86/kernel/fpu/xstate.c | 7 ++++++-
> 2 files changed, 9 insertions(+), 1 deletion(-)
>
>diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
>index 3b28c5b25e12..4d64de34da12 100644
>--- a/arch/x86/kernel/fpu/core.c
>+++ b/arch/x86/kernel/fpu/core.c
>@@ -526,6 +526,9 @@ static void __fpstate_reset(struct fpstate *fpstate, u64 xfd)
> fpstate->xfeatures = fpu_kernel_cfg.default_features;
> fpstate->user_xfeatures = fpu_user_cfg.default_features;
> fpstate->xfd = xfd;
>+
>+ /* Ensure that xcomp_bv matches ->xfeatures */
>+ xstate_init_xcomp_bv(&fpstate->regs.xsave, fpstate->xfeatures);
> }
>
> void fpstate_reset(struct fpu *fpu)
>diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>index c8340156bfd2..f9f45610c72f 100644
>--- a/arch/x86/kernel/fpu/xstate.c
>+++ b/arch/x86/kernel/fpu/xstate.c
>@@ -360,7 +360,12 @@ static void __init setup_init_fpu_buf(void)
>
> print_xstate_features();
>
>- xstate_init_xcomp_bv(&init_fpstate.regs.xsave, fpu_kernel_cfg.max_features);
>+ /*
>+ * 'init_fpstate' is sized for the default feature
>+ * set without any of the dynamic features.
>+ */
>+ xstate_init_xcomp_bv(&init_fpstate.regs.xsave,
>+ fpu_kernel_cfg.default_features);
>
> /*
> * Init all the features state with header.xfeatures being 0x0
>--
>2.34.1

2022-10-13 01:50:24

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate

On 10/11/2022 3:24 PM, Dave Hansen wrote:
>
> Please let me know if this looks OK. I'd also especially
> appreciate some testing from folks that have AMX hardware
> handy.

The issue looks to be reproducible at boot time with
CONFIG_DEBUG_PAGEALLOC=y and CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y.
Folks found this.

With the patch, I don't see the splat anymore.

Tested-by Chang S. Bae <[email protected]>

Thanks,
Chang

2022-10-13 03:52:45

by Yao, Yuan

[permalink] [raw]
Subject: RE: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate

>-----Original Message-----
>From: Dave Hansen <[email protected]>
>Sent: Wednesday, October 12, 2022 06:24
>To: [email protected]
>Cc: Bae, Chang Seok <[email protected]>; [email protected]; Yao, Yuan <[email protected]>; Hansen, Dave
><[email protected]>; Thomas Gleixner <[email protected]>
>Subject: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate
>
>From: Yuan Yao <[email protected]>
>
>This was found a couple of months ago in a big old AMX
>backport. But, it looks to be a problem in mainline too.
>Please let me know if this looks OK. I'd also especially
>appreciate some testing from folks that have AMX hardware
>handy.
>
>Builds and survives a quick boot test on non-AMX hardware.
>
>--
>
>== Background ==
>
>'init_fpstate' is a sort of template for all of the fpstates
>that come after it. It is copied when new processes are
>execve()'d or XRSTOR'd to get fpregs into a known state.
>
>That means that it represents the *starting* state for a
>process's fpstate which includes only the 'default' features.
>Dynamic features can, of course, be acquired later, but
>processes start with only default_features.
>
>During boot the kernel decides whether all fpstates will be
>kept in the compacted or uncompacted format. This choice is
>communicated to the hardware via the XCOMP_BV field in all
>XSAVE buffers, including 'init_fpstate'.
>
>== Problem ==
>
>But, the existing XCOMP_BV calculation is incorrect. It uses
>the set of 'max_features', not the default features.
>
>As a result, when XRSTOR operates on buffers derived from
>'init_fpstate', it may attempt to "tickle" dynamic features which
>are at offsets for which there is no space allocated in the
>fpstate.
>
>== Scope ==
>
>This normally results in a relatively harmless out-of-bounds
>memory read. It's harmless because it never gets consumed. But,
>if the fpstate is next to some unmapped memory, this "tickle" can
>cause a page fault and an oops.
>
>This only causes issues on systems when dynamic features are
>available and when an XSAVE buffer is next to uninitialized
>memory. In other words, it only affects recent Intel server
>CPUs, and in relatively few memory locations.
>
>Those two things are why it took relatively long to catch this.
>
>== Solution ==
>
>Use 'default_features' to establish the init_fpstate
>xcomp_bv value. Reset individual fpstate xcomp_bv values
>when the rest of the fpstate is reset.
>
>[ dhansen: add reset code from tglx, rewrites
> commit message and comments ]
>
>Fixes: 1c253ff2287f ("x86/fpu: Move xstate feature masks to fpu_*_cfg")
>Suggested-by: Dave Hansen <[email protected]>
>Suggested-by: Thomas Gleixner <[email protected]>
>Signed-off-by: Yuan Yao <[email protected]>
>Cc: [email protected]
>---
> arch/x86/kernel/fpu/core.c | 3 +++
> arch/x86/kernel/fpu/xstate.c | 7 ++++++-
> 2 files changed, 9 insertions(+), 1 deletion(-)
>
>diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
>index 3b28c5b25e12..4d64de34da12 100644
>--- a/arch/x86/kernel/fpu/core.c
>+++ b/arch/x86/kernel/fpu/core.c
>@@ -526,6 +526,9 @@ static void __fpstate_reset(struct fpstate *fpstate, u64 xfd)
> fpstate->xfeatures = fpu_kernel_cfg.default_features;
> fpstate->user_xfeatures = fpu_user_cfg.default_features;
> fpstate->xfd = xfd;
>+
>+ /* Ensure that xcomp_bv matches ->xfeatures */
>+ xstate_init_xcomp_bv(&fpstate->regs.xsave, fpstate->xfeatures);
> }
>
> void fpstate_reset(struct fpu *fpu)
>diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>index c8340156bfd2..f9f45610c72f 100644
>--- a/arch/x86/kernel/fpu/xstate.c
>+++ b/arch/x86/kernel/fpu/xstate.c
>@@ -360,7 +360,12 @@ static void __init setup_init_fpu_buf(void)
>
> print_xstate_features();
>
>- xstate_init_xcomp_bv(&init_fpstate.regs.xsave, fpu_kernel_cfg.max_features);
>+ /*
>+ * 'init_fpstate' is sized for the default feature
>+ * set without any of the dynamic features.
>+ */
>+ xstate_init_xcomp_bv(&init_fpstate.regs.xsave,
>+ fpu_kernel_cfg.default_features);

Below trace is observed on host kernel with this patch applied when create VM.

The reason is __copy_xstate_to_uabi_buf() copies data from &init_fpstate when the component
is not existed in the source kernel fpstate (here is the AMX tile component), but the
AMX TILE bit is removed from init_fpstate due to this patch, so the WARN is triggered and return
NULL which causes kernel NULL pointer dereference later.

I considered 2 possible ways to fix this without big change:
1. For such non-exist component, we can fill the buffer in target fpstate to 0 and don't WARN if the bit is not set
in init_fpstate.
2. Enlarge the init_fpstate to content the dynamic components and still keel fpu_kernel_cfg.max_features
in init_fpstate (but still remove the dynamic bit from new cloned thread), so we can use it safely in above case,
but near 2 pages (8K) is wasted.

[ 100.989998] ------------[ cut here ]------------
[ 100.995332] WARNING: CPU: 109 PID: 2910 at arch/x86/kernel/fpu/xstate.c:944 __raw_xsave_addr+0x49/0x50
[ 101.005948] Modules linked in: kvm_intel kvm x86_pkg_temp_thermal snd_pcm input_leds snd_timer led_class joydev mac_hid irqbypass efi_pstore snd soundcore button sch_fq_codel ip_tables x_tables ixgbe mdio mdio_devres libphy igc [last unloaded: kvm]
[ 101.030924] CPU: 109 PID: 2910 Comm: CPU 0/KVM Tainted: G I 6.0.0-rc6-kmv-upstream-queue+ #23
[ 101.054199] RIP: 0010:__raw_xsave_addr+0x49/0x50
[ 101.059511] Code: 85 15 b3 44 39 01 74 1c 0f 1f 44 00 00 48 0f a3 f7 73 17 e8 d9 fe ff ff 89 c0 48 01 d8 5b 5d c3 cc cc cc cc 0f 0b 31 c0 eb f3 <0f> 0b 31 c0 eb ed 90 0f 1f 44 00 00 55 49 89 d0 48 89 f1 48 89 e5
[ 101.080865] RSP: 0018:ffa000000b8cbc30 EFLAGS: 00010206
[ 101.086870] RAX: 0000000000002000 RBX: ffffffff823be3c0 RCX: 0000000000000012
[ 101.095038] RDX: 0000000000040000 RSI: 0000000000000012 RDI: 80000000000206e7
[ 101.103177] RBP: ffa000000b8cbc38 R08: ffffffff823bf400 R09: 0000000000000001
[ 101.111350] R10: ffa000000b8cbc70 R11: ff110008ef3f4008 R12: ff110008ef3f4b00
[ 101.119500] R13: 0000000000002000 R14: 0000000000000012 R15: 0000000000000012
[ 101.127678] FS: 00007fca15639700(0000) GS:ff1100104fd40000(0000) knlGS:0000000000000000
[ 101.136922] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 101.143512] CR2: 00007fca202a9001 CR3: 00000008f49a8005 CR4: 0000000000773ee0
[ 101.151658] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 101.159817] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 101.167961] PKRU: 55555554
[ 101.171110] Call Trace:
[ 101.173972] <TASK>
[ 101.176435] __copy_xstate_to_uabi_buf+0x33b/0x870
[ 101.181936] fpu_copy_guest_fpstate_to_uabi+0x28/0x80
[ 101.187720] kvm_arch_vcpu_ioctl+0x14c/0x1460 [kvm]
[ 101.196409] ? __this_cpu_preempt_check+0x13/0x20
[ 101.204924] ? vmx_vcpu_put+0x2e/0x260 [kvm_intel]
[ 101.213531] kvm_vcpu_ioctl+0xea/0x6b0 [kvm]
[ 101.221507] ? kvm_vcpu_ioctl+0xea/0x6b0 [kvm]
[ 101.229661] ? __fget_light+0xd4/0x130
[ 101.237011] __x64_sys_ioctl+0xe3/0x910
[ 101.244395] ? debug_smp_processor_id+0x17/0x20
[ 101.252463] ? fpregs_assert_state_consistent+0x27/0x50
[ 101.261228] do_syscall_64+0x3f/0x90
[ 101.268090] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 101.276574] RIP: 0033:0x7fca1b8f362b
[ 101.283312] Code: 0f 1e fa 48 8b 05 5d b8 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2d b8 2c 00 f7 d8 64 89 01 48
[ 101.312528] RSP: 002b:00007fca15638488 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 101.323793] RAX: ffffffffffffffda RBX: 000055b3c88ea9c0 RCX: 00007fca1b8f362b
[ 101.334583] RDX: 00007fca0c001000 RSI: ffffffff9000aecf RDI: 000000000000000e
[ 101.345320] RBP: 00007fca15638580 R08: 000055b3c80d95f0 R09: 000055b3c8820320
[ 101.355994] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffeed19156e
[ 101.366611] R13: 00007ffeed19156f R14: 00007ffeed191630 R15: 00007fca15638880
[ 101.377221] </TASK>
[ 101.382225] ---[ end trace 0000000000000000 ]---
[ 101.390004] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 101.400468] #PF: supervisor read access in kernel mode
[ 101.408908] #PF: error_code(0x0000) - not-present page
[ 101.417319] PGD 88bd3e067 P4D 0
[ 101.423541] Oops: 0000 [#1] PREEMPT SMP
[ 101.430410] CPU: 109 PID: 2910 Comm: CPU 0/KVM Tainted: G W I 6.0.0-rc6-kmv-upstream-queue+ #23
[ 101.463968] RIP: 0010:memcpy_erms+0x6/0x10
[ 101.471389] Code: cc cc cc cc eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 cc cc cc cc 66 90 48 89 f8 48 89 d1 <f3> a4 c3 cc cc cc cc 0f 1f 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe
[ 101.501238] RSP: 0018:ffa000000b8cbc40 EFLAGS: 00010246
[ 101.510134] RAX: ff110008ef3f4b00 RBX: 0000000000002000 RCX: 0000000000002000
[ 101.521233] RDX: 0000000000002000 RSI: 0000000000000000 RDI: ff110008ef3f4b00
[ 101.532327] RBP: ffa000000b8cbce0 R08: ffffffff823bf400 R09: 0000000000000001
[ 101.543425] R10: ffa000000b8cbc70 R11: ff110008ef3f4008 R12: ff110008ef3f6b00
[ 101.554530] R13: 0000000000000000 R14: 0000000000000012 R15: 0000000000000012
[ 101.565652] FS: 00007fca15639700(0000) GS:ff1100104fd40000(0000) knlGS:0000000000000000
[ 101.577915] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 101.587555] CR2: 0000000000000000 CR3: 00000008f49a8005 CR4: 0000000000773ee0
[ 101.598786] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 101.609993] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 101.621163] PKRU: 55555554
[ 101.627358] Call Trace:
[ 101.633163] <TASK>
[ 101.638474] ? __copy_xstate_to_uabi_buf+0x381/0x870
[ 101.646977] fpu_copy_guest_fpstate_to_uabi+0x28/0x80
[ 101.655526] kvm_arch_vcpu_ioctl+0x14c/0x1460 [kvm]
[ 101.663832] ? __this_cpu_preempt_check+0x13/0x20
[ 101.671938] ? vmx_vcpu_put+0x2e/0x260 [kvm_intel]
[ 101.680146] kvm_vcpu_ioctl+0xea/0x6b0 [kvm]
[ 101.687746] ? kvm_vcpu_ioctl+0xea/0x6b0 [kvm]
[ 101.695513] ? __fget_light+0xd4/0x130
[ 101.702466] __x64_sys_ioctl+0xe3/0x910
[ 101.709507] ? debug_smp_processor_id+0x17/0x20
[ 101.717289] ? fpregs_assert_state_consistent+0x27/0x50
[ 101.725823] do_syscall_64+0x3f/0x90
[ 101.732497] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 101.740842] RIP: 0033:0x7fca1b8f362b
[ 101.747555] Code: 0f 1e fa 48 8b 05 5d b8 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2d b8 2c 00 f7 d8 64 89 01 48
[ 101.776677] RSP: 002b:00007fca15638488 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 101.787889] RAX: ffffffffffffffda RBX: 000055b3c88ea9c0 RCX: 00007fca1b8f362b
[ 101.798637] RDX: 00007fca0c001000 RSI: ffffffff9000aecf RDI: 000000000000000e
[ 101.809379] RBP: 00007fca15638580 R08: 000055b3c80d95f0 R09: 000055b3c8820320
[ 101.820038] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffeed19156e
[ 101.830642] R13: 00007ffeed19156f R14: 00007ffeed191630 R15: 00007fca15638880
[ 101.841228] </TASK>
[ 101.846230] Modules linked in: kvm_intel kvm x86_pkg_temp_thermal snd_pcm input_leds snd_timer led_class joydev mac_hid irqbypass efi_pstore snd soundcore button sch_fq_codel ip_tables x_tables ixgbe mdio mdio_devres libphy igc [last unloaded: kvm]
[ 101.878964] CR2: 0000000000000000
[ 101.885472] ---[ end trace 0000000000000000 ]---

>
> /*
> * Init all the features state with header.xfeatures being 0x0
>--
>2.34.1

2022-10-13 17:16:19

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate

On 10/12/2022 8:35 PM, Yao, Yuan wrote:
>
> The reason is __copy_xstate_to_uabi_buf() copies data from &init_fpstate when the component
> is not existed in the source kernel fpstate (here is the AMX tile component), but the
> AMX TILE bit is removed from init_fpstate due to this patch, so the WARN is triggered and return
> NULL which causes kernel NULL pointer dereference later.

We have this in __copy_xstate_to_uabi_buf() [1]:

mask = fpstate->user_xfeatures;

for_each_extended_xfeature(i, mask) {
...
}

And the KVM code seems to set dynamic features regardless of the buffer
reallocation [2]:

vcpu->arch.guest_fpu.fpstate->user_xfeatures =
vcpu->arch.guest_supported_xcr0 | XFEATURE_MASK_FPSSE;

The kernel code seems to be aware of this as fpstate_realloc() does [3]:

if (!guest_fpu)
newfps->user_xfeatures = curfps->user_xfeatures | xfeatures;

But it updates the 'xfeature' bitmask for all:

newfps->xfeatures = curfps->xfeatures | xfeatures;

So, I think we can do something like this here:

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c8340156bfd2..8ea7d0e95f1a 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1127,8 +1127,12 @@ void __copy_xstate_to_uabi_buf(struct membuf to,
struct fpstate *fpstate,
* non-compacted format disabled features still occupy state space,
* but there is no state to copy from in the compacted
* init_fpstate. The gap tracking will zero these states.
+ *
+ * In the case of guest fpstate, this user_xfeatures does not
+ * dynamically reflect the capacity of the XSAVE buffer but
+ * xfeatures does. So AND them together.
*/
- mask = fpstate->user_xfeatures;
+ mask = fpstate->user_xfeatures & fpstate->xfeatures;

Let me also test this by running KVM.

Thanks,
Chang

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n1131
[2]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/cpuid.c#n346
[3]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n1448

2022-10-13 17:34:09

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate

On 10/13/22 09:23, Chang S. Bae wrote:
>
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -1127,8 +1127,12 @@ void __copy_xstate_to_uabi_buf(struct membuf to,
> struct fpstate *fpstate,
>          * non-compacted format disabled features still occupy state space,
>          * but there is no state to copy from in the compacted
>          * init_fpstate. The gap tracking will zero these states.
> +        *
> +        * In the case of guest fpstate, this user_xfeatures does not
> +        * dynamically reflect the capacity of the XSAVE buffer but
> +        * xfeatures does. So AND them together.
>          */
> -       mask = fpstate->user_xfeatures;
> +       mask = fpstate->user_xfeatures & fpstate->xfeatures;

I'm not sure this is quite right either.

Doesn't kvm expect that all of the ->user_xfeatures will end up being
copied out? We surely can't copy them from 'fpstate' if the feature
isn't there, but we can't skip them entirely, can we?

2022-10-13 17:53:29

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate

On 10/13/22 10:33, Chang S. Bae wrote:
>>>
>>> -       mask = fpstate->user_xfeatures;
>>> +       mask = fpstate->user_xfeatures & fpstate->xfeatures;
>>
>> I'm not sure this is quite right either.
>>
>> Doesn't kvm expect that all of the ->user_xfeatures will end up being
>> copied out?  We surely can't copy them from 'fpstate' if the feature
>> isn't there, but we can't skip them entirely, can we?
>
> No, we can't skip them. IIUC, the code will zero out:
>
>     /*
>      * ... The gap tracking will zero these states.
>      */
>     mask = fpstate->user_xfeatures;
>
>     for_each_extended_xfeature(i, mask) {
>         /*
>          * If there was a feature or alignment gap, zero the space
>          * in the destination buffer.
>          */
>         if (zerofrom < xstate_offsets[i])
>             membuf_zero(&to, xstate_offsets[i] - zerofrom);
>
>         <snip>

Ahh, good point. A better comment for that would be:

* Some user_xfeatures may not be present in the fpstate.
* Remove those from 'mask' to zero those features in the
* user buffer instead of retrieving them from fpstate.

2022-10-13 18:43:18

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate

On 10/13/2022 10:21 AM, Dave Hansen wrote:
> On 10/13/22 09:23, Chang S. Bae wrote:
>>
>> --- a/arch/x86/kernel/fpu/xstate.c
>> +++ b/arch/x86/kernel/fpu/xstate.c
>> @@ -1127,8 +1127,12 @@ void __copy_xstate_to_uabi_buf(struct membuf to,
>> struct fpstate *fpstate,
>>          * non-compacted format disabled features still occupy state space,
>>          * but there is no state to copy from in the compacted
>>          * init_fpstate. The gap tracking will zero these states.
>> +        *
>> +        * In the case of guest fpstate, this user_xfeatures does not
>> +        * dynamically reflect the capacity of the XSAVE buffer but
>> +        * xfeatures does. So AND them together.
>>          */
>> -       mask = fpstate->user_xfeatures;
>> +       mask = fpstate->user_xfeatures & fpstate->xfeatures;
>
> I'm not sure this is quite right either.
>
> Doesn't kvm expect that all of the ->user_xfeatures will end up being
> copied out? We surely can't copy them from 'fpstate' if the feature
> isn't there, but we can't skip them entirely, can we?

No, we can't skip them. IIUC, the code will zero out:

/*
* ... The gap tracking will zero these states.
*/
mask = fpstate->user_xfeatures;

for_each_extended_xfeature(i, mask) {
/*
* If there was a feature or alignment gap, zero the space
* in the destination buffer.
*/
if (zerofrom < xstate_offsets[i])
membuf_zero(&to, xstate_offsets[i] - zerofrom);

<snip>

/*
* Keep track of the last copied state in the non-compacted
* target buffer for gap zeroing.
*/
zerofrom = xstate_offsets[i] + xstate_sizes[i];
}

out:
if (to.left)
membuf_zero(&to, to.left);

Thanks,
Chang

2022-10-13 20:47:56

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate

On 10/12/22 20:35, Yao, Yuan wrote:
> Below trace is observed on host kernel with this patch applied when create VM.
>
> The reason is __copy_xstate_to_uabi_buf() copies data from &init_fpstate when the component
> is not existed in the source kernel fpstate (here is the AMX tile component), but the
> AMX TILE bit is removed from init_fpstate due to this patch, so the WARN is triggered and return
> NULL which causes kernel NULL pointer dereference later.
>
> I considered 2 possible ways to fix this without big change:
> 1. For such non-exist component, we can fill the buffer in target fpstate to 0 and don't WARN if the bit is not set
> in init_fpstate.
> 2. Enlarge the init_fpstate to content the dynamic components and still keel fpu_kernel_cfg.max_features
> in init_fpstate (but still remove the dynamic bit from new cloned thread), so we can use it safely in above case,
> but near 2 pages (8K) is wasted.

There's a bigger problem here.

Removing the AMX bit from xstate_bv just *TELLS* the __raw_xsave_addr()
about the buffer being too small. Before this patch, __raw_xsave_addr()
was probably reading crap out of whatever is past 'init_fpstate' and
copying it out to userspace.

This mess makes me want a guard page after 'init_fpstate'.

Anyway, this isn't really a separate bug. It's just more nasty fallout
from xcomp_bv being wrong in the first place.

2022-10-14 04:18:39

by Yao, Yuan

[permalink] [raw]
Subject: RE: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate

>-----Original Message-----
>From: Bae, Chang Seok <[email protected]>
>Sent: Friday, October 14, 2022 00:23
>To: Yao, Yuan <[email protected]>; Dave Hansen <[email protected]>; [email protected]
>Cc: [email protected]; Hansen, Dave <[email protected]>; Thomas Gleixner <[email protected]>
>Subject: Re: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate
>
>On 10/12/2022 8:35 PM, Yao, Yuan wrote:
>>
>> The reason is __copy_xstate_to_uabi_buf() copies data from &init_fpstate when the component
>> is not existed in the source kernel fpstate (here is the AMX tile component), but the
>> AMX TILE bit is removed from init_fpstate due to this patch, so the WARN is triggered and return
>> NULL which causes kernel NULL pointer dereference later.
>
>We have this in __copy_xstate_to_uabi_buf() [1]:
>
> mask = fpstate->user_xfeatures;
>
> for_each_extended_xfeature(i, mask) {
> ...
> }
>
>And the KVM code seems to set dynamic features regardless of the buffer
>reallocation [2]:
>
> vcpu->arch.guest_fpu.fpstate->user_xfeatures =
> vcpu->arch.guest_supported_xcr0 | XFEATURE_MASK_FPSSE;
>
>The kernel code seems to be aware of this as fpstate_realloc() does [3]:
>
> if (!guest_fpu)
> newfps->user_xfeatures = curfps->user_xfeatures | xfeatures;
>
>But it updates the 'xfeature' bitmask for all:
>
> newfps->xfeatures = curfps->xfeatures | xfeatures;
>
>So, I think we can do something like this here:
>
>diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>index c8340156bfd2..8ea7d0e95f1a 100644
>--- a/arch/x86/kernel/fpu/xstate.c
>+++ b/arch/x86/kernel/fpu/xstate.c
>@@ -1127,8 +1127,12 @@ void __copy_xstate_to_uabi_buf(struct membuf to,
>struct fpstate *fpstate,
> * non-compacted format disabled features still occupy state space,
> * but there is no state to copy from in the compacted
> * init_fpstate. The gap tracking will zero these states.
>+ *
>+ * In the case of guest fpstate, this user_xfeatures does not
>+ * dynamically reflect the capacity of the XSAVE buffer but
>+ * xfeatures does. So AND them together.
> */
>- mask = fpstate->user_xfeatures;
>+ mask = fpstate->user_xfeatures & fpstate->xfeatures;

This doesn’t work. At this point KVM already called fpstate_realloc() for guest
fpstate so the dynamic bits already set for the fpstate->xfeature: fpstate->xfeatures is 0x606e7 here.

Also the guest fpstate's xstate_bv (header.xfeature here) is 0 here, so all data will be read from
init_fpstate instead of guest fpstate, which triggered this for reading AMX TILE component.

To keep using init_fpstate as "fallback" for reading component data in above case, changes like
below should work, but this removes the valuable WARN_ON_ONCE from __raw_xsae_addr():

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index f9f45610c72f..1471de470b58 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -941,7 +941,7 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
return NULL;

if (cpu_feature_enabled(X86_FEATURE_XCOMPACTED)) {
- if (WARN_ON_ONCE(!(xcomp_bv & BIT_ULL(xfeature_nr))))
+ if (!(xcomp_bv & BIT_ULL(xfeature_nr)))
return NULL;
}

@@ -1049,7 +1049,10 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
static void copy_feature(bool from_xstate, struct membuf *to, void *xstate,
void *init_xstate, unsigned int size)
{
- membuf_write(to, from_xstate ? xstate : init_xstate, size);
+ if ((from_xstate && xstate) || (!from_xstate && init_xstate))
+ membuf_write(to, from_xstate ? xstate : init_xstate, size);
+ else
+ membuf_zero(to, size);
}

>
>Let me also test this by running KVM.
>
>Thanks,
>Chang
>
>[1]
>https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n1131
>[2]
>https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/cpuid.c#n346
>[3]
>https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n1448

2022-10-14 04:39:44

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate

On 10/13/2022 10:44 AM, Dave Hansen wrote:
>
> A better comment for that would be:
>
> * Some user_xfeatures may not be present in the fpstate.
> * Remove those from 'mask' to zero those features in the
> * user buffer instead of retrieving them from fpstate.

Yes, indeed!

Also this xstate copy routine looks to need some updates.

If an xfeature is present in fpstate, and in init state, the value is
retrieved from init_fpstate via copy_feature(). But, it has no space for
dynamic states. Also, for extended states, the init state is known to be
zero.

Then, perhaps, init_fpstate is better not to be accessed in the
for_each_extended_xfeature loop; instead of using copy_feature(), the
feature can be zeroed like this:

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 13b83b11b3d8..0fdfd03938b6 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1150,11 +1150,11 @@ void __copy_xstate_to_uabi_buf(struct membuf to,
struct fpstate *fpstate,
*/
pkru.pkru = pkru_val;
membuf_write(&to, &pkru, sizeof(pkru));
- } else {
- copy_feature(header.xfeatures & BIT_ULL(i), &to,
- __raw_xsave_addr(xsave, i),
- __raw_xsave_addr(xinit, i),
+ } else if (header.xfeatures & BIT_ULL(i)) {
+ membuf_write(&to, __raw_xsave_addr(xsave, i),
xstate_sizes[i]);
+ } else {
+ membuf_zero(&to, xstate_sizes[i]);
}

Thanks,
Chang

2022-10-14 04:41:54

by Yao, Yuan

[permalink] [raw]
Subject: RE: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate

>-----Original Message-----
>From: Bae, Chang Seok <[email protected]>
>Sent: Friday, October 14, 2022 11:54
>To: Hansen, Dave <[email protected]>; Yao, Yuan <[email protected]>; Dave Hansen <[email protected]>;
>[email protected]
>Cc: [email protected]; Thomas Gleixner <[email protected]>
>Subject: Re: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate
>
>On 10/13/2022 10:44 AM, Dave Hansen wrote:
>>
>> A better comment for that would be:
>>
>> * Some user_xfeatures may not be present in the fpstate.
>> * Remove those from 'mask' to zero those features in the
>> * user buffer instead of retrieving them from fpstate.
>
>Yes, indeed!
>
>Also this xstate copy routine looks to need some updates.
>
>If an xfeature is present in fpstate, and in init state, the value is
>retrieved from init_fpstate via copy_feature(). But, it has no space for
>dynamic states. Also, for extended states, the init state is known to be
>zero.
>
>Then, perhaps, init_fpstate is better not to be accessed in the
>for_each_extended_xfeature loop; instead of using copy_feature(), the
>feature can be zeroed like this:
>
>diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>index 13b83b11b3d8..0fdfd03938b6 100644
>--- a/arch/x86/kernel/fpu/xstate.c
>+++ b/arch/x86/kernel/fpu/xstate.c
>@@ -1150,11 +1150,11 @@ void __copy_xstate_to_uabi_buf(struct membuf to,
>struct fpstate *fpstate,
> */
> pkru.pkru = pkru_val;
> membuf_write(&to, &pkru, sizeof(pkru));
>- } else {
>- copy_feature(header.xfeatures & BIT_ULL(i), &to,
>- __raw_xsave_addr(xsave, i),
>- __raw_xsave_addr(xinit, i),
>+ } else if (header.xfeatures & BIT_ULL(i)) {
>+ membuf_write(&to, __raw_xsave_addr(xsave, i),
> xstate_sizes[i]);
>+ } else {
>+ membuf_zero(&to, xstate_sizes[i]);

Ah, I didn’t aware your reply before sent mine :- )

Does init_fpstate saves the "init" state for all non-dynamic components in its buffer ? if no than
this change is better, else read from init_fpstate for allocated buffer and only zero the not exist buffer is preferred to me.
The head.xfeatures is 0 before the guest fpu is used to xsaves at least once.

> }
>
>Thanks,
>Chang

2022-10-14 04:52:27

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate

On 10/13/2022 9:10 PM, Yao, Yuan wrote:
>
> Does init_fpstate saves the "init" state for all non-dynamic components in its buffer ?

For legacy states :)

* ... But doing so is a pointless exercise because most
* components have an all zeros init state except for the legacy
* ones (FP and SSE).

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n379

Thanks,
Chang

2022-10-18 00:09:37

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate

On 10/11/2022 3:24 PM, Dave Hansen wrote:
>
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index 3b28c5b25e12..4d64de34da12 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -526,6 +526,9 @@ static void __fpstate_reset(struct fpstate *fpstate, u64 xfd)
> fpstate->xfeatures = fpu_kernel_cfg.default_features;
> fpstate->user_xfeatures = fpu_user_cfg.default_features;
> fpstate->xfd = xfd;
> +
> + /* Ensure that xcomp_bv matches ->xfeatures */
> + xstate_init_xcomp_bv(&fpstate->regs.xsave, fpstate->xfeatures);
> }

I have some difficulty understanding the problem without this. Maybe I'm
missing something here:

We have two call sites for this -- (a) one for the guest fpstate
allocation [1] and (b) the other for the reset [2].

(a) The former has a call chain to init xcomp_bv:
fpu_alloc_guest_fpstate()->fpstate_init_user()->xstate_init_xcomp_bv()

(b) And the latter picks up the default area:
void fpstate_reset(struct fpu *fpu)
{
/* Set the fpstate pointer to the default fpstate */
fpu->fpstate = &fpu->__fpstate;
__fpstate_reset(fpu->fpstate, init_fpstate.xfd);
...
}
Then, xcomp_bv looks to be subsequently written by XSAVE* or by copying
from init_fpstate.

There are three distinct call sites for fpstate_reset():
* fpu_clone() [3]: the child will either copy from init_fpstate or do
XSAVE* that will update xcomp_bv according to ->xfeatures.
* fpu__init_system_xstate() [4]: When the init task switches away,
xcomp_bv will be updated by XSAVE*.
* fpu_flush_thread() [5]: xcomp_bv will be copied from init_fpstate via
fpu_reset_fpregs().

Thanks,
Chang

[1]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/core.c#n229
[2]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/core.c#n535
[3]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/core.c#n567
[4]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n869
[5]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/core.c#n744