2020-10-01 20:47:15

by Chang S. Bae

[permalink] [raw]
Subject: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

Intel's Extended Feature Disable (XFD) feature is an extension of the XSAVE
architecture. XFD allows the kernel to enable a feature state in XCR0 and
to receive a #NM trap when a task uses instructions accessing that state.
In this way, Linux can allocate the large task->fpu buffer only for tasks
that use it.

XFD introduces two MSRs: IA32_XFD to enable/disable the feature and
IA32_XFD_ERR to assist the #NM trap handler. Both use the same
state-component bitmap format, used by XCR0.

Use this hardware capability to find the right time to expand xstate area.
Introduce two sets of helper functions for that:

1. The first set is primarily for interacting with the XFD hardware
feature. Helpers for configuring disablement, e.g. in context switching,
are:
xdisable_setbits()
xdisable_getbits()
xdisable_switch()

2. The second set is for managing the first-use status and handling #NM
trap:
xfirstuse_enabled()
xfirstuse_not_detected()
xfirstuse_event_handler()

The #NM handler induces the xstate area expansion to save the first-used
states.

No functional change until the kernel enables dynamic user states and XFD.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/fpu/internal.h | 53 ++++++++++++++++++++++++++++-
arch/x86/include/asm/msr-index.h | 2 ++
arch/x86/kernel/fpu/core.c | 37 ++++++++++++++++++++
arch/x86/kernel/fpu/xstate.c | 34 ++++++++++++++++--
arch/x86/kernel/process.c | 5 +++
arch/x86/kernel/process_32.c | 2 +-
arch/x86/kernel/process_64.c | 2 +-
arch/x86/kernel/traps.c | 3 ++
9 files changed, 133 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 2901d5df4366..7d7fe1d82966 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -274,6 +274,7 @@
#define X86_FEATURE_XSAVEC (10*32+ 1) /* XSAVEC instruction */
#define X86_FEATURE_XGETBV1 (10*32+ 2) /* XGETBV with ECX = 1 instruction */
#define X86_FEATURE_XSAVES (10*32+ 3) /* XSAVES/XRSTORS instructions */
+#define X86_FEATURE_XFD (10*32+ 4) /* eXtended Feature Disabling */

/*
* Extended auxiliary flags: Linux defined - for features scattered in various
diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 3b03ead87a46..f5dbbaa060fb 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -572,11 +572,60 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
* Misc helper functions:
*/

+/* The first-use detection helpers: */
+
+static inline void xdisable_setbits(u64 value)
+{
+ wrmsrl_safe(MSR_IA32_XFD, value);
+}
+
+static inline u64 xdisable_getbits(void)
+{
+ u64 value;
+
+ rdmsrl_safe(MSR_IA32_XFD, &value);
+ return value;
+}
+
+static inline u64 xfirstuse_enabled(void)
+{
+ /* All the dynamic user components are first-use enabled. */
+ return xfeatures_mask_user_dynamic;
+}
+
+/*
+ * Convert fpu->firstuse_bv to xdisable configuration in MSR IA32_XFD.
+ * xdisable_setbits() only uses this.
+ */
+static inline u64 xfirstuse_not_detected(struct fpu *fpu)
+{
+ u64 firstuse_bv = (fpu->state_mask & xfirstuse_enabled());
+
+ /*
+ * If first-use is not detected, set the bit. If the detection is
+ * not enabled, the bit is always zero in firstuse_bv. So, make
+ * following conversion:
+ */
+ return (xfirstuse_enabled() ^ firstuse_bv);
+}
+
+/* Update MSR IA32_XFD based on fpu->firstuse_bv */
+static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
+{
+ if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
+ return;
+
+ if (unlikely(prev->state_mask != next->state_mask))
+ xdisable_setbits(xfirstuse_not_detected(next));
+}
+
+bool xfirstuse_event_handler(struct fpu *fpu);
+
/*
* Load PKRU from the FPU context if available. Delay loading of the
* complete FPU state until the return to userland.
*/
-static inline void switch_fpu_finish(struct fpu *new_fpu)
+static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu *new_fpu)
{
u32 pkru_val = init_pkru_value;
struct pkru_state *pk;
@@ -586,6 +635,8 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)

set_thread_flag(TIF_NEED_FPU_LOAD);

+ xdisable_switch(old_fpu, new_fpu);
+
if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
return;

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 2859ee4f39a8..0ccbe8cc99ad 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -610,6 +610,8 @@
#define MSR_IA32_BNDCFGS_RSVD 0x00000ffc

#define MSR_IA32_XSS 0x00000da0
+#define MSR_IA32_XFD 0x000001c4
+#define MSR_IA32_XFD_ERR 0x000001c5

#define MSR_IA32_APICBASE 0x0000001b
#define MSR_IA32_APICBASE_BSP (1<<8)
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index ece6428ba85b..2e07bfcd54b3 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -518,3 +518,40 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
*/
return 0;
}
+
+bool xfirstuse_event_handler(struct fpu *fpu)
+{
+ bool handled = false;
+ u64 event_mask;
+
+ /* Check whether the first-use detection is running. */
+ if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
+ return handled;
+
+ rdmsrl_safe(MSR_IA32_XFD_ERR, &event_mask);
+
+ /* The trap event should happen to one of first-use enabled features */
+ WARN_ON(!(event_mask & xfirstuse_enabled()));
+
+ /* If IA32_XFD_ERR is empty, the current trap has nothing to do with. */
+ if (!event_mask)
+ return handled;
+
+ /*
+ * The first-use event is presumed to be from userspace, so it should have
+ * nothing to do with interrupt context.
+ */
+ if (WARN_ON(in_interrupt()))
+ return handled;
+
+ if (alloc_xstate_area(fpu, event_mask, NULL))
+ return handled;
+
+ xdisable_setbits(xfirstuse_not_detected(fpu));
+
+ /* Clear the trap record. */
+ wrmsrl_safe(MSR_IA32_XFD_ERR, 0);
+ handled = true;
+
+ return handled;
+}
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index b9261ab4e5e2..13e8eff7a23b 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -130,6 +130,18 @@ static bool xfeature_is_supervisor(int xfeature_nr)
return ecx & 1;
}

+static bool xfeature_disable_supported(int xfeature_nr)
+{
+ u32 eax, ebx, ecx, edx;
+
+ /*
+ * If state component 'i' supports xfeature disable (or first-use
+ * detection, ECX[2] return 1; otherwise, 0.
+ */
+ cpuid_count(XSTATE_CPUID, xfeature_nr, &eax, &ebx, &ecx, &edx);
+ return ecx & 4;
+}
+
/*
* Available once those arrays for the offset, size, and alignment info are set up,
* by setup_xstate_features().
@@ -313,6 +325,9 @@ void fpu__init_cpu_xstate(void)
wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
xfeatures_mask_supervisor_dynamic());
}
+
+ if (boot_cpu_has(X86_FEATURE_XFD))
+ xdisable_setbits(xfirstuse_enabled());
}

static bool xfeature_enabled(enum xfeature xfeature)
@@ -511,8 +526,9 @@ static void __init print_xstate_offset_size(void)
for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
if (!xfeature_enabled(i))
continue;
- pr_info("x86/fpu: xstate_offset[%d]: %4d, xstate_sizes[%d]: %4d\n",
- i, xstate_comp_offsets[i], i, xstate_sizes[i]);
+ pr_info("x86/fpu: xstate_offset[%d]: %4d, xstate_sizes[%d]: %4d (%s)\n",
+ i, xstate_comp_offsets[i], i, xstate_sizes[i],
+ (xfeatures_mask_user_dynamic & BIT_ULL(i)) ? "on-demand" : "default");
}
}

@@ -942,9 +958,18 @@ void __init fpu__init_system_xstate(void)
}

xfeatures_mask_all &= fpu__get_supported_xfeatures_mask();
- /* Do not support the dynamically allocated area yet. */
xfeatures_mask_user_dynamic = 0;

+ for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
+ u64 feature_mask = BIT_ULL(i);
+
+ if (!(xfeatures_mask_user() & feature_mask))
+ continue;
+
+ if (xfeature_disable_supported(i))
+ xfeatures_mask_user_dynamic |= feature_mask;
+ }
+
/* Enable xstate instructions to be able to continue with initialization: */
fpu__init_cpu_xstate();
err = init_xstate_size();
@@ -999,6 +1024,9 @@ void fpu__resume_cpu(void)
wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
xfeatures_mask_supervisor_dynamic());
}
+
+ if (boot_cpu_has(X86_FEATURE_XFD))
+ xdisable_setbits(xfirstuse_not_detected(&current->thread.fpu));
}

/*
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 43d38bd09fb1..66850b63fe66 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -102,6 +102,11 @@ void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
*size = fpu_kernel_xstate_default_size;
}

+void arch_release_task_struct(struct task_struct *tsk)
+{
+ free_xstate_area(&tsk->thread.fpu);
+}
+
/*
* Free thread data structures etc..
*/
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 4f2f54e1281c..7bd5d08eeb41 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -213,7 +213,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)

this_cpu_write(current_task, next_p);

- switch_fpu_finish(next_fpu);
+ switch_fpu_finish(prev_fpu, next_fpu);

/* Load the Intel cache allocation PQR MSR. */
resctrl_sched_in();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 9afefe325acb..3970367c02c0 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -595,7 +595,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
this_cpu_write(current_task, next_p);
this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));

- switch_fpu_finish(next_fpu);
+ switch_fpu_finish(prev_fpu, next_fpu);

/* Reload sp0. */
update_task_stack(next_p);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 81a2fb711091..0eebcae99938 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1028,6 +1028,9 @@ DEFINE_IDTENTRY(exc_device_not_available)
{
unsigned long cr0 = read_cr0();

+ if (xfirstuse_event_handler(&current->thread.fpu))
+ return;
+
#ifdef CONFIG_MATH_EMULATION
if (!boot_cpu_has(X86_FEATURE_FPU) && (cr0 & X86_CR0_EM)) {
struct math_emu_info info = { };
--
2.17.1


2020-10-01 23:41:57

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

On Thu, Oct 1, 2020 at 1:43 PM Chang S. Bae <[email protected]> wrote:
>
> Intel's Extended Feature Disable (XFD) feature is an extension of the XSAVE
> architecture. XFD allows the kernel to enable a feature state in XCR0 and
> to receive a #NM trap when a task uses instructions accessing that state.
> In this way, Linux can allocate the large task->fpu buffer only for tasks
> that use it.
>
> XFD introduces two MSRs: IA32_XFD to enable/disable the feature and
> IA32_XFD_ERR to assist the #NM trap handler. Both use the same
> state-component bitmap format, used by XCR0.
>
> Use this hardware capability to find the right time to expand xstate area.
> Introduce two sets of helper functions for that:
>
> 1. The first set is primarily for interacting with the XFD hardware
> feature. Helpers for configuring disablement, e.g. in context switching,
> are:
> xdisable_setbits()
> xdisable_getbits()
> xdisable_switch()
>
> 2. The second set is for managing the first-use status and handling #NM
> trap:
> xfirstuse_enabled()
> xfirstuse_not_detected()
> xfirstuse_event_handler()
>
> The #NM handler induces the xstate area expansion to save the first-used
> states.
>
> No functional change until the kernel enables dynamic user states and XFD.
>
> Signed-off-by: Chang S. Bae <[email protected]>
> Reviewed-by: Len Brown <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/include/asm/fpu/internal.h | 53 ++++++++++++++++++++++++++++-
> arch/x86/include/asm/msr-index.h | 2 ++
> arch/x86/kernel/fpu/core.c | 37 ++++++++++++++++++++
> arch/x86/kernel/fpu/xstate.c | 34 ++++++++++++++++--
> arch/x86/kernel/process.c | 5 +++
> arch/x86/kernel/process_32.c | 2 +-
> arch/x86/kernel/process_64.c | 2 +-
> arch/x86/kernel/traps.c | 3 ++
> 9 files changed, 133 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 2901d5df4366..7d7fe1d82966 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -274,6 +274,7 @@
> #define X86_FEATURE_XSAVEC (10*32+ 1) /* XSAVEC instruction */
> #define X86_FEATURE_XGETBV1 (10*32+ 2) /* XGETBV with ECX = 1 instruction */
> #define X86_FEATURE_XSAVES (10*32+ 3) /* XSAVES/XRSTORS instructions */
> +#define X86_FEATURE_XFD (10*32+ 4) /* eXtended Feature Disabling */
>
> /*
> * Extended auxiliary flags: Linux defined - for features scattered in various
> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> index 3b03ead87a46..f5dbbaa060fb 100644
> --- a/arch/x86/include/asm/fpu/internal.h
> +++ b/arch/x86/include/asm/fpu/internal.h
> @@ -572,11 +572,60 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
> * Misc helper functions:
> */
>
> +/* The first-use detection helpers: */
> +
> +static inline void xdisable_setbits(u64 value)
> +{
> + wrmsrl_safe(MSR_IA32_XFD, value);
> +}
> +
> +static inline u64 xdisable_getbits(void)
> +{
> + u64 value;
> +
> + rdmsrl_safe(MSR_IA32_XFD, &value);
> + return value;
> +}
> +
> +static inline u64 xfirstuse_enabled(void)
> +{
> + /* All the dynamic user components are first-use enabled. */
> + return xfeatures_mask_user_dynamic;
> +}
> +
> +/*
> + * Convert fpu->firstuse_bv to xdisable configuration in MSR IA32_XFD.
> + * xdisable_setbits() only uses this.
> + */
> +static inline u64 xfirstuse_not_detected(struct fpu *fpu)
> +{
> + u64 firstuse_bv = (fpu->state_mask & xfirstuse_enabled());
> +
> + /*
> + * If first-use is not detected, set the bit. If the detection is
> + * not enabled, the bit is always zero in firstuse_bv. So, make
> + * following conversion:
> + */
> + return (xfirstuse_enabled() ^ firstuse_bv);
> +}
> +
> +/* Update MSR IA32_XFD based on fpu->firstuse_bv */
> +static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
> +{
> + if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
> + return;
> +
> + if (unlikely(prev->state_mask != next->state_mask))
> + xdisable_setbits(xfirstuse_not_detected(next));
> +}
> +
> +bool xfirstuse_event_handler(struct fpu *fpu);
> +
> /*
> * Load PKRU from the FPU context if available. Delay loading of the
> * complete FPU state until the return to userland.
> */
> -static inline void switch_fpu_finish(struct fpu *new_fpu)
> +static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu *new_fpu)
> {
> u32 pkru_val = init_pkru_value;
> struct pkru_state *pk;
> @@ -586,6 +635,8 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
>
> set_thread_flag(TIF_NEED_FPU_LOAD);
>
> + xdisable_switch(old_fpu, new_fpu);
> +
> if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
> return;
>
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 2859ee4f39a8..0ccbe8cc99ad 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -610,6 +610,8 @@
> #define MSR_IA32_BNDCFGS_RSVD 0x00000ffc
>
> #define MSR_IA32_XSS 0x00000da0
> +#define MSR_IA32_XFD 0x000001c4
> +#define MSR_IA32_XFD_ERR 0x000001c5
>
> #define MSR_IA32_APICBASE 0x0000001b
> #define MSR_IA32_APICBASE_BSP (1<<8)
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index ece6428ba85b..2e07bfcd54b3 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -518,3 +518,40 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
> */
> return 0;
> }
> +
> +bool xfirstuse_event_handler(struct fpu *fpu)
> +{
> + bool handled = false;
> + u64 event_mask;
> +
> + /* Check whether the first-use detection is running. */
> + if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
> + return handled;
> +
> + rdmsrl_safe(MSR_IA32_XFD_ERR, &event_mask);

NAK.

MSR_IA32_XFD_ERR needs to be wired up in the exception handler, not in
some helper called farther down the stack

But this raises an interesting point -- what happens if allocation
fails? I think that, from kernel code, we simply cannot support this
exception mechanism. If kernel code wants to use AMX (and that would
be very strange indeed), it should call x86_i_am_crazy_amx_begin() and
handle errors, not rely on exceptions. From user code, I assume we
send a signal if allocation fails.

2020-10-14 07:01:44

by Brown, Len

[permalink] [raw]
Subject: RE: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

> From: Andy Lutomirski <[email protected]>

> > diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> > @@ -518,3 +518,40 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
..
> > +bool xfirstuse_event_handler(struct fpu *fpu)
> > +{
> > + bool handled = false;
> > + u64 event_mask;
> > +
> > + /* Check whether the first-use detection is running. */
> > + if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
> > + return handled;
> > +

> MSR_IA32_XFD_ERR needs to be wired up in the exception handler, not in
> some helper called farther down the stack

xfirstuse_event_handler() is called directly from the IDTENTRY exc_device_not_available():

> > @@ -1028,6 +1028,9 @@ DEFINE_IDTENTRY(exc_device_not_available)
> > {
> > unsigned long cr0 = read_cr0();
> >
> > + if (xfirstuse_event_handler(&current->thread.fpu))
> > + return;

Are you suggesting we should instead open code it inside that routine?

> But this raises an interesting point -- what happens if allocation
> fails? I think that, from kernel code, we simply cannot support this
> exception mechanism. If kernel code wants to use AMX (and that would
> be very strange indeed), it should call x86_i_am_crazy_amx_begin() and
> handle errors, not rely on exceptions. From user code, I assume we
> send a signal if allocation fails.

The XFD feature allows us to transparently expand the kernel context switch buffer
for a user task, when that task first touches this associated hardware.
It allows applications to operate as if the kernel had allocated the backing AMX
context switch buffer at initialization time. However, since we expect only
a sub-set of tasks to actually use AMX, we instead defer allocation until
we actually see first use for that task, rather than allocating for all tasks.

While we currently don't propose AMX use inside the kernel, it is conceivable
that could be done in the future, just like AVX is used by the RAID code;
and it would be done the same way -- kernel_fpu_begin()/kernel_fpu_end().
Such future Kernel AMX use would _not_ arm XFD, and would _not_ provoke this fault.
(note that we context switch the XFD-armed state per task)

vmalloc() does not fail, and does not return an error, and so there is no concept
of returning a signal. If we got to the point where vmalloc() sleeps, then the system
has bigger OOM issues, and the OOM killer would be on the prowl.

If we were concerned about using vmalloc for a couple of pages in the task structure,
Then we could implement a routine to harvest unused buffers and free them --
but that didn't seem worth the complexity. Note that this feature is 64-bit only.

Thanks,
-Len


2020-10-14 07:14:05

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

On 10/13/20 3:31 PM, Brown, Len wrote:
> vmalloc() does not fail, and does not return an error, and so there is no concept
> of returning a signal.

Well, the order-0 allocations are no-fail, as are the vmalloc kernel
structures and the page tables that might have to be allocated. But,
that's not guaranteed to be in place *forever*. I think we still need
to check for and handle allocation failures, even if they're not known
to be possible today.

> If we got to the point where vmalloc() sleeps, then the system
> has bigger OOM issues, and the OOM killer would be on the prowl.

vmalloc() can *certainly* sleep. Allocation failures mean returning
NULL from the allocator, and the very way we avoid doing that is by
sleeping to go reclaim some memory from some other allocation.

Sleeping is a normal and healthy part of handling allocation requests,
including vmalloc().

> If we were concerned about using vmalloc for a couple of pages in the task structure,
> Then we could implement a routine to harvest unused buffers and free them --
> but that didn't seem worth the complexity. Note that this feature is 64-bit only.

IMNHO, vmalloc() is overkill for ~10k, which is roughly the size of the
XSAVE buffer for the first AMX implementation. But, it's not overkill
for the ~66k of space that will be needed if some CPU implementation
comes along and uses all of the architectural space AMX provides.

2020-10-14 09:25:57

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use


> On Oct 13, 2020, at 3:31 PM, Brown, Len <[email protected]> wrote:
>
> 
>>
>> From: Andy Lutomirski <[email protected]>
>
>>> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
>>> @@ -518,3 +518,40 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
> ..
>>> +bool xfirstuse_event_handler(struct fpu *fpu)
>>> +{
>>> + bool handled = false;
>>> + u64 event_mask;
>>> +
>>> + /* Check whether the first-use detection is running. */
>>> + if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
>>> + return handled;
>>> +
>
>> MSR_IA32_XFD_ERR needs to be wired up in the exception handler, not in
>> some helper called farther down the stack
>
> xfirstuse_event_handler() is called directly from the IDTENTRY exc_device_not_available():
>
>>> @@ -1028,6 +1028,9 @@ DEFINE_IDTENTRY(exc_device_not_available)
>>> {
>>> unsigned long cr0 = read_cr0();
>>>
>>> + if (xfirstuse_event_handler(&current->thread.fpu))
>>> + return;
>
> Are you suggesting we should instead open code it inside that routine?

MSR_IA32_XFD_ERR is like CR2 and DR6 — it’s functionally a part of the exception. Let’s handle it as such. (And, a bit like DR6, it’s a bit broken.)

>
>> But this raises an interesting point -- what happens if allocation
>> fails? I think that, from kernel code, we simply cannot support this
>> exception mechanism. If kernel code wants to use AMX (and that would
>> be very strange indeed), it should call x86_i_am_crazy_amx_begin() and
>> handle errors, not rely on exceptions. From user code, I assume we
>> send a signal if allocation fails.
>
> The XFD feature allows us to transparently expand the kernel context switch buffer
> for a user task, when that task first touches this associated hardware.
> It allows applications to operate as if the kernel had allocated the backing AMX
> context switch buffer at initialization time. However, since we expect only
> a sub-set of tasks to actually use AMX, we instead defer allocation until
> we actually see first use for that task, rather than allocating for all tasks.

I sure hope that not all tasks use it. Context-switching it will be absurdly expensive.

>
> While we currently don't propose AMX use inside the kernel, it is conceivable
> that could be done in the future, just like AVX is used by the RAID code;
> and it would be done the same way -- kernel_fpu_begin()/kernel_fpu_end().
> Such future Kernel AMX use would _not_ arm XFD, and would _not_ provoke this fault.
> (note that we context switch the XFD-armed state per task)

How expensive is *that*? Can you give approximate cycle counts for saving, restoring, arming and disarming?

This reminds me of TS. Writing TS was more expensive than saving the whole FPU state. And, for kernel code, we can’t just “not arm” the XFD — we would have to disarm it.



>
> vmalloc() does not fail, and does not return an error, and so there is no concept
> of returning a signal. If we got to the point where vmalloc() sleeps, then the system
> has bigger OOM issues, and the OOM killer would be on the prowl.
>
> If we were concerned about using vmalloc for a couple of pages in the task structure,
> Then we could implement a routine to harvest unused buffers and free them

Kind of like we vmalloc a couple pages for kernel stacks, and we carefully cache them. And we handle failure gracefully.

2020-10-14 12:13:35

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

On 10/13/20 6:11 PM, Andy Lutomirski wrote:
> I have no problem with vmalloc(), but I do have a problem with
> vfree() due to the IPIs that result. We need a cache or something.

This sounds like the kind of thing we should just build into vmalloc()
instead of having a bunch of callers implement their own caches. It
shouldn't be too much of a challenge to have vmalloc() keep a cacheline
or two of stats about common vmalloc() sizes and keep some pools around.

It's not going to be hard to implement caches to reduce vfree()-induced
churn, but I'm having a hard time imaging that it'll have anywhere near
the benefits that it did for stacks. Tasks fundamentally come and go a
*lot*, and those paths are hot.

Tasks who go to the trouble to populate 8k or 64k of register state
fundamentally *can't* come and go frequently. We also (probably) don't
have to worry about AMX tasks doing fork()/exec() pairs and putting
pressure on vmalloc(). Before an app can call out to library functions
to do the fork, they've got to save the state off themselves and likely
get it back to the init state. The fork() code can tell AMX is in the
init state and decline to allocate actual space for the child.

> I have to say: this mechanism is awful. Can we get away with skipping
> the dynamic XSAVES mess entirely? What if we instead allocate
> however much space we need as an array of pages and have one percpu
> contiguous region. To save, we XSAVE(S or C) just the AMX state to
> the percpu area and then copy it. To restore, we do the inverse. Or
> would this kill the modified optimization and thus be horrible?

Actually, I think the modified optimization would survive such a scheme:

* copy page array into percpu area
* XRSTORS from percpu area, modified optimization tuple is saved
* run userspace
* XSAVES back to percpu area. tuple matches, modified optimization
is still in play
* copy percpu area back to page array

Since the XRSTORS->XSAVES pair is both done to the percpu area, the
XSAVE tracking hardware never knows it isn't working on the "canonical"
buffer (the page array).

It seems a bit ugly, but it's not like an 8k memcpy() is *that* painful.
The double dcache footprint isn't super nice, though.

Chang, I think that leaves us with three possibilities:

1. use plain old vmalloc()
2. use vmalloc(), but implement caches for large XSAVE state, either in
front of, or inside the vmalloc() API
3. Use a static per-cpu buffer for XSAVE*/XRSTOR* and memcpy() to/from
a scattered list of pages.

A #4 also comes to mind:

Do something somewhat like kmap_atomic(). Map the scattered pages
contiguously and use XSAVE*/XRSTOR* directly, but unmap immediately
after XSAVE*/XRSTOR*. For ~12k of state, I suspect the ~400-500 cycles
for 3 INVLPG's might beat out a memcpy() of 12k of state.

Global TLB invalidations would not be required.

I have to say, though, that I'd be OK with sticking with plain old
vmalloc() for now.

2020-10-14 15:09:20

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use


> On Oct 13, 2020, at 3:44 PM, Dave Hansen <[email protected]> wrote:
>
> On 10/13/20 3:31 PM, Brown, Len wrote:
>> vmalloc() does not fail, and does not return an error, and so there is no concept
>> of returning a signal.
>
> Well, the order-0 allocations are no-fail, as are the vmalloc kernel
> structures and the page tables that might have to be allocated. But,
> that's not guaranteed to be in place *forever*. I think we still need
> to check for and handle allocation failures, even if they're not known
> to be possible today.
>
>> If we got to the point where vmalloc() sleeps, then the system
>> has bigger OOM issues, and the OOM killer would be on the prowl.
>
> vmalloc() can *certainly* sleep. Allocation failures mean returning
> NULL from the allocator, and the very way we avoid doing that is by
> sleeping to go reclaim some memory from some other allocation.
>
> Sleeping is a normal and healthy part of handling allocation requests,
> including vmalloc().
>
>> If we were concerned about using vmalloc for a couple of pages in the task structure,
>> Then we could implement a routine to harvest unused buffers and free them --
>> but that didn't seem worth the complexity. Note that this feature is 64-bit only.
>
> IMNHO, vmalloc() is overkill for ~10k, which is roughly the size of the
> XSAVE buffer for the first AMX implementation. But, it's not overkill
> for the ~66k of space that will be needed if some CPU implementation
> comes along and uses all of the architectural space AMX provides.

I have no problem with vmalloc(), but I do have a problem with vfree() due to the IPIs that result. We need a cache or something.

I have to say: this mechanism is awful. Can we get away with skipping the dynamic XSAVES mess entirely? What if we instead allocate however much space we need as an array of pages and have one percpu contiguous region. To save, we XSAVE(S or C) just the AMX state to the percpu area and then copy it. To restore, we do the inverse. Or would this kill the modified optimization and thus be horrible?

2020-10-14 15:55:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

On Tue, Oct 13, 2020 at 03:43:59PM -0700, Dave Hansen wrote:
> On 10/13/20 3:31 PM, Brown, Len wrote:
> > vmalloc() does not fail, and does not return an error, and so there is no concept
> > of returning a signal.
>
> Well, the order-0 allocations are no-fail, as are the vmalloc kernel
> structures and the page tables that might have to be allocated. But,
> that's not guaranteed to be in place *forever*. I think we still need
> to check for and handle allocation failures, even if they're not known
> to be possible today.

Quite, on top of that, we could run out of vmalloc space (unlikely, but
sitll).

You really have to deal with vmalloc() failing.

2020-10-14 17:41:47

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

On 10/14/20 9:10 AM, Andy Lutomirski wrote:
>> Actually, I think the modified optimization would survive such a scheme:
>>
>> * copy page array into percpu area
>> * XRSTORS from percpu area, modified optimization tuple is saved
>> * run userspace
>> * XSAVES back to percpu area. tuple matches, modified optimization
>> is still in play
>> * copy percpu area back to page array
>>
>> Since the XRSTORS->XSAVES pair is both done to the percpu area, the
>> XSAVE tracking hardware never knows it isn't working on the "canonical"
>> buffer (the page array).
> I was suggesting something a little bit different. We'd keep XMM,
> YMM, ZMM, etc state stored exactly the way we do now and, for
> AMX-using tasks, we would save the AMX state in an entirely separate
> buffer. This way the pain of having a variable xstate layout is
> confined just to AMX tasks.

OK, got it.

So, we'd either need a second set of XSAVE/XRSTORs, or "manual" copying
of the registers out to memory. We can preserve the modified
optimization if we're careful about ordering, but only for *ONE* of the
XSAVE buffers (if we use two).

> I'm okay with vmalloc() too, but I do think we need to deal with the
> various corner cases like allocation failing.

Yeah, agreed about handling the corner cases. Also, if we preserve
plain old vmalloc() for now, we need good tracepoints or stats so we can
precisely figure out how many vmalloc()s (and IPIs) are due to AMX.

2020-10-14 17:42:51

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

On Tue, Oct 13, 2020 at 11:03 PM Dave Hansen <[email protected]> wrote:
>
> On 10/13/20 6:11 PM, Andy Lutomirski wrote:
> > I have no problem with vmalloc(), but I do have a problem with
> > vfree() due to the IPIs that result. We need a cache or something.
>
> This sounds like the kind of thing we should just build into vmalloc()
> instead of having a bunch of callers implement their own caches. It
> shouldn't be too much of a challenge to have vmalloc() keep a cacheline
> or two of stats about common vmalloc() sizes and keep some pools around.
>
> It's not going to be hard to implement caches to reduce vfree()-induced
> churn, but I'm having a hard time imaging that it'll have anywhere near
> the benefits that it did for stacks. Tasks fundamentally come and go a
> *lot*, and those paths are hot.
>
> Tasks who go to the trouble to populate 8k or 64k of register state
> fundamentally *can't* come and go frequently. We also (probably) don't
> have to worry about AMX tasks doing fork()/exec() pairs and putting
> pressure on vmalloc(). Before an app can call out to library functions
> to do the fork, they've got to save the state off themselves and likely
> get it back to the init state. The fork() code can tell AMX is in the
> init state and decline to allocate actual space for the child.
>
> > I have to say: this mechanism is awful. Can we get away with skipping
> > the dynamic XSAVES mess entirely? What if we instead allocate
> > however much space we need as an array of pages and have one percpu
> > contiguous region. To save, we XSAVE(S or C) just the AMX state to
> > the percpu area and then copy it. To restore, we do the inverse. Or
> > would this kill the modified optimization and thus be horrible?
>
> Actually, I think the modified optimization would survive such a scheme:
>
> * copy page array into percpu area
> * XRSTORS from percpu area, modified optimization tuple is saved
> * run userspace
> * XSAVES back to percpu area. tuple matches, modified optimization
> is still in play
> * copy percpu area back to page array
>
> Since the XRSTORS->XSAVES pair is both done to the percpu area, the
> XSAVE tracking hardware never knows it isn't working on the "canonical"
> buffer (the page array).

I was suggesting something a little bit different. We'd keep XMM,
YMM, ZMM, etc state stored exactly the way we do now and, for
AMX-using tasks, we would save the AMX state in an entirely separate
buffer. This way the pain of having a variable xstate layout is
confined just to AMX tasks.

I'm okay with vmalloc() too, but I do think we need to deal with the
various corner cases like allocation failing.

2020-11-03 21:36:56

by Chang S. Bae

[permalink] [raw]
Subject: Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

On Wed, 2020-10-14 at 09:29 -0700, Dave Hansen wrote:
> On 10/14/20 9:10 AM, Andy Lutomirski wrote:
> >
> > I was suggesting something a little bit different. We'd keep XMM,
> > YMM, ZMM, etc state stored exactly the way we do now and, for
> > AMX-using tasks, we would save the AMX state in an entirely separate
> > buffer. This way the pain of having a variable xstate layout is
> > confined just to AMX tasks.
>
> OK, got it.
>
> So, we'd either need a second set of XSAVE/XRSTORs, or "manual" copying
> of the registers out to memory. We can preserve the modified
> optimization if we're careful about ordering, but only for *ONE* of the
> XSAVE buffers (if we use two).

For what is worth,

If using two buffers, the buffer for saving the tile data also needs space
for the legacy states.

The AMX state is stored at the end of the XSAVE buffer (at least for now).
So, the layout (in terms of offsets of non-AMX states) won't be changed at
run-time.

Thanks,
Chang

2020-11-03 21:45:55

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

On 11/3/20 1:32 PM, Bae, Chang Seok wrote:
> On Wed, 2020-10-14 at 09:29 -0700, Dave Hansen wrote:
>> On 10/14/20 9:10 AM, Andy Lutomirski wrote:
>>> I was suggesting something a little bit different. We'd keep XMM,
>>> YMM, ZMM, etc state stored exactly the way we do now and, for
>>> AMX-using tasks, we would save the AMX state in an entirely separate
>>> buffer. This way the pain of having a variable xstate layout is
>>> confined just to AMX tasks.
>> OK, got it.
>>
>> So, we'd either need a second set of XSAVE/XRSTORs, or "manual" copying
>> of the registers out to memory. We can preserve the modified
>> optimization if we're careful about ordering, but only for *ONE* of the
>> XSAVE buffers (if we use two).
> For what is worth,
>
> If using two buffers, the buffer for saving the tile data also needs space
> for the legacy states.

Just to be clear, you're talking about the 512-byte 'struct
fxregs_state' which is the first member of 'struct xregs_state', right?

Basically, any two-buffer model wastes 576 bytes of space (legacy fxsave
plus xsave header) for each task that uses two buffers. This provides
an overall space savings unless there are something like 16x more
TMUL-using tasks than non-TMUL-using tasks.

We could eliminate the 576 bytes of waste, but at the *performance* cost
of having a permanently (non-task_struct-embedded) out-of-line XSAVE
buffer for all tasks.

> The AMX state is stored at the end of the XSAVE buffer (at least for now).
> So, the layout (in terms of offsets of non-AMX states) won't be changed at
> run-time.

I don't know what point you are trying to make here.

2020-11-03 21:57:57

by Chang S. Bae

[permalink] [raw]
Subject: Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

On Tue, 2020-11-03 at 13:41 -0800, Dave Hansen wrote:
> On 11/3/20 1:32 PM, Bae, Chang Seok wrote:
> > On Wed, 2020-10-14 at 09:29 -0700, Dave Hansen wrote:
> > > On 10/14/20 9:10 AM, Andy Lutomirski wrote:
> > > > I was suggesting something a little bit different. We'd keep XMM,
> > > > YMM, ZMM, etc state stored exactly the way we do now and, for
> > > > AMX-using tasks, we would save the AMX state in an entirely
> > > > separate
> > > > buffer. This way the pain of having a variable xstate layout is
> > > > confined just to AMX tasks.
> > >
> > > OK, got it.
> > >
> > > So, we'd either need a second set of XSAVE/XRSTORs, or "manual"
> > > copying
> > > of the registers out to memory. We can preserve the modified
> > > optimization if we're careful about ordering, but only for *ONE* of
> > > the
> > > XSAVE buffers (if we use two).
> >
> > For what is worth,
> >
> > If using two buffers, the buffer for saving the tile data also needs
> > space
> > for the legacy states.
>
> Just to be clear, you're talking about the 512-byte 'struct
> fxregs_state' which is the first member of 'struct xregs_state', right?

Yes.

>
> > The AMX state is stored at the end of the XSAVE buffer (at least for
> > now).
> > So, the layout (in terms of offsets of non-AMX states) won't be changed
> > at
> > run-time.
>
> I don't know what point you are trying to make here.

I was trying to clarify the concern that Andy had:

"the pain of having a variable xstate layout"

Thanks,
Chang