2018-11-07 01:41:20

by Aubrey Li

[permalink] [raw]
Subject: [RFC PATCH v1 1/2] x86/fpu: detect AVX task

XSAVES and its variants use init optimization to reduce the amount of
data that they save to memory during context switch. Init optimization
uses the state component bitmap to denote if a component is in its init
configuration. We use this information to detect if a task contains AVX
instructions.

Signed-off-by: Aubrey Li <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Arjan van de Ven <[email protected]>
---
arch/x86/include/asm/fpu/internal.h | 97 +++++++++++++++++++++++++++----------
arch/x86/include/asm/fpu/types.h | 17 +++++++
2 files changed, 88 insertions(+), 26 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index a38bf5a..5054a7d 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -74,6 +74,12 @@ static __always_inline __pure bool use_fxsr(void)
return static_cpu_has(X86_FEATURE_FXSR);
}

+static __always_inline __pure bool use_xgetbv1(void)
+{
+ return static_cpu_has(X86_FEATURE_OSXSAVE) &&
+ static_cpu_has(X86_FEATURE_XGETBV1);
+}
+
/*
* fpstate handling functions:
*/
@@ -103,6 +109,34 @@ static inline void fpstate_init_fxstate(struct fxregs_state *fx)
}
extern void fpstate_sanitize_xstate(struct fpu *fpu);

+/*
+ * MXCSR and XCR definitions:
+ */
+
+extern unsigned int mxcsr_feature_mask;
+
+#define XCR_XFEATURE_ENABLED_MASK 0x00000000
+#define XINUSE_STATE_BITMAP_INDEX 0x00000001
+
+static inline u64 xgetbv(u32 index)
+{
+ u32 eax, edx;
+
+ asm volatile(".byte 0x0f,0x01,0xd0" /* xgetbv */
+ : "=a" (eax), "=d" (edx)
+ : "c" (index));
+ return eax + ((u64)edx << 32);
+}
+
+static inline void xsetbv(u32 index, u64 value)
+{
+ u32 eax = value;
+ u32 edx = value >> 32;
+
+ asm volatile(".byte 0x0f,0x01,0xd1" /* xsetbv */
+ : : "a" (eax), "d" (edx), "c" (index));
+}
+
#define user_insn(insn, output, input...) \
({ \
int err; \
@@ -275,6 +309,42 @@ static inline void copy_fxregs_to_kernel(struct fpu *fpu)
: "D" (st), "m" (*st), "a" (lmask), "d" (hmask) \
: "memory")

+#define AVX_STATE_DECAY_COUNT 3
+
+/*
+ * This function is called during context switch to update AVX component state
+ */
+static inline void update_avx_state(struct avx_state *avx)
+{
+ /*
+ * Check if XGETBV with ECX = 1 supported. XGETBV with ECX = 1
+ * returns the logical-AND of XCR0 and XINUSE. XINUSE is a bitmap
+ * by which the processor tracks the status of various components.
+ */
+ if (!use_xgetbv1()) {
+ avx->state = 0;
+ return;
+ }
+ /*
+ * XINUSE is dynamic to track component state because VZEROUPPER
+ * happens on every function end and reset the bitmap to the
+ * initial configuration.
+ *
+ * State decay is introduced to solve the race condition between
+ * context switch and a function end. State is aggressively set
+ * once it's detected but need to be cleared by decay 3 context
+ * switches
+ */
+ if (xgetbv(XINUSE_STATE_BITMAP_INDEX) & XFEATURE_MASK_Hi16_ZMM) {
+ avx->state = 1;
+ avx->decay_count = AVX_STATE_DECAY_COUNT;
+ } else {
+ if (!avx->decay_count)
+ avx->decay_count--;
+ else
+ avx->state = 0;
+ }
+}
/*
* This function is called only during boot time when x86 caps are not set
* up and alternative can not be used yet.
@@ -411,6 +481,7 @@ static inline int copy_fpregs_to_fpstate(struct fpu *fpu)
{
if (likely(use_xsave())) {
copy_xregs_to_kernel(&fpu->state.xsave);
+ update_avx_state(&fpu->avx);
return 1;
}

@@ -577,31 +648,5 @@ static inline void user_fpu_begin(void)
preempt_enable();
}

-/*
- * MXCSR and XCR definitions:
- */
-
-extern unsigned int mxcsr_feature_mask;
-
-#define XCR_XFEATURE_ENABLED_MASK 0x00000000
-
-static inline u64 xgetbv(u32 index)
-{
- u32 eax, edx;
-
- asm volatile(".byte 0x0f,0x01,0xd0" /* xgetbv */
- : "=a" (eax), "=d" (edx)
- : "c" (index));
- return eax + ((u64)edx << 32);
-}
-
-static inline void xsetbv(u32 index, u64 value)
-{
- u32 eax = value;
- u32 edx = value >> 32;
-
- asm volatile(".byte 0x0f,0x01,0xd1" /* xsetbv */
- : : "a" (eax), "d" (edx), "c" (index));
-}

#endif /* _ASM_X86_FPU_INTERNAL_H */
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 202c539..39d5bc2 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -274,6 +274,15 @@ union fpregs_state {
};

/*
+ * This is per task AVX state data structure that indicates
+ * whether the task uses AVX instructions.
+ */
+struct avx_state {
+ unsigned int state;
+ unsigned int decay_count;
+};
+
+/*
* Highest level per task FPU state data structure that
* contains the FPU register state plus various FPU
* state fields:
@@ -303,6 +312,14 @@ struct fpu {
unsigned char initialized;

/*
+ * @avx_state:
+ *
+ * This data structure indicates whether this context
+ * contains AVX states
+ */
+ struct avx_state avx;
+
+ /*
* @state:
*
* In-memory copy of all FPU registers that we save/restore
--
2.7.4



2018-11-07 01:41:05

by Aubrey Li

[permalink] [raw]
Subject: [RFC PATCH v1 2/2] proc: add /proc/<pid>/thread_state

Expose the per-task cpu specific thread state value, it's helpful
for userland to classify and schedule the tasks by different policies

Signed-off-by: Aubrey Li <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Arjan van de Ven <[email protected]>
---
arch/x86/kernel/fpu/xstate.c | 13 +++++++++++++
fs/proc/base.c | 13 +++++++++++++
2 files changed, 26 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 87a57b7..5a4a1d5 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -7,6 +7,7 @@
#include <linux/cpu.h>
#include <linux/mman.h>
#include <linux/pkeys.h>
+#include <linux/seq_file.h>

#include <asm/fpu/api.h>
#include <asm/fpu/internal.h>
@@ -1245,3 +1246,15 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)

return 0;
}
+
+/*
+ * report AVX state
+ */
+void arch_thread_state(struct seq_file *m, struct task_struct *task)
+{
+ if (task->thread.fpu.avx.state)
+ seq_putc(m, '1');
+ else
+ seq_putc(m, '0');
+ seq_putc(m, '\n');
+}
diff --git a/fs/proc/base.c b/fs/proc/base.c
index aaffc0c..be24327 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2893,6 +2893,17 @@ static int proc_pid_patch_state(struct seq_file *m, struct pid_namespace *ns,
}
#endif /* CONFIG_LIVEPATCH */

+void __weak arch_thread_state(struct seq_file *m, struct task_struct *task)
+{
+}
+
+static int proc_pid_thread_state(struct seq_file *m, struct pid_namespace *ns,
+ struct pid *pid, struct task_struct *task)
+{
+ arch_thread_state(m, task);
+ return 0;
+}
+
/*
* Thread groups
*/
@@ -2994,6 +3005,7 @@ static const struct pid_entry tgid_base_stuff[] = {
#ifdef CONFIG_LIVEPATCH
ONE("patch_state", S_IRUSR, proc_pid_patch_state),
#endif
+ ONE("thread_state", S_IRUSR, proc_pid_thread_state),
};

static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
@@ -3372,6 +3384,7 @@ static const struct pid_entry tid_base_stuff[] = {
#ifdef CONFIG_LIVEPATCH
ONE("patch_state", S_IRUSR, proc_pid_patch_state),
#endif
+ ONE("thread_state", S_IRUSR, proc_pid_thread_state),
};

static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
--
2.7.4


2018-11-07 17:42:57

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/2] x86/fpu: detect AVX task

On 11/06/2018 10:23 AM, Aubrey Li wrote:

> +static inline void update_avx_state(struct avx_state *avx)
> +{
> + /*
> + * Check if XGETBV with ECX = 1 supported. XGETBV with ECX = 1
> + * returns the logical-AND of XCR0 and XINUSE. XINUSE is a bitmap
> + * by which the processor tracks the status of various components.
> + */
> + if (!use_xgetbv1()) {
> + avx->state = 0;
> + return;
> + }
> + /*
> + * XINUSE is dynamic to track component state because VZEROUPPER
> + * happens on every function end and reset the bitmap to the
> + * initial configuration.
> + *
> + * State decay is introduced to solve the race condition between
> + * context switch and a function end. State is aggressively set
> + * once it's detected but need to be cleared by decay 3 context
> + * switches
> + */
> + if (xgetbv(XINUSE_STATE_BITMAP_INDEX) & XFEATURE_MASK_Hi16_ZMM) {
> + avx->state = 1;
> + avx->decay_count = AVX_STATE_DECAY_COUNT;
> + } else {
> + if (!avx->decay_count)

Seems like the check should be

if (avx->decay_count)

as we decrement the decay_count if it is non-zero.

> + avx->decay_count--;
> + else
> + avx->state = 0;
> + }
> +}

Tim

2018-11-08 00:32:38

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/2] x86/fpu: detect AVX task

On 2018/11/8 1:41, Tim Chen wrote:
> On 11/06/2018 10:23 AM, Aubrey Li wrote:
>
>> +static inline void update_avx_state(struct avx_state *avx)
>> +{
>> + /*
>> + * Check if XGETBV with ECX = 1 supported. XGETBV with ECX = 1
>> + * returns the logical-AND of XCR0 and XINUSE. XINUSE is a bitmap
>> + * by which the processor tracks the status of various components.
>> + */
>> + if (!use_xgetbv1()) {
>> + avx->state = 0;
>> + return;
>> + }
>> + /*
>> + * XINUSE is dynamic to track component state because VZEROUPPER
>> + * happens on every function end and reset the bitmap to the
>> + * initial configuration.
>> + *
>> + * State decay is introduced to solve the race condition between
>> + * context switch and a function end. State is aggressively set
>> + * once it's detected but need to be cleared by decay 3 context
>> + * switches
>> + */
>> + if (xgetbv(XINUSE_STATE_BITMAP_INDEX) & XFEATURE_MASK_Hi16_ZMM) {
>> + avx->state = 1;
>> + avx->decay_count = AVX_STATE_DECAY_COUNT;
>> + } else {
>> + if (!avx->decay_count)
>
> Seems like the check should be
>
> if (avx->decay_count)
>
> as we decrement the decay_count if it is non-zero.

Right, thanks to point this out, will fix in v2 soon.

Thanks,
-Aubrey
>
>> + avx->decay_count--;
>> + else
>> + avx->state = 0;
>> + }
>> +}
>
> Tim
>


2018-11-08 06:33:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH v1 2/2] proc: add /proc/<pid>/thread_state


* Aubrey Li <[email protected]> wrote:

> Expose the per-task cpu specific thread state value, it's helpful
> for userland to classify and schedule the tasks by different policies

That's pretty vague - what exactly would use this information? I'm sure
you have a usecase in mind - could you please describe it?

Thanks,

Ingo

2018-11-08 10:19:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 2/2] proc: add /proc/<pid>/thread_state

On Thu, Nov 08, 2018 at 07:32:46AM +0100, Ingo Molnar wrote:
>
> * Aubrey Li <[email protected]> wrote:
>
> > Expose the per-task cpu specific thread state value, it's helpful
> > for userland to classify and schedule the tasks by different policies
>
> That's pretty vague - what exactly would use this information? I'm sure
> you have a usecase in mind - could you please describe it?

Yeah, "thread_state" is a pretty terrible name for this. The use-case is
detectoring which tasks use AVX3 such that a userspace component (think
job scheduler) can cluster them together.

The 'problem' is that running AVX2+ code drops the max clock, because
you light up the massive wide (and thus large area) paths.

So maybe something like "simd_active" is a better name, dunno.

Or maybe "simd_class" and we can write out 0,1,2,3 depending on the AVX
class being used, dunno. It might make sense to look at what other arch
SIMD stuff looks like to form this interface.

2018-11-08 14:04:04

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v1 2/2] proc: add /proc/<pid>/thread_state

On 2018/11/8 18:17, Peter Zijlstra wrote:
> On Thu, Nov 08, 2018 at 07:32:46AM +0100, Ingo Molnar wrote:
>>
>> * Aubrey Li <[email protected]> wrote:
>>
>>> Expose the per-task cpu specific thread state value, it's helpful
>>> for userland to classify and schedule the tasks by different policies
>>
>> That's pretty vague - what exactly would use this information? I'm sure
>> you have a usecase in mind - could you please describe it?
>
> Yeah, "thread_state" is a pretty terrible name for this.

task_struct has a CPU specific element "thread", I quote it to here to
create a generic interface to expose CPU specific state of the task.
Like /proc/<pid>/stat, I plan to use this "thread_state" to host any CPU
specific state, including AVX state(now only), and any other states(may come
soon). So this interface may be extended in future like the following:

#cat /proc/<pid>/thread_state
1 0 0


>
The use-case is
> detectoring which tasks use AVX3 such that a userspace component (think
> job scheduler) can cluster them together.>
> The 'problem' is that running AVX2+ code drops the max clock, because
> you light up the massive wide (and thus large area) paths.

Thanks to explain this.

>
> So maybe something like "simd_active" is a better name, dunno.

As above, maybe the name can be hidden by thread_state?

>
> Or maybe "simd_class" and we can write out 0,1,2,3 depending on the AVX
> class being used, dunno. It might make sense to look at what other arch
> SIMD stuff looks like to form this interface.
>
A task may use 1,2,3 simultaneously, as a scheduler hint, it's just cluster
or not, so 0/1 may be good enough.

Thanks,
-Aubrey

2018-11-12 05:32:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH v1 2/2] proc: add /proc/<pid>/thread_state


* Peter Zijlstra <[email protected]> wrote:

> On Thu, Nov 08, 2018 at 07:32:46AM +0100, Ingo Molnar wrote:
> >
> > * Aubrey Li <[email protected]> wrote:
> >
> > > Expose the per-task cpu specific thread state value, it's helpful
> > > for userland to classify and schedule the tasks by different policies
> >
> > That's pretty vague - what exactly would use this information? I'm sure
> > you have a usecase in mind - could you please describe it?
>
> Yeah, "thread_state" is a pretty terrible name for this. The use-case is
> detectoring which tasks use AVX3 such that a userspace component (think
> job scheduler) can cluster them together.

I'd prefer the kernel to do such clustering...

Thanks,

Ingo

2018-11-12 07:55:46

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v1 2/2] proc: add /proc/<pid>/thread_state

On 2018/11/12 13:31, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
>> On Thu, Nov 08, 2018 at 07:32:46AM +0100, Ingo Molnar wrote:
>>>
>>> * Aubrey Li <[email protected]> wrote:
>>>
>>>> Expose the per-task cpu specific thread state value, it's helpful
>>>> for userland to classify and schedule the tasks by different policies
>>>
>>> That's pretty vague - what exactly would use this information? I'm sure
>>> you have a usecase in mind - could you please describe it?
>>
>> Yeah, "thread_state" is a pretty terrible name for this. The use-case is
>> detectoring which tasks use AVX3 such that a userspace component (think
>> job scheduler) can cluster them together.
>
> I'd prefer the kernel to do such clustering...
>
Some userland application projects like Kubernetes request an interface of
such information, we could do the clustering either in the kernel or from
userland, does it make sense we expose the information via the proposed
interface first?

Thanks,
-Aubrey

2018-11-12 08:57:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 2/2] proc: add /proc/<pid>/thread_state

On Mon, Nov 12, 2018 at 06:31:47AM +0100, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Thu, Nov 08, 2018 at 07:32:46AM +0100, Ingo Molnar wrote:
> > >
> > > * Aubrey Li <[email protected]> wrote:
> > >
> > > > Expose the per-task cpu specific thread state value, it's helpful
> > > > for userland to classify and schedule the tasks by different policies
> > >
> > > That's pretty vague - what exactly would use this information? I'm sure
> > > you have a usecase in mind - could you please describe it?
> >
> > Yeah, "thread_state" is a pretty terrible name for this. The use-case is
> > detectoring which tasks use AVX3 such that a userspace component (think
> > job scheduler) can cluster them together.
>
> I'd prefer the kernel to do such clustering...

I think that is a next step.

Also, while the kernel can do this at a best effort basis, it cannot
take into account things the kernel doesn't know about, like high
priority job peak load etc.., things a job scheduler would know.

Then again, a job scheduler would likely already know about the AVX
state anyway.

2018-11-12 14:22:31

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC PATCH v1 2/2] proc: add /proc/<pid>/thread_state


>> I'd prefer the kernel to do such clustering...
>
> I think that is a next step.
>
> Also, while the kernel can do this at a best effort basis, it cannot
> take into account things the kernel doesn't know about, like high
> priority job peak load etc.., things a job scheduler would know.
>
> Then again, a job scheduler would likely already know about the AVX
> state anyway.

the job scheduler can guess.
unless it can also *measure* it won't know for sure...

so even in that scenario having a decent way to report actuals is useful


>