2018-12-11 07:45:41

by Aubrey Li

[permalink] [raw]
Subject: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

User space tools which do automated task placement need information
about AVX-512 usage of tasks, because AVX-512 usage could cause core
turbo frequency drop and impact the running task on the sibling CPU.

The XSAVE hardware structure has bits that indicate when valid state
is present in registers unique to AVX-512 use. Use these bits to
indicate when AVX-512 has been in use and add per-task AVX-512 state
tracking to context switch.

The tracking turns on the usage flag at the next context switch of
the task, but requires 3 consecutive context switches with no usage
to clear it. This decay is required because well-written AVX-512
applications are expected to clear this state when not actively using
AVX-512 registers.

Although this mechanism is imprecise and can theoretically have both
false-positives and false-negatives, it has been measured to be precise
enough to be useful under real-world workloads like tensorflow and linpack.

If higher precision is required, suggest user space tools to use the
PMU-based mechanisms in combination.

Signed-off-by: Aubrey Li <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Arjan van de Ven <[email protected]>
---
arch/x86/include/asm/fpu/internal.h | 22 ++++++++++++++++++++++
arch/x86/include/asm/fpu/types.h | 8 ++++++++
2 files changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index a38bf5a1e37a..0da74d63ba14 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -275,6 +275,27 @@ static inline void copy_fxregs_to_kernel(struct fpu *fpu)
: "D" (st), "m" (*st), "a" (lmask), "d" (hmask) \
: "memory")

+#define AVX512_STATE_DECAY_COUNT 3
+/*
+ * This function is called during context switch to update AVX512 state
+ */
+static inline void update_avx512_state(struct fpu *fpu)
+{
+ /*
+ * AVX512 state is tracked here because its use is known to slow
+ * the max clock speed of the core.
+ *
+ * However, AVX512-using tasks are expected to clear this state when
+ * not actively using these registers. Thus, this tracking mechanism
+ * can miss. To ensure that false-negatives do not immediately show
+ * up, decay the usage count over time.
+ */
+ if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512)
+ fpu->avx512_usage = AVX512_STATE_DECAY_COUNT;
+ else if (fpu->avx512_usage)
+ fpu->avx512_usage--;
+}
+
/*
* This function is called only during boot time when x86 caps are not set
* up and alternative can not be used yet.
@@ -411,6 +432,7 @@ static inline int copy_fpregs_to_fpstate(struct fpu *fpu)
{
if (likely(use_xsave())) {
copy_xregs_to_kernel(&fpu->state.xsave);
+ update_avx512_state(fpu);
return 1;
}

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 202c53918ecf..313b134d3ca3 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -302,6 +302,14 @@ struct fpu {
*/
unsigned char initialized;

+ /*
+ * @avx512_usage:
+ *
+ * Records the usage of AVX512 registers. A value of non-zero is used
+ * to indicate whether these AVX512 registers recently had valid state.
+ */
+ unsigned char avx512_usage;
+
/*
* @state:
*
--
2.17.1



2018-12-11 07:44:11

by Aubrey Li

[permalink] [raw]
Subject: [PATCH v4 2/2] proc: add AVX-512 usage to /proc/pid/status

AVX-512 components usage could cause core turbo frequency drop. So
it's useful to expose AVX-512 components usage as a heuristic hint
for the user space job scheduler to cluster the AVX-512 using tasks
together.

Example:
$ cat /proc/pid/status | grep AVX512_hint
AVX512_hint: 1

The hint number '0' indicates the task recently didn't use AVX-512
components thus unlikely has frequency drop issue. And the number '1'
indicates the task recently used AVX-512 components thus could cause
core frequency drop. User space tools may want to further check by:

$ perf stat --pid <pid> -e core_power.lvl2_turbo_license -- sleep 1

Performance counter stats for process id '3558':

3,251,565,961 core_power.lvl2_turbo_license

1.004031387 seconds time elapsed

Non-zero counter value confirms that the task causes frequency drop.

Signed-off-by: Aubrey Li <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Arjan van de Ven <[email protected]>
---
arch/x86/kernel/fpu/xstate.c | 19 +++++++++++++++++++
fs/proc/array.c | 5 +++++
2 files changed, 24 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 87a57b7642d3..98baa47c97b0 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -7,6 +7,7 @@
#include <linux/cpu.h>
#include <linux/mman.h>
#include <linux/pkeys.h>
+#include <linux/seq_file.h>

#include <asm/fpu/api.h>
#include <asm/fpu/internal.h>
@@ -1245,3 +1246,21 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)

return 0;
}
+
+/*
+ * Report CPU specific thread state
+ */
+void arch_task_state(struct seq_file *m, struct task_struct *task)
+{
+ /*
+ * Check the processor and build option if AVX512 is supported.
+ */
+ if (!cpu_feature_enabled(X86_FEATURE_AVX512F))
+ return;
+ /*
+ * Report AVX-512 components usage:
+ */
+ seq_put_decimal_ull(m, "AVX512_hint:\t",
+ task->thread.fpu.avx512_usage ? 1 : 0);
+ seq_putc(m, '\n');
+}
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 0ceb3b6b37e7..dd88c2219f08 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -392,6 +392,10 @@ static inline void task_core_dumping(struct seq_file *m, struct mm_struct *mm)
seq_putc(m, '\n');
}

+void __weak arch_task_state(struct seq_file *m, struct task_struct *task)
+{
+}
+
int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
@@ -414,6 +418,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
task_cpus_allowed(m, task);
cpuset_task_status_allowed(m, task);
task_context_switch_counts(m, task);
+ arch_task_state(m, task);
return 0;
}

--
2.17.1


2018-12-11 17:25:04

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

On 12/10/18 4:24 PM, Aubrey Li wrote:
> The tracking turns on the usage flag at the next context switch of
> the task, but requires 3 consecutive context switches with no usage
> to clear it. This decay is required because well-written AVX-512
> applications are expected to clear this state when not actively using
> AVX-512 registers.

One concern about this: Given a HZ=1000 system, this means that the
flag needs to get scanned every ~3ms. That's a pretty good amount of
scanning on a system with hundreds or thousands of tasks running around.

How many tasks does this scale to until you're eating up an entire CPU
or two just scanning /proc?

2018-12-11 17:40:52

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

to update AVX512 state
> + */
> +static inline void update_avx512_state(struct fpu *fpu)
> +{
> + /*
> + * AVX512 state is tracked here because its use is known to slow
> + * the max clock speed of the core.
> + *
> + * However, AVX512-using tasks are expected to clear this state when
> + * not actively using these registers. Thus, this tracking mechanism
> + * can miss. To ensure that false-negatives do not immediately show
> + * up, decay the usage count over time.
> + */
> + if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512)
> + fpu->avx512_usage = AVX512_STATE_DECAY_COUNT;
> + else if (fpu->avx512_usage)
> + fpu->avx512_usage--;
> +}
> +
> /*
> * This function is called only during boot time when x86 caps are not set
> * up and alternative can not be used yet.
> @@ -411,6 +432,7 @@ static inline int copy_fpregs_to_fpstate(struct fpu *fpu)
> {
> if (likely(use_xsave())) {
> copy_xregs_to_kernel(&fpu->state.xsave);
> + update_avx512_state(fpu);
> return 1;
> }


Is there a reason we shouldn't do:

if (!cpu_feature_enabled(X86_FEATURE_AVX512F))
update_avx512_state(fpu);

?

2018-12-11 17:55:45

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks



On 12/11/18 9:18 AM, Dave Hansen wrote:
> On 12/10/18 4:24 PM, Aubrey Li wrote:
>> The tracking turns on the usage flag at the next context switch of
>> the task, but requires 3 consecutive context switches with no usage
>> to clear it. This decay is required because well-written AVX-512
>> applications are expected to clear this state when not actively using
>> AVX-512 registers.
>
> One concern about this: Given a HZ=1000 system, this means that the
> flag needs to get scanned every ~3ms. That's a pretty good amount of
> scanning on a system with hundreds or thousands of tasks running around.
>
> How many tasks does this scale to until you're eating up an entire CPU
> or two just scanning /proc?
>

The reason why I believe that an exponential decay average, or some
other kind of weighted decay of the AVX512 usage is better.

Tim


2018-12-11 17:57:07

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

On Tue, Dec 11, 2018 at 09:18:25AM -0800, Dave Hansen wrote:
> On 12/10/18 4:24 PM, Aubrey Li wrote:
> > The tracking turns on the usage flag at the next context switch of
> > the task, but requires 3 consecutive context switches with no usage
> > to clear it. This decay is required because well-written AVX-512
> > applications are expected to clear this state when not actively using
> > AVX-512 registers.
>
> One concern about this: Given a HZ=1000 system, this means that the
> flag needs to get scanned every ~3ms. That's a pretty good amount of
> scanning on a system with hundreds or thousands of tasks running around.
>
> How many tasks does this scale to until you're eating up an entire CPU
> or two just scanning /proc?

Yes that's why we may need to propagate it to cgroups in the kernel,
because user daemons don't really want to track every TID.

But per pid is a start so that people can start experimenting with this.

Then with some experience fancier interfaces for it can be implemented.

-Andi

2018-12-11 17:59:40

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] proc: add AVX-512 usage to /proc/pid/status



On 12/10/18 4:24 PM, Aubrey Li wrote:
> AVX-512 components usage could cause core turbo frequency drop. So
> it's useful to expose AVX-512 components usage as a heuristic hint
> for the user space job scheduler to cluster the AVX-512 using tasks
> together.
>
> Example:
> $ cat /proc/pid/status | grep AVX512_hint
> AVX512_hint: 1
>
> The hint number '0' indicates the task recently didn't use AVX-512
> components thus unlikely has frequency drop issue. And the number '1'
> indicates the task recently used AVX-512 components thus could cause
> core frequency drop. User space tools may want to further check by:
>
> $ perf stat --pid <pid> -e core_power.lvl2_turbo_license -- sleep 1
>
> Performance counter stats for process id '3558':
>
> 3,251,565,961 core_power.lvl2_turbo_license
>
> 1.004031387 seconds time elapsed
>
> Non-zero counter value confirms that the task causes frequency drop.
>
> Signed-off-by: Aubrey Li <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: Tim Chen <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Arjan van de Ven <[email protected]>
> ---
> arch/x86/kernel/fpu/xstate.c | 19 +++++++++++++++++++
> fs/proc/array.c | 5 +++++
> 2 files changed, 24 insertions(+)
>
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 87a57b7642d3..98baa47c97b0 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -7,6 +7,7 @@
> #include <linux/cpu.h>
> #include <linux/mman.h>
> #include <linux/pkeys.h>
> +#include <linux/seq_file.h>
>
> #include <asm/fpu/api.h>
> #include <asm/fpu/internal.h>
> @@ -1245,3 +1246,21 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
>
> return 0;
> }
> +
> +/*
> + * Report CPU specific thread state
> + */
> +void arch_task_state(struct seq_file *m, struct task_struct *task)
> +{
> + /*
> + * Check the processor and build option if AVX512 is supported.
> + */
> + if (!cpu_feature_enabled(X86_FEATURE_AVX512F))
> + return;
> + /*
> + * Report AVX-512 components usage:
> + */
> + seq_put_decimal_ull(m, "AVX512_hint:\t",
> + task->thread.fpu.avx512_usage ? 1 : 0);

I believe you need some kind of documentation of this new interface
in kernel's Documentation. So an admin doesn't need to look into the
kernel code to figure out what this /proc/pid/status is supposed to do.

Tim



2018-12-11 23:47:10

by Li, Aubrey

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

On 2018/12/12 1:18, Dave Hansen wrote:
> On 12/10/18 4:24 PM, Aubrey Li wrote:
>> The tracking turns on the usage flag at the next context switch of
>> the task, but requires 3 consecutive context switches with no usage
>> to clear it. This decay is required because well-written AVX-512
>> applications are expected to clear this state when not actively using
>> AVX-512 registers.
>
> One concern about this: Given a HZ=1000 system, this means that the
> flag needs to get scanned every ~3ms. That's a pretty good amount of
> scanning on a system with hundreds or thousands of tasks running around.
>
> How many tasks does this scale to until you're eating up an entire CPU
> or two just scanning /proc?
>

Do we have a real requirement to do this in practical environment?
AFAIK, 1s or even 5s is good enough in some customers environment.

Thanks,
-Aubrey

2018-12-12 00:16:59

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

On 12/11/2018 3:46 PM, Li, Aubrey wrote:
> On 2018/12/12 1:18, Dave Hansen wrote:
>> On 12/10/18 4:24 PM, Aubrey Li wrote:
>>> The tracking turns on the usage flag at the next context switch of
>>> the task, but requires 3 consecutive context switches with no usage
>>> to clear it. This decay is required because well-written AVX-512
>>> applications are expected to clear this state when not actively using
>>> AVX-512 registers.
>>
>> One concern about this: Given a HZ=1000 system, this means that the
>> flag needs to get scanned every ~3ms. That's a pretty good amount of
>> scanning on a system with hundreds or thousands of tasks running around.
>>
>> How many tasks does this scale to until you're eating up an entire CPU
>> or two just scanning /proc?
>>
>
> Do we have a real requirement to do this in practical environment?
> AFAIK, 1s or even 5s is good enough in some customers environment.

maybe instead of a 1/0 bit, it's useful to store the timestamp of the last
time we found the task to use avx? (need to find a good time unit)


2018-12-12 00:36:55

by Li, Aubrey

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

On 2018/12/12 1:20, Dave Hansen wrote:
> to update AVX512 state
>> + */
>> +static inline void update_avx512_state(struct fpu *fpu)
>> +{
>> + /*
>> + * AVX512 state is tracked here because its use is known to slow
>> + * the max clock speed of the core.
>> + *
>> + * However, AVX512-using tasks are expected to clear this state when
>> + * not actively using these registers. Thus, this tracking mechanism
>> + * can miss. To ensure that false-negatives do not immediately show
>> + * up, decay the usage count over time.
>> + */
>> + if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512)
>> + fpu->avx512_usage = AVX512_STATE_DECAY_COUNT;
>> + else if (fpu->avx512_usage)
>> + fpu->avx512_usage--;
>> +}
>> +
>> /*
>> * This function is called only during boot time when x86 caps are not set
>> * up and alternative can not be used yet.
>> @@ -411,6 +432,7 @@ static inline int copy_fpregs_to_fpstate(struct fpu *fpu)
>> {
>> if (likely(use_xsave())) {
>> copy_xregs_to_kernel(&fpu->state.xsave);
>> + update_avx512_state(fpu);
>> return 1;
>> }
>
>
> Is there a reason we shouldn't do:
>
> if (!cpu_feature_enabled(X86_FEATURE_AVX512F))
> update_avx512_state(fpu);
>
> ?
>

Why _!_ ?



2018-12-12 00:41:06

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

On 12/11/18 4:34 PM, Li, Aubrey wrote:
>> Is there a reason we shouldn't do:
>>
>> if (!cpu_feature_enabled(X86_FEATURE_AVX512F))
>> update_avx512_state(fpu);
>>
>> ?
>>
> Why _!_ ?

Sorry, got it backwards. I think I was considering having you do a

if (!cpu_feature_enabled(X86_FEATURE_AVX512F))
return;

inside of update_avx512_state(), but I got the state mixed up in my head.

You don't need the '!'.

2018-12-12 01:01:51

by Li, Aubrey

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

On 2018/12/12 8:14, Arjan van de Ven wrote:
> On 12/11/2018 3:46 PM, Li, Aubrey wrote:
>> On 2018/12/12 1:18, Dave Hansen wrote:
>>> On 12/10/18 4:24 PM, Aubrey Li wrote:
>>>> The tracking turns on the usage flag at the next context switch of
>>>> the task, but requires 3 consecutive context switches with no usage
>>>> to clear it. This decay is required because well-written AVX-512
>>>> applications are expected to clear this state when not actively using
>>>> AVX-512 registers.
>>>
>>> One concern about this: Given a HZ=1000 system, this means that the
>>> flag needs to get scanned every ~3ms. That's a pretty good amount of
>>> scanning on a system with hundreds or thousands of tasks running around.
>>>
>>> How many tasks does this scale to until you're eating up an entire CPU
>>> or two just scanning /proc?
>>>
>>
>> Do we have a real requirement to do this in practical environment?
>> AFAIK, 1s or even 5s is good enough in some customers environment.
>
> maybe instead of a 1/0 bit, it's useful to store the timestamp of the last
> time we found the task to use avx? (need to find a good time unit)
>
>
Are you suggesting kernel does not do any translation, just provide a fact
to the user space tool and let user space tool to decide how to use this info?

So how does user space tool use this timestamp in your mind?

Thanks,
-Aubrey

2018-12-12 01:08:03

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

On 12/11/18 4:59 PM, Li, Aubrey wrote:
>> maybe instead of a 1/0 bit, it's useful to store the timestamp of the last
>> time we found the task to use avx? (need to find a good time unit)
>>
>>
> Are you suggesting kernel does not do any translation, just provide a fact
> to the user space tool and let user space tool to decide how to use this info?
>
> So how does user space tool use this timestamp in your mind?

Couple of things...

If we have a timestamp, we don't need a decay. That means less tuning
and also less work at runtime in the common case.

Let's say we report milliseconds. The app itself looks at the number
and could now say the following:

1. task A used AVX512 1ms ago, I might need to segregate it
2. task B used AVX512 10000ms ago, time to stop segregating it
3. task C used AVX512 1ms ago, and was using it 100ms ago (during the
last scan), it's a regular AVX512 user.

That way, you don't have to *catch* tasks in the window between use and
the end of the decay period.

2018-12-12 16:56:18

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

From: Aubrey Li
> Sent: 11 December 2018 00:25
>
> User space tools which do automated task placement need information
> about AVX-512 usage of tasks, because AVX-512 usage could cause core
> turbo frequency drop and impact the running task on the sibling CPU.
>
> The XSAVE hardware structure has bits that indicate when valid state
> is present in registers unique to AVX-512 use. Use these bits to
> indicate when AVX-512 has been in use and add per-task AVX-512 state
> tracking to context switch.

Isn't a thread likely to clear the AVX registers at the end of a function
that uses them.
In particular this removes the massive overhead on certain cpus of
switching between two AVX modes.
So it is actually unlikely that XSAVE will need to save them at all?

As I've also said before the registers are caller saved and since
systems calls are normal function calls the application code
would have to save them across a system call.
This allows the kernel to zero the registers on system call entry
again meaning that XSAVE won't normally have to save them.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2018-12-12 18:02:25

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

> Isn't a thread likely to clear the AVX registers at the end of a function
> that uses them.
> In particular this removes the massive overhead on certain cpus of
> switching between two AVX modes.
> So it is actually unlikely that XSAVE will need to save them at all?

Only if context switches only happened on function boundaries, which
is obviously not the case.

Yes the detection mechanism is not 100% accurate, but if AVX
is used significantly it should eventually detect it.
Think of it as a statistical sampling heuristic.

>
> As I've also said before the registers are caller saved and since
> systems calls are normal function calls the application code
> would have to save them across a system call.
> This allows the kernel to zero the registers on system call entry
> again meaning that XSAVE won't normally have to save them.

While I agree this would be nice, the Linux system call ABI
wasn't defined like this, so it cannot be done at this point.

-Andi