2020-11-20 07:58:16

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC] Documentation/scheduler/schedutil.txt

Hi,

I was recently asked to explain how schedutil works, the below write-up
is the result of that and I figured we might as well stick it in the
tree.

Not as a patch for easy reading and commenting.

---

NOTE; all this assumes a linear relation between frequency and work capacity,
we know this is flawed, but it is the best workable approximation.


PELT (Per Entity Load Tracking)
-------------------------------

With PELT we track some metrics across the various entities, from individual
tasks to task-group slices to CPU runqueues. As the basis for this we use an
EWMA, each period (1024us) is decayed such that y^32 = 0.5. That is, the
most recent 32ms contribute half, while the rest of history contribute the
other half.

Specifically:

ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ...

ewma(u) = ewma_sum(u) / ewma_sum(1)

Since this is essentially a progression of an infinite geometric series, the
results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property
is key, since it gives the ability to recompose the averages when tasks move
around.

Note that blocked tasks still contribute to the aggregates (task-group slices
and CPU runqueues), which reflects their expected contribution when they
resume running.

Using this we track 2 key metrics: 'running' and 'runnable'. 'Running'
reflects the time an entity spends on the CPU, while 'runnable' reflects the
time an entity spends on the runqueue. When there is only a single task these
two metrics are the same, but once there is contention for the CPU 'running'
will decrease to reflect the fraction of time each task spends on the CPU
while 'runnable' will increase to reflect the amount of contention.

For more detail see: kernel/sched/pelt.c


Frequency- / Heterogeneous Invariance
-------------------------------------

Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
a big CPU, we allow architectures to scale the time delta with two ratios, one
DVFS ratio and one microarch ratio.

For simple DVFS architectures (where software is in full control) we trivially
compute the ratio as:

f_cur
r_dvfs := -----
f_max

For more dynamic systems where the hardware is in control of DVFS (Intel,
ARMv8.4-AMU) we use hardware counters to provide us this ratio. In specific,
for Intel, we use:

APERF
f_cur := ----- * P0
MPERF

4C-turbo; if available and turbo enabled
f_max := { 1C-turbo; if turbo enabled
P0; otherwise

f_cur
r_dvfs := min( 1, ----- )
f_max

We pick 4C turbo over 1C turbo to make it slightly more sustainable.

r_het is determined as the average performance difference between a big and
LITTLE core when running at max frequency over 'relevant' benchmarks.

The result is that the above 'running' and 'runnable' metrics become invariant
of DVFS and Heterogenous state. IOW. we can transfer and compare them between
CPUs.

For more detail see:

- kernel/sched/pelt.h:update_rq_clock_pelt()
- arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."


UTIL_EST / UTIL_EST_FASTUP
--------------------------

Because periodic tasks have their averages decayed while they sleep, even
though when running their expected utilization will be the same, they suffer a
(DVFS) ramp-up after they become runnable again.

To alleviate this (a default enabled option) UTIL_EST drives an (IIR) EWMA
with the 'running' value on dequeue -- when it is highest. A further default
enabled option UTIL_EST_FASTUP modifies the IIR filter to instantly increase
and only decay on decrease.

A further runqueue wide sum (of runnable tasks) is maintained of:

util_est := \Sum_t max( t_running, t_util_est_ewma )

For more detail see: kernel/sched/fair.h:util_est_dequeue()


UCLAMP
------

It is possible to set effective u_min and u_max clamps on each task; the
runqueue keeps an max aggregate of these clamps for all running tasks.

For more detail see: include/uapi/linux/sched/types.h


Schedutil / DVFS
----------------

Every time the scheduler load tracking is updated (task wakeup, task
migration, time progression) we call out to schedutil to update the hardware
DVFS state.

The basis is the CPU runqueue's 'running' metric, which per the above it is
the frequency invariant utilization estimate of the CPU. From this we compute
a desired frequency like:

max( running, util_est ); if UTIL_EST
u_cfs := { running; otherwise

u_clamp := clamp( u_cfs, u_min, u_max )

u := u_cfs + u_rt + u_irq + u_dl; [approx. see source for more detail]

f_des := min( f_max, 1.25 u * f_max )

XXX IO-wait; when the update is due to a task wakeup from IO-completion we
boost 'u' above.

This frequency is then used to select a P-state/OPP or directly munged into a
CPPC style request to the hardware.

XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
required to satisfy the workload.

Because these callbacks are directly from the scheduler, the DVFS hardware
interaction should be 'fast' and non-blocking. Schedutil supports
rate-limiting DVFS requests for when hardware interaction is slow and
expensive, this reduces effectiveness.

For more information see: kernel/sched/cpufreq_schedutil.c


NOTES
-----

- On low-load scenarios, where DVFS is most relevant, the 'running' numbers
will closely reflect utilization.

- In saturated scenarios task movement will cause some transient dips,
suppose we have a CPU saturated with 4 tasks, then when we migrate a task
to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
new CPU will gain 0.25. This is inevitable and time progression will
correct this. XXX do we still guarantee f_max due to no idle-time?

- Much of the above is about avoiding DVFS dips, and independent DVFS domains
having to re-learn / ramp-up when load shifts.


2020-11-20 08:59:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] Documentation/scheduler/schedutil.txt

On Fri, Nov 20, 2020 at 08:55:27AM +0100, Peter Zijlstra wrote:
> - In saturated scenarios task movement will cause some transient dips,
> suppose we have a CPU saturated with 4 tasks, then when we migrate a task
> to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
> new CPU will gain 0.25. This is inevitable and time progression will
> correct this. XXX do we still guarantee f_max due to no idle-time?

Do we want something like this? Is the 1.5 threshold sane? (it's been too
long since I looked at actual numbers here)

---

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 68d369cba9e4..f0bed8902c40 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -90,3 +90,4 @@ SCHED_FEAT(WA_BIAS, true)
*/
SCHED_FEAT(UTIL_EST, true)
SCHED_FEAT(UTIL_EST_FASTUP, true)
+SCHED_FEAT(UTIL_SAT, true)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 590e6f27068c..bf70e5ed8ba6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2593,10 +2593,17 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
return READ_ONCE(rq->avg_dl.util_avg);
}

+#define RUNNABLE_SAT (SCHED_CAPACITY_SCALE + SCHED_CAPACITY_SCALE/2)
+
static inline unsigned long cpu_util_cfs(struct rq *rq)
{
unsigned long util = READ_ONCE(rq->cfs.avg.util_avg);

+ if (sched_feat(UTIL_SAT)) {
+ if (READ_ONCE(rq->cfs.avg.runnable_avg) > RUNNABLE_SAT)
+ return SCHED_CAPACITY_SCALE;
+ }
+
if (sched_feat(UTIL_EST)) {
util = max_t(unsigned long, util,
READ_ONCE(rq->cfs.avg.util_est.enqueued));

2020-11-20 09:18:16

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC] Documentation/scheduler/schedutil.txt

On Friday 20 Nov 2020 at 09:56:53 (+0100), Peter Zijlstra wrote:
> On Fri, Nov 20, 2020 at 08:55:27AM +0100, Peter Zijlstra wrote:
> > - In saturated scenarios task movement will cause some transient dips,
> > suppose we have a CPU saturated with 4 tasks, then when we migrate a task
> > to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
> > new CPU will gain 0.25. This is inevitable and time progression will
> > correct this. XXX do we still guarantee f_max due to no idle-time?

The sugov_cpu_is_busy() logic should mitigate that, but looking at it
again I just realized we don't apply it to the 'shared' update path. I
can't recall why. Anybody?

> Do we want something like this? Is the 1.5 threshold sane? (it's been too
> long since I looked at actual numbers here)
>
> ---
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 68d369cba9e4..f0bed8902c40 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -90,3 +90,4 @@ SCHED_FEAT(WA_BIAS, true)
> */
> SCHED_FEAT(UTIL_EST, true)
> SCHED_FEAT(UTIL_EST_FASTUP, true)
> +SCHED_FEAT(UTIL_SAT, true)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 590e6f27068c..bf70e5ed8ba6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2593,10 +2593,17 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
> return READ_ONCE(rq->avg_dl.util_avg);
> }
>
> +#define RUNNABLE_SAT (SCHED_CAPACITY_SCALE + SCHED_CAPACITY_SCALE/2)
> +
> static inline unsigned long cpu_util_cfs(struct rq *rq)
> {
> unsigned long util = READ_ONCE(rq->cfs.avg.util_avg);
>
> + if (sched_feat(UTIL_SAT)) {
> + if (READ_ONCE(rq->cfs.avg.runnable_avg) > RUNNABLE_SAT)
> + return SCHED_CAPACITY_SCALE;
> + }
> +
> if (sched_feat(UTIL_EST)) {
> util = max_t(unsigned long, util,
> READ_ONCE(rq->cfs.avg.util_est.enqueued));

Need to do the math again, but it's an interesting idea and would solve
a few things (e.g. reset the overutilized flag because of the 'gap' left
by a migration and such) ...

Thanks,
Quentin

2020-11-20 09:21:10

by Viresh Kumar

[permalink] [raw]
Subject: Re: [RFC] Documentation/scheduler/schedutil.txt

On 20-11-20, 09:13, Quentin Perret wrote:
> On Friday 20 Nov 2020 at 09:56:53 (+0100), Peter Zijlstra wrote:
> > On Fri, Nov 20, 2020 at 08:55:27AM +0100, Peter Zijlstra wrote:
> > > - In saturated scenarios task movement will cause some transient dips,
> > > suppose we have a CPU saturated with 4 tasks, then when we migrate a task
> > > to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
> > > new CPU will gain 0.25. This is inevitable and time progression will
> > > correct this. XXX do we still guarantee f_max due to no idle-time?
>
> The sugov_cpu_is_busy() logic should mitigate that, but looking at it
> again I just realized we don't apply it to the 'shared' update path. I
> can't recall why. Anybody?

t b7eaf1aab9f8bd2e49fceed77ebc66c1b5800718
Author: Rafael J. Wysocki <[email protected]>
Date: Wed Mar 22 00:08:50 2017 +0100

cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely

[skip]

This is unlikely to be an issue on systems where cpufreq policies are
shared between multiple CPUs, because in those cases the policy
utilization is computed as the maximum of the CPU utilization values
over the whole policy and if that turns out to be low, reducing the
frequency for the policy most likely is a good idea anyway. On
systems with one CPU per policy, however, it may affect performance
adversely and even lead to increased energy consumption in some cases.

On those systems it may be addressed by taking another utilization
metric into consideration, like whether or not the CPU whose
frequency is about to be reduced has been idle recently, because if
that's not the case, the CPU is likely to be busy in the near future
and its frequency should not be reduced.

--
viresh

2020-11-20 09:31:00

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC] Documentation/scheduler/schedutil.txt

On Friday 20 Nov 2020 at 14:49:04 (+0530), Viresh Kumar wrote:
> This is unlikely to be an issue on systems where cpufreq policies are
> shared between multiple CPUs, because in those cases the policy
> utilization is computed as the maximum of the CPU utilization values
> over the whole policy and if that turns out to be low, reducing the
> frequency for the policy most likely is a good idea anyway.

Hmm, I'm not sure I agree with this actually. We may be migrating the
task to a different policy altogether. And even if we migrate to another
CPU in the current policy, the task util_avg may be small just because
it was packed with other tasks on a rq, which means it may not increase
the util of the destination rq by much.

ISTR Douglas' EM-based schedutil boosting series was addressing that at
some point, I'll go have a look back at that discussion...

2020-11-20 11:47:37

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC] Documentation/scheduler/schedutil.txt


Hi,

On 20/11/20 07:55, Peter Zijlstra wrote:
> Frequency- / Heterogeneous Invariance
> -------------------------------------
>
> Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
> for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
> a big CPU, we allow architectures to scale the time delta with two ratios, one
> DVFS ratio and one microarch ratio.
>
> For simple DVFS architectures (where software is in full control) we trivially
> compute the ratio as:
>
> f_cur
> r_dvfs := -----
> f_max
>
> For more dynamic systems where the hardware is in control of DVFS (Intel,
> ARMv8.4-AMU) we use hardware counters to provide us this ratio. In specific,
> for Intel, we use:
>
> APERF
> f_cur := ----- * P0
> MPERF
>
> 4C-turbo; if available and turbo enabled
> f_max := { 1C-turbo; if turbo enabled
> P0; otherwise
>
> f_cur
> r_dvfs := min( 1, ----- )
> f_max
>
> We pick 4C turbo over 1C turbo to make it slightly more sustainable.
>
> r_het is determined as the average performance difference between a big and
> LITTLE core when running at max frequency over 'relevant' benchmarks.

Welcome to our wonderful world where there can be more than just two types
of CPUs! A perhaps safer statement would be:

r_het is determined as the ratio of highest performance level of the
current CPU vs the highest performance level of any other CPU in the
system.


Also; do we want to further state the obvious?

r_tot := r_het * r_dvfs

>
> The result is that the above 'running' and 'runnable' metrics become invariant
> of DVFS and Heterogenous state. IOW. we can transfer and compare them between
> CPUs.
>
> For more detail see:
>
> - kernel/sched/pelt.h:update_rq_clock_pelt()
> - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
>

Some of that is rephrased in

Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"

(with added diagrams crafted with love by yours truly); I suppose a cross
reference can't hurt.

>
> UTIL_EST / UTIL_EST_FASTUP
> --------------------------
>
> Because periodic tasks have their averages decayed while they sleep, even
> though when running their expected utilization will be the same, they suffer a
> (DVFS) ramp-up after they become runnable again.
>
> To alleviate this (a default enabled option) UTIL_EST drives an (IIR) EWMA
> with the 'running' value on dequeue -- when it is highest. A further default
> enabled option UTIL_EST_FASTUP modifies the IIR filter to instantly increase
> and only decay on decrease.
>
> A further runqueue wide sum (of runnable tasks) is maintained of:
>
> util_est := \Sum_t max( t_running, t_util_est_ewma )
>
> For more detail see: kernel/sched/fair.h:util_est_dequeue()
>
>
> UCLAMP
> ------
>
> It is possible to set effective u_min and u_max clamps on each task; the

Nit: effective clamps are the task clamps clamped by task group clamps
(yes, that is 4 times 'clamp' in a single line).

> runqueue keeps an max aggregate of these clamps for all running tasks.
>
> For more detail see: include/uapi/linux/sched/types.h
>
>
> Schedutil / DVFS
> ----------------
>
> Every time the scheduler load tracking is updated (task wakeup, task
> migration, time progression) we call out to schedutil to update the hardware
> DVFS state.
>
> The basis is the CPU runqueue's 'running' metric, which per the above it is
> the frequency invariant utilization estimate of the CPU. From this we compute
> a desired frequency like:
>
> max( running, util_est ); if UTIL_EST
> u_cfs := { running; otherwise
>
> u_clamp := clamp( u_cfs, u_min, u_max )
>
> u := u_cfs + u_rt + u_irq + u_dl; [approx. see source for more detail]
>
> f_des := min( f_max, 1.25 u * f_max )
>
> XXX IO-wait; when the update is due to a task wakeup from IO-completion we
> boost 'u' above.
>

IIRC the boost is fiddled with during the above, but can be applied at
different subsequent updates (even if the task that triggered the boost is
no longer here).

> This frequency is then used to select a P-state/OPP or directly munged into a
> CPPC style request to the hardware.
>
> XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
> required to satisfy the workload.
>
> Because these callbacks are directly from the scheduler, the DVFS hardware
> interaction should be 'fast' and non-blocking. Schedutil supports
> rate-limiting DVFS requests for when hardware interaction is slow and
> expensive, this reduces effectiveness.
>
> For more information see: kernel/sched/cpufreq_schedutil.c
>
>
> NOTES
> -----
>
> - On low-load scenarios, where DVFS is most relevant, the 'running' numbers
> will closely reflect utilization.
>
> - In saturated scenarios task movement will cause some transient dips,
> suppose we have a CPU saturated with 4 tasks, then when we migrate a task
> to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
> new CPU will gain 0.25. This is inevitable and time progression will
> correct this. XXX do we still guarantee f_max due to no idle-time?
>
> - Much of the above is about avoiding DVFS dips, and independent DVFS domains
> having to re-learn / ramp-up when load shifts.

2020-11-20 14:41:54

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC] Documentation/scheduler/schedutil.txt

Hi Peter,

Looks like a nice summary to me.

On Fri, Nov 20, 2020 at 08:55:27AM +0100, Peter Zijlstra wrote:
> Hi,
>
> I was recently asked to explain how schedutil works, the below write-up
> is the result of that and I figured we might as well stick it in the
> tree.
>
> Not as a patch for easy reading and commenting.
>
> ---
>
> NOTE; all this assumes a linear relation between frequency and work capacity,
> we know this is flawed, but it is the best workable approximation.

If you replace frequency with performance level everywhere (CPPC style),
most of it should still work without that assumption. The assumption
might have be made in hw or fw instead though.

Morten

2020-11-23 10:10:25

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC] Documentation/scheduler/schedutil.txt

On Mon, 23 Nov 2020 at 10:30, Dietmar Eggemann <[email protected]> wrote:
>
> On 20/11/2020 09:56, Peter Zijlstra wrote:
> > On Fri, Nov 20, 2020 at 08:55:27AM +0100, Peter Zijlstra wrote:
> >> - In saturated scenarios task movement will cause some transient dips,
> >> suppose we have a CPU saturated with 4 tasks, then when we migrate a task
> >> to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
> >> new CPU will gain 0.25. This is inevitable and time progression will
> >> correct this. XXX do we still guarantee f_max due to no idle-time?
> >
> > Do we want something like this? Is the 1.5 threshold sane? (it's been too
> > long since I looked at actual numbers here)
>
> Did some tests on a big.little system:
>
> (1) rt-app workload on big CPU:
>
> - task0-3 (runtime/period=4000us/16000us, started with
> 4000us delay to each other) run on CPU1
> - then task3 migrates to CPU2 and runs there for 64ms
> - then task2 migrates to CPU2 too and both tasks run there
> for another 64ms
>
> ...
> task3-3-1684 [001] 3982.798729: sched_pelt_cfs: cpu=1 path=/ load=232890 runnable=3260 util=1011
> migration/1-14 [001] 3982.798756: sched_migrate_task: comm=task3-3 pid=1684 prio=101 orig_cpu=1 dest_cpu=2*
> migration/1-14 [001] 3982.798767: sched_pelt_cfs: cpu=1 path=/ load=161374 runnable=2263 util=*700* <-- util dip !!!
> task1-1-1682 [001] 3982.799802: sched_pelt_cfs: cpu=1 path=/ load=160988 runnable=2257 util=706
> ...
> task2-2-1683 [001] 3982.849123: sched_pelt_cfs: cpu=1 path=/ load=161124 runnable=2284 util=904
> task2-2-1683 [001] 3982.851960: sched_pelt_cfs: cpu=1 path=/ load=160130 runnable=2271 util=911
> migration/1-14 [001] 3982.851984: sched_migrate_task: comm=task2-2 pid=1683 prio=101 orig_cpu=1 dest_cpu=2**
> migration/1-14 [001] 3982.851995: sched_pelt_cfs: cpu=1 path=/ load=88672 runnable=*1257* util=512 <-- runnable below 1536
> task1-1-1682 [001] 3982.852983: sched_pelt_cfs: cpu=1 path=/ load=88321 runnable=1252 util=521
> ...
>
>
> * task1,2,3 remain on CPU1 and still have to catch up, no idle
> time on CPU1
>
> ** task 1,2 remain on CPU1, there is idle time on CPU1!
>
>
> (2) rt-app workload on LITTLE CPU (orig cpu_capacity: 446)
>
> - task0-3 (runtime/period=1742us/16000us, started with
> 4000us delay to each other) run on CPU4
> - then task3 migrates to CPU5 and runs there for 64ms
> - then task2 migrates to CPU5 too and both tasks run there
> for another 64ms
>
> ...
> task1-1-1777 [004] 789.443015: sched_pelt_cfs: cpu=4 path=/ load=234718 runnable=3018 util=976
> migration/4-29 [004] 789.444718: sched_migrate_task: comm=task3-3 pid=1779 prio=101 orig_cpu=4 dest_cpu=5*
> migration/4-29 [004] 789.444739: sched_pelt_cfs: cpu=4 path=/ load=163543 runnable=2114 util=*778* <--util dip !!!
> task2-2-1778 [004] 789.447013: sched_pelt_cfs: cpu=4 path=/ load=163392 runnable=2120 util=777
> ...
> task1-1-1777 [004] 789.507012: sched_pelt_cfs: cpu=4 path=/ load=164482 runnable=2223 util=879
> migration/4-29 [004] 789.508023: sched_migrate_task: comm=task2-2 pid=1778 prio=101 orig_cpu=4 dest_cpu=5**
> migration/4-29 [004] 789.508044: sched_pelt_cfs: cpu=4 path=/ load=94099 runnable=*1264* util=611 <-- runnable below 1536
> task0-0-1776 [004] 789.511011: sched_pelt_cfs: cpu=4 path=/ load=93898 runnable=1264 util=622
> ...
>
> * task1,2,3 remain on CPU1 and still have to catch up, no idle
> time on CPU1
>
> ** task 1,2 remain on CPU1, no idle time on CPU1 yet.
>
> So for the big CPU, there is idle time and for the LITTLE there
> isn't with runnable below the threshold.

I'm not sure to catch what you want to highlight with your tests ?

>
> As Quentin pointed out, sugov_cpu_is_busy() (only implemented on
> 'single') tries to do something similar.
>
> I assume that 'another utilization metric' mentioned in commit
> b7eaf1aab9f8 ("cpufreq: schedutil: Avoid reducing frequency of
> busy CPUs prematurely") is rq->cfs.avg.runnable_avg.

2020-11-23 11:33:15

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC] Documentation/scheduler/schedutil.txt

On 23/11/2020 11:05, Vincent Guittot wrote:
> On Mon, 23 Nov 2020 at 10:30, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 20/11/2020 09:56, Peter Zijlstra wrote:
>>> On Fri, Nov 20, 2020 at 08:55:27AM +0100, Peter Zijlstra wrote:
>>>> - In saturated scenarios task movement will cause some transient dips,
>>>> suppose we have a CPU saturated with 4 tasks, then when we migrate a task
>>>> to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
>>>> new CPU will gain 0.25. This is inevitable and time progression will
>>>> correct this. XXX do we still guarantee f_max due to no idle-time?
>>>
>>> Do we want something like this? Is the 1.5 threshold sane? (it's been too
>>> long since I looked at actual numbers here)
>>
>> Did some tests on a big.little system:
>>
>> (1) rt-app workload on big CPU:
>>
>> - task0-3 (runtime/period=4000us/16000us, started with
>> 4000us delay to each other) run on CPU1
>> - then task3 migrates to CPU2 and runs there for 64ms
>> - then task2 migrates to CPU2 too and both tasks run there
>> for another 64ms
>>
>> ...
>> task3-3-1684 [001] 3982.798729: sched_pelt_cfs: cpu=1 path=/ load=232890 runnable=3260 util=1011
>> migration/1-14 [001] 3982.798756: sched_migrate_task: comm=task3-3 pid=1684 prio=101 orig_cpu=1 dest_cpu=2*
>> migration/1-14 [001] 3982.798767: sched_pelt_cfs: cpu=1 path=/ load=161374 runnable=2263 util=*700* <-- util dip !!!
>> task1-1-1682 [001] 3982.799802: sched_pelt_cfs: cpu=1 path=/ load=160988 runnable=2257 util=706
>> ...
>> task2-2-1683 [001] 3982.849123: sched_pelt_cfs: cpu=1 path=/ load=161124 runnable=2284 util=904
>> task2-2-1683 [001] 3982.851960: sched_pelt_cfs: cpu=1 path=/ load=160130 runnable=2271 util=911
>> migration/1-14 [001] 3982.851984: sched_migrate_task: comm=task2-2 pid=1683 prio=101 orig_cpu=1 dest_cpu=2**
>> migration/1-14 [001] 3982.851995: sched_pelt_cfs: cpu=1 path=/ load=88672 runnable=*1257* util=512 <-- runnable below 1536
>> task1-1-1682 [001] 3982.852983: sched_pelt_cfs: cpu=1 path=/ load=88321 runnable=1252 util=521
>> ...
>>
>>
>> * task1,2,3 remain on CPU1 and still have to catch up, no idle
>> time on CPU1
>>
>> ** task 1,2 remain on CPU1, there is idle time on CPU1!
>>
>>
>> (2) rt-app workload on LITTLE CPU (orig cpu_capacity: 446)
>>
>> - task0-3 (runtime/period=1742us/16000us, started with
>> 4000us delay to each other) run on CPU4
>> - then task3 migrates to CPU5 and runs there for 64ms
>> - then task2 migrates to CPU5 too and both tasks run there
>> for another 64ms
>>
>> ...
>> task1-1-1777 [004] 789.443015: sched_pelt_cfs: cpu=4 path=/ load=234718 runnable=3018 util=976
>> migration/4-29 [004] 789.444718: sched_migrate_task: comm=task3-3 pid=1779 prio=101 orig_cpu=4 dest_cpu=5*
>> migration/4-29 [004] 789.444739: sched_pelt_cfs: cpu=4 path=/ load=163543 runnable=2114 util=*778* <--util dip !!!
>> task2-2-1778 [004] 789.447013: sched_pelt_cfs: cpu=4 path=/ load=163392 runnable=2120 util=777
>> ...
>> task1-1-1777 [004] 789.507012: sched_pelt_cfs: cpu=4 path=/ load=164482 runnable=2223 util=879
>> migration/4-29 [004] 789.508023: sched_migrate_task: comm=task2-2 pid=1778 prio=101 orig_cpu=4 dest_cpu=5**
>> migration/4-29 [004] 789.508044: sched_pelt_cfs: cpu=4 path=/ load=94099 runnable=*1264* util=611 <-- runnable below 1536
>> task0-0-1776 [004] 789.511011: sched_pelt_cfs: cpu=4 path=/ load=93898 runnable=1264 util=622
>> ...
>>
>> * task1,2,3 remain on CPU1 and still have to catch up, no idle
>> time on CPU1
>>
>> ** task 1,2 remain on CPU1, no idle time on CPU1 yet.
>>
>> So for the big CPU, there is idle time and for the LITTLE there
>> isn't with runnable below the threshold.
>
> I'm not sure to catch what you want to highlight with your tests ?

I thought the question was whether 'runnable_avg = 1.5 x
SCHED_CAPACITY_SCALE' is a good threshold to decide to drive frequency
by runnable_avg or util_avg.

[...]

2020-11-23 18:44:33

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC] Documentation/scheduler/schedutil.txt

On 23/11/2020 14:42, Vincent Guittot wrote:
> On Mon, 23 Nov 2020 at 12:27, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 23/11/2020 11:05, Vincent Guittot wrote:
>>> On Mon, 23 Nov 2020 at 10:30, Dietmar Eggemann <[email protected]> wrote:
>>>>
>>>> On 20/11/2020 09:56, Peter Zijlstra wrote:
>>>>> On Fri, Nov 20, 2020 at 08:55:27AM +0100, Peter Zijlstra wrote:

[...]

>> I thought the question was whether 'runnable_avg = 1.5 x
>> SCHED_CAPACITY_SCALE' is a good threshold to decide to drive frequency
>> by runnable_avg or util_avg.
>
> we can't use SCHED_CAPACITY_SCALE and must use cpu's capacity

To get some idle time on the LITTLE CPU I extended the time in which
task3 and then task2-3 run on CPU5 to 128ms.

The moment CPU4 has some idle time for the first time after the first
migration (~207ms), the runnable_avg drops from 1323 (task0: 316,
task1: 1020) to 1.
I can't see the dependency to the CPU capacity here.
Util_avg is also larger than the CPU capacity.

The steep fall from runnable=1323 to 1 is due to lost_idle_time update.

...
migration/4-29 [004] 60.000034: sched_migrate_task: comm=task3-3 pid=1690 prio=101 orig_cpu=4 dest_cpu=5
migration/4-29 [004] 60.000046: sched_pelt_cfs: cpu=4 path=/ load=163618 runnable=2296 util=748
...
migration/4-29 [004] 60.142088: sched_migrate_task: comm=task2-2 pid=1689 prio=101 orig_cpu=4 dest_cpu=5
migration/4-29 [004] 60.142100: sched_pelt_cfs: cpu=4 path=/ load=93358 runnable=1325 util=628
...
task0-0-1687 [004] 60.201385: sched_pelt_se: cpu=4 path=(null) comm=task0-0 pid=1687 load=22317 runnable=316 util=316
task0-0-1687 [004] 60.201387: sched_pelt_cfs: cpu=4 path=/ load=93978 runnable=1336 util=788
...
task1-1-1688 [004] 60.207225: sched_pelt_se: cpu=4 path=(null) comm=task1-1 pid=1688 load=71610 runnable=1020 util=497
task1-1-1688 [004] 60.207227: sched_pelt_cfs: cpu=4 path=/ load=93017 runnable=1323 util=800
<idle>-0 [004] 60.207254: cpu_idle: state=0 cpu_id=4
...
<idle>-0 [004] 60.209397: sched_pelt_cfs: cpu=4 path=/ load=80 runnable=1 util=0
...

2020-11-23 23:23:06

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC] Documentation/scheduler/schedutil.txt

On 20/11/2020 08:55, Peter Zijlstra wrote:

[...]

> PELT (Per Entity Load Tracking)
> -------------------------------

[...]

> Using this we track 2 key metrics: 'running' and 'runnable'. 'Running'
> reflects the time an entity spends on the CPU, while 'runnable' reflects the
> time an entity spends on the runqueue. When there is only a single task these
> two metrics are the same, but once there is contention for the CPU 'running'
> will decrease to reflect the fraction of time each task spends on the CPU
> while 'runnable' will increase to reflect the amount of contention.

People might find it confusing to map 'running and 'runnable' into the 3
PELT signals (load_avg, runnable_avg and util_avg) being used in the
scheduler ... with load_avg being 'runnable' and 'weight' based.

> For more detail see: kernel/sched/pelt.c
>
>
> Frequency- / Heterogeneous Invariance
> -------------------------------------

We call 'Heterogeneous Invariance' CPU invariance in chapter 2.3
Documentation/scheduler/sched-capacity.rst.

[...]

> For more detail see:
>
> - kernel/sched/pelt.h:update_rq_clock_pelt()
> - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."

drivers/base/arch_topology.c:"f_cur/f_max ratio computation".

> UTIL_EST / UTIL_EST_FASTUP
> --------------------------

[...]

> util_est := \Sum_t max( t_running, t_util_est_ewma )
>
> For more detail see: kernel/sched/fair.h:util_est_dequeue()

s/fair.h/fair.c

> UCLAMP
> ------
>
> It is possible to set effective u_min and u_max clamps on each task; the

s/on each task/on each CFS or RT task

[...]

2020-11-23 23:24:32

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC] Documentation/scheduler/schedutil.txt

On 20/11/2020 09:56, Peter Zijlstra wrote:
> On Fri, Nov 20, 2020 at 08:55:27AM +0100, Peter Zijlstra wrote:
>> - In saturated scenarios task movement will cause some transient dips,
>> suppose we have a CPU saturated with 4 tasks, then when we migrate a task
>> to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
>> new CPU will gain 0.25. This is inevitable and time progression will
>> correct this. XXX do we still guarantee f_max due to no idle-time?
>
> Do we want something like this? Is the 1.5 threshold sane? (it's been too
> long since I looked at actual numbers here)

Did some tests on a big.little system:

(1) rt-app workload on big CPU:

- task0-3 (runtime/period=4000us/16000us, started with
4000us delay to each other) run on CPU1
- then task3 migrates to CPU2 and runs there for 64ms
- then task2 migrates to CPU2 too and both tasks run there
for another 64ms

...
task3-3-1684 [001] 3982.798729: sched_pelt_cfs: cpu=1 path=/ load=232890 runnable=3260 util=1011
migration/1-14 [001] 3982.798756: sched_migrate_task: comm=task3-3 pid=1684 prio=101 orig_cpu=1 dest_cpu=2*
migration/1-14 [001] 3982.798767: sched_pelt_cfs: cpu=1 path=/ load=161374 runnable=2263 util=*700* <-- util dip !!!
task1-1-1682 [001] 3982.799802: sched_pelt_cfs: cpu=1 path=/ load=160988 runnable=2257 util=706
...
task2-2-1683 [001] 3982.849123: sched_pelt_cfs: cpu=1 path=/ load=161124 runnable=2284 util=904
task2-2-1683 [001] 3982.851960: sched_pelt_cfs: cpu=1 path=/ load=160130 runnable=2271 util=911
migration/1-14 [001] 3982.851984: sched_migrate_task: comm=task2-2 pid=1683 prio=101 orig_cpu=1 dest_cpu=2**
migration/1-14 [001] 3982.851995: sched_pelt_cfs: cpu=1 path=/ load=88672 runnable=*1257* util=512 <-- runnable below 1536
task1-1-1682 [001] 3982.852983: sched_pelt_cfs: cpu=1 path=/ load=88321 runnable=1252 util=521
...


* task1,2,3 remain on CPU1 and still have to catch up, no idle
time on CPU1

** task 1,2 remain on CPU1, there is idle time on CPU1!


(2) rt-app workload on LITTLE CPU (orig cpu_capacity: 446)

- task0-3 (runtime/period=1742us/16000us, started with
4000us delay to each other) run on CPU4
- then task3 migrates to CPU5 and runs there for 64ms
- then task2 migrates to CPU5 too and both tasks run there
for another 64ms

...
task1-1-1777 [004] 789.443015: sched_pelt_cfs: cpu=4 path=/ load=234718 runnable=3018 util=976
migration/4-29 [004] 789.444718: sched_migrate_task: comm=task3-3 pid=1779 prio=101 orig_cpu=4 dest_cpu=5*
migration/4-29 [004] 789.444739: sched_pelt_cfs: cpu=4 path=/ load=163543 runnable=2114 util=*778* <--util dip !!!
task2-2-1778 [004] 789.447013: sched_pelt_cfs: cpu=4 path=/ load=163392 runnable=2120 util=777
...
task1-1-1777 [004] 789.507012: sched_pelt_cfs: cpu=4 path=/ load=164482 runnable=2223 util=879
migration/4-29 [004] 789.508023: sched_migrate_task: comm=task2-2 pid=1778 prio=101 orig_cpu=4 dest_cpu=5**
migration/4-29 [004] 789.508044: sched_pelt_cfs: cpu=4 path=/ load=94099 runnable=*1264* util=611 <-- runnable below 1536
task0-0-1776 [004] 789.511011: sched_pelt_cfs: cpu=4 path=/ load=93898 runnable=1264 util=622
...

* task1,2,3 remain on CPU1 and still have to catch up, no idle
time on CPU1

** task 1,2 remain on CPU1, no idle time on CPU1 yet.

So for the big CPU, there is idle time and for the LITTLE there
isn't with runnable below the threshold.

As Quentin pointed out, sugov_cpu_is_busy() (only implemented on
'single') tries to do something similar.

I assume that 'another utilization metric' mentioned in commit
b7eaf1aab9f8 ("cpufreq: schedutil: Avoid reducing frequency of
busy CPUs prematurely") is rq->cfs.avg.runnable_avg.

2020-11-24 00:20:35

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC] Documentation/scheduler/schedutil.txt

On Mon, 23 Nov 2020 at 12:27, Dietmar Eggemann <[email protected]> wrote:
>
> On 23/11/2020 11:05, Vincent Guittot wrote:
> > On Mon, 23 Nov 2020 at 10:30, Dietmar Eggemann <[email protected]> wrote:
> >>
> >> On 20/11/2020 09:56, Peter Zijlstra wrote:
> >>> On Fri, Nov 20, 2020 at 08:55:27AM +0100, Peter Zijlstra wrote:
> >>>> - In saturated scenarios task movement will cause some transient dips,
> >>>> suppose we have a CPU saturated with 4 tasks, then when we migrate a task
> >>>> to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
> >>>> new CPU will gain 0.25. This is inevitable and time progression will
> >>>> correct this. XXX do we still guarantee f_max due to no idle-time?
> >>>
> >>> Do we want something like this? Is the 1.5 threshold sane? (it's been too
> >>> long since I looked at actual numbers here)
> >>
> >> Did some tests on a big.little system:
> >>
> >> (1) rt-app workload on big CPU:
> >>
> >> - task0-3 (runtime/period=4000us/16000us, started with
> >> 4000us delay to each other) run on CPU1
> >> - then task3 migrates to CPU2 and runs there for 64ms
> >> - then task2 migrates to CPU2 too and both tasks run there
> >> for another 64ms
> >>
> >> ...
> >> task3-3-1684 [001] 3982.798729: sched_pelt_cfs: cpu=1 path=/ load=232890 runnable=3260 util=1011
> >> migration/1-14 [001] 3982.798756: sched_migrate_task: comm=task3-3 pid=1684 prio=101 orig_cpu=1 dest_cpu=2*
> >> migration/1-14 [001] 3982.798767: sched_pelt_cfs: cpu=1 path=/ load=161374 runnable=2263 util=*700* <-- util dip !!!
> >> task1-1-1682 [001] 3982.799802: sched_pelt_cfs: cpu=1 path=/ load=160988 runnable=2257 util=706
> >> ...
> >> task2-2-1683 [001] 3982.849123: sched_pelt_cfs: cpu=1 path=/ load=161124 runnable=2284 util=904
> >> task2-2-1683 [001] 3982.851960: sched_pelt_cfs: cpu=1 path=/ load=160130 runnable=2271 util=911
> >> migration/1-14 [001] 3982.851984: sched_migrate_task: comm=task2-2 pid=1683 prio=101 orig_cpu=1 dest_cpu=2**
> >> migration/1-14 [001] 3982.851995: sched_pelt_cfs: cpu=1 path=/ load=88672 runnable=*1257* util=512 <-- runnable below 1536
> >> task1-1-1682 [001] 3982.852983: sched_pelt_cfs: cpu=1 path=/ load=88321 runnable=1252 util=521
> >> ...
> >>
> >>
> >> * task1,2,3 remain on CPU1 and still have to catch up, no idle
> >> time on CPU1
> >>
> >> ** task 1,2 remain on CPU1, there is idle time on CPU1!
> >>
> >>
> >> (2) rt-app workload on LITTLE CPU (orig cpu_capacity: 446)
> >>
> >> - task0-3 (runtime/period=1742us/16000us, started with
> >> 4000us delay to each other) run on CPU4
> >> - then task3 migrates to CPU5 and runs there for 64ms
> >> - then task2 migrates to CPU5 too and both tasks run there
> >> for another 64ms
> >>
> >> ...
> >> task1-1-1777 [004] 789.443015: sched_pelt_cfs: cpu=4 path=/ load=234718 runnable=3018 util=976
> >> migration/4-29 [004] 789.444718: sched_migrate_task: comm=task3-3 pid=1779 prio=101 orig_cpu=4 dest_cpu=5*
> >> migration/4-29 [004] 789.444739: sched_pelt_cfs: cpu=4 path=/ load=163543 runnable=2114 util=*778* <--util dip !!!
> >> task2-2-1778 [004] 789.447013: sched_pelt_cfs: cpu=4 path=/ load=163392 runnable=2120 util=777
> >> ...
> >> task1-1-1777 [004] 789.507012: sched_pelt_cfs: cpu=4 path=/ load=164482 runnable=2223 util=879
> >> migration/4-29 [004] 789.508023: sched_migrate_task: comm=task2-2 pid=1778 prio=101 orig_cpu=4 dest_cpu=5**
> >> migration/4-29 [004] 789.508044: sched_pelt_cfs: cpu=4 path=/ load=94099 runnable=*1264* util=611 <-- runnable below 1536
> >> task0-0-1776 [004] 789.511011: sched_pelt_cfs: cpu=4 path=/ load=93898 runnable=1264 util=622
> >> ...
> >>
> >> * task1,2,3 remain on CPU1 and still have to catch up, no idle
> >> time on CPU1
> >>
> >> ** task 1,2 remain on CPU1, no idle time on CPU1 yet.
> >>
> >> So for the big CPU, there is idle time and for the LITTLE there
> >> isn't with runnable below the threshold.
> >
> > I'm not sure to catch what you want to highlight with your tests ?
>
> I thought the question was whether 'runnable_avg = 1.5 x
> SCHED_CAPACITY_SCALE' is a good threshold to decide to drive frequency
> by runnable_avg or util_avg.

we can't use SCHED_CAPACITY_SCALE and must use cpu's capacity

>
> [...]

2020-11-24 00:31:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] Documentation/scheduler/schedutil.txt

On Mon, Nov 23, 2020 at 10:26:13AM +0100, Dietmar Eggemann wrote:
> On 20/11/2020 08:55, Peter Zijlstra wrote:
>
> [...]
>
> > PELT (Per Entity Load Tracking)
> > -------------------------------
>
> [...]
>
> > Using this we track 2 key metrics: 'running' and 'runnable'. 'Running'
> > reflects the time an entity spends on the CPU, while 'runnable' reflects the
> > time an entity spends on the runqueue. When there is only a single task these
> > two metrics are the same, but once there is contention for the CPU 'running'
> > will decrease to reflect the fraction of time each task spends on the CPU
> > while 'runnable' will increase to reflect the amount of contention.
>
> People might find it confusing to map 'running and 'runnable' into the 3
> PELT signals (load_avg, runnable_avg and util_avg) being used in the
> scheduler ... with load_avg being 'runnable' and 'weight' based.

Yeah, but that's for another document, I suppose. much of pelt.c uses
runnable. Also, the comment that goes with struct sched_avg should
explain.

> > For more detail see: kernel/sched/pelt.c
> >
> >
> > Frequency- / Heterogeneous Invariance
> > -------------------------------------
>
> We call 'Heterogeneous Invariance' CPU invariance in chapter 2.3
> Documentation/scheduler/sched-capacity.rst.
>
> [...]

Fair enough; I've renamed it to match.

> > For more detail see:
> >
> > - kernel/sched/pelt.h:update_rq_clock_pelt()
> > - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
>
> drivers/base/arch_topology.c:"f_cur/f_max ratio computation".

I can't seem to find that in any tree near me (I tried tip/master and
next/master)

> > UTIL_EST / UTIL_EST_FASTUP
> > --------------------------
>
> [...]
>
> > util_est := \Sum_t max( t_running, t_util_est_ewma )
> >
> > For more detail see: kernel/sched/fair.h:util_est_dequeue()
>
> s/fair.h/fair.c
>
> > UCLAMP
> > ------
> >
> > It is possible to set effective u_min and u_max clamps on each task; the
>
> s/on each task/on each CFS or RT task

Thanks!

2020-12-02 14:21:17

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC] Documentation/scheduler/schedutil.txt

For whatever reason, this only arrived in my inbox today.

On Fri, Nov 20, 2020 at 08:55:27AM +0100, Peter Zijlstra wrote:
> PELT (Per Entity Load Tracking)
> -------------------------------
>
> With PELT we track some metrics across the various entities, from individual

s/entities/scheduler entities/ so that someone will recognise
sched_entity in the code.

> tasks to task-group slices to CPU runqueues. As the basis for this we use an
> EWMA, each period (1024us) is decayed such that y^32 = 0.5. That is, the

Expand -- Exponentially Weighted Moving Average (EWMA). The cc list
should recognise it automatically, but maybe not a first entrant to
kernel/sched. kernel/sched is littered with institutional knowledge.

> most recent 32ms contribute half, while the rest of history contribute the
> other half.
>

IIRC, this 32ms is tied to the value of LOAD_AVG_PERIOD and the length
of the ewma_sum series below. Might be worth expanding a little further.

> Specifically:
>
> ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ...
>
> ewma(u) = ewma_sum(u) / ewma_sum(1)
>
> Since this is essentially a progression of an infinite geometric series, the
> results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property
> is key, since it gives the ability to recompose the averages when tasks move
> around.
>
> Note that blocked tasks still contribute to the aggregates (task-group slices
> and CPU runqueues), which reflects their expected contribution when they
> resume running.
>
> Using this we track 2 key metrics: 'running' and 'runnable'. 'Running'
> reflects the time an entity spends on the CPU, while 'runnable' reflects the
> time an entity spends on the runqueue. When there is only a single task these
> two metrics are the same, but once there is contention for the CPU 'running'
> will decrease to reflect the fraction of time each task spends on the CPU
> while 'runnable' will increase to reflect the amount of contention.
>
> For more detail see: kernel/sched/pelt.c
>
>
> Frequency- / Heterogeneous Invariance
> -------------------------------------
>
> Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
> for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
> a big CPU, we allow architectures to scale the time delta with two ratios, one
> DVFS ratio and one microarch ratio.
>

Expand -- Dynamic Voltage and Frequency Scaling (DVFS) and assume that
the reader will think of cpufreq.

> For simple DVFS architectures (where software is in full control) we trivially
> compute the ratio as:
>
> f_cur
> r_dvfs := -----
> f_max
>
> For more dynamic systems where the hardware is in control of DVFS (Intel,
> ARMv8.4-AMU) we use hardware counters to provide us this ratio. In specific,
> for Intel, we use:
>

s/In specific, for Intel/For Intel specifically,/

> APERF
> f_cur := ----- * P0
> MPERF
>
> 4C-turbo; if available and turbo enabled
> f_max := { 1C-turbo; if turbo enabled
> P0; otherwise
>
> f_cur
> r_dvfs := min( 1, ----- )
> f_max
>
> We pick 4C turbo over 1C turbo to make it slightly more sustainable.
>
> r_het is determined as the average performance difference between a big and
> LITTLE core when running at max frequency over 'relevant' benchmarks.
>

r_het is never mentioned again so it's not immediately obvious what it
ties into. I assume het is short for heterogeneous and is simply another
way of looking at current CPU compute power vs potential CPU compute power
(be that due to DVFS or big.LITTLE).

> The result is that the above 'running' and 'runnable' metrics become invariant
> of DVFS and Heterogenous state. IOW. we can transfer and compare them between
> CPUs.
>
> For more detail see:
>
> - kernel/sched/pelt.h:update_rq_clock_pelt()
> - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
>

The role and importance of frequency invariance is mentioned but it
could be more explicit. However, looking at update_rq_clock_pelt may be
enough of a clue.

Either way, decoding this document fully will require someone to spend a
lot of time on the source and then rereading this document. That's
probably a good thing.

>
> UTIL_EST / UTIL_EST_FASTUP
> --------------------------
>
> Because periodic tasks have their averages decayed while they sleep, even
> though when running their expected utilization will be the same, they suffer a
> (DVFS) ramp-up after they become runnable again.
>

s/they become runnable again/they are running again/ ? Maybe refer as
"running" because it's only once they are on the CPU and running that
DVFS comes into play?



> To alleviate this (a default enabled option) UTIL_EST drives an (IIR) EWMA

Expand IIR -- Immediate Impulse Reponse?

> with the 'running' value on dequeue -- when it is highest. A further default
> enabled option UTIL_EST_FASTUP modifies the IIR filter to instantly increase
> and only decay on decrease.
>
> A further runqueue wide sum (of runnable tasks) is maintained of:
>
> util_est := \Sum_t max( t_running, t_util_est_ewma )
>
> For more detail see: kernel/sched/fair.h:util_est_dequeue()
>

It's less obvious what the consequence is unless the reader manages to
tie the IO-wait comment in "Schedutil / DVFS" to this section.

> UCLAMP
> ------
>
> It is possible to set effective u_min and u_max clamps on each task; the
> runqueue keeps an max aggregate of these clamps for all running tasks.
>
> For more detail see: include/uapi/linux/sched/types.h
>
>
> Schedutil / DVFS
> ----------------
>
> Every time the scheduler load tracking is updated (task wakeup, task
> migration, time progression) we call out to schedutil to update the hardware
> DVFS state.
>
> The basis is the CPU runqueue's 'running' metric, which per the above it is
> the frequency invariant utilization estimate of the CPU. From this we compute
> a desired frequency like:
>
> max( running, util_est ); if UTIL_EST
> u_cfs := { running; otherwise
>
> u_clamp := clamp( u_cfs, u_min, u_max )
>
> u := u_cfs + u_rt + u_irq + u_dl; [approx. see source for more detail]
>
> f_des := min( f_max, 1.25 u * f_max )
>
> XXX IO-wait; when the update is due to a task wakeup from IO-completion we
> boost 'u' above.
>
> This frequency is then used to select a P-state/OPP or directly munged into a
> CPPC style request to the hardware.
>
> XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
> required to satisfy the workload.
>
> Because these callbacks are directly from the scheduler, the DVFS hardware
> interaction should be 'fast' and non-blocking. Schedutil supports
> rate-limiting DVFS requests for when hardware interaction is slow and
> expensive, this reduces effectiveness.
>

Is it worth explicitly mentioning that a key advantage over
hardware-based approaches is that schedutil carries utilisation state on
CPU migration? You say that it is tracked but it's less obvious why that
matters as a pure hardware based approach loses utilisation information
about a task once it migrates.

Even moving note 3 below into this section and expanding it with an
example based on HWP would be helpful.

> For more information see: kernel/sched/cpufreq_schedutil.c
>
>
> NOTES
> -----
>
> - On low-load scenarios, where DVFS is most relevant, the 'running' numbers
> will closely reflect utilization.
>
> - In saturated scenarios task movement will cause some transient dips,
> suppose we have a CPU saturated with 4 tasks, then when we migrate a task
> to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
> new CPU will gain 0.25. This is inevitable and time progression will
> correct this. XXX do we still guarantee f_max due to no idle-time?
>
> - Much of the above is about avoiding DVFS dips, and independent DVFS domains
> having to re-learn / ramp-up when load shifts.
>

--
Mel Gorman
SUSE Labs

2020-12-02 15:57:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] Documentation/scheduler/schedutil.txt

On Wed, Dec 02, 2020 at 02:18:35PM +0000, Mel Gorman wrote:
> For whatever reason, this only arrived in my inbox today.

It's as if the gods of email knew you was busy before ;-)

> On Fri, Nov 20, 2020 at 08:55:27AM +0100, Peter Zijlstra wrote:
> > PELT (Per Entity Load Tracking)
> > -------------------------------
> >
> > With PELT we track some metrics across the various entities, from individual
>
> s/entities/scheduler entities/ so that someone will recognise
> sched_entity in the code.
>
> > tasks to task-group slices to CPU runqueues. As the basis for this we use an
> > EWMA, each period (1024us) is decayed such that y^32 = 0.5. That is, the
>
> Expand -- Exponentially Weighted Moving Average (EWMA). The cc list
> should recognise it automatically, but maybe not a first entrant to
> kernel/sched. kernel/sched is littered with institutional knowledge.

It is, we could do with moar comments.

> > most recent 32ms contribute half, while the rest of history contribute the
> > other half.
> >
>
> IIRC, this 32ms is tied to the value of LOAD_AVG_PERIOD and the length
> of the ewma_sum series below. Might be worth expanding a little further.

It is LOAD_AVG_PERIOD. Some people (re)generate the PELT tables with a
different period (16 and 64 are common).

Not sure what there is to expand; the whole of it is: y^32=0.5. We had
to pick some half-life period, 32 seemed like a good number.

> > r_het is determined as the average performance difference between a big and
> > LITTLE core when running at max frequency over 'relevant' benchmarks.
> >
>
> r_het is never mentioned again so it's not immediately obvious what it
> ties into. I assume het is short for heterogeneous and is simply another
> way of looking at current CPU compute power vs potential CPU compute power
> (be that due to DVFS or big.LITTLE).

It's now called r_cpu, but yes.

> > The result is that the above 'running' and 'runnable' metrics become invariant
> > of DVFS and Heterogenous state. IOW. we can transfer and compare them between
> > CPUs.

^^^ that is the consequence; being able to directly compare the numbers
across CPUs.

> >
> > For more detail see:
> >
> > - kernel/sched/pelt.h:update_rq_clock_pelt()
> > - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
> >
>
> The role and importance of frequency invariance is mentioned but it
> could be more explicit. However, looking at update_rq_clock_pelt may be
> enough of a clue.
>
> Either way, decoding this document fully will require someone to spend a
> lot of time on the source and then rereading this document. That's
> probably a good thing.

:-)

> > UTIL_EST / UTIL_EST_FASTUP
> > --------------------------
> >
> > Because periodic tasks have their averages decayed while they sleep, even
> > though when running their expected utilization will be the same, they suffer a
> > (DVFS) ramp-up after they become runnable again.
> >
>
> s/they become runnable again/they are running again/ ? Maybe refer as
> "running" because it's only once they are on the CPU and running that
> DVFS comes into play?

OK.

> > To alleviate this (a default enabled option) UTIL_EST drives an (IIR) EWMA
>
> Expand IIR -- Immediate Impulse Reponse?

Infinite Impuse Response

> > with the 'running' value on dequeue -- when it is highest. A further default
> > enabled option UTIL_EST_FASTUP modifies the IIR filter to instantly increase
> > and only decay on decrease.
> >
> > A further runqueue wide sum (of runnable tasks) is maintained of:
> >
> > util_est := \Sum_t max( t_running, t_util_est_ewma )
> >
> > For more detail see: kernel/sched/fair.h:util_est_dequeue()
> >
>
> It's less obvious what the consequence is unless the reader manages to
> tie the IO-wait comment in "Schedutil / DVFS" to this section.

I'm not entirely sure I follow. The purpose of UTIL_EST is to avoid
ramp-up issues and isn't related to IO-wait boosting.

> > Schedutil / DVFS
> > ----------------
> >
> > Every time the scheduler load tracking is updated (task wakeup, task
> > migration, time progression) we call out to schedutil to update the hardware
> > DVFS state.
> >
> > The basis is the CPU runqueue's 'running' metric, which per the above it is
> > the frequency invariant utilization estimate of the CPU. From this we compute
> > a desired frequency like:
> >
> > max( running, util_est ); if UTIL_EST
> > u_cfs := { running; otherwise
> >
> > u_clamp := clamp( u_cfs, u_min, u_max )
> >
> > u := u_cfs + u_rt + u_irq + u_dl; [approx. see source for more detail]
> >
> > f_des := min( f_max, 1.25 u * f_max )
> >
> > XXX IO-wait; when the update is due to a task wakeup from IO-completion we
> > boost 'u' above.
> >
> > This frequency is then used to select a P-state/OPP or directly munged into a
> > CPPC style request to the hardware.
> >
> > XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
> > required to satisfy the workload.
> >
> > Because these callbacks are directly from the scheduler, the DVFS hardware
> > interaction should be 'fast' and non-blocking. Schedutil supports
> > rate-limiting DVFS requests for when hardware interaction is slow and
> > expensive, this reduces effectiveness.
> >
>
> Is it worth explicitly mentioning that a key advantage over
> hardware-based approaches is that schedutil carries utilisation state on
> CPU migration? You say that it is tracked but it's less obvious why that
> matters as a pure hardware based approach loses utilisation information
> about a task once it migrates.

Not sure that was the exact goal of the document; I set out to describe
schedutil.

> Even moving note 3 below into this section and expanding it with an
> example based on HWP would be helpful.

I might not be the best person to talk about HWP; even though I work for
Intel I know remarkably little of it. I don't even think I've got a
machine that has it on.

Latest version below... I'll probably send it as a patch soon and get it
merged. We can always muck with it more later.

---

NOTE; all this assumes a linear relation between frequency and work capacity,
we know this is flawed, but it is the best workable approximation.


PELT (Per Entity Load Tracking)
-------------------------------

With PELT we track some metrics across the various scheduler entities, from
individual tasks to task-group slices to CPU runqueues. As the basis for this
we use an Exponentially Weighted Moving Average (EWMA), each period (1024us)
is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute
half, while the rest of history contribute the other half.

Specifically:

ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ...

ewma(u) = ewma_sum(u) / ewma_sum(1)

Since this is essentially a progression of an infinite geometric series, the
results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property
is key, since it gives the ability to recompose the averages when tasks move
around.

Note that blocked tasks still contribute to the aggregates (task-group slices
and CPU runqueues), which reflects their expected contribution when they
resume running.

Using this we track 2 key metrics: 'running' and 'runnable'. 'Running'
reflects the time an entity spends on the CPU, while 'runnable' reflects the
time an entity spends on the runqueue. When there is only a single task these
two metrics are the same, but once there is contention for the CPU 'running'
will decrease to reflect the fraction of time each task spends on the CPU
while 'runnable' will increase to reflect the amount of contention.

For more detail see: kernel/sched/pelt.c


Frequency- / CPU Invariance
---------------------------

Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
a big CPU, we allow architectures to scale the time delta with two ratios, one
Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio.

For simple DVFS architectures (where software is in full control) we trivially
compute the ratio as:

f_cur
r_dvfs := -----
f_max

For more dynamic systems where the hardware is in control of DVFS (Intel,
ARMv8.4-AMU) we use hardware counters to provide us this ratio. For Intel
specifically, we use:

APERF
f_cur := ----- * P0
MPERF

4C-turbo; if available and turbo enabled
f_max := { 1C-turbo; if turbo enabled
P0; otherwise

f_cur
r_dvfs := min( 1, ----- )
f_max

We pick 4C turbo over 1C turbo to make it slightly more sustainable.

r_cpu is determined as the ratio of highest performance level of the current
CPU vs the highest performance level of any other CPU in the system.

r_tot = r_dvfs * r_cpu

The result is that the above 'running' and 'runnable' metrics become invariant
of DVFS and CPU type. IOW. we can transfer and compare them between CPUs.

For more detail see:

- kernel/sched/pelt.h:update_rq_clock_pelt()
- arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
- Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"


UTIL_EST / UTIL_EST_FASTUP
--------------------------

Because periodic tasks have their averages decayed while they sleep, even
though when running their expected utilization will be the same, they suffer a
(DVFS) ramp-up after they are running again.

To alleviate this (a default enabled option) UTIL_EST drives an Infinite
Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
filter to instantly increase and only decay on decrease.

A further runqueue wide sum (of runnable tasks) is maintained of:

util_est := \Sum_t max( t_running, t_util_est_ewma )

For more detail see: kernel/sched/fair.c:util_est_dequeue()


UCLAMP
------

It is possible to set effective u_min and u_max clamps on each CFS or RT task;
the runqueue keeps an max aggregate of these clamps for all running tasks.

For more detail see: include/uapi/linux/sched/types.h


Schedutil / DVFS
----------------

Every time the scheduler load tracking is updated (task wakeup, task
migration, time progression) we call out to schedutil to update the hardware
DVFS state.

The basis is the CPU runqueue's 'running' metric, which per the above it is
the frequency invariant utilization estimate of the CPU. From this we compute
a desired frequency like:

max( running, util_est ); if UTIL_EST
u_cfs := { running; otherwise

u_clamp := clamp( u_cfs, u_min, u_max )

u := u_cfs + u_rt + u_irq + u_dl; [approx. see source for more detail]

f_des := min( f_max, 1.25 u * f_max )

XXX IO-wait; when the update is due to a task wakeup from IO-completion we
boost 'u' above.

This frequency is then used to select a P-state/OPP or directly munged into a
CPPC style request to the hardware.

XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
required to satisfy the workload.

Because these callbacks are directly from the scheduler, the DVFS hardware
interaction should be 'fast' and non-blocking. Schedutil supports
rate-limiting DVFS requests for when hardware interaction is slow and
expensive, this reduces effectiveness.

For more information see: kernel/sched/cpufreq_schedutil.c


NOTES
-----

- On low-load scenarios, where DVFS is most relevant, the 'running' numbers
will closely reflect utilization.

- In saturated scenarios task movement will cause some transient dips,
suppose we have a CPU saturated with 4 tasks, then when we migrate a task
to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
new CPU will gain 0.25. This is inevitable and time progression will
correct this. XXX do we still guarantee f_max due to no idle-time?

- Much of the above is about avoiding DVFS dips, and independent DVFS domains
having to re-learn / ramp-up when load shifts.

2020-12-02 16:50:35

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC] Documentation/scheduler/schedutil.txt

On Wed, Dec 02, 2020 at 04:54:52PM +0100, Peter Zijlstra wrote:
> > IIRC, this 32ms is tied to the value of LOAD_AVG_PERIOD and the length
> > of the ewma_sum series below. Might be worth expanding a little further.
>
> It is LOAD_AVG_PERIOD. Some people (re)generate the PELT tables with a
> different period (16 and 64 are common).
>
> Not sure what there is to expand; the whole of it is: y^32=0.5. We had
> to pick some half-life period, 32 seemed like a good number.
>

No issue with the number other than the y^32 is tied to LOAD_AVG_PERIOD.
Again, it's something that someone looking at the source would eventually
figure out so it's probably for the best.

> > > To alleviate this (a default enabled option) UTIL_EST drives an (IIR) EWMA
> >
> > Expand IIR -- Immediate Impulse Reponse?
>
> Infinite Impuse Response
>

Sorry, yes, still worth an expansion.

> > > with the 'running' value on dequeue -- when it is highest. A further default
> > > enabled option UTIL_EST_FASTUP modifies the IIR filter to instantly increase
> > > and only decay on decrease.
> > >
> > > A further runqueue wide sum (of runnable tasks) is maintained of:
> > >
> > > util_est := \Sum_t max( t_running, t_util_est_ewma )
> > >
> > > For more detail see: kernel/sched/fair.h:util_est_dequeue()
> > >
> >
> > It's less obvious what the consequence is unless the reader manages to
> > tie the IO-wait comment in "Schedutil / DVFS" to this section.
>
> I'm not entirely sure I follow. The purpose of UTIL_EST is to avoid
> ramp-up issues and isn't related to IO-wait boosting.
>

I mixed up the example. Historically io-wait boosting was one way of
avoiding DVFS ramp-up issues but now that I reread it, it's best to leave
it general like you already have in your current version.

> > Is it worth explicitly mentioning that a key advantage over
> > hardware-based approaches is that schedutil carries utilisation state on
> > CPU migration? You say that it is tracked but it's less obvious why that
> > matters as a pure hardware based approach loses utilisation information
> > about a task once it migrates.
>
> Not sure that was the exact goal of the document; I set out to describe
> schedutil.
>

Fair enough, it would simply lead to documentation creep.

> > Even moving note 3 below into this section and expanding it with an
> > example based on HWP would be helpful.
>
> I might not be the best person to talk about HWP; even though I work for
> Intel I know remarkably little of it. I don't even think I've got a
> machine that has it on.
>
> Latest version below... I'll probably send it as a patch soon and get it
> merged. We can always muck with it more later.
>

True. At least any confusion can then be driven by specific questions :)

FWIW, after reading the new version I'll ack the patch when it shows up.

Thanks!

--
Mel Gorman
SUSE Labs

2020-12-02 17:01:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] Documentation/scheduler/schedutil.txt

On Wed, Dec 02, 2020 at 04:45:31PM +0000, Mel Gorman wrote:

> > > It's less obvious what the consequence is unless the reader manages to
> > > tie the IO-wait comment in "Schedutil / DVFS" to this section.
> >
> > I'm not entirely sure I follow. The purpose of UTIL_EST is to avoid
> > ramp-up issues and isn't related to IO-wait boosting.
> >
>
> I mixed up the example. Historically io-wait boosting was one way of
> avoiding DVFS ramp-up issues but now that I reread it, it's best to leave
> it general like you already have in your current version.

So IO-wait boosting is an interesting case; as it captures something not
present in the rest of the model, namely interaction.

There's also that series of patches that does the cpu/gpu interaction
thing.

It would be worth expanding on it, but I didn't have it in me to dig
through the archives to get a coherent description of the current state
of things. Something left todo later...