The current implementation of load tracking invariance scales the load
tracking value with current frequency and uarch performance (only for
utilization) of the CPU.
One main result of the current formula is that the figures are capped by
the current capacity of the CPU. This limitation is the main reason of not
including the uarch invariance (arch_scale_cpu_capacity) in the calculation
of load_avg because capping the load can generate erroneous system load
statistic as described with this example [1]
Instead of scaling the complete value of PELT algo, we should only scale
the running time by the current capacity of the CPU. It seems more correct
to only scale the running time because the non running time of a task
(sleeping or waiting for a runqueue) is the same whatever the current freq
and the compute capacity of the CPU.
Then, one main advantage of this change is that the load of a task can
reach max value whatever the current freq and the uarch of the CPU on which
it run. It will just take more time at a lower freq than a max freq or on a
"little" CPU compared to a "big" one. The load and the utilization stay
invariant across system so we can still compared them between CPU but with
a wider range of values.
With this change, we don't have to test if a CPU is overloaded or not in
order to use one metric (util) or another (load) as all metrics are always
valid.
I have put below some examples of duration to reach some typical load value
according to the capacity of the CPU with current implementation
and with this patch.
Util (%) max capacity half capacity(mainline) half capacity(w/ patch)
972 (95%) 138ms not reachable 276ms
486 (47.5%) 30ms 138ms 60ms
256 (25%) 13ms 32ms 26ms
We can see that at half capacity, we need twice the duration of max
capacity with this patch whereas we have a non linear increase of the
duration with current implementation.
[1] https://lkml.org/lkml/2014/12/18/128
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 28 +++++++++++++---------------
1 file changed, 13 insertions(+), 15 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 824aa9f..f2a18e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2560,10 +2560,9 @@ static __always_inline int
__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
unsigned long weight, int running, struct cfs_rq *cfs_rq)
{
- u64 delta, scaled_delta, periods;
+ u64 delta, periods;
u32 contrib;
- unsigned int delta_w, scaled_delta_w, decayed = 0;
- unsigned long scale_freq, scale_cpu;
+ unsigned int delta_w, decayed = 0;
delta = now - sa->last_update_time;
/*
@@ -2584,8 +2583,10 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
return 0;
sa->last_update_time = now;
- scale_freq = arch_scale_freq_capacity(NULL, cpu);
- scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+ if (running) {
+ delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));
+ delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));
+ }
/* delta_w is the amount already accumulated against our next period */
delta_w = sa->period_contrib;
@@ -2601,16 +2602,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
* period and accrue it.
*/
delta_w = 1024 - delta_w;
- scaled_delta_w = cap_scale(delta_w, scale_freq);
if (weight) {
- sa->load_sum += weight * scaled_delta_w;
+ sa->load_sum += weight * delta_w;
if (cfs_rq) {
cfs_rq->runnable_load_sum +=
- weight * scaled_delta_w;
+ weight * delta_w;
}
}
if (running)
- sa->util_sum += scaled_delta_w * scale_cpu;
+ sa->util_sum += delta_w << SCHED_CAPACITY_SHIFT;
delta -= delta_w;
@@ -2627,25 +2627,23 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
/* Efficiently calculate \sum (1..n_period) 1024*y^i */
contrib = __compute_runnable_contrib(periods);
- contrib = cap_scale(contrib, scale_freq);
if (weight) {
sa->load_sum += weight * contrib;
if (cfs_rq)
cfs_rq->runnable_load_sum += weight * contrib;
}
if (running)
- sa->util_sum += contrib * scale_cpu;
+ sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
}
/* Remainder of delta accrued against u_0` */
- scaled_delta = cap_scale(delta, scale_freq);
if (weight) {
- sa->load_sum += weight * scaled_delta;
+ sa->load_sum += weight * delta;
if (cfs_rq)
- cfs_rq->runnable_load_sum += weight * scaled_delta;
+ cfs_rq->runnable_load_sum += weight * delta;
}
if (running)
- sa->util_sum += scaled_delta * scale_cpu;
+ sa->util_sum += delta << SCHED_CAPACITY_SHIFT;
sa->period_contrib += delta;
--
1.9.1
On Tue, Nov 24, 2015 at 02:49:30PM +0100, Vincent Guittot wrote:
> Instead of scaling the complete value of PELT algo, we should only scale
> the running time by the current capacity of the CPU. It seems more correct
> to only scale the running time because the non running time of a task
> (sleeping or waiting for a runqueue) is the same whatever the current freq
> and the compute capacity of the CPU.
So I'm leaning towards liking this; however with your previous example
of 3 cpus and 7 tasks, where CPU0-1 are 'little' and of half the
capacity as the 'big' CPU2, with 2 tasks on CPU0-1 each and 3 tasks on
CPU2.
This would result, for CPU0, in a load of 100% wait time + 100% runtime,
scaling the runtime 50% will get you a total load of 150%.
For CPU2 we get 100% runtime and 200% wait time, no scaling, for a total
load of 300%.
So the CPU0-1 cluster has a 300% load and the CPU2 'cluster' has a 300%
load, even though the actual load is not actually equal, CPUs0-1
combined have the same capacity as CPU2, so it should be 4-4 tasks for
an equal balance.
So I'm not sure the claim of comparable between CPUs stands. Still it is
an interesting idea and I will consider it more.
On 25 November 2015 at 10:24, Peter Zijlstra <[email protected]> wrote:
> On Tue, Nov 24, 2015 at 02:49:30PM +0100, Vincent Guittot wrote:
>> Instead of scaling the complete value of PELT algo, we should only scale
>> the running time by the current capacity of the CPU. It seems more correct
>> to only scale the running time because the non running time of a task
>> (sleeping or waiting for a runqueue) is the same whatever the current freq
>> and the compute capacity of the CPU.
>
> So I'm leaning towards liking this; however with your previous example
> of 3 cpus and 7 tasks, where CPU0-1 are 'little' and of half the
> capacity as the 'big' CPU2, with 2 tasks on CPU0-1 each and 3 tasks on
> CPU2.
>
> This would result, for CPU0, in a load of 100% wait time + 100% runtime,
> scaling the runtime 50% will get you a total load of 150%.
>
> For CPU2 we get 100% runtime and 200% wait time, no scaling, for a total
> load of 300%.
>
> So the CPU0-1 cluster has a 300% load and the CPU2 'cluster' has a 300%
> load, even though the actual load is not actually equal, CPUs0-1
> combined have the same capacity as CPU2, so it should be 4-4 tasks for
> an equal balance.
With the example above, we have (after that everything has reached
their stable value)
With the mainline:
load_avg of CPU0 : 2048 and load_avg of each task should be 1024
load_avg of CPU1 : 2048 and load_avg of each task should be 1024
load_avg of CPU2 : 3072 and load_avg of each task should be 1024
With this patch which now includes the cpu invariance in the
calculation of load_avg
load_avg of CPU0 : 2048 and load_avg of each task should be 1024
load_avg of CPU1 : 2048 and load_avg of each task should be 1024
load_avg of CPU2 : 3072 and load_avg of each task should be 1024
The main difference will be in the time needed to reach these values.
CPU2 will reach 95% of the final value in 136ms whereas the load_avg
of CPU0 and CPU1 should be around 789 at that time and will reach the
same value than CPU2 after additional 136ms
Regards,
Vincent
>
>
> So I'm not sure the claim of comparable between CPUs stands. Still it is
> an interesting idea and I will consider it more.
On Tue, Nov 24, 2015 at 02:49:30PM +0100, Vincent Guittot wrote:
> The current implementation of load tracking invariance scales the load
> tracking value with current frequency and uarch performance (only for
> utilization) of the CPU.
>
> One main result of the current formula is that the figures are capped by
> the current capacity of the CPU. This limitation is the main reason of not
> including the uarch invariance (arch_scale_cpu_capacity) in the calculation
> of load_avg because capping the load can generate erroneous system load
> statistic as described with this example [1]
The reason why we don't want to scale load_avg with regard to uarch
capacity (as we do with util_avg) is explained in
e3279a2e6d697e00e74f905851ee7cf532f72b2d as well.
> Instead of scaling the complete value of PELT algo, we should only scale
> the running time by the current capacity of the CPU. It seems more correct
> to only scale the running time because the non running time of a task
> (sleeping or waiting for a runqueue) is the same whatever the current freq
> and the compute capacity of the CPU.
You seem to imply that we currently scale running, waiting, and sleeping
time. That is not the case. We scale running and waiting time, but not
sleeping time. Whether we should scale waiting time or not is a good
question. The waiting time is affected by the running time of the other
tasks on the cfs_rq, so on one hand it seems a bit inconsistent to scale
one and not the other. On the other hand, not scaling waiting time would
make tasks that spend a lot of time waiting appear bigger, which could
an advantage as it would make load-balancing more prone to spread tasks.
A third alternative is to drop the scaling of load_avg completely, but
it is still needed for util_avg as we want util_avg to be invariant to
frequency and uarch scaling.
> Then, one main advantage of this change is that the load of a task can
> reach max value whatever the current freq and the uarch of the CPU on which
> it run. It will just take more time at a lower freq than a max freq or on a
> "little" CPU compared to a "big" one. The load and the utilization stay
> invariant across system so we can still compared them between CPU but with
> a wider range of values.
Just removing scaling of waiting time and applying scaling by current
capacity (including uarch) to the running time will not make load_avg
reach the max value for tasks running alone on a cpu. Since the task
isn't waiting at all (it is alone) all contributions are running time
which is scaled, IIUC, and hence the result is still capped by the
current capacity of the cpu. But that doesn't match your example results
further down if I read them correctly.
The changes made in the code of this patch are quite subtle, but very
important as they change the behaviour of the PELT geometric series
quite a lot. It is much more than just changing whether we scale waiting
time and apply uarch scaling to running time of load_avg or not. I
think we need to understand the math behind this patch to understand how
the PELT metrics are affected because I think this patch changes some of
the fundamentals originally described by Paul and Ben.
Instead of scaling the contribution of each 1024us segment like we
currently do, this patch is essentially warping time and lumps it
together and let it contribute fully but skips decays. It is rather hard
to explain, but the result is that the patch affects both load_avg and
util_avg, and it breaks scale-invariance.
Executive summary: Scaling time (delta) instead of the individual
segment contributions breaks scale-invariance. The net result on
load_avg seems to be nothing apart from slower reaction time.
That is how I see after having tested it a bit. But I could be getting
it all wrong. :-/
Much more detail:
Original geometric series:
\sum (0..n) u_n * y^n
Current geometric series with scale invariance:
\sum (0..n) u_n * c_n * y^n
In reality we only approximate having the capacity scaling for each
segment as don't enforce PELT updates for each capacity change due to
frequency scaling.
In this patch scaling is applied to the entire delta since last update
instead of each individual segment. That gives us a very interesting
time warping effect when updates happen less frequently than every 1ms.
On cpus with reduced capacity the delta is reduced and all the math is
done as if less time had passed since last update which introduces an
error with regard to the decay of the series as we segments of time with
zero contribution.
It is probably easier described with an example:
We have one periodic task with a period of 4ms. Busy time per activation
is 1ms at 100% capacity. The task has been running forever (>350ms) and
we consider the load_avg calculations at enqueue/dequeue, which is
should the most common update points for this scenario besides the tick
updates.
task states
s = sleeping
R = running (scheduled)
pelt
d = decay segment (load_avg * y, y^32 = 0.5)
[0..1024] = segment contribution (including any scaling)
U = __update_load_avg() is called
f = 100%
| 1024us | 1024us | 1024us | 1024us | 1024us | 1024us |
task | s | R | s | s | s | R |
pelt ml | d U 1024 U d | d | d U 1024 U
patch | d U 1024 U d | d | d U 1024 U
f = 33%
| 1024us | 1024us | 1024us | 1024us | 1024us | 1024us |
task | s | R | R | R | s | R |
pelt ml | d U 341y^2 | 341y | 341 U d U 341y^2 |
patch | d U 1024 | 0 | 0 U d U 1024 |
In the first case, f = 100%, the update after the busy period is
complete we decay load_avg by one period (segment) and add a
contribution of 1024. We are at 100% so it is a full contribution for
this segment both with and without this patch. The task enqueue update
accounts for the sleeping time by decaying load_avg three periods. The
same in both cases. We could say that the contributions of a full cycle
of the the task is:
f_100% cycle = 1024 + decay(4)
If we reduce the capacity to 33%, things look a bit different. In
mainline, the dequeque update after the busy period would decay three
periods and add \sum (i = 2..0) 0.33*1024*y^i to account for the three
busy segments. The enqueue update decays the load_avg by one segment.
The full cycle contribution becomes:
Mainline:
f_33% cycle = 341*y^2 + 341*y + 341 + decay(4)
With this patch it is different. At the dequeue update we scale the time
delta instead of the contribution, such that delta = 0.33*delta, so the
calculation is based on only one period (segment) has passed. Hence we
decay by one segment and add 1024, but still set the update point to the
true timestamp so the following update doesn't take the two remaining
segments into account. The enqueue update decays the load_avg by one
segment, just like it does in mainline. The full cycle contribution
becomes:
Patch:
f_33% cycle = 1024 + decay(2)
This is clearly different from mainline. Not only is the busy
contribution higher, 1024 > 341*y^2 + 341*y + 341, since y < 1, but we
also decay less. The result is an inflation of the load_avg and util_avg
metrics for tasks that run for more than 1ms at the time if
__update_load_avg() isn't called every 1ms.
I did a quick test to confirm this using a single periodic task and
changing the compute capacity.
util_avg
capacity mainline patch
1024 ~359 ~352
512 ~340 ~534
Execution time went from 1.4ms to 2.8ms per activation without
overloading the cpu.
The fundamental idea in scale invariance is that util_avg should be
comparable between cpu at any capacity as long none of them are
over-utilized. This isn't preserved by the patch in its current form.
> With this change, we don't have to test if a CPU is overloaded or not in
> order to use one metric (util) or another (load) as all metrics are always
> valid.
I'm not sure what you mean by always valid. util_avg is still not a
meaningful metric for tasks running on over-utilized cpus, so it can not
be used unconditionally. If util_avg > capacity we still have no clue if
the task can fit on a different cpu with higher capacity.
> I have put below some examples of duration to reach some typical load value
> according to the capacity of the CPU with current implementation
> and with this patch.
>
> Util (%) max capacity half capacity(mainline) half capacity(w/ patch)
> 972 (95%) 138ms not reachable 276ms
> 486 (47.5%) 30ms 138ms 60ms
> 256 (25%) 13ms 32ms 26ms
I assume that these are numbers for util_avg and not load_avg as said in
the text above. It confuses me a little bit as you started out by
talking about the lack of uarch scaling of load_avg and propose to
change that, not util_avg.
The equivalent table for load_avg would something like this:
load_avg (%) max capacity half capacity(mainline) half capacity(w/ patch)
972 (95%) 138ms 138ms 276ms
486 (47.5%) 30ms 30ms 60ms
256 (25%) 13ms 13ms 26ms
load_avg does reach max capacity as it is. The patch just makes it
happen at a slower pace, which I'm not sure is a good or bad thing.
> We can see that at half capacity, we need twice the duration of max
> capacity with this patch whereas we have a non linear increase of the
> duration with current implementation.
Is it a problem that the time to reach a certain value is not linear?
It is still somewhat unclear to me why we would want this change. Adding
uarch scaling to load_avg but then modify the geometric series so the
end result is the same except that it now reacts slower at lower
capacities seems a bit strange.
>
> [1] https://lkml.org/lkml/2014/12/18/128
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 28 +++++++++++++---------------
> 1 file changed, 13 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 824aa9f..f2a18e1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2560,10 +2560,9 @@ static __always_inline int
> __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> unsigned long weight, int running, struct cfs_rq *cfs_rq)
> {
> - u64 delta, scaled_delta, periods;
> + u64 delta, periods;
> u32 contrib;
> - unsigned int delta_w, scaled_delta_w, decayed = 0;
> - unsigned long scale_freq, scale_cpu;
> + unsigned int delta_w, decayed = 0;
>
> delta = now - sa->last_update_time;
> /*
> @@ -2584,8 +2583,10 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> return 0;
> sa->last_update_time = now;
>
> - scale_freq = arch_scale_freq_capacity(NULL, cpu);
> - scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> + if (running) {
> + delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));
> + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));
This is where the time warping happens. delta is used to determine the
number of periods (segments) since last update. Scaling this, as opposed
to the contributions for each segment individually, can lead to
disappearing segments.
Hi Vincent,
I don't quite catch what this is doing, maybe I need more time
to ramp up to the gory detail difficult like this.
Do you scale or not scale? You seem removed the scaling, but added it
after "Remainder of delta accrued against u_0"..
Thanks,
Yuyang
On Tue, Nov 24, 2015 at 02:49:30PM +0100, Vincent Guittot wrote:
> The current implementation of load tracking invariance scales the load
> tracking value with current frequency and uarch performance (only for
> utilization) of the CPU.
>
> One main result of the current formula is that the figures are capped by
> the current capacity of the CPU. This limitation is the main reason of not
> including the uarch invariance (arch_scale_cpu_capacity) in the calculation
> of load_avg because capping the load can generate erroneous system load
> statistic as described with this example [1]
>
> Instead of scaling the complete value of PELT algo, we should only scale
> the running time by the current capacity of the CPU. It seems more correct
> to only scale the running time because the non running time of a task
> (sleeping or waiting for a runqueue) is the same whatever the current freq
> and the compute capacity of the CPU.
>
> Then, one main advantage of this change is that the load of a task can
> reach max value whatever the current freq and the uarch of the CPU on which
> it run. It will just take more time at a lower freq than a max freq or on a
> "little" CPU compared to a "big" one. The load and the utilization stay
> invariant across system so we can still compared them between CPU but with
> a wider range of values.
>
> With this change, we don't have to test if a CPU is overloaded or not in
> order to use one metric (util) or another (load) as all metrics are always
> valid.
>
> I have put below some examples of duration to reach some typical load value
> according to the capacity of the CPU with current implementation
> and with this patch.
>
> Util (%) max capacity half capacity(mainline) half capacity(w/ patch)
> 972 (95%) 138ms not reachable 276ms
> 486 (47.5%) 30ms 138ms 60ms
> 256 (25%) 13ms 32ms 26ms
>
> We can see that at half capacity, we need twice the duration of max
> capacity with this patch whereas we have a non linear increase of the
> duration with current implementation.
>
> [1] https://lkml.org/lkml/2014/12/18/128
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 28 +++++++++++++---------------
> 1 file changed, 13 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 824aa9f..f2a18e1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2560,10 +2560,9 @@ static __always_inline int
> __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> unsigned long weight, int running, struct cfs_rq *cfs_rq)
> {
> - u64 delta, scaled_delta, periods;
> + u64 delta, periods;
> u32 contrib;
> - unsigned int delta_w, scaled_delta_w, decayed = 0;
> - unsigned long scale_freq, scale_cpu;
> + unsigned int delta_w, decayed = 0;
>
> delta = now - sa->last_update_time;
> /*
> @@ -2584,8 +2583,10 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> return 0;
> sa->last_update_time = now;
>
> - scale_freq = arch_scale_freq_capacity(NULL, cpu);
> - scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> + if (running) {
> + delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));
> + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));
> + }
>
> /* delta_w is the amount already accumulated against our next period */
> delta_w = sa->period_contrib;
> @@ -2601,16 +2602,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> * period and accrue it.
> */
> delta_w = 1024 - delta_w;
> - scaled_delta_w = cap_scale(delta_w, scale_freq);
> if (weight) {
> - sa->load_sum += weight * scaled_delta_w;
> + sa->load_sum += weight * delta_w;
> if (cfs_rq) {
> cfs_rq->runnable_load_sum +=
> - weight * scaled_delta_w;
> + weight * delta_w;
> }
> }
> if (running)
> - sa->util_sum += scaled_delta_w * scale_cpu;
> + sa->util_sum += delta_w << SCHED_CAPACITY_SHIFT;
>
> delta -= delta_w;
>
> @@ -2627,25 +2627,23 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>
> /* Efficiently calculate \sum (1..n_period) 1024*y^i */
> contrib = __compute_runnable_contrib(periods);
> - contrib = cap_scale(contrib, scale_freq);
> if (weight) {
> sa->load_sum += weight * contrib;
> if (cfs_rq)
> cfs_rq->runnable_load_sum += weight * contrib;
> }
> if (running)
> - sa->util_sum += contrib * scale_cpu;
> + sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
> }
>
> /* Remainder of delta accrued against u_0` */
> - scaled_delta = cap_scale(delta, scale_freq);
> if (weight) {
> - sa->load_sum += weight * scaled_delta;
> + sa->load_sum += weight * delta;
> if (cfs_rq)
> - cfs_rq->runnable_load_sum += weight * scaled_delta;
> + cfs_rq->runnable_load_sum += weight * delta;
> }
> if (running)
> - sa->util_sum += scaled_delta * scale_cpu;
> + sa->util_sum += delta << SCHED_CAPACITY_SHIFT;
>
> sa->period_contrib += delta;
>
> --
> 1.9.1
Hi Morten,
Thanks for the review and sorry for the late reply
On 8 December 2015 at 18:04, Morten Rasmussen <[email protected]> wrote:
> On Tue, Nov 24, 2015 at 02:49:30PM +0100, Vincent Guittot wrote:
>> The current implementation of load tracking invariance scales the load
>> tracking value with current frequency and uarch performance (only for
>> utilization) of the CPU.
>>
>> One main result of the current formula is that the figures are capped by
>> the current capacity of the CPU. This limitation is the main reason of not
>> including the uarch invariance (arch_scale_cpu_capacity) in the calculation
>> of load_avg because capping the load can generate erroneous system load
>> statistic as described with this example [1]
>
> The reason why we don't want to scale load_avg with regard to uarch
> capacity (as we do with util_avg) is explained in
> e3279a2e6d697e00e74f905851ee7cf532f72b2d as well.
>
>> Instead of scaling the complete value of PELT algo, we should only scale
>> the running time by the current capacity of the CPU. It seems more correct
>> to only scale the running time because the non running time of a task
>> (sleeping or waiting for a runqueue) is the same whatever the current freq
>> and the compute capacity of the CPU.
>
> You seem to imply that we currently scale running, waiting, and sleeping
> time. That is not the case. We scale running and waiting time, but not
> sleeping time. Whether we should scale waiting time or not is a good
In fact, I was referring to the same equation than you use below
\sum (0..n) u_n * c_n * y^n
to say that the complete value of PELT is scaled because we scale all
u_n which include idle fractions.
Note that this doesn't change anything at the end as u_n is null during idle.
> question. The waiting time is affected by the running time of the other
> tasks on the cfs_rq, so on one hand it seems a bit inconsistent to scale
> one and not the other. On the other hand, not scaling waiting time would
> make tasks that spend a lot of time waiting appear bigger, which could
> an advantage as it would make load-balancing more prone to spread tasks.
> A third alternative is to drop the scaling of load_avg completely, but
I don't think it's a good idea to limit the usage of load_avg to
system that are overloaded and at the opposite to limit the util for
not overloaded system. The boundary between both states is rarely
clear and you often have part of the system that can be overloaded
while the other part is not.
> it is still needed for util_avg as we want util_avg to be invariant to
> frequency and uarch scaling.
>
>> Then, one main advantage of this change is that the load of a task can
>> reach max value whatever the current freq and the uarch of the CPU on which
>> it run. It will just take more time at a lower freq than a max freq or on a
>> "little" CPU compared to a "big" one. The load and the utilization stay
>> invariant across system so we can still compared them between CPU but with
>> a wider range of values.
>
> Just removing scaling of waiting time and applying scaling by current
> capacity (including uarch) to the running time will not make load_avg
> reach the max value for tasks running alone on a cpu. Since the task
> isn't waiting at all (it is alone) all contributions are running time
> which is scaled, IIUC, and hence the result is still capped by the
> current capacity of the cpu. But that doesn't match your example results
> further down if I read them correctly.
In the current implementation, we scale the full contribution of each
fraction of time in the PELT equation so if the capacity of a CPU
can't be larger than Clocal_max because of frequency scaling and/or
uarch, we have
\sum (0..n) u_n * c_n * y^n <= Clocal_max * \sum (0..n) u_n * y^n
With the proposed way to take into account the uarch and the current
frequency, we scale the time that elapses before accounting it into a
segment of the equation. As a summary, the delta time is scaled to
reflect the amount of time that would have been used at the max
capacity of the system. So if the frequency is half max freq, the time
that will be accounted, will be half the really elapsed time. In
parallel, the duration of the job will be twice longer so we will have
the same amount of time accounted at the end.
With this patch, the PELT equation stays \sum (0..n) u_n * y^n
whatever the uarch and the current frequency. The main benefits is
that we can reach the max value whatever uarch and current freq. The
impact of the uarch and the current frequency is taken into account
before the equation when we are accounting the time into a segment.
>
> The changes made in the code of this patch are quite subtle, but very
> important as they change the behaviour of the PELT geometric series
> quite a lot. It is much more than just changing whether we scale waiting
> time and apply uarch scaling to running time of load_avg or not. I
> think we need to understand the math behind this patch to understand how
> the PELT metrics are affected because I think this patch changes some of
> the fundamentals originally described by Paul and Ben.
As explained above, the PELT equation in itself will be no more
impacted by freq and uarch as their impacts are taken into account
outside.
>
> Instead of scaling the contribution of each 1024us segment like we
> currently do, this patch is essentially warping time and lumps it
> together and let it contribute fully but skips decays. It is rather hard
> to explain, but the result is that the patch affects both load_avg and
> util_avg, and it breaks scale-invariance.
>
> Executive summary: Scaling time (delta) instead of the individual
> segment contributions breaks scale-invariance. The net result on
> load_avg seems to be nothing apart from slower reaction time.
>
> That is how I see after having tested it a bit. But I could be getting
> it all wrong. :-/
For me it's not slowing the reaction time but reflecting more
accurately the real behavior.
Let take the example of a task with a computation that take 10ms at
max capacity.
At max capacity, the job will run 10ms and the util value will be
199 as well as the load_avg.
At half frequency, the job will run 20ms instead of 10ms . With the
current scale-invariance implementation, the util_avg value will be
180 as well as the load_avg. The load_avg value would have been 360 if
we remove all kind of scale-invariance as you proposed above.
With the proposed implementation, the util value will be 199 as well
as the load_avg because we will add the same amount segment .
The PELT implementation is about calculating the load/utilization of a
task/CPU. It uses the time to reflect the amount of work done by a
task. In a system has the same fix compute capacity per second for all
cpus, it's fine to only use the time. But when we have different
compute capacity across the system, we have to reflect this difference
in the time that is added to a segment.
>
>
> Much more detail:
>
> Original geometric series:
>
> \sum (0..n) u_n * y^n
>
> Current geometric series with scale invariance:
>
> \sum (0..n) u_n * c_n * y^n
>
> In reality we only approximate having the capacity scaling for each
> segment as don't enforce PELT updates for each capacity change due to
> frequency scaling.
>
> In this patch scaling is applied to the entire delta since last update
we probably don't have the same meaning of the last update but that's
exactly the same as the current implementation
> instead of each individual segment. That gives us a very interesting
> time warping effect when updates happen less frequently than every 1ms.
> On cpus with reduced capacity the delta is reduced and all the math is
> done as if less time had passed since last update which introduces an
> error with regard to the decay of the series as we segments of time with
> zero contribution.
This happen because the compute capacity is lower and the actual work
done during this segment of time is lower too. The duration of the
computation will be longer and at the end, we will have the same
amount of segments of time for the job
>
> It is probably easier described with an example:
>
> We have one periodic task with a period of 4ms. Busy time per activation
> is 1ms at 100% capacity. The task has been running forever (>350ms) and
It's not clear for me why you want a task that was running forever
before the use case ? Apart from starting at max value or more
precisely 33% of max value when f=33% ?
> we consider the load_avg calculations at enqueue/dequeue, which is
> should the most common update points for this scenario besides the tick
> updates.
>
> task states
> s = sleeping
> R = running (scheduled)
>
> pelt
> d = decay segment (load_avg * y, y^32 = 0.5)
> [0..1024] = segment contribution (including any scaling)
> U = __update_load_avg() is called
>
> f = 100%
> | 1024us | 1024us | 1024us | 1024us | 1024us | 1024us |
> task | s | R | s | s | s | R |
> pelt ml | d U 1024 U d | d | d U 1024 U
> patch | d U 1024 U d | d | d U 1024 U
>
> f = 33%
> | 1024us | 1024us | 1024us | 1024us | 1024us | 1024us |
> task | s | R | R | R | s | R |
> pelt ml | d U 341y^2 | 341y | 341 U d U 341y^2 |
> patch | d U 1024 | 0 | 0 U d U 1024 |
>
> In the first case, f = 100%, the update after the busy period is
> complete we decay load_avg by one period (segment) and add a
> contribution of 1024. We are at 100% so it is a full contribution for
> this segment both with and without this patch. The task enqueue update
> accounts for the sleeping time by decaying load_avg three periods. The
> same in both cases. We could say that the contributions of a full cycle
> of the the task is:
>
> f_100% cycle = 1024 + decay(4)
>
> If we reduce the capacity to 33%, things look a bit different. In
> mainline, the dequeque update after the busy period would decay three
> periods and add \sum (i = 2..0) 0.33*1024*y^i to account for the three
> busy segments. The enqueue update decays the load_avg by one segment.
> The full cycle contribution becomes:
>
> Mainline:
> f_33% cycle = 341*y^2 + 341*y + 341 + decay(4)
>
> With this patch it is different. At the dequeue update we scale the time
> delta instead of the contribution, such that delta = 0.33*delta, so the
> calculation is based on only one period (segment) has passed. Hence we
> decay by one segment and add 1024, but still set the update point to the
> true timestamp so the following update doesn't take the two remaining
> segments into account. The enqueue update decays the load_avg by one
> segment, just like it does in mainline. The full cycle contribution
> becomes:
>
> Patch:
> f_33% cycle = 1024 + decay(2)
>
> This is clearly different from mainline. Not only is the busy
> contribution higher, 1024 > 341*y^2 + 341*y + 341, since y < 1, but we
> also decay less. The result is an inflation of the load_avg and util_avg
So the busy contribution is exactly the same as fmax whereas it's not
the case with current implementation as you mentioned above. But the
number of "decay" is not the same whereas the current implementation
have the same number of decay.
I have to look at how i can improve the decay accuracy.
> metrics for tasks that run for more than 1ms at the time if
> __update_load_avg() isn't called every 1ms.
>
> I did a quick test to confirm this using a single periodic task and
> changing the compute capacity.
>
> util_avg
> capacity mainline patch
> 1024 ~359 ~352
> 512 ~340 ~534
>
> Execution time went from 1.4ms to 2.8ms per activation without
> overloading the cpu.
At the opposite, there are some use cases where the proposed util_avg
is more accurate. In fact, this mainly depends of which part of the
decay or the load is preponderant in the value of util_avg/load_avg
As soon as the running time is around 100ms, we "saturate" the
load_avg or the util_avg. So a task that runs 50ms each 150ms at max
capacity will be around 695 for both util_avg and load_avg, whereas it
will be around 470 for util_avg and 940 for load_avg at half capacity
(due uarch) as the duration becomes 100ms. For this example, we have
lost the scale invariance with current implementation. With the
proposed changes, the util_avg and the load_avg would be 750.
>
> The fundamental idea in scale invariance is that util_avg should be
> comparable between cpu at any capacity as long none of them are
> over-utilized. This isn't preserved by the patch in its current form.
>
>> With this change, we don't have to test if a CPU is overloaded or not in
>> order to use one metric (util) or another (load) as all metrics are always
>> valid.
>
> I'm not sure what you mean by always valid. util_avg is still not a
> meaningful metric for tasks running on over-utilized cpus, so it can not
> be used unconditionally. If util_avg > capacity we still have no clue if
> the task can fit on a different cpu with higher capacity.
That's one side goal of changing the way the scale invariance is taken
into account in util_avg. Being > current capacity can still be
meaningful
>
>> I have put below some examples of duration to reach some typical load value
>> according to the capacity of the CPU with current implementation
>> and with this patch.
>>
>> Util (%) max capacity half capacity(mainline) half capacity(w/ patch)
>> 972 (95%) 138ms not reachable 276ms
>> 486 (47.5%) 30ms 138ms 60ms
>> 256 (25%) 13ms 32ms 26ms
>
> I assume that these are numbers for util_avg and not load_avg as said in
It can be both. half capacity can refer to frequency invariance or
uarch invariance
> the text above. It confuses me a little bit as you started out by
> talking about the lack of uarch scaling of load_avg and propose to
> change that, not util_avg.
The goal is to impact both util_avg and load_avg:
Being able to add the uarch in the calculation of the load_avg to
improve the fairness in presence of cpus with different capacity.
Being able to use the util_avg in a wider time scale.
>
> The equivalent table for load_avg would something like this:
>
> load_avg (%) max capacity half capacity(mainline) half capacity(w/ patch)
> 972 (95%) 138ms 138ms 276ms
> 486 (47.5%) 30ms 30ms 60ms
> 256 (25%) 13ms 13ms 26ms
>
> load_avg does reach max capacity as it is. The patch just makes it
> happen at a slower pace, which I'm not sure is a good or bad thing.
>
>> We can see that at half capacity, we need twice the duration of max
>> capacity with this patch whereas we have a non linear increase of the
>> duration with current implementation.
>
> Is it a problem that the time to reach a certain value is not linear?
This doesn't help in the scale-invariance
Thanks,
Vincent
>
> It is still somewhat unclear to me why we would want this change. Adding
> uarch scaling to load_avg but then modify the geometric series so the
> end result is the same except that it now reacts slower at lower
> capacities seems a bit strange.
>
>>
>> [1] https://lkml.org/lkml/2014/12/18/128
>>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> ---
>> kernel/sched/fair.c | 28 +++++++++++++---------------
>> 1 file changed, 13 insertions(+), 15 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 824aa9f..f2a18e1 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2560,10 +2560,9 @@ static __always_inline int
>> __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>> unsigned long weight, int running, struct cfs_rq *cfs_rq)
>> {
>> - u64 delta, scaled_delta, periods;
>> + u64 delta, periods;
>> u32 contrib;
>> - unsigned int delta_w, scaled_delta_w, decayed = 0;
>> - unsigned long scale_freq, scale_cpu;
>> + unsigned int delta_w, decayed = 0;
>>
>> delta = now - sa->last_update_time;
>> /*
>> @@ -2584,8 +2583,10 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>> return 0;
>> sa->last_update_time = now;
>>
>> - scale_freq = arch_scale_freq_capacity(NULL, cpu);
>> - scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
>> + if (running) {
>> + delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));
>> + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));
>
> This is where the time warping happens. delta is used to determine the
> number of periods (segments) since last update. Scaling this, as opposed
> to the contributions for each segment individually, can lead to
> disappearing segments.
On 14 December 2015 at 01:26, Yuyang Du <[email protected]> wrote:
> Hi Vincent,
>
> I don't quite catch what this is doing, maybe I need more time
> to ramp up to the gory detail difficult like this.
>
> Do you scale or not scale? You seem removed the scaling, but added it
> after "Remainder of delta accrued against u_0"..
I'm scaling the time before taking it in the pelt algorithm. My reply
to Morten's comment tries to explain more deeply what i'm trying to
achieve
Thanks,
Vincent
>
> Thanks,
> Yuyang
>
> On Tue, Nov 24, 2015 at 02:49:30PM +0100, Vincent Guittot wrote:
>> The current implementation of load tracking invariance scales the load
>> tracking value with current frequency and uarch performance (only for
>> utilization) of the CPU.
>>
>> One main result of the current formula is that the figures are capped by
>> the current capacity of the CPU. This limitation is the main reason of not
>> including the uarch invariance (arch_scale_cpu_capacity) in the calculation
>> of load_avg because capping the load can generate erroneous system load
>> statistic as described with this example [1]
>>
>> Instead of scaling the complete value of PELT algo, we should only scale
>> the running time by the current capacity of the CPU. It seems more correct
>> to only scale the running time because the non running time of a task
>> (sleeping or waiting for a runqueue) is the same whatever the current freq
>> and the compute capacity of the CPU.
>>
>> Then, one main advantage of this change is that the load of a task can
>> reach max value whatever the current freq and the uarch of the CPU on which
>> it run. It will just take more time at a lower freq than a max freq or on a
>> "little" CPU compared to a "big" one. The load and the utilization stay
>> invariant across system so we can still compared them between CPU but with
>> a wider range of values.
>>
>> With this change, we don't have to test if a CPU is overloaded or not in
>> order to use one metric (util) or another (load) as all metrics are always
>> valid.
>>
>> I have put below some examples of duration to reach some typical load value
>> according to the capacity of the CPU with current implementation
>> and with this patch.
>>
>> Util (%) max capacity half capacity(mainline) half capacity(w/ patch)
>> 972 (95%) 138ms not reachable 276ms
>> 486 (47.5%) 30ms 138ms 60ms
>> 256 (25%) 13ms 32ms 26ms
>>
>> We can see that at half capacity, we need twice the duration of max
>> capacity with this patch whereas we have a non linear increase of the
>> duration with current implementation.
>>
>> [1] https://lkml.org/lkml/2014/12/18/128
>>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> ---
>> kernel/sched/fair.c | 28 +++++++++++++---------------
>> 1 file changed, 13 insertions(+), 15 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 824aa9f..f2a18e1 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2560,10 +2560,9 @@ static __always_inline int
>> __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>> unsigned long weight, int running, struct cfs_rq *cfs_rq)
>> {
>> - u64 delta, scaled_delta, periods;
>> + u64 delta, periods;
>> u32 contrib;
>> - unsigned int delta_w, scaled_delta_w, decayed = 0;
>> - unsigned long scale_freq, scale_cpu;
>> + unsigned int delta_w, decayed = 0;
>>
>> delta = now - sa->last_update_time;
>> /*
>> @@ -2584,8 +2583,10 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>> return 0;
>> sa->last_update_time = now;
>>
>> - scale_freq = arch_scale_freq_capacity(NULL, cpu);
>> - scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
>> + if (running) {
>> + delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));
>> + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));
>> + }
>>
>> /* delta_w is the amount already accumulated against our next period */
>> delta_w = sa->period_contrib;
>> @@ -2601,16 +2602,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>> * period and accrue it.
>> */
>> delta_w = 1024 - delta_w;
>> - scaled_delta_w = cap_scale(delta_w, scale_freq);
>> if (weight) {
>> - sa->load_sum += weight * scaled_delta_w;
>> + sa->load_sum += weight * delta_w;
>> if (cfs_rq) {
>> cfs_rq->runnable_load_sum +=
>> - weight * scaled_delta_w;
>> + weight * delta_w;
>> }
>> }
>> if (running)
>> - sa->util_sum += scaled_delta_w * scale_cpu;
>> + sa->util_sum += delta_w << SCHED_CAPACITY_SHIFT;
>>
>> delta -= delta_w;
>>
>> @@ -2627,25 +2627,23 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>>
>> /* Efficiently calculate \sum (1..n_period) 1024*y^i */
>> contrib = __compute_runnable_contrib(periods);
>> - contrib = cap_scale(contrib, scale_freq);
>> if (weight) {
>> sa->load_sum += weight * contrib;
>> if (cfs_rq)
>> cfs_rq->runnable_load_sum += weight * contrib;
>> }
>> if (running)
>> - sa->util_sum += contrib * scale_cpu;
>> + sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
>> }
>>
>> /* Remainder of delta accrued against u_0` */
>> - scaled_delta = cap_scale(delta, scale_freq);
>> if (weight) {
>> - sa->load_sum += weight * scaled_delta;
>> + sa->load_sum += weight * delta;
>> if (cfs_rq)
>> - cfs_rq->runnable_load_sum += weight * scaled_delta;
>> + cfs_rq->runnable_load_sum += weight * delta;
>> }
>> if (running)
>> - sa->util_sum += scaled_delta * scale_cpu;
>> + sa->util_sum += delta << SCHED_CAPACITY_SHIFT;
>>
>> sa->period_contrib += delta;
>>
>> --
>> 1.9.1