by Yuyang Du

[permalink] [raw]

Subject: RE: [RFC PATCH 08/10] sched: Track blocked utilization contributions

Thanks, Vincent. Then again ping Peter and PJT and Ben for the
rewrite patch v6, which has not been reviewed.

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Vincent Guittot
Sent: Wednesday, December 17, 2014 4:13 PM
To: Morten Rasmussen
Cc: Peter Zijlstra; [email protected]; Dietmar Eggemann; Paul Turner; Benjamin Segall; Michael Turquette; linux-kernel; [email protected]
Subject: Re: [RFC PATCH 08/10] sched: Track blocked utilization contributions

On 2 December 2014 at 15:06, Morten Rasmussen <[email protected]> wrote:
> Introduces the blocked utilization, the utilization counter-part to
> cfs_rq->utilization_load_avg. It is the sum of sched_entity
> utilization contributions of entities that were recently on the cfs_rq
> that are currently blocked. Combined with sum of contributions of
> entities currently on the cfs_rq or currently running
> (cfs_rq->utilization_load_avg) this can provide a more stable average
> view of the cpu usage.

I'm fully aligned with the interest of adding blocked tasks in the CPU's utilization. Now, instead of adding one more atomic data and it's manipulation, it might be worth to move on [email protected]
patchset: rewrites the per entity load tracking which includes blocked load.

>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Morten Rasmussen <[email protected]>
> ---
> kernel/sched/fair.c | 30 +++++++++++++++++++++++++++++-
> kernel/sched/sched.h | 8 ++++++--
> 2 files changed, 35 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index
> 090223f..adf64df 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2778,6 +2778,15 @@ static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
> cfs_rq->blocked_load_avg = 0; }
>
> +static inline void subtract_utilization_blocked_contrib(struct cfs_rq *cfs_rq,
> + long
> +utilization_contrib) {
> + if (likely(utilization_contrib < cfs_rq->utilization_blocked_avg))
> + cfs_rq->utilization_blocked_avg -= utilization_contrib;
> + else
> + cfs_rq->utilization_blocked_avg = 0; }
> +
> static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
>
> /* Update a sched_entity's runnable average */ @@ -2813,6 +2822,8 @@
> static inline void update_entity_load_avg(struct sched_entity *se,
> cfs_rq->utilization_load_avg += utilization_delta;
> } else {
> subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
> + subtract_utilization_blocked_contrib(cfs_rq,
> +
> + -utilization_delta);
> }
> }
>
> @@ -2830,14 +2841,20 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
> return;
>
> if (atomic_long_read(&cfs_rq->removed_load)) {
> - unsigned long removed_load;
> + unsigned long removed_load, removed_utilization;
> removed_load = atomic_long_xchg(&cfs_rq->removed_load,
> 0);
> + removed_utilization =
> + atomic_long_xchg(&cfs_rq->removed_utilization,
> + 0);
> subtract_blocked_load_contrib(cfs_rq, removed_load);
> + subtract_utilization_blocked_contrib(cfs_rq,
> +
> + removed_utilization);
> }
>
> if (decays) {
> cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
> decays);
> + cfs_rq->utilization_blocked_avg =
> + decay_load(cfs_rq->utilization_blocked_avg,
> + decays);
> atomic64_add(decays, &cfs_rq->decay_counter);
> cfs_rq->last_decay = now;
> }
> @@ -2884,6 +2901,8 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
> /* migrated tasks did not contribute to our blocked load */
> if (wakeup) {
> subtract_blocked_load_contrib(cfs_rq,
> se->avg.load_avg_contrib);
> + subtract_utilization_blocked_contrib(cfs_rq,
> +
> + se->avg.utilization_avg_contrib);
> update_entity_load_avg(se, 0);
> }
>
> @@ -2910,6 +2929,8 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
> cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib;
> if (sleep) {
> cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
> + cfs_rq->utilization_blocked_avg +=
> +
> + se->avg.utilization_avg_contrib;
> se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
> } /* migrations, e.g. sleep=0 leave decay_count == 0 */ } @@
> -4929,6 +4950,8 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
> se->avg.decay_count = -__synchronize_entity_decay(se);
> atomic_long_add(se->avg.load_avg_contrib,
>
> &cfs_rq->removed_load);
> + atomic_long_add(se->avg.utilization_avg_contrib,
> + &cfs_rq->removed_utilization);
> }
>
> /* We have migrated, no longer consider this task hot */ @@
> -7944,6 +7967,8 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
> if (se->avg.decay_count) {
> __synchronize_entity_decay(se);
> subtract_blocked_load_contrib(cfs_rq,
> se->avg.load_avg_contrib);
> + subtract_utilization_blocked_contrib(cfs_rq,
> +
> + se->avg.utilization_avg_contrib);
> }
> #endif
> }
> @@ -8003,6 +8028,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq) #ifdef
> CONFIG_SMP
> atomic64_set(&cfs_rq->decay_counter, 1);
> atomic_long_set(&cfs_rq->removed_load, 0);
> + atomic_long_set(&cfs_rq->removed_utilization, 0);
> #endif
> }
>
> @@ -8055,6 +8081,8 @@ static void task_move_group_fair(struct task_struct *p, int queued)
> */
> se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
> cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
> + cfs_rq->utilization_blocked_avg +=
> +
> + se->avg.utilization_avg_contrib;
> #endif
> }
> }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index
> e402133..208237f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -368,11 +368,15 @@ struct cfs_rq {
> * the blocked sched_entities on the rq.
> * utilization_load_avg is the sum of the average running time of the
> * sched_entities on the rq.
> + * utilization_blocked_avg is the utilization equivalent of
> + * blocked_load_avg, i.e. the sum of running contributions of blocked
> + * sched_entities associated with the rq.
> */
> - unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
> + unsigned long runnable_load_avg, blocked_load_avg;
> + unsigned long utilization_load_avg, utilization_blocked_avg;
> atomic64_t decay_counter;
> u64 last_decay;
> - atomic_long_t removed_load;
> + atomic_long_t removed_load, removed_utilization;
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> /* Required to track per-cpu representation of a task_group */
> --
> 1.9.1
>
>

2014-12-30 15:05:51

by Morten Rasmussen

[permalink] [raw]

Subject: Re: [RFC PATCH 02/10] sched: Make usage and load tracking cpu scale-invariant

On Thu, Dec 18, 2014 at 09:41:51AM +0000, Vincent Guittot wrote:
> On 2 December 2014 at 15:06, Morten Rasmussen <[email protected]> wrote:
> > From: Dietmar Eggemann <[email protected]>
> >
> > Besides the existing frequency scale-invariance correction factor, apply
> > cpu scale-invariance correction factor to usage and load tracking.
> >
> > Cpu scale-invariance takes cpu performance deviations due to
> > micro-architectural differences (i.e. instructions per seconds) between
> > cpus in HMP systems (e.g. big.LITTLE) and differences in the frequency
> > value of the highest OPP between cpus in SMP systems into consideration.
> >
> > Each segment of the sched_avg::{running_avg_sum, runnable_avg_sum}
> > geometric series is now scaled by the cpu performance factor too so the
> > sched_avg::{utilization_avg_contrib, load_avg_contrib} of each entity will
> > be invariant from the particular cpu of the HMP/SMP system it is gathered
> > on. As a result, cfs_rq::runnable_load_avg which is the sum of
> > sched_avg::load_avg_contrib, becomes cpu scale-invariant too.
> >
> > So the {usage, load} level that is returned by {get_cpu_usage,
> > weighted_cpuload} stays relative to the max cpu performance of the system.
>
> Having a load/utilization that is invariant across the system is a
> good thing but your patch only do part of the job. The load is
> invariant so they can be directly compared across system but you
> haven't updated the load balance code that also scales the load with
> capacity.
>
> Then, the task load is now cap by the max capacity of the CPU on which
> it runs. Let use an example made of 3 CPUs with the following
> topology:
> -CPU0 and CPU1 are in the same cluster 0 (share cache) and have a
> capacity of 512 each
> -CPU2 is in its own cluster (don't share cache with other) and have a
> capacity of 1024
> Each cluster have thee same compute capacity of 1024
>
> Then, let consider that we have 7 always running tasks with the
> following placement:
> -tasks A and B on CPU0
> -tasks C, D on CPU1
> -tasks F, G and H on CPU2
>
> At cluster level with have the following statistic:
> -On cluster 0, compute capacity budget for each task is 256 (2 * 512 /
> 4) and the cluster load is 4096 with current implementation and 2048
> with cpu invariant load tracking
> -On custer 1, compute capacity budget for each task is 341 (1024 / 3)
> and the cluster load is 3072 with both implementation
>
> The cluster 0 is more loaded than cluster 1 as the compute capacity
> available for each task is lower than on cluster 1. The trends is
> similar with current implementation of load tracking as we have a load
> of 4096 for cluster 0 vs 3072 for cluster 1 but the cpu invariant load
> tracking shows an different trend with a load of 2048 for cluster 0 vs
> 3072 for cluster 1

Very good point. This patch can't go without necessary modifications of
the load balance code. I think we have even discussed it in the past :-/

I have been pondering the solution for a while. The obvious solution
that will leave everything as it is to derive the 'old' load from the
new scale-invariant load, but that sort of defeats the whole purpose of
introducing it. We need to figure out when we need the 'old' and 'new'
load.

Let's call the current load implementation 'norm_load' as it is
basically:

norm_load = busy_time/wall_time * prio_scale

The new scale-invariant load, 'scale_load', is then:

scale_load = busy_time/wall_time * prio_scale * capacity

Hence, we can compute norm_load as:

norm_load = scale_load/capacity

For overload scenarios, such as your example, norm_load can tell us to
which degree the group is overloaded, but not relative to its capacity.
Hence, it doesn't tell us anything about the capacity per task. (If you
change the CPU0 and CPU1 capacity to 1024 in your example the cluster
load remains unchanged, but the capacity per task increases). I agree
that capacity per task is important for fairness in overloaded
scenarios, but I think the right metric should be based on load instead
of number of tasks to factor in priority. Instead of norm_load, we
should compare capacity per load (or the inverse) of the clusters:

capacity_per_load = capacity/norm_load

This is exactly the inverse of avg_load already used in the current
load-balance code. For your example avg_load happens to be equal to the
cluster norm_load. If you change the CPU0 and CPU1 capacities to 1024,
the cluster 0 avg_load decreases to 2048 to correctly indicate that the
load per capacity is now lower than cluster 1, while norm_load is
unchanged.

All that said, we can maintain the current behaviour if we just make
sure to compute norm_load from scale_load before we compute avg_load.
This is all good for overloaded scenarios, but for non-overloaded
scenarios avg_load gives us a wrong picture. When none of the cpus are
overloaded we don't care about the load per capacity as all tasks get
the cpu cycles they need.

If you replace the tasks in your example with tasks that each use 128
capacity you get:

cluster 0 cluster 1
capacity 1024 1024
scale_load 512 384
avg_load 1024 384

Here avg_load exaggerates the load per capacity due to the fact that
avg_load uses norm_load, not scale_load. This is current behaviour. It
seems that we should continue using avg_load (or some re-expression
thereof based on scale_load) for overloaded scenarios and use scale_load
for non-overloaded scenarios to improve things over what we currently
have.

> Considering that adding cpu invariance in the load tracking implies
> more modification of the load balance, it might be worth reordering
> your patchset and move this patch at the end instead of the beginning
> so other patches might be merged while fixing the load balance

Indeed. I will move this patch further back in the set to the other more
experimental patches while working on patches to fix the load-balance
code.

Morten

>
> Regards,
> Vincent
>
> >
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Signed-off-by: Dietmar Eggemann <[email protected]>
> > ---
> > kernel/sched/fair.c | 27 ++++++++++++++++++++++-----
> > 1 file changed, 22 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b41f03d..5c4c989 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2473,6 +2473,21 @@ static u32 __compute_runnable_contrib(u64 n)
> > }
> >
> > unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
> > +unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu);
> > +
> > +static unsigned long contrib_scale_factor(int cpu)
> > +{
> > + unsigned long scale_factor;
> > +
> > + scale_factor = arch_scale_freq_capacity(NULL, cpu);
> > + scale_factor *= arch_scale_cpu_capacity(NULL, cpu);
> > + scale_factor >>= SCHED_CAPACITY_SHIFT;
> > +
> > + return scale_factor;
> > +}
> > +
> > +#define scale_contrib(contrib, scale_factor) \
> > + ((contrib * scale_factor) >> SCHED_CAPACITY_SHIFT)
> >
> > /*
> > * We can represent the historical contribution to runnable average as the
> > @@ -2510,7 +2525,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
> > u64 delta, scaled_delta, periods;
> > u32 runnable_contrib, scaled_runnable_contrib;
> > int delta_w, scaled_delta_w, decayed = 0;
> > - unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > + unsigned long scale_factor;
> >
> > delta = now - sa->last_runnable_update;
> > /*
> > @@ -2531,6 +2546,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
> > return 0;
> > sa->last_runnable_update = now;
> >
> > + scale_factor = contrib_scale_factor(cpu);
> > +
> > /* delta_w is the amount already accumulated against our next period */
> > delta_w = sa->avg_period % 1024;
> > if (delta + delta_w >= 1024) {
> > @@ -2543,7 +2560,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
> > * period and accrue it.
> > */
> > delta_w = 1024 - delta_w;
> > - scaled_delta_w = (delta_w * scale_freq) >> SCHED_CAPACITY_SHIFT;
> > + scaled_delta_w = scale_contrib(delta_w, scale_factor);
> >
> > if (runnable)
> > sa->runnable_avg_sum += scaled_delta_w;
> > @@ -2566,8 +2583,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
> >
> > /* Efficiently calculate \sum (1..n_period) 1024*y^i */
> > runnable_contrib = __compute_runnable_contrib(periods);
> > - scaled_runnable_contrib = (runnable_contrib * scale_freq)
> > - >> SCHED_CAPACITY_SHIFT;
> > + scaled_runnable_contrib =
> > + scale_contrib(runnable_contrib, scale_factor);
> >
> > if (runnable)
> > sa->runnable_avg_sum += scaled_runnable_contrib;
> > @@ -2577,7 +2594,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
> > }
> >
> > /* Remainder of delta accrued against u_0` */
> > - scaled_delta = (delta * scale_freq) >> SCHED_CAPACITY_SHIFT;
> > + scaled_delta = scale_contrib(delta, scale_factor);
> >
> > if (runnable)
> > sa->runnable_avg_sum += scaled_delta;
> > --
> > 1.9.1
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>