Message-ID: <55A3F097.8040101@arm.com>
Date: Mon, 13 Jul 2015 18:08:39 +0100
From: Dietmar Eggemann <dietmar.eggemann@arm.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0
MIME-Version: 1.0
To: Yuyang Du <yuyang.du@intel.com>, "mingo@kernel.org" <mingo@kernel.org>,
        "peterz@infradead.org" <peterz@infradead.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
CC: "pjt@google.com" <pjt@google.com>,
        "bsegall@google.com" <bsegall@google.com>,
        Morten Rasmussen <Morten.Rasmussen@arm.com>,
        "vincent.guittot@linaro.org" <vincent.guittot@linaro.org>,
        "len.brown@intel.com" <len.brown@intel.com>,
        "rafael.j.wysocki@intel.com" <rafael.j.wysocki@intel.com>,
        "fengguang.wu@intel.com" <fengguang.wu@intel.com>,
        "boqun.feng@gmail.com" <boqun.feng@gmail.com>,
        "srikar@linux.vnet.ibm.com" <srikar@linux.vnet.ibm.com>
Subject: Re: [PATCH v9 2/4] sched: Rewrite runnable load and utilization average
 tracking
References: <1435018085-7004-1-git-send-email-yuyang.du@intel.com> <1435018085-7004-3-git-send-email-yuyang.du@intel.com>
In-Reply-To: <1435018085-7004-3-git-send-email-yuyang.du@intel.com>
Content-Type: text/plain; charset=WINDOWS-1252
Content-Transfer-Encoding: 8BIT
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7323
Lines: 187

Hi Yuyang,

I did some testing of your new pelt implementation.

TC 1: one nice-0 60% task affine to cpu1 in root tg and 2 nice-0 20%
periodic tasks affine to cpu1 in a task group with id=3 (one hierarchy).

TC 2: 10 nice-0 5% tasks affine to cpu1 in a task group with id=3 (one
hierarchy).

and compared the results (the se (tasks and tg representation for cpu1),
cfs_rq and tg related pelt signals) with the current pelt implementation.

The signals are very similar (taken the differences due to
separated/missing blocked load/util in the current pelt and the slightly
different behaviour in transitional phases (e.g. task enqueue/dequeue)
into consideration.

I haven't done any performance related tests yet.

-- Dietmar

On 23/06/15 01:08, Yuyang Du wrote:
> The idea of runnable load average (let runnable time contribute to weight)
> was proposed by Paul Turner, and it is still followed by this rewrite. This
> rewrite aims to solve the following issues:

[...]

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index af0eeba..8b4bc4f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1183,29 +1183,23 @@ struct load_weight {
>         u32 inv_weight;
>  };
> 
> +/*
> + * The load_avg/util_avg represents an infinite geometric series:
> + * 1) load_avg describes the amount of time that a sched_entity
> + * is runnable on a rq. It is based on both load_sum and the
> + * weight of the task.
> + * 2) util_avg describes the amount of time that a sched_entity
> + * is running on a CPU. It is based on util_sum and is scaled
> + * in the range [0..SCHED_LOAD_SCALE].

sa->load_[avg/sum] and sa->util_[avg/sum] are also used for the
aggregated load/util values on the cfs_rq's.

> + * The 64 bit load_sum can:
> + * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with
> + * the highest weight (=88761) always runnable, we should not overflow
> + * 2) for entity, support any load.weight always runnable
> + */
>  struct sched_avg {
> -       u64 last_runnable_update;
> -       s64 decay_count;
> -       /*
> -        * utilization_avg_contrib describes the amount of time that a
> -        * sched_entity is running on a CPU. It is based on running_avg_sum
> -        * and is scaled in the range [0..SCHED_LOAD_SCALE].
> -        * load_avg_contrib described the amount of time that a sched_entity
> -        * is runnable on a rq. It is based on both runnable_avg_sum and the
> -        * weight of the task.
> -        */
> -       unsigned long load_avg_contrib, utilization_avg_contrib;
> -       /*
> -        * These sums represent an infinite geometric series and so are bound
> -        * above by 1024/(1-y).  Thus we only need a u32 to store them for all
> -        * choices of y < 1-2^(-32)*1024.
> -        * running_avg_sum reflects the time that the sched_entity is
> -        * effectively running on the CPU.
> -        * runnable_avg_sum represents the amount of time a sched_entity is on
> -        * a runqueue which includes the running time that is monitored by
> -        * running_avg_sum.
> -        */
> -       u32 runnable_avg_sum, avg_period, running_avg_sum;
> +       u64 last_update_time, load_sum;
> +       u32 util_sum, period_contrib;
> +       unsigned long load_avg, util_avg;
>  };

[...]

>  /*
> - * Aggregate cfs_rq runnable averages into an equivalent task_group
> - * representation for computing load contributions.
> + * Updating tg's load_avg is necessary before update_cfs_share (which is done)
> + * and effective_load (which is not done because it is too costly).
>   */
> -static inline void __update_tg_runnable_avg(struct sched_avg *sa,
> -                                                 struct cfs_rq *cfs_rq)
> +static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
>  {

This function is always called with force=0 ? I remember that there was
some discussion about this in your v5 (error bounds of '/ 64') but since
it is not used ...

> -       struct task_group *tg = cfs_rq->tg;
> -       long contrib;
> -
> -       /* The fraction of a cpu used by this cfs_rq */
> -       contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT,
> -                         sa->avg_period + 1);
> -       contrib -= cfs_rq->tg_runnable_contrib;
> +       long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> 
> -       if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
> -               atomic_add(contrib, &tg->runnable_avg);
> -               cfs_rq->tg_runnable_contrib += contrib;
> -       }
> -}

[...]

> -static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
> -
> -/* Update a sched_entity's runnable average */
> -static inline void update_entity_load_avg(struct sched_entity *se,
> -                                         int update_cfs_rq)
> +/* Update task and its cfs_rq load average */
> +static inline void update_load_avg(struct sched_entity *se, int update_tg)
>  {
>         struct cfs_rq *cfs_rq = cfs_rq_of(se);
> -       long contrib_delta, utilization_delta;
>         int cpu = cpu_of(rq_of(cfs_rq));
> -       u64 now;
> +       u64 now = cfs_rq_clock_task(cfs_rq);
> 
>         /*
> -        * For a group entity we need to use their owned cfs_rq_clock_task() in
> -        * case they are the parent of a throttled hierarchy.
> +        * Track task load average for carrying it to new CPU after migrated, and
> +        * track group sched_entity load average for task_h_load calc in migration
>          */
> -       if (entity_is_task(se))
> -               now = cfs_rq_clock_task(cfs_rq);
> -       else
> -               now = cfs_rq_clock_task(group_cfs_rq(se));

Why don't you make this distinction while getting 'now' between se's
representing tasks or task groups anymore?

> +       __update_load_avg(now, cpu, &se->avg,
> +               se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se);
> 
> -       if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq,
> -                                       cfs_rq->curr == se))
> -               return;
> -
> -       contrib_delta = __update_entity_load_avg_contrib(se);
> -       utilization_delta = __update_entity_utilization_avg_contrib(se);
> -
> -       if (!update_cfs_rq)
> -               return;
> -
> -       if (se->on_rq) {
> -               cfs_rq->runnable_load_avg += contrib_delta;
> -               cfs_rq->utilization_load_avg += utilization_delta;
> -       } else {
> -               subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
> -       }
> +       if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
> +               update_tg_load_avg(cfs_rq, 0);
>  }

[...]

> -
>  static void update_blocked_averages(int cpu)

The name of this function now becomes misleading since you don't update
blocked averages any more. Existing pelt calls
__update_blocked_averages_cpu() -> update_cfs_rq_blocked_load() ->
subtract_blocked_load_contrib() for all tg tree.

Whereas you update cfs_rq.avg->[load/util]_[avg/sum] and conditionally
tg->load_avg and cfs_rq->tg_load_avg_contrib.

[...]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/