From: bsegall@google.com
To: Yuyang Du <yuyang.du@intel.com>
Cc: mingo@kernel.org, peterz@infradead.org, linux-kernel@vger.kernel.org,
        pjt@google.com, morten.rasmussen@arm.com, vincent.guittot@linaro.org,
        dietmar.eggemann@arm.com, arjan.van.de.ven@intel.com,
        len.brown@intel.com, fengguang.wu@intel.com
Subject: Re: [PATCH v7 2/4] sched: Rewrite runnable load and utilization average tracking
References: <1431570135-12838-1-git-send-email-yuyang.du@intel.com>
	<1431570135-12838-3-git-send-email-yuyang.du@intel.com>
Date: Mon, 18 May 2015 13:55:21 -0700
In-Reply-To: <1431570135-12838-3-git-send-email-yuyang.du@intel.com> (Yuyang
	Du's message of "Thu, 14 May 2015 10:22:13 +0800")
Message-ID: <xm26fv6t4nd2.fsf@sword-of-the-dawn.mtv.corp.google.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7164
Lines: 174

Yuyang Du <yuyang.du@intel.com> writes:

> The idea of runnable load average (let runnable time contribute to weight)
> was proposed by Paul Turner, and it is still followed by this rewrite. This
> rewrite aims to solve the following issues:
>
> 1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is
>    updated at the granularity of an entity at a time, which results in the
>    cfs_rq's load average is stale or partially updated: at any time, only
>    one entity is up to date, all other entities are effectively lagging behind.
>    This is undesirable.
>
>    To illustrate, if we have n runnable entities in the cfs_rq, as time elapses,
>    they certainly become outdated:
>
>    t0: cfs_rq { e1_old, e2_old, ..., en_old }
>
>    and when we update:
>
>    t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old }
>
>    t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old }
>
>    ...
>
>    We solve this by combining all runnable entities' load averages together in
>    cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based on the
>    fact that if we regard the update as a function, then:
>
>    w * update(e) = update(w * e) and 
>
>    update(e1) + update(e2) = update(e1 + e2), then
>
>    w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2)
>
>    therefore, by this rewrite, we have an entirely updated cfs_rq at the time
>    we update it:
>
>    t1: update cfs_rq { e1_new, e2_new, ..., en_new }
>
>    t2: update cfs_rq { e1_new, e2_new, ..., en_new }
>
>    ...

This requires a u32*u32->u64 multiply that the old code did not, but
only on period rollover which already has a u64 div, so that's probably
fine. The non-rollover one would only need one with
SCHED_LOAD_RESOLUTION active, and this code should probably
scale_load_down anyway (the old code did in the equivalent place).

I need to read the code more thoroughly to be sure everything else is ok

This also causes an entity's load to not immediately change when its
weight does, which is in some ways more sensible, but not how it's ever
worked.

>
> 2. cfs_rq's load average is different between top rq->cfs_rq and other task_group's
>    per CPU cfs_rqs in whether or not blocked_load_average contributes to the load.
>
>    The basic idea behind runnable load average (the same as utilization) is that
>    the blocked state is taken into account as opposed to only accounting for the
>    currently runnable state. Therefore, the average should include both the
>    runnable/running and blocked load averages. This rewrite does that.
>
>    In addition, we also combine runnable/running and blocked averages of all entities
>    into the cfs_rq's average, and update it together at once. This is based on the
>    fact that:
>
>    update(runnable) + update(blocked) = update(runnable + blocked)
>
>    This also significantly reduces the codes as we don't need to separately maintain 
>    runnable/running load and blocked load.

This means that races and such around migrate have the opportunity to
end up reducing the entire average, rather than just the blocked load
(which was going towards zero anyway). This may not be the end of the
world, but iirc was part of (all of?) the reason for the separation to
begin with.

>
> 3. How task_group entities' share is calculated is complex.
>
>    In this rewrite, the task_group's load_avg is aggregated from its per CPU cfs_rq's
>    load_avg. Group entity's weight is simply proportional to its own cfs_rq's
>    load_avg / task_group's load_avg. Therefore,
>
>    if task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn },
>
>    task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg
>
>    cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share
>
> To sum up, this rewrite in principle is equivalent to the previous, but significantly
> reduces the code complexity and hence increases clarity and efficiency. In addition,
> the new load_avg is much more smooth/continuous (no spurious spikes and valleys)
> and updated more consistently and quickly to reflect the load dynamics. As a result,
> we have less load tracking overhead and better performance, especially better power
> efficiency due to more balanced load.
>
> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> ---
>  include/linux/sched.h |   41 ++--
>  kernel/sched/core.c   |    3 -
>  kernel/sched/debug.c  |   35 +--
>  kernel/sched/fair.c   |  621 ++++++++++++++++---------------------------------
>  kernel/sched/proc.c   |    2 +-
>  kernel/sched/sched.h  |   28 +--
>  6 files changed, 240 insertions(+), 490 deletions(-)
>

I looked at the diff, but it quickly became unreadable due to git
finding "common" blank lines, and it doesn't apply to tip/master or the
other trees I tried.

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8222ae4..85a619a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1123,29 +1123,24 @@ struct load_weight {
>  	u32 inv_weight;
>  };
>  
> +/*
> + * The load_avg/util_avg represents an infinite geometric series:
> + * 1) load_avg describes the amount of time that a sched_entity
> + * is runnable on a rq. It is based on both load_sum and the
> + * weight of the task.
> + * 2) util_avg describes the amount of time that a sched_entity
> + * is running on a CPU. It is based on util_sum and is scaled
> + * in the range [0..SCHED_LOAD_SCALE].
> + * The 64 bit load_sum can:
> + * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with
> + * the highest weight (=88761) always runnable, we should not overflow
> + * 2) for entity, support any load.weight always runnable
> + */
>  struct sched_avg {
> -	u64 last_runnable_update;
> -	s64 decay_count;
> -	/*
> -	 * utilization_avg_contrib describes the amount of time that a
> -	 * sched_entity is running on a CPU. It is based on running_avg_sum
> -	 * and is scaled in the range [0..SCHED_LOAD_SCALE].
> -	 * load_avg_contrib described the amount of time that a sched_entity
> -	 * is runnable on a rq. It is based on both runnable_avg_sum and the
> -	 * weight of the task.
> -	 */
> -	unsigned long load_avg_contrib, utilization_avg_contrib;
> -	/*
> -	 * These sums represent an infinite geometric series and so are bound
> -	 * above by 1024/(1-y).  Thus we only need a u32 to store them for all
> -	 * choices of y < 1-2^(-32)*1024.
> -	 * running_avg_sum reflects the time that the sched_entity is
> -	 * effectively running on the CPU.
> -	 * runnable_avg_sum represents the amount of time a sched_entity is on
> -	 * a runqueue which includes the running time that is monitored by
> -	 * running_avg_sum.
> -	 */
> -	u32 runnable_avg_sum, avg_period, running_avg_sum;
> +	u64 last_update_time;
> +	u64 load_sum, util_sum;

util_sum can still just be u32, yes?

> +	unsigned long load_avg, util_avg;
> +	u32 period_contrib;
>  };
>  
>  #ifdef CONFIG_SCHEDSTATS
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/