Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752503AbdIANfR (ORCPT ); Fri, 1 Sep 2017 09:35:17 -0400 Received: from bombadil.infradead.org ([65.50.211.133]:34569 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752296AbdIANfO (ORCPT ); Fri, 1 Sep 2017 09:35:14 -0400 Message-Id: <20170901132748.083733695@infradead.org> User-Agent: quilt/0.63-1 Date: Fri, 01 Sep 2017 15:21:01 +0200 From: Peter Zijlstra To: mingo@kernel.org, linux-kernel@vger.kernel.org, tj@kernel.org, josef@toxicpanda.com Cc: torvalds@linux-foundation.org, vincent.guittot@linaro.org, efault@gmx.de, pjt@google.com, clm@fb.com, dietmar.eggemann@arm.com, morten.rasmussen@arm.com, bsegall@google.com, yuyang.du@intel.com, peterz@infradead.org Subject: [PATCH -v2 02/18] sched/fair: Add comment to calc_cfs_shares() References: <20170901132059.342024223@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Disposition: inline; filename=peterz-sched-comment-calc_cfs_shares.patch Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2839 Lines: 78 Explain the magic equation in calc_cfs_shares() a bit better. Signed-off-by: Peter Zijlstra (Intel) --- kernel/sched/fair.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 61 insertions(+) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2707,6 +2707,67 @@ account_entity_dequeue(struct cfs_rq *cf #ifdef CONFIG_FAIR_GROUP_SCHED # ifdef CONFIG_SMP +/* + * All this does is approximate the hierarchical proportion which includes that + * global sum we all love to hate. + * + * That is, the weight of a group entity, is the proportional share of the + * group weight based on the group runqueue weights. That is: + * + * tg->weight * grq->load.weight + * ge->load.weight = ----------------------------- (1) + * \Sum grq->load.weight + * + * Now, because computing that sum is prohibitively expensive to compute (been + * there, done that) we approximate it with this average stuff. The average + * moves slower and therefore the approximation is cheaper and more stable. + * + * So instead of the above, we substitute: + * + * grq->load.weight -> grq->avg.load_avg (2) + * + * which yields the following: + * + * tg->weight * grq->avg.load_avg + * ge->load.weight = ------------------------------ (3) + * tg->load_avg + * + * Where: tg->load_avg ~= \Sum grq->avg.load_avg + * + * That is shares_avg, and it is right (given the approximation (2)). + * + * The problem with it is that because the average is slow -- it was designed + * to be exactly that of course -- this leads to transients in boundary + * conditions. In specific, the case where the group was idle and we start the + * one task. It takes time for our CPU's grq->avg.load_avg to build up, + * yielding bad latency etc.. + * + * Now, in that special case (1) reduces to: + * + * tg->weight * grq->load.weight + * ge->load.weight = ----------------------------- = tg>weight (4) + * grp->load.weight + * + * That is, the sum collapses because all other CPUs are idle; the UP scenario. + * + * So what we do is modify our approximation (3) to approach (4) in the (near) + * UP case, like: + * + * ge->load.weight = + * + * tg->weight * grq->load.weight + * --------------------------------------------------- (5) + * tg->load_avg - grq->avg.load_avg + grq->load.weight + * + * + * And that is shares_weight and is icky. In the (near) UP case it approaches + * (4) while in the normal case it approaches (3). It consistently + * overestimates the ge->load.weight and therefore: + * + * \Sum ge->load.weight >= tg->weight + * + * hence icky! + */ static long calc_cfs_shares(struct cfs_rq *cfs_rq) { long tg_weight, tg_shares, load, shares;