Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752805AbdI1KDQ (ORCPT ); Thu, 28 Sep 2017 06:03:16 -0400 Received: from foss.arm.com ([217.140.101.70]:54526 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752656AbdI1KDM (ORCPT ); Thu, 28 Sep 2017 06:03:12 -0400 Date: Thu, 28 Sep 2017 11:03:03 +0100 From: Morten Rasmussen To: Peter Zijlstra Cc: mingo@kernel.org, linux-kernel@vger.kernel.org, tj@kernel.org, josef@toxicpanda.com, torvalds@linux-foundation.org, vincent.guittot@linaro.org, efault@gmx.de, pjt@google.com, clm@fb.com, dietmar.eggemann@arm.com, bsegall@google.com, yuyang.du@intel.com Subject: Re: [PATCH -v2 02/18] sched/fair: Add comment to calc_cfs_shares() Message-ID: <20170928100303.GA962@e105550-lin.cambridge.arm.com> References: <20170901132059.342024223@infradead.org> <20170901132748.083733695@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170901132748.083733695@infradead.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4869 Lines: 131 On Fri, Sep 01, 2017 at 03:21:01PM +0200, Peter Zijlstra wrote: > Explain the magic equation in calc_cfs_shares() a bit better. > > Signed-off-by: Peter Zijlstra (Intel) > --- > kernel/sched/fair.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 61 insertions(+) > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2707,6 +2707,67 @@ account_entity_dequeue(struct cfs_rq *cf > > #ifdef CONFIG_FAIR_GROUP_SCHED > # ifdef CONFIG_SMP > +/* > + * All this does is approximate the hierarchical proportion which includes that > + * global sum we all love to hate. > + * > + * That is, the weight of a group entity, is the proportional share of the > + * group weight based on the group runqueue weights. That is: > + * > + * tg->weight * grq->load.weight > + * ge->load.weight = ----------------------------- (1) > + * \Sum grq->load.weight > + * > + * Now, because computing that sum is prohibitively expensive to compute (been > + * there, done that) we approximate it with this average stuff. The average > + * moves slower and therefore the approximation is cheaper and more stable. > + * > + * So instead of the above, we substitute: > + * > + * grq->load.weight -> grq->avg.load_avg (2) > + * > + * which yields the following: > + * > + * tg->weight * grq->avg.load_avg > + * ge->load.weight = ------------------------------ (3) > + * tg->load_avg > + * > + * Where: tg->load_avg ~= \Sum grq->avg.load_avg > + * > + * That is shares_avg, and it is right (given the approximation (2)). > + * > + * The problem with it is that because the average is slow -- it was designed > + * to be exactly that of course -- this leads to transients in boundary > + * conditions. In specific, the case where the group was idle and we start the > + * one task. It takes time for our CPU's grq->avg.load_avg to build up, > + * yielding bad latency etc.. > + * > + * Now, in that special case (1) reduces to: > + * > + * tg->weight * grq->load.weight > + * ge->load.weight = ----------------------------- = tg>weight (4) > + * grp->load.weight Should it be "grq->load.weight" in the denominator of (4)? And "tg->weight" at the end? > + * > + * That is, the sum collapses because all other CPUs are idle; the UP scenario. Shouldn't (3) collapse in the same way too in this special case? In theory it should reduce to: tg->weight * grq->avg.load_avg ge->load.weight = ------------------------------ grq->avg.load_avg But I can see many reasons why it won't happen in practice if things aren't perfectly up-to-date. If tg->load_avg and grq->avg.load_avg in (3) aren't in sync, or there are stale contributions to tg->load_avg from other cpus then (3) can return anything between 0 and tg->weight. > + * > + * So what we do is modify our approximation (3) to approach (4) in the (near) > + * UP case, like: > + * > + * ge->load.weight = > + * > + * tg->weight * grq->load.weight > + * --------------------------------------------------- (5) > + * tg->load_avg - grq->avg.load_avg + grq->load.weight > + * > + * > + * And that is shares_weight and is icky. In the (near) UP case it approaches > + * (4) while in the normal case it approaches (3). It consistently > + * overestimates the ge->load.weight and therefore: > + * > + * \Sum ge->load.weight >= tg->weight > + * > + * hence icky! IIUC, if grq->avg.load_avg > grq->load.weight, i.e. you have blocked tasks, you can end up with underestimating the ge->load.weight for some of the group entities lead to \Sum ge->load.weight < tg->weight. Let's take a simple example: Two cpus, one task group with three tasks in it: An always-running task on both cpus, and an additional periodic task currently blocked on cpu 0 (contributing 512 to grq->avg.load_avg on cpu 0). tg->weight = 1024 tg->load_avg = 2560 \Sum grq->load.weight = 2048 cpu 0 1 \Sum grq->avg.load_avg 1536 1024 grq->load.weight 1024 1024 ge->load_weight (1) 512 512 1024 >= tg->weight ge->load_weight (3) 614 410 1024 >= tg->weight ge->load_weight (5) 512 410 922 < tg->weight So with (5) we are missing 102 worth of ge->load.weight. If (1), the instantaneous ge->load.weight, is what we want, then ge->load.weight of cpu 1 is underestimated, if (3), shares_avg, is the goal, then ge->load.weight of cpu 0 is underestimated. The "missing" ge->load.weight can get much larger if the blocked task had higher priority. Another thing is that we are loosing a bit of the nice stability that (3) provides if you have periodic tasks. I'm not sure if we can do better than (5), I'm just trying to understand how the approximation will behave and make sure we understand the implications. Morten