Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752067AbdI2LfZ (ORCPT ); Fri, 29 Sep 2017 07:35:25 -0400 Received: from merlin.infradead.org ([205.233.59.134]:41588 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751490AbdI2LfY (ORCPT ); Fri, 29 Sep 2017 07:35:24 -0400 Date: Fri, 29 Sep 2017 13:35:00 +0200 From: Peter Zijlstra To: Morten Rasmussen Cc: mingo@kernel.org, linux-kernel@vger.kernel.org, tj@kernel.org, josef@toxicpanda.com, torvalds@linux-foundation.org, vincent.guittot@linaro.org, efault@gmx.de, pjt@google.com, clm@fb.com, dietmar.eggemann@arm.com, bsegall@google.com, yuyang.du@intel.com Subject: Re: [PATCH -v2 02/18] sched/fair: Add comment to calc_cfs_shares() Message-ID: <20170929113500.kryvylelokyohssz@hirez.programming.kicks-ass.net> References: <20170901132059.342024223@infradead.org> <20170901132748.083733695@infradead.org> <20170928100303.GA962@e105550-lin.cambridge.arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170928100303.GA962@e105550-lin.cambridge.arm.com> User-Agent: NeoMutt/20170609 (1.8.3) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3811 Lines: 100 On Thu, Sep 28, 2017 at 11:03:03AM +0100, Morten Rasmussen wrote: > > +/* > > + * All this does is approximate the hierarchical proportion which includes that > > + * global sum we all love to hate. > > + * > > + * That is, the weight of a group entity, is the proportional share of the > > + * group weight based on the group runqueue weights. That is: > > + * > > + * tg->weight * grq->load.weight > > + * ge->load.weight = ----------------------------- (1) > > + * \Sum grq->load.weight > > + * > > + * Now, because computing that sum is prohibitively expensive to compute (been > > + * there, done that) we approximate it with this average stuff. The average > > + * moves slower and therefore the approximation is cheaper and more stable. > > + * > > + * So instead of the above, we substitute: > > + * > > + * grq->load.weight -> grq->avg.load_avg (2) > > + * > > + * which yields the following: > > + * > > + * tg->weight * grq->avg.load_avg > > + * ge->load.weight = ------------------------------ (3) > > + * tg->load_avg > > + * > > + * Where: tg->load_avg ~= \Sum grq->avg.load_avg > > + * > > + * That is shares_avg, and it is right (given the approximation (2)). > > + * > > + * The problem with it is that because the average is slow -- it was designed > > + * to be exactly that of course -- this leads to transients in boundary > > + * conditions. In specific, the case where the group was idle and we start the > > + * one task. It takes time for our CPU's grq->avg.load_avg to build up, > > + * yielding bad latency etc.. > > + * > > + * Now, in that special case (1) reduces to: > > + * > > + * tg->weight * grq->load.weight > > + * ge->load.weight = ----------------------------- = tg>weight (4) > > + * grp->load.weight > > Should it be "grq->load.weight" in the denominator of (4)? > And "tg->weight" at the end? Yes, otherwise its all doesn't really make sense :-) Typing is hard it seems. > > + * > > + * That is, the sum collapses because all other CPUs are idle; the UP scenario. > > Shouldn't (3) collapse in the same way too in this special case? That's more difficult to see and (1) is the canonical form. > In theory it should reduce to: > > tg->weight * grq->avg.load_avg > ge->load.weight = ------------------------------ > grq->avg.load_avg Yes, agreed. > But I can see many reasons why it won't happen in practice if things > aren't perfectly up-to-date. If tg->load_avg and grq->avg.load_avg in > (3) aren't in sync, or there are stale contributions to tg->load_avg > from other cpus then (3) can return anything between 0 and tg->weight. Just so. > > + * > > + * So what we do is modify our approximation (3) to approach (4) in the (near) > > + * UP case, like: > > + * > > + * ge->load.weight = > > + * > > + * tg->weight * grq->load.weight > > + * --------------------------------------------------- (5) > > + * tg->load_avg - grq->avg.load_avg + grq->load.weight > > + * > > + * > > + * And that is shares_weight and is icky. In the (near) UP case it approaches > > + * (4) while in the normal case it approaches (3). It consistently > > + * overestimates the ge->load.weight and therefore: > > + * > > + * \Sum ge->load.weight >= tg->weight > > + * > > + * hence icky! > > IIUC, if grq->avg.load_avg > grq->load.weight, i.e. you have blocked > tasks, you can end up with underestimating the ge->load.weight for some > of the group entities lead to \Sum ge->load.weight < tg->weight. Ah yes, you're right. However, if you look at the end of the series we actually end up with using: max(grq->load.weight, grq->avg.load_avg) Which I suppose makes it true again.