Date: Fri, 29 Sep 2017 13:35:00 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: mingo@kernel.org, linux-kernel@vger.kernel.org, tj@kernel.org,
        josef@toxicpanda.com, torvalds@linux-foundation.org,
        vincent.guittot@linaro.org, efault@gmx.de, pjt@google.com, clm@fb.com,
        dietmar.eggemann@arm.com, bsegall@google.com, yuyang.du@intel.com
Subject: Re: [PATCH -v2 02/18] sched/fair: Add comment to calc_cfs_shares()
Message-ID: <20170929113500.kryvylelokyohssz@hirez.programming.kicks-ass.net>
References: <20170901132059.342024223@infradead.org>
 <20170901132748.083733695@infradead.org>
 <20170928100303.GA962@e105550-lin.cambridge.arm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170928100303.GA962@e105550-lin.cambridge.arm.com>
User-Agent: NeoMutt/20170609 (1.8.3)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3811
Lines: 100

On Thu, Sep 28, 2017 at 11:03:03AM +0100, Morten Rasmussen wrote:

> > +/*
> > + * All this does is approximate the hierarchical proportion which includes that
> > + * global sum we all love to hate.
> > + *
> > + * That is, the weight of a group entity, is the proportional share of the
> > + * group weight based on the group runqueue weights. That is:
> > + *
> > + *                     tg->weight * grq->load.weight
> > + *   ge->load.weight = -----------------------------               (1)
> > + *			  \Sum grq->load.weight
> > + *
> > + * Now, because computing that sum is prohibitively expensive to compute (been
> > + * there, done that) we approximate it with this average stuff. The average
> > + * moves slower and therefore the approximation is cheaper and more stable.
> > + *
> > + * So instead of the above, we substitute:
> > + *
> > + *   grq->load.weight -> grq->avg.load_avg                         (2)
> > + *
> > + * which yields the following:
> > + *
> > + *                     tg->weight * grq->avg.load_avg
> > + *   ge->load.weight = ------------------------------              (3)
> > + *				tg->load_avg
> > + *
> > + * Where: tg->load_avg ~= \Sum grq->avg.load_avg
> > + *
> > + * That is shares_avg, and it is right (given the approximation (2)).
> > + *
> > + * The problem with it is that because the average is slow -- it was designed
> > + * to be exactly that of course -- this leads to transients in boundary
> > + * conditions. In specific, the case where the group was idle and we start the
> > + * one task. It takes time for our CPU's grq->avg.load_avg to build up,
> > + * yielding bad latency etc..
> > + *
> > + * Now, in that special case (1) reduces to:
> > + *
> > + *                     tg->weight * grq->load.weight
> > + *   ge->load.weight = ----------------------------- = tg>weight   (4)
> > + *			    grp->load.weight
> 
> Should it be "grq->load.weight" in the denominator of (4)?
> And "tg->weight" at the end?

Yes, otherwise its all doesn't really make sense :-) Typing is hard it
seems.

> > + *
> > + * That is, the sum collapses because all other CPUs are idle; the UP scenario.
> 
> Shouldn't (3) collapse in the same way too in this special case?

That's more difficult to see and (1) is the canonical form.

> In theory it should reduce to:
> 
>                      tg->weight * grq->avg.load_avg
>    ge->load.weight = ------------------------------
> 			grq->avg.load_avg

Yes, agreed.

> But I can see many reasons why it won't happen in practice if things
> aren't perfectly up-to-date. If tg->load_avg and grq->avg.load_avg in
> (3) aren't in sync, or there are stale contributions to tg->load_avg
> from other cpus then (3) can return anything between 0 and tg->weight.

Just so.

> > + *
> > + * So what we do is modify our approximation (3) to approach (4) in the (near)
> > + * UP case, like:
> > + *
> > + *   ge->load.weight =
> > + *
> > + *              tg->weight * grq->load.weight
> > + *     ---------------------------------------------------         (5)
> > + *     tg->load_avg - grq->avg.load_avg + grq->load.weight
> > + *
> > + *
> > + * And that is shares_weight and is icky. In the (near) UP case it approaches
> > + * (4) while in the normal case it approaches (3). It consistently
> > + * overestimates the ge->load.weight and therefore:
> > + *
> > + *   \Sum ge->load.weight >= tg->weight
> > + *
> > + * hence icky!
> 
> IIUC, if grq->avg.load_avg > grq->load.weight, i.e. you have blocked
> tasks, you can end up with underestimating the ge->load.weight for some
> of the group entities lead to \Sum ge->load.weight < tg->weight.

Ah yes, you're right. However, if you look at the end of the series we
actually end up with using:

	max(grq->load.weight, grq->avg.load_avg)

Which I suppose makes it true again.