Explain the magic equation in calc_cfs_shares() a bit better.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/fair.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 61 insertions(+)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2707,6 +2707,67 @@ account_entity_dequeue(struct cfs_rq *cf
#ifdef CONFIG_FAIR_GROUP_SCHED
# ifdef CONFIG_SMP
+/*
+ * All this does is approximate the hierarchical proportion which includes that
+ * global sum we all love to hate.
+ *
+ * That is, the weight of a group entity, is the proportional share of the
+ * group weight based on the group runqueue weights. That is:
+ *
+ * tg->weight * grq->load.weight
+ * ge->load.weight = ----------------------------- (1)
+ * \Sum grq->load.weight
+ *
+ * Now, because computing that sum is prohibitively expensive to compute (been
+ * there, done that) we approximate it with this average stuff. The average
+ * moves slower and therefore the approximation is cheaper and more stable.
+ *
+ * So instead of the above, we substitute:
+ *
+ * grq->load.weight -> grq->avg.load_avg (2)
+ *
+ * which yields the following:
+ *
+ * tg->weight * grq->avg.load_avg
+ * ge->load.weight = ------------------------------ (3)
+ * tg->load_avg
+ *
+ * Where: tg->load_avg ~= \Sum grq->avg.load_avg
+ *
+ * That is shares_avg, and it is right (given the approximation (2)).
+ *
+ * The problem with it is that because the average is slow -- it was designed
+ * to be exactly that of course -- this leads to transients in boundary
+ * conditions. In specific, the case where the group was idle and we start the
+ * one task. It takes time for our CPU's grq->avg.load_avg to build up,
+ * yielding bad latency etc..
+ *
+ * Now, in that special case (1) reduces to:
+ *
+ * tg->weight * grq->load.weight
+ * ge->load.weight = ----------------------------- = tg>weight (4)
+ * grp->load.weight
+ *
+ * That is, the sum collapses because all other CPUs are idle; the UP scenario.
+ *
+ * So what we do is modify our approximation (3) to approach (4) in the (near)
+ * UP case, like:
+ *
+ * ge->load.weight =
+ *
+ * tg->weight * grq->load.weight
+ * --------------------------------------------------- (5)
+ * tg->load_avg - grq->avg.load_avg + grq->load.weight
+ *
+ *
+ * And that is shares_weight and is icky. In the (near) UP case it approaches
+ * (4) while in the normal case it approaches (3). It consistently
+ * overestimates the ge->load.weight and therefore:
+ *
+ * \Sum ge->load.weight >= tg->weight
+ *
+ * hence icky!
+ */
static long calc_cfs_shares(struct cfs_rq *cfs_rq)
{
long tg_weight, tg_shares, load, shares;
On Fri, Sep 01, 2017 at 03:21:01PM +0200, Peter Zijlstra wrote:
> Explain the magic equation in calc_cfs_shares() a bit better.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> kernel/sched/fair.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 61 insertions(+)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2707,6 +2707,67 @@ account_entity_dequeue(struct cfs_rq *cf
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> # ifdef CONFIG_SMP
> +/*
> + * All this does is approximate the hierarchical proportion which includes that
> + * global sum we all love to hate.
> + *
> + * That is, the weight of a group entity, is the proportional share of the
> + * group weight based on the group runqueue weights. That is:
> + *
> + * tg->weight * grq->load.weight
> + * ge->load.weight = ----------------------------- (1)
> + * \Sum grq->load.weight
> + *
> + * Now, because computing that sum is prohibitively expensive to compute (been
> + * there, done that) we approximate it with this average stuff. The average
> + * moves slower and therefore the approximation is cheaper and more stable.
> + *
> + * So instead of the above, we substitute:
> + *
> + * grq->load.weight -> grq->avg.load_avg (2)
> + *
> + * which yields the following:
> + *
> + * tg->weight * grq->avg.load_avg
> + * ge->load.weight = ------------------------------ (3)
> + * tg->load_avg
> + *
> + * Where: tg->load_avg ~= \Sum grq->avg.load_avg
> + *
> + * That is shares_avg, and it is right (given the approximation (2)).
> + *
> + * The problem with it is that because the average is slow -- it was designed
> + * to be exactly that of course -- this leads to transients in boundary
> + * conditions. In specific, the case where the group was idle and we start the
> + * one task. It takes time for our CPU's grq->avg.load_avg to build up,
> + * yielding bad latency etc..
> + *
> + * Now, in that special case (1) reduces to:
> + *
> + * tg->weight * grq->load.weight
> + * ge->load.weight = ----------------------------- = tg>weight (4)
> + * grp->load.weight
Should it be "grq->load.weight" in the denominator of (4)?
And "tg->weight" at the end?
> + *
> + * That is, the sum collapses because all other CPUs are idle; the UP scenario.
Shouldn't (3) collapse in the same way too in this special case? In
theory it should reduce to:
tg->weight * grq->avg.load_avg
ge->load.weight = ------------------------------
grq->avg.load_avg
But I can see many reasons why it won't happen in practice if things
aren't perfectly up-to-date. If tg->load_avg and grq->avg.load_avg in
(3) aren't in sync, or there are stale contributions to tg->load_avg
from other cpus then (3) can return anything between 0 and tg->weight.
> + *
> + * So what we do is modify our approximation (3) to approach (4) in the (near)
> + * UP case, like:
> + *
> + * ge->load.weight =
> + *
> + * tg->weight * grq->load.weight
> + * --------------------------------------------------- (5)
> + * tg->load_avg - grq->avg.load_avg + grq->load.weight
> + *
> + *
> + * And that is shares_weight and is icky. In the (near) UP case it approaches
> + * (4) while in the normal case it approaches (3). It consistently
> + * overestimates the ge->load.weight and therefore:
> + *
> + * \Sum ge->load.weight >= tg->weight
> + *
> + * hence icky!
IIUC, if grq->avg.load_avg > grq->load.weight, i.e. you have blocked
tasks, you can end up with underestimating the ge->load.weight for some
of the group entities lead to \Sum ge->load.weight < tg->weight.
Let's take a simple example:
Two cpus, one task group with three tasks in it: An always-running task
on both cpus, and an additional periodic task currently blocked on cpu 0
(contributing 512 to grq->avg.load_avg on cpu 0).
tg->weight = 1024
tg->load_avg = 2560
\Sum grq->load.weight = 2048
cpu 0 1 \Sum
grq->avg.load_avg 1536 1024
grq->load.weight 1024 1024
ge->load_weight (1) 512 512 1024 >= tg->weight
ge->load_weight (3) 614 410 1024 >= tg->weight
ge->load_weight (5) 512 410 922 < tg->weight
So with (5) we are missing 102 worth of ge->load.weight.
If (1), the instantaneous ge->load.weight, is what we want, then
ge->load.weight of cpu 1 is underestimated, if (3), shares_avg, is the
goal, then ge->load.weight of cpu 0 is underestimated.
The "missing" ge->load.weight can get much larger if the blocked task
had higher priority.
Another thing is that we are loosing a bit of the nice stability that
(3) provides if you have periodic tasks.
I'm not sure if we can do better than (5), I'm just trying to understand
how the approximation will behave and make sure we understand the
implications.
Morten
On Thu, Sep 28, 2017 at 11:03:03AM +0100, Morten Rasmussen wrote:
> > +/*
> > + * All this does is approximate the hierarchical proportion which includes that
> > + * global sum we all love to hate.
> > + *
> > + * That is, the weight of a group entity, is the proportional share of the
> > + * group weight based on the group runqueue weights. That is:
> > + *
> > + * tg->weight * grq->load.weight
> > + * ge->load.weight = ----------------------------- (1)
> > + * \Sum grq->load.weight
> > + *
> > + * Now, because computing that sum is prohibitively expensive to compute (been
> > + * there, done that) we approximate it with this average stuff. The average
> > + * moves slower and therefore the approximation is cheaper and more stable.
> > + *
> > + * So instead of the above, we substitute:
> > + *
> > + * grq->load.weight -> grq->avg.load_avg (2)
> > + *
> > + * which yields the following:
> > + *
> > + * tg->weight * grq->avg.load_avg
> > + * ge->load.weight = ------------------------------ (3)
> > + * tg->load_avg
> > + *
> > + * Where: tg->load_avg ~= \Sum grq->avg.load_avg
> > + *
> > + * That is shares_avg, and it is right (given the approximation (2)).
> > + *
> > + * The problem with it is that because the average is slow -- it was designed
> > + * to be exactly that of course -- this leads to transients in boundary
> > + * conditions. In specific, the case where the group was idle and we start the
> > + * one task. It takes time for our CPU's grq->avg.load_avg to build up,
> > + * yielding bad latency etc..
> > + *
> > + * Now, in that special case (1) reduces to:
> > + *
> > + * tg->weight * grq->load.weight
> > + * ge->load.weight = ----------------------------- = tg>weight (4)
> > + * grp->load.weight
>
> Should it be "grq->load.weight" in the denominator of (4)?
> And "tg->weight" at the end?
Yes, otherwise its all doesn't really make sense :-) Typing is hard it
seems.
> > + *
> > + * That is, the sum collapses because all other CPUs are idle; the UP scenario.
>
> Shouldn't (3) collapse in the same way too in this special case?
That's more difficult to see and (1) is the canonical form.
> In theory it should reduce to:
>
> tg->weight * grq->avg.load_avg
> ge->load.weight = ------------------------------
> grq->avg.load_avg
Yes, agreed.
> But I can see many reasons why it won't happen in practice if things
> aren't perfectly up-to-date. If tg->load_avg and grq->avg.load_avg in
> (3) aren't in sync, or there are stale contributions to tg->load_avg
> from other cpus then (3) can return anything between 0 and tg->weight.
Just so.
> > + *
> > + * So what we do is modify our approximation (3) to approach (4) in the (near)
> > + * UP case, like:
> > + *
> > + * ge->load.weight =
> > + *
> > + * tg->weight * grq->load.weight
> > + * --------------------------------------------------- (5)
> > + * tg->load_avg - grq->avg.load_avg + grq->load.weight
> > + *
> > + *
> > + * And that is shares_weight and is icky. In the (near) UP case it approaches
> > + * (4) while in the normal case it approaches (3). It consistently
> > + * overestimates the ge->load.weight and therefore:
> > + *
> > + * \Sum ge->load.weight >= tg->weight
> > + *
> > + * hence icky!
>
> IIUC, if grq->avg.load_avg > grq->load.weight, i.e. you have blocked
> tasks, you can end up with underestimating the ge->load.weight for some
> of the group entities lead to \Sum ge->load.weight < tg->weight.
Ah yes, you're right. However, if you look at the end of the series we
actually end up with using:
max(grq->load.weight, grq->avg.load_avg)
Which I suppose makes it true again.
On Fri, Sep 29, 2017 at 01:35:00PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 28, 2017 at 11:03:03AM +0100, Morten Rasmussen wrote:
> > IIUC, if grq->avg.load_avg > grq->load.weight, i.e. you have blocked
> > tasks, you can end up with underestimating the ge->load.weight for some
> > of the group entities lead to \Sum ge->load.weight < tg->weight.
>
> Ah yes, you're right. However, if you look at the end of the series we
> actually end up with using:
>
> max(grq->load.weight, grq->avg.load_avg)
>
> Which I suppose makes it true again.
Yes, with the next patch in the series, underestimation is no longer
possible.