From: Josef Bacik <josef@toxicpanda.com>
To: mingo@redhat.com, peterz@infradead.org, linux-kernel@vger.kernel.org,
        umgwanakikbuti@gmail.com, tj@kernel.org, kernel-team@fb.com
Cc: Josef Bacik <jbacik@fb.com>
Subject: [PATCH 6/7] sched/fair: rework effective_load
Date: Fri, 14 Jul 2017 13:21:03 +0000
Message-Id: <1500038464-8742-7-git-send-email-josef@toxicpanda.com>
In-Reply-To: <1500038464-8742-1-git-send-email-josef@toxicpanda.com>
References: <1500038464-8742-1-git-send-email-josef@toxicpanda.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 8891
Lines: 240

From: Josef Bacik <jbacik@fb.com>

effective_load()'s job is to calculate the delta in load from the root
task_group's perspective by adding a task into the heirarchy.  Before it did
this by adding the load average to load average of the cfs_rq, calculating the
new share amount for the cfs_rq, and then subtracting the old se->load.avg from
this value.  This should theoretically be correct, but in testing I discovered
we were getting negative deltas constantly when we were supposed to be adding
load.

The reason for this is the se->avg.load_avg is based on a historical weight,
which is what we want, except we're using current values to calculate our new
weight.  So tg->load_avg could have gone down because of load removed on other
CPU's since we calculated our weight for our cfs_rq, which means our newly
calculated weight could end up being lower than our historic weight.
Propagating this up the whole hierarchy means we'd see a lot of drift.

To solve this we need to re-calculate our historic weight using with the current
tg->load_avg.  This solves the negative deltas when we should be getting
positive deltas.

The other aspect of the problem is that we were using the load average instead
of the weight.  This isn't strictly accurate from what happens normally.
Normally we add weight to our cfs_rq->load.weight, and then the load_avg
reflects this as things run.  Using load_avg for everything will mean that CPU's
with interactive tasks will be biased against CPU's with non-interactive tasks,
essentially resulting in the effective_load() delta being smaller than it should
be, and thus bias us towards waking with affinity.  Instead use weights
everywhere for a more consistent behavior.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 kernel/sched/fair.c | 144 +++++++++++++++++++++-------------------------------
 1 file changed, 57 insertions(+), 87 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4e4fc5d..034d5df 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5484,110 +5484,82 @@ static unsigned long cpu_avg_load_per_task(int cpu)
  * on this @cpu and results in a total addition (subtraction) of @wg to the
  * total group weight.
  *
- * Given a runqueue weight distribution (rw_i) we can compute a shares
- * distribution (s_i) using:
- *
- *   s_i = rw_i / \Sum rw_j						(1)
- *
- * Suppose we have 4 CPUs and our @tg is a direct child of the root group and
- * has 7 equal weight tasks, distributed as below (rw_i), with the resulting
- * shares distribution (s_i):
- *
- *   rw_i = {   2,   4,   1,   0 }
- *   s_i  = { 2/7, 4/7, 1/7,   0 }
- *
- * As per wake_affine() we're interested in the load of two CPUs (the CPU the
- * task used to run on and the CPU the waker is running on), we need to
- * compute the effect of waking a task on either CPU and, in case of a sync
- * wakeup, compute the effect of the current task going to sleep.
- *
- * So for a change of @wl to the local @cpu with an overall group weight change
- * of @wl we can compute the new shares distribution (s'_i) using:
- *
- *   s'_i = (rw_i + @wl) / (@wg + \Sum rw_j)				(2)
- *
- * Suppose we're interested in CPUs 0 and 1, and want to compute the load
- * differences in waking a task to CPU 0. The additional task changes the
- * weight and shares distributions like:
- *
- *   rw'_i = {   3,   4,   1,   0 }
- *   s'_i  = { 3/8, 4/8, 1/8,   0 }
- *
- * We can then compute the difference in effective weight by using:
- *
- *   dw_i = S * (s'_i - s_i)						(3)
- *
- * Where 'S' is the group weight as seen by its parent.
- *
- * Therefore the effective change in loads on CPU 0 would be 5/56 (3/8 - 2/7)
- * times the weight of the group. The effect on CPU 1 would be -4/56 (4/8 -
- * 4/7) times the weight of the group.
- */
-static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
-{
+ * This mirrors essentially what calc_cfs_shares does, but only with the
+ * weights.  Since we're computing theoretical change to the root_task_group we
+ * need to emulate what happens when a new task is scheduled, that is it's
+ * weight is added to the cfs_rq->load.weight.  This has no immediate effect in
+ * the normal case, it only adjusts the load_avg as things run in the hierarchy.
+ * So in order to get a better view of the change from the root task we use only
+ * weights.
+ *
+ * We also cannot rely on previously calculated weights for the groups, as those
+ * were generated in the past with a different tg->load_avg.  Load could have
+ * been removed from other CPU's, which would cause this to return a negative
+ * delta despite "adding" load to the heirarchy.  So we have to re-compute our
+ * current weight as well as the theoretical weight if we were to add this task
+ * in order to get an accurate delta.
+ *
+ * Which brings me to how this works.  In normal operations we add
+ * task->se.load.weight to cfs_rq->load.weight.  Then we calc calc_cfs_shares()
+ * on the cfs_rq, which generates the new weight for its se on its parent
+ * cfs_rq.  We then have to add that delta to the parent cfs_rq->load.weight,
+ * and regenerate its new weight for its se for its parent cfs_rq, and so on and
+ * so forth up the heirarchy.  Once we get to the top our load_delta will be the
+ * delta as seen by the root task group.
+ */
+static long effective_load(struct task_group *tg, int cpu, long wg)
+{
+	long load_delta = 0;
 	struct sched_entity *se = tg->se[cpu];
 
 	if (!tg->parent)	/* the trivial, non-cgroup case */
-		return wl;
+		return wg;
 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = se->my_q;
-		long W, w = cfs_rq_load_avg(cfs_rq);
+		long rq_weight = scale_load_down(cfs_rq->load.weight);
+		long tg_weight, tg_new_weight;
+		long new_weight = tg_weight + wg;
+		long old_weight;
+		long tg_shares;
 
 		tg = cfs_rq->tg;
+		tg_shares = READ_ONCE(tg->shares);
 
-		/*
-		 * W = @wg + \Sum rw_j
-		 */
-		W = wg + atomic_long_read(&tg->load_avg);
-
-		/* Ensure \Sum rw_j >= rw_i */
-		W -= cfs_rq->tg_load_avg_contrib;
-		W += w;
+		tg_weight = atomic_long_read(&tg->load_avg);
+		tg_weight -= cfs_rq->tg_load_avg_contrib;
 
 		/*
-		 * w = rw_i + @wl
+		 * We add in our 'load' to the denominator, this mirrors that
+		 * addition in calc_cfs_shares, but with a different load for
+		 * either of the scenarios.
 		 */
-		w += wl;
+		tg_new_weight = tg_weight + new_weight;
+		tg_weight += rq_weight;
 
-		/*
-		 * wl = S * s'_i; see (2)
-		 */
-		if (W > 0 && w < W)
-			wl = (w * (long)scale_load_down(tg->shares)) / W;
-		else
-			wl = scale_load_down(tg->shares);
+		wg = tg_shares * new_weight;
+		old_weight = tg_shares * rq_weight;
+		if (tg_weight)
+			old_weight /= tg_weight;
+		if (tg_new_weight)
+			wg /= tg_new_weight;
 
-		/*
-		 * Per the above, wl is the new se->load.weight value; since
-		 * those are clipped to [MIN_SHARES, ...) do so now. See
-		 * calc_cfs_shares().
-		 */
-		if (wl < MIN_SHARES)
-			wl = MIN_SHARES;
 
-		/*
-		 * wl = dw_i = S * (s'_i - s_i); see (3)
-		 */
-		wl -= se->avg.load_avg;
+		wg = clamp_t(long, wg, MIN_SHARES, tg_shares);
+		old_weight = clamp_t(long, old_weight, MIN_SHARES, tg_shares);
 
-		/*
-		 * Recursively apply this logic to all parent groups to compute
-		 * the final effective load change on the root group. Since
-		 * only the @tg group gets extra weight, all parent groups can
-		 * only redistribute existing shares. @wl is the shift in shares
-		 * resulting from this level per the above.
-		 */
-		wg = 0;
+		wg = scale_load_down(wg);
+		old_weight = scale_load_down(old_weight);
+		load_delta = wg - old_weight;
 	}
 
-	return wl;
+	return load_delta;
 }
 #else
 
-static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
+static long effective_load(struct task_group *tg, int cpu, long wg)
 {
-	return wl;
+	return wg;
 }
 
 #endif
@@ -5646,7 +5618,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	s64 this_eff_load, prev_eff_load;
 	int idx, this_cpu;
 	struct task_group *tg;
-	unsigned long weight, avg;
+	unsigned long weight;
 	int balanced;
 
 	idx	  = sd->wake_idx;
@@ -5662,14 +5634,12 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	if (sync) {
 		tg = task_group(current);
 		weight = se_weight(&current->se);
-		avg = current->se.avg.load_avg;
 
-		this_load += effective_load(tg, this_cpu, -avg, -weight);
+		this_load += effective_load(tg, this_cpu, -weight);
 	}
 
 	tg = task_group(p);
 	weight = se_weight(&p->se);
-	avg = p->se.avg.load_avg;
 
 	/*
 	 * In low-load situations, where prev_cpu is idle and this_cpu is idle
@@ -5688,7 +5658,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 
 	if (this_load > 0) {
 		this_eff_load *= this_load +
-			effective_load(tg, this_cpu, avg, weight);
+			effective_load(tg, this_cpu, weight);
 
 		prev_eff_load *= load;
 	}
-- 
2.9.3