2017-07-14 13:21:13

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 0/7][RESEND] Fix cpu imbalance with equal weighted groups

(sorry to anybody who got this already, fb email is acting weird and ate the
linux-kernel submission so I have no idea who got this and who didn't.)

Hello,

In testing stacked services we noticed that if you started a normal CPU heavy
application in one cgroup, and a cpu stress test in another cgroup of equal
weight, the cpu stress group would get significantly more cpu time, usually
around 50% more.

Peter fixed some load propagation issues for Tejun a few months ago, and they
fixed the latency issues that Tejun was seeing, however they regressed this
imbalance problem more, so the cpu stress group was now getting more like 70%
more CPU time.

The following patches are to fix the regression introduced by Peter's patches
and then to fix the imbalance itself. Part of the imbalance fix is from Peter's
propagation patches, we just needed the runnable weight to be calculated
differently to fix the regression.

Essentially what happens is the "stress" group has tasks that never leave the
CPU, so the load average and runnable load average skews towards their
load.weight. However the interactive tasks obviously go on and off the CPU,
resulting in a lower load average. With Peter's changes to use the runnable
load average more this exacerbated the problem.

To solve this problem I've done a few things. First we use the max of the
weight or average for our cfs_rq weight calculations. This allows tasks that
have a lower load average but a higher weight to have an appropriate effect on
the cfs_rq when enqueue'ing.

The second part of the fix is to fix how we decide to do wake affinity. If I
simply disabled wake affinity and had the other patches the imbalance
disappeared as well. Fixing the wake affinity involves a few things.

First we need to change effective_load() to re-calculate the historic weight in
addition to the new weight with the new process. This is because simply using
our old weight/load_avg would be inaccurate if the load_avg for the task_group
had changed at all since we calculated our load. In practice this meant that
effective_load would often (almost always for my testcase) return a negative
delta for adding the process to the given CPU. This meant we always did wake
affine, even though the load on the current CPU was too high.

Those patches get us 95% there, the final patch is probably the more
controversial one, but brings us to complete balance between the two groups.
One thing that was observed was we would wake affine, and then promptly load
balance things off of the CPU that we woke to. You'd see tasks bounce around
CPU's constantly. So to avoid this thrashing record the last time we were load
balanced, and wait HZ duration before allowing a affinity wake up to occur.
This reduced the thrashing quite a bit, and brought our CPU usage to equality.

I have a stripped down reproducer here

https://github.com/josefbacik/debug-scripts/tree/master/unbalanced-reproducer

unbalanced.sh uses the cgroup2 interface which requires Tejun's cgroup2 cpu
controller patch, and unbalanced-v1.sh uses the old cgroupsv1 interface, and
assumes you have cpuacct,cpu mounted at /sys/fs/cgroup/cpuacct. You also need
rt-app installed.

Thanks,

Josef


2017-07-14 13:21:18

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 4/7] sched/fair: don't include effective load of process in the old cpu load

From: Josef Bacik <[email protected]>

We have no idea how long ago the process went to sleep. It could have just gone
to sleep, in which case we would essentially be counting the load of the process
twice in the previous effective load. Since the history presumably already
exists in the previous cpu load don't bother adding it's effective load again.

Signed-off-by: Josef Bacik <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d958634..ee8dced 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5689,7 +5689,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
this_eff_load *= this_load +
effective_load(tg, this_cpu, weight, weight);

- prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight);
+ prev_eff_load *= load;
}

balanced = this_eff_load <= prev_eff_load;
--
2.9.3

2017-07-14 13:21:20

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 7/7] sched/fair: don't wake affine recently load balanced tasks

From: Josef Bacik <[email protected]>

The wake affinity logic will move tasks between two cpu's that appear to be
loaded equally at the current time, with a slight bias towards cache locality.
However on a heavily loaded system the load balancer has a better insight into
what needs to be moved around, so instead keep track of the last time a task was
migrated by the load balancer. If it was recent, opt to let the process stay on
it's current CPU (or an idle sibling).

Signed-off-by: Josef Bacik <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 11 +++++++++++
2 files changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1a0eadd..d872780 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -528,6 +528,7 @@ struct task_struct {
unsigned long wakee_flip_decay_ts;
struct task_struct *last_wakee;

+ unsigned long last_balance_ts;
int wake_cpu;
#endif
int on_rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 034d5df..6a98a38 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5604,6 +5604,16 @@ static int wake_wide(struct task_struct *p)
unsigned int slave = p->wakee_flips;
int factor = this_cpu_read(sd_llc_size);

+ /*
+ * If we've balanced this task recently we don't want to undo all of
+ * that hard work by the load balancer and move it to the current cpu.
+ * Constantly overriding the load balancers decisions is going to make
+ * it question its purpose in life and give it anxiety and self worth
+ * issues, and nobody wants that.
+ */
+ if (time_before(jiffies, p->last_balance_ts + HZ))
+ return 1;
+
if (master < slave)
swap(master, slave);
if (slave < factor || master < slave * factor)
@@ -7097,6 +7107,7 @@ static int detach_tasks(struct lb_env *env)
goto next;

detach_task(p, env);
+ p->last_balance_ts = jiffies;
list_add(&p->se.group_node, &env->tasks);

detached++;
--
2.9.3

2017-07-14 13:21:38

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 6/7] sched/fair: rework effective_load

From: Josef Bacik <[email protected]>

effective_load()'s job is to calculate the delta in load from the root
task_group's perspective by adding a task into the heirarchy. Before it did
this by adding the load average to load average of the cfs_rq, calculating the
new share amount for the cfs_rq, and then subtracting the old se->load.avg from
this value. This should theoretically be correct, but in testing I discovered
we were getting negative deltas constantly when we were supposed to be adding
load.

The reason for this is the se->avg.load_avg is based on a historical weight,
which is what we want, except we're using current values to calculate our new
weight. So tg->load_avg could have gone down because of load removed on other
CPU's since we calculated our weight for our cfs_rq, which means our newly
calculated weight could end up being lower than our historic weight.
Propagating this up the whole hierarchy means we'd see a lot of drift.

To solve this we need to re-calculate our historic weight using with the current
tg->load_avg. This solves the negative deltas when we should be getting
positive deltas.

The other aspect of the problem is that we were using the load average instead
of the weight. This isn't strictly accurate from what happens normally.
Normally we add weight to our cfs_rq->load.weight, and then the load_avg
reflects this as things run. Using load_avg for everything will mean that CPU's
with interactive tasks will be biased against CPU's with non-interactive tasks,
essentially resulting in the effective_load() delta being smaller than it should
be, and thus bias us towards waking with affinity. Instead use weights
everywhere for a more consistent behavior.

Signed-off-by: Josef Bacik <[email protected]>
---
kernel/sched/fair.c | 144 +++++++++++++++++++++-------------------------------
1 file changed, 57 insertions(+), 87 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4e4fc5d..034d5df 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5484,110 +5484,82 @@ static unsigned long cpu_avg_load_per_task(int cpu)
* on this @cpu and results in a total addition (subtraction) of @wg to the
* total group weight.
*
- * Given a runqueue weight distribution (rw_i) we can compute a shares
- * distribution (s_i) using:
- *
- * s_i = rw_i / \Sum rw_j (1)
- *
- * Suppose we have 4 CPUs and our @tg is a direct child of the root group and
- * has 7 equal weight tasks, distributed as below (rw_i), with the resulting
- * shares distribution (s_i):
- *
- * rw_i = { 2, 4, 1, 0 }
- * s_i = { 2/7, 4/7, 1/7, 0 }
- *
- * As per wake_affine() we're interested in the load of two CPUs (the CPU the
- * task used to run on and the CPU the waker is running on), we need to
- * compute the effect of waking a task on either CPU and, in case of a sync
- * wakeup, compute the effect of the current task going to sleep.
- *
- * So for a change of @wl to the local @cpu with an overall group weight change
- * of @wl we can compute the new shares distribution (s'_i) using:
- *
- * s'_i = (rw_i + @wl) / (@wg + \Sum rw_j) (2)
- *
- * Suppose we're interested in CPUs 0 and 1, and want to compute the load
- * differences in waking a task to CPU 0. The additional task changes the
- * weight and shares distributions like:
- *
- * rw'_i = { 3, 4, 1, 0 }
- * s'_i = { 3/8, 4/8, 1/8, 0 }
- *
- * We can then compute the difference in effective weight by using:
- *
- * dw_i = S * (s'_i - s_i) (3)
- *
- * Where 'S' is the group weight as seen by its parent.
- *
- * Therefore the effective change in loads on CPU 0 would be 5/56 (3/8 - 2/7)
- * times the weight of the group. The effect on CPU 1 would be -4/56 (4/8 -
- * 4/7) times the weight of the group.
- */
-static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
-{
+ * This mirrors essentially what calc_cfs_shares does, but only with the
+ * weights. Since we're computing theoretical change to the root_task_group we
+ * need to emulate what happens when a new task is scheduled, that is it's
+ * weight is added to the cfs_rq->load.weight. This has no immediate effect in
+ * the normal case, it only adjusts the load_avg as things run in the hierarchy.
+ * So in order to get a better view of the change from the root task we use only
+ * weights.
+ *
+ * We also cannot rely on previously calculated weights for the groups, as those
+ * were generated in the past with a different tg->load_avg. Load could have
+ * been removed from other CPU's, which would cause this to return a negative
+ * delta despite "adding" load to the heirarchy. So we have to re-compute our
+ * current weight as well as the theoretical weight if we were to add this task
+ * in order to get an accurate delta.
+ *
+ * Which brings me to how this works. In normal operations we add
+ * task->se.load.weight to cfs_rq->load.weight. Then we calc calc_cfs_shares()
+ * on the cfs_rq, which generates the new weight for its se on its parent
+ * cfs_rq. We then have to add that delta to the parent cfs_rq->load.weight,
+ * and regenerate its new weight for its se for its parent cfs_rq, and so on and
+ * so forth up the heirarchy. Once we get to the top our load_delta will be the
+ * delta as seen by the root task group.
+ */
+static long effective_load(struct task_group *tg, int cpu, long wg)
+{
+ long load_delta = 0;
struct sched_entity *se = tg->se[cpu];

if (!tg->parent) /* the trivial, non-cgroup case */
- return wl;
+ return wg;

for_each_sched_entity(se) {
struct cfs_rq *cfs_rq = se->my_q;
- long W, w = cfs_rq_load_avg(cfs_rq);
+ long rq_weight = scale_load_down(cfs_rq->load.weight);
+ long tg_weight, tg_new_weight;
+ long new_weight = tg_weight + wg;
+ long old_weight;
+ long tg_shares;

tg = cfs_rq->tg;
+ tg_shares = READ_ONCE(tg->shares);

- /*
- * W = @wg + \Sum rw_j
- */
- W = wg + atomic_long_read(&tg->load_avg);
-
- /* Ensure \Sum rw_j >= rw_i */
- W -= cfs_rq->tg_load_avg_contrib;
- W += w;
+ tg_weight = atomic_long_read(&tg->load_avg);
+ tg_weight -= cfs_rq->tg_load_avg_contrib;

/*
- * w = rw_i + @wl
+ * We add in our 'load' to the denominator, this mirrors that
+ * addition in calc_cfs_shares, but with a different load for
+ * either of the scenarios.
*/
- w += wl;
+ tg_new_weight = tg_weight + new_weight;
+ tg_weight += rq_weight;

- /*
- * wl = S * s'_i; see (2)
- */
- if (W > 0 && w < W)
- wl = (w * (long)scale_load_down(tg->shares)) / W;
- else
- wl = scale_load_down(tg->shares);
+ wg = tg_shares * new_weight;
+ old_weight = tg_shares * rq_weight;
+ if (tg_weight)
+ old_weight /= tg_weight;
+ if (tg_new_weight)
+ wg /= tg_new_weight;

- /*
- * Per the above, wl is the new se->load.weight value; since
- * those are clipped to [MIN_SHARES, ...) do so now. See
- * calc_cfs_shares().
- */
- if (wl < MIN_SHARES)
- wl = MIN_SHARES;

- /*
- * wl = dw_i = S * (s'_i - s_i); see (3)
- */
- wl -= se->avg.load_avg;
+ wg = clamp_t(long, wg, MIN_SHARES, tg_shares);
+ old_weight = clamp_t(long, old_weight, MIN_SHARES, tg_shares);

- /*
- * Recursively apply this logic to all parent groups to compute
- * the final effective load change on the root group. Since
- * only the @tg group gets extra weight, all parent groups can
- * only redistribute existing shares. @wl is the shift in shares
- * resulting from this level per the above.
- */
- wg = 0;
+ wg = scale_load_down(wg);
+ old_weight = scale_load_down(old_weight);
+ load_delta = wg - old_weight;
}

- return wl;
+ return load_delta;
}
#else

-static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
+static long effective_load(struct task_group *tg, int cpu, long wg)
{
- return wl;
+ return wg;
}

#endif
@@ -5646,7 +5618,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
s64 this_eff_load, prev_eff_load;
int idx, this_cpu;
struct task_group *tg;
- unsigned long weight, avg;
+ unsigned long weight;
int balanced;

idx = sd->wake_idx;
@@ -5662,14 +5634,12 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
if (sync) {
tg = task_group(current);
weight = se_weight(&current->se);
- avg = current->se.avg.load_avg;

- this_load += effective_load(tg, this_cpu, -avg, -weight);
+ this_load += effective_load(tg, this_cpu, -weight);
}

tg = task_group(p);
weight = se_weight(&p->se);
- avg = p->se.avg.load_avg;

/*
* In low-load situations, where prev_cpu is idle and this_cpu is idle
@@ -5688,7 +5658,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,

if (this_load > 0) {
this_eff_load *= this_load +
- effective_load(tg, this_cpu, avg, weight);
+ effective_load(tg, this_cpu, weight);

prev_eff_load *= load;
}
--
2.9.3

2017-07-14 13:21:16

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 2/7] sched/fair: calculate runnable_weight slightly differently

From: Josef Bacik <[email protected]>

Our runnable_weight currently looks like this

runnable_weight = shares * runnable_load_avg / load_avg

The goal is to scale the runnable weight for the group based on its runnable to
load_avg ratio. The problem with this is it biases us towards tasks that never
go to sleep. Tasks that go to sleep are going to have their runnable_load_avg
decayed pretty hard, which will drastically reduce the runnable weight of groups
with interactive tasks. To solve this imbalance we tweak this slightly, so in
the ideal case it is still the above, but in the interactive case it is

runnable_weight = shares * runnable_weight / load_weight

which will make the weight distribution fairer between interactive and
non-interactive groups.

Signed-off-by: Josef Bacik <[email protected]>
---
kernel/sched/fair.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 326bc55..5d4489e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2880,9 +2880,15 @@ static void update_cfs_group(struct sched_entity *se)
* Note: we need to deal with very sporadic 'runnable > load' cases
* due to numerical instability.
*/
- runnable = shares * gcfs_rq->avg.runnable_load_avg;
- if (runnable)
- runnable /= max(gcfs_rq->avg.load_avg, gcfs_rq->avg.runnable_load_avg);
+ runnable = shares * max(scale_load_down(gcfs_rq->runnable_weight),
+ gcfs_rq->avg.runnable_load_avg);
+ if (runnable) {
+ long divider = max(gcfs_rq->avg.load_avg,
+ scale_load_down(gcfs_rq->load.weight));
+ divider = max_t(long, 1, divider);
+ runnable /= divider;
+ }
+ runnable = clamp_t(long, runnable, MIN_SHARES, shares);

reweight_entity(cfs_rq_of(se), se, shares, runnable);
}
--
2.9.3

2017-07-14 13:21:58

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 5/7] sched/fair: use the task weight instead of average in effective_load

From: Josef Bacik <[email protected]>

This is a preparation patch for the next patch. When adding a new task to a
cfs_rq we do not add our load_avg to the existing cfs_rq, we add our weight, and
that changes how the load average moves as the cfs_rq/task runs. Using the load
average in our effective_load calculation is going to be slightly inaccurate
from what we want to be computing (the actual effect of waking this task on this
cpu), and is going to bias us towards always affinity waking tasks.

Signed-off-by: Josef Bacik <[email protected]>
---
kernel/sched/fair.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee8dced..4e4fc5d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5646,7 +5646,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
s64 this_eff_load, prev_eff_load;
int idx, this_cpu;
struct task_group *tg;
- unsigned long weight;
+ unsigned long weight, avg;
int balanced;

idx = sd->wake_idx;
@@ -5661,14 +5661,15 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
*/
if (sync) {
tg = task_group(current);
- weight = current->se.avg.load_avg;
+ weight = se_weight(&current->se);
+ avg = current->se.avg.load_avg;

- this_load += effective_load(tg, this_cpu, -weight, -weight);
- load += effective_load(tg, prev_cpu, 0, -weight);
+ this_load += effective_load(tg, this_cpu, -avg, -weight);
}

tg = task_group(p);
- weight = p->se.avg.load_avg;
+ weight = se_weight(&p->se);
+ avg = p->se.avg.load_avg;

/*
* In low-load situations, where prev_cpu is idle and this_cpu is idle
@@ -5687,7 +5688,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,

if (this_load > 0) {
this_eff_load *= this_load +
- effective_load(tg, this_cpu, weight, weight);
+ effective_load(tg, this_cpu, avg, weight);

prev_eff_load *= load;
}
--
2.9.3

2017-07-14 13:21:14

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 1/7] sched/fair: use reweight_entity to reweight tasks

From: Josef Bacik <[email protected]>

reweight_task only accounts for the load average change in the cfs_rq, but
doesn't account for the runnable_average change in the cfs_rq. We need to do
everything reweight_entity does, and then we just set our inv_weight
appropriately.

Signed-off-by: Josef Bacik <[email protected]>
---
kernel/sched/fair.c | 31 +++++++++++--------------------
1 file changed, 11 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b13b451..326bc55 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2808,26 +2808,6 @@ __sub_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum);
}

-void reweight_task(struct task_struct *p, int prio)
-{
- struct sched_entity *se = &p->se;
- struct cfs_rq *cfs_rq = cfs_rq_of(se);
- struct load_weight *load = &p->se.load;
-
- u32 divider = LOAD_AVG_MAX - 1024 + se->avg.period_contrib;
-
- __sub_load_avg(cfs_rq, se);
-
- load->weight = scale_load(sched_prio_to_weight[prio]);
- load->inv_weight = sched_prio_to_wmult[prio];
-
- se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
- se->avg.runnable_load_avg =
- div_u64(se_runnable(se) * se->avg.runnable_load_sum, divider);
-
- __add_load_avg(cfs_rq, se);
-}
-
static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
unsigned long weight, unsigned long runnable)
{
@@ -2857,6 +2837,17 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
}
}

+void reweight_task(struct task_struct *p, int prio)
+{
+ struct sched_entity *se = &p->se;
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ struct load_weight *load = &se->load;
+ unsigned long weight = scale_load(sched_prio_to_weight[prio]);
+
+ reweight_entity(cfs_rq, se, weight, weight);
+ load->inv_weight = sched_prio_to_wmult[prio];
+}
+
static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);

/*
--
2.9.3

2017-07-14 13:22:16

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 3/7] sched/fair: fix definitions of effective load

From: Josef Bacik <[email protected]>

It appears as though we've reversed the definitions of 'this_effective_load' and
'prev_effective_load'. We seem to be using the wrong CPU capacity for both of
these parameters, and we want to add the imbalance percentage to the current CPU
to make sure we're meeting a high enough load imbalance threshold to justify
moving the task.

Signed-off-by: Josef Bacik <[email protected]>
---
kernel/sched/fair.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5d4489e..d958634 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5679,11 +5679,11 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
* Otherwise check if either cpus are near enough in load to allow this
* task to be woken on this_cpu.
*/
- this_eff_load = 100;
- this_eff_load *= capacity_of(prev_cpu);
+ prev_eff_load = 100;
+ prev_eff_load *= capacity_of(prev_cpu);

- prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
- prev_eff_load *= capacity_of(this_cpu);
+ this_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
+ this_eff_load *= capacity_of(this_cpu);

if (this_load > 0) {
this_eff_load *= this_load +
--
2.9.3

2017-07-31 10:45:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/7] sched/fair: use reweight_entity to reweight tasks

On Fri, Jul 14, 2017 at 01:20:58PM +0000, Josef Bacik wrote:
> From: Josef Bacik <[email protected]>
>
> reweight_task only accounts for the load average change in the cfs_rq, but
> doesn't account for the runnable_average change in the cfs_rq. We need to do
> everything reweight_entity does, and then we just set our inv_weight
> appropriately.

The difference is in the calling convention. If you look at the
callsite:

set_user_nice()
set_load_weight()
reweight_task()

You'll see that ->on_rq will always be false. That said, I think you're
right in that we're missing a se->runnable_weight update, because while
__update_load_avg*() doesn't use it (and will in fact (re)set it), there
are other users that could come before that.

2017-07-31 11:39:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/7] sched/fair: calculate runnable_weight slightly differently

On Fri, Jul 14, 2017 at 01:20:59PM +0000, Josef Bacik wrote:
> From: Josef Bacik <[email protected]>
>
> Our runnable_weight currently looks like this
>
> runnable_weight = shares * runnable_load_avg / load_avg
>
> The goal is to scale the runnable weight for the group based on its runnable to
> load_avg ratio. The problem with this is it biases us towards tasks that never
> go to sleep. Tasks that go to sleep are going to have their runnable_load_avg
> decayed pretty hard, which will drastically reduce the runnable weight of groups
> with interactive tasks. To solve this imbalance we tweak this slightly, so in
> the ideal case it is still the above, but in the interactive case it is
>
> runnable_weight = shares * runnable_weight / load_weight
>
> which will make the weight distribution fairer between interactive and
> non-interactive groups.
>
> Signed-off-by: Josef Bacik <[email protected]>
> ---
> kernel/sched/fair.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 326bc55..5d4489e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2880,9 +2880,15 @@ static void update_cfs_group(struct sched_entity *se)
> * Note: we need to deal with very sporadic 'runnable > load' cases
> * due to numerical instability.
> */
> - runnable = shares * gcfs_rq->avg.runnable_load_avg;
> - if (runnable)
> - runnable /= max(gcfs_rq->avg.load_avg, gcfs_rq->avg.runnable_load_avg);
> + runnable = shares * max(scale_load_down(gcfs_rq->runnable_weight),
> + gcfs_rq->avg.runnable_load_avg);
> + if (runnable) {
> + long divider = max(gcfs_rq->avg.load_avg,
> + scale_load_down(gcfs_rq->load.weight));
> + divider = max_t(long, 1, divider);
> + runnable /= divider;
> + }
> + runnable = clamp_t(long, runnable, MIN_SHARES, shares);


So what should be:


grq->runnable_load_avg
runnable = shares * ----------------------
grq->load_avg


Turns into:



max(gcfs_rq->avg.runnable_load_avg, gcfs_rq->runnable_weight)
shares * -------------------------------------------------------------
max(gcfs_rq->avg.load_avg , gcfs_rq->load.weight)


For which I think we can have an analogous argument to what we have for
calc_cfs_shares() no? That is, we use the immediate values to get better
representation for interactive, but use the avg values as lower bounds
in case the immediate value is 0.


I think we can write this better like:


/* rename calc_cfs_shares to calc_group_shares */

/*
* big comment a-la calc_group_shares goes here
*/
static long calc_group_runnable(...)
{
/*
* We need to deal with numerical instabilities that can result
* in sporadic cases where: runnable_avg > load_avg.
*/
load_avg = max(gcfs_rq->avg.runnable_load_avg, gcfs_rq->avg.load_avg);

/*
* Since the immediate values can be 0, use the averages as
* lower bounds.
*/
runnable = max(gcfs_rq->runnable_weight, gcfs_rq->avg.runnable_load_avg);
load = max(gcfs_rq->load.weight , load_avg);

runnable *= shares;
if (load)
runnable /= load;

return clamp_t(long, runnable, MIN_SHARES, shares);
}

But yes..

2017-08-01 10:52:55

by Brendan Jackman

[permalink] [raw]
Subject: Re: [PATCH 7/7] sched/fair: don't wake affine recently load balanced tasks

Hi Josef,

I happened to be thinking about something like this while investigating
a totally different issue with ARM big.LITTLE. Comment below...

On Fri, Jul 14 2017 at 13:21, Josef Bacik wrote:
> From: Josef Bacik <[email protected]>
>
> The wake affinity logic will move tasks between two cpu's that appear to be
> loaded equally at the current time, with a slight bias towards cache locality.
> However on a heavily loaded system the load balancer has a better insight into
> what needs to be moved around, so instead keep track of the last time a task was
> migrated by the load balancer. If it was recent, opt to let the process stay on
> it's current CPU (or an idle sibling).
>
> Signed-off-by: Josef Bacik <[email protected]>
> ---
> include/linux/sched.h | 1 +
> kernel/sched/fair.c | 11 +++++++++++
> 2 files changed, 12 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1a0eadd..d872780 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -528,6 +528,7 @@ struct task_struct {
> unsigned long wakee_flip_decay_ts;
> struct task_struct *last_wakee;
>
> + unsigned long last_balance_ts;
> int wake_cpu;
> #endif
> int on_rq;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 034d5df..6a98a38 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5604,6 +5604,16 @@ static int wake_wide(struct task_struct *p)
> unsigned int slave = p->wakee_flips;
> int factor = this_cpu_read(sd_llc_size);
>
> + /*
> + * If we've balanced this task recently we don't want to undo all of
> + * that hard work by the load balancer and move it to the current cpu.
> + * Constantly overriding the load balancers decisions is going to make
> + * it question its purpose in life and give it anxiety and self worth
> + * issues, and nobody wants that.
> + */
> + if (time_before(jiffies, p->last_balance_ts + HZ))
> + return 1;
> +
> if (master < slave)
> swap(master, slave);
> if (slave < factor || master < slave * factor)
> @@ -7097,6 +7107,7 @@ static int detach_tasks(struct lb_env *env)
> goto next;
>
> detach_task(p, env);
> + p->last_balance_ts = jiffies;

I guess this timestamp should be set in the active balance path too?

> list_add(&p->se.group_node, &env->tasks);
>
> detached++;

Cheers,
Brendan