LinuxLists.cc - [PATCH V5 0/2] load_balance() fixes for affinity

2017-06-07 19:19:20

Subject: [PATCH V5 0/2] load_balance() fixes for affinity

We noticed in testing that affined workloads do not provide consistent
performance under certain circumstances. To isolate the issue, we began
testing with a representative CPU workload. In some cases the CPU workload
results would vary by a large percentage from run to run. We used JTAG and
scheduler trace to try and profile the issue and noticed that at least one
core was left idle for the duration of the test in the low score runs.

The specific test methodology used in our investigation involved using an
executable that was compiled to fork a specific number of workers. The
executable was then affined to a number of cores equal to the number of
workers forked, using the userspace taskset utility. The expectation was
that all cores in the mask of runnable CPUs would run a process until the
work was complete. In practice, for several affinity masks, one to two
cores would not receive work until late in the test or not at all.

Our first observation was that it appeared initial cpu assignment of the
forked threads appeared suboptimal. We observed that it was highly
probable that many or all of the threads would be initially assigned to the
same CPU. Our limited investigation into this indicated that the forking of
the threads occurs fast enough that load estimates for CPUs have not
adjusted to the work they have been assigned. The function
select_task_rq_fair() does not take number of running process into
consideration and thus can easily assign several tasks spawned close
together to the same CPU within the allowed CPUs for a workload. While this
seems suboptimal, the load_balance() algorithm run on other idle cores in
the affinity mask should resolve imbalances that result from it.

As our test case was leaving cores idle for many seconds at a time, we dug
into load_balance() code and specifically the group_imbalance path. We
added ftrace to the path in question and noticed that two branches in
load_balance() conflict in their assessments of the ability of the
algorithm to migrate load. The group_imbalance path correctly sets the flag
to indicate the group can not be properly balanced due to affinity, but the
redo condition right after this branch incorrectly assumes that there may
be other cores with work to be pulled by considering cores outside of the
scheduling domain in question.

Consider the following circumstance where (T) represents tasks and (*)
represents the allowed CPUs:

A B C
{ 0 1 } { 2 3 } { 4 5 }
T T T T
T
* * * * *

In this case, core 0 in scheduling group 'A' should run load balance on its
local scheduling domain and try to pull work from core 1. When it fails to
move any work due to affinity, it will mark scheduling group 'A' as
"group_imbalance". However, right after doing so, core 0 load_balance()
evaluates the redo path with core 1 removed from consideration. This should
leave no other cores to consider and result in a jump to "out_all_pinned",
leaving the "group_imbalance" flag to be seen by future runs of the
algorithm on core 3. As the code stands currently, core 0 evaluates all
cores instead of those in the scheduling domain under consideration.
It incorrectly sees other cores in the system and attempts a redo.
With core 1 removed from the mask, the redo identifies no load imbalance
and results in a jump to "out_balanced", clearing the "group_imbalance"
flag.

We propose correcting this logic with patch 1 of this series.

Making this correction revealed an issue in the calculate_imbalance() path.
Patch 2 removes a branch that does not make sense with the current
load_balance() algorithm because it has no scenario where it benifits the
"group_imbalance" case in calculate_imbalance() and which causes problems
in systems with affined workloads and many idle or lightly loaded CPUs.

Co-authored-by: Austin Christ <[email protected]>
Signed-off-by: Jeffrey Hugo <[email protected]>

[V5]
-updated comment to explain the "why" behind the redo check
-fixed panic triggered from active_load_balance_cpu_stop()

[V4]
-restricted active cpus mask to the domain span
-fixed dst_cpu masking logic to work for the entire local group

[V3]
-fixed botched subject lines

[V2]
-Corrected use of Signed-off-by tags
-Removed temp cpumask variable from stack

Jeffrey Hugo (2):
sched/fair: Fix load_balance() affinity redo path
sched/fair: Remove group imbalance from calculate_imbalance()

kernel/sched/fair.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)

--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

2017-06-07 19:19:17

by Jeffrey Hugo

[permalink] [raw]

Subject: [PATCH V5 1/2] sched/fair: Fix load_balance() affinity redo path

If load_balance() fails to migrate any tasks because all tasks were
affined, load_balance() removes the source cpu from consideration and
attempts to redo and balance among the new subset of cpus.

There is a bug in this code path where the algorithm considers all active
cpus in the system (minus the source that was just masked out). This is
not valid for two reasons: some active cpus may not be in the current
scheduling domain and one of the active cpus is dst_cpu. These cpus should
not be considered, as we cannot pull load from them.

Instead of failing out of load_balance(), we may end up redoing the search
with no valid cpus and incorrectly concluding the domain is balanced.
Additionally, if the group_imbalance flag was just set, it may also be
incorrectly unset, thus the flag will not be seen by other cpus in future
load_balance() runs as that algorithm intends.

Fix the check by removing cpus not in the current domain and the dst_cpu
from considertation, thus limiting the evaluation to valid remaining cpus
from which load might be migrated.

Co-authored-by: Austin Christ <[email protected]>
Co-authored-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Jeffrey Hugo <[email protected]>
Tested-by: Tyler Baicar <[email protected]>
---
kernel/sched/fair.c | 32 ++++++++++++++++++++------------
1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d711093..2ba4407 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6737,10 +6737,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
* our sched_group. We may want to revisit it if we couldn't
* meet load balance goals by pulling other tasks on src_cpu.
*
- * Also avoid computing new_dst_cpu if we have already computed
- * one in current iteration.
+ * Avoid computing new_dst_cpu for NEWLY_IDLE or if we have
+ * already computed one in current iteration.
*/
- if (!env->dst_grpmask || (env->flags & LBF_DST_PINNED))
+ if (env->idle == CPU_NEWLY_IDLE || (env->flags & LBF_DST_PINNED))
return 0;

/* Prevent to re-select dst_cpu via env's cpus */
@@ -8091,14 +8091,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.tasks = LIST_HEAD_INIT(env.tasks),
};

- /*
- * For NEWLY_IDLE load_balancing, we don't need to consider
- * other cpus in our group
- */
- if (idle == CPU_NEWLY_IDLE)
- env.dst_grpmask = NULL;
-
- cpumask_copy(cpus, cpu_active_mask);
+ cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);

schedstat_inc(sd->lb_count[idle]);

@@ -8220,7 +8213,15 @@ static int load_balance(int this_cpu, struct rq *this_rq,
/* All tasks on this runqueue were pinned by CPU affinity */
if (unlikely(env.flags & LBF_ALL_PINNED)) {
cpumask_clear_cpu(cpu_of(busiest), cpus);
- if (!cpumask_empty(cpus)) {
+ /*
+ * Attempting to continue load balancing at the current
+ * sched_domain level only makes sense if there are
+ * active cpus remaining as possible busiest cpus to
+ * pull load from which are not contained within the
+ * destination group that is receiving any migrated
+ * load.
+ */
+ if (!cpumask_subset(cpus, env.dst_grpmask)) {
env.loop = 0;
env.loop_break = sched_nr_migrate_break;
goto redo;
@@ -8516,6 +8517,13 @@ static int active_load_balance_cpu_stop(void *data)
.src_cpu = busiest_rq->cpu,
.src_rq = busiest_rq,
.idle = CPU_IDLE,
+ /*
+ * can_migrate_task() doesn't need to compute new_dst_cpu
+ * for active balancing. Since we have CPU_IDLE, but no
+ * @dst_grpmask we need to make that test go away with lying
+ * about DST_PINNED.
+ */
+ .flags = LBF_DST_PINNED,
};

schedstat_inc(sd->alb_count);
--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

2017-06-07 19:19:49

by Jeffrey Hugo

[permalink] [raw]

Subject: [PATCH V5 2/2] sched/fair: Remove group imbalance from calculate_imbalance()

The group_imbalance path in calculate_imbalance() made sense when it was
added back in 2007 with commit 908a7c1b9b80 ("sched: fix improper load
balance across sched domain") because busiest->load_per_task factored into
the amount of imbalance that was calculated. That is not the case today.

The group_imbalance path can only affect the outcome of
calculate_imbalance() when the average load of the domain is less than the
original busiest->load_per_task. In this case, busiest->load_per_task is
overwritten with the scheduling domain load average. Thus
busiest->load_per_task no longer represents actual load that can be moved.

At the final comparison between env->imbalance and busiest->load_per_task,
imbalance may be larger than the new busiest->load_per_task causing the
check to fail under the assumption that there is a task that could be
migrated to satisfy the imbalance. However env->imbalance may still be
smaller than the original busiest->load_per_task, thus it is unlikely that
there is a task that can be migrated to satisfy the imbalance.
Calculate_imbalance() would not choose to run fix_small_imbalance() when we
expect it should. In the worst case, this can result in idle cpus.

Since the group imbalance path in calculate_imbalance() is at best a NOP
but otherwise harmful, remove it.

Co-authored-by: Austin Christ <[email protected]>
Signed-off-by: Jeffrey Hugo <[email protected]>
Tested-by: Tyler Baicar <[email protected]>
---
kernel/sched/fair.c | 9 ---------
1 file changed, 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 84255ab..3600713 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7760,15 +7760,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
local = &sds->local_stat;
busiest = &sds->busiest_stat;

- if (busiest->group_type == group_imbalanced) {
- /*
- * In the group_imb case we cannot rely on group-wide averages
- * to ensure cpu-load equilibrium, look at wider averages. XXX
- */
- busiest->load_per_task =
- min(busiest->load_per_task, sds->avg_load);
- }
-
/*
* Avg load of busiest sg can be less and avg load of local sg can
* be greater than avg load across all sgs of sd because avg load
--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

2017-06-13 14:32:14

by Jeffrey Hugo

[permalink] [raw]

Subject: Re: [PATCH V5 0/2] load_balance() fixes for affinity

On 6/7/2017 1:18 PM, Jeffrey Hugo wrote:
> Co-authored-by: Austin Christ <[email protected]>
> Signed-off-by: Jeffrey Hugo <[email protected]>
>
> [V5]
> -updated comment to explain the "why" behind the redo check
> -fixed panic triggered from active_load_balance_cpu_stop()
>
> [V4]
> -restricted active cpus mask to the domain span
> -fixed dst_cpu masking logic to work for the entire local group
>
> [V3]
> -fixed botched subject lines
>
> [V2]
> -Corrected use of Signed-off-by tags
> -Removed temp cpumask variable from stack
>
> Jeffrey Hugo (2):
> sched/fair: Fix load_balance() affinity redo path
> sched/fair: Remove group imbalance from calculate_imbalance()
>
> kernel/sched/fair.c | 22 ++++++++++++----------
> 1 file changed, 12 insertions(+), 10 deletions(-)
>

Peter (and/or Ingo?), is this series likely to make 4.13? V5 has been
quiet, so I'm not sure if folks are happy with things, or just haven't
had time to review.

Thanks

--
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

2017-06-20 14:28:27

by Jeffrey Hugo

[permalink] [raw]

Subject: Re: [PATCH V5 0/2] load_balance() fixes for affinity

On 6/13/2017 8:32 AM, Jeffrey Hugo wrote:
> On 6/7/2017 1:18 PM, Jeffrey Hugo wrote:
>> Co-authored-by: Austin Christ <[email protected]>
>> Signed-off-by: Jeffrey Hugo <[email protected]>
>>
>> [V5]
>> -updated comment to explain the "why" behind the redo check
>> -fixed panic triggered from active_load_balance_cpu_stop()
>>
>> [V4]
>> -restricted active cpus mask to the domain span
>> -fixed dst_cpu masking logic to work for the entire local group
>>
>> [V3]
>> -fixed botched subject lines
>>
>> [V2]
>> -Corrected use of Signed-off-by tags
>> -Removed temp cpumask variable from stack
>>
>> Jeffrey Hugo (2):
>> sched/fair: Fix load_balance() affinity redo path
>> sched/fair: Remove group imbalance from calculate_imbalance()
>>
>> kernel/sched/fair.c | 22 ++++++++++++----------
>> 1 file changed, 12 insertions(+), 10 deletions(-)
>>
>
> Peter (and/or Ingo?), is this series likely to make 4.13? V5 has been
> quiet, so I'm not sure if folks are happy with things, or just haven't
> had time to review.
>
> Thanks
>

Ping?

--
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

2017-06-28 16:12:35

by Jeffrey Hugo

[permalink] [raw]

Subject: Re: [PATCH V5 0/2] load_balance() fixes for affinity

On 6/20/2017 8:28 AM, Jeffrey Hugo wrote:
> On 6/13/2017 8:32 AM, Jeffrey Hugo wrote:
>> On 6/7/2017 1:18 PM, Jeffrey Hugo wrote:
>>> Co-authored-by: Austin Christ <[email protected]>
>>> Signed-off-by: Jeffrey Hugo <[email protected]>
>>>
>>> [V5]
>>> -updated comment to explain the "why" behind the redo check
>>> -fixed panic triggered from active_load_balance_cpu_stop()
>>>
>>> [V4]
>>> -restricted active cpus mask to the domain span
>>> -fixed dst_cpu masking logic to work for the entire local group
>>>
>>> [V3]
>>> -fixed botched subject lines
>>>
>>> [V2]
>>> -Corrected use of Signed-off-by tags
>>> -Removed temp cpumask variable from stack
>>>
>>> Jeffrey Hugo (2):
>>> sched/fair: Fix load_balance() affinity redo path
>>> sched/fair: Remove group imbalance from calculate_imbalance()
>>>
>>> kernel/sched/fair.c | 22 ++++++++++++----------
>>> 1 file changed, 12 insertions(+), 10 deletions(-)
>>>
>>
>> Peter (and/or Ingo?), is this series likely to make 4.13? V5 has been
>> quiet, so I'm not sure if folks are happy with things, or just haven't
>> had time to review.
>>
>> Thanks
>>
>
> Ping?
>

Ping again.

Since there has been no comments for the past 3 weeks, I'm assuming
everyone is happy with this series. With 4.12-rc7 out, and Linus saying
it looks good, it seems likely that the merge window is going to open
next week. We really want this in 4.13.

Would you please confirm that the series will be pulled into 4.13?

--
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

2017-07-05 09:19:39

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH V5 0/2] load_balance() fixes for affinity

On Wed, Jun 28, 2017 at 10:12:29AM -0600, Jeffrey Hugo wrote:
> >
> > Ping?
> >
>
> Ping again.
>
> Since there has been no comments for the past 3 weeks,

Sorry, I had to deal with becoming a parent for the 3rd time, let me go
stare at them ;-)

2017-07-05 11:22:15

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH V5 2/2] sched/fair: Remove group imbalance from calculate_imbalance()

On Wed, Jun 07, 2017 at 01:18:58PM -0600, Jeffrey Hugo wrote:
> The group_imbalance path in calculate_imbalance() made sense when it was
> added back in 2007 with commit 908a7c1b9b80 ("sched: fix improper load
> balance across sched domain") because busiest->load_per_task factored into
> the amount of imbalance that was calculated. That is not the case today.

It would be nice to have some more information on which patch(es)
changed that.

> The group_imbalance path can only affect the outcome of
> calculate_imbalance() when the average load of the domain is less than the
> original busiest->load_per_task. In this case, busiest->load_per_task is
> overwritten with the scheduling domain load average. Thus
> busiest->load_per_task no longer represents actual load that can be moved.
>
> At the final comparison between env->imbalance and busiest->load_per_task,
> imbalance may be larger than the new busiest->load_per_task causing the
> check to fail under the assumption that there is a task that could be
> migrated to satisfy the imbalance. However env->imbalance may still be
> smaller than the original busiest->load_per_task, thus it is unlikely that
> there is a task that can be migrated to satisfy the imbalance.
> Calculate_imbalance() would not choose to run fix_small_imbalance() when we
> expect it should. In the worst case, this can result in idle cpus.
>
> Since the group imbalance path in calculate_imbalance() is at best a NOP
> but otherwise harmful, remove it.

load_per_task is horrible and should die. Ever since we did cgroup
support the number is complete crap, but even before that the concept
was dubious.

Most of the logic that uses the number stems from the pre-smp-nice era.

This also of course means that fix_small_imbalance() is probably a load
of crap. Digging through all that has been on the todo list for a long
while but somehow not something I've ever gotten to :/

2017-07-05 11:22:38

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH V5 1/2] sched/fair: Fix load_balance() affinity redo path

On Wed, Jun 07, 2017 at 01:18:57PM -0600, Jeffrey Hugo wrote:
> If load_balance() fails to migrate any tasks because all tasks were
> affined, load_balance() removes the source cpu from consideration and
> attempts to redo and balance among the new subset of cpus.
>
> There is a bug in this code path where the algorithm considers all active
> cpus in the system (minus the source that was just masked out). This is
> not valid for two reasons: some active cpus may not be in the current
> scheduling domain and one of the active cpus is dst_cpu. These cpus should
> not be considered, as we cannot pull load from them.
>
> Instead of failing out of load_balance(), we may end up redoing the search
> with no valid cpus and incorrectly concluding the domain is balanced.
> Additionally, if the group_imbalance flag was just set, it may also be
> incorrectly unset, thus the flag will not be seen by other cpus in future
> load_balance() runs as that algorithm intends.
>
> Fix the check by removing cpus not in the current domain and the dst_cpu
> from considertation, thus limiting the evaluation to valid remaining cpus
> from which load might be migrated.
>
> Co-authored-by: Austin Christ <[email protected]>
> Co-authored-by: Dietmar Eggemann <[email protected]>
> Signed-off-by: Jeffrey Hugo <[email protected]>
> Tested-by: Tyler Baicar <[email protected]>

Yes, this looks good. Thanks!

2017-07-05 22:52:58

by tip-bot for Vasyl Gomonovych

[permalink] [raw]

Subject: [tip:sched/urgent] sched/fair: Fix load_balance() affinity redo path

Commit-ID: 65a4433aebe36c8c6abeb69b99ef00274b971c6c
Gitweb: http://git.kernel.org/tip/65a4433aebe36c8c6abeb69b99ef00274b971c6c
Author: Jeffrey Hugo <[email protected]>
AuthorDate: Wed, 7 Jun 2017 13:18:57 -0600
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 5 Jul 2017 16:28:48 +0200

sched/fair: Fix load_balance() affinity redo path

If load_balance() fails to migrate any tasks because all tasks were
affined, load_balance() removes the source CPU from consideration and
attempts to redo and balance among the new subset of CPUs.

There is a bug in this code path where the algorithm considers all active
CPUs in the system (minus the source that was just masked out). This is
not valid for two reasons: some active CPUs may not be in the current
scheduling domain and one of the active CPUs is dst_cpu. These CPUs should
not be considered, as we cannot pull load from them.

Instead of failing out of load_balance(), we may end up redoing the search
with no valid CPUs and incorrectly concluding the domain is balanced.
Additionally, if the group_imbalance flag was just set, it may also be
incorrectly unset, thus the flag will not be seen by other CPUs in future
load_balance() runs as that algorithm intends.

Fix the check by removing CPUs not in the current domain and the dst_cpu
from considertation, thus limiting the evaluation to valid remaining CPUs
from which load might be migrated.

Co-authored-by: Austin Christ <[email protected]>
Co-authored-by: Dietmar Eggemann <[email protected]>
Tested-by: Tyler Baicar <[email protected]>
Signed-off-by: Jeffrey Hugo <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Austin Christ <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Timur Tabi <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 32 ++++++++++++++++++++------------
1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 008c514..c95880e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6646,10 +6646,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
* our sched_group. We may want to revisit it if we couldn't
* meet load balance goals by pulling other tasks on src_cpu.
*
- * Also avoid computing new_dst_cpu if we have already computed
- * one in current iteration.
+ * Avoid computing new_dst_cpu for NEWLY_IDLE or if we have
+ * already computed one in current iteration.
*/
- if (!env->dst_grpmask || (env->flags & LBF_DST_PINNED))
+ if (env->idle == CPU_NEWLY_IDLE || (env->flags & LBF_DST_PINNED))
return 0;

/* Prevent to re-select dst_cpu via env's cpus */
@@ -8022,14 +8022,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.tasks = LIST_HEAD_INIT(env.tasks),
};

- /*
- * For NEWLY_IDLE load_balancing, we don't need to consider
- * other cpus in our group
- */
- if (idle == CPU_NEWLY_IDLE)
- env.dst_grpmask = NULL;
-
- cpumask_copy(cpus, cpu_active_mask);
+ cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);

schedstat_inc(sd->lb_count[idle]);

@@ -8151,7 +8144,15 @@ more_balance:
/* All tasks on this runqueue were pinned by CPU affinity */
if (unlikely(env.flags & LBF_ALL_PINNED)) {
cpumask_clear_cpu(cpu_of(busiest), cpus);
- if (!cpumask_empty(cpus)) {
+ /*
+ * Attempting to continue load balancing at the current
+ * sched_domain level only makes sense if there are
+ * active CPUs remaining as possible busiest CPUs to
+ * pull load from which are not contained within the
+ * destination group that is receiving any migrated
+ * load.
+ */
+ if (!cpumask_subset(cpus, env.dst_grpmask)) {
env.loop = 0;
env.loop_break = sched_nr_migrate_break;
goto redo;
@@ -8447,6 +8448,13 @@ static int active_load_balance_cpu_stop(void *data)
.src_cpu = busiest_rq->cpu,
.src_rq = busiest_rq,
.idle = CPU_IDLE,
+ /*
+ * can_migrate_task() doesn't need to compute new_dst_cpu
+ * for active balancing. Since we have CPU_IDLE, but no
+ * @dst_grpmask we need to make that test go away with lying
+ * about DST_PINNED.
+ */
+ .flags = LBF_DST_PINNED,
};

schedstat_inc(sd->alb_count);

2017-07-07 15:26:42

by Jeffrey Hugo

[permalink] [raw]

Subject: Re: [PATCH V5 2/2] sched/fair: Remove group imbalance from calculate_imbalance()

On 7/5/2017 5:22 AM, Peter Zijlstra wrote:
> On Wed, Jun 07, 2017 at 01:18:58PM -0600, Jeffrey Hugo wrote:
>> The group_imbalance path in calculate_imbalance() made sense when it was
>> added back in 2007 with commit 908a7c1b9b80 ("sched: fix improper load
>> balance across sched domain") because busiest->load_per_task factored into
>> the amount of imbalance that was calculated. That is not the case today.
>
> It would be nice to have some more information on which patch(es)
> changed that.

From the history it looks like commit dd5feea14a7d ("sched: Fix
SCHED_MC regression caused by change in sched cpu_power") removed
load_per_task from imbalance calculations (was factored into max_pull).
After this change, the group imbalance modifications to this variable
appear to no longer work as originally intended.

>
>> The group_imbalance path can only affect the outcome of
>> calculate_imbalance() when the average load of the domain is less than the
>> original busiest->load_per_task. In this case, busiest->load_per_task is
>> overwritten with the scheduling domain load average. Thus
>> busiest->load_per_task no longer represents actual load that can be moved.
>>
>> At the final comparison between env->imbalance and busiest->load_per_task,
>> imbalance may be larger than the new busiest->load_per_task causing the
>> check to fail under the assumption that there is a task that could be
>> migrated to satisfy the imbalance. However env->imbalance may still be
>> smaller than the original busiest->load_per_task, thus it is unlikely that
>> there is a task that can be migrated to satisfy the imbalance.
>> Calculate_imbalance() would not choose to run fix_small_imbalance() when we
>> expect it should. In the worst case, this can result in idle cpus.
>>
>> Since the group imbalance path in calculate_imbalance() is at best a NOP
>> but otherwise harmful, remove it.
>
> load_per_task is horrible and should die. Ever since we did cgroup
> support the number is complete crap, but even before that the concept
> was dubious.
>
> Most of the logic that uses the number stems from the pre-smp-nice era.
>
> This also of course means that fix_small_imbalance() is probably a load
> of crap. Digging through all that has been on the todo list for a long
> while but somehow not something I've ever gotten to :/

Based on the state of the code today, our change fixes a current issue
that we are encoutering. If we add the above history to the commit text,
is our change sufficent as a short term solution between now and
whenever the load_per_task path is rearchitected?

--
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.