2016-10-09 03:39:31

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

2016-09-29 3:37 GMT+08:00 Matt Fleming <[email protected]>:
> On Wed, 28 Sep, at 12:14:22PM, Peter Zijlstra wrote:
>>
>> Which suggests we do something like the below (not compile tested or
>> anything, also I ran out of tea again).
>
> I'm away on FTO right now. I can test this when I return on Friday.
>
> Funnily enough, I now remember that I already sent a fix for the
> missing update_rq_clock() in post_init_entity_util_avg(), but didn't
> apply it when chasing this hackbench regression (oops),
>
> https://lkml.kernel.org/r/[email protected]

The difference between this patch and Peterz's is your patch have a
delta since activate_task()->enqueue_task() does do update_rq_clock(),
so why don't have the delta will cause low cpu machines (4 or 8) to
regress against your another reply in this thread?

Regards,
Wanpeng Li


2016-10-10 10:01:16

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On Sun, 09 Oct, at 11:39:27AM, Wanpeng Li wrote:
>
> The difference between this patch and Peterz's is your patch have a
> delta since activate_task()->enqueue_task() does do update_rq_clock(),
> so why don't have the delta will cause low cpu machines (4 or 8) to
> regress against your another reply in this thread?

Both my patch and Peter's patch cause issues with low cpu machines. In
<[email protected]> I said,

"This patch causes some low cpu machines (4 or 8) to regress. It turns
out they regress with my patch too."

Have I misunderstood your question?

I ran out of time to investigate this last week, though I did try all
proposed patches, including Vincent's, and none of them produced wins
across the board.

I should get a bit further this week.

Vincent, Dietmar, did you guys ever get around to submitting your PELT
tracepoint patches? Getting some introspection into the scheduler's
load balancing decisions would speed up this sort of research.

2016-10-10 10:15:25

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

2016-10-10 18:01 GMT+08:00 Matt Fleming <[email protected]>:
> On Sun, 09 Oct, at 11:39:27AM, Wanpeng Li wrote:
>>
>> The difference between this patch and Peterz's is your patch have a
>> delta since activate_task()->enqueue_task() does do update_rq_clock(),
>> so why don't have the delta will cause low cpu machines (4 or 8) to
>> regress against your another reply in this thread?
>
> Both my patch and Peter's patch cause issues with low cpu machines. In
> <[email protected]> I said,
>
> "This patch causes some low cpu machines (4 or 8) to regress. It turns
> out they regress with my patch too."
>
> Have I misunderstood your question?

I compare this one https://marc.info/?l=linux-kernel&m=147446511924571
with Peterz's patch.

>
> I ran out of time to investigate this last week, though I did try all
> proposed patches, including Vincent's, and none of them produced wins
> across the board.
>
> I should get a bit further this week.
>
> Vincent, Dietmar, did you guys ever get around to submitting your PELT
> tracepoint patches? Getting some introspection into the scheduler's
> load balancing decisions would speed up this sort of research.

2016-10-10 12:29:53

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On 10 October 2016 at 12:01, Matt Fleming <[email protected]> wrote:
> On Sun, 09 Oct, at 11:39:27AM, Wanpeng Li wrote:
>>
>> The difference between this patch and Peterz's is your patch have a
>> delta since activate_task()->enqueue_task() does do update_rq_clock(),
>> so why don't have the delta will cause low cpu machines (4 or 8) to
>> regress against your another reply in this thread?
>
> Both my patch and Peter's patch cause issues with low cpu machines. In
> <[email protected]> I said,
>
> "This patch causes some low cpu machines (4 or 8) to regress. It turns
> out they regress with my patch too."
>
> Have I misunderstood your question?
>
> I ran out of time to investigate this last week, though I did try all
> proposed patches, including Vincent's, and none of them produced wins
> across the board.

I have tried to reprocude your issue on my target an hikey board (ARM
based octo cores) but i failed to see a regression with commit
7dc603c9028e. Neverthless, i can see tasks not been well spread
during fork as you mentioned. So I have studied a bit more the
spreading issue during fork last week and i have a new version of my
proposed patch that i'm going to send soon. With this patch, i can see
a good spread of tasks during the fork sequence and some kind of perf
improvement even if it's bit difficult as the variance is quite
important with hackbench test so it's mainly an improvement of
repeatability of the result

>
> I should get a bit further this week.
>
> Vincent, Dietmar, did you guys ever get around to submitting your PELT
> tracepoint patches? Getting some introspection into the scheduler's

My tarcepoint are not in a shape to be submitted and would need a
cleanup as some are more hacks for debugging than real trace events.
Nevertheless, i can push them on a git branch if they can be useful
for someone


> load balancing decisions would speed up this sort of research.

2016-10-10 13:54:46

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On 10/10/16 13:29, Vincent Guittot wrote:
> On 10 October 2016 at 12:01, Matt Fleming <[email protected]> wrote:
>> On Sun, 09 Oct, at 11:39:27AM, Wanpeng Li wrote:
>>>
>>> The difference between this patch and Peterz's is your patch have a
>>> delta since activate_task()->enqueue_task() does do update_rq_clock(),
>>> so why don't have the delta will cause low cpu machines (4 or 8) to
>>> regress against your another reply in this thread?
>>
>> Both my patch and Peter's patch cause issues with low cpu machines. In
>> <[email protected]> I said,
>>
>> "This patch causes some low cpu machines (4 or 8) to regress. It turns
>> out they regress with my patch too."
>>
>> Have I misunderstood your question?
>>
>> I ran out of time to investigate this last week, though I did try all
>> proposed patches, including Vincent's, and none of them produced wins
>> across the board.
>
> I have tried to reprocude your issue on my target an hikey board (ARM
> based octo cores) but i failed to see a regression with commit
> 7dc603c9028e. Neverthless, i can see tasks not been well spread

Wasn't this about the two patches mentioned in this thread? The one from
Matt using 'se->sum_exec_runtime' in the if condition in
enqueue_entity_load_avg() and Peterz's conditional call to
update_rq_clock(rq) in enqueue_task()?

> during fork as you mentioned. So I have studied a bit more the
> spreading issue during fork last week and i have a new version of my
> proposed patch that i'm going to send soon. With this patch, i can see
> a good spread of tasks during the fork sequence and some kind of perf
> improvement even if it's bit difficult as the variance is quite
> important with hackbench test so it's mainly an improvement of
> repeatability of the result

Hikey (ARM64 2x4 cpus) board: cpufreq: performance, cpuidle: disabled

Performance counter stats for 'perf bench sched messaging -g 20 -l 500'
(10 runs):

(1) tip/sched/core: commit 447976ef4fd0

5.902209533 seconds time elapsed ( +- 0.31% )

(2) tip/sched/core + original patch on the 'sched/fair: Do not decay
new task load on first enqueue' thread (23/09/16)

5.919933030 seconds time elapsed ( +- 0.44% )

(3) tip/sched/core + Peter's ENQUEUE_NEW patch on the 'sched/fair: Do
not decay new task load on first enqueue' thread (28/09/16)

5.970195534 seconds time elapsed ( +- 0.37% )

Not sure if we can call this a regression but it also shows no
performance gain.

>>
>> I should get a bit further this week.
>>
>> Vincent, Dietmar, did you guys ever get around to submitting your PELT
>> tracepoint patches? Getting some introspection into the scheduler's
>
> My tarcepoint are not in a shape to be submitted and would need a
> cleanup as some are more hacks for debugging than real trace events.
> Nevertheless, i can push them on a git branch if they can be useful
> for someone

We carry two trace events locally, one for PELT on se and one for
cfs_rq's (I have to add the runnable bits here) which work for
CONFIG_FAIR_GROUP_SCHED and !CONFIG_FAIR_GROUP_SCHED. I put them into
__update_load_avg(), attach_entity_load_avg() and
detach_entity_load_avg(). I could post them but so far mainline has been
reluctant to see the need for PELT related trace events ...

[...]

2016-10-10 17:34:48

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

Le Monday 10 Oct 2016 ? 14:29:28 (+0200), Vincent Guittot a ?crit :
> On 10 October 2016 at 12:01, Matt Fleming <[email protected]> wrote:
> > On Sun, 09 Oct, at 11:39:27AM, Wanpeng Li wrote:
> >>
> >> The difference between this patch and Peterz's is your patch have a
> >> delta since activate_task()->enqueue_task() does do update_rq_clock(),
> >> so why don't have the delta will cause low cpu machines (4 or 8) to
> >> regress against your another reply in this thread?
> >
> > Both my patch and Peter's patch cause issues with low cpu machines. In
> > <[email protected]> I said,
> >
> > "This patch causes some low cpu machines (4 or 8) to regress. It turns
> > out they regress with my patch too."
> >
> > Have I misunderstood your question?
> >
> > I ran out of time to investigate this last week, though I did try all
> > proposed patches, including Vincent's, and none of them produced wins
> > across the board.
>
> I have tried to reprocude your issue on my target an hikey board (ARM
> based octo cores) but i failed to see a regression with commit
> 7dc603c9028e. Neverthless, i can see tasks not been well spread
> during fork as you mentioned. So I have studied a bit more the
> spreading issue during fork last week and i have a new version of my
> proposed patch that i'm going to send soon. With this patch, i can see
> a good spread of tasks during the fork sequence and some kind of perf
> improvement even if it's bit difficult as the variance is quite
> important with hackbench test so it's mainly an improvement of
> repeatability of the result
>

Subject: [PATCH] sched: use load_avg for selecting idlest group

select_busiest_group only compares the runnable_load_avg when looking for
the idlest group. But on fork intensive use case like hackbenchw here task
blocked quickly after the fork, this can lead to selecting the same CPU
whereas other CPUs, which have similar runnable load but a lower load_avg,
could be chosen instead.

When the runnable_load_avg of 2 CPUs are close, we now take into account
the amount of blocked load as a 2nd selection factor.

For use case like hackbench, this enable the scheduler to select different
CPUs during the fork sequence and to spread tasks across the system.

Tests have been done on a Hikey board (ARM based octo cores) for several
kernel. The result below gives min, max, avg and stdev values of 18 runs
with each configuration.

The v4.8+patches configuration also includes the changes below which is part of the
proposal made by Peter to ensure that the clock will be up to date when the
fork task will be attached to the rq.

@@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p)
__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
#endif
rq = __task_rq_lock(p, &rf);
+ update_rq_clock(rq);
post_init_entity_util_avg(&p->se);

activate_task(rq, p, 0);

hackbench -P -g 1

ea86cb4b7621 7dc603c9028e v4.8 v4.8+patches
min 0.049 0.050 0.051 0,048
avg 0.057 0.057(0%) 0.057(0%) 0,055(+5%)
max 0.066 0.068 0.070 0,063
stdev +/-9% +/-9% +/-8% +/-9%

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++--------
1 file changed, 32 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 039de34..628b00b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5166,15 +5166,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int sd_flag)
{
struct sched_group *idlest = NULL, *group = sd->groups;
- unsigned long min_load = ULONG_MAX, this_load = 0;
+ unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0;
+ unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0;
int load_idx = sd->forkexec_idx;
- int imbalance = 100 + (sd->imbalance_pct-100)/2;
+ unsigned long imbalance = (scale_load_down(NICE_0_LOAD)*(sd->imbalance_pct-100))/100;

if (sd_flag & SD_BALANCE_WAKE)
load_idx = sd->wake_idx;

do {
- unsigned long load, avg_load;
+ unsigned long load, avg_load, runnable_load;
int local_group;
int i;

@@ -5188,6 +5189,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,

/* Tally up the load of all CPUs in the group */
avg_load = 0;
+ runnable_load = 0;

for_each_cpu(i, sched_group_cpus(group)) {
/* Bias balancing toward cpus of our domain */
@@ -5196,21 +5198,43 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
else
load = target_load(i, load_idx);

- avg_load += load;
+ runnable_load += load;
+
+ avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs);
}

/* Adjust by relative CPU capacity of the group */
avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
+ runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;

if (local_group) {
- this_load = avg_load;
- } else if (avg_load < min_load) {
- min_load = avg_load;
+ this_runnable_load = runnable_load;
+ this_avg_load = avg_load;
+ } else if (min_runnable_load > (runnable_load + imbalance)) {
+ /*
+ * The runnable load is significantly smaller so we
+ * can pick this new cpu
+ */
+ min_runnable_load = runnable_load;
+ min_avg_load = avg_load;
+ idlest = group;
+ } else if ((runnable_load < (min_runnable_load + imbalance)) &&
+ (100*min_avg_load > sd->imbalance_pct*avg_load)) {
+ /*
+ * The runnable loads are close so we take into account
+ * blocked load throught avg_load which is blocked +
+ * runnable load
+ */
+ min_avg_load = avg_load;
idlest = group;
}
+
} while (group = group->next, group != sd->groups);

- if (!idlest || 100*this_load < imbalance*min_load)
+ if (!idlest ||
+ (min_runnable_load > (this_runnable_load + imbalance)) ||
+ ((this_runnable_load < (min_runnable_load + imbalance)) &&
+ (100*min_avg_load > sd->imbalance_pct*this_avg_load)))
return NULL;
return idlest;
}
--
2.7.4


2016-10-10 18:29:34

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On 10 October 2016 at 15:54, Dietmar Eggemann <[email protected]> wrote:
> On 10/10/16 13:29, Vincent Guittot wrote:
>> On 10 October 2016 at 12:01, Matt Fleming <[email protected]> wrote:
>>> On Sun, 09 Oct, at 11:39:27AM, Wanpeng Li wrote:
>>>>
>>>> The difference between this patch and Peterz's is your patch have a
>>>> delta since activate_task()->enqueue_task() does do update_rq_clock(),
>>>> so why don't have the delta will cause low cpu machines (4 or 8) to
>>>> regress against your another reply in this thread?
>>>
>>> Both my patch and Peter's patch cause issues with low cpu machines. In
>>> <[email protected]> I said,
>>>
>>> "This patch causes some low cpu machines (4 or 8) to regress. It turns
>>> out they regress with my patch too."
>>>
>>> Have I misunderstood your question?
>>>
>>> I ran out of time to investigate this last week, though I did try all
>>> proposed patches, including Vincent's, and none of them produced wins
>>> across the board.
>>
>> I have tried to reprocude your issue on my target an hikey board (ARM
>> based octo cores) but i failed to see a regression with commit
>> 7dc603c9028e. Neverthless, i can see tasks not been well spread
>
> Wasn't this about the two patches mentioned in this thread? The one from
> Matt using 'se->sum_exec_runtime' in the if condition in
> enqueue_entity_load_avg() and Peterz's conditional call to
> update_rq_clock(rq) in enqueue_task()?

I was trying to reproduce the regression that Matt mentioned at the
beg of the thread not those linked to proposed fixes

>
>> during fork as you mentioned. So I have studied a bit more the
>> spreading issue during fork last week and i have a new version of my
>> proposed patch that i'm going to send soon. With this patch, i can see
>> a good spread of tasks during the fork sequence and some kind of perf
>> improvement even if it's bit difficult as the variance is quite
>> important with hackbench test so it's mainly an improvement of
>> repeatability of the result
>
> Hikey (ARM64 2x4 cpus) board: cpufreq: performance, cpuidle: disabled
>
> Performance counter stats for 'perf bench sched messaging -g 20 -l 500'
> (10 runs):
>
> (1) tip/sched/core: commit 447976ef4fd0
>
> 5.902209533 seconds time elapsed ( +- 0.31% )

This seems to be too long to test the impact of the forking phase of hackbench

>
> (2) tip/sched/core + original patch on the 'sched/fair: Do not decay
> new task load on first enqueue' thread (23/09/16)
>
> 5.919933030 seconds time elapsed ( +- 0.44% )
>
> (3) tip/sched/core + Peter's ENQUEUE_NEW patch on the 'sched/fair: Do
> not decay new task load on first enqueue' thread (28/09/16)
>
> 5.970195534 seconds time elapsed ( +- 0.37% )
>
> Not sure if we can call this a regression but it also shows no
> performance gain.
>
>>>
>>> I should get a bit further this week.
>>>
>>> Vincent, Dietmar, did you guys ever get around to submitting your PELT
>>> tracepoint patches? Getting some introspection into the scheduler's
>>
>> My tarcepoint are not in a shape to be submitted and would need a
>> cleanup as some are more hacks for debugging than real trace events.
>> Nevertheless, i can push them on a git branch if they can be useful
>> for someone
>
> We carry two trace events locally, one for PELT on se and one for
> cfs_rq's (I have to add the runnable bits here) which work for
> CONFIG_FAIR_GROUP_SCHED and !CONFIG_FAIR_GROUP_SCHED. I put them into
> __update_load_avg(), attach_entity_load_avg() and
> detach_entity_load_avg(). I could post them but so far mainline has been
> reluctant to see the need for PELT related trace events ...
>
> [...]

2016-10-11 09:44:32

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On 10/10/16 19:29, Vincent Guittot wrote:
> On 10 October 2016 at 15:54, Dietmar Eggemann <[email protected]> wrote:
>> On 10/10/16 13:29, Vincent Guittot wrote:
>>> On 10 October 2016 at 12:01, Matt Fleming <[email protected]> wrote:
>>>> On Sun, 09 Oct, at 11:39:27AM, Wanpeng Li wrote:

[...]

>>> I have tried to reprocude your issue on my target an hikey board (ARM
>>> based octo cores) but i failed to see a regression with commit
>>> 7dc603c9028e. Neverthless, i can see tasks not been well spread
>>
>> Wasn't this about the two patches mentioned in this thread? The one from
>> Matt using 'se->sum_exec_runtime' in the if condition in
>> enqueue_entity_load_avg() and Peterz's conditional call to
>> update_rq_clock(rq) in enqueue_task()?
>
> I was trying to reproduce the regression that Matt mentioned at the
> beg of the thread not those linked to proposed fixes

OK.

>
>>
>>> during fork as you mentioned. So I have studied a bit more the
>>> spreading issue during fork last week and i have a new version of my
>>> proposed patch that i'm going to send soon. With this patch, i can see
>>> a good spread of tasks during the fork sequence and some kind of perf
>>> improvement even if it's bit difficult as the variance is quite
>>> important with hackbench test so it's mainly an improvement of
>>> repeatability of the result
>>
>> Hikey (ARM64 2x4 cpus) board: cpufreq: performance, cpuidle: disabled
>>
>> Performance counter stats for 'perf bench sched messaging -g 20 -l 500'
>> (10 runs):
>>
>> (1) tip/sched/core: commit 447976ef4fd0
>>
>> 5.902209533 seconds time elapsed ( +- 0.31% )
>
> This seems to be too long to test the impact of the forking phase of hackbench

[...]

Yeah, you're right. But I can't see any significant difference. IMHO,
it's all in the noise.

(A) Performance counter stats for 'perf bench sched messaging -g 100 -l
1 -t'
# 20 sender and receiver threads per group
# 100 groups == 4000 threads run

(1) tip/sched/core: commit 447976ef4fd0

Total time: 0.188 [sec]

(2) tip/sched/core + original patch on the 'sched/fair: Do not decay
new task load on first enqueue' thread (23/09/16)

Total time: 0.199 [sec]

(3) tip/sched/core + Peter's ENQUEUE_NEW patch on the 'sched/fair: Do
not decay new task load on first enqueue' thread (28/09/16)

Total time: 0.178 [sec]

(B) hackbench -P -g 1
Running in process mode with 1 groups using 40 file descriptors
each (== 40 tasks)
Each sender will pass 100 messages of 100 bytes

(1) 0.067

(2) 0.083

(3) 0.073

(C) hackbench -T -g 1
Running in threaded mode with 1 groups using 40 file descriptors
each (== 40 tasks)
Each sender will pass 100 messages of 100 bytes

(1) 0.077

(2) 0.079

(3) 0.072

Maybe, instead of the performance gov, I should pin the frequency to a
lower one to eliminate the thermal influence on this Hikey board.



















2016-10-11 10:25:17

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On Mon, 10 Oct, at 07:34:40PM, Vincent Guittot wrote:
>
> Subject: [PATCH] sched: use load_avg for selecting idlest group
>
> select_busiest_group only compares the runnable_load_avg when looking for
> the idlest group. But on fork intensive use case like hackbenchw here task
> blocked quickly after the fork, this can lead to selecting the same CPU
> whereas other CPUs, which have similar runnable load but a lower load_avg,
> could be chosen instead.
>
> When the runnable_load_avg of 2 CPUs are close, we now take into account
> the amount of blocked load as a 2nd selection factor.
>
> For use case like hackbench, this enable the scheduler to select different
> CPUs during the fork sequence and to spread tasks across the system.
>
> Tests have been done on a Hikey board (ARM based octo cores) for several
> kernel. The result below gives min, max, avg and stdev values of 18 runs
> with each configuration.
>
> The v4.8+patches configuration also includes the changes below which is part of the
> proposal made by Peter to ensure that the clock will be up to date when the
> fork task will be attached to the rq.
>
> @@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p)
> __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
> #endif
> rq = __task_rq_lock(p, &rf);
> + update_rq_clock(rq);
> post_init_entity_util_avg(&p->se);
>
> activate_task(rq, p, 0);
>
> hackbench -P -g 1
>
> ea86cb4b7621 7dc603c9028e v4.8 v4.8+patches
> min 0.049 0.050 0.051 0,048
> avg 0.057 0.057(0%) 0.057(0%) 0,055(+5%)
> max 0.066 0.068 0.070 0,063
> stdev +/-9% +/-9% +/-8% +/-9%
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++--------
> 1 file changed, 32 insertions(+), 8 deletions(-)

This patch looks pretty good to me and this 2-socket 48-cpu Xeon
(domain0 SMT, domain1 MC, domain2 NUMA) shows a few nice performance
improvements, and no regressions for various combinations of hackbench
sockets/pipes and group numbers.

But on a 2-socket 8-cpu Xeon (domain0 MC, domain1 DIE) running,

perf stat --null -r 25 -- hackbench -pipe 30 process 1000

I see a regression,

baseline: 2.41228
patched : 2.64528 (-9.7%)

Even though the spread of tasks during fork[0] is improved,

baseline CV: 0.478%
patched CV : 0.042%

Clearly the spread wasn't *that* bad to begin with on this machine for
this workload. I consider the baseline spread to be pretty well
distributed. Some other factor must be at play.

Patched runqueue latencies are higher (max9* are percentiles),

baseline: mean: 615932.69 max90: 75272.00 max95: 175985.00 max99: 5884778.00 max: 1694084747.00
patched: mean : 882026.28 max90: 92015.00 max95: 291760.00 max99: 7590167.00 max: 1841154776.00

And there are more migrations of hackbench tasks,

baseline: total: 5390 cross-MC: 3810 cross-DIE: 1580
patched : total: 7222 cross-MC: 4591 cross-DIE: 2631
(+34.0%) (+20.5%) (+66.5%)

That's a lot more costly cross-DIE migrations. I think this patch is
along the right lines, but there's something fishy happening on this
box.

[0] - Fork task placement spread measurement:

cat /tmp/trace.$1 | grep -E "wakeup_new.*comm=hackbench" | \
sed -e 's/.*target_cpu=//' | sort | uniq -c | awk '{print $1}'

2016-10-11 10:28:06

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On Mon, 10 Oct, at 06:09:14PM, Wanpeng Li wrote:
> 2016-10-10 18:01 GMT+08:00 Matt Fleming <[email protected]>:
> > On Sun, 09 Oct, at 11:39:27AM, Wanpeng Li wrote:
> >>
> >> The difference between this patch and Peterz's is your patch have a
> >> delta since activate_task()->enqueue_task() does do update_rq_clock(),
> >> so why don't have the delta will cause low cpu machines (4 or 8) to
> >> regress against your another reply in this thread?
> >
> > Both my patch and Peter's patch cause issues with low cpu machines. In
> > <[email protected]> I said,
> >
> > "This patch causes some low cpu machines (4 or 8) to regress. It turns
> > out they regress with my patch too."
> >
> > Have I misunderstood your question?
>
> I compare this one https://marc.info/?l=linux-kernel&m=147446511924571
> with Peterz's patch.

Oh. Low cpu machines probably do regress with my patch. At least, I
can't find the notes to prove that they don't.

2016-10-11 10:40:13

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On Tue, 11 Oct, at 10:44:25AM, Dietmar Eggemann wrote:
>
> [...]
>
> Yeah, you're right. But I can't see any significant difference. IMHO,
> it's all in the noise.
>
> (A) Performance counter stats for 'perf bench sched messaging -g 100 -l
> 1 -t'
> # 20 sender and receiver threads per group
> # 100 groups == 4000 threads run
>

FWIW, our tests run with 1000 loops, not 1, and we don't use 100
groups for low cpu machines. We tend to test upto $((nproc * 4)).

2016-10-11 13:15:13

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On 11 October 2016 at 12:24, Matt Fleming <[email protected]> wrote:
> On Mon, 10 Oct, at 07:34:40PM, Vincent Guittot wrote:
>>
>> Subject: [PATCH] sched: use load_avg for selecting idlest group
>>
>> select_busiest_group only compares the runnable_load_avg when looking for
>> the idlest group. But on fork intensive use case like hackbenchw here task
>> blocked quickly after the fork, this can lead to selecting the same CPU
>> whereas other CPUs, which have similar runnable load but a lower load_avg,
>> could be chosen instead.
>>
>> When the runnable_load_avg of 2 CPUs are close, we now take into account
>> the amount of blocked load as a 2nd selection factor.
>>
>> For use case like hackbench, this enable the scheduler to select different
>> CPUs during the fork sequence and to spread tasks across the system.
>>
>> Tests have been done on a Hikey board (ARM based octo cores) for several
>> kernel. The result below gives min, max, avg and stdev values of 18 runs
>> with each configuration.
>>
>> The v4.8+patches configuration also includes the changes below which is part of the
>> proposal made by Peter to ensure that the clock will be up to date when the
>> fork task will be attached to the rq.
>>
>> @@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p)
>> __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
>> #endif
>> rq = __task_rq_lock(p, &rf);
>> + update_rq_clock(rq);
>> post_init_entity_util_avg(&p->se);
>>
>> activate_task(rq, p, 0);
>>
>> hackbench -P -g 1
>>
>> ea86cb4b7621 7dc603c9028e v4.8 v4.8+patches
>> min 0.049 0.050 0.051 0,048
>> avg 0.057 0.057(0%) 0.057(0%) 0,055(+5%)
>> max 0.066 0.068 0.070 0,063
>> stdev +/-9% +/-9% +/-8% +/-9%
>>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> ---
>> kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++--------
>> 1 file changed, 32 insertions(+), 8 deletions(-)
>
> This patch looks pretty good to me and this 2-socket 48-cpu Xeon
> (domain0 SMT, domain1 MC, domain2 NUMA) shows a few nice performance
> improvements, and no regressions for various combinations of hackbench
> sockets/pipes and group numbers.

Good

>
> But on a 2-socket 8-cpu Xeon (domain0 MC, domain1 DIE) running,
>
> perf stat --null -r 25 -- hackbench -pipe 30 process 1000

I'm going to have a look at this use case on my hikey board made 8
CPUs (2 cluster of quad core) (domain0 MC, domain1 DIE) which seems to
have a similar topology

>
> I see a regression,
>
> baseline: 2.41228
> patched : 2.64528 (-9.7%)

Just to be sure; By baseline you mean v4.8 ?

>
> Even though the spread of tasks during fork[0] is improved,
>
> baseline CV: 0.478%
> patched CV : 0.042%
>
> Clearly the spread wasn't *that* bad to begin with on this machine for
> this workload. I consider the baseline spread to be pretty well
> distributed. Some other factor must be at play.
>
> Patched runqueue latencies are higher (max9* are percentiles),
>
> baseline: mean: 615932.69 max90: 75272.00 max95: 175985.00 max99: 5884778.00 max: 1694084747.00
> patched: mean : 882026.28 max90: 92015.00 max95: 291760.00 max99: 7590167.00 max: 1841154776.00
>
> And there are more migrations of hackbench tasks,
>
> baseline: total: 5390 cross-MC: 3810 cross-DIE: 1580
> patched : total: 7222 cross-MC: 4591 cross-DIE: 2631
> (+34.0%) (+20.5%) (+66.5%)
>
> That's a lot more costly cross-DIE migrations. I think this patch is
> along the right lines, but there's something fishy happening on this
> box.
>
> [0] - Fork task placement spread measurement:
>
> cat /tmp/trace.$1 | grep -E "wakeup_new.*comm=hackbench" | \
> sed -e 's/.*target_cpu=//' | sort | uniq -c | awk '{print $1}'

nice command to evaluate spread

2016-10-11 19:01:15

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On Tue, 11 Oct, at 03:14:47PM, Vincent Guittot wrote:
> >
> > I see a regression,
> >
> > baseline: 2.41228
> > patched : 2.64528 (-9.7%)
>
> Just to be sure; By baseline you mean v4.8 ?

Baseline is actually tip/sched/core commit 447976ef4fd0
("sched/irqtime: Consolidate irqtime flushing code") but I could try
out v4.8 instead if you'd prefer that.

> > cat /tmp/trace.$1 | grep -E "wakeup_new.*comm=hackbench" | \
> > sed -e 's/.*target_cpu=//' | sort | uniq -c | awk '{print $1}'
>
> nice command to evaluate spread

Thanks!

2016-10-12 07:42:57

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On 11 October 2016 at 20:57, Matt Fleming <[email protected]> wrote:
> On Tue, 11 Oct, at 03:14:47PM, Vincent Guittot wrote:
>> >
>> > I see a regression,
>> >
>> > baseline: 2.41228
>> > patched : 2.64528 (-9.7%)
>>
>> Just to be sure; By baseline you mean v4.8 ?
>
> Baseline is actually tip/sched/core commit 447976ef4fd0
> ("sched/irqtime: Consolidate irqtime flushing code") but I could try
> out v4.8 instead if you'd prefer that.

ok. In fact, I have noticed another regression with tip/sched/core and
hackbench while looking at yours.
I have bisect to :
10e2f1acd0 ("sched/core: Rewrite and improve select_idle_siblings")

hackbench -P -g 1

v4.8 tip/sched/core tip/sched/core+revert 10e2f1acd010
and 1b568f0aabf2
min 0.051 0,052 0.049
avg 0.057(0%) 0,062(-7%) 0.056(+1%)
max 0.070 0,073 0.067
stdev +/-8% +/-10% +/-9%

The issue seems to be that it prevents some migration at wake up at
the end of hackbench test so we have last tasks that compete for the
same CPU whereas other CPUs are idle in the same MC domain. I haven't
to look more deeply which part of the patch do the regression yet

>
>> > cat /tmp/trace.$1 | grep -E "wakeup_new.*comm=hackbench" | \
>> > sed -e 's/.*target_cpu=//' | sort | uniq -c | awk '{print $1}'
>>
>> nice command to evaluate spread
>
> Thanks!

2016-10-18 10:11:34

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On Tue, 11 Oct, at 11:39:57AM, Matt Fleming wrote:
> On Tue, 11 Oct, at 10:44:25AM, Dietmar Eggemann wrote:
> >
> > [...]
> >
> > Yeah, you're right. But I can't see any significant difference. IMHO,
> > it's all in the noise.
> >
> > (A) Performance counter stats for 'perf bench sched messaging -g 100 -l
> > 1 -t'
> > # 20 sender and receiver threads per group
> > # 100 groups == 4000 threads run
> >
>
> FWIW, our tests run with 1000 loops, not 1, and we don't use 100
> groups for low cpu machines. We tend to test upto $((nproc * 4)).

It's worth pointing out that using less than ~655 loops allows the
'sender' tasks to write down the pipes (when using -p) without
blocking due to the pipe being full, since the default number of pipe
buffers is 16 (PIPE_DEF_BUFFERS) and each buffer is one page.

This has implications for the "optimal" schedule which in that case
is: schedule all sender tasks first, and run to exit, then schedule
all readers to read from the pipes.

Just something I noticed to watch out for when picking a loops count.
There's probably some similar issue for socket buffers.

2016-10-18 10:29:47

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On Tue, 11 Oct, at 11:24:53AM, Matt Fleming wrote:
>
> That's a lot more costly cross-DIE migrations. I think this patch is
> along the right lines, but there's something fishy happening on this
> box.

I wasn't really able to track down why this machine regressed, and the
patch shows enough improvement for other boxes that I think it's worth
sticking in tip/sched/core.

I saw a number of double digit gains for hackbench with this patch.
Here's a 4-cpu box that shows improvements in arithmetic mean and some
nice improvements to the stddev,

hackbench-process-pipes
4.8.0 4.8.0
vanilla find-idle-group-v1r1
Amean 1 3.2580 ( 0.00%) 3.0607 ( 6.06%)
Amean 3 8.0387 ( 0.00%) 8.2027 ( -2.04%)
Amean 5 12.0507 ( 0.00%) 12.1703 ( -0.99%)
Amean 7 16.0027 ( 0.00%) 16.4567 ( -2.84%)
Amean 12 26.9030 ( 0.00%) 21.7767 ( 19.05%)
Amean 16 33.4127 ( 0.00%) 28.9877 ( 13.24%)
Stddev 1 0.1383 ( 0.00%) 0.1787 (-29.22%)
Stddev 3 0.1297 ( 0.00%) 0.1677 (-29.29%)
Stddev 5 0.5495 ( 0.00%) 0.4549 ( 17.22%)
Stddev 7 0.7401 ( 0.00%) 0.6161 ( 16.75%)
Stddev 12 2.3187 ( 0.00%) 0.7102 ( 69.37%)
Stddev 16 4.0734 ( 0.00%) 2.9364 ( 27.91%)

And a 60-cpu box with similar improvements,

hackbench-process-sockets
4.8.0 4.8.0
vanilla find-idle-group-v1r1
Amean 1 32.8810 ( 0.00%) 2.0590 ( 93.74%)
Amean 4 34.8663 ( 0.00%) 3.3177 ( 90.48%)
Amean 7 34.1243 ( 0.00%) 4.3323 ( 87.30%)
Amean 12 36.8373 ( 0.00%) 5.3650 ( 85.44%)
Amean 21 29.0333 ( 0.00%) 6.0280 ( 79.24%)
Amean 30 27.2807 ( 0.00%) 7.2653 ( 73.37%)
Amean 48 23.2453 ( 0.00%) 9.9810 ( 57.06%)
Amean 79 29.6660 ( 0.00%) 16.6680 ( 43.81%)
Amean 110 41.3680 ( 0.00%) 23.1787 ( 43.97%)
Amean 141 45.3720 ( 0.00%) 29.4130 ( 35.17%)
Amean 172 45.6403 ( 0.00%) 34.3783 ( 24.68%)
Amean 203 47.6120 ( 0.00%) 40.0063 ( 15.97%)
Amean 234 50.1703 ( 0.00%) 43.4503 ( 13.39%)
Amean 240 50.1110 ( 0.00%) 44.4660 ( 11.26%)
Stddev 1 0.9714 ( 0.00%) 0.6617 ( 31.88%)
Stddev 4 0.4459 ( 0.00%) 0.2943 ( 34.01%)
Stddev 7 0.3091 ( 0.00%) 0.2722 ( 11.94%)
Stddev 12 0.4603 ( 0.00%) 0.1424 ( 69.05%)
Stddev 21 1.3647 ( 0.00%) 0.1566 ( 88.52%)
Stddev 30 4.1277 ( 0.00%) 0.0481 ( 98.84%)
Stddev 48 1.3689 ( 0.00%) 0.4968 ( 63.71%)
Stddev 79 2.0772 ( 0.00%) 0.5309 ( 74.44%)
Stddev 110 4.5655 ( 0.00%) 0.1846 ( 95.96%)
Stddev 141 4.2637 ( 0.00%) 0.6675 ( 84.34%)
Stddev 172 1.5741 ( 0.00%) 1.0242 ( 34.93%)
Stddev 203 1.0973 ( 0.00%) 1.4043 (-27.98%)
Stddev 234 2.2048 ( 0.00%) 0.8709 ( 60.50%)
Stddev 240 1.2849 ( 0.00%) 0.2688 ( 79.08%)

2016-10-18 11:10:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On Wed, Oct 12, 2016 at 09:41:36AM +0200, Vincent Guittot wrote:

> ok. In fact, I have noticed another regression with tip/sched/core and
> hackbench while looking at yours.
> I have bisect to :
> 10e2f1acd0 ("sched/core: Rewrite and improve select_idle_siblings")
>
> hackbench -P -g 1
>
> v4.8 tip/sched/core tip/sched/core+revert 10e2f1acd010
> and 1b568f0aabf2
> min 0.051 0,052 0.049
> avg 0.057(0%) 0,062(-7%) 0.056(+1%)
> max 0.070 0,073 0.067
> stdev +/-8% +/-10% +/-9%
>
> The issue seems to be that it prevents some migration at wake up at
> the end of hackbench test so we have last tasks that compete for the
> same CPU whereas other CPUs are idle in the same MC domain. I haven't
> to look more deeply which part of the patch do the regression yet

So select_idle_cpu(), which does the LLC wide CPU scan is now throttled
by a comparison between avg_cost and avg_idle; where avg_cost is a
historical measure of how costly it was to scan the entire LLC domain
and avg_idle is our current idle time guestimate (also a historical
average).

The problem was that a number of workloads were spending quite a lot of
time here scanning CPUs while they could be doing useful work (esp.
since newer parts have silly amounts of CPUs per LLC).

The toggle is a heuristic with a random number in.. we could see if
there's anything better we can do. I know some people take the toggle
out entirely, but that will regress other workloads.

2016-10-18 11:10:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On Tue, Oct 18, 2016 at 11:29:37AM +0100, Matt Fleming wrote:
> On Tue, 11 Oct, at 11:24:53AM, Matt Fleming wrote:
> >
> > That's a lot more costly cross-DIE migrations. I think this patch is
> > along the right lines, but there's something fishy happening on this
> > box.
>
> I wasn't really able to track down why this machine regressed, and the
> patch shows enough improvement for other boxes that I think it's worth
> sticking in tip/sched/core.
>

I'm entirely lost as to which patch we're talking about by now ;-)

2016-10-18 11:30:19

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On Tue, 18 Oct, at 01:10:17PM, Peter Zijlstra wrote:
>
> I'm entirely lost as to which patch we're talking about by now ;-)

Heh, this one from Vincent,

https://lkml.kernel.org/r/[email protected]

2016-10-18 12:16:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On Tue, Oct 18, 2016 at 12:29:57PM +0100, Matt Fleming wrote:
> On Tue, 18 Oct, at 01:10:17PM, Peter Zijlstra wrote:
> >
> > I'm entirely lost as to which patch we're talking about by now ;-)
>
> Heh, this one from Vincent,
>
> https://lkml.kernel.org/r/[email protected]

Ah, right.

Seems like a sensible thing to do, and I suppose I should go finish my
(and yours) update_rq_clock() patches that supersede the patch referred
to in that thing and is depended upon.


It might make sense to have helper functions to evaluate those
conditions, because currently there's two instances of each, once in the
branch selection and then again (but inverted, we miss the == case fwiw)
in the return NULL case.



2016-10-18 15:19:54

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On 18 October 2016 at 13:09, Peter Zijlstra <[email protected]> wrote:
> On Wed, Oct 12, 2016 at 09:41:36AM +0200, Vincent Guittot wrote:
>
>> ok. In fact, I have noticed another regression with tip/sched/core and
>> hackbench while looking at yours.
>> I have bisect to :
>> 10e2f1acd0 ("sched/core: Rewrite and improve select_idle_siblings")
>>
>> hackbench -P -g 1
>>
>> v4.8 tip/sched/core tip/sched/core+revert 10e2f1acd010
>> and 1b568f0aabf2
>> min 0.051 0,052 0.049
>> avg 0.057(0%) 0,062(-7%) 0.056(+1%)
>> max 0.070 0,073 0.067
>> stdev +/-8% +/-10% +/-9%
>>
>> The issue seems to be that it prevents some migration at wake up at
>> the end of hackbench test so we have last tasks that compete for the
>> same CPU whereas other CPUs are idle in the same MC domain. I haven't
>> to look more deeply which part of the patch do the regression yet
>
> So select_idle_cpu(), which does the LLC wide CPU scan is now throttled
> by a comparison between avg_cost and avg_idle; where avg_cost is a
> historical measure of how costly it was to scan the entire LLC domain
> and avg_idle is our current idle time guestimate (also a historical
> average).
>
> The problem was that a number of workloads were spending quite a lot of
> time here scanning CPUs while they could be doing useful work (esp.
> since newer parts have silly amounts of CPUs per LLC).

make sense

>
> The toggle is a heuristic with a random number in.. we could see if
> there's anything better we can do. I know some people take the toggle
> out entirely, but that will regress other workloads.

ok. so removing the toggle fixes the problem in this test case too.

may be we can take into account the sd_llc_size in the toggle

2016-10-19 06:38:43

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On 18 October 2016 at 14:15, Peter Zijlstra <[email protected]> wrote:
> On Tue, Oct 18, 2016 at 12:29:57PM +0100, Matt Fleming wrote:
>> On Tue, 18 Oct, at 01:10:17PM, Peter Zijlstra wrote:
>> >
>> > I'm entirely lost as to which patch we're talking about by now ;-)
>>
>> Heh, this one from Vincent,
>>
>> https://lkml.kernel.org/r/[email protected]
>
> Ah, right.
>
> Seems like a sensible thing to do, and I suppose I should go finish my
> (and yours) update_rq_clock() patches that supersede the patch referred
> to in that thing and is depended upon.
>
>
> It might make sense to have helper functions to evaluate those

The main issue is the number of parameters used in these conditions
that makes helper function not really more readable.

> conditions, because currently there's two instances of each, once in the
> branch selection and then again (but inverted, we miss the == case fwiw)

not sure to catch the comment about inverted and miss the == case
The test splits runnable_load_avg is 3 ranges:
[0 .. (min_runnable_load - imbalance)] : use the
runnable_loab_avg/this_runnable_load which is significantly smaller
] (min_runnable_load - imbalance) .. (min_runnable_load + imbalance) [
: min_runnable_load and runnable_loab_avg/this_runnable_load are close
so we compare min_load_avg with avg_load/this_avg_load to choose
[(min_runnable_load + imbalance) .. ULONG_MAX] : use min_runnable_load

The condition is used when we look for the best other group in the
sched_domain and to compare the local group with this best other
group

> in the return NULL case.
>
>
>

2016-10-19 15:21:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On Wed, Oct 19, 2016 at 08:38:12AM +0200, Vincent Guittot wrote:
> > It might make sense to have helper functions to evaluate those
>
> The main issue is the number of parameters used in these conditions
> that makes helper function not really more readable.

Fair enough I suppose..

> > conditions, because currently there's two instances of each, once in the
> > branch selection and then again (but inverted, we miss the == case fwiw)
>
> not sure to catch the comment about inverted and miss the == case

Oh, you're right. Which proves my point on this not being entirely
readable :/

Initially I thought the things were of the form:

else if (ineq) {
}

...

if (... || !ineq || ...)


Which would get you things like: a<b, !a<b := a>b, and leave a==b
undefined. But if I put them along side one another like:


+ } else if (min_runnable_load > (runnable_load + imbalance)) {

+ (min_runnable_load > (this_runnable_load + imbalance)) ||


+ } else if ((runnable_load < (min_runnable_load + imbalance)) &&
+ (100*min_avg_load > sd->imbalance_pct*avg_load)) {

+ ((this_runnable_load < (min_runnable_load + imbalance)) &&
+ (100*min_avg_load > sd->imbalance_pct*this_avg_load)))

We can see this is not in fact the case.

Blergh, I also cannot see a pretty way to increase readability here,
because while they have the same general shape, there's this small
variation with this_*.

A well..

2016-11-09 16:54:19

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On 19 October 2016 at 11:53, Peter Zijlstra <[email protected]> wrote:
> On Wed, Oct 19, 2016 at 08:38:12AM +0200, Vincent Guittot wrote:
>> > It might make sense to have helper functions to evaluate those
>>
>> The main issue is the number of parameters used in these conditions
>> that makes helper function not really more readable.
>
> Fair enough I suppose..
>
>> > conditions, because currently there's two instances of each, once in the
>> > branch selection and then again (but inverted, we miss the == case fwiw)
>>
>> not sure to catch the comment about inverted and miss the == case
>
> Oh, you're right. Which proves my point on this not being entirely
> readable :/

Hi Peter,

I'm not sure of the status of this patch.
It seems difficult to get a more readable code.
Do you want me to add more comments about when we enter each state or
current version is enough ?

Vincent

>
> Initially I thought the things were of the form:
>
> else if (ineq) {
> }
>
> ...
>
> if (... || !ineq || ...)
>
>
> Which would get you things like: a<b, !a<b := a>b, and leave a==b
> undefined. But if I put them along side one another like:
>
>
> + } else if (min_runnable_load > (runnable_load + imbalance)) {
>
> + (min_runnable_load > (this_runnable_load + imbalance)) ||
>
>
> + } else if ((runnable_load < (min_runnable_load + imbalance)) &&
> + (100*min_avg_load > sd->imbalance_pct*avg_load)) {
>
> + ((this_runnable_load < (min_runnable_load + imbalance)) &&
> + (100*min_avg_load > sd->imbalance_pct*this_avg_load)))
>
> We can see this is not in fact the case.
>
> Blergh, I also cannot see a pretty way to increase readability here,
> because while they have the same general shape, there's this small
> variation with this_*.
>
> A well..