2016-11-25 15:36:12

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 0/2 v2] sched: improve spread of tasks during fork

This patchset was originally 1 patch but a perf regression happened during the rebase.
The patch 01 fixes the perf regression discovered in tip/sched/core during the
rebase.
The patch 02 is the rebase of the patch that fixes the regression raised by
Matt Fleming [1].

[1] https://lkml.org/lkml/2016/10/18/206

Vincent Guittot (2):
sched: fix find_idlest_group for fork
sched: use load_avg for selecting idlest group

kernel/sched/fair.c | 54 +++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 44 insertions(+), 10 deletions(-)

--
2.7.4


2016-11-25 15:36:19

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 1/2 v2] sched: fix find_idlest_group for fork

During fork, the utilization of a task is init once the rq has been
selected because the current utilization level of the rq is used to set
the utilization of the fork task. As the task's utilization is still
null at this step of the fork sequence, it doesn't make sense to look for
some spare capacity that can fit the task's utilization.
Furthermore, I can see perf regressions for the test "hackbench -P -g 1"
because the least loaded policy is always bypassed and tasks are not
spread during fork.

With this patch and the fix below, we are back to same performances as
for v4.8. The fix below is only a temporary one used for the test until a
smarter solution is found because we can't simply remove the test which is
useful for others benchmarks

@@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t

avg_cost = this_sd->avg_scan_cost;

- /*
- * Due to large variance we need a large fuzz factor; hackbench in
- * particularly is sensitive here.
- */
- if ((avg_idle / 512) < avg_cost)
- return -1;
-
time = local_clock();

for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aa47589..820a787 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5463,13 +5463,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
* utilized systems if we require spare_capacity > task_util(p),
* so we allow for some task stuffing by using
* spare_capacity > task_util(p)/2.
+ * spare capacity can't be used for fork because the utilization has
+ * not been set yet as it need to get a rq to init the utilization
*/
+ if (sd_flag & SD_BALANCE_FORK)
+ goto no_spare;
+
if (this_spare > task_util(p) / 2 &&
imbalance*this_spare > 100*most_spare)
return NULL;
else if (most_spare > task_util(p) / 2)
return most_spare_sg;

+no_spare:
if (!idlest || 100*this_load < imbalance*min_load)
return NULL;
return idlest;
--
2.7.4

2016-11-25 15:36:33

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 2/2 v2] sched: use load_avg for selecting idlest group

find_idlest_group() only compares the runnable_load_avg when looking for
the least loaded group. But on fork intensive use case like hackbench
where tasks blocked quickly after the fork, this can lead to selecting the
same CPU instead of other CPUs, which have similar runnable load but a
lower load_avg.

When the runnable_load_avg of 2 CPUs are close, we now take into account
the amount of blocked load as a 2nd selection factor. There is now 3 zones
for the runnable_load of the rq:
-[0 .. (runnable_load - imbalance)] : Select the new rq which has
significantly less runnable_load
-](runnable_load - imbalance) .. (runnable_load + imbalance)[ : The
runnable load are close so we use load_avg to chose between the 2 rq
-[(runnable_load + imbalance) .. ULONG_MAX] : Keep the current rq which
has significantly less runnable_load

For use case like hackbench, this enable the scheduler to select different
CPUs during the fork sequence and to spread tasks across the system.

Tests have been done on a Hikey board (ARM based octo cores) for several
kernel. The result below gives min, max, avg and stdev values of 18 runs
with each configuration.

The v4.8+patches configuration also includes the changes below which is
part of the proposal made by Peter to ensure that the clock will be up to
date when the fork task will be attached to the rq.

@@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p)
__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
#endif
rq = __task_rq_lock(p, &rf);
+ update_rq_clock(rq);
post_init_entity_util_avg(&p->se);

activate_task(rq, p, 0);

hackbench -P -g 1

ea86cb4b7621 7dc603c9028e v4.8 v4.8+patches
min 0.049 0.050 0.051 0,048
avg 0.057 0.057(0%) 0.057(0%) 0,055(+5%)
max 0.066 0.068 0.070 0,063
stdev +/-9% +/-9% +/-8% +/-9%

Signed-off-by: Vincent Guittot <[email protected]>
---

Changes since v2:
- Rebase on latest sched/core
- Get same results with the rebase and the fix mentioned in patch 01

kernel/sched/fair.c | 48 ++++++++++++++++++++++++++++++++++++++----------
1 file changed, 38 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 820a787..ecb5ee8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5395,16 +5395,20 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
{
struct sched_group *idlest = NULL, *group = sd->groups;
struct sched_group *most_spare_sg = NULL;
- unsigned long min_load = ULONG_MAX, this_load = 0;
+ unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0;
+ unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0;
unsigned long most_spare = 0, this_spare = 0;
int load_idx = sd->forkexec_idx;
- int imbalance = 100 + (sd->imbalance_pct-100)/2;
+ int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
+ unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
+ (sd->imbalance_pct-100) / 100;

if (sd_flag & SD_BALANCE_WAKE)
load_idx = sd->wake_idx;

do {
- unsigned long load, avg_load, spare_cap, max_spare_cap;
+ unsigned long load, avg_load, runnable_load;
+ unsigned long spare_cap, max_spare_cap;
int local_group;
int i;

@@ -5421,6 +5425,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
* the group containing the CPU with most spare capacity.
*/
avg_load = 0;
+ runnable_load = 0;
max_spare_cap = 0;

for_each_cpu(i, sched_group_cpus(group)) {
@@ -5430,7 +5435,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
else
load = target_load(i, load_idx);

- avg_load += load;
+ runnable_load += load;
+
+ avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs);

spare_cap = capacity_spare_wake(i, p);

@@ -5439,14 +5446,32 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
}

/* Adjust by relative CPU capacity of the group */
- avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
+ avg_load = (avg_load * SCHED_CAPACITY_SCALE) /
+ group->sgc->capacity;
+ runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) /
+ group->sgc->capacity;

if (local_group) {
- this_load = avg_load;
+ this_runnable_load = runnable_load;
+ this_avg_load = avg_load;
this_spare = max_spare_cap;
} else {
- if (avg_load < min_load) {
- min_load = avg_load;
+ if (min_runnable_load > (runnable_load + imbalance)) {
+ /*
+ * The runnable load is significantly smaller
+ * so we can pick this new cpu
+ */
+ min_runnable_load = runnable_load;
+ min_avg_load = avg_load;
+ idlest = group;
+ } else if ((runnable_load < (min_runnable_load + imbalance)) &&
+ (100*min_avg_load > imbalance_scale*avg_load)) {
+ /*
+ * The runnable loads are close so we take
+ * into account blocked load through avg_load
+ * which is blocked + runnable load
+ */
+ min_avg_load = avg_load;
idlest = group;
}

@@ -5470,13 +5495,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
goto no_spare;

if (this_spare > task_util(p) / 2 &&
- imbalance*this_spare > 100*most_spare)
+ imbalance_scale*this_spare > 100*most_spare)
return NULL;
else if (most_spare > task_util(p) / 2)
return most_spare_sg;

no_spare:
- if (!idlest || 100*this_load < imbalance*min_load)
+ if (!idlest ||
+ (min_runnable_load > (this_runnable_load + imbalance)) ||
+ ((this_runnable_load < (min_runnable_load + imbalance)) &&
+ (100*min_avg_load > imbalance_scale*this_avg_load)))
return NULL;
return idlest;
}
--
2.7.4

2016-11-28 17:01:34

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH 1/2 v2] sched: fix find_idlest_group for fork

On Fri, 25 Nov, at 04:34:32PM, Vincent Guittot wrote:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index aa47589..820a787 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5463,13 +5463,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> * utilized systems if we require spare_capacity > task_util(p),
> * so we allow for some task stuffing by using
> * spare_capacity > task_util(p)/2.
> + * spare capacity can't be used for fork because the utilization has
> + * not been set yet as it need to get a rq to init the utilization
> */
> + if (sd_flag & SD_BALANCE_FORK)
> + goto no_spare;
> +
> if (this_spare > task_util(p) / 2 &&
> imbalance*this_spare > 100*most_spare)
> return NULL;
> else if (most_spare > task_util(p) / 2)
> return most_spare_sg;
>
> +no_spare:
> if (!idlest || 100*this_load < imbalance*min_load)
> return NULL;
> return idlest;

It's only a minor comment, but would you be opposed to calling this
label 'skip_spare' to indicate that spare capacity may exist, but
we're not going to make use of it?

2016-11-28 17:03:50

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH 0/2 v2] sched: improve spread of tasks during fork

On Fri, 25 Nov, at 04:34:31PM, Vincent Guittot wrote:
> This patchset was originally 1 patch but a perf regression happened during the rebase.
> The patch 01 fixes the perf regression discovered in tip/sched/core during the
> rebase.
> The patch 02 is the rebase of the patch that fixes the regression raised by
> Matt Fleming [1].
>
> [1] https://lkml.org/lkml/2016/10/18/206
>
> Vincent Guittot (2):
> sched: fix find_idlest_group for fork
> sched: use load_avg for selecting idlest group
>
> kernel/sched/fair.c | 54 +++++++++++++++++++++++++++++++++++++++++++----------
> 1 file changed, 44 insertions(+), 10 deletions(-)

FYI, I'm running both of these patches through the SUSE performance
testing grid right now. I'll let you know as soon as I have results
(should be in the next few days).

2016-11-28 17:21:24

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/2 v2] sched: fix find_idlest_group for fork

On 28 November 2016 at 18:01, Matt Fleming <[email protected]> wrote:
> On Fri, 25 Nov, at 04:34:32PM, Vincent Guittot wrote:
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index aa47589..820a787 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5463,13 +5463,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>> * utilized systems if we require spare_capacity > task_util(p),
>> * so we allow for some task stuffing by using
>> * spare_capacity > task_util(p)/2.
>> + * spare capacity can't be used for fork because the utilization has
>> + * not been set yet as it need to get a rq to init the utilization
>> */
>> + if (sd_flag & SD_BALANCE_FORK)
>> + goto no_spare;
>> +
>> if (this_spare > task_util(p) / 2 &&
>> imbalance*this_spare > 100*most_spare)
>> return NULL;
>> else if (most_spare > task_util(p) / 2)
>> return most_spare_sg;
>>
>> +no_spare:
>> if (!idlest || 100*this_load < imbalance*min_load)
>> return NULL;
>> return idlest;
>
> It's only a minor comment, but would you be opposed to calling this
> label 'skip_spare' to indicate that spare capacity may exist, but
> we're not going to make use of it?

you're right, 'skip_spare' makes more sense

2016-11-28 17:21:58

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 0/2 v2] sched: improve spread of tasks during fork

On 28 November 2016 at 18:02, Matt Fleming <[email protected]> wrote:
> On Fri, 25 Nov, at 04:34:31PM, Vincent Guittot wrote:
>> This patchset was originally 1 patch but a perf regression happened during the rebase.
>> The patch 01 fixes the perf regression discovered in tip/sched/core during the
>> rebase.
>> The patch 02 is the rebase of the patch that fixes the regression raised by
>> Matt Fleming [1].
>>
>> [1] https://lkml.org/lkml/2016/10/18/206
>>
>> Vincent Guittot (2):
>> sched: fix find_idlest_group for fork
>> sched: use load_avg for selecting idlest group
>>
>> kernel/sched/fair.c | 54 +++++++++++++++++++++++++++++++++++++++++++----------
>> 1 file changed, 44 insertions(+), 10 deletions(-)
>
> FYI, I'm running both of these patches through the SUSE performance
> testing grid right now. I'll let you know as soon as I have results
> (should be in the next few days).

Thanks

2016-11-29 10:58:17

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 1/2 v2] sched: fix find_idlest_group for fork

On Fri, Nov 25, 2016 at 04:34:32PM +0100, Vincent Guittot wrote:
> During fork, the utilization of a task is init once the rq has been
> selected because the current utilization level of the rq is used to set
> the utilization of the fork task. As the task's utilization is still
> null at this step of the fork sequence, it doesn't make sense to look for
> some spare capacity that can fit the task's utilization.
> Furthermore, I can see perf regressions for the test "hackbench -P -g 1"
> because the least loaded policy is always bypassed and tasks are not
> spread during fork.

Agreed, the late initialization of util_avg doesn't work very well with
the spare capacity checking.

> With this patch and the fix below, we are back to same performances as
> for v4.8. The fix below is only a temporary one used for the test until a
> smarter solution is found because we can't simply remove the test which is
> useful for others benchmarks
>
> @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>
> avg_cost = this_sd->avg_scan_cost;
>
> - /*
> - * Due to large variance we need a large fuzz factor; hackbench in
> - * particularly is sensitive here.
> - */
> - if ((avg_idle / 512) < avg_cost)
> - return -1;
> -
> time = local_clock();
>
> for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {

I don't quite get this fix, but it is very likely because I haven't paid
enough attention.

Are you saying that removing the avg_cost check is improving hackbench
performance? I thought it was supposed to help hackbench? I'm confused
:-(

>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index aa47589..820a787 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5463,13 +5463,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> * utilized systems if we require spare_capacity > task_util(p),
> * so we allow for some task stuffing by using
> * spare_capacity > task_util(p)/2.
> + * spare capacity can't be used for fork because the utilization has
> + * not been set yet as it need to get a rq to init the utilization
> */
> + if (sd_flag & SD_BALANCE_FORK)
> + goto no_spare;
> +
> if (this_spare > task_util(p) / 2 &&
> imbalance*this_spare > 100*most_spare)
> return NULL;
> else if (most_spare > task_util(p) / 2)
> return most_spare_sg;
>
> +no_spare:
> if (!idlest || 100*this_load < imbalance*min_load)
> return NULL;
> return idlest;

Looks okay to me. We are returning to use load, which is initialized,
for fork decisions.

Should we do the same for SD_BALANCE_EXEC?

An alternative fix would be to move the utilization initialization
before we pick the cpu, but that opens the whole discussion about what
we should initialize it to again. So I'm fine with not going there now.

Morten

2016-11-29 11:42:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2 v2] sched: fix find_idlest_group for fork

On Tue, Nov 29, 2016 at 10:57:59AM +0000, Morten Rasmussen wrote:
> > @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> >
> > avg_cost = this_sd->avg_scan_cost;
> >
> > - /*
> > - * Due to large variance we need a large fuzz factor; hackbench in
> > - * particularly is sensitive here.
> > - */
> > - if ((avg_idle / 512) < avg_cost)
> > - return -1;
> > -
> > time = local_clock();
> >
> > for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
>
> I don't quite get this fix, but it is very likely because I haven't paid
> enough attention.
>
> Are you saying that removing the avg_cost check is improving hackbench
> performance? I thought it was supposed to help hackbench? I'm confused
> :-(

IIRC, and my pounding head really doesn't remember much, the comment
reads like we need the large fudge factor because hackbench. That is,
hackbench would like this test to go away, but others benchmarks will
tank.

Now, if only I would've written down which benchmarks that were.. awell.

2016-11-29 11:44:29

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH 1/2 v2] sched: fix find_idlest_group for fork

On Tue, 29 Nov, at 12:42:43PM, Peter Zijlstra wrote:
>
> IIRC, and my pounding head really doesn't remember much, the comment
> reads like we need the large fudge factor because hackbench. That is,
> hackbench would like this test to go away, but others benchmarks will
> tank.
>
> Now, if only I would've written down which benchmarks that were.. awell.

Going out on a limb: Chris' schbench?

2016-11-29 12:30:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2 v2] sched: fix find_idlest_group for fork

On Tue, Nov 29, 2016 at 11:44:19AM +0000, Matt Fleming wrote:
> On Tue, 29 Nov, at 12:42:43PM, Peter Zijlstra wrote:
> >
> > IIRC, and my pounding head really doesn't remember much, the comment
> > reads like we need the large fudge factor because hackbench. That is,
> > hackbench would like this test to go away, but others benchmarks will
> > tank.
> >
> > Now, if only I would've written down which benchmarks that were.. awell.
>
> Going out on a limb: Chris' schbench?

No, that actually wants that test taken out as well. It might have been
things like sysbench or so.

2016-11-29 13:04:55

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/2 v2] sched: fix find_idlest_group for fork

On 29 November 2016 at 11:57, Morten Rasmussen <[email protected]> wrote:
> On Fri, Nov 25, 2016 at 04:34:32PM +0100, Vincent Guittot wrote:
>> During fork, the utilization of a task is init once the rq has been
>> selected because the current utilization level of the rq is used to set
>> the utilization of the fork task. As the task's utilization is still
>> null at this step of the fork sequence, it doesn't make sense to look for
>> some spare capacity that can fit the task's utilization.
>> Furthermore, I can see perf regressions for the test "hackbench -P -g 1"
>> because the least loaded policy is always bypassed and tasks are not
>> spread during fork.
>
> Agreed, the late initialization of util_avg doesn't work very well with
> the spare capacity checking.
>
>> With this patch and the fix below, we are back to same performances as
>> for v4.8. The fix below is only a temporary one used for the test until a
>> smarter solution is found because we can't simply remove the test which is
>> useful for others benchmarks
>>
>> @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>
>> avg_cost = this_sd->avg_scan_cost;
>>
>> - /*
>> - * Due to large variance we need a large fuzz factor; hackbench in
>> - * particularly is sensitive here.
>> - */
>> - if ((avg_idle / 512) < avg_cost)
>> - return -1;
>> -
>> time = local_clock();
>>
>> for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
>
> I don't quite get this fix, but it is very likely because I haven't paid
> enough attention.
>
> Are you saying that removing the avg_cost check is improving hackbench
> performance? I thought it was supposed to help hackbench? I'm confused
> :-(

Yes, avg_cost check prevents some tasks migration at the end of the
tests when some threads have already finished their loop letting some
CPUs idle whereas others threads are still competing on the same CPUS

>
>>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> ---
>> kernel/sched/fair.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index aa47589..820a787 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5463,13 +5463,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>> * utilized systems if we require spare_capacity > task_util(p),
>> * so we allow for some task stuffing by using
>> * spare_capacity > task_util(p)/2.
>> + * spare capacity can't be used for fork because the utilization has
>> + * not been set yet as it need to get a rq to init the utilization
>> */
>> + if (sd_flag & SD_BALANCE_FORK)
>> + goto no_spare;
>> +
>> if (this_spare > task_util(p) / 2 &&
>> imbalance*this_spare > 100*most_spare)
>> return NULL;
>> else if (most_spare > task_util(p) / 2)
>> return most_spare_sg;
>>
>> +no_spare:
>> if (!idlest || 100*this_load < imbalance*min_load)
>> return NULL;
>> return idlest;
>
> Looks okay to me. We are returning to use load, which is initialized,
> for fork decisions.
>
> Should we do the same for SD_BALANCE_EXEC?

I asked myself if i should add SD_BALANCE_EXEC but decided to only
keep SD_BALANCE_FORK for now as no regression has been raised for now.

>
> An alternative fix would be to move the utilization initialization
> before we pick the cpu, but that opens the whole discussion about what
> we should initialize it to again. So I'm fine with not going there now.
>
> Morten

2016-11-29 14:46:22

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 1/2 v2] sched: fix find_idlest_group for fork

On Tue, Nov 29, 2016 at 12:42:43PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 29, 2016 at 10:57:59AM +0000, Morten Rasmussen wrote:
> > > @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> > >
> > > avg_cost = this_sd->avg_scan_cost;
> > >
> > > - /*
> > > - * Due to large variance we need a large fuzz factor; hackbench in
> > > - * particularly is sensitive here.
> > > - */
> > > - if ((avg_idle / 512) < avg_cost)
> > > - return -1;
> > > -
> > > time = local_clock();
> > >
> > > for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
> >
> > I don't quite get this fix, but it is very likely because I haven't paid
> > enough attention.
> >
> > Are you saying that removing the avg_cost check is improving hackbench
> > performance? I thought it was supposed to help hackbench? I'm confused
> > :-(
>
> IIRC, and my pounding head really doesn't remember much, the comment
> reads like we need the large fudge factor because hackbench. That is,
> hackbench would like this test to go away, but others benchmarks will
> tank.

Thanks, that seems in line with Vincent's reply.

The last bit that isn't clear to me is whether /512 is a 'large' fuzz
factor. I guess it is, as we can have many wake-ups, i.e. many times
avg_cost, over the period where avg_idle is calculated. No?

2016-11-29 14:50:57

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 1/2 v2] sched: fix find_idlest_group for fork

On Tue, Nov 29, 2016 at 02:04:27PM +0100, Vincent Guittot wrote:
> On 29 November 2016 at 11:57, Morten Rasmussen <[email protected]> wrote:
> > On Fri, Nov 25, 2016 at 04:34:32PM +0100, Vincent Guittot wrote:
> >> @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> >>
> >> avg_cost = this_sd->avg_scan_cost;
> >>
> >> - /*
> >> - * Due to large variance we need a large fuzz factor; hackbench in
> >> - * particularly is sensitive here.
> >> - */
> >> - if ((avg_idle / 512) < avg_cost)
> >> - return -1;
> >> -
> >> time = local_clock();
> >>
> >> for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
> >
> > I don't quite get this fix, but it is very likely because I haven't paid
> > enough attention.
> >
> > Are you saying that removing the avg_cost check is improving hackbench
> > performance? I thought it was supposed to help hackbench? I'm confused
> > :-(
>
> Yes, avg_cost check prevents some tasks migration at the end of the
> tests when some threads have already finished their loop letting some
> CPUs idle whereas others threads are still competing on the same CPUS

Okay, thanks.

> > Should we do the same for SD_BALANCE_EXEC?
>
> I asked myself if i should add SD_BALANCE_EXEC but decided to only
> keep SD_BALANCE_FORK for now as no regression has been raised for now.

Fair enough.

FWIW, with the label renaming suggested by mfleming, you can add my
reviewed/acked-by if you like.

Morten

2016-11-29 14:57:57

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/2 v2] sched: fix find_idlest_group for fork

On 29 November 2016 at 15:50, Morten Rasmussen <[email protected]> wrote:
> On Tue, Nov 29, 2016 at 02:04:27PM +0100, Vincent Guittot wrote:
>> On 29 November 2016 at 11:57, Morten Rasmussen <[email protected]> wrote:
>> > On Fri, Nov 25, 2016 at 04:34:32PM +0100, Vincent Guittot wrote:
>> >> @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>> >>
>> >> avg_cost = this_sd->avg_scan_cost;
>> >>
>> >> - /*
>> >> - * Due to large variance we need a large fuzz factor; hackbench in
>> >> - * particularly is sensitive here.
>> >> - */
>> >> - if ((avg_idle / 512) < avg_cost)
>> >> - return -1;
>> >> -
>> >> time = local_clock();
>> >>
>> >> for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
>> >
>> > I don't quite get this fix, but it is very likely because I haven't paid
>> > enough attention.
>> >
>> > Are you saying that removing the avg_cost check is improving hackbench
>> > performance? I thought it was supposed to help hackbench? I'm confused
>> > :-(
>>
>> Yes, avg_cost check prevents some tasks migration at the end of the
>> tests when some threads have already finished their loop letting some
>> CPUs idle whereas others threads are still competing on the same CPUS
>
> Okay, thanks.
>
>> > Should we do the same for SD_BALANCE_EXEC?
>>
>> I asked myself if i should add SD_BALANCE_EXEC but decided to only
>> keep SD_BALANCE_FORK for now as no regression has been raised for now.
>
> Fair enough.
>
> FWIW, with the label renaming suggested by mfleming, you can add my
> reviewed/acked-by if you like.

Thanks

>
> Morten

2016-11-30 12:49:27

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 2/2 v2] sched: use load_avg for selecting idlest group

On Fri, Nov 25, 2016 at 04:34:33PM +0100, Vincent Guittot wrote:
> find_idlest_group() only compares the runnable_load_avg when looking for
> the least loaded group. But on fork intensive use case like hackbench
> where tasks blocked quickly after the fork, this can lead to selecting the
> same CPU instead of other CPUs, which have similar runnable load but a
> lower load_avg.
>
> When the runnable_load_avg of 2 CPUs are close, we now take into account
> the amount of blocked load as a 2nd selection factor. There is now 3 zones
> for the runnable_load of the rq:
> -[0 .. (runnable_load - imbalance)] : Select the new rq which has
> significantly less runnable_load
> -](runnable_load - imbalance) .. (runnable_load + imbalance)[ : The
> runnable load are close so we use load_avg to chose between the 2 rq
> -[(runnable_load + imbalance) .. ULONG_MAX] : Keep the current rq which
> has significantly less runnable_load
>
> For use case like hackbench, this enable the scheduler to select different
> CPUs during the fork sequence and to spread tasks across the system.
>
> Tests have been done on a Hikey board (ARM based octo cores) for several
> kernel. The result below gives min, max, avg and stdev values of 18 runs
> with each configuration.
>
> The v4.8+patches configuration also includes the changes below which is
> part of the proposal made by Peter to ensure that the clock will be up to
> date when the fork task will be attached to the rq.
>
> @@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p)
> __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
> #endif
> rq = __task_rq_lock(p, &rf);
> + update_rq_clock(rq);
> post_init_entity_util_avg(&p->se);
>
> activate_task(rq, p, 0);
>
> hackbench -P -g 1
>
> ea86cb4b7621 7dc603c9028e v4.8 v4.8+patches
> min 0.049 0.050 0.051 0,048
> avg 0.057 0.057(0%) 0.057(0%) 0,055(+5%)
> max 0.066 0.068 0.070 0,063
> stdev +/-9% +/-9% +/-8% +/-9%
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
>
> Changes since v2:
> - Rebase on latest sched/core
> - Get same results with the rebase and the fix mentioned in patch 01
>
> kernel/sched/fair.c | 48 ++++++++++++++++++++++++++++++++++++++----------
> 1 file changed, 38 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 820a787..ecb5ee8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5395,16 +5395,20 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> {
> struct sched_group *idlest = NULL, *group = sd->groups;
> struct sched_group *most_spare_sg = NULL;
> - unsigned long min_load = ULONG_MAX, this_load = 0;
> + unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0;
> + unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0;
> unsigned long most_spare = 0, this_spare = 0;
> int load_idx = sd->forkexec_idx;
> - int imbalance = 100 + (sd->imbalance_pct-100)/2;
> + int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
> + unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
> + (sd->imbalance_pct-100) / 100;
>
> if (sd_flag & SD_BALANCE_WAKE)
> load_idx = sd->wake_idx;
>
> do {
> - unsigned long load, avg_load, spare_cap, max_spare_cap;
> + unsigned long load, avg_load, runnable_load;
> + unsigned long spare_cap, max_spare_cap;
> int local_group;
> int i;
>
> @@ -5421,6 +5425,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> * the group containing the CPU with most spare capacity.
> */
> avg_load = 0;
> + runnable_load = 0;
> max_spare_cap = 0;
>
> for_each_cpu(i, sched_group_cpus(group)) {
> @@ -5430,7 +5435,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> else
> load = target_load(i, load_idx);
>
> - avg_load += load;
> + runnable_load += load;
> +
> + avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs);
>
> spare_cap = capacity_spare_wake(i, p);
>
> @@ -5439,14 +5446,32 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> }
>
> /* Adjust by relative CPU capacity of the group */
> - avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
> + avg_load = (avg_load * SCHED_CAPACITY_SCALE) /
> + group->sgc->capacity;
> + runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) /
> + group->sgc->capacity;
>
> if (local_group) {
> - this_load = avg_load;
> + this_runnable_load = runnable_load;
> + this_avg_load = avg_load;
> this_spare = max_spare_cap;
> } else {
> - if (avg_load < min_load) {
> - min_load = avg_load;
> + if (min_runnable_load > (runnable_load + imbalance)) {
> + /*
> + * The runnable load is significantly smaller
> + * so we can pick this new cpu
> + */
> + min_runnable_load = runnable_load;
> + min_avg_load = avg_load;
> + idlest = group;
> + } else if ((runnable_load < (min_runnable_load + imbalance)) &&
> + (100*min_avg_load > imbalance_scale*avg_load)) {
> + /*
> + * The runnable loads are close so we take
> + * into account blocked load through avg_load
> + * which is blocked + runnable load
> + */
> + min_avg_load = avg_load;
> idlest = group;
> }
>
> @@ -5470,13 +5495,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> goto no_spare;
>
> if (this_spare > task_util(p) / 2 &&
> - imbalance*this_spare > 100*most_spare)
> + imbalance_scale*this_spare > 100*most_spare)
> return NULL;
> else if (most_spare > task_util(p) / 2)
> return most_spare_sg;
>
> no_spare:
> - if (!idlest || 100*this_load < imbalance*min_load)
> + if (!idlest ||
> + (min_runnable_load > (this_runnable_load + imbalance)) ||
> + ((this_runnable_load < (min_runnable_load + imbalance)) &&
> + (100*min_avg_load > imbalance_scale*this_avg_load)))

I don't get why you have imbalance_scale applied to this_avg_load and
not min_avg_load. IIUC, you end up preferring non-local groups?

If we take the example where this_runnable_load == min_runnable_load and
this_avg_load == min_avg_load. In this case, and in cases where
min_avg_load is slightly bigger than this_avg_load, we end up picking
the 'idlest' group even if the local group is equally good or even
slightly better?

> return NULL;
> return idlest;
> }

Overall, I like that load_avg is being brought in to make better
decisions. The variable naming is a bit confusing. For example,
runnable_load is a capacity-average just like avg_load. 'imbalance' is
now an absolute capacity-average margin, but it is hard to come up with
better short alternatives.

Although 'imbalance' is based on the existing imbalance_pct, I find
somewhat arbitrary. Why is (imbalance_pct-100)*1024/100 a good absolute
margin to define the interval where we want to consider load_avg? I
guess it is case of 'we had to pick some value', which we have done in
many other places. Though, IMHO, it is a bit strange that imbalance_pct
is used in two different ways to bias comparison in the same function.
It used to be only used as a scaling factor (now imbalance_scale), while
this patch proposes to use it for computing an absolute margin
(imbalance) as well. It is not major issue, but it is not clear why it
is used differently to compare two metrics that are relatively closely
related.

Morten

2016-11-30 13:49:53

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/2 v2] sched: use load_avg for selecting idlest group

On 30 November 2016 at 13:49, Morten Rasmussen <[email protected]> wrote:
> On Fri, Nov 25, 2016 at 04:34:33PM +0100, Vincent Guittot wrote:
>> find_idlest_group() only compares the runnable_load_avg when looking for
>> the least loaded group. But on fork intensive use case like hackbench

[snip]

>> + min_avg_load = avg_load;
>> + idlest = group;
>> + } else if ((runnable_load < (min_runnable_load + imbalance)) &&
>> + (100*min_avg_load > imbalance_scale*avg_load)) {
>> + /*
>> + * The runnable loads are close so we take
>> + * into account blocked load through avg_load
>> + * which is blocked + runnable load
>> + */
>> + min_avg_load = avg_load;
>> idlest = group;
>> }
>>
>> @@ -5470,13 +5495,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>> goto no_spare;
>>
>> if (this_spare > task_util(p) / 2 &&
>> - imbalance*this_spare > 100*most_spare)
>> + imbalance_scale*this_spare > 100*most_spare)
>> return NULL;
>> else if (most_spare > task_util(p) / 2)
>> return most_spare_sg;
>>
>> no_spare:
>> - if (!idlest || 100*this_load < imbalance*min_load)
>> + if (!idlest ||
>> + (min_runnable_load > (this_runnable_load + imbalance)) ||
>> + ((this_runnable_load < (min_runnable_load + imbalance)) &&
>> + (100*min_avg_load > imbalance_scale*this_avg_load)))
>
> I don't get why you have imbalance_scale applied to this_avg_load and
> not min_avg_load. IIUC, you end up preferring non-local groups?

In fact, I have keep the same condition that is used when looping the group.
You're right that we should prefer local rq if avg_load are close and
test the condition
(100*this_avg_load > imbalance_scale*min_avg_load) instead

>
> If we take the example where this_runnable_load == min_runnable_load and
> this_avg_load == min_avg_load. In this case, and in cases where
> min_avg_load is slightly bigger than this_avg_load, we end up picking
> the 'idlest' group even if the local group is equally good or even
> slightly better?
>
>> return NULL;
>> return idlest;
>> }
>
> Overall, I like that load_avg is being brought in to make better
> decisions. The variable naming is a bit confusing. For example,
> runnable_load is a capacity-average just like avg_load. 'imbalance' is
> now an absolute capacity-average margin, but it is hard to come up with
> better short alternatives.
>
> Although 'imbalance' is based on the existing imbalance_pct, I find
> somewhat arbitrary. Why is (imbalance_pct-100)*1024/100 a good absolute
> margin to define the interval where we want to consider load_avg? I
> guess it is case of 'we had to pick some value', which we have done in
> many other places. Though, IMHO, it is a bit strange that imbalance_pct
> is used in two different ways to bias comparison in the same function.

I see imbalance_pct like the definition of the acceptable imbalance %
for a sched_domain. This % is then used against the current load or to
define an absolute value.

> It used to be only used as a scaling factor (now imbalance_scale), while
> this patch proposes to use it for computing an absolute margin
> (imbalance) as well. It is not major issue, but it is not clear why it
> is used differently to compare two metrics that are relatively closely
> related.

In fact, scaling factor (imbalance) doesn't work well with small
value. As an example, the use of a scaling factor fails as soon as
this_runnable_load = 0 because we always selected local rq even if
min_runnable_load is only 1 which doesn't really make sense because
they are just the same.

>
> Morten

2016-11-30 13:54:31

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/2 v2] sched: use load_avg for selecting idlest group

On 30 November 2016 at 14:49, Vincent Guittot
<[email protected]> wrote:
> On 30 November 2016 at 13:49, Morten Rasmussen <[email protected]> wrote:
>> On Fri, Nov 25, 2016 at 04:34:33PM +0100, Vincent Guittot wrote:
>>> find_idlest_group() only compares the runnable_load_avg when looking for
>>> the least loaded group. But on fork intensive use case like hackbench
>
> [snip]
>
>>> + min_avg_load = avg_load;
>>> + idlest = group;
>>> + } else if ((runnable_load < (min_runnable_load + imbalance)) &&
>>> + (100*min_avg_load > imbalance_scale*avg_load)) {
>>> + /*
>>> + * The runnable loads are close so we take
>>> + * into account blocked load through avg_load
>>> + * which is blocked + runnable load
>>> + */
>>> + min_avg_load = avg_load;
>>> idlest = group;
>>> }
>>>
>>> @@ -5470,13 +5495,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>> goto no_spare;
>>>
>>> if (this_spare > task_util(p) / 2 &&
>>> - imbalance*this_spare > 100*most_spare)
>>> + imbalance_scale*this_spare > 100*most_spare)
>>> return NULL;
>>> else if (most_spare > task_util(p) / 2)
>>> return most_spare_sg;
>>>
>>> no_spare:
>>> - if (!idlest || 100*this_load < imbalance*min_load)
>>> + if (!idlest ||
>>> + (min_runnable_load > (this_runnable_load + imbalance)) ||
>>> + ((this_runnable_load < (min_runnable_load + imbalance)) &&
>>> + (100*min_avg_load > imbalance_scale*this_avg_load)))
>>
>> I don't get why you have imbalance_scale applied to this_avg_load and
>> not min_avg_load. IIUC, you end up preferring non-local groups?
>
> In fact, I have keep the same condition that is used when looping the group.
> You're right that we should prefer local rq if avg_load are close and
> test the condition
> (100*this_avg_load > imbalance_scale*min_avg_load) instead

Of course the correct condition is
(100*this_avg_load < imbalance_scale*min_avg_load)

>
>>
>> If we take the example where this_runnable_load == min_runnable_load and
>> this_avg_load == min_avg_load. In this case, and in cases where
>> min_avg_load is slightly bigger than this_avg_load, we end up picking
>> the 'idlest' group even if the local group is equally good or even
>> slightly better?
>>
>>> return NULL;
>>> return idlest;
>>> }
>>
>> Overall, I like that load_avg is being brought in to make better
>> decisions. The variable naming is a bit confusing. For example,
>> runnable_load is a capacity-average just like avg_load. 'imbalance' is
>> now an absolute capacity-average margin, but it is hard to come up with
>> better short alternatives.
>>
>> Although 'imbalance' is based on the existing imbalance_pct, I find
>> somewhat arbitrary. Why is (imbalance_pct-100)*1024/100 a good absolute
>> margin to define the interval where we want to consider load_avg? I
>> guess it is case of 'we had to pick some value', which we have done in
>> many other places. Though, IMHO, it is a bit strange that imbalance_pct
>> is used in two different ways to bias comparison in the same function.
>
> I see imbalance_pct like the definition of the acceptable imbalance %
> for a sched_domain. This % is then used against the current load or to
> define an absolute value.
>
>> It used to be only used as a scaling factor (now imbalance_scale), while
>> this patch proposes to use it for computing an absolute margin
>> (imbalance) as well. It is not major issue, but it is not clear why it
>> is used differently to compare two metrics that are relatively closely
>> related.
>
> In fact, scaling factor (imbalance) doesn't work well with small
> value. As an example, the use of a scaling factor fails as soon as
> this_runnable_load = 0 because we always selected local rq even if
> min_runnable_load is only 1 which doesn't really make sense because
> they are just the same.
>
>>
>> Morten

2016-11-30 14:23:37

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 2/2 v2] sched: use load_avg for selecting idlest group

On Wed, Nov 30, 2016 at 02:49:11PM +0100, Vincent Guittot wrote:
> On 30 November 2016 at 13:49, Morten Rasmussen <[email protected]> wrote:
> > On Fri, Nov 25, 2016 at 04:34:33PM +0100, Vincent Guittot wrote:
> >> find_idlest_group() only compares the runnable_load_avg when looking for
> >> the least loaded group. But on fork intensive use case like hackbench
>
> [snip]
>
> >> + min_avg_load = avg_load;
> >> + idlest = group;
> >> + } else if ((runnable_load < (min_runnable_load + imbalance)) &&
> >> + (100*min_avg_load > imbalance_scale*avg_load)) {
> >> + /*
> >> + * The runnable loads are close so we take
> >> + * into account blocked load through avg_load
> >> + * which is blocked + runnable load
> >> + */
> >> + min_avg_load = avg_load;
> >> idlest = group;
> >> }
> >>
> >> @@ -5470,13 +5495,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> >> goto no_spare;
> >>
> >> if (this_spare > task_util(p) / 2 &&
> >> - imbalance*this_spare > 100*most_spare)
> >> + imbalance_scale*this_spare > 100*most_spare)
> >> return NULL;
> >> else if (most_spare > task_util(p) / 2)
> >> return most_spare_sg;
> >>
> >> no_spare:
> >> - if (!idlest || 100*this_load < imbalance*min_load)
> >> + if (!idlest ||
> >> + (min_runnable_load > (this_runnable_load + imbalance)) ||
> >> + ((this_runnable_load < (min_runnable_load + imbalance)) &&
> >> + (100*min_avg_load > imbalance_scale*this_avg_load)))
> >
> > I don't get why you have imbalance_scale applied to this_avg_load and
> > not min_avg_load. IIUC, you end up preferring non-local groups?
>
> In fact, I have keep the same condition that is used when looping the group.

The logic is inverted compared to the group loop as there you after
picking a better group. There it makes sense that you accept groups with
a slightly higher runnable_load if they have a much better avg_load.

Here you are after rejecting 'idlest' group if it isn't significantly
better than the local group, so the conditions have to be inverted.

> You're right that we should prefer local rq if avg_load are close and
> test the condition
> (100*this_avg_load > imbalance_scale*min_avg_load) instead

I would argue you should switch the inequality operator around:

(100*this_avg_load < imbalance_scale*min_avg_load)

So it becomes true unless min_avg_load is significantly less than
this_avg_load meaning that we will ignore 'idlest' and return NULL. This
is also in line with old condition (100*this_load < imbalance*min_load).

> >
> > If we take the example where this_runnable_load == min_runnable_load and
> > this_avg_load == min_avg_load. In this case, and in cases where
> > min_avg_load is slightly bigger than this_avg_load, we end up picking
> > the 'idlest' group even if the local group is equally good or even
> > slightly better?
> >
> >> return NULL;
> >> return idlest;
> >> }
> >
> > Overall, I like that load_avg is being brought in to make better
> > decisions. The variable naming is a bit confusing. For example,
> > runnable_load is a capacity-average just like avg_load. 'imbalance' is
> > now an absolute capacity-average margin, but it is hard to come up with
> > better short alternatives.
> >
> > Although 'imbalance' is based on the existing imbalance_pct, I find
> > somewhat arbitrary. Why is (imbalance_pct-100)*1024/100 a good absolute
> > margin to define the interval where we want to consider load_avg? I
> > guess it is case of 'we had to pick some value', which we have done in
> > many other places. Though, IMHO, it is a bit strange that imbalance_pct
> > is used in two different ways to bias comparison in the same function.
>
> I see imbalance_pct like the definition of the acceptable imbalance %
> for a sched_domain. This % is then used against the current load or to
> define an absolute value.
>
> > It used to be only used as a scaling factor (now imbalance_scale), while
> > this patch proposes to use it for computing an absolute margin
> > (imbalance) as well. It is not major issue, but it is not clear why it
> > is used differently to compare two metrics that are relatively closely
> > related.
>
> In fact, scaling factor (imbalance) doesn't work well with small
> value. As an example, the use of a scaling factor fails as soon as
> this_runnable_load = 0 because we always selected local rq even if
> min_runnable_load is only 1 which doesn't really make sense because
> they are just the same.

Agreed, using the operator for scaling is not ideal for low load
scenarios. The spare-capacity checking is fixing that issue for some
scenarios but not all.

Could we mention these points somewhere so people can be reminded later?
In the commit message, if not as a comment in the code?

2016-11-30 14:24:38

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 2/2 v2] sched: use load_avg for selecting idlest group

On Wed, Nov 30, 2016 at 02:54:00PM +0100, Vincent Guittot wrote:
> On 30 November 2016 at 14:49, Vincent Guittot
> <[email protected]> wrote:
> > On 30 November 2016 at 13:49, Morten Rasmussen <[email protected]> wrote:
> >> On Fri, Nov 25, 2016 at 04:34:33PM +0100, Vincent Guittot wrote:
> >>> find_idlest_group() only compares the runnable_load_avg when looking for
> >>> the least loaded group. But on fork intensive use case like hackbench
> >
> > [snip]
> >
> >>> + min_avg_load = avg_load;
> >>> + idlest = group;
> >>> + } else if ((runnable_load < (min_runnable_load + imbalance)) &&
> >>> + (100*min_avg_load > imbalance_scale*avg_load)) {
> >>> + /*
> >>> + * The runnable loads are close so we take
> >>> + * into account blocked load through avg_load
> >>> + * which is blocked + runnable load
> >>> + */
> >>> + min_avg_load = avg_load;
> >>> idlest = group;
> >>> }
> >>>
> >>> @@ -5470,13 +5495,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> >>> goto no_spare;
> >>>
> >>> if (this_spare > task_util(p) / 2 &&
> >>> - imbalance*this_spare > 100*most_spare)
> >>> + imbalance_scale*this_spare > 100*most_spare)
> >>> return NULL;
> >>> else if (most_spare > task_util(p) / 2)
> >>> return most_spare_sg;
> >>>
> >>> no_spare:
> >>> - if (!idlest || 100*this_load < imbalance*min_load)
> >>> + if (!idlest ||
> >>> + (min_runnable_load > (this_runnable_load + imbalance)) ||
> >>> + ((this_runnable_load < (min_runnable_load + imbalance)) &&
> >>> + (100*min_avg_load > imbalance_scale*this_avg_load)))
> >>
> >> I don't get why you have imbalance_scale applied to this_avg_load and
> >> not min_avg_load. IIUC, you end up preferring non-local groups?
> >
> > In fact, I have keep the same condition that is used when looping the group.
> > You're right that we should prefer local rq if avg_load are close and
> > test the condition
> > (100*this_avg_load > imbalance_scale*min_avg_load) instead
>
> Of course the correct condition is
> (100*this_avg_load < imbalance_scale*min_avg_load)

Agreed, I should have read the entire thread before replying :-)

2016-12-02 15:21:20

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/2 v2] sched: use load_avg for selecting idlest group

On 30 November 2016 at 15:24, Morten Rasmussen <[email protected]> wrote:
> On Wed, Nov 30, 2016 at 02:54:00PM +0100, Vincent Guittot wrote:
>> On 30 November 2016 at 14:49, Vincent Guittot
>> <[email protected]> wrote:
>> > On 30 November 2016 at 13:49, Morten Rasmussen <[email protected]> wrote:
>> >> On Fri, Nov 25, 2016 at 04:34:33PM +0100, Vincent Guittot wrote:
>> >>> find_idlest_group() only compares the runnable_load_avg when looking for
>> >>> the least loaded group. But on fork intensive use case like hackbench
>> >
>> > [snip]
>> >
>> >>> + min_avg_load = avg_load;
>> >>> + idlest = group;
>> >>> + } else if ((runnable_load < (min_runnable_load + imbalance)) &&
>> >>> + (100*min_avg_load > imbalance_scale*avg_load)) {
>> >>> + /*
>> >>> + * The runnable loads are close so we take
>> >>> + * into account blocked load through avg_load
>> >>> + * which is blocked + runnable load
>> >>> + */
>> >>> + min_avg_load = avg_load;
>> >>> idlest = group;
>> >>> }
>> >>>
>> >>> @@ -5470,13 +5495,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>> >>> goto no_spare;
>> >>>
>> >>> if (this_spare > task_util(p) / 2 &&
>> >>> - imbalance*this_spare > 100*most_spare)
>> >>> + imbalance_scale*this_spare > 100*most_spare)
>> >>> return NULL;
>> >>> else if (most_spare > task_util(p) / 2)
>> >>> return most_spare_sg;
>> >>>
>> >>> no_spare:
>> >>> - if (!idlest || 100*this_load < imbalance*min_load)
>> >>> + if (!idlest ||
>> >>> + (min_runnable_load > (this_runnable_load + imbalance)) ||
>> >>> + ((this_runnable_load < (min_runnable_load + imbalance)) &&
>> >>> + (100*min_avg_load > imbalance_scale*this_avg_load)))
>> >>
>> >> I don't get why you have imbalance_scale applied to this_avg_load and
>> >> not min_avg_load. IIUC, you end up preferring non-local groups?
>> >
>> > In fact, I have keep the same condition that is used when looping the group.
>> > You're right that we should prefer local rq if avg_load are close and
>> > test the condition
>> > (100*this_avg_load > imbalance_scale*min_avg_load) instead
>>
>> Of course the correct condition is
>> (100*this_avg_load < imbalance_scale*min_avg_load)
>
> Agreed, I should have read the entire thread before replying :-)

Interestingly, the original condition (100*min_avg_load >
imbalance_scale*this_avg_load) gives better performance result for the
hackbench test than the new one : (100*this_avg_load <
imbalance_scale*min_avg_load)

Matt,

Have you been able to get some results for the patchset ?

Vincent

2016-12-02 22:24:35

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH 2/2 v2] sched: use load_avg for selecting idlest group

On Fri, 02 Dec, at 04:20:54PM, Vincent Guittot wrote:
>
> Matt,
>
> Have you been able to get some results for the patchset ?

Sorry, I messed up the test config and the tests never ran :(

I've redone everything and fast-tracked the results on the SUSE grid
now. Should have results by morning.

2016-12-03 03:31:37

by Brendan Gregg

[permalink] [raw]
Subject: Re: [PATCH 2/2 v2] sched: use load_avg for selecting idlest group

On Fri, Nov 25, 2016 at 7:34 AM, Vincent Guittot
<[email protected]> wrote:
>
> find_idlest_group() only compares the runnable_load_avg when looking for
> the least loaded group. But on fork intensive use case like hackbench
> where tasks blocked quickly after the fork, this can lead to selecting the
> same CPU instead of other CPUs, which have similar runnable load but a
> lower load_avg.
>
> When the runnable_load_avg of 2 CPUs are close, we now take into account
> the amount of blocked load as a 2nd selection factor. There is now 3 zones
> for the runnable_load of the rq:
> -[0 .. (runnable_load - imbalance)] : Select the new rq which has
> significantly less runnable_load
> -](runnable_load - imbalance) .. (runnable_load + imbalance)[ : The
> runnable load are close so we use load_avg to chose between the 2 rq
> -[(runnable_load + imbalance) .. ULONG_MAX] : Keep the current rq which
> has significantly less runnable_load
>

For background, is this from the "A decade of wasted cores" paper's
patches? What's the expected typical gain? Thanks,

Brendan

2016-12-03 21:54:47

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH 2/2 v2] sched: use load_avg for selecting idlest group

On Fri, 02 Dec, at 07:31:04PM, Brendan Gregg wrote:
>
> For background, is this from the "A decade of wasted cores" paper's
> patches?

No, this patch fixes an issue I originally reported here,

https://lkml.kernel.org/r/[email protected]

Essentially, if you have an idle or partially-idle system and a
workload that consists of fork()'ing a bunch of tasks, where each of
those tasks immediately sleeps waiting for some wakeup, then those
tasks aren't spread across all idle CPUs very well.

We saw this issue when running hackbench with a small loop count, such
that the actual benchmark setup (fork()'ing) is where the majority of
the runtime is spent.

In that scenario, there's a large potential/blocked load, but
essentially no runnable load, and the balance on fork scheduler code
only cares about runnable load without Vincent's patch applied.

The closest thing I can find in the "A decade of wasted cores" paper
is "The Overload-on-Wakeup bug", but I don't think that's the issue
here since,

a) We're balancing on fork, not wakeup
b) The fork on balance code balances across nodes OK

> What's the expected typical gain? Thanks,

The results are still coming back from the SUSE performance test grid
but they do show that this patch is mainly a win for multi-socket
machines with more than 8 cores or thereabouts.

[ Vincent, I'll follow up to your PATCH 1/2 with the results that are
specifically for that patch ]

Assuming a fork-intensive or fork-dominated workload, and a
multi-socket machine, such as this 2 socket, NUMA, with 12 cores and
HT enabled (48 cpus), we saw a very clear win between +10% and +15%
for processes communicating via pipes,

(1) tip-sched = tip/sched/core branch
(2) fix-fig-for-fork = (1) + PATCH 1/2
(3) fix-sig = (1) + (2) + PATCH 2/2

hackbench-process-pipes
4.9.0-rc6 4.9.0-rc6 4.9.0-rc6
tip-sched fix-fig-for-fork fix-sig
Amean 1 0.0717 ( 0.00%) 0.0696 ( 2.99%) 0.0730 ( -1.79%)
Amean 4 0.1244 ( 0.00%) 0.1200 ( 3.56%) 0.1190 ( 4.36%)
Amean 7 0.1891 ( 0.00%) 0.1937 ( -2.42%) 0.1831 ( 3.17%)
Amean 12 0.2964 ( 0.00%) 0.3116 ( -5.11%) 0.2784 ( 6.07%)
Amean 21 0.4011 ( 0.00%) 0.4090 ( -1.96%) 0.3574 ( 10.90%)
Amean 30 0.4944 ( 0.00%) 0.4654 ( 5.87%) 0.4171 ( 15.63%)
Amean 48 0.6113 ( 0.00%) 0.6309 ( -3.20%) 0.5331 ( 12.78%)
Amean 79 0.8616 ( 0.00%) 0.8706 ( -1.04%) 0.7710 ( 10.51%)
Amean 110 1.1304 ( 0.00%) 1.2211 ( -8.02%) 1.0163 ( 10.10%)
Amean 141 1.3754 ( 0.00%) 1.4279 ( -3.81%) 1.2803 ( 6.92%)
Amean 172 1.6217 ( 0.00%) 1.7367 ( -7.09%) 1.5363 ( 5.27%)
Amean 192 1.7809 ( 0.00%) 2.0199 (-13.42%) 1.7129 ( 3.82%)

Things look even better when using threads and pipes, with wins
between 11% and 29% when looking at results outside of the noise,

hackbench-thread-pipes
4.9.0-rc6 4.9.0-rc6 4.9.0-rc6
tip-sched fix-fig-for-fork fix-sig
Amean 1 0.0736 ( 0.00%) 0.0794 ( -7.96%) 0.0779 ( -5.83%)
Amean 4 0.1709 ( 0.00%) 0.1690 ( 1.09%) 0.1663 ( 2.68%)
Amean 7 0.2836 ( 0.00%) 0.3080 ( -8.61%) 0.2640 ( 6.90%)
Amean 12 0.4393 ( 0.00%) 0.4843 (-10.24%) 0.4090 ( 6.89%)
Amean 21 0.5821 ( 0.00%) 0.6369 ( -9.40%) 0.5126 ( 11.95%)
Amean 30 0.6557 ( 0.00%) 0.6459 ( 1.50%) 0.5711 ( 12.90%)
Amean 48 0.7924 ( 0.00%) 0.7760 ( 2.07%) 0.6286 ( 20.68%)
Amean 79 1.0534 ( 0.00%) 1.0551 ( -0.16%) 0.8481 ( 19.49%)
Amean 110 1.5286 ( 0.00%) 1.4504 ( 5.11%) 1.1121 ( 27.24%)
Amean 141 1.9507 ( 0.00%) 1.7790 ( 8.80%) 1.3804 ( 29.23%)
Amean 172 2.2261 ( 0.00%) 2.3330 ( -4.80%) 1.6336 ( 26.62%)
Amean 192 2.3753 ( 0.00%) 2.3307 ( 1.88%) 1.8246 ( 23.19%)

Somewhat surprisingly, I can see improvements for UMA machines with
fewer cores when the workload heavily saturates the machine and the
workload isn't dominated by fork. Such heavy saturation isn't super
realistic, but still interesting. I haven't dug into why these results
occurred, but I am happy things didn't instead fall off a cliff.

Here's a 4-cpu UMA box showing some improvement at the higher end,

hackbench-process-pipes
4.9.0-rc6 4.9.0-rc6 4.9.0-rc6
tip-sched fix-fig-for-fork fix-sig
Amean 1 3.5060 ( 0.00%) 3.5747 ( -1.96%) 3.5117 ( -0.16%)
Amean 3 7.7113 ( 0.00%) 7.8160 ( -1.36%) 7.7747 ( -0.82%)
Amean 5 11.4453 ( 0.00%) 11.5710 ( -1.10%) 11.3870 ( 0.51%)
Amean 7 15.3147 ( 0.00%) 15.9420 ( -4.10%) 15.8450 ( -3.46%)
Amean 12 25.5110 ( 0.00%) 24.3410 ( 4.59%) 22.6717 ( 11.13%)
Amean 16 32.3010 ( 0.00%) 28.5897 ( 11.49%) 25.7473 ( 20.29%)

2016-12-03 23:25:54

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH 1/2 v2] sched: fix find_idlest_group for fork

On Fri, 25 Nov, at 04:34:32PM, Vincent Guittot wrote:
> During fork, the utilization of a task is init once the rq has been
> selected because the current utilization level of the rq is used to set
> the utilization of the fork task. As the task's utilization is still
> null at this step of the fork sequence, it doesn't make sense to look for
> some spare capacity that can fit the task's utilization.
> Furthermore, I can see perf regressions for the test "hackbench -P -g 1"
> because the least loaded policy is always bypassed and tasks are not
> spread during fork.
>
> With this patch and the fix below, we are back to same performances as
> for v4.8. The fix below is only a temporary one used for the test until a
> smarter solution is found because we can't simply remove the test which is
> useful for others benchmarks
>
> @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>
> avg_cost = this_sd->avg_scan_cost;
>
> - /*
> - * Due to large variance we need a large fuzz factor; hackbench in
> - * particularly is sensitive here.
> - */
> - if ((avg_idle / 512) < avg_cost)
> - return -1;
> -
> time = local_clock();
>
> for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
>

OK, I need to point out that I didn't apply the above hunk when
testing this patch series. But I wouldn't have expected that to impact
our fork-intensive workloads so much. Let me know if you'd like me to
re-run with it applied.

I don't see much of a difference, positive or negative, for the
majority of the test machines, it's mainly a wash.

However, the following 4-cpu Xeon E5504 machine does show a nice win,
with thread counts in the mid-range (note, the second column is number
of hackbench groups, where each group has 40 tasks),

hackbench-process-pipes
4.9.0-rc6 4.9.0-rc6 4.9.0-rc6
tip-sched fix-fig-for-fork fix-sig
Amean 1 0.2193 ( 0.00%) 0.2014 ( 8.14%) 0.1746 ( 20.39%)
Amean 3 0.4489 ( 0.00%) 0.3544 ( 21.04%) 0.3284 ( 26.83%)
Amean 5 0.6173 ( 0.00%) 0.4690 ( 24.02%) 0.4977 ( 19.37%)
Amean 7 0.7323 ( 0.00%) 0.6367 ( 13.05%) 0.6267 ( 14.42%)
Amean 12 0.9716 ( 0.00%) 1.0187 ( -4.85%) 0.9351 ( 3.75%)
Amean 16 1.2866 ( 0.00%) 1.2664 ( 1.57%) 1.2131 ( 5.71%)

2016-12-05 08:50:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2 v2] sched: fix find_idlest_group for fork

On Tue, Nov 29, 2016 at 02:46:10PM +0000, Morten Rasmussen wrote:
> On Tue, Nov 29, 2016 at 12:42:43PM +0100, Peter Zijlstra wrote:
> > On Tue, Nov 29, 2016 at 10:57:59AM +0000, Morten Rasmussen wrote:
> > > > @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> > > >
> > > > avg_cost = this_sd->avg_scan_cost;
> > > >
> > > > - /*
> > > > - * Due to large variance we need a large fuzz factor; hackbench in
> > > > - * particularly is sensitive here.
> > > > - */
> > > > - if ((avg_idle / 512) < avg_cost)
> > > > - return -1;
> > > > -

> The last bit that isn't clear to me is whether /512 is a 'large' fuzz
> factor. I guess it is, as we can have many wake-ups, i.e. many times
> avg_cost, over the period where avg_idle is calculated. No?


So the idea was to not spend more time looking for work than we're
actually going to be idle for, since then we're wasting time we could've
done work.

So avg_idle and avg_cost are equal measure in that the immediate
inequality would be: if (avg_idle < avg_cost) stop().

Of course, both being averages with unknown distribution makes that a
tricky proposition, and that also makes the 512 hard to quantify. Still
a factor of 512 of the total average, where our variable cannot go
negative (negative time intervals are nonsensical), is fairly large.


2016-12-05 09:27:52

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/2 v2] sched: use load_avg for selecting idlest group

Le Saturday 03 Dec 2016 ? 21:47:07 (+0000), Matt Fleming a ?crit :
> On Fri, 02 Dec, at 07:31:04PM, Brendan Gregg wrote:
> >
> > For background, is this from the "A decade of wasted cores" paper's
> > patches?
>
> No, this patch fixes an issue I originally reported here,
>
> https://lkml.kernel.org/r/[email protected]
>
> Essentially, if you have an idle or partially-idle system and a
> workload that consists of fork()'ing a bunch of tasks, where each of
> those tasks immediately sleeps waiting for some wakeup, then those
> tasks aren't spread across all idle CPUs very well.
>
> We saw this issue when running hackbench with a small loop count, such
> that the actual benchmark setup (fork()'ing) is where the majority of
> the runtime is spent.
>
> In that scenario, there's a large potential/blocked load, but
> essentially no runnable load, and the balance on fork scheduler code
> only cares about runnable load without Vincent's patch applied.
>
> The closest thing I can find in the "A decade of wasted cores" paper
> is "The Overload-on-Wakeup bug", but I don't think that's the issue
> here since,
>
> a) We're balancing on fork, not wakeup
> b) The fork on balance code balances across nodes OK
>
> > What's the expected typical gain? Thanks,
>
> The results are still coming back from the SUSE performance test grid
> but they do show that this patch is mainly a win for multi-socket
> machines with more than 8 cores or thereabouts.
>
> [ Vincent, I'll follow up to your PATCH 1/2 with the results that are
> specifically for that patch ]
>
> Assuming a fork-intensive or fork-dominated workload, and a
> multi-socket machine, such as this 2 socket, NUMA, with 12 cores and
> HT enabled (48 cpus), we saw a very clear win between +10% and +15%
> for processes communicating via pipes,
>
> (1) tip-sched = tip/sched/core branch
> (2) fix-fig-for-fork = (1) + PATCH 1/2
> (3) fix-sig = (1) + (2) + PATCH 2/2
>
> hackbench-process-pipes
> 4.9.0-rc6 4.9.0-rc6 4.9.0-rc6
> tip-sched fix-fig-for-fork fix-sig
> Amean 1 0.0717 ( 0.00%) 0.0696 ( 2.99%) 0.0730 ( -1.79%)
> Amean 4 0.1244 ( 0.00%) 0.1200 ( 3.56%) 0.1190 ( 4.36%)
> Amean 7 0.1891 ( 0.00%) 0.1937 ( -2.42%) 0.1831 ( 3.17%)
> Amean 12 0.2964 ( 0.00%) 0.3116 ( -5.11%) 0.2784 ( 6.07%)
> Amean 21 0.4011 ( 0.00%) 0.4090 ( -1.96%) 0.3574 ( 10.90%)
> Amean 30 0.4944 ( 0.00%) 0.4654 ( 5.87%) 0.4171 ( 15.63%)
> Amean 48 0.6113 ( 0.00%) 0.6309 ( -3.20%) 0.5331 ( 12.78%)
> Amean 79 0.8616 ( 0.00%) 0.8706 ( -1.04%) 0.7710 ( 10.51%)
> Amean 110 1.1304 ( 0.00%) 1.2211 ( -8.02%) 1.0163 ( 10.10%)
> Amean 141 1.3754 ( 0.00%) 1.4279 ( -3.81%) 1.2803 ( 6.92%)
> Amean 172 1.6217 ( 0.00%) 1.7367 ( -7.09%) 1.5363 ( 5.27%)
> Amean 192 1.7809 ( 0.00%) 2.0199 (-13.42%) 1.7129 ( 3.82%)
>
> Things look even better when using threads and pipes, with wins
> between 11% and 29% when looking at results outside of the noise,
>
> hackbench-thread-pipes
> 4.9.0-rc6 4.9.0-rc6 4.9.0-rc6
> tip-sched fix-fig-for-fork fix-sig
> Amean 1 0.0736 ( 0.00%) 0.0794 ( -7.96%) 0.0779 ( -5.83%)
> Amean 4 0.1709 ( 0.00%) 0.1690 ( 1.09%) 0.1663 ( 2.68%)
> Amean 7 0.2836 ( 0.00%) 0.3080 ( -8.61%) 0.2640 ( 6.90%)
> Amean 12 0.4393 ( 0.00%) 0.4843 (-10.24%) 0.4090 ( 6.89%)
> Amean 21 0.5821 ( 0.00%) 0.6369 ( -9.40%) 0.5126 ( 11.95%)
> Amean 30 0.6557 ( 0.00%) 0.6459 ( 1.50%) 0.5711 ( 12.90%)
> Amean 48 0.7924 ( 0.00%) 0.7760 ( 2.07%) 0.6286 ( 20.68%)
> Amean 79 1.0534 ( 0.00%) 1.0551 ( -0.16%) 0.8481 ( 19.49%)
> Amean 110 1.5286 ( 0.00%) 1.4504 ( 5.11%) 1.1121 ( 27.24%)
> Amean 141 1.9507 ( 0.00%) 1.7790 ( 8.80%) 1.3804 ( 29.23%)
> Amean 172 2.2261 ( 0.00%) 2.3330 ( -4.80%) 1.6336 ( 26.62%)
> Amean 192 2.3753 ( 0.00%) 2.3307 ( 1.88%) 1.8246 ( 23.19%)
>
> Somewhat surprisingly, I can see improvements for UMA machines with
> fewer cores when the workload heavily saturates the machine and the
> workload isn't dominated by fork. Such heavy saturation isn't super
> realistic, but still interesting. I haven't dug into why these results
> occurred, but I am happy things didn't instead fall off a cliff.
>
> Here's a 4-cpu UMA box showing some improvement at the higher end,
>
> hackbench-process-pipes
> 4.9.0-rc6 4.9.0-rc6 4.9.0-rc6
> tip-sched fix-fig-for-fork fix-sig
> Amean 1 3.5060 ( 0.00%) 3.5747 ( -1.96%) 3.5117 ( -0.16%)
> Amean 3 7.7113 ( 0.00%) 7.8160 ( -1.36%) 7.7747 ( -0.82%)
> Amean 5 11.4453 ( 0.00%) 11.5710 ( -1.10%) 11.3870 ( 0.51%)
> Amean 7 15.3147 ( 0.00%) 15.9420 ( -4.10%) 15.8450 ( -3.46%)
> Amean 12 25.5110 ( 0.00%) 24.3410 ( 4.59%) 22.6717 ( 11.13%)
> Amean 16 32.3010 ( 0.00%) 28.5897 ( 11.49%) 25.7473 ( 20.29%)

Hi Matt,

Thanks for the results.

During the review, it has been pointed out by Morten that the test condition
(100*this_avg_load < imbalance_scale*min_avg_load) makes more sense than
(100*min_avg_load > imbalance_scale*this_avg_load). But i see lower
performances with this change. Coud you run tests with the change below on
top of the patchset ?

---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8d1ae7..0129fbb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5514,7 +5514,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
if (!idlest ||
(min_runnable_load > (this_runnable_load + imbalance)) ||
((this_runnable_load < (min_runnable_load + imbalance)) &&
- (100*min_avg_load > imbalance_scale*this_avg_load)))
+ (100*this_avg_load < imbalance_scale*min_avg_load)))
return NULL;
return idlest;
}
--
2.7.4



2016-12-05 09:18:57

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/2 v2] sched: fix find_idlest_group for fork

On 4 December 2016 at 00:25, Matt Fleming <[email protected]> wrote:
> On Fri, 25 Nov, at 04:34:32PM, Vincent Guittot wrote:
>> During fork, the utilization of a task is init once the rq has been
>> selected because the current utilization level of the rq is used to set
>> the utilization of the fork task. As the task's utilization is still
>> null at this step of the fork sequence, it doesn't make sense to look for
>> some spare capacity that can fit the task's utilization.
>> Furthermore, I can see perf regressions for the test "hackbench -P -g 1"
>> because the least loaded policy is always bypassed and tasks are not
>> spread during fork.
>>
>> With this patch and the fix below, we are back to same performances as
>> for v4.8. The fix below is only a temporary one used for the test until a
>> smarter solution is found because we can't simply remove the test which is
>> useful for others benchmarks
>>
>> @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>
>> avg_cost = this_sd->avg_scan_cost;
>>
>> - /*
>> - * Due to large variance we need a large fuzz factor; hackbench in
>> - * particularly is sensitive here.
>> - */
>> - if ((avg_idle / 512) < avg_cost)
>> - return -1;
>> -
>> time = local_clock();
>>
>> for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
>>
>
> OK, I need to point out that I didn't apply the above hunk when
> testing this patch series. But I wouldn't have expected that to impact
> our fork-intensive workloads so much. Let me know if you'd like me to
> re-run with it applied.

At least on my target ( hikey board : dual quad cortex-A53 platform),
i can see additional perf improvements for the fork intensive test
"hackbench -P -g 1"
The patch above was there to explain any difference in perf results
with v4.8 but you don't need to re-run with it

>
> I don't see much of a difference, positive or negative, for the
> majority of the test machines, it's mainly a wash.
>
> However, the following 4-cpu Xeon E5504 machine does show a nice win,
> with thread counts in the mid-range (note, the second column is number
> of hackbench groups, where each group has 40 tasks),
>
> hackbench-process-pipes
> 4.9.0-rc6 4.9.0-rc6 4.9.0-rc6
> tip-sched fix-fig-for-fork fix-sig
> Amean 1 0.2193 ( 0.00%) 0.2014 ( 8.14%) 0.1746 ( 20.39%)
> Amean 3 0.4489 ( 0.00%) 0.3544 ( 21.04%) 0.3284 ( 26.83%)
> Amean 5 0.6173 ( 0.00%) 0.4690 ( 24.02%) 0.4977 ( 19.37%)
> Amean 7 0.7323 ( 0.00%) 0.6367 ( 13.05%) 0.6267 ( 14.42%)
> Amean 12 0.9716 ( 0.00%) 1.0187 ( -4.85%) 0.9351 ( 3.75%)
> Amean 16 1.2866 ( 0.00%) 1.2664 ( 1.57%) 1.2131 ( 5.71%)

2016-12-05 13:35:59

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH 2/2 v2] sched: use load_avg for selecting idlest group

On Mon, 05 Dec, at 10:27:36AM, Vincent Guittot wrote:
>
> Hi Matt,
>
> Thanks for the results.
>
> During the review, it has been pointed out by Morten that the test condition
> (100*this_avg_load < imbalance_scale*min_avg_load) makes more sense than
> (100*min_avg_load > imbalance_scale*this_avg_load). But i see lower
> performances with this change. Coud you run tests with the change below on
> top of the patchset ?
>
> ---
> kernel/sched/fair.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e8d1ae7..0129fbb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5514,7 +5514,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> if (!idlest ||
> (min_runnable_load > (this_runnable_load + imbalance)) ||
> ((this_runnable_load < (min_runnable_load + imbalance)) &&
> - (100*min_avg_load > imbalance_scale*this_avg_load)))
> + (100*this_avg_load < imbalance_scale*min_avg_load)))
> return NULL;
> return idlest;
> }

Queued for testing.

2016-12-08 14:09:50

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH 2/2 v2] sched: use load_avg for selecting idlest group

On Mon, 05 Dec, at 01:35:46PM, Matt Fleming wrote:
> On Mon, 05 Dec, at 10:27:36AM, Vincent Guittot wrote:
> >
> > Hi Matt,
> >
> > Thanks for the results.
> >
> > During the review, it has been pointed out by Morten that the test condition
> > (100*this_avg_load < imbalance_scale*min_avg_load) makes more sense than
> > (100*min_avg_load > imbalance_scale*this_avg_load). But i see lower
> > performances with this change. Coud you run tests with the change below on
> > top of the patchset ?
> >
> > ---
> > kernel/sched/fair.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e8d1ae7..0129fbb 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5514,7 +5514,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> > if (!idlest ||
> > (min_runnable_load > (this_runnable_load + imbalance)) ||
> > ((this_runnable_load < (min_runnable_load + imbalance)) &&
> > - (100*min_avg_load > imbalance_scale*this_avg_load)))
> > + (100*this_avg_load < imbalance_scale*min_avg_load)))
> > return NULL;
> > return idlest;
> > }
>
> Queued for testing.

Most of the machines didn't notice the difference with this new patch.
However, I did come across one test that showed a negative change,


hackbench-thread-pipes
4.9.0-rc6 4.9.0-rc6 4.9.0-rc6 4.9.0-rc6
tip-sched fix-fig-for-fork fix-sig fix-fig-for-fork-v2
Amean 1 0.1266 ( 0.00%) 0.1269 ( -0.23%) 0.1287 ( -1.69%) 0.1357 ( -7.22%)
Amean 4 0.4989 ( 0.00%) 0.5174 ( -3.72%) 0.5251 ( -5.27%) 0.5117 ( -2.58%)
Amean 7 0.8510 ( 0.00%) 0.8517 ( -0.08%) 0.8964 ( -5.34%) 0.8801 ( -3.42%)
Amean 12 1.0699 ( 0.00%) 1.0484 ( 2.00%) 1.0147 ( 5.15%) 1.0759 ( -0.56%)
Amean 21 1.2816 ( 0.00%) 1.2140 ( 5.27%) 1.1879 ( 7.31%) 1.2414 ( 3.13%)
Amean 30 1.4440 ( 0.00%) 1.4003 ( 3.03%) 1.3969 ( 3.26%) 1.4057 ( 2.65%)
Amean 48 1.5773 ( 0.00%) 1.5983 ( -1.33%) 1.3984 ( 11.34%) 1.5624 ( 0.94%)
Amean 79 2.2343 ( 0.00%) 2.3066 ( -3.24%) 2.0053 ( 10.25%) 2.2630 ( -1.29%)
Amean 96 2.6736 ( 0.00%) 2.4313 ( 9.06%) 2.4181 ( 9.55%) 2.4717 ( 7.55%)

4.9.0-rc6 4.9.0-rc6 4.9.0-rc6 4.9.0-rc6
tip-schedfix-fig-for-fork fix-sigfix-fig-for-fork-v2
User 129.53 128.64 127.70 131.00
System 1784.54 1744.21 1654.08 1744.00
Elapsed 92.07 90.44 86.95 91.00

Looking at the 48 and 79 groups rows for mean there's a noticeable
drop off in performance of ~10%, which should be outside of the noise
for this test. This is a 2 socket, 4 NUMA node (yes, really), 24 cpus
AMD opteron circa 2010.

Given the age of this machine, I don't think it's worth holding up the
patch.

2016-12-08 14:33:52

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/2 v2] sched: use load_avg for selecting idlest group

On 8 December 2016 at 15:09, Matt Fleming <[email protected]> wrote:
> On Mon, 05 Dec, at 01:35:46PM, Matt Fleming wrote:
>> On Mon, 05 Dec, at 10:27:36AM, Vincent Guittot wrote:
>> >
>> > Hi Matt,
>> >
>> > Thanks for the results.
>> >
>> > During the review, it has been pointed out by Morten that the test condition
>> > (100*this_avg_load < imbalance_scale*min_avg_load) makes more sense than
>> > (100*min_avg_load > imbalance_scale*this_avg_load). But i see lower
>> > performances with this change. Coud you run tests with the change below on
>> > top of the patchset ?
>> >
>> > ---
>> > kernel/sched/fair.c | 2 +-
>> > 1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index e8d1ae7..0129fbb 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -5514,7 +5514,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>> > if (!idlest ||
>> > (min_runnable_load > (this_runnable_load + imbalance)) ||
>> > ((this_runnable_load < (min_runnable_load + imbalance)) &&
>> > - (100*min_avg_load > imbalance_scale*this_avg_load)))
>> > + (100*this_avg_load < imbalance_scale*min_avg_load)))
>> > return NULL;
>> > return idlest;
>> > }
>>
>> Queued for testing.
>
> Most of the machines didn't notice the difference with this new patch.
> However, I did come across one test that showed a negative change,

OK so you don't see perf impact between the 2 test condition except
for the machine below
So IMHO, we should use the new test condition which tries to keep task
local if there is no other CPU that is obviously less loaded.

I'm going to prepare a new version of the patchset and apply all comments

Thanks
Vincent

>
>
> hackbench-thread-pipes
> 4.9.0-rc6 4.9.0-rc6 4.9.0-rc6 4.9.0-rc6
> tip-sched fix-fig-for-fork fix-sig fix-fig-for-fork-v2
> Amean 1 0.1266 ( 0.00%) 0.1269 ( -0.23%) 0.1287 ( -1.69%) 0.1357 ( -7.22%)
> Amean 4 0.4989 ( 0.00%) 0.5174 ( -3.72%) 0.5251 ( -5.27%) 0.5117 ( -2.58%)
> Amean 7 0.8510 ( 0.00%) 0.8517 ( -0.08%) 0.8964 ( -5.34%) 0.8801 ( -3.42%)
> Amean 12 1.0699 ( 0.00%) 1.0484 ( 2.00%) 1.0147 ( 5.15%) 1.0759 ( -0.56%)
> Amean 21 1.2816 ( 0.00%) 1.2140 ( 5.27%) 1.1879 ( 7.31%) 1.2414 ( 3.13%)
> Amean 30 1.4440 ( 0.00%) 1.4003 ( 3.03%) 1.3969 ( 3.26%) 1.4057 ( 2.65%)
> Amean 48 1.5773 ( 0.00%) 1.5983 ( -1.33%) 1.3984 ( 11.34%) 1.5624 ( 0.94%)
> Amean 79 2.2343 ( 0.00%) 2.3066 ( -3.24%) 2.0053 ( 10.25%) 2.2630 ( -1.29%)
> Amean 96 2.6736 ( 0.00%) 2.4313 ( 9.06%) 2.4181 ( 9.55%) 2.4717 ( 7.55%)
>
> 4.9.0-rc6 4.9.0-rc6 4.9.0-rc6 4.9.0-rc6
> tip-schedfix-fig-for-fork fix-sigfix-fig-for-fork-v2
> User 129.53 128.64 127.70 131.00
> System 1784.54 1744.21 1654.08 1744.00
> Elapsed 92.07 90.44 86.95 91.00
>
> Looking at the 48 and 79 groups rows for mean there's a noticeable
> drop off in performance of ~10%, which should be outside of the noise
> for this test. This is a 2 socket, 4 NUMA node (yes, really), 24 cpus
> AMD opteron circa 2010.
>
> Given the age of this machine, I don't think it's worth holding up the
> patch.