2017-08-21 15:23:35

by Brendan Jackman

[permalink] [raw]
Subject: [PATCH 0/2] sched/fair: Tweaks for select_task_rq_fair slowpath

This patchset optimises away an unused comparison, and fixes some corner cases in
the find_idlest_group path of select_task_rq_fair.

Brendan Jackman (2):
sched/fair: Remove unnecessary comparison with -1
sched/fair: Fix use of NULL with find_idlest_group

kernel/sched/fair.c | 36 ++++++++++++++++++++----------------
1 file changed, 20 insertions(+), 16 deletions(-)

--
2.14.1


2017-08-21 15:23:43

by Brendan Jackman

[permalink] [raw]
Subject: [PATCH 1/2] sched/fair: Remove unnecessary comparison with -1

Since 83a0a96a5f26 (sched/fair: Leverage the idle state info when
choosing the "idlest" cpu) find_idlest_cpu no longer returns -1.

Signed-off-by: Brendan Jackman <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c95880e216f6..64618d768546 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5934,7 +5934,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
}

new_cpu = find_idlest_cpu(group, p, cpu);
- if (new_cpu == -1 || new_cpu == cpu) {
+ if (new_cpu == cpu) {
/* Now try balancing at a lower domain level of cpu */
sd = sd->child;
continue;
--
2.14.1

2017-08-21 15:23:52

by Brendan Jackman

[permalink] [raw]
Subject: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group

The current use of returning NULL from find_idlest_group is broken in
two cases:

a1) The local group is not allowed.

In this case, we currently do not change this_runnable_load or
this_avg_load from its initial value of 0, which means we return
NULL regardless of the load of the other, allowed groups. This
results in pointlessly continuing the find_idlest_group search
within the local group and then returning prev_cpu from
select_task_rq_fair.

a2) No CPUs in the sched_domain are allowed.

In this case we also return NULL and again pointlessly continue
the search.

b) smp_processor_id() is the "idlest" and != prev_cpu.

find_idlest_group also returns NULL when the local group is
allowed and is the idlest. The caller then continues the
find_idlest_group search at a lower level of the current CPU's
sched_domain hierarchy. However new_cpu is not updated. This means
the search is pointless and we return prev_cpu from
select_task_rq_fair.

This is fixed by:

1. Returning NULL from find_idlest_group only when _no_ groups were
allowed in the current sched_domain. In this case, we now break
from the while(sd) loop and immediately return prev_cpu. This
fixes case a2).

2. Initializing this_runnable_load and this_avg_load to ULONG_MAX
instead of 0. This means in case a1) we now return the idlest
non-local group.

3. Explicitly updating new_cpu when find_idlest_group returns the
local group, fixing case b).

This patch also re-words the check for whether the group in
consideration is local, under the assumption that the first group in
the sched domain is always the local one.

Signed-off-by: Brendan Jackman <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 34 +++++++++++++++++++---------------
1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 64618d768546..7cb5ed719cf9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5382,26 +5382,29 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
* domain.
*/
static struct sched_group *
-find_idlest_group(struct sched_domain *sd, struct task_struct *p,
- int this_cpu, int sd_flag)
+find_idlest_group(struct sched_domain *sd, struct task_struct *p, int sd_flag)
{
struct sched_group *idlest = NULL, *group = sd->groups;
+ struct sched_group *local_group = sd->groups;
struct sched_group *most_spare_sg = NULL;
- unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0;
- unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0;
+ unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = ULONG_MAX;
+ unsigned long min_avg_load = ULONG_MAX, this_avg_load = ULONG_MAX;
unsigned long most_spare = 0, this_spare = 0;
int load_idx = sd->forkexec_idx;
int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
(sd->imbalance_pct-100) / 100;

+ if (!cpumask_intersects(sched_domain_span(sd), &p->cpus_allowed))
+ return NULL;
+
if (sd_flag & SD_BALANCE_WAKE)
load_idx = sd->wake_idx;

do {
unsigned long load, avg_load, runnable_load;
unsigned long spare_cap, max_spare_cap;
- int local_group;
+ bool group_is_local = group == local_group;
int i;

/* Skip over this group if it has no CPUs allowed */
@@ -5409,9 +5412,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
&p->cpus_allowed))
continue;

- local_group = cpumask_test_cpu(this_cpu,
- sched_group_span(group));
-
/*
* Tally up the load of all CPUs in the group and find
* the group containing the CPU with most spare capacity.
@@ -5422,7 +5422,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,

for_each_cpu(i, sched_group_span(group)) {
/* Bias balancing toward cpus of our domain */
- if (local_group)
+ if (group_is_local)
load = source_load(i, load_idx);
else
load = target_load(i, load_idx);
@@ -5443,7 +5443,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) /
group->sgc->capacity;

- if (local_group) {
+ if (group_is_local) {
this_runnable_load = runnable_load;
this_avg_load = avg_load;
this_spare = max_spare_cap;
@@ -5489,21 +5489,21 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,

if (this_spare > task_util(p) / 2 &&
imbalance_scale*this_spare > 100*most_spare)
- return NULL;
+ return local_group;

if (most_spare > task_util(p) / 2)
return most_spare_sg;

skip_spare:
if (!idlest)
- return NULL;
+ return local_group;

if (min_runnable_load > (this_runnable_load + imbalance))
- return NULL;
+ return local_group;

if ((this_runnable_load < (min_runnable_load + imbalance)) &&
(100*this_avg_load < imbalance_scale*min_avg_load))
- return NULL;
+ return local_group;

return idlest;
}
@@ -5927,8 +5927,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
continue;
}

- group = find_idlest_group(sd, p, cpu, sd_flag);
+ group = find_idlest_group(sd, p, sd_flag);
if (!group) {
+ break;
+ } else if (group == sd->groups) {
+ new_cpu = cpu;
+ /* Now try balancing at a lower domain level of cpu */
sd = sd->child;
continue;
}
--
2.14.1

2017-08-21 17:26:27

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group

On Mon, Aug 21, 2017 at 04:21:28PM +0100, Brendan Jackman wrote:
> The current use of returning NULL from find_idlest_group is broken in
> two cases:
>
> a1) The local group is not allowed.
>
> In this case, we currently do not change this_runnable_load or
> this_avg_load from its initial value of 0, which means we return
> NULL regardless of the load of the other, allowed groups. This
> results in pointlessly continuing the find_idlest_group search
> within the local group and then returning prev_cpu from
> select_task_rq_fair.
>
> a2) No CPUs in the sched_domain are allowed.
>
> In this case we also return NULL and again pointlessly continue
> the search.
>
> b) smp_processor_id() is the "idlest" and != prev_cpu.
>
> find_idlest_group also returns NULL when the local group is
> allowed and is the idlest. The caller then continues the
> find_idlest_group search at a lower level of the current CPU's
> sched_domain hierarchy. However new_cpu is not updated. This means
> the search is pointless and we return prev_cpu from
> select_task_rq_fair.
>
> This is fixed by:
>
> 1. Returning NULL from find_idlest_group only when _no_ groups were
> allowed in the current sched_domain. In this case, we now break
> from the while(sd) loop and immediately return prev_cpu. This
> fixes case a2).
>
> 2. Initializing this_runnable_load and this_avg_load to ULONG_MAX
> instead of 0. This means in case a1) we now return the idlest
> non-local group.
>
> 3. Explicitly updating new_cpu when find_idlest_group returns the
> local group, fixing case b).
>
> This patch also re-words the check for whether the group in
> consideration is local, under the assumption that the first group in
> the sched domain is always the local one.
>
> Signed-off-by: Brendan Jackman <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 34 +++++++++++++++++++---------------
> 1 file changed, 19 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 64618d768546..7cb5ed719cf9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5382,26 +5382,29 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
> * domain.
> */
> static struct sched_group *
> -find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> - int this_cpu, int sd_flag)
> +find_idlest_group(struct sched_domain *sd, struct task_struct *p, int sd_flag)
> {
> struct sched_group *idlest = NULL, *group = sd->groups;
> + struct sched_group *local_group = sd->groups;
> struct sched_group *most_spare_sg = NULL;
> - unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0;
> - unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0;
> + unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = ULONG_MAX;
> + unsigned long min_avg_load = ULONG_MAX, this_avg_load = ULONG_MAX;
> unsigned long most_spare = 0, this_spare = 0;
> int load_idx = sd->forkexec_idx;
> int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
> unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
> (sd->imbalance_pct-100) / 100;
>
> + if (!cpumask_intersects(sched_domain_span(sd), &p->cpus_allowed))
> + return NULL;
> +
> if (sd_flag & SD_BALANCE_WAKE)
> load_idx = sd->wake_idx;
>
> do {
> unsigned long load, avg_load, runnable_load;
> unsigned long spare_cap, max_spare_cap;
> - int local_group;
> + bool group_is_local = group == local_group;
> int i;
>
> /* Skip over this group if it has no CPUs allowed */
> @@ -5409,9 +5412,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> &p->cpus_allowed))
> continue;
>
> - local_group = cpumask_test_cpu(this_cpu,
> - sched_group_span(group));
> -

This isn't right is it? cpu isn't necessarily in the very first group of a sd
right? Thanks,

Josef

2017-08-21 17:30:17

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Remove unnecessary comparison with -1

On Mon, Aug 21, 2017 at 04:21:27PM +0100, Brendan Jackman wrote:
> Since 83a0a96a5f26 (sched/fair: Leverage the idle state info when
> choosing the "idlest" cpu) find_idlest_cpu no longer returns -1.
>
> Signed-off-by: Brendan Jackman <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Josef Bacik <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: Peter Zijlstra <[email protected]>

Reviewed-by: Josef Bacik <[email protected]>

Thanks,

Josef

2017-08-21 17:59:35

by Brendan Jackman

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group

Hi Josef,

Thanks for taking a look.

On Mon, Aug 21 2017 at 17:26, Josef Bacik wrote:
> On Mon, Aug 21, 2017 at 04:21:28PM +0100, Brendan Jackman wrote:
[...]
>> - local_group = cpumask_test_cpu(this_cpu,
>> - sched_group_span(group));
>> -
>
> This isn't right is it? cpu isn't necessarily in the very first group of a sd
> right?

I think it is - I haven't grokked the sd/sg setup code in
kernel/sched/topology.c but there is a comment[1] that I interpret as
saying this. I'll take a more careful look tomorrow.

[1] http://elixir.free-electrons.com/linux/v4.13-rc6/source/kernel/sched/topology.c#L786

If I'm wrong, this can be rewritten not to use that assumption - I did
it this way in the caller ("else if (group == sd->groups)") because I
didn't want to use cpumask_test_cpu, and then changed it inside
find_idlest_group so there weren't two ways of doing the same thing in
the same neighbourhood of code.

Cheers,
Brendan

2017-08-21 20:22:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group

On Mon, Aug 21, 2017 at 06:59:28PM +0100, Brendan Jackman wrote:
> Hi Josef,
>
> Thanks for taking a look.
>
> On Mon, Aug 21 2017 at 17:26, Josef Bacik wrote:
> > On Mon, Aug 21, 2017 at 04:21:28PM +0100, Brendan Jackman wrote:
> [...]
> >> - local_group = cpumask_test_cpu(this_cpu,
> >> - sched_group_span(group));
> >> -
> >
> > This isn't right is it? cpu isn't necessarily in the very first group of a sd
> > right?
>
> I think it is - I haven't grokked the sd/sg setup code in
> kernel/sched/topology.c but there is a comment[1] that I interpret as
> saying this. I'll take a more careful look tomorrow.
>
> [1] http://elixir.free-electrons.com/linux/v4.13-rc6/source/kernel/sched/topology.c#L786
>
> If I'm wrong, this can be rewritten not to use that assumption - I did
> it this way in the caller ("else if (group == sd->groups)") because I
> didn't want to use cpumask_test_cpu, and then changed it inside
> find_idlest_group so there weren't two ways of doing the same thing in
> the same neighbourhood of code.

No you are quite correct. The sched_domain of a CPU always includes that
CPU and the first group of a sched_domain is it's child domain and
therefore also includes that CPU.

2017-08-21 21:14:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group

On Mon, Aug 21, 2017 at 04:21:28PM +0100, Brendan Jackman wrote:
> The current use of returning NULL from find_idlest_group is broken in
> two cases:
>
> a1) The local group is not allowed.
>
> In this case, we currently do not change this_runnable_load or
> this_avg_load from its initial value of 0, which means we return
> NULL regardless of the load of the other, allowed groups. This
> results in pointlessly continuing the find_idlest_group search
> within the local group and then returning prev_cpu from
> select_task_rq_fair.

> b) smp_processor_id() is the "idlest" and != prev_cpu.
>
> find_idlest_group also returns NULL when the local group is
> allowed and is the idlest. The caller then continues the
> find_idlest_group search at a lower level of the current CPU's
> sched_domain hierarchy. However new_cpu is not updated. This means
> the search is pointless and we return prev_cpu from
> select_task_rq_fair.
>

I think its much simpler than that.. but its late, so who knows ;-)

Both cases seem predicated on the assumption that we'll return @cpu when
we don't find any idler CPU. Consider, if the local group is the idlest,
we should stick with @cpu and simply proceed with the child domain.

The confusion, and the bugs, seem to have snuck in when we started
considering @prev_cpu, whenever that was. The below is mostly code
movement to put that whole while(sd) loop into its own function.

The effective change is setting @new_cpu = @cpu when we start that loop:

@@ -6023,6 +6023,8 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
struct sched_group *group;
int weight;

+ new_cpu = cpu;
+
if (!(sd->flags & sd_flag)) {
sd = sd->child;
continue;




---
kernel/sched/fair.c | 83 +++++++++++++++++++++++++++++++----------------------
1 file changed, 48 insertions(+), 35 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c77e4b1d51c0..3e77265c480a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5588,10 +5588,10 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
}

/*
- * find_idlest_cpu - find the idlest cpu among the cpus in group.
+ * find_idlest_group_cpu - find the idlest cpu among the cpus in group.
*/
static int
-find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
+find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
{
unsigned long load, min_load = ULONG_MAX;
unsigned int min_exit_latency = UINT_MAX;
@@ -5640,6 +5640,50 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
}

+static int
+find_idlest_cpu(struct sched_domain *sd, struct task_struct *p, int cpu, int sd_flag)
+{
+ struct sched_domain *tmp;
+ int new_cpu = cpu;
+
+ while (sd) {
+ struct sched_group *group;
+ int weight;
+
+ if (!(sd->flags & sd_flag)) {
+ sd = sd->child;
+ continue;
+ }
+
+ group = find_idlest_group(sd, p, cpu, sd_flag);
+ if (!group) {
+ sd = sd->child;
+ continue;
+ }
+
+ new_cpu = find_idlest_group_cpu(group, p, cpu);
+ if (new_cpu == -1 || new_cpu == cpu) {
+ /* Now try balancing at a lower domain level of cpu */
+ sd = sd->child;
+ continue;
+ }
+
+ /* Now try balancing at a lower domain level of new_cpu */
+ cpu = new_cpu;
+ weight = sd->span_weight;
+ sd = NULL;
+ for_each_domain(cpu, tmp) {
+ if (weight <= tmp->span_weight)
+ break;
+ if (tmp->flags & sd_flag)
+ sd = tmp;
+ }
+ /* while loop will break here if sd == NULL */
+ }
+
+ return new_cpu;
+}
+
/*
* Implement a for_each_cpu() variant that starts the scan at a given cpu
* (@start), and wraps around.
@@ -6019,39 +6063,8 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);

- } else while (sd) {
- struct sched_group *group;
- int weight;
-
- if (!(sd->flags & sd_flag)) {
- sd = sd->child;
- continue;
- }
-
- group = find_idlest_group(sd, p, cpu, sd_flag);
- if (!group) {
- sd = sd->child;
- continue;
- }
-
- new_cpu = find_idlest_cpu(group, p, cpu);
- if (new_cpu == -1 || new_cpu == cpu) {
- /* Now try balancing at a lower domain level of cpu */
- sd = sd->child;
- continue;
- }
-
- /* Now try balancing at a lower domain level of new_cpu */
- cpu = new_cpu;
- weight = sd->span_weight;
- sd = NULL;
- for_each_domain(cpu, tmp) {
- if (weight <= tmp->span_weight)
- break;
- if (tmp->flags & sd_flag)
- sd = tmp;
- }
- /* while loop will break here if sd == NULL */
+ } else {
+ new_cpu = find_idlest_cpu(sd, p, cpu, sd_flag);
}
rcu_read_unlock();




2017-08-21 21:26:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group

On Mon, Aug 21, 2017 at 11:14:00PM +0200, Peter Zijlstra wrote:
> +static int
> +find_idlest_cpu(struct sched_domain *sd, struct task_struct *p, int cpu, int sd_flag)
> +{
> + struct sched_domain *tmp;
> + int new_cpu = cpu;
> +
> + while (sd) {
> + struct sched_group *group;
> + int weight;
> +
> + if (!(sd->flags & sd_flag)) {
> + sd = sd->child;
> + continue;
> + }
> +
> + group = find_idlest_group(sd, p, cpu, sd_flag);
> + if (!group) {
> + sd = sd->child;
> + continue;
> + }
> +
> + new_cpu = find_idlest_group_cpu(group, p, cpu);
> + if (new_cpu == -1 || new_cpu == cpu) {
> + /* Now try balancing at a lower domain level of cpu */
> + sd = sd->child;
> + continue;
> + }
> +
> + /* Now try balancing at a lower domain level of new_cpu */
> + cpu = new_cpu;
> + weight = sd->span_weight;
> + sd = NULL;
> + for_each_domain(cpu, tmp) {
> + if (weight <= tmp->span_weight)
> + break;
> + if (tmp->flags & sd_flag)
> + sd = tmp;
> + }

This find-the-sd-for-another-cpu thing is horrific. And it has always
bugged me that the whole thing is O(n^2) to find a CPU.

I understand why it has this form, but scanning each CPU more than once
is just offensive.

> + /* while loop will break here if sd == NULL */
> + }
> +
> + return new_cpu;
> +}

2017-08-22 04:36:17

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group

Hi Peter,

On Mon, Aug 21, 2017 at 2:14 PM, Peter Zijlstra <[email protected]> wrote:
> On Mon, Aug 21, 2017 at 04:21:28PM +0100, Brendan Jackman wrote:
>> The current use of returning NULL from find_idlest_group is broken in
>> two cases:
>>
>> a1) The local group is not allowed.
>>
>> In this case, we currently do not change this_runnable_load or
>> this_avg_load from its initial value of 0, which means we return
>> NULL regardless of the load of the other, allowed groups. This
>> results in pointlessly continuing the find_idlest_group search
>> within the local group and then returning prev_cpu from
>> select_task_rq_fair.
>
>> b) smp_processor_id() is the "idlest" and != prev_cpu.
>>
>> find_idlest_group also returns NULL when the local group is
>> allowed and is the idlest. The caller then continues the
>> find_idlest_group search at a lower level of the current CPU's
>> sched_domain hierarchy. However new_cpu is not updated. This means
>> the search is pointless and we return prev_cpu from
>> select_task_rq_fair.
>>
>
> I think its much simpler than that.. but its late, so who knows ;-)
>
> Both cases seem predicated on the assumption that we'll return @cpu when
> we don't find any idler CPU. Consider, if the local group is the idlest,
> we should stick with @cpu and simply proceed with the child domain.
>
> The confusion, and the bugs, seem to have snuck in when we started
> considering @prev_cpu, whenever that was. The below is mostly code
> movement to put that whole while(sd) loop into its own function.
>
> The effective change is setting @new_cpu = @cpu when we start that loop:
>
<snip>
> ---
> kernel/sched/fair.c | 83 +++++++++++++++++++++++++++++++----------------------
> 1 file changed, 48 insertions(+), 35 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c77e4b1d51c0..3e77265c480a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5588,10 +5588,10 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
> }
>
> /*
> - * find_idlest_cpu - find the idlest cpu among the cpus in group.
> + * find_idlest_group_cpu - find the idlest cpu among the cpus in group.
> */
> static int
> -find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
> +find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
> {
> unsigned long load, min_load = ULONG_MAX;
> unsigned int min_exit_latency = UINT_MAX;
> @@ -5640,6 +5640,50 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
> return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
> }
>
> +static int
> +find_idlest_cpu(struct sched_domain *sd, struct task_struct *p, int cpu, int sd_flag)
> +{
> + struct sched_domain *tmp;
> + int new_cpu = cpu;
> +
> + while (sd) {
> + struct sched_group *group;
> + int weight;
> +
> + if (!(sd->flags & sd_flag)) {
> + sd = sd->child;
> + continue;
> + }
> +
> + group = find_idlest_group(sd, p, cpu, sd_flag);
> + if (!group) {
> + sd = sd->child;
> + continue;

But this will still have the issue of pointlessly searching in the
local_group when the idlest CPU is in the non-local group? Stemming
from the fact that find_idlest_group is broken if the local group is
not allowed.

I believe this is fixed by Brendan's patch? :

"Initializing this_runnable_load and this_avg_load to ULONG_MAX
instead of 0. This means in case a1) we now return the idlest
non-local group."


Hopefully I didn't missing something. Sorry if I did, thanks,

-Joel

2017-08-22 07:49:00

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group

On 21 August 2017 at 17:21, Brendan Jackman <[email protected]> wrote:
> The current use of returning NULL from find_idlest_group is broken in
[snip]
> ---
> kernel/sched/fair.c | 34 +++++++++++++++++++---------------
> 1 file changed, 19 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 64618d768546..7cb5ed719cf9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5382,26 +5382,29 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
> * domain.
> */
> static struct sched_group *
> -find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> - int this_cpu, int sd_flag)
> +find_idlest_group(struct sched_domain *sd, struct task_struct *p, int sd_flag)
> {
> struct sched_group *idlest = NULL, *group = sd->groups;
> + struct sched_group *local_group = sd->groups;
> struct sched_group *most_spare_sg = NULL;
> - unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0;
> - unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0;
> + unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = ULONG_MAX;
> + unsigned long min_avg_load = ULONG_MAX, this_avg_load = ULONG_MAX;

Good catch for this_runnable_load, indeed there is a problem is
local_group is not allowed

> unsigned long most_spare = 0, this_spare = 0;
> int load_idx = sd->forkexec_idx;
> int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
> unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
> (sd->imbalance_pct-100) / 100;
>
> + if (!cpumask_intersects(sched_domain_span(sd), &p->cpus_allowed))
> + return NULL;

You should put that test above at upper level in the while (sd) loop
or even before the while(sd) loop as there is even no need to start
this while loop in this case. There is no need to call
find_idlest_group if there is no cpu allowed and no group to find.
Then, you can keep the current behavior of find_idlest_group() which
return null if there no best group than the local one

> +
> if (sd_flag & SD_BALANCE_WAKE)
> load_idx = sd->wake_idx;
>
> do {
> unsigned long load, avg_load, runnable_load;
> unsigned long spare_cap, max_spare_cap;
> - int local_group;
> + bool group_is_local = group == local_group;
> int i;
>
> /* Skip over this group if it has no CPUs allowed */

[snip]

> @@ -5927,8 +5927,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> continue;
> }
>
> - group = find_idlest_group(sd, p, cpu, sd_flag);
> + group = find_idlest_group(sd, p, sd_flag);
> if (!group) {
> + break;
> + } else if (group == sd->groups) {
> + new_cpu = cpu;

The " new_cpu = cpu" above should be put before the while(sd) loop
like in Peter's proposal

> + /* Now try balancing at a lower domain level of cpu */
> sd = sd->child;
> continue;
> }
> --
> 2.14.1
>

2017-08-22 10:39:28

by Brendan Jackman

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group


On Tue, Aug 22 2017 at 04:34, Joel Fernandes wrote:
> Hi Peter,
>
> On Mon, Aug 21, 2017 at 2:14 PM, Peter Zijlstra <[email protected]> wrote:
>> On Mon, Aug 21, 2017 at 04:21:28PM +0100, Brendan Jackman wrote:
>>> The current use of returning NULL from find_idlest_group is broken in
>>> two cases:
>>>
>>> a1) The local group is not allowed.
>>>
>>> In this case, we currently do not change this_runnable_load or
>>> this_avg_load from its initial value of 0, which means we return
>>> NULL regardless of the load of the other, allowed groups. This
>>> results in pointlessly continuing the find_idlest_group search
>>> within the local group and then returning prev_cpu from
>>> select_task_rq_fair.
>>
>>> b) smp_processor_id() is the "idlest" and != prev_cpu.
>>>
>>> find_idlest_group also returns NULL when the local group is
>>> allowed and is the idlest. The caller then continues the
>>> find_idlest_group search at a lower level of the current CPU's
>>> sched_domain hierarchy. However new_cpu is not updated. This means
>>> the search is pointless and we return prev_cpu from
>>> select_task_rq_fair.
>>>
>>
>> I think its much simpler than that.. but its late, so who knows ;-)
>>
>> Both cases seem predicated on the assumption that we'll return @cpu when
>> we don't find any idler CPU. Consider, if the local group is the idlest,
>> we should stick with @cpu and simply proceed with the child domain.
>>
>> The confusion, and the bugs, seem to have snuck in when we started
>> considering @prev_cpu, whenever that was. The below is mostly code
>> movement to put that whole while(sd) loop into its own function.
>>
>> The effective change is setting @new_cpu = @cpu when we start that loop:
>>
> <snip>
>> ---
>> kernel/sched/fair.c | 83 +++++++++++++++++++++++++++++++----------------------
>> 1 file changed, 48 insertions(+), 35 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c77e4b1d51c0..3e77265c480a 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5588,10 +5588,10 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
>> }
>>
>> /*
>> - * find_idlest_cpu - find the idlest cpu among the cpus in group.
>> + * find_idlest_group_cpu - find the idlest cpu among the cpus in group.
>> */
>> static int
>> -find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>> +find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>> {
>> unsigned long load, min_load = ULONG_MAX;
>> unsigned int min_exit_latency = UINT_MAX;
>> @@ -5640,6 +5640,50 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
>> return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
>> }
>>
>> +static int
>> +find_idlest_cpu(struct sched_domain *sd, struct task_struct *p, int cpu, int sd_flag)
>> +{
>> + struct sched_domain *tmp;
>> + int new_cpu = cpu;
>> +
>> + while (sd) {
>> + struct sched_group *group;
>> + int weight;
>> +
>> + if (!(sd->flags & sd_flag)) {
>> + sd = sd->child;
>> + continue;
>> + }
>> +
>> + group = find_idlest_group(sd, p, cpu, sd_flag);
>> + if (!group) {
>> + sd = sd->child;
>> + continue;
>
> But this will still have the issue of pointlessly searching in the
> local_group when the idlest CPU is in the non-local group? Stemming
> from the fact that find_idlest_group is broken if the local group is
> not allowed.
>
> I believe this is fixed by Brendan's patch? :
>
> "Initializing this_runnable_load and this_avg_load to ULONG_MAX
> instead of 0. This means in case a1) we now return the idlest
> non-local group."
>

Yeah, I don't think this is enough, firstly because of affinity and
secondly because I _guess_ we don't want to migrate from prev_cpu to
@cpu if none of the sched_domains have sd_flag set (...right? That would
feel like wake balancing even when there's no SD_BALANCE_WAKE anywhere).

However the code movement helps - I'll combine it with Vincent's
suggestions and post a v2.

2017-08-22 10:41:41

by Brendan Jackman

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group


On Tue, Aug 22 2017 at 07:48, Vincent Guittot wrote:
> On 21 August 2017 at 17:21, Brendan Jackman <[email protected]> wrote:
>> The current use of returning NULL from find_idlest_group is broken in
> [snip]
>> ---
>> kernel/sched/fair.c | 34 +++++++++++++++++++---------------
>> 1 file changed, 19 insertions(+), 15 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 64618d768546..7cb5ed719cf9 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5382,26 +5382,29 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
>> * domain.
>> */
>> static struct sched_group *
>> -find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>> - int this_cpu, int sd_flag)
>> +find_idlest_group(struct sched_domain *sd, struct task_struct *p, int sd_flag)
>> {
>> struct sched_group *idlest = NULL, *group = sd->groups;
>> + struct sched_group *local_group = sd->groups;
>> struct sched_group *most_spare_sg = NULL;
>> - unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0;
>> - unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0;
>> + unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = ULONG_MAX;
>> + unsigned long min_avg_load = ULONG_MAX, this_avg_load = ULONG_MAX;
>
> Good catch for this_runnable_load, indeed there is a problem is
> local_group is not allowed
>
>> unsigned long most_spare = 0, this_spare = 0; >> int load_idx = sd->forkexec_idx;
>> int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
>> unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
>> (sd->imbalance_pct-100) / 100;
>>
>> + if (!cpumask_intersects(sched_domain_span(sd), &p->cpus_allowed))
>> + return NULL;
>
> You should put that test above at upper level in the while (sd) loop
> or even before the while(sd) loop as there is even no need to start
> this while loop in this case. There is no need to call
> find_idlest_group if there is no cpu allowed and no group to find.
> Then, you can keep the current behavior of find_idlest_group() which
> return null if there no best group than the local one

True. I have to admit the reason I didn't do this is because due to the
current "else while", doing this outside the loop would have meant
another level of indentation. But actually if I break out a new function
as in Peter's proposal I can avoid that (plus I probably shouldn't let
code beauty overrule warm-path performance :| ).

>> +
>> if (sd_flag & SD_BALANCE_WAKE)
>> load_idx = sd->wake_idx;
>>
>> do {
>> unsigned long load, avg_load, runnable_load;
>> unsigned long spare_cap, max_spare_cap;
>> - int local_group;
>> + bool group_is_local = group == local_group;
>> int i;
>>
>> /* Skip over this group if it has no CPUs allowed */
>
> [snip]
>
>> @@ -5927,8 +5927,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>> continue;
>> }
>>
>> - group = find_idlest_group(sd, p, cpu, sd_flag);
>> + group = find_idlest_group(sd, p, sd_flag);
>> if (!group) {
>> + break;
>> + } else if (group == sd->groups) {
>> + new_cpu = cpu;
>
> The " new_cpu = cpu" above should be put before the while(sd) loop
> like in Peter's proposal

I don't think that would work - I've responded along with Joel in the
other subthread.

>> + /* Now try balancing at a lower domain level of cpu */
>> sd = sd->child;
>> continue;
>> }
>> --
>> 2.14.1
>>

2017-08-22 10:45:39

by Brendan Jackman

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group


On Tue, Aug 22 2017 at 10:39, Brendan Jackman wrote:
> On Tue, Aug 22 2017 at 04:34, Joel Fernandes wrote:
>> Hi Peter,
>>
>> On Mon, Aug 21, 2017 at 2:14 PM, Peter Zijlstra <[email protected]> wrote:
>>> On Mon, Aug 21, 2017 at 04:21:28PM +0100, Brendan Jackman wrote:
>>>> The current use of returning NULL from find_idlest_group is broken in
>>>> two cases:
>>>>
>>>> a1) The local group is not allowed.
>>>>
>>>> In this case, we currently do not change this_runnable_load or
>>>> this_avg_load from its initial value of 0, which means we return
>>>> NULL regardless of the load of the other, allowed groups. This
>>>> results in pointlessly continuing the find_idlest_group search
>>>> within the local group and then returning prev_cpu from
>>>> select_task_rq_fair.
>>>
>>>> b) smp_processor_id() is the "idlest" and != prev_cpu.
>>>>
>>>> find_idlest_group also returns NULL when the local group is
>>>> allowed and is the idlest. The caller then continues the
>>>> find_idlest_group search at a lower level of the current CPU's
>>>> sched_domain hierarchy. However new_cpu is not updated. This means
>>>> the search is pointless and we return prev_cpu from
>>>> select_task_rq_fair.
>>>>
>>>
>>> I think its much simpler than that.. but its late, so who knows ;-)
>>>
>>> Both cases seem predicated on the assumption that we'll return @cpu when
>>> we don't find any idler CPU. Consider, if the local group is the idlest,
>>> we should stick with @cpu and simply proceed with the child domain.
>>>
>>> The confusion, and the bugs, seem to have snuck in when we started
>>> considering @prev_cpu, whenever that was. The below is mostly code
>>> movement to put that whole while(sd) loop into its own function.
>>>
>>> The effective change is setting @new_cpu = @cpu when we start that loop:
>>>
>> <snip>
>>> ---
>>> kernel/sched/fair.c | 83 +++++++++++++++++++++++++++++++----------------------
>>> 1 file changed, 48 insertions(+), 35 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index c77e4b1d51c0..3e77265c480a 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5588,10 +5588,10 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
>>> }
>>>
>>> /*
>>> - * find_idlest_cpu - find the idlest cpu among the cpus in group.
>>> + * find_idlest_group_cpu - find the idlest cpu among the cpus in group.
>>> */
>>> static int
>>> -find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>>> +find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>>> {
>>> unsigned long load, min_load = ULONG_MAX;
>>> unsigned int min_exit_latency = UINT_MAX;
>>> @@ -5640,6 +5640,50 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
>>> return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
>>> }
>>>
>>> +static int
>>> +find_idlest_cpu(struct sched_domain *sd, struct task_struct *p, int cpu, int sd_flag)
>>> +{
>>> + struct sched_domain *tmp;
>>> + int new_cpu = cpu;
>>> +
>>> + while (sd) {
>>> + struct sched_group *group;
>>> + int weight;
>>> +
>>> + if (!(sd->flags & sd_flag)) {
>>> + sd = sd->child;
>>> + continue;
>>> + }
>>> +
>>> + group = find_idlest_group(sd, p, cpu, sd_flag);
>>> + if (!group) {
>>> + sd = sd->child;
>>> + continue;
>>
>> But this will still have the issue of pointlessly searching in the
>> local_group when the idlest CPU is in the non-local group? Stemming
>> from the fact that find_idlest_group is broken if the local group is
>> not allowed.
>>
>> I believe this is fixed by Brendan's patch? :
>>
>> "Initializing this_runnable_load and this_avg_load to ULONG_MAX
>> instead of 0. This means in case a1) we now return the idlest
>> non-local group."
>>
>
> Yeah, I don't think this is enough, firstly because of affinity and
> secondly because I _guess_ we don't want to migrate from prev_cpu to
> @cpu if none of the sched_domains have sd_flag set (...right? That would
> feel like wake balancing even when there's no SD_BALANCE_WAKE anywhere).

Hmm, actually we always have sd_flag set, because otherwise we wouldn't
have done sd = tmp at the beginning of select_task_rq_fair.

>
> However the code movement helps - I'll combine it with Vincent's
> suggestions and post a v2.

2017-08-22 11:03:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group

On Tue, Aug 22, 2017 at 11:39:26AM +0100, Brendan Jackman wrote:

> However the code movement helps - I'll combine it with Vincent's
> suggestions and post a v2.

Please also split into multiple patches, as I feel this 2/2 does too
many things. Fixing that this_*_load initialization stuff for instance
could be a separate patch.

And unlike what I did, do the whole code movement as a pure code
movement, without any additional changes.

2017-08-22 12:46:16

by Brendan Jackman

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group


On Tue, Aug 22 2017 at 11:03, Peter Zijlstra wrote:
> On Tue, Aug 22, 2017 at 11:39:26AM +0100, Brendan Jackman wrote:
>
>> However the code movement helps - I'll combine it with Vincent's
>> suggestions and post a v2.
>
> Please also split into multiple patches, as I feel this 2/2 does too
> many things. Fixing that this_*_load initialization stuff for instance
> could be a separate patch.

Sure. It was initially multiple patches but there were confusing
dependencies between them, so I squashed it. I don' think that will be a
problem in light of the new suggestions from you & Vincent.

> And unlike what I did, do the whole code movement as a pure code
> movement, without any additional changes.

Will do.