2013-08-19 16:13:29

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 00/10] Various load-balance cleanups/optimizations -v2


After poking at them a little more I feel somewhat more confident.

I found one more bug, but this one was my own fault, we should also clear
sds->busiest_stat.avg_load because update_sd_pick_busiest() reads that before
we set it.

Moved the memset optimization and the fix for that into a separate patch.

Other than that, there's a sched_domain degenerate fix; and a new way to detect
group_imb (which relies on the sched_domain fix).


2013-08-21 02:09:24

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 00/10] Various load-balance cleanups/optimizations -v2

On Mon, Aug 19, 2013 at 06:00:58PM +0200, Peter Zijlstra wrote:
>
> After poking at them a little more I feel somewhat more confident.
>
> I found one more bug, but this one was my own fault, we should also clear
> sds->busiest_stat.avg_load because update_sd_pick_busiest() reads that before
> we set it.
>
> Moved the memset optimization and the fix for that into a separate patch.
>
> Other than that, there's a sched_domain degenerate fix; and a new way to detect
> group_imb (which relies on the sched_domain fix).

I'm fine to my patches with your changes.

Thanks!

2013-08-28 08:56:04

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 11/10] sched, fair: Reduce local_group logic


Subject: sched, fair: Reduce local_group logic
From: Peter Zijlstra <[email protected]>
Date: Wed Aug 28 10:32:32 CEST 2013

Try and reduce the local_group logic by pulling most of it into
update_sd_lb_stats.

We could completely get rid of the local_group variable, but that
results in two calls to update_sg_lb_stats() and that blows up the
object size by about 2k :/

Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/sched/fair.c | 29 +++++++++++++++++------------
1 file changed, 17 insertions(+), 12 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4561,6 +4561,8 @@ static inline void update_sg_lb_stats(st
unsigned long load;
int i;

+ memset(sgs, 0, sizeof(*sgs));
+
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
struct rq *rq = cpu_rq(i);

@@ -4579,10 +4581,6 @@ static inline void update_sg_lb_stats(st
sgs->idle_cpus++;
}

- if (local_group && (env->idle != CPU_NEWLY_IDLE ||
- time_after_eq(jiffies, group->sgp->next_update)))
- update_group_power(env->sd, env->dst_cpu);
-
/* Adjust by relative CPU power of the group */
sgs->group_power = group->sgp->power;
sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / sgs->group_power;
@@ -4675,11 +4673,17 @@ static inline void update_sd_lb_stats(st
if (local_group) {
sds->local = sg;
sgs = &sds->local_stat;
+
+ if (env->idle != CPU_NEWLY_IDLE ||
+ time_after_eq(jiffies, sg->sgp->next_update))
+ update_group_power(env->sd, env->dst_cpu);
}

- memset(sgs, 0, sizeof(*sgs));
update_sg_lb_stats(env, sg, load_idx, local_group, sgs);

+ if (local_group)
+ goto next_group;
+
/*
* In case the child domain prefers tasks go to siblings
* first, lower the sg capacity to one so that we'll try
@@ -4690,19 +4694,20 @@ static inline void update_sd_lb_stats(st
* heaviest group when it is already under-utilized (possible
* with a large weight task outweighs the tasks on the system).
*/
- if (prefer_sibling && !local_group &&
- sds->local && sds->local_stat.group_has_capacity)
+ if (prefer_sibling && sds->local &&
+ sds->local_stat.group_has_capacity)
sgs->group_capacity = min(sgs->group_capacity, 1U);

- /* Now, start updating sd_lb_stats */
- sds->total_load += sgs->group_load;
- sds->total_pwr += sgs->group_power;
-
- if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
+ if (update_sd_pick_busiest(env, sds, sg, sgs)) {
sds->busiest = sg;
sds->busiest_stat = *sgs;
}

+next_group:
+ /* Now, start updating sd_lb_stats */
+ sds->total_load += sgs->group_load;
+ sds->total_pwr += sgs->group_power;
+
sg = sg->next;
} while (sg != env->sd->groups);
}

2013-08-28 08:57:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 11/10] sched, fair: Reduce local_group logic


# ls -la defconfig-build/kernel/sched/fair.o*
-rw-r--r-- 1 root root 760688 Aug 28 10:53 defconfig-build/kernel/sched/fair.o
-rw-r--r-- 1 root root 758160 Aug 28 10:36 defconfig-build/kernel/sched/fair.o.orig

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4553,9 +4553,9 @@ static inline int sg_imbalanced(struct s
* @local_group: Does group contain this_cpu.
* @sgs: variable to hold the statistics for this group.
*/
-static inline void update_sg_lb_stats(struct lb_env *env,
+static void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
- int local_group, struct sg_lb_stats *sgs)
+ int group_bias, struct sg_lb_stats *sgs)
{
unsigned long nr_running;
unsigned long load;
@@ -4568,8 +4568,7 @@ static inline void update_sg_lb_stats(st

nr_running = rq->nr_running;

- /* Bias balancing toward cpus of our domain */
- if (local_group)
+ if (group_bias)
load = target_load(i, load_idx);
else
load = source_load(i, load_idx);
@@ -4667,22 +4666,21 @@ static inline void update_sd_lb_stats(st

do {
struct sg_lb_stats *sgs = &tmp_sgs;
- int local_group;

- local_group = cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg));
- if (local_group) {
+ if (cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg))) {
sds->local = sg;
sgs = &sds->local_stat;

if (env->idle != CPU_NEWLY_IDLE ||
time_after_eq(jiffies, sg->sgp->next_update))
update_group_power(env->sd, env->dst_cpu);
- }
-
- update_sg_lb_stats(env, sg, load_idx, local_group, sgs);

- if (local_group)
+ /* Bias balancing toward cpus of our domain */
+ update_sg_lb_stats(env, sg, load_idx, 1, sgs);
goto next_group;
+ }
+
+ update_sg_lb_stats(env, sg, load_idx, 0, sgs);

/*
* In case the child domain prefers tasks go to siblings

2013-08-28 09:17:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 11/10] sched, fair: Reduce local_group logic

On Wed, Aug 28, 2013 at 10:55:42AM +0200, Peter Zijlstra wrote:
> @@ -4690,19 +4694,20 @@ static inline void update_sd_lb_stats(st
> * heaviest group when it is already under-utilized (possible
> * with a large weight task outweighs the tasks on the system).
> */
> + if (prefer_sibling && sds->local &&
> + sds->local_stat.group_has_capacity)
> sgs->group_capacity = min(sgs->group_capacity, 1U);

While we're here, I think its always true that sds->local is set,
because env->dst_cpu should always be part of the local group and the
local group is always sd->groups and since that now directly skips to
next_group we'll not get here without this being true.

Hmm?

>
> + if (update_sd_pick_busiest(env, sds, sg, sgs)) {
> sds->busiest = sg;
> sds->busiest_stat = *sgs;
> }
>
> +next_group:
> + /* Now, start updating sd_lb_stats */
> + sds->total_load += sgs->group_load;
> + sds->total_pwr += sgs->group_power;
> +
> sg = sg->next;
> } while (sg != env->sd->groups);
> }

2013-08-28 11:15:09

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 12/10] sched, fair: Fix group power_orig computation


Subject: sched, fair: Fix group power_orig computation
From: Peter Zijlstra <[email protected]>
Date: Wed Aug 28 11:44:39 CEST 2013

When looking at the code I noticed we don't actually compute
sgp->power_orig correctly for groups, fix that.

Currently the only consumer of that value is fix_small_capacity()
which is only used on POWER7+ and that code excludes this case by
being limited to SD_SHARE_CPUPOWER which is only ever set on the SMT
domain which must be the lowest domain and this has singleton groups.

So nothing should be affected by this change.

Cc: Michael Neuling <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/sched/fair.c | 40 ++++++++++++++++++++++++++--------------
1 file changed, 26 insertions(+), 14 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4448,7 +4448,7 @@ void update_group_power(struct sched_dom
{
struct sched_domain *child = sd->child;
struct sched_group *group, *sdg = sd->groups;
- unsigned long power;
+ unsigned long power, power_orig;
unsigned long interval;

interval = msecs_to_jiffies(sd->balance_interval);
@@ -4460,7 +4460,7 @@ void update_group_power(struct sched_dom
return;
}

- power = 0;
+ power_orig = power = 0;

if (child->flags & SD_OVERLAP) {
/*
@@ -4468,8 +4468,12 @@ void update_group_power(struct sched_dom
* span the current group.
*/

- for_each_cpu(cpu, sched_group_cpus(sdg))
- power += power_of(cpu);
+ for_each_cpu(cpu, sched_group_cpus(sdg)) {
+ struct sched_group *sg = cpu_rq(cpu)->sd->groups;
+
+ power_orig += sg->sgp->power_orig;
+ power += sg->sgp->power;
+ }
} else {
/*
* !SD_OVERLAP domains can assume that child groups
@@ -4478,12 +4482,14 @@ void update_group_power(struct sched_dom

group = child->groups;
do {
+ power_orig += group->sgp->power_orig;
power += group->sgp->power;
group = group->next;
} while (group != child->groups);
}

- sdg->sgp->power_orig = sdg->sgp->power = power;
+ sdg->sgp->power_orig = power_orig;
+ sdg->sgp->power = power;
}

/*

2013-08-28 11:15:40

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 13/10] sched, fair: Rework and comment the group_capacity code


Subject: sched, fair: Rework and comment the group_capacity code
From: Peter Zijlstra <[email protected]>
Date: Wed Aug 28 11:50:34 CEST 2013

Pull out the group_capacity computation so that we can more clearly
comment its issues.

Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/sched/fair.c | 32 ++++++++++++++++++++++++--------
1 file changed, 24 insertions(+), 8 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4551,6 +4551,27 @@ static inline int sg_imbalanced(struct s
return group->sgp->imbalance;
}

+/*
+ * Compute the group capacity.
+ *
+ * For now the capacity is simply the number of power units in the group_power.
+ * A power unit represents a full core.
+ *
+ * This has an issue where N*frac(smt_power) >= 1, in that case we'll see extra
+ * 'cores' that aren't actually there.
+ */
+static inline int sg_capacity(struct lb_env *env, struct sched_group *group)
+{
+
+ unsigned int power = group->sgp->power;
+ unsigned int capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE);
+
+ if (!capacity)
+ capacity = fix_small_capacity(env->sd, group);
+
+ return capacity;
+}
+
/**
* update_sg_lb_stats - Update sched_group's statistics for load balancing.
* @env: The load balancing environment.
@@ -4594,16 +4615,11 @@ static inline void update_sg_lb_stats(st
if (sgs->sum_nr_running)
sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;

- sgs->group_imb = sg_imbalanced(group);
-
- sgs->group_capacity =
- DIV_ROUND_CLOSEST(sgs->group_power, SCHED_POWER_SCALE);
-
- if (!sgs->group_capacity)
- sgs->group_capacity = fix_small_capacity(env->sd, group);
-
sgs->group_weight = group->group_weight;

+ sgs->group_imb = sg_imbalanced(group);
+ sgs->group_capacity = sg_capacity(env, group);
+
if (sgs->group_capacity > sgs->sum_nr_running)
sgs->group_has_capacity = 1;
}

2013-08-28 11:16:50

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 14/10] sched, fair: Fix the group_capacity computation


Subject: sched, fair: Fix the group_capacity computation
From: Peter Zijlstra <[email protected]>
Date: Wed Aug 28 12:40:38 CEST 2013

Do away with 'phantom' cores due to N*frac(smt_power) >= 1 by limiting
the capacity to the actual number of cores.

The assumption of 1 < smt_power < 2 is an actual requirement because
of what SMT is so this should work regardless of the SMT
implementation.

It can still be defeated by creative use of cpu hotplug, but if you're
one of those freaks, you get to live with it.

Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/sched/fair.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4554,18 +4554,24 @@ static inline int sg_imbalanced(struct s
/*
* Compute the group capacity.
*
- * For now the capacity is simply the number of power units in the group_power.
- * A power unit represents a full core.
- *
- * This has an issue where N*frac(smt_power) >= 1, in that case we'll see extra
- * 'cores' that aren't actually there.
+ * Avoid the issue where N*frac(smt_power) >= 1 creates 'phantom' cores by
+ * first dividing out the smt factor and computing the actual number of cores
+ * and limit power unit capacity with that.
*/
static inline int sg_capacity(struct lb_env *env, struct sched_group *group)
{
+ unsigned int capacity, smt, cpus;
+ unsigned int power, power_orig;
+
+ power = group->sgp->power;
+ power_orig = group->sgp->power_orig;
+ cpus = group->group_weight;

- unsigned int power = group->sgp->power;
- unsigned int capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE);
+ /* smt := ceil(cpus / power), assumes: 1 < smt_power < 2 */
+ smt = DIV_ROUND_UP(SCHED_POWER_SCALE * cpus, power_orig);
+ capacity = cpus / smt; /* cores */

+ capacity = min_t(capacity, DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE));
if (!capacity)
capacity = fix_small_capacity(env->sd, group);

2013-09-04 07:44:19

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC][PATCH 14/10] sched, fair: Fix the group_capacity computation

On 28 August 2013 13:16, Peter Zijlstra <[email protected]> wrote:
>
> Subject: sched, fair: Fix the group_capacity computation
> From: Peter Zijlstra <[email protected]>
> Date: Wed Aug 28 12:40:38 CEST 2013
>
> Do away with 'phantom' cores due to N*frac(smt_power) >= 1 by limiting
> the capacity to the actual number of cores.
>

Peter,

your patch also solves the 'phantom' big cores that can appear on HMP
system because big cores have a cpu_power >= SCHED_POWER_SCALE in
order to express a higher capacity than LITTLE cores.

Acked-by Vincent Guittot <[email protected]>

Vincent

> The assumption of 1 < smt_power < 2 is an actual requirement because
> of what SMT is so this should work regardless of the SMT
> implementation.
>
> It can still be defeated by creative use of cpu hotplug, but if you're
> one of those freaks, you get to live with it.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> kernel/sched/fair.c | 20 +++++++++++++-------
> 1 file changed, 13 insertions(+), 7 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4554,18 +4554,24 @@ static inline int sg_imbalanced(struct s
> /*
> * Compute the group capacity.
> *
> - * For now the capacity is simply the number of power units in the group_power.
> - * A power unit represents a full core.
> - *
> - * This has an issue where N*frac(smt_power) >= 1, in that case we'll see extra
> - * 'cores' that aren't actually there.
> + * Avoid the issue where N*frac(smt_power) >= 1 creates 'phantom' cores by
> + * first dividing out the smt factor and computing the actual number of cores
> + * and limit power unit capacity with that.
> */
> static inline int sg_capacity(struct lb_env *env, struct sched_group *group)
> {
> + unsigned int capacity, smt, cpus;
> + unsigned int power, power_orig;
> +
> + power = group->sgp->power;
> + power_orig = group->sgp->power_orig;
> + cpus = group->group_weight;
>
> - unsigned int power = group->sgp->power;
> - unsigned int capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE);
> + /* smt := ceil(cpus / power), assumes: 1 < smt_power < 2 */
> + smt = DIV_ROUND_UP(SCHED_POWER_SCALE * cpus, power_orig);
> + capacity = cpus / smt; /* cores */
>
> + capacity = min_t(capacity, DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE));
> if (!capacity)
> capacity = fix_small_capacity(env->sd, group);
>