2014-10-07 12:15:11

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v7 0/7] sched: consolidation of cpu_capacity

During load balance, the scheduler evaluates the number of tasks that a group
of CPUs can handle. The current method assumes that tasks have a fix load of
SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE.
This assumption generates wrong decision by creating ghost cores or by
removing real ones when the original capacity of CPUs is different from the
default SCHED_CAPACITY_SCALE. We don't try anymore to evaluate the number of
available cores based on the group_capacity but instead we evaluate the usage
of a group and compare it with its capacity.

This patchset mainly replaces the old capacity method by a new one and has kept
the policy almost unchanged whereas we could certainly take advantage of this
new statistic in several other places of the load balance.

The utilization_avg_contrib is based on the current implementation of the
load avg tracking. I also have a version of the utilization_avg_contrib that
is based on the new implementation proposal [1] but haven't provide the patches
and results as [1] is still under review. I can provide change above [1] to
change how utilization_avg_contrib is computed and adapt to new mecanism.

Change since V6
- add group usage tracking
- fix some commits' messages
- minor fix like comments and argument order

Change since V5
- remove patches that have been merged since v5 : patches 01, 02, 03, 04, 05, 07
- update commit log and add more details on the purpose of the patches
- fix/remove useless code with the rebase on patchset [2]
- remove capacity_orig in sched_group_capacity as it is not used
- move code in the right patch
- add some helper function to factorize code

Change since V4
- rebase to manage conflicts with changes in selection of busiest group [4]

Change since V3:
- add usage_avg_contrib statistic which sums the running time of tasks on a rq
- use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization
- fix replacement power by capacity
- update some comments

Change since V2:
- rebase on top of capacity renaming
- fix wake_affine statistic update
- rework nohz_kick_needed
- optimize the active migration of a task from CPU with reduced capacity
- rename group_activity by group_utilization and remove unused total_utilization
- repair SD_PREFER_SIBLING and use it for SMT level
- reorder patchset to gather patches with same topics

Change since V1:
- add 3 fixes
- correct some commit messages
- replace capacity computation by activity
- take into account current cpu capacity

[1] https://lkml.org/lkml/2014/7/18/110
[2] https://lkml.org/lkml/2014/7/25/589

Morten Rasmussen (1):
sched: Track group sched_entity usage contributions

Vincent Guittot (6):
sched: add per rq cpu_capacity_orig
sched: move cfs task on a CPU with higher capacity
sched: add utilization_avg_contrib
sched: get CPU's usage statistic
sched: replace capacity_factor by usage
sched: add SD_PREFER_SIBLING for SMT level

include/linux/sched.h | 21 +++-
kernel/sched/core.c | 15 +--
kernel/sched/debug.c | 12 ++-
kernel/sched/fair.c | 283 ++++++++++++++++++++++++++++++--------------------
kernel/sched/sched.h | 11 +-
5 files changed, 209 insertions(+), 133 deletions(-)

--
1.9.1


2014-10-07 12:15:18

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v7 2/7] sched: move cfs task on a CPU with higher capacity

If the CPU is used for handling lot of IRQs, trig a load balance to check if
it's worth moving its tasks on another CPU that has more capacity.

As a sidenote, this will note generate more spurious ilb because we already
trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
has a task, we will trig the ilb once for migrating the task.

The nohz_kick_needed function has been cleaned up a bit while adding the new
test

env.src_cpu and env.src_rq must be set unconditionnaly because they are used
in need_active_balance which is called even if busiest->nr_running equals 1

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 70 ++++++++++++++++++++++++++++++++++++++---------------
1 file changed, 50 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c3674da..9075dee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5896,6 +5896,18 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
}

/*
+ * Check whether the capacity of the rq has been noticeably reduced by side
+ * activity. The imbalance_pct is used for the threshold.
+ * Return true is the capacity is reduced
+ */
+static inline int
+check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
+{
+ return ((rq->cpu_capacity * sd->imbalance_pct) <
+ (rq->cpu_capacity_orig * 100));
+}
+
+/*
* Group imbalance indicates (and tries to solve) the problem where balancing
* groups is inadequate due to tsk_cpus_allowed() constraints.
*
@@ -6567,6 +6579,14 @@ static int need_active_balance(struct lb_env *env)
*/
if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
return 1;
+
+ /*
+ * The src_cpu's capacity is reduced because of other
+ * sched_class or IRQs, we trig an active balance to move the
+ * task
+ */
+ if (check_cpu_capacity(env->src_rq, sd))
+ return 1;
}

return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
@@ -6668,6 +6688,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,

schedstat_add(sd, lb_imbalance[idle], env.imbalance);

+ env.src_cpu = busiest->cpu;
+ env.src_rq = busiest;
+
ld_moved = 0;
if (busiest->nr_running > 1) {
/*
@@ -6677,8 +6700,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
* correctly treated as an imbalance.
*/
env.flags |= LBF_ALL_PINNED;
- env.src_cpu = busiest->cpu;
- env.src_rq = busiest;
env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);

more_balance:
@@ -7378,10 +7399,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)

/*
* Current heuristic for kicking the idle load balancer in the presence
- * of an idle cpu is the system.
+ * of an idle cpu in the system.
* - This rq has more than one task.
- * - At any scheduler domain level, this cpu's scheduler group has multiple
- * busy cpu's exceeding the group's capacity.
+ * - This rq has at least one CFS task and the capacity of the CPU is
+ * significantly reduced because of RT tasks or IRQs.
+ * - At parent of LLC scheduler domain level, this cpu's scheduler group has
+ * multiple busy cpu.
* - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
* domain span are idle.
*/
@@ -7391,9 +7414,10 @@ static inline int nohz_kick_needed(struct rq *rq)
struct sched_domain *sd;
struct sched_group_capacity *sgc;
int nr_busy, cpu = rq->cpu;
+ bool kick = false;

if (unlikely(rq->idle_balance))
- return 0;
+ return false;

/*
* We may be recently in ticked or tickless idle mode. At the first
@@ -7407,38 +7431,44 @@ static inline int nohz_kick_needed(struct rq *rq)
* balancing.
*/
if (likely(!atomic_read(&nohz.nr_cpus)))
- return 0;
+ return false;

if (time_before(now, nohz.next_balance))
- return 0;
+ return false;

if (rq->nr_running >= 2)
- goto need_kick;
+ return true;

rcu_read_lock();
sd = rcu_dereference(per_cpu(sd_busy, cpu));
-
if (sd) {
sgc = sd->groups->sgc;
nr_busy = atomic_read(&sgc->nr_busy_cpus);

- if (nr_busy > 1)
- goto need_kick_unlock;
+ if (nr_busy > 1) {
+ kick = true;
+ goto unlock;
+ }
+
}

- sd = rcu_dereference(per_cpu(sd_asym, cpu));
+ sd = rcu_dereference(rq->sd);
+ if (sd) {
+ if ((rq->cfs.h_nr_running >= 1) &&
+ check_cpu_capacity(rq, sd)) {
+ kick = true;
+ goto unlock;
+ }
+ }

+ sd = rcu_dereference(per_cpu(sd_asym, cpu));
if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
sched_domain_span(sd)) < cpu))
- goto need_kick_unlock;
-
- rcu_read_unlock();
- return 0;
+ kick = true;

-need_kick_unlock:
+unlock:
rcu_read_unlock();
-need_kick:
- return 1;
+ return kick;
}
#else
static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
--
1.9.1

2014-10-07 12:15:25

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v7 5/7] sched: get CPU's usage statistic

Monitor the usage level of each group of each sched_domain level. The usage is
the amount of cpu_capacity that is currently used on a CPU or group of CPUs.
We use the utilization_load_avg to evaluate the usage level of each group.

The utilization_avg_contrib only takes into account the running time but not
the uArch so the utilization_load_avg is in the range [0..SCHED_LOAD_SCALE]
to reflect the running load on the CPU. We have to scale the utilization with
the capacity of the CPU to get the usage of the latter. The usage can then be
compared with the available capacity.

The frequency scaling invariance is not taken into account in this patchset,
it will be solved in another patchset

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d3e9067..7364ed4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4551,6 +4551,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
return target;
}

+static int get_cpu_usage(int cpu)
+{
+ unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+ unsigned long capacity = capacity_orig_of(cpu);
+
+ if (usage >= SCHED_LOAD_SCALE)
+ return capacity + 1;
+
+ return (usage * capacity) >> SCHED_LOAD_SHIFT;
+}
+
/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
* that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5679,6 +5690,7 @@ struct sg_lb_stats {
unsigned long sum_weighted_load; /* Weighted load of group's tasks */
unsigned long load_per_task;
unsigned long group_capacity;
+ unsigned long group_usage; /* Total usage of the group */
unsigned int sum_nr_running; /* Nr tasks running in the group */
unsigned int group_capacity_factor;
unsigned int idle_cpus;
@@ -6053,6 +6065,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
load = source_load(i, load_idx);

sgs->group_load += load;
+ sgs->group_usage += get_cpu_usage(i);
sgs->sum_nr_running += rq->cfs.h_nr_running;

if (rq->nr_running > 1)
--
1.9.1

2014-10-07 12:15:43

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v7 7/7] sched: add SD_PREFER_SIBLING for SMT level

Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
the scheduler will put at least 1 task per core.

Signed-off-by: Vincent Guittot <[email protected]>
Reviewed-by: Preeti U. Murthy <[email protected]>
---
kernel/sched/core.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 37fb92c..731f2ad 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6165,6 +6165,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
*/

if (sd->flags & SD_SHARE_CPUCAPACITY) {
+ sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 110;
sd->smt_gain = 1178; /* ~15% */

--
1.9.1

2014-10-07 12:15:47

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v7 6/7] sched: replace capacity_factor by usage

The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is
SCHED_CAPACITY_SCALE but the capacity_factor is hardly working for SMT system,
it sometimes works for big cores but fails to do the right thing for little
cores.

Below are two examples to illustrate the problem that this patch solves:

1 - capacity_factor makes the assumption that max capacity of a CPU is
SCHED_CAPACITY_SCALE and the load of a thread is always is
SCHED_LOAD_SCALE. It compares the output of these figures with the sum
of nr_running to decide if a group is overloaded or not.

But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
(640 as an example), a group of 3 CPUS will have a max capacity_factor
of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
seen as overloaded if we have only one task per CPU.

2 - Then, if the default capacity of a CPU is greater than
SCHED_CAPACITY_SCALE (1512 as an example), a group of 4 CPUs will have
a capacity_factor of 4 (at max and thanks to the fix [0] for SMT system
that prevent the apparition of ghost CPUs) but if one CPU is fully
used by rt tasks (and its capacity is reduced to nearly nothing), the
capacity factor of the group will still be 4
(div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]).

So, this patch tries to solve this issue by removing capacity_factor
and replacing it with the 2 following metrics :
-The available CPU's capacity for CFS tasks which is already used by
load_balance.
-The usage of the CPU by the CFS tasks. For the latter, I have
re-introduced the utilization_avg_contrib which is used to compute the usage
of a CPU by CFS tasks.

Finally, the sched_group->sched_group_capacity->capacity_orig has been removed
because it's no more used during load balance.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/core.c | 12 -----
kernel/sched/fair.c | 131 ++++++++++++++++++++-------------------------------
kernel/sched/sched.h | 2 +-
3 files changed, 51 insertions(+), 94 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 45ae52d..37fb92c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5373,17 +5373,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
break;
}

- /*
- * Even though we initialize ->capacity to something semi-sane,
- * we leave capacity_orig unset. This allows us to detect if
- * domain iteration is still funny without causing /0 traps.
- */
- if (!group->sgc->capacity_orig) {
- printk(KERN_CONT "\n");
- printk(KERN_ERR "ERROR: domain->cpu_capacity not set\n");
- break;
- }
-
if (!cpumask_weight(sched_group_cpus(group))) {
printk(KERN_CONT "\n");
printk(KERN_ERR "ERROR: empty group\n");
@@ -5868,7 +5857,6 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
* die on a /0 trap.
*/
sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
- sg->sgc->capacity_orig = sg->sgc->capacity;

/*
* Make sure the first group of this domain contains the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7364ed4..7ee71d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5692,11 +5692,10 @@ struct sg_lb_stats {
unsigned long group_capacity;
unsigned long group_usage; /* Total usage of the group */
unsigned int sum_nr_running; /* Nr tasks running in the group */
- unsigned int group_capacity_factor;
unsigned int idle_cpus;
unsigned int group_weight;
enum group_type group_type;
- int group_has_free_capacity;
+ int group_no_capacity;
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
@@ -5837,7 +5836,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
capacity >>= SCHED_CAPACITY_SHIFT;

cpu_rq(cpu)->cpu_capacity_orig = capacity;
- sdg->sgc->capacity_orig = capacity;

if (sched_feat(ARCH_CAPACITY))
capacity *= arch_scale_freq_capacity(sd, cpu);
@@ -5860,7 +5858,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
{
struct sched_domain *child = sd->child;
struct sched_group *group, *sdg = sd->groups;
- unsigned long capacity, capacity_orig;
+ unsigned long capacity;
unsigned long interval;

interval = msecs_to_jiffies(sd->balance_interval);
@@ -5872,7 +5870,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
return;
}

- capacity_orig = capacity = 0;
+ capacity = 0;

if (child->flags & SD_OVERLAP) {
/*
@@ -5892,19 +5890,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
* Use capacity_of(), which is set irrespective of domains
* in update_cpu_capacity().
*
- * This avoids capacity/capacity_orig from being 0 and
+ * This avoids capacity from being 0 and
* causing divide-by-zero issues on boot.
- *
- * Runtime updates will correct capacity_orig.
*/
if (unlikely(!rq->sd)) {
- capacity_orig += capacity_orig_of(cpu);
capacity += capacity_of(cpu);
continue;
}

sgc = rq->sd->groups->sgc;
- capacity_orig += sgc->capacity_orig;
capacity += sgc->capacity;
}
} else {
@@ -5915,42 +5909,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)

group = child->groups;
do {
- capacity_orig += group->sgc->capacity_orig;
capacity += group->sgc->capacity;
group = group->next;
} while (group != child->groups);
}

- sdg->sgc->capacity_orig = capacity_orig;
sdg->sgc->capacity = capacity;
}

/*
- * Try and fix up capacity for tiny siblings, this is needed when
- * things like SD_ASYM_PACKING need f_b_g to select another sibling
- * which on its own isn't powerful enough.
- *
- * See update_sd_pick_busiest() and check_asym_packing().
- */
-static inline int
-fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
-{
- /*
- * Only siblings can have significantly less than SCHED_CAPACITY_SCALE
- */
- if (!(sd->flags & SD_SHARE_CPUCAPACITY))
- return 0;
-
- /*
- * If ~90% of the cpu_capacity is still there, we're good.
- */
- if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29)
- return 1;
-
- return 0;
-}
-
-/*
* Check whether the capacity of the rq has been noticeably reduced by side
* activity. The imbalance_pct is used for the threshold.
* Return true is the capacity is reduced
@@ -5996,38 +5963,37 @@ static inline int sg_imbalanced(struct sched_group *group)
return group->sgc->imbalance;
}

-/*
- * Compute the group capacity factor.
- *
- * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by
- * first dividing out the smt factor and computing the actual number of cores
- * and limit unit capacity with that.
- */
-static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group)
+static inline bool
+group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
{
- unsigned int capacity_factor, smt, cpus;
- unsigned int capacity, capacity_orig;
+ if ((sgs->group_capacity * 100) >
+ (sgs->group_usage * env->sd->imbalance_pct))
+ return true;

- capacity = group->sgc->capacity;
- capacity_orig = group->sgc->capacity_orig;
- cpus = group->group_weight;
+ if (sgs->sum_nr_running < sgs->group_weight)
+ return true;

- /* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
- smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
- capacity_factor = cpus / smt; /* cores */
+ return false;
+}

- capacity_factor = min_t(unsigned,
- capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
- if (!capacity_factor)
- capacity_factor = fix_small_capacity(env->sd, group);
+static inline bool
+group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
+{
+ if (sgs->sum_nr_running <= sgs->group_weight)
+ return false;
+
+ if ((sgs->group_capacity * 100) <
+ (sgs->group_usage * env->sd->imbalance_pct))
+ return true;

- return capacity_factor;
+ return false;
}

-static enum group_type
-group_classify(struct sched_group *group, struct sg_lb_stats *sgs)
+static enum group_type group_classify(struct lb_env *env,
+ struct sched_group *group,
+ struct sg_lb_stats *sgs)
{
- if (sgs->sum_nr_running > sgs->group_capacity_factor)
+ if (sgs->group_no_capacity)
return group_overloaded;

if (sg_imbalanced(group))
@@ -6088,11 +6054,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;

sgs->group_weight = group->group_weight;
- sgs->group_capacity_factor = sg_capacity_factor(env, group);
- sgs->group_type = group_classify(group, sgs);

- if (sgs->group_capacity_factor > sgs->sum_nr_running)
- sgs->group_has_free_capacity = 1;
+ sgs->group_no_capacity = group_is_overloaded(env, sgs);
+ sgs->group_type = group_classify(env, group, sgs);
}

/**
@@ -6214,17 +6178,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

/*
* In case the child domain prefers tasks go to siblings
- * first, lower the sg capacity factor to one so that we'll try
+ * first, lower the sg capacity to one so that we'll try
* and move all the excess tasks away. We lower the capacity
* of a group only if the local group has the capacity to fit
- * these excess tasks, i.e. nr_running < group_capacity_factor. The
+ * these excess tasks, i.e. group_capacity > 0. The
* extra check prevents the case where you always pull from the
* heaviest group when it is already under-utilized (possible
* with a large weight task outweighs the tasks on the system).
*/
if (prefer_sibling && sds->local &&
- sds->local_stat.group_has_free_capacity)
- sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
+ group_has_capacity(env, &sds->local_stat)) {
+ if (sgs->sum_nr_running > 1)
+ sgs->group_no_capacity = 1;
+ sgs->group_capacity = min(sgs->group_capacity,
+ SCHED_CAPACITY_SCALE);
+ }

if (update_sd_pick_busiest(env, sds, sg, sgs)) {
sds->busiest = sg;
@@ -6403,11 +6371,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
*/
if (busiest->group_type == group_overloaded &&
local->group_type == group_overloaded) {
- load_above_capacity =
- (busiest->sum_nr_running - busiest->group_capacity_factor);
-
- load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
- load_above_capacity /= busiest->group_capacity;
+ load_above_capacity = busiest->sum_nr_running *
+ SCHED_LOAD_SCALE;
+ if (load_above_capacity > busiest->group_capacity)
+ load_above_capacity -= busiest->group_capacity;
+ else
+ load_above_capacity = ~0UL;
}

/*
@@ -6470,6 +6439,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
local = &sds.local_stat;
busiest = &sds.busiest_stat;

+ /* ASYM feature bypasses nice load balance check */
if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
check_asym_packing(env, &sds))
return sds.busiest;
@@ -6490,8 +6460,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
goto force_balance;

/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
- if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
- !busiest->group_has_free_capacity)
+ if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) &&
+ busiest->group_no_capacity)
goto force_balance;

/*
@@ -6550,7 +6520,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
int i;

for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
- unsigned long capacity, capacity_factor, wl;
+ unsigned long capacity, wl;
enum fbq_type rt;

rq = cpu_rq(i);
@@ -6579,9 +6549,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
continue;

capacity = capacity_of(i);
- capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE);
- if (!capacity_factor)
- capacity_factor = fix_small_capacity(env->sd, group);

wl = weighted_cpuload(i);

@@ -6589,7 +6556,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
* When comparing with imbalance, use weighted_cpuload()
* which is not scaled with the cpu capacity.
*/
- if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance)
+
+ if (rq->nr_running == 1 && wl > env->imbalance &&
+ !check_cpu_capacity(rq, env->sd))
continue;

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 35eca7d..7963981 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -759,7 +759,7 @@ struct sched_group_capacity {
* CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
* for a single CPU.
*/
- unsigned int capacity, capacity_orig;
+ unsigned int capacity;
unsigned long next_update;
int imbalance; /* XXX unrelated to capacity but shared group state */
/*
--
1.9.1

2014-10-07 12:16:43

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 4/7] sched: Track group sched_entity usage contributions

From: Morten Rasmussen <[email protected]>

Adds usage contribution tracking for group entities. Unlike
se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group
entities is the sum of se->avg.utilization_avg_contrib for all entities on the
group runqueue. It is _not_ influenced in any way by the task group
h_load. Hence it is representing the actual cpu usage of the group, not
its intended load contribution which may differ significantly from the
utilization on lightly utilized systems.

cc: Paul Turner <[email protected]>
cc: Ben Segall <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/debug.c | 3 +++
kernel/sched/fair.c | 5 +++++
2 files changed, 8 insertions(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index e0fbc0f..efb47ed 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -94,8 +94,10 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
P(se->load.weight);
#ifdef CONFIG_SMP
P(se->avg.runnable_avg_sum);
+ P(se->avg.running_avg_sum);
P(se->avg.avg_period);
P(se->avg.load_avg_contrib);
+ P(se->avg.utilization_avg_contrib);
P(se->avg.decay_count);
#endif
#undef PN
@@ -633,6 +635,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
P(se.avg.running_avg_sum);
P(se.avg.avg_period);
P(se.avg.load_avg_contrib);
+ P(se.avg.utilization_avg_contrib);
P(se.avg.decay_count);
#endif
P(policy);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d6de526..d3e9067 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2381,6 +2381,8 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
return 0;

se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
+ se->avg.utilization_avg_contrib =
+ decay_load(se->avg.utilization_avg_contrib, decays);
se->avg.decay_count = 0;

return decays;
@@ -2525,6 +2527,9 @@ static long __update_entity_utilization_avg_contrib(struct sched_entity *se)

if (entity_is_task(se))
__update_task_entity_utilization(se);
+ else
+ se->avg.utilization_avg_contrib =
+ group_cfs_rq(se)->utilization_load_avg;

return se->avg.utilization_avg_contrib - old_contrib;
}
--
1.9.1

2014-10-07 12:16:46

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v7 3/7] sched: add utilization_avg_contrib

Add new statistics which reflect the average time a task is running on the CPU
and the sum of these running time of the tasks on a runqueue. The latter is
named utilization_avg_contrib.

This patch is based on the usage metric that was proposed in the 1st
versions of the per-entity load tracking patchset by Paul Turner
<[email protected]> but that has be removed afterward. This version differs
from the original one in the sense that it's not linked to task_group.

The rq's utilization_avg_contrib will be used to check if a rq is overloaded
or not instead of trying to compute how many task a group of CPUs can handle

Rename runnable_avg_period into avg_period as it is now used with both
runnable_avg_sum and running_avg_sum

Add some descriptions of the variables to explain their differences

Signed-off-by: Vincent Guittot <[email protected]>
---
include/linux/sched.h | 21 +++++++++++++---
kernel/sched/debug.c | 9 ++++---
kernel/sched/fair.c | 70 +++++++++++++++++++++++++++++++++++++++------------
kernel/sched/sched.h | 8 +++++-
4 files changed, 84 insertions(+), 24 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 18f5262..f6662b6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1071,15 +1071,28 @@ struct load_weight {
};

struct sched_avg {
+ u64 last_runnable_update;
+ s64 decay_count;
+ /*
+ * utilization_avg_contrib describes the amount of time that a
+ * sched_entity is running on a CPU. It is based on running_avg_sum
+ * and is scaled in the range [0..SCHED_LOAD_SCALE].
+ * load_avg_contrib described the the amount of time that a
+ * sched_entity is runnable on a rq. It is based on both
+ * runnable_avg_sum and the weight of the task.
+ */
+ unsigned long load_avg_contrib, utilization_avg_contrib;
/*
* These sums represent an infinite geometric series and so are bound
* above by 1024/(1-y). Thus we only need a u32 to store them for all
* choices of y < 1-2^(-32)*1024.
+ * running_avg_sum reflects the time that the sched_entity is
+ * effectively running on the CPU.
+ * runnable_avg_sum represents the amount of time a sched_entity is on
+ * a runqueue which includes the running time that is monitored by
+ * running_avg_sum.
*/
- u32 runnable_avg_sum, runnable_avg_period;
- u64 last_runnable_update;
- s64 decay_count;
- unsigned long load_avg_contrib;
+ u32 runnable_avg_sum, avg_period, running_avg_sum;
};

#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index ce33780..e0fbc0f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -71,7 +71,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
if (!se) {
struct sched_avg *avg = &cpu_rq(cpu)->avg;
P(avg->runnable_avg_sum);
- P(avg->runnable_avg_period);
+ P(avg->avg_period);
return;
}

@@ -94,7 +94,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
P(se->load.weight);
#ifdef CONFIG_SMP
P(se->avg.runnable_avg_sum);
- P(se->avg.runnable_avg_period);
+ P(se->avg.avg_period);
P(se->avg.load_avg_contrib);
P(se->avg.decay_count);
#endif
@@ -214,6 +214,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->runnable_load_avg);
SEQ_printf(m, " .%-30s: %ld\n", "blocked_load_avg",
cfs_rq->blocked_load_avg);
+ SEQ_printf(m, " .%-30s: %ld\n", "utilization_load_avg",
+ cfs_rq->utilization_load_avg);
#ifdef CONFIG_FAIR_GROUP_SCHED
SEQ_printf(m, " .%-30s: %ld\n", "tg_load_contrib",
cfs_rq->tg_load_contrib);
@@ -628,7 +630,8 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
P(se.load.weight);
#ifdef CONFIG_SMP
P(se.avg.runnable_avg_sum);
- P(se.avg.runnable_avg_period);
+ P(se.avg.running_avg_sum);
+ P(se.avg.avg_period);
P(se.avg.load_avg_contrib);
P(se.avg.decay_count);
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9075dee..d6de526 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -678,8 +678,8 @@ void init_task_runnable_average(struct task_struct *p)

p->se.avg.decay_count = 0;
slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
- p->se.avg.runnable_avg_sum = slice;
- p->se.avg.runnable_avg_period = slice;
+ p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice;
+ p->se.avg.avg_period = slice;
__update_task_entity_contrib(&p->se);
}
#else
@@ -1548,7 +1548,7 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
*period = now - p->last_task_numa_placement;
} else {
delta = p->se.avg.runnable_avg_sum;
- *period = p->se.avg.runnable_avg_period;
+ *period = p->se.avg.avg_period;
}

p->last_sum_exec_runtime = runtime;
@@ -2294,7 +2294,8 @@ static u32 __compute_runnable_contrib(u64 n)
*/
static __always_inline int __update_entity_runnable_avg(u64 now,
struct sched_avg *sa,
- int runnable)
+ int runnable,
+ int running)
{
u64 delta, periods;
u32 runnable_contrib;
@@ -2320,7 +2321,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
sa->last_runnable_update = now;

/* delta_w is the amount already accumulated against our next period */
- delta_w = sa->runnable_avg_period % 1024;
+ delta_w = sa->avg_period % 1024;
if (delta + delta_w >= 1024) {
/* period roll-over */
decayed = 1;
@@ -2333,7 +2334,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
delta_w = 1024 - delta_w;
if (runnable)
sa->runnable_avg_sum += delta_w;
- sa->runnable_avg_period += delta_w;
+ if (running)
+ sa->running_avg_sum += delta_w;
+ sa->avg_period += delta_w;

delta -= delta_w;

@@ -2343,20 +2346,26 @@ static __always_inline int __update_entity_runnable_avg(u64 now,

sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
periods + 1);
- sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
+ sa->running_avg_sum = decay_load(sa->running_avg_sum,
+ periods + 1);
+ sa->avg_period = decay_load(sa->avg_period,
periods + 1);

/* Efficiently calculate \sum (1..n_period) 1024*y^i */
runnable_contrib = __compute_runnable_contrib(periods);
if (runnable)
sa->runnable_avg_sum += runnable_contrib;
- sa->runnable_avg_period += runnable_contrib;
+ if (running)
+ sa->running_avg_sum += runnable_contrib;
+ sa->avg_period += runnable_contrib;
}

/* Remainder of delta accrued against u_0` */
if (runnable)
sa->runnable_avg_sum += delta;
- sa->runnable_avg_period += delta;
+ if (running)
+ sa->running_avg_sum += delta;
+ sa->avg_period += delta;

return decayed;
}
@@ -2408,7 +2417,7 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,

/* The fraction of a cpu used by this cfs_rq */
contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT,
- sa->runnable_avg_period + 1);
+ sa->avg_period + 1);
contrib -= cfs_rq->tg_runnable_contrib;

if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
@@ -2461,7 +2470,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)

static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
{
- __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
+ __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable,
+ runnable);
__update_tg_runnable_avg(&rq->avg, &rq->cfs);
}
#else /* CONFIG_FAIR_GROUP_SCHED */
@@ -2479,7 +2489,7 @@ static inline void __update_task_entity_contrib(struct sched_entity *se)

/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
- contrib /= (se->avg.runnable_avg_period + 1);
+ contrib /= (se->avg.avg_period + 1);
se->avg.load_avg_contrib = scale_load(contrib);
}

@@ -2498,6 +2508,27 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
return se->avg.load_avg_contrib - old_contrib;
}

+
+static inline void __update_task_entity_utilization(struct sched_entity *se)
+{
+ u32 contrib;
+
+ /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
+ contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
+ contrib /= (se->avg.avg_period + 1);
+ se->avg.utilization_avg_contrib = scale_load(contrib);
+}
+
+static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
+{
+ long old_contrib = se->avg.utilization_avg_contrib;
+
+ if (entity_is_task(se))
+ __update_task_entity_utilization(se);
+
+ return se->avg.utilization_avg_contrib - old_contrib;
+}
+
static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
long load_contrib)
{
@@ -2514,7 +2545,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
int update_cfs_rq)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
- long contrib_delta;
+ long contrib_delta, utilization_delta;
u64 now;

/*
@@ -2526,18 +2557,22 @@ static inline void update_entity_load_avg(struct sched_entity *se,
else
now = cfs_rq_clock_task(group_cfs_rq(se));

- if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
+ if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
+ cfs_rq->curr == se))
return;

contrib_delta = __update_entity_load_avg_contrib(se);
+ utilization_delta = __update_entity_utilization_avg_contrib(se);

if (!update_cfs_rq)
return;

- if (se->on_rq)
+ if (se->on_rq) {
cfs_rq->runnable_load_avg += contrib_delta;
- else
+ cfs_rq->utilization_load_avg += utilization_delta;
+ } else {
subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
+ }
}

/*
@@ -2612,6 +2647,7 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
}

cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+ cfs_rq->utilization_load_avg += se->avg.utilization_avg_contrib;
/* we force update consideration on load-balancer moves */
update_cfs_rq_blocked_load(cfs_rq, !wakeup);
}
@@ -2630,6 +2666,7 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
update_cfs_rq_blocked_load(cfs_rq, !sleep);

cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+ cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib;
if (sleep) {
cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
@@ -2967,6 +3004,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
*/
update_stats_wait_end(cfs_rq, se);
__dequeue_entity(cfs_rq, se);
+ update_entity_load_avg(se, 1);
}

update_stats_curr_start(cfs_rq, se);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 12483d8..35eca7d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -343,8 +343,14 @@ struct cfs_rq {
* Under CFS, load is tracked on a per-entity basis and aggregated up.
* This allows for the description of both thread and group usage (in
* the FAIR_GROUP_SCHED case).
+ * runnable_load_avg is the sum of the load_avg_contrib of the
+ * sched_entities on the rq.
+ * blocked_load_avg is similar to runnable_load_avg except that its
+ * the blocked sched_entities on the rq.
+ * utilization_load_avg is the sum of the average running time of the
+ * sched_entities on the rq.
*/
- unsigned long runnable_load_avg, blocked_load_avg;
+ unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
atomic64_t decay_counter;
u64 last_decay;
atomic_long_t removed_load;
--
1.9.1

2014-10-07 12:17:31

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v7 1/7] sched: add per rq cpu_capacity_orig

This new field cpu_capacity_orig reflects the original capacity of a CPU
before being altered by rt tasks and/or IRQ

The cpu_capacity_orig will be used in several places to detect when the
capacity of a CPU has been noticeably reduced so we can trig load balance
to look for a CPU with better capacity.
As an example, we can detect when a CPU handles a significant amount of irq
(with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by
scheduler whereas CPUs, which are really idle, are available

In addition, this new cpu_capacity_orig will be used to evaluate the usage of
a CPU by CFS tasks

Signed-off-by: Vincent Guittot <[email protected]>
Reviewed-by: Kamalesh Babulal <[email protected]>

---
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 8 +++++++-
kernel/sched/sched.h | 1 +
3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c84bdc0..45ae52d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7087,7 +7087,7 @@ void __init sched_init(void)
#ifdef CONFIG_SMP
rq->sd = NULL;
rq->rd = NULL;
- rq->cpu_capacity = SCHED_CAPACITY_SCALE;
+ rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
rq->post_schedule = 0;
rq->active_balance = 0;
rq->next_balance = jiffies;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bd61cff..c3674da 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4089,6 +4089,11 @@ static unsigned long capacity_of(int cpu)
return cpu_rq(cpu)->cpu_capacity;
}

+static unsigned long capacity_orig_of(int cpu)
+{
+ return cpu_rq(cpu)->cpu_capacity_orig;
+}
+
static unsigned long cpu_avg_load_per_task(int cpu)
{
struct rq *rq = cpu_rq(cpu);
@@ -5776,6 +5781,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)

capacity >>= SCHED_CAPACITY_SHIFT;

+ cpu_rq(cpu)->cpu_capacity_orig = capacity;
sdg->sgc->capacity_orig = capacity;

if (sched_feat(ARCH_CAPACITY))
@@ -5837,7 +5843,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
* Runtime updates will correct capacity_orig.
*/
if (unlikely(!rq->sd)) {
- capacity_orig += capacity_of(cpu);
+ capacity_orig += capacity_orig_of(cpu);
capacity += capacity_of(cpu);
continue;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6130251..12483d8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -579,6 +579,7 @@ struct rq {
struct sched_domain *sd;

unsigned long cpu_capacity;
+ unsigned long cpu_capacity_orig;

unsigned char idle_balance;
/* For active balancing */
--
1.9.1

2014-10-07 20:15:44

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH 4/7] sched: Track group sched_entity usage contributions

Vincent Guittot <[email protected]> writes:

> From: Morten Rasmussen <[email protected]>
>
> Adds usage contribution tracking for group entities. Unlike
> se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group
> entities is the sum of se->avg.utilization_avg_contrib for all entities on the
> group runqueue. It is _not_ influenced in any way by the task group
> h_load. Hence it is representing the actual cpu usage of the group, not
> its intended load contribution which may differ significantly from the
> utilization on lightly utilized systems.


Just noting that this version also has usage disappear immediately when
a task blocks, although it does what you probably want on migration.

Also, group-ses don't ever use their running_avg_sum so it's kinda a
waste, but I'm not sure it's worth doing anything about.

>
> cc: Paul Turner <[email protected]>
> cc: Ben Segall <[email protected]>
>
> Signed-off-by: Morten Rasmussen <[email protected]>
> ---
> kernel/sched/debug.c | 3 +++
> kernel/sched/fair.c | 5 +++++
> 2 files changed, 8 insertions(+)
>
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index e0fbc0f..efb47ed 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -94,8 +94,10 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
> P(se->load.weight);
> #ifdef CONFIG_SMP
> P(se->avg.runnable_avg_sum);
> + P(se->avg.running_avg_sum);
> P(se->avg.avg_period);
> P(se->avg.load_avg_contrib);
> + P(se->avg.utilization_avg_contrib);
> P(se->avg.decay_count);
> #endif
> #undef PN
> @@ -633,6 +635,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
> P(se.avg.running_avg_sum);
> P(se.avg.avg_period);
> P(se.avg.load_avg_contrib);
> + P(se.avg.utilization_avg_contrib);
> P(se.avg.decay_count);
> #endif
> P(policy);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d6de526..d3e9067 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2381,6 +2381,8 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
> return 0;
>
> se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
> + se->avg.utilization_avg_contrib =
> + decay_load(se->avg.utilization_avg_contrib, decays);
> se->avg.decay_count = 0;
>
> return decays;
> @@ -2525,6 +2527,9 @@ static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
>
> if (entity_is_task(se))
> __update_task_entity_utilization(se);
> + else
> + se->avg.utilization_avg_contrib =
> + group_cfs_rq(se)->utilization_load_avg;
>
> return se->avg.utilization_avg_contrib - old_contrib;
> }

2014-10-08 07:16:31

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 4/7] sched: Track group sched_entity usage contributions

On 7 October 2014 22:15, <[email protected]> wrote:
> Vincent Guittot <[email protected]> writes:
>
>> From: Morten Rasmussen <[email protected]>
>>
>> Adds usage contribution tracking for group entities. Unlike
>> se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group
>> entities is the sum of se->avg.utilization_avg_contrib for all entities on the
>> group runqueue. It is _not_ influenced in any way by the task group
>> h_load. Hence it is representing the actual cpu usage of the group, not
>> its intended load contribution which may differ significantly from the
>> utilization on lightly utilized systems.
>
>
> Just noting that this version also has usage disappear immediately when
> a task blocks, although it does what you probably want on migration.

Yes. For this patchset, we prefer to stay aligned with current
implementation which only take into account runnable tasks in order
cap the change of the behavior.

IMHO, the use of blocked task in utilization_avg_contrib should be
part of a dedicated patchset with dedicated non regression test

>
> Also, group-ses don't ever use their running_avg_sum so it's kinda a
> waste, but I'm not sure it's worth doing anything about.
>
>>

2014-10-08 11:13:30

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 4/7] sched: Track group sched_entity usage contributions

On Tue, Oct 07, 2014 at 09:15:39PM +0100, [email protected] wrote:
> Vincent Guittot <[email protected]> writes:
>
> > From: Morten Rasmussen <[email protected]>
> >
> > Adds usage contribution tracking for group entities. Unlike
> > se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group
> > entities is the sum of se->avg.utilization_avg_contrib for all entities on the
> > group runqueue. It is _not_ influenced in any way by the task group
> > h_load. Hence it is representing the actual cpu usage of the group, not
> > its intended load contribution which may differ significantly from the
> > utilization on lightly utilized systems.
>
>
> Just noting that this version also has usage disappear immediately when
> a task blocks, although it does what you probably want on migration.

Yes, as it was discussed at Ksummit, we should include blocked
usage. It gives a much more stable metric.

> Also, group-ses don't ever use their running_avg_sum so it's kinda a
> waste, but I'm not sure it's worth doing anything about.

Yeah, but since we still have to maintain runnable_avg_sum for group-ses
it isn't much that can be saved I think. Maybe avoid calling
update_entity_load_avg() from set_next_entity() for group-ses, but I'm
not sure how much it helps.

2014-10-08 17:04:12

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v7 3/7] sched: add utilization_avg_contrib

On 07/10/14 13:13, Vincent Guittot wrote:
> Add new statistics which reflect the average time a task is running on the CPU
> and the sum of these running time of the tasks on a runqueue. The latter is
> named utilization_avg_contrib.
>
> This patch is based on the usage metric that was proposed in the 1st
> versions of the per-entity load tracking patchset by Paul Turner
> <[email protected]> but that has be removed afterward. This version differs
> from the original one in the sense that it's not linked to task_group.
>
> The rq's utilization_avg_contrib will be used to check if a rq is overloaded
> or not instead of trying to compute how many task a group of CPUs can handle
>
> Rename runnable_avg_period into avg_period as it is now used with both
> runnable_avg_sum and running_avg_sum
>
> Add some descriptions of the variables to explain their differences
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> include/linux/sched.h | 21 +++++++++++++---
> kernel/sched/debug.c | 9 ++++---
> kernel/sched/fair.c | 70 +++++++++++++++++++++++++++++++++++++++------------
> kernel/sched/sched.h | 8 +++++-
> 4 files changed, 84 insertions(+), 24 deletions(-)
>

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9075dee..d6de526 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -678,8 +678,8 @@ void init_task_runnable_average(struct task_struct *p)
>
> p->se.avg.decay_count = 0;
> slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
> - p->se.avg.runnable_avg_sum = slice;
> - p->se.avg.runnable_avg_period = slice;
> + p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice;
> + p->se.avg.avg_period = slice;
> __update_task_entity_contrib(&p->se);

Didn't you miss a call to __update_task_entity_utilization here?

Another thing from the naming perspective: You use the term
'utilization' for 'load' (e.g. existing __update_entity_load_avg_contrib
versus new __update_entity_utilization_avg_contrib) but for the
__update_task_entity_contrib function, you use the term 'utilization'
for 'contrib'. IMHO, kind of hard to grasp.

> }
> #else
> @@ -1548,7 +1548,7 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
> *period = now - p->last_task_numa_placement;
> } else {
> delta = p->se.avg.runnable_avg_sum;
> - *period = p->se.avg.runnable_avg_period;
> + *period = p->se.avg.avg_period;
> }
>
> p->last_sum_exec_runtime = runtime;

2014-10-09 11:24:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 2/7] sched: move cfs task on a CPU with higher capacity

On Tue, Oct 07, 2014 at 02:13:32PM +0200, Vincent Guittot wrote:
> +++ b/kernel/sched/fair.c
> @@ -5896,6 +5896,18 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
> }
>
> /*
> + * Check whether the capacity of the rq has been noticeably reduced by side
> + * activity. The imbalance_pct is used for the threshold.
> + * Return true is the capacity is reduced
> + */
> +static inline int
> +check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
> +{
> + return ((rq->cpu_capacity * sd->imbalance_pct) <
> + (rq->cpu_capacity_orig * 100));
> +}
> +
> +/*
> * Group imbalance indicates (and tries to solve) the problem where balancing
> * groups is inadequate due to tsk_cpus_allowed() constraints.
> *
> @@ -6567,6 +6579,14 @@ static int need_active_balance(struct lb_env *env)
> */
> if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
> return 1;
> +
> + /*
> + * The src_cpu's capacity is reduced because of other
> + * sched_class or IRQs, we trig an active balance to move the
> + * task
> + */
> + if (check_cpu_capacity(env->src_rq, sd))
> + return 1;
> }

So does it make sense to first check if there's a better candidate at
all? By this time we've already iterated the current SD while trying
regular load balancing, so we could know this.

2014-10-09 11:36:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 5/7] sched: get CPU's usage statistic

On Tue, Oct 07, 2014 at 02:13:35PM +0200, Vincent Guittot wrote:
> Monitor the usage level of each group of each sched_domain level. The usage is
> the amount of cpu_capacity that is currently used on a CPU or group of CPUs.
> We use the utilization_load_avg to evaluate the usage level of each group.
>
> The utilization_avg_contrib only takes into account the running time but not
> the uArch so the utilization_load_avg is in the range [0..SCHED_LOAD_SCALE]
> to reflect the running load on the CPU. We have to scale the utilization with
> the capacity of the CPU to get the usage of the latter. The usage can then be
> compared with the available capacity.

You say cpu_capacity, but in actual fact you use capacity_orig and fail
to justify/clarify this.

> The frequency scaling invariance is not taken into account in this patchset,
> it will be solved in another patchset

Maybe explain what the specific invariance issue is that is skipped over
for now.

> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d3e9067..7364ed4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4551,6 +4551,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
> return target;
> }
>
> +static int get_cpu_usage(int cpu)
> +{
> + unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
> + unsigned long capacity = capacity_orig_of(cpu);
> +
> + if (usage >= SCHED_LOAD_SCALE)
> + return capacity + 1;

Like Morten I'm confused by that +1 thing.

> +
> + return (usage * capacity) >> SCHED_LOAD_SHIFT;
> +}

A comment with that function that it returns capacity units might
clarify shift confusion Morten raised the other day.

2014-10-09 12:17:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 6/7] sched: replace capacity_factor by usage

On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
> +static inline bool
> +group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
> {
> + if ((sgs->group_capacity * 100) >
> + (sgs->group_usage * env->sd->imbalance_pct))
> + return true;

Why the imb_pct there? We're looking for 100% utilization, not 130 or
whatnot, right?

> + if (sgs->sum_nr_running < sgs->group_weight)
> + return true;

With the code as it stands, this is the cheaper test (no mults) so why
is it second?

> + return false;
> +}
>
> +static inline bool
> +group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
> +{
> + if (sgs->sum_nr_running <= sgs->group_weight)
> + return false;
> +
> + if ((sgs->group_capacity * 100) <
> + (sgs->group_usage * env->sd->imbalance_pct))
> + return true;
>
> + return false;
> }

Same thing here wrt imb_pct

2014-10-09 13:57:58

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v7 5/7] sched: get CPU's usage statistic

On 9 October 2014 13:36, Peter Zijlstra <[email protected]> wrote:
> On Tue, Oct 07, 2014 at 02:13:35PM +0200, Vincent Guittot wrote:
>> Monitor the usage level of each group of each sched_domain level. The usage is
>> the amount of cpu_capacity that is currently used on a CPU or group of CPUs.
>> We use the utilization_load_avg to evaluate the usage level of each group.
>>
>> The utilization_avg_contrib only takes into account the running time but not
>> the uArch so the utilization_load_avg is in the range [0..SCHED_LOAD_SCALE]
>> to reflect the running load on the CPU. We have to scale the utilization with
>> the capacity of the CPU to get the usage of the latter. The usage can then be
>> compared with the available capacity.
>
> You say cpu_capacity, but in actual fact you use capacity_orig and fail
> to justify/clarify this.

you're right it's cpu_capacity_orig no cpu_capacity

cpu_capacity is the compute capacity available for CFS task once we
have removed the capacity that is used by RT tasks.

We want to compare the utilization of the CPU (utilization_avg_contrib
which is in the range [0..SCHED_LOAD_SCALE]) with available capacity
(cpu_capacity which is in the range [0..cpu_capacity_orig])
An utilization_avg_contrib equals to SCHED_LOAD_SCALE means that the
CPU is fully utilized so all cpu_capacity_orig are used. so we scale
the utilization_avg_contrib from [0..SCHED_LOAD_SCALE] into cpu_usage
in the range [0..cpu_capacity_orig]


>
>> The frequency scaling invariance is not taken into account in this patchset,
>> it will be solved in another patchset
>
> Maybe explain what the specific invariance issue is that is skipped over
> for now.

ok. I can add description on the fact that if the core run slower, the
tasks will use more running time of the CPU for the same job so the
usage of the cpu which is based on the amount of time, will increase

>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> ---
>> kernel/sched/fair.c | 13 +++++++++++++
>> 1 file changed, 13 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d3e9067..7364ed4 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4551,6 +4551,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
>> return target;
>> }
>>
>> +static int get_cpu_usage(int cpu)
>> +{
>> + unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
>> + unsigned long capacity = capacity_orig_of(cpu);
>> +
>> + if (usage >= SCHED_LOAD_SCALE)
>> + return capacity + 1;
>
> Like Morten I'm confused by that +1 thing.

ok. the goal was to point out the erroneous case where usage is out of
the range but if it generates confusion, it can remove it

>
>> +
>> + return (usage * capacity) >> SCHED_LOAD_SHIFT;
>> +}
>
> A comment with that function that it returns capacity units might
> clarify shift confusion Morten raised the other day.

ok. i will add a comment

2014-10-09 14:16:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 6/7] sched: replace capacity_factor by usage

On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
> @@ -6214,17 +6178,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>
> /*
> * In case the child domain prefers tasks go to siblings
> + * first, lower the sg capacity to one so that we'll try
> * and move all the excess tasks away. We lower the capacity
> * of a group only if the local group has the capacity to fit
> + * these excess tasks, i.e. group_capacity > 0. The
> * extra check prevents the case where you always pull from the
> * heaviest group when it is already under-utilized (possible
> * with a large weight task outweighs the tasks on the system).
> */
> if (prefer_sibling && sds->local &&
> + group_has_capacity(env, &sds->local_stat)) {
> + if (sgs->sum_nr_running > 1)
> + sgs->group_no_capacity = 1;
> + sgs->group_capacity = min(sgs->group_capacity,
> + SCHED_CAPACITY_SCALE);
> + }
>
> if (update_sd_pick_busiest(env, sds, sg, sgs)) {
> sds->busiest = sg;

> @@ -6490,8 +6460,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> goto force_balance;
>
> /* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
> - if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
> - !busiest->group_has_free_capacity)
> + if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) &&
> + busiest->group_no_capacity)
> goto force_balance;
>
> /*

This is two calls to group_has_capacity() on the local group. Why not
compute once in update_sd_lb_stats()?

2014-10-09 14:18:31

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v7 6/7] sched: replace capacity_factor by usage

On 9 October 2014 14:16, Peter Zijlstra <[email protected]> wrote:
> On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
>> +static inline bool
>> +group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
>> {
>> + if ((sgs->group_capacity * 100) >
>> + (sgs->group_usage * env->sd->imbalance_pct))
>> + return true;
>
> Why the imb_pct there? We're looking for 100% utilization, not 130 or
> whatnot, right?

Having exactly 100% is quite difficult because of various rounding.
So i have added a margin/threshold to prevent any excessive change of the state.
I have just to use the same margin/threshold than in other place in
load balance.

so the current threshold depends of the sched_level. it's around 14% at MC level

>
>> + if (sgs->sum_nr_running < sgs->group_weight)
>> + return true;
>
> With the code as it stands, this is the cheaper test (no mults) so why
> is it second?
>
>> + return false;
>> +}
>>
>> +static inline bool
>> +group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
>> +{
>> + if (sgs->sum_nr_running <= sgs->group_weight)
>> + return false;
>> +
>> + if ((sgs->group_capacity * 100) <
>> + (sgs->group_usage * env->sd->imbalance_pct))
>> + return true;
>>
>> + return false;
>> }
>
> Same thing here wrt imb_pct

2014-10-09 14:29:30

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v7 6/7] sched: replace capacity_factor by usage

On 9 October 2014 16:16, Peter Zijlstra <[email protected]> wrote:
> On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
>> @@ -6214,17 +6178,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>>
>> /*
>> * In case the child domain prefers tasks go to siblings
>> + * first, lower the sg capacity to one so that we'll try
>> * and move all the excess tasks away. We lower the capacity
>> * of a group only if the local group has the capacity to fit
>> + * these excess tasks, i.e. group_capacity > 0. The
>> * extra check prevents the case where you always pull from the
>> * heaviest group when it is already under-utilized (possible
>> * with a large weight task outweighs the tasks on the system).
>> */
>> if (prefer_sibling && sds->local &&
>> + group_has_capacity(env, &sds->local_stat)) {
>> + if (sgs->sum_nr_running > 1)
>> + sgs->group_no_capacity = 1;
>> + sgs->group_capacity = min(sgs->group_capacity,
>> + SCHED_CAPACITY_SCALE);
>> + }
>>
>> if (update_sd_pick_busiest(env, sds, sg, sgs)) {
>> sds->busiest = sg;
>
>> @@ -6490,8 +6460,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>> goto force_balance;
>>
>> /* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
>> - if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
>> - !busiest->group_has_free_capacity)
>> + if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) &&
>> + busiest->group_no_capacity)
>> goto force_balance;
>>
>> /*
>
> This is two calls to group_has_capacity() on the local group. Why not
> compute once in update_sd_lb_stats()?

mainly because of the place in the code, so it is not always used
during load balance unlike group_no_capacity

Now that i have said that, i just noticed that it's better to move the
call to the last tested condition

+ if (env->idle == CPU_NEWLY_IDLE && busiest->group_no_capacity &&
+ group_has_capacity(env, local))

2014-10-09 14:58:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 6/7] sched: replace capacity_factor by usage

On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
> @@ -6214,17 +6178,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>
> /*
> * In case the child domain prefers tasks go to siblings
> - * first, lower the sg capacity factor to one so that we'll try
> + * first, lower the sg capacity to one so that we'll try
> * and move all the excess tasks away. We lower the capacity
> * of a group only if the local group has the capacity to fit
> - * these excess tasks, i.e. nr_running < group_capacity_factor. The
> + * these excess tasks, i.e. group_capacity > 0. The
> * extra check prevents the case where you always pull from the
> * heaviest group when it is already under-utilized (possible
> * with a large weight task outweighs the tasks on the system).
> */
> if (prefer_sibling && sds->local &&
> - sds->local_stat.group_has_free_capacity)
> - sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
> + group_has_capacity(env, &sds->local_stat)) {
> + if (sgs->sum_nr_running > 1)
> + sgs->group_no_capacity = 1;
> + sgs->group_capacity = min(sgs->group_capacity,
> + SCHED_CAPACITY_SCALE);
> + }
>
> if (update_sd_pick_busiest(env, sds, sg, sgs)) {
> sds->busiest = sg;

So this is your PREFER_SIBLING implementation, why is this a good one?

That is, the current PREFER_SIBLING works because we account against
nr_running, and setting it to 1 makes 2 tasks too much and we end up
moving stuff away.

But if I understand things right, we're now measuring tasks in
'utilization' against group_capacity, so setting group_capacity to
CAPACITY_SCALE, means we can end up with many tasks on the one cpu
before we move over to another group, right?

So I think that for 'idle' systems we want to do the
nr_running/work-conserving thing -- get as many cpus running
'something' and avoid queueing like the plague.

Then when there's some queueing, we want to go do the utilization thing,
basically minimize queueing by leveling utilization.

Once all cpus are fully utilized, we switch to fair/load based balancing
and try and get equal load on cpus.

Does that make sense?


If so, how about adding a group_type and splitting group_other into say
group_idle and group_util:

enum group_type {
group_idle = 0,
group_util,
group_imbalanced,
group_overloaded,
}

we change group_classify() into something like:

if (sgs->group_usage > sgs->group_capacity)
return group_overloaded;

if (sg_imbalanced(group))
return group_imbalanced;

if (sgs->nr_running < sgs->weight)
return group_idle;

return group_util;


And then have update_sd_pick_busiest() something like:

if (sgs->group_type > busiest->group_type)
return true;

if (sgs->group_type < busiest->group_type)
return false;

switch (sgs->group_type) {
case group_idle:
if (sgs->nr_running < busiest->nr_running)
return false;
break;

case group_util:
if (sgs->group_usage < busiest->group_usage)
return false;
break;

default:
if (sgs->avg_load < busiest->avg_load)
return false;
break;
}

....


And then some calculate_imbalance() magic to complete it..


If we have that, we can play tricks with the exact busiest condition in
update_sd_pick_busiest() to implement PREFER_SIBLING or so.

Makes sense?

2014-10-09 15:00:06

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v7 2/7] sched: move cfs task on a CPU with higher capacity

On 9 October 2014 13:23, Peter Zijlstra <[email protected]> wrote:
> On Tue, Oct 07, 2014 at 02:13:32PM +0200, Vincent Guittot wrote:
>> +++ b/kernel/sched/fair.c
>> @@ -5896,6 +5896,18 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
>> }
>>
>> /*
>> + * Check whether the capacity of the rq has been noticeably reduced by side
>> + * activity. The imbalance_pct is used for the threshold.
>> + * Return true is the capacity is reduced
>> + */
>> +static inline int
>> +check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
>> +{
>> + return ((rq->cpu_capacity * sd->imbalance_pct) <
>> + (rq->cpu_capacity_orig * 100));
>> +}
>> +
>> +/*
>> * Group imbalance indicates (and tries to solve) the problem where balancing
>> * groups is inadequate due to tsk_cpus_allowed() constraints.
>> *
>> @@ -6567,6 +6579,14 @@ static int need_active_balance(struct lb_env *env)
>> */
>> if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
>> return 1;
>> +
>> + /*
>> + * The src_cpu's capacity is reduced because of other
>> + * sched_class or IRQs, we trig an active balance to move the
>> + * task
>> + */
>> + if (check_cpu_capacity(env->src_rq, sd))
>> + return 1;
>> }
>
> So does it make sense to first check if there's a better candidate at
> all? By this time we've already iterated the current SD while trying
> regular load balancing, so we could know this.

i'm not sure to completely catch your point.
Normally, f_b_g and f_b_q have already looked at the best candidate
when we call need_active_balance. And src_cpu has been elected.
Or i have missed your point ?


>
>

2014-10-09 15:12:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 5/7] sched: get CPU's usage statistic

On Thu, Oct 09, 2014 at 03:57:28PM +0200, Vincent Guittot wrote:
> On 9 October 2014 13:36, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Oct 07, 2014 at 02:13:35PM +0200, Vincent Guittot wrote:
> >> Monitor the usage level of each group of each sched_domain level. The usage is
> >> the amount of cpu_capacity that is currently used on a CPU or group of CPUs.
> >> We use the utilization_load_avg to evaluate the usage level of each group.
> >>
> >> The utilization_avg_contrib only takes into account the running time but not
> >> the uArch so the utilization_load_avg is in the range [0..SCHED_LOAD_SCALE]
> >> to reflect the running load on the CPU. We have to scale the utilization with
> >> the capacity of the CPU to get the usage of the latter. The usage can then be
> >> compared with the available capacity.
> >
> > You say cpu_capacity, but in actual fact you use capacity_orig and fail
> > to justify/clarify this.
>
> you're right it's cpu_capacity_orig no cpu_capacity
>
> cpu_capacity is the compute capacity available for CFS task once we
> have removed the capacity that is used by RT tasks.

But why, when you compare the sum of usage against the capacity you want
matching units. Otherwise your usage will far exceed capacity in the
presence of RT tasks, that doesn't seen to make sense to me.

> We want to compare the utilization of the CPU (utilization_avg_contrib
> which is in the range [0..SCHED_LOAD_SCALE]) with available capacity
> (cpu_capacity which is in the range [0..cpu_capacity_orig])
> An utilization_avg_contrib equals to SCHED_LOAD_SCALE means that the
> CPU is fully utilized so all cpu_capacity_orig are used. so we scale
> the utilization_avg_contrib from [0..SCHED_LOAD_SCALE] into cpu_usage
> in the range [0..cpu_capacity_orig]

Right, I got that, the usage thing converts from 'utilization' to
fraction of capacity, so that we can then compare it against the total
capacity.

But if, as per the above we were to use the same capacity for both
sides, its a pointless factor and we could've immediately compared the
'utilization' number against its unit.

> >> +static int get_cpu_usage(int cpu)
> >> +{
> >> + unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
> >> + unsigned long capacity = capacity_orig_of(cpu);
> >> +
> >> + if (usage >= SCHED_LOAD_SCALE)
> >> + return capacity + 1;
> >
> > Like Morten I'm confused by that +1 thing.
>
> ok. the goal was to point out the erroneous case where usage is out of
> the range but if it generates confusion, it can remove it

Well, the fact that you clip makes that point, returning a value outside
of the specified range doesn't.

2014-10-09 15:18:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 6/7] sched: replace capacity_factor by usage

On Thu, Oct 09, 2014 at 04:18:02PM +0200, Vincent Guittot wrote:
> On 9 October 2014 14:16, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
> >> +static inline bool
> >> +group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
> >> {
> >> + if ((sgs->group_capacity * 100) >
> >> + (sgs->group_usage * env->sd->imbalance_pct))
> >> + return true;
> >
> > Why the imb_pct there? We're looking for 100% utilization, not 130 or
> > whatnot, right?
>
> Having exactly 100% is quite difficult because of various rounding.
> So i have added a margin/threshold to prevent any excessive change of the state.
> I have just to use the same margin/threshold than in other place in
> load balance.
>

Yet you failed to mention this anywhere. Also does it really matter?

2014-10-09 15:30:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 2/7] sched: move cfs task on a CPU with higher capacity

On Thu, Oct 09, 2014 at 04:59:36PM +0200, Vincent Guittot wrote:
> On 9 October 2014 13:23, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Oct 07, 2014 at 02:13:32PM +0200, Vincent Guittot wrote:
> >> +++ b/kernel/sched/fair.c
> >> @@ -5896,6 +5896,18 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
> >> }
> >>
> >> /*
> >> + * Check whether the capacity of the rq has been noticeably reduced by side
> >> + * activity. The imbalance_pct is used for the threshold.
> >> + * Return true is the capacity is reduced
> >> + */
> >> +static inline int
> >> +check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
> >> +{
> >> + return ((rq->cpu_capacity * sd->imbalance_pct) <
> >> + (rq->cpu_capacity_orig * 100));
> >> +}
> >> +
> >> +/*
> >> * Group imbalance indicates (and tries to solve) the problem where balancing
> >> * groups is inadequate due to tsk_cpus_allowed() constraints.
> >> *
> >> @@ -6567,6 +6579,14 @@ static int need_active_balance(struct lb_env *env)
> >> */
> >> if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
> >> return 1;
> >> +
> >> + /*
> >> + * The src_cpu's capacity is reduced because of other
> >> + * sched_class or IRQs, we trig an active balance to move the
> >> + * task
> >> + */
> >> + if (check_cpu_capacity(env->src_rq, sd))
> >> + return 1;
> >> }
> >
> > So does it make sense to first check if there's a better candidate at
> > all? By this time we've already iterated the current SD while trying
> > regular load balancing, so we could know this.
>
> i'm not sure to completely catch your point.
> Normally, f_b_g and f_b_q have already looked at the best candidate
> when we call need_active_balance. And src_cpu has been elected.
> Or i have missed your point ?

Yep you did indeed miss my point.

So I've always disliked this patch for its arbitrary nature, why
unconditionally try and active balance every time there is 'some' RT/IRQ
usage, it could be all CPUs are over that arbitrary threshold and we'll
end up active balancing for no point.

So, since we've already iterated all CPUs in our domain back in
update_sd_lb_stats() we could have computed the CFS fraction:

1024 * capacity / capacity_orig

for every cpu and collected the min/max of this. Then we can compute if
src is significantly (and there I suppose we can indeed use imb)
affected compared to others.

2014-10-10 07:17:49

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v7 6/7] sched: replace capacity_factor by usage

On 9 October 2014 17:18, Peter Zijlstra <[email protected]> wrote:
> On Thu, Oct 09, 2014 at 04:18:02PM +0200, Vincent Guittot wrote:
>> On 9 October 2014 14:16, Peter Zijlstra <[email protected]> wrote:
>> > On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
>> >> +static inline bool
>> >> +group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
>> >> {
>> >> + if ((sgs->group_capacity * 100) >
>> >> + (sgs->group_usage * env->sd->imbalance_pct))
>> >> + return true;
>> >
>> > Why the imb_pct there? We're looking for 100% utilization, not 130 or
>> > whatnot, right?
>>
>> Having exactly 100% is quite difficult because of various rounding.
>> So i have added a margin/threshold to prevent any excessive change of the state.
>> I have just to use the same margin/threshold than in other place in
>> load balance.
>>
>
> Yet you failed to mention this anywhere. Also does it really matter?

yes i think it latter because it give a more stable view of the
"overload state" and "have free capacity state" of the CPU.
One additional point is that the imbalance_pct will ensure that a
cpu/group will not been seen as having capacity if its available
capacity is only 1-5% which will generate spurious task migration

I will add these details in the commit log and in a comment in the code

2014-10-10 07:18:54

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v7 6/7] sched: replace capacity_factor by usage

On 10 October 2014 09:17, Vincent Guittot <[email protected]> wrote:
>
> yes i think it latter because it give a more stable view of the

s/latter/matter/

> "overload state" and "have free capacity state" of the CPU.
> One additional point is that the imbalance_pct will ensure that a
> cpu/group will not been seen as having capacity if its available
> capacity is only 1-5% which will generate spurious task migration
>
> I will add these details in the commit log and in a comment in the code

2014-10-10 07:47:08

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v7 2/7] sched: move cfs task on a CPU with higher capacity

On 9 October 2014 17:30, Peter Zijlstra <[email protected]> wrote:
> On Thu, Oct 09, 2014 at 04:59:36PM +0200, Vincent Guittot wrote:
>> On 9 October 2014 13:23, Peter Zijlstra <[email protected]> wrote:
>> > On Tue, Oct 07, 2014 at 02:13:32PM +0200, Vincent Guittot wrote:
>> >> +++ b/kernel/sched/fair.c
>> >> @@ -5896,6 +5896,18 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
>> >> }
>> >>
>> >> /*
>> >> + * Check whether the capacity of the rq has been noticeably reduced by side
>> >> + * activity. The imbalance_pct is used for the threshold.
>> >> + * Return true is the capacity is reduced
>> >> + */
>> >> +static inline int
>> >> +check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
>> >> +{
>> >> + return ((rq->cpu_capacity * sd->imbalance_pct) <
>> >> + (rq->cpu_capacity_orig * 100));
>> >> +}
>> >> +
>> >> +/*
>> >> * Group imbalance indicates (and tries to solve) the problem where balancing
>> >> * groups is inadequate due to tsk_cpus_allowed() constraints.
>> >> *
>> >> @@ -6567,6 +6579,14 @@ static int need_active_balance(struct lb_env *env)
>> >> */
>> >> if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
>> >> return 1;
>> >> +
>> >> + /*
>> >> + * The src_cpu's capacity is reduced because of other
>> >> + * sched_class or IRQs, we trig an active balance to move the
>> >> + * task
>> >> + */
>> >> + if (check_cpu_capacity(env->src_rq, sd))
>> >> + return 1;
>> >> }
>> >
>> > So does it make sense to first check if there's a better candidate at
>> > all? By this time we've already iterated the current SD while trying
>> > regular load balancing, so we could know this.
>>
>> i'm not sure to completely catch your point.
>> Normally, f_b_g and f_b_q have already looked at the best candidate
>> when we call need_active_balance. And src_cpu has been elected.
>> Or i have missed your point ?
>
> Yep you did indeed miss my point.
>
> So I've always disliked this patch for its arbitrary nature, why
> unconditionally try and active balance every time there is 'some' RT/IRQ
> usage, it could be all CPUs are over that arbitrary threshold and we'll
> end up active balancing for no point.
>
> So, since we've already iterated all CPUs in our domain back in
> update_sd_lb_stats() we could have computed the CFS fraction:
>
> 1024 * capacity / capacity_orig
>
> for every cpu and collected the min/max of this. Then we can compute if
> src is significantly (and there I suppose we can indeed use imb)
> affected compared to others.

ok, so we should put additional check in f_b_g to be sure that we will
jump to force_balance only if there will be a real gain by moving the
task on the local group (from an available capacity for the task point
of view)
and probably in f_b_q too

>
>

2014-10-10 14:38:32

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v7 5/7] sched: get CPU's usage statistic

On 9 October 2014 17:12, Peter Zijlstra <[email protected]> wrote:

>> >> +static int get_cpu_usage(int cpu)
>> >> +{
>> >> + unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
>> >> + unsigned long capacity = capacity_orig_of(cpu);
>> >> +
>> >> + if (usage >= SCHED_LOAD_SCALE)
>> >> + return capacity + 1;
>> >
>> > Like Morten I'm confused by that +1 thing.
>>
>> ok. the goal was to point out the erroneous case where usage is out of
>> the range but if it generates confusion, it can remove it
>
> Well, the fact that you clip makes that point, returning a value outside
> of the specified range doesn't.

i meant removing the +1

2014-10-21 07:39:15

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v7 6/7] sched: replace capacity_factor by usage

On 9 October 2014 16:58, Peter Zijlstra <[email protected]> wrote:
> On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
>> @@ -6214,17 +6178,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>>
>> /*
>> * In case the child domain prefers tasks go to siblings
>> - * first, lower the sg capacity factor to one so that we'll try
>> + * first, lower the sg capacity to one so that we'll try
>> * and move all the excess tasks away. We lower the capacity
>> * of a group only if the local group has the capacity to fit
>> - * these excess tasks, i.e. nr_running < group_capacity_factor. The
>> + * these excess tasks, i.e. group_capacity > 0. The
>> * extra check prevents the case where you always pull from the
>> * heaviest group when it is already under-utilized (possible
>> * with a large weight task outweighs the tasks on the system).
>> */
>> if (prefer_sibling && sds->local &&
>> - sds->local_stat.group_has_free_capacity)
>> - sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
>> + group_has_capacity(env, &sds->local_stat)) {
>> + if (sgs->sum_nr_running > 1)
>> + sgs->group_no_capacity = 1;
>> + sgs->group_capacity = min(sgs->group_capacity,
>> + SCHED_CAPACITY_SCALE);
>> + }
>>
>> if (update_sd_pick_busiest(env, sds, sg, sgs)) {
>> sds->busiest = sg;
>
> So this is your PREFER_SIBLING implementation, why is this a good one?
>
> That is, the current PREFER_SIBLING works because we account against
> nr_running, and setting it to 1 makes 2 tasks too much and we end up
> moving stuff away.
>
> But if I understand things right, we're now measuring tasks in
> 'utilization' against group_capacity, so setting group_capacity to
> CAPACITY_SCALE, means we can end up with many tasks on the one cpu
> before we move over to another group, right?

I would say no, because we don't have to be overloaded to migrate a
task so the load balance can occurs before we become overloaded. The
main difference is that a group is better tagged overloaded than
previously.

The update of capacity field is only used when calculating the
imbalance and the average load on the sched_domain but the group has
already been classified (group_overloaded or other). If we have more
than 1 task, we overwrite the status of group_no_capacity to true. The
updated capacity will then be used to cap the max amount of load that
will be pulled in order to ensure that the busiest group will not
become idle.
So for prefer sibling case, we are not taking into account the
utilization and the capacity metrics but we check whether more than 1
task is running in this group (which is what you proposed below)

Then, we keep the same mechanism than with capacity_factor for
calculating the imbalance

This update of PREFER_SIBLING is quite similar to the previous one
except that we directly use nr_running instead of
group_capacity_factor and we force the group state to no more capacity

Regards,
Vincent

>
> So I think that for 'idle' systems we want to do the
> nr_running/work-conserving thing -- get as many cpus running
> 'something' and avoid queueing like the plague.
>
> Then when there's some queueing, we want to go do the utilization thing,
> basically minimize queueing by leveling utilization.
>
> Once all cpus are fully utilized, we switch to fair/load based balancing
> and try and get equal load on cpus.
>
> Does that make sense?
>
>
> If so, how about adding a group_type and splitting group_other into say
> group_idle and group_util:
>
> enum group_type {
> group_idle = 0,
> group_util,
> group_imbalanced,
> group_overloaded,
> }
>
> we change group_classify() into something like:
>
> if (sgs->group_usage > sgs->group_capacity)
> return group_overloaded;
>
> if (sg_imbalanced(group))
> return group_imbalanced;
>
> if (sgs->nr_running < sgs->weight)
> return group_idle;
>
> return group_util;
>
>
> And then have update_sd_pick_busiest() something like:
>
> if (sgs->group_type > busiest->group_type)
> return true;
>
> if (sgs->group_type < busiest->group_type)
> return false;
>
> switch (sgs->group_type) {
> case group_idle:
> if (sgs->nr_running < busiest->nr_running)
> return false;
> break;
>
> case group_util:
> if (sgs->group_usage < busiest->group_usage)
> return false;
> break;
>
> default:
> if (sgs->avg_load < busiest->avg_load)
> return false;
> break;
> }
>
> ....
>
>
> And then some calculate_imbalance() magic to complete it..
>
>
> If we have that, we can play tricks with the exact busiest condition in
> update_sd_pick_busiest() to implement PREFER_SIBLING or so.
>
> Makes sense?

2014-11-23 01:03:26

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH v7 6/7] sched: replace capacity_factor by usage

Hi Vincent,
On 10/9/14, 10:18 PM, Vincent Guittot wrote:
> On 9 October 2014 14:16, Peter Zijlstra <[email protected]> wrote:
>> On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
>>> +static inline bool
>>> +group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
>>> {
>>> + if ((sgs->group_capacity * 100) >
>>> + (sgs->group_usage * env->sd->imbalance_pct))
>>> + return true;
>> Why the imb_pct there? We're looking for 100% utilization, not 130 or
>> whatnot, right?
> Having exactly 100% is quite difficult because of various rounding.

Could you give some examples about the various rounding?

Regards,
Wanpeng Li

> So i have added a margin/threshold to prevent any excessive change of the state.
> I have just to use the same margin/threshold than in other place in
> load balance.
>
> so the current threshold depends of the sched_level. it's around 14% at MC level
>
>>> + if (sgs->sum_nr_running < sgs->group_weight)
>>> + return true;
>> With the code as it stands, this is the cheaper test (no mults) so why
>> is it second?
>>
>>> + return false;
>>> +}
>>>
>>> +static inline bool
>>> +group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
>>> +{
>>> + if (sgs->sum_nr_running <= sgs->group_weight)
>>> + return false;
>>> +
>>> + if ((sgs->group_capacity * 100) <
>>> + (sgs->group_usage * env->sd->imbalance_pct))
>>> + return true;
>>>
>>> + return false;
>>> }
>> Same thing here wrt imb_pct
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2014-11-24 10:16:31

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v7 6/7] sched: replace capacity_factor by usage

On 23 November 2014 at 02:03, Wanpeng Li <[email protected]> wrote:
> Hi Vincent,
> On 10/9/14, 10:18 PM, Vincent Guittot wrote:
>>
>> On 9 October 2014 14:16, Peter Zijlstra <[email protected]> wrote:
>>>
>>> On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
>>>>
>>>> +static inline bool
>>>> +group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
>>>> {
>>>> + if ((sgs->group_capacity * 100) >
>>>> + (sgs->group_usage * env->sd->imbalance_pct))
>>>> + return true;
>>>
>>> Why the imb_pct there? We're looking for 100% utilization, not 130 or
>>> whatnot, right?
>>
>> Having exactly 100% is quite difficult because of various rounding.
>
>
> Could you give some examples about the various rounding?

As an example while I was monitoring the runnable_load_sum and the
runnable_load_period, they were varying a bit around the max value but
was not always the same which can change the load_avg_contrib by few
unit

Regards,
Vincent

>
> Regards,
> Wanpeng Li
>
>> So i have added a margin/threshold to prevent any excessive change of the
>> state.
>> I have just to use the same margin/threshold than in other place in
>> load balance.
>>
>> so the current threshold depends of the sched_level. it's around 14% at MC
>> level
>>
>>>> + if (sgs->sum_nr_running < sgs->group_weight)
>>>> + return true;
>>>
>>> With the code as it stands, this is the cheaper test (no mults) so why
>>> is it second?
>>>
>>>> + return false;
>>>> +}
>>>>
>>>> +static inline bool
>>>> +group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
>>>> +{
>>>> + if (sgs->sum_nr_running <= sgs->group_weight)
>>>> + return false;
>>>> +
>>>> + if ((sgs->group_capacity * 100) <
>>>> + (sgs->group_usage * env->sd->imbalance_pct))
>>>> + return true;
>>>>
>>>> + return false;
>>>> }
>>>
>>> Same thing here wrt imb_pct
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>
>