Part of this patchset was previously part of the larger tasks packing patchset
[1]. I have splitted the latter in 3 different patchsets (at least) to make the
thing easier.
-configuration of sched_domain topology [2]
-update and consolidation of cpu_power (this patchset)
-tasks packing algorithm
SMT system is no more the only system that can have a CPUs with an original
capacity that is different from the default value. We need to extend the use of
cpu_power_orig to all kind of platform so the scheduler will have both the
maximum capacity (cpu_power_orig/power_orig) and the current capacity
(cpu_power/power) of CPUs and sched_groups. A new function arch_scale_cpu_power
has been created and replace arch_scale_smt_power, which is SMT specifc in the
computation of the capapcity of a CPU.
During load balance, the scheduler evaluates the number of tasks that a group
of CPUs can handle. The current method assumes that tasks have a fix load of
SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_POWER_SCALE.
This assumption generates wrong decision by creating ghost cores and by
removing real ones when the original capacity of CPUs is different from the
default SCHED_POWER_SCALE. We don't try anymore to evaluate the number of
available cores based on the group_capacity but instead we detect when the group
is fully utilized
Now that we have the original capacity of CPUS and their activity/utilization,
we can evaluate more accuratly the capacity and the level of utilization of a
group of CPUs.
This patchset mainly replaces the old capacity method by a new one and has kept
the policy almost unchanged whereas we could certainly take advantage of this
new statistic in several other places of the load balance.
Tests results:
I have put below results of 3 kind of tests:
- hackbench -l 500 -s 4096
- scp of 100MB file on the platform
- ebizzy with various number of threads
on 3 kernel
tip = tip/sched/core
patch = tip + this patchset
patch+irq = tip + this patchset + irq accounting
each test has been run 6 times and the figure below show the stdev and the
diff compared to the tip kernel
Dual cortex A7 tip | patch | patch+irq
stdev | diff stdev | diff stdev
hackbench (+/-)1.02% | +0.42%(+/-)1.29% | -0.65%(+/-)0.44%
scp (+/-)0.41% | +0.18%(+/-)0.10% | +78.05%(+/-)0.70%
ebizzy -t 1 (+/-)1.72% | +1.43%(+/-)1.62% | +2.58%(+/-)2.11%
ebizzy -t 2 (+/-)0.42% | +0.06%(+/-)0.45% | +1.45%(+/-)4.05%
ebizzy -t 4 (+/-)0.73% | +8.39%(+/-)13.25% | +4.25%(+/-)10.08%
ebizzy -t 6 (+/-)10.30% | +2.19%(+/-)3.59% | +0.58%(+/-)1.80%
ebizzy -t 8 (+/-)1.45% | -0.05%(+/-)2.18% | +2.53%(+/-)3.40%
ebizzy -t 10 (+/-)3.78% | -2.71%(+/-)2.79% | -3.16%(+/-)3.06%
ebizzy -t 12 (+/-)3.21% | +1.13%(+/-)2.02% | -1.13%(+/-)4.43%
ebizzy -t 14 (+/-)2.05% | +0.15%(+/-)3.47% | -2.08%(+/-)1.40%
uad cortex A15 tip | patch | patch+irq
stdev | diff stdev | diff stdev
hackbench (+/-)0.55% | -0.58%(+/-)0.90% | +0.62%(+/-)0.43%
scp (+/-)0.21% | -0.10%(+/-)0.10% | +5.70%(+/-)0.53%
ebizzy -t 1 (+/-)0.42% | -0.58%(+/-)0.48% | -0.29%(+/-)0.18%
ebizzy -t 2 (+/-)0.52% | -0.83%(+/-)0.20% | -2.07%(+/-)0.35%
ebizzy -t 4 (+/-)0.22% | -1.39%(+/-)0.49% | -1.78%(+/-)0.67%
ebizzy -t 6 (+/-)0.44% | -0.78%(+/-)0.15% | -1.79%(+/-)1.10%
ebizzy -t 8 (+/-)0.43% | +0.13%(+/-)0.92% | -0.17%(+/-)0.67%
ebizzy -t 10 (+/-)0.71% | +0.10%(+/-)0.93% | -0.36%(+/-)0.77%
ebizzy -t 12 (+/-)0.65% | -1.07%(+/-)1.13% | -1.13%(+/-)0.70%
ebizzy -t 14 (+/-)0.92% | -0.28%(+/-)1.25% | +2.84%(+/-)9.33%
I haven't been able to fully test the patchset for a SMT system to check that
the regression that has been reported by Preethi has been solved but the
various tests that i have done, don't show any regression so far.
The correction of SD_PREFER_SIBLING mode and the use of the latter at SMT level
should have fix the regression.
Change since V2:
- rebase on top of capacity renaming
- fix wake_affine statistic update
- rework nohz_kick_needed
- optimize the active migration of a task from CPU with reduced capacity
- rename group_activity by group_utilization and remove unused total_utilization
- repair SD_PREFER_SIBLING and use it for SMT level
- reorder patchset to gather patches with same topics
Change since V1:
- add 3 fixes
- correct some commit messages
- replace capacity computation by activity
- take into account current cpu capacity
[1] https://lkml.org/lkml/2013/10/18/121
[2] https://lkml.org/lkml/2014/3/19/377
Vincent Guittot (12):
sched: fix imbalance flag reset
sched: remove a wake_affine condition
sched: fix avg_load computation
sched: Allow all archs to set the power_orig
ARM: topology: use new cpu_power interface
sched: add per rq cpu_power_orig
sched: test the cpu's capacity in wake affine
sched: move cfs task on a CPU with higher capacity
Revert "sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED"
sched: get CPU's utilization statistic
sched: replace capacity_factor by utilization
sched: add SD_PREFER_SIBLING for SMT level
arch/arm/kernel/topology.c | 4 +-
kernel/sched/core.c | 3 +-
kernel/sched/fair.c | 290 +++++++++++++++++++++++----------------------
kernel/sched/sched.h | 5 +-
4 files changed, 158 insertions(+), 144 deletions(-)
--
1.9.1
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
We will have some idle load balance which are triggered during tick.
Unfortunately, the tick is also used to queue background work so we can reach
the situation where short work has been queued on a CPU which already runs a
task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
a worker thread that is pinned on a CPU so an imbalance due to pinned task is
detected and the imbalance flag is set.
Then, we will not be able to clear the flag because we have at most 1 task on
each CPU but the imbalance flag will trig to useless active load balance
between the idle CPU and the busy CPU.
We need to reset of the imbalance flag as soon as we have reached a balanced
state.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d3c73122..0c48dff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6615,10 +6615,8 @@ more_balance:
if (sd_parent) {
int *group_imbalance = &sd_parent->groups->sgc->imbalance;
- if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
+ if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0)
*group_imbalance = 1;
- } else if (*group_imbalance)
- *group_imbalance = 0;
}
/* All tasks on this runqueue were pinned by CPU affinity */
@@ -6703,6 +6701,16 @@ more_balance:
goto out;
out_balanced:
+ /*
+ * We reach balance although we may have faced some affinity
+ * constraints. Clear the imbalance flag if it was set.
+ */
+ if (sd_parent) {
+ int *group_imbalance = &sd_parent->groups->sgc->imbalance;
+ if (*group_imbalance)
+ *group_imbalance = 0;
+ }
+
schedstat_inc(sd, lb_balanced[idle]);
sd->nr_balance_failed = 0;
--
1.9.1
This new field cpu_power_orig reflects the available capacity of a CPUs unlike
the cpu_power which reflects the current capacity that can be altered by
frequency and rt tasks.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 1 +
kernel/sched/sched.h | 1 +
3 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 54f5722..92dd6d1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6964,7 +6964,7 @@ void __init sched_init(void)
#ifdef CONFIG_SMP
rq->sd = NULL;
rq->rd = NULL;
- rq->cpu_capacity = SCHED_CAPACITY_SCALE;
+ rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
rq->post_schedule = 0;
rq->active_balance = 0;
rq->next_balance = jiffies;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f0bba5f..7dff578 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5671,6 +5671,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
capacity >>= SCHED_CAPACITY_SHIFT;
+ cpu_rq(cpu)->cpu_capacity_orig = capacity;
sdg->sgc->capacity_orig = capacity;
if (sched_feat(ARCH_CAPACITY))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2f86361..7b311de 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -568,6 +568,7 @@ struct rq {
struct sched_domain *sd;
unsigned long cpu_capacity;
+ unsigned long cpu_capacity_orig;
unsigned char idle_balance;
/* For active balancing */
--
1.9.1
The computation of avg_load and avg_load_per_task should only takes into
account the number of cfs tasks. The non cfs task are already taken into
account by decreasing the cpu's capacity (cpu_power) and they will be tracked
in the CPU's utilization (group_utilization) of the next patches
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c6dba48..148b277 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4051,7 +4051,7 @@ static unsigned long capacity_of(int cpu)
static unsigned long cpu_avg_load_per_task(int cpu)
{
struct rq *rq = cpu_rq(cpu);
- unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
+ unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
unsigned long load_avg = rq->cfs.runnable_load_avg;
if (nr_running)
@@ -5865,7 +5865,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
load = source_load(i, load_idx);
sgs->group_load += load;
- sgs->sum_nr_running += rq->nr_running;
+ sgs->sum_nr_running += rq->cfs.h_nr_running;
#ifdef CONFIG_NUMA_BALANCING
sgs->nr_numa_running += rq->nr_numa_running;
sgs->nr_preferred_running += rq->nr_preferred_running;
--
1.9.1
This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.
We are going to use runnable_avg_sum and runnable_avg_period in order to get
the utilization of the CPU. This statistic includes all tasks that run the CPU
and not only CFS tasks.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 13 ++++++-------
kernel/sched/sched.h | 4 ++--
2 files changed, 8 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 742ad88..bdc5cb9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2428,19 +2428,12 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
se->avg.load_avg_contrib >>= NICE_0_SHIFT;
}
}
-
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
-{
- __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
- __update_tg_runnable_avg(&rq->avg, &rq->cfs);
-}
#else /* CONFIG_FAIR_GROUP_SCHED */
static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
int force_update) {}
static inline void __update_tg_runnable_avg(struct sched_avg *sa,
struct cfs_rq *cfs_rq) {}
static inline void __update_group_entity_contrib(struct sched_entity *se) {}
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
#endif /* CONFIG_FAIR_GROUP_SCHED */
static inline void __update_task_entity_contrib(struct sched_entity *se)
@@ -2539,6 +2532,12 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
__update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
}
+static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
+{
+ __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
+ __update_tg_runnable_avg(&rq->avg, &rq->cfs);
+}
+
/* Add the load generated by se into cfs_rq's child load-average */
static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
struct sched_entity *se,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7b311de..fb53166 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -542,8 +542,6 @@ struct rq {
#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this cpu: */
struct list_head leaf_cfs_rq_list;
-
- struct sched_avg avg;
#endif /* CONFIG_FAIR_GROUP_SCHED */
/*
@@ -634,6 +632,8 @@ struct rq {
#ifdef CONFIG_SMP
struct llist_head wake_list;
#endif
+
+ struct sched_avg avg;
};
static inline int cpu_of(struct rq *rq)
--
1.9.1
If the CPU is used for handling lot of IRQs, trig a load balance to check if
it's worth moving its tasks on another CPU that has more capacity.
As a sidenote, this will note generate more spurious ilb because we already
trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
has a task, we will trig the ilb once for migrating the task.
The nohz_kick_needed function has been cleaned up a bit while adding the new
test
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 58 +++++++++++++++++++++++++++++++++++++----------------
1 file changed, 41 insertions(+), 17 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a23c938..742ad88 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5944,6 +5944,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
return true;
}
+ /*
+ * The group capacity is reduced probably because of activity from other
+ * sched class or interrupts which use part of the available capacity
+ */
+ if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
+ env->sd->imbalance_pct))
+ return true;
+
return false;
}
@@ -6421,13 +6429,24 @@ static int need_active_balance(struct lb_env *env)
struct sched_domain *sd = env->sd;
if (env->idle == CPU_NEWLY_IDLE) {
+ int src_cpu = env->src_cpu;
/*
* ASYM_PACKING needs to force migrate tasks from busy but
* higher numbered CPUs in order to pack all tasks in the
* lowest numbered CPUs.
*/
- if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
+ if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
+ return 1;
+
+ /*
+ * If the CPUs share their cache and the src_cpu's capacity is
+ * reduced because of other sched_class or IRQs, we trig an
+ * active balance to move the task
+ */
+ if ((sd->flags & SD_SHARE_PKG_RESOURCES)
+ && ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
+ sd->imbalance_pct)))
return 1;
}
@@ -6529,6 +6548,8 @@ redo:
schedstat_add(sd, lb_imbalance[idle], env.imbalance);
+ env.src_cpu = busiest->cpu;
+
ld_moved = 0;
if (busiest->nr_running > 1) {
/*
@@ -6538,7 +6559,6 @@ redo:
* correctly treated as an imbalance.
*/
env.flags |= LBF_ALL_PINNED;
- env.src_cpu = busiest->cpu;
env.src_rq = busiest;
env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
@@ -7233,9 +7253,10 @@ static inline int nohz_kick_needed(struct rq *rq)
struct sched_domain *sd;
struct sched_group_capacity *sgc;
int nr_busy, cpu = rq->cpu;
+ bool kick = false;
if (unlikely(rq->idle_balance))
- return 0;
+ return false;
/*
* We may be recently in ticked or tickless idle mode. At the first
@@ -7249,38 +7270,41 @@ static inline int nohz_kick_needed(struct rq *rq)
* balancing.
*/
if (likely(!atomic_read(&nohz.nr_cpus)))
- return 0;
+ return false;
if (time_before(now, nohz.next_balance))
- return 0;
+ return false;
if (rq->nr_running >= 2)
- goto need_kick;
+ return true;
rcu_read_lock();
sd = rcu_dereference(per_cpu(sd_busy, cpu));
-
if (sd) {
sgc = sd->groups->sgc;
nr_busy = atomic_read(&sgc->nr_busy_cpus);
- if (nr_busy > 1)
- goto need_kick_unlock;
+ if (nr_busy > 1) {
+ kick = true;
+ goto unlock;
+ }
+
+ if ((rq->cfs.h_nr_running >= 1)
+ && ((rq->cpu_capacity * sd->imbalance_pct) <
+ (rq->cpu_capacity_orig * 100))) {
+ kick = true;
+ goto unlock;
+ }
}
sd = rcu_dereference(per_cpu(sd_asym, cpu));
-
if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
sched_domain_span(sd)) < cpu))
- goto need_kick_unlock;
+ kick = true;
+unlock:
rcu_read_unlock();
- return 0;
-
-need_kick_unlock:
- rcu_read_unlock();
-need_kick:
- return 1;
+ return kick;
}
#else
static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
--
1.9.1
Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
the scheduler will put at least 1 task per core.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/core.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 92dd6d1..9103669 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6068,6 +6068,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
*/
if (sd->flags & SD_SHARE_CPUCAPACITY) {
+ sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 110;
sd->smt_gain = 1178; /* ~15% */
--
1.9.1
The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
SCHED_POWER_SCALE.
We can have a better idea of the capacity of a group of CPUs and of the
utilization of this group thanks to the rework of group_power_orig and
group_utilization. We can deduct how many capacity is still available.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 107 +++++++++++++++++++---------------------------------
1 file changed, 38 insertions(+), 69 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a6b4b25..cf65284 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5534,13 +5534,13 @@ struct sg_lb_stats {
unsigned long sum_weighted_load; /* Weighted load of group's tasks */
unsigned long load_per_task;
unsigned long group_capacity;
+ unsigned long group_capacity_orig;
unsigned long group_utilization; /* Total utilization of the group */
unsigned int sum_nr_running; /* Nr tasks running in the group */
- unsigned int group_capacity_factor;
unsigned int idle_cpus;
unsigned int group_weight;
int group_imb; /* Is there an imbalance in the group ? */
- int group_has_free_capacity;
+ int group_out_of_capacity;
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
@@ -5781,31 +5781,6 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
}
/*
- * Try and fix up capacity for tiny siblings, this is needed when
- * things like SD_ASYM_PACKING need f_b_g to select another sibling
- * which on its own isn't powerful enough.
- *
- * See update_sd_pick_busiest() and check_asym_packing().
- */
-static inline int
-fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
-{
- /*
- * Only siblings can have significantly less than SCHED_CAPACITY_SCALE
- */
- if (!(sd->flags & SD_SHARE_CPUCAPACITY))
- return 0;
-
- /*
- * If ~90% of the cpu_capacity is still there, we're good.
- */
- if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29)
- return 1;
-
- return 0;
-}
-
-/*
* Group imbalance indicates (and tries to solve) the problem where balancing
* groups is inadequate due to tsk_cpus_allowed() constraints.
*
@@ -5839,32 +5814,24 @@ static inline int sg_imbalanced(struct sched_group *group)
return group->sgc->imbalance;
}
-/*
- * Compute the group capacity factor.
- *
- * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by
- * first dividing out the smt factor and computing the actual number of cores
- * and limit unit capacity with that.
- */
-static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group)
+static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
+ struct lb_env *env)
{
- unsigned int capacity_factor, smt, cpus;
- unsigned int capacity, capacity_orig;
-
- capacity = group->sgc->capacity;
- capacity_orig = group->sgc->capacity_orig;
- cpus = group->group_weight;
+ if ((sgs->group_capacity_orig * 100) > (sgs->group_utilization * env->sd->imbalance_pct)
+ || (sgs->sum_nr_running < sgs->group_weight))
+ return 1;
- /* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
- smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
- capacity_factor = cpus / smt; /* cores */
+ return 0;
+}
- capacity_factor = min_t(unsigned,
- capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
- if (!capacity_factor)
- capacity_factor = fix_small_capacity(env->sd, group);
+static inline int group_is_overloaded(struct sg_lb_stats *sgs,
+ struct lb_env *env)
+{
+ if ((sgs->group_capacity_orig * 100) < (sgs->group_utilization * env->sd->imbalance_pct)
+ && (sgs->sum_nr_running > sgs->group_weight))
+ return 1;
- return capacity_factor;
+ return 0;
}
/**
@@ -5905,6 +5872,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->idle_cpus++;
}
+ sgs->group_capacity_orig = group->sgc->capacity_orig;
/* Adjust by relative CPU capacity of the group */
sgs->group_capacity = group->sgc->capacity;
sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
@@ -5915,10 +5883,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_weight = group->group_weight;
sgs->group_imb = sg_imbalanced(group);
- sgs->group_capacity_factor = sg_capacity_factor(env, group);
- if (sgs->group_capacity_factor > sgs->sum_nr_running)
- sgs->group_has_free_capacity = 1;
+ sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
}
/**
@@ -5942,7 +5908,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
if (sgs->avg_load <= sds->busiest_stat.avg_load)
return false;
- if (sgs->sum_nr_running > sgs->group_capacity_factor)
+ if (sgs->group_out_of_capacity)
return true;
if (sgs->group_imb)
@@ -6041,17 +6007,20 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/*
* In case the child domain prefers tasks go to siblings
- * first, lower the sg capacity factor to one so that we'll try
+ * first, lower the sg capacity to one so that we'll try
* and move all the excess tasks away. We lower the capacity
* of a group only if the local group has the capacity to fit
- * these excess tasks, i.e. nr_running < group_capacity_factor. The
+ * these excess tasks, i.e. group_capacity > 0. The
* extra check prevents the case where you always pull from the
* heaviest group when it is already under-utilized (possible
* with a large weight task outweighs the tasks on the system).
*/
if (prefer_sibling && sds->local &&
- sds->local_stat.group_has_free_capacity)
- sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
+ group_has_free_capacity(&sds->local_stat, env)) {
+ if (sgs->sum_nr_running > 1)
+ sgs->group_out_of_capacity = 1;
+ sgs->group_capacity = min(sgs->group_capacity, SCHED_CAPACITY_SCALE);
+ }
if (update_sd_pick_busiest(env, sds, sg, sgs)) {
sds->busiest = sg;
@@ -6223,11 +6192,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* Except of course for the group_imb case, since then we might
* have to drop below capacity to reach cpu-load equilibrium.
*/
- load_above_capacity =
- (busiest->sum_nr_running - busiest->group_capacity_factor);
-
- load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
- load_above_capacity /= busiest->group_capacity;
+ load_above_capacity = busiest->sum_nr_running * SCHED_LOAD_SCALE;
+ if (load_above_capacity > busiest->group_capacity)
+ load_above_capacity -= busiest->group_capacity;
+ else
+ load_above_capacity = ~0UL;
}
/*
@@ -6290,6 +6259,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
local = &sds.local_stat;
busiest = &sds.busiest_stat;
+ /* ASYM feature bypasses nice load balance check */
if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
check_asym_packing(env, &sds))
return sds.busiest;
@@ -6310,8 +6280,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
goto force_balance;
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
- if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
- !busiest->group_has_free_capacity)
+ if (env->idle == CPU_NEWLY_IDLE && group_has_free_capacity(local, env) &&
+ busiest->group_out_of_capacity)
goto force_balance;
/*
@@ -6369,7 +6339,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
int i;
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
- unsigned long capacity, capacity_factor, wl;
+ unsigned long capacity, wl;
enum fbq_type rt;
rq = cpu_rq(i);
@@ -6398,9 +6368,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
continue;
capacity = capacity_of(i);
- capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE);
- if (!capacity_factor)
- capacity_factor = fix_small_capacity(env->sd, group);
wl = weighted_cpuload(i);
@@ -6408,7 +6375,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
* When comparing with imbalance, use weighted_cpuload()
* which is not scaled with the cpu capacity.
*/
- if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance)
+
+ if (rq->nr_running == 1 && wl > env->imbalance &&
+ ((capacity * env->sd->imbalance_pct) >= (rq->cpu_capacity_orig * 100)))
continue;
/*
--
1.9.1
Monitor the utilization level of each group of each sched_domain level. The
utilization is the amount of cpu_power that is currently used on a CPU or group
of CPUs. We use the runnable_avg_sum and _period to evaluate this utilization
level. In the special use case where the CPU is fully loaded by more than 1
task, the activity level is set above the cpu_power in order to reflect the
overload of the CPU
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bdc5cb9..a6b4b25 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4047,6 +4047,11 @@ static unsigned long capacity_of(int cpu)
return cpu_rq(cpu)->cpu_capacity;
}
+static unsigned long capacity_orig_of(int cpu)
+{
+ return cpu_rq(cpu)->cpu_capacity_orig;
+}
+
static unsigned long cpu_avg_load_per_task(int cpu)
{
struct rq *rq = cpu_rq(cpu);
@@ -4437,6 +4442,18 @@ done:
return target;
}
+static int get_cpu_utilization(int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+ u32 sum = rq->avg.runnable_avg_sum;
+ u32 period = rq->avg.runnable_avg_period;
+
+ if (sum >= period)
+ return capacity_orig_of(cpu) + 1;
+
+ return (sum * capacity_orig_of(cpu)) / period;
+}
+
/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
* that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5517,6 +5534,7 @@ struct sg_lb_stats {
unsigned long sum_weighted_load; /* Weighted load of group's tasks */
unsigned long load_per_task;
unsigned long group_capacity;
+ unsigned long group_utilization; /* Total utilization of the group */
unsigned int sum_nr_running; /* Nr tasks running in the group */
unsigned int group_capacity_factor;
unsigned int idle_cpus;
@@ -5876,6 +5894,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
load = source_load(i, load_idx);
sgs->group_load += load;
+ sgs->group_utilization += get_cpu_utilization(i);
sgs->sum_nr_running += rq->cfs.h_nr_running;
#ifdef CONFIG_NUMA_BALANCING
sgs->nr_numa_running += rq->nr_numa_running;
@@ -6043,7 +6062,6 @@ next_group:
/* Now, start updating sd_lb_stats */
sds->total_load += sgs->group_load;
sds->total_capacity += sgs->group_capacity;
-
sg = sg->next;
} while (sg != env->sd->groups);
--
1.9.1
Currently the task always wakes affine on this_cpu if the latter is idle.
Before waking up the task on this_cpu, we check that this_cpu capacity is not
significantly reduced because of RT tasks or irq activity.
Use case where the number of irq and/or the time spent under irq is important
will take benefit of this because the task that is woken up by irq or softirq
will not use the same CPU than irq (and softirq) but a idle one which share
its cache.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7dff578..a23c938 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4240,6 +4240,7 @@ static int wake_wide(struct task_struct *p)
static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
{
s64 this_load, load;
+ s64 this_eff_load, prev_eff_load;
int idx, this_cpu, prev_cpu;
struct task_group *tg;
unsigned long weight;
@@ -4283,21 +4284,23 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
* Otherwise check if either cpus are near enough in load to allow this
* task to be woken on this_cpu.
*/
- if (this_load > 0) {
- s64 this_eff_load, prev_eff_load;
+ this_eff_load = 100;
+ this_eff_load *= capacity_of(prev_cpu);
+
+ prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
+ prev_eff_load *= capacity_of(this_cpu);
- this_eff_load = 100;
- this_eff_load *= capacity_of(prev_cpu);
+ if (this_load > 0) {
this_eff_load *= this_load +
effective_load(tg, this_cpu, weight, weight);
- prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
- prev_eff_load *= capacity_of(this_cpu);
prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight);
+ } else if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
+ this_eff_load = 0;
+ }
+
+ balanced = this_eff_load <= prev_eff_load;
- balanced = this_eff_load <= prev_eff_load;
- } else
- balanced = true;
schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
if (!balanced)
--
1.9.1
Use the new arch_scale_cpu_power in order to reflect the original capacity of
a CPU instead of arch_scale_freq_power which is more linked to a scaling of
the capacity linked to the frequency.
Signed-off-by: Vincent Guittot <[email protected]>
---
arch/arm/kernel/topology.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index d42a7db..2310bfb 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -42,7 +42,7 @@
*/
static DEFINE_PER_CPU(unsigned long, cpu_scale);
-unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
+unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
{
return per_cpu(cpu_scale, cpu);
}
@@ -166,7 +166,7 @@ static void update_cpu_capacity(unsigned int cpu)
set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
printk(KERN_INFO "CPU%u: update cpu_capacity %lu\n",
- cpu, arch_scale_freq_capacity(NULL, cpu));
+ cpu, arch_scale_cpu_capacity(NULL, cpu));
}
#else
--
1.9.1
power_orig is only changed for system with a SMT sched_domain level in order to
reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
original capacity that is different from the default value.
Create a more generic function arch_scale_cpu_power that can be also used by
non SMT platform to set power_orig.
The weak behavior of arch_scale_cpu_power is the previous SMT one in order to
keep backward compatibility in the use of power_orig.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 24 ++++++++++++++++--------
1 file changed, 16 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 148b277..f0bba5f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5614,6 +5614,20 @@ unsigned long __weak arch_scale_smt_capacity(struct sched_domain *sd, int cpu)
return default_scale_smt_capacity(sd, cpu);
}
+unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+{
+ unsigned long weight = sd->span_weight;
+
+ if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
+ if (sched_feat(ARCH_CAPACITY))
+ return arch_scale_smt_capacity(sd, cpu);
+ else
+ return default_scale_smt_capacity(sd, cpu);
+ }
+
+ return SCHED_CAPACITY_SCALE;
+}
+
static unsigned long scale_rt_capacity(int cpu)
{
struct rq *rq = cpu_rq(cpu);
@@ -5650,18 +5664,12 @@ static unsigned long scale_rt_capacity(int cpu)
static void update_cpu_capacity(struct sched_domain *sd, int cpu)
{
- unsigned long weight = sd->span_weight;
unsigned long capacity = SCHED_CAPACITY_SCALE;
struct sched_group *sdg = sd->groups;
- if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
- if (sched_feat(ARCH_CAPACITY))
- capacity *= arch_scale_smt_capacity(sd, cpu);
- else
- capacity *= default_scale_smt_capacity(sd, cpu);
+ capacity *= arch_scale_cpu_capacity(sd, cpu);
- capacity >>= SCHED_CAPACITY_SHIFT;
- }
+ capacity >>= SCHED_CAPACITY_SHIFT;
sdg->sgc->capacity_orig = capacity;
--
1.9.1
I have tried to understand the meaning of the condition :
(this_load <= load &&
this_load + target_load(prev_cpu, idx) <= tl_per_task)
but i failed to find a use case that can take advantage of it and i haven't
found clear description in the previous commits' log.
Futhermore, the comment of the condition refers to the task_hot function that
was used before being replaced by the current condition:
/*
* This domain has SD_WAKE_AFFINE and
* p is cache cold in this domain, and
* there is no bad imbalance.
*/
If we look more deeply the below condition
this_load + target_load(prev_cpu, idx) <= tl_per_task
When sync is clear, we have :
tl_per_task = runnable_load_avg / nr_running
this_load = max(runnable_load_avg, cpuload[idx])
target_load = max(runnable_load_avg', cpuload'[idx])
It implies that runnable_load_avg' == 0 and nr_running <= 1 in order to match the
condition. This implies that runnable_load_avg == 0 too because of the
condition: this_load <= load
but if this _load is null, balanced is already set and the test is redundant.
If sync is set, it's not as straight forward as above (especially if cgroup
are involved) but the policy should be similar as we have removed a task that's
going to sleep in order to get a more accurate load and this_load values.
The current conclusion is that these additional condition don't give any benefit
so we can remove them.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 30 ++++++------------------------
1 file changed, 6 insertions(+), 24 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c48dff..c6dba48 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4241,7 +4241,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
{
s64 this_load, load;
int idx, this_cpu, prev_cpu;
- unsigned long tl_per_task;
struct task_group *tg;
unsigned long weight;
int balanced;
@@ -4299,32 +4298,15 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
balanced = this_eff_load <= prev_eff_load;
} else
balanced = true;
-
- /*
- * If the currently running task will sleep within
- * a reasonable amount of time then attract this newly
- * woken task:
- */
- if (sync && balanced)
- return 1;
-
schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
- tl_per_task = cpu_avg_load_per_task(this_cpu);
- if (balanced ||
- (this_load <= load &&
- this_load + target_load(prev_cpu, idx) <= tl_per_task)) {
- /*
- * This domain has SD_WAKE_AFFINE and
- * p is cache cold in this domain, and
- * there is no bad imbalance.
- */
- schedstat_inc(sd, ttwu_move_affine);
- schedstat_inc(p, se.statistics.nr_wakeups_affine);
+ if (!balanced)
+ return 0;
- return 1;
- }
- return 0;
+ schedstat_inc(sd, ttwu_move_affine);
+ schedstat_inc(p, se.statistics.nr_wakeups_affine);
+
+ return 1;
}
/*
--
1.9.1
On 06/30/2014 09:35 PM, Vincent Guittot wrote:
> The imbalance flag can stay set whereas there is no imbalance.
>
> Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
> We will have some idle load balance which are triggered during tick.
> Unfortunately, the tick is also used to queue background work so we can reach
> the situation where short work has been queued on a CPU which already runs a
> task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
> CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
> a worker thread that is pinned on a CPU so an imbalance due to pinned task is
> detected and the imbalance flag is set.
> Then, we will not be able to clear the flag because we have at most 1 task on
> each CPU but the imbalance flag will trig to useless active load balance
> between the idle CPU and the busy CPU.
>
> We need to reset of the imbalance flag as soon as we have reached a balanced
> state.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 14 +++++++++++---
> 1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d3c73122..0c48dff 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6615,10 +6615,8 @@ more_balance:
> if (sd_parent) {
> int *group_imbalance = &sd_parent->groups->sgc->imbalance;
>
> - if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
> + if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0)
> *group_imbalance = 1;
> - } else if (*group_imbalance)
> - *group_imbalance = 0;
> }
>
> /* All tasks on this runqueue were pinned by CPU affinity */
> @@ -6703,6 +6701,16 @@ more_balance:
> goto out;
>
> out_balanced:
> + /*
> + * We reach balance although we may have faced some affinity
> + * constraints. Clear the imbalance flag if it was set.
> + */
> + if (sd_parent) {
> + int *group_imbalance = &sd_parent->groups->sgc->imbalance;
> + if (*group_imbalance)
> + *group_imbalance = 0;
> + }
> +
> schedstat_inc(sd, lb_balanced[idle]);
>
> sd->nr_balance_failed = 0;
>
I am not convinced that we can clear the imbalance flag here. Lets take
a simple example. Assume at a particular level of sched_domain, there
are two sched_groups with one cpu each. There are 2 tasks on the source
cpu, one of which is running(t1) and the other thread(t2) does not have
the dst_cpu in the tsk_allowed_mask. Now no task can be migrated to the
dst_cpu due to affinity constraints. Note that t2 is *not pinned, it
just cannot run on the dst_cpu*. In this scenario also we reach the
out_balanced tag right? If we set the group_imbalance flag to 0, we are
ruling out the possibility of migrating t2 to any other cpu in a higher
level sched_domain by saying that all is well, there is no imbalance.
This is wrong, isn't it?
My point is that by clearing the imbalance flag in the out_balanced
case, you might be overlooking the fact that the tsk_cpus_allowed mask
of the tasks on the src_cpu may not be able to run on the dst_cpu in
*this* level of sched_domain, but can potentially run on a cpu at any
higher level of sched_domain. By clearing the flag, we are not
encouraging load balance at that level for t2.
Am I missing something?
Regards
Preeti U Murthy
On 8 July 2014 05:13, Preeti U Murthy <[email protected]> wrote:
> On 06/30/2014 09:35 PM, Vincent Guittot wrote:
>> The imbalance flag can stay set whereas there is no imbalance.
>>
>> Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
>> We will have some idle load balance which are triggered during tick.
>> Unfortunately, the tick is also used to queue background work so we can reach
>> the situation where short work has been queued on a CPU which already runs a
>> task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
>> CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
>> a worker thread that is pinned on a CPU so an imbalance due to pinned task is
>> detected and the imbalance flag is set.
>> Then, we will not be able to clear the flag because we have at most 1 task on
>> each CPU but the imbalance flag will trig to useless active load balance
>> between the idle CPU and the busy CPU.
>>
>> We need to reset of the imbalance flag as soon as we have reached a balanced
>> state.
>>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> ---
>> kernel/sched/fair.c | 14 +++++++++++---
>> 1 file changed, 11 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d3c73122..0c48dff 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6615,10 +6615,8 @@ more_balance:
>> if (sd_parent) {
>> int *group_imbalance = &sd_parent->groups->sgc->imbalance;
>>
>> - if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
>> + if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0)
>> *group_imbalance = 1;
>> - } else if (*group_imbalance)
>> - *group_imbalance = 0;
>> }
>>
>> /* All tasks on this runqueue were pinned by CPU affinity */
>> @@ -6703,6 +6701,16 @@ more_balance:
>> goto out;
>>
>> out_balanced:
>> + /*
>> + * We reach balance although we may have faced some affinity
>> + * constraints. Clear the imbalance flag if it was set.
>> + */
>> + if (sd_parent) {
>> + int *group_imbalance = &sd_parent->groups->sgc->imbalance;
>> + if (*group_imbalance)
>> + *group_imbalance = 0;
>> + }
>> +
>> schedstat_inc(sd, lb_balanced[idle]);
>>
>> sd->nr_balance_failed = 0;
>>
> I am not convinced that we can clear the imbalance flag here. Lets take
> a simple example. Assume at a particular level of sched_domain, there
> are two sched_groups with one cpu each. There are 2 tasks on the source
> cpu, one of which is running(t1) and the other thread(t2) does not have
> the dst_cpu in the tsk_allowed_mask. Now no task can be migrated to the
> dst_cpu due to affinity constraints. Note that t2 is *not pinned, it
> just cannot run on the dst_cpu*. In this scenario also we reach the
> out_balanced tag right? If we set the group_imbalance flag to 0, we are
No we will not. If we have 2 tasks on 1 CPU in one sched_group and the
other group with an idle CPU, we are not balanced so we will not go
to out_balanced and the group_imbalance will staty set until we reach
a balanced state (by migrating t1).
> ruling out the possibility of migrating t2 to any other cpu in a higher
> level sched_domain by saying that all is well, there is no imbalance.
> This is wrong, isn't it?
>
> My point is that by clearing the imbalance flag in the out_balanced
> case, you might be overlooking the fact that the tsk_cpus_allowed mask
> of the tasks on the src_cpu may not be able to run on the dst_cpu in
> *this* level of sched_domain, but can potentially run on a cpu at any
> higher level of sched_domain. By clearing the flag, we are not
The imbalance flag is per sched_domain level so we will not clear
group_imbalance flag of other levels if the imbalance is also detected
at a higher level it will migrate t2
Regards,
Vincent
> encouraging load balance at that level for t2.
>
> Am I missing something?
>
> Regards
> Preeti U Murthy
>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 06/30/2014 12:05 PM, Vincent Guittot wrote:
> The imbalance flag can stay set whereas there is no imbalance.
>
> Let assume that we have 3 tasks that run on a dual cores /dual
> cluster system. We will have some idle load balance which are
> triggered during tick. Unfortunately, the tick is also used to
> queue background work so we can reach the situation where short
> work has been queued on a CPU which already runs a task. The load
> balance will detect this imbalance (2 tasks on 1 CPU and an idle
> CPU) and will try to pull the waiting task on the idle CPU. The
> waiting task is a worker thread that is pinned on a CPU so an
> imbalance due to pinned task is detected and the imbalance flag is
> set. Then, we will not be able to clear the flag because we have at
> most 1 task on each CPU but the imbalance flag will trig to useless
> active load balance between the idle CPU and the busy CPU.
>
> We need to reset of the imbalance flag as soon as we have reached a
> balanced state.
>
> Signed-off-by: Vincent Guittot <[email protected]>
Acked-by: Rik van Riel <[email protected]>
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJTvLFvAAoJEM553pKExN6DUcYIAJkj0fl4DIpx/7ywqSSCo4Da
1IpJI5Hz2zb+NunG8M/4kugDSYvMmiuNhFgG1ET7me11jxTcTqg0e8UZ4zJbW55u
i14IVyXLW+AVhJNQc1Umu3c6tTnjawOIzLa4wJ5JCVFVTj/5AhHyJjbKfQJnew1q
XlYm8+FPGmXQgJ0G3itmpx3gAYsrQIQXtIhM9wwgmKiysF4s+HZZppyZKtGbOtm4
ia408LsmjOYYp4vGSTa4F4IWx1K0fJVpz33TsCLb2pwKy6t/4hKf9tOn/wXPSLVc
NbWrP7zYYJ8EaXgo/RV9OJnPXq0h0Tbp9eMtd4u/hRCrcHw1dUZBpDMIRkS5NzI=
=uoh0
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 06/30/2014 12:05 PM, Vincent Guittot wrote:
> I have tried to understand the meaning of the condition :
> (this_load <= load && this_load + target_load(prev_cpu, idx) <=
> tl_per_task) but i failed to find a use case that can take
> advantage of it and i haven't found clear description in the
> previous commits' log. Futhermore, the comment of the condition
> refers to the task_hot function that was used before being replaced
> by the current condition: /* * This domain has SD_WAKE_AFFINE and *
> p is cache cold in this domain, and * there is no bad imbalance.
> */
>
> If we look more deeply the below condition this_load +
> target_load(prev_cpu, idx) <= tl_per_task
>
> When sync is clear, we have : tl_per_task = runnable_load_avg /
> nr_running this_load = max(runnable_load_avg, cpuload[idx])
> target_load = max(runnable_load_avg', cpuload'[idx])
>
> It implies that runnable_load_avg' == 0 and nr_running <= 1 in
> order to match the condition. This implies that runnable_load_avg
> == 0 too because of the condition: this_load <= load but if this
> _load is null, balanced is already set and the test is redundant.
>
> If sync is set, it's not as straight forward as above (especially
> if cgroup are involved) but the policy should be similar as we have
> removed a task that's going to sleep in order to get a more
> accurate load and this_load values.
>
> The current conclusion is that these additional condition don't
> give any benefit so we can remove them.
>
> Signed-off-by: Vincent Guittot <[email protected]>
Acked-by: Rik van Riel <[email protected]>
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJTvLG+AAoJEM553pKExN6DfhMH/iYJAS/nCq1teCNm39zZg1bF
MIp2iWwnB5VTQvwVLQ3FaGU80oWScUez8zjn26LjPp5SnxyDEwG0YVM3/57EyJme
PgffmP8xsVJwRkKnO4VRR0Go2EMXl9cdf9exe6lFjRyv4/Z15o9NZsDmX8JcLmG+
JPKq4DnMpJ3LiMiVuwwYtYSLk/tCHCPLnie4Z/WznlHk220WciVXFZG7AQI2AHXj
pMkZ5TWQODW7PEec+8dGzDnFcdXPftRYWCKLXG+9NM92YQpIsK8nZdC8rJeXhjBC
9VNb8QNZ6yVd0lvOSxy0drOV9BFXUImF6lsLxA12oHE6TLm6FeiHTG9X4xGOhN0=
=XlY9
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 06/30/2014 12:05 PM, Vincent Guittot wrote:
> The computation of avg_load and avg_load_per_task should only takes
> into account the number of cfs tasks. The non cfs task are already
> taken into account by decreasing the cpu's capacity (cpu_power) and
> they will be tracked in the CPU's utilization (group_utilization)
> of the next patches
>
> Signed-off-by: Vincent Guittot <[email protected]>
Acked-by: Rik van Riel <[email protected]>
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJTvLKrAAoJEM553pKExN6Dcz4H/2lJUVmEEzBOxc23cbYSXJcR
E1shwcwfbGe/0GwHwWOB4+9UyDuxp6pO6PaQ581nZZZ3sWSifjrViJqtoBheoERY
yIH3zbnpPVBD3JNOVA/5XjHSdnYye/Qf23TVsaC/Jd2O8MH3oilJFJTbOoeU+ns7
DREwq0T4T7RnLo9oVc6eLyWf9Ouc3FufAH41u48y702Ppbw1qnOU/ZSRhTLWpgoR
PjybIcEaNBugMHxv4CCjcU74VkYGoMupN93QZeuG71MwOjbkUQIIRwp4De9MW0Rw
59eO97StrE6o3oLcNaoqHPM3OzyvejF03soW6jROQJsTq3Ck7nD9jjGwav+TwO4=
=TC6D
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 06/30/2014 12:05 PM, Vincent Guittot wrote:
> power_orig is only changed for system with a SMT sched_domain level
> in order to reflect the lower capacity of CPUs. Heterogenous system
> also have to reflect an original capacity that is different from
> the default value.
>
> Create a more generic function arch_scale_cpu_power that can be
> also used by non SMT platform to set power_orig.
>
> The weak behavior of arch_scale_cpu_power is the previous SMT one
> in order to keep backward compatibility in the use of power_orig.
>
> Signed-off-by: Vincent Guittot <[email protected]>
Acked-by: Rik van Riel <[email protected]>
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJTvLLLAAoJEM553pKExN6D/bAH/3bwSpTQUfp37uMHo42JEwHV
+GiMv+Q8dnVi6/b14UEYaRa0T7ELBOfbAvn7MjUDJVTGDOhCptXuE1Xedz+fKh01
FaVA5sJmCtyhH6v1n8q0PM5D/HcL6WRCBuf91Mx+3aSJhHcULRgN22wajYfxFOEy
IBPQjwwxjbWOBsN4l8UevMnzOVLV3S9I/+NHHTRmIcPpc/gdREO31ey4LX4E1y4U
sni6exOmhjYGiWgYwt9x5+Fmio0EwqVbSsdygWrSJykjW5dcUfM5A6ACp44u7keK
L2tb/eiZthGGd+ct4J08nF51BLrXUAIBcyXM3wxWNFCIQkwhbnVxYapkfmSfe48=
=ucGA
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 06/30/2014 12:05 PM, Vincent Guittot wrote:
> Use the new arch_scale_cpu_power in order to reflect the original
> capacity of a CPU instead of arch_scale_freq_power which is more
> linked to a scaling of the capacity linked to the frequency.
>
> Signed-off-by: Vincent Guittot <[email protected]>
Acked-by: Rik van Riel <[email protected]>
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJTvLLgAAoJEM553pKExN6DZ40H/38CjOJQjRY4WZ+2UxlXSXfU
V7gcMwHCNIJvJQ0fG3IfxVacdlJsK8LsQMgPL5sYCeJpLKYrQHPIQ2dIdkmz2W9I
lcpr5dBq1AAB+PAbkywV2wtQBlDd0AMMQNrPqL6FdK92M4ML3iweJluA4qp0K1ip
AtZM90ku4SQ52RyQR3ePIb396CeT1TRqihDnpWxbcZt53DkqWW3SCVE68qL1SIEO
YKKlZyFXxLJDBZfM2kk90L8NPSmXzTa9xdsu+3CQKiyukqhgJoNe6J9E11CjlKBx
BQggFRRTliIUat7IGlkgcyWLhXhhMLrfoXJN/sEdrWG2T2rApnI4u2Q4PM02/zU=
=hHGQ
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 06/30/2014 12:05 PM, Vincent Guittot wrote:
> This new field cpu_power_orig reflects the available capacity of a
> CPUs unlike the cpu_power which reflects the current capacity that
> can be altered by frequency and rt tasks.
>
> Signed-off-by: Vincent Guittot <[email protected]>
Acked-by: Rik van Riel <[email protected]>
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJTvLLxAAoJEM553pKExN6DKHsIALFlj278vKIifriFbVOOYgDf
RppA21YTjzYP8YE00h0INatqPeIBRn5u1ifHUzlAtGIIV8dWy5q8QKu3D5zxNPfH
hTxqyX/vMYcLTJOC9yXPS+mM/wcOx1JEpNnuRtF3UZgjG+70boPuGLvlecy+mxz9
mtW6StYOxX0/WgHgbWHMYaVDCKGdn5P2ey6azD2Hrvmp6XsVat8T9+ChdmRJ0/t+
0D7QrEo56h+YSU3l0TqoHFwT/8l8pyGGyYOi1ABeZIaY13W2SJu2Ec4cYhK6qqmN
uz69QWwTPa1gOPTqpj1f8g64/ckBjwKaA7hB9GLCqqk1cuI6PK22B5QJEg/9BpE=
=wa9A
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 06/30/2014 12:05 PM, Vincent Guittot wrote:
> Currently the task always wakes affine on this_cpu if the latter is
> idle. Before waking up the task on this_cpu, we check that this_cpu
> capacity is not significantly reduced because of RT tasks or irq
> activity.
>
> Use case where the number of irq and/or the time spent under irq is
> important will take benefit of this because the task that is woken
> up by irq or softirq will not use the same CPU than irq (and
> softirq) but a idle one which share its cache.
>
> Signed-off-by: Vincent Guittot <[email protected]>
Acked-by: Rik van Riel <[email protected]>
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJTvLMXAAoJEM553pKExN6DGwcIAIDInqWSTMPG1SpNPIgrDB4g
YILWx6pFhv1ZSVEiJ05jxNpDKHeAKaEFR9NrSA/Psd+0p1zZ0pazioXyVhhr92PO
2C/xnIDsxNp7Qf+Yz1OvAHdoZiBM/uOI60JU+xL3bF/npjrw6DWe4WitikAookRR
Yjwhtyj4KiVgXBYDWuES5HS5s61W+ol1F3as6KKMurfwU6JYXqJYMRZGQoymi144
5oV9enFgN3u6kXhXUMgC3vfzVW2xMr7SNDTvEDO0vqrvlIRsv07u63G0YppoMFbI
Mht3ARLZKOwR6paiju1IkBdy/4nZeAuO461hDKwb/bB5q3JdeUtXw89xbelJD1Y=
=H6O5
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 07/08/2014 11:05 PM, Rik van Riel wrote:
> On 06/30/2014 12:05 PM, Vincent Guittot wrote:
>> The imbalance flag can stay set whereas there is no imbalance.
>
>> Let assume that we have 3 tasks that run on a dual cores /dual
>> cluster system. We will have some idle load balance which are
>> triggered during tick. Unfortunately, the tick is also used to
>> queue background work so we can reach the situation where short
>> work has been queued on a CPU which already runs a task. The
>> load balance will detect this imbalance (2 tasks on 1 CPU and an
>> idle CPU) and will try to pull the waiting task on the idle CPU.
>> The waiting task is a worker thread that is pinned on a CPU so
>> an imbalance due to pinned task is detected and the imbalance
>> flag is set. Then, we will not be able to clear the flag because
>> we have at most 1 task on each CPU but the imbalance flag will
>> trig to useless active load balance between the idle CPU and the
>> busy CPU.
>
>> We need to reset of the imbalance flag as soon as we have reached
>> a balanced state.
>
>> Signed-off-by: Vincent Guittot <[email protected]>
>
> Acked-by: Rik van Riel <[email protected]>
Never mind that, Preeti explained the failure mode in more detail
on irc, and I am no longer convinced this change is a good idea.
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJTvLi6AAoJEM553pKExN6DEAUH/A+dwZ4MYW1JafoKN3iaFzgr
FqUzDayZ5NO2fSKNoa35CqvPhyZzAjeQn4H2yh0u/EhNI+1MzIqMY4i8l6dmB3QO
woOj8pf6O5lB1+iWk9F2moEJRX2y3X8mhhysO3ujR8211Ic7t12Z195+SH7qD2OD
xDjmRd/hTyNtn8mHOpcC2A69jaac8ZHKL6zNJcE9ax0IQxLTxZ+BzQnJhCtILEbD
a2eYe/Dm039bvgZjIRmy0ht+GkTtUuA6Yn7/SxLbQMQ0eSolan0DC3M7TSB3SkDB
z6lczf4lrMbyJdPhq1MX+C9YfbUAsICE2XQZYhpn/nopem0MTKBfQoLG4W3fWOg=
=6eQR
-----END PGP SIGNATURE-----
Hi Vincent,
On 07/08/2014 03:42 PM, Vincent Guittot wrote:
> On 8 July 2014 05:13, Preeti U Murthy <[email protected]> wrote:
>> On 06/30/2014 09:35 PM, Vincent Guittot wrote:
>>> The imbalance flag can stay set whereas there is no imbalance.
>>>
>>> Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
>>> We will have some idle load balance which are triggered during tick.
>>> Unfortunately, the tick is also used to queue background work so we can reach
>>> the situation where short work has been queued on a CPU which already runs a
>>> task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
>>> CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
>>> a worker thread that is pinned on a CPU so an imbalance due to pinned task is
>>> detected and the imbalance flag is set.
>>> Then, we will not be able to clear the flag because we have at most 1 task on
>>> each CPU but the imbalance flag will trig to useless active load balance
>>> between the idle CPU and the busy CPU.
>>>
>>> We need to reset of the imbalance flag as soon as we have reached a balanced
>>> state.
>>>
>>> Signed-off-by: Vincent Guittot <[email protected]>
>>> ---
>>> kernel/sched/fair.c | 14 +++++++++++---
>>> 1 file changed, 11 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index d3c73122..0c48dff 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6615,10 +6615,8 @@ more_balance:
>>> if (sd_parent) {
>>> int *group_imbalance = &sd_parent->groups->sgc->imbalance;
>>>
>>> - if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
>>> + if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0)
>>> *group_imbalance = 1;
>>> - } else if (*group_imbalance)
>>> - *group_imbalance = 0;
>>> }
>>>
>>> /* All tasks on this runqueue were pinned by CPU affinity */
>>> @@ -6703,6 +6701,16 @@ more_balance:
>>> goto out;
>>>
>>> out_balanced:
>>> + /*
>>> + * We reach balance although we may have faced some affinity
>>> + * constraints. Clear the imbalance flag if it was set.
>>> + */
>>> + if (sd_parent) {
>>> + int *group_imbalance = &sd_parent->groups->sgc->imbalance;
>>> + if (*group_imbalance)
>>> + *group_imbalance = 0;
>>> + }
>>> +
>>> schedstat_inc(sd, lb_balanced[idle]);
>>>
>>> sd->nr_balance_failed = 0;
>>>
>> I am not convinced that we can clear the imbalance flag here. Lets take
>> a simple example. Assume at a particular level of sched_domain, there
>> are two sched_groups with one cpu each. There are 2 tasks on the source
>> cpu, one of which is running(t1) and the other thread(t2) does not have
>> the dst_cpu in the tsk_allowed_mask. Now no task can be migrated to the
>> dst_cpu due to affinity constraints. Note that t2 is *not pinned, it
>> just cannot run on the dst_cpu*. In this scenario also we reach the
>> out_balanced tag right? If we set the group_imbalance flag to 0, we are
>
> No we will not. If we have 2 tasks on 1 CPU in one sched_group and the
> other group with an idle CPU, we are not balanced so we will not go
> to out_balanced and the group_imbalance will staty set until we reach
> a balanced state (by migrating t1).
In the example that I mention above, t1 and t2 are on the rq of cpu0;
while t1 is running on cpu0, t2 is on the rq but does not have cpu1 in
its cpus allowed mask. So during load balance, cpu1 tries to pull t2,
cannot do so, and hence LBF_ALL_PINNED flag is set and it jumps to
out_balanced. Note that there are only two sched groups at this level of
sched domain.one with cpu0 and the other with cpu1. In this scenario we
do not try to do active load balancing, atleast thats what the code does
now if LBF_ALL_PINNED flag is set.
>
>> ruling out the possibility of migrating t2 to any other cpu in a higher
>> level sched_domain by saying that all is well, there is no imbalance.
>> This is wrong, isn't it?
>>
>> My point is that by clearing the imbalance flag in the out_balanced
>> case, you might be overlooking the fact that the tsk_cpus_allowed mask
>> of the tasks on the src_cpu may not be able to run on the dst_cpu in
>> *this* level of sched_domain, but can potentially run on a cpu at any
>> higher level of sched_domain. By clearing the flag, we are not
>
> The imbalance flag is per sched_domain level so we will not clear
> group_imbalance flag of other levels if the imbalance is also detected
> at a higher level it will migrate t2
Continuing with the above explanation; when LBF_ALL_PINNED flag is
set,and we jump to out_balanced, we clear the imbalance flag for the
sched_group comprising of cpu0 and cpu1,although there is actually an
imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in
its cpus allowed mask) in another sched group when load balancing is
done at the next sched domain level.
Elaborating on this, when cpu2 in another socket,lets say, begins load
balancing and update_sd_pick_busiest() is called, the group with cpu0
and cpu1 may not be picked as a potential imbalanced group. Had we not
cleared the imbalance flag for this group, we could have balanced out t2
to cpu2/3.
Is the scenario I am describing clear?
Regards
Preeti U Murthy
>
> Regards,
> Vincent
>
>> encouraging load balance at that level for t2.
>>
>> Am I missing something?
>>
>> Regards
>> Preeti U Murthy
>>
>
On Mon, Jun 30, 2014 at 6:05 PM, Vincent Guittot
<[email protected]> wrote:
> Use the new arch_scale_cpu_power in order to reflect the original capacity of
s/arch_scale_cpu_power/arch_scale_cpu_capacity and similar renames in
the commit logs across the entire patchset to take into account all
the renames in the code.
> a CPU instead of arch_scale_freq_power which is more linked to a scaling of
> the capacity linked to the frequency.
here too...
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> arch/arm/kernel/topology.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index d42a7db..2310bfb 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -42,7 +42,7 @@
> */
> static DEFINE_PER_CPU(unsigned long, cpu_scale);
>
> -unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> +unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> {
> return per_cpu(cpu_scale, cpu);
> }
> @@ -166,7 +166,7 @@ static void update_cpu_capacity(unsigned int cpu)
> set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
>
> printk(KERN_INFO "CPU%u: update cpu_capacity %lu\n",
> - cpu, arch_scale_freq_capacity(NULL, cpu));
> + cpu, arch_scale_cpu_capacity(NULL, cpu));
> }
>
> #else
> --
> 1.9.1
>
>
> _______________________________________________
> linaro-kernel mailing list
> [email protected]
> http://lists.linaro.org/mailman/listinfo/linaro-kernel
On Mon, Jun 30, 2014 at 6:05 PM, Vincent Guittot
<[email protected]> wrote:
> This new field cpu_power_orig reflects the available capacity of a CPUs unlike
> the cpu_power which reflects the current capacity that can be altered by
> frequency and rt tasks.
>
s/cpu_power/cpu_capacity
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/core.c | 2 +-
> kernel/sched/fair.c | 1 +
> kernel/sched/sched.h | 1 +
> 3 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 54f5722..92dd6d1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6964,7 +6964,7 @@ void __init sched_init(void)
> #ifdef CONFIG_SMP
> rq->sd = NULL;
> rq->rd = NULL;
> - rq->cpu_capacity = SCHED_CAPACITY_SCALE;
> + rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
> rq->post_schedule = 0;
> rq->active_balance = 0;
> rq->next_balance = jiffies;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f0bba5f..7dff578 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5671,6 +5671,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
>
> capacity >>= SCHED_CAPACITY_SHIFT;
>
> + cpu_rq(cpu)->cpu_capacity_orig = capacity;
> sdg->sgc->capacity_orig = capacity;
>
> if (sched_feat(ARCH_CAPACITY))
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 2f86361..7b311de 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -568,6 +568,7 @@ struct rq {
> struct sched_domain *sd;
>
> unsigned long cpu_capacity;
> + unsigned long cpu_capacity_orig;
>
> unsigned char idle_balance;
> /* For active balancing */
> --
> 1.9.1
>
>
> _______________________________________________
> linaro-kernel mailing list
> [email protected]
> http://lists.linaro.org/mailman/listinfo/linaro-kernel
On 9 July 2014 05:54, Preeti U Murthy <[email protected]> wrote:
> Hi Vincent,
>
> On 07/08/2014 03:42 PM, Vincent Guittot wrote:
[ snip]
>>>> out_balanced:
>>>> + /*
>>>> + * We reach balance although we may have faced some affinity
>>>> + * constraints. Clear the imbalance flag if it was set.
>>>> + */
>>>> + if (sd_parent) {
>>>> + int *group_imbalance = &sd_parent->groups->sgc->imbalance;
>>>> + if (*group_imbalance)
>>>> + *group_imbalance = 0;
>>>> + }
>>>> +
>>>> schedstat_inc(sd, lb_balanced[idle]);
>>>>
>>>> sd->nr_balance_failed = 0;
>>>>
>>> I am not convinced that we can clear the imbalance flag here. Lets take
>>> a simple example. Assume at a particular level of sched_domain, there
>>> are two sched_groups with one cpu each. There are 2 tasks on the source
>>> cpu, one of which is running(t1) and the other thread(t2) does not have
>>> the dst_cpu in the tsk_allowed_mask. Now no task can be migrated to the
>>> dst_cpu due to affinity constraints. Note that t2 is *not pinned, it
>>> just cannot run on the dst_cpu*. In this scenario also we reach the
>>> out_balanced tag right? If we set the group_imbalance flag to 0, we are
>>
>> No we will not. If we have 2 tasks on 1 CPU in one sched_group and the
>> other group with an idle CPU, we are not balanced so we will not go
>> to out_balanced and the group_imbalance will staty set until we reach
>> a balanced state (by migrating t1).
>
> In the example that I mention above, t1 and t2 are on the rq of cpu0;
> while t1 is running on cpu0, t2 is on the rq but does not have cpu1 in
> its cpus allowed mask. So during load balance, cpu1 tries to pull t2,
> cannot do so, and hence LBF_ALL_PINNED flag is set and it jumps to
That's where I disagree: my understanding of can_migrate_task is that
the LBF_ALL_PINNED will be cleared before returning false when
checking t1 because we are testing all tasks even the running task
> out_balanced. Note that there are only two sched groups at this level of
> sched domain.one with cpu0 and the other with cpu1. In this scenario we
> do not try to do active load balancing, atleast thats what the code does
> now if LBF_ALL_PINNED flag is set.
>
>>
>>> ruling out the possibility of migrating t2 to any other cpu in a higher
>>> level sched_domain by saying that all is well, there is no imbalance.
>>> This is wrong, isn't it?
>>>
>>> My point is that by clearing the imbalance flag in the out_balanced
>>> case, you might be overlooking the fact that the tsk_cpus_allowed mask
>>> of the tasks on the src_cpu may not be able to run on the dst_cpu in
>>> *this* level of sched_domain, but can potentially run on a cpu at any
>>> higher level of sched_domain. By clearing the flag, we are not
>>
>> The imbalance flag is per sched_domain level so we will not clear
>> group_imbalance flag of other levels if the imbalance is also detected
>> at a higher level it will migrate t2
>
> Continuing with the above explanation; when LBF_ALL_PINNED flag is
> set,and we jump to out_balanced, we clear the imbalance flag for the
> sched_group comprising of cpu0 and cpu1,although there is actually an
> imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in
> its cpus allowed mask) in another sched group when load balancing is
> done at the next sched domain level.
The imbalance is per sched_domain level so it will not have any side
effect on the next level
Regards,
Vincent
>
> Elaborating on this, when cpu2 in another socket,lets say, begins load
> balancing and update_sd_pick_busiest() is called, the group with cpu0
> and cpu1 may not be picked as a potential imbalanced group. Had we not
> cleared the imbalance flag for this group, we could have balanced out t2
> to cpu2/3.
>
> Is the scenario I am describing clear?
>
> Regards
> Preeti U Murthy
>>
>> Regards,
>> Vincent
>>
>>> encouraging load balance at that level for t2.
>>>
>>> Am I missing something?
>>>
>>> Regards
>>> Preeti U Murthy
>>>
>>
>
On 9 July 2014 09:49, Amit Kucheria <[email protected]> wrote:
> On Mon, Jun 30, 2014 at 6:05 PM, Vincent Guittot
> <[email protected]> wrote:
>> Use the new arch_scale_cpu_power in order to reflect the original capacity of
>
> s/arch_scale_cpu_power/arch_scale_cpu_capacity and similar renames in
> the commit logs across the entire patchset to take into account all
> the renames in the code.
good catch, the commit log passes through the find and replace. I
will change the other commit message as well
>
>> a CPU instead of arch_scale_freq_power which is more linked to a scaling of
>> the capacity linked to the frequency.
>
> here too...
>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> ---
>> arch/arm/kernel/topology.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
>> index d42a7db..2310bfb 100644
>> --- a/arch/arm/kernel/topology.c
>> +++ b/arch/arm/kernel/topology.c
>> @@ -42,7 +42,7 @@
>> */
>> static DEFINE_PER_CPU(unsigned long, cpu_scale);
>>
>> -unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
>> +unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>> {
>> return per_cpu(cpu_scale, cpu);
>> }
>> @@ -166,7 +166,7 @@ static void update_cpu_capacity(unsigned int cpu)
>> set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
>>
>> printk(KERN_INFO "CPU%u: update cpu_capacity %lu\n",
>> - cpu, arch_scale_freq_capacity(NULL, cpu));
>> + cpu, arch_scale_cpu_capacity(NULL, cpu));
>> }
>>
>> #else
>> --
>> 1.9.1
>>
>>
>> _______________________________________________
>> linaro-kernel mailing list
>> [email protected]
>> http://lists.linaro.org/mailman/listinfo/linaro-kernel
On Mon, Jun 30, 2014 at 06:05:32PM +0200, Vincent Guittot wrote:
> +++ b/kernel/sched/fair.c
> @@ -6615,10 +6615,8 @@ more_balance:
In order to avoid labels being used as functions; you can use:
.quiltrc:
QUILT_DIFF_OPTS="-F ^[[:alpha:]\$_].*[^:]\$"
.gitconfig:
[diff "default"]
xfuncname = "^[[:alpha:]$_].*[^:]$"
On 9 July 2014 12:14, Peter Zijlstra <[email protected]> wrote:
> On Mon, Jun 30, 2014 at 06:05:32PM +0200, Vincent Guittot wrote:
>> +++ b/kernel/sched/fair.c
>> @@ -6615,10 +6615,8 @@ more_balance:
>
> In order to avoid labels being used as functions; you can use:
>
> .quiltrc:
> QUILT_DIFF_OPTS="-F ^[[:alpha:]\$_].*[^:]\$"
>
> .gitconfig:
> [diff "default"]
> xfuncname = "^[[:alpha:]$_].*[^:]$"
>
Thanks, i'm going to add it to my .gitconfig
>
On Wed, Jul 09, 2014 at 09:24:54AM +0530, Preeti U Murthy wrote:
> In the example that I mention above, t1 and t2 are on the rq of cpu0;
> while t1 is running on cpu0, t2 is on the rq but does not have cpu1 in
> its cpus allowed mask. So during load balance, cpu1 tries to pull t2,
> cannot do so, and hence LBF_ALL_PINNED flag is set and it jumps to
> out_balanced. Note that there are only two sched groups at this level of
> sched domain.one with cpu0 and the other with cpu1. In this scenario we
> do not try to do active load balancing, atleast thats what the code does
> now if LBF_ALL_PINNED flag is set.
I think Vince is right in saying that in this scenario ALL_PINNED won't
be set. move_tasks() will iterate cfs_rq::cfs_tasks, that list will also
include the current running task.
And can_migrate_task() only checks for current after the pinning bits.
> Continuing with the above explanation; when LBF_ALL_PINNED flag is
> set,and we jump to out_balanced, we clear the imbalance flag for the
> sched_group comprising of cpu0 and cpu1,although there is actually an
> imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in
> its cpus allowed mask) in another sched group when load balancing is
> done at the next sched domain level.
And this is where Vince is wrong; note how
update_sg_lb_stats()/sg_imbalance() uses group->sgc->imbalance, but
load_balance() sets: sd_parent->groups->sgc->imbalance, so explicitly
one level up.
So what we can do I suppose is clear 'group->sgc->imbalance' at
out_balanced.
In any case, the entirely of this group imbalance crap is just that,
crap. Its a terribly difficult situation and the current bits more or
less fudge around some of the common cases. Also see the comment near
sg_imbalanced(). Its not a solid and 'correct' anything. Its a bunch of
hacks trying to deal with hard cases.
A 'good' solution would be prohibitively expensive I fear.
On Mon, Jun 30, 2014 at 06:05:35PM +0200, Vincent Guittot wrote:
> power_orig is only changed for system with a SMT sched_domain level in order to
> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
> original capacity that is different from the default value.
>
> Create a more generic function arch_scale_cpu_power that can be also used by
> non SMT platform to set power_orig.
>
> The weak behavior of arch_scale_cpu_power is the previous SMT one in order to
> keep backward compatibility in the use of power_orig.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 24 ++++++++++++++++--------
> 1 file changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 148b277..f0bba5f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5614,6 +5614,20 @@ unsigned long __weak arch_scale_smt_capacity(struct sched_domain *sd, int cpu)
> return default_scale_smt_capacity(sd, cpu);
> }
>
> +unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> +{
> + unsigned long weight = sd->span_weight;
> +
> + if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
> + if (sched_feat(ARCH_CAPACITY))
> + return arch_scale_smt_capacity(sd, cpu);
> + else
> + return default_scale_smt_capacity(sd, cpu);
> + }
> +
> + return SCHED_CAPACITY_SCALE;
> +}
The only caller of arch_scale_smt_capacity is now an arch_ function
itself; which makes it entirely redundant, no?
Also, sched_feat() and default_scale_smt_capability() aren't available
to arch implementations of this function and can thus not retain
semantics.
Seeing how we currently don't have any arch_scale_smt_capacity
implementations other than the default we can easily replace it entirely
with something like:
unsigned long __weak arch_scale_cpu_capacity()
{
if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
return sd->smt_gain / sd->span_weight;
return SCHED_CAPACITY_SCALE;
}
Hmm?
On 07/09/2014 04:13 PM, Peter Zijlstra wrote:
> On Wed, Jul 09, 2014 at 09:24:54AM +0530, Preeti U Murthy wrote:
>> In the example that I mention above, t1 and t2 are on the rq of cpu0;
>> while t1 is running on cpu0, t2 is on the rq but does not have cpu1 in
>> its cpus allowed mask. So during load balance, cpu1 tries to pull t2,
>> cannot do so, and hence LBF_ALL_PINNED flag is set and it jumps to
>> out_balanced. Note that there are only two sched groups at this level of
>> sched domain.one with cpu0 and the other with cpu1. In this scenario we
>> do not try to do active load balancing, atleast thats what the code does
>> now if LBF_ALL_PINNED flag is set.
>
> I think Vince is right in saying that in this scenario ALL_PINNED won't
> be set. move_tasks() will iterate cfs_rq::cfs_tasks, that list will also
> include the current running task.
Hmm.. really? Because while dequeueing a task from the rq so as to
schedule it on a cpu, we delete its entry from the list of cfs_tasks on
the rq.
list_del_init(&se->group_node) in account_entity_dequeue() does that.
>
> And can_migrate_task() only checks for current after the pinning bits.
>
>> Continuing with the above explanation; when LBF_ALL_PINNED flag is
>> set,and we jump to out_balanced, we clear the imbalance flag for the
>> sched_group comprising of cpu0 and cpu1,although there is actually an
>> imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in
>> its cpus allowed mask) in another sched group when load balancing is
>> done at the next sched domain level.
>
> And this is where Vince is wrong; note how
> update_sg_lb_stats()/sg_imbalance() uses group->sgc->imbalance, but
> load_balance() sets: sd_parent->groups->sgc->imbalance, so explicitly
> one level up.
One level up? The group->sgc->imbalance flag is checked during
update_sg_lb_stats(). This flag is *set during the load balancing at a
lower level sched domain*.IOW, when the 'group' formed the sched domain.
>
> So what we can do I suppose is clear 'group->sgc->imbalance' at
> out_balanced.
You mean 'set'? If we clear it we will have no clue about imbalances at
lower level sched domains due to pinning. Specifically in LBF_ALL_PINNED
case. This might prevent us from balancing out these tasks to other
groups at a higher level domain. update_sd_pick_busiest() specifically
relies on this flag to choose the busiest group.
>
> In any case, the entirely of this group imbalance crap is just that,
> crap. Its a terribly difficult situation and the current bits more or
> less fudge around some of the common cases. Also see the comment near
> sg_imbalanced(). Its not a solid and 'correct' anything. Its a bunch of
> hacks trying to deal with hard cases.
>
> A 'good' solution would be prohibitively expensive I fear.
So the problem that Vincent is trying to bring to the fore with this
patchset is that, when the busy group has cpus with only 1 running task
but imbalance flag set due to affinity constraints, we unnecessarily try
to do active balancing on this group. Active load balancing will not
succeed when there is only one task. To solve this issue will a simple
check on (busiest->nr_running > 1) in addition to !(ld_moved) before
doing active load balancing not work?
Thanks
Regards
Preeti U Murthy
>
On Wed, Jul 09, 2014 at 05:11:20PM +0530, Preeti U Murthy wrote:
> On 07/09/2014 04:13 PM, Peter Zijlstra wrote:
> > On Wed, Jul 09, 2014 at 09:24:54AM +0530, Preeti U Murthy wrote:
> >> In the example that I mention above, t1 and t2 are on the rq of cpu0;
> >> while t1 is running on cpu0, t2 is on the rq but does not have cpu1 in
> >> its cpus allowed mask. So during load balance, cpu1 tries to pull t2,
> >> cannot do so, and hence LBF_ALL_PINNED flag is set and it jumps to
> >> out_balanced. Note that there are only two sched groups at this level of
> >> sched domain.one with cpu0 and the other with cpu1. In this scenario we
> >> do not try to do active load balancing, atleast thats what the code does
> >> now if LBF_ALL_PINNED flag is set.
> >
> > I think Vince is right in saying that in this scenario ALL_PINNED won't
> > be set. move_tasks() will iterate cfs_rq::cfs_tasks, that list will also
> > include the current running task.
>
> Hmm.. really? Because while dequeueing a task from the rq so as to
> schedule it on a cpu, we delete its entry from the list of cfs_tasks on
> the rq.
>
> list_del_init(&se->group_node) in account_entity_dequeue() does that.
But set_next_entity() doesn't call account_entity_dequeue(), only
__dequeue_entity() to take it out of the rb-tree.
> > And can_migrate_task() only checks for current after the pinning bits.
> >
> >> Continuing with the above explanation; when LBF_ALL_PINNED flag is
> >> set,and we jump to out_balanced, we clear the imbalance flag for the
> >> sched_group comprising of cpu0 and cpu1,although there is actually an
> >> imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in
> >> its cpus allowed mask) in another sched group when load balancing is
> >> done at the next sched domain level.
> >
> > And this is where Vince is wrong; note how
> > update_sg_lb_stats()/sg_imbalance() uses group->sgc->imbalance, but
> > load_balance() sets: sd_parent->groups->sgc->imbalance, so explicitly
> > one level up.
>
> One level up? The group->sgc->imbalance flag is checked during
> update_sg_lb_stats(). This flag is *set during the load balancing at a
> lower level sched domain*.IOW, when the 'group' formed the sched domain.
sd_parent is one level up.
> >
> > So what we can do I suppose is clear 'group->sgc->imbalance' at
> > out_balanced.
>
> You mean 'set'? If we clear it we will have no clue about imbalances at
> lower level sched domains due to pinning. Specifically in LBF_ALL_PINNED
> case. This might prevent us from balancing out these tasks to other
> groups at a higher level domain. update_sd_pick_busiest() specifically
> relies on this flag to choose the busiest group.
No, clear, in load_balance. So set one level up, clear the current
level.
On 9 July 2014 12:43, Peter Zijlstra <[email protected]> wrote:
> On Wed, Jul 09, 2014 at 09:24:54AM +0530, Preeti U Murthy wrote:
[snip]
>
>> Continuing with the above explanation; when LBF_ALL_PINNED flag is
>> set,and we jump to out_balanced, we clear the imbalance flag for the
>> sched_group comprising of cpu0 and cpu1,although there is actually an
>> imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in
>> its cpus allowed mask) in another sched group when load balancing is
>> done at the next sched domain level.
>
> And this is where Vince is wrong; note how
> update_sg_lb_stats()/sg_imbalance() uses group->sgc->imbalance, but
> load_balance() sets: sd_parent->groups->sgc->imbalance, so explicitly
> one level up.
>
I forgot this behavior when studying preeti use case
> So what we can do I suppose is clear 'group->sgc->imbalance' at
> out_balanced.
>
> In any case, the entirely of this group imbalance crap is just that,
> crap. Its a terribly difficult situation and the current bits more or
> less fudge around some of the common cases. Also see the comment near
> sg_imbalanced(). Its not a solid and 'correct' anything. Its a bunch of
> hacks trying to deal with hard cases.
>
> A 'good' solution would be prohibitively expensive I fear.
I have tried to summarized several use cases that have been discussed
for this patch
The 1st use case is the one that i described in the commit message of
this patch: If we have a sporadic imbalance that set the imbalance
flag, we don't clear it after and it generates spurious and useless
active load balance
Then preeti came with the following use case :
we have a sched_domain made of CPU0 and CPU1 in 2 different sched_groups
2 tasks A and B are on CPU0, B can't run on CPU1, A is the running task.
When CPU1's sched_group is doing load balance, the imbalance should be
set. That's still happen with this patchset because the LBF_ALL_PINNED
flag will be cleared thanks to task A.
Preeti also explained me the following use cases on irc:
If we have both tasks A and B that can't run on CPU1, the
LBF_ALL_PINNED will stay set. As we can't do anything, we conclude
that we are balanced, we go to out_balanced and we clear the imbalance
flag. But we should not consider that as a balanced state but as a all
tasks pinned state instead and we should let the imbalance flag set.
If we now have 2 additional CPUs which are in the cpumask of task A
and/or B at the parent sched_domain level , we should migrate one task
in this group but this will not happen (with this patch) because the
sched_group made of CPU0 and CPU1 is not overloaded (2 tasks for 2
CPUs) and the imbalance flag has been cleared as described previously.
I'm going to send a new revision of the patchset with the correction
Vincent
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
We will have some idle load balance which are triggered during tick.
Unfortunately, the tick is also used to queue background work so we can reach
the situation where short work has been queued on a CPU which already runs a
task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
a worker thread that is pinned on a CPU so an imbalance due to pinned task is
detected and the imbalance flag is set.
Then, we will not be able to clear the flag because we have at most 1 task on
each CPU but the imbalance flag will trig to useless active load balance
between the idle CPU and the busy CPU.
We need to reset of the imbalance flag as soon as we have reached a balanced
state. If all tasks are pinned, we don't consider that as a balanced state and
let the imbalance flag set.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 22 ++++++++++++++++++----
1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d3c73122..a836198 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6615,10 +6615,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
if (sd_parent) {
int *group_imbalance = &sd_parent->groups->sgc->imbalance;
- if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
+ if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0)
*group_imbalance = 1;
- } else if (*group_imbalance)
- *group_imbalance = 0;
}
/* All tasks on this runqueue were pinned by CPU affinity */
@@ -6629,7 +6627,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
env.loop_break = sched_nr_migrate_break;
goto redo;
}
- goto out_balanced;
+ goto out_all_pinned;
}
}
@@ -6703,6 +6701,22 @@ static int load_balance(int this_cpu, struct rq *this_rq,
goto out;
out_balanced:
+ /*
+ * We reach balance although we may have faced some affinity
+ * constraints. Clear the imbalance flag if it was set.
+ */
+ if (sd_parent) {
+ int *group_imbalance = &sd_parent->groups->sgc->imbalance;
+ if (*group_imbalance)
+ *group_imbalance = 0;
+ }
+
+out_all_pinned:
+ /*
+ * We reach balance because all tasks are pinned at this level so
+ * we can't migrate them. Let the imbalance flag set so parent level
+ * can try to migrate them.
+ */
schedstat_inc(sd, lb_balanced[idle]);
sd->nr_balance_failed = 0;
--
1.9.1
Hi Vincent,
On 07/10/2014 03:00 PM, Vincent Guittot wrote:
> The imbalance flag can stay set whereas there is no imbalance.
>
> Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
> We will have some idle load balance which are triggered during tick.
> Unfortunately, the tick is also used to queue background work so we can reach
> the situation where short work has been queued on a CPU which already runs a
> task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
> CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
> a worker thread that is pinned on a CPU so an imbalance due to pinned task is
> detected and the imbalance flag is set.
> Then, we will not be able to clear the flag because we have at most 1 task on
> each CPU but the imbalance flag will trig to useless active load balance
> between the idle CPU and the busy CPU.
>
> We need to reset of the imbalance flag as soon as we have reached a balanced
> state. If all tasks are pinned, we don't consider that as a balanced state and
> let the imbalance flag set.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 22 ++++++++++++++++++----
> 1 file changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d3c73122..a836198 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6615,10 +6615,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> if (sd_parent) {
> int *group_imbalance = &sd_parent->groups->sgc->imbalance;
>
> - if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
> + if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0)
> *group_imbalance = 1;
> - } else if (*group_imbalance)
> - *group_imbalance = 0;
> }
>
> /* All tasks on this runqueue were pinned by CPU affinity */
> @@ -6629,7 +6627,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> env.loop_break = sched_nr_migrate_break;
> goto redo;
> }
> - goto out_balanced;
> + goto out_all_pinned;
> }
> }
>
> @@ -6703,6 +6701,22 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> goto out;
>
> out_balanced:
> + /*
> + * We reach balance although we may have faced some affinity
> + * constraints. Clear the imbalance flag if it was set.
> + */
> + if (sd_parent) {
> + int *group_imbalance = &sd_parent->groups->sgc->imbalance;
> + if (*group_imbalance)
> + *group_imbalance = 0;
> + }
> +
> +out_all_pinned:
> + /*
> + * We reach balance because all tasks are pinned at this level so
> + * we can't migrate them. Let the imbalance flag set so parent level
> + * can try to migrate them.
> + */
> schedstat_inc(sd, lb_balanced[idle]);
>
> sd->nr_balance_failed = 0;
>
This patch looks good to me.
Reviewed-by: Preeti U Murthy <[email protected]>
Hi Peter, Vincent,
On 07/10/2014 02:44 PM, Vincent Guittot wrote:
> On 9 July 2014 12:43, Peter Zijlstra <[email protected]> wrote:
>> On Wed, Jul 09, 2014 at 09:24:54AM +0530, Preeti U Murthy wrote:
>
> [snip]
>
>>
>>> Continuing with the above explanation; when LBF_ALL_PINNED flag is
>>> set,and we jump to out_balanced, we clear the imbalance flag for the
>>> sched_group comprising of cpu0 and cpu1,although there is actually an
>>> imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in
>>> its cpus allowed mask) in another sched group when load balancing is
>>> done at the next sched domain level.
>>
>> And this is where Vince is wrong; note how
>> update_sg_lb_stats()/sg_imbalance() uses group->sgc->imbalance, but
>> load_balance() sets: sd_parent->groups->sgc->imbalance, so explicitly
>> one level up.
>>
>
> I forgot this behavior when studying preeti use case
>
>> So what we can do I suppose is clear 'group->sgc->imbalance' at
>> out_balanced.
>>
>> In any case, the entirely of this group imbalance crap is just that,
>> crap. Its a terribly difficult situation and the current bits more or
>> less fudge around some of the common cases. Also see the comment near
>> sg_imbalanced(). Its not a solid and 'correct' anything. Its a bunch of
>> hacks trying to deal with hard cases.
>>
>> A 'good' solution would be prohibitively expensive I fear.
>
> I have tried to summarized several use cases that have been discussed
> for this patch
>
> The 1st use case is the one that i described in the commit message of
> this patch: If we have a sporadic imbalance that set the imbalance
> flag, we don't clear it after and it generates spurious and useless
> active load balance
>
> Then preeti came with the following use case :
> we have a sched_domain made of CPU0 and CPU1 in 2 different sched_groups
> 2 tasks A and B are on CPU0, B can't run on CPU1, A is the running task.
> When CPU1's sched_group is doing load balance, the imbalance should be
> set. That's still happen with this patchset because the LBF_ALL_PINNED
> flag will be cleared thanks to task A.
>
> Preeti also explained me the following use cases on irc:
>
> If we have both tasks A and B that can't run on CPU1, the
> LBF_ALL_PINNED will stay set. As we can't do anything, we conclude
> that we are balanced, we go to out_balanced and we clear the imbalance
> flag. But we should not consider that as a balanced state but as a all
> tasks pinned state instead and we should let the imbalance flag set.
> If we now have 2 additional CPUs which are in the cpumask of task A
> and/or B at the parent sched_domain level , we should migrate one task
> in this group but this will not happen (with this patch) because the
> sched_group made of CPU0 and CPU1 is not overloaded (2 tasks for 2
> CPUs) and the imbalance flag has been cleared as described previously.
Peter,
The above paragraph describes my concern with regard to clearing the
imbalance flag at a given level of sched domain in case of pinned tasks
in the below conversation.
https://lkml.org/lkml/2014/7/9/454.
You are right about iterating through all tasks including the current
task during load balancing.
Thanks
Regards
Preeti U Murthy
>
> I'm going to send a new revision of the patchset with the correction
>
> Vincent
>
On Mon, Jun 30, 2014 at 06:05:38PM +0200, Vincent Guittot wrote:
> Currently the task always wakes affine on this_cpu if the latter is idle.
> Before waking up the task on this_cpu, we check that this_cpu capacity is not
> significantly reduced because of RT tasks or irq activity.
>
> Use case where the number of irq and/or the time spent under irq is important
> will take benefit of this because the task that is woken up by irq or softirq
> will not use the same CPU than irq (and softirq) but a idle one which share
> its cache.
The above, doesn't seem to explain:
> + } else if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
> + this_eff_load = 0;
> + }
> +
> + balanced = this_eff_load <= prev_eff_load;
this. Why would you unconditionally allow wake_affine across cache
domains?
On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:
> + /*
> + * If the CPUs share their cache and the src_cpu's capacity is
> + * reduced because of other sched_class or IRQs, we trig an
> + * active balance to move the task
> + */
> + if ((sd->flags & SD_SHARE_PKG_RESOURCES)
> + && ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
> + sd->imbalance_pct)))
> return 1;
Why is this tied to shared caches?
On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:
> + if ((sd->flags & SD_SHARE_PKG_RESOURCES)
> + && ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
> + sd->imbalance_pct)))
> + if ((rq->cfs.h_nr_running >= 1)
> + && ((rq->cpu_capacity * sd->imbalance_pct) <
> + (rq->cpu_capacity_orig * 100))) {
That's just weird indentation and operators should be at the end not at
the beginning of a line-break.
On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:
You 'forgot' to update the comment that goes with nohz_kick_needed().
> @@ -7233,9 +7253,10 @@ static inline int nohz_kick_needed(struct rq *rq)
> struct sched_domain *sd;
> struct sched_group_capacity *sgc;
> int nr_busy, cpu = rq->cpu;
> + bool kick = false;
>
> if (unlikely(rq->idle_balance))
> + return false;
>
> /*
> * We may be recently in ticked or tickless idle mode. At the first
> @@ -7249,38 +7270,41 @@ static inline int nohz_kick_needed(struct rq *rq)
> * balancing.
> */
> if (likely(!atomic_read(&nohz.nr_cpus)))
> + return false;
>
> if (time_before(now, nohz.next_balance))
> + return false;
>
> if (rq->nr_running >= 2)
> + return true;
>
> rcu_read_lock();
> sd = rcu_dereference(per_cpu(sd_busy, cpu));
> if (sd) {
> sgc = sd->groups->sgc;
> nr_busy = atomic_read(&sgc->nr_busy_cpus);
>
> + if (nr_busy > 1) {
> + kick = true;
> + goto unlock;
> + }
> +
> + if ((rq->cfs.h_nr_running >= 1)
> + && ((rq->cpu_capacity * sd->imbalance_pct) <
> + (rq->cpu_capacity_orig * 100))) {
> + kick = true;
> + goto unlock;
> + }
Again, why only for shared caches?
> }
>
> sd = rcu_dereference(per_cpu(sd_asym, cpu));
> if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
> sched_domain_span(sd)) < cpu))
> + kick = true;
>
> +unlock:
> rcu_read_unlock();
> + return kick;
> }
On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote:
> This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.
>
> We are going to use runnable_avg_sum and runnable_avg_period in order to get
> the utilization of the CPU. This statistic includes all tasks that run the CPU
> and not only CFS tasks.
But this rq->avg is not the one that is migration aware, right? So why
use this?
We already compensate cpu_capacity for !fair tasks, so I don't see why
we can't use the migration aware one (and kill this one as Yuyang keeps
proposing) and compensate with the capacity factor.
On 9 July 2014 12:57, Peter Zijlstra <[email protected]> wrote:
> On Mon, Jun 30, 2014 at 06:05:35PM +0200, Vincent Guittot wrote:
>> power_orig is only changed for system with a SMT sched_domain level in order to
>> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
>> original capacity that is different from the default value.
>>
>> Create a more generic function arch_scale_cpu_power that can be also used by
>> non SMT platform to set power_orig.
>>
>> The weak behavior of arch_scale_cpu_power is the previous SMT one in order to
>> keep backward compatibility in the use of power_orig.
>>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> ---
>> kernel/sched/fair.c | 24 ++++++++++++++++--------
>> 1 file changed, 16 insertions(+), 8 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 148b277..f0bba5f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5614,6 +5614,20 @@ unsigned long __weak arch_scale_smt_capacity(struct sched_domain *sd, int cpu)
>> return default_scale_smt_capacity(sd, cpu);
>> }
>>
>> +unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>> +{
>> + unsigned long weight = sd->span_weight;
>> +
>> + if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
>> + if (sched_feat(ARCH_CAPACITY))
>> + return arch_scale_smt_capacity(sd, cpu);
>> + else
>> + return default_scale_smt_capacity(sd, cpu);
>> + }
>> +
>> + return SCHED_CAPACITY_SCALE;
>> +}
>
> The only caller of arch_scale_smt_capacity is now an arch_ function
> itself; which makes it entirely redundant, no?
>
> Also, sched_feat() and default_scale_smt_capability() aren't available
> to arch implementations of this function and can thus not retain
> semantics.
>
> Seeing how we currently don't have any arch_scale_smt_capacity
> implementations other than the default we can easily replace it entirely
> with something like:
>
> unsigned long __weak arch_scale_cpu_capacity()
> {
> if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
> return sd->smt_gain / sd->span_weight;
>
> return SCHED_CAPACITY_SCALE;
> }
>
> Hmm?
That's fair. i'm going to clean up and remove unused function
On 10 July 2014 13:06, Peter Zijlstra <[email protected]> wrote:
> On Mon, Jun 30, 2014 at 06:05:38PM +0200, Vincent Guittot wrote:
>> Currently the task always wakes affine on this_cpu if the latter is idle.
>> Before waking up the task on this_cpu, we check that this_cpu capacity is not
>> significantly reduced because of RT tasks or irq activity.
>>
>> Use case where the number of irq and/or the time spent under irq is important
>> will take benefit of this because the task that is woken up by irq or softirq
>> will not use the same CPU than irq (and softirq) but a idle one which share
>> its cache.
>
> The above, doesn't seem to explain:
>
>> + } else if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
>> + this_eff_load = 0;
>> + }
>> +
>> + balanced = this_eff_load <= prev_eff_load;
>
> this. Why would you unconditionally allow wake_affine across cache
> domains?
The current policy is to use this_cpu if this_load <= 0. I want to
keep the current wake affine policy for all sched_domain that doesn't
share cache so if the involved CPUs don't share cache, I clear
this_eff_load to force balanced to be true. But when CPUs share their
cache, we not only look at idleness but also at available capacity of
prev and local CPUs.
On 10 July 2014 13:24, Peter Zijlstra <[email protected]> wrote:
> On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:
>> + if ((sd->flags & SD_SHARE_PKG_RESOURCES)
>> + && ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
>> + sd->imbalance_pct)))
>
>
>> + if ((rq->cfs.h_nr_running >= 1)
>> + && ((rq->cpu_capacity * sd->imbalance_pct) <
>> + (rq->cpu_capacity_orig * 100))) {
>
> That's just weird indentation and operators should be at the end not at
> the beginning of a line-break.
Ok, i'm going to fix it
On 10 July 2014 13:18, Peter Zijlstra <[email protected]> wrote:
> On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:
>> + /*
>> + * If the CPUs share their cache and the src_cpu's capacity is
>> + * reduced because of other sched_class or IRQs, we trig an
>> + * active balance to move the task
>> + */
>> + if ((sd->flags & SD_SHARE_PKG_RESOURCES)
>> + && ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
>> + sd->imbalance_pct)))
>> return 1;
>
> Why is this tied to shared caches?
It's just to limit the change of the policy to a level that can have
benefit without performance regression. I'm not sure that we can do
that at any level without risk
On 10 July 2014 15:16, Peter Zijlstra <[email protected]> wrote:
> On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote:
>> This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.
>>
>> We are going to use runnable_avg_sum and runnable_avg_period in order to get
>> the utilization of the CPU. This statistic includes all tasks that run the CPU
>> and not only CFS tasks.
>
> But this rq->avg is not the one that is migration aware, right? So why
> use this?
Yes, it's not the one that is migration aware
>
> We already compensate cpu_capacity for !fair tasks, so I don't see why
> we can't use the migration aware one (and kill this one as Yuyang keeps
> proposing) and compensate with the capacity factor.
The 1st point is that cpu_capacity is compensated by both !fair_tasks
and frequency scaling and we should not take into account frequency
scaling for detecting overload
What we have now is the the weighted load avg that is the sum of the
weight load of entities on the run queue. This is not usable to detect
overload because of the weight. An unweighted version of this figure
would be more usefull but it's not as accurate as the one I use IMHO.
The example that has been discussed during the review of the last
version has shown some limitations
With the following schedule pattern from Morten's example
| 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
A: run rq run ----------- sleeping ------------- run
B: rq run rq run ---- sleeping ------------- rq
The scheduler will see the following values:
Task A unweighted load value is 47%
Task B unweight load is 60%
The maximum Sum of unweighted load is 104%
rq->avg load is 60%
And the real CPU load is 50%
So we will have opposite decision depending of the used values: the
rq->avg or the Sum of unweighted load
The sum of unweighted load has the main advantage of showing
immediately what will be the relative impact of adding/removing a
task. In the example, we can see that removing task A or B will remove
around half the CPU load but it's not so good for giving the current
utilization of the CPU
Vincent
On Thu, Jul 10, 2014 at 04:03:51PM +0200, Vincent Guittot wrote:
> On 10 July 2014 13:18, Peter Zijlstra <[email protected]> wrote:
> > On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:
> >> + /*
> >> + * If the CPUs share their cache and the src_cpu's capacity is
> >> + * reduced because of other sched_class or IRQs, we trig an
> >> + * active balance to move the task
> >> + */
> >> + if ((sd->flags & SD_SHARE_PKG_RESOURCES)
> >> + && ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
> >> + sd->imbalance_pct)))
> >> return 1;
> >
> > Why is this tied to shared caches?
>
> It's just to limit the change of the policy to a level that can have
> benefit without performance regression. I'm not sure that we can do
> that at any level without risk
Similar to the other change; so both details _should_ have been in the
changelogs etc..
In any case, its feels rather arbitrary to me. What about machines where
there's no cache sharing at all (the traditional SMP systems). This
thing you're trying to do still seems to make sense there.
On Fri, Jul 11, 2014 at 09:51:06AM +0200, Vincent Guittot wrote:
> On 10 July 2014 15:16, Peter Zijlstra <[email protected]> wrote:
> > On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote:
> >> This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.
> >>
> >> We are going to use runnable_avg_sum and runnable_avg_period in order to get
> >> the utilization of the CPU. This statistic includes all tasks that run the CPU
> >> and not only CFS tasks.
> >
> > But this rq->avg is not the one that is migration aware, right? So why
> > use this?
>
> Yes, it's not the one that is migration aware
>
> >
> > We already compensate cpu_capacity for !fair tasks, so I don't see why
> > we can't use the migration aware one (and kill this one as Yuyang keeps
> > proposing) and compensate with the capacity factor.
>
> The 1st point is that cpu_capacity is compensated by both !fair_tasks
> and frequency scaling and we should not take into account frequency
> scaling for detecting overload
dvfs could help? Also we should not use arch_scale_freq_capacity() for
things like cpufreq-ondemand etc. Because for those the compute capacity
is still the max. We should only use it when we hard limit things.
> What we have now is the the weighted load avg that is the sum of the
> weight load of entities on the run queue. This is not usable to detect
> overload because of the weight. An unweighted version of this figure
> would be more usefull but it's not as accurate as the one I use IMHO.
> The example that has been discussed during the review of the last
> version has shown some limitations
>
> With the following schedule pattern from Morten's example
>
> | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
> A: run rq run ----------- sleeping ------------- run
> B: rq run rq run ---- sleeping ------------- rq
>
> The scheduler will see the following values:
> Task A unweighted load value is 47%
> Task B unweight load is 60%
> The maximum Sum of unweighted load is 104%
> rq->avg load is 60%
>
> And the real CPU load is 50%
>
> So we will have opposite decision depending of the used values: the
> rq->avg or the Sum of unweighted load
>
> The sum of unweighted load has the main advantage of showing
> immediately what will be the relative impact of adding/removing a
> task. In the example, we can see that removing task A or B will remove
> around half the CPU load but it's not so good for giving the current
> utilization of the CPU
In that same discussion ISTR a suggestion about adding avg_running time,
as opposed to the current avg_runnable. The sum of avg_running should be
much more accurate, and still react correctly to migrations.
On 11 July 2014 16:51, Peter Zijlstra <[email protected]> wrote:
> On Thu, Jul 10, 2014 at 04:03:51PM +0200, Vincent Guittot wrote:
>> On 10 July 2014 13:18, Peter Zijlstra <[email protected]> wrote:
>> > On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:
>> >> + /*
>> >> + * If the CPUs share their cache and the src_cpu's capacity is
>> >> + * reduced because of other sched_class or IRQs, we trig an
>> >> + * active balance to move the task
>> >> + */
>> >> + if ((sd->flags & SD_SHARE_PKG_RESOURCES)
>> >> + && ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
>> >> + sd->imbalance_pct)))
>> >> return 1;
>> >
>> > Why is this tied to shared caches?
>>
>> It's just to limit the change of the policy to a level that can have
>> benefit without performance regression. I'm not sure that we can do
>> that at any level without risk
>
> Similar to the other change; so both details _should_ have been in the
> changelogs etc..
i'm going to add details in the v4
>
> In any case, its feels rather arbitrary to me. What about machines where
> there's no cache sharing at all (the traditional SMP systems). This
> thing you're trying to do still seems to make sense there.
ok, I thought that traditional SMP systems have this flag set at core
level. I mean ARM platforms have the flag for CPUs in the same cluster
(which include current ARM SMP system) and the corei7 of my laptop has
the flag at the cores level.
>
On Fri, Jul 11, 2014 at 08:51:06AM +0100, Vincent Guittot wrote:
> On 10 July 2014 15:16, Peter Zijlstra <[email protected]> wrote:
> > On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote:
> >> This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.
> >>
> >> We are going to use runnable_avg_sum and runnable_avg_period in order to get
> >> the utilization of the CPU. This statistic includes all tasks that run the CPU
> >> and not only CFS tasks.
> >
> > But this rq->avg is not the one that is migration aware, right? So why
> > use this?
>
> Yes, it's not the one that is migration aware
>
> >
> > We already compensate cpu_capacity for !fair tasks, so I don't see why
> > we can't use the migration aware one (and kill this one as Yuyang keeps
> > proposing) and compensate with the capacity factor.
>
> The 1st point is that cpu_capacity is compensated by both !fair_tasks
> and frequency scaling and we should not take into account frequency
> scaling for detecting overload
>
> What we have now is the the weighted load avg that is the sum of the
> weight load of entities on the run queue. This is not usable to detect
> overload because of the weight. An unweighted version of this figure
> would be more usefull but it's not as accurate as the one I use IMHO.
IMHO there is no perfect utilization metric, but I think it is
fundamentally wrong to use a metric that is migration unaware to make
migration decisions. I mentioned that during the last review as well. It
is like having a very fast controller with a really slow (large delay)
feedback loop. There is a high risk of getting an unstable balance when
you load-balance rate is faster than the feedback delay.
> The example that has been discussed during the review of the last
> version has shown some limitations
>
> With the following schedule pattern from Morten's example
>
> | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
> A: run rq run ----------- sleeping ------------- run
> B: rq run rq run ---- sleeping ------------- rq
>
> The scheduler will see the following values:
> Task A unweighted load value is 47%
> Task B unweight load is 60%
> The maximum Sum of unweighted load is 104%
> rq->avg load is 60%
>
> And the real CPU load is 50%
>
> So we will have opposite decision depending of the used values: the
> rq->avg or the Sum of unweighted load
>
> The sum of unweighted load has the main advantage of showing
> immediately what will be the relative impact of adding/removing a
> task. In the example, we can see that removing task A or B will remove
> around half the CPU load but it's not so good for giving the current
> utilization of the CPU
You forgot to mention the issues with rq->avg that were brought up last
time :-)
Here is an load-balancing example:
Task A, B, C, and D are all running/runnable constantly. To avoid
decimals we assume the sched tick to have a 9 ms period. We have four
cpus in a single sched_domain.
rq == rq->avg
uw == unweighted tracked load
cpu0:
| 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms |
A: run rq rq
B: rq run rq
C: rq rq run
D: rq rq rq run run run run run run
rq: 100% 100% 100% 100% 100% 100% 100% 100% 100%
uw: 400% 400% 400% 100% 100% 100% 100% 100% 100%
cpu1:
| 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms |
A: run rq run rq run rq
B: rq run rq run rq run
C:
D:
rq: 0% 0% 0% 0% 6% 12% 18% 23% 28%
uw: 0% 0% 0% 200% 200% 200% 200% 200% 200%
cpu2:
| 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms |
A:
B:
C: run run run run run run
D:
rq: 0% 0% 0% 0% 6% 12% 18% 23% 28%
uw: 0% 0% 0% 100% 100% 100% 100% 100% 100%
cpu3:
| 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms |
A:
B:
C:
D:
rq: 0% 0% 0% 0% 0% 0% 0% 0% 0%
uw: 0% 0% 0% 0% 0% 0% 0% 0% 0%
A periodic load-balance occurs on cpu1 after 9 ms. cpu0 rq->avg
indicates overload. Consequently cpu1 pulls task A and B.
Shortly after (<1 ms) cpu2 does a periodic load-balance. cpu0 rq->avg
hasn't changed so cpu0 still appears overloaded. cpu2 pulls task C.
Shortly after (<1 ms) cpu3 does a periodic load-balance. cpu0 rq->avg
still indicates overload so cpu3 tries to pull tasks but fails since
there is only task D left.
9 ms later the sched tick causes periodic load-balances on all the cpus.
cpu0 rq->avg still indicates that it has the highest load since cpu1
rq->avg has not had time to indicate overload. Consequently cpu1, 2,
and 3 will try to pull from that and fail. The balance will only change
once cpu1 rq->avg has increased enough to indicate overload.
Unweighted load will on the other hand indicate the load changes
instantaneously, so cpu3 would observe the overload of cpu1 immediately
and pull task A or B.
In this example using rq->avg leads to imbalance whereas unweighted load
would not. Correct me if I missed anything.
Coming back to the previous example. I'm not convinced that inflation of
the unweighted load sum when tasks overlap in time is a bad thing. I
have mentioned this before. The average cpu utilization over the 40ms
period is 50%. However the true compute capacity demand is 200% for the
first 15ms of the period, 100% for the next 5ms and 0% for the remaining
25ms. The cpu is actually overloaded for 15ms every 40ms. This fact is
factored into the unweighted load whereas rq->avg would give you the
same utilization no matter if the tasks are overlapped or not. Hence
unweighted load would give us an indication that the mix of tasks isn't
optimal even if the cpu has spare cycles.
If you don't care about overlap and latency, the unweighted sum of task
running time (that Peter has proposed a number of times) is better
metric, IMHO. As long the cpu isn't fully utilized.
Morten
On 11 July 2014 17:13, Peter Zijlstra <[email protected]> wrote:
> On Fri, Jul 11, 2014 at 09:51:06AM +0200, Vincent Guittot wrote:
>> On 10 July 2014 15:16, Peter Zijlstra <[email protected]> wrote:
>> > On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote:
>> >> This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.
>> >>
>> >> We are going to use runnable_avg_sum and runnable_avg_period in order to get
>> >> the utilization of the CPU. This statistic includes all tasks that run the CPU
>> >> and not only CFS tasks.
>> >
>> > But this rq->avg is not the one that is migration aware, right? So why
>> > use this?
>>
>> Yes, it's not the one that is migration aware
>>
>> >
>> > We already compensate cpu_capacity for !fair tasks, so I don't see why
>> > we can't use the migration aware one (and kill this one as Yuyang keeps
>> > proposing) and compensate with the capacity factor.
>>
>> The 1st point is that cpu_capacity is compensated by both !fair_tasks
>> and frequency scaling and we should not take into account frequency
>> scaling for detecting overload
>
> dvfs could help? Also we should not use arch_scale_freq_capacity() for
> things like cpufreq-ondemand etc. Because for those the compute capacity
> is still the max. We should only use it when we hard limit things.
In my mind, arch_scale_cpu_freq was intend to scale the capacity of
the CPU according to the current dvfs operating point.
As it's no more use anywhere now that we have arch_scale_cpu, we could
probably remove it .. and see when it will become used.
>
>> What we have now is the the weighted load avg that is the sum of the
>> weight load of entities on the run queue. This is not usable to detect
>> overload because of the weight. An unweighted version of this figure
>> would be more usefull but it's not as accurate as the one I use IMHO.
>> The example that has been discussed during the review of the last
>> version has shown some limitations
>>
>> With the following schedule pattern from Morten's example
>>
>> | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
>> A: run rq run ----------- sleeping ------------- run
>> B: rq run rq run ---- sleeping ------------- rq
>>
>> The scheduler will see the following values:
>> Task A unweighted load value is 47%
>> Task B unweight load is 60%
>> The maximum Sum of unweighted load is 104%
>> rq->avg load is 60%
>>
>> And the real CPU load is 50%
>>
>> So we will have opposite decision depending of the used values: the
>> rq->avg or the Sum of unweighted load
>>
>> The sum of unweighted load has the main advantage of showing
>> immediately what will be the relative impact of adding/removing a
>> task. In the example, we can see that removing task A or B will remove
>> around half the CPU load but it's not so good for giving the current
>> utilization of the CPU
>
> In that same discussion ISTR a suggestion about adding avg_running time,
> as opposed to the current avg_runnable. The sum of avg_running should be
> much more accurate, and still react correctly to migrations.
I haven't look in details but I agree that avg_running would be much
more accurate than avg_runnable and should probably fit the
requirement. Does it means that we could re-add the avg_running (or
something similar) that has disappeared during the review of load avg
tracking patchset ?
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Fri, Jul 11, 2014 at 07:39:29PM +0200, Vincent Guittot wrote:
> In my mind, arch_scale_cpu_freq was intend to scale the capacity of
> the CPU according to the current dvfs operating point.
> As it's no more use anywhere now that we have arch_scale_cpu, we could
> probably remove it .. and see when it will become used.
I probably should have written comments when I wrote that code, but it
was meant to be used only where, as described above, we limit things.
Ondemand and such, which will temporarily decrease freq, will ramp it up
again at demand, and therefore lowering the capacity will skew things.
You'll put less load on because its run slower, and then you'll run it
slower because there's less load on -> cyclic FAIL.
> > In that same discussion ISTR a suggestion about adding avg_running time,
> > as opposed to the current avg_runnable. The sum of avg_running should be
> > much more accurate, and still react correctly to migrations.
>
> I haven't look in details but I agree that avg_running would be much
> more accurate than avg_runnable and should probably fit the
> requirement. Does it means that we could re-add the avg_running (or
> something similar) that has disappeared during the review of load avg
> tracking patchset ?
Sure, I think we killed it there because there wasn't an actual use for
it and I'm always in favour of stripping everything to their bare bones,
esp big and complex things.
And then later, add things back once we have need for it.
On Fri, Jul 11, 2014 at 09:12:38PM +0100, Peter Zijlstra wrote:
> On Fri, Jul 11, 2014 at 07:39:29PM +0200, Vincent Guittot wrote:
> > In my mind, arch_scale_cpu_freq was intend to scale the capacity of
> > the CPU according to the current dvfs operating point.
> > As it's no more use anywhere now that we have arch_scale_cpu, we could
> > probably remove it .. and see when it will become used.
>
> I probably should have written comments when I wrote that code, but it
> was meant to be used only where, as described above, we limit things.
> Ondemand and such, which will temporarily decrease freq, will ramp it up
> again at demand, and therefore lowering the capacity will skew things.
>
> You'll put less load on because its run slower, and then you'll run it
> slower because there's less load on -> cyclic FAIL.
Agreed. We can't use a frequency scaled compute capacity for all
load-balancing decisions. However, IMHO, it would be useful to have know
the current compute capacity in addition to the max compute capacity
when considering energy costs. So we would have something like:
* capacity_max: cpu capacity at highest frequency.
* capacity_cur: cpu capacity at current frequency.
* capacity_avail: cpu capacity currently available. Basically
capacity_cur taking rt, deadline, and irq accounting into account.
capacity_max should probably include rt, deadline, and irq accounting as
well. Or we need both?
Based on your description arch_scale_freq_capacity() can't be abused to
implement capacity_cur (and capacity_avail) unless it is repurposed.
Nobody seems to implement it. Otherwise we would need something similar
to update capacity_cur (and capacity_avail).
As a side note, we can potentially get into a similar fail cycle already
due to the lack of scale invariance in the entity load tracking.
>
> > > In that same discussion ISTR a suggestion about adding avg_running time,
> > > as opposed to the current avg_runnable. The sum of avg_running should be
> > > much more accurate, and still react correctly to migrations.
> >
> > I haven't look in details but I agree that avg_running would be much
> > more accurate than avg_runnable and should probably fit the
> > requirement. Does it means that we could re-add the avg_running (or
> > something similar) that has disappeared during the review of load avg
> > tracking patchset ?
>
> Sure, I think we killed it there because there wasn't an actual use for
> it and I'm always in favour of stripping everything to their bare bones,
> esp big and complex things.
>
> And then later, add things back once we have need for it.
I think it is a useful addition to the set of utilization metrics. I
don't think it is universally more accurate than runnable_avg. Actually
quite the opposite when the cpu is overloaded. But for partially loaded
cpus it is very useful if you don't want to factor in waiting time on
the rq.
Morten
On Mon, Jul 14, 2014 at 01:55:29PM +0100, Morten Rasmussen wrote:
> On Fri, Jul 11, 2014 at 09:12:38PM +0100, Peter Zijlstra wrote:
> > On Fri, Jul 11, 2014 at 07:39:29PM +0200, Vincent Guittot wrote:
> > > In my mind, arch_scale_cpu_freq was intend to scale the capacity of
> > > the CPU according to the current dvfs operating point.
> > > As it's no more use anywhere now that we have arch_scale_cpu, we could
> > > probably remove it .. and see when it will become used.
> >
> > I probably should have written comments when I wrote that code, but it
> > was meant to be used only where, as described above, we limit things.
> > Ondemand and such, which will temporarily decrease freq, will ramp it up
> > again at demand, and therefore lowering the capacity will skew things.
> >
> > You'll put less load on because its run slower, and then you'll run it
> > slower because there's less load on -> cyclic FAIL.
>
> Agreed. We can't use a frequency scaled compute capacity for all
> load-balancing decisions. However, IMHO, it would be useful to have know
> the current compute capacity in addition to the max compute capacity
> when considering energy costs. So we would have something like:
>
> * capacity_max: cpu capacity at highest frequency.
>
> * capacity_cur: cpu capacity at current frequency.
>
> * capacity_avail: cpu capacity currently available. Basically
> capacity_cur taking rt, deadline, and irq accounting into account.
>
> capacity_max should probably include rt, deadline, and irq accounting as
> well. Or we need both?
I'm struggling to fully grasp your intent. We need DVFS like accounting
for sure, and that means a current freq hook, but I'm not entirely sure
how that relates to capacity.
> Based on your description arch_scale_freq_capacity() can't be abused to
> implement capacity_cur (and capacity_avail) unless it is repurposed.
> Nobody seems to implement it. Otherwise we would need something similar
> to update capacity_cur (and capacity_avail).
Yeah, I never got around to doing so. I started doing a APERF/MPERF SMT
capacity thing for x86 but never finished that. The naive implementation
suffered the same FAIL loop as above because APERF stops on idle. So
when idle your capacity drops to nothing, leading to no new work,
leading to more idle etc.
I never got around to fixing that -- adding an idle filter, and ever
since things have somewhat bitrotted.
> As a side note, we can potentially get into a similar fail cycle already
> due to the lack of scale invariance in the entity load tracking.
Yah, I think that got mentioned a long while ago.
> > > > In that same discussion ISTR a suggestion about adding avg_running time,
> > > > as opposed to the current avg_runnable. The sum of avg_running should be
> > > > much more accurate, and still react correctly to migrations.
> > >
> > > I haven't look in details but I agree that avg_running would be much
> > > more accurate than avg_runnable and should probably fit the
> > > requirement. Does it means that we could re-add the avg_running (or
> > > something similar) that has disappeared during the review of load avg
> > > tracking patchset ?
> >
> > Sure, I think we killed it there because there wasn't an actual use for
> > it and I'm always in favour of stripping everything to their bare bones,
> > esp big and complex things.
> >
> > And then later, add things back once we have need for it.
>
> I think it is a useful addition to the set of utilization metrics. I
> don't think it is universally more accurate than runnable_avg. Actually
> quite the opposite when the cpu is overloaded. But for partially loaded
> cpus it is very useful if you don't want to factor in waiting time on
> the rq.
Well, different things different names. Utilization as per literature is
simply the fraction of CPU time actually used. In that sense running_avg
is about right for that. Our current runnable_avg is entirely different
(as I think we all agree by now).
But yes, for application the tipping point is u == 1, up until that
point pure utilization makes sense, after that our runnable_avg makes
more sense.
On Fri, Jul 11, 2014 at 05:17:44PM +0200, Vincent Guittot wrote:
> > In any case, its feels rather arbitrary to me. What about machines where
> > there's no cache sharing at all (the traditional SMP systems). This
> > thing you're trying to do still seems to make sense there.
>
> ok, I thought that traditional SMP systems have this flag set at core
> level.
Yeah, with 1 core, so its effectively disabled.
> I mean ARM platforms have the flag for CPUs in the same cluster
> (which include current ARM SMP system) and the corei7 of my laptop has
> the flag at the cores level.
So I can see 'small' parts reducing shared caches in order to improve
idle performance.
The point being that LLC seems a somewhat arbitrary measure for this.
Can we try and see what happens if you remove the limit. Its always best
to try the simpler things first and only make it more complex if we have
to.
On Mon, Jul 14, 2014 at 02:20:52PM +0100, Peter Zijlstra wrote:
> On Mon, Jul 14, 2014 at 01:55:29PM +0100, Morten Rasmussen wrote:
> > On Fri, Jul 11, 2014 at 09:12:38PM +0100, Peter Zijlstra wrote:
> > > On Fri, Jul 11, 2014 at 07:39:29PM +0200, Vincent Guittot wrote:
> > > > In my mind, arch_scale_cpu_freq was intend to scale the capacity of
> > > > the CPU according to the current dvfs operating point.
> > > > As it's no more use anywhere now that we have arch_scale_cpu, we could
> > > > probably remove it .. and see when it will become used.
> > >
> > > I probably should have written comments when I wrote that code, but it
> > > was meant to be used only where, as described above, we limit things.
> > > Ondemand and such, which will temporarily decrease freq, will ramp it up
> > > again at demand, and therefore lowering the capacity will skew things.
> > >
> > > You'll put less load on because its run slower, and then you'll run it
> > > slower because there's less load on -> cyclic FAIL.
> >
> > Agreed. We can't use a frequency scaled compute capacity for all
> > load-balancing decisions. However, IMHO, it would be useful to have know
> > the current compute capacity in addition to the max compute capacity
> > when considering energy costs. So we would have something like:
> >
> > * capacity_max: cpu capacity at highest frequency.
> >
> > * capacity_cur: cpu capacity at current frequency.
> >
> > * capacity_avail: cpu capacity currently available. Basically
> > capacity_cur taking rt, deadline, and irq accounting into account.
> >
> > capacity_max should probably include rt, deadline, and irq accounting as
> > well. Or we need both?
>
> I'm struggling to fully grasp your intent. We need DVFS like accounting
> for sure, and that means a current freq hook, but I'm not entirely sure
> how that relates to capacity.
We can abstract all the factors that affect current compute capacity
(frequency, P-states, big.LITTLE,...) in the scheduler by having
something like capacity_{cur,avail} to tell us how much capacity does a
particular cpu have in its current state. Assuming that implement scale
invariance for entity load tracking (we are working on that), we can
directly compare task utilization with compute capacity for balancing
decisions. For example, we can figure out how much spare capacity a cpu
has in its current state by simply:
spare_capacity(cpu) = capacity_avail(cpu) - \sum_{tasks(cpu)}^{t} util(t)
If you put more than spare_capacity(cpu) worth of task utilization on
the cpu, you will cause the cpu (and any affected cpus) to change
P-state and potentially be less energy-efficient.
Does that make any sense?
Instead of dealing with frequencies directly in the scheduler code, we
can abstract it by just having scalable compute capacity.
>
> > Based on your description arch_scale_freq_capacity() can't be abused to
> > implement capacity_cur (and capacity_avail) unless it is repurposed.
> > Nobody seems to implement it. Otherwise we would need something similar
> > to update capacity_cur (and capacity_avail).
>
> Yeah, I never got around to doing so. I started doing a APERF/MPERF SMT
> capacity thing for x86 but never finished that. The naive implementation
> suffered the same FAIL loop as above because APERF stops on idle. So
> when idle your capacity drops to nothing, leading to no new work,
> leading to more idle etc.
>
> I never got around to fixing that -- adding an idle filter, and ever
> since things have somewhat bitrotted.
I see.
>
> > As a side note, we can potentially get into a similar fail cycle already
> > due to the lack of scale invariance in the entity load tracking.
>
> Yah, I think that got mentioned a long while ago.
It did :-)
>
> > > > > In that same discussion ISTR a suggestion about adding avg_running time,
> > > > > as opposed to the current avg_runnable. The sum of avg_running should be
> > > > > much more accurate, and still react correctly to migrations.
> > > >
> > > > I haven't look in details but I agree that avg_running would be much
> > > > more accurate than avg_runnable and should probably fit the
> > > > requirement. Does it means that we could re-add the avg_running (or
> > > > something similar) that has disappeared during the review of load avg
> > > > tracking patchset ?
> > >
> > > Sure, I think we killed it there because there wasn't an actual use for
> > > it and I'm always in favour of stripping everything to their bare bones,
> > > esp big and complex things.
> > >
> > > And then later, add things back once we have need for it.
> >
> > I think it is a useful addition to the set of utilization metrics. I
> > don't think it is universally more accurate than runnable_avg. Actually
> > quite the opposite when the cpu is overloaded. But for partially loaded
> > cpus it is very useful if you don't want to factor in waiting time on
> > the rq.
>
> Well, different things different names. Utilization as per literature is
> simply the fraction of CPU time actually used. In that sense running_avg
> is about right for that. Our current runnable_avg is entirely different
> (as I think we all agree by now).
>
> But yes, for application the tipping point is u == 1, up until that
> point pure utilization makes sense, after that our runnable_avg makes
> more sense.
Agreed.
If you really care about latency/performance you might be interested in
comparing running_avg and runnable_avg even for u < 1. If the
running_avg/runnable_avg ratio is significantly less than one, tasks are
waiting on the rq to be scheduled.
On Mon, Jul 14, 2014 at 03:04:35PM +0100, Morten Rasmussen wrote:
> > I'm struggling to fully grasp your intent. We need DVFS like accounting
> > for sure, and that means a current freq hook, but I'm not entirely sure
> > how that relates to capacity.
>
> We can abstract all the factors that affect current compute capacity
> (frequency, P-states, big.LITTLE,...) in the scheduler by having
> something like capacity_{cur,avail} to tell us how much capacity does a
> particular cpu have in its current state. Assuming that implement scale
> invariance for entity load tracking (we are working on that), we can
> directly compare task utilization with compute capacity for balancing
> decisions. For example, we can figure out how much spare capacity a cpu
> has in its current state by simply:
>
> spare_capacity(cpu) = capacity_avail(cpu) - \sum_{tasks(cpu)}^{t} util(t)
>
> If you put more than spare_capacity(cpu) worth of task utilization on
> the cpu, you will cause the cpu (and any affected cpus) to change
> P-state and potentially be less energy-efficient.
>
> Does that make any sense?
>
> Instead of dealing with frequencies directly in the scheduler code, we
> can abstract it by just having scalable compute capacity.
Ah, ok. Same thing then.
> > But yes, for application the tipping point is u == 1, up until that
> > point pure utilization makes sense, after that our runnable_avg makes
> > more sense.
>
> Agreed.
>
> If you really care about latency/performance you might be interested in
> comparing running_avg and runnable_avg even for u < 1. If the
> running_avg/runnable_avg ratio is significantly less than one, tasks are
> waiting on the rq to be scheduled.
Indeed, that gives a measure of queueing.
[...]
>> In that same discussion ISTR a suggestion about adding avg_running time,
>> as opposed to the current avg_runnable. The sum of avg_running should be
>> much more accurate, and still react correctly to migrations.
>
> I haven't look in details but I agree that avg_running would be much
> more accurate than avg_runnable and should probably fit the
> requirement. Does it means that we could re-add the avg_running (or
> something similar) that has disappeared during the review of load avg
> tracking patchset ?
Are you referring to '[RFC PATCH 14/14] sched: implement usage tracking'
https://lkml.org/lkml/2012/2/1/769 from Paul Turner?
__update_entity_runnable_avg() has an additional parameter 'running' so
that it can be called for
a) sched_entities in update_entity_load_avg():
__update_entity_runnable_avg(..., se->on_rq, cfs_rq->curr == se))
b) rq's in update_rq_runnable_avg():
__update_entity_runnable_avg(..., runnable, runnable);
I can see how it gives us two different signals for a sched_entity but
for a rq?
Do I miss something here?
-- Dietmar
[...]
On 11 July 2014 22:12, Peter Zijlstra <[email protected]> wrote:
> On Fri, Jul 11, 2014 at 07:39:29PM +0200, Vincent Guittot wrote:
>> In my mind, arch_scale_cpu_freq was intend to scale the capacity of
>> the CPU according to the current dvfs operating point.
>> As it's no more use anywhere now that we have arch_scale_cpu, we could
>> probably remove it .. and see when it will become used.
>
> I probably should have written comments when I wrote that code, but it
> was meant to be used only where, as described above, we limit things.
> Ondemand and such, which will temporarily decrease freq, will ramp it up
> again at demand, and therefore lowering the capacity will skew things.
>
> You'll put less load on because its run slower, and then you'll run it
> slower because there's less load on -> cyclic FAIL.
>
>> > In that same discussion ISTR a suggestion about adding avg_running time,
>> > as opposed to the current avg_runnable. The sum of avg_running should be
>> > much more accurate, and still react correctly to migrations.
>>
>> I haven't look in details but I agree that avg_running would be much
>> more accurate than avg_runnable and should probably fit the
>> requirement. Does it means that we could re-add the avg_running (or
>> something similar) that has disappeared during the review of load avg
>> tracking patchset ?
>
> Sure, I think we killed it there because there wasn't an actual use for
> it and I'm always in favour of stripping everything to their bare bones,
> esp big and complex things.
>
> And then later, add things back once we have need for it.
Ok, i'm going to look how to add it back taking nto account current
Yuyang's rework of load_avg
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On 14 July 2014 15:51, Peter Zijlstra <[email protected]> wrote:
> On Fri, Jul 11, 2014 at 05:17:44PM +0200, Vincent Guittot wrote:
>> > In any case, its feels rather arbitrary to me. What about machines where
>> > there's no cache sharing at all (the traditional SMP systems). This
>> > thing you're trying to do still seems to make sense there.
>>
>> ok, I thought that traditional SMP systems have this flag set at core
>> level.
>
> Yeah, with 1 core, so its effectively disabled.
>
>> I mean ARM platforms have the flag for CPUs in the same cluster
>> (which include current ARM SMP system) and the corei7 of my laptop has
>> the flag at the cores level.
>
> So I can see 'small' parts reducing shared caches in order to improve
> idle performance.
>
> The point being that LLC seems a somewhat arbitrary measure for this.
>
> Can we try and see what happens if you remove the limit. Its always best
> to try the simpler things first and only make it more complex if we have
> to.
ok. i will remove the condition in the next version
On 11 July 2014 18:13, Morten Rasmussen <[email protected]> wrote:
[snip]
>
> In this example using rq->avg leads to imbalance whereas unweighted load
> would not. Correct me if I missed anything.
You just miss to take into account how the imbalance is computed
>
> Coming back to the previous example. I'm not convinced that inflation of
> the unweighted load sum when tasks overlap in time is a bad thing. I
> have mentioned this before. The average cpu utilization over the 40ms
> period is 50%. However the true compute capacity demand is 200% for the
> first 15ms of the period, 100% for the next 5ms and 0% for the remaining
> 25ms. The cpu is actually overloaded for 15ms every 40ms. This fact is
> factored into the unweighted load whereas rq->avg would give you the
> same utilization no matter if the tasks are overlapped or not. Hence
> unweighted load would give us an indication that the mix of tasks isn't
> optimal even if the cpu has spare cycles.
>
> If you don't care about overlap and latency, the unweighted sum of task
> running time (that Peter has proposed a number of times) is better
> metric, IMHO. As long the cpu isn't fully utilized.
>
> Morten
On Tue, Jul 15, 2014 at 10:27:19AM +0100, Vincent Guittot wrote:
> On 11 July 2014 18:13, Morten Rasmussen <[email protected]> wrote:
>
> [snip]
>
> >
> > In this example using rq->avg leads to imbalance whereas unweighted load
> > would not. Correct me if I missed anything.
>
> You just miss to take into account how the imbalance is computed
I don't think so. I'm aware that the imbalance is calculated based on
the runnable_load_avg of the cpus. But if you pick the wrong cpus to
compare to begin with, it doesn't matter.
Morten
On 15 July 2014 11:32, Morten Rasmussen <[email protected]> wrote:
> On Tue, Jul 15, 2014 at 10:27:19AM +0100, Vincent Guittot wrote:
>> On 11 July 2014 18:13, Morten Rasmussen <[email protected]> wrote:
>>
>> [snip]
>>
>> >
>> > In this example using rq->avg leads to imbalance whereas unweighted load
>> > would not. Correct me if I missed anything.
>>
>> You just miss to take into account how the imbalance is computed
>
> I don't think so. I'm aware that the imbalance is calculated based on
> the runnable_load_avg of the cpus. But if you pick the wrong cpus to
> compare to begin with, it doesn't matter.
The average load of your sched_domain is 1 task per CPU so CPU1 will
pull only 1 task and not 2
>
> Morten
On Mon, Jul 14, 2014 at 06:54:11PM +0100, Dietmar Eggemann wrote:
> __update_entity_runnable_avg() has an additional parameter 'running' so
> that it can be called for
>
> a) sched_entities in update_entity_load_avg():
>
> __update_entity_runnable_avg(..., se->on_rq, cfs_rq->curr == se))
>
>
> b) rq's in update_rq_runnable_avg():
>
> __update_entity_runnable_avg(..., runnable, runnable);
>
> I can see how it gives us two different signals for a sched_entity but
> for a rq?
>
For rq,
__update_entity_runnable_avg(..., runnable, runnable > 0)
Then, first one would be effectively CPU ConCurrency (fair and !fair) and
second one would be CPU (has task) running (or about CPU utilization for
fair and !fair), :)
Thanks,
Yuyang