Several wrong task placement have been raised with the current load
balance algorithm but their fixes are not always straight forward and
end up with using biased values to force migrations. A cleanup and rework
of the load balance will help to handle such UCs and enable to fine grain
the behavior of the scheduler for other cases.
Patch 1 has already been sent separately and only consolidate asym policy
in one place and help the review of the changes in load_balance.
Patch 2 renames the sum of h_nr_running in stats.
Patch 3 removes meaningless imbalance computation to make review of
patch 4 easier.
Patch 4 reworks load_balance algorithm and fixes some wrong task placement
but try to stay conservative.
Patch 5 add the sum of nr_running to monitor non cfs tasks and take that
into account when pulling tasks.
Patch 6 replaces runnable_load by load now that the signal is only used
when overloaded.
Patch 7 improves the spread of tasks at the 1st scheduling level.
Patch 8 uses utilization instead of load in all steps of misfit task
path.
Patch 9 replaces runnable_load_avg by load_avg in the wake up path.
Patch 10 optimizes find_idlest_group() that was using both runnable_load
and load. This has not been squashed with previous patch to ease the
review.
Some benchmarks results based on 8 iterations of each tests:
- small arm64 dual quad cores system
tip/sched/core w/ this patchset improvement
schedpipe 54981 +/-0.36% 55459 +/-0.31% (+0.97%)
hackbench
1 groups 0.906 +/-2.34% 0.906 +/-2.88% (+0.06%)
- large arm64 2 nodes / 224 cores system
tip/sched/core w/ this patchset improvement
schedpipe 125323 +/-0.98% 125624 +/-0.71% (+0.24%)
hackbench -l (256000/#grp) -g #grp
1 groups 15.360 +/-1.76% 14.206 +/-1.40% (+8.69%)
4 groups 5.822 +/-1.02% 5.508 +/-6.45% (+5.38%)
16 groups 3.103 +/-0.80% 3.244 +/-0.77% (-4.52%)
32 groups 2.892 +/-1.23% 2.850 +/-1.81% (+1.47%)
64 groups 2.825 +/-1.51% 2.725 +/-1.51% (+3.54%)
128 groups 3.149 +/-8.46% 3.053 +/-13.15% (+3.06%)
256 groups 3.511 +/-8.49% 3.019 +/-1.71% (+14.03%)
dbench
1 groups 329.677 +/-0.46% 329.771 +/-0.11% (+0.03%)
4 groups 931.499 +/-0.79% 947.118 +/-0.94% (+1.68%)
16 groups 1924.210 +/-0.89% 1947.849 +/-0.76% (+1.23%)
32 groups 2350.646 +/-5.75% 2351.549 +/-6.33% (+0.04%)
64 groups 2201.524 +/-3.35% 2192.749 +/-5.84% (-0.40%)
128 groups 2206.858 +/-2.50% 2376.265 +/-7.44% (+7.68%)
256 groups 1263.520 +/-3.34% 1633.143 +/-13.02% (+29.25%)
tip/sched/core sha1:
0413d7f33e60 ('sched/uclamp: Always use 'enum uclamp_id' for clamp_id values')
Changes since v2:
- fix typo and reorder code
- some minor code fixes
- optimize the find_idles_group()
Not covered in this patchset:
- update find_idlest_group() to be more aligned with load_balance(). I didn't
want to delay this version because of this update which is not ready yet
- Better detection of overloaded and fully busy state, especially for cases
when nr_running > nr CPUs.
Vincent Guittot (8):
sched/fair: clean up asym packing
sched/fair: rename sum_nr_running to sum_h_nr_running
sched/fair: remove meaningless imbalance calculation
sched/fair: rework load_balance
sched/fair: use rq->nr_running when balancing load
sched/fair: use load instead of runnable load in load_balance
sched/fair: evenly spread tasks when not overloaded
sched/fair: use utilization to select misfit task
sched/fair: use load instead of runnable load in wakeup path
sched/fair: optimize find_idlest_group
kernel/sched/fair.c | 805 +++++++++++++++++++++++++++-------------------------
1 file changed, 417 insertions(+), 388 deletions(-)
--
2.7.4
runnable load has been introduced to take into account the case
where blocked load biases the load balance decision which was selecting
underutilized group with huge blocked load whereas other groups were
overloaded.
The load is now only used when groups are overloaded. In this case,
it's worth being conservative and taking into account the sleeping
tasks that might wakeup on the cpu.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7e74836..15ec38c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5385,6 +5385,11 @@ static unsigned long cpu_runnable_load(struct rq *rq)
return cfs_rq_runnable_load_avg(&rq->cfs);
}
+static unsigned long cpu_load(struct rq *rq)
+{
+ return cfs_rq_load_avg(&rq->cfs);
+}
+
static unsigned long capacity_of(int cpu)
{
return cpu_rq(cpu)->cpu_capacity;
@@ -8070,7 +8075,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
env->flags |= LBF_NOHZ_AGAIN;
- sgs->group_load += cpu_runnable_load(rq);
+ sgs->group_load += cpu_load(rq);
sgs->group_util += cpu_util(i);
sgs->sum_h_nr_running += rq->cfs.h_nr_running;
@@ -8512,7 +8517,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
init_sd_lb_stats(&sds);
/*
- * Compute the various statistics relavent for load balancing at
+ * Compute the various statistics relevant for load balancing at
* this level.
*/
update_sd_lb_stats(env, &sds);
@@ -8672,10 +8677,10 @@ static struct rq *find_busiest_queue(struct lb_env *env,
switch (env->balance_type) {
case migrate_load:
/*
- * When comparing with load imbalance, use cpu_runnable_load()
+ * When comparing with load imbalance, use cpu_load()
* which is not scaled with the CPU capacity.
*/
- load = cpu_runnable_load(rq);
+ load = cpu_load(rq);
if (nr_running == 1 && load > env->imbalance &&
!check_cpu_capacity(rq, env->sd))
@@ -8683,7 +8688,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
/*
* For the load comparisons with the other CPU's, consider
- * the cpu_runnable_load() scaled with the CPU capacity, so
+ * the cpu_load() scaled with the CPU capacity, so
* that the load can be moved away from the CPU that is
* potentially running at a lower capacity.
*
--
2.7.4
Clean up asym packing to follow the default load balance behavior:
- classify the group by creating a group_asym_packing field.
- calculate the imbalance in calculate_imbalance() instead of bypassing it.
We don't need to test twice same conditions anymore to detect asym packing
and we consolidate the calculation of imbalance in calculate_imbalance().
There is no functional changes.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 63 ++++++++++++++---------------------------------------
1 file changed, 16 insertions(+), 47 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1054d2c..3175fea 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7685,6 +7685,7 @@ struct sg_lb_stats {
unsigned int group_weight;
enum group_type group_type;
int group_no_capacity;
+ unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
@@ -8139,9 +8140,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* ASYM_PACKING needs to move all the work to the highest
* prority CPUs in the group, therefore mark all groups
* of lower priority than ourself as busy.
+ *
+ * This is primarily intended to used at the sibling level. Some
+ * cores like POWER7 prefer to use lower numbered SMT threads. In the
+ * case of POWER7, it can move to lower SMT modes only when higher
+ * threads are idle. When in lower SMT modes, the threads will
+ * perform better since they share less core resources. Hence when we
+ * have idle threads, we want them to be the higher ones.
*/
if (sgs->sum_nr_running &&
sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
+ sgs->group_asym_packing = 1;
if (!sds->busiest)
return true;
@@ -8283,51 +8292,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
}
/**
- * check_asym_packing - Check to see if the group is packed into the
- * sched domain.
- *
- * This is primarily intended to used at the sibling level. Some
- * cores like POWER7 prefer to use lower numbered SMT threads. In the
- * case of POWER7, it can move to lower SMT modes only when higher
- * threads are idle. When in lower SMT modes, the threads will
- * perform better since they share less core resources. Hence when we
- * have idle threads, we want them to be the higher ones.
- *
- * This packing function is run on idle threads. It checks to see if
- * the busiest CPU in this domain (core in the P7 case) has a higher
- * CPU number than the packing function is being run on. Here we are
- * assuming lower CPU number will be equivalent to lower a SMT thread
- * number.
- *
- * Return: 1 when packing is required and a task should be moved to
- * this CPU. The amount of the imbalance is returned in env->imbalance.
- *
- * @env: The load balancing environment.
- * @sds: Statistics of the sched_domain which is to be packed
- */
-static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds)
-{
- int busiest_cpu;
-
- if (!(env->sd->flags & SD_ASYM_PACKING))
- return 0;
-
- if (env->idle == CPU_NOT_IDLE)
- return 0;
-
- if (!sds->busiest)
- return 0;
-
- busiest_cpu = sds->busiest->asym_prefer_cpu;
- if (sched_asym_prefer(busiest_cpu, env->dst_cpu))
- return 0;
-
- env->imbalance = sds->busiest_stat.group_load;
-
- return 1;
-}
-
-/**
* fix_small_imbalance - Calculate the minor imbalance that exists
* amongst the groups of a sched_domain, during
* load balancing.
@@ -8411,6 +8375,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
local = &sds->local_stat;
busiest = &sds->busiest_stat;
+ if (busiest->group_asym_packing) {
+ env->imbalance = busiest->group_load;
+ return;
+ }
+
if (busiest->group_type == group_imbalanced) {
/*
* In the group_imb case we cannot rely on group-wide averages
@@ -8515,8 +8484,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
busiest = &sds.busiest_stat;
/* ASYM feature bypasses nice load balance check */
- if (check_asym_packing(env, &sds))
- return sds.busiest;
+ if (busiest->group_asym_packing)
+ goto force_balance;
/* There is no busy sibling group to pull tasks from */
if (!sds.busiest || busiest->sum_nr_running == 0)
--
2.7.4
When there is only 1 cpu per group, using the idle cpus to evenly spread
tasks doesn't make sense and nr_running is a better metrics.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++------------
1 file changed, 28 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 15ec38c..a7c8ee6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8596,18 +8596,34 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
busiest->sum_nr_running > local->sum_nr_running + 1)
goto force_balance;
- if (busiest->group_type != group_overloaded &&
- (env->idle == CPU_NOT_IDLE ||
- local->idle_cpus <= (busiest->idle_cpus + 1)))
- /*
- * If the busiest group is not overloaded
- * and there is no imbalance between this and busiest group
- * wrt idle CPUs, it is balanced. The imbalance
- * becomes significant if the diff is greater than 1 otherwise
- * we might end up to just move the imbalance on another
- * group.
- */
- goto out_balanced;
+ if (busiest->group_type != group_overloaded) {
+ if (env->idle == CPU_NOT_IDLE)
+ /*
+ * If the busiest group is not overloaded (and as a
+ * result the local one too) but this cpu is already
+ * busy, let another idle cpu try to pull task.
+ */
+ goto out_balanced;
+
+ if (busiest->group_weight > 1 &&
+ local->idle_cpus <= (busiest->idle_cpus + 1))
+ /*
+ * If the busiest group is not overloaded
+ * and there is no imbalance between this and busiest
+ * group wrt idle CPUs, it is balanced. The imbalance
+ * becomes significant if the diff is greater than 1
+ * otherwise we might end up to just move the imbalance
+ * on another group. Of course this applies only if
+ * there is more than 1 CPU per group.
+ */
+ goto out_balanced;
+
+ if (busiest->sum_h_nr_running == 1)
+ /*
+ * busiest doesn't have any tasks waiting to run
+ */
+ goto out_balanced;
+ }
force_balance:
/* Looks like there is an imbalance. Compute it */
--
2.7.4
The load_balance algorithm contains some heuristics which have become
meaningless since the rework of the scheduler's metrics like the
introduction of PELT.
Furthermore, load is an ill-suited metric for solving certain task
placement imbalance scenarios. For instance, in the presence of idle CPUs,
we should simply try to get at least one task per CPU, whereas the current
load-based algorithm can actually leave idle CPUs alone simply because the
load is somewhat balanced. The current algorithm ends up creating virtual
and meaningless value like the avg_load_per_task or tweaks the state of a
group to make it overloaded whereas it's not, in order to try to migrate
tasks.
load_balance should better qualify the imbalance of the group and clearly
define what has to be moved to fix this imbalance.
The type of sched_group has been extended to better reflect the type of
imbalance. We now have :
group_has_spare
group_fully_busy
group_misfit_task
group_asym_capacity
group_imbalanced
group_overloaded
Based on the type fo sched_group, load_balance now sets what it wants to
move in order to fix the imbalance. It can be some load as before but also
some utilization, a number of task or a type of task:
migrate_task
migrate_util
migrate_load
migrate_misfit
This new load_balance algorithm fixes several pending wrong tasks
placement:
- the 1 task per CPU case with asymmetrics system
- the case of cfs task preempted by other class
- the case of tasks not evenly spread on groups with spare capacity
Also the load balance decisions have been consolidated in the 3 functions
below after removing the few bypasses and hacks of the current code:
- update_sd_pick_busiest() select the busiest sched_group.
- find_busiest_group() checks if there is an imbalance between local and
busiest group.
- calculate_imbalance() decides what have to be moved.
Finally, the now unused field total_running of struct sd_lb_stats has been
removed.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 585 ++++++++++++++++++++++++++++++++++------------------
1 file changed, 380 insertions(+), 205 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 017aad0..d33379c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7078,11 +7078,26 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
enum fbq_type { regular, remote, all };
+/*
+ * group_type describes the group of CPUs at the moment of the load balance.
+ * The enum is ordered by pulling priority, with the group with lowest priority
+ * first so the groupe_type can be simply compared when selecting the busiest
+ * group. see update_sd_pick_busiest().
+ */
enum group_type {
- group_other = 0,
+ group_has_spare = 0,
+ group_fully_busy,
group_misfit_task,
+ group_asym_packing,
group_imbalanced,
- group_overloaded,
+ group_overloaded
+};
+
+enum migration_type {
+ migrate_load = 0,
+ migrate_util,
+ migrate_task,
+ migrate_misfit
};
#define LBF_ALL_PINNED 0x01
@@ -7115,7 +7130,7 @@ struct lb_env {
unsigned int loop_max;
enum fbq_type fbq_type;
- enum group_type src_grp_type;
+ enum migration_type balance_type;
struct list_head tasks;
};
@@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
{
struct list_head *tasks = &env->src_rq->cfs_tasks;
struct task_struct *p;
- unsigned long load;
+ unsigned long util, load;
int detached = 0;
lockdep_assert_held(&env->src_rq->lock);
@@ -7380,19 +7395,53 @@ static int detach_tasks(struct lb_env *env)
if (!can_migrate_task(p, env))
goto next;
- load = task_h_load(p);
+ switch (env->balance_type) {
+ case migrate_load:
+ load = task_h_load(p);
- if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
- goto next;
+ if (sched_feat(LB_MIN) &&
+ load < 16 && !env->sd->nr_balance_failed)
+ goto next;
- if ((load / 2) > env->imbalance)
- goto next;
+ if ((load / 2) > env->imbalance)
+ goto next;
+
+ env->imbalance -= load;
+ break;
+
+ case migrate_util:
+ util = task_util_est(p);
+
+ if (util > env->imbalance)
+ goto next;
+
+ env->imbalance -= util;
+ break;
+
+ case migrate_task:
+ /* Migrate task */
+ env->imbalance--;
+ break;
+
+ case migrate_misfit:
+ load = task_h_load(p);
+
+ /*
+ * utilization of misfit task might decrease a bit
+ * since it has been recorded. Be conservative in the
+ * condition.
+ */
+ if (load < env->imbalance)
+ goto next;
+
+ env->imbalance = 0;
+ break;
+ }
detach_task(p, env);
list_add(&p->se.group_node, &env->tasks);
detached++;
- env->imbalance -= load;
#ifdef CONFIG_PREEMPT
/*
@@ -7406,7 +7455,7 @@ static int detach_tasks(struct lb_env *env)
/*
* We only want to steal up to the prescribed amount of
- * runnable load.
+ * load/util/tasks.
*/
if (env->imbalance <= 0)
break;
@@ -7671,7 +7720,6 @@ struct sg_lb_stats {
unsigned int idle_cpus;
unsigned int group_weight;
enum group_type group_type;
- int group_no_capacity;
unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
#ifdef CONFIG_NUMA_BALANCING
@@ -7687,10 +7735,10 @@ struct sg_lb_stats {
struct sd_lb_stats {
struct sched_group *busiest; /* Busiest group in this sd */
struct sched_group *local; /* Local group in this sd */
- unsigned long total_running;
unsigned long total_load; /* Total load of all groups in sd */
unsigned long total_capacity; /* Total capacity of all groups in sd */
unsigned long avg_load; /* Average load across all groups in sd */
+ unsigned int prefer_sibling; /* tasks should go to sibling first */
struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
struct sg_lb_stats local_stat; /* Statistics of the local group */
@@ -7707,13 +7755,11 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
*sds = (struct sd_lb_stats){
.busiest = NULL,
.local = NULL,
- .total_running = 0UL,
.total_load = 0UL,
.total_capacity = 0UL,
.busiest_stat = {
- .avg_load = 0UL,
- .sum_h_nr_running = 0,
- .group_type = group_other,
+ .idle_cpus = UINT_MAX,
+ .group_type = group_has_spare,
},
};
}
@@ -7955,19 +8001,26 @@ group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
}
static inline enum
-group_type group_classify(struct sched_group *group,
+group_type group_classify(struct lb_env *env,
+ struct sched_group *group,
struct sg_lb_stats *sgs)
{
- if (sgs->group_no_capacity)
+ if (group_is_overloaded(env, sgs))
return group_overloaded;
if (sg_imbalanced(group))
return group_imbalanced;
+ if (sgs->group_asym_packing)
+ return group_asym_packing;
+
if (sgs->group_misfit_task_load)
return group_misfit_task;
- return group_other;
+ if (!group_has_capacity(env, sgs))
+ return group_fully_busy;
+
+ return group_has_spare;
}
static bool update_nohz_stats(struct rq *rq, bool force)
@@ -8004,10 +8057,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,
struct sg_lb_stats *sgs,
int *sg_status)
{
- int i, nr_running;
+ int i, nr_running, local_group;
memset(sgs, 0, sizeof(*sgs));
+ local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group));
+
for_each_cpu_and(i, sched_group_span(group), env->cpus) {
struct rq *rq = cpu_rq(i);
@@ -8032,9 +8087,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,
/*
* No need to call idle_cpu() if nr_running is not 0
*/
- if (!nr_running && idle_cpu(i))
+ if (!nr_running && idle_cpu(i)) {
sgs->idle_cpus++;
+ /* Idle cpu can't have misfit task */
+ continue;
+ }
+
+ if (local_group)
+ continue;
+ /* Check for a misfit task on the cpu */
if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
sgs->group_misfit_task_load < rq->misfit_task_load) {
sgs->group_misfit_task_load = rq->misfit_task_load;
@@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
}
}
- /* Adjust by relative CPU capacity of the group */
+ /* Check if dst cpu is idle and preferred to this group */
+ if (env->sd->flags & SD_ASYM_PACKING &&
+ env->idle != CPU_NOT_IDLE &&
+ sgs->sum_h_nr_running &&
+ sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
+ sgs->group_asym_packing = 1;
+ }
+
sgs->group_capacity = group->sgc->capacity;
- sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
sgs->group_weight = group->group_weight;
- sgs->group_no_capacity = group_is_overloaded(env, sgs);
- sgs->group_type = group_classify(group, sgs);
+ sgs->group_type = group_classify(env, group, sgs);
+
+ /* Computing avg_load makes sense only when group is overloaded */
+ if (sgs->group_type == group_overloaded)
+ sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
+ sgs->group_capacity;
}
/**
@@ -8072,6 +8144,10 @@ static bool update_sd_pick_busiest(struct lb_env *env,
{
struct sg_lb_stats *busiest = &sds->busiest_stat;
+ /* Make sure that there is at least one task to pull */
+ if (!sgs->sum_h_nr_running)
+ return false;
+
/*
* Don't try to pull misfit tasks we can't help.
* We can use max_capacity here as reduction in capacity on some
@@ -8080,7 +8156,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
*/
if (sgs->group_type == group_misfit_task &&
(!group_smaller_max_cpu_capacity(sg, sds->local) ||
- !group_has_capacity(env, &sds->local_stat)))
+ sds->local_stat.group_type != group_has_spare))
return false;
if (sgs->group_type > busiest->group_type)
@@ -8089,62 +8165,80 @@ static bool update_sd_pick_busiest(struct lb_env *env,
if (sgs->group_type < busiest->group_type)
return false;
- if (sgs->avg_load <= busiest->avg_load)
- return false;
-
- if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
- goto asym_packing;
-
/*
- * Candidate sg has no more than one task per CPU and
- * has higher per-CPU capacity. Migrating tasks to less
- * capable CPUs may harm throughput. Maximize throughput,
- * power/energy consequences are not considered.
+ * The candidate and the current busiest group are the same type of
+ * group. Let check which one is the busiest according to the type.
*/
- if (sgs->sum_h_nr_running <= sgs->group_weight &&
- group_smaller_min_cpu_capacity(sds->local, sg))
- return false;
- /*
- * If we have more than one misfit sg go with the biggest misfit.
- */
- if (sgs->group_type == group_misfit_task &&
- sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+ switch (sgs->group_type) {
+ case group_overloaded:
+ /* Select the overloaded group with highest avg_load. */
+ if (sgs->avg_load <= busiest->avg_load)
+ return false;
+ break;
+
+ case group_imbalanced:
+ /*
+ * Select the 1st imbalanced group as we don't have any way to
+ * choose one more than another.
+ */
return false;
-asym_packing:
- /* This is the busiest node in its class. */
- if (!(env->sd->flags & SD_ASYM_PACKING))
- return true;
+ case group_asym_packing:
+ /* Prefer to move from lowest priority CPU's work */
+ if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
+ return false;
+ break;
- /* No ASYM_PACKING if target CPU is already busy */
- if (env->idle == CPU_NOT_IDLE)
- return true;
- /*
- * ASYM_PACKING needs to move all the work to the highest
- * prority CPUs in the group, therefore mark all groups
- * of lower priority than ourself as busy.
- *
- * This is primarily intended to used at the sibling level. Some
- * cores like POWER7 prefer to use lower numbered SMT threads. In the
- * case of POWER7, it can move to lower SMT modes only when higher
- * threads are idle. When in lower SMT modes, the threads will
- * perform better since they share less core resources. Hence when we
- * have idle threads, we want them to be the higher ones.
- */
- if (sgs->sum_h_nr_running &&
- sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
- sgs->group_asym_packing = 1;
- if (!sds->busiest)
- return true;
+ case group_misfit_task:
+ /*
+ * If we have more than one misfit sg go with the biggest
+ * misfit.
+ */
+ if (sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+ return false;
+ break;
- /* Prefer to move from lowest priority CPU's work */
- if (sched_asym_prefer(sds->busiest->asym_prefer_cpu,
- sg->asym_prefer_cpu))
- return true;
+ case group_fully_busy:
+ /*
+ * Select the fully busy group with highest avg_load. In
+ * theory, there is no need to pull task from such kind of
+ * group because tasks have all compute capacity that they need
+ * but we can still improve the overall throughput by reducing
+ * contention when accessing shared HW resources.
+ *
+ * XXX for now avg_load is not computed and always 0 so we
+ * select the 1st one.
+ */
+ if (sgs->avg_load <= busiest->avg_load)
+ return false;
+ break;
+
+ case group_has_spare:
+ /*
+ * Select not overloaded group with lowest number of
+ * idle cpus. We could also compare the spare capacity
+ * which is more stable but it can end up that the
+ * group has less spare capacity but finally more idle
+ * cpus which means less opportunity to pull tasks.
+ */
+ if (sgs->idle_cpus >= busiest->idle_cpus)
+ return false;
+ break;
}
- return false;
+ /*
+ * Candidate sg has no more than one task per CPU and has higher
+ * per-CPU capacity. Migrating tasks to less capable CPUs may harm
+ * throughput. Maximize throughput, power/energy consequences are not
+ * considered.
+ */
+ if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
+ (sgs->group_type <= group_fully_busy) &&
+ (group_smaller_min_cpu_capacity(sds->local, sg)))
+ return false;
+
+ return true;
}
#ifdef CONFIG_NUMA_BALANCING
@@ -8182,13 +8276,13 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)
* @env: The load balancing environment.
* @sds: variable to hold the statistics for this sched_domain.
*/
+
static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
{
struct sched_domain *child = env->sd->child;
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats *local = &sds->local_stat;
struct sg_lb_stats tmp_sgs;
- bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
int sg_status = 0;
#ifdef CONFIG_NO_HZ_COMMON
@@ -8215,22 +8309,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
if (local_group)
goto next_group;
- /*
- * In case the child domain prefers tasks go to siblings
- * first, lower the sg capacity so that we'll try
- * and move all the excess tasks away. We lower the capacity
- * of a group only if the local group has the capacity to fit
- * these excess tasks. The extra check prevents the case where
- * you always pull from the heaviest group when it is already
- * under-utilized (possible with a large weight task outweighs
- * the tasks on the system).
- */
- if (prefer_sibling && sds->local &&
- group_has_capacity(env, local) &&
- (sgs->sum_h_nr_running > local->sum_h_nr_running + 1)) {
- sgs->group_no_capacity = 1;
- sgs->group_type = group_classify(sg, sgs);
- }
if (update_sd_pick_busiest(env, sds, sg, sgs)) {
sds->busiest = sg;
@@ -8239,13 +8317,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
next_group:
/* Now, start updating sd_lb_stats */
- sds->total_running += sgs->sum_h_nr_running;
sds->total_load += sgs->group_load;
sds->total_capacity += sgs->group_capacity;
sg = sg->next;
} while (sg != env->sd->groups);
+ /* Tag domain that child domain prefers tasks go to siblings first */
+ sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
+
#ifdef CONFIG_NO_HZ_COMMON
if ((env->flags & LBF_NOHZ_AGAIN) &&
cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
@@ -8283,69 +8363,133 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
*/
static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
{
- unsigned long max_pull, load_above_capacity = ~0UL;
struct sg_lb_stats *local, *busiest;
local = &sds->local_stat;
busiest = &sds->busiest_stat;
- if (busiest->group_asym_packing) {
+ if (busiest->group_type == group_misfit_task) {
+ /* Set imbalance to allow misfit task to be balanced. */
+ env->balance_type = migrate_misfit;
+ env->imbalance = busiest->group_misfit_task_load;
+ return;
+ }
+
+ if (busiest->group_type == group_asym_packing) {
+ /*
+ * In case of asym capacity, we will try to migrate all load to
+ * the preferred CPU.
+ */
+ env->balance_type = migrate_load;
env->imbalance = busiest->group_load;
return;
}
+ if (busiest->group_type == group_imbalanced) {
+ /*
+ * In the group_imb case we cannot rely on group-wide averages
+ * to ensure CPU-load equilibrium, try to move any task to fix
+ * the imbalance. The next load balance will take care of
+ * balancing back the system.
+ */
+ env->balance_type = migrate_task;
+ env->imbalance = 1;
+ return;
+ }
+
/*
- * Avg load of busiest sg can be less and avg load of local sg can
- * be greater than avg load across all sgs of sd because avg load
- * factors in sg capacity and sgs with smaller group_type are
- * skipped when updating the busiest sg:
+ * Try to use spare capacity of local group without overloading it or
+ * emptying busiest
*/
- if (busiest->group_type != group_misfit_task &&
- (busiest->avg_load <= sds->avg_load ||
- local->avg_load >= sds->avg_load)) {
- env->imbalance = 0;
+ if (local->group_type == group_has_spare) {
+ if (busiest->group_type > group_fully_busy) {
+ /*
+ * If busiest is overloaded, try to fill spare
+ * capacity. This might end up creating spare capacity
+ * in busiest or busiest still being overloaded but
+ * there is no simple way to directly compute the
+ * amount of load to migrate in order to balance the
+ * system.
+ */
+ env->balance_type = migrate_util;
+ env->imbalance = max(local->group_capacity, local->group_util) -
+ local->group_util;
+ return;
+ }
+
+ if (busiest->group_weight == 1 || sds->prefer_sibling) {
+ /*
+ * When prefer sibling, evenly spread running tasks on
+ * groups.
+ */
+ env->balance_type = migrate_task;
+ env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
+ return;
+ }
+
+ /*
+ * If there is no overload, we just want to even the number of
+ * idle cpus.
+ */
+ env->balance_type = migrate_task;
+ env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
return;
}
/*
- * If there aren't any idle CPUs, avoid creating some.
+ * Local is fully busy but have to take more load to relieve the
+ * busiest group
*/
- if (busiest->group_type == group_overloaded &&
- local->group_type == group_overloaded) {
- load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
- if (load_above_capacity > busiest->group_capacity) {
- load_above_capacity -= busiest->group_capacity;
- load_above_capacity *= scale_load_down(NICE_0_LOAD);
- load_above_capacity /= busiest->group_capacity;
- } else
- load_above_capacity = ~0UL;
+ if (local->group_type < group_overloaded) {
+ /*
+ * Local will become overloaded so the avg_load metrics are
+ * finally needed.
+ */
+
+ local->avg_load = (local->group_load * SCHED_CAPACITY_SCALE) /
+ local->group_capacity;
+
+ sds->avg_load = (sds->total_load * SCHED_CAPACITY_SCALE) /
+ sds->total_capacity;
}
/*
- * We're trying to get all the CPUs to the average_load, so we don't
- * want to push ourselves above the average load, nor do we wish to
- * reduce the max loaded CPU below the average load. At the same time,
- * we also don't want to reduce the group load below the group
- * capacity. Thus we look for the minimum possible imbalance.
+ * Both group are or will become overloaded and we're trying to get all
+ * the CPUs to the average_load, so we don't want to push ourselves
+ * above the average load, nor do we wish to reduce the max loaded CPU
+ * below the average load. At the same time, we also don't want to
+ * reduce the group load below the group capacity. Thus we look for
+ * the minimum possible imbalance.
*/
- max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
-
- /* How much load to actually move to equalise the imbalance */
+ env->balance_type = migrate_load;
env->imbalance = min(
- max_pull * busiest->group_capacity,
+ (busiest->avg_load - sds->avg_load) * busiest->group_capacity,
(sds->avg_load - local->avg_load) * local->group_capacity
) / SCHED_CAPACITY_SCALE;
-
- /* Boost imbalance to allow misfit task to be balanced. */
- if (busiest->group_type == group_misfit_task) {
- env->imbalance = max_t(long, env->imbalance,
- busiest->group_misfit_task_load);
- }
-
}
/******* find_busiest_group() helpers end here *********************/
+/*
+ * Decision matrix according to the local and busiest group state
+ *
+ * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded
+ * has_spare nr_idle balanced N/A N/A balanced balanced
+ * fully_busy nr_idle nr_idle N/A N/A balanced balanced
+ * misfit_task force N/A N/A N/A force force
+ * asym_capacity force force N/A N/A force force
+ * imbalanced force force N/A N/A force force
+ * overloaded force force N/A N/A force avg_load
+ *
+ * N/A : Not Applicable because already filtered while updating
+ * statistics.
+ * balanced : The system is balanced for these 2 groups.
+ * force : Calculate the imbalance as load migration is probably needed.
+ * avg_load : Only if imbalance is significant enough.
+ * nr_idle : dst_cpu is not busy and the number of idle cpus is quite
+ * different in groups.
+ */
+
/**
* find_busiest_group - Returns the busiest group within the sched_domain
* if there is an imbalance.
@@ -8380,17 +8524,17 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
local = &sds.local_stat;
busiest = &sds.busiest_stat;
- /* ASYM feature bypasses nice load balance check */
- if (busiest->group_asym_packing)
- goto force_balance;
-
/* There is no busy sibling group to pull tasks from */
- if (!sds.busiest || busiest->sum_h_nr_running == 0)
+ if (!sds.busiest)
goto out_balanced;
- /* XXX broken for overlapping NUMA groups */
- sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)
- / sds.total_capacity;
+ /* Misfit tasks should be dealt with regardless of the avg load */
+ if (busiest->group_type == group_misfit_task)
+ goto force_balance;
+
+ /* ASYM feature bypasses nice load balance check */
+ if (busiest->group_type == group_asym_packing)
+ goto force_balance;
/*
* If the busiest group is imbalanced the below checks don't
@@ -8401,55 +8545,64 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
goto force_balance;
/*
- * When dst_cpu is idle, prevent SMP nice and/or asymmetric group
- * capacities from resulting in underutilization due to avg_load.
- */
- if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
- busiest->group_no_capacity)
- goto force_balance;
-
- /* Misfit tasks should be dealt with regardless of the avg load */
- if (busiest->group_type == group_misfit_task)
- goto force_balance;
-
- /*
* If the local group is busier than the selected busiest group
* don't try and pull any tasks.
*/
- if (local->avg_load >= busiest->avg_load)
+ if (local->group_type > busiest->group_type)
goto out_balanced;
/*
- * Don't pull any tasks if this group is already above the domain
- * average load.
+ * When groups are overloaded, use the avg_load to ensure fairness
+ * between tasks.
*/
- if (local->avg_load >= sds.avg_load)
- goto out_balanced;
+ if (local->group_type == group_overloaded) {
+ /*
+ * If the local group is more loaded than the selected
+ * busiest group don't try and pull any tasks.
+ */
+ if (local->avg_load >= busiest->avg_load)
+ goto out_balanced;
+
+ /* XXX broken for overlapping NUMA groups */
+ sds.avg_load = (sds.total_load * SCHED_CAPACITY_SCALE) /
+ sds.total_capacity;
- if (env->idle == CPU_IDLE) {
/*
- * This CPU is idle. If the busiest group is not overloaded
- * and there is no imbalance between this and busiest group
- * wrt idle CPUs, it is balanced. The imbalance becomes
- * significant if the diff is greater than 1 otherwise we
- * might end up to just move the imbalance on another group
+ * Don't pull any tasks if this group is already above the
+ * domain average load.
*/
- if ((busiest->group_type != group_overloaded) &&
- (local->idle_cpus <= (busiest->idle_cpus + 1)))
+ if (local->avg_load >= sds.avg_load)
goto out_balanced;
- } else {
+
/*
- * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
- * imbalance_pct to be conservative.
+ * If the busiest group is more loaded, use imbalance_pct to be
+ * conservative.
*/
if (100 * busiest->avg_load <=
env->sd->imbalance_pct * local->avg_load)
goto out_balanced;
}
+ /* Try to move all excess tasks to child's sibling domain */
+ if (sds.prefer_sibling && local->group_type == group_has_spare &&
+ busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
+ goto force_balance;
+
+ if (busiest->group_type != group_overloaded &&
+ (env->idle == CPU_NOT_IDLE ||
+ local->idle_cpus <= (busiest->idle_cpus + 1)))
+ /*
+ * If the busiest group is not overloaded
+ * and there is no imbalance between this and busiest group
+ * wrt idle CPUs, it is balanced. The imbalance
+ * becomes significant if the diff is greater than 1 otherwise
+ * we might end up to just move the imbalance on another
+ * group.
+ */
+ goto out_balanced;
+
force_balance:
/* Looks like there is an imbalance. Compute it */
- env->src_grp_type = busiest->group_type;
calculate_imbalance(env, &sds);
return env->imbalance ? sds.busiest : NULL;
@@ -8465,11 +8618,13 @@ static struct rq *find_busiest_queue(struct lb_env *env,
struct sched_group *group)
{
struct rq *busiest = NULL, *rq;
- unsigned long busiest_load = 0, busiest_capacity = 1;
+ unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1;
+ unsigned int busiest_nr = 0;
int i;
for_each_cpu_and(i, sched_group_span(group), env->cpus) {
- unsigned long capacity, load;
+ unsigned long capacity, load, util;
+ unsigned int nr_running;
enum fbq_type rt;
rq = cpu_rq(i);
@@ -8497,20 +8652,8 @@ static struct rq *find_busiest_queue(struct lb_env *env,
if (rt > env->fbq_type)
continue;
- /*
- * For ASYM_CPUCAPACITY domains with misfit tasks we simply
- * seek the "biggest" misfit task.
- */
- if (env->src_grp_type == group_misfit_task) {
- if (rq->misfit_task_load > busiest_load) {
- busiest_load = rq->misfit_task_load;
- busiest = rq;
- }
-
- continue;
- }
-
capacity = capacity_of(i);
+ nr_running = rq->cfs.h_nr_running;
/*
* For ASYM_CPUCAPACITY domains, don't pick a CPU that could
@@ -8520,35 +8663,67 @@ static struct rq *find_busiest_queue(struct lb_env *env,
*/
if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
capacity_of(env->dst_cpu) < capacity &&
- rq->nr_running == 1)
+ nr_running == 1)
continue;
- load = cpu_runnable_load(rq);
+ switch (env->balance_type) {
+ case migrate_load:
+ /*
+ * When comparing with load imbalance, use cpu_runnable_load()
+ * which is not scaled with the CPU capacity.
+ */
+ load = cpu_runnable_load(rq);
- /*
- * When comparing with imbalance, use cpu_runnable_load()
- * which is not scaled with the CPU capacity.
- */
+ if (nr_running == 1 && load > env->imbalance &&
+ !check_cpu_capacity(rq, env->sd))
+ break;
- if (rq->nr_running == 1 && load > env->imbalance &&
- !check_cpu_capacity(rq, env->sd))
- continue;
+ /*
+ * For the load comparisons with the other CPU's, consider
+ * the cpu_runnable_load() scaled with the CPU capacity, so
+ * that the load can be moved away from the CPU that is
+ * potentially running at a lower capacity.
+ *
+ * Thus we're looking for max(load_i / capacity_i), crosswise
+ * multiplication to rid ourselves of the division works out
+ * to: load_i * capacity_j > load_j * capacity_i; where j is
+ * our previous maximum.
+ */
+ if (load * busiest_capacity > busiest_load * capacity) {
+ busiest_load = load;
+ busiest_capacity = capacity;
+ busiest = rq;
+ }
+ break;
+
+ case migrate_util:
+ util = cpu_util(cpu_of(rq));
+
+ if (busiest_util < util) {
+ busiest_util = util;
+ busiest = rq;
+ }
+ break;
+
+ case migrate_task:
+ if (busiest_nr < nr_running) {
+ busiest_nr = nr_running;
+ busiest = rq;
+ }
+ break;
+
+ case migrate_misfit:
+ /*
+ * For ASYM_CPUCAPACITY domains with misfit tasks we simply
+ * seek the "biggest" misfit task.
+ */
+ if (rq->misfit_task_load > busiest_load) {
+ busiest_load = rq->misfit_task_load;
+ busiest = rq;
+ }
+
+ break;
- /*
- * For the load comparisons with the other CPU's, consider
- * the cpu_runnable_load() scaled with the CPU capacity, so
- * that the load can be moved away from the CPU that is
- * potentially running at a lower capacity.
- *
- * Thus we're looking for max(load_i / capacity_i), crosswise
- * multiplication to rid ourselves of the division works out
- * to: load_i * capacity_j > load_j * capacity_i; where j is
- * our previous maximum.
- */
- if (load * busiest_capacity > busiest_load * capacity) {
- busiest_load = load;
- busiest_capacity = capacity;
- busiest = rq;
}
}
@@ -8594,7 +8769,7 @@ voluntary_active_balance(struct lb_env *env)
return 1;
}
- if (env->src_grp_type == group_misfit_task)
+ if (env->balance_type == migrate_misfit)
return 1;
return 0;
--
2.7.4
cfs load_balance only takes care of CFS tasks whereas CPUs can be used by
other scheduling class. Typically, a CFS task preempted by a RT or deadline
task will not get a chance to be pulled on another CPU because the
load_balance doesn't take into account tasks from other classes.
Add sum of nr_running in the statistics and use it to detect such
situation.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d33379c..7e74836 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7716,6 +7716,7 @@ struct sg_lb_stats {
unsigned long group_load; /* Total load over the CPUs of the group */
unsigned long group_capacity;
unsigned long group_util; /* Total utilization of the group */
+ unsigned int sum_nr_running; /* Nr of tasks running in the group */
unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
unsigned int idle_cpus;
unsigned int group_weight;
@@ -7949,7 +7950,7 @@ static inline int sg_imbalanced(struct sched_group *group)
static inline bool
group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
{
- if (sgs->sum_h_nr_running < sgs->group_weight)
+ if (sgs->sum_nr_running < sgs->group_weight)
return true;
if ((sgs->group_capacity * 100) >
@@ -7970,7 +7971,7 @@ group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
static inline bool
group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
{
- if (sgs->sum_h_nr_running <= sgs->group_weight)
+ if (sgs->sum_nr_running <= sgs->group_weight)
return false;
if ((sgs->group_capacity * 100) <
@@ -8074,6 +8075,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->sum_h_nr_running += rq->cfs.h_nr_running;
nr_running = rq->nr_running;
+ sgs->sum_nr_running += nr_running;
+
if (nr_running > 1)
*sg_status |= SG_OVERLOAD;
@@ -8423,7 +8426,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* groups.
*/
env->balance_type = migrate_task;
- env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
+ env->imbalance = (busiest->sum_nr_running - local->sum_nr_running) >> 1;
return;
}
@@ -8585,7 +8588,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
/* Try to move all excess tasks to child's sibling domain */
if (sds.prefer_sibling && local->group_type == group_has_spare &&
- busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
+ busiest->sum_nr_running > local->sum_nr_running + 1)
goto force_balance;
if (busiest->group_type != group_overloaded &&
--
2.7.4
clean up load_balance and remove meaningless calculation and fields before
adding new algorithm.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 105 +---------------------------------------------------
1 file changed, 1 insertion(+), 104 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02ab6b5..017aad0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5390,18 +5390,6 @@ static unsigned long capacity_of(int cpu)
return cpu_rq(cpu)->cpu_capacity;
}
-static unsigned long cpu_avg_load_per_task(int cpu)
-{
- struct rq *rq = cpu_rq(cpu);
- unsigned long nr_running = READ_ONCE(rq->cfs.h_nr_running);
- unsigned long load_avg = cpu_runnable_load(rq);
-
- if (nr_running)
- return load_avg / nr_running;
-
- return 0;
-}
-
static void record_wakee(struct task_struct *p)
{
/*
@@ -7677,7 +7665,6 @@ static unsigned long task_h_load(struct task_struct *p)
struct sg_lb_stats {
unsigned long avg_load; /*Avg load across the CPUs of the group */
unsigned long group_load; /* Total load over the CPUs of the group */
- unsigned long load_per_task;
unsigned long group_capacity;
unsigned long group_util; /* Total utilization of the group */
unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
@@ -8059,9 +8046,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_capacity = group->sgc->capacity;
sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
- if (sgs->sum_h_nr_running)
- sgs->load_per_task = sgs->group_load / sgs->sum_h_nr_running;
-
sgs->group_weight = group->group_weight;
sgs->group_no_capacity = group_is_overloaded(env, sgs);
@@ -8292,76 +8276,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
}
/**
- * fix_small_imbalance - Calculate the minor imbalance that exists
- * amongst the groups of a sched_domain, during
- * load balancing.
- * @env: The load balancing environment.
- * @sds: Statistics of the sched_domain whose imbalance is to be calculated.
- */
-static inline
-void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
-{
- unsigned long tmp, capa_now = 0, capa_move = 0;
- unsigned int imbn = 2;
- unsigned long scaled_busy_load_per_task;
- struct sg_lb_stats *local, *busiest;
-
- local = &sds->local_stat;
- busiest = &sds->busiest_stat;
-
- if (!local->sum_h_nr_running)
- local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
- else if (busiest->load_per_task > local->load_per_task)
- imbn = 1;
-
- scaled_busy_load_per_task =
- (busiest->load_per_task * SCHED_CAPACITY_SCALE) /
- busiest->group_capacity;
-
- if (busiest->avg_load + scaled_busy_load_per_task >=
- local->avg_load + (scaled_busy_load_per_task * imbn)) {
- env->imbalance = busiest->load_per_task;
- return;
- }
-
- /*
- * OK, we don't have enough imbalance to justify moving tasks,
- * however we may be able to increase total CPU capacity used by
- * moving them.
- */
-
- capa_now += busiest->group_capacity *
- min(busiest->load_per_task, busiest->avg_load);
- capa_now += local->group_capacity *
- min(local->load_per_task, local->avg_load);
- capa_now /= SCHED_CAPACITY_SCALE;
-
- /* Amount of load we'd subtract */
- if (busiest->avg_load > scaled_busy_load_per_task) {
- capa_move += busiest->group_capacity *
- min(busiest->load_per_task,
- busiest->avg_load - scaled_busy_load_per_task);
- }
-
- /* Amount of load we'd add */
- if (busiest->avg_load * busiest->group_capacity <
- busiest->load_per_task * SCHED_CAPACITY_SCALE) {
- tmp = (busiest->avg_load * busiest->group_capacity) /
- local->group_capacity;
- } else {
- tmp = (busiest->load_per_task * SCHED_CAPACITY_SCALE) /
- local->group_capacity;
- }
- capa_move += local->group_capacity *
- min(local->load_per_task, local->avg_load + tmp);
- capa_move /= SCHED_CAPACITY_SCALE;
-
- /* Move if we gain throughput */
- if (capa_move > capa_now)
- env->imbalance = busiest->load_per_task;
-}
-
-/**
* calculate_imbalance - Calculate the amount of imbalance present within the
* groups of a given sched_domain during load balance.
* @env: load balance environment
@@ -8380,15 +8294,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
return;
}
- if (busiest->group_type == group_imbalanced) {
- /*
- * In the group_imb case we cannot rely on group-wide averages
- * to ensure CPU-load equilibrium, look at wider averages. XXX
- */
- busiest->load_per_task =
- min(busiest->load_per_task, sds->avg_load);
- }
-
/*
* Avg load of busiest sg can be less and avg load of local sg can
* be greater than avg load across all sgs of sd because avg load
@@ -8399,7 +8304,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
(busiest->avg_load <= sds->avg_load ||
local->avg_load >= sds->avg_load)) {
env->imbalance = 0;
- return fix_small_imbalance(env, sds);
+ return;
}
/*
@@ -8437,14 +8342,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
busiest->group_misfit_task_load);
}
- /*
- * if *imbalance is less than the average load per runnable task
- * there is no guarantee that any tasks will be moved so we'll have
- * a think about bumping its value to force at least one task to be
- * moved
- */
- if (env->imbalance < busiest->load_per_task)
- return fix_small_imbalance(env, sds);
}
/******* find_busiest_group() helpers end here *********************/
--
2.7.4
Rename sum_nr_running to sum_h_nr_running because it effectively tracks
cfs->h_nr_running so we can use sum_nr_running to track rq->nr_running
when needed.
There is no functional changes.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3175fea..02ab6b5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7680,7 +7680,7 @@ struct sg_lb_stats {
unsigned long load_per_task;
unsigned long group_capacity;
unsigned long group_util; /* Total utilization of the group */
- unsigned int sum_nr_running; /* Nr tasks running in the group */
+ unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
unsigned int idle_cpus;
unsigned int group_weight;
enum group_type group_type;
@@ -7725,7 +7725,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
.total_capacity = 0UL,
.busiest_stat = {
.avg_load = 0UL,
- .sum_nr_running = 0,
+ .sum_h_nr_running = 0,
.group_type = group_other,
},
};
@@ -7916,7 +7916,7 @@ static inline int sg_imbalanced(struct sched_group *group)
static inline bool
group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
{
- if (sgs->sum_nr_running < sgs->group_weight)
+ if (sgs->sum_h_nr_running < sgs->group_weight)
return true;
if ((sgs->group_capacity * 100) >
@@ -7937,7 +7937,7 @@ group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
static inline bool
group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
{
- if (sgs->sum_nr_running <= sgs->group_weight)
+ if (sgs->sum_h_nr_running <= sgs->group_weight)
return false;
if ((sgs->group_capacity * 100) <
@@ -8029,7 +8029,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_load += cpu_runnable_load(rq);
sgs->group_util += cpu_util(i);
- sgs->sum_nr_running += rq->cfs.h_nr_running;
+ sgs->sum_h_nr_running += rq->cfs.h_nr_running;
nr_running = rq->nr_running;
if (nr_running > 1)
@@ -8059,8 +8059,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_capacity = group->sgc->capacity;
sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
- if (sgs->sum_nr_running)
- sgs->load_per_task = sgs->group_load / sgs->sum_nr_running;
+ if (sgs->sum_h_nr_running)
+ sgs->load_per_task = sgs->group_load / sgs->sum_h_nr_running;
sgs->group_weight = group->group_weight;
@@ -8117,7 +8117,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* capable CPUs may harm throughput. Maximize throughput,
* power/energy consequences are not considered.
*/
- if (sgs->sum_nr_running <= sgs->group_weight &&
+ if (sgs->sum_h_nr_running <= sgs->group_weight &&
group_smaller_min_cpu_capacity(sds->local, sg))
return false;
@@ -8148,7 +8148,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* perform better since they share less core resources. Hence when we
* have idle threads, we want them to be the higher ones.
*/
- if (sgs->sum_nr_running &&
+ if (sgs->sum_h_nr_running &&
sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
sgs->group_asym_packing = 1;
if (!sds->busiest)
@@ -8166,9 +8166,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
#ifdef CONFIG_NUMA_BALANCING
static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs)
{
- if (sgs->sum_nr_running > sgs->nr_numa_running)
+ if (sgs->sum_h_nr_running > sgs->nr_numa_running)
return regular;
- if (sgs->sum_nr_running > sgs->nr_preferred_running)
+ if (sgs->sum_h_nr_running > sgs->nr_preferred_running)
return remote;
return all;
}
@@ -8243,7 +8243,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
*/
if (prefer_sibling && sds->local &&
group_has_capacity(env, local) &&
- (sgs->sum_nr_running > local->sum_nr_running + 1)) {
+ (sgs->sum_h_nr_running > local->sum_h_nr_running + 1)) {
sgs->group_no_capacity = 1;
sgs->group_type = group_classify(sg, sgs);
}
@@ -8255,7 +8255,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
next_group:
/* Now, start updating sd_lb_stats */
- sds->total_running += sgs->sum_nr_running;
+ sds->total_running += sgs->sum_h_nr_running;
sds->total_load += sgs->group_load;
sds->total_capacity += sgs->group_capacity;
@@ -8309,7 +8309,7 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
local = &sds->local_stat;
busiest = &sds->busiest_stat;
- if (!local->sum_nr_running)
+ if (!local->sum_h_nr_running)
local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
else if (busiest->load_per_task > local->load_per_task)
imbn = 1;
@@ -8407,7 +8407,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
*/
if (busiest->group_type == group_overloaded &&
local->group_type == group_overloaded) {
- load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE;
+ load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
if (load_above_capacity > busiest->group_capacity) {
load_above_capacity -= busiest->group_capacity;
load_above_capacity *= scale_load_down(NICE_0_LOAD);
@@ -8488,7 +8488,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
goto force_balance;
/* There is no busy sibling group to pull tasks from */
- if (!sds.busiest || busiest->sum_nr_running == 0)
+ if (!sds.busiest || busiest->sum_h_nr_running == 0)
goto out_balanced;
/* XXX broken for overlapping NUMA groups */
--
2.7.4
On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:
> Clean up asym packing to follow the default load balance behavior:
> - classify the group by creating a group_asym_packing field.
> - calculate the imbalance in calculate_imbalance() instead of
> bypassing it.
>
> We don't need to test twice same conditions anymore to detect asym
> packing
> and we consolidate the calculation of imbalance in
> calculate_imbalance().
>
> There is no functional changes.
>
> Signed-off-by: Vincent Guittot <[email protected]>
Acked-by: Rik van Riel <[email protected]>
--
All Rights Reversed.
On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:
> Rename sum_nr_running to sum_h_nr_running because it effectively
> tracks
> cfs->h_nr_running so we can use sum_nr_running to track rq-
> >nr_running
> when needed.
>
> There is no functional changes.
>
> Signed-off-by: Vincent Guittot <[email protected]>
>
Acked-by: Rik van Riel <[email protected]>
--
All Rights Reversed.
On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:
> clean up load_balance and remove meaningless calculation and fields
> before
> adding new algorithm.
>
> Signed-off-by: Vincent Guittot <[email protected]>
Yay.
Acked-by: Rik van Riel <[email protected]>
--
All Rights Reversed.
On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:
>
> Also the load balance decisions have been consolidated in the 3
> functions
> below after removing the few bypasses and hacks of the current code:
> - update_sd_pick_busiest() select the busiest sched_group.
> - find_busiest_group() checks if there is an imbalance between local
> and
> busiest group.
> - calculate_imbalance() decides what have to be moved.
I really like the direction this series is going.
However, I suppose I should run these patches for
a few days with some of our test workloads before
I send out an ack for this patch :)
--
All Rights Reversed.
On Mon, 30 Sep 2019 at 03:13, Rik van Riel <[email protected]> wrote:
>
> On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:
> >
> > Also the load balance decisions have been consolidated in the 3
> > functions
> > below after removing the few bypasses and hacks of the current code:
> > - update_sd_pick_busiest() select the busiest sched_group.
> > - find_busiest_group() checks if there is an imbalance between local
> > and
> > busiest group.
> > - calculate_imbalance() decides what have to be moved.
>
> I really like the direction this series is going.
Thanks
>
> However, I suppose I should run these patches for
> a few days with some of our test workloads before
> I send out an ack for this patch :)
Yes more tests on different platform are welcome
>
> --
> All Rights Reversed.
Hi Vincent,
On 19/09/2019 09:33, Vincent Guittot wrote:
these are just some comments & questions based on a code study. Haven't
run any tests with it yet.
[...]
> The type of sched_group has been extended to better reflect the type of
> imbalance. We now have :
> group_has_spare
> group_fully_busy
> group_misfit_task
> group_asym_capacity
s/group_asym_capacity/group_asym_packing
> group_imbalanced
> group_overloaded
>
> Based on the type fo sched_group, load_balance now sets what it wants to
s/fo/for
> move in order to fix the imbalance. It can be some load as before but also
> some utilization, a number of task or a type of task:
> migrate_task
> migrate_util
> migrate_load
> migrate_misfit
>
> This new load_balance algorithm fixes several pending wrong tasks
> placement:
> - the 1 task per CPU case with asymmetrics system
s/asymmetrics/asymmetric
This stands for ASYM_CPUCAPACITY and ASYM_PACKING, right?
[...]
> #define LBF_ALL_PINNED 0x01
> @@ -7115,7 +7130,7 @@ struct lb_env {
> unsigned int loop_max;
>
> enum fbq_type fbq_type;
> - enum group_type src_grp_type;
> + enum migration_type balance_type;
Minor thing:
Why not
enum migration_type migration_type;
or
enum balance_type balance_type;
We do the same for other enums like fbq_type or group_type.
> struct list_head tasks;
> };
>
The detach_tasks() comment still mentions only runnable load.
> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
> {
> struct list_head *tasks = &env->src_rq->cfs_tasks;
> struct task_struct *p;
> - unsigned long load;
> + unsigned long util, load;
Minor: Order by length or reduce scope to while loop ?
> int detached = 0;
>
> lockdep_assert_held(&env->src_rq->lock);
> @@ -7380,19 +7395,53 @@ static int detach_tasks(struct lb_env *env)
> if (!can_migrate_task(p, env))
> goto next;
>
> - load = task_h_load(p);
> + switch (env->balance_type) {
> + case migrate_load:
> + load = task_h_load(p);
>
> - if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
> - goto next;
> + if (sched_feat(LB_MIN) &&
> + load < 16 && !env->sd->nr_balance_failed)
> + goto next;
>
> - if ((load / 2) > env->imbalance)
> - goto next;
> + if ((load / 2) > env->imbalance)
> + goto next;
> +
> + env->imbalance -= load;
> + break;
> +
> + case migrate_util:
> + util = task_util_est(p);
> +
> + if (util > env->imbalance)
> + goto next;
> +
> + env->imbalance -= util;
> + break;
> +
> + case migrate_task:
> + /* Migrate task */
Minor: IMHO, this comment is not necessary.
> + env->imbalance--;
> + break;
> +
> + case migrate_misfit:
> + load = task_h_load(p);
> +
> + /*
> + * utilization of misfit task might decrease a bit
This patch still uses load. IMHO this comments becomes true only with 08/10.
[...]
> @@ -7707,13 +7755,11 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
> *sds = (struct sd_lb_stats){
> .busiest = NULL,
> .local = NULL,
> - .total_running = 0UL,
> .total_load = 0UL,
> .total_capacity = 0UL,
> .busiest_stat = {
> - .avg_load = 0UL,
There is a sentence in the comment above explaining why avg_load has to
be cleared. And IMHO local group isn't cleared anymore but set/initialized.
> - .sum_h_nr_running = 0,
> - .group_type = group_other,
> + .idle_cpus = UINT_MAX,
> + .group_type = group_has_spare,
> },
> };
> }
[...]
> @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> }
> }
>
> - /* Adjust by relative CPU capacity of the group */
> + /* Check if dst cpu is idle and preferred to this group */
s/preferred to/preferred by ? or the preferred CPU of this group ?
[...]
> @@ -8283,69 +8363,133 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> */
> static inline void calculate_imbalance(struct lb_env *env, struct
sd_lb_stats *sds)
> {
> - unsigned long max_pull, load_above_capacity = ~0UL;
> struct sg_lb_stats *local, *busiest;
>
> local = &sds->local_stat;
> busiest = &sds->busiest_stat;
>
> - if (busiest->group_asym_packing) {
> + if (busiest->group_type == group_misfit_task) {
> + /* Set imbalance to allow misfit task to be balanced. */
> + env->balance_type = migrate_misfit;
> + env->imbalance = busiest->group_misfit_task_load;
> + return;
> + }
> +
> + if (busiest->group_type == group_asym_packing) {
> + /*
> + * In case of asym capacity, we will try to migrate all load to
Does asym capacity stands for asym packing or asym cpu capacity?
> + * the preferred CPU.
> + */
> + env->balance_type = migrate_load;
> env->imbalance = busiest->group_load;
> return;
> }
>
> + if (busiest->group_type == group_imbalanced) {
> + /*
> + * In the group_imb case we cannot rely on group-wide averages
> + * to ensure CPU-load equilibrium, try to move any task to fix
> + * the imbalance. The next load balance will take care of
> + * balancing back the system.
balancing back ?
> + */
> + env->balance_type = migrate_task;
> + env->imbalance = 1;
> + return;
> + }
> +
> /*
> - * Avg load of busiest sg can be less and avg load of local sg can
> - * be greater than avg load across all sgs of sd because avg load
> - * factors in sg capacity and sgs with smaller group_type are
> - * skipped when updating the busiest sg:
> + * Try to use spare capacity of local group without overloading it or
> + * emptying busiest
> */
> - if (busiest->group_type != group_misfit_task &&
> - (busiest->avg_load <= sds->avg_load ||
> - local->avg_load >= sds->avg_load)) {
> - env->imbalance = 0;
> + if (local->group_type == group_has_spare) {
> + if (busiest->group_type > group_fully_busy) {
So this could be 'busiest->group_type == group_overloaded' here to match
the comment below? Since you handle group_misfit_task,
group_asym_packing, group_imbalanced above and return.
> + /*
> + * If busiest is overloaded, try to fill spare
> + * capacity. This might end up creating spare capacity
> + * in busiest or busiest still being overloaded but
> + * there is no simple way to directly compute the
> + * amount of load to migrate in order to balance the
> + * system.
> + */
> + env->balance_type = migrate_util;
> + env->imbalance = max(local->group_capacity, local->group_util) -
> + local->group_util;
> + return;
> + }
> +
> + if (busiest->group_weight == 1 || sds->prefer_sibling) {
> + /*
> + * When prefer sibling, evenly spread running tasks on
> + * groups.
> + */
> + env->balance_type = migrate_task;
> + env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> + return;
> + }
> +
> + /*
> + * If there is no overload, we just want to even the number of
> + * idle cpus.
> + */
> + env->balance_type = migrate_task;
> + env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
Why do we need a max_t(long, 0, ...) here and not for the 'if
(busiest->group_weight == 1 || sds->prefer_sibling)' case?
> return;
> }
>
> /*
> - * If there aren't any idle CPUs, avoid creating some.
> + * Local is fully busy but have to take more load to relieve the
s/have/has
> + * busiest group
> */
I thought that 'local->group_type == group_imbalanced' is allowed as
well? So 'if (local->group_type < group_overloaded)' (further down)
could include that?
> - if (busiest->group_type == group_overloaded &&
> - local->group_type == group_overloaded) {
> - load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
> - if (load_above_capacity > busiest->group_capacity) {
> - load_above_capacity -= busiest->group_capacity;
> - load_above_capacity *= scale_load_down(NICE_0_LOAD);
> - load_above_capacity /= busiest->group_capacity;
> - } else
> - load_above_capacity = ~0UL;
> + if (local->group_type < group_overloaded) {
> + /*
> + * Local will become overloaded so the avg_load metrics are
> + * finally needed.
> + */
How does this relate to the decision_matrix[local, busiest] (dm[])? E.g.
dm[overload, overload] == avg_load or dm[fully_busy, overload] == force.
It would be nice to be able to match all allowed fields of dm to code sections.
[...]
> /******* find_busiest_group() helpers end here *********************/
>
> +/*
> + * Decision matrix according to the local and busiest group state
Minor s/state/type ?
> + *
> + * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded
> + * has_spare nr_idle balanced N/A N/A balanced balanced
> + * fully_busy nr_idle nr_idle N/A N/A balanced balanced
> + * misfit_task force N/A N/A N/A force force
> + * asym_capacity force force N/A N/A force force
s/asym_capacity/asym_packing
> + * imbalanced force force N/A N/A force force
> + * overloaded force force N/A N/A force avg_load
> + *
> + * N/A : Not Applicable because already filtered while updating
> + * statistics.
> + * balanced : The system is balanced for these 2 groups.
> + * force : Calculate the imbalance as load migration is probably needed.
> + * avg_load : Only if imbalance is significant enough.
> + * nr_idle : dst_cpu is not busy and the number of idle cpus is quite
> + * different in groups.
> + */
> +
> /**
> * find_busiest_group - Returns the busiest group within the sched_domain
> * if there is an imbalance.
> @@ -8380,17 +8524,17 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> local = &sds.local_stat;
> busiest = &sds.busiest_stat;
>
> - /* ASYM feature bypasses nice load balance check */
> - if (busiest->group_asym_packing)
> - goto force_balance;
> -
> /* There is no busy sibling group to pull tasks from */
> - if (!sds.busiest || busiest->sum_h_nr_running == 0)
> + if (!sds.busiest)
> goto out_balanced;
>
> - /* XXX broken for overlapping NUMA groups */
> - sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)
> - / sds.total_capacity;
> + /* Misfit tasks should be dealt with regardless of the avg load */
> + if (busiest->group_type == group_misfit_task)
> + goto force_balance;
> +
> + /* ASYM feature bypasses nice load balance check */
Minor: s/ASYM feature/ASYM_PACKING ... to distinguish clearly from
ASYM_CPUCAPACITY.
> + if (busiest->group_type == group_asym_packing)
> + goto force_balance;
[...]
On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <[email protected]> wrote:
>
> Hi Vincent,
>
> On 19/09/2019 09:33, Vincent Guittot wrote:
>
> these are just some comments & questions based on a code study. Haven't
> run any tests with it yet.
>
> [...]
>
> > The type of sched_group has been extended to better reflect the type of
> > imbalance. We now have :
> > group_has_spare
> > group_fully_busy
> > group_misfit_task
> > group_asym_capacity
>
> s/group_asym_capacity/group_asym_packing
Yes, I forgot to change the commit message and the comments
>
> > group_imbalanced
> > group_overloaded
> >
> > Based on the type fo sched_group, load_balance now sets what it wants to
>
> s/fo/for
s/fo/of/
>
> > move in order to fix the imbalance. It can be some load as before but also
> > some utilization, a number of task or a type of task:
> > migrate_task
> > migrate_util
> > migrate_load
> > migrate_misfit
> >
> > This new load_balance algorithm fixes several pending wrong tasks
> > placement:
> > - the 1 task per CPU case with asymmetrics system
>
> s/asymmetrics/asymmetric
yes
>
> This stands for ASYM_CPUCAPACITY and ASYM_PACKING, right?
yes
>
> [...]
>
> > #define LBF_ALL_PINNED 0x01
> > @@ -7115,7 +7130,7 @@ struct lb_env {
> > unsigned int loop_max;
> >
> > enum fbq_type fbq_type;
> > - enum group_type src_grp_type;
> > + enum migration_type balance_type;
>
> Minor thing:
>
> Why not
> enum migration_type migration_type;
> or
> enum balance_type balance_type;
>
> We do the same for other enums like fbq_type or group_type.
yes, I can align
>
> > struct list_head tasks;
> > };
> >
>
> The detach_tasks() comment still mentions only runnable load.
ok
>
> > @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
> > {
> > struct list_head *tasks = &env->src_rq->cfs_tasks;
> > struct task_struct *p;
> > - unsigned long load;
> > + unsigned long util, load;
>
> Minor: Order by length or reduce scope to while loop ?
I don't get your point here
>
> > int detached = 0;
> >
> > lockdep_assert_held(&env->src_rq->lock);
> > @@ -7380,19 +7395,53 @@ static int detach_tasks(struct lb_env *env)
> > if (!can_migrate_task(p, env))
> > goto next;
> >
> > - load = task_h_load(p);
> > + switch (env->balance_type) {
> > + case migrate_load:
> > + load = task_h_load(p);
> >
> > - if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
> > - goto next;
> > + if (sched_feat(LB_MIN) &&
> > + load < 16 && !env->sd->nr_balance_failed)
> > + goto next;
> >
> > - if ((load / 2) > env->imbalance)
> > - goto next;
> > + if ((load / 2) > env->imbalance)
> > + goto next;
> > +
> > + env->imbalance -= load;
> > + break;
> > +
> > + case migrate_util:
> > + util = task_util_est(p);
> > +
> > + if (util > env->imbalance)
> > + goto next;
> > +
> > + env->imbalance -= util;
> > + break;
> > +
> > + case migrate_task:
> > + /* Migrate task */
>
> Minor: IMHO, this comment is not necessary.
yes
>
> > + env->imbalance--;
> > + break;
> > +
> > + case migrate_misfit:
> > + load = task_h_load(p);
> > +
> > + /*
> > + * utilization of misfit task might decrease a bit
>
> This patch still uses load. IMHO this comments becomes true only with 08/10.
even with 8/10 it's not correct and it has been removed.
I'm going to remove it.
>
> [...]
>
> > @@ -7707,13 +7755,11 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
> > *sds = (struct sd_lb_stats){
> > .busiest = NULL,
> > .local = NULL,
> > - .total_running = 0UL,
> > .total_load = 0UL,
> > .total_capacity = 0UL,
> > .busiest_stat = {
> > - .avg_load = 0UL,
>
> There is a sentence in the comment above explaining why avg_load has to
> be cleared. And IMHO local group isn't cleared anymore but set/initialized.
Yes, I have to update it
>
> > - .sum_h_nr_running = 0,
> > - .group_type = group_other,
> > + .idle_cpus = UINT_MAX,
> > + .group_type = group_has_spare,
> > },
> > };
> > }
>
> [...]
>
> > @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> > }
> > }
> >
> > - /* Adjust by relative CPU capacity of the group */
> > + /* Check if dst cpu is idle and preferred to this group */
>
> s/preferred to/preferred by ? or the preferred CPU of this group ?
dst cpu doesn't belong to this group. We compare asym_prefer_cpu of
this group vs dst_cpu which belongs to another group
>
> [...]
>
> > @@ -8283,69 +8363,133 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > */
> > static inline void calculate_imbalance(struct lb_env *env, struct
> sd_lb_stats *sds)
> > {
> > - unsigned long max_pull, load_above_capacity = ~0UL;
> > struct sg_lb_stats *local, *busiest;
> >
> > local = &sds->local_stat;
> > busiest = &sds->busiest_stat;
> >
> > - if (busiest->group_asym_packing) {
> > + if (busiest->group_type == group_misfit_task) {
> > + /* Set imbalance to allow misfit task to be balanced. */
> > + env->balance_type = migrate_misfit;
> > + env->imbalance = busiest->group_misfit_task_load;
> > + return;
> > + }
> > +
> > + if (busiest->group_type == group_asym_packing) {
> > + /*
> > + * In case of asym capacity, we will try to migrate all load to
>
> Does asym capacity stands for asym packing or asym cpu capacity?
busiest->group_type == group_asym_packing
will fix it
>
> > + * the preferred CPU.
> > + */
> > + env->balance_type = migrate_load;
> > env->imbalance = busiest->group_load;
> > return;
> > }
> >
> > + if (busiest->group_type == group_imbalanced) {
> > + /*
> > + * In the group_imb case we cannot rely on group-wide averages
> > + * to ensure CPU-load equilibrium, try to move any task to fix
> > + * the imbalance. The next load balance will take care of
> > + * balancing back the system.
>
> balancing back ?
In case of imbalance, we don't try to balance the system but only try
to get rid of the pinned tasks problem. The system will still be
unbalanced after the migration and the next load balance will take
care of balancing the system
>
> > + */
> > + env->balance_type = migrate_task;
> > + env->imbalance = 1;
> > + return;
> > + }
> > +
> > /*
> > - * Avg load of busiest sg can be less and avg load of local sg can
> > - * be greater than avg load across all sgs of sd because avg load
> > - * factors in sg capacity and sgs with smaller group_type are
> > - * skipped when updating the busiest sg:
> > + * Try to use spare capacity of local group without overloading it or
> > + * emptying busiest
> > */
> > - if (busiest->group_type != group_misfit_task &&
> > - (busiest->avg_load <= sds->avg_load ||
> > - local->avg_load >= sds->avg_load)) {
> > - env->imbalance = 0;
> > + if (local->group_type == group_has_spare) {
> > + if (busiest->group_type > group_fully_busy) {
>
> So this could be 'busiest->group_type == group_overloaded' here to match
> the comment below? Since you handle group_misfit_task,
> group_asym_packing, group_imbalanced above and return.
This is just to be more robust in case some new states are added later
>
> > + /*
> > + * If busiest is overloaded, try to fill spare
> > + * capacity. This might end up creating spare capacity
> > + * in busiest or busiest still being overloaded but
> > + * there is no simple way to directly compute the
> > + * amount of load to migrate in order to balance the
> > + * system.
> > + */
> > + env->balance_type = migrate_util;
> > + env->imbalance = max(local->group_capacity, local->group_util) -
> > + local->group_util;
> > + return;
> > + }
> > +
> > + if (busiest->group_weight == 1 || sds->prefer_sibling) {
> > + /*
> > + * When prefer sibling, evenly spread running tasks on
> > + * groups.
> > + */
> > + env->balance_type = migrate_task;
> > + env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> > + return;
> > + }
> > +
> > + /*
> > + * If there is no overload, we just want to even the number of
> > + * idle cpus.
> > + */
> > + env->balance_type = migrate_task;
> > + env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
>
> Why do we need a max_t(long, 0, ...) here and not for the 'if
> (busiest->group_weight == 1 || sds->prefer_sibling)' case?
For env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
either we have sds->prefer_sibling && busiest->sum_nr_running >
local->sum_nr_running + 1
or busiest->group_weight == 1 and env->idle != CPU_NOT_IDLE so local
cpu is idle or newly idle
>
> > return;
> > }
> >
> > /*
> > - * If there aren't any idle CPUs, avoid creating some.
> > + * Local is fully busy but have to take more load to relieve the
>
> s/have/has
>
> > + * busiest group
> > */
>
> I thought that 'local->group_type == group_imbalanced' is allowed as
> well? So 'if (local->group_type < group_overloaded)' (further down)
> could include that?
yes.
Imbalance state is not very useful for local group but only reflects
that the group is not overloaded so either fully busy or has spare
capacity.
In this case we assume the worst : fully_busy
>
> > - if (busiest->group_type == group_overloaded &&
> > - local->group_type == group_overloaded) {
> > - load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
> > - if (load_above_capacity > busiest->group_capacity) {
> > - load_above_capacity -= busiest->group_capacity;
> > - load_above_capacity *= scale_load_down(NICE_0_LOAD);
> > - load_above_capacity /= busiest->group_capacity;
> > - } else
> > - load_above_capacity = ~0UL;
> > + if (local->group_type < group_overloaded) {
> > + /*
> > + * Local will become overloaded so the avg_load metrics are
> > + * finally needed.
> > + */
>
> How does this relate to the decision_matrix[local, busiest] (dm[])? E.g.
> dm[overload, overload] == avg_load or dm[fully_busy, overload] == force.
> It would be nice to be able to match all allowed fields of dm to code sections.
decision_matrix describes how it decides between balanced or unbalanced.
In case of dm[overload, overload], we use the avg_load to decide if it
is balanced or not
In case of dm[fully_busy, overload], the groups are unbalanced because
fully_busy < overload and we force the balance. Then
calculate_imbalance() uses the avg_load to decide how much will be
moved
dm[overload, overload]=force means that we force the balance and we
will compute later the imbalance. avg_load may be used to calculate
the imbalance
dm[overload, overload]=avg_load means that we compare the avg_load to
decide whether we need to balance load between groups
dm[overload, overload]=nr_idle means that we compare the number of
idle cpus to decide whether we need to balance. In fact this is no
more true with patch 7 because we also take into account the number of
nr_h_running when weight =1
>
> [...]
>
> > /******* find_busiest_group() helpers end here *********************/
> >
> > +/*
> > + * Decision matrix according to the local and busiest group state
>
> Minor s/state/type ?
ok
>
> > + *
> > + * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded
> > + * has_spare nr_idle balanced N/A N/A balanced balanced
> > + * fully_busy nr_idle nr_idle N/A N/A balanced balanced
> > + * misfit_task force N/A N/A N/A force force
> > + * asym_capacity force force N/A N/A force force
>
> s/asym_capacity/asym_packing
yes
>
> > + * imbalanced force force N/A N/A force force
> > + * overloaded force force N/A N/A force avg_load
> > + *
> > + * N/A : Not Applicable because already filtered while updating
> > + * statistics.
> > + * balanced : The system is balanced for these 2 groups.
> > + * force : Calculate the imbalance as load migration is probably needed.
> > + * avg_load : Only if imbalance is significant enough.
> > + * nr_idle : dst_cpu is not busy and the number of idle cpus is quite
> > + * different in groups.
> > + */
> > +
> > /**
> > * find_busiest_group - Returns the busiest group within the sched_domain
> > * if there is an imbalance.
> > @@ -8380,17 +8524,17 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> > local = &sds.local_stat;
> > busiest = &sds.busiest_stat;
> >
> > - /* ASYM feature bypasses nice load balance check */
> > - if (busiest->group_asym_packing)
> > - goto force_balance;
> > -
> > /* There is no busy sibling group to pull tasks from */
> > - if (!sds.busiest || busiest->sum_h_nr_running == 0)
> > + if (!sds.busiest)
> > goto out_balanced;
> >
> > - /* XXX broken for overlapping NUMA groups */
> > - sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)
> > - / sds.total_capacity;
> > + /* Misfit tasks should be dealt with regardless of the avg load */
> > + if (busiest->group_type == group_misfit_task)
> > + goto force_balance;
> > +
> > + /* ASYM feature bypasses nice load balance check */
>
> Minor: s/ASYM feature/ASYM_PACKING ... to distinguish clearly from
> ASYM_CPUCAPACITY.
yes
>
> > + if (busiest->group_type == group_asym_packing)
> > + goto force_balance;
>
> [...]
>
On 19/09/2019 09:33, Vincent Guittot wrote:
[...]
> @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> }
> }
>
> - /* Adjust by relative CPU capacity of the group */
> + /* Check if dst cpu is idle and preferred to this group */
> + if (env->sd->flags & SD_ASYM_PACKING &&
> + env->idle != CPU_NOT_IDLE &&
> + sgs->sum_h_nr_running &&
> + sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
> + sgs->group_asym_packing = 1;
> + }
> +
Since asym_packing check is done per-sg rather per-CPU (like misfit_task), can you
not check for asym_packing in group_classify() directly (like for overloaded) and
so get rid of struct sg_lb_stats::group_asym_packing?
Something like (only compile tested).
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0c3aa1dc290..b2848d6f8a2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7692,7 +7692,6 @@ struct sg_lb_stats {
unsigned int idle_cpus;
unsigned int group_weight;
enum group_type group_type;
- unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
@@ -7952,6 +7951,20 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
return false;
}
+static inline bool
+group_has_asym_packing(struct lb_env *env, struct sched_group *sg,
+ struct sg_lb_stats *sgs)
+{
+ if (env->sd->flags & SD_ASYM_PACKING &&
+ env->idle != CPU_NOT_IDLE &&
+ sgs->sum_h_nr_running &&
+ sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
+ return true;
+ }
+
+ return false;
+}
+
/*
* group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
* per-CPU capacity than sched_group ref.
@@ -7983,7 +7996,7 @@ group_type group_classify(struct lb_env *env,
if (sg_imbalanced(group))
return group_imbalanced;
- if (sgs->group_asym_packing)
+ if (group_has_asym_packing(env, group, sgs))
return group_asym_packing;
if (sgs->group_misfit_task_load)
@@ -8076,14 +8089,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
}
}
- /* Check if dst cpu is idle and preferred to this group */
- if (env->sd->flags & SD_ASYM_PACKING &&
- env->idle != CPU_NOT_IDLE &&
- sgs->sum_h_nr_running &&
- sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
- sgs->group_asym_packing = 1;
- }
-
sgs->group_capacity = group->sgc->capacity;
[...]
group_asym_packing
On Tue, 1 Oct 2019 at 10:15, Dietmar Eggemann <[email protected]> wrote:
>
> On 19/09/2019 09:33, Vincent Guittot wrote:
>
>
> [...]
>
> > @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> > }
> > }
> >
> > - /* Adjust by relative CPU capacity of the group */
> > + /* Check if dst cpu is idle and preferred to this group */
> > + if (env->sd->flags & SD_ASYM_PACKING &&
> > + env->idle != CPU_NOT_IDLE &&
> > + sgs->sum_h_nr_running &&
> > + sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
> > + sgs->group_asym_packing = 1;
> > + }
> > +
>
> Since asym_packing check is done per-sg rather per-CPU (like misfit_task), can you
> not check for asym_packing in group_classify() directly (like for overloaded) and
> so get rid of struct sg_lb_stats::group_asym_packing?
asym_packing uses a lot of fields of env to set group_asym_packing but
env is not statistics and i'd like to remove env from
group_classify() argument so it will use only statistics.
At now, env is an arg of group_classify only for imbalance_pct field
and I have replaced env by imabalnce_pct in a patch that is under
preparation.
With this change, I can use group_classify outside load_balance()
>
> Something like (only compile tested).
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d0c3aa1dc290..b2848d6f8a2a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7692,7 +7692,6 @@ struct sg_lb_stats {
> unsigned int idle_cpus;
> unsigned int group_weight;
> enum group_type group_type;
> - unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
> unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
> #ifdef CONFIG_NUMA_BALANCING
> unsigned int nr_numa_running;
> @@ -7952,6 +7951,20 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
> return false;
> }
>
> +static inline bool
> +group_has_asym_packing(struct lb_env *env, struct sched_group *sg,
> + struct sg_lb_stats *sgs)
> +{
> + if (env->sd->flags & SD_ASYM_PACKING &&
> + env->idle != CPU_NOT_IDLE &&
> + sgs->sum_h_nr_running &&
> + sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
> + return true;
> + }
> +
> + return false;
> +}
> +
> /*
> * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
> * per-CPU capacity than sched_group ref.
> @@ -7983,7 +7996,7 @@ group_type group_classify(struct lb_env *env,
> if (sg_imbalanced(group))
> return group_imbalanced;
>
> - if (sgs->group_asym_packing)
> + if (group_has_asym_packing(env, group, sgs))
> return group_asym_packing;
>
> if (sgs->group_misfit_task_load)
> @@ -8076,14 +8089,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> }
> }
>
> - /* Check if dst cpu is idle and preferred to this group */
> - if (env->sd->flags & SD_ASYM_PACKING &&
> - env->idle != CPU_NOT_IDLE &&
> - sgs->sum_h_nr_running &&
> - sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
> - sgs->group_asym_packing = 1;
> - }
> -
> sgs->group_capacity = group->sgc->capacity;
>
> [...]
On 01/10/2019 10:14, Vincent Guittot wrote:
> On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <[email protected]> wrote:
>>
>> Hi Vincent,
>>
>> On 19/09/2019 09:33, Vincent Guittot wrote:
[...]
>>> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
>>> {
>>> struct list_head *tasks = &env->src_rq->cfs_tasks;
>>> struct task_struct *p;
>>> - unsigned long load;
>>> + unsigned long util, load;
>>
>> Minor: Order by length or reduce scope to while loop ?
>
> I don't get your point here
Nothing dramatic here! Just
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0c3aa1dc290..a08f342ead89 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7333,8 +7333,8 @@ static const unsigned int sched_nr_migrate_break = 32;
static int detach_tasks(struct lb_env *env)
{
struct list_head *tasks = &env->src_rq->cfs_tasks;
- struct task_struct *p;
unsigned long load, util;
+ struct task_struct *p;
int detached = 0;
lockdep_assert_held(&env->src_rq->lock);
or
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0c3aa1dc290..4d1864d43ed7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7334,7 +7334,6 @@ static int detach_tasks(struct lb_env *env)
{
struct list_head *tasks = &env->src_rq->cfs_tasks;
struct task_struct *p;
- unsigned long load, util;
int detached = 0;
lockdep_assert_held(&env->src_rq->lock);
@@ -7343,6 +7342,8 @@ static int detach_tasks(struct lb_env *env)
return 0;
while (!list_empty(tasks)) {
+ unsigned long load, util;
+
/*
[...]
>>> @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>> }
>>> }
>>>
>>> - /* Adjust by relative CPU capacity of the group */
>>> + /* Check if dst cpu is idle and preferred to this group */
>>
>> s/preferred to/preferred by ? or the preferred CPU of this group ?
>
> dst cpu doesn't belong to this group. We compare asym_prefer_cpu of
> this group vs dst_cpu which belongs to another group
Ah, in the sense of 'preferred over'. Got it now!
[...]
>>> + if (busiest->group_type == group_imbalanced) {
>>> + /*
>>> + * In the group_imb case we cannot rely on group-wide averages
>>> + * to ensure CPU-load equilibrium, try to move any task to fix
>>> + * the imbalance. The next load balance will take care of
>>> + * balancing back the system.
>>
>> balancing back ?
>
> In case of imbalance, we don't try to balance the system but only try
> to get rid of the pinned tasks problem. The system will still be
> unbalanced after the migration and the next load balance will take
> care of balancing the system
OK.
[...]
>>> /*
>>> - * Avg load of busiest sg can be less and avg load of local sg can
>>> - * be greater than avg load across all sgs of sd because avg load
>>> - * factors in sg capacity and sgs with smaller group_type are
>>> - * skipped when updating the busiest sg:
>>> + * Try to use spare capacity of local group without overloading it or
>>> + * emptying busiest
>>> */
>>> - if (busiest->group_type != group_misfit_task &&
>>> - (busiest->avg_load <= sds->avg_load ||
>>> - local->avg_load >= sds->avg_load)) {
>>> - env->imbalance = 0;
>>> + if (local->group_type == group_has_spare) {
>>> + if (busiest->group_type > group_fully_busy) {
>>
>> So this could be 'busiest->group_type == group_overloaded' here to match
>> the comment below? Since you handle group_misfit_task,
>> group_asym_packing, group_imbalanced above and return.
>
> This is just to be more robust in case some new states are added later
OK, although I doubt that additional states can be added easily w/o
carefully auditing the entire lb code ;-)
[...]
>>> + if (busiest->group_weight == 1 || sds->prefer_sibling) {
>>> + /*
>>> + * When prefer sibling, evenly spread running tasks on
>>> + * groups.
>>> + */
>>> + env->balance_type = migrate_task;
>>> + env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
>>> + return;
>>> + }
>>> +
>>> + /*
>>> + * If there is no overload, we just want to even the number of
>>> + * idle cpus.
>>> + */
>>> + env->balance_type = migrate_task;
>>> + env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
>>
>> Why do we need a max_t(long, 0, ...) here and not for the 'if
>> (busiest->group_weight == 1 || sds->prefer_sibling)' case?
>
> For env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
>
> either we have sds->prefer_sibling && busiest->sum_nr_running >
> local->sum_nr_running + 1
I see, this corresponds to
/* Try to move all excess tasks to child's sibling domain */
if (sds.prefer_sibling && local->group_type == group_has_spare &&
busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
goto force_balance;
in find_busiest_group, I assume.
Haven't been able to recreate this yet on my arm64 platform since there
is no prefer_sibling and in case local and busiest have
group_type=group_has_spare they bailout in
if (busiest->group_type != group_overloaded &&
(env->idle == CPU_NOT_IDLE ||
local->idle_cpus <= (busiest->idle_cpus + 1)))
goto out_balanced;
[...]
>>> - if (busiest->group_type == group_overloaded &&
>>> - local->group_type == group_overloaded) {
>>> - load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
>>> - if (load_above_capacity > busiest->group_capacity) {
>>> - load_above_capacity -= busiest->group_capacity;
>>> - load_above_capacity *= scale_load_down(NICE_0_LOAD);
>>> - load_above_capacity /= busiest->group_capacity;
>>> - } else
>>> - load_above_capacity = ~0UL;
>>> + if (local->group_type < group_overloaded) {
>>> + /*
>>> + * Local will become overloaded so the avg_load metrics are
>>> + * finally needed.
>>> + */
>>
>> How does this relate to the decision_matrix[local, busiest] (dm[])? E.g.
>> dm[overload, overload] == avg_load or dm[fully_busy, overload] == force.
>> It would be nice to be able to match all allowed fields of dm to code sections.
>
> decision_matrix describes how it decides between balanced or unbalanced.
> In case of dm[overload, overload], we use the avg_load to decide if it
> is balanced or not
OK, that's why you calculate sgs->avg_load in update_sg_lb_stats() only
for 'sgs->group_type == group_overloaded'.
> In case of dm[fully_busy, overload], the groups are unbalanced because
> fully_busy < overload and we force the balance. Then
> calculate_imbalance() uses the avg_load to decide how much will be
> moved
And in this case 'local->group_type < group_overloaded' in
calculate_imbalance(), 'local->avg_load' and 'sds->avg_load' have to be
calculated before using them in env->imbalance = min(...).
OK, got it now.
> dm[overload, overload]=force means that we force the balance and we
> will compute later the imbalance. avg_load may be used to calculate
> the imbalance
> dm[overload, overload]=avg_load means that we compare the avg_load to
> decide whether we need to balance load between groups
> dm[overload, overload]=nr_idle means that we compare the number of
> idle cpus to decide whether we need to balance. In fact this is no
> more true with patch 7 because we also take into account the number of
> nr_h_running when weight =1
This becomes clearer now ... slowly.
[...]
On 01/10/2019 11:14, Vincent Guittot wrote:
> group_asym_packing
>
> On Tue, 1 Oct 2019 at 10:15, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 19/09/2019 09:33, Vincent Guittot wrote:
>>
>>
>> [...]
>>
>>> @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>> }
>>> }
>>>
>>> - /* Adjust by relative CPU capacity of the group */
>>> + /* Check if dst cpu is idle and preferred to this group */
>>> + if (env->sd->flags & SD_ASYM_PACKING &&
>>> + env->idle != CPU_NOT_IDLE &&
>>> + sgs->sum_h_nr_running &&
>>> + sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
>>> + sgs->group_asym_packing = 1;
>>> + }
>>> +
>>
>> Since asym_packing check is done per-sg rather per-CPU (like misfit_task), can you
>> not check for asym_packing in group_classify() directly (like for overloaded) and
>> so get rid of struct sg_lb_stats::group_asym_packing?
>
> asym_packing uses a lot of fields of env to set group_asym_packing but
> env is not statistics and i'd like to remove env from
> group_classify() argument so it will use only statistics.
> At now, env is an arg of group_classify only for imbalance_pct field
> and I have replaced env by imabalnce_pct in a patch that is under
> preparation.
> With this change, I can use group_classify outside load_balance()
OK, I see. This relates to the 'update find_idlest_group() to be more
aligned with load_balance()' point mentioned in the cover letter I
assume. To make sure we can use load instead of runnable_load there as well.
[...]
On 19/09/2019 08:33, Vincent Guittot wrote:
> Rename sum_nr_running to sum_h_nr_running because it effectively tracks
> cfs->h_nr_running so we can use sum_nr_running to track rq->nr_running
> when needed.
>
> There is no functional changes.
>
> Signed-off-by: Vincent Guittot <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
On 19/09/2019 08:33, Vincent Guittot wrote:
> clean up load_balance and remove meaningless calculation and fields before
> adding new algorithm.
>
> Signed-off-by: Vincent Guittot <[email protected]>
We'll probably want to squash the removal of fix_small_imbalance() in the
actual rework (patch 04) to not make this a regression bisect honeypot.
Other than that:
Reviewed-by: Valentin Schneider <[email protected]>
On 19/09/2019 08:33, Vincent Guittot wrote:
[...]
> @@ -8283,69 +8363,133 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> */
> static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
> {
> - unsigned long max_pull, load_above_capacity = ~0UL;
> struct sg_lb_stats *local, *busiest;
>
> local = &sds->local_stat;
> busiest = &sds->busiest_stat;
>
> - if (busiest->group_asym_packing) {
> + if (busiest->group_type == group_misfit_task) {
> + /* Set imbalance to allow misfit task to be balanced. */
> + env->balance_type = migrate_misfit;
> + env->imbalance = busiest->group_misfit_task_load;
> + return;
> + }
> +
> + if (busiest->group_type == group_asym_packing) {
> + /*
> + * In case of asym capacity, we will try to migrate all load to
> + * the preferred CPU.
> + */
> + env->balance_type = migrate_load;
> env->imbalance = busiest->group_load;
> return;
> }
>
> + if (busiest->group_type == group_imbalanced) {
> + /*
> + * In the group_imb case we cannot rely on group-wide averages
> + * to ensure CPU-load equilibrium, try to move any task to fix
> + * the imbalance. The next load balance will take care of
> + * balancing back the system.
> + */
> + env->balance_type = migrate_task;
> + env->imbalance = 1;
> + return;
> + }
> +
> /*
> - * Avg load of busiest sg can be less and avg load of local sg can
> - * be greater than avg load across all sgs of sd because avg load
> - * factors in sg capacity and sgs with smaller group_type are
> - * skipped when updating the busiest sg:
> + * Try to use spare capacity of local group without overloading it or
> + * emptying busiest
> */
> - if (busiest->group_type != group_misfit_task &&
> - (busiest->avg_load <= sds->avg_load ||
> - local->avg_load >= sds->avg_load)) {
> - env->imbalance = 0;
> + if (local->group_type == group_has_spare) {
> + if (busiest->group_type > group_fully_busy) {
> + /*
> + * If busiest is overloaded, try to fill spare
> + * capacity. This might end up creating spare capacity
> + * in busiest or busiest still being overloaded but
> + * there is no simple way to directly compute the
> + * amount of load to migrate in order to balance the
> + * system.
> + */
> + env->balance_type = migrate_util;
> + env->imbalance = max(local->group_capacity, local->group_util) -
> + local->group_util;
> + return;
> + }
> +
> + if (busiest->group_weight == 1 || sds->prefer_sibling) {
> + /*
> + * When prefer sibling, evenly spread running tasks on
> + * groups.
> + */
> + env->balance_type = migrate_task;
> + env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
Isn't that one somewhat risky?
Say both groups are classified group_has_spare and we do prefer_sibling.
We'd select busiest as the one with the maximum number of busy CPUs, but it
could be so that busiest.sum_h_nr_running < local.sum_h_nr_running (because
pinned tasks or wakeup failed to properly spread stuff).
The thing should be unsigned so at least we save ourselves from right
shifting a negative value, but we still end up with a gygornous imbalance
(which we then store into env.imbalance which *is* signed... Urgh).
[...]
On Tue, 1 Oct 2019 at 19:12, Valentin Schneider
<[email protected]> wrote:
>
> On 19/09/2019 08:33, Vincent Guittot wrote:
> > clean up load_balance and remove meaningless calculation and fields before
> > adding new algorithm.
> >
> > Signed-off-by: Vincent Guittot <[email protected]>
>
> We'll probably want to squash the removal of fix_small_imbalance() in the
> actual rework (patch 04) to not make this a regression bisect honeypot.
Yes that's the plan. Peter asked to split the patch to ease the review
> Other than that:
>
> Reviewed-by: Valentin Schneider <[email protected]>
On Tue, 1 Oct 2019 at 18:53, Dietmar Eggemann <[email protected]> wrote:
>
> On 01/10/2019 10:14, Vincent Guittot wrote:
> > On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <[email protected]> wrote:
> >>
> >> Hi Vincent,
> >>
> >> On 19/09/2019 09:33, Vincent Guittot wrote:
>
> [...]
>
> >>> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
> >>> {
> >>> struct list_head *tasks = &env->src_rq->cfs_tasks;
> >>> struct task_struct *p;
> >>> - unsigned long load;
> >>> + unsigned long util, load;
> >>
> >> Minor: Order by length or reduce scope to while loop ?
> >
> > I don't get your point here
>
> Nothing dramatic here! Just
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d0c3aa1dc290..a08f342ead89 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7333,8 +7333,8 @@ static const unsigned int sched_nr_migrate_break = 32;
> static int detach_tasks(struct lb_env *env)
> {
> struct list_head *tasks = &env->src_rq->cfs_tasks;
> - struct task_struct *p;
> unsigned long load, util;
> + struct task_struct *p;
hmm... I still don't get this.
We usually gather pointers instead of interleaving them with other varaiables
> int detached = 0;
>
> lockdep_assert_held(&env->src_rq->lock);
>
> or
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d0c3aa1dc290..4d1864d43ed7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7334,7 +7334,6 @@ static int detach_tasks(struct lb_env *env)
> {
> struct list_head *tasks = &env->src_rq->cfs_tasks;
> struct task_struct *p;
> - unsigned long load, util;
> int detached = 0;
>
> lockdep_assert_held(&env->src_rq->lock);
> @@ -7343,6 +7342,8 @@ static int detach_tasks(struct lb_env *env)
> return 0;
>
> while (!list_empty(tasks)) {
> + unsigned long load, util;
> +
> /*
>
> [...]
>
> >>> @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >>> }
> >>> }
> >>>
> >>> - /* Adjust by relative CPU capacity of the group */
> >>> + /* Check if dst cpu is idle and preferred to this group */
> >>
> >> s/preferred to/preferred by ? or the preferred CPU of this group ?
> >
> > dst cpu doesn't belong to this group. We compare asym_prefer_cpu of
> > this group vs dst_cpu which belongs to another group
>
> Ah, in the sense of 'preferred over'. Got it now!
>
> [...]
>
> >>> + if (busiest->group_type == group_imbalanced) {
> >>> + /*
> >>> + * In the group_imb case we cannot rely on group-wide averages
> >>> + * to ensure CPU-load equilibrium, try to move any task to fix
> >>> + * the imbalance. The next load balance will take care of
> >>> + * balancing back the system.
> >>
> >> balancing back ?
> >
> > In case of imbalance, we don't try to balance the system but only try
> > to get rid of the pinned tasks problem. The system will still be
> > unbalanced after the migration and the next load balance will take
> > care of balancing the system
>
> OK.
>
> [...]
>
> >>> /*
> >>> - * Avg load of busiest sg can be less and avg load of local sg can
> >>> - * be greater than avg load across all sgs of sd because avg load
> >>> - * factors in sg capacity and sgs with smaller group_type are
> >>> - * skipped when updating the busiest sg:
> >>> + * Try to use spare capacity of local group without overloading it or
> >>> + * emptying busiest
> >>> */
> >>> - if (busiest->group_type != group_misfit_task &&
> >>> - (busiest->avg_load <= sds->avg_load ||
> >>> - local->avg_load >= sds->avg_load)) {
> >>> - env->imbalance = 0;
> >>> + if (local->group_type == group_has_spare) {
> >>> + if (busiest->group_type > group_fully_busy) {
> >>
> >> So this could be 'busiest->group_type == group_overloaded' here to match
> >> the comment below? Since you handle group_misfit_task,
> >> group_asym_packing, group_imbalanced above and return.
> >
> > This is just to be more robust in case some new states are added later
>
> OK, although I doubt that additional states can be added easily w/o
> carefully auditing the entire lb code ;-)
>
> [...]
>
> >>> + if (busiest->group_weight == 1 || sds->prefer_sibling) {
> >>> + /*
> >>> + * When prefer sibling, evenly spread running tasks on
> >>> + * groups.
> >>> + */
> >>> + env->balance_type = migrate_task;
> >>> + env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> >>> + return;
> >>> + }
> >>> +
> >>> + /*
> >>> + * If there is no overload, we just want to even the number of
> >>> + * idle cpus.
> >>> + */
> >>> + env->balance_type = migrate_task;
> >>> + env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
> >>
> >> Why do we need a max_t(long, 0, ...) here and not for the 'if
> >> (busiest->group_weight == 1 || sds->prefer_sibling)' case?
> >
> > For env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> >
> > either we have sds->prefer_sibling && busiest->sum_nr_running >
> > local->sum_nr_running + 1
>
> I see, this corresponds to
>
> /* Try to move all excess tasks to child's sibling domain */
> if (sds.prefer_sibling && local->group_type == group_has_spare &&
> busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
> goto force_balance;
>
> in find_busiest_group, I assume.
yes, it is
>
> Haven't been able to recreate this yet on my arm64 platform since there
> is no prefer_sibling and in case local and busiest have
You probably have a b.L platform for which the flag is cleared because
the hikey (dual quad cores arm64) takes advantage of prefer sibling
at DIE level to spread tasks
> group_type=group_has_spare they bailout in
>
> if (busiest->group_type != group_overloaded &&
> (env->idle == CPU_NOT_IDLE ||
> local->idle_cpus <= (busiest->idle_cpus + 1)))
> goto out_balanced;
>
>
> [...]
>
> >>> - if (busiest->group_type == group_overloaded &&
> >>> - local->group_type == group_overloaded) {
> >>> - load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
> >>> - if (load_above_capacity > busiest->group_capacity) {
> >>> - load_above_capacity -= busiest->group_capacity;
> >>> - load_above_capacity *= scale_load_down(NICE_0_LOAD);
> >>> - load_above_capacity /= busiest->group_capacity;
> >>> - } else
> >>> - load_above_capacity = ~0UL;
> >>> + if (local->group_type < group_overloaded) {
> >>> + /*
> >>> + * Local will become overloaded so the avg_load metrics are
> >>> + * finally needed.
> >>> + */
> >>
> >> How does this relate to the decision_matrix[local, busiest] (dm[])? E.g.
> >> dm[overload, overload] == avg_load or dm[fully_busy, overload] == force.
> >> It would be nice to be able to match all allowed fields of dm to code sections.
> >
> > decision_matrix describes how it decides between balanced or unbalanced.
> > In case of dm[overload, overload], we use the avg_load to decide if it
> > is balanced or not
>
> OK, that's why you calculate sgs->avg_load in update_sg_lb_stats() only
> for 'sgs->group_type == group_overloaded'.
>
> > In case of dm[fully_busy, overload], the groups are unbalanced because
> > fully_busy < overload and we force the balance. Then
> > calculate_imbalance() uses the avg_load to decide how much will be
> > moved
>
> And in this case 'local->group_type < group_overloaded' in
> calculate_imbalance(), 'local->avg_load' and 'sds->avg_load' have to be
> calculated before using them in env->imbalance = min(...).
>
> OK, got it now.
>
> > dm[overload, overload]=force means that we force the balance and we
> > will compute later the imbalance. avg_load may be used to calculate
> > the imbalance
> > dm[overload, overload]=avg_load means that we compare the avg_load to
> > decide whether we need to balance load between groups
> > dm[overload, overload]=nr_idle means that we compare the number of
> > idle cpus to decide whether we need to balance. In fact this is no
> > more true with patch 7 because we also take into account the number of
> > nr_h_running when weight =1
>
> This becomes clearer now ... slowly.
>
> [...]
On Tue, 1 Oct 2019 at 19:47, Valentin Schneider
<[email protected]> wrote:
>
> On 19/09/2019 08:33, Vincent Guittot wrote:
>
> [...]
>
> > @@ -8283,69 +8363,133 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > */
> > static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
> > {
> > - unsigned long max_pull, load_above_capacity = ~0UL;
> > struct sg_lb_stats *local, *busiest;
> >
> > local = &sds->local_stat;
> > busiest = &sds->busiest_stat;
> >
> > - if (busiest->group_asym_packing) {
> > + if (busiest->group_type == group_misfit_task) {
> > + /* Set imbalance to allow misfit task to be balanced. */
> > + env->balance_type = migrate_misfit;
> > + env->imbalance = busiest->group_misfit_task_load;
> > + return;
> > + }
> > +
> > + if (busiest->group_type == group_asym_packing) {
> > + /*
> > + * In case of asym capacity, we will try to migrate all load to
> > + * the preferred CPU.
> > + */
> > + env->balance_type = migrate_load;
> > env->imbalance = busiest->group_load;
> > return;
> > }
> >
> > + if (busiest->group_type == group_imbalanced) {
> > + /*
> > + * In the group_imb case we cannot rely on group-wide averages
> > + * to ensure CPU-load equilibrium, try to move any task to fix
> > + * the imbalance. The next load balance will take care of
> > + * balancing back the system.
> > + */
> > + env->balance_type = migrate_task;
> > + env->imbalance = 1;
> > + return;
> > + }
> > +
> > /*
> > - * Avg load of busiest sg can be less and avg load of local sg can
> > - * be greater than avg load across all sgs of sd because avg load
> > - * factors in sg capacity and sgs with smaller group_type are
> > - * skipped when updating the busiest sg:
> > + * Try to use spare capacity of local group without overloading it or
> > + * emptying busiest
> > */
> > - if (busiest->group_type != group_misfit_task &&
> > - (busiest->avg_load <= sds->avg_load ||
> > - local->avg_load >= sds->avg_load)) {
> > - env->imbalance = 0;
> > + if (local->group_type == group_has_spare) {
> > + if (busiest->group_type > group_fully_busy) {
> > + /*
> > + * If busiest is overloaded, try to fill spare
> > + * capacity. This might end up creating spare capacity
> > + * in busiest or busiest still being overloaded but
> > + * there is no simple way to directly compute the
> > + * amount of load to migrate in order to balance the
> > + * system.
> > + */
> > + env->balance_type = migrate_util;
> > + env->imbalance = max(local->group_capacity, local->group_util) -
> > + local->group_util;
> > + return;
> > + }
> > +
> > + if (busiest->group_weight == 1 || sds->prefer_sibling) {
> > + /*
> > + * When prefer sibling, evenly spread running tasks on
> > + * groups.
> > + */
> > + env->balance_type = migrate_task;
> > + env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
>
> Isn't that one somewhat risky?
>
> Say both groups are classified group_has_spare and we do prefer_sibling.
> We'd select busiest as the one with the maximum number of busy CPUs, but it
> could be so that busiest.sum_h_nr_running < local.sum_h_nr_running (because
> pinned tasks or wakeup failed to properly spread stuff).
>
> The thing should be unsigned so at least we save ourselves from right
> shifting a negative value, but we still end up with a gygornous imbalance
> (which we then store into env.imbalance which *is* signed... Urgh).
so it's not clear what happen with a right shift on negative signed
value and this seems to be compiler dependent so even
max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1) might be wrong
I'm going to update it
>
> [...]
On Tue, 1 Oct 2019 at 18:53, Dietmar Eggemann <[email protected]> wrote:
>
> On 01/10/2019 10:14, Vincent Guittot wrote:
> > On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <[email protected]> wrote:
> >>
> >> Hi Vincent,
> >>
> >> On 19/09/2019 09:33, Vincent Guittot wrote:
>
[...]
>
> >>> + if (busiest->group_weight == 1 || sds->prefer_sibling) {
> >>> + /*
> >>> + * When prefer sibling, evenly spread running tasks on
> >>> + * groups.
> >>> + */
> >>> + env->balance_type = migrate_task;
> >>> + env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> >>> + return;
> >>> + }
> >>> +
> >>> + /*
> >>> + * If there is no overload, we just want to even the number of
> >>> + * idle cpus.
> >>> + */
> >>> + env->balance_type = migrate_task;
> >>> + env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
> >>
> >> Why do we need a max_t(long, 0, ...) here and not for the 'if
> >> (busiest->group_weight == 1 || sds->prefer_sibling)' case?
> >
> > For env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> >
> > either we have sds->prefer_sibling && busiest->sum_nr_running >
> > local->sum_nr_running + 1
>
> I see, this corresponds to
>
> /* Try to move all excess tasks to child's sibling domain */
> if (sds.prefer_sibling && local->group_type == group_has_spare &&
> busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
> goto force_balance;
>
> in find_busiest_group, I assume.
yes. But it seems that I missed a case:
prefer_sibling is set
busiest->sum_h_nr_running <= local->sum_h_nr_running + 1 so we skip
goto force_balance above
But env->idle != CPU_NOT_IDLE and local->idle_cpus >
(busiest->idle_cpus + 1) so we also skip goto out_balance and finally
call calculate_imbalance()
in calculate_imbalance with prefer_sibling set, imbalance =
(busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
so we probably want something similar to max_t(long, 0,
(busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1)
>
> Haven't been able to recreate this yet on my arm64 platform since there
> is no prefer_sibling and in case local and busiest have
> group_type=group_has_spare they bailout in
>
> if (busiest->group_type != group_overloaded &&
> (env->idle == CPU_NOT_IDLE ||
> local->idle_cpus <= (busiest->idle_cpus + 1)))
> goto out_balanced;
>
>
> [...]
>
> >>> - if (busiest->group_type == group_overloaded &&
> >>> - local->group_type == group_overloaded) {
> >>> - load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
> >>> - if (load_above_capacity > busiest->group_capacity) {
> >>> - load_above_capacity -= busiest->group_capacity;
> >>> - load_above_capacity *= scale_load_down(NICE_0_LOAD);
> >>> - load_above_capacity /= busiest->group_capacity;
> >>> - } else
> >>> - load_above_capacity = ~0UL;
> >>> + if (local->group_type < group_overloaded) {
> >>> + /*
> >>> + * Local will become overloaded so the avg_load metrics are
> >>> + * finally needed.
> >>> + */
> >>
> >> How does this relate to the decision_matrix[local, busiest] (dm[])? E.g.
> >> dm[overload, overload] == avg_load or dm[fully_busy, overload] == force.
> >> It would be nice to be able to match all allowed fields of dm to code sections.
> >
> > decision_matrix describes how it decides between balanced or unbalanced.
> > In case of dm[overload, overload], we use the avg_load to decide if it
> > is balanced or not
>
> OK, that's why you calculate sgs->avg_load in update_sg_lb_stats() only
> for 'sgs->group_type == group_overloaded'.
>
> > In case of dm[fully_busy, overload], the groups are unbalanced because
> > fully_busy < overload and we force the balance. Then
> > calculate_imbalance() uses the avg_load to decide how much will be
> > moved
>
> And in this case 'local->group_type < group_overloaded' in
> calculate_imbalance(), 'local->avg_load' and 'sds->avg_load' have to be
> calculated before using them in env->imbalance = min(...).
>
> OK, got it now.
>
> > dm[overload, overload]=force means that we force the balance and we
> > will compute later the imbalance. avg_load may be used to calculate
> > the imbalance
> > dm[overload, overload]=avg_load means that we compare the avg_load to
> > decide whether we need to balance load between groups
> > dm[overload, overload]=nr_idle means that we compare the number of
> > idle cpus to decide whether we need to balance. In fact this is no
> > more true with patch 7 because we also take into account the number of
> > nr_h_running when weight =1
>
> This becomes clearer now ... slowly.
>
> [...]
On 02/10/2019 08:44, Vincent Guittot wrote:
> On Tue, 1 Oct 2019 at 18:53, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 01/10/2019 10:14, Vincent Guittot wrote:
>>> On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <[email protected]> wrote:
>>>>
>>>> Hi Vincent,
>>>>
>>>> On 19/09/2019 09:33, Vincent Guittot wrote:
>>
>> [...]
>>
>>>>> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
>>>>> {
>>>>> struct list_head *tasks = &env->src_rq->cfs_tasks;
>>>>> struct task_struct *p;
>>>>> - unsigned long load;
>>>>> + unsigned long util, load;
>>>>
>>>> Minor: Order by length or reduce scope to while loop ?
>>>
>>> I don't get your point here
>>
>> Nothing dramatic here! Just
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d0c3aa1dc290..a08f342ead89 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7333,8 +7333,8 @@ static const unsigned int sched_nr_migrate_break = 32;
>> static int detach_tasks(struct lb_env *env)
>> {
>> struct list_head *tasks = &env->src_rq->cfs_tasks;
>> - struct task_struct *p;
>> unsigned long load, util;
>> + struct task_struct *p;
>
> hmm... I still don't get this.
> We usually gather pointers instead of interleaving them with other varaiables
I thought we should always order local variable declarations from
longest to shortest line but can't find this rule in coding-style.rst
either.
[...]
On 02/10/2019 10:23, Vincent Guittot wrote:
> On Tue, 1 Oct 2019 at 18:53, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 01/10/2019 10:14, Vincent Guittot wrote:
>>> On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <[email protected]> wrote:
>>>>
>>>> Hi Vincent,
>>>>
>>>> On 19/09/2019 09:33, Vincent Guittot wrote:
>>
> [...]
>
>>
>>>>> + if (busiest->group_weight == 1 || sds->prefer_sibling) {
>>>>> + /*
>>>>> + * When prefer sibling, evenly spread running tasks on
>>>>> + * groups.
>>>>> + */
>>>>> + env->balance_type = migrate_task;
>>>>> + env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
>>>>> + return;
>>>>> + }
>>>>> +
>>>>> + /*
>>>>> + * If there is no overload, we just want to even the number of
>>>>> + * idle cpus.
>>>>> + */
>>>>> + env->balance_type = migrate_task;
>>>>> + env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
>>>>
>>>> Why do we need a max_t(long, 0, ...) here and not for the 'if
>>>> (busiest->group_weight == 1 || sds->prefer_sibling)' case?
>>>
>>> For env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
>>>
>>> either we have sds->prefer_sibling && busiest->sum_nr_running >
>>> local->sum_nr_running + 1
>>
>> I see, this corresponds to
>>
>> /* Try to move all excess tasks to child's sibling domain */
>> if (sds.prefer_sibling && local->group_type == group_has_spare &&
>> busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
>> goto force_balance;
>>
>> in find_busiest_group, I assume.
>
> yes. But it seems that I missed a case:
>
> prefer_sibling is set
> busiest->sum_h_nr_running <= local->sum_h_nr_running + 1 so we skip
> goto force_balance above
> But env->idle != CPU_NOT_IDLE and local->idle_cpus >
> (busiest->idle_cpus + 1) so we also skip goto out_balance and finally
> call calculate_imbalance()
>
> in calculate_imbalance with prefer_sibling set, imbalance =
> (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
>
> so we probably want something similar to max_t(long, 0,
> (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1)
Makes sense.
Caught a couple of
[ 369.310464] 0-3->4-7 2->5 env->imbalance = 2147483646
[ 369.310796] 0-3->4-7 2->4 env->imbalance = 2147483647
in this if condition on h620 running hackbench.
On 02/10/2019 09:30, Vincent Guittot wrote:
>> Isn't that one somewhat risky?
>>
>> Say both groups are classified group_has_spare and we do prefer_sibling.
>> We'd select busiest as the one with the maximum number of busy CPUs, but it
>> could be so that busiest.sum_h_nr_running < local.sum_h_nr_running (because
>> pinned tasks or wakeup failed to properly spread stuff).
>>
>> The thing should be unsigned so at least we save ourselves from right
>> shifting a negative value, but we still end up with a gygornous imbalance
>> (which we then store into env.imbalance which *is* signed... Urgh).
>
> so it's not clear what happen with a right shift on negative signed
> value and this seems to be compiler dependent so even
> max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1) might be wrong
>
Yeah, right shift on signed negative values are implementation defined. This
is what I was worried about initially, but I think the expression resulting
from the subtraction is unsigned (both terms are unsigned) so this would
just wrap when busiest < local - but that is still a problem.
((local->idle_cpus - busiest->idle_cpus) >> 1) should be fine because we do
have this check in find_busiest_group() before heading off to
calculate_imbalance():
if (busiest->group_type != group_overloaded &&
(env->idle == CPU_NOT_IDLE ||
local->idle_cpus <= (busiest->idle_cpus + 1)))
/* ... */
goto out_balanced;
which ensures the subtraction will be at least 2. We're missing something
equivalent for the sum_h_nr_running case.
> I'm going to update it
>
>
>>
>> [...]
On Wed, Oct 02, 2019 at 11:21:20AM +0200, Dietmar Eggemann wrote:
> I thought we should always order local variable declarations from
> longest to shortest line but can't find this rule in coding-style.rst
> either.
You're right though, that is generally encouraged. From last years
(2018) KS there was the notion of a subsystem handbook and the tip
tree's submissions thereto can be found there:
https://lore.kernel.org/lkml/[email protected]/
But for some raisin that never actually went anywhere...
On Wed, Oct 02, 2019 at 11:47:59AM +0100, Valentin Schneider wrote:
> Yeah, right shift on signed negative values are implementation defined.
Seriously? Even under -fno-strict-overflow? There is a perfectly
sensible operation for signed shift right, this stuff should not be
undefined.
Hi Vincent,
On Thu, Sep 19, 2019 at 09:33:31AM +0200 Vincent Guittot wrote:
> Several wrong task placement have been raised with the current load
> balance algorithm but their fixes are not always straight forward and
> end up with using biased values to force migrations. A cleanup and rework
> of the load balance will help to handle such UCs and enable to fine grain
> the behavior of the scheduler for other cases.
>
> Patch 1 has already been sent separately and only consolidate asym policy
> in one place and help the review of the changes in load_balance.
>
> Patch 2 renames the sum of h_nr_running in stats.
>
> Patch 3 removes meaningless imbalance computation to make review of
> patch 4 easier.
>
> Patch 4 reworks load_balance algorithm and fixes some wrong task placement
> but try to stay conservative.
>
> Patch 5 add the sum of nr_running to monitor non cfs tasks and take that
> into account when pulling tasks.
>
> Patch 6 replaces runnable_load by load now that the signal is only used
> when overloaded.
>
> Patch 7 improves the spread of tasks at the 1st scheduling level.
>
> Patch 8 uses utilization instead of load in all steps of misfit task
> path.
>
> Patch 9 replaces runnable_load_avg by load_avg in the wake up path.
>
> Patch 10 optimizes find_idlest_group() that was using both runnable_load
> and load. This has not been squashed with previous patch to ease the
> review.
>
> Some benchmarks results based on 8 iterations of each tests:
> - small arm64 dual quad cores system
>
> tip/sched/core w/ this patchset improvement
> schedpipe 54981 +/-0.36% 55459 +/-0.31% (+0.97%)
>
> hackbench
> 1 groups 0.906 +/-2.34% 0.906 +/-2.88% (+0.06%)
>
> - large arm64 2 nodes / 224 cores system
>
> tip/sched/core w/ this patchset improvement
> schedpipe 125323 +/-0.98% 125624 +/-0.71% (+0.24%)
>
> hackbench -l (256000/#grp) -g #grp
> 1 groups 15.360 +/-1.76% 14.206 +/-1.40% (+8.69%)
> 4 groups 5.822 +/-1.02% 5.508 +/-6.45% (+5.38%)
> 16 groups 3.103 +/-0.80% 3.244 +/-0.77% (-4.52%)
> 32 groups 2.892 +/-1.23% 2.850 +/-1.81% (+1.47%)
> 64 groups 2.825 +/-1.51% 2.725 +/-1.51% (+3.54%)
> 128 groups 3.149 +/-8.46% 3.053 +/-13.15% (+3.06%)
> 256 groups 3.511 +/-8.49% 3.019 +/-1.71% (+14.03%)
>
> dbench
> 1 groups 329.677 +/-0.46% 329.771 +/-0.11% (+0.03%)
> 4 groups 931.499 +/-0.79% 947.118 +/-0.94% (+1.68%)
> 16 groups 1924.210 +/-0.89% 1947.849 +/-0.76% (+1.23%)
> 32 groups 2350.646 +/-5.75% 2351.549 +/-6.33% (+0.04%)
> 64 groups 2201.524 +/-3.35% 2192.749 +/-5.84% (-0.40%)
> 128 groups 2206.858 +/-2.50% 2376.265 +/-7.44% (+7.68%)
> 256 groups 1263.520 +/-3.34% 1633.143 +/-13.02% (+29.25%)
>
> tip/sched/core sha1:
> 0413d7f33e60 ('sched/uclamp: Always use 'enum uclamp_id' for clamp_id values')
>
> Changes since v2:
> - fix typo and reorder code
> - some minor code fixes
> - optimize the find_idles_group()
>
> Not covered in this patchset:
> - update find_idlest_group() to be more aligned with load_balance(). I didn't
> want to delay this version because of this update which is not ready yet
> - Better detection of overloaded and fully busy state, especially for cases
> when nr_running > nr CPUs.
>
> Vincent Guittot (8):
> sched/fair: clean up asym packing
> sched/fair: rename sum_nr_running to sum_h_nr_running
> sched/fair: remove meaningless imbalance calculation
> sched/fair: rework load_balance
> sched/fair: use rq->nr_running when balancing load
> sched/fair: use load instead of runnable load in load_balance
> sched/fair: evenly spread tasks when not overloaded
> sched/fair: use utilization to select misfit task
> sched/fair: use load instead of runnable load in wakeup path
> sched/fair: optimize find_idlest_group
>
> kernel/sched/fair.c | 805 +++++++++++++++++++++++++++-------------------------
> 1 file changed, 417 insertions(+), 388 deletions(-)
>
> --
> 2.7.4
>
We've been testing v3 and for the most part everything looks good. The
group imbalance issues are fixed on all of our test systems except one.
The one is an 8-node intel system with 160 cpus. I'll put the system
details at the end.
This shows the average number of benchmark threads running on each node
through the run. That is, not including the 2 stress jobs. The end
results are a 4x slow down in the cgroup case versus not. The 152 and
156 are the number of LU threads in the run. In all cases there are 2
stress CPU threads running either in their own cgroups (GROUP) or
everything is in one cgroup (NORMAL). The normal case is pretty well
balanced with only a few >= 20 and those that are are only a little
over. In the GROUP cases things are not so good. There are some > 30
for example, and others < 10.
lu.C.x_152_GROUP_1 17.52 16.86 17.90 18.52 20.00 19.00 22.00 20.19
lu.C.x_152_GROUP_2 15.70 15.04 15.65 15.72 23.30 28.98 20.09 17.52
lu.C.x_152_GROUP_3 27.72 32.79 22.89 22.62 11.01 12.90 12.14 9.93
lu.C.x_152_GROUP_4 18.13 18.87 18.40 17.87 18.80 19.93 20.40 19.60
lu.C.x_152_GROUP_5 24.14 26.46 20.92 21.43 14.70 16.05 15.14 13.16
lu.C.x_152_NORMAL_1 21.03 22.43 20.27 19.97 18.37 18.80 16.27 14.87
lu.C.x_152_NORMAL_2 19.24 18.29 18.41 17.41 19.71 19.00 20.29 19.65
lu.C.x_152_NORMAL_3 19.43 20.00 19.05 20.24 18.76 17.38 18.52 18.62
lu.C.x_152_NORMAL_4 17.19 18.25 17.81 18.69 20.44 19.75 20.12 19.75
lu.C.x_152_NORMAL_5 19.25 19.56 19.12 19.56 19.38 19.38 18.12 17.62
lu.C.x_156_GROUP_1 18.62 19.31 18.38 18.77 19.88 21.35 19.35 20.35
lu.C.x_156_GROUP_2 15.58 12.72 14.96 14.83 20.59 19.35 29.75 28.22
lu.C.x_156_GROUP_3 20.05 18.74 19.63 18.32 20.26 20.89 19.53 18.58
lu.C.x_156_GROUP_4 14.77 11.42 13.01 10.09 27.05 33.52 23.16 22.98
lu.C.x_156_GROUP_5 14.94 11.45 12.77 10.52 28.01 33.88 22.37 22.05
lu.C.x_156_NORMAL_1 20.00 20.58 18.47 18.68 19.47 19.74 19.42 19.63
lu.C.x_156_NORMAL_2 18.52 18.48 18.83 18.43 20.57 20.48 20.61 20.09
lu.C.x_156_NORMAL_3 20.27 20.00 20.05 21.18 19.55 19.00 18.59 17.36
lu.C.x_156_NORMAL_4 19.65 19.60 20.25 20.75 19.35 20.10 19.00 17.30
lu.C.x_156_NORMAL_5 19.79 19.67 20.62 22.42 18.42 18.00 17.67 19.42
From what I can see this was better but not perfect in v1. It was closer and
so the end results (LU reported times and op/s) were close enough. But looking
closer at it there are still some issues. (NORMAL is comparable to above)
lu.C.x_152_GROUP_1 18.08 18.17 19.58 19.29 19.25 17.50 21.46 18.67
lu.C.x_152_GROUP_2 17.12 17.48 17.88 17.62 19.57 17.31 23.00 22.02
lu.C.x_152_GROUP_3 17.82 17.97 18.12 18.18 24.55 22.18 16.97 16.21
lu.C.x_152_GROUP_4 18.47 19.08 18.50 18.66 21.45 25.00 15.47 15.37
lu.C.x_152_GROUP_5 20.46 20.71 27.38 24.75 17.06 16.65 12.81 12.19
lu.C.x_156_GROUP_1 18.70 18.80 20.25 19.50 20.45 20.30 19.55 18.45
lu.C.x_156_GROUP_2 19.29 19.90 17.71 18.10 20.76 21.57 19.81 18.86
lu.C.x_156_GROUP_3 25.09 29.19 21.83 21.33 18.67 18.57 11.03 10.29
lu.C.x_156_GROUP_4 18.60 19.10 19.20 18.70 20.30 20.00 19.70 20.40
lu.C.x_156_GROUP_5 18.58 18.9 18.63 18.16 17.32 19.37 23.92 21.08
There is high variance so it may not be anythign specific between v1 and v3 here.
The initial fixes I made for this issue did not exhibit this behavior. They
would have had other issues dealing with overload cases though. In this case
however there are only 154 or 158 threads on 160 CPUs so not overloaded.
I'll try to get my hands on this system and poke into it. I just wanted to get
your thoughts and let you know where we are.
Thanks,
Phil
System details:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 160
On-line CPU(s) list: 0-159
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 8
NUMA node(s): 8
Vendor ID: GenuineIntel
CPU family: 6
Model: 47
Model name: Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz
Stepping: 2
CPU MHz: 1063.934
BogoMIPS: 4787.73
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0-9,80-89
NUMA node1 CPU(s): 10-19,90-99
NUMA node2 CPU(s): 20-29,100-109
NUMA node3 CPU(s): 30-39,110-119
NUMA node4 CPU(s): 40-49,120-129
NUMA node5 CPU(s): 50-59,130-139
NUMA node6 CPU(s): 60-69,140-149
NUMA node7 CPU(s): 70-79,150-159
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm epb pti tpr_shadow vnmi flexpriority ept vpid dtherm ida arat
$ numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 80 81 82 83 84 85 86 87 88 89
node 0 size: 64177 MB
node 0 free: 60866 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 90 91 92 93 94 95 96 97 98 99
node 1 size: 64507 MB
node 1 free: 61167 MB
node 2 cpus: 20 21 22 23 24 25 26 27 28 29 100 101 102 103 104 105 106 107 108
109
node 2 size: 64507 MB
node 2 free: 61250 MB
node 3 cpus: 30 31 32 33 34 35 36 37 38 39 110 111 112 113 114 115 116 117 118
119
node 3 size: 64507 MB
node 3 free: 61327 MB
node 4 cpus: 40 41 42 43 44 45 46 47 48 49 120 121 122 123 124 125 126 127 128
129
node 4 size: 64507 MB
node 4 free: 60993 MB
node 5 cpus: 50 51 52 53 54 55 56 57 58 59 130 131 132 133 134 135 136 137 138
139
node 5 size: 64507 MB
node 5 free: 60892 MB
node 6 cpus: 60 61 62 63 64 65 66 67 68 69 140 141 142 143 144 145 146 147 148
149
node 6 size: 64507 MB
node 6 free: 61139 MB
node 7 cpus: 70 71 72 73 74 75 76 77 78 79 150 151 152 153 154 155 156 157 158
159
node 7 size: 64480 MB
node 7 free: 61188 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 12 17 17 19 19 19 19
1: 12 10 17 17 19 19 19 19
2: 17 17 10 12 19 19 19 19
3: 17 17 12 10 19 19 19 19
4: 19 19 19 19 10 12 17 17
5: 19 19 19 19 12 10 17 17
6: 19 19 19 19 17 17 10 12
7: 19 19 19 19 17 17 12 10
--
On 08/10/2019 15:16, Peter Zijlstra wrote:
> On Wed, Oct 02, 2019 at 11:47:59AM +0100, Valentin Schneider wrote:
>
>> Yeah, right shift on signed negative values are implementation defined.
>
> Seriously? Even under -fno-strict-overflow? There is a perfectly
> sensible operation for signed shift right, this stuff should not be
> undefined.
>
Mmm good point. I didn't see anything relevant in the description of that
flag. All my copy of the C99 standard (draft) says at 6.5.7.5 is:
"""
The result of E1 >> E2 [...] If E1 has a signed type and a negative value,
the resulting value is implementation-defined.
"""
Arithmetic shift would make sense, but I think this stems from twos'
complement not being imposed: 6.2.6.2.2 says sign can be done with
sign + magnitude, twos complement or ones' complement...
I suppose when you really just want a division you should ask for division
semantics - i.e. use '/'. I'd expect compilers to be smart enough to turn
that into a shift if a power of 2 is involved, and to do something else
if negative values can be involved.
Le Tuesday 08 Oct 2019 ? 15:34:04 (+0100), Valentin Schneider a ?crit :
> On 08/10/2019 15:16, Peter Zijlstra wrote:
> > On Wed, Oct 02, 2019 at 11:47:59AM +0100, Valentin Schneider wrote:
> >
> >> Yeah, right shift on signed negative values are implementation defined.
> >
> > Seriously? Even under -fno-strict-overflow? There is a perfectly
> > sensible operation for signed shift right, this stuff should not be
> > undefined.
> >
>
> Mmm good point. I didn't see anything relevant in the description of that
> flag. All my copy of the C99 standard (draft) says at 6.5.7.5 is:
>
> """
> The result of E1 >> E2 [...] If E1 has a signed type and a negative value,
> the resulting value is implementation-defined.
> """
>
> Arithmetic shift would make sense, but I think this stems from twos'
> complement not being imposed: 6.2.6.2.2 says sign can be done with
> sign + magnitude, twos complement or ones' complement...
>
> I suppose when you really just want a division you should ask for division
> semantics - i.e. use '/'. I'd expect compilers to be smart enough to turn
> that into a shift if a power of 2 is involved, and to do something else
> if negative values can be involved.
This is how I plan to get ride of the problem:
+ if (busiest->group_weight == 1 || sds->prefer_sibling) {
+ unsigned int nr_diff = busiest->sum_h_nr_running;
+ /*
+ * When prefer sibling, evenly spread running tasks on
+ * groups.
+ */
+ env->migration_type = migrate_task;
+ lsub_positive(&nr_diff, local->sum_h_nr_running);
+ env->imbalance = nr_diff >> 1;
+ return;
+ }
On 08/10/2019 16:30, Vincent Guittot wrote:
[...]
>
> This is how I plan to get ride of the problem:
> + if (busiest->group_weight == 1 || sds->prefer_sibling) {
> + unsigned int nr_diff = busiest->sum_h_nr_running;
> + /*
> + * When prefer sibling, evenly spread running tasks on
> + * groups.
> + */
> + env->migration_type = migrate_task;
> + lsub_positive(&nr_diff, local->sum_h_nr_running);
> + env->imbalance = nr_diff >> 1;
> + return;
> + }
>
I think this wants a
/* Local could have more tasks than busiest */
atop the lsub, otherwise yeah that ought to work.
Hi Phil,
On Tue, 8 Oct 2019 at 16:33, Phil Auld <[email protected]> wrote:
>
> Hi Vincent,
>
> On Thu, Sep 19, 2019 at 09:33:31AM +0200 Vincent Guittot wrote:
> > Several wrong task placement have been raised with the current load
> > balance algorithm but their fixes are not always straight forward and
> > end up with using biased values to force migrations. A cleanup and rework
> > of the load balance will help to handle such UCs and enable to fine grain
> > the behavior of the scheduler for other cases.
> >
[...]
> >
>
> We've been testing v3 and for the most part everything looks good. The
> group imbalance issues are fixed on all of our test systems except one.
>
> The one is an 8-node intel system with 160 cpus. I'll put the system
> details at the end.
>
> This shows the average number of benchmark threads running on each node
> through the run. That is, not including the 2 stress jobs. The end
> results are a 4x slow down in the cgroup case versus not. The 152 and
> 156 are the number of LU threads in the run. In all cases there are 2
> stress CPU threads running either in their own cgroups (GROUP) or
> everything is in one cgroup (NORMAL). The normal case is pretty well
> balanced with only a few >= 20 and those that are are only a little
> over. In the GROUP cases things are not so good. There are some > 30
> for example, and others < 10.
>
>
> lu.C.x_152_GROUP_1 17.52 16.86 17.90 18.52 20.00 19.00 22.00 20.19
> lu.C.x_152_GROUP_2 15.70 15.04 15.65 15.72 23.30 28.98 20.09 17.52
> lu.C.x_152_GROUP_3 27.72 32.79 22.89 22.62 11.01 12.90 12.14 9.93
> lu.C.x_152_GROUP_4 18.13 18.87 18.40 17.87 18.80 19.93 20.40 19.60
> lu.C.x_152_GROUP_5 24.14 26.46 20.92 21.43 14.70 16.05 15.14 13.16
> lu.C.x_152_NORMAL_1 21.03 22.43 20.27 19.97 18.37 18.80 16.27 14.87
> lu.C.x_152_NORMAL_2 19.24 18.29 18.41 17.41 19.71 19.00 20.29 19.65
> lu.C.x_152_NORMAL_3 19.43 20.00 19.05 20.24 18.76 17.38 18.52 18.62
> lu.C.x_152_NORMAL_4 17.19 18.25 17.81 18.69 20.44 19.75 20.12 19.75
> lu.C.x_152_NORMAL_5 19.25 19.56 19.12 19.56 19.38 19.38 18.12 17.62
>
> lu.C.x_156_GROUP_1 18.62 19.31 18.38 18.77 19.88 21.35 19.35 20.35
> lu.C.x_156_GROUP_2 15.58 12.72 14.96 14.83 20.59 19.35 29.75 28.22
> lu.C.x_156_GROUP_3 20.05 18.74 19.63 18.32 20.26 20.89 19.53 18.58
> lu.C.x_156_GROUP_4 14.77 11.42 13.01 10.09 27.05 33.52 23.16 22.98
> lu.C.x_156_GROUP_5 14.94 11.45 12.77 10.52 28.01 33.88 22.37 22.05
> lu.C.x_156_NORMAL_1 20.00 20.58 18.47 18.68 19.47 19.74 19.42 19.63
> lu.C.x_156_NORMAL_2 18.52 18.48 18.83 18.43 20.57 20.48 20.61 20.09
> lu.C.x_156_NORMAL_3 20.27 20.00 20.05 21.18 19.55 19.00 18.59 17.36
> lu.C.x_156_NORMAL_4 19.65 19.60 20.25 20.75 19.35 20.10 19.00 17.30
> lu.C.x_156_NORMAL_5 19.79 19.67 20.62 22.42 18.42 18.00 17.67 19.42
>
>
> From what I can see this was better but not perfect in v1. It was closer and
> so the end results (LU reported times and op/s) were close enough. But looking
> closer at it there are still some issues. (NORMAL is comparable to above)
>
>
> lu.C.x_152_GROUP_1 18.08 18.17 19.58 19.29 19.25 17.50 21.46 18.67
> lu.C.x_152_GROUP_2 17.12 17.48 17.88 17.62 19.57 17.31 23.00 22.02
> lu.C.x_152_GROUP_3 17.82 17.97 18.12 18.18 24.55 22.18 16.97 16.21
> lu.C.x_152_GROUP_4 18.47 19.08 18.50 18.66 21.45 25.00 15.47 15.37
> lu.C.x_152_GROUP_5 20.46 20.71 27.38 24.75 17.06 16.65 12.81 12.19
>
> lu.C.x_156_GROUP_1 18.70 18.80 20.25 19.50 20.45 20.30 19.55 18.45
> lu.C.x_156_GROUP_2 19.29 19.90 17.71 18.10 20.76 21.57 19.81 18.86
> lu.C.x_156_GROUP_3 25.09 29.19 21.83 21.33 18.67 18.57 11.03 10.29
> lu.C.x_156_GROUP_4 18.60 19.10 19.20 18.70 20.30 20.00 19.70 20.40
> lu.C.x_156_GROUP_5 18.58 18.9 18.63 18.16 17.32 19.37 23.92 21.08
>
> There is high variance so it may not be anything specific between v1 and v3 here.
While preparing v4, I have noticed that I have probably oversimplified
the end of find_idlest_group() in patch "sched/fair: optimize
find_idlest_group" when it compares local vs the idlest other group.
Especially, there were a NUMA specific test that I removed in v3 and
re added in v4.
Then, I'm also preparing a full rework that find_idlest_group() which
will behave more closely to load_balance; I mean : collect statistics,
classify group then selects the idlest
What is the behavior of lu.C Thread ? are they waking up a lot ?and
could trigger the slow wake path ?
>
> The initial fixes I made for this issue did not exhibit this behavior. They
> would have had other issues dealing with overload cases though. In this case
> however there are only 154 or 158 threads on 160 CPUs so not overloaded.
>
> I'll try to get my hands on this system and poke into it. I just wanted to get
> your thoughts and let you know where we are.
Thanks for testing
Vincent
>
>
>
> Thanks,
> Phil
>
>
> System details:
>
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Byte Order: Little Endian
> CPU(s): 160
> On-line CPU(s) list: 0-159
> Thread(s) per core: 2
> Core(s) per socket: 10
> Socket(s): 8
> NUMA node(s): 8
> Vendor ID: GenuineIntel
> CPU family: 6
> Model: 47
> Model name: Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz
> Stepping: 2
> CPU MHz: 1063.934
> BogoMIPS: 4787.73
> Virtualization: VT-x
> L1d cache: 32K
> L1i cache: 32K
> L2 cache: 256K
> L3 cache: 30720K
> NUMA node0 CPU(s): 0-9,80-89
> NUMA node1 CPU(s): 10-19,90-99
> NUMA node2 CPU(s): 20-29,100-109
> NUMA node3 CPU(s): 30-39,110-119
> NUMA node4 CPU(s): 40-49,120-129
> NUMA node5 CPU(s): 50-59,130-139
> NUMA node6 CPU(s): 60-69,140-149
> NUMA node7 CPU(s): 70-79,150-159
> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm epb pti tpr_shadow vnmi flexpriority ept vpid dtherm ida arat
>
> $ numactl --hardware
> available: 8 nodes (0-7)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 80 81 82 83 84 85 86 87 88 89
> node 0 size: 64177 MB
> node 0 free: 60866 MB
> node 1 cpus: 10 11 12 13 14 15 16 17 18 19 90 91 92 93 94 95 96 97 98 99
> node 1 size: 64507 MB
> node 1 free: 61167 MB
> node 2 cpus: 20 21 22 23 24 25 26 27 28 29 100 101 102 103 104 105 106 107 108
> 109
> node 2 size: 64507 MB
> node 2 free: 61250 MB
> node 3 cpus: 30 31 32 33 34 35 36 37 38 39 110 111 112 113 114 115 116 117 118
> 119
> node 3 size: 64507 MB
> node 3 free: 61327 MB
> node 4 cpus: 40 41 42 43 44 45 46 47 48 49 120 121 122 123 124 125 126 127 128
> 129
> node 4 size: 64507 MB
> node 4 free: 60993 MB
> node 5 cpus: 50 51 52 53 54 55 56 57 58 59 130 131 132 133 134 135 136 137 138
> 139
> node 5 size: 64507 MB
> node 5 free: 60892 MB
> node 6 cpus: 60 61 62 63 64 65 66 67 68 69 140 141 142 143 144 145 146 147 148
> 149
> node 6 size: 64507 MB
> node 6 free: 61139 MB
> node 7 cpus: 70 71 72 73 74 75 76 77 78 79 150 151 152 153 154 155 156 157 158
> 159
> node 7 size: 64480 MB
> node 7 free: 61188 MB
> node distances:
> node 0 1 2 3 4 5 6 7
> 0: 10 12 17 17 19 19 19 19
> 1: 12 10 17 17 19 19 19 19
> 2: 17 17 10 12 19 19 19 19
> 3: 17 17 12 10 19 19 19 19
> 4: 19 19 19 19 10 12 17 17
> 5: 19 19 19 19 12 10 17 17
> 6: 19 19 19 19 17 17 10 12
> 7: 19 19 19 19 17 17 12 10
>
>
>
> --
On Tue, Oct 08, 2019 at 03:34:04PM +0100, Valentin Schneider wrote:
> On 08/10/2019 15:16, Peter Zijlstra wrote:
> > On Wed, Oct 02, 2019 at 11:47:59AM +0100, Valentin Schneider wrote:
> >
> >> Yeah, right shift on signed negative values are implementation defined.
> >
> > Seriously? Even under -fno-strict-overflow? There is a perfectly
> > sensible operation for signed shift right, this stuff should not be
> > undefined.
> >
>
> Mmm good point. I didn't see anything relevant in the description of that
> flag. All my copy of the C99 standard (draft) says at 6.5.7.5 is:
>
> """
> The result of E1 >> E2 [...] If E1 has a signed type and a negative value,
> the resulting value is implementation-defined.
> """
>
> Arithmetic shift would make sense, but I think this stems from twos'
> complement not being imposed: 6.2.6.2.2 says sign can be done with
> sign + magnitude, twos complement or ones' complement...
But -fno-strict-overflow mandates 2s complement for all such signed
issues.
On 08/10/2019 17:33, Peter Zijlstra wrote:
> On Tue, Oct 08, 2019 at 03:34:04PM +0100, Valentin Schneider wrote:
>> On 08/10/2019 15:16, Peter Zijlstra wrote:
>>> On Wed, Oct 02, 2019 at 11:47:59AM +0100, Valentin Schneider wrote:
>>>
>>>> Yeah, right shift on signed negative values are implementation defined.
>>>
>>> Seriously? Even under -fno-strict-overflow? There is a perfectly
>>> sensible operation for signed shift right, this stuff should not be
>>> undefined.
>>>
>>
>> Mmm good point. I didn't see anything relevant in the description of that
>> flag. All my copy of the C99 standard (draft) says at 6.5.7.5 is:
>>
>> """
>> The result of E1 >> E2 [...] If E1 has a signed type and a negative value,
>> the resulting value is implementation-defined.
>> """
>>
>> Arithmetic shift would make sense, but I think this stems from twos'
>> complement not being imposed: 6.2.6.2.2 says sign can be done with
>> sign + magnitude, twos complement or ones' complement...
>
> But -fno-strict-overflow mandates 2s complement for all such signed
> issues.
>
So then there really shouldn't be any ambiguity. I have no idea if
-fno-strict-overflow then also lifts the undefinedness of the right shifts,
gotta get my spade and dig some more.
On Tue, Oct 08, 2019 at 05:30:02PM +0200, Vincent Guittot wrote:
> This is how I plan to get ride of the problem:
> + if (busiest->group_weight == 1 || sds->prefer_sibling) {
> + unsigned int nr_diff = busiest->sum_h_nr_running;
> + /*
> + * When prefer sibling, evenly spread running tasks on
> + * groups.
> + */
> + env->migration_type = migrate_task;
> + lsub_positive(&nr_diff, local->sum_h_nr_running);
> + env->imbalance = nr_diff >> 1;
> + return;
> + }
I'm thinking the max_t(long, 0, ...); variant reads a lot simpler and
really _should_ work given that -fno-strict-overflow / -fwrapv mandates
2s complement.
On 08/10/2019 17:39, Valentin Schneider wrote:
>>
>> But -fno-strict-overflow mandates 2s complement for all such signed
>> issues.
>>
>
> So then there really shouldn't be any ambiguity. I have no idea if
> -fno-strict-overflow then also lifts the undefinedness of the right shifts,
> gotta get my spade and dig some more.
>
Bleh, no luck really.
Thinking about it some more you can't overflow by right shifting
(logical/arithmetic), so I dunno if signed right shift counts as "such
signed issues".
On Thu, Sep 19, 2019 at 09:33:35AM +0200, Vincent Guittot wrote:
> + if (busiest->group_type == group_asym_packing) {
> + /*
> + * In case of asym capacity, we will try to migrate all load to
> + * the preferred CPU.
> + */
> + env->balance_type = migrate_load;
> env->imbalance = busiest->group_load;
> return;
> }
I was a bit surprised with this; I sorta expected a migrate_task,1 here.
The asym_packing thing has always been a nr_running issue to me. If
there is something to run, we should run on as many siblings/cores as
possible, but preferably the 'highest' ranked siblings/cores.
On Tue, 8 Oct 2019 at 19:39, Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Oct 08, 2019 at 05:30:02PM +0200, Vincent Guittot wrote:
>
> > This is how I plan to get ride of the problem:
> > + if (busiest->group_weight == 1 || sds->prefer_sibling) {
> > + unsigned int nr_diff = busiest->sum_h_nr_running;
> > + /*
> > + * When prefer sibling, evenly spread running tasks on
> > + * groups.
> > + */
> > + env->migration_type = migrate_task;
> > + lsub_positive(&nr_diff, local->sum_h_nr_running);
> > + env->imbalance = nr_diff >> 1;
> > + return;
> > + }
>
> I'm thinking the max_t(long, 0, ...); variant reads a lot simpler and
> really _should_ work given that -fno-strict-overflow / -fwrapv mandates
> 2s complement.
Another point that I have overlooked is that sum_h_nr_running is
unsigned int whereas imbalance is long
In fact, (long) (unsigned long A - unsigned long B) >> 1 works correctly
but
(long) (unsigned int A - unsigned int B) >> 1 doesn't
On Tue, 8 Oct 2019 at 19:55, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Sep 19, 2019 at 09:33:35AM +0200, Vincent Guittot wrote:
> > + if (busiest->group_type == group_asym_packing) {
> > + /*
> > + * In case of asym capacity, we will try to migrate all load to
> > + * the preferred CPU.
> > + */
> > + env->balance_type = migrate_load;
> > env->imbalance = busiest->group_load;
> > return;
> > }
>
> I was a bit surprised with this; I sorta expected a migrate_task,1 here.
I have just kept current mechanism
>
> The asym_packing thing has always been a nr_running issue to me. If
> there is something to run, we should run on as many siblings/cores as
> possible, but preferably the 'highest' ranked siblings/cores.
I can probably set the number of running task to migrate instead of the load
>
On Tue, Oct 08, 2019 at 05:53:11PM +0200 Vincent Guittot wrote:
> Hi Phil,
>
...
> While preparing v4, I have noticed that I have probably oversimplified
> the end of find_idlest_group() in patch "sched/fair: optimize
> find_idlest_group" when it compares local vs the idlest other group.
> Especially, there were a NUMA specific test that I removed in v3 and
> re added in v4.
>
> Then, I'm also preparing a full rework that find_idlest_group() which
> will behave more closely to load_balance; I mean : collect statistics,
> classify group then selects the idlest
>
Okay, I'll watch for V4 and restest. It really seems to be limited to
the 8-node system. None of the other systems are showing this.
> What is the behavior of lu.C Thread ? are they waking up a lot ?and
> could trigger the slow wake path ?
Yes, probably a fair bit of waking. It's an interative equation solving
code. It's fairly CPU intensive but requires communication for dependent
calculations. That's part of why having them mis-balanced causes such a
large slow down. I think at times everyone else waits for the slow guys.
>
> >
> > The initial fixes I made for this issue did not exhibit this behavior. They
> > would have had other issues dealing with overload cases though. In this case
> > however there are only 154 or 158 threads on 160 CPUs so not overloaded.
> >
> > I'll try to get my hands on this system and poke into it. I just wanted to get
> > your thoughts and let you know where we are.
>
> Thanks for testing
>
Sure!
Thanks,
Phil
> Vincent
>
> >
> >
> >
> > Thanks,
> > Phil
> >
> >
> > System details:
> >
> > Architecture: x86_64
> > CPU op-mode(s): 32-bit, 64-bit
> > Byte Order: Little Endian
> > CPU(s): 160
> > On-line CPU(s) list: 0-159
> > Thread(s) per core: 2
> > Core(s) per socket: 10
> > Socket(s): 8
> > NUMA node(s): 8
> > Vendor ID: GenuineIntel
> > CPU family: 6
> > Model: 47
> > Model name: Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz
> > Stepping: 2
> > CPU MHz: 1063.934
> > BogoMIPS: 4787.73
> > Virtualization: VT-x
> > L1d cache: 32K
> > L1i cache: 32K
> > L2 cache: 256K
> > L3 cache: 30720K
> > NUMA node0 CPU(s): 0-9,80-89
> > NUMA node1 CPU(s): 10-19,90-99
> > NUMA node2 CPU(s): 20-29,100-109
> > NUMA node3 CPU(s): 30-39,110-119
> > NUMA node4 CPU(s): 40-49,120-129
> > NUMA node5 CPU(s): 50-59,130-139
> > NUMA node6 CPU(s): 60-69,140-149
> > NUMA node7 CPU(s): 70-79,150-159
> > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm epb pti tpr_shadow vnmi flexpriority ept vpid dtherm ida arat
> >
> > $ numactl --hardware
> > available: 8 nodes (0-7)
> > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 80 81 82 83 84 85 86 87 88 89
> > node 0 size: 64177 MB
> > node 0 free: 60866 MB
> > node 1 cpus: 10 11 12 13 14 15 16 17 18 19 90 91 92 93 94 95 96 97 98 99
> > node 1 size: 64507 MB
> > node 1 free: 61167 MB
> > node 2 cpus: 20 21 22 23 24 25 26 27 28 29 100 101 102 103 104 105 106 107 108
> > 109
> > node 2 size: 64507 MB
> > node 2 free: 61250 MB
> > node 3 cpus: 30 31 32 33 34 35 36 37 38 39 110 111 112 113 114 115 116 117 118
> > 119
> > node 3 size: 64507 MB
> > node 3 free: 61327 MB
> > node 4 cpus: 40 41 42 43 44 45 46 47 48 49 120 121 122 123 124 125 126 127 128
> > 129
> > node 4 size: 64507 MB
> > node 4 free: 60993 MB
> > node 5 cpus: 50 51 52 53 54 55 56 57 58 59 130 131 132 133 134 135 136 137 138
> > 139
> > node 5 size: 64507 MB
> > node 5 free: 60892 MB
> > node 6 cpus: 60 61 62 63 64 65 66 67 68 69 140 141 142 143 144 145 146 147 148
> > 149
> > node 6 size: 64507 MB
> > node 6 free: 61139 MB
> > node 7 cpus: 70 71 72 73 74 75 76 77 78 79 150 151 152 153 154 155 156 157 158
> > 159
> > node 7 size: 64480 MB
> > node 7 free: 61188 MB
> > node distances:
> > node 0 1 2 3 4 5 6 7
> > 0: 10 12 17 17 19 19 19 19
> > 1: 12 10 17 17 19 19 19 19
> > 2: 17 17 10 12 19 19 19 19
> > 3: 17 17 12 10 19 19 19 19
> > 4: 19 19 19 19 10 12 17 17
> > 5: 19 19 19 19 12 10 17 17
> > 6: 19 19 19 19 17 17 10 12
> > 7: 19 19 19 19 17 17 12 10
> >
> >
> >
> > --
--
On Wed, 9 Oct 2019 at 21:33, Phil Auld <[email protected]> wrote:
>
> On Tue, Oct 08, 2019 at 05:53:11PM +0200 Vincent Guittot wrote:
> > Hi Phil,
> >
>
> ...
>
> > While preparing v4, I have noticed that I have probably oversimplified
> > the end of find_idlest_group() in patch "sched/fair: optimize
> > find_idlest_group" when it compares local vs the idlest other group.
> > Especially, there were a NUMA specific test that I removed in v3 and
> > re added in v4.
> >
> > Then, I'm also preparing a full rework that find_idlest_group() which
> > will behave more closely to load_balance; I mean : collect statistics,
> > classify group then selects the idlest
> >
>
> Okay, I'll watch for V4 and restest. It really seems to be limited to
> the 8-node system. None of the other systems are showing this.
Thanks for the information. This system might generate statistics at
boundaries between 2 behaviors and task placement misbehave
>
>
> > What is the behavior of lu.C Thread ? are they waking up a lot ?and
> > could trigger the slow wake path ?
>
> Yes, probably a fair bit of waking. It's an interative equation solving
> code. It's fairly CPU intensive but requires communication for dependent
> calculations. That's part of why having them mis-balanced causes such a
> large slow down. I think at times everyone else waits for the slow guys.
>
>
> >
> > >
> > > The initial fixes I made for this issue did not exhibit this behavior. They
> > > would have had other issues dealing with overload cases though. In this case
> > > however there are only 154 or 158 threads on 160 CPUs so not overloaded.
> > >
> > > I'll try to get my hands on this system and poke into it. I just wanted to get
> > > your thoughts and let you know where we are.
> >
> > Thanks for testing
> >
>
> Sure!
>
>
> Thanks,
> Phil
>
>
>
>
> > Vincent
> >
> > >
> > >
> > >
> > > Thanks,
> > > Phil
> > >
> > >
> > > System details:
> > >
> > > Architecture: x86_64
> > > CPU op-mode(s): 32-bit, 64-bit
> > > Byte Order: Little Endian
> > > CPU(s): 160
> > > On-line CPU(s) list: 0-159
> > > Thread(s) per core: 2
> > > Core(s) per socket: 10
> > > Socket(s): 8
> > > NUMA node(s): 8
> > > Vendor ID: GenuineIntel
> > > CPU family: 6
> > > Model: 47
> > > Model name: Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz
> > > Stepping: 2
> > > CPU MHz: 1063.934
> > > BogoMIPS: 4787.73
> > > Virtualization: VT-x
> > > L1d cache: 32K
> > > L1i cache: 32K
> > > L2 cache: 256K
> > > L3 cache: 30720K
> > > NUMA node0 CPU(s): 0-9,80-89
> > > NUMA node1 CPU(s): 10-19,90-99
> > > NUMA node2 CPU(s): 20-29,100-109
> > > NUMA node3 CPU(s): 30-39,110-119
> > > NUMA node4 CPU(s): 40-49,120-129
> > > NUMA node5 CPU(s): 50-59,130-139
> > > NUMA node6 CPU(s): 60-69,140-149
> > > NUMA node7 CPU(s): 70-79,150-159
> > > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm epb pti tpr_shadow vnmi flexpriority ept vpid dtherm ida arat
> > >
> > > $ numactl --hardware
> > > available: 8 nodes (0-7)
> > > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 80 81 82 83 84 85 86 87 88 89
> > > node 0 size: 64177 MB
> > > node 0 free: 60866 MB
> > > node 1 cpus: 10 11 12 13 14 15 16 17 18 19 90 91 92 93 94 95 96 97 98 99
> > > node 1 size: 64507 MB
> > > node 1 free: 61167 MB
> > > node 2 cpus: 20 21 22 23 24 25 26 27 28 29 100 101 102 103 104 105 106 107 108
> > > 109
> > > node 2 size: 64507 MB
> > > node 2 free: 61250 MB
> > > node 3 cpus: 30 31 32 33 34 35 36 37 38 39 110 111 112 113 114 115 116 117 118
> > > 119
> > > node 3 size: 64507 MB
> > > node 3 free: 61327 MB
> > > node 4 cpus: 40 41 42 43 44 45 46 47 48 49 120 121 122 123 124 125 126 127 128
> > > 129
> > > node 4 size: 64507 MB
> > > node 4 free: 60993 MB
> > > node 5 cpus: 50 51 52 53 54 55 56 57 58 59 130 131 132 133 134 135 136 137 138
> > > 139
> > > node 5 size: 64507 MB
> > > node 5 free: 60892 MB
> > > node 6 cpus: 60 61 62 63 64 65 66 67 68 69 140 141 142 143 144 145 146 147 148
> > > 149
> > > node 6 size: 64507 MB
> > > node 6 free: 61139 MB
> > > node 7 cpus: 70 71 72 73 74 75 76 77 78 79 150 151 152 153 154 155 156 157 158
> > > 159
> > > node 7 size: 64480 MB
> > > node 7 free: 61188 MB
> > > node distances:
> > > node 0 1 2 3 4 5 6 7
> > > 0: 10 12 17 17 19 19 19 19
> > > 1: 12 10 17 17 19 19 19 19
> > > 2: 17 17 10 12 19 19 19 19
> > > 3: 17 17 12 10 19 19 19 19
> > > 4: 19 19 19 19 10 12 17 17
> > > 5: 19 19 19 19 12 10 17 17
> > > 6: 19 19 19 19 17 17 10 12
> > > 7: 19 19 19 19 17 17 12 10
> > >
> > >
> > >
> > > --
>
> --
On 9/19/19 1:03 PM, Vincent Guittot wrote:
> Several wrong task placement have been raised with the current load
> balance algorithm but their fixes are not always straight forward and
> end up with using biased values to force migrations. A cleanup and rework
> of the load balance will help to handle such UCs and enable to fine grain
> the behavior of the scheduler for other cases.
>
> Patch 1 has already been sent separately and only consolidate asym policy
> in one place and help the review of the changes in load_balance.
>
> Patch 2 renames the sum of h_nr_running in stats.
>
> Patch 3 removes meaningless imbalance computation to make review of
> patch 4 easier.
>
> Patch 4 reworks load_balance algorithm and fixes some wrong task placement
> but try to stay conservative.
>
> Patch 5 add the sum of nr_running to monitor non cfs tasks and take that
> into account when pulling tasks.
>
> Patch 6 replaces runnable_load by load now that the signal is only used
> when overloaded.
>
> Patch 7 improves the spread of tasks at the 1st scheduling level.
>
> Patch 8 uses utilization instead of load in all steps of misfit task
> path.
>
> Patch 9 replaces runnable_load_avg by load_avg in the wake up path.
>
> Patch 10 optimizes find_idlest_group() that was using both runnable_load
> and load. This has not been squashed with previous patch to ease the
> review.
>
> Some benchmarks results based on 8 iterations of each tests:
> - small arm64 dual quad cores system
>
> tip/sched/core w/ this patchset improvement
> schedpipe 54981 +/-0.36% 55459 +/-0.31% (+0.97%)
>
> hackbench
> 1 groups 0.906 +/-2.34% 0.906 +/-2.88% (+0.06%)
>
> - large arm64 2 nodes / 224 cores system
>
> tip/sched/core w/ this patchset improvement
> schedpipe 125323 +/-0.98% 125624 +/-0.71% (+0.24%)
>
> hackbench -l (256000/#grp) -g #grp
> 1 groups 15.360 +/-1.76% 14.206 +/-1.40% (+8.69%)
> 4 groups 5.822 +/-1.02% 5.508 +/-6.45% (+5.38%)
> 16 groups 3.103 +/-0.80% 3.244 +/-0.77% (-4.52%)
> 32 groups 2.892 +/-1.23% 2.850 +/-1.81% (+1.47%)
> 64 groups 2.825 +/-1.51% 2.725 +/-1.51% (+3.54%)
> 128 groups 3.149 +/-8.46% 3.053 +/-13.15% (+3.06%)
> 256 groups 3.511 +/-8.49% 3.019 +/-1.71% (+14.03%)
>
> dbench
> 1 groups 329.677 +/-0.46% 329.771 +/-0.11% (+0.03%)
> 4 groups 931.499 +/-0.79% 947.118 +/-0.94% (+1.68%)
> 16 groups 1924.210 +/-0.89% 1947.849 +/-0.76% (+1.23%)
> 32 groups 2350.646 +/-5.75% 2351.549 +/-6.33% (+0.04%)
> 64 groups 2201.524 +/-3.35% 2192.749 +/-5.84% (-0.40%)
> 128 groups 2206.858 +/-2.50% 2376.265 +/-7.44% (+7.68%)
> 256 groups 1263.520 +/-3.34% 1633.143 +/-13.02% (+29.25%)
>
> tip/sched/core sha1:
> 0413d7f33e60 ('sched/uclamp: Always use 'enum uclamp_id' for clamp_id values')
> [...]
I am quietly impressed with this patch series as it makes easy to
understand the behavior of the load balancer just by looking at the code.
I have tested v3 on IBM POWER9 system with following configuration:
- CPU(s): 176
- Thread(s) per core: 4
- Core(s) per socket: 22
- Socket(s): 2
- Model name: POWER9, altivec supported
- NUMA node0 CPU(s): 0-87
- NUMA node8 CPU(s): 88-175
I see results in par with the baseline (tip/sched/core) with most of my
testings.
hackbench
=========
hackbench -l (256000/#grp) -g #grp (lower is better):
+--------+--------------------+-------------------+------------------+
| groups | w/ patches | Baseline | Performance gain |
+--------+--------------------+-------------------+------------------+
| 1 | 14.948 (+/- 0.10) | 15.13 (+/- 0.47 ) | +1.20 |
| 4 | 5.938 (+/- 0.034) | 6.085 (+/- 0.07) | +2.4 |
| 8 | 6.594 (+/- 0.072) | 6.223 (+/- 0.03) | -5.9 |
| 16 | 5.916 (+/- 0.05) | 5.559 (+/- 0.00) | -6.4 |
| 32 | 5.288 (+/- 0.034) | 5.23 (+/- 0.01) | -1.1 |
| 64 | 5.147 (+/- 0.036) | 5.193 (+/- 0.09) | +0.8 |
| 128 | 5.368 (+/- 0.0245) | 5.446 (+/- 0.04) | +1.4 |
| 256 | 5.637 (+/- 0.088) | 5.596 (+/- 0.07) | -0.7 |
| 512 | 5.78 (+/- 0.0637) | 5.934 (+/- 0.06) | +2.5 |
+--------+--------------------+-------------------+------------------+
dbench
========
dbench <grp> (Throughput: Higher is better):
+---------+---------------------+-----------------------+----------+
| groups | w/ patches | baseline | gain |
+---------+---------------------+-----------------------+----------+
| 1 | 12.6419(+/-0.58) | 12.6511 (+/-0.277) | -0.00 |
| 4 | 23.7712(+/-2.22) | 21.8526 (+/-0.844) | +8.7 |
| 8 | 40.1333(+/-0.85) | 37.0623 (+/-3.283) | +8.2 |
| 16 | 60.5529(+/-2.35) | 60.0972 (+/-9.655) | +0.7 |
| 32 | 98.2194(+/-1.69) | 87.6701 (+/-10.72) | +12.0 |
| 64 | 150.733(+/-9.91) | 109.782 (+/-0.503) | +37.3 |
| 128 | 173.443(+/-22.4) | 130.006 (+/-21.84) | +33.4 |
| 256 | 121.011(+/-15.2) | 120.603 (+/-11.82) | +0.3 |
| 512 | 10.9889(+/-0.39) | 12.5518 (+/-1.030) | -12 |
+---------+---------------------+-----------------------+----------+
I am happy with the results as it turns to be beneficial in most cases.
Still I will be doing testing for different scenarios and workloads.
BTW do you have any specific test case which might show different behavior
for SMT-4/8 systems with these patch set?
Thanks,
Parth
On 9/19/19 1:03 PM, Vincent Guittot wrote:
[...]
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 585 ++++++++++++++++++++++++++++++++++------------------
> 1 file changed, 380 insertions(+), 205 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 017aad0..d33379c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7078,11 +7078,26 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
>
> enum fbq_type { regular, remote, all };
>
> +/*
> + * group_type describes the group of CPUs at the moment of the load balance.
> + * The enum is ordered by pulling priority, with the group with lowest priority
> + * first so the groupe_type can be simply compared when selecting the busiest
> + * group. see update_sd_pick_busiest().
> + */
> enum group_type {
> - group_other = 0,
> + group_has_spare = 0,
> + group_fully_busy,
> group_misfit_task,
> + group_asym_packing,
> group_imbalanced,
> - group_overloaded,
> + group_overloaded
> +};
> +
> +enum migration_type {
> + migrate_load = 0,
> + migrate_util,
> + migrate_task,
> + migrate_misfit
> };
>
> #define LBF_ALL_PINNED 0x01
> @@ -7115,7 +7130,7 @@ struct lb_env {
> unsigned int loop_max;
>
> enum fbq_type fbq_type;
> - enum group_type src_grp_type;
> + enum migration_type balance_type;
> struct list_head tasks;
> };
>
> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
> {
> struct list_head *tasks = &env->src_rq->cfs_tasks;
> struct task_struct *p;
> - unsigned long load;
> + unsigned long util, load;
> int detached = 0;
>
> lockdep_assert_held(&env->src_rq->lock);
> @@ -7380,19 +7395,53 @@ static int detach_tasks(struct lb_env *env)
> if (!can_migrate_task(p, env))
> goto next;
>
> - load = task_h_load(p);
> + switch (env->balance_type) {
> + case migrate_load:
> + load = task_h_load(p);
>
> - if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
> - goto next;
> + if (sched_feat(LB_MIN) &&
> + load < 16 && !env->sd->nr_balance_failed)
> + goto next;
>
> - if ((load / 2) > env->imbalance)
> - goto next;
> + if ((load / 2) > env->imbalance)
> + goto next;
> +
> + env->imbalance -= load;
> + break;
> +
> + case migrate_util:
> + util = task_util_est(p);
> +
> + if (util > env->imbalance)
Can you please explain what would happen for
`if (util/2 > env->imbalance)` ?
just like when migrating load, even util shouldn't be migrated if
env->imbalance is just near the utilization of the task being moved, isn't it?
> + goto next;
> +
> + env->imbalance -= util;
> + break;
> +[ ... ]
Thanks,
Parth
Hi Parth,
On Wed, 16 Oct 2019 at 09:21, Parth Shah <[email protected]> wrote:
>
>
>
> On 9/19/19 1:03 PM, Vincent Guittot wrote:
> > Several wrong task placement have been raised with the current load
> > balance algorithm but their fixes are not always straight forward and
> > end up with using biased values to force migrations. A cleanup and rework
> > of the load balance will help to handle such UCs and enable to fine grain
> > the behavior of the scheduler for other cases.
> >
> > Patch 1 has already been sent separately and only consolidate asym policy
> > in one place and help the review of the changes in load_balance.
> >
> > Patch 2 renames the sum of h_nr_running in stats.
> >
> > Patch 3 removes meaningless imbalance computation to make review of
> > patch 4 easier.
> >
> > Patch 4 reworks load_balance algorithm and fixes some wrong task placement
> > but try to stay conservative.
> >
> > Patch 5 add the sum of nr_running to monitor non cfs tasks and take that
> > into account when pulling tasks.
> >
> > Patch 6 replaces runnable_load by load now that the signal is only used
> > when overloaded.
> >
> > Patch 7 improves the spread of tasks at the 1st scheduling level.
> >
> > Patch 8 uses utilization instead of load in all steps of misfit task
> > path.
> >
> > Patch 9 replaces runnable_load_avg by load_avg in the wake up path.
> >
> > Patch 10 optimizes find_idlest_group() that was using both runnable_load
> > and load. This has not been squashed with previous patch to ease the
> > review.
> >
> > Some benchmarks results based on 8 iterations of each tests:
> > - small arm64 dual quad cores system
> >
> > tip/sched/core w/ this patchset improvement
> > schedpipe 54981 +/-0.36% 55459 +/-0.31% (+0.97%)
> >
> > hackbench
> > 1 groups 0.906 +/-2.34% 0.906 +/-2.88% (+0.06%)
> >
> > - large arm64 2 nodes / 224 cores system
> >
> > tip/sched/core w/ this patchset improvement
> > schedpipe 125323 +/-0.98% 125624 +/-0.71% (+0.24%)
> >
> > hackbench -l (256000/#grp) -g #grp
> > 1 groups 15.360 +/-1.76% 14.206 +/-1.40% (+8.69%)
> > 4 groups 5.822 +/-1.02% 5.508 +/-6.45% (+5.38%)
> > 16 groups 3.103 +/-0.80% 3.244 +/-0.77% (-4.52%)
> > 32 groups 2.892 +/-1.23% 2.850 +/-1.81% (+1.47%)
> > 64 groups 2.825 +/-1.51% 2.725 +/-1.51% (+3.54%)
> > 128 groups 3.149 +/-8.46% 3.053 +/-13.15% (+3.06%)
> > 256 groups 3.511 +/-8.49% 3.019 +/-1.71% (+14.03%)
> >
> > dbench
> > 1 groups 329.677 +/-0.46% 329.771 +/-0.11% (+0.03%)
> > 4 groups 931.499 +/-0.79% 947.118 +/-0.94% (+1.68%)
> > 16 groups 1924.210 +/-0.89% 1947.849 +/-0.76% (+1.23%)
> > 32 groups 2350.646 +/-5.75% 2351.549 +/-6.33% (+0.04%)
> > 64 groups 2201.524 +/-3.35% 2192.749 +/-5.84% (-0.40%)
> > 128 groups 2206.858 +/-2.50% 2376.265 +/-7.44% (+7.68%)
> > 256 groups 1263.520 +/-3.34% 1633.143 +/-13.02% (+29.25%)
> >
> > tip/sched/core sha1:
> > 0413d7f33e60 ('sched/uclamp: Always use 'enum uclamp_id' for clamp_id values')
> > [...]
>
> I am quietly impressed with this patch series as it makes easy to
> understand the behavior of the load balancer just by looking at the code.
Thanks
>
> I have tested v3 on IBM POWER9 system with following configuration:
> - CPU(s): 176
> - Thread(s) per core: 4
> - Core(s) per socket: 22
> - Socket(s): 2
> - Model name: POWER9, altivec supported
> - NUMA node0 CPU(s): 0-87
> - NUMA node8 CPU(s): 88-175
>
> I see results in par with the baseline (tip/sched/core) with most of my
> testings.
>
> hackbench
> =========
> hackbench -l (256000/#grp) -g #grp (lower is better):
> +--------+--------------------+-------------------+------------------+
> | groups | w/ patches | Baseline | Performance gain |
> +--------+--------------------+-------------------+------------------+
> | 1 | 14.948 (+/- 0.10) | 15.13 (+/- 0.47 ) | +1.20 |
> | 4 | 5.938 (+/- 0.034) | 6.085 (+/- 0.07) | +2.4 |
> | 8 | 6.594 (+/- 0.072) | 6.223 (+/- 0.03) | -5.9 |
> | 16 | 5.916 (+/- 0.05) | 5.559 (+/- 0.00) | -6.4 |
> | 32 | 5.288 (+/- 0.034) | 5.23 (+/- 0.01) | -1.1 |
> | 64 | 5.147 (+/- 0.036) | 5.193 (+/- 0.09) | +0.8 |
> | 128 | 5.368 (+/- 0.0245) | 5.446 (+/- 0.04) | +1.4 |
> | 256 | 5.637 (+/- 0.088) | 5.596 (+/- 0.07) | -0.7 |
> | 512 | 5.78 (+/- 0.0637) | 5.934 (+/- 0.06) | +2.5 |
> +--------+--------------------+-------------------+------------------+
>
>
> dbench
> ========
> dbench <grp> (Throughput: Higher is better):
> +---------+---------------------+-----------------------+----------+
> | groups | w/ patches | baseline | gain |
> +---------+---------------------+-----------------------+----------+
> | 1 | 12.6419(+/-0.58) | 12.6511 (+/-0.277) | -0.00 |
> | 4 | 23.7712(+/-2.22) | 21.8526 (+/-0.844) | +8.7 |
> | 8 | 40.1333(+/-0.85) | 37.0623 (+/-3.283) | +8.2 |
> | 16 | 60.5529(+/-2.35) | 60.0972 (+/-9.655) | +0.7 |
> | 32 | 98.2194(+/-1.69) | 87.6701 (+/-10.72) | +12.0 |
> | 64 | 150.733(+/-9.91) | 109.782 (+/-0.503) | +37.3 |
> | 128 | 173.443(+/-22.4) | 130.006 (+/-21.84) | +33.4 |
> | 256 | 121.011(+/-15.2) | 120.603 (+/-11.82) | +0.3 |
> | 512 | 10.9889(+/-0.39) | 12.5518 (+/-1.030) | -12 |
> +---------+---------------------+-----------------------+----------+
>
>
>
> I am happy with the results as it turns to be beneficial in most cases.
> Still I will be doing testing for different scenarios and workloads.
Thanks for the tests results
> BTW do you have any specific test case which might show different behavior
> for SMT-4/8 systems with these patch set?
No, I don't have any specific tests case in mind but it should better
spread tasks in cores for middle loaded UCs with tasks with different
weight either beacuse different nice prio or because of cgroup. The
load_balance doesn't use load in this case unlike current
implementation.
Thanks
Vincent
>
>
> Thanks,
> Parth
>
On Wed, 16 Oct 2019 at 09:21, Parth Shah <[email protected]> wrote:
>
>
>
> On 9/19/19 1:03 PM, Vincent Guittot wrote:
>
> [...]
>
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > kernel/sched/fair.c | 585 ++++++++++++++++++++++++++++++++++------------------
> > 1 file changed, 380 insertions(+), 205 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 017aad0..d33379c 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7078,11 +7078,26 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
> >
> > enum fbq_type { regular, remote, all };
> >
> > +/*
> > + * group_type describes the group of CPUs at the moment of the load balance.
> > + * The enum is ordered by pulling priority, with the group with lowest priority
> > + * first so the groupe_type can be simply compared when selecting the busiest
> > + * group. see update_sd_pick_busiest().
> > + */
> > enum group_type {
> > - group_other = 0,
> > + group_has_spare = 0,
> > + group_fully_busy,
> > group_misfit_task,
> > + group_asym_packing,
> > group_imbalanced,
> > - group_overloaded,
> > + group_overloaded
> > +};
> > +
> > +enum migration_type {
> > + migrate_load = 0,
> > + migrate_util,
> > + migrate_task,
> > + migrate_misfit
> > };
> >
> > #define LBF_ALL_PINNED 0x01
> > @@ -7115,7 +7130,7 @@ struct lb_env {
> > unsigned int loop_max;
> >
> > enum fbq_type fbq_type;
> > - enum group_type src_grp_type;
> > + enum migration_type balance_type;
> > struct list_head tasks;
> > };
> >
> > @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
> > {
> > struct list_head *tasks = &env->src_rq->cfs_tasks;
> > struct task_struct *p;
> > - unsigned long load;
> > + unsigned long util, load;
> > int detached = 0;
> >
> > lockdep_assert_held(&env->src_rq->lock);
> > @@ -7380,19 +7395,53 @@ static int detach_tasks(struct lb_env *env)
> > if (!can_migrate_task(p, env))
> > goto next;
> >
> > - load = task_h_load(p);
> > + switch (env->balance_type) {
> > + case migrate_load:
> > + load = task_h_load(p);
> >
> > - if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
> > - goto next;
> > + if (sched_feat(LB_MIN) &&
> > + load < 16 && !env->sd->nr_balance_failed)
> > + goto next;
> >
> > - if ((load / 2) > env->imbalance)
> > - goto next;
> > + if ((load / 2) > env->imbalance)
> > + goto next;
> > +
> > + env->imbalance -= load;
> > + break;
> > +
> > + case migrate_util:
> > + util = task_util_est(p);
> > +
> > + if (util > env->imbalance)
>
> Can you please explain what would happen for
> `if (util/2 > env->imbalance)` ?
> just like when migrating load, even util shouldn't be migrated if
> env->imbalance is just near the utilization of the task being moved, isn't it?
I have chosen uti and not util/2 to be conservative because
migrate_util is used to fill spare capacity.
With `if (util/2 > env->imbalance)`, we can more easily overload the
local group or pick too much utilization from the overloaded group.
>
> > + goto next;
> > +
> > + env->imbalance -= util;
> > + break;
> > +[ ... ]
>
> Thanks,
> Parth
>
On 10/16/19 5:26 PM, Vincent Guittot wrote:
> On Wed, 16 Oct 2019 at 09:21, Parth Shah <[email protected]> wrote:
>>
>>
>>
>> On 9/19/19 1:03 PM, Vincent Guittot wrote:
>>
>> [...]
>>
>>> Signed-off-by: Vincent Guittot <[email protected]>
>>> ---
>>> kernel/sched/fair.c | 585 ++++++++++++++++++++++++++++++++++------------------
>>> 1 file changed, 380 insertions(+), 205 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 017aad0..d33379c 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -7078,11 +7078,26 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
>>>
>>> enum fbq_type { regular, remote, all };
>>>
>>> +/*
>>> + * group_type describes the group of CPUs at the moment of the load balance.
>>> + * The enum is ordered by pulling priority, with the group with lowest priority
>>> + * first so the groupe_type can be simply compared when selecting the busiest
>>> + * group. see update_sd_pick_busiest().
>>> + */
>>> enum group_type {
>>> - group_other = 0,
>>> + group_has_spare = 0,
>>> + group_fully_busy,
>>> group_misfit_task,
>>> + group_asym_packing,
>>> group_imbalanced,
>>> - group_overloaded,
>>> + group_overloaded
>>> +};
>>> +
>>> +enum migration_type {
>>> + migrate_load = 0,
>>> + migrate_util,
>>> + migrate_task,
>>> + migrate_misfit
>>> };
>>>
>>> #define LBF_ALL_PINNED 0x01
>>> @@ -7115,7 +7130,7 @@ struct lb_env {
>>> unsigned int loop_max;
>>>
>>> enum fbq_type fbq_type;
>>> - enum group_type src_grp_type;
>>> + enum migration_type balance_type;
>>> struct list_head tasks;
>>> };
>>>
>>> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
>>> {
>>> struct list_head *tasks = &env->src_rq->cfs_tasks;
>>> struct task_struct *p;
>>> - unsigned long load;
>>> + unsigned long util, load;
>>> int detached = 0;
>>>
>>> lockdep_assert_held(&env->src_rq->lock);
>>> @@ -7380,19 +7395,53 @@ static int detach_tasks(struct lb_env *env)
>>> if (!can_migrate_task(p, env))
>>> goto next;
>>>
>>> - load = task_h_load(p);
>>> + switch (env->balance_type) {
>>> + case migrate_load:
>>> + load = task_h_load(p);
>>>
>>> - if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
>>> - goto next;
>>> + if (sched_feat(LB_MIN) &&
>>> + load < 16 && !env->sd->nr_balance_failed)
>>> + goto next;
>>>
>>> - if ((load / 2) > env->imbalance)
>>> - goto next;
>>> + if ((load / 2) > env->imbalance)
>>> + goto next;
>>> +
>>> + env->imbalance -= load;
>>> + break;
>>> +
>>> + case migrate_util:
>>> + util = task_util_est(p);
>>> +
>>> + if (util > env->imbalance)
>>
>> Can you please explain what would happen for
>> `if (util/2 > env->imbalance)` ?
>> just like when migrating load, even util shouldn't be migrated if
>> env->imbalance is just near the utilization of the task being moved, isn't it?
>
> I have chosen uti and not util/2 to be conservative because
> migrate_util is used to fill spare capacity.
> With `if (util/2 > env->imbalance)`, we can more easily overload the
> local group or pick too much utilization from the overloaded group.
>
fair enough. I missed the point that unlike migrate_load, with
migrate_util, env->imbalance is just spare capacity of the local group.
Thanks,
Parth
>>
>>> + goto next;
>>> +
>>> + env->imbalance -= util;
>>> + break;
>>> +[ ... ]
>>
>> Thanks,
>> Parth
>>