This patchset consolidates several changes in the capacity and the usage
tracking of the CPU. It provides a frequency invariant metric of the usage of
CPUs and generally improves the accuracy of load/usage tracking in the
scheduler. The frequency invariant metric is the foundation required for the
consolidation of cpufreq and implementation of a fully invariant load tracking.
These are currently WIP and require several changes to the load balancer
(including how it will use and interprets load and capacity metrics) and
extensive validation. The frequency invariance is done with
arch_scale_freq_capacity and this patchset doesn't provide the backends of
the function which are architecture dependent.
As discussed at LPC14, Morten and I have consolidated our changes into a single
patchset to make it easier to review and merge.
During load balance, the scheduler evaluates the number of tasks that a group
of CPUs can handle. The current method assumes that tasks have a fix load of
SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE.
This assumption generates wrong decision by creating ghost cores or by
removing real ones when the original capacity of CPUs is different from the
default SCHED_CAPACITY_SCALE. With this patch set, we don't try anymore to
evaluate the number of available cores based on the group_capacity but instead
we evaluate the usage of a group and compare it with its capacity.
This patchset mainly replaces the old capacity_factor method by a new one and
keeps the general policy almost unchanged. These new metrics will be also used
in later patches.
The CPU usage is based on a running time tracking version of the current
implementation of the load average tracking. I also have a version that is
based on the new implementation proposal [1] but I haven't provide the patches
and results as [1] is still under review. I can provide change above [1] to
change how CPU usage is computed and to adapt to new mecanism.
Change since V8
- reorder patches
Change since V7
- add freq invariance for usage tracking
- add freq invariance for scale_rt
- update comments and commits' message
- fix init of utilization_avg_contrib
- fix prefer_sibling
Change since V6
- add group usage tracking
- fix some commits' messages
- minor fix like comments and argument order
Change since V5
- remove patches that have been merged since v5 : patches 01, 02, 03, 04, 05, 07
- update commit log and add more details on the purpose of the patches
- fix/remove useless code with the rebase on patchset [2]
- remove capacity_orig in sched_group_capacity as it is not used
- move code in the right patch
- add some helper function to factorize code
Change since V4
- rebase to manage conflicts with changes in selection of busiest group
Change since V3:
- add usage_avg_contrib statistic which sums the running time of tasks on a rq
- use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization
- fix replacement power by capacity
- update some comments
Change since V2:
- rebase on top of capacity renaming
- fix wake_affine statistic update
- rework nohz_kick_needed
- optimize the active migration of a task from CPU with reduced capacity
- rename group_activity by group_utilization and remove unused total_utilization
- repair SD_PREFER_SIBLING and use it for SMT level
- reorder patchset to gather patches with same topics
Change since V1:
- add 3 fixes
- correct some commit messages
- replace capacity computation by activity
- take into account current cpu capacity
[1] https://lkml.org/lkml/2014/10/10/131
[2] https://lkml.org/lkml/2014/7/25/589
Morten Rasmussen (2):
sched: Track group sched_entity usage contributions
sched: Make sched entity usage tracking scale-invariant
Vincent Guittot (8):
sched: add utilization_avg_contrib
sched: remove frequency scaling from cpu_capacity
sched: make scale_rt invariant with frequency
sched: add per rq cpu_capacity_orig
sched: get CPU's usage statistic
sched: replace capacity_factor by usage
sched: add SD_PREFER_SIBLING for SMT level
sched: move cfs task on a CPU with higher capacity
include/linux/sched.h | 21 ++-
kernel/sched/core.c | 15 +-
kernel/sched/debug.c | 12 +-
kernel/sched/fair.c | 369 ++++++++++++++++++++++++++++++++------------------
kernel/sched/sched.h | 15 +-
5 files changed, 276 insertions(+), 156 deletions(-)
--
1.9.1
Add new statistics which reflect the average time a task is running on the CPU
and the sum of these running time of the tasks on a runqueue. The latter is
named utilization_load_avg.
This patch is based on the usage metric that was proposed in the 1st
versions of the per-entity load tracking patchset by Paul Turner
<[email protected]> but that has be removed afterwards. This version differs from
the original one in the sense that it's not linked to task_group.
The rq's utilization_load_avg will be used to check if a rq is overloaded or
not instead of trying to compute how many tasks a group of CPUs can handle.
Rename runnable_avg_period into avg_period as it is now used with both
runnable_avg_sum and running_avg_sum
Add some descriptions of the variables to explain their differences
cc: Paul Turner <[email protected]>
cc: Ben Segall <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
---
include/linux/sched.h | 21 ++++++++++++---
kernel/sched/debug.c | 10 ++++---
kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++-----------
kernel/sched/sched.h | 8 +++++-
4 files changed, 89 insertions(+), 24 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 18f5262..b576b29 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1071,15 +1071,28 @@ struct load_weight {
};
struct sched_avg {
+ u64 last_runnable_update;
+ s64 decay_count;
+ /*
+ * utilization_avg_contrib describes the amount of time that a
+ * sched_entity is running on a CPU. It is based on running_avg_sum
+ * and is scaled in the range [0..SCHED_LOAD_SCALE].
+ * load_avg_contrib described the amount of time that a sched_entity
+ * is runnable on a rq. It is based on both runnable_avg_sum and the
+ * weight of the task.
+ */
+ unsigned long load_avg_contrib, utilization_avg_contrib;
/*
* These sums represent an infinite geometric series and so are bound
* above by 1024/(1-y). Thus we only need a u32 to store them for all
* choices of y < 1-2^(-32)*1024.
+ * running_avg_sum reflects the time that the sched_entity is
+ * effectively running on the CPU.
+ * runnable_avg_sum represents the amount of time a sched_entity is on
+ * a runqueue which includes the running time that is monitored by
+ * running_avg_sum.
*/
- u32 runnable_avg_sum, runnable_avg_period;
- u64 last_runnable_update;
- s64 decay_count;
- unsigned long load_avg_contrib;
+ u32 runnable_avg_sum, avg_period, running_avg_sum;
};
#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index ce33780..f384452 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -71,7 +71,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
if (!se) {
struct sched_avg *avg = &cpu_rq(cpu)->avg;
P(avg->runnable_avg_sum);
- P(avg->runnable_avg_period);
+ P(avg->avg_period);
return;
}
@@ -94,7 +94,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
P(se->load.weight);
#ifdef CONFIG_SMP
P(se->avg.runnable_avg_sum);
- P(se->avg.runnable_avg_period);
+ P(se->avg.avg_period);
P(se->avg.load_avg_contrib);
P(se->avg.decay_count);
#endif
@@ -214,6 +214,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->runnable_load_avg);
SEQ_printf(m, " .%-30s: %ld\n", "blocked_load_avg",
cfs_rq->blocked_load_avg);
+ SEQ_printf(m, " .%-30s: %ld\n", "utilization_load_avg",
+ cfs_rq->utilization_load_avg);
#ifdef CONFIG_FAIR_GROUP_SCHED
SEQ_printf(m, " .%-30s: %ld\n", "tg_load_contrib",
cfs_rq->tg_load_contrib);
@@ -628,8 +630,10 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
P(se.load.weight);
#ifdef CONFIG_SMP
P(se.avg.runnable_avg_sum);
- P(se.avg.runnable_avg_period);
+ P(se.avg.running_avg_sum);
+ P(se.avg.avg_period);
P(se.avg.load_avg_contrib);
+ P(se.avg.utilization_avg_contrib);
P(se.avg.decay_count);
#endif
P(policy);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bd61cff..3a91ae6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -670,6 +670,7 @@ static int select_idle_sibling(struct task_struct *p, int cpu);
static unsigned long task_h_load(struct task_struct *p);
static inline void __update_task_entity_contrib(struct sched_entity *se);
+static inline void __update_task_entity_utilization(struct sched_entity *se);
/* Give new task start runnable values to heavy its load in infant time */
void init_task_runnable_average(struct task_struct *p)
@@ -678,9 +679,10 @@ void init_task_runnable_average(struct task_struct *p)
p->se.avg.decay_count = 0;
slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
- p->se.avg.runnable_avg_sum = slice;
- p->se.avg.runnable_avg_period = slice;
+ p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice;
+ p->se.avg.avg_period = slice;
__update_task_entity_contrib(&p->se);
+ __update_task_entity_utilization(&p->se);
}
#else
void init_task_runnable_average(struct task_struct *p)
@@ -1548,7 +1550,7 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
*period = now - p->last_task_numa_placement;
} else {
delta = p->se.avg.runnable_avg_sum;
- *period = p->se.avg.runnable_avg_period;
+ *period = p->se.avg.avg_period;
}
p->last_sum_exec_runtime = runtime;
@@ -2294,7 +2296,8 @@ static u32 __compute_runnable_contrib(u64 n)
*/
static __always_inline int __update_entity_runnable_avg(u64 now,
struct sched_avg *sa,
- int runnable)
+ int runnable,
+ int running)
{
u64 delta, periods;
u32 runnable_contrib;
@@ -2320,7 +2323,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
sa->last_runnable_update = now;
/* delta_w is the amount already accumulated against our next period */
- delta_w = sa->runnable_avg_period % 1024;
+ delta_w = sa->avg_period % 1024;
if (delta + delta_w >= 1024) {
/* period roll-over */
decayed = 1;
@@ -2333,7 +2336,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
delta_w = 1024 - delta_w;
if (runnable)
sa->runnable_avg_sum += delta_w;
- sa->runnable_avg_period += delta_w;
+ if (running)
+ sa->running_avg_sum += delta_w;
+ sa->avg_period += delta_w;
delta -= delta_w;
@@ -2343,20 +2348,26 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
periods + 1);
- sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
+ sa->running_avg_sum = decay_load(sa->running_avg_sum,
+ periods + 1);
+ sa->avg_period = decay_load(sa->avg_period,
periods + 1);
/* Efficiently calculate \sum (1..n_period) 1024*y^i */
runnable_contrib = __compute_runnable_contrib(periods);
if (runnable)
sa->runnable_avg_sum += runnable_contrib;
- sa->runnable_avg_period += runnable_contrib;
+ if (running)
+ sa->running_avg_sum += runnable_contrib;
+ sa->avg_period += runnable_contrib;
}
/* Remainder of delta accrued against u_0` */
if (runnable)
sa->runnable_avg_sum += delta;
- sa->runnable_avg_period += delta;
+ if (running)
+ sa->running_avg_sum += delta;
+ sa->avg_period += delta;
return decayed;
}
@@ -2372,6 +2383,8 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
return 0;
se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
+ se->avg.utilization_avg_contrib =
+ decay_load(se->avg.utilization_avg_contrib, decays);
se->avg.decay_count = 0;
return decays;
@@ -2408,7 +2421,7 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
/* The fraction of a cpu used by this cfs_rq */
contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT,
- sa->runnable_avg_period + 1);
+ sa->avg_period + 1);
contrib -= cfs_rq->tg_runnable_contrib;
if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
@@ -2461,7 +2474,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
{
- __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
+ __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable,
+ runnable);
__update_tg_runnable_avg(&rq->avg, &rq->cfs);
}
#else /* CONFIG_FAIR_GROUP_SCHED */
@@ -2479,7 +2493,7 @@ static inline void __update_task_entity_contrib(struct sched_entity *se)
/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
- contrib /= (se->avg.runnable_avg_period + 1);
+ contrib /= (se->avg.avg_period + 1);
se->avg.load_avg_contrib = scale_load(contrib);
}
@@ -2498,6 +2512,27 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
return se->avg.load_avg_contrib - old_contrib;
}
+
+static inline void __update_task_entity_utilization(struct sched_entity *se)
+{
+ u32 contrib;
+
+ /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
+ contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
+ contrib /= (se->avg.avg_period + 1);
+ se->avg.utilization_avg_contrib = scale_load(contrib);
+}
+
+static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
+{
+ long old_contrib = se->avg.utilization_avg_contrib;
+
+ if (entity_is_task(se))
+ __update_task_entity_utilization(se);
+
+ return se->avg.utilization_avg_contrib - old_contrib;
+}
+
static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
long load_contrib)
{
@@ -2514,7 +2549,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
int update_cfs_rq)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
- long contrib_delta;
+ long contrib_delta, utilization_delta;
u64 now;
/*
@@ -2526,18 +2561,22 @@ static inline void update_entity_load_avg(struct sched_entity *se,
else
now = cfs_rq_clock_task(group_cfs_rq(se));
- if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
+ if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
+ cfs_rq->curr == se))
return;
contrib_delta = __update_entity_load_avg_contrib(se);
+ utilization_delta = __update_entity_utilization_avg_contrib(se);
if (!update_cfs_rq)
return;
- if (se->on_rq)
+ if (se->on_rq) {
cfs_rq->runnable_load_avg += contrib_delta;
- else
+ cfs_rq->utilization_load_avg += utilization_delta;
+ } else {
subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
+ }
}
/*
@@ -2612,6 +2651,7 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
}
cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+ cfs_rq->utilization_load_avg += se->avg.utilization_avg_contrib;
/* we force update consideration on load-balancer moves */
update_cfs_rq_blocked_load(cfs_rq, !wakeup);
}
@@ -2630,6 +2670,7 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
update_cfs_rq_blocked_load(cfs_rq, !sleep);
cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+ cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib;
if (sleep) {
cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
@@ -2967,6 +3008,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
*/
update_stats_wait_end(cfs_rq, se);
__dequeue_entity(cfs_rq, se);
+ update_entity_load_avg(se, 1);
}
update_stats_curr_start(cfs_rq, se);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6130251..c34bd11 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -343,8 +343,14 @@ struct cfs_rq {
* Under CFS, load is tracked on a per-entity basis and aggregated up.
* This allows for the description of both thread and group usage (in
* the FAIR_GROUP_SCHED case).
+ * runnable_load_avg is the sum of the load_avg_contrib of the
+ * sched_entities on the rq.
+ * blocked_load_avg is similar to runnable_load_avg except that its
+ * the blocked sched_entities on the rq.
+ * utilization_load_avg is the sum of the average running time of the
+ * sched_entities on the rq.
*/
- unsigned long runnable_load_avg, blocked_load_avg;
+ unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
atomic64_t decay_counter;
u64 last_decay;
atomic_long_t removed_load;
--
1.9.1
The average running time of RT tasks is used to estimate the remaining compute
capacity for CFS tasks. This remaining capacity is the original capacity scaled
down by a factor (aka scale_rt_capacity). This estimation of available capacity
must also be invariant with frequency scaling.
A frequency scaling factor is applied on the running time of the RT tasks for
computing scale_rt_capacity.
In sched_rt_avg_update, we scale the RT execution time like below:
rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT
Then, scale_rt_capacity can be summarized by:
scale_rt_capacity = SCHED_CAPACITY_SCALE -
((rq->rt_avg << SCHED_CAPACITY_SHIFT) / period)
We can optimize by removing right and left shift in the computation of rq->rt_avg
and scale_rt_capacity
The call to arch_scale_frequency_capacity in the rt scheduling path might be
a concern for RT folks because I'm not sure whether we can rely on
arch_scale_freq_capacity to be short and efficient ?
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 17 +++++------------
kernel/sched/sched.h | 4 +++-
2 files changed, 8 insertions(+), 13 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a5039da..b37c27b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5785,7 +5785,7 @@ unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
static unsigned long scale_rt_capacity(int cpu)
{
struct rq *rq = cpu_rq(cpu);
- u64 total, available, age_stamp, avg;
+ u64 total, used, age_stamp, avg;
s64 delta;
/*
@@ -5801,19 +5801,12 @@ static unsigned long scale_rt_capacity(int cpu)
total = sched_avg_period() + delta;
- if (unlikely(total < avg)) {
- /* Ensures that capacity won't end up being negative */
- available = 0;
- } else {
- available = total - avg;
- }
+ used = div_u64(avg, total);
- if (unlikely((s64)total < SCHED_CAPACITY_SCALE))
- total = SCHED_CAPACITY_SCALE;
+ if (likely(used < SCHED_CAPACITY_SCALE))
+ return SCHED_CAPACITY_SCALE - used;
- total >>= SCHED_CAPACITY_SHIFT;
-
- return div_u64(available, total);
+ return 1;
}
static void update_cpu_capacity(struct sched_domain *sd, int cpu)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c34bd11..fc5b152 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1312,9 +1312,11 @@ static inline int hrtick_enabled(struct rq *rq)
#ifdef CONFIG_SMP
extern void sched_avg_update(struct rq *rq);
+extern unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
{
- rq->rt_avg += rt_delta;
+ rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
sched_avg_update(rq);
}
#else
--
1.9.1
This new field cpu_capacity_orig reflects the original capacity of a CPU
before being altered by rt tasks and/or IRQ
The cpu_capacity_orig will be used:
- to detect when the capacity of a CPU has been noticeably reduced so we can
trig load balance to look for a CPU with better capacity. As an example, we
can detect when a CPU handles a significant amount of irq
(with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by
scheduler whereas CPUs, which are really idle, are available.
- evaluate the available capacity for CFS tasks
Signed-off-by: Vincent Guittot <[email protected]>
Reviewed-by: Kamalesh Babulal <[email protected]>
---
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 8 +++++++-
kernel/sched/sched.h | 1 +
3 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c84bdc0..45ae52d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7087,7 +7087,7 @@ void __init sched_init(void)
#ifdef CONFIG_SMP
rq->sd = NULL;
rq->rd = NULL;
- rq->cpu_capacity = SCHED_CAPACITY_SCALE;
+ rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
rq->post_schedule = 0;
rq->active_balance = 0;
rq->next_balance = jiffies;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b37c27b..4782733 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4141,6 +4141,11 @@ static unsigned long capacity_of(int cpu)
return cpu_rq(cpu)->cpu_capacity;
}
+static unsigned long capacity_orig_of(int cpu)
+{
+ return cpu_rq(cpu)->cpu_capacity_orig;
+}
+
static unsigned long cpu_avg_load_per_task(int cpu)
{
struct rq *rq = cpu_rq(cpu);
@@ -5821,6 +5826,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
capacity >>= SCHED_CAPACITY_SHIFT;
+ cpu_rq(cpu)->cpu_capacity_orig = capacity;
sdg->sgc->capacity_orig = capacity;
capacity *= scale_rt_capacity(cpu);
@@ -5875,7 +5881,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
* Runtime updates will correct capacity_orig.
*/
if (unlikely(!rq->sd)) {
- capacity_orig += capacity_of(cpu);
+ capacity_orig += capacity_orig_of(cpu);
capacity += capacity_of(cpu);
continue;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fc5b152..aaaa3e4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -585,6 +585,7 @@ struct rq {
struct sched_domain *sd;
unsigned long cpu_capacity;
+ unsigned long cpu_capacity_orig;
unsigned char idle_balance;
/* For active balancing */
--
1.9.1
Now that arch_scale_cpu_capacity has been introduced to scale the original
capacity, the arch_scale_freq_capacity is no longer used (it was
previously used by ARM arch). Remove arch_scale_freq_capacity from the
computation of cpu_capacity. The frequency invariance will be handled in the
load tracking and not in the CPU capacity. arch_scale_freq_capacity will be
revisited for scaling load with the current frequency of the CPUs in a later
patch.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a171e1b..a96affd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5823,13 +5823,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
sdg->sgc->capacity_orig = capacity;
- if (sched_feat(ARCH_CAPACITY))
- capacity *= arch_scale_freq_capacity(sd, cpu);
- else
- capacity *= default_scale_capacity(sd, cpu);
-
- capacity >>= SCHED_CAPACITY_SHIFT;
-
capacity *= scale_rt_capacity(cpu);
capacity >>= SCHED_CAPACITY_SHIFT;
--
1.9.1
When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining
capacity for CFS tasks can be significantly reduced. Once we detect such
situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle
load balance to check if it's worth moving its tasks on an idle CPU.
Once the idle load_balance has selected the busiest CPU, it will look for an
active load balance for only two cases :
- there is only 1 task on the busiest CPU.
- we haven't been able to move a task of the busiest rq.
A CPU with a reduced capacity is included in the 1st case, and it's worth to
actively migrate its task if the idle CPU has got full capacity. This test has
been added in need_active_balance.
As a sidenote, this will note generate more spurious ilb because we already
trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
has a task, we will trig the ilb once for migrating the task.
The nohz_kick_needed function has been cleaned up a bit while adding the new
test
env.src_cpu and env.src_rq must be set unconditionnally because they are used
in need_active_balance which is called even if busiest->nr_running equals 1
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++---------------
1 file changed, 53 insertions(+), 21 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index db392a6..02e8f7f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6634,6 +6634,28 @@ static int need_active_balance(struct lb_env *env)
return 1;
}
+ /*
+ * The dst_cpu is idle and the src_cpu CPU has only 1 CFS task.
+ * It's worth migrating the task if the src_cpu's capacity is reduced
+ * because of other sched_class or IRQs whereas capacity stays
+ * available on dst_cpu.
+ */
+ if ((env->idle != CPU_NOT_IDLE) &&
+ (env->src_rq->cfs.h_nr_running == 1)) {
+ unsigned long src_eff_capacity, dst_eff_capacity;
+
+ dst_eff_capacity = 100;
+ dst_eff_capacity *= capacity_of(env->dst_cpu);
+ dst_eff_capacity *= capacity_orig_of(env->src_cpu);
+
+ src_eff_capacity = sd->imbalance_pct;
+ src_eff_capacity *= capacity_of(env->src_cpu);
+ src_eff_capacity *= capacity_orig_of(env->dst_cpu);
+
+ if (src_eff_capacity < dst_eff_capacity)
+ return 1;
+ }
+
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
}
@@ -6733,6 +6755,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
schedstat_add(sd, lb_imbalance[idle], env.imbalance);
+ env.src_cpu = busiest->cpu;
+ env.src_rq = busiest;
+
ld_moved = 0;
if (busiest->nr_running > 1) {
/*
@@ -6742,8 +6767,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
* correctly treated as an imbalance.
*/
env.flags |= LBF_ALL_PINNED;
- env.src_cpu = busiest->cpu;
- env.src_rq = busiest;
env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
more_balance:
@@ -7443,22 +7466,25 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
/*
* Current heuristic for kicking the idle load balancer in the presence
- * of an idle cpu is the system.
+ * of an idle cpu in the system.
* - This rq has more than one task.
- * - At any scheduler domain level, this cpu's scheduler group has multiple
- * busy cpu's exceeding the group's capacity.
+ * - This rq has at least one CFS task and the capacity of the CPU is
+ * significantly reduced because of RT tasks or IRQs.
+ * - At parent of LLC scheduler domain level, this cpu's scheduler group has
+ * multiple busy cpu.
* - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
* domain span are idle.
*/
-static inline int nohz_kick_needed(struct rq *rq)
+static inline bool nohz_kick_needed(struct rq *rq)
{
unsigned long now = jiffies;
struct sched_domain *sd;
struct sched_group_capacity *sgc;
int nr_busy, cpu = rq->cpu;
+ bool kick = false;
if (unlikely(rq->idle_balance))
- return 0;
+ return false;
/*
* We may be recently in ticked or tickless idle mode. At the first
@@ -7472,38 +7498,44 @@ static inline int nohz_kick_needed(struct rq *rq)
* balancing.
*/
if (likely(!atomic_read(&nohz.nr_cpus)))
- return 0;
+ return false;
if (time_before(now, nohz.next_balance))
- return 0;
+ return false;
if (rq->nr_running >= 2)
- goto need_kick;
+ return true;
rcu_read_lock();
sd = rcu_dereference(per_cpu(sd_busy, cpu));
-
if (sd) {
sgc = sd->groups->sgc;
nr_busy = atomic_read(&sgc->nr_busy_cpus);
- if (nr_busy > 1)
- goto need_kick_unlock;
+ if (nr_busy > 1) {
+ kick = true;
+ goto unlock;
+ }
+
}
- sd = rcu_dereference(per_cpu(sd_asym, cpu));
+ sd = rcu_dereference(rq->sd);
+ if (sd) {
+ if ((rq->cfs.h_nr_running >= 1) &&
+ check_cpu_capacity(rq, sd)) {
+ kick = true;
+ goto unlock;
+ }
+ }
+ sd = rcu_dereference(per_cpu(sd_asym, cpu));
if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
sched_domain_span(sd)) < cpu))
- goto need_kick_unlock;
+ kick = true;
+unlock:
rcu_read_unlock();
- return 0;
-
-need_kick_unlock:
- rcu_read_unlock();
-need_kick:
- return 1;
+ return kick;
}
#else
static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
--
1.9.1
Monitor the usage level of each group of each sched_domain level. The usage is
the portion of cpu_capacity_orig that is currently used on a CPU or group of
CPUs. We use the utilization_load_avg to evaluate the usage level of each
group.
The utilization_load_avg only takes into account the running time of the CFS
tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully
utilized. Nevertheless, we must cap utilization_load_avg which can be temporaly
greater than SCHED_LOAD_SCALE after the migration of a task on this CPU and
until the metrics are stabilized.
The utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the
running load on the CPU whereas the available capacity for the CFS task is in
the range [0..cpu_capacity_orig]. In order to test if a CPU is fully utilized
by CFS tasks, we have to scale the utilization in the cpu_capacity_orig range
of the CPU to get the usage of the latter. The usage can then be compared with
the available capacity (ie cpu_capacity) to deduct the usage level of a CPU.
The frequency scaling invariance of the usage is not taken into account in this
patch, it will be solved in another patch which will deal with frequency
scaling invariance on the running_load_avg.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4782733..884578e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4559,6 +4559,33 @@ static int select_idle_sibling(struct task_struct *p, int target)
done:
return target;
}
+/*
+ * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
+ * tasks. The unit of the return value must capacity so we can compare the
+ * usage with the capacity of the CPU that is available for CFS task (ie
+ * cpu_capacity).
+ * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
+ * CPU. It represents the amount of utilization of a CPU in the range
+ * [0..SCHED_LOAD_SCALE]. The usage of a CPU can't be higher than the full
+ * capacity of the CPU because it's about the running time on this CPU.
+ * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
+ * because of unfortunate rounding in avg_period and running_load_avg or just
+ * after migrating tasks until the average stabilizes with the new running
+ * time. So we need to check that the usage stays into the range
+ * [0..cpu_capacity_orig] and cap if necessary.
+ * Without capping the usage, a group could be seen as overloaded (CPU0 usage
+ * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity/
+ */
+static int get_cpu_usage(int cpu)
+{
+ unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+ unsigned long capacity = capacity_orig_of(cpu);
+
+ if (usage >= SCHED_LOAD_SCALE)
+ return capacity;
+
+ return (usage * capacity) >> SCHED_LOAD_SHIFT;
+}
/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
@@ -5688,6 +5715,7 @@ struct sg_lb_stats {
unsigned long sum_weighted_load; /* Weighted load of group's tasks */
unsigned long load_per_task;
unsigned long group_capacity;
+ unsigned long group_usage; /* Total usage of the group */
unsigned int sum_nr_running; /* Nr tasks running in the group */
unsigned int group_capacity_factor;
unsigned int idle_cpus;
@@ -6036,6 +6064,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
load = source_load(i, load_idx);
sgs->group_load += load;
+ sgs->group_usage += get_cpu_usage(i);
sgs->sum_nr_running += rq->cfs.h_nr_running;
if (rq->nr_running > 1)
--
1.9.1
Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
the scheduler will put at least 1 task per core.
Signed-off-by: Vincent Guittot <[email protected]>
Reviewed-by: Preeti U. Murthy <[email protected]>
---
kernel/sched/core.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 37fb92c..731f2ad 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6165,6 +6165,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
*/
if (sd->flags & SD_SHARE_CPUCAPACITY) {
+ sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 110;
sd->smt_gain = 1178; /* ~15% */
--
1.9.1
From: Morten Rasmussen <[email protected]>
Adds usage contribution tracking for group entities. Unlike
se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group
entities is the sum of se->avg.utilization_avg_contrib for all entities on the
group runqueue. It is _not_ influenced in any way by the task group
h_load. Hence it is representing the actual cpu usage of the group, not
its intended load contribution which may differ significantly from the
utilization on lightly utilized systems.
cc: Paul Turner <[email protected]>
cc: Ben Segall <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/debug.c | 2 ++
kernel/sched/fair.c | 3 +++
2 files changed, 5 insertions(+)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index f384452..efb47ed 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -94,8 +94,10 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
P(se->load.weight);
#ifdef CONFIG_SMP
P(se->avg.runnable_avg_sum);
+ P(se->avg.running_avg_sum);
P(se->avg.avg_period);
P(se->avg.load_avg_contrib);
+ P(se->avg.utilization_avg_contrib);
P(se->avg.decay_count);
#endif
#undef PN
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3a91ae6..a171e1b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2529,6 +2529,9 @@ static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
if (entity_is_task(se))
__update_task_entity_utilization(se);
+ else
+ se->avg.utilization_avg_contrib =
+ group_cfs_rq(se)->utilization_load_avg;
return se->avg.utilization_avg_contrib - old_contrib;
}
--
1.9.1
The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is
SCHED_CAPACITY_SCALE. group_capacity_factor divides the capacity of the group
by SCHED_LOAD_SCALE to estimate how many task can run in the group. Then, it
compares this value with the sum of nr_running to decide if the group is
overloaded or not. But the group_capacity_factor is hardly working for SMT
system, it sometimes works for big cores but fails to do the right thing for
little cores.
Below are two examples to illustrate the problem that this patch solves:
1- If the original capacity of a CPU is less than SCHED_CAPACITY_SCALE
(640 as an example), a group of 3 CPUS will have a max capacity_factor of 2
(div_round_closest(3x640/1024) = 2) which means that it will be seen as
overloaded even if we have only one task per CPU.
2 - If the original capacity of a CPU is greater than SCHED_CAPACITY_SCALE
(1512 as an example), a group of 4 CPUs will have a capacity_factor of 4
(at max and thanks to the fix [0] for SMT system that prevent the apparition
of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is
reduced to nearly nothing), the capacity factor of the group will still be 4
(div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]).
So, this patch tries to solve this issue by removing capacity_factor and
replacing it with the 2 following metrics :
-The available CPU's capacity for CFS tasks which is already used by
load_balance.
-The usage of the CPU by the CFS tasks. For the latter, utilization_avg_contrib
has been re-introduced to compute the usage of a CPU by CFS tasks.
group_capacity_factor and group_has_free_capacity has been removed and replaced
by group_no_capacity. We compare the number of task with the number of CPUs and
we evaluate the level of utilization of the CPUs to define if a group is
overloaded or if a group has capacity to handle more tasks.
For SD_PREFER_SIBLING, a group is tagged overloaded if it has more than 1 task
so it will be selected in priority (among the overloaded groups). Since [1],
SD_PREFER_SIBLING is no more concerned by the computation of load_above_capacity
because local is not overloaded.
Finally, the sched_group->sched_group_capacity->capacity_orig has been removed
because it's no more used during load balance.
[1] https://lkml.org/lkml/2014/8/12/295
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/core.c | 12 -----
kernel/sched/fair.c | 150 +++++++++++++++++++++++++--------------------------
kernel/sched/sched.h | 2 +-
3 files changed, 75 insertions(+), 89 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 45ae52d..37fb92c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5373,17 +5373,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
break;
}
- /*
- * Even though we initialize ->capacity to something semi-sane,
- * we leave capacity_orig unset. This allows us to detect if
- * domain iteration is still funny without causing /0 traps.
- */
- if (!group->sgc->capacity_orig) {
- printk(KERN_CONT "\n");
- printk(KERN_ERR "ERROR: domain->cpu_capacity not set\n");
- break;
- }
-
if (!cpumask_weight(sched_group_cpus(group))) {
printk(KERN_CONT "\n");
printk(KERN_ERR "ERROR: empty group\n");
@@ -5868,7 +5857,6 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
* die on a /0 trap.
*/
sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
- sg->sgc->capacity_orig = sg->sgc->capacity;
/*
* Make sure the first group of this domain contains the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 884578e..db392a6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5717,11 +5717,10 @@ struct sg_lb_stats {
unsigned long group_capacity;
unsigned long group_usage; /* Total usage of the group */
unsigned int sum_nr_running; /* Nr tasks running in the group */
- unsigned int group_capacity_factor;
unsigned int idle_cpus;
unsigned int group_weight;
enum group_type group_type;
- int group_has_free_capacity;
+ int group_no_capacity;
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
@@ -5855,7 +5854,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
capacity >>= SCHED_CAPACITY_SHIFT;
cpu_rq(cpu)->cpu_capacity_orig = capacity;
- sdg->sgc->capacity_orig = capacity;
capacity *= scale_rt_capacity(cpu);
capacity >>= SCHED_CAPACITY_SHIFT;
@@ -5871,7 +5869,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
{
struct sched_domain *child = sd->child;
struct sched_group *group, *sdg = sd->groups;
- unsigned long capacity, capacity_orig;
+ unsigned long capacity;
unsigned long interval;
interval = msecs_to_jiffies(sd->balance_interval);
@@ -5883,7 +5881,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
return;
}
- capacity_orig = capacity = 0;
+ capacity = 0;
if (child->flags & SD_OVERLAP) {
/*
@@ -5903,19 +5901,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
* Use capacity_of(), which is set irrespective of domains
* in update_cpu_capacity().
*
- * This avoids capacity/capacity_orig from being 0 and
+ * This avoids capacity from being 0 and
* causing divide-by-zero issues on boot.
- *
- * Runtime updates will correct capacity_orig.
*/
if (unlikely(!rq->sd)) {
- capacity_orig += capacity_orig_of(cpu);
capacity += capacity_of(cpu);
continue;
}
sgc = rq->sd->groups->sgc;
- capacity_orig += sgc->capacity_orig;
capacity += sgc->capacity;
}
} else {
@@ -5926,39 +5920,24 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
group = child->groups;
do {
- capacity_orig += group->sgc->capacity_orig;
capacity += group->sgc->capacity;
group = group->next;
} while (group != child->groups);
}
- sdg->sgc->capacity_orig = capacity_orig;
sdg->sgc->capacity = capacity;
}
/*
- * Try and fix up capacity for tiny siblings, this is needed when
- * things like SD_ASYM_PACKING need f_b_g to select another sibling
- * which on its own isn't powerful enough.
- *
- * See update_sd_pick_busiest() and check_asym_packing().
+ * Check whether the capacity of the rq has been noticeably reduced by side
+ * activity. The imbalance_pct is used for the threshold.
+ * Return true is the capacity is reduced
*/
static inline int
-fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
+check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
{
- /*
- * Only siblings can have significantly less than SCHED_CAPACITY_SCALE
- */
- if (!(sd->flags & SD_SHARE_CPUCAPACITY))
- return 0;
-
- /*
- * If ~90% of the cpu_capacity is still there, we're good.
- */
- if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29)
- return 1;
-
- return 0;
+ return ((rq->cpu_capacity * sd->imbalance_pct) <
+ (rq->cpu_capacity_orig * 100));
}
/*
@@ -5996,37 +5975,54 @@ static inline int sg_imbalanced(struct sched_group *group)
}
/*
- * Compute the group capacity factor.
- *
- * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by
- * first dividing out the smt factor and computing the actual number of cores
- * and limit unit capacity with that.
+ * group_has_capacity returns true if the group has spare capacity that could
+ * be used by some tasks. We consider that a group has spare capacity if the
+ * number of task is smaller than the number of CPUs or if the usage is lower
+ * than the available capacity for CFS tasks. For the latter, we use a
+ * threshold to stabilize the state, to take into account the variance of the
+ * tasks' load and to return true if the available capacity in meaningful for
+ * the load balancer. As an example, an available capacity of 1% can appear
+ * but it doesn't make any benefit for the load balance.
*/
-static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group)
+static inline bool
+group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
{
- unsigned int capacity_factor, smt, cpus;
- unsigned int capacity, capacity_orig;
+ if ((sgs->group_capacity * 100) >
+ (sgs->group_usage * env->sd->imbalance_pct))
+ return true;
- capacity = group->sgc->capacity;
- capacity_orig = group->sgc->capacity_orig;
- cpus = group->group_weight;
+ if (sgs->sum_nr_running < sgs->group_weight)
+ return true;
+
+ return false;
+}
- /* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
- smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
- capacity_factor = cpus / smt; /* cores */
+/*
+ * group_is_overloaded returns true if the group has more tasks than it can
+ * handle. We consider that a group is overloaded if the number of tasks is
+ * greater than the number of CPUs and the tasks already use all available
+ * capacity for CFS tasks. For the latter, we use a threshold to stabilize
+ * the state, to take into account the variance of tasks' load and to return
+ * true if available capacity is no more meaningful for load balancer
+ */
+static inline bool
+group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
+{
+ if (sgs->sum_nr_running <= sgs->group_weight)
+ return false;
- capacity_factor = min_t(unsigned,
- capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
- if (!capacity_factor)
- capacity_factor = fix_small_capacity(env->sd, group);
+ if ((sgs->group_capacity * 100) <
+ (sgs->group_usage * env->sd->imbalance_pct))
+ return true;
- return capacity_factor;
+ return false;
}
-static enum group_type
-group_classify(struct sched_group *group, struct sg_lb_stats *sgs)
+static enum group_type group_classify(struct lb_env *env,
+ struct sched_group *group,
+ struct sg_lb_stats *sgs)
{
- if (sgs->sum_nr_running > sgs->group_capacity_factor)
+ if (sgs->group_no_capacity)
return group_overloaded;
if (sg_imbalanced(group))
@@ -6087,11 +6083,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
sgs->group_weight = group->group_weight;
- sgs->group_capacity_factor = sg_capacity_factor(env, group);
- sgs->group_type = group_classify(group, sgs);
- if (sgs->group_capacity_factor > sgs->sum_nr_running)
- sgs->group_has_free_capacity = 1;
+ sgs->group_no_capacity = group_is_overloaded(env, sgs);
+ sgs->group_type = group_classify(env, group, sgs);
}
/**
@@ -6213,17 +6207,20 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/*
* In case the child domain prefers tasks go to siblings
- * first, lower the sg capacity factor to one so that we'll try
+ * first, lower the sg capacity so that we'll try
* and move all the excess tasks away. We lower the capacity
* of a group only if the local group has the capacity to fit
- * these excess tasks, i.e. nr_running < group_capacity_factor. The
- * extra check prevents the case where you always pull from the
- * heaviest group when it is already under-utilized (possible
- * with a large weight task outweighs the tasks on the system).
+ * these excess tasks. The extra check prevents the case where
+ * you always pull from the heaviest group when it is already
+ * under-utilized (possible with a large weight task outweighs
+ * the tasks on the system).
*/
if (prefer_sibling && sds->local &&
- sds->local_stat.group_has_free_capacity)
- sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
+ group_has_capacity(env, &sds->local_stat) &&
+ (sgs->sum_nr_running > 1)) {
+ sgs->group_no_capacity = 1;
+ sgs->group_type = group_overloaded;
+ }
if (update_sd_pick_busiest(env, sds, sg, sgs)) {
sds->busiest = sg;
@@ -6402,11 +6399,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
*/
if (busiest->group_type == group_overloaded &&
local->group_type == group_overloaded) {
- load_above_capacity =
- (busiest->sum_nr_running - busiest->group_capacity_factor);
-
- load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
- load_above_capacity /= busiest->group_capacity;
+ load_above_capacity = busiest->sum_nr_running *
+ SCHED_LOAD_SCALE;
+ if (load_above_capacity > busiest->group_capacity)
+ load_above_capacity -= busiest->group_capacity;
+ else
+ load_above_capacity = ~0UL;
}
/*
@@ -6469,6 +6467,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
local = &sds.local_stat;
busiest = &sds.busiest_stat;
+ /* ASYM feature bypasses nice load balance check */
if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
check_asym_packing(env, &sds))
return sds.busiest;
@@ -6489,8 +6488,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
goto force_balance;
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
- if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
- !busiest->group_has_free_capacity)
+ if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) &&
+ busiest->group_no_capacity)
goto force_balance;
/*
@@ -6549,7 +6548,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
int i;
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
- unsigned long capacity, capacity_factor, wl;
+ unsigned long capacity, wl;
enum fbq_type rt;
rq = cpu_rq(i);
@@ -6578,9 +6577,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
continue;
capacity = capacity_of(i);
- capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE);
- if (!capacity_factor)
- capacity_factor = fix_small_capacity(env->sd, group);
wl = weighted_cpuload(i);
@@ -6588,7 +6584,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
* When comparing with imbalance, use weighted_cpuload()
* which is not scaled with the cpu capacity.
*/
- if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance)
+
+ if (rq->nr_running == 1 && wl > env->imbalance &&
+ !check_cpu_capacity(rq, env->sd))
continue;
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aaaa3e4..8fd30c1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -759,7 +759,7 @@ struct sched_group_capacity {
* CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
* for a single CPU.
*/
- unsigned int capacity, capacity_orig;
+ unsigned int capacity;
unsigned long next_update;
int imbalance; /* XXX unrelated to capacity but shared group state */
/*
--
1.9.1
From: Morten Rasmussen <[email protected]>
Apply frequency scale-invariance correction factor to usage tracking.
Each segment of the running_load_avg geometric series is now scaled by the
current frequency so the utilization_avg_contrib of each entity will be
invariant with frequency scaling. As a result, utilization_load_avg which is
the sum of utilization_avg_contrib, becomes invariant too. So the usage level
that is returned by get_cpu_usage, stays relative to the max frequency as the
cpu_capacity which is is compared against.
Then, we want the keep the load tracking values in a 32bits type, which implies
that the max value of {runnable|running}_avg_sum must be lower than
2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742,
arch_scale_freq_capacity must return a value less than
(48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024).
So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow.
cc: Paul Turner <[email protected]>
cc: Ben Segall <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 21 ++++++++++++++-------
1 file changed, 14 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a96affd..a5039da 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2266,6 +2266,8 @@ static u32 __compute_runnable_contrib(u64 n)
return contrib + runnable_avg_yN_sum[n];
}
+unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+
/*
* We can represent the historical contribution to runnable average as the
* coefficients of a geometric series. To do this we sub-divide our runnable
@@ -2294,7 +2296,7 @@ static u32 __compute_runnable_contrib(u64 n)
* load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
* = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
*/
-static __always_inline int __update_entity_runnable_avg(u64 now,
+static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
struct sched_avg *sa,
int runnable,
int running)
@@ -2302,6 +2304,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
u64 delta, periods;
u32 runnable_contrib;
int delta_w, decayed = 0;
+ unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
delta = now - sa->last_runnable_update;
/*
@@ -2337,7 +2340,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
if (runnable)
sa->runnable_avg_sum += delta_w;
if (running)
- sa->running_avg_sum += delta_w;
+ sa->running_avg_sum += delta_w * scale_freq
+ >> SCHED_CAPACITY_SHIFT;
sa->avg_period += delta_w;
delta -= delta_w;
@@ -2358,7 +2362,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
if (runnable)
sa->runnable_avg_sum += runnable_contrib;
if (running)
- sa->running_avg_sum += runnable_contrib;
+ sa->running_avg_sum += runnable_contrib * scale_freq
+ >> SCHED_CAPACITY_SHIFT;
sa->avg_period += runnable_contrib;
}
@@ -2366,7 +2371,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
if (runnable)
sa->runnable_avg_sum += delta;
if (running)
- sa->running_avg_sum += delta;
+ sa->running_avg_sum += delta * scale_freq
+ >> SCHED_CAPACITY_SHIFT;
sa->avg_period += delta;
return decayed;
@@ -2474,8 +2480,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
{
- __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable,
- runnable);
+ __update_entity_runnable_avg(rq_clock_task(rq), cpu_of(rq), &rq->avg,
+ runnable, runnable);
__update_tg_runnable_avg(&rq->avg, &rq->cfs);
}
#else /* CONFIG_FAIR_GROUP_SCHED */
@@ -2553,6 +2559,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
long contrib_delta, utilization_delta;
+ int cpu = cpu_of(rq_of(cfs_rq));
u64 now;
/*
@@ -2564,7 +2571,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
else
now = cfs_rq_clock_task(group_cfs_rq(se));
- if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
+ if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq,
cfs_rq->curr == se))
return;
--
1.9.1
On 4 November 2014 00:54, Vincent Guittot <[email protected]> wrote:
> The scheduler tries to compute how many tasks a group of CPUs can handle by
> assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is
> SCHED_CAPACITY_SCALE. group_capacity_factor divides the capacity of the group
> by SCHED_LOAD_SCALE to estimate how many task can run in the group. Then, it
> compares this value with the sum of nr_running to decide if the group is
> overloaded or not. But the group_capacity_factor is hardly working for SMT
> system, it sometimes works for big cores but fails to do the right thing for
> little cores.
>
> Below are two examples to illustrate the problem that this patch solves:
>
> 1- If the original capacity of a CPU is less than SCHED_CAPACITY_SCALE
> (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2
> (div_round_closest(3x640/1024) = 2) which means that it will be seen as
> overloaded even if we have only one task per CPU.
>
> 2 - If the original capacity of a CPU is greater than SCHED_CAPACITY_SCALE
> (1512 as an example), a group of 4 CPUs will have a capacity_factor of 4
> (at max and thanks to the fix [0] for SMT system that prevent the apparition
> of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is
> reduced to nearly nothing), the capacity factor of the group will still be 4
> (div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]).
>
> So, this patch tries to solve this issue by removing capacity_factor and
> replacing it with the 2 following metrics :
> -The available CPU's capacity for CFS tasks which is already used by
> load_balance.
> -The usage of the CPU by the CFS tasks. For the latter, utilization_avg_contrib
> has been re-introduced to compute the usage of a CPU by CFS tasks.
>
> group_capacity_factor and group_has_free_capacity has been removed and replaced
> by group_no_capacity. We compare the number of task with the number of CPUs and
> we evaluate the level of utilization of the CPUs to define if a group is
> overloaded or if a group has capacity to handle more tasks.
>
> For SD_PREFER_SIBLING, a group is tagged overloaded if it has more than 1 task
> so it will be selected in priority (among the overloaded groups). Since [1],
> SD_PREFER_SIBLING is no more concerned by the computation of load_above_capacity
> because local is not overloaded.
>
> Finally, the sched_group->sched_group_capacity->capacity_orig has been removed
> because it's no more used during load balance.
>
> [1] https://lkml.org/lkml/2014/8/12/295
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/core.c | 12 -----
> kernel/sched/fair.c | 150 +++++++++++++++++++++++++--------------------------
> kernel/sched/sched.h | 2 +-
> 3 files changed, 75 insertions(+), 89 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 45ae52d..37fb92c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5373,17 +5373,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
> break;
> }
>
> - /*
> - * Even though we initialize ->capacity to something semi-sane,
> - * we leave capacity_orig unset. This allows us to detect if
> - * domain iteration is still funny without causing /0 traps.
> - */
> - if (!group->sgc->capacity_orig) {
> - printk(KERN_CONT "\n");
> - printk(KERN_ERR "ERROR: domain->cpu_capacity not set\n");
> - break;
> - }
> -
> if (!cpumask_weight(sched_group_cpus(group))) {
> printk(KERN_CONT "\n");
> printk(KERN_ERR "ERROR: empty group\n");
> @@ -5868,7 +5857,6 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
> * die on a /0 trap.
> */
> sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
> - sg->sgc->capacity_orig = sg->sgc->capacity;
>
> /*
> * Make sure the first group of this domain contains the
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 884578e..db392a6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5717,11 +5717,10 @@ struct sg_lb_stats {
> unsigned long group_capacity;
> unsigned long group_usage; /* Total usage of the group */
> unsigned int sum_nr_running; /* Nr tasks running in the group */
> - unsigned int group_capacity_factor;
> unsigned int idle_cpus;
> unsigned int group_weight;
> enum group_type group_type;
> - int group_has_free_capacity;
> + int group_no_capacity;
> #ifdef CONFIG_NUMA_BALANCING
> unsigned int nr_numa_running;
> unsigned int nr_preferred_running;
> @@ -5855,7 +5854,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
> capacity >>= SCHED_CAPACITY_SHIFT;
>
> cpu_rq(cpu)->cpu_capacity_orig = capacity;
> - sdg->sgc->capacity_orig = capacity;
>
> capacity *= scale_rt_capacity(cpu);
> capacity >>= SCHED_CAPACITY_SHIFT;
> @@ -5871,7 +5869,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
> {
> struct sched_domain *child = sd->child;
> struct sched_group *group, *sdg = sd->groups;
> - unsigned long capacity, capacity_orig;
> + unsigned long capacity;
> unsigned long interval;
>
> interval = msecs_to_jiffies(sd->balance_interval);
> @@ -5883,7 +5881,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
> return;
> }
>
> - capacity_orig = capacity = 0;
> + capacity = 0;
>
> if (child->flags & SD_OVERLAP) {
> /*
> @@ -5903,19 +5901,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
> * Use capacity_of(), which is set irrespective of domains
> * in update_cpu_capacity().
> *
> - * This avoids capacity/capacity_orig from being 0 and
> + * This avoids capacity from being 0 and
> * causing divide-by-zero issues on boot.
> - *
> - * Runtime updates will correct capacity_orig.
> */
> if (unlikely(!rq->sd)) {
> - capacity_orig += capacity_orig_of(cpu);
> capacity += capacity_of(cpu);
> continue;
> }
>
> sgc = rq->sd->groups->sgc;
> - capacity_orig += sgc->capacity_orig;
> capacity += sgc->capacity;
> }
> } else {
> @@ -5926,39 +5920,24 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
>
> group = child->groups;
> do {
> - capacity_orig += group->sgc->capacity_orig;
> capacity += group->sgc->capacity;
> group = group->next;
> } while (group != child->groups);
> }
>
> - sdg->sgc->capacity_orig = capacity_orig;
> sdg->sgc->capacity = capacity;
> }
>
> /*
> - * Try and fix up capacity for tiny siblings, this is needed when
> - * things like SD_ASYM_PACKING need f_b_g to select another sibling
> - * which on its own isn't powerful enough.
> - *
> - * See update_sd_pick_busiest() and check_asym_packing().
> + * Check whether the capacity of the rq has been noticeably reduced by side
> + * activity. The imbalance_pct is used for the threshold.
> + * Return true is the capacity is reduced
> */
> static inline int
> -fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
> +check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
> {
> - /*
> - * Only siblings can have significantly less than SCHED_CAPACITY_SCALE
> - */
> - if (!(sd->flags & SD_SHARE_CPUCAPACITY))
> - return 0;
> -
> - /*
> - * If ~90% of the cpu_capacity is still there, we're good.
> - */
> - if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29)
> - return 1;
> -
> - return 0;
> + return ((rq->cpu_capacity * sd->imbalance_pct) <
> + (rq->cpu_capacity_orig * 100));
> }
>
> /*
> @@ -5996,37 +5975,54 @@ static inline int sg_imbalanced(struct sched_group *group)
> }
>
> /*
> - * Compute the group capacity factor.
> - *
> - * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by
> - * first dividing out the smt factor and computing the actual number of cores
> - * and limit unit capacity with that.
> + * group_has_capacity returns true if the group has spare capacity that could
> + * be used by some tasks. We consider that a group has spare capacity if the
> + * number of task is smaller than the number of CPUs or if the usage is lower
> + * than the available capacity for CFS tasks. For the latter, we use a
> + * threshold to stabilize the state, to take into account the variance of the
> + * tasks' load and to return true if the available capacity in meaningful for
> + * the load balancer. As an example, an available capacity of 1% can appear
> + * but it doesn't make any benefit for the load balance.
> */
> -static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group)
> +static inline bool
> +group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
> {
> - unsigned int capacity_factor, smt, cpus;
> - unsigned int capacity, capacity_orig;
> + if ((sgs->group_capacity * 100) >
> + (sgs->group_usage * env->sd->imbalance_pct))
Hi Vincent,
In case of when some CPU(s) is used to handle heavy IRQs or RT tasks,
get_cpu_usage() will get low usage(capacity), and cpu_capacity gets
low as well, so do those of the whole group correspondingly.
So in this case, is there any guarantee that this math will return false?
Thanks,
Xunlei
> + return true;
>
> - capacity = group->sgc->capacity;
> - capacity_orig = group->sgc->capacity_orig;
> - cpus = group->group_weight;
> + if (sgs->sum_nr_running < sgs->group_weight)
> + return true;
> +
> + return false;
> +}
>
> - /* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
> - smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
> - capacity_factor = cpus / smt; /* cores */
> +/*
> + * group_is_overloaded returns true if the group has more tasks than it can
> + * handle. We consider that a group is overloaded if the number of tasks is
> + * greater than the number of CPUs and the tasks already use all available
> + * capacity for CFS tasks. For the latter, we use a threshold to stabilize
> + * the state, to take into account the variance of tasks' load and to return
> + * true if available capacity is no more meaningful for load balancer
> + */
> +static inline bool
> +group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
> +{
> + if (sgs->sum_nr_running <= sgs->group_weight)
> + return false;
>
> - capacity_factor = min_t(unsigned,
> - capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
> - if (!capacity_factor)
> - capacity_factor = fix_small_capacity(env->sd, group);
> + if ((sgs->group_capacity * 100) <
> + (sgs->group_usage * env->sd->imbalance_pct))
> + return true;
>
> - return capacity_factor;
> + return false;
> }
>
> -static enum group_type
> -group_classify(struct sched_group *group, struct sg_lb_stats *sgs)
> +static enum group_type group_classify(struct lb_env *env,
> + struct sched_group *group,
> + struct sg_lb_stats *sgs)
> {
> - if (sgs->sum_nr_running > sgs->group_capacity_factor)
> + if (sgs->group_no_capacity)
> return group_overloaded;
>
> if (sg_imbalanced(group))
> @@ -6087,11 +6083,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
>
> sgs->group_weight = group->group_weight;
> - sgs->group_capacity_factor = sg_capacity_factor(env, group);
> - sgs->group_type = group_classify(group, sgs);
>
> - if (sgs->group_capacity_factor > sgs->sum_nr_running)
> - sgs->group_has_free_capacity = 1;
> + sgs->group_no_capacity = group_is_overloaded(env, sgs);
> + sgs->group_type = group_classify(env, group, sgs);
> }
>
> /**
> @@ -6213,17 +6207,20 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>
> /*
> * In case the child domain prefers tasks go to siblings
> - * first, lower the sg capacity factor to one so that we'll try
> + * first, lower the sg capacity so that we'll try
> * and move all the excess tasks away. We lower the capacity
> * of a group only if the local group has the capacity to fit
> - * these excess tasks, i.e. nr_running < group_capacity_factor. The
> - * extra check prevents the case where you always pull from the
> - * heaviest group when it is already under-utilized (possible
> - * with a large weight task outweighs the tasks on the system).
> + * these excess tasks. The extra check prevents the case where
> + * you always pull from the heaviest group when it is already
> + * under-utilized (possible with a large weight task outweighs
> + * the tasks on the system).
> */
> if (prefer_sibling && sds->local &&
> - sds->local_stat.group_has_free_capacity)
> - sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
> + group_has_capacity(env, &sds->local_stat) &&
> + (sgs->sum_nr_running > 1)) {
> + sgs->group_no_capacity = 1;
> + sgs->group_type = group_overloaded;
> + }
>
> if (update_sd_pick_busiest(env, sds, sg, sgs)) {
> sds->busiest = sg;
> @@ -6402,11 +6399,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> */
> if (busiest->group_type == group_overloaded &&
> local->group_type == group_overloaded) {
> - load_above_capacity =
> - (busiest->sum_nr_running - busiest->group_capacity_factor);
> -
> - load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
> - load_above_capacity /= busiest->group_capacity;
> + load_above_capacity = busiest->sum_nr_running *
> + SCHED_LOAD_SCALE;
> + if (load_above_capacity > busiest->group_capacity)
> + load_above_capacity -= busiest->group_capacity;
> + else
> + load_above_capacity = ~0UL;
> }
>
> /*
> @@ -6469,6 +6467,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> local = &sds.local_stat;
> busiest = &sds.busiest_stat;
>
> + /* ASYM feature bypasses nice load balance check */
> if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
> check_asym_packing(env, &sds))
> return sds.busiest;
> @@ -6489,8 +6488,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> goto force_balance;
>
> /* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
> - if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
> - !busiest->group_has_free_capacity)
> + if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) &&
> + busiest->group_no_capacity)
> goto force_balance;
>
> /*
> @@ -6549,7 +6548,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> int i;
>
> for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
> - unsigned long capacity, capacity_factor, wl;
> + unsigned long capacity, wl;
> enum fbq_type rt;
>
> rq = cpu_rq(i);
> @@ -6578,9 +6577,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> continue;
>
> capacity = capacity_of(i);
> - capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE);
> - if (!capacity_factor)
> - capacity_factor = fix_small_capacity(env->sd, group);
>
> wl = weighted_cpuload(i);
>
> @@ -6588,7 +6584,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> * When comparing with imbalance, use weighted_cpuload()
> * which is not scaled with the cpu capacity.
> */
> - if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance)
> +
> + if (rq->nr_running == 1 && wl > env->imbalance &&
> + !check_cpu_capacity(rq, env->sd))
> continue;
>
> /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index aaaa3e4..8fd30c1 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -759,7 +759,7 @@ struct sched_group_capacity {
> * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
> * for a single CPU.
> */
> - unsigned int capacity, capacity_orig;
> + unsigned int capacity;
> unsigned long next_update;
> int imbalance; /* XXX unrelated to capacity but shared group state */
> /*
> --
> 1.9.1
>
>
> _______________________________________________
> linaro-kernel mailing list
> [email protected]
> http://lists.linaro.org/mailman/listinfo/linaro-kernel
On 19 November 2014 16:15, pang.xunlei <[email protected]> wrote:
> On 4 November 2014 00:54, Vincent Guittot <[email protected]> wrote:
[snip]
>> +static inline bool
>> +group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
>> {
>> - unsigned int capacity_factor, smt, cpus;
>> - unsigned int capacity, capacity_orig;
>> + if ((sgs->group_capacity * 100) >
>> + (sgs->group_usage * env->sd->imbalance_pct))
> Hi Vincent,
>
> In case of when some CPU(s) is used to handle heavy IRQs or RT tasks,
> get_cpu_usage() will get low usage(capacity), and cpu_capacity gets
> low as well, so do those of the whole group correspondingly.
> So in this case, is there any guarantee that this math will return false?
As an example, we will return false whatever the non-zero value of
group_capacity if there is no CFS tasks in the group
Regards,
Vincent
>
> Thanks,
> Xunlei
>> + return true;
>>
>> - capacity = group->sgc->capacity;
>> - capacity_orig = group->sgc->capacity_orig;
>> - cpus = group->group_weight;
>> + if (sgs->sum_nr_running < sgs->group_weight)
>> + return true;
>> +
>> + return false;
>> +}
>>
>> - /* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
>> - smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
>> - capacity_factor = cpus / smt; /* cores */
>> +/*
>> + * group_is_overloaded returns true if the group has more tasks than it can
>> + * handle. We consider that a group is overloaded if the number of tasks is
>> + * greater than the number of CPUs and the tasks already use all available
>> + * capacity for CFS tasks. For the latter, we use a threshold to stabilize
>> + * the state, to take into account the variance of tasks' load and to return
>> + * true if available capacity is no more meaningful for load balancer
>> + */
>> +static inline bool
>> +group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
>> +{
>> + if (sgs->sum_nr_running <= sgs->group_weight)
>> + return false;
>>
>> - capacity_factor = min_t(unsigned,
>> - capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
>> - if (!capacity_factor)
>> - capacity_factor = fix_small_capacity(env->sd, group);
>> + if ((sgs->group_capacity * 100) <
>> + (sgs->group_usage * env->sd->imbalance_pct))
>> + return true;
>>
>> - return capacity_factor;
>> + return false;
>> }
>>
>> -static enum group_type
>> -group_classify(struct sched_group *group, struct sg_lb_stats *sgs)
>> +static enum group_type group_classify(struct lb_env *env,
>> + struct sched_group *group,
>> + struct sg_lb_stats *sgs)
>> {
>> - if (sgs->sum_nr_running > sgs->group_capacity_factor)
>> + if (sgs->group_no_capacity)
>> return group_overloaded;
>>
>> if (sg_imbalanced(group))
>> @@ -6087,11 +6083,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>> sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
>>
>> sgs->group_weight = group->group_weight;
>> - sgs->group_capacity_factor = sg_capacity_factor(env, group);
>> - sgs->group_type = group_classify(group, sgs);
>>
>> - if (sgs->group_capacity_factor > sgs->sum_nr_running)
>> - sgs->group_has_free_capacity = 1;
>> + sgs->group_no_capacity = group_is_overloaded(env, sgs);
>> + sgs->group_type = group_classify(env, group, sgs);
>> }
>>
>> /**
>> @@ -6213,17 +6207,20 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>>
>> /*
>> * In case the child domain prefers tasks go to siblings
>> - * first, lower the sg capacity factor to one so that we'll try
>> + * first, lower the sg capacity so that we'll try
>> * and move all the excess tasks away. We lower the capacity
>> * of a group only if the local group has the capacity to fit
>> - * these excess tasks, i.e. nr_running < group_capacity_factor. The
>> - * extra check prevents the case where you always pull from the
>> - * heaviest group when it is already under-utilized (possible
>> - * with a large weight task outweighs the tasks on the system).
>> + * these excess tasks. The extra check prevents the case where
>> + * you always pull from the heaviest group when it is already
>> + * under-utilized (possible with a large weight task outweighs
>> + * the tasks on the system).
>> */
>> if (prefer_sibling && sds->local &&
>> - sds->local_stat.group_has_free_capacity)
>> - sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
>> + group_has_capacity(env, &sds->local_stat) &&
>> + (sgs->sum_nr_running > 1)) {
>> + sgs->group_no_capacity = 1;
>> + sgs->group_type = group_overloaded;
>> + }
>>
>> if (update_sd_pick_busiest(env, sds, sg, sgs)) {
>> sds->busiest = sg;
>> @@ -6402,11 +6399,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>> */
>> if (busiest->group_type == group_overloaded &&
>> local->group_type == group_overloaded) {
>> - load_above_capacity =
>> - (busiest->sum_nr_running - busiest->group_capacity_factor);
>> -
>> - load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
>> - load_above_capacity /= busiest->group_capacity;
>> + load_above_capacity = busiest->sum_nr_running *
>> + SCHED_LOAD_SCALE;
>> + if (load_above_capacity > busiest->group_capacity)
>> + load_above_capacity -= busiest->group_capacity;
>> + else
>> + load_above_capacity = ~0UL;
>> }
>>
>> /*
>> @@ -6469,6 +6467,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>> local = &sds.local_stat;
>> busiest = &sds.busiest_stat;
>>
>> + /* ASYM feature bypasses nice load balance check */
>> if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
>> check_asym_packing(env, &sds))
>> return sds.busiest;
>> @@ -6489,8 +6488,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>> goto force_balance;
>>
>> /* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
>> - if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
>> - !busiest->group_has_free_capacity)
>> + if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) &&
>> + busiest->group_no_capacity)
>> goto force_balance;
>>
>> /*
>> @@ -6549,7 +6548,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>> int i;
>>
>> for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
>> - unsigned long capacity, capacity_factor, wl;
>> + unsigned long capacity, wl;
>> enum fbq_type rt;
>>
>> rq = cpu_rq(i);
>> @@ -6578,9 +6577,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>> continue;
>>
>> capacity = capacity_of(i);
>> - capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE);
>> - if (!capacity_factor)
>> - capacity_factor = fix_small_capacity(env->sd, group);
>>
>> wl = weighted_cpuload(i);
>>
>> @@ -6588,7 +6584,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>> * When comparing with imbalance, use weighted_cpuload()
>> * which is not scaled with the cpu capacity.
>> */
>> - if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance)
>> +
>> + if (rq->nr_running == 1 && wl > env->imbalance &&
>> + !check_cpu_capacity(rq, env->sd))
>> continue;
>>
>> /*
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index aaaa3e4..8fd30c1 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -759,7 +759,7 @@ struct sched_group_capacity {
>> * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
>> * for a single CPU.
>> */
>> - unsigned int capacity, capacity_orig;
>> + unsigned int capacity;
>> unsigned long next_update;
>> int imbalance; /* XXX unrelated to capacity but shared group state */
>> /*
>> --
>> 1.9.1
>>
>>
>> _______________________________________________
>> linaro-kernel mailing list
>> [email protected]
>> http://lists.linaro.org/mailman/listinfo/linaro-kernel
On Mon, Nov 03, 2014 at 04:54:37PM +0000, Vincent Guittot wrote:
> This patchset consolidates several changes in the capacity and the usage
> tracking of the CPU. It provides a frequency invariant metric of the usage of
> CPUs and generally improves the accuracy of load/usage tracking in the
> scheduler. The frequency invariant metric is the foundation required for the
> consolidation of cpufreq and implementation of a fully invariant load tracking.
> These are currently WIP and require several changes to the load balancer
> (including how it will use and interprets load and capacity metrics) and
> extensive validation. The frequency invariance is done with
> arch_scale_freq_capacity and this patchset doesn't provide the backends of
> the function which are architecture dependent.
>
> As discussed at LPC14, Morten and I have consolidated our changes into a single
> patchset to make it easier to review and merge.
>
> During load balance, the scheduler evaluates the number of tasks that a group
> of CPUs can handle. The current method assumes that tasks have a fix load of
> SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE.
> This assumption generates wrong decision by creating ghost cores or by
> removing real ones when the original capacity of CPUs is different from the
> default SCHED_CAPACITY_SCALE. With this patch set, we don't try anymore to
> evaluate the number of available cores based on the group_capacity but instead
> we evaluate the usage of a group and compare it with its capacity.
>
> This patchset mainly replaces the old capacity_factor method by a new one and
> keeps the general policy almost unchanged. These new metrics will be also used
> in later patches.
>
> The CPU usage is based on a running time tracking version of the current
> implementation of the load average tracking. I also have a version that is
> based on the new implementation proposal [1] but I haven't provide the patches
> and results as [1] is still under review. I can provide change above [1] to
> change how CPU usage is computed and to adapt to new mecanism.
>
> Change since V8
> - reorder patches
I think the patch set is in a good shape. I have got few comments and
suggestions mostly minor things and things related to naming to the
first 7 patches. I haven't found any problems with the code itself. I
haven't tested it yet though. The last three patches I have a few
questions for.
Morten
Should the subject mention that the patch adds utilization tracking?
Maybe: 'sched: Add utilization tracking' ?
On Mon, Nov 03, 2014 at 04:54:38PM +0000, Vincent Guittot wrote:
> Add new statistics which reflect the average time a task is running on the CPU
> and the sum of these running time of the tasks on a runqueue. The latter is
> named utilization_load_avg.
>
> This patch is based on the usage metric that was proposed in the 1st
> versions of the per-entity load tracking patchset by Paul Turner
Should we do ourselves and anybody else who feels like going through the
pain of understanding the load-tracking code a favor and drop the use of
the term 'usage' and use 'utilization' everywhere instead? 'usage' isn't
clearly defined anywhere.
Referring to 'usage' here in the reference to original patch is fine,
but I suggest that we remove it from the code and comment on subsequent
patches unless there is a very good reason to keep it.
> <[email protected]> but that has be removed afterwards. This version differs from
> the original one in the sense that it's not linked to task_group.
>
> The rq's utilization_load_avg will be used to check if a rq is overloaded or
> not instead of trying to compute how many tasks a group of CPUs can handle.
>
> Rename runnable_avg_period into avg_period as it is now used with both
> runnable_avg_sum and running_avg_sum
>
> Add some descriptions of the variables to explain their differences
>
> cc: Paul Turner <[email protected]>
> cc: Ben Segall <[email protected]>
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> include/linux/sched.h | 21 ++++++++++++---
> kernel/sched/debug.c | 10 ++++---
> kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++-----------
> kernel/sched/sched.h | 8 +++++-
> 4 files changed, 89 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 18f5262..b576b29 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1071,15 +1071,28 @@ struct load_weight {
> };
>
> struct sched_avg {
> + u64 last_runnable_update;
> + s64 decay_count;
> + /*
> + * utilization_avg_contrib describes the amount of time that a
> + * sched_entity is running on a CPU. It is based on running_avg_sum
> + * and is scaled in the range [0..SCHED_LOAD_SCALE].
> + * load_avg_contrib described the amount of time that a sched_entity
> + * is runnable on a rq. It is based on both runnable_avg_sum and the
> + * weight of the task.
> + */
> + unsigned long load_avg_contrib, utilization_avg_contrib;
> /*
> * These sums represent an infinite geometric series and so are bound
> * above by 1024/(1-y). Thus we only need a u32 to store them for all
> * choices of y < 1-2^(-32)*1024.
> + * running_avg_sum reflects the time that the sched_entity is
> + * effectively running on the CPU.
> + * runnable_avg_sum represents the amount of time a sched_entity is on
> + * a runqueue which includes the running time that is monitored by
> + * running_avg_sum.
i.e. runnable_avg_sum = running_avg_sum + 'waiting time'
> */
> - u32 runnable_avg_sum, runnable_avg_period;
> - u64 last_runnable_update;
> - s64 decay_count;
> - unsigned long load_avg_contrib;
> + u32 runnable_avg_sum, avg_period, running_avg_sum;
> };
>
> #ifdef CONFIG_SCHEDSTATS
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index ce33780..f384452 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -71,7 +71,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
> if (!se) {
> struct sched_avg *avg = &cpu_rq(cpu)->avg;
> P(avg->runnable_avg_sum);
> - P(avg->runnable_avg_period);
> + P(avg->avg_period);
> return;
> }
>
> @@ -94,7 +94,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
> P(se->load.weight);
> #ifdef CONFIG_SMP
> P(se->avg.runnable_avg_sum);
> - P(se->avg.runnable_avg_period);
> + P(se->avg.avg_period);
> P(se->avg.load_avg_contrib);
> P(se->avg.decay_count);
> #endif
> @@ -214,6 +214,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> cfs_rq->runnable_load_avg);
> SEQ_printf(m, " .%-30s: %ld\n", "blocked_load_avg",
> cfs_rq->blocked_load_avg);
> + SEQ_printf(m, " .%-30s: %ld\n", "utilization_load_avg",
> + cfs_rq->utilization_load_avg);
> #ifdef CONFIG_FAIR_GROUP_SCHED
> SEQ_printf(m, " .%-30s: %ld\n", "tg_load_contrib",
> cfs_rq->tg_load_contrib);
> @@ -628,8 +630,10 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
> P(se.load.weight);
> #ifdef CONFIG_SMP
> P(se.avg.runnable_avg_sum);
> - P(se.avg.runnable_avg_period);
> + P(se.avg.running_avg_sum);
> + P(se.avg.avg_period);
> P(se.avg.load_avg_contrib);
> + P(se.avg.utilization_avg_contrib);
> P(se.avg.decay_count);
> #endif
> P(policy);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bd61cff..3a91ae6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -670,6 +670,7 @@ static int select_idle_sibling(struct task_struct *p, int cpu);
> static unsigned long task_h_load(struct task_struct *p);
>
> static inline void __update_task_entity_contrib(struct sched_entity *se);
> +static inline void __update_task_entity_utilization(struct sched_entity *se);
>
> /* Give new task start runnable values to heavy its load in infant time */
> void init_task_runnable_average(struct task_struct *p)
> @@ -678,9 +679,10 @@ void init_task_runnable_average(struct task_struct *p)
>
> p->se.avg.decay_count = 0;
> slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
> - p->se.avg.runnable_avg_sum = slice;
> - p->se.avg.runnable_avg_period = slice;
> + p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice;
> + p->se.avg.avg_period = slice;
> __update_task_entity_contrib(&p->se);
> + __update_task_entity_utilization(&p->se);
> }
> #else
> void init_task_runnable_average(struct task_struct *p)
> @@ -1548,7 +1550,7 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
> *period = now - p->last_task_numa_placement;
> } else {
> delta = p->se.avg.runnable_avg_sum;
> - *period = p->se.avg.runnable_avg_period;
> + *period = p->se.avg.avg_period;
> }
>
> p->last_sum_exec_runtime = runtime;
> @@ -2294,7 +2296,8 @@ static u32 __compute_runnable_contrib(u64 n)
> */
> static __always_inline int __update_entity_runnable_avg(u64 now,
> struct sched_avg *sa,
> - int runnable)
> + int runnable,
> + int running)
> {
> u64 delta, periods;
> u32 runnable_contrib;
> @@ -2320,7 +2323,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
> sa->last_runnable_update = now;
>
> /* delta_w is the amount already accumulated against our next period */
> - delta_w = sa->runnable_avg_period % 1024;
> + delta_w = sa->avg_period % 1024;
> if (delta + delta_w >= 1024) {
> /* period roll-over */
> decayed = 1;
> @@ -2333,7 +2336,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
> delta_w = 1024 - delta_w;
> if (runnable)
> sa->runnable_avg_sum += delta_w;
> - sa->runnable_avg_period += delta_w;
> + if (running)
> + sa->running_avg_sum += delta_w;
> + sa->avg_period += delta_w;
>
> delta -= delta_w;
>
> @@ -2343,20 +2348,26 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
>
> sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
> periods + 1);
> - sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
> + sa->running_avg_sum = decay_load(sa->running_avg_sum,
> + periods + 1);
> + sa->avg_period = decay_load(sa->avg_period,
> periods + 1);
>
> /* Efficiently calculate \sum (1..n_period) 1024*y^i */
> runnable_contrib = __compute_runnable_contrib(periods);
> if (runnable)
> sa->runnable_avg_sum += runnable_contrib;
> - sa->runnable_avg_period += runnable_contrib;
> + if (running)
> + sa->running_avg_sum += runnable_contrib;
> + sa->avg_period += runnable_contrib;
> }
>
> /* Remainder of delta accrued against u_0` */
> if (runnable)
> sa->runnable_avg_sum += delta;
> - sa->runnable_avg_period += delta;
> + if (running)
> + sa->running_avg_sum += delta;
> + sa->avg_period += delta;
>
> return decayed;
> }
> @@ -2372,6 +2383,8 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
> return 0;
>
> se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
> + se->avg.utilization_avg_contrib =
> + decay_load(se->avg.utilization_avg_contrib, decays);
> se->avg.decay_count = 0;
>
> return decays;
> @@ -2408,7 +2421,7 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
>
> /* The fraction of a cpu used by this cfs_rq */
> contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT,
> - sa->runnable_avg_period + 1);
> + sa->avg_period + 1);
> contrib -= cfs_rq->tg_runnable_contrib;
>
> if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
> @@ -2461,7 +2474,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
>
> static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
> {
> - __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
> + __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable,
> + runnable);
> __update_tg_runnable_avg(&rq->avg, &rq->cfs);
> }
> #else /* CONFIG_FAIR_GROUP_SCHED */
> @@ -2479,7 +2493,7 @@ static inline void __update_task_entity_contrib(struct sched_entity *se)
>
> /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
> contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
> - contrib /= (se->avg.runnable_avg_period + 1);
> + contrib /= (se->avg.avg_period + 1);
> se->avg.load_avg_contrib = scale_load(contrib);
> }
>
> @@ -2498,6 +2512,27 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
> return se->avg.load_avg_contrib - old_contrib;
> }
>
> +
> +static inline void __update_task_entity_utilization(struct sched_entity *se)
> +{
> + u32 contrib;
> +
> + /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
> + contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
> + contrib /= (se->avg.avg_period + 1);
> + se->avg.utilization_avg_contrib = scale_load(contrib);
> +}
> +
> +static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
> +{
> + long old_contrib = se->avg.utilization_avg_contrib;
> +
> + if (entity_is_task(se))
> + __update_task_entity_utilization(se);
> +
> + return se->avg.utilization_avg_contrib - old_contrib;
> +}
> +
> static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
> long load_contrib)
> {
> @@ -2514,7 +2549,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
> int update_cfs_rq)
> {
> struct cfs_rq *cfs_rq = cfs_rq_of(se);
> - long contrib_delta;
> + long contrib_delta, utilization_delta;
> u64 now;
>
> /*
> @@ -2526,18 +2561,22 @@ static inline void update_entity_load_avg(struct sched_entity *se,
> else
> now = cfs_rq_clock_task(group_cfs_rq(se));
>
> - if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
> + if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
> + cfs_rq->curr == se))
> return;
>
> contrib_delta = __update_entity_load_avg_contrib(se);
> + utilization_delta = __update_entity_utilization_avg_contrib(se);
>
> if (!update_cfs_rq)
> return;
>
> - if (se->on_rq)
> + if (se->on_rq) {
> cfs_rq->runnable_load_avg += contrib_delta;
> - else
> + cfs_rq->utilization_load_avg += utilization_delta;
> + } else {
> subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
> + }
> }
>
> /*
> @@ -2612,6 +2651,7 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
> }
>
> cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
> + cfs_rq->utilization_load_avg += se->avg.utilization_avg_contrib;
> /* we force update consideration on load-balancer moves */
> update_cfs_rq_blocked_load(cfs_rq, !wakeup);
> }
> @@ -2630,6 +2670,7 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
> update_cfs_rq_blocked_load(cfs_rq, !sleep);
>
> cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
> + cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib;
> if (sleep) {
> cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
> se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
> @@ -2967,6 +3008,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> */
> update_stats_wait_end(cfs_rq, se);
> __dequeue_entity(cfs_rq, se);
> + update_entity_load_avg(se, 1);
> }
>
> update_stats_curr_start(cfs_rq, se);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 6130251..c34bd11 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -343,8 +343,14 @@ struct cfs_rq {
> * Under CFS, load is tracked on a per-entity basis and aggregated up.
> * This allows for the description of both thread and group usage (in
> * the FAIR_GROUP_SCHED case).
> + * runnable_load_avg is the sum of the load_avg_contrib of the
> + * sched_entities on the rq.
> + * blocked_load_avg is similar to runnable_load_avg except that its
> + * the blocked sched_entities on the rq.
This is slightly misleading. Blocked entities are not on the rq. Maybe
say: "... except it is the blocked sched_entities associated with the
rq." ?
s/usage/utilization/ in subject.
On Mon, Nov 03, 2014 at 04:54:39PM +0000, Vincent Guittot wrote:
> From: Morten Rasmussen <[email protected]>
>
> Adds usage contribution tracking for group entities. Unlike
s/usage contribution/utilization/
> se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group
> entities is the sum of se->avg.utilization_avg_contrib for all entities on the
> group runqueue. It is _not_ influenced in any way by the task group
> h_load. Hence it is representing the actual cpu usage of the group, not
s/usage/utilization/
> its intended load contribution which may differ significantly from the
> utilization on lightly utilized systems.
>
> cc: Paul Turner <[email protected]>
> cc: Ben Segall <[email protected]>
>
> Signed-off-by: Morten Rasmussen <[email protected]>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/debug.c | 2 ++
> kernel/sched/fair.c | 3 +++
> 2 files changed, 5 insertions(+)
>
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index f384452..efb47ed 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -94,8 +94,10 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
> P(se->load.weight);
> #ifdef CONFIG_SMP
> P(se->avg.runnable_avg_sum);
> + P(se->avg.running_avg_sum);
> P(se->avg.avg_period);
> P(se->avg.load_avg_contrib);
> + P(se->avg.utilization_avg_contrib);
> P(se->avg.decay_count);
> #endif
> #undef PN
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3a91ae6..a171e1b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2529,6 +2529,9 @@ static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
>
> if (entity_is_task(se))
> __update_task_entity_utilization(se);
> + else
> + se->avg.utilization_avg_contrib =
> + group_cfs_rq(se)->utilization_load_avg;
>
> return se->avg.utilization_avg_contrib - old_contrib;
> }
What happened to the update of se->avg.utilization_avg_contrib in
__synchronize_entity_decay()? It seems to have disapperead after v7.
On Mon, Nov 03, 2014 at 04:54:40PM +0000, Vincent Guittot wrote:
> Now that arch_scale_cpu_capacity has been introduced to scale the original
> capacity, the arch_scale_freq_capacity is no longer used (it was
> previously used by ARM arch). Remove arch_scale_freq_capacity from the
> computation of cpu_capacity. The frequency invariance will be handled in the
> load tracking and not in the CPU capacity.
Just a note:
Yes, we are scaling the load tracking of each task to compensate for
frequency scaling. I think that is all fine. Later we will have to feed
the current frequency into the energy model as well to figure out if
putting more tasks on a cpu will/should cause the frequency to increase
or not. We don't need it for now and don't need to add it in the commit
message. It is just a reminder that we may actually want to use
arch_scale_freq_capacity() combined with cpu_capacity to provide this
input to the energy model later.
On Mon, Nov 03, 2014 at 04:54:41PM +0000, Vincent Guittot wrote:
> From: Morten Rasmussen <[email protected]>
>
> Apply frequency scale-invariance correction factor to usage tracking.
s/usage/utilization/
> Each segment of the running_load_avg geometric series is now scaled by the
> current frequency so the utilization_avg_contrib of each entity will be
s/entity/sched_entity/
> invariant with frequency scaling. As a result, utilization_load_avg which is
> the sum of utilization_avg_contrib, becomes invariant too. So the usage level
s/sum of utilization_avg_contrib/sum of sched_entity
utilization_avg_contribs/
s/usage/utilization/
> that is returned by get_cpu_usage, stays relative to the max frequency as the
> cpu_capacity which is is compared against.
The last bit doesn't parse right. '... Maybe it is better to drop
the reference to get_cpu_usage which hasn't been defined yet and rewrite
the thing to:
Apply frequency scale-invariance correction factor to utilization
tracking. Each segment of the running_load_avg geometric series is now
scaled by the current frequency so the utilization_avg_contrib of each
entity will be invariant with frequency scaling. As a result,
utilization_load_avg which is the sum of sched_entity
utilization_avg_contribs becomes invariant too and is now relative to
the max utilization at the max frequency (=cpu_capacity).
I think we should add:
arch_scale_freq_capacity() is reintroduced to provide the frequency
compensation scaling factor.
> Then, we want the keep the load tracking values in a 32bits type, which implies
s/Then, we/We/
On Mon, Nov 03, 2014 at 04:54:42PM +0000, Vincent Guittot wrote:
> The average running time of RT tasks is used to estimate the remaining compute
> capacity for CFS tasks. This remaining capacity is the original capacity scaled
> down by a factor (aka scale_rt_capacity). This estimation of available capacity
> must also be invariant with frequency scaling.
>
> A frequency scaling factor is applied on the running time of the RT tasks for
> computing scale_rt_capacity.
>
> In sched_rt_avg_update, we scale the RT execution time like below:
> rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT
>
> Then, scale_rt_capacity can be summarized by:
> scale_rt_capacity = SCHED_CAPACITY_SCALE -
> ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / period)
>
> We can optimize by removing right and left shift in the computation of rq->rt_avg
> and scale_rt_capacity
>
> The call to arch_scale_frequency_capacity in the rt scheduling path might be
> a concern for RT folks because I'm not sure whether we can rely on
> arch_scale_freq_capacity to be short and efficient ?
It better be fast :) It is used in critical paths. However, if you
really care about latency you probably don't want frequency scaling to
mess around. If the architecture provides a fast-path for
arch_scale_freq_capacity() returning SCHED_CAPACITY_SCALE when frequency
scaling is disabled, the overhead should be minimal. If the architecture
doesn't provide arch_scale_freq_capacity() it becomes a constant
multiplication and should hopefully go away completely.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 17 +++++------------
> kernel/sched/sched.h | 4 +++-
> 2 files changed, 8 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a5039da..b37c27b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5785,7 +5785,7 @@ unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> static unsigned long scale_rt_capacity(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> - u64 total, available, age_stamp, avg;
> + u64 total, used, age_stamp, avg;
> s64 delta;
>
> /*
> @@ -5801,19 +5801,12 @@ static unsigned long scale_rt_capacity(int cpu)
>
> total = sched_avg_period() + delta;
>
> - if (unlikely(total < avg)) {
> - /* Ensures that capacity won't end up being negative */
> - available = 0;
> - } else {
> - available = total - avg;
> - }
> + used = div_u64(avg, total);
I haven't looked through all the details of the rt avg tracking, but if
'used' is in the range [0..SCHED_CAPACITY_SCALE], I believe it should
work. Is it guaranteed that total > 0 so we don't get division by zero?
It does get a slightly more complicated if we want to figure out the
available capacity at the current frequency (current < max) later. Say,
rt eats 25% of the compute capacity, but the current frequency is only
50%. In that case get:
curr_avail_capacity = (arch_scale_cpu_capacity() *
(arch_scale_freq_capacity() - (SCHED_SCALE_CAPACITY - scale_rt_capacity())))
>> SCHED_CAPACITY_SHIFT
With numbers assuming arch_scale_cpu_capacity() = 800:
curr_avail_capacity = 800 * (512 - (1024 - 758)) >> 10 = 200
Which isn't actually that bad. Anyway, it isn't needed until we start
invovling energy models.
>
> - if (unlikely((s64)total < SCHED_CAPACITY_SCALE))
> - total = SCHED_CAPACITY_SCALE;
> + if (likely(used < SCHED_CAPACITY_SCALE))
> + return SCHED_CAPACITY_SCALE - used;
>
> - total >>= SCHED_CAPACITY_SHIFT;
> -
> - return div_u64(available, total);
> + return 1;
> }
>
> static void update_cpu_capacity(struct sched_domain *sd, int cpu)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c34bd11..fc5b152 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1312,9 +1312,11 @@ static inline int hrtick_enabled(struct rq *rq)
>
> #ifdef CONFIG_SMP
> extern void sched_avg_update(struct rq *rq);
> +extern unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
I'm not sure if it makes any difference, but shouldn't it be __weak
instead of extern?
unsigned long __weak arch_scale_freq_capacity(...)
Also, now that the function prototype definition is in the header file
we can kill the local prototype in fair.c introduced in patch 4:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6fd5ac6..921b174 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2277,8 +2277,6 @@ static u32 __compute_runnable_contrib(u64 n)
return contrib + runnable_avg_yN_sum[n];
}
-unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd,
int cpu);
-
/*
* We can represent the historical contribution to runnable average as
* the
* coefficients of a geometric series. To do this we sub-divide our
* runnable
On Mon, Nov 03, 2014 at 04:54:44PM +0000, Vincent Guittot wrote:
> Monitor the usage level of each group of each sched_domain level. The usage is
> the portion of cpu_capacity_orig that is currently used on a CPU or group of
> CPUs. We use the utilization_load_avg to evaluate the usage level of each
> group.
Here 'usage' is defined for the first time.
>
> The utilization_load_avg only takes into account the running time of the CFS
> tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully
> utilized. Nevertheless, we must cap utilization_load_avg which can be temporaly
> greater than SCHED_LOAD_SCALE after the migration of a task on this CPU and
> until the metrics are stabilized.
>
> The utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the
> running load on the CPU whereas the available capacity for the CFS task is in
> the range [0..cpu_capacity_orig]. In order to test if a CPU is fully utilized
> by CFS tasks, we have to scale the utilization in the cpu_capacity_orig range
> of the CPU to get the usage of the latter. The usage can then be compared with
> the available capacity (ie cpu_capacity) to deduct the usage level of a CPU.
So 'usage' is more precisely scaled utilization (by
cpu_capacity_orig/SCHED_LOAD_SCALE). Do we need to use 'usage' to
describe this?
So far we only have introduced frequency invariant load tracking, once
we add uarch invariance utilization_load_avg will be in the range
[0..cpu_capacity_orig] as the scaling will happen as part of the load
tracking (just like the frequency invariance). Then 'usage' becomes
equal to utilization_load_avg which means that there is very little
reason to keep the term. No?
I haven't pointed out all uses of 'usage' in this and following patches.
If 'usage' is kept the previous patches should be revisited to define
it.
>
> The frequency scaling invariance of the usage is not taken into account in this
> patch, it will be solved in another patch which will deal with frequency
> scaling invariance on the running_load_avg.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 29 +++++++++++++++++++++++++++++
> 1 file changed, 29 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4782733..884578e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4559,6 +4559,33 @@ static int select_idle_sibling(struct task_struct *p, int target)
> done:
> return target;
> }
> +/*
> + * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
> + * tasks. The unit of the return value must capacity so we can compare the
s/must/must be/
> + * usage with the capacity of the CPU that is available for CFS task (ie
> + * cpu_capacity).
> + * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
> + * CPU. It represents the amount of utilization of a CPU in the range
> + * [0..SCHED_LOAD_SCALE]. The usage of a CPU can't be higher than the full
s/ / /
> + * capacity of the CPU because it's about the running time on this CPU.
Maybe add (cpu_capacity_orig) to make it clear what full capacity means.
> + * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
> + * because of unfortunate rounding in avg_period and running_load_avg or just
> + * after migrating tasks until the average stabilizes with the new running
> + * time. So we need to check that the usage stays into the range
> + * [0..cpu_capacity_orig] and cap if necessary.
> + * Without capping the usage, a group could be seen as overloaded (CPU0 usage
> + * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity/
> + */
> +static int get_cpu_usage(int cpu)
> +{
> + unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
> + unsigned long capacity = capacity_orig_of(cpu);
> +
> + if (usage >= SCHED_LOAD_SCALE)
> + return capacity;
> +
> + return (usage * capacity) >> SCHED_LOAD_SHIFT;
> +}
>
> /*
> * select_task_rq_fair: Select target runqueue for the waking task in domains
> @@ -5688,6 +5715,7 @@ struct sg_lb_stats {
> unsigned long sum_weighted_load; /* Weighted load of group's tasks */
> unsigned long load_per_task;
> unsigned long group_capacity;
> + unsigned long group_usage; /* Total usage of the group */
> unsigned int sum_nr_running; /* Nr tasks running in the group */
> unsigned int group_capacity_factor;
> unsigned int idle_cpus;
> @@ -6036,6 +6064,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> load = source_load(i, load_idx);
>
> sgs->group_load += load;
> + sgs->group_usage += get_cpu_usage(i);
> sgs->sum_nr_running += rq->cfs.h_nr_running;
>
> if (rq->nr_running > 1)
The last two hunks do not appear to be used in this patch. Would it be
better to have them with the code that uses the statistics? The patch
however do what the subject says. Just a thought.
On Mon, Nov 03, 2014 at 04:54:45PM +0000, Vincent Guittot wrote:
> The scheduler tries to compute how many tasks a group of CPUs can handle by
> assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is
> SCHED_CAPACITY_SCALE. group_capacity_factor divides the capacity of the group
> by SCHED_LOAD_SCALE to estimate how many task can run in the group. Then, it
> compares this value with the sum of nr_running to decide if the group is
> overloaded or not. But the group_capacity_factor is hardly working for SMT
> system, it sometimes works for big cores but fails to do the right thing for
> little cores.
>
> Below are two examples to illustrate the problem that this patch solves:
>
> 1- If the original capacity of a CPU is less than SCHED_CAPACITY_SCALE
> (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2
> (div_round_closest(3x640/1024) = 2) which means that it will be seen as
> overloaded even if we have only one task per CPU.
>
> 2 - If the original capacity of a CPU is greater than SCHED_CAPACITY_SCALE
> (1512 as an example), a group of 4 CPUs will have a capacity_factor of 4
> (at max and thanks to the fix [0] for SMT system that prevent the apparition
> of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is
> reduced to nearly nothing), the capacity factor of the group will still be 4
> (div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]).
>
> So, this patch tries to solve this issue by removing capacity_factor and
> replacing it with the 2 following metrics :
> -The available CPU's capacity for CFS tasks which is already used by
> load_balance.
> -The usage of the CPU by the CFS tasks. For the latter, utilization_avg_contrib
> has been re-introduced to compute the usage of a CPU by CFS tasks.
>
> group_capacity_factor and group_has_free_capacity has been removed and replaced
> by group_no_capacity. We compare the number of task with the number of CPUs and
> we evaluate the level of utilization of the CPUs to define if a group is
> overloaded or if a group has capacity to handle more tasks.
>
> For SD_PREFER_SIBLING, a group is tagged overloaded if it has more than 1 task
> so it will be selected in priority (among the overloaded groups). Since [1],
> SD_PREFER_SIBLING is no more concerned by the computation of load_above_capacity
> because local is not overloaded.
[...]
> @@ -6213,17 +6207,20 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>
> /*
> * In case the child domain prefers tasks go to siblings
> - * first, lower the sg capacity factor to one so that we'll try
> + * first, lower the sg capacity so that we'll try
> * and move all the excess tasks away. We lower the capacity
> * of a group only if the local group has the capacity to fit
> - * these excess tasks, i.e. nr_running < group_capacity_factor. The
> - * extra check prevents the case where you always pull from the
> - * heaviest group when it is already under-utilized (possible
> - * with a large weight task outweighs the tasks on the system).
> + * these excess tasks. The extra check prevents the case where
> + * you always pull from the heaviest group when it is already
> + * under-utilized (possible with a large weight task outweighs
> + * the tasks on the system).
> */
> if (prefer_sibling && sds->local &&
> - sds->local_stat.group_has_free_capacity)
> - sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
> + group_has_capacity(env, &sds->local_stat) &&
> + (sgs->sum_nr_running > 1)) {
> + sgs->group_no_capacity = 1;
> + sgs->group_type = group_overloaded;
> + }
I'm still a bit confused about SD_PREFER_SIBLING. What is the flag
supposed to do and why?
It looks like a weak load balancing bias attempting to consolidate tasks
on domains with spare capacity. It does so by marking non-local groups
as overloaded regardless of their actual load if the local group has
spare capacity. Correct?
In patch 9 this behaviour is enabled for SMT level domains, which
implies that tasks will be consolidated in MC groups, that is we prefer
multiple tasks on sibling cpus (hw threads). I must be missing something
essential. I was convinced that we wanted to avoid using sibling cpus on
SMT systems as much as possible?
On Mon, Nov 03, 2014 at 04:54:47PM +0000, Vincent Guittot wrote:
> When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining
> capacity for CFS tasks can be significantly reduced. Once we detect such
> situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle
> load balance to check if it's worth moving its tasks on an idle CPU.
>
> Once the idle load_balance has selected the busiest CPU, it will look for an
> active load balance for only two cases :
> - there is only 1 task on the busiest CPU.
> - we haven't been able to move a task of the busiest rq.
>
> A CPU with a reduced capacity is included in the 1st case, and it's worth to
> actively migrate its task if the idle CPU has got full capacity. This test has
> been added in need_active_balance.
>
> As a sidenote, this will note generate more spurious ilb because we already
> trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
> has a task, we will trig the ilb once for migrating the task.
>
> The nohz_kick_needed function has been cleaned up a bit while adding the new
> test
>
> env.src_cpu and env.src_rq must be set unconditionnally because they are used
> in need_active_balance which is called even if busiest->nr_running equals 1
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++---------------
> 1 file changed, 53 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index db392a6..02e8f7f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6634,6 +6634,28 @@ static int need_active_balance(struct lb_env *env)
> return 1;
> }
>
> + /*
> + * The dst_cpu is idle and the src_cpu CPU has only 1 CFS task.
> + * It's worth migrating the task if the src_cpu's capacity is reduced
> + * because of other sched_class or IRQs whereas capacity stays
> + * available on dst_cpu.
> + */
> + if ((env->idle != CPU_NOT_IDLE) &&
> + (env->src_rq->cfs.h_nr_running == 1)) {
> + unsigned long src_eff_capacity, dst_eff_capacity;
> +
> + dst_eff_capacity = 100;
> + dst_eff_capacity *= capacity_of(env->dst_cpu);
> + dst_eff_capacity *= capacity_orig_of(env->src_cpu);
> +
> + src_eff_capacity = sd->imbalance_pct;
> + src_eff_capacity *= capacity_of(env->src_cpu);
> + src_eff_capacity *= capacity_orig_of(env->dst_cpu);
Do we need to scale by capacity_orig? Shouldn't the absolute capacity be
better?
if (capacity_of(env->src) * sd->imbalance_pct < capacity_of(env->dst) *
100) ?
Isn't it the absolute available capacity that matters? For SMP
capacity_orig is the same and cancels out and doesn't change anything.
For big.LITTLE we would rather have the task run on a big where rt/irq
eats 30% than a little cpu where rq/irq eats 5%, assuming big capacity
is much bigger than little capacity so the absolute available capacity
(~cycles/time) is larger on the big cpu.
On 21 November 2014 at 13:34, Morten Rasmussen <[email protected]> wrote:
> Should the subject mention that the patch adds utilization tracking?
> Maybe: 'sched: Add utilization tracking' ?
>
>
> On Mon, Nov 03, 2014 at 04:54:38PM +0000, Vincent Guittot wrote:
>> Add new statistics which reflect the average time a task is running on the CPU
>> and the sum of these running time of the tasks on a runqueue. The latter is
>> named utilization_load_avg.
>>
>> This patch is based on the usage metric that was proposed in the 1st
>> versions of the per-entity load tracking patchset by Paul Turner
>
> Should we do ourselves and anybody else who feels like going through the
> pain of understanding the load-tracking code a favor and drop the use of
> the term 'usage' and use 'utilization' everywhere instead? 'usage' isn't
> clearly defined anywhere.
>
> Referring to 'usage' here in the reference to original patch is fine,
> but I suggest that we remove it from the code and comment on subsequent
> patches unless there is a very good reason to keep it.
As discussed with Peter, we use usage when the task's utilization has
been scaled by the capacity.
IIRC from one of our discussion, dietmar should prepare a patchset to
rename and aligned variables and field.
Regards,
Vincent
>
>> <[email protected]> but that has be removed afterwards. This version differs from
>> the original one in the sense that it's not linked to task_group.
>>
>> The rq's utilization_load_avg will be used to check if a rq is overloaded or
>> not instead of trying to compute how many tasks a group of CPUs can handle.
>>
>> Rename runnable_avg_period into avg_period as it is now used with both
>> runnable_avg_sum and running_avg_sum
>>
>> Add some descriptions of the variables to explain their differences
>>
>> cc: Paul Turner <[email protected]>
>> cc: Ben Segall <[email protected]>
>>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> ---
>> include/linux/sched.h | 21 ++++++++++++---
>> kernel/sched/debug.c | 10 ++++---
>> kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++-----------
>> kernel/sched/sched.h | 8 +++++-
>> 4 files changed, 89 insertions(+), 24 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 18f5262..b576b29 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1071,15 +1071,28 @@ struct load_weight {
>> };
>>
>> struct sched_avg {
>> + u64 last_runnable_update;
>> + s64 decay_count;
>> + /*
>> + * utilization_avg_contrib describes the amount of time that a
>> + * sched_entity is running on a CPU. It is based on running_avg_sum
>> + * and is scaled in the range [0..SCHED_LOAD_SCALE].
>> + * load_avg_contrib described the amount of time that a sched_entity
>> + * is runnable on a rq. It is based on both runnable_avg_sum and the
>> + * weight of the task.
>> + */
>> + unsigned long load_avg_contrib, utilization_avg_contrib;
>> /*
>> * These sums represent an infinite geometric series and so are bound
>> * above by 1024/(1-y). Thus we only need a u32 to store them for all
>> * choices of y < 1-2^(-32)*1024.
>> + * running_avg_sum reflects the time that the sched_entity is
>> + * effectively running on the CPU.
>> + * runnable_avg_sum represents the amount of time a sched_entity is on
>> + * a runqueue which includes the running time that is monitored by
>> + * running_avg_sum.
>
> i.e. runnable_avg_sum = running_avg_sum + 'waiting time'
>
>> */
>> - u32 runnable_avg_sum, runnable_avg_period;
>> - u64 last_runnable_update;
>> - s64 decay_count;
>> - unsigned long load_avg_contrib;
>> + u32 runnable_avg_sum, avg_period, running_avg_sum;
>> };
>>
>> #ifdef CONFIG_SCHEDSTATS
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index ce33780..f384452 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -71,7 +71,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
>> if (!se) {
>> struct sched_avg *avg = &cpu_rq(cpu)->avg;
>> P(avg->runnable_avg_sum);
>> - P(avg->runnable_avg_period);
>> + P(avg->avg_period);
>> return;
>> }
>>
>> @@ -94,7 +94,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
>> P(se->load.weight);
>> #ifdef CONFIG_SMP
>> P(se->avg.runnable_avg_sum);
>> - P(se->avg.runnable_avg_period);
>> + P(se->avg.avg_period);
>> P(se->avg.load_avg_contrib);
>> P(se->avg.decay_count);
>> #endif
>> @@ -214,6 +214,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
>> cfs_rq->runnable_load_avg);
>> SEQ_printf(m, " .%-30s: %ld\n", "blocked_load_avg",
>> cfs_rq->blocked_load_avg);
>> + SEQ_printf(m, " .%-30s: %ld\n", "utilization_load_avg",
>> + cfs_rq->utilization_load_avg);
>> #ifdef CONFIG_FAIR_GROUP_SCHED
>> SEQ_printf(m, " .%-30s: %ld\n", "tg_load_contrib",
>> cfs_rq->tg_load_contrib);
>> @@ -628,8 +630,10 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
>> P(se.load.weight);
>> #ifdef CONFIG_SMP
>> P(se.avg.runnable_avg_sum);
>> - P(se.avg.runnable_avg_period);
>> + P(se.avg.running_avg_sum);
>> + P(se.avg.avg_period);
>> P(se.avg.load_avg_contrib);
>> + P(se.avg.utilization_avg_contrib);
>> P(se.avg.decay_count);
>> #endif
>> P(policy);
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index bd61cff..3a91ae6 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -670,6 +670,7 @@ static int select_idle_sibling(struct task_struct *p, int cpu);
>> static unsigned long task_h_load(struct task_struct *p);
>>
>> static inline void __update_task_entity_contrib(struct sched_entity *se);
>> +static inline void __update_task_entity_utilization(struct sched_entity *se);
>>
>> /* Give new task start runnable values to heavy its load in infant time */
>> void init_task_runnable_average(struct task_struct *p)
>> @@ -678,9 +679,10 @@ void init_task_runnable_average(struct task_struct *p)
>>
>> p->se.avg.decay_count = 0;
>> slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
>> - p->se.avg.runnable_avg_sum = slice;
>> - p->se.avg.runnable_avg_period = slice;
>> + p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice;
>> + p->se.avg.avg_period = slice;
>> __update_task_entity_contrib(&p->se);
>> + __update_task_entity_utilization(&p->se);
>> }
>> #else
>> void init_task_runnable_average(struct task_struct *p)
>> @@ -1548,7 +1550,7 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
>> *period = now - p->last_task_numa_placement;
>> } else {
>> delta = p->se.avg.runnable_avg_sum;
>> - *period = p->se.avg.runnable_avg_period;
>> + *period = p->se.avg.avg_period;
>> }
>>
>> p->last_sum_exec_runtime = runtime;
>> @@ -2294,7 +2296,8 @@ static u32 __compute_runnable_contrib(u64 n)
>> */
>> static __always_inline int __update_entity_runnable_avg(u64 now,
>> struct sched_avg *sa,
>> - int runnable)
>> + int runnable,
>> + int running)
>> {
>> u64 delta, periods;
>> u32 runnable_contrib;
>> @@ -2320,7 +2323,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
>> sa->last_runnable_update = now;
>>
>> /* delta_w is the amount already accumulated against our next period */
>> - delta_w = sa->runnable_avg_period % 1024;
>> + delta_w = sa->avg_period % 1024;
>> if (delta + delta_w >= 1024) {
>> /* period roll-over */
>> decayed = 1;
>> @@ -2333,7 +2336,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
>> delta_w = 1024 - delta_w;
>> if (runnable)
>> sa->runnable_avg_sum += delta_w;
>> - sa->runnable_avg_period += delta_w;
>> + if (running)
>> + sa->running_avg_sum += delta_w;
>> + sa->avg_period += delta_w;
>>
>> delta -= delta_w;
>>
>> @@ -2343,20 +2348,26 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
>>
>> sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
>> periods + 1);
>> - sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
>> + sa->running_avg_sum = decay_load(sa->running_avg_sum,
>> + periods + 1);
>> + sa->avg_period = decay_load(sa->avg_period,
>> periods + 1);
>>
>> /* Efficiently calculate \sum (1..n_period) 1024*y^i */
>> runnable_contrib = __compute_runnable_contrib(periods);
>> if (runnable)
>> sa->runnable_avg_sum += runnable_contrib;
>> - sa->runnable_avg_period += runnable_contrib;
>> + if (running)
>> + sa->running_avg_sum += runnable_contrib;
>> + sa->avg_period += runnable_contrib;
>> }
>>
>> /* Remainder of delta accrued against u_0` */
>> if (runnable)
>> sa->runnable_avg_sum += delta;
>> - sa->runnable_avg_period += delta;
>> + if (running)
>> + sa->running_avg_sum += delta;
>> + sa->avg_period += delta;
>>
>> return decayed;
>> }
>> @@ -2372,6 +2383,8 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
>> return 0;
>>
>> se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
>> + se->avg.utilization_avg_contrib =
>> + decay_load(se->avg.utilization_avg_contrib, decays);
>> se->avg.decay_count = 0;
>>
>> return decays;
>> @@ -2408,7 +2421,7 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
>>
>> /* The fraction of a cpu used by this cfs_rq */
>> contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT,
>> - sa->runnable_avg_period + 1);
>> + sa->avg_period + 1);
>> contrib -= cfs_rq->tg_runnable_contrib;
>>
>> if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
>> @@ -2461,7 +2474,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
>>
>> static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
>> {
>> - __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
>> + __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable,
>> + runnable);
>> __update_tg_runnable_avg(&rq->avg, &rq->cfs);
>> }
>> #else /* CONFIG_FAIR_GROUP_SCHED */
>> @@ -2479,7 +2493,7 @@ static inline void __update_task_entity_contrib(struct sched_entity *se)
>>
>> /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
>> contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
>> - contrib /= (se->avg.runnable_avg_period + 1);
>> + contrib /= (se->avg.avg_period + 1);
>> se->avg.load_avg_contrib = scale_load(contrib);
>> }
>>
>> @@ -2498,6 +2512,27 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
>> return se->avg.load_avg_contrib - old_contrib;
>> }
>>
>> +
>> +static inline void __update_task_entity_utilization(struct sched_entity *se)
>> +{
>> + u32 contrib;
>> +
>> + /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
>> + contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
>> + contrib /= (se->avg.avg_period + 1);
>> + se->avg.utilization_avg_contrib = scale_load(contrib);
>> +}
>> +
>> +static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
>> +{
>> + long old_contrib = se->avg.utilization_avg_contrib;
>> +
>> + if (entity_is_task(se))
>> + __update_task_entity_utilization(se);
>> +
>> + return se->avg.utilization_avg_contrib - old_contrib;
>> +}
>> +
>> static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
>> long load_contrib)
>> {
>> @@ -2514,7 +2549,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
>> int update_cfs_rq)
>> {
>> struct cfs_rq *cfs_rq = cfs_rq_of(se);
>> - long contrib_delta;
>> + long contrib_delta, utilization_delta;
>> u64 now;
>>
>> /*
>> @@ -2526,18 +2561,22 @@ static inline void update_entity_load_avg(struct sched_entity *se,
>> else
>> now = cfs_rq_clock_task(group_cfs_rq(se));
>>
>> - if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
>> + if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
>> + cfs_rq->curr == se))
>> return;
>>
>> contrib_delta = __update_entity_load_avg_contrib(se);
>> + utilization_delta = __update_entity_utilization_avg_contrib(se);
>>
>> if (!update_cfs_rq)
>> return;
>>
>> - if (se->on_rq)
>> + if (se->on_rq) {
>> cfs_rq->runnable_load_avg += contrib_delta;
>> - else
>> + cfs_rq->utilization_load_avg += utilization_delta;
>> + } else {
>> subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
>> + }
>> }
>>
>> /*
>> @@ -2612,6 +2651,7 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
>> }
>>
>> cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
>> + cfs_rq->utilization_load_avg += se->avg.utilization_avg_contrib;
>> /* we force update consideration on load-balancer moves */
>> update_cfs_rq_blocked_load(cfs_rq, !wakeup);
>> }
>> @@ -2630,6 +2670,7 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
>> update_cfs_rq_blocked_load(cfs_rq, !sleep);
>>
>> cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
>> + cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib;
>> if (sleep) {
>> cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
>> se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
>> @@ -2967,6 +3008,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> */
>> update_stats_wait_end(cfs_rq, se);
>> __dequeue_entity(cfs_rq, se);
>> + update_entity_load_avg(se, 1);
>> }
>>
>> update_stats_curr_start(cfs_rq, se);
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 6130251..c34bd11 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -343,8 +343,14 @@ struct cfs_rq {
>> * Under CFS, load is tracked on a per-entity basis and aggregated up.
>> * This allows for the description of both thread and group usage (in
>> * the FAIR_GROUP_SCHED case).
>> + * runnable_load_avg is the sum of the load_avg_contrib of the
>> + * sched_entities on the rq.
>> + * blocked_load_avg is similar to runnable_load_avg except that its
>> + * the blocked sched_entities on the rq.
>
> This is slightly misleading. Blocked entities are not on the rq. Maybe
> say: "... except it is the blocked sched_entities associated with the
> rq." ?
On 21 November 2014 at 13:35, Morten Rasmussen <[email protected]> wrote:
> s/usage/utilization/ in subject.
>
> On Mon, Nov 03, 2014 at 04:54:39PM +0000, Vincent Guittot wrote:
>> From: Morten Rasmussen <[email protected]>
>>
>> Adds usage contribution tracking for group entities. Unlike
>
> s/usage contribution/utilization/
>
>> se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group
>> entities is the sum of se->avg.utilization_avg_contrib for all entities on the
>> group runqueue. It is _not_ influenced in any way by the task group
>> h_load. Hence it is representing the actual cpu usage of the group, not
>
> s/usage/utilization/
>
>> its intended load contribution which may differ significantly from the
>> utilization on lightly utilized systems.
>>
>> cc: Paul Turner <[email protected]>
>> cc: Ben Segall <[email protected]>
>>
>> Signed-off-by: Morten Rasmussen <[email protected]>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> ---
>> kernel/sched/debug.c | 2 ++
>> kernel/sched/fair.c | 3 +++
>> 2 files changed, 5 insertions(+)
>>
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index f384452..efb47ed 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -94,8 +94,10 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
>> P(se->load.weight);
>> #ifdef CONFIG_SMP
>> P(se->avg.runnable_avg_sum);
>> + P(se->avg.running_avg_sum);
>> P(se->avg.avg_period);
>> P(se->avg.load_avg_contrib);
>> + P(se->avg.utilization_avg_contrib);
>> P(se->avg.decay_count);
>> #endif
>> #undef PN
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 3a91ae6..a171e1b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2529,6 +2529,9 @@ static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
>>
>> if (entity_is_task(se))
>> __update_task_entity_utilization(se);
>> + else
>> + se->avg.utilization_avg_contrib =
>> + group_cfs_rq(se)->utilization_load_avg;
>>
>> return se->avg.utilization_avg_contrib - old_contrib;
>> }
>
> What happened to the update of se->avg.utilization_avg_contrib in
> __synchronize_entity_decay()? It seems to have disapperead after v7.
It was moved in patch 1 where it is is right place
On 21 November 2014 at 13:35, Morten Rasmussen <[email protected]> wrote:
> On Mon, Nov 03, 2014 at 04:54:42PM +0000, Vincent Guittot wrote:
[snip]
>> The average running time of RT tasks is used to estimate the remaining compute
>> @@ -5801,19 +5801,12 @@ static unsigned long scale_rt_capacity(int cpu)
>>
>> total = sched_avg_period() + delta;
>>
>> - if (unlikely(total < avg)) {
>> - /* Ensures that capacity won't end up being negative */
>> - available = 0;
>> - } else {
>> - available = total - avg;
>> - }
>> + used = div_u64(avg, total);
>
> I haven't looked through all the details of the rt avg tracking, but if
> 'used' is in the range [0..SCHED_CAPACITY_SCALE], I believe it should
> work. Is it guaranteed that total > 0 so we don't get division by zero?
static inline u64 sched_avg_period(void)
{
return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
}
>
> It does get a slightly more complicated if we want to figure out the
> available capacity at the current frequency (current < max) later. Say,
> rt eats 25% of the compute capacity, but the current frequency is only
> 50%. In that case get:
>
> curr_avail_capacity = (arch_scale_cpu_capacity() *
> (arch_scale_freq_capacity() - (SCHED_SCALE_CAPACITY - scale_rt_capacity())))
> >> SCHED_CAPACITY_SHIFT
You don't have to be so complicated but simply need to do:
curr_avail_capacity for CFS = (capacity_of(CPU) *
arch_scale_freq_capacity()) >> SCHED_CAPACITY_SHIFT
capacity_of(CPU) = 600 is the max available capacity for CFS tasks
once we have removed the 25% of capacity that is used by RT tasks
arch_scale_freq_capacity = 512 because we currently run at 50% of max freq
so curr_avail_capacity for CFS = 300
Vincent
>
> With numbers assuming arch_scale_cpu_capacity() = 800:
>
> curr_avail_capacity = 800 * (512 - (1024 - 758)) >> 10 = 200
>
> Which isn't actually that bad. Anyway, it isn't needed until we start
> invovling energy models.
>
>>
>> - if (unlikely((s64)total < SCHED_CAPACITY_SCALE))
>> - total = SCHED_CAPACITY_SCALE;
>> + if (likely(used < SCHED_CAPACITY_SCALE))
>> + return SCHED_CAPACITY_SCALE - used;
>>
>> - total >>= SCHED_CAPACITY_SHIFT;
>> -
>> - return div_u64(available, total);
>> + return 1;
>> }
>>
On 21 November 2014 at 13:37, Morten Rasmussen <[email protected]> wrote:
> On Mon, Nov 03, 2014 at 04:54:45PM +0000, Vincent Guittot wrote:
[snip]
>> */
>> if (prefer_sibling && sds->local &&
>> - sds->local_stat.group_has_free_capacity)
>> - sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
>> + group_has_capacity(env, &sds->local_stat) &&
>> + (sgs->sum_nr_running > 1)) {
>> + sgs->group_no_capacity = 1;
>> + sgs->group_type = group_overloaded;
>> + }
>
> I'm still a bit confused about SD_PREFER_SIBLING. What is the flag
> supposed to do and why?
The goal is to spread tasks across the group even if the the latter is
not overloaded. for SMT level, the goal is to have 1 task per core
before 1 task per HW thread
>
> It looks like a weak load balancing bias attempting to consolidate tasks
> on domains with spare capacity. It does so by marking non-local groups
> as overloaded regardless of their actual load if the local group has
> spare capacity. Correct?
>
> In patch 9 this behaviour is enabled for SMT level domains, which
> implies that tasks will be consolidated in MC groups, that is we prefer
> multiple tasks on sibling cpus (hw threads). I must be missing something
> essential. I was convinced that we wanted to avoid using sibling cpus on
> SMT systems as much as possible?
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On 21 November 2014 at 13:37, Morten Rasmussen <[email protected]> wrote:
> On Mon, Nov 03, 2014 at 04:54:47PM +0000, Vincent Guittot wrote:
>>
>> + /*
>> + * The dst_cpu is idle and the src_cpu CPU has only 1 CFS task.
>> + * It's worth migrating the task if the src_cpu's capacity is reduced
>> + * because of other sched_class or IRQs whereas capacity stays
>> + * available on dst_cpu.
>> + */
>> + if ((env->idle != CPU_NOT_IDLE) &&
>> + (env->src_rq->cfs.h_nr_running == 1)) {
>> + unsigned long src_eff_capacity, dst_eff_capacity;
>> +
>> + dst_eff_capacity = 100;
>> + dst_eff_capacity *= capacity_of(env->dst_cpu);
>> + dst_eff_capacity *= capacity_orig_of(env->src_cpu);
>> +
>> + src_eff_capacity = sd->imbalance_pct;
>> + src_eff_capacity *= capacity_of(env->src_cpu);
>> + src_eff_capacity *= capacity_orig_of(env->dst_cpu);
>
> Do we need to scale by capacity_orig? Shouldn't the absolute capacity be
> better?
>
> if (capacity_of(env->src) * sd->imbalance_pct < capacity_of(env->dst) *
> 100) ?
we don't want to compare absolute capacity between CPUs but to compare
the reduction of their capacity because we want to choose the CPU
which is less used by RT tasks or irq
Regards,
Vincent
>
> Isn't it the absolute available capacity that matters? For SMP
> capacity_orig is the same and cancels out and doesn't change anything.
> For big.LITTLE we would rather have the task run on a big where rt/irq
> eats 30% than a little cpu where rq/irq eats 5%, assuming big capacity
> is much bigger than little capacity so the absolute available capacity
> (~cycles/time) is larger on the big cpu.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Mon, Nov 24, 2014 at 02:04:21PM +0000, Vincent Guittot wrote:
> On 21 November 2014 at 13:35, Morten Rasmussen <[email protected]> wrote:
> > What happened to the update of se->avg.utilization_avg_contrib in
> > __synchronize_entity_decay()? It seems to have disapperead after v7.
>
> It was moved in patch 1 where it is is right place
Ahh... I missed that :)
On Mon, Nov 24, 2014 at 02:24:00PM +0000, Vincent Guittot wrote:
> On 21 November 2014 at 13:35, Morten Rasmussen <[email protected]> wrote:
> > On Mon, Nov 03, 2014 at 04:54:42PM +0000, Vincent Guittot wrote:
>
> [snip]
>
> >> The average running time of RT tasks is used to estimate the remaining compute
> >> @@ -5801,19 +5801,12 @@ static unsigned long scale_rt_capacity(int cpu)
> >>
> >> total = sched_avg_period() + delta;
> >>
> >> - if (unlikely(total < avg)) {
> >> - /* Ensures that capacity won't end up being negative */
> >> - available = 0;
> >> - } else {
> >> - available = total - avg;
> >> - }
> >> + used = div_u64(avg, total);
> >
> > I haven't looked through all the details of the rt avg tracking, but if
> > 'used' is in the range [0..SCHED_CAPACITY_SCALE], I believe it should
> > work. Is it guaranteed that total > 0 so we don't get division by zero?
>
> static inline u64 sched_avg_period(void)
> {
> return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
> }
>
I see.
> >
> > It does get a slightly more complicated if we want to figure out the
> > available capacity at the current frequency (current < max) later. Say,
> > rt eats 25% of the compute capacity, but the current frequency is only
> > 50%. In that case get:
> >
> > curr_avail_capacity = (arch_scale_cpu_capacity() *
> > (arch_scale_freq_capacity() - (SCHED_SCALE_CAPACITY - scale_rt_capacity())))
> > >> SCHED_CAPACITY_SHIFT
>
> You don't have to be so complicated but simply need to do:
> curr_avail_capacity for CFS = (capacity_of(CPU) *
> arch_scale_freq_capacity()) >> SCHED_CAPACITY_SHIFT
>
> capacity_of(CPU) = 600 is the max available capacity for CFS tasks
> once we have removed the 25% of capacity that is used by RT tasks
> arch_scale_freq_capacity = 512 because we currently run at 50% of max freq
>
> so curr_avail_capacity for CFS = 300
I don't think that is correct. It is at least not what I had in mind.
capacity_orig_of(cpu) = 800, we run at 50% frequency which means:
curr_capacity = capacity_orig_of(cpu) * arch_scale_freq_capacity()
>> SCHED_CAPACITY_SHIFT
= 400
So the total capacity at the current frequency (50%) is 400, without
considering RT. scale_rt_capacity() is frequency invariant, so it takes
away capacity_orig_of(cpu) - capacity_of(cpu) = 200 worth of capacity
for RT. We need to subtract that from the current capacity to get the
available capacity at the current frequency.
curr_available_capacity = curr_capacity - (capacity_orig_of(cpu) -
capacity_of(cpu)) = 200
In other words, 800 is the max capacity, we are currently running at 50%
frequency, which gives us 400. RT takes away 25% of 800
(frequency-invariant) from the 400, which leaves us with 200 left for
CFS tasks at the current frequency.
In your calculations you subtract the RT load before computing the
current capacity using arch_scale_freq_capacity(), where I think it
should be done after. You find the amount spare capacity you would have
at the maximum frequency when RT has been subtracted and then scale the
result by frequency which means indirectly scaling the RT load
contribution again (the rt avg has already been scaled). So instead of
taking away 200 of the 400 (current capacity @ 50% frequency), it only
takes away 100 which isn't right.
scale_rt_capacity() is frequency-invariant, so if the RT load is 50% and
the frequency is 50%, there are no spare cycles left.
curr_avail_capacity should be 0. But using your expression above you
would get capacity_of(cpu) = 400 after removing RT,
arch_scale_freq_capacity = 512 and you get 200. I don't think that is
right.
Morten
On Mon, Nov 24, 2014 at 02:41:28PM +0000, Vincent Guittot wrote:
> On 21 November 2014 at 13:37, Morten Rasmussen <[email protected]> wrote:
> > On Mon, Nov 03, 2014 at 04:54:45PM +0000, Vincent Guittot wrote:
>
> [snip]
>
> >> */
> >> if (prefer_sibling && sds->local &&
> >> - sds->local_stat.group_has_free_capacity)
> >> - sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
> >> + group_has_capacity(env, &sds->local_stat) &&
> >> + (sgs->sum_nr_running > 1)) {
> >> + sgs->group_no_capacity = 1;
> >> + sgs->group_type = group_overloaded;
> >> + }
> >
> > I'm still a bit confused about SD_PREFER_SIBLING. What is the flag
> > supposed to do and why?
>
> The goal is to spread tasks across the group even if the the latter is
> not overloaded. for SMT level, the goal is to have 1 task per core
> before 1 task per HW thread
That makes more sense and is in line with how I understand SMT
scheduling. So we try to have at least one task per group. Where each
group is a domain with SD_PREFER_SIBLING.
Anyway, you don't change the prefer-sibling behaviour in this patch set.
I was just wondering how it would work for SMT balancing.
Thanks,
Morten
On Mon, Nov 24, 2014 at 02:45:45PM +0000, Vincent Guittot wrote:
> On 21 November 2014 at 13:37, Morten Rasmussen <[email protected]> wrote:
> > On Mon, Nov 03, 2014 at 04:54:47PM +0000, Vincent Guittot wrote:
>
> >>
> >> + /*
> >> + * The dst_cpu is idle and the src_cpu CPU has only 1 CFS task.
> >> + * It's worth migrating the task if the src_cpu's capacity is reduced
> >> + * because of other sched_class or IRQs whereas capacity stays
> >> + * available on dst_cpu.
> >> + */
> >> + if ((env->idle != CPU_NOT_IDLE) &&
> >> + (env->src_rq->cfs.h_nr_running == 1)) {
> >> + unsigned long src_eff_capacity, dst_eff_capacity;
> >> +
> >> + dst_eff_capacity = 100;
> >> + dst_eff_capacity *= capacity_of(env->dst_cpu);
> >> + dst_eff_capacity *= capacity_orig_of(env->src_cpu);
> >> +
> >> + src_eff_capacity = sd->imbalance_pct;
> >> + src_eff_capacity *= capacity_of(env->src_cpu);
> >> + src_eff_capacity *= capacity_orig_of(env->dst_cpu);
> >
> > Do we need to scale by capacity_orig? Shouldn't the absolute capacity be
> > better?
> >
> > if (capacity_of(env->src) * sd->imbalance_pct < capacity_of(env->dst) *
> > 100) ?
>
> we don't want to compare absolute capacity between CPUs but to compare
> the reduction of their capacity because we want to choose the CPU
> which is less used by RT tasks or irq
But least relative RT load doesn't necessarily mean most available
compute capacity. 50% RT use of a capacity_orig = 1000 (capacity_of(cpu) =
500, eff_capacity = 50%) gives better CFS throughput than 20% RT use of
a capacity_orig = 500 (capacity_of(cpu) = 400, eff_capacity = 80%). Why pick
the cpu with less throughput?
Morten
>
> Regards,
> Vincent
> >
> > Isn't it the absolute available capacity that matters? For SMP
> > capacity_orig is the same and cancels out and doesn't change anything.
> > For big.LITTLE we would rather have the task run on a big where rt/irq
> > eats 30% than a little cpu where rq/irq eats 5%, assuming big capacity
> > is much bigger than little capacity so the absolute available capacity
> > (~cycles/time) is larger on the big cpu.
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
On Mon, Nov 24, 2014 at 02:04:15PM +0000, Vincent Guittot wrote:
> On 21 November 2014 at 13:34, Morten Rasmussen <[email protected]> wrote:
> > Should the subject mention that the patch adds utilization tracking?
> > Maybe: 'sched: Add utilization tracking' ?
> >
> >
> > On Mon, Nov 03, 2014 at 04:54:38PM +0000, Vincent Guittot wrote:
> >> Add new statistics which reflect the average time a task is running on the CPU
> >> and the sum of these running time of the tasks on a runqueue. The latter is
> >> named utilization_load_avg.
> >>
> >> This patch is based on the usage metric that was proposed in the 1st
> >> versions of the per-entity load tracking patchset by Paul Turner
> >
> > Should we do ourselves and anybody else who feels like going through the
> > pain of understanding the load-tracking code a favor and drop the use of
> > the term 'usage' and use 'utilization' everywhere instead? 'usage' isn't
> > clearly defined anywhere.
> >
> > Referring to 'usage' here in the reference to original patch is fine,
> > but I suggest that we remove it from the code and comment on subsequent
> > patches unless there is a very good reason to keep it.
>
> As discussed with Peter, we use usage when the task's utilization has
> been scaled by the capacity.
>
> IIRC from one of our discussion, dietmar should prepare a patchset to
> rename and aligned variables and field.
I read this as we stick with usage while utilization is scaled by
capacity and potentially drop it again when adding uarch invariance so
the scaling goes away. Or rename it before if we find a better name for
it.
Hi Vincent,
On 11/4/14, 12:54 AM, Vincent Guittot wrote:
> The average running time of RT tasks is used to estimate the remaining compute
> capacity for CFS tasks. This remaining capacity is the original capacity scaled
> down by a factor (aka scale_rt_capacity). This estimation of available capacity
> must also be invariant with frequency scaling.
>
> A frequency scaling factor is applied on the running time of the RT tasks for
> computing scale_rt_capacity.
>
> In sched_rt_avg_update, we scale the RT execution time like below:
> rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT
>
> Then, scale_rt_capacity can be summarized by:
> scale_rt_capacity = SCHED_CAPACITY_SCALE -
> ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / period)
The 'period' aka 'total' in the scale_rt_capacity(), why it is
sched_avg_period() + delta instead of sched_avg_period()?
Regards,
Wanpeng Li
>
> We can optimize by removing right and left shift in the computation of rq->rt_avg
> and scale_rt_capacity
>
> The call to arch_scale_frequency_capacity in the rt scheduling path might be
> a concern for RT folks because I'm not sure whether we can rely on
> arch_scale_freq_capacity to be short and efficient ?
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 17 +++++------------
> kernel/sched/sched.h | 4 +++-
> 2 files changed, 8 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a5039da..b37c27b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5785,7 +5785,7 @@ unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> static unsigned long scale_rt_capacity(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> - u64 total, available, age_stamp, avg;
> + u64 total, used, age_stamp, avg;
> s64 delta;
>
> /*
> @@ -5801,19 +5801,12 @@ static unsigned long scale_rt_capacity(int cpu)
>
> total = sched_avg_period() + delta;
>
> - if (unlikely(total < avg)) {
> - /* Ensures that capacity won't end up being negative */
> - available = 0;
> - } else {
> - available = total - avg;
> - }
> + used = div_u64(avg, total);
>
> - if (unlikely((s64)total < SCHED_CAPACITY_SCALE))
> - total = SCHED_CAPACITY_SCALE;
> + if (likely(used < SCHED_CAPACITY_SCALE))
> + return SCHED_CAPACITY_SCALE - used;
>
> - total >>= SCHED_CAPACITY_SHIFT;
> -
> - return div_u64(available, total);
> + return 1;
> }
>
> static void update_cpu_capacity(struct sched_domain *sd, int cpu)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c34bd11..fc5b152 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1312,9 +1312,11 @@ static inline int hrtick_enabled(struct rq *rq)
>
> #ifdef CONFIG_SMP
> extern void sched_avg_update(struct rq *rq);
> +extern unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
> +
> static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
> {
> - rq->rt_avg += rt_delta;
> + rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
> sched_avg_update(rq);
> }
> #else
On 24 November 2014 at 18:05, Morten Rasmussen <[email protected]> wrote:
> On Mon, Nov 24, 2014 at 02:24:00PM +0000, Vincent Guittot wrote:
>> On 21 November 2014 at 13:35, Morten Rasmussen <[email protected]> wrote:
>> > On Mon, Nov 03, 2014 at 04:54:42PM +0000, Vincent Guittot wrote:
>>
>> [snip]
>>
>> >> The average running time of RT tasks is used to estimate the remaining compute
>> >> @@ -5801,19 +5801,12 @@ static unsigned long scale_rt_capacity(int cpu)
>> >>
>> >> total = sched_avg_period() + delta;
>> >>
>> >> - if (unlikely(total < avg)) {
>> >> - /* Ensures that capacity won't end up being negative */
>> >> - available = 0;
>> >> - } else {
>> >> - available = total - avg;
>> >> - }
>> >> + used = div_u64(avg, total);
>> >
>> > I haven't looked through all the details of the rt avg tracking, but if
>> > 'used' is in the range [0..SCHED_CAPACITY_SCALE], I believe it should
>> > work. Is it guaranteed that total > 0 so we don't get division by zero?
>>
>> static inline u64 sched_avg_period(void)
>> {
>> return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
>> }
>>
>
> I see.
>
>> >
>> > It does get a slightly more complicated if we want to figure out the
>> > available capacity at the current frequency (current < max) later. Say,
>> > rt eats 25% of the compute capacity, but the current frequency is only
>> > 50%. In that case get:
>> >
>> > curr_avail_capacity = (arch_scale_cpu_capacity() *
>> > (arch_scale_freq_capacity() - (SCHED_SCALE_CAPACITY - scale_rt_capacity())))
>> > >> SCHED_CAPACITY_SHIFT
>>
>> You don't have to be so complicated but simply need to do:
>> curr_avail_capacity for CFS = (capacity_of(CPU) *
>> arch_scale_freq_capacity()) >> SCHED_CAPACITY_SHIFT
>>
>> capacity_of(CPU) = 600 is the max available capacity for CFS tasks
>> once we have removed the 25% of capacity that is used by RT tasks
>> arch_scale_freq_capacity = 512 because we currently run at 50% of max freq
>>
>> so curr_avail_capacity for CFS = 300
>
> I don't think that is correct. It is at least not what I had in mind.
>
> capacity_orig_of(cpu) = 800, we run at 50% frequency which means:
>
> curr_capacity = capacity_orig_of(cpu) * arch_scale_freq_capacity()
> >> SCHED_CAPACITY_SHIFT
> = 400
>
> So the total capacity at the current frequency (50%) is 400, without
> considering RT. scale_rt_capacity() is frequency invariant, so it takes
> away capacity_orig_of(cpu) - capacity_of(cpu) = 200 worth of capacity
> for RT. We need to subtract that from the current capacity to get the
> available capacity at the current frequency.
>
> curr_available_capacity = curr_capacity - (capacity_orig_of(cpu) -
> capacity_of(cpu)) = 200
you're right, this one looks good to me too
>
> In other words, 800 is the max capacity, we are currently running at 50%
> frequency, which gives us 400. RT takes away 25% of 800
> (frequency-invariant) from the 400, which leaves us with 200 left for
> CFS tasks at the current frequency.
>
> In your calculations you subtract the RT load before computing the
> current capacity using arch_scale_freq_capacity(), where I think it
> should be done after. You find the amount spare capacity you would have
> at the maximum frequency when RT has been subtracted and then scale the
> result by frequency which means indirectly scaling the RT load
> contribution again (the rt avg has already been scaled). So instead of
> taking away 200 of the 400 (current capacity @ 50% frequency), it only
> takes away 100 which isn't right.
>
> scale_rt_capacity() is frequency-invariant, so if the RT load is 50% and
> the frequency is 50%, there are no spare cycles left.
> curr_avail_capacity should be 0. But using your expression above you
> would get capacity_of(cpu) = 400 after removing RT,
> arch_scale_freq_capacity = 512 and you get 200. I don't think that is
> right.
>
> Morten
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On 25 November 2014 at 03:24, Wanpeng Li <[email protected]> wrote:
> Hi Vincent,
> On 11/4/14, 12:54 AM, Vincent Guittot wrote:
>>
>> The average running time of RT tasks is used to estimate the remaining
>> compute
>> capacity for CFS tasks. This remaining capacity is the original capacity
>> scaled
>> down by a factor (aka scale_rt_capacity). This estimation of available
>> capacity
>> must also be invariant with frequency scaling.
>>
>> A frequency scaling factor is applied on the running time of the RT tasks
>> for
>> computing scale_rt_capacity.
>>
>> In sched_rt_avg_update, we scale the RT execution time like below:
>> rq->rt_avg += rt_delta * arch_scale_freq_capacity() >>
>> SCHED_CAPACITY_SHIFT
>>
>> Then, scale_rt_capacity can be summarized by:
>> scale_rt_capacity = SCHED_CAPACITY_SCALE -
>> ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / period)
>
>
> The 'period' aka 'total' in the scale_rt_capacity(), why it is
> sched_avg_period() + delta instead of sched_avg_period()?
The default value of sched_avg_period is 1sec which is "long" so we
take into account the time consumed by RT tasks in the ongoing period
.
>
> Regards,
> Wanpeng Li
>
>
>>
>> We can optimize by removing right and left shift in the computation of
>> rq->rt_avg
>> and scale_rt_capacity
>>
>> The call to arch_scale_frequency_capacity in the rt scheduling path might
>> be
>> a concern for RT folks because I'm not sure whether we can rely on
>> arch_scale_freq_capacity to be short and efficient ?
>>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> ---
>> kernel/sched/fair.c | 17 +++++------------
>> kernel/sched/sched.h | 4 +++-
>> 2 files changed, 8 insertions(+), 13 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index a5039da..b37c27b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5785,7 +5785,7 @@ unsigned long __weak arch_scale_cpu_capacity(struct
>> sched_domain *sd, int cpu)
>> static unsigned long scale_rt_capacity(int cpu)
>> {
>> struct rq *rq = cpu_rq(cpu);
>> - u64 total, available, age_stamp, avg;
>> + u64 total, used, age_stamp, avg;
>> s64 delta;
>> /*
>> @@ -5801,19 +5801,12 @@ static unsigned long scale_rt_capacity(int cpu)
>> total = sched_avg_period() + delta;
>> - if (unlikely(total < avg)) {
>> - /* Ensures that capacity won't end up being negative */
>> - available = 0;
>> - } else {
>> - available = total - avg;
>> - }
>> + used = div_u64(avg, total);
>> - if (unlikely((s64)total < SCHED_CAPACITY_SCALE))
>> - total = SCHED_CAPACITY_SCALE;
>> + if (likely(used < SCHED_CAPACITY_SCALE))
>> + return SCHED_CAPACITY_SCALE - used;
>> - total >>= SCHED_CAPACITY_SHIFT;
>> -
>> - return div_u64(available, total);
>> + return 1;
>> }
>> static void update_cpu_capacity(struct sched_domain *sd, int cpu)
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index c34bd11..fc5b152 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1312,9 +1312,11 @@ static inline int hrtick_enabled(struct rq *rq)
>> #ifdef CONFIG_SMP
>> extern void sched_avg_update(struct rq *rq);
>> +extern unsigned long arch_scale_freq_capacity(struct sched_domain *sd,
>> int cpu);
>> +
>> static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
>> {
>> - rq->rt_avg += rt_delta;
>> + rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL,
>> cpu_of(rq));
>> sched_avg_update(rq);
>> }
>> #else
>
>
Hi Vincent,
On 11/25/14, 9:52 PM, Vincent Guittot wrote:
> On 25 November 2014 at 03:24, Wanpeng Li <[email protected]> wrote:
>> Hi Vincent,
>> On 11/4/14, 12:54 AM, Vincent Guittot wrote:
>>> The average running time of RT tasks is used to estimate the remaining
>>> compute
>>> capacity for CFS tasks. This remaining capacity is the original capacity
>>> scaled
>>> down by a factor (aka scale_rt_capacity). This estimation of available
>>> capacity
>>> must also be invariant with frequency scaling.
>>>
>>> A frequency scaling factor is applied on the running time of the RT tasks
>>> for
>>> computing scale_rt_capacity.
>>>
>>> In sched_rt_avg_update, we scale the RT execution time like below:
>>> rq->rt_avg += rt_delta * arch_scale_freq_capacity() >>
>>> SCHED_CAPACITY_SHIFT
>>>
>>> Then, scale_rt_capacity can be summarized by:
>>> scale_rt_capacity = SCHED_CAPACITY_SCALE -
>>> ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / period)
>>
>> The 'period' aka 'total' in the scale_rt_capacity(), why it is
>> sched_avg_period() + delta instead of sched_avg_period()?
> The default value of sched_avg_period is 1sec which is "long" so we
> take into account the time consumed by RT tasks in the ongoing period
> .
Do you mean 'sched_avg_period() + delta' should be replaced by 'delta'
since sched_avg_period() is "long"?
Regards,
Wanpeng Li
>
>> Regards,
>> Wanpeng Li
>>
>>
>>> We can optimize by removing right and left shift in the computation of
>>> rq->rt_avg
>>> and scale_rt_capacity
>>>
>>> The call to arch_scale_frequency_capacity in the rt scheduling path might
>>> be
>>> a concern for RT folks because I'm not sure whether we can rely on
>>> arch_scale_freq_capacity to be short and efficient ?
>>>
>>> Signed-off-by: Vincent Guittot <[email protected]>
>>> ---
>>> kernel/sched/fair.c | 17 +++++------------
>>> kernel/sched/sched.h | 4 +++-
>>> 2 files changed, 8 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index a5039da..b37c27b 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5785,7 +5785,7 @@ unsigned long __weak arch_scale_cpu_capacity(struct
>>> sched_domain *sd, int cpu)
>>> static unsigned long scale_rt_capacity(int cpu)
>>> {
>>> struct rq *rq = cpu_rq(cpu);
>>> - u64 total, available, age_stamp, avg;
>>> + u64 total, used, age_stamp, avg;
>>> s64 delta;
>>> /*
>>> @@ -5801,19 +5801,12 @@ static unsigned long scale_rt_capacity(int cpu)
>>> total = sched_avg_period() + delta;
>>> - if (unlikely(total < avg)) {
>>> - /* Ensures that capacity won't end up being negative */
>>> - available = 0;
>>> - } else {
>>> - available = total - avg;
>>> - }
>>> + used = div_u64(avg, total);
>>> - if (unlikely((s64)total < SCHED_CAPACITY_SCALE))
>>> - total = SCHED_CAPACITY_SCALE;
>>> + if (likely(used < SCHED_CAPACITY_SCALE))
>>> + return SCHED_CAPACITY_SCALE - used;
>>> - total >>= SCHED_CAPACITY_SHIFT;
>>> -
>>> - return div_u64(available, total);
>>> + return 1;
>>> }
>>> static void update_cpu_capacity(struct sched_domain *sd, int cpu)
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index c34bd11..fc5b152 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -1312,9 +1312,11 @@ static inline int hrtick_enabled(struct rq *rq)
>>> #ifdef CONFIG_SMP
>>> extern void sched_avg_update(struct rq *rq);
>>> +extern unsigned long arch_scale_freq_capacity(struct sched_domain *sd,
>>> int cpu);
>>> +
>>> static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
>>> {
>>> - rq->rt_avg += rt_delta;
>>> + rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL,
>>> cpu_of(rq));
>>> sched_avg_update(rq);
>>> }
>>> #else
>>
On 26 November 2014 at 06:18, Wanpeng Li <[email protected]> wrote:
> Hi Vincent,
>
> On 11/25/14, 9:52 PM, Vincent Guittot wrote:
>>
>> On 25 November 2014 at 03:24, Wanpeng Li <[email protected]> wrote:
>>>
>>> Hi Vincent,
>>> On 11/4/14, 12:54 AM, Vincent Guittot wrote:
>>>>
>>>> The average running time of RT tasks is used to estimate the remaining
>>>> compute
>>>> capacity for CFS tasks. This remaining capacity is the original capacity
>>>> scaled
>>>> down by a factor (aka scale_rt_capacity). This estimation of available
>>>> capacity
>>>> must also be invariant with frequency scaling.
>>>>
>>>> A frequency scaling factor is applied on the running time of the RT
>>>> tasks
>>>> for
>>>> computing scale_rt_capacity.
>>>>
>>>> In sched_rt_avg_update, we scale the RT execution time like below:
>>>> rq->rt_avg += rt_delta * arch_scale_freq_capacity() >>
>>>> SCHED_CAPACITY_SHIFT
>>>>
>>>> Then, scale_rt_capacity can be summarized by:
>>>> scale_rt_capacity = SCHED_CAPACITY_SCALE -
>>>> ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / period)
>>>
>>>
>>> The 'period' aka 'total' in the scale_rt_capacity(), why it is
>>> sched_avg_period() + delta instead of sched_avg_period()?
>>
>> The default value of sched_avg_period is 1sec which is "long" so we
>> take into account the time consumed by RT tasks in the ongoing period
>> .
>
>
> Do you mean 'sched_avg_period() + delta' should be replaced by 'delta' since
> sched_avg_period() is "long"?
No, Both are useful. sched_avg_period is useful to keep the value more
stable across the time. Without sched_avg_period, it will be too much
reactive ( especially at the beg of a period ).
>
> Regards,
> Wanpeng Li
>
>
>>
>>> Regards,
>>> Wanpeng Li
>>>
>>>
>>>> We can optimize by removing right and left shift in the computation of
>>>> rq->rt_avg
>>>> and scale_rt_capacity
>>>>
>>>> The call to arch_scale_frequency_capacity in the rt scheduling path
>>>> might
>>>> be
>>>> a concern for RT folks because I'm not sure whether we can rely on
>>>> arch_scale_freq_capacity to be short and efficient ?
>>>>
>>>> Signed-off-by: Vincent Guittot <[email protected]>
>>>> ---
>>>> kernel/sched/fair.c | 17 +++++------------
>>>> kernel/sched/sched.h | 4 +++-
>>>> 2 files changed, 8 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index a5039da..b37c27b 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -5785,7 +5785,7 @@ unsigned long __weak
>>>> arch_scale_cpu_capacity(struct
>>>> sched_domain *sd, int cpu)
>>>> static unsigned long scale_rt_capacity(int cpu)
>>>> {
>>>> struct rq *rq = cpu_rq(cpu);
>>>> - u64 total, available, age_stamp, avg;
>>>> + u64 total, used, age_stamp, avg;
>>>> s64 delta;
>>>> /*
>>>> @@ -5801,19 +5801,12 @@ static unsigned long scale_rt_capacity(int cpu)
>>>> total = sched_avg_period() + delta;
>>>> - if (unlikely(total < avg)) {
>>>> - /* Ensures that capacity won't end up being negative */
>>>> - available = 0;
>>>> - } else {
>>>> - available = total - avg;
>>>> - }
>>>> + used = div_u64(avg, total);
>>>> - if (unlikely((s64)total < SCHED_CAPACITY_SCALE))
>>>> - total = SCHED_CAPACITY_SCALE;
>>>> + if (likely(used < SCHED_CAPACITY_SCALE))
>>>> + return SCHED_CAPACITY_SCALE - used;
>>>> - total >>= SCHED_CAPACITY_SHIFT;
>>>> -
>>>> - return div_u64(available, total);
>>>> + return 1;
>>>> }
>>>> static void update_cpu_capacity(struct sched_domain *sd, int cpu)
>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>>> index c34bd11..fc5b152 100644
>>>> --- a/kernel/sched/sched.h
>>>> +++ b/kernel/sched/sched.h
>>>> @@ -1312,9 +1312,11 @@ static inline int hrtick_enabled(struct rq *rq)
>>>> #ifdef CONFIG_SMP
>>>> extern void sched_avg_update(struct rq *rq);
>>>> +extern unsigned long arch_scale_freq_capacity(struct sched_domain *sd,
>>>> int cpu);
>>>> +
>>>> static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
>>>> {
>>>> - rq->rt_avg += rt_delta;
>>>> + rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL,
>>>> cpu_of(rq));
>>>> sched_avg_update(rq);
>>>> }
>>>> #else
>>>
>>>
>
On Tue, Nov 25, 2014 at 01:48:02PM +0000, Vincent Guittot wrote:
> On 24 November 2014 at 18:05, Morten Rasmussen <[email protected]> wrote:
> > On Mon, Nov 24, 2014 at 02:24:00PM +0000, Vincent Guittot wrote:
> >> On 21 November 2014 at 13:35, Morten Rasmussen <[email protected]> wrote:
> >> > On Mon, Nov 03, 2014 at 04:54:42PM +0000, Vincent Guittot wrote:
> >>
> >> [snip]
> >>
> >> >> The average running time of RT tasks is used to estimate the remaining compute
> >> >> @@ -5801,19 +5801,12 @@ static unsigned long scale_rt_capacity(int cpu)
> >> >>
> >> >> total = sched_avg_period() + delta;
> >> >>
> >> >> - if (unlikely(total < avg)) {
> >> >> - /* Ensures that capacity won't end up being negative */
> >> >> - available = 0;
> >> >> - } else {
> >> >> - available = total - avg;
> >> >> - }
> >> >> + used = div_u64(avg, total);
> >> >
> >> > I haven't looked through all the details of the rt avg tracking, but if
> >> > 'used' is in the range [0..SCHED_CAPACITY_SCALE], I believe it should
> >> > work. Is it guaranteed that total > 0 so we don't get division by zero?
> >>
> >> static inline u64 sched_avg_period(void)
> >> {
> >> return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
> >> }
> >>
> >
> > I see.
> >
> >> >
> >> > It does get a slightly more complicated if we want to figure out the
> >> > available capacity at the current frequency (current < max) later. Say,
> >> > rt eats 25% of the compute capacity, but the current frequency is only
> >> > 50%. In that case get:
> >> >
> >> > curr_avail_capacity = (arch_scale_cpu_capacity() *
> >> > (arch_scale_freq_capacity() - (SCHED_SCALE_CAPACITY - scale_rt_capacity())))
> >> > >> SCHED_CAPACITY_SHIFT
> >>
> >> You don't have to be so complicated but simply need to do:
> >> curr_avail_capacity for CFS = (capacity_of(CPU) *
> >> arch_scale_freq_capacity()) >> SCHED_CAPACITY_SHIFT
> >>
> >> capacity_of(CPU) = 600 is the max available capacity for CFS tasks
> >> once we have removed the 25% of capacity that is used by RT tasks
> >> arch_scale_freq_capacity = 512 because we currently run at 50% of max freq
> >>
> >> so curr_avail_capacity for CFS = 300
> >
> > I don't think that is correct. It is at least not what I had in mind.
> >
> > capacity_orig_of(cpu) = 800, we run at 50% frequency which means:
> >
> > curr_capacity = capacity_orig_of(cpu) * arch_scale_freq_capacity()
> > >> SCHED_CAPACITY_SHIFT
> > = 400
> >
> > So the total capacity at the current frequency (50%) is 400, without
> > considering RT. scale_rt_capacity() is frequency invariant, so it takes
> > away capacity_orig_of(cpu) - capacity_of(cpu) = 200 worth of capacity
> > for RT. We need to subtract that from the current capacity to get the
> > available capacity at the current frequency.
> >
> > curr_available_capacity = curr_capacity - (capacity_orig_of(cpu) -
> > capacity_of(cpu)) = 200
>
> you're right, this one looks good to me too
Okay, thanks for confirming. It doesn't affect this patch set anyway, I just
wanted to be sure that I got all the scaling factors right :)
On 21/11/14 12:35, Morten Rasmussen wrote:
> On Mon, Nov 03, 2014 at 04:54:41PM +0000, Vincent Guittot wrote:
>> From: Morten Rasmussen <[email protected]>
>>
Could we rename this patch to 'sched: Make usage tracking frequency
scale-invariant'?
The reason is, since we scale sched_avg::running_avg_sum according to
the cpu frequency, not only sched_entity but also rq is affected (both
contain sched_avg) and cfs_rq too since cfs_rq::utilization_load_avg is
\sum (sched_entity::sched_avg::utilization_avg_contrib)
And so far we're only considering frequency scale-invariance and not cpu
(or uArch) scale-invariance for the utilization/usage related per-entity
load tracking bits.
>> Apply frequency scale-invariance correction factor to usage tracking.
>
> s/usage/utilization/
>
>> Each segment of the running_load_avg geometric series is now scaled by the
>> current frequency so the utilization_avg_contrib of each entity will be
>
> s/entity/sched_entity/
>
>> invariant with frequency scaling. As a result, utilization_load_avg which is
>> the sum of utilization_avg_contrib, becomes invariant too. So the usage level
>
> s/sum of utilization_avg_contrib/sum of sched_entity
> utilization_avg_contribs/
>
> s/usage/utilization/
>
>> that is returned by get_cpu_usage, stays relative to the max frequency as the
>> cpu_capacity which is is compared against.
>
> The last bit doesn't parse right. '... Maybe it is better to drop
> the reference to get_cpu_usage which hasn't been defined yet and rewrite
> the thing to:
>
> Apply frequency scale-invariance correction factor to utilization
> tracking. Each segment of the running_load_avg geometric series is now
s/running_load_avg/running_avg_sum
Couldn't find running_load_avg in the current code.
What about using sched_avg::running_avg_sum instead of running_avg_sum?
At least for me it is very complicated to distinguish the different
per-entity load tracking data elements on the various scheduler data
structures (like rq, se, cfs_rq) when I read per-entity load tracking
related patch header.
> scaled by the current frequency so the utilization_avg_contrib of each
> entity will be invariant with frequency scaling. As a result,
> utilization_load_avg which is the sum of sched_entity
> utilization_avg_contribs becomes invariant too and is now relative to
> the max utilization at the max frequency (=cpu_capacity).
max frequency of _the_ cpu. It does not have to be the max frequency of
the system though.
The difference between the max. frequencies between the cpus of the
system is part of the cpu invariant scaling ('cpu' equal to 'uarch' plus
'max frequency in the system'). See 'clock-frequency' dtb property of
node cpu in arch/arm/kernel/topology.c. Frequency scale invariance is
about DVFS and this is per cpu.
>
> I think we should add:
>
> arch_scale_freq_capacity() is reintroduced to provide the frequency
> compensation scaling factor.
>
>> Then, we want the keep the load tracking values in a 32bits type, which implies
>
> s/Then, we/We/
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>